JP2005516557A

JP2005516557A - Video conferencing system and operation method

Info

Publication number: JP2005516557A
Application number: JP2003565169A
Authority: JP
Inventors: ラレット，アーサー
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2002-01-30
Filing date: 2002-12-16
Publication date: 2005-06-02
Also published as: GB0202101D0; FI20041039A; GB2384932A; KR20040079973A; HK1058450A1; WO2003065720A1; GB2384932B; CN1618233A

Abstract

複数のマルチメディア・ユーザ装置（５５０，５６０，５７０，５８０）の間でのマルチメディア・ビデオ会議においてビデオ・イメージを中継する方法は、複数のマルチメディア・ユーザ装置のうちの或る数のマルチメディア・ユーザ装置によりレイヤ化ビデオ・イメージを送信するステップを含み、当該レイヤ化ビデオ・イメージは、ベース・レイヤ（５５２，５６２，５７２，５８２）及び１又はそれより多くの増強レイヤ（５５５，５６５，５７５，５８５）を含む。送信されたレイヤ化ビデオ・イメージは、マルチポイント制御装置（５２０）で受信され、そこで、或る数の活動中の話者（５３５）の或る数のベース・レイヤ・ビデオ・イメージと最も活動中の話者の１又はそれより多くの増強レイヤ（５４０）とが選択される。マルチポイント制御装置（５２０）は、ベース・レイヤ・ビデオ・イメージと最も活動中の話者の１又はそれより多くの増強レイヤ（５４０）とを複数のマルチメディア・ユーザ装置（５５０，５６０，５７０，５８０）のうちの１又はそれより多くのマルチメディア・ユーザ装置へ送信する。唯一つの完全な品質ビデオ・ストリームの代わりに、使用可能な帯域幅を共用して、１つの増強レイヤ及び幾つかのベース・レイヤを送ることが可能になるので、話者の識別が、従来のビデオ会議システムと比較して非常に改善される。A method for relaying a video image in a multimedia video conference between a plurality of multimedia user devices (550, 560, 570, 580) is provided by a certain number of multi-user devices. Transmitting a layered video image by a media user equipment, the layered video image comprising a base layer (552, 562, 572, 582) and one or more enhancement layers (555, 565). , 575, 585). The transmitted layered video image is received at the multipoint controller (520), where there is a certain number of base layer video images and the most active of a certain number of active speakers (535). One or more enhancement layers (540) of the middle speakers are selected. The multi-point controller (520) includes a base layer video image and one or more enhancement layers (540) of the most active speakers in a plurality of multimedia user devices (550, 560, 570). , 580) to one or more multimedia user devices. Instead of a single full quality video stream, it becomes possible to share one available bandwidth and send one enhancement layer and several base layers so that speaker identification Greatly improved compared to video conferencing system.

Description

［発明の分野］
本発明はビデオ会議に関する。本発明は、レイヤ化ビデオ・コーディングを用いて、Ｈ．３２３及び／又はＳＩＰベースの集中化ビデオ会議におけるビデオ切り換え機構に適用可能であるが、それに限定されるものではない。 [Field of the Invention]
The present invention relates to video conferencing. The present invention uses layered video coding and uses H.264. The present invention is applicable to a video switching mechanism in a H.323 and / or SIP-based centralized video conference, but is not limited thereto.

［発明の背景］
ビジネスのペースが加速し、関係が世界中に拡大するにつれ、通信距離を迅速且つ経済的に埋める必要性が、主要な挑戦となってきた。顧客とスタッフとを効率的に引き合わせることは、ますますより競争的な市場で成功するために不可欠である。ビジネスは、音声、ビデオ、イメージ・データ、及びそれらのいずれの組み合わせのような様々な通信方法を用いて、リアルタイム情報が国及び大陸全体にわたって共用することをサポートする柔軟な解法を捜している。 [Background of the invention]
As the pace of business accelerates and relationships expand around the world, the need to quickly and economically fill communication distances has become a major challenge. Efficiently bringing customers and staff together is essential to succeed in an increasingly more competitive market. Businesses are looking for flexible solutions that support sharing real-time information across countries and continents using a variety of communication methods such as voice, video, image data, and any combination thereof.

特に、多国籍組織は、当該組織内のグループに一層効率的且つ実効的に通信させるため、多数の場所のコストのかかる移動及びリンクを無くす希望をますます増大させている。インターネット・プロトコル（ＩＰ）ネットワークを介して動作するマルチポイント会議システムは、この必要性に対処しようと努めている。本発明の分野において、端末装置が、オーディオ及びビデオ・ストリームをリアルタイムでマルチポイント・ビデオ会議において交換することが既知である。ＩＰネットワークを介してマルチポイント会議を設定する従来の方法は、マルチポイント制御装置（ＭＣＵ）を用いる方法である。ＭＣＵは、マルチポイント会議に参加するため、３つ又はそれより多くの端末装置及び／又は通信ゲートウエイに対する能力を与えるネットワーク上のエンドポイント（端点）である。ＭＣＵはまた、ポイント・ツー・ポイント会議における２つの端末装置を接続し、それにより、それらポイント・ツー・ポイント会議は、マルチポイント会議に進展する能力を有することになる。 In particular, multinational organizations are increasingly wishing to eliminate costly movements and links in multiple locations in order to communicate more efficiently and effectively to groups within the organization. Multipoint conferencing systems that operate over Internet Protocol (IP) networks strive to address this need. In the field of the invention, it is known that terminal devices exchange audio and video streams in real time in a multipoint video conference. A conventional method for setting up a multipoint conference via an IP network is a method using a multipoint control unit (MCU). An MCU is an endpoint on the network that provides the capability for three or more terminal devices and / or communication gateways to participate in a multipoint conference. The MCU also connects two end devices in a point-to-point conference, so that the point-to-point conference has the ability to evolve into a multipoint conference.

最初に図１を参照すると、既知の集中化された会議モデル（以下、「集中化会議モデル」と記す。）１００が示されている。集中化会議は、ＭＣＵベースの会議ブリッジを利用する。全ての端末装置（エンドポイント）１２０，１２２，１２５は、メディア情報をオーディオ、ビデオ及び／又はデータ信号の形式で、並びに制御情報ストリーム１４０をＭＣＵ１１０に送信し、そしてそれから受信する。これらの伝送は、ポイント・ツー・ポイント（２地点間）方式で行われる。これが図１に示されている。 Referring initially to FIG. 1, a known centralized conference model (hereinafter referred to as a “centralized conference model”) 100 is shown. Centralized conferences utilize MCU-based conference bridges. All terminal devices (endpoints) 120, 122, 125 send media information in the form of audio, video and / or data signals and a control information stream 140 to and from the MCU 110. These transmissions are performed in a point-to-point manner. This is illustrated in FIG.

ＭＣＵ１１０は、マルチポイント制御器（ＭＣ）及びゼロ又は１以上のマルチポイント・プロセッサ（ＭＰ）から成る。ＭＣは、呼設定及び呼信号送出折衝を全ての端末装置間で処理して、オーディオ及びビデオ処理に関する共通の能力を決定する。ＭＣＵ１１０は、いずれのメディア・ストリームも直接には処理しない。これは、ＭＰにまかし、当該ＭＰは、オーディオ、ビデオ及び／又はデータ・ビットを混合し、切り換え、そして処理する。 The MCU 110 consists of a multipoint controller (MC) and zero or more multipoint processors (MP). The MC handles call setup and call signaling negotiations between all terminal devices to determine common capabilities for audio and video processing. MCU 110 does not process any media stream directly. This is up to the MP, which mixes, switches and processes audio, video and / or data bits.

このようにして、ＭＣＵは、マルチロケーション（同時に複数の地点に存在すること）・セミナー、販売会議、グループ会議、及び他の「フェース・ツー・フェース（対面）」通信をホストする能力を提供する。また、マルチポイント会議は、様々な応用に用いることができることが知られている。 In this way, the MCU provides the ability to host multi-location (same at multiple points) seminars, sales conferences, group conferences, and other “face-to-face” communications. . In addition, it is known that the multipoint conference can be used for various applications.

（ｉ）複数の場所にいる役員及び管理者が、「フェース・ツー・フェース」で会い、リアルタイム情報を共有し、時間、費用及び旅行の要望のいずれの損失無しにより迅速に決断することができる。 (I) Executives and managers in multiple locations can meet “face-to-face”, share real-time information, and make faster decisions without any loss of time, expense and travel needs. .

（ｉｉ）プロジェクト・チーム及び知識労働者は、個人の仕事を調整し、共有されたドキュメント、プレゼンテーション、設計及びファイルをリアルタイムの仕方で閲覧及び改訂することができる。 (Ii) Project teams and knowledge workers can coordinate their work and view and revise shared documents, presentations, designs and files in real time.

（ｉｉｉ）遠隔の場所にいる学生、訓練生及び従業員は、共有された教育／訓練資源にいずれの距離又は時間ゾーンを超えてアクセスすることができる。
その結果、ＭＣＵベースのシステムは、ＩＰベースのネットワークを介したマルチメディア通信において将来重要な役割を果たすことが予見される。 (Iii) Students, trainees and employees at remote locations can access shared education / training resources across any distance or time zone.
As a result, MCU-based systems are expected to play an important role in the future in multimedia communications over IP-based networks.

そのようなマルチメディア通信は、多くの場合ビデオ伝送を採用する。そのような伝送において、イメージのシーケンス（よくフレームと呼ばれる。）は、送信装置と受信装置との間で伝送される。マルチポイント・マルチメディア会議システムは、例えば、Ｈ．３２３及びＳＩＰセッション・レイヤ・プロトコル標準により指定されるように、様々な方法を用いて、設定することができる。ＳＩＰの参考資料は、
http://www.ietf.org/rfc/rfc2543.txt、及び
http://www.cs.columbia.edu/~hgs/sip
に見つけることができる。 Such multimedia communications often employ video transmission. In such transmission, a sequence of images (often called a frame) is transmitted between a transmitting device and a receiving device. The multipoint multimedia conference system is, for example, H.264. It can be set using various methods, as specified by H.323 and SIP Session Layer Protocol standards. SIP reference materials
http://www.ietf.org/rfc/rfc2543.txt, and
http://www.cs.columbia.edu/~hgs/sip
Can be found in

更に、例えばＩＴＵＨ．２６３ビデオ圧縮［ＩＴＵ−Ｔ勧告、Ｈ．２６３、「低ビット・レート通信用ビデオ・コーディング」］を用いるシステムにおいて、ビデオ・シーケンスの第１のフレームは、一般的にイントラコード化された情報（ｉｎｔｒａｃｏｄｅｄｉｎｆｏｒｍａｔｉｏｎ）と呼ばれる、大量のイメージ・データを含む。イントラコード化されたフレームは、それが最初のフレームであるので、表示されるべきイメージの実質的部分を与える。このイントラコード化されたフレームにはインターコード化された（予測された）情報が続き、そのインターコード化された（予測された）情報（ｉｎｔｅｒ−ｃｏｄｅｄ（ｐｒｅｄｉｃｔｅｄ）ｉｎｆｏｒｍａｔｉｏｎ）は一般的に、送信中であるイメージの変化に関連するデータを含む。従って、予測され且つインターコード化された情報は、イントラコード化された情報より非常に少ない情報しか含まない。 Further, for example, ITU H.264. H.263 video compression [ITU-T recommendation, H.264 263, “Video Coding for Low Bit Rate Communication”], the first frame of the video sequence is a large amount of image code, commonly referred to as intra coded information. Contains data. An intra-coded frame gives a substantial portion of the image to be displayed because it is the first frame. This intra-coded frame is followed by inter-coded (predicted) information, and the inter-coded (predicted) information is generally transmitted. Contains data related to image changes in it. Thus, the predicted and intercoded information contains much less information than the intra-coded information.

従来のマルチメディア会議システムにおいては、ユーザは、彼らが話すとき自分自身を識別することが必要であり、それにより受信端末装置は、誰が話をしているかを知っている。明らかなことであるが、送信端末装置がそれ自身を識別するのに失敗した場合、聴いているユーザは、誰が話をしているかを推測しなければならない。 In conventional multimedia conferencing systems, users need to identify themselves when they speak, so that the receiving terminal knows who is talking. Obviously, if the sending terminal fails to identify itself, the listening user must guess who is talking.

既知の技術は、この問題を、オーディオ・ストリームを解析し、次いで活動中の話者の名前及びビデオ・ストリームを全ての参加者に送ることにより解決している。集中化会議システムにおいては、ＭＣＵは、多くの場合この機能を実行する。次いで、ＭＣＵは、適切な入力マルチメディア・ストリームを出力ポート／経路へ切り換えることにより、話者の名前及び対応のビデオ及びオーディオ・ストリームを全ての参加者に送信することができる。 Known techniques solve this problem by analyzing the audio stream and then sending the name of the active speaker and the video stream to all participants. In centralized conference systems, MCUs often perform this function. The MCU can then send the speaker's name and corresponding video and audio streams to all participants by switching the appropriate input multimedia stream to the output port / path.

ビデオ切り換えは、単一のビデオ・ストリームを各エンドポイントへ供給することをねらった周知の技術であり、複数のポイント・ツー・ポイント・セッションを配列することと等価である。ビデオ切り換えは、
（ｉ）音声により活動化された切り換え（ｖｏｉｃｅａｃｔｉｖａｔｅｄｓｗｉｔｃｈｉｎｇ）であって、そこでは、ＭＣＵが活動中の話者のビデオを送信すること、
（ｉｉ）調時され活動化された切り換え（ｔｉｍｅｄａｃｔｉｖａｔｅｄｓｗｉｔｃｈｉｎｇ）であって、そこでは、各参加者のビデオが１つずつ所定の時間間隔で送信されること、
（ｉｉｉ）個人のビデオ選択切り換え（ｉｎｄｉｖｉｄｕａｌｖｉｄｅｏｓｅｌｅｃｔｉｏｎｓｗｉｔｃｈｉｎｇ）であって、そこでは、各エンドポイントは、参加者が受信することを希望するその参加者のビデオ・ストリームを要求することができること、であることができる。 Video switching is a well-known technique aimed at providing a single video stream to each endpoint, and is equivalent to arranging multiple point-to-point sessions. Video switching
(I) voice activated switching, in which the MCU transmits the video of the active speaker;
(Ii) timed and activated switching, where each participant's video is transmitted one by one at a predetermined time interval;
(Iii) Individual video selection switching, where each endpoint can request a video stream of the participant that the participant wishes to receive. Can be.

ここで図２を参照すると、従来のビデオ切り換え機構２００のブロック図が示されている。従来の集中化会議システムにおいては、ビデオ切り換えは次のとおりに実行される。例えばインターネット・プロトコル（ＩＰ）ベースのネットワーク２１０内に位置するＭＣＵ２２０は、スイッチ２３０を含む。ＭＣＵ２２０は、全ての参加者（ユーザ装置）２５０，２６０，２７０，２８０のビデオ・ストリーム２５５，２６５，２７５，２８５を受信する。ＭＣＵはまた、組み合わされた（多重化された）オーディオ・ストリーム２９０を、話している参加者から別々に受信し得る。次いで、ＭＣＵ２２０は、それらのビデオ・ストリームの１つを選択し、このビデオ・ストリーム２４０を全ての参加者２５０，２６０，２７０，２８０に送信する。 Referring now to FIG. 2, a block diagram of a conventional video switching mechanism 200 is shown. In a conventional centralized conference system, video switching is performed as follows. For example, MCU 220 located within Internet Protocol (IP) based network 210 includes a switch 230. The MCU 220 receives the video streams 255, 265, 275, 285 of all participants (user devices) 250, 260, 270, 280. The MCU may also receive a combined (multiplexed) audio stream 290 separately from the speaking participant. MCU 220 then selects one of those video streams and sends this video stream 240 to all participants 250, 260, 270, 280.

そのような従来のシステムは、そのシステムが活動中の話者のビデオ・ストリームのみを送信する欠点を有する。ユーザは、依然として、幾人かの話者が同時に話している場合、又は活動中の話者が絶えず変わっている場合、ビデオ・ストリームの話者を識別する点に問題を有する。これは、特に、大きなビデオ会議の場合である。 Such conventional systems have the disadvantage of transmitting only the video stream of the active speaker. Users still have problems identifying video stream speakers when several speakers are speaking at the same time or when active speakers are constantly changing. This is especially the case for large video conferences.

代替として、各参加者のビデオを全ての参加者に送信することができる。しかしながら、このアプローチは、帯域幅の制限のため無線ベースの会議で悪化する。
ビデオ技術の分野では、ビデオが一連の静止イメージ／ピクチャとして送信されることが知られている。ビデオ信号の品質はビデオ信号のコード化又は圧縮の間に悪影響を被る場合があるので、ビデオ信号と符号化されたビデオ・ビット・ストリームとの差に基づいた追加の情報「レイヤ（複数）」を含めることが知られている。追加のレイヤを含めることにより、復号及び／又は復元が続く受信信号の品質を増強することが可能になる。従って、１又はそれより多くのレイヤに仕切られたピクチャ及び増強ピクチャの階層構造を用いて、レイヤ化されたビデオ・ビット・ストリームを生成する。 Alternatively, each participant's video can be sent to all participants. However, this approach is exacerbated with wireless-based conferencing due to bandwidth limitations.
In the field of video technology, it is known that video is transmitted as a series of still images / pictures. Since the quality of the video signal may be adversely affected during the encoding or compression of the video signal, additional information “layers” based on the difference between the video signal and the encoded video bit stream. Is known to include. By including additional layers, it is possible to enhance the quality of the received signal followed by decoding and / or reconstruction. Thus, a layered video bit stream is generated using a hierarchical structure of pictures and augmented pictures partitioned into one or more layers.

レイヤ化された（スケーラブルな）ビデオ・ビット・ストリームにおいては、ビデオ信号に対する増強は、ベース・レイヤへ次のいずれかにより追加され得る。
（ｉ）ピクチャの解像度（空間スケーラビリティ）を増大する；又は
（ｉｉ）エラー情報を含めて、ピクチャの信号対雑音比（ＳＮＲスケーラビリティ）を改善する；又は
（ｉｉｉ）余分のピクチャを含めて、フレーム・レート（時間的スケーラビリティ）を増大する。 In a layered (scalable) video bitstream, enhancements to the video signal can be added to the base layer by either:
(I) increase picture resolution (spatial scalability); or (ii) include error information to improve the signal-to-noise ratio (SNR scalability) of the picture; or (iii) include extra pictures Increase the rate (temporal scalability).

そのような増強は、ピクチャ全体に、又はピクチャ内の任意の形状にされたオブジェクトに適用され得て、それは、オブジェクト・ベースのスケーラビリティと呼ばれる。時間的増強レイヤの使い捨て性質（ｄｉｓｐｏｓａｂｌｅｎａｔｕｒｅ）を残しておくため、Ｈ．２６３標準は、時間的スケーラビリティ・モードに含まれるピクチャが図３のビデオ・ストリームに示されるように双方向に予測された（Ｂ）ピクチャであるべきであることを要求する。 Such enhancement can be applied to the entire picture or to any shaped object in the picture, which is referred to as object-based scalability. To preserve the disposable nature of the temporal enhancement layer, The H.263 standard requires that the pictures included in the temporal scalability mode should be bi-predicted (B) pictures as shown in the video stream of FIG.

図３は、ビデオ・コーディング技術の分野で知られているように、Ｂピクチャ予測依存性を説明するスケーラブルなビデオ構成３００の概略図を示す。初期イントラコード化フレーム（Ｉ_１）３１０には、双方向に予測されたフレーム（Ｂ_２）３２０が続く。次いで、これには、（単方向の）予測されたフレーム（Ｐ_３）３３０が続き、そして再び第２の双方向に予測されたフレーム（Ｂ_４）３４０が続く。次いで、これには、再び、（単方向の）予測されたフレーム（Ｐ_５）３５０が続き、以下同様である。 FIG. 3 shows a schematic diagram of a scalable video structure 300 that illustrates B picture prediction dependencies, as is known in the video coding art. The initial intra-coded frame (I ₁ ) 310 is followed by a bi-predicted frame (B ₂ ) 320. This is then followed by a (unidirectional) predicted frame (P ₃ ) 330 and again a second bidirectional predicted frame (B ₄ ) 340. This is then followed again by a (unidirectional) predicted frame (P ₅ ) 350, and so on.

図４は、ビデオ・コーディング技術の分野で既知である、レイヤ化されたビデオ構成の概略図である。レイヤ化されたビデオ・ビット・ストリームは、ベース・レイヤ４０５、及び１又はそれより多くの増強レイヤ４３５を含む。 FIG. 4 is a schematic diagram of a layered video structure as known in the field of video coding technology. The layered video bit stream includes a base layer 405 and one or more enhancement layers 435.

ベース・レイヤ（レイヤ１）は、元のビデオ信号ピクチャからサンプリング、コード化、及び／又は圧縮された１又はそれより多くのイントラコード化されたピクチャ（Ｉピクチャ）を含む。更に、ベース・レイヤは、イントラコード化されたピクチャ（単数又は複数）４１０から予測された複数の予測されたイントラコード化されたピクチャ（Ｐピクチャ）を含むであろう。 The base layer (layer 1) includes one or more intra-coded pictures (I pictures) sampled, coded and / or compressed from the original video signal picture. In addition, the base layer will include a plurality of predicted intra-coded pictures (P pictures) predicted from the intra-coded picture (s) 410.

増強レイヤ（レイヤ２又は３又は４以上）４３５においては、３つのタイプのピクチャを用い得る。即ち、
（ｉ）双方向に予測された（Ｂ）ピクチャ（図示せず）、
（ｉｉ）ベース・レイヤ４０５のイントラコード化されたピクチャ（単数又は複数）に基づく増強されたイントラ（ＥＩ）・ピクチャ４４０、
（ｉｉｉ）ベース・レイヤ４０５のイントラコード化され予測されたピクチャ４２０，４３０に基づく増強された予測された（ＥＰ）ピクチャ４５０，４６０。 In the enhancement layer (layer 2 or 3 or 4 or more) 435, three types of pictures may be used. That is,
(I) bidirectionally predicted (B) picture (not shown),
(Ii) enhanced intra (EI) picture 440 based on the base layer 405 intra-coded picture (s),
(Iii) Enhanced predicted (EP) pictures 450, 460 based on base layer 405 intra-coded predicted pictures 420, 430.

低いレイヤからの垂直の矢印は、増強レイヤにおけるピクチャが参照（より低い）レイヤにおけるそのピクチャの再構成された近似から予測されることを示す。
要約すると、スケーラブルなビデオ・コーディングが、マルチキャスト・マルチメディア会議で、そしてポイント・ツー・ポイント又はマルチキャスト・ビデオ通信の文脈においてのみ用いられてきた。しかしながら、無線ネットワークは、現在マルチキャスティングをサポートしてない。更に、マルチキャスティングの場合、各レイヤは、別々のマルチキャスト・セッションで送られ、その場合、受信機は、それ自体、１又はそれより多くのセッションを記録すべきかどうかを決定する。 A vertical arrow from the lower layer indicates that a picture in the enhancement layer is predicted from a reconstructed approximation of that picture in the reference (lower) layer.
In summary, scalable video coding has been used only in multicast multimedia conferencing and in the context of point-to-point or multicast video communications. However, wireless networks currently do not support multicasting. Furthermore, in the case of multicasting, each layer is sent in a separate multicast session, in which case the receiver itself decides whether to record one or more sessions.

従って、前述の欠点を改善し得る、改善されたビデオ会議構成及び動作方法に対する必要性が存在する。 Accordingly, there is a need for an improved video conferencing configuration and method of operation that can ameliorate the aforementioned drawbacks.

［発明の陳述］
本発明に従って、請求項１に記載されたマルチメディア・ビデオ会議でビデオ・イメージを中継する方法、請求項７に記載されたビデオ・イメージを中継するビデオ会議装置、請求項１１に記載されたビデオ会議に参加する無線装置、請求項１２に記載されたマルチポイント・プロセッサ、請求項１６に記載されたビデオ通信システム、請求項１８に記載されたメディア資源機能装置、請求項１９又は２０に記載されたビデオ通信装置、及び請求項２３に記載された記憶媒体が提供される。本発明の更なる局面が、いわゆる従属請求項に記載されている。 [Statement of invention]
A method for relaying a video image in a multimedia videoconference according to claim 1, according to the present invention, a videoconferencing device for relaying a video image according to claim 7, and a video according to claim 11. A wireless device participating in a conference, a multipoint processor according to claim 12, a video communication system according to claim 16, a media resource function device according to claim 18, and a media resource function device according to claim 19 or 20. A video communication device and a storage medium according to claim 23 are provided. Further aspects of the invention are described in the so-called dependent claims.

要約すると、本発明の発明概念は、従来技術の構成の欠点を、ビデオ会議における参加者及び話者の識別を改善するためのビデオ切り換え方法を提供することにより対処する。本発明は、各ユーザが使用可能な帯域幅をより良好に使用するため、レイヤ化されたビデオ・コーディングを利用する。
本発明の例示的実施形態がここで、添付図面を参照して説明される。 In summary, the inventive concept of the present invention addresses the shortcomings of the prior art arrangements by providing a video switching method to improve the identification of participants and speakers in a video conference. The present invention utilizes layered video coding in order to better use the bandwidth available to each user.
Exemplary embodiments of the invention will now be described with reference to the accompanying drawings.

［好適な実施形態の説明］
要約すると、本発明の好適な実施形態は、レイヤ化されたビデオ・コーディングを利用する、マルチメディア会議用の新しいビデオ切り換え機構を提案する。以前には、レイヤ化されたビデオ・コーディングは、ビデオ・ビット・ストリームを２以上のレイヤ、即ち、図４に関連して前述したようにベース・レイヤ及び１又は幾つかの増強レイヤに仕切る（ｐａｒｔｉｔｉｏｎ）ため用いられてきただけである。スケーラブルなビデオ通信のためのこれらの既知の技術は、Ｈ．２６３及びＭＰＥＧ−４のような標準に詳細に記載されている。 [Description of Preferred Embodiment]
In summary, the preferred embodiment of the present invention proposes a new video switching mechanism for multimedia conferencing that utilizes layered video coding. Previously, layered video coding partitioned the video bitstream into two or more layers, namely a base layer and one or several enhancement layers as described above in connection with FIG. (partition). These known techniques for scalable video communication are described in H.264. It is described in detail in standards such as H.263 and MPEG-4.

しかしながら、本発明の発明者は、レイヤ化されたビデオ・コーディングの概念を適応させ、且つその適応された概念をマルチメディア・ビデオ会議応用に適用することにより利益が得られる筈であることを認識した。このようにして、本発明は、ポイント・ツー・ポイント又はマルチキャスト・ビデオ通信とは対照的に、マルチメディア会議の使用に焦点を合わせたスケーラブルなビデオ・コーディングの様々なタイプを定義する。 However, the inventors of the present invention recognize that it should benefit from adapting the concept of layered video coding and applying the adapted concept to multimedia video conferencing applications. did. In this way, the present invention defines various types of scalable video coding focused on the use of multimedia conferencing as opposed to point-to-point or multicast video communications.

ここで図５を参照すると、本発明の好適な実施形態に従ったビデオ切り換え機構の機能ブロック図５００が示されている。従来の集中化会議システムとは対照的に、ビデオ切り換えは、次のとおりに実行される。例えば、インターネット・プロトコル（ＩＰ）ベースのネットワーク５１０内に位置するＭＣＵ５２０は、スイッチ５３０を含む。 Referring now to FIG. 5, a functional block diagram 500 of a video switching mechanism is shown according to a preferred embodiment of the present invention. In contrast to conventional centralized conferencing systems, video switching is performed as follows. For example, MCU 520 located within Internet Protocol (IP) based network 510 includes a switch 530.

ＭＣＵ５２０は、ベース・レイヤ５５２，５６２，５７２，５８２と、全ての参加者（ユーザ装置）５５０，５６０，５７０，５８０の１又はそれより多くの増強レイヤ・ストリーム５５５，５６５，５７５，５８５とを含む「レイヤ化された」ビデオ・ストリームを受信する。単に明瞭さのためのみで、唯１つの増強レイヤ・ビデオ・ストリームが１参加者について示されている。 The MCU 520 includes a base layer 552, 562, 572, 582 and one or more enhancement layer streams 555, 565, 575, 585 of all participants (user equipment) 550, 560, 570, 580. A containing “layered” video stream is received. For clarity only, only one enhancement layer video stream is shown for one participant.

ＭＣＵ５２０はまた、組み合わされた（多重化された）オーディオ・ストリーム５９０を参加者から別に受信し得る。次いで、ＭＣＵ５２０は、或る数の活動中の話者５３５のベース・レイヤ・ビデオ・ストリーム、及び最も活動中の話者の増強レイヤ５４０を、スイッチ５３０を用いて、選択する。次いで、ＭＣＵ５２０は、これらのビデオ・ストリーム５３５，５４０を全ての参加者５５０，５６０，５７０，５８０に送信する。 The MCU 520 may also receive a combined (multiplexed) audio stream 590 separately from the participants. The MCU 520 then selects a number of active speaker 535 base layer video streams and the most active speaker enhancement layer 540 using the switch 530. The MCU 520 then sends these video streams 535, 540 to all participants 550, 560, 570, 580.

最も活動中の話者を決定するための選択プロセスは、ＭＣＵ５２０が、最初に全ての活動中の話者が誰であるかを決定するため、オーディオ・ストリーム５９０を解析することにより好適に実行される。次いで、最も活動中の話者が、図６を参照して説明されるように、マルチポイント・プロセッサ・ユニットで好適に決定される。１又はそれより多くのベース・レイヤ及び１つの増強レイヤが、各参加者の活動に基づいた優先レベルに従って参加者へ好適に送信される。 The selection process for determining the most active speaker is preferably performed by the MCU 520 first analyzing the audio stream 590 to determine who is all active speakers. The The most active speaker is then preferably determined at the multipoint processor unit, as described with reference to FIG. One or more base layers and one augmentation layer are preferably transmitted to the participants according to a priority level based on each participant's activity.

図５に示す改善されたがしかしより複雑なビデオ切り換え機構を実効的に動作させるため、マルチポイント処理装置（ＭＰ）６００は、本発明の好適な実施形態に従い、そして図６に示されるように、新しいビデオ切り換え機構を容易にするよう適合された。 In order to effectively operate the improved but more complex video switching mechanism shown in FIG. 5, a multipoint processing unit (MP) 600 is in accordance with a preferred embodiment of the present invention and as shown in FIG. Adapted to facilitate a new video switching mechanism.

ＭＰ６００は、相変わらず、オーディオ・ストリーム５９０を、参加者のビデオ／マルチメディア通信装置からパケット・フィルタリング・モジュール６１０を介して受信し、そしてこのオーディオ・ストリームをパケット・ルーティング・モジュール６３０へルーティングする。しかしながら、オーディオ・ストリームは、ここでまた、話者識別モジュール６２０へルーティングされ、当該話者識別モジュール６２０は、オーディオ・ストリーム５９０を解析して、誰が活動中の話者であるかを決定する。話者識別モジュール６２０は、各参加者の活動に基づいて優先レベルを割り当て、そして次のことを決定する。即ち、
（ｉ）最も活動中の話者６６２、
（ｉｉ）任意の他の活動中の話者６２５、そしてこれはデフォルトによる、
（ｉｉｉ）任意の残りの活動してない話者。 The MP 600 still receives the audio stream 590 from the participant's video / multimedia communication device via the packet filtering module 610 and routes this audio stream to the packet routing module 630. However, the audio stream is again routed to the speaker identification module 620, which analyzes the audio stream 590 to determine who is the active speaker. The speaker identification module 620 assigns priority levels based on each participant's activity and determines: That is,
(I) the most active speaker 662,
(Ii) any other active speaker 625, and this is by default,
(Iii) Any remaining inactive speakers.

次いで、話者識別モジュール６２０は、本発明の好適な実施形態に従って、話者の優先レベルを処理するよう適合されている切り換えモジュール６４０へ優先レベル情報を送信する。更に、切り換えモジュール６４０は、ビデオ・ベース・レイヤ・ストリーム５５２，５６２，５７２及び５８２、及びビデオ増強レイヤ・ストリーム５５５，５６５，５７５及び５８５を含むレイヤ化されたビデオ・ストリームを参加者のビデオ通信装置からパケット・フィルタリング・モジュール６１０を介して受信するよう適合されている。切り換えモジュール６４０は、この話者情報を用いて、二次の（よりレベルの低い）活動中の話者及び最も活動中の話者のビデオ・ベース・レイヤと、最も活動中の話者のビデオ増強レイヤのみとを全ての参加者へパケット・ルーティング・モジュール６３０を介して送信する。 The speaker identification module 620 then sends priority level information to a switching module 640 adapted to process the speaker priority level in accordance with a preferred embodiment of the present invention. In addition, the switching module 640 may send the layered video streams including video base layer streams 552, 562, 572 and 582, and video enhancement layer streams 555, 565, 575 and 585 to the video communication of the participant. It is adapted to receive from a device via a packet filtering module 610. The switching module 640 uses this speaker information to use the video base layer of the secondary (lower level) active speaker and the most active speaker and the video of the most active speaker. Only the enhancement layer is transmitted to all participants via the packet routing module 630.

従って、マルチポイント・プロセッサの１又はそれより多くの受信ポートは、ベース・レイヤ・ビデオ・ストリーム５５２，５６２，５７２及び５８２及び増強レイヤ・ストリーム５５５，５６５，５７５及び５８５を含むレイヤ化されたビデオ・ストリームを複数のユーザ装置５５０，５６０，５７０及び５８０から受信するよう適合されている。切り換えモジュール６４０が、唯一人の活動中の話者が存在することが決定された場合、１つのベース・レイヤ・ビデオ・イメージ及び対応の１又はそれより多くの増強レイヤをただ選択し得ることが、本発明の意図内である。次いで、この話者は、１又はそれより多くのユーザ装置５５０，５６０，５７０及び５８０へ送信する最も活動中の話者として自動的に指定される。 Thus, one or more receive ports of the multipoint processor are layered video including base layer video streams 552, 562, 572 and 582 and enhancement layer streams 555, 565, 575 and 585. It is adapted to receive streams from multiple user devices 550, 560, 570 and 580. If it is determined that there is only one active speaker, the switching module 640 may only select one base layer video image and the corresponding one or more enhancement layers. Within the spirit of the present invention. This speaker is then automatically designated as the most active speaker to send to one or more user devices 550, 560, 570 and 580.

ビデオ会議で生じることがあるように、最も活動中の話者が絶えず変わっているとき、増強レイヤは、絶えず切り替わっている。本発明の発明者は、そのような絶え間なく且つ迅速な切り換えに伴う潜在的な問題を認識した。そのような環境の下では、第１のフレームが実際に、以前単に二次の活動中の話者であった話者からの、予測されたフレーム（ＥＰ）であった場合、第１のフレームがイントラ・フレーム（ＥＩ）に変換されることが必要である。 The enhancement layer is constantly switching when the most active speaker is constantly changing, as may occur in video conferencing. The inventor of the present invention has recognized the potential problems associated with such constant and rapid switching. Under such circumstances, if the first frame is actually the predicted frame (EP) from a speaker that was previously simply the second active speaker, the first frame Needs to be converted to intra frames (EI).

この潜在的な問題に対処するため、パケット・フィルタリング・モジュール６１０からのビデオ・ベース・レイヤ・ストリーム５５２，５６２，５７２及び５８２とビデオ増強レイヤ・ストリーム５５５，５６５，５７５及び５８５とは、パケット解除（ｄｅ−ｐａｃｋｅｔｉｓａｔｉｏｎ）機能部６８０に入力されるのが好ましい。パケット解除機能部６８０は、ビデオ・ストリームを逆多重化し、そして逆多重化されたビデオ・ストリームをビデオ・デコーダ及びバッファ機能部６７０に与える。 To address this potential problem, video base layer streams 552, 562, 572, and 582 and video enhancement layer streams 555, 565, 575, and 585 from packet filtering module 610 are depacketized. It is preferably input to the (de-packetation) function unit 680. The packet release function unit 680 demultiplexes the video stream, and supplies the demultiplexed video stream to the video decoder and buffer function unit 670.

ビデオ・デコーディング（復号化）を同期させ且つ調整する（ｃｏ−ｏｒｄｉｎａｔｅ）ため、ビデオ・デコーダ及びバッファ機能部６７０は、最も活動中の話者６２２の指示を受信する。最も活動中の話者に関するビデオ・ストリーム情報を抽出した後で、ビデオ・デコーダ及びバッファ機能部６７０は、最も活動中の話者６２２の双方向に予測された（ＢＰ）ビデオ・ストリーム・データ及び／又は予測された（ＥＰ）ビデオ・ストリーム・データを「ＥＰフレーム／ＥＩフレーム変換モジュール」６６０へ与える。「ＥＰフレーム／ＥＩフレーム変換モジュール」６６０は、入力ビデオ・ストリームを処理して、主要話者増強レイヤ・ビデオ・ストリームを、イントラコード化された（ＥＩ）フレームとして与える。 To synchronize and co-ordinate video decoding, video decoder and buffer function 670 receives an indication of the most active speaker 622. After extracting the video stream information for the most active speaker, the video decoder and buffer function 670 may provide the bi-predicted (BP) video stream data of the most active speaker 622 and The predicted (EP) video stream data is provided to an “EP frame / EI frame conversion module” 660. An “EP frame / EI frame conversion module” 660 processes the input video stream and provides the primary speaker enhancement layer video stream as intra-coded (EI) frames.

次いで、主要話者増強レイヤ・ビデオ・ストリームは、パケット化機能部６５０に入力され、そこで、その主要話者増強レイヤ・ビデオ・ストリームは、パケット化され、そして切り換えモジュール６４０に入力される。次いで、切り換えモジュール６４０は、主要話者増強レイヤ・ビデオ・ストリームを、二次の活動中の話者のビデオ・ベース・レイヤ・ストリーム５５２，５６２，５７２及び５８２と組み合わせ、そしてその組み合わされたマルチメディア・ストリームをパケット・ルーティング・モジュール６３０へルーティングする。次いで、パケット・ルーティング・モジュール６３０は、図５の方法に従って、情報を参加者にルーティングする。 The main speaker enhancement layer video stream is then input to the packetization function 650 where the main speaker enhancement layer video stream is packetized and input to the switching module 640. The switching module 640 then combines the primary speaker enhancement layer video stream with the video base layer streams 552, 562, 572 and 582 of the secondary active speaker and the combined multi Route the media stream to the packet routing module 630. The packet routing module 630 then routes the information to the participants according to the method of FIG.

本発明の好適な実施形態においては、ビデオ切り換えモジュール６４０は、「ＥＰフレーム／ＥＩフレーム変換モジュール」６６０が主要話者が変わったと決定するとき「ＥＰフレーム／ＥＩフレーム変換モジュール」６６０の出力を用いる。 In a preferred embodiment of the present invention, the video switching module 640 uses the output of the “EP frame / EI frame conversion module” 660 when the “EP frame / EI frame conversion module” 660 determines that the main speaker has changed. .

二次の話者が変わったと考えられるときその二次の話者に対して同じ機能を実行するため、ＥＰフレーム／ＥＩフレーム変換モジュール６６０と似ている１又はそれより多くのモジュールがまたＭＰ６００に含められることができるであろうことは本発明の意図内である。その他の場合は、単一の「ＥＰフレーム／ＥＩフレーム変換モジュール」６６０を用いて主要話者のみのビデオ・ストリームを変換する実施形態においては、例えば活動してない話者が二次の活動中の話者になるとき、話者識別モジュール６２０（又は切り換えモジュール６４０）は、新しいイントラ・フレームを要求し得る。代替として、切り換えモジュール６４０は、新しい二次の活動中の話者のビデオ・ベース・レイヤ・ストリームを全ての参加者へ送る前にその新しい二次の活動中の話者の新しいイントラ・フレームを待ってよい。 One or more modules similar to the EP frame / EI frame conversion module 660 are also included in the MP 600 to perform the same function for the secondary speaker when the secondary speaker is considered to have changed. It is within the spirit of the invention that it could be included. Otherwise, in embodiments where a single “EP frame / EI frame conversion module” 660 is used to convert the primary speaker-only video stream, for example, an inactive speaker is secondary active. Speaker identification module 620 (or switching module 640) may request a new intra frame. Alternatively, the switching module 640 may send the new secondary active speaker's new intra frame before sending the new secondary active speaker's video base layer stream to all participants. You can wait.

本発明の好適な実施形態に加えて、２以上の増強レイヤが使用のため入手可能である場合、話者のより多くのクラス（階級）を用いることができることは、本発明の意図内である。話者のより多くのクラスを用いることにより、話者の識別が特に大きなビデオ会議に関して改善されるので、マルチメディア・メッセージのより精細なスケーラビリティを達成することができる。 In addition to the preferred embodiment of the present invention, it is within the intent of the present invention that more classes of speakers can be used if more than one enhancement layer is available for use. . By using more classes of speakers, finer scalability of multimedia messages can be achieved because speaker identification is improved especially for large video conferences.

また、予測されたフレームをイントラ・フレームに変換することをベース・レイヤ・ストリームのうちの１又はそれより多くのベース・レイヤ・ストリームに関して追加することができるであろうことが本発明の意図内である。このようにして、切り換えモジュール６４０は、新しいイントラ・フレームを待つ必要無しに、ベース・レイヤ間を迅速に切り換えることができる。 It is also within the intent of the present invention that transforming a predicted frame into an intra frame could be added for one or more of the base layer streams. It is. In this way, the switching module 640 can quickly switch between base layers without having to wait for a new intra frame.

図７は、本発明の好適な実施形態を用いたビデオ会議に参加する無線装置７００のビデオ・ディスプレイ７１０を示す。これまで説明してきた発明概念を実行することにより、改善されたビデオ通信が達成される。特に、所与の帯域幅に対して、参加者は、ここで、よりレベルの低い（二次の）活動中の話者７３０を低くし且つ活動してない話者のビデオを提供しないことにより、最も活動中の話者のより良好なビデオ品質を受け取ることができる。そのような改善されたビデオ会議を提供するため、ビデオ通信装置は、最も活動中の話者７２０の増強レイヤ及びベース・レイヤと、二次の活動中の話者７３０のベース・レイヤとを受信し、そして活動してない話者からのビデオは受信しない。 FIG. 7 illustrates a video display 710 of a wireless device 700 that participates in a video conference using the preferred embodiment of the present invention. By implementing the inventive concepts described so far, improved video communication is achieved. In particular, for a given bandwidth, the participant can now lower the lower level (secondary) active speaker 730 and not provide video of the inactive speaker. Can receive better video quality of the most active speakers. In order to provide such improved video conferencing, the video communication device receives the enhancement layer and base layer of the most active speaker 720 and the base layer of the secondary active speaker 730. And do not receive video from inactive speakers.

そのようにして、ビデオ通信装置は、最も活動中の話者の絶えず更新されるビデオ・イメージをより大きな且つより高い解像度のディスプレイで提供することができる一方、より小さいディスプレイは、二次の（よりレベルの低い）活動中の話者を表示することができる。 In that way, the video communication device can provide a constantly updated video image of the most active speaker on a larger and higher resolution display, while the smaller display is secondary ( Active speakers (lower level) can be displayed.

無線装置７００は、最も活動中の話者のより高い品質のビデオ・イメージを表示する主要ビデオ・ディスプレイと、それぞれのよりレベルの低い活動中の話者を表示する１又はそれより多くの第２の個別ディスプレイとを有することが好ましい。それぞれのビデオ・イメージをそれぞれのディスプレイに表示する操作は、ビデオ・ディスプレイに動作可能に結合されるプロセッサ（図示せず）により実行されることが好ましい。プロセッサは、最も活動中の話者７２０及びよりレベルの低い活動中の話者の指示を受け取り、そして受信されたどのビデオ・イメージが第１のディスプレイに表示されるべきか、及びよりレベルの低い活動中の話者７３０から受信されたどのビデオ・イメージ（単数又は複数）が第２のディスプレイに表示されるべきかを決定する。第２のディスプレイは、より低い活動中の話者のより低い品質のビデオ・イメージを提供することによりコストを節約するよう構成されることが有利である。 Wireless device 700 includes a primary video display that displays a higher quality video image of the most active speaker, and one or more second speakers that display each lower level active speaker. Preferably with a separate display. The operation of displaying each video image on each display is preferably performed by a processor (not shown) operably coupled to the video display. The processor receives an indication of the most active speaker 720 and the lower level active speaker, and which received video image is to be displayed on the first display and the lower level Determine which video image (s) received from the active speaker 730 should be displayed on the second display. The second display is advantageously configured to save costs by providing a lower quality video image of the lower active speaker.

ＭＣＵベースのシステムが将来ＩＰベースのネットワークを介したマルチメディア通信を容易にするであろうことが予想される。従って、本発明の発明者は、本明細書に記載された技術がＭＣＵを利用するいずれのＨ．３２３／ＳＩＰベースのマルチポイント・マルチメディア会議又はシステムに組み込まれることができるであろうことを想定するものである。 It is expected that MCU-based systems will facilitate multimedia communication over IP-based networks in the future. Accordingly, the inventor of the present invention is aware that any H.264 technology in which the techniques described herein utilize an MCU. It is envisioned that it could be incorporated into a H.323 / SIP based multipoint multimedia conference or system.

前述の発明の好適な応用は、広帯域符号分割多重アクセス（ＷＣＤＭＡ）標準に関する第３世代パートナーシップ・プロジェクト（３ＧＰＰ）仕様内にある。特に、本発明は、ＩＰマルチメディア領域（仕様書の３ＧＴＳ２５．ｘｘｘシリーズに記載されている。）に適用されることができ、それは、Ｈ．３２３／ＳＩＰＭＣＵを３ＧＰＰネットワークの中に組み込むことを計画中である。ＭＣＵは、メディア資源機能部８９０Ａ（図８参照）によりホストされるであろう。 A preferred application of the foregoing invention is in the Third Generation Partnership Project (3GPP) specification for the Wideband Code Division Multiple Access (WCDMA) standard. In particular, the present invention can be applied to the IP multimedia domain (described in the 3G TS 25.xxx series of specifications). Planning to incorporate H.323 / SIP MCU into 3GPP network. The MCU will be hosted by the media resource function 890A (see FIG. 8).

図８は、３ＧＰＰ（ＵＭＴＳ）通信システム／ネットワーク８００を階層形式で示し、その３ＧＰＰ（ＵＭＴＳ）通信システム／ネットワーク８００は、本発明の好適な実施形態に従って適合されることが可能である。通信システム８００は、ＵＭＴＳ及び／又はＧＰＲＳエアー・インターフェースを介して動作することができるネットワーク構成要素に準拠し、且つそれらを含む。 FIG. 8 shows a 3GPP (UMTS) communication system / network 800 in a hierarchical format, which 3GPP (UMTS) communication system / network 800 can be adapted according to a preferred embodiment of the present invention. The communication system 800 is compliant with and includes network components that can operate over a UMTS and / or GPRS air interface.

そのネットワークは次のものを備えると都合良いと考えられる。
（ｉ）次のものから作られたユーザ装置領域８１０：
（ａ）ユーザＳＩＭ（ＵＳＩＭ）領域８２０、及び
（ｂ）移動装置領域８３０
（ｉｉ）次のものから作られたインフラストラクチャ領域８４０：
（ｃ）アクセス・ネットワーク領域８５０、及び
（ｄ）次のものから作られたコア・ネットワーク領域８６０；
（ｄｉ）サービス提供ネットワーク領域８７０、及び
（ｄｉｉ）移行ネットワーク領域８８０、及び
（ｄｉｉｉ）ＩＰマルチメディア領域８９０（なお、マルチメディアはＳＩＰ（ＥＴＦＲＦＣ２５４３）により与えられる。）。 The network may be convenient to have:
(I) User device area 810 made from:
(A) User SIM (USIM) area 820; and (b) Mobile device area 830.
(Ii) Infrastructure area 840 made from:
(C) an access network area 850, and (d) a core network area 860 made up of:
(Di) a service providing network area 870, and
(Dii) the migration network area 880, and
(Iii) IP multimedia area 890 (note that multimedia is provided by SIP (ETF RFC2543)).

移動装置領域８３０において、ＵＥ８３０Ａは、データをＵＳＩＭ領域８２０内のユーザＳＩＭ８２０Ａから有線Ｃｕインターフェースを介して受信する。ＵＥ８３０Ａは、データをネットワーク・アクセス領域８５０内のノードＢ８５０Ａと無線Ｕｕインターフェースを介して通信する。ネットワーク・アクセス領域８５０内で、ノードＢ８５０Ａは、１又はそれより多くの送受信機装置を含み、そしてセル・ベースのシステム・インフラストラクチャの残りの構成要素、例えばＲＮＣ８５０ＢとＩｕｂインターフェースを介して、ＵＭＴＳ仕様に定義されるように、通信する。 In the mobile device area 830, the UE 830A receives data from the user SIM 820A in the USIM area 820 via the wired Cu interface. UE 830A communicates data with Node B 850A in network access area 850 via a wireless Uu interface. Within the network access area 850, the Node B 850A includes one or more transceiver devices, and through the remaining components of the cell-based system infrastructure, eg, the RNC 850B and Iub interface, Communicate as defined in

ＲＮＣ８５０Ｂは、他のＲＮＣ（図示せず）とＩｕｒインターフェースを介して通信する。ＲＮＣ８５０Ｂは、サービス提供ネットワーク領域８７０内のＳＧＳＮ８７０ＡとＩｕインターフェースを介して通信する。サービス提供ネットワーク領域８７０内で、ＳＧＳＮ８７０Ａは、ＧＧＳＮ８７０ＢとＧｎインターフェースを介して通信し、そしてＳＧＳＮ８７０Ａは、ＶＬＲサーバ８７０ＣとＧｓインターフェースを介して通信する。本発明の好適な実施形態に従って、ＳＧＳＮ８７０Ａは、ＩＰマルチメディア領域８９０内のメディア資源機能部（８９０Ａ）内に存在するＭＣＵ（図示せず）と通信する。その通信は、Ｇｉインターフェースを介して実行される。 The RNC 850B communicates with other RNCs (not shown) via the Iur interface. The RNC 850B communicates with the SGSN 870A in the service providing network area 870 via the Iu interface. Within the service providing network area 870, SGSN 870A communicates with GGSN 870B via the Gn interface, and SGSN 870A communicates with VLR server 870C via the Gs interface. In accordance with a preferred embodiment of the present invention, SGSN 870A communicates with an MCU (not shown) residing in a media resource function (890A) in IP multimedia domain 890. The communication is performed via the Gi interface.

ＧＧＳＮ８７０Ｂ（及び／又はＳＳＧＮ）は、ＵＭＴＳ（ＧＰＲＳ）がインターネット又は公衆交換電話網（ＰＳＴＮ）のような公衆交換データ・ネットワーク（ＰＳＤＮ）８８０Ａとインターフェースすることを担当している。ＳＧＳＮ８７０Ａは、例えば、ＵＭＴＳコア・ネットワーク内のトラフィックのためのルーティング及びトンネリング機能を実行し、一方ＧＧＳＮ８７０Ｂは、人がシステムのＵＭＴＳモードにアクセスするこのケースにおいて、外部パケット・ネットワークへリンクする。 GGSN 870B (and / or SSGN) is responsible for interfacing UMTS (GPRS) with a public switched data network (PSDN) 880A, such as the Internet or the public switched telephone network (PSTN). SGSN 870A, for example, performs routing and tunneling functions for traffic within the UMTS core network, while GGSN 870B links to an external packet network in this case where a person accesses the system's UMTS mode.

ＲＮＣ８５０Ｂは、多数のノードＢ８５０Ａのための資源の制御及び割り当てを担当するＵＴＲＡＮ構成要素である。典型的には、５０から１００個のノードＢが、１つのＲＮＣ８５０Ｂにより制御され得る。ＲＮＣ８５０Ｂはまた、エアー・インターフェースを介したユーザ・トラフィックの信頼性のある供給を提供する。ＲＮＣは、相互に（インターフェースＩｕｒを介して）通信して、ハンドオーバ及びマクロ・ダイバーシティをサポートする。 RNC 850B is a UTRAN component responsible for controlling and allocating resources for multiple Node B 850A. Typically, 50 to 100 Node Bs can be controlled by one RNC 850B. The RNC 850B also provides a reliable supply of user traffic over the air interface. The RNCs communicate with each other (via interface Iur) to support handover and macro diversity.

ＳＧＳＮ８７０Ａは、位置レジスタ（ＨＬＲ及びＶＬＲ）に対するセッション制御及びインターフェースを担当するＵＭＴＳコア・ネットワーク構成要素である。ＳＧＳＮは、多くのＲＮＣにとって大きな集中化された制御器である。 SGSN 870A is the UMTS core network component responsible for session control and interface to location registers (HLR and VLR). SGSN is a large centralized controller for many RNCs.

ＧＧＳＮ８７０Ｂは、コア・パケット・ネットワーク内のユーザ・データを最終の宛先（例えば、インターネット・サービス・プロバイダ（ＩＳＰ））へ集中させ且つトンネリングさせることを担当するＵＭＴＳコア・ネットワーク構成要素である。そのようなユーザ・データは、ＩＰマルチメディア領域８９０へ及び／又はそれからのマルチメディア及び関連の信号送出データを含む。ＩＰマルチメディア領域８９０内で、ＭＲＦは、マルチメディア資源機能制御器（ＭＲＦＣ）８９２Ａ及びマルチメディア資源機能プロセッサ（ＭＲＦＰ）８９１Ａに分割される。前述したように、ＭＲＦＣ８９２Ａはマルチポイント制御器（ＭＣ）の機能を提供し、一方ＭＲＦＰ８９１Ａはマルチポイント・プロセッサ（ＭＰ）の機能を提供する。 GGSN 870B is a UMTS core network component responsible for concentrating and tunneling user data in the core packet network to the final destination (eg, Internet Service Provider (ISP)). Such user data includes multimedia and associated signaling data to and / or from the IP multimedia area 890. Within the IP multimedia area 890, the MRF is divided into a multimedia resource function controller (MRFC) 892A and a multimedia resource function processor (MRFP) 891A. As described above, the MRFC 892A provides a multipoint controller (MC) function, while the MRFP 891A provides a multipoint processor (MP) function.

Ｍｒ参照ポイント／インターフェース８９３Ａ間に用いられるプロトコルは、ＳＩＰ（ＲＦＣ２５４３により定義される）である。呼状態制御機能（ＣＳＣＦ）８９５Ａは、呼サーバとして作用し、そしてマルチメディア呼信号送出を扱う。 The protocol used between Mr reference point / interface 893A is SIP (defined by RFC2543). The call state control function (CSCF) 895A acts as a call server and handles multimedia call signaling.

こうして、本発明の好適な実施形態に従って、構成要素ＳＧＳＮ８７０Ａ、ＧＧＳＮ８７０Ｂ、及びＭＲＦ８９０Ａ内の全ての構成要素は、上記で説明したようにマルチメディア・メッセージを容易にするよう適合されている。更に、ＵＥ８３０Ａ、ノードＢ８５０Ａ及びＲＮＣ８５０Ｂはまた、上記で説明したように改善されたマルチメディア・メッセージを容易にするよう適合されている。 Thus, in accordance with the preferred embodiment of the present invention, all components in components SGSN 870A, GGSN 870B, and MRF 890A are adapted to facilitate multimedia messages as described above. Further, UE 830A, Node B 850A and RNC 850B are also adapted to facilitate improved multimedia messages as described above.

より一般的には、その適合は、それぞれの通信装置においていずれの適切な要領で実行され得る。例えば、新しい装置は、従来の通信装置に追加され得て、又は代替として、従来の通信装置の既存構成要素は、例えば、その中の１又はそれより多くのプロセッサを再プログラミングすることにより適合され得る。そのようして、要求された適合は、フロッピー（登録商標）・ディスク、ハード・ディスク、ＰＲＯＭ、ＲＡＭ、又はそれらの任意の組み合わせ、又は他の記憶マルチメディアのような記憶媒体に格納されたプロセッサ実行可能命令の形式で実行され得る。 More generally, the adaptation may be performed in any suitable manner at each communication device. For example, a new device can be added to a conventional communication device, or alternatively, an existing component of a conventional communication device is adapted, for example, by reprogramming one or more processors therein. obtain. As such, the requested adaptation is a processor stored on a storage medium such as a floppy disk, hard disk, PROM, RAM, or any combination thereof, or other storage multimedia. It can be executed in the form of executable instructions.

また、マルチメディア・メッセージのそのような適合は、代替として、通信システム８００のいずれの他の適切な構成要素を適合させることにより、制御され、全部又は部分的に実行され得る。 Also, such adaptation of multimedia messages may alternatively be controlled and performed in whole or in part by adapting any other suitable component of communication system 800.

上記の構成要素が典型的には、移動装置領域８３０、アクセス・ネットワーク領域８５０、及びサービス提供ネットワーク領域８７０にまたがって分割された個別の且つ分離した装置として（それら自体のそれぞれのソフトウエア及び／又はハードウエア・プラットフォーム上に）設けられるが、他の構成も適用することができることを想定している。 The above components are typically as separate and separate devices divided across mobile device area 830, access network area 850, and service providing network area 870 (with their respective software and / or (Or on a hardware platform), but other configurations are envisioned.

更に、ＧＳＭネットワークのような他のネットワーク・インフラストラクチャの場合、処理動作の実現は、いずれの他の適切なタイプの基地局、基地局制御器、移動交換センタ、又は動作及び管理制御器等のようないずれの適切なノードで実施され得る。代替として、前述のステップは、いずれの適切なネットワーク又はシステム内の異なる場所又はエンティテイに分散された様々な構成要素により実行され得る。 Further, for other network infrastructures such as GSM networks, the implementation of the processing operation may be any other suitable type of base station, base station controller, mobile switching center, or operation and management controller, etc. Can be implemented at any suitable node. Alternatively, the foregoing steps may be performed by various components distributed at different locations or entities within any suitable network or system.

好ましくは前述した集中化されたビデオ会議に適用されるときレイヤ化されたビデオ・コーディングを用いたビデオ会議方法は、次の利点を与える。
（ｉ）話者の識別は、従来のシステムと比較して非常に改善される。それは、唯１つのフル品質のビデオ・ストリームの代わりに、帯域幅を共用して、１又はそれより多くの増強レイヤ及び幾つかのベース・レイヤを送信するのを可能にするからである。 A video conferencing method using layered video coding, preferably when applied to the centralized video conferencing described above, provides the following advantages.
(I) Speaker identification is greatly improved compared to conventional systems. That is because instead of only one full quality video stream, it is possible to share bandwidth and transmit one or more enhancement layers and several base layers.

（ｉｉ）活動中の話者が変わるときのビデオ切り換えは、本明細書で説明した発明概念を用いて非常に円滑である。それは、発明の概念が活動中の話者、第２の大部分の活動中の話者、活動してない話者の幾つかの状態を定義するからである。 (Ii) Video switching when the active speaker changes is very smooth using the inventive concepts described herein. This is because the inventive concept defines several states of active speakers, second most active speakers, and inactive speakers.

（ｉｉｉ）最も活動中の話者のビデオ品質が改善される。
（ｉｖ）改善されたビデオ通信装置が様々の話者を表示し、各表示されたイメージは、それぞれのビデオ通信装置の送信と関連した有線レベルに依存している。 (Iii) The video quality of the most active speaker is improved.
(Iv) The improved video communication device displays various speakers, and each displayed image is dependent on the wired level associated with the transmission of the respective video communication device.

複数のマルチメディア・ユーザ装置間のマルチメディア・ビデオ会議でビデオ・イメージを中継する方法が記載された。この方法は、ベース・レイヤ及び１又はそれより多くの増強レイヤを含むレイヤ化されたビデオ・イメージを、複数のユーザ装置のうちの或る数のユーザ装置により送信するステップと、その送信されたレイヤ化されたビデオ・イメージをマルチポイント制御装置で受信するステップとを含む。或る数の活動中の話者の或る数のベース・レイヤ・ビデオ・イメージが選択され、そして最も活動中の話者の１又はそれより多くの増強レイヤが選択される。マルチポイント制御装置は、或る数の活動中の話者の或る数のベース・レイヤ・ビデオ・イメージ、及び最も活動中の話者の１又はそれより多くの増強レイヤを複数のマルチメディア・ユーザ装置のうちの１又はそれより多くのマルチメディア・ユーザ装置へ送信する。 A method for relaying video images in a multimedia video conference between multiple multimedia user devices has been described. The method includes transmitting a layered video image including a base layer and one or more enhancement layers by a number of user devices out of a plurality of user devices and the transmitted Receiving a layered video image at a multipoint controller. A number of base layer video images of a number of active speakers are selected, and one or more enhancement layers of the most active speakers are selected. The multipoint controller can transfer a number of base layer video images of a number of active speakers and one or more enhancement layers of the most active speakers to multiple multimedia Send to one or more of the user equipment multimedia user equipment.

その上、ビデオ・イメージを複数のユーザ装置間で中継するビデオ会議装置が記載された。更に、ビデオ会議に参加するための無線装置であって或る数の参加者がビデオ・イメージを送信する無線装置が記載された。 In addition, a video conferencing device that relays video images between a plurality of user devices has been described. In addition, a wireless device for participating in a video conference has been described in which a certain number of participants transmit video images.

図１は、既知の集中化会議モデルを示す。FIG. 1 shows a known centralized conference model. 図２は、従来のビデオ切り換え機構の機能図を示す。FIG. 2 shows a functional diagram of a conventional video switching mechanism. 図３は、ビデオ・コーディング技術の分野で既知である、ピクチャ予測依存性を示すビデオ構成の概略図である。FIG. 3 is a schematic diagram of a video structure showing picture prediction dependencies as known in the field of video coding technology. 図４は、ビデオ・コーディング技術の分野で既知であるレイヤ化されたビデオ構成の概略図である。FIG. 4 is a schematic diagram of a layered video configuration known in the field of video coding technology. 図５は、本発明の好適な実施形態に従ったビデオ切り換え機構の機能図を示す。FIG. 5 shows a functional diagram of a video switching mechanism according to a preferred embodiment of the present invention. 図６は、本発明の好適な実施形態に従ったマルチポイント処理装置の機能ブロック図／フロー・チャートを示す。FIG. 6 shows a functional block diagram / flow chart of a multipoint processing device according to a preferred embodiment of the present invention. 図７は、本発明の好適な実施形態を用いたビデオ会議に参加する無線装置のビデオ・ディスプレイを示す。FIG. 7 shows a video display of a wireless device participating in a video conference using the preferred embodiment of the present invention. 図８は、本発明の好適な実施形態に従って適合されたＵＭＴＳ（３ＧＰＰ）通信システムを示す。FIG. 8 illustrates a UMTS (3GPP) communication system adapted according to a preferred embodiment of the present invention.

Claims

A method for relaying a video image in a multimedia video conference between a plurality of multimedia user devices (550, 560, 570, 580), comprising:
Transmitting a layered video image by a number of multimedia user devices of the plurality of multimedia user devices, wherein the layered video image is a base layer (552, 562, 572). 582) and one or more enhancement layers (555, 565, 575, 585), and
Receiving the transmitted layered video image at a multipoint controller (520);
Selecting a number of base layer video images of a number of active speakers (535) and one or more enhancement layers (540) of the most active speakers; ,
The multipoint controller (520) allows the number of active speakers (535) of the number of base layer video images and one or more of the most active speakers. Transmitting an enhancement layer (540) to one or more multimedia user devices of the plurality of multimedia user devices (550, 560, 570, 580). A method of relaying video images in a conference.

The step of selecting further comprises:
A certain number transmitted by the plurality of multimedia user devices (550, 560, 570, 580) to determine the certain number of active speakers and / or the most active speakers; The method of relaying video images in a multimedia videoconference according to claim 1, comprising the step of analyzing an audio data stream (590).

Assigning a priority level to each layered video image and / or said audio data stream transmitted by a respective user equipment;
A number of base layer video images (535) for transmission to the one or more multimedia user devices of the plurality of multimedia user devices (550, 560, 570, 580). And selecting one or more enhancement layers (540) based on the assigned priority level. 3. A multimedia video conference according to claim 1, further comprising: To relay video images in

Converting (660) a first predicted frame of the video image of the most active speaker to an intra-coded frame to enhance the video quality of the most active speaker. A method for relaying a video image in a multimedia video conference according to any one of claims 1 to 3.

When more than one enhancement layer is available, the multi-class indication of the one or more speakers with each layered video image transmission is provided to provide finer scalability of the video image. 5. A method for relaying a video image in a multimedia videoconference according to any one of claims 1 to 4, further comprising the step of receiving by a point controller (520).

6. The method of any one of claims 1-5, further comprising converting predicted frames to intra-coded frames for one or more base layer video streams. A method of relaying a video image in the multimedia video conference described in the paragraph.

A video conferencing device for relaying a video image between a plurality of user devices (550, 560, 570, 580),
A multipoint controller (520) adapted to receive a number of layered video images transmitted by a number of multimedia user devices of the plurality of multimedia user devices. The multipoint controller (520), wherein the layered video image includes a base layer (552, 562, 572, 582) and one or more enhancement layers (555, 565, 575, 585); ,
A number of base layer video images of a number of active speakers (535) and one of the most active speakers operatively coupled to the multipoint controller (520) A video switching module (530) adapted to select more enhancement layers (540);
The multipoint controller (520) further includes the number of base layer video images of a number of active speakers and one or more enhancement layers of the most active speakers. (540) is adapted to transmit to one or more of the plurality of multimedia user devices (550, 560, 570, 580).

A prediction frame / intra coded frame conversion module (660) for converting a prediction frame operably coupled to the video switching module (530) into an intra-coded frame, wherein the multipoint controller (520) is first A predictive frame / intra-coded frame conversion module (660) that provides the enhancement layer video stream of the most active speaker as an intra-coded frame when receiving the frame as a predicted frame 8. The video conference apparatus according to claim 7, further comprising:

Further comprising a speaker identification module (620) that analyzes a number of active speakers and / or a number of audio streams (590) to determine the most active speakers. The video conferencing apparatus according to claim 7 or 8.

The speaker identification module (620) assigns a priority level based on each participant's determined activity to determine the most active speaker (622), any other active speaker (625), and The video conferencing apparatus of claim 9, wherein one or more of the non-active speakers are determined.

A wireless device (700) participating in a video conference in which a plurality of participants transmit video images,
A video display (710) having a first display for displaying each participant (720, 730) from the plurality of participants and one or more second individual displays;
Operatively coupled to the video display to receive indications of the least active speaker (720) and less active speaker (730), and from the most active speaker (720) The received video image should be displayed on the first display providing a higher quality video image and received from a number of the less active speakers (730). A wireless device (700) comprising: a processor for determining that the video image should be displayed on the one or more second displays that provide a lower quality video image.

A layered video stream, including a base layer video stream (552, 562, 572, 582) and an augmented layer video stream (555, 565, 575, 585), is transmitted to a plurality of user devices (550, 560, 570). , 580) one or more receiving ports adapted to receive from
A number of active speakers (535) operatively coupled to the one or more receiving ports and transmitting to one or more user devices (550, 560, 570, 580). Multipoint processor comprising a switching module (640) for selecting a certain number of base layer video images and one or more enhancement layers (540) of the most active speakers.

One of the plurality of user devices operatively coupled to the one or more receiving ports to determine a number of active speakers and / or the most active speakers. The multipoint processor of claim 12, further comprising a speaker identification module (620) for analyzing a number of audio streams (590) received from a number of user devices.

The speaker identification module (620) assigns a priority level based on a determined activity of a certain number of participants, so that the most active speaker (622), any other active speaker ( 625) and the multipoint processor of claim 12 or 13, wherein one or more of any inactive speakers are determined.

A prediction frame / intra-coded frame conversion module (660) operably coupled to the switching module (640), wherein the enhancement layer video stream of the most active speaker is a prediction frame at each port; 13. A prediction frame / intra coded frame conversion module (660) for converting the enhancement layer video stream of the most active speaker into an intra coded frame when received as 15. The multipoint processor according to any one of 1 to 14.

12. Adapted to perform the steps of the method according to any one of claims 1 to 6, or adapted to incorporate the video conferencing device according to any one of claims 7 to 10, or claim 12. A video communication system adapted to incorporate a multipoint processor according to any one of 1 to 15.

The video communication system of claim 16, wherein the video communication system is compatible with a UMTS communication standard (800) having an internet protocol multimedia domain (890) to facilitate video conferencing communications.

12. Adapted to perform the steps of the method according to any one of claims 1 to 6, or adapted to incorporate the video conferencing device according to any one of claims 7 to 10, or claim 12. A media resource function device (890A) adapted to incorporate a multipoint processor according to any one of 1 to 15.

A video communications apparatus (700) adapted to receive a layered video conference image generated according to the method of any one of claims 1-6.

7. A layer adapted to generate a layered video conference image for use in a method according to any one of claims 1 to 6, or generated according to a method according to any one of claims 1 to 6. Video communication device adapted to transmit a structured video conference image.

The video communication device according to claim 19, wherein the video communication device is one of Node B (850A), RNC (850B), SGSN (870A), GGSN (870B) and MRF (890A).

Video conferencing image 11. A method for relaying a video image in a multimedia conference according to claim 1 or 6 adapted to facilitate based on the H.323 standard or SIP standard, or according to any one of claims 7 to 10. 19. A video conferencing device, or a multipoint processor according to any one of claims 12 to 15, or a multipoint processor according to claim 16 or 17, or a media resource function device (890A) according to claim 18. The video communication device according to any one of claims 19 to 21.

A storage medium storing processor-executable instructions for controlling a processor to perform the method according to any one of claims 1-6.