JP2018521593A

JP2018521593A - Composition and scaling of angle-separated subscenes

Info

Publication number: JP2018521593A
Application number: JP2018502621A
Authority: JP
Inventors: シュニットマン，マーク・スティーブン; マケエフ，マクシム
Original assignee: Owl Labs Inc
Current assignee: Owl Labs Inc
Priority date: 2015-04-01
Filing date: 2016-04-01
Publication date: 2018-08-02
Anticipated expiration: 2036-04-01
Also published as: IL254812A0; US10991108B2; US20160292884A1; EP3278180A1; US10636154B2; IL282492B2; CN107980221A; IL254812B; WO2016161288A1; AU2022202258B2; CN114422738A; EP3995892A1; CA3239163A1; CA2981522A1; AU2022202258A1; EP3278180B1; AU2016242980B2; IL302194A; AU2019261804B2; ES2906619T3

Abstract

高密度に合成された単一カメラ信号は、ワイドカメラからキャプチャされた、実質的に２．４：１以上のアスペクト比を有するパノラマビデオ信号から形成され得る。２つ以上のサブシーンビデオ信号が各自の対象方位においてサブサンプリングされ、並べて合成されて、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成し得る。ステージシーンビデオ信号の領域の８０％以上がパノラマビデオ信号からサブサンプリングされ得る。A densely synthesized single camera signal can be formed from a panoramic video signal captured from a wide camera and having an aspect ratio substantially greater than 2.4: 1. Two or more sub-scene video signals may be sub-sampled in their respective orientations and combined side by side to form a stage scene video signal having an aspect ratio of substantially 2: 1 or less. More than 80% of the area of the stage scene video signal can be subsampled from the panoramic video signal.

Description

関連出願の相互参照
本願は、２０１５年４月１日に出願された米国仮特許出願連続番号第６２／１４１，８２２号に基づく利益を米国特許法第１１９条（ｅ）に従って主張し、上記仮特許出願の開示全体を本明細書に引用により援用する。 CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit based on US Provisional Patent Application Serial No. 62 / 141,822 filed on April 1, 2015 in accordance with US Pat. The entire disclosure of the patent application is incorporated herein by reference.

分野
局面は、画像キャプチャおよび強調のための装置および方法に関する。 FIELD Aspects relate to apparatus and methods for image capture and enhancement.

背景
マルチパーティ遠隔会議、ビデオチャット、およびテレビ会議は、同じ会議室にいる複数の参加者が少なくとも１人のリモートパーティと接続された状態で行なわれることが多い。 Background Multi-party teleconferencing, video chat, and video conferencing often take place with multiple participants in the same conference room connected to at least one remote party.

ビデオ会議ソフトウェアの個人対個人モードの場合、水平視野が限られている（たとえば７０度）ことが多い１つのローカルカメラのみが利用可能である。この単一カメラが１人の参加者の前に位置決めされるか、すべての参加者に向けられてテーブルの頭に位置決めされるかにかかわらず、当該単一カメラから遠い、または当該カメラに対して鋭角をなしている会議室内のそれらの参加者によって与えられる音声、ボディランゲージ、および非言語的合図をリモートパーティが理解することは困難である（たとえば、人の顔ではなく輪郭を見ている）。 In the person-to-person mode of the video conferencing software, only one local camera, often with a limited horizontal field of view (eg, 70 degrees), is available. Regardless of whether this single camera is positioned in front of one participant or is directed to all participants and positioned at the head of the table, it is far from or relative to that camera. It is difficult for a remote party to understand the voice, body language, and non-verbal cues given by those participants in a meeting room that is sharp and sharp (for example, looking at a contour rather than a human face) ).

ビデオ会議ソフトウェアの多人数モードの場合、同じ会議室内にある２つ以上のモバイルデバイス（ラップトップ、タブレット、または携帯電話）のカメラが利用可能であるため、いくつかの異なる問題が追加される。会議にログインする会議室参加者が多いほど、音声フィードバックおよびクロストークが大きくなり得る。カメラパースペクティブは、単一カメラの場合と同じぐらい参加者から離れているか、または歪んでいる場合がある。ローカル参加者は、同じ部屋にいるにもかかわらず、自身らのモバイルデバイスを介して他の参加者と交流する（それによって、リモートパーティと同じボディランゲージおよび非言語的合図の弱点を受け継ぐ）傾向があり得る。 In the multiplayer mode of video conferencing software, several different problems are added because the cameras of two or more mobile devices (laptops, tablets, or cell phones) in the same conference room are available. The more conference room participants logging into the conference, the greater the voice feedback and crosstalk. The camera perspective may be as far away from the participant as the single camera or distorted. Local participants tend to interact with other participants via their mobile devices (thus inheriting the same body language and non-verbal cues as remote parties), even though they are in the same room There can be.

セットアップを同室の参加者にとって非常に容易にする、または経験をリモート参加者の視点から自動でシームレスにするように、ワイドシーン（たとえば２人以上の会議参加者のワイドシーン）内の角度分離されたサブシーンおよび／または対象サブシーンを合成する、追跡する、および／または表示するための公知の商業技術または実験技術はない。 Angle-separated within a wide scene (eg, wide scenes of two or more conference participants) to make setup very easy for participants in the room or to make the experience automatically and seamless from the remote participant's perspective There is no known commercial or experimental technique for compositing, tracking, and / or displaying a sub-scene and / or subject sub-scene.

概要
本実施形態の一局面において、高密度に合成された単一カメラ信号を出力するプロセスは、実質的に９０度以上の水平画角を有するワイドカメラからキャプチャされた、実質的に２．４：１以上のアスペクト比を有するパノラマビデオ信号を記録し得る。ワイドカメラから各自の対象方位において、少なくとも２つのサブシーンビデオ信号がサブサンプリングされ得る。２つ以上のサブシーンビデオ信号は並べて合成されて、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成し得る。任意に、ステージシーンビデオ信号の領域の８０％よりも大きい領域がパノラマビデオ信号からサブサンプリングされる。ステージシーンビデオ信号は単一カメラビデオ信号としてフォーマットされ得る。任意に、パノラマビデオ信号は実質的に８：１以上のアスペクト比を有し、実質的に３６０度の水平画角を有するワイドカメラからキャプチャされる。 Overview In one aspect of this embodiment, the process of outputting a densely synthesized single camera signal is substantially 2.4 captured from a wide camera having a horizontal field of view substantially greater than 90 degrees. A panoramic video signal having an aspect ratio of 1 or more can be recorded. At least two sub-scene video signals can be sub-sampled from the wide camera in their target orientations. Two or more sub-scene video signals may be combined side by side to form a stage scene video signal having an aspect ratio substantially equal to or less than 2: 1. Optionally, an area greater than 80% of the area of the stage scene video signal is subsampled from the panoramic video signal. The stage scene video signal may be formatted as a single camera video signal. Optionally, the panoramic video signal is captured from a wide camera having an aspect ratio of substantially greater than 8: 1 and a horizontal field of view of substantially 360 degrees.

本実施形態の関連局面において、会議カメラは、高密度に合成された単一カメラ信号を出力するように構成される。会議カメラの撮像素子またはワイドカメラは、実質的に２．４：１以上のアスペクト比を有するパノラマビデオ信号をキャプチャするおよび／または記録するように構成され得、ワイドカメラは実質的に９０度以上の水平画角を有する。撮像素子またはワイドカメラに動作可能に接続されたプロセッサは、ワイドカメラから各自の対象方位において２つ以上のサブシーンビデオ信号をサブサンプリングするように構成され得る。当該プロセッサは、２つ以上のサブシーンビデオ信号を並んだビデオ信号としてメモリ（たとえばバッファおよび／またはビデオメモリ）に合成して、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成するように構成され得る。当該プロセッサは、ステージシーンビデオ信号の領域の８０％を超える領域がパノラマビデオ信号からサブサンプリングされるように、サブシーンビデオ信号をメモリ（たとえばバッファおよび／またはビデオメモリ）に合成するように構成され得る。当該プロセッサはさらに、ステージシーンビデオ信号を、たとえばＵＳＢ上でトランスポートされる単一カメラビデオ信号としてフォーマットするように構成され得る。 In a related aspect of this embodiment, the conference camera is configured to output a densely synthesized single camera signal. The conference camera imager or wide camera may be configured to capture and / or record a panoramic video signal having an aspect ratio of substantially 2.4: 1 or greater, the wide camera being substantially 90 degrees or greater. Horizontal angle of view. A processor operably connected to the image sensor or the wide camera may be configured to subsample two or more sub-scene video signals from the wide camera in their target orientations. The processor combines two or more sub-scene video signals into a memory (eg, a buffer and / or video memory) as a side-by-side video signal to produce a stage scene video signal having an aspect ratio substantially equal to or less than 2: 1. Can be configured to form. The processor is configured to synthesize the sub-scene video signal into memory (eg, a buffer and / or video memory) such that more than 80% of the area of the stage scene video signal is sub-sampled from the panoramic video signal. obtain. The processor may be further configured to format the stage scene video signal as a single camera video signal transported over, for example, USB.

上記の局面のいずれか一方において、当該プロセッサは、パノラマビデオ信号からの各自の対象方位において追加のサブシーンビデオ信号をサブサンプリングすることと、２つ以上のサブシーンビデオ信号を、１つ以上の追加のサブシーンビデオ信号とともに合成して、複数の並んだサブシーンビデオ信号を含む、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成することとを実行するように構成され得る。任意に、２つ以上のサブシーンビデオ信号を１つ以上の追加のサブシーンビデオ信号とともに合成してステージシーンビデオ信号を形成することは、２つ以上のサブシーンビデオ信号の少なくとも１つを置換して実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成することによって、１つ以上の追加のサブシーンビデオ信号をステージシーンビデオ信号に移行させることを含む。 In any one of the above aspects, the processor sub-samples additional sub-scene video signals in their respective orientations from the panoramic video signal and converts the two or more sub-scene video signals into one or more Combining with the additional sub-scene video signal to form a stage scene video signal having an aspect ratio of substantially less than 2: 1 comprising a plurality of side-by-side sub-scene video signals obtain. Optionally, combining two or more subscene video signals with one or more additional subscene video signals to form a stage scene video signal replaces at least one of the two or more subscene video signals Transitioning one or more additional sub-scene video signals to the stage scene video signal by forming a stage scene video signal having an aspect ratio of substantially 2: 1 or less.

さらに任意に、各サブシーンビデオ信号には最小幅が割当てられ得、ステージシーンビデオ信号への各自の移行が完了すると、各サブシーンビデオ信号は実質的にその最小幅以上で並べて合成されてステージシーンビデオ信号を形成し得る。代わりに、またはさらに、移行中の各自のサブシーンビデオ信号の合成幅は、合成幅が実質的にその対応する各自の最小幅以上になるまで、移行全体にわたって増加し得る。さらに代わりに、またはさらに、サブシーンビデオ信号は、実質的にその最小幅以上で並べて合成され得、各々は、すべての合成されたサブシーンビデオ信号の合計がステージシーンビデオ信号の幅と実質的に等しい各自の幅で合成され得る。 Further optionally, each sub-scene video signal can be assigned a minimum width, and once each transition to the stage scene video signal is complete, each sub-scene video signal is substantially combined side by side with the minimum width and combined to the stage. A scene video signal may be formed. Alternatively or additionally, the composite width of each sub-scene video signal in transition may increase throughout the transition until the composite width is substantially greater than or equal to its corresponding minimum width. Additionally or alternatively, the sub-scene video signal may be combined side by side substantially above its minimum width, each of which is the sum of all combined sub-scene video signals substantially equal to the width of the stage scene video signal. Can be combined with their own width equal to.

いくつかの場合、ステージシーンビデオ信号内のサブシーンビデオ信号の幅は、サブシーンビデオ信号に対応する１つ以上の対象方位において検出されたアクティビティ基準に従って変化するように合成され得るのに対して、ステージシーンビデオ信号の幅は一定に保たれる。他の場合、２つ以上のサブシーンビデオ信号を１つ以上の追加のサブシーンビデオ信号とともに合成してステージシーンビデオ信号を形成することは、２つ以上のサブシーンビデオ信号の少なくとも１つの幅を、１つ以上の追加のサブシーンビデオ信号の幅に対応する量だけ縮小することによって、１つ以上の追加のサブシーンビデオ信号をステージシーンビデオ信号に移行させることを含む。 In some cases, the width of the sub-scene video signal in the stage scene video signal may be synthesized to change according to the activity criteria detected in one or more target orientations corresponding to the sub-scene video signal. The width of the stage scene video signal is kept constant. In other cases, combining two or more sub-scene video signals with one or more additional sub-scene video signals to form a stage scene video signal is at least one width of the two or more sub-scene video signals Transitioning the one or more additional sub-scene video signals to the stage scene video signal by reducing by an amount corresponding to the width of the one or more additional sub-scene video signals.

さらに任意に、各サブシーンビデオ信号には各自の最小幅が割当てられ得、各サブシーンビデオ信号は、実質的にその対応する各自の最小幅以上で並べて合成されてステージシーンビデオ信号を形成し得る。１つ以上の追加のサブシーンビデオ信号とともに、２つ以上のサブシーンビデオ信号の各自の最小幅の合計がステージシーンビデオ信号の幅を超えると、２つ以上のサブシーンビデオ信号の少なくとも１つがステージシーンビデオ信号から除去されるように移行し得る。任意に、ステージシーンビデオ信号から除去されるように移行するサブシーンビデオ信号は、アクティビティ基準が最も以前に満たされた各自の対象方位に対応する。 Further optionally, each sub-scene video signal may be assigned its own minimum width, and each sub-scene video signal is combined side by side with substantially its corresponding minimum width to form a stage scene video signal. obtain. When the sum of the respective minimum widths of the two or more subscene video signals exceeds the width of the stage scene video signal, along with the one or more additional subscene video signals, at least one of the two or more subscene video signals is A transition may be made to be removed from the stage scene video signal. Optionally, the sub-scene video signal that transitions to be removed from the stage scene video signal corresponds to its own target orientation for which the activity criteria were most recently met.

上記の局面のいずれか一方において、２つ以上のサブシーンビデオ信号および１つ以上の追加のサブシーンビデオ信号の各自の対象方位間のワイドカメラに対する左から右への順序は、２つ以上のサブシーンビデオ信号が１つ以上の追加のサブシーンビデオ信号とともに合成されてステージシーンビデオ信号を形成する際に保存され得る。 In any one of the above aspects, the order from left to right for a wide camera between their respective orientations of two or more sub-scene video signals and one or more additional sub-scene video signals is two or more The sub-scene video signal may be combined with one or more additional sub-scene video signals and stored when forming a stage scene video signal.

さらに上記の局面のいずれか一方において、パノラマビデオ信号からの各自の対象方位は、ワイドカメラに対する各自の対象方位において検出された選択基準に依存して選択され得る。選択基準が真でなくなった後、その対応するサブシーンビデオ信号は、ステージシーンビデオ信号から除去されるように移行し得る。代わりに、またはさらに、選択基準は、各自の対象方位において満たされたアクティビティ基準の存在を含み得る。この場合、当該プロセッサは、各自の対象方位においてアクティビティ基準が満たされてからの時間をカウントし得る。各自の対象方位においてアクティビティ基準が満たされた後の予め定められた期間、各自のサブシーン信号はステージシーンビデオ信号から除去されるように移行し得る。 Further, in any one of the above aspects, the respective target orientation from the panoramic video signal can be selected depending on the selection criteria detected in the respective target orientation relative to the wide camera. After the selection criteria are no longer true, the corresponding sub-scene video signal may transition to be removed from the stage scene video signal. Alternatively or additionally, the selection criteria may include the presence of activity criteria that are met in each subject orientation. In this case, the processor may count the time since the activity criterion is met in its target orientation. Each sub-scene signal may transition to be removed from the stage scene video signal for a predetermined period of time after the activity criteria are met in each target orientation.

上記の局面のさらなる変形において、当該プロセッサは、実質的に８：１以上のアスペクト比の縮小したパノラマビデオ信号をパノラマビデオ信号からサブサンプリングすることと、２つ以上のサブシーンビデオ信号を縮小したパノラマビデオ信号とともに合成して、複数の並んだサブシーンビデオ信号とパノラマビデオ信号とを含む、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成することとを実行し得る。任意に、２つ以上のサブシーンビデオ信号は、縮小したパノラマビデオ信号とともに合成されて、複数の並んだサブシーンビデオ信号と、複数の並んだサブシーンビデオ信号よりも高いパノラマビデオ信号とを含む、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成し得、パノラマビデオ信号は、ステージシーンビデオ信号の領域の１／５以下であり、ステージシーンビデオ信号の幅を実質的に横切って延びる。 In a further variation of the above aspect, the processor subsamples a reduced panoramic video signal having an aspect ratio of substantially 8: 1 or greater from the panoramic video signal and reduces two or more subscene video signals. Combining with the panoramic video signal may be performed to form a stage scene video signal having an aspect ratio of substantially 2: 1 or less, including a plurality of side-by-side sub-scene video signals and panoramic video signals. Optionally, two or more sub-scene video signals are combined with the reduced panoramic video signal to include a plurality of side-by-side sub-scene video signals and a panoramic video signal higher than the plurality of side-by-side sub-scene video signals. A stage scene video signal having an aspect ratio of substantially 2: 1 or less, wherein the panoramic video signal is no more than 1/5 of the area of the stage scene video signal and substantially reduces the width of the stage scene video signal. It extends across.

上記の局面のさらなる変形において、当該プロセッサまたは関連のプロセッサは、テキストドキュメントからテキストビデオ信号をサブサンプリングし、２つ以上のサブシーンビデオ信号の少なくとも１つをテキストビデオ信号に置換することによって、テキストビデオ信号をステージシーンビデオ信号に移行させ得る。 In a further variation of the above aspect, the processor or associated processor subtexts the text video signal from the text document and replaces at least one of the two or more subscene video signals with the text video signal. The video signal may be transitioned to a stage scene video signal.

任意に、当該プロセッサは、保持基準に基づいて、２つ以上のサブシーンビデオ信号の少なくとも１つを、移行から保護される保護サブシーンビデオ信号として設定し得る。この場合、当該プロセッサは、２つ以上のサブシーンビデオ信号の少なくとも１つを置換することによって、および／または保護サブシーン以外のサブシーンビデオ信号を移行させることによって、１つ以上の追加のサブシーンビデオ信号をステージシーンビデオ信号に移行させ得る。 Optionally, the processor may set at least one of the two or more sub-scene video signals as a protected sub-scene video signal that is protected from transition based on retention criteria. In this case, the processor may replace one or more additional sub-scene video signals by replacing at least one of the two or more sub-scene video signals and / or by transitioning sub-scene video signals other than the protected sub-scene. The scene video signal may be transitioned to a stage scene video signal.

いくつかの場合、当該プロセッサは、代わりに、またはさらに、強調基準に基づいてサブシーン強調動作を設定し得、２つ以上のサブシーンビデオ信号の少なくとも１つは、対応する強調基準に基づいてサブシーン強調動作に従って強調される。任意に、当該プロセッサは、センサから検知された基準に基づいてサブシーン参加者通知動作を設定し得、ローカルリマインダ指標（たとえばライト、ブリンキング、または音）が、対応する検知された基準に基づいて通知動作に従って起動される。 In some cases, the processor may alternatively or additionally set a sub-scene enhancement operation based on an enhancement criterion, and at least one of the two or more sub-scene video signals may be based on a corresponding enhancement criterion. Emphasized according to the sub-scene enhancement operation. Optionally, the processor may set a sub-scene participant notification action based on a criterion detected from the sensor, and a local reminder indicator (eg, light, blinking, or sound) is based on the corresponding detected criterion Is started according to the notification operation.

本実施形態の一局面において、ワイドビデオ信号内の対象方位においてサブシーンを追跡するためのプロセスは、音響センサアレイと実質的に９０度以上の視野を観察するワイドカメラとを用いてある角度範囲を監視することを含み得る。角度範囲内に検出された音響認識および視覚認識の少なくとも一方の局所化に沿って、第１の対象方位が識別され得る。第１の対象方位に沿って、ワイドカメラから第１のサブシーンビデオ信号がサブサンプリングされ得る。音響認識および視覚認識の少なくとも一方の信号特性に従って、第１のサブシーンビデオ信号の幅が設定され得る。 In one aspect of this embodiment, a process for tracking a sub-scene in a target orientation within a wide video signal is a range of angles using an acoustic sensor array and a wide camera that observes a field of view substantially greater than 90 degrees. Monitoring. A first target orientation may be identified along with localization of at least one of acoustic recognition and visual recognition detected within the angular range. A first sub-scene video signal may be sub-sampled from the wide camera along the first target orientation. The width of the first sub-scene video signal may be set according to signal characteristics of at least one of acoustic recognition and visual recognition.

本実施形態の関連局面において、会議カメラは、広角シーンからサブサンプリングされてスケーリングされたサブシーンを含むビデオ信号を出力し、ワイドビデオ信号内のサブシーンおよび／または対象方位を追跡するように構成され得る。会議カメラおよび／またはそのプロセッサは、音響センサアレイと実質的に９０度以上の視野を観察するワイドカメラとを用いてある角度範囲を監視するように構成され得る。当該プロセッサは、角度範囲内に検出された音響認識および視覚認識の少なくとも一方の局所化に沿って第１の対象方位を識別するように構成され得る。当該プロセッサはさらに、第１の対象方位に沿ってワイドカメラから第１のサブシーンビデオ信号をメモリ（バッファまたはビデオ）にサブサンプリングするように構成され得る。当該プロセッサはさらに、音響認識および視覚認識の少なくとも一方の信号特性に従って第１のサブシーンビデオ信号の幅を設定するように構成され得る。 In a related aspect of this embodiment, the conference camera is configured to output a video signal that includes a subsampled and scaled subscene from a wide angle scene and to track the subscene and / or subject orientation within the wide video signal. Can be done. The conference camera and / or its processor may be configured to monitor a range of angles using an acoustic sensor array and a wide camera that observes a field of view substantially greater than 90 degrees. The processor may be configured to identify a first target orientation along localization of at least one of acoustic recognition and visual recognition detected within the angular range. The processor may be further configured to subsample a first sub-scene video signal from a wide camera along a first target orientation into a memory (buffer or video). The processor may be further configured to set the width of the first sub-scene video signal according to signal characteristics of at least one of acoustic recognition and visual recognition.

上記の局面のいずれかにおいて、信号特性は、音響認識または視覚認識の一方または両方の信頼レベルを表わし得る。任意に、信号特性は、音響認識または視覚認識の一方または両方内に認識された特徴の幅を表わし得る。さらに任意に、信号特性は、第１の対象方位に沿って認識された人間の顔の概算幅に対応し得る。 In any of the above aspects, the signal characteristic may represent a confidence level of one or both of acoustic recognition or visual recognition. Optionally, the signal characteristic may represent a feature width recognized within one or both of acoustic recognition or visual recognition. Further optionally, the signal characteristic may correspond to an approximate width of a human face recognized along the first target orientation.

代わりに、またはさらに、視覚認識の信号特性に従って幅が設定されない場合、予め定められた幅が、角度範囲内に検出された音響認識の局所化に沿って設定され得る。さらに任意に、第１の対象方位は視覚認識によって求められ得、第１のサブシーンビデオ信号の幅は次に視覚認識の信号特性に従って設定される。さらに任意に、第１の対象方位は、角度範囲内に検出された音響認識に向けて方向付けられて識別され得る。この場合、当該プロセッサは、音響認識に近接した視覚認識を識別し得、第１のサブシーンビデオ信号の幅は次に、音響認識に近接した視覚認識の信号特性に従って設定され得る。 Alternatively or additionally, if the width is not set according to the visual recognition signal characteristics, a predetermined width may be set along with the localization of acoustic recognition detected within the angular range. Further optionally, the first target orientation may be determined by visual recognition, and the width of the first sub-scene video signal is then set according to the visual recognition signal characteristics. Further optionally, the first target orientation can be directed and identified towards acoustic recognition detected within the angular range. In this case, the processor may identify visual recognition proximate to acoustic recognition, and the width of the first sub-scene video signal may then be set according to the signal characteristics of visual recognition proximate to acoustic recognition.

本実施形態の別の局面において、プロセッサは、ワイドビデオ信号内の対象方位においてサブシーンを追跡するプロセスを実行するように構成され得、当該プロセスは、実質的に９０度以上のワイドカメラ視野に対応する動画ビデオ信号を通してサブサンプリングウィンドウをスキャンすることを含む。当該プロセッサは、サブサンプリングウィンドウ内の候補方位を識別するように構成され得、各対象方位は、サブサンプリングウィンドウ内に検出された視覚認識の局所化に対応する。当該プロセッサは次に、候補方位を空間マップに記録し得、音響認識のための音響センサアレイを用いて、ワイドカメラ視野に対応する角度範囲を監視し得る。 In another aspect of this embodiment, the processor may be configured to perform a process of tracking a sub-scene in a target orientation within a wide video signal, the process substantially over a wide camera field of view of 90 degrees or more. Scanning a sub-sampling window through a corresponding animated video signal. The processor may be configured to identify candidate orientations within the sub-sampling window, each target orientation corresponding to a localized visual recognition detected within the sub-sampling window. The processor may then record the candidate orientation in a spatial map and monitor the angular range corresponding to the wide camera field of view using an acoustic sensor array for acoustic recognition.

任意に、空間マップに記録された１つの候補方位に近接して音響認識が検出されると、当該プロセッサはさらに、実質的に１つの候補方位に対応するように第１の対象方位をスナップし得、第１の対象方位に沿ってワイドカメラから第１のサブシーンビデオ信号をサブサンプリングし得る。任意に、当該プロセッサはさらに、音響認識の信号特性に従って第１のサブシーンビデオ信号の幅を設定するように構成され得る。さらに任意に、信号特性は音響認識の信頼レベルを表わし得るか、または、音響認識もしくは視覚認識の一方もしくは両方内に認識された特徴の幅を表わし得る。信号特性は、代わりに、またはさらに、第１の対象方位に沿って認識された人間の顔の概算幅に対応し得る。任意に、視覚認識の信号特性に従って幅が設定されない場合、予め定められた幅が、角度範囲内に検出された音響認識の局所化に沿って設定され得る。 Optionally, when acoustic recognition is detected proximate to one candidate orientation recorded in the spatial map, the processor further snaps the first target orientation to substantially correspond to one candidate orientation. Thus, the first sub-scene video signal may be sub-sampled from the wide camera along the first target orientation. Optionally, the processor may be further configured to set the width of the first sub-scene video signal according to the signal characteristics of sound recognition. Further optionally, the signal characteristic may represent a confidence level of acoustic recognition or may represent a width of a feature recognized within one or both of acoustic recognition or visual recognition. The signal characteristic may alternatively or additionally correspond to the approximate width of the human face recognized along the first object orientation. Optionally, if the width is not set according to the signal characteristics of visual recognition, a predetermined width can be set along with the localization of acoustic recognition detected within the angular range.

本実施形態の別の局面において、プロセッサは、対象方位においてサブシーンを追跡するように構成され得、これは、実質的に９０度以上のワイドカメラ視野に対応する動画ビデオ信号を記録することを含む。当該プロセッサは、音響認識のための音響センサアレイを用いて、ワイドカメラ視野に対応する角度範囲を監視し、角度範囲内に検出された音響認識に向けて方向付けられている第１の対象方位を識別するように構成され得る。第１の対象方位に従って動画ビデオ信号内にサブサンプリングウィンドウが位置付けられ得、視覚認識がサブサンプリングウィンドウ内に検出され得る。任意に、当該プロセッサは、実質的に視覚認識を中心とするワイドカメラからキャプチャされた第１のサブシーンビデオ信号をサブサンプリングし、視覚認識の信号特性に従って第１のサブシーンビデオ信号の幅を設定するように構成され得る。 In another aspect of this embodiment, the processor may be configured to track the sub-scene in the target orientation, which records a video video signal substantially corresponding to a wide camera field of view of 90 degrees or more. Including. The processor monitors an angular range corresponding to the wide camera field of view using an acoustic sensor array for acoustic recognition and is directed toward the acoustic recognition detected within the angular range. May be configured to identify. A subsampling window may be positioned in the video video signal according to the first target orientation and visual recognition may be detected in the subsampling window. Optionally, the processor subsamples a first sub-scene video signal captured from a wide camera substantially centered on visual recognition and determines the width of the first sub-scene video signal according to the signal characteristics of visual recognition. Can be configured to set.

本実施形態のさらなる局面において、プロセッサは、ワイドビデオ信号内の対象方位においてサブシーンを追跡するように構成され得、これは、音響センサアレイと実質的に９０度以上の視野を観察するワイドカメラとを用いてある角度範囲を監視することを含む。複数の対象方位が識別され得、各々が角度範囲内の局所化に向けて方向付けられている。当該プロセッサは、対象方位に対応する記録された特性の空間マップを維持し、１つ以上の対象方位に実質的に沿ってワイドカメラからサブシーンビデオ信号をサブサンプリングするように構成され得る。少なくとも１つの対象方位に対応する記録された特性に従って、サブシーンビデオ信号の幅が設定され得る。 In a further aspect of this embodiment, the processor may be configured to track a sub-scene in a target orientation within a wide video signal, which is a wide camera that observes an acoustic sensor array and a field of view substantially greater than 90 degrees. And monitoring an angular range. Multiple object orientations can be identified, each oriented towards localization within an angular range. The processor may be configured to maintain a spatial map of recorded characteristics corresponding to the subject orientation and subsample the sub-scene video signal from the wide camera substantially along one or more subject orientations. The width of the sub-scene video signal can be set according to recorded characteristics corresponding to at least one target orientation.

本実施形態のさらなる局面において、プロセッサは、ワイドビデオ信号内の対象方位においてサブシーンを追跡するプロセスを実行するように構成され得、当該プロセスは、音響センサアレイと実質的に９０度以上の視野を観察するワイドカメラとを用いてある角度範囲を監視することと、各々が角度範囲内の局所化に向けて方向付けられている複数の対象方位を識別することとを含む。少なくとも１つの対象方位に実質的に沿ってワイドカメラからサブシーンビデオ信号がサンプリングされ得、少なくとも１つの認識基準に基づく閾値が満たされるまでサブシーンビデオ信号を拡大することによって、サブシーンビデオ信号の幅が設定され得る。任意に、局所化に対応する記録された特性の速度および方向の一方の変更に基づいて、各対象方位についての変更ベクトルが予測され得、予測に基づいて対象方位の位置が更新され得る。任意に、局所化に対応する記録された特性の直近の位置に基づいて、局所化のための検索領域が予測され得、予測に基づいて局所化の位置が更新され得る。 In a further aspect of this embodiment, the processor may be configured to perform a process of tracking sub-scenes in a target orientation within the wide video signal, the process substantially including a field of view greater than 90 degrees with the acoustic sensor array. Monitoring a range of angles using a wide camera that observes and identifying a plurality of target orientations each directed towards localization within the range of angles. A sub-scene video signal can be sampled from a wide camera substantially along at least one object orientation, and by expanding the sub-scene video signal until a threshold based on at least one recognition criterion is met, The width can be set. Optionally, a change vector for each target orientation can be predicted based on one of the changes in velocity and direction of the recorded characteristics corresponding to localization, and the location of the target orientation can be updated based on the prediction. Optionally, a search region for localization can be predicted based on the most recent location of the recorded characteristic corresponding to localization, and the localization location can be updated based on the prediction.

デバイス１００によって収集されたワイドシーン内の角度分離されたサブシーンおよび／または対象サブシーンを合成する、追跡する、および／または表示するのに好適なデバイスの実施形態の概略ブロック図である。2 is a schematic block diagram of an embodiment of a device suitable for compositing, tracking, and / or displaying angle separated sub-scenes and / or subject sub-scenes in a wide scene collected by device 100. FIG. デバイス１００によって収集されたワイドシーン内の角度分離されたサブシーンおよび／または対象サブシーンを合成する、追跡する、および／または表示するのに好適なデバイスの実施形態の概略ブロック図である。2 is a schematic block diagram of an embodiment of a device suitable for compositing, tracking, and / or displaying angle separated sub-scenes and / or subject sub-scenes in a wide scene collected by device 100. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. 図１Ａおよび図１Ｂのデバイス１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。1B is a schematic diagram of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device 100 of FIGS. 1A and 1B. FIG. ３人の参加者を示す、会議カメラの使用事例を上から見下ろした図である。It is the figure which looked down at the use example of a conference camera which shows three participants from the top. ３人の参加者を示す、会議カメラパノラマ画像信号を上から見下ろした図である。It is the figure which looked down at the conference camera panorama image signal which shows three participants from the top. ３人の参加者を示し、顔幅設定またはサブシーンの識別の描写を含む、会議テーブルを示す会議カメラの使用事例を上から見下ろした図である。FIG. 6 is a top down view of a conference camera use case showing a conference table, showing a depiction of three participants and including a face width setting or sub-scene identification depiction. ３人の参加者を示し、顔幅設定またはサブシーンの識別の描写を含む、会議カメラパノラマ画像信号を上から見下ろした図である。FIG. 6 is a top down view of a conference camera panoramic image signal showing three participants and including a depiction of face width setting or sub-scene identification. ３人の参加者を示し、肩幅設定またはサブシーンの識別の描写を含む、会議テーブルを示す会議カメラの使用事例を上から見下ろした図である。FIG. 6 is a top down view of a conference camera use case showing a conference table, showing three participants and including a depiction of shoulder width settings or sub-scene identification. ３人の参加者を示し、肩幅設定またはサブシーンの識別の描写を含む、会議カメラパノラマ画像信号を上から見下ろした図である。FIG. 5 is a top down view of a conference camera panoramic image signal showing three participants and including a depiction of shoulder width setting or sub-scene identification. ３人の参加者およびホワイトボードを示し、より広いサブシーンの識別の描写を含む、会議テーブルを示す会議カメラの使用事例を上から見下ろした図である。FIG. 6 is a top down view of a conference camera use case showing a conference table, showing a depiction of a wider sub-scene showing three participants and a whiteboard. ３人の参加者およびホワイトボードを示し、より広いサブシーンの識別の描写を含む、会議カメラパノラマ画像信号を上から見下ろした図である。FIG. 3 is a top down view of a conference camera panoramic image signal showing three participants and a whiteboard, including a depiction of wider sub-scene identification. ５人の参加者を示し、視覚最小幅および方位ならびに音響最小幅および方位の識別の描写を含む、１０人掛けの会議テーブルを示す会議カメラの使用事例を上から見下ろした図である。FIG. 6 is a top down view of a conference camera use case showing a 10-person conference table showing five participants and including a depiction of visual minimum width and orientation and acoustic minimum width and orientation identification. ５人の参加者を示し、視覚最小幅および方位ならびに音響最小幅および方位の識別の描写を含む、会議カメラパノラマ画像信号を上から見下ろした図である。FIG. 5 is a top down view of a conference camera panoramic image signal showing five participants and including a depiction of visual minimum width and orientation and acoustic minimum width and orientation identification. 会議カメラビデオ信号と、最小幅と、ステージシーンビデオ信号に合成すべきサブシーンビデオ信号およびパノラマビデオ信号の抽出との概略図である。FIG. 6 is a schematic diagram of a conference camera video signal, a minimum width, and extraction of a sub-scene video signal and a panoramic video signal to be combined with a stage scene video signal. ステージシーンビデオ信号に合成すべきサブシーンビデオ信号およびパノラマビデオ信号の概略図である。It is the schematic of the sub scene video signal and panoramic video signal which should be synthesize | combined with a stage scene video signal. 可能な合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible composite output or stage scene video signal. 可能な合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible composite output or stage scene video signal. 可能な合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible composite output or stage scene video signal. 会議カメラビデオ信号と、最小幅と、ステージシーンビデオ信号に合成すべき代替のサブシーンビデオ信号および代替のパノラマビデオ信号の抽出との概略図である。FIG. 5 is a schematic diagram of a conference camera video signal, a minimum width, and extraction of an alternative sub-scene video signal and an alternative panoramic video signal to be combined with the stage scene video signal. ステージシーンビデオ信号に合成すべき代替のサブシーンビデオ信号および代替のパノラマビデオ信号の概略図である。FIG. 4 is a schematic diagram of an alternative sub-scene video signal and an alternative panoramic video signal to be combined with a stage scene video signal. 可能な代替の合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible alternative composite output or stage scene video signal. 可能な代替の合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible alternative composite output or stage scene video signal. 可能な代替の合成出力またはステージシーンビデオ信号を示す図である。FIG. 6 illustrates a possible alternative composite output or stage scene video signal. 会議テーブル画像がより自然で快適なビューに配列されるように調整されたパノラマビデオ信号の概略図である。FIG. 6 is a schematic view of a panoramic video signal adjusted so that the conference table images are arranged in a more natural and comfortable view. 可能な合成出力またはステージシーンビデオ信号の概略図である。FIG. 6 is a schematic diagram of possible composite output or stage scene video signals. 可能な合成出力またはステージシーンビデオ信号の概略図である。FIG. 6 is a schematic diagram of possible composite output or stage scene video signals. ビデオ会議ソフトウェアが合成出力またはステージシーンビデオ信号を表示し得る代替方法の概略図である。FIG. 6 is a schematic diagram of an alternative method in which video conferencing software may display a composite output or stage scene video signal. ビデオ会議ソフトウェアが合成出力またはステージシーンビデオ信号を表示し得る代替方法の概略図である。FIG. 6 is a schematic diagram of an alternative method in which video conferencing software may display a composite output or stage scene video signal. ステージシーン（ビデオ信号）ビデオ信号を合成するためのステップを含むフローチャートの図である。FIG. 5 is a flowchart of steps including steps for synthesizing a stage scene (video signal) video signal. 対象方位に基づいてサブシーン（サブシーンビデオ信号）を合成して生成するためのステップを含む詳細なフローチャートの図である。FIG. 6 is a detailed flowchart diagram including steps for synthesizing and generating a sub-scene (sub-scene video signal) based on a target orientation. サブシーンをステージシーンビデオ信号に合成するためのステップを含む詳細なフローチャートの図である。FIG. 4 is a detailed flowchart diagram including steps for synthesizing a sub-scene with a stage scene video signal. 合成されたステージシーンビデオ信号を単一カメラ信号として出力するためのステップを含む詳細なフローチャートの図である。FIG. 6 is a detailed flowchart diagram including steps for outputting a combined stage scene video signal as a single camera signal. 局所化および／または対象方位および／またはサブシーンの幅を設定するためのステップを実行する第１のモードを含む詳細なフローチャートの図である。FIG. 4 is a detailed flowchart diagram including a first mode for performing localization and / or steps for setting a subject orientation and / or sub-scene width. 局所化および／または対象方位および／またはサブシーンの幅を設定するためのステップを実行する第２のモードを含む詳細なフローチャートの図である。FIG. 6 is a detailed flowchart diagram including a second mode of performing steps for localization and / or setting the orientation and / or sub-scene width. 局所化および／または対象方位および／またはサブシーンの幅を設定するためのステップを実行する第３のモードを含む詳細なフローチャートの図である。FIG. 6 is a detailed flowchart diagram including a third mode for performing steps for localization and / or setting the orientation and / or sub-scene width. 図３Ａ〜図５Ｂに実質的に対応する、単一カメラ信号を受信するビデオ会議クライアントを有するローカルＰＣに取付けられた会議カメラを含む実施形態の動作を示す図であり、ＰＣは次いでインターネットに接続されており、２つのリモートＰＣなどもビデオ会議ディスプレイ内で単一カメラ信号を受信する。FIG. 6 illustrates the operation of an embodiment including a conference camera attached to a local PC having a video conferencing client that receives a single camera signal substantially corresponding to FIGS. 3A-5B, the PC then connecting to the Internet Two remote PCs etc. also receive a single camera signal within the video conference display. 図３Ａ〜図５Ｂに実質的に対応する、単一カメラ信号を受信するビデオ会議クライアントを有するローカルＰＣに取付けられた会議カメラを含む実施形態の動作を示す図であり、ＰＣは次いでインターネットに接続されており、２つのリモートＰＣなどもビデオ会議ディスプレイ内で単一カメラ信号を受信する。FIG. 6 illustrates the operation of an embodiment including a conference camera attached to a local PC having a video conferencing client that receives a single camera signal substantially corresponding to FIGS. 3A-5B, the PC then connecting to the Internet Two remote PCs etc. also receive a single camera signal within the video conference display. 図３Ａ〜図５Ｂに実質的に対応する、単一カメラ信号を受信するビデオ会議クライアントを有するローカルＰＣに取付けられた会議カメラを含む実施形態の動作を示す図であり、ＰＣは次いでインターネットに接続されており、２つのリモートＰＣなどもビデオ会議ディスプレイ内で単一カメラ信号を受信する。FIG. 6 illustrates the operation of an embodiment including a conference camera attached to a local PC having a video conferencing client that receives a single camera signal substantially corresponding to FIGS. 3A-5B, the PC then connecting to the Internet Two remote PCs etc. also receive a single camera signal within the video conference display. ビデオ会議クライアントが個別の隣接しているビューの代わりにオーバーラップしているビデオビューを使用する、図１９〜図２１のシステムの変形を示す図である。FIG. 22 illustrates a variation of the system of FIGS. 19-21 where video conferencing clients use overlapping video views instead of individual adjacent views. 実質的に図６Ａ〜図６Ｂに対応し、ホワイトボード用の高解像度カメラビューを含む、図１９〜図２１のシステムの変形を示す図である。FIG. 22 shows a variation of the system of FIGS. 19-21 substantially corresponding to FIGS. 6A-6B and including a high resolution camera view for a whiteboard. 高解像度テキストドキュメントビュー（たとえばテキストエディタ、ワードプロセッシング、プレゼンテーション、またはスプレッドシート）を含む、図１９〜図２１のシステムの変形を示す図である。FIG. 22 illustrates a variation of the system of FIGS. 19-21 including a high resolution text document view (eg, text editor, word processing, presentation, or spreadsheet). 図１Ｂの構成と同様の構成を用いて、ビデオ会議クライアントがサブシーンごとにインスタンス化される配列の概略図である。FIG. 2 is a schematic diagram of an arrangement in which video conferencing clients are instantiated for each sub-scene using a configuration similar to that of FIG. 1B. 図１〜図２６全体にわたって用いられているいくつかの例示的な図像および記号の概略図である。FIG. 27 is a schematic illustration of some exemplary icons and symbols used throughout FIGS.

詳細な説明
会議カメラ
図１Ａおよび図１Ｂは、会議カメラ１００であるデバイスによって収集されたワイドシーン内の角度分離されたサブシーンおよび／または対象サブシーンを合成する、追跡する、および／または表示するのに好適なデバイスの実施形態の概略ブロック図である。 Detailed Description Conference Camera FIGS. 1A and 1B synthesize, track, and / or display angle separated sub-scenes and / or subject sub-scenes in a wide scene collected by a device that is a conference camera 100 FIG. 2 is a schematic block diagram of an embodiment of a device suitable for:

図１Ａは、会議カメラ１００または会議「ウェブカム」として、たとえば、接続されたラップトップ、タブレット、またはモバイルデバイス４０のＵＳＢホストもしくはハブに接続されたＵＳＢ周辺装置として通信するように、かつ、「グーグルハングアウト（Google（登録商標）Hangout）」、「スカイプ（Skype）」または「フェイスタイム（Facetime）」といった既成のビデオチャットまたはビデオ会議ソフトウェアによって一般に用いられているアスペクト比、画素数、および比率の単一のビデオ画像を提供するように構築されたデバイスを示す。デバイス１００は、たとえば、２人以上の出席者をキャプチャ可能な、出席者または参加者Ｍ１，Ｍ２…Ｍｎの会議を見渡すように方向付けられたカメラなどの、「ワイドカメラ」２，３または５を含む。カメラ２，３または５は、１つのデジタル撮像装置もしくはレンズ、または２つ以上のデジタル撮像装置もしくはレンズ（たとえばソフトウェアにスティッチングされているかまたはその他）を含み得る。会議内のデバイス１００の場所に応じて、ワイドカメラ２，３または５の視野は７０度以下であり得ることに留意すべきである。しかし、１つ以上の実施形態では、ワイドカメラ２，３，５は会議の中央で有用であり、この場合、ワイドカメラは実質的に９０度の、または１４０度よりも大きい（必ずしも連続的ではない）、または最大で３６０度の水平視野を有し得る。 FIG. 1A communicates as a conference camera 100 or conference “webcam”, eg, as a connected laptop, tablet, or USB peripheral device connected to a USB host or hub of the mobile device 40 and “Google Aspect ratio, pixel count, and ratio commonly used by off-the-shelf video chat or video conferencing software such as Hangout, Skype, or Facetime Fig. 2 illustrates a device constructed to provide a single video image. The device 100 may be a “wide camera” 2, 3 or 5 such as, for example, a camera oriented to overlook a meeting of attendees or participants M1, M2. including. The camera 2, 3 or 5 may include one digital imaging device or lens, or two or more digital imaging devices or lenses (eg, stitched to software or others). It should be noted that the field of view of the wide camera 2, 3 or 5 can be 70 degrees or less depending on the location of the device 100 in the conference. However, in one or more embodiments, wide cameras 2, 3, and 5 are useful in the middle of a conference, where the wide camera is substantially 90 degrees or greater than 140 degrees (not necessarily continuous). Not), or have a horizontal field of view of up to 360 degrees.

大きい会議室（たとえば８人以上を収容するように設計された会議室）では、広い視野（たとえば実質的に９０度以上）を記録し、非常に広いシーンを共同して互いにスティッチングして最も心地よい角度をキャプチャする複数の広角カメラデバイスを有することが有用であり得る。たとえば、長い（１０′〜２０′）テーブルの遠端における広角カメラではスピーカーＳＰＫＲの満足のいかない遠いビューがもたらされ得るが、テーブル全体に分散した複数のカメラ（たとえば５席ごとに１つ）を有すると、少なくとも１つの満足のいくまたは心地よいビューが与えられ得る。カメラ２，３，５は（たとえばＨ：Ｖ水平−垂直比率である、たとえば２．４：１から１０：１のアスペクト比の）パノラマシーンを撮像もしくは記録し、および／またはこの信号をＵＳＢ接続を介して利用可能にし得る。 In a large conference room (eg a conference room designed to accommodate more than 8 people), record a wide field of view (eg, substantially 90 degrees or more) and stitch together each other with a very wide scene. It may be useful to have multiple wide-angle camera devices that capture a pleasant angle. For example, a wide-angle camera at the far end of a long (10'-20 ') table may provide an unsatisfactory distant view of the speaker SPKR, but multiple cameras (eg, one for every five seats) distributed throughout the table. ) May provide at least one satisfactory or pleasant view. Cameras 2, 3, and 5 capture or record panoramic scenes (eg, H: V horizontal-vertical ratio, eg 2.4: 1 to 10: 1 aspect ratio) and / or connect this signal via USB Can be made available via

図２Ａ〜図２Ｌに関して述べるように、会議カメラ１００のベースからのワイドカメラ２，３，５の高さは好ましくは８インチよりも大きいため、カメラ２，３，５は会議において典型的なラップトップスクリーンよりも高く、それによって会議出席者Ｍ１，Ｍ２…Ｍｎへの遮られていないおよび／またはほぼ目の高さのビューを有し得る。マイクアレイ４は少なくとも２つのマイクを含み、当該技術において公知であるようにビーム形成、相対的飛行時間、局所化、または受信した信号強度差によって、近くの音または発話への対象方位を得ることができる。マイクアレイ４は、ワイドカメラ２の視野と少なくとも実質的に同じ角度範囲をカバーするように方向付けられた複数のマイクペアを含み得る。 As described with respect to FIGS. 2A-2L, since the height of the wide cameras 2, 3, 5 from the base of the conference camera 100 is preferably greater than 8 inches, the cameras 2, 3, 5 are typical wraps in a conference. It may be higher than the top screen, thereby having an unobstructed and / or approximately eye-level view to conference attendees M1, M2,. The microphone array 4 includes at least two microphones to obtain a target orientation to a nearby sound or utterance by beamforming, relative time of flight, localization, or received signal strength differences as is known in the art. Can do. The microphone array 4 may include a plurality of microphone pairs that are oriented to cover at least substantially the same angular range as the field of view of the wide camera 2.

マイクアレイ４は、８インチよりも高い高さでワイドカメラ２，３，５とともに任意に配列されているため、出席者Ｍ１，Ｍ２…Ｍｎの発言中にアレイ４と当該出席者との間に直接的な「見通し線」がやはり存在し、典型的なラップトップスクリーンによって遮られない。計算およびグラフィカルイベントを処理するためのＣＰＵおよび／またはＧＰＵ（ならびにカメラ回路などの関連付けられた回路）６が、ワイドカメラ２，３，５の各々およびマイクアレイ４に接続されている。ＲＯＭおよびＲＡＭ８が、実行可能コードを保持して受信するためにＣＰＵおよびＧＰＵ６に接続されている。ネットワークインターフェイスおよびスタック１０が、ＣＰＵ６に接続されたＵＳＢ、イーサネット（登録商標）、および／またはＷｉＦｉのために設けられている。１つ以上のシリアルバスがこれら電子部品を相互に接続しており、それらはＤＣ、ＡＣ、またはバッテリパワーによって電力が供給される。 Since the microphone array 4 is arbitrarily arranged together with the wide cameras 2, 3, 5 at a height higher than 8 inches, during the speech of the attendees M 1, M 2, Mn, between the array 4 and the attendee A direct “line of sight” still exists and is not obstructed by a typical laptop screen. A CPU and / or GPU (and associated circuitry such as camera circuitry) 6 for processing computations and graphical events is connected to each of the wide cameras 2, 3, 5 and the microphone array 4. ROM and RAM 8 are connected to the CPU and GPU 6 for holding and receiving executable code. A network interface and stack 10 is provided for USB, Ethernet, and / or WiFi connected to the CPU 6. One or more serial buses interconnect these electronic components and are powered by DC, AC, or battery power.

カメラ２，３，５のカメラ回路は、処理されたまたはレンダリングされた画像またはビデオストリームを、１．２５：１から２．４：１または２．５：１の「Ｈ：Ｖ」水平−垂直比率またはアスペクト比（たとえば４：３，１６：１０，１６：９の比率を含む）の風景画方向の単一カメラ画像信号、ビデオ信号またはストリームとして、および／または上述のように、好適なレンズおよび／またはスティッチング回路を用いて、パノラマ画像またはビデオストリームを、実質的に２．４：１以上の単一カメラ画像信号として出力し得る。図１Ａの会議カメラ１００は通常はＵＳＢ周辺装置として、ラップトップ、タブレット、またはモバイルデバイス４０（少なくとも１つのバスによって相互に接続されたディスプレイ、ネットワークインターフェイス、コンピューティングプロセッサ、メモリ、カメラおよびマイク部を有する）に接続され得、接続されると、マルチパーティテレビ会議、ビデオ会議、またはビデオチャットソフトウェアがホストされて、インターネット６０を介してリモートクライアント５０とテレビ会議するために接続可能となる。 The camera circuits of cameras 2, 3 and 5 process or render the processed or rendered image or video stream from 1.25: 1 to 2.4: 1 or 2.5: 1 “H: V” horizontal-vertical. A suitable lens as a single camera image signal, video signal or stream in landscape direction with a ratio or aspect ratio (eg including ratios of 4: 3, 16:10, 16: 9) and / or as described above And / or the stitching circuit may be used to output a panoramic image or video stream as a substantially single camera image signal of 2.4: 1 or higher. The conference camera 100 of FIG. 1A is typically a USB peripheral device such as a laptop, tablet, or mobile device 40 (with a display, network interface, computing processor, memory, camera and microphone unit interconnected by at least one bus. Multi-party video conferencing, video conferencing, or video chat software is hosted and connectable for video conferencing with the remote client 50 over the Internet 60.

図１Ｂは、図１Ａのデバイス１００およびテレビ会議デバイス４０の両方が統合されている図１Ａの変形である。単一カメラ画像信号、ビデオ信号またはビデオストリームとしてのカメラ回路出力はＣＰＵ、ＧＰＵ、関連付けられた回路およびメモリ５，６が直接利用可能であり、テレビ会議ソフトウェアは代わりに当該ＣＰＵ、ＧＰＵならびに関連付けられた回路およびメモリ５，６によってホストされる。デバイス１００は、インターネット６０またはＩＮＥＴを介してリモートクライアント５０とテレビ会議するために（たとえばＷｉＦｉまたはイーサネットを介して）直接接続可能である。ディスプレイ１２は、テレビ会議ソフトウェアを操作するための、かつ本明細書に記載のテレビ会議ビューおよびグラフィックスを会議出席者Ｍ１，Ｍ２…Ｍ３に見せるためのユーザインターフェイスを提供する。図１Ａのデバイスまたは会議カメラ１００は、代わりにインターネット６０に直接接続され、それによって、リモートクライアント５０によってビデオをリモートサーバに直接記録すること、またはリモートクライアント５０によってそのようなサーバからビデオにライブでアクセスすることを可能にしてもよい。 FIG. 1B is a variation of FIG. 1A in which both device 100 and video conferencing device 40 of FIG. 1A are integrated. The camera circuit output as a single camera image signal, video signal or video stream is directly available to the CPU, GPU, associated circuitry and memory 5,6, and video conferencing software is instead associated with the CPU, GPU and associated Hosted by a separate circuit and memory 5,6. The device 100 can be connected directly (e.g., via WiFi or Ethernet) for video conferencing with the remote client 50 via the Internet 60 or INET. Display 12 provides a user interface for operating the video conference software and for viewing video conference views and graphics described herein to conference attendees M1, M2,... M3. The device or conference camera 100 of FIG. 1A is instead connected directly to the Internet 60 so that video can be recorded directly to a remote server by the remote client 50 or live from such a server to the video by the remote client 50. It may be possible to access.

図２Ａから図２Ｌは、図１Ａおよび図１Ｂのデバイスまたは会議カメラ１００のための、ワイドシーンおよび／またはパノラマシーンを収集するのに好適な会議カメラ１４またはカメラタワー１４配列の実施形態の概略図である。「カメラタワー」１４および「会議カメラ」１４は本明細書中では実質的に交換可能に用いられ得るが、会議カメラはカメラタワーでなくてもよい。図２Ａ〜図２Ｌにおけるデバイス１００のベースからのワイドカメラ２，３，５の高さは、好ましくは８インチよりも大きく１５インチよりも小さい。 FIGS. 2A-2L are schematic views of an embodiment of a conference camera 14 or camera tower 14 arrangement suitable for collecting wide and / or panoramic scenes for the device or conference camera 100 of FIGS. 1A and 1B. It is. Although “camera tower” 14 and “conference camera” 14 may be used substantially interchangeably herein, the conference camera may not be a camera tower. The height of the wide cameras 2, 3, 5 from the base of the device 100 in FIGS. 2A-2L is preferably greater than 8 inches and less than 15 inches.

図２Ａのカメラタワー１４配列では、複数のカメラがカメラタワー１４のカメラレベル（８から１５インチ）で周囲に配列され、等角度に離間している。カメラの数はカメラの視野およびスパンすべき角度によって決定され、パノラマのスティッチングされたビューを形成する場合は、スパンする累積角度は個々のカメラ間のオーバーラップを有するべきである。たとえば、図２Ａの場合、各々が１００〜１１０度の視野（破線で示す）である４つのカメラ２ａ，２ｂ，２ｃ，２ｄ（２ａ〜２ｄとラベル付けされている）が互いに９０度に配列されて、カメラタワー１４の周りの３６０度の累積ビューまたはスティッチング可能なもしくはスティッチングされたビューを提供する。 In the camera tower 14 arrangement of FIG. 2A, a plurality of cameras are arranged around the camera level of the camera tower 14 (8 to 15 inches) and are spaced equiangularly. The number of cameras is determined by the field of view of the camera and the angle to be spanned, and when forming a panoramic stitched view, the spanning cumulative angle should have an overlap between the individual cameras. For example, in the case of FIG. 2A, four cameras 2a, 2b, 2c, and 2d (labeled 2a to 2d) each having a field of view of 100 to 110 degrees (shown by broken lines) are arranged at 90 degrees with respect to each other. Thus providing a 360 degree cumulative view around the camera tower 14 or a stitchable or stitched view.

たとえば、図２Ｂの場合、各々が１３０度以上の視野（破線で示す）である３つのカメラ２ａ，２ｂ，２ｃ（２ａ〜２ｃとラベル付けされている）が互いに１２０度に配列されて、やはり、タワー１４の周りの３６０度の累積ビューまたはスティッチング可能なビューを提供する。カメラ２ａ〜２ｄの垂直視野は水平視野よりも小さく、たとえば８０度未満である。各カメラ２ａ〜２ｄからの画像、ビデオまたはサブシーンは、スティッチング、デワープ、または歪み補償といった公知の光学補正の前後に処理されて対象方位またはサブシーンが識別され得るが、典型的には出力前にそのように補正されることになる。 For example, in the case of FIG. 2B, three cameras 2a, 2b, 2c (labeled 2a-2c) each having a field of view of 130 degrees or more (shown by a broken line) are arranged at 120 degrees with respect to each other. , Providing a 360 degree cumulative view or stitchable view around the tower 14. The vertical field of view of the cameras 2a to 2d is smaller than the horizontal field of view, for example, less than 80 degrees. Images, videos or sub-scenes from each camera 2a-2d can be processed before and after known optical corrections such as stitching, dewarping, or distortion compensation to identify the target orientation or sub-scene, but typically output That will be corrected before.

図２Ｃのカメラタワー１４配列では、上向きに方向付けられた単一の魚眼カメラまたはほぼ魚眼のカメラ３ａが、カメラタワー１４のカメラレベル（８から１５インチ）の頂上に配列されている。この場合、魚眼カメラレンズは３６０度の連続する水平ビュー、および約２１５（たとえば１９０〜２３０）度の垂直視野（破線で示す）を有して配列される。代わりに、たとえば図２Ｄに示すような円筒形の透過性シェル、上部放物面鏡、黒い中央ポスト、テレセントリックレンズ構成を有する単一の反射屈折「円筒画像」カメラまたはレンズ３ｂが、３６０度の連続する水平ビューを有して、約４０〜８０度の垂直視野を有して配列され、水平線上にほぼ中心付けられている。魚眼カメラおよび円筒画像カメラの各々の場合、会議テーブルから８〜１５インチ上に位置決めされる垂直視野は水平線の下を延びて、会議テーブルの周りの出席者Ｍ１，Ｍ２…Ｍｎを腰の高さ以下まで撮像することを可能にする。各カメラ３ａまたは３ｂからの画像、ビデオまたはサブシーンは、デワープまたは歪み補償といった魚眼レンズまたは反射屈折レンズについての公知の光学補正の前後に処理されて対象方位またはサブシーンが識別され得るが、典型的には出力前にそのように補正されることになる。 In the camera tower 14 arrangement of FIG. 2C, a single fisheye camera or an approximately fisheye camera 3a oriented upwards is arranged on top of the camera level (8 to 15 inches) of the camera tower 14. In this case, the fisheye camera lens is arranged with a continuous horizontal view of 360 degrees and a vertical field of view (shown in broken lines) of about 215 (eg 190-230) degrees. Instead, a single catadioptric “cylindrical image” camera or lens 3b with a cylindrical transmissive shell, top paraboloid mirror, black center post, telecentric lens configuration, for example as shown in FIG. It has a continuous horizontal view and is arranged with a vertical field of view of about 40-80 degrees and is approximately centered on the horizontal line. For each fisheye camera and cylindrical imaging camera, the vertical field of view, positioned 8-15 inches above the conference table, extends below the horizontal line to allow attendees M1, M2,. It is possible to take an image up to below. Images, videos or sub-scenes from each camera 3a or 3b may be processed before and after known optical corrections for fisheye or catadioptric lenses such as dewarping or distortion compensation to identify the target orientation or sub-scene, Will be so corrected before output.

図２Ｌのカメラタワー１４配列では、複数のカメラがカメラタワー１４のカメラレベル（８から１５インチ）で周囲に配列され、等角度に離間している。カメラの数は、この場合、完全に連続的なパノラマのスティッチングされたビューを形成することを意図しておらず、スパンする累積角度は個々のカメラ間のオーバーラップ有していない。たとえば、図２Ｌの場合、各々が１３０度以上の視野（破線で示す）である２つのカメラ２ａ，２ｂが互いに９０度に配列されて、カメラタワー１４の両側の約２６０度以上を含む別個のビューを提供する。この配列は、長い会議テーブルＣＴの場合に有用となる。たとえば、図２Ｅの場合、２つのカメラ２ａ〜２ｂがパンしており、および／または縦軸の周りを回転可能であり、本明細書に記載の対象方位Ｂ１，Ｂ２…Ｂｎをカバーしている。各カメラ２ａ〜２ｂからの画像、ビデオまたはサブシーンは、光学補正の前後に本明細書に記載のようにスキャンまたは分析され得る。 In the camera tower 14 arrangement of FIG. 2L, a plurality of cameras are arranged around the camera level (8 to 15 inches) of the camera tower 14 and spaced equiangularly. The number of cameras is not intended in this case to form a completely continuous panoramic stitched view, and the spanning cumulative angle has no overlap between the individual cameras. For example, in the case of FIG. 2L, two cameras 2a, 2b, each having a field of view of 130 degrees or more (shown by a broken line), are arranged at 90 degrees from each other, and include separate 260 degrees or more on both sides of the camera tower 14. Provide a view. This arrangement is useful for long conference tables CT. For example, in the case of FIG. 2E, the two cameras 2a-2b are panning and / or rotatable about the vertical axis and cover the target orientations B1, B2,... Bn described herein. . Images, videos or sub-scenes from each camera 2a-2b may be scanned or analyzed as described herein before and after optical correction.

図２Ｆおよび図２Ｇでは、テーブルの頭および端の配列が示されており、すなわち、図２Ｆおよび図２Ｇに示すカメラタワー１４の各々は、会議テーブルＣＴの頭に有利に置かれるよう意図されている。図３Ａ〜図６Ａに示すように、プレゼンテーションおよびビデオ会議用の大型フラットパネルディスプレイＦＰが会議テーブルＣＴの頭または端に置かれることが多く、図２Ｆおよび図２Ｇの配列は、代わりにフラットパネルＦＰの真正面に近接して置かれる。図２Ｆのカメラタワー１４配列では、約１３０度の視野の２つのカメラが互いに１２０度に置かれて、長い会議テーブルＣＴの２辺をカバーしている。ディスプレイおよびタッチインターフェイス１２がテーブル上に方向付けられており（壁にフラットパネルＦＰがない場合に特に有用である）、ビデオ会議ソフトウェアのためにクライアントを表示する。このディスプレイ１２は接続された、接続可能なまたは取外し可能なタブレットまたはモバイルデバイスであり得る。図２Ｇのカメラタワー配列では、１つの高解像度の、任意に傾いているカメラ７（自身の独立したテレビ会議クライアントソフトウェアまたはインスタンスに任意に接続されている）が対象オブジェクト（ホワイトボードＷＢまたはテーブルＣＴ面上のページもしくは紙など）に方向付け可能であり、たとえば１００〜１１０度の視野の２つの独立してパンしている／または傾いているカメラ５ａ，５ｂが対象方位をカバーするように方向付けられるか方向付け可能である。 In FIG. 2F and FIG. 2G, the table head and end arrangement is shown, ie, each of the camera towers 14 shown in FIG. 2F and FIG. 2G is intended to be advantageously placed on the head of the conference table CT. Yes. As shown in FIGS. 3A-6A, a large flat panel display FP for presentation and video conferencing is often placed at the head or end of the conference table CT, and the arrangement of FIGS. It is placed close to the front. In the camera tower 14 arrangement of FIG. 2F, two cameras with a field of view of about 130 degrees are placed at 120 degrees with each other to cover two sides of a long conference table CT. A display and touch interface 12 is oriented on the table (especially useful when there is no flat panel FP on the wall) and displays the client for video conferencing software. The display 12 can be a connected, connectable or removable tablet or mobile device. In the camera tower arrangement of FIG. 2G, one high-resolution, arbitrarily tilted camera 7 (optionally connected to its own independent videoconferencing client software or instance) is the target object (whiteboard WB or table CT). Orientation such that two independently panning / tilting cameras 5a, 5b with a field of view of 100-110 degrees, for example, cover the target orientation. Can be attached or oriented.

各カメラ２ａ，２ｂ，５ａ，５ｂ，７からの画像、ビデオまたはサブシーンは、光学補正の前後に本明細書に記載のようにスキャンまたは分析され得る。図２Ｈは、９０度に分離して配列された１００〜１３０度の２つのカメラ２ａ〜２ｂまたは２ｃ〜２ｄを各々が有する２つの同一ユニットが、テーブルＣＴの頭または端において＞１８０度のビューユニットとして独立して用いられ得るが、さらに任意に、背中合わせに組合されて、部屋全体をスパンする会議テーブルＣＴの中央に適切に置かれた４つのカメラ２ａ〜２ｄを有する図２Ａのユニットと実質的に同一のユニットを形成し得る変形を示す。図２Ｈのタワーユニット１４，１４の各々には、組合されたユニットを形成するためのネットワークインターフェイスおよび／または物理インターフェイスが設けられることになる。当該２つのユニットは、代わりに、またはさらに、自由に配列されてもよく、または以下の図２Ｋ、図６Ａ、図６Ｂおよび図１４に関して述べるように協調して配列されてもよい。 Images, videos or sub-scenes from each camera 2a, 2b, 5a, 5b, 7 can be scanned or analyzed as described herein before and after optical correction. FIG. 2H shows that two identical units, each having two cameras 2a-2b or 2c-2d at 100-130 degrees arranged 90 degrees apart, view> 180 degrees at the head or end of the table CT. It can be used independently as a unit, but more optionally substantially similar to the unit of FIG. 2A with four cameras 2a-2d suitably combined in the center of a conference table CT combined back to back and spanning the entire room. The deformation | transformation which can form the same unit is shown. Each of the tower units 14, 14 of FIG. 2H will be provided with a network interface and / or a physical interface to form a combined unit. The two units may alternatively or additionally be freely arranged, or may be arranged in concert as described with respect to FIGS. 2K, 6A, 6B and 14 below.

図２Ｊでは、図２ｃのカメラと同様の魚眼カメラまたはレンズ３ａ（反射屈折レンズ３ｂと物理的におよび／または概念的に交換可能である）がカメラタワー１４のカメラレベル（８から１５インチ）の頂上に配列されている。１つの回転可能な、高解像度の、任意に傾いているカメラ７（自身の独立したテレビ会議クライアントソフトウェアまたはインスタンスに任意に接続されている）が対象オブジェクト（ホワイトボードＷＢまたはテーブルＣＴ面上のページもしくは紙など）に方向付け可能である。図６Ａ、図６Ｂおよび図１４に示すように、この配列が有利に働くのは、第１のテレビ会議クライアントが（図１４では「会議室（ローカル）ディスプレイ上でまたはこれに接続されると）、たとえば第１の物理または仮想ネットワークインターフェイスまたはチャネル１０ａを介して、合成されたサブシーンをシーンＳＣカメラ３ａ，３ｂから単一カメラ画像または合成出力ＣＯとして受信し、第２のテレビ会議クライアント（図１４ではデバイス１００内に存在しており、第２の物理または仮想ネットワークインターフェイスまたはチャネル１０ｂを介してインターネットに接続されている）が独立した高解像度画像をカメラ７から受信するときである。 In FIG. 2J, a fisheye camera or lens 3a similar to the camera of FIG. 2c (which can be physically and / or conceptually interchangeable with the catadioptric lens 3b) is the camera level of the camera tower 14 (8 to 15 inches). It is arranged on the top. One rotatable, high-resolution, optionally tilting camera 7 (optionally connected to its own independent videoconferencing client software or instance) is the target object (page on whiteboard WB or table CT plane) Or paper). As shown in FIG. 6A, FIG. 6B and FIG. 14, this arrangement is advantageous because the first videoconference client (in FIG. 14 “when on or connected to the conference room (local) display)” For example, via a first physical or virtual network interface or channel 10a, the synthesized sub-scene is received from the scene SC cameras 3a, 3b as a single camera image or a synthesized output CO and is sent to a second videoconference client (FIG. 14 is present in the device 100 and is connected to the Internet via a second physical or virtual network interface or channel 10b) when receiving an independent high resolution image from the camera 7.

図２Ｋは、同様にカメラ３ａ，３ｂおよび７からの画像のための別個のビデオ会議チャネルが有利であり得る同様の配列を示しているが、図２Ｋの配列では、各カメラ３ａ，３ｂ対７が各自のタワー１４を有しており、インターフェイス１５（有線でも無線でもよい）を介してタワー１４の残りに任意に接続されている。図２Ｋの配列では、シーンＳＣカメラ３ａ，３ｂを有するパノラマタワー１４が会議テーブルＣＴの中央に置かれてもよく、方向付けられた高解像度タワー１４がテーブルＣＴの頭に、または、方向付けられた、高解像度の、別個のクライアント画像もしくはビデオストリームが対象となる任意の場所に置かれてもよい。各カメラ３ａ，７からの画像、ビデオまたはサブシーンは、光学補正の前後に本明細書に記載のようにスキャンまたは分析され得る。 FIG. 2K shows a similar arrangement in which separate video conferencing channels for images from cameras 3a, 3b and 7 may be advantageous, but in the arrangement of FIG. 2K, each camera 3a, 3b pair 7 Each has its own tower 14 and is optionally connected to the rest of the tower 14 via an interface 15 (which may be wired or wireless). In the arrangement of FIG. 2K, the panoramic tower 14 with the scene SC cameras 3a, 3b may be placed in the center of the conference table CT, and the oriented high resolution tower 14 is directed to or at the head of the table CT. Alternatively, a high resolution, separate client image or video stream may be placed anywhere. Images, videos or sub-scenes from each camera 3a, 7 can be scanned or analyzed as described herein before and after optical correction.

会議カメラの使用
図３Ａ、図３Ｂおよび図１２を参照して、写真撮影シーンを合成して出力する本方法の実施形態によると、デバイスまたは会議カメラ１００（または２００）が、たとえば円形または矩形の会議テーブルＣＴの上に置かれる。デバイス１００は、会議参加者Ｍ１，Ｍ２，Ｍ３…Ｍｎの利便性または意図に従って配置されてもよい。 Using Conference Cameras Referring to FIGS. 3A, 3B and 12, according to an embodiment of the present method for combining and outputting a photography scene, a device or conference camera 100 (or 200) is, for example, circular or rectangular It is placed on the conference table CT. The device 100 may be arranged according to the convenience or intention of the conference participants M1, M2, M3.

任意の典型的な会議では、参加者Ｍ１，Ｍ２…Ｍｎはデバイス１００に対して角度的に分散することになる。デバイス１００が参加者Ｍ１，Ｍ２…Ｍｎの中央に置かれる場合、参加者は、本明細書に記載のように、パノラマカメラでキャプチャされ得る。逆に、デバイス１００が参加者の片側に置かれる（たとえばテーブルの一端に置かれる、またはフラットパネルＦＰに装着される）場合は、参加者Ｍ１，Ｍ２…Ｍｎをスパンするのにワイドカメラ（たとえば９０度以上）で十分であり得る。 In any typical conference, participants M1, M2,... Mn will be angularly distributed with respect to device 100. If the device 100 is placed in the middle of the participants M1, M2,... Mn, the participants can be captured with a panoramic camera as described herein. Conversely, if the device 100 is placed on one side of the participant (eg, placed at one end of the table or attached to the flat panel FP), a wide camera (eg, to span the participants M1, M2,... Mn) 90 degrees or more) may be sufficient.

図３Ａに示すように、参加者Ｍ１，Ｍ２…Ｍｎの各々は、たとえば説明のために起点ＯＲから測定された、デバイス１００からの各自の方位Ｂ１，Ｂ２…Ｂｎを有することになる。各方位Ｂ１，Ｂ２…Ｂｎはある範囲の角度または公称角度であり得る。図３Ｂに示すように、「ロールされていない」、投影された、またはデワープされた魚眼、パノラマまたはワイドシーンＳＣは、予期される各自の方位Ｂ１，Ｂ２…Ｂｎに配列された各参加者Ｍ１，Ｍ２…Ｍｎの画像を含む。特に、矩形のテーブルＣＴおよび／またはテーブルＣＴの片側のデバイス１００の配列の場合、各参加者Ｍ１，Ｍ２…Ｍｎの画像は参加者の対向角度に従って短縮されているか遠近法の歪みを含み得る（図３Ｂにおいておよび図面全体にわたって、予期される短縮方向を用いて概略的に描かれている）。当業者に周知であるような遠近法および／または視覚幾何学的補正が、短縮されたまたは遠近法の歪みを有する画像、サブシーン、またはシーンＳＣに適用され得るが、不要な場合もある。 As shown in FIG. 3A, each of the participants M1, M2,... Mn will have their respective orientations B1, B2,... Bn from the device 100, measured for example from the origin OR for illustration. Each orientation B1, B2,... Bn can be a range of angles or nominal angles. As shown in FIG. 3B, an “unrolled”, projected or dewarped fisheye, panorama or wide scene SC is each participant arranged in their expected orientation B1, B2... Bn. M1, M2... Mn images are included. In particular, in the case of a rectangular table CT and / or an array of devices 100 on one side of the table CT, the image of each participant M1, M2,... Mn may be shortened or include perspective distortion according to the participant's facing angle ( 3B and schematically throughout the drawing with the expected shortening direction). Perspective and / or visual geometric correction, as is well known to those skilled in the art, can be applied to images, sub-scenes, or scenes SC that have shortened or perspective distortion, but may not be necessary.

顔検出および拡幅
一例として、共通のアルゴリズムを使用する現代の顔検出ライブラリおよびＡＰＩ（５０個を超える利用可能なＡＰＩおよびＳＤＫのうち、たとえば、AndroidのFaceDetector. Faceクラス、オブジェクティブＣのCIDetectorクラスおよびCIFaceFeatureオブジェクト、Haarカスケードを用いるOpenCVのCascadeClassifierクラス）は通常、瞳孔間距離、ならびに顔特徴および顔ポーズの空間的位置を返す。参加者Ｍｎの耳を範囲に含めるべきである場合は、顔幅推定の大まかな下限は瞳孔間距離／角度の約２倍であり得、大まかな上限は瞳孔間距離／角度の３倍であり得る。肖像画幅推定（すなわち頭部にいくらかの肩幅を加えたもの）の大まかな下限は顔幅／角度の２倍であり得、大まかな上限は顔幅／角度の４倍であり得る。代わりに、サブシーン幅の固定角度または他のより直接的な設定が用いられてもよい。 Face Detection and Widening As an example, modern face detection libraries and APIs that use a common algorithm (for example, Android's FaceDetector. Face class, Objective C's CIDetector class and CIFaceFeature, among the more than 50 available APIs and SDKs) Object, OpenCV's CascadeClassifier class using the Haar cascade) usually returns the interpupillary distance and the spatial location of facial features and poses. If participant Mn's ears should be included in the range, a rough lower limit for face width estimation can be about twice the interpupillary distance / angle, and a rough upper limit is three times the interpupillary distance / angle. obtain. The rough lower limit for portrait width estimation (ie, the head plus some shoulder width) can be twice the face width / angle, and the rough upper limit can be four times the face width / angle. Alternatively, a fixed angle of the sub-scene width or other more direct setting may be used.

図４Ａ〜図４Ｂおよび図５Ａ〜図５Ｂは、顔幅および肩幅の両方（そのいずれか一方は当初のサブシーン幅を設定するために本明細書に記載のような最小幅であり得る）の１つの例示的な二段階のおよび／または別個の識別を示す。図４Ａおよび図４Ｂに示すように、瞳孔間距離または顔特徴（特徴、クラス、色、セグメント、パッチ、テクスチャ、訓練された分類子、もしくは他の特徴）の他の寸法分析に従って設定された顔幅ＦＷ１，ＦＷ２…ＦＷｎがパノラマシーンＳＣから得られる。対照的に、図５Ａ、図５Ｂ、図６Ａおよび図６Ｂでは、肩幅ＳＷ１，ＳＷ２…ＳＷｎが同一の分析に従って設定され、約３倍もしくは４倍だけ、またはデフォルト音声分解もしくは幅に従ってスケーリングされる。 4A-4B and 5A-5B show both face width and shoulder width, either of which can be the minimum width as described herein to set the initial sub-scene width. One exemplary two-stage and / or separate identification is shown. Faces set according to inter-pupil distance or other dimensional analysis of features (features, classes, colors, segments, patches, textures, trained classifiers, or other features), as shown in FIGS. 4A and 4B The widths FW1, FW2,... FWn are obtained from the panoramic scene SC. In contrast, in FIGS. 5A, 5B, 6A and 6B, shoulder widths SW1, SW2,... SWn are set according to the same analysis and scaled by about 3 or 4 times, or by default speech decomposition or width.

角度分離されたサブシーンの合成
図７Ａおよび図７Ｂは、５人の参加者Ｍ１，Ｍ２，Ｍ３，Ｍ４およびＭ５を示し、視覚最小幅Ｍｉｎ.２および対応する角度範囲対象方位Ｂ５と、音響最小幅Ｍｉｎ.５および対応するベクトル対象方位Ｂ２との識別の描写を含む、約１０人掛けの会議テーブルＣＴ、および会議カメラパノラマ画像信号ＳＣをそれぞれ示す、会議カメラ１００の使用事例を上から見下ろした図である。 7A and 7B show five participants M1, M2, M3, M4, and M5, with visual minimum width Min.2 and corresponding angular range object orientation B5, and acoustic maximum. A top-down view of the use case of the conference camera 100 showing a conference table CT for about 10 people and a conference camera panoramic image signal SC, each including a depiction of the small Min.5 and the corresponding vector object orientation B2. FIG.

図７Ａでは、会議カメラ１００は１０人掛けの長い会議テーブルＣＴの中央に配置されている。したがって、テーブルＣＴの中央側の参加者Ｍ１，Ｍ２，Ｍ３は最も短縮が小さく、カメラ１００の最大画像域および角度ビューを占めているのに対して、テーブルＣＴの端側の参加者Ｍ５およびＭ４は最も短縮が大きく、最小画像域を占めている。 In FIG. 7A, the conference camera 100 is arranged at the center of a long conference table CT for ten people. Accordingly, the participants M1, M2, M3 on the center side of the table CT have the smallest shortening and occupy the maximum image area and angle view of the camera 100, whereas the participants M5 and M4 on the end side of the table CT. Is the shortest and occupies the smallest image area.

図７Ｂでは、全体のシーンビデオ信号ＳＣは、たとえば３６０度ビデオ信号であり、参加者Ｍ１…Ｍ５全員を含んでいる。会議テーブルＣＴはシーンＳＣにおいてパノラマビューの歪んだ「Ｗ」形状の特性を有して現れるのに対して、参加者Ｍ１…Ｍ５は会議カメラ１００からの自身の位置および距離に応じて異なるサイズで異なる短縮アスペクトを有して現れる（矩形の体および楕円形の頭で単純に概略的に表わされている）。図７Ａおよび図７Ｂに示すように、各参加者Ｍ１…Ｍ５は各自の方位Ｂ１…Ｂ５によってメモリ８内に表わされ、音、動作、または特徴の音響または視覚またはセンサ局所化によって求められ得る。図７Ａおよび図７Ｂに描かれているように、参加者Ｍ２は顔の検出によって局所化されている（かつ、対応するベクトル状の方位Ｂ２および最小幅Ｍｉｎ．２がメモリに記録されており、顔検出ヒューリスティックから得られた顔幅に比例して求められる）場合があり、参加者Ｍ５はビーム形成、相対的信号強度、および／または発話のような音声信号の飛行時間によって局所化されている（かつ、対応するセクタ状の方位Ｂ５および最小幅Ｍｉｎ．５がメモリに記録されており、音響アレイ４の概算解像度に比例して求められる）場合がある。 In FIG. 7B, the entire scene video signal SC is a 360-degree video signal, for example, and includes all participants M1... M5. The conference table CT appears in the scene SC with the distorted “W” shape characteristics of the panoramic view, whereas the participants M1... M5 have different sizes depending on their position and distance from the conference camera 100. Appear with different abbreviated aspects (simply represented simply by a rectangular body and an elliptical head). As shown in FIGS. 7A and 7B, each participant M1... M5 is represented in memory 8 by their orientation B1... B5 and can be determined by acoustic or visual or sensor localization of sounds, actions or features. . As depicted in FIGS. 7A and 7B, participant M2 has been localized by face detection (and the corresponding vector-like orientation B2 and minimum width Min.2 are recorded in memory, Participant M5 is localized by the time of flight of an audio signal such as beamforming, relative signal strength, and / or utterance (which may be determined in proportion to the face width obtained from the face detection heuristic). (And the corresponding sector-like orientation B5 and minimum width Min.5 are recorded in the memory and are obtained in proportion to the approximate resolution of the acoustic array 4).

図８Ａは、会議カメラ１００ビデオ信号と、最小幅Ｍｉｎ．ｎと、ステージシーンビデオ信号ＳＴＧ，ＣＯに合成すべきサブシーンビデオ信号ＳＳ２，ＳＳ５およびパノラマビデオ信号ＳＣ．Ｒの抽出との概略図を示す。図８Ａの上部は本質的に図７Ｂを再現している。図８Ａに示すように、図７Ｂからの全体のシーンビデオ信号ＳＣが対象方位（この例では方位Ｂ２およびＢ５に限定される）ならびに幅（この例では幅Ｍｉｎ．２およびＭｉｎ．５に限定される）に従ってサブサンプリングされ得る。サブシーンビデオ信号ＳＳ２は、（視覚的に求められた）顔幅限界Ｍｉｎ．２と少なくとも同じ幅であるが、ステージＳＴＧの幅、高さ、および／もしくは利用可能な領域、または合成出力ＣＯのアスペクト比および利用可能な領域に対してより広くなってもよいし、またはより広くスケーリングされてもよい。サブシーンビデオ信号ＳＳ５は、（音響的に求められた）音響概算Ｍｉｎ．５と少なくとも同じ幅であるが、同様により広くなってもよいし、またはより広くスケーリングされてもよく、かつ限定されてもよい。このキャプチャ内の縮小したパノラマシーンＳＣ．Ｒは全体のシーンＳＣの上下がクロップされたバージョンであり、この場合、１０：１のアスペクト比にクロップされる。代わりに、縮小したパノラマシーンＳＣ．Ｒは、比例的スケーリングまたはアナモフィックスケーリングによって全体のパノラマシーンビデオ信号ＳＣから得られてもよい（たとえば上部および下部は残るが、中央部よりも圧縮される）。いずれの場合も、図８Ａおよび図８Ｂの例では、３つの異なるビデオ信号源ＳＳ２，ＳＳ５およびＳＣ．ＲがステージＳＴＧまたは合成出力ＣＯに合成されるように利用可能である。 FIG. 8A shows a conference camera 100 video signal and a minimum width Min. n, sub-scene video signals SS2 and SS5 to be combined with stage scene video signals STG and CO, and panoramic video signal SC. Schematic diagram with R extraction. The upper part of FIG. 8A essentially reproduces FIG. 7B. As shown in FIG. 8A, the entire scene video signal SC from FIG. 7B is limited to the target orientation (in this example limited to orientations B2 and B5) and width (in this example limited to widths Min.2 and Min.5). Subsampled). The sub-scene video signal SS2 includes a face width limit Min. 2 may be wider than the width, height, and / or available area of the stage STG, or the aspect ratio and available area of the composite output CO, or more It may be widely scaled. The sub-scene video signal SS5 includes an acoustic approximate Min. It may be at least as wide as 5, but it may be wider as well, or may be more widely scaled and limited. A reduced panorama scene SC. R is a version in which the upper and lower sides of the entire scene SC are cropped, and in this case, it is cropped to an aspect ratio of 10: 1. Instead, a reduced panoramic scene SC. R may be obtained from the entire panoramic scene video signal SC by proportional scaling or anamorphic scaling (eg, the upper and lower portions remain but are compressed more than the central portion). In any case, in the example of FIGS. 8A and 8B, three different video signal sources SS2, SS5 and SC. R can be used to be combined with the stage STG or the combined output CO.

図８Ｂは本質的に図８Ａの下部を再現しており、ステージシーンビデオ信号ＳＴＧまたはＣＯに合成すべきサブシーンビデオ信号ＳＳ２，ＳＳ５およびパノラマビデオ信号ＳＣ．Ｒの概略図を示す。図８Ｃから図８Ｅは、可能な３つの合成出力またはステージシーンビデオ信号ＳＴＧまたはＣＯを示す。 FIG. 8B essentially reproduces the lower part of FIG. 8A, and sub-scene video signals SS2 and SS5 and panoramic video signal SC. A schematic diagram of R is shown. 8C-8E show three possible composite outputs or stage scene video signals STG or CO.

図８Ｃに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．ＲがステージＳＴＧの上部全体を横切って合成されており、この場合はステージ領域の１／５または２０％未満を占めている。サブシーンＳＳ５は少なくともその最小領域を占めるように合成されており、全体的にスケーリングされていないが、ステージ幅の約１／２を満たすように拡幅されている。サブシーンＳＳ２も、少なくともその（大幅に小さい）最小領域を占めるように合成されており、全体的にスケーリングされておらず、やはりステージ幅の約１／２を満たすように拡幅されている。この合成出力ＣＯでは、２つのサブシーンにほぼ同じ面積が与えられているが、参加者はカメラ１００からの自身の距離に対応する異なる見掛けのサイズである。また、合成された２つのサブシーンの左右のまたは時計回りの順序は、室内の参加者またはカメラ１００からの対象方位の（かつ、縮小したパノラマビューＳＣ．Ｒに現われているような）順序と同じであることに留意すべきである。さらに、本明細書に記載の移行のいずれかが、サブシーンビデオ信号ＳＳ２，ＳＳ５をステージビデオ信号ＳＴＧに合成する際に用いられ得る。たとえば、両方のサブシーンがステージＳＴＧを単純に瞬時に満たしてもよく、または、一方が、その対応する左右のステージ方向からスライドインしてステージ全体を満たした後、他方がその対応する左右のステージ方向からスライドインすることによって漸進的に幅が狭くなる、などでもよく、いずれの場合も、サブシーンウインドウ、フレーム、アウトラインなどが移行全体にわたってそのビデオストリームを表示している。 In the synthesized output CO or stage scene video signal STG shown in FIG. R is synthesized across the entire top of the stage STG, which in this case occupies less than 1/5 or 20% of the stage area. The sub-scene SS5 is synthesized so as to occupy at least the minimum area thereof, and is not scaled as a whole, but is widened to satisfy about ½ of the stage width. The sub-scene SS2 is also synthesized so as to occupy at least the (substantially small) minimum area, is not scaled as a whole, and is also widened to satisfy about ½ of the stage width. In this composite output CO, the two sub-scenes are given approximately the same area, but the participants have different apparent sizes corresponding to their distance from the camera 100. Also, the left and right or clockwise order of the two synthesized sub-scenes is the order of the target orientation from the room participants or the camera 100 (and appearing in the reduced panoramic view SC.R). It should be noted that they are the same. Further, any of the transitions described herein can be used in combining the sub-scene video signals SS2, SS5 into the stage video signal STG. For example, both sub-scenes may simply fill stage STG instantaneously, or after one slides in from its corresponding left and right stage direction to fill the entire stage, the other fills its corresponding left and right It may be progressively narrower by sliding in from the stage direction, etc. In any case, the sub-scene window, frame, outline, etc. display the video stream throughout the transition.

図８Ｄに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．Ｒは同様にシーンＳＴＧに合成されているが、信号ＳＳ５およびＳＳ２の各々は、参加者Ｍ５，Ｍ２がステージＳＴＧのより大きい領域を占めるように比例してスケーリングまたはズームされている。各信号ＳＳ５およびＳＳ２の最小幅もズームされて描かれており、信号ＳＳ５およびＳＳ２は依然として各自の最小幅以上を占めているが、各々はステージの約１／２を満たすように拡幅されている（ＳＳ５の場合、最小幅はステージの１／２を占めている）。参加者Ｍ５，Ｍ３はステージＳＴＧ上で、または合成出力信号ＣＯ内で実質的に同等のサイズである。 In the synthesized output CO or the stage scene video signal STG shown in FIG. R is similarly synthesized into the scene STG, but each of the signals SS5 and SS2 is scaled or zoomed proportionally so that the participants M5 and M2 occupy a larger area of the stage STG. The minimum width of each signal SS5 and SS2 is also zoomed in, and signals SS5 and SS2 still occupy more than their minimum widths, but each is widened to fill approximately half of the stage. (In the case of SS5, the minimum width occupies 1/2 of the stage). Participants M5 and M3 are substantially the same size on stage STG or in composite output signal CO.

図８Ｅに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．Ｒは同様にシーンＳＴＧに合成されているが、信号ＳＳ５およびＳＳ２の各々は状況に応じてスケーリングまたはズームされている。サブシーン信号ＳＳ５およびＳＳ２は依然として各自の最小幅以上を占めているが、各々はステージの異なる量を満たすように拡幅されている。この場合、サブシーン信号ＳＳ５はスケールアップまたはズームされていないが、より広い最小幅を有しており、ステージＳＧの２／３よりも多い領域を占めている。一方、信号ＳＳ２の最小幅はズームされて描かれており、その最小幅の約３倍を占めている。図８Ｅの相対的比率および状態が起こる１つの状況は、参加者Ｍ５に対して視覚的な局所化が行われず、広く不確実な（低い信頼レベル）対象方位および広い最小幅が与えられ得る場合、およびさらに、参加者Ｍ５が長い間発言し続けて、ステージＳＴＧのサブシーンＳＳ５の占有率を任意に増加させる場合であり得る。同時に、参加者Ｍ２は信頼性の高い顔幅検出を有し、サブシーンＳＳ２がスケーリングおよび／または拡幅されてその最小幅よりも大きい領域を消費することを可能にし得る。 In the combined output CO or stage scene video signal STG shown in FIG. 8E, the reduced panoramic video signal SC. R is similarly synthesized into the scene STG, but each of the signals SS5 and SS2 is scaled or zoomed depending on the situation. Sub-scene signals SS5 and SS2 still occupy more than their minimum widths, but each has been widened to fill a different amount of stage. In this case, the sub-scene signal SS5 is not scaled up or zoomed, but has a wider minimum width and occupies more than 2/3 of the stage SG. On the other hand, the minimum width of the signal SS2 is zoomed and occupies about three times the minimum width. One situation in which the relative proportions and conditions of FIG. 8E occur is that there is no visual localization for participant M5 and a wide uncertainty (low confidence level) object orientation and a wide minimum width can be given. , And moreover, it may be a case where the participant M5 keeps speaking for a long time and arbitrarily increases the occupation ratio of the sub-scene SS5 of the stage STG. At the same time, participant M2 may have reliable face width detection, allowing sub-scene SS2 to be scaled and / or widened to consume an area larger than its minimum width.

図９Ａも、会議カメラ１００ビデオ信号と、最小幅Ｍｉｎ．ｎと、ステージシーンビデオ信号に合成すべき代替のサブシーンビデオ信号ＳＳｎおよび代替のパノラマビデオ信号ＳＣ．Ｒの抽出との概略図を示す。図９Ａの上部は、参加者Ｍ１が最新のスピーカーになっており、対応するサブシーンＳＳ１が対応する最小幅Ｍｉｎ．１を有している以外は、本質的に図７Ｂを再現している。図９Ａに示すように、図７Ｂからの全体のシーンビデオ信号ＳＣは、対象方位（ここでは方位Ｂ１，Ｂ２およびＢ５）ならびに幅（ここでは幅Ｍｉｎ．１，Ｍｉｎ．２およびＭｉｎ．５）に従ってサブサンプリングされ得る。サブシーンビデオ信号ＳＳ１，ＳＳ２およびＳＳ５の各々は、（視覚的に、音響的に、またはセンサで求められた）各自の最小幅Ｍｉｎ．１，Ｍｉｎ．２およびＭｉｎ．５と少なくとも同じ幅であるが、ステージＳＴＧの幅、高さ、および／もしくは利用可能な領域または合成出力ＣＯのアスペクト比および利用可能な領域に対してより広くなってもよいし、またはより広くスケーリングされてもよい。このキャプチャ内の縮小したパノラマシーンＳＣ．Ｒは全体のシーンＳＣの上下および側部がクロップされたバージョンであり、この場合、約７．５：１のアスペクト比で、最も関連している／直近のスピーカーＭ１，Ｍ２およびＭ５のみをスパンするようにクロップされる。図９Ａおよび図９Ｂの例では、４つの異なるビデオ信号源ＳＳ１，ＳＳ２，ＳＳ５およびＳＣ．ＲがステージＳＴＧまたは合成出力ＣＯに合成されるように利用可能である。 9A also shows the conference camera 100 video signal and the minimum width Min. n, an alternative sub-scene video signal SSn to be combined with the stage scene video signal, and an alternative panoramic video signal SC. Schematic diagram with R extraction. In the upper part of FIG. 9A, the participant M1 is the latest speaker, and the corresponding sub-scene SS1 corresponds to the minimum width Min. Except for having 1, essentially reproduces FIG. 7B. As shown in FIG. 9A, the entire scene video signal SC from FIG. 7B follows the target orientation (here, orientations B1, B2, and B5) and width (here, widths Min.1, Min.2, and Min.5). Can be subsampled. Each of the sub-scene video signals SS1, SS2 and SS5 has its minimum width Min. (Determined visually, acoustically or with a sensor). 1, Min. 2 and Min. 5 may be wider than the width, height, and / or available area of the stage STG or the aspect ratio and available area of the composite output CO, or at least as wide as 5 It may be scaled. A reduced panorama scene SC. R is a cropped version of the entire scene SC, in this case spanning only the most relevant / most recent speakers M1, M2 and M5 with an aspect ratio of about 7.5: 1. To be cropped. In the example of FIGS. 9A and 9B, four different video signal sources SS1, SS2, SS5 and SC. R can be used to be combined with the stage STG or the combined output CO.

図９Ｂは本質的に図９Ａの下部を再現しており、ステージシーンビデオ信号に合成すべきサブシーンビデオ信号およびパノラマビデオ信号の概略図を示す。図９Ｃから図９Ｅは、可能な３つの合成出力またはステージシーンビデオ信号を示す。 FIG. 9B essentially reproduces the lower part of FIG. 9A and shows a schematic diagram of a sub-scene video signal and a panoramic video signal to be combined with the stage scene video signal. 9C-9E show three possible composite outputs or stage scene video signals.

図９Ｃに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．ＲがステージＳＴＧの上部をほぼ完全に横切って合成されており、この場合はステージ領域の１／４未満を占めている。サブシーンＳＳ５はここでも、少なくともその最小領域を占めるように合成されており、全体的にスケーリングされていないが、ステージ幅の約１／３を満たすように拡幅されている。サブシーンＳＳ２およびＳＳ１も、少なくともそれらのより小さい最小領域を占めるように合成されており、全体的にスケーリングされておらず、さらに、各々がステージ幅の約１／３を満たすように拡幅されている。この合成出力ＣＯでは、３つのサブシーンにほぼ同じ面積が与えられているが、参加者はカメラ１００からの自身の距離に対応する異なる見掛けのサイズである。合成されたまたは移行した２つのサブシーンの左右のまたは時計回りの順序は、室内の参加者またはカメラ１００からの対象方位の（かつ、縮小したパノラマビューＳＣ．Ｒに現われているような）順序と同じままである。さらに、本明細書に記載の移行のいずれかが、サブシーンビデオ信号ＳＳ１，ＳＳ２，ＳＳ５をステージビデオ信号ＳＴＧに合成する際に用いられ得る。特に、移行は、縮小したパノラマビューＳＣ．Ｒと同じ左右の順序でまたは当該順序から近づくスライド式移行としてより快適である（たとえば、Ｍ１およびＭ２が既にステージ上にある場合は、Ｍ５がステージの右からスライドインして、Ｍ１およびＭ５が既にステージ上にある場合は、Ｍ２がそれらの間で上または下からスライドインして、Ｍ２およびＭ５が既にステージ上にある場合は、Ｍ１がステージの左からスライドインして、パノラマビューＳＣ．ＲのＭ１，Ｍ２，Ｍ５の順序を保存すべきである）。 In the combined output CO or stage scene video signal STG shown in FIG. 9C, the reduced panoramic video signal SC. R is synthesized almost completely across the top of the stage STG, and in this case occupies less than ¼ of the stage area. The sub-scene SS5 is again synthesized so as to occupy at least the minimum area thereof, and is not scaled as a whole, but is widened to satisfy about 1/3 of the stage width. Sub-scenes SS2 and SS1 are also synthesized to occupy at least their smaller minimum area, are not entirely scaled, and are each widened to meet about 1/3 of the stage width. Yes. In this composite output CO, the three sub-scenes are given approximately the same area, but the participants have different apparent sizes corresponding to their distance from the camera 100. The left or right or clockwise order of the two sub-scenes that have been synthesized or transitioned is the order of the subject orientation from the room participants or the camera 100 (and as appearing in the reduced panoramic view SC.R). Remains the same. Further, any of the transitions described herein can be used when combining the sub-scene video signals SS1, SS2, SS5 into the stage video signal STG. In particular, the transition is a reduced panoramic view SC. It is more comfortable in the same left-right order as R or as a sliding transition approaching from that order (for example, if M1 and M2 are already on the stage, M5 slides in from the right of the stage and M1 and M5 are If it is already on the stage, M2 slides in between them from above or below, and if M2 and M5 are already on the stage, M1 slides in from the left of the stage and panoramic view SC. R's M1, M2, M5 order should be preserved).

図９Ｄに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．Ｒは同様にシーンＳＴＧに合成されているが、信号ＳＳ１，ＳＳ２およびＳＳ５の各々は、参加者Ｍ１，Ｍ２，Ｍ５がステージＳＴＧのより大きい領域を占めるように比例してスケーリングまたはズームされている。各信号ＳＳ１，ＳＳ２，ＳＳ５の最小幅もズームされて描かれており、信号ＳＳ１，ＳＳ２，ＳＳ５は依然として各自のズームされた最小幅以上を占めているが、サブシーンＳＳ５はステージ上でそのズームされた最小幅よりも若干大きい領域を満たすように拡幅されており、ＳＳ５はステージの幅の６０パーセントを占めており、ＳＳ２は１５パーセントを占めているに過ぎず、ＳＳ３が残りの２５パーセントを占めている。参加者Ｍ１，Ｍ２，Ｍ５はステージＳＴＧ上で、または合成出力信号ＣＯ内で実質的に同等の高さまたは顔サイズであるが、参加者Ｍ２およびサブシーンＳＳ２は頭部および／または体幅よりも少し大きい領域のみを示すように実質的にクロップされてもよい。 In the combined output CO or stage scene video signal STG shown in FIG. 9D, the reduced panoramic video signal SC. R is similarly synthesized into the scene STG, but each of the signals SS1, SS2, and SS5 is scaled or zoomed proportionally so that the participants M1, M2, M5 occupy a larger area of the stage STG. . The minimum width of each signal SS1, SS2, SS5 is also zoomed in and the signals SS1, SS2, SS5 still occupy more than their zoomed minimum width, but the sub-scene SS5 is zoomed on the stage. It has been widened to fill an area slightly larger than the minimum width, SS5 occupies 60 percent of the stage width, SS2 only occupies 15 percent, and SS3 occupies the remaining 25 percent is occupying. Participants M1, M2, and M5 are substantially the same height or face size on stage STG or in composite output signal CO, but participant M2 and sub-scene SS2 are from head and / or body width. May be substantially cropped to show only a slightly larger area.

図９Ｅに示す合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧでは、縮小したパノラマビデオ信号ＳＣ．Ｒは同様にシーンＳＴＧに合成されているが、信号ＳＳ１，ＳＳ２，ＳＳ５の各々は状況に応じてスケーリングまたはズームされている。サブシーン信号ＳＳ１，ＳＳ２，ＳＳ５は依然として各自の最小幅以上を占めているが、各々はステージの異なる量を満たすように拡幅されている。この場合、サブシーン信号ＳＳ１，ＳＳ２，ＳＳ５のいずれもスケールアップまたはズームされていないが、直近のまたは最も関連しているスピーカーＭ１を有するサブシーンＳＳ１はステージＳＧの１／２よりも大きい領域を占めている。一方、サブシーンＳＳ２およびＳＳ５の各々はステージＳＴＧのより小さいまたは減少した占有率を占めているが、サブシーンＳＳ５が最小幅を有しているため、ステージＳＴＧの占有率のさらなる減少はサブシーンＳＳ２またはＳＳ１から取られる。図９Ｅの相対的比率および状態が起こる１つの状況は、参加者Ｍ１に対して視覚的な局所化が行われ得るが、参加者Ｍ１が長時間発言し続けて、ステージＳＴＧのサブシーンＳＳ１の占有率を他の２つのサブシーンに対して任意に増加させる場合であり得る。 In the combined output CO or stage scene video signal STG shown in FIG. 9E, the reduced panoramic video signal SC. R is similarly synthesized into the scene STG, but each of the signals SS1, SS2, SS5 is scaled or zoomed according to the situation. The sub-scene signals SS1, SS2, SS5 still occupy more than their minimum widths, but each has been widened to satisfy a different amount of stage. In this case, none of the sub-scene signals SS1, SS2, SS5 has been scaled up or zoomed, but the sub-scene SS1 with the most recent or most relevant speaker M1 has an area larger than ½ of the stage SG. is occupying. On the other hand, each of the sub-scenes SS2 and SS5 occupies a smaller or reduced occupancy ratio of the stage STG, but since the sub-scene SS5 has a minimum width, a further decrease in the occupancy ratio of the stage STG is sub-scene. Taken from SS2 or SS1. One situation in which the relative proportions and conditions of FIG. 9E occur is that visual localization may be performed for participant M1, but participant M1 continues to speak for a long time, in sub-scene SS1 of stage STG. It may be a case where the occupation ratio is arbitrarily increased with respect to the other two sub-scenes.

図９Ｆに描かれているパノラマシーンＳＣまたは縮小したパノラマシーンＳＣ．Ｒでは、会議カメラ１０００はテーブルＣＴの中央ではなく、代わりに（たとえば図７Ａの右側に破線位置によって示すように）テーブルＣＴの一端側に置かれており、フラットパネルＦＰは遠隔会議参加者を示している。この場合、会議テーブルＣＴはやはり大きく歪んだ「Ｗ」形状で現れる。図９Ｆの上部に示すように、会議カメラ１００またはパノラマシーンＳＣのインデックス方向もしくは起点ＯＲが、高アスペクト比パノラマシーンＳＣの限界が会議テーブルＣＴを「割る」ように向けられている場合、テーブルＣＴの周りの人物の位置を参照することは非常に困難である。しかし、会議カメラ１００またはパノラマシーンのインデックス方向もしくは起点ＯＲが、テーブルＣＴが連続的であるようにおよび／または全員が片側を向いて位置決めされるように配列される場合は、シーンはより自然になる。本実施形態によると、プロセッサ６は画像分析を実行してパノラマ画像のインデックス位置または起点位置を変更し得る。一例では、パノラマ画像のインデックス位置または起点位置は、テーブル領域に対応する画像パッチの単一の連続的なセグメント化の面積が最大化される（たとえばテーブルが割れない）ように「回転」させられ得る。別の例では、パノラマ画像のインデックス位置または起点位置は、２つの最も近いまたは最大の顔認識が互いに最も離れている（たとえばテーブルが割れない）ように「回転」させられ得る。第３の例では、別の例では、パノラマ画像のインデックス位置または起点位置は、テーブル領域に対応する画像パッチの最低高さセグメント化がパノラマエッジに位置する（たとえば「Ｗ」形状が回転して、会議カメラ１００に最も近いテーブルエッジをパノラマエッジに置く）ように「回転」させられ得る。 9F or the reduced panorama scene SC. In R, the conference camera 1000 is not at the center of the table CT, but instead is placed at one end of the table CT (eg, as indicated by the dashed line position on the right side of FIG. 7A), and the flat panel FP identifies the remote conference participants. Show. In this case, the conference table CT also appears in a “W” shape that is greatly distorted. As shown at the top of FIG. 9F, if the index direction or origin OR of the conference camera 100 or panoramic scene SC is oriented so that the limit of the high aspect ratio panoramic scene SC “divides” the conference table CT, the table CT It is very difficult to refer to the position of the person around. However, if the conference camera 100 or panoramic scene index direction or origin OR is arranged so that the table CT is continuous and / or positioned so that everyone is facing one side, the scene is more natural Become. According to this embodiment, the processor 6 can perform image analysis to change the index position or the starting position of the panoramic image. In one example, the index position or origin position of the panoramic image is “rotated” so that the area of a single continuous segmentation of the image patch corresponding to the table area is maximized (eg, the table does not break). obtain. In another example, the index position or origin position of the panoramic image may be “rotated” so that the two closest or largest face recognitions are furthest away from each other (eg, the table does not break). In the third example, in another example, the index position or origin position of the panoramic image is located at the panoramic edge where the lowest height segmentation of the image patch corresponding to the table area is located (eg, the “W” shape is rotated). , Place the table edge closest to the conference camera 100 on the panorama edge).

図１０Ａは、可能な合成出力ＣＯまたはステージシーンビデオ信号ＳＴＧの概略図を示しており、図９Ｄの合成出力信号ＣＯまたはステージビデオ信号ＳＴＧを実質的に再現しており、縮小したパノラマ信号がステージＳＴＧの上部の１／４未満を占めるように合成されており、３つの異なるサブシーンビデオ信号がステージＳＴＧの残りの異なる量を占めるように合成されている。図１０Ｂは、可能な合成出力またはステージシーンビデオ信号の代替概略図を示しており、互いに隣接した３つの異なるサブシーンビデオ信号が、ステージＳＴＧまたは合成出力信号ＣＯの異なる量を占めるように合成されている。 FIG. 10A shows a schematic diagram of a possible composite output CO or stage scene video signal STG, which substantially reproduces the composite output signal CO or stage video signal STG of FIG. It is synthesized to occupy less than a quarter of the top of the STG, and three different sub-scene video signals are synthesized to occupy the remaining different amounts of the stage STG. FIG. 10B shows an alternative schematic of possible combined output or stage scene video signals, where three different sub-scene video signals adjacent to each other are combined to occupy different amounts of stage STG or combined output signal CO. ing.

図１１Ａおよび図１１Ｂは、ビデオ会議ソフトウェアが合成出力またはステージシーンビデオ信号を表示し得る２つの代替方法の概略図を示す。図１１Ａおよび図１１Ｂでは、合成出力信号ＣＯは添付の音声（現在のスピーカーの声を強調するように任意に混合および／またはビーム形成される）とともに単一カメラ信号として（たとえばＵＳＢポートを介して）受信され、単一カメラ信号としてビデオ会議アプリケーションに統合されている。図１１Ａに示すように、各々の単一カメラ信号には別個のウインドウが与えられており、合成出力信号ＣＯなどの選択信号またはアクティブ信号または前景信号がサムネイルとして再現される。対照的に、図１１Ｂに示す例では、選択された単一カメラ信号には実用的な限り最大の面積がディスプレイ上に与えられており、合成出力信号ＣＯなどの選択信号またはアクティブ信号または前景信号が網掛けしたサムネイルまたはグレー表示したサムネイルとして提示される。 11A and 11B show schematic diagrams of two alternative ways in which video conferencing software may display a composite output or stage scene video signal. In FIGS. 11A and 11B, the combined output signal CO is as a single camera signal (eg, via a USB port) with accompanying audio (optionally mixed and / or beamformed to enhance the current speaker's voice). ) Received and integrated into a video conferencing application as a single camera signal. As shown in FIG. 11A, each single camera signal is provided with a separate window, and a selection signal such as a composite output signal CO or an active signal or foreground signal is reproduced as a thumbnail. In contrast, in the example shown in FIG. 11B, the selected single camera signal is given the largest possible area on the display as practical, and the selection signal such as the composite output signal CO or the active or foreground signal. Are displayed as shaded thumbnails or grayed-out thumbnails.

サブシーン識別および合成
図１２に示すように、ステップＳ１０において、新たなサブシーンＳＳ１，ＳＳ２…ＳＳｎが、たとえばパノラマビデオ信号ＳＣ内で認識されると、シーンに応じて生成されて追跡され得る。その後、ステップＳ３０において、サブシーンＳＳ１，ＳＳ２…ＳＳｎは、本明細書に記載の対象方位、条件、および認識に従って合成され得る。合成出力またはステージシーンＳＴＧ，ＣＯが次にステップ５０において出力され得る。 Sub-Scene Identification and Compositing As shown in FIG. 12, when new sub-scenes SS1, SS2,... SSn are recognized in, for example, the panoramic video signal SC in step S10, they can be generated and tracked according to the scene. Thereafter, in step S30, the sub-scenes SS1, SS2,... SSn can be combined according to the target orientation, conditions, and recognition described herein. The composite output or stage scene STG, CO can then be output at step 50.

図１３に示す付加的な詳細において、かつ図３Ａから図７Ｂ（図３Ａおよび図７Ｂを含む）に示すように、ステップＳ１２において、デバイス１００は、１つ以上の少なくとも部分的にパノラマのカメラ２または２ａ…２ｎから少なくとも９０度の画角の広角（たとえば９０〜３６０度の角度）シーンＳＣをキャプチャする。 In the additional details shown in FIG. 13 and as shown in FIGS. 3A to 7B (including FIGS. 3A and 7B), in step S12, the device 100 includes one or more at least partially panoramic cameras 2. Alternatively, the scene SC is captured at a wide angle (for example, an angle of 90 to 360 degrees) with an angle of view of at least 90 degrees from 2a.

追跡およびサブシーン識別のためのその後の処理は、ネイティブの、歪みのない、もしくはスティッチングされていないシーンＳＣに対して実行されてもよいし、またはロールされていない、歪み補正された、もしくはスティッチングされたシーンＳＣに対して実行されてもよい。 Subsequent processing for tracking and sub-scene identification may be performed on a native, undistorted or unstitched scene SC, or unrolled, distorted corrected, or It may be performed on the stitched scene SC.

ステップＳ１４において、新たな対象方位Ｂ１，Ｂ２…Ｂｎが、ビーム形成、認識、識別、ベクトル化、またはホーミング技術の１つ以上を用いて広角ビューＳＣから得られる。 In step S14, new object orientations B1, B2,... Bn are obtained from the wide angle view SC using one or more of beamforming, recognition, identification, vectorization, or homing techniques.

ステップＳ１６において、１つ以上の新たな方位が、当初の角度範囲（たとえば０〜５度）から、典型的な人間の頭部、および／または典型的な人間の肩、または他のデフォルト幅（たとえば画素もしくは角度範囲で測定される）をスパンするのに十分な角度範囲まで拡幅される。分析の順序は逆であってもよいことに留意すべきであり、たとえば、まず顔を検出してから当該顔への方位を求めてもよい。拡幅は１つ、２つ、またはそれ以上のステップで行なわれてもよく、本明細書に記載の２つのステップは例に過ぎない。また、「拡幅」は漸進的な拡幅処理を必要とせず、たとえば、「拡幅」は検出、認識、閾値、または値に基づいて角度範囲を直接設定することを意味し得る。サブシーンの角度範囲を設定するのに異なる方法が用いられてもよい。２つ以上の顔が互いに近接しているなどのいくつかの場合、「拡幅」は、正確な対象方位Ｂ１に１つの顔しかなくても、これらの顔のすべてを含むように選択され得る。 In step S16, one or more new orientations can be obtained from the initial angular range (eg, 0-5 degrees), typical human head, and / or typical human shoulder, or other default width ( Widened to an angular range sufficient to span (e.g., measured in pixels or angular ranges). It should be noted that the analysis order may be reversed. For example, a face may be detected first and then the orientation to the face may be obtained. The widening may be performed in one, two, or more steps, and the two steps described herein are merely examples. Also, “widening” does not require a gradual widening process, for example, “widening” can mean setting the angular range directly based on detection, recognition, threshold, or value. Different methods may be used to set the sub-scene angle range. In some cases, such as when two or more faces are close to each other, the “broadening” may be selected to include all of these faces even if there is only one face in the correct target orientation B1.

ステップＳ１６において、（かつ図５Ａおよび図５Ｂに示すように）、肩幅サブシーンＳＳ１，ＳＳ２…ＳＳｎが、瞳孔間距離または他の顔、頭部、胴体もしくは他の可視的な特徴（特徴、クラス、色、セグメント、パッチ、テクスチャ、訓練された分類子、または他の特徴）から取られた測定に従ってステップＳ１８のように設定または調整され得、シーンＳＣから得られ得る。サブシーンＳＳ１，ＳＳ２…ＳＳｎの幅は、肩幅に従って（代わりに顔幅ＦＷに従って）、または代わりに、音声マイクアレイ４の角度分解能に関連する予め定められた幅として設定され得る。 In step S16 (and as shown in FIGS. 5A and 5B), the shoulder width sub-scene SS1, SS2,... SSn is the interpupillary distance or other facial, head, torso or other visible feature (feature, class , Colors, segments, patches, textures, trained classifiers, or other features) may be set or adjusted as in step S18 according to measurements taken from the scene SC. The width of the sub-scenes SS1, SS2,... SSn can be set according to the shoulder width (alternatively according to the face width FW), or alternatively as a predetermined width related to the angular resolution of the audio microphone array 4.

代わりに、ステップＳ１６において、各々のまたはすべての対象方位についてのサブシーン幅に対する上限および／または下限が、たとえば、それぞれピーク、平均、または代表的な肩幅ＳＷおよび顔幅ＦＷとして、ステップＳ１８において設定または調整され得る。ＦＷおよびＳＷの表記は本明細書において、「顔幅」ＦＷまたは「肩幅」ＳＷ（すなわち、サブシーンとして角度的にキャプチャすべき顔もしくは肩のスパン）、および顔幅ＦＷまたは肩幅ＳＷを表わす結果的な顔幅または肩幅サブシーンＳＳ（すなわち、ワイドシーンＳＣから識別され、得られ、調整され、選択されたまたはキャプチャされた画素のブロックもしくは対応する幅のサブシーン）として交換可能に用いられることに留意すべきである。 Instead, in step S16, the upper and / or lower limits for the sub-scene width for each or all subject orientations are set in step S18, eg, as a peak, average, or representative shoulder width SW and face width FW, respectively. Or it can be adjusted. The notation of FW and SW is herein a “face width” FW or “shoulder width” SW (ie, a face or shoulder span to be captured angularly as a sub-scene), and a result representing the face width FW or shoulder width SW. To be used interchangeably as a typical face width or shoulder width sub-scene SS (ie, a block of pixels or a corresponding width sub-scene identified, obtained, adjusted, selected or captured from a wide scene SC) Should be noted.

ステップＳ１６において、またはステップＳ１６〜Ｓ１８において代わりにもしくは加えて、少なくとも２０度の画角（たとえばＦＷ１および／またはＳＷ１）の第１の個別のサブシーンが、第１の対象方位Ｂ１，Ｂ２…Ｂｎにおいて広角シーンＳＣから得られる。少なくとも２０度の画角（たとえばＦＷ１および／またはＳＷ１）設定の代わりにまたは当該設定に加えて、第１の個別のサブシーンＦＷ１および／またはＳＷ１は、（たとえばＭ１に特有であるかＭ１，Ｍ２…Ｍｎを表わす）瞳孔間距離の少なくとも２倍から１２倍をスパンする画角として、または、代わりに、もしくはさらに、（たとえばＭ１に特有であるかＭ１，Ｍ２…Ｍｎを表わす）瞳孔間距離と（たとえばＭ１に特有であるかＭ１，Ｍ２…Ｍｎを表わす）肩幅との間の幅をキャプチャするようにスケーリングされた画角として、広角シーンＳＣから得られ得る。より広いまたは肩幅ＳＷｎのサブシーンキャプチャは、より狭い顔幅ＦＷｎを後の参照用に記録し得る。 In step S16, or alternatively or in addition to steps S16 to S18, the first individual sub-scene with an angle of view of at least 20 degrees (eg FW1 and / or SW1) is the first target orientation B1, B2,. Obtained from the wide-angle scene SC. Instead of or in addition to the setting of an angle of view (eg FW1 and / or SW1) of at least 20 degrees, the first individual sub-scene FW1 and / or SW1 is (eg unique to M1 or M1, M2 As an angle of view spanning at least two to twelve times the interpupillary distance (representing Mn), or alternatively or additionally, the interpupillary distance (eg, characteristic of M1 or M1, M2 ... Mn) It can be obtained from the wide-angle scene SC as an angle of view scaled to capture the width between the shoulder width (eg, M1 specific or representing M1, M2... Mn). Sub-scene capture with a wider or shoulder width SWn may record a narrower face width FWn for later reference.

第２の対象方位Ｂ１，Ｂ２…Ｂｎが利用可能である場合、ステップＳ１６において、またはステップＳ１６〜Ｓ１８において代わりにもしくは加えて、第２の個別のサブシーン（たとえばＦＷ２および／またはＳＳ２）が、たとえばＢ２である第２の対象方位において広角ビューＳＣから同様に得られる。逐次の対象方位Ｂ３…Ｂｎが利用可能である場合、逐次の個別のサブシーン（たとえばＦＷ３…ｎおよび／またはＳＳ３…ｎ）が、逐次の対象方位Ｂ３…Ｂｎにおいて広角ビューＳＣから同様に得られる。 If the second target orientation B1, B2,... Bn is available, the second individual sub-scene (eg FW2 and / or SS2) is replaced in step S16 or alternatively or in addition in steps S16 to S18. For example, it is similarly obtained from the wide-angle view SC in the second target orientation, which is B2. If successive object orientations B3... Bn are available, successive individual sub-scenes (eg FW3... N and / or SS3... N) are similarly obtained from the wide angle view SC in the successive object orientations B3. .

第１および第２の対象方位Ｂ１，Ｂ２（およびその後の対象方位Ｂ３…Ｂｎ）は、異なるカメラ画像のスティッチングによって得られるか、単一パノラマカメラから得られるかにかかわらず、同一のデバイス１００から得られるので、第１の対象方位に対して実質的に共通の角度起点を有し得る。任意に、異なる角度起点からの１つ以上の追加の対象方位Ｂｎが、デバイス１００の別のカメラ５もしくは７から、または接続されたデバイス（たとえば、接続されたラップトップ、タブレット、もしくは図１Ａのモバイルデバイス４０、または図２Ｋの衛星タワー１４ｂ上の接続された衛星カメラ７）上のカメラから得られ得る。 The first and second target orientations B1, B2 (and subsequent target orientations B3... Bn) are the same device 100 regardless of whether they are obtained by stitching different camera images or from a single panoramic camera. So that it can have a substantially common angular origin with respect to the first target orientation. Optionally, one or more additional target orientations Bn from different angular origins can be obtained from another camera 5 or 7 of device 100 or from a connected device (eg, connected laptop, tablet, or FIG. 1A). It can be obtained from a camera on the mobile device 40 or a connected satellite camera 7) on the satellite tower 14b of FIG. 2K.

上述のように、幅ＦＷまたはＳＷを表わす、設定された、得られた、または拡幅されたサブシーンＳＳは、たとえば、（ｉ）他のサブシーンと同等のもしくは一致するサイズであるように、（ｉｉ）任意に上述の幅の下限未満でないまたは上限を超えない、出力画像もしくはストリーム信号のアスペクト比に対して均一に分割されるか分割可能である（たとえば２個、３個もしくは４個のセグメントに分割される）ように、（ｉｉｉ）近くの対象方位において他のサブシーンとのオーバーラップを回避するように、および／または（ｉｖ）他のサブシーンと輝度、コントラスト、もしくは他のビデオ特性が一致するように、ステップＳ１８において調整され得る。 As described above, the set, obtained, or widened sub-scene SS representing the width FW or SW is, for example, (i) such that it is of a size equivalent to or consistent with other sub-scenes. (Ii) can be uniformly divided or divided (eg, 2, 3 or 4) with respect to the aspect ratio of the output image or stream signal, optionally not less than the lower limit of the above width or not exceeding the upper limit. (Iii) to avoid overlap with other sub-scenes in near object orientations and / or (iv) brightness, contrast, or other video with other sub-scenes It can be adjusted in step S18 so that the characteristics match.

ステップＳ２０（図１６〜図１８からモード１、２、もしくは３のステップを合理的で動作可能な組合せで含み得る）において、識別された対象方位Ｂ１，Ｂ２…ＢｎならびにサブシーンＦＷ１，ＦＷ２…ＦＷｎおよび／またはＳＳ１，ＳＳ２…ＳＳｎに関するデータおよび／またはメタデータが追跡目的で記録され得る。たとえば、起点ＯＲからの相対的位置（たとえばセンサもしくは計算によって求められる）、幅、高さ、および／または上述のいずれかの調整されたパラメータが記録され得る。 In step S20 (which may include modes 1, 2, or 3 from FIGS. 16-18 in a reasonable and operable combination), the identified object orientations B1, B2... Bn and sub-scenes FW1, FW2. And / or data and / or metadata regarding SS1, SS2... SSn may be recorded for tracking purposes. For example, the relative position from the origin OR (determined by sensor or calculation, for example), width, height, and / or any of the adjusted parameters described above can be recorded.

代わりに、ステップＳ２０において、サブシーンと関連付けられた特性データ、予測データまたは追跡データが記録され、たとえば、ステップＳ２０においてサブシーン、方位、または他の特徴追跡データベースに追加され得る。たとえば、サブシーンＦＷ１，ＦＷ２…ＦＷｎおよび／またはＳＳ１，ＳＳ２…ＳＳｎは、画像またはビデオシーンＳＣ内に識別された、瞬間的な画像、画像ブロック、またはビデオブロックであり得る。ビデオの場合、ビデオの圧縮／解凍アプローチに応じて、予測データがシーンまたはサブシーンと関連付けられ得、サブシーンと関連付けられたデータまたはメタデータとして記録され得るが、追跡する追加の新たなサブシーンの一部である傾向がある。 Alternatively, characteristic data, prediction data or tracking data associated with the sub-scene may be recorded at step S20 and added to, for example, a sub-scene, orientation, or other feature tracking database at step S20. For example, the sub-scenes FW1, FW2... FWn and / or SS1, SS2... SSn can be instantaneous images, image blocks or video blocks identified in the image or video scene SC. In the case of video, depending on the video compression / decompression approach, prediction data can be associated with a scene or sub-scene and recorded as data or metadata associated with the sub-scene, but additional new sub-scenes to track Tend to be a part of.

追跡データまたは他の対象データの記録に続いて、処理はメインルーチンに戻る。
状況ごとのサブシーンの合成
図１２のステップＳ３０において、プロセッサ６は状況ごとに（たとえば、ステップＳ２０において追跡データとしてまたはシーンデータとして記録されたデータ、フラッグ、指標、設定、または他のアクションパラメータごとに）サブシーンＳＳｎを合成し得、すなわち、異なる幅ＦＷ１，ＦＷ２…ＦＷｎおよび／またはＳＷ１，ＳＷ２…ＳＷｎに対応する第１の、任意に第２の、および任意にその後の個別のサブシーンＳＳｎを、合成シーンまたは単一カメラ画像またはビデオ信号ＳＴＧまたはＣＯに組合わせる。本明細書において、単一カメラ画像またはビデオ信号ＳＴＧ，ＣＯは、単一のＵＳＢ（または他の周辺バスもしくはネットワーク）カメラに対応するＵＳＢ（または他の周辺バスもしくはネットワーク）周辺画像またはビデオ信号またはストリームを表わす単一のビデオフレームまたは単一の合成ビデオフレームを指し得る。 Following the recording of tracking data or other target data, processing returns to the main routine.
Sub-scene synthesis for each situation In step S30 of FIG. 12, the processor 6 performs for each situation (for example, for each data, flag, indicator, setting, or other action parameter recorded as tracking data or scene data in step S20). To) sub-scene SSn, ie first, optionally second, and optionally subsequent individual sub-scene SSn corresponding to different widths FW1, FW2... FWn and / or SW1, SW2. Are combined into a composite scene or single camera image or video signal STG or CO. As used herein, a single camera image or video signal STG, CO is a USB (or other peripheral bus or network) peripheral image or video signal corresponding to a single USB (or other peripheral bus or network) camera or It may refer to a single video frame representing a stream or a single composite video frame.

ステップＳ３２において、デバイス１００、その回路、および／またはその実行可能コードは、合成されて組合わされた画像またはビデオストリームＳＴＧまたはＣＯとして配列すべき関連しているサブシーンＳＳｎを識別し得る。「関連している」とは、ステップＳ１４における識別ならびに／またはステップＳ２０における更新および追跡に関して述べた基準に従って判断され得る。たとえば、１つの関連しているサブシーンは直近のスピーカーのサブシーンであり、第２の関連しているサブシーンは２番目に直近のスピーカーのサブシーンであり得る。これら２人の直近のスピーカーは、３番目のスピーカーが発言することによってさらに関連するようになるまで、最も関連し続け得る。本明細書中の実施形態は、合成シーン内のサブシーン内に３人のスピーカーを収容し、その各々が等しい幅のセグメント、または自身の頭部および／もしくは肩を保持するのに十分広いセグメントを有している。しかし、２人のスピーカーまたは４人のスピーカーまたはそれ以上のスピーカーも、合成されたスクリーン幅のそれぞれより広いまたは狭い占有率で容易に収容され得る。 In step S32, the device 100, its circuitry, and / or its executable code may identify the related sub-scene SSn to be arranged as a combined and combined image or video stream STG or CO. “Relevant” may be determined according to the criteria set forth with respect to identification at step S14 and / or updating and tracking at step S20. For example, one related sub-scene may be the most recent speaker sub-scene and the second related sub-scene may be the second most recent speaker sub-scene. These two most recent speakers may continue to be the most relevant until the third speaker becomes more relevant by speaking. Embodiments herein include three speakers in a sub-scene in a composite scene, each of which is an equal width segment, or a segment that is wide enough to hold its own head and / or shoulder have. However, two speakers or four speakers or more speakers can also be easily accommodated with a wider or narrower occupancy of the combined screen width, respectively.

高さおよび幅のみにおいて顔をカプセル化するサブシーンＳＳｎを選択することによって、最大で８人のスピーカーが合理的に収容され得（たとえば合成シーンの上段に４人、下段に４人）、４人から８人のスピーカーの配列が、適切なスクリーンおよび／またはウインドウ（ウインドウに対応するサブシーン）のバッファリングおよび合成（たとえば、サブシーンを、オーバーラップしている１デックのカードとして、またはより関連しているスピーカーが大きく手前にあり、あまり関連していないスピーカーが小さく奥にあるビューの短縮されたリングとして提示する）によって収容され得る。図６Ａおよび図６Ｂを参照して、（たとえば図６Ａに描かれているように第２のカメラ７によって撮像された場合に）表示すべき最も関連しているシーンはＷＢであるとシステムが判断した場合はいつでも、シーンＳＳｎはホワイトボードコンテンツＷＢをさらに含み得る。ホワイトボードまたはホワイトボードシーンＷＢは目立って提示され、シーンの大半または大部分を占め得るのに対して、スピーカーＭ１，Ｍ２…ＭｎまたはＳＰＫＲはホワイトボードＷＢコンテンツとともにピクチャーインピクチャーで任意に提示され得る。 By selecting a sub-scene SSn that encapsulates the face only in height and width, a maximum of eight speakers can be reasonably accommodated (eg, four at the top of the composite scene and four at the bottom), 4 An array of 8 to 8 speakers can be used to buffer and compose appropriate screens and / or windows (sub-scenes corresponding to windows) (eg, sub-scenes as a deck of overlapping cards or more The related speakers are large in the foreground and the less related speakers present as a shortened ring of small, deep views. With reference to FIGS. 6A and 6B, the system determines that the most relevant scene to be displayed is WB (eg, when captured by the second camera 7 as depicted in FIG. 6A). If so, the scene SSn may further include whiteboard content WB. The whiteboard or whiteboard scene WB is prominently presented and may occupy most or most of the scene, whereas the speakers M1, M2 ... Mn or SPKR can optionally be presented in picture-in-picture along with the whiteboard WB content .

ステップＳ３４において、関連しているサブシーンセットＳＳ１，ＳＳ２…ＳＳｎが以前に関連していたサブシーンＳＳｎと比較される。ステップＳ３４およびＳ３２は逆の順序で実行されてもよい。この比較によって、以前に関連していたサブシーンＳＳｎが利用可能であるか、ステージＳＴＧもしくはＣＯ上に残り続けるべきであるか、ステージＳＴＧもしくはＣＯから除去されるべきであるか、より小さいもしくは大きいサイズもしくはパースペクティブに再構成されるべきであるか、またはそうでなければ以前に合成されたシーンもしくはステージＳＴＧもしくはＣＯから変更される必要があるかが判断される。新たなサブシーンＳＳｎを表示すべきである場合、シーン変更の候補サブシーンＳＳｎが多すぎる場合がある。たとえば、ステップＳ３６において、シーン変更の閾値が確認され得る（このステップはステップＳ３２およびＳ３４の前または間に実行され得る）。たとえば、個別のサブシーンＳＳｎの数が閾値数（たとえば３）よりも大きくなると、広角シーンＳＣ全体または縮小したパノラマシーンＳＣ．Ｒを（たとえばそのまま、またはＵＳＢ周辺装置カメラのアスペクト比内に収まるようにセグメント化してスタックして）出力することが好ましい場合がある。代わりに、複数のサブシーンＳＳｎの合成シーンの代わりに、または合成出力ＣＯとして、単一カメラシーンを提示することが最良な場合がある。 In step S34, the associated sub-scene set SS1, SS2,... SSn is compared with the previously associated sub-scene SSn. Steps S34 and S32 may be performed in the reverse order. Depending on this comparison, the previously associated sub-scene SSn is available, should remain on the stage STG or CO, should be removed from the stage STG or CO, smaller or larger It is determined whether it should be reconstructed to size or perspective, or else it needs to be modified from a previously synthesized scene or stage STG or CO. When a new sub-scene SSn is to be displayed, there may be too many candidate scenes SSn for scene change. For example, in step S36, a scene change threshold may be ascertained (this step may be performed before or during steps S32 and S34). For example, when the number of individual sub-scenes SSn is larger than a threshold number (for example, 3), the entire wide-angle scene SC or the reduced panoramic scene SC. It may be preferable to output R (eg, as it is or segmented and stacked to fit within the aspect ratio of the USB peripheral device camera). Instead, it may be best to present a single camera scene instead of a composite scene of multiple sub-scenes SSn or as a composite output CO.

ステップＳ３８において、デバイス１００、その回路、および／またはその実行可能コードは、サブシーンメンバーＳＳ１，ＳＳ２…ＳＳｎと、それらが合成出力ＣＯに移行するおよび／または合成される順序とを設定し得る。言い換えれば、ステージＳＴＧまたはＣＯとして出力すべきサブシーン補数ＳＳ１，ＳＳ２…ＳＳｎの候補メンバー、およびシーン変更のための任意の規則または閾値が満たされているか超えられているかが判断されると、シーンＳＳｎの順序およびそれらが追加される、除去される、切替えられる、または再配列される移行がステップＳ３８において決定され得る。ステップＳ３８は、以前のステップおよびスピーカーＳＰＫＲまたはＭ１，Ｍ２…Ｍｎの履歴に応じて、より重要であるかそれほど重要でないことに留意すべきである。２人または３人のスピーカーＭ１，Ｍ２…ＭｎまたはＳＰＫＲが識別され、デバイス１００が動作し始めるのと同時に表示されるべきである場合、ステップＳ３８は白紙の状態で開始し、デフォルトの関連規則に従う（たとえば、スピーカーＳＰＫＲを時計回りに提示する、合成出力ＣＯにおいて３人以下のスピーカーで開始する）。同じ３人のスピーカーＭ１，Ｍ２…Ｍｎが関連し続けている場合は、サブシーンメンバー、順序、および合成はステップＳ３８において変更しなくてもよい。 In step S38, the device 100, its circuitry, and / or its executable code may set the sub-scene members SS1, SS2,... SSn and the order in which they transition to the composite output CO and / or are composited. In other words, once it is determined whether the candidate members of the sub-scene complements SS1, SS2,... SSn to be output as stage STG or CO and any rules or thresholds for scene changes are met or exceeded, The order of SSn and the transition in which they are added, removed, switched or rearranged may be determined in step S38. It should be noted that step S38 is more or less important depending on the previous steps and the history of the speakers SPKR or M1, M2. If two or three speakers M1, M2... Mn or SPKR are identified and should be displayed at the same time as the device 100 begins to operate, step S38 starts with a blank sheet and follows the default associated rules. (For example, start with 3 or fewer speakers in composite output CO, presenting speaker SPKR clockwise). If the same three speakers M1, M2,... Mn continue to be related, the sub-scene members, order, and composition may not be changed in step S38.

上述のように、ステップＳ１８に関して述べた識別、およびステップＳ２０に関して述べた予測／更新によって、ステップＳ３２〜Ｓ４０において合成出力ＣＯが変更され得る。ステップＳ４０において、実行すべき移行および合成が決定される。 As described above, the composite output CO can be changed in steps S32 to S40 by the identification described with respect to step S18 and the prediction / update described with respect to step S20. In step S40, the transition and synthesis to be performed are determined.

たとえば、デバイス１００は、その後の対象方位においてその後の（たとえば第３の、第４の、またはその後の）個別のサブシーンＳＳｎを広角またはパノラマシーンＳＣから得ることができる。ステップＳ３２〜Ｓ３８において、その後のサブシーンＳＳｎは、合成シーンまたは合成出力ＣＯに合成されるか組合されるように設定され得る。さらに、ステップＳ３２〜Ｓ３８において、その後のサブシーン以外の別のサブシーンＳＳｎ（たとえば、以前のまたはあまり関連していないサブシーン）が、合成シーンから（合成移行によって）除去されるように設定され得る（そして、ステップＳ５０において単一カメラシーンとしてフォーマットされる合成シーンまたは合成出力ＣＯとして合成されて出力される）。 For example, device 100 may obtain subsequent (eg, third, fourth, or subsequent) individual sub-scene SSn from a wide-angle or panoramic scene SC in a subsequent subject orientation. In steps S32 to S38, the subsequent sub-scene SSn can be set to be combined or combined with the combined scene or combined output CO. Further, in steps S32-S38, another sub-scene SSn other than the subsequent sub-scene (eg, the previous or less relevant sub-scene) is set to be removed from the composite scene (by the composite transition). Obtained (and synthesized and output as a synthesized scene CO or a synthesized output CO in step S50).

追加例または代替例として、デバイス１００はステップＳ３２〜Ｓ３８において、ステップＳ１８および／またはＳ２０を参照して述べたような追加基準（たとえば、発言の時間、発言の周波数、可聴周波数の咳／くしゃみ／戸口のベル、音の振幅、発話角度と顔認識との一致）の設定に従って、合成シーンまたは合成出力ＣＯに合成もしくは組合せるべき、または合成シーンもしくは合成出力ＣＯから除去すべきサブシーンＳＳｎを設定し得る。ステップＳ３２〜Ｓ３８において、追加基準を満たすその後のサブシーンＳＳｎのみが合成シーンＣＯに組合されるように設定され得る。ステップＳ４０において、実行すべき移行および合成ステップが決定される。ステージシーンは次にステップＳ５０において、単一カメラシーンとしてフォーマットされる合成出力ＣＯとして合成されて出力される。 As an additional example or alternative, the device 100 in steps S32-S38 may include additional criteria as described with reference to steps S18 and / or S20 (eg, speech time, speech frequency, audible frequency cough / sneeze / The sub-scene SSn to be synthesized or combined with the synthesized scene or synthesized output CO, or to be removed from the synthesized scene or synthesized output CO, according to the settings of the doorbell, sound amplitude, speech angle and face recognition) Can do. In steps S32 to S38, only subsequent sub-scenes SSn that meet the additional criteria may be set to be combined with the composite scene CO. In step S40, the transition and synthesis steps to be performed are determined. The stage scene is then combined and output as a combined output CO formatted as a single camera scene in step S50.

追加例または代替例として、デバイス１００はステップＳ３２〜Ｓ３８において、ステップＳ１８および／またはＳ２０を参照して述べたような保持基準（たとえば、音声／発言の時間、音声／発言の周波数、最後の発言からの時間、保持用にタグ付けされている）に基づいて、サブシーンＳＳｎを除去から保護される保護サブシーンとして設定し得る。ステップＳ３２〜Ｓ３８において、その後のサブシーン以外のサブシーンＳＳｎを除去することは、保護サブシーンを、合成シーンから除去されるように設定しない。ステップＳ４０において、実行すべき移行および合成が決定される。合成シーンは次にステップＳ５０において、単一カメラシーンとしてフォーマットされる合成出力ＣＯとして合成されて出力される。 As an additional example or alternative, the device 100 in steps S32-S38, the retention criteria as described with reference to steps S18 and / or S20 (eg, voice / speech time, voice / speech frequency, last speech, etc.) Sub-scene SSn may be set as a protected sub-scene protected from removal. In steps S32 to S38, removing the sub-scene SSn other than the subsequent sub-scene does not set the protection sub-scene to be removed from the composite scene. In step S40, the transition and synthesis to be performed are determined. The combined scene is then combined and output as a combined output CO formatted as a single camera scene in step S50.

追加例または代替例として、デバイス１００はステップＳ３２〜Ｓ３８において、強調基準（たとえば、繰り返しのスピーカー、指定されたプレゼンター、直近のスピーカー、最も声の大きいスピーカー、手の中で／シーン変更において回転するオブジェクト、周波数領域内の高周波数シーンアクティビティ、挙手）に基づいて、ステップＳ１８および／またはＳ２０を参照して述べたようなサブシーンＳＳｎ強調動作（たとえば、スケーリング、ブリンキング、ジニー、バウンシング、カードソーティング、オーダリング、コーナリング）を設定し得る。ステップＳ３２〜Ｓ３８において、個別のサブシーンＳＳｎの少なくとも１つが、各自のまたは対応する強調基準に基づいてサブシーン強調動作に従って強調されるように設定され得る。ステップＳ４０において、実行すべき移行および合成が決定される。合成シーンは次にステップＳ５０において、単一カメラシーンとしてフォーマットされる合成出力ＣＯとして合成されて出力される。 As an additional or alternative, the device 100 rotates in steps S32-S38 in an emphasis criterion (eg, repeat speaker, designated presenter, most recent speaker, loudest speaker, in hand / scene change). Sub-scene SSn enhancement operations (eg, scaling, blinking, genie, bouncing, card sorting) as described with reference to steps S18 and / or S20 based on object, high frequency scene activity in frequency domain, raised hand) , Ordering, cornering). In steps S32 to S38, at least one of the individual sub-scenes SSn may be set to be enhanced according to the sub-scene enhancement operation based on their own or corresponding enhancement criteria. In step S40, the transition and synthesis to be performed are determined. The combined scene is then combined and output as a combined output CO formatted as a single camera scene in step S50.

追加例または代替例として、デバイス１００はステップＳ３２〜Ｓ３８において、センサまたは検知された基準（たとえば、静か過ぎる、リモートポーク）に基づいて、ステップＳ１８および／またはＳ２０を参照して述べたようなサブシーン参加者通知またはリマインダ動作（たとえば、サブシーンの側にいる人物に光をブリンクさせる）を設定し得る。ステップＳ３２〜Ｓ３８において、ローカルリマインダ指標が、各自のまたは対応する検知された基準に基づいて通知またはリマインダ動作に従って起動されるように設定され得る。ステップＳ４０において、実行すべき移行および合成が決定される。合成シーンは次にステップＳ５０において、単一カメラシーンとしてフォーマットされる合成出力ＣＯとして合成されて出力される。 As an additional example or alternative, the device 100 may perform sub-steps as described with reference to steps S18 and / or S20 in steps S32-S38 based on sensors or sensed criteria (eg, too quiet, remote pork). A scene participant notification or reminder action (eg, blinking light to a person on the side of a sub-scene) may be set. In steps S32-S38, local reminder indicators may be set to be activated according to notifications or reminder actions based on their own or corresponding sensed criteria. In step S40, the transition and synthesis to be performed are determined. The combined scene is then combined and output as a combined output CO formatted as a single camera scene in step S50.

ステップＳ４０において、デバイス１００、その回路、および／またはその実行可能コードは、合成画像のサブシーン補数を円滑に変更するための移行および合成を生成する。追跡データまたは他の対象データの合成出力ＣＯの合成に続いて、処理はメインルーチンに戻る。 In step S40, the device 100, its circuitry, and / or its executable code generates a transition and composition to smoothly change the sub-scene complement of the composite image. Following the synthesis of the tracking data or other subject data synthesis output CO, processing returns to the main routine.

合成出力
図１５のステップＳ５２〜Ｓ５６において、（任意に逆の順序で）、合成シーンＳＴＧまたはＣＯは、単一カメラシーンとして送信または受信されるようにフォーマットされ、すなわち合成され、および／または移行はバッファ、スクリーンもしくはフレームにレンダリングもしくは合成される（この場合、「バッファ」、「スクリーン」または「フレーム」は単一カメラビュー出力に対応する）。デバイス１００、その回路、および／またはその実行可能コードは、合成ウインドウまたはスクリーンマネージャを、任意にＧＰＵ加速と共に用い、サブシーンごとにオフスクリーンバッファを提供し、バッファを、周辺グラフィックスおよび移行グラフィックスとともに、単一カメラビューを表わす単一カメラ画像に合成し得、その結果を出力またはディスプレイメモリに書込む。合成ウインドウまたはサブスクリーンマネージャ回路は、ブレンディング、フェージング、スケーリング、回転、複製、曲げ、捩じれ、シャフリング、ブラーリング、もしくは他の処理をバッファリングされたウインドウに対して実行するか、またはフリップ切替、スタック切替、カバー切替、リング切替、グルーピング、タイリングといったドロップシャドウおよびアニメーションをレンダリングし得る。合成ウインドウマネージャは、合成シーンに入るサブシーンが移行効果で追加される、除去される、または切替えられるように合成され得る視覚的な移行を提供し得る。サブシーンはフェードインまたはフェードアウトし、可視的にシュリンクインまたはシュリンクアウトし、内向きにまたは外向きに滑らかに放射状に広がる。合成中または移行中のすべてのシーンはビデオシーンであり得、たとえば、各々が、パノラマシーンＳＣからサブサンプリングされた進行中のビデオストリームを含む。 Composite Output In steps S52-S56 of FIG. 15, (optionally in reverse order), the composite scene STG or CO is formatted, i.e., combined and / or transitioned, to be transmitted or received as a single camera scene. Is rendered or composited into a buffer, screen or frame (where “buffer”, “screen” or “frame” corresponds to a single camera view output). Device 100, its circuitry, and / or its executable code uses a synthesis window or screen manager, optionally with GPU acceleration, to provide an off-screen buffer for each sub-scene, and to use the buffer as peripheral and transition graphics. Together, it can be combined into a single camera image representing a single camera view, and the result written to output or display memory. The composite window or subscreen manager circuit performs blending, fading, scaling, rotation, replication, bending, twisting, shuffling, blurring, or other processing on the buffered window, or flip switching, Drop shadows and animations such as stack switching, cover switching, ring switching, grouping, tiling can be rendered. The composite window manager may provide a visual transition that can be composited so that sub-scenes entering the composite scene are added, removed, or switched with transition effects. The sub-scene fades in or out, visually shrinks in or out, and smoothly spreads radially inward or outward. All scenes being synthesized or transitioned can be video scenes, for example, each containing an ongoing video stream subsampled from a panoramic scene SC.

ステップＳ５２において、移行または合成は（必要に応じて、繰返して、漸進的に、または連続的に）フレーム、バッファ、またはビデオメモリにレンダリングされる（なお、移行および合成は個々のフレームまたはビデオストリームに適用され得、シーンＳＴＧ，ＣＯ全体および個々の構成サブシーンＳＳ１，ＳＳ２…ＳＳｎのビデオの多くのフレームを介して進行中のプロセスであり得る。 In step S52, the transition or composition is rendered into a frame, buffer, or video memory (repeatedly, incrementally, or continuously as needed) (note that transition and composition are individual frames or video streams). And can be an ongoing process through many frames of video of the entire scene STG, CO and the individual constituent sub-scenes SS1, SS2... SSn.

ステップＳ５４において、デバイス１００、その回路、および／またはその実行可能コードは音声ストリームを選択して移行させ得る。ウインドウ、シーン、ビデオ、またはサブシーン合成マネージャと同様に、音声ストリームは、特にアレイ４を形成するビームの場合、合成中のサブシーンを強調するように強調されてもよいし、強調されなくてもよい。同様に、音声を合成ビデオシーンに同期させることが行なわれてもよい。 In step S54, the device 100, its circuitry, and / or its executable code may select and transition an audio stream. Similar to the window, scene, video, or sub-scene composition manager, the audio stream may or may not be enhanced to emphasize the sub-scene being synthesized, especially in the case of the beams forming the array 4. Also good. Similarly, the audio may be synchronized to the synthesized video scene.

ステップＳ５６において、デバイス１００、その回路、および／またはその実行可能コードは、単一カメラビデオおよび音声のシミュレーションを合成出力ＣＯとして出力する。上述のように、この出力は、たとえば２：１未満のアスペクト比および典型的に１．７８：１未満のアスペクト比などの、周辺ＵＳＢ装置の単一の、たとえばウェブカムビューをシミュレートしているアスペクト比および画素数であり、グループテレビ会議ソフトウェアによって外部ウェブカム入力として用いられ得る。ウェブカム入力を表示ビューとしてレンダリングする場合、テレビ会議ソフトウェアは合成出力ＣＯをその他のＵＳＢカメラとして扱い、ホストデバイス４０（または図１Ｂの直接接続されたデバイス１００のバージョン）と対話しているすべてのクライアントが、合成出力ＣＯを、ホストデバイス（または図１Ｂの直接接続されたデバイス１００のバージョン）に対応するすべてのメインビューおよびサムネイルビュー内に提示する。 In step S56, the device 100, its circuitry, and / or its executable code outputs a single camera video and audio simulation as a composite output CO. As described above, this output simulates a single, eg, webcam view, of a peripheral USB device, eg, an aspect ratio less than 2: 1 and an aspect ratio typically less than 1.78: 1. Aspect ratio and number of pixels, which can be used as external webcam input by group video conferencing software. When rendering the webcam input as a display view, the video conferencing software treats the composite output CO as any other USB camera and interacts with the host device 40 (or the directly connected device 100 version of FIG. 1B). Presents the composite output CO in all main and thumbnail views corresponding to the host device (or a version of the directly connected device 100 of FIG. 1B).

サブシーン合成の例
図１２〜図１６を参照して述べたように、会議カメラ１００およびプロセッサ６は、単一カメラビデオ信号ＳＴＧ，ＣＯを（ステップＳ３０において）合成し、（ステップＳ５０において）出力し得る。ＲＯＭ／ＲＡＭ８に動作可能に接続されたプロセッサ６は、実質的に９０度以上の水平画角を有するワイドカメラ２，３，５からキャプチャされた、実質的に２．４：１以上のアスペクト比を有するパノラマビデオ信号ＳＣを（ステップＳ１２において）記録し得る。１つの任意のバージョンでは、当該パノラマビデオ信号は実質的に８：１以上のアスペクト比を有し、実質的に３６０度の水平画角を有するワイドカメラからキャプチャされる。 Example of Sub-Scene Synthesis As described with reference to FIGS. 12 to 16, the conference camera 100 and the processor 6 synthesize (in step S30) the single camera video signals STG, CO and output (in step S50). Can do. A processor 6 operably connected to the ROM / RAM 8 has an aspect ratio substantially greater than 2.4: 1 captured from wide cameras 2, 3 and 5 having a horizontal field of view substantially greater than 90 degrees. Can be recorded (in step S12). In one optional version, the panoramic video signal is captured from a wide camera having an aspect ratio substantially greater than 8: 1 and having a horizontal field of view of substantially 360 degrees.

プロセッサ６は、（たとえばステップＳ１４において）ワイドカメラ１００から各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて少なくとも２つのサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図８Ｃ〜図８Ｅおよび図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）を（たとえばステップＳ３２〜Ｓ４０において）サブサンプリングし得る。プロセッサ６は、２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図８Ｃ〜図８Ｅおよび図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）を並べて（ステップＳ３２〜Ｓ４０においてバッファ、フレーム、またはビデオメモリに）合成して、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号ＣＯ，ＳＴＧを（ステップＳ５２〜Ｓ５６において）形成し得る。任意に、単一カメラビデオ信号のできる限り多くを高密度に満たす（参加者のより大きいビューに繋がる）ために、ステージシーンビデオ信号ＣＯ，ＳＴＧの領域の実質的に８０％以上がパノラマビデオ信号ＳＣからサブサンプリングされ得る。ＵＳＢ／ＬＡＮインターフェイス１０に動作可能に接続されたプロセッサ６は、（ステップＳ５２〜Ｓ５６のように）単一カメラビデオ信号としてフォーマットされるステージシーンビデオ信号ＣＯ，ＳＴＧを出力し得る。 The processor 6 (for example in step S14) receives at least two sub-scene video signals SS1, SS2... SSn (for example in FIGS. 8C to 8E and FIGS. 9C to 9E) in the respective target orientations B1, B2. SS2 and SS5) may be subsampled (eg, in steps S32-S40). The processor 6 arranges two or more sub-scene video signals SS1, SS2... SSn (for example, SS2 and SS5 in FIGS. 8C to 8E and 9C to 9E) (buffers, frames, or video memories in steps S32 to S40). To) to form (in steps S52-S56) a stage scene video signal CO, STG having an aspect ratio of substantially less than 2: 1. Optionally, to fill as much of the single camera video signal as possible (leading to a larger view of the participants), substantially 80% or more of the area of the stage scene video signal CO, STG is a panoramic video signal. Can be subsampled from the SC. A processor 6 operably connected to the USB / LAN interface 10 may output stage scene video signals CO, STG that are formatted as single camera video signals (as in steps S52-S56).

最適には、プロセッサ６は、パノラマビデオ信号ＳＣからの（ならびに／または任意に、たとえばＧＰＵ６および／もしくはＲＯＭ／ＲＡＭ８において、バッファ、フレームもしくはビデオメモリからの、ならびに／またはワイドカメラ２，３，５から直接の）各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて追加の（たとえば第３の、第４の、またはその後の）サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳ３（たとえば図９Ｃ〜図９ＥではＳＳ１）をサブサンプリングし得る。プロセッサは次に、ステージＳＴＧ，ＣＯ上に当初合成された２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳ３（たとえば図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）を、１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ１）とともに合成して、実質的に２：１以下のアスペクト比を有する、かつ複数の並んだサブシーンビデオ信号（たとえば１列に、または格子状に合成された２つ、３つ、４つまたはそれ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ）を含む、ステージシーンビデオ信号ＳＴＧ，ＣＯを形成し得る。プロセッサ６は、１つ以上の対象方位またはサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎについての１つ以上の追加基準をメモリ内に設定または記憶し得る。この場合、たとえば、追加基準（たとえば十分な品質、十分な照度など）を満たすそれらの追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎのみがステージシーンビデオ信号ＳＴＧ，ＣＯに移行し得る。 Optimally, the processor 6 is from the panoramic video signal SC (and / or optionally, for example in the GPU 6 and / or ROM / RAM 8, from a buffer, frame or video memory, and / or the wide camera 2, 3, 5 (For example, SS1 in FIGS. 9C-9E) additional sub-scene video signals SS1, SS2... SS3 (for example, third, fourth, or subsequent) in their respective orientations B1, B2,. Sub-sampling can be performed. The processor then converts two or more sub-scene video signals SS1, SS2... SS3 (eg SS2 and SS5 in FIGS. 9C-9E) originally synthesized on stages STG, CO into one or more additional sub-scenes. Combined with video signals SS1, SS2... SSn (eg, SS1 in FIGS. 9C-9E), and having a plurality of side-by-side sub-scene video signals (eg, in a row) having an aspect ratio of substantially 2: 1 or less. Alternatively, stage scene video signals STG, CO may be formed that include two, three, four or more sub-scene video signals SS1, SS2,. The processor 6 may set or store in a memory one or more additional criteria for one or more target orientations or sub-scene video signals SS1, SS2,. In this case, for example, only those additional sub-scene video signals SS1, SS2,... SSn that meet additional criteria (eg, sufficient quality, sufficient illumination, etc.) may transition to the stage scene video signals STG, CO.

代わりに、またはさらに、追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎは、ステージＳＴＧ，ＣＯに既に合成されている可能性があるサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの１つ以上を置換して、依然として実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号ＳＴＧ，ＣＯを形成することによって、プロセッサ６によってステージシーンビデオ信号ＳＴＧ，ＣＯに合成され得る。合成すべき各サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎには最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎが割当てられ得、ステージシーンビデオ信号ＳＴＧ，ＣＯへの各自の移行が完了すると、各サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎは実質的にその最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎ以上で並べて合成されてステージシーンビデオ信号ＳＴＧ，ＣＯを形成し得る。 Alternatively or additionally, the additional sub-scene video signals SS1, SS2... SSn replace one or more of the sub-scene video signals SS1, SS2... SSn that may have already been combined in the stages STG, CO. It can be synthesized by the processor 6 to the stage scene video signals STG, CO by forming a stage scene video signal STG, CO that still has an aspect ratio of substantially less than 2: 1. Each sub-scene video signal SS1, SS2,... SSn to be synthesized has a minimum width Min. 1, Min. 2 ... Min. n and each sub-scene video signal SS1, SS2,... SSn is substantially its minimum width Min. 1, Min. 2 ... Min. Stage scene video signals STG and CO can be formed by being combined side by side with n or more.

いくつかの場合、たとえばステップＳ１６〜Ｓ１８において、プロセッサ６は、移行中の各自のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの合成幅を、合成幅が実質的にその対応する各自の最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎ以上になるまで、移行全体にわたって増加するように増加させ得る。代わりに、またはさらに、各サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎは、実質的にその最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎ以上で、かつ、各ＳＳ１，ＳＳ２…ＳＳｎが、すべての合成されたサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの合計がステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯの幅と実質的に等しい各自の幅で、プロセッサ６によって並べて合成され得る。 In some cases, for example, in steps S16-S18, the processor 6 determines the combined width of each sub-scene video signal SS1, SS2,... SSn in transition to a minimum width Min. 1, Min. 2 ... Min. It can be increased to increase throughout the transition until n or more. Alternatively or additionally, each sub-scene video signal SS1, SS2,... SSn is substantially at its minimum width Min. 1, Min. 2 ... Min. Each SS1, SS2,... SSn is greater than or equal to n and the sum of all synthesized sub-scene video signals SS1, SS2... SSn is substantially equal to the width of the stage scene video signal or synthesized output STG, CO. Can be combined side by side by the processor 6 in width.

代わりに、またはさらに、ステージシーンビデオ信号ＳＴＧ，ＣＯ内のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの幅は、サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎに対応する１つ以上の対象方位Ｂ１，Ｂ２…Ｂｎにおいて検出された１つ以上のアクティビティ基準（たとえば、視覚動作、検知された動作、発話の音響検出など）に従って（たとえばステップＳ１６〜Ｓ１８のように）変化するようにプロセッサ６によって合成されるのに対して、ステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯの幅は一定に保たれる。 Alternatively or additionally, the width of the sub-scene video signals SS1, SS2,... SSn in the stage scene video signals STG, CO is one or more target orientations B1, B2,. Synthesized by processor 6 to vary according to one or more activity criteria detected in Bn (eg, visual motion, sensed motion, acoustic detection of speech, etc.) (eg, as in steps S16-S18). On the other hand, the width of the stage scene video signal or the synthesized output STG, CO is kept constant.

任意に、プロセッサ６は、１つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）を１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ１）とともに合成して、１つまたは２つまたはそれ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）の幅を、１つ以上の追加されたまたはその後のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ１）の幅に対応する量だけ縮小することによって１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ１）をステージシーンビデオ信号ＳＴＧ，ＣＯに移行させることによって、ステージシーンビデオ信号を形成し得る。 Optionally, the processor 6 converts one or more sub-scene video signals SS1, SS2 ... SSn (eg SS2 and SS5 in FIGS. 9C-9E) to one or more additional sub-scene video signals SS1, SS2 ... SSn (eg Combined with SS1) in FIGS. 9C-9E, the width of one or more sub-scene video signals SS1, SS2,... SSn (eg, SS2 and SS5 in FIGS. 9C-9E) is one or more. One or more additional sub-scene video signals SS1, SS2 by reducing by an amount corresponding to the width of the added or subsequent sub-scene video signals SS1, SS2... SSn (eg SS1 in FIGS. 9C-9E). ... SSn (for example, SS1 in FIGS. 9C to 9E) is transferred to the stage scene video signals STG and CO. Therefore, to form a stage scene video signal.

いくつかの場合、プロセッサ６は各サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎに各自の最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎを割当て得、各サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎを実質的にその対応する各自の最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎ以上で並べて合成して、ステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯを形成し得る。１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎとともに、２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの各自の最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎの合計がステージシーンビデオ信号ＳＴＧ，ＣＯの幅を超えると、２つのサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの１つ以上が、ステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯから除去されるようにプロセッサ６によって移行し得る。 In some cases, the processor 6 uses its minimum width Min. For each sub-scene video signal SS1, SS2,. 1, Min. 2 ... Min. n and assign each sub-scene video signal SS1, SS2... SSn substantially to its corresponding minimum width Min. 1, Min. 2 ... Min. A stage scene video signal or composite output STG, CO can be formed by combining side by side with n or more. The minimum width Min. Of each of the two or more sub-scene video signals SS1, SS2... SSn, together with one or more additional sub-scene video signals SS1, SS2. 1, Min. 2 ... Min. When the sum of n exceeds the width of the stage scene video signal STG, CO, one or more of the two sub-scene video signals SS1, SS2,... SSn are removed from the stage scene video signal or the synthesized output STG, CO. It can be migrated by the processor 6.

別の代替例では、プロセッサ９は、１つ以上のアクティビティ基準（たとえば、視覚動作、検知された動作、発話の音響検出、最後の発話からの時間など）が最も以前に満たされた各自の対象方位Ｂ１，Ｂ２…Ｂｎに対応するように、ステージシーンビデオ信号ＳＴＧ，ＣＯから除去されるように移行すべき２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの少なくとも１つを選択し得る。 In another alternative, the processor 9 may have its own target that most recently met one or more activity criteria (eg, visual motion, sensed motion, acoustic detection of speech, time since last speech, etc.). At least one of the two or more sub-scene video signals SS1, SS2,... SSn to be transferred so as to be removed from the stage scene video signals STG, CO may be selected so as to correspond to the orientations B1, B2,.

多くの場合、図８Ｂ〜図８Ｅおよび図９Ｂ〜図９Ｅに示すように、プロセッサ６は、２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）ならびに１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎ（たとえば図９Ｃ〜図９ＥではＳＳ１）の各自の対象方位Ｂ１，Ｂ２…Ｂｎ間のワイドカメラ２，３，５に対する左から右への（見下ろした場合、時計回りの）順序を、２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎが少なくとも１つのその後のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎとともに合成されてステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯを形成する際に保存し得る。 In many cases, as shown in FIGS. 8B-8E and FIGS. 9B-9E, processor 6 includes two or more sub-scene video signals SS1, SS2,... SSn (eg, SS2 and SS5 in FIGS. 9C-9E) and From left to right for wide cameras 2, 3, 5 between their respective orientations B1, B2,... Bn of one or more additional sub-scene video signals SS1, SS2... SSn (eg SS1 in FIGS. 9C-9E). Two or more subscene video signals SS1, SS2,... SSn are combined with at least one subsequent subscene video signal SS1, SS2. It can be preserved when forming the output STG, CO.

代わりに、またはさらに、プロセッサ６は、ワイドカメラ２，３，５に対する各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて検出された１つ以上の選択基準（たとえば、視覚動作、検知された動作、発話の音響検出、最後の発話からの時間など）に依存して、パノラマビデオ信号ＳＣからの各自の対象方位Ｂ１，Ｂ２…Ｂｎを選択し得る。１つ以上の選択基準が真でなくなった後、プロセッサ６は、その対応するサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎをステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯから除去するように移行させ得る。選択基準は、各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて満たされたアクティビティ基準の存在を含み得る。プロセッサ９は、各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて１つ以上のアクティビティ基準が満たされてからの時間をカウントし得る。各自の対象方位Ｂ１，Ｂ２…Ｂｎにおいて１つ以上のアクティビティ基準が満たされた後の予め定められた期間、プロセッサ６は、各自のサブシーン信号ＳＳ１，ＳＳ２…ＳＳｎをステージシーンビデオ信号ＳＴＧから除去するように移行させ得る。 Alternatively or additionally, the processor 6 may use one or more selection criteria (eg, visual motion, sensed motion, utterances) detected in the respective target orientations B1, B2,. Depending on the sound detection, the time since the last utterance, etc., the respective target orientations B1, B2,... Bn from the panoramic video signal SC can be selected. After one or more selection criteria are no longer true, the processor 6 may transition to remove its corresponding sub-scene video signal SS1, SS2,... SSn from the stage scene video signal or composite output STG, CO. The selection criteria may include the presence of activity criteria that are met in their target orientations B1, B2,. The processor 9 may count the time since one or more activity criteria are met in each of the target orientations B1, B2,. The processor 6 removes its sub-scene signal SS1, SS2,... SSn from the stage scene video signal STG for a predetermined period after one or more activity criteria are met in its target orientation B1, B2,. Can be transitioned to.

図８Ａ〜図８Ｃ、図９Ａ〜図９Ｃ、図１０Ａ、図１Ｂ、図１１Ａ、図１１Ｂ、および図２２に示す縮小したパノラマビデオ信号ＳＣ．Ｒに関して、プロセッサ６は、パノラマビデオ信号ＳＣから、実質的に８：１以上のアスペクト比の縮小したパノラマビデオ信号ＳＣ．Ｒをサブサンプリングし得る。プロセッサ６は次に、２つ以上のサブシーンビデオ信号（たとえば図８Ｃ〜図８Ｅおよび図９Ｃ〜図９ＥではＳＳ２およびＳＳ５）を縮小したパノラマビデオ信号ＳＣ．Ｒとともに合成して、複数の並んだサブシーンビデオ信号（たとえば図８Ｃ〜図８ＥではＳＳ２およびＳＳ５、図９Ｃ〜図９ＥではＳＳ１，ＳＳ２およびＳＳ５）とパノラマビデオ信号ＳＣ．Ｒとを含む、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号ＳＴＧ，ＣＯを形成し得る。 8A to 8C, 9A to 9C, 10A, 1B, 11A, 11B, and 22, the reduced panoramic video signal SC. With respect to R, the processor 6 determines from the panorama video signal SC that the panorama video signal SC. R can be subsampled. The processor 6 then reduces the panoramic video signal SC.2 from a reduced version of two or more sub-scene video signals (eg SS2 and SS5 in FIGS. 8C-8E and 9C-9E). R and a plurality of side-by-side sub-scene video signals (for example, SS2 and SS5 in FIGS. 8C to 8E, SS1, SS2 and SS5 in FIGS. 9C to 9E) and the panoramic video signal SC. Stage scene video signals STG and CO having an aspect ratio of substantially 2: 1 or less, including R, can be formed.

この場合、プロセッサ６は、２つ以上のサブシーンビデオ信号（たとえば図８Ｃ〜図８ＥではＳＳ２およびＳＳ５、図９Ｃ〜図９ＥではＳＳ１，ＳＳ２およびＳＳ５）を縮小したパノラマビデオ信号ＳＣ．Ｒとともに合成して、複数の並んだサブシーンビデオ信号（たとえば図８Ｃ〜図８ＥではＳＳ２およびＳＳ５、図９Ｃ〜図９ＥではＳＳ１，ＳＳ２およびＳＳ５）と、複数の並んだサブシーンビデオ信号よりも高いパノラマビデオ信号ＳＣ．Ｒとを含む、実質的に２：１以下のアスペクト比を有するステージシーンビデオ信号を形成し得、パノラマビデオ信号は、ステージシーンビデオ信号または合成出力ＳＴＧまたはＣＯの領域の１／５以下であり、ステージシーンビデオ信号または合成出力ＳＴＧまたはＣＯの幅を実質的に横切って延びる。 In this case, the processor 6 reduces the panoramic video signal SC.2 obtained by reducing two or more sub-scene video signals (for example, SS2 and SS5 in FIGS. 8C to 8E and SS1, SS2 and SS5 in FIGS. 9C to 9E). Combined with R, a plurality of side-by-side sub-scene video signals (for example, SS2 and SS5 in FIGS. 8C to 8E, SS1, SS2 and SS5 in FIGS. 9C to 9E) and a plurality of side-by-side sub-scene video signals High panoramic video signal SC. And a stage scene video signal having an aspect ratio of substantially 2: 1 or less, including R, wherein the panoramic video signal is 1/5 or less of the area of the stage scene video signal or the composite output STG or CO. , Extending substantially across the width of the stage scene video signal or composite output STG or CO.

代替例では、図２４に示すように、プロセッサ６は、テキストドキュメントから（たとえば、テキストエディタ、ワードプロセッサ、スプレッドシート、プレゼンテーション、またはテキストをレンダリングするその他のドキュメントから）テキストビデオ信号ＴＤ１からサブサンプルをサブサンプリングし得るか、またはプロセッサ６に当該サブサンプルが提供され得る。プロセッサ６は次に、２つ以上のサブシーンビデオ信号の少なくとも１つをテキストビデオ信号ＴＤ１または同等物ＴＤ１．Ｒに置換することによって、テキストビデオ信号ＴＤ１またはそのレンダリングされたもしくは縮小されたバージョンＴＤ１．Ｒをステージシーンビデオ信号ＳＴＧ，ＣＯに移行させ得る。 In the alternative, as shown in FIG. 24, processor 6 sub-samples subsamples from text video signal TD1 from a text document (eg, from a text editor, word processor, spreadsheet, presentation, or other document that renders text). It can be sampled or the subsample can be provided to the processor 6. The processor 6 then converts at least one of the two or more sub-scene video signals into a text video signal TD1 or equivalent TD1. R by replacing the text video signal TD1 or a rendered or reduced version TD1. R can be shifted to stage scene video signals STG, CO.

任意に、プロセッサ６は、１つ以上の保持基準（たとえば、視覚動作、検知された動作、発話の音響検出、最後の発話からの時間など）に基づいて、２つのサブシーンビデオ信号の１つ以上を、移行から保護される保護サブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎとして設定し得る。この場合、プロセッサ６は、２つ以上のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの少なくとも１つを置換することによって、しかし特に、保護されるサブシーン以外のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎを移行させることによって、１つ以上の追加のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎをステージシーンビデオ信号に移行させ得る。 Optionally, the processor 6 determines one of the two sub-scene video signals based on one or more retention criteria (eg, visual motion, sensed motion, acoustic detection of speech, time since last speech, etc.). The above can be set as protected sub-scene video signals SS1, SS2,. In this case, the processor 6 replaces at least one of the two or more sub-scene video signals SS1, SS2... SSn, but in particular sub-scene video signals SS1, SS2. By transitioning, one or more additional sub-scene video signals SS1, SS2,... SSn may be transitioned to stage scene video signals.

代わりに、プロセッサ６は、１つ以上の強調基準（たとえば、視覚動作、検知された動作、発話の音響検出、最後の発話からの時間など）に基づいてサブシーン強調動作（たとえばブリンキング、ハイライト表示、アウトライン表示、アイコンオーバーレイ等）を設定し得る。この場合、１つ以上のサブシーンビデオ信号が、サブシーン強調動作に従って、対応する強調基準に基づいて強調される。 Instead, the processor 6 uses a sub-scene enhancement operation (eg, blinking, high) based on one or more enhancement criteria (eg, visual action, detected action, acoustic detection of the utterance, time since last utterance, etc.). Light display, outline display, icon overlay, etc.) can be set. In this case, one or more sub-scene video signals are enhanced based on the corresponding enhancement criteria according to the sub-scene enhancement operation.

追加の変形では、プロセッサ６は、センサから検知された基準（たとえば、ＲＦ素子、受動型赤外線素子または距離認識素子といったセンサによって検出される音波、振動、電磁放射、熱、ＵＶ照射、無線、マイクロ波、電気特性、または深度／範囲の検出）に基づいてサブシーン参加者通知動作を設定し得る。プロセッサ６は、対応する検知された基準に基づいて、通知動作に従って１つ以上のローカルリマインダ指標を起動し得る。 In an additional variation, the processor 6 may detect a reference sensed from the sensor (e.g., sound waves, vibrations, electromagnetic radiation, heat, UV radiation, wireless, microscopic, detected by sensors such as RF elements, passive infrared elements or distance recognition elements). Sub-scene participant notification behavior may be set based on wave, electrical characteristics, or depth / range detection). The processor 6 may activate one or more local reminder indicators according to the notification action based on the corresponding detected criteria.

対象方位の例
たとえば、対象方位は、たとえば、発言している参加者Ｍ１，Ｍ２…Ｍｎ、たとえば、ビーム形成、局所化、または比較的な受信信号強度、または少なくとも２つのマイクを用いる比較的な飛行時間によって、マイクアレイ４によって角度認識される、ベクトル化される、または、識別される参加者Ｍ１，Ｍ２…Ｍｎなどの、１つ以上の音声信号または検出に対応するそれらの方位であってもよい。音声信号が十分に強いまたは十分に明瞭であるか否かを決定するために閾値処理または周波数領域分析が用いられてもよく、一致しないペア、マルチパス、および／または冗長を捨てるために、少なくとも３つのマイクを用いてフィルタリングが行なわれてもよい。３つのマイクには、比較用に３ペアを形成するという利点がある。 Examples of target orientations For example, target orientations are, for example, speaking participants M1, M2... Mn, eg, beamforming, localization, or comparative received signal strength, or comparatively using at least two microphones. One or more audio signals or their orientations corresponding to detection, such as participants M1, M2,... Mn, angle-recognized, vectorized, or identified by microphone array 4 according to time of flight Also good. Thresholding or frequency domain analysis may be used to determine whether the speech signal is strong enough or clear enough, at least to throw away mismatched pairs, multipaths, and / or redundancy Filtering may be performed using three microphones. The three microphones have the advantage of forming three pairs for comparison.

別の例として、代わりに、またはさらに、対象方位は、カメラ２からの画像もしくは動画ビデオもしくはＲＧＢＤをスキャン可能な特徴、画像、パターン、クラス、およびまたは動作検出回路もしくは実行可能コードによって、動作がシーン内に検出される、角度認識される、ベクトル化される、または識別されるそれらの方位であってもよい。 As another example, alternatively or in addition, the orientation of the object may be acted upon by features, images, patterns, classes, and / or motion detection circuitry or executable code capable of scanning an image or video or RGBD from the camera 2. There may be those orientations that are detected, angle-recognized, vectorized, or identified in the scene.

別の例として、代わりに、またはさらに、対象方位は、カメラ２からの画像もしくは動画ビデオもしくはＲＧＢＤ信号をスキャン可能な顔検出回路または実行可能コードによって、顔構造がシーン内に検出される、角度認識される、ベクトル化される、または識別されるそれらの方位であってもよい。骨格構造もこのように検出され得る。 As another example, alternatively or additionally, the target orientation is the angle at which the face structure is detected in the scene by a face detection circuit or executable code that can scan an image or video or RGBD signal from the camera 2. These orientations may be recognized, vectorized, or identified. Skeletal structures can also be detected in this way.

別の例として、代わりに、またはさらに、対象方位は、カメラ２からの画像もしくは動画ビデオもしくはＲＧＢＤ信号をスキャン可能なエッジ検出、コーナー検出、ブロブ検出もしくはセグメント化、極値検出、および／または特徴検出回路もしくは実行可能コードによって、色、テクスチャ、および／またはパターンが実質的に連続的な構造がシーン内に検出される、角度認識される、ベクトル化される、または識別されるそれらの方位であってもよい。認識は、以前に記録した、学習した、または訓練した画像パッチ、色、テクスチャ、またはパターンを参照してもよい。 As another example, alternatively or additionally, the target orientation may be edge detection, corner detection, blob detection or segmentation, extreme value detection, and / or features that can scan an image or video or RGBD signal from camera 2 With those orientations in which a substantially continuous structure of color, texture, and / or pattern is detected, angle-recognized, vectorized, or identified by a detection circuit or executable code There may be. Recognition may refer to previously recorded, learned, or trained image patches, colors, textures, or patterns.

別の例として、代わりに、またはさらに、対象方位は、カメラ２からの画像もしくは動画ビデオもしくはＲＧＢＤ信号をスキャン可能な差分および／または変更検出回路もしくは実行可能コードによって、公知の環境との差分がシーン内に検出される、角度認識される、ベクトル化される、または識別されるそれらの方位であってもよい。たとえば、デバイス１００は当該デバイスが配置されている空の会議室の１つ以上の視覚マップを維持し、人物などの十分に妨げとなるエンティティがマップ内の既知の特徴または領域を妨げていることを検出し得る。 As another example, alternatively or additionally, the orientation of the subject may differ from a known environment by a difference that can scan an image or video or RGBD signal from the camera 2 and / or a change detection circuit or executable code. There may be those orientations that are detected, angle-recognized, vectorized, or identified in the scene. For example, the device 100 maintains one or more visual maps of an empty conference room in which the device is located, and sufficiently obstructing entities such as people are blocking known features or areas in the map. Can be detected.

別の例として、代わりに、またはさらに、対象方位は、カメラ２からの画像もしくは動画ビデオもしくはＲＧＢＤをスキャン可能な特徴、画像、パターン、クラス、およびまたは動作検出回路もしくは実行可能コードによって、「ホワイトボード」形状、ドア形状、または椅子の背中の形状を含む矩形などの規則的形状が識別される、角度認識される、ベクトル化される、または識別されるそれらの方位であってもよい。 As another example, alternatively or additionally, the target orientation may be “white” by a feature, image, pattern, class, and / or motion detection circuit or executable code capable of scanning an image or animated video or RGBD from camera 2. Regular shapes such as rectangles including “board” shapes, door shapes, or chair back shapes may be identified, angle recognized, vectorized, or their orientation identified.

別の例として、代わりに、またはさらに、対象方位は、能動型もしくは受動型音響エミッタもしくはトランスデューサ、および／または受動型もしくは能動型光学もしくは視覚基準マーカ、および／またはＲＦＩＤもしくはその他の電磁的に検出可能なものを含む、人工ランドマークとして認識可能な基準オブジェクトまたは特徴がデバイス１００を用いる人物によって置かれるそれらの方位であってもよく、これらは上記の１つ以上の技術によって角度認識され、ベクトル化され、または識別される。 As another example, alternatively or additionally, the orientation of the object is detected by an active or passive acoustic emitter or transducer, and / or a passive or active optical or visual fiducial marker, and / or RFID or other electromagnetically detected Reference objects or features recognizable as artificial landmarks, including possible ones, may be those orientations placed by a person using device 100, which are angle-recognized by one or more of the techniques described above, and vectors Or identified.

当初のまたは新たな対象方位がこのように得られない（たとえばどの参加者Ｍ１，Ｍ２…Ｍｎもまだ発言していないため）場合、合成シーンの代わりにデフォルトビューが単一カメラシーンとして出力されるように設定され得る。たとえば、１つのデフォルトビューとして、（たとえば２：１から１０：１のＨ：Ｖ水平−垂直比率の）パノラマシーン全体がフラグメント化され、出力される単一カメラ比率に配列され得る（たとえば、一般に風景画方向では１．２５：１から２．４：１もしくは２．５：１のＨ：Ｖアスペクト比または水平−垂直比率であるが、対応する「逆向きの」肖像画方向比率も可能である）。別の例として、対象方位の前のデフォルトビューが最初に得られ、出力シーン比率に対応する「ウインドウ」が、たとえばゆっくりとパンしているカメラのシミュレーションとして、たとえばシーンＳＣ全体にわたって固定レートで追跡され得る。別の例として、デフォルトビューは各会議出席者Ｍ１，Ｍ２…Ｍｎの「顔写真」（マージン内に５〜２０％の付加的な幅を含む）で構成されてもよく、マージンは利用可能な表示領域を最適化するように調整される。 If the original or new target orientation is not obtained in this way (eg because no participant M1, M2 ... Mn has yet spoken), the default view is output as a single camera scene instead of the composite scene. Can be set as follows. For example, as one default view, the entire panoramic scene (eg, from 2: 1 to 10: 1 H: V horizontal-vertical ratio) can be fragmented and arranged into a single camera ratio that is output (eg, generally Landscape landscape orientations range from 1.25: 1 to 2.4: 1 or 2.5: 1 H: V aspect ratio or horizontal-vertical ratio, although corresponding "reverse" portrait orientation ratios are possible ). As another example, a default view in front of the target orientation is initially obtained and the “window” corresponding to the output scene ratio is tracked at a fixed rate, eg, across the scene SC, eg as a simulation of a slowly panning camera. Can be done. As another example, the default view may consist of a “face photo” (including an additional width of 5-20% within the margin) of each meeting attendee M1, M2,. Adjusted to optimize display area.

アスペクト比の例
実施形態および発明の局面はいずれの角度範囲またはアスペクト比でも有用であり得るが、利点が任意に大きくなるのは、サブシーンが、実質的に２．４：１以上のアスペクト比（アスペクト比はフレームまたは画素寸法のいずれかを表わす）を有するパノラマビデオ信号を提供するカメラから形成され、ほとんどのラップトップまたはテレビディスプレイ（通常は１．７８：１以下）において見られるように、実質的に２：１以下（たとえば１６：９，１６：１０または４：３など）の全アスペクト比を有する複数参加者ステージビデオ信号に合成され、さらに、任意に、ステージビデオ信号サブシーンが合成された全体のフレームの８０％を超える領域を満たす場合、および／またはステージビデオ信号サブシーンとパノラマビデオ信号で形成されたいずれかの付加的に合成されたサムネイルとが合成された全体のフレームの９０％を超える領域を満たす場合である。このように、示される各参加者は、実用的に可能な限りほぼ最大にスクリーンを満たす。 Aspect Ratio Examples Although embodiments and aspects of the invention may be useful in any angular range or aspect ratio, the benefits are arbitrarily increased because the sub-scene has an aspect ratio of substantially 2.4: 1 or higher. Formed from a camera that provides a panoramic video signal having an aspect ratio (representing either frame or pixel dimensions) and as seen on most laptops or television displays (usually 1.78: 1 or less) Synthesized into a multi-participant stage video signal having an overall aspect ratio of substantially 2: 1 or less (eg 16: 9, 16:10 or 4: 3, etc.), and optionally a stage video signal sub-scene Fills more than 80% of the entire recorded frame and / or stage video signal subscene and performance This is a case where any of the additionally synthesized thumbnails formed with the Norama video signal fills an area exceeding 90% of the total synthesized frame. Thus, each participant shown fills the screen as much as practically possible.

ビューの垂直角度と水平角度との対応する比はα＝２逆正接からの比（ｄ／２ｆ）として求めることができ、式中ｄはセンサの垂直または水平寸法であり、ｆはレンズの実効焦点距離である。会議用の異なる広角カメラは単一レンズから９０度、１２０度、または１８０度の視野を有し得るが、各カメラは、アスペクト比１．７８：１の１０８０ｐ画像（たとえば１９２０ｘ１０８０画像）またはアスペクト比３．５：１、もしくは他のアスペクト比のはるかに広い画像を出力し得る。会議シーンを観察する際、１２０度または１８０度のワイドカメラと組合されたより小さいアスペクト比（たとえば２：１以下）は、所望され得るよりも多くの天井、壁、またはテーブルを示し得る。したがって、シーンまたはパノラマビデオ信号ＳＣのアスペクト比、およびカメラ１００の画角ＦＯＶは独立していてもよいが、よりワイドなカメラ１００（９０度以上）をより広いアスペクト比（たとえば２．４：１以上）のビデオ信号と一致させ、さらに任意に、最大ワイドカメラ（たとえば３６０度パノラマビュー）が最も広いアスペクト比（たとえば８：１以上）と一致していることが本実施形態に任意に有利である。 The corresponding ratio between the vertical and horizontal angles of the view can be determined as the ratio from α = 2 arctangent (d / 2f), where d is the vertical or horizontal dimension of the sensor and f is the effective lens. The focal length. Different wide-angle cameras for meetings can have a 90, 120, or 180 degree field of view from a single lens, but each camera has a 1080p image (eg, 1920x1080 image) or aspect ratio with an aspect ratio of 1.78: 1 A much wider image with 3.5: 1 or other aspect ratio can be output. When observing a conference scene, a smaller aspect ratio (eg 2: 1 or less) combined with a 120 or 180 degree wide camera may indicate more ceilings, walls, or tables than may be desired. Accordingly, the aspect ratio of the scene or panoramic video signal SC and the angle of view FOV of the camera 100 may be independent, but a wider camera 100 (90 degrees or more) can have a wider aspect ratio (for example, 2.4: 1). In this embodiment, it is arbitrarily advantageous that the maximum wide camera (for example, 360 ° panoramic view) matches the widest aspect ratio (for example, 8: 1 or more). is there.

サブシーンまたは方位の追跡の例
図１２〜図１８、特に図１６〜図１８に示すような、図１Ａおよび図１Ｂのデバイスによって実行されるプロセスは、ワイドビデオ信号ＳＣ内の対象方位Ｂ１，Ｂ２…ＢｎにおいてサブシーンＦＷ、ＳＳを追跡することを含み得る。図１６に示すように、音響センサまたはマイクアレイ４（任意のビーム形成回路を有する）およびワイドカメラ２，３，５に動作可能に接続されたプロセッサ６は、ステップＳ２０２において、任意にまたは好ましくは実質的に９０度以上である実質的に共通の角度範囲を監視する。 Example of Sub-Scene or Orientation Tracking The process performed by the device of FIGS. 1A and 1B, as shown in FIGS. 12-18, in particular FIGS. 16-18, is subject to orientation B1, B2 in the wide video signal SC. ... may include tracking sub-scenes FW, SS in Bn. As shown in FIG. 16, the processor 6 operatively connected to the acoustic sensor or microphone array 4 (with any beam forming circuit) and the wide cameras 2, 3, 5 is optionally or preferably in step S202. A substantially common angular range that is substantially greater than 90 degrees is monitored.

プロセッサ６は、ステップＳ２０４およびステップＳ２０６においてワイドカメラ２，３，５の角度範囲内の音響認識（たとえば周波数、パターン、もしくは他の音声認識）または視覚認識（たとえば動作検出、顔検出、骨格検出、色ブロブセグメント化もしくは検出）の一方または両方の局所化（たとえば、デカルト座標もしくは極座標内の、またはある方向における位置を表わす測定など）に沿って第１の対象方位Ｂ１，Ｂ２…Ｂｎを識別するコードを実行するか、または当該識別する回路を含むか当該回路に動作可能に接続され得る。ステップＳ１０のように、かつステップＳ１２およびＳ１４のように、サブシーンビデオ信号ＳＳが、ステップＳ１４において識別された対象方位Ｂ１，Ｂ２…Ｂｎに沿ってワイドカメラ２，３，５からサブサンプリングされる（たとえば、ワイドカメラ２，３，５の撮像素子から新たにサンプリングされるか、またはステップＳ１２においてキャプチャされたパノラマシーンＳＣからサブサンプリングされる）。サブシーンビデオ信号ＳＳの幅（たとえば、最小幅Ｍｉｎ．１，Ｍｉｎ．２…Ｍｉｎ．ｎ、またはサブシーン表示幅ＤＷｉｄ．１，ＤＷｉｄ．２…ＤＷｉｄ．ｎ）が、ステップＳ２１０において音響認識および視覚／視覚認識の一方または両方の信号特性に従ってプロセッサ６によって設定され得る。信号特性は、さまざまな音響認識または視覚認識の品質または信頼レベルを表わし得る。本明細書において使用する「音響認識」は、ドップラー分析といった波形の周波数分析を含む、音波または振動に基づいた任意の認識（たとえば、測定閾値を満たす、ディスクリプタと一致するなど）を含み得るのに対して、「視覚認識」は、ＲＦ素子、受動型赤外線素子または距離認識素子といったセンサによって検出される熱またはＵＶ照射、無線またはマイクロ波、電気特性認識または深度／範囲といった、電磁放射に対応する任意の認識（たとえば、測定閾値を満たす、ディスクリプタと一致するなど）を含み得る。 In step S204 and step S206, the processor 6 performs acoustic recognition (eg, frequency, pattern, or other speech recognition) or visual recognition (eg, motion detection, face detection, skeletal detection, Identify first object orientations B1, B2,... Bn along one or both localizations (eg, color blob segmentation or detection) (eg, a measurement in Cartesian or polar coordinates or representing a position in a direction) Code may be executed or may include or be operatively connected to the identifying circuit. As in step S10 and in steps S12 and S14, the sub-scene video signal SS is subsampled from the wide cameras 2, 3, 5 along the target orientations B1, B2,... Bn identified in step S14. (For example, it is newly sampled from the image sensors of the wide cameras 2, 3, 5 or sub-sampled from the panoramic scene SC captured in step S12). The width of the sub-scene video signal SS (for example, the minimum width Min.1, Min.2... Min.n, or the sub-scene display width DWid.1, DWid.2... / Can be set by the processor 6 according to one or both signal characteristics of visual recognition. The signal characteristics can represent various acoustic or visual recognition quality or confidence levels. As used herein, “acoustic recognition” may include any recognition based on sound waves or vibrations (eg, meet measurement thresholds, match descriptors, etc.), including frequency analysis of waveforms such as Doppler analysis. In contrast, "visual recognition" corresponds to electromagnetic radiation, such as heat or UV radiation, radio or microwave, electrical property recognition or depth / range detected by sensors such as RF elements, passive infrared elements or distance recognition elements. Any recognition (eg, meeting a measurement threshold, matching a descriptor, etc.) may be included.

たとえば、ステップＳ１４において識別される対象方位Ｂ１，Ｂ２…Ｂｎは、異なる順序でそのような音響認識と視覚認識との組合せによって求めることができ、当該順序のいくつかは図１６〜図１８においてモード１，２または３（互いに合理的かつ論理的に組合され得る）として示されている。たとえば図１８のステップＳ２２０のように、１つの順序では、音響認識の方位がまず記録される（しかしこの順序は反復および／または変更され得る）。任意に、そのような方位Ｂ１，Ｂ２…Ｂｎは、ある角度、許容差を有するある角度、または概算範囲もしくは角度範囲の方位（図７Ａの方位Ｂ５など）であり得る。図１８のステップＳ２２８〜Ｓ２３２に示すように、記録された音響認識方位は、十分に信頼性のある視覚認識が、記録された音響認識の閾値角度範囲内に実質的にある場合、視覚認識（たとえば顔認識）に基づいて精製され（狭められるか再評価され）得る。同じモードにおいて、またはたとえば図１７のステップＳ２１８のように別のモードと組合されて、視覚認識と関連付けられていない任意の音響認識は候補対象方位Ｂ１，Ｂ２…Ｂｎのままであり得る。 For example, the target orientations B1, B2,... Bn identified in step S14 can be determined by a combination of such acoustic recognition and visual recognition in different orders, some of which are modes in FIGS. 1, 2 or 3 (which can be rationally and logically combined with each other). For example, as in step S220 of FIG. 18, in one order, the orientation of acoustic recognition is first recorded (but this order can be repeated and / or changed). Optionally, such orientations B1, B2,... Bn can be an angle, an angle with tolerance, or an approximate or angular range orientation (such as orientation B5 in FIG. 7A). As shown in steps S228 to S232 of FIG. 18, the recorded sound recognition orientation is determined by visual recognition (if the sufficiently reliable visual recognition is substantially within the recorded sound recognition threshold angle range). It can be refined (narrowed or reevaluated) based on eg face recognition. Any acoustic recognition that is not associated with visual recognition, in the same mode, or combined with another mode, eg, as in step S218 of FIG. 17, may remain candidate object orientations B1, B2,.

任意に、図１６のステップＳ２１０のように、信号特性は音響認識および視覚認識の一方または両方の信頼レベルを表わしている。「信頼レベル」は公式の確率的定義を満たす必要はないが、ある程度の信頼性（たとえば、閾値振幅を超える、信号品質、信号／雑音比もしくは同等物、または成功基準）を確立する任意の比較測定を意味し得る。代わりに、またはさらに、図１６のステップＳ２１０のように、信号特性は、音響認識（たとえば音が発生し得る角度範囲）または視覚認識（たとえば瞳孔間距離、顔幅、体幅）の一方または両方内に認識された特徴の幅を表わし得る。たとえば、信号特性は、対象方位Ｂ１，Ｂ２…Ｂｎに沿って認識された（たとえば視覚認識によって求められた）人間の顔の概算幅に対応し得る。第１のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの幅は視覚認識の信号特性に従って設定され得る。 Optionally, as in step S210 of FIG. 16, the signal characteristic represents a confidence level of one or both of acoustic recognition and visual recognition. A “confidence level” need not meet the formal probabilistic definition, but any comparison that establishes some degree of confidence (eg, signal quality, signal / noise ratio or equivalent, or success criteria above threshold amplitude) Can mean measurement. Alternatively or additionally, as in step S210 of FIG. 16, the signal characteristics may be one or both of acoustic recognition (eg, the angular range in which sound can occur) or visual recognition (eg, interpupillary distance, face width, body width). It can represent the width of the feature recognized within. For example, the signal characteristics may correspond to the approximate width of a human face recognized along the target orientations B1, B2,... Bn (eg, determined by visual recognition). The widths of the first sub-scene video signals SS1, SS2,... SSn can be set according to the visual recognition signal characteristics.

たとえば図１８のステップＳ２２８のように、いくつかの場合、幅が視覚認識の信号特性に従って設定されない（たとえば、幅規定特徴を認識できない場合に確実に設定できないなど）場合は、図１８のステップＳ２３０のように、予め定められた幅が、角度範囲内に検出された音響認識の局所化に沿って設定され得る。たとえば、図１８のステップＳ２２８およびＳ２３２のように、人間の発話を示す音響信号を有していると評価された対象方位Ｂ１，Ｂ２…Ｂｎに沿って画像分析によって顔が認識され得ない場合、サブシーンＳＳを規定するための音響方位に沿って、たとえばステップＳ２３０のように、デフォルト幅（たとえばシーンＳＣ全体の幅の１／１０から１／４と同等の幅を有するサブシーン）が維持または設定され得る。たとえば、図７Ａは、出席者Ｍ５の顔が出席者Ｍ４の方向を向いており、Ｍ５が発言中である出席者およびスピーカーのシナリオを示す。この場合、会議カメラ１００の音響マイクアレイ４は対象方位Ｂ５に沿ってスピーカーＭ５を局所化可能であり得る（ここで、対象方位Ｂ５はベクトルではなく方位範囲として描かれている）が、ワイドカメラ２，３，５ビデオ信号のパノラマシーンＳＣの画像分析は顔または他の視覚認識を分解不可能であり得る。そのような場合、デフォルト幅Ｍｉｎ．５が、対象方位Ｂ５に沿ってサブシーンＳＳ５を最初に規定する、限定する、またはレンダリングするため最小幅として設定され得る。 For example, as in step S228 of FIG. 18, in some cases, when the width is not set according to the signal characteristics of visual recognition (for example, cannot be reliably set when the width defining feature cannot be recognized), step S230 of FIG. As such, a predetermined width may be set along with the localization of acoustic recognition detected within the angular range. For example, as in steps S228 and S232 of FIG. 18, when a face cannot be recognized by image analysis along the target orientations B1, B2,... Bn evaluated as having an acoustic signal indicating human speech, Along with the acoustic direction for defining the sub-scene SS, the default width (for example, a sub-scene having a width equivalent to 1/10 to 1/4 of the entire width of the scene SC) is maintained or is maintained, for example, as in step S230. Can be set. For example, FIG. 7A shows an attendee and speaker scenario where attendee M5's face is facing attendee M4 and M5 is speaking. In this case, the acoustic microphone array 4 of the conference camera 100 may be able to localize the speaker M5 along the target direction B5 (where the target direction B5 is drawn as a range of directions instead of a vector), but the wide camera Image analysis of the panoramic scene SC of 2, 3, 5 video signals may not be able to resolve the face or other visual perception. In such a case, the default width Min. 5 may be set as the minimum width to initially define, limit, or render the sub-scene SS5 along the subject orientation B5.

別の実施形態では、対象方位Ｂ１，Ｂ２…Ｂｎは、会議カメラ１００の角度範囲内に検出された音響認識に向けて方向付けられて識別され得る。この場合、プロセッサ６は、任意に図１６のステップＳ２０９のように音響認識に近接した（たとえば、対象方位Ｂ１，Ｂ２…Ｂｎ内の、当該方位にオーバーラップしている、または当該方位の隣の、たとえば、対象方位Ｂ１，Ｂ２…Ｂｎの円弧の５〜２０度内の）視覚認識を識別し得る。この場合、第１のサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの幅は、音響認識に近接していた（もしくはしている）または他の方法で音響認識と一致していた（もしくはしている）視覚認識の信号特性に従って設定され得る。これが起こり得るのは、たとえば、対象方位Ｂ１，Ｂ２…Ｂｎがまず音響マイクアレイ４で識別され、その後、ワイドカメラ１００からのビデオ画像を用いて十分に近いまたはその他の方法で一致している顔認識で妥当性を検証されるまたは確認される場合である。 In another embodiment, the target orientations B1, B2,... Bn may be directed and identified towards acoustic recognition detected within the conference camera 100 angular range. In this case, the processor 6 is arbitrarily close to the acoustic recognition as in step S209 of FIG. 16 (for example, in the target azimuth B1, B2,... Bn, overlapping with the azimuth or next to the azimuth. For example, visual recognition (within 5-20 degrees of the arc of the object orientation B1, B2,... Bn) may be identified. In this case, the width of the first sub-scene video signal SS1, SS2,... SSn was close (or is) to acoustic recognition or otherwise matched (or is) acoustic recognition. It can be set according to the signal characteristics of visual recognition. This can happen, for example, when the target orientations B1, B2,... Bn are first identified by the acoustic microphone array 4 and then close enough or otherwise matched using the video image from the wide camera 100. It is when the validity is verified or confirmed by recognition.

ある変形では、図１７および図１６を参照して述べたように、会議またはワイドカメラ１００を含むシステムは、潜在的な視覚認識または音響認識を用いて図１７のステップＳ２１８のように空間マップを作成し、次に図１６のステップＳ２０９のように、この空間マップに依拠して、後の、関連付けられている、一致している、近接した、または「スナップされた」認識の妥当性を同一のまたは異なるまたは他の認識アプローチによって検証し得る。たとえば、いくつかの場合、全体のパノラマシーンＳＣは、顔認識などのためにフレーム単位で効果的にスキャンするには大き過ぎる場合がある。この場合、人々は、特に会議のために自分の席に座った後は、カメラ１００を用いる会議状況において場所を著しく移動しないため、全体のパノラマシーンＳＣの一部のみが、たとえばビデオフレームごとにスキャンされ得る。 In one variation, as described with reference to FIGS. 17 and 16, a system that includes a conference or wide camera 100 uses spatial or visual recognition to generate a spatial map as in step S218 of FIG. Create and then rely on this spatial map, as in step S209 of FIG. 16, to match the validity of later associated, matched, close, or “snapped” recognition Or by different or other recognition approaches. For example, in some cases, the entire panoramic scene SC may be too large to be effectively scanned frame by frame for face recognition and the like. In this case, people do not move significantly in a meeting situation using the camera 100, especially after sitting in their seat for a meeting, so that only a part of the entire panoramic scene SC is, for example, every video frame. Can be scanned.

たとえば、図１７のステップＳ２１２のように、ワイドビデオ信号内の対象方位Ｂ１，Ｂ２…ＢｎにおいてサブシーンＳＳ１，ＳＳ２…ＳＳｎを追跡するために、プロセッサ６は、実質的に９０度以上のワイドカメラ１００視野に対応する動画ビデオ信号ＳＣを通してサブサンプリングウインドウをスキャンし得る。プロセッサ６またはそれに関連付けられた回路は、たとえば図１７のステップＳ２１４のように、候補対象方位Ｂ１，Ｂ２…Ｂｎについての好適な信号品質を規定するための閾値を実質的に満たすことによって、サブサンプリングウインドウ内の候補対象方位Ｂ１，Ｂ２…Ｂｎを識別し得る。各対象方位Ｂ１，Ｂ２…Ｂｎは、たとえば図１７のステップＳ２１６のように、サブサンプリングウインドウ内に検出された視覚認識の局所化に対応し得る。図１７のステップＳ２１８のように、候補方位Ｂ１，Ｂ２…Ｂｎは空間マップ（たとえば、候補方位の位置、場所、および／もしくは方向を追跡し続けるメモリまたはデータベース構造）に記録され得る。たとえばこのようにして、その方位において音響検出がまだ起こっていなくても、顔認識または他の視覚認識（たとえば動作）が空間マップに記憶され得る。その後、ワイドカメラ１００の角度範囲が、音響認識のための音響センサまたはマイクアレイ４を用いてプロセッサ６によって監視され得る（これは候補対象方位Ｂ１，Ｂ２…Ｂｎの妥当性を検証するために用いられ得る）。 For example, as shown in step S212 in FIG. 17, in order to track the sub-scenes SS1, SS2,... SSn in the target orientations B1, B2,. The sub-sampling window can be scanned through the animated video signal SC corresponding to 100 views. The processor 6 or the circuit associated therewith performs sub-sampling by substantially satisfying a threshold value for defining a suitable signal quality for the candidate target orientations B1, B2,... Bn, for example as in step S214 of FIG. Candidate target orientations B1, B2,... Bn in the window can be identified. Each target orientation B1, B2,... Bn may correspond to localization of visual recognition detected in the sub-sampling window, for example, as in step S216 of FIG. As in step S218 of FIG. 17, the candidate orientations B1, B2,... Bn can be recorded in a spatial map (eg, a memory or database structure that keeps track of the position, location, and / or orientation of the candidate orientations). For example, in this manner, face recognition or other visual recognition (eg, motion) can be stored in the spatial map even if acoustic detection has not yet occurred in that orientation. Thereafter, the angular range of the wide camera 100 can be monitored by the processor 6 using an acoustic sensor or microphone array 4 for acoustic recognition (this is used to verify the validity of the candidate target orientations B1, B2,... Bn. Can be).

たとえば図７Ａを参照して、会議カメラ１００のプロセッサ６は、視覚認識（たとえば顔、色、動作など）のためにパノラマシーンＳＣ全体の異なるサブサンプリングされたウインドウをスキャンし得る。照明、動作、顔の向きなどに応じて、図７において、出席者Ｍ１…Ｍ５の顔、動作または同様の検出に対応する潜在的な対象方位が空間マップに記憶され得る。しかし、図７Ａに示すシナリオでは、出席者Ｍａｐ．１側の潜在的な対象方位は、発言中でない出席者に対応する場合は、音響信号によって後で妥当性を検証されない場合がある（かつ、この出席者はサブシーン内にまったくキャプチャされず、パノラマシーン内にのみキャプチャされ得る）。出席者Ｍ１…Ｍ５が発言したか発言し始めると、これらの出席者を含むまたはこれらの出席者側の潜在的な対象方位の妥当性が検証され、対象方位Ｂ１，Ｂ２…Ｂ５として記録され得る。 For example, referring to FIG. 7A, the processor 6 of the conference camera 100 may scan different subsampled windows of the entire panoramic scene SC for visual recognition (eg, face, color, motion, etc.). Depending on the lighting, motion, face orientation, etc., in FIG. 7, the potential object orientations corresponding to the faces, motions or similar detections of attendees M1... M5 may be stored in the spatial map. However, in the scenario shown in FIG. A potential target orientation on one side may not be validated later by the acoustic signal if it corresponds to an attendee who is not speaking (and this attendee is not captured at all in the sub-scene, Can only be captured within a panoramic scene). When attendees M1 ... M5 speak or begin to speak, the validity of potential subject orientations including or attending these attendees can be verified and recorded as subject orientations B1, B2 ... B5 .

任意に、図１６のステップＳ２０９のように、空間マップに記録された１つの候補方位に近接して（実質的に隣接して、隣に、または＋／−５〜２０度の円弧内に）音響認識が検出されると、プロセッサ６は、その１つの候補方位と実質的に対応するように対象方位Ｂ１，Ｂ２…Ｂｎをスナップし得る。図１６のステップＳ２０９は、対象方位が相当する空間マップと一致していることを示しており、「一致」は対象方位値を関連付ける、置換するまたは変更することを含み得る。たとえば、ウインドウおよび／またはパノラマシーンＳＣ内の顔認識または動作認識は、音響アレイまたはマイクアレイ４よりも良い解像度を有し得るが、検出の頻度または信頼度が低いため、音響認識に起因する検出された対象方位Ｂ１，Ｂ２…Ｂｎは視覚認識に従って変更され、記録され、または他の方法で補正もしくは調整され得る。この場合、音響認識から得られた明白な対象方位Ｂ１，Ｂ２…Ｂｎに沿ってサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎをサブサンプリングする代わりに、プロセッサ６は、たとえば、以前にマッピングされた視覚認識を用いて音響対象方位Ｂ１，Ｂ２…Ｂｎが補正された後にワイドカメラ１００および／またはパノラマシーンＳＣから、スナップ動作に続いて対象方位Ｂ１，Ｂ２…Ｂｎに沿ってサブシーンビデオ信号をサブサンプリングし得る。この場合、図１６のステップＳ２１０のように、サブシーンビデオ信号ＳＳの幅は、検出された顔幅もしくは動作幅に従って、または代わりに、音響認識の信号特性（たとえば、デフォルト幅、アレイ４の解像度、信頼レベル、音響認識もしくは視覚認識の一方もしくは両方内に認識された特徴の幅、対象方位に沿って認識された人間の顔の概算幅）に従って設定され得る。図１６のステップＳ２１０または図１８のステップＳ２３０のように、サブシーンＳＳ幅が、顔幅または動作範囲といった視覚認識の信号特性に従って設定されていない場合、予め定められた幅（たとえば図７Ａのようなデフォルト幅Ｍｉｎ.５など）が音響認識に従って設定され得る。 Optionally, as in step S209 in FIG. 16, close to one candidate orientation recorded in the spatial map (substantially adjacent, next to, or within a +/− 5 to 20 degree arc). When acoustic recognition is detected, the processor 6 may snap the target orientations B1, B2,... Bn to substantially correspond to the one candidate orientation. Step S209 of FIG. 16 indicates that the target orientation is consistent with the corresponding spatial map, and “match” may include associating, replacing, or changing the target orientation value. For example, face recognition or motion recognition in the window and / or panoramic scene SC may have better resolution than the acoustic array or microphone array 4, but detection due to acoustic recognition due to less frequent or reliable detection. The target orientations B1, B2,... Bn made can be changed according to visual recognition, recorded, or otherwise corrected or adjusted. In this case, instead of sub-sampling the sub-scene video signals SS1, SS2... SSn along the obvious object orientations B1, B2. Sub-sample the sub-scene video signal from the wide camera 100 and / or the panoramic scene SC along the target direction B1, B2,... Bn following the snap operation after the acoustic target directions B1, B2,. obtain. In this case, as in step S210 of FIG. 16, the width of the sub-scene video signal SS is set according to the detected face width or motion width, or alternatively, the signal characteristics of acoustic recognition (eg, default width, resolution of the array 4). , Confidence level, width of features recognized within one or both of acoustic recognition and / or visual recognition, approximate width of human face recognized along the subject orientation). As in step S210 of FIG. 16 or step S230 of FIG. 18, when the sub-scene SS width is not set according to the visual recognition signal characteristics such as the face width or the motion range, a predetermined width (for example, as shown in FIG. 7A). Default width Min.5 etc.) can be set according to the acoustic recognition.

図１８の例では、会議カメラ１００およびプロセッサ６は、実質的に９０度以上のワイドカメラ１００の視野ＦＯＶに対応する動画ビデオ信号を記録することによって、対象方位Ｂ１，Ｂ２…Ｂｎにおいてサブシーンを追跡し得る。プロセッサは、ステップＳ２２０において、音響認識のための音響センサアレイ４を用いて、ワイドカメラ１００の視野ＦＯＶに対応する角度範囲を監視し得、ステップＳ２２２において音響認識の範囲が検出されると、ステップＳ２２４において、当該角度範囲内に検出された音響認識に向けて方向付けられている対象方位Ｂ１，Ｂ２…Ｂｎを識別し得る。プロセッサ６および関連付けられた回路はステップＳ２２６において、次に（たとえば図７Ａの対象方位Ｂ５の範囲と同様の）対象方位Ｂ１，Ｂ２…Ｂｎの対応する範囲に従って、パノラマシーンＳＣの動画ビデオ信号内にサブサンプリングウインドウを位置付け得る。プロセッサは次に、ステップＳ２２８のように当該範囲内に視覚認識が検出されると、サブサンプリングウインドウ内に検出された視覚認識を局所化し得る。その後、プロセッサ６は、任意に実質的に視覚認識を中心とするワイドカメラ１００から（カメラ１００から直接、またはパノラマシーン記録ＳＣから）キャプチャされたサブシーンビデオ信号ＳＳをサブサンプリングし得る。ステップＳ２３２のように、プロセッサ６は次に、視覚認識の信号特性に従ってサブシーンビデオ信号ＳＳの幅を設定し得る。図１８のステップＳ２２８のように、視覚認識が可能でない、好適でない、検出されない、または選択されない場合、プロセッサ６は図１８のステップＳ２３０のように、音響最小幅を維持または選択し得る。 In the example of FIG. 18, the conference camera 100 and the processor 6 record a sub-scene in the target orientations B1, B2,... Bn by recording a moving image video signal substantially corresponding to the field of view FOV of the wide camera 100 of 90 degrees or more. Can be tracked. In step S220, the processor can monitor the angular range corresponding to the field of view FOV of the wide camera 100 using the acoustic sensor array 4 for acoustic recognition. If the range of acoustic recognition is detected in step S222, the processor In S224, the target orientations B1, B2,... Bn directed toward the acoustic recognition detected within the angular range can be identified. The processor 6 and associated circuitry then in step S226, in the video video signal of the panoramic scene SC, next according to the corresponding range of the target orientations B1, B2,... Bn (eg similar to the range of the target orientation B5 in FIG. 7A). A sub-sampling window can be positioned. The processor may then localize the detected visual recognition within the sub-sampling window when visual recognition is detected within the range as in step S228. The processor 6 may then subsample the captured sub-scene video signal SS from the wide camera 100 (directly from the camera 100 or from the panoramic scene recording SC), optionally substantially centered on visual recognition. As in step S232, the processor 6 may then set the width of the sub-scene video signal SS according to the visual recognition signal characteristics. If visual recognition is not possible, not preferred, not detected, or not selected, as in step S228 of FIG. 18, the processor 6 may maintain or select the acoustic minimum width, as in step S230 of FIG.

代わりに、会議カメラ１００およびプロセッサ６は、図１６〜図１８のように、たとえば図１７のステップＳ２１２のように、音響センサアレイ４と実質的に９０度以上の視野を観察するワイドカメラ２，３，５とを用いてある角度範囲を監視することによって、パノラマシーンＳＣなどのワイドビデオ信号内の対象方位Ｂ１，Ｂ２…Ｂｎにおいてサブシーンを追跡し得る。プロセッサ６は、各々が当該角度範囲内の（ステップＳ２１６のように音響または視覚またはセンサベースの）局所化に向けて方向付けられている複数の対象方位Ｂ１，Ｂ２…Ｂｎを識別し得、対象方位Ｂ１，Ｂ２…Ｂｎ、対応する認識、対応する局所化、またはそれを表わすデータが図１７のステップＳ２１８のように逐次記憶されるにつれて、対象方位Ｂ１，Ｂ２…Ｂｎに対応する記録された特性の空間マップを維持し得る。その後、たとえば図１６のステップＳ２１０のように、プロセッサ６は、少なくとも１つの対象方位Ｂ１，Ｂ２…Ｂｎに実質的に沿って、ワイドカメラ１００からサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎをサブサンプリングし、少なくとも１つの対象方位Ｂ１，Ｂ２…Ｂｎに対応する記録された特性に従ってサブシーンビデオ信号ＳＳ１，ＳＳ２…ＳＳｎの幅を設定し得る。 Instead, as shown in FIGS. 16 to 18, the conference camera 100 and the processor 6, for example, in step S <b> 212 of FIG. 17, the wide camera 2 that observes the visual field substantially 90 degrees or more with the acoustic sensor array 4. 3 and 5 can be used to monitor a sub-scene in a target orientation B1, B2,... Bn in a wide video signal such as a panoramic scene SC. The processor 6 may identify a plurality of target orientations B1, B2,... Bn, each directed towards localization within the angular range (acoustic or visual or sensor-based as in step S216) As the orientations B1, B2,... Bn, corresponding recognition, corresponding localization, or data representing them are sequentially stored as in step S218 of FIG. 17, the recorded characteristics corresponding to the target orientations B1, B2,. Maintain a spatial map of Thereafter, as in step S210 of FIG. 16, for example, the processor 6 subsamples the sub-scene video signals SS1, SS2,... SSn from the wide camera 100 substantially along at least one target orientation B1, B2,. , The width of the sub-scene video signals SS1, SS2,... SSn may be set according to the recorded characteristics corresponding to at least one target orientation B1, B2,.

予測追跡の例
新たな対象方位を識別するための構造、装置、方法および技術の上記の説明では、そのような新たな対象方位を識別するためのさまざまな検出、認識、誘発、または他の原因を説明している。以下の説明では、対象方位およびサブシーンの方位、方向、場所、ポーズ、幅、または他の特性の変更の更新、追跡、または予測について述べるが、この更新、追跡、および予測は上記の説明にも当てはまり得る。新たな対象方位を識別し、方位またはサブシーンの変更を更新または予測するための方法の説明は、対象方位またはサブシーンの再獲得が追跡または予測によって容易になるという点で関連している。本明細書に記載の方法および技術は、ステップＳ２０，Ｓ３２，Ｓ５４またはＳ５６において方位および／またはサブシーンをスキャンする、識別する、更新する、追跡する、記録する、または再獲得するために用いることができ、逆もまた同様である。 Predictive tracking examples In the above description of structures, devices, methods and techniques for identifying new target orientations, various detection, recognition, triggering, or other causes to identify such new target orientations Is explained. The following discussion describes updating, tracking, or predicting changes in object orientation and sub-scene orientation, direction, location, pose, width, or other characteristics, but this update, tracking, and prediction are described above. Can also be true. The description of the method for identifying new object orientations and updating or predicting orientation or subscene changes is relevant in that re-acquisition of the object orientation or subscene is facilitated by tracking or prediction. The methods and techniques described herein are used to scan, identify, update, track, record, or reacquire orientation and / or subscenes in steps S20, S32, S54, or S56 And vice versa.

たとえば、予測ＨＥＶＣ、Ｈ．２６４、ＭＰＥＧ−４、他のＭＰＥＧＩスライス、Ｐスライス、およびＢスライス（またはフレーム、またはマクロブロック）；他のフレーム内およびフレーム間、写真、マクロブロック、またはスライス；Ｈ．２６４または他のＳＩフレーム／スライス、ＳＰフレーム／スライス（スイッチングＰ）、および／またはマルチフレーム動き予測；ＶＰ９またはＶＰ１０スーパーブロック、ブロック、マクロブロックまたはスーパーフレーム、フレーム内およびフレーム間予測、成分予測、動き補償、動きベクトル予測、および／またはセグメント化に従って符号化されるかこれらに関連しているデータなど、予測ビデオデータがサブシーンごとに記録され得る。 For example, prediction HEVC, H.I. H.264, MPEG-4, other MPEG I slices, P slices, and B slices (or frames or macroblocks); other intra-frame and interframe, photos, macroblocks or slices; H.264 or other SI frame / slice, SP frame / slice (switching P), and / or multi-frame motion prediction; VP9 or VP10 superblock, block, macroblock or superframe, intraframe and interframe prediction, component prediction, Predictive video data may be recorded for each sub-scene, such as data that is encoded according to or related to motion compensation, motion vector prediction, and / or segmentation.

たとえば、マイクアレイに関する音声動作から得られた動きベクトル、または直接的なもしくは画素ベースの方法（たとえばブロックマッチング、位相相関、周波数領域相関、画素再帰、オプティカルフロー）および／または間接的なもしくは特徴ベースの方法（サブシーンもしくはシーン領域上に適用されるＲＡＮＳＡＣといった統計関数を用いるコーナー検出などの特徴検出）から得られた動きベクトルなど、ビデオ標準または動き補償ＳＰＩとは独立した上述のような他の予測または追跡データが記録され得る。 For example, motion vectors obtained from speech motion on a microphone array, or direct or pixel based methods (eg block matching, phase correlation, frequency domain correlation, pixel recursion, optical flow) and / or indirect or feature based Other methods such as those described above independent of the video standard or motion compensation SPI, such as motion vectors obtained from this method (feature detection such as corner detection using statistical functions such as RANSAC applied on sub-scenes or scene regions) Prediction or tracking data can be recorded.

さらに、または代わりに、サブシーンごとの更新または追跡は、たとえば、振幅、発声の周波数、発声の長さ、関連の出席者Ｍ１，Ｍ２…Ｍｎ（相互のトラフィックを有する２つのサブシーン）、司会役または調整役の出席者Ｍ．Ｌｅａｄ（定期的に短く音声を差し挟むサブシーン）、認識された信号位相（たとえば、拍手、「私にカメラを向け続けて下さい」ならびに他の表現および発話認識といった、得られた音声パラメータなどの関連の指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る。これらのパラメータまたは指標は、追跡ステップとは独立して、または追跡ステップ時の異なる時間に記録され得る。また、サブシーンごとの追跡は、たとえば、咳またはくしゃみを表わす音声；機械、風、または点滅を表わす定期的または周期的な動作またはビデオ；過渡的動作または過渡的であるのに十分高い周波数での動作などの、エラーまたは無関係の指標を記録し、識別し、またはスコア付けし得る。 Additionally or alternatively, per-scene updates or tracking can be performed, for example, amplitude, utterance frequency, utterance length, associated attendees M1, M2... Mn (two subscenes with mutual traffic), moderator Participants in the role or coordinator Lead (sub-scenes with regular short speech), recognized signal phase (eg applause, “keep me pointing at the camera” and other speech and speech recognition obtained speech parameters, etc.) Related indicators or data or information representing them may be recorded, identified or scored, these parameters or indicators may be recorded independently of the tracking step or at different times during the tracking step. Also, sub-scene tracking can be performed, for example, with audio representing coughing or sneezing; periodic or periodic motion or video representing machine, wind, or blinking; at high enough frequency to be transient or transient Errors or irrelevant indicators, such as actions, may be recorded, identified, or scored.

さらに、または代わりに、サブシーンごとの更新または追跡は、たとえば、保持基準（たとえば、音声／発言の時間、音声／発言の周波数、最後の発言からの時間、保持用にタグ付けされている）に基づいて、サブシーンを設定するおよび／またはサブシーンを除去から保護するための指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る。合成のためのその後の処理において、新たなまたはその後のサブシーン以外のサブシーンを除去することは、保護サブシーンを合成シーンから除去しない。言い換えれば、保護サブシーンは合成シーンから除去される優先度が低いことになる。 Additionally or alternatively, sub-scene updates or tracking may be, for example, retention criteria (eg, voice / speech time, speech / speech frequency, time since last speech, tagged for retention) Based on the above, an indicator or data or information representing it may be recorded, identified, or scored for setting up a sub-scene and / or protecting the sub-scene from removal. In subsequent processing for compositing, removing a sub-scene other than a new or subsequent sub-scene does not remove the protected sub-scene from the composite scene. In other words, the protection sub-scene has a low priority to be removed from the composite scene.

さらに、または代わりに、サブシーンごとの更新または追跡は、追加基準を設定するための指標またはそれを表わすデータもしくは情報（たとえば、発言の時間、発言の周波数、可聴周波数の咳／くしゃみ／戸口のベル、音の振幅、発話角度と顔認識との一致）を記録し、識別し、またはスコア付けし得る。コンパイルのための処理では、追加基準を満たすその後のサブシーンのみが合成シーンに組合される。 In addition, or alternatively, sub-scene updates or tracking may be an indicator for setting additional criteria or data or information representing it (eg, speech time, speech frequency, audible frequency cough / sneeze / doorway Bell, sound amplitude, utterance angle and face recognition match) may be recorded, identified, or scored. In the process for compilation, only subsequent sub-scenes that meet the additional criteria are combined into the composite scene.

さらに、または代わりに、サブシーンごとの更新または追跡は、強調基準（たとえば、繰り返しのスピーカー、指定されたプレゼンター、直近のスピーカー、最も声の大きいスピーカー、手の中で／シーン変更において回転するオブジェクトの動作検出、周波数領域内の高周波数シーンアクティビティ、挙手の動作または骨格認識）に基づいて、たとえば、音声、ＣＧＩ、画像、ビデオ、もしくは合成効果などのサブシーン強調動作を設定するための指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る（たとえば、１つのサブシーンをより大きくスケーリングする、１つのサブシーンの境界をブリンクさせるかパルス化する、ジニーエフェクトで新たなサブシーンを挿入する（小から大に増大させる）、バウンス効果でサブシーンを強調または挿入する、カードソーティングまたはシャッフル効果で１つ以上のサブシーンを配列すること、オーバーラップ効果でサブシーンをオーダリングする、「折り重なった」グラフィックコーナーの外見でサブシーンをコーナリングする）。コンパイル処理では、個別のサブシーンの少なくとも１つが、各自のまたは対応する強調基準に基づいてサブシーン強調動作に従って強調される。 Additionally or alternatively, sub-scene updates or tracking can be done with emphasis criteria (eg, repeated speakers, designated presenters, most recent speakers, loudest speakers, objects that rotate in hand / scene changes) Based on motion detection, high-frequency scene activity in the frequency domain, raised hand motion or skeleton recognition), for example, an index for setting sub-scene enhancement motion such as speech, CGI, image, video, or composite effect, or Data or information representing it can be recorded, identified, or scored (e.g., scaling one sub-scene to a larger scale, blinking or pulsing one sub-scene boundary, new with genie effects Insert a sub-scene (increase from small to large), Enhancing or inserting sub-scenes with ungs effects, arranging one or more sub-scenes with card sorting or shuffle effects, ordering sub-scenes with overlap effects, sub-scenes with the appearance of a “folded” graphic corner Cornering). In the compilation process, at least one of the individual sub-scenes is enhanced according to the sub-scene enhancement operation based on their own or corresponding enhancement criteria.

さらに、または代わりに、サブシーンごとの更新または追跡は、センサまたは検知された基準（たとえば、静か過ぎる、ソーシャルメディアからのリモートポーク）に基づいて、サブシーン参加者通知もしくはリマインダ動作を設定するための指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る（たとえば、出席者Ｍ１，Ｍ２…Ｍｎに対してデバイス１００上のライトを、任意にサブシーンと同じ側のライトをブリンクさせる）。コンパイル処理またはその他の処理において、ローカルリマインダ指標が、各自のまたは対応する検知された基準に基づいて通知またはリマインダ動作に従って起動される。 Additionally or alternatively, sub-scene updates or tracking may be used to set sub-scene participant notifications or reminder actions based on sensors or sensed criteria (eg, too quiet, remote poke from social media) Can be recorded, identified, or scored (eg, lights on device 100 for attendees M1, M2,... Mn, optionally on the same side as the sub-scene) To blink). In a compilation process or other process, a local reminder indicator is activated according to a notification or reminder action based on its own or corresponding detected criteria.

さらに、または代わりに、サブシーンごとの更新または追跡は、たとえば、各認識または局所化の記録された特性（たとえばステップＳ１４またはＳ２０に関して本明細書に記載したような色ブロブ、顔、音声）の速度または方向の変更に基づいて、各自の角度セクタＦＷ１，ＦＷ２…ＦＷｎもしくはＳＷ１，ＳＷ２…ＳＷｎについての変更ベクトルを予測もしくは設定するための、および／または当該予測もしくは設定に基づいて各自の角度セクタＦＷ１，ＦＷ２…ＦＷｎもしくはＳＷ１，ＳＷ２…ＳＷｎの方向を更新するための指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る。 Additionally or alternatively, sub-scene-by-sub-scene updates or tracking can be performed, for example, for each recognized or localized recorded characteristic (eg, color blob, face, voice as described herein with respect to steps S14 or S20). For predicting or setting a change vector for each angular sector FW1, FW2... FWn or SW1, SW2... SWn based on a change in speed or direction and / or for each angular sector based on the prediction or setting Indicators for updating the direction of FW1, FW2... FWn or SW1, SW2... SWn or data or information representing it may be recorded, identified or scored.

さらに、または代わりに、サブシーンごとの更新または追跡は、たとえば、各認識または局所化の記録された特性（たとえば色ブロブ、顔、音声）の直近の位置に基づいて、失われた認識もしくは局所化の再取込もしくは再獲得のために検索領域を予測もしくは設定するための、および／または当該予測もしくは設定に基づいて各自の角度セクタの方向を更新するための指標またはそれを表わすデータもしくは情報を記録し、識別し、またはスコア付けし得る。記録された特性は、皮膚および／または衣服を表わす少なくとも１つの色ブロブ、セグメント化、またはブロブオブジェクトであり得る。 Additionally or alternatively, sub-scene updates or tracking can be based on, for example, lost recognition or locality based on the most recent location of each recognition or localization recorded characteristic (eg, color blob, face, speech). Index or data or information for predicting or setting a search area for re-acquisition or re-acquisition and / or for updating the direction of each angular sector based on the prediction or setting May be recorded, identified, or scored. The recorded property may be at least one color blob, segmentation, or blob object representing skin and / or clothing.

さらに、または代わりに、サブシーンごとの更新または追跡は、記録された特性の（たとえば、シーンＳＣ内の起点ＯＲからの方位Ｂ１，Ｂ２…Ｂｎまたは角度、およびシーンＳＣ内の角度セクタＦＷ，ＳＷに対応するサブシーンＳＳ１，ＳＳ２…ＳＳｎなどの角度範囲に基づく）デカルトマップまたは特にもしくは任意に極マップを維持し得、記録された特性の各々は、記録された特性の方位Ｂ１，Ｂ２…Ｂｎを表わす少なくとも１つのパラメータを有する。 Additionally or alternatively, sub-scene-by-sub-scene updating or tracking may be performed on recorded characteristics (eg, orientations B1, B2,... Bn or angle from origin OR in scene SC, and angular sectors FW, SW in scene SC. A Cartesian map (based on an angular range such as sub-scene SS1, SS2... SSn, etc.) or in particular or optionally a polar map, each of the recorded characteristics can be recorded in the recorded characteristics orientation B1, B2. At least one parameter representing.

したがって、代わりに、またはさらに、デバイス１００、その回路、ならびに／またはＲＯＭ／ＲＡＭ８および／もしくはＣＰＵ／ＧＰＵ６内に記憶されて実行される実行可能コードの実施形態は、標的角度範囲（たとえばシーンＳＣを形成するカメラ２ｎ，３ｎ，５もしくは７の水平範囲、またはこのサブセット）を、音響センサアレイ４および光センサアレイ２，３，５および／または７を用いて監視することによって、広角シーンＳＣ内の幅ＦＷおよび／またはＳＷに対応する対象サブシーンＳＳ１，ＳＳ２…ＳＳｎを追跡し得る。デバイス１００、その回路、および／またはその実行可能コードは、たとえば図８のステップＳ１４（新たな対象方位の識別）および／またはステップＳ２０（方位／サブシーンのための追跡および特性情報）に関して本明細書に記載したように、認識基準（たとえば音、顔）を探して標的角度範囲ＳＣをスキャンし得る。デバイス１００、その回路、および／またはその実行可能コードは、音響センサアレイ４ならびに光センサアレイ２，３，５および／または７の少なくとも１つによって第１の認識（たとえば検出、識別、誘発、または他の原因）および局所化（たとえば角度、ベクトル、ポーズ、または場所）に基づいて第１の対象方位Ｂ１を識別し得る。デバイス１００、その回路、および／またはその実行可能コードは、音響センサアレイ４ならびに光センサアレイ２，３，５および／または７の少なくとも１つによって第２の認識および局所化（ならびに任意に第３のおよびその後の認識および局所化）に基づいて第２の対象方位Ｂ２（ならびに任意に第３のおよびその後の対象方位Ｂ３…Ｂｎ）を識別し得る。 Thus, alternatively or additionally, embodiments of executable code stored and executed in device 100, its circuitry, and / or ROM / RAM 8 and / or CPU / GPU 6 may be configured to target angle ranges (eg, scene SC). The horizontal range of the camera 2n, 3n, 5 or 7 to be formed, or a subset thereof) is monitored using the acoustic sensor array 4 and the light sensor arrays 2, 3, 5 and / or 7 in the wide-angle scene SC. The target sub-scenes SS1, SS2,... SSn corresponding to the width FW and / or SW may be tracked. Device 100, its circuitry, and / or its executable code are described herein with respect to, for example, step S14 (identification of a new target orientation) and / or step S20 (tracking and characteristic information for orientation / subscene) of FIG. As described in the document, the target angular range SC may be scanned for recognition criteria (eg, sound, face). Device 100, its circuitry, and / or its executable code may be subject to a first recognition (eg, detection, identification, triggering, or) by at least one of acoustic sensor array 4 and photosensor arrays 2, 3, 5, and / or 7. The first target orientation B1 may be identified based on other causes) and localization (eg, angle, vector, pose, or location). The device 100, its circuitry, and / or its executable code is subject to a second recognition and localization (and optionally a third) by at least one of the acoustic sensor array 4 and the photosensor arrays 2, 3, 5, and / or 7. And subsequent recognition and localization) may identify the second target orientation B2 (and optionally the third and subsequent target orientations B3 ... Bn).

デバイス１００、その回路、および／またはその実行可能コードは、各自の対象方位Ｂ１，Ｂ２…Ｂｎを含む角度サブシーン（たとえば当初の小さい角度範囲または顔ベースのサブシーンＦＷ）を、少なくとも１つの認識基準（たとえば、設定または再設定された角度スパンが瞳孔間距離よりも広い、この２倍である、またはそれ以上である；設定または再設定された角度スパンが頭と壁のコントラスト、距離、エッジ、差分、または動作移行よりも広い）に基づく閾値（たとえば、図１３のステップＳ１６〜Ｓ１８を参照して述べたような幅閾値）が満たされるまで拡大、拡幅、設定または再設定することによって、各対象方位Ｂ１，Ｂ２…Ｂｎについての各自の角度セクタ（たとえばＦＷ，ＳＷまたは他のもの）を設定し得る。 Device 100, its circuitry, and / or its executable code recognizes at least one angular sub-scene (eg, an initial small angular range or a face-based sub-scene FW) that includes its target orientation B1, B2,... Bn. Reference (for example, the set or reset angle span is greater than, twice, or more than the interpupillary distance; the set or reset angle span is the head-to-wall contrast, distance, edge By enlarging, widening, setting or resetting until a threshold based on (or wider than the difference or motion transition) (e.g., a width threshold as described with reference to steps S16 to S18 in FIG. 13) is satisfied, Each angle sector (eg FW, SW or others) for each target orientation B1, B2,... Bn may be set.

デバイス１００、その回路、および／またはその実行可能コードは、各認識および／もしくは局所化内のまたは各認識および／もしくは局所化を表わす記録された特性（たとえば色ブロブ、顔、音声）の方向または方位Ｂ１，Ｂ２…Ｂｎの変化に基づいて、各自の角度セクタＦＷ１，ＦＷ２…ＦＷｎおよび／またはＳＷ１，ＳＷ２…ＳＷｎの方向または方位Ｂ１，Ｂ２…Ｂｎを更新または追跡（これらの用語は本明細書中で交換可能に用いられる）し得る。任意に、本明細書に記載のように、デバイス１００、その回路、および／またはその実行可能コードは、各自の角度セクタＦＷ１，ＦＷ２…ＦＷｎおよび／またはＳＷ１，ＳＷ２…ＳＷｎを更新または追跡して、第１の、第２の、および／または第３の、および／またはその後の対象方位Ｂ１，Ｂ２…Ｂｎの角度変化に従い得る。 The device 100, its circuitry, and / or its executable code may be the direction of recorded characteristics (eg, color blob, face, voice) within each recognition and / or localization or representing each recognition and / or localization, or Update or track their respective angular sectors FW1, FW2 ... FWn and / or SW1, SW2 ... SWn directions or orientations B1, B2 ... Bn based on changes in orientations B1, B2 ... Bn (these terms are described herein) Used interchangeably). Optionally, as described herein, device 100, its circuitry, and / or its executable code may update or track its respective angular sectors FW1, FW2 ... FWn and / or SW1, SW2 ... SWn. , First, second, and / or third, and / or subsequent angular changes in the target orientations B1, B2,... Bn.

合成出力例（ビデオ会議の場合）
図８Ａ〜図８Ｄ、図１０Ａ〜図１０Ｂ、および図１９〜図２４において、「合成出力ＣＯ」、すなわち、合成されてレンダリングされた／合成されたカメラビューとしての組合されたまたは合成されたサブシーンが、リモートディスプレイＲＤ１のメインビュー（会議室ローカルディスプレイＬＤから受信したシーンを表わす）、およびネットワークインターフェイス１０または１０ａの両方への引出線とともに示されており、会議室（ローカル）ディスプレイＬＤのテレビ会議クライアントは、ＵＳＢ周辺装置１００から受信したビデオ信号を単一カメラビューとして「透過的に」取扱い、合成出力ＣＯをリモートクライアントまたはリモートディスプレイＲＤ１およびＲＤ２に伝えることを表わしている。すべてのサムネイルビューも合成出力ＣＯを示し得ることに留意すべきである。一般に、図１９、図２０および図２２は図３Ａ〜図５Ｂに示す出席者の配列に対応しており、さらにもう１人の出席者が図２１において図３Ａ〜図５Ｂに示す空席に座って参加している。 Composite output example (for video conferencing)
In FIGS. 8A-8D, 10A-10B, and 19-24, the “composite output CO”, ie, the combined or synthesized sub as a combined rendered / composed camera view. The scene is shown with a lead to both the main view of the remote display RD1 (representing the scene received from the conference room local display LD) and the network interface 10 or 10a, and the TV of the conference room (local) display LD. The conference client treats the video signal received from the USB peripheral device 100 “transparently” as a single camera view and represents the composite output CO to the remote client or remote display RD1 and RD2. It should be noted that all thumbnail views can also show the composite output CO. In general, FIGS. 19, 20, and 22 correspond to the attendee arrangement shown in FIGS. 3A-5B, and another attendee sits in the empty seat shown in FIGS. 3A-5B in FIG. I'm joining.

例示的な移行間で、縮小したパノラマビデオ信号ＳＣ．Ｒ（垂直スクリーンの約２５％を占めている）は、（たとえば図９Ａ〜図９Ｅに示すように）パノラマシーンビデオ信号ＳＣの「ズームイン」した部分を示し得る。ズームレベルは、この約２５％に含まれている画素の数によって決定され得る。人物／オブジェクトＭ１，Ｍ２…Ｍｎが関連するようになると、対応するサブシーンＳＳ１，ＳＳ２…ＳＳｎが（たとえばスライドするビデオパネルを合成することによって）ステージシーンＳＴＧまたは合成出力ＣＯに移行し、参加者Ｍ１，Ｍ２…Ｍｎ間のその時計回りのまたは左から右への位置が維持される。同時に、プロセッサは、ＧＰＵ６メモリまたはＲＯＭ／ＲＡＭ８を用いて、現在の対象方位Ｂ１，Ｂ２…Ｂｎをスクリーンの中央に表示するために、縮小したパノラマビデオ信号ＳＣ．Ｒを左または右にゆっくりスクロールし得る。現在の対象方位はハイライト表示され得る。新たな関連しているサブシーンＳＳ１，ＳＳ２…ＳＳｎが識別されると、縮小したパノラマビデオ信号ＳＣ．Ｒは、直近のサブシーンＳＳ１，ＳＳ２…ＳＳｎがハイライト表示されて、縮小したパノラマビデオ信号ＳＣ．Ｒの中央に配置されるように、回転またはパンし得る。この構成によって、会議中、縮小したパノラマビデオ信号ＳＣ．Ｒが連続的に再レンダリングされて実質的にパンされて、部屋の関連部分が示される。 Between example transitions, the reduced panoramic video signal SC. R (accounting for about 25% of the vertical screen) may indicate a “zoomed in” portion of the panoramic scene video signal SC (eg, as shown in FIGS. 9A-9E). The zoom level can be determined by the number of pixels contained in this approximately 25%. When the persons / objects M1, M2,... Mn become related, the corresponding sub-scene SS1, SS2,... SSn transitions to the stage scene STG or the composite output CO (for example, by synthesizing a sliding video panel) and participates. Its clockwise or left-to-right position between M1, M2 ... Mn is maintained. At the same time, the processor uses the GPU 6 memory or the ROM / RAM 8 to reduce the panorama video signal SC.1 in order to display the current target orientations B1, B2,. R can slowly scroll left or right. The current target orientation can be highlighted. Once a new associated sub-scene SS1, SS2,... SSn is identified, the reduced panoramic video signal SC. R is a panorama video signal SC.2 in which the latest sub-scenes SS1, SS2,. It can be rotated or panned so that it is centered in R. With this configuration, a reduced panoramic video signal SC. R is continuously re-rendered and substantially panned to indicate the relevant part of the room.

図１９に示すように、典型的なビデオ会議ディスプレイにおいて、各出席者のディスプレイはマスタビューおよび複数のサムネイルビューを示し、各々はウェブカメラの出力信号によって実質的に決定される。マスタビューは典型的にリモート出席者のうちの１人であり、サムネイルビューは他の出席者を表わす。ビデオ会議であるかチャットシステムであるかに応じて、マスタビューは出席者の中で活発なスピーカーを示すように選択され得るか、または、しばしばサムネイルの選択によって、いくつかの場合ではローカルシーンを含む別の出席者に切替えられ得る。いくつかのシステムでは、ローカルシーンサムネイルは、各出席者が自身をカメラに対して位置決めして有用なシーンを提示し得るように、常に全体のディスプレイ内にあり続ける（この例を図１９に示す）。 As shown in FIG. 19, in a typical videoconference display, each attendee's display shows a master view and multiple thumbnail views, each substantially determined by the output signal of the webcam. The master view is typically one of the remote attendees, and the thumbnail view represents the other attendees. Depending on whether it is a video conferencing or a chat system, the master view can be selected to show active speakers among the attendees, or often select a local scene in some cases by selecting a thumbnail. Can be switched to another attendee. In some systems, the local scene thumbnails always remain in the overall display so that each attendee can position themselves relative to the camera to present a useful scene (an example of this is shown in FIG. 19). ).

図１９に示すように、本発明に係る実施形態は、単一カメラシーンの代わりに、複数の出席者の合成されたステージビューを提供する。たとえば、図１９では、（アイコン図Ｍ１，Ｍ２およびＭ３によって表わされる）出席者Ｍ１，Ｍ２およびＭ２への潜在的な対象方位Ｂ１，Ｂ２およびＢ３を会議カメラ１００が利用可能である。本明細書に記載のように、局所化されるまたはその他の方法で識別される可能な３人の出席者Ｍ１，Ｍ２，Ｍ３がおり、１人のＳＰＫＲが発言中であるため、ステージＳＴＧ（合成出力ＣＯと同等）は当初、図１９では出席者Ｍ２である活発なスピーカーＳＰＫＲのサブシーンを含む、デフォルト数（この場合は２つ）の関連したサブシーンでポピュレートされ得る。 As shown in FIG. 19, an embodiment according to the present invention provides a composite stage view of multiple attendees instead of a single camera scene. For example, in FIG. 19, the conference camera 100 can utilize potential target orientations B1, B2, and B3 to attendees M1, M2, and M2 (represented by icon diagrams M1, M2, and M3). As described herein, there are three possible attendees M1, M2, M3 that are localized or otherwise identified, and one SPKR is speaking, so stage STG ( The composite output CO) can initially be populated with a default number (in this case two) of related sub-scenes, including the active speaker SPKR sub-scene, which is attendee M2 in FIG.

図１９には３人の参加者のディスプレイが、すなわち、たとえば、会議カメラ１００に、かつインターネットＩＮＥＴに接続されたパーソナルコンピュータなどのローカルディスプレイＬＤと、第１のリモート出席者Ａ．ｈｅｘの第１のパーソナルコンピュータ（「ＰＣ」）またはタブレットディスプレイリモートディスプレイＲＤ１と、第２のリモート出席者Ａ．ｄｉａｍｏｎｄの第２のＰＣまたはタブレットディスプレイＲＤ２とが示されている。ビデオ会議の文脈において予期されるように、ローカルディスプレイＬＤは、ローカルディスプレイＰＣのオペレータまたはビデオ会議ソフトウェア（図１９ではＡ．ｈｅｘ）によって選択されたリモートスピーカーを最も顕著に示すのに対して、２つのリモートディスプレイＲＤ１，ＲＤ２は、リモートオペレータまたはソフトウェアによって選択されたビュー（たとえば、活発なスピーカーのビュー、会議カメラ１００の合成ビューＣＯ）を示す。 FIG. 19 shows the display of three participants, that is, a local display LD such as a personal computer connected to the conference camera 100 and to the Internet INET, and a first remote attendant A.E. hex's first personal computer ("PC") or tablet display remote display RD1 and a second remote attendant A.H. A diamond second PC or tablet display RD2 is shown. As expected in the context of video conferencing, the local display LD shows the most prominent remote speakers selected by the operator of the local display PC or video conferencing software (A. hex in FIG. 19), whereas 2 One remote display RD1, RD2 shows a view (eg, active speaker view, composite view CO of conference camera 100) selected by the remote operator or software.

マスタビューおよびサムネイルビュー内の出席者の配列は、ビデオ会議またはビデオチャットシステム内のユーザ選択およびさらには自動選択にある程度依存するが、図１９の例では、ローカルディスプレイＬＤは、典型的であるように、最後に選択されたリモート出席者（たとえば、リモートディスプレイＲＤ１を有するＰＣまたはラップトップで作業している出席者であるＡ．ｈｅｘ）が示されているマスタビューと、本質的にすべての出席者が表わされているサムネイル列（ローカル会議カメラ１００からの合成されたステージビューを含む）とを示している。リモートディスプレイＲＤ１およびＲＤ２の各々は、対照的に、（スピーカーＳＰＫＲが現在発言中であるため）合成されたステージビューＣＯ、ＳＴＧを含むマスタビューを示しており、サムネイル列はここでも残りの出席者のビューを含んでいる。 The arrangement of attendees in the master view and thumbnail view depends to some extent on user selection and even automatic selection in the video conference or video chat system, but in the example of FIG. 19, the local display LD seems to be typical. With a master view showing the last selected remote attendee (eg A. hex who is working on a PC or laptop with remote display RD1) and essentially all attendance A thumbnail row (including a synthesized stage view from the local conference camera 100) is shown. Each of the remote displays RD1 and RD2, in contrast, shows a master view including the synthesized stage view CO, STG (since speaker SPKR is currently speaking), and the thumbnail column is again the remaining attendees Includes views.

図１９は、出席者Ｍ３が既に発言したか、またはステージＳＴＧのデフォルト占有者として以前に選択されており、最も関連しているサブシーンをすでに占めている（たとえば直近に関連していたサブシーンであった）と仮定している。図１９に示すように、スピーカーＭ２（アイコン図Ｍ２、およびリモートディスプレイ２では口が開いているシルエットＭ２）に対応するサブシーンＳＳ１が、（ブロック矢印によって表わされる）スライド移行で単一カメラビューに合成される。好ましいスライド移行はゼロまたは無視できるほどの幅で開始し、真ん中の、すなわち、対応するサブシーンＳＳ１，ＳＳ２…ＳＳｎの対象方位Ｂ１，Ｂ２…Ｂｎがステージ上にスライドし、次に、合成された対応するサブシーンＳＳ１，ＳＳ２…ＳＳｎの幅を少なくとも最小幅に達するまで成長させ、ステージ全体が満たされるまで、合成された対応するサブシーンＳＳ１，ＳＳ２…ＳＳｎの幅を成長させ続け得る。合成（中間移行）および合成シーンはカメラビューとして会議室（ローカル）ディスプレイＬＤのテレビ会議クライアントに提供されるので、合成および合成シーンは、ローカルクライアントディスプレイＬＤおよび２つのリモートクライアントディスプレイＲＤ１，ＲＤ２のメインビューおよびサムネイルビュー内に実質的に同時に提示（すなわち現在のビューとして提示）され得る。 FIG. 19 shows that attendee M3 has already spoken or has been previously selected as the default occupant of stage STG and already occupies the most relevant sub-scene (eg, the most recently associated sub-scene It was assumed). As shown in FIG. 19, sub-scene SS1 corresponding to speaker M2 (icon diagram M2 and silhouette M2 with an open mouth on remote display 2) is in a single camera view with a slide transition (represented by a block arrow). Synthesized. The preferred slide transition starts with zero or negligible width, and the middle, ie, the target orientations B1, B2 ... Bn of the corresponding sub-scene SS1, SS2 ... SSn slide on the stage and then synthesized The width of the corresponding sub-scene SS1, SS2,... SSn can be grown at least until it reaches a minimum width, and the width of the corresponding sub-scene SS1, SS2. Since the composite (intermediate transition) and composite scene are provided to the video conference client of the conference room (local) display LD as a camera view, the composite and composite scene is the main of the local client display LD and the two remote client displays RD1, RD2. It can be presented substantially simultaneously in the view and thumbnail view (ie, presented as the current view).

図２０において、図１９の後、出席者Ｍ１が直近のおよび／または最も関連しているスピーカーになる（たとえば、先の状況は、出席者Ｍ２が直近のおよび／または最も関連しているスピーカーであった図１９の状況である）。出席者Ｍ３およびＭ２についてのサブシーンＳＳ３およびＳＳ２は追跡および識別基準に従って関連し続けており、（スケーリング、またはクロッピングによって、瞳孔間距離の２〜１２倍の幅限界および本明細書に記載のような他の方法によって任意に限定されて）必要に応じてより小さい幅に再構成され得る。サブシーンＳＳ２は同様に互換性のあるサイズに合成された後、（ここでもブロック矢印によって表わされる）スライド移行でステージＳＴＧ上に構成される。図９、図１０Ａ〜図１０Ｂ、および図１１Ａ〜図１１Ｂに関して本明細書中に述べたように、新たなスピーカーＳＰＫＲは、既に表示された出席者Ｍ２の方位の（見下ろした図において、時計回りに）右側にいる出席者Ｍ１であるため、任意にサブシーンＳＳ１を、左右像または左から右への順序（Ｍ３，Ｍ２，Ｍ１）、この場合は右からの移行を保存するようにステージ上に移行させてもよい。 In FIG. 20, after FIG. 19, attendee M1 becomes the nearest and / or most relevant speaker (eg, the previous situation is that speaker M2 is the nearest and / or most relevant speaker. This is the situation in FIG. Sub-scenes SS3 and SS2 for attendees M3 and M2 continue to be related according to the tracking and identification criteria (by scaling or cropping, the 2-12 times width limit and as described herein) (Optionally limited by other methods) can be reconfigured to a smaller width if desired. Sub-scene SS2 is similarly composed on a compatible size and is then configured on stage STG with a slide transition (again represented by a block arrow). As described herein with respect to FIGS. 9, 10A-10B, and 11A-11B, the new speaker SPKR is in the orientation of attendee M2 already displayed (clockwise in the down view). A) Since the attendee M1 is on the right side, arbitrarily place the sub-scene SS1 on the stage so as to preserve the left-right image or left-to-right order (M3, M2, M1), in this case the transition from the right You may move to.

図２１において、図２０の後、部屋に到着した新たな出席者Ｍ４が直近の最も関連しているスピーカーになる。スピーカーＭ２およびＭ１についてのサブシーンＳＳ２およびＳＳ１は追跡および識別基準に従って関連し続けており、「３対１」の幅に合成されたままである。スピーカーＭ３に対応するサブシーンは「エージアウト」し、もはや直近のスピーカーほど関連していない（しかし、多くの他の優先度および関連性が本明細書に記載される）。スピーカーＭ４に対応するサブシーンＳＳ４は互換性のあるサイズに合成された後、（ここでもブロック矢印によって表わされる）フリップ移行でカメラ出力に合成され、サブシーンＳＳ３は除去としてフリップアウトされている。これは、スライド移行または代替の移行であってもよい。図示していないが、代替として、新たなスピーカーＳＰＫＲは既に表示された出席者Ｍ２およびＭ１の方位の（見下ろした図において、時計回りに）左側にいる出席者Ｍ４であるため、任意にサブシーンＳＳ４を、左右像または左から右への順序（Ｍ４，Ｍ２，Ｍ１）、この場合は左からの移行を保存するようにステージ上に移行させてもよい。この場合、サブシーンＳＳ２，ＳＳ１の各々は右に１つ場所が移行してもよく、サブシーンＭ３はステージの右に（スライド移行で離れるように）出てもよい。 In FIG. 21, after FIG. 20, the new attendee M4 arriving in the room becomes the most recently associated speaker. Sub-scenes SS2 and SS1 for speakers M2 and M1 continue to be related according to the tracking and identification criteria and remain synthesized in a “3 to 1” width. The sub-scene corresponding to speaker M3 “ages out” and is no longer as relevant as the most recent speaker (but many other priorities and relevance are described herein). Sub-scene SS4 corresponding to speaker M4 is composited to a compatible size and then composited to the camera output with flip transitions (again represented by block arrows), and sub-scene SS3 is flipped out as a removal. This may be a slide transition or an alternative transition. Although not shown, as an alternative, the new speaker SPKR is an attendee M4 on the left (clockwise in the view down) of the attendees M2 and M1 already displayed, so any sub-scene SS4 may be shifted onto the stage so as to preserve the left-right image or left-to-right order (M4, M2, M1), in this case the shift from the left. In this case, each of the sub-scenes SS2 and SS1 may move one place to the right, and the sub-scene M3 may go to the right of the stage (so as to be separated by slide transition).

本明細書に記載のように、図１９〜図２１は、合成された、追跡された、および／または表示された合成シーンが受信されて単一カメラシーンとして表示される、例としてモバイルデバイス上の例示的なローカルおよびリモートビデオ会議モードを示している。これらは前の段落の文脈において参照されて記載されている。 As described herein, FIGS. 19-21 are illustrated on a mobile device as an example, where a composite, tracked, and / or displayed composite scene is received and displayed as a single camera scene. Fig. 6 illustrates exemplary local and remote video conferencing modes. These are referenced and described in the context of the previous paragraph.

全体的な情報は同様であるが、図２２は、図１９の形態の変形であるビデオ会議を表示する形態を提示している。特に、図１９ではサムネイルビューはマスタビューにオーバーラップしておらず、マスタビューと一致するサムネイルビューはサムネイル列内に保持されているが、図２２の形態ではサムネイルがマスタビューにオーバーラップしており（たとえば、マスタビュー上に重畳されるように合成されており）、現在のマスタビューは（たとえば減光などによって）サムネイル列内で強調されていない。 Although the overall information is the same, FIG. 22 presents a form for displaying a video conference that is a variation of the form of FIG. In particular, in FIG. 19, the thumbnail view does not overlap the master view, and the thumbnail view that matches the master view is held in the thumbnail column, but in the form of FIG. 22, the thumbnail overlaps the master view. (Eg, synthesized to be superimposed on the master view) and the current master view is not highlighted in the thumbnail column (eg, by dimming).

図２３は、高解像度の、クローズアップの、または同様の別個のカメラ７に対応する第４のクライアントが自身のクライアントをネットワークインターフェイス１０ｂを介してテレビ会議グループに接続しているのに対して、合成出力ＣＯおよびその移行がネットワークインターフェイス１０ａを介して会議室（ローカル）ディスプレイＬＤに提示される、図１９〜図２２の変形を示す。 FIG. 23 shows that a fourth client corresponding to a high resolution, close-up or similar separate camera 7 connects his client to the video conference group via the network interface 10b. FIG. 23 shows a variant of FIGS. 19 to 22 in which the composite output CO and its transition are presented to the conference room (local) display LD via the network interface 10a.

図２４は、テキストレビューウインドウを有する、コードまたはドキュメントをレビューしているクライアントがローカル無線接続を介して会議カメラ１００に接続する、図１９〜図２２の変形を示す（しかし、ある変形では、コードまたはドキュメントをレビューしているクライアントはリモートステーションからのインターネットを介して接続してもよい）。一例では、第１のデバイスまたはクライアント（ＰＣまたはタブレット）が出席者をパノラマビューで示すビデオ会議またはチャットクライアントを実行し、第２のクライアントまたはデバイス（ＰＣまたはタブレット）がコードまたはドキュメントレビュークライアントを実行し、それをウェブカメラと同じ形態のビデオ信号として会議カメラ１００に提供する。会議カメラ１００は、コードまたはドキュメントレビュークライアントのドキュメントウインドウ／ビデオ信号をフルフレームサブシーンＳＳｎとしてステージＳＴＧまたはＣＯに合成し、任意にさらに、たとえばステージＳＴＧまたはＣＯよりも高い、会議出席者を含むローカルパノラマシーンを合成する。このように、個々の出席者サブシーンの代わりに、ビデオ信号内に示されるテキストをすべての出席者が利用可能であるが、出席者はやはりパノラマビューＳＣを参照することによって確認されてもよい。図示していないが、会議カメラ１００デバイスは、代わりに、第２のビデオ会議クライアントを作成、インスタンス化、または実行してドキュメントビューをホストしてもよい。代わりに、高解像度の、クローズアップの、または単に別個のカメラ７が自身のクライアントをネットワークインターフェイス１０ｂを介してテレビ会議グループに接続しているのに対して、合成出力ＣＯおよびその移行がネットワークインターフェイス１０ａを介して会議室（ローカル）ディスプレイに提示される。 FIG. 24 shows a variation of FIGS. 19-22 in which a client reviewing code or a document with a text review window connects to the conference camera 100 via a local wireless connection (although in some variations the code Or the client reviewing the document may connect via the Internet from a remote station). In one example, a first device or client (PC or tablet) runs a video conferencing or chat client that shows attendees in a panoramic view, and a second client or device (PC or tablet) runs a code or document review client Then, it is provided to the conference camera 100 as a video signal in the same form as the web camera. The conference camera 100 synthesizes the code or document review client document window / video signal as a full frame sub-scene SSn to the stage STG or CO and optionally further includes a local including conference attendees, eg, higher than the stage STG or CO. Combining panoramic scenes. Thus, instead of individual attendee sub-scenes, the text shown in the video signal is available to all attendees, but attendees may still be confirmed by referring to the panoramic view SC. . Although not shown, the conference camera 100 device may instead create, instantiate, or execute a second video conference client to host the document view. Instead, the high-resolution, close-up, or simply separate camera 7 connects its client to the video conference group via the network interface 10b, whereas the composite output CO and its transition is a network interface. It is presented on the conference room (local) display via 10a.

少なくとも一実施形態において、会議出席者Ｍ１，Ｍ２…Ｍｎは、常にステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯ内に示され得る。たとえば図２５に示すように、少なくとも顔幅検出に基づいて、プロセッサ６は顔を顔のみのサブシーンＳＳ１，ＳＳ２…ＳＳｎとしてクロップし、それらをステージシーンビデオ信号または合成出力ＳＴＧ，ＣＯの上部または下部に沿って整列させ得る。この場合、リモートデバイスＲＤ１などのデバイスを使用する参加者が、クロップされた顔のみのサブシーンＳＳ１，ＳＳ２…ＳＳｎをクリックしてまたは（タッチスクリーンの場合は）触れてローカルディスプレイＬＤと通信し、その人物に集中したステージシーンビデオ信号ＳＴＧを作成できることが望ましい場合がある。１つの例示的な解決では、図１Ｂと同様の、インターネットＩＮＥＴに直接接続された構成を用いて、会議カメラ１００は適切な数の仮想ビデオ会議クライアントを作成もしくはインスタンス化し、および／または各々に仮想カメラを割当て得る。 In at least one embodiment, meeting attendees M1, M2,... Mn may always be indicated in the stage scene video signal or composite output STG, CO. For example, as shown in FIG. 25, based on at least face width detection, the processor 6 crops the faces as face-only sub-scenes SS1, SS2,... SSn, and places them on top of the stage scene video signal or composite output STG, CO or Can be aligned along the bottom. In this case, a participant using a device such as the remote device RD1 communicates with the local display LD by clicking or touching (in the case of a touch screen) the sub-scene SS1, SS2,. It may be desirable to be able to create a stage scene video signal STG concentrated on that person. In one exemplary solution, using a configuration directly connected to the Internet INET, similar to FIG. 1B, the conference camera 100 creates or instantiates an appropriate number of virtual video conferencing clients and / or virtual to each Cameras can be assigned.

図２６は、図１〜図２６全体にわたって用いられているいくつかの図像および記号を示す。特に、カメラレンズの中心から延びる矢印は、当該矢印がさまざまな図においてそのようにラベル付けされているか否かにかかわらず、対象方位Ｂ１，Ｂ２…Ｂｎに対応し得る。カメラレンズから開いた「Ｖ」字状の角度に延びる破線は、当該破線がさまざまな図においてそのようにラベル付けされているか否かにかかわらず、レンズの視野に対応し得る。楕円形の頭および矩形または台形の体を有する人物の概略的な「棒線画」の描写は、この概略的な人物がさまざまな図においてそのようにラベル付けされているか否かにかかわらず、会議参加者に対応し得る。この概略的な人物の開いた口の描写は、口が開いているこの概略的な人物がさまざまな図においてそのようにラベル付けされているか否かにかかわらず、現在のスピーカーＳＰＫＲを描き得る。左から右に、右から左に、上から下に、または螺旋状に延びる太い矢印は、当該矢印がさまざまな図においてそのようにラベル付けされているか否かにかかわらず、進行中の移行または移行の合成を示し得る。 FIG. 26 shows some icons and symbols used throughout FIGS. In particular, an arrow extending from the center of the camera lens can correspond to the target orientations B1, B2,... Bn regardless of whether the arrow is labeled as such in the various figures. A dashed line extending from the camera lens at a “V” shaped angle may correspond to the field of view of the lens, regardless of whether the dashed line is labeled as such in the various figures. A depiction of a schematic “bar drawing” of a person with an oval head and a rectangular or trapezoidal body, regardless of whether this schematic person is so labeled in various figures Can respond to participants. This schematic person's open mouth depiction may depict the current speaker SPKR, regardless of whether this schematic person whose mouth is open is so labeled in the various figures. A thick arrow that extends from left to right, from right to left, from top to bottom, or in a spiral, indicates whether a transition in progress or whether the arrow is labeled as such in the various figures. A transitional synthesis may be shown.

本開示において、「広角カメラ」および「ワイドシーン」は視野および対象からの距離に依存しており、肩を並べていない異なる２人の人物を会議においてキャプチャするのに十分広い視野を有する任意のカメラを含む。 In this disclosure, “wide angle camera” and “wide scene” depend on the field of view and distance from the subject, and any camera with a field of view that is wide enough to capture two different people who are not shoulder-to-side in a meeting including.

「視野」は、垂直視野が特定されていない限り、カメラの水平視野である。本明細書において使用する「シーン」は、カメラによってキャプチャされたシーンの画像（静止画または動画）を意味する。一般に、例外を含むが、パノラマ「シーン」ＳＣは、その信号が単一カメラによってキャプチャされるか複数のカメラからスティッチングされるかにかかわらず、システムが取扱う最大画像またはビデオストリームまたは信号の１つである。本明細書中で言及される、最もよく言及されるシーン「ＳＣ」は、魚眼レンズに結合されたカメラ、パノラマ光学部品に結合されたカメラ、または重なり合っているカメラの等角分布によってキャプチャされたパノラマシーンＳＣであるシーンＳＣを含む。パノラマ光学部品は、カメラにパノラマシーンを実質的に直接提供してもよく、魚眼レンズの場合、パノラマシーンＳＣは、魚眼ビューの周囲または水平バンドが分離されており、長い、高アスペクト比の矩形画像にデワープされている水平バンドであってもよく、重なり合っているカメラの場合、パノラマシーンは、個々のオーバーラップしているビューからスティッチングされてクロップされ（かつ場合によってはデワープされ）てもよい。「サブシーン」は、たとえば、連続的な、通常はシーン全体よりも小さい矩形の画素ブロックである、シーンのサブ部分を意味する。パノラマシーンは３６０度未満にクロップされても、その中でサブシーンが取扱われる全体のシーンＳＣと称され得る。 “Field of view” is the horizontal field of view of the camera unless a vertical field of view is specified. As used herein, a “scene” means an image (still image or moving image) of a scene captured by a camera. Generally, with exceptions, a panoramic “scene” SC is one of the largest images or video streams or signals handled by the system, regardless of whether the signal is captured by a single camera or stitched from multiple cameras. One. The most frequently mentioned scene “SC” referred to herein is a panorama captured by a conformal distribution of a camera coupled to a fisheye lens, a camera coupled to a panoramic optic, or an overlapping camera. The scene SC which is the scene SC is included. The panoramic optics may provide the camera with a panoramic scene substantially directly, and in the case of a fisheye lens, the panoramic scene SC is a long, high aspect ratio rectangle with the fisheye view surrounding or horizontal bands separated. It can be a horizontal band that is dewarped to the image, and in the case of overlapping cameras, panoramic scenes can be stitched and cropped (and possibly dewarped) from individual overlapping views. Good. “Sub-scene” means a sub-portion of a scene, for example, a continuous, usually rectangular pixel block that is smaller than the entire scene. Even if the panoramic scene is cropped to less than 360 degrees, it may be referred to as the entire scene SC in which sub-scenes are handled.

本明細書において使用する「アスペクト比」はＨ：Ｖ水平：垂直比として記載され、「より大きい」アスペクト比は垂直に対して水平比率を（広く短く）増加させる。１：１よりも大きいアスペクト比（たとえば、１．１：１，２：１，１０：１）は「風景画書式」と見なされ、本開示のために、１：１以下のアスペクトは「肖像画書式」（たとえば、１：１．１，１：２，１：３）と見なされる。「単一カメラ」ビデオ信号は、その各々の全体が引用により本明細書に援用されている（すなわち、同じＵＲＬにおけるhttp://www.usb.org/developers/docs/devclass_docs/USB_Video_Class_1_5.zip USB_Video_Class_1_1_090711.zip参照）、USB Implementers Forum による「USB Device Class Definition for Video Devices」１．１または１．５としても公知の、たとえばＵＶＣといった単一カメラに対応するビデオ信号としてフォーマットされる。ＵＶＣ内に記載される信号のいずれも、当該信号がＵＳＢを介してトランスポートされるか、搬送されるか、送信されるか、トンネルされるかにかかわらず、「単一カメラビデオ信号」であり得る。 As used herein, “aspect ratio” is described as an H: V horizontal: vertical ratio, with a “greater” aspect ratio increasing the horizontal ratio (widely short) relative to vertical. Aspect ratios greater than 1: 1 (eg, 1.1: 1, 2: 1, 10: 1) are considered “scenery formats”, and for purposes of this disclosure, an aspect ratio of 1: 1 or less “Format” (eg, 1: 1.1, 1: 2, 1: 3). Each “single camera” video signal is incorporated herein by reference in its entirety (ie http://www.usb.org/developers/docs/devclass_docs/USB_Video_Class_1_5.zip USB_Video_Class_1_1_090711 at the same URL). .zip), also known as “USB Device Class Definition for Video Devices” 1.1 or 1.5 by USB Implementers Forum, formatted as a video signal corresponding to a single camera, eg UVC. Any of the signals described in the UVC is a “single camera video signal” regardless of whether the signal is transported, transported, transmitted or tunneled via USB. possible.

「ディスプレイ」は、任意の直接ディスプレイスクリーンまたは投影ディスプレイを意味する。「カメラ」はデジタル撮像装置を意味し、これは、ＣＣＤもしくはＣＭＯＳカメラ、熱画像カメラ、またはＲＧＢＤ深度もしくは飛行時間カメラであってもよい。当該カメラは、２つ以上のスティッチングされたカメラビューによって形成される、および／または広いアスペクト、パノラマ、広角、魚眼、もしくは反射屈折パースペクティブの仮想カメラであってもよい。 “Display” means any direct display screen or projection display. “Camera” means a digital imaging device, which may be a CCD or CMOS camera, a thermal imaging camera, or an RGBD depth or time of flight camera. The camera may be formed by two or more stitched camera views and / or a virtual camera with a wide aspect, panorama, wide angle, fisheye, or catadioptric perspective.

「参加者」は、グループビデオ会議セッションに接続されてウェブカメラからのビューを表示している人物、デバイス、または場所であり、ほとんどの場合「出席者」は参加者であるだけでなく、会議カメラ１００と同じ部屋にいる。「スピーカー」は、発言中であるか、または会議カメラ１００もしくは関連のリモートサーバが当該スピーカーを識別するのに十分最近に発言した出席者であるが、いくつかの説明では、発言中であるか、またはビデオ会議クライアントもしくは関連のリモートサーバが当該スピーカーを識別するのに十分最近に発言した参加者であってもよい。 A “participant” is a person, device, or location that is connected to a group videoconference session and displaying a view from a webcam, and in most cases the “participant” is not only a participant but also a conference You are in the same room as the camera 100. A “speaker” is an attendee who is speaking or an attendee who has spoken recently enough to identify the speaker, although the conference camera 100 or related remote server is speaking. Or a participant who has spoken recently enough to identify the speaker.

「合成」は一般に、当該技術において公知であるようなデジタル合成、すなわち、複数のビデオ信号（および／または画像もしくは他のメディアオブジェクト）をデジタル的にアセンブルして最終ビデオ信号を作成することを意味し、これは、アルファ合成およびブレンディング、アンチエイリアシング、ノードベースの合成、キーフレーミング、レイヤベースの合成、ネスティング合成または複合、ディープ画像合成（機能ベースであるかサンプルベースであるかにかかわらず、色、不透明度、およびディープデータを用いる深度を用いる）といった技術を含む。合成は、各々がビデオストリームを含むサブシーンの動作および／またはアニメーションを含む進行中のプロセスであり、たとえば、全体のステージシーン内のさまざまなフレーム、ウインドウ、およびサブシーンの各々が、それらが全体のステージシーンとして移動し、移行し、ブレンドされ、または他の方法で合成されるにつれて異なる進行中のビデオストリームを表示し得る。本明細書において使用する合成は、１つ以上のウインドウのための１つ以上のオフスクリーンバッファを有する合成ウインドウマネージャ、またはスタッキングウインドウマネージャを使用してもよい。任意のオフスクリーンバッファまたはディスプレイメモリコンテンツが二重もしくは三重にバッファリングされてもよいし、またはその他の方法でバッファリングされてもよい。合成はさらに、２Ｄおよび３Ｄアニメーション効果の適用、ブレンディング、フェージング、スケーリング、ズーミング、回転、複製、曲げ、捩じれ、シャフリング、ブラーリング、ドロップシャドー、グロー、プレビュー、およびアニメーションの追加といった、バッファリングされたウインドウまたはディスプレイメモリウインドウの一方または両方に対する処理を含み得る。合成はさらに、ベクトル指向のグラフィカル要素またはピクセルもしくはボクセル指向のグラフィカル要素にこれらを適用することを含み得る。合成は、タッチ、マウスオーバー、ホバーまたはクリックするとポップアッププレビューをレンダリングすること、背景に対していくつかのウインドウを再配列してタッチ、マウスオーバー、ホバーまたはクリックによって選択を可能にすることによるウインドウ切替、およびフリップ切替、カバー切替、リング切替、露光切替などを含み得る。本明細書に記載のように、フェージング、スライディング、成長または縮小、およびこれらの組合せなどのさまざまな視覚移行が当該ステージ上で用いられ得る。本明細書において使用する「移行」は、必要な合成ステップを含む。 “Combining” generally means digital composition as is known in the art, ie, digitally assembling multiple video signals (and / or images or other media objects) to create the final video signal. This includes alpha compositing and blending, anti-aliasing, node-based compositing, key framing, layer-based compositing, nesting compositing or compositing, deep image compositing (whether function based or sample based) , Opacity, and depth using deep data). Compositing is an ongoing process that includes motion and / or animation of sub-scenes, each containing a video stream, for example, each of the various frames, windows, and sub-scenes in the entire stage scene. As different stage scenes may display different ongoing video streams as they are moved, transitioned, blended, or otherwise synthesized. Composition as used herein may use a composition window manager having one or more off-screen buffers for one or more windows, or a stacking window manager. Any off-screen buffer or display memory content may be double or triple buffered or otherwise buffered. Compositing is further buffered, such as applying 2D and 3D animation effects, blending, fading, scaling, zooming, rotating, duplicating, bending, twisting, shuffling, blurring, drop shadow, glow, preview, and adding animation. Processing for one or both of a specific window and a display memory window. Compositing may further include applying them to vector-oriented graphical elements or pixel or voxel-oriented graphical elements. Compositing renders a pop-up preview when touched, mouse over, hover or click, window switching by rearranging several windows against the background to allow selection by touch, mouse over, hover or click , And flip switching, cover switching, ring switching, exposure switching, and the like. As described herein, various visual transitions such as fading, sliding, growing or shrinking, and combinations thereof can be used on the stage. As used herein, “transition” includes the necessary synthesis steps.

本明細書に開示される実施形態に関連して記載される方法またはアルゴリズムのステップは、ハードウェアにおいて、プロセッサによって実行されるソフトウェアモジュールにおいて、またはこれら２つの組合せにおいて直接具体化されてもよい。ソフトウェアモジュールは、ＲＡＭメモリ、フラッシュメモリ、ＲＯＭメモリ、ＥＰＲＯＭメモリ、ＥＥＰＲＯＭメモリ、レジスタ、ハードディスク、リムーバブルディスク、ＣＤ−ＲＯＭ、または当該技術において公知のその他の形態の記憶媒体内に存在していてもよい。例示的な記憶媒体が、プロセッサが当該記憶媒体から情報を読出し、かつ当該記憶媒体に情報を書込むことができるように、当該プロセッサに結合されてもよい。代わりに、記憶媒体はプロセッサと一体であってもよい。プロセッサおよび記憶媒体はＡＳＩＣ内に存在していてもよい。ＡＳＩＣはユーザ端末内に存在していてもよい。代わりに、プロセッサおよび記憶媒体は個別の部品としてユーザ端末内に存在していてもよい。 The method or algorithm steps described in connection with the embodiments disclosed herein may be directly embodied in hardware, in software modules executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or other form of storage medium known in the art. . An exemplary storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and storage medium may reside in an ASIC. The ASIC may exist in the user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

上述のプロセスのすべては、１つ以上の汎用または専用コンピュータまたはプロセッサによって実行されるソフトウェアコードモジュールにおいて具体化され、かつ当該ソフトウェアコードモジュールを介して完全に自動化されてもよい。コードモジュールは、任意の種類のコンピュータ読取可能媒体、または他のコンピュータ記憶装置もしくは記憶装置の集合上に記憶されてもよい。代わりに、当該方法の一部またはすべては、特化したコンピュータハードウェアにおいて具体化されてもよい。 All of the processes described above may be embodied in and fully automated via software code modules executed by one or more general purpose or special purpose computers or processors. The code modules may be stored on any type of computer readable medium or other computer storage device or collection of storage devices. Alternatively, some or all of the methods may be embodied in specialized computer hardware.

本明細書に記載の方法およびタスクのすべては、コンピュータシステムによって実行されて完全に自動化されてもよい。コンピュータシステムは、いくつかの場合、ネットワーク上で通信して相互運用して上述の機能を実行する複数の個別のコンピュータまたはコンピューティングデバイス（たとえば物理サーバ、ワークステーション、ストレージアレイ等）を含んでいてもよい。そのようなコンピューティングデバイスの各々は典型的に、プログラム命令を実行するプロセッサ（もしくは複数のプロセッサもしくは回路もしくは回路の集合、たとえばモジュール）、またはメモリもしくは他の非一時的なコンピュータ読取可能記憶媒体に記憶されたモジュールを含む。本明細書に開示されるさまざまな機能はそのようなプログラム命令において具体化されてもよいが、開示される機能の一部またはすべては代わりにコンピュータシステムの特定用途向け回路（たとえばＡＳＩＣまたはＦＰＧＡ）において実現されてもよい。コンピュータシステムが複数のコンピューティングデバイスを含む場合、これらのデバイスは同じ場所に配置されてもよいが、そのように配置されなくてもよい。開示される方法およびタスクの結果は、ソリッドステートメモリチップおよび／または磁気ディスクといった物理的記憶装置を異なる状態に変換することによって永続的に記憶されてもよい。 All of the methods and tasks described herein may be performed by a computer system and fully automated. Computer systems often include a plurality of individual computers or computing devices (eg, physical servers, workstations, storage arrays, etc.) that communicate over a network and interoperate to perform the functions described above. Also good. Each such computing device typically resides in a processor (or a plurality of processors or circuits or collections of circuits, eg, modules), or memory or other non-transitory computer-readable storage medium that executes program instructions. Contains stored modules. The various functions disclosed herein may be embodied in such program instructions, but some or all of the disclosed functions instead are application-specific circuits (eg, ASICs or FPGAs) for computer systems. May be realized. If a computer system includes multiple computing devices, these devices may be located at the same location, but need not be so. The results of the disclosed methods and tasks may be stored persistently by converting physical storage devices such as solid state memory chips and / or magnetic disks to different states.

Claims

A method of synthesizing and outputting a video signal,
Recording a panoramic video signal having an aspect ratio of substantially 2.4: 1 or greater, captured from a wide camera having a horizontal field of view of substantially 90 degrees or greater;
Sub-sampling at least two sub-scene video signals from the wide camera in their respective orientations;
Combining the at least two sub-scene video signals side by side to form a stage scene video signal having an aspect ratio substantially equal to or less than 2: 1, wherein more than 80% of the area of the stage scene video signal A large area is subsampled from the panoramic video signal, and
Outputting the stage scene video signal formatted as a single camera video signal.

Sub-sampling additional sub-scene video signals in their respective orientations from the panoramic video signal;
A stage scene having an aspect ratio of substantially less than or equal to 2: 1 comprising a plurality of side-by-side sub-scene video signals, wherein the at least two sub-scene video signals are combined with at least one of the additional sub-scene video signals The method of claim 1, further comprising forming a video signal.

Combining at least two sub-scene video signals with at least one additional sub-scene video signal to form a stage scene video signal;
At least one of the additional sub-scene video signals is transferred to the stage scene video signal by replacing at least one of the at least two sub-scene video signals and has an aspect ratio substantially equal to or less than 2: 1. The method of claim 2, comprising forming a stage scene video signal.

A minimum width is assigned to each sub-scene video signal, and when each transition to the stage scene video signal is completed, each sub-scene video signal is combined side by side with substantially the minimum width or more and synthesized. The method of claim 3, wherein:

5. The method of claim 4, wherein the composite width of each sub-scene video signal in transition increases throughout the transition until the composite width is substantially greater than or equal to its corresponding minimum width.

Each sub-scene video signal is substantially greater than or equal to its minimum width, and each has its own width in which the sum of all synthesized sub-scene video signals is substantially equal to the width of the stage scene video signal, The method of claim 4, wherein the methods are synthesized side by side.

The width of the sub-scene video signal in the stage scene video signal is synthesized to change according to the activity criteria detected in at least one target orientation corresponding to the sub-scene video signal, whereas the stage scene video The method of claim 6, wherein the width of the signal is kept constant.

Combining the at least two sub-scene video signals with at least one additional sub-scene video signal to form a stage scene video signal;
Reducing at least one width of the at least two sub-scene video signals by an amount corresponding to a width of at least one of the additional sub-scene video signals, thereby reducing at least one of the additional sub-scene video signals to the stage The method of claim 2 including transitioning to a scene video signal.

Each sub-scene video signal is assigned its own minimum width, and each sub-scene video signal is combined side by side substantially above its corresponding minimum width to form the stage scene video signal, with at least one Along with the additional sub-scene video signal, when the sum of the respective minimum widths of the at least two sub-scene video signals exceeds the width of the stage scene video signal, at least one of the at least two sub-scene video signals is The method of claim 8, wherein the transition is made to be removed from the stage scene video signal.

10. The at least one of the two sub-scene video signals that transition to be removed from the stage scene video signal corresponds to a respective target orientation for which an activity criterion was met most recently. Method.

The order from left to right for the wide camera between the respective orientations of the at least two sub-scene video signals and at least one of the additional sub-scene video signals is such that the at least two sub-scene video signals are at least one The method of claim 9, wherein the method is combined with the additional sub-scene video signal and stored when forming the stage scene video signal.

Each target orientation from the panoramic video signal is selected depending on selection criteria detected in the target orientation relative to the wide camera, and
The method of claim 1, comprising transitioning the corresponding sub-scene video signal to be removed from the stage scene video signal after the selection criteria is no longer true.

The selection criteria includes the presence of activity criteria satisfied in the respective target orientation, and
Counting the time since the activity criterion is satisfied in each of the target orientations, and each sub-scene signal is a predetermined period after the activity criterion is satisfied in the respective target orientations. The method of claim 12, transitioning to be removed from the stage scene video signal.

Subsampling from the panoramic video signal a reduced panoramic video signal having an aspect ratio of substantially 8: 1 or higher;
A stage having an aspect ratio of substantially less than or equal to 2: 1 including a plurality of side-by-side sub-scene video signals and the panoramic video signal by combining the at least two sub-scene video signals with the reduced panoramic video signal The method of claim 1, further comprising forming a scene video signal.

Combining the at least two sub-scene video signals with the reduced panoramic video signal to include a plurality of side-by-side sub-scene video signals and the panoramic video signal higher than the plurality of side-by-side sub-scene video signals; Further comprising forming a stage scene video signal having an aspect ratio of substantially 2: 1 or less, wherein the panoramic video signal is 1/5 or less of the area of the stage scene video signal, 15. The method of claim 14, wherein the method extends substantially across the width.

Subsampling a text video signal from a text document;
The method of claim 14, further comprising transitioning the text video signal to the stage scene video signal by replacing at least one of the at least two sub-scene video signals with the text video signal.

Further comprising setting at least one of the at least two sub-scene video signals as a protected sub-scene video signal that is protected from transition based on a retention criterion, the at least one of the at least two sub-scene video signals 4. The method of claim 3, wherein transitioning at least one of the additional subscene video signals to the stage scene video signal by replacing s transitions subscene video signals other than the protected subscene.

The method further comprises setting a sub-scene enhancement operation based on an enhancement criterion, wherein at least one of the at least two sub-scene video signals is enhanced according to the sub-scene enhancement operation based on a corresponding enhancement criterion. The method according to 1.

The method of claim 1, further comprising setting a sub-scene participant notification action based on a criterion detected from a sensor, wherein a local reminder indicator is activated according to the notification action based on a corresponding detected criterion. the method of.

The method of claim 1, wherein the panoramic video signal is captured from a wide camera having an aspect ratio substantially equal to or greater than 8: 1 and having a horizontal angle of view of substantially 360 degrees.

A method of tracking a sub-scene in a target orientation within a wide video signal,
Monitoring a range of angles using an acoustic sensor array and a wide camera that observes a field of view substantially greater than 90 degrees;
Identifying a first target orientation along localization of at least one of acoustic recognition and visual recognition detected within the angular range;
Sub-sampling a first sub-scene video signal from the wide camera along the first target orientation;
Setting a width of the first sub-scene video signal according to signal characteristics of at least one of the acoustic recognition and the visual recognition.

The method of claim 21, wherein the signal characteristic represents a confidence level of at least one of the acoustic recognition and the visual recognition.

The method of claim 21, wherein the signal characteristic represents a width of a feature recognized within at least one of the acoustic recognition and the visual recognition.

24. The method of claim 23, wherein the signal characteristic corresponds to an approximate width of a human face recognized along the first target orientation.

24. The method of claim 23, wherein if a width is not set according to the visual recognition signal characteristic, a predetermined width is set along with localization of acoustic recognition detected within the angular range.

The method of claim 21, wherein the first target orientation is determined by visual recognition and the width of the first sub-scene video signal is set according to the signal characteristics of the visual recognition.

The first target orientation is directed and identified toward acoustic recognition detected within the angular range; and
23. The method of claim 21, comprising identifying visual recognition proximate to the acoustic recognition, wherein the width of the first sub-scene video signal is set according to signal characteristics of the visual recognition proximate to the acoustic recognition. Method.

A method of tracking a sub-scene in a target orientation within a wide video signal,
Scanning the subsampling window through a video signal that substantially corresponds to a wide camera field of view greater than 90 degrees;
Identifying candidate orientations in the sub-sampling window, each target orientation corresponding to localization of visual recognition detected in the sub-sampling window;
Recording the candidate orientation in a spatial map;
Monitoring an angular range corresponding to the wide camera field of view using an acoustic sensor array for acoustic recognition.

When acoustic recognition is detected in proximity to one candidate orientation recorded in the spatial map,
Snapping a first target orientation to substantially correspond to the one candidate orientation;
29. The method of claim 28, comprising sub-sampling a first sub-scene video signal from the wide camera along the first target orientation.

30. The method of claim 29, further comprising setting a width of the first sub-scene video signal according to the signal characteristics of the acoustic recognition.

The method of claim 30, wherein the signal characteristic represents a confidence level of the acoustic recognition.

The method of claim 30, wherein the signal characteristic represents a width of a feature recognized within at least one of the acoustic recognition and the visual recognition.

31. The method of claim 30, wherein the signal characteristic corresponds to an approximate width of a human face recognized along the first target orientation.

31. The method of claim 30, wherein if a width is not set according to the visual recognition signal characteristic, a predetermined width is set along with localization of acoustic recognition detected within the angular range.

A method for tracking a sub-scene in a target orientation,
Recording a video signal corresponding to a wide camera field of view substantially greater than 90 degrees;
Monitoring an angular range corresponding to the wide camera field of view using an acoustic sensor array for acoustic recognition;
Identifying a first target orientation that is directed toward acoustic recognition detected within the angular range;
Locating a subsampling window in the video signal according to the first target orientation;
Localizing the visual recognition detected within the sub-sampling window.

Sub-sampling a first sub-scene video signal captured from the wide camera substantially centered on the visual recognition;
36. The method of claim 35, further comprising setting a width of the first sub-scene video signal according to the visual recognition signal characteristics.

A method of tracking a sub-scene in a target orientation within a wide video signal,
Monitoring a range of angles using an acoustic sensor array and a wide camera that observes a field of view substantially greater than 90 degrees;
Identifying a plurality of object orientations each directed toward localization within the angular range;
Maintaining a spatial map of recorded characteristics corresponding to the target orientation;
Sub-sampling a sub-scene video signal from the wide camera substantially along at least one target orientation;
Setting a width of the sub-scene video signal according to a recorded characteristic corresponding to the at least one target orientation.

A method of tracking a sub-scene in a target orientation within a wide video signal,
Monitoring a range of angles using an acoustic sensor array and a wide camera that observes a field of view substantially greater than 90 degrees;
Identifying a plurality of object orientations each directed toward localization within the angular range;
Sub-sampling a sub-scene video signal from the wide camera substantially along at least one target orientation;
Setting the width of the sub-scene video signal by expanding the sub-scene video signal until a threshold based on at least one recognition criterion is met.

Predicting a change vector for each target orientation based on one change in velocity and direction of the recorded characteristic corresponding to localization;
39. The method of claim 38, further comprising updating the position of each target orientation based on the prediction.

Predicting a search area for localization based on the most recent position of the recorded characteristic corresponding to localization;
39. The method of claim 38, further comprising updating the localization location based on the prediction.