JP3616111B2

JP3616111B2 - Method and apparatus for obtaining a high resolution still image using multiple images of different focal lengths or different viewing zones

Info

Publication number: JP3616111B2
Application number: JP29449192A
Authority: JP
Inventors: エィ．テオドシオローラ; アール．ベンダーワルター
Original assignee: Massachusetts Institute of Technology
Current assignee: Massachusetts Institute of Technology
Priority date: 1991-11-01
Filing date: 1992-11-02
Publication date: 2005-02-02
Anticipated expiration: 2020-02-02
Also published as: JPH05304675A

Description

【０００１】
【産業上の利用分野】
本発明は基本的には異なる焦点距離の複数の像及び装置を使用して高解像度静止画像を創作する方法に関するものである。特には、本発明はズームビデオシークエンス（ｓｅｑｕｅｎｃｅ）等の複数の異なる焦点距離像（イメージ）を使用した固定焦点距離像である静止高解像画像を創出する方法に関するものである。本発明はまた静止パノラマ像のものよりも狭い視域の複数像から静止パノラマ像（イメージ）を創作する技術にも関するものである。
【０００２】
【従来の技術】
画像処理分野においては、あるシーン（ｓｃｅｎｅ）の静止画像を得ることがしばしば望まれる。たいていの場合においては、その静止画像は記録手段の性能及びその静止画像を撮像する装置の焦点距離により決定される解像度を有する。ビデオ装置は現在比較的安価であり、多くの人々が使用できるほどに単純構造である。ビデオ記録装置はスチール写真のような静止画像描出と較べてある意味の利点を備えている。始動したビデオカメラはその焦点領域内にある全出来事を撮影することができるが、一方普通のスチール写真の場合には写真家がシャッターを押すことで選択した被写体のみを撮影する。よって、スポーツイベントのごとき高速で移動する被写体を撮影する場合、又は結婚式や報道ドキュメンタリーのごとき予期しない事態が発生するような状況下ではビデオを常時撮影状態にセットしておき、事後に望むスチールを選択することが往々にして便利である。しかしながら、ビデオ信号の解像度は１ピクチャー高あたり４８０本程度のライン（走査線）と１ピクチャー幅あたり６４０程度のサンプル（ｓａｍｐｌｅ）に制限されている。（ビデオ信号自体はスキャンラインを通じて連続的である。しかしながら、ディスプレーするためにスキャンライン方向にサンプル化されている。）多くの場合においてこの程度の解像度では高画質を与えるには不充分である。特に、もしオリジナルの被写体像が比較的短い焦点距離で撮影されたものである場合において不足する。像が拡大されるとその像は相当に不鮮明になる。同様に、映画や８ミリ撮影のごとき他の撮影技術の場合にもそれらに特有の解像度に限定される。像の拡大は画像全体にわたり単位面積あたりの解像度を必然的に劣化させる。
【０００３】
例えば、ステージ上のピアノの前で聴衆に向かって演奏しているソロの演奏家をその聴衆と共に撮影することが望まれる場合がある。もし撮像装置がビデオ装置であれば、聴衆を映し出しているワイドアングル画像は前記の標準ビデオ解像度に見合った解像度となる。画面全体にわたる解像度も同じである。よって、ソロピアニストの像はそのシーンの残り部分と同様な粗さとなる。たとえば、もしそのソロリスト像が全画面の１／１６のスペースを取っているとすれば、垂直方向に１２０本のラインと、水平方向に１６０のサンプルを使用していることになる。この場合、たとえば会場の後方にある空席等のあまり重要でないシーンも同一の解像度となる。図１は異なる２焦点距離に関する１焦平面上に、あるシーンの焦点合わせを略図的に示すものである。もし焦点距離ｆｗが比較的短ければ、像２の全幅は焦平面４上に焦点が合う。
【０００４】
もちろん、ソロリストにズームインし、さらに長い焦点距離でソロリストの像を撮影することでそのソロリストをより鮮明（即ち、垂直方向にさらに多数のラインと水平方向にさらに多数のピクセル（ｐｉｘｅｌ））に映し出すことは可能である。図１に示すごとく、焦点距離ｆＴはｆｗよりも長い。しかしながら、像２の中央部６のみが焦平面４に焦点が合わせされている。焦平面の範囲外で焦点されているので他のほとんどのシーンは犠牲になっている。ソロリスト像はさらに拡大されて大きなスペースを占め、元の像の輪郭の一部は写らない。
【０００５】
２チャンネルのデータを結合させて画像データを増強する（ｅｎｈａｎｃｅ）技術は周知である。その第１チャンネルは空間的高解像度（即ち、単位長さあたりに比較的多くの画素）及び比較的時間的低解像度（即ち、単位時間あたりに比較的少ないフレーム数）を有するものであり、その第２チャンネルは空間的低解像度及び時間的高解像度を有する。その結合の結果、空間的及び時間的解像度はそれらの高いほうに近づき、普通の状態で時間的及び空間的高解像度を有するシングル像シークエンス（ｓｉｎｇｌｅｉｍａｇｅｓｅｑｕｅｎｃｅ）を伝達するのに要する情報よりも少ない情報伝達で済む。１９８８年５月に合衆国のマサチューセッツ工科大学（ＭａｓｓａｃｈｕｓｅｔｔｓＩｎｓｔｉｔｕｔｅｏｆＴｅｃｈｎｏｌｏｇｙ）電気工学及びコンピュータ科学部に提出されたＢ．Ｓ．論文であるクレーマンローレンスエヌ（Ｃｌａｍａｎ，ＬａｗｒｅｎｃｅＮ．）の「２チャンネル空間−時間エンコーダ（ＡＴｗｏ−ＣｈａｎｎｅｌＳｐａｔｉｏ−ＴｅｍｐｏｒａｌＥｎｃｏｄｅｒ）」を参照されたい。
【０００６】
静止画像の種々な空間的部分の解像増強のごとく、最短焦点距離で撮影したものを映写化するのに今日利用可能な技術は見当らない。クレーマンの論文は固定焦点距離像とベクトル量子化を利用しており、その結果オリジナルの空間的高解像度像のものを越えない解像度及び視域のスチールフレームを提供している。
【０００７】
【発明が解決しようとする課題】
パノラマビュー（ｖｉｅｗ）の一部から他のパノラマビューの一部にかけて実質的に共通な１焦点距離を維持しつつ、あるシーンのパノラマビューを提供できることもまた望ましいことである。これを行う従来の方法はビデオカメラをパノラマシーンの片側から別側に移動することであり、本質的には前後のフレームから各々ほんの少々異なる数多くのフレームを撮ることである。その隣接するフレームに関して、各フレームは右側と左側のエッジ部分が異なるのみである。フレームを形成している像の大部分は隣接するフレームの像と同一である。パノラマシーンを形成するこれらの種々な像を保存し、ナビゲートするには多量のデータ保存及びデータアクセスが必要とされる。データ保存及びアクセスには多額の費用を要するのでこの従来技術は望ましいものとはいえない。また、保存及びアクセスされたデータの大部分が実用性を有しない。パノラマ的空間を撮影するのに現在使用されている撮影装置にはグルブスコープ（ｇｌｕｂｕｓｃｏｐｅ）又はボルピ（ｖｏｌｐｉ）レンズの移動が含まれる。
１シーンの１位置から別位置へパン（ｐａｎ）し、同時にズームできることも望ましいことである。従来技術の欠点はそのような組み合わせにおいて望ましくない結果をもたらすことである。
【０００８】
よって、本発明の目的は、以下の利点を備えた比較的に高解像度を有する静止画像を創作する方法及び装置を提供することである：
１）像全体にわたり高解像度で情報を取得する必要がない。
２）あまり重要ではない像の大部分に関して情報を収集する必要がない。
３）種々な焦点距離又は視域の標準的ビデオ像のシークエンスを入力要素として取得できる。
４）標準的フィルム像のシークエンスを入力要素として取得できる。
５）望む像のいかなる部分でもその解像度を増強する。
６）適正にプログラムされた汎用デジタルコンピュータ及び標準型ビデオ又は映画装置が使用可能である。
【０００９】
本発明の別目的は過剰なデータ保存及びアクセス能力を要せず、あるシーンのパノラマビューを観察者に提供し、その観察者にそのシーンの１位置から他の位置までのナビゲーションを可能とする方法を提供することである。本発明のさらに別目的はデジタル化されたいかなる形態の像データであろうとも前記能力を発揮させることである。
【００１０】
【課題を解決するための手段】
上記課題を解決するための本発明を要約すれば、本発明は静止画像を発生させる方法であって、複数の像を創出するステップを有しており、各像は他と異なる焦点距離にて創出されており、さらに、その各像を共通の焦点距離にスケールするステップと、その各スケールされた像を１焦点距離の最終像に組み合わせるステップを有しており、その最終像の部分はそのオリジナルシークエンスと比較して相対的に高い解像度を有している。本発明はさらに、変化する視域の静止画像のシークエンスを全体的視域のパノラマ像に組み入れるステップをも有している。異なる視域にて発生された像を組み合わせることに加えて、本発明の方法はパノラマシーンのごとき全体的シーンの異なる視域に関して発生した像を組み合わされたパノラマ視域に組み合わせることにも使用可能である。本発明のこの特質は変化する焦点距離のものとも結合可能である。
【００１１】
本発明はまた静止画像を発生させる装置であって、複数の像を創出する手段を有しており、各像は他と異なる焦点距離にて創出されたものであり、さらに、各像を共通の焦点距離にスケールする手段とスケールされた像をそれぞれ１焦点距離の１像に組み入れる手段とを有している。本発明の装置はさらに全体的シーンの異なる視域に関して発生した像を組み合わされたパノラマ視域に組み入れる装置を含んでいる。
【００１２】
次に、最初の実施例を解説する。ここでは本発明は静止画像（イメージ）を発生させる方法であって、以下のステップから成り立っている：
１）それぞれ異なる焦点距離で創出された複数の像をそれぞれ代表する複数の信号を発生させるステップ
２）共通の焦点距離にスケール（ｓｃａｌｅ）された対応する像を代表するように各信号を変換するステップ
３）各変換した信号を組み合わせ、オリジナルシークエンスの像と比較して部分的には比較的に高解像度であるスケールされた像を１焦点距離の最終像とする組み合わせを代表した信号を得るステップ。
【００１３】
別の実施例を解説する。ここでは本発明は静止画像を発生させる装置であって、以下の手段から成り立っている：
１）それぞれ互いに異なる焦点距離にて創出された複数の像を創出する手段。
２）異なる焦点距離のそれぞれの像を代表する複数の信号を発生させる手段。
３）共通の焦点距離にスケールされた対応する像を代表するように複数の信号の各々を変換する手段
４）各変換した信号を組み合わせ、スケールされた像を１焦点距離の１像とする組み合わせを代表した信号を得る手段。
【００１４】
さらに別の実施例を解説する。ここでは本発明は静止画像を発生させる方法であって、以下のステップから成る：
１）それぞれ異なる視域で創出された複数の像をそれぞれ代表する複数の信号を発生させるステップ
２）共通のパノラマ視域内で１位置にトランスレート（ｔｒａｎｓｌａｔｅ）された対応する像を代表するように各信号を変換するステップ
３）各変換した信号を組み合わせ、オリジナルシークエンスの像と比較してさらに大きな視域をカバーする１パノラマ視域の最終像とするトランスレートされた像の組み合わせを代表した信号を得るステップ。
【００１５】
またさらに別の実施例を解説する。ここでは本発明は静止画像を発生させる装置であって、以下の手段から成り立っている：
１）それぞれ互いに異なる視域にて創出された複数の像を創出する手段
２）それぞれ異なる視域の複数の像の１つを代表する複数の信号を発生させる手段
３）共通のパノラマ視域内の１位置にトランスレートされた対応する像を代表するように複数の信号の各々を変換する手段
４）各変換した信号を組み合わせ、１パノラマ視域の１像とするトランスレートされた像の組み合わせを代表した信号を得る手段。
【００１６】
さらにまた別の実施例を解説する。ここでは本発明は静止画像を発生させる方法であって、以下のステップから成る：
１）それぞれ異なる視域にて創出された複数の像の１つを各々代表する複数の信号を発生させるステップ
２）共通のパノラマ視域内の１位置にトランスレートされ、共通の焦点距離にスケールされた対応する像を代表するように各信号を変換するステップ
３）各変換した信号を組み合わせ、部分的にはオリジナルシークエンスの像よりも高い解像度であり、オリジナルシークエンス及び１焦点距離の像と比較してさらに大きな視域をカバーする１パノラマ視域の最終像とするトランスレートされ、スケールされた像の組み合わせを代表した信号を得るステップ。
【００１７】
【作用】
上記構成により、以下の利点を備えた比較的に高解像度を有する静止画像を創作する方法及び装置が提供される。
１）像全体にわたり高解像度で情報を取得する必要がない。
２）あまり重要ではない像の大部分に関して情報を収集する必要がない。
３）種々な焦点距離又は視域の標準的ビデオ像のシークエンスを入力要素として取得できる。
４）標準的フィルム像のシークエンスを入力要素として取得できる。
５）望む像のいかなる部分でもその解像度を増強する。
６）適正にプログラムされた汎用デジタルコンピュータ及び標準型ビデオ又は映画装置が使用可能である。
【００１８】
さらに、過剰なデータ保存及びアクセス能力を要せず、あるシーンのパノラマビューが観察者に提供され、その観察者にそのシーンの１位置から他の位置までのナビゲーションを可能とする方法が提供される。
又、デジタル化されたいかなる形態の像データであろうとも前記能力を発揮させることができる。
【００１９】
【実施例】
以下、本発明の実施例につき詳細に説明する。
典型的なビデオ像はフィールドのシークエンスにより創出される。各フィールドは画像化されるシーンの静止画像を代表する。インターレース（ｉｎｔｅｒｌａｃｅ）によって連続的フィールド間に１／２のスキャンラインの垂直方向のずれが生じる。（表示システムによってはインターレースなくスキャンされており、その場合にはフィールド間の垂直方向のずれは生じない。）一般的に毎秒５０又は６０フィールドの割合でこのような静止フィールドのシークエンスを表示することにより、モーション又は変化は人の視覚システムの心理的精神肉体的特性によって表出される。各フィールドのペアは前述したごとくにラインで満ちたスクリーンで構成され、各ラインはそれぞれ画素（ピクセル）により構成されている。各ピクセルは、コンピュータメモリー又は他の適当なデジタル記録媒体内に特定のレンジである信号値により代表される。カラー像においては、このレンジは典型的には色の３要素（ｃｏｍｐｏｎｅｎｔ）の各々に対して０−２５５であり、グレースケール（ｇｒａｙｓｃａｌｅ）像に対してはこのレンジは１要素あたり特徴的には０−２５５である。サテライト映像又はＸ線のような像源は０−４０９６ほどの大きさのレンジを有することもある。ピクセル値はフレーム内のそれらの位置に何らかの手段にて対応する形状でメモリー内に保存される。ビデオ像に対してなされる全操作は典型的には個々のピクセル要素の値を代表する信号に対して行われる。
【００２０】
像のモノクロ記録の場合には、各ピクセル要素は単独の個別的（ｄｉｓｃｒｅｔｅ）な要素である。像のカラー記録の場合には、１セットのチャンネル又はピクセルのグループが各ピクチャー要素に対して使用される。例えば、ＲＧＢとして知られる表色スキーム（ｃｏｌｏｒｖａｌｕｅｓｃｈｅｍｅ）において、各色は赤（Ｒ）、緑（Ｇ）及び青（Ｂ）の色量の組み合わせにより代表される。これら３色の各々の色「チャンネル」が別々に提供される。ＲＧＢシステムにおいて、各チャンネルはスキャンラインごとに同数のピクセルとスクリーンごとに同数のスキャンラインを有する。下記他のカラー値システムは異なるチャンネルに対してスキャンラインごとに異なる数のサンプルを有する。ピクセルの要素は典型的には表示装置上に相互に隣接して位置しており、同時に表示されるときには（観察者にとってはそのように錯覚する）結合してオリジナルの色を形成する。ピクセルの時間シークエンス（ｔｉｍｅｓｅｑｕｅｎｔｉａｌ）ディスプレー等の他のスキームも利用可能である。
【００２１】
ＲＧＢカラー値スキームは特定の使用には役立つが、カラー値の数学的操作には必ずしも最適ではない。他のカラースキームの方がさらに有用であり、特には像の輝度を表しているチャンネルを含むものが有効である。一般的に輝度とは所定方向における単位あたりの知覚面積から出される、又は反射された光の強さとして記述される。一般的に、輝度及び２つの他のディメンション（ｄｉｍｅｎｓｉｏｎ）により定義される３チャンネルカラースペースはＲＧＢカラースペースと等価である。典型的な輝度カラースペースは、アメリカ合衆国のテレビ放送用テレビ基準委員会により使用されているＹ（輝度）、ｉ（位相）及びｑ（クワドラチャー（ｑｕａｄｒａｔｕｒｅ））カラースペースである。他の輝度カラースペースはＣＩＥ（ＣｏｍｍｉｓｓｉｏｎＩｎｔｅｒｎａｔｉｏｎａｌｄｅｌ′Ｅｃｌａｉｒａｇｅ）、Ｙ、ｘ、ｙ（輝度及び２クロミナンスチャンネル）及びそのバリアント（ｖａｒｉａｎｔｓ）、さらにＹ、ｕ、ｖ（輝度及び２クロミナンスチャンネル）並びに他にも多数存在している。
【００２２】
本発明においては、たいていの処理は１チャンネル又は１コンポーネントで充分である。全てのデータ計算及び操作はまずカラー像のＹチャンネルに対してのみ実施される。Ｙチャンネルが選択される理由は普通、Ｙチャンネルがビデオシステムにおいて最も高い信号対ノイズ比を有しているからであり、さらに、Ｙチャンネルは、たいていの場合にクロミナンスよりも高い空間（ｓｐａｔｉａｌ）周波数にてサンプルされるからである。Ｙチャンネルに関して必要な変換が決定された後、同一の変換が位相及びクワドラチャーのクロミナンスチャンネルのような残りのチャンネルに対して適用される。これらの変換の特徴を以下にて述べる。
【００２３】
ビデオ像は通常一連のフレームと見なされているが、実際にはそのような「フレーム」はいかなるときにも存在していない。フレームとして人たる観察者及び当該分野の技術者に考えられているのは、実際には「フィールド」の１ペアのことである。各フィールドは偶数列のスキャンライン、又は奇数列のスキャンラインにより成り立つ。偶数のフィールドは奇数列のスキャンラインから垂直方向に１本のスキャンラインの半分だけオフセット（ｏｆｆｓｅｔ）されている。１ペアのフィールドはインターレース（ｉｎｔｅｒｌａｃｅ）されて１フレームを形成している。
【００２４】
フィールドペア１０１と１０２は図２において略図的に示されている。フィールド１０１は像の奇数列のスキャンラインのみを含んでおり、フィールド１０２は像の偶数列のスキャンラインのみを含んでいる。ビデオ装置はこれらのフィールドを連続的に別々に記録する。よって、各フィールドは潜在的に多少とも異なる像を記録することとなり、それはフィールドの記録に要する時間に関連するシーン又はカメラのモーションによる。また、ビデオ装置はフィールドを連続的に素早くディスプレーし、その速度は典型的には毎秒５０から６０フィールドである。この速度でフィールドがディスプレーされるとき、観察者は組み合わされて１つになったフレーム１１０であるフィールドを「見る」こととなる。各フィールド（シークエンスの最初と最後を除いて）は各連続した２フレームのコンポーネントであることが理解されよう。図３に示すように、フィールド１０２はフレーム１１０の第２フィールドとフレーム１１２の第１フィールドを形成する。同様に、フィールド１０３はフレーム１１２の第２フィールドとフレーム１１４の第１フィールドを形成する。人たる観察者により組み合わされる以外は、フレームは個々の信号要素としては実際に存在しないことがこれで理解されよう。
【００２５】
本発明の方法はフレームのシークエンス、特にはビデオ像のシークエンスを使用する。本発明実施化には、フレームコンポーネントの脱インターレース（ｄｅ−ｉｎｔｅｒｌａｃｅ）が必要である。脱インターレースとは、奇数列又は偶数列のラインだけではなく、像の各ラインのピクセル値を含む、特にコンピュータメモリーオンリーにおけるピクセル要素からなる実際のフレームを表す信号を構成することを意味している。本発明はまたインターレース技術を利用することなく記録されたデータに対しても適用が可能である。しかし、インターレースされる材料は共通なので、それを脱インターレースできることが重要である。
【００２６】
本発明によれば、脱インターレースはデータ信号にメジアン（ｍｅｄｉａｎ）フィルターを適用することで達成される。例えば、時間ｔで脱インターレースされたフレームの７番目のスキャンラインを創出するには４個の値のメジアンが使用される。即ち、時間ｔ−１におけるフィールドの７番ラインの各ピクセル要素に対する値と、時間ｔ＋１におけるフィールドの７番ラインの対応ピクセル要素に対する値と、時間ｔにおけるフィールドの６番ラインの対応ピクセル要素に対する値と、時間ｔにおけるフィールドの８番ラインの対応ピクセル要素に対する値である。これらの４個のメジアンは脱インターレースされたシークエンスのフレームの７番ラインにおける対応ピクセル要素に対する値として割り当てられる。
【００２７】
同じプロセスがスキャンラインの各ピクセルと、そのフィールドの各奇数列のスキャンラインに対して繰り返される。偶数列のスキャンラインは単に時間ｔにおけるフィールドから採用される。この脱インターレースされたフレームはいかなるオリジナルシークエンスのフレームとも異なることを指摘する必要がある。なぜならば、奇数列スキャンラインを形成しているピクセル要素は前後のフィールドと時間ｔでのフィールドとの結合により創出されるからである。
【００２８】
第２の脱インターレースされたフレームを創出するにはこのプロセスが繰り返されるが、偶数列のスキャンラインはフィールドｔ及びｔ＋２からの偶数列スキャンラインのメジアンを使用することで形成され、奇数列のスキャンラインは関係するスキャンラインの上下に存在するフィールドｔ＋１から形成される。奇数列のスキャンラインは時間ｔ＋１でのフィールドから直接的に採用される。
【００２９】
フレームが脱インターレースされた後、一連のフレームが得られ、それらはそれぞれ偶数列と奇数列のスキャンラインの全必要数から構成されており、フィールドのシークエンスを観察している人たる観察者により感知される１フレーム内のものと同数のスキャンラインを有しているものとなる。
【００３０】
他の脱インターレース方法もまた利用でき、本発明の思想の範囲内である。しかし、これら他の方法は前述の脱インターレース技術程には優れた結果を提供するとは考えられていない。その１つの方法は１フィールド内の各ペアのライン間において、データの新ラインを合成するために各フィールド内のスキャンライン間のリニアインタポレーション（ｌｉｎｅａｒｉｎｔｅｒｐｏｌａｔｉｏｎ）を行うことである。この技術は動きのない像の部分において明らかに空間的解像度（ｓｐａｔｉａｌｒｅｓｏｌｕｔｉｏｎ）のロスを導く。別の方法としては、そのときのフィールドの前後のフィールド間でインタポレーションを行うことである。この技術は動きのある像の部分の時間的解像度を犠牲にする。別々のフィールドに対してデータ操作を施し、１フィールドを次のフィールドにワープさせるためのアファイン変換を利用することも可能である。しかしながら、フィールドのアファイン変換はインターレースによる空間的時間エイリアシング（ｔｅｍｐｏｒａｌａｌｉａｓｉｎｇ）により脱インターレースされたフレームにアファイン変換を施すほどには良好な結果を提供しない。
【００３１】
本発明の方法に使用されるフレームのシークエンスは、典型的にはズームシークエンスであり、これは長い焦点距離から短い焦点距離、又はその逆のズーミングを意味する。又、ズーミング中にパン（水平モーション）又はジブ（ｊｉｂ）（垂直モーション）することもできる。脱インターレースされた１００のフレームのシークエンス２００が図４において略図的に示されている。そのシークエンスは比較的短い焦点距離でワイドなアングルのフレーム２０１から比較的長い焦点距離でクローズアップのフレーム３００までのズームにより構成されている。フレーム２０１において一連の文字が略図的に表されている。フレーム３００においてはフレーム２０１の中央に位置しているアルファベット「Ｘ」の部分のみがフレームに含まれている。
【００３２】
シークエンス２００の各フレームは同数のピクセル要素及びスキャンラインから構成されている。従って、中央の文字Ｘのクロス部分がフレーム２０１のズームにおいて５０のピクセルを使用しているものと仮定すれば、そのクロス部分はフレーム３００でのズームにおいては３００のピクセルを使用していることになる。１の典型的なズームは、シークエンスの最初から最後のフレームにかけて４：１以上のスケーリングファクター（ｓｃａｌｉｎｇｆａｃｔｏｒ）をもたらす。フレーム３００のズームからのクロス部分の映像化は明らかにフレーム２０１のズームからのクロス部分の映像化よりも多くの情報提供が可能となる。しかし、例えば他の文字のごとき当初のシーンの他の被写体を映像化するのに必要な情報はフレーム３００からはまったく得られない。なぜなら、これらのものはフレーム３００には存在しないからである。よって、本発明の１つの目的はシークエンスの別々のフレームから得られる情報を１つの組み合わせ像に形成することであり、そのシーンの大部分についての充分な情報を提供することである。
【００３３】
個々のフレームとオリジナルシーンの間の関係は図５にて略図的に示されている。フレーム２０１は全オリジナルシーンを映し出している。フレーム２０２は領域２０２ｓ内部に見合うだけのオリジナルシーンのみを映像化しており、それはフレーム２０１よりも小さいものである。フレーム２０３は領域２０３ｓに見合うだけのオリジナルシーンのみを映像化しており、このように順番に領域３００ｓまで続く。従って、オリジナルシーンはズーム２０１によりマッピング（ｍａｐｐｉｎｇ）された全シーンのデータスペースと同一サイズであるデータスペース内に徐々により小さな部分へとマッピングされてゆく。（このデータスペースは「小」データスペースという。本方法に関する他のデータスペースと比べて最小だからである。）
【００３４】
本発明の目的は各ズームフレームから得られるだけの出来るかぎり多量の情報を利用することである。各ズームフレームは拡大され、最長の焦点距離のフレーム、即ちフレーム２９９と同じスケールを有することとなる。図６において略図的に示されているように、拡大されたフレームは相互に積み重ねられる。オリジナルフレーム２０１はオリジナルサイズの何倍にも拡大される。各連続するフレーム２０２、２０３・・・等は徐々に多少とも少なめに拡大され、最終フレーム２９９はまったく拡大されない。各フレームの像が他の全フレームシーンの同一部分と実質的にぴったりと重なるように重ね合わせることは可能である。もし各フレームの像部分が不鮮明であれば、シーンの見える部分（フレーム２９９の全部及び他の全フレームの輪郭部）は最も高い解像度を有するフレームから構成される。
【００３５】
各フレームのスケールアップは、各フレームを代表するデータ信号の変換操作と、その変換データを一連のデータスペースにマッピングする操作とを経て、究極的に最終データスペースにマッピングされたデータを代表する信号を発生させる操作により達成される。この最終データスペースは、前記小データスペースよりもずっと大きく、よって大データスペースと呼ばれる。（実際各フレームは全大スペース内に構成されており、最大に拡大されたフレーム２０１と同じデータスペースを占めている。しかし、フレーム２０１を除く全フレームに対しては、フレームの輪郭部はゼロの値を有するピクセルにより形成されている。）
【００３６】
各拡張フレームの解像度はもちろん互いに異なっており、ある部分、例えば、フレーム２０１からの中央文字Ｘにおいて得られる情報の詳細はフレーム２０３又は２９９から得られるシーンの同じ部分における情報の詳細よりもずっと少ないであろう。言い換えるならば、フレーム２０１の拡大像はピクセル間の情報ギャップの影響を受ける。フレーム３００からの像はこれらの情報ギャップの影響をまったく受けない。この情報ギャップは以下に説明するごとく、データを取得可能なピクセル間のインタポレーションにより満たされる。
【００３７】
上記したように、もしフレームが同じサイズにスケールされ、各々のデータが他のデータに重ねられると、それらは良く重なり合うが、実質的に重なり合っているのみであり、必ずしも厳密な意味で重なり合っているわけではない。これはカメラのモーションやシーンの要素のモーションのためである。比較的高い解像度の静止画像を創出するためにフレームのシークエンスを利用する過程でこれらのモーションを考慮に入れることも重要である。
【００３８】
本発明の方法を以下、詳細に説明する。前記スキームを特定の手段にて活用するには、本発明の方法は、まず各フレームを同じディメンション（大データスペース）のデータスペースにマッピング（ｍａｐ）又は「ワーピング」する必要がある。そのワーピング後にシークエンスの個々のフレームに対して、ウエイトを与えられた、すなわち、重み付けされた（ｗｅｉｇｈｔｅｄ）時間メジアン操作を施し、その操作を施されたフレームを結合して合成画を作成する。
【００３９】
各フレームを大データスペースにマッピング又はワーピングするには、ズームのフレームシリーズが実質的に一定な焦点距離のフレームのシリーズとしてモデル化（ｍｏｄｅｌｌｅｄ）され、１又は２の動いている被写体を記録する。その状況は相互的（ｒｅｃｉｐｒｏｃａｌ）なものである。記録装置の焦点距離を変化するものとして処理するかわりに、シーン全体が固定焦点距離レンズの記録装置に近づくか、又は遠ざかるものとして処理しても同じことである。
【００４０】
以下の説明において本発明方法の基本的ステップを紹介する。基本ステップを最初に紹介するのは説明を目的としたものであって、決して本発明方法のステップ順を示しているのではない。そのステップの順は基本的概念を紹介した後で説明する。
【００４１】
本発明の基本的ステップはカメラのズームモーションをリカバー（ｒｅｃｏｖｅｒ）するためのものである。本発明の方法は、水平方向、鉛直方向及びスケール方向（水平／鉛直平面に垂直）における像の速度成分としてのカメラズームに影響を受ける連続的フレーム間の変化をモデル化するものである。そのような像ポーションに対するフレーム間の速度はこれらの３方向各々について決定される。その結果どのフレームのいかなるピクセル値も、多重速度（連続的多重フレームのペアを表しているもの）をオリジナル像ポーション（部分）を表しているデータに適用することで異なる焦点距離の１フレーム内の対応位置にワーピング可能である。これに関連する技術は１９９０年４月に発表されたニュージャージー州プリンストンにあるデイビッドサーノフリサーチセンター（ＤａｖｉｄＳａｒｎｏｆｆＲｅｓｅａｒｃｈＣｅｎｔｅｒ）のバーゲンジェイ（Ｂｅｒｇｅｎ，Ｊ．）、バートピー（Ｂｕｒｔ，Ｐ．）、ヒンゴラニアール（Ｈｉｎｇｏｒａｎｉ，Ｒ．）及びペレグエス（Ｐｅｌｅｇ，Ｓ．）らによる表題『３フレームからの２モーション計算』にて開示されている。以下の基本的説明の多くは実質的には前記バーゲン他の論文から借用したものである。
【００４２】
ある像領域のモーションに対する単純で閉じた系での形態予想法は前記バーゲン他により導き出されたものである。まず、像の部分的な小移動は、像シークエンスであるフレームＩ（ｘ，ｙ，ｔ−１）及びＩ（ｘ，ｙ，ｔ）間で生じると考えることができる。このＩ（ｘ，ｙ，ｔ）は、時間ｔにおけるｘ（水平）及びｙ（鉛直）方向に延びている観察像であり、例えば、フレーム２９９にて示されている。Ｉ（ｘ，ｙ，ｔ−１）は時間ｔ−１における観察像であり、例えば、フレーム２９８である。いかなるときにもＰ（ｘ，ｙ）として表されるｘとｙのパターンはすべてのピクセルの速度フィールドである速度ｐ（ｘ，ｙ）で移動しており、ｘ方向とｙ方向双方のモーション成分を有している。モーションフィールドｐ（ｘ，ｙ）はｘとｙにおける変位による以下で表される。
ｐ（ｘ，ｙ）＝（ｐｘ（ｘ，ｙ），ｐｙ（ｘ，ｙ））（１）
このｐｘ（ｘ，ｙ）はｘ方向（ｘとｙの関数）の変位であり、ｐｙ（ｘ，ｙ）はｙ方向（ｘとｙの関数）の変位である。よって、以下の式が導かれる。
Ｉ（ｘ，ｙ，ｔ）＝Ｐ（ｘ−ｔｐｘ，ｙ−ｔｐｙ）（２）
Ｉ（ｘ，ｙ，ｔ）＝Ｉ（ｘ−ｐｘ，ｙ−ｐｙ，ｔ−１）（３）
【００４３】
前記フレームのインターバルを時間の１ユニットとすることで表記を単純化することができる。最小平方誤差（ｌｅａｓｔｓｑｕａｒｅｄｅｒｒｏｒ）技法に従い、測定値とフィールドｐを使用した計算値との間の平方誤差を最小とするモーションフィールドｐ＝（ｐｘ，ｐｙ）を求めるのが有益である。
【００４４】
【数１】

１フレームから次のフレームまでの偏差が小さいと仮定すれば、等式（４）はＩ（ｘ，ｙ，ｔ）の省略テーラーシリーズエクスパンション（ｔｒｕｎｃａｔｅｄＴａｙｌｏｒｓｅｒｉｅｓｅｘｐａｎｓｉｏｎ）により単純化することが可能である。
【００４５】
【数２】

ここにおいて、
【数３】

であり、従って、
【数４】

となる。
【００４６】
像モーションは速度成分の各パラメータ（ｐａｒａｍｅｔｅｒ）に関して等式（６）の導関数（ｄｅｒｉｖａｔｉｖｅｓ）をゼロにセットし（誤差は導関数がゼロのときに最小だからである）、得られる等式システムを解くことで得られる。
もし、像ポーションのモーションが単純移動（ｓｉｍｐｌｅｔｒａｎｓｌａｔｉｏｎ）によりモデル化されるなら、ｐ＝（ａｘ，ａｙ）となり、このａｘ及びａｙはピクセルのユニットにおいて定数となり、光学的フロー（ｆｌｏｗ）等式は以下のようになる。
【００４７】
【数５】

【００４８】
本発明方法の特徴的な適用ケースにおいては、モーションは単純移動によってはモデル化できないことが理解されよう。なぜなら、単純移動は焦点レンズのズームのようにスケール変化のリカバリー又はモデル化を行わないからである。その代わりに、像のモーションは、アファイン変換（即ち直線を直線に、平行線を平行線に変換するが、ポイント間の距離と、ライン間のアングルを変化させる可能性のある変換）としてさらに典型的に正確にモデル化される。この場合、モーションフィールドｐは６個のパラメータであるａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙ及びｃｙを有しており、それらは次のように適用される。
【００４９】
【数６】

【００５０】
ここで、ａｘ及びａｙは上記のごとくであり、ｂｘはｘ方向のｘのパーセンテージスケールファクターであり、ｃｘはｘのパーセンテージローテーションファクターであり、ｙ位置に関係する（ｄｅｐｅｎｄｉｎｇｏｎ）ものである。ｂｙはｙのパーセンテージローテーションファクターであり、ｘ位置に関係し、ｃｙはｙのパーセンテージスケーリングファクターである。ズームにおける１フレームから次のフレームへのａｘとａｙの通常のレンジは、数個のピクセル程度である。残余ファクターの普通のレンジは０．００１から０．１程度である。もし、等式（６）の誤差がこれらの６個のパラメータの各々に関して微分（ｄｉｆｆｅｒｅｎｔｉａｔｅｄ）されると、６個の未知数ａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙ及びｃｙを持つ６個の等式システムが得られる。即ち、
【００５１】
【数７】

である。
【００５２】
このシステムは係数ａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙ及びｃｙについて解かれなければならない。解を得ることは可能である。なぜなら、Ｉｘ，Ｉｙ，及びＩｔ、即ち、ｘ、ｙ及びｔに関する像の部分的導関数は時間ｔ及び時間ｔ＋１の像値から決定可能だからである。Ｉｔは時間ｔ＋１のピクセル値を時間ｔにおける対応ピクセル値から差し引くことで決定される。Ｉｘは時間ｔのピクセル値と時間ｔ＋１の対応ピクセル値を加えて、ｘにおける導関数フィルター（ｆｉｌｔｅｒ）を介してその合計をラン（ｒｕｎ）させることで決定される。Ｉｙはその得られた合計をｙの導関数フィルターを介してランさせることで決定される。これらの３個の値が像内のすべてのピクセルに対して決定されたならば、等式（１０）のシステムは、係数ａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙ及びｃｙについて解かれる。これらの係数を知れば、１つのフレームから次のフレームまでの像の特殊なアスペクト（ａｓｐｅｃｔ）を代表する与えられたピクセル値の位置の変化を決定することが可能となる。
【００５３】
従って、フレーム２０１のピクセル値の位置を決定するために、像のポーションをワープされていないフレーム２０２の像の対応ポーションと合致させるためにフレーム２０１が１ステップ分だけワーピングされた後に、等式９ｘと９ｙの変換がフレーム２０１の各ピクセル値に適用される。図７に示すように、ポイント（ｘ，ｙ）、ピクセル位置（２０、３０）の像ポーションを考慮されたい。（図７はスケールするためものではない。）図７においてオリジナル位置におけるフレーム２０１は、符号２０１にて表される。フレーム２０２のスケールにワーピングされた後、フレーム２０１は２０１２として表される。フレーム２０３のスケールにワーピングされた後には２０１３として表され、この要領でフレーム２０１９９まで続行する。スケールの増加率１０％（１フレームから次のフレームに対するものとしては大きい率）と５個のピクセルの右側へのパンに対しては、フレーム２０１とフレーム２０２間の典型的な係数は以下の値を有している。
【００５４】
【数８】

フレーム２０１のピクセル（２０、３０）での値のｘ方向におけるフレーム２０１からフレーム２０１２へのずれは、５＋（．１×２０）＋（０×３０）＝７となる。その値はｘの正の方向に７ピクセル分移動させ、ｘ位置２７にくることを意味する。ｙ方向のずれは、０＋（０×２０）＋（．１×３０）＝３となり、ｙの方向に３ピクセル分移動してｙ位置３３にくることを意味する。これは図７にて略図的に示されており、フレーム２０１のピクセル位置（ｘ，ｙ）からフレーム２０１２の別位置（その像の同一箇所）へ向かう曲矢印Ａである。
【００５５】
同様に、同じピクセル値をフレーム２０１３にて占める位置にワーピングするには、フレーム２０２と２０３との間で６つの等式（１０）のセットを解くことで得られる係数ａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙとｃｙを使用してフレーム２０１２のピクセル座標（ｃｏｏｒｄｉｎａｔｅｓ）に変換等式（９ｘ）と（９ｙ）を適用することが必要である。それらの係数は、フレーム２０１とフレーム２０２との間で得られたものとは異なるかもしれない。
【００５６】
その変換等式は１次式（ｌｉｎｅａｒ）であり、よってリバース可能（ｒｅｖｅｒｓｉｂｌｅ）である。フレーム２０１のスケールからフレーム２０２のスケールへの変換には係数ａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙ及びｃｙが使用される。フレーム２０２のスケールからフレーム２０１のスケールに変換するには、これらの係数の１次逆元（ｌｉｎｅａｒｉｎｖｅｒｓｅ）が使用される。
【００５７】
以上説明したように、フレーム２０１のポイント（ｘ，ｙ）からのピクセル値はフレーム２０１２の新位置にワーピングされる。ポイント（ｘ＋１，ｙ）からのピクセル値もフレーム２０１２の新位置にワーピングされるが、その位置は典型的にはピクセル（ｘ，ｙ）の値に対応するワーピング位置に隣接することはない。もしこれ以上何の操作もしなければ、フレーム２０１２のこれら２ポイント間のスペースはブランク又は値なしの状態となる。このスペースに情報を入力するには何らかのインタポレーションが必要となる。１次及び双１次（ｂｉｌｉｎｅａｒ）インタポレーションを含む種々な技法が可能である。双１次インタポレーションは効果的に使用されている。
【００５８】
バーゲン、バート他により説明されているように、前記のモーション予想法はズームシークエンスの１フレームから次のフレームまでの像のずれが少ない（１ピクセル以下）前記の省略テーラーシリーズ近似法が適しているの時のみにおいて正確である。図８において略図的に示されている多重解析（ピラミッド）構造（ｍｕｌｔｉｒｅｓｏｌｕｔｉｏｎｓｔｒｕｃｔｕｒｅ）を使用することでさらに良い結果が得られ、その技法はより一般的な大きな移動の場合にも適用可能である。
【００５９】
アファイン変換パラメータａｘ，ｂｘ，ｃｘ，ａｙ，ｂｙとｃｙを決定する過程においてガウスのピラミッド（ｐｙｒａｍｉｄ）Ｇは、例えばフレーム２０１と２０２である像フレームペアの各フレームに対して構築される。シークエンスの各メンバーに対して、その解析及びそのサンプルデンシティ（ｄｅｎｓｉｔｙ）が平方根で減じられている（ｒｅｄｕｃｅｄｂｙａｐｏｗｅｒｏｆ２）オリジナル像の修正コピーのシークエンスによりそのピラミッドは形成される。シークエンス２００のフレームの１つがガウスピラミッドシークエンスのベースレベルを形成していることを除けば、例えばＧ２０１，０、Ｇ２０１，１、Ｇ２０１，２、Ｇ２０１，Ｉ等のガウスピラミッドシークエンスのメンバーはシークエンス２００のメンバーとは全く異なっていることが特記されなければならない。
【００６０】
レゾリューション（ｒｅｓｏｌｕｔｉｏｎ）を減少させるために、データはローパスフィルターを通過させられる。このローパスフィルターを通過させることで像の小さな又は素早い移動に関連したデータを排除する。従って、大きなモーションはレゾリューションが最も大きく減少したレベルにて検知される。ローパスフィルターがデータ内の偏差の大部分を排除しているので、存在する全ピクセルに対する計算をする必要性はなくなる。よって、操作対象のピクセル数を減少させるために２程度のサブサンプリング（ｓｕｂ−ｓａｍｐｌｉｎｇ）が適用される。このサブサンプリングは計算の能率を高め、操作のスピード向上に寄与する。サブサンプリングの特徴的なパターンは隔行及び隔列を無視することである。
【００６１】
ピラミッドの各レベルＩは、ローパスフィルターの効果を発揮させるために小核（ｋｅｒｎｅｌ）フィルターωで先行するレベルのデータを収束させることで取得され、続いてサブサンプリングを行う。Ｇｔ，ｌ＝［Ｇｔ，ｌ−１＊ ω］↓_２ここでのＧｔ，ｌは像Ｉ（ｘ，ｙ，ｔ）に対するＩ^ｔｈのピラミッドレベルである。上記↓_２は、括弧内の量がｘとｙに対して２でサブサンプルされていることを示している。例えば、Ｇ２０１，１を得るにはＧ２０１，０をフィルターωで収束し、その結果をサブサンプリングする。
【００６２】
変換の分析は像ピラミッドの低レゾリューションレベル、例えばレベル３にて開始される。４８０スキャンラインと６４０ピクセルにより定義されるオリジナル像に対しては、典型的にはレベル３の分析は良好な結果をもたらす。レベルＩのサンプル距離はオリジナル像のサンプル距離の２^Ｉ倍である。従って、この関係で大きくなる像速度が予想可能となる。追跡手順の各連続的反復にて、分析は次の段階のレゾリューションピラミッドレベルへと移動してオリジナルに近づいて行く。
【００６３】
従って、アファイン変換パラメータの決定は、例えばレベル２にて開始する。まず、ピラミッドＧ２０１とピラミッドＧ２０２の間でａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙに対する等式（１０）を解く必要がある。これは２ステップで行われる。まず、アファイン変換ｐ２のシード（ｓｅｅｄ）セットが選択される。このシードは全部ゼロであっても、ズームによるスケーリングファクター若しくは知られたパン又はジブによるトランスレーションのような変換の知られているアスペクトに近似して選択されたものであっても構わない。これらのアファイン変換はレベル２でのワーピングされた像を得るためにＷ２においてＧ２０１，２に適用される。これは図８において歪んだ方形Ｇ２０１，２ｗにより図示されている。たいていの場合には、このワーピングは次の時間インターバルｔ＋１、即ちＧ２０２，２でのガウスメンバーを正確には提供しないであろう。よって、第２のステップでは調整用アファインパラメータΔｐ２のセットが像の値Ｇ２０２，２とＧ２０１，２ｗ間で予想される。これらは前述で解説されたごとくに予想されるものである。
【００６４】
まず、ローパスフィルタリング及びサブサンプリングを介さずにフレームに対して上記したようにＩｘ、Ｉｙ及びＩｔを解くことが必要である。Ｉｘ、Ｉｙ及びＩｔは、さらに小さくローパスフィルタリングされたデータのサブサンプルセットが使用されることを除けば、同様に計算される。フレーム２０２からフレーム２０１に対する値を引く代わりに、ワーピングされたピラミッドフレームＧ２０１，２ｗからの値がピラミッドフレームＧ２０２，２に対する値から引かれる。このようにして、レベル２の部分的導関数（ｐａｒｔｉａｌｄｅｒｉｖａｔｉｖｅ）が決定され、その後にこのレベル用の調整用アファインパラメータａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙが決定可能となる。調整用アファインパラメータは図８においてΔｐ２として集合的（ｃｏｌｌｅｃｔｉｖｅｌｙ）に示されている。
【００６５】
調整用アファインパラメータのセットは、ガウスシークエンスピラミッド２０１における先行するレベルからのアファインパラメータｐ２と結合されてレベル１、即ちｐ１用のアファインパラメータを形成する。この結合は単純な加算ではない。例えば、新ａｘタームは時間ｔにおけるａｘターム（ピラミッドＧ２０２）や時間ｔ−１におけるａｘターム（ピラミッドＧ２０１）及びｘ方向における他の変化に基づくものである。以下の式はこの関係を説明している。
【００６６】
【数９】

【００６７】
このプロセスは繰り返されるが、今回はレベル１であり、フレーム２０１及び２０２に対する操作過程でアファイン変換パラメータａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙがオリジナルレベルにて取得されるまで全レベルを通して実施される。アファインパラメータは最も正確なところで収束（ｃｏｎｖｅｒｇｅ）するので、ΔｐＩタームはゼロとなる傾向にある。
【００６８】
従って、いかなるフレームのスケールからの１フレーム（例えば、フレーム２２６から次のフレーム２２７のスケール）を変換するワーピングファクターを決定するには、前述の操作が実施される。よって、フレームスケールの各ペアに対して、アファイン変換パラメータａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙのセットが計算される。その後、１フレーム、例えばフレーム２５１を適当なサイズに変換するには、まずフレーム２５１と２５２に対する先行する分析により決定されたアファイン変換パラメータａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙを使用してフレーム２５２のスケールに変換される。次に、その変換されたフレーム２５１２はフレーム２５２と２５３に対する先行する分析により決定されたアファイン変換パラメータａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙを使用してフレーム２５３のスケールに変換される。このプロセスは繰り返され、フレーム３００のスケールにて大データスペースにそのフレームが変換されてしまうまで継続される。
【００６９】
前記の方法は、もしカメラ又は被写体（どちらでもよい）間の相対的モーションがほとんど存在しないか又はまったく存在しないならば良好に作用し、唯一の像変化はズーミングによるものとなる。しかしながら、実際上はそのようなモーションを排除できることが望ましい。いくつかの方法が考えられる。基本的ではあるが効果的な方法は全フレームを視覚的に検査することであり、視域を横切る人のように大きなモーションを特定することである。モーションが各フレームにおいて生じる領域をカバーするためにマスクが利用可能であり、この領域は変換時には無視することができる。最終的なピクチャーのマスク位置を設定するのに望ましいピクセル値をオペレータは手動にて選択する。
【００７０】
別の方法は図９にて略図的に示すように、２つのモーションを追跡するバーゲンとバート他により解説されている技法を利用することである。データはモーションのペアに照らし合わせて評価される。ここでの像Ｉ（ｘ，ｙ，ｔ）はそれぞれ独立したモーションｐとｑを有する異なる像パターンＰ及びＱの組み合わせとしてモデル化される。Ｉ、Ｐ及びＱ間の関係は以下のごとくである。
【００７１】
【数１０】

上記等式中の○と＋を重ねた記号は、以下において便宜上（＋）と表記する。
【００７２】
ここでは、オペレータ（＋）は、加算又は掛け算のごとき２モーションを結合させるためのオペレーションを表し、Ｐ^ｔｐは時間ｔを通じてモーションｐにより変換されたパターンＰを表しており、バーゲン及びバート他は、もしモーション成分の１つ、及び結合ルール（＋）が知られていれば、パターンＰ及びＱの性質について予想をたてることなく、前述の１成分モーション技法を活用して他のモーションを計算することが可能であることを示している。もし、モーションｐが知られていれば、モーションｑのみを決定すればよく、その逆のこともある。速度ｐで移動しているパターンＰの成分は各像フレームをｐによりシフトし、そのシフトしたフレーム値を次のフレームから差し引くことにより像シークエンスから排除することが可能である。得られる差（ｄｉｆｆｅｒｅｎｃｅ）シークエンスは速度ｑにて移動しているパターンのみを含んでいる。
【００７３】
特殊な場合には、結合オペレーション（＋）は加算である。シークエンス２００の３フレームＩ（１）、Ｉ（２）及びＩ（３）の場合について考えてみよう。変数Ｄ１及びＤ２をそれぞれそれらのフレーム間で発生した差フレームに当てはめてみよう。等式１１は以下のようになる。
【００７４】
【数１１】

【００７５】
これは１ステップにてパターンＰを変換するための３０２におけるＩ（１）のワープとして図９において略図的に示されている。この次の段階はパターンＰのモーションの影響を取り除くための３０４におけるＩ（２）の減算である。その結果得られるものはＤ１、即ち差（ｄｉｆｆｅｒｅｎｃｅ）シークエンスの１要素である。Ｄ２はパターンＰのモーションにより３０６でワーピングされたＩ（３）とＩ（２）間の３０８における差により同様に形成される。
【００７６】
変更シークエンスは１モーションｑで移動する新パターンＱ^ｑ−Ｑ^ｐから構成されることになる。
【数１２】

【００７７】
従って、モーションｑは前述の１モーション予想技法を活用して２つの差（ｄｉｆｆｅｒｅｎｃｅｉｍａｇｅｓ）像Ｄ１とＤ２を間で計算可能となる。このことが図９にて３１０で略図的に示されている。同様に、モーションｐはｑが知られているときにリカバー可能となる。観察された像Ｉ（ｘ，ｙ，ｔ）はｑによりシフトされ、新しい差シークエンスが形成される。
【００７８】
【数１３】

このシークエンスは速度ｐにて移動しているパターンＰ^ｐ−Ｐ^ｑである。
【数１４】

よって、ｐは１モーション予想技法を活用してリカバー可能となる。
【００７９】
このシフト及び減法手順はパターンに関わりなく、又はパターンを決定することなく像シークエンスから１つの移動パターンを取り去る。実際上はｐもｑも最初は知られていない。しかしながら、最初に非常におおまかな予想値を選択したとしても、それらの両方ともが前記の技法を反復することによりリカバー可能である。この反復手順は１モーション技法を反復的に適用する。モーションｐを定義するパラメータのおおまかな予想値で始めても、ｑの予想値は取り出され、３１２にてワーピングステップの３０２と３０６にリターンされる。予想値ｑから改善予想値ｐが取り出され、３１２にてワーピングステップ３０２と３０６にリターンされる。この手順を繰り返す。この手順にて正確な予想値に素早く収束する。本当の像シークエンスを使用して、３から１２サイクル後には要求を満たす変換が可能になる。
【００８０】
本発明のこの部分のステップを要約すると以下のごとくとなる。
１。パターンＰのモーションｐ０に対する（ついて）初期予想値を決定する。
２。最新のｐｎ予想値を使用して等式（１２）における差像（ｄｉｆｆｅｒｅｎｃｅｉｍａｇｅｓ）Ｄ１及びＤ２を形成する。
３。１（シングル）モーションエスチメータをＤ１とＤ２に適用してｑｎ＋１の予想値を得る。
４。予想値ｑｎ＋１を使用して新差像Ｄ１及びＤ２を形成する。
５。１（シングル）モーションエスチメータを新しいＤ１とＤ２に適用して新ｐｎ＋２を取得する。
６。ステップ２から手順を繰り返す。
【００８１】
この２モーション技法に従って取り出された２セットのアファインパラメータを観察することで、移動シーン又はカメラモーションを特定することが可能となる。一般的に、ズームワーピングのみに関係するパラメータは１フレームから次のフレームにスムーズに、またほんの少々変化するだけである。像モーション又はカメラモーションに関係するパラメータはズームによるものとは異なる変化を示す。これらの相違する変化は検査により観察可能である。
【００８２】
１フレームペアから次のフレームペアまでのアファインパラメータを自動的に比較し、その変化が予め設定したレベルを越えるときには遮光装置を機能させる（ｔｒｉｇｇｅｒｉｎｇａｆｌａｇ）ことでシーン又はカメラのモーションの特定を自動化することは理論的には可能である。１つの可能性を有する技法はフレームの２ペアのアファインパラメータ間の差を設定した数の先行フレームペアの標準偏差（ｓｔａｎｄａｒｄｄｅｖｉａｔｉｏｎ）と比較することである。例えば、７０のフレームのシークエンスに対しては、少なくとも１０のフレームペアの標準偏差を決定することが一般的である。
【００８３】
もし、カメラとシーン内の要素の両方ともが移動しているとき、２個以上のモーションが存在し、カメラモーションを排除するためのさらに一段上の方法が便利である。前述したアファインの２モーション予想法とマスキング技法を組み合わせると便利な結果をもたらすことが発見されている。像内のずれベクトルの確率デンシティ関数を決定することもまた便利であろう。一般的な理解には、１９９０年６月にアメリカ合衆国マサチューセッツ州ケープコッドにて開催されたアメリカ光学学会の『機械理解及び機械視覚総会』の議事録内にあるギロッドビー（Ｇｉｒｏｄ，Ｂ．）とクオデー（Ｋｕｏ，Ｄ．）による「ずれヒストグラムの直接的予想法（ＤｉｒｅｃｔＥｓｔｉｍａｔｉｏｎｏｆＤｉｓｐｌａｃｅｍｅｎｔＨｉｓｔｏｇｒａｍｓ）」を参照するのがよい。ここにはフレーム間で移動する異なる被写体の数と、それらに対応するずれベクトルがどのようなものであるのかについての情報が掲載されている。ローカルブロックマッチングエスチメータ（ｌｏｃａｌｂｌｏｃｋｍａｔｃｈｉｎｇｅｓｔｉｍａｔｏｒ）がそれらの移動被写体を空間的に位置取りさせるのに使用されている。移動被写体の領域は計算によりマスク処理が施され、その後にアファイン予想値が計算される。
【００８４】
フレームペア間のずれが小さいものであって、突き当たったり、焦点距離の急激な変動のごとき予期しないカメラ移動がまったく存在しないと仮定すれば、アファインパラメータはフレームペア間ではあまり異なるものではない。前述のごとくにパラメータが決定された後、その係数は見せ掛けの値を取り除くために簡略化される。
シークエンス２００の各フレームからのＹチャンネルデータがワーピングされれば、決定されたアファインパラメータはフルカラー映写の変換を提供するために、例えば位相及び求積法（ｐｈａｓｅａｎｄｑｕａｄｒａｔｕｒｅ）で他のチャンネルに適用される。
【００８５】
ワーピングされた短焦点距離フレーム２０１のフルラスタ（ｆｕｌｌｒａｓｔｅｒ）が満たされた後、１フレームから次のフレームまでのアパーチャ設定の変分のごときトーンに影響を及ぼす変化を補うためにトーンスケール補正を実施することができる。中央の像から始めて、２つの像が隣接する箇所周辺で光度のサンプルが採られる。データにスプライン（ｓｐｌｉｎｅ）がフィットされ、大きい方の像（低めの解像度）のピクセルが小さい方の像のピクセルに変更される。その後にこの補正像のトーンスケールは次の大きさのワーピングされた像との比較に用いられ、最大の像までこの手順が繰り返される。
【００８６】
シークエンス２００の各フレームからのフルカラーデータが同じデータスペースにワーピングされたら、各ピクセルに対する１フレームからのデータは他の全フレームのデータと組み合わせる必要がある。いくつかの技法が可能である。最も基本的な技法は最高の解像度を有するフレームから最終合成ピクチャー用のピクセル値を選択することである。図６において示されているように、フレーム２９９がワーピングされたものであるフレーム２９９ｗが一般的にその合成ピクチャーの中央部を占め、このフレームはその像の中央部に関して最も高い解像度となるであろう。フレーム２９８からの情報は中央部の環状方形部位を占め、この情報はこの部分に可能な最高の解像度となる。フレーム２９７からの情報はフレーム２９８ｗの環状領域周辺の多少大きな軸の環状方形部位を占め、このようにして第１フレーム２０１ｗの環状周囲がワーピングされた像（ｆｉｇｕｒｅ）の最も外部を占めるまで続けられる。
【００８７】
上記手順にて好ましい結果が得られるが、明瞭なエッジ部が現れ種々のフレームから発生した領域間の境界線を際立たせる。この理由によりあるピクセルに対してシークエンスのワーピングされた全フレームにウエイト関数（ｗｅｉｇｈｔｉｎｇｆｕｎｃｔｉｏｎ）が適用されそのピクセルの値としてウエイト値のメジアンがとられる。図１０に示すようにピクセルの位置はベクトルＶにより示され、像の同一位置にワーピングされた全フレーム、即ち２０１ｗから２９９ｗまでを突き抜ける。前記ウエイトファンクションはベクトルＶに沿ってその像値（ｉｍａｇｅｖａｌｕｅｓ）に適用される。典型的なウエイト関数は図１１のグラフにて示されている。図から分かるように、そのウエイト関数は上向きに凹形状であり、クロースインズームショット（ｃｌｏｓｅｉｎｚｏｏｍｓｈｏｔ）からのピクセル値は最大のウエイト、おそらく１００％が付与される。望む効果に応じて種々のウエイト関数が適用可能となる。一般的には低い方の解像度を有するフレームよりも高い解像度を有するフレームに対してさらにヘビーなウエイトが付与される。
【００８８】
多少不自然ではあるが低い方の解像度を有するフレームのシークエンスから、高い解像度を有する１静止画像を得る方法のブロック作成について述べてきた。図１３には、実質的に好ましい順序でその方法のステップが示されている。ビデオフィールドのシークエンスは４０２にてとられる。フィールドは一連のフレームを作成するために４０４で脱インターレースされる。このポイントで交互的パス（ａｌｔｅｒｎａｔｅｐａｔｈ）が取得可能となる。被写体又はカメラモーションは、４０６にてズームモーションから分離することができ、その後にアファイン変換係数ａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙを発生させるために光学系フロー分析がなされる。４０４においてステップ４０８に分岐（ｂｒａｎｃｈ）することが可能であり、それにより光学系フロー分析と、被写体又はズームモーションからのカメラモーション分離とを結合させる。この分岐は、また係数ａｘ、ｂｘ、ｃｘ、ａｙ、ｂｙ及びｃｙを生じさせる。次に４１２においてそのアファイン変換が各フレームに対して必要な回数だけ適用され、各フレームに対して高い解像度ラスタ（ｒａｓｔｅｒ）での対応フレームが創出される。図１１に示す時間メジアンフィルターは、４１４にて全フレームに適用され、最終合成は４１６にて高解像度ラスタでの各ピクセル位置で時間メジアンフィルターによりフィルターされているワーピングされた各フレーム２０１ｗ、２０２ｗ等に対するそのピクセルでの値を加算することで形成される。
【００８９】
本発明装置の好適実施例は図１４において略図的に示されている。シーンから反射した光、又はシーンにより伝達される光を採り入れてビデオカメラ等の入力装置５００はシーン５０２に適用される。その光は前記入力装置又は標準コンバーター５０４により電気信号に変換される。コンバーター５０４又は入力装置５００から、データはメモリー装置５０６又はデータプロセスユニット５０８を通過する。メモリー装置５０６はフィールド（ｆｉｅｌｄ）によって、またさらにデータが変換されるどのような他の形状（ｃｏｎｆｉｇｕｒａｔｉｏｎｓ）にも従ってデータを記録することができる。前記データプロセスユニットは、典型的には適正にプログラムされた汎用デジタルコンピュータである。オペレータはコンピュータキーボード等の入力装置５１０を介してデータプロセスユニット５０８にコマンドを発する。これらのコマンドは前述の本発明方法のステップを行使するようにコンピュータに対して指示する。指示の内容は、例えば、そのフィールドの脱インターレーシング、差シークエンス（ｄｉｆｆｅｒｅｎｃｅｓｅｑｕｅｎｃｅｓ）を創出することによる２又はそれ以上の移動被写体の特定、アファイン変換係数の計算、全フレームを望むデータスペースにワーピングすること、ウエイトされた時間メジアンフィルターに従いワーピングされたフレームからのデータを結合させ合成ピクチャー映像に導く、等々である。各ステップでの変換されたデータはメモリ装置５０６に記録可能であり、プリンター、ビデオディスプレー装置又は他の適当な出力装置等の一般的な出力装置５１２に出力可能である。さらに、データは追加的操作を施したり、蓄積又はディスプレーを行うために遠隔地（ｒｅｍｏｔｅｌｏｃａｔｉｏｎ）に伝達することもできる。
【００９０】
尚、比較的低解像度でワイドアングルを有するショットにおいては、多くの箇所は不鮮明である。一方、本発明の方法に従って作成された合成ピクチャーの中央部は鮮明で焦点がぴったりと合っており、詳細までくまなく示している。
【００９１】
以上の解説は本発明の説明を目的としたものであり、発明の限定を意図したものではない。ビデオ以外においても、静止画像のシークエンスを利用するいかなる記録技術でも使用可能である。もしその記録技術がピクセル値を発生しないならば、その記録装置により発生されたデータは周知である本分野技術の方法に従いピクセル又は等価なデータスペースにコンバートすることが可能であり、また有効である。ここに紹介した技法に加え、ズームモーションからのカメラモーション又はシーン内のモーションを分離するための種々な技法が適用可能である。さらに、アファイン変換係数を計算するのにガウスピラミッドのステップを利用する必要もない。その計算はそのフル（ｆｕｌｌ）高解像度フレームに対して施されるもののごとき他の方法によっても可能である。
【００９２】
本発明の方法はズームからの１静止画像の場合以外に一連のパンショット及びジブショットからの１パノラマ静止画像を創出する場合にも利用可能である。そのような場合には、全フレームは全パノラマシーンと同じスペースを有するデータスペースにワーピングされることになる。互いに重なり合った焦点距離の異なるピクチャーの山とはならないであろう。むしろ、エッジ部がオーバーラップした一連のピクチャーとなるであろう。ズームシークエンスに適用された実施例においては、ワーピングの主要な要素は各フレームからのデータを拡大してシーンの像を互いに整合させることである。シーンの全像が互いにアラインするようにデータをワーピングすることはまたズーム系適用の１重要面である。この特徴により、例えばカメラモーション又は被写体のモーションによるモーションを取り去ることができる。
【００９３】
パノラマ系適用においては、この拡大する特徴は重要ではなく、たいていの場合には使用されもしない。しかしながら、そのアライメントの観点からは大変重要であり、パノラマシーンの全域的視域が連続的データスペースとして表されているならば、各フレームはその全域的視域の限られた部分を取り上げることになる。ズーム系適用の場合とは異なり、パノラマ系適用における各フレームは同じ焦点距離で創出される。各フレームの像が別のフィールドの同一像と合致するように、全域的データスペースでのフレームからのデータをアラインするために本発明の方法を使用するこが必要である。本発明の方法は、主にショット間のつなぎ目に適用される。もしパンのスピードがフレーム周波数に比べてゆっくりならば、つなぎ目におけるフレーム間のオーバーラップは非常に大きくなる。
【００９４】
パノラマシーンの特定部分でさらに鮮明な画像を得るためにズーム処理をパノラマ処理と結合させることは本発明の思想内である。
本発明の技術を被写体とビデオの非連続的なセグメントからのフレームと結合させるために使用することも可能である。
【００９５】
本発明はビデオカメラにより得られたデータの範囲にて記述されてきたが、本分野の通常の技術者であれば、本発明の方法は、いかにして得られたものであろうともデジタル像を表すデータにも使用可能であることが理解されるであろう。例えば、異なる焦点距離で撮影された一連のスチール写真は前述した方法により組み合わせて特定の部分の像を補強した１像を形成することができる。同様に、パノラマスペース内で種々な位置を描写している一組みのスチール写真を本発明の技法に従って組み合わせ、１枚のパノラマ像を作成することができる。この中ではその種々な像部分はリカバーすることが可能であり、共通な焦点距離であるが異なる視域を有する１組みのばらばらとなった静止画像内にはパノラマ像の元の人工的要素はほとんど示されていない。
【００９６】
本発明は「特許請求の範囲」により特定された全実施例を含む明細書中の記載に照らし合わせて考慮されるべきものであり、さらに合理的範囲内でのそれらの等価形態をも併せて考慮されるべきである。
【００９７】
【発明の効果】
以上詳述したように、本発明は、以下の利点を備えた比較的に高解像度を有する静止画像を創作する方法及び装置を提供することができる。
１）像全体にわたり高解像度で情報を取得する必要がない。
２）あまり重要ではない像の大部分に関して情報を収集する必要がない。
３）種々な焦点距離又は視域の標準的ビデオ像のシークエンスを入力要素として取得できる。
４）標準的フィルム像のシークエンスを入力要素として取得できる。
５）望む像のいかなる部分でもその解像度を増強する。
６）適正にプログラムされた汎用デジタルコンピュータ及び標準型ビデオ又は映画装置が使用可能である。
【００９８】
さらに、本発明によれば、過剰なデータ保存及びアクセス能力を要せず、あるシーンのパノラマビューを観察者に提供し、その観察者にそのシーンの１位置から他の位置までのナビゲーションを可能とする方法を提供することができる。
又、デジタル化されたいかなる形態の像データであろうとも前記能力を発揮させることができるという優れた効果を奏する。
【図面の簡単な説明】
【図１】撮像装置の焦点距離と撮影されたシーン部分との関係を略図的に示す図である。
【図２】ビデオフィールドとビデオフレームのペアの概略を示す図である。
【図３】組み合わされてビデオフレームを構成する典型的なビデオフィールドのペアのインターレーシング（ｉｎｔｅｒｌａｃｉｎｇ）を示す図である。
【図４】実質的に同じシーンのビデオフレームのシークエンスを略図的に示したものであり、短い焦点距離から比較的長い焦点距離にズームインした状態を表す図である。
【図５】最も短い焦点距離（ワイドアングルな視域）のビデオフレーム内シーンの部分の概略を示す図であり、徐々に長くなる焦点距離のフレームのシークエンスの残りメンバー内に供給されている。
【図６】図４において示すシークエンスの各ビデオ像（図６の左側に示されている）の同一サイズデータスペースにマップ（ｍａｐ）又はワープ（ｗａｒｐ）した状態の概略を示す図であり、そのサイズは拡大されたサイズの最も低い解像度フレームである。
【図７】元は比較的短い焦点距離で記録された１フレームをそのシーンの連続的拡大関与するデータスペースにワープしている状態の概略を示す図である。
【図８】シークエンスにおける複数のフレーム間の荒いモーション及び繊細なモーション両方を特定する方法の概略を示す図である。
【図９】フレームのシークエンスにおける２つの移動する被写体のモーションを特定する方法の概略を示す図である。
【図１０】同一データスペース内にワープされた後のシークエンスの各フレームの概略を示す図であり、最終的映像化に再構築されるようにアラインされている状態を表し、各フレームの共通ポイントを通るベクトルが示されている。
【図１１】最終的な像を構築するのに使用されるウエイト因子（ｗｅｉｇｈｔｉｎｇｆａｃｔｏｒ）とそのウエイト因子が適用されているワープされたフレームの元の焦点距離との間の関係を示すグラフ図である。
【図１２】最終的な再構築像とその構築要素の概略を示す図である。
【図１３】本発明の方法の好適実施例を説明するフローチャート図である。
【図１４】本発明の装置の好適実施例の概略を示す図である。
【符号の説明】
２像
４焦平面
６中央部[0001]
[Industrial application fields]
The present invention basically relates to a method for creating a high-resolution still image using a plurality of images and devices having different focal lengths. In particular, the present invention relates to a method for creating a static high-resolution image that is a fixed focal length image using a plurality of different focal length images (images) such as a zoom video sequence. The present invention also relates to a technique for creating a still panoramic image (image) from a plurality of images having a narrow viewing zone than that of a still panoramic image.
[0002]
[Prior art]
In the field of image processing, it is often desirable to obtain a still image of a scene. In most cases, the still image has a resolution determined by the performance of the recording means and the focal length of the device that captures the still image. Video devices are currently relatively inexpensive and are simple enough to be used by many people. Video recording devices offer certain advantages over still image rendering, such as still photography. A started video camera can shoot all events within its focal area, while in the case of regular still photos, it only shoots the subject selected by the photographer by pressing the shutter. Therefore, when shooting a subject that moves at a high speed, such as at a sporting event, or in situations where an unexpected situation occurs, such as a wedding ceremony or a news documentary, the video is always set to the shooting state, and the desired steel It is often convenient to select However, the resolution of the video signal is limited to about 480 lines (scan lines) per picture height and about 640 samples per picture width. (The video signal itself is continuous through the scan line; however, it is sampled in the scan line direction for display.) In many cases this resolution is insufficient to provide high image quality. In particular, this is insufficient if the original subject image was taken with a relatively short focal length. When the image is magnified, the image becomes quite unclear. Similarly, in the case of other shooting techniques such as movies and 8 mm shooting, the resolution is limited to the specific resolution. Image magnification necessarily degrades the resolution per unit area throughout the image.
[0003]
For example, it may be desired to film a solo performer playing with an audience in front of a piano on stage. If the imaging device is a video device, the wide-angle image showing the audience will have a resolution commensurate with the standard video resolution. The resolution across the screen is the same. Thus, the image of the solopianist is as rough as the rest of the scene. For example, if the sololist image takes up 1/16 of the space on the entire screen, 120 lines are used in the vertical direction and 160 samples are used in the horizontal direction. In this case, for example, a less important scene such as a vacant seat behind the venue has the same resolution. FIG. 1 schematically shows the focusing of a scene on one focal plane for two different focal lengths. If the focal length fw is relatively short, the full width of the image 2 is focused on the focal plane 4.
[0004]
Of course, zooming in on the soloist and taking a picture of the soloist with a longer focal length makes the soloist clearer (ie, more lines in the vertical direction and more pixels in the horizontal direction). It is possible to project to. As shown in FIG. 1, the focal length fT is longer than fw. However, only the central part 6 of the image 2 is focused on the focal plane 4. Most other scenes are sacrificed because they are focused outside the focal plane. The soloist image is further enlarged to take up a large space, and part of the outline of the original image is not visible.
[0005]
Techniques for enhancing image data by combining two channels of data are well known. The first channel has a high spatial resolution (ie, a relatively large number of pixels per unit length) and a relatively low temporal resolution (ie, a relatively small number of frames per unit time). The second channel has a low spatial resolution and a high temporal resolution. As a result of the combination, the spatial and temporal resolution approaches those higher and is less than the information required to convey a single image sequence with high temporal and spatial resolution under normal conditions. Information transmission is sufficient. B. B. submitted to the Department of Electrical Engineering and Computer Science in Massachusetts Institute of Technology in May 1988. S. See the paper “A Two-Channel Spatial-Temporal Encoder” by Claman, Lawrence N.
[0006]
There are no techniques available today for projecting images taken at the shortest focal length, such as increased resolution of various spatial portions of still images. Clayman's paper uses fixed focal length images and vector quantization, resulting in a steel frame with a resolution and viewing zone that does not exceed that of the original spatial high-resolution image.
[0007]
[Problems to be solved by the invention]
It would also be desirable to be able to provide a panoramic view of a scene while maintaining a substantially common focal length from one part of the panoramic view to another part of the other panoramic view. A conventional way to do this is to move the video camera from one side of the panoramic scene to the other, essentially taking a number of slightly different frames from the previous and next frames. With respect to the adjacent frames, each frame differs only in the right and left edge portions. Most of the images forming the frame are identical to the images of adjacent frames. A large amount of data storage and data access is required to store and navigate these various images forming the panoramic scene. This prior art is not desirable because data storage and access is expensive. Also, most of the stored and accessed data is not practical. Imaging devices currently used to capture panoramic spaces include the movement of a glubusscope or a volpi lens.
It is also desirable to be able to pan from one position of a scene to another and zoom at the same time. A disadvantage of the prior art is that it produces undesirable results in such a combination.
[0008]
Accordingly, it is an object of the present invention to provide a method and apparatus for creating a still image having a relatively high resolution with the following advantages:
1) There is no need to acquire information at high resolution over the entire image.
2) There is no need to collect information on the majority of images that are less important.
3) A sequence of standard video images of various focal lengths or viewing zones can be acquired as input elements.
4) A standard film image sequence can be acquired as an input element.
5) Increase the resolution of any part of the desired image.
6) A properly programmed general purpose digital computer and standard video or movie equipment can be used.
[0009]
Another object of the present invention does not require excessive data storage and access capability, provides a viewer with a panoramic view of a scene and allows the viewer to navigate from one position of the scene to another. Is to provide a method. Still another object of the present invention is to make the above-mentioned ability possible regardless of the digitized image data.
[0010]
[Means for Solving the Problems]
To summarize the present invention for solving the above-mentioned problems, the present invention is a method for generating a still image, comprising the steps of creating a plurality of images, each image having a different focal length from the others. And having the steps of scaling each image to a common focal length and combining each scaled image into a final image of one focal length, the portion of the final image being Compared with the original sequence, it has a relatively high resolution. The invention further includes the step of incorporating a sequence of still images of the changing viewing zone into the panoramic image of the overall viewing zone. In addition to combining images generated in different viewing zones, the method of the present invention can also be used to combine images generated for different viewing zones of the overall scene, such as panoramic scenes, into a combined panoramic viewing zone. It is. This aspect of the invention can also be combined with those of varying focal lengths.
[0011]
The present invention is also an apparatus for generating a still image, and has a means for creating a plurality of images. Each image is created at a different focal length from the other, and each image is shared. Means for scaling to one focal length and means for incorporating the scaled image into one image at one focal length. The apparatus of the present invention further includes an apparatus that incorporates images generated for different viewing zones of the overall scene into a combined panoramic viewing zone.
[0012]
Next, the first embodiment will be described. Here, the present invention is a method for generating a still image, which comprises the following steps:
1) A step of generating a plurality of signals each representing a plurality of images created at different focal lengths.
2) Transform each signal to represent the corresponding image scaled to a common focal length.
3) combining each converted signal to obtain a signal representative of the combination with a scaled image that is partially relatively high resolution compared to the image of the original sequence, resulting in a final image of one focal length.
[0013]
Another embodiment will be described. Here, the present invention is a device for generating a still image and comprises the following means:
1) A means for creating a plurality of images created at different focal lengths.
2) Means for generating a plurality of signals representative of respective images at different focal lengths.
3) means for converting each of the plurality of signals to represent corresponding images scaled to a common focal length
4) Means for obtaining a signal representative of a combination in which each converted signal is combined and the scaled image is one image at one focal length.
[0014]
Still another embodiment will be described. Here, the present invention is a method for generating a still image comprising the following steps:
1) A step of generating a plurality of signals each representing a plurality of images created in different viewing zones.
2) Transform each signal to represent a corresponding image translated to one position within a common panoramic viewing zone.
3) A step of combining each converted signal to obtain a signal representative of a combination of translated images as a final image in one panoramic viewing area that covers a larger viewing area compared to the image of the original sequence.
[0015]
Still another embodiment will be described. Here, the present invention is a device for generating a still image and comprises the following means:
1) Means for creating a plurality of images created in different viewing zones.
2) Means for generating a plurality of signals representing one of a plurality of images in different viewing zones.
3) Means for converting each of the plurality of signals to represent a corresponding image translated to a position in a common panoramic viewing area.
4) A means for obtaining a signal representative of a combination of translated images that combine the converted signals into one image in one panoramic viewing area.
[0016]
Still another embodiment will be described. Here, the present invention is a method for generating a still image comprising the following steps:
1) A step of generating a plurality of signals each representing one of a plurality of images created in different viewing zones.
2) Transforming each signal to represent a corresponding image translated to a position in a common panoramic viewing area and scaled to a common focal length.
3) Combining each transformed signal, partly higher resolution than the original sequence image, and the final image of one panoramic viewing zone that covers a larger viewing zone compared to the original sequence and one focal length image Obtaining a signal representative of a combination of translated and scaled images.
[0017]
[Action]
The above arrangement provides a method and apparatus for creating a still image having a relatively high resolution with the following advantages.
1) There is no need to acquire information at high resolution over the entire image.
2) There is no need to collect information on the majority of images that are less important.
3) A sequence of standard video images of various focal lengths or viewing zones can be acquired as input elements.
4) A standard film image sequence can be acquired as an input element.
5) Increase the resolution of any part of the desired image.
6) A properly programmed general purpose digital computer and standard video or movie equipment can be used.
[0018]
In addition, a panoramic view of a scene is provided to an observer without requiring excessive data storage and access capabilities, and a method is provided that allows the observer to navigate from one position of the scene to another. The
In addition, the above-described ability can be exhibited regardless of the digitized image data.
[0019]
【Example】
Hereinafter, embodiments of the present invention will be described in detail.
A typical video image is created by a sequence of fields. Each field represents a still image of the scene being imaged. Interlace results in a vertical shift of 1/2 scan line between successive fields. (Some display systems are scanned without interlacing, in which case there is no vertical shift between the fields.) Displaying such a sequence of still fields generally at a rate of 50 or 60 fields per second. Thus, motion or change is expressed by psychological, physical and physical characteristics of the human visual system. Each field pair is composed of a screen filled with lines as described above, and each line is composed of pixels (pixels). Each pixel is represented by a signal value that is a specific range in computer memory or other suitable digital recording medium. For color images, this range is typically 0-255 for each of the three color components, and for grayscale images this range is characteristically per element. 0-255. Image sources such as satellite images or x-rays may have a range on the order of 0-4096. The pixel values are stored in memory in a shape that corresponds in some way to their position in the frame. All operations performed on the video image are typically performed on signals representative of individual pixel element values.
[0020]
In the case of monochrome recording of an image, each pixel element is a single discrete element. For image color recording, a set of channels or groups of pixels is used for each picture element. For example, in a color value scheme known as RGB, each color is represented by a combination of red (R), green (G), and blue (B) color quantities. Each of these three colors “channels” is provided separately. In an RGB system, each channel has the same number of pixels per scan line and the same number of scan lines per screen. The other color value systems described below have a different number of samples per scan line for different channels. The elements of the pixel are typically located next to each other on the display device and combine to form the original color when displayed at the same time (which is an illusion for the viewer). Other schemes such as a pixel time sequence display may also be used.
[0021]
The RGB color value scheme is useful for specific uses, but is not necessarily optimal for mathematical manipulation of color values. Other color schemes are more useful, especially those containing channels representing the brightness of the image. In general, luminance is described as the intensity of reflected or reflected light from the perceived area per unit in a given direction. In general, a three-channel color space defined by luminance and two other dimensions is equivalent to the RGB color space. A typical luminance color space is the Y (luminance), i (phase) and q (quadrature) color space used by the television standards committee for television broadcasting in the United States. Other luminance color spaces include CIE (Commission International de l'Eclairage), Y, x, y (luminance and two chrominance channels) and variants thereof, Y, u, v (luminance and two chrominance channels) and others Many also exist.
[0022]
In the present invention, one channel or component is sufficient for most processes. All data calculations and manipulations are first performed only on the Y channel of the color image. The reason why the Y channel is selected is that it usually has the highest signal-to-noise ratio in video systems, and the Y channel is often higher in spatial frequency than chrominance. It is because it is sampled in. After the necessary transforms for the Y channel are determined, the same transforms are applied to the remaining channels, such as the phase and quadrature chrominance channels. The characteristics of these transformations are described below.
[0023]
Although a video image is usually considered a series of frames, in practice such a “frame” does not exist at any time. It is actually a pair of “fields” that is considered as a frame by human observers and engineers in the field. Each field is composed of even-numbered scan lines or odd-numbered scan lines. The even field is offset from the odd-numbered scan lines by half of one scan line in the vertical direction. One pair of fields are interlaced to form one frame.
[0024]
Field pairs 101 and 102 are shown schematically in FIG. Field 101 includes only odd-line scan lines of the image, and field 102 includes only even-line scan lines of the image. The video device records these fields separately in succession. Thus, each field will potentially record a slightly different image, depending on the scene or camera motion associated with the time required to record the field. Video devices also display fields quickly and continuously, typically at a rate of 50 to 60 fields per second. When the field is displayed at this speed, the viewer will “see” the field, which is the combined frame 110. It will be appreciated that each field (except for the beginning and end of the sequence) is a component of each two consecutive frames. As shown in FIG. 3, field 102 forms a second field of frame 110 and a first field of frame 112. Similarly, field 103 forms a second field of frame 112 and a first field of frame 114. It will be appreciated that a frame does not actually exist as an individual signal element, other than being combined by a human observer.
[0025]
The method of the invention uses a sequence of frames, in particular a sequence of video images. Implementation of the present invention requires de-interlacing of frame components. Deinterlacing means constructing a signal that represents the actual frame of pixel elements, particularly in computer memory only, including the pixel values of each line of the image, not just odd or even lines. . The present invention can also be applied to data recorded without using interlace technology. However, since the material to be interlaced is common, it is important that it can be deinterlaced.
[0026]
According to the invention, deinterlacing is achieved by applying a median filter to the data signal. For example, four values of the median are used to create the seventh scanline of the frame deinterlaced at time t. That is, the value for each pixel element of the 7th line of the field at time t-1, the value for the corresponding pixel element of the 7th line of the field at time t + 1, and the value for the corresponding pixel element of the 6th line of the field at time t + 1. And the value for the corresponding pixel element of the 8th line of the field at time t. These four medians are assigned as values for the corresponding pixel elements in line 7 of the deinterlaced sequence frame.
[0027]
The same process is repeated for each pixel in the scan line and each odd line scan line in the field. Even-numbered scan lines are simply taken from the field at time t. It should be pointed out that this deinterlaced frame is different from any original sequence frame. This is because the pixel elements forming the odd column scan line are created by the combination of the preceding and following fields and the field at time t.
[0028]
This process is repeated to create a second deinterlaced frame, but the even column scan lines are formed using the median of the even column scan lines from fields t and t + 2, and the odd column scans. A line is formed from a field t + 1 that exists above and below the associated scan line. The odd line scan lines are taken directly from the field at time t + 1.
[0029]
After the frames are deinterlaced, a series of frames are obtained, each consisting of the total required number of even and odd scan lines, perceived by the observer who is observing the sequence of fields. The number of scan lines is the same as that in one frame.
[0030]
Other deinterlacing methods can also be utilized and are within the spirit of the present invention. However, these other methods are not believed to provide as good results as the aforementioned deinterlacing techniques. One way is to perform linear interpolation between the scan lines in each field to synthesize a new line of data between each pair of lines in one field. This technique clearly leads to a loss of spatial resolution in the part of the image that is not moving. Another method is to perform interpolation between fields before and after the current field. This technique sacrifices the temporal resolution of the moving image portion. It is also possible to use an affine transformation to perform data manipulation on different fields and warp one field to the next. However, field affine transformations do not provide as good results as performing affine transformations on frames that have been deinterlaced by interlaced spatial temporal aliasing.
[0031]
The sequence of frames used in the method of the present invention is typically a zoom sequence, which means zooming from a long focal length to a short focal length or vice versa. You can also pan (horizontal motion) or jib (vertical motion) during zooming. A de-interlaced 100-frame sequence 200 is shown schematically in FIG. The sequence consists of zooming from a wide-angle frame 201 with a relatively short focal length to a close-up frame 300 with a relatively long focal length. In the frame 201, a series of characters is schematically represented. In the frame 300, only the part of the alphabet “X” located in the center of the frame 201 is included in the frame.
[0032]
Each frame of the sequence 200 is composed of the same number of pixel elements and scan lines. Thus, assuming that the cross portion of the middle letter X uses 50 pixels in the zoom of frame 201, that cross portion uses 300 pixels in the zoom at frame 300. Become. A typical zoom of 1 results in a scaling factor of 4: 1 or more from the beginning to the last frame of the sequence. The imaging of the cross portion from the zoom of the frame 300 clearly provides more information than the imaging of the cross portion from the zoom of the frame 201. However, for example, information necessary to visualize another subject in the original scene such as another character cannot be obtained from the frame 300 at all. This is because these are not present in the frame 300. Thus, one object of the present invention is to form information obtained from separate frames of a sequence into one combined image, and to provide sufficient information about the majority of the scene.
[0033]
The relationship between individual frames and the original scene is shown schematically in FIG. Frame 201 shows the entire original scene. The frame 202 visualizes only the original scene that fits inside the area 202 s and is smaller than the frame 201. The frame 203 visualizes only the original scene that matches the area 203s, and thus continues in sequence to the area 300s. Accordingly, the original scene is gradually mapped to a smaller portion in the data space having the same size as the data space of all scenes mapped by the zoom 201. (This data space is called a “small” data space because it is the smallest compared to other data spaces for this method.)
[0034]
The object of the present invention is to utilize as much information as possible from each zoom frame. Each zoom frame is enlarged and will have the same scale as the longest focal length frame, ie frame 299. As shown schematically in FIG. 6, the enlarged frames are stacked on top of each other. The original frame 201 is enlarged many times the original size. Each

successive frame

202, 203, etc. is gradually enlarged slightly more or less, and the final frame 299 is not enlarged at all. It is possible to overlay each frame image so that it substantially overlaps the same part of all other frame scenes. If the image part of each frame is unclear, the visible part of the scene (the whole frame 299 and the outline of all other frames) is composed of the frame with the highest resolution.
[0035]
The scale-up of each frame is a signal that represents the data that is ultimately mapped to the final data space through the conversion operation of the data signal that represents each frame and the operation that maps the converted data to a series of data spaces. This is achieved by an operation that generates This final data space is much larger than the small data space and is therefore referred to as the large data space. (In fact, each frame is configured in a large space and occupies the same data space as the largest frame 201. However, for all frames except the frame 201, the frame outline is zero. Formed by pixels having a value of.)
[0036]
The resolution of each extended frame is of course different from each other, and the details of information obtained in some part, for example the central letter X from frame 201, are much less than the information details in the same part of the scene obtained from

frame

203 or 299 Will. In other words, the magnified image of the frame 201 is affected by the information gap between pixels. The image from frame 300 is not affected at all by these information gaps. This information gap is filled by interpolation between pixels from which data can be obtained, as described below.
[0037]
As mentioned above, if the frames are scaled to the same size and each data is overlaid on the other data, they overlap well, but only substantially overlap, not necessarily in a strict sense. Do not mean. This is due to camera motion and scene element motion. It is also important to take these motions into account in the process of using frame sequences to create relatively high resolution still images.
[0038]
The method of the present invention will be described in detail below. In order to take advantage of the scheme in a specific way, the method of the present invention first requires that each frame be mapped or “warped” to a data space of the same dimension (large data space). Each frame of the sequence after its warpingIs weighted, ie weighted ( weighted ) Perform a time median operation and combine the frames that have been operated to create a composite image.
[0039]
To map or warp each frame to a large data space, the zoom frame series is modeled as a series of frames of substantially constant focal length and one or two moving subjects are recorded. The situation is reciprocal. Instead of processing the recording device as if it were changing the focal length, it would be the same if the entire scene was processed as approaching or moving away from the recording device of the fixed focal length lens.
[0040]
In the following description, the basic steps of the method of the invention are introduced. The basic steps are first introduced for illustrative purposes and do not in any way indicate the order of steps of the method of the present invention. The order of the steps will be explained after introducing the basic concepts.
[0041]
The basic steps of the present invention are for recovering the zoom motion of the camera. The method of the present invention models changes between successive frames that are affected by camera zoom as the velocity component of the image in the horizontal, vertical and scale directions (horizontal / vertical to the vertical plane). The speed between frames for such image portions is determined for each of these three directions. As a result, any pixel value in any frame can be applied within a frame at a different focal length by applying multiple rates (representing a pair of consecutive multiple frames) to the data representing the original image portion. Warping to the corresponding position is possible. Related technologies were published in April 1990 at the David Sarnoff Research Center in Princeton, NJ (Bergen, J.), Burt, P., Hingolanial. (Hingorani, R.) and Peleg, S. et al. In the title “Calculation of two motions from three frames”. Much of the basic explanation below is substantially borrowed from the above-mentioned article by Bargain et al.
[0042]
A simple closed system morphology prediction method for the motion of an image area was derived by Bargain et al. First, it can be considered that a partial small movement of the image occurs between frames I (x, y, t−1) and I (x, y, t) which are image sequences. This I (x, y, t) is an observation image extending in the x (horizontal) and y (vertical) directions at time t, and is shown by a frame 299, for example. I (x, y, t-1) is an observation image at time t-1, for example, a frame 298. The pattern of x and y, expressed as P (x, y) at any time, is moving at a velocity p (x, y), which is the velocity field of all pixels, and motion components in both the x and y directions. have. The motion field p (x, y) is expressed below by the displacement in x and y.
p (x, y) = (px (x, y), py (x, y)) (1)
Px (x, y) is a displacement in the x direction (a function of x and y), and py (x, y) is a displacement in the y direction (a function of x and y). Therefore, the following formula is derived.
I (x, y, t) = P (x-tpx, y-tpy) (2)
I (x, y, t) =I(X-px, y-py, t-1) (3)
[0043]
The notation can be simplified by setting the frame interval to one unit of time. In accordance with a least squared error technique, it is beneficial to determine the motion field p = (px, py) that minimizes the square error between the measured value and the calculated value using field p.
[0044]
[Expression 1]

Assuming that the deviation from one frame to the next is small, equation (4) can be simplified by truncated Taylor series expansion of I (x, y, t). .
[0045]
[Expression 2]

put it here,
[Equation 3]

And therefore
[Expression 4]

It becomes.
[0046]
Image motion sets the derivative of equation (6) to zero for each parameter of the velocity component (because the error is minimal when the derivative is zero) and the resulting equation system It is obtained by solving.
If the motion of the image portion is modeled by simple translation, then p = (ax, ay), where ax and ay are constants in units of pixels, and the optical flow equation is It becomes as follows.
[0047]
[Equation 5]

[0048]
It will be appreciated that in a characteristic application case of the method of the invention, motion cannot be modeled by simple movement. This is because simple movement does not recover or model scale changes like the zoom of a focal lens. Instead, the motion of the image is more typical as an affine transformation (ie a transformation that converts straight lines to straight lines, parallel lines to parallel lines, but that may change the distance between points and the angle between lines). Modeled accurately. In this case, the motion field p has six parameters ax, bx, cx, ay, by and cy, which are applied as follows.
[0049]
[Formula 6]

[0050]
Where ax and ay are as described above, bx is the percentage scale factor of x in the x direction, and cx is the percentage rotation factor of x, which is dependent on the y position. by is the percentage rotation factor of y and is related to the x position, and cy is the percentage scaling factor of y. The normal range of ax and ay from one frame to the next in zoom is on the order of a few pixels. The normal range of residual factor is about 0.001 to 0.1. If the error in equation (6) is differentiated with respect to each of these six parameters, a system of six equations with six unknowns ax, bx, cx, ay, by and cy. Is obtained. That is,
[0051]
[Expression 7]

It is.
[0052]
This system must be solved for the coefficients ax, bx, cx, ay, by and cy. It is possible to obtain a solution. This is because the partial derivatives of the image with respect to Ix, Iy and It, ie x, y and t, can be determined from the image values at time t and time t + 1. It is determined by subtracting the pixel value at time t + 1 from the corresponding pixel value at time t. Ix is determined by adding the pixel value at time t and the corresponding pixel value at time t + 1 and running the sum through a derivative filter at x. Iy is determined by running the resulting sum through a derivative filter of y. Once these three values have been determined for all pixels in the image, the system of equation (10) is solved for the coefficients ax, bx, cx, ay, by and cy. Knowing these coefficients, it is possible to determine the change in position of a given pixel value that is representative of the particular aspect of the image from one frame to the next.
[0053]
Thus, to determine the position of the pixel value in frame 201, after frame 201 has been warped by one step to match the image portion with the corresponding portion of the unwarped frame 202 image, the equation 9x And 9y transform is applied to each pixel value of frame 201. Consider the image portion of point (x, y), pixel location (20, 30), as shown in FIG. (FIG. 7 is not for scaling.) In FIG. 7, the frame 201 at the original position is represented by reference numeral 201. After warping to the scale of frame 202, frame 201 is represented as 2012. After warping to the scale of the frame 203, it is represented as 2013, and continues to the frame 20119 in this manner. For a scale increase of 10% (a large rate for one frame to the next) and panning 5 pixels to the right, typical coefficients between

frames

201 and 202 are: have.
[0054]
[Equation 8]

The deviation of the value at the pixel (20, 30) of the frame 201 from the frame 201 to the frame 2012 in the x direction is 5+ (0.1 × 20) + (0 × 30) = 7. The value means moving 7 pixels in the positive x direction and coming to the x position 27. The deviation in the y direction is 0+ (0 × 20) + (0.1 × 30) = 3, which means that the pixel moves to the y position 33 by moving by 3 pixels in the y direction. This is shown schematically in FIG. 7 and is a curved arrow A from the pixel position (x, y) of the frame 201 to another position of the frame 2012 (the same location in the image).
[0055]
Similarly, to warp the position where the same pixel value occupies in the frame 2013, the coefficients ax, bx, cx, ay obtained by solving the set of six equations (10) between the

frames

202 and 203 are shown. , By and cy to apply the transformation equations (9x) and (9y) to the pixel coordinates of frame 2012 (coordinates). Those coefficients may differ from those obtained between frame 201 and frame 202.
[0056]
The conversion equation is a linear equation and is therefore reversible. Coefficients ax, bx, cx, ay, by, and cy are used for conversion from the scale of frame 201 to the scale of frame 202. To convert from the scale of the frame 202 to the scale of the frame 201, a linear inverse of these coefficients is used.
[0057]
As described above, the pixel value from the point (x, y) in the frame 201 is warped to a new position in the frame 2012. The pixel value from point (x + 1, y) is also warped to a new position in frame 2012, but that position is typically not adjacent to the warping position corresponding to the value of pixel (x, y). If no further operation is performed, the space between these two points in frame 2012 will be blank or empty. In order to input information in this space, some kind of interpolation is required. Various techniques are possible, including primary and bilinear interpolation. Bilinear interpolation is used effectively.
[0058]
As described by Bargain, Bart et al., The motion prediction method described above is suitable for the abbreviated Taylor series approximation method with little image shift from one frame of the zoom sequence to the next frame (1 pixel or less). It is accurate only when Better results are obtained by using the multiresolution structure shown schematically in FIG. 8 and the technique is also applicable to more general large movement cases.
[0059]
In the process of determining the affine transformation parameters ax, bx, cx, ay, by and cy, a Gaussian pyramid G is constructed for each frame of the image frame pair, eg frames 201 and 202, for example. For each member of the sequence, its analysis and its sample density is reduced by a square root (reduce by a power of 2), and the pyramid is formed by a sequence of modified copies of the original image. Except that one frame of the sequence 200 forms the base level of the Gaussian pyramid sequence, members of the Gaussian pyramid sequence such as G201,0, G201,1, G201, G201, I, etc. It must be noted that it is completely different from the members.
[0060]
Data is passed through a low pass filter to reduce resolution. By passing through this low-pass filter, data related to small or quick movement of the image is eliminated. Therefore, a large motion is detected at a level at which the resolution is greatly reduced. Since the low-pass filter eliminates most of the deviations in the data, there is no need to calculate for all existing pixels. Therefore, sub-sampling of about 2 is applied to reduce the number of pixels to be operated. This sub-sampling increases the calculation efficiency and contributes to the speed of operation. A characteristic pattern of sub-sampling is to ignore alternate rows and alternate columns.
[0061]
Each level I of the pyramid is acquired by converging the data of the preceding level with a kernel filter ω in order to exert the effect of the low-pass filter, and then sub-sampling is performed. Gt, l = [Gt, l-1 * ω] ↓₂ Where Gt, l is I for image I (x, y, t)^thThe pyramid level. Above ↓₂ Indicates that the quantity in parentheses is subsampled by 2 for x and y. For example, to obtain G201,1, G201,0 is converged by the filter ω, and the result is subsampled.
[0062]
The analysis of the transformation starts at the low resolution level of the image pyramid, eg level 3. For an original image defined by 480 scanlines and 640 pixels, a level 3 analysis typically gives good results. Level I sample distance is 2 times the original image sample distance^I Is double. Therefore, an image speed that increases in this relationship can be predicted. At each successive iteration of the tracking procedure, the analysis moves to the next resolution pyramid level and approaches the original.
[0063]
Therefore, the determination of the affine transformation parameter starts at level 2, for example. First, it is necessary to solve equation (10) for ax, bx, cx, ay, by and cy between the pyramid G201 and the pyramid G202. This is done in two steps. First, a seed set of the affine transformation p2 is selected. This seed may be all zero or may be selected to approximate a known aspect of the transformation, such as a zoom scaling factor or a known pan or jib translation. These affine transformations are applied to G201, 2 at W2 to obtain a warped image at level 2. This is illustrated by the distorted squares G201, 2w in FIG. In most cases, this warping will not accurately provide a Gaussian member at the next time interval t + 1, ie G202,2. Therefore, in the second step, a set of adjustment affine parameters Δp2 is predicted between the image values G202,2 and G201,2w. These are expected as described above.
[0064]
First, it is necessary to solve Ix, Iy and It as described above for the frame without going through low-pass filtering and subsampling. Ix, Iy and It are computed similarly except that a smaller sub-sample set of low-pass filtered data is used. Instead of subtracting the value for frame 201 from frame 202, the value from the warped pyramid frames G201, 2w is subtracted from the value for pyramid frames G202,2. In this way, a partial derivative of level 2 is determined, after which adjustment affine parameters ax, bx, cx, ay, by and cy for this level can be determined. The adjustment affine parameters are shown collectively in FIG. 8 as Δp2.
[0065]
The set of adjustment affine parameters is combined with the affine parameter p2 from the previous level in the Gaussian sequence pyramid 201 to form the affine parameter for level 1, i.e. p1. This combination is not a simple addition. For example, the new ax term is based on the ax term at time t (pyramid G202), the ax term at time t-1 (pyramid G201), and other changes in the x direction. The following equation illustrates this relationship.
[0066]
[Equation 9]

[0067]
This process is repeated, but this time is level 1 and is performed through all levels until the affine transformation parameters ax, bx, cx, ay, by and cy are acquired at the original level during the operation on

frames

201 and 202. The Since the affine parameter converges at the most accurate position, the ΔpI term tends to be zero.
[0068]
Thus, to determine the warping factor for converting one frame from any frame scale (eg, the scale of frame 226 to the next frame 227), the above operations are performed. Thus, a set of affine transformation parameters ax, bx, cx, ay, by and cy is calculated for each pair of frame scales. Thereafter, to convert one frame, eg, frame 251 to an appropriate size, first use the affine transformation parameters ax, bx, cx, ay, by and cy determined by the previous analysis for frames 251 and 252 to Converted to a scale of 252. The transformed frame 2512 is then converted to the scale of frame 253 using the affine transformation parameters ax, bx, cx, ay, by and cy determined by previous analysis on frames 252 and 253. This process is repeated until the frame has been converted to a large data space at the scale of frame 300.
[0069]
The above method works well if there is little or no relative motion between the camera or the subject (which can be either) and the only image change is due to zooming. In practice, however, it is desirable to be able to eliminate such motion. Several methods are conceivable. A basic but effective method is to visually inspect the entire frame and to identify large motions like a person crossing the viewing zone. A mask is available to cover the area where motion occurs in each frame, and this area can be ignored during conversion. The operator manually selects the desired pixel value to set the final picture mask position.
[0070]
Another method is to use the technique described by Bargain and Bart et al., Which tracks two motions, as shown schematically in FIG. Data is evaluated against a pair of motions. The image I (x, y, t) here is modeled as a combination of different image patterns P and Q each having independent motions p and q. The relationship between I, P and Q is as follows.
[0071]
[Expression 10]

In the above equation, the symbol obtained by overlapping ○ and + is expressed as (+) for convenience in the following.
[0072]
Here, the operator (+) represents an operation for combining two motions such as addition or multiplication, and P^tpRepresents the pattern P transformed by the motion p over time t, and Bargain and Bart et al. Describe the nature of the patterns P and Q if one of the motion components and the combination rule (+) is known. It shows that other motions can be calculated using the above-described one-component motion technique without expectation. If motion p is known, only motion q need be determined, and vice versa. The component of the pattern P moving at the speed p can be excluded from the image sequence by shifting each image frame by p and subtracting the shifted frame value from the next frame. The resulting difference sequence includes only the pattern moving at speed q.
[0073]
In special cases, the join operation (+) is an addition. Consider the case of 3 frames I (1), I (2) and I (3) of the sequence 200. Let's apply the variables D1 and D2 to the difference frames generated between those frames, respectively. Equation 11 is as follows.
[0074]
[Expression 11]

[0075]
This is shown schematically in FIG. 9 as a warp of I (1) at 302 for converting the pattern P in one step. This next step is a subtraction of I (2) at 304 to remove the effect of the motion of pattern P. The result is D1, a component of the difference sequence. D2 is similarly formed by the difference at 308 between I (3) and I (2) warped at 306 by the motion of pattern P.
[0076]
The change sequence is a new pattern Q that moves in 1 motion q^q-Q^p It will consist of
[Expression 12]

[0077]
Therefore, the motion q can be calculated between two difference images D1 and D2 using the above-described one-motion prediction technique. This is shown schematically at 310 in FIG. Similarly, motion p can be recovered when q is known. The observed image I (x, y, t) is shifted by q to form a new difference sequence.
[0078]
[Formula 13]

This sequence is a pattern P moving at a speed p.^p-P^q It is.
[Expression 14]

Therefore, p can be recovered by utilizing the 1-motion prediction technique.
[0079]
This shift and subtraction procedure removes one moving pattern from the image sequence regardless of the pattern or without determining the pattern. In practice, neither p nor q are known at first. However, even if a very rough estimate is initially selected, both of them can be recovered by repeating the above technique. This iterative procedure applies one motion technique iteratively. Even if you start with a rough estimate of the parameters that define motion p, the expected value of q is retrieved and returned to the warping steps 302 and 306 at 312. The improved predicted value p is extracted from the predicted value q and returned to the warping steps 302 and 306 at 312. Repeat this procedure. This procedure quickly converges to the exact expected value. Using true image sequences, conversions that meet the requirements are possible after 3 to 12 cycles.
[0080]
The steps in this part of the invention are summarized as follows:
1. An initial expected value for the motion P0 of the pattern P is determined.
2. The latest pn predicted values are used to form difference images D1 and D2 in equation (12).
3. Apply a 1 (single) motion estimator to D1 and D2 to get the expected value of qn + 1.
4. New difference images D1 and D2 are formed using the predicted value qn + 1.
5.1 Apply a 1 (single) motion estimator to the new D1 and D2 to get a new pn + 2.
6. Repeat the procedure from step 2.
[0081]
By observing two sets of affine parameters extracted according to this two-motion technique, it is possible to identify a moving scene or camera motion. In general, parameters relating only to zoom warping change smoothly and only slightly from one frame to the next. Parameters related to image motion or camera motion show different changes than those due to zoom. These different changes can be observed by examination.
[0082]
Automatic comparison of affine parameters from one frame pair to the next frame pair, and when the change exceeds a preset level, the shading device functions (triggering a flag) to automatically identify the motion of the scene or camera It is theoretically possible to do. One possible technique is to compare the difference between two pairs of affine parameters of a frame with a standard deviation of a set number of previous frame pairs. For example, for a sequence of 70 frames, it is common to determine a standard deviation of at least 10 frame pairs.
[0083]
If both the camera and the elements in the scene are moving, there are more than two motions, and a more advanced method for eliminating camera motion is convenient. It has been discovered that combining Affine's two-motion prediction and masking techniques described above provides convenient results. It may also be convenient to determine the probability density function of the shift vector in the image. For a general understanding, Girod B. and Quede in the minutes of the American Optical Society's “Mechanical Understanding and Machine Vision Congress” held in Cape Cod, Massachusetts, USA in June 1990. Reference is made to “Direct Estimation of Displacement Histograms” by (Kuo, D.). Here, information on the number of different subjects moving between frames and what the shift vector corresponding to them is is posted. A local block matching estimator is used to spatially locate these moving subjects. The area of the moving subject is subjected to a mask process by calculation, and thereafter an expected affine value is calculated.
[0084]
Assuming that the deviation between the frame pairs is small and that there is no unexpected camera movement, such as a bump or a sudden change in focal length, the affine parameters are not very different between the frame pairs. After the parameters are determined as described above, the coefficients are simplified to remove spurious values.
Once the Y channel data from each frame of sequence 200 is warped, the determined affine parameters are applied to other channels, eg, with phase and quadrature, to provide full color projection conversion. The
[0085]
After the full raster of the warped short focal length frame 201 is filled, tone scale correction is performed to compensate for changes that affect the tone, such as variations in aperture settings from one frame to the next. can do. Starting from the center image, a sample of light intensity is taken around where the two images are adjacent. A spline is fitted to the data, and the pixels of the larger image (lower resolution) are changed to the pixels of the smaller image. The tone scale of this corrected image is then used for comparison with the next sized warped image and the procedure is repeated until the largest image.
[0086]
Once the full color data from each frame of sequence 200 has been warped to the same data space, the data from one frame for each pixel needs to be combined with the data from all other frames. Several techniques are possible. The most basic technique is to select the pixel value for the final composite picture from the frame with the highest resolution. As shown in FIG. 6, frame 299w, which is the warped frame 299, generally occupies the center of the composite picture, which is the highest resolution with respect to the center of the image. Let's go. The information from frame 298 occupies a central annular square, and this information is the highest resolution possible for this part. Information from frame 297 occupies a somewhat larger axial annular square around the annular region of frame 298w, and thus continues until the annular periphery of first frame 201w occupies the outermost portion of the warped figure. .
[0087]
Although the above procedure gives good results, a distinct edge appears and highlights the boundaries between regions generated from various frames. For this reason, a weighting function is applied to all frames warped in the sequence for a certain pixel, and the median of the weight value is taken as the value of that pixel. As shown in FIG. 10, the position of the pixel is indicated by the vector V and penetrates through all frames warped to the same position of the image, ie 201w to 299w. The weight function is applied to the image values along the vector V. A typical weight function is shown in the graph of FIG. As can be seen, the weight function is concave upward and the pixel value from the close in zoom shot is given the maximum weight, perhaps 100%. Various weight functions can be applied depending on the desired effect. In general, a heavier weight is given to a frame having a higher resolution than a frame having a lower resolution.
[0088]
We have described the block creation of a method of obtaining one still image having a high resolution from a sequence of frames having a somewhat unnatural but lower resolution. FIG. 13 shows the steps of the method in a substantially preferred order. The sequence of the video field is taken at 402. The field is deinterlaced at 404 to create a series of frames. At this point, an alternate path can be obtained. The subject or camera motion can be separated from the zoom motion at 406, after which optical system flow analysis is performed to generate affine transformation coefficients ax, bx, cx, ay, by and cy. It is possible to branch to step 408 at 404, thereby combining optical system flow analysis and camera motion separation from the subject or zoom motion. This branch also gives rise to the coefficients ax, bx, cx, ay, by and cy. Next, at 412 the affine transformation is applied as many times as necessary for each frame, creating a corresponding frame with a high resolution raster for each frame. The temporal median filter shown in FIG. 11 is applied to all frames at 414, and the final composite is warped each

frame

201w, 202w, etc. filtered by the temporal median filter at each pixel location in the high resolution raster at 416, etc. Is formed by adding the values at that pixel to.
[0089]
A preferred embodiment of the device of the present invention is shown schematically in FIG. An input device 500 such as a video camera is applied to the scene 502 by taking in light reflected from the scene or transmitted through the scene. The light is converted into an electrical signal by the input device or standard converter 504. From converter 504 or input device 500, the data passes through memory device 506 or data processing unit 508. The memory device 506 can record data according to fields and also according to any other configuration into which the data is converted. The data processing unit is typically a properly programmed general purpose digital computer. The operator issues a command to the data process unit 508 via an input device 510 such as a computer keyboard. These commands instruct the computer to exercise the steps of the method described above. The contents of the instructions are, for example, deinterlacing the field, identifying two or more moving subjects by creating difference sequences, calculating affine transformation coefficients, and warping all frames to the desired data space. The data from the warped frames according to the weighted time median filter is combined into a composite picture image, and so on. The converted data at each step can be recorded in the memory device 506 and can be output to a general output device 512 such as a printer, video display device or other suitable output device. Further, the data can be transmitted to a remote location for additional manipulation or storage or display.
[0090]
In a shot having a relatively low resolution and a wide angle, many portions are unclear. On the other hand, the central part of the composite picture created according to the method of the present invention is clear and in-focus and shows all the details.
[0091]
The above explanation is for the purpose of explaining the present invention and is not intended to limit the invention. In addition to video, any recording technique that uses a sequence of still images can be used. If the recording technique does not generate pixel values, the data generated by the recording device can be converted to pixels or equivalent data space according to well-known methods in the art and is useful. . In addition to the techniques introduced here, various techniques for separating camera motion from zoom motion or motion within a scene are applicable. Furthermore, it is not necessary to use Gaussian pyramid steps to calculate the affine transform coefficients. The calculation can also be done by other methods, such as those performed on the full high resolution frame.
[0092]
The method of the present invention can be used to create a single panoramic still image from a series of pan shots and jib shots in addition to a single still image from zoom. In such a case, all frames will be warped to a data space that has the same space as all panoramic scenes. It will not be a pile of pictures with different focal lengths that overlap each other. Rather, it will be a series of pictures with overlapping edges. In the embodiment applied to the zoom sequence, the main element of warping is to enlarge the data from each frame to align the scene images with each other. Warping the data so that the entire image of the scene is aligned with each other is also an important aspect of zoom system application. With this feature, for example, motion due to camera motion or subject motion can be removed.
[0093]
In panoramic applications, this expanding feature is not important and is not used in most cases. However, it is very important from an alignment point of view, and if the panoramic scene's global viewing area is represented as a continuous data space, each frame will take up a limited portion of that global viewing area. Become. Unlike the zoom system application, each frame in the panorama system application is created with the same focal length. It is necessary to use the method of the present invention to align data from frames in the global data space so that each frame image matches the same image in another field. The method of the present invention is mainly applied at the joint between shots. If the pan speed is slow compared to the frame frequency, the overlap between frames at the seam is very large.
[0094]
It is within the spirit of the present invention to combine zoom processing with panorama processing to obtain a clearer image at a specific portion of the panoramic scene.
It is also possible to use the technique of the present invention to combine frames from discontinuous segments of the subject and video.
[0095]
Although the present invention has been described in the range of data obtained by a video camera, anyone of ordinary skill in the art will appreciate that the method of the present invention can be obtained as a digital image. It will be understood that it can also be used for data representing. For example, a series of still pictures taken at different focal lengths can be combined in the manner described above to form an image that reinforces a particular portion of the image. Similarly, a set of still photographs depicting various positions within a panoramic space can be combined according to the techniques of the present invention to create a single panoramic image. In this, the various image parts can be recovered, and the original artificial element of the panoramic image is contained in a set of discrete still images having a common focal length but different viewing zones. Almost not shown.
[0096]
The present invention should be considered in light of the description in the specification including all examples specified by the “claims”, and also include their equivalent forms within a reasonable range. Should be considered.
[0097]
【The invention's effect】
As described above in detail, the present invention can provide a method and apparatus for creating a still image having a relatively high resolution with the following advantages.
1) There is no need to acquire information at high resolution over the entire image.
2) There is no need to collect information on the majority of images that are less important.
3) A sequence of standard video images of various focal lengths or viewing zones can be acquired as input elements.
4) A standard film image sequence can be acquired as an input element.
5) Increase the resolution of any part of the desired image.
6) A properly programmed general purpose digital computer and standard video or movie equipment can be used.
[0098]
In addition, the present invention provides a viewer with a panoramic view of a scene without excessive data storage and access capabilities, and allows the viewer to navigate from one position of the scene to another. A method can be provided.
Further, the present invention has an excellent effect that the above-mentioned ability can be exhibited regardless of the digitized image data.
[Brief description of the drawings]
FIG. 1 is a diagram schematically showing a relationship between a focal length of an imaging apparatus and a captured scene portion.
FIG. 2 is a diagram showing an outline of a pair of a video field and a video frame.
FIG. 3 illustrates interlacing of exemplary video field pairs that are combined to form a video frame.
FIG. 4 schematically illustrates a sequence of video frames of substantially the same scene, representing a zoomed-in state from a short focal length to a relatively long focal length.
FIG. 5 is a diagram schematically illustrating a portion of a scene in a video frame having the shortest focal length (wide-angle viewing zone), which is supplied in the remaining members of the sequence of frames with gradually increasing focal lengths.
6 is a diagram showing an outline of a state in which each video image of the sequence shown in FIG. 4 (shown on the left side of FIG. 6) is mapped (map) or warped to the same size data space; The size is the lowest resolution frame of the enlarged size.
FIG. 7 is a diagram showing an outline of a state in which one frame originally recorded at a relatively short focal length is warped to a data space related to continuous expansion of the scene.
FIG. 8 is a diagram showing an outline of a method for identifying both rough motion and delicate motion between a plurality of frames in a sequence.
FIG. 9 is a diagram showing an outline of a method for specifying motions of two moving subjects in a sequence of frames.
FIG. 10 is a diagram showing an outline of each frame of a sequence after being warped in the same data space, and shows a state in which the frames are aligned so as to be reconstructed in a final visualization, and a common point of each frame A vector through is shown.
FIG. 11 is a graph showing the relationship between the weighting factor used to construct the final image and the original focal length of the warped frame to which the weighting factor is applied. is there.
FIG. 12 is a diagram showing an outline of a final reconstructed image and its construction elements.
FIG. 13 is a flow chart illustrating a preferred embodiment of the method of the present invention.
FIG. 14 is a schematic diagram of a preferred embodiment of the apparatus of the present invention.
[Explanation of symbols]
2 statues
4 focal planes
6 Central part

Claims

A method for generating a still image of a scene, comprising the following steps:
a. Obtaining a plurality of images created at different focal lengths; b. Generating a signal representing the image for each of the images c. Converting each of the signals to represent a corresponding image scaled to a common focal length, d. Combining the transformed signals such that the scaled images are combined into one image.

2. The still image generating method according to claim 1, wherein the converting step of each signal includes the following steps:
a. Performing at least one affine transformation on the signal representing each image; b. Generating a signal representing each of the plurality of transformed images.

The step of performing at least one affine transformation comprises generating a sequence of corrected frames with reduced resolution, sampling them, and performing at least one affine transformation on the modified frames. The method according to claim 2.

3. The method of claim 2, wherein the step of performing at least one affine transformation comprises the following steps:
a. Reordering the plurality of images in the sequence b. Determining, for each image pair in the sequence, a set of affine parameters that substantially define a transformation from the first image to the second image of the pair; For each of the plurality of images, a plurality of sets of the affine parameters are combined and combined into a set of affine parameters d. Performing an affine transformation on each of the images using the corresponding set of affine parameters corresponding to each other;

The method of claim 1, wherein the combining step comprises applying a temporal median filter to the signal representing each scaled image.

6. The method of claim 5, wherein applying the temporal median filter comprises applying a weighted temporal median filter to a corresponding signal from each scaled image.

The weighted temporal median filter has a filter that assigns a larger weight to an image created with a relatively long focal length than to an image created with a relatively short focal length. The method of claim 6 characterized in that:

Further comprising identifying a change between the signal pair prior to the converting step, the change being caused by the motion of the means for creating the image represented by the signal and the motion of the elements in the scene. The method of claim 1, wherein the method is not caused by a difference in focal length between images.

The method of claim 1, wherein the step of obtaining a plurality of images comprises recording a plurality of video images.

The method of claim 1, wherein the converting step further comprises converting each signal to represent respective images aligned with each other.

2. The method of claim 1, further comprising the step of identifying relative motion between image pairs due to factors other than the fact that the two images were created at different focal lengths prior to the combining step. the method of.

12. The method of claim 11, wherein the step of identifying relative motion comprises the following steps:
a. Predicting a first relative motion of the first pattern portion of both images of a pair b. Utilizing the predicted first relative motion to determine a second relative motion of the second pattern portion of both images; c. Repeating the following steps until a satisfactory resolution of relative motion is obtained;
i. Using the second relative motion to more accurately identify the first relative motion of the first pattern portion; ii. Using a more accurate specification of the first relative motion to more accurately identify the second relative motion of the second pattern portion.

An apparatus for generating a still image of a scene, comprising the following means:
a. An image creation means b, which is a means for creating a plurality of images, each of the plurality of images being created at a focal length different from the others. Means c for generating a signal representative of each image; Means d for converting each signal to represent an image corresponding to each scaled to a common focal length; Means for combining the transformed signals to indicate a final signal formed by combining the scaled images into a single image at a focal length;

The apparatus of claim 13, wherein the means for creating a plurality of images comprises a video recording device.