JP2004220312A

JP2004220312A - Multi-viewpoint camera system

Info

Publication number: JP2004220312A
Application number: JP2003006773A
Authority: JP
Inventors: Hideo Saito; 英雄斎藤; Daisuke Iso; 大輔磯
Original assignee: Japan Science and Technology Agency
Current assignee: Japan Science and Technology Agency
Priority date: 2003-01-15
Filing date: 2003-01-15
Publication date: 2004-08-05
Anticipated expiration: 2023-01-15
Also published as: JP4354708B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a multi-viewpoint camera system without generating a distortion in a re-configured three-dimensional shape. <P>SOLUTION: Independently prepared basic cameras 431, 432 are arranged with respect to a multi-viewpoint camera 411, etc., used at infinity in a focal distance, disposed to allow their lines of sight to be nearly orthogonally crossed, and placed distantly apart so as to catch an object space as large as possible. A projection grid space is defined by the basic cameras 431, 432. The three-dimensional shape of an object is re-configured (restored) in the projection grid space through the use of epipolar geometry between the multi-viewpoint camera 411 and the basic cameras 431, 432. Consequently, the distortion is prevented in the re-configured three-dimensional shape. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、多視点カメラシステムにおいて、多視点カメラに撮影された対象物の３次元形状を再構成（復元）することに関する。
【０００２】
【従来の技術】
多視点カメラシステムとは、多視点カメラシステムを構成している各カメラ（以降「多視点カメラ」という）を用いて同一の対象物を多視点から撮影し、それらの撮影した画像から対象物の３次元形状を再構成（復元）するシステムである。例えば、仮想視点から見た画像（任意視点画像）を生成するために、このような多視点カメラシステムが利用されている。
従来、このような多視点カメラシステムにおいては、各カメラの内部パラメータ（焦点距離、投影中心座標など）および外部パラメータ（カメラの３次元位置、向き（回転３自由度）など）を推定する手続である、カメラキャリブレーション（校正）を行なう必要があった（例えば非特許文献１参照）。
このカメラキャリブレーションは、対象空間中に存在する点の３次元座標と、その点が画像に投影される２次元座標との関係（以降「３次元−２次元マップ」という）を６組以上あらかじめ測定しておくことにより行なわれる。実際には、例えば、各カメラについて同時に撮影されたマーカ点から、各カメラのキャリブレーションを行なう。
【０００３】
しかしながら、カメラキャリブレーションによる手法の場合、各カメラ独立にカメラパラメータの推定を行なっているために、各カメラ間で推定されたカメラパラメータ間に、推定誤差による矛盾が発生してしまう。具体的には、各カメラで独立に推定されたカメラパラメータから計算したカメラ間のエピポーラ幾何と、各カメラに共通に見えるマーカ点から検出したカメラ間の対応関係から推定したエピポーラ幾何との間に微妙な誤差を発生してしまう。この誤差は、特に離れたカメラ間で大きい（例えば非特許文献２参照）。
また、このような多視点カメラシステムにおいては、基本的には複数のカメラ間で共通に見える点を検出して、三角測量の原理により３次元形状を再構成するものであるため、カメラ間の相対的幾何学関係を出来るだけ正確に扱うことが重要となる。一方で、多数のカメラに対して同時に３次元位置の既知なマーカ点を撮影させてキャリブレーションするのは手間がかかる。しかも、屋外のように広い撮影空間を対象とする場合に、マーカ点を広範囲にわたって正確に設置することは非常に困難である。
【０００４】
上記の問題点を解決するために、発明者らは、カメラキャリブレーションを行なわずに対象物の３次元形状を再構成できる多視点カメラシステムを提案してきた（非特許文献２および非特許文献３参照）。
この手法では、まず、多視点カメラのうち、基準となるカメラ（以降「基底カメラ」という）を任意に２台選択する。そして、これらの２台の基底カメラから他のカメラに投影されるエピポーラ線を用いて、射影グリッド空間（ＰｒｏｊｅｃｔｉｖｅＧｒｉｄＳｐａｃｅ）と呼ばれる３次元空間を構築する。そして、この射影グリッド空間内で対象物の３次元形状を再構成する。
【０００５】
以下に、発明者らが従来提案してきた射影グリッド空間による手法を詳しく説明する。
上述したように、多視点カメラシステムにおいて対象物の３次元形状を再構成するためには、３次元形状を再構成しようとする３次元空間の各点（３次元座標）と、多視点カメラの各々の画像上に投影される点（２次元座標）との関係（３次元−２次元マップ）が必要となる。
従来の手法における多視点カメラシステムを、図１（ａ）に示す。図１（ａ）において、１１１〜１１６は多視点カメラシステムを構成している各カメラ（多視点カメラ）である。なお、多視点カメラの数は図に示されている数に限られない。従来の手法では、カメラの配置とは無関係にユークリッド空間１２０を定義し、ユークリッド空間１２０上の座標と、各カメラ（１１１〜１１６）の画像上の座標とをカメラごとに関連づけるカメラキャリブレーションが必要であった。
【０００６】
一方、発明者らが従来提案してきた、射影グリッド空間を定義した多視点カメラシステムを図１（ｂ）に示す。図１（ｂ）において、１１１〜１１６は、図１（ａ）と同様に多視点カメラシステムを構成している各カメラである。
射影グリッド空間１３０は次のように定義される。まず、数台あるカメラのうち任意の２台（この例では、カメラ１１１およびカメラ１１２）をそれぞれ基底カメラ１、基底カメラ２とする。これらの基底カメラそれぞれの視点からの中心射影によって３次元空間（射影グリッド空間）１３０を定義する。つまり、空間を定義する３軸として、基底カメラ１の画像のＸ軸およびＹ軸、そして基底カメラ２の画像のＸ軸を用いる。そして、これらの３軸をそれぞれＰ軸、Ｑ軸、Ｒ軸として、射影グリッド空間１３０を定義する。
【０００７】
射影グリッド空間内の点Ａ（ｐ，ｑ，ｒ）と各カメラの画像との関係づけを図２および図３に示す。図２は、射影グリッド空間内の点Ａ（ｐ，ｑ，ｒ）の基底カメラ画像への投影を示した図である。図２において、点２１０は基底カメラ１の視点、点２２０は基底カメラ２の視点である。また、２１２は基底カメラ１から得られる画像（画像１）、２２２は基底カメラ２から得られる画像（画像２）を示している。点ａ_１（ｐ，ｑ）および点ａ_２（ｒ，ｓ）は、それぞれ画像１および画像２上における点Ａの投影点である。
ここで、画像ｈの画像ｋに対する基礎行列をＦ_ｈｋと表すものとする。このとき、点Ａ（ｐ，ｑ，ｒ）は射影グリッド空間の定義より、画像１では点ａ_１（ｐ，ｑ）に投影される。また、画像２の画像１に対する基礎行列Ｆ_２１を用いて点ａ_１（ｐ，ｑ）を画像２に直線ｌとして投影すると、直線ｌは下記の式（１）で表される。
【数１】

射影グリッド空間の定義より、点Ａ（ｐ，ｑ，ｒ）の画像２における投影点ａ_２（ｒ，ｓ）のＸ座標はｒであるから、点ａ_２（ｒ，ｓ）は、直線ｌ上の、Ｘ座標がｒである点として定めることができる。
【０００８】
図３は、点Ａ（ｐ，ｑ，ｒ）の基底カメラ以外のカメラ（カメラｉ）の画像への投影を示した図である。図３において、基底カメラ１における点２１０、２１２で示されている画像１、および点ａ_１（ｐ，ｑ）、基底カメラ２における点２２０、２２２で示されている画像２、および点ａ_２（ｒ，ｓ）については、図２と同様である。点２３０は基底カメラ以外のカメラｉの視点であり、２３２で示されている画像はカメラｉから得られる画像ｉであり、点ａ_ｉは画像ｉ上における点Ａの投影点である。また、Ｆ_ｉ１は画像ｉの画像１に対する基礎行列、Ｆ_ｉ２は画像ｉの画像２に対する基礎行列である。
【０００９】
点ａ_ｉを決定するには、まずＦ_ｉ１を用いて点ａ_１を画像ｉに直線ｌ_１として投影する。さらにＦ_ｉ２を用いて点ａ_２を画像ｉに直線ｌ_２として投影する。そして、これらの２直線ｌ_１とｌ_２の交点が、画像ｉ上における点Ａの投影点ａ_ｉとなる。なお、直線ｌ_１およびｌ_２は下記の式（２）および（３）で表される。
【数２】

【数３】

【００１０】
このように、従来提案してきた射影グリッド空間による手法では、まず、多視点カメラの中から２つの基底カメラを選び、これらの基底カメラ間での基礎行列により、３次元のグリッド位置を定義する。そして、このグリッド位置と、基底カメラ以外の多視点カメラの画像位置との関係は、基底カメラと多視点カメラとの基礎行列により記述される。
また、３次元形状の再構成は、例えばシルエット法など（非特許文献４参照）を用いて行なう。射影グリッド空間による手法においてシルエット法を用いる場合、基底カメラと多視点カメラとの基礎行列から、射影グリッド空間の任意の点が各カメラに投影される２次元位置を求め、この位置がシルエットの内部か外部かの判定を行なうことによって、対象物の３次元形状を再構成する。
３次元形状の再構成には、全カメラ数をＮとすると、２つの基底カメラ間の基礎行列と、各基底カメラとそれ以外の多視点カメラ間の基礎行列との、合計１＋（Ｎ−２）×２組の基礎行列が必要となる。
【００１１】
上述したように、発明者らが提案してきた射影グリッド空間による手法では、射影グリッド空間と画像上の点との関係を、カメラ間のエピポーラ幾何を表す基礎行列のみを用いて記述することができる。このため、多視点カメラシステムでカメラキャリブレーション（校正）を行なわなくても、対象物の３次元形状を復元することが可能である。
【００１２】
【非特許文献１】
Ｒ．Ｔｓａｉ： ”ＡＶｅｒｓａｔｉｌｅＣａｍｅｒａＣａｌｉｂｒａｔｉｏｎＴｅｃｈｎｉｑｕｅｆｏｒＨｉｇｈ−Ａｃｃｕｒａｃｙ３ＤＭａｃｈｉｎｅＶｉｓｉｏｎＭｅｔｒｏｌｏｇｙＵｓｉｎｇＯｆｆ−ｔｈｅ−ＳｈｅｌｆＴＶＣａｍｅｒａｓａｎｄＬｅｎｓｅｓ”，ＩＥＥＥＪｏｕｒｎａｌｏｆＲｏｂｏｔｉｃｓａｎｄＡｕｔｏｍａｔｉｏｎＲＡ−３，４，ｐｐ．３２３−３４４，１９８７
【非特許文献２】
斎藤英雄，金出武雄：「多数のカメラによるダイナミックイベントの仮想化」，情報処理学会研究報告「コンピュータビジョンとイメージメディア」Ｎｏ．１１９−０１６，１９９９年１１月
【非特許文献３】
斎藤英雄，木村誠，矢口悟志，稲本奈穂；「多視点映像による現実シーンの仮想化−カメラ間の射影的関係の利用による中間視点映像生成−」，情報処理学会研究報告「コンピュータビジョンとイメージメディア」Ｎｏ．１３１−００８，２００２年１月
【非特許文献４】
矢口悟志，木村誠，斎藤英雄，金出武雄：「未校正多視点カメラシステムを用いた任意視点画像生成」，情報処理学会コンピュータビジョンとイメージメディア研究会論文誌，Ｖｏｌ．４２，Ｎｏ．ＳＩＧ６（ＣＶＩＭ２），ｐｐ．９−２１，２００１年６月
【００１３】
【発明が解決しようとする課題】
しかしながら、発明者らが従来提案してきた上述の手法では、対象物を撮影するためのカメラから２台のカメラを基底カメラとして選択するため、基底カメラによって構成される３次元空間（射影グリッド空間）の各軸は透視投影の影響で直交せず、歪んだ空間となってしまう場合があった。この場合、射影グリッド空間において再構成された３次元形状にも歪みが生じてしまう。
本発明の目的は、再構成された３次元形状に歪みが生じない多視点カメラシステムを提供することである。
【００１４】
【課題を解決するための手段】
上記の課題を解決するために、本発明は、対象物に対する複数の多視点カメラと、直交するように設置した２台の基底カメラから画像を得る複数画像取得手段と、前記２台の基底カメラと前記複数の多視点カメラとを関連付ける関連付け手段と、前記複数の多視点カメラから得た画像から、前記関連付けを利用して前記２台の基底カメラで構成される３次元の射影グリッド空間で対象物の３次元形状を復元する復元処理手段とを備えることを特徴とする多視点カメラシステムである。
前記複数画像取得手段は、前記多視点カメラから得た画像から、対象物のシルエット画像を生成し、対象物の色彩画像とともに出力することを特徴としてもよい。
また、前記復元処理手段は、前記シルエット画像から、８分木モデル生成手法を用いて３次元モデルを作成することを特徴としてもよい。
また、前記復元処理手段は、８分木モデル生成手法を用いて生成された３次元モデルのボクセルのうち、内部ボクセルを除去し、対象物の色彩画像により表面ボクセルに着色することで対象物を任意視点で観察した画像を生成することを特徴としてもよい。
また、前記複数の多視点カメラは、視差画像を得ることができるステレオカメラであり、前記複数画像取得手段は、前記多視点カメラから得た視差画像および色彩画像から、対象物のシルエット画像を生成し、対象物の色彩画像とともに出力することを特徴としてもよい。
【００１５】
【発明の実施の形態】
以降、図を参照しながら、本発明の多視点カメラシステムの実施形態の一例を説明する。
まず、図４および図５を参照しながら、本発明において３次元空間（射影グリッド空間）を定義する手法について説明する。
次に、図６および図７を参照しながら、本発明の実施形態の構成および処理の流れを説明する。
次に、図８〜図１５を参照しながら、本実施形態において定義された射影グリッド空間で対象物の３次元形状を復元する手法について説明する。
最後に、図１６〜図２０を参照しながら、本実施形態を用いて行なった実験とその結果を示す。
【００１６】
＜射影グリッド空間の定義＞
まず、図４および図５を参照しながら、本発明において３次元空間（射影グリッド空間）を定義する手法について説明する。
上述したように、発明者らが従来提案してきた射影グリッド空間による手法では、多視点カメラのうち任意の２台を基底カメラとして選択し、これら２台の基底カメラを用いて３次元座標系を構成していた。
本発明では、従来手法の問題点であった歪みを解消するために、２台の基底カメラを多視点カメラとは別に用意して３次元座標系を構成する。図４は本発明におけるカメラ配置の例を示したものである。
【００１７】
図４において、ステレオカメラ４１１〜４１４は、多視点カメラである。本発明では、これらの多視点カメラ４１１等に対して、別途用意した基底カメラ４３１および４３２を配置する。
基底カメラ４３１および４３２は、焦点距離をほぼ無限遠にして使用する。また、図４に示しているように、これら２つの基底カメラの視線（ｒ軸とｐ軸）がほぼ直交するように配置し、対象空間をできるだけ大きく捕えることができるよう、遠くに配置する。これにより、これら２つの基底カメラの視線で構成される座標軸の歪みを軽減することができる。
なお、本発明において基底カメラ４３１および４３２は、単に３次元座標系を構成するためにのみ用いられる。このため、動きのある対象物を撮影するような場合であっても、基底カメラ４３１および４３２には、動画像の撮影には向かないが空間解像度に優れているカメラを用いることが可能である。その結果、対象空間の解像度を向上することもできる。
【００１８】
図５に、本発明において定義される射影グリッド空間を示す。この射影グリッド空間において、基底カメラと多視点カメラとを基礎行列によって関連付ける手法は、発明者らが従来提案してきた上述の手法とほぼ同様である。
２台の基底カメラをそれぞれ基底カメラ１、基底カメラ２とする。図５において、点５１０および点５２０はそれぞれ基底カメラ１および基底カメラ２の視点である。ここで、射影グリッド空間上の点Ｐ（ｐ，ｑ，ｒ）は、基底カメラ１の画像５１２上の点ｐ_１（ｐ，ｑ）と基底カメラ２の画像５２２上の点ｐ_２（ｒ，ｙ）の２点によって定義される。ここで、点ｐ_２（ｒ，ｙ）は、基底カメラ１の視点５１０から点ｐ_１を通る基底カメラ１の視線を、基底カメラ２の画像５２２上に投影したエピポーラ線上の点である。このエピポーラ線は、上述の式（１）で表される。点ｐ_２の座標ｙは、このエピポーラ線の一次方程式で求められる。
【００１９】
次に、射影グリッド空間と各カメラで撮影される画像とを関連付ける３次元−２次元マップを算出する。画像５３２上の直線ｌ_１は、基底カメラ１の画像５１２上の点ｐ_１を通る基底カメラ１の視線を、カメラｉの画像５３２上に投影したエピポーラ線である。エピポーラ線ｌ_１は、上述の式（２）で表される。同様に、直線ｌ_２は基底カメラ２の画像上の点ｐ_２を通る基底カメラ２の視線を、カメラｉの画像上に投影したエピポーラ線である。エピポーラ線ｌ_２は、上述の式（３）で表される。
このようにして求められたエピポーラ線ｌ_１とｌ_２の交点が、射影グリッド空間上の点Ｐ（ｐ，ｑ，ｒ）に対応するカメラｉの画像５３２上の点ｐ_ｉ（ｕ_ｉ，ｖ_ｉ）である。
同様にして、射影グリッド空間と全ての多視点カメラの画像との３次元−２次元マップを算出する。この３次元−２次元マップは、ルックアップ表として保存される。このルックアップ表は、後述の対象物の３次元形状を再構成する処理において、射影グリッド空間上の３次元座標を多視点カメラの画像上の２次元座標に変換する際に参照される。
【００２０】
従来は、この３次元−２次元マップを算出するために、少なくとも対象空間における６つの３次元座標を定義し、それらの座標を各カメラの画像へ投影して、カメラパラメータを推定する（カメラキャリブレーションを行なう）必要があった。しかしながら、特に対象空間が広大である場合には、空間内の３次元座標を測定してカメラキャリブレーションを行なうのは非常に困難である。本発明の手法によれば、エピポーラ線を算出するのみでよいため、各カメラの画像上で２次元の座標を測定すれば足りる。このため、対象空間が広大である場合であっても、測定に費やす労力は増大しない。さらに、本実施形態の射影グリッド空間は歪みがほとんどない。なぜなら、この空間の座標軸は２台の基底カメラの視線により構成されているが、この視線は直交していることが前提だからである。
【００２１】
＜本実施形態の構成および処理の流れ＞
次に、図６および図７を参照しながら、本発明の多視点カメラシステムの実施形態の構成および処理の流れを説明する。
図６は、本実施形態の多視点カメラシステムのシステム構成の例を示す図である。また、図７は本実施形態の多視点カメラシステムで任意視点画像を生成する処理の流れを示したフローチャートである。本実施形態では４台の多視点カメラからなる多視点カメラシステムを例として説明するが、カメラの台数はこれに限られない。また、各々の処理の内容は後で詳しく説明する。
【００２２】
図６において、４台のステレオカメラ（６１１〜６１４）、４台のＰＣ（６２１〜６２４）、１台のホストＰＣ６３０、モニタ６４０がＬＡＮで接続されている。本実施形態ではこのほかに、３次元座標系を構成するために２台のカメラを基底カメラとして使用する。
４台のステレオカメラ６１１等には、色彩画像とともに視差画像を撮ることのできるタイプのものを用いる。ステレオカメラ６１１等は、それぞれ画像キャプチャ用の４台のＰＣ６２１等に接続されている。画像キャプチャ用のＰＣ６２１等と接続しているホストＰＣ６３０は、任意視点画像を生成する処理などを行なうホストＰＣである。
まず、ステレオカメラ６１１等により、対象物が撮影される（図７のＳ７１０）。ＰＣ６２１等は、ステレオカメラ６１１等が撮影した画像からシルエット（輪郭）を抽出し、シルエット画像を生成する（Ｓ７２０）。この処理は画像キャプチャ用のＰＣ６２１等でローカルに行なわれる。
【００２３】
本実施形態では、シルエットを抽出するために色彩画像と視差画像の両方を用いる。その結果、画像の背景を効果的に除去することができる。一方、撮影した画像の視差値は３次元形状の再構成に使用するには粗い値であるため、３次元形状の再構成の際には視差画像は用いない。
色彩画像および上述のＳ７２０で生成されたシルエット画像は、画像キャプチャ用のＰＣ６２１等においてＪＰＥＧ圧縮され、ホストＰＣ６３０に送信される。これらの画像は３次元形状の再構成の処理で用いられる。
次に、ホストＰＣ６３０で３次元形状の再構成（復元）を行なう。３次元形状の再構成においては、まず、８分木モデルの生成を行なう（Ｓ７３０）。８分木モデルの生成には、シルエット法による形状生成手法とともに、８分木生成アルゴリズム（ｏｃｔｒｅｅｇｅｎｅｒａｔｉｏｎａｌｇｏｒｉｔｈｍ）の手法を用いる。この手法によれば、処理時間を短縮することができる。次に、８分木モデルからボクセルモデルを生成する（Ｓ７４０）。次に、内部ボクセルの除去を行なう（Ｓ７５０）。内部ボクセルの除去を行なうのは、最終的な任意視点画像の表示には表面ボクセルのみで足りるからである。最後に表面ボクセルに彩色（Ｓ７６０）すれば、任意視点画像が完成する。生成された任意視点画像は、例えばモニタ６４０に出力される。
【００２４】
＜３次元形状の復元＞
次に、図８〜図１５を参照しながら、多視点カメラに撮影された対象物の画像を用いて、基底カメラにより構成された射影グリッド空間で対象物の３次元形状を再構成し、任意視点画像を生成する手法について説明する。
【００２５】
（シルエット画像の生成）
対象物の３次元形状を再構成する前に、まず、本実施形態の各ステレオカメラで撮影された画像から背景を除去してシルエット画像を生成する（図７のＳ７２０に示す処理）。
上述したように、本実施形態では、シルエット画像を生成する際に、各ステレオカメラで撮影された色彩画像だけでなく、視差画像も使用する。色彩画像のみでは、鏡面反射や影が、生成されるシルエット画像に影響を及ぼすからである。一方、視差画像のみを用いた場合ではこれらの影響はほとんどないが、粗いシルエットしか得ることができない。このため、本実施形態では色彩画像と視差画像の両方の情報を用いて背景除去を行なう。
【００２６】
背景画像の画素（ｘ，ｙ）について色彩（ｃｏｌｏｒ）をｃ_ｂ（ｘ，ｙ），視差値（ｄｉｓｐａｒｉｔｙｖａｌｕｅ）をｄ_ｂ（ｘ，ｙ）で表し、対象物を含む画像の画素（ｘ，ｙ）の色彩をｃ_ｃ（ｘ，ｙ），視差値をｄ_ｃ（ｘ，ｙ）で表すとすると、本実施形態では下記の擬似コードに示す処理でシルエットを生成する。

ここで、ｐ（ｘ，ｙ）はシルエット画像の画素の状態を表す。ｔｈ_Ｄは視差（ｄｉｓｐａｒｉｔｙ）の閾値で、対象物を含む画像と背景画像の視差の差分値がｔｈ_Ｄ以下なら背景（ＮＯＴ＿ＳＩＬＨＯＵＥＴＴＥ）と判定し、それ以外なら色彩（ｃｏｌｏｒ）を使った判定を行う。色彩の判定に用いる閾値はｔｈ_Ｕ，ｔｈ_Ｌの２種類あり、対象物を含む画像と背景画像の色彩（ｃｏｌｏｒ）の差分値が、前者のｔｈ_Ｕより大きければシルエット内部（ＳＩＬＨＯＵＥＴＴＥ）と判定し、後者のｔｈ_Ｌより小さければ背景（ＮＯＴ＿ＳＩＬＨＯＵＥＴＴＥ）と判定する。それ以外の場合、つまり色彩（ｃｏｌｏｒ）の差分値が、ｔｈ_Ｕ以下ｔｈ_Ｌ以上になる場合は、色差θを使った判定を行う。この判定では、色差がｔｈ_Ａ以上の場合はシルエット内部（ＳＩＬＨＯＵＥＴＴＥ）と判定し、それ以外の場合は背景（ＮＯＴ＿ＳＩＬＨＯＵＥＴＴＥ）と判定する。
【００２７】
上述の手法を用いてシルエット画像を生成する様子を、図８〜図１０に示す。図８および図９は、ステレオカメラにより撮影された色彩画像および視差画像である。図８（ａ）はシルエット画像生成の対象物（人物）を含む色彩画像、図８（ｂ）はシルエット画像生成の対象物を含まない（背景のみの）色彩画像である。また、図９（ａ）はシルエット画像生成の対象物（人物）を含む視差画像、図９（ｂ）はシルエット画像生成の対象物を含まない（背景のみの）視差画像である。
図１０（ａ）は、図９の視差画像を用いず、図８の色彩画像のみを用いて生成されたシルエット画像である。一方、図１０（ｂ）は図８の色彩画像および図９の視差画像を用いて（すなわち本実施形態における手法を用いて）生成されたシルエット画像である。
図１０の（ａ）と（ｂ）を比較すると、図１０（ａ）では、図８（ａ）に写り込んでいる人物の影が、シルエット画像の人物の足周りに影響している（人物の影も、シルエットとして抽出されてしまっている）のがわかる。一方、視差画像は影に影響されないため、図１０（ｂ）のシルエット画像には影の影響はなく、人物のみがシルエットとして抽出されている。
【００２８】
（シルエット画像からの３次元形状の生成）
次に、３次元形状の再構成を行なうためにシルエット画像からの３次元形状の生成を行なう。本実施形態では、射影グリッド空間と各カメラで撮影された画像との関連付けにより、８分木データ構造を用いてシルエットから３次元形状を生成する（図７のＳ７３０に示す処理）。なお、８分木モデルの生成については後述する。
従来手法であるユークリッド空間を用いた場合、シルエット画像の透視投影により円錐形モデルが多数生成される。そして、結果として生成される３次元形状は全ての円錐形モデルの共通部分である。すなわち、下記の式（４）によりシルエット画像から３次元形状が生成される。
【数４】

ここで、Ｉは全てのシルエット画像の組であり、ｉはその組の中にあるひとつのシルエット画像である。Ｖ_ｉはｉ個目のシルエット画像から生成される形状モデルである。
【００２９】
一般的に、ユークリッド空間においては、各ボクセルを全てのシルエット画像に投影させることにより、そのボクセルがシルエットの内部にあるか外部にあるかを判断する。そして、ひとつの画像においてボクセルがシルエットの外部にあれば、そのボクセルは対象物の一部ではない。一方、ボクセルが全ての画像においてシルエットの内部にあれば、そのボクセルは対象物の一部であると判断される。
このシルエット法を、本実施形態の射影グリッド空間を用いた場合にあてはめると、透視投影による変換の代わりに、あらかじめ用意しておいた上述のルックアップ表を用いて、射影グリッド空間における３次元座標を画像上の２次元座標に変換する。この変換は、上述のように、基礎行列によってあらかじめ射影グリッド空間と各カメラ画像との関連付けを行なっているために可能である。
【００３０】
（８分木モデルの生成）
上述した８分木データ構造は、対象とする３次元空間全体（ユニバーサル・スペースと呼ぶ）を再帰的に８分割（縦、横、奥行き方向にそれぞれ２分割）していくことにより生成される８分木モデルである。
８分割された空間の中のひとつの領域（オクタント）を構成するボクセルのタイプがすべて同じになった場合には、そのオクタントはそれ以上分割しない。それ以外の場合には、そのオクタントはさらに８つの立方体に分割され、場合によっては単一のボクセルにまで分割されることになる。
【００３１】
図１１は、対象空間から８分木モデルを生成する様子を示したものである。
図１１（ａ）は、図１１（ｂ）に示す対象空間１１５０から生成された８分木である。この８分木の各ノード（１１００，１１１０，１１２０，１１３０等）は、図１１（ｂ）で再帰的に分割された各空間に対応している。レベル０に示されているノード（１１００）は空間全体に対応している。レベル１に示されている８個のノード（１１１０等）はそれぞれ１回目の分割による８個の空間に対応している。同様に、レベル２は２回目の分割による空間、レベル３は３回目の分割による空間に対応している。
また、ノードの色は、その空間が対象物であるかを示している。空間に対象物を含まない（背景のみである）場合には対応するノードを黒色で表す。また、空間全体が対象物である場合には対応するノードを白色で表す。空間の一部に対象物を含む場合には対応するノードを灰色で表す。そして、空間の一部に対象物を含む場合は、その空間をさらに８分割する。なお、その空間が対象物であるかを判断する手法は後述で説明する。
【００３２】
上述したノードと空間との対応により図１１（ａ）および（ｂ）を参照すると、空間全体（１１５０）はその一部に対象物を含むため、ノード１１００は灰色で表される。この場合、空間１１５０は８分割（縦、横、奥行き方向にそれぞれ２分割）される。
８分割された空間のうち、図１１（ｂ）において対象物１１６２を含むオクタントおよび対象物１１６４を含むオクタントは、対象物のみのオクタントであるため、それ以上分割されない。図１１（ａ）においてこれらのオクタントに対応しているノードは黒色で表される。
一方、対象物１１７２を含むオクタントおよび対象物１１７４，１１８２を含むオクタントは、一部に対象物を含むオクタントであるため、さらに８分割される。図１１（ａ）においてこれらのオクタントに対応しているノードは灰色で表される。
それ以外のオクタントは対象物を含まない（背景のみである）ため、それ以上分割されない。図１１（ａ）においてこれらのオクタントに対応しているノードは白色で表される。
こうして、その一部に対象物を含む空間がなくなるまで（対応するノードが全て黒色か白色のいずれかになるまで）、空間の分割を再帰的に繰り返す。
【００３３】
空間をオクタントに分割し、全てのシルエット画像に変換するにあたって、本実施形態では下記のような手法を用いる。
まず、オクタントの８頂点を画像平面内座標へ変換し、その立方体の画像内領域を探索することで、対象としている立方体が対象物を表すかどうかを調べていく。このとき、画像内領域は長方形になるとは限らない。そこで本実施形態のシステムでは、計算量削減のため、立方体の画像内領域を囲む最小の長方形を探索領域として、インターセクション・チェックを行なってオクタントの属性（キューブ・タイプ）を調べる。
【００３４】
図１２はインターセクション・チェックを説明した図である。図１２（ａ）〜（ｃ）はシルエット画像であり、黒く示された部分（１２１０）は背景、灰色で示された部分（１２２０）は対象物のシルエットである。また、矩形１２３２，１２３４，および１２３６は、チェック対象の矩形領域である。
図１２（ｂ）に示すように、ある画像におけるチェック対象の矩形領域１２３４がシルエットと背景からなる場合には、その領域に対応する空間（オクタント）のキューブ・タイプは”ＧＲＡＹ”であると仮定される。なお、キューブ・タイプが”ＧＲＡＹ”であるとは、その空間の一部に対象物を含んでいることを意味する。
また、図１２（ａ）に示すように、チェック対象の矩形領域１２３２の全ての画素が背景である場合には、その領域に対応する空間のキューブ・タイプは”ＢＬＡＣＫ”であると仮定される。キューブ・タイプが”ＢＬＡＣＫ”であるとは、その空間全体が背景からなることを意味する。
一方、図１２（ｃ）に示すように、チェック対象の矩形領域１２３６の全ての画素がシルエットである場合には、その領域に対応する空間のキューブ・タイプは”ＷＨＩＴＥ”であると仮定される。キューブ・タイプが”ＷＨＩＴＥ”であるとは、その空間全体が対象物からなることを意味する。
【００３５】
ある画像における矩形領域のインターセクション・チェックの結果、キューブ・タイプが”ＢＬＡＣＫ”であると仮定されると、シルエット法の概念に基づいて、この空間のキューブ・タイプは”ＢＬＡＣＫ”であると確定される。
それ以外の場合には、その空間の立方体に対して、他の全ての画像が参照されるまでインターセクション・チェックを続行する。
全ての画像が参照されると、その空間のキューブ・タイプが確定する。全ての画像において、仮定されたキューブ・タイプが”ＷＨＩＴＥ”である場合には、その空間のキューブ・タイプは”ＷＨＩＴＥ”であると確定される。それ以外の場合には”ＧＲＡＹ”であると確定される。”ＧＲＡＹ”であると確定された場合、その空間に対して仮定されたキューブ・タイプは全て保存され、以降の処理でも参照する。これにより、計算時間を短縮することができる。
【００３６】
図１３（ａ）（ｂ）は、ある空間のキューブ・タイプが”ＧＲＡＹ”であると確定された場合に、上記により保存された仮定キューブ・タイプを後続の処理で参照する様子を示したものである。ここでは、対象モデルを４台のカメラ（カメラ１〜カメラ４）で捕らえた場合（４つのシルエット画像に対してインターセクション・チェックを行なう場合）で説明する。
図１３（ａ）は４つのシルエット画像に対してインターセクションを行なった結果を、カメラごとにスタックで示している。図中の”Ｗ”は”ＷＨＩＴＥ”を、”Ｇ”は”ＧＲＡＹ”を、”？”は不明（これからチェックされる）であることを示している。図１３（ａ）は４つのシルエット画像に対してインターセクション・チェックを行なった結果を、空間（キューブ）ごとに８分木で示している。
【００３７】
図１３において、親ノード１３５２の仮定キューブ・タイプが、１３１２に示すように、カメラ１の画像で”ＷＨＩＴＥ”，カメラ２の画像で”ＷＨＩＴＥ”，カメラ３の画像で”ＧＲＡＹ”，カメラ４の画像で”ＧＲＡＹ”であった場合、その親ノード１３５２に対応する空間のキューブ・タイプは”ＧＲＡＹ”であると確定する。この場合、この空間の立方体は８分割され、それぞれが８分木の子ノード（１３５４等）となる。ここで、子ノード１３５４等においても、カメラ１およびカメラ２の画像における仮定キューブ・タイプは”ＷＨＩＴＥ”であることが確定し（１３１４を参照）、カメラ１およびカメラ２の画像についてはインターセクション・チェックを再び行なう必要はない。
このように、親ノードの仮定キューブ・タイプを保存しておき、子ノードの処理の際に参照することで、処理の無駄を省くことができる。
【００３８】
キューブ・タイプが確定すると、キューブ・タイプによって後工程が分かれる。キューブ・タイプが”ＢＬＡＣＫ”または”ＷＨＩＴＥ”で確定した場合は、対応する空間をそれ以上分割する必要はない。すなわち、その時点で８分木の各ノードはそのまま葉節点となる。一方、キューブ・タイプが”ＧＲＡＹ”で確定した場合には、対応する空間はさらに８分割される。すなわち、対応する８分木のノードは８つの子ノードを持つことになる。ある空間に対して上記の処理が終了したら、他の空間についても同様の処理を再帰的に繰り返す。
なお、８分木の手法については、例えばＭ．Ｐｏｔｍｅｓｉｌ， ”Ｇｅｎｅｒａｔｉｎｇｏｃｔｒｅｅｍｏｄｅｌｓｏｆ３Ｄｏｂｊｅｃｔｓｆｒｏｍｔｈｅｉｒｓｉｌｈｏｕｅｔｔｅｓｉｎａｓｅｑｕｅｎｃｅｏｆｉｍａｇｅｓ”，ＣｏｍｐｕｔｅｒＶｉｓｉｏｎ，Ｇｒａｐｈｉｃｓ，ａｎｄＩｍａｇｅＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．４０，ｐｐ．１−２９，１９８７等を参照されたい。
【００３９】
（内部ボクセルの除去）
８分木構造の生成が終了したら、画像表示のために８分木モデルをボクセルモデルに変換する。８分木のそれぞれのノードについて、２^３ｎ個（ｎは８分木のレベル数である）のボクセルがある。これらのボクセル全てを最終的な表示対象として処理すると、多大な処理時間を要する。この問題を避けるために、本実施形態では、対象物の内部に対応している内部ボクセルを処理対象から除去し、対象物の表面に対応している表面ボクセルのみを着色して表示する。なお、これらの処理は、図７のＳ７４０〜Ｓ７６０に示す処理である。
本実施形態において内部ボクセルを除去する様子を、図１４（ａ）（ｂ）に示す。図１４（ａ）において、１４００で示された対象空間内に１４１０等の立方体で示された対象物がある。この対象物を表現した８分木は、上述の図１１に示したように、最大レベル数は３であった。図１４（ｂ）では、対象物をレベル数３の分割に対応するボクセル（１４３０等）で示している。後続の着色処理においては、これらのボクセルのうち、対象物の表面を構成している表面ボクセルのみを着色の対象とすればよく、表面を構成していない内部ボクセルは除去してよい。
【００４０】
あるボクセル（ボクセルｖ）が内部ボクセルであるか表面ボクセルであるかは、次のようにして判断する。
ボクセルｖの６つの面と隣り合うボクセルのうち１つ以上のボクセルが、対象物の含まれていない空のボクセルである場合、ボクセルｖは表面ボクセルである。それ以外の場合、ボクセルｖは内部ボクセルである。
以降の処理（ボクセルモデルの着色）においては、表面ボクセルのみを処理の対象とする。
【００４１】
（ボクセルモデルの着色）
次に、３次元形状の表面ボクセルに着色する。本実施形態では、実際に各カメラで撮影された画像のうち２つの画像を選択し、それらの画像上の画素をもとに、表面ボクセルの色を動的に決定する。どの画像を選択するかは、任意視点の位置により決定される。
表面ボクセルの色を決定する手法を、図１５および下記の式（５）を用いて説明する。
【数５】

【００４２】
図１５および上記の式（５）において、θはカメラｉと任意視点との水平角、φはカメラ（ｉ＋１）と任意視点との水平角である。色彩の混合の重みはこれらの角度によって決定される。対象物１５４０上の点Ｐが、前述で生成されたルックアップ表によりカメラｉの画像１５１０上の点ｐ_１およびカメラ（ｉ＋１）の画像１５２０上の点ｐ_２に変換される。点ｐ_１の色彩（ｃｏｌｏｒ）をｃ（ｐ_１）、点ｐ_２の色彩（ｃｏｌｏｒ）をｃ（ｐ_２）、任意視点の画像１５３０上において点ｐ_１および点ｐ_２と対応する点ｐの色彩（ｃｏｌｏｒ）をｃ（ｐ）とすると、上記の式（５）により、ｃ（ｐ）を算出することができる。
上述により表面ボクセルへの着色を行なえば、任意視点における３次元形状を表示することができる。
なお、ボクセルに対する処理については、例えばＧ．Ｋ．Ｍ．Ｃｈｅｕｎｇ，Ｔ．Ｋａｎａｄｅ，Ｊ．Ｙ．Ｂｏｕｇｕｅｔ，ａｎｄＭ．Ｈｏｌｌｅｒ， ”Ａｒｅａｌｔｉｍｅｓｙｓｔｅｍｆｏｒｒｏｂｕｓｔ３Ｄｖｏｘｅｌｒｅｃｏｓｔｒｕｃｔｉｏｎｏｆｈｕｍａｎｍｏｔｉｏｎｓ”，ＣＶＰＲ２０００ＩＥＥＥＣｏｍｐｕｔ．Ｓｏｃ，ＬｏｓＡｌａｍｉｔｏｓ，ＣＡ，ＵＳＡ，ｖｏｌ．２，ｐｐ．７１４−７２９，２０００等を参照されたい。
【００４３】
＜本実施形態を用いた実験結果＞
最後に、本実施形態を用いて行なった実験とその結果を図１６〜図２０に示す。
この実験は、以下の条件で行なった。
・ボクセル数：２５６×２５６×２５６（個）
・８分木の最大レベル数：８
・画像の解像度：３２０×２４０ピクセル
・色彩の深度：２４ビット
・視差の深度：８ビット
【００４４】
なお、この実験は図６に示すような４台のステレオカメラ（カメラ１〜カメラ４とする）を用いた多視点カメラシステムを用いて行なった。
図１６はこの実験において各処理に要した時間を示した図である。
図１６において ”ＣａｍｅｒａＰＣ” で示されている４本の時間軸は、多視点カメラと接続している画像キャプチャ用ＰＣの処理を時間軸で示している。また ”ＨｏｓｔＰＣ” で示されている時間軸は、ホストＰＣにおける処理を時間軸で示している。
【００４５】
”Ｃａｐｔｕｒｅ” の処理は、ステレオカメラで対象モデルを撮影して画像キャプチャ用ＰＣに保存する処理であり、この処理は５０ミリ秒で行なわれている。
”Ｄｉｓｐａｒｉｔｙｉｍａｇｅｇｅｎｅｒａｔｉｏｎ” は視差画像を生成する処理であり、この処理は１００ミリ秒で行なわれている。
”Ｓｉｌｈｏｕｅｔｔｅｉｍａｇｅｇｅｎｅｒａｔｉｏｎ” はシルエット画像を生成する処理であり、この処理は７０ミリ秒で行なわれている。
”Ｉｍａｇｅｔｒａｎｓｆｅｒ” の処理は、色彩画像とシルエット画像をＪＰＥＧ圧縮してホストＰＣに送信する処理である。この処理は１００ミリ秒で行なわれている。
”３Ｄｓｈａｐｅｒｅｃｏｎｓｔｒｕｃｔｉｏｎ” の処理は、シルエット画像から８分木モデルを生成する処理である。この処理は１２０ミリ秒で行なわれている。なお、本実施形態の８分木生成アルゴリズムを使用しないでこの処理を実行した場合には５２０ミリ秒を要した。本実施形態の８分木生成アルゴリズムにより処理速度が向上していることがわかる。
”Ｄｉｓｐｌａｙ” の処理は、８分木モデルからボクセルモデルを生成し、内部ボクセルを除去し、表面ボクセルに色彩を施す処理である。この処理は１００ミリ秒で行なわれている。
【００４６】
図１７（ａ）〜（ｄ）は、４台のステレオカメラで捕らえた実際の画像である。（ａ）はカメラ１の画像、（ｂ）はカメラ２の画像、（ｃ）はカメラ３の画像、（ｄ）はカメラ４の画像である。
図１８（ａ）〜（ｌ）は、図１７に示した画像をもとに、本実施形態の手法を用いて生成された任意視点の画像である。（ａ）〜（ｄ）はカメラ１とカメラ２の中間視点における生成画像であり、画像の下に示された１０：０，７：３などの比率は、上述で示した色彩の混合の重みである。同様に、（ｅ）〜（ｈ）はカメラ２とカメラ３の中間視点における画像、（ｉ）〜（ｌ）はカメラ３とカメラ４の中間視点における画像である。
図１８の各画像に示すように、本実施形態の射影グリッド空間による手法を用いれば、カメラキャリブレーションが必要な従来のユークリッド空間による手法と同程度の画像を生成できる。
【００４７】
図１９および図２０は、発明者らが従来提案してきた射影グリッド空間による手法（多視点カメラから任意の２台を基底カメラとする手法）を用いた処理結果と、本実施形態の射影グリッド空間による手法を用いた処理結果とを比較する実験を示したものである。
図１９（ａ）〜（ｄ）は、それぞれカメラ１〜４により撮影された画像である。被写体となっている二人の人物Ａ，Ｂは、実際にはほぼ同じ身長である。
図２０（ａ）および（ｂ）は、処理の結果を示した図である。図２０（ａ）は、カメラ１およびカメラ４を基底カメラとして（すなわち、従来の射影グリッド空間の手法により）生成された画像である。一方、図２０（ｂ）は別途２台のカメラを基底カメラとして（すなわち、本実施形態の射影グリッド空間の手法により）生成された画像である。
【００４８】
カメラ１とカメラ４から定義された射影グリッド空間の歪みの影響で、図２０（ａ）の画像にも歪みが生じている。被写体の２人の人物Ａ，Ｂは実際にはほぼ同じ身長であるのに、３次元形状が再構成された図２０（ａ）においては、人物Ａの身長は１７８ボクセル、人物Ｂの身長は１５８ボクセルとなっており、差が生じている。
これに対し、図２０（ｂ）においては人物ＡおよびＢは実際と同様にほぼ同じ身長（Ａが２０９ボクセル、Ｂが２０１ボクセル）で再構成されている。なぜなら、本実施形態の手法によれば、２台の基底カメラのエピポーラ線はほぼ平行になるため、歪みがほとんど生じないからである。
また、図２０（ａ）上の白線は、射影グリッド空間におけるｐ−ｒ平面の断面を表している。ｐ−ｒ平面上の直線は、図１９の各画像上でエピポーラ線として表示されている。図１９（ａ）〜（ｃ）上では白線、図１９（ｄ）上では白い点として示されている。しかしながら、図２０（ａ）においては、人物Ｂはｐ−ｒ平面を表す白線上に立っているが、人物Ａは足が白線より下に出てしまっている。
一方、図２０（ｂ）上の白線も、図２０（ａ）と同様に射影グリッド空間におけるｐ−ｒ平面の断面を表しているが、人物Ａ，Ｂともｐ−ｒ平面を表す白線上に立っている。
【００４９】
【発明の効果】
本発明では、多視点カメラとは別に、透視投影の歪みを無視できるような２台のカメラを別途用意して、これら２台のカメラを基底カメラとし、多視点カメラと２台の基底カメラとのエピポーラ幾何を利用して、基底カメラにより構成された３次元座標系において対象物の３次元形状を再構成（復元）する。これにより、再構成された３次元形状の歪みを防ぐことができる。
【図面の簡単な説明】
【図１】（ａ）対象空間をユークリッド空間で定義した従来の多視点カメラシステムの例を示す図である。
（ｂ）対象空間を射影グリッド空間で定義した従来の多視点カメラシステムの例を示す図である。
【図２】射影グリッド空間内の点Ａの基底カメラ画像への投影を示す図である。
【図３】射影グリッド空間内の点Ａの基底カメラ以外のカメラ画像への投影を示す図である。
【図４】本発明におけるカメラ配置の例を示す図である。
【図５】本発明において定義される射影グリッド空間を示す図である。
【図６】本実施形態における多視点カメラシステムのシステム構成の例を示す図である。
【図７】本実施形態の処理の流れを示したフローチャートである。
【図８】（ａ）対象物（人物）を含む色彩画像である。
（ｂ）対象物を含まない（背景のみの）色彩画像である。
【図９】（ａ）対象物（人物）を含む視差画像である。
（ｂ）対象物を含まない（背景のみの）視差画像である。
【図１０】（ａ）視差画像を用いず、色彩画像のみを用いて生成されたシルエット画像である。
（ｂ）色彩画像および視差画像を用いて生成されたシルエット画像である。
【図１１】８分木モデルを生成する様子を示す図である。
【図１２】インターセクション・チェックの手法を説明する図である。
【図１３】（ａ）インターセクション・チェックの結果をカメラごとのスタックで示す図である。
（ｂ）インターセクション・チェックの結果を空間ごとの８分木で示す図である。
【図１４】（ａ）（ｂ）内部ボクセルを除去する手法を説明する図である。
【図１５】表面ボクセルの色を決定する手法を説明する図である。
【図１６】本実施形態を用いた実験において、各処理に要した時間を示す図である。
【図１７】（ａ）〜（ｄ）本実施形態を用いた実験において、それぞれ４台のステレオカメラで捕らえた画像である。
【図１８】（ａ）〜（ｌ）図１７の画像をもとに、本実施形態の手法を用いて生成された任意視点画像である。
【図１９】（ａ）〜（ｄ）本実施形態を用いた実験において、それぞれ４台のステレオカメラで捕らえた画像である
【図２０】（ａ）カメラ１およびカメラ４を基底カメラとして生成された任意視点画像である。
（ｂ）別途２台のカメラを基底カメラとして生成された任意視点画像である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to reconstructing (reconstructing) a three-dimensional shape of an object captured by a multi-view camera in a multi-view camera system.
[0002]
[Prior art]
A multi-view camera system is a camera that captures the same object from multiple viewpoints using the cameras that make up the multi-view camera system (hereinafter referred to as a “multi-view camera”), This is a system for reconstructing (restoring) a three-dimensional shape. For example, such a multi-view camera system is used to generate an image (arbitrary viewpoint image) viewed from a virtual viewpoint.
Conventionally, in such a multi-view camera system, a procedure for estimating an internal parameter (focal length, projection center coordinate, etc.) and an external parameter (three-dimensional position and orientation (rotational three degrees of freedom) of the camera) of each camera is performed. A certain camera calibration (calibration) has to be performed (for example, see Non-Patent Document 1).
In this camera calibration, six or more pairs of three-dimensional coordinates of a point existing in the target space and two-dimensional coordinates at which the point is projected on an image (hereinafter referred to as a “three-dimensional-two-dimensional map”) are set in advance. It is performed by measuring. In practice, for example, calibration of each camera is performed from the marker points photographed simultaneously for each camera.
[0003]
However, in the case of the camera calibration method, since camera parameters are estimated independently for each camera, inconsistencies due to estimation errors occur between camera parameters estimated between the cameras. Specifically, between the epipolar geometry between the cameras calculated from the camera parameters independently estimated for each camera, and the epipolar geometry estimated from the correspondence between the cameras detected from the marker points that appear common to each camera A subtle error will occur. This error is particularly large between distant cameras (for example, see Non-Patent Document 2).
In addition, in such a multi-view camera system, basically, a point that is seen in common among a plurality of cameras is detected and a three-dimensional shape is reconstructed based on the principle of triangulation. It is important to handle relative geometric relationships as accurately as possible. On the other hand, it is time-consuming to calibrate a large number of cameras by photographing a known marker point at a three-dimensional position at the same time. Moreover, when targeting a large shooting space such as outdoors, it is very difficult to accurately set marker points over a wide range.
[0004]
In order to solve the above problems, the inventors have proposed a multi-view camera system capable of reconstructing a three-dimensional shape of an object without performing camera calibration (Non-Patent Documents 2 and 3). reference).
In this method, first, arbitrarily two reference cameras (hereinafter referred to as “base cameras”) are selected from the multi-view cameras. Then, a three-dimensional space called a projective grid space is constructed using the epipolar lines projected from these two base cameras to the other cameras. Then, the three-dimensional shape of the object is reconstructed in the projection grid space.
[0005]
Hereinafter, a method using a projection grid space, which has been conventionally proposed by the inventors, will be described in detail.
As described above, in order to reconstruct the three-dimensional shape of the object in the multi-view camera system, each point (three-dimensional coordinates) in the three-dimensional space for which the three-dimensional shape is to be reconstructed, and the position of the multi-view camera. A relationship (3D-2D map) with points (2D coordinates) projected on each image is required.
FIG. 1A shows a multi-view camera system according to a conventional method. In FIG. 1A, reference numerals 111 to 116 denote cameras (multi-view cameras) constituting the multi-view camera system. Note that the number of multi-view cameras is not limited to the number shown in the figure. In the conventional method, a camera calibration is required in which the Euclidean space 120 is defined irrespective of the camera arrangement, and the coordinates on the Euclidean space 120 and the coordinates on the images of the cameras (111 to 116) are associated for each camera. Met.
[0006]
On the other hand, FIG. 1B shows a multi-view camera system that defines a projection grid space, which has been conventionally proposed by the inventors. In FIG. 1B, reference numerals 111 to 116 denote cameras constituting a multi-view camera system as in FIG. 1A.
The projection grid space 130 is defined as follows. First, arbitrarily two cameras (camera 111 and camera 112 in this example) out of several cameras are referred to as

base cameras

1 and 2, respectively. A three-dimensional space (projection grid space) 130 is defined by the central projection from the viewpoint of each of these base cameras. That is, the X axis and the Y axis of the image of the base camera 1 and the X axis of the image of the base camera 2 are used as the three axes that define the space. Then, a projection grid space 130 is defined using these three axes as a P axis, a Q axis, and an R axis, respectively.
[0007]
FIGS. 2 and 3 show the relationship between the point A (p, q, r) in the projection grid space and the image of each camera. FIG. 2 is a diagram illustrating projection of a point A (p, q, r) in the projection grid space onto a base camera image. In FIG. 2, a point 210 is a viewpoint of the base camera 1, and a point 220 is a viewpoint of the base camera 2. Reference numeral 212 denotes an image (image 1) obtained from the base camera 1, and reference numeral 222 denotes an image (image 2) obtained from the base camera 2. Point a ₁ (P, q) and point a ₂ (R, s) is the projection point of point A on image 1 and image 2, respectively.
Here, the fundamental matrix of image h for image k is F _hk It is assumed that At this time, the point A (p, q, r) is a point a in the image 1 according to the definition of the projective grid space. ₁ Projected to (p, q). Also, the fundamental matrix F of image 2 with respect to image 1 ₂₁ Using point a ₁ When (p, q) is projected on the image 2 as a straight line l, the straight line l is expressed by the following equation (1).
(Equation 1)

According to the definition of the projection grid space, the projection point a in the image 2 of the point A (p, q, r) ₂ Since the X coordinate of (r, s) is r, the point a ₂ (R, s) can be defined as a point on the straight line 1 whose X coordinate is r.
[0008]
FIG. 3 is a diagram illustrating projection of the point A (p, q, r) onto an image of a camera (camera i) other than the base camera. In FIG. 3, an image 1 indicated by

points

210 and 212 in the base camera 1 and a point a ₁ (P, q), image 2 of base camera 2 indicated by

points

220 and 222, and point a ₂ (R, s) is the same as in FIG. The point 230 is the viewpoint of the camera i other than the base camera, and the image indicated by 232 is the image i obtained from the camera i, and the point a _i Is the projection point of point A on image i. Also, F _i1 Is the fundamental matrix for image 1 of image i, F _i2 Is the fundamental matrix for image 2 of image i.
[0009]
Point a _i To determine _i1 Using point a ₁ To the image i ₁ Projected as Further F _i2 Using point a ₂ To the image i ₂ Projected as And these two straight lines l ₁ And l ₂ Is the projected point a of point A on image i _i It becomes. Note that a straight line l ₁ And l ₂ Is represented by the following equations (2) and (3).
(Equation 2)

[Equation 3]

[0010]
As described above, in the method based on the projection grid space that has been conventionally proposed, first, two base cameras are selected from the multi-view cameras, and a three-dimensional grid position is defined by a base matrix between these base cameras. Then, the relationship between the grid position and the image position of a multi-view camera other than the base camera is described by a basic matrix of the base camera and the multi-view camera.
The reconstruction of the three-dimensional shape is performed using, for example, the silhouette method (see Non-Patent Document 4). When the silhouette method is used in the method based on the projection grid space, a two-dimensional position at which an arbitrary point in the projection grid space is projected to each camera is obtained from a basic matrix of the base camera and the multi-view camera, and this position is defined as the inside of the silhouette. The three-dimensional shape of the object is reconstructed by determining whether the object is external or external.
For the reconstruction of the three-dimensional shape, assuming that the total number of cameras is N, a total of 1+ (N−2) of a base matrix between two base cameras and a base matrix between each base camera and other multi-viewpoint cameras. ) × 2 sets of basic matrices are required.
[0011]
As described above, in the method based on the projective grid space proposed by the inventors, the relationship between the projective grid space and points on an image can be described using only a basic matrix representing epipolar geometry between cameras. . For this reason, it is possible to restore the three-dimensional shape of the object without performing camera calibration (calibration) in the multi-view camera system.
[0012]
[Non-patent document 1]
R. Tsai: "A Versatile Camera Calibration Technology for High-Accuracy 3D Machine Vision Metrology Using Off-the-Self TV Cameras and Others. 323-344, 1987
[Non-patent document 2]
Hideo Saito, Takeo Kanade: "Virtualization of Dynamic Events with Many Cameras", Information Processing Society of Japan Research Report, "Computer Vision and Image Media," 119-016, November 1999
[Non-Patent Document 3]
Hideo Saito, Makoto Kimura, Satoshi Yaguchi, Naho Inamoto; "Virtualization of Real Scenes Using Multi-view Images-Intermediate Viewpoint Image Generation Using Projective Relationships between Cameras", Information Processing Society of Japan Research Report, "Computer Vision and Image Media""No. 131-008, January 2002
[Non-patent document 4]
Satoshi Yaguchi, Makoto Kimura, Hideo Saito, Takeo Kanade: "Arbitrary Viewpoint Image Generation Using Uncalibrated Multi-Viewpoint Camera System", Transactions of the IPSJ SIG Computer Vision and Image Media, Vol. 42, no. SIG6 (CVIM2), pp. 9-21, June 2001
[0013]
[Problems to be solved by the invention]
However, in the above-described method that has been conventionally proposed by the inventors, two cameras are selected as base cameras from among cameras for photographing an object, so that a three-dimensional space (projection grid space) constituted by the base cameras is used. Axes may not be orthogonal due to the effects of perspective projection, resulting in a distorted space. In this case, the three-dimensional shape reconstructed in the projection grid space is also distorted.
An object of the present invention is to provide a multi-view camera system in which no distortion occurs in a reconstructed three-dimensional shape.
[0014]
[Means for Solving the Problems]
In order to solve the above-mentioned problems, the present invention provides a plurality of multi-view cameras for an object, a plurality of image acquisition means for obtaining images from two base cameras installed orthogonally, and the two base cameras. Associating means for associating the plurality of multi-view cameras with the plurality of multi-view cameras, and using an image obtained from the plurality of multi-view cameras by using the association in a three-dimensional projection grid space composed of the two base cameras. A multi-view camera system comprising a restoration processing unit for restoring a three-dimensional shape of an object.
The multi-image acquisition unit may generate a silhouette image of the target object from the image obtained from the multi-view camera, and output the silhouette image together with the color image of the target object.
Further, the restoration processing means may create a three-dimensional model from the silhouette image using an octree model generation method.
The restoration processing means removes an internal voxel among voxels of a three-dimensional model generated by using an octree model generation method, and colors the surface voxel by using a color image of the object to color the object. It may be characterized in that an image observed from an arbitrary viewpoint is generated.
Further, the plurality of multi-view cameras are stereo cameras capable of obtaining a parallax image, and the plurality of image obtaining means generates a silhouette image of an object from the parallax images and the color images obtained from the multi-view camera. Alternatively, the image may be output together with the color image of the object.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an example of an embodiment of a multi-view camera system of the present invention will be described with reference to the drawings.
First, a method of defining a three-dimensional space (projection grid space) in the present invention will be described with reference to FIGS.
Next, the configuration and processing flow of the embodiment of the present invention will be described with reference to FIGS.
Next, a method of restoring the three-dimensional shape of the object in the projection grid space defined in the present embodiment will be described with reference to FIGS.
Finally, an experiment performed using the present embodiment and the results thereof will be described with reference to FIGS.
[0016]
<Definition of projective grid space>
First, a method of defining a three-dimensional space (projection grid space) in the present invention will be described with reference to FIGS.
As described above, in the method using the projection grid space that the inventors have conventionally proposed, any two of the multi-view cameras are selected as base cameras, and a three-dimensional coordinate system is formed using these two base cameras. Was composed.
In the present invention, in order to eliminate distortion, which is a problem of the conventional method, two base cameras are prepared separately from a multi-view camera to form a three-dimensional coordinate system. FIG. 4 shows an example of a camera arrangement according to the present invention.
[0017]
In FIG. 4, stereo cameras 411 to 414 are multi-viewpoint cameras. In the present invention,

base cameras

431 and 432 prepared separately are arranged for these multi-view cameras 411 and the like.
The

base cameras

431 and 432 are used with a focal length of almost infinity. In addition, as shown in FIG. 4, the two base cameras are arranged so that their lines of sight (r-axis and p-axis) are substantially orthogonal to each other, and are arranged as far away as possible to capture the target space as large as possible. Thus, it is possible to reduce the distortion of the coordinate axes formed by the lines of sight of these two base cameras.
Note that, in the present invention, the

base cameras

431 and 432 are used only for simply forming a three-dimensional coordinate system. For this reason, even when a moving object is photographed, cameras that are not suitable for photographing moving images but have excellent spatial resolution can be used as the

base cameras

431 and 432. . As a result, the resolution of the target space can be improved.
[0018]
FIG. 5 shows the projection grid space defined in the present invention. In this projection grid space, the method of associating the base camera and the multi-view camera by the basic matrix is almost the same as the above-mentioned method that the inventors have conventionally proposed.
The two base cameras are referred to as a base camera 1 and a base camera 2, respectively. In FIG. 5, points 510 and 520 are the viewpoints of the base camera 1 and the base camera 2, respectively. Here, a point P (p, q, r) on the projection grid space is a point p (p, q, r) on the image 512 of the base camera 1. ₁ (P, q) and a point p on the image 522 of the base camera 2 ₂ It is defined by two points (r, y). Where the point p ₂ (R, y) is a point p from the viewpoint 510 of the base camera 1 ₁ Is a point on the epipolar line projected on the image 522 of the base camera 2 through the line of sight of the base camera 1 passing through. This epipolar line is represented by the above equation (1). Point p ₂ Is obtained by a linear equation of this epipolar line.
[0019]
Next, a three-dimensional-two-dimensional map that associates the projection grid space with the image captured by each camera is calculated. Line l on image 532 ₁ Is a point p on the image 512 of the base camera 1 ₁ Is an epipolar line obtained by projecting the line of sight of the base camera 1 passing through on the image 532 of the camera i. Epipolar line l ₁ Is represented by the above equation (2). Similarly, a straight line l ₂ Is a point p on the image of the base camera 2 ₂ Is an epipolar line obtained by projecting the line of sight of the base camera 2 passing through on the image of the camera i. Epipolar line l ₂ Is represented by the above equation (3).
Epipolar line l thus obtained ₁ And l ₂ Is the point p on the image 532 of the camera i corresponding to the point P (p, q, r) in the projection grid space. _i (U _i , V _i ).
Similarly, a three-dimensional-two-dimensional map of the projection grid space and the images of all the multi-view cameras is calculated. This three-dimensional-two-dimensional map is stored as a look-up table. This look-up table is referred to when converting three-dimensional coordinates in the projection grid space into two-dimensional coordinates on the image of the multi-view camera in a process of reconstructing a three-dimensional shape of the object described later.
[0020]
Conventionally, in order to calculate this three-dimensional-two-dimensional map, at least six three-dimensional coordinates in a target space are defined, and those coordinates are projected on an image of each camera to estimate camera parameters (camera calibration). Was required). However, it is very difficult to measure the three-dimensional coordinates in the space and perform camera calibration, particularly when the target space is vast. According to the method of the present invention, it is only necessary to calculate the epipolar line, so it is sufficient to measure two-dimensional coordinates on the image of each camera. Therefore, even when the target space is large, the labor required for the measurement does not increase. Furthermore, the projection grid space of the present embodiment has almost no distortion. This is because the coordinate axes of this space are constituted by the lines of sight of the two base cameras, but this line of sight is assumed to be orthogonal.
[0021]
<Configuration and Processing Flow of the Present Embodiment>
Next, the configuration and the flow of processing of the embodiment of the multi-view camera system of the present invention will be described with reference to FIGS.
FIG. 6 is a diagram illustrating an example of a system configuration of the multi-view camera system according to the present embodiment. FIG. 7 is a flowchart showing a flow of processing for generating an arbitrary viewpoint image in the multi-view camera system of the present embodiment. In the present embodiment, a multi-view camera system including four multi-view cameras will be described as an example, but the number of cameras is not limited to this. The details of each process will be described later.
[0022]
In FIG. 6, four stereo cameras (611 to 614), four PCs (621 to 624), one host PC 630, and a monitor 640 are connected by a LAN. In addition, in this embodiment, two cameras are used as base cameras in order to form a three-dimensional coordinate system.
As the four stereo cameras 611 and the like, a type capable of taking a parallax image together with a color image is used. The stereo camera 611 and the like are each connected to four PCs 621 and the like for image capture. The host PC 630 connected to the image capture PC 621 or the like is a host PC that performs processing for generating an arbitrary viewpoint image and the like.
First, an object is photographed by the stereo camera 611 or the like (S710 in FIG. 7). The PC 621 or the like extracts a silhouette (outline) from the image captured by the stereo camera 611 or the like and generates a silhouette image (S720). This processing is locally performed by the PC 621 for image capture or the like.
[0023]
In the present embodiment, both a color image and a parallax image are used to extract a silhouette. As a result, the background of the image can be effectively removed. On the other hand, since the parallax value of the captured image is a coarse value for use in reconstructing a three-dimensional shape, a parallax image is not used in reconstructing a three-dimensional shape.
The color image and the silhouette image generated in S720 are JPEG-compressed by the image capture PC 621 or the like, and transmitted to the host PC 630. These images are used in a process of reconstructing a three-dimensional shape.
Next, the host PC 630 performs reconstruction (restoration) of the three-dimensional shape. In the reconstruction of the three-dimensional shape, first, an octree model is generated (S730). The generation of the octree model uses a method of an octree generation algorithm together with a shape generation method by the silhouette method. According to this method, the processing time can be reduced. Next, a voxel model is generated from the octree model (S740). Next, internal voxels are removed (S750). The internal voxels are removed because only the surface voxels are sufficient for the final display of an arbitrary viewpoint image. Finally, if the surface voxels are colored (S760), an arbitrary viewpoint image is completed. The generated arbitrary viewpoint image is output to, for example, the monitor 640.
[0024]
<Reconstruction of three-dimensional shape>
Next, referring to FIGS. 8 to 15, the three-dimensional shape of the target object is reconstructed in the projection grid space formed by the base camera using the image of the target object captured by the multi-viewpoint camera. A method for generating a viewpoint image will be described.
[0025]
(Generation of silhouette images)
Before reconstructing the three-dimensional shape of the object, first, a silhouette image is generated by removing the background from the image captured by each stereo camera of the present embodiment (processing shown in S720 of FIG. 7).
As described above, in the present embodiment, when generating a silhouette image, not only a color image captured by each stereo camera but also a parallax image is used. This is because specular reflection and shadow affect the generated silhouette image only with the color image. On the other hand, when only the parallax image is used, these effects are hardly obtained, but only a coarse silhouette can be obtained. For this reason, in the present embodiment, background removal is performed using information of both the color image and the parallax image.
[0026]
The color (color) of the pixel (x, y) of the background image is c _b (X, y), the disparity value (disparity value) is d _b The color of the pixel (x, y) of the image including the object is represented by (x, y), and _c (X, y), the disparity value is d _c Assuming that the silhouette is represented by (x, y), in the present embodiment, a silhouette is generated by processing shown in the following pseudo code.

Here, p (x, y) represents the state of the pixel of the silhouette image. th _D Is a threshold value of the disparity, and the difference value between the disparity between the image including the target object and the background image is th. _D If it is below, it is determined to be the background (NOT_SILHOUETE), and otherwise, it is determined using the color (color). The threshold used for color determination is th _U , Th _L The difference between the color (color) of the image including the object and the background image is represented by the former th _U If it is larger, it is determined that the silhouette is inside (SILHOUETE), and the latter th _L If it is smaller, it is determined to be the background (NOT_SILHOUETE). In other cases, that is, when the difference value of the color is _U Below th _L In the case above, the determination using the color difference θ is performed. In this determination, the color difference is th _A In the above case, it is determined to be inside the silhouette (SILHOUETE), and otherwise, it is determined to be the background (NOT_SILHOUETE).
[0027]
FIGS. 8 to 10 show how a silhouette image is generated using the above method. 8 and 9 are a color image and a parallax image taken by a stereo camera. FIG. 8A is a color image including a silhouette image generation target (person), and FIG. 8B is a color image (only a background) not including a silhouette image generation target. FIG. 9A is a parallax image including an object (person) for generating a silhouette image, and FIG. 9B is a parallax image not including a target for silhouette image generation (only the background).
FIG. 10A is a silhouette image generated using only the color image of FIG. 8 without using the parallax image of FIG. 9. On the other hand, FIG. 10B is a silhouette image generated using the color image of FIG. 8 and the parallax image of FIG. 9 (that is, using the method according to the present embodiment).
Comparing FIG. 10A and FIG. 10B, in FIG. 10A, the shadow of the person shown in FIG. 8A affects around the feet of the person in the silhouette image (person Is also extracted as a silhouette). On the other hand, since the parallax image is not affected by the shadow, the silhouette image in FIG. 10B is not affected by the shadow, and only the person is extracted as the silhouette.
[0028]
(Generation of three-dimensional shapes from silhouette images)
Next, a three-dimensional shape is generated from the silhouette image in order to reconstruct the three-dimensional shape. In the present embodiment, a three-dimensional shape is generated from a silhouette using an octree data structure by associating a projection grid space with an image captured by each camera (the process illustrated in S730 of FIG. 7). The generation of the octree model will be described later.
When the Euclidean space, which is a conventional method, is used, a large number of conical models are generated by perspective projection of a silhouette image. Then, the resulting three-dimensional shape is a common part of all the conical models. That is, a three-dimensional shape is generated from the silhouette image by the following equation (4).
(Equation 4)

Here, I is a set of all silhouette images, and i is one silhouette image in the set. V _i Is a shape model generated from the i-th silhouette image.
[0029]
In general, in the Euclidean space, each voxel is projected on all silhouette images to determine whether the voxel is inside or outside the silhouette. If a voxel is outside the silhouette in one image, the voxel is not part of the object. On the other hand, if the voxel is inside the silhouette in all the images, it is determined that the voxel is a part of the object.
When this silhouette method is applied to the case where the projection grid space of the present embodiment is used, three-dimensional coordinates in the projection grid space are obtained by using the above-mentioned look-up table prepared in advance instead of transformation by perspective projection. Is converted to two-dimensional coordinates on the image. This conversion is possible because, as described above, the projection grid space is associated with each camera image in advance by the basic matrix.
[0030]
(Generation of octree model)
The above-described octree data structure is generated by recursively dividing the entire target three-dimensional space (referred to as a universal space) into eight (two each in the vertical, horizontal, and depth directions). It is a branch tree model.
If the types of voxels that make up one region (octant) in the eight divided spaces are all the same, the octant is not further divided. Otherwise, the octant will be further divided into eight cubes, possibly even a single voxel.
[0031]
FIG. 11 shows how an octree model is generated from the target space.
FIG. 11A is an octant tree generated from the target space 1150 shown in FIG. 11B. Each node (1100, 1110, 1120, 1130, etc.) of this octree corresponds to each space recursively divided in FIG. 11B. The node (1100) shown at level 0 corresponds to the entire space. The eight nodes (1110, etc.) shown at level 1 correspond to the eight spaces obtained by the first division. Similarly, level 2 corresponds to the space by the second division, and level 3 corresponds to the space by the third division.
The color of the node indicates whether the space is an object. If the space does not include the target (only the background), the corresponding node is represented in black. If the entire space is an object, the corresponding node is represented in white. When an object is included in a part of the space, the corresponding node is shown in gray. When a part of the space includes the object, the space is further divided into eight. A method for determining whether the space is an object will be described later.
[0032]
Referring to FIGS. 11A and 11B based on the correspondence between the nodes and the space, the node 1100 is represented in gray because the entire space (1150) includes an object in a part thereof. In this case, the space 1150 is divided into eight (two each in the vertical, horizontal, and depth directions).
Of the eight divided spaces, the octant including the target object 1162 and the octant including the target object 1164 in FIG. 11B are octants including only the target object and are not further divided. In FIG. 11A, nodes corresponding to these octants are represented in black.
On the other hand, the octant including the object 1172 and the octant including the

objects

1174 and 1182 are octants partially including the object, and thus are further divided into eight. In FIG. 11A, nodes corresponding to these octants are represented in gray.
The other octants do not include the object (only the background) and are not further divided. In FIG. 11A, nodes corresponding to these octants are represented in white.
In this way, the division of the space is recursively repeated until there is no space including the object in a part thereof (until the corresponding nodes are all black or white).
[0033]
In dividing the space into octants and converting them into all silhouette images, the present embodiment uses the following method.
First, the eight vertices of the octant are converted into coordinates in the image plane, and a search is made in the image area of the cube to determine whether the target cube represents the target. At this time, the area in the image is not necessarily a rectangle. Therefore, in the system of the present embodiment, in order to reduce the amount of calculation, an intersection check is performed by examining the attribute of the octant (cube type) using the smallest rectangle surrounding the cubic image area as the search area.
[0034]
FIG. 12 is a diagram for explaining the intersection check. FIGS. 12A to 12C are silhouette images, in which a black part (1210) is a background and a gray part (1220) is a silhouette of the object. The

rectangles

1232, 1234, and 1236 are rectangular areas to be checked.
As shown in FIG. 12B, when a rectangular area 1234 to be checked in an image includes a silhouette and a background, it is assumed that the cube type of the space (octant) corresponding to the area is “GRAY”. Is done. It should be noted that the cube type being “GRAY” means that a part of the space contains an object.
Also, as shown in FIG. 12A, when all the pixels of the rectangular area 1232 to be checked are the background, it is assumed that the cube type of the space corresponding to that area is “BLACK”. . When the cube type is "BLACK", it means that the entire space is composed of the background.
On the other hand, as shown in FIG. 12C, when all the pixels of the rectangular area 1236 to be checked are silhouettes, it is assumed that the cube type of the space corresponding to that area is “WHITE”. . When the cube type is “WHITE”, it means that the entire space is composed of an object.
[0035]
As a result of the intersection check of a rectangular area in an image, if the cube type is assumed to be "BLACK", the cube type of this space is determined to be "BLACK" based on the concept of the silhouette method. Is done.
Otherwise, continue the intersection check on the cube in that space until all other images are referenced.
When all images are referenced, the cube type of the space is determined. In all images, if the assumed cube type is "WHITE", the cube type of the space is determined to be "WHITE". Otherwise, it is determined to be "GRAY". If it is determined to be "GRAY", all the cube types assumed for the space are saved and referred to in the subsequent processing. As a result, the calculation time can be reduced.
[0036]
FIGS. 13A and 13B show how the hypothetical cube type stored above is referred to in the subsequent processing when the cube type of a certain space is determined to be “GRAY”. It is. Here, a case where the target model is captured by four cameras (cameras 1 to 4) (a case where an intersection check is performed on four silhouette images) will be described.
FIG. 13A shows the result of performing an intersection on four silhouette images in a stack for each camera. In the figure, "W" indicates "WHITE", "G" indicates "GRAY", and "?" Indicates unknown (to be checked). FIG. 13A shows the result of performing an intersection check on four silhouette images in an octree for each space (cube).
[0037]
In FIG. 13, the assumed cube type of the parent node 1352 is “WHITE” in the image of camera 1, “WHITE” in the image of camera 2, “GRAY” in the image of camera 3, and If the image is “GRAY”, it is determined that the cube type of the space corresponding to the parent node 1352 is “GRAY”. In this case, the cube of this space is divided into eight, and each becomes a child node (1354 or the like) of an octant tree. Here, also in the child node 1354 and the like, it is determined that the assumed cube type in the images of the camera 1 and the camera 2 is “WHITE” (see 1314), and the images of the camera 1 and the camera 2 are There is no need to perform the check again.
As described above, by storing the hypothetical cube type of the parent node and referencing it when processing the child node, it is possible to reduce waste of processing.
[0038]
When the cube type is determined, the post-process is divided according to the cube type. If the cube type is determined by "BLACK" or "WHITE", there is no need to further divide the corresponding space. That is, each node of the octree at that time becomes a leaf node as it is. On the other hand, when the cube type is determined as “GRAY”, the corresponding space is further divided into eight. That is, the corresponding octree node has eight child nodes. When the above processing is completed for a certain space, the same processing is recursively repeated for other spaces.
The octree method is described in, for example, M. Potmesil, "Generating Octrees models of 3D objects from the air silhouettes in a sequence of images," Computer Vision, Graphics. 40 pp. 1-29, 1987, etc.
[0039]
(Removal of internal voxels)
After the generation of the octree structure, the octree model is converted into a voxel model for image display. For each node in the octree, 2 ³ⁿ There are voxels where n is the number of levels in the octree. If all of these voxels are processed as final display targets, a great deal of processing time is required. In order to avoid this problem, in the present embodiment, the internal voxels corresponding to the inside of the target object are removed from the processing target, and only the surface voxels corresponding to the surface of the target object are colored and displayed. These processes are the processes shown in S740 to S760 in FIG.
FIGS. 14A and 14B show how the internal voxels are removed in this embodiment. In FIG. 14A, a target object indicated by a cube, such as 1410, is present in the target space indicated by 1400. In the octree expressing this object, the maximum number of levels was 3, as shown in FIG. In FIG. 14B, the target object is represented by voxels (1430 and the like) corresponding to the division of the number of levels of three. In the subsequent coloring process, of these voxels, only the surface voxels constituting the surface of the object need to be colored, and the internal voxels not constituting the surface may be removed.
[0040]
Whether a certain voxel (voxel v) is an internal voxel or a surface voxel is determined as follows.
If one or more voxels among the voxels adjacent to the six faces of voxel v are empty voxels that do not contain an object, voxel v is a surface voxel. Otherwise, voxel v is an internal voxel.
In the subsequent processing (coloring of the voxel model), only the surface voxels are processed.
[0041]
(Coloring of voxel model)
Next, three-dimensional surface voxels are colored. In the present embodiment, two images are actually selected from the images captured by each camera, and the color of the surface voxel is dynamically determined based on the pixels on those images. Which image to select is determined by the position of the arbitrary viewpoint.
A method for determining the surface voxel color will be described with reference to FIG. 15 and the following equation (5).
(Equation 5)

[0042]
In FIG. 15 and the above equation (5), θ is the horizontal angle between the camera i and the arbitrary viewpoint, and φ is the horizontal angle between the camera (i + 1) and the arbitrary viewpoint. The weight of the color mixture is determined by these angles. The point P on the object 1540 is changed to the point p on the image 1510 of the camera i by the lookup table generated as described above. ₁ And a point p on image 1520 of camera (i + 1) ₂ Is converted to Point p ₁ Of the color (color) of c (p ₁ ), Point p ₂ Of the color (color) of c (p ₂ ), A point p on the image 1530 at an arbitrary viewpoint ₁ And point p ₂ If the color (color) of the point p corresponding to is represented by c (p), c (p) can be calculated by the above equation (5).
If the surface voxels are colored as described above, a three-dimensional shape at an arbitrary viewpoint can be displayed.
Note that processing for voxels is described in, for example, K. M. Cheung, T .; Kanade, J.M. Y. Bouguet, and M.S. Holler, "A real time system for robust 3D voxel reconstruction of human motions", CVPR 2000 IEEE Comput. Soc, Los Alamitos, CA, USA, vol. 2, pp. 714-729, 2000, etc.
[0043]
<Experimental results using this embodiment>
Finally, experiments performed using this embodiment and the results are shown in FIGS.
This experiment was performed under the following conditions.
・ Number of voxels: 256 × 256 × 256 (pieces)
・ Maximum number of levels of octree: 8
・ Image resolution: 320 × 240 pixels
・ Color depth: 24 bits
・ Parallax depth: 8 bits
[0044]
This experiment was performed using a multi-view camera system using four stereo cameras (camera 1 to camera 4) as shown in FIG.
FIG. 16 is a diagram showing the time required for each processing in this experiment.
In FIG. 16, four time axes indicated by “Camera PC” indicate the processing of the image capturing PC connected to the multi-view camera on the time axis. The time axis indicated by “HostPC” indicates the processing in the host PC by the time axis.
[0045]
The process of “Capture” is a process of photographing the target model with a stereo camera and storing it in a PC for image capture, and this process is performed in 50 milliseconds.
“Disparity image generation” is processing for generating a parallax image, and this processing is performed in 100 milliseconds.
“Silhouette image generation” is a process for generating a silhouette image, and this process is performed in 70 milliseconds.
The process of "Image transfer" is a process of JPEG-compressing the color image and the silhouette image and transmitting them to the host PC. This process is performed in 100 milliseconds.
The process of "3D shape reconstruction" is a process of generating an octree model from a silhouette image. This process is performed in 120 milliseconds. In addition, when this processing was executed without using the octant tree generation algorithm of the present embodiment, it took 520 milliseconds. It can be seen that the processing speed is improved by the octree generating algorithm of the present embodiment.
The “Display” process is a process of generating a voxel model from an octree model, removing internal voxels, and applying colors to surface voxels. This process is performed in 100 milliseconds.
[0046]
FIGS. 17A to 17D are actual images captured by four stereo cameras. (A) is an image of the camera 1, (b) is an image of the camera 2, (c) is an image of the camera 3, and (d) is an image of the camera 4.
18 (a) to 18 (l) are images of an arbitrary viewpoint generated based on the image shown in FIG. 17 using the method of the present embodiment. (A) to (d) are generated images at an intermediate viewpoint between the camera 1 and the camera 2, and the ratios such as 10: 0, 7: 3, etc. shown below the images are the weights of the color mixture shown above It is. Similarly, (e) to (h) are images at an intermediate viewpoint between the

cameras

2 and 3, and (i) to (l) are images at an intermediate viewpoint between the cameras 3 and 4.
As shown in the respective images in FIG. 18, if the method using the projection grid space of the present embodiment is used, it is possible to generate an image comparable to the conventional method using the Euclidean space that requires camera calibration.
[0047]
FIGS. 19 and 20 show processing results obtained by using a technique based on a projection grid space (a technique using any two cameras as base cameras from a multi-view camera) conventionally proposed by the inventors, and a projection grid space according to the present embodiment. 2 shows an experiment for comparing the result with the processing result using the method according to the first embodiment.
FIGS. 19A to 19D are images taken by the cameras 1 to 4, respectively. Actually, the two persons A and B serving as subjects have substantially the same height.
FIGS. 20A and 20B are diagrams showing the results of the processing. FIG. 20A is an image generated by using the cameras 1 and 4 as base cameras (that is, by a conventional projection grid space technique). On the other hand, FIG. 20B is an image generated by using two cameras separately as base cameras (that is, by the method of the projection grid space of the present embodiment).
[0048]
Due to the influence of the distortion in the projection grid space defined by the cameras 1 and 4, the image shown in FIG. In FIG. 20A in which the three-dimensional shape is reconstructed, the height of the person A is 178 voxels, and the height of the person B is 2 There are 158 voxels, and there is a difference.
On the other hand, in FIG. 20B, the persons A and B are reconstructed with almost the same height (A is 209 voxels and B is 201 voxels) as in the actual case. This is because, according to the method of the present embodiment, the epipolar lines of the two base cameras are almost parallel, so that almost no distortion occurs.
The white line in FIG. 20A represents a cross section of the pr plane in the projection grid space. The straight line on the pr plane is displayed as an epipolar line on each image in FIG. 19 (a) to 19 (c) are shown as white lines, and FIG. 19 (d) are shown as white dots. However, in FIG. 20A, the person B is standing on the white line representing the pr plane, while the person A has his feet below the white line.
On the other hand, the white line in FIG. 20B also represents a cross section of the pr plane in the projection grid space similarly to FIG. 20A, but both the persons A and B are on the white line representing the pr plane. Is standing.
[0049]
【The invention's effect】
In the present invention, apart from the multi-view camera, two cameras capable of ignoring the distortion of the perspective projection are separately prepared, and these two cameras are used as base cameras, and the multi-view camera and the two base cameras are used. The three-dimensional shape of the object is reconstructed (restored) in the three-dimensional coordinate system configured by the base camera using the epipolar geometry of Thereby, distortion of the reconstructed three-dimensional shape can be prevented.
[Brief description of the drawings]
FIG. 1A illustrates an example of a conventional multi-view camera system in which a target space is defined by a Euclidean space.
(B) is a diagram showing an example of a conventional multi-view camera system in which a target space is defined by a projection grid space.
FIG. 2 is a diagram illustrating projection of a point A in a projection grid space onto a base camera image.
FIG. 3 is a diagram illustrating projection of a point A in a projection grid space onto a camera image other than the base camera.
FIG. 4 is a diagram illustrating an example of a camera arrangement according to the present invention.
FIG. 5 is a diagram showing a projection grid space defined in the present invention.
FIG. 6 is a diagram illustrating an example of a system configuration of a multi-view camera system according to the present embodiment.
FIG. 7 is a flowchart illustrating a flow of a process according to the embodiment.
FIG. 8A is a color image including an object (person).
(B) A color image that does not include a target object (only a background).
FIG. 9A is a parallax image including an object (person).
(B) A parallax image that does not include a target object (only the background).
FIG. 10A is a silhouette image generated using only a color image without using a parallax image.
(B) A silhouette image generated using a color image and a parallax image.
FIG. 11 is a diagram showing how an octree model is generated.
FIG. 12 is a diagram for explaining an intersection check method.
FIG. 13A is a diagram illustrating a result of an intersection check in a stack for each camera.
FIG. 13B is a diagram illustrating the result of the intersection check in an octree of each space.
FIGS. 14A and 14B are diagrams illustrating a method of removing internal voxels.
FIG. 15 is a diagram illustrating a technique for determining a color of a surface voxel.
FIG. 16 is a diagram showing the time required for each processing in an experiment using the present embodiment.
17A to 17D are images captured by four stereo cameras, respectively, in an experiment using the present embodiment.
18 (a) to (l) are arbitrary viewpoint images generated using the method of the present embodiment based on the images of FIG.
FIGS. 19A to 19D are images captured by four stereo cameras, respectively, in an experiment using the present embodiment.
FIG. 20 (a) is an arbitrary viewpoint image generated using camera 1 and camera 4 as base cameras.
(B) Arbitrary viewpoint images separately generated using two cameras as base cameras.

Claims

A multi-view camera system,
A plurality of multi-view cameras for the object, and a plurality of image acquisition means for obtaining images from two base cameras installed orthogonally;
Associating means for associating the two base cameras with the plurality of multi-view cameras;
Restoration processing means for restoring a three-dimensional shape of an object in a three-dimensional projection grid space constituted by the two base cameras from the images obtained from the plurality of multi-view cameras using the association. A multi-view camera system characterized in that:

The multi-view camera system according to claim 1,
The multi-view camera system, wherein the plurality of image acquisition means generates a silhouette image of the object from an image obtained from the multi-view camera and outputs the silhouette image together with a color image of the object.

The multi-view camera system according to claim 2,
A multi-view camera system, wherein the restoration processing means creates a three-dimensional model from the silhouette image using an octree model generation technique.

The multi-view camera system according to claim 3,
The restoration processing means removes an internal voxel among voxels of a three-dimensional model generated by using an octree model generation method, and colors the surface object voxel with a color image of the object, thereby displaying the object at an arbitrary viewpoint. A multi-view camera system, characterized by generating an image observed by the camera.

The multi-view camera system according to any one of claims 1 to 4,
The plurality of multi-view cameras are stereo cameras that can obtain a parallax image,
The multi-view camera system, wherein the plurality of image obtaining means generates a silhouette image of the object from the parallax image and the color image obtained from the multi-view camera, and outputs the silhouette image together with the color image of the object.