JP2024055772A

JP2024055772A - Computer vision method and system

Info

Publication number: JP2024055772A
Application number: JP2023139292A
Authority: JP
Inventors: ロゴテティスフォティオス; メッカロベルト; ブドヴィティスイグナス; シポラロベルト
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2022-10-07
Filing date: 2023-08-29
Publication date: 2024-04-18
Also published as: GB2623119A; GB202214751D0

Abstract

【課題】オブジェクトの３次元再構成を生成するコンピュータビジョン方法を提供する。【解決手段】方法は、オブジェクトの照度差ステレオ画像の、異なる方向からの照明を使用して第１の位置にあるカメラから撮られた少なくとも１つの画像を含む第１のセット及びオブジェクトの照度差ステレオ画像の、異なる方向からの照明を使用して第２の位置にあるカメラから撮られた少なくとも１つの画像を含む第２のセットを受信し、照度差ステレオ画像の第１のセットを使用してオブジェクトの第１の法線マップ及び照度差ステレオ画像の第２のセットを使用してオブジェクトの第２の法線マップを生成し、第１の法線マップ内の法線のパッチと第２の法線マップ内の法線のパッチとの間のステレオマッチングを実行することによりオブジェクトの形状のステレオ推定を決定する。法線マップを、オブジェクトの再構成を生成するためにオブジェクトの形状の第１の推定と共に使用する。【選択図】図５A computer vision method for generating a three-dimensional reconstruction of an object includes receiving a first set of photometric stereo images of the object, the first set including at least one image taken from a camera at a first position using illumination from a different direction, and a second set of photometric stereo images of the object, the second set including at least one image taken from a camera at a second position using illumination from a different direction, generating a first normal map of the object using the first set of photometric stereo images and a second normal map of the object using the second set of photometric stereo images, and determining a stereo estimate of the shape of the object by performing stereo matching between normal patches in the first normal map and normal patches in the second normal map. The normal map is used together with the first estimate of the shape of the object to generate a reconstruction of the object.

Description

実施形態は、オブジェクトの３Ｄ撮像を実行するためのコンピュータビジョンシステムおよび方法に関する。 Embodiments relate to computer vision systems and methods for performing 3D imaging of objects.

多くのコンピュータビジョンタスクは、オブジェクトが光を反射するやり方からオブジェクトの正確な３Ｄ再構成を取り出すことを必要とする。しかしながら、３Ｄジオメトリを再構成することは、キャストシャドウ、自己反射、および周辺光などのグローバル照明効果が、特に鏡面について、作用するので難題である。 Many computer vision tasks require deriving an accurate 3D reconstruction of an object from the way it reflects light. However, reconstructing 3D geometry is a challenge because global lighting effects such as cast shadows, self-reflection, and ambient light come into play, especially for specular surfaces.

照度差ステレオは、コンピュータビジョンにおける長年の問題である。最近の方法は、実データセットと合成データセットとの両方で印象的な法線推定精度を達成している。しかしながら、推定された形状の品質および実用性における進歩は、特に一般的な光の反射に対処するときに、グローバルジオメトリを取り出す際の不正確さのために進歩が非常に制限されるので、あまり説得力がない。 Photometric stereo is a long-standing problem in computer vision. Recent methods have achieved impressive normal estimation accuracy on both real and synthetic datasets. However, progress in the quality and practicality of the estimated shape is less compelling, as progress is severely limited due to inaccuracies in retrieving the global geometry, especially when dealing with general light reflections.

照度差ステレオ技法が、単一のカメラ位置からの画像の取得のみに頼るとき（単眼照度差ステレオ）、再構成の精度は、通常、カメラと対象のオブジェクトとの間の距離の大雑把な推定に依存する。例えば、画像の取得の前または後に、オブジェクトジオメトリの初期推定が得られ、深度マップが初期オブジェクトジオメトリに基づいて初期化される。 When photometric stereo techniques rely only on the acquisition of images from a single camera position (monocular photometric stereo), the accuracy of the reconstruction usually depends on a rough estimate of the distance between the camera and the object of interest. For example, an initial estimate of the object geometry is obtained before or after image acquisition, and a depth map is initialized based on the initial object geometry.

図１は、本発明の理解に役立つ一例によるシステムの概略図。FIG. 1 is a schematic diagram of an example system useful for understanding the present invention. 図２は、オブジェクトの３Ｄ撮像を実行するためのカメラおよび光源の配置を示す概略図。FIG. 2 is a schematic diagram showing an arrangement of cameras and light sources for performing 3D imaging of an object. 図３は、オブジェクトまたは背景の照度差ステレオ画像のセットから面法線を復元する方法の高レベルの概略図。FIG. 3 is a high-level schematic of a method for recovering surface normals from a set of photometric stereo images of an object or scene. 図４は、オブジェクトの３Ｄ撮像を実行するためのカメラおよび光源の配置を示す概略図。FIG. 4 is a schematic diagram showing an arrangement of cameras and light sources for performing 3D imaging of an object. 図５は、一実施形態による３Ｄ再構成のための方法の流れ図。FIG. 5 is a flow diagram of a method for 3D reconstruction according to one embodiment. 図６は、観測マップから法線マップを取り出すために使用できるＣＮＮの概略図。FIG. 6 is a schematic diagram of a CNN that can be used to extract a normal map from an observation map. 図７は、一実施形態による再構成方法の動作原理を示す図。FIG. 7 illustrates the operating principle of a reconstruction method according to an embodiment. 図８は、一実施形態による方法のステップを示す図。FIG. 8 illustrates steps of a method according to one embodiment. 図９は、単眼（単一ビュー）照度差ステレオと比較した方法の結果を示す図。FIG. 9 shows the results of our method compared to monocular (single-view) photometric stereo. 図１０は、一実施形態によるシステムの概略図。FIG. 10 is a schematic diagram of a system according to one embodiment.

第１の態様では、オブジェクトの３次元再構成を生成するコンピュータビジョン方法であって、
オブジェクトの照度差ステレオ画像の第１のセットおよびオブジェクトの照度差ステレオ画像の第２のセットを受信することと、第１のセットは、異なる方向からの照明を使用して第１の位置にある第１のカメラから撮られた少なくとも１つの画像を含み、第２のセットは、異なる方向からの照明を使用して第２の位置にある第２のカメラから撮られた少なくとも１つの画像を含み、
照度差ステレオ画像の第１のセットを使用してオブジェクトの第１の法線マップを生成することと、
照度差ステレオ画像の第２のセットを使用してオブジェクトの第２の法線マップを生成することと、
第１の法線マップ内の法線のパッチと第２の法線マップ内の法線のパッチとの間のステレオマッチングを実行することによってオブジェクトの形状のステレオ推定を決定することと、
オブジェクトの再構成を生成するためにオブジェクトの形状のステレオ推定と共に第１の法線マップおよび第２の法線マップを使用することと
を含む方法が提供される。 In a first aspect, there is provided a computer vision method for generating a three-dimensional reconstruction of an object, comprising the steps of:
receiving a first set of photometric stereo images of an object and a second set of photometric stereo images of the object, the first set including at least one image taken from a first camera at a first location using illumination from a different direction and the second set including at least one image taken from a second camera at a second location using illumination from a different direction;
generating a first normal map of the object using a first set of photometric stereo images;
generating a second normal map of the object using a second set of photometric stereo images;
determining a stereo estimate of the shape of the object by performing stereo matching between normal patches in the first normal map and normal patches in the second normal map;
A method is provided that includes using the first normal map and the second normal map together with a stereo estimate of the shape of the object to generate a reconstruction of the object.

開示された方法は、コンピュータ技術に結び付けられ、コンピューティングの分野で生じる技術的問題、すなわち、オブジェクトまたは背景の３次元再構成を生成する技術的問題に対処する。開示された方法は、再構成の品質を改善する仕方に関するこの技術的問題を解決する。改善は、緻密な局所形状変化を推定する際の照度差ステレオの強度と疎であるが正確な深度推定を行う際のストラクチャフロムモーション（Structure from Motion）の強度をマージする双眼ベースの照度差ステレオを与えることによって提供される。 The disclosed method is linked to computer technology and addresses a technical problem arising in the field of computing, namely, generating a 3D reconstruction of an object or a scene. The disclosed method solves this technical problem of how to improve the quality of the reconstruction. The improvement is provided by providing a binocular-based photometric stereo that merges the strengths of photometric stereo in estimating dense local shape changes and the strengths of Structure from Motion in providing a sparse but accurate depth estimation.

オブジェクトは、キャプチャ装置のカメラの深度範囲内に配置されてもよく、オブジェクトとカメラとの間の距離（すなわち、深度ｚ）は、定規を用いておおよそ測定される。次いで、深度マップは、定規、またはカメラからオブジェクトまでの平均距離を推定する他の方法を用いて推定されるように、全ての点の深度を一定値に設定することによって初期化される。深度を初期化する他の方法が使用されてもよく、例えば、CADモデル、ｋｉｎｅｃｔタイプの深度センサなどを使用してよい。初期オブジェクトジオメトリの推定における任意のさらなる精巧さの追加は、例えば、単一の反復によって、方法の実行時間を減少させることができる。 The object may be placed within the depth range of the capture device's camera, and the distance between the object and the camera (i.e., depth z) is approximately measured using a ruler. The depth map is then initialized by setting the depth of all points to a constant value, as estimated using the ruler, or other method of estimating the average distance from the camera to the object. Other methods of initializing the depth may be used, for example using a CAD model, a kinect-type depth sensor, etc. Adding any further sophistication in the estimation of the initial object geometry may reduce the execution time of the method, for example by a single iteration.

要するに、照度差ステレオにおける固有の曖昧さは、カメラからの対象のオブジェクトの大雑把な深度を決定することから生じる。これは、ジオメトリの可測性を確実にするためにスケールファクタが決定されなければならない現実世界の用途にとって特に極めて重要である。 In short, an inherent ambiguity in photometric stereo arises from determining the rough depth of the target object from the camera. This is especially crucial for real-world applications where scale factors must be determined to ensure measurability of the geometry.

この問題に少なくとも一部対処するために、緻密な局所形状変化を推定する際の照度差ステレオの強度と疎であるが正確な深度推定を行う際のストラクチャフロムモーションの強度をマージする双眼ベースの照度差ステレオが本明細書で説明される。 To address this issue, at least in part, binocular-based photometric stereo is described herein that merges the strengths of photometric stereo in estimating dense local shape changes with the strengths of structure-from-motion in providing sparse but accurate depth estimation.

本明細書で説明される実施形態は、最初の２セットの画像に代えて、法線に対するマッチングを実行し、テクスチャまたは光沢のないオブジェクトに対処するときでもロバストであるより信頼できるステレオマッチングが実行され得る。結果として、オブジェクトの再構成をステレオマッチングステップのペア対応へ制約することにより、オブジェクトジオメトリのより正確な表現を与えることができ、これは、単一のオブジェクトジオメトリへの収束へオブジェクトジオメトリを更新する反復手順のためにより良い初期推定を与える。 The embodiments described herein perform matching on normals instead of the first two sets of images, allowing more reliable stereo matching to be performed that is robust even when dealing with objects that have no texture or gloss. As a result, constraining the reconstruction of the object to pairwise correspondences in the stereo matching step can provide a more accurate representation of the object geometry, which provides a better initial guess for the iterative procedure of updating the object geometry to converge to a single object geometry.

一実施形態では、第１の法線マップおよび第２の法線マップは、オブジェクトに対する照明の近接場効果についての光分布を再計算するためにオブジェクトの形状の推定を使用して生成される。 In one embodiment, the first normal map and the second normal map are generated using an estimate of the shape of the object to recalculate the light distribution for near-field effects of the illumination on the object.

オブジェクトの第１の再構成が生成されると、この方法は、オブジェクトの再構成を使用して近接場効果による光分布を再計算することと、再計算された光分布から第１の法線マップおよび第２の法線マップを再計算することと、オブジェクトのさらなる再構成を生成することとを含む。 Once a first reconstruction of the object is generated, the method includes recalculating the light distribution due to near-field effects using the reconstruction of the object, recalculating the first normal map and the second normal map from the recalculated light distribution, and generating a further reconstruction of the object.

上記は、反復のやり方で使用されてよく、ここにおいて、近接場効果による光分布を再計算することは、
（ａ）再計算された光分布から第１の法線マップおよび第２の法線マップを再計算するためにオブジェクトの再構成を使用することと、
（ｂ）再計算された第１の法線マップ内の法線のパッチと再計算された第２の法線マップ内の法線のパッチとの間のステレオマッチングを実行することによってオブジェクトの形状のさらなる推定を決定することと、
（ｃ）再計算された第１の法線マップおよび再計算された第２の法線マップのうちの少なくとも１つから形状のさらなる再構成を生成することと、
オブジェクトのさらなる再構成が収束するまで（ａ）から（ｃ）を繰り返すことと
を含む。 The above may be used in an iterative manner, where recalculating the light distribution due to near field effects includes:
(a) using a reconstruction of the object to recalculate a first normal map and a second normal map from a recalculated light distribution;
(b) determining a further estimate of the shape of the object by performing stereo matching between patches of normals in the recomputed first normal map and patches of normals in the recomputed second normal map;
(c) generating a further reconstruction of the shape from at least one of the recalculated first normal map and the recalculated second normal map; and
and repeating (a) through (c) until a further reconstruction of the object converges.

一実施形態では、オブジェクトの再構成を生成するためにオブジェクトの形状のステレオ推定と共に第１の法線マップおよび第２の法線マップを使用することは、
第１の再構成を生成するために第１の法線マップを形状のステレオ推定の制約と統合することと、
第２の再構成を生成するために第２の法線マップを形状のステレオ推定の制約と統合することと、
オブジェクトの再構成である融合された再構成を生成するために第１の再構成および第２の再構成を組み合わせることと
を含む。 In one embodiment, using the first normal map and the second normal map together with a stereo estimate of the shape of the object to generate a reconstruction of the object includes:
Integrating the first normal map with a stereo estimation constraint of the shape to generate a first reconstruction;
Integrating the second normal map with constraints of the stereo estimation of shape to generate a second reconstruction;
and combining the first reconstruction and the second reconstruction to generate a fused reconstruction, which is a reconstruction of the object.

融合された再構成は、ポアソンソルバを使用して生成され得る。 The fused reconstruction can be generated using a Poisson solver.

第１の法線マップおよび第２の法線マップ上でステレオマッチングを実行することは、
第１の法線マップ上の少なくとも１つの画素グループを選択することと、
マッチのために第２の法線マップにわたって走査することによって第２の法線マップ内でマッチした画素グループを探索することと
を含んでよい。 Performing stereo matching on the first normal map and the second normal map includes:
Selecting at least one group of pixels on the first normal map;
and searching for a matched group of pixels in the second normal map by scanning across the second normal map for a match.

マッチを探す間に処理されるデータの量を減少させるために、マッチのために第２の法線マップにわたって走査することが、エピポーラ線にわたって実行されてもよい。マッチした画素グループを探索することは、オブジェクトの現在の推定された再構成によって制約を受けることもあり得る。したがって、オブジェクトの形状が収束するとき、マッチを探索することは、より効率的に行われることが可能である。 To reduce the amount of data processed while searching for a match, scanning over the second normal map for a match may be performed over the epipolar lines. The search for a matched pixel group may also be constrained by the current estimated reconstruction of the object. Thus, when the object shape converges, searching for a match can be done more efficiently.

さらなる実施形態では、第１の法線マップおよび第２の法線マップ上でステレオマッチングを実行することは、
第１の法線マップの少なくとも１つの画素グループに対してパッチワーピング（patch warping）を実行することと、
第１の法線マップのパッチワーピングされた少なくとも１つの画素グループを使用して第２の法線マップの対応する画素グループを決定することと
を含む。 In a further embodiment, performing stereo matching on the first normal map and the second normal map includes:
performing patch warping on at least one group of pixels of the first normal map;
and determining a corresponding group of pixels in a second normal map using the patch-warped at least one group of pixels in the first normal map.

さらなる実施形態では、マッチのために第２の法線マップにわたって走査することによって第２の法線マップ内でマッチした画素グループを探索することが、第１の部分ステレオ推定を生成するために使用され、
方法は、
第２の法線マップ上の少なくとも１つの画素グループを選択することと、
第２の部分ステレオ推定を生成するために、マッチのために第１の法線マップにわたって走査することによって第１の法線マップ内でマッチした画素グループを探索することと、
形状のステレオ推定を形成するために第１の部分ステレオ推定および第２の部分ステレオ推定を組み合わせることと、をさらに含み、ここにおいて、一致しない第１の部分ステレオ推定および第２の部分ステレオ推定からの点が破棄される。 In a further embodiment, searching for matched pixel groups in the second normal map by scanning across the second normal map for matches is used to generate a first partial stereo estimate;
The method is:
Selecting at least one group of pixels on the second normal map;
searching for matched pixel groups in the first normal map by scanning across the first normal map for matches to generate a second partial stereo estimate;
and further comprising combining the first partial stereo estimate and the second partial stereo estimate to form a stereo estimate of the shape, wherein points from the first partial stereo estimate and the second partial stereo estimate that do not match are discarded.

上記は、同じ点について異なる深度を示す部分ステレオマップ上の点が破棄され得るので、誤差に対してロバストさを与えるように組み合わされる２つのステレオマップが生成されることを可能にする。 The above allows two stereo maps to be generated that are combined to provide robustness to errors, since points on the partial stereo map that show different depths for the same point can be discarded.

照度差ステレオ画像の第１のセットは、異なる方向からの照明を使用して第１の位置にある第１のカメラから撮られた第１の複数の画像を含み、第２のセットは、異なる方向からの照明を使用して第２の位置にある第２のカメラから撮られた第２の複数の画像を含む。 The first set of photometric stereo images includes a first plurality of images taken from a first camera at a first location using illumination from different directions, and the second set includes a second plurality of images taken from a second camera at a second location using illumination from different directions.

一実施形態では、第１の法線マップを生成することは、照度差ステレオ画像の第１のセットを表す情報、オブジェクトの形状の推定、および光源の位置情報を、第１の法線マップを出力するように訓練されたニューラルネットワークに入力することを含む。 In one embodiment, generating the first normal map includes inputting information representing the first set of photometric stereo images, an estimate of the object's shape, and light source position information into a neural network trained to output the first normal map.

例えば、照度差ステレオ画像の第１のセットを表す情報、オブジェクトの形状の推定、および光源の位置情報は、観測マップの形態で与えられ、ここにおいて、観測マップは、第１のカメラの各画素について生成され、各観測マップは、２Ｄ平面へのライティング方向の投影を含み、各画素についてのライティング方向は、各照度差ステレオ画像から得られる。 For example, information representing a first set of photometric stereo images, an estimate of the object's shape, and light source location information is provided in the form of an observation map, where an observation map is generated for each pixel of the first camera, where each observation map includes a projection of a lighting direction onto a 2D plane, where the lighting direction for each pixel is obtained from each photometric stereo image.

さらなる実施形態では、オブジェクトの３次元（３Ｄ）再構成を生成するためのシステムが提供され、システムは、インターフェースとプロセッサと備え、
インターフェースは、画像入力を有し、オブジェクトの照度差ステレオ画像のセットを受信するように構成され、照度差ステレオ画像のセットは、１つまたは複数の光源を用いて異なる方向からの照明を使用して複数の画像を含み、
プロセッサは、
オブジェクトの照度差ステレオ画像の第１のセットおよびオブジェクトの照度差ステレオ画像の第２のセットを受信し、第１のセットは、異なる方向からの照明を使用して第１の位置にある第１のカメラから撮られた少なくとも１つの画像を含み、第２のセットは、異なる方向からの照明を使用して第２の位置にある第２のカメラから撮られた少なくとも１つの画像を含むものであり、
照度差ステレオ画像の第１のセットを使用してオブジェクトの第１の法線マップを生成し、
照度差ステレオ画像の第２のセットを使用してオブジェクトの第２の法線マップを生成し、
第１の法線マップ内の法線のパッチと第２の法線マップ内の法線のパッチとの間のステレオマッチングを実行することによってオブジェクトの形状のステレオ推定を決定し、
オブジェクトの再構成を生成するためにオブジェクトの形状のステレオ推定と共に第１の法線マップおよび第２の法線マップを使用するように構成されている。 In a further embodiment, there is provided a system for generating a three-dimensional (3D) reconstruction of an object, the system comprising an interface and a processor:
The interface has an image input and is configured to receive a set of photometric stereo images of the object, the set of photometric stereo images including a plurality of images using illumination from different directions with one or more light sources;
The processor
receiving a first set of photometric stereo images of an object and a second set of photometric stereo images of the object, the first set including at least one image taken from a first camera at a first location using illumination from a different direction and the second set including at least one image taken from a second camera at a second location using illumination from a different direction;
generating a first normal map of the object using a first set of photometric stereo images;
generating a second normal map of the object using a second set of photometric stereo images;
determining a stereo estimate of the shape of the object by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map;
The system is configured to use the first normal map and the second normal map together with a stereo estimate of the shape of the object to generate a reconstruction of the object.

上記のシステムでは、前記第１のカメラは、前記第２のカメラと異なるように構成される。さらなる実施形態では、第１の位置と第２の位置との間で移動させられる単一のカメラが使用される。 In the above system, the first camera is configured to be different from the second camera. In a further embodiment, a single camera is used that is moved between the first and second positions.

上述したように上記方法をコンピュータに実行させるようになされたコンピュータ可読命令を搬送する搬送媒体が提供されてもよい。 A carrier medium may be provided carrying computer readable instructions adapted to cause a computer to carry out the method as described above.

図１は、オブジェクトの３次元（３Ｄ）画像データを取り込み、オブジェクトを再構成するために使用され得るシステムの概略図を示す。本明細書で使用されるとき、用語「オブジェクト」は、撮像されているものを示すために使用される。しかしながら、この用語は、複数のオブジェクト、背景、またはオブジェクトと背景の組合せなどを包含し得ることを理解されたい。 Figure 1 shows a schematic diagram of a system that may be used to capture three-dimensional (3D) image data of an object and reconstruct the object. As used herein, the term "object" is used to denote what is being imaged. However, it should be understood that the term may encompass multiple objects, a background, or a combination of objects and background, etc.

オブジェクト１０の３Ｄ画像データは、装置１１を使用して取り込まれる。装置１１のさらなる詳細は、図２を参照して与えられる。 3D image data of object 10 is captured using device 11. Further details of device 11 are provided with reference to FIG. 2.

装置１１によって取り込まれる３Ｄ画像データは、コンピュータ１２へ送られ、そこで３Ｄ画像データは処理される。図１では、コンピュータ１２は、デスクトップコンピュータとして示されるが、それは、任意のプロセッサ、例えば、分散プロセッサ、または携帯電話のプロセッサなどであってもよいと理解されたい。例示的なプロセッサの詳細は、図１０を参照してこの説明において以下にさらに説明される。 The 3D image data captured by device 11 is sent to computer 12, where the 3D image data is processed. In FIG. 1, computer 12 is shown as a desktop computer, but it should be understood that it may be any processor, such as a distributed processor, or a processor in a mobile phone. Details of an exemplary processor are described further below in this description with reference to FIG. 10.

図１のシステムは、既存のハードウェアセットアップにおいて設けられてもよく、セットアップのプロセスの品質管理のために使用され得る。そのようなセットアップは、３Ｄ印刷セットアップおよび産業パイプラインを含むが、これらに限定されない。例えば、図１のシステムは、システムが印刷プロセスの品質管理を実行するために使用される３Ｄプリンタセットアップに設けられてもよい。より具体的には、システムは、印刷プロセスの中間結果の照度差ステレオ画像を取り込み、印刷プロセスの適切な実行を確認するために使用され得る。さらに、システムは、日用品の３Ｄモデルを得るためにユーザによって使用されるハンドヘルドデバイスとして実施されてもよい。 The system of FIG. 1 may be installed in an existing hardware setup and used for quality control of the setup's process. Such setups include, but are not limited to, 3D printing setups and industrial pipelines. For example, the system of FIG. 1 may be installed in a 3D printer setup where the system is used to perform quality control of the printing process. More specifically, the system may be used to capture photometric stereo images of intermediate results of the printing process and verify proper execution of the printing process. Furthermore, the system may be implemented as a handheld device used by a user to obtain 3D models of everyday items.

図２は、照度差ステレオに使用できる図１の装置１１の例示的な配置を示す。図２は、複数の画素を備えるカメラ２０と複数の光源２１とを互いにおよびカメラ２０に対して一定の関係で保持するマウントを示す。装置１１のこの配置は、カメラ２０および複数の光源２１が、互いから一定の分離を維持しつつ一緒に移動させられることを可能にする。 Figure 2 shows an example arrangement of the apparatus 11 of Figure 1 that can be used for photometric stereo. Figure 2 shows a mount that holds a camera 20 with multiple pixels and multiple light sources 21 in a fixed relationship to each other and to the camera 20. This arrangement of the apparatus 11 allows the camera 20 and multiple light sources 21 to be moved together while maintaining a fixed separation from each other.

装置１１の特定の例示的配置では、光源２１は、カメラ２０を取り囲んで設けられる。しかしながら、光源２１およびカメラ２０は、異なる配置で設けられてもよいことが理解される。カメラ２０は、オブジェクト１０の照度差ステレオデータを得るために、光源２１と一緒に使用される。個々の光源２１は、カメラ２０が照度差ステレオデータを取り込むことを可能にするために次々に起動させられる。 In a particular exemplary arrangement of the apparatus 11, the light sources 21 are arranged surrounding the cameras 20. However, it is understood that the light sources 21 and the cameras 20 may be arranged differently. The cameras 20 are used together with the light sources 21 to obtain photometric stereo data of the object 10. The individual light sources 21 are activated in sequence to enable the cameras 20 to capture the photometric stereo data.

装置１１の特定の配置では、８ｍｍレンズを備えたＦＬＥＡ３．２メガピクセルカメラが、カメラ２０として使用される。カメラ２０は、８ｍｍレンズを有し、プリント回路基板に堅固に取り付けられる。プリント回路基板は、像平面と同一平面の方式で配置されるとともに６．５センチメートルの最大差でカメラ２０を取り囲んで設けられた光源２１として使用される１６個の白色の明るいＬＥＤをさらに備える。 In a particular arrangement of the device 11, a FLEA 3.2 megapixel camera with an 8 mm lens is used as the camera 20. The camera 20 has an 8 mm lens and is rigidly mounted to a printed circuit board. The printed circuit board further comprises 16 white bright LEDs used as light sources 21 arranged in a flush manner with the image plane and surrounding the camera 20 with a maximum difference of 6.5 centimeters.

装置１１は、オブジェクト１０の照度差ステレオ画像を取り込むために使用され得る。オブジェクト１０は、カメラ２０の深度範囲内でカメラ２０の正面に配置される。被写界深度（ＤＯＦ）とも呼ばれる深度範囲は、オブジェクト１０が焦点にあるカメラ２０とオブジェクト１０との間の距離の範囲を示すために使用される用語である。オブジェクト１０が、カメラ２０の深度範囲の外側でカメラ２０に近すぎるまたは遠すぎる場合、オブジェクトは、焦点から外れ、細部は解像できない。例えば、８ｍｍレンズを装備したカメラ２０は、５センチメートルと３０センチメートルの間の深度範囲を有し得る。 The device 11 may be used to capture photometric stereo images of an object 10. The object 10 is placed in front of the camera 20 within the depth range of the camera 20. The depth range, also called depth of field (DOF), is a term used to indicate the range of distances between the camera 20 and the object 10 within which the object 10 is in focus. If the object 10 is too close or too far from the camera 20 outside the depth range of the camera 20, the object will be out of focus and no details can be resolved. For example, a camera 20 equipped with an 8 mm lens may have a depth range between 5 centimeters and 30 centimeters.

光源２１は、カメラ２０が異なるライティング条件下でオブジェクトの照度差ステレオデータを取り込むことを可能にするように個々に次々に作動させられる。これは、一度にたった１つの光源をオンに切り換えることによって達成される。例えば、装置１１は１６個のＬＥＤを備えてもよく、したがって、１６個の照度差ステレオ画像のセットが取り込まれてもよい。 The light sources 21 are activated individually and in sequence to enable the camera 20 to capture photometric stereo data of the object under different lighting conditions. This is achieved by switching on only one light source at a time. For example, the device 11 may be equipped with 16 LEDs and therefore a set of 16 photometric stereo images may be captured.

照度差ステレオでは、各光源２１は、異なるライティング条件下でオブジェクトの照度差ステレオデータを取り込むためにオブジェクト１０に対して個々に作動させられる。これは、一度にたった１つの光源をオンに切り換えることによって達成される。次いで、知られている表面またはオブジェクトからカメラ２０の各画素上へ反射される光の量が測定される。例えば、装置１１は、１６個のＬＥＤを備えてもよく、したがって、１６個の照度差ステレオ画像のセットが取り込まれてもよい。 In photometric stereo, each light source 21 is individually activated relative to the object 10 to capture photometric stereo data of the object under different lighting conditions. This is accomplished by switching on only one light source at a time. The amount of light reflected from a known surface or object onto each pixel of the camera 20 is then measured. For example, the device 11 may be equipped with 16 LEDs, and thus a set of 16 photometric stereo images may be captured.

照度差ステレオにより、面法線が推定されることを可能にする。その最も基本的な形態では、ランバート面上の任意の点について、カメラにおける反射光の強度Ｉは、 Photometric stereo allows the surface normal to be estimated. In its most basic form, for any point on a Lambertian surface, the reflected light intensity I at the camera is

によって表すことができる。 can be expressed as:

ただし、Ｉはカメラにおける反射光の強度であり、Ｌは照射光の光方向に対応する法線ベクトルであり、ｎはその点における表面に対する法線であり、ｋはその点におけるアルベド反射率である。３つ以上の異なる光方向について上のものを解くことによって、面法線ｎを推定することが可能である。面法線の全てが、表面上の全ての点について決定されると、法線を統合（integral）することによって表面を再構築することが可能である。 where I is the reflected light intensity at the camera, L is the normal vector corresponding to the light direction of the illuminating light, n is the normal to the surface at that point, and k is the albedo reflectance at that point. By solving the above for three or more different light directions, it is possible to estimate the surface normal n. Once all of the surface normals have been determined for all points on the surface, it is possible to reconstruct the surface by integrating the normals.

実際には、表面が一定のアルベドを有する可能性が低いので、状況はより複雑である。また、近接場光減衰は、照度差ステレオを使用して復元されるオブジェクトの形状にも影響を及ぼす可能性がある。 In practice, the situation is more complicated, since surfaces are unlikely to have a constant albedo. Near-field optical attenuation can also affect the shape of objects reconstructed using photometric stereo.

これをモデル化するために、以下のものが使用されてもよく、ここで、照度差ステレオ画像の各セットがｍ枚の画像を含み、画像枚数は、照明方向の個数に対応する（これは、この例では、使用される光源の個数に対応する）。ｊ＝１，…，ｍについての各画像ｉ_j,pは、画素ｐのセットとして見ることができる。ｍ個の光源の各々について、方向Ｌ_j、および輝度Φ_jが知られている。 To model this, the following may be used, where each set of photometric stereo images contains m images, the number of images corresponding to the number of lighting directions (which in this example corresponds to the number of light sources used): Each image i _j,p , for j=1,...,m, can be viewed as a set of pixels p. For each of the m light sources, the direction L _j and the intensity Φ _j are known.

上述したように、近接場光減衰は、オブジェクトからの光の反射に影響を及ぼす。近接場光減衰は、散逸の以下の非線形放射モデルを使用してモデル化される。 As mentioned above, near-field optical attenuation affects the reflection of light from an object. Near-field optical attenuation is modeled using the following nonlinear radiation model of dissipation:

ただし、Φ_mは光源の固有輝度であり、Ｓ_mはＬＥＤ点光源の向きを示す主方向であり、μ_mは角度散逸係数であり、ライティング方向は、 where Φ _m is the intrinsic luminance of the light source, S _m is the principal direction of the LED point source, μ _m is the angular dissipation coefficient, and the lighting direction is

として表される。 is expressed as:

点０にあるカメラ中心に対して位置Ｐ_mにある較正された点光源を仮定し、これにより可変の光ベクトルＬ_m＝Ｐ_k－Ｘという結果になり、ただし、Ｘは３Ｄ表面の点座標であり、Ｘ＝［ｘ，ｙ，ｚ］^Tと表される。 We assume a calibrated point light source at position P _m relative to the camera center at point 0, which results in a variable light vector L _m =P _k -X, where X is the 3D surface point coordinate, denoted as X = [x, y, z] ^T.

視線ベクトル Gaze vector

を of

として定義する。一般的な画像放射照度方程式は、 The general image irradiance equation is:

として表され、ここで、Ｎは面法線であり、Ｂは一般的な双方向反射率分布関数（ＢＲＤＦ：bidirectional reflectance distribution function）であると仮定され、ρは表面アルベドであり、ただし、アルベドρおよび画像はＲＧＢであり、反射率はチャネルごとに異なり、したがって、最も一般的なケースを可能にする。さらに、影および自己反射などのグローバル照明効果も、Ｂに組み込むことができる。 where N is the surface normal, B is assumed to be a general bidirectional reflectance distribution function (BRDF), and ρ is the surface albedo, where the albedo ρ and the images are RGB and the reflectance varies per channel, thus allowing for the most general case. Additionally, global lighting effects such as shadows and self-reflection can also be incorporated in B.

上記のものがモデル化されることを可能にするために、照度差ステレオ画像は、観測マップへ変換される。観測マップは、画素単位の照明情報を説明するために生成される。一実施形態では、観測マップは、各画素について生成される。各観測マップは、２Ｄ平面上へのライティング方向の投影を含み、各画素についてのライティング方向は、各照度差ステレオ画像から得られる。 To allow the above to be modeled, the photometric stereo images are converted into observation maps. Observation maps are generated to account for pixel-by-pixel illumination information. In one embodiment, an observation map is generated for each pixel. Each observation map contains a projection of the lighting direction onto a 2D plane, where the lighting direction for each pixel is obtained from each photometric stereo image.

観測マップがどのように構築され得るかの詳細は、後で説明される。しかしながら、要するに、観測マップは、光ベクトルであるＬ_mを使用して生成される。上述したように、これは、画素に対応するオブジェクトの表面上の点の３Ｄ位置に対して定められる。したがって、オブジェクトの形状はＬ_mに影響を及ぼし、したがって、観測マップに影響を及ぼす。観測マップが第１の時間について生成されるとき、形状の単純な推定が使用される。例えば、全ての画素がカメラから一定の距離にあると仮定されてもよい。 The details of how the observation map may be constructed will be explained later. However, in short, the observation map is generated using _Lm , the light vector. As mentioned above, this is defined relative to the 3D location of the point on the surface of the object that corresponds to the pixel. Thus, the shape of the object affects _Lm and therefore the observation map. When the observation map is generated for the first time, a simple estimate of the shape is used. For example, it may be assumed that all pixels are at a constant distance from the camera.

観測マップは、各表面画素におけるあらゆる照明方向を考慮し、可変数の画像の情報を単一のｄ×ｄ画像マップにマージすることを可能にするように意図されている。 The observation map is intended to take into account all lighting directions at each surface pixel and to allow merging the information of a variable number of images into a single dxd image map.

複数の画像が与えられるとき、カメラ２０の各画素ｐについて、画像ｊにおけるその値は、ｉ_j,pとして示される。変更された照明を用いた画像内の画素の全ての観察は、単一のｄ×ｄ×４マップＯに組み合わされ、ここで、ｄは、次元である。これがどのように達成されるかについては後述される。 Given multiple images, for each pixel p of camera 20, its value in image j is denoted as i _j,p . All observations of a pixel in images with modified illumination are combined into a single d x d x 4 map O, where d is the dimension. How this is achieved is described below.

図３は、近接場光学効果を考慮に入れることができる、オブジェクトまたは背景の照度差ステレオ画像のセットから面法線を復元するための方法の基本ステップの高レベル図を示す。 Figure 3 shows a high-level diagram of the basic steps of a method for recovering surface normals from a set of photometric stereo images of an object or scene, which can take into account near-field optical effects.

オブジェクトの画像は、上述した手順に従って装置１１を使用して得られる。代替として、照度差ステレオ画像は、遠隔照度差ステレオ撮像装置から得られ、コンピュータ１２に通信され、したがって、入力として与えられ得る。 Images of the object are obtained using device 11 according to the procedure described above. Alternatively, photometric stereo images may be obtained from a remote photometric stereo imaging device and communicated to computer 12, thus providing it as input.

照度差画像の各画素について、観察マップは、全ての照度差画像の画素観測を観測マップ上へ組み合わせることによってレンダリングされる。（上記したように）。続いて、法線（方位）マップを生成するために、観測マップが処理される。一実施形態では、観測マップは、畳み込みニューラルネットワーク（ＣＮＮ）によって処理され得る。そのようなＣＮＮの可能な構成は、図６を参照して説明される。 For each pixel in the photometric image, an observation map is rendered by combining all photometric image pixel observations onto the observation map (as described above). The observation map is then processed to generate a normal (orientation) map. In one embodiment, the observation map may be processed by a convolutional neural network (CNN). A possible configuration of such a CNN is described with reference to FIG. 6.

図４は、一実施形態による装置４０を示す。図４は、第１のカメラ４１と第２のカメラ４２とを保持するマウントを示す。マウントは、複数の光源４３を互いにおよび第１のカメラ４１および第２のカメラ４２に対して一定の関係でさらに保持する。装置４０のこの配置は、第１のカメラ４１、第２のカメラ４２、および複数の光源４３が互いから一定の分離を維持しつつ一緒に移動させられることを可能にする。 Figure 4 shows an apparatus 40 according to one embodiment. Figure 4 shows a mount that holds a first camera 41 and a second camera 42. The mount further holds a number of light sources 43 in a fixed relationship to each other and to the first camera 41 and the second camera 42. This arrangement of the apparatus 40 allows the first camera 41, the second camera 42, and the multiple light sources 43 to be moved together while maintaining a fixed separation from each other.

一実施形態では、１５個のＬＥＤは、複数の光源として使用されるが、互いに対してのならびに第１のカメラ４１および第２のカメラ４２に対してのその位置が知られている限り、任意の個数および配置の光源が、使用され得る。 In one embodiment, 15 LEDs are used as the multiple light sources, but any number and arrangement of light sources may be used as long as their positions relative to each other and to the first camera 41 and second camera 42 are known.

図４は、第１のカメラ４１と第２のカメラ４２とを備える装置を示すが、説明された方法は、単一のカメラによって実施されてもよく、ここにおいて、単一のカメラは、ＬＥＤの所与の位置、ならびに第１の位置と第２の位置、第１の位置とＬＥＤ、および第２の位置とＬＥＤの間の知られている距離について、図４の第１のカメラ４１の位置と図４の第２のカメラ４２の位置との間で移動するように構成される。 Although FIG. 4 shows an apparatus with a first camera 41 and a second camera 42, the described method may be performed with a single camera, where the single camera is configured to move between the position of the first camera 41 in FIG. 4 and the position of the second camera 42 in FIG. 4 for a given position of the LED and known distances between the first position and the second position, the first position and the LED, and the second position and the LED.

図５は、一実施形態による方法の概要を示す流れ図である。 Figure 5 is a flow chart outlining a method according to one embodiment.

カメラ１を用いたデータ取り込みのステップＳ５０１では、オブジェクトの照度差ステレオ画像の第１のセットが取得される。例えば、照度差ステレオ画像の第１のセットは、第１の位置において図４に示された第１のカメラ（またはカメラ１）から取得され得る。同時に、カメラ２を用いたデータ取り込みのステップＳ５０３では、照度差ステレオ画像の第２のセットは、第２の位置において図４に示された第２のカメラ（またはカメラ２）から取得される。画像の第１のセットと第２のセットの両方は、上述したように知られている光（ＬＥＤ）の位置および知られているカメラの位置の下で取得される。 In step S501 of data capture using camera 1, a first set of photometric stereo images of the object is acquired. For example, the first set of photometric stereo images may be acquired from a first camera (or camera 1) shown in FIG. 4 at a first position. At the same time, in step S503 of data capture using camera 2, a second set of photometric stereo images is acquired from a second camera (or camera 2) shown in FIG. 4 at a second position. Both the first and second sets of images are acquired under known light (LED) positions and known camera positions as described above.

代替実施形態では、オブジェクトの照度差ステレオ画像の２セットは、メモリデバイスから取得され、ここにおいて、照度差ステレオ画像の２セットは、説明された方法の実行前に一度に前もって取得された。さらなる変形例では、たった１つのカメラがあり、同じカメラによって、照度差ステレオ画像の第１のセットは、第１の位置において取得され、照度差ステレオ画像の第２のセットは、第２の位置において取得される。 In an alternative embodiment, two sets of photometric stereo images of the object are acquired from a memory device, where the two sets of photometric stereo images were acquired once in advance before the execution of the described method. In a further variation, there is only one camera, where a first set of photometric stereo images is acquired at a first position and a second set of photometric stereo images is acquired at a second position by the same camera.

以下の説明において参照を容易にするために、照度差ステレオ画像の第１のセットは、左ステレオ画像と呼ばれ得るとともに、照度差ステレオ画像の第２のセットは、右ステレオ画像と呼ばれ得る。しかしながら、第１の位置は、第２の位置の右側に、第２の位置の上方に、第２の位置の下方に、または第２の位置から離れた任意の位置にあってもよい。第１のセット、第２のセット、左および右の専門語は、第１の位置におけるカメラから得られた画像または画像のセットを第２の位置におけるカメラから得られた画像または画像のセットと区別するために使用され得る。 For ease of reference in the following description, the first set of photometric stereo images may be referred to as left stereo images and the second set of photometric stereo images may be referred to as right stereo images. However, the first location may be to the right of the second location, above the second location, below the second location, or at any location away from the second location. The terminology of first set, second set, left and right may be used to distinguish an image or set of images obtained from a camera at the first location from an image or set of images obtained from a camera at the second location.

照度差ステレオ画像の各セットは、ｍ個の画像を含み、画像の個数は、使用される光源の個数に対応する。ｊ＝１，…，ｍの場合、各画像ｉ_j,pは、画素ｐのセットとして理解され得る。ｍ個の光源の各々について、方向Ｌ_jおよび輝度Φ_jは知られており、法線Ｎ_pの計算に使用される。 Each set of photometric stereo images contains m images, the number of images corresponding to the number of light sources used. Each image i _j,p can be seen as a set of pixels p, for j=1,...,m. For each of the m light sources, the direction L _j and the intensity Φ _j are known and are used to calculate the normal N _p .

ステップＳ５０５では、法線マップは、データ取り込みステップＳ５０１から生成される。これは、左法線マップＮ_Lと呼ばれる。ステップＳ５０７では、法線マップが、データ取り込みステップＳ５０３から生成される。これは、右法線マップＮ_Rと呼ばれる。 In step S505, a normal map is generated from the data capture step S501. This is called the left normal map N _L. In step S507, a normal map is generated from the data capture step S503. This is called the right normal map N _R.

この実施形態では、左法線マップおよび右法線マップは、上述された方法を使用して生成され、左観測マップはカメラ１からのデータから生成され、右観測マップはカメラ２からのデータから生成される。これらの初期観測マップを生成するために、初期オブジェクトジオメトリｚ_estの推定が使用される。これは、（数センチメートルまで正確であり得る）とても大雑把な深度初期化ｚ_estであり得る。オブジェクトジオメトリの推定は、ｚ_estに対応する一定値を有する深度マップを備える。 In this embodiment, the left and right normal maps are generated using the method described above, with the left observation map generated from data from camera 1 and the right observation map generated from data from camera 2. To generate these initial observation maps, an estimate of the initial object geometry z _est is used. This may be a very rough depth initialization z _est (which may be accurate to within a few centimeters). The estimate of the object geometry comprises a depth map that has a constant value corresponding to z _est .

Ｓ５０１から与えられる２セットの照度差画像が、左ビューおよび右ビュー（第１のカメラ位置および第２のカメラ位置からのビュー）で構成されたいくつかの較正されたステレオペアを与えると仮定される。 The two sets of photometric images provided by S501 are assumed to provide several calibrated stereo pairs consisting of left and right views (views from a first camera position and a second camera position).

カメラの焦点距離ｆ、ステレオベースラインｂを仮定すると、よく知られた深度（ｚ）と視差（ｄ：disparity）の関係、 Assuming a camera focal length f and a stereo baseline b, the well-known relationship between depth (z) and disparity (d) is:

が適用される。 applies.

第１の位置と第２の位置との両方についての照度差ステレオ画像、および深度初期化ｚ_estを仮定すると、左および右法線マップＮ_LおよびＮ_Rが算出され得る。 Given photometric stereo images for both the first and second positions, and a depth initialization z _est , left and right normal maps N _L and N _R can be calculated.

ステップＳ５０９では、ステレオマッチングが、左法線マップおよび右法線マップに関して実行される。このステップでは、ステレオマッチング（または視差推定）が、左法線マップと右法線マップとの間の疎なペアの形状を保存する対応のセットを出力するために、左法線マップと右法線マップの両方に対して実行される。 In step S509, stereo matching is performed on the left and right normal maps. In this step, stereo matching (or disparity estimation) is performed on both the left and right normal maps to output a set of sparse-pair, shape-preserving correspondences between the left and right normal maps.

Ｓ５０５およびＳ５０７からの法線マップＮ_LおよびＮ_Rは、グラウンドトゥルースに対して数度まで正確であることが予期される。しかしながら、深度マップ（数値積分を使用してそれぞれＮ_LおよびＮ_Rから計算されるＺ_0LおよびＺ_0R）は、スケールの曖昧さに悩まされる。これは、数値積分が全体の平均深度を保存する、すなわち、 The normal maps N _L and N _R from S505 and S507 are expected to be accurate to within a few degrees relative to the ground truth. However, the depth maps (Z _0L and Z _0R , which are calculated from N _L and N _R , respectively, using numerical integration) suffer from scale ambiguity. This is because numerical integration preserves the overall average depth, i.e.,

であるので、問題の単一ビューバージョンの必然的な結果である。したがって、真実の深度Ｚ_Tを有するほとんど完全に積分可能な表面については、 This is a necessary consequence of the single-view version of the problem, since . Thus, for an almost perfectly integrable surface with true depth _ZT ,

であることが予期される。 It is expected that.

実際には、式（６）は、非常に滑らかな表面にのみ適用され、任意の合理的なサイズの画像についての誤差伝播にとても敏感である。これは、次に、オブジェクトの全体形状Ｚ₀がいくらかの低周波変形または曲げを示すことを意味する。しかしながら、注意が小さいサイズの画像パッチに制限される場合、滑らかさの制約は、（深度不連続を含まないパッチの場合）満足するのがずっと容易であり、したがって、式（６）は、左右のマッチングを容易にするために利用され得る。 In practice, equation (6) only applies to very smooth surfaces and is very sensitive to error propagation for any reasonably sized image. This in turn means that the overall shape of the object, _Z0 , will exhibit some low-frequency deformation or bending. However, if attention is restricted to small sized image patches, the smoothness constraint is much easier to satisfy (for patches that do not contain depth discontinuities), and thus equation (6) can be utilized to facilitate left-right matching.

一実施形態では、疎なステレオマッチング（または視差推定）が、オブジェクトの疎なステレオ表現を生成するために使用される。形式的には、このステレオマッチングステップの目標は、左法線マップＮ_L上の位置（ｘ，ｙ）を有する各画素について、右法線マップＮ_Rにおけるベストマッチ位置（ｘ－ｄ，ｙ）を見つけることであり、その理由は、これが式１を使用して絶対的な表面深度を回復することを直接可能にするからである。 In one embodiment, sparse stereo matching (or disparity estimation) is used to generate a sparse stereo representation of the object. Formally, the goal of this stereo matching step is to find, for each pixel with position (x,y) on the left normal map N _L , the best matching position (x-d,y) in the right normal map N _R , since this directly allows to recover the absolute surface depth using Equation 1.

単一の画素マッチングは、あまりに曖昧であり、したがって、ノイズに対してあまりに敏感であり得る。一実施形態では、（ｘ，ｙ）のまわりｗ画素でパッチをマッチすることが目標であり、これは、ＮＰ_L＝Ｎ_L［ｘ－ｗ：ｘ＋ｗ，ｙ－ｗ：ｙ＋ｗ］と表されてもよく、ここで、ＮＰ_Lは、左法線マップＮ_Lのパッチである。 Single pixel matching may be too ambiguous and therefore too sensitive to noise. In one embodiment, the goal is to match a patch at w pixels around (x,y), which may be expressed as NP _L =N _L [x-w:x+w,y-w:y+w], where NP _L is a patch in the left normal map N _L .

一実施形態では、画素位置（ｘ，ｙ）を中心としている左画像ＮＰ_L上の所与の画素パッチについて、ＮＰ_R［ｘ－ｄ，ｙ］の探索は、右法線マップＮ_R内の（ｘ，ｙ）における第１の候補画素パッチで始まる右法線マップＮ_R上で実行される。さらなる候補マッチは、例えば、ｘ次元に沿って第１の候補画素パッチを左または右へシフトすることによって、得られる。探索は、エピポーラ線（または走査線）について制約される。エピポーラ制約は、画像のうちの１つにおける画素が与えられると、他の画像における潜在的な共役像が、エピポーラ線と呼ばれる直線に属することを述べる。この制約は、ステレオマッチングが、基本的に１次元問題であることを示す。 In one embodiment, for a given pixel patch on the left image NP _L centered on pixel location (x,y), a search in NP _R [x-d,y] is performed on the right normal map _NR starting with a first candidate pixel patch at (x,y) in the right normal map _NR . Further candidate matches are obtained, for example, by shifting the first candidate pixel patch left or right along the x dimension. The search is constrained on epipolar lines (or scan lines). The epipolar constraint states that given a pixel in one of the images, its potential conjugate image in the other image belongs to a straight line called an epipolar line. This constraint shows that stereo matching is essentially a one-dimensional problem.

次いで、新しい画素パッチが、左画像ＮＰ_L上で選択され、右法線マップにおけるマッチの探索が、再び実行される。 A new pixel patch is then selected on the left image NP _L and the search for a match in the right normal map is performed again.

候補画素パッチは、重なり合ってもよい。各候補画素パッチは、第１の候補画素パッチのｘ方向のシフトである一時的な視差ｄ_tに関連している。 The candidate pixel patches may overlap. Each candidate pixel patch is associated with a temporal disparity _dt , which is the shift in the x-direction of the first candidate pixel patch.

左画像ＮＰ_L上の所与の画素パッチについては、右画像ＮＰ_R上の各候補画素パッチは、候補画素がベストマッチであるか決定するために推定される。一実施形態において、各候補画素パッチは、角距離（法線のコサイン類似度）を使用して推定される。しかしながら、他のメトリクスが、最小二乗差などの評価に使用されてもよい。 For a given pixel patch on the left image NP _L , each candidate pixel patch on the right image NP _R is estimated to determine which candidate pixel patch is the best match. In one embodiment, each candidate pixel patch is estimated using angular distance (cosine similarity of normals). However, other metrics may be used for evaluation, such as least squares difference.

これは、左法線マップＮ_Lと右法線マップＮ_Rとの間のペアの形状を保存する対応という結果になる。 This results in a pairwise, shape-preserving correspondence between the left normal map N _L and the right normal map N _R .

一実施形態では、視差の左右一貫性は強制される。これは、反対のビューからの、すなわち右から左への情報を参照することによって視差推定を強化する。一実施形態では、これは、左画像ＮＰ_L上の所与の画素パッチを決定し、右法線マップ上のマッチする画素パッチＮＰ_Rを識別することによって達成される。次いで、ＮＰ_Rについて、画素位置（ｘ，ｙ）に中心を有する状態で、ＮＰ_L［ｘ－ｄ，ｙ］の探索は、左法線マップＮ_L内の（ｘ，ｙ）における第１の候補画素パッチで始まる左法線マップＮ_L上で実行される。さらなる候補マッチは、第１の候補画素パッチを左または右にシフトすることによって得られる。候補画素パッチは、重なり合ってもよい。各候補画素パッチは、一時的な視差ｄ_tに関連している。 In one embodiment, left-right consistency of disparity is enforced. This strengthens the disparity estimation by referring to information from the opposite view, i.e., from right to left. In one embodiment, this is achieved by determining a given pixel patch on the left image NP _L and identifying a matching pixel patch NP _R on the right normal map. Then, for NP _R , centered at pixel location (x, y), a search for NP _L [x-d, y] is performed on the left normal map _NL starting with a first candidate pixel patch at (x, y) in the left normal map _NL . Further candidate matches are obtained by shifting the first candidate pixel patch to the left or right. The candidate pixel patches may overlap. Each candidate pixel patch is associated with a temporal disparity d _t .

次に、右画像ＮＰ_R上の所与の画素パッチについて、どの候補画素がベストマッチであるか決定するために、左画像ＮＰ_L上の各候補画素パッチが評価される。一実施形態では、各候補画素パッチは、角距離（法線のコサイン類似度）を使用して評価される。しかしながら、他のメトリクスが、最小二乗差などの評価に使用されてもよい。 Next, for a given pixel patch on the right image NP _R , each candidate pixel patch on the left image NP _L is evaluated to determine which candidate pixel patch is the best match. In one embodiment, each candidate pixel patch is evaluated using angular distance (cosine similarity of normals), however, other metrics may be used for evaluation, such as least squares difference.

視差の左右一貫性は、左法線マップＮ_L上のパッチと右法線マップＮ_R上のパッチとの間のマッチと、右と左の間のマッチとが、所与の閾値よりも低い視差差を有するという条件によって強制される。視差差が閾値を超える場合、そのマッチは排除され得る。例えば、一実施形態では、０．５を超える視差差を有する全てのマッチが排除される。しかしながら、視差に基づく他のメトリックが使用されてもよい。 Left-right consistency of disparity is enforced by the condition that matches between patches on the left normal map N _L and the right normal map N _R and matches between right and left have a disparity difference below a given threshold. If the disparity difference exceeds the threshold, the match may be rejected. For example, in one embodiment, all matches with a disparity difference greater than 0.5 are rejected. However, other metrics based on disparity may be used.

一実施形態では、５°未満の法線差で点のみを維持することによって、単一画素法線マッチの左右一貫性も強制される。 In one embodiment, left-right consistency of single pixel normal matches is also enforced by keeping only points with normal differences less than 5°.

上記は、候補画素パッチを探索するウィンドウ方法を説明したが、候補パッチを識別する代替方法が使用されてもよい。 Although the above describes a window method for searching for candidate pixel patches, alternative methods of identifying candidate patches may be used.

一実施形態では、より正確なステレオマッチングを与えるために、パッチワーピングは、法線マップに関連した曲率を補正するために使用され得る。任意の正面を向いていない平面について、異なる深度を有する画素が異なる視差を有するので、右法線マップＮＰ_Rにおける対応するパッチは、（ｘ－ｄ，ｙ）を中心にしたサイズ２ｗの平方ではないことに留意されたい。しかしながら、近似的な相対深度の式（６）は、左画像から右画像へのより正確なマッピング（すなわち、パッチワーピング）を算出するために使用され得る。 In one embodiment, patch warping may be used to correct for the curvature associated with the normal map to give a more accurate stereo matching. Note that for any non-front facing plane, pixels with different depths have different disparities, so the corresponding patch in the right normal map NP _R is not a square of size 2w centered at (x-d,y). However, the approximate relative depth equation (6) may be used to compute a more accurate mapping (i.e., patch warping) from the left image to the right image.

より正確には、一実施形態では、各一時的な視差ｄ_tについて、ワーピング手順は、以下の通りである。 More precisely, in one embodiment, for each temporal disparity _dt , the warping procedure is as follows:

１． 1.

のように平均の一時的な深さを算出する Calculate the average temporary depth as follows

２．相対深度スケールファクタを算出する 2. Calculate the relative depth scale factor

３．各パッチ画素（ｘ_p，ｙ_p）についてその相対深度をｚ_p＝ｓＺ_0L［ｘ_p，ｙ_p］へスケールする 3. For each patch pixel (x _p , y _p ), scale its relative depth to z _p = sZ _0L [x _p , y _p ]

４． 4.

としてパッチ画素視差を算出する Calculate the patch pixel disparity as

５．位置Ｎ_R［ｘ_p－ｄ_p，ｙ_p］において右法線マップから補間値をサンプリングする 5. Sample an interpolated value from the right normal map at position N _R [x _p - d _p , y _p ]

したがって、上記のワーピング手順を使用して、左法線マップＮ_L上の画素の各パッチについて、右法線マップＮ_Rの一時的な対応が算出される。上述したように、マッチングは、角距離（法線のコサイン類似度）を使用して推定される。一実施形態では、視差の左右一貫性は、上述したように強制される。 Thus, using the warping procedure described above, for each patch of pixels on the left normal map N _L , a temporal correspondence in the right normal map N _R is calculated. As described above, matching is estimated using angular distance (cosine similarity of normals). In one embodiment, left-right consistency of disparity is enforced as described above.

これは、左法線マップＮ_Lと右法線マップＮ_Rとの間に疎なペアの形状を保存する対応のセットをもたらす。これから、２つの形状が再構成され、１つは左法線マップから始まるパッチマッチングからのものであり、１つは右法線マップから始まるパッチマッチングからのものである。 This results in a set of sparse-pair, shape-preserving correspondences between the left normal map N _L and the right normal map N _R. From this, two shapes are reconstructed, one from patch matching starting from the left normal map and one from patch matching starting from the right normal map.

次いで、構成される２つの形状（部分ステレオ推定）は、１つに組み合わされる。対応する点は、２つの部分ステレオ推定上の異なる深度にあると推定される場合、対応する点は共に廃棄される。再び組み合わされたステレオ推定は疎であり、その点の個数を増加させるために統合（integrated）または補間され得る。しかしながら、ステレオ推定からの疎な点は、次のステップで直接使用され得る。 The two constructed shapes (partial stereo estimates) are then combined into one. If corresponding points are estimated to be at different depths on the two partial stereo estimates, they are discarded together. The recombined stereo estimate is sparse and can be integrated or interpolated to increase its number of points. However, the sparse points from the stereo estimate can be used directly in the next step.

ペア対応から、絶対的な深度を有するステレオ点のセットは、式（５）の原理およびカメラの較正マトリクスを利用する三角測量法を使用して算出される。次いで、疎なペアの形状を保存する対応のセットは、３Ｄ形状取り出しステップＳ５１１およびＳ５１３に入力される。 From the pairwise correspondences, a set of stereo points with absolute depth is computed using a triangulation method that exploits the principle of equation (5) and the camera calibration matrix. The set of sparse pairwise shape-preserving correspondences is then input to the 3D shape extraction steps S511 and S513.

ステップＳ５１１において、第１の３Ｄ形状（または再構成）は、上で詳述されたようなステレオマッチングから得られたペア対応によって制約を受けつつ、ステップＳ５０５においてカメラ１から得られる法線マップを使用して得られる。 In step S511, a first 3D shape (or reconstruction) is obtained using the normal map obtained from camera 1 in step S505, constrained by the pairwise correspondences obtained from stereo matching as detailed above.

第２の３Ｄ形状（または再構成）は、ステップＳ５１３において、上で詳述されたようなステレオマッチングから得られたペア対応によって制約を受けつつ、ステップＳ５０７においてカメラ２から得られた法線マップを使用して得られる。 A second 3D shape (or reconstruction) is obtained in step S513 using the normal map obtained from camera 2 in step S507, constrained by the pairwise correspondences obtained from stereo matching as detailed above.

再構成は、深度マップを生成するために法線マップを数値的に積分することによって生成され、この積分は、２つの深度マップのステレオマッチングから決定される深度情報によって制約を受ける。 The reconstruction is generated by numerically integrating the normal map to produce a depth map, where the integral is constrained by depth information determined from stereo matching of the two depth maps.

一実施形態では、これは、Ｑｕｅａｕ（ｅはアクサンテギュが付く。以下同様。）およびＤｕｒｏｕ［Ｙ．ＱｕｅａｕおよびＪ．－Ｄ．Ｄｕｒｏｕ．Ｅｄｇｅ－ｐｒｅｓｅｒｖｉｎｇｉｎｔｅｇｒａｔｉｏｎｏｆａｎｏｒｍａｌｆｉｅｌｄ：Ｗｅｉｇｈｔｅｄｌｅａｓｔｓｑｕａｒｅｓ，ＴＶａｎｄＬ１ａｐｐｒｏａｃｈｅｓ．ＩｎＳＳＶＭ，２０１５］の変分法に従うことによって行われてもよく、ただし、表面導関数は一次有限差分を使用して近似される。 In one embodiment, this may be done by following the variational method of Queau and Durou [Y. Queau and J.-D. Durou. Edge-preserving integration of a normal field: Weighted least squares, TV and L1 approaches. In SSVM, 2015], where the surface derivatives are approximated using first-order finite differences.

深度マップ（Ｚ_0L，Ｚ_0R）を得るために各法線マップ（Ｎ_L，Ｎ_R）の数値積分は、これらのステレオ点に従うように制約され得る。これは、本質的に、積分方程式に別の項λ｜｜ｚ－ｚ０｜｜を追加し、ただし、ｚは深度であり、ｚ₀は推定深度であり、λはマッチの信頼度に基づいてスケール変更することもできる一定の重みである。これにより、スケールの曖昧さが解消するとともに、低周波数の曲げを減少させ、したがって、単一ビューの照度差ステレオ深度精度を大きく改善する。 The numerical integration of each normal map (N _L ,N _R ) to obtain the depth map (Z _0L ,Z _0R ) can be constrained to follow these stereo points. This essentially adds another term λ||z-z0|| to the integral equation, where z is the depth, z ₀ is the estimated depth, and λ is a constant weight that can also be scaled based on the confidence of the match. This removes scale ambiguity and reduces low frequency bending, thus greatly improving single-view photometric stereo depth accuracy.

ｄｘｄｙに関する積分の項は、 The integral term with respect to dxdy is:

であり、ただし、Ｄは発散演算子であり、ｚは知られている深度であり、ｚ_dは法線から推定された深度であり、ここで、λ_st｜｜ｚ－ｚ_st｜｜は、ステレオ点を使用する制約であり、λ_prev｜｜ｚ－ｚ_prev｜｜は、先の再構成から決定された事前二次式（quadratic prior）を使用する制約である。疎なステレオ再構成により、ステレオ再構成からの対応する値が存在しない点が再構成される場合、項λ_st｜｜ｚ－ｚ_st｜｜は、０に設定される。 where D is the divergence operator, z is the known depth, _zd is the depth estimated from the normals, and where _λst ||z- _zst || is a constraint using the stereo points and _λprev ||z- _zprev || is a constraint using a quadratic prior determined from a previous reconstruction. If the sparse stereo reconstruction reconstructs a point that has no corresponding value from the stereo reconstruction, the term _λst ||z- _zst || is set to 0.

この疎なステレオ再構成は、単眼照度差ステレオ再構成よりもグローバル歪みによってあまり影響を受けない。 This sparse stereo reconstruction is less affected by global distortions than monocular photometric stereo reconstruction.

この段階では、（複数の修正されたステレオペアに対応することができるマルチビュー照度差ステレオの場合）出力は、依然としてビューごとの２．５Ｄ深度マップのセットであり、したがって、完全な表面を得るために融合が依然として必要とされることに留意されたい。ステレオ制約を受ける単一ビュー再構成（Ｚ_0L，Ｚ_0R）は、測量的に一貫しており、したがって、ステップＳ５１５において統一された表面上にマージされ得る。融合に基づく異なるビューにおける不正確さの影響を最小にするために、各点は、（ＦｏｔｉｏｓＬｏｇｏｔｈｅｔｉｓ，ＩｇｎａｓＢｕｄｖｙｔｉｓ，ＲｏｂｅｒｔｏＭｅｃｃａ，およびＲｏｂｅｒｔｏＣｉｐｏｌｌａ．Ａｄｉｆｆｅｒｅｎｔｉａｌｖｏｌｕｍｅｔｒｉｃａｐｐｒｏａｃｈｔｏｍｕｌｔｉ－ｖｉｅｗｐｈｏｔｏｍｅｔｒｉｃｓｔｅｒｅｏ．ＩｎＩＣＣＶ，２０１９に詳述されているように）そのビュー項、すなわちｎ・ｖを使用して重み付けされる。これは、画像の中心にある点は、正しい可能性がより高いので、より大きい重みで与えられることを可能にする。 Note that at this stage (in the case of multi-view photometric stereo, which can accommodate multiple rectified stereo pairs) the output is still a set of 2.5D depth maps per view, and thus fusion is still required to obtain a complete surface. The single-view reconstructions (Z _OL , Z _OR ) subject to the stereo constraint are metrically consistent and can therefore be merged onto a unified surface in step S515. To minimize the impact of inaccuracies in the different view based fusion, each point is weighted using its view term, i.e., n·v (as detailed in Fotios Logothetis, Ignas Budvytis, Roberto Mecca, and Roberto Cipolla. A differential volumetric approach to multi-view photometric stereo. In ICCV, 2019). This allows points in the center of the image to be given a higher weight since they are more likely to be correct.

加えて、オブジェクトのビジュアルハル（visual hull）の外側にある点は完全に排除される。 In addition, points outside the object's visual hull are completely rejected.

一実施形態では、ポアソン再構成が実行される。疎なステレオ点のステレオ点群は、ポアソン再構成を用いてより緻密にされ得る。ポアソン再構成の出力は、全てのビューに投影でき、反復手順の次の段階のためのより良い初期化として使用され得る緻密な表面である。 In one embodiment, a Poisson reconstruction is performed. A stereo point cloud of sparse stereo points can be made denser using Poisson reconstruction. The output of the Poisson reconstruction is a dense surface that can be projected onto all views and used as a better initialization for the next stage of the iterative procedure.

一実施形態では、再構成を改善するために、上述されたプロセスが何度も反復される。続く反復では、先の反復からのマージされたステレオ制約を受けた構成が、大雑把な深度推定または初期３Ｄ平面として使用される。この推定を使用して、法線が算出され、法線を基にステレオマッチングが実行され、オブジェクトの画像が、ステレオマッチングから得られたステレオ点によって制約を受けて再構成される。 In one embodiment, the above process is iterated multiple times to improve the reconstruction. In subsequent iterations, the merged stereo constrained configuration from the previous iteration is used as a rough depth estimate or initial 3D plane. Using this estimate, normals are calculated, stereo matching is performed based on the normals, and an image of the object is reconstructed constrained by the stereo points obtained from the stereo matching.

加えて、緻密な表面推定を有することは、単一ビュー再構成段階において考慮され得る不連続境界を推定することも可能にする（境界にわたって画素について通常の統合可能性の制約を本質的に取り除く）。 In addition, having a dense surface estimate also makes it possible to estimate discontinuous boundaries that can be taken into account in the single-view reconstruction stage (essentially removing the usual integrability constraints on pixels across boundaries).

次いで、ステップＳ５１５においてマージされたステレオ制約を受けた単一ビュー再構成は、オブジェクトの形状が近接場効果にどのように影響を与えるかを説明する式（１）～（４）に関連して上述したように、近接場効果を考慮に入れる光分布を再算出するために、ステップＳ５１７において、再生成された形状として使用される。 The stereo-constrained single-view reconstruction merged in step S515 is then used as the regenerated shape in step S517 to recalculate the light distribution that takes into account near-field effects, as described above in connection with equations (1)-(4), which describe how the object's shape affects near-field effects.

ステップＳ５１９では、新たな観測マップが、カメラ１および２について生成される。これらの観測マップは、上記したものと同じように生成される。しかしながら、ここでは、Ｓ５１７において再生成された形状は、ｚ_estとして使用される。法線マップは、ステップＳ５２１において、Ｓ５１９で生成された観測マップから生成される。 In step S519, new observation maps are generated for cameras 1 and 2. These observation maps are generated in the same manner as described above, however, now the shape regenerated in S517 is used as z _est . Normal maps are generated in step S521 from the observation maps generated in S519.

次に、方法は、ステップＳ５２１において生成された新しい法線マップがステレオマッチングされる場合、Ｓ５０９にループして戻る。次いで、プロセスは、Ｓ５１５において再構成された形状が収束するまで、上記のように継続する。 The method then loops back to S509 if the new normal map generated in step S521 is stereo matched. The process then continues as above until the reconstructed shape converges in S515.

上記の方法は、観測マップの形成と、これらの観測マップから法線マップを生成することとを説明する。 The above method describes the formation of observation maps and the generation of normal maps from these observation maps.

より詳細には、照度差ステレオ画像がＲＧＢ画像である場合、ＲＧＢチャネルが平均化され、したがって画像がグレースケール画像に変換される前処理が実行され得る。前処理段階では、照度差ステレオ画像が、固有光源輝度でやはり補償され、ここで、固有光源輝度Φ_mは、各ＬＥＤの一定の特性である。結果として得られる画像の値は、ＲＡＷグレー画像値と呼ばれ、本明細書において以下にさらに説明されるように観測マップに含まれる。 More specifically, if the photometric stereo images are RGB images, a pre-processing may be performed in which the RGB channels are averaged, thus converting the images to grayscale images. In the pre-processing stage, the photometric stereo images are also compensated with the intrinsic light source luminance, where Φ _m is a constant property of each LED. The values of the resulting images are called RAW gray image values and are included in the observation map as further described herein below.

したがって、一般的な画像放射照度方程式は、以下のようにＢＲＤＦ逆問題に再配置され得る。 The general image irradiance equation can therefore be rearranged into a BRDF inverse problem as follows:

ここで、ｊ_mは、ＢＲＤＦサンプルを示す。視線ベクトル where j _m denotes a BRDF sample.

が知られているが、深度ｚによって示されるオブジェクトと照度差ステレオ取り込みデバイスとの間の距離に対する非線形依存性により、ライティング方向Ｌ_mおよび近接場光減衰α_mは、未知であることに留意されたい。したがって、畳み込みニューラルネットワークの目的は、以下でさらに説明するように、マップの第３および第４のチャネルを介してネットワークに入力される、一般的な視線方向の逆のＢＲＤＦ問題を解決し、面法線Ｎと、続いて深度ｚとを復元することである。 Note that while , is known, the lighting direction _L and the near-field optical attenuation _α are unknown due to their nonlinear dependence on the distance between the object represented by the depth z and the photometric stereo capture device. Thus, the goal of the convolutional neural network is to solve the inverse BRDF problem of a general line of sight direction, input to the network via the third and fourth channels of the map, as further explained below, to recover the surface normal N and subsequently the depth z.

各画素の深度ｚ（局所表面深度）の初期推定を仮定すると、近接場減衰α_m（Ｘ）が、式（１）に従って算出でき、したがって、近接場減衰の補償後の照度差ステレオ画像における観測を表す等価な遠距離場反射率サンプルｊ_mが、式（８）の第１の部分を使用して得ることができる。式（８）の第２の部分は、面法線Ｎを計算するために使用される畳み込みニューラルネットワーク（ＣＮＮ）によってモデル化され、法線Ｎを近似するために使用される。 Given an initial estimate of the depth z of each pixel (local surface depth), the near-field attenuation α _m (X) can be calculated according to equation (1), and therefore the equivalent far-field reflectance samples j _m representing the observations in the photometric stereo image after compensation for the near-field attenuation can be obtained using the first part of equation (8). The second part of equation (8) is modeled by a convolutional neural network (CNN) that is used to calculate the surface normal N and is used to approximate the normal N.

ステップＳ４０７では、等価な遠距離場反射率サンプルｊ_mのセットは、観測マップを生成するために使用され、観測マップは、結果として、ＣＮＮの入力に与えられる。各等価な遠距離場反射率サンプルｊ_mは、画素のセットｘ＝｛ｘ₁，…，ｘ_p｝を含み、ここで、各サンプルｊ_mについて、ｍは光源の個数、したがって、等価な遠距離場反射率サンプルのセット内のサンプルの数を示し、光方向Ｌ_mおよび輝度Φ_mは、知られており、各画素ｘについて面法線Ｎ（ｘ）の推定に使用される。 In step S407, the set of equivalent far-field reflectance samples _jm is used to generate an observation map, which is then fed to the input of the CNN. Each equivalent far-field reflectance sample _jm comprises a set of pixels x={ _x1 ,..., _xp }, where for each sample _jm , m denotes the number of light sources and thus the number of samples in the set of equivalent far-field reflectance samples, and the light direction _Lm and brightness _Φm are known and are used to estimate the surface normal N(x) for each pixel x.

観測マップは、全ての光源からの情報を単一のマップに組み合わせることによって算出される。特に、各画素ｘについて、等価な遠距離場反射率サンプルｊ_mのセット内の全ての観測は、単一のｄ×ｄ観測マップにマージされる。 The observation map is computed by combining the information from all light sources into a single map: in particular, for each pixel x, all observations in the set of equivalent far-field reflectance samples j _m are merged into a single d × d observation map.

各画素ｘについて最初に正規化された観測 First normalized observation for each pixel x

が算出される。観測の正規化は、光源輝度Φ_mの変動を補償し、全ての光源ｍについての最大輝度で除算することによって実行される。 Normalization of the observations is performed to compensate for variations in the light source luminance Φ _m by dividing by the maximum luminance over all light sources m.

光源変動の補償は、異なる画素のアルベド変動を補償することが目標とされる。結果として、これにより、各画素の観測に関連したデータの範囲が縮小することにもなる。 The compensation for illumination variations is aimed at compensating for albedo variations at different pixels. As a result, this also reduces the range of data associated with the observation of each pixel.

続いて、各画素ｘについて正規化された観測 Then, for each pixel x, the normalized observations

が、正規化された観測マップＯ_n上に配置される。正規化された観測マップは、次元ｄ×ｄを有する正方形の観測マップである。いくつかの実施形態では、ｄは３２である。観測マップのサイズは、使用される照度差ステレオ画像の数またはサイズから独立している。 is placed on the normalized observation map O _n . The normalized observation map is a square observation map with dimensions d×d. In some embodiments, d is 32. The size of the observation map is independent of the number or size of the photometric stereo images used.

正規化された観測は、光源方向 Normalized observations are based on the light source direction

をｄ×ｄマップへ投影することによって、以下の式に従って、正規化された観測マップ上にマッピングされる。 By projecting onto the dxd map, it is mapped onto the normalized observation map according to the following formula:

いくつかの例では、正規化された観測データは、除算演算によって破損され得る。例えば、正規化された観測データの破損は、観測値の最大値が飽和したときに発生し得る。飽和値で除算すると、正規化された観測値の過大評価になる。他の例では、観測におけるとても暗い点の分割は、数値的に不安定になり、ノイズの量または任意の区別の不正確さが増幅される。 In some instances, normalized observation data can be corrupted by a division operation. For example, corruption of normalized observation data can occur when the maximum value of the observation is saturated. Dividing by the saturation value leads to an overestimation of the normalized observation. In other instances, segmentation of very faint points in the observation can become numerically unstable, amplifying the amount of noise or imprecision of any distinction.

したがって、各画素ｘについてのｄ×ｄ観測マップは、ＲＡＷチャネルマップＯ_rの追加によって次元ｄ×ｄ×２を有する３次元観測マップに拡張される。ＲＡＷチャネルマップは、ＲＡＷグレースケールチャネルマップであってもよい。ＲＡＷチャネルマップは次のように定められる。 Thus, the d×d observation map for each pixel x is expanded to a three-dimensional observation map with dimensions d×d×2 by the addition of a raw channel map O _r . The raw channel map may be a raw grayscale channel map. The raw channel map is defined as follows:

以下Ｏによって示されるｄ×ｄ×２観測マップは、正規化された観測マップＯ_nおよびＲＡＷチャネルマップＯ_rの第３の軸上の連結演算によって生成される。 A dxdx2 observation map, hereafter denoted by O, is generated by a concatenation operation on the third axis of the normalized observation map O _n and the raw channel map O _r .

いくつかの実施形態では、寸法ｄ×ｄ×２を有する観察マップＯは、２つの追加のチャネルを含むように観測マップを強化することによって、視線ベクトルの第１の２つの成分 In some embodiments, an observation map O having dimensions d x d x 2 is used to obtain the first two components of the line of sight vector by augmenting the observation map to include two additional channels.

および and

にそれぞれ一定であるｄ×ｄ×４観測マップに拡張することができる。成分 can be expanded to a dxdx4 observation map, where each component is constant.

および and

は、スカラー成分であり、それ自体がＢＲＤＦ逆問題の式において使用される視線ベクトル is a scalar component, itself the line of sight vector used in the BRDF inverse problem formula

を完全に決定する。 Completely determine.

観測マップは、離散化された光方向の２Ｄグリッド上のＢＲＤＦサンプルからの相対画素強度を記録する。観測マップの表現は、使用されるライト、およびしたがって照度差ステレオ画像の数が潜在的に変化するにもかかわらず、それが画素長についての２Ｄ入力を与えるので、古典的なＣＮＮアーキテクチャと共に使用するのに非常に便利な表現である。 The observation map records the relative pixel intensities from the BRDF samples on a 2D grid of discretized light directions. The observation map representation is a very convenient representation to use with classical CNN architectures because it gives a 2D input in terms of pixel length, despite the potential variation in the light used, and therefore the number of photometric stereo images.

ステップＳ５０５、Ｓ５０７、およびＳ５２１では、各画素ｘについての観測マップは、ＣＮＮの入力に与えられ、ＣＮＮは、ＢＲＤＦ逆問題を解くために使用され、観測マップ内の相対画素強度に基づいて各点についての面法線を計算する。ＣＮＮは現実世界の影響に対してロバストであるように設計されるので、モデル化されたＢＲＤＦ逆問題の式は、ＢＲＤＦ逆問題の式の不正確な表現である。 In steps S505, S507, and S521, the observation map for each pixel x is fed to the input of a CNN, which is used to solve the BRDF inverse problem and calculate the surface normal for each point based on the relative pixel intensities in the observation map. Because the CNN is designed to be robust to real-world effects, the modeled BRDF inverse problem equation is an inexact representation of the BRDF inverse problem equation.

画素単位の観測マップから面法線を生成するために使用され得る畳み込みニューラルネットワークの高レベル流れ図が、図６に提示される。 A high-level flow diagram of a convolutional neural network that can be used to generate surface normals from a pixel-wise observation map is presented in Figure 6.

ネットワークは、実世界データに対処するためのロバストな特徴を学習するために使用される７つの畳み込み層を備える。これは、ＧＢ２５９８７１１に記載されているように、ネットワークのトレーニング中に拡張戦略を使用することによって行われる。ネットワークは、２つの完全に接続された層と、逆ＢＲＤＦ問題を解くために使用され、したがって各画素について面法線を算出する終端で対数層と組み合わされている完全に接続された層とをさらに含む（ステップＳ５０５およびＳ５０７）。 The network comprises seven convolutional layers that are used to learn robust features to deal with real-world data. This is done by using an augmentation strategy during the training of the network, as described in GB2598711. The network further comprises two fully connected layers and a fully connected layer that is used to solve the inverse BRDF problem and is therefore combined with a logarithmic layer at the end that computes the surface normal for each pixel (steps S505 and S507).

ネットワークは、合計で約４５０万個のパラメータを有する。ネットワークのアーキテクチャの全体図が図６に図示されている。図６は、特定のネットワークアーキテクチャを表すが、観測マップから面法線を推定するために、異なるニューラルネットワークアーキテクチャが使用されてもよいことが理解される。 The network has a total of about 4.5 million parameters. An overall view of the network's architecture is shown in FIG. 6. Although FIG. 6 depicts a particular network architecture, it is understood that different neural network architectures may be used to estimate surface normals from the observed map.

提案されたネットワークは、単一分岐ネットワークを含む。７つの畳み込み層６０３、６０５、６０９、６１３、６１９、６２３、および６２７、ならびに第１の２つの完全に接続された層６３１および６３５は、それぞれ、ＲＥＬＵ活性化関数がそれに続く。各畳み込み層の畳み込みフィルタのサイズが、図６に示されており、したがって、各層についての出力ボリュームのサイズが推測され得る。特に、畳み込み層６０３は、次元（３×３）を有する３２個の畳み込みフィルタを備え、次元（３２，３２，３２）を有する出力ボリュームを出力し、畳み込み層６０５は、次元（３×３）を有する３２個の畳み込みフィルタを備え、次元（３２，３２，３２）を有する出力ボリュームを出力する。第１の連結層６０８の出力ボリュームは、（３２，３２，６４）の次元を有する。畳み込み層６０９は、次元（３×３）を有する３２個の畳み込みフィルタを備え、次元（３２，３２，３２）を有する出力ボリュームを出力する。第２の連結層６１２の出力ボリュームは、次元（３２，３２，９６）を有する。畳み込み層６１３は、次元（１×１）を有する６４個の畳み込みフィルタを備え、次元（３２，３２，６４）を有する出力ボリュームを出力する。平均プーリング層６１７の出力ボリュームは、（１６，１６，６４）の寸法を有する。畳み込み層６１９は、（３×３）の次元を有する６４個の畳み込みフィルタを備え、次元（１６，１６，６４）を有する出力ボリュームを出力する。 The proposed network includes a single-branch network. Seven convolutional layers 603, 605, 609, 613, 619, 623, and 627, as well as the first two fully connected layers 631 and 635, each followed by a RELU activation function. The size of the convolutional filters of each convolutional layer is shown in FIG. 6, and thus the size of the output volume for each layer can be inferred. In particular, the convolutional layer 603 comprises 32 convolutional filters with dimensions (3×3) and outputs an output volume with dimensions (32, 32, 32), and the convolutional layer 605 comprises 32 convolutional filters with dimensions (3×3) and outputs an output volume with dimensions (32, 32, 32). The output volume of the first concatenated layer 608 has dimensions of (32, 32, 64). The convolution layer 609 includes 32 convolution filters with dimensions (3x3) and outputs an output volume with dimensions (32, 32, 32). The output volume of the second concatenation layer 612 has dimensions (32, 32, 96). The convolution layer 613 includes 64 convolution filters with dimensions (1x1) and outputs an output volume with dimensions (32, 32, 64). The output volume of the average pooling layer 617 has dimensions (16, 16, 64). The convolution layer 619 includes 64 convolution filters with dimensions (3x3) and outputs an output volume with dimensions (16, 16, 64).

第３の連結層６２２の出力ボリュームは、次元（１６，１６，１２８）を有する。畳み込み層６２３は、次元（３×３）を有する６４個の畳み込みフィルタを備え、次元（１６，１６，６４）を有する出力ボリュームを出力する。畳み込み層６２７は、次元（３×３）を有する１２８個のフィルタを備え、次元（１６，１６，１２８）を有する出力ボリュームを出力する。 The output volume of the third concatenation layer 622 has dimensions (16, 16, 128). The convolution layer 623 has 64 convolution filters with dimensions (3x3) and outputs an output volume with dimensions (16, 16, 64). The convolution layer 627 has 128 filters with dimensions (3x3) and outputs an output volume with dimensions (16, 16, 128).

さらに、畳み込み層６０５、６０９、６１３、６１９、および６２３の後に、ドロップアウト層６０７、６１１、６１５、６２１、および６２５がそれぞれ使用される。ドロップアウトは、ネットワーク内のニューロン間の独立学習を減少させるトレーニング手法である。トレーニング中に、ネットワークの縮小バージョンが作成されるように、ノードのランダムなセットがネットワークからドロップされる。ネットワークの縮小バージョンは、ニューラルネットワークの他のセクションとは独立して学習し、したがって、ニューロンが互いの間で共依存性を発達させるのを防ぐ。 In addition, dropout layers 607, 611, 615, 621, and 625 are used after convolutional layers 605, 609, 613, 619, and 623, respectively. Dropout is a training technique that reduces independent learning between neurons in the network. During training, a random set of nodes is dropped from the network such that a reduced version of the network is created. The reduced version of the network learns independently of other sections of the neural network, thus preventing neurons from developing codependency among each other.

各ドロップアウト層は、層の出力がドロップアウトされる確率を指定するドロップアウトパラメータに関連付けられる。層６０７、６１１、６１５、６２１、および６２５では、ドロップアウトパラメータは０．２であり、したがって、パラメータの２０％がドロップアウトされる。 Each dropout layer is associated with a dropout parameter that specifies the probability that the output of the layer is dropped out. In layers 607, 611, 615, 621, and 625, the dropout parameter is 0.2, thus 20% of the output is dropped out.

スキップ接続も、収束を加速するためにネットワークアーキテクチャに用いられる。スキップ接続は、畳み込み層の出力がプロシーディング層（proceeding layer）をスキップすることを可能にし、ネットワークの次の層への入力におけるように与えられる前に、続く層の出力が一緒に連結されることを可能にするために使用される。平均プーリング層６１７も用いられる。 Skip connections are also used in the network architecture to accelerate convergence. Skip connections are used to allow the output of a convolutional layer to skip a proceeding layer, allowing the outputs of subsequent layers to be concatenated together before being presented as input to the next layer of the network. An average pooling layer 617 is also used.

７個の畳み込み層６０３、６０５、６０９、６１３、６１９、６２３、および６２７を含む畳み込みニューラルネットワークの第１の部分は、平坦化層６２９によって、完全に接続された層６３１、６３５、６３７および対数層６３３を含むネットワークの第２の部分から効果的に分離される。完全に接続された層は、それらが非線形数学関数の良好な近似を与えるので使用される。 A first part of the convolutional neural network, which includes seven convolutional layers 603, 605, 609, 613, 619, 623, and 627, is effectively separated by a flattening layer 629 from a second part of the network, which includes fully connected layers 631, 635, 637 and a logarithmic layer 633. Fully connected layers are used because they provide good approximations of nonlinear mathematical functions.

平坦化層６２９は、畳み込み層６２７によって出力された特徴マップを、第１の完全に接続された層６３１に与えられる入力データのベクトルに再配置する。畳み込み層６２７の出力ボリュームは、３２，７６８要素ベクトルに平坦化され、これは、第１の完全に接続された層６３１へ与えられる。 The flattening layer 629 rearranges the feature maps output by the convolutional layer 627 into a vector of input data that is fed to the first fully connected layer 631. The output volume of the convolutional layer 627 is flattened into a 32,768 element vector, which is fed to the first fully connected layer 631.

本明細書で上述したＢＲＤＦモデルは、多くの実際のＢＲＤＦの良好な近似であると考えられるＢｌｉｎｎ－Ｐｈｏｎｇ反射率モデルに従い、ＢＲＤＦは、拡散成分と指数成分との和としてモデル化される。したがって、逆ＢＲＤＦ問題を解くために、これらの演算の逆は、線形和と対数和の組合せとして近似される。線形和と対数和の組合せは、密結合層６３５と対数層６３３の組合せを使用してＣＮＮにおいて実施される。密結合層６３５は、和の線形成分を表し、対数層は、和の対数成分を表す。 The BRDF model described above in this specification follows the Blinn-Phong reflectance model, which is believed to be a good approximation of many real BRDFs, where the BRDF is modeled as a sum of diffuse and exponential components. Therefore, to solve the inverse BRDF problem, the inverse of these operations is approximated as a combination of linear and logarithmic sums. The combination of linear and logarithmic sums is implemented in the CNN using a combination of a densely coupled layer 635 and a logarithmic layer 633. The densely coupled layer 635 represents the linear component of the sum, and the logarithmic layer represents the logarithmic component of the sum.

最後に、ネットワーク特徴によって抽出されたものを単位ベクトル（画素の面法線）に変換し、したがって、ネットワークは、画素単位の観測マップ入力から面法線を出力するために、正規化層６３９が使用される。 Finally, a normalization layer 639 is used to convert what has been extracted by the network features into unit vectors (pixel surface normals) so that the network outputs surface normals from the pixel-wise observation map input.

図７は、一実施形態による方法のステップを示す。図７は、行および列に配置された画像のセットを示す。以下の説明では、各画像は、画像（行，列）と呼ばれる。画像が２つの行に及ぶ場合、画像は（行１：行２，列）と呼ばれる。画像は、方法の各段階についてグラウンドトゥルースが知られているオブジェクトについて取得される。 Figure 7 illustrates the steps of a method according to one embodiment. Figure 7 illustrates a set of images arranged in rows and columns. In the following description, each image is referred to as image (row, col). If an image spans two rows, the image is referred to as (row1:row2, col). Images are acquired of an object for which the ground truth is known for each stage of the method.

画像（１，１）は、照度差ステレオ画像の第１のセットからの画像であり、画像（２，１）は、照度差ステレオ画像の第２のセットからの画像である。画像（１，１）は右画像であり得、画像（２，１）は左画像であり得、またはその逆である。 Image (1,1) is an image from a first set of photometric stereo images, and image (2,1) is an image from a second set of photometric stereo images. Image (1,1) can be the right image and image (2,1) can be the left image, or vice versa.

画像（１，２）は、グラウンドトゥルースに対しての画像（１，１）について算出された推定法線マップの誤差を示す。画像（２，２）は、グラウンドトゥルースに対しての画像（２，１）について算出された推定法線マップの誤差を示す。 Image (1,2) shows the error of the estimated normal map calculated for image (1,1) relative to the ground truth. Image (2,2) shows the error of the estimated normal map calculated for image (2,1) relative to the ground truth.

画像（２，３）は、画像（２，１）について推定された法線マップを統合することによって得られた法線からの構造（ＳｆＮ：structure from normals）を示す。画像（２，４）は、グラウンドトゥルースに対するＳｆＮマップの誤差を示す。 Image (2,3) shows the structure from normals (SfN) obtained by integrating the normal map estimated for image (2,1). Image (2,4) shows the error of the SfN map against the ground truth.

画像（１，３）は、画像（１，１）について推定された法線マップを統合することによって得られた法線からの構造（ＳｆＮ）を示す。画像（１，４）は、グラウンドトゥルースに対するＳｆＮマップの誤差を示す。 Image (1,3) shows Structure from Normals (SfN) obtained by integrating the normal map estimated for image (1,1). Image (1,4) shows the error of the SfN map against the ground truth.

画像（１，６）は、Ｓ５０９を参照して説明されるように、画像（１，１）の法線マップと画像（２，１）の法線マップのマッチングから決定された視差推定マップを示す。画像（２，６）は、画像（２，１）についての法線マップと画像（１，１）の法線マップのマッチングから決定された視差推定マップを示す。黒は低視差であり、一方、白は高視差である。画像（１，５）は、画像（１，６）の視差マップにおける誤差を示す。画像（２，５）は、画像（２，６）の視差マップにおける誤差を示す。 Image (1,6) shows a disparity estimate map determined from matching the normal map for image (1,1) with the normal map for image (2,1) as described with reference to S509. Image (2,6) shows a disparity estimate map determined from matching the normal map for image (2,1) with the normal map for image (1,1). Black is low disparity while white is high disparity. Image (1,5) shows the error in the disparity map for image (1,6). Image (2,5) shows the error in the disparity map for image (2,6).

画像（１：２，７）は、ステレオマッチングから得られたステレオ点を示す。画像（１：２，８）は、グラウンドトゥルースに対してのステレオポイントの誤差を示す。 Image (1:2,7) shows the stereo points obtained from stereo matching. Image (1:2,8) shows the error of the stereo points against the ground truth.

画像（１：２，９）は、画像（１：２，７）のステレオ点によって制約された、画像（１，１）および画像（２，１）について得られた深度マップを融合することから得られる再構成されたオブジェクトを示す。画像（１：２，１０）は、グラウンドトゥルースに対しての融合された形状の誤差を示す。 Image (1:2,9) shows the reconstructed object resulting from fusing the depth maps obtained for images (1,1) and (2,1), constrained by the stereo points of image (1:2,7). Image (1:2,10) shows the error of the fused shape against the ground truth.

画像（１：２，１１）は、ポアソン再構成を使用して再構築されたオブジェクトを示す。画像（１：２，１２）は、グラウンドトゥルースに対しての融合された形状の誤差を示す。 Image (1:2,11) shows the reconstructed object using Poisson reconstruction. Image (1:2,12) shows the error of the fused shape against the ground truth.

行３から行４は、本方法の第２の反復についてのプロセスのステップを示す。行５から行６は、本方法の第３の反復についてのプロセスのステップを示す。 Lines 3-4 show the process steps for a second iteration of the method. Lines 5-6 show the process steps for a third iteration of the method.

図８は、再構成方法の動作原理を示す。Ｓ９０１は、オブジェクトが（１５個からの）１個のＬＥＤ光源によって照らされる２つのカメラから生じる入力画像のサンプルを示す。Ｓ９０２への矢印経路に続くのは、法線マップが単眼照度差ステレオから算出されたパイプラインである。次いで、２つのビューを幾何学的にマッチさせ、オブジェクトの疎なステレオ表現を与えるために、そのような法線マップが使用される。パッチワーピングは、Ｓ９０３において使用され得る。Ｓ９０４は、疎なステレオ点再構成を示す。Ｓ９０５は、２つのステレオ制約された単一ビュー再構成の融合を示す。Ｓ９０６において、再構成をさらに精緻化するために、ポアソン再構成が実行される。 Figure 8 shows the working principle of the reconstruction method. S901 shows a sample of input images originating from two cameras where the object is illuminated by one LED light source (out of 15). Following the arrow path to S902 is the pipeline where a normal map is computed from monocular photometric stereo. Such a normal map is then used to geometrically match the two views and give a sparse stereo representation of the object. Patch warping can be used in S903. S904 shows a sparse stereo point reconstruction. S905 shows the fusion of two stereo constrained single view reconstructions. In S906, a Poisson reconstruction is performed to further refine the reconstruction.

Ｓ９０７において、Ｓ９０６からの再構成は、本方法の続く反復のための形状ジオメトリの初期推定として使用される。これは、近接場光減衰について入力画像を補償するために、入力画像と共に使用される。 In S907, the reconstruction from S906 is used as an initial estimate of the shape geometry for subsequent iterations of the method. This is used together with the input image to compensate the input image for near-field optical attenuation.

図８は、左から右に、ステレオからの点群；事前点群を使用する法線統合（Normal integration）；全てのビューからの融合および次のラウンドを初期化するための逆投影；事前表面を使用し、深度不連続性を考慮する統合といった融合のプロセスのステップを示す。 Figure 8 shows, from left to right, the steps of the fusion process: point cloud from stereo; normal integration using a point cloud prior; fusion from all views and backprojection to initialize the next round; integration using a surface prior and considering depth discontinuities.

図９は、単眼（単一ビュー）照度差ステレオと比較した提案されたシステムの結果を示す。１００１において、異なる材料で作製された３つのオブジェクト（ウサギ、女王、およびリス）の双眼照度差ステレオ画像のペアが示されている。グラウンドトゥルース（１００２）の後に、単眼照度差ステレオ再構成（１００３）があり、それに続いて提案された方法の再構成（１００４）がある。 Figure 9 shows the results of the proposed system compared to monocular (single view) photometric stereo. In 1001, binocular photometric stereo image pairs of three objects (rabbit, queen, and squirrel) made of different materials are shown. After the ground truth (1002), there is a monocular photometric stereo reconstruction (1003), followed by the reconstruction of the proposed method (1004).

上記の方法およびシステムは、双眼照度差ステレオ、すなわち、２つの異なる位置にあるカメラから撮られた２つのビューまたは２セットの画像に関して説明されている。しかしながら、説明される方法は、より多くのビューにやはり適用され得る。例えば、一実施形態では、３つの異なる位置にあるカメラか撮られた３つのビューまたは３セットの画像がある。次いで、説明されたステレオマッチングを、第１のビューと第２のビューとの間、第１のビューと第３のビューとの間、および第２のビューと第３のビューとの間で実行され得る。結果として生じるステレオ制約された単一ビュー再構成（第１の再構成、第２の再構成、および第３の再構成）は、一緒にマージされ得る。 The above method and system are described with respect to binocular photometric stereo, i.e., two views or two sets of images taken from cameras in two different positions. However, the described method may also be applied to more views. For example, in one embodiment, there are three views or three sets of images taken from cameras in three different positions. The described stereo matching may then be performed between the first and second views, between the first and third views, and between the second and third views. The resulting stereo constrained single-view reconstructions (first, second, and third reconstructions) may be merged together.

上記の方法は、任意のいくつかのビュー、または照度差画像のセットに適用され得る。 The above method can be applied to any number of views, or set of photometric images.

図１０は、実施形態による方法を実施するために使用できるハードウェアの概略図である。これは一例にすぎず、他の構成が使用されてもよいことに留意されたい。 Figure 10 is a schematic diagram of hardware that can be used to implement the method according to an embodiment. Note that this is only an example and other configurations may be used.

ハードウェアは、コンピューティングセクション１１００を備える。この特定の例では、このセクションの構成要素は、一緒に説明される。しかしながら、それらは必ずしも同じ位置にないことが理解されよう。 The hardware comprises a computing section 1100. In this particular example, the components of this section are described together; however, it will be understood that they are not necessarily in the same location.

コンピューティングシステム１１００の構成要素は、（中央処理ユニット、ＣＰＵなどの）処理ユニット１１１３と、システムメモリ１１０１と、システムメモリ１１０１を含む様々なシステム構成要素を処理ユニット１１１３に結合するシステムバス１１１１とを含むことができるが、これらに限定されない。システムバス１１１１は、様々なバスアーキテクチャなどのいずれかを使用するメモリバスまたはメモリコントローラ、周辺バス、およびローカルバスを含むいくつかのタイプのバス構造のいずれかであり得る。コンピューティングシステム１１００は、バス１１１１に接続された外部メモリ１１１５も含む。 The components of the computing system 1100 may include, but are not limited to, a processing unit 1113 (such as a central processing unit, CPU), a system memory 1101, and a system bus 1111 that couples various system components including the system memory 1101 to the processing unit 1113. The system bus 1111 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures, etc. The computing system 1100 also includes an external memory 1115 connected to the bus 1111.

システムメモリ１１０１は、読取り専用メモリなどの揮発性／または不揮発性メモリの形態でコンピュータ記憶媒体を含む。起動中などにコンピュータ内の要素間で情報を転送するのを助けるルーチンを含む基本入出力システム（ＢＩＯＳ）１１０３は、典型的には、システムメモリ１１０１に記憶される。加えて、システムメモリは、ＣＰＵ１１１３によって使用されているオペレーティングシステム１１０５、アプリケーションプログラム１１０７、およびプログラムデータ１１０９を収容する。 The system memory 1101 includes computer storage media in the form of volatile and/or nonvolatile memory, such as read-only memory. A basic input/output system (BIOS) 1103, containing routines that help transfer information between elements within the computer, such as during start-up, is typically stored in the system memory 1101. Additionally, the system memory contains an operating system 1105, application programs 1107, and program data 1109 that are used by the CPU 1113.

また、インターフェース１１２５は、バス１１１１に接続されている。インターフェースは、コンピュータシステムがさらなるデバイスから情報を受信するためのネットワークインターフェースであり得る。インターフェースは、ユーザがある種のコマンド等に応答することを可能にするユーザインタフェースであることもできる。 Also connected to bus 1111 is interface 1125. The interface may be a network interface through which the computer system receives information from additional devices. The interface may also be a user interface that allows a user to respond to certain commands, etc.

グラフィックス処理ユニット（ＧＰＵ）１１１９は、この複数の並列の呼出しの動作により、上述した方法に特によく適している。したがって、一実施形態では、処理は、ＣＰＵ１１１３とＧＰＵ１１１９との間で分割され得る。 The graphics processing unit (GPU) 1119 is particularly well suited for the above-described method due to its operation of multiple parallel invocations. Thus, in one embodiment, processing may be split between the CPU 1113 and the GPU 1119.

上述された実施形態は、（ｉ）疎なペアの形状を保存する対応を推定すること、および（ｉｉ）推定された画素単位の法線によって案内される再構成、例えば、ポアソン再構成を初期化するためにそれらを使用することからなる２つのステップを組み合わせるが、他の再構成方法が使用されてもよい。 The embodiment described above combines two steps: (i) estimating shape-preserving correspondences for sparse pairs, and (ii) using them to initialize a reconstruction guided by the estimated pixel-wise normals, e.g., a Poisson reconstruction, although other reconstruction methods may be used.

ステレオマッチングは、ビュー不変特徴をマッチングすることを必要とし、これは、テクスチャのないオブジェクトにおいて、特に、異なるビューにおける外観および画素強度を変化させる鏡面反射を受けるオブジェクトについてとても難題である。加えて、局所的な表面曲率は、左と右の（または第１および第２の）ビューの間の局所的な外観を歪め、そのため、パッチマッチベースの方法は、２つのビュー内の画素の矩形パッチをマッチしようとする。 Stereo matching requires matching view-invariant features, which is very challenging for textureless objects, especially for objects that undergo specular reflections that change appearance and pixel intensity in different views. In addition, local surface curvature distorts the local appearance between the left and right (or first and second) views, so patch-matching based methods try to match rectangular patches of pixels in the two views.

単一ビューの近接場照度差ステレオは、近似的な深度初期化を用いてもとても正確であり得る緻密な（全ての前景画素についての）面法線を算出することができる。面法線は、ビュー不変特徴であり、パッチマッチングを本質的に可能にする任意の非平面の表面についての変動も示す。 Single-view near-field photometric stereo can compute dense surface normals (for every foreground pixel) that can be very accurate even with approximate depth initialization. Surface normals are view-invariant features and also exhibit variation for any non-planar surface, which inherently enables patch matching.

加えて、法線の統合は、局所的に正確である（および低周波変形または曲げに単に悩まされる）形状推定を与え、この局所的形状は、以下に説明されるように、パッチワーピングを実行し、したがって、ステレオマッチングを最大化するために使用され得る。 In addition, normal integration gives a shape estimate that is locally accurate (and only suffers from low-frequency deformations or bending), and this local shape can be used to perform patch warping, and thus maximize stereo matching, as described below.

したがって、初期の２セットの画像の代わりに法線に対するマッチングを実行するとき、テクスチャまたは光沢のないオブジェクトを扱うときでもロバストであるより信頼できるステレオマッチングが実行され得る。結果として、オブジェクトの再構成をステレオマッチングステップのペア対応に制約することは、単一のオブジェクトジオメトリに収束するようにオブジェクトジオメトリを更新する反復手順のためのより良い初期推定を与えるオブジェクトジオメトリのより正確な表現を与えることができる。 Therefore, when performing matching on normals instead of the initial two sets of images, more reliable stereo matching can be performed that is robust even when dealing with objects that have no texture or gloss. As a result, constraining the object reconstruction to pairwise correspondences in the stereo matching step can give a more accurate representation of the object geometry that gives a better initial guess for the iterative procedure of updating the object geometry to converge to a single object geometry.

上述したアーキテクチャも、ＧＰＵを使用する携帯電話に適している。いくつかの実施形態が説明されてきたが、これらの実施形態は、例のみによって示されており、本発明の範囲を限定することは意図されない。実際、本明細書で説明される新規なデバイスおよび方法は、様々な他の形態で具体化されてもよく、さらに、本明細書で説明されるデバイス、方法、および製品の形態における様々な省略、置換、変更が本発明の趣旨から逸脱せずに行われてもよい。添付の特許請求の範囲およびその均等物は、本発明の範囲および趣旨の範囲内に入るように、そのような形態または修正を含むことが意図される。

The above architecture is also suitable for mobile phones that use GPUs. Although several embodiments have been described, these embodiments are presented by way of example only and are not intended to limit the scope of the present invention. Indeed, the novel devices and methods described herein may be embodied in a variety of other forms, and further, various omissions, substitutions, and changes in the form of the devices, methods, and products described herein may be made without departing from the spirit of the present invention. The appended claims and their equivalents are intended to cover such forms or modifications as fall within the scope and spirit of the present invention.

Claims

1. A computer vision method for generating a three-dimensional reconstruction of an object, comprising:
receiving a first set of a plurality of photometric stereo images of the object and a second set of a plurality of photometric stereo images of the object, the first set including at least one image taken from a first camera at a first location using illumination from a plurality of different directions and the second set including at least one image taken from a second camera at a second location using illumination from a plurality of different directions;
generating a first normal map of the object using the first set of photometric stereo images;
generating a second normal map of the object using the second set of photometric stereo images;
determining a stereo estimate of the shape of the object by performing stereo matching between a plurality of patches of normals in the first normal map and a plurality of patches of normals in the second normal map;
using the first normal map and the second normal map together with the stereo estimate of the shape of the object to generate a reconstruction of the object.

The method of claim 1, wherein the first normal map and the second normal map are generated using an estimate of the shape of the object to recalculate a light distribution for multiple near-field effects of the illumination on the object.

The method of claim 2, further comprising: recalculating the light distribution due to multiple near-field effects using the reconstruction of the object; recalculating the first normal map and the second normal map from the recalculated light distribution; and generating a further reconstruction of the object.

Recalculating the light distribution due to a plurality of near field effects includes:
(a) using the reconstruction of the object to recalculate the first normal map and the second normal map from the recalculated light distribution;
(b) determining a further estimate of the shape of the object by performing stereo matching between a plurality of patches of normals in the recomputed first normal map and a plurality of patches of normals in the recomputed second normal map;
(c) generating a further reconstruction of the shape from at least one of the recalculated first normal map and the recalculated second normal map; and
and repeating (a) through (c) until a plurality of further reconstructions of the object converge.

Using the first normal map and the second normal map together with the stereo estimate of the shape of the object to generate a reconstruction of the object, comprising:
Integrating the first normal map with constraints of the stereo estimation of the shape to generate a first reconstruction;
Integrating the second normal map with constraints of the stereo estimation of the shape to generate a second reconstruction;
and combining the first reconstruction and the second reconstruction to generate a fused reconstruction, which is the reconstruction of the object.

The method of claim 5, wherein the fused reconstruction is generated using a Poisson solver.

Performing stereo matching on the first normal map and the second normal map includes:
Selecting at least one group of pixels on the first normal map;
and searching for a matched group of pixels in the second normal map by scanning across the second normal map for multiple matches.

8. The method of claim 7, wherein scanning across the second normal map for matches is performed across epipolar lines.

The method of claim 7, wherein the search for a matching pixel group is constrained by a current estimated reconstruction of the object.

Performing stereo matching on the first normal map and the second normal map includes:
performing patch warping on the at least one group of pixels of the first normal map;
and determining a corresponding group of pixels of the second normal map using the patch-warped at least one group of pixels of the first normal map.

Selecting at least one group of pixels on the first normal map;
Searching for matched pixel groups in the second normal map by scanning across the second normal map for a number of matches is used to generate a first partial stereo estimate;
The method comprises:
Selecting at least one group of pixels on the second normal map;
searching for matched pixel groups in the first normal map by scanning across the first normal map for a number of matches to generate a second partial stereo estimate;
8. The method of claim 7, further comprising: combining the first partial stereo estimate and the second partial stereo estimate to form a stereo estimate of the shape, wherein points from the first partial stereo estimate and the second partial stereo estimate that do not match are discarded.

The method of claim 1, wherein the first set includes a first plurality of images taken from the first camera at the first location using illumination from a plurality of different directions, and the second set includes a second plurality of images taken from the second camera at the second location using illumination from a plurality of different directions.

The method of claim 1, wherein generating the first normal map includes inputting information representing the first set of photometric stereo images, an estimate of the shape of the object, and position information for a number of light sources into a neural network trained to output the first normal map.

The method of claim 13, wherein the information representing the first set of photometric stereo images, the estimate of the shape of the object, and the position information of the light sources is provided in the form of an observation map, where the observation map is generated for each pixel of the first camera, each observation map comprising a projection of multiple lighting directions onto a 2D plane, the multiple lighting directions for each pixel being obtained from each photometric stereo image.

1. A system for generating a three-dimensional (3D) reconstruction of an object, comprising: an interface and a processor;
the interface has an image input and is configured to receive a set of photometric stereo images of an object, the set of photometric stereo images including a plurality of images using illumination from a plurality of different directions with one or more light sources;
The processor,
receiving a first set of a plurality of photometric stereo images of the object and a second set of a plurality of photometric stereo images of the object, the first set including at least one image taken from a first camera at a first location using illumination from a plurality of different directions and the second set including at least one image taken from a second camera at a second location using illumination from a plurality of different directions;
generating a first normal map of the object using the first set of photometric stereo images;
generating a second normal map of the object using the second set of photometric stereo images;
determining a stereo estimate of the shape of the object by performing stereo matching between a plurality of patches of normals in the first normal map and a plurality of patches of normals in the second normal map;
a system configured to use the first normal map and the second normal map together with the stereo estimate of the shape of the object to generate a reconstruction of the object.

The system of claim 15, wherein the first camera is configured differently from the second camera.

A carrier medium carrying computer readable instructions adapted to cause a computer to perform the method of claim 1.