JP2022519194A

JP2022519194A - Depth estimation

Info

Publication number: JP2022519194A
Application number: JP2021542489A
Authority: JP
Inventors: トリスタン・ウィリアム・レイドロー; ヤン・チャルノフスキ; ステファン・ロイテンエッガー
Original assignee: Imperial College Innovations Ltd
Current assignee: Ip2ipo Innovations Ltd
Priority date: 2019-01-24
Filing date: 2020-01-15
Publication date: 2022-03-22
Also published as: KR20210119417A; GB201901007D0; GB2580691A; US11941831B2; CN113330486A; GB2580691B; US20210350560A1; WO2020152437A1

Abstract

シーンの奥行きを推定する画像処理システムが提供される。画像処理システムは、融合エンジンを備え、融合エンジンは、幾何学的再構成エンジンからの第１の奥行き推定と、ニューラルネットワークアーキテクチャからの第２の奥行き推定と、を受信する。融合エンジンは、第１の奥行き推定と第２の奥行き推定とを確率的に融合させて、シーンの融合奥行き推定を出力するように構成される。融合エンジンは、幾何学的再構成エンジンからの第１の奥行き推定の不確実性測定と、ニューラルネットワークアーキテクチャからの第２の奥行き推定の不確実性測定と、を受信し、不確実性測定を使用して、第１の奥行き推定と第２の奥行き推定とを確率的に融合させるように構成される。An image processing system that estimates the depth of the scene is provided. The image processing system comprises a fusion engine, which receives a first depth estimate from the geometric reconstruction engine and a second depth estimate from the neural network architecture. The fusion engine is configured to stochastically fuse the first depth estimation and the second depth estimation to output the fusion depth estimation of the scene. The fusion engine receives the uncertainty measurement of the first depth estimation from the geometric reconstruction engine and the uncertainty measurement of the second depth estimation from the neural network architecture, and makes the uncertainty measurement. It is configured to be used to stochastically fuse a first depth estimate with a second depth estimate.

Description

本発明は、シーンの奥行きを推定することに関する。本発明は、ロボットデバイスがその環境内をナビゲートし及び／またはインタラクトするのに使用する奥行き推定に、排他的ではないが特に関する。 The present invention relates to estimating the depth of a scene. The present invention relates, but not exclusively, to the depth estimation used by a robotic device to navigate and / or interact within its environment.

コンピュータビジョン及びロボット工学の分野では、頻繁に３次元（３Ｄ）空間の表現を構築する必要がある。３Ｄ空間の表現を構築することにより、現実世界の環境を仮想領域またはデジタル領域にマッピングすることが可能となり、電子デバイスにより使用及び操作され得る。例えば、拡張現実アプリケーションでは、ユーザは、ハンドヘルドデバイスを使用して、周囲環境内のエンティティに対応する仮想オブジェクトとインタラクトし得る、または移動可能なロボットデバイスは、位置特定及びマッピング同時実行、従ってその環境のナビゲーションを可能にするために、３Ｄ空間の表現が必要であり得る。多くのアプリケーションでは、インテリジェントシステムが、デジタル情報ソースを物理オブジェクトに結び付けることができるように、環境の表現を有する必要があり得る。これにより、人を取り巻く物理環境がインターフェースとなる高度なヒューマン‐マシンインターフェースが可能となる。同様に、このような表現により、高度なマシン‐世界インターフェースも可能となり得、例えば、ロボットデバイスが現実世界の環境で物理オブジェクトとインタラクトして操作することが可能となる。 In the fields of computer vision and robotics, it is often necessary to construct representations of three-dimensional (3D) space. By constructing a representation of 3D space, it is possible to map a real world environment to a virtual or digital domain, which can be used and manipulated by electronic devices. For example, in an augmented reality application, a user can use a handheld device to interact with a virtual object that corresponds to an entity in the surrounding environment, or a mobile robot device can locate and map simultaneously, and thus that environment. A representation of the 3D space may be needed to enable navigation. In many applications, intelligent systems may need to have a representation of the environment so that digital information sources can be tied to physical objects. This enables an advanced human-machine interface in which the physical environment surrounding people is the interface. Similarly, such representations can also enable advanced machine-world interfaces, for example, allowing robotic devices to interact with and manipulate physical objects in a real-world environment.

３Ｄ空間の表現を構築するのに利用可能な技法がいくつか存在する。例えば、運動からの構造復元、並びに位置特定及びマッピング同時実行（ＳＬＡＭ）が、そのような技法の２つである。ＳＬＡＭ技法は、通常、マッピングする３Ｄシーンの奥行きの推定を伴う。奥行き推定は、深度カメラを使用して行われ得る。しかし、深度カメラは通常、範囲が制限され、消費電力が比較的高く、明るい日光などの屋外環境では正しく機能しない場合がある。他の事例では、奥行き推定は、例えば空間の画像に基づいて、深度カメラを使用せずに、行われ得る。 There are several techniques available for constructing representations of 3D space. For example, structural restoration from motion, as well as location and mapping concurrency (SLAM) are two such techniques. SLAM techniques usually involve estimating the depth of the 3D scene to be mapped. Depth estimation can be done using a depth camera. However, depth cameras are usually limited in range, consume relatively high power, and may not function properly in outdoor environments such as bright sunlight. In other cases, depth estimation can be done, for example, based on spatial images, without the use of depth cameras.

２０１７年のＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ（ＣＶＰＲ）に関するＩＥＥＥ会議の議事録に記載されるＫ．Ｔａｔｅｎｏｅｔａｌ．による論文「ＣＮＮ－ＳＬＡＭ：Ｒｅａｌ－ｔｉｍｅｄｅｎｓｅｍｏｎｏｃｕｌａｒＳＬＡＭｗｉｔｈｌｅａｒｎｅｄｄｅｐｔｈｐｒｅｄｉｃｔｉｏｎ」は、畳み込みニューラルネットワーク（ＣＮＮ）により取得された奥行きマップと、直接単眼ＳＬＡＭから取得された奥行き測定との融合を説明する。不鮮明な奥行き境界を回復させるために、ＣＮＮ予測奥行きマップが再構成の初期推定として使用され、ピクセルごとの小ベースラインステレオマッチングに依存する直接ＳＬＡＭスキームにより、連続的に精緻化される。しかし、この手法では、全体的な一貫性は保持されない。 K.K. described in the minutes of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2017. Tateno et al. The paper "CNN-SLAM: Real-time sense monocular SLAM with learned depth prediction" describes the fusion of depth maps obtained by a convolutional neural network (CNN) with depth measurements taken directly from a monocular SLAM. To restore blurry depth boundaries, a CNN predicted depth map is used as an initial estimate of the reconstruction and is continuously refined by a direct SLAM scheme that relies on small pixel-by-pixel baseline stereo matching. However, this approach does not maintain overall consistency.

既存の技法を考えると、例えば３Ｄ空間のマッピングを改善するために、奥行き推定の有効で効率的な方法が望まれる。 Given existing techniques, effective and efficient methods of depth estimation are desired, for example to improve mapping in 3D space.

本発明の第１の態様による、シーンの奥行きを推定する画像処理システムが提供される。画像処理システムは、融合エンジンを備え、融合エンジンは、幾何学的再構成エンジンからの第１の奥行き推定と、ニューラルネットワークアーキテクチャからの第２の奥行き推定とを受信し、第１の奥行き推定と第２の奥行き推定とを確率的に融合させて、シーンの融合奥行き推定を出力し、融合エンジンは、幾何学的再構成エンジンからの第１の奥行き推定の不確実性測定と、ニューラルネットワークアーキテクチャからの第２の奥行き推定の不確実性測定と、を受信するように構成され、融合エンジンは、不確実性測定を使用して、第１の奥行き推定と第２の奥行き推定とを確率的に融合させるように構成される。 An image processing system for estimating the depth of a scene according to the first aspect of the present invention is provided. The image processing system comprises a fusion engine, which receives a first depth estimate from the geometric reconstruction engine and a second depth estimate from the neural network architecture, with the first depth estimate. Probabilistically fusing with the second depth estimation to output the fusion depth estimation of the scene, the fusion engine has the uncertainty measurement of the first depth estimation from the geometric reconstruction engine and the neural network architecture. The uncertainty measurement of the second depth estimation from is configured to receive, and the fusion engine uses the uncertainty measurement to probabilistically perform the first depth estimation and the second depth estimation. It is configured to be fused with.

いくつかの実施例では、融合エンジンは、ニューラルネットワークアーキテクチャから表面配向推定及び表面配向推定の不確実性測定を受信し、表面配向推定及び表面配向推定の不確実性測定を使用して、第１の奥行き推定と第２の推定とを確率的に融合させるように構成される。 In some embodiments, the fusion engine receives surface orientation estimation and surface orientation estimation uncertainty measurements from the neural network architecture, and uses surface orientation estimation and surface orientation estimation uncertainty measurements to first. It is configured to stochastically fuse the depth estimation and the second estimation of.

いくつかの実施例では、表面配向推定には、第１の方向の奥行き勾配推定、第１の方向に直交する方向の奥行き勾配推定、及び表面法線推定のうちの１つ以上が含まれる。 In some embodiments, the surface orientation estimation includes one or more of a depth gradient estimation in a first direction, a depth gradient estimation in a direction orthogonal to the first direction, and a surface normal estimation.

いくつかの実施例では、融合エンジンは、第１の奥行き推定と第２の推定とを確率的に融合させる時に、スケール推定を特定するように構成される。 In some embodiments, the fusion engine is configured to specify the scale estimation when the first depth estimation and the second estimation are stochastically fused.

いくつかの実施例では、シーンは、ビデオデータの第１のフレームでキャプチャされ、ビデオデータの第１のフレームについての第２の奥行き推定が受信され、第１の奥行き推定には、ビデオデータの第１のフレームについての複数の第１の奥行き推定が含まれ、複数の第１の奥行き推定のうちの少なくとも１つは、ビデオデータの第１のフレームとは異なるビデオデータの第２のフレームを使用して生成され、融合エンジンは、反復ごとに第２の奥行き推定と複数の奥行き推定のうちの１つとを処理して、シーンの融合奥行き推定を反復的に出力するように構成される。 In some embodiments, the scene is captured in the first frame of the video data, a second depth estimate for the first frame of the video data is received, and the first depth estimate is for the video data. A plurality of first depth estimates for the first frame are included, and at least one of the plurality of first depth estimates includes a second frame of video data that is different from the first frame of the video data. Generated using, the fusion engine is configured to process a second depth estimate and one of a plurality of depth estimates for each iteration to iteratively output the fusion depth estimate for the scene.

いくつかの実施例では、第１の奥行き推定、第２の奥行き推定、及び融合奥行き推定はそれぞれ、複数のピクセルについての奥行きマップを含む。 In some embodiments, the first depth estimation, the second depth estimation, and the fusion depth estimation each include a depth map for a plurality of pixels.

いくつかの実施例では、第１の奥行き推定は、中密度奥行き推定であり、第２の奥行き推定及び融合奥行き推定はそれぞれ、高密度奥行き推定を含む。 In some embodiments, the first depth estimation is a medium density depth estimation, and the second depth estimation and the fusion depth estimation each include a high density depth estimation.

いくつかの実施例では、システムは、ビデオデータのフレームをキャプチャする単眼カメラと、シーンを観察している間の単眼カメラの姿勢を特定する追跡システムと、幾何学的再構成エンジンと、を備える。このような実施例では、幾何学的再構成エンジンは、追跡システムからの姿勢と、ビデオデータのフレームとを使用して、ビデオデータのフレームからピクセルの少なくとも部分集合についての奥行き推定を生成するように構成され、幾何学的再構成エンジンは、測光誤差を最小化して、奥行き推定を生成するように構成される。 In some embodiments, the system comprises a monocular camera that captures frames of video data, a tracking system that identifies the orientation of the monocular camera while observing the scene, and a geometric reconstruction engine. .. In such an embodiment, the geometric reconstruction engine uses the orientation from the tracking system and the frame of the video data to generate a depth estimate for at least a subset of the pixels from the frame of the video data. The geometric reconstruction engine is configured to minimize metering errors and generate depth estimates.

いくつかの実施例では、システムは、ニューラルネットワークアーキテクチャを備え、ニューラルネットワークアーキテクチャは、１つ以上のニューラルネットワークを含み、ビデオデータのフレームのピクセル値を受信し、予測を行うように構成され、当該予測では、第２の奥行き推定を生成するために、画像部分の第１の集合のそれぞれについての奥行き推定と、画像部分の第２の集合のそれぞれについての少なくとも１つの表面配向推定と、各奥行き推定に関連付けられた１つ以上の不確実性測定と、各表面配向推定に関連付けられた１つ以上の不確実性測定と、が予測される。 In some embodiments, the system comprises a neural network architecture, the neural network architecture comprising one or more neural networks, configured to receive and make predictions of the pixel values of a frame of video data. In the prediction, to generate a second depth estimate, a depth estimate for each of the first sets of image parts, at least one surface orientation estimate for each of the second sets of image parts, and each depth. One or more uncertainty measurements associated with the estimation and one or more uncertainty measurements associated with each surface orientation estimation are predicted.

本発明の第２の態様による、シーンの奥行きを推定する方法が提供される。方法は、シーンの幾何学的再構成を使用して、シーンの第１の奥行き推定を生成することであって、幾何学的再構成は、第１の奥行き推定の不確実性測定を出力するように構成される、当該生成することと、ニューラルネットワークアーキテクチャを使用して、シーンの第２の奥行き推定を生成することであって、ニューラルネットワークアーキテクチャは、第２の奥行き推定の不確実性測定を出力するように構成される、当該生成することと、不確実性測定を使用して、第１の奥行き推定と第２の奥行き推定とを確率的に融合させて、シーンの融合奥行き推定を生成することと、を含む。 A method of estimating the depth of a scene according to a second aspect of the present invention is provided. The method is to use the geometric reconstruction of the scene to generate a first depth estimate of the scene, which outputs an uncertainty measurement of the first depth estimation. The generation is configured to generate a second depth estimate of the scene using the neural network architecture, which is an uncertainty measurement of the second depth estimate. The fusion depth estimation of the scene is made by probabilistically fusing the first depth estimation and the second depth estimation using the generation and the uncertainty measurement, which is configured to output. Including to generate.

いくつかの実施例では、方法は、第１の奥行き推定を生成する前に、シーンの２つ以上のビューを表す画像データをカメラから取得することを含む。このような実施例では、第１の奥行き推定を生成することは、カメラの姿勢推定を取得することと、少なくとも姿勢推定と画像データとの関数である測光誤差を最小化することにより、第１の奥行き推定を生成することと、を含む。 In some embodiments, the method comprises acquiring image data representing two or more views of the scene from a camera prior to generating a first depth estimate. In such an embodiment, generating the first depth estimation is by obtaining the attitude estimation of the camera and at least minimizing the photometric error, which is a function of the attitude estimation and the image data. Includes generating depth estimates for.

いくつかの実施例では、方法は、第１の奥行き推定を生成する前に、シーンの１つ以上のビューを表す画像データをカメラから取得することを含む。このような実施例では、第２の奥行き推定を生成することは、ニューラルネットワークアーキテクチャで、画像データを受信することと、第２の奥行き推定を生成するために、ニューラルネットワークアーキテクチャを使用して、画像部分の集合のそれぞれについて奥行き推定を予測することと、ニューラルネットワークアーキテクチャを使用して、画像部分の集合のそれぞれについて少なくとも１つの表面配向推定を予測することと、ニューラルネットワークアーキテクチャを使用して、各奥行き推定及び各表面配向推定の不確実性測定の集合を予測することと、を含む。表面配向推定には、第１の方向の奥行き勾配推定、第１の方向に直交する方向の奥行き勾配推定、及び表面法線推定のうちの１つ以上が含まれ得る。 In some embodiments, the method comprises acquiring image data representing one or more views of the scene from the camera prior to generating the first depth estimation. In such an embodiment, generating a second depth estimate is a neural network architecture that receives image data and uses a neural network architecture to generate a second depth estimate. Predicting depth estimates for each set of image parts, predicting at least one surface orientation estimate for each set of image parts using a neural network architecture, and using a neural network architecture, Includes predicting a set of uncertainty measurements for each depth estimation and each surface orientation estimation. The surface orientation estimation may include one or more of a depth gradient estimation in a first direction, a depth gradient estimation in a direction orthogonal to the first direction, and a surface normal estimation.

いくつかの実施例では、方法は、第１の奥行き推定を生成する前に、シーンの２つ以上のビューを表す画像データをカメラから取得することを含み、画像データは複数のピクセルを含む。このような実施例では、第１の奥行き推定を生成することは、カメラの姿勢推定を取得することと、画像データ内のピクセルの一部についての奥行き推定を含む中密度奥行き推定を生成することと、を含む。これらの実施例では、第２の奥行き推定を生成することは、画像データ内のピクセルについて高密度奥行き推定を生成することを含み、第１の奥行き推定と第２の奥行き推定とを確率的に融合させることは、画像データ内のピクセルについて高密度奥行き推定を出力することを含む。 In some embodiments, the method comprises acquiring image data representing two or more views of the scene from a camera prior to generating a first depth estimate, the image data comprising a plurality of pixels. In such an embodiment, generating a first depth estimate is to obtain a camera attitude estimate and generate a medium density depth estimate that includes a depth estimate for some of the pixels in the image data. And, including. In these embodiments, generating a second depth estimate involves generating a high density depth estimate for the pixels in the image data, probabilistically performing a first depth estimate and a second depth estimate. Fusing involves outputting a high density depth estimate for the pixels in the image data.

いくつかの実施例では、方法は、反復的に繰り返され、後続の反復に関して、方法は、第２の奥行き推定を生成するか否かを判定することを含み、第１の奥行き推定と第２の奥行き推定とを確率的に融合させることは、第２の奥行き推定を生成しないという判定に応じて、第２の奥行き推定の前の値の集合を使用することを含む。 In some embodiments, the method is iteratively repeated, and for subsequent iterations, the method comprises determining whether to generate a second depth estimate, a first depth estimate and a second. Probabilistic fusion with the depth estimation of

いくつかの実施例では、方法は、ビデオデータのフレームに適用され、第１の奥行き推定と第２の奥行き推定とを確率的に融合させることは、ビデオデータの所与のフレームについて、第１の奥行き推定に関連付けられた第１のコスト項と、第２の奥行き推定に関連付けられた第２のコスト項とを含むコスト関数を最適化することを含む。このような実施例では、第１のコスト項は、融合奥行き推定値と、第１の奥行き推定値と、第１の奥行き推定の不確実性値との関数を含み、第２のコスト項は、融合奥行き推定値と、第２の奥行き推定値と、第２の奥行き推定の不確実性値との関数を含み、コスト関数を最適化して、融合奥行き推定値が特定される。コスト関数を最適化することは、融合奥行き推定のスケールファクタを特定することを含み得、スケールファクタは、シーンに関する融合奥行き推定のスケールを示す。いくつかの実施例では、方法は、ニューラルネットワークアーキテクチャを使用して、シーンの少なくとも１つの表面配向推定を生成することを含み、ニューラルネットワークアーキテクチャは、少なくとも１つの表面配向推定のそれぞれについて不確実性測定を出力するように構成され、コスト関数は、少なくとも１つの表面配向推定に関連付けられた第３のコスト項を含み、第３のコスト項は、融合奥行き推定値と、表面配向推定値と、少なくとも１つの表面配向推定のそれぞれについての不確実性値との関数を含む。 In some embodiments, the method is applied to a frame of video data, and probabilistic fusion of a first depth estimate and a second depth estimate is the first for a given frame of video data. It involves optimizing a cost function that includes a first cost term associated with the depth estimation of and a second cost term associated with the second depth estimation. In such an embodiment, the first cost term includes a function of the fusion depth estimate, the first depth estimate, and the uncertainty value of the first depth estimate, and the second cost term is. , A function of a fusion depth estimate, a second depth estimate, and a second depth estimation uncertainty value is included, and the cost function is optimized to identify the fusion depth estimate. Optimizing the cost function can include identifying the scale factor for the fusion depth estimation, which indicates the scale of the fusion depth estimation for the scene. In some embodiments, the method involves using a neural network architecture to generate at least one surface orientation estimate for the scene, where the neural network architecture is uncertain about each of the at least one surface orientation estimates. Configured to output measurements, the cost function includes a third cost term associated with at least one surface orientation estimate, the third cost term being a fusion depth estimate and a surface orientation estimate. Includes a function with an uncertainty value for each of the at least one surface orientation estimates.

第２の態様による特定の実施例集合では、シーンの幾何学的再構成は、シーンの第１の奥行き確率体積を生成するように構成され、第１の奥行き確率体積は、第１の奥行き推定を含む第１の複数の奥行き推定と、第１の複数の奥行き推定の各奥行き推定にそれぞれ関連付けられた第１の複数の不確実性測定と、を含み、第１の複数の奥行き推定のうちの所与の奥行き推定に関連付けられた不確実性測定は、シーンの所与の領域が、第１の複数の奥行き推定のうちの所与の奥行き推定により表される奥行きに存在する確率を表し、ニューラルネットワークアーキテクチャは、シーンの第２の奥行き確率体積を出力するように構成され、第２の奥行き確率体積は、第２の奥行き推定を含む第２の複数の奥行き推定と、第２の複数の奥行き推定の各奥行き推定にそれぞれ関連付けられた第２の複数の不確実性測定と、を含み、第２の複数の奥行き推定のうちの所与の奥行き推定に関連付けられた不確実性測定は、シーンの所与の領域が、第２の複数の奥行き推定のうちの所与の奥行き推定により表される奥行きに存在する確率を表す。 In the particular set of examples according to the second aspect, the geometric reconstruction of the scene is configured to generate a first depth probability volume of the scene, where the first depth probability volume is the first depth estimation. Of the first plurality of depth estimates, including a first plurality of depth estimates including, and a first plurality of uncertainty measurements associated with each depth estimate of the first plurality of depth estimates. The uncertainty measurement associated with a given depth estimation of is the probability that a given area of the scene will be at the depth represented by the given depth estimation of the first plurality of depth estimations. The neural network architecture is configured to output a second depth probability volume of the scene, where the second depth probability volume is a second plurality of depth estimates, including a second depth estimate, and a second plurality. A second plurality of uncertainty measurements associated with each depth estimation of the depth estimation, and an uncertainty measurement associated with a given depth estimation of the second plurality of depth estimations. , Represents the probability that a given area of the scene is at the depth represented by the given depth estimation of the second plurality of depth estimations.

特定の実施例集合のうちのいくつかの実施例では、シーンの第２の奥行き推定を生成することは、ニューラルネットワークアーキテクチャを使用してシーンの画像を表す画像データを処理して、第２の奥行き確率体積を生成することを含み、第２の複数の奥行き推定は、複数の奥行き推定集合を含み、それぞれがシーンの画像の異なる各部分に関連付けられる。 In some embodiments of a particular set of examples, generating a second depth estimate of the scene uses a neural network architecture to process the image data that represents the image of the scene, and the second. A second plurality of depth estimates, including generating a depth probability volume, comprises a plurality of depth estimation sets, each associated with a different portion of the image of the scene.

特定の実施例集合のうちのいくつかの実施例では、第２の複数の奥行き推定は、事前に定義された値を有する奥行き推定を含む。事前に定義された値の間には、不均一な間隔があり得る。事前に定義された値は、事前に定義された奥行き範囲内の複数の対数奥行き値を含み得る。 In some embodiments of a particular set of examples, the second plurality of depth estimates include depth estimates with predefined values. There can be non-uniform spacing between the predefined values. A predefined value may include multiple logarithmic depth values within a predefined depth range.

特定の実施例集合のうちのいくつかの実施例では、シーンの第１の奥行き確率体積を生成することは、シーンの第１の観察を表すビデオデータの第１のフレームと、シーンの第２の観察を表すビデオデータの第２のフレームとを処理して、第１のフレームの複数の部分のそれぞれについて測光誤差の集合を生成することであって、測光誤差はそれぞれ、第１の複数の奥行き推定の異なる各奥行き推定に関連付けられる、当該生成することと、測光誤差をスケーリングして、測光誤差をそれぞれの確率値に変換することと、を含む。 In some embodiments of the particular set of examples, generating the first depth probability volume of the scene is the first frame of video data representing the first observation of the scene and the second of the scene. It is to process with the second frame of the video data representing the observation of to generate a set of photometric errors for each of the plurality of parts of the first frame, each of which has a plurality of first photometric errors. Includes that generation associated with each depth estimation with different depth estimation and scaling the photometric error to convert the photometric error to their respective probability values.

特定の実施例集合のうちのいくつかの実施例では、不確実性測定を使用して第１の奥行き推定と第２の奥行き推定とを確率的に融合させることは、第１の複数の不確実性測定と第２の複数の不確実性測定とを組み合わせて、融合確率体積を生成することを含む。これらの実施例では、シーンの融合奥行き推定を生成することは、融合確率体積からシーンの融合奥行き推定を取得することを含み得る。これらの実施例は、融合確率体積を使用して奥行き確率関数を取得することと、奥行き確率関数を使用して、融合奥行き推定を取得することと、を含み得る。これらの実施例では、融合奥行き推定を取得することは、コスト関数を最適化することを含み得、当該コスト関数は、融合確率体積を使用して取得された第１のコスト項と、奥行き値に対する局所的な幾何学的制約を含む第２のコスト項と、を含む。このような事例では、方法は、さらなるニューラルネットワークアーキテクチャから、表面配向推定及びオクルージョン境界推定を受信することと、表面配向推定及びオクルージョン境界推定を使用して、第２のコスト項を生成することと、をさらに含み得る。これらの実施例では、融合奥行き確率体積は、シーンの第１の観察を表すビデオデータの第１のフレームに関連付けられた第１の融合奥行き確率体積であり得、方法は、第１の融合奥行き確率体積を、第１の占有確率体積に変換することと、シーンを観察している間のカメラの姿勢を表す姿勢データに基づいて、第１の占有確率体積をワープさせて、シーンの第２の観察を表すビデオデータの第２のフレームに関連付けられた第２の占有確率体積を取得することと、第２の占有確率体積を、第２のフレームに関連付けられた第２の融合奥行き確率体積に変換することと、を含み得る。 In some embodiments of a particular set of examples, the probabilistic fusion of a first depth estimate and a second depth estimate using uncertainty measurements is a first plurality of failures. It involves combining a certainty measurement with a second plurality of uncertainty measurements to generate a fusion probability volume. In these embodiments, generating the fusion depth estimation of the scene may include obtaining the fusion depth estimation of the scene from the fusion probability volume. These examples may include using a fusion probability volume to obtain a depth probability function and using a depth probability function to obtain a fusion depth estimation. In these embodiments, obtaining a fusion depth estimate may include optimizing a cost function, which is the first cost term obtained using the fusion probability volume and the depth value. Includes a second cost term, including a local geometric constraint on. In such cases, the method is to receive surface orientation estimates and occlusion boundary estimates from further neural network architectures, and to use surface orientation estimates and occlusion boundary estimates to generate a second cost term. , Can be further included. In these embodiments, the fusion depth probability volume can be the first fusion depth probability volume associated with the first frame of video data representing the first observation of the scene, and the method is a first fusion depth. The second occupancy volume of the scene is warped based on the conversion of the stochastic volume to the first occupancy probability volume and the attitude data representing the posture of the camera while observing the scene. Obtaining the second occupancy probability volume associated with the second frame of the video data representing the observation of, and the second occupancy probability volume, the second fusion depth probability volume associated with the second frame. Can include converting to.

本発明の第３の態様による、シーンの奥行きを推定する画像処理システムが提供され、画像処理システムは、幾何学的再構成エンジンからの第１の奥行き確率体積と、ニューラルネットワークアーキテクチャからの第２の奥行き確率体積とを受信し、第１の奥行き確率体積と第２の奥行き確率体積とを融合させて、シーンの融合奥行き確率体積を出力する、融合エンジンと、融合奥行き確率体積を使用して、シーンの奥行きを推定する奥行き推定エンジンと、を備える。 An image processing system for estimating the depth of a scene is provided according to a third aspect of the present invention, wherein the image processing system has a first depth probability volume from a geometric reconstruction engine and a second from a neural network architecture. Using a fusion engine and a fusion depth probability volume that receives the depth probability volume of and outputs the fusion depth probability volume of the scene by fusing the first depth probability volume and the second depth probability volume. It is equipped with a depth estimation engine that estimates the depth of the scene.

本発明の第４の態様による、シーンの奥行きを推定する方法が提供され、方法は、シーンの幾何学的再構成を使用して、シーンの第１の奥行き確率体積を生成することと、ニューラルネットワークアーキテクチャを使用して、シーンの第２の奥行き確率体積を生成することと、第１の奥行き確率体積と第２の奥行き確率体積とを融合させて、シーンの融合奥行き確率体積を生成することと、融合奥行き確率体積を使用して、シーンの融合奥行き推定を生成することと、を含む。 A method of estimating the depth of a scene is provided according to a fourth aspect of the invention, in which the geometric reconstruction of the scene is used to generate a first depth probability volume of the scene and a neural. Using a network architecture to generate a second depth-probability volume for a scene and a fusion of a first depth-probability volume and a second depth-probability volume to generate a fusion depth-probability volume for a scene. And to generate a fusion depth estimate for the scene using the fusion depth probability volume.

本発明の第５の態様によるコンピューティングシステムが提供され、コンピューティングシステムは、ビデオのフレームを提供する単眼キャプチャデバイスと、単眼キャプチャデバイスの姿勢データを提供する位置特定及びマッピング同時実行システムと、第１または第３の態様のシステムと、姿勢データ及びビデオのフレームを受信して、幾何学的再構成エンジンを実施する中密度マルチビューステレオコンポーネントと、ニューラルネットワークアーキテクチャを実施する電子回路と、を備える。 A computing system according to a fifth aspect of the present invention is provided, in which the computing system includes a monocular capture device that provides a frame of video, a position determination and mapping simultaneous execution system that provides orientation data of the monocular capture device, and a fifth. It comprises a system of the first or third aspect, a medium density multi-view stereo component that receives frames of orientation data and video and implements a geometric reconstruction engine, and an electronic circuit that implements a neural network architecture. ..

本発明の第６の態様によるロボットデバイスが提供され、ロボットデバイスは、第５の態様のコンピューティングシステムと、ロボットデバイスが周囲の３次元環境とインタラクトすることを可能にする１つ以上のアクチュエータであって、周囲の３次元環境の少なくとも一部がシーンに示される、当該１つ以上のアクチュエータと、１つ以上のアクチュエータを制御する少なくとも１つのプロセッサを有するインタラクションエンジンであって、融合奥行き推定を使用して周囲の３次元環境とインタラクトする当該インタラクションエンジンと、を備える。 A robotic device according to a sixth aspect of the present invention is provided, wherein the robotic device is a computing system according to the fifth aspect and one or more actuators that allow the robotic device to interact with the surrounding three-dimensional environment. An interaction engine having at least a portion of the surrounding 3D environment shown in the scene, the one or more actuators and at least one processor controlling the one or more actuators, for fusion depth estimation. It comprises the interaction engine, which is used to interact with the surrounding 3D environment.

本発明の第７の態様による、コンピュータ実行可能命令を含む非一時的コンピュータ可読記憶媒体が提供され、コンピュータ実行可能命令は、プロセッサにより実行されると、コンピューティングデバイスに、前述の方法のうちのいずれかを実行させる。 A non-temporary computer-readable storage medium comprising computer-executable instructions according to a seventh aspect of the present invention is provided, and when the computer-executable instructions are executed by the processor, the computing device is subjected to the method described above. Have one run.

添付の図面を参照する単なる例として与えられた本発明の実施形態の下記の説明から、さらなる機能が明らかになるであろう。 Further functionality will be apparent from the following description of embodiments of the invention given as merely examples with reference to the accompanying drawings.

３次元（３Ｄ）空間の実施例を示す概略図である。It is a schematic diagram which shows the Example of the 3D (3D) space. ３Ｄ空間における例示的なオブジェクトの利用可能な自由度を示す概略図である。FIG. 6 is a schematic diagram showing the available degrees of freedom of an exemplary object in 3D space. 例示的なキャプチャデバイスにより生成されるビデオデータを示す概略図である。FIG. 6 is a schematic diagram showing video data generated by an exemplary capture device. 実施例による画像処理システムの概略図である。It is the schematic of the image processing system by an Example. さらなる実施例による、画像処理システムの概略図である。It is a schematic diagram of an image processing system according to a further embodiment. さらなる別の実施例による、画像処理システムの概略図である。It is a schematic diagram of an image processing system according to still another embodiment. 実施例による、表面配向推定及び表面配向推定の不確実性測定を示す概略図である。It is a schematic diagram which shows the uncertainty measurement of the surface orientation estimation and the surface orientation estimation by an Example. さらなる別の実施例による、画像処理システムの概略図である。It is a schematic diagram of an image processing system according to still another embodiment. 実施例による、コンピューティングシステムのコンポーネントを示す概略図である。It is a schematic diagram which shows the component of the computing system by an embodiment. 実施例による、ロボットデバイスのコンポーネントを示す概略図である。It is a schematic diagram which shows the component of the robot device by an Example. 図１～７を参照して説明された様々な機能の実施例を示す概略図である。It is a schematic diagram which shows the Example of the various functions described with reference to FIGS. 1-7. シーンの奥行きを推定する例示的な方法を示すフロー図である。It is a flow diagram which shows the exemplary method of estimating the depth of a scene. シーンの奥行きを推定するさらなる例示的な方法を示すフロー図である。It is a flow diagram which shows the further exemplary method of estimating the depth of a scene. プロセッサと、コンピュータ実行可能命令を含む非一時的コンピュータ可読記憶媒体との実施例を示す概略図である。FIG. 5 is a schematic diagram illustrating an embodiment of a processor and a non-temporary computer-readable storage medium including computer-executable instructions. さらなる実施例による、シーンの第１の奥行き推定と第２の奥行き推定との融合を示す概略図である。It is a schematic diagram which shows the fusion of the 1st depth estimation and the 2nd depth estimation of a scene by a further embodiment. 実施例による、第２の奥行き確率体積を取得するためのシステムの概略図である。FIG. 3 is a schematic diagram of a system for acquiring a second depth probability volume according to an embodiment. 図１３のシステムを使用して取得されたそれぞれの奥行き推定に関連付けられた不確実性測定の実施例を示す概略図である。FIG. 3 is a schematic diagram illustrating an example of uncertainty measurement associated with each depth estimate obtained using the system of FIG. 実施例による、第１の奥行き確率体積を取得するためのシステムの概略図である。FIG. 3 is a schematic diagram of a system for acquiring a first depth probability volume according to an embodiment. シーンの融合奥行き推定を取得する例示的な方法を示すフロー図である。It is a flow diagram which shows the exemplary method of getting the fusion depth estimation of a scene. 図１６の方法を使用することにより、融合奥行き推定を取得するためのシステムの概略図である。FIG. 6 is a schematic diagram of a system for obtaining fusion depth estimates by using the method of FIG. 第２の融合奥行き確率体積を取得する例示的な方法を示すフロー図である。It is a flow diagram which shows the exemplary method of acquiring the 2nd fusion depth probability volume. さらなる実施例による、シーンの奥行きを推定する例示的な方法を示すフロー図である。It is a flow diagram which shows the exemplary method of estimating the depth of a scene by a further embodiment. さらなる実施例による、シーンの奥行きを推定する画像処理システムの概略図である。FIG. 3 is a schematic diagram of an image processing system for estimating the depth of a scene according to a further embodiment.

本明細書で説明されるいくつかの実施例は、シーンの奥行きを推定することを可能にする。このような実施例は、シーンの幾何学的再構成を使用したシーンの第１の奥行き推定の生成を含む。第１の奥行き推定は、例えばシーンの画像を処理することにより、生成され得る。画像は、例えば、２次元（２Ｄ）カラー画像であり得、例えばＲＧＢ（赤、緑、青）画像であり得る。第１の奥行き推定は、幾何学的制約に基づいて生成され得る。例えば、シーンの所与の部分を表す画像内のピクセルの色は、画像をキャプチャするのに使用されるカメラの位置とは無関係であると想定され得る。これは、図を参照してさらに説明されるように、第１の奥行き推定の生成に利用され得る。幾何学的再構成はまた、例えば第１の奥行き推定の正確度を示す第１の奥行き推定の不確実性測定を出力するように構成される。例えば、第１の奥行き推定が多く制約を受け、正確に推定され得る場合、不確実性測定は、第１の奥行き推定が少なく制約を受けた他の事例より、低くなり得る。 Some embodiments described herein make it possible to estimate the depth of the scene. Such embodiments include the generation of a first depth estimate of the scene using the geometric reconstruction of the scene. The first depth estimation can be generated, for example, by processing an image of the scene. The image can be, for example, a two-dimensional (2D) color image, eg, an RGB (red, green, blue) image. The first depth estimation can be generated based on geometric constraints. For example, the color of a pixel in an image that represents a given part of the scene can be assumed to be independent of the position of the camera used to capture the image. This can be used to generate a first depth estimate, as further described with reference to the figure. The geometric reconstruction is also configured to output a first depth estimation uncertainty measurement that indicates, for example, the accuracy of the first depth estimation. For example, if the first depth estimation is more constrained and can be estimated accurately, the uncertainty measurement can be lower than in other cases where the first depth estimation is less constrained.

ニューラルネットワークアーキテクチャを使用して、シーンの第２の奥行き推定が生成される。ニューラルネットワークアーキテクチャはまた、第２の奥行き推定の不確実性測定を出力するように構成される。例えば、入力画像から奥行き推定及び関連する不確実性の両方を予測するようにトレーニングされた畳み込みニューラルネットワーク（ＣＮＮ）などのニューラルネットワークアーキテクチャを使用して、シーンの画像は処理され得る。不確実性測定は、関連する第２の奥行き推定の信頼性を示し得る。例えば、ニューラルネットワークアーキテクチャのトレーニングに使用されたトレーニングデータに存在しなかったオブジェクトを含む画像領域の第２の奥行き推定は、比較的不確実であり得、よって、ニューラルネットワークアーキテクチャから取得された比較的高い不確実性測定に関連付けられ得る。反対に、トレーニングデータに存在したオブジェクトを含む画像領域の第２の奥行き推定は、より低い不確実性測定に関連付けられ得る。 A second depth estimate of the scene is generated using the neural network architecture. The neural network architecture is also configured to output a second depth estimation uncertainty measurement. For example, a scene image can be processed using a neural network architecture such as a convolutional neural network (CNN) trained to predict both depth estimation and associated uncertainties from the input image. Uncertainty measurements may indicate the reliability of the associated second depth estimation. For example, a second depth estimate of an image region containing objects that were not present in the training data used to train the neural network architecture can be relatively uncertain, and thus relatively obtained from the neural network architecture. Can be associated with high uncertainty measurements. Conversely, a second depth estimate of the image area containing the objects present in the training data may be associated with a lower uncertainty measurement.

第１の奥行き推定と第２の奥行き推定は、不確実性測定を使用して確率的に融合され、シーンの融合奥行き推定が生成される。このように第１の奥行き推定と第２の奥行き推定とを組み合わせることにより、融合奥行き推定の精度は向上し得る。例えば第１の奥行き推定（幾何学的制約に基づく）は、シーンの一部分をシーンの別の部分と比較して、シーンの当該部分の信頼できる相対的な奥行き推定を提供し得る。このようにして、第１の奥行き推定は、例えばシーンの他の部分と比較して、実世界環境内の好適な位置にシーンの当該部分を配置あるいは位置特定することが可能であり得る。しかし、第１の奥行き推定は、例えばシーンのその部分内の表面の不均一なテクスチャが原因で、シーンのその部分内の奥行きの変化など、シーンのその部分内の奥行き勾配をキャプチャする精度が低くあり得る。対照的に、第２の奥行き推定（ニューラルネットワークアーキテクチャから取得される）は、シーン内の奥行き勾配を正確にキャプチャし得るが、シーンの所与の部分をシーンの他の部分と比較して位置特定する精度は、低くあり得る。しかし、不確実性測定を使用して第１の奥行き推定と第２の奥行き推定とを確率的に融合させることにより、第１の奥行き推定及び第２の奥行き推定それぞれの個々の効果が相乗的に増強され得、よって、融合奥行き推定の精度が向上する。例えば、融合奥行き推定の全体的一貫性を確保するために、不確実性測定は、第１の奥行き推定と第２の奥行き推定との融合を制約し得る。さらに、シーンの推定奥行きにおける不鮮明なアーチファクトは、他の方法と比較して減少し得る。 The first depth estimate and the second depth estimate are stochastically fused using uncertainty measurements to generate a fused depth estimate for the scene. By combining the first depth estimation and the second depth estimation in this way, the accuracy of the fusion depth estimation can be improved. For example, the first depth estimation (based on geometric constraints) may compare one part of the scene to another part of the scene to provide a reliable relative depth estimation for that part of the scene. In this way, the first depth estimation may be able to place or locate that part of the scene at a suitable position in the real world environment, as compared to, for example, other parts of the scene. However, the first depth estimation has the accuracy of capturing the depth gradient within that portion of the scene, for example due to the uneven texture of the surface within that portion of the scene, such as changes in depth within that portion of the scene. Can be low. In contrast, a second depth estimate (obtained from a neural network architecture) can accurately capture the depth gradient in the scene, but position a given part of the scene compared to the rest of the scene. The accuracy to identify can be low. However, by stochastically fusing the first depth estimation and the second depth estimation using uncertainty measurement, the individual effects of the first depth estimation and the second depth estimation are synergistic. Therefore, the accuracy of fusion depth estimation is improved. For example, to ensure the overall consistency of the fusion depth estimation, the uncertainty measurement may constrain the fusion of the first depth estimation and the second depth estimation. In addition, blurry artifacts in the estimated depth of the scene can be reduced compared to other methods.

図１Ａ及び図１Ｂは、３Ｄ空間の実施例と、その空間に関連付けられた画像データのキャプチャとを、概略的に示す。次に、図１Ｃは、空間を表示する時に画像データを生成するように構成されたキャプチャデバイスを示す。これらの実施例は、本明細書で説明されるいくつかの機能をよりよく説明するために提示されており、限定するものとしてみなされるべきではなく、説明をしやすくするために、いくつかの機能は省略及び簡略化されている。 1A and 1B schematically show an embodiment of a 3D space and a capture of image data associated with that space. Next, FIG. 1C shows a capture device configured to generate image data when displaying space. These examples are presented to better illustrate some of the features described herein and should not be considered limiting, but some for ease of explanation. Functions have been omitted and simplified.

図１Ａは、３Ｄ空間１１０の実施例１００を示す。３Ｄ空間１１０は、内部物理空間及び／または外部物理空間、例えば部屋または地理的場所の少なくとも一部であり得る。本実施例１００の３Ｄ空間１１０は、３Ｄ空間内に配置された、いくつかの物理オブジェクト１１５を含む。これらのオブジェクト１１５には、とりわけ、人、電子デバイス、家具、動物、建物部分、及び設備のうちの１つ以上が含まれ得る。図１Ａの３Ｄ空間１１０は、下面が示されているが、これが全ての実施態様においてそうである必要はなく、例えば環境は、空中または地球外空間内であってもよい。 FIG. 1A shows Example 100 of the 3D space 110. The 3D space 110 can be at least part of an internal physical space and / or an external physical space, such as a room or geographical location. The 3D space 110 of the present embodiment 100 includes several physical objects 115 arranged in the 3D space. These objects 115 may include, among other things, one or more of people, electronic devices, furniture, animals, building parts, and equipment. The 3D space 110 of FIG. 1A is shown with a bottom surface, but this does not have to be the case in all embodiments, for example the environment may be in the air or in extraterrestrial space.

実施例１００はまた、３Ｄ空間１１０に関連付けられたビデオデータをキャプチャするのに使用され得る様々な例示的なキャプチャデバイス１２０－Ａ、１２０－Ｂ、１２０－Ｃ（参照番号１２０と総称される）を示す。図１Ａのキャプチャデバイス１２０－Ａなどのキャプチャデバイスは、３Ｄ空間１１０を観察することにより生じるデータを、デジタル形式またはアナログ形式で記録するように構成されたカメラを備え得る。例えば、キャプチャデバイス１２０－Ａは、単眼カメラなどの単眼キャプチャデバイスであり得る。単眼カメラは通常、１度に１つの位置からシーンの画像をキャプチャし、単一のレンズまたはレンズシステムを有し得る。対照的に、ステレオカメラは一般に、少なくとも２つのレンズを含み、レンズごとに個別の画像センサを有する。キャプチャデバイス１２０－Ａとして使用可能な単眼キャプチャデバイスは、複数の角度位置から３Ｄ空間１１０の画像をキャプチャするように配置された単眼多方向カメラデバイスであり得る。使用時、複数の画像が次々にキャプチャされ得る。いくつかの事例では、複数の角度位置は、広い視野を占める。特定の事例では、キャプチャデバイス１２０－Ａは、全方向カメラ、例えば実質的に３６０度の視野をキャプチャするように構成されたデバイスを備え得る。この事例では、全方向カメラは、パノラマ環状レンズを有するデバイスを備え得、例えばレンズは、電荷結合アレイに関連して取り付けられ得る。 Example 100 also includes various exemplary capture devices 120-A, 120-B, 120-C (collectively referred to as reference number 120) that can be used to capture video data associated with 3D space 110. Is shown. A capture device such as the capture device 120-A of FIG. 1A may include a camera configured to record the data generated by observing the 3D space 110 in digital or analog format. For example, the capture device 120-A can be a monocular capture device such as a monocular camera. Monocular cameras usually capture images of the scene from one position at a time and may have a single lens or lens system. In contrast, a stereo camera generally includes at least two lenses, each with a separate image sensor. The monocular capture device that can be used as the capture device 120-A can be a monocular multidirectional camera device arranged to capture an image of 3D space 110 from a plurality of angular positions. When in use, multiple images can be captured one after another. In some cases, multiple angular positions occupy a large field of view. In certain cases, the capture device 120-A may include an omnidirectional camera, eg, a device configured to capture a substantially 360 degree field of view. In this example, the omnidirectional camera may include a device with a panoramic annular lens, for example the lens may be attached in connection with a charge-coupled array.

複数の異なる位置から３Ｄ空間の複数の画像をキャプチャするために、キャプチャデバイス１２０－Ａは、移動可能であり得る。例えば、キャプチャデバイス１２０－Ａは、３Ｄ空間１１０の異なる観察部分に対応する異なるフレームをキャプチャするように構成され得る。キャプチャデバイス１２０－Ａは、静止台を基準にして移動可能であり得、例えば３Ｄ空間１１０に関してカメラの位置及び／または配向を変更させるアクチュエータを備え得る。別の事例では、キャプチャデバイス１２０－Ａは、人間のユーザにより操作及び移動されるハンドヘルドデバイスであり得る。一事例では、キャプチャデバイス１２０－Ａは、一連の画像をキャプチャするように構成されたカメラなどの静止画像デバイスを備え得、別の事例では、キャプチャデバイス１２０－Ａは、一連の画像をビデオフレームの形式で含むビデオデータをキャプチャするビデオデバイスを備え得る。例えば、キャプチャデバイス１２０－Ａは、ビデオデータのフレームをキャプチャする、あるいは取得する単眼カメラまたは単眼キャプチャデバイスであり得る。 The capture device 120-A may be mobile to capture multiple images in 3D space from multiple different locations. For example, the capture device 120-A may be configured to capture different frames corresponding to different observation portions of the 3D space 110. The capture device 120-A may be movable relative to a stationary platform and may include an actuator that changes the position and / or orientation of the camera with respect to, for example, the 3D space 110. In another example, the capture device 120-A can be a handheld device operated and moved by a human user. In one case, the capture device 120-A may include a still image device such as a camera configured to capture a series of images, in another case the capture device 120-A may videoframe the series of images. It may be equipped with a video device that captures video data, including in the format of. For example, the capture device 120-A may be a monocular camera or monocular capture device that captures or captures frames of video data.

図１Ａでは、３Ｄ空間１１０内を移動するように構成されたロボットデバイス１３０に接続された複数のキャプチャデバイス１２０－Ｂ、１２０－Ｃも示される。ロボットデバイス１３５には、自律空中可動デバイス及び／または自律地上可動デバイスが含まれ得る。本実施例１００では、ロボットデバイス１３０は、アクチュエータ１３５を備え、アクチュエータ１３５は、デバイスが３Ｄ空間１１０をナビゲートすることを可能にする。これらのアクチュエータ１３５には、例示のホイールが含まれ、他の事例では、これらのアクチュエータ１３５には、線路、穿孔機構、ローターなどが含まれ得る。このようなデバイス上に、１つ以上のキャプチャデバイス１２０－Ｂ、１２０－Ｃは、静的にまたは移動可能に取り付けられ得る。いくつかの事例では、ロボットデバイスは、３Ｄ空間１１０内に静的に取り付けられ得るが、アームまたは他のアクチュエータなどのデバイスの一部は、空間内を移動して、空間内のオブジェクトとインタラクトするように構成され得る。各キャプチャデバイス１２０－Ｂ、１２０－Ｃは、異なる種類の画像データ、ビデオデータをキャプチャし得、及び／またはステレオ画像ソースを含み得る。一事例では、キャプチャデバイス１２０－Ｂ、１２０－Ｃのうちの少なくとも１つは、測光データ、例えばカラー画像またはグレースケール画像をキャプチャするように構成される。一事例では、キャプチャデバイス１２０－Ｂ、１２０－Ｃのうちの１つ以上は、ロボットデバイス１３０とは無関係に移動可能であり得る。一事例では、キャプチャデバイス１２０－Ｂ、１２０－Ｃのうちの１つ以上は、例えば角度のある円弧で回転する、及び／または３６０度で回転する回転機構上に取り付けられ得、並びに／あるいはシーンのパノラマ（例えば最大３６０度の完全パノラマ）をキャプチャするように適合された光学素子で構成される。いくつかの事例では、キャプチャデバイス１２０－Ａと同様または同一のキャプチャデバイスが、図１Ａのキャプチャデバイス１２０－Ｂ、１２０－Ｃのうちの一方または両方として、使用され得ることが、理解されよう。 FIG. 1A also shows a plurality of capture devices 120-B, 120-C connected to a robot device 130 configured to move within the 3D space 110. The robot device 135 may include an autonomous aerial movable device and / or an autonomous ground movable device. In the 100th embodiment, the robot device 130 comprises an actuator 135, which allows the device to navigate the 3D space 110. These actuators 135 include exemplary wheels, and in other cases, these actuators 135 may include railroad tracks, drilling mechanisms, rotors, and the like. On such a device, one or more capture devices 120-B, 120-C may be mounted statically or movably. In some cases, the robot device can be statically mounted within the 3D space 110, but some of the devices, such as arms or other actuators, move in space and interact with objects in space. Can be configured as Each capture device 120-B, 120-C may capture different types of image data, video data, and / or may include a stereo image source. In one example, at least one of the capture devices 120-B, 120-C is configured to capture photometric data, such as a color or grayscale image. In one example, one or more of the capture devices 120-B, 120-C may be mobile independently of the robot device 130. In one example, one or more of the capture devices 120-B, 120-C may be mounted on a rotating mechanism that rotates, for example, in an angular arc and / or rotates at 360 degrees, and / or a scene. Consists of optics adapted to capture a panorama (eg, a complete panorama of up to 360 degrees). It will be appreciated that in some cases a capture device similar to or identical to the capture device 120-A can be used as one or both of the capture devices 120-B, 120-C of FIG. 1A.

図１Ｂは、キャプチャデバイス１２０及び／またはロボットデバイス１３０が利用可能な自由度の実施例１４０を示す。１２０－Ａなどのキャプチャデバイスの事例では、デバイスの方向１５０は、レンズまたは他の撮像装置の軸と同一線上であり得る。３軸のうちの１軸の周りを回転する例として、法線軸１５５が図に示される。同様に、ロボットデバイス１３０の事例では、ロボットデバイス１３０のアライメント方向１４５が定義され得る。これは、ロボットデバイスの向き及び／または進行方向を示し得る。法線軸１５５も示される。キャプチャデバイス１２０またはロボットデバイス１３０に関して単一の法線軸のみが示されるが、これらのデバイスは、後述されるように、１４０として概略的に示される軸のうちのいずれか１つ以上の軸の周りを回転し得る。 FIG. 1B shows Example 140 of the degrees of freedom available to the capture device 120 and / or the robot device 130. In the case of a capture device such as 120-A, the device orientation 150 may be in line with the axis of the lens or other imaging device. As an example of rotating around one of the three axes, the normal axis 155 is shown in the figure. Similarly, in the case of the robot device 130, the alignment direction 145 of the robot device 130 may be defined. It may indicate the orientation and / or direction of travel of the robot device. The bobbin axis 155 is also shown. Only a single normal axis is shown for the capture device 120 or robot device 130, but these devices are around one or more of the axes schematically shown as 140, as described below. Can be rotated.

より一般的には、キャプチャデバイスの配向及び位置は、６自由度（６ＤＯＦ）を基準にして３次元において定義され得、位置は、３次元の各次元内に、例えば［ｘ、ｙ、ｚ］座標により定義され得、配向は、３軸の各軸の周りの回転を表す角度ベクトル、例えば［θ_ｘ、θ_ｙ、θ_ｚ］により定義され得る。位置及び配向は、例えば３Ｄ座標系内で定義された原点を基準とした、３次元内の変換とみなされ得る。例えば、［ｘ、ｙ、ｚ］座標は、原点から３Ｄ座標系内の特定の位置への変換を表し得、角度ベクトル［θ_ｘ、θ_ｙ、θ_ｚ］は、３Ｄ座標系内の回転を定義し得る。６ＤＯＦを有する変換は、行列として定義され得、よって行列による乗算により、変換が適用される。いくつかの実施態様では、キャプチャデバイスは、制限された６自由度の集合を基準にして定義され得、例えば地上車両上のキャプチャデバイスの場合、ｙ次元は一定であり得る。ロボットデバイス１３０などのいくつかの実施態様では、別のデバイスに接続されたキャプチャデバイスの配向及び位置は、その別のデバイスの配向及び位置を基準にして定義され得、例えばロボットデバイス１３０の配向及び位置を基準にして定義され得る。 More generally, the orientation and position of the capture device can be defined in three dimensions with respect to six degrees of freedom (6DOF), and the position is within each dimension of the three dimensions, eg [x, y, z]. It can be defined by coordinates and the orientation can be defined by an angle vector representing rotation around each of the three axes, eg [θ _x , θ _y , θ _z ]. Positions and orientations can be considered, for example, in three dimensions with respect to the origin defined in the 3D coordinate system. For example, the [x, y, z] coordinates may represent a transformation from the origin to a particular position in the 3D coordinate system, and the angle vector [θ _x , θ _y , θ _z ] may represent a rotation in the 3D coordinate system. Can be defined. A transformation with 6DOF can be defined as a matrix, so the transformation is applied by multiplication by the matrix. In some embodiments, the capture device can be defined with respect to a limited set of 6 degrees of freedom, for example in the case of a capture device on a ground vehicle, the y-dimension can be constant. In some embodiments, such as the robot device 130, the orientation and position of the capture device connected to another device can be defined relative to the orientation and position of that other device, eg, the orientation and position of the robot device 130. Can be defined relative to position.

本明細書で説明される実施例では、例えば６ＤＯＦ変換行列で記述されたように、キャプチャデバイスの配向及び位置は、キャプチャデバイスの姿勢として定義され得る。同様に、例えば６ＤＯＦ変換行列で記述されたように、オブジェクト表現の配向及び位置は、オブジェクト表現の姿勢として定義され得る。例えばビデオデータまたは一連の静止画像が記録される時、キャプチャデバイスが時間ｔ＋１に時間ｔとは異なる姿勢を取り得るように、キャプチャデバイスの姿勢は経時的に変化し得る。キャプチャデバイスを備えたハンドヘルドモバイルコンピューティングデバイスの事例では、ハンドヘルドデバイスはユーザにより３Ｄ空間１１０内を移動させられるため、その姿勢は変化し得る。 In the embodiments described herein, the orientation and position of the capture device can be defined as the orientation of the capture device, as described, for example, in the 6DOF transformation matrix. Similarly, the orientation and position of an object representation can be defined as the orientation of the object representation, as described, for example, in the 6DOF transformation matrix. For example, when video data or a series of still images are recorded, the attitude of the capture device may change over time so that the capture device may take a different attitude at time t + 1 than time t. In the case of a handheld mobile computing device with a capture device, the attitude of the handheld device can change as it is moved within the 3D space 110 by the user.

図１Ｃは、キャプチャデバイス構成の実施例を概略的に示す。図１Ｃの実施例１６０では、キャプチャデバイス１６５は、画像データ１７０を生成するように構成される。図１Ｃでは、画像データ１７０は、複数のフレーム１７５を含む。各プレーム１７５は、図１の１１０などの３Ｄ空間の画像がキャプチャされる期間内の特定の時間ｔに関連し得る（すなわちＦ_ｔ）。フレーム１７５は通常、測定データの２Ｄ表現から成る。例えば、フレーム１７５は、時間ｔに記録されたピクセル値の２Ｄ配列または行列を含み得る。図１Ｃの実施例では、画像データ内の全てのフレーム１７５は同じサイズであるが、これは全ての実施例においてそうである必要はない。フレーム１７５内のピクセル値は、３Ｄ空間の特定の部分の測定を表す。図１Ｃでは、画像データは、単眼キャプチャデバイスからのシーンの複数のビューを表し、複数のビューのそれぞれは、異なる各時間ｔにキャプチャされたものである。しかし、他の事例では、キャプチャデバイス（すなわち画像キャプチャシステムまたはビデオキャプチャシステム）によりキャプチャされた画像データは、互いに同じ時間、または少なくとも部分的に重複する時間にキャプチャされたシーンの複数のビューを表し得る。これは、キャプチャデバイスがステレオキャプチャシステムである事例であり得る。 FIG. 1C schematically shows an example of a capture device configuration. In Example 160 of FIG. 1C, the capture device 165 is configured to generate image data 170. In FIG. 1C, the image data 170 includes a plurality of frames 175. Each plume 175 may be associated with a particular time t (ie, F _t ) within the period in which an image in 3D space, such as 110 in FIG. 1, is captured. Frame 175 usually consists of a 2D representation of the measurement data. For example, frame 175 may include a 2D array or matrix of pixel values recorded at time t. In the embodiment of FIG. 1C, all frames 175 in the image data are the same size, but this does not have to be the case in all the embodiments. Pixel values in frame 175 represent measurements of specific parts of 3D space. In FIG. 1C, the image data represents a plurality of views of the scene from the monocular capture device, each of which is captured at different times t. However, in other cases, the image data captured by the capture device (ie, an image capture system or video capture system) represents multiple views of the scene captured at the same time, or at least partially overlapping, with each other. obtain. This can be the case where the capture device is a stereo capture system.

図１Ｃの実施例では、各フレーム１７５は、測光データを含む。測光データは通常、輝度、強度、色など、画像の測光特性を表す。図１Ｃでは、各フレーム１７５は、フレーム１７５の各ピクセルの強度値を含み、これは、例えばカラーバンドまたはカラーチャネルごとに０～２５５のグレースケールレベルまたは輝度レベルで記憶され得る。例えばグレースケールレベル０は最も暗い強度（例えば黒）に該当し、例えばグレースケールレベル２５５は最も明るい強度（例えば白）に該当し、グレースケールレベル０～２５５は、黒と白との間の中間強度に該当する。図１Ｃでは、測光データは、所与の解像度の赤、緑、青のピクセル強度値を表す。ゆえに、各フレーム１７５は、カラー画像を表し、フレーム内の各［ｘ、ｙ］ピクセル値は、ＲＧＢベクトル［Ｒ、Ｇ、Ｂ］を含む。一実施例として、カラーデータの解像度は、６４０×４８０ピクセルであり得る。他の実施例では、他のカラー空間が使用され得、及び／または測光データは、他の測光特性を表し得る。 In the embodiment of FIG. 1C, each frame 175 contains photometric data. Photometric data usually represents the photometric characteristics of an image, such as luminance, intensity, and color. In FIG. 1C, each frame 175 contains an intensity value for each pixel of frame 175, which may be stored, for example, at a grayscale level or luminance level of 0-255 per color band or color channel. For example, grayscale level 0 corresponds to the darkest intensity (eg black), for example grayscale level 255 corresponds to the brightest intensity (eg white), and grayscale levels 0-255 are between black and white. Corresponds to strength. In FIG. 1C, the photometric data represents red, green, and blue pixel intensity values at a given resolution. Therefore, each frame 175 represents a color image, and each [x, y] pixel value in the frame contains an RGB vector [R, G, B]. As an embodiment, the resolution of the color data can be 640 x 480 pixels. In other embodiments, other color spaces may be used and / or the photometric data may represent other photometric properties.

キャプチャデバイス１６５は、接続されたデータストレージデバイスに画像データ１７０を記憶するように構成され得る。別の事例では、キャプチャデバイス１６５は、画像データ１７０を、例えばデータストリームとして、またはフレームごとに、接続されたコンピューティングデバイスに送信し得る。接続されたコンピューティングデバイスは、例えばユニバーサルシリアルバス（ＵＳＢ）接続を介して直接接続され得る、または間接的に接続され得、例えば画像データ１７０は、１つ以上のコンピュータネットワークを介して送信され得る。さらに別の事例では、キャプチャデバイス１６５は、画像データ１７０を１つ以上のコンピュータネットワークを介して送信し、ネットワーク接続ストレージデバイスに記憶するように構成され得る。画像データ１７０は、フレームごとに、または例えば複数のフレームがまとめられ得るバッチベースで、記憶及び／または送信され得る。 The capture device 165 may be configured to store image data 170 in a connected data storage device. In another example, the capture device 165 may transmit the image data 170 to the connected computing device, for example as a data stream or frame by frame. Connected computing devices may be directly connected or indirectly connected, for example via a universal serial bus (USB) connection, eg image data 170 may be transmitted over one or more computer networks. .. In yet another example, the capture device 165 may be configured to transmit image data 170 over one or more computer networks and store it in a networked storage device. The image data 170 may be stored and / or transmitted frame by frame or, for example, on a batch basis where multiple frames can be combined.

画像データ１７０はまた、後述の実施例で使用される前に、１つ以上の前処理動作が実行され得る。一事例では、２つのフレーム集合が共通のサイズ及び解像度を有するように、前処理が適用され得る。 The image data 170 may also perform one or more preprocessing operations before being used in the embodiments described below. In one case, pretreatment may be applied so that the two sets of frames have a common size and resolution.

いくつかの事例では、キャプチャデバイス１６５は、画像データ形式でビデオデータを生成するように構成され得る。しかし、ビデオデータは、異なる各時間にキャプチャされた複数のフレームを同様に表し得る。一事例では、キャプチャデバイス１６５によりキャプチャされたビデオデータは、圧縮されたビデオストリームまたはファイルを含み得る。この事例では、例えばビデオデコーダの出力として、ストリームまたはファイルからビデオデータのフレームが再構成され得る。ビデオストリームまたはファイルの前処理に続いて、メモリ位置からビデオデータが取得され得る。 In some cases, the capture device 165 may be configured to generate video data in image data format. However, the video data may similarly represent multiple frames captured at different times. In one case, the video data captured by the capture device 165 may include a compressed video stream or file. In this case, a frame of video data may be reconstructed from a stream or file, for example as the output of a video decoder. Following video stream or file preprocessing, video data may be retrieved from memory location.

図１Ｃは実施例として提供され、後述の方法及びシステムで使用する画像データ１７０を生成するために、図に示される構成とは異なる構成を使用してもよいことが、理解されよう。画像データ１７０にはさらに、３Ｄ空間のキャプチャされたまたは記録されたビューを表す２次元形式で構成された任意の測定感覚入力が含まれ得る。例えば、これは、数ある中でも、測光データ、奥行きデータ電磁撮像、超音波撮像及びレーダ出力が挙げられ得る。これらの事例では、特定のデータ形式に関連付けられた撮像デバイス、例えば奥行きデータのないＲＧＢデバイスのみが必要になり得る。 It will be appreciated that FIG. 1C is provided as an embodiment and may use a configuration different from that shown in the figure to generate the image data 170 for use in the methods and systems described below. The image data 170 may further include any measurement sensory input configured in two-dimensional format representing a captured or recorded view in 3D space. For example, this may include photometric data, depth data electromagnetic imaging, ultrasonic imaging and radar output, among others. In these cases, only imaging devices associated with a particular data format, such as RGB devices without depth data, may be required.

図２は、シーンの奥行きを推定するための例示的な画像処理システム２００を示す。図２の画像処理システム２００では、幾何学的再構成エンジンにより、第１の奥行き推定２３０、及び第１の奥行き推定２３０の不確実性測定２３５が生成される。第１の奥行き推定２３０、及び第１の奥行き推定２３０の不確実性測定２３５は、まとめて第１の奥行きデータ２５０と称され得る。幾何学的再構成エンジンは、例えばシーンの少なくとも２つの画像を処理することにより、第１の奥行き推定２３０を取得するように構成される。図１Ａ～図１Ｃを参照して説明されたように、少なくとも２つの画像は、任意の好適なキャプチャデバイスを使用してキャプチャされ得、ＲＧＢデータなどの画像データとして表され得る。幾何学的再構成エンジンは、測光技法を利用して、第１の奥行き推定２３０を生成し得る。例えば、シーンの所与の部分の画像を取得するのに使用されたキャプチャデバイスの位置に関係なく、シーンの所与の部分は、同じ測光特性（輝度、強度、及び／または色など）を有するはずである。幾何学的再構成エンジンは、これを利用して、第１の奥行き推定２３０を生成し得る。一実施例として、幾何学的再構成エンジンは、異なるそれぞれの位置からキャプチャされた同一シーンの少なくとも２つの画像を処理して、測光誤差を最小化するシーンの所与の部分の奥行きを特定し得る。例えば、第１の奥行き推定２３０がシーンの所与の部分の実際の奥行きに最も近い時、測光誤差を最小化することができる。しかし、これは単なる例であり、他の実施例では、他の幾何学的技法を使用して第１の奥行き推定２３０が生成され得る。特に、例えば図２を参照して本明細書で説明される幾何学的再構成技法は、例えば単眼システムを使用して取得され得る２つの画像を使用するが、他の実施例、例えば単一ステレオ画像の事例では、１つ以上の画像が使用され得る。 FIG. 2 shows an exemplary image processing system 200 for estimating the depth of a scene. In the image processing system 200 of FIG. 2, the geometric reconstruction engine produces an uncertainty measurement 235 for the first depth estimation 230 and the first depth estimation 230. The first depth estimation 230 and the uncertainty measurement 235 of the first depth estimation 230 may be collectively referred to as the first depth data 250. The geometric reconstruction engine is configured to obtain a first depth estimate 230, for example by processing at least two images of the scene. As described with reference to FIGS. 1A-1C, at least two images can be captured using any suitable capture device and can be represented as image data such as RGB data. The geometric reconstruction engine may utilize a photometric technique to generate a first depth estimate 230. For example, regardless of the location of the capture device used to capture an image of a given part of the scene, the given part of the scene has the same photometric properties (such as brightness, intensity, and / or color). Should be. The geometric reconstruction engine can utilize this to generate a first depth estimate 230. As an embodiment, the geometric reconstruction engine processes at least two images of the same scene captured from different positions to determine the depth of a given portion of the scene that minimizes photometric error. obtain. For example, when the first depth estimation 230 is closest to the actual depth of a given portion of the scene, the photometric error can be minimized. However, this is merely an example, and in other embodiments, other geometric techniques may be used to generate the first depth estimate 230. In particular, the geometric reconstruction technique described herein with reference to, eg, FIG. 2, uses two images that can be obtained, eg, using a monocular system, but other embodiments, eg, single. In the case of stereo images, one or more images may be used.

いくつかの事例では、第１の奥行き推定２３０を生成する前に、シーンの２つ以上のビューを表す画像データが、カメラなどのキャプチャデバイスから取得される。このような事例では、第１の奥行き推定２３０を生成することは、カメラの姿勢推定を取得することと、少なくとも姿勢推定と画像データとの関数である測定誤差を最小化することにより、第１の奥行き推定２３０を生成することと、を含む。 In some cases, image data representing two or more views of the scene is obtained from a capture device such as a camera before generating the first depth estimation 230. In such cases, generating the first depth estimation 230 is first by acquiring the attitude estimation of the camera and at least minimizing the measurement error which is a function of the attitude estimation and the image data. To generate a depth estimate of 230, including.

カメラの姿勢推定は通常、画像データにより表される画像をキャプチャしている間のカメラの位置及び配向を示す。画像データが、例えばビデオのフレームに対応する一連のビューを表す場合、姿勢推定は、ビデオのフレームを通した経時的なカメラの位置及び配置を示し得る。例えば、画像データは、環境（部屋の内部など）の方々にカメラ（ＲＧＢカメラなど）を移動させることにより、取得され得る。従って、ビデオのフレームの少なくとも部分集合（ゆえに画像データにより表される画像の部分集合）は、フレームが記録された時間のカメラの位置及び配向を表す対応姿勢推定を有し得る。姿勢推定は、ビデオの全てのフレーム（または一連の画像の全ての画像）に存在するわけではないが、カメラが取得したビデオまたは複数の画像のうちの画像の部分集合の記録された時間範囲内の時間の部分集合に関して、特定され得る。 Camera pose estimation usually indicates the position and orientation of the camera while capturing the image represented by the image data. If the image data represents, for example, a series of views corresponding to a video frame, the pose estimation may indicate the position and placement of the camera over time through the video frame. For example, image data can be acquired by moving a camera (RGB camera, etc.) to people in the environment (inside the room, etc.). Thus, at least a subset of video frames (hence the subset of images represented by image data) may have a corresponding pose estimate that represents the position and orientation of the camera at the time the frame was recorded. Attitude estimation is not present in every frame of the video (or every image in a series of images), but within the recorded time range of a subset of the video or images acquired by the camera. Can be specified with respect to a subset of the time of.

カメラの姿勢推定を取得するために、様々な異なる方法が使用され得る。例えば、カメラの姿勢は、画像データを受信し姿勢を出力する既知のＳＬＡＭシステムを使用して推定され得、位置及び配向を示すカメラのセンサを使用して、及び／またはカスタム姿勢追跡方法を使用して、推定され得る。ＳＬＡＭシステムでは、例えば、カメラの姿勢は、経時的にカメラがキャプチャした画像の処理に基づいて、推定され得る。 A variety of different methods can be used to obtain camera attitude estimates. For example, camera orientation can be estimated using a known SLAM system that receives image data and outputs attitude, using camera sensors that indicate position and orientation, and / or using custom attitude tracking methods. And can be estimated. In SLAM systems, for example, camera orientation can be estimated based on the processing of images captured by the camera over time.

少なくともポーズ推定と画像データとの関数である測光誤差を最小化することにより、第１の奥行き推定２３０を取得することができる。いくつかの事例では、マッピング関数を適用して、第１の画像（シーンの第１のビューに対応）のピクセルを、第２の画像（シーンの第２のビューに対応）の対応位置にマッピングして、第１の画像の再マッピングバージョンが取得され得る。このようなマッピング関数は、例えば、第１の画像をキャプチャしている間のカメラの推定姿勢と、第１の画像のピクセルの奥行きに依存する。次に、第１の画像の再マッピングバージョンのピクセルごとに、測光特性が特定され得る（例えば所与のピクセルの強度値を返す強度関数を使用して）。次に、同じ強度関数を使用して、（カメラにより取得された）第１の画像のピクセルごとに、対応する測光特性が特定され得る。所与の奥行きのピクセルに関連付けられた測光特性（ピクセル強度値など）は、カメラの姿勢とは無関係であるはずのため、第１の画像の再マッピングバージョン及び第１の画像自体の測光特性は、奥行きが正しく推定されると、同一になるはずである。このようにして、第１の画像のピクセルの奥行きは、反復的に変更され得、測光誤差（例えば第１の画像の測光特性と、第１の画像の再マッピングバージョンの測光特性との差に基づく）は、反復ごとに計算され得る。所与のピクセルについての第１の奥行き推定２３０は、このような測光誤差を最小化する奥行き値であると考えられ得る。実施例では、測光誤差最小化プロセス中に反復的に使用される奥行き推定は、画像のエピポーラ線に沿い得る。所与のピクセルについて奥行き推定が既に存在する場合（例えば所与のピクセルに対応するピクセルを有する前のフレームまたは画像から取得済みである場合）、測光誤差計算に反復的に入力される奥行き推定は、前の奥行き推定の所与の範囲内であり得、例えば前の奥行き推定に関連付けられた不確実性測定を２回プラスマイナスした範囲内であり得る。これは、奥行き値のより可能性のある範囲内で好適な奥行き値の検索に集中することにより、第１の奥行き推定２３０の生成の効率性を向上させ得る。いくつかの事例では、最小測光誤差に関連付けられた奥行き値に近い、または当該奥行き値を含む２つの隣接する奥行き値の間で、補間が実行され得る。第１の奥行き推定２３０を取得するための好適な方法が、２０１３年のＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＩＣＣＶ）の議事録に掲載されたＪ．Ｅｎｇｅｌｅｔａｌ，による論文「Ｓｅｍｉ－ＤｅｎｓｅＶｉｓｕａｌＯｄｏｍｅｔｒｙｆｏｒａＭｏｎｏｃｕｌａｒＣａｍｅｒａ」に説明される。しかし、他の方法が代わりに使用されてもよい。 The first depth estimation 230 can be obtained by at least minimizing the photometric error, which is a function of the pose estimation and the image data. In some cases, a mapping function is applied to map the pixels of the first image (corresponding to the first view of the scene) to the corresponding positions of the second image (corresponding to the second view of the scene). Then, a remapping version of the first image can be obtained. Such a mapping function depends, for example, on the estimated attitude of the camera while capturing the first image and the pixel depth of the first image. Then, for each pixel of the remapped version of the first image, the photometric properties can be identified (eg, using an intensity function that returns the intensity value of a given pixel). The same intensity function can then be used to identify the corresponding photometric characteristics for each pixel of the first image (acquired by the camera). The photometric characteristics associated with pixels of a given depth (such as pixel intensity values) should be independent of the camera's orientation, so the remapping version of the first image and the photometric characteristics of the first image itself , If the depth is estimated correctly, it should be the same. In this way, the pixel depth of the first image can be iteratively changed to the difference between the photometric error (eg, the photometric characteristics of the first image and the photometric characteristics of the remapped version of the first image). Based on) can be calculated for each iteration. The first depth estimation 230 for a given pixel can be thought of as a depth value that minimizes such photometric errors. In the embodiment, the depth estimation used iteratively during the photometric error minimization process can be along the epipolar lines of the image. If a depth estimate already exists for a given pixel (eg, taken from a previous frame or image that has a pixel corresponding to a given pixel), the depth estimate that is iteratively entered into the metering error calculation , Can be within a given range of the previous depth estimation, eg, within the range of two plus or minus the uncertainty measurements associated with the previous depth estimation. This can improve the efficiency of the generation of the first depth estimation 230 by concentrating on the search for suitable depth values within a more likely range of depth values. In some cases, interpolation may be performed between two adjacent depth values that are close to or contain the depth value associated with the minimum metering error. A preferred method for obtaining the first depth estimate 230 was published in the minutes of the 2013 International Conference on Computer Vision (ICCV). Explained in the paper "Semi-Dense Visual Odometry for a Monocular Camera" by Angel et al. However, other methods may be used instead.

通常、第１の奥行き推定２３０に関連付けられた不確実性が存在する。不確実性は、例えば、第１の奥行き推定２３０が実際の奥行きに正しく対応する信頼度を表す。例えば、不確実性は、測光不確実性（キャプチャデバイスの測光解像度に制限され得るまたは依存し得る）に依存し得、測光不確実性は、第１の奥行き推定２３０が特定され得る精度を制限し得る。不確実性は、さらに、または代わりに、第１の奥行き推定２３０を生成するのに使用される方法と、第１の奥行き推定２３０の生成が補間プロセスを含む場合は隣接する補間点の間のステップサイズなど、この方法に関連付けられた任意の固有の不確実性とに依存し得る。不確実性は、第１の奥行き推定２３０に関連付けられた誤差に対応するとみなされ得る。図２の実施例では、幾何学的再構成エンジンは、第１の奥行き推定２３０、及び第１の奥行き推定２３０の不確実性測定２３５の両方を出力するように構成される。幾何学的再構成エンジンは、第１の奥行き推定２３０に対応する平均μと、不確実性測定２３５に対応する分散θとを含む配列または行列を生成するように構成され得るが、これは単なる例に過ぎない。平均及び分散は、画像全体、１つ以上の画像部分、画像のピクセルのうちの１つ以上に関して提供され得る。 Usually, there is uncertainty associated with the first depth estimation 230. Uncertainty represents, for example, the confidence that the first depth estimate 230 correctly corresponds to the actual depth. For example, uncertainty can depend on metering uncertainty (which can be limited or dependent on the metering resolution of the capture device), and metering uncertainty limits the accuracy with which the first depth estimate 230 can be identified. Can be. Uncertainty is further or instead between the method used to generate the first depth estimation 230 and the adjacent interpolation points if the generation of the first depth estimation 230 involves an interpolation process. It may depend on any inherent uncertainty associated with this method, such as step size. Uncertainty can be considered to correspond to the error associated with the first depth estimation 230. In the embodiment of FIG. 2, the geometric reconstruction engine is configured to output both the first depth estimation 230 and the uncertainty measurement 235 of the first depth estimation 230. The geometric reconstruction engine can be configured to generate an array or matrix containing the mean μ corresponding to the first depth estimate 230 and the variance θ corresponding to the uncertainty measurement 235, which is merely It's just an example. Mean and variance may be provided for the entire image, one or more image portions, one or more of the pixels of the image.

第１の奥行き推定２３０の生成が測光誤差の最小化（または他の最適化）を含む実施例では、第１の奥行き推定２３０に関連付けられた不確実性測定２３５は、第１の奥行き推定２３０を取得するための補間に使用された２つの奥行き値間の測光誤差の差に基づき、及びこれらの２つの奥行き値間の差に基づいて、ヤコビアン項Ｊを計算することにより取得され得る。このような事例では、第１の奥行き推定２３０の不確実性θ_ｇｅｏは、次のように考えられ得る。
θ_ｇｅｏ＝（Ｊ^ＴＪ）^－１
しかし、これは単なる例に過ぎず、他の実施例では、他の不確実性測定が使用され得る。 In an embodiment in which the generation of the first depth estimation 230 involves minimizing (or other optimization) the photometric error, the uncertainty measurement 235 associated with the first depth estimation 230 is the first depth estimation 230. Can be obtained by calculating the Jacobian term J based on the difference in photometric error between the two depth values used in the interpolation to obtain and based on the difference between these two depth values. In such a case, the uncertainty θ _geo of the first depth estimation 230 can be considered as follows.
θ _geo = (J ^T J) ^-1
However, this is just an example, and other uncertainty measurements may be used in other embodiments.

いくつかの事例では、第１の奥行き推定２３０は、複数のピクセルについての第１の奥行きマップであり得る。例えば、第１の奥行き推定２３０には、シーンの入力画像のピクセルごとの毎ピクセル奥行き推定が含まれ得る。ゆえに、入力画像の解像度と、第１の奥行き推定２３０に対応する第１の奥行きマップとは、同一であり得る。第１の奥行き推定２３０を生成する前に、入力画像に対し前処理が実行され得、これには、入力画像の解像度を変更することが含まれ得ることを、理解されたい。例えば、入力画像の解像度は、例えば画像をダウンサンプリングして、入力画像を処理するための計算要件を削減することにより、低下し得る。他の事例では、第１の奥行き推定２３０には、複数のピクセルに対する単一の奥行き値が含まれ得、入力画像の奥行き値とピクセルとは、一対多対応である。例えば、複数のピクセル、例えば同様の色または強度など同様の測光特性を有する画像が組み合わせられ得、奥行き値は、このピクセルの組み合わせに関して取得され得る。 In some cases, the first depth estimation 230 may be a first depth map for a plurality of pixels. For example, the first depth estimation 230 may include a pixel-by-pixel depth estimation for each pixel of the input image of the scene. Therefore, the resolution of the input image and the first depth map corresponding to the first depth estimation 230 can be the same. It should be appreciated that preprocessing may be performed on the input image prior to generating the first depth estimation 230, which may include changing the resolution of the input image. For example, the resolution of the input image can be reduced, for example by downsampling the image to reduce the computational requirements for processing the input image. In another example, the first depth estimation 230 may include a single depth value for a plurality of pixels, and the depth value of the input image and the pixel have a one-to-many correspondence. For example, multiple pixels, eg images with similar photometric properties such as similar color or intensity, may be combined and the depth value may be obtained for this pixel combination.

いくつかの事例では、第１の奥行き推定２３０は、いわゆる「中密度」奥行き推定であり得る。このような事例では、第１の奥行き推定２３０には、例えば入力画像（または複数の画像）でキャプチャされたような、シーンの部分の部分集合の奥行き推定が含まれ得る。例えば、中密度奥行き推定には、例えばシーンの２つ以上のビューの一部に対応する、シーンの２つ以上のビューを表す画像データ内のピクセルの一部の奥行き推定が含まれ得る。第１の奥行き推定２３０が取得されたシーンの部分は、いくつかの測光基準などのいくつかの画像基準を満たすピクセルの部分に対応し得る。例えば、第１の奥行き推定２３０は、十分な量の詳細または情報を含むと特定された画像の部分について、取得され得る。これは、例えば所与の領域にわたる測光特性（輝度または色など）の変化を示す画像勾配を計算することにより、特定され得る。画像勾配は、シーンの所与の領域にわたる奥行きの変化を示す奥行き勾配に対応し得る、または奥行き勾配の代用として使用され得る。例えばシーンの比較的小さい領域で奥行きの変化が比較的大きく、シーンの特徴豊富な部分に対応するなど、大量に詳細を有する画像領域では、画像勾配は通常、比較的大きい。他の事例では、第１の奥行き推定２３０は、いわゆる「低密度」奥行き推定であり得る。これらの事例では、第１の奥行き推定２３０は、特定の画像特徴に対応すると特定された画像の部分について、取得され得る。例えば、画像のキーポイントが特定され得、画像のキーポイントは通常、様々な視点、回転、スケール、及び照度から確実に位置特定可能であり得る画像内の特徴的位置に対応する。このような事例では、他の画像部分の奥行き推定を取得することなく、キーポイントを含む画像パッチについて、第１の奥行き推定２３０は取得され得る。さらなる別の事例では、第１の奥行き推定２３０は、画像または画像部分のコンテンツに関係なく、画像全体（または画像部分）について奥行き推定が取得される、いわゆる「高密度」奥行き推定であり得る。 In some cases, the first depth estimation 230 may be a so-called "medium density" depth estimation. In such cases, the first depth estimation 230 may include a depth estimation of a subset of parts of the scene, for example captured by an input image (or a plurality of images). For example, medium density depth estimation may include depth estimation of a portion of a pixel in image data representing two or more views of a scene, for example corresponding to a portion of two or more views of the scene. The portion of the scene from which the first depth estimation 230 has been obtained may correspond to a portion of pixels that meets some image criteria, such as some photometric criteria. For example, the first depth estimation 230 may be obtained for a portion of the image that has been identified as containing a sufficient amount of detail or information. This can be specified, for example, by calculating an image gradient that indicates a change in photometric characteristics (such as brightness or color) over a given area. The image gradient can correspond to a depth gradient that indicates a change in depth over a given area of the scene, or can be used as a substitute for a depth gradient. The image gradient is usually relatively large in an image region that has a large amount of detail, for example, in a relatively small region of the scene where the change in depth is relatively large and corresponds to a feature-rich portion of the scene. In other cases, the first depth estimation 230 may be a so-called "low density" depth estimation. In these cases, the first depth estimation 230 may be obtained for a portion of the image identified as corresponding to a particular image feature. For example, image keypoints can be identified, and image keypoints usually correspond to characteristic positions in the image that can be reliably repositioned from various viewpoints, rotations, scales, and illuminances. In such a case, the first depth estimation 230 can be obtained for the image patch containing the key points without obtaining the depth estimation of other image portions. In yet another example, the first depth estimation 230 may be a so-called "high density" depth estimation in which a depth estimation is obtained for the entire image (or image portion) regardless of the content of the image or image portion.

いくつかの事例では、第１の奥行き推定２３０の不確実性測定２３５は、第１の奥行き推定２３０と、同じ種類であり得る、または同じ解像度を含み得る。例えば、第１の奥行き推定２３０に入力画像のピクセルごとの奥行き推定が含まれる場合、ピクセルごとの対応する不確実性測定も存在し得る。反対に、第１の奥行き推定２３０に入力画像の複数のピクセルについての奥行き推定が含まれる場合、その複数のピクセルについての対応する不確実性測定も存在し得る。同様に、第１の奥行き推定２３０が低密度、中密度、または高密度である場合、不確実性測定２３５もそれぞれ、低密度、中密度、または高密度であり得る。しかし他の事例では、不確実性測定２３５の種類または解像度は、第１の奥行き推定２３０の種類または解像度とは異なり得る。 In some cases, the uncertainty measurement 235 of the first depth estimate 230 may be of the same type or contain the same resolution as the first depth estimate 230. For example, if the first depth estimation 230 includes a pixel-by-pixel depth estimation of the input image, there may also be a corresponding pixel-by-pixel uncertainty measurement. Conversely, if the first depth estimation 230 includes depth estimation for a plurality of pixels in the input image, there may also be a corresponding uncertainty measurement for the plurality of pixels. Similarly, if the first depth estimate 230 is low density, medium density, or high density, then the uncertainty measurement 235 can also be low density, medium density, or high density, respectively. However, in other cases, the type or resolution of the uncertainty measurement 235 may differ from the type or resolution of the first depth estimate 230.

図２の画像処理システム２００はまた、第２の奥行き推定２４０、及び第２の奥行き推定２４０の不確実性測定２４５を生成するように構成され、第２の奥行き推定２４０、及び第２の奥行き推定２４０の不確実性測定２４５は、まとめて第２の奥行きデータ２６０と称され得る。第２の奥行きデータ２６０は、ニューラルネットワークアーキテクチャを使用して生成され得、ニューラルネットワークアーキテクチャは、奥行き推定及び関連する不確実性測定を予測するように、教師なし画像データまたは教師あり（すなわちラベル付けされた）画像データでトレーニングされ得る。様々な異なるニューラルネットワークアーキテクチャが使用され得る。例えば、ニューラルネットワークアーキテクチャには、複数の層を有するいわゆる「ディープ」ニューラルネットワークであり得る少なくとも１つの畳み込みニューラルネットワーク（ＣＮＮ）が含まれ得る。 The image processing system 200 of FIG. 2 is also configured to generate a second depth estimation 240 and an uncertainty measurement 245 of the second depth estimation 240, the second depth estimation 240, and the second depth. The uncertainty measurement 245 of the estimation 240 may be collectively referred to as the second depth data 260. The second depth data 260 can be generated using a neural network architecture, which is unsupervised image data or supervised (ie labeled) to predict depth estimation and associated uncertainty measurements. Can be trained with image data. A variety of different neural network architectures can be used. For example, a neural network architecture may include at least one convolutional neural network (CNN), which can be a so-called "deep" neural network with multiple layers.

いくつかの実施例では、第１の奥行き推定２３０を生成する前に、シーンの１つ以上のビューを表す画像データが、カメラなどのキャプチャデバイスから取得され得る。このような事例では、第２の奥行き推定２４０を生成することは、ニューラルネットワークアーキテクチャで画像データを受信することを含み得る。画像データは、任意の好適な形式であり得、例えば、複数の異なる位置からキャプチャされたシーンの複数の２Ｄ画像を表し得る。次に、ニューラルネットワークアーキテクチャを使用して、画像部分の集合のそれぞれについて、奥行き推定が予測され、第２の奥行き推定２４０が生成され得る。画像部分の集合は、画像（もしくは複数の画像）の全体に、または画像もしくは複数の画像の部分集合に、対応し得る。 In some embodiments, image data representing one or more views of the scene may be obtained from a capture device, such as a camera, prior to generating the first depth estimation 230. In such cases, generating a second depth estimate 240 may include receiving image data in a neural network architecture. The image data can be in any suitable format and may represent, for example, multiple 2D images of the scene captured from a plurality of different locations. Next, using a neural network architecture, a depth estimate can be predicted for each set of image portions and a second depth estimate 240 can be generated. The set of image portions may correspond to the entire image (or a plurality of images), or to a subset of an image or a plurality of images.

第１の奥行き推定２３０と同様に、第２の奥行き推定２４０は、複数のピクセルについての第２の奥行きマップであり得、例えばシーンの入力画像の奥行き値とピクセルとは、一対一マッピングである。しかし、他の事例では、第２の奥行き推定２４０には、複数のピクセルについての単一の奥行き値が含まれ得、入力画像の奥行き値とピクセルとは、一対多対応である。さらに、第２の奥行き推定２４０は、低密度、中密度、または高密度の奥行き推定であり得る。一事例では、２つの奥行き推定は異なる密度を有し、例えば第１の奥行き推定２３０は中密度奥行き推定であり得、第２の奥行き推定２４０は、高密度奥行き推定であり得る。さらに、第１の奥行き推定２３０を参照して説明されたように、第２の奥行き推定２４０の不確実性測定２４５の種類または解像度は、第２の奥行き推定２４０の種類または解像度と、同一であり得る、または異なり得る。 Similar to the first depth estimation 230, the second depth estimation 240 can be a second depth map for a plurality of pixels, for example, the depth values and pixels of the input image of the scene are one-to-one mappings. .. However, in other cases, the second depth estimation 240 may include a single depth value for a plurality of pixels, and the depth value of the input image and the pixel have a one-to-many correspondence. Further, the second depth estimation 240 may be a low density, medium density, or high density depth estimation. In one example, the two depth estimates have different densities, for example the first depth estimate 230 may be a medium density depth estimate and the second depth estimate 240 may be a high density depth estimate. Further, as described with reference to the first depth estimation 230, the type or resolution of the uncertainty measurement 245 of the second depth estimation 240 is the same as the type or resolution of the second depth estimation 240. It can or can be different.

図２の画像処理システム２００は、融合エンジン２７０を含み、融合エンジン２７０は、幾何学的再構成エンジンからの第１の奥行き推定２３０と、第１の奥行き推定２３０の不確実性測定２３５と、ニューラルネットワークアーキテクチャからの第２の奥行き推定２４０と、第２の奥行き推定２４０の不確実性測定２４５と、を受信するように構成される。融合エンジン２７０は、第１の奥行き推定２３０の不確実性測定２３５と、第２の奥行き推定２４０の不確実性測定２４５とを使用して、第１の奥行き推定２３０と第２の奥行き推定２４０とを確率的に融合させて、シーンの融合奥行き推定２８０を出力するように構成される。このようにして、第１の奥行き推定２３０及び第２の奥行き推定２４０の両方が融合奥行き推定２８０に寄与し、これにより、第１の奥行き推定２３０または第２の奥行き推定２４０を単独で使用した場合と比べて、融合奥行き推定２８０の精度が向上し得る。 The image processing system 200 of FIG. 2 includes a fusion engine 270, which includes a first depth estimation 230 from a geometric reconstruction engine and an uncertainty measurement 235 of the first depth estimation 230. It is configured to receive a second depth estimate 240 from the neural network architecture and an uncertainty measurement 245 of the second depth estimate 240. The fusion engine 270 uses the uncertainty measurement 235 of the first depth estimation 230 and the uncertainty measurement 245 of the second depth estimation 240 to use the first depth estimation 230 and the second depth estimation 240. And are stochastically fused to output the fusion depth estimation 280 of the scene. In this way, both the first depth estimation 230 and the second depth estimation 240 contributed to the fusion depth estimation 280, whereby the first depth estimation 230 or the second depth estimation 240 was used alone. Compared with the case, the accuracy of the fusion depth estimation 280 can be improved.

例えば、第１の奥行き推定２３０の不確実性（幾何学的制約に基づく）は、低テクスチャのシーンの領域で、例えば壁などの奥行きが比較的変化しない、または奥行きが少しずつ変化するシーンの領域で、より高くなり得る。さらに、付加的または代替的に、第１の奥行き推定２３０は、シーンの一部が部分的に遮られている領域では、比較的不確実であり得る。対照的に、第２の奥行き推定２４０（ニューラルネットワークアーキテクチャにより取得される）は、曖昧な領域（例えば低テクスチャの領域）では、第１の奥行き推定２３０より不確実性が低くあり得るが、高テクスチャ領域では、高テクスチャ領域が第１の奥行き推定２３０により正確にキャプチャされたにもかかわらず、第２の奥行き推定２４０は精度が低くなり得る。不確実性測定２３５及び不確実性測定２４５の使用は、例えば第１の奥行き推定２３０及び第２の奥行き推定２４０のそれぞれを、それらの相対的な不確実性に基づいて、好適にバランスをとって融合奥行き推定２８０に寄与させることにより、第１の奥行き推定２３０と第２の奥行き推定２４０との確率的融合を補助する。例えば、第１の奥行き推定２３０に関連付けられた不確実性測定２３５が、第２の奥行き推定２４０に関連付けられた不確実性測定２４５よりも高いシーンの領域では、第２の奥行き推定２４０が、第１の奥行き推定２３０よりも大きく融合奥行き推定２８０に寄与し得る。さらに、全体的な一貫性を維持することができるため、融合奥行き推定２８０は、選ばれた局所的なシーン領域だけでなく、全体的なレベルでシーンの奥行きを正確にキャプチャする。 For example, the uncertainty of the first depth estimate 230 (based on geometric constraints) is in areas of low textured scenes, such as walls where the depth is relatively unchanged or the depth changes little by little. In the area, it can be higher. In addition, additional or alternative, the first depth estimation 230 can be relatively uncertain in areas where part of the scene is partially obstructed. In contrast, the second depth estimate 240 (obtained by the neural network architecture) may be less uncertain than the first depth estimate 230 in ambiguous regions (eg, low textured regions), but is higher. In the textured area, the second depth estimation 240 may be less accurate, even though the high texture area was accurately captured by the first depth estimation 230. The use of uncertainty measurement 235 and uncertainty measurement 245 favorably balances, for example, the first depth estimation 230 and the second depth estimation 240, respectively, based on their relative uncertainties. By contributing to the fusion depth estimation 280, the stochastic fusion of the first depth estimation 230 and the second depth estimation 240 is assisted. For example, in areas of the scene where the uncertainty measurement 235 associated with the first depth estimation 230 is higher than the uncertainty measurement 245 associated with the second depth estimation 240, the second depth estimation 240 may be: It can contribute to the fusion depth estimation 280 larger than the first depth estimation 230. In addition, because overall consistency can be maintained, the Fusion Depth Estimate 280 accurately captures the depth of the scene at the overall level, not just the selected local scene area.

いくつかの事例では、第１の奥行き推定２３０は、中密度奥行き推定であり、第２の奥行き推定２４０及び融合奥行き推定２８０はそれぞれ、高密度奥行き推定を含む。例えば、第１の奥行き推定２３０が適切に正確となり得る十分なテクスチャを有するシーンの部分について、第１の奥行き推定２３０は取得され得る。このような事例では、第１の奥行き推定２３０は、テクスチャが足りないシーンの他の部分については、取得され得ない。しかし、第２の奥行き推定２４０は、画像（または画像部分）でキャプチャされたシーン全体について取得され得る。ゆえに、このような事例では、第１の奥行き推定２３０と第２の奥行き推定２４０とを融合させることにより、融合奥行き推定２８０も、画像でキャプチャされたシーン全体について取得され得る。このような事例では、融合奥行き推定２８０の一部分は、第１の奥行き推定２３０及び第２の奥行き推定２４０の両方を融合させることにより、取得され得る（例えば融合奥行き推定２８０の一部分は、シーンのテクスチャの多い部分に対応する）。しかし、融合奥行き推定２８０の異なる部分は、第２の奥行き推定２４０からのみ取得され得る（例えば融合奥行き推定２８０の一部分はシーンの滑らかな部分に対応するため、第１の奥行き推定２３０は信頼性が低くなり得る）。 In some cases, the first depth estimation 230 is a medium density depth estimation, and the second depth estimation 240 and the fusion depth estimation 280 each include a high density depth estimation. For example, the first depth estimation 230 may be obtained for a portion of the scene that has sufficient texture for which the first depth estimation 230 can be reasonably accurate. In such cases, the first depth estimation 230 cannot be obtained for other parts of the scene where the texture is lacking. However, the second depth estimation 240 may be obtained for the entire scene captured in the image (or image portion). Therefore, in such cases, by fusing the first depth estimation 230 and the second depth estimation 240, the fusion depth estimation 280 can also be obtained for the entire scene captured in the image. In such cases, a portion of the fusion depth estimation 280 can be obtained by fusing both the first depth estimation 230 and the second depth estimation 240 (eg, a portion of the fusion depth estimation 280 is of the scene. Corresponds to areas with many textures). However, different parts of the fusion depth estimation 280 can only be obtained from the second depth estimation 240 (eg, because a portion of the fusion depth estimation 280 corresponds to a smooth portion of the scene, the first depth estimation 230 is reliable. Can be low).

第１の奥行き推定２３０と第２の奥行き推定２４０とを確率的に融合させるために、様々な異なる方法が使用され得る。例えば、第１の奥行き推定２３０及び第２の奥行き推定２４０、並びに不確実性測定２３５及び不確実性測定２４５に基づくコスト関数は、第１の奥行き推定２３０と第２の奥行き推定２４０を確率的に融合させて、融合奥行き推定２８０を取得するために、最適化され得る。コスト関数の最小値が取得される融合奥行き推定２８０を取得するように、コスト関数の最適化は、異なる入力奥行き推定でコスト関数の値を反復的に計算することを含み得る。コスト関数は、代替的に、損失関数または誤差関数と称され得る。 A variety of different methods can be used to stochastically fuse the first depth estimation 230 and the second depth estimation 240. For example, a cost function based on a first depth estimate 230 and a second depth estimate 240, as well as an uncertainty measurement 235 and an uncertainty measurement 245 probabilistically makes a first depth estimate 230 and a second depth estimate 240. Can be optimized to obtain a fusion depth estimate of 280. Cost function optimization may include iteratively calculating the value of the cost function with different input depth estimates, such as obtaining a fusion depth estimate 280 from which the minimum value of the cost function is obtained. The cost function can be referred to as a loss function or an error function instead.

図２の実施例では、コスト関数は、第１の奥行き推定２３０に関連付けられた第１のコスト項と、第２の奥行き推定２４０に関連付けられた第２のコスト項とを含む。第１のコスト項は、融合奥行き推定値と、第１の奥行き推定値（例えば幾何学的再構成エンジンから取得された第１の奥行き推定２３０から得られる）と、第１の奥行き推定２３０の不確実性値（例えば第１の奥行き推定２３０の不確実性測定２３５から得られる）との関数を含む。同様に、第２のコスト項は、融合奥行き推定値と、第２の奥行き推定値（例えばニューラルネットワークアーキテクチャから取得された第２の奥行き推定２４０から得られる）と、第２の奥行き推定２４０の不確実性値（例えば第２の奥行き推定２４０の不確実性測定２４５から得られる）との関数を含む。コスト関数を最適化して、融合エンジン２７０により出力される融合奥行き推定２８０を形成する融合奥行き推定値が特定される。これは、例えば、融合奥行き推定値を反復的に変更して、コスト関数を最適化する融合奥行き推定値を特定することを含む。例えば、コスト関数は、その値が事前に定義された基準を満たす場合、例えばその値が事前に定義された最小値以下である場合、最適化されたとみなすことができる。他の事例では、コスト関数の最適化は、コスト関数の最小化を含み得る。このようにして、第１の奥行き推定２３０及び第２の奥行き推定２４０、並びに不確実性測定２３５及び不確実性測定２４５の両方が、取得される融合奥行き推定２８０に対する制約として機能して、融合奥行き推定２８０の精度が向上する。 In the embodiment of FIG. 2, the cost function includes a first cost term associated with the first depth estimation 230 and a second cost term associated with the second depth estimation 240. The first cost term is a fusion depth estimate, a first depth estimate (eg, obtained from a first depth estimate 230 obtained from a geometric reconstruction engine), and a first depth estimate 230. It includes a function with an uncertainty value (eg, obtained from the uncertainty measurement 235 of the first depth estimate 230). Similarly, the second cost term is a fusion depth estimate, a second depth estimate (eg, obtained from a second depth estimate 240 obtained from a neural network architecture), and a second depth estimate 240. It includes a function with an uncertainty value (eg, obtained from the uncertainty measurement 245 of the second depth estimate 240). The cost function is optimized to identify the fusion depth estimates that form the fusion depth estimation 280 output by the fusion engine 270. This involves, for example, iteratively modifying the fusion depth estimate to identify the fusion depth estimate that optimizes the cost function. For example, a cost function can be considered optimized if its value meets a predefined criterion, for example if its value is less than or equal to a predefined minimum. In other cases, cost function optimization may include cost function minimization. In this way, both the first depth estimation 230 and the second depth estimation 240, as well as the uncertainty measurement 235 and the uncertainty measurement 245, serve as constraints on the fusion depth estimation 280 to be acquired and fused. The accuracy of the depth estimation 280 is improved.

しかし、コスト関数の使用は単なる例に過ぎないことが、理解されよう。他の実施例では、第１の奥行き推定と第２の奥行き推定とは、異なる方法で、不確実性測定を使用して、確率的に融合され得る。 However, it will be understood that the use of cost functions is just an example. In other embodiments, the first depth estimation and the second depth estimation can be stochastically fused using uncertainty measurements in different ways.

従って、本明細書のいくつかの実施例は、シーンの奥行き推定の正確な再構成を提供し、よって、ロボットデバイスと実世界環境とのインタラクションを促進する。具体的には、本明細書のいくつかの実施例は、リアルタイムまたはほぼリアルタイムの動作を可能にし（他の奥行き推定手法とは対照的に）、屋外及び屋内の場所を含む様々な異なる環境におけるシーンの奥行き推定を提供するように、設計される。 Accordingly, some embodiments herein provide an accurate reconstruction of the depth estimation of the scene, thereby facilitating the interaction between the robotic device and the real-world environment. Specifically, some embodiments herein enable real-time or near-real-time operation (as opposed to other depth estimation techniques) and in a variety of different environments, including outdoor and indoor locations. Designed to provide depth estimation for the scene.

図３は、さらなる実施例による、画像処理システム３００の概略図である。図３の画像処理システム３００は、様々な点で図２の画像処理システム２００と類似する。図２の機能と同じ図３の対応する機能には、同じ参照番号がつけられるが、１００だけ増分される。 FIG. 3 is a schematic diagram of the image processing system 300 according to a further embodiment. The image processing system 300 of FIG. 3 is similar to the image processing system 200 of FIG. 2 in various respects. The corresponding function of FIG. 3, which is the same as the function of FIG. 2, is given the same reference number, but is incremented by 100.

図３では、融合エンジン３７０は、第１の奥行き推定、第２の奥行き推定、及び不確実性測定に加えて、表面配向推定３２０、及び表面配向推定３２０の不確実性測定３２５を受信するように構成される。融合エンジン３７０は、表面配向推定３２０、及び表面配向推定３２０の不確実性測定３２５を使用して、第１の奥行き推定と第２の奥行き推定とを確率的に融合させるように構成される。 In FIG. 3, the fusion engine 370 is configured to receive a surface orientation estimation 320 and an uncertainty measurement 325 of the surface orientation estimation 320 in addition to the first depth estimation, the second depth estimation, and the uncertainty measurement. It is composed of. The fusion engine 370 is configured to stochastically fuse the first depth estimation and the second depth estimation using the surface orientation estimation 320 and the uncertainty measurement 325 of the surface orientation estimation 320.

例えば、表面配向推定３２０は、キャプチャデバイスによりキャプチャされたシーンの画像のピクセルまたは他の画像領域に対応する表面の方向または傾斜を示す。例えば、表面の配向は、キャプチャデバイスによりキャプチャされたシーンの画像のピクセルまたは他の画像領域の表面の配向角度をキャプチャするとみなされ得る。例えば、表面配向は、所与の表面に垂直な軸である表面法線に対応する。他の事例では、表面配向推定３２０は、表面勾配、例えば表面の変化度の測定に対応し得る。複数のピクセルの表面配向を使用して、複数のピクセルに対応する表面の特質の指標が取得され得る。例えば、比較的滑らかで変化のない表面は、比較的一定の表面配向を有し得る。反対に、高テクスチャの表面は、様々な異なる表面配向に関連付けられ得る。 For example, the surface orientation estimation 320 indicates the orientation or tilt of the surface corresponding to the pixels or other image areas of the image of the scene captured by the capture device. For example, surface orientation can be considered to capture the surface orientation angle of pixels or other image areas of the image of the scene captured by the capture device. For example, surface orientation corresponds to a surface normal, which is the axis perpendicular to a given surface. In other cases, the surface orientation estimation 320 may accommodate measurements of surface gradients, such as surface changes. Using the surface orientation of multiple pixels, an index of surface properties corresponding to multiple pixels can be obtained. For example, a relatively smooth and unchanged surface can have a relatively constant surface orientation. Conversely, highly textured surfaces can be associated with a variety of different surface orientations.

表面配向推定３２０、及び表面配向推定３２０の不確実性測定３２５は、様々な異なる方法で取得され得る。例えば、シーンの画像は、例えば画像のピクセルのピクセル強度値などの測光特性の変化に基づいて、表面配向推定３２０及び表面配向推定３２０の不確実性測定３２５を特定するように処理され得る。 The surface orientation estimation 320 and the uncertainty measurement 325 of the surface orientation estimation 320 can be obtained by various different methods. For example, the image of the scene may be processed to identify the uncertainty measurement 325 of the surface orientation estimation 320 and the surface orientation estimation 320 based on changes in photometric properties, such as pixel intensity values of the pixels of the image.

図４は、さらなる別の実施例による、画像処理システム４００の概略図である。図４の画像処理システム４００は、様々な点で図３の画像処理システム３００と類似するが、幾何学的再構成エンジン４３０及びニューラルネットワークアーキテクチャ４２０を明確に例示する。 FIG. 4 is a schematic diagram of the image processing system 400 according to still another embodiment. The image processing system 400 of FIG. 4 is similar to the image processing system 300 of FIG. 3 in many respects, but clearly illustrates the geometric reconstruction engine 430 and the neural network architecture 420.

図４では、ビデオデータのフレーム４１０が受信される。フレーム４１０は、例えばカメラなどのキャプチャデバイスによりキャプチャされ、シーンのビューを含む。他の事例では、図４の画像処理システム４００を使用して、ビデオを表すビデオデータではなく、静止画像を表す画像データが処理され得ることを、理解されたい。 In FIG. 4, the frame 410 of the video data is received. Frame 410 is captured by a capture device such as a camera and includes a view of the scene. It should be appreciated that in other cases, the image processing system 400 of FIG. 4 can be used to process image data representing still images rather than video data representing video.

フレーム４１０は、幾何学的再構成エンジン４３０及びニューラルネットワークアーキテクチャ４２０により処理される。幾何学的再構成エンジン４３０及びニューラルネットワークアーキテクチャ４２０は、図２を参照して説明されたように、第１の奥行きデータ４５０及び第２の奥行きデータ４６０を生成するように構成され得る。 The frame 410 is processed by the geometric reconstruction engine 430 and the neural network architecture 420. The geometric reconstruction engine 430 and the neural network architecture 420 may be configured to generate a first depth data 450 and a second depth data 460, as described with reference to FIG.

図４の実施例では、第２の奥行きデータ４６０は、表面配向推定、及び表面配向推定の不確実性測定を含む。この実施例では、表面配向推定、及び表面配向推定の不確実性測定は、第２の奥行き推定、及び第２の奥行き推定の不確実性測定に加えて、ニューラルネットワークアーキテクチャ４２０により生成される。 In the embodiment of FIG. 4, the second depth data 460 includes surface orientation estimation and uncertainty measurement of surface orientation estimation. In this embodiment, the surface orientation estimation and the uncertainty measurement of the surface orientation estimation are generated by the neural network architecture 420 in addition to the second depth estimation and the second depth estimation uncertainty measurement.

図４のニューラルネットワークアーキテクチャ４２０は、１つ以上のニューラルネットワークを含み得、図４に示されるフレーム４１０などのビデオデータのフレームのピクセル値を受信するように構成される。ニューラルネットワークアーキテクチャ４２０は、第２の奥行き推定を生成するために、画像部分の第１の集合のそれぞれについて奥行き推定を予測し、並びに画像部分の第２の集合のそれぞれについて少なくとも１つの表面配向推定を予測するように構成される。画像部分の第１の集合は、画像部分の第２の集合と同じであってもよく、異なっていてもよい。例えば、画像部分の第１の集合と第２の集合は、完全に重複している、部分的に重複している、または全く重複していない場合がある。奥行き推定及び少なくとも１つの表面配向は、異なるそれぞれの解像度で取得され得る。このような事例では、奥行き推定及び少なくとも１つの表面配向のうちの一方または両方の解像度は、その後、所望の解像度を得るために、例えば補間により、変更され得る。例えば、少なくとも１つの表面配向は、奥行き推定よりも低い解像度で取得され得るが、その後、奥行き推定と同じ解像度にアップスケーリングされ得る。ニューラルネットワークアーキテクチャ４２０はまた、各奥行き推定に関連付けられた１つ以上の不確実性測定と、各表面配向推定に関連付けられた１つ以上の不確実性測定とを予測するように構成される。 The neural network architecture 420 of FIG. 4 may include one or more neural networks and is configured to receive frame pixel values of video data such as frame 410 shown in FIG. The neural network architecture 420 predicts a depth estimate for each of the first sets of image parts and at least one surface orientation estimate for each of the second sets of image parts to generate a second depth estimate. Is configured to predict. The first set of image portions may be the same as or different from the second set of image portions. For example, the first set and the second set of image parts may be completely overlapped, partially overlapped, or not overlapped at all. Depth estimation and at least one surface orientation can be obtained at different resolutions. In such cases, the resolution of one or both of the depth estimation and at least one surface orientation can then be modified, for example by interpolation, to obtain the desired resolution. For example, at least one surface orientation can be obtained at a lower resolution than the depth estimate, but can then be upscaled to the same resolution as the depth estimate. The neural network architecture 420 is also configured to predict one or more uncertainty measurements associated with each depth estimation and one or more uncertainty measurements associated with each surface orientation estimation.

第１の奥行きデータ４５０と第２の奥行きデータ４６０とは、融合エンジン４７０を使用して確率的に融合され、融合奥行き推定４８０が取得される。図では、融合エンジン４７０は、少なくとも１つの表面配向も使用して、融合奥行き推定４８０を取得する。融合奥行き推定４８０を取得するためにコスト関数が最適化される実施例では、コスト関数は、少なくとも１つの表面配向推定に関連付けられた第３のコスト項を含み得る。このような事例では、第３のコスト項は、融合奥行き推定値、表面配向推定値（例えばニューラルネットワークアーキテクチャ４２０から取得される）、及び少なくとも１つの表面配向推定ごとの不確実性値（例えば表面配向推定ごとの不確実性測定から得られる）の関数を含み得る。例えば、第３のコスト項は、表面配向推定ごとのコスト項の合計を含み得る。コスト関数の最適化は、図２に関して説明されたとおりであり得るが、表面配向情報が追加されている。 The first depth data 450 and the second depth data 460 are stochastically fused using the fusion engine 470 to obtain a fusion depth estimation 480. In the figure, the fusion engine 470 also uses at least one surface orientation to obtain a fusion depth estimate of 480. In an embodiment where the cost function is optimized to obtain a fusion depth estimate 480, the cost function may include a third cost term associated with at least one surface orientation estimate. In such cases, the third cost term is a fusion depth estimate, a surface orientation estimate (eg, obtained from the neural network architecture 420), and an uncertainty value per at least one surface orientation estimate (eg, surface). It may include a function) (obtained from uncertainty measurements for each orientation estimation). For example, the third cost term may include the sum of the cost terms for each surface orientation estimation. The cost function optimization may be as described for FIG. 2, but with the addition of surface orientation information.

表面配向情報を使用して融合奥行き推定４８０を取得することにより、融合奥行き推定４８０の精度は、さらに向上し得る。例えば、表面配向推定（及びその関連する不確実性測定）は、所与のピクセルとその隣接ピクセルとの間の制約を課し得る。このようにして、融合奥行き推定４８０の全体的な一貫性は向上し得る。 By acquiring the fusion depth estimation 480 using the surface orientation information, the accuracy of the fusion depth estimation 480 can be further improved. For example, surface orientation estimation (and its associated uncertainty measurements) can impose constraints between a given pixel and its adjacent pixels. In this way, the overall consistency of the fusion depth estimate 480 can be improved.

図５は、実施例５００による、表面配向推定３２０及び表面配向推定３２０の不確実性測定３２５を示す概略図である。図５では、表面配向推定３２０は、第１の方向（この事例ではｘ軸に沿った方向）の奥行き勾配推定５１０と、第１の方向に直交する方向（デカルト座標系が存在するこの事例ではｙ軸に沿った方向）の奥行き勾配推定５２０とを含む。 FIG. 5 is a schematic view showing the uncertainty measurement 325 of the surface orientation estimation 320 and the surface orientation estimation 320 according to the example 500. In FIG. 5, the surface orientation estimation 320 has a depth gradient estimation 510 in the first direction (direction along the x-axis in this case) and a direction orthogonal to the first direction (in this case where a Cartesian coordinate system exists). Includes a depth gradient estimate of 520 (in the direction along the y-axis).

例えば、所与の方向の奥行き勾配推定は、その所与の方向におけるシーン（例えば画像でキャプチャされたシーン）の奥行きの変化の推定を表す。奥行き勾配推定を使用して、シーンの画像における奥行きの急速なまたは特有の変化が特定され得る。例えば、シーンの一部分にわたり奥行きが異なるシーンの当該一部分に対応する画像の領域では、奥行き勾配は比較的高くなり得る。反対に、カメラに対して比較的一定の奥行きに存在するシーンの別の一部分に対応する画像の他の領域では、奥行き勾配は比較的低くなり得る。２つの異なる方向（互いに直交する、すなわち垂直である２つの方向など）の奥行き勾配を推定することにより、画像でキャプチャされたシーンの奥行き特性は、より正確及び／またはより効率的に特定され得る。 For example, a depth gradient estimate in a given direction represents an estimate of the change in depth of a scene (eg, a scene captured in an image) in that given direction. Depth gradient estimation can be used to identify rapid or distinctive changes in depth in an image of a scene. For example, the depth gradient can be relatively high in the region of the image corresponding to that portion of the scene that has different depths across a portion of the scene. Conversely, depth gradients can be relatively low in other areas of the image that correspond to another portion of the scene that resides at a relatively constant depth with respect to the camera. By estimating depth gradients in two different directions (such as two directions that are orthogonal to each other, that is, perpendicular to each other), the depth characteristics of the scene captured in the image can be identified more accurately and / or more efficiently. ..

他の実施例では、表面配向推定３２０は、奥行き配向推定５１０、５２０に加えて、またはこれらの代わりに、他の配向推定を含み得る。例えば、表面配向推定３２０は、表面法線推定を含み得る。 In other embodiments, the surface orientation estimation 320 may include other orientation estimates in addition to or in place of the depth orientation estimates 510 and 520. For example, the surface orientation estimation 320 may include a surface normal estimation.

図５のように、いくつかの事例では、表面配向推定ごとに対応する不確実性測定が存在する。ゆえに、図５では、第１の方向の奥行き勾配推定５１０に関連付けられた第１の不確実性測定５３０と、第１の方向に直交する方向の奥行き勾配推定５２０に関連付けられた第２の不確実性測定５４０とが存在する。 As shown in FIG. 5, in some cases, there is a corresponding uncertainty measurement for each surface orientation estimation. Therefore, in FIG. 5, the first uncertainty measurement 530 associated with the depth gradient estimation 510 in the first direction and the second uncertainty associated with the depth gradient estimation 520 in the direction orthogonal to the first direction. There is a certainty measurement 540.

各表面配向推定の不確実性測定は、様々な異なる方法で生成され得る。例えば、ニューラルネットワークアーキテクチャ（融合エンジンにより第１の奥行き推定と確率的に融合される第２の奥行き推定を生成するのに使用され得る）は、表面配向推定、及び各表面配向推定に関連付けられた対応する不確実性測定を生成するようにトレーニングされ得る。 Uncertainty measurements for each surface orientation estimate can be generated in a variety of different ways. For example, a neural network architecture (which can be used to generate a second depth estimate that is stochastically fused with a first depth estimate by a fusion engine) has been associated with surface orientation estimates, and each surface orientation estimation. It can be trained to produce the corresponding uncertainty measurement.

いくつかの事例では、第２の奥行き推定及び／または表面配向推定（複数可）は、対数推定であり得る。これは、負の値に数値的意味があるため、ニューラルネットワークアーキテクチャによるこれらの推定の生成が促進され得る。さらに、２つの対数奥行きの差（例えば対数奥行きの勾配に対応する）は、スケールが不変である２つの奥行きの比率に対応する。さらに、対数奥行き勾配が２つの直交方向で予測される場合（図５の実施例のように）、第１の奥行き推定と第２の奥行き推定との確率的融合（例えば対数奥行き勾配を使用する）は、線形であり、ドット積及び正規化動作なしで実行され得る。ゆえに、他の場合よりも効率的に融合プロセスを実行することができる。 In some cases, the second depth estimation and / or surface orientation estimation (s) can be logarithmic estimation. This can facilitate the generation of these estimates by the neural network architecture, as negative values have numerical implications. Moreover, the difference between the two log depths (eg, corresponding to the gradient of the log depth) corresponds to the ratio of the two depths whose scale is invariant. Further, if the log depth gradient is predicted in two orthogonal directions (as in the embodiment of FIG. 5), a probabilistic fusion of the first depth estimation and the second depth estimation (eg, using the log depth gradient). ) Is linear and can be performed without dot product and normalization actions. Therefore, the fusion process can be executed more efficiently than in other cases.

図６は、さらなる別の実施例による、画像処理システム６００の概略図である。画像処理システム６００は、単眼キャプチャデバイス６０５を含み、これは、シーンの画像をキャプチャするキャプチャデバイスまたはカメラの実施例である。単眼キャプチャデバイス６０５は、シーンのビデオを表すビデオデータをキャプチャするように構成される。シーンは、ビデオの第１のフレームでキャプチャされ、これは、図６の実施例では、キーフレーム６１０と称され得る。キーフレーム６１０は、例えば、より完全な奥行き推定が取得されるビデオのフレームに対応し、例えば、以前に奥行きが推定されていないシーンの新たな部分、または他の部分よりも特徴が豊富であると識別されたシーンの部分に対応する、またはそのような部分を含む。例えば、以前に奥行き推定が取得されていないビデオの第１のフレームは、キーフレームとみなされ得る。キーフレームは、例えば外部ＳＬＡＭシステムなどの外部システムにより指定されたキーフレームであり得る。他の事例では、単眼キャプチャデバイス６０５が閾値距離を超える距離を移動した後に得られるフレームが、キーフレームであり得る。単眼キャプチャデバイス６０５によりキャプチャされた他のフレームは、参照フレーム６１５とみなされ得る。 FIG. 6 is a schematic diagram of the image processing system 600 according to still another embodiment. The image processing system 600 includes a monocular capture device 605, which is an embodiment of a capture device or camera that captures an image of a scene. The monocular capture device 605 is configured to capture video data representing a video of the scene. The scene is captured in the first frame of the video, which may be referred to as key frame 610 in the embodiment of FIG. The keyframe 610 corresponds, for example, to a frame of video for which a more complete depth estimate is obtained, and is more feature-rich than, for example, a new part or other part of the scene where the depth was not previously estimated. Corresponds to, or includes, the portion of the scene identified as. For example, the first frame of a video for which depth estimation has not been previously obtained can be considered a key frame. The keyframe can be a keyframe designated by an external system, for example an external SLAM system. In another case, the frame obtained after the monocular capture device 605 travels a distance that exceeds the threshold distance may be a key frame. Other frames captured by the monocular capture device 605 can be considered as reference frame 615.

図６の実施例において単眼キャプチャデバイス６０５によりキャプチャされたフレームは（キーフレーム６１０であるか、参照フレーム６１５であるかに関係なく）、追跡システム６２５を使用して処理される。追跡システム６２５は、シーンを観察している間（例えばフレームをキャプチャしている間）の単眼キャプチャデバイス６０５の姿勢を特定するために使用される。図２を参照して説明されたように、追跡システム６２５は、動作センサを含み得、これは、単眼キャプチャデバイス６０５に接続された、または単眼キャプチャデバイス６０５を支持するロボットデバイスを動かすように構成されたアクチュエータに接続され得る、またはアクチュエータの一部を形成し得る。このように、追跡システム６２５は、オドメトリデータをキャプチャし、オドメトリデータを処理して、単眼キャプチャデバイス６０５の姿勢推定を生成し得る。 The frames captured by the monocular capture device 605 in the embodiment of FIG. 6 (whether keyframes 610 or reference frames 615) are processed using the tracking system 625. The tracking system 625 is used to identify the posture of the monocular capture device 605 while observing the scene (eg while capturing a frame). As described with reference to FIG. 2, the tracking system 625 may include a motion sensor, which is configured to move a robotic device connected to or supporting the monocular capture device 605. It can be connected to the actuator or can form part of the actuator. Thus, the tracking system 625 may capture the odometry data and process the odometry data to generate a pose estimation for the monocular capture device 605.

図６では、追跡システム６２５は、参照フレーム６１５をキャプチャしている間の単眼キャプチャデバイス６０５の姿勢の推定６４０と、キーフレームをキャプチャしている間の単眼キャプチャデバイス６０５の姿勢の推定６３５とを生成する。単眼キャプチャデバイス６０５の推定された姿勢６４０、６３５、及び単眼キャプチャデバイス６０５によりキャプチャされたビデオデータは、幾何学的再構成エンジン６３０により使用され、ビデオデータのフレームからピクセルの少なくとも部分集合についての奥行き推定が生成される。いくつかの事例において幾何学的再構成エンジン６３０は、測光誤差を最小化して奥行き推定を生成するように構成される。これは、図２を参照してさらに説明される。図６の幾何学的再構成エンジン６３０は、奥行き推定と、奥行き推定の不確実性測定とを含む第１の奥行きデータ６５０を出力する。 In FIG. 6, the tracking system 625 captures the pose estimation 640 of the monocular capture device 605 while capturing the reference frame 615 and the pose estimation 635 of the monocular capture device 605 while capturing the key frame. Generate. The estimated orientations of the monocular capture device 605 640, 635, and the video data captured by the monocular capture device 605 are used by the geometric reconstruction engine 630 to be used by the geometric reconstruction engine 630 to provide depth from the frame of the video data to at least a subset of the pixels. An estimate is generated. In some cases, the geometric reconstruction engine 630 is configured to minimize metering errors and generate depth estimates. This will be further explained with reference to FIG. The geometric reconstruction engine 630 of FIG. 6 outputs a first depth data 650 including depth estimation and depth estimation uncertainty measurement.

いくつかの事例では、第１の奥行きデータ６５０は、単眼キャプチャデバイス６０５により取得されるフレームごとに再度生成され、例えば第１の奥行きデータ６５０は、キーフレームに関連し得、さらに、取得され処理される追加参照フレームごとに反復的に更新され得る。第１の奥行きデータ６５０は、リアルタイムまたはほぼリアルタイムで生成され得、従って、例えば単眼キャプチャデバイス６０５のフレームレートに対応したレートで、頻繁に実行され得る。 In some cases, the first depth data 650 may be regenerated for each frame captured by the monocular capture device 605, for example the first depth data 650 may be associated with a key frame and further captured and processed. It can be updated iteratively for each additional reference frame that is made. The first depth data 650 can be generated in real time or near real time and can therefore be performed frequently, for example at a rate corresponding to the frame rate of the monocular capture device 605.

キーフレーム６１０に対応すると特定されたフレームに関して、図６の画像処理システム６００はさらに、ニューラルネットワークアーキテクチャ６２０を使用して第２の奥行きデータ６６０を生成するキーフレーム６１０の処理を含む。しかし、ニューラルネットワークアーキテクチャ６２０を使用する参照フレーム６１５の処理は、いくつかの事例では省略され得る。これは、図６では、破線を使用して概略的に示される。破線は、画像処理システム６００により選択的に実行され得る画像処理パイプラインの部分に対応する。例えば、第１の奥行きデータ６５０の生成は、フレームが参照フレーム６１５であるかキーフレーム６１０であるかに関係なく、フレームについて実行され得るが、第２の奥行きデータの生成は、キーフレーム６１０に対して選択的に実行され得る。図６の実施例では、第２の奥行きデータ６６０は、第２の奥行き推定、第２の奥行き推定の不確実性測定、少なくとも１つの表面配向推定、及び表面配向推定の不確実性測定を含むが、これは単なる例に過ぎない。ニューラルネットワークアーキテクチャ６２０は、本明細書で説明される他の実施例のニューラルネットワークアーキテクチャと同様または同一であり得る。 For frames identified as corresponding to keyframes 610, the image processing system 600 of FIG. 6 further includes processing of keyframes 610 to generate second depth data 660 using the neural network architecture 620. However, the processing of the reference frame 615 using the neural network architecture 620 may be omitted in some cases. This is schematically shown in FIG. 6 using a dashed line. The dashed line corresponds to a portion of the image processing pipeline that can be selectively executed by the image processing system 600. For example, the generation of the first depth data 650 may be performed on the frame regardless of whether the frame is the reference frame 615 or the key frame 610, while the generation of the second depth data may be performed on the key frame 610. On the other hand, it can be executed selectively. In the embodiment of FIG. 6, the second depth data 660 includes a second depth estimation, a second depth estimation uncertainty measurement, at least one surface orientation estimation, and a surface orientation estimation uncertainty measurement. But this is just an example. The neural network architecture 620 may be similar or identical to the neural network architectures of the other embodiments described herein.

図６の画像処理システム６００はまた、融合エンジン６７０を含み、これは、第１の奥行き推定及び第２の奥行き推定（幾何学的再構成エンジン６３０及びニューラルネットワークアーキテクチャ６２０によりそれぞれ取得された）を、関連する不確実性測定を使用して、統計的に融合させるように構成される。図６では、融合エンジン６７０はまた、少なくとも１つの表面配向推定、及び少なくとも１つの表面配向推定に関連付けられたそれぞれの不確実性測定を使用して、第１の奥行き推定と第２の奥行き推定とを統計的に融合させる。しかし、少なくとも１つの表面配向推定、及び少なくとも１つの表面配向推定に関連付けられたそれぞれの不確実性測定の使用は、他の事例では省略され得る。 The image processing system 600 of FIG. 6 also includes a fusion engine 670, which has a first depth estimation and a second depth estimation (acquired by the geometric reconstruction engine 630 and the neural network architecture 620, respectively). , Constructed to be statistically fused using the associated uncertainty measurement. In FIG. 6, the fusion engine 670 also uses at least one surface orientation estimation and the respective uncertainty measurements associated with at least one surface orientation estimation to make a first depth estimation and a second depth estimation. And statistically fuse. However, the use of at least one surface orientation estimation and each uncertainty measurement associated with at least one surface orientation estimation may be omitted in other cases.

融合エンジン６７０は、第１の奥行き推定と第２の奥行き推定とを統計的に融合させることにより、融合奥行き推定６８０を生成するように構成される。図６の融合エンジン６７０はまた、第１の奥行き推定と第２の奥行き推定とを確率的に融合させる時に、スケール推定を特定するように構成される。例えば、融合エンジン６７０が、コスト関数を最適化して融合奥行き推定６８０特定するように構成される場合、コスト関数を最適化することは、融合奥行き推定６８０のスケールファクタ６８５を特定することを含み得る。このような事例では、スケールファクタ６８５は、シーンに関する融合奥行き推定６８０のスケールを示す。従って、スケールファクタは、第１の奥行き推定及び第２の奥行き推定により提供されるシーンのスケールの不正確さを補い得る。スケールファクタは、スカラであり得る。例えば、第１の奥行き推定は、図６の単眼キャプチャデバイス６０５がもたらす姿勢に基づいて生成されるため、第１の奥行き推定は任意のスケールを有する。しかし、スケールファクタの生成により、特定のスケールでの奥行き推定を取得することが可能となる。 The fusion engine 670 is configured to generate a fusion depth estimation 680 by statistically fusing the first depth estimation and the second depth estimation. The fusion engine 670 of FIG. 6 is also configured to specify the scale estimation when the first depth estimation and the second depth estimation are stochastically fused. For example, if the fusion engine 670 is configured to optimize the cost function to specify the fusion depth estimate 680, optimizing the cost function may include specifying the scale factor 685 of the fusion depth estimate 680. .. In such cases, the scale factor 685 indicates the scale of the fusion depth estimate 680 for the scene. Therefore, the scale factor can compensate for the scale inaccuracy of the scene provided by the first depth estimation and the second depth estimation. The scale factor can be scalar. For example, the first depth estimation has an arbitrary scale because the first depth estimation is generated based on the attitude provided by the monocular capture device 605 of FIG. However, the generation of scale factors makes it possible to obtain depth estimates at a particular scale.

コスト関数を最適化して融合奥行き推定６８０が特定される事例では、コスト関数の第１のコスト項は、融合奥行き推定値、第１の奥行き推定値、第１の奥行き推定の不確実性値、及びスケールファクタの関数を含み得る。コスト関数の最適化は、スケールファクタ並びに融合奥行き推定６８０を反復的に変更して、コスト関数を最適化する（例えば最小化する）スケールファクタ及び融合奥行き推定６８０を特定することを含み得る。このような事例では、コスト関数はまた、図２及び図４を参照して説明されたように、第２のコスト項及び／または第３のコスト項を含み得る。このような事例では、第２のコスト項及び／または第３のコスト項は、スケールファクタとは無関係であり得る。 In the case where the fusion depth estimation 680 is specified by optimizing the cost function, the first cost term of the cost function is the fusion depth estimation value, the first depth estimation value, the uncertainty value of the first depth estimation, And may include a function of scale factor. Optimizing the cost function may include iteratively modifying the scale factor as well as the fusion depth estimation 680 to identify the scale factor and fusion depth estimation 680 that optimize (eg, minimize) the cost function. In such cases, the cost function may also include a second cost term and / or a third cost term, as described with reference to FIGS. 2 and 4. In such cases, the second and / or third cost term may be independent of the scale factor.

説明されるように、第１の奥行きデータ６５０が幾何学的再構成エンジン６３０により生成される頻度より少ない頻度で、第２の奥行きデータ６６０はニューラルネットワークアーキテクチャ６２０により生成され得る。例えば、キーフレーム６１０については、第１の奥行きデータ６５０と第２の奥行きデータ６６０の両方が生成され得る。第２の奥行きデータの生成が省略され得るキーフレーム６１０は、第２の奥行きデータの生成が省略され得る参照フレーム６１５よりも、少なくあり得る。 As described, the second depth data 660 may be generated by the neural network architecture 620 less frequently than the first depth data 650 is generated by the geometric reconstruction engine 630. For example, for the key frame 610, both the first depth data 650 and the second depth data 660 can be generated. The key frame 610 in which the generation of the second depth data may be omitted may be less than the reference frame 615 in which the generation of the second depth data may be omitted.

実施例として、ビデオデータの第１のフレームでシーンがキャプチャされ得、ビデオデータの第１のフレームについての第２の奥行き推定が受信され得る。第２の奥行き推定は、ニューラルネットワークアーキテクチャ６２０より生成され得る。ゆえに、ビデオデータの第１のフレームは、キーフレーム６１５であるとみなされ得る。この実施例では、複数の第１の奥行き推定が取得される。複数の第１の奥行き推定のうちの少なくとも１つは、ビデオデータの第１のフレームとは異なるビデオデータの第２のフレームを使用して生成される。例えば、複数の第１の奥行き推定（幾何学的再構成エンジン６３０により生成された）には、第１のフレーム（キーフレーム６１０である）の第１の奥行き推定と、第２のフレーム（参照フレーム６１５である）の第１の奥行き推定とが含まれ得る。この事例では、融合エンジン６７０は、反復ごとに第２の奥行き推定と複数の奥行き推定のうちの１つとを処理して、シーンの融合奥行き推定６８０を反復的に出力するように構成される。例えば、第１のフレームを受信すると、融合エンジン６７０は、第１のフレームを使用して生成された第１の奥行き推定と、第１のフレームを使用して生成された第２の奥行き推定とを融合させ得る。しかし、第２のフレームを受信すると、融合エンジン６７０は代わりに、第２のフレームを使用して生成された第１の奥行き推定と、第１のフレームを使用して前に生成された第２の奥行き推定とを融合させ得る。言い換えると、第２の奥行き推定は、フレームごとに再生成され得ず、代わりに、前のフレーム（前のキーフレーム６１５など）から再利用され得る。言い換えると、融合奥行き推定６８０の生成は、反復的に繰り返され得る。方法は、後続の反復に関して、第２の奥行き推定を生成するか否かを判定することを含み得る。上記で説明されたように、このような判定は、画像でキャプチャされたシーンのコンテンツに基づいて、例えば、シーンの前の画像と比較してコンテンツは著しく変化したか否か（例えば単眼キャプチャデバイス６０５の移動により）、またはコンテンツは特徴豊富であるか否かなどに基づいて、行われ得る。第２の奥行き推定を生成しない（例えば参照フレーム６１５について）という判定に応じて、これらの実施例は、前の第２の奥行き推定の値の集合を使用して、第１の奥行き推定と第２の奥行き推定とを確率的に融合させることを含む。これにより、ニューラルネットワークアーキテクチャ６２０を使用して画像を処理する必要がなくなる。 As an embodiment, the scene may be captured in the first frame of the video data and a second depth estimate for the first frame of the video data may be received. The second depth estimation can be generated from the neural network architecture 620. Therefore, the first frame of the video data can be considered to be the key frame 615. In this embodiment, a plurality of first depth estimates are obtained. At least one of the plurality of first depth estimates is generated using a second frame of video data that is different from the first frame of video data. For example, a plurality of first depth estimates (generated by the geometric reconstruction engine 630) include a first depth estimate of the first frame (keyframe 610) and a second frame (reference). It may include a first depth estimation (which is frame 615). In this example, the fusion engine 670 is configured to process one of a second depth estimation and a plurality of depth estimates for each iteration and iteratively output the fusion depth estimation 680 of the scene. For example, upon receiving the first frame, the fusion engine 670 has a first depth estimate generated using the first frame and a second depth estimate generated using the first frame. Can be fused. However, upon receiving the second frame, the fusion engine 670 instead uses the first depth estimation generated using the second frame and the previously generated second using the first frame. Can be fused with the depth estimation of. In other words, the second depth estimate cannot be regenerated frame by frame and can instead be reused from the previous frame (such as the previous keyframe 615). In other words, the generation of the fusion depth estimate 680 can be iteratively repeated. The method may include determining whether to generate a second depth estimate for subsequent iterations. As described above, such a determination is based on the content of the scene captured in the image, for example, whether the content has changed significantly compared to the image before the scene (eg, a monocular capture device). (By moving 605), or based on whether the content is feature-rich or not, etc. In response to the determination that no second depth estimation is generated (eg for reference frame 615), these examples use the set of values of the previous second depth estimation to make the first depth estimation and the second. It includes stochastically fusing with the depth estimation of 2. This eliminates the need to process the image using the neural network architecture 620.

このような実施例では、第１の奥行き推定は、第２の奥行き推定よりも頻繁に生成され得る（第２の奥行き推定はニューラルネットワークアーキテクチャ６２０を使用するため生成がより遅くなり得る）。いくつかの事例では、融合奥行き推定６８０は、更新された第１の奥行き推定と既存の第２の奥行き推定とに基づいて、精緻化され得る。ゆえに、シーンの奥行きは、第１の奥行き推定及び第２の奥行き推定の両方が更新された後にシーンの奥行きが更新される他の事例よりも、高いレートで更新され得る。実際に、第１の奥行き推定と第２の奥行き推定とを別々に生成して、その後に第１の奥行き推定と第２の奥行き推定とを融合させることにより、本明細書の方法は、他の方法と比べて、より柔軟であり、より効率的に実行され得る。 In such an embodiment, the first depth estimation may be generated more frequently than the second depth estimation (the second depth estimation may be slower because it uses the neural network architecture 620). In some cases, the fusion depth estimate 680 may be refined based on the updated first depth estimate and the existing second depth estimate. Therefore, the depth of the scene can be updated at a higher rate than in other cases where the depth of the scene is updated after both the first depth estimation and the second depth estimation have been updated. In fact, by generating the first depth estimation and the second depth estimation separately and then fusing the first depth estimation and the second depth estimation, the methods herein are described as other. It is more flexible and can be performed more efficiently than the method of.

図７Ａは、本明細書に説明される方法のうちのいずれかを実施するために使用され得るコンピューティングシステム７００のコンポーネントを示す概略図である。コンピューティングシステム７００は、単一のコンピューティングデバイス（例えばデスクトップ、ラップトップ、モバイル及び／または組み込みコンピューティングデバイス）であり得る、または複数の別個のコンピューティングデバイスにわたり分散された分散コンピューティングシステムであり得る（例えばいくつかのコンポーネントは、１つ以上のクライアントコンピューティングデバイスからネットワークを介して発せられた要求に基づいて、１つ以上のサーバコンピューティングデバイスにより実施され得る）。 FIG. 7A is a schematic diagram showing components of a computing system 700 that may be used to implement any of the methods described herein. The computing system 700 can be a single computing device (eg, desktop, laptop, mobile and / or embedded computing device), or is a distributed computing system distributed across multiple separate computing devices. Obtain (eg, some components may be performed by one or more server computing devices based on requests made over the network from one or more client computing devices).

コンピューティングシステム７００は、例えばシーンの観察を含むビデオのフレームを提供するビデオキャプチャデバイス７１０を含む。コンピューティングシステム７００はまた、位置特定及びマッピング同時実行（ＳＬＡＭ）システム７２０を含む。ロボットマッピング及びナビゲーションの分野におけるＳＬＡＭシステムは、未知の環境のマップを構築及び更新し、同時に環境内のマップに関連付けられたロボットデバイスの位置を特定するように機能する。例えば、ロボットデバイスは、マップを構築、更新、及び／または使用するデバイスであり得る。ＳＬＡＭシステム７２０は、ビデオキャプチャデバイス７１０の姿勢データを提供するように構成される。コンピューティングシステム７００の中密度マルチビューステレオコンポーネント７３０は、姿勢データ及びビデオのフレームを受信して、上記の他の実施例で説明された幾何学的再構成エンジンを実施するように構成される。中密度マルチビューステレオコンポーネント７３０は、前述の「中密度」であると言うことができ、用語「マルチビューステレオ」は、コンポーネント７３０が、単眼（例えば非ステレオ）カメラからのデータの連続フレームを使用する代わりに、ステレオ画像ペアをシミュレートして奥行きデータを特定するように機能することを示す。この事例では、移動するカメラからのフレームは、共通の環境の異なるビューを提供し得、これにより、前述のように奥行きデータを生成することが可能となる。コンピューティングシステム７００はまた、ニューラルネットワーク回路７４０を含み、これは、例えば、上記の実施例を参照して説明されたニューラルネットワークアーキテクチャを実施する電子回路である。コンピューティングシステム７００はまた、本明細書の実施例の融合エンジンを実施するように構成された画像処理システム７５０を含む。画像処理システム７５０は、例えば、中密度マルチビューステレオコンポーネント７３０からの第１の奥行きデータと、ニューラルネットワーク回路７４０からの第２の奥行きデータとを確率的に融合させて、融合奥行きデータを取得する。 The computing system 700 includes, for example, a video capture device 710 that provides frames of video including scene observation. The computing system 700 also includes a location and mapping concurrency (SLAM) system 720. SLAM systems in the field of robot mapping and navigation serve to build and update maps of unknown environments while at the same time locating robot devices associated with maps in the environment. For example, a robot device can be a device that builds, updates, and / or uses a map. The SLAM system 720 is configured to provide attitude data for the video capture device 710. The medium density multi-view stereo component 730 of the computing system 700 is configured to receive frames of attitude data and video to implement the geometric reconstruction engine described in the other embodiments above. The medium density multiview stereo component 730 can be said to be the aforementioned "medium density", the term "multiview stereo" in which the component 730 uses a continuous frame of data from a monocular (eg non-stereo) camera. Instead, we show that it works to simulate a stereo image pair to identify depth data. In this case, the frame from the moving camera can provide different views of the common environment, which makes it possible to generate depth data as described above. The computing system 700 also includes a neural network circuit 740, which is, for example, an electronic circuit that implements the neural network architecture described with reference to the above embodiments. The computing system 700 also includes an image processing system 750 configured to implement the fusion engine of the embodiments of the present specification. The image processing system 750 probabilistically fuses the first depth data from the medium density multi-view stereo component 730 and the second depth data from the neural network circuit 740 to acquire the fused depth data. ..

図７Ｂは、実施例による、ロボットデバイス７６０のコンポーネントを示す概略図である。ロボットデバイス７６０は、図７Ａのコンピューティングシステム７００を含む。ロボットデバイス７６０はまた、ロボットデバイス７６０が周囲の３次元環境とインタラクトすることを可能にする１つ以上のアクチュエータ７７０を含む。周囲の３次元環境の少なくとも一部は、コンピューティングシステム７００のビデオキャプチャデバイス７１０によりキャプチャされたシーンに示され得る。図７Ｂの事例では、ロボットデバイス７６０は、ロボットデバイスが特定の環境をナビゲートする時に（例えば図１Ａのデバイス１３０により）、ビデオデータをキャプチャするように構成され得る。しかし、別の事例では、ロボットデバイス７６０は、環境をスキャンし得る、またはモバイルデバイスもしくは別のロボットデバイスを有するユーザなどの第三者から受信したビデオデータを操作し得る。ロボットデバイス７６０がビデオデータを処理する時、ロボットデバイス７６０は、例えば融合奥行き推定であるシーンの奥行き推定を生成するように構成され得る。 FIG. 7B is a schematic diagram showing the components of the robot device 760 according to the embodiment. The robot device 760 includes the computing system 700 of FIG. 7A. The robot device 760 also includes one or more actuators 770 that allow the robot device 760 to interact with the surrounding three-dimensional environment. At least a portion of the surrounding 3D environment may be shown in the scene captured by the video capture device 710 of the computing system 700. In the case of FIG. 7B, the robot device 760 may be configured to capture video data as the robot device navigates a particular environment (eg, by device 130 of FIG. 1A). However, in another case, the robot device 760 may scan the environment or manipulate video data received from a third party such as a mobile device or a user with another robot device. When the robot device 760 processes video data, the robot device 760 may be configured to generate a depth estimate of the scene, for example a fusion depth estimate.

ロボットデバイス７６０はまた、１つ以上のアクチュエータ７７０を制御する少なくとも１つのプロセッサを含むインタラクションエンジン７８０を含む。図７Ｂのインタラクションエンジン７８０は、融合奥行き推定を使用して、周囲の３次元環境とインタラクトするように構成される。インタラクションエンジン７８０は、融合奥行き推定を使用して、１つ以上のアクチュエータを制御して環境とインタラクトし得る。例えば、融合奥行き推定を使用して、環境内のオブジェクトをつかむこと、及び／または壁などの障壁との衝突を回避することができる。 The robot device 760 also includes an interaction engine 780 that includes at least one processor that controls one or more actuators 770. The interaction engine 780 of FIG. 7B is configured to interact with the surrounding 3D environment using fusion depth estimation. The interaction engine 780 can use fusion depth estimation to control one or more actuators to interact with the environment. For example, fusion depth estimation can be used to grab objects in the environment and / or avoid collisions with barriers such as walls.

図７Ａ及び図７Ｂを参照して本明細書に説明される機能コンポーネントの実施例は、専用処理電子機器を含み得、及び／または少なくとも１つのコンピューティングデバイスのプロセッサにより実行されるコンピュータプログラムコードにより実施され得る。いくつかの事例では、１つ以上の組み込みコンピューティングデバイスが使用され得る。本明細書に説明されるコンポーネントは、コンピュータ可読媒体にロードされたコンピュータプログラムコードを実行するためにメモリと関連して作動する少なくとも１つのプロセッサを含み得る。この媒体は、消去可能プログラム可能読み出し専用メモリなどのソリッドステートストレージを含み得、コンピュータプログラムコードは、ファームウェアを含み得る。他の事例では、コンポーネントは、適切に構成されたシステムオンチップ、特定用途向け集積回路、及び／または１つ以上の適切にプログラムされたフィールドプログラマブルゲートアレイを含み得る。一事例では、コンポーネントは、モバイルコンピューティングデバイス及び／またはデスクトップコンピューティングデバイス内のコンピュータプログラムコード及び／または専用処理電子機器により、実施され得る。一事例では、前の事例と同様に、または前の事例の代わりに、コンポーネントは、コンピュータプログラムコードを実行する１つ以上のグラフィカル処理ユニットにより、実施され得る。いくつかの事例では、コンポーネントは、例えばグラフィックス処理ユニットの複数のプロセッサ及び／またはコア上で、並行して実施される１つ以上の機能により、実施され得る。 Examples of functional components described herein with reference to FIGS. 7A and 7B may include dedicated processing electronic devices and / or by computer program code executed by the processor of at least one computing device. Can be carried out. In some cases, one or more embedded computing devices may be used. The components described herein may include at least one processor that operates in association with memory to execute computer program code loaded on a computer readable medium. The medium may include solid state storage such as erasable programmable read-only memory and the computer program code may include firmware. In other cases, the component may include a well-configured system-on-chip, application-specific integrated circuit, and / or one or more well-programmed field programmable gate arrays. In one case, the component may be implemented by a computer program code and / or dedicated processing electronic device within a mobile computing device and / or a desktop computing device. In one case, as in the previous case, or in place of the previous case, the component may be implemented by one or more graphical processing units that execute computer program code. In some cases, the component may be implemented by one or more functions implemented in parallel, eg, on multiple processors and / or cores of a graphics processing unit.

図８は、図１～７を参照して説明された様々な機能の実施例８００を示す概略図である。図８は、シーンと称され得る３次元（３Ｄ）環境８０５の実施例を示す。３Ｄ環境８０５は、図１のキャプチャデバイス１２０などのキャプチャデバイス８１０、並びに２つのオブジェクト８１５、８２０を含む。キャプチャデバイス８１０は、３Ｄ環境８０５の観察をキャプチャするように構成される（例えば静止画像またはビデオの形式で）。これらの観察は、例えば、オブジェクト８１５、８２０の観察を含み、オブジェクト８１５、８２０の互いに対する位置、及び他のオブジェクトまたは３Ｄ環境の地物（オブジェクト８１５、８２０を支持する表面、またはオブジェクト８１５、８２０の後ろの壁など）に対するオブジェクト８１５、８２０の位置を示し得る。キャプチャデバイス８１０によりキャプチャされた３Ｄ環境８０５の観察を示すビデオのフレーム８２５の実施例も、図８に示される。図示されるように、ビデオのフレーム８２５内に、２つのオブジェクト８１５、８２０が表示される。 FIG. 8 is a schematic diagram showing Example 800 of various functions described with reference to FIGS. 1-7. FIG. 8 shows an example of a three-dimensional (3D) environment 805 that can be referred to as a scene. The 3D environment 805 includes a capture device 810 such as the capture device 120 of FIG. 1, and two objects 815 and 820. The capture device 810 is configured to capture the observations of the 3D environment 805 (eg, in the form of a still image or video). These observations include, for example, observations of objects 815, 820, positions of objects 815, 820 relative to each other, and other objects or features in a 3D environment (surfaces supporting objects 815, 820, or objects 815, 820). It may indicate the position of objects 815, 820 with respect to (such as the wall behind). An example of a video frame 825 showing an observation of the 3D environment 805 captured by the capture device 810 is also shown in FIG. As shown, two objects 815 and 820 are displayed within frame 825 of the video.

図８はまた、幾何学的再構成エンジンにより取得された第１の奥行き推定８３０の実施例を概略的に示す。図示されるように、シーン内のオブジェクト８１５、８２０の存在は、第１の奥行き推定８３０において、輪郭８３２、８３４により示される。ゆえに、第１の奥行き推定８３０により、例えば、画像内の境界線または他の縁（例えばシーン内のオブジェクトの縁で起こり得る奥行きの突然の変化に対応する）を特定することが可能となる。 FIG. 8 also schematically shows an example of a first depth estimation 830 obtained by a geometric reconstruction engine. As shown, the presence of objects 815,820 in the scene is indicated by contours 832, 834 in the first depth estimation 830. Therefore, the first depth estimation 830 makes it possible to identify, for example, boundaries or other edges in an image (eg, corresponding to possible sudden changes in depth at the edges of objects in the scene).

ニューラルネットワークアーキテクチャにより取得された第２の奥行き推定８３５も、図８に概略的に示される。図示されるように、シーン内のオブジェクト８１５、８２０の存在は、第２の奥行き推定８３５において、陰影８３６、８３８により示される。例えば、陰影のグレースケール値は、オブジェクトの一部の相対的な奥行きを示す。 A second depth estimate 835 obtained by the neural network architecture is also schematically shown in FIG. As illustrated, the presence of objects 815,820 in the scene is indicated by shading 836, 838 in the second depth estimation 835. For example, a shaded grayscale value indicates the relative depth of a portion of an object.

図８の実施例では、２つのオブジェクト８１５、８２０は、キャプチャデバイス８１０に向かって突出している。例えば、図８において、第１のオブジェクト８１５は、垂直に延びる長手方向軸を有する円筒である。ゆえに、第１のオブジェクト８１５は、その中心がキャプチャデバイス８１０に向かって膨らみ、その側面がキャプチャデバイス８１０から後退する（キャプチャデバイス８１０から見た場合）。第１のオブジェクト８１５の形状は、第２の奥行き推定８３５でキャプチャされ、第１のオブジェクト８１５の中心に向いたキャプチャデバイス８１０に対して、第１のオブジェクト８１５の奥行きが減少することが示される（第２の奥行き推定８３５において第１のオブジェクトの中心に向かってより濃くなる陰影領域８３６により示される）。しかし、第１のオブジェクトの縁は、第２の奥行き推定８３５より第１の奥行き推定８３０において、より鋭くまたはより鮮明である。これは、第１の奥行き推定８３０は、シーンの高テクスチャ領域の奥行きをより正確にキャプチャし得、一方第２の奥行き推定８３５は、シーンの低テクスチャ（すなわち滑らかな）領域の奥行きをより正確にキャプチャし得ることを示す。これは、第２の奥行き推定８３５において、左上隅及び左下隅に陰影があり、これらの領域がシーンの他の領域と比べて奥行きの差を有することが示されていることから、さらにわかる。この差は、比較的微細なまたは小さな奥行きの変化であるため、第１の奥行き推定８３０では識別されていない。 In the embodiment of FIG. 8, the two objects 815 and 820 project toward the capture device 810. For example, in FIG. 8, the first object 815 is a cylinder with a vertically extending longitudinal axis. Therefore, the center of the first object 815 bulges toward the capture device 810, and its side surface retracts from the capture device 810 (when viewed from the capture device 810). The shape of the first object 815 is captured by the second depth estimation 835, indicating that the depth of the first object 815 is reduced relative to the capture device 810 towards the center of the first object 815. (Indicated by the shaded area 836 that becomes darker towards the center of the first object in the second depth estimation 835). However, the edges of the first object are sharper or sharper in the first depth estimation 830 than in the second depth estimation 835. This is because the first depth estimation 830 can more accurately capture the depth of the high textured area of the scene, while the second depth estimation 835 more accurately captures the depth of the low textured (ie smooth) area of the scene. Indicates that it can be captured. This is further apparent from the second depth estimation 835, which shows that there are shadows in the upper left and lower left corners, and that these areas have a depth difference compared to the other areas of the scene. This difference is not identified in the first depth estimation 830 because it is a relatively subtle or small depth change.

図８はまた、第１の方向（この実施例では水平方向）の第１の奥行き勾配推定８４０と、第１の方向に直交する方向（この実施例では垂直方向）の第２の奥行き勾配推定８４５との実施例を概略的に示す。オブジェクト８１５、８２０の存在は、第１の奥行き勾配推定８４０において、矢印８４２、８４４によりそれぞれ示される。オブジェクト８１５、８２０の存在は、第２の奥行き勾配推定８４５において、矢印８４６、８４８によりそれぞれ示される。第１の奥行き推定８３０及び第２の奥行き推定８３５を参照して説明されたように、オブジェクト８１５は、その長手方向軸に沿ってキャプチャデバイス８１０に対して膨らむ。これは、第１の奥行き勾配推定８４０において、オブジェクト８１５がその円筒形状によりキャプチャデバイス８１０から後退方向へより急速に湾曲するオブジェクト８１５の側面方向よりも、中央領域で、矢印８４２が互いにより近接している（奥行き勾配の変化が急速でないことを示す）ことから、わかる。 FIG. 8 also shows a first depth gradient estimation 840 in the first direction (horizontal in this embodiment) and a second depth gradient estimation in the direction orthogonal to the first direction (vertical in this embodiment). An embodiment with 845 is shown schematically. The presence of objects 815 and 820 is indicated by arrows 842 and 844 in the first depth gradient estimation 840, respectively. The presence of objects 815 and 820 is indicated by arrows 846 and 848 in the second depth gradient estimation 845, respectively. As described with reference to the first depth estimation 830 and the second depth estimation 835, the object 815 inflates with respect to the capture device 810 along its longitudinal axis. This is because in the first depth gradient estimation 840, the arrows 842 are closer to each other in the central region than in the lateral direction of the object 815, where the object 815 curves more rapidly in the receding direction from the capture device 810 due to its cylindrical shape. (Indicates that the change in depth gradient is not rapid).

図９は、シーンの奥行きを推定する例示的な方法９００を示すフロー図である。方法９００は、第１の奥行き推定を生成する第１の動作９１０を含む。シーンの幾何学的再構成を使用して、第１の奥行き推定が生成され得、シーンの幾何学的再構成は、第１の奥行き推定の不確実性測定を出力するように構成される。第２の動作９２０にて、ニューラルネットワークアーキテクチャを使用して、第２の奥行き推定が生成される。ニューラルネットワークアーキテクチャは、第２の奥行き推定の不確実性測定を出力するように構成される。第３の動作９３０にて、第１の奥行き推定と第２の奥行き推定は、不確実性測定を使用して確率的に融合され、シーンの融合奥行き推定が生成される。本明細書に説明されるシステムのうちのいずれかを使用して、図９の方法９００は実施され得る。 FIG. 9 is a flow diagram illustrating an exemplary method 900 for estimating the depth of a scene. Method 900 includes a first action 910 that produces a first depth estimate. A first depth estimate can be generated using the geometric reconstruction of the scene, and the geometric reconstruction of the scene is configured to output an uncertainty measurement of the first depth estimation. In the second operation 920, the neural network architecture is used to generate the second depth estimation. The neural network architecture is configured to output a second depth estimation uncertainty measurement. In the third operation 930, the first depth estimation and the second depth estimation are stochastically fused using uncertainty measurement to generate a fused depth estimation of the scene. The method 900 of FIG. 9 can be implemented using any of the systems described herein.

図１０は、シーンの奥行きを推定するさらなる例示的な方法１０００を示すフロー図である。第１の動作１０１０にて、画像データが取得される。画像データは、シーンの画像をキャプチャするように構成されたキャプチャデバイスから取得され得る。第２の動作１０２０にて、第１の動作１０１０の画像データを取得する間のキャプチャデバイスの姿勢推定が生成される。第３の動作１０３０にて、例えば本明細書の他の実施例を参照して説明されるように、幾何学的再構成エンジンを使用して、シーンの奥行きの中密度推定が取得される。第４の動作１０４０にて、図６を参照して説明されたように、キャプチャデバイスによりキャプチャされた画像がキーフレームであるか否かが判定される。キーフレームである場合、第２の奥行き推定を生成するために、第５の動作１０５０にて、ニューラルネットワーク出力が生成される。しかし、画像がキーフレームに該当しない場合、動作１０６０にて、既存のニューラルネットワーク出力（例えば前の画像で取得された）が代わりに使用される。ニューラルネットワーク出力には、例えば、第２の奥行き推定と、少なくとも１つの表面配向推定と、第２の奥行き推定及び少なくとも１つの表面配向推定のそれぞれに関連付けられた不確実性測定とが含まれる。最後に、第７の動作１０７０にて、第３の動作１０３０の第１の奥行き推定と、第５の動作１０５０または第６の動作１０６０の第２の奥行き推定とが、例えば第１の奥行き推定及び第２の奥行き推定にそれぞれ関連付けられた不確実性測定を使用する確率的な方法で融合される。この実施例では、融合奥行きマップ及びスケールファクタを取得するために、融合動作中に、少なくとも１つの表面配向推定及び対応する不確実性測定も使用される。図１０の方法１００は、例えば、図６のシステム６００を使用して実施され得る。 FIG. 10 is a flow diagram illustrating a further exemplary method 1000 for estimating the depth of a scene. Image data is acquired in the first operation 1010. Image data can be obtained from a capture device configured to capture an image of the scene. In the second operation 1020, the attitude estimation of the capture device while acquiring the image data of the first operation 1010 is generated. In the third operation 1030, a geometric reconstruction engine is used to obtain a medium density estimate of the depth of the scene, as described, for example, with reference to other embodiments herein. In the fourth operation 1040, as described with reference to FIG. 6, it is determined whether or not the image captured by the capture device is a key frame. If it is a key frame, a neural network output is generated in the fifth action 1050 to generate a second depth estimate. However, if the image does not correspond to a key frame, then in operation 1060, the existing neural network output (eg, obtained in the previous image) is used instead. The neural network output includes, for example, a second depth estimation and at least one surface orientation estimation, and an uncertainty measurement associated with each of the second depth estimation and at least one surface orientation estimation. Finally, in the seventh operation 1070, the first depth estimation of the third operation 1030 and the second depth estimation of the fifth operation 1050 or the sixth operation 1060 are, for example, the first depth estimation. And the second depth estimation are fused in a probabilistic way using the uncertainty measurements associated with each. In this embodiment, at least one surface orientation estimation and corresponding uncertainty measurement are also used during the fusion operation to obtain the fusion depth map and scale factor. The method 100 of FIG. 10 can be implemented using, for example, the system 600 of FIG.

図１１は、プロセッサ１１１０と、コンピュータ実行可能命令１１３０を含む非一時的コンピュータ可読記憶媒体１１２０との実施例１１００を示す概略図である。コンピュータ実行可能命令１１３０は、プロセッサ１１１０により実行されると、プロセッサ１１１０を備えるコンピューティングデバイスなどのコンピュータデバイスに、シーンの奥行きを推定させる。命令により、結果的に、前述の例示的な方法と同様の方法が実行され得る。例えば、コンピュータ可読記憶媒体１１２０は、図６を参照して説明されたように、複数の参照フレームについて取得され得る複数の第１の奥行きデータ１１４０を記憶するように構成され得る。コンピュータ可読記憶媒体１１２０はまた、キーフレームの第２の奥行き推定１１５０を記憶するように構成され得る。第１の奥行きデータと第２の奥行きデータとは、確率的に融合され、融合奥行き推定が取得され得る。図１１では、第１の奥行きデータ１１４０及び第２の奥行きデータ１１５０がコンピュータ可読記憶媒体１１２０に記憶されているように示されるが、他の実施例では、第１の奥行きデータ１１４０及び第２の奥行きデータ１１５０のうちの少なくとも１つは、コンピュータ可読記憶媒体１１２０の外部の（しかしコンピュータ可読記憶媒体１１２０によりアクセス可能な）ストレージに記憶され得る。 FIG. 11 is a schematic diagram showing an embodiment 1100 of a processor 1110 and a non-temporary computer-readable storage medium 1120 including a computer executable instruction 1130. When executed by the processor 1110, the computer executable instruction 1130 causes a computer device such as a computing device equipped with the processor 1110 to estimate the depth of the scene. The command may result in a method similar to the exemplary method described above. For example, the computer-readable storage medium 1120 may be configured to store a plurality of first depth data 1140s that may be acquired for the plurality of reference frames, as described with reference to FIG. The computer-readable storage medium 1120 may also be configured to store a second depth estimate of keyframes 1150. The first depth data and the second depth data are stochastically fused, and a fusion depth estimation can be obtained. In FIG. 11, the first depth data 1140 and the second depth data 1150 are shown to be stored in the computer-readable storage medium 1120, but in other embodiments, the first depth data 1140 and the second depth data 1140. At least one of the depth data 1150 may be stored in storage external to the computer readable storage medium 1120 (but accessible by the computer readable storage medium 1120).

図１２は、さらなる実施例による、シーンの第１の奥行き推定と第２の奥行き推定との融合を示す概略図である。図１２では、幾何学的再構成を使用して、シーンについての第１の奥行き確率体積１２００が生成される。第１の奥行き確率体積１２００は、第１の複数の奥行き推定（この事例では本明細書の他の実施例で論じられる第１の奥行き推定を含む）、及び第１の複数の奥行き推定の各奥行き推定にそれぞれ関連付けられた第１の複数の不確実性測定を含む。従って、第１の複数の不確実性測定には、第１の奥行き推定の不確実性測定が含まれる。 FIG. 12 is a schematic diagram showing the fusion of the first depth estimation and the second depth estimation of the scene according to a further embodiment. In FIG. 12, the geometric reconstruction is used to generate a first depth probability volume 1200 for the scene. The first depth probability volume 1200 is each of the first plurality of depth estimates (in this case including the first depth estimates discussed in other embodiments herein) and the first plurality of depth estimates. Includes a first plurality of uncertainty measurements, each associated with a depth estimate. Therefore, the first plurality of uncertainty measurements include a first depth estimation uncertainty measurement.

第１の奥行き確率体積１２００が図１２に概略的に示されるが、これは、例示しやすくするために簡略化された実施例である。図１２では、シーンの観察を表すフレームは、図１２ではＰ_１～Ｐ_９とラベル付けされた９つのピクセルを含む。ピクセルのそれぞれは、シーンの異なる各部分の観察に対応する。図１２のピクセルのそれぞれについて、Ｄ_１、Ｄ_２、及びＤ_３とラベル付けされた３つの奥行き推定が存在する（しかし他の実施例ではピクセルごとにさらに多いまたは少ない奥行き推定が存在する場合がある）。図１２の各奥行き推定は、それぞれの不確実性測定に関連付けられる。図１２では、ｎ番目のピクセルＰ_ｎについてのｍ番目の奥行き推定Ｄ_ｍに関連付けられた不確実性測定は、ｕ_ｎｍとラベル付けされる。図１２では、上部の行のピクセル（Ｐ_１、Ｐ_２、及びＰ_３）の不確実性測定が示される。しかし、フレームの他のピクセル（Ｐ_４～Ｐ_９）についての奥行き推定も、対応する不確実性測定（図１２に図示せず）を有することを、理解されたい。この実施例では、２次元ピクセル配列についてのそれぞれの奥行き推定に関連付けられた不確実性測定の３次元構成が、３次元確率体積を形成する。 A first depth probability volume 1200 is schematically shown in FIG. 12, which is a simplified example for ease of illustration. In FIG. 12, the frame representing the observation of the scene contains _nine pixels labeled P1 to _P9 in FIG. Each of the pixels corresponds to an observation of different parts of the scene. For each of the pixels in FIG. 12, there are three depth estimates labeled D ₁ , D ₂ , and D ₃ (but in other embodiments there may be more or less depth estimates per pixel. be). Each depth estimate in FIG. 12 is associated with each uncertainty measurement. In FIG. 12, the uncertainty measurement associated with the _mth depth estimation Dm for the nth pixel P _n is labeled _unm . FIG. 12 shows an uncertainty measurement of the pixels in the top row (P ₁ , P ₂ , and P ₃ ). However, it should be understood that depth estimates for other pixels of the frame ( _P4-9 ) also have a corresponding uncertainty measurement (not shown in FIG. ₁₂ ). In this embodiment, the 3D configuration of the uncertainty measurement associated with each depth estimation for the 2D pixel array forms a 3D stochastic volume.

図１２では、第１の奥行き確率体積１２００の第１の複数の奥行き推定のうちの所与の奥行き推定に関連付けられた不確実性測定は、シーンの所与の領域（その観察が所与のピクセルでキャプチャされる）が、第１の複数の奥行き推定のうちの所与の奥行き推定により表される奥行きに存在する確率を表す。ゆえに、図１２では、ｕ_１１は、第１のピクセルＰ_１でキャプチャされたシーンの領域が、第１の奥行き推定Ｄ_１の奥行きに対応する奥行きに存在する確率を表す。 In FIG. 12, the uncertainty measurement associated with a given depth estimate of the first plurality of depth estimates of the first depth probability volume 1200 is given in a given area of the scene (its observation is given). Represents the probability that (captured in pixels) is present at the depth represented by a given depth estimate of the first plurality of depth estimates. Therefore, in FIG. 12, u ₁₁ represents the probability that the region of the scene captured by the first pixel P ₁ exists at a depth corresponding to the depth of the first depth estimation D ₁ .

図１２はまた、第２の奥行き確率体積１２０２を含み、第２の奥行き確率体積１２０２は、ニューラルネットワークアーキテクチャを使用してシーンについて生成される。この事例の第２の奥行き確率体積１２０２は、その他の点では第１の奥行き確率体積１２００に類似し、第２の奥行き推定を含む第２の複数の奥行き推定と、第２の複数の奥行き推定の各奥行き推定にそれぞれ関連付けられた第２の複数の不確実性測定とを含む。従って、第２の複数の不確実性測定には、第２の奥行き推定の不確実性測定が含まれる。第１の奥行き確率体積１２００に関して、第２の複数の奥行き推定のうちの所与の奥行き推定に関連付けられた不確実性測定は、シーンの所与の領域が、第２の複数の奥行き推定のうちの所与の奥行き推定により表される奥行きに存在する確率を表す。 FIG. 12 also includes a second depth probability volume 1202, which is generated for the scene using a neural network architecture. The second depth probability volume 1202 in this case is otherwise similar to the first depth probability volume 1200, with a second plurality of depth estimates including a second depth estimate and a second plurality of depth estimates. Includes a second plurality of uncertainty measurements associated with each depth estimation of. Therefore, the second plurality of uncertainty measurements include a second depth estimation uncertainty measurement. With respect to the first depth probability volume 1200, the uncertainty measurement associated with a given depth estimate of the second plurality of depth estimates is that a given area of the scene is the second plurality of depth estimates. Represents the probability of being at the depth represented by our given depth estimation.

図１２の実施例では、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２は、所与のピクセルについての同一のそれぞれの奥行き推定（Ｄ_１、Ｄ_２、及びＤ_３）に関連付けられた不確実性測定を含む。しかし、他の実施例では、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２の奥行き推定は、互いに異なり得ることを、理解されたい。例えば、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２のうちの一方は、他方より多くの個数の奥行き推定を有し得、及び／または互いに異なる値の奥行き推定を有し得る。 In the embodiment of FIG. 12, the first depth probability volume 1200 and the second depth probability volume 1202 are associated with the same respective depth estimates (D ₁ , D ₂ , and D ₃ ) for a given pixel. Includes uncertainty measurements. However, it should be understood that in other embodiments, the depth estimates of the first depth probability volume 1200 and the second depth probability volume 1202 may differ from each other. For example, one of the first depth probability volume 1200 and the second depth probability volume 1202 may have a larger number of depth estimates and / or different depth estimates.

シーンの所与の部分についての幾何学的再構成の精度とニューラルネットワークアーキテクチャの精度は通常異なるため、第１の奥行き確率体積１２００と第２の奥行き確率体積１２０２とでは、所与の奥行き推定に関連付けられた不確実性測定は通常異なることを、理解されたい。これにより、幾何学的再構成またはニューラルネットワークアーキテクチャのどちらが使用されているかに応じて、シーンの所与の部分について、異なる確率分布が生じ得る。例えば、所与の技法（幾何学的再構成、またはニューラルネットワークアーキテクチャの使用を伴う）がシーンの所与の部分の奥行きを正確に特徴付けることができない場合、シーンの所与の部分を表すピクセルに関連付けられた奥行き確率分布は、比較的均一となり得、シーンのその部分の最も可能性の高い奥行きを突き止めることが困難となる。反対に、所与の技法がシーンの所与の部分の奥行きを正確に特定することができる場合、奥行き確率分布は、シーンの所与の部分の奥行きに対応する奥行き推定で、より鋭いピークを有し得る。 Since the accuracy of the geometric reconstruction for a given part of the scene and the accuracy of the neural network architecture are usually different, the first depth probability volume 1200 and the second depth probability volume 1202 give a given depth estimate. It should be understood that the associated uncertainty measurements are usually different. This can result in different probability distributions for a given part of the scene, depending on whether geometric reconstruction or neural network architecture is used. For example, if a given technique (with geometric reconstruction, or the use of a neural network architecture) cannot accurately characterize the depth of a given part of the scene, then a pixel representing a given part of the scene The associated depth probability distribution can be relatively uniform, making it difficult to determine the most likely depth of that part of the scene. Conversely, if a given technique can accurately determine the depth of a given part of the scene, then the depth probability distribution is a depth estimate that corresponds to the depth of a given part of the scene, with sharper peaks. May have.

第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２に関連付けられた不確実性測定を融合させることにより、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２に関連付けられた奥行き推定自体が確率的に融合され得、これにより、融合奥行き確率体積１２０４が生成される。これは、図１２に概略的に示され、融合奥行き確率体積１２０４は、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２に類似するが、第１の奥行き確率体積１２００の奥行き推定と第２の奥行き確率体積１２０２の奥行き推定との確率的融合により、第１の奥行き確率体積１２００及び第２の奥行き確率体積１２０２の不確実性測定ｕ_ｎｍの値とは異なる不確実性測定ｕ_ｎｍの値を通常含む。このように第１の奥行き確率体積１２００と第２の奥行き確率体積１２０２とを確率的に融合させることにより、２つの異なるソース（幾何学的再構成及びニューラルネットワークアーキテクチャ）からのシーンの奥行きに関する情報を組み合わせることが可能となり、通常、各ソースを個別に使用する場合と比較して、奥行き推定の精度が向上する。 Depth estimation associated with the first depth probability volume 1200 and the second depth probability volume 1202 by fusing the uncertainty measurements associated with the first depth probability volume 1200 and the second depth probability volume 1202. It can be fused probabilistically, which produces a fusion depth probability volume 1204. This is schematically shown in FIG. 12, where the fusion depth probability volume 1204 is similar to the first depth probability volume 1200 and the second depth probability volume 1202, but with the depth estimation of the first depth probability volume 1200. Uncertainty measurement u _nm different from the value of uncertainty measurement u _nm of the first depth probability volume 1200 and the second depth probability volume 1202 by the probabilistic fusion with the depth estimation of the second depth probability volume 1202. Usually contains the value of. By probabilistically fusing the first depth probability volume 1200 and the second depth probability volume 1202 in this way, information about the depth of the scene from two different sources (geometric reconstruction and neural network architecture). Can be combined, and the accuracy of depth estimation is usually improved compared to the case where each source is used individually.

図１３は、図１２の第２の奥行き確率体積１２０２と同様または同一であり得る第２の奥行き確率体積１３０２を取得するための例示的なシステム１３００の概略図である。システム１３００は、シーンの観察を表すフレーム１３０４を受信するように構成される。この事例のフレーム１３０４は、シーンの画像を表す画像データにより表される。システム１３００のニューラルネットワークアーキテクチャ１３０６は、フレーム１３０４を処理して第２の奥行き確率体積１３０２を生成するように構成され、第２の奥行き確率体積１３０２は、この事例では、第２の複数の奥行き推定１３０８、及び第２の複数の不確実性測定１３１０（この事例では、シーンの所与の領域が、第２の複数の奥行き推定１３０８のうちの奥行き推定により表される奥行きに存在するそれぞれの確率を表す）を含む。 FIG. 13 is a schematic diagram of an exemplary system 1300 for acquiring a second depth probability volume 1302 that may be similar to or identical to the second depth probability volume 1202 of FIG. The system 1300 is configured to receive frames 1304 representing observations of the scene. The frame 1304 of this example is represented by image data representing an image of the scene. The neural network architecture 1306 of the system 1300 is configured to process frames 1304 to generate a second depth probability volume 1302, which in this case is a second plurality of depth estimates. 1308, and a second plurality of uncertainty measurements 1310 (in this case, the probabilities that a given area of the scene is at the depth represented by the depth estimation of the second plurality of depth estimates 1308, respectively. Represents).

図１３のシステム１３００は、シーンの画像の異なる各部分にそれぞれ関連付けられた複数の奥行き推定集合を含む第２の複数の奥行き推定１３０８を出力するように構成される。ゆえに、システム１３００を使用して図１２の第２の奥行き確率体積１２０２を出力する場合、所与のピクセルについての奥行き推定Ｄ_１、Ｄ_２、及びＤ_３は、奥行き推定集合に対応するとみなされ得る。従って、図１３のシステム１３００を使用して複数のピクセルを含む画像が処理され、複数のピクセルの異なる各ピクセルにそれぞれ関連付けられた複数の奥行き推定集合が生成され得る。 The system 1300 of FIG. 13 is configured to output a second plurality of depth estimation 1308s, including a plurality of depth estimation sets associated with different parts of the image of the scene. Therefore, when using the system 1300 to output the second depth probability volume 1202 of FIG. 12, the depth estimates D ₁ , D ₂ , and D ₃ for a given pixel are considered to correspond to the depth estimation set. obtain. Thus, the system 1300 of FIG. 13 can be used to process an image containing a plurality of pixels to generate a plurality of depth estimation sets associated with each of the different pixels of the plurality of pixels.

図１３の実施例のニューラルネットワークアーキテクチャ１３０６は、事前に定義された値を有するそれぞれの奥行き推定１３０８に関連付けられた不確実性測定１３１０を出力するように構成される。言い換えると、各ピクセルについて単一の奥行き値を出力するのではなく、図１３のニューラルネットワークアーキテクチャ１３０６は、各ピクセルについての複数の事前に定義された離散的な奥行き推定１３０８のそれぞれに関して、不確実性測定１３１０を出力するように構成される。このようにして、ニューラルネットワークアーキテクチャ１３０６は、所与のピクセルについて、所与の範囲にわたる離散奥行き確率分布を出力し、これは、この事例ではノンパラメトリックである。これにより、ニューラルネットワークアーキテクチャ１３０６は、予測される奥行きについての不確実性を表すことが可能となる（所与の奥行き推定に関連付けられた不確実性測定により表され、この事例では確率値により表される）。これにより、さらに、ニューラルネットワークアーキテクチャ１３０６は、多重仮説奥行き予測を行うことが可能となり、これは、幾何学的再構成により取得された奥行き推定と融合されると、シーンの奥行きをより正確に推定することが可能となる。 The neural network architecture 1306 of the embodiment of FIG. 13 is configured to output an uncertainty measurement 1310 associated with each depth estimate 1308 having a predefined value. In other words, rather than outputting a single depth value for each pixel, the neural network architecture 1306 of FIG. 13 is uncertain for each of the multiple predefined discrete depth estimates 1308 for each pixel. It is configured to output the sex measurement 1310. In this way, the neural network architecture 1306 outputs a discrete depth probability distribution over a given range for a given pixel, which is nonparametric in this case. This allows the neural network architecture 1306 to represent uncertainty about the expected depth (represented by the uncertainty measurement associated with a given depth estimate, in this case by the probability value). Will be). This also allows the neural network architecture 1306 to make multiple hypothetical depth predictions, which, when fused with the depth estimates obtained by geometric reconstruction, estimate the depth of the scene more accurately. It becomes possible to do.

ニューラルネットワークアーキテクチャ１３０６により出力される事前に定義された値の間には、不均一な間隔があり得る。このような手法では、ニューラルネットワークアーキテクチャ１３０６は、所与のピクセルについて、奥行き推定により占められる奥行き範囲にわたる可変解像度を有する奥行き確率分布を出力するように構成される。例えば、事前に定義された値は、事前に定義された奥行き範囲（奥行き推定により占められる奥行き範囲の全てまたは一部であり得る範囲）内の複数の対数奥行き値を含み得る。対数奥行きパラメータ化を使用することにより、奥行き範囲は対数空間で均一に分割することが可能となる。これにより、シーンの観察をキャプチャするのに使用されるキャプチャデバイスにより近い領域では、より高い奥行き解像度が提供され、より遠い領域では、より低い解像度が提供される。 There can be non-uniform spacing between the predefined values output by the neural network architecture 1306. In such a technique, the neural network architecture 1306 is configured to output a depth probability distribution with variable resolution over the depth range occupied by the depth estimation for a given pixel. For example, a predefined value may include multiple logarithmic depth values within a predefined depth range (a range that can be all or part of the depth range occupied by depth estimation). By using log-depth parameterization, the depth range can be evenly divided in logarithmic space. This provides higher depth resolution in areas closer to the capture device used to capture the observation of the scene and lower resolution in areas farther away.

図１３のシステム１３００を使用して処理された画像の所与のピクセルについて、第２の複数の奥行き推定１３０８のそれぞれの奥行き推定に関連付けられた不確実性測定１３１０が、図１４に概略的に示される。図１４では、不確実性測定は、ｙ軸１４００上に示される確率密度値であり、奥行き推定は、ｘ軸１４０２上に示されるメートル単位の対数奥行き値である。図１４の奥行き推定は、離散値を有する。よって、図１４は、棒グラフ１４０６形式の離散確率分布１４００を示す。 For a given pixel in an image processed using the system 1300 of FIG. 13, the uncertainty measurement 1310 associated with each depth estimation of the second plurality of depth estimates 1308 is schematically shown in FIG. Shown. In FIG. 14, the uncertainty measurement is a probability density value shown on the y-axis 1400, and the depth estimation is a metric log depth value shown on the x-axis 1402. The depth estimation in FIG. 14 has discrete values. Therefore, FIG. 14 shows a discrete probability distribution 1400 in the form of a bar graph 1406.

いくつかの事例では、離散化誤差を減らすため、及びシーンの奥行き推定の取得を促進するために、離散確率分布１４００から連続確率関数が取得され得る。離散確率分布１４００から取得された連続確率関数１４０８が、図１４に概略的に示される。連続確率関数１４０８は、滑らかな関数であり得、図１６を参照して下記でさらに論述される。 In some cases, a continuous probability function can be obtained from the discrete probability distribution 1400 in order to reduce the discretization error and facilitate the acquisition of the depth estimation of the scene. The continuous probability function 1408 obtained from the discrete probability distribution 1400 is schematically shown in FIG. The continuous probability function 1408 can be a smooth function and is further discussed below with reference to FIG.

図１３を参照し直すと、ニューラルネットワークアーキテクチャ１３０６として、様々な異なるニューラルネットワークアーキテクチャが使用され得る。一例では、ニューラルネットワークアーキテクチャ１３０６は、残差ニューラルネットワーク（ＲｅｓＮｅｔ）エンコーダと、その後に続く３つのアップサンプルブロックとを含み、各アップサンプルブロックは、双線形アップサンプリング層、入力画像との結合、その後に２つの畳み込み層を含み、これにより、出力は、シーンの観察を表す入力画像の解像度と同じ解像度を有する。 Revisiting FIG. 13, a variety of different neural network architectures can be used as the neural network architecture 1306. In one example, the neural network architecture 1306 includes a residual neural network (ResNet) encoder followed by three upsample blocks, where each upsample block is a bilinear upsampling layer, coupled with an input image, and then. Contains two convolutional layers, whereby the output has the same resolution as the input image representing the observation of the scene.

離散奥行き推定に関連付けられた確率値を予測するように図１３のニューラルネットワークアーキテクチャ１３０６をトレーニングするのに、順序損失関数が使用され得る。好適な順序損失関数Ｌ（θ）の例は、次のとおりである。

θは、ニューラルネットワークアーキテクチャ１３０６の重みの集合であり、Ｋは、奥行き範囲が離散化されるビンの数であり、ｋ_ｉ ^＊は、ピクセルｉについてのグラウンドトゥルース奥行きを含むビンのインデックスであり、ｐ_θ，ｉ（ｋ_ｉ ^＊＝ｊ）は、グラウンドトゥルース奥行きがビンｊ内である確率に関するニューラルネットワークアーキテクチャ１３０６の予測である。しかし、これは単なる例に過ぎず、他の実施例では、他の損失関数が使用され得る。 A sequence loss function can be used to train the neural network architecture 1306 of FIG. 13 to predict the probability values associated with the discrete depth estimation. An example of a suitable sequence loss function L (θ) is as follows.

θ is the set of weights in the neural network architecture 1306, K is the number of bins whose depth range is discretized, and ki ^* is the index of bins containing the ground truth depth for pixel _i . p _{θ, i} ₍ ki ^* = j) is a prediction of the neural network architecture 1306 regarding the probability that the ground truth depth is within bin j. However, this is just an example, and other loss functions may be used in other embodiments.

図１５を見ると、図１５は、図１２の第１の奥行き確率体積１２００と同様または同一であり得る第１の奥行き確率体積１５０２を取得するための例示的なシステム１５００の概略図である。図１５のシステム１５００は、シーンの第１の観察を表す第１のフレーム１５０４、及び例えばシーンの第１の観察の前または後のシーンの第２の観察を表す第２のフレーム１５０６を処理するように構成される。実施例の第１の観察及び第２の観察は、最後に部分的に重複する（例えば両方がシーンの同一部分の観察を含むように）。 Looking at FIG. 15, FIG. 15 is a schematic diagram of an exemplary system 1500 for acquiring a first depth probability volume 1502 that may be similar to or identical to the first depth probability volume 1200 of FIG. System 1500 of FIG. 15 processes a first frame 1504 representing a first observation of a scene, and a second frame 1506 representing, for example, a second observation of the scene before or after the first observation of the scene. It is configured as follows. The first and second observations of the embodiment are finally partially overlapped (eg, both include observations of the same part of the scene).

第１のフレーム１５０４及び第２のフレーム１５０６は、測光誤差計算エンジン１５０８により処理され、第１のフレーム１５０４の複数の部分のそれぞれについて、測光誤差１５１０の集合が生成され、測光誤差１５１０はそれぞれ、第１の複数の奥行き推定１５１２のうちの異なる各奥行き推定に関連付けられる。測光誤差は、第１の複数の奥行き推定１５１２のそれぞれについて、第１のフレーム１５０４を第２のフレーム１５０６にワープさせ、ワープされた第１のフレーム１５０４と第２のフレーム１５０６との差を特定することにより、取得され得る。いくつかの事例では、差は、例えば３×３ピクセルサイズのピクセルのパッチについてのワープされた第１のフレーム１５０４のピクセル値と第２のフレーム１５０６のピクセル値との二乗差の合計であるが、これは単なる実例に過ぎない。このように第１のフレーム１５０４をワープさせることは、例えば図２を参照して説明されたように、第１のフレーム１５０４のピクセルを第２のフレーム１５０６内の対応する位置にマッピングすることに対応するとみなされ得る。例えば、図２を参照して説明された実施例における測光誤差を最小化する奥行き値を特定するためにそれぞれの奥行き値について反復的に計算された測光誤差は、図１５の実施例における測光誤差計算エンジン１５０８による測光誤差１５１０の集合として、出力され得る。図２の実施例は、測光誤差を最小化することにより取得された奥行き推定に関連付けられた不確実性測定を計算することを含み、例えば、ヤコビアン項を使用して計算される。対照的に、図１５の実施例では、測光誤差１５１０の集合は、それら自体が、各奥行き推定にそれぞれ関連付けられたそれぞれの不確実性測定として扱われる。 The first frame 1504 and the second frame 1506 are processed by the photometric error calculation engine 1508 to generate a set of photometric errors 1510 for each of the plurality of parts of the first frame 1504, each of which has a photometric error 1510. Associated with each different depth estimate of the first plurality of depth estimates 1512. The photometric error warps the first frame 1504 to the second frame 1506 for each of the first plurality of depth estimates 1512 and identifies the difference between the warped first frame 1504 and the second frame 1506. Can be obtained by doing so. In some cases, the difference is, for example, the sum of the squared differences between the pixel values of the warped first frame 1504 and the pixel values of the second frame 1506 for a patch of pixels of 3x3 pixel size. , This is just an example. Warping the first frame 1504 in this way is to map the pixels of the first frame 1504 to the corresponding positions in the second frame 1506, as described, for example, with reference to FIG. Can be considered corresponding. For example, the photometric error calculated iteratively for each depth value to identify the depth values that minimize the photometric error in the embodiment described with reference to FIG. 2 is the photometric error in the embodiment of FIG. It can be output as a set of metering errors 1510 by the calculation engine 1508. The embodiment of FIG. 2 involves calculating the uncertainty measurement associated with the depth estimation obtained by minimizing the photometric error, eg, using the Jacobian term. In contrast, in the embodiment of FIG. 15, the set of photometric errors 1510 themselves are treated as their respective uncertainty measurements associated with each depth estimate.

第１のフレーム１５０４のワープは、第２のフレーム１５０６でキャプチャされたシーンの第２の観察を複製することを目的とする（例えば第２のフレーム１５０６をキャプチャしている間のカメラの第２の姿勢と同じ姿勢を有するカメラで観察されたように）。第１のフレーム１５０４はこのように、第１の複数の奥行き推定１５１２のそれぞれについて、変換される（奥行き推定のそれぞれは、第１のフレーム１５０４をキャプチャしている間のカメラの第１の姿勢を基準としたシーンの仮説奥行きである）。通常、第１の姿勢を基準としたシーンの奥行きは不均一であるが、シーン全体が同じ奥行きにあると仮定して、その奥行き推定の測光誤差をピクセルごとに（または画像パッチごとに）計算することにより、より効率的にワープが実行され得る。この手法は、第１のフレーム１５０４の複数のピクセルのそれぞれについての第１の複数の奥行き推定１５１２に対して繰り返し実行されて、コスト体積が生成され得、コスト体積から第１の奥行き確率体積１５０２が取得され得る。図２を参照して説明されたように、カメラの第１の姿勢及び第２の姿勢は、任意の好適な方法を使用して取得され得ることを、理解されたい。 The warp of the first frame 1504 is intended to duplicate the second observation of the scene captured in the second frame 1506 (eg, the second of the cameras while capturing the second frame 1506). As observed with a camera that has the same posture as the posture of). The first frame 1504 is thus transformed for each of the first plurality of depth estimates 1512 (each of the depth estimates is the first orientation of the camera while capturing the first frame 1504). Is the hypothetical depth of the scene based on). Normally, the depth of the scene relative to the first pose is non-uniform, but assuming the entire scene is at the same depth, the measurement error of the depth estimation is calculated pixel by pixel (or image patch by image). By doing so, the warp can be executed more efficiently. This technique can be iterated over for the first plurality of depth estimates 1512 for each of the plurality of pixels in the first frame 1504 to generate a cost volume, from the cost volume to the first depth probability volume 1502. Can be obtained. It should be understood that the first and second poses of the camera can be obtained using any suitable method, as described with reference to FIG.

第１の奥行き確率体積１５０２と、図１３のシステム１３００を使用して取得される第２の奥行き確率体積との融合を簡潔にするために、確率値がニューラルネットワークアーキテクチャ１３０６により出力されるそれぞれの奥行きビンの中点が、第１の奥行き確率体積１５０２のそれぞれの奥行き推定として使用され得る。しかし、他の実施例において、そうである必要はない。 Each probability value is output by the neural network architecture 1306 to simplify the fusion of the first depth probability volume 1502 and the second depth probability volume obtained using the system 1300 of FIG. The midpoint of the depth bin can be used as the respective depth estimate for the first depth probability volume 1502. However, in other embodiments it does not have to be.

いくつかの実施例では、第１のフレーム１５０４がワープされる前、及び／または測光誤差１５１０の集合が計算される前に、第１のフレーム１５０４及び第２のフレーム１５０６は正規化される。正規化は、第１のフレーム１５０４及び第２のフレーム１５０６のそれぞれについて、ピクセル値のそれぞれから平均ピクセル値を減算して、出力された値のそれぞれを、第１のフレーム１５０４及び第２のフレーム１５０６の標準偏差で割ることにより、実行され得る。これにより、シーンの第１の観察と第２の観察との照明の変化に過度に影響されることなく、所与の奥行き推定に関して、ワープされた第１のフレーム１５０４と第２のフレーム１５０６との根本的な測光差を、より正確に特定することが可能となる。 In some embodiments, the first frame 1504 and the second frame 1506 are normalized before the first frame 1504 is warped and / or before the set of photometric errors 1510 is calculated. For normalization, for each of the first frame 1504 and the second frame 1506, the average pixel value is subtracted from each of the pixel values, and the output values are each of the first frame 1504 and the second frame. It can be done by dividing by the standard deviation of 1506. This allows the warped first frame 1504 and second frame 1506 with respect to a given depth estimation without being overly affected by changes in illumination between the first and second observations of the scene. It is possible to more accurately identify the fundamental photometric difference of.

前述のように、図１５の測光誤差計算エンジン１５０８により取得された測光誤差１５１０の集合は、コスト体積を形成するとみなされ得る。測光誤差１５１０の集合から第１の奥行き確率体積１５０２を取得するために、図１５のシステム１５００は、スケーリングエンジン１５１２を含み、これは、測光誤差１５１０をそれぞれの確率値１５１４（第１の複数の奥行き推定１５１２のうちの奥行き推定に関連付けられた不確実性測定に対応するとみなされ得る）にスケーリングするように構成される。一事例では、スケーリングは、スケーリング後に所与のピクセルについての第１の複数の奥行き推定１５１２のそれぞれに関する二乗測光誤差の負の値の合計が１となるように、各ピクセルの二乗測光誤差の負の値を個別にスケーリングすることを含む。次に、スケーリングされた値は、第１の複数の奥行き推定１５１２のうちの所与の奥行き推定に関連付けられたそれぞれの確率値１５１４として用いられ、これにより、第１の確率体積１５０２が生成され得る。 As mentioned above, the set of photometric errors 1510 acquired by the photometric error calculation engine 1508 in FIG. 15 can be considered to form a cost volume. To obtain the first depth probability volume 1502 from the set of metering errors 1510, the system 1500 of FIG. 15 includes a scaling engine 1512, which sets the metering error 1510 to each probability value 1514 (first plurality). It is configured to scale to (which can be considered to correspond to the uncertainty measurement associated with the depth estimation) of the depth estimation 1512. In one example, scaling is the negative squared metering error of each pixel such that after scaling the sum of the negative squared metering errors for each of the first plurality of depth estimates 1512 for a given pixel is 1. Includes scaling the values of individually. The scaled values are then used as the respective probability values 1514 associated with a given depth estimate of the first plurality of depth estimates 1512, thereby producing a first probability volume 1502. obtain.

図１６は、シーンの融合奥行き推定を取得する例示的な方法１６００を示すフロー図であり、シーンの観察は、ピクセルの配列をそれぞれ含む複数のフレームでキャプチャされる。融合奥行き推定は、図１２～図１５を参照して説明されたような第１の奥行き確率体積及び第２の奥行き確率体積を使用して取得される。 FIG. 16 is a flow diagram illustrating an exemplary method 1600 for obtaining a fusion depth estimate of a scene, where the observation of the scene is captured in multiple frames, each containing an array of pixels. The fusion depth estimation is obtained using a first depth probability volume and a second depth probability volume as described with reference to FIGS. 12-15.

図１６の項目１６０２は、融合確率体積を取得することを含む。項目１６０２にて、第１の奥行き推定と第２の奥行き推定（第１の奥行き確率体積と第２の奥行き確率体積の一部をそれぞれ形成する）は、第１の複数の不確実性測定と第２の複数の不確実性測定とを組み合わせることにより、確率的に融合され、融合確率体積が生成される。第１の複数の不確実性測定と第２の複数の不確実性測定は、様々な異なる方法で組み合され得る。第１の複数の不確実性測定及び第２の複数の不確実性測定が第１の奥行き確率体積及び第２の奥行き確率体積に関連付けられた確率値である一事例では、第１の複数の奥行き推定のうちの奥行き推定に関連付けられた確率値と、第２の複数の奥行き推定のうちの対応する奥行き推定に関連付けられた確率値と組み合わせることで（例えば乗算することで）、奥行き推定のそれぞれについて融合値を取得することにより、融合確率体積が取得される。いくつかの事例では、次に、所与のピクセルについての奥行き推定のそれぞれに関する融合値が、合計１となるようにスケーリングされ、奥行き推定のそれぞれに関する融合確率値が生成される。しかし、これは、他の事例では、例えば第１の複数の奥行き推定及び第２の複数の奥行き推定に関連付けられた確率値が、所与のピクセルについて合計１となるように既に前にスケーリングされている場合は、省略され得る。 Item 1602 in FIG. 16 includes acquiring the fusion probability volume. In item 1602, the first depth estimation and the second depth estimation (forming a part of the first depth probability volume and the second depth probability volume, respectively) are the first plurality of uncertainty measurements. By combining with a second plurality of uncertainty measurements, they are stochastically fused to produce a fusion probability volume. The first plurality of uncertainty measurements and the second plurality of uncertainty measurements can be combined in a variety of different ways. In one case, where the first plurality of uncertainty measurements and the second plurality of uncertainty measurements are the probability values associated with the first depth probability volume and the second depth probability volume, the first plurality. By combining (eg, multiplying) the probability value associated with the depth estimation of the depth estimation with the probability value associated with the corresponding depth estimation of the second plurality of depth estimations, the depth estimation By acquiring the fusion value for each, the fusion probability volume is acquired. In some cases, the fusion values for each of the depth estimates for a given pixel are then scaled to a total of 1 to generate fusion probability values for each of the depth estimates. However, in other cases, the probability values associated with, for example, the first plurality of depth estimates and the second plurality of depth estimates have already been previously scaled to a total of 1 for a given pixel. If so, it can be omitted.

図１６の実施例では、融合確率体積を使用して取得されるシーンの奥行き推定の定量化を回避するため、及び後続の最適化ステップで使用する好適な関数（図１６の項目１６０４及び１６０６を参照してさらに論述される）を取得するために、融合確率体積を使用して、奥行き確率関数が取得される。奥行き確率関数は、融合確率体積のパラメータ化を表し、連続的な奥行き値を取得することを可能にする（単に融合確率体積の離散的な奥行き推定だけでなく）。奥行き確率関数は、例えばガウス基底関数を使用する、カーネル密度推定（ＫＤＥ）技法など、離散分布をパラメータ化する任意の好適な技法を使用して、取得され得る。 In the embodiment of FIG. 16, suitable functions (items 1604 and 1606 of FIG. 16) are used to avoid quantification of the depth estimation of the scene obtained using the fusion probability volume and in subsequent optimization steps. The depth probability function is obtained using the fusion probability volume to obtain (referred to and further discussed). The depth probability function represents the parameterization of the fusion probability volume and makes it possible to obtain continuous depth values (not just a discrete depth estimation of the fusion probability volume). Depth probability functions can be obtained using any suitable technique for parameterizing discrete distributions, such as the kernel density estimation (KDE) technique, which uses Gaussian basis functions.

図１６の項目１６０４及び１６０６にて、融合確率体積からシーンの融合奥行き推定が取得される（この事例では融合確率体積から取得された奥行き確率関数から取得されるが、これは単なる例に過ぎない）。図１６の実施例では、シーンの融合奥行き推定を取得することは、項目１６０４にてコスト関数を最適化することを含む。この事例のコスト関数は、融合確率体積を使用して取得された第１のコスト項と、奥行き値に対する局所的な幾何学的制約を含む第２のコスト項とを含む。コスト関数ｃ（ｄ）は、次のように表され得る。
ｃ（ｄ）＝ｃ_１（ｄ）＋λｃ_２（ｄ）
ｄは、推定される奥行き値であり、ｃ_１（ｄ）は、第１のコスト項であり、ｃ_２（ｄ）は第２のコスト項であり、λは、コスト関数に対する第２のコスト項の寄与を調整するために使用されるパラメータである。パラメータλは、実験的に調整され、奥行き値の好適な推定が取得され得る。一事例のパラメータλの好適な値は、１×１０^７であるが、これは単なる例に過ぎない。 In items 1604 and 1606 of FIG. 16, the fusion depth estimation of the scene is obtained from the fusion probability volume (in this case, it is obtained from the depth probability function obtained from the fusion probability volume, but this is only an example. ). In the embodiment of FIG. 16, acquiring the fusion depth estimation of the scene comprises optimizing the cost function in item 1604. The cost function in this case includes a first cost term obtained using the fusion probability volume and a second cost term that includes a local geometric constraint on the depth value. The cost function c (d) can be expressed as follows.
c (d) = c ₁ (d) + λc ₂ (d)
d is the estimated depth value, c ₁ (d) is the first cost term, c ₂ (d) is the second cost term, and λ is the second cost for the cost function. A parameter used to adjust the contribution of a term. The parameter λ can be adjusted experimentally to obtain a good estimate of the depth value. A suitable value for the parameter λ in one case is 1 × ¹⁰⁷ , but this is just an example.

第１のコスト項は、融合確率体積に依存し、図１６の実施例では、融合確率体積から取得された奥行き確率関数に依存する。この事例では、第１のコスト項は、次のように表され得る。

ｆ_ｉ（ｄ_ｉ）は、奥行きｄ_ｉで評価された、所与の入力フレーム（シーンの観察を表す）のピクセルｉについての奥行き確率関数の出力である。 The first cost term depends on the fusion probability volume and, in the embodiment of FIG. 16, depends on the depth probability function obtained from the fusion probability volume. In this case, the first cost term can be expressed as:

f _i ( _di ) is the output of the depth probability function for pixel _i of a given input frame (representing the observation of the scene), evaluated at depth di.

第１の奥行き確率体積と第２の奥行き確率体積とを融合させることにより、融合確率体積は通常、幾何学的再構成またはニューラルネットワークアーキテクチャを単独で使用する場合よりも、局所的一貫性が高くなる。図１６の実施例では、正規化項とみなされ得る第２のコスト項を含むことにより、局所的一貫性は向上する。この事例の第２のコスト項は、コスト関数の最適化中に局所的な幾何学的制約を課し、これにより、局所的幾何学がより良く維持される。 By fusing the first depth-probability volume with the second depth-probability volume, the fusion-probability volume is usually more locally consistent than when geometric reconstruction or neural network architecture is used alone. Become. In the embodiment of FIG. 16, local consistency is improved by including a second cost term that can be considered a normalization term. The second cost term in this case imposes local geometric constraints during the optimization of the cost function, which better maintains the local geometry.

図１６の方法１６００を使用して融合奥行き推定１７０２を取得するためのシステム１７００が、図１７に概略的に示される。融合奥行き確率体積１７０４が、システム１７００の奥行き推定エンジン１７０６に入力される。奥行き推定エンジン１７０６は、図１６の項目１６０４のコスト関数最適化を実行する。コスト関数の第１のコスト項は、奥行き推定エンジン１７０６に入力される融合奥行き確率体積１７０４に依存し、この事例では、上記の式ｃ_１（ｄ）を使用して表され得る。ゆえに、図１７では、奥行き推定エンジン１７０６は、第１のコスト項を計算するために、融合奥行き確率体積１７０４から奥行き確率関数を取得するように構成される。しかし、他の事例では、第１のコスト項は、融合奥行き確率体積１７０４自体から取得され得る、または奥行き推定エンジン１７０６は、融合奥行き確率体積１７０４ではなく、奥行き確率関数を受信するように構成され得る。 A system 1700 for obtaining fusion depth estimates 1702 using method 1600 of FIG. 16 is schematically shown in FIG. The fusion depth probability volume 1704 is input to the depth estimation engine 1706 of the system 1700. The depth estimation engine 1706 performs the cost function optimization of item 1604 of FIG. The first cost term of the cost function depends on the fusion depth probability volume 1704 input to the depth estimation engine 1706 and can be expressed in this case using equation c ₁ (d) above. Therefore, in FIG. 17, the depth estimation engine 1706 is configured to obtain a depth probability function from the fusion depth probability volume 1704 in order to calculate the first cost term. However, in other cases, the first cost term can be obtained from the fusion depth probability volume 1704 itself, or the depth estimation engine 1706 is configured to receive a depth probability function rather than a fusion depth probability volume 1704. obtain.

図１７のシステム１７００はまた、さらなるニューラルネットワークアーキテクチャ１７０８を含み、これは、融合奥行き推定が生成されるシーンの観察を表す入力フレーム１７１０を受信するように、及びコスト関数の第２のコスト項を生成する際に使用する幾何学的制約データ１７１２を生成するように、構成される。この事例の入力フレーム１７１０は、第２の奥行き確率体積を生成するためにニューラルネットワークアーキテクチャにより処理される入力フレームであり、第１の奥行き確率体積を生成するために幾何学的再構成エンジンにより処理されるフレームのうちの１つであるが、これは単なる例に過ぎない。 The system 1700 of FIG. 17 also includes an additional neural network architecture 1708, which is to receive an input frame 1710 representing an observation of the scene in which the fusion depth estimation is generated, and a second cost term of the cost function. It is configured to generate the geometric constraint data 1712 used in the generation. The input frame 1710 in this example is an input frame processed by the neural network architecture to generate a second depth probability volume and is processed by a geometric reconstruction engine to generate a first depth probability volume. It is one of the frames that are made, but this is just an example.

図１７の実施例の幾何学的制約データ１７１２は、表面配向推定及びオクルージョン境界推定を表す。表面配向推定及びオクルージョン境界推定を使用して、第２のコスト項が生成される。例えば、表面配向推定は、入力フレーム１７１０の所与のピクセルについての表面法線を表し、さらなるニューラルネットワークアーキテクチャ１７０８により予測される。当業者には理解されるように、任意の好適にトレーニングされたニューラルネットワークアーキテクチャが、さらなるニューラルネットワークアーキテクチャ１７０８として使用され得る。第２のコスト項で表面配向推定を使用すると、コスト関数を最適化することにより取得される融合奥行き推定において、局所的幾何学の維持が向上する。例えば、隣接するピクセルについて表面配向推定が類似する（例えばこれらのピクセルは同様の配向であり、連続した平面の表面が見込まれることを示す）場合、第２のコスト項は通常、小さくなる。 The geometric constraint data 1712 of the embodiment of FIG. 17 represents surface orientation estimation and occlusion boundary estimation. A second cost term is generated using surface orientation estimation and occlusion boundary estimation. For example, the surface orientation estimation represents a surface normal for a given pixel in input frame 1710 and is predicted by the further neural network architecture 1708. Any well-trained neural network architecture can be used as a further neural network architecture 1708, as will be appreciated by those of skill in the art. Using surface orientation estimation in the second cost term improves the maintenance of local geometry in the fusion depth estimation obtained by optimizing the cost function. For example, if the surface orientation estimates are similar for adjacent pixels (eg, these pixels have similar orientations, indicating that a surface of a continuous plane is expected), the second cost term is usually smaller.

しかし、シーンは通常、オブジェクトの境界（オクルージョン境界と称され得る）に奥行きの不連続性を含む。このような境界では、シーンの観察を表す入力フレームの隣接するピクセルの表面配向推定は、通常、互いに異なる。これらの領域のオブジェクトの一部は、オブジェクト境界におけるオブジェクトの奥行きの急激な変化により遮蔽され得るため、このような領域では、表面配向推定は信頼性に欠き得る。従って、オブジェクトのこれらの部分の観察は、入力フレームに存在し得ず、これは、表面配向推定の信頼性、及びシーンの観察を表す画像の隣接するピクセルの表面配向推定間の差に基づくコスト項の信頼性に、影響を与え得る。 However, scenes usually contain depth discontinuities at the boundaries of objects (which can be referred to as occlusion boundaries). At such boundaries, the surface orientation estimates of adjacent pixels in the input frame that represent the observation of the scene are usually different from each other. Surface orientation estimates can be unreliable in such areas, as some of the objects in these areas can be occluded by sudden changes in the depth of the objects at the object boundaries. Therefore, observations of these parts of the object cannot be present in the input frame, which is the reliability of the surface orientation estimation and the cost based on the difference between the surface orientation estimates of adjacent pixels of the image representing the observation of the scene. It can affect the reliability of a term.

これを補うために、図１７の実施例の第２のコスト項が、オクルージョン境界での正規化項をマスクする。言い換えると、例えば、オクルージョン境界に対応するピクセルなど、入力フレーム１７１０の信頼性のない領域に対応するピクセルについては、第２のコスト項の寄与が低くなり、より信頼性のある領域に対応するピクセルについては、第２のコスト項の寄与が高くなるように、入力フレーム１７１０のそれぞれのピクセルについて第２のコスト項の寄与を調整するために、第２のコスト項は０～１の値などの値で重みづけされる。例えば、オクルージョン境界上にあるピクセルについて、第２のコスト項は、ゼロの値の重みで重みづけされ得、よって、これらのピクセルについては、第２のコスト項は、コスト関数の最適化に寄与しない。 To compensate for this, the second cost term in the embodiment of FIG. 17 masks the normalization term at the occlusion boundary. In other words, for pixels that correspond to unreliable regions of input frame 1710, such as pixels that correspond to occlusion boundaries, the second cost term contributes less and corresponds to more reliable regions. In order to adjust the contribution of the second cost term for each pixel of the input frame 1710 so that the contribution of the second cost term is higher, the second cost term may be a value of 0 to 1, or the like. Weighted by value. For example, for pixels on the occlusion boundary, the second cost term can be weighted with a zero value weight, so for these pixels the second cost term contributes to the optimization of the cost function. do not do.

いくつかの事例では、さらなるニューラルネットワークアーキテクチャ１７０８は、所与のピクセルがオクルージョン境界に属する確率を、オクルージョン境界推定として出力する。このような事例では、この確率が、例えば０．４などの所定閾値以上である値である場合、ピクセルはオクルージョン境界上にあるとみなされ得る。 In some cases, the additional neural network architecture 1708 outputs the probability that a given pixel belongs to the occlusion boundary as an occlusion boundary estimate. In such cases, pixels can be considered to be on occlusion boundaries if this probability is greater than or equal to a predetermined threshold, such as 0.4.

図１７の実施例では、奥行き推定エンジン１７０６により生成され、コスト関数を最適化するのに使用される第２のコスト項ｃ_２（ｄ）は、次のように表され得る。

ｂ_ｉ∈｛０、１｝は、入力フレーム１７１０のピクセルｉのオクルージョン境界推定に基づいたマスクの値であり、＜．，．＞は、ドット積演算子を表し、

は、さらなるニューラルネットワークアーキテクチャ１７０８により出力された表面配向推定であり、Ｋは、入力フレーム１７１０をキャプチャするのに使用されたカメラに関連付けられた固有パラメータを表す行列であり（時にカメラ固有行列と称される）、

は、ピクセルｉの均一ピクセル座標を表し、Ｗはピクセル単位の画像の幅である。 In the embodiment of FIG. 17, the second cost term c ₂ (d) generated by the depth estimation engine 1706 and used to optimize the cost function can be expressed as:

b _i ∈ {0, 1} is the mask value based on the occlusion boundary estimation of the pixel i of the input frame 1710, and <. ,. > Represents the dot product operator

Is a surface orientation estimation output by the further neural network architecture 1708, where K is a matrix representing the camera-specific parameters used to capture the input frame 1710 (sometimes referred to as the camera-specific matrix). Will be),

Represents the uniform pixel coordinates of pixel i, and W is the width of the image in pixel units.

図１７では、勾配降下を使用して奥行き推定エンジン１７０６によりコスト関数が最適化され、コスト関数の値を最小化する奥行き値ｄに対応する融合奥行き推定１７０２が取得される（図１６の項目１６０６）。しかし、これは単なる例に過ぎず、他の事例では、異なる最適化技法を使用して融合奥行き推定は取得され得る。 In FIG. 17, the depth estimation engine 1706 uses gradient descent to optimize the cost function and obtain a fusion depth estimate 1702 corresponding to the depth value d that minimizes the value of the cost function (item 1606 in FIG. 16). ). However, this is just an example, and in other cases fusion depth estimates can be obtained using different optimization techniques.

図１８は、第２の融合奥行き確率体積を取得する例示的な方法１８００を示すフロー図である。図１８の方法１８００を使用して、シーンの第１の観察を表すビデオの第１のフレームに関連付けられた第１の融合奥行き確率体積から、シーンの第２の観察を表すビデオデータの第２のフレームに関連付けられた第２の融合奥行き確率体積が取得され得、第１のフレームは、例えば第２のフレームの前または後である。第１の融合奥行き確率体積を使用して第２の融合奥行き確率体積を取得することにより、シーンの奥行きに関する情報が、複数のフレームにわたり保持され得る。これにより、第２のフレームの奥行き推定は、第１のフレームからの情報を使用せずに第２のフレームの奥行き推定を再計算する場合と比較して、向上し得る。 FIG. 18 is a flow diagram illustrating an exemplary method 1800 for obtaining a second fusion depth probability volume. Using method 1800 of FIG. 18, from the first fusion depth probability volume associated with the first frame of the video representing the first observation of the scene, the second of the video data representing the second observation of the scene. A second fusion depth probability volume associated with a frame can be obtained, the first frame being, for example, before or after the second frame. By using the first fusion depth probability volume to obtain the second fusion depth probability volume, information about the depth of the scene can be retained across multiple frames. Thereby, the depth estimation of the second frame can be improved as compared with the case of recalculating the depth estimation of the second frame without using the information from the first frame.

第１の融合奥行き確率体積は第１のフレームのそれぞれのピクセルの奥行き確率分布を表すため、第１の融合奥行き確率体積により表される情報を第２のフレームに組み込むことは、自明ではない。これに対処するために、図１８の項目１８０２は、第１の融合奥行き確率体積を、第１の占有確率体積に変換することを含む。第１のフレームの第１の融合奥行き確率体積は、図１２～図１７を参照して説明された方法のうちのいずれかを使用して取得され得る。第１の占有確率体積は、占有ベースの確率体積としてみなされ得、よって、シーンを向く第１のフレームのキャプチャに関連付けられた第１の姿勢のカメラから伝搬される光線に沿った奥行きごとに、空間内の関連ポイントが占有されている確率が存在する。 Since the first fusion depth probability volume represents the depth probability distribution of each pixel of the first frame, it is not obvious to incorporate the information represented by the first fusion depth probability volume into the second frame. To address this, item 1802 in FIG. 18 includes converting a first fusion depth probability volume into a first occupancy probability volume. The first fusion depth probability volume of the first frame can be obtained using any of the methods described with reference to FIGS. 12-17. The first occupancy stochastic volume can be considered as a occupancy-based stochastic volume, and thus for each depth along the light beam propagated from the camera in the first orientation associated with the capture of the first frame facing the scene. , There is a probability that the related points in space are occupied.

一事例では、第１の占有確率体積は、奥行きが第１の奥行き確率体積のビンｊに属することを条件として、ボクセルＳ_ｋ，ｉ（例えば第１のフレームのピクセルｉに関連付けられた光線に沿った第１の奥行き確率体積のビンｋに関連付けられた奥行き推定に対応する３次元体積要素である）が占有されている確率を最初に特定することにより、取得される。

In one example, the first occupancy probability volume is on a voxel _{Sk, i} (eg, a ray associated with pixel i in the first frame, provided that the depth belongs to the bin j of the first depth probability volume. The first depth probability along is obtained by first identifying the probability that the (three-dimensional volume element corresponding to the depth estimation associated with the bin k of the volume) is occupied.

これから、第１の占有確率体積ｐ（Ｓ_ｋ，ｉ＝１）が、以下の式を使用して取得され得る。

ｐ_ｉ（ｋ_ｉ ^＊＝ｋ）は、第１のフレームのピクセルｉについての第１の奥行き確率体積のビンｋの確率値であり、Ｋは、ピクセル単位の第１のフレームの幅である。 From this, the first occupancy probability volume p ( _{Sk, i} = 1) can be obtained using the following equation.

p _i ( _ki ^* = k) is the probability value of the bin k of the first depth probability volume for the pixel i of the first frame, and K is the width of the first frame in pixel units.

図１８の項目１８０４は、シーンを観察している間のカメラの姿勢を表す姿勢データに基づいて、第１の占有確率体積をワープさせて、第２のフレームに関連付けられた第２の占有確率体積を取得することを含む。第１の占有確率体積をワープさせることは、図１５を参照して説明された測光誤差１５１０を取得するために第１のフレームをワープさせることと、その他の点では類似し得、通常、第１のフレームをキャプチャしている間のカメラの第１の姿勢を表す第１の姿勢データと、第２のフレームをキャプチャしている間のカメラの第２の姿勢を表す第２の姿勢データとを、姿勢データとして使用する。このようにして、第１の占有確率体積は、第２のフレームにワープされ得る。いくつかの事例では、第２のフレームは、対応するワープされた第１の占有確率体積が存在しないいくつかのピクセルを含み得る。これらのピクセルには、例えば（単なる例に過ぎないが）０．０１の値などの所定値（例えばデフォルト値）が占有確率に使用され得る。 Item 1804 in FIG. 18 warps the first occupancy probability volume based on the stance data representing the posture of the camera while observing the scene, and the second occupancy probability associated with the second frame. Includes obtaining volume. Warping the first occupied probability volume can be otherwise similar to warping the first frame to obtain the photometric error 1510 described with reference to FIG. 15, and is usually the first. The first posture data representing the first posture of the camera while capturing one frame, and the second posture data representing the second posture of the camera while capturing the second frame. Is used as posture data. In this way, the first occupancy probability volume can be warped to the second frame. In some cases, the second frame may contain some pixels in which the corresponding warped first occupancy volume does not exist. For these pixels, a predetermined value (eg, the default value), such as a value of 0.01 (just an example), may be used for the occupancy probability.

図１８の項目１８０６にて、第２の占有確率体積は、第２のフレームに関連付けられた第２の融合奥行き確率体積に変換される。この変換は、次の式を使用して実行され、第２のフレームのピクセルｉについての第２の奥行き確率分布のビンｋの確率値ｐ_ｉ（ｋ_ｉ ^＊＝ｋ）が取得され得る。

この式を使用して、第２のフレームの複数のピクセルのそれぞれについて、第２の融合奥行き確率分布のそれぞれのビンの確率値が生成され得る。次に、第２の融合奥行き確率分布は、１つの光線に沿って分布の合計が１となるようにスケーリングされ、第２の融合奥行き確率体積が取得され得る。次に、第２のフレームの融合奥行き推定が、例えば図１６及び図１７を参照して説明されるように、第２の融合奥行き確率体積から取得され得る。 In item 1806 of FIG. 18, the second occupancy probability volume is converted to the second fusion depth probability volume associated with the second frame. This transformation is performed using the following equation, and the probability value _pi (ki ^* = _k ) of the bin k of the second depth probability distribution for the pixel i of the second frame can be obtained.

Using this equation, the probability values for each bin of the second fusion depth probability distribution can be generated for each of the plurality of pixels in the second frame. Next, the second fusion depth probability distribution can be scaled so that the sum of the distributions is 1 along one ray, and the second fusion depth probability volume can be obtained. The fusion depth estimation of the second frame can then be obtained from the second fusion depth probability volume, as described, for example, with reference to FIGS. 16 and 17.

図１９は、さらなる実施例による、シーンの奥行きを推定する例示的な方法１９００を示すフロー図である。 FIG. 19 is a flow diagram illustrating an exemplary method 1900 for estimating the depth of a scene according to a further embodiment.

項目１９０２にて、シーンの幾何学的再構成を使用して、シーンの第１の奥行き確率体積が生成される。第１の奥行き確率体積は、例えば、図１２を参照して説明された第１の奥行き確率体積と同一または同様であり、例えば図１５を参照して説明されたように、生成され得る。 At item 1902, the geometric reconstruction of the scene is used to generate a first depth probability volume for the scene. The first depth probability volume is, for example, the same as or similar to the first depth probability volume described with reference to FIG. 12, and can be generated, for example, as described with reference to FIG.

項目１９０４にて、ニューラルネットワークアーキテクチャを使用して、シーンの第２の奥行き確率体積が生成される。第２の奥行き確率体積は、例えば、図１２を参照して説明された第２の奥行き確率体積と同一または同様であり、例えば図１３及び図１４を参照して説明されたように、生成され得る。 At item 1904, a neural network architecture is used to generate a second depth probability volume for the scene. The second depth probability volume is, for example, the same as or similar to the second depth probability volume described with reference to FIG. 12, eg, generated as described with reference to FIGS. 13 and 14. obtain.

項目１９０６にて、第１の奥行き確率体積及び第２の奥行き確率体積を使用して、シーンの融合奥行き確率体積が生成され、項目１９０８にて、融合奥行き確率体積を使用して、シーンの融合奥行き推定が生成される。図１９の項目１９０６及び１９０８の融合奥行き確率体積及び融合奥行き推定の生成は、図１６及び／または図１７の方法と同様または同一の方法を使用し得る。 In item 1906, the fusion depth probability volume of the scene is generated using the first depth probability volume and the second depth probability volume, and in item 1908, the fusion depth probability volume of the scene is used. Depth estimation is generated. The generation of the fusion depth probability volume and fusion depth estimation of items 1906 and 1908 of FIG. 19 may be similar to or the same as the method of FIGS. 16 and / or FIG.

図２０は、さらなる実施例による、シーンの奥行きを推定する画像処理システム２０００の概略図である。画像処理システム２０００は、融合エンジン２００２を含み、融合エンジン２００２は、幾何学的再構成エンジン２００６からの第１の奥行き確率体積２００４と、ニューラルネットワークアーキテクチャ２０１０からの第２の奥行き確率体積２００８とを受信し、第１の奥行き確率体積２００４と第２の奥行き確率体積２００８とを融合させて、シーンの融合奥行き確率体積２０１２を出力する。画像処理システム２０００はまた、奥行き推定エンジン２０１４を含み、奥行き推定エンジン２０１４は、融合奥行き確率体積２０１２を使用して、シーンの奥行きを推定する（融合奥行き推定２０１６と称され得る）。 FIG. 20 is a schematic diagram of an image processing system 2000 for estimating the depth of a scene according to a further embodiment. The image processing system 2000 includes a fusion engine 2002, which comprises a first depth probability volume 2004 from the geometric reconstruction engine 2006 and a second depth probability volume 2008 from the neural network architecture 2010. Upon receiving, the first depth probability volume 2004 and the second depth probability volume 2008 are fused, and the fusion depth probability volume 2012 of the scene is output. The image processing system 2000 also includes a depth estimation engine 2014, which estimates the depth of the scene using the fusion depth probability volume 2012 (which can be referred to as fusion depth estimation 2016).

図２０の実施例では、画像処理システム２０００は、シーンのそれぞれの観察を表す入力フレーム２０１８を処理して、融合奥行き推定２０１６を生成するように構成される。入力フレーム２０１８は、例えば、第１のフレームを含み、これは、例えば図１２～図１４を参照して説明されたように、第２の奥行き確率体積２００８を生成するためにニューラルネットワークアーキテクチャ２０１０により処理される。入力フレーム２０１８はまた、第２のフレームを含み得る。このような実施例では、例えば図１２及び図１５を参照して説明されたように、第１のフレーム及び第２のフレームの両方が、幾何学的再構成エンジン２００６により処理され、第１の奥行き確率体積２００４が生成され得る。融合エンジン２００２による融合奥行き確率体積２０１２の生成、及び奥行き推定エンジン２０１４による融合奥行き推定２０１６の生成は、例えば図１６及び図１７を参照して説明されたようなものであり得る。 In the embodiment of FIG. 20, the image processing system 2000 is configured to process input frames 2018 representing each observation of the scene to generate a fusion depth estimate 2016. The input frame 2018 includes, for example, a first frame, which is by the neural network architecture 2010 to generate a second depth probability volume 2008, as described, for example, with reference to FIGS. 12-14. It is processed. The input frame 2018 may also include a second frame. In such an embodiment, both the first frame and the second frame are processed by the geometric reconstruction engine 2006, as described, for example, with reference to FIGS. 12 and 15. Depth probability volume 2004 can be generated. The generation of the fusion depth probability volume 2012 by the fusion engine 2002 and the generation of the fusion depth estimation 2016 by the depth estimation engine 2014 may be as described, for example, with reference to FIGS. 16 and 17.

上記の実施例は、例示として理解されるべきである。さらなる実施例が想定される。 The above embodiment should be understood as an example. Further examples are envisioned.

図１６の実施例では、コスト関数は、第１のコスト項及び第２のコスト項を含む。その他の点では図１６の実施例と同一または同様である他の実施例では、コスト関数は、第２のコスト項を含み得ず、例えば第１のコスト項のみを含み得る。 In the embodiment of FIG. 16, the cost function includes a first cost term and a second cost term. In other embodiments that are otherwise identical or similar to the embodiment of FIG. 16, the cost function may not include a second cost term, eg only a first cost term.

図１９の方法１９００または図２０のシステム２０００により第１のフレームについて取得された融合奥行き確率体積は、図１８を参照して説明されたようにワープされ、第２のフレームの奥行き確率体積が取得され得る。 The fusion depth probability volume acquired for the first frame by the method 1900 of FIG. 19 or the system 2000 of FIG. 20 is warped as described with reference to FIG. 18 and the depth probability volume of the second frame is acquired. Can be done.

図１２～図１７及び図１９及び図２０を参照して説明されたようなシーンの奥行きの推定は、シーンの観察ごとに実行される必要はないことを、理解されたい。代わりに、キーフレームと称され得る観察の部分集合（例えばフレームの部分集合）について、奥行きは推定され得る。これにより、処理要件を軽減することができる。同様に、奥行きが推定されたキーフレームに続くフレームごとに、図１９の方法は実行される必要はない。例えば、第１のフレームと第２のフレームとの間でカメラの姿勢が大幅に変更した場合は、図１９の方法は省略されてもよい。 It should be understood that the estimation of the depth of the scene as described with reference to FIGS. 12-17 and 19 and 20 need not be performed for each observation of the scene. Alternatively, the depth can be estimated for a subset of observations (eg, a subset of frames) that can be referred to as keyframes. This can reduce the processing requirements. Similarly, the method of FIG. 19 does not need to be performed for each frame following a key frame whose depth has been estimated. For example, if the posture of the camera changes significantly between the first frame and the second frame, the method of FIG. 19 may be omitted.

奥行きを推定するビデオの第１のフレームについて、前述のようなニューラルネットワークアーキテクチャを使用して、奥行きが推定されてもよい（例えば第２の奥行き確率体積を第１の奥行き確率体積と融合させることなく、第２の奥行き確率体積から奥行き推定を計算することにより）。ビデオの少なくとも１つのさらなるフレームを取得した後、第１の奥行き確率体積が計算され、第２の奥行き確率体積と融合され、シーンの融合奥行き推定を生成するために、融合奥行き確率体積が取得され得る。 For the first frame of the video that estimates the depth, the depth may be estimated using the neural network architecture as described above (eg, fusing the second depth probability volume with the first depth probability volume). Not by calculating the depth estimation from the second depth probability volume). After acquiring at least one additional frame of video, a first depth probability volume is calculated and fused with a second depth probability volume to obtain a fusion depth probability volume to generate a fusion depth estimate for the scene. obtain.

図１６及び図１７の実施例では、融合奥行き確率体積から導出された奥行き確率関数に基づいたコスト関数を使用して、融合奥行き推定が生成される。この手法は、ニューラルネットワークアーキテクチャからの予測が比較的不確実であり、ゆえに誤った最小値の影響を受けやすい特徴のない領域において、より正確に機能する傾向がある。しかし、他の実施例では、所与のピクセルの融合奥行き推定が、融合奥行き確率体積からの最大の確率を有するピクセルの奥行き推定とみなされ得る。 In the embodiments of FIGS. 16 and 17, a fusion depth estimate is generated using a cost function based on a depth probability function derived from the fusion depth probability volume. This technique tends to work more accurately in featureless regions that are susceptible to false minimums because the predictions from the neural network architecture are relatively uncertain. However, in other embodiments, the fusion depth estimation of a given pixel can be considered as the depth estimation of the pixel with the highest probability from the fusion depth probability volume.

図１２～図２０の実施例では、融合奥行き推定は、高密度奥行き推定である。しかし、他の事例では、同様の方法またはシステムを使用して、例えば本明細書の方法を使用して入力フレームのピクセルの部分集合を処理することにより、中密度または低密度の奥行き推定が取得され得る。付加的または代替的に、図１２～図２０のうちのいずれか１つの図の実施例に従って取得された融合奥行き推定のそれぞれの奥行き推定と、融合奥行き推定が取得された入力フレームのピクセルとの間には、一対一、一対多、または多対一のマッピングが存在し得る。 In the embodiments of FIGS. 12-20, the fusion depth estimation is a high density depth estimation. However, in other cases, medium or low density depth estimates can be obtained by using similar methods or systems, eg, by processing a subset of the pixels of the input frame using the methods herein. Can be done. Additional or alternative, the depth estimation of each of the fusion depth estimates obtained according to the embodiment of any one of FIGS. 12 to 20 and the pixels of the input frame from which the fusion depth estimation was obtained. There can be one-to-one, one-to-many, or many-to-one mappings in between.

図１２～図２０の実施例は、シーンの観察を表すフレームを処理することを参照して説明されている。しかし、これらの方法及び／またはシステムは、代替的に、ビデオのフレームではなく静止画像を処理するために使用されてもよいことを、理解されたい。 The embodiments of FIGS. 12-20 are described with reference to processing frames representing observations of the scene. However, it should be understood that these methods and / or systems may instead be used to process still images rather than frames of video.

図１９の方法１９００及び／または図２０のシステム２０００は、図１Ａ～図１Ｃのキャプチャデバイス、図７Ａのコンピューティングシステム７００、及び／または図７Ｂのロボットデバイス７６０など、本明細書に説明されるシステムまたは装置のいずれかを使用し得る。図１９の方法１９００を実行するための命令、または図２０のシステム２０００を実施するための命令は、図１１を参照して説明されたような非一時的コンピュータ可読記憶媒体に記憶され得る。 The system 2000 of FIG. 19 and / or system 2000 of FIG. 20 is described herein, such as the capture device of FIGS. 1A-1C, the computing system 700 of FIG. 7A, and / or the robot device 760 of FIG. 7B. Either the system or the device can be used. Instructions for performing the method 1900 of FIG. 19 or performing the system 2000 of FIG. 20 may be stored in a non-temporary computer-readable storage medium as described with reference to FIG.

任意の１つの実施例に関連して説明される任意の機能は、単独で使用されてもよく、または説明される他の機能と組み合わせて使用されてもよく、また、実施例のうちの任意の他の実施例の１つ以上の機能と組み合わせて使用されてもよく、または実施例のうちの任意の他の実施例の任意の組み合わせの１つ以上の機能と組み合わせて使用されてもよいことを、理解されたい。さらに、添付の特許請求の範囲で定義される本発明の範囲から逸脱することなく、上記で説明されていない均等物及び変更物も使用されてもよい。 Any function described in connection with any one embodiment may be used alone or in combination with other functions described, and may be any of the embodiments. It may be used in combination with one or more functions of other embodiments, or it may be used in combination of one or more functions of any combination of any other embodiment of the embodiments. Please understand that. Further, equivalents and modifications not described above may be used without departing from the scope of the invention as defined in the appended claims.

Claims

An image processing system that estimates the depth of a scene.
The first depth estimation and the second depth estimation from the neural network architecture are received from the geometric reconstruction engine, and the first depth estimation and the second depth estimation are stochastically fused. Equipped with a fusion engine that outputs the fusion depth estimation of the scene
The fusion engine is to receive the uncertainty measurement of the first depth estimation from the geometric reconstruction engine and the uncertainty measurement of the second depth estimation from the neural network architecture. Configured,
The fusion engine is configured to stochastically fuse the first depth estimate with the second depth estimate using the uncertainty measurement.
The system.

The fusion engine receives surface orientation estimation and uncertainty measurement of the surface orientation estimation from the neural network architecture, and uses the uncertainty measurement of the surface orientation estimation and the surface orientation estimation to use the first. The system according to claim 1, wherein the depth estimation of the above and the second depth estimation are stochastically fused.

For the surface orientation estimation,
Depth gradient estimation in the first direction and
Depth gradient estimation in the direction orthogonal to the first direction and
Surface normal estimation and
The system according to claim 2, wherein one or more of them are included.

The one according to any one of claims 1 to 3, wherein the fusion engine is configured to specify a scale estimation when the first depth estimation and the second depth estimation are stochastically fused. System.

The scene is captured in the first frame of the video data and
The second depth estimation for the first frame of the video data is received and
The first depth estimation includes a plurality of first depth estimations for the first frame of the video data, and at least one of the plurality of first depth estimations is the video data. Generated using a second frame of video data that is different from the first frame
The fusion engine is configured to process the second depth estimation and one of the plurality of depth estimates for each iteration and iteratively output the fusion depth estimation of the scene.
The system according to any one of claims 1 to 4.

The system according to any one of claims 1 to 5, wherein the first depth estimation, the second depth estimation, and the fusion depth estimation each include a depth map for a plurality of pixels.

The system according to any one of claims 1 to 6, wherein the first depth estimation is a medium density depth estimation, and the second depth estimation and the fusion depth estimation each include a high density depth estimation. ..

With a monocular camera that captures frames of video data,
A tracking system that identifies the posture of the monocular camera while observing the scene,
With the geometric reconstruction engine,
The geometric reconstruction engine is configured to use the orientation from the tracking system and the frame of the video data to generate a depth estimate for at least a subset of the pixels from the frame of the video data. The geometric reconstruction engine is configured to minimize the photometric error and generate the depth estimation.
The system according to any one of claims 1 to 7.

With the above neural network architecture
The neural network architecture comprises one or more neural networks and is configured to receive pixel values of frames of video data and make predictions.
To generate the second depth estimate, a depth estimate for each of the first sets of image portions,
With at least one surface orientation estimation for each of the second sets of image parts,
With one or more uncertainty measurements associated with each depth estimate,
With one or more uncertainty measurements associated with each surface orientation estimation,
Is expected,
The system according to any one of claims 1 to 8.

It ’s a way to estimate the depth of the scene.
The geometric reconstruction of the scene is used to generate a first depth estimate of the scene, which outputs an uncertainty measurement of the first depth estimation. The above-mentioned generation, which is configured to be
Using a neural network architecture is to generate a second depth estimate of the scene, the neural network architecture being configured to output an uncertainty measurement of the second depth estimate. The above-mentioned generation and
Using the uncertainty measurement, the first depth estimation and the second depth estimation are stochastically fused to generate the fusion depth estimation of the scene.
The method described above.

Before generating the first depth estimate,
Acquiring image data representing two or more views of the scene from the camera,
Including
Generating the first depth estimation is
Obtaining the attitude estimation of the camera and
To generate the first depth estimation by minimizing at least the photometric error which is a function of the attitude estimation and the image data.
including,
The method according to claim 10.

Before generating the first depth estimate,
Acquiring image data representing one or more views of the scene from the camera,
Including
Generating the second depth estimation is
Receiving the image data with the neural network architecture
To generate the second depth estimate, the neural network architecture is used to predict the depth estimate for each set of image parts.
Using the neural network architecture to predict at least one surface orientation estimate for each set of image portions.
Using the neural network architecture to predict a set of uncertainty measurements for each depth estimate and each surface orientation estimate,
including,
The method according to claim 10.

For the surface orientation estimation,
Depth gradient estimation in the first direction and
Depth gradient estimation in the direction orthogonal to the first direction and
Surface normal estimation and
12. The method of claim 12, comprising one or more of the above.

Before generating the first depth estimate,
Acquiring image data from a camera that contains multiple pixels representing two or more views of the scene.
Including
Generating the first depth estimation is
Obtaining the attitude estimation of the camera and
To generate a medium density depth estimate that includes a depth estimate for a portion of the pixel in the image data.
Including
Generating the second depth estimation includes generating a high density depth estimation for the pixel in the image data.
Probabilistic fusion of the first depth estimation and the second depth estimation includes outputting a high density depth estimation for the pixel in the image data.
The method according to claim 10.

The method is iteratively repeated and with respect to subsequent iterations.
The method comprises determining whether to generate the second depth estimation.
Probabilistic fusion of the first depth estimation and the second depth estimation is a set of values prior to the second depth estimation, depending on the determination that the second depth estimation is not generated. Including using
The method according to any one of claims 10 to 14.

The method is applied to a frame of video data, and the probabilistic fusion of the first depth estimation and the second depth estimation is for a given frame of video data.
It comprises optimizing a cost function that includes a first cost term associated with the first depth estimation and a second cost term associated with the second depth estimation.
The first cost term includes a function of the fusion depth estimate, the first depth estimate, and the uncertainty value of the first depth estimate.
The second cost term includes a function of the fusion depth estimate, the second depth estimate, and the uncertainty value of the second depth estimate.
The cost function is optimized to identify the fusion depth estimate.
The method according to any one of claims 10 to 15.

16. The method of claim 16, wherein optimizing the cost function comprises identifying the scale factor of the fusion depth estimation, wherein the scale factor indicates the scale of the fusion depth estimation for the scene.

Including using the neural network architecture to generate at least one surface orientation estimate for the scene.
The neural network architecture is configured to output an uncertainty measurement for each of the at least one surface orientation estimates.
The cost function includes a third cost term associated with the at least one surface orientation estimation.
The third cost term comprises a function of a fusion depth estimate, a surface orientation estimate, and an uncertainty value for each of the at least one surface orientation estimation.
The method of claim 16 or 17.

The geometric reconstruction of the scene is configured to generate a first depth probability volume of the scene, where the first depth probability volume is.
A first plurality of depth estimates, including the first depth estimate,
The first plurality of uncertainty measurements associated with each of the first plurality of depth estimates, respectively.
Including
The uncertainty measurement associated with a given depth estimate of the first plurality of depth estimates is that a given area of the scene is the given given of the first plurality of depth estimates. Represents the probability of being in the depth represented by depth estimation
The neural network architecture is configured to output a second depth probability volume of the scene, where the second depth probability volume is.
A second plurality of depth estimates, including the second depth estimate,
A second plurality of uncertainty measurements associated with each of the second plurality of depth estimates, respectively.
Including
The uncertainty measurement associated with a given depth estimate of the second plurality of depth estimates is that a given area of the scene is the given given of the second plurality of depth estimates. Represents the probability of being in the depth represented by depth estimation,
The method according to claim 10.

Generating the second depth estimation of the scene comprises using the neural network architecture to process image data representing the image of the scene to generate the second depth probability volume. ,
The second plurality of depth estimates include a plurality of depth estimation sets, each associated with a different portion of the image of the scene.
19. The method of claim 19.

19. The method of claim 19 or 20, wherein the second plurality of depth estimates include depth estimates having predefined values.

21. The method of claim 21, wherein there is a non-uniform spacing between the predefined values.

22. The method of claim 21 or 22, wherein the predefined values include a plurality of logarithmic depth values within a predefined depth range.

Generating the first depth probability volume of the scene is
The first frame of the video data representing the first observation of the scene and the second frame of the video data representing the second observation of the scene are processed to form a plurality of parts of the first frame. To generate a set of metering errors for each, said generating and each metering error is associated with each different depth estimation of the first plurality of depth estimates.
Scaling the metering error and converting the metering error into their respective probability values,
The method according to any one of claims 19 to 23.

Probabilistic fusion of the first depth estimation and the second depth estimation using the uncertainty measurement is to generate the first plurality of uncertainty measurements and the second plurality of uncertainties. The method of any one of claims 19-24, comprising generating a fusion probability volume in combination with certainty measurements.

25. The method of claim 25, wherein generating the fusion depth estimation of the scene comprises obtaining the fusion depth estimation of the scene from the fusion probability volume.

To obtain the depth probability function using the fusion probability volume,
Using the depth probability function to obtain the fusion depth estimation,
25 or 26, the method of claim 25 or 26.

Obtaining the fusion depth estimation involves optimizing the cost function, which is a cost function.
The first cost term obtained using the fusion probability volume and
A second cost term, including local geometric constraints on the depth value,
25. The method according to any one of claims 25 to 27.

Receiving surface orientation estimates and occlusion boundary estimates from further neural network architectures,
Using the surface orientation estimation and the occlusion boundary estimation to generate the second cost term,
28. The method of claim 28.

The fusion depth probability volume is a first fusion depth probability volume associated with a first frame of video data representing the first observation of the scene.
The method is
Converting the first fusion depth probability volume into the first occupancy probability volume,
Based on the posture data representing the posture of the camera while observing the scene, the first occupancy probability volume is warped and associated with the second frame of the video data representing the second observation of the scene. To obtain the second occupancy probability volume obtained,
Converting the second occupied probability volume into a second fusion depth probability volume associated with the second frame,
25. The method according to any one of claims 25 to 29.

An image processing system that estimates the depth of a scene.
The first depth probability volume is received from the geometric reconstruction engine and the second depth probability volume is received from the neural network architecture, and the first depth probability volume and the second depth probability volume are fused. , A fusion engine that outputs the fusion depth probability volume of the scene,
With a depth estimation engine that estimates the depth of the scene using the fusion depth probability volume,
The system comprising.

It ’s a way to estimate the depth of the scene.
Using the geometric reconstruction of the scene to generate the first depth probability volume of the scene,
Using a neural network architecture to generate a second depth probability volume for the scene,
By fusing the first depth probability volume and the second depth probability volume to generate the fusion depth probability volume of the scene,
Using the fusion depth probability volume to generate a fusion depth estimate for the scene,
The method described above.

With a monocular capture device that provides video frames,
The position identification and mapping simultaneous execution system that provides the attitude data of the monocular capture device,
The system according to claim 1 or 31 and
A medium-density multi-view stereo component that receives the attitude data and video frames and implements the geometric reconstruction engine.
An electronic circuit that implements the neural network architecture and
A computing system.

It ’s a robot device,
The computing system according to claim 33 and
With the one or more actuators that allow the robot device to interact with the surrounding three-dimensional environment, wherein at least a portion of the surrounding three-dimensional environment is shown in the scene. ,
An interaction engine comprising at least one processor controlling the one or more actuators.
Including
The interaction engine uses the fusion depth estimation to interact with the surrounding three-dimensional environment.
The robot device.

A non-temporary computer-readable storage medium containing computer-executable instructions that, when executed by a processor, causes the computing device to perform the method of any one of claims 10-30. The non-temporary computer-readable storage medium.