JP2023035903A

JP2023035903A - Depth estimation method and depth estimation device

Info

Publication number: JP2023035903A
Application number: JP2022129542A
Authority: JP
Inventors: 良太朗角田; Ryotaro Tsunoda
Original assignee: Morpho Inc
Current assignee: Morpho Inc
Priority date: 2021-08-30
Filing date: 2022-08-16
Publication date: 2023-03-13

Abstract

To provide a technique for acquiring a high-quality depth map.SOLUTION: A depth estimation method includes acquiring a plurality of depth maps and outputting a single output depth map by combining the plurality of depth maps while reducing an average difference in distance value between adjacent pixels, compared to a case of directly using distance values contained in the plurality of depth maps.SELECTED DRAWING: Figure 1

Description

本開示は、深度推定方法と深度推定装置に関する。 The present disclosure relates to depth estimation methods and depth estimation devices.

カメラ等の撮像デバイスによって撮像された画像は、被写体の輝度情報を表す一方、ＴｏＦ（ＴｉｍｅｏｆＦｌｉｇｈｔ）センサ、ＬｉＤＡＲ（ＬｉｇｈｔＤｅｔｅｃｔｉｏｎａｎｄＲａｎｇｉｎｇ）等の測距センサによって検知された深度マップ（深度画像とも呼ばれる）は、測距センサと被写体との間の距離又は奥行き情報を表す。このような深度マップは、例えば、撮像した画像に対する写真加工、車両やロボット等の自律動作のための物体検出などに利用されうる。 An image captured by an imaging device such as a camera represents the luminance information of a subject, while a depth map (also called a depth image) detected by a ranging sensor such as a ToF (Time of Flight) sensor or LiDAR (Light Detection and Ranging) is used. ) represents the distance or depth information between the ranging sensor and the subject. Such a depth map can be used, for example, for photo processing of captured images, object detection for autonomous operation of vehicles, robots, and the like.

ＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）技術の進化によって、撮像デバイスから取得された画像から、被写体と撮像デバイスとの間の距離、すなわち、奥行きを表す深度マップを推定する深度推定モデルが開発されている。例えば、ＭｉＤａＳ（https://github.com/intel-isl/MiDaS）、ＤＰＴ（https://github.com/intel-isl/DPT）などが、単眼画像に対する深度推定モデルとして知られている。 With the advancement of AI (Artificial Intelligence) technology, a depth estimation model has been developed that estimates a depth map representing the depth, that is, the distance between an object and an imaging device, from an image acquired from the imaging device. For example, MiDaS (https://github.com/intel-isl/MiDaS) and DPT (https://github.com/intel-isl/DPT) are known as depth estimation models for monocular images.

一方、近年のスマートフォン、タブレット等のモバイル端末の高機能化に伴って、ＴｏＦ（ＴｉｍｅｏｆＦｌｉｇｈｔ）センサ、ＬｉＤＡＲ（ＬｉｇｈｔＤｅｔｅｃｔｉｏｎａｎｄＲａｎｇｉｎｇ）センサ等の測距センサが、モバイル端末に備えられるようになってきている。例えば、特許文献１には、ＴｏＦセンサとステレオカメラとによって取得された深度マップに対して位置合わせを行って、最適化された深度マップを出力する処理システムが開示されている。 On the other hand, along with the recent sophistication of mobile terminals such as smartphones and tablets, ranging sensors such as ToF (Time of Flight) sensors and LiDAR (Light Detection and Ranging) sensors have come to be installed in mobile terminals. is coming. For example, Patent Literature 1 discloses a processing system that aligns depth maps acquired by a ToF sensor and a stereo camera and outputs an optimized depth map.

特開２０２０－０４２７７２号公報Japanese Patent Application Laid-Open No. 2020-042772

しかしながら、ＴｏＦセンサで取得された深度マップは、典型的には、正確な距離値を有する一方、多くの欠損画素を含みうる。一方、ディープニューラルネットワークに基づく深度推定モデルは、全体的に一貫性のある深度マップを出力する一方、正確な距離値を取得できず、また、細かいテクスチャを読み取ることができないケースがある。 However, depth maps acquired with ToF sensors, while typically having accurate distance values, may contain many missing pixels. On the other hand, while depth estimation models based on deep neural networks output globally consistent depth maps, they often fail to obtain accurate distance values and are unable to read fine textures.

このため、特許文献１の手法のように、カメラから取得された深度マップによって欠損画素を単に補完するだけでは、ＴｏＦセンサで取得された距離値を有する画素領域と、それ以外の画素領域との境界が不自然に目立ってしまう。つまり、高品質な深度マップを取得することができない。 For this reason, simply complementing missing pixels with a depth map obtained from a camera, as in the technique of Patent Document 1, is not sufficient for the difference between a pixel area having a distance value obtained by the ToF sensor and other pixel areas. Boundaries stand out unnaturally. In other words, a high-quality depth map cannot be acquired.

上記問題点に鑑み、本開示の１つの課題は、高品質な深度マップを取得するための技術を提供することである。 In view of the above problems, one problem of the present disclosure is to provide a technique for obtaining a high-quality depth map.

本開示の一態様は、複数の深度マップを取得することと、前記複数の深度マップに含まれる距離値をそのまま用いる場合と比べて、隣接する画素の距離値の差の平均を減少させつつ前記複数の深度マップを合成して、１つの出力深度マップを出力することと、を備える。 One aspect of the present disclosure is obtaining multiple depth maps and reducing the average difference in distance values of adjacent pixels compared to directly using the distance values contained in the multiple depth maps. Combining multiple depth maps to output an output depth map.

本開示によれば、高品質な深度マップを取得するための技術を提供することができる。 According to the present disclosure, it is possible to provide a technique for obtaining a high-quality depth map.

本開示の一実施例による深度推定処理を示す概略図である。FIG. 3 is a schematic diagram illustrating depth estimation processing according to one embodiment of the present disclosure; 本開示の一実施例による深度推定システムを示すブロック図である。1 is a block diagram illustrating a depth estimation system according to one embodiment of the disclosure; FIG. 本開示の一実施例による深度マップを示す図である。[0014] Fig. 4 illustrates a depth map according to one embodiment of the disclosure; 本開示の一実施例による深度推定装置のハードウェア構成を示すブロック図である。1 is a block diagram showing a hardware configuration of a depth estimation device according to one embodiment of the present disclosure; FIG. 本開示の一実施例による深度推定装置の機能構成を示すブロック図である。1 is a block diagram showing the functional configuration of a depth estimation device according to an embodiment of the present disclosure; FIG. 本開示の一実施例による深度推定処理を示すフローチャートである。4 is a flowchart illustrating depth estimation processing according to one embodiment of the present disclosure; 本開示の他の実施例による深度推定処理を示す概略図である。FIG. 4 is a schematic diagram illustrating depth estimation processing according to another embodiment of the present disclosure;

以下、図面を参照して本開示の実施の形態を説明する。 Embodiments of the present disclosure will be described below with reference to the drawings.

以下の実施例では、測定対象領域の画像（例えば、ＲＧＢ画像）から深度推定モデルによって推論された深度マップ又は深度画像（以降、深度マップと総称する）と、測距センサから取得された深度マップとを、後述される制約を含むコスト関数に従って合成する深度推定装置が開示される。例えば、本開示の深度推定装置は、測距センサによって取得された深度マップをＲＧＢ画像と同等な画像レベルまで補完するＤｅｐｔｈＣｏｍｐｌｅｔｉｏｎを実現するのに利用されうる。なお、本明細書を通じて、深度マップとは、画素毎に距離値を有する２次元データである。 In the following examples, a depth map or depth image (hereinafter collectively referred to as a depth map) inferred by a depth estimation model from an image (for example, an RGB image) of a measurement target area and a depth map obtained from a ranging sensor and according to a cost function including constraints described below. For example, the depth estimation apparatus of the present disclosure can be used to implement depth completion that complements a depth map acquired by a ranging sensor to an image level equivalent to an RGB image. Note that throughout this specification, a depth map is two-dimensional data having a distance value for each pixel.

［概略］
後述される本開示の一実施例を概略すると、図１に示されるように、深度推定装置１００は、測定対象領域に対してＴｏＦセンサによって取得された深度マップＴと、訓練済み深度推定モデルによって当該測定対象領域のＲＧＢ画像から推論された深度マップＰとを合成し、合成された深度マップＯを生成する。このとき、深度推定装置１００は、コスト関数を利用して、深度マップＯの各画素について、対応する画素の距離値が深度マップＴに存在する場合には当該深度マップＴの距離値を深度マップＯの距離値に一致させ、対応する画素の距離値が深度マップＴに存在しない場合には当該深度マップＯの距離値を深度マップＰの距離値に一致させるように、また、深度マップＯの当該画素の距離値を隣接画素の距離値に近づけるように、深度マップＯを構成する。 [Overview]
To summarize an embodiment of the present disclosure, which will be described later, as shown in FIG. A depth map P inferred from the RGB image of the measurement target area is synthesized to generate a synthesized depth map O. FIG. At this time, the depth estimation apparatus 100 uses the cost function to convert the distance value of the depth map T to the depth map T if the distance value of the corresponding pixel exists in the depth map T for each pixel of the depth map O. When the distance value of the corresponding pixel does not exist in the depth map T, the distance value of the depth map O is matched with the distance value of the depth map P, and the depth map O The depth map O is constructed so that the distance value of the pixel in question approaches the distance values of the neighboring pixels.

このような合成を実現するため、深度推定装置１００は、
（制約１）注目画素に対応する距離値が深度マップＴに存在する場合、深度マップＯにおける注目画素の距離値を深度マップＴの距離値に近づける。
（制約２）注目画素に対応する距離値が深度マップＴに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＰの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の近傍画素の距離値に近づける。
という３つの制約を含むコスト関数に従って深度マップＴと深度マップＰとを合成する。 In order to realize such synthesis, the depth estimation device 100
(Constraint 1) When the depth map T has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map T.
(Constraint 2) When the depth map T does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value in the depth map P.
(Constraint 3) Bring the distance value of the pixel of interest close to the distance value of the neighboring pixels of the pixel of interest.
The depth map T and the depth map P are synthesized according to a cost function including three constraints:

後述される実施例の深度推定装置１００によると、深度マップＴにおいて距離値が欠損していない画素については、深度マップＴの距離値を利用し、深度マップＴにおいて距離値が欠損している画素については、深度マップＰの距離値を利用して深度マップＯを構成する。この結果、大域的に高精度な深度マップＯを取得できる。また、深度マップＯの各画素の距離値を隣接画素の距離値に近づけるように深度マップＯを構成するため、隣接画素間で平滑化された深度マップＯを取得することができる。 According to the depth estimation apparatus 100 of the embodiment described later, the distance value of the depth map T is used for pixels with no missing distance value in the depth map T, and the pixels with missing distance values in the depth map T are , the depth map O is constructed using the distance values of the depth map P. As shown in FIG. As a result, a highly accurate depth map O can be acquired globally. Further, since the depth map O is configured so that the distance value of each pixel of the depth map O approaches the distance value of the adjacent pixels, the depth map O smoothed between the adjacent pixels can be obtained.

[深度推定システム]
まず、図２～４を参照して、本開示の一実施例による深度推定システムを説明する。図２は、本開示の一実施例による深度推定システムを示すブロック図である。 [Depth estimation system]
First, referring to FIGS. 2-4, a depth estimation system according to one embodiment of the present disclosure will be described. FIG. 2 is a block diagram illustrating a depth estimation system according to one embodiment of the disclosure.

図２に示されるように、深度推定システム１０は、カメラ２０、ＴｏＦセンサ３０、前処理装置４０及び深度推定装置１００を有する。 As shown in FIG. 2, the depth estimation system 10 has a camera 20, a ToF sensor 30, a preprocessing device 40 and a depth estimation device 100. As shown in FIG.

カメラ２０は、測定対象領域を撮像し、当該測定対象領域のＲＧＢ画像を生成する。例えば、カメラ２０は、単眼カメラであってもよく、被写体を含む測定対象領域の単眼のＲＧＢ画像を生成する。生成されたＲＧＢ画像は、前処理装置４０にわたされる。しかしながら、本開示による深度推定システムは、カメラ２０に限定されず、測定対象領域を撮像する他の何れかのタイプの撮像デバイスを備えてもよい。また、本開示による深度推定システムは、ＲＧＢ画像に限定されず、前処理装置４０及び推論エンジン４１によって深度マップに変換可能な他の形式の画像データを取得又は処理してもよい。 The camera 20 captures an image of the measurement target area and generates an RGB image of the measurement target area. For example, the camera 20 may be a monocular camera and generates a monocular RGB image of a measurement target area including a subject. The generated RGB image is passed to the preprocessing device 40 . However, the depth estimation system according to the present disclosure is not limited to camera 20 and may comprise any other type of imaging device that images the measurement target area. Also, the depth estimation system according to the present disclosure is not limited to RGB images and may acquire or process other forms of image data that can be converted into depth maps by preprocessor 40 and inference engine 41 .

ＴｏＦセンサ３０は、測定対象領域における各被写体とＴｏＦセンサ３０との間の距離（奥行き）を検知し、ＴｏＦデータ又はＴｏＦ画像（以降、ＴｏＦデータと総称する）を生成する。生成されたＴｏＦデータは、前処理装置４０にわたされる。しかしながら、本開示による深度推定システムは、ＴｏＦセンサ３０に限定されず、ＬｉＤＡＲセンサなど、深度マップを生成可能な他の何れか適切なタイプの測距センサを備えてもよく、備えられた測距センサのタイプに対応する測距データを取得してもよい。 The ToF sensor 30 detects the distance (depth) between each subject in the measurement target area and the ToF sensor 30, and generates ToF data or a ToF image (hereinafter collectively referred to as ToF data). The generated ToF data is passed to the preprocessing device 40 . However, depth estimation systems according to the present disclosure are not limited to ToF sensors 30 and may comprise any other suitable type of ranging sensor capable of generating a depth map, such as a LiDAR sensor. Ranging data corresponding to the type of sensor may be obtained.

前処理装置４０は、カメラ２０から取得したＲＧＢ画像を前処理し、推論エンジン４１による推論結果としての深度マップＰを取得する。ここで、推論エンジン４１は、入力としてＲＧＢ画像を受け付け、測定対象領域の各被写体とカメラ２０との間の距離（奥行き）を示す深度マップＰを出力する。例えば、推論エンジン４１は、ＭｉＤａＳ、ＤＰＴなどの既存の深度推定モデルであってもよいし、あるいは、何れか１つ以上の既存の深度推定モデルから訓練（例えば、蒸留）されたモデルであってもよい。また、推論エンジン４１は、前処理装置４０に搭載されてもよいし、あるいは、外部のサーバ（図示せず）に搭載され、推論結果がネットワークを介し前処理装置４０にわたされてもよい。 The preprocessing device 40 preprocesses the RGB image acquired from the camera 20 and acquires a depth map P as an inference result by the inference engine 41 . Here, the inference engine 41 receives an RGB image as an input, and outputs a depth map P indicating the distance (depth) between each subject in the measurement target area and the camera 20 . For example, the inference engine 41 can be an existing depth estimation model such as MiDaS, DPT, or a model trained (eg, distilled) from any one or more existing depth estimation models. good too. Also, the inference engine 41 may be installed in the preprocessing device 40, or may be installed in an external server (not shown), and the inference result may be passed to the preprocessing device 40 via a network.

具体的には、取得したＲＧＢ画像に対して、前処理装置４０は、当該ＲＧＢ画像を推論エンジン４１に入力し、推論結果として深度マップＰを取得する。典型的には、深度マップＰは、全体的に一貫性があるが、正確な距離値を表しているとは限らず、また、細かいテクスチャも表していない可能性がある。そして、前処理装置４０は、深度マップＰをＴｏＦデータＴのサイズと整合するようにリスケーリングしてもよい。 Specifically, the preprocessing device 40 inputs the acquired RGB image to the inference engine 41 and acquires a depth map P as an inference result. Typically, the depth map P is globally consistent, but does not always represent accurate distance values, and may not represent fine textures. The preprocessor 40 may then rescale the depth map P to match the size of the ToF data T. FIG.

他方、取得したＴｏＦデータに対して、前処理装置４０は、ノイズ除去等の前処理を実行してもよい。例えば、前処理装置４０は、ＴｏＦデータに対してオープニング処理を実行し、孤立した画素を除去する。これは、孤立した距離値を有する画素はノイズである可能性が高いためである。また、前処理装置４０は、ＴｏＦデータの遠景画素を深度マップＰに近づけるよう前処理してもよい。一般に、ＴｏＦセンサ３０によって好適に測距可能
な範囲は数メートルの範囲であり、ＴｏＦデータの遠景部分は、推論エンジン４１によって取得された深度マップＰの対応する部分の距離値に近くなるよう補正されてもよい。 On the other hand, the preprocessing device 40 may perform preprocessing such as noise removal on the acquired ToF data. For example, the preprocessor 40 performs opening processing on the ToF data to remove isolated pixels. This is because pixels with isolated distance values are likely to be noise. Further, the preprocessing device 40 may preprocess the background pixels of the ToF data so as to bring them closer to the depth map P. FIG. In general, the range that can be suitably ranged by the ToF sensor 30 is a range of several meters, and the distant portion of the ToF data is corrected to be closer to the distance value of the corresponding portion of the depth map P obtained by the inference engine 41. may be

また、前処理装置４０は、ＴｏＦデータを参照して推論結果の中央付近を適応的に近景に寄せるようにしてもよい。ＴｏＦデータは、典型的に超近景又は特定色の物体をキャプチャできないという特徴を有する。このため、前処理することなくＴｏＦデータＴと深度マップＰとを合成すると、深度マップＯは遠景部分にスケール合わせされ、中央部分の被写体が遠景になってしまうためである。前処理装置４０は、このようにして前処理されたＴｏＦデータＴ及び深度マップＰを深度推定装置１００にわたす。 Also, the preprocessing device 40 may refer to the ToF data and adaptively bring the vicinity of the center of the inference result closer to the foreground. ToF data is typically characterized by an inability to capture hyper-near objects or objects of a particular color. Therefore, if the ToF data T and the depth map P are synthesized without preprocessing, the scale of the depth map O is adjusted to the distant view portion, and the object in the central portion becomes the distant view. The preprocessing device 40 passes the ToF data T and the depth map P preprocessed in this manner to the depth estimation device 100 .

深度推定装置１００は、前処理装置４０から取得したＴｏＦデータＴ及び深度マップＰをコスト関数に従って合成し、合成された深度マップＯを生成する。本開示の一実施例によるコスト関数は、
（制約１）注目画素に対応する距離値がＴｏＦデータＴに存在する場合、深度マップＯにおける注目画素の距離値をＴｏＦデータＴの距離値に近づける。
（制約２）注目画素に対応する距離値がＴｏＦデータＴに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＰの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の近傍画素の距離値に近づける。
という３つの制約を含むものであってもよい。すなわち、深度推定装置１００は、ＴｏＦデータＴにおいて距離値が欠損していない画素については、深度マップＴの距離値を利用し、深度マップＴにおいて距離値が欠損している画素については、深度マップＰの距離値を利用して深度マップＯを構成する。さらに、深度推定装置１００は、深度マップＯの各画素の距離値を隣接画素の距離値に近づけるように深度マップＯを構成する。これにより、大域的に高い精度を有し、平滑化された深度マップＯを取得することができる。 The depth estimation device 100 synthesizes the ToF data T and the depth map P obtained from the preprocessing device 40 according to the cost function to generate a synthesized depth map O. FIG. A cost function according to one embodiment of the present disclosure is:
(Constraint 1) When the ToF data T has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought close to the ToF data T distance value.
(Constraint 2) When the ToF data T does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map P.
(Constraint 3) Bring the distance value of the pixel of interest close to the distance value of the neighboring pixels of the pixel of interest.
It may include the three constraints of That is, the depth estimation apparatus 100 uses the distance value of the depth map T for pixels with no missing distance value in the ToF data T, and uses the depth map T for pixels with no missing distance value in the depth map T. Construct a depth map O using the distance values of P. Furthermore, the depth estimation apparatus 100 configures the depth map O such that the distance value of each pixel of the depth map O is brought closer to the distance value of the adjacent pixels. This makes it possible to acquire a smoothed depth map O with high accuracy globally.

例えば、図３に示されるように、測定対象領域に対して、推論された深度マップＰと測距されたＴｏＦデータＴとが取得された場合、深度推定装置１００は、上述したコスト関数に従って、図示されるような合成された深度マップＯを取得することができる。図３から観察できるように、深度マップＯは、ＴｏＦデータＴと深度マップＰとのどちらよりも測定対象領域の各オブジェクトの深度又は奥行きをより良好に再現していると考えられる。 For example, as shown in FIG. 3, when an inferred depth map P and ranged ToF data T are acquired for the measurement target area, the depth estimation apparatus 100 performs the following according to the above-described cost function: A synthesized depth map O as shown can be obtained. As can be observed from FIG. 3, the depth map O is considered to reproduce the depth of each object in the measurement target region better than both the ToF data T and the depth map P. In FIG.

ここで、深度推定装置１００は、スマートフォン、タブレット、パーソナルコンピュータ等の計算装置によって実現され、例えば、図４に示されるようなハードウェア構成を有してもよい。すなわち、深度推定装置１００は、バスＢを介し相互接続される記憶装置１０１、プロセッサ１０２、ユーザインタフェース（ＵＩ）装置１０３及び通信装置１０４を有する。 Here, the depth estimation device 100 may be implemented by a computing device such as a smart phone, tablet, or personal computer, and may have a hardware configuration as shown in FIG. 4, for example. That is, the depth estimation device 100 comprises a storage device 101, a processor 102, a user interface (UI) device 103 and a communication device 104 interconnected via a bus B. FIG.

深度推定装置１００における後述される各種機能及び処理を実現するプログラム又は指示は、ネットワークなどを介し何れかの外部装置からダウンロードされてもよいし、ＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋ－ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、フラッシュメモリ等の着脱可能な記憶媒体から提供されてもよい。 Programs or instructions for realizing various functions and processes described later in the depth estimation device 100 may be downloaded from any external device via a network or the like, or stored in a CD-ROM (Compact Disk-Read Only Memory), flash It may be provided from a removable storage medium such as a memory.

記憶装置１０１は、ランダムアクセスメモリ、フラッシュメモリ、ハードディスクドライブなどの１つ以上の非一時的な記憶媒体（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙｓｔｏｒａｇｅｍｅｄｉｕｍ）によって実現され、インストールされたプログラム又は指示と共に、プログラム又は指示の実行に用いられるファイル、データ等を格納する。 The storage device 101 is implemented by one or more non-transitory storage mediums such as random access memory, flash memory, hard disk drive, etc., and is capable of executing programs or instructions together with installed programs or instructions. Stores files, data, etc. used for

プロセッサ１０２は、１つ以上のプロセッサコアから構成されうる１つ以上のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒ
ｏｃｅｓｓｉｎｇＵｎｉｔ）、処理回路（ｐｒｏｃｅｓｓｉｎｇｃｉｒｃｕｉｔｒｙ）等によって実現されてもよい。プロセッサ１０２は、記憶装置１０１に格納されたプログラム、指示、当該プログラム若しくは指示を実行するのに必要なパラメータなどのデータ等に従って、後述される深度推定装置１００の各種機能及び処理を実行する。 The processor 102 includes one or more CPUs (Central Processing Units), GPUs (Graphics Pr
processing unit), processing circuit, or the like. The processor 102 executes various functions and processes of the depth estimation device 100 described later according to data such as a program stored in the storage device 101, instructions, and parameters necessary for executing the program or instructions.

ユーザインタフェース（ＵＩ）装置１０３は、キーボード、マウス、カメラ、マイクロフォン等の入力装置、ディスプレイ、スピーカ、ヘッドセット、プリンタ等の出力装置、タッチパネル等の入出力装置から構成されてもよく、ユーザと深度推定装置１００との間のインタフェースを実現する。例えば、ユーザは、ディスプレイ又はタッチパネルに表示されたＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）をキーボード、マウス等を操作し、深度推定装置１００を操作する。 The user interface (UI) device 103 may be composed of an input device such as a keyboard, mouse, camera, and microphone, an output device such as a display, speaker, headset, and printer, and an input/output device such as a touch panel. An interface with the estimating device 100 is implemented. For example, the user operates the depth estimation apparatus 100 by operating a GUI (Graphical User Interface) displayed on a display or touch panel with a keyboard, mouse, or the like.

通信装置１０４は、外部装置、インターネット、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等の通信ネットワークとの通信処理を実行する各種通信回路により実現される。 The communication device 104 is implemented by various communication circuits that execute communication processing with a communication network such as an external device, the Internet, and a LAN (Local Area Network).

しかしながら、上述したハードウェア構成は単なる一例であり、本開示による深度推定装置１００は、他の何れか適切なハードウェア構成により実現されてもよい。例えば、カメラ２０、ＴｏＦセンサ３０及び前処理装置４０の一部又は全てが、深度測定装置１００に組み込まれてもよい。 However, the hardware configuration described above is merely an example, and the depth estimation device 100 according to the present disclosure may be implemented with any other appropriate hardware configuration. For example, some or all of camera 20 , ToF sensor 30 and preprocessing device 40 may be incorporated into depth measuring device 100 .

［深度推定装置］
次に、図５を参照して、本開示の一実施例による深度推定装置１００を説明する。図５は、本開示の一実施例による深度推定装置１００の機能構成を示すブロック図である。 [Depth estimation device]
Next, with reference to FIG. 5, a depth estimation device 100 according to one embodiment of the present disclosure will be described. FIG. 5 is a block diagram showing the functional configuration of the depth estimation device 100 according to one embodiment of the present disclosure.

図５に示されるように、深度推定装置１００は、取得部１１０及び導出部１２０を有する。 As shown in FIG. 5, depth estimation apparatus 100 has acquisition section 110 and derivation section 120 .

取得部１１０は、測定対象領域に対して測距センサによって取得された第１の深度マップと、訓練済み推論エンジンによって測定対象領域の画像から推論された第２の深度マップとを取得する。すなわち、取得部１１０は、前処理装置４０からＴｏＦデータＴと深度マップＰとを取得し、導出部１２０にわたす。 Acquisition unit 110 acquires a first depth map acquired by a ranging sensor for a measurement target area and a second depth map inferred from an image of the measurement target area by a trained inference engine. That is, the acquisition unit 110 acquires the ToF data T and the depth map P from the preprocessing device 40 and transfers them to the derivation unit 120 .

ここで、ＴｏＦデータＴは、ＴｏＦセンサ３０の検知結果に対して前処理装置４０によって前処理されたデータであってもよい。例えば、前処理としては、ノイズを除去するためのオープニング処理、遠景部分に対する補正処理などであってもよい。 Here, the ToF data T may be data preprocessed by the preprocessing device 40 with respect to the detection result of the ToF sensor 30 . For example, the preprocessing may be an opening process for removing noise, a correction process for a background portion, or the like.

また、深度マップＰは、カメラ２０によって撮像されたＲＧＢ画像に対する推論エンジン４１の推論結果を、ＴｏＦデータＴのサイズに一致するようリサイジングしたデータであってもよい。例えば、ＴｏＦデータＴ及び深度マップＰは、幅２２４ピクセルと高さ１６８ピクセルとの２次元データにリサイジングされてもよい。 Further, the depth map P may be data obtained by resizing the inference result of the inference engine 41 for the RGB image captured by the camera 20 so as to match the size of the ToF data T. FIG. For example, the ToF data T and depth map P may be resized into two-dimensional data with a width of 224 pixels and a height of 168 pixels.

導出部１２０は、コスト関数に従って第１の深度マップと第２の深度マップとから第３の深度マップを導出する。ここで、コスト関数は、
注目画素に対応する距離値が第１の深度マップに存在する場合、第３の深度マップにおける注目画素の距離値を第１の深度マップの距離値に近づけるための第１の制約と、
注目画素に対応する距離値が第１の深度マップに存在しない場合、第３の深度マップにおける注目画素の距離値を第２の深度マップの距離値に近づけるための第２の制約と、
注目画素の距離値を注目画素の近傍画素の距離値に近づけるための第３の制約と、
を含む。 A derivation unit 120 derives a third depth map from the first depth map and the second depth map according to a cost function. where the cost function is
a first constraint to bring the distance value of the pixel of interest in the third depth map closer to the distance value of the first depth map when a distance value corresponding to the pixel of interest exists in the first depth map;
a second constraint for bringing the distance value of the pixel of interest in the third depth map closer to the distance value of the second depth map when the distance value corresponding to the pixel of interest does not exist in the first depth map;
a third constraint for bringing the distance value of the pixel of interest closer to the distance values of neighboring pixels of the pixel of interest;
including.

具体的には、導出部１２０は、第１～３の制約を含むコスト関数に従ってＴｏＦデータＴと深度マップＰとを合成し、合成された深度マップＯを生成する。一実施例では、コスト関数は、

として定式化されうる。ここで、ｘは深度マップＯであり、ＴはＴｏＦデータＴであり、Ｐは深度マップＰであり、ＩはＲＧＢ画像である。また、ｗ_０，ｗ_１，ε，Ｍはパラメータであり、∇は勾配を求める演算子である。導出部１２０は、式（１）を最小化する深度マップｘを求め、これを深度マップＯとする。 Specifically, the derivation unit 120 synthesizes the ToF data T and the depth map P according to the cost function including the first to third constraints to generate the synthesized depth map O. FIG. In one embodiment, the cost function is

can be formulated as where x is the depth map O, T is the ToF data T, P is the depth map P, and I is the RGB image. Also, w ₀ , w ₁ , ε and M are parameters, and ∇ is an operator for obtaining a gradient. The deriving unit 120 obtains the depth map x that minimizes the expression (1), and sets this as the depth map O. FIG.

ここで、式（１）の右辺の第１項

は、第１の制約に関するものであり、ＴｏＦデータＴに距離値が存在する画素については、最終出力ｘの当該画素がＴｏＦデータＴの距離値に一致することを要請するものである。 Here, the first term on the right side of equation (1)

relates to the first constraint, which requires that, for pixels for which the ToF data T has a distance value, the pixel in the final output x matches the ToF data T distance value.

また、式（１）の右辺の第２項

は、第２の制約に関するものであり、ＴｏＦデータＴに距離値が欠損している画素については、最終出力ｘの当該画素が深度マップＰの距離値に一致することを要請するものである。 Also, the second term on the right side of equation (1)

relates to the second constraint, which requires that pixels in the final output x whose distance values are missing in the ToF data T match the distance values in the depth map P. FIG.

そして、式（１）の右辺の第３項

は、第３の制約に関するものであり、分子は、最終出力ｘの注目画素と当該注目画素の隣接画素との距離値が近くなること、すなわち、平滑化を要請するものである。なお、分母は、撮像された被写体と背景部分との間のエッジ領域における分子の平滑化の効果を弱めるためのものである。 And the third term on the right side of equation (1)

is related to the third constraint, and the numerator requests that the distance values between the pixel of interest in the final output x and the neighboring pixels of the pixel of interest be close, that is, smoothing is required. Note that the denominator is for weakening the smoothing effect of the numerator in the edge region between the imaged subject and the background portion.

パラメータｗ_０，ｗ_１は、３つの項の影響度をバランスさせるための正の重みである（特に、ｗ_０は推論された深度マップＰよりもＴｏＦデータＴを重視するよう１未満に設定されてもよい）。また、パラメータＭは、エッジ領域でどの程度平滑化効果を弱めるかを規定する正の重みである。さらに、パラメータεは、ゼロ除算を回避するための微小な正定数である。なお、（∇Ｉ）^２は、ＲＧＢの各チャネルで微分画像を求め、それらをチャネル方向に平均化することとして定義されうる。 The parameters w ₀ and w ₁ are positive weights to balance the influence of the three terms (in particular, w ₀ is set to less than 1 to emphasize the ToF data T over the inferred depth map P). may be used). Also, the parameter M is a positive weight that defines how much the smoothing effect is weakened in edge regions. Furthermore, the parameter ε is a small positive constant to avoid division by zero. Note that (∇I) ² can be defined as obtaining differential images in each of the RGB channels and averaging them in the channel direction.

導出部１２０は、式（１）を最小にするｘを以下のように求めることができる。説明の簡単化のため、

とすると、Ｇはｘに依存しないため、予め計算可能である。このとき、式（１）のコスト関数は、以下のように書き換えることができる。

式（２）は１次式の２乗和の形式を有するため、Ｅ（ｘ）を最小にするｘを線形方程式の最小二乗解として以下のように厳密に求めることができる。 The derivation unit 120 can obtain x that minimizes the expression (1) as follows. For simplicity of explanation,

Then G does not depend on x and can be calculated in advance. At this time, the cost function of Equation (1) can be rewritten as follows.

Since equation (2) has the form of a sum of squares of linear equations, x that minimizes E(x) can be strictly determined as a least-squares solution of a linear equation as follows.

ここで、Ｅ（ｘ）＝０を満たすｘの条件を考える。これは、Ｅ（ｘ）の各項の１次式がゼロである場合に成り立つ。従って、任意の画素（ｉ，ｊ）に対して、

が成り立てばよい。ただし、

であり、画素（ｉ，ｊ）が２次元データの右端又は下端にあって、ｘ_{ｉ，ｊ＋１}又はｘ_{ｉ＋１，ｊ}が定義されない状況では、その未定義変数が出現する１次の項はゼロとされる。 Now consider the condition of x that satisfies E(x)=0. This is true if the linearity of each term in E(x) is zero. Therefore, for any pixel (i,j),

should be established. however,

, and in a situation where pixel (i,j) is at the right end or bottom end of two-dimensional data and x _i,j+1 or x _i+1,j is not defined, the first-order term in which the undefined variable appears is zero. be done.

式（３）は、

の行列表現として表すことができる。ここで、ｘの画素は、ラスタスキャン順に１次元配列される。この線形方程式は、変数の数よりも条件式の数が多く、ｏｖｅｒ－ｄｅｔｅｒｍｉｎｅｄな系となっており、厳密解は存在せず、最小二乗解を求めることが妥当である
。この最小二乗解は、Ｅ（ｘ）を最小（ゼロ）にする厳密解に一致する。従って、導出部１２０は、式（４）の最小二乗解を求めることによって、コスト関数Ｅ（ｘ）を最小にするｘを求めることができる。 Formula (3) is

can be expressed as a matrix representation of Here, the pixels of x are one-dimensionally arranged in raster scan order. Since this linear equation has more conditional expressions than variables, it is an over-determined system, there is no exact solution, and it is appropriate to obtain a least-squares solution. This least-squares solution corresponds to the exact solution that minimizes (zero) E(x). Therefore, the derivation unit 120 can obtain x that minimizes the cost function E(x) by obtaining the least-squares solution of Equation (4).

具体例として、以下のＴｏＦデータＴ、深度マップＰ、係数Ｇ及び最終出力ｘが与えられているケースを考える。

ここで、最終出力ｘの距離値は未定である。また、Ｔ_１，２及びＴ_２，１がブランクになっているが、これは、当該画素の距離値が欠損していることを意味する。 As a specific example, consider the following ToF data T, depth map P, coefficient G, and final output x.

Here, the distance value of the final output x is undecided. Also, T _1,2 and T _2,1 are blank, which means that the distance value for that pixel is missing.

これらの入力に対して、式（２）のコスト関数の右辺第１項は、

となる。また、ｗ_０＝０．０１と設定されている場合、第２項は、

となる。さらに、第３項は、

となる。 For these inputs, the first term on the right side of the cost function of equation (2) is

becomes. Also, when w ₀ =0.01, the second term is

becomes. Furthermore, Section 3

becomes.

これらをまとめると、コスト関数Ｅ（ｘ）は、以下のようになる。

すなわち、コスト関数Ｅ（ｘ）は、ｘに関する１次式の２乗和になっていることがわかる。このため、以下の線形方程式の最小二乗解によってコストを最小化する厳密解ｘを導出することができる。

Putting these together, the cost function E(x) is as follows.

That is, it can be seen that the cost function E(x) is the sum of squares of linear expressions regarding x. Therefore, the exact solution x that minimizes the cost can be derived by the least-squares solution of the following linear equation.

式（５）の最小二乗解は特異値分解などによって容易に求められる。このようにして、導出部１２０は、式（５）の最小二乗解を求めることによって、妥当な計算時間によってコスト関数Ｅ（ｘ）を最小にするｘを導出することができ、ＴｏＦデータＴと深度マップＰとから合成された深度マップＯを取得することができる。 The least-squares solution of equation (5) can be easily obtained by singular value decomposition or the like. In this way, the derivation unit 120 can derive x that minimizes the cost function E(x) in a reasonable computation time by obtaining the least-squares solution of equation (5), and the ToF data T and A depth map O synthesized from the depth map P can be obtained.

［深度推定処理］
次に、図６を参照して、本開示の一実施例による深度推定処理を説明する。当該深度推定処理は、上述した深度推定装置１００によって実行され、より詳細には、深度推定装置１００の１つ以上のプロセッサ１０２が１つ以上の記憶装置１０１に格納された１つ以上のプログラム又は指示を実行することによって実現されてもよい。例えば、当該深度推定処理は、深度推定装置１００のユーザが当該処理に係るアプリケーション等を起動することによって開始されうる。 [Depth estimation processing]
Next, with reference to FIG. 6, depth estimation processing according to one embodiment of the present disclosure will be described. The depth estimation process is executed by the depth estimation device 100 described above. More specifically, one or more processors 102 of the depth estimation device 100 execute one or more programs or It may be realized by executing instructions. For example, the depth estimation process can be started by the user of the depth estimation apparatus 100 activating an application or the like related to the process.

図６は、本開示の一実施例による深度推定処理を示すフローチャートである。 FIG. 6 is a flowchart illustrating depth estimation processing according to one embodiment of the present disclosure.

図６に示されるように、ステップＳ１０１において、深度推定装置１００は、測定対象領域のＲＧＢ画像Ｉから推論された深度マップＰ及びＴｏＦデータＴを取得する。具体的には、カメラ２０が測定対象領域を撮像し、ＲＧＢ画像Ｉを取得し、また、ＴｏＦセンサ３０が当該測定対象領域を測定し、ＴｏＦセンサ３０と測定対象領域の各物体との間の距離を示すＴｏＦデータを取得する。 As shown in FIG. 6, in step S101, the depth estimation apparatus 100 obtains a depth map P and ToF data T inferred from the RGB image I of the measurement target area. Specifically, the camera 20 captures an image of the measurement target area to obtain an RGB image I, the ToF sensor 30 measures the measurement target area, and the distance between the ToF sensor 30 and each object in the measurement target area is Obtain ToF data indicating distance.

次に、前処理装置４０は、取得したＴｏＦデータを前処理し、ＴｏＦデータＴを取得する。また、前処理装置４０は、推論エンジン４１を利用して、ＲＧＢ画像Ｉから深度マップＰを生成する。例えば、ＴｏＦデータＴは、ＴｏＦセンサ３０から取得されたＴｏＦデータに対してオープニング処理、補正処理などを実行することによって取得されてもよい。また、深度マップＰは、推論エンジン４１の推論結果に対してＴｏＦデータＴのサイズと一致するようにリサイジング処理を実行することによって取得されてもよい。 Next, the preprocessing device 40 preprocesses the acquired ToF data and acquires ToF data T. FIG. The preprocessing device 40 also generates a depth map P from the RGB image I using the inference engine 41 . For example, the ToF data T may be obtained by performing opening processing, correction processing, etc. on the ToF data obtained from the ToF sensor 30 . Also, the depth map P may be obtained by performing a resizing process on the inference result of the inference engine 41 so as to match the size of the ToF data T. FIG.

このようにして取得されたＴｏＦデータＴ及び深度マップＰが、深度推定装置１００に提供される。 The ToF data T and depth map P obtained in this way are provided to the depth estimation device 100 .

ステップＳ１０２において、深度推定装置１００は、コスト関数に従ってＴｏＦデータＴ及び深度マップＰを合成し、合成された深度マップＯを導出する。例えば、コスト関数は、
（制約１）注目画素に対応する距離値がＴｏＦデータＴに存在する場合、深度マップＯにおける注目画素の距離値をＴｏＦデータＴの距離値に近づける。
（制約２）注目画素に対応する距離値がＴｏＦデータＴに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＰの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の近傍画素の距離値に近づける。
という３つの制約を含むものであってもよい。 In step S102, the depth estimation apparatus 100 synthesizes the ToF data T and the depth map P according to the cost function, and derives the synthesized depth map O. FIG. For example, the cost function is
(Constraint 1) When the ToF data T has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought close to the ToF data T distance value.
(Constraint 2) When the ToF data T does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map P.
(Constraint 3) Bring the distance value of the pixel of interest close to the distance value of the neighboring pixels of the pixel of interest.
It may include the three constraints of

具体的には、コスト関数は、

として定式化されてもよい。ここで、ｘは深度マップＯであり、ＴはＴｏＦデータＴであり、Ｐは深度マップＰであり、ＩはＲＧＢ画像である。また、ｗ_０，ｗ_１，ε，Ｍはパラメータであり、∇は勾配を求める演算子である。深度推定装置１００は、コスト関数Ｅ（ｘ）を最小化する深度マップｘを深度マップＯとする。ここで、コスト関数Ｅ（ｘ）を最小化する深度マップｘは、Ｅ（ｘ）＝０とした場合に得られる線形方程式の最小二乗解として求めることができる。 Specifically, the cost function is

may be formulated as where x is the depth map O, T is the ToF data T, P is the depth map P, and I is the RGB image. Also, w ₀ , w ₁ , ε and M are parameters, and ∇ is an operator for obtaining a gradient. The depth estimation apparatus 100 uses the depth map O as the depth map x that minimizes the cost function E(x). Here, the depth map x that minimizes the cost function E(x) can be obtained as a least-squares solution of a linear equation obtained when E(x)=0.

上述した実施例によると、深度推定装置１００は、
（制約１）注目画素に対応する距離値が深度マップＴに存在する場合、深度マップＯにおける注目画素の距離値を深度マップＴの距離値に近づける。
（制約２）注目画素に対応する距離値が深度マップＴに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＰの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の近傍画素の距離値に近づける。
という３つの制約を含むコスト関数を利用して、測距センサから取得した深度マップＴと、撮像デバイスからの画像から推論された深度マップＰとを合成し、合成された深度マップＯを構成する。これにより、大域的に高い精度を有し、隣接画素間で平滑化された深度マップＯを取得することができる。 According to the embodiment described above, the depth estimation device 100
(Constraint 1) When the depth map T has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map T.
(Constraint 2) When the depth map T does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value in the depth map P.
(Constraint 3) Bring the distance value of the pixel of interest close to the distance value of the neighboring pixels of the pixel of interest.
Synthesize the depth map T obtained from the ranging sensor and the depth map P inferred from the image from the imaging device to construct a synthesized depth map O . Thereby, it is possible to obtain a depth map O that has high accuracy globally and is smoothed between adjacent pixels.

［その他の実施例］
上述の実施形態では、３つの制約を含むコスト関数を利用して、測距センサから取得した深度マップＴと、撮像デバイスからの画像から推論された深度マップＰを合成し、合成された深度マップＯを構成する態様について説明した。しかしながら、合成対象となる深度マップは、上述の組み合わせでなくてもよい。一例として、深度推定装置１００は、撮像デバイスからの画像を訓練されたモデルを用いて深度マップＰを推論する手法ではなく、ステレオカメラが撮像した画像に基づき深度マップＰを生成してもよい。ステレオカメラによって撮像された２枚以上の画像は、視点（撮像地点）間の視差に基づき、三角測量の原理を用いて、カメラから被写体までの距離（奥行）情報を取得できることが知られている。本実施例では、図７に示されるように、深度推定装置１００は、ステレオカメラで撮像された画像に基づき、画素ごとに距離値を有する深度マップＰを生成する。生成された深度マップＰは、上述の３つの制約を含むコスト関数を利用して、測距センサから取得された深度マップＴと合成され、最終的に合成された深度マップＯが出力される。 [Other Examples]
In the above embodiment, a cost function containing three constraints is used to synthesize the depth map T obtained from the ranging sensor and the depth map P inferred from the image from the imaging device, and the synthesized depth map The manner in which O is constructed has been described. However, the depth maps to be synthesized need not be the above combinations. As an example, depth estimation apparatus 100 may generate depth map P based on images captured by a stereo camera, rather than using a model trained on images from an imaging device to infer depth map P. It is known that two or more images captured by a stereo camera can acquire distance (depth) information from the camera to the subject using the principle of triangulation based on parallax between viewpoints (image capturing points). . In this embodiment, as shown in FIG. 7, the depth estimation device 100 generates a depth map P having a distance value for each pixel based on images captured by a stereo camera. The generated depth map P is combined with the depth map T acquired from the ranging sensor using the cost function including the three constraints described above, and finally the combined depth map O is output.

前述のとおり、式（１）の右辺の第３項は、簡単に言えばある画素における距離値と、その画素に隣接する画素における距離値とを近づけることを要請するものである。すなわち、画素間の距離値の平滑化が要請されている。一例によれば、式（１）のコスト関数を最小化する距離値を求めることで、入力深度マップの距離値から大きく離れることを抑制しつつ隣接する複数画素の距離値を平滑化する。ここで平滑化というのは例えば１つの被写体に含まれる複数の距離値を平滑化することである。ある隣接する２つの画素における距離値の差を小さくし、別の隣接する２つの画素における距離値の差を大きくすることで、全体の距離値を平滑化する場合がある。したがって、平滑化によって全ての隣接画素における距離値の差が縮小するとは限らない。一例によれば、平滑化処理は、隣接距離値（隣接する２つの画素における距離値）の差の平均を減少させる処理である。一例によれば、ある被写体に着目したとき、その被写体に含まれる隣接距離値の差の平均を減少させる。別の例によれば、深度マップ全体に着目したとき、その深度マップ全体に含まれる隣接距離値の差の平均を減少させる。いくつかの例によれば、出力深度マップの画素における距離値は、複数の複数の深度マップの対応する画素におけるどの距離値とも異なる。 As mentioned above, the third term on the right hand side of equation (1) simply requests that the distance value at a pixel be closer to the distance values at its neighbors. That is, there is a demand for smoothing the distance value between pixels. According to one example, by finding the distance value that minimizes the cost function of equation (1), the distance values of adjacent pixels are smoothed while suppressing large deviations from the distance values of the input depth map. Here, smoothing means, for example, smoothing a plurality of distance values included in one object. The overall distance value may be smoothed by decreasing the difference between the distance values of some two adjacent pixels and increasing the difference between the distance values of another two adjacent pixels. Therefore, smoothing does not necessarily reduce the difference in distance values at all neighboring pixels. According to one example, the smoothing process is a process that reduces the average difference between adjacent distance values (distance values in two adjacent pixels). According to one example, when focusing on an object, the average difference of adjacent distance values contained in the object is reduced. According to another example, when looking at the entire depth map, the average difference between neighboring distance values contained in the entire depth map is reduced. According to some examples, a distance value at a pixel of the output depth map is different from any distance value at corresponding pixels of the plurality of depth maps.

ここで、ある画素（以後、第１画素と称する）と、その画素に隣接する画素（以後、第２画素と称する）とが、両方とも１つの被写体に関連する場合について考える。この場合、以下（Ａ）（Ｂ）（Ｃ）が成り立つ。
（Ａ）第１画素の距離値と第２画素の距離値がともに深度マップＰから得られたのであればこれらの距離値の差は小さく、式（１）の右辺の第３項によってこれらを平滑化する効果は小さい。
（Ｂ）第１画素の距離値と第２画素の距離値がともに深度マップＴから得られたのであればこれらの距離値の差は小さく、式（１）の右辺の第３項によってこれらを平滑化する効果は小さい。
（Ｃ）しかしながら、第１画素の距離値と第２画素の距離値のうち一方が深度マップＰから得られ他方が深度マップＴから得られた場合には、これらの距離値の差は比較的大きくなり得るので、その場合には式（１）の右辺の第３項による平滑化の効果は大きくなり得る。 Now, consider a case where a certain pixel (hereinafter referred to as the first pixel) and its adjacent pixel (hereinafter referred to as the second pixel) are both associated with one subject. In this case, the following (A), (B) and (C) are established.
(A) If both the distance value of the first pixel and the distance value of the second pixel are obtained from the depth map P, the difference between these distance values is small, and the third term on the right hand side of equation (1) calculates them by The smoothing effect is small.
(B) If both the distance value of the first pixel and the distance value of the second pixel are obtained from the depth map T, the difference between these distance values is small, and the third term on the right side of equation (1) can be used to convert them to The smoothing effect is small.
(C) However, if one of the distance value for the first pixel and the distance value for the second pixel is obtained from the depth map P and the other from the depth map T, the difference between these distance values is relatively Since it can be large, in that case the effect of smoothing by the third term on the right side of equation (1) can be large.

上記１つの被写体について、単に深度マップＴをベースとしつつ深度マップＴの欠損部分を深度マップＰで補うと、隣接距離値の差の平均は大きい値になる。深度マップ全体についても同様である。隣接距離値の差の平均が大きいと画素間の境界が不自然に目立ってしまう。 For the one subject, if the depth map T is simply used as a base and the missing portion of the depth map T is supplemented with the depth map P, the average difference between adjacent distance values becomes a large value. The same is true for the entire depth map. If the average difference between adjacent distance values is large, the boundaries between pixels will stand out unnaturally.

これに対し、本実施形態では上述のとおり距離値の平滑化処理を施すので、１つの被写体又は深度マップ全体について、隣接距離値の差の平均が小さくなる。一例によれば、合成された深度マップにおける複数の被写体のそれぞれについて、上述の距離値を平滑化する処理が施されている。なお、式（１）以外の演算方法によって、隣接距離値の差の平均を小さくすることができる。 On the other hand, in the present embodiment, since the distance value is smoothed as described above, the average difference between adjacent distance values for one subject or the entire depth map becomes small. According to one example, the aforementioned distance value smoothing process is performed for each of the plurality of objects in the synthesized depth map. Note that the average difference between adjacent distance values can be reduced by a calculation method other than Equation (1).

ところで、第１画素がある被写体に関し、第２画素が背景部分に関する場合、第１画素と第２画素はエッジ領域を構成する。この場合、第１画素の距離値と第２画素の距離値との間に相当の差がある。そして、被写体の距離値と背景部分の距離値の差は維持されるべきである。そこで、エッジ領域においては、式（１）の第３項の分母によって距離値の平滑化の効果を弱めることとした。これにより、エッジ領域における距離値の差を過度に小さくすることなく、適切に維持することができる。 By the way, when the first pixel relates to a subject and the second pixel relates to a background portion, the first pixel and the second pixel form an edge region. In this case, there is a significant difference between the distance value of the first pixel and the distance value of the second pixel. And the difference between the object distance value and the background part distance value should be maintained. Therefore, in the edge region, the effect of smoothing the distance value is weakened by the denominator of the third term of Equation (1). As a result, the difference in distance values in the edge region can be properly maintained without being excessively small.

一例によれば、演算処理の簡単化のために、被写体における距離値の平滑化の程度と、エッジ領域における距離値の平滑化の程度を一致させることができる。この場合、式（１）の右辺の第３項における分母を省略できる。この場合、例えば式（１）の右辺第３項のｗ１を比較的小さい値にしておくことで、エッジ領域において距離値が過度に平滑化されることを防止できる。別の例によれば、式（１）で示されるとおり、被写体の中では距離値を平滑化する効果を高め、エッジ領域では距離値を平滑化する効果を弱めることができる。 According to one example, the degree of smoothing of the distance value in the object and the degree of smoothing of the distance value in the edge region can be matched in order to simplify the arithmetic processing. In this case, the denominator in the third term on the right side of Equation (1) can be omitted. In this case, for example, by setting w1 in the third term on the right side of Equation (1) to a relatively small value, excessive smoothing of the distance value in the edge region can be prevented. According to another example, as shown in equation (1), the effect of smoothing the distance value can be increased in the subject and less effective in the edge region.

深度推定装置１００という名称は、２つの深度マップを単に合成するのではなく、上述のとおり平滑化して距離値を推定するので「推定」との言葉を含む。深度推定装置１００は、深度マップ合成装置と言い換えたり、深度マップ生成装置と言い換えたりすることができる。 The name depth estimator 100 includes the word "estimate" because it estimates the distance values by smoothing them as described above, rather than simply combining two depth maps. The depth estimation device 100 can be rephrased as a depth map synthesis device or a depth map generation device.

深度推定装置１００に入力される深度マップとして、ＲＧＢ画像から推論して得られた深度マップＰと、測距センサで得られた深度マップＴを例示した。別の例によれば、別の深度マップを深度測定装置１００に入力することができる。入力用深度マップの限定されない変形例を以下に示す。 As depth maps to be input to the depth estimation device 100, a depth map P obtained by reasoning from an RGB image and a depth map T obtained by a distance measuring sensor are exemplified. According to another example, another depth map can be input to depth measuring device 100 . A non-limiting variation of the input depth map is shown below.

［例１］
例えば、ステレオマッチングで得られた深度マップＳと、広角カメラの画像から推論された深度マップＷと、を深度推定装置１００に入力することができる。ステレオマッチングでは、オクルージョン領域を除き、比較的正確な距離値を算出することができる。しかし、ステレオマッチングでは、望遠カメラ相当の距離値（depth）しか算出できず、広角カメラにおける画像端付近の距離値は算出できない。他方、広角カメラのsingle camera depth estimationによれば、画像全面において距離値を推定することができる。つまり、深度マップＳの有効画素数は、深度マップＷの有効画素数より小さい。そこで、深度推定装置１００は、以下の３つの制約を含むコスト関数に従って深度マップＳと深度マップＷとを合成し、深度マップＯを作成することができる。
（制約１）注目画素に対応する距離値が深度マップＳに存在する場合、深度マップＯにおける注目画素の距離値を深度マップＳの距離値に近づける。
（制約２）注目画素に対応する距離値が深度マップＳに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＷの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の隣接画素の距離値に近づける。
こうして、全面、つまり全画素に距離値を有する深度マップＯを生成することができる。一例によれば、広角カメラと望遠カメラを併用することで、この深度マップＯの生成に必要な撮像データを提供できる。別の例によれば、別の方法で深度マップを合成できる。例えば、深度マップＳに距離値が存在する場合はその距離値を用い、深度マップＳに距離値が存在しない部分については深度マップＷの距離値を用いることで、出力深度マップを合成してもよい。一例によれば、出力深度マップの中央部分の距離値は深度マップＳの距離値から取り込み、出力深度マップの上記中央部分を囲む周辺部分の距離値は深度マップＷから取り込むことができる。 [Example 1]
For example, the depth map S obtained by stereo matching and the depth map W inferred from the wide-angle camera image can be input to the depth estimation device 100 . Stereo matching can calculate relatively accurate distance values except for occlusion areas. However, in stereo matching, only a distance value (depth) equivalent to that of a telephoto camera can be calculated, and a distance value near the edge of an image in a wide-angle camera cannot be calculated. On the other hand, according to the single camera depth estimation of the wide-angle camera, the distance value can be estimated in the entire image. That is, the number of effective pixels of the depth map S is smaller than the number of effective pixels of the depth map W. Therefore, the depth estimation apparatus 100 can create the depth map O by synthesizing the depth map S and the depth map W according to a cost function including the following three constraints.
(Constraint 1) When the depth map S has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map S.
(Constraint 2) When the depth map S does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value in the depth map W.
(Constraint 3) Make the distance value of the pixel of interest closer to the distance value of the pixel adjacent to the pixel of interest.
In this way, a depth map O can be generated that has distance values for the entire surface, that is, for all pixels. According to one example, a combination of a wide-angle camera and a telephoto camera can provide the imaging data necessary for generating this depth map O. FIG. According to another example, the depth map can be synthesized in another way. For example, if the depth map S has a distance value, that distance value is used, and for a portion where the depth map S does not have a distance value, the distance value of the depth map W is used. good. According to one example, the distance values for the central portion of the output depth map can be taken from the depth map S, and the distance values for the peripheral portion surrounding said central portion of the output depth map can be taken from the depth map W.

［例２］
例２では、例１の深度マップＳと深度マップＷに加えて、ＴｏＦセンサなどの測距センサによって得られた深度マップＴを入力深度マップとする。つまり、３つの深度マップを深度推定装置１００に入力する。そして、深度推定装置１００では、入力深度マップの信頼度に基づいて、３つの入力深度マップを合成する。一例によれば、深度推定装置１００は、以下の４つの制約を含むコスト関数に基づいて深度マップＯを合成することができる。
（制約１）注目画素に対応する距離値が深度マップＴに存在する場合、深度マップＯにおける注目画素の距離値を深度マップＴの距離値に近づける。
（制約２）注目画素に対応する距離値が深度マップＴに存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＳの距離値に近づける。
（制約３）注目画素に対応する距離値が深度マップＴにも深度マップＳにも存在しない場合、深度マップＯにおける注目画素の距離値を深度マップＷの距離値に近づける。
（制約４）注目画素の距離値を当該注目画素の隣接画素の距離値に近づける。
この例では、深度マップＴの距離値の信頼度が高く、深度マップＳの距離値が次に信頼度が高く、深度マップＷの距離値は信頼度が低い、との前提をおいている。このように深度マップの信頼度に基づいて、複数の深度マップを合成して深度マップＯを生成することができる。別の例によれば、別の入力深度マップを採用することができる。別の例によれば、別の方法で深度マップを合成できる。例えば、深度マップＴに距離値が存在する場合はその距離値を用い、深度マップＴに距離値が存在しない部分については深度マップＳの距離値を用い、深度マップＴにも深度マップＳにも距離値が存在しない部分については深度マップＷの距離値を用いることで、出力深度マップを合成してもよい。 [Example 2]
In Example 2, in addition to the depth map S and depth map W of Example 1, a depth map T obtained by a distance measuring sensor such as a ToF sensor is used as an input depth map. In other words, three depth maps are input to the depth estimation device 100 . Then, the depth estimation apparatus 100 synthesizes the three input depth maps based on the reliability of the input depth maps. According to an example, depth estimation apparatus 100 can synthesize depth map O based on a cost function that includes the following four constraints.
(Constraint 1) When the depth map T has a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of the depth map T.
(Constraint 2) When the depth map T does not have a distance value corresponding to the pixel of interest, the distance value of the pixel of interest in the depth map O is brought closer to the distance value in the depth map S.
(Constraint 3) If the distance value corresponding to the pixel of interest does not exist in either the depth map T or the depth map S, the distance value of the pixel of interest in the depth map O is brought closer to the distance value in the depth map W.
(Constraint 4) Make the distance value of the pixel of interest closer to the distance value of the pixel adjacent to the pixel of interest.
In this example, it is assumed that the distance values of depth map T are highly reliable, the distance values of depth map S are next most reliable, and the distance values of depth map W are less reliable. Thus, depth map O can be generated by synthesizing a plurality of depth maps based on the reliability of the depth maps. According to another example, another input depth map can be employed. According to another example, the depth map can be synthesized in another way. For example, if the depth map T has a distance value, that distance value is used, and for a portion where the depth map T does not have a distance value, the distance value of the depth map S is used. An output depth map may be synthesized by using the distance values of the depth map W for a portion where no distance value exists.

［例３］
入力深度マップとして、デュアルカメラ視差推定によって得られた深度マップＤと、ＴｏＦセンサなどの測距センサによって得られた深度マップＴを、深度推定装置１００に入力することができる。デュアルカメラ視差推定では、オクルージョン領域のマッチングが出来ないため、そこは一般に無効領域になる。この無効領域の周囲の距離値に基づいて無効領域の距離値を埋めると、無効領域は距離値の信頼度が低い領域となる。また、繰り返しパターンの領域、又はテクスチャがない領域が存在すると一点から複数箇所にマッチしてしまい、得られる距離値の精度が落ちる場合もある。
つまり、深度マップＤは、
・領域Ｄ１：（信頼度が低い）オクルージョン領域、
・領域Ｄ２：（信頼度が低い）繰り返しパターンの領域、又は、テクスチャがない領域（平坦領域）、
・領域Ｄ３：（信頼度が高い）領域Ｄ１、Ｄ２以外の領域、
に分類することができる。
他方、ＴｏＦセンサなどの測距センサによって得られた深度マップＴは、
・領域Ｔ１：（信頼度が低い）画像の位置合わせにより発生したオクルージョン領域、
・領域Ｔ２：（信頼度が低い）赤外線が届かない箇所（例えば遠景）、
・領域Ｔ３：（信頼度が低い）繰り返しパターンの領域、又は、テクスチャがない領域、
・領域Ｔ４：（信頼度が高い）領域Ｔ１、Ｔ２、Ｔ３以外の領域
を含む。なお、基準画像とＴｏＦまたは参照画像の方向が異なると、オクルージョンの位置が別方向になるので、相補関係にあるということができる。
このように、深度マップＤ、Ｔには、それぞれ信頼度が高い領域と信頼度が低い領域がある。そこで、深度推定装置１００は、以下の３つの制約を含むコスト関数に基づいて深度マップＯを合成することができる。
（制約１）注目画素に対応する距離値が領域Ｄ３又は領域Ｔ４に存在する場合、深度マップＯにおける注目画素の距離値をこれらのいずれかの距離値に近づける。
（制約２）注目画素に対応する距離値が領域Ｄ３又は領域Ｔ４に存在しない場合、深度マップＯにおける注目画素の距離値を領域Ｄ１、Ｄ２、Ｔ１、Ｔ２、Ｔ３のいずれかの距離値に近づける。
（制約３）注目画素の距離値を当該注目画素の隣接画素の距離値に近づける。
これにより、複数の入力深度マップの信頼度が高い領域を優先的に深度マップＯに反映させて、精度の高い深度マップを合成できる。 [Example 3]
As input depth maps, a depth map D obtained by dual-camera parallax estimation and a depth map T obtained by a ranging sensor such as a ToF sensor can be input to the depth estimation device 100 . In dual-camera parallax estimation, occlusion areas cannot be matched, so they are generally invalid areas. When the distance values of the invalid area are filled based on the distance values around the invalid area, the invalid area becomes an area with low reliability of the distance value. Also, if there is a repeating pattern area or an area without texture, multiple points may be matched from one point, and the accuracy of the obtained distance value may drop.
That is, the depth map D is
- Region D1: (low confidence) occlusion region,
- Region D2: (low reliability) repeated pattern region or textureless region (flat region),
- Area D3: Areas other than (highly reliable) areas D1 and D2,
can be classified into
On the other hand, the depth map T obtained by a ranging sensor such as a ToF sensor is
Region T1: occlusion region caused by image alignment (low confidence);
・Area T2: (Reliability is low) A place where infrared rays do not reach (for example, a distant view),
- Region T3: (low reliability) repeated pattern region or textureless region;
• Region T4: (high reliability) includes regions other than regions T1, T2, and T3. If the direction of the reference image and the ToF or the reference image are different, the occlusion position will be in a different direction, so it can be said that they are in a complementary relationship.
In this way, the depth maps D and T each have a high-reliability region and a low-reliability region. Therefore, the depth estimation apparatus 100 can synthesize the depth map O based on a cost function including the following three constraints.
(Constraint 1) If the distance value corresponding to the pixel of interest exists in the region D3 or T4, the distance value of the pixel of interest in the depth map O is brought closer to one of these distance values.
(Constraint 2) If the distance value corresponding to the pixel of interest does not exist in the region D3 or region T4, the distance value of the pixel of interest in the depth map O is brought closer to the distance value of any one of the regions D1, D2, T1, T2, T3. .
(Constraint 3) Make the distance value of the pixel of interest closer to the distance value of the pixel adjacent to the pixel of interest.
As a result, it is possible to preferentially reflect in the depth map O the regions of which the reliability of the plurality of input depth maps is high, thereby synthesizing a highly accurate depth map.

上記の例３では１つの入力深度マップについて、距離値の信頼度が高い領域と、距離値の信頼度が低い領域の２つに分類した。別の例によれば、１つの入力深度マップについて、信頼度の異なる３つ以上の領域に分割することができる。例えば、深度マップＤは、信頼度の高い領域Ｄ３、領域Ｄ３より信頼度が低い領域Ｄ２、領域Ｄ２より信頼度が低い領域Ｄ１に分割される。例えば深度マップＴは、信頼度の高い領域Ｔ４、領域Ｔ４より信頼度が低い領域Ｔ３、領域Ｔ３より信頼度が低い領域Ｔ２、領域Ｔ２より信頼度が低い領域Ｔ１、に分割される。一例によれば、ＴｏＦセンサなどの測距センサにより取得された深度マップＴについては、距離値とともに信頼度のマップが入手される。他方、深度マップＤについては、例えば前処理装置で、各領域について信頼度スコアを割り当てる。例えば、深度推定装置１００には、各画素について、距離値と、その距離値の信頼度スコアと、が割り当てられたデータが、入力深度マップとして入力される。深度マップＤ、Ｔを入力深度マップとする場合、一例によれば、信頼度が高い順に、領域Ｔ４、Ｄ３、Ｔ３、Ｔ２、Ｔ１、Ｄ２、Ｄ１であり、深度推定装置１００は、前述のコスト関数を用いた方法と同等の方法によって、信頼度が高いものから順に深度マップＯに反映させていく。
入力深度マップの領域毎に信頼度を割り当て、信頼度の高いものを優先して出力深度マップに反映させるようにすることで、さらに高品質な深度マップを提供することができる。 In Example 3 above, one input depth map is classified into two regions, one with a high degree of reliability of the distance value and the other with a low degree of reliability of the distance value. According to another example, an input depth map can be divided into three or more regions with different degrees of confidence. For example, the depth map D is divided into a region D3 with a higher reliability, a region D2 with a lower reliability than the region D3, and a region D1 with a lower reliability than the region D2. For example, the depth map T is divided into a region T4 with a higher reliability, a region T3 with a lower reliability than the region T4, a region T2 with a lower reliability than the region T3, and a region T1 with a lower reliability than the region T2. According to one example, for a depth map T acquired by a ranging sensor, such as a ToF sensor, a confidence map is obtained along with the distance values. For the depth map D, on the other hand, a confidence score is assigned to each region, for example in a preprocessor. For example, the depth estimation apparatus 100 receives, as an input depth map, data in which a distance value and a reliability score of the distance value are assigned to each pixel. When the depth maps D and T are used as input depth maps, according to one example, regions T4, D3, T3, T2, T1, D2, and D1 are in descending order of reliability. The depth map O is reflected in descending order of reliability by a method equivalent to the method using functions.
A higher quality depth map can be provided by assigning a reliability level to each region of the input depth map and preferentially reflecting areas with a high level of reliability in the output depth map.

１０深度推定システム
２０カメラ
３０ＴｏＦセンサ
４０前処理装置
１００深度推定装置
１１０取得部
１２０導出部 REFERENCE SIGNS LIST 10 depth estimation system 20 camera 30 ToF sensor 40 preprocessing device 100 depth estimation device 110 acquisition unit 120 derivation unit

Claims

obtaining multiple depth maps;
Compared to using the distance values contained in the plurality of depth maps as they are, the plurality of depth maps are combined while reducing an average difference in distance values of adjacent pixels to output one output depth map. A depth estimation method comprising:

2. The depth estimation method of claim 1, wherein distance values at pixels of the output depth map are different than distance values at corresponding pixels of the plurality of depth maps.

the plurality of depth maps includes a first depth map and a second depth map;
obtaining the first depth map using a ranging sensor;
2. The method of claim 1, comprising obtaining the second depth map using an imaging device.

the plurality of depth maps includes a first depth map and a second depth map;
the first depth map is a depth map in which distance values are missing in some pixels;
2. The depth estimation method of claim 1, wherein the second depth map is a depth map without missing distance values.

obtaining a first depth map with stereo matching;
obtaining a second depth map obtained by inferring from the image;
Combining a plurality of depth maps including the first depth map and the second depth map to output an output depth map.

6. The depth estimation method according to claim 5, wherein the number of effective pixels of said first depth map is smaller than the number of effective pixels of said second depth map.

6. The depth estimation method of claim 5, wherein the images are taken with a wide-angle camera.

7. The depth of claim 6, wherein distance values for a central portion of said output depth map are taken from said first depth map and peripheral portions surrounding said central portion of said output depth map are taken from said second depth map. estimation method.

obtaining a third depth map obtained by the ToF sensor;
6. The depth estimation method according to claim 5, wherein said output depth map is obtained by combining said first depth map, said second depth map and said third depth map.

an acquisition unit that acquires a first depth map acquired by a ranging sensor for a measurement target area and a second depth map generated from an image of the measurement target area;
a derivation unit that derives a third depth map from the first depth map and the second depth map according to a cost function;
The cost function is
a first constraint for bringing the distance value of the pixel of interest in the third depth map closer to the distance value of the first depth map when a distance value corresponding to the pixel of interest exists in the first depth map; and,
a second depth map for bringing the distance value of the pixel of interest in the third depth map closer to the distance value of the second depth map when the distance value corresponding to the pixel of interest does not exist in the first depth map; constraints and
a third constraint for bringing the distance value of the pixel of interest closer to the distance values of neighboring pixels of the pixel of interest;
A depth estimator, comprising:

11. The depth estimation device according to claim 10, wherein the derivation unit derives the third depth map such that the value of the cost function is minimized.

11. Depth estimation apparatus according to claim 10, wherein the cost function is defined to weaken the third constraint at discontinuities of distance values in the first depth map or the second depth map. .