JP2021515939A

JP2021515939A - Monocular depth estimation method and its devices, equipment and storage media

Info

Publication number: JP2021515939A
Application number: JP2020546428A
Authority: JP
Inventors: 郭▲曉▼▲陽▼; 李▲鴻▼升; 伊▲帥▼; 任思捷; 王▲曉▼▲剛▼
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2018-05-22
Filing date: 2019-02-27
Publication date: 2021-06-24
Anticipated expiration: 2039-02-27
Also published as: SG11202008787UA; CN108961327A; CN108961327B; JP7106665B2; WO2019223382A1

Abstract

本願の実施例は、処理対象の画像を取得するステップと、前記処理対象の画像を訓練された単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るステップであって、前記単眼深度推定ネットワークモデルは、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られたものである、ステップと、前記処理対象の画像の解析結果を出力するステップと、を含む単眼深度推定方法を提供する。本願の実施例は同時に単眼深度推定装置、機器および記憶媒体を提供する。An embodiment of the present application is a step of acquiring an image of a processing target and a step of inputting the image of the processing target into a trained monocular depth estimation network model and obtaining an analysis result of the image of the processing target. The monocular depth estimation network model is obtained by supervised training using a parallax map output by the first binocular matching neural network model, and includes a step, a step of outputting the analysis result of the image to be processed, and a step of outputting the analysis result of the image to be processed. To provide a monocular depth estimation method including. The embodiments of the present application simultaneously provide a monocular depth estimation device, an instrument, and a storage medium.

Description

（関連出願の相互参照）
本願は２０１８年０５月２２日に出願された、出願番号２０１８１０４９６５４１．６の中国特許出願に基づいて提出され、該中国特許出願の優先権を主張し、その開示の全てが参照によって本願に組み込まれる。 (Cross-reference of related applications)
This application is filed on the basis of the Chinese patent application of application number 2018104965541.6, filed May 22, 2018, claiming the priority of the Chinese patent application, the entire disclosure of which is incorporated herein by reference. ..

本願の実施例は人工知能分野に関し、特に単眼深度推定方法およびその装置、機器ならびに記憶媒体に関する。 The embodiments of the present application relate to the field of artificial intelligence, particularly to monocular depth estimation methods and their devices, devices and storage media.

単眼深度推定はコンピュータビジョンにおける重要な課題であり、単眼深度推定の具体的なタスクは画像における画素点それぞれの深度を予測することである。そのうち、各画素点の深度値からなる画像は深度マップとも呼ばれる。単眼深度推定は自動運転における障害物検出、三次元シーン再構成、立体シーン解析に対して重要な意味を持っている。また、単眼深度推定は他のコンピュータビジョンタスク、例えば物体検出、ターゲット追跡およびターゲット識別の性能を間接的に向上させることができる。 Monocular depth estimation is an important issue in computer vision, and the specific task of monocular depth estimation is to predict the depth of each pixel point in an image. Of these, the image consisting of the depth values of each pixel point is also called a depth map. Monocular depth estimation has important implications for obstacle detection, 3D scene reconstruction, and 3D scene analysis in autonomous driving. Monocular depth estimation can also indirectly improve the performance of other computer vision tasks such as object detection, target tracking and target identification.

現時点での問題は単眼深度推定用のニューラルネットワークを訓練するために大量のラベル付きデータが必要であるが、ラベル付きデータを取得するコストが高い。室外環境ではレーザレーダによってラベル付きデータを取得できるが、取得したラベル付きデータは非常に疎であり、このようなラベル付きデータを用いて訓練した単眼深度推定ネットワークは明瞭なエッジを有さず小さな物体の正確な深度情報をキャプチャできない。 The problem at the moment is that a large amount of labeled data is needed to train a neural network for monocular depth estimation, but the cost of acquiring the labeled data is high. Although labeled data can be acquired by laser radar in an outdoor environment, the acquired labeled data is very sparse, and monocular depth estimation networks trained with such labeled data are small without sharp edges. Unable to capture accurate depth information for the object.

本願の実施例は単眼深度推定方法およびその装置、機器ならびに記憶媒体を提供する。 The embodiments of the present application provide a monocular depth estimation method and its devices, devices and storage media.

本願の実施例の技術的解決手段は以下のように実現する。 The technical solution of the embodiment of the present application is realized as follows.

本願の実施例は、処理対象の画像を取得するステップと、前記処理対象の画像を訓練された単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るステップであって、前記単眼深度推定ネットワークモデルは、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られたものである、ステップと、前記処理対象の画像の解析結果を出力するステップと、を含む単眼深度推定方法を提供する。 An embodiment of the present application is a step of acquiring an image of a processing target and a step of inputting the image of the processing target into a trained monocular depth estimation network model and obtaining an analysis result of the image of the processing target. The monocular depth estimation network model is obtained by supervised training using a disparity map output by the first binocular matching neural network model, and includes a step, a step of outputting the analysis result of the image to be processed, and a step of outputting the analysis result of the image to be processed. To provide a monocular depth estimation method including.

本願の実施例は、処理対象の画像を取得するように構成された取得モジュールと、前記処理対象の画像を訓練された単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るように構成された実行モジュールであって、前記単眼深度推定ネットワークモデルは、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られたものである、実行モジュールと、前記処理対象の画像の解析結果を出力するように構成された出力モジュールと、を含む単眼深度推定装置を提供する。 In the embodiment of the present application, the acquisition module configured to acquire the image to be processed and the image to be processed are input to the trained monocular depth estimation network model to obtain the analysis result of the image to be processed. The execution module and the monocular depth estimation network model, which are configured as described above, are obtained by supervised training using a disparity map output by the first binocular matching neural network model. Provided is a monocular depth estimation device including an output module configured to output an analysis result of an image to be processed.

本願の実施例は、プロセッサおよびプロセッサにおいて運用可能なコンピュータプログラムが記憶されたメモリを含む単眼深度推定機器であって、前記プロセッサは前記プログラムを実行する時に本願の実施例が提供する単眼深度推定方法におけるステップを実現する単眼深度推定機器を提供する。 An embodiment of the present application is a monocular depth estimation device including a processor and a memory in which a computer program that can be operated by the processor is stored, and the monocular depth estimation method provided by the embodiment of the present application when the processor executes the program. To provide a monocular depth estimation device that realizes the steps in.

本願の実施例は、コンピュータプログラムが記憶されたコンピュータ読み取り可能記憶媒体であって、該コンピュータプログラムはプロセッサにより実行される時に本願の実施例が提供する単眼深度推定方法におけるステップを実現するコンピュータ読み取り可能記憶媒体を提供する。 An embodiment of the present application is a computer-readable storage medium in which a computer program is stored, and the computer program is computer readable to realize a step in the monocular depth estimation method provided by the embodiment of the present application when executed by a processor. Provide a storage medium.

本願の実施例では、処理対象の画像を取得し、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得て、そして前記処理対象の画像の解析結果を出力することで、深度マップラベル付きのデータをより少なく使用するか、または使用しないで、単眼深度推定ネットワークを訓練することができ、またより効率的な、教師なしの、微調整可能な、両眼視差を利用したネットワークによる方法を提供し、それにより単眼深度推定の効果を間接的に向上させる。 In the embodiment of the present application, the image to be processed is acquired, and the image to be processed is input to the monocular depth estimation network model obtained by supervised training by the parallax map output by the first binocular matching neural network model. Then, by obtaining the analysis result of the image to be processed and outputting the analysis result of the image to be processed, the data with the depth map label is used less or not, and the monocular depth estimation is performed. It provides a more efficient, unsupervised, fine-tuned, binocular parallax-based network method that allows the network to be trained, thereby indirectly improving the effectiveness of monocular depth estimation.

本願の実施例に係る単眼深度推定方法の実現フローチャート１である。It is a realization flowchart 1 of the monocular depth estimation method which concerns on embodiment of this application. 本願の実施例の単一画像の深度推定模式図である。It is a depth estimation schematic diagram of a single image of the Example of this application. 本願の実施例の第二両眼マッチングニューラルネットワークモデルの訓練模式図である。It is a training schematic diagram of the second binocular matching neural network model of the Example of this application. 本願の実施例の単眼深度推定ネットワークモデルの訓練模式図である。It is a training schematic diagram of the monocular depth estimation network model of the Example of this application. 本願の実施例の損失関数関連画像の模式図である。It is a schematic diagram of the loss function-related image of the Example of this application. 本願の実施例に係る単眼深度推定方法の実現フローチャート２である。It is a realization flowchart 2 of the monocular depth estimation method which concerns on embodiment of this application. 本願の実施例の損失関数の効果模式図である。It is the effect schematic diagram of the loss function of the Example of this application. 本願の実施例の可視化深度推定の結果模式図である。It is a schematic diagram of the result of the visualization depth estimation of the Example of this application. 本願の実施例の単眼深度推定装置の構成模式図である。It is a block diagram of the monocular depth estimation apparatus of the Example of this application. 本願の実施例の単眼深度推定機器のハードウェア実体模式図である。It is a hardware substance schematic diagram of the monocular depth estimation device of the Example of this application.

本願の実施例の目的、技術的解決手段および利点をより明確にするために、以下に本願の実施例における図面と関連付けて、出願の具体的な技術的解決手段をさらに詳細に説明する。以下の実施例は本願を説明するためのものであり、本願の範囲を限定するものではない。 In order to further clarify the purpose, technical solutions and advantages of the embodiments of the present application, the specific technical solutions of the application will be described in more detail below in association with the drawings of the embodiments of the present application. The following examples are for the purpose of explaining the present application and do not limit the scope of the present application.

後続の記載では、「モジュール」、「コンポーネント」または「ユニット」など要素を表すための接尾辞は本願の説明に役立つためにのみ使用され、それら自体は特定の意味を持っていない。従って、「モジュール」、「コンポーネント」または「ユニット」は混合して使用可能である。 In the following description, suffixes for representing elements such as "module", "component" or "unit" are used only for the purposes of the description of the present application and have no particular meaning in their own right. Therefore, "modules", "components" or "units" can be mixed and used.

一般的には、深度ニューラルネットワークを用いて単一画像の深度マップを予測すれば、一つの画像だけで画像の対応するシーンの三次元モデリングを行い、各画素点の深度を得ることができる。本願の実施例が提供する単眼深度推定方法はニューラルネットワークによって訓練して得られ、訓練データは両眼マッチングによって出力された視差マップデータに由来し、レーザレーダなどの高価な深度取得機器を必要としない。訓練データを提供する両眼マッチングアルゴリズムもニューラルネットワークによって実現され、該ネットワークはレンダリングエンジンによりレンダリングされる大量の仮想両眼画像対によって予備訓練するだけで良好な効果を達成でき、また、実データに基づいてさらに微調整による訓練を行ってより良好な効果を達成できる。 In general, if a depth map of a single image is predicted using a depth neural network, it is possible to perform three-dimensional modeling of the corresponding scenes of the images with only one image and obtain the depth of each pixel point. The monocular depth estimation method provided by the embodiment of the present application is obtained by training by a neural network, and the training data is derived from the parallax map data output by binocular matching and requires an expensive depth acquisition device such as a laser radar. do not. A binocular matching algorithm that provides training data is also implemented by a neural network, which can achieve good results with only pre-training with a large number of virtual binocular image pairs rendered by a rendering engine, and also to real data. Based on this, further fine-tuning training can be performed to achieve better effects.

以下に図面と実施例を関連付けて本願の技術的解決手段をさらに説明する。 The technical solutions of the present application will be further described below in association with the drawings and examples.

本願の実施例は計算機器において用いられる単眼深度推定方法を提供し、該方法が実現する機能はサーバ内のプロセッサによってプログラムコードを呼び出して実現されてもよく、当然ながら、プログラムコードはコンピュータ記憶媒体内に記憶可能であり、よって、該サーバは少なくともプロセッサおよび記憶媒体を含む。図１Ａは本願の実施例に係る単眼深度推定方法の実現フローチャート１であり、図１Ａに示すように、該方法は以下を含む。 An embodiment of the present application provides a monocular depth estimation method used in a computing device, and the function realized by the method may be realized by calling a program code by a processor in a server, and of course, the program code is a computer storage medium. It is storable within, so the server includes at least a processor and a storage medium. FIG. 1A is a flow chart 1 for realizing a monocular depth estimation method according to an embodiment of the present application, and as shown in FIG. 1A, the method includes the following.

ステップＳ１０１において、処理対象の画像を取得する。 In step S101, the image to be processed is acquired.

ここで、移動端末によって処理対象の画像を取得してもよく、前記処理対象の画像は、任意のシーンの画像を含んでもよい。一般的には、移動端末は実施プロセスにおいて、例えば携帯電話、携帯情報端末（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ：ＰＤＡ）、ナビゲーター、デジタル電話機、テレビ電話機、スマートウォッチ、スマートブレスレット、ウエアラブル機器、タブレットなどを含むような、情報処理能力を有する様々なタイプの機器としてもよい。サーバは実現プロセスにおいて、例えば携帯電話、タブレット、ノートパソコンなどの移動端末、パーソナルコンピュータおよびサーバクラスタなどの固定端末のような情報処理能力を有する計算機器としてもよい。 Here, the image to be processed may be acquired by the mobile terminal, and the image to be processed may include an image of an arbitrary scene. In general, mobile terminals include, for example, mobile phones, personal digital assistants (PDAs), navigators, digital phones, video phones, smart watches, smart bracelets, wearable devices, tablets, etc. in the implementation process. , Various types of devices having information processing capability may be used. In the realization process, the server may be a computing device having information processing capability such as a mobile terminal such as a mobile phone, a tablet, or a laptop computer, or a fixed terminal such as a personal computer and a server cluster.

ステップＳ１０２において、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得る。 In step S102, the image to be processed is input to the monocular depth estimation network model obtained by supervised training using a parallax map output by the first binocular matching neural network model, and the analysis result of the image to be processed is obtained. To get.

本願の実施例では、前記単眼深度推定ネットワークモデルは主に以下の三つのステップによって取得される。第一のステップではレンダリングエンジンによりレンダリングされる合成両眼データを使用して両眼マッチングニューラルネットワークを予備訓練する。第二のステップでは実シーンのデータを使用して第一のステップで得られた両眼マッチングニューラルネットワークの微調整による訓練を行う。第三のステップでは第二のステップで得られた両眼マッチングニューラルネットワークを使用して単眼深度推定ネットワークを教示し、それにより単眼深度推定ネットワークを訓練して得る。従来技術では、単眼深度推定は一般的に大量のラベル付きの実データを使用して訓練するか、または教師なしの方法を使用して単眼深度推定ネットワークを訓練する。しかし、大量のラベル付きの実データは取得コストが高く、またそのまま教師なしの方法によって単眼深度推定ネットワークを訓練すれば遮蔽領域の深度推定が処理できなくなり、得られた効果が悪い。それに対して、本願では前記単眼深度推定ネットワークモデルのサンプルデータが第一両眼マッチングニューラルネットワークモデルにより出力される視差マップに由来し、つまり、本願は両眼視差を利用した単眼深度予測を行う。従って、本願における方法は大量のラベル付きデータを必要とせず、かつ良好な訓練効果を得ることができる。 In the embodiment of the present application, the monocular depth estimation network model is mainly acquired by the following three steps. The first step is to pre-train a binocular matching neural network using synthetic binocular data rendered by a rendering engine. In the second step, the data of the actual scene is used for training by fine-tuning the binocular matching neural network obtained in the first step. In the third step, the binocular matching neural network obtained in the second step is used to teach the monocular depth estimation network, thereby training the monocular depth estimation network. In the prior art, monocular depth estimation is generally trained using large amounts of labeled real data, or unsupervised methods are used to train monocular depth estimation networks. However, the acquisition cost of a large amount of labeled actual data is high, and if the monocular depth estimation network is trained as it is by an unsupervised method, the depth estimation of the shielded area cannot be processed, and the obtained effect is poor. On the other hand, in the present application, the sample data of the monocular depth estimation network model is derived from the parallax map output by the first binocular matching neural network model, that is, in the present application, the monocular depth prediction using the binocular parallax is performed. Therefore, the method in the present application does not require a large amount of labeled data, and a good training effect can be obtained.

ステップＳ１０３において、前記処理対象の画像の解析結果を出力する。ここで、前記処理対象の画像の解析結果とは、前記処理対象の画像の対応する深度マップをいう。処理対象の画像を取得してから、前記処理対象の画像を訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記単眼深度推定ネットワークモデルは一般的に、深度マップではなく、前記処理対象の画像の対応する視差マップを出力するため、さらに前記単眼深度推定ネットワークモデルにより出力される視差マップ、処理対象の画像を撮影するカメラのレンズ基線長および処理対象の画像を撮影するカメラのレンズ焦点距離に基づき、前記処理対象の画像の対応する深度マップを決定する必要がある。 In step S103, the analysis result of the image to be processed is output. Here, the analysis result of the image to be processed means a corresponding depth map of the image to be processed. After acquiring the image to be processed, the image to be processed is input to the monocular depth estimation network model obtained by training, and the monocular depth estimation network model is generally not a depth map but the processing target. In order to output the corresponding disparity map of the image, the disparity map output by the monocular depth estimation network model, the lens baseline length of the camera that captures the image to be processed, and the lens focal length of the camera that captures the image to be processed. It is necessary to determine the corresponding depth map of the image to be processed based on the above.

図１Ｂは本願の実施例の単一画像の深度推定模式図であり、図１Ｂに示すように、番号が１１の画像１１は処理対象の画像であり、番号が１２の画像１２は番号が１１の画像１１の対応する深度マップである。 FIG. 1B is a schematic depth estimation diagram of a single image according to an embodiment of the present application. As shown in FIG. 1B, the image 11 having the number 11 is the image to be processed, and the image 12 having the number 12 has the number 11. It is a corresponding depth map of image 11.

実際の適用では、前記レンズ基線長および前記レンズ焦点距離の積と、前記出力される処理対象の画像の対応する視差マップとの比を、前記処理対象の画像の対応する深度マップとして決定してもよい。 In actual application, the ratio of the product of the lens baseline length and the lens focal length to the corresponding parallax map of the output image to be processed is determined as the corresponding depth map of the image to be processed. May be good.

上記方法の実施例に基づき、本願の実施例はさらに単眼深度推定方法を提供し、該方法は以下を含む。 Based on examples of the above methods, the examples of the present application further provide a monocular depth estimation method, which method includes:

ステップＳ１１１において、合成された左画像および合成された右画像を含む深度ラベル付きの合成された両眼画像を合成サンプルデータとして取得する。 In step S111, a composite binocular image with a depth label including the composite left image and the composite right image is acquired as composite sample data.

いくつかの実施例では、前記方法はさらに、レンダリングエンジンによって仮想３Ｄシーンを構築するステップＳ１１と、二つの仮想カメラによって前記３Ｄシーンを両眼画像としてマッピングするステップＳ１２と、前記仮想３Ｄシーンを構築する時の位置、前記仮想３Ｄシーンを構築する時の方向および前記仮想カメラのレンズ焦点距離に基づいて前記合成両眼画像の深度データを取得するステップＳ１３と、前記深度データに基づいて前記両眼画像をラベル付けし、前記合成された両眼画像を得るステップＳ１４と、を含む。 In some embodiments, the method further constructs the virtual 3D scene in step S11, which builds a virtual 3D scene by a rendering engine, and step S12, which maps the 3D scene as a binocular image by two virtual cameras. Step S13 to acquire the depth data of the composite binocular image based on the position when the virtual 3D scene is constructed, the direction when the virtual 3D scene is constructed, and the lens focal distance of the virtual camera, and the binocular based on the depth data. Includes step S14, which labels the image and obtains the combined binocular image.

ステップＳ１１２において、取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練する。 In step S112, the second binocular matching neural network model is trained based on the acquired synthetic sample data.

ここで、実際の適用において、前記ステップＳ１１２は以下のステップによって実現してもよい。ステップＳ１１２１、前記合成された両眼画像に基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、出力が視差マップおよび遮蔽マップである訓練後の第二両眼マッチングニューラルネットワークモデルを得る。ここで、前記視差マップは前記左画像における各画素点と前記右画像における対応する画素点との、画素を単位とする視差距離を表現し、前記遮蔽マップは前記左画像における各画素点の前記右画像における対応する画素点が物体により遮蔽されているかどうかを表現する。 Here, in actual application, the step S112 may be realized by the following steps. In step S1121, a second binocular matching neural network model is trained based on the synthesized binocular image to obtain a trained second binocular matching neural network model whose output is a parallax map and a shielding map. Here, the disparity map expresses the disparity distance in units of pixels between each pixel point in the left image and the corresponding pixel point in the right image, and the shielding map represents the pixel points of each pixel point in the left image. Represents whether the corresponding pixel points in the right image are shielded by an object.

図１Ｃは本願の実施例の第二両眼マッチングニューラルネットワークモデルの訓練模式図であり、図１Ｃに示すように、番号が１１の画像１１は合成された両眼画像の左画像であり、番号が１２の画像１２は合成された両眼画像の右画像であり、 FIG. 1C is a schematic training diagram of the second binocular matching neural network model of the embodiment of the present application, and as shown in FIG. 1C, the image 11 having the number 11 is the left image of the synthesized binocular image, and the numbers are 11C. Image 12 of 12 is the right image of the combined binocular image.

は番号が１１の左画像１１に含まれる全ての画素点の画素値であり、 Is the pixel value of all the pixel points included in the left image 11 having the number 11.

は番号が１２の右画像１２に含まれる全ての画素点の画素値であり、番号が１３の画像１３は第二両眼マッチングニューラルネットワークモデルが訓練されてから出力した遮蔽マップであり、番号が１４の画像１４は第二両眼マッチングニューラルネットワークモデルが訓練されてから出力した視差マップであり、番号が１５の画像１５は第二両眼マッチングニューラルネットワークモデルである。 Is the pixel value of all the pixel points included in the right image 12 with the number 12, and the image 13 with the number 13 is the shielding map output after the second binocular matching neural network model is trained, and the number is Image 14 of image 14 is a disparity map output after the second binocular matching neural network model is trained, and image 15 of number 15 is a second binocular matching neural network model.

ステップＳ１１３において、取得した実サンプルデータに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得る。 In step S113, the parameters of the second binocular matching neural network model after training are adjusted based on the acquired actual sample data to obtain the first binocular matching neural network model.

ここで、前記ステップＳ１１３は二つの形態で実現できる。そのうち、第一の実現形態は以下のステップで実現する。ステップＳ１１３１ａ、取得した深度ラベル付きの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師あり訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得る。ここで、取得したのは深度ラベル付きの実両眼データであり、このように、そのまま深度ラベル付きの実両眼データを用いて、ステップＳ１１２での訓練後の第二両眼マッチングニューラルネットワークの教師あり訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、訓練後の第二両眼マッチングニューラルネットワークモデルの効果をさらに向上させ、第一両眼マッチングニューラルネットワークモデルを得ることができる。この部分では、両眼視差ネットワークは実データを適合させる必要がある。深度ラベル付きの実両眼データを使用し、教師ありの訓練によって両眼視差ネットワークを直接微調整して訓練してネットワークの重みを調整するようにしてもよい。第二の実現形態は以下のステップで実現する。ステップＳ１１３１ｂ、取得した深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得る。本願の実施例では、また深度ラベルなしの実両眼データを使用して訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るようにしてもよい。ここの教師なし訓練とは深度データラベルなしで、両眼データのみで訓練することをいい、教師なしの微調整方法によって該プロセスを実現してもよい。 Here, the step S113 can be realized in two forms. Among them, the first realization form is realized by the following steps. In step S1131a, supervised training of the second binocular matching neural network model after training is performed based on the acquired real binocular data with a depth label, whereby the weight of the second binocular matching neural network model after the training is performed. To obtain a first binocular matching neural network model. Here, the acquired real binocular data is the depth-labeled real binocular data, and thus, using the depth-labeled real binocular data as it is, the second binocular matching neural network after the training in step S112 is used. Supervised training is performed, thereby adjusting the weight of the second binocular matching neural network model after the training, further improving the effect of the second binocular matching neural network model after the training, and the first binocular matching neural network. You can get a network model. In this part, the binocular parallax network needs to adapt the actual data. Real binocular data with depth labels may be used and supervised training may be used to directly fine-tune and train the binocular parallax network to adjust the weight of the network. The second embodiment is realized by the following steps. In step S1131b, unsupervised training of the trained second binocular matching neural network model is performed based on the acquired real binocular data without depth label, whereby the weight of the second binocular matching neural network model after the training is performed. To obtain a first binocular matching neural network model. In the embodiments of the present application, unsupervised training of the post-trained second binocular matching neural network model is also performed using real binocular data without depth labeling, thereby performing the post-training second binocular matching neural network. The weights of the model may be adjusted to obtain a first binocular matching neural network model. The unsupervised training here means training with only binocular data without a depth data label, and the process may be realized by an unsupervised fine-tuning method.

ステップＳ１１４において、前記第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによって単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練する。 In step S114, the monocular depth estimation network model is taught by the parallax map output by the first binocular matching neural network model, and the monocular depth estimation network model is trained thereby.

ここで、前記ステップＳ１１４は二つの形態で実現してもよい。そのうち、第一の実現形態は以下のステップで実現する。ステップＳ１１４１ａ、左画像および右画像を含む前記深度ラベル付きの実両眼データのうちの左画像または右画像を訓練サンプルとして取得する。ステップＳ１１４２ａ、前記深度ラベル付きの実両眼データのうちの左画像または右画像に基づいて単眼深度推定ネットワークモデルを訓練する。ここで、深度ニューラルネットワークを用いて単一画像の深度マップを予測すれば、一つの画像だけで画像の対応するシーンの三次元モデリングを行い、各画素点の深度を得ることができる。従って、前記深度ラベル付きの実両眼データのうちの左画像または右画像に基づいて単眼深度推定ネットワークモデルを訓練してもよく、そのうち、前記深度ラベル付きの実両眼データはステップＳ１１３１ａで使用された深度ラベル付きの実両眼データである。第二の実現形態は以下のステップで実現する。ステップＳ１１４１ｂ、左画像および右画像を含む前記深度ラベルなしの実両眼データを前記第一両眼マッチングニューラルネットワークモデルに入力し、対応する視差マップを得る。ステップＳ１１４２ｂ、前記対応する視差マップ、前記深度ラベルなしの実両眼データを撮影するカメラのレンズ基線長および前記深度ラベルなしの実両眼データを撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定する。ステップＳ１１４３ｂ、前記深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、前記視差マップの対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練する。ここで、深度ニューラルネットワークを用いて単一画像の深度マップを予測すれば、一つの画像だけで画像の対応するシーンの三次元モデリングを行い、各画素点の深度を得ることができる。従って、ステップＳ１１３１ｂで使用された深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、またステップＳ１１４１ｂで使用された深度ラベルなしの実両眼データのうちの左画像または右画像をもサンプルデータとし、ステップＳ１１４１ｂで出力された視差マップの対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練し、訓練後の単眼深度推定ネットワークモデルを得るようにしてもよい。 Here, the step S114 may be realized in two forms. Among them, the first realization form is realized by the following steps. Step S1141a, the left image or the right image of the depth-labeled real binocular data including the left image and the right image is acquired as a training sample. Step S1142a trains a monocular depth estimation network model based on the left or right image of the depth-labeled real binocular data. Here, if a depth map of a single image is predicted using a depth neural network, it is possible to perform three-dimensional modeling of the corresponding scenes of the images with only one image and obtain the depth of each pixel point. Therefore, the monocular depth estimation network model may be trained based on the left image or the right image of the depth-labeled real binocular data, of which the depth-labeled real binocular data is used in step S1131a. It is the actual binocular data with the depth label. The second embodiment is realized by the following steps. Step S1141b, the real binocular data without the depth label, including the left and right images, is input into the first binocular matching neural network model to obtain the corresponding parallax map. The parallax map is based on step S1142b, the corresponding parallax map, the lens baseline length of the camera that captures the real binocular data without the depth label, and the lens focal length of the camera that captures the real binocular data without the depth label. Determine the corresponding depth map for. In step S1143b, the left image or the right image of the real binocular data without the depth label is used as sample data, and a monocular depth estimation network model is taught based on the corresponding depth map of the parallax map, whereby the monocular depth is described. Train an inferred network model. Here, if a depth map of a single image is predicted using a depth neural network, it is possible to perform three-dimensional modeling of the corresponding scenes of the images with only one image and obtain the depth of each pixel point. Therefore, the left image or the right image of the real binocular data without depth label used in step S1131b is used as sample data, and the left image or the left image or the real binocular data without depth label used in step S1141b is used. Using the right image as sample data, the monocular depth estimation network model is taught based on the corresponding depth map of the parallax map output in step S1141b, thereby training the monocular depth estimation network model, and the monocular depth after training. An estimated network model may be obtained.

図１Ｄは本願の実施例の単眼深度推定ネットワークモデルの訓練模式図であり、図１Ｄに示すように、図（ａ）は深度ラベルなしの実両眼データを前記第一両眼マッチングニューラルネットワークモデルに入力し、対応する番号が１３の視差マップ１３を得ることを示し、そのうち、前記深度ラベルなしの実両眼データは番号が１１の左画像１１および番号が１２の右画像１２を含み、番号が１５の画像１５は第一両眼マッチングニューラルネットワークモデルである。図１Ｄにおける図（ｂ）は前記深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、前記番号が１３の視差マップ１３の対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練することを示し、そのうち、前記サンプルデータの前記単眼深度推定ネットワークモデルによる出力は番号が１４の視差マップ１４であり、番号が１６の画像１６は単眼深度推定ネットワークモデルである。 FIG. 1D is a training schematic diagram of the monocular depth estimation network model of the embodiment of the present application, and as shown in FIG. 1D, FIG. 1D is a first binocular matching neural network model in which real binocular data without a depth label is used. Indicates that a parallax map 13 with a corresponding number of 13 is obtained, of which the real binocular data without the depth label includes a left image 11 with a number 11 and a right image 12 with a number 12. Image 15 of 15 is a first binocular matching neural network model. In FIG. 1D, the left image or the right image of the actual binocular data without the depth label is used as sample data, and the monocular depth estimation network is based on the corresponding depth map of the disparity map 13 having the number 13. It shows that the model is taught and thereby trains the monocular depth estimation network model, of which the output of the sample data by the monocular depth estimation network model is a disparity map 14 with number 14 and an image with number 16. Reference numeral 16 denotes a monocular depth estimation network model.

ステップＳ１１５において、処理対象の画像を取得する。 In step S115, the image to be processed is acquired.

ここで、訓練後の単眼深度推定ネットワークモデルを得ると、この単眼深度推定ネットワークモデルを使用することが可能である。即ち、この単眼深度推定ネットワークモデルを用いて、処理対象の画像の対応する深度マップを取得することができる。 Here, once the trained monocular depth estimation network model is obtained, it is possible to use this monocular depth estimation network model. That is, the corresponding depth map of the image to be processed can be acquired by using this monocular depth estimation network model.

ステップＳ１１６において、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得る。 In step S116, the image to be processed is input to the monocular depth estimation network model obtained by supervised training using a parallax map output by the first binocular matching neural network model, and the analysis result of the image to be processed is obtained. To get.

ステップＳ１１７において、前記単眼深度推定ネットワークモデルにより出力される視差マップを含む前記処理対象の画像の解析結果を出力する。 In step S117, the analysis result of the image to be processed including the parallax map output by the monocular depth estimation network model is output.

ステップＳ１１８において、前記単眼深度推定ネットワークモデルにより出力される視差マップ、前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ基線長および前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定する。 In step S118, the parallax map output by the monocular depth estimation network model, the lens focal length of the camera that captures the image input to the monocular depth estimation network model, and the image input to the monocular depth estimation network model are captured. The corresponding depth map of the parallax map is determined based on the lens focal length of the camera.

ステップＳ１１９において、前記視差マップの対応する深度マップを出力する。 In step S119, the corresponding depth map of the parallax map is output.

ステップＳ１２１において、合成された左画像および合成された右画像を含む深度ラベル付きの合成された両眼画像を合成サンプルデータとして取得する。 In step S121, a composite binocular image with a depth label including a composite left image and a composite right image is acquired as composite sample data.

ステップＳ１２２において、取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練する。 In step S122, the second binocular matching neural network model is trained based on the acquired synthetic sample data.

ここで、合成データを使用して第二両眼マッチングニューラルネットワークモデルを訓練すると、より高い汎化能力を発揮することができる。 Here, if the second binocular matching neural network model is trained using the synthesized data, a higher generalization ability can be exhibited.

ステップＳ１２３において、式（１） In step S123, equation (1)

を利用して前記損失関数を決定する。ここで、前記 To determine the loss function. Here, the above

は本願の実施例が提供する損失関数を表し、前記 Represents the loss function provided by the embodiments of the present application.

は再構成誤差を表し、前記 Represents the reconstruction error, said

は前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを表し、前記 Indicates that the parallax map output by the first binocular matching network model has a smaller bias than the parallax map output by the second binocular matching network model after the training.

は前記第一両眼マッチングネットワークモデルを制約する出力勾配が前記訓練後の第二両眼マッチングネットワークモデルの出力勾配に一致することを表し、前記 Indicates that the output gradient that constrains the first binocular matching network model matches the output gradient of the second binocular matching network model after the training.

は強度係数を表す。ここで、 Represents the intensity coefficient. here,

は正則項である。 Is a regular term.

いくつかの実施例では、ステップＳ１２３での式（１）はさらに以下のステップでの式によって細分化されてもよい。即ち、前記方法はさらに以下を含む。ステップＳ１２３１において、式（２） In some embodiments, the equation (1) in step S123 may be further subdivided by the equations in the following steps. That is, the method further includes: In step S1231, the equation (2)

または式（３） Or formula (3)

を利用して前記再構成誤差を決定する。ここで、前記 Is used to determine the reconstruction error. Here, the above

は画像における画素の数を表し、前記 Represents the number of pixels in the image, said

は前記訓練後の第二両眼マッチングネットワークモデルにより出力される遮蔽マップの画素値を表し、前記 Represents the pixel value of the shielding map output by the second binocular matching network model after the training.

は深度ラベルなしの実両眼データのうちの左画像の画素値を表し、前記 Represents the pixel value of the left image of the actual binocular data without the depth label, and is described above.

は深度ラベルなしの実両眼データのうちの右画像の画素値を表し、前記 Represents the pixel value of the right image of the actual binocular data without the depth label, and is described above.

は右画像をサンプリングしてから合成した画像、即ち再構成された左画像の画素値を表し、前記 Represents the pixel value of the image obtained by sampling the right image and then synthesizing it, that is, the reconstructed left image.

は左画像をサンプリングしてから合成した画像、即ち再構成された右画像の画素値を表し、前記 Represents the pixel value of the image obtained by sampling the left image and then synthesizing it, that is, the reconstructed right image.

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.

は画素点の画素座標を表し、前記 Represents the pixel coordinates of the pixel point,

は訓練後の第二両眼マッチングネットワークモデルの出力を表し、前記 Represents the output of the second binocular matching network model after training, said

は右画像または右画像の関連データを表し、前記 Represents the right image or the related data of the right image, said

は左画像または左画像の関連データを表し、前記 Represents the left image or the related data of the left image, said

は画像画素点のＲＧＢ（ＲｅｄＧｒｅｅｎＢｌｕｅ、赤、緑および青）値を表す。ステップＳ１２３２において、式（４） Represents the RGB (Red Green Blue, red, green and blue) values of the image pixel points. In step S1232, equation (4)

または式（５） Or equation (5)

を利用して前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを決定する。ここで、前記 Is used to determine that the parallax map output by the first binocular matching network model has a smaller bias than the parallax map output by the second binocular matching network model after the training. Here, the above

はサンプルデータのうちの左画像の訓練後の第二両眼マッチングネットワークによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the second binocular matching network after training the left image of the sample data.

はサンプルデータのうちの右画像の訓練後の第二両眼マッチングネットワークによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the second binocular matching network after training the right image in the sample data.

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the first binocular matching network of the left image in the actual binocular data without the depth label.

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the first binocular matching network of the right image in the actual binocular data without the depth label.

は強度係数を表す。ステップＳ１２３３において、式（６） Represents the intensity coefficient. In step S1233, equation (6)

または式（７） Or equation (7)

を利用して前記第一両眼マッチングネットワークモデルの出力勾配が前記第二両眼マッチングネットワークモデルの出力勾配に一致することを決定する。ここで、前記 Is used to determine that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model. Here, the above

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the first binocular matching network of the left image of the actual binocular data without the depth label.

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the first binocular matching network of the right image of the real binocular data without the depth label.

はサンプルデータのうちの左画像の訓練後の第二両眼マッチングネットワークによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the second binocular matching network after training the left image of the sample data.

はサンプルデータのうちの右画像の訓練後の第二両眼マッチングネットワークによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the second binocular matching network after training the right image of the sample data.

は左画像または左画像の関連データを表す。 Represents the left image or the associated data of the left image.

ステップＳ１２４において、損失関数（Ｌｏｓｓ）を使用し、前記深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得る。 In step S124, the loss function (Loss) is used to perform unsupervised training of the post-trained second binocular matching neural network model based on the depth-labeled real binocular data, thereby the post-training first. Adjust the weights of the binocular matching neural network model to obtain the first binocular matching neural network model.

ここで、前記損失関数（Ｌｏｓｓ）はステップＳ１２２での訓練後の第二両眼マッチングニューラルネットワークの出力により微調整による訓練を正則化し、従来技術での教師なしの微調整に幅広く存在する予測が不明になるという問題を回避し、微調整によって得られた第一両眼マッチングネットワークの効果を向上させ、それにより第一両眼マッチングネットワークを教示して得られた単眼深度ネットワークの効果を間接的に向上させる。図１Ｅは本願の実施例の損失関数関連画像の模式図であり、図１Ｅに示すように、図（ａ）は深度ラベルなしの実両眼データの左画像であり、図１Ｅにおける図（ｂ）は深度ラベルなしの実両眼データの右画像であり、図１Ｅにおける図（ｃ）は図（ａ）と図（ｂ）とを組み合わせた深度ラベルなしの実両眼画像を訓練後の第二両眼マッチングニューラルネットワークモデルに入力してから出力された視差マップであり、図１Ｅにおける図（ｄ）は図（ｂ）で表される右画像をサンプリングしてから、図（ｃ）で表される視差マップと結合し、左画像を再構成した画像であり、図１Ｅにおける図（ｅ）は図（ａ）で表される左画像における画素と図（ｄ）で表される再構成後の左画像における対応する画素との差を求めて得られた画像、即ち左画像の再構成誤差マップであり、図１Ｅにおける図（ｆ）は図（ａ）と図（ｂ）とを組み合わせた深度ラベルなしの実両眼画像を訓練後の第二両眼マッチングニューラルネットワークモデルに入力してから出力された遮蔽マップである。ここで、図（ｄ）における全ての赤枠１１は前記再構成後の左画像の図（ａ）で表される実左画像との差異がある部分を表し、図（ｅ）における全ての赤枠１２は前記再構成誤差マップにおいて誤差がある部分、即ち遮蔽された部分を表す。ここで、ステップＳ１２４に記載の教師なしの微調整による両眼視差ネットワーク訓練を実現する時、右画像を使用して左画像を再構成する必要があるが、遮蔽が存在する領域が正確に再構成できないため、遮蔽マップを用いてこの部分の誤訓練信号を除去して教師なしの微調整による訓練の効果を向上させる。 Here, the loss function (Loss) normalizes the training by fine adjustment by the output of the second binocular matching neural network after the training in step S122, and there is a wide range of predictions that exist in unsupervised fine adjustment in the prior art. Avoiding the problem of becoming unclear, improving the effect of the first binocular matching network obtained by fine-tuning, thereby indirectly improving the effect of the monocular depth network obtained by teaching the first binocular matching network. To improve. FIG. 1E is a schematic view of a loss function-related image according to an embodiment of the present application, and as shown in FIG. 1E, FIG. 1A is a left image of real binocular data without a depth label, and FIG. 1E is a diagram (b). ) Is the right image of the actual binocular data without the depth label, and FIG. (C) in FIG. 1E is the second image after training the actual binocular image without the depth label, which is a combination of FIGS. (a) and (b). It is a disparity map output after inputting to the binocular matching neural network model, and FIG. 1E in FIG. 1E is a table in FIG. 1E after sampling the right image represented by FIG. It is an image in which the left image is reconstructed by combining with the disparity map to be obtained, and FIG. 1E in FIG. 1E shows the pixels in the left image represented by FIG. It is an image obtained by finding the difference from the corresponding pixel in the left image of, that is, a reconstruction error map of the left image, and FIG. 1E in FIG. 1E is a combination of FIGS. (a) and (b). This is a shielding map output after inputting a real binocular image without a depth label into the trained second binocular matching neural network model. Here, all the red frames 11 in FIG. (D) represent the portion of the left image after the reconstruction that is different from the actual left image represented by FIG. (A), and all the red frames in FIG. The frame 12 represents a portion having an error in the reconstruction error map, that is, a shielded portion. Here, when realizing the binocular parallax network training by the unsupervised fine adjustment described in step S124, it is necessary to reconstruct the left image using the right image, but the area where the shield exists is accurately reconstructed. Since it cannot be constructed, a shielding map is used to eliminate the mistraining signal in this part to improve the effect of training by unsupervised fine-tuning.

ステップＳ１２５において、前記第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによって前記単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練する。 In step S125, the monocular depth estimation network model is taught by the parallax map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.

ここで、前記単眼深度推定ネットワークモデルのサンプル画像は、深度ラベルなしの実両眼データのうちの左画像であってもよいし、深度ラベルなしの実両眼データのうちの右画像であってもよい。そのうち、左画像をサンプル画像とする場合、式（１）、式（２）、式（４）および式（６）を利用して損失関数を決定し、右画像をサンプル画像とする場合、式（１）、式（３）、式（５）および式（７）を利用して損失関数を決定する。 Here, the sample image of the monocular depth estimation network model may be the left image of the real binocular data without the depth label, or the right image of the real binocular data without the depth label. May be good. When the left image is used as the sample image, the loss function is determined using the equations (1), (2), (4) and (6), and when the right image is used as the sample image, the equation is used. The loss function is determined using the equations (1), (3), (5) and (7).

本願の実施例では、前記第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによって前記単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練する前記ステップとは、前記第一両眼マッチングニューラルネットワークモデルにより出力される視差マップの対応する深度マップによって前記単眼深度推定ネットワークモデルを教示し、つまり教示情報を提供し、それにより前記単眼深度推定ネットワークモデルを訓練することをいう。 In the embodiment of the present application, the step of teaching the monocular depth estimation network model by the disparity map output by the first binocular matching neural network model and thereby training the monocular depth estimation network model is the first step. Teaching the monocular depth estimation network model by the corresponding depth map of the disparity map output by the monocular matching neural network model, that is, providing teaching information, thereby training the monocular depth estimation network model. ..

ステップＳ１２６において、処理対象の画像を取得する。 In step S126, the image to be processed is acquired.

ステップＳ１２７において、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得る。 In step S127, the image to be processed is input to the monocular depth estimation network model obtained by supervised training using a parallax map output by the first binocular matching neural network model, and the analysis result of the image to be processed is obtained. To get.

ステップＳ１２８において、前記単眼深度推定ネットワークモデルにより出力される視差マップを含む前記処理対象の画像の解析結果を出力する。 In step S128, the analysis result of the image to be processed including the parallax map output by the monocular depth estimation network model is output.

ステップＳ１２９において、前記単眼深度推定ネットワークモデルにより出力される視差マップ、前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ基線長および前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定する。 In step S129, the parallax map output by the monocular depth estimation network model, the lens focal length of the camera that captures the image input to the monocular depth estimation network model, and the image input to the monocular depth estimation network model are captured. The corresponding depth map of the parallax map is determined based on the lens focal length of the camera.

ステップＳ１３０において、前記視差マップの対応する深度マップを出力する。 In step S130, the corresponding depth map of the parallax map is output.

本願の実施例では、前記処理対象の画像が街並みの画像である場合、前記訓練後の単眼深度推定ネットワークモデルを使用して前記街並みの画像の深度を予測することができる。 In the embodiment of the present application, when the image to be processed is an image of a cityscape, the depth of the image of the cityscape can be predicted by using the monocular depth estimation network model after the training.

上記方法の実施例に基づき、本願の実施例はさらに単眼深度推定方法を提供し、図２Ａは本願の実施例に係る単眼深度推定方法の実現フローチャート２であり、図２Ａに示すように、該方法は以下を含む。 Based on the embodiment of the above method, the embodiment of the present application further provides a monocular depth estimation method, and FIG. 2A is a flowchart 2 for realizing the monocular depth estimation method according to the embodiment of the present application, as shown in FIG. 2A. The method includes:

ステップＳ２０１において、レンダリングエンジンによりレンダリングされる合成データを使用して両眼マッチングネットワークを訓練し、両眼画像の視差マップを得る。 In step S201, the composite data rendered by the rendering engine is used to train the binocular matching network to obtain a parallax map of the binocular image.

ここで、前記両眼マッチングネットワークの入力は、一対の両眼画像（左画像および右画像を含む）であり、前記両眼マッチングネットワークの出力は、視差マップ、遮蔽マップであり、即ち、両眼マッチングネットワークは両眼画像を入力として使用し、視差マップおよび遮蔽マップを出力する。そのうち、視差マップは左画像における各画素点と右画像における対応する画素点との、画素を単位とする視差距離を表現するために用いられ、遮蔽マップは左画像の各画素の右画像における対応する画素点が他の物体に遮蔽されているかどうかを表現するために用いられる。視野角の変化により、左画像における何らかの領域が右画像において他の物体に遮蔽され、そこで遮蔽マップは左画像における画素が右画像において遮蔽されているかどうかをレベル付けするために用いられる。この部分では、両眼マッチングネットワークはコンピュータレンダリングエンジンにより生成される合成データを使用して訓練し、まずレンダリングエンジンによっていくつかの仮想３Ｄシーンを構築し、続いて二つの仮想カメラによって３Ｄシーンを両眼画像としてマッピングし、それにより合成データを得て、同時に正確な深度データおよびカメラ焦点距離などのデータもレンダリングエンジンから得られるため、両眼マッチングネットワークは直接これらのラベル付きデータによって教師あり訓練を行うことができる。 Here, the input of the binocular matching network is a pair of binocular images (including a left image and a right image), and the output of the binocular matching network is a disparity map, a shielding map, that is, binoculars. The matching network uses the binocular image as input and outputs the disparity map and the obscuration map. Among them, the disparity map is used to express the disparity distance in units of pixels between each pixel point in the left image and the corresponding pixel point in the right image, and the shielding map is the correspondence in the right image of each pixel in the left image. It is used to express whether or not the pixel point to be used is shielded by another object. The change in viewing angle causes some area in the left image to be obscured by other objects in the right image, where the obstruction map is used to level whether the pixels in the left image are obscured in the right image. In this part, the binocular matching network is trained using synthetic data generated by a computer rendering engine, first building some virtual 3D scenes with the rendering engine, and then both 3D scenes with two virtual cameras. The binocular matching network is trained directly with these labeled data, as it maps as an ocular image, thereby obtaining composite data, as well as accurate depth data and data such as camera focal distance from the rendering engine. It can be carried out.

ステップＳ２０２において、損失関数を使用し、教師なしの微調整方法によって実両眼画像データに基づいてステップＳ２０１で得られた両眼マッチングネットワークを微調整する。 In step S202, the loss function is used to fine-tune the binocular matching network obtained in step S201 based on the actual binocular image data by an unsupervised fine-tuning method.

この部分では、たとえ深度ラベルなしの実両眼データを用いて両眼視差ネットワークの教師なし訓練を行っても、両眼視差ネットワークは実データを適合させる必要がある。ここの教師なし訓練とは深度データラベルがない状況で、両眼データのみで訓練することをいう。本願の実施例は新たな教師なしの微調整方法、即ち上記実施例における損失関数を使用した教師なしの微調整を提供する。本願の実施例が提供する損失関数の主な目的は予備訓練効果を低下させることなく実両眼データに基づいて両眼視差ネットワークを微調整するのを図ることであり、微調整プロセスではステップＳ２０１で得られた、予備訓練された両眼視差ネットワークの予備的な出力により指導および正則化を行う。図２Ｂは本願の実施例の損失関数の効果模式図であり、図２Ｂに示すように、番号が２１の画像２１は従来技術での損失関数を使用した場合に得られた視差マップであり、番号が２２の画像２２は本願の実施例が提供する損失関数を使用した場合に得られた視差マップである。従来技術の損失関数は遮蔽領域を単独で考慮せず、遮蔽領域の画像再構成誤差をもゼロに最適化し、それにより遮蔽領域の予測視差誤りが発生し、視差マップのエッジもぼやけるのに対して、本願における損失関数は遮蔽マップを用いてこの部分の誤訓練信号を除去して教師なしの微調整による訓練の効果を向上させる。 In this part, the binocular parallax network needs to adapt the real data, even if unsupervised training of the binocular parallax network is performed using the real binocular data without the depth label. Unsupervised training here means training with only binocular data in the absence of depth data labels. The embodiments of the present application provide a new unsupervised fine-tuning method, that is, unsupervised fine-tuning using the loss function in the above embodiment. The main purpose of the loss function provided by the embodiment of the present application is to fine-tune the binocular parallax network based on the actual binocular data without deteriorating the preliminary training effect, and in the fine-tuning process, step S201. Guidance and regularization are performed by the preliminary output of the pre-trained binocular parallax network obtained in. FIG. 2B is a schematic diagram of the effect of the loss function of the embodiment of the present application, and as shown in FIG. 2B, the image 21 having the number 21 is a parallax map obtained when the loss function in the prior art is used. Image 22 of number 22 is a parallax map obtained when the loss function provided by the embodiment of the present application is used. The loss function of the prior art does not consider the shielded area alone and optimizes the image reconstruction error of the shielded area to zero, which causes a predicted parallax error of the shielded area and blurs the edges of the parallax map. Therefore, the loss function in the present application uses a shielding map to eliminate the mistraining signal in this part and improves the effect of training by fine tuning without a teacher.

ステップＳ２０３において、ステップＳ２０２で得られた両眼マッチングネットワークを使用して実データに基づいて単眼深度推定を教示し、最終的に単眼深度推定ネットワークを得る。ここで、前記単眼深度推定ネットワークの入力は、単一単眼画像であり、前記単眼深度推定ネットワークの出力は、深度マップである。ステップＳ２０２で実データに基づいて微調整した両眼視差ネットワークが得られ、一対の両眼画像毎に、両眼視差ネットワークが視差マップを予測して得て、視差マップＤ、両眼レンズ基線長ｂおよびレンズ焦点距離ｆによって、視差マップの対応する深度マップを計算して得ることができる、即ち式（８） In step S203, the binocular matching network obtained in step S202 is used to teach monocular depth estimation based on actual data, and finally a monocular depth estimation network is obtained. Here, the input of the monocular depth estimation network is a monocular image, and the output of the monocular depth estimation network is a depth map. A binocular parallax network finely adjusted based on the actual data is obtained in step S202, and the binocular parallax network predicts and obtains a parallax map for each pair of binocular images, and obtains the parallax map D and the binocular lens baseline length. The corresponding depth map of the parallax map can be calculated and obtained from b and the lens focal distance f, that is, equation (8).

によって視差マップの対応する深度マップｄを計算して得ることができる。単眼深度ネットワークを訓練して深度マップを予測して得るために、両眼画像対のうちの左画像を単眼深度ネットワークの入力とし、続いて計算して得られた深度マップを両眼視差ネットワークによって出力して教示し、それにより単眼深度ネットワークを訓練し、最終的な結果を得るようにしてもよい。実際の適用では、本願の実施例における単眼深度推定方法によって訓練して無人運転のための深度推定モジュールを得て、それによりシーンの三次元再構成または障害物検出を行うことができる。かつ本願の実施例が提供する教師なしの微調整方法は両眼視差ネットワークの性能を向上させる。 Can be obtained by calculating the corresponding depth map d of the parallax map. In order to train the monocular depth network and obtain the predicted depth map, the left image of the binocular image pair is used as the input of the monocular depth network, and then the calculated depth map is used by the binocular parallax network. It may be output and taught, thereby training the monocular depth network to obtain the final result. In practical applications, the monocular depth estimation method of the embodiments of the present application can be trained to obtain a depth estimation module for unmanned driving, which can be used for three-dimensional reconstruction of the scene or obstacle detection. Moreover, the unsupervised fine-tuning method provided by the embodiments of the present application improves the performance of the binocular parallax network.

従来技術では、教師ありの単眼深度推定方法では、正確なラベル付きデータはかなり限られた数しか取得できず、かつその取得も非常に困難である。再構成誤差に基づく教師なしの方法は性能が通常画素マッチングの曖昧さにより制限される。これらの問題を解決するために、本願の実施例は新たな単眼深度推定方法を提供し、従来技術での教師ありおよび教師なしの深度推定方法の限界を打破する。本願の実施例における方法は両眼マッチングネットワークを使用してクロスモーダルな合成データに基づいて訓練し、かつそれで単眼深度推定ネットワークを教示する。前記両眼マッチングネットワークは、意味特徴から抽出するのではなく、左右画像の画素マッチング関係に基づいて視差を得るため、両眼マッチングネットワークは合成データから実データに効果的に汎化することができる。本願の実施例の方法は主に三つのステップを含む。第一に、合成データを用いて両眼マッチングネットワークを訓練し、両眼画像から遮蔽マップおよび視差マップを予測する。第二に、利用可能な実データを用いて、教師ありまたは教師なしで、訓練後の両眼マッチングネットワークを選択的に調整する。第三に、第二のステップで得られた、実データを用いて微調整して訓練した両眼マッチングネットワークによる教示下で、単眼深度推定ネットワークを訓練する。このように、両眼マッチングネットワークを間接的に利用することで単眼深度推定において合成データをより効果的に利用して性能を向上させることができる。 In the prior art, the supervised monocular depth estimation method can obtain only a very limited number of accurate labeled data, and it is very difficult to obtain the data. Unsupervised methods based on reconstruction error are usually limited in performance due to pixel matching ambiguity. To solve these problems, the embodiments of the present application provide a new monocular depth estimation method, breaking the limits of supervised and unsupervised depth estimation methods in the prior art. The method in the examples of the present application trains based on cross-modal synthetic data using a binocular matching network and thereby teaches a monocular depth estimation network. Since the binocular matching network obtains parallax based on the pixel matching relationship of the left and right images instead of extracting from the semantic features, the binocular matching network can be effectively generalized from the composite data to the actual data. .. The methods of the embodiments of the present application mainly include three steps. First, the composite data is used to train a binocular matching network to predict obstruction and parallax maps from binocular images. Second, the available real data is used to selectively tailor the post-trained binocular matching network with or without supervised learning. Third, the monocular depth estimation network is trained under the guidance of the binocular matching network, which was fine-tuned and trained using the actual data obtained in the second step. In this way, by indirectly using the binocular matching network, it is possible to more effectively utilize the synthesized data in the monocular depth estimation and improve the performance.

第一のステップでは、合成データを利用して両眼マッチングネットワークを訓練し、それは以下を含む。現時点では、グラフィックスレンダリングエンジンによって深度情報を含む多数の合成画像を生成できる。しかし、単眼深度推定はシーンに入力される意味情報に非常に敏感であるため、これらの合成画像データを実データと直接併合して単眼深度推定ネットワークを訓練すると、通常悪い性能が得られる。合成データと実データとの巨大なモダリティの差により、合成データを使用した補助訓練は全く役に立たなくなる。しかしながら、両眼マッチングネットワークはより強い汎化能力を有し、合成データを使用して訓練した両眼マッチングネットワークは実データに基づいても良好な視差マップを出力できる。そのため、本願の実施例は両眼マッチングネットワーク訓練を介して合成データと実データとを繋いで単眼深度訓練の性能を向上させる。まず大量の合成両眼データを利用して両眼マッチングネットワークを予備訓練する。従来の構造とは異なり、実施例における両眼マッチングネットワークは視差マップのもとに、さらにマルチスケールの遮蔽マップを推定する。ここで、遮蔽マップは正確な画像において、左側画像の画素の右画像における対応する画素点が他の物体に遮蔽されているかどうかを示す。次のステップでは、教師なしの微調整方法に前記遮蔽マップが使用され、それによって誤推定を回避する。そのうち、左右視差の整合性チェック方法を使用し、式（９） The first step uses synthetic data to train a binocular matching network, which includes: At this time, graphics rendering engines can generate a large number of composite images, including depth information. However, since monocular depth estimation is very sensitive to the semantic information input to the scene, training a monocular depth estimation network by directly merging these composite image data with real data usually results in poor performance. Due to the huge modality difference between synthetic and real data, auxiliary training using synthetic data becomes completely useless. However, the binocular matching network has stronger generalization ability, and the binocular matching network trained using synthetic data can output a good parallax map even based on the actual data. Therefore, in the embodiment of the present application, the performance of the monocular depth training is improved by connecting the synthesized data and the actual data via the binocular matching network training. First, a binocular matching network is pre-trained using a large amount of synthetic binocular data. Unlike the conventional structure, the binocular matching network in the embodiment further estimates the multi-scale occlusion map based on the parallax map. Here, the shielding map indicates in an accurate image whether the corresponding pixel point in the right image of the pixel in the left image is shielded by another object. In the next step, the obstruction map is used in an unsupervised fine-tuning method, thereby avoiding mispresumption. Among them, the consistency check method of left-right parallax was used, and the equation (9) was used.

を利用して正確にラベル付けされた視差マップから正確なラベルを有する遮蔽マップを得るようにしてもよい。ここで、下付き文字 May be used to obtain a shield map with the correct label from the accurately labeled parallax map. Here, the subscript

は画像における Is in the image

行目の値を表し、下付き文字 Represents the value of the line, the subscript

は画像における Is in the image

列目の値を表す。 Represents the value in the column.

は左右画像の視差マップを表し、 Represents the parallax map of the left and right images

は右画像で再構成した左画像の視差マップであり、非遮蔽領域について、左視差マップと右画像で再構成した左画像の視差マップとは一致するものである。整合性チェックの閾値は１とする。遮蔽マップは遮蔽領域において０とし、非遮蔽領域において１とする。従って、本実施例は式（１０） Is a parallax map of the left image reconstructed with the right image, and the parallax map of the left image reconstructed with the right image and the parallax map of the left image reconstructed with the right image match the unshielded area. The threshold value of the consistency check is 1. The shield map is 0 in the shielded area and 1 in the non-shielded area. Therefore, in this embodiment, the formula (10)

を使用して合成データによる両眼マッチングネットワーク訓練の損失（Ｌｏｓｓ）を計算し、この段階で、損失関数 Calculate the loss of binocular matching network training with synthetic data using, and at this stage, the loss function

は二つの部分、即ち視差マップ推定誤差 Is the two parts, the parallax map estimation error

および遮蔽マップ推定誤差 And shielding map estimation error

からなる。両眼視差ネットワークのマルチスケール中間層にも視差および遮蔽予測が発生し、かつそのままマルチスケール予測の損失重み Consists of. Parallax and obstruction predictions also occur in the multiscale middle layer of the binocular parallax network, and the loss weight of the multiscale predictions remains the same.

に用いられ、 Used in

は各層の対応する視差マップ推定誤差を表し、 Represents the corresponding parallax map estimation error for each layer

は各層の対応する遮蔽マップ推定誤差を表し、 Represents the corresponding shielding map estimation error for each layer

層目を表す。視差マップを訓練するために、Ｌ１損失関数を採用して異常値の影響を回避し、訓練プロセスのロバスト性を向上させる。遮蔽マップを訓練するために、式（１１） Represents a layer. To train the parallax map, the L1 loss function is adopted to avoid the effects of outliers and improve the robustness of the training process. To train the shield map, equation (11)

で遮蔽マップ推定誤差 Shielding map estimation error

を表し、二値交差エントロピー損失を分類タスクとして遮蔽マップを訓練する。ここで、 And train the shielding map with binary cross entropy loss as the classification task. here,

は画像における画素の総数であり、 Is the total number of pixels in the image

は正確なラベルを有する遮蔽マップを表し、 Represents a shield map with the correct label

は訓練後の両眼マッチングネットワークによって出力された遮蔽マップを表す。 Represents the obstruction map output by the binocular matching network after training.

第二のステップでは、教師ありまたは教師なしの微調整方法を使用して実データに基づいて第一のステップで得られた訓練後の両眼マッチングネットワークを訓練し、それは以下を含む。本願の実施例は二つの方式で訓練後の両眼マッチングネットワークを微調整する。そのうち、教師ありの微調整方法では、マルチスケールのＬ１回帰損失関数 The second step trains the post-trained binocular matching network obtained in the first step based on actual data using supervised or unsupervised fine-tuning methods, including: In the embodiment of the present application, the binocular matching network after training is fine-tuned by two methods. Among them, the supervised fine-tuning method is a multi-scale L1 regression loss function.

、即ち視差マップ推定誤差 That is, the parallax map estimation error

のみを採用して先の画素マッチング予測の誤差を改善し、それについては式（１２） By adopting only, the error of the previous pixel matching prediction is improved, and the formula (12) is used for that.

を参照されたい。結果によると、数少ない監視データ、例えば１００枚の画像を使用しても、両眼マッチングネットワークは合成モーダルデータから実モーダルデータに適合させることができる。教師なしの微調整方法では、教師なしのネットワークチューニングについて、図２Ｂにおける画像２１に示すように、従来技術での教師なしの微調整方法によってはぼやけた視差マップが得られ、性能が悪い。その原因は教師なし損失の限界、およびＲＧＢ値のみが入った画素マッチングの曖昧性にある。そこで、本願の実施例は付加的な正則項を導入してその制約により性能を向上させる。実データにより、微調整されていない訓練後の両眼マッチングネットワークから対応する遮蔽マップおよび視差マップを得て、かつ、それをそれぞれ Please refer to. According to the results, even with a few surveillance data, for example 100 images, the binocular matching network can adapt from synthetic modal data to real modal data. In the unsupervised fine-tuning method, as shown in image 21 in FIG. 2B, a blurry parallax map is obtained by the unsupervised fine-tuning method in the prior art for network tuning without a teacher, and the performance is poor. The cause is the limit of unsupervised loss and the ambiguity of pixel matching containing only RGB values. Therefore, in the embodiment of the present application, an additional regular term is introduced and the performance is improved by the constraint. Based on the actual data, the corresponding shielding map and parallax map were obtained from the untuned post-training binocular matching network, and each of them was obtained.

でラベル付けする。この二つのデータは訓練プロセスの規範化に用いられる。さらに、本願の実施例が提供する教師なしの微調整損失関数、即ち損失関数 Label with. These two data will be used to normalize the training process. Further, the unsupervised fine-tuning loss function provided by the embodiments of the present application, that is, the loss function.

の取得については前の実施例における記載を参照すればよい。 For the acquisition of, refer to the description in the previous embodiment.

第三のステップでは、単眼深度推定ネットワークを訓練し、それは以下を含む。ここまで、発明者らは大量の合成データによって両眼マッチングネットワークのクロスモーダルな訓練を行い、実データを使用して微調整していた。最終的な単眼深度推定ネットワークを訓練するために、本願の実施例は訓練後の両眼マッチングネットワークにより予測される視差マップを採用して訓練データを提供する。単眼深度推定の損失 The third step trains the monocular depth estimation network, which includes: So far, the inventors have conducted cross-modal training of binocular matching networks using a large amount of synthetic data, and fine-tuned using actual data. To train the final monocular depth estimation network, the embodiments of the present application provide training data by adopting a parallax map predicted by the post-training binocular matching network. Loss of monocular depth estimation

は式（１３） Is equation (13)

に示す複数の部分から求められる。ここで、 It is obtained from multiple parts shown in. here,

は画素点の総和であり、 Is the sum of the pixel points,

は単眼深度推定ネットワークにより出力される視差マップを表し、 Represents the parallax map output by the monocular depth estimation network.

は訓練後の両眼マッチングネットワークにより出力される視差マップ、または、訓練後の両眼マッチングネットワークを微調整したネットワークにより出力される視差マップを表す。なお、式（９）から式（１３）はいずれも単眼深度推定ネットワークによって実データのうちの左画像を訓練サンプルとして使用することを例にし、説明していることに注意すべきである。実験については、単眼深度推定ネットワークが視野角の変化に敏感であるため、訓練データには切り抜きおよびスケーリングを施さない。前記単眼深度推定ネットワークの入力、および単眼深度推定ネットワーク教示用の視差マップはいずれも訓練後の両眼マッチングネットワークから得られる。図２Ｃは本願の実施例の可視化深度推定の結果模式図であり、従来技術および本願の実施例における単眼深度推定方法を使用して取得した三つの異なる街並み画像の対応する深度マップを示し、そのうち、１行目は単眼深度推定ネットワークの入力、即ち三つの異なる街並み画像であり、２行目は最近傍法によって疎なレーザレーダ深度マップに補間して得られた深度データであり、３行目から５行目は従来技術における三つの異なる単眼深度推定方法によってそれぞれ得られた三つの入力画像の対応する深度マップである。本願の結果は最後の三行に示し、それは本願の実施例における第一のステップで得られた、合成データを使用して訓練した両眼マッチングネットワークをそのまま利用し、単眼深度推定ネットワークを教示することによって得られた単眼深度ネットワークの三つの入力画像の対応する深度マップ、即ち番号が２１の画像２１、番号が２２の画像２２、番号が２３の画像２３、本願の実施例が提供する教師なし損失関数により、訓練後の両眼マッチングネットワークを微調整し、微調整後のネットワークにより出力される視差マップを、単眼深度推定ネットワークの訓練データとすることによって得られた単眼深度ネットワークの三つの入力画像の対応する深度マップ、即ち番号が２４の画像２４、番号が２５の画像２５、番号が２６の画像２６、および訓練後の両眼マッチングネットワークの教師あり微調整を行い、微調整後のネットワークにより出力される視差マップを、単眼深度推定ネットワークの訓練データとすることによって得られた単眼深度ネットワークの三つの入力画像の対応する深度マップ、即ち番号が２７の画像２７、番号が２８の画像２８、番号が２９の画像２９である。番号が２１の画像２１から番号が２９の画像２９から見えるように、本願の実施例における単眼深度推定方法によって得られたモデルはより細かいシーン構造をキャプチャ可能である。 Represents a parallax map output by the binocular matching network after training, or a parallax map output by a network in which the binocular matching network after training is fine-tuned. It should be noted that all of the equations (9) to (13) are explained by taking the left image of the actual data as a training sample by the monocular depth estimation network as an example. For the experiment, the training data is not cropped and scaled because the monocular depth estimation network is sensitive to changes in viewing angle. Both the input of the monocular depth estimation network and the parallax map for teaching the monocular depth estimation network are obtained from the binocular matching network after training. FIG. 2C is a schematic result of the visualization depth estimation of the embodiment of the present application, showing the corresponding depth maps of three different cityscape images obtained using the prior art and the monocular depth estimation method of the embodiment of the present application. The first line is the input of the monocular depth estimation network, that is, three different cityscape images, the second line is the depth data obtained by interpolating into a sparse laser radar depth map by the nearest neighbor method, and the third line. Lines 5 to 5 are the corresponding depth maps of the three input images obtained by each of the three different monocular depth estimation methods in the prior art. The results of the present application are shown in the last three lines, which teach the monocular depth estimation network using the binocular matching network trained using the synthetic data obtained in the first step in the examples of the present application as it is. Corresponding depth maps of the three input images of the monocular depth network thus obtained, namely image 21 of number 21, image 22 of number 22, image 23 of number 23, no teacher provided by the embodiments of the present application. Three inputs of the monocular depth network obtained by fine-tuning the binocular matching network after training by the loss function and using the disparity map output by the fine-tuned network as training data of the monocular depth estimation network. The corresponding depth map of the images, namely image 24 with number 24, image 25 with number 25, image 26 with number 26, and a supervised binocular matching network after training, fine-tuned and fine-tuned network. Corresponding depth maps of the three input images of the monocular depth network obtained by using the disparity map output by the above as training data of the monocular depth estimation network, that is, the image 27 with the number 27 and the image 28 with the number 28. , The number 29 is image 29. As can be seen from the image 21 of the number 21 to the image 29 of the number 29, the model obtained by the monocular depth estimation method in the embodiment of the present application can capture a finer scene structure.

本願の実施例は単眼深度推定装置を提供し、図３は本願の実施例の単眼深度推定装置の構成模式図であり、図３に示すように、前記装置３００は、処理対象の画像を取得するように構成された取得モジュール３０１と、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るように構成された実行モジュール３０２と、前記処理対象の画像の解析結果を出力するように構成された出力モジュール３０３と、を含む。 An embodiment of the present application provides a monocular depth estimation device, and FIG. 3 is a schematic configuration diagram of the monocular depth estimation device of the embodiment of the present application. As shown in FIG. 3, the device 300 acquires an image to be processed. The acquisition module 301 configured to perform the above and the image to be processed are input to the monocular depth estimation network model obtained by supervised training using a disparity map output by the first binocular matching neural network model. The execution module 302 configured to obtain the analysis result of the image to be processed and the output module 303 configured to output the analysis result of the image to be processed are included.

いくつかの実施例では、前記装置はさらに、前記第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによって前記単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練するように構成された第三訓練モジュールを含む。 In some embodiments, the device further teaches the monocular depth estimation network model with a parallax map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model. Includes a third training module configured in.

いくつかの実施例では、前記装置はさらに、取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練するように構成された第一訓練モジュールと、取得した実サンプルデータに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練モジュールと、を含む。 In some embodiments, the device is further based on a first training module configured to train a second binocular matching neural network model based on the acquired synthetic sample data, and on the acquired actual sample data. Includes a second training module configured to adjust the parameters of the post-trained second binocular matching neural network model to obtain a first binocular matching neural network model.

いくつかの実施例では、前記装置はさらに、合成された左画像および合成された右画像を含む深度ラベル付きの合成された両眼画像を前記合成サンプルデータとして取得するように構成された第一取得モジュールを含む。 In some embodiments, the apparatus is further configured to acquire a composite binocular image with a depth label that includes a composite left image and a composite right image as the composite sample data. Includes acquisition module.

いくつかの実施例では、前記第一訓練モジュールは、前記合成された両眼画像に基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、出力が視差マップおよび遮蔽マップである訓練後の第二両眼マッチングニューラルネットワークモデルを得るように構成された第一訓練ユニットを含み、ここで、前記視差マップは前記左画像における各画素点と前記右画像における対応する画素点との、画素を単位とする視差距離を表現し、前記遮蔽マップは前記左画像における各画素点の前記右画像における対応する画素点が物体により遮蔽されているかどうかを表現する。 In some embodiments, the first training module trains a second binocular matching neural network model based on the synthesized binocular image, and the output is a disparity map and a shield map after training. It includes a first training unit configured to obtain a binocular matching neural network model, wherein the disparity map is in pixel units of each pixel point in the left image and a corresponding pixel point in the right image. The shielding map expresses whether or not each pixel point in the left image and the corresponding pixel point in the right image are shielded by an object.

いくつかの実施例では、前記装置はさらに、レンダリングエンジンによって仮想３Ｄシーンを構築するように構成された構築モジュールと、二つの仮想カメラによって前記３Ｄシーンを両眼画像としてマッピングするように構成されたマッピングモジュールと、前記仮想３Ｄシーンを構築する時の位置、前記仮想３Ｄシーンを構築する時の方向および前記仮想カメラのレンズ焦点距離に基づいて前記合成両眼画像の深度データを取得するように構成された第二取得モジュールと、前記深度データに基づいて前記両眼画像をラベル付けし、前記合成された両眼画像を得るように構成された第三取得モジュールと、を含む。 In some embodiments, the device is further configured to map the 3D scene as a binocular image with a build module configured to build a virtual 3D scene by a rendering engine and two virtual cameras. It is configured to acquire the depth data of the composite binocular image based on the mapping module, the position when constructing the virtual 3D scene, the direction when constructing the virtual 3D scene, and the lens focal distance of the virtual camera. The second acquisition module is included, and a third acquisition module configured to label the binocular image based on the depth data and obtain the combined binocular image.

いくつかの実施例では、前記第二訓練モジュールは、取得した深度ラベル付きの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師あり訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練ユニットを含む。 In some embodiments, the second training module conducts supervised training of a post-trained second binocular matching neural network model based on the acquired depth-labeled real binocular data, thereby post-training. Includes a second training unit configured to adjust the weights of the second binocular matching neural network model to obtain the first binocular matching neural network model.

いくつかの実施例では、前記第二訓練モジュール内の第二訓練ユニットはさらに、取得した深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成される。 In some embodiments, the second training unit within the second training module further performs unsupervised training of the post-trained second binocular matching neural network model based on the acquired depth-labeled real binocular data. This is configured to adjust the weights of the second binocular matching neural network model after the training to obtain the first binocular matching neural network model.

いくつかの実施例では、前記第二訓練モジュール内の第二訓練ユニットは、損失関数を使用し、前記深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練コンポーネントを含む。 In some embodiments, the second training unit within the second training module uses a loss function to form a post-trained second binocular matching neural network model based on the unsupervised real binocular data. Includes a second training component configured to perform unsupervised training thereby adjusting the weights of the post-trained second binocular matching neural network model to obtain a first binocular matching neural network model.

いくつかの実施例では、前記装置はさらに、式（１４） In some embodiments, the device further comprises equation (14).

を利用して前記損失関数を決定するように構成された第一決定モジュールを含み、ここで、前記 Includes a first decision module configured to use to determine the loss function, wherein said

は損失関数を表し、前記 Represents the loss function, said

は再構成誤差を表し、前記 Represents the reconstruction error, said

は強度係数を表す。 Represents the intensity coefficient.

いくつかの実施例では、前記装置はさらに、式（１５） In some embodiments, the device further comprises equation (15).

または式（１６） Or equation (16)

を利用して前記再構成誤差を決定するように構成された第二決定モジュールを含み、ここで、前記 Includes a second decision module configured to use to determine the reconstruction error, wherein said.

は右画像をサンプリングしてから合成した画像の画素値を表し、前記 Represents the pixel value of the image synthesized after sampling the right image.

は左画像をサンプリングしてから合成した画像の画素値を表し、前記 Represents the pixel value of the image synthesized after sampling the left image.

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、 Represents the pixel value of the parallax map output by the first binocular matching network model of the right image of the actual binocular data without the depth label.

は画素点の画素座標を表す。 Represents the pixel coordinates of the pixel points.

いくつかの実施例では、前記装置はさらに、式（１７） In some embodiments, the device further comprises equation (17).

または式（１８） Or equation (18)

を利用して前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップと比べて偏りが小さいことを決定するように構成された第三決定モジュールを含み、ここで、前記 Is configured to determine that the parallax map output by the first binocular matching network model is less biased than the parallax map output by the second binocular matching network model after the training. Includes a third decision module, where the above

はサンプルデータのうちの左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the second binocular matching network model after training the left image of the sample data.

はサンプルデータのうちの右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記 Represents the pixel value of the parallax map output by the second binocular matching network model after training of the right image in the sample data.

は強度係数を表す。 Represents the intensity coefficient.

いくつかの実施例では、前記装置はさらに、式（１９） In some embodiments, the device further comprises equation (19).

または式（２０） Or equation (20)

を利用して前記第一両眼マッチングネットワークモデルの出力勾配が前記第二両眼マッチングネットワークモデルの出力勾配に一致することを決定するように構成された第四決定モジュールを含み、ここで、前記 Includes a fourth determination module configured to use to determine that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model.

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the first binocular matching network model of the left image of the real binocular data without the depth label.

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the first binocular matching network model of the right image of the real binocular data without the depth label.

はサンプルデータのうちの左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記 Represents the gradient of the parallax map output by the second binocular matching network model after training the left image of the sample data.

はサンプルデータのうちの右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表す。 Represents the gradient of the parallax map output by the second binocular matching network model after training the right image of the sample data.

いくつかの実施例では、前記深度ラベル付きの実両眼データは左画像および右画像を含み、それに対して、前記第三訓練モジュールは、前記深度ラベル付きの実両眼データのうちの左画像または右画像を訓練サンプルとして取得するように構成された第一取得ユニットと、前記深度ラベル付きの実両眼データのうちの左画像または右画像に基づいて単眼深度推定ネットワークモデルを訓練するように構成された第一訓練ユニットと、を含む。 In some embodiments, the depth-labeled real binocular data includes a left image and a right image, whereas the third training module is the left image of the depth-labeled real binocular data. Or to train a monocular depth estimation network model based on the first acquisition unit configured to acquire the right image as a training sample and the left or right image of the depth-labeled real binocular data. Includes a configured first training unit.

いくつかの実施例では、前記深度ラベルなしの実両眼データは左画像および右画像を含み、それに対して、前記第三訓練モジュールはさらに、前記深度ラベルなしの実両眼データを前記第一両眼マッチングニューラルネットワークモデルに入力し、対応する視差マップを得るように構成された第二取得ユニットと、前記対応する視差マップ、前記深度ラベルなしの実両眼データを撮影するカメラのレンズ基線長および前記深度ラベルなしの実両眼データを撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定するように構成された第一決定ユニットと、前記深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、前記視差マップの対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練するように構成された第二訓練ユニットと、を含む。 In some embodiments, the depth-labeled real binocular data includes a left image and a right image, whereas the third training module further includes the depth-labeled real binocular data of the first. A second acquisition unit configured to input into a binocular matching neural network model to obtain the corresponding disparity map, the corresponding disparity map, and the lens baseline length of the camera that captures the actual binocular data without the depth label. And the first determination unit configured to determine the corresponding depth map of the disparity map based on the lens focal distance of the camera that captures the real binocular data without the depth label, and the real without the depth label. Using the left or right image of the eye data as sample data, the monocular depth estimation network model is taught based on the corresponding depth map of the parallax map, thereby training the monocular depth estimation network model. Also includes a second training unit.

いくつかの実施例では、前記処理対象の画像の解析結果は前記単眼深度推定ネットワークモデルにより出力される視差マップを含み、それに対して、前記装置はさらに、前記単眼深度推定ネットワークモデルにより出力される視差マップ、前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ基線長および前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定するように構成された第五決定モジュールと、前記視差マップの対応する深度マップを出力するように構成された第一出力モジュールと、を含む。 In some embodiments, the analysis result of the image to be processed includes a parallax map output by the monocular depth estimation network model, whereas the device is further output by the monocular depth estimation network model. Correspondence of the parallax map based on the parallax map, the lens baseline length of the camera that captures the image input to the monocular depth estimation network model, and the lens focal length of the camera that captures the image input to the monocular depth estimation network model. It includes a fifth determination module configured to determine the depth map to be processed and a first output module configured to output the corresponding depth map of the parallax map.

ここで説明すべきは、以上の装置の実施例はその説明が上記方法の実施例に対する説明に類似し、方法の実施例に類似する有益な効果を有するということである。本願の装置の実施例において開示されていない技術的詳細については、本願の方法の実施例に対する説明を参照されたい。本願の実施例において、ソフトウェア機能モジュールの形で上記単眼深度推定方法を実現し、かつ独立した製品として販売または使用する場合、コンピュータ読み取り可能記憶媒体に記憶してもよい。このような見解をもとに、本願の実施例の技術的解決手段は実質的にまたは従来技術に寄与する部分はソフトウェア製品の形で実施することができ、該コンピュータソフトウェア製品は記憶媒体に記憶され、コンピュータ機器に本願の各実施例に記載の方法の全てまたは一部を実行させる複数の命令を含む。前記記憶媒体は、ＵＳＢフラッシュドライブ、モバイルハードディスク、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ：読み取り専用メモリ）、磁気ディスクまたは光ディスクなどのプログラムコードを記憶可能である様々な媒体を含む。従って、本願の実施例はハードウェアとソフトウェアのいかなる特定の組み合わせにも限定されない。それに対して、本願の実施例はプロセッサおよびプロセッサにおいて運用可能なコンピュータプログラムが記憶されたメモリを含む単眼深度推定機器であって、前記プロセッサは前記プログラムを実行する時に単眼深度推定方法におけるステップを実現する単眼深度推定機器を提供する。それに対して、本願の実施例はコンピュータプログラムが記憶されたコンピュータ読み取り可能記憶媒体であって、該コンピュータプログラムはプロセッサにより実行される時に単眼深度推定方法におけるステップを実現するコンピュータ読み取り可能記憶媒体を提供する。ここで指摘しておきたいのは、以上の記憶媒体および機器の実施例はその説明が上記方法の実施例に対する説明に類似し、方法の実施例に類似する有益な効果を有するということである。本願の記憶媒体および機器の実施例において開示されていない技術的詳細については、本願の方法の実施例に対する説明を参照されたい。 It should be explained here that the embodiment of the above apparatus is similar to the description for the embodiment of the above method and has a beneficial effect similar to that of the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, see the description of the embodiments of the methods of the present application. In the embodiment of the present application, when the monocular depth estimation method is realized in the form of a software function module and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this view, the technical solution of the embodiments of the present application can be implemented in the form of a software product substantially or in part contributing to the prior art, and the computer software product is stored in a storage medium. And include a plurality of instructions that cause a computer device to perform all or part of the methods described in each embodiment of the present application. The storage medium includes various media capable of storing a program code such as a USB flash drive, a mobile hard disk, a ROM (Read Only Memory), a magnetic disk or an optical disk. Therefore, the examples of the present application are not limited to any particular combination of hardware and software. On the other hand, an embodiment of the present application is a monocular depth estimation device including a processor and a memory in which a computer program that can be operated in the processor is stored, and the processor realizes a step in the monocular depth estimation method when executing the program. To provide a monocular depth estimation device. In contrast, an embodiment of the present application provides a computer-readable storage medium in which a computer program is stored, wherein the computer program provides a computer-readable storage medium that realizes a step in a monocular depth estimation method when executed by a processor. To do. It should be pointed out here that the above examples of the storage medium and the device have a beneficial effect similar to the description of the above method embodiment and similar to the method embodiment. .. For technical details not disclosed in the examples of storage media and devices of the present application, see the description of the examples of the methods of the present application.

説明すべきは、図４は本願の実施例の単眼深度推定機器のハードウェア実体模式図であり、図４に示すように、該単眼深度推定機器４００のハードウェア実体は、メモリ４０１、通信バス４０２およびプロセッサ４０３を含み、そのうち、メモリ４０１はプロセッサ４０３により実行可能な命令およびアプリケーションを記憶するように構成され、またプロセッサ４０３および単眼深度推定機器４００内の各モジュールの処理対象のまたは処理したデータをキャッシュすることができ、それはＦＬＡＳＨ（登録商標）（フラッシュメモリ）またはＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ：ランダムアクセスメモリ）によって実現可能である。通信バス４０２は単眼深度推定機器４００をネットワークによって他の端末またはサーバと通信するようにすることができ、またプロセッサ４０３とメモリ４０１の間の接続や通信をも実現できる。プロセッサ４０３は通常、単眼深度推定機器４００の全体的な動作を制御する。 It should be explained that FIG. 4 is a schematic hardware diagram of the monocular depth estimation device according to the embodiment of the present application, and as shown in FIG. 4, the hardware entity of the monocular depth estimation device 400 is a memory 401 and a communication bus. A 402 and a processor 403 are included, of which the memory 401 is configured to store instructions and applications that can be executed by the processor 403, and the data to be processed or processed by each module in the processor 403 and the monocular depth estimation device 400. Can be cached, which can be achieved by FLASH® (flash memory) or RAM (Random Access Memory: Random Access Memory). The communication bus 402 can allow the monocular depth estimation device 400 to communicate with another terminal or server via a network, and can also realize connection and communication between the processor 403 and the memory 401. Processor 403 typically controls the overall operation of the monocular depth estimation device 400.

説明すべきは、本明細書において、用語「含む」、「からなる」またはその他のあらゆる変形は非排他的包含を含むように意図され、それにより一連の要素を含むプロセス、方法、物品または装置は、それらの要素のみならず、明示されていない他の要素、またはこのようなプロセス、方法、物品または装置に固有の要素をも含むようになるということである。特に断らない限り、後句「一つの……を含む」により限定される要素は、該要素を含むプロセス、方法、物品または装置に別の同じ要素がさらに存在することを排除するものではない。 It should be explained that, herein, the term "contains", "consists of" or any other variation is intended to include non-exclusive inclusion, thereby including a process, method, article or device that includes a set of elements. Means that it will include not only those elements, but also other unspecified elements, or elements specific to such a process, method, article or device. Unless otherwise stated, the elements limited by the latter phrase "contains one ..." do not preclude the presence of another same element in the process, method, article or device containing the element.

以上の実施形態に対する説明によって、当業者であれば上記実施例の方法はソフトウェアと必要な共通ハードウェアプラットフォームとの組み合わせという形態で実現できることを明らかに理解可能であり、当然ながら、ハードウェアによって実現してもよいが、多くの場合において前者はより好ましい実施形態である。このような見解をもとに、本願の技術的解決手段は実質的にまたは従来技術に寄与する部分はソフトウェアの形で実施することができ、該コンピュータソフトウェア製品は記憶媒体（例えばＲＯＭ／ＲＡＭ、磁気ディスク、光ディスク）に記憶され、端末機器（携帯電話、コンピュータ、サーバ、エアコン、またはネットワーク機器などであってもよい）に本願の各実施例に記載の方法を実行させる複数の命令を含む。 From the above description of the embodiment, those skilled in the art can clearly understand that the method of the above embodiment can be realized in the form of a combination of software and a necessary common hardware platform, and of course, it is realized by hardware. However, in many cases the former is a more preferred embodiment. Based on this view, the technical solution of the present application can be implemented in the form of software in substance or in part contributing to the prior art, and the computer software product is a storage medium (eg, ROM / RAM, etc.). It includes a plurality of instructions stored on a magnetic disk (optical disk) and causing a terminal device (which may be a mobile phone, computer, server, air conditioner, network device, etc.) to perform the method described in each embodiment of the present application.

本願は本願の実施例に係る方法、機器（装置）、およびコンピュータプログラム製品のフローチャートおよび／またはブロック図を参照して説明している。なお、フローチャートおよび／またはブロック図におけるそれぞれのフローおよび／またはブロック、ならびにフローチャートおよび／またはブロック図におけるフローおよび／またはブロックの組み合わせはコンピュータプログラム命令によって実現できることを理解すべきである。これらのコンピュータプログラム命令は、機械を製造するために、共通コンピュータ、専用コンピュータ、組み込みプロセッサまたは他のプログラマブルデータ処理装置のプロセッサへ提供されてもよく、それにより、コンピュータまたは他のプログラマブルデータ処理装置のプロセッサによって実行される命令は、フローチャートの一つ以上のフローおよび／またはブロック図の一つ以上のブロックにおいて指定された機能を実現する手段を創出する。これらのコンピュータプログラム命令は、コンピュータまたは他のプログラマブルデータ処理装置を特定の方式で動作させるように指導可能なコンピュータ読み取り可能メモリに記憶されてもよく、それによって該コンピュータ読み取り可能メモリに記憶された命令は、フローチャートの一つ以上のフローおよび／またはブロック図の一つ以上のブロックにおいて指定された機能を実現する命令手段を含む製品を創出する。 The present application is described with reference to flowcharts and / or block diagrams of methods, devices (devices), and computer program products according to the embodiments of the present application. It should be understood that the respective flows and / or blocks in the flowchart and / or block diagram, and the combinations of flows and / or blocks in the flowchart and / or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a common computer, a dedicated computer, an embedded processor or the processor of another programmable data processor to manufacture the machine, thereby the computer or other programmable data processor. Instructions executed by the processor create a means of achieving the specified function in one or more flows of the flowchart and / or one or more blocks of the block diagram. These computer program instructions may be stored in computer-readable memory that can instruct a computer or other programmable data processor to operate in a particular manner, thereby storing instructions in the computer-readable memory. Creates a product that includes an instruction means that implements a specified function in one or more flows and / or one or more blocks in a block diagram.

これらのコンピュータプログラム命令はコンピュータまたは他のプログラマブルデータ処理装置にロードすることにより、コンピュータ実行処理を生成するように、コンピュータまたは他のプログラマブルデータ処理装置において一連の動作ステップを実行させるようにしてもよく、それにより、コンピュータまたは他のプログラマブルデータ処理装置において実行される命令はフローチャートの一つ以上のフローおよび／またはブロック図の一つ以上のブロックにおいて指定された機能を実現するためのステップを提供する。 These computer program instructions may be loaded into a computer or other programmable data processing device to perform a series of operational steps in the computer or other programmable data processing device so as to generate computer execution processing. , Thereby, the instructions executed in a computer or other programmable data processing device provide steps to achieve the specified function in one or more flows and / or one or more blocks in a block diagram. ..

以上は本願の好適な実施例に過ぎず、本願の特許範囲を限定するものではなく、本願の明細書および図面の内容を利用してなした等価構成または等価フロー変換、あるいは他の関連技術分野へのその直接または間接の転用は、同様に、いずれも本願の特許保護範囲に含まれるものとする。 The above is merely a preferred embodiment of the present application, and does not limit the scope of the patent of the present application. Equivalent configuration or equivalent flow conversion made by utilizing the contents of the specification and drawings of the present application, or other related technical fields. Any direct or indirect diversion to is also included in the claims of the present application.

本願の実施例では、処理対象の画像を取得し、前記処理対象の画像を、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られた単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得て、そして前記処理対象の画像の解析結果を出力することで、深度マップラベル付きのデータをより少なく使用するか、または使用しないで、単眼深度推定ネットワークを訓練することができ、またより効率的な、教師なしの、微調整可能な、両眼視差を利用したネットワークによる方法を提供し、それにより単眼深度推定の効果を間接的に向上させる。
例えば、本願は以下の項目を提供する。
（項目１）
処理対象の画像を取得するステップと、
前記処理対象の画像を訓練された単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るステップであって、前記単眼深度推定ネットワークモデルは、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られたものである、ステップと、
前記処理対象の画像の解析結果を出力するステップと、を含む単眼深度推定方法。
（項目２）
前記第一両眼マッチングニューラルネットワークモデルの訓練プロセスは、
取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、訓練後の第二両眼マッチングニューラルネットワークモデルを取得するステップと、
取得した実サンプルデータに基づいて前記訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得るステップと、を含む項目１に記載の方法。
（項目３）
さらに、合成された左画像および合成された右画像を含む深度ラベル付きの合成された両眼画像を前記合成サンプルデータとして取得するステップを含む項目２に記載の方法。
（項目４）
前記取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練する前記ステップは、
前記合成された両眼画像に基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、出力が視差マップおよび遮蔽マップである訓練後の第二両眼マッチングニューラルネットワークモデルを得るステップを含み、ここで、前記視差マップは前記左画像における各画素点と前記右画像における対応する画素点との、画素を単位とする視差距離を表現し、前記遮蔽マップは前記左画像における各画素点の前記右画像における対応する画素点が物体により遮蔽されているかどうかを表現する項目３に記載の方法。
（項目５）
前記取得した実サンプルデータに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得る前記ステップは、
取得した深度ラベル付きの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師あり訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るステップを含む項目２に記載の方法。
（項目６）
前記取得した実サンプルデータに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得る前記ステップはさらに、
取得した深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るステップを含む項目２に記載の方法。
（項目７）
前記取得した深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得る前記ステップは、
損失関数を使用し、前記深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るステップを含む項目６に記載の方法。
（項目８）
さらに、式
（化８７）

を利用して前記損失関数を決定するステップを含み、
ここで、前記
（化８８）

は損失関数を表し、前記
（化８９）

は再構成誤差を表し、前記
（化９０）

は前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを表し、前記
（化９１）

は前記第一両眼マッチングネットワークモデルを制約する出力勾配が前記訓練後の第二両眼マッチングネットワークモデルの出力勾配に一致することを表し、前記
（化９２）

は強度係数を表す項目７に記載の方法。
（項目９）
さらに、式
（化９３）

、または、
（化９４）

を利用して前記再構成誤差を決定するステップを含み、
ここで、前記
（化９５）

は画像における画素の数を表し、前記
（化９６）

は前記訓練後の第二両眼マッチングネットワークモデルにより出力される遮蔽マップの画素値を表し、前記
（化９７）

は深度ラベルなしの実両眼データのうちの左画像の画素値を表し、前記
（化９８）

は深度ラベルなしの実両眼データのうちの右画像の画素値を表し、前記
（化９９）

は右画像をサンプリングしてから合成した画像の画素値を表し、前記
（化１００）

は左画像をサンプリングしてから合成した画像の画素値を表し、前記
（化１０１）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１０２）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１０３）

は画素点の画素座標を表す項目８に記載の方法。
（項目１０）
さらに、式
（化１０４）

、または、
（化１０５）

を利用して前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを決定するステップを含み、
ここで、前記
（化１０６）

は画像における画素の数を表し、前記
（化１０７）

は前記訓練後の第二両眼マッチングネットワークモデルにより出力される遮蔽マップの画素値を表し、前記
（化１０８）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１０９）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１１０）

は左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１１１）

は右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１１２）

は画素点の画素座標を表し、前記
（化１１３）

は強度係数を表す項目８に記載の方法。
（項目１１）
さらに、式
（化１１４）

、または、
（化１１５）

を利用して前記第一両眼マッチングネットワークモデルの出力勾配が前記第二両眼マッチングネットワークモデルの出力勾配に一致することを決定するステップを含み、
ここで、前記
（化１１６）

は画像における画素の数を表し、前記
（化１１７）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１１８）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１１９）

は左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１２０）

は右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１２１）

は画素点の画素座標を表す項目８に記載の方法。
（項目１２）
前記深度ラベル付きの実両眼データは左画像および右画像を含み、それに対して、前記単眼深度推定ネットワークモデルの訓練プロセスは、
前記深度ラベル付きの実両眼データのうちの左画像または右画像を訓練サンプルとして取得するステップと、
前記深度ラベル付きの実両眼データのうちの左画像または右画像に基づいて単眼深度推定ネットワークモデルを訓練するステップと、を含む項目５に記載の方法。
（項目１３）
前記深度ラベルなしの実両眼データは左画像および右画像を含み、それに対して、前記単眼深度推定ネットワークモデルの訓練プロセスは、
前記深度ラベルなしの実両眼データを前記第一両眼マッチングニューラルネットワークモデルに入力し、対応する視差マップを得るステップと、
前記対応する視差マップ、前記深度ラベルなしの実両眼データを撮影するカメラのレンズ基線長および前記深度ラベルなしの実両眼データを撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定するステップと、
前記深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、前記視差マップの対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練するステップと、を含む項目６から１１のいずれか一項に記載の方法。
（項目１４）
前記処理対象の画像の解析結果は前記単眼深度推定ネットワークモデルにより出力される視差マップを含み、それに対して、さらに、
前記単眼深度推定ネットワークモデルにより出力される視差マップ、前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ基線長および前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定するステップと、
前記視差マップの対応する深度マップを出力するステップと、を含む項目１２または１３に記載の方法。
（項目１５）
処理対象の画像を取得するように構成された取得モジュールと、
前記処理対象の画像を訓練された単眼深度推定ネットワークモデルに入力し、前記処理対象の画像の解析結果を得るように構成された実行モジュールであって、前記単眼深度推定ネットワークモデルは、第一両眼マッチングニューラルネットワークモデルにより出力される視差マップによる教師あり訓練によって得られたものである、実行モジュールと、
前記処理対象の画像の解析結果を出力するように構成された出力モジュールと、を含む単眼深度推定装置。
（項目１６）
さらに、取得した合成サンプルデータに基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、訓練後の第二両眼マッチングニューラルネットワークモデルを取得するように構成された第一訓練モジュールと、取得した実サンプルデータに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルのパラメータを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練モジュールと、を含む項目１５に記載の装置。
（項目１７）
さらに、合成された左画像および合成された右画像を含む深度ラベル付きの合成された両眼画像を前記合成サンプルデータとして取得するように構成された第一取得モジュールを含む項目１６に記載の装置。
（項目１８）
前記第一訓練モジュールは、前記合成された両眼画像に基づいて第二両眼マッチングニューラルネットワークモデルを訓練し、出力が視差マップおよび遮蔽マップである訓練後の第二両眼マッチングニューラルネットワークモデルを得るように構成された第一訓練ユニットを含み、ここで、前記視差マップは前記左画像における各画素点と前記右画像における対応する画素点との、画素を単位とする視差距離を表現し、前記遮蔽マップは前記左画像における各画素点の前記右画像における対応する画素点が物体により遮蔽されているかどうかを表現する項目１７に記載の装置。
（項目１９）
前記第二訓練モジュールは、取得した深度ラベル付きの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師あり訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練ユニットを含む項目１６に記載の装置。
（項目２０）
前記第二訓練ユニットはさらに、取得した深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成される項目１６に記載の装置。
（項目２１）
前記第二訓練ユニットは、損失関数を使用し、前記深度ラベルなしの実両眼データに基づいて訓練後の第二両眼マッチングニューラルネットワークモデルの教師なし訓練を行い、それによって前記訓練後の第二両眼マッチングニューラルネットワークモデルの重みを調整し、第一両眼マッチングニューラルネットワークモデルを得るように構成された第二訓練コンポーネントを含む項目２０に記載の装置。
（項目２２）
さらに、式
（化１２２）

を利用して前記損失関数を決定するように構成された第一決定モジュールを含み、ここで、前記
（化１２３）

は損失関数を表し、前記
（化１２４）

は再構成誤差を表し、前記
（化１２５）

は前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを表し、前記
（化１２６）

は前記第一両眼マッチングネットワークモデルを制約する出力勾配が前記訓練後の第二両眼マッチングネットワークモデルの出力勾配に一致することを表し、前記
（化１２７）

は強度係数を表す項目２１に記載の装置。
（項目２３）
さらに、式
（化１２８）

、または、
（化１２９）

を利用して前記再構成誤差を決定するように構成された第二決定モジュールを含み、ここで、前記
（化１３０）

は画像における画素の数を表し、前記
（化１３１）

は前記訓練後の第二両眼マッチングネットワークモデルにより出力される遮蔽マップの画素値を表し、前記
（化１３２）

は深度ラベルなしの実両眼データのうちの左画像の画素値を表し、前記
（化１３３）

は深度ラベルなしの実両眼データのうちの右画像の画素値を表し、前記
（化１３４）

は右画像をサンプリングしてから合成した画像の画素値を表し、前記
（化１３５）

は左画像をサンプリングしてから合成した画像の画素値を表し、前記
（化１３６）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１３７）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１３８）

は画素点の画素座標を表す項目２２に記載の装置。
（項目２４）
さらに、式
（化１３９）

、または、
（化１４０）

を利用して前記第一両眼マッチングネットワークモデルにより出力される視差マップが前記訓練後の第二両眼マッチングネットワークモデルにより出力される視差マップに比べて偏りが小さいことを決定するように構成された第三決定モジュールを含み、ここで、前記
（化１４１）

は画像における画素の数を表し、前記
（化１４２）

は前記訓練後の第二両眼マッチングネットワークモデルにより出力される遮蔽マップの画素値を表し、前記
（化１４３）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１４４）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１４５）

は左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１４６）

は右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの画素値を表し、前記
（化１４７）

は画素点の画素座標を表し、前記
（化１４８）

は強度係数を表す項目２２に記載の装置。
（項目２５）
さらに、式
（化１４９）

、または、
（化１５０）

を利用して前記第一両眼マッチングネットワークモデルの出力勾配が前記第二両眼マッチングネットワークモデルの出力勾配に一致することを決定するように構成された第四決定モジュールを含み、ここで、前記
（化１５１）

は画像における画素の数を表し、前記
（化１５２）

は深度ラベルなしの実両眼データのうちの左画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１５３）

は深度ラベルなしの実両眼データのうちの右画像の第一両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１５４）

は左画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１５５）

は右画像の訓練後の第二両眼マッチングネットワークモデルによって出力された視差マップの勾配を表し、前記
（化１５６）

は画素点の画素座標を表す項目２２に記載の装置。
（項目２６）
前記深度ラベル付きの実両眼データは左画像および右画像を含み、それに対して、さらに、前記深度ラベル付きの実両眼データのうちの左画像または右画像を訓練サンプルとして取得し、そして前記深度ラベル付きの実両眼データのうちの左画像または右画像に基づいて単眼深度推定ネットワークモデルを訓練するように構成された第三訓練モジュールを含む項目１９に記載の装置。
（項目２７）
前記深度ラベルなしの実両眼データは左画像および右画像を含み、それに対して、さらに、前記深度ラベルなしの実両眼データを前記第一両眼マッチングニューラルネットワークモデルに入力し、対応する視差マップを得て、前記対応する視差マップ、前記深度ラベルなしの実両眼データを撮影するカメラのレンズ基線長および前記深度ラベルなしの実両眼データを撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定し、そして前記深度ラベルなしの実両眼データのうちの左画像または右画像をサンプルデータとし、前記視差マップの対応する深度マップに基づいて単眼深度推定ネットワークモデルを教示し、それにより前記単眼深度推定ネットワークモデルを訓練するように構成された第三訓練モジュールを含む項目２０から２５のいずれか一項に記載の装置。
（項目２８）
前記処理対象の画像の解析結果は前記単眼深度推定ネットワークモデルにより出力される視差マップを含み、それに対して、さらに、前記単眼深度推定ネットワークモデルにより出力される視差マップ、前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ基線長および前記単眼深度推定ネットワークモデルに入力される画像を撮影するカメラのレンズ焦点距離に基づき、前記視差マップの対応する深度マップを決定するように構成された第五決定モジュールと、前記視差マップの対応する深度マップを出力するように構成された第一出力モジュールと、を含む項目２６または２７に記載の装置。
（項目２９）
プロセッサおよびプロセッサにおいて運用可能なコンピュータプログラムが記憶されたメモリを含む単眼深度推定機器であって、前記プロセッサは前記プログラムを実行する時に項目１から１４のいずれか一項に記載の単眼深度推定方法におけるステップを実現する単眼深度推定機器。
（項目３０）
コンピュータプログラムが記憶されたコンピュータ読み取り可能記憶媒体であって、該コンピュータプログラムはプロセッサにより実行される時に項目１から１４のいずれか一項に記載の単眼深度推定方法におけるステップを実現するコンピュータ読み取り可能記憶媒体。 In the embodiment of the present application, the image to be processed is acquired, and the image to be processed is input to the monocular depth estimation network model obtained by supervised training by the parallax map output by the first binocular matching neural network model. Then, by obtaining the analysis result of the image to be processed and outputting the analysis result of the image to be processed, the data with the depth map label is used less or not, and the monocular depth estimation is performed. It provides a more efficient, unsupervised, fine-tuned, binocular parallax-based network method that allows the network to be trained, thereby indirectly improving the effectiveness of monocular depth estimation.
For example, the present application provides the following items.
(Item 1)
Steps to get the image to be processed and
It is a step of inputting the image to be processed into the trained monocular depth estimation network model and obtaining the analysis result of the image to be processed. The monocular depth estimation network model is based on the first binocular matching neural network model. Steps and steps obtained by supervised training with the output parallax map
A monocular depth estimation method including a step of outputting an analysis result of an image to be processed.
(Item 2)
The training process of the first binocular matching neural network model is
The step of training the second binocular matching neural network model based on the acquired synthetic sample data and acquiring the second binocular matching neural network model after training, and
The method according to item 1, comprising a step of adjusting the parameters of the second binocular matching neural network model after the training based on the acquired actual sample data to obtain a first binocular matching neural network model.
(Item 3)
The method according to item 2, further comprising a step of acquiring a composite binocular image with a depth label including a composited left image and a composited right image as the composite sample data.
(Item 4)
The step of training the second binocular matching neural network model based on the acquired synthetic sample data is
A step of training a second binocular matching neural network model based on the synthesized binocular image to obtain a trained second binocular matching neural network model whose output is a disparity map and an occlusion map, wherein , The disparity map represents the disparity distance in pixels of each pixel point in the left image and the corresponding pixel point in the right image, and the shielding map represents the right image of each pixel point in the left image. The method according to item 3, which expresses whether or not the corresponding pixel points in the above are shielded by an object.
(Item 5)
The step of adjusting the parameters of the second binocular matching neural network model after training based on the acquired actual sample data to obtain the first binocular matching neural network model is
Supervised training of the second binocular matching neural network model after training is performed based on the acquired real binocular data with a depth label, thereby adjusting the weight of the second binocular matching neural network model after the training. , The method according to item 2, which includes a step of obtaining a first binocular matching neural network model.
(Item 6)
The step of adjusting the parameters of the second binocular matching neural network model after training based on the acquired actual sample data to obtain the first binocular matching neural network model further
Based on the acquired real binocular data without depth label, unsupervised training of the second binocular matching neural network model after training is performed, thereby adjusting the weight of the second binocular matching neural network model after the training. , The method according to item 2, which includes a step of obtaining a first binocular matching neural network model.
(Item 7)
Unsupervised training of the second binocular matching neural network model after training is performed based on the acquired real binocular data without depth label, thereby adjusting the weight of the second binocular matching neural network model after the training. And the above step to obtain the first binocular matching neural network model is
Using the loss function, unsupervised training of the post-trained second binocular matching neural network model is performed based on the depth-labeled real binocular data, thereby the post-training second binocular matching neural network model. 6. The method of item 6, comprising the step of adjusting the weights of and obtaining a first binocular matching neural network model.
(Item 8)
In addition, the formula
(Chemical 87)

Including the step of determining the loss function using
Here, the above
(Chemical 88)

Represents the loss function, said
(Chemical 89)

Represents the reconstruction error, said
(Chemical 90)

Indicates that the parallax map output by the first binocular matching network model has a smaller bias than the parallax map output by the second binocular matching network model after the training.
(Chemical 91)

Indicates that the output gradient that constrains the first binocular matching network model matches the output gradient of the second binocular matching network model after the training.
(Chemical 92)

Is the method according to item 7, which represents an intensity coefficient.
(Item 9)
In addition, the formula
(Chemical 93)

, Or
(Chemical 94)

Including the step of determining the reconstruction error using
Here, the above
(Chemical 95)

Represents the number of pixels in the image, said
(Chemical 96)

Represents the pixel value of the shielding map output by the second binocular matching network model after the training.
(Chemical 97)

Represents the pixel value of the left image of the actual binocular data without the depth label, and is described above.
(Chemical 98)

Represents the pixel value of the right image of the actual binocular data without the depth label, and is described above.
(Chemical 99)

Represents the pixel value of the image synthesized after sampling the right image.
(Chemical 100)

Represents the pixel value of the image synthesized after sampling the left image.
(Chemical 101)

Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.
(Chemical 102)

Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.
(Chemical 103)

Is the method according to item 8, which represents the pixel coordinates of pixel points.
(Item 10)
In addition, the formula
(Chemical 104)

, Or
(Chemical 105)

Including the step of determining that the parallax map output by the first binocular matching network model is less biased than the parallax map output by the second binocular matching network model after the training.
Here, the above
(Chemical 106)

Represents the number of pixels in the image, said
(Chemical 107)

Represents the pixel value of the shielding map output by the second binocular matching network model after the training.
(Chemical 108)

Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.
(Chemical 109)

Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.
(Chemical 110)

Represents the pixel value of the parallax map output by the second binocular matching network model after training the left image.
(Chemical 111)

Represents the pixel value of the parallax map output by the second binocular matching network model after training the right image.
(Chemical 112)

Represents the pixel coordinates of the pixel point,
(Chemical 113)

Is the method according to item 8 representing the strength coefficient.
(Item 11)
In addition, the formula
(Chemical 114)

, Or
(Chemical 115)

Includes the step of determining that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model using.
Here, the above
(Chemical 116)

Represents the number of pixels in the image, said
(Chemical 117)

Represents the gradient of the parallax map output by the first binocular matching network model of the left image of the real binocular data without the depth label.
(Chemical 118)

Represents the gradient of the parallax map output by the first binocular matching network model of the right image of the real binocular data without the depth label.
(Chemical 119)

Represents the gradient of the parallax map output by the second binocular matching network model after training the left image.
(Chemical 120)

Represents the gradient of the parallax map output by the second binocular matching network model after training in the right image.
(Chemical 121)

Is the method according to item 8, which represents the pixel coordinates of pixel points.
(Item 12)
The depth-labeled real binocular data includes left and right images, whereas the training process for the monocular depth estimation network model
The step of acquiring the left image or the right image of the actual binocular data with the depth label as a training sample, and
5. The method of item 5, comprising training a monocular depth estimation network model based on the left or right image of the depth-labeled real binocular data.
(Item 13)
The real binocular data without the depth label includes a left image and a right image, whereas the training process of the monocular depth estimation network model
The step of inputting the real binocular data without the depth label into the first binocular matching neural network model to obtain the corresponding parallax map, and
The corresponding parallax map is based on the corresponding parallax map, the lens baseline length of the camera that captures the real binocular data without the depth label, and the lens focal length of the camera that captures the real binocular data without the depth label. Steps to determine the depth map and
The left image or the right image of the actual binocular data without the depth label is used as sample data, and a monocular depth estimation network model is taught based on the corresponding depth map of the parallax map, thereby the monocular depth estimation network model. And the method according to any one of items 6 to 11, including.
(Item 14)
The analysis result of the image to be processed includes the parallax map output by the monocular depth estimation network model, and further,
The parallax map output by the monocular depth estimation network model, the lens focal length of the camera that captures the image input to the monocular depth estimation network model, and the lens of the camera that captures the image input to the monocular depth estimation network model. The step of determining the corresponding depth map of the parallax map based on the focal length, and
The method according to

item

12 or 13, comprising the step of outputting the corresponding depth map of the parallax map.
(Item 15)
An acquisition module configured to acquire the image to be processed, and
It is an execution module configured to input the image to be processed into the trained monocular depth estimation network model and obtain the analysis result of the image to be processed, and the monocular depth estimation network model is the first car. The execution module and the execution module obtained by supervised training with the disparity map output by the eye matching neural network model.
A monocular depth estimation device including an output module configured to output the analysis result of the image to be processed.
(Item 16)
Furthermore, the first training module configured to train the second binocular matching neural network model based on the acquired synthetic sample data and acquire the second binocular matching neural network model after training, and the acquired actual The second training module, which is configured to adjust the parameters of the second binocular matching neural network model after training based on the sample data to obtain the first binocular matching neural network model, is described in item 15. apparatus.
(Item 17)
The apparatus according to item 16, further comprising a first acquisition module configured to acquire a composite binocular image with a depth label including a composited left image and a composited right image as the composite sample data. ..
(Item 18)
The first training module trains a second binocular matching neural network model based on the synthesized binocular image, and produces a trained second binocular matching neural network model whose output is a disparity map and an occlusion map. A first training unit configured to obtain is included, wherein the disparity map represents a pixel-by-pixel disparity distance between each pixel point in the left image and a corresponding pixel point in the right image. The apparatus according to item 17, wherein the shielding map represents whether or not each pixel point in the left image and the corresponding pixel point in the right image are shielded by an object.
(Item 19)
The second training module performs supervised training of the second binocular matching neural network model after training based on the acquired real binocular data with a depth label, thereby performing the second binocular matching neural network after the training. 16. The apparatus of item 16 comprising a second training unit configured to adjust the weights of the network model to obtain a first binocular matching neural network model.
(Item 20)
The second training unit further performs unsupervised training of the post-trained second binocular matching neural network model based on the acquired depth-labeled real binocular data, thereby performing the post-training second binocular matching. The apparatus according to item 16, wherein the weight of the neural network model is adjusted to obtain a first binocular matching neural network model.
(Item 21)
The second training unit uses the loss function to perform unsupervised training of the post-trained second binocular matching neural network model based on the depth-labeled real binocular data, thereby the post-training second. 20. The apparatus of item 20, comprising a second training component configured to adjust the weights of the binocular matching neural network model to obtain a first binocular matching neural network model.
(Item 22)
In addition, the formula
(Chemical 122)

Includes a first decision module configured to use to determine the loss function, wherein said
(Chemical 123)

Represents the loss function, said
(Chemical 124)

Represents the reconstruction error, said
(Chemical 125)

Indicates that the parallax map output by the first binocular matching network model has a smaller bias than the parallax map output by the second binocular matching network model after the training.
(Chemical 126)

Indicates that the output gradient that constrains the first binocular matching network model matches the output gradient of the second binocular matching network model after the training.
(Chemical 127)

Is the apparatus according to item 21, which represents an intensity coefficient.
(Item 23)
In addition, the formula
(Chemical 128)

, Or
(Chemical 129)

Includes a second decision module configured to use to determine the reconstruction error, wherein said.
(Chemical 130)

Represents the number of pixels in the image, said
(Chemical 131)

Represents the pixel value of the shielding map output by the second binocular matching network model after the training.
(Chemical 132)

Represents the pixel value of the left image of the actual binocular data without the depth label, and is described above.
(Chemical 133)

Represents the pixel value of the right image of the actual binocular data without the depth label, and is described above.
(Chemical 134)

Represents the pixel value of the image synthesized after sampling the right image.
(Chemical 135)

Represents the pixel value of the image synthesized after sampling the left image.
(Chemical 136)

Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.
(Chemical 137)

Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.
(Chemical 138)

22 is the device according to item 22, which represents the pixel coordinates of pixel points.
(Item 24)
In addition, the formula
(Chemical 139)

, Or
(Chemical 140)

Is configured to determine that the parallax map output by the first binocular matching network model is less biased than the parallax map output by the second binocular matching network model after the training. Includes a third decision module, where the above
(Chemical 141)

Represents the number of pixels in the image, said
(Chemical 142)

Represents the pixel value of the shielding map output by the second binocular matching network model after the training.
(Chemical 143)

Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.
(Chemical 144)

Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.
(Chemical 145)

Represents the pixel value of the parallax map output by the second binocular matching network model after training the left image.
(Chemical 146)

Represents the pixel value of the parallax map output by the second binocular matching network model after training the right image.
(Chemical 147)

Represents the pixel coordinates of the pixel point,
(Chemical 148)

22 is the apparatus according to item 22, which represents an intensity coefficient.
(Item 25)
In addition, the formula
(Chemical 149)

, Or
(Chemical 150)

Includes a fourth determination module configured to use to determine that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model.
(Chemical 151)

Represents the number of pixels in the image, said
(Ka 152)

Represents the gradient of the parallax map output by the first binocular matching network model of the left image of the real binocular data without the depth label.
(Chemical 153)

Represents the gradient of the parallax map output by the first binocular matching network model of the right image of the real binocular data without the depth label.
(Chemical 154)

Represents the gradient of the parallax map output by the second binocular matching network model after training the left image.
(Chemical 155)

Represents the gradient of the parallax map output by the second binocular matching network model after training in the right image.
(Chemical 156)

22 is the device according to item 22, which represents the pixel coordinates of pixel points.
(Item 26)
The depth-labeled real binocular data includes a left image and a right image, whereas the left or right image of the depth-labeled real binocular data is obtained as a training sample, and said. 19. The apparatus of item 19, comprising a third training module configured to train a monocular depth estimation network model based on a left or right image of real binocular data labeled with depth.
(Item 27)
The depth-labeled real binocular data includes a left image and a right image, whereas the depth-labeled real binocular data is further input into the first binocular matching neural network model and the corresponding parallax. Based on the corresponding disparity map, the lens baseline length of the camera that captures the real binocular data without the depth label, and the lens focal distance of the camera that captures the real binocular data without the depth label. Determine the corresponding depth map of the disparity map, and use the left or right image of the real binocular data without the depth label as sample data, and a monocular depth estimation network model based on the corresponding depth map of the disparity map. The apparatus according to any one of items 20 to 25, comprising a third training module configured to teach and thereby train the monocular depth estimation network model.
(Item 28)
The analysis result of the image to be processed includes the parallax map output by the monocular depth estimation network model, and further, the parallax map output by the monocular depth estimation network model and the monocular depth estimation network model. It is configured to determine the corresponding depth map of the parallax map based on the lens baseline length of the camera that captures the input image and the lens focal distance of the camera that captures the image input to the monocular depth estimation network model. 26. The apparatus of

item

26 or 27, comprising a fifth determination module and a first output module configured to output a corresponding depth map of the parallax map.
(Item 29)
A monocular depth estimation device including a processor and a memory in which a computer program that can be operated by the processor is stored, wherein the processor executes the program according to the monocular depth estimation method according to any one of items 1 to 14. Monocular depth estimation device that realizes steps.
(Item 30)
A computer-readable storage medium in which a computer program is stored, which realizes the steps in the monocular depth estimation method according to any one of items 1 to 14 when executed by a processor. Medium.

Claims

Steps to get the image to be processed and
It is a step of inputting the image to be processed into the trained monocular depth estimation network model and obtaining the analysis result of the image to be processed. The monocular depth estimation network model is based on the first binocular matching neural network model. Steps and steps obtained by supervised training with the output parallax map
A monocular depth estimation method including a step of outputting an analysis result of an image to be processed.

The training process of the first binocular matching neural network model is
The step of training the second binocular matching neural network model based on the acquired synthetic sample data and acquiring the second binocular matching neural network model after training, and
The method according to claim 1, further comprising a step of adjusting the parameters of the second binocular matching neural network model after the training based on the acquired actual sample data to obtain a first binocular matching neural network model.

The method according to claim 2, further comprising the step of acquiring a composite binocular image with a depth label including a composited left image and a composited right image as the composite sample data.

The step of training the second binocular matching neural network model based on the acquired synthetic sample data is
A step of training a second binocular matching neural network model based on the synthesized binocular image to obtain a trained second binocular matching neural network model whose output is a disparity map and an occlusion map, wherein , The disparity map represents the disparity distance in pixels of each pixel point in the left image and the corresponding pixel point in the right image, and the shielding map represents the right image of each pixel point in the left image. The method according to claim 3, which expresses whether or not the corresponding pixel points in the above are shielded by an object.

The step of adjusting the parameters of the second binocular matching neural network model after training based on the acquired actual sample data to obtain the first binocular matching neural network model is
Supervised training of the second binocular matching neural network model after training is performed based on the acquired real binocular data with a depth label, thereby adjusting the weight of the second binocular matching neural network model after the training. , The method of claim 2, comprising the step of obtaining a first binocular matching neural network model.

The step of adjusting the parameters of the second binocular matching neural network model after training based on the acquired actual sample data to obtain the first binocular matching neural network model further
Based on the acquired real binocular data without depth label, unsupervised training of the second binocular matching neural network model after training is performed, thereby adjusting the weight of the second binocular matching neural network model after the training. , The method of claim 2, comprising the step of obtaining a first binocular matching neural network model.

Unsupervised training of the second binocular matching neural network model after training is performed based on the acquired real binocular data without depth label, thereby adjusting the weight of the second binocular matching neural network model after the training. And the above step to obtain the first binocular matching neural network model is
Using the loss function, unsupervised training of the post-trained second binocular matching neural network model is performed based on the depth-labeled real binocular data, thereby the post-training second binocular matching neural network model. 6. The method of claim 6, comprising the step of adjusting the weights of and obtaining a first binocular matching neural network model.

In addition, the formula

Including the step of determining the loss function using
Here, the above

Represents the loss function, said

Represents the reconstruction error, said

Indicates that the parallax map output by the first binocular matching network model has a smaller bias than the parallax map output by the second binocular matching network model after the training.

Indicates that the output gradient that constrains the first binocular matching network model matches the output gradient of the second binocular matching network model after the training.

Is the method according to claim 7, which represents a strength coefficient.

In addition, the formula

, Or

Including the step of determining the reconstruction error using
Here, the above

Represents the number of pixels in the image, said

Represents the pixel value of the shielding map output by the second binocular matching network model after the training.

Represents the pixel value of the left image of the actual binocular data without the depth label, and is described above.

Represents the pixel value of the right image of the actual binocular data without the depth label, and is described above.

Represents the pixel value of the image synthesized after sampling the right image.

Represents the pixel value of the image synthesized after sampling the left image.

Represents the pixel value of the parallax map output by the first binocular matching network model of the left image in the actual binocular data without the depth label.

Represents the pixel value of the parallax map output by the first binocular matching network model of the right image in the actual binocular data without the depth label.

8 is the method according to claim 8, wherein the pixel coordinates of the pixel points are represented.

In addition, the formula

, Or

Including the step of determining that the parallax map output by the first binocular matching network model is less biased than the parallax map output by the second binocular matching network model after the training.
Here, the above

Represents the number of pixels in the image, said

Represents the pixel value of the parallax map output by the second binocular matching network model after training the left image.

Represents the pixel value of the parallax map output by the second binocular matching network model after training the right image.

Represents the pixel coordinates of the pixel point,

Is the method according to claim 8, which represents a strength coefficient.

In addition, the formula

, Or

Includes the step of determining that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model using.
Here, the above

Represents the number of pixels in the image, said

Represents the gradient of the parallax map output by the first binocular matching network model of the left image of the real binocular data without the depth label.

Represents the gradient of the parallax map output by the first binocular matching network model of the right image of the real binocular data without the depth label.

Represents the gradient of the parallax map output by the second binocular matching network model after training the left image.

Represents the gradient of the parallax map output by the second binocular matching network model after training in the right image.

The depth-labeled real binocular data includes left and right images, whereas the training process for the monocular depth estimation network model
The step of acquiring the left image or the right image of the actual binocular data with the depth label as a training sample, and
The method of claim 5, comprising training a monocular depth estimation network model based on the left or right image of the depth-labeled real binocular data.

The real binocular data without the depth label includes a left image and a right image, whereas the training process of the monocular depth estimation network model
The step of inputting the real binocular data without the depth label into the first binocular matching neural network model to obtain the corresponding parallax map, and
The corresponding parallax map is based on the corresponding parallax map, the lens baseline length of the camera that captures the real binocular data without the depth label, and the lens focal length of the camera that captures the real binocular data without the depth label. Steps to determine the depth map and
The left image or the right image of the actual binocular data without the depth label is used as sample data, and a monocular depth estimation network model is taught based on the corresponding depth map of the parallax map, thereby the monocular depth estimation network model. The method according to any one of claims 6 to 11, including the step of training.

The analysis result of the image to be processed includes the parallax map output by the monocular depth estimation network model, and further,
The parallax map output by the monocular depth estimation network model, the lens focal length of the camera that captures the image input to the monocular depth estimation network model, and the lens of the camera that captures the image input to the monocular depth estimation network model. The step of determining the corresponding depth map of the parallax map based on the focal length, and
12. The method of claim 12 or 13, comprising outputting a corresponding depth map of the parallax map.

An acquisition module configured to acquire the image to be processed, and
It is an execution module configured to input the image to be processed into the trained monocular depth estimation network model and obtain the analysis result of the image to be processed, and the monocular depth estimation network model is the first car. The execution module and the execution module obtained by supervised training with the disparity map output by the eye matching neural network model.
A monocular depth estimation device including an output module configured to output the analysis result of the image to be processed.

Furthermore, the first training module configured to train the second binocular matching neural network model based on the acquired synthetic sample data and acquire the second binocular matching neural network model after training, and the acquired actual 13. The second training module, which is configured to adjust the parameters of the second binocular matching neural network model after training based on the sample data to obtain the first binocular matching neural network model, according to claim 15. Equipment.

16 according to claim 16, further comprising a first acquisition module configured to acquire a composite binocular image with a depth label that includes a composite left image and a composite right image as the composite sample data. apparatus.

The first training module trains a second binocular matching neural network model based on the synthesized binocular image, and produces a trained second binocular matching neural network model whose output is a disparity map and an occlusion map. A first training unit configured to obtain is included, wherein the disparity map represents a pixel-by-pixel disparity distance between each pixel point in the left image and a corresponding pixel point in the right image. The device according to claim 17, wherein the shielding map represents whether or not each pixel point in the left image and a corresponding pixel point in the right image are shielded by an object.

The second training module performs supervised training of the second binocular matching neural network model after training based on the acquired real binocular data with a depth label, thereby performing the second binocular matching neural network after the training. 16. The apparatus of claim 16, comprising a second training unit configured to adjust the weights of the network model to obtain a first binocular matching neural network model.

The second training unit further performs unsupervised training of the post-trained second binocular matching neural network model based on the acquired depth-labeled real binocular data, thereby performing the post-training second binocular matching. 16. The apparatus of claim 16, configured to adjust the weights of the neural network model to obtain a first binocular matching neural network model.

The second training unit uses the loss function to perform unsupervised training of the post-trained second binocular matching neural network model based on the depth-labeled real binocular data, thereby the post-training second. 20. The apparatus of claim 20, comprising a second training component configured to adjust the weights of the binocular matching neural network model to obtain a first binocular matching neural network model.

In addition, the formula

Includes a first decision module configured to use to determine the loss function, wherein said

Represents the loss function, said

Represents the reconstruction error, said

21 is the device according to claim 21, which represents a strength coefficient.

In addition, the formula

, Or

Includes a second decision module configured to use to determine the reconstruction error, wherein said.

Represents the number of pixels in the image, said

22 is the apparatus according to claim 22, wherein the pixel coordinates of the pixel points are represented.

In addition, the formula

, Or

Is configured to determine that the parallax map output by the first binocular matching network model is less biased than the parallax map output by the second binocular matching network model after the training. Includes a third decision module, where the above

Represents the number of pixels in the image, said

Represents the pixel coordinates of the pixel point,

22 is the apparatus according to claim 22, wherein the strength coefficient is represented.

In addition, the formula

, Or

Includes a fourth determination module configured to use to determine that the output gradient of the first binocular matching network model matches the output gradient of the second binocular matching network model.

Represents the number of pixels in the image, said

The depth-labeled real binocular data includes a left image and a right image, whereas the left or right image of the depth-labeled real binocular data is obtained as a training sample, and said. 19. The apparatus of claim 19, comprising a third training module configured to train a monocular depth estimation network model based on a left or right image of real binocular data labeled with depth.

The depth-labeled real binocular data includes a left image and a right image, whereas the depth-labeled real binocular data is further input into the first binocular matching neural network model and the corresponding parallax. Based on the corresponding disparity map, the lens baseline length of the camera that captures the real binocular data without the depth label, and the lens focal distance of the camera that captures the real binocular data without the depth label. Determine the corresponding depth map of the disparity map, and use the left or right image of the real binocular data without the depth label as sample data, and a monocular depth estimation network model based on the corresponding depth map of the disparity map. The apparatus according to any one of claims 20 to 25, comprising a third training module configured to teach and thereby train the monocular depth estimation network model.

The analysis result of the image to be processed includes the parallax map output by the monocular depth estimation network model, and further, the parallax map output by the monocular depth estimation network model and the monocular depth estimation network model. It is configured to determine the corresponding depth map of the parallax map based on the lens baseline length of the camera that captures the input image and the lens focal distance of the camera that captures the image input to the monocular depth estimation network model. 26. The apparatus according to claim 26 or 27, comprising a fifth determination module and a first output module configured to output a corresponding depth map of the parallax map.

The monocular depth estimation method according to any one of claims 1 to 14, wherein the processor is a monocular depth estimation device including a memory in which a computer program that can be operated in the processor is stored, and the processor executes the program. Monocular depth estimation device that realizes the steps in.

A computer-readable storage medium in which a computer program is stored, which, when executed by a processor, implements the steps in the monocular depth estimation method according to any one of claims 1 to 14. Storage medium.