JP2022160382A

JP2022160382A - Method and system for generating learning data for machine learning

Info

Publication number: JP2022160382A
Application number: JP2022062743A
Authority: JP
Inventors: ゴンホチャ; Geon Ho Cha; ホドクジャン; Deok Ho; ドンユンウィ; Dongyoon Wee
Original assignee: Line Corp; Naver Corp
Current assignee: Z Intermediate Global Corp; Naver Corp
Priority date: 2021-04-06
Filing date: 2022-04-05
Publication date: 2022-10-19
Anticipated expiration: 2042-04-05
Also published as: KR20220138707A; JP7285986B2; KR102581001B1

Abstract

To provide a method and system for generating learning data for machine learning, and a method and system for generating a sample image.SOLUTION: A learning data generation method for machine learning includes: a step (S110) of generating a sample image using a target image (including a depth map generated from the target image); a step (S120) of estimating depth values of pixels included in the sample image, to generate a depth map for the sample image; and a step (S130, S140) of generating learning data using the depth map generated from the target image and the depth map generated from the sample image.SELECTED DRAWING: Figure 3

Description

本発明は、学習データ生成方法及びシステムに関し、より詳細には、機械学習のための学習データ生成方法及びそれを用いた機械学習方法に関する。 TECHNICAL FIELD The present invention relates to a learning data generation method and system, and more particularly to a learning data generation method for machine learning and a machine learning method using the same.

３次元コンピュータグラフィックスにおいて、デプスマップ（ｄｅｐｔｈｍａｐ，深度マップ）は、視点から被写体の表面までの距離に関する情報を提供する。デプスマップにより取得される３次元情報は、３Ｄモデリング、ロボット分野、医療分野、航空分野、国防分野、自律走行分野などにおいて活発に用いられている。 In three-dimensional computer graphics, a depth map provides information about the distance from the viewpoint to the surface of the object. Three-dimensional information acquired by depth maps is actively used in 3D modeling, robotics, medical, aviation, national defense, autonomous driving, and the like.

一方、４次産業の中核である人工知能は、マシンラーニング（狭い意味の機械学習）に人間の脳を模倣したニューラルネットワークを加えたディープラーニング（広い意味の機械学習）により飛躍的な発展を遂げている。 On the other hand, artificial intelligence, which is the core of the quaternary industry, has made dramatic progress through deep learning (machine learning in the broad sense), which is a combination of machine learning (machine learning in the narrow sense) and neural networks that mimic the human brain. ing.

このような機械学習の発展に伴い、最近の深度推定（ＤｅｐｔｈＥｓｔｉｍａｔｉｏｎ）技術分野においては、２次元画像から３次元復元を行うために機械学習を活用することに重点を置いている。 With the development of machine learning, the recent depth estimation technology field focuses on using machine learning to perform 3D reconstruction from 2D images.

この場合、機械学習ベースの深度推定技術において、深度推定モデルを教師なし（ｕｎｓｕｐｅｒｖｉｓｅｄ）ベースで学習する際には、連続した画像を活用する。しかし、これまで知られている深度推定技術においては、連続した画像のオブジェクトが動いてはならないという仮定（ｓｔａｔｉｃｓｃｅｎｅａｓｓｕｍｐｔｉｏｎ）が適用され、画像中に動的オブジェクトがある場合、連続した画像を完全な学習データとして活用できないという問題があった。 In this case, machine-learning-based depth estimation techniques leverage sequential images when training depth estimation models on an unsupervised basis. However, in the depth estimation techniques known so far, the static scene assumption is applied that objects in consecutive images should not move, and if there is a dynamic object in the images, the consecutive images are completely reproduced. There was a problem that it could not be used as a learning data.

例えば、図１のように、車両１００にカメラを装着して画像を収集する場合、収集された画像に走行中の他の自動車などが含まれると、深度推定モデルの学習データからそれを除外しなければならず、よって、深度推定結果が不正確になるという問題がある。 For example, as shown in FIG. 1, when a camera is attached to a vehicle 100 and images are collected, if the collected images include other cars in motion, they are excluded from the learning data of the depth estimation model. Therefore, there is a problem that the depth estimation result is inaccurate.

上記問題を解決して推論性能を向上させるために、本発明は、前記連続した画像を活用した深度推定モデルにおいて動的オブジェクトが含まれる画像を学習データとして完全に活用できる新たな学習データ生成方法を提案する。 In order to solve the above problems and improve inference performance, the present invention provides a new learning data generation method that can fully utilize images containing dynamic objects as learning data in a depth estimation model that utilizes continuous images. Suggest.

Ｇｏｄａｒｄ，Ｃｌｅｍｅｎｔ，ｅｔａｌ．「ＤｉｇｇｉｎｇＩｎｔｏＳｅｌｆ－ＳｕｐｅｒｖｉｓｅｄＭｏｎｏｃｕｌａｒＤｅｐｔｈＥｓｔｉｍａｔｉｏｎ」ＩＣＣＶ，２０１９．Godard, Clement, et al. "Digging Into Self-Supervised Monocular Depth Estimation" ICCV, 2019.

本発明は、動的オブジェクトが含まれる画像を学習データとして完全に活用できる新たな学習データ生成方法を提供するためのものである。 SUMMARY OF THE INVENTION An object of the present invention is to provide a new learning data generation method that can fully utilize images containing dynamic objects as learning data.

また、本発明は、教師なしベースの機械学習のための自己サンプル（Ｓｅｌｆ－Ｓａｍｐｌｅ）を生成し、それを活用して学習データを生成する方法及びシステムを提供するためのものである。 The present invention also provides a method and system for generating self-samples for unsupervised machine learning and utilizing them to generate learning data.

より具体的には、本発明は、動的オブジェクトが含まれる単一の画像から高い正確度で深度推定を行える機械学習データ生成方法及びシステムに関する。 More specifically, the present invention relates to a machine learning data generation method and system capable of highly accurate depth estimation from a single image containing dynamic objects.

さらに、本発明は、単一の画像から教師なし深度推定学習に活用される複数の自己サンプルを生成する方法及びシステムに関する。 Further, the present invention relates to a method and system for generating multiple self-samples from a single image for use in unsupervised depth estimation learning.

上記課題を解決するために、本発明は、対象画像及び前記対象画像のデプスマップを用いてサンプル画像を生成するステップと、前記サンプル画像に含まれるピクセルの深度値を推定して前記サンプル画像に対するデプスマップを生成するステップと、前記対象画像のデプスマップ及び前記サンプル画像から生成されたデプスマップを用いて学習データの少なくとも一部を生成するステップとを含む、機械学習のための学習データ生成方法を提供する。 In order to solve the above problems, the present invention provides a step of generating a sample image using a target image and a depth map of the target image; A learning data generating method for machine learning, comprising the steps of: generating a depth map; and generating at least part of learning data using the depth map of the target image and the depth map generated from the sample image. I will provide a.

また、本発明は、対象画像を保存する保存部と、前記対象画像に含まれるピクセルの深度値を推定して前記対象画像に対するデプスマップを生成する制御部とを含み、前記制御部は、前記対象画像及び前記デプスマップを用いてサンプル画像を生成し、前記サンプル画像に含まれるピクセルの深度値を推定して前記サンプル画像に対するデプスマップを生成し、前記対象画像から生成されたデプスマップに対して剛体変換（ｒｉｇｉｄｔｒａｎｓｆｏｒｍａｔｉｏｎ）を行い、前記サンプル画像から生成されたデプスマップ及び前記剛体変換されたデプスマップを用いて学習データの少なくとも一部を生成する、機械学習のための学習データ生成システムを提供する。 Further, the present invention includes a storage unit that stores a target image, and a control unit that estimates depth values of pixels included in the target image and generates a depth map for the target image, wherein the control unit includes: generating a sample image using the target image and the depth map; estimating depth values of pixels included in the sample image to generate a depth map for the sample image; a training data generation system for machine learning, wherein the depth map generated from the sample image and the depth map subjected to the rigid transformation are used to generate at least a portion of the training data; offer.

さらに、本発明は、電子機器で１つ以上のプロセスにより実行され、コンピュータ可読記録媒体に格納可能なプログラムであって、前記プログラムは、対象画像に含まれるピクセルの深度値を推定して前記対象画像に対するデプスマップを生成するステップと、前記対象画像及び前記デプスマップを用いてサンプル画像を生成するステップと、前記サンプル画像に含まれるピクセルの深度値を推定して前記サンプル画像に対するデプスマップを生成するステップと、前記対象画像から生成されたデプスマップに対して剛体変換を行うステップと、前記サンプル画像から生成されたデプスマップ及び前記剛体変換されたデプスマップを用いて学習データの少なくとも一部を生成するステップとを実行させるコマンドを含む、コンピュータ可読記録媒体に格納可能なプログラムを提供する。 Further, the present invention is a program executed by one or more processes in an electronic device and storable in a computer-readable recording medium, the program estimating depth values of pixels included in a target image to generating a depth map for an image; generating a sample image using the target image and the depth map; and estimating depth values of pixels included in the sample image to generate a depth map for the sample image. performing rigid transformation on the depth map generated from the target image; and transforming at least part of learning data using the depth map generated from the sample image and the depth map subjected to the rigid transformation. A program storable on a computer-readable recording medium is provided that includes commands for performing the generating steps.

さらに、本発明は、電子機器で１つ以上のプロセスにより実行され、コンピュータ可読記録媒体に格納可能なプログラムであって、前記プログラムは、対象画像に含まれるピクセルの深度値を推定するステップと、前記対象画像に含まれるピクセルの深度値に基づいて、前記対象画像に含まれるピクセルを３次元空間上にマッピングするステップと、前記３次元空間上にマッピングされたピクセルに対して、予め設定されたパラメータで剛体変換を行うステップと、前記剛体変換が行われたピクセルを２次元平面に投影してサンプル画像を生成するステップとを実行させるコマンドを含む、コンピュータ可読記録媒体に格納可能なプログラムを提供する。 Further, the present invention is a program executable by one or more processes in an electronic device and storable on a computer-readable recording medium, the program comprising the steps of: estimating depth values of pixels included in a target image; mapping the pixels included in the target image onto a three-dimensional space based on the depth values of the pixels included in the target image; A program storable on a computer-readable recording medium, comprising commands for performing rigid transformation with a parameter and projecting the rigidly transformed pixels onto a two-dimensional plane to generate a sample image. do.

前述したように、本発明によれば、剛体変換パラメータを多様に適用して、単一の対象画像から複数の自己サンプルを生成することができる。よって、本発明は、深度推定学習のための画像を無制限に確保することができる。 As described above, the present invention allows multiple applications of rigid transformation parameters to generate multiple self-samples from a single target image. Therefore, the present invention can reserve an unlimited number of images for depth estimation learning.

また、本発明による自己サンプル生成方法で生成されたサンプル画像は、全ての領域が静的領域からなるので、サンプル画像において深度推定正確度を低減させる動的領域をフィルタリングする必要がなくなる。よって、本発明は、サンプル画像全体を学習に活用することができる。 Also, since the sample images generated by the self-sampling method according to the present invention consist entirely of static regions, there is no need to filter dynamic regions in the sample images, which reduces depth estimation accuracy. Therefore, the present invention can utilize the entire sample image for learning.

さらに、本発明による学習データ生成方法は、サンプル画像を生成する際に剛体変換パラメータを多様に適用することにより、多様な状況での損失を算出することができる。具体的には、本発明によれば、対象画像を収集するカメラが動ける全ての場合における損失が算出されるので、本発明による損失関数を機械学習に適用する場合、高い正確度で深度推定を行うことができる。 Furthermore, the learning data generation method according to the present invention can calculate losses in various situations by applying various rigid transformation parameters when generating sample images. Specifically, according to the present invention, the loss is calculated in all cases where the camera collecting the target image can move. It can be carried out.

自律走行時に活用される深度推定方法を説明するための概念図である。FIG. 4 is a conceptual diagram for explaining a depth estimation method used during autonomous driving; 本発明によるシステムを説明するための概念図である。1 is a conceptual diagram for explaining a system according to the present invention; FIG. 本発明による学習データ生成方法を説明するためのフローチャートである。4 is a flowchart for explaining a learning data generation method according to the present invention; 本発明による学習データ生成方法を実行する方法を示す概念図である。FIG. 4 is a conceptual diagram showing a method for executing the learning data generation method according to the present invention; 本発明による学習データ生成方法を実行する方法を示す概念図である。FIG. 4 is a conceptual diagram showing a method for executing the learning data generation method according to the present invention; 本発明によるサンプル画像生成方法を説明するためのフローチャートである。4 is a flow chart for explaining a sample image generation method according to the present invention; 本発明によるサンプル画像生成方法を示す概念図である。1 is a conceptual diagram showing a sample image generation method according to the present invention; FIG. 本発明によるサンプル画像生成方法を示す概念図である。1 is a conceptual diagram showing a sample image generation method according to the present invention; FIG. 本発明によるサンプル画像生成方法を示す概念図である。1 is a conceptual diagram showing a sample image generation method according to the present invention; FIG. 従来の深度推定学習方法を示す概念図である。1 is a conceptual diagram showing a conventional depth estimation learning method; FIG. 従来の深度推定学習方法を示す概念図である。1 is a conceptual diagram showing a conventional depth estimation learning method; FIG.

以下、添付図面を参照して本発明の実施形態について詳細に説明するが、図面番号に関係なく同一又は類似の構成要素には同一の符号を付し、それについての重複する説明は省略する。以下の説明で用いられる構成要素の接尾辞である「モジュール」や「部」は、明細書の作成を容易にするために付与又は混用されるものであり、それ自体が有意性や有用性を有するものではない。また、本発明の実施形態について説明するにあたり、関連する公知技術についての具体的な説明が本発明の実施形態の要旨を不明にすると判断される場合は、その詳細な説明を省略する。さらに、添付図面は本発明の実施形態の理解を助けるためのものにすぎず、添付図面により本発明の技術的思想が限定されるものではなく、本発明の思想及び技術範囲に含まれるあらゆる変更、均等物乃至代替物を含むものと理解すべきである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The same or similar components are denoted by the same reference numerals regardless of the drawing numbers, and duplicate descriptions thereof will be omitted. The suffixes “module” and “part” used in the following explanation are given or used together to facilitate the preparation of the specification, and themselves have significance and usefulness. does not have In addition, in describing the embodiments of the present invention, detailed descriptions of related known techniques will be omitted if it is determined that they may obscure the gist of the embodiments of the present invention. Furthermore, the accompanying drawings are only for helping understanding of the embodiments of the present invention, and the technical ideas of the present invention are not limited by the accompanying drawings. , equivalents or alternatives.

「第１」、「第２」などのように序数を含む用語は様々な構成要素を説明するために用いられるが、上記構成要素は上記用語により限定されるものではない。上記用語は１つの構成要素を他の構成要素と区別する目的でのみ用いられる。 Terms including ordinal numbers such as "first", "second", etc. are used to describe various components, but the components are not limited by the above terms. The above terms are only used to distinguish one component from another.

ある構成要素が他の構成要素に「連結」又は「接続」されていると言及された場合は、他の構成要素に直接連結又は接続されていてもよく、中間にさらに他の構成要素が存在してもよいものと解すべきである。それに対して、ある構成要素が他の構成要素に「直接連結」又は「直接接続」されていると言及された場合は、中間にさらに他の構成要素が存在しないものと解すべきである。 When a component is referred to as being "coupled" or "connected" to another component, it may be directly coupled or connected to the other component, with additional components in between. It should be interpreted as something that can be done. In contrast, when a component is referred to as being "directly coupled" or "directly connected" to another component, it should be understood that there are no additional components in between.

単数の表現には、特に断らない限り複数の表現が含まれる。 References to the singular include the plural unless specifically stated otherwise.

本明細書において、「含む」や「有する」などの用語は、本明細書に記載された特徴、数字、ステップ、動作、構成要素、部品又はそれらの組み合わせが存在することを指定しようとするもので、１つ又はそれ以上の他の特徴、数字、ステップ、動作、構成要素、部品又はそれらの組み合わせの存在や付加可能性を予め排除するものではないと理解すべきである。 As used herein, terms such as "including" and "having" are intended to specify the presence of features, numbers, steps, acts, components, parts, or combinations thereof described herein. and does not preclude the presence or addition of one or more other features, figures, steps, acts, components, parts or combinations thereof.

本発明は、単一の画像から深度を推定するための自己サンプルを生成し、前記自己サンプルを用いて機械学習データを生成する方法に関する。 The present invention relates to a method of generating self-samples for depth estimation from a single image and using said self-samples to generate machine learning data.

本発明においては、説明の便宜上、深度推定学習に用いられる原本画像を「対象画像」という。対象画像は、カメラから収集された画像であってもよい。より具体的には、対象画像は、車両２００に配置されたカメラから収集された画像であってもよい。 In the present invention, for convenience of explanation, the original image used for depth estimation learning is referred to as a "target image". The target image may be an image collected from a camera. More specifically, the target image may be an image collected from a camera located on vehicle 200 .

一方、対象画像から生成され、教師なし深度推定学習に活用される複数の画像を「サンプル画像」という。サンプル画像は、対象画像に基づいて生成されるが、対象画像と全く同じではない画像である。サンプル画像は、対象画像から複数生成されるようにしてもよく、複数のサンプル画像は、異なる画像である。本発明においては、このような複数のサンプル画像を「自己サンプル」ともいう。 On the other hand, a plurality of images generated from the target image and used for unsupervised depth estimation learning are called "sample images". A sample image is an image that is generated based on a target image, but which is not exactly the same as the target image. A plurality of sample images may be generated from the target image, and the plurality of sample images are different images. In the present invention, such a plurality of sample images are also called "self-samples".

なお、本発明は、対象画像とサンプル画像を用いて学習データを生成する。本明細書における「学習データ」とは、機械学習に活用されるデータであって、対象画像及び当該対象画像のデプスマップ、サンプル画像及び当該サンプル画像のデプスマップ、前記対象画像のデプスマップ及び前記サンプル画像のデプスマップに基づいて算出された損失データ、並びに、前記損失データの値を最小化するための全ての演算過程で生成されるデータを意味する。 The present invention generates learning data using target images and sample images. The "learning data" in this specification is data utilized for machine learning, and includes a target image and a depth map of the target image, a sample image and a depth map of the sample image, a depth map of the target image and the It means loss data calculated based on the depth map of the sample image and data generated in all calculation processes for minimizing the value of the loss data.

本明細書における「学習データを生成する」には、サンプル画像を生成すること、対象画像からデプスマップを生成すること、サンプル画像からデプスマップを生成すること、損失関数により損失データを算出すること、算出された損失データを用いて深度推定時に必要な加重値を変更することが含まれる。 "Generating learning data" in this specification includes generating a sample image, generating a depth map from a target image, generating a depth map from a sample image, and calculating loss data using a loss function. , using the calculated loss data to modify the weights needed during depth estimation.

本発明によるシステムは、対象画像から複数のサンプル画像を生成し、対象画像とサンプル画像を活用して深度推定のための学習データを生成する。本発明を具体的に説明するに先立って、本発明によるシステムについて具体的に説明する。 A system according to the present invention generates a plurality of sample images from a target image, and utilizes the target image and the sample images to generate learning data for depth estimation. Prior to specifically describing the present invention, the system according to the present invention will be specifically described.

図２は本発明によるシステムを説明するための概念図である。 FIG. 2 is a conceptual diagram for explaining the system according to the present invention.

まず、図２に示すように、車両２００とは、道路や線路を走る全ての移動手段を意味する。車両２００は、画像を撮影するための少なくとも１つのカメラ２１０を含んでもよい。具体的には、車両は、同じ方向を撮影する複数のカメラを含んでもよく、異なる方向をそれぞれ撮影する複数のカメラを含んでもよい。本明細書における「対象画像」とは、特定の方向を撮影した単一の画像を意味する。 First, as shown in FIG. 2, the vehicle 200 means all means of transportation running on roads and railroad tracks. Vehicle 200 may include at least one camera 210 for capturing images. Specifically, the vehicle may include multiple cameras that capture images in the same direction, or multiple cameras that capture images in different directions. A "target image" as used herein means a single image captured in a specific direction.

一方、本発明による深度推定システム３００は、通信部３１０、保存部３２０及び制御部３３０の少なくとも１つを含む。システム３００は、車両２００に含まれてもよく、車両２００の外部に配置された別のサーバであってもよい。本明細書においては、説明の便宜上、車両２００とシステム３００を分離して説明するが、システム３００が車両２００に含まれるようにしてもよい。 Meanwhile, the depth estimation system 300 according to the present invention includes at least one of a communication unit 310 , a storage unit 320 and a control unit 330 . System 300 may be included in vehicle 200 or may be a separate server located external to vehicle 200 . In this specification, vehicle 200 and system 300 are described separately for convenience of explanation, but system 300 may be included in vehicle 200 .

通信部３１０は、車両２００、外部ストレージ（例えば、データベース（ｄａｔａｂａｓｅ）３４０）、外部サーバ及びクラウドサーバの少なくとも１つと通信を行うことができる。 The communication unit 310 can communicate with at least one of the vehicle 200, an external storage (eg, database 340), an external server, and a cloud server.

なお、外部サーバ又はクラウドサーバは、制御部３３０の少なくとも一部の役割を果たすように構成されてもよい。すなわち、データ処理やデータ演算などの実行は、外部サーバ又はクラウドサーバで行われるようにしてもよく、本発明においてはその方式を問わない。 Note that the external server or cloud server may be configured to play at least a part of the control unit 330 . That is, the execution of data processing, data calculation, etc. may be performed by an external server or a cloud server, and the present invention does not care about the method.

また、通信部３１０は、通信対象（例えば、電子機器、外部サーバ、デバイスなど）の通信規格に準拠して、様々な通信方式をサポートすることができる。 In addition, the communication unit 310 can support various communication methods in compliance with communication standards of communication targets (for example, electronic equipment, external servers, devices, etc.).

例えば、通信部３１０は、ＷＬＡＮ（ＷｉｒｅｌｅｓｓＬＡＮ）、Ｗｉ－Ｆｉ（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）、Ｗｉ－ＦｉＤｉｒｅｃｔ（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙＤｉｒｅｃｔ）、ＤＬＮＡ（登録商標）（ＤｉｇｉｔａｌＬｉｖｉｎｇＮｅｔｗｏｒｋＡｌｌｉａｎｃｅ）、ＷｉＢｒｏ（ＷｉｒｅｌｅｓｓＢｒｏａｄｂａｎｄ）、ＷｉＭＡＸ（ＷｏｒｌｄＩｎｔｅｒｏｐｅｒａｂｉｌｉｔｙｆｏｒＭｉｃｒｏｗａｖｅＡｃｃｅｓｓ）、ＨＳＤＰＡ（Ｈｉｇｈ－ＳｐｅｅｄＤｏｗｎｌｉｎｋＰａｃｋｅｔＡｃｃｅｓｓ）、ＨＳＵＰＡ（Ｈｉｇｈ－ＳｐｅｅｄＵｐｌｉｎｋＰａｃｋｅｔＡｃｃｅｓｓ）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、ＬＴＥ－Ａ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ－Ａｄｖａｎｃｅｄ）、５Ｇ（５ｔｈＧｅｎｅｒａｔｉｏｎＭｏｂｉｌｅＴｅｌｅｃｏｍｍｕｎｉｃａｔｉｏｎ）、ブルートゥース（登録商標）（Ｂｌｕｅｔｏｏｔｈ（登録商標））、ＲＦＩＤ（ＲａｄｉｏＦｒｅｑｕｅｎｃｙＩｄｅｎｔｉｆｉｃａｔｉｏｎ）、ＩｒＤＡ（ＩｎｆｒａｒｅｄＤａｔａＡｓｓｏｃｉａｔｉｏｎ）、ＵＷＢ（ＵｌｔｒａＷｉｄｅＢａｎｄ）、ＺｉｇＢｅｅ、ＮＦＣ（ＮｅａｒＦｉｅｌｄＣｏｍｍｕｎｉｃａｔｉｏｎ）及びワイヤレスＵＳＢ（ＷｉｒｅｌｅｓｓＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）技術の少なくとも１つを用いて、通信対象と通信を行うようにしてもよい。 For example, the communication unit 310 supports WLAN (Wireless LAN), Wi-Fi (Wireless Fidelity), Wi-Fi Direct (Wireless Fidelity Direct), DLNA (registered trademark) (Digital Living Network Alliance), WiBro (Wireless Broadband), WiMAX （ＷｏｒｌｄＩｎｔｅｒｏｐｅｒａｂｉｌｉｔｙｆｏｒＭｉｃｒｏｗａｖｅＡｃｃｅｓｓ）、ＨＳＤＰＡ（Ｈｉｇｈ－ＳｐｅｅｄＤｏｗｎｌｉｎｋＰａｃｋｅｔＡｃｃｅｓｓ）、ＨＳＵＰＡ（Ｈｉｇｈ－ＳｐｅｅｄＵｐｌｉｎｋＰａｃｋｅｔＡｃｃｅｓｓ）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、ＬＴＥ－Ａ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ－Ａｄｖａｎｃｅｄ）、５Ｇ（５ｔｈＧｅｎｅｒａｔｉｏｎ Mobile Telecommunication), Bluetooth (Registered Trademark), RFID (Radio Frequency Identification), IrDA (Infrared Data Association), UWB (Ultra Wide Band), ZigBeeWireless, NFC (Near Wireless) Wireless Universal Serial Bus) technology may be used to communicate with the communication target.

次に、保存部３２０は、本発明に係る様々な情報を保存するようにしてもよい。本発明において、保存部３２０は、本発明によるシステム３００自体に備えられてもよい。それとは異なり、保存部３２０の少なくとも一部は、データベース（ＤＢ）３４０及びクラウドストレージ（又はクラウドサーバ）の少なくとも一方であってもよい。すなわち、保存部３２０は、本発明によるシステム及び方法のために必要な情報が保存される空間であれば十分であり、物理的な空間の制約はないものと解される。よって、以下では、保存部３２０、データベース３４０、外部ストレージ、クラウドストレージ（又はクラウドサーバ）を区分せず、全てを保存部３２０とする。 Next, the storage unit 320 may store various information according to the present invention. In the present invention, the storage unit 320 may be provided in the system 300 itself according to the present invention. Alternatively, at least part of the storage unit 320 may be at least one of a database (DB) 340 and cloud storage (or cloud server). In other words, the storage unit 320 has no physical space limitation as long as it stores information necessary for the system and method according to the present invention. Therefore, hereinafter, the storage unit 320, the database 340, the external storage, and the cloud storage (or cloud server) are not classified and are all referred to as the storage unit 320.

本発明によるサンプル画像の生成及び深度の推定のために保存部３２０に保存される情報には、対象画像及び対象画像から生成された複数のサンプル画像が含まれてもよい。 Information stored in the storage unit 320 for sample image generation and depth estimation according to the present invention may include a target image and a plurality of sample images generated from the target image.

次に、制御部３３０は、本発明によるシステム３００の全般的な動作を制御するように構成される。制御部３３０は、上記構成要素により入力又は出力される信号、データ、情報などを処理したり、ユーザに適切な情報又は機能を提供又は処理することができる。 Controller 330 is then configured to control the overall operation of system 300 in accordance with the present invention. The control unit 330 may process signals, data, information, etc. input or output from the above components, or may provide or process appropriate information or functions to a user.

制御部３３０は、少なくとも１つの中央処理装置（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ，ＣＰＵ）を含み、本発明による機能を実行することができる。また、制御部３３０は、人工知能ベースのデータ処理を行うことができ、本発明によるサンプル画像の生成及び深度の推定を行うことができる。さらに、制御部３３０は、マシンラーニング（ｍａｃｈｉｎｅｌｅａｒｎｉｎｇ）及びディープラーニング（ｄｅｅｐｌｅａｒｎｉｎｇ）の少なくとも一方の方式により、本発明によるサンプル画像の生成及び深度の推定を行うことができる。 The control unit 330 includes at least one central processing unit (CPU) and can perform functions according to the present invention. The control unit 330 can also perform artificial intelligence-based data processing, and can perform sample image generation and depth estimation according to the present invention. Further, the control unit 330 can perform sample image generation and depth estimation according to the present invention by at least one of machine learning and deep learning.

本発明による学習データ生成方法について説明するに先立って、従来の連続した画像を活用した深度推定学習方法について説明する。 Prior to explaining the learning data generation method according to the present invention, a conventional depth estimation learning method utilizing continuous images will be explained.

従来は、同じカメラにおいて異なる視点で撮影された２つの対象画像（以下、第１対象画像及び第２対象画像ともいう）を活用している。ここで、２つの対象画像を撮影する際に、カメラのみ移動し、画像に含まれる全てのオブジェクトは停止していると仮定する。このような仮定によれば、第１対象画像に含まれるピクセルが剛体変換されて第２対象画像が形成されるとみなされる。第１対象画像及び第２対象画像にエゴ動作推定器（Ｅｇｏ－ｍｏｔｉｏｎｅｓｔｉｍａｔｏｒ）を適用して剛体変換パラメータ（又はパラメータ）を算出することができる。 Conventionally, two target images (hereinafter also referred to as a first target image and a second target image) photographed from different viewpoints with the same camera are utilized. Here, it is assumed that only the camera is moving and all objects contained in the images are stationary when the two target images are captured. According to this assumption, it is assumed that the pixels contained in the first target image are rigidly transformed to form the second target image. An ego-motion estimator may be applied to the first target image and the second target image to calculate rigid body transformation parameters (or parameters).

エゴ動作推定器としては、公知のモデルを用いることができる。例えば、非特許文献１に開示されたエゴ動作推定器を用いてもよいが、それに限定されるものではない。 A known model can be used as the ego motion estimator. For example, the ego-motion estimator disclosed in Non-Patent Document 1 may be used, but is not limited thereto.

前記剛体変換パラメータが算出されると、逆剛体変換が可能になる。具体的には、３次元画像に対する剛体変換パラメータは、４×４行列の形で算出されるようにしてもよい。当該パラメータ行列の逆行列を、第２対象画像を構成するピクセル（ピクセルベクトル）に掛けると、逆剛体変換結果が算出される。 Once the rigid transformation parameters are calculated, an inverse rigid transformation is possible. Specifically, rigid transformation parameters for a three-dimensional image may be calculated in the form of a 4×4 matrix. By multiplying the pixels (pixel vectors) forming the second target image by the inverse matrix of the parameter matrix, an inverse rigid body transformation result is calculated.

第２対象画像を構成する全てのピクセルに逆剛体変換を行うと、新たな画像が生成される。 A new image is generated by performing an inverse rigid transformation on all the pixels that make up the second target image.

具体的には、図７ａに示すように、第２対象画像７４０が第１対象画像７１０の剛体変換結果物であると仮定すると、第１対象画像７１０を構成するピクセルｐ２は、第２対象画像７４０を構成するいずれか１つのピクセルｐ３に剛体変換される。 Specifically, as shown in FIG. 7a, assuming that the second target image 740 is the result of rigid transformation of the first target image 710, the pixel p2 forming the first target image 710 is the second target image 740 is rigid-transformed to any one pixel p3.

エゴ動作推定器（Ｇ）により、第１対象画像７１０及び第２対象画像７４０に対する剛体変換パラメータを算出することができる。 An ego-motion estimator (G) can compute rigid transformation parameters for the first target image 710 and the second target image 740 .

その後、前記算出された剛体変換パラメータを用いて、第２対象画像７４０を構成するいずれか１つのピクセルｐ３に逆剛体変換（Ｔ）を行うと、逆剛体変換が行われたピクセルｐ３’が生成される。第２対象画像７４０を構成する全てのピクセルに逆剛体変換を行うことにより、逆剛体変換画像７４０’が生成される。本明細書においては、逆剛体変換の結果で生成された画像を逆剛体変換画像という。 After that, when inverse rigid transformation (T) is performed on any one pixel p3 constituting the second target image 740 using the calculated rigid transformation parameter, a pixel p3′ subjected to the inverse rigid transformation is generated. be done. An inverse rigid transformation image 740 ′ is generated by performing inverse rigid transformation on all the pixels that make up the second target image 740 . In this specification, an image generated as a result of inverse rigid transformation is referred to as an inverse rigid transformation image.

上記仮定によれば、第２対象画像を構成する特定のピクセルに逆剛体変換を行う場合、前記特定のピクセルは、第１対象画像を構成するピクセルのうち前記特定のピクセルに対応するピクセルと同じ位置に移動しなければならない。 According to the above assumption, when the inverse rigid body transformation is performed on a specific pixel forming the second target image, the specific pixel is the same as the pixel corresponding to the specific pixel among the pixels forming the first target image. must move into position.

第１対象画像と逆剛体変換画像を用いて、下記数式１のように測光損失（Ｐｈｏｔｏｍｅｔｒｉｃｌｏｓｓ）を算出する。測光損失は、画像に含まれる全てのオブジェクトが剛体であり、停止した状態でカメラのみ移動しながら２つの画像を撮影したと仮定して算出されたエラーである。 Using the first target image and the inverse rigid-transformed image, the photometric loss is calculated according to Equation 1 below. Photometric loss is the error calculated assuming that all objects in the images are rigid bodies and that the two images are taken while the camera is stationary and only the camera is moving.

（数式１）
上記数式１は、非特許文献１に開示された数式であるので、具体的な説明は省略する。

(Formula 1)
Since the formula 1 is disclosed in Non-Patent Document 1, a detailed description thereof will be omitted.

一方、図７ｂに示すように、第１対象画像７１０の深度推定７３０により、デプスマップ７２０が生成される。図７ａの過程で生成された逆剛体変換画像７４０’は、デプスマップ７２０とワーピング（Ｗ）される。その後、ワーピングされた画像に含まれるピクセルｐ３’’と第１対象画像７１０に含まれるピクセルを用いて、上記数式１による損失が算出される。図７ａ及び図７ｂにおいて説明した深度推定のための機械学習では、測光損失を最小限に抑えるための学習を行う。 On the other hand, depth estimation 730 of the first target image 710 produces a depth map 720, as shown in FIG. 7b. The inverse rigid transform image 740 ′ generated in the process of FIG. 7 a is warped (W) with the depth map 720 . Then, the pixel p3'' included in the warped image and the pixels included in the first target image 710 are used to calculate the loss according to Equation 1 above. The machine learning for depth estimation described in FIGS. 7a and 7b learns to minimize the photometric loss.

前述した方式の深度推定学習方法は、画像中のオブジェクトが移動する場合、不正確な結果となる。それを防止するために、画像中で移動するオブジェクトをフィルタリングしているが、その場合、対象画像全体を学習に活用できないという問題があった。 The depth estimation learning method of the above-described scheme gives inaccurate results when the object in the image is moving. In order to prevent this, objects that move in the image are filtered, but in that case there is a problem that the entire target image cannot be used for learning.

本発明は、対象画像に動的オブジェクトが含まれていても、それを学習に活用し、学習のための連続した画像を無制限に生成できる、機械学習データ生成方法を提供する。 SUMMARY OF THE INVENTION The present invention provides a machine learning data generation method that can utilize a dynamic object in learning even if the target image contains a dynamic object, and generate an unlimited number of consecutive images for learning.

以下、上記構成と共に、本発明による機械学習データ生成方法について添付図面を参照してより具体的に説明する。 Hereinafter, the machine learning data generating method according to the present invention will be described more specifically with reference to the accompanying drawings together with the above configuration.

図３は本発明による機械学習データ生成方法を説明するためのフローチャートであり、図４ａ及び図４ｂは本発明による機械学習データ生成方法を実行する方法を示す概念図である。 FIG. 3 is a flowchart for explaining the machine learning data generation method according to the present invention, and FIGS. 4a and 4b are conceptual diagrams showing a method for executing the machine learning data generation method according to the present invention.

まず、対象画像を用いて自己サンプルを生成するステップ（Ｓ１１０）が行われる。 First, a step of generating self-samples using the target image (S110) is performed.

サンプル画像は、複数生成され、複数のサンプル画像のそれぞれは、異なる剛体変換パラメータにより生成される。複数のサンプル画像のそれぞれは、深度推定学習に活用される。本明細書においては、１つのサンプル画像と対象画像により深度推定学習を行う方法について説明する。サンプル画像生成方法については後述する。 A plurality of sample images are generated, and each of the plurality of sample images is generated with different rigid body transformation parameters. Each of the multiple sample images is utilized for depth estimation learning. This specification describes a method of performing depth estimation learning using one sample image and a target image. A sample image generation method will be described later.

自己サンプルを生成した後、自己サンプルに含まれるピクセルの深度値を推定するステップ（Ｓ１２０）が行われる。 After generating the self-samples, the step of estimating (S120) the depth values of the pixels contained in the self-samples is performed.

前記サンプル画像に深度推定方法を適用してデプスマップを算出する。このとき、深度推定モデルとしては、サンプル画像の生成時に対象画像に適用された深度推定モデルと同じモデルを適用する。深度推定モデルとサンプル画像の生成については後述する。 A depth map is calculated by applying a depth estimation method to the sample image. At this time, as the depth estimation model, the same model as the depth estimation model applied to the target image when generating the sample image is applied. The depth estimation model and generation of sample images will be described later.

深度推定ステップが行われると、デプスマップが生成される。デプスマップは、ピクセル座標情報及び各ピクセルの深度値情報を含む。ピクセル座標情報は、対象画像のピクセルに対応する座標であり、ピクセルの深度値は、特定のピクセルで算出された深度値を示す情報である。デプスマップは、画像として出力されるようにしてもよい。デプスマップ画像は、複数のピクセルを含み、それぞれのピクセルは、座標情報及び深度情報を含む。デプスマップ画像は、対象画像に含まれるピクセルのそれぞれにマッチングされた色情報の代わりに深度値を定義する。 Once the depth estimation step is performed, a depth map is generated. The depth map includes pixel coordinate information and depth value information of each pixel. The pixel coordinate information is the coordinates corresponding to the pixel of the target image, and the pixel depth value is information indicating the depth value calculated for the specific pixel. The depth map may be output as an image. A depth map image includes a plurality of pixels, each pixel including coordinate information and depth information. A depth map image defines a depth value instead of color information matched to each of the pixels contained in the target image.

サンプル画像から算出されたデプスマップは、座標情報及び深度情報を含む。デプスマップに含まれる座標情報は、サンプル画像に含まれる座標情報と同じ情報であり、それぞれの座標情報に深度情報がマッチングされる。サンプル画像とデプスマップの関係は、対象画像と対象画像から生成されたデプスマップの関係と同じである。 A depth map calculated from the sample image includes coordinate information and depth information. The coordinate information included in the depth map is the same information as the coordinate information included in the sample image, and the depth information is matched with each piece of coordinate information. The relationship between the sample image and the depth map is the same as the relationship between the target image and the depth map generated from the target image.

次に、対象画像から算出されたデプスマップを構成するピクセルを３次元空間上にマッピングし、その後マッピングされたピクセルに対して剛体変換を行うステップ（Ｓ１３０）が行われる。 Next, a step (S130) of mapping the pixels constituting the depth map calculated from the target image onto a three-dimensional space and then performing rigid body transformation on the mapped pixels is performed.

剛体変換とは、全ての点のペア間のユークリッド距離を保持する幾何学的変換を意味する。剛体変換には、平行移動、回転、反射、又はそれらの組み合わせが含まれる。剛体変換が行われた後、全てのオブジェクトは同じ形状及び大きさを保持する。 A rigid transformation means a geometric transformation that preserves the Euclidean distance between all pairs of points. Rigid transformations include translations, rotations, reflections, or combinations thereof. All objects retain the same shape and size after rigid transformations are performed.

本明細書においては、前記剛体変換を行う一実施形態として、平行移動、回転、又はそれらの組み合わせについて説明するが、本明細書における剛体変換には、平行移動及び回転以外の他の種類の変換も含まれる。 Although translations, rotations, or a combination thereof are described herein as one embodiment of performing the rigid transformations, rigid transformations herein include other types of transformations other than translations and rotations. is also included.

一方、３次元空間上にマッピングされたピクセルのそれぞれは、座標情報を含むベクトルからなるようにしてもよい。すなわち、それぞれのピクセルは、Ｘ軸座標情報及びＹ軸座標情報を含むベクトルからなるようにしてもよい。本明細書においては、ピクセルのそれぞれの座標情報を含むベクトルをピクセルベクトルという。 On the other hand, each pixel mapped on the three-dimensional space may consist of a vector containing coordinate information. That is, each pixel may consist of a vector containing X-axis coordinate information and Y-axis coordinate information. In this specification, a vector containing coordinate information of each pixel is called a pixel vector.

剛体変換は、ピクセルベクトルに剛体変換パラメータを含む行列を掛けることにより行われるようにしてもよい。本明細書においては、剛体変換パラメータを含む行列をパラメータ行列という。 A rigid transformation may be performed by multiplying the pixel vector by a matrix containing the rigid transformation parameters. In this specification, a matrix containing rigid transformation parameters is referred to as a parameter matrix.

一実施形態において、剛体変換パラメータは、移動しようとするＸ軸方向の距離値、Ｙ軸方向の距離値、及びＺ軸方向の距離値の少なくとも１つを含む。また、剛体変換パラメータは、Ｘ軸を基準とする回転角度、Ｙ軸を基準とする回転角度、及びＺ軸を基準とする回転角度の少なくとも１つを含む。前述したように、平行移動及び回転からなる剛体変換を行う場合、剛体変換パラメータは、最大６つの異なるパラメータを含む。ただし、それに限定されるものではなく、前記平行移動及び回転以外の他の種類の剛体変換が行われる場合、剛体変換パラメータは、前述した６つのパラメータ以外の他のパラメータを含んでもよい。 In one embodiment, the rigid transformation parameters include at least one of an X-axis distance value, a Y-axis distance value, and a Z-axis distance value to be moved. Also, the rigid body transformation parameter includes at least one of a rotation angle with respect to the X-axis, a rotation angle with respect to the Y-axis, and a rotation angle with respect to the Z-axis. As mentioned above, when performing a rigid transformation consisting of translation and rotation, the rigid transformation parameters include up to 6 different parameters. However, it is not limited to this, and when other types of rigid body transformations other than the translation and rotation are performed, the rigid body transformation parameters may include parameters other than the six parameters described above.

一方、２次元上で剛体変換を行う際に用いられるパラメータ行列は３×３の形であり、３次元上で剛体変換を行う際に用いられるパラメータ行列は４×４の形である。 On the other hand, the parameter matrix used for rigid transformation in two dimensions is 3×3, and the parameter matrix used for rigid transformation in three dimensions is 4×4.

対象画像から生成されたデプスマップは、座標情報及び深度情報を含む。ここで、座標情報は、２次元上の座標を定義する座標情報である。例えば、対象画像から生成されたデプスマップは、Ｘ軸座標情報及びＹ軸座標情報を含んでもよい。 A depth map generated from the target image includes coordinate information and depth information. Here, the coordinate information is coordinate information that defines two-dimensional coordinates. For example, a depth map generated from a target image may include X-axis coordinate information and Y-axis coordinate information.

剛体変換のために、デプスマップを構成するピクセルは、３次元空間上にマッピングされるようにしてもよい。具体的には、デプスマップを構成する複数のピクセルは、ベクトルに変換されるようにしてもよい。ここで、デプスマップを構成するそれぞれのピクセルは、３次元ベクトルに変換される。具体的には、３次元ベクトルを生成する際に、ピクセルに含まれる２次元座標情報及び深度情報が共に活用される。すなわち、ピクセルに含まれる深度情報が特定の軸に関する座標情報として活用される。 For rigid transformation, the pixels that make up the depth map may be mapped onto a three-dimensional space. Specifically, a plurality of pixels forming the depth map may be converted into a vector. Here, each pixel making up the depth map is transformed into a three-dimensional vector. Specifically, when generating a three-dimensional vector, both two-dimensional coordinate information and depth information included in pixels are utilized. That is, depth information included in pixels is used as coordinate information about a specific axis.

前述のように生成された３次元ベクトルにサンプル画像の生成時に適用された剛体変換パラメータを同一に適用して剛体変換を行う。 Rigid body transformation is performed by applying the same rigid body transformation parameters that were applied when the sample image was generated to the three-dimensional vector generated as described above.

その後、剛体変換により新たに生成されたベクトルに含まれる３種類の座標情報のいずれかを深度情報に変換して新たなピクセルを生成する。デプスマップに含まれる全てのピクセルに対して上記過程を適用すると、新たなデプスマップが生成される。 After that, any one of the three types of coordinate information included in the vector newly generated by the rigid body transformation is converted into depth information to generate a new pixel. Applying the above process to all pixels contained in the depth map produces a new depth map.

一実施形態において、デプスマップに含まれる深度情報をＺ軸座標情報に変換して３次元ベクトルを生成する。サンプル画像の生成時に適用された剛体変換パラメータは、４×４行列の形である。前記行列をデプスマップから生成された３次元ベクトルに掛けて新たなベクトルを生成する。剛体変換されたベクトルに含まれるＺ軸座標情報を深度情報に変換して新たなピクセルを生成する。 In one embodiment, the depth information contained in the depth map is converted to Z-axis coordinate information to generate a three-dimensional vector. The rigid transformation parameters applied when generating the sample images are in the form of a 4x4 matrix. A new vector is generated by multiplying the 3D vector generated from the depth map by the matrix. A new pixel is generated by converting the Z-axis coordinate information included in the rigid-transformed vector into depth information.

本明細書においては、説明の便宜上、対象画像から生成されたデプスマップを剛体変換して生成されたデプスマップを第１デプスマップといい、サンプル画像から生成されたデプスマップを第２デプスマップという。 In this specification, for convenience of explanation, the depth map generated by rigid body transformation of the depth map generated from the target image is referred to as the first depth map, and the depth map generated from the sample image is referred to as the second depth map. .

最後に、自己サンプルに含まれるピクセルの深度値と、剛体変換が行われた対象画像に含まれるピクセルの深度値を用いて、学習データを生成するステップ（Ｓ１４０）が行われる。 Finally, a step of generating learning data using the depth values of the pixels included in the self-samples and the depth values of the pixels included in the target image subjected to the rigid body transformation is performed (S140).

第１及び第２デプスマップは、同じ大きさ及び形状に形成され、同じ数のピクセルを含む。第１及び第２デプスマップは、互いに同じ座標情報を含むピクセルを含む。 The first and second depth maps are formed with the same size and shape and contain the same number of pixels. The first and second depth maps contain pixels that contain the same coordinate information as each other.

第１及び第２デプスマップのそれぞれに含まれるピクセルのうち、互いに同じ座標情報を含むピクセルに含まれる深度情報を用いて損失を算出する。このとき、第１及び第２デプスマップを構成する全てのピクセルのそれぞれに含まれる深度情報が活用される。 A loss is calculated using depth information included in pixels having the same coordinate information among pixels included in each of the first and second depth maps. At this time, depth information included in each of all pixels forming the first and second depth maps is used.

ここで、第１及び第２デプスマップに含まれる一部の深度値は、学習データの生成に活用されないこともある。具体的には、本発明は、サンプル画像を生成した後、マスクマップを生成するステップをさらに含む。 Here, some depth values included in the first and second depth maps may not be used to generate learning data. Specifically, the present invention further includes generating a mask map after generating the sample image.

対象画像から生成されたサンプル画像は、剛体変換が行われたものであるので、対象画像に対するズレが存在する。よって、サンプル画像には対象画像に対応するピクセルが存在しない領域が存在することがある。 Since the sample image generated from the target image has undergone rigid body transformation, there is a deviation from the target image. Therefore, the sample image may have areas where there are no pixels corresponding to the target image.

簡単に言えば、サンプル画像は、画像中の全てのオブジェクトが剛体であり、停止した状態でカメラのみ移動して対象画像を撮影した後のその撮影された画像であることを前提にする。カメラが移動する場合、カメラの視野から外れる領域が存在するので、本発明は、マスクマップを生成し、カメラの視野から外れた領域が深度推定学習に用いられないようにする。 In simple terms, the sample image assumes that all objects in the image are rigid bodies and that the captured image is taken after the target image has been captured with only the camera moving while stationary. If the camera moves, there will be areas outside the camera's field of view, so the present invention generates a mask map so that the areas outside the camera's field of view are not used for depth estimation learning.

マスクマップは、対象画像及びサンプル画像に基づいて生成される。具体的には、後述するサンプル画像の生成時に一部のピクセルに損失が発生するので、基本値の色情報を含む新たなピクセルが生成される。 A mask map is generated based on the target image and the sample image. Specifically, since some pixels are lost when generating a sample image, which will be described later, new pixels containing color information of basic values are generated.

マスクマップは、対象画像と同じ大きさ及び形状に生成され、座標情報及びフィルタ情報を含む。マスクマップに含まれる座標情報は、対象画像に含まれる座標情報と同じである。それぞれの座標情報にはフィルタ情報がマッチングされる。前記フィルタ情報は、０又は１で定義された情報であり、今後の深度推定学習時にサンプル画像の一部の領域をフィルタリングするのに用いられる。 The mask map is generated to have the same size and shape as the target image, and includes coordinate information and filter information. The coordinate information included in the mask map is the same as the coordinate information included in the target image. Filter information is matched to each piece of coordinate information. The filter information is information defined as 0 or 1, and is used to filter a partial region of the sample image during depth estimation learning in the future.

一方、マスクマップは、サンプル画像と同じ大きさ及び形状に形成され、サンプル画像と同じ数のピクセルを含む。マスクマップ及びサンプル画像は、互いに対応するピクセルをそれぞれ含む。 On the other hand, the mask map is formed in the same size and shape as the sample image and contains the same number of pixels as the sample image. The mask map and the sample image each contain pixels that correspond to each other.

マスクマップを生成する際に、マスクマップに含まれるフィルタ情報は、サンプル画像に含まれるピクセルの種類に応じて決定される。具体的には、サンプル画像は、剛体変換された３次元ピクセルが投影されたピクセルと、新たに生成されたピクセルとを含む。前記投影されたピクセルと同じ座標情報を有するピクセルのフィルタ情報は１に設定する。それに対して、前記新たに生成されたピクセルと同じ座標情報を有するピクセルのフィルタ情報は０に設定する。前述した方式でマスクマップに含まれる全てのピクセルのフィルタ情報を定義することができる。 When generating the mask map, the filter information included in the mask map is determined according to the types of pixels included in the sample image. Specifically, the sample image includes pixels onto which rigid-transformed 3D pixels are projected and newly generated pixels. Filter information of pixels having the same coordinate information as the projected pixel is set to one. On the other hand, filter information of pixels having the same coordinate information as the newly generated pixel is set to zero. Filter information of all pixels included in the mask map can be defined in the manner described above.

一実施形態においては、図４ａに示すように、マスクマップＭは、対象画像４１０及びサンプル画像４１０’により生成される。マスクマップＭは、２つの領域からなる。第一に、サンプル画像４１０’の全領域のうち、対象画像４１０に含まれるピクセルに対応するピクセルを含まない領域にマッチングされる第１領域Ｍ０である。第１領域Ｍ０に含まれるピクセルは、座標情報及び０で定義されたフィルタ情報を含む。第二に、サンプル画像４１０’の全領域のうち、対象画像４１０に含まれるピクセルに対応するピクセルを含む領域にマッチングされる第２領域Ｍ１である。第２領域Ｍ１に含まれるピクセルは、座標情報及び１で定義されたフィルタ情報を含む。 In one embodiment, the mask map M is generated from a target image 410 and a sample image 410', as shown in Figure 4a. The mask map M consists of two regions. The first is a first area M0 that is matched with an area that does not include pixels corresponding to pixels included in the target image 410, among all areas of the sample image 410'. Pixels included in the first region M0 include coordinate information and filter information defined as zero. The second is a second area M1 that is matched to an area including pixels corresponding to pixels included in the target image 410 among all areas of the sample image 410'. Pixels included in the second region M1 include coordinate information and filter information defined in 1.

前述したマスクマップは、本発明による深度推定学習のための損失の算出に活用される。 The mask map described above is used to calculate the loss for depth estimation learning according to the present invention.

具体的には、図４ｂに示すように、深度推定４３０により、対象画像４１０からデプスマップ４２０が生成される。デプスマップ４２０に剛体変換（Ｔ）を適用すると、第１デプスマップ４２０’が生成される。ここで、第１デプスマップ４２０’の生成に適用される剛体変換パラメータは、サンプル画像の生成に用いられる剛体変換パラメータと同じパラメータである。 Specifically, the depth estimation 430 produces a depth map 420 from the target image 410, as shown in FIG. 4b. Applying a rigid transformation (T) to depth map 420 produces a first depth map 420'. Here, the rigid transformation parameter applied to generate the first depth map 420' is the same as the rigid transformation parameter used to generate the sample image.

一方、深度推定４３０により、サンプル画像４１０’から第２デプスマップ４２０’’が生成される。ここで、第２デプスマップ４２０’’の生成に用いられる深度推定モデルとしては、対象画像４１０からデプスマップ４２０を生成する際に用いられる深度推定モデルと同じモデルを用いる。 Meanwhile, depth estimation 430 produces a second depth map 420'' from the sample image 410'. Here, as the depth estimation model used to generate the second depth map 420 ″, the same model as the depth estimation model used when generating the depth map 420 from the target image 410 is used.

その後、第１及び第２デプスマップ４２０’、４２０’’とマスクマップＭを活用して、等尺性の一貫性の損失（Ｉｓｏｍｅｔｒｉｃｃｏｎｓｉｓｔｅｎｃｙｌｏｓｓ）が算出される。 Then, using the first and second depth maps 420' and 420'' and the mask map M, isometric consistency loss is calculated.

一実施形態において、第１及び第２デプスマップを用いた損失の算出は、下記数式２のように行われる。 In one embodiment, loss calculation using the first and second depth maps is performed as in Equation 2 below.

（数式２）
上記数式２において、Ｄｓｅｌｆ（上付き文字を含む）とは、第２デプスマップを構成するピクセルのうち、特定の座標（ｕ，ｖ）に存在するピクセルにマッチングされた深度値を意味し、Ｄｓｅｌｆ（上付き文字を含まない）とは、第１デプスマップを構成するピクセルのうち、特定の座標（ｕ，ｖ）に存在するピクセルにマッチングされた深度値を意味する。

(Formula 2)
In Equation 2, Dself (including a superscript) means a depth value matched to a pixel existing at specific coordinates (u, v) among pixels constituting the second depth map. "(not including superscript)" means a depth value matched to a pixel existing at specific coordinates (u, v) among pixels constituting the first depth map.

また、上記数式２において、ｋは、１つの対象画像から生成され、学習に用いられた自己サンプルの数であり、ｔは、学習に用いられた対象画像の数である。 In Equation 2 above, k is the number of self-samples generated from one target image and used for learning, and t is the number of target images used for learning.

一方、上記数式２において、Ｖは、前述したマスクマップに対応する変数であり、特定の座標（ｕ，ｖ）にマッチングされるフィルタ情報値である。Ｖは、０又は１である。 Meanwhile, in Equation 2, V is a variable corresponding to the mask map described above, and is a filter information value matched to specific coordinates (u, v). V is 0 or 1;

前述したように、本発明によるサンプル画像は、全ての領域が静的領域からなるので、サンプル画像において深度推定正確度を低減させる動的領域をフィルタリングする必要がなくなる。よって、本発明は、サンプル画像全体を学習に活用することができる。 As mentioned above, the sample image according to the present invention consists entirely of static regions, so there is no need to filter dynamic regions in the sample image, which reduces depth estimation accuracy. Therefore, the present invention can utilize the entire sample image for learning.

以下、前述した学習データの生成に活用されるサンプル画像生成方法についてより具体的に説明する。 The sample image generation method used to generate the learning data described above will be described in more detail below.

図５は本発明によるサンプル画像生成方法を説明するためのフローチャートであり、図６ａ～図６ｃは本発明によるサンプル画像生成方法を示す概念図である。 FIG. 5 is a flow chart for explaining the sample image generation method according to the present invention, and FIGS. 6a to 6c are conceptual diagrams showing the sample image generation method according to the present invention.

まず、図５に示すように、本発明によるサンプル画像生成方法においては、対象画像に含まれるピクセルの深度値を推定するステップ（Ｓ２１０）が行われる。 First, as shown in FIG. 5, in the sample image generating method according to the present invention, a step (S210) of estimating depth values of pixels included in the target image is performed.

深度推定ステップにおいて、画像を撮影するカメラの視点を基準として画像に含まれるオブジェクトとの距離が算出される。当該距離値は、対象画像に含まれるピクセル毎に算出される。すなわち、深度推定ステップにおいて、深度対象画像に含まれるピクセルが示すオブジェクトと基準視点間の距離が算出される。 In the depth estimation step, the distance to the object included in the image is calculated based on the viewpoint of the camera that captures the image. The distance value is calculated for each pixel included in the target image. That is, in the depth estimation step, the distance between the object indicated by the pixels included in the depth target image and the reference viewpoint is calculated.

深度推定ステップが行われると、デプスマップが生成される。深度推定モデルとしては、公知の様々なモデルを用いることができる。例えば、非特許文献１に開示された深度推定モデルが用いられてもよいが、これに限定されるものではない。 Once the depth estimation step is performed, a depth map is generated. Various known models can be used as the depth estimation model. For example, the depth estimation model disclosed in Non-Patent Document 1 may be used, but it is not limited to this.

一実施形態においては、図６ａに示すように、対象画像６１０に対する深度推定６３０によりデプスマップ６２０が生成される。デプスマップ６２０は、ピクセルにマッチングされた深度情報に応じて異なる色で表現した画像であり、対象画像に含まれるオブジェクトのそれぞれの深度を可視化する。対象画像６１０及びデプスマップ６２０は、サンプル画像の生成に活用される。 In one embodiment, a depth map 620 is generated by depth estimation 630 for a target image 610, as shown in FIG. 6a. The depth map 620 is an image expressed in different colors according to the depth information matched to the pixels, and visualizes the depth of each object included in the target image. Target image 610 and depth map 620 are utilized to generate a sample image.

次に、対象画像に含まれるピクセルの深度値により、対象画像に含まれるピクセルを３次元空間上にマッピングするステップ（Ｓ２２０）が行われる。 Next, a step of mapping the pixels included in the target image onto a three-dimensional space based on the depth values of the pixels included in the target image (S220) is performed.

対象画像は、対象画像を構成するピクセル座標情報及び色情報を含む。対象画像は、２次元画像であるので、ピクセル座標情報は、２つの軸に関する座標情報を含む。説明の便宜上、対象画像に含まれるピクセル座標情報は、Ｘ軸座標情報及びＹ軸座標情報を含むと説明する。 The target image includes pixel coordinate information and color information that make up the target image. Since the target image is a two-dimensional image, the pixel coordinate information includes coordinate information regarding two axes. For convenience of explanation, the pixel coordinate information included in the target image will be described as including X-axis coordinate information and Y-axis coordinate information.

前述したステップＳ２２０で生成されたデプスマップは、ピクセル座標情報及び深度情報を含む。デプスマップに含まれるピクセル座標情報は、２つの軸に関する座標情報を含む。説明の便宜上、対象画像に含まれるピクセル座標情報は、Ｘ軸座標情報及びＹ軸座標情報を含むと説明する。 The depth map generated in step S220 includes pixel coordinate information and depth information. The pixel coordinate information included in the depth map includes coordinate information regarding two axes. For convenience of explanation, the pixel coordinate information included in the target image will be described as including X-axis coordinate information and Y-axis coordinate information.

特定の対象画像から生成されたデプスマップは、対象画像に含まれる特定のピクセルの座標情報及び当該ピクセルの深度情報を含む。ステップＳ２２０において、デプスマップに含まれる深度情報を対象画像にマッチングする。対象画像及びデプスマップのそれぞれは、互いに同じ座標情報を含む。特定の座標情報に対応する深度情報は、前記特定の座標情報と同じ座標情報に対応するピクセルにマッチングされる。よって、３次元画像が生成される。 A depth map generated from a specific target image includes coordinate information of specific pixels included in the target image and depth information of the pixels. In step S220, the depth information included in the depth map is matched to the target image. Each of the target image and the depth map includes the same coordinate information as each other. Depth information corresponding to specific coordinate information is matched to pixels corresponding to the same coordinate information as the specific coordinate information. A three-dimensional image is thus generated.

一方、３次元画像を生成する際に、カメラキャリブレーション（ｃａｍｅｒａｃａｌｉｂｒａｔｉｏｎ）過程が行われるようにしてもよい。具体的には、２次元画像を３次元画像に変換する過程で、所定のパラメータを有するマトリクスが対象画像を構成するピクセルのそれぞれに適用されるようにしてもよい。前記所定のパラメータは、ピンホールカメラのモデルによって異なる。 Meanwhile, a camera calibration process may be performed when generating a 3D image. Specifically, in the process of transforming a two-dimensional image into a three-dimensional image, a matrix having predetermined parameters may be applied to each pixel forming the target image. The predetermined parameter differs depending on the model of the pinhole camera.

ここで、カメラキャリブレーションのための所定のパラメータは、カメラ外部パラメータ（ｅｘｔｒｉｎｓｉｃｐａｒａｍｅｔｅｒ）及びカメラ内部パラメータ（ｉｎｔｒｉｎｓｉｃｐａｒａｍｅｔｅｒ）の少なくとも一方を含んでもよい。ここで、カメラ内部パラメータは、焦点距離（ｆｏｃａｌｌｅｎｇｔｈ）、主点（ｐｒｉｎｃｉｐａｌｐｏｉｎｔ）及び非対称係数（ｓｋｅｗｃｏｅｆｆｉｃｉｅｎｔ）の少なくとも１つに関するパラメータを含んでもよい。前記カメラキャリブレーションのための所定のパラメータは、２次元画像及び３次元画像のいずれか一方から他方に変換する際に適用されるようにしてもよい。 Here, the predetermined parameters for camera calibration may include at least one of camera extrinsic parameters and camera intrinsic parameters. Here, the camera intrinsic parameters may include parameters related to at least one of focal length, principal point and skew coefficient. The predetermined parameters for camera calibration may be applied when converting from one of the two-dimensional image and the three-dimensional image to the other.

前述した方式で生成された３次元画像は、ピクセル座標情報及び色情報を含む。３次元画像に含まれるピクセル座標情報は、３つの軸に関する座標情報を含む。説明の便宜上、対象画像に含まれるピクセル座標情報は、Ｘ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報を含むと説明する。前記３次元画像に含まれるＸ軸座標情報及びＹ軸座標情報は、対象画像に含まれる座標情報であり、Ｚ軸座標情報は、デプスマップに含まれる深度情報である。 A 3D image generated by the above method includes pixel coordinate information and color information. Pixel coordinate information contained in a three-dimensional image includes coordinate information regarding three axes. For convenience of explanation, the pixel coordinate information included in the target image will be described as including X-axis coordinate information, Y-axis coordinate information, and Z-axis coordinate information. The X-axis coordinate information and Y-axis coordinate information included in the three-dimensional image are coordinate information included in the target image, and the Z-axis coordinate information is depth information included in the depth map.

前述した方式で生成された３次元画像に含まれる座標情報を３次元空間上にマッピングする場合、３次元画像を構成するピクセルのそれぞれは３次元空間上にマッピングされる。 When the coordinate information included in the 3D image generated by the method described above is mapped onto the 3D space, each pixel forming the 3D image is mapped onto the 3D space.

一実施形態においては、図６ａに示すように、対象画像６１０に含まれるピクセルｐ１に、ピクセルｐ１に対応する深度値をマッピングする。よって、Ｘ－Ｙ平面上に位置していたピクセルｐ１が３次元空間上にリフティング（Ｌ）される。対象画像６１０に含まれる全てのピクセルに深度値をマッピングすることにより、３次元画像を生成することができる。 In one embodiment, a depth value corresponding to pixel p1 is mapped to pixel p1 contained in target image 610, as shown in FIG. 6a. Therefore, the pixel p1 located on the XY plane is lifted (L) onto the three-dimensional space. By mapping depth values to all pixels contained in target image 610, a three-dimensional image can be generated.

次に、３次元空間上にマッピングされたピクセルに対して、予め設定されたパラメータで剛体変換を行うステップ（Ｓ２３０）が行われる。 Next, a step (S230) of performing rigid body transformation with preset parameters on the pixels mapped on the three-dimensional space is performed.

３次元空間上にマッピングされたピクセルのそれぞれは、座標情報を含むベクトルからなるようにしてもよい。すなわち、それぞれのピクセルは、Ｘ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報を含むベクトルからなるようにしてもよい。 Each of the pixels mapped on the three-dimensional space may consist of a vector containing coordinate information. That is, each pixel may consist of a vector containing X-axis coordinate information, Y-axis coordinate information, and Z-axis coordinate information.

３次元上で剛体変換を行う際に用いられるパラメータ行列は４×４の形である。具体的には、平行移動のためのパラメータ行列は、移動しようとするＸ軸方向の距離値、Ｙ軸方向の距離値、及びＺ軸方向の距離値を含む４×４行列からなる。一方、回転のためのパラメータ行列は、回転の基準となる軸毎に異なる行列を含んでもよい。 A parameter matrix used for rigid body transformation in three dimensions has the form of 4×4. Specifically, the parameter matrix for parallel movement consists of a 4×4 matrix containing distance values in the X-axis direction, Y-axis direction, and Z-axis direction distance values to be moved. On the other hand, the parameter matrix for rotation may include a different matrix for each axis that serves as a reference for rotation.

例えば、回転のためのパラメータ行列は、Ｘ軸を基準とする回転角度情報を含む４×４行列、Ｙ軸を基準とする回転角度情報を含む４×４行列、及びＺ軸を基準とする回転角度情報を含む４×４行列を含む。 For example, the parameter matrix for rotation includes a 4×4 matrix containing rotation angle information about the X axis, a 4×4 matrix containing rotation angle information about the Y axis, and a rotation angle information about the Z axis. Contains a 4x4 matrix containing angle information.

前記ピクセルベクトルのそれぞれに前記パラメータ行列を予め設定された順序で掛けてＸ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報を含むベクトルを算出することができる。 A vector including X-axis coordinate information, Y-axis coordinate information, and Z-axis coordinate information may be calculated by multiplying each of the pixel vectors by the parameter matrix in a predetermined order.

剛体変換されたベクトルは、特定のピクセルに対して剛体変換を行った場合、特定のピクセルの新たな座標情報を含む。３次元画像を構成する全てのピクセルに対して剛体変換を行うと、新たな３次元画像を生成することができる。前述した方法で生成された３次元画像に含まれるピクセルは、原本の３次元画像に含まれるピクセルと同じ色情報を含み、異なる座標情報を含む。 A rigid-transformed vector contains new coordinate information for a particular pixel when a rigid transformation is performed on the particular pixel. A new three-dimensional image can be generated by performing a rigid transformation on all the pixels forming the three-dimensional image. The pixels contained in the 3D image generated by the above method contain the same color information and different coordinate information as the pixels contained in the original 3D image.

一実施形態においては、図６ｂに示すように、３次元空間上に位置するピクセルｐ１に対して剛体変換（Ｔ）を行うと、ピクセルｐ１の３次元空間上の座標が変更される。剛体変換が行われたピクセルｐ１’は、既存のピクセルｐ１と同じ色情報を含み、異なる座標情報を含む。 In one embodiment, as shown in FIG. 6b, when a rigid transformation (T) is performed on a pixel p1 located in the three-dimensional space, the coordinates of the pixel p1 in the three-dimensional space are changed. A pixel p1' subjected to rigid body transformation contains the same color information as the existing pixel p1, but contains different coordinate information.

最後に、剛体変換が行われたピクセルを２次元平面に投影して自己サンプルを生成するステップ（Ｓ２４０）が行われる。 Finally, the step of projecting the rigid-transformed pixels onto a two-dimensional plane to generate self-samples (S240) is performed.

剛体変換が行われたピクセルは、予め設定された平面上に投影される。具体的には、前記予め設定された平面は、対象画像を３次元空間上にリフティングする際に対象画像が配置される平面であり得る。 The pixels that have been rigidly transformed are projected onto a preset plane. Specifically, the preset plane may be a plane on which the target image is arranged when the target image is lifted onto the three-dimensional space.

例えば、対象画像をＸ－Ｙ平面上に配置した状態で対象画像に深度値をマッピングした場合、剛体変換された画像が投影される平面はＸ－Ｙ平面であり得る。 For example, if the target image is placed on the XY plane and depth values are mapped onto the target image, the plane onto which the rigid transformed image is projected may be the XY plane.

一方、前記投影過程において、カメラキャリブレーション過程が行われるようにしてもよい。カメラキャリブレーションについては前述したので、具体的な説明は省略する。 Meanwhile, a camera calibration process may be performed in the projection process. Since the camera calibration has been described above, a detailed description will be omitted.

剛体変換が行われたピクセルは、３次元空間上の座標を定義する座標情報及び色情報を含む。具体的には、剛体変換が行われたピクセルは、Ｘ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報を含む。 A pixel that has undergone rigid body transformation includes coordinate information and color information that define coordinates in a three-dimensional space. Specifically, the rigid-transformed pixels include X-axis coordinate information, Y-axis coordinate information, and Z-axis coordinate information.

剛体変換が行われたピクセルを投影する際に、前記座標情報を構成するＸ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報のいずれかが削除されるようにしてもよい。例えば、剛体変換された画像をＸ－Ｙ平面に投影する場合、剛体変換が行われたピクセルに含まれるＺ軸座標情報が削除される。 Any one of the X-axis coordinate information, the Y-axis coordinate information, and the Z-axis coordinate information constituting the coordinate information may be deleted when projecting the pixels subjected to the rigid body transformation. For example, when projecting a rigid-transformed image onto the XY plane, the Z-axis coordinate information contained in the rigid-transformed pixels is deleted.

剛体変換が行われたピクセルに含まれるＸ軸座標情報、Ｙ軸座標情報及びＺ軸座標情報のいずれかが削除されることにより、ピクセルが２次元平面上に配置される。剛体変換された画像に含まれる全てのピクセルを予め設定された平面上に投影した後、サンプル画像を生成する。 Pixels are arranged on a two-dimensional plane by deleting any of the X-axis coordinate information, Y-axis coordinate information, and Z-axis coordinate information included in the pixels subjected to rigid body transformation. A sample image is generated after projecting all pixels contained in the rigid-transformed image onto a preset plane.

ここで、剛体変換された３次元画像が投影される領域は、既存の対象画像が存在する領域とは異なる。サンプル画像は、既存の対象画像が存在する領域を基準として生成される。既存の対象画像が存在する領域外に投影された情報は、サンプル画像の生成に用いられない。 Here, the area onto which the rigid-transformed three-dimensional image is projected differs from the area in which the existing target image exists. A sample image is generated with reference to an area in which an existing target image exists. Information projected outside the existing target image is not used to generate the sample image.

対象画像と剛体変換された３次元画像とは同じ数のピクセルを含むので、剛体変換された３次元画像に含まれる一部のピクセルをサンプル画像の生成に活用しない場合、一部のピクセルの損失が発生する。 Since the target image and the rigidly transformed 3D image contain the same number of pixels, if some of the pixels contained in the rigidly transformed 3D image are not used to generate the sample image, some pixels are lost. occurs.

このため、サンプル画像は、２種類のピクセルからなる。具体的には、サンプル画像は、剛体変換が行われたピクセルを投影して生成されたピクセルと、サンプル画像の生成時に新たに生成されたピクセルとを含んでもよい。 Therefore, the sample image consists of two types of pixels. Specifically, the sample image may include pixels generated by projecting pixels that have undergone rigid transformation, and pixels newly generated when the sample image is generated.

剛体変換が行われたピクセルを投影して生成されたピクセルは、既存のピクセルに含まれる色情報をそのまま含む。投影されたピクセルのみでサンプル画像を形成する場合、対象画像とはピクセルの数が異なるようになる。このため、サンプル画像の生成時にピクセルが投影されない領域に新たなピクセルを形成する。新たに生成されたピクセルは、予め設定された色情報（例えば、黒色又は白色に対応する色情報）を含む。 A pixel generated by projecting a pixel subjected to rigid body transformation includes color information contained in the existing pixel as it is. If only the projected pixels form the sample image, it will have a different number of pixels than the target image. Therefore, new pixels are formed in areas where no pixels were projected when the sample image was generated. The newly generated pixels contain preset color information (eg, color information corresponding to black or white).

例えば、対象画像がＸ－Ｙ平面に配置され、０＜Ｘ＜Ａ、０＜Ｙ＜Ｂ領域に配置されるとすると、剛体変換された後にＸ－Ｙ平面上に投影されるピクセルのうち、０＜Ｘ＜Ａ、０＜Ｙ＜Ｂ領域に投影されるピクセルのみサンプル画像の生成に活用され、前記領域から外れて投影されるピクセルはサンプル画像の生成に活用されない。前記領域のうち、ピクセルが投影されない地点には、新たなピクセルが生成される。 For example, if the target image is placed on the XY plane and placed in the regions 0<X<A and 0<Y<B, among the pixels projected onto the XY plane after rigid body transformation, Only pixels projected into the 0<X<A, 0<Y<B region are used to generate the sample image, and pixels projected outside the region are not used to generate the sample image. New pixels are generated at those points in the area where no pixels are projected.

一実施形態においては、図６ｃに示すように、剛体変換が行われたピクセルｐ１’は、最初に対象画像が配置されたＸ－Ｙ平面に投影（Ｐ）される。よって、２次元上に投影されたピクセルｐ１’’は、対象画像に含まれるピクセルと同じ色情報及び異なる座標情報を含む。剛体変換された３次元画像に含まれる全てのピクセルをＸ－Ｙ平面上に投影する場合、２次元サンプル画像６１０’が生成される。ここで、対象画像が配置された領域Ａに投影されたピクセルのみサンプル画像の生成に活用され、他のピクセルはサンプル画像の生成に活用されない。 In one embodiment, as shown in FIG. 6c, the rigidly transformed pixel p1' is projected (P) onto the XY plane where the target image was originally located. Therefore, the two-dimensionally projected pixel p1'' contains the same color information and different coordinate information as the pixels contained in the target image. If all the pixels contained in the rigid transformed 3D image are projected onto the XY plane, a 2D sample image 610' is generated. Here, only the pixels projected onto the area A where the target image is arranged are used to generate the sample image, and the other pixels are not used to generate the sample image.

前述したステップＳ２３０で剛体変換時に適用されるパラメータに応じて異なるサンプル画像が生成される。本発明は、対象画像に複数の剛体変換パラメータを適用して複数のサンプル画像を生成する。 Different sample images are generated according to the parameters applied during the rigid body transformation in step S230 described above. The present invention applies multiple rigid transformation parameters to a target image to generate multiple sample images.

前述したように、本発明は、画像中の全てのオブジェクトが剛体からなり、停止した状態を維持するサンプル画像を生成し、機械学習データの生成に活用する。以下、本発明による深度推定学習のための学習データの生成について具体的に説明する。 As described above, the present invention generates a sample image in which all objects in the image are rigid bodies and maintains a stationary state, and utilizes it for generating machine learning data. Generation of learning data for depth estimation learning according to the present invention will be specifically described below.

前述した方式で算出された等尺性の一貫性の損失は、深度推定学習のための損失関数（ｌｏｓｓｆｕｎｃｔｉｏｎ）の設定に活用することができる。 The isometric consistency loss calculated in the manner described above can be used to set a loss function for depth estimation learning.

一実施形態において、本発明により算出された等尺性の一貫性の損失を用いた損失関数は、下記数式３のように設定される。 In one embodiment, the loss function using the isometric consistency loss calculated by the present invention is set as Equation 3 below.

（数式３）
上記数式３において、Ｌｐは、図７ａ及び図７ｂにおいて説明した方式で算出された測光損失であり、Ｌｓは、滑らかさ損失（Ｓｍｏｏｔｈｎｅｓｓｌｏｓｓ）であり、Ｌｉｓｓｇは、本発明による方法で算出された等尺性の一貫性の損失である。測光損失及び滑らかさ損失は公知の損失関数であるので、具体的な説明は省略する。

(Formula 3)
In Equation 3, Lp is the photometric loss calculated by the method described in FIGS. 7a and 7b, Ls is the smoothness loss, and Lissg is calculated by the method according to the present invention. Isometric consistency loss. Since the photometric loss and the smoothness loss are well-known loss functions, a detailed description thereof will be omitted.

対象画像から生成されたデプスマップ及びサンプル画像から生成されたデプスマップに基づいて損失データを算出し、その後前記損失データに基づいて深度推定時に必要な加重値を変更する。前記加重値を変更した後、加重値が変更された深度推定モデルを用いて対象画像のデプスマップを再生成し、それを用いてサンプル画像を再生成し、サンプル画像からデプスマップを再生成する。その後、対象画像のデプスマップ及びサンプル画像のデプスマップに基づいて損失データを再算出する。上記演算は、学習データ生成回数が予め設定された回数に到達するまで繰り返される。ここで、予め設定された回数は、深度推定モデルの信頼性を確保できるほど十分に大きい数でなければならない。こうすることにより、深度推定時に必要な最適な加重値を見つけることができる。 Loss data is calculated based on the depth map generated from the target image and the depth map generated from the sample image, and then weights necessary for depth estimation are changed based on the loss data. After changing the weights, using the depth estimation model with the changed weights to regenerate the depth map of the target image, using it to regenerate the sample image, and regenerating the depth map from the sample image. . After that, the loss data is recalculated based on the depth map of the target image and the depth map of the sample image. The above calculation is repeated until the learning data generation count reaches a preset count. Here, the preset number of times must be a number large enough to ensure the reliability of the depth estimation model. By doing so, we can find the optimal weights needed during depth estimation.

前述したように、本発明により生成されたサンプル画像は、対象画像から生成されるものであるので、サンプル画像に含まれる全てのオブジェクトが剛体であり、動かないオブジェクトである。よって、本発明により生成されたサンプル画像を用いて深度推定学習を行う場合、剛体でないオブジェクト及び動くオブジェクトに対するフィルタリングを行う必要がなくなる。よって、対象画像の全ての領域を深度推定学習に活用することができる。 As described above, the sample images generated by the present invention are generated from the target image, so all objects included in the sample images are rigid bodies and do not move. Therefore, when depth estimation learning is performed using the sample images generated by the present invention, there is no need to perform filtering for non-rigid objects and moving objects. Therefore, all regions of the target image can be utilized for depth estimation learning.

一方、前述した本発明は、コンピュータで１つ以上のプロセスにより実行され、コンピュータ可読媒体（又は記録媒体）に格納可能なプログラムとして実現することができる。 On the other hand, the present invention described above can be implemented as a program that is executed by one or more processes on a computer and can be stored in a computer-readable medium (or recording medium).

また、前述した本発明は、プログラム記録媒体にコンピュータ可読コード又はコマンドとして実現することができる。すなわち、本発明は、プログラムの形態で提供することができる。 Also, the present invention described above can be implemented as computer readable codes or commands on a program recording medium. That is, the present invention can be provided in the form of a program.

一方、コンピュータ可読媒体は、コンピュータシステムにより読み取り可能なデータが記録されるあらゆる種類の記録装置を含む。コンピュータ可読媒体の例としては、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｉｓｋ）、ＳＤＤ（ＳｉｌｉｃｏｎＤｉｓｋＤｒｉｖｅ）、ＲＯＭ、ＲＡＭ、ＣＤ－ＲＯＭ、磁気テープ、フロッピー（登録商標）ディスク、光データ記憶装置などが挙げられる。 A computer-readable medium, on the other hand, includes any type of recording device on which data readable by a computer system is recorded. Examples of computer readable media include HDD (Hard Disk Drive), SSD (Solid State Disk), SDD (Silicon Disk Drive), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device and the like.

また、コンピュータ可読媒体は、ストレージを含み、電子機器が通信によりアクセスできるサーバ又はクラウドストレージであり得る。この場合、コンピュータは、有線又は無線通信により、サーバ又はクラウドストレージから本発明によるプログラムをダウンロードすることができる。 Computer-readable media also includes storage, which may be a server or cloud storage communicatively accessible by the electronic device. In this case, the computer can download the program according to the present invention from a server or cloud storage via wired or wireless communication.

さらに、本発明において、前述したコンピュータは、プロセッサ、すなわち中央処理装置（ＣＰＵ）が搭載された電子機器であり、その種類は特に限定されない。 Furthermore, in the present invention, the computer described above is an electronic device equipped with a processor, that is, a central processing unit (CPU), and its type is not particularly limited.

一方、本発明の詳細な説明は例示的なものであり、あらゆる面で限定的に解釈されてはならない。本発明の範囲は添付の特許請求の範囲の合理的解釈により定められるべきであり、本発明の均等の範囲内でのあらゆる変更が本発明の範囲に含まれる。 On the other hand, the detailed description of the present invention is illustrative and should not be construed as limiting in any respect. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the scope of equivalents of the present invention are included in the scope of the present invention.

２００車両
２１０カメラ
３００深度推定システム
３１０通信部
３２０保存部
３３０制御部
３４０データベース（ＤＢ）
４１０対象画像
４１０’ サンプル画像
４２０デプスマップ
４２０’ 第１デプスマップ
４２０’’ 第２デプスマップ
４３０深度推定
６１０対象画像
６１０’ ２次元サンプル画像
６２０デプスマップ
６３０深度推定
７１０第１対象画像
７３０深度推定
７４０第２対象画像
７４０’ 逆剛体変換画像
Ａ対象画像が配置された領域
Ｍマスクマップ
Ｍ０第１領域
Ｍ１第２領域 200 vehicle 210 camera 300 depth estimation system 310 communication unit 320 storage unit 330 control unit 340 database (DB)
410 target image 410 ′ sample image 420 depth map 420 ′ first depth map 420 ″ second depth map 430 depth estimation 610 target image 610 ′ two-dimensional sample image 620 depth map 630 depth estimation 710 first target image 730 depth estimation 740 Second target image 740′ Inverse rigid body transformation image A Region where target image is arranged M Mask map M0 First region M1 Second region

Claims

generating a sample image using a target image and a depth map of the target image;
estimating depth values of pixels contained in the sample image to generate a depth map for the sample image;
and generating at least part of learning data using the depth map of the target image and the depth map generated from the sample image.

further comprising the step of mapping pixels constituting a depth map generated from the target image onto a three-dimensional space, and then performing a rigid body transformation on the mapped pixels;
generating at least a portion of the learning data,
2. The learning data generation method for machine learning according to claim 1, wherein the depth map generated from the sample image and the depth map subjected to rigid body transformation are used.

at least a portion of the learning data includes loss data;
generating at least a portion of the learning data,
calculating the loss data based on depth information included in pixels corresponding to each other among pixels included in each of the depth map generated from the sample image and the rigid-transformed depth map. 3. The learning data generation method for machine learning according to claim 2.

The step of generating the sample image includes:
estimating depth values of pixels contained in the target image;
mapping the pixels included in the target image onto a three-dimensional space based on the depth values of the pixels included in the target image;
a step of performing rigid body transformation with preset parameters on the pixels mapped on the three-dimensional space;
4. The learning data generation method for machine learning according to claim 3, further comprising: projecting the rigid-transformed pixels onto a two-dimensional plane to generate the sample image.

The step of performing rigid body transformation on the depth map generated from the target image includes:
5. The learning data generation method for machine learning according to claim 4, wherein the same parameters as the preset parameters are used.

further comprising generating a mask map using the target image and the sample image;
The loss data is
6. The depth map of the sample image is generated based on a remaining area of the entire area of the depth map subjected to rigid body transformation, excluding an area excluded based on the mask map. 3. The learning data generation method for machine learning according to any one of 1.

estimating depth values for pixels in the target image;
mapping the pixels included in the target image onto a three-dimensional space based on the depth values of the pixels included in the target image;
a step of performing rigid body transformation with preset parameters on the pixels mapped on the three-dimensional space;
and projecting the rigid-transformed pixels onto a two-dimensional plane to generate a sample image.

The sample image is
formed in the same shape and size as the target image,
8. The method of claim 7, wherein the sample image includes the same number of pixels as the target image.

each pixel included in the target image includes color information;
Pixels projected onto the two-dimensional plane are
9. The method of claim 8, wherein color information included in pixels corresponding to pixels projected onto the two-dimensional plane among pixels included in the target image is included. .

The sample image is
10. The method of claim 9, wherein the pixels projected onto the two-dimensional plane include only pixels projected onto a preset area.

The sample image is
11. The machine learning method according to claim 10, comprising a plurality of pixels formed by projection from the pixels subjected to the rigid transformation, and a plurality of pixels newly generated when the sample image is generated. A sample image generation method for

12. The method of claim 11, wherein the newly generated pixels contain the same color information and preset color information.

13. The sample image generating method for machine learning according to claim 8, wherein said preset parameters include parameters relating to at least one of rotational transformation and parallel transformation.

a storage unit for storing target images;
a control unit that estimates depth values of pixels included in the target image to generate a depth map for the target image;
The control unit
generating a sample image using the target image and the depth map;
estimating depth values of pixels included in the sample image to generate a depth map for the sample image;
performing rigid body transformation on the depth map generated from the target image;
A learning data generation system for machine learning, wherein learning data is generated using the depth map generated from the sample image and the depth map subjected to rigid body transformation.

a storage unit for storing target images;
estimating depth values of pixels included in the target image; mapping pixels included in the target image onto a three-dimensional space based on the depth values of pixels included in the target image; a controller that performs rigid transformation with preset parameters on the mapped pixels and projects the pixels subjected to the rigid transformation onto a two-dimensional plane to generate a sample image for machine learning. sample image generation system.

A computer program comprising a plurality of instructions,
When the instruction is executed
estimating depth values of pixels included in a target image to generate a depth map for the target image;
generating a sample image using the target image and the depth map;
estimating depth values of pixels contained in the sample image to generate a depth map for the sample image;
performing a rigid transformation on a depth map generated from the target image;
and generating training data using the depth map generated from the sample image and the rigid-transformed depth map.

a computer program comprising a plurality of instructions which, when executed, estimate depth values for pixels contained in a target image;
mapping the pixels included in the target image onto a three-dimensional space based on the depth values of the pixels included in the target image;
a step of performing rigid body transformation with preset parameters on the pixels mapped on the three-dimensional space;
and projecting the rigid-transformed pixels onto a two-dimensional plane to generate a sample image.