JP2020119333A

JP2020119333A - Image processing method, image processing device, image processing system, imaging device, program, and storage medium

Info

Publication number: JP2020119333A
Application number: JP2019010512A
Authority: JP
Inventors: 義明井田; Yoshiaki Ida
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-08-06
Anticipated expiration: 2039-01-24
Also published as: JP7246943B2

Abstract

To provide an image processing method capable of estimating normal information with high accuracy in a general object having no limitation on a reflection characteristic and a shape.SOLUTION: An image processing method includes a first step for acquiring a first normal map of an object space and reference data as input data, a second step for acquiring learning information, and a third step for estimating a second normal map on the basis of the input data and the learning information. The first normal map is either a normal map acquired by using a photometric stereo method or a shape from shading method or a normal map acquired by using a distance map. The reference data includes a photographic image obtained by photographing the object space. The third step estimates the second normal map by executing a step for executing the n-th linear transformation and the n-th non-linear transformation based on the learning information of the input data until an integer n becomes N(≥2) from 1 to generate intermediate data, and a step for executing the N+1-th linear transformation based on the learning information of the intermediate data.SELECTED DRAWING: Figure 3

Description

本発明は、被写体の法線情報を取得する画像処理方法、画像処理装置、画像処理システム、撮像装置、プログラム、および、記憶媒体に関する。 The present invention relates to an image processing method, an image processing device, an image processing system, an imaging device, a program, and a storage medium for acquiring normal line information of a subject.

デジタルカメラ等の撮像装置で被写体を撮像して得られた撮影画像から、被写体の形状情報として面法線の情報（以下、法線情報という）を取得する方法が知られている。法線情報を取得する方法としては、シェイプフロムシェーディング法や、照度差ステレオ法がある。シェイプフロムシェーディング法では、１枚の撮影画像から法線情報を推定できるが、対象物体の反射率が一様であることや被写体の形状がなめらかに変化すること等の仮定を必要とする。照度差ステレオ法は、被写体の面法線と光源方向に基づいた反射特性を仮定し、複数の光源位置での被写体の輝度情報と仮定した反射特性とから法線情報を推定する方法である。複数の光源位置で撮像した撮影画像を用いることでシェイプフロムシェーディング法よりも少ない仮定の下で法線情報を推定できる。仮定される被写体の反射特性としてはランバートの余弦則に従うランバート反射モデルが用いられることが多い。 A method is known in which surface normal information (hereinafter referred to as normal information) is acquired as shape information of a subject from a captured image obtained by capturing an image of the subject with an imaging device such as a digital camera. As a method of acquiring the normal line information, there are a shape from shading method and a photometric stereo method. In the shape-from-shading method, normal line information can be estimated from a single captured image, but it is necessary to assume that the reflectance of the target object is uniform and that the shape of the subject changes smoothly. The photometric stereo method is a method of assuming reflection characteristics based on a surface normal of a subject and a light source direction, and estimating normal line information from luminance information of the subject at a plurality of light source positions and the assumed reflection characteristics. The normal information can be estimated under a smaller number of assumptions than the shape-from-shading method by using captured images captured at a plurality of light source positions. A Lambert reflection model that follows Lambert's cosine law is often used as the assumed reflection characteristic of the subject.

一般に、物体での反射には、鏡面反射と拡散反射がある。鏡面反射は、物体表面での正反射であり、物体表面（界面）においてフレネルの式に従うフレネル反射である。拡散反射は、被写体の表面を透過した後に物体内部で散乱されて光が返ってくる反射である。鏡面反射した光は上述のランバートの余弦則では表せず、撮像装置で観測される被写体からの反射光に鏡面反射光が含まれていると、シェイプフロムシェーディング法や照度差ステレオ法では法線情報を正確に推定できない。光源からの光が当たらない陰影部においても仮定した反射モデルからのずれが生じ、法線情報を正確に推定できない。また、表面の粗い被写体や半透明体などでは拡散反射成分もランバートの余弦則からずれを生じる。さらに、相互反射が生じている場合および拡散反射成分が観測されない金属体や透明体などにおいても、法線情報を正確に推定できない。 Generally, reflection on an object includes specular reflection and diffuse reflection. The specular reflection is regular reflection on the object surface and Fresnel reflection according to the Fresnel equation on the object surface (interface). Diffuse reflection is reflection in which light returns after being transmitted through the surface of an object and then scattered inside the object. The specularly reflected light cannot be expressed by the Lambert's cosine law described above.If specular reflected light is included in the reflected light from the subject observed by the image pickup device, the normal information can be obtained by the shape from shading method or photometric stereo method. Cannot be accurately estimated. Even in the shaded area where the light from the light source does not illuminate, deviation from the assumed reflection model occurs, and normal information cannot be accurately estimated. In addition, the diffuse reflection component also deviates from Lambert's cosine law in a subject with a rough surface or a semitransparent body. Furthermore, normal information cannot be accurately estimated even when mutual reflection occurs or even in a metal body or a transparent body where a diffuse reflection component is not observed.

特許文献１には、４つ以上の光源を使用して得られた複数の法線候補から、鏡面反射成分の影響を除いて高精度に法線情報を求める方法が開示されている。また、非特許文献１には、畳み込みニューラルネットワークを応用して１枚の撮影画像から法線情報を推定する方法が開示されている。 Patent Document 1 discloses a method of highly accurately obtaining normal line information from a plurality of normal line candidates obtained by using four or more light sources by removing the influence of specular reflection components. In addition, Non-Patent Document 1 discloses a method of estimating normal line information from one captured image by applying a convolutional neural network.

特開２０１０−１２２１５８号公報JP, 2010-122158, A

Ｄ. Ｅｉｇｅｎ, ｅｔａｌ. “ＰｒｅｄｉｃｔｉｎｇＤｅｐｔｈ，ＳｕｒｆａｃｅＮｏｒｍａｌｓａｎｄＳｅｍａｎｔｉｃＬａｂｅｌｓｗｉｔｈａＣｏｍｍｏｎＭｕｌｔｉ−ＳｃａｌｅＣｏｎｖｏｌｕｔｉｏｎａｌＡｒｃｈｉｔｅｃｔｕｒｅ”，ａｒＸｉｖ：１４１１．４７３４（２０１４）D. Eigen, et al. "Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture", 1.4ar34(141).

しかしながら、特許文献１に開示された方法では、鏡面反射成分の影響を受ける撮影画像が複数ある場合において法線情報を推定できない。また、陰影部が生じた場合、ランバートの余弦則からずれた反射特性を持つ被写体の場合、相互反射が生じた場合、および拡散反射成分が観測されない被写体の場合等においても法線情報を推定できない。非特許文献１に開示された方法では、法線情報を推定できないことによる破綻部は生じにくいが、照度差ステレオ法等で高精度に取得された法線マップに対して推定精度が低い。 However, the method disclosed in Patent Document 1 cannot estimate normal line information when there are a plurality of captured images affected by the specular reflection component. In addition, normal information cannot be estimated even when a shaded portion is generated, a subject having a reflection characteristic deviated from Lambert's cosine law, mutual reflection occurs, or a subject in which a diffuse reflection component is not observed. .. The method disclosed in Non-Patent Document 1 is less likely to cause a failure due to the inability to estimate normal information, but has a low estimation accuracy with respect to the normal map acquired with high accuracy by the photometric stereo method or the like.

本発明は、反射特性や形状の制約がない一般的な被写体において、高精度に法線情報を推定可能な画像処理方法、画像処理装置、画像処理システム、撮像装置、画像処理プログラム、および、記憶媒体を提供することを目的とする。 The present invention relates to an image processing method, an image processing device, an image processing system, an imaging device, an image processing program, and a storage device, which are capable of estimating normal line information with high accuracy in a general subject having no reflection characteristic or shape constraint. The purpose is to provide a medium.

本発明の一側面としての画像処理方法は、被写体空間の第１の法線マップと参照データとを入力データとして取得する第１工程と、あらかじめ学習された学習情報を取得する第２工程と、入力データと学習情報とに基づいて、第２の法線マップを推定する第３工程と、を有し、第１の法線マップは、照度差ステレオ法またはシェイプフロムシェーディング法を用いて取得された法線マップ、および距離マップを用いて取得された法線マップのいずれかであり、参照データは、被写体空間を撮影した撮影画像を含み、第３工程では、Ｎを２以上の整数、ｎを１からＮまでの整数とした場合、入力データに対して、学習情報に基づく第ｎの線型変換、および第ｎの非線型変換を、ｎが１からＮになるまで順に実行して中間データを生成する工程と、中間データに対して、学習情報に基づく第Ｎ＋１の線型変換を実行する工程と、が実行されることで第２の法線マップが推定されることを特徴とする。 An image processing method according to one aspect of the present invention includes a first step of acquiring a first normal map of a subject space and reference data as input data, and a second step of acquiring learning information learned in advance. And a third step of estimating a second normal map based on the input data and the learning information, wherein the first normal map is obtained by using a photometric stereo method or a shape from shading method. The normal map obtained by using the normal map and the normal map obtained by using the distance map, the reference data includes a captured image of the subject space, and in the third step, N is an integer of 2 or more, n Is an integer from 1 to N, the n-th linear conversion based on the learning information and the n-th nonlinear conversion are sequentially executed on the input data until n becomes 1 to N, and the intermediate data Is generated, and a step of executing the (N+1)th linear transformation on the intermediate data based on the learning information is performed, whereby the second normal map is estimated.

本発明の他の側面としての画像処理方法は、被写体空間の第１の法線マップと参照データとを入力データとして取得する第１工程と、あらかじめ学習された学習情報を取得する第２工程と、入力データと学習情報とに基づいて、第２の法線マップを推定する第３工程と、を有し、参照データは、被写体の特性に基づいて被写体空間を撮影した撮影画像の各領域をラベル付けしたラベルマップ、第１の法線マップに含まれる法線情報の信頼度を表す信頼度マップ、および距離マップのうち少なくとも１つを含み、第３工程では、Ｎを２以上の整数、ｎを１からＮまでの整数とした場合、入力データに対して、学習情報に基づく第ｎの線型変換、および第ｎの非線型変換を、ｎが１からＮになるまで順に実行して中間データを生成する工程と、中間データに対して、学習情報に基づく第Ｎ＋１の線型変換を実行する工程と、が実行されることで第２の法線マップが推定されることを特徴とする。 An image processing method according to another aspect of the present invention includes a first step of acquiring a first normal line map of a subject space and reference data as input data, and a second step of acquiring learning information learned in advance. And a third step of estimating a second normal map based on the input data and the learning information, and the reference data is a region of the captured image obtained by capturing the subject space based on the characteristics of the subject. At least one of a labeled label map, a reliability map indicating the reliability of the normal information included in the first normal map, and a distance map is included. In the third step, N is an integer of 2 or more, When n is an integer from 1 to N, the n-th linear conversion based on the learning information and the n-th nonlinear conversion are sequentially executed on the input data until n becomes 1 to N The second normal map is estimated by executing the step of generating data and the step of executing the (N+1)th linear transformation based on the learning information with respect to the intermediate data.

本発明によれば、反射特性や形状の制約がない一般的な被写体において、高精度に法線情報を推定可能な画像処理方法、画像処理装置、画像処理システム、撮像装置、画像処理プログラム、および、記憶媒体を提供することができる。 According to the present invention, an image processing method, an image processing apparatus, an image processing system, an image pickup apparatus, an image processing program, and an image processing method capable of estimating normal line information with high accuracy in a general subject having no reflection characteristic or shape constraint. A storage medium can be provided.

実施例１の撮像装置の外観図である。3 is an external view of the image pickup apparatus of Embodiment 1. FIG. 実施例１の撮像装置のブロック図である。3 is a block diagram of the image pickup apparatus of Embodiment 1. FIG. 実施例１の第２の法線マップの推定処理を示すフローチャートである。6 is a flowchart showing a second normal map estimation process of the first embodiment. ディープラーニングの１つであるＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）のネットワーク構造を示す図である（実施例１，２）。It is a figure which shows the network structure of CNN(Convolutional Neural Network) which is one of the deep learning (Examples 1 and 2). 実施例１の学習情報の学習を示すフローチャートである。5 is a flowchart showing learning of learning information according to the first embodiment. 実施例２の画像処理システムの外観図である。6 is an external view of an image processing system of Example 2. FIG. 実施例２の画像処理システムのブロック図である。5 is a block diagram of an image processing system of Example 2. FIG. 実施例２の第２の法線マップの推定処理を示すフローチャートである。9 is a flowchart illustrating a second normal map estimation process according to the second embodiment.

以下、本発明の好ましい実施の形態を、添付の図面に基づいて詳細に説明する。各図において、同一の部材については同一の参照番号を付し、重複する説明は省略する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In each of the drawings, the same members are designated by the same reference numerals, and duplicated description will be omitted.

本発明では、破綻部や精度が十分でない領域を有する第１の法線マップに基づいて、ディープラーニング（深層学習とも呼ばれる）を用いて、第１の法線マップを改善した第２の法線マップを推定する。法線マップとは、被写体空間の法線情報を画像として並べたマップであり、法線情報とは、法線方向ベクトルや、法線を表す各自由度を指す。ディープラーニングの入力として、第１の法線マップに加えて撮影画像、ラベルマップ、信頼度マップ、または距離マップのうち少なくとも１つを入力し、これらのデータを参照することで第２の法線マップを推定する。第１の法線マップと、撮影画像、ラベルマップ、信頼度マップ、または距離マップのうち少なくとも１つからなる入力データを用いて、対応関係をディープラーニングによって学習することで、高精度に法線情報を推定可能である。 In the present invention, the second normal line obtained by improving the first normal line map by using deep learning (also referred to as deep learning) based on the first normal line map having a broken portion or an area with insufficient accuracy is used. Estimate the map. The normal line map is a map in which normal line information of the subject space is arranged as an image, and the normal line information indicates a normal direction vector and each degree of freedom representing the normal line. In addition to the first normal map, at least one of a captured image, a label map, a confidence map, or a distance map is input as input for deep learning, and the second normal can be obtained by referring to these data. Estimate the map. By using the input data consisting of the first normal map and at least one of the photographed image, the label map, the reliability map, and the distance map, the correspondence relationship is learned by deep learning, so that the normal line can be accurately obtained. Information can be estimated.

本実施例では、本発明の画像処理方法を撮像装置１００に適用した場合について説明する。図１は、撮像装置１００の外観図である。図２は、撮像装置１００のブロック図である。 In this embodiment, a case where the image processing method of the present invention is applied to the image pickup apparatus 100 will be described. FIG. 1 is an external view of the image pickup apparatus 100. FIG. 2 is a block diagram of the image pickup apparatus 100.

撮像装置１００は、被写体空間の像を撮影画像として取得する撮像部１０１を有する。撮像部１０１は、被写体空間から入射する光を集光する結像光学系１０１ａ、および複数の画素を有する撮像素子１０１ｂを有する。撮像素子１０１ｂは、例えば、ＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）センサや、ＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌ−ＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）センサなどである。画像処理部１０２は、被写体空間の第１の法線マップの部分領域を取得し、撮影画像の対応する領域と合わせた入力データに対して、法線情報（第２の法線マップ）の推定を行う。画像処理部１０２は、学習部１０２ａ、推定部１０２ｂ、および取得部１０２ｃを有する。第２の法線マップの推定の際には、記憶部１０３に記憶された学習情報が呼び出されて使用される。推定された第２の法線マップは、液晶ディスプレイなどの表示部１０４に表示、または記録媒体１０５に保存される。第２の法線マップの代わりに、第２の法線マップに基づいて生成された画像（例えば、レンダリング画像）を表示部１０４に表示、または記録媒体１０５に保存してもよい。ただし、第１の法線マップ、撮影画像、または入力データを記録媒体１０５に保存し、任意のタイミングで第２の法線マップの推定を行ってもよい。光源１１０は、撮像時に選択的に点灯し、撮像部１０１が複数の異なる光源環境下で撮影を行うことを可能とする。本実施例では、光源１１０は、撮像装置１００と一体的に構成されているが、撮像装置１００とは異なる外部装置であってもよい。システムコントローラ１０６は、上述した一連の制御を行う。 The imaging device 100 includes an imaging unit 101 that acquires an image of the subject space as a captured image. The image pickup unit 101 includes an image forming optical system 101a that collects light incident from a subject space, and an image pickup element 101b having a plurality of pixels. The image sensor 101b is, for example, a CCD (Charge Coupled Device) sensor or a CMOS (Complementary Metal-Oxide Semiconductor) sensor. The image processing unit 102 acquires a partial area of the first normal map of the subject space, and estimates normal information (second normal map) for the input data combined with the corresponding area of the captured image. I do. The image processing unit 102 has a learning unit 102a, an estimation unit 102b, and an acquisition unit 102c. When estimating the second normal map, the learning information stored in the storage unit 103 is called and used. The estimated second normal map is displayed on the display unit 104 such as a liquid crystal display or saved in the recording medium 105. Instead of the second normal map, an image (for example, a rendering image) generated based on the second normal map may be displayed on the display unit 104 or stored in the recording medium 105. However, the first normal map, the captured image, or the input data may be stored in the recording medium 105, and the second normal map may be estimated at an arbitrary timing. The light source 110 is selectively turned on at the time of image capturing, and enables the image capturing unit 101 to perform image capturing under a plurality of different light source environments. In the present embodiment, the light source 110 is configured integrally with the imaging device 100, but may be an external device different from the imaging device 100. The system controller 106 performs the series of controls described above.

以下、図３を参照して、画像処理部１０２により実行される第２の法線マップの推定について説明する。図３は、第２の法線マップの推定処理を示すフローチャートである。 Hereinafter, the estimation of the second normal map executed by the image processing unit 102 will be described with reference to FIG. FIG. 3 is a flowchart showing the estimation process of the second normal map.

ステップＳ１０１では、取得部１０２ｃは、異なる光源環境下で被写体を撮影した複数の撮影画像と学習情報を取得する。学習情報とは、後述する入力データに対して、推定された第２の法線マップを結び付けるためにあらかじめ学習された情報である。 In step S101, the acquisition unit 102c acquires a plurality of photographed images of a subject under different light source environments and learning information. The learning information is information learned in advance for connecting the estimated second normal map to the input data described later.

ステップＳ１０２では、取得部１０２ｃは、入力データとして、第１の法線マップの部分領域、および異なる光源下で被写体を撮影した複数の撮影画像の部分領域を取得する。本実施例では、第１の法線マップとして、あらかじめ取得部１０２ｃによって、異なる光源環境下で被写体を撮影した複数の撮影画像から、照度差ステレオ法で取得されたものを用いる。また、複数の撮影画像のそれぞれから取得される領域は、画像上の同一位置となるように選択されている。なお、複数の撮影画像のそれぞれに対して、電子的な手振れ補正処理等の位置ずらし処理を行ってもよい。また、入力データとして、全ての撮影画像を用いる必要はなく、１枚の撮影画像を用いてもよいし、照度差ステレオ法に用いた撮影画像とは異なる撮影画像を用いてもよい。第２の法線マップの推定は、入力データを単位として部分領域ごとに行われる。部分領域は、各撮影画像の一部の領域であればよく、撮影画像全体であってもよい。また、第１の法線マップを取得する方法は照度差ステレオ法に限定されず、シェイプフロムシェーディング法や視差画像から取得した被写体空間の距離マップに基づいて取得してもよい。この場合、ステップＳ１０１では、第１の法線マップの取得方法および入力データに必要な撮影画像を取得すればよく、所定の光源を用いて撮影しても環境光のみで撮影してもよく、１枚の撮影画像のみ取得しても視差画像を取得してもよい。なお、第１の法線マップは、あらかじめ記録媒体１０５に保存されていてもよい。第１の法線マップを記録媒体１０５から取得する場合、ステップＳ１０１では入力データに必要な撮影画像のみを取得すればよい。 In step S102, the acquisition unit 102c acquires, as the input data, the partial area of the first normal map and the partial areas of the plurality of captured images of the subject captured under different light sources. In the present embodiment, as the first normal map, one obtained by the photometric stereo method from a plurality of captured images obtained by capturing the subject under different light source environments by the acquisition unit 102c in advance is used. Further, the areas acquired from each of the plurality of captured images are selected so as to be at the same position on the images. Note that position shift processing such as electronic camera shake correction processing may be performed on each of the plurality of captured images. Further, it is not necessary to use all the captured images as the input data, and one captured image may be used, or a captured image different from the captured image used for the photometric stereo method may be used. The estimation of the second normal map is performed for each partial area with the input data as a unit. The partial area may be a partial area of each captured image, and may be the entire captured image. Further, the method of acquiring the first normal map is not limited to the photometric stereo method, but may be acquired based on the shape-from-shading method or the distance map of the subject space acquired from the parallax image. In this case, in step S101, a captured image required for the first normal map acquisition method and input data may be acquired, and it may be captured using a predetermined light source or only ambient light. Only one captured image or parallax image may be acquired. The first normal map may be stored in the recording medium 105 in advance. When the first normal map is acquired from the recording medium 105, only the captured image necessary for the input data needs to be acquired in step S101.

ステップＳ１０３では、推定部１０２ｂは、学習情報を用いて入力データから部分法線情報を推定する。ステップＳ１０３では、多層のニューラルネットワークを用いて入力データの処理を行う。以下、図４を参照して、ステップＳ１０３で行われる推定の詳細について説明する。 In step S103, the estimation unit 102b estimates the partial normal information from the input data using the learning information. In step S103, the input data is processed using a multilayer neural network. Hereinafter, the details of the estimation performed in step S103 will be described with reference to FIG.

図４は、ディープラーニングの１つであるＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）のネットワーク構造を示している。ただし、多層のニューラルネットワークであれば、ＣＮＮ以外の方法（例えばＤＢＮ（ＤｅｅｐＢｅｌｉｅｆＮｅｔｗｏｒｋ）など）を用いても構わない。 FIG. 4 shows the network structure of CNN (Convolutional Neural Network), which is one of the deep learning methods. However, as long as it is a multilayer neural network, a method other than CNN (for example, DBN (Deep Belief Network)) may be used.

ＣＮＮは、複数の層構造になっており、各層で学習情報に基づく線型変換、および非線型変換が実行される。非線型変換は、学習情報に基づいていてもよいし、学習情報に基づいていなくてもよい。ここで、ｎを１からＮまでの整数とする場合、ｎ番目の層を第ｎ層、第ｎ層における線型変換と非線型変換をそれぞれ、第ｎの線型変換と第ｎの非線型変換と呼称する。ただし、Ｎは、２以上の整数である。入力データ２０１は、第１層で複数のフィルタ２０２それぞれとのコンボリューション（複数の線型関数による第１の線型変換）をとられる。その後、活性化関数（ＡｃｔｉｖａｔｉｏｎＦｕｎｃｔｉｏｎ，図４ではＡＦ）と呼ばれる非線型関数で変換（第１の非線型変換）される。また、入力データ２０１が複数枚描画されているのは、複数のチャンネルを有しているためである。本実施例では、入力データが第１の法線マップのチャンネル数に加えて、入力データに用いる撮影画像１枚ごとにＲＧＢ（Ｒｅｄ，Ｇｒｅｅｎ，Ｂｌｕｅ）の３チャンネルを有している。第１の法線マップのチャンネル数は、法線情報の表現方法によって異なる。法線ベクトルの３次元の成分を各チャンネルに割り当てれば３チャンネルであり、法線ベクトルの方向を２つの角度で表現すれば２チャンネルである。また、入力データがＲＧＢのチャンネルを有していても、各色で個別にＣＮＮへ入力しても構わない。 The CNN has a plurality of layered structures, and linear conversion and non-linear conversion based on learning information are executed in each layer. The nonlinear conversion may or may not be based on the learning information. Here, when n is an integer from 1 to N, the nth layer is the nth layer, and the linear conversion and the nonlinear conversion in the nth layer are the nth linear conversion and the nth nonlinear conversion, respectively. Call it. However, N is an integer of 2 or more. The input data 201 is subjected to convolution (first linear conversion by a plurality of linear functions) with each of the plurality of filters 202 in the first layer. After that, conversion (first nonlinear conversion) is performed by a nonlinear function called an activation function (AF in FIG. 4). Further, the plurality of input data 201 are drawn because they have a plurality of channels. In this embodiment, in addition to the number of channels of the first normal map, the input data has three channels of RGB (Red, Green, Blue) for each photographed image used for the input data. The number of channels in the first normal map differs depending on the method of expressing the normal information. If the three-dimensional component of the normal vector is assigned to each channel, there are three channels, and if the direction of the normal vector is expressed by two angles, there are two channels. Even if the input data has RGB channels, each color may be individually input to the CNN.

フィルタ２０２は、複数存在し、それぞれと入力データ２０１とのコンボリューションを個別に算出する。フィルタ２０２の係数は、学習情報から決定される。学習情報はフィルタ係数そのものでもよいし、フィルタを所定の関数でフィッティングした際の係数でもよい。フィルタ２０２それぞれのチャンネル数は、入力データ２０１と一致し、３次元フィルタとなる（３次元目がチャンネル数を表す）。また、コンボリューションの結果に対して、学習情報から決定される定数（負もとり得る）を加算してもよい。 There are a plurality of filters 202, and the convolution of each of them and the input data 201 is individually calculated. The coefficient of the filter 202 is determined from the learning information. The learning information may be the filter coefficient itself or a coefficient when the filter is fitted with a predetermined function. The number of channels of each filter 202 matches that of the input data 201 and becomes a three-dimensional filter (the third dimension represents the number of channels). Further, a constant (which can be negative) determined from the learning information may be added to the result of convolution.

活性化関数ｆ（ｘ）の例として、以下の式（１）から（３）が挙げられる。 Examples of the activation function f(x) include the following equations (1) to (3).

式（１）はシグモイド関数、式（２）はハイパボリックタンジェント関数、式（３）はＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）と呼ばれる。式（３）中のｍａｘは、引数のうち最大値を出力するＭＡＸ関数を表す。式（１）から（３）は、全て単調増加関数である。また、活性化関数としてＭａｘｏｕｔを使用してもよい。Ｍａｘｏｕｔは、第ｎの線型変換の出力である複数の画像のうち、各画素で最大値である信号値を出力するＭＡＸ関数である。 The formula (1) is called a sigmoid function, the formula (2) is called a hyperbolic tangent function, and the formula (3) is called ReLU (Rectified Linear Unit). Max in Expression (3) represents a MAX function that outputs the maximum value among the arguments. Equations (1) to (3) are all monotonically increasing functions. Alternatively, Maxout may be used as the activation function. Maxout is a MAX function that outputs the signal value that is the maximum value in each pixel of the plurality of images that are the output of the n-th linear conversion.

第１の線型変換、および第１の非線型変換を施された入力データを、第１変換入力データ２０３と呼称する。第１変換入力データ２０３の各チャンネル成分は、入力データ２０１とフィルタ２０２それぞれのコンボリューションから生成される。そのため、第１変換入力データ２０３のチャンネル数は、フィルタ２０２の数と同じになる。 The input data that has been subjected to the first linear conversion and the first nonlinear conversion is referred to as first conversion input data 203. Each channel component of the first converted input data 203 is generated from the convolution of the input data 201 and the filter 202. Therefore, the number of channels of the first conversion input data 203 is the same as the number of filters 202.

第２層では、第１変換入力データ２０３に対して、第１層と同様に学習情報から決定される複数のフィルタ２０４とのコンボリューション（第２の線型変換）と、活性化関数による非線型変換（第２の非線型変換）が行われる。第２層で使用されるフィルタ２０４は通常、第１層で使用されるフィルタ２０２と同一ではない。フィルタのサイズや数も一致しなくてもよい。ただし、フィルタ２０４のチャンネル数と第１変換入力データ２０３のチャンネル数は一致する。同様の演算を第Ｎ層まで繰り返すことで、中間データ２１０が得られる。最後に、第Ｎ＋１層で中間データ２１０とフィルタ２１１のコンボリューションと定数の加算（第Ｎ＋１の線型変換）から、入力データ２０１に対する第２の法線マップを推定した部分法線情報２１２が得られる。フィルタ２１１と定数も、学習情報から決定される。部分法線情報２１２のチャンネル数は、法線情報の表現方法によって異なり、第１の法線マップのチャンネル数に等しい。したがって、フィルタ２１１の数は部分法線情報２１２のチャンネル数と同じになる。部分法線情報２１２の各チャンネルの成分は、中間データ２１０とフィルタ２１１それぞれとのコンボリューションを含む演算から求められる。なお、入力データ２０１と部分法線情報２１２のサイズは一致しなくてよい。コンボリューション時、入力データ２０１の外側にはデータが存在しないため、データの存在する領域のみで演算すると、コンボリューション結果はサイズが小さくなる。ただし、周期境界条件などを設けることで、サイズを保つこともできる。なお、本実施例では、第ｍの線型変換（ｍ＝１〜Ｎ＋１）のそれぞれに関する各フィルタの係数は、全て異なっている。 In the second layer, with respect to the first transformed input data 203, convolution (second linear transformation) with a plurality of filters 204 determined from the learning information as in the first layer, and non-linearity by the activation function The conversion (second nonlinear conversion) is performed. The filter 204 used in the second layer is typically not the same as the filter 202 used in the first layer. The size and number of filters do not have to match. However, the number of channels of the filter 204 matches the number of channels of the first conversion input data 203. By repeating the same calculation up to the Nth layer, the intermediate data 210 is obtained. Finally, the partial normal information 212 obtained by estimating the second normal map for the input data 201 is obtained from the convolution of the intermediate data 210 and the filter 211 and the addition of the constant (N+1th linear conversion) in the (N+1)th layer. .. The filter 211 and the constant are also determined from the learning information. The number of channels of the partial normal information 212 depends on the method of expressing the normal information and is equal to the number of channels of the first normal map. Therefore, the number of filters 211 is the same as the number of channels of the partial normal information 212. The component of each channel of the partial normal information 212 is obtained from the calculation including the convolution of the intermediate data 210 and the filter 211. Note that the sizes of the input data 201 and the partial normal information 212 do not have to match. Since data does not exist outside the input data 201 during convolution, the size of the convolution result becomes small if the calculation is performed only in the area where the data exists. However, the size can be maintained by providing periodic boundary conditions. In addition, in the present embodiment, the coefficients of the respective filters relating to the m-th linear transformation (m=1 to N+1) are all different.

ディープラーニングが高い性能を発揮できる理由は、非線型変換を多層構造によって何度も行うことで、高い非線型性が得られるためである。仮に、非線型変換を担う活性化関数が存在せず、線型変換のみでネットワークが構成されていた場合、いくら多層にしてもそれと等価な単層の線型変換が存在するため、多層構造にする意味がない。ディープラーニングは、より多層にする方が強い非線型を得られるため、高い性能が出やすいと言われている。一般に、少なくとも３層以上を有する場合がディープラーニングと呼ばれている。このように構成されたディープラーニングを用いることで、破綻部を低減した法線情報（第２の法線マップ）を推定できる。 The reason why deep learning can exhibit high performance is that high nonlinearity can be obtained by performing nonlinear conversion many times with a multilayer structure. If there is no activation function responsible for non-linear conversion and the network is configured only with linear conversion, there is a single-layer linear conversion equivalent to that, no matter how many layers are used. There is no. It is said that deep learning is more likely to produce higher performance because a stronger multilayer can be obtained by using more layers. Generally, the case of having at least three layers or more is called deep learning. By using the deep learning configured as described above, it is possible to estimate the normal vector information (second normal vector map) in which the breakdown portion is reduced.

ステップＳ１０４では、推定部１０２ｂは、撮影画像のうち既定の領域に対して、第２の法線マップの推定が完了したかどうかを判定する。既定の領域全てに対して、部分法線情報が生成されていれば、ステップＳ１０５に進む。そうでなければ、ステップＳ１０２に戻り、法線情報が推定されていない入力データを取得する。 In step S104, the estimation unit 102b determines whether or not the estimation of the second normal map has been completed for a predetermined area in the captured image. If the partial normal information is generated for all the predetermined areas, the process proceeds to step S105. If not, the process returns to step S102, and the input data for which the normal information is not estimated is acquired.

ステップＳ１０５では、推定部１０２ｂは、第２の法線マップを出力する。第２の法線マップは、生成された部分法線情報を合成することで生成される。入力データが画像領域の全体である場合、部分法線情報をそのまま第２の法線マップとすればよい。 In step S105, the estimation unit 102b outputs the second normal line map. The second normal map is generated by combining the generated partial normal information. When the input data is the entire image area, the partial normal information may be directly used as the second normal map.

以上の処理によって、第１の法線マップから撮影画像を参照して第２の法線マップを推定することができる。 With the above processing, the second normal map can be estimated by referring to the captured image from the first normal map.

なお、第１の法線マップを取得する際に視差画像を用いなかった場合でも、視点の異なる複数の撮影画像を入力し、第２の法線マップを得るようにしてもよい。視点によっても光の反射角が変化することから被写体の反射特性の影響も変化する。同一の被写体に対して異なる光の反射を受けた撮影画像を複数入力することで、推定精度を向上することができる。 Even if the parallax image is not used when acquiring the first normal map, a plurality of captured images with different viewpoints may be input to obtain the second normal map. Since the reflection angle of light changes depending on the viewpoint, the influence of the reflection characteristics of the subject also changes. The estimation accuracy can be improved by inputting a plurality of captured images that have received different light reflections for the same subject.

以下、図５を参照して、学習情報の学習について説明する。図５は、学習情報の学習を示すフローチャートである。学習は、第２の法線マップを推定する前であれば、学習部１０２ａで行なってもよいし、撮像装置１００とは別の演算装置で行なってもよい。本実施例では、学習部１０２ａで学習を実行する場合について説明する。 Hereinafter, learning of learning information will be described with reference to FIG. FIG. 5 is a flowchart showing learning of learning information. The learning may be performed by the learning unit 102a or may be performed by an arithmetic device other than the imaging device 100 before the estimation of the second normal map. In this embodiment, a case where learning is performed by the learning unit 102a will be described.

ステップＳ２０１では、学習部１０２ａは、一対以上の学習データを取得する。一対の学習データとは、第１の法線マップと撮影画像とを含む学習用入力データと、同一の被写体の第１の法線マップより高精度な学習用法線マップである。ここで、第１の法線マップは取得方法によって異なるものの、低解像度であったり取得方法ごとの苦手とする条件によって生じた破綻部（低精度または取得できていない部分）が存在したりする。本実施例では、第１の法線マップをより高精度な第２の法線マップに補正するための推定処理を行う。したがって、学習データにおいては第１の法線マップが破綻する条件や破綻による影響を推定時とそろえる必要がある。そのため、学習時に用いる第１の法線マップは推定時に用いる第１の法線マップと同様に取得したデータであり、学習時に用いる撮影画像は推定時に用いる撮影画像と枚数、撮影条件を対応させることが望ましい。学習データとして用いる高精度な法線情報（学習用法線マップ）として、第１の法線マップに比べて精度が高く破綻部のない法線情報を用いる必要がある。この法線情報の精度によって、推定される第２の法線マップの精度が変わる。 In step S201, the learning unit 102a acquires a pair of learning data. The pair of learning data is the learning input data including the first normal map and the captured image, and the learning normal map with higher accuracy than the first normal map of the same subject. Here, although the first normal map differs depending on the acquisition method, there is a broken portion (low accuracy or a portion that cannot be acquired) caused by a low resolution or a condition that each acquisition method is not good at. In the present embodiment, an estimation process for correcting the first normal map to the more accurate second normal map is performed. Therefore, in the learning data, it is necessary to align the conditions under which the first normal map fails and the effects of the failure with the time of estimation. Therefore, the first normal map used at the time of learning is the data acquired in the same manner as the first normal map used at the time of estimation, and the photographed image used at the time of learning should correspond to the photographed image used at the time of estimation, the number of images, and photographing conditions. Is desirable. As the highly accurate normal information (learning normal map) used as learning data, it is necessary to use normal information that is more accurate than the first normal map and has no broken portions. The accuracy of the estimated second normal map changes depending on the accuracy of this normal information.

学習データとして用いる高精度な法線情報には、様々な形状、法線ベクトルを持つ被写体、および様々な反射特性の被写体が含まれていることが望ましい。例えば、学習データに光沢のある被写体がない場合、同様の反射特性を持つ被写体での第１の法線マップ、撮影画像、後述する参照データ、および第２の法線マップを用いた学習データが存在しない。そのため、同様の被写体に対する法線情報推定の効果が充分に得られない可能性がある。 It is desirable that the high-accuracy normal line information used as the learning data includes subjects having various shapes and normal vectors, and subjects having various reflection characteristics. For example, when there is no glossy subject in the learning data, the learning data using the first normal map, the captured image, the reference data described later, and the second normal map for the subject having the same reflection characteristic is obtained. not exist. Therefore, there is a possibility that the effect of estimating the normal information for the same subject cannot be sufficiently obtained.

学習データを用意する方法として、シミュレーションを利用してもよいし、実測した情報を使用してもよい。シミュレーションを行う場合、反射特性を付与した３Ｄモデルに対して、ＣＧレンダリングを行うことで撮影画像に相当する画像を生成し、生成された画像から取得部１０２ｃと同等の取得方法に基づいて、第１の法線マップを取得すればよい。３Ｄモデルが既知のため、学習用法線マップは既知である。 As a method for preparing the learning data, a simulation may be used or actually measured information may be used. When performing a simulation, an image corresponding to a captured image is generated by performing CG rendering on a 3D model with reflection characteristics, and based on an acquisition method equivalent to that of the acquisition unit 102c from the generated image, It is sufficient to acquire the normal map of No. 1. Since the 3D model is known, the normal map for learning is known.

実測した情報を使用する場合、形状が既知の被写体（法線情報が既知の被写体）を撮影し、取得部１０２ｃと同等の取得方法に基づいて第１の法線マップに相当する法線情報を取得すればよい。これにより、正確な法線情報を得られる。この場合も、学習用法線マップは既知である。 When the measured information is used, a subject whose shape is known (a subject whose normal line information is known) is photographed, and normal line information corresponding to the first normal line map is obtained based on an acquisition method equivalent to that of the acquisition unit 102c. Just get it. As a result, accurate normal line information can be obtained. Also in this case, the learning normal map is already known.

また、形状が未知の被写体に対しても、異なる方法を用いることで取得した法線情報（第２の法線マップ）を用いて、学習させてもよい。例として、レーザー測距によって形状を取得したり、構造化照明で照明した画像から形状を取得したり、鏡面反射の反射方向と光源の入射方向から形状を取得したり、接触式の計測機器で形状を取得したりする方法が挙げられる。被写体のサイズ、形状、および反射特性といった条件に応じて適切な取得方法を選択することで、一般の被写体に対して第１の法線マップよりも高精度に法線情報を取得することができる。 Further, it is also possible to train a subject whose shape is unknown by using normal line information (second normal line map) acquired by using a different method. As an example, the shape can be acquired by laser ranging, the shape can be acquired from the image illuminated by structured illumination, the shape can be acquired from the reflection direction of the specular reflection and the incident direction of the light source, or with a contact-type measuring device. There is a method of acquiring the shape. By selecting an appropriate acquisition method according to conditions such as the size, shape, and reflection characteristics of the subject, normal line information can be obtained for a general subject with higher accuracy than the first normal line map. ..

ここで、第１の法線マップの破綻しやすい条件について説明する。まず、照度差ステレオ法について説明する。照度差ステレオ法によって法線を取得する場合、異なる光源環境下で撮影した際の輝度の変化から法線方向が取得される。照度差ステレオ法として、一般の反射特性に対応したものもあるが、撮影枚数や計算負荷が増えるため、多くの場合はランバートモデルに従う拡散反射を仮定している。したがって、鏡面反射が観測される場合には誤った法線が取得される。また、入射光が遮蔽される影部や入射光が当たらない陰部、他の物体からの反射光が入射する相互反射による輝度の変化によっても法線が破綻しやすい。表面の粗い被写体や内部散乱の強い半透明体においてはランバートモデルから外れた反射特性を示し、光沢のある金属体やグラスのような透明体では拡散反射が観測されないため、これらの被写体においても法線が破綻しやすい。また、環境光が強い場合には、光源からの入射光による寄与が相対的に小さくなり、取得精度が低下する。シェイプフロムシェーディング法においてもランバートモデルが仮定されることが多く、同様の被写体で破綻が生じやすい。また、法線の連続的な変化を仮定した場合、細かい凹凸形状が反映されない場合がある。 Here, the conditions under which the first normal map easily breaks will be described. First, the photometric stereo method will be described. When the normal line is obtained by the photometric stereo method, the normal line direction is obtained from the change in brightness when the images are taken under different light source environments. Some photometric stereo methods correspond to general reflection characteristics, but in many cases diffuse reflection according to the Lambert model is assumed because the number of images to be photographed and the calculation load increase. Therefore, a false normal is obtained when specular reflection is observed. Further, the normal line easily breaks down due to a change in the brightness due to a shadow portion where the incident light is shielded, a shade portion where the incident light does not hit, or mutual reflection where reflected light from another object enters. For a subject with a rough surface or a semi-transparent body with strong internal scattering, the reflection characteristics deviate from the Lambert model, and for a transparent body such as a glossy metal body or glass, diffuse reflection is not observed. The line is easy to fail. Further, when the ambient light is strong, the contribution of the incident light from the light source is relatively small, and the acquisition accuracy is reduced. A Lambert model is often assumed also in the shape-from-shading method, and a similar subject is likely to fail. In addition, when assuming a continuous change in the normal line, fine irregularities may not be reflected.

距離マップから法線情報を取得する場合、遠近の被写体に対する境界部で距離が不連続的に変化すると誤った法線情報が推定される。法線情報は、距離情報の微分に相当するため、距離マップのノイズにも強く影響される。また、距離マップの取得方法によって距離マップ自体に破綻がある場合もあり、この場合も法線情報が破綻する。例えば、多視点画像から視点間の相関をもとに距離を取得する場合、視点ごとに見えの異なる鏡面反射体や透明体、オクルージョン、位置ごとに輝度変化の少ないテクスチャの少ない被写体や、周期的な構造を持つ被写体などによって距離が破綻しやすい。Ｔｉｍｅｏｆｆｌｉｇｈｔ法によって距離を取得する場合、光が返ってこない鏡面反射体や透明体で距離が取得できなかったり、低反射物体や外光の強い環境下でノイズによって取得精度が低下したりする。 When the normal information is acquired from the distance map, incorrect normal information is estimated when the distance discontinuously changes at the boundary with respect to the distant object. Since the normal information corresponds to the differentiation of the distance information, it is also strongly affected by the noise in the distance map. Further, there are cases where the distance map itself has a failure depending on the method of acquiring the distance map, and in this case, the normal line information also fails. For example, when acquiring a distance from a multi-viewpoint image based on the correlation between viewpoints, a specular reflector or a transparent body that has a different appearance for each viewpoint, occlusion, an object with a small amount of texture with a small change in brightness at each position, or a periodic object. It is easy for the distance to collapse due to subjects with different structures. When the distance is acquired by the Time of flight method, the distance cannot be acquired by a specular reflector or a transparent body that does not return light, or the accuracy of acquisition may be reduced by noise in a low reflection object or in an environment with strong external light. ..

ステップＳ２０２では、学習部１０２ａは、学習データから複数の学習ペアを取得する。学習ペアは、学習用入力データと学習用部分法線情報とからなる。学習用入力データは、第１の法線マップと撮影画像から取得される。学習用入力データのサイズは、ステップＳ１０２で取得された入力データと同じである。学習用部分法線情報は、学習用法線マップから、領域の中心が学習用入力データと同じ被写体位置になるように取得される。学習用部分法線情報のサイズは、ステップＳ１０３で生成された部分法線情報と同じである。 In step S202, the learning unit 102a acquires a plurality of learning pairs from the learning data. The learning pair consists of learning input data and learning partial normal information. The input data for learning is acquired from the first normal map and the captured image. The size of the learning input data is the same as the size of the input data acquired in step S102. The learning partial normal information is acquired from the learning normal map such that the center of the area is at the same subject position as the learning input data. The size of the learning partial normal information is the same as the size of the partial normal information generated in step S103.

ステップＳ２０３では、学習部１０２ａは、複数の学習ペアから、学習情報を学習によって取得する。学習では、第２の法線マップの推定と同じネットワーク構造を使用する。本実施例では、図４のネットワーク構造に対して学習用入力データが入力され、その出力結果と学習用部分法線情報の誤差が算出される。法線情報の誤差は、各成分の差分をとってもよいし、学習用部分法線情報の法線ベクトルと出力結果の法線ベクトルの内積を１から引いた値としてもよい。誤差が最小化されるように、例えば、誤差逆伝播法（Ｂａｃｋｐｒｏｐａｇａｔｉｏｎ）などを用いて、第１層から第Ｎ＋１層で用いる各フィルタの係数や加算する定数（学習情報）を更新、最適化する。各フィルタと定数の初期値は、例えば、乱数から決定すればよい。また、各層ごとに初期値を事前学習するＡｕｔｏＥｎｃｏｄｅｒなどのプレトレーニングを行なってもよい。 In step S203, the learning unit 102a acquires learning information from a plurality of learning pairs by learning. Learning uses the same network structure as the estimation of the second normal map. In this embodiment, learning input data is input to the network structure of FIG. 4, and an error between the output result and the learning partial normal information is calculated. The error of the normal information may be the difference between the components, or may be a value obtained by subtracting 1 from the inner product of the normal vector of the learning partial normal information and the normal vector of the output result. In order to minimize the error, the coefficient of each filter used in the first layer to the N+1th layer and the constant to be added (learning information) are updated and optimized by using, for example, the back propagation method (Backpropagation). .. The initial value of each filter and constant may be determined from random numbers, for example. Further, pre-training such as Auto Encoder for pre-learning the initial value may be performed for each layer.

学習ペアを全てネットワーク構造に入力し、それら全ての情報を使って学習情報を更新する方法をバッチ学習と呼ぶ。バッチ学習では、学習ペアの数が増えるにつれて、計算負荷が膨大になってしまう。また、学習情報の更新に１つの学習ペアのみを使用し、更新ごとに異なる学習ペアを使用する学習方法をオンライン学習と呼ぶ。オンライン学習では、学習ペアが増えても計算量は増大しないが、１つの学習ペアに存在するノイズの影響を大きく受けてしまう。そのため、これら２つの方法の中間に位置するミニバッチ法を用いて学習することが望ましい。ミニバッチ法では、全学習ペアの中からいくつかのペアを抽出し、それらを用いて学習情報の更新が行われる。次の更新では、異なる学習ペアを抽出して使用する。これを繰り返すことで、バッチ学習とオンライン学習の欠点を小さくすることができ、高い推定効果を得やすくなる。 The method of inputting all the learning pairs into the network structure and updating the learning information by using all the information is called batch learning. In batch learning, the computational load increases as the number of learning pairs increases. A learning method in which only one learning pair is used to update learning information and a different learning pair is used for each update is called online learning. In online learning, the amount of calculation does not increase even if the number of learning pairs increases, but it is greatly affected by noise existing in one learning pair. Therefore, it is desirable to use the mini-batch method, which is located between these two methods, for learning. In the mini-batch method, some pairs are extracted from all learning pairs and the learning information is updated using them. In the next update, we will extract and use different learning pairs. By repeating this, the defects of batch learning and online learning can be reduced, and a high estimation effect can be easily obtained.

ステップＳ２０４では、学習部１０２ａは、学習された学習情報を出力する。本実施例では、出力された学習情報は記憶部１０３に記憶される。 In step S204, the learning unit 102a outputs the learned learning information. In this embodiment, the output learning information is stored in the storage unit 103.

以上の処理によって、反射特性や形状の制約がない一般的な被写体において、高精度に法線情報を推定可能な学習情報を学習することができる。 By the above processing, it is possible to learn the learning information capable of estimating the normal line information with high accuracy in a general subject having no restriction on the reflection characteristic or the shape.

また、以上の処理に加えて、ＣＮＮの性能を向上させる工夫を併用してもよい。例えば、ロバスト性の向上のためネットワークの各層において、ドロップアウト（Ｄｒｏｐｏｕｔ）やダウンサンプリングであるプーリング（ｐｏｏｌｉｎｇ）を行なってもよい。 Further, in addition to the above processing, a device for improving the performance of CNN may be used together. For example, in order to improve robustness, pooling, which is dropout or downsampling, may be performed in each layer of the network.

また、本実施例では、撮影画像の部分領域から入力データを取得したが、撮影画像に任意の画像処理を行った後の画像から取得してもよい。例えば、撮影画像から鏡面反射の影響を受けない拡散反射画像を生成し、拡散反射画像から入力データを取得してもよい。これにより、反射特性の異なる被写体においても鏡面反射を除いた成分に絞って学習および推定ができ、推定精度を向上することができる。 Further, in the present embodiment, the input data is acquired from the partial area of the captured image, but it may be acquired from the image after performing arbitrary image processing on the captured image. For example, a diffuse reflection image that is not affected by specular reflection may be generated from the captured image, and the input data may be acquired from the diffuse reflection image. As a result, even for subjects having different reflection characteristics, learning and estimation can be performed by focusing on the components excluding specular reflection, and the estimation accuracy can be improved.

以上説明した構成により、反射特性や形状の制約がない一般的な被写体において、高精度に法線情報を推定することができる。 With the configuration described above, it is possible to estimate the normal line information with high accuracy in a general subject having no restriction on the reflection characteristic or shape.

本実施例では、本発明の画像処理方法を画像処理システム５００に適用した場合について説明する。図６は、画像処理システム５００の外観図である。図７（ａ）は、画像処理システム５００のブロック図である。 In this embodiment, a case where the image processing method of the present invention is applied to the image processing system 500 will be described. FIG. 6 is an external view of the image processing system 500. FIG. 7A is a block diagram of the image processing system 500.

画像処理システム５００は、撮影画像を得る撮像装置３００、第２の法線マップを推定する画像処理装置３０１、および学習を行うサーバー３０５を有する。本実施例では、第２の法線マップを推定する際の参照データとして、第１の実施例で用いた撮影画像ではなく、ラベルマップ、信頼度マップ、または距離マップのうち少なくとも１つを用いる。これにより、高精度な法線情報の推定を可能とする。 The image processing system 500 includes an imaging device 300 that obtains a captured image, an image processing device 301 that estimates a second normal map, and a server 305 that performs learning. In the present embodiment, at least one of a label map, a reliability map, or a distance map is used as reference data when estimating the second normal map, instead of the captured image used in the first embodiment. .. This enables highly accurate estimation of normal line information.

撮像装置３００の基本構成は、第２の法線マップの推定と学習情報の学習を行う画像処理部を除いて、実施例１の撮像装置１００と同様である。撮像装置３００で撮影された撮影画像は、画像処理装置３０１内の記憶部３０２に記憶される。画像処理装置３０１は、ネットワーク３０４と有線または無線で接続されており、同様にネットワーク３０４に接続されたサーバー３０５にアクセスする。サーバー３０５は、取得部３１１により取得された第１の法線マップ、および参照データから第２の法線マップを推定するための学習情報を学習する学習部３０７と、学習情報を記憶する記憶部３０６とを有する。推定部３０３は、記憶部３０６から学習情報を取得し、第２の法線マップを推定する。第２の法線マップは、表示装置３０８、記録媒体３０９、および出力装置３１０のうち少なくとも１つに出力される。第２の法線マップの代わりに、第２の法線マップに基づいて生成された画像（例えば、レンダリング画像）が出力されてもよい。 The basic configuration of the image capturing apparatus 300 is the same as that of the image capturing apparatus 100 according to the first embodiment except for the image processing unit that estimates the second normal map and learns learning information. The captured image captured by the image capturing apparatus 300 is stored in the storage unit 302 in the image processing apparatus 301. The image processing apparatus 301 is connected to the network 304 by wire or wirelessly, and similarly accesses the server 305 connected to the network 304. The server 305 includes a learning unit 307 that learns learning information for estimating the second normal map from the first normal map and reference data acquired by the acquisition unit 311, and a storage unit that stores the learning information. 306 and. The estimation unit 303 acquires the learning information from the storage unit 306 and estimates the second normal line map. The second normal map is output to at least one of the display device 308, the recording medium 309, and the output device 310. An image (for example, a rendering image) generated based on the second normal map may be output instead of the second normal map.

表示装置３０８は、例えば、液晶ディスプレイやプロジェクタなどである。ユーザーは、表示装置３０８を介して、処理途中の法線情報を確認しながら作業を行うことができる。記録媒体３０９は、例えば、半導体メモリ、ハードディスク、およびネットワーク上のサーバー等である。出力装置３１０は、例えば、プリンタなどである。画像処理装置３０１は、必要に応じて現像処理やその他の画像処理を行う機能を有していてよい。 The display device 308 is, for example, a liquid crystal display or a projector. The user can work while confirming the normal line information during the processing via the display device 308. The recording medium 309 is, for example, a semiconductor memory, a hard disk, a server on the network, or the like. The output device 310 is, for example, a printer or the like. The image processing apparatus 301 may have a function of performing development processing and other image processing as needed.

以下、図８を参照して、画像処理装置３０１により実行される第２の法線マップの推定について説明する。図８は、第２の法線マップの推定処理を示すフローチャートである。 Hereinafter, the estimation of the second normal map executed by the image processing apparatus 301 will be described with reference to FIG. FIG. 8 is a flowchart showing the estimation process of the second normal line map.

ステップＳ３０１では、推定部３０３は、記憶部３０２から撮影画像を取得する。 In step S301, the estimation unit 303 acquires a captured image from the storage unit 302.

ステップＳ３０２では、推定部３０３は、第１の法線マップと参照データを取得する。第１の法線マップは、取得部３１１によってあらかじめ取得されている。参照データは、本実施例では、ラベルマップ、信頼度マップ、および距離マップのうち少なくとも１つである。なお、実施例１と同様に、撮影画像を参照データとして用いてもよい。第１の法線マップおよび距離マップは、実施例１で説明したによって取得すればよい。ラベルマップとは、被写体の特性に基づいて撮影画像の各領域をラベル付けした分布である。被写体の特性とは、例えば、透過・反射特性を指し、被写体の材質ごとにラベル付けをすることで、その被写体がどのような透過・反射特性を有しているかが分かる。また、鏡面反射領域、相互反射領域、または影領域など第１の法線マップを取得する際に破綻の原因となる領域を取得してラベル付けしてもよい。信頼度マップとは、第１の法線マップの各位置において、法線情報がどの程度信頼できるかを示した分布である。信頼度マップは、例えば、第１の法線マップが照度差ステレオ法により取得されている場合、第１の法線マップおよびランバートモデルから推測される輝度値と実際の輝度値との差分量などから生成することができる。 In step S302, the estimation unit 303 acquires the first normal line map and the reference data. The first normal map is acquired in advance by the acquisition unit 311. In this embodiment, the reference data is at least one of a label map, a reliability map, and a distance map. In addition, you may use a picked-up image as reference data like Example 1. The first normal map and the distance map may be acquired as described in the first embodiment. The label map is a distribution in which each area of the captured image is labeled based on the characteristics of the subject. The characteristics of the subject refer to, for example, transmission/reflection characteristics. By labeling each material of the subject, it is possible to know what transmission/reflection characteristics the subject has. Further, when acquiring the first normal map such as a specular reflection area, a mutual reflection area, or a shadow area, an area that causes a failure may be acquired and labeled. The reliability map is a distribution indicating how reliable the normal information is at each position of the first normal map. The reliability map is, for example, when the first normal map is acquired by the photometric stereo method, the difference amount between the brightness value estimated from the first normal map and the Lambert model and the actual brightness value. Can be generated from.

ステップＳ３０３では、推定部３０３は、参照データに基づいて、使用するネットワーク構造と学習情報、入力データのサイズ、および推定を実行する領域を決定する。本実施例では、推定部３０３は、図４を用いて説明したＣＮＮを使用して法線情報を推定する。また、本実施例では、参照データに応じて個別に学習させた学習情報を用いるため、ネットワーク構造および入力データのサイズも学習時に使用したものと同じになる。取得した参照データに応じた学習情報を用いることで、より高精度に第２の法線マップを推定することができる。 In step S303, the estimation unit 303 determines a network structure to be used and learning information, a size of input data, and an area to be estimated based on the reference data. In this embodiment, the estimation unit 303 estimates the normal line information using the CNN described with reference to FIG. Further, in this embodiment, since the learning information individually learned according to the reference data is used, the network structure and the size of the input data are the same as those used at the time of learning. By using the learning information according to the acquired reference data, the second normal map can be estimated with higher accuracy.

参照データとしてラベルマップを用いる場合、第１の法線マップで破綻が生じやすい領域がその種類に応じてラベル付けされ、破綻が生じにくい領域についてもラベル付けされている。例えば、金属体や透明体など反射特性の異なる被写体ごとにラベル付けされている場合、被写体の種類ごとに学習した学習情報を使用する。これにより、より精度の高い推定が可能となる。使用するネットワーク構造と学習情報、および入力データのサイズは学習時の条件によって決まる。また、推定を実行する領域を破綻が生じやすい領域としてラベル付けされた領域に限ってもよく、これによって推定を高速に実行できる。 When a label map is used as the reference data, areas that are likely to fail in the first normal map are labeled according to their types, and areas that are less likely to fail are also labeled. For example, in the case where a subject such as a metal body or a transparent body having different reflection characteristics is labeled, learning information learned for each type of subject is used. This enables more accurate estimation. The network structure to be used, learning information, and the size of input data are determined by the learning conditions. Further, the region where the estimation is performed may be limited to the region labeled as the region where the failure is likely to occur, whereby the estimation can be performed at high speed.

参照データとして信頼度マップを用いる場合、第１の法線マップで破綻が生じている可能性の度合いが信頼度によって表されている。そこで、例えば、信頼度が低いほど入力データのサイズを大きくすればよい。また、信頼度が高い領域を含むように入力データのサイズを決定してもよい。これにより、より精度の高い推定が可能となる。また、推定を実行する領域を信頼度が所定の閾値より低い領域に限ってもよく、これによって推定を高速に実行できる。 When the reliability map is used as the reference data, the degree of possibility that a failure has occurred in the first normal map is represented by the reliability. Therefore, for example, the smaller the reliability, the larger the size of the input data. Also, the size of the input data may be determined so as to include a region with high reliability. This enables more accurate estimation. Further, the region where the estimation is performed may be limited to the region where the reliability is lower than a predetermined threshold value, whereby the estimation can be performed at high speed.

次に、参照データとして距離マップを用いる場合、距離によって第１の法線マップの取得精度が変わる場合がある。例えば、撮像装置３００の光源を順次点灯して照度差ステレオ法によって法線を取得する場合、光源が撮像装置３００に固定されているため、距離が遠いほど各光源を点灯して撮影した複数の撮影画像間で光源方向の差が小さくなり輝度の変化が出にくくなる。輝度の変化が出にくいほど法線取得精度が低下するため、距離マップが法線の信頼度に対応する。したがって、距離マップに応じて入力データのサイズを変えてもよい。また、距離マップが不連続に変化する点では、遠近の被写体の境界領域であり、距離マップから法線マップを取得する際に誤った法線を算出しやすい。したがって、境界領域では、境界領域のデータを用いて学習した学習情報を用いてもよいし、境界領域に限って推定を実行してもよい。 Next, when a distance map is used as the reference data, the acquisition accuracy of the first normal map may change depending on the distance. For example, when the light sources of the image pickup apparatus 300 are sequentially turned on and the normal line is obtained by the photometric stereo method, the light source is fixed to the image pickup apparatus 300. The difference in the direction of the light source between the captured images becomes small, and it becomes difficult for the brightness to change. The accuracy of normal acquisition decreases as the change in brightness is less likely to occur, so the distance map corresponds to the reliability of the normal. Therefore, the size of the input data may be changed according to the distance map. Further, at the point where the distance map changes discontinuously, it is the boundary area of the distant object and it is easy to calculate an incorrect normal line when acquiring the normal map from the distance map. Therefore, in the boundary area, learning information learned using the data of the boundary area may be used, or the estimation may be executed only in the boundary area.

ネットワーク構造は、各層で使用するフィルタのサイズだけでなく、１つの層で使用されるフィルタの数や層数なども含む。 The network structure includes not only the size of filters used in each layer, but also the number of filters used in one layer, the number of layers, and the like.

学習情報は参照データに応じて学習されており、対応した学習情報が使用される。これにより、より精度の高い推定が可能となる。 The learning information is learned according to the reference data, and the corresponding learning information is used. This enables more accurate estimation.

ステップＳ３０４では、推定部３０３は、第１の法線マップと参照データとから入力データを取得する。 In step S304, the estimation unit 303 acquires input data from the first normal map and reference data.

ステップＳ３０５では、推定部３０３は、学習情報から部分法線情報を生成する。 In step S305, the estimation unit 303 generates partial normal information from the learning information.

ステップＳ３０６では、推定部３０３は、既定の領域を推定し終えたかどうかを判定する。既定の領域を推定し終えた場合、ステップＳ３０７に進み、そうでない場合、ステップＳ３０４に戻る。 In step S306, the estimation unit 303 determines whether or not the estimation of the default area has been completed. If the estimation of the predetermined area is completed, the process proceeds to step S307, and if not, the process returns to step S304.

ステップＳ３０７では、推定部３０３は、第２の法線マップを出力する。 In step S307, the estimation unit 303 outputs the second normal line map.

なお、ステップ３０４をステップＳ３０２およびステップＳ３０３の前に実行してもよい。その場合、ステップＳ３０２およびステップＳ３０３では、ステップ３０４で取得した部分領域に対して第１の法線マップと参照データを取得して、対応する学習情報などを取得する。 Note that step 304 may be executed before steps S302 and S303. In that case, in steps S302 and S303, the first normal map and reference data are acquired for the partial area acquired in step 304, and corresponding learning information and the like are acquired.

以下、学習部３０７により実行される学習情報の学習に関して説明する。本実施例では、参照データごとに異なる学習情報を学習する。学習方法は、図５のフローチャートに従う。 Hereinafter, learning of learning information executed by the learning unit 307 will be described. In this embodiment, different learning information is learned for each reference data. The learning method follows the flowchart of FIG.

まず、学習データをシミュレーション（ＣＧレンダリング）によって生成する場合について説明する。例えば、所定の反射特性に設定して法線情報からレンダリング画像を生成し、レンダリング画像から第１の法線マップを取得することで、反射特性に対応した一対の学習データを得る。類似する複数の反射特性を１つのラベルとして対応付ける場合には、類似する反射特性それぞれにおいて学習データを得ることが望ましい。学習データに対してステップＳ２０１からＳ２０４を実行し、その後、異なる反射特性の被写体に対して同様の手順を繰り返す。信頼度マップや距離マップを参照データとして使用する場合も被写体領域の特性別に同様の手順で学習すればよい。 First, a case where learning data is generated by simulation (CG rendering) will be described. For example, a pair of learning data corresponding to the reflection characteristic is obtained by setting a predetermined reflection characteristic, generating a rendering image from the normal information, and acquiring a first normal map from the rendering image. When associating a plurality of similar reflection characteristics as one label, it is desirable to obtain learning data for each of the similar reflection characteristics. Steps S201 to S204 are executed on the learning data, and then the same procedure is repeated for the subject having different reflection characteristics. Even when the reliability map or the distance map is used as the reference data, the same procedure may be learned for each characteristic of the subject region.

また、形状が既知の実被写体を用いて学習データを生成する場合について説明する。この場合、既知の形状を持つ被写体に対して推定時に用いる第１の法線マップおよび参照データを取得することで学習データを得る。被写体の距離によって学習情報を変える場合、被写体の距離を変えながら撮影画像、第１の法線マップ、および参照データを取得すればよい。被写体の反射特性が既知の場合、異なる反射特性の被写体に変えながら学習データを取得すればよい。厳密に反射特性が既知でなくとも、材質を変えながら学習データを取得すればよい。この場合、信頼度マップは事前に取得することが困難であるため、第１の法線マップと同時に信頼度マップを取得し、様々な信頼度の被写体空間について学習データを取得する必要がある。そして、同じ分類にあたる学習データごとにステップＳ２０１からＳ２０４を実行して学習情報を生成する。 A case will be described in which learning data is generated using a real subject whose shape is already known. In this case, learning data is obtained by acquiring a first normal map and reference data used for estimation for a subject having a known shape. When the learning information is changed according to the distance to the subject, the captured image, the first normal map, and the reference data may be acquired while changing the distance to the subject. When the reflection characteristics of the subject are known, learning data may be acquired while changing to a subject having a different reflection characteristic. Even if the reflection characteristics are not exactly known, learning data may be acquired while changing the material. In this case, since it is difficult to acquire the reliability map in advance, it is necessary to acquire the reliability map at the same time as the first normal map and acquire the learning data for the subject spaces of various reliability. Then, the learning information is generated by executing steps S201 to S204 for each learning data that falls into the same classification.

なお、本実施例では、本発明を適用可能な画像処理システムの一例として図７（ａ）に示される画像処理システム５００について説明したが、本発明はこれに限定されない。例えば、図７（ｂ）に示されるように、サーバー４０１が図７（ａ）の画像処理装置３０１およびサーバー３０５の機能を有していてもよい。この場合、サーバー４０１は、撮像装置、スマートフォン、タブレット端末、およびパソコン（ＰＣ）などの画像出力装置から出力された撮影画像を、ネットワーク４０２を介して取得し、取得した撮影画像に対して本発明の画像処理方法を実行する。 In the present embodiment, the image processing system 500 shown in FIG. 7A has been described as an example of the image processing system to which the present invention can be applied, but the present invention is not limited to this. For example, as shown in FIG. 7B, the server 401 may have the functions of the image processing device 301 and the server 305 of FIG. 7A. In this case, the server 401 acquires captured images output from an image output device such as an imaging device, a smartphone, a tablet terminal, and a personal computer (PC) via the network 402, and the present invention is applied to the acquired captured images. The image processing method of is executed.

以上説明した構成により、反射特性や形状の制約がない一般的な被写体において、高精度に法線情報を推定することができる。
（その他の実施例）
本発明は、上述の実施例の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 With the configuration described above, it is possible to estimate the normal line information with high accuracy in a general subject having no restriction on the reflection characteristic or shape.
(Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program. It can also be realized by the processing. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

各実施例によれば、反射特性や形状の制約がない一般的な被写体において、精度よく法線情報を推定可能な画像処理装置、撮像装置、画像処理方法、画像処理プログラム、および、記憶媒体を提供することができる。 According to each embodiment, an image processing device, an imaging device, an image processing method, an image processing program, and a storage medium capable of accurately estimating normal line information in a general subject having no reflection characteristic or shape constraint are provided. Can be provided.

以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 Although the preferred embodiments of the present invention have been described above, the present invention is not limited to these embodiments, and various modifications and changes can be made within the scope of the gist thereof.

２０１入力データ
２１０中間データ
201 Input data 210 Intermediate data

Claims

A first step of acquiring a first normal map of the subject space and reference data as input data;
A second step of acquiring learning information learned in advance;
A third step of estimating a second normal map based on the input data and the learning information,
The first normal map is either a normal map acquired using a photometric stereo method or a shape from shading method, and a normal map acquired using a distance map,
The reference data includes a captured image of the subject space,
In the third step, when N is an integer greater than or equal to 2 and n is an integer from 1 to N, the nth linear conversion based on the learning information and the nth nonlinear conversion are performed on the input data. Is executed in order from n to 1 to N to generate intermediate data, and a step of performing the (N+1)th linear conversion on the intermediate data based on the learning information is executed. The image processing method, wherein the second normal map is estimated by

The image processing method according to claim 1, wherein the captured images are a plurality of images obtained by capturing the subject space under a plurality of different light source environments.

In the first step, a partial area of the first normal map and a partial area of the reference data are acquired as the input data,
The image processing method according to claim 2, wherein the input data includes partial areas at the same position extracted from each of the plurality of images.

The reference data includes a label map that labels each area of the captured image based on the characteristics of the subject, a reliability map that represents the reliability of normal line information included in the first normal line map, and a distance map. The image processing method according to claim 1, wherein at least one of them is included.

A first step of acquiring a first normal map of the object space and reference data as input data;
A second step of acquiring learning information learned in advance,
A third step of estimating a second normal map based on the input data and the learning information,
The reference data is a label map that labels each area of a captured image in which the subject space is captured based on the characteristics of the subject, and a reliability map that indicates the reliability of normal line information included in the first normal line map. , And at least one of the distance maps,
In the third step, when N is an integer of 2 or more and n is an integer from 1 to N, the nth linear conversion and the nth nonlinear conversion based on the learning information are performed on the input data. Is executed in order from n to 1 to N to generate intermediate data, and a step of performing (N+1)th linear conversion on the intermediate data based on the learning information is executed. The image processing method, wherein the second normal map is estimated by

In the first step, a partial area of the first normal map and a partial area of the reference data are acquired as input data,
The image processing method according to claim 4, wherein the partial region is extracted based on at least one of the label map, the reliability map, and the distance map.

7. The partial area is an area of a label representing an object whose accuracy is low in the first normal map of the label map or an area of low reliability of the reliability map. The image processing method described in.

The learning information includes learning input data including a first normal map and reference data regarding a subject space different from the subject space, and a learning method different from the first normal map corresponding to the learning input data. Information learned based on the line map,
The image processing method according to claim 1, wherein the learning input data includes a subject whose accuracy is low in the first normal map.

9. The image processing method according to claim 8, wherein the learning input data includes at least one of a subject that is specularly reflected, a subject that has a rough surface, a metal body, a transparent body, and a translucent body.

The image processing method according to claim 1, wherein the learning information corresponding to the linear conversion is determined based on the reference data.

The image processing method according to claim 10, wherein the size of the filter used for the linear conversion is determined based on the reference data.

The image processing method according to claim 1, wherein the first normal map is a normal map acquired based on a photometric stereo method or a shape-from-shading method. ..

An acquisition unit that acquires the first normal map of the object space and the reference data as input data, and acquires learning information learned in advance;
An estimation unit that estimates a second normal map based on the input data and the learning information,
The first normal map is either a normal map acquired using a photometric stereo method or a shape from shading method, and a normal map acquired using a distance map,
The reference data includes a captured image of the subject space,
When N is an integer greater than or equal to 2 and n is an integer from 1 to N, the estimation unit performs the nth linear conversion and the nth nonlinear conversion on the input data based on the learning information. , N in sequence from 1 to N to generate intermediate data, and performing the N+1th linear transformation based on the learning information on the intermediate data. An image processing apparatus characterized by estimating a second normal map.

An acquisition unit that acquires the first normal map of the object space and the reference data as input data, and acquires learning information learned in advance;
An estimation unit that estimates a second normal map based on the input data and the learning information,
The reference data is a label map that labels each area of a captured image in which the subject space is captured based on the characteristics of the subject, and a reliability map that indicates the reliability of normal line information included in the first normal line map. , And at least one of the distance maps,
When N is an integer greater than or equal to 2 and n is an integer from 1 to N, the estimation unit performs the nth linear conversion and the nth nonlinear conversion on the input data based on the learning information. , N in sequence from 1 to N to generate intermediate data, and performing the N+1th linear transformation based on the learning information on the intermediate data. An image processing apparatus characterized by estimating a second normal map.

The image processing apparatus according to claim 13, further comprising a storage unit that stores the learning information.

The image processing apparatus according to claim 13,
An image processing apparatus, comprising: an image output device that outputs a captured image of a subject space to the image processing device.

An imaging unit that acquires an image of the subject space as a captured image,
An image pickup apparatus comprising: an image processing unit that executes the image processing method according to claim 1 on the captured image.

A program for causing a computer to execute the image processing method according to any one of claims 1 to 12.

A computer-readable storage medium storing the program according to claim 18.