JP7190842B2

JP7190842B2 - Information processing device, control method and program for information processing device

Info

Publication number: JP7190842B2
Application number: JP2018152718A
Authority: JP
Inventors: 誠冨岡; 大輔小竹; 望糟谷; 雅博鈴木
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2017-11-02
Filing date: 2018-08-14
Publication date: 2022-12-16
Anticipated expiration: 2038-08-14
Also published as: JP2019087229A

Description

本発明は、情報処理装置、情報処理装置の制御方法及びプログラムに関する。 The present invention relates to an information processing device, a control method for an information processing device, and a program.

画像情報に基づく撮像装置の位置及び姿勢の計測は、複合現実感／拡張現実感における現実空間と仮想物体の位置合わせ、ロボットや自動車の自己位置推定、物体や空間の三次元モデリングなど様々な目的で利用される。 Measurement of the position and orientation of an imaging device based on image information is used for various purposes such as registration of real space and virtual objects in mixed reality/augmented reality, self-position estimation of robots and automobiles, and 3D modeling of objects and spaces. used in

非特許文献１では、事前に学習した学習モデルを用いて画像から位置姿勢を算出するための指標である幾何情報（奥行き情報）を推定し、推定した奥行き情報に基づいて位置姿勢を算出する方法が開示されている。 Non-Patent Document 1 describes a method of estimating geometric information (depth information), which is an index for calculating a position and orientation from an image, using a learning model learned in advance, and calculating the position and orientation based on the estimated depth information. is disclosed.

Ｋ．Ｔａｔｅｎｏ，Ｆ．Ｔｏｍｂａｒｉ，Ｉ．ＬａｉｎａａｎｄＮ．Ｎａｖａｂ， "ＣＮＮ－ＳＬＡＭ：Ｒｅａｌ－ｔｉｍｅｄｅｎｓｅｍｏｎｏｃｕｌａｒＳＬＡＭｗｉｔｈｌｅａｒｎｅｄｄｅｐｔｈｐｒｅｄｉｃｔｉｏｎ"，ＩＥＥＥＣｏｍｐｕｔｅｒＳｏｃｉｅｔｙＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ（ＣＶＰＲ），２０１７K. Tateno, F.; Tombari, I.; Laina andN. Navab, "CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction", IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR7), 201VPR7. Ｚ．Ｚｈａｎｇ，"Ａｆｌｅｘｉｂｌｅｎｅｗｔｅｃｈｎｉｑｕｅｆｏｒｃａｍｅｒａｃａｌｉｂｒａｔｉｏｎ，" ＩＥＥＥＴｒａｎｓ．ｏｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ，ｖｏｌ．２２，ｎｏ．１１，ｐｐ．１３３０－１３３４，２０００．Z. Zhang, "A flexible new technique for camera calibration," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330-1334, 2000. Ｊ．Ｅｎｇｅｌ，Ｔ．Ｓｃｈｐｓ，ａｎｄＤ．Ｃｒｅｍｅｒｓ．ＬＳＤ－ＳＬＡＭ：Ｌａｒｇｅ－ＳｃａｌｅＤｉｒｅｃｔＭｏｎｏｃｕｌａｒＳＬＡＭ．ＩｎＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＥＣＣＶ），２０１４．J. Engel, T. Schps, andD. Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. In European Conference on Computer Vision (ECCV), 2014. Ｅ．Ｓｈｅｌｈａｍｅｒ，Ｊ．ＬｏｎｇａｎｄＴ．Ｄａｒｒｅｌｌ， "ＦｕｌｌｙＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋｓｆｏｒＳｅｍａｎｔｉｃＳｅｇｍｅｎｔａｔｉｏｎ"，ＴｒａｎｓａｃｔｉｏｎｏｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ（ＰＡＭＩ），Ｖｏｌ．３９，ｐｐ．６４０－６５１，２０１７E. Shelhamer, J.; Long and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation", Transaction on Pattern Analysis and Machine Intelligence (PAMI), Vol. 39, pp. 640-651, 2017 Ａ．Ｋｅｎｄａｌｌ，Ｍ．ＧｒｉｍｅｓａｎｄＲ．Ｃｉｐｏｌｌａ，"ＰｏｓｅＮｅｔ：ＡＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋｆｏｒＲｅａｌ－Ｔｉｍｅ６－ＤＯＦＣａｍｅｒａＲｅｌｏｃａｌｉｚａｔｉｏｎ"，ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＩＣＣＶ），２０１５，ｐｐ．２９３８－２９４６A. Kendall, M.; Grimes and R. Cipolla, "PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization", International Conference on Computer Vision (ICCV), 2015, pp. 2938-2946 Ｓ．Ｈｏｌｚｅｒ，Ｖ．Ｌｅｐｅｔｉｔ，Ｓ．Ｉｌｉｃ，Ｓ．Ｈｉｎｔｅｒｓｔｏｉｓｓｅｒ，Ｎ．Ｎａｖａｂ，Ｃ．ＣａｇｎｉａｒｔａｎｄＫ．Ｋｏｎｏｌｉｇｅ， "ＭｕｌｔｉｍｏｄａｌＴｅｍｐｌａｔｅｓｆｏｒＲｅａｌ－ＴｉｍｅＤｅｔｅｃｔｉｏｎｏｆＴｅｘｔｕｒｅ－ｌｅｓｓＯｂｊｅｃｔｓｉｎＨｅａｖｉｌｙＣｌｕｔｔｅｒｅｄＳｃｅｎｅｓ" ＩｎＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＥＣＣＶ），２０１１．S. Holzer, V.; Lepetit, S.; Ilic, S.; Hinterstoisser, N.; Navab, C.I. Cagniart and K. Konolige, "Multimodal Templates for Real-Time Detection of Texture-less Objects in Heavily Cluttered Scenes" In European Conference on Computer Vision (ECCV), 2011. Ｊ．Ｅｎｇｅｌ，Ｔ．Ｓｃｈｐｓ，ａｎｄＤ．Ｃｒｅｍｅｒｓ．ＬＳＤ－ＳＬＡＭ：Ｌａｒｇｅ－ＳｃａｌｅＤｉｒｅｃｔＭｏｎｏｃｕｌａｒＳＬＡＭ．ＩｎＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＥＣＣＶ），２０１４．J. Engel, T. Schps, andD. Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. In European Conference on Computer Vision (ECCV), 2014.

非特許文献１では、学習モデルを学習するために使用した学習画像を撮像したシーンと、撮像装置が撮像した入力画像に写るシーンとが類似している前提がある。そのため、シーンが類似していない場合でも、幾何情報の推定の精度を向上するための解決策が求められていた。 In Non-Patent Document 1, there is a premise that a scene in which a learning image used for learning a learning model is captured is similar to a scene in an input image captured by an imaging device. Therefore, there has been a demand for a solution for improving the accuracy of estimating geometric information even when scenes are dissimilar.

本発明は、上記の課題に鑑みてなされたものであり、高精度に位置姿勢を取得するための技術を提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is to provide a technique for obtaining a position and orientation with high accuracy.

上記の目的を達成する本発明に係る情報処理装置は、
撮像装置で撮像された撮像画像と前記撮像画像のデプス情報とを教師データとして用いて学習され、入力画像に対応するデプス情報を推定するための複数の学習モデルを前記教師データとして用いた撮像画像の撮像位置と対応させて保持する保持手段と、
前記学習モデルと対応させて保持している撮像位置の何れか１つと前記入力画像を撮像した撮像位置との一致度から得られる評価結果に基づいて前記複数の学習モデルから前記入力画像に写るシーンに適した学習モデルを選択する選択手段と、
前記入力画像と前記選択された学習モデルとを用いて第一のデプス情報を推定する推定手段と、
を備え、
前記推定手段は、モーションステレオ法により入力画像から第二のデプス情報をさらに推定し、前記学習モデルごとに前記入力画像と前記学習モデルとに基づいて第三のデプス情報をさらに推定し、前記選択手段は、前記第二のデプス情報と前記第三のデプス情報との一致度が高いほど高い評価結果となるように前記学習モデルの評価結果を算出することを特徴とする。 An information processing apparatus according to the present invention that achieves the above object includes:
A captured image that is learned using a captured image captured by an imaging device and depth information of the captured image as teacher data, and that uses a plurality of learning models for estimating depth information corresponding to an input image as the teacher data. holding means for holding in correspondence with the imaging position of
A scene captured in the input image from the plurality of learning models based on an evaluation result obtained from the degree of matching between any one of the imaging positions held in association with the learning model and the imaging position at which the input image was captured. a selection means for selecting a learning model suitable for
estimating means for estimating first depth information using the input image and the selected learning model;
with
The estimating means further estimates second depth information from the input image by a motion stereo method, further estimates third depth information based on the input image and the learning model for each of the learning models, and selects The means is characterized by calculating the evaluation result of the learning model such that the higher the degree of matching between the second depth information and the third depth information, the higher the evaluation result .

本発明によれば、高精度に幾何情報を推定することができる。 According to the present invention, geometric information can be estimated with high accuracy.

実施形態１における情報処理装置１の機能構成を示す図である。2 is a diagram showing the functional configuration of the information processing device 1 according to Embodiment 1. FIG. 実施形態１における学習モデル群保持部のデータ構造を示す図である。4 is a diagram showing the data structure of a learning model group holding unit according to Embodiment 1. FIG. 実施形態１における情報処理装置１のハードウェア構成を示す図である。2 is a diagram showing the hardware configuration of the information processing device 1 according to Embodiment 1. FIG. 実施形態１における処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure according to the first embodiment; 実施形態１におけるステップＳ１４０における処理手順を示すフローチャートである。4 is a flow chart showing a processing procedure in step S140 in Embodiment 1. FIG. 実施形態１におけるステップＳ１１２０における処理手順を示すフローチャートである。11 is a flow chart showing a processing procedure in step S1120 in Embodiment 1. FIG. 実施形態２におけるステップＳ１１２０における処理手順を示すフローチャートである。10 is a flow chart showing a processing procedure in step S1120 in Embodiment 2. FIG. 実施形態３におけるステップＳ１１２０における処理手順を示すフローチャートである。14 is a flow chart showing a processing procedure in step S1120 in Embodiment 3. FIG. 実施形態６における学習モデルを特徴づける情報を提示するＧＵＩの一例を示す図である。FIG. 21 is a diagram showing an example of a GUI presenting information characterizing a learning model in Embodiment 6; 実施形態３における情報処理装置２の機能構成を示す図である。FIG. 10 is a diagram showing the functional configuration of an information processing device 2 according to Embodiment 3; 実施形態５の変形例の１つにおける情報処理装置３の機能構成を示す図である。FIG. 20 is a diagram showing a functional configuration of an information processing device 3 in one of modifications of the fifth embodiment; 実施形態９における情報処理装置４の機能構成を示す図である。FIG. 14 is a diagram showing a functional configuration of an information processing device 4 according to a ninth embodiment; 実施形態９における処理手順を示すフローチャートである。FIG. 21 is a flow chart showing a processing procedure in Embodiment 9. FIG.

以下、図面を参照しながら実施形態を説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, embodiments will be described with reference to the drawings. Note that the configurations shown in the following embodiments are merely examples, and the present invention is not limited to the illustrated configurations.

（実施形態１）
本実施形態では、複合現実感システムにおける現実空間と仮想物体との位置合わせ、すなわち、仮想物体の描画に利用するための現実空間における撮像装置の位置及び姿勢の計測に本発明を適用した場合について説明する。複合現実感を体験するユーザはスマートフォンやタブレットに代表されるモバイル端末を把持し、モバイル端末のディスプレイを通じて仮想物体が重畳された現実空間を観察する。本実施形態ではモバイル端末には撮像装置として単眼のＲＧＢカメラが搭載されており、カメラが撮像した画像上にカメラの現実空間における位置姿勢に基づいて描画された仮想物体のＣＧ画像が重畳されてユーザに提示される。 (Embodiment 1)
In this embodiment, the present invention is applied to the alignment of the real space and the virtual object in the mixed reality system, that is, the measurement of the position and orientation of the imaging device in the real space for use in drawing the virtual object. explain. A user who experiences mixed reality holds a mobile terminal such as a smartphone or tablet, and observes a real space in which virtual objects are superimposed through the display of the mobile terminal. In this embodiment, the mobile terminal is equipped with a monocular RGB camera as an imaging device, and a CG image of a virtual object drawn based on the position and orientation of the camera in the real space is superimposed on the image captured by the camera. presented to the user.

撮像装置の位置姿勢の算出には、撮像装置が撮像した入力画像に基づいて学習モデルが推定した幾何情報を用いる。本実施形態における学習モデルが推定する幾何情報とは、入力画像のピクセルごとに推定した奥行き情報であるデプスマップのことである。また、学習モデルは、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）とする。具体的には、ある時刻ｔに撮像された画像（以下、現フレームと呼ぶ）に、現フレーム以前のある時刻ｔ'に撮像された画像（以下、前フレームと呼ぶ）を入力として学習モデルが推定した奥行き情報（以下、前デプスマップと呼ぶ）に基づいて、前フレームの各画素を現フレームに射影する。ここでいう射影とは、前フレームの画素が現フレームのどこに写るかを算出することである。具体的には、前フレームの各画素の画像座標（ｕ_ｔ－１，ｖ_ｔ－１）とカメラの内部パラメータ（ｆ_ｘ、ｆ_ｙ、ｃ_ｘ、ｃ_ｙ）、および前デプスマップの画素の奥行き値Ｄを用いて当該画素の前フレームのカメラ座標系における三次元座標（Ｘ_ｔ－１，Ｙ_ｔ－１，Ｚ_ｔ－１）を数１により算出する。 Geometric information estimated by the learning model based on the input image captured by the imaging device is used to calculate the position and orientation of the imaging device. The geometric information estimated by the learning model in this embodiment is a depth map, which is depth information estimated for each pixel of the input image. Also, the learning model is assumed to be a CNN (Convolutional Neural Network). Specifically, an image captured at a certain time t (hereinafter referred to as a current frame) and an image captured at a time t′ before the current frame (hereinafter referred to as a previous frame) are input to the learning model. Each pixel of the previous frame is projected onto the current frame based on the estimated depth information (hereinafter referred to as the previous depth map). Projection here means calculating where the pixels of the previous frame appear in the current frame. Specifically, the image coordinates (u _t-1 , v _t-1 ) and camera intrinsic parameters (f _x , f _y , c _x , c _y ) of each pixel in the previous frame, and the pixel coordinates of the previous depth map Using the depth value D, the three-dimensional coordinates (X _t−1 , Y _t−1 , Z _t−1 ) of the pixel in question in the camera coordinate system of the previous frame are calculated by Equation (1).

次に、前フレームを撮影したカメラの位置に対する現フレームを撮影したカメラの位置及び姿勢であるｔ_{（ｔ－１）→ｔ}、Ｒ_{（ｔ－１）→ｔ}を用いて現フレームのカメラ座標系における当該特徴点の三次元座標（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）を、数２により算出する。 Next, using t _(t−1)→t , R _(t−1)→ t, which are the position and orientation of the camera that captured the current frame with respect to the position of the camera that captured the previous frame, the camera coordinate system of the current frame The three-dimensional coordinates (X _t , Y _t , Z _t ) of the feature points in are calculated by Equation (2).

次に、数３により現フレームのカメラ座標系における当該特徴点の三次元座標（Ｘ_ｔ，Ｙ_ｔ，Ｚ_ｔ）を現フレームの画像座標（ｕ_ｔ，ｖ_ｔ）に変換する。 Next, the three-dimensional coordinates (X _t , Y _t , Z _t ) of the feature point in the camera coordinate system of the current frame are converted into the image coordinates (u _t , v _t ) of the current frame by Expression 3.

本実施形態では、数１から数３の処理を射影と呼ぶ。この前フレームの画素（ｕ_ｔ－１，ｖ_ｔ－１）の輝度値と、射影先の現フレームの画素（ｕ_ｔ，ｖ_ｔ）との輝度値との輝度差が最小となるように位置及び姿勢ｔ_{（ｔ－１）→ｔ}、Ｒ_{（ｔ－１）→ｔ}を算出する。最後に、世界座標系に対する前フレームを撮像したカメラの位置及び姿勢ｔ_{w→（ｔ－１)}、Ｒ_{w→（ｔ－１)}を用いて、数４により世界座標系に対する現フレームを撮影したカメラの位置及び姿勢ｔ_w→ｔ、Ｒ_w→ｔを算出する。 In this embodiment, the processing of Equations 1 to 3 is called projection. Position such that the luminance difference between the luminance value of the pixel (u _t−1 , v _t−1 ) in the previous frame and the luminance value of the pixel (u _t , v _t ) in the current frame of the projection destination is the minimum. and attitudes t _(t−1)→t and R _(t−1)→t are calculated. Finally, using the positions and orientations t _w→(t−1) and R _w→(t−1) of the camera that captured the previous frame with respect to the world coordinate system, the current frame with respect to the world coordinate system was captured using Equation 4. The camera position and orientation t _w→t and R _w→t are calculated.

学習モデルは、複数の画像とそれと同時刻に同視野を撮影した複数のデプスマップに基づいて、画像を入力すると対応するデプスマップが推定できるようにあらかじめ学習しておく。例えば、屋内シーンが写っている学習画像を用いて学習した学習モデルを使用すると、屋内画像を入力したときには高い精度でデプスマップを推定できる。ただし、この学習モデルに屋外画像を入力すると、出力するデプスマップの精度が低下する。そこで、本実施形態では、シーンごとに学習した複数の学習モデルの中から、入力画像を撮像したシーンにおいて高い精度で幾何情報を算出できる学習モデルを選択する方法として、学習画像が入力画像と類似している学習モデルを選択する方法について説明する。シーンとは、例えば屋内シーンや屋外シーン、日本家屋の部屋のシーンや西洋家屋の部屋のシーン、オフィスシーンや工場シーンなどである。 The learning model is trained in advance so that when an image is input, the corresponding depth map can be estimated based on a plurality of images and a plurality of depth maps of the same field of view captured at the same time. For example, if a learning model trained using training images showing indoor scenes is used, depth maps can be estimated with high accuracy when indoor images are input. However, when outdoor images are input to this learning model, the accuracy of the output depth map decreases. Therefore, in this embodiment, as a method of selecting a learning model capable of calculating geometric information with high accuracy in a scene in which an input image is captured from among a plurality of learning models learned for each scene, the learning image is similar to the input image. Explain how to select a learning model that has Scenes include, for example, indoor scenes, outdoor scenes, Japanese house room scenes, Western house room scenes, office scenes, factory scenes, and the like.

本実施形態における撮像装置の位置及び姿勢とは、現実空間中に規定された世界座標におけるカメラの位置を表す３パラメータ、及びカメラの姿勢を表す３パラメータを合わせた６パラメータのことを表すこととする。本実施形態では、とくに断りがない限りカメラの位置及び姿勢をカメラの位置姿勢と呼ぶ。また、カメラの光軸をＺ軸、画像の水平方向をＸ軸、垂直方向をＹ軸とするカメラ上に規定される三次元の座標系をカメラ座標系と呼ぶ。 The position and orientation of the imaging device in the present embodiment represent 6 parameters including 3 parameters representing the position of the camera in the world coordinates defined in the real space and 3 parameters representing the orientation of the camera. do. In this embodiment, the position and orientation of the camera are referred to as the camera position and orientation unless otherwise specified. A three-dimensional coordinate system defined on the camera, in which the optical axis of the camera is the Z axis, the horizontal direction of the image is the X axis, and the vertical direction of the image is the Y axis, is called a camera coordinate system.

＜情報処理装置の構成＞
図１は、本実施形態における情報処理装置１の機能構成例を示す図である。情報処理装置１は、画像入力部１１０、学習モデル選択部１２０、学習モデル群保持部１３０、幾何情報推定部１４０及び位置姿勢取得部１５０を備えている。画像入力部１１０はモバイル端末に搭載された撮像装置１１、および表示情報生成部１２と接続されている。位置姿勢取得部１５０は、表示情報生成部１２と接続されている。表示情報生成部１２は表示部１３と接続されている。ただし、図１は、機器構成の一例であり、本発明の適用範囲を限定するものではない。図１の例では表示情報生成部１２及び表示部１３は情報処理装置１の外部に構成されているが、これらを情報処理装置１に含めて構成してもよい。また、表示情報生成部１２を情報処理装置１に含め、表示部１３は情報処理装置１の外部装置として構成してもよい。 <Configuration of information processing device>
FIG. 1 is a diagram showing a functional configuration example of an information processing apparatus 1 according to this embodiment. The information processing apparatus 1 includes an image input unit 110 , a learning model selection unit 120 , a learning model group storage unit 130 , a geometric information estimation unit 140 and a position/orientation acquisition unit 150 . The image input unit 110 is connected to the imaging device 11 and the display information generation unit 12 mounted on the mobile terminal. The position/orientation acquisition unit 150 is connected to the display information generation unit 12 . The display information generation section 12 is connected to the display section 13 . However, FIG. 1 is an example of the equipment configuration, and does not limit the scope of application of the present invention. Although the display information generation unit 12 and the display unit 13 are configured outside the information processing device 1 in the example of FIG. Further, the display information generation unit 12 may be included in the information processing device 1 and the display unit 13 may be configured as an external device of the information processing device 1 .

画像入力部１１０は、撮像装置１１が撮像するシーンの２次元画像の画像データを時系列（例えば毎秒６０フレーム）に入力し、学習モデル選択部１２０、幾何情報推定部１４０、位置姿勢取得部１５０、表示情報生成部１２に出力する。 The image input unit 110 inputs the image data of the two-dimensional images of the scene captured by the imaging device 11 in time series (for example, 60 frames per second). , to the display information generator 12 .

学習モデル選択部１２０は、画像入力部１１０が入力した入力画像に基づいて、学習モデル群保持部１３０が保持する各学習モデルを選択し、選択結果を幾何情報推定部１４０に出力する。 The learning model selection unit 120 selects each learning model held by the learning model group holding unit 130 based on the input image input by the image input unit 110 and outputs the selection result to the geometric information estimation unit 140 .

学習モデル群保持部１３０は、複数の学習モデルを保持する。データ構造の詳細は後述する。幾何情報推定部１４０は、学習モデル選択部１２０が選択した学習モデルに画像入力部１１０が入力した入力画像を入力し、幾何情報を推定する。また、推定した幾何情報を位置姿勢取得部１５０に出力する。 The learning model group holding unit 130 holds a plurality of learning models. Details of the data structure will be described later. The geometric information estimation unit 140 inputs the input image input by the image input unit 110 to the learning model selected by the learning model selection unit 120, and estimates geometric information. Also, the estimated geometric information is output to the position and orientation acquisition unit 150 .

位置姿勢取得部１５０は、画像入力部１１０が入力した入力画像と幾何情報推定部１４０が入力した幾何情報とに基づいて撮像装置の位置姿勢を算出して取得。そして、取得した位置姿勢情報を表示情報生成部１２に出力する。 The position and orientation acquisition unit 150 calculates and acquires the position and orientation of the imaging device based on the input image input by the image input unit 110 and the geometric information input by the geometric information estimation unit 140 . Then, the acquired position and orientation information is output to the display information generation unit 12 .

表示情報生成部１２は、位置姿勢取得部１５０から取得した位置姿勢と不図示の保持部が保持するカメラの内部・外部パラメータとを用いて仮想物体のＣＧ画像をレンダリングする。そして、画像入力部１１０が入力した入力画像上にＣＧ画像を重畳した合成画像を生成する。また、合成画像を表示部１３に出力する。表示部１３は、モバイル端末のディスプレイであり、表示情報生成部１２が生成した合成画像を表示する。 The display information generation unit 12 renders a CG image of the virtual object using the position and orientation acquired from the position and orientation acquisition unit 150 and internal/external parameters of the camera held by a holding unit (not shown). Then, a composite image is generated by superimposing the CG image on the input image input by the image input unit 110 . Also, the composite image is output to the display unit 13 . The display unit 13 is a display of the mobile terminal, and displays the composite image generated by the display information generation unit 12 .

なお、図２は、学習モデル群保持部１３０のデータ構造を示す図である。本実施形態においては、少なくとも２つの学習モデルを保持する。また、学習モデルごとにその学習モデルの学習のために用いた学習画像を少なくとも１枚保持ものとする。学習モデルは、例えばＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）の識別器をバイナリ形式で保存したデータファイルであるものとする。 2 is a diagram showing the data structure of the learning model group holding unit 130. As shown in FIG. In this embodiment, at least two learning models are held. At least one learning image used for learning of the learning model is held for each learning model. The learning model is assumed to be a data file in which, for example, a CNN (Convolutional Neural Network) discriminator is saved in binary format.

図３は、情報処理装置１のハードウェア構成を示す図である。Ｈ１１はＣＰＵであり、システムバスＨ２０に接続された各種デバイスの制御を行う。Ｈ１２はＲＯＭであり、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔ／ＯｕｔｐｕｔＳｙｓｔｅｍ）のプログラムやブートプログラムを記憶する。Ｈ１３はＲＡＭであり、ＣＰＵであるＨ１１の主記憶装置として使用される。Ｈ１４は外部メモリであり、情報処理装置１が処理するプログラムを格納する。入力部Ｈ１５はキーボードやマウス、ロボットコントローラであり、情報等の入力に係る処理を行う。表示部Ｈ１６はＨ１１からの指示に従って情報処理装置１の演算結果を表示装置に出力する。なお、表示装置は液晶表示装置やプロジェクタ、ＬＥＤインジケータなど、種類は問わない。また、表示部１３であってもよい。Ｈ１７は通信インターフェイスであり、ネットワークを介して情報通信を行うものであり、通信インターフェイスはイーサネット（登録商標）でもよく、ＵＳＢやシリアル通信、無線通信等種類は問わない。Ｈ１８は入出力部（Ｉ／Ｏ）であり、カメラＨ１９と接続されている。なお、カメラＨ１９は撮像装置１１に相当する。 FIG. 3 is a diagram showing the hardware configuration of the information processing device 1. As shown in FIG. H11 is a CPU, which controls various devices connected to the system bus H20. H12 is a ROM that stores a BIOS (Basic Input/Output System) program and a boot program. H13 is a RAM, which is used as a main storage device for H11, which is a CPU. H14 is an external memory, which stores programs to be processed by the information processing apparatus 1. FIG. The input unit H15 is a keyboard, a mouse, and a robot controller, and performs processing related to input of information and the like. The display unit H16 outputs the calculation result of the information processing device 1 to the display device according to the instruction from H11. The display device may be of any type, such as a liquid crystal display device, a projector, or an LED indicator. Alternatively, it may be the display unit 13 . H17 is a communication interface, which performs information communication via a network, and the communication interface may be Ethernet (registered trademark), USB, serial communication, wireless communication, or the like. An input/output unit (I/O) H18 is connected to the camera H19. Note that the camera H19 corresponds to the imaging device 11 .

＜処理＞
次に、本実施形態における処理手順について説明する。図４は、本実施形態における情報処理装置１を含む情報処理システムが実施する処理手順を示すフローチャートである。 <Processing>
Next, a processing procedure in this embodiment will be described. FIG. 4 is a flowchart showing a processing procedure performed by an information processing system including the information processing apparatus 1 according to this embodiment.

ステップＳ１１０では、システムの初期化を行う。すなわち、外部メモリＨ１４からプログラムを読み込み、情報処理装置１を動作可能な状態にする。また、情報処理装置１に接続された各機器（撮像装置１１など）のパラメータや、撮像装置１１の初期位置姿勢を読み込む。撮像装置１１の内部パラメータ（焦点距離ｆ_ｘ（画像の水平方向）、ｆ_ｙ（画像の垂直方向）、画像中心位置ｃ_ｘ（画像の水平方向）、ｃ_ｙ（画像の垂直方向）、レンズ歪みパラメータ）は、非特許文献２に記載のＺｈａｎｇの方法によって事前に校正する。 In step S110, the system is initialized. That is, the program is read from the external memory H14 and the information processing apparatus 1 is made operable. Also, the parameters of each device (such as the imaging device 11) connected to the information processing device 1 and the initial position and orientation of the imaging device 11 are read. Internal parameters of the imaging device 11 (focal length f _x (horizontal direction of image), f _y (vertical direction of image), image center position c _x (horizontal direction of image), c _y (vertical direction of image), lens distortion parameters) are calibrated in advance by Zhang's method described in Non-Patent Document 2.

ステップＳ１２０では、撮像装置１１がシーンの撮影を行い、画像入力部１１０に出力する。 In step S120 , the imaging device 11 shoots a scene and outputs it to the image input unit 110 .

ステップＳ１３０では、画像入力部１１０が、撮像装置１１が撮像したシーンを含む画像を入力画像として取得する。なお、本実施形態においては、入力画像とはＲＧＢ画像である。 In step S130, the image input unit 110 acquires an image including a scene captured by the imaging device 11 as an input image. Note that in the present embodiment, the input image is an RGB image.

ステップＳ１４０では、学習モデル選択部１２０が、学習モデル群保持部１３０が保持する学習画像を用いて、それぞれの学習モデルの評価値を算出し、算出した評価値に基づいて学習モデルの選択を行う。ステップＳ１４０の評価値算出処理の詳細は図５及び図６を参照して後述する。 In step S140, the learning model selection unit 120 uses the learning image held by the learning model group holding unit 130 to calculate the evaluation value of each learning model, and selects a learning model based on the calculated evaluation value. . Details of the evaluation value calculation process in step S140 will be described later with reference to FIGS.

ステップＳ１５０では、幾何情報推定部１４０が、ステップＳ１４０で選択された学習モデルを用いて幾何情報を推定する。具体的には、学習モデル選択部１２０が、入力画像を学習モデルに入力し、幾何情報であるデプスマップを推定する。本実施形態では、学習モデルに前フレームの画像を入力し、前デプスマップを推定する。 In step S150, the geometric information estimation unit 140 estimates geometric information using the learning model selected in step S140. Specifically, the learning model selection unit 120 inputs the input image to the learning model and estimates a depth map, which is geometric information. In this embodiment, the previous frame image is input to the learning model to estimate the previous depth map.

ステップＳ１６０では、位置姿勢取得部１５０が、ステップＳ１５０で算出された幾何情報（デプスマップ）を用いて撮像装置１１の位置姿勢を算出して取得する。具体的には、まず、前フレームの各画素を、学習モデルによって推定したデプスマップに基づいて現フレームに射影する。次に、射影した前フレームの画素の画素値と現フレームの画素値との輝度差が最小となるようにＥｎｇｅｌらの方法（非特許文献３）により位置及び姿勢を算出する
ステップＳ１７０では、表示情報生成部１２が、ステップＳ１６０で算出された撮像装置１１の位置姿勢を用いて、仮想物体のＣＧ画像をレンダリングし、入力画像に重畳合成した合成画像を生成して、表示部１３に入力する。また、表示部１３が、合成画像をモバイルデバイスのディスプレイである表示部に表示する。 In step S160, the position and orientation acquisition unit 150 calculates and acquires the position and orientation of the imaging device 11 using the geometric information (depth map) calculated in step S150. Specifically, first, each pixel of the previous frame is projected onto the current frame based on the depth map estimated by the learning model. Next, the position and orientation are calculated by the method of Engel et al. The information generation unit 12 renders the CG image of the virtual object using the position and orientation of the imaging device 11 calculated in step S160, generates a composite image superimposed on the input image, and inputs the composite image to the display unit 13. . Also, the display unit 13 displays the synthesized image on the display unit, which is the display of the mobile device.

ステップＳ１８０では、システムを終了するか否か判定する。具体的には、不図示の入力部によりユーザが終了コマンドを入力していれば終了し、そうでなければステップＳ１２０に戻り処理を続ける。 In step S180, it is determined whether or not to terminate the system. Specifically, if the user has input an end command through an input unit (not shown), the process ends. If not, the process returns to step S120 and continues.

＜学習モデルの選択処理＞
ここで、図５は、本実施形態におけるＳ１４０の学習モデルの選択処理の手順を示すフローチャートである。 <Learning model selection processing>
Here, FIG. 5 is a flowchart showing the procedure of the learning model selection process of S140 in this embodiment.

ステップＳ１１１０では、学習モデル選択部１２０が、学習モデル群保持部１３０が保持するそれぞれの学習モデルのうち、どの学習モデルを用いるか決定済みか否か判定する。使用する学習モデルが未決定の場合にはＳ１１２０へ進む。一方、使用する学習モデルが決定済の場合には処理を終了する。 In step S1110, learning model selection unit 120 determines whether or not it has already been decided which learning model to use among the learning models held by learning model group holding unit . If the learning model to be used has not been determined, the process proceeds to S1120. On the other hand, if the learning model to be used has already been determined, the process ends.

ステップＳ１１２０では、学習モデル選択部１２０が、入力画像と学習画像との類似画像探索によって、学習モデル群保持部１３０が保持するそれぞれの学習モデルの評価値を算出する。評価値とは、撮像装置１１が撮像した入力画像と学習画像との類似度のことであり、撮像装置が撮像するシーンへの学習モデルの適合度を表す。本実施形態では、評価値は０から１の連続値であり、１となるほど適合度が高いものとする。評価値の算出処理の詳細は図６を参照して後述する。 In step S1120, the learning model selection unit 120 calculates the evaluation value of each learning model held by the learning model group holding unit 130 by similar image search between the input image and the learning image. The evaluation value is the degree of similarity between the input image captured by the imaging device 11 and the learning image, and represents the degree of adaptation of the learning model to the scene captured by the imaging device. In this embodiment, the evaluation value is a continuous value from 0 to 1, and the higher the value is 1, the higher the degree of conformity. Details of the evaluation value calculation process will be described later with reference to FIG.

ステップＳ１１３０では、学習モデル選択部１２０が、Ｓ１１２０で算出した評価値が最大となった学習画像を含む学習モデルを選択する。このとき、例えばＲＡＭであるＨ１３などの記憶部に当該学習モデルを読み込むことで、幾何情報推定部１４０が幾何情報を推定可能な状態にする。 In step S1130, learning model selection section 120 selects a learning model including a learning image with the maximum evaluation value calculated in S1120. At this time, the learning model is read into a storage unit such as H13, which is a RAM, so that the geometric information estimation unit 140 can estimate the geometric information.

＜評価値算出処理＞
続いて、図６は、本実施形態におけるＳ１１２０の学習モデルの評価値算出処理の詳細手順を示すフローチャートである。 <Evaluation value calculation processing>
Next, FIG. 6 is a flowchart showing the detailed procedure of the learning model evaluation value calculation process of S1120 in this embodiment.

ステップＳ１２１０では、学習モデル選択部１２０が、評価値を算出していない学習モデルの学習に用いた学習画像を学習モデル群保持部１３０からＲＡＭであるＨ１３にロードする。 In step S1210, the learning model selection unit 120 loads the learning image used for learning of the learning model for which the evaluation value has not been calculated from the learning model group holding unit 130 to the RAM H13.

ステップＳ１２２０では、学習モデル選択部１２０が、入力画像と学習画像との類似度に基づいて学習モデルの評価値を算出する。本実施形態では、画像を縮小し、画像の輝度情報に対して離散コサイン変換を施し、低周波成分のハッシュ値を算出するｐＨａｓｈ法を用いる。まず、学習モデル選択部１２０が、入力画像と学習画像のハッシュ値を算出し、それらのハミング距離を算出する。本実施形態における入力画像と学習モデルとの類似度とは、ハミング距離である。算出した各モデルのハミング距離のなかの最大値を用いて、各モデルのハミング距離を０から１に正規化した連続値を算出する。本実施形態では、この正規化した値を各学習モデルの評価値とする。 In step S1220, learning model selection section 120 calculates the evaluation value of the learning model based on the degree of similarity between the input image and the learning image. In this embodiment, a pHash method is used in which an image is reduced, discrete cosine transform is performed on luminance information of the image, and a hash value of low-frequency components is calculated. First, the learning model selection unit 120 calculates the hash values of the input image and the learning image, and calculates the Hamming distance between them. The degree of similarity between the input image and the learning model in this embodiment is the Hamming distance. A continuous value obtained by normalizing the Hamming distance of each model from 0 to 1 is calculated using the maximum value among the calculated Hamming distances of each model. In this embodiment, this normalized value is used as the evaluation value of each learning model.

ステップＳ１２３０では、学習モデル選択部１２０が、すべての学習モデルの評価値が算出されたか否か判定する。全ての評価値の算出が完了していれば、処理を終了する。一方、評価値の算出が終わっていない学習モデルがあれば、ステップＳ１２１０に戻る。ただし、学習モデル群保持部１３０が保持するすべての学習モデルを評価する必要はなく、使用頻度の高い上位Ｎ（Ｎは整数）個の学習モデルのみ評価値を算出するように構成してもよい。このような場合にはステップＳ１２３０は上位Ｎ個の学習モデルの評価が完了しているかどうかを判定する。 In step S1230, learning model selection section 120 determines whether evaluation values for all learning models have been calculated. If the calculation of all evaluation values has been completed, the process ends. On the other hand, if there is a learning model whose evaluation value has not been calculated yet, the process returns to step S1210. However, it is not necessary to evaluate all the learning models held by the learning model group holding unit 130, and the evaluation value may be calculated only for the top N (N is an integer) learning models with high usage frequency. . In such a case, step S1230 determines whether evaluation of the top N learning models has been completed.

＜効果＞
以上に述べたように、実施形態１では、複数の学習モデルの中から各学習モデルについて評価値を算出し、評価値の高い学習モデルを選択する。このとき、入力画像とそれぞれの学習モデルの学習時に用いた学習画像との類似度を算出し、類似度が高い各学習モデルに高い評価値をつける。そして、評価値が高い学習モデルを用い推定した幾何情報を用いて撮像装置の位置姿勢を算出する。このように、入力画像と学習画像とが類似している学習モデルを選択することで、学習モデルが高い精度で幾何情報を推定することができる。そのため、例えば、この推定した幾何情報を用いて撮像装置の位置姿勢を求める場合には、高い精度で撮像装置の位置姿勢を算出することができる。なお、推定した幾何情報の他の用途としては、例えば、後述する自動車の自動運転などにおける画像認識に用いることなどが挙げられる。 <effect>
As described above, in the first embodiment, an evaluation value is calculated for each learning model from among a plurality of learning models, and a learning model with a high evaluation value is selected. At this time, the degree of similarity between the input image and the learning image used in learning each learning model is calculated, and a high evaluation value is assigned to each learning model having a high degree of similarity. Then, the position and orientation of the imaging device are calculated using geometric information estimated using a learning model with a high evaluation value. In this way, by selecting a learning model whose input image and learning image are similar, the learning model can estimate geometric information with high accuracy. Therefore, for example, when the position and orientation of the imaging device are obtained using this estimated geometric information, the position and orientation of the imaging device can be calculated with high accuracy. Other uses of the estimated geometric information include, for example, image recognition in automatic driving of automobiles, which will be described later.

さらに、個別のシーンごとに学習した小規模な学習モデルを選択して用いることで、メモリ容量の小さな計算機においても幾何情報を推定することができる。これにより、モバイル端末においても撮像装置の位置姿勢を算出することができる。 Furthermore, by selecting and using a small-scale learning model trained for each individual scene, geometric information can be estimated even on a computer with a small memory capacity. Accordingly, the position and orientation of the imaging device can be calculated even in the mobile terminal.

さらに、個別のシーンごとに学習した小規模な学習モデルを選択して用いることで、大規模な学習モデルを用いる場合と比較し、少ない実行時間で幾何情報を推定することができる。これにより、短い時間で撮像装置の位置姿勢を算出することができる。 Furthermore, by selecting and using a small-scale learning model trained for each individual scene, geometric information can be estimated with less execution time than when using a large-scale learning model. As a result, the position and orientation of the imaging device can be calculated in a short time.

＜変形例＞
実施形態１においては、学習モデル群保持部１３０は、学習モデルの学習に用いた学習画像を保持していた。しかしながら、保持する画像は、学習モデルを特徴づけることができる画像であれば、学習画像そのものに限らない。例えば、学習画像を縮小した画像や一部を切り取った画像であってもよいし、学習画像と類似する画像であってもよい。 <Modification>
In the first embodiment, the learning model group holding unit 130 holds learning images used for learning the learning models. However, the images to be held are not limited to the learning images themselves as long as they are images that can characterize the learning model. For example, it may be an image obtained by cutting down a learning image, an image obtained by cutting a part of the learning image, or an image similar to the learning image.

実施形態１においては、学習モデルの評価方法にｐＨａｓｈ法を用いていた。しかしながら、評価方法はｐＨａｓｈ法に限らず、学習モデルの学習に用いた学習画像と入力画像との類似度を算出できる方法であれば何でもよい。具体的には、入力画像と学習画像それぞれから算出したカラーヒストグラムの類似度を用いてもよい。あるいは、入力画像と学習画像それぞれの局所領域の輝度の勾配方向をヒストグラム化したＨｏｇ（ＨｉｓｔｇｒａｍｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴量を算出しＨｏｇ特徴量の類似度を用いてもよい。また、画面を小領域に区切りそれら小領域に対し複数の方向・周波数のガボールフィルタをかけた応答を特徴量とするＧＩＳＴ特徴の類似度を用いてもよい。 In the first embodiment, the pHash method is used as the learning model evaluation method. However, the evaluation method is not limited to the pHash method, and any method that can calculate the degree of similarity between a learning image used for learning a learning model and an input image may be used. Specifically, the degree of similarity between color histograms calculated from the input image and the learning image may be used. Alternatively, a Hog (Histgram of Oriented Gradients) feature amount obtained by forming a histogram of the gradient directions of luminance in local regions of the input image and the learning image may be calculated and the similarity of the Hog feature amount may be used. Alternatively, the screen may be divided into small regions, and the GIST feature similarity may be used in which responses obtained by applying Gabor filters in a plurality of directions and frequencies to the small regions are used as feature amounts.

また、入力画像と学習画像それぞれから検出した局所特徴の特徴量が類似する個数を評価値としてもよい。局所特徴として、例えば平滑化した画像の局所領域内の勾配方向ヒストグラムを特徴量とするＳＩＦＴ特徴点を用いてもよい。あるいは、局所画像領域内でのある２点の輝度の大小からバイナリコードを生成し特徴量とするＯＲＢ特徴点を用いてもよい。さらに、局所特徴は、Ｈａｒｒｉｓコーナー検出法によって画像中の角などの特徴的な位置を算出し、周囲の色情報を特徴量とした画像特徴量でもよいし、周囲の小領域のテンプレートを特徴量としてもよい。さらに、文字認識によって検出した文字情報を局所特徴として用いてもよい。複数種類の画像特徴を組み合わせて用いてもよい。 Also, the number of similar feature amounts of local features detected from each of the input image and the learning image may be used as the evaluation value. As the local feature, for example, a SIFT feature point having a gradient orientation histogram in a local region of a smoothed image as a feature quantity may be used. Alternatively, an ORB feature point may be used as a feature amount by generating a binary code from the magnitude of brightness of two points in a local image region. Further, the local feature may be an image feature amount obtained by calculating a characteristic position such as a corner in the image by the Harris corner detection method and using surrounding color information as a feature amount, or may be an image feature amount using a template of a surrounding small area as a feature amount. may be Furthermore, character information detected by character recognition may be used as a local feature. A plurality of types of image features may be used in combination.

さらに、局所特徴のベクトル量子化によりヒストグラムに変換したＢａｇ－ｏｆ－ＶｉｓｕａｌＷｏｒｄｓの類似度を評価値として用いてもよい。また、あらかじめ学習モデルごとに学習画像からＢａｇ－ｏｆ－ＶｉｓｕａｌＷｏｒｄｓを算出しておき、それぞれの特徴量が存在する特徴空間との距離が最大となる識別境界を算出するＳＶＮ（サポートベクターマシン）を用いた識別器を利用してもよい。このとき、入力画像から算出した特徴量を入力としてＳＶＮが識別した学習モデルに投票し、投票数が最大の学習モデルを選択してもよい。事前に各モデルの学習画像に基づいてそれぞれの学習モデルに対応するラベルを出力するように学習したＳｈｅｌｈａｍｅｒらの提案したニューラルネットワーク（非特許文献４）に入力画像を入力したときの出力ラベルに対応する学習モデルを選択してもよい。 Furthermore, the similarity of Bag-of-Visual Words converted into a histogram by vector quantization of local features may be used as an evaluation value. In addition, the Bag-of-Visual Words are calculated from the learning images for each learning model in advance, and an SVN (support vector machine) is used to calculate the discrimination boundary that maximizes the distance from the feature space where each feature value exists. The discriminator used may be used. At this time, the learning model identified by the SVN may be voted for using the feature amount calculated from the input image as an input, and the learning model with the largest number of votes may be selected. Corresponds to the output label when the input image is input to the neural network proposed by Shelhamer et al. You may select a learning model that

入力画像は１枚の画像に限らず、複数枚の画像でもよく、それぞれの入力画像と学習画像との類似度の和や最大値、最小値、平均値、中央値を評価値としてもよい。 The input image is not limited to one image, and may be a plurality of images, and the sum, maximum value, minimum value, average value, or median of similarities between each input image and learning image may be used as evaluation values.

実施形態１では、学習モデル群保持部１３０は学習画像を保持し、評価値算出時にハッシュ値や特徴量を算出していた。しかしながら、あらかじめ学習画像からハッシュ値や特徴量を算出し、学習モデル群保持部１３０がそれらを保持しておいてもよい。これにより、学習モデルの選択時に学習画像からのハッシュ値や特徴量の算出が必要なくなり、短時間で学習モデルを選択することができる。 In the first embodiment, the learning model group holding unit 130 holds learning images, and calculates hash values and feature amounts when calculating evaluation values. However, hash values and feature amounts may be calculated from learning images in advance, and the learning model group holding unit 130 may hold them. This eliminates the need to calculate hash values and feature quantities from learning images when selecting a learning model, and allows a learning model to be selected in a short period of time.

実施形態１では、学習モデル群保持部１３０が保持する学習モデルは識別器をバイナリ形式で出力したデータファイルであった。しかしながら、幾何情報推定部１４０が幾何情報を推定できるような形式で保持していれば、ＣＮＮのネットワーク構造と重みを出力したデータファイルで保持してもよい。あるいは、あらかじめネットワーク構造を決めておきその重みのみを抽出して出力したデータファイルとして保持してもよい。重みのみ抽出することで、識別器自体をバイナリ形式で出力するよりもデータファイルの大きさを小さくすることができる。 In the first embodiment, the learning models held by the learning model group holding unit 130 are data files in which classifiers are output in binary format. However, as long as the geometric information estimation unit 140 holds the geometric information in a format that allows the geometric information to be estimated, the CNN network structure and weights may be held in the output data file. Alternatively, the network structure may be determined in advance, and only the weights thereof may be extracted and stored as an output data file. By extracting only the weights, the size of the data file can be made smaller than when the discriminator itself is output in binary format.

実施形態１では、学習画像と入力画像の類似度合に基づいて評価値を算出していた。しかしながら、評価値の算出方法は、学習画像と入力画像を撮影したシーンが類似していれば高い評価値をつけることができる方法であれば何でもよい。例えば、学習画像と入力画像が撮像されたシーン情報を検出して、シーン情報の一致度に基づいて評価値を算出する。シーン情報とは、室内や屋外、海岸、山、道路といったシーンのカテゴリのことである。シーン情報の検出には、シーンのカテゴリを判別するシーン判別学習モデルを用いる。例えばシーン判別学習モデルは、入力した画像が当該カテゴリであれば１を、そうでなければ０を出力するようにＤｅｅｐＬｅａｒｎｉｎｇを用いて学習されたニューラルネットワークのことである。また、一つの画像から複数の局所特徴を検出し、それらの特徴量の平均や相関値を列挙したＧＬＣ特徴を算出しておく。そして、各カテゴリのＧＬＣ特徴が存在する特徴空間同士の距離が最大となる識別境界を算出するＳＶＮ（サポートベクターマシン）によってシーンのカテゴリを判別してもよい。なお、ＧＬＣとは、ｇｅｎｅｒａｌｉｚｅｄｌｏｃａｌｃｏｒｒｅｌａｔｉｏｎの略である。 In Embodiment 1, the evaluation value is calculated based on the degree of similarity between the learning image and the input image. However, the evaluation value calculation method may be any method as long as it can give a high evaluation value if the scenes in which the learning image and the input image are photographed are similar. For example, scene information in which the learning image and the input image were captured is detected, and the evaluation value is calculated based on the degree of matching of the scene information. Scene information refers to scene categories such as indoors, outdoors, coasts, mountains, and roads. Scene information is detected using a scene discrimination learning model that discriminates scene categories. For example, the scene discrimination learning model is a neural network trained using deep learning so that if the input image is in the relevant category, 1 is output, otherwise 0 is output. In addition, a plurality of local features are detected from one image, and GLC features are calculated by listing averages and correlation values of the feature quantities. Then, the category of the scene may be determined by SVN (Support Vector Machine) that calculates a discrimination boundary that maximizes the distance between the feature spaces in which the GLC features of each category exist. Note that GLC is an abbreviation for generalized local correlation.

実施形態１では、前フレームの各画素を幾何情報推定部１４０が学習モデルを用いて推定したデプスマップに基づいて現フレームに射影し、射影した前フレームの画素の画素値と現フレームの画素値との輝度差が最小となるように位置及び姿勢を算出していた。しかしながら、学習モデルの出力に基づいて位置姿勢取得する方法であればよい。例えば、幾何情報推定部が学習モデルを用いて前フレームと現フレームのデプスマップを推定する。そして、現デプスマップの各画素の三次元点と前デプスマップの各画素の三次元点のうち最近傍の点との距離を最小化するように繰り返し位置姿勢を算出するＩＣＰ法を用いて位置姿勢を算出する。ＩＣＰとは、ＩｔｅｒａｔｉｖｅＣｌｏｓｅｓｔＰｏｉｎｔの略である。また、位置姿勢取得部１５０が、前フレーム、現フレームから局所特徴を算出しそれらの一致する構造を指し示す対応点を求める。そして、前フレームの局所特徴をデプスマップの奥行き値を用いて現フレームに射影したときの対応点の距離を最小化するようにＰｎＰ問題を解くことで位置姿勢を算出してもよい。なお、幾何情報推定部は、学習モデルの出力を間接的に用いる構成でもよい。すなわち、学習モデルの出力を時系列フィルタリングにより補正した幾何情報に基づいて位置姿勢を算出してもよい（非特許文献１に記載）。 In the first embodiment, each pixel of the previous frame is projected onto the current frame based on the depth map estimated by the geometric information estimation unit 140 using the learning model, and the pixel value of the projected pixel of the previous frame and the pixel value of the current frame are calculated. The position and orientation were calculated so that the luminance difference between the However, any method may be used as long as it acquires the position and orientation based on the output of the learning model. For example, the geometric information estimating unit estimates the depth maps of the previous frame and the current frame using the learning model. Then, the ICP method is used to repeatedly calculate the position and orientation so as to minimize the distance between the three-dimensional point of each pixel in the current depth map and the closest point among the three-dimensional points of each pixel in the previous depth map. Calculate posture. ICP is an abbreviation for Iterative Closest Point. In addition, the position/orientation acquisition unit 150 calculates local features from the previous frame and the current frame, and obtains corresponding points that point to structures where they match. Then, the position and orientation may be calculated by solving the PnP problem so as to minimize the distance between corresponding points when the local features of the previous frame are projected onto the current frame using the depth values of the depth map. Note that the geometric information estimation unit may be configured to indirectly use the output of the learning model. That is, the position and orientation may be calculated based on geometric information obtained by correcting the output of the learning model by time-series filtering (described in Non-Patent Document 1).

実施形態１では、学習モデルが推定する幾何情報はデプスマップであった。しかしながら、本実施形態に適用できる学習モデルは、出力した幾何情報に基づいて撮像装置の位置姿勢を算出できるものであればよい。例えば、入力画像の中から位置姿勢取得に用いるための顕著点を幾何情報として算出する学習モデルを用いてもよい。このときは幾何情報推定部１４０が前フレームと現フレームから学習モデルを用いて顕著点を推定し、位置姿勢取得部が前フレームと現フレームで一致する構造を指し示す顕著点に基づいて５点アルゴリズム法を用いて位置姿勢を算出する。また、前フレームと現フレーム二枚の画像を入力し、幾何情報としてそれらの間の位置姿勢の６自由度を推定するように学習した学習モデル（非特許文献５）を利用してもよい。この時には、位置姿勢取得部１５０が位置姿勢を算出する代わりに、幾何情報推定部１４０が学習モデルを用いて推定した位置姿勢を直接位置姿勢取得結果として表示情報生成部１２に入力してもよい。 In Embodiment 1, the geometric information estimated by the learning model was a depth map. However, the learning model that can be applied to the present embodiment may be any model that can calculate the position and orientation of the imaging device based on the output geometric information. For example, a learning model that calculates, as geometric information, saliency points for use in acquiring the position and orientation from the input image may be used. In this case, the geometric information estimation unit 140 estimates saliency points from the previous frame and the current frame using a learning model, and the position/orientation acquisition unit uses a 5-point algorithm based on the saliency points that indicate matching structures in the previous frame and the current frame. method is used to calculate the pose. Alternatively, a learning model (Non-Patent Document 5) trained to input two images of a previous frame and a current frame and estimate six degrees of freedom of position and orientation between them as geometric information may be used. At this time, instead of calculating the position and orientation by the position and orientation acquisition section 150, the position and orientation estimated by the geometric information estimation section 140 using the learning model may be directly input to the display information generation section 12 as the position and orientation acquisition result. .

実施形態１では、複合現実感システムにおける現実空間と仮想物体との位置合わせのアプリケーションに適用した例を説明した。しかしながら、本実施形態で説明した情報処理装置１に適用可能なのは当該アプリケーションに限らず、学習モデルが出力した幾何情報、または位置姿勢取得結果を用いるものであればよい。例えば、情報処理装置１を自立移動ロボットや自動車に取り付けてロボットや自動車の自己位置を推定する自立移動システムとして用いてもよい。このときの自立移動システムには、電動モータなどの移動機構や、位置姿勢取得部１５０が算出した位置姿勢に基づいて行動を決定し移動機構を制御する制御部を備えていてもよい。また、工業用ロボットハンドの先端に取り付けてロボットハンドの位置姿勢を計測するロボットシステムとして用いてもよい。このときのロボットシステムには、ロボットアーム等のマニピュレータや、吸着ハンド等の把持装置、および位置姿勢取得部１５０が算出した位置姿勢に基づいてマニピュレータや把持装置を制御する制御部を備えていてもよい。 In the first embodiment, an example has been described in which the present invention is applied to the registration application between the physical space and the virtual object in the mixed reality system. However, what is applicable to the information processing apparatus 1 described in the present embodiment is not limited to the application, and any application that uses the geometric information output by the learning model or the position/orientation acquisition result may be used. For example, the information processing apparatus 1 may be attached to an autonomous mobile robot or automobile and used as an autonomous mobile system that estimates the self-position of the robot or automobile. The independent movement system at this time may include a movement mechanism such as an electric motor, and a control unit that determines actions based on the position and orientation calculated by the position and orientation acquisition unit 150 and controls the movement mechanism. Moreover, it may be used as a robot system that is attached to the tip of an industrial robot hand to measure the position and orientation of the robot hand. The robot system at this time may include a manipulator such as a robot arm, a grasping device such as a suction hand, and a control unit that controls the manipulator and the grasping device based on the position and orientation calculated by the position and orientation acquisition unit 150. good.

また、情報処理装置１の使用用途は位置姿勢推定に限らず、三次元再構成に用いてもよい。例えば、工業部品や建物といったＣＡＤモデルを生成するための計測システムとして用いてもよい。このときの計測システムは、幾何情報推定部１４０が推定した幾何情報から三次元モデルを生成する三次元モデル生成部をさらに備えていてもよい。また、ＲＧＢカメラや濃淡画像を取得するカメラなど、デプスマップを取得することのできないカメラから高精度にデプスマップを取得する装置として用いてもよい。 Further, the application of the information processing apparatus 1 is not limited to position and orientation estimation, and may be used for three-dimensional reconstruction. For example, it may be used as a measurement system for generating CAD models of industrial parts and buildings. The measurement system at this time may further include a three-dimensional model generation section that generates a three-dimensional model from the geometric information estimated by the geometric information estimation section 140 . Further, it may be used as a device for acquiring a depth map with high accuracy from a camera that cannot acquire a depth map, such as an RGB camera or a camera that acquires a grayscale image.

実施形態１では、モバイルデバイスが学習モデル選択部１２０や学習モデル群保持部１３０を有する構成を説明した。しかしながら、クラウドサーバが実施形態１で示した情報処理装置の一部の機能を有し、実行してもよい。例えば、クラウドサーバが学習モデル選択部１２０と学習モデル群保持部１３０とを有する構成であってもよい。この構成では、まずモバイル端末が入力画像を不図示の通信部を用いてクラウドサーバに転送する。次に、クラウドサーバ上の学習モデル選択部１２０が、クラウドサーバ状の学習モデル群保持部１３０の保持する学習モデルについて評価値を算出し、評価結果に基づいて学習モデルを選択する。そして、クラウドサーバが選択した学習モデルを、通信部を用いてモバイル端末に転送する。このような構成にすることで、モバイルデバイスは多数の学習モデル群を保持する必要がなく、学習モデル選択にかかる計算も実行しなくてよいため、小規模な計算機しか搭載されていないモバイルデバイスに対しても本発明を適用可能となる。 In the first embodiment, the configuration in which the mobile device has the learning model selection unit 120 and the learning model group holding unit 130 has been described. However, the cloud server may have and execute some of the functions of the information processing apparatus shown in the first embodiment. For example, the cloud server may be configured to have the learning model selection unit 120 and the learning model group holding unit 130 . In this configuration, the mobile terminal first transfers the input image to the cloud server using a communication unit (not shown). Next, the learning model selection unit 120 on the cloud server calculates evaluation values for the learning models held by the learning model group holding unit 130 on the cloud server, and selects a learning model based on the evaluation results. Then, the learning model selected by the cloud server is transferred to the mobile terminal using the communication unit. With such a configuration, mobile devices do not need to store a large number of learning model groups and do not need to perform calculations for learning model selection. The present invention can also be applied to

また、学習画像を撮像したカメラが撮像装置１１と異なる場合には、学習モデル群保持部１３０が保持する学習モデルごとに学習画像を撮像したカメラのカメラパラメータも併せて保持しておいてもよい。この場合、撮像装置１１のカメラパラメータと学習画像を撮像したカメラのカメラパラメータとに基づいて、非特許文献１のように幾何情報推定部１４０が学習モデルを用いて推定した幾何情報のスケールが撮像装置１１に一致するように、幾何情報を補正する。 Further, if the camera that captured the learning images is different from the imaging device 11, the camera parameters of the camera that captured the learning images may also be held for each learning model held by the learning model group holding unit 130. . In this case, based on the camera parameters of the imaging device 11 and the camera parameters of the camera that captured the learning image, the geometric information scale estimated by the geometric information estimation unit 140 using the learning model as in Non-Patent Document 1 is captured. Correct the geometric information to match the device 11 .

本実施形態においては、画像を撮像する撮像装置がＲＧＢカメラである構成について説明した。ただし、ＲＧＢカメラに限るものではなく、現実空間の画像を撮像するカメラであれば何でもよく、たとえば濃淡画像を撮像するカメラでもあってもよいし、奥行き情報や距離画像、三次元点群データを撮像できるカメラであってもよい。また、単眼カメラであってもよいし、二台以上の複数のカメラやセンサを備えるカメラであってもよい。 In the present embodiment, the configuration in which the imaging device that captures an image is an RGB camera has been described. However, it is not limited to the RGB camera, and any camera that captures an image of the real space may be used. A camera capable of imaging may be used. Also, it may be a monocular camera, or a camera including two or more cameras or sensors.

（実施形態２）
実施形態１では、入力画像と学習モデルの学習に利用した学習画像との類似度を評価値として算出していた。これに対して、実施形態２では、入力画像から検出した物体の種類と、あらかじめ学習画像から検出しておいた物体の種類とを比較することで学習モデルの評価値を算出する例を説明する。 (Embodiment 2)
In the first embodiment, the degree of similarity between an input image and a learning image used for learning a learning model is calculated as an evaluation value. On the other hand, in the second embodiment, an example will be described in which the evaluation value of the learning model is calculated by comparing the type of object detected from the input image and the type of object detected from the learning image in advance. .

＜情報処理装置の構成＞
実施形態２における情報処理装置の構成は、実施形態１で説明した情報処理装置１の構成を示す図１と同一であるため省略する。学習モデル群保持部１３０が保持する情報が実施形態１と異なる。本実施形態において、学習モデル群保持部１３０は、少なくとも２つの学習モデルを保持する。さらに、学習モデルごとにそのモデルの学習のために用いた学習画像からあらかじめ検出した物体情報を記載した物体情報リストを持つものとする。本実施形態では、各学習モデルに対して物体情報リストが１つずつ関連付けられており、物体情報リストには例えば「机」や「テーブル」、「ベッド」、「椅子」といった物体情報（本実施形態では物体の種類）が保持されているものとする。 <Configuration of information processing device>
The configuration of the information processing apparatus according to the second embodiment is the same as that shown in FIG. Information held by the learning model group holding unit 130 differs from that in the first embodiment. In this embodiment, the learning model group holding unit 130 holds at least two learning models. Furthermore, it is assumed that each learning model has an object information list in which object information detected in advance from learning images used for learning the model is described. In this embodiment, one object information list is associated with each learning model, and the object information list includes object information such as "desk", "table", "bed", and "chair" (this embodiment In terms of form, the type of object) is assumed to be retained.

＜処理＞
実施形態２における処理全体の手順は、実施形態１で説明した情報処理装置１の処理手順を示す図４と同一であるため、説明を省略する。また、図４における学習モデルのステップＳ１４０の詳細は、図５と同一であるため説明を省略する。実施形態１と異なるのは、図５のステップＳ１１２０における学習モデル選択部が学習モデルの評価値を算出する評価値算出処理である。本実施形態では図６に代えて図７の処理を行う。 <Processing>
The procedure of the entire processing in the second embodiment is the same as that shown in FIG. 4 showing the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. Further, the details of step S140 of the learning model in FIG. 4 are the same as those in FIG. 5, so the description is omitted. What differs from the first embodiment is the evaluation value calculation process in which the learning model selection unit calculates the evaluation value of the learning model in step S1120 of FIG. In this embodiment, the processing of FIG. 7 is performed instead of that of FIG.

以下、図７を参照しながら、本実施形態に係る学習モデルの評価値算出処理について説明する。なお、図７のステップＳ１３４０は、図６のステップＳ１２３０と同一なので説明を省略する。 Hereinafter, the learning model evaluation value calculation process according to the present embodiment will be described with reference to FIG. Note that step S1340 in FIG. 7 is the same as step S1230 in FIG. 6, so description thereof is omitted.

ステップＳ１３１０では、学習モデル選択部１２０が、入力画像から物体検出を行い、検出した物体情報を検出物体種リストに保持する。物体検出には、物体の有無を判定する物体検出学習モデルを用いる。この物体検出学習モデルは、入力した画像に対象となる物体が含まれていれば１を、含まれていなければ０を出力するようにＤｅｅｐＬｅａｒｎｉｎｇを用いて学習されたニューラルネットワークである。物体検出学習モデルを利用して入力画像から物体を検出し、対象となる物体種が含まれていれば検出物体種リストに当該物体情報を記録する。 In step S1310, learning model selection section 120 detects an object from the input image, and stores information about the detected object in a detected object type list. For object detection, an object detection learning model that determines the presence or absence of an object is used. This object detection learning model is a neural network trained using deep learning so that if an input image contains a target object, 1 is output, and if not, 0 is output. An object detection learning model is used to detect an object from the input image, and if the target object type is included, the object information is recorded in the detected object type list.

ステップＳ１３２０では、学習モデル選択部１２０が、評価値を算出していない学習モデルに紐づく物体情報リストを学習モデル群保持部１３０からＲＡＭであるＨ１３にロードする。 In step S1320, the learning model selection unit 120 loads the object information list linked to the learning model for which the evaluation value has not been calculated from the learning model group holding unit 130 to the RAM H13.

ステップＳ１３３０では、学習モデル選択部１２０が、ステップＳ１３１０で検出した検出物体種リストとステップＳ１３２０でロードした物体情報リストとを比較して、物体の種類が一致している物体情報を探索する。一致している物体種が見つかった場合には、その物体が検出されていた学習モデルに投票する。具体的には、各学習モデルに整数を保持するメモリが割り当てられており、一致する物体情報があれば当該メモリ値を＋１増加させる。なお、このメモリは図４の初期化ステップＳ１１０で０に初期化されているものとする。なお、本実施形態における評価値とは、投票数である。 In step S1330, learning model selection section 120 compares the detected object type list detected in step S1310 with the object information list loaded in step S1320, and searches for object information with a matching object type. If a matching object type is found, vote for the learning model that detected that object. Specifically, a memory holding an integer is assigned to each learning model, and if there is matching object information, the memory value is increased by +1. It is assumed that this memory has been initialized to 0 in the initialization step S110 of FIG. Note that the evaluation value in this embodiment is the number of votes.

＜効果＞
以上に述べたように、実施形態２では、学習モデルの学習に使用した学習画像から検出された物体情報と、入力画像から検出された物体情報とを比較し、同じ種類の物体が写っているほど高い評価値を学習モデルに付与する。そして、評価値の大きな学習モデルを用いて推定した幾何情報を用いて撮像装置の位置姿勢を算出する。これにより、入力画像と学習画像に同じ種類の物体が写っている学習モデルを選択することができ、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を取得することができる。 <effect>
As described above, in the second embodiment, the object information detected from the learning image used for learning the learning model is compared with the object information detected from the input image. A higher evaluation value is assigned to the learning model. Then, the position and orientation of the imaging device are calculated using geometric information estimated using a learning model with a large evaluation value. As a result, it is possible to select a learning model in which the same type of object is shown in the input image and the learning image, and the learning model can estimate the geometric information with high accuracy, and the position and orientation of the imaging device can be determined with high accuracy. can be obtained.

＜変形例＞
実施形態２では、学習画像と入力画像からの物体検出には、あらかじめ機械学習により学習した物体検出学習モデルを用いた。しかしながら、物体検出は物体種の有無を判定できるものであれば何でもよい。例えば、物体種ごとに局所特徴をあらかじめ算出しておき、入力画像から検出した局所特徴とマッチングした局所特徴数が所定の閾値以上であれば検出したと判定してもよい。また、あらかじめ物体の画像を切り出したテンプレート画像を保持しておき、学習画像と入力画像それぞれからテンプレートマッチングにより物体検出を行ってもよい。 <Modification>
In the second embodiment, an object detection learning model learned in advance by machine learning is used for object detection from learning images and input images. However, any object detection may be used as long as it can determine the presence or absence of an object type. For example, local features may be calculated in advance for each object type, and if the number of local features matching the local features detected from the input image is equal to or greater than a predetermined threshold, it may be determined that the local features have been detected. Alternatively, a template image obtained by clipping an image of an object may be stored in advance, and object detection may be performed by template matching from each of the learning image and the input image.

実施形態２の物体検出処理では、物体種を検出したか否かの二値の判定結果を用いて学習モデルに評価値をつけていた。しかしながら、評価値は入力画像と学習画像に同じ種類の物体が写っていれば高くなるようなものであればよい。例えば、存在確率である０から１の連続値を出力するように学習した物体検出学習モデルを用い、これらの値に基づいて学習モデルに評価値をつけてもよい。具体的には、ステップＳ１３３０において、各学習モデルに実数を保持するメモリを割り当てておき、学習画像から検出した物体種の存在確率と入力画像から検出した物体種の存在確率との積を当該メモリ値に加算した値を評価値としてもよい。 In the object detection process of the second embodiment, the learning model is given an evaluation value using the binary determination result of whether or not the object type has been detected. However, the evaluation value should be high if the same type of object appears in the input image and the learning image. For example, an object detection learning model trained to output a continuous value of 0 to 1, which is an existence probability, may be used, and an evaluation value may be assigned to the learning model based on these values. Specifically, in step S1330, memory for holding real numbers is allocated to each learning model, and the product of the existence probability of the object type detected from the learning image and the existence probability of the object type detected from the input image is stored in the memory. A value added to the value may be used as the evaluation value.

（実施形態３）
実施形態１では、入力画像と学習モデルの学習に利用した学習画像との類似度を評価値として算出していた。実施形態２では、入力画像から検出した物体種と、あらかじめ学習画像から検出しておいた物体種とを比較することで学習モデルの評価値を算出していた。これに対して、実施形態３では、入力画像や学習モデルを学習した学習画像を撮影した位置情報を用いて学習モデルの評価値を算出する例を説明する。 (Embodiment 3)
In the first embodiment, the degree of similarity between an input image and a learning image used for learning a learning model is calculated as an evaluation value. In the second embodiment, the evaluation value of the learning model is calculated by comparing the object type detected from the input image and the object type detected from the learning image in advance. On the other hand, in the third embodiment, an example will be described in which the evaluation value of the learning model is calculated using the position information of the input image or the learning image obtained by learning the learning model.

＜情報処理装置の構成＞
図１０に示されるように、実施形態３における情報処理装置２は、実施形態１で説明した情報処理装置１の構成を示す図１の構成に加えて、位置情報取得部１０００を更に備える。位置情報取得部１０００は、不図示の位置情報取得センサからＧＰＳ信号やＷｉＦｉ信号などのセンサ情報を受信し、位置情報として例えば座標値やアクセスポイントの識別ＩＤを算出する。位置情報取得部１０００は、取得した位置情報を学習モデル選択部１２０に出力する。また、学習モデル群保持部１３０が保持する情報が実施形態１とは異なっている。本実施形態において、学習モデル群保持部１３０は、少なくとも２つの学習モデルと、各モデルの学習のために用いた学習画像を撮像した位置情報を記載した位置情報リストを持つものとする。 <Configuration of information processing device>
As shown in FIG. 10, the information processing apparatus 2 according to the third embodiment further includes a position information acquisition unit 1000 in addition to the configuration of FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment. The location information acquisition unit 1000 receives sensor information such as GPS signals and WiFi signals from a location information acquisition sensor (not shown), and calculates, for example, coordinate values and access point identification IDs as location information. Position information acquisition section 1000 outputs the acquired position information to learning model selection section 120 . Also, the information held by the learning model group holding unit 130 is different from that of the first embodiment. In this embodiment, the learning model group holding unit 130 has at least two learning models and a position information list that describes the position information of the learning image used for learning each model.

＜処理＞
実施形態３における処理手順は、実施形態１で説明した情報処理装置１の処理手順を示す図４と同一であるため説明を省略する。また、図４におけるステップＳ１４０の詳細は、図５と同一であるため説明を省略する。実施形態１と異なるのは、図５のステップＳ１１２０における学習モデル選択部が学習モデルの評価値を算出する評価値算出処理である。本実施形態では図６に代えて図８の処理を行う。 <Processing>
The processing procedure in the third embodiment is the same as FIG. 4 showing the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. Further, the details of step S140 in FIG. 4 are the same as those in FIG. 5, so the description is omitted. What differs from the first embodiment is the evaluation value calculation process in which the learning model selection unit calculates the evaluation value of the learning model in step S1120 of FIG. In this embodiment, the processing of FIG. 8 is performed instead of that of FIG.

以下、図８を参照しながら、本実施形態に係る学習モデルの評価値算出処理について説明する。なお、図８のステップＳ１３４０は、図６のステップＳ１２３０と同一なので説明を省略する。 Hereinafter, the learning model evaluation value calculation process according to the present embodiment will be described with reference to FIG. Note that step S1340 in FIG. 8 is the same as step S1230 in FIG. 6, so description thereof is omitted.

ステップＳ１４１０では、位置情報取得部１０００が、学習モデル群保持部１３０が保持する学習画像撮像時に位置情報取得センサから取得したＧＰＳ信号やＷｉＦｉ信号等のセンサ情報から、緯度経度やアクセスポイントの識別ＩＤなどの位置情報を取得する。 In step S1410, the location information acquisition unit 1000 acquires the latitude and longitude and the identification ID of the access point from sensor information such as GPS signals and WiFi signals acquired from the location information acquisition sensor when the learning image held by the learning model group holding unit 130 is captured. Get location information such as

ステップＳ１４２０では、学習モデル選択部１２０が、評価値を算出していない学習モデルと関連付けられた位置情報リストを、学習モデル群保持部１３０からＲＡＭであるＨ１３にロードする。 In step S1420, the learning model selection unit 120 loads the position information list associated with the learning model for which the evaluation value has not been calculated from the learning model group holding unit 130 to the RAM H13.

ステップＳ１４３０では、学習モデル選択部１２０が、ステップＳ１４１０で取得した位置情報とステップＳ１４２０でロードした物体情報リストとを比較して、一致している位置情報を探索する。具体的には、位置情報としてＷｉＦｉのアクセスポイントの識別ＩＤを用いるのであれば、一致しているアクセスポイントの識別ＩＤが見つかった場合にはその位置情報が観測された学習モデルに投票する。 In step S1430, learning model selection section 120 compares the position information acquired in step S1410 with the object information list loaded in step S1420, and searches for matching position information. Specifically, if the identification ID of a WiFi access point is used as location information, when a matching access point identification ID is found, the learning model for which the location information is observed is voted for.

また、ＧＰＳから得た緯度経度を位置情報として用いるのであれば、ステップＳ１４１０で取得した位置情報が保持されている位置情報リストと学習モデル群保持部が保持する位置情報との距離が所定の閾値以内であるか判定する。そして、位置情報から求めた座標の国や地域が同一であれば一致している位置情報と判定し、学習モデルに投票する。各学習モデルには整数を保持するメモリが割り当てられており、一致する位置情報があれば当該メモリ値を＋１増加させることとする。このメモリは図４の初期化ステップＳ１１０で０に初期化されているものとする。なお、本実施形態における評価値とは、投票数である。 Further, if the latitude and longitude obtained from GPS is used as the position information, the distance between the position information list holding the position information acquired in step S1410 and the position information held by the learning model group holding unit is a predetermined threshold value. Determine if it is within If the coordinates obtained from the position information are in the same country or region, the position information is determined to match, and the learning model is voted for. A memory holding an integer is allocated to each learning model, and if there is matching position information, the memory value is increased by +1. It is assumed that this memory has been initialized to 0 in initialization step S110 in FIG. Note that the evaluation value in this embodiment is the number of votes.

＜効果＞
以上に述べたように、実施形態３では、入力画像や学習モデルを学習した学習画像を撮影した位置情報が一致しているほど高い評価値を学習モデルに付与する。これにより、入力画像と学習画像とを撮影した位置情報が一致している学習モデルを選択することができるため、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 <effect>
As described above, in the third embodiment, a higher evaluation value is given to the learning model as the position information of the input image and the learning image obtained by learning the learning model match. As a result, it is possible to select a learning model in which the positional information of the input image and the learning image match. The position and orientation can be calculated.

＜変形例＞
実施形態３では、位置情報としてＧＰＳから求めた緯度経度や、ＷｉＦｉアクセスポイントの識別ＩＤを例として挙げた。しかしながら、位置情報は入力画像と学習画像を撮影した位置を識別できるものであれば何でもよい。例えば、緯度経度から求めた国名、地域名、住所であってもよいし、後述するＷｉＦｉアクセスポイント以外の識別ＩＤであってもよい。 <Modification>
In the third embodiment, the latitude and longitude obtained from GPS and the identification ID of the WiFi access point are given as examples of position information. However, any position information may be used as long as the positions at which the input image and the learning image are photographed can be identified. For example, it may be a country name, region name, or address obtained from latitude and longitude, or an identification ID other than a WiFi access point, which will be described later.

位置情報の計測方式としては、緯度経度はＧＰＳ信号、アクセスポイントの識別ＩＤはＷｉＦｉ信号から求める例を挙げた。しかしながら、位置情報が計測できるものであればなんでもよい。例えば、入力画像中に地名が写っていれば、地名から位置情報を算出してもよい。光学式センサを用いて赤外線ビーコンの識別ＩＤを検知して位置情報として用いてもよく、マイクを使って超音波ビーコンの識別ＩＤを検知して用いてもよい。ＢＬＥ（Ｂｌｕｅｔｏｏｔｈ（登録商標）ＬｏｗＥｎａｇｙ）による無線ビーコンの識別ＩＤを検出して用いてもよい。また、３Ｇや４Ｇのモバイルネットワークの基地局ＩＤを計測して位置情報として用いてもよい。さらに、例示した位置情報のうち一種類のみ使用してもよいし複数使用してもよい。 As a method for measuring position information, an example is given in which the latitude and longitude are obtained from GPS signals, and the identification ID of an access point is obtained from WiFi signals. However, anything can be used as long as position information can be measured. For example, if a place name appears in the input image, position information may be calculated from the place name. The identification ID of the infrared beacon may be detected using an optical sensor and used as position information, or the identification ID of the ultrasonic beacon may be detected using a microphone and used. An identification ID of a radio beacon based on BLE (Bluetooth (registered trademark) Low Energy) may be detected and used. Also, a base station ID of a 3G or 4G mobile network may be measured and used as location information. Furthermore, only one type of the exemplified position information may be used, or a plurality of types may be used.

（実施形態４）
実施形態１では、入力画像と学習モデルの学習に利用した学習画像との類似度を評価値として算出していた。実施形態２では、入力画像から検出した物体種と、あらかじめ学習画像から検出しておいた物体種とを比較することで学習モデルの評価値を算出していた。実施形態３では、入力画像や学習モデルを学習した学習画像を撮影した位置情報を用いて学習モデルの評価値を算出していた。ただし、これまで述べてきた方法では、入力画像に写っているシーンが昼なのに反して夜に撮影した学習画像を用いて学習した学習モデルや、春なのに反して冬の雪が降っているシーンの学習画像を用いて学習した学習モデル、雨なのに反して晴れの日の学習画像で学習した学習モデルというように、同一の撮影地点でも見えが違う状況の学習画像で学習した学習モデルが選択されうる。しかしながら、見えの違う状況の学習画像で学習した学習モデルを用いてしまうと、学習モデルが推定する幾何情報の精度が低下することにより、位置姿勢を高精度に取得することが困難になる。そこで、実施形態４では、入力画像や学習モデルを学習した学習画像を撮影した日時や季節、天気といった画像の見えを変えうる状況を表す状況情報が一致するほど、学習モデルに高い評価値を算出する。 (Embodiment 4)
In the first embodiment, the degree of similarity between an input image and a learning image used for learning a learning model is calculated as an evaluation value. In the second embodiment, the evaluation value of the learning model is calculated by comparing the object type detected from the input image and the object type detected from the learning image in advance. In the third embodiment, the evaluation value of the learning model is calculated using the position information of the input image or the learning image used for learning the learning model. However, in the methods described so far, the training model was trained using training images taken at night, whereas the input images show scenes in the daytime, and the learning models were trained using images that were snowing in winter, even though it was spring. For example, a learning model trained using images and a learning model trained using training images on a sunny day, even though it is raining, a learning model trained using learning images with different appearances even at the same shooting location can be selected. However, if a learning model trained on learning images with different appearances is used, the accuracy of the geometric information estimated by the learning model is reduced, making it difficult to acquire the position and orientation with high accuracy. Therefore, in the fourth embodiment, a higher evaluation value is calculated for the learning model as the situation information representing the situation in which the appearance of the image can be changed, such as the shooting date, season, weather, etc. do.

＜情報処理装置の構成＞
実施形態４における情報処理装置の構成は、実施形態１で説明した情報処理装置１の構成を示す図１と同一であるため省略する。学習モデル群保持部１３０が保持する情報が実施形態１と異なる。本実施形態において、学習モデル群保持部１３０は、少なくとも２つの学習モデルを保持する。さらに、学習モデルごとにそのモデルの学習のために用いた学習画像を撮像した際の画像の見えを変えうる状況を記述した状況情報を保持する。なお、本実施形態においては状況情報の具体例として、日時情報を状況情報として用いる場合を説明する。各学習モデルに対して状況情報リストが１つずつ関連付けられており、状況情報リストには状況情報として学習画像を撮影したが状況情報が保持されているものとする。また、情報処理装置は撮影時刻を取得することのできる内部時計を保有しているものとする。 <Configuration of information processing device>
The configuration of the information processing apparatus according to the fourth embodiment is the same as that shown in FIG. Information held by the learning model group holding unit 130 differs from that in the first embodiment. In this embodiment, the learning model group holding unit 130 holds at least two learning models. Furthermore, for each learning model, it holds situation information that describes the situation in which the appearance of the image when the learning image used for learning the model is captured can be changed. In this embodiment, as a specific example of situation information, a case will be described in which date and time information is used as situation information. It is assumed that one situation information list is associated with each learning model, and that the situation information of the photographed learning image is held in the situation information list as the situation information. It is also assumed that the information processing apparatus has an internal clock from which shooting times can be obtained.

＜処理＞
実施形態４における処理全体の手順は、実施形態１で説明した情報処理装置１の処理手順を示す図４と同一であるため、説明を省略する。また、図４における学習モデルのステップＳ１４０の詳細は、図５と同一であるため説明を省略する。実施形態１と異なるのは、図５のステップＳ１１２０における学習モデル選択部が学習モデルの評価値を算出する評価値算出処理である。 <Processing>
The procedure of the entire processing in the fourth embodiment is the same as that shown in FIG. 4 showing the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. Further, the details of step S140 of the learning model in FIG. 4 are the same as those in FIG. 5, so the description is omitted. What differs from the first embodiment is the evaluation value calculation process in which the learning model selection unit calculates the evaluation value of the learning model in step S1120 of FIG.

評価値算出処理においては、内部時計から取得した日時情報と学習モデル群保持部が保持する状況情報である日時情報の一致度合が高いほど、学習モデルに高い評価値をつける。具体的には、状況情報リストに保持された撮影時刻と入力画像の撮影時刻との時間の差が所定の時間内であり、かつ撮影した月／日が所定の日数以内であれば一致する時刻・季節に撮影された学習モデルであるとする。各学習モデルには状況情報の一致／不一致を表す二値（１：Ｔｒｕｅ／０Ｆａｌｓｅ）を保持することのできるメモリが割り当てられており、一致する時刻・季節に撮影された学習モデルと判定された場合に当該メモリを１（Ｔｒｕｅ）とする。なお、図４の初期化ステップＳ１１０で０（Ｆａｌｓｅ）として初期化されているものとする。 In the evaluation value calculation process, the higher the degree of matching between the date and time information acquired from the internal clock and the date and time information, which is the status information held by the learning model group holding unit, the higher the evaluation value given to the learning model. Specifically, if the time difference between the shooting time held in the status information list and the shooting time of the input image is within a predetermined time period and the shooting month/day is within a predetermined number of days, the matching time • Assume that the model is a learning model that is photographed seasonally. Each learning model is assigned a memory that can hold a binary value (1: True/0 False) representing the match/disagreement of the situation information. In this case, the memory is set to 1 (True). Note that it is assumed to have been initialized to 0 (False) in the initialization step S110 of FIG.

＜効果＞
以上に述べたように、実施形態４では入力画像や学習モデルを学習した学習画像の見えを変えうる状況情報が一致しているほど高い評価値を学習モデルに付与する。これにより、入力画像と学習画像の撮影状況が一致している学習モデルを選択することができるため、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 <effect>
As described above, in the fourth embodiment, a higher evaluation value is given to a learning model as the situation information that can change the appearance of the input image and the learning image that has learned the learning model match. As a result, it is possible to select a learning model in which the shooting conditions of the input image and the learning image match. can be calculated.

＜変形例＞
実施形態４では、状況情報として情報処理装置が保有する電子時計から取得した日時情報を用いていた。しかしながら、現在の時刻を取得できれば日時情報の取得方法は何でもよく、Ｉ／Ｆ（Ｈ１７）を介してネットワーク経由で外部サーバから取得してもよいし、キーボード等の入力手段を用いてユーザが入力してもよい。 <Modification>
In the fourth embodiment, the date and time information acquired from the electronic clock held by the information processing apparatus is used as the status information. However, as long as the current time can be obtained, any method of obtaining the date and time information may be used. It may be obtained from an external server via the network via the I/F (H17), or the user may input the date and time using an input means such as a keyboard. You may

実施形態４では、状況情報として日時情報を用いていた。しかしながら、状況情報は日時情報に限らず、撮像した画像の見えを変える状況を表す情報なら何でもよい。例えば、日時情報をそのまま使うのではなく、日時情報から例えば朝／昼／夕方／夜／薄明というように分類した時間のカテゴリを状況情報として用いてもよい。このとき、状況情報保持リストには学習画像を撮像した時間のカテゴリ情報が保持されており、これらが一致する学習モデルほど高い評価値を算出してもよい。また、春／夏／秋／冬といった季節情報を状況情報として用いてもよく、この時には、状況情報保持リストに学習画像を撮像した季節情報が保持されており、これらが一致する学習モデルほど高い評価値を算出してもよい。また、天気を配信するＷｅｂサイトからＩ／Ｆ（Ｈ１７）を介してネットワーク経由で取得した晴れ／曇り／雨／雪といった天気情報を状況情報として用いてもよい。このとき、状況情報保持リストには学習画像を撮像した天気情報が保持されており、これらが一致する学習モデルほど高い評価値を算出してもよい。 In the fourth embodiment, date and time information is used as status information. However, the situation information is not limited to date and time information, and any information representing a situation that changes the appearance of a captured image may be used. For example, instead of using the date and time information as it is, time categories classified from the date and time information such as morning/noon/evening/night/twilight may be used as the situation information. At this time, the status information holding list holds the category information of the time when the learning image was captured, and a higher evaluation value may be calculated for the learning model with which the information matches. Also, seasonal information such as spring/summer/autumn/winter may be used as the situation information. At this time, the situation information holding list holds the season information obtained by imaging the learning images. An evaluation value may be calculated. Also, weather information such as sunny/cloudy/rainy/snow obtained from a website that distributes the weather via the network via the I/F (H17) may be used as the situation information. At this time, the situation information holding list holds the weather information obtained by imaging the learning image, and the learning model having the matching information may be calculated with a higher evaluation value.

（実施形態５）
実施形態１では、入力画像と学習モデルの学習に用いた学習画像との類似度を評価値として算出していた。実施形態２では、入力画像から検出した物体種と、あらかじめ学習画像から検出しておいた物体種とを比較することで学習モデルの評価値を算出していた。実施形態３では、入力画像や学習モデルを学習した学習画像を撮影した位置情報を用いて学習モデルの評価値を算出していた。実施形態４では、撮像した画像の見えを変えうる状況情報の一致度から学習モデルの評価値を算出した。これ対して、実施形態５では、実施形態１から実施形態４を組み合わせた方式で学習モデルの評価値を算出する例を説明する。
＜情報処理装置の構成＞
実施形態５における情報処理装置の構成は、実施形態１で説明した情報処理装置１の構成を示す図１と同一であるため説明を省略する。学習モデル群保持部１３０が保持する情報が実施形態１と異なっている。本実施形態において、学習モデル群保持部１３０は、少なくとも２つの学習モデルを持ち、さらにモデルごとに、実施形態１で説明したようにそのモデルの学習のために用いた学習画像を１枚以上保持する。また、実施形態２で説明したようにそのモデルの学習画像からあらかじめ検出した物体情報を記載した物体情報リストを保持する。実施形態３で説明したようにそのモデルの学習画像を撮像した位置情報を記載した位置情報リストを保持する。さらに、実施形態４で説明したようにそのモデルの学習画像を撮像した際の状況情報を記載した状況情報リストを保持する。 (Embodiment 5)
In the first embodiment, the degree of similarity between an input image and a learning image used for learning a learning model is calculated as an evaluation value. In the second embodiment, the evaluation value of the learning model is calculated by comparing the object type detected from the input image and the object type detected from the learning image in advance. In the third embodiment, the evaluation value of the learning model is calculated using the position information of the input image or the learning image used for learning the learning model. In the fourth embodiment, the evaluation value of the learning model is calculated from the matching degree of the situation information that can change the appearance of the captured image. On the other hand, in the fifth embodiment, an example of calculating the evaluation value of the learning model by a method combining the first to fourth embodiments will be described.
<Configuration of information processing device>
The configuration of the information processing apparatus according to the fifth embodiment is the same as that of FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, so the description is omitted. The information held by the learning model group holding unit 130 is different from that of the first embodiment. In this embodiment, the learning model group holding unit 130 has at least two learning models, and for each model, holds one or more learning images used for learning the model as described in the first embodiment. do. In addition, as described in the second embodiment, an object information list containing object information detected in advance from learning images of the model is stored. As described in the third embodiment, it holds a position information list that describes the position information at which the learning images of the model were captured. Furthermore, as described in the fourth embodiment, a situation information list is held that describes the situation information when the learning image of the model was captured.

＜処理＞
実施形態５における処理手順は、実施形態１で説明した情報処理装置１の処理手順を示す図４と同一であるため説明を省略する。実施形態１と異なるのは、ステップＳ１４０における学習モデル選択部１２０が学習モデルを選択する手順である。 <Processing>
The processing procedure in the fifth embodiment is the same as FIG. 4 showing the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. What differs from the first embodiment is the procedure in which the learning model selection unit 120 selects a learning model in step S140.

ステップＳ１４０では、学習モデル群保持部１３０が保持する物体情報リストを用いて、学習モデル選択部１２０がそれぞれの学習モデルの評価値を算出する。この時、実施形態１のステップＳ１２２０で説明した評価値を評価値１として算出する。次に、実施形態２、および実施形態３、実施形態４で説明した各学習モデルの投票数のうち、最大投票数で各投票数を割って正規化した０～１の連続値を評価値２、および評価値３として算出する。 In step S140 , the learning model selection unit 120 calculates the evaluation value of each learning model using the object information list held by the learning model group holding unit 130 . At this time, the evaluation value described in step S1220 of the first embodiment is calculated as evaluation value 1. FIG. Next, among the number of votes of each learning model described in Embodiment 2, Embodiment 3, and Embodiment 4, the evaluation value 2 is a continuous value of 0 to 1 normalized by dividing each number of votes by the maximum number of votes. , and an evaluation value of 3.

図５のステップＳ１１３０では、学習モデル選択部１２０が、ステップＳ１２２０、ステップＳ１３３０、およびステップＳ１４０で算出した評価値１から評価値４に基づいて使用する学習モデルを選択する。本実施形態では、評価値１から４の和が最大となる学習モデルを選択することとし、学習モデル選択部１２０が、記録部（例えばＲＡＭ：Ｈ１３）に当該学習モデルを読み込むことで、幾何情報推定部１４０が幾何情報を推定可能な状態にする。 In step S1130 of FIG. 5, learning model selection unit 120 selects a learning model to be used based on evaluation values 1 to 4 calculated in steps S1220, S1330, and S140. In this embodiment, the learning model that maximizes the sum of the evaluation values 1 to 4 is selected. The estimation unit 140 puts the geometric information into a state that can be estimated.

＜効果＞
以上に述べたように、実施形態５では、入力画像と学習画像との類似度、入力画像と学習画像から検出した物体種の一致度、入力画像や学習画像を撮影した位置情報の一致度が高いほど高い評価値になるように学習モデルの評価値を算出する。入力画像と学習画像とが類似しており、かつ入力画像と学習画像とに同一種類の物体が撮像されており、かつ撮影した位置が一致し、かつ撮影時刻や季節、天気が一致する学習モデルを選択することで、高い精度で幾何情報を推定することができる。従って、高い精度で撮像装置の位置姿勢を算取得することができる。 <effect>
As described above, in the fifth embodiment, the degree of similarity between the input image and the learning image, the degree of matching of the object type detected from the input image and the learning image, and the degree of matching of the position information of the input image and the learning image are The evaluation value of the learning model is calculated such that the higher the evaluation value, the higher the evaluation value. A learning model in which the input image and the learning image are similar, the same type of object is captured in the input image and the learning image, the shooting positions match, and the shooting time, season, and weather match. By selecting , the geometric information can be estimated with high accuracy. Therefore, the position and orientation of the imaging device can be calculated and acquired with high accuracy.

＜変形例＞
実施形態５においては、入力画像と学習画像の類似度、入力画像と学習画像から検出した物体種の一致度、入力画像や学習画像を撮影した位置情報の一致度、撮像した画像の見えを変えうる状況情報の一致度から学習モデルの評価値を算出した。しかしながら、上記４つのうち２つを使用する構成であってもよい。また、ステップＳ１１３０における学習モデルの決定においては、例えば、評価値１から評価値３の所定の重み付き平均が最大となる学習モデルを選んでもよい。あるいは、各評価値が所定の閾値以上の中から最大の学習モデルを選んでもよいし、各評価値すべての中で最大の評価値をもつ学習モデルを選んでもよい。 <Modification>
In the fifth embodiment, the degree of similarity between the input image and the learning image, the degree of matching of the object type detected from the input image and the learning image, the degree of matching of the position information of the input image and the learning image, and the appearance of the captured image are changed. The evaluation value of the learning model was calculated from the matching degree of the available situation information. However, the configuration may be such that two of the above four are used. Further, in determining the learning model in step S1130, for example, a learning model that maximizes a predetermined weighted average of evaluation values 1 to 3 may be selected. Alternatively, the learning model with the largest evaluation value may be selected from among the evaluation values equal to or greater than a predetermined threshold value, or the learning model with the maximum evaluation value among all the evaluation values may be selected.

また、各評価値に基づいて、徐々に学習モデルを絞り込んでもよい。例えば、初めに位置情報と状況情報により絞り込んでおいてから、類似する画像や物体種が写る学習画像で学習した学習モデルを選択する。具体的には、評価値３および評価値４が所定の閾値以上の学習モデルを選び出し、さらにその中から評価値１と評価値２との和が最大となるような学習モデルを選ぶ。これにより、計算量が多い類似画像探索や物体検出処理を減らすことができる。 Also, the learning models may be gradually narrowed down based on each evaluation value. For example, after first narrowing down based on position information and situation information, a learning model that has been trained using learning images in which similar images or object types appear is selected. Specifically, learning models with evaluation values 3 and 4 equal to or greater than a predetermined threshold are selected, and a learning model that maximizes the sum of evaluation values 1 and 2 is selected. As a result, it is possible to reduce similar image search and object detection processing that require a large amount of calculation.

また、本発明における情報処理装置を、自動車に搭載した場合には、自動運転における電動モータなどの移動機構の制御に用いてもよいし、人の運転時の加減速やハンドリング操作のアシストに用いてもよく、ナビゲーションシステムとして用いてもよい。また、自動車に搭載するのに限るのではなく、クラウド上に本情報処理装置を実装し、ネットワーク経由で処理した結果を自動車の制御や運転アシスト、ナビゲーション等に用いてもよい。 Further, when the information processing device according to the present invention is installed in an automobile, it may be used to control a moving mechanism such as an electric motor in automatic driving, or may be used to assist acceleration/deceleration and handling operations during human driving. may be used as a navigation system. In addition, the present information processing apparatus is not limited to being installed in an automobile, and the information processing apparatus may be mounted on the cloud, and the results of processing via a network may be used for automobile control, driving assistance, navigation, and the like.

自動車向けに本発明における情報処理装置を用いる場合には、学習モデル選択部１２０は自動車に搭載されたカーナビゲーションシステムや各種センサ、各種制御装置から通信Ｉ／Ｆ（Ｈ１７）を介して取得した走行情報を用いて学習モデルを選択することもできる。なお、このような構成とした場合には、自動車から得られる走行情報を用いて学習モデルを選択する方法であれば何でもよい。具体的には、走行情報としてカーナビゲーションシステムの地図情報に付随したシーンのカテゴリ（市街地や山間地域、海辺地域、トンネル内、高速道路）を取得し、実施形態１で説明したようにシーンのカテゴリ情報をもとに学習モデルを選択してもよい。別の選択方法としては、走行情報として、自動車に搭載されたカメラから道路上の人や自動車、信号機、標識やそれらの数・密度、道路の状況（車線数、路面：アスファルトや土）を取得し、それらを物体情報として、実施形態２で説明したように学習モデルを選択してもよい。他には、走行情報として、カーナビゲーションシステムが算出した自動車が走行している住所情報や、自動車に搭載したカメラが撮影した交通看板から認識した地名情報、ＧＰＳやＷｉＦｉ、各種ビーコンから得たセンサ情報を取得し、これらから得た位置情報をもとに実施形態３で説明したように学習モデルを選択してもよい。さらには、カーナビゲーションシステムから得られる時刻情報、ライトの点灯の有無（昼／夜の判別に利用できる）やワイパーの動作の有無（晴れ／雨の判別に利用できる）を実施形態４で説明したよう状況情報として用いて学習モデルを選択してもよい。また、車種や自動車へのカメラの取り付け位置や取り付け向きを状況情報として用いて、同一車種、同一の位置や姿勢で取り付けられたカメラで撮影した学習モデルほど高くなるように評価値となるように算出してもよい。なお、ここでは実施形態１から４で述べた方法を自動車に適応した事例を複数例示したが、これらはどれか一つのみ利用しても、複数を組み合わせて用いてもよい。また、自車の走行情報をもとに学習モデルを選択する方法を例示したが、学習モデルの選択に用いることができる走行情報であれば何でもよく、周囲の自動車が取得した走行情報や、信号機や道路標識、道路横に設置した設置型カメラや各種センサから取得した走行情報をもとに学習モデルを選択してもよい。 When the information processing apparatus of the present invention is used for automobiles, the learning model selection unit 120 uses a car navigation system, various sensors, and various controllers mounted on the automobile to obtain driving data through the communication I/F (H17). The information can also be used to select a learning model. In the case of such a configuration, any method may be used as long as it selects a learning model using driving information obtained from a vehicle. Specifically, a scene category (urban area, mountainous area, seaside area, tunnel, highway) attached to the map information of the car navigation system is acquired as driving information, and the scene category is obtained as described in the first embodiment. A learning model may be selected based on information. As another selection method, as driving information, people and cars on the road, traffic lights, signs, their number and density, and road conditions (number of lanes, road surface: asphalt and dirt) are acquired from the camera mounted on the car. , and using them as object information, a learning model may be selected as described in the second embodiment. In addition, as driving information, the address information where the car is traveling calculated by the car navigation system, the place name information recognized from the traffic signboard photographed by the camera mounted on the car, the sensors obtained from GPS, WiFi, and various beacons Information may be acquired, and a learning model may be selected as described in the third embodiment based on the position information obtained from the information. Furthermore, the time information obtained from the car navigation system, whether or not the lights are on (which can be used to determine whether it is day or night), and whether or not the wipers are operating (which can be used to determine whether it is sunny or rainy) have been explained in the fourth embodiment. A learning model may be selected using such situation information. In addition, using the position and orientation of the camera attached to the car model and car as situational information, the evaluation value will be higher for the learning model that was shot with the camera attached in the same car model and in the same position and posture. can be calculated. Although a plurality of cases in which the methods described in Embodiments 1 to 4 are applied to automobiles are illustrated here, any one of them may be used alone, or a plurality of them may be used in combination. In addition, although the method of selecting a learning model based on the driving information of the own vehicle has been exemplified, any driving information that can be used for selecting a learning model can be used. A learning model may be selected based on driving information acquired from road signs, installed cameras installed on the side of the road, and various sensors.

（実施形態６）
実施形態１から実施形態５では、入力画像と学習画像との類似度や、それらの画像から検出した物体や画像撮像時の位置情報、撮像した画像の見えを変えうる状況情報に基づいて、学習モデルを選択していた。これに対して、実施形態６では、入力画像からモーションステレオにより推定した幾何情報（第二の幾何情報）と、学習モデル群保持部が保持する各学習モデルを用いて推定した幾何情報（第三の幾何情報）とを比較することで学習モデルの評価値を算出する。なお、学習モデル選択部が選択した学習モデルが出力した幾何情報が第一の幾何情報である。 (Embodiment 6)
In Embodiments 1 to 5, learning is performed based on the degree of similarity between an input image and a learning image, information on objects detected from these images, position information at the time of image capturing, and situation information that can change the appearance of the captured image. I had selected a model. In contrast, in the sixth embodiment, the geometric information (second geometric information) estimated from the input image by motion stereo and the geometric information (third geometric information) estimated using each learning model held by the learning model group holding unit The evaluation value of the learning model is calculated by comparing with the geometric information of Note that the geometric information output by the learning model selected by the learning model selection unit is the first geometric information.

すなわち、第一の幾何情報とは、選択済みの学習モデルが出力する幾何情報であって位置姿勢の取得に利用する幾何情報である。第二の幾何情報とは、モーションステレオ等で求めた幾何情報であって学習モデルの選択に利用する幾何情報である。第三の幾何情報とは、学習モデル群が出力した幾何情報であって学習モデルの選択に利用する幾何情報である。 That is, the first geometric information is geometric information output by the selected learning model and used to acquire the position and orientation. The second geometric information is geometric information obtained by motion stereo or the like and used for selecting a learning model. The third geometric information is geometric information output by the learning model group and used for selecting the learning model.

＜情報処理装置の構成＞
実施形態６における情報処理装置の構成は、実施形態１で説明した情報処理装置１の構成を示す図１と同一であるため説明を省略する。実施形態１と異なるのは、各構成の役割とそれらのデータの入出力関係である。 <Configuration of information processing device>
The configuration of the information processing apparatus according to the sixth embodiment is the same as that shown in FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, so description thereof is omitted. What differs from the first embodiment is the role of each component and the input/output relationship of their data.

幾何情報推定部１４０は、学習モデル選択部１２０が選択した学習モデルに画像入力部１１０が入力した入力画像を入力し、第一の幾何情報を推定する。また、幾何情報推定部１４０は、画像入力部１１０が入力した複数の画像に基づいて、第二の幾何情報を算出する。第二の幾何情報の算出法方法については後述する。さらに、幾何情報推定部１４０は、学習モデル群保持部１３０が保持する各学習モデルに入力画像を入力し、第三の幾何情報を推定する。そして、第二の幾何情報および第三の幾何情報を学習モデル選択部１２０に出力する。また、第一の幾何情報を位置姿勢取得部１５０に出力する。 The geometric information estimation unit 140 inputs the input image input by the image input unit 110 to the learning model selected by the learning model selection unit 120, and estimates first geometric information. Also, the geometric information estimation unit 140 calculates second geometric information based on the plurality of images input by the image input unit 110 . A method for calculating the second geometric information will be described later. Further, the geometric information estimation unit 140 inputs an input image to each learning model held by the learning model group holding unit 130, and estimates third geometric information. Then, the second geometric information and the third geometric information are output to learning model selection section 120 . Also, the first geometric information is output to the position and orientation acquisition unit 150 .

学習モデル選択部１２０は、画像入力部１１０が入力した入力画像と、幾何情報推定部１４０により推定された第二の幾何情報及び第三の幾何情報とに基づいて、学習モデル群保持部１３０が保持するそれぞれの学習モデルの評価値を算出する。評価値に基づいて学習モデルを選択し、幾何情報推定部１４０に出力する。学習モデル群保持部１３０は、少なくとも２つの学習モデルを保持する。 Based on the input image input by the image input unit 110 and the second and third geometric information estimated by the geometric information estimation unit 140, the learning model selection unit 120 selects the learning model group holding unit 130 Calculate the evaluation value of each retained learning model. A learning model is selected based on the evaluation value and output to the geometric information estimation unit 140 . The learning model group holding unit 130 holds at least two learning models.

＜処理＞
実施形態６における情報処理装置の処理手順は、実施形態１で説明した情報処理装置１の処理手順を説明した図４、およびステップＳ１４０における学習モデルの選択処理の詳細を説明した図５と同一であるため省略する。実施形態１と異なるのは、幾何情報推定部１４０が、さらに第二の幾何情報、第三の幾何情報を算出する点と、学習モデルの評価手順である。 <Processing>
The processing procedure of the information processing apparatus in the sixth embodiment is the same as FIG. 4 explaining the processing procedure of the information processing apparatus 1 explained in the first embodiment and FIG. 5 explaining the details of the learning model selection process in step S140. omitted because there is Differences from the first embodiment are that the geometric information estimation unit 140 further calculates the second geometric information and the third geometric information, and the evaluation procedure of the learning model.

ステップＳ１５０では、幾何情報推定部１４０が、各学習モデルを用いて第三の幾何情報である第三のデプスマップを推定する。さらに、入力画像に基づいて第二の幾何情報を算出する。本実施形態における第二の幾何情報とは、第一の時刻ｔに撮像装置１１が撮像した第一の入力画像と、撮像装置１１を既定の移動量（例えばカメラ座標系でＸ軸方向に１０ｃｍ）動かした後の第二の時刻ｔ＋１に撮像装置１１が撮像した第二の入力画像とに基づいて、モーションステレオ法により算出した第二のデプスマップのことである。なお、前述の既定の移動量をベースライン長として奥行きのスケールを規定する。 In step S150, the geometric information estimation unit 140 estimates a third depth map, which is third geometric information, using each learning model. Further, second geometric information is calculated based on the input image. The second geometric information in this embodiment includes the first input image captured by the imaging device 11 at the first time t, and the predetermined movement amount of the imaging device 11 (for example, 10 cm in the X-axis direction in the camera coordinate system). ) is the second depth map calculated by the motion stereo method based on the second input image captured by the imaging device 11 at the second time t+1 after the movement. It should be noted that the depth scale is defined using the above-described predetermined amount of movement as the baseline length.

そして、学習モデル選択部１２０が、学習モデル群保持部１３０が保持する各学習モデルの評価値を算出する。学習モデル選択部１２０は、第三のデプスマップと第二のデプスマップとの各画素の奥行きの差を、画像中の全ての画素に対して加算した値の逆数を評価値として算出する。そして、学習モデル選択部１２０は、評価値が最大となる学習モデルを選択する。 Then, the learning model selection unit 120 calculates the evaluation value of each learning model held by the learning model group holding unit 130 . The learning model selection unit 120 calculates, as an evaluation value, the reciprocal of a value obtained by adding the depth difference of each pixel between the third depth map and the second depth map to all pixels in the image. Then, the learning model selection unit 120 selects the learning model with the maximum evaluation value.

＜効果＞
実施形態６では、幾何情報推定部１４０が学習モデルを用いて推定した第三の幾何情報と、入力画像からモーションステレオにより推定した第二の幾何情報とを比較することで学習モデルの評価値を算出する。これにより、モーションステレオにより推定した第二の幾何情報と類似する幾何情報を出力することができる学習モデルを選択することができ、高い精度で撮像装置の位置姿勢を取得することができる。 <effect>
In the sixth embodiment, the evaluation value of the learning model is calculated by comparing the third geometric information estimated by the geometric information estimation unit 140 using the learning model and the second geometric information estimated from the input image by motion stereo. calculate. As a result, it is possible to select a learning model that can output geometric information similar to the second geometric information estimated by motion stereo, and to acquire the position and orientation of the imaging device with high accuracy.

また、学習モデルが複数のシーンの学習画像を用いて学習されているといった場合や、学習モデルを特徴づける情報が保持されていない場合においても、高い精度で幾何情報を出力できる学習モデルを選択することができる。このため、高い精度で撮像装置の位置姿勢を取得することができる。 In addition, even if the learning model is trained using learning images of multiple scenes, or if information that characterizes the learning model is not stored, select a learning model that can output geometric information with high accuracy. be able to. Therefore, the position and orientation of the imaging device can be obtained with high accuracy.

＜変形例＞
実施形態６においては、画像を撮像する撮像装置１１は、ＲＧＢカメラを用いてモーションステレオ法により第二の幾何情報を算出していた。しかしながら、撮像装置１１が、奥行き情報や距離画像、三次元点群データを撮像できるカメラである場合には、それらから得られる奥行き情報を第二の幾何情報として用いてもよい。また、二台以上の複数のカメラやセンサを備えるカメラを用いる場合には、複数のカメラの画像をステレオマッチングして算出した奥行きを第二の幾何情報としてもよい。さらには、情報処理装置がＬｉＤＡＲやミリ波レーダなど奥行き情報を得ることができるセンサを更に搭載していれば、それらから得た奥行き情報を第二の幾何情報としてもよい。 <Modification>
In Embodiment 6, the imaging device 11 that captures an image calculates the second geometric information by the motion stereo method using an RGB camera. However, if the imaging device 11 is a camera that can capture depth information, distance images, and three-dimensional point cloud data, the depth information obtained from them may be used as the second geometric information. Further, when using a camera including two or more cameras or sensors, the depth calculated by stereo-matching the images of the plurality of cameras may be used as the second geometric information. Furthermore, if the information processing apparatus is further equipped with sensors capable of obtaining depth information, such as LiDAR and millimeter wave radar, the depth information obtained from them may be used as the second geometric information.

実施形態６では、カメラを規定量だけ移動させることでベースライン長を決定していた。しかしながら、情報処理装置１がＩＭＵといった慣性計測装置などのカメラの移動量を推定できる移動量測定センサを備えていれば、センサで取得したセンサ情報（移動量）をベースライン長として用いてスケールを規定してもよい。ＩＭＵとは、ｉｎｅｒｔｉａｌｍｅａｓｕｒｅｍｅｎｔｕｎｉｔの略である。 In Embodiment 6, the baseline length is determined by moving the camera by a specified amount. However, if the information processing apparatus 1 is equipped with a movement amount measurement sensor such as an inertial measurement device such as an IMU that can estimate the movement amount of the camera, the sensor information (movement amount) obtained by the sensor is used as the baseline length to determine the scale. may be specified. IMU is an abbreviation for an inertial measurement unit.

また、モーションステレオに利用した二枚の画像のベースライン長が未知のまま、学習モデル選択部１２０が、学習モデル群保持部１３０が保持する各学習モデルに評価値を付与してもよい。 Further, the learning model selection unit 120 may assign an evaluation value to each learning model held by the learning model group holding unit 130 while the baseline lengths of the two images used for motion stereo are unknown.

具体的には、まず学習モデル選択部１２０が、ベースライン長が未知のままモーションステレオにより求めたデプスマップを、例えば平均値、中央値、最大値、最小値などで正規化した第二のデプスマップを算出する。次に、学習モデル選択部１２０が、学習モデル群保持部１３０が保持する学習モデルに入力画像を入力し、得られたデプスマップの平均値、中央値、最大値または最小値などで正規化した第三のデプスマップを算出する。そして、第三のデプスマップと第二のデプスマップの奥行きの差の、画像全体の和の逆数を評価値として、評価値が最大となる学習モデルを選択する。 Specifically, first, the learning model selection unit 120 normalizes the depth map obtained by motion stereo with the baseline length unknown, for example, by the average value, the median value, the maximum value, the minimum value, etc. to obtain a second depth map. Compute the map. Next, the learning model selection unit 120 inputs the input image to the learning model held by the learning model group holding unit 130, and normalizes the obtained depth map with the average value, median value, maximum value, minimum value, etc. Compute a third depth map. Then, the reciprocal of the sum of the depth differences between the third depth map and the second depth map for the entire image is used as the evaluation value, and the learning model with the maximum evaluation value is selected.

これにより、モーションステレオのベースラインが未知であっても、第三のデプスマップと第二のデプスマップの概略構造が一致する学習モデルを選択することができ、高い精度で位置姿勢を取得することができる。 As a result, even if the motion stereo baseline is unknown, it is possible to select a learning model in which the schematic structures of the third depth map and the second depth map match, and to acquire the position and orientation with high accuracy. can be done.

実施形態６では、第三の幾何情報と第二の幾何情報とが一致する学習モデルを選択していた。しかしながら、第二の幾何情報を算出せずとも、少なくとも第三の幾何情報のみを用いて学習モデルを決定することもできる。例えば、各学習モデルが出力した第三の幾何情報に基づいて位置姿勢取得部１５０が位置姿勢を算出したときの残差に基づいて、学習モデル選択部１２０が各学習モデルに評価値をつけてもよい。具体的には、位置姿勢取得部１５０が、各学習モデルが推定した前デプスマップの各画素の奥行き値に基づいて前フレームの各画素を現フレームに射影したときの、射影前の前フレームの画素と、射影後の現フレームの画素との輝度の誤差を最小化するように位置姿勢を算出する。また、この時の輝度の誤差の残差を学習モデル選択部１２０に入力する。そして、学習モデル選択部１２０が、位置姿勢取得部１５０が入力した残差の逆数を評価値とし、評価値が最大となる学習モデルを選択する。また、位置姿勢取得部１５０が、繰り返し計算により徐々に誤差が小さくなるように位置姿勢を算出する場合には、誤差が収束するまでにかかった回数や時間を計測し、それらを学習モデル選択部１２０に入力してもよい。この時、学習モデル選択部１２０は、回数や時間の逆数を評価値としてもよく、評価値が最大となる学習モデルを選択してもよい。 In Embodiment 6, a learning model in which the third geometric information matches the second geometric information is selected. However, it is also possible to determine a learning model using only at least the third geometric information without calculating the second geometric information. For example, the learning model selection unit 120 assigns an evaluation value to each learning model based on the residual when the position/orientation acquisition unit 150 calculates the position/orientation based on the third geometric information output by each learning model. good too. Specifically, when the position and orientation acquisition unit 150 projects each pixel of the previous frame onto the current frame based on the depth value of each pixel of the previous depth map estimated by each learning model, The position and orientation are calculated so as to minimize the luminance error between the pixels and the pixels of the current frame after projection. Also, the residual of the luminance error at this time is input to the learning model selection unit 120 . Then, the learning model selection unit 120 uses the reciprocal of the residual input by the position/orientation acquisition unit 150 as the evaluation value, and selects the learning model with the maximum evaluation value. Further, when the position/orientation acquisition unit 150 calculates the position/orientation so that the error gradually decreases by repeated calculation, the number of times and the time taken until the error converges are measured, and the learning model selection unit 120 may be entered. At this time, the learning model selection unit 120 may use the reciprocal of the number of times or time as the evaluation value, or may select the learning model with the maximum evaluation value.

また、学習モデル群保持部１３０が保持するそれぞれの学習モデルが出力する第三のデプスマップを初期値として、時系列フィルタリングによりそれぞれ第二のデプスマップを算出する構成であってもよい。このときには、学習モデル選択部１２０が、時系列フィルタリングにおけるそれぞれの第二のデプスマップの奥行き値の変化量に基づいて学習モデルに評価値を付与する。具体的には、時系列フィルタリングによる各画素の奥行きの変化量の例えば平均値、中央値、最大値、最小値や和の逆数を各学習モデルの評価値として算出してもよく、この評価値が最大となる学習モデルを選択してもよい。また、時系列フィルタリングにおいて各画素の奥行きの分散値や信頼度（非特許文献１におけるＵｎｃｅｒｔａｉｎｔｙｍａｐ）を求め、分散値の逆数や信頼度の例えば平均値、中央値、最大値、最小値や和が最大となる学習モデルを選択してもよい。 Alternatively, the third depth map output by each learning model held by the learning model group holding unit 130 may be used as an initial value, and the second depth map may be calculated by time-series filtering. At this time, the learning model selection unit 120 gives an evaluation value to the learning model based on the amount of change in the depth value of each second depth map in the time-series filtering. Specifically, for example, the average value, median value, maximum value, minimum value, or the reciprocal of the sum of the amount of change in the depth of each pixel due to time-series filtering may be calculated as the evaluation value of each learning model. You may select the learning model that maximizes . Also, in time-series filtering, the variance and reliability of the depth of each pixel (Uncertainty map in Non-Patent Document 1) are obtained, and the reciprocal of the variance and reliability, for example, the average value, median value, maximum value, minimum value, and sum You may select the learning model that maximizes .

実施形態６では、学習モデルが出力する幾何情報がデプスマップである場合について説明した。しかしながら、時刻の異なる二枚の入力画像を入力し、得られる出力の幾何情報が２枚の画像間の相対位置姿勢の６パラメータである学習モデルを用いてもよい。なお、学習モデルが出力した相対位置姿勢のことを第一の相対姿勢と呼ぶ。このような学習モデルを用いる場合には、学習モデル選択部１２０が、二枚の画像間で特徴点を検出およびマッチングし、二枚の画像で特徴量が一致した特徴点の対応関係に基づいて５点アルゴリズムを用いて第二の相対位姿勢を算出する。このとき、学習モデル選択部１２０が、第一の相対位置姿勢と第二の相対位置姿勢の６パラメータの二乗距離の逆数を評価値とし、評価値が最大となる学習モデルを選択する。 In the sixth embodiment, the case where the geometric information output by the learning model is a depth map has been described. However, it is also possible to use a learning model in which two input images at different times are input, and the obtained output geometric information is six parameters of the relative position and orientation between the two images. Note that the relative position and orientation output by the learning model will be referred to as the first relative orientation. When using such a learning model, the learning model selection unit 120 detects and matches feature points between two images, and based on the correspondence relationship between the feature points whose feature amounts match between the two images, A second relative orientation is calculated using a 5-point algorithm. At this time, the learning model selection unit 120 uses the reciprocals of the squared distances of the six parameters of the first relative position/posture and the second relative position/posture as evaluation values, and selects the learning model with the maximum evaluation value.

実施形態６では、学習モデル選択部１２０が、モーションステレオにより算出した第二のデプスマップと、学習モデルが出力した第三のデプスマップとを比較して、学習モデルに評価値を付与していた。しかしながら、入力画像からサイズが既知の物体が検出できれば、物体のサイズと第三のデプスマップとが整合する学習モデルに高い評価値をつけてもよい。 In the sixth embodiment, the learning model selection unit 120 compares the second depth map calculated by motion stereo with the third depth map output by the learning model, and assigns an evaluation value to the learning model. . However, if an object with a known size can be detected from the input image, a learning model that matches the size of the object with the third depth map may be given a high evaluation value.

図１１に示されるように、本変形例における情報処理装置３は、実施形態１で説明した情報処理装置１の構成に加えて、物体モデル保持部１１１０と、物体検出部１１２０とをさらに備える。物体モデル保持部１１１０は、形状が既知の物体モデルを保持する。物体検出部１１２０は、入力画像から物体検出と物体モデルの位置合わせを行う。そして、物体検出部１１２０が物体を検出し、入力画像に物体モデルを位置合わせしたときに算出できる物体表面までの距離情報と、学習モデルが出力した当該領域のデプスマップ中の物体領域の奥行き値とを学習モデル選択部１２０が比較する。そして、その比較結果に基づいて学習モデルに評価値を付与する。なお、物体モデル保持部１１１０と学習モデル保持部１３０とを一体に構成してもよい。 As shown in FIG. 11, the information processing apparatus 3 in this modification further includes an object model holding unit 1110 and an object detection unit 1120 in addition to the configuration of the information processing apparatus 1 described in the first embodiment. The object model holding unit 1110 holds an object model whose shape is known. The object detection unit 1120 performs object detection from an input image and alignment of an object model. Then, the object detection unit 1120 detects the object, and the distance information to the object surface that can be calculated when the object model is aligned with the input image, and the depth value of the object area in the depth map of the area output by the learning model. are compared by the learning model selection unit 120 . An evaluation value is given to the learning model based on the comparison result. Note that the object model holding unit 1110 and the learning model holding unit 130 may be configured integrally.

本変形例における物体モデルとは、例えば大きさや形状がおおよそ一定の一般物体である、例えば缶やペットボトル、人の手といった物体の三次元のＣＡＤデータのことである。具体的には、まず、物体検出部１１２０が入力画像から物体検出を行う。物体検出には、例えば、入力画像の微分である勾配画像とＣＡＤデータをさまざまな方向から観察したときのシルエットとの位置合わせを行うＬｉｎｅ２Ｄ法（非特許文献６）により、入力画像中に写っている物体にＣＡＤモデルを位置合わせする。なお、物体の位置合わせ方法は上記方法に限らない。次に、カメラの内部パラメータに基づいて、カメラからＣＡＤモデル表面までの距離値を算出する。最後に、学習モデル選択部１２０が、学習モデルに入力画像を入力し、得られた幾何情報であるデプスマップの物体領域の奥行き値と、ＣＡＤモデルから算出した当該領域の距離値との差を算出し、物体領域全域に渡って加算した逆数を評価値として算出する。学習モデル選択部１２０は、この評価値が最大となる学習モデルを選択する。なお、本変形例では形状の既知な一般物体を用いた場合について説明したが、一般物体ではなくとも大きさが既知な特定の模様を印刷した人工マーカや、大きさや形状が一意に定まる三次元物体を代わりに用いてもよい。 The object model in this modified example is, for example, three-dimensional CAD data of an object such as a can, a PET bottle, or a human hand, which is a general object having approximately constant size and shape. Specifically, first, the object detection unit 1120 detects an object from the input image. For object detection, for example, the Line2D method (Non-Patent Document 6), which aligns the gradient image, which is the differential of the input image, with the silhouette when CAD data is observed from various directions, is used to detect objects in the input image. Align the CAD model to the existing object. Note that the method of aligning the objects is not limited to the above method. Next, the distance value from the camera to the CAD model surface is calculated based on the internal parameters of the camera. Finally, the learning model selection unit 120 inputs the input image to the learning model, and calculates the difference between the depth value of the object region of the depth map, which is the obtained geometric information, and the distance value of the region calculated from the CAD model. calculated, and the reciprocal obtained by adding over the entire object region is calculated as an evaluation value. The learning model selection unit 120 selects the learning model with the maximum evaluation value. In this modified example, the case of using a general object with a known shape has been described, but an artificial marker printed with a specific pattern of known size that is not a general object, or a three-dimensional pattern with a unique size and shape Objects may be used instead.

これにより、入力画像に写っている大きさが既知の物体の大きさが正しく推定できる学習モデルを選択することができ、学習モデルが出力するデプスマップの精度が向上し、位置姿勢取得の精度を向上させることができる。 As a result, it is possible to select a learning model that can accurately estimate the size of an object whose size is known in the input image. can be improved.

（実施形態７）
実施形態１から実施形態６では、選択した学習モデルが入力画像に写るシーンに適しているかユーザが確認することができなかった。これに対して、実施形態７では、ユーザが確認できるように、学習モデルを特徴づける情報、学習モデルの出力に基づき仮想物体であるＣＧ画像を合成した合成画像、学習モデルの出力に基づき生成したシーンの三次元形状などを表示部に表示する例を説明する。 (Embodiment 7)
In Embodiments 1 to 6, the user cannot confirm whether the selected learning model is suitable for the scene appearing in the input image. On the other hand, in the seventh embodiment, in order for the user to check, information characterizing the learning model, a synthesized image obtained by synthesizing a CG image that is a virtual object based on the output of the learning model, and an image generated based on the output of the learning model An example of displaying a three-dimensional shape of a scene on the display unit will be described.

＜情報処理装置の構成＞
実施形態７における情報処理装置の構成の一部は、実施形態１で説明した情報処理装置１の構成を示す図１と同じであるため説明を省略する。実施形態１と異なるのは、表示情報生成部１２と表示部１３とを情報処理装置１に組み込んだ点である。 <Configuration of information processing device>
A part of the configuration of the information processing apparatus according to the seventh embodiment is the same as that of FIG. A difference from the first embodiment is that a display information generation unit 12 and a display unit 13 are incorporated in the information processing apparatus 1 .

学習モデル群保持部１３０は、少なくとも２つの学習モデルと、学習モデルを特徴づける情報リストを保持する。学習モデルを特徴づける情報リストとは、実施形態１で説明した学習画像であってもよいし、実施形態２で説明した物体情報リストであってもよいし、実施形態３で説明した位置情報リストであってもよい。これらのリストすべてを保持しておいてもよいし、一部のみ保持しておいてもよい。本実施形態では、学習モデル群保持部１３０が、３種のリストすべてを、学習モデルを特徴づける情報リストとして保持するものとする。 The learning model group holding unit 130 holds at least two learning models and an information list that characterizes the learning models. The information list that characterizes the learning model may be the learning image described in the first embodiment, the object information list described in the second embodiment, or the position information list described in the third embodiment. may be All of these lists may be retained, or only some of them may be retained. In this embodiment, the learning model group holding unit 130 holds all three types of lists as information lists that characterize learning models.

学習モデル選択部１２０は、学習モデル群保持部１３０が保持する学習モデルを特徴づける情報リストと、学習モデル群保持部１３０が保持するそれぞれの学習モデルに入力画像を入力して得た幾何情報とを表示情報生成部１２に出力する。 The learning model selection unit 120 includes an information list characterizing the learning model held by the learning model group holding unit 130, and geometric information obtained by inputting an input image to each learning model held by the learning model group holding unit 130. is output to the display information generation unit 12 .

表示情報生成部１２は、学習モデルを特徴づける情報リストを文字情報としてレンダリングした第１の合成画像を生成する。また、幾何情報推定部１４０により推定された幾何情報（第一の幾何情報又は第三の幾何情報）に基づいて仮想物体を合成した第２の合成画像を生成する。さらに、幾何情報推定部１４０により推定された幾何情報（第一の幾何情報又は第三の幾何情報）に基づいて作成した入力画像に写るシーンの三次元形状をレンダリングした第３の合成画像を生成する。これらの合成画像を表示情報として表示部１３に出力する。なお、これらのうち少なくとも１つを表示情報として生成してもよい。 The display information generation unit 12 generates a first synthesized image by rendering an information list characterizing the learning model as character information. Also, a second synthesized image is generated by synthesizing the virtual object based on the geometric information (first geometric information or third geometric information) estimated by the geometric information estimation unit 140 . Furthermore, a third synthetic image is generated by rendering the three-dimensional shape of the scene appearing in the input image created based on the geometric information (first geometric information or third geometric information) estimated by the geometric information estimation unit 140. do. These synthesized images are output to the display unit 13 as display information. Note that at least one of these may be generated as the display information.

表示部１３は、例えばモバイル端末のディスプレイのウィンドウのことであり、表示情報生成部１２が入力した表示情報を提示する。 The display unit 13 is, for example, a window of the display of the mobile terminal, and presents display information input by the display information generation unit 12 .

図９は、本実施形態における表示部１３が提示する表示情報の一例であるＧＵＩ１００を示す図である。 FIG. 9 is a diagram showing a GUI 100, which is an example of display information presented by the display unit 13 in this embodiment.

Ｇ１１０は学習モデルを特徴づける情報リストを提示するためのウィンドウであり、Ｇ１２０は仮想物体であるＣＧ画像を合成した合成画像を提示するためのウィンドウであり、Ｇ１３０は幾何情報や三次元形状を提示するためのウィンドウである。また、Ｇ１４０は入力画像や入力画像から検出した情報を提示するためのウィンドウである。また、Ｇ１４１０は、学習モデル選択部１２０が選択した学習モデルを示す枠である。 G110 is a window for presenting an information list that characterizes the learning model, G120 is a window for presenting a synthesized image obtained by synthesizing CG images, which are virtual objects, and G130 presents geometric information and three-dimensional shapes. It is a window for G140 is a window for presenting an input image or information detected from the input image. G1410 is a frame indicating the learning model selected by learning model selection section 120 .

Ｇ１１１０は、ウィンドウＧ１１０に、学習モデル群保持部１３０が保持する学習モデルのモデル名を提示した例である。また、Ｇ１１２０は、学習モデル群保持部１３０が保持する位置情報リストを提示した例である。Ｇ１１３０は、学習モデル群保持部１３０が保持する物体情報リストを提示した例である。Ｇ１１３１は、学習モデル群保持部１３０が保持する各学習モデルの学習のために用いた学習画像を提示した例である。また、Ｇ１１４０は、入力画像に写る物体種や、入力画像を撮像した位置情報をウィンドウＧ１４０に提示した例である。Ｇ１１５０は、入力画像をウィンドウＧ１４０に提示した例である。ユーザはウィンドウＧ１１０とＧ１４０とに提示された内容を比較することで、学習モデル選択部１２０が選択した学習モデルが適切であるかどうか確認することができる。 G1110 is an example of presenting model names of learning models held by the learning model group holding unit 130 in the window G110. G1120 is an example of presenting the position information list held by the learning model group holding unit 130 . G1130 is an example of presenting an object information list held by the learning model group holding unit 130 . G1131 is an example of presenting a learning image used for learning each learning model held by the learning model group holding unit 130 . G1140 is an example of presenting the type of object appearing in the input image and the positional information of the imaged input image on the window G140. G1150 is an example of presenting the input image in window G140. The user can check whether the learning model selected by learning model selection unit 120 is appropriate by comparing the contents presented in windows G110 and G140.

Ｇ１２１０は、ウィンドウＧ１２０に、幾何情報推定部１４０により推定された幾何情報に基づいて入力画像に合成した仮想物体のＣＧ画像である。Ｇ１２２０は、入力画像に仮想物体Ｇ１２１０を合成した画像を提示した例である。また、Ｇ１２３０は、学習モデル選択部１２０が算出した各モデルの評価値を提示した例である。ここでは、ＣＧ画像として人のモデルを入力画像のベッドの上に重畳し、さらに第三の幾何情報から求めたベッドのサイズのＣＧ画像を重畳した例を示した。このとき、ユーザは入力画像とＣＧ画像とが整合しているかどうかを確認することができる。 G1210 is a CG image of a virtual object combined with the input image in the window G120 based on the geometric information estimated by the geometric information estimation unit 140. FIG. G1220 is an example of presenting an image obtained by synthesizing a virtual object G1210 with an input image. G1230 is an example of presenting the evaluation value of each model calculated by the learning model selection unit 120 . Here, an example is shown in which a human model is superimposed on the bed of the input image as a CG image, and a CG image of the size of the bed obtained from the third geometric information is superimposed. At this time, the user can confirm whether the input image and the CG image match.

具体的には、ベッドとＣＧ画像とのスケールが一致しているか、ベッドの大きさが実物のスケールと一致しているか、ＣＧ画像がベッド面に対して正対しているかといった点から、学習モデル選択部１２０が選択した学習モデルが適切かどうか判断できる。 Specifically, whether the scale of the bed and the CG image match, whether the size of the bed matches the scale of the real thing, and whether the CG image faces the bed surface, the learning model It can be determined whether the learning model selected by the selection unit 120 is appropriate.

Ｇ１３１０は、ウィンドウＧ１３０に幾何情報推定部１４０により推定された幾何情報に基づいて入力画像を撮像したシーンの三次元形状を復元した結果を提示した例である。ユーザは、提示された三次元形状が歪んでいないか、奥行きのスケールが実物と異なっていないかを確認することで、学習モデル選択部１２０が選択した学習モデルが適切であるかどうか判断できる。 G1310 is an example in which the result of restoring the three-dimensional shape of the scene in which the input image was captured based on the geometric information estimated by the geometric information estimation unit 140 is presented in the window G130. The user can determine whether the learning model selected by the learning model selection unit 120 is appropriate by confirming whether the presented three-dimensional shape is distorted and whether the depth scale is different from the real thing.

＜処理＞
実施形態７における全体処理の手順は実施形態１で説明した情報処理装置１の処理手順を説明した図４と同一であるため説明を省略する。実施形態１と異なる処理は、表示情報生成部１２が表示情報を生成する手順である。 <Processing>
The procedure of the overall processing in the seventh embodiment is the same as FIG. 4 explaining the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. A process different from that of the first embodiment is a procedure in which the display information generation unit 12 generates display information.

ステップＳ１７０では、表示情報生成部１２が、学習モデルを特徴づける情報リストを表示部１３のウィンドウＧ１１０の位置にレンダリングする。具体的には、学習モデル群保持部１３０が保持する学習モデルを学習した学習画像Ｇ１１３１や、位置情報リストＧ１１１０、物体情報リストＧ１１２０を所定の位置にレンダリングし、表示情報を生成する。また、幾何情報推定部１４０により推定された幾何情報に基づいて仮想物体のＣＧ画像を入力画像に合成する。 In step S170 , the display information generation unit 12 renders the information list characterizing the learning model at the position of the window G110 of the display unit 13 . Specifically, the learning image G1131 obtained by learning the learning model held by the learning model group holding unit 130, the position information list G1110, and the object information list G1120 are rendered at predetermined positions to generate display information. Also, based on the geometric information estimated by the geometric information estimation unit 140, the CG image of the virtual object is combined with the input image.

具体的には、まず、幾何情報であるデプスマップからＲＡＮＳＡＣ法を併用した平面フィッティングを用いて主平面を求める。次に、主平面の法線方向を算出する。最後に、主平面上に仮想物体のＣＧ画像を所定の位置（例えばＧ１２０）にレンダリングし、表示情報を生成する。なお、Ｇ１２１０に示したようにデプスマップ上の任意の二点の距離をレンダリングしてもよい。さらに、幾何情報推定部１４０により推定された幾何情報に基づいて算出した、入力画像を撮像したシーンの三次元形状を復元した結果を所定の位置（例えばＧ１３０）にレンダリングし、表示情報を生成する。具体的には、入力画像の各画素を、デプスマップの各画素の奥行き値に基づいて任意の仮想カメラに射影した射影画像を生成し、表示情報に追加する。以上のようにして生成した表示情報を、表示部１３が、ディスプレイに提示する。 Specifically, first, a principal plane is obtained from a depth map, which is geometric information, using plane fitting in combination with the RANSAC method. Next, the normal direction of the principal plane is calculated. Finally, the CG image of the virtual object is rendered at a predetermined position (eg, G120) on the principal plane to generate display information. Note that the distance between any two points on the depth map may be rendered as shown in G1210. Furthermore, the result of restoring the three-dimensional shape of the scene in which the input image was captured, which is calculated based on the geometric information estimated by the geometric information estimation unit 140, is rendered at a predetermined position (for example, G130) to generate display information. . Specifically, each pixel of the input image is projected onto an arbitrary virtual camera based on the depth value of each pixel of the depth map to generate a projected image and add it to the display information. The display unit 13 presents the display information generated as described above on the display.

＜効果＞
以上述べたように、実施形態７では、学習モデルを特徴づける情報、仮想物体のＣＧを合成した合成画像、学習モデルが出力した幾何情報や当該幾何情報に基づいて復元した三次元形状を提示する。これにより、各学習モデルを入力画像に写るシーンで利用したときの適合度や、正しい学習モデルを選択することができたかどうかということをユーザが確認することができる。さらに、不適切な学習モデルが選択された場合には、処理をやり直し、適切な学習モデルを選択し直すことができるようになり、高い精度で位置姿勢を取得することができるようになる。 <effect>
As described above, in the seventh embodiment, information characterizing a learning model, a synthesized image obtained by synthesizing CG of a virtual object, geometric information output by the learning model, and a three-dimensional shape restored based on the geometric information are presented. . As a result, the user can confirm the suitability when using each learning model in the scene shown in the input image, and whether or not the correct learning model has been selected. Furthermore, when an inappropriate learning model is selected, the processing can be redone and an appropriate learning model can be selected again, so that the position and orientation can be obtained with high accuracy.

＜変形例＞
実施形態７では、学習モデルを特徴づける情報、仮想物体のＣＧを合成した合成画像、学習モデルが出力した幾何情報や幾何情報に基づいて復元した三次元形状をレンダリングした表示情報を提示する構成について説明した。しかしながら、３つの表示情報をすべて提示する必要はなく、少なくとも１つを提示する構成であってもよい。 <Modification>
In the seventh embodiment, information characterizing a learning model, a composite image obtained by synthesizing CG of a virtual object, geometric information output by the learning model, and display information obtained by rendering a three-dimensional shape restored based on the geometric information are presented. explained. However, it is not necessary to present all three pieces of display information, and at least one piece of display information may be presented.

表示情報に基づいてマウスやキーボードなどの入力部を用いてユーザが入力した入力情報に基づいて、学習モデル選択部１２０が学習モデルを選択することもできる。図９のＧ１４２０はラジオボタンであり、Ｇ１４３０は入力ボタンである。ユーザはラジオボタンでチェックしたモデルを利用することを、入力ボタンを押下して情報処理装置に入力する。このような構成にすることで、学習モデルを特徴づける情報を参照してユーザが学習モデルを選択することができ、入力画像を撮像したシーンにおいて学習モデルが高い精度で幾何情報を算出することができる。従って、高い精度で位置姿勢を取得することができる。また、ユーザが、自動選択された学習モデルが不適切と判断した場合に、選択結果を修正することもできる。 The learning model selection unit 120 can also select a learning model based on input information input by the user using an input unit such as a mouse or keyboard based on display information. G1420 in FIG. 9 is a radio button, and G1430 is an input button. The user presses the input button to input to the information processing apparatus that the model checked by the radio button is to be used. With such a configuration, the user can select a learning model by referring to the information that characterizes the learning model, and the learning model can calculate the geometric information with high accuracy in the scene in which the input image is captured. can. Therefore, the position and orientation can be obtained with high accuracy. Also, if the user determines that the automatically selected learning model is inappropriate, he or she can correct the selection result.

（実施形態８）
実施形態１から実施形態５では、最初に一度だけ学習モデルを選択していた。しかしながら、複合現実感を体験しているうちに、例えば利用者が移動することで入力画像に写るシーンが変化するような場合には対応が難しい。そこで、実施形態７では、一度学習モデルを選択した後も引き続き学習モデルについて再度評価値を算出し直す例を説明する。 (Embodiment 8)
In Embodiments 1 to 5, the learning model is selected only once at the beginning. However, it is difficult to deal with a case where, for example, the scene shown in the input image changes due to the user's movement while experiencing the mixed reality. Therefore, in the seventh embodiment, an example will be described in which even after the learning model is selected once, the evaluation value of the learning model is calculated again.

＜情報処理装置の構成＞
実施形態８における情報処理装置の構成は、実施形態１で説明した情報処理装置１の構成を示す図１と同じであるため説明を省略する。実施形態１と異なるのは、幾何情報推定部１４０が入力画像に基づいて第二の幾何情報を算出して学習モデル選択部１２０に出力する点と、学習モデル選択部１２０が学習モデルの評価値を算出し直す点である。 <Configuration of information processing device>
The configuration of the information processing apparatus according to the eighth embodiment is the same as that shown in FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, so the description thereof is omitted. The difference from the first embodiment is that the geometric information estimation unit 140 calculates the second geometric information based on the input image and outputs it to the learning model selection unit 120, and the learning model selection unit 120 calculates the evaluation value of the learning model. is recalculated.

幾何情報推定部１４０は、学習モデル群保持部１３０が保持する学習モデルと、画像入力部１１０が入力した入力画像とを用いてさらに第三の幾何情報を算出する。また、入力画像を用いてモーションステレオにより第二の幾何情報を算出し、学習モデル選択部１２０に出力する。 The geometric information estimation unit 140 further calculates third geometric information using the learning model held by the learning model group holding unit 130 and the input image input by the image input unit 110 . Also, second geometric information is calculated by motion stereo using the input image, and is output to the learning model selection unit 120 .

学習モデル選択部１２０は、幾何情報推定部１４０が入力した、第三の幾何情報と第二の幾何情報とに基づいて、学習モデル群保持部１３０が保持するそれぞれの学習モデルの評価値を算出する。そして、評価結果を幾何情報推定部１４０に出力する。 Based on the third geometric information and the second geometric information input by the geometric information estimating unit 140, the learning model selecting unit 120 calculates the evaluation value of each learning model held by the learning model group holding unit 130. do. Then, the evaluation result is output to the geometric information estimation unit 140 .

＜処理＞
実施形態８における全体処理の手順は実施形態１で説明した情報処理装置１の処理手順を説明した図４と同一であるため説明を省略する。実施形態１と異なる処理は、ステップＳ１４０において一度学習モデルを選択した後にも再度学習モデルの評価値を算出し、学習モデルを選択し直す点と、ステップＳ１５０において幾何情報推定部１４０が第二の幾何情報を算出する点である。 <Processing>
The procedure of the overall processing in the eighth embodiment is the same as FIG. 4 explaining the processing procedure of the information processing apparatus 1 explained in the first embodiment, so the explanation is omitted. The processing different from the first embodiment is that the evaluation value of the learning model is calculated again even after the learning model is once selected in step S140, and the learning model is reselected, and in step S150 the geometric information estimating unit 140 selects the second It is a point for calculating geometric information.

本実施形態におけるステップＳ１４０の処理の詳細では、図５のステップＳ１１１０の処理が取り除かれる。すなわち、学習モデルが決定済みか否かに関わらず、学習モデル選択部１２０が学習モデルの評価値を算出する。 In the details of the process of step S140 in this embodiment, the process of step S1110 in FIG. 5 is removed. That is, regardless of whether the learning model has been determined, the learning model selection unit 120 calculates the evaluation value of the learning model.

図４におけるステップＳ１５０では、幾何情報推定部１４０が、学習モデルを用いて第三の幾何情報である第三のデプスマップを推定する。さらに、入力画像に基づいてモーションステレオを用いて第二の幾何情報を算出する。 In step S150 in FIG. 4, the geometric information estimation unit 140 estimates the third depth map, which is the third geometric information, using the learning model. Further, second geometric information is calculated using motion stereo based on the input image.

さらに、本実施形態に係るステップＳ１５０では、学習モデル選択部１２０が、幾何情報推定部１４０により求められた第三の幾何情報と第二の幾何情報とに基づいて学習モデルの評価値を再算出する。具体的には、学習モデル選択部１２０が、図５のステップＳ１１２０で学習モデルの評価値算出に用いた入力画像が撮像された時刻ｔ以降の任意の時刻ｔ'において幾何情報推定部１４０が推定（更新）した第三の幾何情報である第三のデプスマップと、同時刻ｔ'において幾何情報推定部１４０が推定（更新）した第二の幾何情報との奥行きの差の和の逆数を各学習モデルの評価値として再算出する。そして、学習モデル選択部１２０が、評価値が最大となる学習モデルを新たに選択する。 Furthermore, in step S150 according to the present embodiment, the learning model selection unit 120 recalculates the evaluation value of the learning model based on the third geometric information and the second geometric information obtained by the geometric information estimation unit 140. do. Specifically, the learning model selection unit 120 causes the geometric information estimation unit 140 to estimate at an arbitrary time t′ after the time t when the input image used for calculating the evaluation value of the learning model in step S1120 in FIG. The reciprocal of the sum of the depth differences between the third depth map, which is the (updated) third geometric information, and the second geometric information estimated (updated) by the geometric information estimation unit 140 at the same time t′ Recalculate as the evaluation value of the learning model. Then, the learning model selection unit 120 newly selects the learning model with the maximum evaluation value.

＜効果＞
実施形態８では、一度学習モデルを選択した後にも、再度学習モデルの評価値を算出し、学習モデルを選択し直す。これにより、例えば複合現実感を体験している間に、利用者が移動して入力画像に写るシーンが変化した場合に、学習モデルを再度評価できる。再評価結果が高い学習モデルを選択することで、その時点の入力画像に写るシーンにおいて高い精度で学習モデルが幾何情報を算出でき、従って、高い精度で撮像装置の位置姿勢を取得することがきる。 <effect>
In the eighth embodiment, even after the learning model is selected once, the evaluation value of the learning model is calculated again, and the learning model is selected again. As a result, the learning model can be re-evaluated when, for example, the user moves while experiencing mixed reality and the scene captured in the input image changes. By selecting a learning model with a high re-evaluation result, the learning model can calculate geometric information with high accuracy in the scene captured in the input image at that time, and therefore the position and orientation of the imaging device can be obtained with high accuracy. .

＜変形例＞
なお、学習モデルの評価値を再算出するタイミングは任意である。すなわち、一定の時間間隔ごとに再算出してもよいし、非特許文献１に記載のキーフレームの追加のタイミング毎に再算出してもよい。また、位置姿勢取得結果に基づいて、撮像装置１１が所定の移動量以上移動したときに再算出してもよい。また、一度選択した学習モデルの評価値が低下した場合や所定の閾値を下回った場合に、再算出してもよい。入力画像に写るシーンや物体種が変化した場合や位置情報が変化した場合（例えば新たなＷｉＦｉアクセスポイントが見つかった、ＧＰＳ位置情報が変化したといった場合）に、再算出してもよい。また、一定時刻経過した場合や、天気が変化した場合に再算出するようにしてもよい。 <Modification>
The timing of recalculating the evaluation value of the learning model is arbitrary. That is, it may be recalculated at regular time intervals, or may be recalculated each time a key frame is added as described in Non-Patent Document 1. Further, based on the position/orientation acquisition result, recalculation may be performed when the imaging device 11 moves by a predetermined movement amount or more. Further, when the evaluation value of the learning model that has been selected once decreases or falls below a predetermined threshold value, it may be recalculated. It may be recalculated when the scene or object type captured in the input image changes, or when the position information changes (for example, when a new WiFi access point is found or when GPS position information changes). Further, it may be recalculated when a certain period of time has passed or when the weather changes.

実施形態８では、幾何情報推定部１４０がモーションステレオにより第二の幾何情報を算出していた。しかしながら、第二の幾何情報を時系列的に統合して、精度を向上させてから学習モデルの評価に用いてもよい。例えば、時刻ｔ'において幾何情報推定部１４０が推定した第三の幾何情報を初期値として、任意の時刻ｔ'＋ｉまでの入力画像から時系列フィルタリングにより第二の幾何情報を算出する（非特許文献１に記載）。また、複数の時刻ｔ'においてこのようにして算出した複数のデプスマップを統合して第二のデプスマップを算出する。ここでいう統合とは、非特許文献７に記載がある、デプスマップの生成に用いた入力画像を撮像した複数時刻ｔ'のカメラの位置姿勢をポーズグラフ最適化により算出し、得られたカメラ位置姿勢を用いてさらに複数の第二のデプスマップを平滑化することである。また、ポーズグラフ最適化時の残差を評価値として残差が最も小さい学習モデルを選択してもよし、ポーズグラフ最適化にかかる処理時間を評価値として処理時間が最も短かった学習モデルを選択してもよい。このようにすることで、最適化が進むとより評価値が正確に算出できるようになり、より適切な学習モデルが選択できるため、位置姿勢の取得精度が向上する。 In the eighth embodiment, the geometric information estimation unit 140 calculates the second geometric information by motion stereo. However, the second geometric information may be integrated in chronological order to improve accuracy and then used for evaluation of the learning model. For example, using the third geometric information estimated by the geometric information estimating unit 140 at time t′ as an initial value, the second geometric information is calculated by time-series filtering from the input image up to arbitrary time t′+i (non-patent described in Reference 1). Also, a second depth map is calculated by integrating a plurality of depth maps calculated in this manner at a plurality of times t'. Here, the integration is described in Non-Patent Literature 7, and the position and orientation of the camera at multiple times t′ when the input image used to generate the depth map is calculated by pose graph optimization, and the obtained camera Smoothing the plurality of second depth maps using the pose. Alternatively, you can select the learning model with the smallest residual by using the residual when optimizing the pose graph as an evaluation value, or select the learning model with the shortest processing time by using the processing time required for optimizing the pose graph as an evaluation value. You may By doing so, as the optimization progresses, the evaluation value can be calculated more accurately, and a more appropriate learning model can be selected, thereby improving the position and orientation acquisition accuracy.

実施形態８では、学習モデル選択部１２０が、幾何情報推定部１４０が算出した第二の幾何情報である第二のデプスマップと、学習モデルが出力した第三のデプスマップとを比較して、各モデルについて評価値を算出し直していた。しかしながら、学習モデルの選択方法はこれに限らず、時刻ｔ'における入力画像を用いて実施形態１から実施形態７で説明した方法で各学習モデルの評価値を算出し直してもよい。 In the eighth embodiment, the learning model selection unit 120 compares the second depth map, which is the second geometric information calculated by the geometric information estimation unit 140, with the third depth map output by the learning model, The evaluation value was recalculated for each model. However, the learning model selection method is not limited to this, and the evaluation value of each learning model may be recalculated by the method described in the first to seventh embodiments using the input image at time t′.

実施形態８では、学習モデル選択部１２０が、幾何情報推定部１４０が算出した第二の幾何情報である第二のデプスマップと、学習モデルが出力した第三のデプスマップとを比較して、各モデルについて評価値を算出し直していた。しかしながら、複数時刻ｔ'における第三の幾何情報と第二の幾何情報との一致度に基づいて、一致度が高いほど高い評価値を付与し、当該評価値に基づいて学習モデルを選択してもよい。具体的には、学習モデル選択部１２０が、複数時刻ｔ'における第三のデプスマップと第二のデプスマップの奥行き値の差の和である第１の評価値を算出する。そして、それら第１の評価値の、例えば平均値、中央値、最大値、最小値や和の逆数を第２の評価値として算出し、第２の評価値が最大となる学習モデルを選択する。これにより、複数時刻の入力画像において高い精度で第三の幾何情報を算出することのできる学習モデルを選択できる。このため、たとえ最初に誤った学習モデルを選択したとしても、学習モデル選択部１２０が、高い精度で幾何情報を推定できる学習モデルを徐々に選択し直すことができ、高い精度で位置姿勢を取得することができる。 In the eighth embodiment, the learning model selection unit 120 compares the second depth map, which is the second geometric information calculated by the geometric information estimation unit 140, with the third depth map output by the learning model, The evaluation value was recalculated for each model. However, based on the degree of matching between the third geometric information and the second geometric information at multiple times t′, the higher the degree of matching, the higher the evaluation value is given, and the learning model is selected based on the evaluation value. good too. Specifically, the learning model selection unit 120 calculates a first evaluation value that is the sum of differences in depth values between the third depth map and the second depth map at a plurality of times t′. Then, for example, the average value, median value, maximum value, minimum value, or reciprocal of the sum of the first evaluation values is calculated as the second evaluation value, and the learning model with the maximum second evaluation value is selected. . This makes it possible to select a learning model capable of calculating the third geometric information with high accuracy in the input images at multiple times. Therefore, even if an incorrect learning model is initially selected, the learning model selection unit 120 can gradually reselect a learning model that can estimate geometric information with high accuracy, thereby acquiring the position and orientation with high accuracy. can do.

評価値を算出し直すと、学習モデル選択部１２０が選択する学習モデルが変わることがある。このとき、幾何情報推定部１４０が幾何情報の推定に用いる学習モデルが変わるため、変更前後で学習モデルの出力が大きく変化する場合がある。これに対処するため、ステップＳ１５０において、幾何情報推定部１４０が所定の時間、二つの学習モデルが出力する幾何情報の重み付和を第三の幾何情報として算出してもよい。具体的には、モデルの切り替え期間を表す所定のフレーム数Ｎと、切り替え開始からの経過フレーム数αを用いて以下の式のようにしてデプスマップを補正する。 When the evaluation value is recalculated, the learning model selected by the learning model selection unit 120 may change. At this time, since the learning model used by the geometric information estimation unit 140 to estimate the geometric information is changed, the output of the learning model may change significantly before and after the change. In order to deal with this, in step S150, the geometric information estimation unit 140 may calculate the weighted sum of the geometric information output by the two learning models for a predetermined period of time as the third geometric information. Specifically, the depth map is corrected as in the following equation using a predetermined number of frames N representing the model switching period and the number of frames α that have elapsed since the start of switching.

ただし、Ｄ_１が変更前の学習モデルが出力したデプスマップ、Ｄ_２が変更後の学習モデルが出力したデプスマップ、Ｄが補正したデプスマップである。 However, D1 is the depth map output by the learning model before change, _D2 is the depth map _output by the learning model after change, and D is the corrected depth map.

（実施形態９）
実施形態１から８では、あらかじめ作成しておいた複数の学習モデルの中から、情報処理装置を適用するシーンにおいて精度よく幾何情報を推定することができる学習モデルを選択する方法について述べた。本実施形態では、第二の撮像装置９１であるデプスセンサが取得したＲＧＢ画像とデプスマップとに基づいて、情報処理装置が用いる学習モデルを生成する方法について説明する。特に、本実施形態では、第二の撮像装置９１が撮像したＲＧＢ画像からシーンの種別を認識し、種別ごとに学習モデルの作り分ける方法について述べる。なお、本実施形態においては、第二の撮像装置９１とはＴＯＦセンサのことであり、ＲＧＢ画像とデプスマップが取得できるものである。 (Embodiment 9)
In the first to eighth embodiments, a method of selecting a learning model capable of accurately estimating geometric information in a scene to which an information processing apparatus is applied from among a plurality of learning models created in advance has been described. In this embodiment, a method of generating a learning model used by an information processing device based on an RGB image and a depth map acquired by a depth sensor, which is the second imaging device 91, will be described. In particular, in this embodiment, a method of recognizing scene types from RGB images captured by the second imaging device 91 and separately creating learning models for each type will be described. In this embodiment, the second imaging device 91 is a TOF sensor, which can acquire an RGB image and a depth map.

＜情報処理装置の構成＞
まず、図１２を参照しながら、実施形態９に係る情報処理装置４の構成を説明する。実施形態９における情報処理装置４の構成は、実施形態１で説明した情報処理装置１の構成を示す図１に加えて、第二の画像入力部９１０、学習用データ分類部９２０、学習データ保持部９３０、学習モデル生成部９４０が追加された点で実施形態１と異なる。第二の画像入力部９１０は、第二の撮像装置９１と接続されている。 <Configuration of information processing device>
First, the configuration of the information processing device 4 according to the ninth embodiment will be described with reference to FIG. In addition to FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, the configuration of the information processing apparatus 4 in the ninth embodiment includes a second image input unit 910, a learning data classification unit 920, and a learning data holding unit. It differs from the first embodiment in that a unit 930 and a learning model generation unit 940 are added. A second image input unit 910 is connected to the second imaging device 91 .

第二の画像入力部９１０は、第二の撮像装置９１が撮像するシーンの２次元画像の画像データ（以降、モデル学習用画像と呼ぶ）およびデプスマップ（以降、モデル学習用デプスマップと呼ぶ）を時系列（例えば毎秒６０フレーム）に入力し、学習用データ分類部９２０に出力する。なお、モデル学習用画像とモデル学習用デプスマップを合わせて学習用データと呼ぶ。 The second image input unit 910 receives image data of a two-dimensional image of a scene captured by the second imaging device 91 (hereinafter referred to as a model learning image) and a depth map (hereinafter referred to as a model learning depth map). are input in time series (for example, 60 frames per second) and output to the learning data classification unit 920 . Note that the model learning image and the model learning depth map are collectively referred to as learning data.

学習用データ分類部９２０は、第二の画像入力部９１０が入力した学習用データに基づいてシーンの種別を認識し、種別ごとに学習用データを分類し、学習データ保持部９３０に出力する。なお、学習用データの分類方法については後述する。 The learning data classification unit 920 recognizes the scene type based on the learning data input by the second image input unit 910 , classifies the learning data by type, and outputs the learning data to the learning data holding unit 930 . A method of classifying the learning data will be described later.

学習データ保持部９３０は、学習用データ分類部９２０が分類した学習用データを、シーンの種別ごとに分類して保持する。学習データ保持部９３０は例えばＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）である。シーンの種別ごとにフォルダが分けられており、学習用データ分類部９２０の分類結果に該当するフォルダに学習用データを保持する。なお、同時刻に取得したモデル学習用画像とモデル学習用デプスマップには共通ＩＤ（例えば連番や時刻）が割り振られており、対応づけられているものとする。 The learning data holding section 930 classifies and holds the learning data classified by the learning data classification section 920 for each type of scene. The learning data holding unit 930 is, for example, an SSD (Solid State Drive). Folders are divided for each type of scene, and learning data is held in a folder corresponding to the classification result of the learning data classification unit 920 . It is assumed that a common ID (for example, serial number or time) is assigned to the model learning image and the model learning depth map acquired at the same time and are associated with each other.

学習モデル生成部９４０は、学習用データ分類部９２０の分類結果をもとに学習データ保持部９３０に格納された学習用データを用いて学習モデルを生成する。生成した学習モデルを学習モデル群保持部１３０に出力する。 The learning model generation unit 940 generates a learning model using the learning data stored in the learning data holding unit 930 based on the classification result of the learning data classification unit 920 . The generated learning model is output to the learning model group holding unit 130 .

次に、図１３のフローチャートを参照して、本実施形態における処理手順について説明する。本実施形態においては、実施形態１で説明した処理手順に加え、モデル学習用画像・モデル学習用デプスマップ撮像ステップＳ９１０、モデル学習用画像・モデル学習用デプスマップ入力ステップＳ９２０、画像分類ステップＳ９３０、学習データ保持ステップＳ９４０、学習データ収集完了判定ステップＳ９５０、学習モデル生成ステップＳ９６０、学習モデル保持ステップＳ９７０が追加されている点で実施形態１と異なる。 Next, the processing procedure in this embodiment will be described with reference to the flowchart of FIG. In this embodiment, in addition to the processing procedure described in Embodiment 1, model learning image/model learning depth map imaging step S910, model learning image/depth map input step S920, image classification step S930, It differs from the first embodiment in that a learning data holding step S940, a learning data collection completion determination step S950, a learning model generating step S960, and a learning model holding step S970 are added.

本実施形態では、まず実施形態１で説明した初期化ステップＳ１１０を実行し、システムを初期化する。次に、以降で述べるステップＳ９１０からＳ９７０までを実行し、学習モデルを生成する。そして、実施形態１で述べたステップＳ１２０以降の処理を実行し、撮像装置１１の位置姿勢を算出する。 In this embodiment, first, the initialization step S110 described in the first embodiment is executed to initialize the system. Next, steps S910 to S970 described below are executed to generate a learning model. Then, the processes after step S120 described in the first embodiment are executed to calculate the position and orientation of the imaging device 11 .

ステップＳ９１０では、第二の撮像装置９１がシーンの撮影を行い、ＲＧＢ画像およびデプスマップを第二の画像入力部９２０に出力する。次にステップＳ９２０に移行する。 In step S910 , the second imaging device 91 captures the scene and outputs the RGB image and depth map to the second image input section 920 . Next, the process moves to step S920.

ステップＳ９２０では、第二の画像入力部９１０が、第二の撮像装置９１が撮像した画像およびデプスマップを、モデル学習用画像およびモデル学習用デプスマップとして取得する。次にステップＳ９３０に移行する。 In step S920, the second image input unit 910 acquires the image and the depth map captured by the second imaging device 91 as a model learning image and a model learning depth map. Next, the process proceeds to step S930.

ステップＳ９３０では、学習用データ分類部９２０が、モデル学習用画像からシーンの種別を認識し、学習用データを分類する。本実施形態では、実施形態１の変形例で述べたシーン判別学習モデルを用いる。シーン判別学習モデルとは入力した画像が当該カテゴリであれば１を、そうでなければ０を出力するようにあらかじめ学習しておいたＤｅｅｐＬｅａｒｎｉｎｇを用いて学習されたニューラルネットワークのことである。つまり、モデル学習用画像をシーン判別学習モデルに入力し、得られたカテゴリを、学習用データの分類結果として判定する。次にステップＳ９４０に移行する。 In step S930, the learning data classification unit 920 recognizes the scene type from the model learning image and classifies the learning data. This embodiment uses the scene discrimination learning model described in the modified example of the first embodiment. The scene discrimination learning model is a neural network that has been learned using deep learning so that it outputs 1 if the input image belongs to the category, and 0 otherwise. That is, the model learning image is input to the scene discrimination learning model, and the obtained category is determined as the classification result of the learning data. Next, the process proceeds to step S940.

ステップＳ９４０では、ステップＳ９２０における学習用データの分類結果を基に、学習用データ分類部９２０が学習用データを学習データ保持部９３０に保持する。具体的には、分類結果に該当するフォルダにモデル学習用画像・モデル学習用デプスマップを格納する。次にステップＳ９５０に移行する。 In step S940, the learning data classification unit 920 holds the learning data in the learning data holding unit 930 based on the classification result of the learning data in step S920. Specifically, an image for model learning and a depth map for model learning are stored in a folder corresponding to the classification result. Next, the process proceeds to step S950.

ステップＳ９５０では学習用データの収集が完了したかどうかを判定する。ここでは、不図示の入力装置を用いてユーザが終了のコマンドを入力した際にデータの収集を完了と判定する。データ収集が完了と判定した場合にはステップＳ９６０に進む。そうでなければステップＳ９１０に進みデータ収集を続ける。 In step S950, it is determined whether or not the collection of learning data has been completed. Here, it is determined that the data collection is completed when the user inputs an end command using an input device (not shown). If it is determined that the data collection is completed, the process proceeds to step S960. Otherwise, proceed to step S910 to continue data collection.

ステップＳ９６０では、学習モデル生成部９４０が、学習データ保持部９３０が保持する学習用データを用いて、学習用データ分類部９２０が分類したカテゴリごとの学習モデルを生成する。つまり、学習データ保持部９３０のフォルダごとに学習モデルを生成する。具体的には、学習用画像を該当フォルダから乱数により選択し、これを入力として学習モデルが出力した幾何情報と、選択した学習用画像に対応する学習用デプスマップとの誤差が最小化するように学習モデルを学習することを繰り返す。なお、学習モデルの生成方法は非特許文献１に詳述されており、これを援用できる。 In step S960 , the learning model generation unit 940 uses the learning data held by the learning data holding unit 930 to generate a learning model for each category classified by the learning data classification unit 920 . That is, a learning model is generated for each folder in the learning data holding unit 930 . Specifically, a learning image is selected from the corresponding folder by random numbers, and the error between the geometric information output by the learning model using this as an input and the learning depth map corresponding to the selected learning image is minimized. Iterate to learn the learning model every time. A method of generating a learning model is described in detail in Non-Patent Document 1, which can be used as a reference.

学習用データ分類部９２０が分類したカテゴリごとに生成した学習モデルを、学習モデル生成部９４０が、学習モデル群保持部１３０に保持する。このとき、合わせて学習用画像を学習モデル群保持部１３０にコピーし保持しておく。 The learning model generation unit 940 holds the learning model generated for each category classified by the learning data classification unit 920 in the learning model group holding unit 130 . At this time, the learning images are also copied and held in the learning model group holding unit 130 .

＜効果＞
以上に述べたように、実施形態９では、モデル学習用画像のシーンの判別結果を基にモデル学習用画像とモデル学習用デプスマップとを分類し、それぞれのシーンの種別ごとに学習モデルを生成する。このように、シーンの種別ごとに学習モデルを生成しておき、実施形態１で説明した位置姿勢算出時には撮像画像のシーンの種別が一致する学習モデルを用いることで、高い精度で撮像装置の位置姿勢を算出することができる。 <effect>
As described above, in the ninth embodiment, model learning images and model learning depth maps are classified based on the scene discrimination results of model learning images, and a learning model is generated for each scene type. do. In this way, a learning model is generated for each scene type, and a learning model that matches the scene type of the captured image is used when calculating the position and orientation described in the first embodiment. Posture can be calculated.

＜変形例＞
実施形態９では、モデル学習用画像に写るシーン判別のために、シーン判別学習モデルを用いていた。しかしながら、モデル学習用画像に写るシーンの種別を判別するものであれば何でもよい。つまり、実施形態１で述べたように、あらかじめシーンの種別ごとにＧＬＣ特徴の特徴空間の識別境界をＳＶＮによって算出しておき、モデル学習用画像から検出したＧＬＣ特徴がどのカテゴリに位置するか判別した結果を基にシーンの種別を判別してもよい。あらかじめシーンの種別ごとにカラーヒストグラムを作成しておき、モデル学習用画像のカラーヒストグラムが最も一致するシーンの種別に分類してもよい。 <Modification>
In the ninth embodiment, a scene discrimination learning model is used to discriminate a scene captured in a model learning image. However, anything can be used as long as it can determine the type of scene captured in the model learning image. That is, as described in the first embodiment, the discrimination boundary of the feature space of the GLC features is calculated by SVN in advance for each scene type, and it is determined in which category the GLC features detected from the model learning image are located. The type of scene may be determined based on the results obtained. A color histogram may be created in advance for each scene type, and the model learning image may be classified into the scene type that best matches the color histogram.

実施形態９では、モデル学習用画像に写るシーンを基に学習用データを分類し、学習モデルを生成していた。一方、モデル学習用画像に写る物体種に応じて学習用データを分類し、学習モデルを生成することもできる。つまり、「机」や「テーブル」、「車」、「信号機」といった物体の検出結果毎に学習用データを分類することもできる。なお、学習用画像からの物体検出については実施形態２で例示した物体検出方法を用いることができる。このようにして分類した学習データごとに学習モデルを生成する。また、あらかじめ物体の共起確率を算出しておき、これらに基づいて学習用データを分類してもよい。共起確率とは、例えば「ちゃぶ台」、「テレビ」、「ベッド」が同時に観測される確率、「机」、「椅子」、「パソコン」、「ディスプレイ」が同時に観測される確率など、物体が同時に観測される確率のことである。物体が同時に観測される確率を用いると、明示的なシーン検出はしていないものの、「ちゃぶ台」、「テレビ」、「ベッド」が観測されるシーンは日本家屋、「机」、「椅子」、「パソコン」、「ディスプレイ」が観測されるシーンはオフィスといったシーンごとの学習用データの分類ができる。 In the ninth embodiment, the learning model is generated by classifying the learning data based on the scene captured in the model learning image. On the other hand, it is also possible to classify the learning data according to the type of object appearing in the model learning image and generate a learning model. In other words, the learning data can be classified for each object detection result such as "desk", "table", "car", and "traffic light". Note that the object detection method exemplified in the second embodiment can be used for object detection from the learning image. A learning model is generated for each learning data classified in this manner. Alternatively, the co-occurrence probabilities of objects may be calculated in advance, and the learning data may be classified based on these. The co-occurrence probability is the probability that objects are observed at the same time, such as the probability that ``table'', ``TV'', and ``bed'' are observed at the same time, and the probability that ``desk'', ``chair'', ``computer'', and ``display'' are observed at the same time. It is the probability of being observed at the same time. Using the probability that objects are observed at the same time, the scenes in which ``chabudai'', ``TV'', and ``bed'' are observed are Japanese houses, ``desks'', ``chairs'', ``desk'', ``chairs'', Scenes where "PC" and "Display" are observed can be classified into learning data for each scene such as an office.

学習用データを取得した位置情報を用いて学習用データを分類することもできる。位置情報とは、実施形態３で説明したように、例えば緯度経度の座標値やＷｉｆｉアクセスポイントの識別ＩＤのことである。具体的には、緯度経度を所定の間隔で分割し、学習用データを分類してもよい。また、学習用データを取得した際に観測されたＷｉｆｉアクセスポイントの識別ＩＤごとに学習用データを分類してもよい。さらには、ＧＰＳから算出した位置情報から、国や地域、海/山/道路などのカテゴリを不図示の地図情報から同定し、それらカテゴリごとに学習用データを分類してもよい。このようにして分類した学習データごとに学習モデルを生成する。 It is also possible to classify the learning data using the location information from which the learning data was acquired. As described in the third embodiment, the position information is, for example, latitude and longitude coordinate values or an identification ID of a Wifi access point. Specifically, the latitude and longitude may be divided at predetermined intervals to classify the learning data. Also, the learning data may be classified for each identification ID of the Wifi access point observed when the learning data was acquired. Furthermore, from position information calculated from GPS, categories such as country, region, sea/mountain/road may be identified from map information (not shown), and learning data may be classified for each category. A learning model is generated for each learning data classified in this manner.

学習用データを取得した日時や季節、天気といった画像の見えを変えうる状況ごとに学習用データを分類することもできる。例えば、撮影時刻の時間ごとに学習用データを分類してもよい。また、撮影時刻を朝、昼、夕、夜といったカテゴリに分割し、それらのカテゴリごとに学習用データを分類してもよい。撮影した月日ごとに分類してもよいし、月日から春／夏／秋／冬といった季節を区切り、学習用データを分類してもよい。天気を配信するＷｅｂサイトからＩ／Ｆ（Ｈ１７）を介してネットワーク経由で取得した晴れ／曇り／雨／雪といった天気のカテゴリごとに学習用データを分類してもよい。このようにして分類した学習データごとに学習モデルを生成する。 It is also possible to classify the learning data according to the date and time when the learning data was acquired, the season, the weather, and other situations that can change the appearance of the image. For example, the learning data may be classified for each shooting time. Alternatively, the photographing time may be divided into categories such as morning, noon, evening, and night, and the learning data may be classified for each category. The learning data may be classified according to the month and day when the image was taken, or by separating the seasons such as spring/summer/autumn/winter from the month and day. The learning data may be classified into weather categories such as sunny/cloudy/rainy/snow obtained from a website that distributes the weather via the I/F (H17) via the network. A learning model is generated for each learning data classified in this manner.

モデル学習用デプスマップを基に学習用データを分類することもできる。例えば、モデル学習用デプスマップの奥行き値の平均、最大値、最小値、中央値、分散値をもとに学習用モデルを分類してもよい。モデル学習用デプスマップの凹凸度合をもとに学習モデルを分類してもよい。凹凸度合の判別には、例えばモデル学習用デプスマップに平面フィッティングにより主平面を算出し、主平面から所定の距離にあるデプスマップから算出した三次元点の個数を用いることができる。また、モデル画像用デプスマップから画素ごとに法線を算出し、周囲の画素との法線の内積が所定の距離以下を同一ラベルとしたラベリング結果の個数を凹凸度合としてもよい。 Learning data can also be classified based on the model learning depth map. For example, the learning models may be classified based on the average, maximum value, minimum value, median value, and variance value of the depth values of the depth maps for model learning. The learning model may be classified based on the degree of unevenness of the depth map for model learning. To determine the degree of unevenness, for example, a principal plane is calculated by plane fitting to a depth map for model learning, and the number of three-dimensional points calculated from the depth map at a predetermined distance from the principal plane can be used. Alternatively, the normal may be calculated for each pixel from the model image depth map, and the number of labeling results where the inner product of the normal to surrounding pixels is equal to or less than a predetermined distance may be used as the degree of unevenness.

また、これまで述べたシーンや物体、位置情報などをユーザが不図示の入力手段により入力し、これらをもとに学習用データを分類することもできる。また、学習モデルの利用目的ごとに学習用データを分類してもよい。つまり、学習モデルを自動運転など車載カメラにおける位置姿勢推定に用いる用途や、スマートフォンやタブレットにＣＧを重畳するためのカメラの位置姿勢に用いる用途、といった用途種別をユーザが入力し、その入力結果（用途種別）毎に学習用データを分類してもよい。このようにして分類した学習データごとに学習モデルを生成する。 In addition, the user can input the scenes, objects, positional information, and the like described so far through input means (not shown), and based on these, the learning data can be classified. Also, the learning data may be classified according to the purpose of use of the learning model. In other words, the user inputs the application type, such as the use of the learning model for estimating the position and orientation of an in-vehicle camera for automatic driving, or the use for the position and orientation of the camera for superimposing CG on a smartphone or tablet, and the input result ( You may classify|categorize the data for learning for every use classification). A learning model is generated for each learning data classified in this manner.

自動車のようにカーナビゲーションシステムが搭載された機器に第二の撮像装置９１を搭載し、学習用データを取得する場合には、カーナビゲーションシステムの地図情報に付随したシーンの種別（市街地や山間地域、海辺地域、トンネル内、高速道路）をシーン判別結果として学習用データを分類してもよい。自動車に搭載されたカメラから道路上の人や自動車、信号機、標識やそれらの数・密度、道路の状況（車線数、路面：アスファルトや土）を取得し、それらを物体情報として、学習用データを分類することもできる。カーナビゲーションシステムが算出した自動車が走行している住所情報や、自動車に搭載したカメラが撮影した交通看板から認識した地名情報、ＧＰＳやＷｉＦｉ、各種ビーコンから得たセンサ情報を取得し、これらから得た位置情報をもとに、学習用データを分類してもよい。カーナビゲーションシステムから得られる時刻情報、ライトの点灯の有無（昼／夜の判別に利用できる）やワイパーの動作の有無（晴れ／雨の判別に利用できる）をもとに学習用データを分類することもできる。車種や自動車へのカメラの取り付け位置や取り付け向きごとに学習用データを分類することもできる。このような分類結果を基に、それぞれの分類ごとに学習モデルを生成することができる。 When the second imaging device 91 is installed in a device equipped with a car navigation system, such as an automobile, and learning data is acquired, the type of scene associated with the map information of the car navigation system (urban area, mountainous area, etc.) , seaside area, tunnel, highway) may be used as the scene discrimination result to classify the learning data. People and cars on the road, traffic lights, signs, their number and density, road conditions (number of lanes, road surface: asphalt and dirt) are acquired from the camera mounted on the car, and these are used as object information for learning. can also be classified. It acquires address information calculated by the car navigation system where the car is traveling, place name information recognized from the traffic signboard photographed by the camera mounted on the car, and sensor information obtained from GPS, WiFi, and various beacons. The learning data may be classified based on the obtained position information. Classify learning data based on the time information obtained from the car navigation system, whether lights are on (can be used to determine day/night), and whether wipers are operating (can be used to determine whether it is sunny or rainy). can also It is also possible to classify the learning data according to the type of vehicle and the position and orientation of the camera attached to the vehicle. Based on such classification results, a learning model can be generated for each classification.

また、学習用データを取得するシーケンスごとに学習用データを分類することもできる。つまり、本情報処理装置４を起動し、終了するまでを１シーケンスとし、その間に取得したデータは同じカテゴリであるとして分類する。このようにして学習モデルを生成することもできる。 Also, the learning data can be classified for each sequence for acquiring the learning data. In other words, starting up the information processing apparatus 4 and ending it are regarded as one sequence, and the data acquired during that period are classified as belonging to the same category. A learning model can also be generated in this way.

ここまでで述べてきた分類法は一例であり、幾何情報が高精度に推定できる学習モデルを生成することができる分類方法であればどんな分類方法でもよい。前述の分類方法を個別に用いてもよいし、任意の数を組み合わせてもよい。 The classification methods described so far are only examples, and any classification method may be used as long as it is a classification method that can generate a learning model that can estimate geometric information with high accuracy. The above classification methods may be used individually, or any number may be combined.

実施形態９では、第二の撮像装置９１が撮像した直後に学習用データ分類部９２０が学習用データを分類し、学習データ保持部９３０に保持していた。しかしながら、あらかじめ第二の撮像装置９１がモデル学習用画像とモデル学習用デプスマップとを撮りためておき、後に学習用データ分類部９２０が学習用データを分類してもよい。一般に画像認識や学習モデルの学習は計算コストが大きい。このため、このような構成にすると、あらかじめ計算リソースの小さいハードウェアで学習用データを取得しておいて、画像認識や学習モデルの学習は大きなリソースを持つハードウェアで処理するといったことができる。 In the ninth embodiment, the learning data classification unit 920 classifies the learning data immediately after the second imaging device 91 captures the image, and the learning data holding unit 930 holds the classified data. However, the second image pickup device 91 may capture and store model learning images and model learning depth maps in advance, and the learning data classification section 920 may classify the learning data later. In general, image recognition and training of a learning model have a large computational cost. Therefore, with such a configuration, learning data can be obtained in advance by hardware with small computational resources, and image recognition and learning of a learning model can be processed by hardware with large resources.

また、このような構成とすることで、複数の第二の撮像装置９１で個別に撮影した学習用データを組み合わせて用いる、一度取得した学習用データに加えて後から別の学習用データを加える、といったことも可能となる。さらには、実施形態８で説明したように、求めた第二の幾何情報をデプスマップとし、それと撮像装置１１が撮像した画像とを合わせて学習用データとしてもよい。また、一度学習した学習モデルを、前述した方法で追加した学習用データを用いて、学習モデル生成部９４０が学習モデルを追加学習してもよい。 In addition, by adopting such a configuration, learning data individually photographed by the plurality of second imaging devices 91 are combined and used, and in addition to once acquired learning data, another learning data is added later. , etc. is also possible. Furthermore, as described in the eighth embodiment, the obtained second geometric information may be used as a depth map, and the image captured by the imaging device 11 may be combined with the obtained second geometric information as learning data. Further, the learning model generator 940 may additionally learn a learning model that has been learned once, using the learning data added by the method described above.

実施形態９では、学習用データ分類部９２０が、第二の撮像装置９１が撮像したすべての学習用データを分類していた。しかしながら、すべての学習用データを分類する必要は無く、一部の学習用データのみ分類してもよい。第二の撮像装置９１が取得するうち、例えば６０回に１回のみ学習用データを分類し、学習データ保持部９３０に保持してもよい。また、学習用データ分類部９２０が、新たに取得した学習用データと学習データ保持部９３０に保持済みの学習データとの類似度が低いデータであれば学習データ保持部９３０に保持してもよい。ここでいう類似度とは、例えば画像の輝度の平均値や最大値、最小値、中央値、分散値の差である。また、モデル学習用画像をシーン判別学習モデルがシーン認識したときの認識尤度を使うこともできる。具体的には、シーン判別学習モデルが当該シーンか否かの０、１の出力を算出する直前の認識尤度値（各シーンの合致度合いの値）と、既に保持済みの学習用データの認識尤度値との距離のことである。このように、類似する学習データの数を減らすようにデータを収集する、学習用データの類似度の分離度合いが広がるように学習用データを収集することで、学習モデルの生成に係る時間が減るとともに、学習モデルの認識精度を向上させることができる。 In the ninth embodiment, the learning data classification unit 920 classifies all the learning data captured by the second imaging device 91 . However, it is not necessary to classify all the learning data, and only some of the learning data may be classified. For example, the learning data may be classified only once every 60 times acquired by the second imaging device 91 and held in the learning data holding unit 930 . Further, if the learning data classification unit 920 has a low degree of similarity between the newly acquired learning data and the learning data already held in the learning data holding unit 930, the learning data holding unit 930 may hold the data. . Here, the degree of similarity is, for example, the difference between the average value, maximum value, minimum value, median value, and variance value of the brightness of the image. It is also possible to use the recognition likelihood when the scene discrimination learning model performs scene recognition on the model learning image. Specifically, the recognition likelihood value (the value of the matching degree of each scene) immediately before the scene discrimination learning model calculates the output of 0 or 1 indicating whether or not the scene is the relevant scene, and the recognition of the already stored learning data It is the distance from the likelihood value. In this way, data is collected so as to reduce the number of similar learning data, and learning data is collected such that the degree of separation of the similarity of the learning data is widened, thereby reducing the time required to generate a learning model. At the same time, the recognition accuracy of the learning model can be improved.

実施形態９では、まず学習モデルを生成した後、その学習モデルを用いて位置姿勢推定をする方法について述べた。しかしながら、学習モデルの生成と、その学習モデルを用いて位置姿勢推定とを別々の装置で行うこともできる。例えば、学習モデルの生成を第一の情報処理装置で行い、作成した学習モデルをネットワーク経由でクラウドサーバにアップロードする。そして、別の第二の情報処理装置がネットワーク経由でクラウドサーバにある学習モデルをロードし、用いて位置姿勢推定するといった方式も可能である。さらには、学習モデル生成時に、第一の情報処理装置が学習用データを取得し、取得したデータをサーバにアップロードし、画像分類ステップＳ９３０、学習モデル生成ステップＳ９６０はクラウドサーバ上の第二の情報処理装置が行うという方式も可能である。 In the ninth embodiment, a method of first generating a learning model and then estimating the position and orientation using the learning model has been described. However, the generation of the learning model and the position and orientation estimation using the learning model can be performed by separate devices. For example, the learning model is generated by the first information processing device, and the generated learning model is uploaded to the cloud server via the network. It is also possible to use a method in which another second information processing device loads a learning model in a cloud server via a network and uses it to estimate the position and orientation. Furthermore, at the time of learning model generation, the first information processing device acquires learning data, uploads the acquired data to the server, and the image classification step S930 and the learning model generation step S960 are performed based on the second information on the cloud server. A system in which the processing device performs is also possible.

実施形態９では、ユーザが終了のコマンドを入力することで、学習用データの取得を終了した。学習に必要な学習用データが収集できていれば、学習用データの取得の終了判定方法は任意である。例えば、学習用データ分類部９２０が、所定の時間経過したら終了と判定してもよいし、所定の数の学習データが取得できたら終了と判定してもよい。さらには、各分類カテゴリの学習データ数が所定の数を上まわったら終了と判定してもよい。また、学習モデル生成部９４０が、学習モデルの学習度合を計算し、学習が収束したら学習を終了と判定してもよい。 In the ninth embodiment, the acquisition of the learning data is ended by the user inputting the end command. As long as the learning data necessary for learning can be collected, any method can be used to determine whether acquisition of the learning data is completed. For example, the learning data classification unit 920 may determine that the processing is finished after a predetermined time has elapsed, or may determine that the processing is finished when a predetermined number of pieces of learning data have been acquired. Furthermore, it may be determined that the processing is completed when the number of learning data items for each classification category exceeds a predetermined number. Also, the learning model generation unit 940 may calculate the learning degree of the learning model, and determine that the learning is finished when the learning converges.

本実施形態においては、画像の分類とは、学習用データ分類部９２０が学習モデルを分類した結果を基に、学習データ保持部９３０のフォルダごとに学習用データを保持することであった。しかしながら、フォルダごとに分類する必要は無く、学習用データごとに分類結果を記録したリストを学習データ保持部９３０が保持する構成としてもよい。 In the present embodiment, image classification means holding learning data for each folder in the learning data holding unit 930 based on the result of the learning model classification by the learning data classifying unit 920 . However, it is not necessary to classify by folder, and the learning data holding unit 930 may hold a list in which classification results are recorded for each learning data.

実施形態１から５では、学習モデル群保持部１３０は、少なくとも２つの学習モデルに加え、物体情報リスト、位置情報リスト、状況情報リストを保持する構成であった。これらリストは必要に応じて学習用データ分類部９２０が生成してもよい。つまり、モデル学習用画像から物体種を検出し物体情報リストに追加する、取得した位置情報や状況情報を位置情報リストや状況情報リストに追加する、という処理を合わせて行うことで、実施形態１から５で必要な情報を追加することができる。また、実施形態９で述べた学習用画像は学習モデル群保持部にコピーし保持する処理は、学習モデル選択時に学習用画像が必要でなければ無くてもよい。 In Embodiments 1 to 5, the learning model group holding unit 130 is configured to hold an object information list, a position information list, and a situation information list in addition to at least two learning models. These lists may be generated by the learning data classification unit 920 as necessary. In other words, the process of detecting the object type from the model learning image and adding it to the object information list, and adding the acquired position information and situation information to the position information list and the situation information list are performed together. You can add the necessary information from to 5. Further, the process of copying and holding the learning images in the learning model group holding unit described in the ninth embodiment may be omitted if the learning images are not required when the learning model is selected.

第二の撮像装置９１は、ＴＯＦセンサに限らず、画像とデプスマップを取得できるものであればよい。具体的には、パターンを投影し奥行きを推定するデプスカメラでもよい。また、２台のカメラを並べたステレオカメラ構成により、ステレオマッチングにより奥行きを算出し、デプスマップとして出力するステレオカメラであってもよい。さらには、３ＤＬｉＤＡＲ（Light Detection and Ranging）とカメラを組み合わせ、ＬｉＤＡＲが取得した奥行き値を画像座標に変換したデプスマップを出力するように構成した装置であってもよい。また、画像はＲＧＢ画像に限らずグレー画像でもよい。また、事前に第二の撮像装置９１が画像とデプスマップを取得し、不図示の記録装置に保持し、それら記録装置から画像とデプスマップを第二の入力部９１０が入力してもよい。 The second imaging device 91 is not limited to a TOF sensor, and may be any device capable of acquiring an image and a depth map. Specifically, a depth camera that projects a pattern and estimates the depth may be used. Alternatively, a stereo camera configuration may be used in which two cameras are arranged side by side to calculate the depth by stereo matching and output it as a depth map. Furthermore, a device configured to combine a 3D LiDAR (Light Detection and Ranging) and a camera and output a depth map obtained by converting depth values acquired by the LiDAR into image coordinates may be used. Further, the image is not limited to the RGB image, and may be a gray image. Alternatively, the second imaging device 91 may acquire images and depth maps in advance, store them in a recording device (not shown), and the second input unit 910 may input the images and depth maps from these recording devices.

＜各実施形態の効果＞
実施形態１では、複数の学習モデルの中の各学習モデルに評価値を付与し、評価値が高い学習モデルを選択する。このとき、入力画像とそれぞれの学習モデルの学習時に用いた学習画像との類似度を算出し、類似度が高いほど高い評価値となるように各学習モデルの評価値を算出する。そして、評価値が高い学習モデルを用いて推定した幾何情報を用いて、撮像装置の位置姿勢を算出する。このように、入力画像と学習画像が類似している学習モデルを選択することで、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 <Effects of each embodiment>
In the first embodiment, an evaluation value is given to each learning model among a plurality of learning models, and a learning model with a high evaluation value is selected. At this time, the degree of similarity between the input image and the learning image used in learning each learning model is calculated, and the evaluation value of each learning model is calculated such that the higher the degree of similarity, the higher the evaluation value. Then, the position and orientation of the imaging device are calculated using geometric information estimated using a learning model with a high evaluation value. In this way, by selecting a learning model whose input image and learning image are similar, the learning model can estimate geometric information with high accuracy, and can calculate the position and orientation of the imaging device with high accuracy. can.

実施形態２では、学習モデルの学習に使用した学習画像から検出した物体情報と、入力画像から検出された物体情報とを比較し、同じ種類の物体が写っているほど高い評価値となるように学習モデルの評価値を算出する。そして、評価値の大きな学習モデルを用いて推定した幾何情報を用いて撮像装置の位置姿勢を算出する。これにより、入力画像と学習画像に同じ種類の物体が写っている学習モデルを選択することができ、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 In the second embodiment, the object information detected from the learning image used for learning the learning model is compared with the object information detected from the input image so that the more the same type of object appears, the higher the evaluation value. Calculate the evaluation value of the learning model. Then, the position and orientation of the imaging device are calculated using geometric information estimated using a learning model with a large evaluation value. As a result, it is possible to select a learning model in which the same type of object is shown in the input image and the learning image, and the learning model can estimate the geometric information with high accuracy, and the position and orientation of the imaging device can be determined with high accuracy. can be calculated.

実施形態３では、入力画像や学習モデルを学習した学習画像を撮影した位置情報が一致しているほど評価値が高くなるように学習モデルの評価値を算出する。これにより、入力画像と学習画像とを撮影した位置情報が一致している学習モデルを選択することができるため、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 In the third embodiment, the evaluation value of the learning model is calculated such that the more the position information of the input image and the learning image obtained by learning the learning model match, the higher the evaluation value. As a result, it is possible to select a learning model in which the positional information of the input image and the learning image match. The position and orientation can be calculated.

実施形態４では、入力画像や学習モデルを学習した学習画像の見えを変えうる状況情報が一致しているほど高い評価値を学習モデルに付与する。これにより、入力画像と学習画像の撮影状況が一致している学習モデルを選択することができるため、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 In the fourth embodiment, a higher evaluation value is given to a learning model as the situation information that can change the appearance of the input image and the learning image for which the learning model has been trained is more consistent. As a result, it is possible to select a learning model in which the shooting conditions of the input image and the learning image match. can be calculated.

実施形態５では、入力画像と学習画像との類似度、入力画像と学習画像から検出した物体種の一致度、入力画像や学習画像を撮影した位置情報の一致度が高いほど評価値が高くなるように学習モデルの評価値を算出する。より具体的には、力画像と学習画像とが類似しており、かつ入力画像と学習画像に同一種類の物体が撮像されており、かつ撮影した位置が一致する学習モデルを選択する。これにより、学習モデルが高い精度で幾何情報を推定することができ、高い精度で撮像装置の位置姿勢を算出することができる。 In the fifth embodiment, the higher the degree of similarity between the input image and the learning image, the higher the degree of matching between the object type detected from the input image and the learning image, and the higher the degree of matching of the position information of the input image and the learning image, the higher the evaluation value. Calculate the evaluation value of the learning model as follows. More specifically, a learning model is selected in which the force image and the learning image are similar, the same type of object is captured in the input image and the learning image, and the photographed positions match. As a result, the learning model can estimate the geometric information with high accuracy, and can calculate the position and orientation of the imaging device with high accuracy.

実施形態６では、幾何情報推定部１４０が学習モデルを用いて推定した第三の幾何情報と、入力画像からモーションステレオにより推定した第二の幾何情報とが類似しているほど評価値が高くなるように学習モデルの評価値を算出する。これにより、モーションステレオにより推定した第二の幾何情報と類似する第三の幾何情報を出力することができる学習モデルを選択することができ、高い精度で撮像装置の位置姿勢を算出することができる。 In the sixth embodiment, the more similar the third geometric information estimated by the geometric information estimation unit 140 using the learning model and the second geometric information estimated from the input image by motion stereo, the higher the evaluation value. Calculate the evaluation value of the learning model as follows. As a result, it is possible to select a learning model that can output third geometric information similar to the second geometric information estimated by motion stereo, and to calculate the position and orientation of the imaging device with high accuracy. .

また、学習モデルが複数のシーンの学習画像を用いて学習された場合や、学習モデルを特徴づける情報が保持されていない場合においても、高い精度で幾何情報を出力できる学習モデルを選択することができる。このため、高い精度で撮像装置の位置姿勢を算出することができる。 In addition, even when a learning model is trained using learning images of multiple scenes, or when information that characterizes the learning model is not stored, it is possible to select a learning model that can output geometric information with high accuracy. can. Therefore, the position and orientation of the imaging device can be calculated with high accuracy.

実施形態７では、学習モデルを特徴付ける情報、仮想物体のＣＧを合成した合成画像、学習モデルが出力した幾何情報や当該幾何情報に基づいて復元した三次元形状を提示する。これにより各学習モデルを入力画像に写るシーンで利用したときの適合度や、正しい学習モデルを選択することができたかどうかということをユーザが視覚的に確認することができる。さらに、表示情報に基づいてユーザが学習モデルを選択することもできる。さらに、不適切な学習モデルが選択された場合には、ユーザが処理のやり直しの判断をすることができ、適切な学習モデルを選択し直すことができる。従って、高い精度で位置姿勢を算出することができるようになる。 In the seventh embodiment, information characterizing a learning model, a composite image obtained by synthesizing CG of a virtual object, geometric information output by the learning model, and a three-dimensional shape restored based on the geometric information are presented. As a result, the user can visually confirm the suitability when each learning model is used in the scene shown in the input image, and whether or not the correct learning model has been selected. Additionally, the user can select a learning model based on the displayed information. Furthermore, when an inappropriate learning model is selected, the user can decide to redo the process and select an appropriate learning model again. Therefore, the position and orientation can be calculated with high accuracy.

実施形態８では、一度学習モデルを選択した後にも、再度学習モデルの評価値を算出し、学習モデルを選択し直す。このようにすることで、例えば複合現実感を体験している間に利用者が移動するなどして入力画像に写るシーンが変化した場合であっても、学習モデルを再度評価できる。再評価結果が高い学習モデルを選択することで、その時点の入力画像に写るシーンにおいて高い精度で学習モデルが幾何情報を算出でき、高い精度で撮像装置の位置姿勢を算出することがきる。 In the eighth embodiment, even after the learning model is selected once, the evaluation value of the learning model is calculated again, and the learning model is selected again. By doing so, even if the scene shown in the input image changes due to, for example, the user moving while experiencing mixed reality, the learning model can be evaluated again. By selecting a learning model with a high re-evaluation result, the learning model can calculate geometric information with high accuracy in the scene captured in the input image at that time, and the position and orientation of the imaging device can be calculated with high accuracy.

実施形態９では、モデル学習用画像のシーンの判別結果を基にモデル学習用画像とモデル学習用デプスマップとを分類し、それぞれのシーンの種別ごとに学習モデルを生成する。このように、シーンの種別ごとに学習モデルを生成しておき、実施形態１で説明した位置姿勢算出時には撮像画像のシーンの種別が一致する学習モデルを用いることで、高い精度で撮像装置の位置姿勢を算出することができる。 In the ninth embodiment, the model learning image and the model learning depth map are classified based on the scene discrimination result of the model learning image, and a learning model is generated for each scene type. In this way, a learning model is generated for each scene type, and a learning model that matches the scene type of the captured image is used when calculating the position and orientation described in the first embodiment. Posture can be calculated.

＜定義＞
本発明における画像入力部は、現実空間を撮像した画像を入力するものであれば何でもよい。たとえば濃淡画像を撮像するカメラの画像を入力してもよいし、ＲＧＢ画像を入力するカメラの画像を入力してもよい。奥行き情報や距離画像、三次元点群データを撮像できるカメラの画像を入力してもよい。また、単眼カメラであってもよいし、二台以上の複数のカメラやセンサを備えるカメラが撮像した画像を入力してもよい。さらに、カメラが撮像した画像を直接入力してもよいし、ネットワークを介して入力してもよい。 <Definition>
The image input unit in the present invention may be anything as long as it inputs an image of the real space. For example, an image of a camera that captures a grayscale image may be input, or an image of a camera that inputs an RGB image may be input. You may input the image of the camera which can image depth information, a distance image, and three-dimensional point-group data. Also, a monocular camera may be used, or an image captured by two or more cameras or a camera equipped with a sensor may be input. Furthermore, an image captured by a camera may be directly input, or may be input via a network.

本発明における学習モデルとは、カメラ画像を入力としたときに幾何情報を出力するものであれば何でもよい。例えば、カメラ画像を入力したときに幾何情報を出力するようにあらかじめ学習したニューラルネットワークやＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）である。また、学習モデルが推定する幾何情報とは、例えば、入力画像のピクセルごとに推定した奥行き情報であるデプスマップのことである。なお、学習モデルが推定する幾何情報は、入力画像の中から位置姿勢取得に用いるための顕著点を幾何情報として算出する学習モデルであってもよい。前フレームと現フレーム二枚の画像を入力し、幾何情報としてそれらの間の位置姿勢の６自由度を推定するように学習した学習モデルであってもよい。 The learning model in the present invention may be anything as long as it outputs geometric information when a camera image is input. For example, it is a neural network or a CNN (Convolutional Neural Network) trained in advance so as to output geometric information when a camera image is input. Also, the geometric information estimated by the learning model is, for example, a depth map that is depth information estimated for each pixel of the input image. Note that the geometric information estimated by the learning model may be a learning model that calculates, as geometric information, saliency points for use in obtaining the position and orientation from the input image. The learning model may be a learning model that inputs two images of the previous frame and the current frame and estimates the six degrees of freedom of position and orientation between them as geometric information.

学習モデル群保持部は少なくとも２つの（複数の）学習モデルを保持するものであれば何でもよい。また、学習モデルに加えて、保持する学習モデルを特徴づける情報リストも合わせて保持してもよい。学習モデルを特徴づける情報とは、入力画像に写るシーンへの学習モデルの適合度を表す評価値を算出するための情報である。具体的には、学習モデルを特徴づける情報リストとは学習モデルの学習に用いた学習画像や、学習画像に写っている物体情報リスト、学習画像を撮像したときの位置情報リスト、学習画像を撮像した撮影日時や季節、その時の天気といった画像の見えを変えうる状況を記述した状況情報リストのことである。また、学習画像を撮像したカメラの内部パラメータを学習モデルごとに合わせて保持しておいてもよい。 The learning model group holding unit may be anything as long as it holds at least two (plural) learning models. In addition to the learning model, an information list that characterizes the held learning model may also be held. The information that characterizes the learning model is information for calculating an evaluation value that indicates the suitability of the learning model to the scene shown in the input image. Specifically, the information list that characterizes the learning model includes the learning images used for learning the learning model, the object information list that appears in the learning images, the position information list when the learning images were captured, and the learning images that were captured. It is a situation information list that describes situations that can change the appearance of an image, such as the shooting date and time, the season, and the weather at that time. Also, the internal parameters of the camera that captured the learning image may be stored in accordance with each learning model.

学習モデル選択部は、学習モデル群保持部が保持する学習モデルに評価値を付与するものであれば何でもよい。ここでいう評価値とは、撮像装置が撮像するシーンへの学習モデルの適合度を表す指標である。具体的には、入力画像と学習画像の類似度や、それらの画像から検出した物体種や画像撮像時の位置情報の一致度合のことである。 The learning model selection unit may be anything as long as it gives an evaluation value to the learning model held by the learning model group holding unit. The evaluation value here is an index that indicates the degree of conformity of the learning model to the scene captured by the imaging device. Specifically, it is the degree of similarity between an input image and a learning image, and the degree of matching between object types detected from those images and position information at the time of image capturing.

評価値の算出方法は上記方法に限らず、学習モデルが出力した幾何情報と、入力画像から計測した幾何情報との近似度合に基づいて算出してもよい。具体的には、入力画像からモーションステレオ法により算出した第二の幾何情報と、学習モデルが出力した第三の幾何情報とを比較することで学習モデルの評価値を算出してもよい。 The method of calculating the evaluation value is not limited to the above method, and may be calculated based on the degree of approximation between the geometric information output by the learning model and the geometric information measured from the input image. Specifically, the evaluation value of the learning model may be calculated by comparing the second geometric information calculated from the input image by the motion stereo method and the third geometric information output by the learning model.

さらに、学習モデル選択部は、評価値に基づいて学習モデルを選択する構成としてもよい。学習モデルを特徴づける情報リストをモバイルデバイス等のディスプレイに提示し、ユーザが入力した学習モデルを選択してもよい。 Furthermore, the learning model selection unit may be configured to select a learning model based on the evaluation value. A list of information characterizing the learning models may be presented on a display, such as a mobile device, and the learning model input by the user may be selected.

幾何情報推定部は、入力画像を学習モデルに入力し、幾何情報を算出するものであれば何でもよい。本発明においては特に、幾何情報推定部が、学習モデル選択部が選択した学習モデルに入力画像を入力して得た出力を第一の幾何情報と呼び、位置姿勢取得部が位置姿勢取得に用いる。また、幾何情報推定部は、学習モデル選択部が評価値を算出するための指標として第三の幾何情報を推定してもよい。このときは、学習モデル群保持部が保持する各学習モデルに入力画像を入力し得られた出力を第三の幾何情報と呼び、学習モデル選択部が学習モデルの評価値算出に用いる。さらに、幾何情報推定部は入力画像に基づいてモーションステレオを用いて第二の幾何情報を算出する構成としてもよい。 The geometric information estimation unit may be anything as long as it inputs an input image to a learning model and calculates geometric information. Particularly in the present invention, the output obtained by the geometric information estimating unit by inputting the input image to the learning model selected by the learning model selecting unit is called first geometric information, and is used by the position/orientation obtaining unit to obtain the position/orientation. . Moreover, the geometric information estimation unit may estimate the third geometric information as an index for the learning model selection unit to calculate the evaluation value. At this time, the output obtained by inputting the input image to each learning model held by the learning model group holding unit is called third geometric information, and the learning model selection unit uses the third geometric information to calculate the evaluation value of the learning model. Furthermore, the geometric information estimation unit may be configured to calculate the second geometric information using motion stereo based on the input image.

位置姿勢取得部は、学習モデルが出力した幾何情報を用いてカメラの位置姿勢を算出するものであれば何でもよい。例えば、前フレームの各画素を、学習モデルが出力した幾何情報を用いて現フレームに射影し、射影した前フレームの画素の画素値と現フレームの画素値との輝度差が最小となるように位置及び姿勢を算出してもよい。また、学習モデルの出力である第一の幾何情報をそのまま用いてカメラの位置姿勢を算出するのに限らず、幾何情報推定部が時系列フィルタリングにより第二の幾何情報も用いて統合した幾何情報を算出し、それ用いてカメラの位置姿勢を算出してもよい。 The position and orientation acquisition unit may be anything as long as it calculates the position and orientation of the camera using the geometric information output by the learning model. For example, each pixel of the previous frame is projected onto the current frame using the geometric information output by the learning model, and the brightness difference between the pixel value of the projected pixel of the previous frame and the pixel value of the current frame is minimized. Position and attitude may be calculated. In addition, the geometric information estimating unit is not limited to calculating the position and orientation of the camera directly using the first geometric information that is the output of the learning model. may be calculated and used to calculate the position and orientation of the camera.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in the computer of the system or apparatus reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

１：情報処理装置、１１：撮像装置、１２：表示情報生成部、１３：表示部、１１０：画像入力部、１２０：学習モデル選択部、１３０：学習モデル群保持部、１４０：幾何情報推定部、１５０：位置情報取得部 1: information processing device, 11: imaging device, 12: display information generation unit, 13: display unit, 110: image input unit, 120: learning model selection unit, 130: learning model group holding unit, 140: geometric information estimation unit , 150: location information acquisition unit

Claims

A captured image that is learned using a captured image captured by an imaging device and depth information of the captured image as teacher data, and that uses a plurality of learning models for estimating depth information corresponding to an input image as the teacher data. holding means for holding in correspondence with the imaging position of
A scene captured in the input image from the plurality of learning models based on an evaluation result obtained from the degree of matching between any one of the imaging positions held in association with the learning model and the imaging position at which the input image was captured. a selection means for selecting a learning model suitable for
estimating means for estimating first depth information using the input image and the selected learning model;
with
The estimating means further estimates second depth information from the input image by a motion stereo method, further estimates third depth information based on the input image and the learning model for each of the learning models, and selects The information processing apparatus, wherein the means calculates the evaluation result of the learning model such that the higher the degree of matching between the second depth information and the third depth information, the higher the evaluation result .

2. The information processing apparatus according to claim 1, further comprising acquisition means for acquiring the position and orientation of said imaging device based on said first depth information.

Further comprising a sensor for measuring the amount of movement of the imaging device,
The estimating means is arranged so that the higher the degree of matching between at least one of the sensor information measured by the sensor and the depth information calculated based on the sensor information and the third depth information, the higher the evaluation result. 2. The information processing apparatus according to claim 1 , wherein the evaluation result of said learning model is calculated in the following manner.

A captured image that is learned using a captured image captured by an imaging device and depth information of the captured image as teacher data, and that uses a plurality of learning models for estimating depth information corresponding to an input image as the teacher data. holding means for holding in correspondence with the imaging position of
A scene captured in the input image from the plurality of learning models based on an evaluation result obtained from the degree of matching between any one of the imaging positions held in association with the learning model and the imaging position at which the input image was captured. a selection means for selecting a learning model suitable for
estimating means for estimating first depth information using the input image and the selected learning model;
with
the estimation means further estimates third depth information based on the input image and the learning model for each learning model;
The selecting means calculates the evaluation result of the learning model such that the more the size or shape of the known object detected from the input image matches the third depth information, the higher the evaluation result. and information processing equipment.

further comprising generating means for generating display information based on at least one of the input image, the first depth information, the information held by the holding means, the evaluation result, and the position and orientation. 3. The information processing apparatus according to claim 2, characterized by:

6. The information processing apparatus according to claim 5 , wherein said generating means generates said display information by synthesizing a CG image of a virtual object with said input image based on said first depth information.

the estimation means further estimates third depth information based on the input image and the learning model for each learning model;
6. The information processing apparatus according to claim 5 , wherein said generating means generates said display information by synthesizing a CG image of a virtual object with said input image based on said third depth information.

6. The information processing apparatus according to claim 5 , wherein said generating means restores a three-dimensional shape of a scene in which said input image is captured based on said first depth information to generate said display information. .

the estimation means further estimates third depth information based on the input image and the learning model for each learning model;
6. An information processing apparatus according to claim 5 , wherein said generating means restores a three-dimensional shape of a scene in which said input image is captured based on said third depth information to generate said display information. .

10. The information processing apparatus according to any one of claims 5 to 9 , further comprising a display unit that displays the display information.

The estimation means updates the second depth information and the third depth information based on a second input image captured by the imaging device at a second time different from the time when the input image was captured. death,
The selecting means recalculates the evaluation result of the learning model so that the higher the degree of matching between the updated second depth information and the updated third depth information, the higher the evaluation result. 2. The information processing apparatus according to claim 1 .

The estimation means estimates a plurality of second depth information and a plurality of third depth information based on a plurality of input images captured by the imaging device at a plurality of times,
The selecting means calculates the evaluation result of the learning model such that the higher the degree of matching between the plurality of second depth information and the plurality of third depth information, the higher the evaluation result. The information processing device according to claim 1 .

a second image input means for inputting a third input image and fourth depth information acquired by the second imaging device;
learning data classification means for classifying the third input image and the fourth depth information for each type of scene appearing in the third input image or the fourth depth information;
learning data holding means for holding the third input image and the fourth depth information for each scene type based on the classification result of the learning data classification means;
learning the plurality of learning models for each scene type using the third input image and the fourth depth information held by the learning data holding means, and holding the learning model in the learning data holding means; a generating means;
13. The information processing apparatus according to any one of claims 1 to 12 , further comprising:

A captured image that is learned using a captured image captured by an imaging device and depth information of the captured image as teacher data, and that uses a plurality of learning models for estimating depth information corresponding to an input image as the teacher data. A control method for an information processing apparatus comprising holding means for holding in correspondence with the imaging position of
A scene captured in the input image from the plurality of learning models based on an evaluation result obtained from the degree of matching between any one of the imaging positions held in association with the learning model and the imaging position at which the input image was captured. a selection step of selecting a learning model suitable for
an estimation step of estimating first depth information using the input image and the selected learning model;
has
In the estimating step, further estimating second depth information from the input image by a motion stereo method, further estimating third depth information based on the input image and the learning model for each learning model, and In the step, the evaluation result of the learning model is calculated such that the higher the degree of matching between the second depth information and the third depth information, the higher the evaluation result. .

A program for causing a computer to function as the information processing apparatus according to any one of claims 1 to 13 .