JP2019124539A

JP2019124539A - Information processing device, control method therefor, and program

Info

Publication number: JP2019124539A
Application number: JP2018004470A
Authority: JP
Inventors: 誠冨岡; Makoto Tomioka; 小林　一彦; Kazuhiko Kobayashi; 一彦小林; 鈴木　雅博; Masahiro Suzuki; 雅博鈴木
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-01-15
Filing date: 2018-01-15
Publication date: 2019-07-25
Anticipated expiration: 2038-01-15
Also published as: JP7133927B2

Abstract

To calculate the position-posture of an imaging device with high accuracy even when it is moving at high speed.SOLUTION: The information processing device comprises: an input unit for inputting visual information captured by an imaging device at a first time of day; a holding unit for holding a learning model for predicting at least one of visual information and geometric information at a second time of day on and after the first time of day on the basis of the visual information inputted by the input unit; a prediction unit for predicting at least one of visual information and geometric information at the second time of day using the learning model; and a first calculation unit for calculating the position-posture of an imaging device at the second time of day on the basis of at least one of the visual information and geometric information at the second time of day predicted by the prediction unit.SELECTED DRAWING: Figure 1

Description

本発明は情報処理装置及びその制御方法及びプログラムに関する。 The present invention relates to an information processing apparatus, a control method thereof, and a program.

画像情報に基づく撮像装置の位置及び姿勢の計測は、複合現実感／拡張現実感における現実空間と仮想物体の位置合わせ、ロボットや自動車の自己位置推定、物体や空間の三次元モデリングなど様々な目的で利用される。 Measurement of the position and orientation of the imaging device based on image information has various purposes such as alignment between real space and virtual object in mixed reality / augmented reality, self-position estimation of robot or car, 3D modeling of object or space, etc. Used in

非特許文献１には、事前に学習した学習モデルを用いて画像から位置姿勢を算出するための指標である幾何情報（奥行き情報）を推定し、推定した奥行き情報をもとに位置姿勢を算出する方法が開示されている。 In Non-Patent Document 1, geometric information (depth information), which is an index for calculating position and orientation from an image, is estimated using a learning model learned in advance, and position and orientation is calculated based on the estimated depth information. Methods are disclosed.

K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction", In Proc. CVPR, 2017.K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction", In Proc. CVPR, 2017. Z.Zhang,"A flexible new technique for camera calibration," Trans. Pattern Analysis and Machine Intelligence, vol.22, no.11, pp.1330-1334, 2000.Z. Zhang, "A flexible new technique for camera calibration," Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330-1334, 2000. X. Shi, Z. Chen, H. Wang, and D. Yeung, "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting", In Proc. NIPS, 2015.X. Shi, Z. Chen, H. Wang, and D. Yeung, "Convolutional LSTM Network: A Machine Learning Approach for Precitation Nowcasting", In Proc. NIPS, 2015. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. "Deeper depth prediction with fully convolutional residual networks". In Proc. 3DV, 2016.I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. "Deeper depth prediction with fully convolutional residual networks". In Proc. 3DV, 2016. K. He, G. Gkioxari, P. Dollar, and R. Girshick, "Mask R-CNN", In Proc. ICCV, 2017.K. He, G. Gkioxari, P. Dollar, and R. Girshick, "Mask R-CNN", In Proc. ICCV, 2017.

非特許文献１においては、撮像装置が撮像した入力画像をもとに、その入力画像が撮像された時点の幾何情報（奥行き情報）を推定した後に撮像装置の位置姿勢を算出する。幾何情報の推定および位置姿勢の算出には多くの時間を要する。非特許文献１は、その算出中に、撮像装置が大きくは移動しないことを前提としていた。つまり、幾何情報の推定中や位置姿勢算出中に撮像装置が移動する状況下での、高速に且つ高精度に位置姿勢を算出するための解決策が求められている。 In Non-Patent Document 1, based on an input image captured by an imaging device, geometric information (depth information) at the time of capturing the input image is estimated, and then the position and orientation of the imaging device are calculated. It takes a lot of time to estimate geometric information and calculate position and orientation. Non-Patent Document 1 assumes that the imaging device does not move significantly during the calculation. That is, there is a need for a solution for calculating the position and orientation at high speed and with high accuracy under conditions where the imaging device moves during estimation of geometric information and during position and orientation calculation.

この課題を解決するため、例えば本発明の情報処理装置は以下の構成を備える。すなわち、
撮像装置が第１の時刻に撮像した視覚情報を入力する入力手段と、
前記入力手段が入力した視覚情報に基づいて、前記第１の時刻以降の第２の時刻の視覚情報または幾何情報の少なくとも一方を予測するための学習モデルを保持する保持手段と、
前記学習モデルを用いて、前記第２の時刻の視覚情報または幾何情報の少なくとも一方を予測する予測手段と、
前記予測手段が予測した第２の時刻の視覚情報または幾何情報の少なくとも一方に基づいて前記第２の時刻における前記撮像装置の位置姿勢を算出する第１の算出手段とを備える。 In order to solve this problem, for example, the information processing apparatus of the present invention has the following configuration. That is,
An input unit that inputs visual information captured by the imaging device at a first time;
Holding means for holding a learning model for predicting at least one of visual information and geometric information at a second time after the first time based on the visual information input by the input means;
Prediction means for predicting at least one of visual information and geometric information at the second time using the learning model;
And a first calculation unit that calculates the position and orientation of the imaging device at the second time based on at least one of visual information and geometric information of the second time predicted by the prediction unit.

本発明によれば、撮像装置が移動していても高速に、高精度に位置姿勢を算出することができる。 According to the present invention, even if the imaging apparatus is moving, the position and orientation can be calculated with high accuracy at high speed.

第１の実施形態における運転制御システムの構成例を示す構成図。BRIEF DESCRIPTION OF THE DRAWINGS The block diagram which shows the structural example of the operation control system in 1st Embodiment. 第１の実施形態における情報処理装置の構成を示す図。FIG. 1 is a block diagram showing the arrangement of an information processing apparatus according to the first embodiment; 第１の実施形態における保持部のデータ構造を示す図。FIG. 7 is a view showing a data structure of a holding unit in the first embodiment. 第１の実施形態における情報処理装置のハードウェア構成を示す図。FIG. 2 is a diagram showing a hardware configuration of the information processing apparatus in the first embodiment. 第１の実施形態における処理手順を示すフローチャート。3 is a flowchart showing a processing procedure in the first embodiment. 第２の実施形態における位置姿勢算出の処理手順を示すフローチャート。10 is a flowchart showing a processing procedure of position and orientation calculation in the second embodiment. 第３の実施形態における処理手順を示すフローチャート。The flowchart which shows the process sequence in 3rd Embodiment. 第３の実施形態における物体情報算出の処理手順を示すフローチャート。The flowchart which shows the processing procedure of object information calculation in a 3rd embodiment. 第３の実施形態における表示情報を提示するＧＵＩの一例を示す図。A figure showing an example of GUI which presents display information in a 3rd embodiment.

以下図面に従って本発明に係る実施形態を詳細に説明する。なお、以下の実施形態において示す構成は一例に過ぎず、本発明は図示された構成に限定されるものではない。 Hereinafter, embodiments according to the present invention will be described in detail with reference to the drawings. In addition, the structure shown in the following embodiment is only an example, and this invention is not limited to the illustrated structure.

［第１の実施形態］
本実施形態における情報処理装置は、自動車等の移動体に搭載する例を説明する。この情報処理装置は、自動車の位置及び姿勢（以降位置姿勢と記す）を算出し、算出結果を基に自動車のステアリング、アクセル、ブレーキ等の自動車の制御、すなわち自動運転を行うものである。また、本実施形態における撮像装置は、単眼であって１画素につきＲ、Ｇ、Ｂの３成分の画像データを撮像するカメラであり、自動車のフロントガラスの近傍に、且つ、その光軸方向は自動車の直進方向と一致するように取り付けられているものとする。そして、自動車の重心位置に対する撮像装置の自動車への相対的な取り付け位置情報は既知であって、不揮発性のメモリ等に事前に記憶保持されているものとする。故に、撮像装置が撮像した画像をもとに算出した撮像装置の位置姿勢を、記憶保持していた情報に基づく幾何変換を行うことで、自動車の位置姿勢が算出できる。つまり、撮像装置の位置姿勢を求めることは、自動車の位置姿勢を求めることと等価である。なお、撮像装置の位置姿勢の算出は後述する。そして、自動車の制御には、算出した自動車の位置姿勢と、あらかじめカーナビゲーションシステムにより算出しておいた自動車の目的地までの予定経路とが一致するように自動車を制御することになる。 First Embodiment
An information processing apparatus according to the present embodiment will be described by way of an example mounted on a mobile body such as a car. The information processing apparatus calculates the position and attitude (hereinafter referred to as position and attitude) of the car, and controls the car such as steering, acceleration, and braking of the car based on the calculation result, that is, performs automatic driving. Further, the image pickup apparatus in the present embodiment is a camera which is a single eye and picks up image data of three components of R, G and B per pixel, and in the vicinity of a windshield of a car and its optical axis direction It shall be installed so as to coincide with the straight direction of the car. The relative mounting position information of the imaging device to the car with respect to the center of gravity of the car is known and stored in advance in a non-volatile memory or the like. Therefore, the position and orientation of the automobile can be calculated by performing the geometric transformation based on the information stored and held in the position and orientation of the imaging device calculated based on the image captured by the imaging device. That is, determining the position and orientation of the imaging device is equivalent to determining the position and orientation of the vehicle. The calculation of the position and orientation of the imaging device will be described later. Then, for control of the vehicle, the vehicle is controlled so that the calculated position and orientation of the vehicle and the planned route to the destination of the vehicle calculated in advance by the car navigation system coincide with each other.

撮像装置の位置姿勢の算出には、撮像装置が撮像した入力画像をもとに機械学習による推定器が推定した幾何情報が用いられる。本実施形態における機械学習による推定器が推定する幾何情報とは、入力画像のピクセルごとに推定した奥行き情報であるデプスマップのことである。具体的には、まず、撮像装置がある時刻ｔに撮像した入力画像に、時刻ｔ以外のある時刻ｔ’に撮像した入力画像を入力として推定器が推定したデプスマップの各点を時刻ｔの入力画像に射影する。ここでいう射影とは、時刻ｔ’におけるデプスマップの各奥行き値から求める三次元点（空間中のＸ，Ｙ，Ｚの位置を表す）が時刻ｔの画像上のどの画素に写るのか算出することである。次に、射影元のデプスマップの各画素と対応する時刻ｔの入力画像の画素の輝度と射影先の時刻ｔ’の入力画像の画素の輝度との差が最小となるように位置姿勢を算出する。なお、これらの方法は非特許文献１に詳細な記述があり、これを援用できるものとする。 For calculation of the position and orientation of the imaging device, geometric information estimated by an estimator based on machine learning based on an input image captured by the imaging device is used. The geometric information estimated by the estimator based on machine learning in the present embodiment is a depth map which is depth information estimated for each pixel of the input image. Specifically, first, each point of the depth map estimated by the estimator using the input image captured at a certain time t ′ other than the time t is input to the input image captured at a certain time t at a certain time t Project to the input image. The term “projection” here is used to calculate which pixel on the image at time t the three-dimensional point (representing the position of X, Y, Z in space) to be obtained from each depth value of the depth map at time t ′ is calculated. It is. Next, the position and orientation are calculated so that the difference between the brightness of the pixel of the input image at time t corresponding to each pixel of the projection source depth map and the brightness of the pixel of the input image at time t ′ to be projected is minimized. Do. These methods are described in detail in Non-Patent Document 1 and can be incorporated herein.

このとき、位置姿勢が算出される時刻は、ｔ＋Δｔである。ここでいうΔｔとはＣＮＮ（Convolutional Neural Network）が奥行きを推定し、さらにそれらを基に位置姿勢を算出し終えるまでにかかる処理時間である。自動車が走行している場合にはΔｔ時間だけ自動車が移動することになる。しかしながら位置姿勢が算出され制御する時刻ｔ＋Δｔにおいては、時刻ｔにおける位置姿勢を用いて自動車を制御することになり、制御が遅れてしまう。そこで、本実施形態では、Δｔ時刻分だけ未来の画像およびデプスマップを予測することのできる学習モデルを用いて予測した画像およびデプスマップを基にｔ＋Δｔ時刻の位置姿勢を算出する方法について説明する。 At this time, the time at which the position and orientation are calculated is t + Δt. Here, Δt is a processing time taken until the CNN (Convolutional Neural Network) estimates the depth and further calculates the position and orientation based on them. When the vehicle is traveling, the vehicle moves for Δt time. However, at time t + Δt at which the position and orientation are calculated and controlled, the vehicle is controlled using the position and orientation at time t, and the control is delayed. Therefore, in the present embodiment, a method of calculating the position and orientation at time t + Δt based on an image predicted using a learning model capable of predicting a future image and a depth map by Δt time and the depth map will be described.

本実施形態における撮像装置の位置姿勢とは、現実空間中に規定された世界座標におけるカメラの位置を表す３パラメータ、及びカメラの姿勢を表す３パラメータを合わせた６パラメータのことである。また、カメラの光軸をＺ軸、画像の水平方向をＸ軸、垂直方向をＹ軸とするカメラ上に規定される三次元の座標系をカメラ座標系と呼ぶ。 The position and orientation of the imaging device in the present embodiment are six parameters including three parameters representing the position of the camera at world coordinates defined in the real space and three parameters representing the orientation of the camera. Further, a three-dimensional coordinate system defined on the camera whose optical axis of the camera is Z axis, the horizontal direction of the image is X axis, and the vertical direction is Y axis is called camera coordinate system.

図１は、本第１の実施形態における情報処理装置１０を搭載する自動車１の構成概略図である。自動車１は、カメラである撮像装置１１、タイヤの回転数を変えることで自動車の速度を変化させることができるアクセル・ブレーキ機器やタイヤの向きを変えることで方向転換することができるステアリング装置の制御を行う制御装置１２が備え付けられている。 FIG. 1 is a schematic view of the configuration of an automobile 1 on which the information processing apparatus 10 according to the first embodiment is mounted. The automobile 1 is an imaging device 11 which is a camera, an accelerator / brake device which can change the speed of the automobile by changing the number of rotations of the tire, and a steering device which can change direction by changing the direction of the tire. A controller 12 is provided to perform the

図２は、本実施形態における情報処理装置１０の機能構成図である。情報処理装置１０は、入力部１１０、予測部１２０、保持部１３０、算出部１４０、制御部１５０から構成されている。入力部１１０は、自動車に搭載した撮像装置１１と接続されている。制御部１５０は、自動車のステアリング、アクセル、ブレーキ機器である制御装置１２と接続されている。なお、図１は、機器構成の一例であり、本発明の適用範囲を限定するものではない。 FIG. 2 is a functional block diagram of the information processing apparatus 10 in the present embodiment. The information processing apparatus 10 includes an input unit 110, a prediction unit 120, a holding unit 130, a calculation unit 140, and a control unit 150. The input unit 110 is connected to an imaging device 11 mounted on a car. The control unit 150 is connected to the control device 12 which is a steering, an accelerator, and a brake device of a vehicle. Note that FIG. 1 is an example of the device configuration, and the scope of application of the present invention is not limited.

入力部１１０は、撮像装置１１が撮像したシーンの二次元画像を入力画像として時系列（例えば毎秒６０フレーム）に入力し、予測部１２０および保持部１３０に出力する。 The input unit 110 inputs, as an input image, a two-dimensional image of a scene captured by the imaging device 11 in time series (for example, 60 frames per second), and outputs the image to the prediction unit 120 and the storage unit 130.

予測部１２０は、保持部１３０が保持する学習モデルに従って、入力部１１０から入力した画像から、予め設定された時間後の画像及びその幾何情報を予測する。以降、予測した画像を予測画像、予測した幾何情報を予測幾何情報と呼ぶ。 The prediction unit 120 predicts an image after a preset time and its geometric information from the image input from the input unit 110 according to the learning model held by the holding unit 130. Hereinafter, the predicted image is referred to as a predicted image, and the predicted geometric information is referred to as predicted geometric information.

保持部１３０は、学習モデルを保持する。学習モデルは、画像を入力すると、Δｔ時刻分だけ未来の画像およびデプスマップを予測することができる。なお、この学習モデルは内部的に予測に用いた情報の履歴を保持しており、画像が入力されるごとに内部状態を更新する。つまり、過去の入力に対して最新の画像における各画素の移動量を考慮して予測するため、車速が変わる場合においても適用可能である。この学習モデルの詳細については後述する。さらに、保持部１３０は、事前に撮影したシーンの画像と位置姿勢を関連付けてキーフレームとし、キーフレーム群をマップとして保持する。保持部１３０におけるデータ構造、および学習モデルの詳細については後述する。 The holding unit 130 holds a learning model. When the learning model inputs an image, it can predict the future image and depth map by Δt time. Note that this learning model internally holds a history of information used for prediction, and updates the internal state each time an image is input. That is, since the prediction is performed in consideration of the movement amount of each pixel in the latest image with respect to the past input, the present invention can be applied even when the vehicle speed changes. Details of this learning model will be described later. Further, the holding unit 130 associates an image of a scene photographed in advance with the position and orientation to make a key frame, and holds a key frame group as a map. Details of the data structure in the holding unit 130 and the learning model will be described later.

算出部１４０は、予測部１２０が予測した予測画像、予測幾何情報と、保持部１３０が保持するマップをもとに撮像装置１１の位置姿勢を算出する。そして、算出部１４０は、算出した撮像装置の位置姿勢を基に自動車の位置姿勢を求め、制御部１５０に供給する。 The calculation unit 140 calculates the position and orientation of the imaging device 11 based on the predicted image predicted by the prediction unit 120, the predicted geometric information, and the map held by the holding unit 130. Then, the calculation unit 140 obtains the position and orientation of the vehicle based on the calculated position and orientation of the imaging device, and supplies the position and orientation to the control unit 150.

制御部１５０は、算出部１４０から入力した自動車の位置姿勢をもとに、自動車の制御値を算出し、制御装置１２に供給する。 The control unit 150 calculates the control value of the vehicle based on the position and orientation of the vehicle input from the calculation unit 140, and supplies the control value to the control device 12.

制御装置１２は、制御部１５０が算出した制御値を基に、自動車１の制御を行う。 The control device 12 controls the automobile 1 based on the control value calculated by the control unit 150.

図３は、保持部１３０が保持するデータ構造を示す図である。本実施形態においては、保持部１３０は、予測部１２０が予測画像、および幾何情報を予測するための学習モデルを保持する。学習モデルは、例えばＣＮＮの識別器をバイナリ形式で保存したデータファイルであるものとする。また、保持部１３０は、位置姿勢算出の指標となるマップを保持する。マップはシーンの画像と位置姿勢を関連付けてキーフレームとし、キーフレームを複数保持する構成とした。なお、ここでいう位置姿勢とは該当キーフレームを撮像した撮像装置１１の位置姿勢の６パラメータのことである。 FIG. 3 is a diagram showing a data structure held by the holding unit 130. As shown in FIG. In the present embodiment, the holding unit 130 holds a learning model for the prediction unit 120 to predict a predicted image and geometric information. The learning model is, for example, a data file in which CNN classifiers are stored in binary format. Further, the holding unit 130 holds a map that is an index of position and orientation calculation. In the map, the image of the scene is associated with the position and orientation to form a key frame, and a plurality of key frames are held. Note that the position and orientation referred to here are six parameters of the position and orientation of the imaging device 11 that has captured the corresponding key frame.

図４は、情報処理装置１０のハードウェア構成を示す図である。情報処理装置１０は、ＣＰＵ３１１、ＲＯＭ３１２、ＲＡＭ３１３、外部メモリ３１４、入力部３１５、表示部３１６、通信Ｉ／Ｆ３１７、Ｉ／Ｏ３１８を有する。 FIG. 4 is a diagram showing a hardware configuration of the information processing apparatus 10. As shown in FIG. The information processing apparatus 10 includes a CPU 311, a ROM 312, a RAM 313, an external memory 314, an input unit 315, a display unit 316, a communication I / F 317, and an I / O 318.

ＣＰＵ１１は、システムバス３２１に接続された各種デバイスの制御を行うことで、情報処理装置１０の全体の制御を司るも。ＲＯＭ３１２は、ＢＩＯＳのプログラムやブートプログラムを記憶する。ＲＡＭ３１３は、ＣＰＵ３１１の主記憶装置として使用されるメモリであり、情報処理装置１０として機能するＯＳ（オペレーティングシステム）やアプリケーションプログラムを格納する。外部メモリ３１４は、典型的にはハードディスク等の大容量の記憶装置であって、ＯＳ，実施形態で説明する運転支援用のアプリケーションプログラム、各種データを格納している。入力部３１５はタッチパネルディスプレイ、各種入力ボタンであり情報等の入力に係る処理を行う。表示部３１６は、ＣＰＵ３１１からの指示に従って情報処理装置１０の演算結果を表示装置に出力する。なお、表示装置は液晶表示装置やプロジェクタ、ＬＥＤインジケーターなど、種類は問わない。通信Ｉ／Ｆ（インターフェイス）３１７は、ネットワークを介して情報通信を行うものであり、例えばイーサネットＩ／Ｆでもよく、ＵＳＢやシリアル通信、無線通信等種類は問わない。Ｉ／Ｏ１８は、撮像装置１１が有する３１９、および制御装置１２と接続されている。 The CPU 11 controls the entire information processing apparatus 10 by controlling various devices connected to the system bus 321. The ROM 312 stores a BIOS program and a boot program. A RAM 313 is a memory used as a main storage device of the CPU 311, and stores an operating system (OS) functioning as the information processing apparatus 10 and an application program. The external memory 314 is typically a large-capacity storage device such as a hard disk, and stores an OS, an application program for driving support described in the embodiment, and various data. An input unit 315 is a touch panel display and various input buttons, and performs processing related to input of information and the like. The display unit 316 outputs the calculation result of the information processing device 10 to the display device in accordance with an instruction from the CPU 311. The display device may be of any type such as a liquid crystal display device, a projector, or an LED indicator. The communication I / F (interface) 317 performs information communication via a network, and may be, for example, an Ethernet I / F, regardless of the type, such as USB, serial communication, or wireless communication. The I / O 18 is connected to 319 of the imaging device 11 and the control device 12.

上記構成において、本装置に電源が投入されると、ＣＰＵ３１１はＲＯＭ３１２に格納されたブートプログラムを実行することで、外部メモリ３１４からＯＳをＲＡＭ３１３にロードし実行する。そして、ＣＰＵ３１１はＯＳの制御下で、外部メモリ３１４から自動車１の運転に係るアプリケーションプログラムをＲＡＭ３１３にロードし、実行することで図２に示す機能構成を実現することになる。なお、図２における保持部１３０はＲＡＭ３１３により実現することになる。また、実施形態では、図２の構成要素をＣＰＵ３１１が実行するソフトウェアで実現するとしているが、その一部をＣＰＵ３１１とは独立したハードウェアで実現させても構わない。 In the above configuration, when the apparatus is powered on, the CPU 311 loads the OS from the external memory 314 into the RAM 313 and executes it by executing the boot program stored in the ROM 312. Then, the CPU 311 loads an application program related to the operation of the automobile 1 from the external memory 314 into the RAM 313 under the control of the OS and executes the program to realize the functional configuration shown in FIG. The holding unit 130 in FIG. 2 is realized by the RAM 313. In the embodiment, the components in FIG. 2 are realized by software executed by the CPU 311. However, part of the software may be realized by hardware independent of the CPU 311.

次に、本実施形態における情報処理装置１０の処理手順を図５のフローチャートに従って説明する。この処理を構成するステップは、初期化Ｓ１１０、画像撮像Ｓ１２０、画像入力Ｓ１３０、幾何情報予測Ｓ１４０、位置姿勢算出Ｓ１５０、制御値算出Ｓ１６０、制御Ｓ１７０、システム終了判定Ｓ１８０である。 Next, the processing procedure of the information processing apparatus 10 in the present embodiment will be described according to the flowchart of FIG. The steps constituting this process are initialization S110, image capture S120, image input S130, geometric information prediction S140, position and orientation calculation S150, control value calculation S160, control S170, and system termination determination S180.

ステップＳ１１０にて、ＣＰＵ３１１はシステムの初期化を行う。すなわち、外部メモリ１４からＯＳ，プログラムを読み込み、情報処理装置１０を動作可能な状態にする。また、情報処理装置１０に接続された各機器（撮像装置１１など）のパラメータや、撮像装置１１の初期位置姿勢を読み込む。撮像装置１１の内部パラメータ（焦点距離ｆｘ（画像の水平方向）、ｆｙ（画像の垂直方向）、画像中心位置ｃｘ（画像の水平方向）、ｃｙ（画像の垂直方向）、レンズ歪みパラメータ）は、Ｚｈａｎｇの方法（非特許文献２参）によって事前に校正されているものとする、また、自動車１の各制御装置を起動し、動作・制御可能な状態とする。 In step S110, the CPU 311 initializes the system. That is, the OS and the program are read from the external memory 14 to make the information processing apparatus 10 operable. Further, the parameters of each device (such as the imaging device 11) connected to the information processing device 10 and the initial position and orientation of the imaging device 11 are read. The internal parameters (focal length fx (horizontal direction of image), fy (vertical direction of image), image center position cx (horizontal direction of image), cy (vertical direction of image), lens distortion parameter) of the imaging device 11 are It is assumed that it is calibrated in advance by the method of Zhang (see Non-Patent Document 2), and each control device of the automobile 1 is activated to be in an operable / controllable state.

ステップＳ１２０にて、撮像装置１１がシーンの撮像を行い、入力部１１０に撮像して得た画像を供給する。 In step S120, the imaging device 11 captures a scene, and supplies the captured image to the input unit 110.

ステップＳ１３０にて、入力部１１０は、撮像装置１１が撮像した画像を入力画像として取得する。なお、本実施形態においては、入力画像とは１画素がＲ、Ｇ、Ｂの３成分で構成されるＲＧＢ画像である。 In step S130, input unit 110 acquires an image captured by imaging device 11 as an input image. In the present embodiment, the input image is an RGB image in which one pixel is composed of three components of R, G, and B.

ステップＳ１４０にて、予測部１２０は、保持部１３０が保持する学習モデルを参照して、入力画像から予測画像および予測幾何情報を推定する。なお、本実施形態における機械学習による推定器とはＣＮＮ（Convolutional Neural Network）のことである。このＣＮＮはある時刻ｔの画像を入力すると、Δｔ時間後の時刻ｔ＋Δｔの予測画像、および予測幾何情報を予測することができるよう学習されているものとする。なお、ＣＮＮのネットワーク構成、学習方法については後述する。 In step S140, the prediction unit 120 estimates a prediction image and prediction geometric information from the input image with reference to the learning model held by the holding unit 130. Note that the estimator based on machine learning in the present embodiment is a CNN (Convolutional Neural Network). It is assumed that this CNN has been learned so as to be able to predict a predicted image of time t + Δt after Δt time and predicted geometric information when an image at a certain time t is input. The CNN network configuration and learning method will be described later.

ステップＳ１５０にて、算出部１４０は、予測部１２０が予測した予測画像および予測幾何情報と、保持部１３０が保持するマップとをもとに撮像装置１１の位置姿勢を算出する。具体的には、算出部１４０は、前述した位置姿勢算出方法によって、予測幾何情報をマップ中の最近傍のキーフレームに射影する。そして、算出部１４０は、予測画像の各画素値と、射影によって対応付けられたキーフレームの各画素値との輝度差が最小となるようにして、時刻ｔ＋Δｔの撮像装置１１の位置姿勢を算出する。 In step S <b> 150, the calculation unit 140 calculates the position and orientation of the imaging device 11 based on the predicted image and predicted geometric information predicted by the prediction unit 120 and the map held by the holding unit 130. Specifically, the calculation unit 140 projects the predicted geometric information onto the nearest key frame in the map by the above-described position and orientation calculation method. Then, the calculation unit 140 calculates the position and orientation of the imaging device 11 at time t + Δt so as to minimize the luminance difference between each pixel value of the predicted image and each pixel value of the key frame associated by projection. Do.

ステップＳ１６０にて、制御部１５０は、ステップＳ１５０の算出で得られた位置姿勢と、事前に自動車の目的地と出発地の情報から算出しておいたルート情報との距離が最小となるように自動車のステアリングを制御する制御値を算出する。ここでいうルート情報とは、事前にカーナビゲーションシステムにより出発地と目的地から算出しておいたルート上の位置を表す三次元の位置（アンカーポイントと呼ぶ）を格納したリストのことである。そして、制御部１５０は、算出した撮像装置１１の位置とアンカーポイントとの二乗距離が最小となるアンカーポイントを選び、さらにアンカーポイントと撮像装置１１の距離が小さくなる方向ベクトルを算出し、制御装置１２に供給する。 In step S160, the control unit 150 minimizes the distance between the position and orientation obtained by the calculation of step S150 and the route information calculated in advance from the information of the destination of the car and the place of departure. Calculate a control value to control the steering of the car. The route information referred to here is a list storing three-dimensional positions (referred to as anchor points) representing positions on the route calculated in advance from the departure place and the destination by the car navigation system. Then, the control unit 150 selects an anchor point at which the square distance between the calculated position of the imaging device 11 and the anchor point is minimum, and further calculates a direction vector in which the distance between the anchor point and the imaging device 11 decreases. Supply to 12.

制御装置１２は、ステップＳ１７０にて、与えられた情報に従って自動車１を算出した方向ベクトル方向へのステアリングを制御する。 In step S170, control device 12 controls steering in the direction of the direction vector in which vehicle 1 is calculated according to the provided information.

そして、ステップＳ１８０にて、ＣＰＵ３１１はシステムの終了（ユーザによる運転支援の中止指示、或いは目的地への到着）するまで、ステップＳ１２０以降の処理を繰り返す。 Then, in step S180, the CPU 311 repeats the processing of step S120 and subsequent steps until the system is terminated (the instruction to cancel the driving support by the user or the arrival to the destination).

次に、前述した、予測画像および予測幾何情報を推定するためのＣＮＮのネットワーク構成および学習方法について説明する。本実施形態におけるＣＮＮは、過去の入力画像の情報を保持することのできるネットワーク構造であるＣｏｎｖＬＳＴＭ（Convolutional Long short-term memory）と、画像からデプスマップを推定することのできるネットワーク構造であるＦＣＮ（Fully Convolutional Network）をベースとした構成とした。ＣｏｎｖＬＳＴＭによって過去に入力した画像から未来の画像である予測画像を推定するとともに、ＦＣＮにより予測画像から予測幾何情報を推定できるようにした。具体的な構成としては、まず、ＣｏｎｖＬＳＴＭの出力層を２つに分岐する。一方を第一の出力とし予測画像を出力するよう構成する。また、もう一方の出力層をＦＣＮの入力層としてＦＣＮに接続し、ＦＣＮの出力を第二の出力として予測幾何情報を出力するようにする。そして、精度よく予測できるように次のような学習手順を行う。 Next, the network configuration and learning method of the CNN for estimating the predicted image and the predicted geometric information described above will be described. The CNN in the present embodiment is a network structure that can hold information of past input images, and a network structure that can estimate a depth map from images, such as ConvLSTM (Convolutional Long short-term memory). The configuration is based on Fully Convolutional Network. While estimating the prediction image which is a future image from the image input in the past by ConvLSTM, it enabled it to estimate prediction geometric information from a prediction image by FCN. As a specific configuration, first, the output layer of ConvLSTM is branched into two. One is used as a first output to output a predicted image. The other output layer is connected to the FCN as the input layer of the FCN, and the output of the FCN is output as the second output to output predicted geometric information. Then, the following learning procedure is performed so that prediction can be accurately performed.

具体的には、まずＣｏｎｖＬＳＴＭとＦＣＮを別々に学習し、次にそれらを結合して学習する。最初にＣｏｎｖＬＳＴＭが時刻ｔの画像を入力すると時刻ｔ＋Δｔの予測画像を推定できるように学習する。あらかじめ車載カメラでΔｔ時間毎に撮影した学習用画像のうち時刻ｔ以前の所定のＭ枚の画像をＣｏｎｖＬＳＴＭに順次入力する。このとき、時刻ｔの画像を入力し終えた時にＣｏｎｖＬＳＴＭが予測した予測画像と、ｔ＋Δｔの画像との誤差が最小となるようにＣｏｎｖＬＳＴＭのみ学習する。このように学習しておくことで、ＣｏｎｖＬＳＴＭ層は、撮像装置１１が画像を入力すると内部的には過去に入力した画像情報を保持し、それら過去の画像情報の変化量を考慮してΔｔ時刻後の画像を予測することができる。特に、車速を変えて撮影した学習画像を用いて学習しておくことで、推定時には、異なる車速の画像が入力されても、車速に応じてΔｔ時刻後の画像を予測することができるようになる。次に、ＦＣＮ層が画像から幾何情報を予測できるように、あらかじめ撮影した時刻ｔの学習画像をＦＣＮに入力して推定した予測幾何情報と、あらかじめ同時刻ｔに奥行きカメラで撮影したデプスマップとの奥行き値の誤差が最小となるようにＦＣＮを学習する。このようにすることで、ＦＣＮは画像を入力するとデプスマップを推定できるようになる。最後に、ＣｏｎｖＬＳＴＭとＦＣＮの予測結果の不整合を低減させるため、ＣｏｎｖＬＳＴＭとＦＣＮを接続して学習する。あらかじめ撮影したΔｔ時間毎に撮影した時刻ｔの学習用画像を入力として、ＣｏｎｖＬＳＴＭの第一の出力と時刻ｔ＋Δｔの画像、ＦＣＮの出力と時刻ｔ＋Δｔのデプスマップとの誤差が減少するようにネットワークを学習する。以上のようにして学習したＣＮＮを学習モデルとして用いた。 Specifically, first, ConvLSTM and FCN are separately learned, and then they are combined and learned. First, when ConvLSTM inputs an image at time t, learning is performed so that a predicted image at time t + Δt can be estimated. Among the learning images captured in advance by the on-vehicle camera every Δt time, predetermined M images before time t are sequentially input to ConvLSTM. At this time, only ConvLSTM is learned so that an error between the predicted image predicted by ConvLSTM when the image at time t is finished and the image of t + Δt is minimized. By learning in this manner, the ConvLSTM layer internally holds the image information input in the past when the imaging device 11 inputs an image, and the Δt time in consideration of the amount of change in the image information in the past Later images can be predicted. In particular, by learning using a learning image captured at different vehicle speeds, an image after Δt time can be predicted according to the vehicle speed even when images with different vehicle speeds are input at the time of estimation. Become. Next, to be able to predict geometric information from the image by the FCN layer, predicted geometric information estimated by inputting a learning image at time t captured in advance to the FCN, and a depth map captured by the depth camera at the same time t in advance The FCN is learned so as to minimize the error in the depth value of. By doing this, the FCN can estimate the depth map when the image is input. Finally, in order to reduce the inconsistency between the prediction results of ConvLSTM and FCN, ConvLSTM and FCN are connected and learned. The learning image at time t taken every Δt time taken in advance is input and the network is reduced so that the error between the first output of ConvLSTM and the image at time t + Δt, the output of FCN and the depth map at time t + Δt is reduced learn. The CNN learned as described above was used as a learning model.

なお、ＣｏｎｖＬＳＴＭについては、非特許文献３に詳細が開示されている。そして、デプスマップを推定するＦＣＮについては、非特許文献４に詳細が開示されており、これらを援用することができる。 The details of ConvLSTM are disclosed in Non-Patent Document 3. The details of FCN for estimating a depth map are disclosed in Non-Patent Document 4, and these can be used.

＜効果＞
第１の実施形態では、予測部が、撮像装置が入力画像を撮像した時刻以降の予測画像と予測幾何情報を予測し、算出部が予測結果を基に撮像装置の位置姿勢を算出する。以上のように予測結果を基に位置姿勢を算出することで、処理に時間のかかる幾何情報の推定や位置姿勢の算出による遅延の影響を受けずに済み、高速に位置姿勢を算出することができる。 <Effect>
In the first embodiment, the prediction unit predicts a predicted image and predicted geometric information after the time when the imaging device captures an input image, and the calculation unit calculates the position and orientation of the imaging device based on the prediction result. By calculating the position and orientation based on the prediction result as described above, it is possible to calculate the position and orientation at high speed without being influenced by the delay due to the estimation of geometric information and the calculation of the position and orientation taking time for processing. it can.

＜変形例＞
第１の実施形態における算出部１４０は、予測幾何情報と予測画像、およびマップをもとに位置姿勢を算出していた。しかしながら、予測部１２０が予測した予測幾何情報または予測画像の少なくとも一方または両方を利用し、撮像装置１１が入力画像を撮像した時刻よりも後の撮像装置１１の位置姿勢を算出することができる構成であれば第１の実施形態で説明した方法に限るものではない。 <Modification>
The calculation unit 140 in the first embodiment calculates the position and orientation based on the predicted geometric information, the predicted image, and the map. However, a configuration capable of calculating the position and orientation of the imaging device 11 after the time when the imaging device 11 captures an input image using at least one or both of the predicted geometric information and the predicted image predicted by the prediction unit 120 The method is not limited to the method described in the first embodiment.

具体的には、予測画像とマップ中のキーフレームに保持される画像から画像特徴を算出し、同じ特性をもつ特徴量の位置関係から位置姿勢を算出する５点アルゴリズムを用いて位置姿勢を算出することもできる。ここでいう画像特徴とは、画像中の角など特異的な特徴をもつ点のことで、特徴量とはそれぞれの画像特徴を一意に記述するための特徴点周りの小領域の画像パッチから算出する識別子のことである。なお、特徴量についてはＳＩＦＴ（Scale Invariant Feature Transform）特徴量やＡＫＡＺＥ（Accelerated-KAZE）特徴量を用いることができる。このような構成とした場合には、学習モデルは予測画像さえ予測できればよく、予測幾何情報を予測しなくてもよい。なお、このような構成における学習モデルの構成方法については後述する。 Specifically, the image feature is calculated from the predicted image and the image held in the key frame in the map, and the position and orientation are calculated using a five-point algorithm that calculates the position and orientation from the positional relationship between feature amounts having the same characteristics. You can also Here, the image feature is a point having a specific feature such as a corner in the image, and the feature amount is calculated from the image patch of the small area around the feature point for uniquely describing each image feature. Is an identifier that As the feature amount, a scale invariant feature transform (SIFT) feature amount or an accelerated-kaze (AKAZE) feature amount can be used. In such a configuration, the learning model only needs to predict the predicted image, and does not have to predict predicted geometric information. A method of constructing a learning model in such a configuration will be described later.

また、キーフレームがデプスマップをさらに保持する構成とした場合には、キーフレームのデプスマップを予測幾何情報であるデプスマップに射影し、それぞれの三次元点の最近傍の点同士の距離が最小となるように位置姿勢を算出するＩＣＰ（Iterative Closest Point）アルゴリズムを用いて位置姿勢を算出してもよい。このような構成とした場合には、学習モデルは予測幾何情報さえ予測できればよく、予測画像を予測しなくてもよい。このような構成における学習モデルの構成方法についても後述する。 When the key frame is configured to further hold the depth map, the depth map of the key frame is projected onto the depth map, which is predictive geometric information, and the distance between the nearest points of the three-dimensional points is minimum. The position and orientation may be calculated using an ICP (Iterative Closest Point) algorithm that calculates the position and orientation so that In such a configuration, the learning model only needs to predict prediction geometric information, and does not have to predict a prediction image. A method of constructing a learning model in such a configuration will also be described later.

第１の実施形態においては、学習モデルにＣｏｎｖＬＳＴＭとＦＣＮを組み合わせた学習モデルを用いていた。しかしながら、撮像装置１１が撮像した時刻ｔよりも後の時刻の位置姿勢を算出するための情報を予測することができる構成であればよい。具体的には、第１の実施形態ではＣｏｎｖＬＳＴＭやＦＣＮなど画像の局所的な位置関係を考慮するニューラルネットワークを用いていたが、それらの代わりにＬＳＴＭやＤｅｎｓｅＮｅｔｗｏｒｋを用いて画像全体を考慮するようなネットワーク構成としてもよい。また、本実施形態ではＣｏｎｖＬＳＴＭの出力を二つに分割し、片方の出力によって予測画像を、もう一方の出力を入力としてＦＣＮによりデプスマップを推定する構成としていた。しかしながら、予測画像のみ推定する学習モデルと、予測幾何情報を算出する学習モデルというようにそれぞれ別々の学習モデルを用意して予測してもよい。さらに言えば、前記変形例にて述べたように予測画像のみ予測すればよいときにはＣｏｎｖＬＳＴＭのみ用いる構成としてもよいし、予測幾何情報のみ予測すればよいときには第１の実施形態で述べたようなＣｏｎｖＬＳＴＭの出力を二つに分割する構成とせず、１つの出力層をＦＣＮの入力層と接続するような構成としてもよい。 In the first embodiment, a learning model in which ConvLSTM and FCN are combined is used as the learning model. However, any configuration may be employed as long as information for calculating the position and orientation of the time after time t taken by the imaging device 11 can be predicted. Specifically, in the first embodiment, a neural network that considers local positional relationship of images such as ConvLSTM and FCN is used, but instead of them, LSTM or Dense Network is used to consider the entire image. Network configuration. Further, in the present embodiment, the output of ConvLSTM is divided into two, and a predicted image is estimated by one of the outputs, and a depth map is estimated by FCN with the other output as an input. However, different learning models may be prepared and predicted, such as a learning model that estimates only a predicted image and a learning model that calculates predicted geometric information. Furthermore, as described in the modification, only the ConvLSTM may be used when only the predicted image may be predicted, or when only the predicted geometric information may be predicted, the ConvLSTM as described in the first embodiment may be used. Alternatively, one output layer may be connected to the input layer of the FCN without dividing the output of the circuit into two.

第１の実施形態においては、画像を撮像する撮像装置１１がＲＧＢカメラである構成について説明した。ただし、ＲＧＢカメラに限るものではなく、現実空間の画像を撮像するカメラであれば特に制限はなく、たとえば濃淡画像を撮像するカメラでもあってもよい。また、単眼カメラであってもよいし、二台以上の複数のカメラやセンサを備えるカメラであってもよい。さらには、奥行き情報や距離画像、三次元点群データなどの幾何情報を取得することのできる装置であってもよい。具体的にはＴＯＦ（Time Of Flight）カメラやＬｉＤＡＲ（Laser Imaging Detection and Ranging）、ソナーセンサ等が該当する。これらＲＧＢカメラ以外を用いる場合には、学習モデルは使用するカメラや計測装置が取得する画像や幾何情報を入力として、それらの取得時刻より先の時刻の幾何情報を予測することができるように構成する。このような構成とする場合には、あらかじめ各種センサに合わせて学習モデルを用意しておけばよい。例えばＴＯＦカメラによって取得した時刻ｔの入力幾何情報（デプスマップ）から時刻ｔ＋Δｔの予測幾何情報を予測して位置姿勢を算出する場合には、学習モデルはあらかじめ、時刻ｔの入力幾何情報を入力としたときに、予測幾何情報と時刻ｔ＋Δｔの入力幾何情報との奥行き値の誤差が減少するように学習しておけばよい。 In the first embodiment, the configuration in which the imaging device 11 for capturing an image is an RGB camera has been described. However, the present invention is not limited to the RGB camera, and there is no particular limitation as long as it is a camera for capturing an image in real space, and for example, a camera for capturing a grayscale image may be used. Moreover, it may be a monocular camera, or may be a camera provided with two or more cameras and sensors. Furthermore, it may be an apparatus capable of acquiring geometric information such as depth information, distance images, and three-dimensional point cloud data. Specifically, a TOF (Time Of Flight) camera, LiDAR (Laser Imaging Detection and Ranging), a sonar sensor, etc. correspond. When using a camera other than these RGB cameras, the learning model is configured to be able to predict geometric information of a time earlier than the time of acquisition thereof, using an image or geometrical information acquired by the camera or measuring device used. Do. In such a configuration, learning models may be prepared in advance in accordance with various sensors. For example, in the case of predicting position / orientation by predicting predicted geometric information of time t + Δt from input geometric information (depth map) of time t acquired by the TOF camera, the learning model inputs the input geometric information of time t in advance. In this case, learning may be performed so that an error in depth value between predicted geometric information and input geometric information at time t + Δt is reduced.

第１の実施形態においては、予測結果をもとに算出した位置姿勢を基に自動車を制御していた。一方で、予測画像ではなく、実際の入力画像をもとに位置姿勢を算出する場合の処理時間を短縮する場合にも用いることができる。具体的には、位置姿勢算出における最適化計算の初期値を、過去に予測画像から算出した位置姿勢とする。あらかじめ予測により求めた位置姿勢を初期値とすることで収束までの位置姿勢算出の繰り返し計算におけるループ回数を減らすことができ、さらに、不確実な予測を用いるのではなく実際の計測値に合わせて位置姿勢を算出することができ、高速に高精度に位置姿勢を算出することができる。 In the first embodiment, the vehicle is controlled based on the position and orientation calculated based on the prediction result. On the other hand, it can also be used in the case of shortening the processing time in the case of calculating a position and orientation based on an actual input image instead of a predicted image. Specifically, the initial value of the optimization calculation in the position and orientation calculation is set as the position and orientation calculated in the past from the predicted image. By setting the position and orientation obtained by prediction in advance as initial values, it is possible to reduce the number of loops in repetitive calculation of position and orientation calculation until convergence, and furthermore, according to actual measurement values instead of using uncertain prediction. The position and orientation can be calculated, and the position and orientation can be calculated at high speed and with high accuracy.

第１の実施形態では、予測部１２０が予測した予測画像や予測幾何情報を使って、算出部１４０が撮像装置１１の位置姿勢（結果的に自動車１の位置姿勢）を算出した。一方で、これら予測画像や予測幾何情報を、予測を用いないＳＬＡＭ（Simultaneous Localization and Mapping）の算出値と併用して用いる構成としてもよい。具体的には、まず時刻ｔ以前の複数の入力画像を用いて非特許文献１の方法でシーン中の三次元空間の点群を表す三次元点を算出しておく。次に、算出した三次元点を予測幾何情報に射影する。次に射影した三次元点と予測幾何情報の奥行きとを重み付和によって統合して高精度化した幾何情報を予測幾何情報として用いる。本変形例においては非特許文献１に記載の信頼度を重みとして用いることができる。非特許文献１の方法では、複数時刻の入力画像からモーションステレオによって三次元点を算出する。各点の信頼度とは、奥行き値の分散の逆数である。以上のような構成として、ＳＬＡＭによって求めた三次元点の信頼度が高い点を予測部１２０が予測した幾何情報のデプスマップの各点の代わりに用いる、またはそれらを統合することで、予測幾何情報を高精度化することができ、高精度に位置姿勢を算出する。 In the first embodiment, the calculation unit 140 calculates the position and orientation of the imaging device 11 (as a result, the position and orientation of the automobile 1) using the predicted image and predicted geometric information predicted by the prediction unit 120. On the other hand, the predicted image and predicted geometric information may be used in combination with the calculated value of SLAM (Simultaneous Localization and Mapping) not using prediction. Specifically, first, using a plurality of input images before time t, a three-dimensional point representing a point group in a three-dimensional space in the scene is calculated by the method of Non-Patent Document 1. Next, the calculated three-dimensional point is projected on prediction geometric information. Next, geometric information obtained by integrating the projected three-dimensional point and the depth of the prediction geometric information by weighted sum and using high precision is used as prediction geometric information. In the present modification, the reliability described in Non-Patent Document 1 can be used as a weight. In the method of Non-Patent Document 1, three-dimensional points are calculated by motion stereo from input images at multiple times. The reliability of each point is the reciprocal of the variance of the depth value. In the above configuration, a point having a high degree of reliability of the three-dimensional point obtained by SLAM is used instead of each point of the depth map of the geometric information predicted by the prediction unit 120, or by integrating them Information can be made more accurate, and position and orientation can be calculated with high accuracy.

第１の実施形態においては、撮像装置１１が撮像した時刻ｔの入力画像をもとに、予測部１２０がある一時刻であるｔ＋Δｔ時刻の予測画像および予測幾何情報を推定し、算出部１４０がｔ＋Δｔ時刻後の撮像装置１１の位置姿勢を算出した。しかしながら、予測部１２０が予測するのは撮像装置１１が撮像した時刻ｔ以降の予測画像や幾何情報であれば、さらに先の複数の時刻の予測画像や予測幾何情報を推定する構成としてもよい。具体的には、予測部１２０が、時刻ｔ＋ｎΔτの予測画像を学習モデルに入力し、時刻ｔ＋（ｎ＋１）Δτの予測画像を予測する。こうすることで学習モデルの変更なしに時刻ｔ＋Δｔ以降のさらに先の時刻の予測画像や予測幾何情報が予測でき、それらの時刻の位置姿勢を算出することができる。このようにして算出した複数時刻の未来の位置姿勢と算出しておいたルートとの距離が最小となるように自動車を制御することで、直近のルートとの位置姿勢の誤差が大きくともさらに先の時刻での誤差が小さくなるように自動車を制御することができ、急制動しなくてもよく、安定して自動車を制御することができるようになる。 In the first embodiment, based on the input image at time t captured by the imaging device 11, the calculating unit 140 estimates a predicted image and predicted geometric information at time t + Δt that is one time at which the predicting unit 120 is present. The position and orientation of the imaging device 11 after time t + Δt were calculated. However, if what the prediction unit 120 predicts is a predicted image or geometric information after time t captured by the imaging device 11, it may be configured to estimate predicted images or predictive geometric information at a plurality of earlier times. Specifically, the prediction unit 120 inputs a predicted image at time t + nΔτ to the learning model, and predicts a predicted image at time t + (n + 1) Δτ. By doing this, it is possible to predict predicted images and predicted geometric information at a time further after time t + Δt without changing the learning model, and it is possible to calculate the position and orientation of those times. By controlling the vehicle so that the distance between the calculated position and orientation of the future at a plurality of times calculated in this way and the calculated route is minimized, the error in the position and orientation from the latest route is even larger. Thus, the vehicle can be controlled so that the error at time t becomes small, the vehicle does not have to be suddenly braked, and the vehicle can be stably controlled.

第１の実施形態においては、予測結果をもとに算出した位置姿勢を基に自動車を制御していた。一方で、予測性能に応じて予測を利用するか否か決めることもできる。具体的には、算出部１４０が、時刻ｔの入力画像を基に予測したｔ＋Δｔ時刻の予測画像と、実際にｔ＋Δｔ付近の時刻に撮像装置１１が撮像した入力画像との各画素の輝度差の平均、最大、最小、中央値が所定の閾値以内であれば予測性能が高いと判定し、位置姿勢算出に予測画像、予測幾何情報を利用する。一方、閾値以上となった場合には予測性能が低いとして位置姿勢算出に予測画像や予測幾何情報を利用せず、入力画像を基に位置姿勢を算出する構成である。また、予測性能が低い場合には予測を停止することもできる。このような構成とすることで、予測精度が低いときに位置姿勢推定精度が低下することを避けることができる。 In the first embodiment, the vehicle is controlled based on the position and orientation calculated based on the prediction result. On the other hand, it is also possible to decide whether to use the prediction or not according to the prediction performance. Specifically, the luminance difference between each pixel of the predicted image at time t + Δt predicted by the calculation unit 140 based on the input image at time t and the input image captured by the imaging device 11 at a time near t + Δt actually If the average, maximum, minimum, and median are within a predetermined threshold, it is determined that the prediction performance is high, and the predicted image and predicted geometric information are used for position and orientation calculation. On the other hand, when it becomes more than the threshold value, the prediction performance is low, and the position and orientation are calculated based on the input image without using the prediction image and the prediction geometric information for position and orientation calculation. In addition, when the prediction performance is low, prediction can be stopped. With such a configuration, it is possible to avoid that the position and orientation estimation accuracy decreases when the prediction accuracy is low.

前記変形例においては、予測画像と入力画像の一致度合に基づいて予測の利用の有無を判断していた。ところで、予測画像と入力画像の一致度合に基づいて制御値を変える構成としてもよい。例えば、前記予測精度が高いときには自動車の速度の上限を上げ、逆に予測精度が低いときには自動車の速度の上限を下げるよう自動車を制御する。このようにすることで、予測性能が悪いときには速度を下げることで安全に自動車を制御することができる。また、予測性能が悪いときには運転手に自動車の制御を要請してもよい。また、ネットワーク経由で管理センターに自動車の制御を要請してもよい。 In the modified example, the use of the prediction is determined based on the degree of coincidence between the predicted image and the input image. The control value may be changed based on the degree of coincidence between the predicted image and the input image. For example, when the prediction accuracy is high, the upper limit of the vehicle speed is increased, and when the prediction accuracy is low, the vehicle is controlled to decrease the upper limit of the vehicle speed. In this way, when the prediction performance is poor, the vehicle can be controlled safely by reducing the speed. In addition, when the predicted performance is poor, the driver may request control of the vehicle. Also, the control center may be requested to control the vehicle via the network.

前記変形例においては、予測画像と入力画像の一致度合に基づいて制御値を変える構成とした。ところで、予測画像と入力画像の一致度合に基づいて予測パラメータを動的に調整してもよい。ここでいう予測パラメータとは、予測する時刻を表すｔ＋Δｔにおける予測時間Δｔ、前記複数時刻の予測を行う場合における予測時刻ｔ＋ｎΔτにおけるどの時刻まで予測をするかを表す予測個数Ｎ（ｎの最大値）、予測間隔を表すΔτのことである。具体的な予測パラメータの調整例としては、算出部１４０が、予測精度が低いと判定した場合には予測時間Δｔを短くする、予測個数Ｎの値を小さくする、予測間隔Δτを小さくするという調整を行ってもよい。一般に、予測期間が長くなるほど予測精度が低下するが、このような調整をすることで予測部１２０の予測値の予測精度を担保できるようになる。なお、予測時間や予測間隔であるΔｔ、Δτを予測器において反映させるために、予測器に入力画像に加えて１つの時間情報を入力する入力層を加えた構成とする。また、このようなネットワーク構成とする場合には、時刻ｔの入力画像と任意のΔｔの値を入力し、時刻ｔ＋Δｔの画像と予測器の予測値が一致するように学習させる。また、予測を行う領域を表すＲＯＩ（region of interest）を予測パラメータとしてもよい。つまり、入力画像のうち一部の領域のみ予測を行うという構成としてもよい。この場合、入力画像と予測画像をそれぞれ既定の小領域に分割し、算出部１４０が小領域毎に予測精度を算出する。この時、予測精度が所定の閾値以上の小領域のみその後の時刻の予測に利用する。 In the modification, the control value is changed based on the degree of coincidence between the predicted image and the input image. By the way, prediction parameters may be dynamically adjusted based on the degree of coincidence between the prediction image and the input image. The prediction parameter referred to here is a prediction time Δt at t + Δt representing a time to be predicted, and a predicted number N (maximum value of n) at which time in the prediction time t + nΔτ in the case of prediction of the plural times , Δτ representing the prediction interval. As a specific adjustment example of the prediction parameter, if the calculation unit 140 determines that the prediction accuracy is low, the adjustment is made to shorten the prediction time Δt, reduce the value of the prediction number N, and reduce the prediction interval Δτ. You may Generally, the longer the prediction period, the lower the prediction accuracy, but such adjustment makes it possible to secure the prediction accuracy of the prediction value of the prediction unit 120. In order to reflect the prediction time and the prediction intervals Δt and Δτ in the predictor, an input layer is added to the predictor to input one time information in addition to the input image. Further, in the case of such a network configuration, an input image at time t and an arbitrary value of Δt are input, and learning is performed so that the image at time t + Δt matches the prediction value of the predictor. Also, a region of interest (ROI) representing a region to be predicted may be used as a prediction parameter. In other words, the prediction may be performed only for a part of the input image. In this case, the input image and the predicted image are each divided into predetermined small areas, and the calculation unit 140 calculates the prediction accuracy for each small area. At this time, only a small area whose prediction accuracy is equal to or more than a predetermined threshold is used to predict the subsequent time.

前記変形例では、予測パラメータの調整には予測画像と入力画像の一致度合を用いていた。しかしながら、情報処理装置１０への入力、予測部１２０の予測値、算出部１４０の算出値、制御部１５０の制御値、または情報処理装置１を搭載する自動車の状態をもとに予測パラメータを調整するもので特に制約はない。具体的には、算出部１４０が予測画像、予測幾何情報を基に算出したｔ＋Δｔ時刻の予測位置姿勢と、撮像装置１１がｔ＋Δｔ時刻付近に撮像した入力画像を基に算出した算出位置姿勢との誤差の二乗誤差が大きい程、予測時刻Δｔや予測間隔Δτ、予測個数Ｎの値を小さくする。 In the modification, the matching degree of the prediction image and the input image is used to adjust the prediction parameter. However, the prediction parameters are adjusted based on the input to the information processing apparatus 10, the predicted value of the prediction unit 120, the calculated value of the calculation unit 140, the control value of the control unit 150, or the state of the vehicle on which the information processing apparatus 1 is mounted. There is no restriction in particular. Specifically, the calculation unit 140 calculates a predicted image, a predicted position / posture at time t + Δt calculated based on predicted geometric information, and a calculated position / posture calculated based on an input image captured by the imaging device 11 around the time t + Δt. As the square error of the error is larger, the values of the prediction time Δt, the prediction interval Δτ, and the number of predictions N are made smaller.

また、自動車の状態を基に予測パラメータを調整してもよい。ここでは一例として、事前に計測しておいた自動車の制動距離で予測パラメータを調整する構成について述べる。制動距離が大きい自動車の場合には、制御してから自動車が所望の動作を完遂するまでに時間がかかるため、より先の予測を用い制御すること所望の動作を行うまでの時間を短くすることが望まれる。このために、自動車の重量が大きい程、予測時間Δｔを大きくする、予測個数Ｎの値を大きくする、予測間隔Δτを大きくする。制動距離以外にも、重量が大きい程、タイヤの接地面積が小さい程、自動車の速度が速い程、自動車の乗車人数が大きい程それぞれのパラメータを大きくする構成としてもよい。また、自動車の速度やステアリングに応じて予測を行う画像の範囲のことである予測範囲を調節してもよい。具体的には自動車の速度が速いときには予測範囲を小さく、速度が遅いときには予測範囲を大きくする。このように予測範囲を調節することで、予測時間間隔や予測時間が長くなる場合、予測個数が増大する場合においても、予測に係る時間を一定時間にすることができる。 Also, the prediction parameters may be adjusted based on the state of the vehicle. Here, as an example, a configuration will be described in which the prediction parameter is adjusted with the braking distance of the vehicle measured in advance. In the case of a vehicle with a large braking distance, it takes time for the vehicle to complete the desired operation after the control, so shortening the time for performing the desired operation to perform control using earlier prediction Is desired. For this reason, the prediction time Δt is increased as the weight of the vehicle is increased, the value of the predicted number N is increased, and the prediction interval Δτ is increased. In addition to the braking distance, as the weight is larger, the contact area of the tire is smaller, the speed of the vehicle is higher, and the number of passengers of the vehicle is larger, the respective parameters may be increased. Also, the prediction range, which is the range of the image to be predicted, may be adjusted according to the speed and steering of the automobile. Specifically, when the speed of the vehicle is high, the prediction range is small, and when the speed is low, the prediction range is large. By adjusting the prediction range in this manner, when the prediction time interval or the prediction time becomes long, even when the number of predictions increases, the time relating to the prediction can be made constant time.

第１の実施形態においては、保持部１３０が保持するマップは、事前に撮影したシーンの画像とその時の位置姿勢を関連付けたキーフレーム群のことであった。しかしながら、マップは以上の構成に限るものではなく、予測部１２０が予測した予測画像または予測幾何情報との間の位置姿勢を算出できるものであれば何でもよい。具体的には、ＬｉＤＡＲで取得した三次元空間の点群を表す三次元点群を全天周カメラで撮影した画像に射影し一致する点を求め、各三次元点に一致する画像の輝度パラメータを加えた４パラメータを持つ四次元点群をマップとして保持してもよい。この構成とした場合には、四次元点群を予測画像に射影し、予測画像と射影した点群との輝度差が減少するように予測画像の視点位置を算出できる。また、３Ｄマップとして、ＬｉＤＡＲで取得した三次元点群のみ保持する構成としてもよい。この場合には、三次元点群を予測幾何情報であるデプスマップに射影し、ＩＣＰアルゴリズムにより三次元点群と予測したデプスマップとの距離が最小となるように位置姿勢を算出してもよい。このような構成とすることで、保持部１３０が保持するマップが変わる構成に対しても本情報処理装置を適用することができる。 In the first embodiment, the map held by the holding unit 130 is a key frame group in which the image of the scene photographed in advance and the position / posture at that time are associated. However, the map is not limited to the above configuration, and any map may be used as long as it can calculate the position and orientation between the predicted image predicted by the prediction unit 120 and the predicted geometric information. Specifically, a three-dimensional point group representing a point group in three-dimensional space acquired by LiDAR is projected onto an image captured by an omnidirectional camera and points that coincide are determined, and the brightness parameter of the image coincident with each three-dimensional point A four-dimensional point group having four parameters added with may be held as a map. With this configuration, it is possible to project the four-dimensional point group onto the predicted image and calculate the viewpoint position of the predicted image so that the difference in luminance between the predicted image and the projected point group is reduced. Alternatively, only a three-dimensional point group acquired by LiDAR may be held as a 3D map. In this case, the three-dimensional point group may be projected onto the depth map, which is predictive geometric information, and the position and orientation may be calculated such that the distance between the three-dimensional point group and the predicted depth map is minimized by the ICP algorithm. . With this configuration, the information processing apparatus can be applied to a configuration in which the map held by the holding unit 130 is changed.

第１の実施形態においては、保持部１３０が事前に作成したマップを保持していた。しかしながら、算出部１４０が位置姿勢を算出する際に、事前に作成したマップを用いるのではなく、予測した位置姿勢をもとにマップを作成しつつ位置姿勢推定を行うＳＬＡＭの構成としてもよい。マップの生成方法に関しては、ＣＮＮ−ＳＬＡＭを援用することができる。なおＣＮＮ−ＳＬＡＭについては非特許文献１に詳細に開示されており、これを援用できるものとする。ただし、ＣＮＮ−ＳＬＡＭでは撮像装置１１が撮像した入力画像を用いていたが、本構成に適応する場合には入力画像の代わりに予測画像を用いることもできる。このような構成にすることで、事前にマップを作成しない場所においても、予測を行いつつ自動車の自己位置推定および制御を行うことができる。 In the first embodiment, the holding unit 130 holds the map created in advance. However, when calculating the position and orientation, the calculation unit 140 may use a SLAM in which the position and orientation is estimated while creating a map based on the predicted position and orientation, instead of using a map created in advance. CNN-SLAM can be used as a method of generating a map. The CNN-SLAM is disclosed in detail in Non-Patent Document 1 and can be incorporated herein. However, in CNN-SLAM, although the input image which the imaging device 11 imaged was used, when adapting to this structure, a predicted image can also be used instead of an input image. With such a configuration, it is possible to perform self-position estimation and control of the vehicle while performing prediction even in a place where a map is not created in advance.

第１の実施形態においては、保持部１３０はマップを保持していた。しかしながら、保持部１３０がマップを保持しない構成としてもよい。具体的には、予測部１２０が時刻ｔより一時刻前のｔ’’の入力画像をもとに予測した予測画像と算出部１４０が算出した位置姿勢、および予測部１２０が時刻ｔの入力画像に対して予測したｔ＋Δｔ時刻の予測画像とを用いて、算出部１４０が時刻ｔ’’に対する時刻ｔ＋Δｔの位置姿勢変化量を算出する。以上の処理を入力部１１０が画像を入力するごとに行い、毎時刻算出される位置姿勢変化量を積算することで、撮像装置１１の位置姿勢を算出することができる。 In the first embodiment, the holding unit 130 holds the map. However, the holding unit 130 may not hold the map. Specifically, a predicted image predicted by the prediction unit 120 based on the input image at t ′ ′ one time before time t, the position and orientation calculated by the calculation unit 140, and the input image at time t The calculation unit 140 calculates the amount of change in position and orientation at time t + Δt with respect to time t ′ ′ using the predicted image at time t + Δt predicted for the above. The position / posture of the imaging device 11 can be calculated by performing the above-described process each time the input unit 110 inputs an image and integrating the position / posture change amount calculated each time.

第１の実施形態においては、学習モデルが内部的に入力画像の履歴を保持する構成としたため、自動車の速度に応じた予測を行うことができる。しかしながら、車速に応じた複数の学習モデルを用意し、車速に応じて切り替えて用いてもよい。つまり、保持部１３０は、複数の学習モデルを保持する。保持する学習モデルの数は、車速に応じた数分望ましいが、例えば１０ｋｍ／、２０ｋｍ／ｈ、…と１０ｋｍ／ｈごとの学習モデルとし、利用する学習モデルはその時の自動車１の車速に最も近いものが選択されるものとする。車速は、制御装置１２から取得するものとする。このような構成とすることもできる。 In the first embodiment, since the learning model is configured to internally hold the history of the input image, prediction according to the speed of the vehicle can be performed. However, a plurality of learning models corresponding to the vehicle speed may be prepared and switched and used according to the vehicle speed. That is, the holding unit 130 holds a plurality of learning models. The number of learning models to be held is preferably several minutes according to the vehicle speed, but for example 10km /, 20km / h, ... and 10km / h as a learning model, the learning model to be used is closest to the vehicle speed of the car 1 at that time It shall be selected. The vehicle speed is acquired from the control device 12. Such a configuration can also be made.

第１の実施形態においては、算出部１４０が算出した位置姿勢をもとに自動車を制御する構成としていた。しかしながら、制御する対象は自動車に限らず、自律移動する装置であれば特に制限はない。例えば実施形態における情報処理装置１０を、工場や倉庫、商店において物資の搬送を行うＡＧＶ（automated guided vehicle）をはじめとする自律移動型ロボットやＡＭＲ（autonomous mobile robot）に適用してもよい。 In the first embodiment, the vehicle is controlled based on the position and orientation calculated by the calculation unit 140. However, the target to be controlled is not limited to a car, and there is no particular limitation as long as it is an apparatus that moves autonomously. For example, the information processing apparatus 10 according to the embodiment may be applied to an autonomous mobile robot such as an automated guided vehicle (AGV) that transports materials in a factory, a warehouse, or a shop, or an autonomous mobile robot (AMR).

また、自動走行に限らず、人が運転する操作をアシストする構成として用いてもよい。具体的には、予測により高速に算出した制御値と人の操作が乖離した場合にはアラートを鳴らすことで、素早く人に注意を喚起することができる。さらに、このような場合には、人が操作を誤ったとして算出した制御値で自動車を制御してもよい。 Moreover, you may use not only automatic driving | running | working but the structure which assists the operation which a person drive. Specifically, when the control value calculated at high speed by prediction and the operation of a person deviate, an alert can be alerted quickly by sounding an alert. Furthermore, in such a case, the vehicle may be controlled with a control value calculated as a person misoperated.

さらには、本情報処理装置を、位置姿勢を算出する装置として用いてもよい。具体的には、複合現実感システムにおける現実空間と仮想物体との位置合わせ、すなわち、仮想物体の描画に利用するための現実空間における撮像装置１１の位置及び姿勢の計測に本発明の方法を適用してもよい。このような構成とした場合には図１における制御部１５０、および制御装置１２は必要ではなく、算出部１４０が予測画像と予測幾何情報を基に算出した位置姿勢を基づいて描画された仮想物体のＣＧ像が重畳され、モバイル端末やヘッドマウントディスプレイのディスプレイを通してユーザに提示する。このような構成とすると、従来位置姿勢算出やＣＧの描画における計算時間による映像の提示までの遅延を低減することができる。 Furthermore, the information processing apparatus may be used as an apparatus for calculating the position and orientation. Specifically, the method of the present invention is applied to alignment between real space and virtual object in a mixed reality system, that is, measurement of the position and orientation of the imaging device 11 in the real space for use in drawing a virtual object. You may In such a configuration, the control unit 150 and the control device 12 in FIG. 1 are not necessary, and a virtual object drawn based on the position and orientation calculated by the calculation unit 140 based on the predicted image and the predicted geometric information. CG images are superimposed and presented to the user through the display of the mobile terminal or head mounted display. With such a configuration, it is possible to reduce the delay until the presentation of the image due to the calculation time in the conventional position and orientation calculation and the drawing of CG.

［第２の実施形態］
上記第１の実施形態では、予測画像、予測幾何情報および、過去の入力画像をもとに作成した位置姿勢の算出指標であるマップを用いて、撮像装置１１の位置姿勢を算出し自動車１の制御を行うものであった。本第２の実施形態では、過去の入力画像をもとに作成したマップだけでなく、予測画像、予測幾何情報を用いてマップを作成するとともに、作成したマップを用いて撮像装置の位置姿勢を算出する例を説明する。 Second Embodiment
In the first embodiment, the position and orientation of the imaging device 11 is calculated using a predicted image, predicted geometric information, and a map that is a calculation index of position and orientation created based on a past input image. It was to control. In the second embodiment, not only maps created based on past input images but also maps are created using predicted images and predicted geometric information, and the position and orientation of the imaging device are determined using the created maps. An example of calculation will be described.

事前に作成したマップが無くとも、入力画像を基にマップを作成しつつ自己位置推定をするＶｉｓｕａｌＳＬＡＭ（Simultaneous Localization and Mapping）を用いることで撮像装置１１の位置姿勢を算出することが可能である。ＶｉｓｕａｌＳＬＡＭでは入力画像と、入力画像に基づいて算出した位置姿勢を関連付けてキーフレームとして保持する。従来のＶｉｓｕａｌＳＬＡＭは、過去に通過した位置（入力画像を撮像した位置）にのみキーフレームを作成するものである。そこで第２の実施形態では、過去に通過した地点にのみキーフレームを作成するのでなく、予測画像、予測幾何情報を用いて、自動車がこれから進むと予測される地点付近のキーフレームを生成する。また、作成したキーフレームを、入力画像を用いて更新し高精度化する。本第２の実施形態においては、キーフレームの生成方法、および更新方法について説明し、これらを用いた自動車の位置姿勢算出および制御方法について説明する。 Even if there is no map created in advance, it is possible to calculate the position and orientation of the imaging device 11 by using Visual SLAM (Simultaneous Localization and Mapping) that performs self-position estimation while creating a map based on an input image. . In Visual SLAM, the input image and the position and orientation calculated based on the input image are associated and held as key frames. The conventional Visual SLAM creates a key frame only at a position which has passed in the past (a position at which an input image is captured). Thus, in the second embodiment, key frames are not generated only at points that have passed in the past, but are generated using predicted images and predicted geometric information to generate key frames in the vicinity of a point where a vehicle is predicted to move. Also, the created key frame is updated using the input image to make it more accurate. In the second embodiment, a method of generating and updating a key frame will be described, and a method of calculating and controlling a position and orientation of a vehicle using these will be described.

本第２の実施形態における装置の構成は、第１の実施形態で説明した情報処理装置１の構成を示す図１と同一であるため、その説明は省略する。本第２の実施形態では、算出部１４０が、撮像装置１１が入力画像を撮像した時刻ｔよりも後の時刻ｔ’のキーフレームを算出する。 The configuration of the apparatus in the second embodiment is the same as that of FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, and therefore the description thereof will be omitted. In the second embodiment, the calculation unit 140 calculates a key frame at time t ′ after time t at which the imaging device 11 captures an input image.

また、算出部１４０は、保持部１３０が保持するキーフレームを更新する。そして、作成または更新したキーフレームを保持部１３０に出力し、保持する点が第１の実施形態と異なる。なお、本第２の実施形態においては、撮像装置１１が撮影したシーンのＲＧＢ画像または予測部１２０が予測した予測画像であるＲＧＢ画像と、幾何情報としてのデプスマップ、およびそれらの視点の位置姿勢を関連付けたデータ構造をキーフレームとする。さらに、キーフレーム群をマップとして保持部１３０に保持する。 Further, the calculation unit 140 updates the key frame held by the holding unit 130. And the point which outputs and hold | maintains the key frame produced or updated to the holding | maintenance part 130 differs from 1st Embodiment. In the second embodiment, the RGB image of the scene captured by the imaging device 11 or the RGB image which is a predicted image predicted by the prediction unit 120, the depth map as geometric information, and the position and orientation of those viewpoints Let the data structure which associated with be a key frame. Furthermore, the key frame group is held in the holding unit 130 as a map.

第２の実施形態における全体の処理手順は、第１の実施形態で説明した情報処理装置１の処理手順を示す図５と同一であるため、説明を省略する。第１の実施形態と異なるのは、ステップＳ１５０において算出部１４０が位置姿勢を算出するのに加え、さらに時刻ｔよりも後の時刻ｔ’のキーフレームを算出し、更新する点である。 The entire processing procedure in the second embodiment is the same as that in FIG. 5 showing the processing procedure of the information processing apparatus 1 described in the first embodiment, and thus the description thereof is omitted. The difference from the first embodiment is that, in addition to the calculation unit 140 calculating the position and orientation in step S150, a key frame at time t 'after time t is calculated and updated.

図６は、第２の実施形態における情報処理装置１０の算出部１４０におけるステップＳ１５０の処理手順の詳細を示すフローチャートである。 FIG. 6 is a flowchart showing details of the processing procedure of step S150 in the calculation unit 140 of the information processing apparatus 10 in the second embodiment.

ステップＳ２１１０にて、算出部１４０は、予測部１２０からの予測画像、保持部１３０が保持するキーフレームを用いて撮像装置１１の位置姿勢を算出する。具体的には、まずキーフレームのデプスマップを予測画像上に射影する。次に、予測画像の各画素値と、射影によって対応付けられたキーフレームが保持する画像の各画素値との輝度差が最小となるようにｔ＋Δｔの位置姿勢を算出する。 In step S2110, the calculation unit 140 calculates the position and orientation of the imaging device 11 using the predicted image from the prediction unit 120 and the key frame held by the holding unit 130. Specifically, first, the depth map of the key frame is projected onto the predicted image. Next, the position and orientation of t + Δt are calculated such that the luminance difference between each pixel value of the predicted image and each pixel value of the image held by the key frame associated by projection is minimized.

ステップＳ２１２０にて、算出部１４０は、キーフレームを算出し、保持部１３０に出力する。まず算出部１４０がキーフレームを追加するかどうか判定する。具体的には、算出部１４０が、ステップＳ２１１０で算出した位置姿勢との二乗距離が最小となるキーフレームを選択し、当該キーフレームとの二乗距離が所定の距離以上でありかつ視線方向が所定の角度以上離れていればキーフレームを追加すると判断する。キーフレームを追加すると判断された場合には、算出部１４０が、予測部１２０が予測した予測画像、予測幾何情報、および算出部１４０がステップＳ２１１０で算出した位置姿勢をキーフレームとして保持部１３０に出力し、保持部１３０がこれを保持する。 In step S2120, calculation unit 140 calculates a key frame and outputs the key frame to holding unit 130. First, the calculation unit 140 determines whether to add a key frame. Specifically, calculation unit 140 selects a key frame that minimizes the square distance with the position and orientation calculated in step S2110, and the square distance with the key frame is equal to or greater than a predetermined distance and the sight line direction is predetermined. If it is separated by more than the angle of, it is judged that the key frame is added. If it is determined that the key frame is to be added, the calculation unit 140 uses the predicted image predicted by the prediction unit 120, the predicted geometric information, and the position and orientation calculated in step S2110 by the calculation unit 140 as the key frame in the holding unit 130. It outputs, and the holding unit 130 holds the same.

ステップＳ２１３０にて、算出部１４０は、予測幾何情報とステップＳ２１１０で算出部１４０が算出した位置姿勢に基づきキーフレームが保持するデプスマップの奥行き値を更新する。まず算出部１４０は、保持部１３０が保持するマップから、ステップＳ２１１０において算出した位置姿勢との距離および視線方向が所定の閾値未満のキーフレームを選択する。次に選択したキーフレームに予測幾何情報であるデプスマップを射影する。そしてキーフレームが保持する幾何情報であるデプスマップと射影した予測幾何情報のデプスマップとを時系列フィルタリング（平滑化）により更新する。時系列フィルタリングとしては、まずＩＣＰアルゴリズムによって二つのデプスマップから算出できる三次元点のうち同一の三次元点とみなす対応関係を算出する。次に、同一であるとみなした２つの三次元点の重み付平均位置を算出する。算出した重み付き平均位置をデプスマップの当該画素の奥行き値として保持部１３０のキーフレームのデプスマップを更新する。 In step S2130, calculation unit 140 updates the depth value of the depth map held by the key frame based on the predicted geometric information and the position and orientation calculated by calculation unit 140 in step S2110. First, the calculation unit 140 selects, from the map held by the holding unit 130, a key frame whose distance from the position and orientation calculated in step S2110 and the sight line direction is less than a predetermined threshold. Next, a depth map, which is predictive geometric information, is projected onto the selected key frame. Then, the depth map which is the geometric information held by the key frame and the depth map of the projected predicted geometric information are updated by time series filtering (smoothing). As time-series filtering, first, a correspondence is regarded that is regarded as the same three-dimensional point among three-dimensional points which can be calculated from two depth maps by the ICP algorithm. Next, weighted average positions of two three-dimensional points considered to be identical are calculated. The depth map of the key frame of the storage unit 130 is updated using the calculated weighted average position as the depth value of the relevant pixel of the depth map.

＜効果＞
以上のように、第２の実施形態では、予測画像、予測幾何情報を用いてキーフレームを作成する。さらに作成したキーフレームを予測画像、予測幾何情報を用いて時系列に更新する。そして、作成したマップを用いて位置姿勢を算出する。このような構成とすることで、従来のＶｉｓｕａｌＳＬＡＭにおいて、撮像装置１１が通過した地点にしか作成できなかったキーフレームを、今後通過する地点付近に作成することができる。そして、あらかじめ作成したキーフレームをさらに時系列に更新することで精度良い幾何情報としてのデプスマップを算出することができる。以上により作成した精度良いマップを用いることで、高精度に位置姿勢を算出することができる。 <Effect>
As described above, in the second embodiment, a key frame is created using a predicted image and predicted geometric information. Further, the created key frame is updated in time series using a predicted image and predicted geometric information. Then, the position and orientation are calculated using the created map. With such a configuration, it is possible to create, in the conventional Visual SLAM, a key frame that can be created only at the point where the imaging device 11 passes, in the vicinity of the point where it passes in the future. Then, by further updating the key frames created in advance in time series, it is possible to calculate a depth map as geometric information with high accuracy. By using the accurate map created as described above, the position and orientation can be calculated with high accuracy.

＜変形例＞
第２の実施形態では、算出部１４０が、予測画像が予測される毎に時系列フィルタリングにより保持部１３０が保持するマップのキーフレームのデプスマップの奥行き値を更新した。しかしながら、マップの更新方法は、マップを高精度化することができる方法であれば特に制約はない。具体的には、入力画像や予測画像を用いてモーションステレオ法によって算出した奥行き値を用いて時系列フィルタリングによりキーフレームのデプスマップを更新する非特許文献１の方法で更新することもできる。また、過去に作成した複数のキーフレームを統合して更新してもよい。ここでいう統合とは、まず複数のキーフレームの位置姿勢に矛盾が無いように、ポーズグラフ最適化を用いてキーフレームの位置姿勢を算出する。次に、得られた位置姿勢を用いてさらに複数のキーフレームのデプスマップを平滑化することである。平滑化においては、前述の時系列フィルタリングと同様の方法を用いることができる。このように複数時刻の入力画像、予測画像を用いてマップを更新することで、高精度なマップが生成でき、位置姿勢の算出精度が向上する。 <Modification>
In the second embodiment, the calculation unit 140 updates the depth value of the depth map of the key frame of the map held by the holding unit 130 by time-series filtering each time the predicted image is predicted. However, the method of updating the map is not particularly limited as long as the map can be refined. Specifically, updating can be performed according to the method of Non-Patent Document 1 in which the depth map of a key frame is updated by time series filtering using depth values calculated by motion stereo method using an input image or a predicted image. Also, multiple key frames created in the past may be integrated and updated. The integration referred to here is to first calculate the position and orientation of the key frame using pose graph optimization so that there is no contradiction in the position and orientation of a plurality of key frames. Next, the obtained position and orientation are used to further smooth the depth maps of a plurality of key frames. In smoothing, the same method as the above-mentioned time-series filtering can be used. As described above, by updating the map using input images at plural times and predicted images, a highly accurate map can be generated, and the calculation accuracy of the position and orientation is improved.

第２の実施形態では、予測部１２０が算出した予測値を基に生成したマップを基に位置姿勢を算出していた。一方、算出したマップを基に自動車の制御値を変更してもよい。具体的には、まず算出部１４０が、予測したキーフレームのデプスマップからシーンの三次元形状を算出する。この時、自動車の進行方向に三次元の構造物が存在する場合には、制御装置１２が自動車の速度を低下させるように制御値を算出する。また、予測したマップから三次元形状を算出し、路面の凹凸度合を算出し、凹凸が大きいと予想された位置を避けるような制御を行うための制御値を算出してもよい。このように、予測値を基に作成したマップを用いて自動車を制御することで、センサで計測する以前からあらかじめ制御することができ、自動車の急制動を減少させることができる。 In the second embodiment, the position and orientation are calculated based on the map generated based on the predicted value calculated by the prediction unit 120. On the other hand, the control value of the vehicle may be changed based on the calculated map. Specifically, first, the calculation unit 140 calculates the three-dimensional shape of the scene from the predicted depth map of the key frame. At this time, when a three-dimensional structure exists in the traveling direction of the vehicle, the control device 12 calculates the control value so as to reduce the speed of the vehicle. Alternatively, a three-dimensional shape may be calculated from the predicted map, the degree of unevenness of the road surface may be calculated, and a control value for performing control to avoid a position where the unevenness is predicted to be large may be calculated. As described above, by controlling the vehicle using the map created based on the predicted value, it is possible to control in advance before measurement by the sensor, and it is possible to reduce sudden braking of the vehicle.

また、第２の実施形態においても、第１の実施形態で述べたように事前に作成したマップを保持している場合には、予測したキーフレームと事前に作成したマップとの差分を基に、事前に作成したマップを用いて位置姿勢を算出するか否かを切り替えることもできる。ここでいう差分は、算出部１４０が算出する、予測したキーフレームとマップとの一致度合を表す値のことである。具体的には、事前に作成したマップを予測したキーフレームのデプスマップに射影し、それらの差の平均値、中央値、最大値、最小値や、デプスマップを小領域に区切り領域毎に算出した奥行き値の差の平均値、中央値、最大値、最小値を用いることができる。前述の差分値が所定の閾値を超えていれば事前に作成しマップとシーンが変わっているとして第２の実施形態の方法で位置姿勢を算出する。一方、閾値以下であれば第１の実施形態に示したように事前に作成したマップを用いて位置姿勢を算出する。以上のように、予測したマップと事前に作成したマップの一致度合を基に事前に作成したマップを利用するか否か判断することで、例えば工事により事前にマップを作成した時点と景観が変わった場合にも安定して高精度に位置姿勢を算出することができる。 Also in the second embodiment, as described in the first embodiment, when the map created in advance is held, the difference between the predicted key frame and the map created in advance is used. It is also possible to switch whether to calculate the position and orientation using a map created in advance. The difference here is a value that is calculated by the calculation unit 140 and represents the degree of coincidence between the predicted key frame and the map. Specifically, the map created in advance is projected onto the depth map of the predicted key frame, and the average value, median, maximum value, minimum value, and depth map of those differences are divided into small areas and the area map is calculated for each area It is possible to use an average value, a median value, a maximum value, and a minimum value of the difference in depth values. If the above difference value exceeds a predetermined threshold value, the position and orientation are calculated by the method of the second embodiment, assuming that the map and the scene have changed, in advance. On the other hand, if it is below the threshold value, the position and orientation are calculated using a map created in advance as described in the first embodiment. As described above, by determining whether or not to use the map created in advance based on the degree of coincidence between the predicted map and the map created in advance, for example, the time when the map is created in advance by construction and the landscape change Even in the case where the position and orientation can be calculated stably with high accuracy.

［第３の実施形態］
第１の実施形態では、予測画像、予測幾何情報を用いて、事前に作成したマップを基に、撮像装置１１の位置姿勢を算出し自動車の制御を行う方法について説明した。第２の実施形態では、予測画像、予測幾何情報を用いてマップを生成、および更新する方法について説明した。第３の実施形態では、予測画像、予測幾何情報を用いて、それらに写る物体情報である物体種、それら物体の位置、移動量を算出し、それらをもとに自動車の制御を行う方法について説明する。 Third Embodiment
In the first embodiment, the method of controlling the vehicle by calculating the position and orientation of the imaging device 11 based on the map created in advance using the predicted image and the predicted geometric information has been described. In the second embodiment, a method of generating and updating a map using a predicted image and predicted geometric information has been described. In the third embodiment, a predicted image and predicted geometric information are used to calculate an object type which is object information to be shown in them, a position of the objects, and an amount of movement, and a method of controlling an automobile based on them. explain.

自動車制御においては、自己位置推定だけでなく、周囲の物体の位置姿勢や移動量の予測が重要である。具体的事例として、走行中に自動車の前方右よりに自転車が直進して走行しており、さらに自転車の前方に駐車している自動車があった場面を挙げる。この時、自転車は駐車している車を避けるために左方向に蛇行し、衝突することが予測される。このため、あらかじめ自動車の速度を減少させることで衝突を回避することができる。また別の具体的事例として、駐車している自動車の前方から不明瞭な物体が飛び出してくる場面を挙げる。この時、不明瞭な物体は人であって、衝突することが予想される。このため、衝突が予想された時点であらかじめ自動車の速度を低下させることで衝突を回避することができる。本第３の実施形態ではこのように、周囲の状況、周囲の物体の物体種や、それらの将来の位置や移動量を予測して、障害物と衝突しないように自動車を制御する方法について述べる。 In car control, not only self-position estimation but also prediction of the position / posture and movement amount of surrounding objects are important. As a specific example, a bicycle goes straight ahead and travels from the front right of the car while traveling, and further, a scene where there is a car parked in front of the bicycle is mentioned. At this time, it is predicted that the bicycle will meander in the left direction to avoid a parked car and collide. For this reason, the collision can be avoided by reducing the speed of the vehicle in advance. Another specific example is a scene where an unclear object pops out from the front of a parked car. At this time, the unclear object is a person and is expected to collide. Therefore, the collision can be avoided by lowering the speed of the vehicle in advance when the collision is predicted. In the third embodiment, a method of controlling the vehicle so as not to collide with the obstacle is described in this way by predicting the surrounding situation, the object type of the surrounding object, and the future position and movement amount of them. .

第３の実施形態における装置の構成は、第１の実施形態で説明した情報処理装置１の構成を示す図１と同一であるため省略する。第３の実施形態では、算出部１４０が、撮像装置１１が入力画像を撮像した時刻ｔよりも後の時刻ｔ’における物体情報として物体種とそれらの位置を算出する。また、各物体との衝突確率を表す衝突可能性情報を算出する。また、衝突可能性情報を制御部１５０に出力する。制御部１５０は算出部１４０が算出した衝突可能性情報に基づき、自動車を制御する制御値を算出し、制御装置１２に出力する。 The configuration of the apparatus according to the third embodiment is the same as that of FIG. 1 showing the configuration of the information processing apparatus 1 described in the first embodiment, and the description thereof will be omitted. In the third embodiment, the calculation unit 140 calculates an object type and their positions as object information at time t ′ after time t at which the imaging device 11 captures an input image. Also, collision probability information is calculated that represents the collision probability with each object. Further, the collision probability information is output to the control unit 150. Control unit 150 calculates a control value for controlling the vehicle based on the collision possibility information calculated by calculation unit 140, and outputs the calculated control value to control device 12.

図７は、第２の実施形態における情報処理装置のフローチャートである。第１の実施形態におけるステップＳ１５０の位置姿勢算出の代わりに、ステップＳ３１０の物体情報算出が追加されている点が第１の実施形態と異なる。 FIG. 7 is a flowchart of the information processing apparatus in the second embodiment. It differs from the first embodiment in that the object information calculation of step S310 is added instead of the position and orientation calculation of step S150 in the first embodiment.

ステップＳ３１０にて、算出部１４０は、予測部１２０が予測した予測画像、予測幾何情報に含まれる物体種とその位置、移動量を算出する。また、それらに基づき、自動車と物体との衝突可能性情報として、自動車と物体ごとの衝突確率値（０から１の範囲の実数であり、１となるほど衝突確率が高いとする）、衝突予測時刻、衝突予測時刻の物体の予測位置を算出する。なお、ステップＳ３１０の処理の詳細については後述する。 In step S310, the calculation unit 140 calculates an object type and its position and movement amount included in the predicted image predicted by the prediction unit 120 and the predicted geometric information. Also, based on them, as collision possibility information between a car and an object, a collision probability value for each car and object (a real number in the range of 0 to 1; assuming that the collision probability is higher as 1), collision prediction time , The predicted position of the object at the collision predicted time is calculated. The details of the process of step S310 will be described later.

ステップＳ１６０においては、算出部１４０は、ステップＳ３１０において算出した衝突可能性情報に基づいて、制御部１５０が衝突を回避するよう自動車の制御値を算出する。具体的には、衝突可能性が所定の閾値以上の物体がある場合には速度を低下させるためのブレーキを制御し、かつ衝突予測時刻における物体の予測位置を避けるようステアリングを制御する。なお、本実施形態において、予測部１２０は、複数時刻（ｔ＋ｎΔｔ：ｎ＝１，…，Ｎ）の画像と幾何情報を予測するものとした。 In step S160, the calculation unit 140 calculates the control value of the vehicle so that the control unit 150 avoids the collision based on the collision possibility information calculated in step S310. Specifically, when there is an object whose collision probability is equal to or more than a predetermined threshold, the brake for controlling the speed is controlled, and the steering is controlled so as to avoid the predicted position of the object at the collision predicted time. In the present embodiment, the prediction unit 120 predicts the image and the geometric information at a plurality of times (t + nΔt: n = 1,..., N).

図８は、第２の実施形態における情報処理装置のステップＳ３１０の物体情報算出処理の詳細を示すフローチャートである。本処理は、予測画像から物体種が何であるか検出するステップＳ２１１０、各物体の位置を算出するステップＳ２１２０、およびそれら物体と自動車の衝突可能性を算出するステップＳ２１３０で構成される。 FIG. 8 is a flowchart showing details of object information calculation processing in step S310 of the information processing apparatus in the second embodiment. The present process comprises step S2110 of detecting what kind of object is from the predicted image, step S2120 of calculating the position of each object, and step S2130 of calculating the collision possibility between the object and the vehicle.

ステップＳ２１１０にて、算出部１４０は、入力画像から物体種を検出する。本実施形態において画像からの物体種の検出には、画像から物体候補領域を算出し、それぞれの候補領域の物体種を推定し、さらに画像の画素毎に物体のラベルを算出するディープニューラルネットワークであるＭａｓｋＲ−ＣＮＮを用いる構成とした。なお、ＭａｓｋＲ−ＣＮＮについては非特許文献５に詳細な記述があり、これを援用できる。ただし、画像から物体種を検出することができる方法であれば、上記方法に限るものではない。 In step S2110, calculation unit 140 detects an object type from the input image. In the present embodiment, in order to detect an object type from an image, a deep neural network is used to calculate an object candidate area from an image, estimate an object type of each candidate area, and calculate an object label for each pixel of the image. It was set as a structure using a certain Mask R-CNN. Note that Non-Patent Document 5 has a detailed description of Mask R-CNN, which can be used. However, the method is not limited to the above method as long as the method can detect the object type from the image.

ステップＳ２１２０にて、算出部１４０は、予測部１２０が予測した各時刻ｔ＋ｎΔｔの予測画像と入力画像との各画素の同一点を表す対応関係を求めることで、ステップＳ２１１０で検出した物体が予測画像中のどの位置するか算出する。具体的には、入力画像と予測画像との間でオプティカルフローを算出することで、入力画像の画素が予測画像のどの画素に対応するか求める。次に、算出部１４０は、ステップＳ２１１０で算出した各画素の物体種ラベルを対応する予測画像の各画素に割り振る。そして、予測画像の各画素と一致する予測幾何情報における各画素にも同様に物体種ラベルを割り振る。以上により予測幾何情報の各三次元点に物体種が割り当てられることになる。物体が各予測時刻ｔ＋ｎΔｔにおいて三次元空間中のどこに位置するかを表す三次元情報である予測位置を導出する。 In step S2120, the calculation unit 140 obtains the correspondence between the predicted image at each time t + nΔt predicted by the prediction unit 120 and the input image representing the same point, whereby the object detected in step S2110 is a predicted image. Calculate which position is inside. Specifically, by calculating the optical flow between the input image and the predicted image, it is determined which pixel of the predicted image the pixel of the input image corresponds to. Next, the calculation unit 140 allocates the object type label of each pixel calculated in step S2110 to each pixel of the corresponding predicted image. Then, an object type label is similarly allocated to each pixel in the prediction geometric information that matches each pixel of the predicted image. An object type is allocated to each three-dimensional point of prediction geometric information by the above. A predicted position, which is three-dimensional information indicating where the object is located in the three-dimensional space at each predicted time t + nΔt, is derived.

ステップＳ２１３０にて、算出部１４０は、算出した各物体の予測位置をもとに各時刻ｔ＋ｎΔｔごとの自動車が物体との衝突可能性を算出する。この衝突可能性とは、自動車１の前方の所定の領域に、ステップＳ２１２０に物体種ラベルが割り振られた予測幾何情報の三次元点が位置している場合には大きく、所定の領域から離れるほど小さくなる値のことである。具体的な衝突可能性の値の算出方法を次に述べる。まず、自動車の制動距離から求めることのできる自動車が停止するまでに進む空間を制動距離空間として、各クラスの三次元点と制動距離空間との最小距離ｄを算出する。次に、自然対数の底の−ｄ乗根の値（０から１となる）を衝突可能性の値とする。なお、あらかじめステップＳ２１１０において検出した物体ごとに０から１の実数を格納するメモリを割り当てておき、当該メモリに衝突可能性の値を格納する。 In step S2130, calculation unit 140 calculates the possibility of collision of an automobile at each time t + nΔt with an object based on the calculated predicted position of each object. The collision possibility is large when the three-dimensional point of the predicted geometric information to which the object type label is allocated in step S2120 is located in the predetermined area in front of the automobile 1, and the farther the object is from the predetermined area It is the smaller value. The specific calculation method of the value of collision possibility is described next. First, the minimum distance d between the three-dimensional point of each class and the braking distance space is calculated, where the space that can be obtained from the braking distance of the vehicle before the vehicle stops is the braking distance space. Next, the value (from 0 to 1) of the base −d root of the natural logarithm is used as the value of the collision possibility. Note that a memory for storing real numbers from 0 to 1 is allocated in advance for each object detected in step S2110, and the value of collision possibility is stored in the memory.

＜効果＞
以上のように、第３の実施形態では、予測画像、予測幾何情報を用いて周囲の物体情報である物体の種類、物体の予測位置を算出し、自動車と各物体との将来の衝突可能性を算出する。そして、将来の衝突可能性が高ければ物体と自動車が衝突しないように自動車を制御する。このようにして実際に入力画像が取得されるより前に早期に物体位置を算出することができる。また、予測した物体位置をもとに物体と自動車との衝突可能性を求め、衝突可能性に基づき衝突を回避するように自動車を制御することで、安全に自動車を制御することができる。 <Effect>
As described above, in the third embodiment, the predicted image, predicted geometric information are used to calculate the type of object which is the surrounding object information, and the predicted position of the object, and the possibility of future collision between the vehicle and each object Calculate And if the possibility of future collision is high, the vehicle is controlled so that the object and the vehicle do not collide. In this way, the object position can be calculated earlier before actual input image acquisition. In addition, it is possible to control the vehicle safely by obtaining the collision possibility between the object and the vehicle based on the predicted object position and controlling the vehicle to avoid the collision based on the collision possibility.

＜変形例＞
第３の実施形態では、予測部１２０が予測した予測画像と入力画像とのオプティカルフローを算出し物体の予測位置を算出していた。しかしながら、予測部１２０が学習モデルを用いて、幾何情報として物体の移動量を算出する構成としてもよい。このとき、学習モデルは時刻ｔにおける入力画像を入力するとその画像における各画素が時刻ｔ＋Δｔにおいてどこの画素に写っているかを表す二次元（ｘ、ｙ）の移動量を画素毎に出力する。この学習モデルは、時系列画像（ｔ＝０，…，ｍ）を学習モデルに入力した際の出力と、ｔ＝ｍからｔ＝ｍ＋１の画像のオプティカルフローとの誤差を最小化するように学習しておけばよい。なお、この学習モデルとしては例えば第１の実施形態で述べたようなＣｏｎｖＬＳＴＭを用いることができる。また、学習モデルは画像の各画素の二次元の移動量を算出するのではなく、三次元空間上の移動量（ｘ、ｙ、ｚ）を画素毎に出力するように構成してもよい。具体的には、時系列画像（ｔ＝０，…，ｍ）を学習モデルに入力した際の出力と、ｔ＝ｍからｔ＝ｍ＋１のデプスマップの各ピクセルの対応関係を表す三次元ベクトルの三値を格納した三次元オプティカルフローとの誤差が最小化するように学習して置けばよい。なお、このときの学習モデルとしては例えば第１の実施形態で述べたＣｏｎｖＬＳＴＭとＦＣＮを接続した構成を利用することができる。以上のように、予測部１２０があらかじめ学習モデルを用いて予測幾何情報として二次元、または三次元のオプティカルフローを予測することで、算出部１４０がオプティカルフロー算出を行う必要が無く、高速に物体位置を算出することができ、迅速に衝突可能性情報を算出できる。 <Modification>
In the third embodiment, the optical flow between the predicted image predicted by the prediction unit 120 and the input image is calculated to calculate the predicted position of the object. However, the prediction unit 120 may use a learning model to calculate the movement amount of the object as geometric information. At this time, when the input image at time t is input, the learning model outputs, for each pixel, a two-dimensional (x, y) movement amount representing to which pixel each pixel in the image is reflected at time t + Δt. This learning model is learning so as to minimize an error between an output when a time-series image (t = 0,..., M) is input to the learning model and an optical flow of an image from t = m to t = m + 1. You should do it. As the learning model, for example, ConvLSTM as described in the first embodiment can be used. In addition, the learning model may be configured to output the movement amount (x, y, z) in the three-dimensional space for each pixel, instead of calculating the two-dimensional movement amount of each pixel of the image. Specifically, an output when a time-series image (t = 0,..., M) is input to a learning model and a three-dimensional vector representing a correspondence between pixels of a depth map from t = m to t = m + 1 The learning may be performed so as to minimize the error with the three-dimensional optical flow storing the three values. Note that, as a learning model at this time, for example, a configuration in which the ConvLSTM and the FCN described in the first embodiment can be used. As described above, when the prediction unit 120 predicts two- or three-dimensional optical flow as prediction geometric information using a learning model in advance, the calculation unit 140 does not have to calculate the optical flow, and the object can be calculated at high speed. The position can be calculated, and the collision possibility information can be calculated quickly.

第３の実施形態では、物体種を検出し、物体種ごとに衝突可能性を算出していた。しかしながら、衝突可能性を算出できれば物体種を明示的に算出しない構成としてもよい。この場合には、予測した幾何情報を基に、各三次元点と制動距離空間との距離を算出し、三次元点群毎に衝突可能性を算出すればよい。また、前述した変形例のように予測部１２０が幾何情報としてオプティカルフローを予測する場合においても、算出部１４０が物体種を算出せず衝突可能性を算出してもよい。このような構成とすることで、物体種が明確にわからない場合においても、衝突回避を行うことができるようになり、安全に自動車を制御することができる。 In the third embodiment, the object type is detected, and the collision possibility is calculated for each object type. However, if the collision possibility can be calculated, the object type may not be calculated explicitly. In this case, the distance between each three-dimensional point and the braking distance space may be calculated based on the predicted geometric information, and the collision probability may be calculated for each three-dimensional point group. Further, even when the prediction unit 120 predicts an optical flow as geometric information as in the modification described above, the calculation unit 140 may calculate the collision probability without calculating the object type. With such a configuration, collision avoidance can be performed even when the object type is not clearly known, and the vehicle can be safely controlled.

第３の実施形態では、撮像装置１１が時刻ｔに撮像した入力画像を基に予測部１２０が予測した予測画像と予測幾何情報を予測していた。しかしながら、過去に予測した予測画像や予測幾何情報を用いて時系列的に予測結果を統合することもできる。具体的には、保持部１３０が保持するマップ中のキーフレームのデプスマップと時刻ｔの入力画像により予測した時刻ｔ＋Δｔの予測幾何情報であるデプスマップとを時系列フィルタリングにより統合する。このときさらに、時系列フィルタリングにより統合した各三次元点群の位置の分散の逆数を信頼度として、信頼度をもとに衝突可能性を算出することもできる。具体的には、各三次元点の信頼度と第３の実施形態における衝突可能性の値の積を信頼度付き衝突可能性値とする。そして、制御部１５０が、信頼度付き衝突可能性値が所定の閾値以上となった場合に当該物体と衝突を回避するように自動車を制御する。 In the third embodiment, the predicted image and predicted geometric information predicted by the prediction unit 120 are predicted based on the input image captured by the imaging device 11 at time t. However, it is also possible to integrate prediction results in time series using predicted images and predicted geometric information predicted in the past. Specifically, the depth map of the key frame in the map held by the holding unit 130 and the depth map that is predicted geometric information at time t + Δt predicted based on the input image at time t are integrated by time series filtering. At this time, it is also possible to calculate the possibility of collision based on the reliability, with the reciprocal of the variance of the position of each three-dimensional point group integrated by time series filtering as the reliability. Specifically, the product of the reliability of each three-dimensional point and the value of the collision possibility in the third embodiment is used as the reliability possibility collision value. Then, the control unit 150 controls the automobile so as to avoid a collision with the object when the reliability possibility collision value becomes equal to or more than a predetermined threshold value.

第３の実施形態では、予測部１２０の予測を基に物体の予測位置を推定していた。一方、ＳＬＡＭによって求めた周囲の三次元点群と予測結果を統合することでよりさらに高精度に物体の予測位置を算出することができる。具体的には、まず過去の入力画像を基にＳＬＡＭによりシーンの三次元点群を算出する。なお、ＳＬＡＭの方式は入力画像から三次元点群を求めることができれば何を使ってもよく、例えば非特許文献１の方法が利用できる。次に、入力画像から物体検出を行い、物体ごとに三次元点群をセグメンテーションする。その後、時刻ｔ＋Δｔにおける予測幾何情報と物体ごとにセグメンテーションした三次元点群とをＩＣＰアルゴリズムによりマッチングする。以上のようにして、ＳＬＡＭで算出したそれぞれの物体が時刻ｔ＋Δｔにおいてどこに位置するのか算出する。以上の構成とすることで、予測精度が低いときにも大域的な物体の移動量を算出することができるため、物体の位置算出性能が向上する。 In the third embodiment, the predicted position of the object is estimated based on the prediction of the prediction unit 120. On the other hand, the predicted position of the object can be calculated with higher accuracy by integrating the surrounding three-dimensional point group obtained by SLAM and the prediction result. Specifically, first, a three-dimensional point group of the scene is calculated by SLAM based on the past input image. In addition, as a method of SLAM, any method may be used as long as it can obtain a three-dimensional point group from an input image. For example, the method of Non-Patent Document 1 can be used. Next, object detection is performed from the input image, and a three-dimensional point group is segmented for each object. Thereafter, predictive geometric information at time t + Δt is matched with the three-dimensional point group segmented for each object by the ICP algorithm. As described above, it is calculated where each object calculated by SLAM is positioned at time t + Δt. With the above-described configuration, it is possible to calculate a global amount of movement of the object even when the prediction accuracy is low, so the position calculation performance of the object is improved.

第３の実施形態では、算出部１４０は、入力画像からあらかじめ物体検出を行い予測画像に射影することで、物体の位置の予測を行った。しかしながら、物体が予測画像や予測幾何情報中のどこにあるのか算出することができれば、物体検出は入力画像から行うのに限らない。例えば予測画像に対して物体検出を行ってもよい。 In the third embodiment, the calculation unit 140 predicts the position of the object by detecting the object in advance from the input image and projecting the object image on the predicted image. However, object detection is not limited to that performed from the input image, as long as it can be calculated where the object is in the predicted image or predicted geometric information. For example, object detection may be performed on a predicted image.

また、複数時刻の予測画像に対して物体検出を行い、各時刻の予測画像の物体検出における尤度の変化を考慮して物体種判別を行ってもよい。具体的には、まず各時刻の予測画像に対して物体検出を行う。次に、物体の尤度の時系列変化を算出する。ここでいう尤度とは、物体検出器が出力する、各物体種別らしさを表す０から１の確率値のことである。 In addition, object detection may be performed on predicted images at a plurality of times, and object type determination may be performed in consideration of a change in likelihood in object detection of the predicted images at each time. Specifically, first, object detection is performed on a predicted image at each time. Next, the time series change of the likelihood of the object is calculated. The likelihood here is a probability value from 0 to 1 representing the likelihood of each object type output from the object detector.

位置姿勢算出においては、動く物体などを外れ値として位置姿勢算出計算から除外することで高精度に位置姿勢を算出することができる。そこで、予測結果から物体の移動量を算出し、移動していない物体のみ用いて位置姿勢算出する構成としてもよい。具体的には、予測結果と入力画像との間のオプティカルフローを算出し、物体が移動しているかいないかを判定する。このとき静止物体と判定された物体上の三次元点群のみ用いて位置姿勢を算出する。このような構成とすることで、従来は位置姿勢を算出しつつ、ＲＡＮＳＡＣやＭ推定などを用いて外れ値除去をしていたのに対し、あらかじめ動く物体上の三次元点群を判別できるため高精度に位置姿勢を算出することができる。 In the position and orientation calculation, the position and orientation can be calculated with high accuracy by excluding a moving object or the like as an outlier from the position and orientation calculation calculation. Therefore, the movement amount of the object may be calculated from the prediction result, and the position and orientation may be calculated using only the non-moving object. Specifically, the optical flow between the prediction result and the input image is calculated to determine whether the object is moving or not. At this time, the position and orientation are calculated using only the three-dimensional point group on the object determined to be a stationary object. With such a configuration, conventionally, while the position and orientation are calculated, outlier removal is performed using RANSAC or M estimation, etc., while a three-dimensional point group on a moving object can be determined in advance. The position and orientation can be calculated with high accuracy.

第１乃至第３の実施形態では、予測部１２０の予測結果や、それらを用いた位置姿勢算出結果、物体の位置といった情報をユーザが確認することができなかった。そこで、それらの情報を液晶ディスプレイやＨＵＤ（Ｈｅａｄ−ＵｐＤｉｓｐｌａｙ）に表示情報として提示する。図９は本変形例における表示装置の表示の一例を示す図である。本変形例においては表示情報を液晶画面に提示するものとする。なお、表示に係る処理は、制御部１５０（ＣＰＵ３１１が行うものとし、表示情報は表示部３１６に表示されるものとする。 In the first to third embodiments, the user can not confirm information such as the prediction result of the prediction unit 120, the position and orientation calculation result using them, and the position of the object. Therefore, such information is presented as display information on a liquid crystal display or a HUD (Head-Up Display). FIG. 9 is a view showing an example of display of the display device in the present modification. In this modification, display information is presented on a liquid crystal screen. In addition, the process which concerns on a display shall be performed by the control part 150 (CPU311 shall be displayed, and display information shall be displayed on the display part 316. FIG.

ＧＵＩ１００は、表示情報を提示するためのウィンドウであり、Ｇ１２０は第１の実施形態における制御情報や第３の実施形態における物体の予測位置や衝突可能性情報を提示するためのウィンドウであり、Ｇ１４０は予測画像や予測幾何情報、第１の実施形態における位置姿勢算出結果を表示するためのウィンドウである。Ｇ１２１、Ｇ１２３は、算出部１４０が算出した物体の予測位置を入力画像に重畳表示した例である。このとき、実際の計測値と予測値との一致度合を基にしてＧ１２１、Ｇ１２３の色や枠の太さ、形状を変更してもよい。また、Ｇ１２２、Ｇ１２４はそれぞれの物体との衝突可能性値を表示した例である。ここでは、Ｇ１２１は対向車であり衝突可能性が低いと判定された例を挙げている。また、Ｇ１２３は自転車であり、自転車の先に駐車している自動車を避けるために道路に飛び出し手来ることを予測し、衝突可能性の値が大きくなった例を示している。Ｇ１２５は、衝突可能性が所定の閾値より大きくなった場合にアラートを提示した例である。同様に衝突可能性の値に応じてＧ１２１やＧ１２３の色や枠の太さ、形状を変更してもよい。また、衝突可能性の値をヒートマップとして可視化してもよい。Ｇ１２６は制御値を提示した例であり、現在の自動車の速度とステアリング角が矢印左側に提示されている。なお、制御方向を自動車の外部にいる人に提示するために、自動車前方に取り付けたディスプレイに矢印として提示する構成としてもよい。また、衝突可能性の上昇を受けて制御値の変更結果が矢印の右側に提示されている。また、Ｇ１２７は第１の実施形態で説明した予測位置姿勢を提示した。Ｇ１２８は、衝突可能性の上昇を受けて制御値を変更した場合の予測位置姿勢を提示した結果である。また、Ｇ１２９は第２の実施形態で説明した、予測結果を基に作成したマップを可視化した図である。具体的には、予測により作成したマップ中のキーフレームを入力画像に射影し重畳した。Ｇ１４０は一番下が入力画像であり、上の画像ほど先の未来を予測した予測画像を提示している。Ｇ１４１は各予測画像の時刻であり、Ｇ１４２にその時刻による予測位置姿勢が提示されている。本変形例では、予測画像を提示する方法を例示したが、予測幾何情報としてデプスマップから算出した三次元点群を提示する構成としてもよい。このとき、三次元点群の大きさや色を実際の計測値と予測値の一致度合や衝突可能性の値を基に変えてもよい。 The GUI 100 is a window for presenting display information, G 120 is a window for presenting control information in the first embodiment and predicted position and collision possibility information of an object in the third embodiment, G 140 Is a window for displaying the predicted image, the predicted geometric information, and the position and orientation calculation result in the first embodiment. G121 and G123 are examples in which the predicted position of the object calculated by the calculation unit 140 is superimposed on the input image. At this time, the color of G121 and G123 or the thickness and shape of the frame may be changed based on the degree of coincidence between the actual measured value and the predicted value. Further, G122 and G124 are examples in which the collision possibility value with each object is displayed. Here, an example is given in which G121 is an oncoming vehicle and it is determined that the collision possibility is low. In addition, G123 is a bicycle, and it is predicted that the vehicle may jump out on the road to avoid a car parked at the tip of the bicycle, and an example in which the value of the collision possibility is increased is shown. G125 is an example where an alert is presented when the possibility of collision becomes larger than a predetermined threshold. Similarly, the color of G121 or G123, the thickness of the frame, or the shape may be changed according to the value of the collision possibility. Also, the value of the collision possibility may be visualized as a heat map. G126 is an example in which the control value is presented, and the current vehicle speed and steering angle are presented on the left side of the arrow. The control direction may be presented as an arrow on a display mounted in front of the vehicle in order to present the control direction to a person outside the vehicle. In addition, in response to the increase of the collision possibility, the change result of the control value is presented on the right side of the arrow. In addition, G127 presents the predicted position and orientation described in the first embodiment. G128 is the result of presenting the predicted position and orientation when the control value is changed in response to the increase in the collision possibility. G129 is a diagram visualizing the map created based on the prediction result described in the second embodiment. Specifically, key frames in the map created by prediction were projected and superimposed on the input image. G140 is an input image at the bottom, and the upper image presents a predicted image in which the future is predicted. G141 is the time of each predicted image, and G142 presents the predicted position and orientation at that time. In this modification, although the method of presenting a prediction image was illustrated, it is good also as composition of presenting a three-dimensional point cloud computed from a depth map as prediction geometric information. At this time, the size and color of the three-dimensional point group may be changed based on the degree of coincidence between the actual measurement value and the prediction value or the value of the collision possibility.

＜効果のまとめ＞
第１の実施形態では、予測部が、撮像装置が入力画像を撮像した時刻以降の予測画像と予測幾何情報を予測し、算出部が予測結果を基に撮像装置の位置姿勢を算出する。以上のように予測結果を基に位置姿勢を算出することで、処理に時間のかかる幾何情報の推定や位置姿勢の算出による遅延の影響を受けずに済み、高速に位置姿勢を算出することができる。 <Summary of effects>
In the first embodiment, the prediction unit predicts a predicted image and predicted geometric information after the time when the imaging device captures an input image, and the calculation unit calculates the position and orientation of the imaging device based on the prediction result. By calculating the position and orientation based on the prediction result as described above, it is possible to calculate the position and orientation at high speed without being influenced by the delay due to the estimation of geometric information and the calculation of the position and orientation taking time for processing. it can.

第２の実施形態では、予測画像、予測幾何情報を用いて、位置姿勢算出の指標となるマップのキーフレームを撮像装置が今後通過すると予測される地点付近に作成する。さらに作成したキーフレームを予測画像、予測幾何情報を用いて時系列に更新する。そして、作成したマップを用いて位置姿勢を算出する。このような構成とすることで、従来のＶｉｓｕａｌＳＬＡＭにおいて、撮像装置が通過した地点にしか作成できなかったキーフレームを、今後撮像装置が通過すると予測される地点にも作成することができる。そして、あらかじめ作成したキーフレームをさらに時系列に更新することで精度良い幾何情報としてのデプスマップを算出することができる。以上により作成した精度良いマップを用いることで、高精度に位置姿勢を算出することができる。 In the second embodiment, a predicted image and predicted geometric information are used to create a key frame of a map serving as an index of position and orientation calculation near a point where the imaging device is predicted to pass in the future. Further, the created key frame is updated in time series using a predicted image and predicted geometric information. Then, the position and orientation are calculated using the created map. With this configuration, in the conventional Visual SLAM, a key frame that can only be created at a point through which the imaging device passes can be created at a point where the imaging device is predicted to pass in the future. Then, by further updating the key frames created in advance in time series, it is possible to calculate a depth map as geometric information with high accuracy. By using the accurate map created as described above, the position and orientation can be calculated with high accuracy.

第３の実施形態では、予測画像、予測幾何情報を用いて周囲の物体情報である物体の種類、物体の予測位置を算出し、自動車と各物体との将来の衝突可能性を算出する。そして、将来の衝突可能性が高ければ物体と自動車が衝突しないように自動車を制御する。このようにして、実際に入力画像が取得されるより前に早期に物体位置を算出することができる。また、予測した物体位置をもとに物体と自動車との衝突可能性を求め、衝突可能性に基づき衝突を回避するように自動車を制御することで、安全に自動車を制御することができる。 In the third embodiment, the predicted image and predicted geometric information are used to calculate the type of object which is the surrounding object information and the predicted position of the object, and to calculate the future collision possibility between the vehicle and each object. And if the possibility of future collision is high, the vehicle is controlled so that the object and the vehicle do not collide. In this way, the object position can be calculated earlier before actual input image acquisition. In addition, it is possible to control the vehicle safely by obtaining the collision possibility between the object and the vehicle based on the predicted object position and controlling the vehicle to avoid the collision based on the collision possibility.

＜定義のまとめ＞
本発明における画像入力部１１０は、現実空間を撮像した視覚情報である画像を入力するものであれば特にその種類は問わない。たとえば濃淡画像（モノクロ画像）を撮像するカメラの画像を入力してもよいし、ＲＧＢ画像を入力するカメラの画像を入力してもよい。奥行き情報や距離画像、三次元点群データを撮像できるカメラの画像を入力してもよい。また、単眼カメラであってもよいし、二台以上の複数のカメラやセンサを備えるカメラが撮像した画像を入力してもよい。さらに、カメラが撮像した画像を直接入力してもよいし、ネットワークを介して入力してもよい。 <Summary of definition>
The type of the image input unit 110 in the present invention is not particularly limited as long as it is an image that is visual information obtained by imaging the real space. For example, an image of a camera for capturing a gray-scale image (monochrome image) may be input, or an image of a camera for inputting an RGB image may be input. You may input the image of the camera which can image depth information, a distance image, and three-dimensional point-group data. In addition, it may be a single-eye camera, or an image captured by a camera including two or more cameras or sensors may be input. Furthermore, an image captured by a camera may be directly input or may be input via a network.

予測部１２０は、撮像装置が視覚情報を撮像した時刻以降の予測画像または予測幾何情報を予測するものである。ここでいう視覚情報とは、撮像装置が撮像した画像やデプスマップのことである。また、予測画像は、予測部が予測した画像のことである。予測幾何情報とは予測部が予測するデプスマップやオプティカルフローのことである。予測部が予測のために用いる学習モデルとは、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）のことである。このＣＮＮはある時刻ｔの画像を入力するとΔｔ時間後の時刻ｔ＋Δｔの予測画像、予測幾何情報のどちらか一方または両方を予測することができるよう学習されているものである。 The prediction unit 120 predicts a predicted image or predicted geometric information after the time when the imaging device captures visual information. The visual information as used herein refers to an image or a depth map captured by an imaging device. Also, the predicted image is an image predicted by the prediction unit. The prediction geometric information is a depth map or an optical flow predicted by the prediction unit. The learning model used by the prediction unit for prediction is a CNN (Convolutional Neural Network). This CNN is trained so that when an image at a certain time t is input, one or both of a predicted image at time t + Δt and a predicted geometric information after Δt time can be predicted.

保持部１３０は、前述した学習モデルを保持するものである。保持部１３０はさらに、位置姿勢算出の指標となるマップを保持しておいてもよい。マップとは、シーンの画像と位置姿勢を関連付けてキーフレームとし、キーフレームを複数保持する構成としてよい。シーンの画像とは、撮像装置が撮像した画像または予測部が予測した予測画像のどちらか一方、または両方のことである。また、マップは幾何情報を保持する構成としてもよく、さらにはシーンの画像と幾何情報の双方を保持する構成としてもよい。 The holding unit 130 holds the learning model described above. The holding unit 130 may further hold a map serving as an index of position and orientation calculation. A map may be configured to associate a scene image with a position and orientation to form a key frame, and to hold a plurality of key frames. An image of a scene is either an image captured by an imaging device or a predicted image predicted by a prediction unit, or both. In addition, the map may be configured to hold geometric information, and may further be configured to hold both the image of the scene and the geometric information.

算出部１４０は、予測部が予測した予測画像または予測幾何情報の少なくとも一方または両方を用いて、撮像装置の位置姿勢を算出するものである。また、算出部は、予測部が予測した予測画像または予測幾何情報の少なくとも一方または両方を用いて、位置姿勢算出の指標となるマップを算出するものであってもよい。さらに、算出部は、予測部が予測した予測画像、予測幾何情報に含まれる物体情報である物体種またはその移動量を算出するものであってもよい。算出部はさらに、物体との衝突確率である衝突可能性情報を算出してもよい。 The calculation unit 140 calculates the position and orientation of the imaging device using at least one or both of the predicted image predicted by the prediction unit and the predicted geometric information. Further, the calculation unit may calculate a map serving as an index of position and orientation calculation using at least one or both of the predicted image predicted by the prediction unit and the predicted geometric information. Furthermore, the calculation unit may calculate the predicted image predicted by the prediction unit, the object type that is the object information included in the prediction geometric information, or the movement amount thereof. The calculation unit may further calculate collision possibility information which is a collision probability with an object.

制御部１５０は、入力画像、予測部が予測した予測画像、予測幾何情報や、算出部が算出した位置姿勢、マップ、物体情報、衝突可能性情報の少なくとも一つまたは複数を基に、本情報処理装置を搭載する装置を制御する制御値を算出するものである。 The control unit 150 sets the main information based on at least one or more of the input image, the predicted image predicted by the prediction unit, the predicted geometric information, the position and orientation calculated by the calculation unit, the map, the object information, and the collision possibility information. A control value for controlling a device on which the processing device is mounted is calculated.

（その他の実施例）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or storage medium, and one or more processors in a computer of the system or apparatus read and execute the program. Can also be realized. It can also be implemented by a circuit (eg, an ASIC) that implements one or more functions.

１…自動車、１０…情報処理装置、１１…撮像装置、１２…制御装置、１１０…入力部、１２０…予測部、１３０…保持部、１４０…算出部、１５０…制御部 DESCRIPTION OF SYMBOLS 1 ... Car, 10 ... Information processing apparatus, 11 ... Imaging apparatus, 12 ... Control apparatus, 110 ... Input part, 120 ... Prediction part, 130 ... Holding part, 140 ... Calculation part, 150 ... Control part

Claims

An input unit that inputs visual information captured by the imaging device at a first time;
Holding means for holding a learning model for predicting at least one of visual information and geometric information at a second time after the first time based on the visual information input by the input means;
Prediction means for predicting at least one of visual information and geometric information at the second time using the learning model;
First calculation means for calculating the position and orientation of the imaging device at the second time based on at least one of visual information and geometric information of the second time predicted by the prediction means;
An information processing apparatus comprising:

An input unit that inputs visual information captured by the imaging device at a first time;
Holding means for holding a learning model for predicting at least one of visual information and geometric information at a second time after the first time based on the visual information input by the input means;
Prediction means for predicting at least one of visual information and geometric information at the second time using the learning model;
Object information included in at least one of visual information at the input first time and visual information or geometrical information of the predicted second time based on at least one of visual information and geometric information predicted by the prediction means Second calculating means for calculating
An information processing apparatus comprising:

The prediction means predicts visual information or geometric information of a third time after the second time based on at least one of visual information or geometric information of the second time predicted by the prediction means. The information processing apparatus according to claim 1 or 2, characterized in that

The imaging is performed based on at least one of visual information or geometric information of the predicted second time and the visual information acquired by the imaging device at a time near the second time. The information processing apparatus according to claim 1, wherein the position and orientation of the apparatus are calculated.

The second calculation means performs the imaging based on at least one of visual information or geometric information of the predicted second time and visual information acquired by the imaging apparatus at a time near the second time. The information processing apparatus according to claim 2, wherein the position and orientation of the apparatus are calculated.

The information processing apparatus according to claim 2, wherein the object information is an object type of an object included in at least one of the input visual information, the predicted visual information, or geometric information.

The object information is at least one of information indicating a position of an object included in at least one of the input visual information, the predicted visual information, and geometric information, or information indicating an amount of movement. The information processing apparatus according to 2 or 6.

An input unit configured to input visual information captured by the imaging device;
Holding means for holding a learning model for predicting at least one of visual information or geometric information of a second time after the first time related to the visual information based on the visual information input by the input means When,
A prediction unit that predicts at least one of visual information and geometric information at the second time using the learning model, and the first calculation unit according to claim 1;
An information processing apparatus comprising: the second calculation unit according to claim 2;

The apparatus further comprises control means for calculating a control value for controlling a moving object equipped with the imaging device based on at least one of visual information and geometric information predicted by the prediction means, and performing control. An information processing apparatus according to any one of claims 1 or 8.

The control means is configured to match the degree of coincidence between at least one of the predicted visual information or geometric information at the second time and visual information further acquired at a time near the second time by the imaging device. The information processing apparatus according to claim 9, wherein a control value is calculated.

The first calculation means calculates the position and orientation of the imaging device at a second time,
10. The information according to claim 9, wherein the control means calculates a control value for controlling the movable body such that the position and orientation of the imaging device match the position and orientation of the imaging device at the second time. Processing unit.

The control value according to any one of claims 9 to 11, wherein the control means calculates a control value for controlling the movable body based on the object information calculated by the second calculation means. Information processing device.

The prediction means is a prediction parameter based on at least one of the predicted visual information or geometric information at the second time, and the matching degree of the visual information further acquired by the imaging device at the time near the second time. The information processing apparatus according to any one of claims 1 to 12, further comprising:

The said prediction means adjusts a prediction parameter based on the position and orientation of the said imaging device of the 2nd time which the said 1st calculation means calculated. The 1st aspect characterized by the above-mentioned. Information processing equipment.

The information according to any one of claims 2, 5, 6, 7, and 8, wherein the prediction means adjusts a prediction parameter based on the object information calculated by the second calculation means. Processing unit.

It further comprises measuring means for measuring a measurement value of at least one of the state of the mobile unit and the situation around the mobile unit,
The information processing apparatus according to any one of claims 9 to 12, wherein the prediction unit adjusts a prediction parameter based on the measurement value.

17. The prediction parameter according to any one of claims 13 to 16, wherein the prediction parameter is at least one of the time of the time predicted by the prediction means, the number of predictions, the prediction time interval, and the prediction range. Information processing device.

The visual information captured by the imaging device, the geometric information of the second time predicted by the prediction means, the visual information of the second time predicted by the prediction means, and the position and orientation calculated by the first calculation means Generating means for generating display information based on at least one of them; display means for presenting the display information generated by the generating means;
The information processing apparatus according to any one of claims 1, 3, 4, 8 to 12, 14, and 16, further comprising:

The visual information captured by the imaging device, the geometric information of the second time predicted by the prediction means, the visual information of the second time predicted by the prediction means, and the position and orientation calculated by the first calculation means Generation means for generating display information based on at least one of them;
Display means for presenting the display information generated by the generation means;
The information processing apparatus according to any one of claims 2, 5, 6, 7, 8 to 12, 15, further comprising:

20. The visual information according to any one of claims 1 to 19, wherein the visual information includes any one of a gray image, an RGB image, depth information, a distance image, and an image obtained by imaging three-dimensional point group data. Information processing equipment.

The information processing apparatus according to any one of claims 1 to 20, wherein the geometric information is a depth map of depth information estimated for each pixel of the visual information.

An input step of inputting visual information captured by the imaging device at a first time;
Using the learning model for predicting at least one of visual information or geometric information of a second time after the first time based on visual information input by the input process, visualizing the second time Predicting at least one of information or geometric information;
A first calculation step of calculating the position and orientation of the imaging device at the second time based on at least one of visual information and geometric information of the second time predicted by the prediction step;
A control method of an information processing apparatus, comprising:

An input step of inputting visual information captured by the imaging device at a first time;
Using the learning model for predicting at least one of visual information or geometric information of a second time after the first time based on visual information input by the input process, visualizing the second time Predicting at least one of information or geometric information;
Object information included in at least one of visual information at the input first time and visual information or geometrical information of the predicted second time based on at least one of visual information and geometric information predicted by the prediction process A second calculation step of calculating
A control method of an information processing apparatus, comprising:

The program for making the said computer perform each process of Claim 22 or 23, when read and performed by the computer.