JP6916091B2

JP6916091B2 - Position / orientation estimation system and position / orientation estimation device

Info

Publication number: JP6916091B2
Application number: JP2017217482A
Authority: JP
Inventors: 雄介関川
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2021-08-11
Anticipated expiration: 2037-11-10
Also published as: JP2019091102A

Description

本発明は、自走体からの観察により得られた光学センシングデータに基づいて該自走体の位置及び姿勢を推定する位置姿勢推定システム及び位置姿勢推定装置に関する。 The present invention relates to a position / orientation estimation system and a position / attitude estimation device that estimate the position and orientation of the self-propelled body based on optical sensing data obtained by observation from the self-propelled body.

従来、自走体（以下、車両を例として説明する）の位置及び姿勢を計測するのに全地球測位システム（Global Positioning System、以下「ＧＰＳ」という。）が用いられている。車両はＧＰＳ受信機を備えており、複数のＧＰＳ衛星からの信号をこのＧＰＳ受信機で受信することにより、コード測位方式又は搬送波測位方式で自車両の位置を測定できる。 Conventionally, a Global Positioning System (hereinafter referred to as "GPS") has been used to measure the position and posture of a self-propelled body (hereinafter, a vehicle will be described as an example). The vehicle is equipped with a GPS receiver, and by receiving signals from a plurality of GPS satellites with the GPS receiver, the position of the own vehicle can be measured by a code positioning method or a carrier wave positioning method.

しかしながら、ＧＰＳ受信機がＧＰＳ衛星からの信号を受信できないトンネル内等の場所では、ＧＰＳによる自車両の位置測定ができない。自車両の位置を測定するＧＰＳ以外の方法の一つとして、ホイールオドメトリやビジュアルオドメトリがある。ホイールオドメトリは、自車両の車輪の方向と回転数とを積分することで自車両の移動軌跡を測定して自車両の位置及び姿勢を推定するものである。ビジュアルオドメトリは、自車両に固定されたカメラによる連続的な複数の画像に基づいて自車両の移動軌跡を推定することで、自車の位置及び姿勢を推定するものである。 However, the position of the own vehicle cannot be measured by GPS in a place such as a tunnel where the GPS receiver cannot receive the signal from the GPS satellite. As one of the methods other than GPS for measuring the position of the own vehicle, there are wheel odometry and visual odometry. The wheel odometry measures the movement locus of the own vehicle by integrating the direction and the number of rotations of the wheels of the own vehicle, and estimates the position and posture of the own vehicle. The visual odometry estimates the position and posture of the own vehicle by estimating the movement locus of the own vehicle based on a plurality of continuous images taken by a camera fixed to the own vehicle.

このビジュアルオドメトリについては、モデルベースの手法が長らく研究されてきたが、近年ディープニューラルネットワーク（Deep Neural Network、以下「ＤＮＮ」という。）を使った学習ベースの手法が着目されている（例えば、特許文献１）。 For this visual odometry, a model-based method has been studied for a long time, but in recent years, a learning-based method using a deep neural network (hereinafter referred to as "DNN") has attracted attention (for example, a patent). Document 1).

e.g., R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In AAAI, pages 3995-4001, 2017e.g., R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In AAAI, pages 3995-4001, 2017

ビジュアルオドメトリにおいて、自車両の位置姿勢推定を精度良く行うには、時間的解像度の高いイメージセンサを用いて長時間の相関をモデル化する必要がある。そのような時間的解像度の高いイメージセンサとして、イベントカメラが注目されている。 In visual odometry, it is necessary to model the long-term correlation using an image sensor with high temporal resolution in order to accurately estimate the position and orientation of the own vehicle. Event cameras are attracting attention as such image sensors with high temporal resolution.

しかしながら、従来のＤＮＮは、短時間の相関を把握することはできたが、長時間の相関を把握するにはシステムの処理負荷（処理時間、使用メモリ容量）が過大になり、現実的ではなかった。 However, although the conventional DNN can grasp the correlation for a short time, it is not realistic because the processing load (processing time, memory capacity used) of the system becomes excessive to grasp the correlation for a long time. rice field.

そこで、本発明は、長時間の相関をモデル化できるＣＮＮ（Long Short-Term CNN、以下「ＬＳＴＣＮＮ」という。）を用いてビジュアルオドメトリを行う位置姿勢推定システム、位置姿勢推定方法、及び位置姿勢推定プログラムを提供することを目的とする。 Therefore, the present invention presents a position / orientation estimation system, a position / orientation estimation method, and a position / orientation estimation that perform visual odometry using a CNN (Long Short-Term CNN, hereinafter referred to as “LSTCNN”) that can model a long-term correlation. The purpose is to provide a program.

本願発明の位置姿勢推定装置は、ビジュアルオドメトリを行うためにイメージセンサのデータに対して行う時空間の３次元ＣＮＮを空間の２次元ＣＮＮと時間の１次元ＣＮＮとに分解して実行する。これにより、処理負荷を過大にすることなく、畳込処理が可能な時間範囲を長くすることができる。 The position / orientation estimation device of the present invention decomposes the space-time three-dimensional CNN performed on the image sensor data into the space two-dimensional CNN and the time one-dimensional CNN for visual odometry. As a result, the time range in which the convolution process can be performed can be extended without increasing the processing load.

本発明の一態様の位置姿勢推定システムは、２次元の位置情報及び１次元の時間情報を含む３次元の光学センシングデータを生成する光学センシング装置と、時系列に入力される前記光学センシングデータに基づいて、ビジュアルオドメトリによって前記光学センシング装置の位置姿勢を推定する位置姿勢推定装置とを備える。前記位置姿勢推定装置は、前記光学センシングデータからなる連続する複数のフレームの各々の前記位置情報をそれぞれ入力して、特徴量を出力する複数の２次元ＣＮＮモジュールからなる２次元畳込部と、前記複数の２次元ＣＮＮモジュールの各々から出力される複数の前記特徴量を入力して、隣接する前記フレームの間の位置姿勢の変化量を局所変化量として出力する１次元ＣＮＮモジュールからなる１次元畳込部と、前記複数のフレームの前記局所変化量を累積することで前記局所変化量の累積値を求め、位置姿勢の初期値に前記累積値を加えることで、前記複数のフレームの後の前記光学センシング装置の位置姿勢を求める累積部とを備えている。 The position / orientation estimation system according to one aspect of the present invention includes an optical sensing device that generates three-dimensional optical sensing data including two-dimensional position information and one-dimensional time information, and the optical sensing data input in a time series. Based on this, a position / orientation estimation device for estimating the position / orientation of the optical sensing device by visual odometry is provided. The position / orientation estimation device includes a two-dimensional convolution unit composed of a plurality of two-dimensional CNN modules that input the position information of each of a plurality of continuous frames composed of the optical sensing data and output a feature amount. A one-dimensional CNN module composed of a one-dimensional CNN module that inputs a plurality of the feature amounts output from each of the plurality of two-dimensional CNN modules and outputs the amount of change in position and orientation between adjacent frames as a local change amount. The cumulative value of the local change amount is obtained by accumulating the local change amounts of the folding portion and the plurality of frames, and by adding the cumulative value to the initial value of the position and orientation, after the plurality of frames. It includes a cumulative unit for determining the position and orientation of the optical sensing device.

この構成により、時系列に入力される２次元位置情報と１次元時間情報からなる光学センシングデータに対して実行すべき３次元ＣＮＮを位置情報に対する２次元ＣＮＮと時間情報に対する１次元ＣＮＮとに分けて実行するので、処理負荷を過大にすることなく、畳込処理可能な時間範囲を長く（光学センシングデータの時間方向の数を多く）することができる。 With this configuration, the 3D CNN to be executed for optical sensing data consisting of 2D position information and 1D time information input in time series is divided into 2D CNN for position information and 1D CNN for time information. Therefore, it is possible to lengthen the time range in which convolution processing is possible (increase the number of optical sensing data in the time direction) without increasing the processing load.

上記の位置姿勢推定システムにおいて、前記光学センシング装置は、イベントカメラであってよい。イベントカメラの時間解像度は高く、単位時間当たりのフレーム数が多くなるが、この構成によれば、そのような多フレーム（長時間）についても畳込を有効に行って位置姿勢を推定できる。 In the position / orientation estimation system, the optical sensing device may be an event camera. The time resolution of the event camera is high, and the number of frames per unit time is large. According to this configuration, the position and orientation can be estimated by effectively performing the convolution even for such a large number of frames (long time).

上記の位置姿勢推定システムにおいて、前記位置姿勢推定装置は、前記光学センシング装置から入力された前記光学センシングデータの時間解像度を低下させて、時間解像度が低下した前記複数のフレームを生成する前処理部をさらに含んでいてよい。この構成により、光学センシング装置からの光学センシングデータの時間解像度が高すぎて畳込処理における入力データが時間方向に疎（スパース）になりすぎることを回避できる。 In the position / orientation estimation system, the position / orientation estimation device reduces the time resolution of the optical sensing data input from the optical sensing device to generate the plurality of frames having the reduced time resolution. May further be included. With this configuration, it is possible to prevent the time resolution of the optical sensing data from the optical sensing device from being too high and the input data in the convolution process becoming too sparse in the time direction.

上記の位置姿勢推定システムにおいて、前記光学センシング装置は、車両の外側をセンシングするように該車両に固定されていてよく、前記累積部は、モデルベースで前記局所変化量を累積し、前記局所変化量を直進変化量及び角度変化量のパラメータで表現してよい。この構成により、車両の移動の制約を活かして少パラメータのモデルで局所変化慮鵜の累積を行うことができる。 In the position / orientation estimation system, the optical sensing device may be fixed to the vehicle so as to sense the outside of the vehicle, and the cumulative unit accumulates the local change amount on a model basis and the local change. The quantity may be expressed by the parameters of the straight-ahead change amount and the angle change amount. With this configuration, it is possible to accumulate local change cormorants with a model with a small number of parameters by taking advantage of the restrictions on the movement of the vehicle.

上記の位置姿勢推定システムにおいて、前記２次元ＣＮＮモジュールの各々は、ＬＳＴＭモジュールであってよい。 In the above position / orientation estimation system, each of the two-dimensional CNN modules may be an LSTM module.

本発明の一態様の位置姿勢推定装置は、２次元の位置情報及び１次元の時間情報を含む３次元の光学センシングデータを生成する光学センシング装置とともに用いられ、時系列に入力される前記光学センシングデータに基づいて、ビジュアルオドメトリによって前記光学センシング装置の位置姿勢を推定する位置姿勢推定装置であって、前記光学センシングデータからなる連続する複数のフレームの各々の前記位置情報をそれぞれ入力して、特徴量を出力する複数の２次元ＣＮＮモジュールからなる２次元畳込部と、前記複数の２次元ＣＮＮモジュールの各々から出力される複数の前記特徴量を入力して、隣接する前記フレームの間の位置姿勢の変化量を局所変化量として出力する１次元ＣＮＮモジュールからなる１次元畳込部と、前記複数のフレームの前記局所変化量を累積することで前記局所変化量の累積値を求め、位置姿勢の初期値に前記累積値を加えることで、前記複数のフレームの後の前記光学センシング装置の位置姿勢を求める累積部とを備えている。 The position / orientation estimation device of one aspect of the present invention is used together with an optical sensing device that generates three-dimensional optical sensing data including two-dimensional position information and one-dimensional time information, and the optical sensing is input in a time series. It is a position / orientation estimation device that estimates the position / orientation of the optical sensing device by visual odometry based on the data, and is characterized by inputting the position information of each of a plurality of consecutive frames composed of the optical sensing data. A position between the two-dimensional convolution unit composed of a plurality of two-dimensional CNN modules that output quantities and the plurality of feature quantities output from each of the plurality of two-dimensional CNN modules and adjacent frames. The cumulative value of the local change amount is obtained by accumulating the one-dimensional convoluted portion composed of the one-dimensional CNN module that outputs the change amount of the posture as the local change amount and the local change amount of the plurality of frames, and the position and orientation. By adding the cumulative value to the initial value of the above, the cumulative portion for obtaining the position and orientation of the optical sensing device after the plurality of frames is provided.

この構成によっても、時系列に入力される２次元位置情報と１次元時間情報からなる光学センシングデータに対して実行すべき３次元ＣＮＮを位置情報に対する２次元ＣＮＮと時間情報に対する１次元ＣＮＮとに分けて実行するので、処理負荷を過大にすることなく、畳込処理可能な時間範囲を長く（光学センシングデータの時間方向の数を多く）することができる。 Even with this configuration, the 3D CNN to be executed for the optical sensing data consisting of the 2D position information and the 1D time information input in time series is divided into the 2D CNN for the position information and the 1D CNN for the time information. Since the data are executed separately, the time range in which the convolution process can be performed can be lengthened (the number of optical sensing data in the time direction is large) without increasing the processing load.

本発明によれば、時系列に入力される２次元位置情報と１次元時間情報からなる光学センシングデータに対して実行すべき３次元ＣＮＮを位置情報に対する２次元ＣＮＮと時間情報に対する１次元ＣＮＮとに分けて実行するので、処理負荷を過大にすることなく、畳込処理可能な時間範囲を長く（光学センシングデータの時間方向の数を多く）することができる。 According to the present invention, the three-dimensional CNN to be executed for the optical sensing data consisting of the two-dimensional position information and the one-dimensional time information input in time series are the two-dimensional CNN for the position information and the one-dimensional CNN for the time information. Since the data is executed separately, the time range in which the convolution process can be performed can be lengthened (the number of optical sensing data in the time direction is large) without increasing the processing load.

本発明の実施の形態の位置姿勢推定システムの構成を示すブロック図A block diagram showing a configuration of a position / orientation estimation system according to an embodiment of the present invention. 本発明の実施の形態のイベントカメラによるイベントデータと通常のカメラによる画像との時間解像度を比較する図The figure which compares the time resolution of the event data by the event camera of embodiment of this invention, and the image by a normal camera. 本発明の実施の形態のビジュアルオドメトリにおけるネットワーク構造及びデータの流れを示す図The figure which shows the network structure and data flow in the visual odometry of embodiment of this invention. 本発明の実施の形態の１次元ＣＮＮモジュール２３及び累積部２４のネットワーク構造を示す図The figure which shows the network structure of the one-dimensional CNN module 23 and the cumulative part 24 of embodiment of this invention. 本発明の実施の形態の平行２輪車両のモデルを示す図The figure which shows the model of the parallel two-wheeled vehicle of embodiment of this invention

以下、図面を参照して本発明の実施の形態を説明する。なお、以下に説明する実施の形態は、本発明を実施する場合の一例を示すものであって、本発明を以下に説明する具体的構成に限定するものではない。本発明の実施にあたっては、実施の形態に応じた具体的構成が適宜採用されてよい。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. It should be noted that the embodiments described below show an example of the case where the present invention is carried out, and the present invention is not limited to the specific configuration described below. In carrying out the present invention, a specific configuration according to the embodiment may be appropriately adopted.

図１は、本発明の実施の形態の位置姿勢推定システムの構成を示すブロック図である。位置姿勢推定システム１００は、光学センシング装置１０と位置姿勢推定装置２０とからなる。位置姿勢装置２０は、前処理部２１、２次元畳込部２２、１次元畳込部２３、及び累積部２４を備えている。光学センシング装置１０は、自走体である車両に固定されて、車両外部を光学的にセンシングすることで光学センシングデータとしてイベントフレームを生成し、生成したイベントフレームを時系列に順に位置姿勢推定装置２０に出力する。 FIG. 1 is a block diagram showing a configuration of a position / orientation estimation system according to an embodiment of the present invention. The position / orientation estimation system 100 includes an optical sensing device 10 and a position / orientation estimation device 20. The position / posture device 20 includes a pretreatment unit 21, a two-dimensional folding unit 22, a one-dimensional folding unit 23, and a cumulative unit 24. The optical sensing device 10 is fixed to a vehicle that is a self-propelled body, generates an event frame as optical sensing data by optically sensing the outside of the vehicle, and generates an event frame in chronological order. Output to 20.

本実施の形態では、光学センシング装置１０として、生物学的知見に基づいた（biologically inspired）カメラとしてのイベントカメラを採用する。通常のカメラは、各ピクセルが所定の露光時間に蓄積した光子の数を測定して、すべてのピクセルの測定結果を１フレームとして同時に出力する。これに対して、イベントカメラは、各ピクセルが非同期で作動する。また、イベントカメラは、前の検出強度（明度）と現在の検出強度（明度）との相違を検出したときに、１フレームの光学センシングデータとしてイベントフレームを出力する。イベントフレーム（以下、単に「フレーム」ともいう。）を構成するイベントデータには、検出強度の相違が生じているピクセルの位置情報、時間情報としてのタイムスタンプ、及び検出強度が増加しているか減少しているかを示す極性情報が含まれる。 In the present embodiment, as the optical sensing device 10, an event camera as a biologically inspired camera is adopted. A normal camera measures the number of photons accumulated in a predetermined exposure time for each pixel, and simultaneously outputs the measurement results of all the pixels as one frame. On the other hand, in the event camera, each pixel operates asynchronously. Further, the event camera outputs an event frame as one frame of optical sensing data when it detects a difference between the previous detection intensity (brightness) and the current detection intensity (brightness). In the event data constituting the event frame (hereinafter, also simply referred to as "frame"), the position information of the pixel in which the detection intensity is different, the time stamp as time information, and the detection intensity are increased or decreased. It contains polarity information that indicates whether or not it is done.

図２は、イベントカメラによるイベントデータと通常のカメラによる画像との時間解像度を比較する図である。イベントカメラは、時間解像度がマイクロ秒オーダであり、通常のカメラ（例えば、３０フレーム／秒）と比較して時間的解像度が極めて高い。また、イベントカメラは、強度の絶対値を検出せず、強度変化の極性のみを検出するので、ダイナミックレンジが広い。例えば、通常のカメラのダイナミックレンジが５０ｄＢ程度であるのに対して、イベントカメラのダイナミックレンジは１２０ｄＢ程度である。さらに、通常のカメラは強度が比較的強い部分しか検出できないのに対して、イベントカメラは、暗部と明部とを同時に検出できる。 FIG. 2 is a diagram comparing the time resolutions of the event data obtained by the event camera and the image obtained by the normal camera. The event camera has a time resolution on the order of microseconds, and has an extremely high time resolution as compared with a normal camera (for example, 30 frames / second). Further, the event camera does not detect the absolute value of the intensity, but detects only the polarity of the intensity change, so that the dynamic range is wide. For example, the dynamic range of a normal camera is about 50 dB, while the dynamic range of an event camera is about 120 dB. Further, while a normal camera can detect only a portion having a relatively high intensity, an event camera can detect a dark portion and a bright portion at the same time.

イベントカメラの上記の特性から、イベントカメラは車両の自動運転のシーンで有効に活用される。ただし、上述のようにイベントカメラの時間解像度は高いので、位置姿勢推定装置２０は、長時間（多フレーム）の相関を扱える必要がある。 Due to the above characteristics of the event camera, the event camera is effectively used in the scene of automatic driving of a vehicle. However, since the time resolution of the event camera is high as described above, the position / orientation estimation device 20 needs to be able to handle the long-term (multi-frame) correlation.

位置姿勢推定装置２０は、光学センシング装置１０から時系列に並んだ複数のイベントフレームを取得して、それらのフレーム数を減少させてフレームｅ１〜ｅＫを抽出する。ここで、Ｋは、位置姿勢推定装置２０において一度に処理可能なフレーム数（以下、「許容フレーム数」ともいう。）である。位置姿勢推定装置２０は、複数のフレームｅ１〜ｅＫから、ビジュアルオドメトリによって１フレーム後〜Ｋフレーム後の自車両の位置姿勢ｐＫを求める。 The position / orientation estimation device 20 acquires a plurality of event frames arranged in time series from the optical sensing device 10 and reduces the number of these frames to extract frames e1 to eK. Here, K is the number of frames that can be processed by the position / orientation estimation device 20 at one time (hereinafter, also referred to as “allowable frame number”). The position / orientation estimation device 20 obtains the position / orientation pK of the own vehicle after one frame to K frames by visual odometry from the plurality of frames e1 to eK.

即ち、位置姿勢推定装置２０によって、下式（１）によるビジュアルオドメトリが実行される。

ここで、

は、タイムステップｋのフレーム（以下、「第ｋフレーム」等と表現する。）であり、Ｍ×Ｎは、イベントカメラの画素数である。また、ｍは、ホイールオドメトリ、慣性測定装置等のビジュアルオドメトリ以外の方法で得られた付加的なセンシングデータである。入力データにｍを含めるか否かは任意である。また、ｐｋは、第ｋフレームにおける光学センシング装置（が固定された自車両）の位置姿勢である。 That is, the position / orientation estimation device 20 executes the visual odometry according to the following equation (1).

here,

Is a frame of the time step k (hereinafter, referred to as “kth frame” or the like), and M × N is the number of pixels of the event camera. Further, m is additional sensing data obtained by a method other than visual odometry such as wheel odometry and inertial measurement device. Whether or not to include m in the input data is optional. Further, pk is the position and orientation of the optical sensing device (own vehicle to which the optical sensing device is fixed) in the kth frame.

具体的には、位置姿勢推定装置２０は、隣り合うフレーム間における自車両の位置及び姿勢の変化（以下、「局所変化量」ともいう。）Δｐ１〜ΔｐＫを求め、それらを順に累積（連結）して、自車両の位置及び姿勢の初期値（以下、「初期位置姿勢」という。）ｐ０に加えることで、第Ｋフレームの自車両の位置姿勢ｐＫを算出する。これを式で表すと、下式（２）となる。

Specifically, the position / orientation estimation device 20 obtains Δp1 to ΔpK of changes in the position and attitude of the own vehicle (hereinafter, also referred to as “local change amount”) between adjacent frames, and accumulates (connects) them in order. Then, the position / orientation pK of the own vehicle in the K frame is calculated by adding it to the initial values (hereinafter, referred to as “initial position / attitude”) p0 of the position and attitude of the own vehicle. When this is expressed by an equation, it becomes the following equation (2).

上記のように、イベントカメラは時間解像度が高いので、位置姿勢推定装置２０におけるＬＳＴＣＮＮは、長時間（多フレーム、例えば、数千フレーム）の相関を扱える必要がある。そこで、位置姿勢推定装置２０は、光学センシング装置１０から入力された複数のフレームデータに対して実行すべき時空間の３次元ＣＮＮを空間の２次元ＣＮＮと時間の１次元ＣＮＮとに分解して実行する。このために、位置姿勢推定装置２０は、光学センシング装置１０から入力される複数のイベントフレームの数を削減する前処理部２１を備えている。また、位置姿勢推定装置２０は、空間の２次元ＣＮＮを実行する２次元畳込部２２と、時間の１次元ＣＮＮを実行する１次元畳込部２３とを有し、複数のフレームについて行う３次元ＣＮＮを２次元ＣＮＮと１次元ＣＮＮとに分割して行う。 As described above, since the event camera has a high time resolution, the LSTCNN in the position / orientation estimation device 20 needs to be able to handle long-term (multi-frame, for example, several thousand frames) correlation. Therefore, the position / orientation estimation device 20 decomposes the three-dimensional CNN in space and time to be executed for a plurality of frame data input from the optical sensing device 10 into a two-dimensional CNN in space and a one-dimensional CNN in time. Execute. For this purpose, the position / orientation estimation device 20 includes a preprocessing unit 21 that reduces the number of a plurality of event frames input from the optical sensing device 10. Further, the position / orientation estimation device 20 has a two-dimensional folding unit 22 that executes a two-dimensional CNN in space and a one-dimensional folding unit 23 that executes a one-dimensional CNN in time, and performs the position / orientation estimation device 20 for a plurality of frames. The dimensional CNN is divided into a two-dimensional CNN and a one-dimensional CNN.

前処理部２１について説明する。光学センシング装置１０から出力される複数のイベントフレームは、各ピクセルにおいて非同期であり、各イベントフレームは｛ｕ，ｖ，ｔ，ｐ｝の４次元のイベントデータからなる。ここで、ｕ、ｖはイベントが検出された位置であり、ｔはイベントが検出された時刻（タイムスタンプ）であり、ｐは検出されたイベントの極性である。これらのイベントデータは、前処理部２１において時空間のイベントフレームに変換される。 The preprocessing unit 21 will be described. The plurality of event frames output from the optical sensing device 10 are asynchronous at each pixel, and each event frame is composed of {u, v, t, p} four-dimensional event data. Here, u and v are the positions where the event was detected, t is the time (time stamp) when the event was detected, and p is the polarity of the detected event. These event data are converted into spatiotemporal event frames by the preprocessing unit 21.

前処理部２１は、イベントフレームを構成するために、イベントフレームの各データｕ、ｖ、ｔ、ｐを、３次元テンソルの対応する時空間位置に投影する。イベントカメラの時間解像度は、１マイクロ秒程度と非常に小さいので、３次元テンソルをその粒度で用意すると、２次元畳込部２２に入力される複数のフレームが疎（スパース）になりすぎてＣＮＮで処理するのに非効率的になる。そこで前処理部２１は、十分に粗く、ただし、通常のカメラのフレームレート（例えば、３０フレーム／秒）よりは小さい時間解像度τ（例えば、１，０００マイクロ秒程度）にまでイベントフレームの時間解像度を低下させる。 The preprocessing unit 21 projects each data u, v, t, p of the event frame onto the corresponding spatiotemporal position of the three-dimensional tensor in order to form the event frame. Since the time resolution of the event camera is very small, about 1 microsecond, if a 3D tensor is prepared with that particle size, multiple frames input to the 2D convolution unit 22 become too sparse (sparse) and CNN. Becomes inefficient to process with. Therefore, the preprocessing unit 21 is sufficiently coarse, but the time resolution of the event frame is reduced to a time resolution τ (for example, about 1,000 microseconds) smaller than the frame rate of a normal camera (for example, 30 frames / second). To reduce.

前処理部２１は、光学センシング装置１０から得られた細かい時間情報を維持するために、各イベントに対する

の３つの重み係数を下式（４）〜（６）によって計算する。

ここで、ｔはイベントのタイムスタンプであり、

は、ｔに最も近い離散化タイムスタンプであり、

である。 The preprocessing unit 21 responds to each event in order to maintain detailed time information obtained from the optical sensing device 10.

The three weighting coefficients of are calculated by the following equations (4) to (6).

Here, t is the time stamp of the event,

Is the discretized time stamp closest to t,

Is.

前処理部２１は、上記のようにして、時間解像度が光学センシング装置１０から入力される複数のイベントフレームより小さい複数のフレームを生成して２次元畳込部２２に入力する。 As described above, the preprocessing unit 21 generates a plurality of frames whose time resolution is smaller than the plurality of event frames input from the optical sensing device 10 and inputs them to the two-dimensional convolution unit 22.

２次元畳込部２２は、前処理部２１から入力される各フレームに対して、それぞれ２次元ＣＮＮを実行する複数の２次元ＣＮＮモジュール２２−１〜２２−Ｋからなり、１次元畳込部２３は、１次元ＣＮＮモジュールからなる。各２次元ＣＮＮモジュール２２−１〜２２−Ｋは、時分割されたＭ×Ｎ×Ｌの短時間のイベントフレームを処理し、１次元ＣＮＮモジュールは、Ｆ×１×Ｔのサイズの長時間の特徴量を処理する。ここで、Ｆは、各２次元ＣＮＮモジュール２２−１〜２２−Ｋから出力される特徴量の長さであり、Ｔは、Ｔ＝Ｋ／Ｌを満たす。 The two-dimensional folding unit 22 is composed of a plurality of two-dimensional CNN modules 22-1 to 22-K that execute two-dimensional CNN for each frame input from the preprocessing unit 21, and the one-dimensional folding unit 22. Reference numeral 23 denotes a one-dimensional CNN module. Each 2D CNN module 22-1 to 22-K processes a time-divided M × N × L short-time event frame, and the 1-dimensional CNN module has a long-time F × 1 × T size. Process features. Here, F is the length of the feature amount output from each of the two-dimensional CNN modules 22-1 to 22-K, and T satisfies T = K / L.

図３は、本発明の実施の形態のビジュアルオドメトリにおけるネットワーク構造及びデータの流れを示す図である。Ｌは、入力されるフレームの特性に応じて１〜Ｋの間で任意に設定される。例えば、Ｋ＝３０００のときにＬ＝１００と設定してよい。 FIG. 3 is a diagram showing a network structure and a data flow in the visual odometry according to the embodiment of the present invention. L is arbitrarily set between 1 and K according to the characteristics of the input frame. For example, when K = 3000, L = 100 may be set.

各２次元ＣＮＮモジュール２２−１〜２２−Ｋは、空間（２次元）の畳み込みを行い、１次元ＣＮＮモジュール２３は、時間（１次元）の畳み込みを行う。各２次元ＣＮＮモジュール２２−１〜２２−Ｋの構造は、例えばＶＧＧ−１６（Simonyan, K., and isserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.）ネットワークの畳込部分と同様であってよい。また、１次元ＣＮＮモジュール２３は、ＷａｖｅＮｅｔ（van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. CoRR abs/1609.03499.）に似た通常の畳込モジュールを積み重ねて構成される。 Each of the two-dimensional CNN modules 22-1 to 22-K convolves in space (two-dimensional), and the one-dimensional CNN module 23 convolves in time (one dimension). The structure of each 2D CNN module 22-1 to 22-K is, for example, VGG-16 (Simonyan, K., and isserman, A. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs / 1409.1556.) It may be similar to the convolutional part of the network. The one-dimensional CNN module 23 includes WaveNet (van den Oord, A .; Dieleman, S .; Zen, H .; Simonyan, K .; Vinyals, O .; Graves, A .; Kalchbrenner, N .; Senior, AW; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. CoRR abs / 1609.03499.) It is composed by stacking ordinary convolutional modules.

図４は、１次元ＣＮＮモジュール２３及び累積部２４のネットワーク構造を示す図である。１次元ＣＮＮモジュール２３は、ＷａｖｅＮｅｔで用いられているように、下式（３）で示されるゲート付き活性化関数（gated activation）ユニットを用いて、複雑な時間依存をモデル化する。

FIG. 4 is a diagram showing a network structure of the one-dimensional CNN module 23 and the cumulative unit 24. The one-dimensional CNN module 23 models a complex time dependence using a gated activation unit represented by the following equation (3), as used in WaveNet.

ここで、ｆは２次元ＣＮＮモジュールのネットワーク２２−１〜２２−Ｋであり、ｇは１次元ＣＮＮモジュール２３のネットワークである。１次元ＣＮＮネットワークｇはＯ層（Ｏは自然数）の層構造を有し、第（Ｏ−１）層の出力は、連続するフレームから推定される隣接フレーム間の変化（局所変化量）であり、第Ｏ層は、パラメータレスのモデルベース位置姿勢連結（Model-based Pose Concatenation、以下「ＭＰＣ」という。）である。ＭＰＣは累積部２４として実装される。ＭＰＣモジュール、即ち累積部２４は、タイムステップｋの局所変化量Δｐｋを用いてタイムステップｋの位置姿勢ｐｋを更新することでタイムステップｋ＋１の位置姿勢ｐｋ＋１を推定する。 Here, f is a network of two-dimensional CNN modules 22-1 to 22-K, and g is a network of one-dimensional CNN modules 23. The one-dimensional CNN network g has a layer structure of an O layer (O is a natural number), and the output of the (O-1) layer is a change (local change amount) between adjacent frames estimated from consecutive frames. The Oth layer is a parameterless model-based Pose Concatenation (hereinafter referred to as "MPC"). The MPC is implemented as a cumulative unit 24. The MPC module, that is, the cumulative unit 24 estimates the position / orientation pk + 1 of the time step k + 1 by updating the position / orientation pk of the time step k using the local change amount Δpk of the time step k.

１次元畳込部２３の最終層では推定された位置姿勢のエラーが計算される。このエラーは、２次元畳込部２２及び１次元畳込部２３のパラメータの更新に用いられる。ＭＰＣモジュールとしての累積部２４は、最終的なエラーから安定したデリバティブ（derivative）（即ち、各ニューロンのパラメータに対する微分）を効果的に算出できる。このＭＰＣモジュールは、以下の知見から得られるものである。 In the final layer of the one-dimensional folding unit 23, the estimated position / orientation error is calculated. This error is used to update the parameters of the two-dimensional folding unit 22 and the one-dimensional folding unit 23. The cumulative unit 24 as an MPC module can effectively calculate a stable derivative (that is, the derivative with respect to the parameter of each neuron) from the final error. This MPC module is obtained from the following findings.

図５は、平行２輪車両のモデルを示す図である。図５に示すように、車両の動きには制約があり、リー代数ｓｅ（２）又はｓｅ（３）のパラメータセットよりも少ないパラメータで表現できる。すなわち、平行２輪車両モデルでは、局所的な移動は、直進速度ｖと角速度ωのパラメータによって表現できるが、本実施の形態の累積部２４は、直進変化ΔＬと角度変化Δθを用い、位置姿勢エラーをΣΔＬとΣΔθで定義する。このようなパラメータ化及び位置姿勢エラーの定義の変更によって、各局所変化量についてのエラー関数のデリバティブの計算を容易かつ安定的にすることができる。 FIG. 5 is a diagram showing a model of a parallel two-wheeled vehicle. As shown in FIG. 5, the movement of the vehicle is restricted and can be expressed by a parameter smaller than the parameter set of the Lie algebra se (2) or se (3). That is, in the parallel two-wheeled vehicle model, the local movement can be expressed by the parameters of the straight-ahead speed v and the angular velocity ω, but the cumulative portion 24 of the present embodiment uses the straight-ahead change ΔL and the angular change Δθ, and the position and orientation. The error is defined by ΣΔL and ΣΔθ. By such parameterization and change of the definition of the position / orientation error, it is possible to easily and stably calculate the derivative of the error function for each local change amount.

以下では、まず、簡単のために、車両が２次元平面を走行する（車両の高さ方向の移動を考慮しない）２次元の場合を説明する。２次元の場合には、車両の位置姿勢は、車両の位置及び直進角

で表示できる。 In the following, for the sake of simplicity, a two-dimensional case in which the vehicle travels on a two-dimensional plane (movement in the height direction of the vehicle is not considered) will be described first. In the case of two dimensions, the position and orientation of the vehicle is the position of the vehicle and the straight-ahead angle.

Can be displayed with.

通常は、位置姿勢は、第ｋフレームにおける局所時間Δｔの間の局所変化量

を用いて下式（７）で更新される。

ここで、Δθｋは、

によって与えられ、ΔＬは、ｖ及びωを用いて下式（８）により計算される。

Normally, the position and orientation are the amount of local change during the local time Δt in the kth frame.

Is updated by the following equation (7).

Here, Δθk is

Given by, ΔL is calculated by equation (8) below using v and ω.

タイムステップｋ−１のエラーはタイムステップｋのエラーに依存しているので、位置姿勢ｐＫの各局所変化量ｚｋに関するデリバティブは、非線形に式（７）及び式（８）に関連している。よって、計算負荷が高く、また、実装が困難であり、さらに、位置姿勢の累積が真の位置姿勢から遠くなっている場合に、デリバティブ自体が不安定となってしまう。 Since the error of the time step k-1 depends on the error of the time step k, the derivative for each local change amount zk of the position-posture pK is non-linearly related to the equations (7) and (8). Therefore, the calculation load is high, the implementation is difficult, and the derivative itself becomes unstable when the cumulative position and orientation are far from the true position and orientation.

従来の誤差の定義の場合には、タイムステップｋの位置姿勢の誤差は、タイムステップ１〜ｋ−１の位置姿勢のエラーに依存するので、その微分は過去の式（８）を経由して過去の微分に影響される。したがって、微分は位置姿勢の積分を行う区間全部のエラーの関数になってしまい、結果として計算負荷が大きくなる。これに対して、本実施の形態では、最終的に積分したエラーの各タイムステップ（時刻）の局所変化量に対する微分がタイムステップごとに独立になるので、計算が簡単で軽量になる。 In the case of the conventional definition of error, the error of the position and orientation of the time step k depends on the error of the position and orientation of the time steps 1 to k-1, so that the differentiation is via the past equation (8). Affected by past differentiation. Therefore, the differentiation becomes a function of the error of the entire interval for integrating the position and orientation, and as a result, the calculation load becomes large. On the other hand, in the present embodiment, the derivative of each time step (time) of the finally integrated error with respect to the local change amount becomes independent for each time step, so that the calculation is simple and lightweight.

具体的には、本実施の形態の累積部２４では、局所変化量を

と表現する代わりに、

と表現し、累積された経路と角度のエラーを

ではなく、

と表記する。 Specifically, in the cumulative unit 24 of the present embodiment, the amount of local change is determined.

Instead of expressing

And the accumulated path and angle errors

not,

Notated as.

第（Ｏ−１）層、即ち１次元ＣＮＮ部２３の出力層は、

を出力し、第Ｏ層はＭＰＣモジュールとして現在の車両の動きを式（７）を用いて更新する。ここで、ｑｋは、下式（９）のように定義できる。

The first (O-1) layer, that is, the output layer of the one-dimensional CNN unit 23 is

Is output, and the Oth layer updates the current movement of the vehicle as an MPC module using the equation (7). Here, qk can be defined as in the following equation (9).

ＭＰＣモジュールで計算される累積位置姿勢エラーＬａｃｃｕｍは、下式（１０）で定義される。

ここで、

は、累積された経路及び角度の真値である。 The cumulative position / orientation error Laccom calculated by the MPC module is defined by the following equation (10).

here,

Is the true value of the accumulated path and angle.

の局所変化量

に関するヤコビ行列は、下式（１１）で計算される。

Local change amount of

The Jacobian matrix with respect to is calculated by the following equation (11).

ＭＰＣモジュールは、上記の累積位置姿勢エラーＬａｃｃｕｍに加えて、局所変化量のエラーＬｌｏｃａｌも下式（１２）で計算する。

In addition to the above-mentioned cumulative position / orientation error Laccum, the MPC module also calculates the error Llocal of the amount of local change by the following equation (12).

これらの累積位置姿勢エラーＬａｃｃｕｍと局所移動エラーＬｌｏｃａｌの合計

は、ネットワークを学習するのに用いられる。調整パラメータλ１、λ２は、学習の初期にはＬｌｏｃａｌを強調し、学習の後期にはＬａｃｃｕｍを強調するように調整する。これにより、連結された位置姿勢の推定の精度を向上できる。 The sum of these cumulative position / orientation error Laccoum and local movement error Llocal

Is used to learn the network. The adjustment parameters λ1 and λ2 are adjusted so as to emphasize Llocal in the early stage of learning and emphasize Laccoum in the later stage of learning. As a result, the accuracy of estimating the connected position and orientation can be improved.

次に、位置姿勢推定装置２０の学習について説明する。位置姿勢推定装置２０の学習を行う際には、入力データとして、許容フレーム数Ｋの２倍の長さの２Ｋフレーム分の連続するシーケンスがデータセットからランダムに抽出される。位置姿勢推定装置２０は、時分割された（Ｍ×Ｎ×Ｌ）の入力データを２Ｔ回にわたって２次元畳込部２２に入力することで、Ｆ×１×２Ｔの大きさのテンソルを取得する。 Next, learning of the position / orientation estimation device 20 will be described. When learning the position / orientation estimation device 20, a continuous sequence of 2K frames having a length twice the allowable number of frames K is randomly extracted from the data set as input data. The position / orientation estimation device 20 acquires a tensor having a size of F × 1 × 2T by inputting time-divisioned (M × N × L) input data into the two-dimensional convolution unit 22 over 2T times. ..

これらのテンソルは１次元畳込部２３に入力され、１次元畳込部２３において位置姿勢及び位置姿勢のエラーが計算される。１次元畳込部２３は、このエラーを用いて更新される。２次元畳込部２２の直前のエラーは、Ｔこの短時間エラーに分割され、Ｔ回にわたって２次元畳込部２２の更新に用いられる。なお、２Ｋフレームは、１次元畳込部２３から有効なＴを取得する必要がある。 These tensors are input to the one-dimensional folding unit 23, and the position-posture error and the position-posture error are calculated in the one-dimensional folding unit 23. The one-dimensional convolution unit 23 is updated using this error. The error immediately before the two-dimensional folding unit 22 is divided into T this short-time error, and is used for updating the two-dimensional folding unit 22 over T times. For the 2K frame, it is necessary to acquire a valid T from the one-dimensional folding unit 23.

本実施の形態では、通常の畳込層について水増しを行わないので、カーネルサイズの半分まで出力を減少させることができる。また、付加的なセンシングデータｍが利用できる場合には、それらは２次元畳込部２２の出力に連結され、その時間情報は２次元畳込部２２の出力とともに１次元畳込部２３によってモデル化される。 In this embodiment, since the normal convolutional layer is not padded, the output can be reduced to half the kernel size. If additional sensing data m is available, they are linked to the output of the two-dimensional convolution unit 22, and the time information is modeled by the one-dimensional convolution unit 23 together with the output of the two-dimensional convolution unit 22. Be transformed.

２次元畳込部２２の各２次元ＣＮＮモジュール２２−１〜２２−Ｋ及び１次元畳込部２３の１次元ＣＮＮモジュールを最適化するために、Ａｄａｍ（Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.）をハイパーパラメータ（学習レートは、ｌ＝０．００１、β１＝０．９、β２＝０．９９９、ε＝０．０００００００１）とともに利用することができる。 Adam (Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. ArXiv preprint arXiv: 1412.6980.) With hyperparameters (learning rate is l = 0.001, β1 = 0.9, β2 = 0.999, ε = 0.00000001) It can be used.

以上のように、本実施の形態の位置姿勢推定システム１００によれば、時系列に入力される２次元の位置情報と１次元の時間情報からなる光学センシングデータに対して実行すべき３次元ＣＮＮを位置情報に対する２次元ＣＮＮと時間情報に対する１次元ＣＮＮとに分けて実行するので、処理負荷を過大にすることなく、畳込処理可能な時間範囲を長く（光学センシングデータの時間方向の数を多く）することができる。 As described above, according to the position / orientation estimation system 100 of the present embodiment, the three-dimensional CNN to be executed for the optical sensing data consisting of the two-dimensional position information and the one-dimensional time information input in time series. Is executed separately for the two-dimensional CNN for the position information and the one-dimensional CNN for the time information, so that the time range in which the convolution process can be performed is extended without increasing the processing load (the number of optical sensing data in the time direction is increased). Many) can.

本発明は、時系列に入力される２次元の位置情報と１次元の時間情報からなる光学センシングデータに対して実行すべき３次元ＣＮＮを位置情報に対する２次元ＣＮＮと時間情報に対する１次元ＣＮＮとに分けて実行するので、処理負荷を過大にすることなく、畳込処理可能な時間範囲を長く（光学センシングデータの時間方向の数を多く）することができ、自走体から撮影された画像に基づいて該自走体の位置及び姿勢を推定する位置姿勢推定システム等として有用である。 In the present invention, the three-dimensional CNN to be executed for the optical sensing data consisting of the two-dimensional position information and the one-dimensional time information input in time series are the two-dimensional CNN for the position information and the one-dimensional CNN for the time information. Since it is executed separately, the time range in which convolution processing can be performed can be lengthened (the number of optical sensing data in the time direction is large) without increasing the processing load, and the image taken from the self-propelled body. It is useful as a position / orientation estimation system or the like that estimates the position and orientation of the self-propelled body based on the above.

１０光学センシング装置
２０位置姿勢推定装置
２１前処理部
２２２次元畳込部
２３１次元畳込部
２４累積部
１００位置姿勢推定システム 10 Optical sensing device 20 Position / orientation estimation device 21 Preprocessing unit 22 Two-dimensional folding unit 23 One-dimensional folding unit 24 Cumulative unit 100 Position / orientation estimation system

Claims

An optical sensing device that generates three-dimensional optical sensing data including two-dimensional position information and one-dimensional time information, and
A position / orientation estimation device that estimates the position / orientation of the optical sensing device by visual odometry based on the optical sensing data input in time series.
With
The position / orientation estimation device is
A two-dimensional convolutional unit composed of a plurality of two-dimensional CNN modules for inputting the position information of each of a plurality of consecutive frames composed of the optical sensing data and outputting the feature amount.
A one-dimensional CNN module composed of a one-dimensional CNN module that inputs a plurality of the feature amounts output from each of the plurality of two-dimensional CNN modules and outputs the amount of change in position and orientation between adjacent frames as a local change amount. Folding part and
By accumulating the local change amounts of the plurality of frames, the cumulative value of the local change amount is obtained, and by adding the cumulative value to the initial value of the position and orientation, the optical sensing device after the plurality of frames Cumulative part for finding position and posture,
Equipped with a,
The optical sensing device is fixed to the vehicle so as to sense the outside of the vehicle.
The cumulative unit is a position / posture estimation system that accumulates the local change amount on a model basis and expresses the local change amount with parameters of a straight-ahead change amount and an angle change amount.

The position / orientation estimation system according to claim 1, wherein the optical sensing device is an event camera.

The position / orientation estimation device further includes a preprocessing unit that lowers the time resolution of the optical sensing data input from the optical sensing device to generate the plurality of frames whose time resolution is lowered. 2. The position / orientation estimation system according to 2.

The cumulative unit uses the first error for the parameter and the second error for the local change amount for learning by weighting them with a first weight and a second weight, respectively. The position / orientation estimation system according to any one of claims 1 to 3, wherein the second weight is made heavier at the initial stage of the learning, and the first weight is made heavier at the latter stage of learning to perform learning. ..

The optical sensing device is used together with an optical sensing device that generates three-dimensional optical sensing data including two-dimensional position information and one-dimensional time information, and is based on the optical sensing data input in a time series by visual odometry. It is a position / orientation estimation device that estimates the position / orientation of
A two-dimensional convolutional unit composed of a plurality of two-dimensional CNN modules for inputting the position information of each of a plurality of consecutive frames composed of the optical sensing data and outputting the feature amount.
A one-dimensional configuration consisting of a one-dimensional CNN module that inputs a plurality of the feature amounts output from each of the plurality of two-dimensional CNN modules and outputs the amount of change in position and orientation between adjacent frames as a local change amount. Folding part and
By accumulating the local change amounts of the plurality of frames, the cumulative value of the local change amount is obtained, and by adding the cumulative value to the initial value of the position and orientation, the optical sensing device after the plurality of frames Cumulative part for finding position and posture,
Equipped with a,
The optical sensing device is fixed to the vehicle so as to sense the outside of the vehicle.
The cumulative unit is a position / posture estimation device that accumulates the local change amount on a model basis and expresses the local change amount with parameters of a straight-ahead change amount and an angle change amount.