JP2021189917A

JP2021189917A - Object detection system, object detection method, and object detection program

Info

Publication number: JP2021189917A
Application number: JP2020096497A
Authority: JP
Inventors: デービッドジメネス; Jimenez David; 光平松田; Kohei Matsuda
Original assignee: ZMP Inc
Current assignee: ZMP Inc
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2021-12-13
Anticipated expiration: 2040-06-02
Also published as: JP7122721B2

Abstract

To provide an object detection system configured to allow a user to define an obstacle and detect an object, such as an obstacle, with high accuracy, while enabling high-speed processing.SOLUTION: An object detection system 10 includes: image generation means 30 which generates a three-dimensional image 31 on the basis of imaging data obtained by imaging means 20; perspective data generation means 40 which generates perspective data 41 by perspective processing on the basis of the three-dimensional image generated by the image generation means; and analysis means 50 which detects depth where an object exists in each subdivided part by horizontal subdivision and depth subdivision based on the perspective data, and combines detection results to detect the object. The perspective data generation means extracts cross sections, as slice images, in a depth direction of the three-dimensional image, to generate perspective data by reducing the dimensions of the slice images. The analysis means extracts cross sections, as slice data, in a height direction on the basis of the perspective data, to detect a direction and a distance of the object.SELECTED DRAWING: Figure 1

Description

本発明は、例えばカメラ等により撮像された三次元のオリジナル画像を俯瞰処理により低次元の俯瞰データに変換し、この俯瞰データに基づいてオリジナル画像内における人物、障害物等の物体を検出する、三次元画像の物体検出を行なうための物体検出システム、物体検出方法及び物体検出プログラムに関するものである。 The present invention converts, for example, a three-dimensional original image captured by a camera or the like into low-dimensional bird's-eye view data by bird's-eye view processing, and detects objects such as people and obstacles in the original image based on the bird's-eye view data. It relates to an object detection system, an object detection method, and an object detection program for detecting an object in a three-dimensional image.

例えば自動車等の走行車両の自動運転においては、走行車両の前方視界から人物，障害物，道路側縁等の物体を検出し、運転可能なエリアを確認して物体との衝突を回避するように、走行車両の駆動制御を行なう必要がある。従来、このような走行方向前方の物体を検出する場合、以下のようにして物体検出が行なわれる。 For example, in the automatic driving of a traveling vehicle such as a car, an object such as a person, an obstacle, or a roadside edge is detected from the front view of the traveling vehicle, and a driveable area is confirmed to avoid a collision with the object. , It is necessary to control the drive of the traveling vehicle. Conventionally, when detecting such an object in front of the traveling direction, the object detection is performed as follows.

先ず、走行車両の前部等に取り付けたステレオカメラで走行車両の前方を撮像し、ステレオカメラの左右一対の画像から三次元画像を生成する。続いて、この三次元画像に対して俯瞰処理を行なうことにより、当該三次元画像の撮像範囲に関する上方から見た俯瞰画像を生成する。その際、俯瞰処理、即ち三次元画像から俯瞰画像への画像変換処理は、ニューラルネットワークを利用して行なわれる。ここでニューラルネットワークとしては、所謂畳み込みニューラルネットワークが使用され、ディープラーニングにより学習して、所望の俯瞰画像が得られる。次に、このようにして得られた俯瞰画像に基づいて、画像処理により俯瞰画像のエリア内における物体の検出が行なわれる。この物体検出処理も、同様にしてニューラルネットワークを利用し、ディープラーニングにより学習して所望の物体検出が行なわれ得るようになっている。 First, a stereo camera attached to the front of the traveling vehicle captures the front of the traveling vehicle, and a three-dimensional image is generated from a pair of left and right images of the stereo camera. Subsequently, by performing bird's-eye view processing on this three-dimensional image, a bird's-eye view image viewed from above regarding the imaging range of the three-dimensional image is generated. At that time, the bird's-eye view processing, that is, the image conversion processing from the three-dimensional image to the bird's-eye view image is performed by using the neural network. Here, as the neural network, a so-called convolutional neural network is used, and learning is performed by deep learning to obtain a desired bird's-eye view image. Next, based on the bird's-eye view image obtained in this way, an object is detected in the area of the bird's-eye view image by image processing. This object detection process also uses a neural network in the same manner, and can be learned by deep learning to perform desired object detection.

これに対して、例えば非特許文献１には、単一カラー画像から畳み込みニューラルネットワークを利用して障害物の検出を行なう手法が開示されている。 On the other hand, for example, Non-Patent Document 1 discloses a method of detecting an obstacle from a single color image by using a convolutional neural network.

D. Levi, N. Garnett and E. Fetaya,“StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation”,http://www.bmva/2015/papers/paper109.pdfD. Levi, N. Garnett and E. Fetaya, “StixelNet: A Deep Convolutional Network for Obstacle Detection and Road Segmentation”, http://www.bmva/2015/papers/paper109.pdf C. Godard, O. M. Aodha, M Firman and G. Brostow,“Digging Into Self-Supervised Monocular Depth Estimation”, https://arxiv.org/abs/1806.01260C. Godard, O. M. Aodha, M Firman and G. Brostow, “Digging Into Self-Supervised Monocular Depth Estimate”, https://arxiv.org/abs/1806.01260 J. Castorena, U. S. Kamilov and P. T. Boufounos,“AUTOCALIBRATION OF LIDAR AND OPTICAL CAMERAS VIA EDGE ALIGNMENT”, https://www.merl.com/publications/docs/TR2016-009.pdfJ. Castorena, U.S. Kamilov and P. T. Boufounos, “AUTOCALIBRATION OF LIDAR AND OPTICAL CAMERAS VIA EDGE ALIGNMENT”, https://www.merl.com/publications/docs/TR2016-009.pdf

しかしながら、三次元画像から直接に俯瞰画像への画像処理変換は、処理データ量が膨大となり、処理に時間がかかることから、例えば自動車の進行方向前方の三次元画像を、自動車の走行に伴って連続的に処理しようとする場合、処理が間に合わなくなってしまうことがあった。また、ステレオカメラから物体までの距離が増大するにつれて、物体の検出精度が著しく低下することになってしまう。これに対して、非特許文献１の障害物の検出手法では、単眼カメラによる二次元画像を使用しているので、直接三次元画像から障害物の検出を行なうことは想定されていない。 However, image processing conversion from a three-dimensional image directly to a bird's-eye view image requires a huge amount of processing data and takes a long time to process. When trying to process continuously, the process may not be in time. Further, as the distance from the stereo camera to the object increases, the detection accuracy of the object decreases significantly. On the other hand, in the obstacle detection method of Non-Patent Document 1, since a two-dimensional image by a monocular camera is used, it is not assumed that the obstacle is detected directly from the three-dimensional image.

本発明は以上の点に鑑み、迅速に処理可能であると共に、ユーザが障害物を定義でき、より高精度で障害物等の物体を検出し得るようにした物体検出システムを提供することを第１の目的とし、物体検出方法を提供することを第２の目的とし、さらに、物体検出プログラムを提供することを第３の目的としている。 In view of the above points, it is the first aspect of the present invention to provide an object detection system that can be processed quickly, can define an obstacle by a user, and can detect an object such as an obstacle with higher accuracy. The first object is to provide an object detection method, and the third object is to provide an object detection program.

上記本発明の第１の目的は、撮像手段で取得された撮像データに基づいて三次元画像を生成する画像生成手段と、画像生成手段で生成された三次元画像に基づいて俯瞰処理により俯瞰データを生成する俯瞰データ生成手段と、俯瞰データに基づいて水平分割及び深度分割にて個々の分割部分についてそれぞれ物体が存在する深度を検出し、各検出結果を結合することにより物体を検出する解析手段と、を含んでおり、俯瞰データ生成手段が、三次元画像の深さ方向に関して複数箇所の断面をスライス画像として取り出して、各スライス画像を低次元化して俯瞰データを生成し、解析手段が、俯瞰データに基づいて高さ方向に関して複数箇所の断面をスライスデータとして抽出することにより物体の方向及び距離を検出する、物体検出システムにより達成される。 The first object of the present invention is an image generation means that generates a three-dimensional image based on the image pickup data acquired by the image pickup means, and a bird's-eye view data by a bird's-eye view process based on the three-dimensional image generated by the image generation means. An analysis means for detecting an object by detecting the depth at which an object exists for each divided portion by horizontal division and depth division based on the bird's-eye view data and combining the detection results. The bird's-eye view data generation means takes out a plurality of cross sections as slice images in the depth direction of the three-dimensional image, lowers each slice image to generate bird's-eye view data, and the analysis means. It is achieved by an object detection system that detects the direction and distance of an object by extracting a plurality of cross sections in the height direction as slice data based on the bird's-eye view data.

好ましくは、俯瞰データ生成手段は、畳み込みニューラルネットワークから成るオートエンコーダから構成され、オートエンコーダが各スライス画像を低次元化して俯瞰データを生成する。
ニューラルネットワークは、好ましくは、入力層，少なくとも一つの中間層及び出力層から成る多層ニューラルネットワークであって、学習の際に、入力層に入力された各スライス画像をいずれかの中間層で低次元中間データに変換した後、出力層でスライス画像と同じ次元の再構築データにデコードして、再構築データがスライス画像における物体を再現し得るようにディープラーニングにより学習し、学習後は中間層から中間データを俯瞰データとして解析手段に出力する。
俯瞰データ生成手段は、好ましくは、各スライス画像をさらに水平方向にスライスしてスライスピースを生成し、このスライスピースを低次元化して俯瞰データを生成する。
俯瞰データは、好ましくは、各スライス画像またはスライスピースをそれぞれベクトルとして、ノンスパース特徴空間にマッピングした特徴ベクトルである。
解析手段は、好ましくは、畳み込みニューラルネットワークから構成されており、ディープラーニングにより学習する。 Preferably, the bird's-eye view data generation means is composed of an autoencoder composed of a convolutional neural network, and the autoencoder reduces the dimension of each slice image to generate bird's-eye view data.
The neural network is preferably a multi-layer neural network consisting of an input layer, at least one intermediate layer and an output layer, and each slice image input to the input layer during training is low-dimensional in any intermediate layer. After converting to intermediate data, the output layer decodes it into reconstructed data of the same dimension as the slice image, and learns by deep learning so that the reconstructed data can reproduce the object in the slice image, and after learning, from the intermediate layer. The intermediate data is output to the analysis means as bird's-eye view data.
The bird's-eye view data generation means preferably slices each slice image in the horizontal direction to generate a slice piece, and lowers the dimension of the slice piece to generate the bird's-eye view data.
The bird's-eye view data is preferably a feature vector mapped to a non-sparse feature space with each slice image or slice piece as a vector.
The analysis means is preferably composed of a convolutional neural network and is learned by deep learning.

上記第２の目的は、撮像データに基づいて三次元画像を生成する画像生成段階と、画像生成段階で生成された三次元画像に基づいて俯瞰処理により俯瞰データを生成する俯瞰データ生成段階と、俯瞰データに基づいて水平分割及び深度分割にて個々の分割部分についてそれぞれ物体が存在する深度を検出し各検出結果を結合することにより物体を検出する解析段階と、を含んでおり、俯瞰データ生成段階にて、三次元画像の深さ方向に関して複数箇所の断面をスライス画像として取り出して各スライス画像を低次元化して俯瞰データを生成し、解析段階にて、俯瞰データに基づいて高さ方向に関して複数箇所の断面をスライスデータとして抽出することにより物体を検出する、物体検出方法により達成される。 The second purpose is the image generation stage of generating a three-dimensional image based on the captured data, and the bird's-eye view data generation stage of generating the bird's-eye view data by the bird's-eye view processing based on the three-dimensional image generated in the image generation stage. It includes an analysis stage in which the depth at which an object exists is detected for each divided portion by horizontal division and depth division based on the bird's-eye view data, and the object is detected by combining the detection results, and the bird's-eye view data is generated. At the stage, multiple cross sections in the depth direction of the three-dimensional image are taken out as slice images, and each slice image is made low-dimensional to generate bird's-eye view data. This is achieved by an object detection method that detects an object by extracting cross sections at a plurality of locations as slice data.

上記第３の目的は、撮像データに基づいて三次元画像を生成する画像生成手順と、画像生成手順で生成された三次元画像に基づいて俯瞰処理により俯瞰データを生成する俯瞰データ生成手順と、俯瞰データに基づいて水平分割及び深度分割にて個々の分割部分についてそれぞれ物体が存在する深度を検出し、各検出結果を結合することにより物体を検出する解析手順の処理をコンピュータに実行させるための物体検出プログラムであって、俯瞰データ生成手順にて、三次元画像の深さ方向に関して複数箇所の断面をスライス画像として取り出し、各スライス画像を低次元化して俯瞰データを生成し、解析手順にて俯瞰データに基づいて高さ方向に関して複数箇所の断面をスライスデータとして抽出することにより物体を検出することをコンピュータに実行させることにより達成される。 The third purpose is an image generation procedure for generating a three-dimensional image based on captured data, a bird's-eye view data generation procedure for generating a bird's-eye view data by a bird's-eye view process based on the three-dimensional image generated by the image generation procedure, and a procedure for generating a bird's-eye view data. To detect the depth at which an object exists for each division in horizontal division and depth division based on bird's-eye view data, and to have the computer execute the processing of the analysis procedure to detect the object by combining the detection results. It is an object detection program, and in the bird's-eye view data generation procedure, cross sections of multiple points in the depth direction of the three-dimensional image are taken out as slice images, each slice image is made low-dimensional, and bird's-eye view data is generated, and in the analysis procedure. This is achieved by having a computer perform detection of an object by extracting multiple cross sections in the height direction as slice data based on bird's-eye view data.

このようにして、本発明によれば、簡単な構成により迅速に処理可能であると共に、より高精度で障害物等の物体を検出し得るようにした物体検出システム、物体検出方法及び物体検出プログラムを提供することができる。 In this way, according to the present invention, an object detection system, an object detection method, and an object detection program that can be processed quickly with a simple configuration and can detect an object such as an obstacle with higher accuracy. Can be provided.

本発明による物体検出システムの一実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of the object detection system by this invention. 図１の物体検出システムにおいて、（Ａ）は撮像画面、（Ｂ）は三次元画像、（Ｃ）は深度スライスをそれぞれ示す図である。In the object detection system of FIG. 1, (A) is an image pickup screen, (B) is a three-dimensional image, and (C) is a diagram showing a depth slice. 図１の物体検出システムにおける三次元画像生成手段を構成するオートエンコーダの動作原理を示す概略図である。It is a schematic diagram which shows the operation principle of the autoencoder which constitutes the 3D image generation means in the object detection system of FIG. 図３のオートエンコーダにおける（Ａ）学習時及び（Ｂ）動作時の動作を説明する説明図である。It is explanatory drawing explaining the operation at the time of (A) learning and (B) operation in the autoencoder of FIG. 図１の物体検出システムの動作を示すフローチャートである。It is a flowchart which shows the operation of the object detection system of FIG. 図１の物体検出システムにおける三次元画像生成手段の動作を説明する説明図である。It is explanatory drawing explaining the operation of the 3D image generation means in the object detection system of FIG. 図１の物体検出システムにおけるスライスピースを示す概略図である。It is a schematic diagram which shows the slice piece in the object detection system of FIG. 図１の物体検出システムにおいて、（Ａ）は物体検出概略図、（Ｂ）は実際の物体検出状態を示す説明図である。In the object detection system of FIG. 1, (A) is a schematic diagram of object detection, and (B) is an explanatory diagram showing an actual object detection state. 図１の物体検出システムにおいて、（Ａ）は深度スライス、（Ｂ）は低次元化された特徴ベクトル、（Ｃ）は各深度スライス毎の一組の特徴ベクトル、（Ｄ）は特徴ベクトルを結合したテンソルを示す説明図である。In the object detection system of FIG. 1, (A) is a depth slice, (B) is a low-dimensional feature vector, (C) is a set of feature vectors for each depth slice, and (D) is a combination of feature vectors. It is explanatory drawing which shows the tensor. 図１の物体検出システムにおける俯瞰データの作成手順を順次に示す概略図である。It is a schematic diagram which sequentially shows the procedure of creating the bird's-eye view data in the object detection system of FIG. 図１の物体検出システムにおける解析手段のニューラルネットワークの解析手順を順次に示す概略図である。It is a schematic diagram which sequentially shows the analysis procedure of the neural network of the analysis means in the object detection system of FIG. 図１１の解析手順により得られた一連のベクトルの一部構成を示す概略図である。It is a schematic diagram which shows the partial structure of the series of vectors obtained by the analysis procedure of FIG. 図１の物体検出システムによる建設現場での実験例において、（Ａ）は撮像画面、（Ｂ）は物体検出を示す概略図である。In an experimental example at a construction site using the object detection system of FIG. 1, (A) is an image pickup screen, and (B) is a schematic diagram showing object detection. 図１の物体検出システムによる都市環境での実験例において、（Ａ）は撮像画面、（Ｂ）は物体検出を示す概略図である。In an experimental example in an urban environment using the object detection system of FIG. 1, (A) is an image pickup screen, and (B) is a schematic diagram showing object detection.

以下、図面に示した実施形態に基づいて本発明を詳細に説明する。
図１は本発明による物体検出システム１０の一実施形態の構成ブロック図、図２（Ａ）は撮像画面、図２（Ｂ）は三次元画像３１、図２（Ｃ）は深度スライス３１ａをそれぞれ示し、図３は図１の物体検出システム１０における三次元画像生成手段３０を構成するオートエンコーダ７０の動作原理を示す概略図である。
図１において、物体検出システム１０は、撮像手段２０と、三次元画像生成手段３０と、俯瞰データ生成手段４０と、解析手段５０と、これらの撮像手段２０、三次元画像生成手段３０、俯瞰データ生成手段４０及び解析手段５０をプログラムにより制御する制御部６０と、から構成されている。 Hereinafter, the present invention will be described in detail based on the embodiments shown in the drawings.
1 is a block diagram of an embodiment of an object detection system 10 according to the present invention, FIG. 2A is an image pickup screen, FIG. 2B is a three-dimensional image 31, and FIG. 2C is a depth slice 31a. 3 is a schematic diagram showing the operating principle of the auto encoder 70 constituting the three-dimensional image generation means 30 in the object detection system 10 of FIG. 1.
In FIG. 1, the object detection system 10 includes an image pickup means 20, a three-dimensional image generation means 30, a bird's-eye view data generation means 40, an analysis means 50, these image pickup means 20, a three-dimensional image generation means 30, and a bird's-eye view data. It is composed of a control unit 60 that controls the generation means 40 and the analysis means 50 by a program.

ここで、制御部６０はコンピュータから構成され、前もってインストールされた本発明による物体検出プログラムを実行することにより、上述した撮像手段２０、三次元画像生成手段３０、俯瞰データ生成手段４０及び解析手段５０を制御して、本発明による物体検出システム１０及び物体検出方法を実現するようになっている。 Here, the control unit 60 is composed of a computer, and by executing the object detection program according to the present invention installed in advance, the image pickup means 20, the three-dimensional image generation means 30, the bird's-eye view data generation means 40, and the analysis means 50 described above are executed. The object detection system 10 and the object detection method according to the present invention are realized by controlling the above.

撮像手段２０は、例えば自動車等の前方を撮像するように配置され、自動車等の走行に伴って逐次前方の被写界の撮像を行なうことができる、公知の構成の単眼カメラ、ステレオカメラ及びライダーの何れかである。撮像手段２０は、制御部６０により制御されて所定の時間間隔で撮像を行ない、順次に撮像信号２１を出力する。ライダーはレーザーレーダーとも呼ばれるセンサで、レーザー画像検出と測距（Laser Imaging Detection and Ranging）を行なうセンサで、LIDARとも表記される。ライダーとしては三次元ライダーを用いることができる。 The image pickup means 20 is arranged so as to image the front of, for example, an automobile or the like, and can sequentially image the front field as the automobile or the like travels, and has a known configuration of a monocular camera, a stereo camera, and a rider. It is either. The image pickup means 20 is controlled by the control unit 60 to perform image pickup at predetermined time intervals, and sequentially outputs an image pickup signal 21. The lidar is a sensor that is also called a laser radar, and is a sensor that performs laser image detection and ranging (Laser Imaging Detection and Ranging), and is also referred to as LIDAR. A three-dimensional rider can be used as the rider.

三次元画像生成手段３０は、図２（Ａ）に示すように撮像手段２０からの各撮像信号２１に基づいて、撮像画面２１ａ内に写っている物体までの深度（距離）を検出して、図２（Ｂ）に示すように三次元画像３１を生成する。三次元画像生成手段３０は、制御部６０により制御されて順次に入力される各撮像信号２１からそれぞれ三次元画像３１を生成する。 As shown in FIG. 2A, the three-dimensional image generation means 30 detects the depth (distance) to the object shown in the image pickup screen 21a based on each image pickup signal 21 from the image pickup means 20. As shown in FIG. 2B, the three-dimensional image 31 is generated. The three-dimensional image generation means 30 generates a three-dimensional image 31 from each image pickup signal 21 controlled by the control unit 60 and sequentially input.

撮像手段２０が単眼カメラの場合には、撮像信号２１によるカラーの撮像画面２１ａは、図２（Ａ）に示すように平面を表わす二次元画像であるが、例えば非特許文献２に開示されている公知の手法を用いて、二次元画像の各点における距離情報（ポイントクラウド）を生成することができる。
そして、三次元画像生成手段３０は、上述した二次元の撮像画面とポイントクラウドを融合するために画像の色強度を対応するポイントに投影することにより、シーン（一つのカメラ画像が表わす場面）内で対応する色でポイントクラウドの各点を「ペイント」するというアルゴリズムで、三次元画像３１を生成する。三次元画像生成手段３０は、入力として、単眼カメラのカラー撮像画面と、この撮像画面における各点の距離情報（ポイントクラウド）が必要である。 When the image pickup means 20 is a monocular camera, the color image pickup screen 21a by the image pickup signal 21 is a two-dimensional image representing a plane as shown in FIG. 2A, but is disclosed in, for example, Non-Patent Document 2. It is possible to generate distance information (point cloud) at each point of the two-dimensional image by using a known method.
Then, the three-dimensional image generation means 30 projects the color intensity of the image onto the corresponding points in order to fuse the above-mentioned two-dimensional image pickup screen and the point cloud, so that the scene (the scene represented by one camera image) can be used. A three-dimensional image 31 is generated by an algorithm of "painting" each point of the point cloud with the corresponding color. The three-dimensional image generation means 30 requires a color imaging screen of a monocular camera and distance information (point cloud) of each point on the imaging screen as inputs.

また、撮像手段２０がステレオカメラの場合には、カラーの撮像信号２１は、左右一対のカメラからの一対の撮像画面を含んでいるので、各撮像画面における被写体の視差に基づいて、一方のカメラの撮像画面とその撮像画面における各点の距離情報（疑似ポイントクラウド）を生成することができる。
ここで、三次元画像３１は、図２（Ｂ）に示すように水平方向Ｈ，垂直方向Ｖ及び深度方向Ｄに延びている。さらに、撮像手段２０が三次元ライダーの場合には、例えば非特許文献３で報告されている公知の方法で三次元画像３１を取得してもよい。例えば、画素数（８００×６００）の二次元の白黒画像と三次元ライダーで取得される深度情報とを組み合わせて、後述する深度方向にスライスされた三次元画像３１を取得することができる。 Further, when the image pickup means 20 is a stereo camera, the color image pickup signal 21 includes a pair of image pickup screens from a pair of left and right cameras, so that one camera is based on the misalignment of the subject on each image pickup screen. It is possible to generate the image pickup screen of the above and the distance information (pseudo point cloud) of each point on the image pickup screen.
Here, the three-dimensional image 31 extends in the horizontal direction H, the vertical direction V, and the depth direction D as shown in FIG. 2 (B). Further, when the image pickup means 20 is a three-dimensional lidar, the three-dimensional image 31 may be acquired by, for example, a known method reported in Non-Patent Document 3. For example, a two-dimensional black-and-white image having the number of pixels (800 × 600) and depth information acquired by a three-dimensional lidar can be combined to acquire a three-dimensional image 31 sliced in the depth direction, which will be described later.

俯瞰データ生成手段４０は、三次元画像生成手段３０からの各三次元画像３１に基づいて俯瞰処理を行なって、低次元化した俯瞰データ４１を生成する。ここで、俯瞰データ生成手段４０は、畳み込みニューラルネットワークから成るオートエンコーダ４２を含んでおり、制御部６０により制御されて順次に入力される各三次元画像３１からそれぞれ低次元化した俯瞰データ４１を生成する。 The bird's-eye view data generation means 40 performs bird's-eye view processing based on each three-dimensional image 31 from the three-dimensional image generation means 30, and generates low-dimensional bird's-eye view data 41. Here, the bird's-eye view data generation means 40 includes an autoencoder 42 composed of a convolutional neural network, and lower-dimensional bird's-eye view data 41 is obtained from each three-dimensional image 31 controlled by the control unit 60 and sequentially input. Generate.

詳細には、俯瞰データ生成手段４０は、先ず三次元画像３１を、図２（Ｂ）に示すように深度方向に関して等間隔でスライスして複数個の深度スライス３１ａを得る。これにより、ポイントクラウドの各ポイントは距離毎に分割される。ここで、深度スライス３１ａは、元のカメラ画像における画素数（例えば９６０×１２８０）に対して、同じ数の画素数（９６０×１２８０）を有している。 Specifically, the bird's-eye view data generation means 40 first slices the three-dimensional image 31 at equal intervals in the depth direction as shown in FIG. 2B to obtain a plurality of depth slices 31a. As a result, each point in the point cloud is divided by distance. Here, the depth slice 31a has the same number of pixels (960 × 1280) with respect to the number of pixels (for example, 960 × 1280) in the original camera image.

続いて、俯瞰データ生成手段４０は、得られた各深度スライス３１ａに関して、図２（Ｃ）に示すように各深度スライス３１ａをさらに水平方向にスライスして、縦長の複数個のスライスピース３１ｂを得る。スライスピース３１ｂは、画素数（９６０×１２８０）の深度スライス３１ａに対して例えば８０個に分割されることにより、画素数（９６０×１６）を有している。そして、俯瞰データ生成手段４０は、このスライスピース３１ｂを後述するオートエンコーダの畳み込みニューラルネットワークの入力層に入力する。その際、スライスピース３１ｂは、図１０（Ａ）に示すように、さらに高さ方向に関して複数個に分割されて処理が行なわれる。これにより、物体の検出精度が高さ方向に関しても高められることになる。 Subsequently, the bird's-eye view data generation means 40 further slices each depth slice 31a in the horizontal direction with respect to each of the obtained depth slices 31a, as shown in FIG. 2C, to form a plurality of vertically long slice pieces 31b. obtain. The slice piece 31b has the number of pixels (960 × 16) by being divided into, for example, 80 pieces with respect to the depth slice 31a having the number of pixels (960 × 1280). Then, the bird's-eye view data generation means 40 inputs the slice piece 31b to the input layer of the convolutional neural network of the autoencoder described later. At that time, as shown in FIG. 10A, the slice piece 31b is further divided into a plurality of pieces in the height direction and processed. As a result, the detection accuracy of the object is improved also in the height direction.

ここで、オートエンコーダ７０は、一般的には、図３に概略的に示すようにする。オートエンコーダ７０は、例えば入力層７１，中間層７２，７３，７４及び出力層７５から成る多層ニューラルネットワークである。なお、図３においては、説明のために便宜的に各層７１〜７５が、それぞれ六次元，四次元，二次元，四次元及び六次元で示されている。このオートエンコーダ７０は、入力データが二段階のエンコードにより二次元に低次元化された後、二段階のデコードによって再び六次元に再構築され、出力層７５から出力される。
オートエンコーダ７０は、多数のサンプルデータに関して、入力層７１に入力される入力データと出力層７５から出力される再構築データとを比較することより、再構築データが入力データと同じ特徴を有するように、ディープラーニングにより学習される。具体的には、入力データをＩとすると、中間層７２では、関数ｆ（Ｉ）＝ｈにより四次元空間にマッピングされ、中間層７３では、関数ｇ（ｈ）＝ｅにより二次元空間にマッピングされ、このｅが二次元のデータとなる。これに対して、中間層７４及び出力層７５では、それぞれ関数ｊ及びｋにより四次元空間，六次元空間にマッピングされて、出力層７５では六次元の再構築データＩｒが出力される。これらのマッピングはいずれも非線形である。上述した各関数ｆ，ｇ，ｊ，ｋは、未知の関数であり、多数のサンプルデータを入力層７１に入力し、入力データＩと再構築データＩｒの差（Ｉ−Ｉｒ）^２を最小化することにより、所謂ディープラーニングにより学習することにより、各関数ｆ〜ｋを決定する。 Here, the autoencoder 70 is generally as shown schematically in FIG. The autoencoder 70 is, for example, a multi-layer neural network including an input layer 71, an intermediate layer 72, 73, 74 and an output layer 75. In FIG. 3, for convenience, each layer 71 to 75 is shown in six dimensions, four dimensions, two dimensions, four dimensions, and six dimensions, respectively. The autoencoder 70 is reduced to two dimensions by two-step encoding of the input data, then reconstructed into six dimensions by two-step decoding, and output from the output layer 75.
The auto encoder 70 compares the input data input to the input layer 71 with the reconstructed data output from the output layer 75 with respect to a large number of sample data so that the reconstructed data has the same characteristics as the input data. In addition, it is learned by deep learning. Specifically, assuming that the input data is I, in the intermediate layer 72, the function f (I) = h maps to the four-dimensional space, and in the intermediate layer 73, the function g (h) = e maps to the two-dimensional space. Then, this e becomes two-dimensional data. On the other hand, the intermediate layer 74 and the output layer 75 are mapped to the four-dimensional space and the six-dimensional space by the functions j and k, respectively, and the output layer 75 outputs the six-dimensional reconstructed data Ir. All of these mappings are non-linear. Each of the above-mentioned functions f, g, j, and k is an unknown function, and a large number of sample data are input to the input layer 71 to minimize ^{the difference (I-Ir) 2 between the input data I and the reconstructed data Ir.} By doing so, each function f to k is determined by learning by so-called deep learning.

本物体検出システム１０にあっては、同様にディープラーニングにより十分に学習されたオートエンコーダ７０のエンコード部分のみを利用して、エンコード化されたデータを中間層７２〜７４から取り出すことにより、俯瞰データ４１を得るようにしている。 In the object detection system 10, the bird's-eye view data is obtained by extracting the encoded data from the intermediate layers 72 to 74 by using only the encoded portion of the autoencoder 70 that has been sufficiently learned by deep learning. I'm trying to get 41.

従って、図４（Ａ）に示す本物体検出システム１０では、オートエンコーダ４２は、エンコーダ部分４２ａで入力データである各スライスピース３１ｂをエンコードして低次元化した俯瞰データ４１を生成し、さらにデコーダ部分４２ｂで俯瞰データ４１をデコードして再構築データ４３を生成し、再構築データが入力データである各スライスピース３１ｂと物体に関して同じ特徴を備えるようにディープラーニングにより学習される。
このようにディープラーニングによる学習が行なわれた後、実際の動作時には、オートエンコーダ４２は、図４（Ｂ）に示すようにエンコーダ部分４２ａのみを利用して、前述した一つの深度スライス３１ａに関する８０個のスライスピース３１ｂをエンコードして、低次元化した俯瞰データ４１を生成する。各スライスピース３１ｂは、それぞれオートエンコーダ４２により低次元にエンコードされて、物体の存在を表わす特徴ベクトルとして低次元の疎ではない特徴空間（以下、ノンスパース特徴空間と呼ぶ）にマッピングされる。その際、各スライドピース３１ｂは、水平方向に分割されていることにより、ノンスパース特徴空間にマッピングされる際に水平方向に関して空間情報が保持され、一連の特徴ベクトルにより高精度で元の撮像画面における物体の存在が表わされる。
このようにして、オートエンコーダ４２により、一つのシーンに関して各深度スライス３１ａ毎に一連の特徴ベクトルから成る俯瞰データ４１が生成される。この俯瞰データ４１は、各深度スライス３１ａが深度位置を表わし、各特徴ベクトルが水平位置を表わしており、これら一連の特徴ベクトルを結合することにより、俯瞰データ４１としてのテンソルが形成される。 Therefore, in the object detection system 10 shown in FIG. 4A, the autoencoder 42 encodes each slice piece 31b, which is input data, at the encoder portion 42a to generate a low-dimensional bird's-eye view data 41, and further, a decoder. The bird's-eye view data 41 is decoded in the portion 42b to generate the reconstructed data 43, and the reconstructed data is learned by deep learning so as to have the same characteristics with respect to each slice piece 31b which is the input data and the object.
After learning by deep learning as described above, in actual operation, the autoencoder 42 uses only the encoder portion 42a as shown in FIG. 4 (B), and is related to the above-mentioned one depth slice 31a. The slice pieces 31b are encoded to generate a low-dimensional bird's-eye view data 41. Each slice piece 31b is encoded in a low dimension by the autoencoder 42, and is mapped to a low-dimensional non-sparse feature space (hereinafter referred to as a non-sparse feature space) as a feature vector representing the existence of an object. At that time, since each slide piece 31b is divided in the horizontal direction, spatial information is retained with respect to the horizontal direction when mapped to the non-sparse feature space, and the original imaging screen with high accuracy is obtained by a series of feature vectors. The existence of the object in.
In this way, the autoencoder 42 generates bird's-eye view data 41 composed of a series of feature vectors for each depth slice 31a for one scene. In the bird's-eye view data 41, each depth slice 31a represents a depth position, and each feature vector represents a horizontal position. By combining these series of feature vectors, a tensor as the bird's-eye view data 41 is formed.

画素数１２８０×９６０のモノクロ画像の場合、各要素がマトリックス∈［０，２５５］内に在るマトリックスＩ_{１２８０，９６０}が扱われる。このマトリックスＩを深度情報と混合し、各画素に画像内の各画素の距離ｄ_ｉ，ｊも追加することにより、［１２８０，９６０，２］のテンソルが得られる。そして、０ｍから最大深度までの範囲で深度間隔ｎ_ｄ（この場合、ｎ_ｄ＝６４）を定義して、上記テンソルをｎ_ｄ個のマトリックスに分割する。分割された各マトリックスａ^ｉに関して、ｉ番目の深度間隔により定義される範囲に在るマトリックス内のすべての要素が取り込まれる。例えば、第一の深度間隔が０ｍから１ｍとすれば、深度がこの範囲（０ｍから１ｍ）内であるテンソル内の要素のすべての強度情報を取り込み、残りの空間をゼロで満たすことによって、一番目のマトリックスａ^０が生成される。このようにして、ｎ_ｄ個のスパースマトリックスが得られる。 In the case of a black-and-white image having 1280 × 960 pixels, the matrix I _1280,960 in which each element is in the matrix ∈ [0,255] is handled. By mixing this matrix I with the depth information and _{adding the distances di and j of} each pixel in the image to each pixel, the tensor of [1280,960,2] is obtained. The depth interval _{n d} (in this _case, n d = 64) ranging from 0m to the maximum depth to define, dividing the tensor _{n d} pieces of the matrix. For each divided matrix ^ai , all elements in the matrix within the range defined by the i-th depth interval are captured. For example, if the first depth interval is 0m to 1m, then one by capturing all the intensity information of the elements in the tensor whose depth is within this range (0m to 1m) and filling the remaining space with zeros. th matrix a ⁰ in is generated. In this manner, n _d number of sparse matrix is obtained.

これらのスパースマトリックスａ^ｉは、それぞれ深度を表わす。さらに、スパースマトリックスａ^ｉを、幅ｗのｎ_ｗ個の列に分割すると、大きさ（１２８０，９６０）のマトリックスａ^ｉは、ｗ＝１６により分割されて、大きさ（１６，９６０）の８０個のマトリックスとなる。これらの新たなより小さいマトリックスｂ^ｉ，ｊは、水平位置を表わす。例えば、マトリックスｂ^０，０は、０ｍから１ｍの範囲の画像の最も左の情報を表わす。そして、各マトリックスｂ^ｉ，ｊが取得され、前もって学習されたオートエンコーダを使用して前述したようにエンコードされる。その後、式ｂ’^ｉ，ｊ＝ｇ（ｆ（ｂ^ｉ，ｊ））から、５００のベクトルにエンコードするために使用されるより小さな潜在空間内に在るものが得られる。 Each of these sparse matrices ^ai represents a depth. Additionally, a sparse matrix ^{a i,} when divided into _{n w} columns of width w, the matrix ^{a i} of size (1280,960) is divided by w = 16, 80 of the size (16,960) It becomes a matrix of pieces. These new smaller matrices bi ^{, j} represent horizontal positions. For example, matrix b ^0,0 represents the leftmost information in an image in the range 0m to 1m. Then, each matrix bi ^{, j} is acquired and encoded as described above using a pre-learned autoencoder. Then, from the equation b'i ^{, j} = g (f (bi ^{, j} )), we get what is in the smaller latent space used to encode into the vector of 500.

ここで注意すべきは、分かりやすいように二つの関数ｆ及びｇのみを表わしているが、この数は、ニューラルネットワーク内の隠された層の選択数に依存して、増大し得る、即ち、三つ以上の関数であってもよい。ｉが深度を、ｊが水平位置を表わすこれらのベクトルｂ’^ｉ，ｊから、マトリックスを連結する連結演算子∩と定義して、以下の数式（１）の演算を実行し、さらに深度に連結すると、それぞれ水平位置を表わすｃ^ｊ個（この場合、８０個）のマトリックスが得られる。 It should be noted here that, for the sake of clarity, only the two functions f and g are shown, but this number can increase depending on the number of hidden layer selections in the neural network, ie. It may be three or more functions. i a is the depth, these vectors b ^'i where j represents horizontal position ^{from j,} defined as the concatenation operator ∩ linking the matrix, perform the calculation of the following formula (1), further connected to the depth Then, c ^j number (in this case, 80), each representing the horizontal position matrix is obtained.

このｃ^ｊ個の各マトリックスが、ディープラーニングで学習されたニューラルネットワークに入力され、分類が実行される。このニューラルネットワークは、どの深度間隔に物体が存在するか、又は物体がまったくないかを学習する。即ち、このニューラルネットワークは、μ^ｊ∈［０，ｎ_ｄ＋１］（整数）として、関数ｈ（ｃ^ｊ）＝μ^ｊを表わすようにディープラーニングでトレーニングされる。そして、μ^ｊを、μ^ｊ番目の要素以外のすべての箇所をゼロを備えた長さｎ_ｄ＋１のベクトルｖ^ｊに変換すると、各ｊに対して、ニューラルネットワークの出力から、下記数式（２）によりマトリックスＭが生成される。 The c ^j number each matrix is input to the neural network learned by deep learning, classification is performed. This neural network learns at what depth interval an object is present or is not at all. That is, this neural network is trained by deep learning so that ^{the function h (c j} ) = μ ^j ^{is expressed as μ j} ∈ [0, n _d + 1] (integer). Then, when μ ^j is converted into a vector v ^j _{of length n d} + 1 with zeros at all points other than the ^{μ j} th element, the following formula (2) is obtained from the output of the neural network for each j. ) Generates the matrix M.

この大きさ（ｎ_ｗ，ｎ_ｄ）のマトリックスＭは、各列：水平位置に対して、各行：物体が存在する距離に、１を有する。このようなマトリックスＭが、ニューラルネットワークの最終出力となる。 The matrix M of this magnitude (n _w , _nd ) has 1 for each column: horizontal position and for each row: distance at which the object is present. Such a matrix M becomes the final output of the neural network.

解析手段５０は、俯瞰データ生成手段４０からの俯瞰データ４１に基づいて、高さ方向に関して複数箇所の断面をスライスデータとして抽出することにより、物体を検出する。解析手段５０は畳み込みニューラルネットワークから構成され、制御部６０により制御されて、順次に入力される俯瞰データ４１から撮像画面における物体を検出する。ここで畳み込みニューラルネットワークは、機械学習による画像認識のために広く利用されており、高い精度で画像認識を行なうことが可能である。 The analysis means 50 detects an object by extracting cross sections at a plurality of locations in the height direction as slice data based on the bird's-eye view data 41 from the bird's-eye view data generation means 40. The analysis means 50 is composed of a convolutional neural network, is controlled by the control unit 60, and detects an object on the imaging screen from the bird's-eye view data 41 sequentially input. Here, the convolutional neural network is widely used for image recognition by machine learning, and it is possible to perform image recognition with high accuracy.

詳細には、解析手段５０は、俯瞰データ４１についてすべての深度位置のどこに物体が存在するかを分類する。解析手段５０は、前述した８０個のスライスピース３１ｂに対応する特徴ベクトルについて、即ち水平方向に関して一側（例えば左側）から他側（例えば右側）に向かってスイープして、各水平位置に関してそれぞれ物体が存在する深度位置を決定する。 Specifically, the analysis means 50 classifies where the object exists at all the depth positions for the bird's-eye view data 41. The analysis means 50 sweeps from one side (for example, the left side) to the other side (for example, the right side) in the horizontal direction with respect to the feature vector corresponding to the 80 slice pieces 31b described above, and an object for each horizontal position. Determines the depth position where is present.

ここで、解析手段５０による物体の存在の判定基準は、前もってディープラーニングにより学習され、物体の種別により適宜に設定される。これにより、物体検出システム１０のユーザは、物体の種別により障害物を定義することができ、物体検出システム１０は、物体の種別を障害物として認識することを学習する。具体的には、自動車等が走行する道路を含む都市環境では、検出すべき物体は車両、人、歩道等であり、また工事車両や作業員が出入りする建設ゾーンでは、検出すべき物体は工事車両や作業員である。このような種々のゾーン環境に対応して検出すべき物体の判定基準が定められる。例えば、都市環境では、各スライスピース３１ｂに関して最も近い距離に在る物体を距離で位置決めしてマークし、このマークした物体の距離を、０からｎ_ｄ＋１のレベルにクラス分けする。このクラスが、ニューラルネットワークの学習のためのターゲットクラスとなる。 Here, the criteria for determining the existence of an object by the analysis means 50 are learned in advance by deep learning, and are appropriately set according to the type of the object. As a result, the user of the object detection system 10 can define an obstacle according to the type of the object, and the object detection system 10 learns to recognize the type of the object as an obstacle. Specifically, in an urban environment including roads on which automobiles and the like travel, objects to be detected are vehicles, people, sidewalks, etc., and in construction zones where construction vehicles and workers enter and exit, objects to be detected are construction. Vehicles and workers. Judgment criteria for objects to be detected are determined in response to such various zone environments. For example, in an urban environment, objects that are closest to each slice piece 31b are positioned and marked by distance, and the distances of the marked objects are classified into levels _{from 0 to nd + 1.} This class is the target class for learning neural networks.

本実施形態の物体検出システム１０は以上のように構成されており、図５のフローチャートに従って以下のように動作する。
即ち、ステップＳＴ１にて撮像手段２０として単眼カメラにより撮像が行なわれ、ステップＳＴ２にて単眼カメラのための深度評価が行なわれ、ステップＳＴ３で示すようにモノクロ画像の色強度が得られると共に、ステップＳＴ４にて深度が得られる。なお、撮像手段２０がステレオカメラの場合には、ステップＳＴ１ａにてカラー撮像が行なわれると共に、ステップＳＴ２ａにて深度評価が行なわれ、また撮像手段２０がＬＩＤＡＲの場合には、ステップＳＴ１ｂにて撮像が行なわれる。 The object detection system 10 of the present embodiment is configured as described above, and operates as follows according to the flowchart of FIG.
That is, in step ST1, imaging is performed by a monocular camera as the imaging means 20, and in step ST2, depth evaluation for the monocular camera is performed, and as shown in step ST3, the color intensity of the monochrome image is obtained and the step. Depth is obtained at ST4. When the imaging means 20 is a stereo camera, color imaging is performed in step ST1a, depth evaluation is performed in step ST2a, and when the imaging means 20 is LIDAR, imaging is performed in step ST1b. Is done.

続いて、ステップＳＴ５にて、対応する三次元ポイントへの色強度値の投影が行なわれる。そして、ステップＳＴ６にて三次元画像３１が深度方向でスライスされ、ステップＳＴ７で二次元の深度スライス３１ａが得られる。
次に、ステップＳＴ８にて、各深度スライス３１ａをそれぞれ水平方向を表わす所定の幅にスライスし、ステップＳＴ９にてスライスピース３１ｂが得られる。その後、ステップＳＴ１０にて、各スライスピース３１ｂをオートエンコーダに入力して非線形エンコードを行ない低次元化する。これにより、ステップＳＴ１１にて特徴ベクトルが得られる。そして、ステップＳＴ１２にて、特徴ベクトルを深度と連結して、水平方向を表わす二次元マトリックスを形成する。これにより、ステップＳＴ１３にて特徴マトリックスが得られる。最後に、ステップＳＴ１４にて、特徴マトリックスをニューラルネットワークに入力して各特徴マトリックスをクラス分けする。これにより、ステップＳＴ１５にて各水平方向に関してクラス分けされた各クラスが、物体が存在する深度を示す深度レベルに対応することになる。 Subsequently, in step ST5, the color intensity value is projected onto the corresponding three-dimensional point. Then, in step ST6, the three-dimensional image 31 is sliced in the depth direction, and in step ST7, a two-dimensional depth slice 31a is obtained.
Next, in step ST8, each depth slice 31a is sliced to a predetermined width indicating the horizontal direction, and in step ST9, a slice piece 31b is obtained. After that, in step ST10, each slice piece 31b is input to the autoencoder to perform non-linear encoding to reduce the dimension. As a result, the feature vector is obtained in step ST11. Then, in step ST12, the feature vector is connected to the depth to form a two-dimensional matrix representing the horizontal direction. As a result, the feature matrix is obtained in step ST13. Finally, in step ST14, the feature matrix is input to the neural network and each feature matrix is classified. As a result, each class classified in each horizontal direction in step ST15 corresponds to a depth level indicating the depth at which the object exists.

また、俯瞰データ生成手段４０のオートエンコーダ４２は、学習時及び動作時に、図６のフローチャートに示すように動作する。
先ずステップＳＴ２１にて、二次元のスライスピース３１ｂが、オートエンコーダ４２におけるニューラルネットワークの第一層に入力されると、ステップＳＴ２２にて非線形エンコードによって隠れ層特徴データ１となり、続いてステップＳＴ２３にて第二層に入力されて、ステップＳＴ２４にて非線形エンコードによって隠れ層特徴データ２となり、同様に順次非線形エンコードされて、ステップＳＴ２６にてニューラルネットワークの第ｎ層に入力されると、ステップＳＴ２７にてエンコードされた特徴ベクトルとなる。 Further, the autoencoder 42 of the bird's-eye view data generation means 40 operates as shown in the flowchart of FIG. 6 during learning and operation.
First, in step ST21, when the two-dimensional slice piece 31b is input to the first layer of the neural network in the autoencoder 42, it becomes hidden layer feature data 1 by nonlinear encoding in step ST22, and then in step ST23. When the data is input to the second layer and becomes the hidden layer feature data 2 by nonlinear encoding in step ST24, and is similarly nonlinearly encoded sequentially and input to the nth layer of the neural network in step ST26, it is input to the nth layer of the neural network in step ST27. It becomes an encoded feature vector.

特徴ベクトルは、続いてステップＳＴ２８にて第（ｎ＋１）層に入力され、ステップＳＴ２９にて、非線形エンコードによって隠れ層特徴データ（ｎ＋１）となり、同様に順次非線形エンコードされ、ステップＳＴ３０にてニューラルネットワークの第２ｎ層に入力されると、ステップＳＴ３１にて非線形エンコードによって再構築された二次元スライスピースとなる。
そして、図７に示すように多数のサンプルデータを繰り返し入力して、ディープラーニングにより入力データであるスライスピース３１ｂと再構築データである再構築された二次元スライスピースの誤差が最小となるようにオートエンコーダが学習される。ここで、ステップＳＴ２７における特徴ベクトルが、オートエンコーダの動作時には、ステップＳＴ３２で示すように解析手段５０で解析処理されて、物体の検出が行なわれる。なお、このようなオートエンコーダのディープラーニングによる学習は、例えば数１０００以上のサンプルデータを使用して行なわれる。 The feature vector is subsequently input to the (n + 1) layer in step ST28, becomes hidden layer feature data (n + 1) by non-linear encoding in step ST29, is similarly non-linearly encoded, and in step ST30 of the neural network. When input to the second layer, it becomes a two-dimensional slice piece reconstructed by nonlinear encoding in step ST31.
Then, as shown in FIG. 7, a large number of sample data are repeatedly input so that the error between the slice piece 31b which is the input data and the reconstructed two-dimensional slice piece which is the reconstructed data is minimized by deep learning. The autoencoder is learned. Here, when the autoencoder is operating, the feature vector in step ST27 is analyzed by the analysis means 50 as shown in step ST32, and the object is detected. It should be noted that such learning by deep learning of the autoencoder is performed using, for example, sample data of several thousand or more.

このようにして、本物体検出システム１０において、図２（Ａ）に示す撮像画面は、解析手段５０により、図８（Ａ）において平面図で概略的に示すように水平方向位置と、最も近い物体までの距離が検出されることになる。この検出結果は、実際には、撮像手段２０の撮像位置から見ると、図８（Ｂ）に示すように扇形の領域に関して物体の位置が把握されることになる。 In this way, in the object detection system 10, the image pickup screen shown in FIG. 2 (A) is closest to the horizontal position as schematically shown in the plan view in FIG. 8 (A) by the analysis means 50. The distance to the object will be detected. As for this detection result, when viewed from the image pickup position of the image pickup means 20, the position of the object is actually grasped with respect to the fan-shaped region as shown in FIG. 8 (B).

次に、実際の撮像画面による物体検出の例を以下に説明する。
一つの三次元画像３１に関して複数個の深度スライス３１ａが生成され、各深度スライス３１ａ（画素数９６０×１２８０）は、図９（Ａ）に示すように水平方向に関して複数個のスライスピース３１ｂ（画素数９６０×１６）に分割される。上記スライスピース３１ｂが、それぞれエンコードされることにより、図９（Ｂ）に示すように、スライスピース３１ｂと同数の特徴ベクトルが得られる。そして、三次元画像３１によるすべての深度スライス３１ａがエンコードされると、図９（Ｃ）に示すように、各深度スライス３１ａ毎に一組８０個の特徴ベクトルが得られる。最後に、各深度スライス３１ａから、各水平位置に対応する特徴ベクトルを取り出してこれらを結合することにより、図９（Ｄ）に示すように、一連の特徴ベクトルから成るテンソルが得られる。 Next, an example of object detection using an actual image pickup screen will be described below.
A plurality of depth slices 31a are generated for one three-dimensional image 31, and each depth slice 31a (number of pixels 960 × 1280) has a plurality of slice pieces 31b (pixels) in the horizontal direction as shown in FIG. 9 (A). It is divided into the number 960 × 16). By encoding each of the slice pieces 31b, as shown in FIG. 9B, the same number of feature vectors as the slice pieces 31b can be obtained. Then, when all the depth slices 31a by the three-dimensional image 31 are encoded, as shown in FIG. 9C, a set of 80 feature vectors is obtained for each depth slice 31a. Finally, by extracting the feature vectors corresponding to each horizontal position from each depth slice 31a and combining them, a tensor composed of a series of feature vectors is obtained as shown in FIG. 9 (D).

物体検出システム１０は上記のように動作するが、解析手段５０により、ニューラルネットワークから出力されるベクトルを組み合わせることで、すべての水平位置について最も近い物体までの距離を把握し、シーン内の最も近い物体の位置を検出する具体例について説明する。 The object detection system 10 operates as described above, but by combining the vectors output from the neural network by the analysis means 50, the distance to the nearest object is grasped for all horizontal positions, and the closest object in the scene is obtained. A specific example of detecting the position of an object will be described.

俯瞰データ４１は、一つのシーンに対してすべてのスライスピース３１ｂ毎に一つの特徴ベクトルを含むマトリックスであることから、解析手段５０は、最も近い物体が存在する深度層を検出するために、俯瞰データ４１をディープラーニングで学習したニューラルネットワークに入力し、クラス分けする。 Since the bird's-eye view data 41 is a matrix containing one feature vector for every slice piece 31b for one scene, the analysis means 50 has a bird's-eye view in order to detect the depth layer in which the closest object exists. The data 41 is input to the neural network learned by deep learning and classified.

図１０は、図１の物体検出システム１０における俯瞰データ４１の作成手順を順次に示し、図１１は、図１の物体検出システム１０における解析手段５０のニューラルネットワークの解析手順を順次に示し、図１２は、図１１の解析手順により得られた一連のベクトルの一部構成を示す。
図１０（Ａ）の左端に示すように、上記マトリックスは、水平方向Ｈに並んだ（一組のスライスピース３１ｂに対応する）特徴ベクトルが各深度スライス３１ａ毎に深度方向Ｄに沿って整列している。そして、解析手段５０は、この行列を構成する各ベクトルのうち、図１０（Ｂ）に示すように各水平位置で深度方向Ｄに整列する特徴ベクトルを取り出して、図１０（Ｃ）に示すようにこれらを結合することにより、図１０（Ｄ）に示すように一連の特徴ベクトルから成るテンソルを生成する。 FIG. 10 sequentially shows the procedure for creating the bird's-eye view data 41 in the object detection system 10 of FIG. 1, and FIG. 11 sequentially shows the analysis procedure of the neural network of the analysis means 50 in the object detection system 10 of FIG. 12 shows a partial configuration of a series of vectors obtained by the analysis procedure of FIG.
As shown at the left end of FIG. 10A, in the above matrix, feature vectors arranged in the horizontal direction H (corresponding to a set of slice pieces 31b) are aligned along the depth direction D for each depth slice 31a. ing. Then, the analysis means 50 takes out a feature vector aligned in the depth direction D at each horizontal position as shown in FIG. 10B from each vector constituting this matrix, and as shown in FIG. 10C. By combining these with, a tensor consisting of a series of feature vectors is generated as shown in FIG. 10 (D).

そして、解析手段５０は、図１１に示すように、例えば五層のニューラルネットワーク（非特許文献１参照）、例えば畳み込みニューラルネットワーク、好ましくはパーセプトロンを使用して、このテンソルを処理して物体を検出する。
図１１において、撮像画面の画素数を幅ｗ＝２４，高さｈ＝３７０で、最小高さｈ_ｍｉｎ＝１４０とすると、ニューラルネットワークの第一層は、２４×３７０×３の入力画像を、各画素位置（ストライド１）にて大きさ１１×５×３の６４個のフィルタで畳み込む。第二層は、大きさ５×３×６４の２００個のカーネルを使用する。最大プーリング層は、第一層に対して大きさ８×４の、そして第二層に対して大きさ４×３の分離領域を超えて最大値を計算する。即ち、プーリング領域間のオーバーラップがない。完全に連結された隠れ層（第三層及び第四層）は、大きさ１０２４及び２０４８のニューロンを有しており、出力層（第五層）は５０のニューロンを有する。 Then, as shown in FIG. 11, the analysis means 50 processes this tensor to detect an object by using, for example, a five-layer neural network (see Non-Patent Document 1), for example, a convolutional neural network, preferably a perceptron. do.
In FIG. 11, assuming that the number of pixels of the image pickup screen is width w = 24, height h = 370, and minimum height h _min = 140, the first layer of the neural network receives an input image of 24 × 370 × 3. At each pixel position (stride 1), it is folded by 64 filters having a size of 11 × 5 × 3. The second layer uses 200 kernels with a size of 5x3x64. The maximum pooling layer calculates the maximum value beyond the separation region of size 8x4 for the first layer and size 4x3 for the second layer. That is, there is no overlap between the pooling regions. The fully connected hidden layers (third and fourth layers) have neurons of size 1024 and 2048, and the output layer (fifth layer) has 50 neurons.

ここで、出力層から出力されるベクトル（図１１の右端）は、ニューラルネットワークが最も近い物体と推定する位置である一つのボックス（図１１の右端で、黒く塗りつぶした部分）を除く他のすべての要素が０のベクトルＶである。このベクトルＶの各要素は、メートル単位で区切られており、上述の黒塗り部分の位置により検出した物体までの距離を表わしている。そしてこのような処理が、シーン内のすべてのマトリックスについて繰り返して実行される。 Here, the vector output from the output layer (right end in FIG. 11) is everything except one box (the right end in FIG. 11, which is filled in black), which is the position estimated by the neural network to be the closest object. The element of is a vector V of 0. Each element of this vector V is separated in meters, and represents the distance to the object detected by the position of the black-painted portion described above. And such processing is repeated for all the matrices in the scene.

解析手段５０は、図１２に示すように、これらのニューラルネットワークから出力されるベクトルを組み合わせることにより、すべての水平位置について最も近い物体までの距離を把握し、シーン内の最も近い物体の位置を検出することができる。これにより、解析手段５０は、俯瞰データ４１に基づいてシーン内に物体が存在するか否かを選択すると共に、物体までの距離を推定する。 As shown in FIG. 12, the analysis means 50 grasps the distance to the nearest object for all horizontal positions by combining the vectors output from these neural networks, and determines the position of the closest object in the scene. Can be detected. As a result, the analysis means 50 selects whether or not an object exists in the scene based on the bird's-eye view data 41, and estimates the distance to the object.

以上説明したように、本発明の物体検出システム１０によれば、撮像手段２０からの撮像データに基づいて画像生成手段で生成された三次元画像３１に関して、俯瞰データ生成手段４０が、三次元画像３１の各スライス画像をそれぞれ低次元化した俯瞰データに変換することにより、次元が低減した分だけデータ量が減少するので、解析手段５０による物体の検出がより迅速に行なわれる。従って、例えば自動車の前方視界を撮像した三次元画像３１から前方に物体を検出する場合に、自動車の走行に伴って逐次前方視界における物体を検出することで障害物等の物体を回避することができる。また、三次元画像３１の深さ方向における断面に基づいて物体を検出することになるため、検出物体の深さ方向の精度が向上し、物体までの距離をより正確に把握することが可能になる。 As described above, according to the object detection system 10 of the present invention, the bird's-eye view data generation means 40 is a three-dimensional image with respect to the three-dimensional image 31 generated by the image generation means based on the image pickup data from the image pickup means 20. By converting each slice image of 31 into low-dimensional bird's-eye view data, the amount of data is reduced by the amount of the reduced dimension, so that the analysis means 50 can detect the object more quickly. Therefore, for example, when an object is detected in front of a three-dimensional image 31 that captures the front view of an automobile, it is possible to avoid an object such as an obstacle by sequentially detecting an object in the front view as the automobile travels. can. Further, since the object is detected based on the cross section of the three-dimensional image 31 in the depth direction, the accuracy of the detected object in the depth direction is improved, and the distance to the object can be grasped more accurately. Become.

俯瞰データ生成手段４０は、畳み込みニューラルネットワークから成るオートエンコーダ４２から構成され、オートエンコーダ４２が各スライス画像を低次元化して俯瞰データ４１を生成する。畳み込みニューラルネットワークは、入力層７１、少なくとも一つの中間層７２〜７４及び出力層７５から成る多層ニューラルネットワークであって、学習の際に、入力層７１に入力された各スライス画像をいずれかの中間層７２〜７４で低次元中間データに変換した後、出力層７５でスライス画像と同じ次元の再構築データにデコードして、再構築データがスライス画像における物体を再現し得るようにディープラーニングにより学習し、学習後は中間層７２〜７４から中間データを俯瞰データ４１として解析手段５０に出力する。 The bird's-eye view data generation means 40 is composed of an autoencoder 42 composed of a convolutional neural network, and the autoencoder 42 reduces the dimension of each slice image to generate the bird's-eye view data 41. The convolutional neural network is a multi-layer neural network consisting of an input layer 71, at least one intermediate layer 72 to 74, and an output layer 75, and each slice image input to the input layer 71 at the time of training is selected as one of the intermediate layers. After converting to low-dimensional intermediate data in layers 72 to 74, it is decoded into reconstructed data of the same dimension as the sliced image in the output layer 75, and learned by deep learning so that the reconstructed data can reproduce the object in the sliced image. After learning, the intermediate data from the intermediate layers 72 to 74 is output to the analysis means 50 as bird's-eye view data 41.

上記俯瞰データ生成手段４０の構成によれば、ニューラルネットワークを利用し、ニュウラルネットワークを十分に学習させておくことによって、より精度良く俯瞰データ４１を生成することができるので、物体の検出がより高精度で行なわれる。 According to the configuration of the bird's-eye view data generation means 40, the bird's-eye view data 41 can be generated more accurately by fully learning the neural network by using the neural network, so that the object can be detected more accurately. It is done with high accuracy.

俯瞰データ生成手段４０は、各スライス画像をさらに水平方向にスライスしてスライスピース３１ｂを生成し、このスライスピース３１ｂを低次元化して俯瞰データ４１を生成する。各スライス画像が水平方向に分割されることで、その後の低次元化に際してある程度水平方向に関して制御することができるので、水平方向に関してより高精度で物体の検出を行なうことが可能であると共に、各スライスピース３１ｂを順次に連続的に処理することで、一つの三次元画像３１の俯瞰データ４１への変換をより迅速に行なうことが可能になる。 The bird's-eye view data generation means 40 further slices each slice image in the horizontal direction to generate a slice piece 31b, and lowers the slice piece 31b to generate the bird's-eye view data 41. By dividing each slice image in the horizontal direction, it is possible to control the object in the horizontal direction to some extent in the subsequent lower dimension, so that it is possible to detect the object with higher accuracy in the horizontal direction and each of them. By sequentially and continuously processing the slice pieces 31b, it becomes possible to more quickly convert one three-dimensional image 31 into the bird's-eye view data 41.

俯瞰データ４１は、各スライス画像またはスライスピース３１ｂをそれぞれベクトルとして、疎ではない特徴空間（以下、ノンスパース特徴空間と呼ぶ）特徴空間にマッピングした特徴ベクトルである。俯瞰データ４１が、可視の俯瞰画像ではなく特徴ベクトルから成る俯瞰データ４１であることから、俯瞰データ４１への変換処理の時間がより一層短縮され、短時間で俯瞰データが生成される。 The bird's-eye view data 41 is a feature vector mapped to a feature space that is not sparse (hereinafter referred to as a non-sparse feature space) with each slice image or slice piece 31b as a vector. Since the bird's-eye view data 41 is not a visible bird's-eye view image but a bird's-eye view data 41 composed of feature vectors, the conversion process to the bird's-eye view data 41 is further shortened, and the bird's-eye view data is generated in a short time.

解析手段５０は、畳み込みニューラルネットワークから構成されており、ディープラーニングにより学習する。ディープラーニングの十分な学習によって三次元画像３１がより精度良く俯瞰データ４１に変換され、この俯瞰データ４１に基づいてより高精度で物体を検出することができる。以下、実施例によりさらに詳細に説明する。 The analysis means 50 is composed of a convolutional neural network and learns by deep learning. With sufficient learning of deep learning, the three-dimensional image 31 is converted into the bird's-eye view data 41 with higher accuracy, and the object can be detected with higher accuracy based on the bird's-eye view data 41. Hereinafter, the description will be described in more detail with reference to Examples.

物体検出システム１０の撮像手段２０と制御部６０は、以下の構成のコンピュータを用いた。
撮像手段：ステレオカメラ（ＺＭＰ株式会社製、型番：Robovision 2）
制御部：
ＣＰＵ：Intel(登録商標)社製、型番：Core(登録商標)ｉ７−８７００
ＲＡＭ（ランダムアクセスメモリ）：３２ＧＢ
記憶装置：１ＴＢ
ＧＰＵ：NVIDIA(登録商標)社製、型番：GeForce(登録商標) RTX2070、
ＲＡＭ：８ＧＢ The image pickup means 20 and the control unit 60 of the object detection system 10 used a computer having the following configuration.
Imaging means: Stereo camera (manufactured by ZMP Inc., model number: Robovision 2)
Control unit:
CPU: Intel (registered trademark), model number: Core (registered trademark) i7-8700
RAM (random access memory): 32GB
Storage device: 1TB
GPU: NVIDIA (registered trademark), model number: GeForce (registered trademark) RTX2070,
RAM: 8GB

図１３は建設現場における物体検出の実験例を示す。図１３（Ａ）に示すように、撮像画面２１ａには二人の作業員Ａ，Ｂが見えているが、他の領域は工事車両の「運転可能な領域」である。ステレオカメラによるカラーの撮像信号２１の入力ピクセル数は１２８０×９６０であるが、プログラムにより６４０×４８０へダウンスケールした。撮像のフレーム数（frames per second）は、１２．５ｆｐｓとした。この撮像画面２１ａについて、本物体検出システム１０によって物体検出を行なったところ、図１３（Ｂ）に示す検出結果が得られた。図１３（Ｂ）に示す再構築された二次元スライスピース（図６のステップＳＴ３１参照）のピクセル数は８０×６０であり、図１３（Ｂ）の出力画像を得るための演算時間は８ｍｓであった。
この検出結果は、ｘ軸が水平位置を、ｙ軸が深度を表わしており、物体が検出されない場合には黒地のままであるが、物体、この場合には二人の作業員Ａ，Ｂが検出されると、その水平方向にて最も近い距離から遠い部分がやや白い表示となって物体が存在することがわかる。図１３（Ｂ）において、二人の作業員Ａ，Ｂがそれぞれ明確に検出され、それぞれ距離に応じて深度が位置決めされていることが確認できる。 FIG. 13 shows an experimental example of object detection at a construction site. As shown in FIG. 13 (A), two workers A and B are visible on the image pickup screen 21a, but the other area is the "operable area" of the construction vehicle. The number of input pixels of the color image pickup signal 21 by the stereo camera is 1280 × 960, but it was downscaled to 640 × 480 by the program. The number of frames per second for imaging was 12.5 fps. When the object was detected by the object detection system 10 on the image pickup screen 21a, the detection result shown in FIG. 13B was obtained. The number of pixels of the reconstructed two-dimensional slice piece (see step ST31 in FIG. 6) shown in FIG. 13 (B) is 80 × 60, and the calculation time for obtaining the output image in FIG. 13 (B) is 8 ms. there were.
In this detection result, the x-axis represents the horizontal position and the y-axis represents the depth, and if no object is detected, the background remains black, but the object, in this case two workers A and B, When it is detected, the part farthest from the shortest distance in the horizontal direction becomes a slightly white display, indicating that an object exists. In FIG. 13B, it can be confirmed that the two workers A and B are clearly detected and the depth is positioned according to the distance.

図１４は、図１３（Ａ）と同様の条件で取得した都市環境における物体検出の実験例を示しており、都市の道路において、運転のために障害物のない道路を除いて、歩行者、歩道、木、車両等を含む全ての障害を検出することを目的とした。
図１４（Ａ）に示すように、撮像画面２１ａには道路走行中の車両から前方を撮像した画像が写っており、前方車両Ｃと左端の歩道Ｄと右側の道路境界柵Ｅが見えている。ステレオカメラによるカラーの撮像信号２１の入力ピクセル数は１２８０×９６０であるが、プログラムにより６４０×４８０へダウンスケールした。撮像のフレーム数（frames per second）は、１２．５ｆｐｓとした。この撮像画面２１ａについて、本物体検出システム１０により物体検出を行なったところ、図１４（Ｂ）に示す検出結果が得られた。図１４（Ｂ）に示す再構築された二次元スライスピース（図６のステップＳＴ３１参照）のピクセル数は８０×６０であり、図１４（Ｂ）の出力画像を得るための演算時間は８ｍｓであった。
図１４（Ｂ）において、前方の車両Ｃと、左端の歩道Ｄ及び道路境界柵Ｅがそれぞれ検出されていることがわかる。この場合、走行中の車両から１２．５ｆｐｓで撮像した撮像画面２１ａにより、物体検出における評価指数であるIntersection over Union(ＩｏＵ精度と呼ぶ)として、８８％程度の良好なＩｏＵ精度が得られた。なお、物体までの実際の距離と位置を確認するためには、図８（Ｂ）に示すように、三次元空間への簡単な投影が必要となる。 FIG. 14 shows an experimental example of object detection in an urban environment acquired under the same conditions as in FIG. 13 (A). Pedestrians, except for roads without obstacles for driving on urban roads. The purpose was to detect all obstacles including sidewalks, trees, vehicles, etc.
As shown in FIG. 14A, the image pickup screen 21a shows an image of the front image taken from a vehicle traveling on the road, and the front vehicle C, the sidewalk D at the left end, and the road boundary fence E on the right side are visible. .. The number of input pixels of the color image pickup signal 21 by the stereo camera is 1280 × 960, but it was downscaled to 640 × 480 by the program. The number of frames per second for imaging was 12.5 fps. When the object was detected by the object detection system 10 on the image pickup screen 21a, the detection result shown in FIG. 14B was obtained. The number of pixels of the reconstructed two-dimensional slice piece (see step ST31 in FIG. 6) shown in FIG. 14 (B) is 80 × 60, and the calculation time for obtaining the output image in FIG. 14 (B) is 8 ms. there were.
In FIG. 14B, it can be seen that the vehicle C in front, the sidewalk D at the left end, and the road boundary fence E are detected, respectively. In this case, a good IoU accuracy of about 88% was obtained as an Intersection over Union (called IoU accuracy), which is an evaluation index in object detection, by the image pickup screen 21a imaged from a moving vehicle at 12.5 fps. In addition, in order to confirm the actual distance and position to the object, as shown in FIG. 8B, a simple projection onto the three-dimensional space is required.

本発明は、その趣旨を逸脱しない範囲において様々な形態で実施することができる。例えば、上述した実施形態においては、撮像手段２０は、ステレオカメラが使用されているが、例えば自動運転車両で使用されている前方監視用のライダーを使用して三次元画像３１を得ることも可能であり、また単眼カメラを使用して、従来公知の手法により単眼カメラの撮像画像とポイントクラウドを組み合わせて、三次元画像３１を得るようにしてもよい。 The present invention can be implemented in various forms without departing from the spirit of the present invention. For example, in the above-described embodiment, the image pickup means 20 uses a stereo camera, but it is also possible to obtain a three-dimensional image 31 by using, for example, a rider for forward monitoring used in an automatically driving vehicle. Further, a stereoscopic image 31 may be obtained by combining a captured image of the monocular camera and a point cloud by a conventionally known method using a monocular camera.

１０：物体検出システム、２０：撮像手段、２１：撮像信号、
２１ａ：撮像画面、３０：三次元画像生成手段、３１：三次元画像、３１ａ：深度スライス、３１ｂ：スライスピース、４０：俯瞰データ生成手段、４１：俯瞰データ、４２：オートエンコーダ、４２ａ：エンコーダ部分、
４２ｂ：デコーダ部分、５０：解析手段、６０：制御部、７０：オートエンコーダ、７１：入力層、７２〜７４：中間層、７５：出力層 10: Object detection system, 20: Imaging means, 21: Imaging signal,
21a: Imaging screen, 30: 3D image generation means, 31: 3D image, 31a: Depth slice, 31b: Slice piece, 40: Bird's-eye view data generation means, 41: Bird's-eye view data, 42: Autoencoder, 42a: Encoder part ,
42b: Decoder part, 50: Analysis means, 60: Control unit, 70: Autoencoder, 71: Input layer, 72 to 74: Intermediate layer, 75: Output layer

Claims

An image pickup means, an image generation means for generating a three-dimensional image based on the image pickup data acquired by the image pickup means, and a bird's-eye view data are generated by a bird's-eye view process based on the three-dimensional image generated by the image generation means. A means for generating bird's-eye view data and an analysis means for detecting an object by detecting the depth at which an object exists for each divided portion in horizontal division and depth division based on the bird's-eye view data and combining the detection results. , Including,
The bird's-eye view data generation means takes out a plurality of cross sections as slice images in the depth direction of the three-dimensional image, lowers each slice image to generate bird's-eye view data, and generates bird's-eye view data.
An object detection system in which the analysis means detects an object by extracting cross sections at a plurality of points in the height direction as slice data based on the bird's-eye view data.

The bird's-eye view data generation means is composed of an autoencoder composed of a convolutional neural network.
The object detection system according to claim 1, wherein the autoencoder reduces the dimension of each slice image to generate the bird's-eye view data.

The convolutional neural network is a multi-layer neural network consisting of an input layer, at least one intermediate layer, and an output layer, and each slice image input to the input layer during training is low in any of the intermediate layers. After conversion to dimensional intermediate data, the output layer decodes it into reconstructed data of the same dimension as the slice image, and the reconstructed data is learned by deep learning so that the object in the slice image can be reproduced, and after learning. Is the object detection system according to claim 2, wherein the intermediate data is output from the intermediate layer as bird's-eye view data to the analysis means.

The method according to any one of claims 1 to 3, wherein the bird's-eye view data generation means further slices each slice image in the horizontal direction to generate a slice piece, and the slice piece is made low-dimensional to generate the bird's-eye view data. Object detection system.

The object detection system according to any one of claims 1 to 4, wherein the bird's-eye view data is a feature vector mapped to a non-sparse feature space with each slice image or slice piece as a vector.

The object detection system according to any one of claims 1 to 5, wherein the analysis means is composed of a convolutional neural network and is learned by deep learning.

Based on the image generation stage that generates a three-dimensional image based on the captured data, the bird's-eye view data generation stage that generates the bird's-eye view data by the bird's-eye view processing based on the three-dimensional image generated in the image generation stage, and the bird's-eye view data. It includes an analysis stage in which the depth at which an object exists is detected for each divided portion in horizontal division and depth division, and the object is detected by combining the detection results.
At the bird's-eye view data generation stage, cross sections of a plurality of points in the depth direction of the three-dimensional image are taken out as slice images, and each slice image is made low-dimensional to generate bird's-eye view data.
An object detection method for detecting an object by extracting cross sections at a plurality of points in the height direction as slice data based on the bird's-eye view data at the analysis stage.

Based on the image generation procedure that generates a three-dimensional image based on the captured data, the bird's-eye view data generation procedure that generates the bird's-eye view data by the bird's-eye view processing based on the three-dimensional image generated by the image generation procedure, and the bird's-eye view data. It is an object detection program for detecting the depth at which an object exists for each divided part in horizontal division and depth division, and letting a computer execute the processing of the analysis procedure for detecting the object by combining the detection results. hand,
In the bird's-eye view data generation procedure, cross sections of a plurality of points in the depth direction of the three-dimensional image are taken out as slice images, and each slice image is made low-dimensional to generate bird's-eye view data.
An object detection program that causes a computer to detect the direction and distance of an object by extracting a plurality of cross sections in the height direction as slice data based on the bird's-eye view data in the analysis procedure.