JP7071215B2

JP7071215B2 - Recognition device

Info

Publication number: JP7071215B2
Application number: JP2018099724A
Authority: JP
Inventors: 博翔陳; ホセインテヘラニニキネジャド; ジョンヴィジャイ; 誠一三田; 咲子西野; 和寿石丸
Original assignee: Toyota School Foundation; Denso Corp
Current assignee: Toyota School Foundation; Denso Corp
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2022-05-18
Anticipated expiration: 2038-05-24
Also published as: JP2019204338A

Description

本開示は、認識装置及び認識方法に関する。 The present disclosure relates to a recognition device and a recognition method.

従来より、物体を撮像した画像から物体や領域を認識する技術として、まず、視差画像から物体候補領域を検出し、その後、輝度画像を用いて画像中の物体を認識する方法が開示されている（例えば、特許文献１）。 Conventionally, as a technique for recognizing an object or a region from an image obtained by capturing an object, a method of first detecting an object candidate region from a parallax image and then recognizing an object in the image using a luminance image has been disclosed. (For example, Patent Document 1).

特開２０１４－１９７３７８号公報Japanese Unexamined Patent Publication No. 2014-197378

しかし、特許文献１に記載の技術では、遠方の物体などのように、視差画像での認識が困難な物体については、認識することが困難であるという課題があった。このため、認識精度が高い他の技術が望まれていた。 However, the technique described in Patent Document 1 has a problem that it is difficult to recognize an object that is difficult to recognize with a parallax image, such as a distant object. Therefore, other techniques with high recognition accuracy have been desired.

本発明は、以下の形態として実現することが可能である。 The present invention can be realized as the following forms.

本発明の一形態によれば、学習済みのニューラルネットワークを用いて、領域と物体とを認識する認識装置（１００）が提供される。この認識装置（１００）は、前記領域と前記物体とが画像中に含まれる撮像画像の特徴マップを抽出する撮像画像特徴マップ抽出部（１０４）と、前記領域と前記物体とが画像中に含まれる距離画像の特徴マップを抽出する距離画像特徴マップ抽出部（１０５）と、前記撮像画像から抽出された特徴マップと、前記距離画像から抽出された特徴マップとを連結する特徴マップ連結部（１０６）と、前記連結された特徴マップと、前記撮像画像から抽出された特徴マップと、前記距離画像から抽出された特徴マップとを用いて、前記領域のセグメンテーションに用いる特徴マップを生成する領域セグメンテーション部（１０７）と、前記領域のセグメンテーションに用いる特徴マップを用いて、前記画像と前記領域とを関連付けるセマンティックセグメンテーションを行う領域出力部（１０９）と、前記連結された特徴マップと、前記撮像画像から抽出された特徴マップと、前記距離画像から抽出された特徴マップとを用いて、前記物体のセグメンテーションに用いる特徴マップを生成する物体セグメンテーション部（１０８）と、前記物体のセグメンテーションに用いる特徴マップを用いて、前記画像と前記物体とを関連付けるセマンティックセグメンテーションを行う物体出力部（１１０）と、を備え、前記撮像画像特徴マップ抽出部と、前記距離画像特徴マップ抽出部と、前記特徴マップ連結部と、前記領域セグメンテーション部と、前記物体セグメンテーション部とは、前記ニューラルネットワークにより構成されている。 According to one embodiment of the present invention, there is provided a recognition device (100) that recognizes a region and an object by using a trained neural network. The recognition device (100) includes a captured image feature map extraction unit (104) for extracting a feature map of a captured image in which the region and the object are included in the image, and the region and the object in the image. The distance image feature map extraction unit (105) that extracts the feature map of the distance image, and the feature map connection unit (106) that connects the feature map extracted from the captured image and the feature map extracted from the distance image. ), The linked feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image, the region segmentation unit that generates the feature map used for the segmentation of the region. (107), a region output unit (109) that performs semantic segmentation that associates the image with the region using the feature map used for segmentation of the region, the linked feature map, and extraction from the captured image. Using the created feature map and the feature map extracted from the distance image, the object segmentation unit (108) that generates the feature map used for the segmentation of the object, and the feature map used for the segmentation of the object are used. An object output unit (110) that performs semantic segmentation that associates the image with the object, the captured image feature map extraction unit, the distance image feature map extraction unit, the feature map connection unit, and the above. The area segmentation section and the object segmentation section are configured by the neural network.

この形態の認識装置によれば、連結された特徴マップと、撮像画像から抽出された特徴マップと、距離画像から抽出された特徴マップとからセマンティックセグメンテーションを行うため、認識精度が向上する。 According to this form of the recognition device, semantic segmentation is performed from the connected feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image, so that the recognition accuracy is improved.

認識装置を搭載した車両の機能ブロック図である。It is a functional block diagram of a vehicle equipped with a recognition device. 認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of a recognition device. 認識処理のフローチャートを示す図である。It is a figure which shows the flowchart of the recognition process. 撮像画像特徴マップ抽出部１０４を説明する図である。It is a figure explaining the captured image feature map extraction unit 104. 距離画像特徴マップ抽出部１０５を説明する図である。It is a figure explaining the distance image feature map extraction unit 105. 領域セグメンテーション部１０７を説明する図である。It is a figure explaining the area segmentation part 107. 物体セグメンテーション部１０８を説明する図である。It is a figure explaining the object segmentation part 108. 画像取得部が取得した輝度画像と、セマンティックセグメンテーション後の領域画像及び物体画像との例を示す図である。It is a figure which shows the example of the luminance image acquired by the image acquisition part, and the area image and the object image after semantic segmentation. ＦｕｓｅＮｅｔを用いた比較例を示す図である。It is a figure which shows the comparative example using FuseNet. Ｕ－Ｎｅｔを用いた比較例を示す図である。It is a figure which shows the comparative example using U-Net. 変形例の認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the recognition device of a modification.

Ａ．第１実施形態
図１に示すように、本実施形態では、認識装置１００は、車両１０に搭載されている。なお、認識装置１００は、例えば、船舶やドローンなどの車両以外の物体に搭載されていてもよい。 A. First Embodiment As shown in FIG. 1, in this embodiment, the recognition device 100 is mounted on the vehicle 10. The recognition device 100 may be mounted on an object other than a vehicle such as a ship or a drone.

車両１０は、さらに、撮像画像取得部２１と距離画像取得部２２とを備える。本実施形態では、撮像画像取得部２１及び距離画像取得部２２は、車両１０の前方が撮像範囲となるように搭載されており、撮像画像取得部２１として単眼カメラを用い、距離画像取得部２２としてステレオカメラを用いる。本実施形態では、認識装置１００は、撮像画像取得部２１から撮像画像を取得し、距離画像取得部２２から距離画像を取得する。 The vehicle 10 further includes a captured image acquisition unit 21 and a distance image acquisition unit 22. In the present embodiment, the captured image acquisition unit 21 and the distance image acquisition unit 22 are mounted so that the front of the vehicle 10 is the imaging range, and a monocular camera is used as the captured image acquisition unit 21 to obtain the distance image acquisition unit 22. A stereo camera is used as. In the present embodiment, the recognition device 100 acquires a captured image from the captured image acquisition unit 21 and acquires a distance image from the distance image acquisition unit 22.

認識装置１００は、ＣＰＵ１１と、ＲＯＭやＲＡＭなどのメモリ１２と、を備える周知のコンピュータとして構成されている。認識装置１００は、ニューラルネットワークにおける畳込み演算専用のチップを用いることが望ましい。認識装置１００は、ＣＰＵ１１とメモリ１２とを用いて、メモリ１２に格納されているプログラムを実行することによって後述の認識処理を行う。具体的には、認識装置１００は、学習済みのニューラルネットワークを用いて、撮像画像取得部２１及び距離画像取得部２２により取得された撮像画像及び距離画像から画像中の物体及び領域を認識する。認識処理によって得られた認識結果は、認識装置１００により車両１０の制御部３０に入力される。制御部３０は、入力された認識結果を用いて、車両１０の動作を制御する。本実施形態では、ニューラルネットワークとして、畳込みニューラルネットワーク（Convolutional Neural Network：ＣＮＮ）を用いるが、他の種類のニューラルネットワークを用いてもよい。ここで、本実施形態の物体として、例えば、車両１０の前方に存在する車両や、石などの障害物が挙げられる。また、本実施形態の領域として、例えば、車両１０の走行可能な領域が挙げられる。本実施形態では、ニューラルネットワークは、撮像画像と距離画像とにおける領域及び物体についてタグ付けされたデータを任意の枚数（例えば、９０００枚）用いて、予め学習が行われている。 The recognition device 100 is configured as a well-known computer including a CPU 11 and a memory 12 such as a ROM or RAM. It is desirable that the recognition device 100 uses a chip dedicated to the convolution operation in the neural network. The recognition device 100 uses the CPU 11 and the memory 12 to execute a program stored in the memory 12 to perform the recognition process described later. Specifically, the recognition device 100 recognizes an object and a region in the image from the captured image and the distance image acquired by the captured image acquisition unit 21 and the distance image acquisition unit 22 by using the trained neural network. The recognition result obtained by the recognition process is input to the control unit 30 of the vehicle 10 by the recognition device 100. The control unit 30 controls the operation of the vehicle 10 by using the input recognition result. In this embodiment, a convolutional neural network (CNN) is used as the neural network, but other types of neural networks may be used. Here, examples of the object of the present embodiment include a vehicle existing in front of the vehicle 10 and an obstacle such as a stone. Further, as an area of the present embodiment, for example, a travelable area of the vehicle 10 can be mentioned. In the present embodiment, the neural network is pre-learned using an arbitrary number (for example, 9000) of data tagged about the region and the object in the captured image and the distance image.

図２に示すように、認識装置１００は、撮像画像入力部１０１と、距離画像入力部１０２と、撮像画像特徴マップ抽出部１０４と、距離画像特徴マップ抽出部１０５と、特徴マップ連結部１０６と、領域セグメンテーション部１０７と、物体セグメンテーション部１０８と、領域出力部１０９と、物体出力部１１０と、を備える。このうち、撮像画像特徴マップ抽出部１０４、距離画像特徴マップ抽出部１０５、特徴マップ連結部１０６、領域セグメンテーション部１０７、及び物体セグメンテーション部１０８がニューラルネットワーク１０３として構成されている。上記各部は、実際には、ＣＰＵ１１が予め記憶されたプログラム（主として、行列演算や畳込み演算）を実効することにより実行される。各部の処理内容を図３に示す。各処理は、矢印に沿ってデータが送られることによって実効されるので、フローチャートのような順次処理としてではなく、データフローを中心に各処理を実行するブロックが存在するものとして説明する。 As shown in FIG. 2, the recognition device 100 includes a captured image input unit 101, a distance image input unit 102, a captured image feature map extraction unit 104, a distance image feature map extraction unit 105, and a feature map connecting unit 106. The region segmentation unit 107, the object segmentation unit 108, the region output unit 109, and the object output unit 110 are provided. Of these, the captured image feature map extraction unit 104, the distance image feature map extraction unit 105, the feature map connection unit 106, the region segmentation unit 107, and the object segmentation unit 108 are configured as the neural network 103. Each of the above parts is actually executed by the CPU 11 executing a program (mainly a matrix operation or a convolution operation) stored in advance. The processing contents of each part are shown in FIG. Since each process is executed by sending data along the arrow, it is described as assuming that there is a block that executes each process centering on the data flow, not as a sequential process as in a flowchart.

図３に示されるように、認識処理が開始されると、認識装置１００の撮像画像入力部１０１は、撮像画像取得部２１から取得された撮像画像をニューラルネットワーク１０３に入力する。なお、取得された撮像画像は、ニューラルネットワーク１０３に入力する前に、サイズの変更や歪みの補正などの処理を行ってもよい。 As shown in FIG. 3, when the recognition process is started, the captured image input unit 101 of the recognition device 100 inputs the captured image acquired from the captured image acquisition unit 21 to the neural network 103. The acquired image may be resized, corrected for distortion, or the like before being input to the neural network 103.

本実施形態では、撮像画像として、画素ごとに輝度（０～２５５）に変換された画像である輝度画像を用いるが、これに代えて、カラー画像を用いてもよい。輝度画像を用いることにより、照度の違いによる影響を小さくできるとともに、カラー画像よりも情報量が少ないため、処理を迅速に行うことができる。また、距離画像として視差画像を用い、撮像画像として輝度画像を用いることにより、遠方の物体が検出しやすくなるため、好ましい。 In the present embodiment, a luminance image, which is an image converted into luminance (0 to 255) for each pixel, is used as the captured image, but a color image may be used instead. By using the luminance image, the influence of the difference in illuminance can be reduced, and the amount of information is smaller than that of the color image, so that the processing can be performed quickly. Further, it is preferable to use a parallax image as a distance image and a luminance image as a captured image because it is easy to detect a distant object.

また、認識装置１００の距離画像入力部１０２は、距離画像取得部２２から取得された距離画像をニューラルネットワーク１０３に入力する。なお、取得された距離画像は、ニューラルネットワーク１０３に入力する前に、サイズの変更や歪みの補正などの処理を行ってもよい。 Further, the distance image input unit 102 of the recognition device 100 inputs the distance image acquired from the distance image acquisition unit 22 to the neural network 103. The acquired distance image may be subjected to processing such as size change and distortion correction before being input to the neural network 103.

本実施形態では、距離画像として、視差画像を用いるが、これに限られず、例えば、デプスカメラから取得するデプス画像や、ＬＩＤＡＲやミリ波レーダーから取得する距離画像を用いてもよい。視差画像は、画素ごとに視差に応じた値（０～２５５）を与えた画像であり、本実施形態では、視差が大きいほど明るく表現される。視差画像を用いることにより、物体の領域や領域の境界が輝度画像よりも特徴として明確となるため、出力結果における境界の精度を向上できる。例えば、アスファルトなどの黒い路面を走行する黒い車両が走行している場合、輝度画像やカラー画像では路面と車両との境界の判別が困難となるが、視差画像を用いることにより、路面と車両との境界が明確となる。また、輝度画像やカラー画像では、車両の外観についても出力結果に影響を与えるが、距離画像を用いることにより、この影響を緩和できる。また、距離画像を用いることにより、撮像時の明るさの影響を緩和できる。このように、本実施形態では、入力画像として、撮像画像と距離画像とを用いることにより、お互いの特性を補完することができる。 In the present embodiment, the parallax image is used as the distance image, but the present invention is not limited to this, and for example, a depth image acquired from a depth camera or a distance image acquired from a lidar or a millimeter wave radar may be used. The parallax image is an image in which a value (0 to 255) corresponding to the parallax is given for each pixel, and in the present embodiment, the larger the parallax, the brighter the image is expressed. By using the parallax image, the area of the object and the boundary of the area become clearer as a feature than the luminance image, so that the accuracy of the boundary in the output result can be improved. For example, when a black vehicle traveling on a black road surface such as asphalt is traveling, it is difficult to distinguish the boundary between the road surface and the vehicle in a luminance image or a color image. The boundary of is clear. Further, in the luminance image and the color image, the appearance of the vehicle also affects the output result, and this influence can be mitigated by using the distance image. Further, by using a distance image, the influence of brightness at the time of imaging can be mitigated. As described above, in the present embodiment, the characteristics of each other can be complemented by using the captured image and the distance image as the input image.

次に、認識装置１００の撮像画像特徴マップ抽出部１０４は、撮像画像の特徴マップを抽出する。つまり、撮像画像特徴マップ抽出部１０４は、メモリ１２に記憶された学習済みモデルからの指令に従って、撮像画像の中に撮像された物体及び領域のセグメンテーションに有用な特徴マップを撮像画像から抽出する。 Next, the captured image feature map extraction unit 104 of the recognition device 100 extracts the feature map of the captured image. That is, the captured image feature map extraction unit 104 extracts a feature map useful for segmentation of an object and a region captured in the captured image from the captured image according to a command from the learned model stored in the memory 12.

図４に示すように、本実施形態では、撮像画像特徴マップ抽出部１０４は、畳込み層とプーリング層とを備える複数の撮像ブロックＳＢ１～ＳＢ５（以下、単に、「撮像ブロックＳＢ」とも呼ぶ）を備える。本実施形態では、撮像ブロックＳＢは、畳込み層が２層とプーリング層が１層とを、この順にデータを処理するように備える。ここで、図３以降の図では、畳込み層を「Ｃｏｎｖ．」と記載し、プーリング層を「Ｐｏｏｌｉｎｇ」と記載する。撮像ブロックＳＢは、後述する走行ブロックの数より１個多く、本実施形態では、撮像画像特徴マップ抽出部１０４は、５個の撮像ブロックＳＢを備える。撮像ブロックＳＢにより抽出された特徴マップは、図３に示すように、領域セグメンテーション部１０７及び物体セグメンテーション部１０８へ出力される。 As shown in FIG. 4, in the present embodiment, the captured image feature map extraction unit 104 has a plurality of imaging blocks SB1 to SB5 including a convolutional layer and a pooling layer (hereinafter, also simply referred to as “imaging block SB”). To prepare for. In the present embodiment, the image pickup block SB is provided with two convolutional layers and one pooling layer so as to process data in this order. Here, in the drawings after FIG. 3, the convoluted layer is described as "Conv.", And the pooling layer is described as "Pooling". The number of the image pickup block SB is one more than the number of traveling blocks described later, and in the present embodiment, the image pickup image feature map extraction unit 104 includes five image pickup block SBs. As shown in FIG. 3, the feature map extracted by the image pickup block SB is output to the region segmentation unit 107 and the object segmentation unit 108.

撮像画像特徴マップ抽出部１０４は、さらに、撮像ブロックＳＢの下流側において、撮像アップサンプリングブロックＳＢＵ（以下、「撮像ＵＳブロックＳＢＵ」とも呼ぶ）を備える。撮像ＵＳブロックＳＢＵは、畳込み層とアップサンプリング層とを備える。なお、図３以降の図では、アップサンプリング層を「ＵＳ」と記載する。本実施形態では、撮像ＵＳブロックＳＢＵは、畳込み層を２層と、アップサンプリング層を１層とを、この順にデータを処理するように備える。撮像ＵＳブロックＳＢＵにより抽出された特徴マップは、特徴マップ連結部１０６へ出力される。 The captured image feature map extraction unit 104 further includes an imaging upsampling block SBU (hereinafter, also referred to as “imaging US block SBU”) on the downstream side of the imaging block SB. The imaging US block SBU includes a convolutional layer and an upsampling layer. In the figures after FIG. 3, the upsampling layer is referred to as "US". In the present embodiment, the imaging US block SBU is provided with two convolutional layers and one upsampling layer so as to process data in this order. The feature map extracted by the imaging US block SBU is output to the feature map connecting unit 106.

認識装置１００の距離画像特徴マップ抽出部１０５は、距離画像の特徴マップを抽出する。つまり、距離画像特徴マップ抽出部１０５は、メモリ１２に記憶された学習済みモデルからの指令に従って、距離画像の中に撮像された物体及び領域のセグメンテーションに有用な特徴マップを距離画像から抽出する。 The distance image feature map extraction unit 105 of the recognition device 100 extracts the feature map of the distance image. That is, the distance image feature map extraction unit 105 extracts a feature map useful for segmentation of an object and a region captured in the distance image from the distance image according to a command from the trained model stored in the memory 12.

図５に示すように、本実施形態では、距離画像特徴マップ抽出部１０５は、畳込み層とプーリング層とを備える複数の距離ブロックＫＢ１～ＫＢ５（以下、単に、「距離ブロックＫＢ」とも呼ぶ）を備える。本実施形態では、距離ブロックＫＢは、畳込み層が２層とプーリング層が１層とを、この順にデータを処理するように備える。本実施形態では、距離ブロックＫＢは、撮像ブロックと同じ個数である。本実施形態では、距離画像特徴マップ抽出部１０５は、５個の距離ブロックＫＢを備える。距離ブロックＫＢにより抽出された特徴マップは、領域セグメンテーション部１０７及び物体セグメンテーション部１０８へ出力される。 As shown in FIG. 5, in the present embodiment, the distance image feature map extraction unit 105 includes a plurality of distance blocks KB1 to KB5 (hereinafter, also simply referred to as “distance block KB”) including a folding layer and a pooling layer. To prepare for. In the present embodiment, the distance block KB is provided with two convolutional layers and one pooling layer so as to process data in this order. In this embodiment, the number of distance blocks KB is the same as the number of image pickup blocks. In the present embodiment, the distance image feature map extraction unit 105 includes five distance blocks KB. The feature map extracted by the distance block KB is output to the area segmentation unit 107 and the object segmentation unit 108.

距離画像特徴マップ抽出部１０５は、さらに、距離ブロックの下流側において、距離アップサンプリングブロックＫＢＵ（以下、「距離ＵＳブロックＫＢＵ」とも呼ぶ）を備える。距離ＵＳブロックＫＢＵは、畳込み層とアップサンプリング層とを備える。本実施形態では、距離ＵＳブロックＫＢＵは、畳込み層を２層と、アップサンプリング層を１層とを、この順にデータを処理するように備える。距離ＵＳブロックＫＢＵにより抽出された特徴マップは、特徴マップ連結部１０６へ出力される。 The distance image feature map extraction unit 105 further includes a distance upsampling block KBU (hereinafter, also referred to as “distance US block KBU”) on the downstream side of the distance block. The distance US block KBU comprises a convolutional layer and an upsampling layer. In the present embodiment, the distance US block KBU is provided with two convolutional layers and one upsampling layer to process data in this order. The feature map extracted by the distance US block KBU is output to the feature map connecting unit 106.

図３から図５に示すように、認識装置１００の特徴マップ連結部１０６は、撮像画像から抽出された特徴マップと、距離画像から抽出された特徴マップとを連結する。具体的には、特徴マップ連結部１０６は、最も下流側の撮像ブロックＳＢ５により抽出された特徴マップと、最も下流側の距離ブロックＫＢ５により抽出された特徴マップと、撮像ＵＳブロックＳＢＵにより抽出された特徴マップと、距離ＵＳブロックＫＢＵにより抽出された特徴マップとを連結させる。より具体的には、特徴マップ連結部１０６は、（ｉ）撮像ＵＳブロックＳＢＵのアップサンプリング層により抽出された特徴マップと、（ｉｉ）距離ＵＳブロックＫＢＵのアップサンプリング層により抽出された特徴マップと、（ｉｉｉ）５番目の撮像ブロックＳＢ５におけるプーリング層の前の畳込み層により抽出された特徴マップと、（ｉｖ）５番目の距離ブロックＫＢ５におけるプーリング層の前の畳込み層により抽出された特徴マップと、を連結する。 As shown in FIGS. 3 to 5, the feature map connecting portion 106 of the recognition device 100 connects the feature map extracted from the captured image and the feature map extracted from the distance image. Specifically, the feature map connecting portion 106 was extracted by the feature map extracted by the most downstream image pickup block SB5, the feature map extracted by the most downstream distance block KB5, and the image pickup US block SBU. The feature map and the feature map extracted by the distance US block KBU are connected. More specifically, the feature map connecting portion 106 includes (i) a feature map extracted by the upsampling layer of the imaging US block SBU, and (ii) a feature map extracted by the upsampling layer of the distance US block KBU. , (Iii) Feature map extracted by the convolutional layer in front of the pooling layer in the 5th imaging block SB5, and (iv) Features extracted by the convolutional layer in front of the pooling layer in the 5th distance block KB5. Connect with the map.

その後、認識装置１００は、連結された特徴マップと、撮像画像から抽出された特徴マップと、距離画像から抽出された特徴マップとを用いて、領域のセグメンテーションに用いる特徴マップを生成するとともに、物体のセグメンテーションに用いる特徴マップを制生成する。ここで、領域のセグメンテーションとは、画像の中の特定領域をピクセル単位で指定することを言い、物体のセグメンテーションとは、画像の中の特定物体をピクセル単位で指定することを言う。 After that, the recognition device 100 uses the connected feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image to generate a feature map to be used for segmentation of the region, and also to generate an object. The feature map used for the segmentation of is generated. Here, the segmentation of an area means that a specific area in an image is specified in pixel units, and the segmentation of an object means that a specific object in an image is specified in pixel units.

本実施形態では、図６に示すように、領域セグメンテーション部１０７は、複数の領域ブロックＲＢ１～ＲＢ４（以下、単に、「領域ブロックＲＢ」とも呼ぶ）を備える。本実施形態では、領域セグメンテーション部１０７は、領域ブロックを４個備える。領域ブロックＲＢは、逆畳込み層と、アップサンプリング層と、結合層とを備える。なお、図６以降の図では、逆畳込み層を「Ｄｅｃｏｎｖ．」と記載し、結合層を「Ｃｏｎｃａｔ」と記載する。本実施形態では、領域ブロックＲＢは、逆畳込み層が２層と、アップサンプリング層と、結合層とを、この順にデータを処理するように備える。領域ブロックの結合層は、アップサンプリング層により抽出された特徴マップと、撮像画像特徴マップ抽出部１０４の撮像ブロック及び距離画像特徴マップ抽出部１０５の距離ブロックによってそれぞれ抽出された特徴マップとの結合を行う。 In the present embodiment, as shown in FIG. 6, the region segmentation unit 107 includes a plurality of region blocks RB1 to RB4 (hereinafter, also simply referred to as “region block RB”). In the present embodiment, the region segmentation unit 107 includes four region blocks. The region block RB includes a reverse convolution layer, an upsampling layer, and a coupling layer. In addition, in the figure after FIG. 6, the reverse convolution layer is described as "Deconv.", And the binding layer is described as "Concat". In the present embodiment, the region block RB is provided with two reverse convolution layers, an upsampling layer, and a coupling layer so as to process data in this order. The connection layer of the region block is a combination of the feature map extracted by the upsampling layer and the feature map extracted by the image pickup block of the captured image feature map extraction unit 104 and the distance block of the distance image feature map extraction unit 105, respectively. conduct.

領域セグメンテーション部１０７は、さらに、領域ブロックの下流側において、領域ドロップアウトブロックＲＢＤ（以下、「領域ＤＯブロックＲＢＤ」とも呼ぶ）を備える。領域ＤＯブロックＲＢＤは、逆畳込み層と、ドロップアウト層とを備える。本実施形態では、領域ＤＯブロックＲＢＤは、逆畳込み層と、ドロップアウト層とを交互に２層ずつ、この順にデータを処理するように備える。領域ＤＯブロックＲＢＤにより抽出された特徴マップは、領域出力部１０９へ出力される。本実施形態はドロップアウト層を備えることにより、過学習を避けることができる。 The region segmentation unit 107 further includes a region dropout block RBD (hereinafter, also referred to as “region DO block RBD”) on the downstream side of the region block. The region DO block RBD comprises a reverse convolution layer and a dropout layer. In the present embodiment, the region DO block RBD is provided with two layers alternately of a reverse convolution layer and a dropout layer, so as to process data in this order. The feature map extracted by the area DO block RBD is output to the area output unit 109. The present embodiment can avoid overfitting by providing a dropout layer.

認識装置１００の物体セグメンテーション部１０８は、連結された特徴マップと、撮像画像から抽出された特徴マップと、距離画像から抽出された特徴マップとを用いて、物体のセグメンテーションに用いる特徴マップを生成する。 The object segmentation unit 108 of the recognition device 100 generates a feature map used for object segmentation by using the connected feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image. ..

本実施形態では、図７に示すように、物体セグメンテーション部１０８は、複数の物体ブロックＢＢ１～ＢＢ４（以下、単に、「物体ブロックＢＢ」とも呼ぶ）を備える。本実施形態では、物体セグメンテーション部１０８は、物体ブロックＢＢを４個備える。物体ブロックＢＢは、逆畳込み層と、アップサンプリング層と、結合層とを備える。本実施形態では、物体ブロックＢＢは、逆畳込み層が２層と、アップサンプリング層と、結合層とを、この順にデータを処理するよう備える。物体ブロックＢＢの結合層は、アップサンプリング層により抽出された特徴マップと、撮像画像特徴マップ抽出部１０４の撮像ブロック及び距離画像特徴マップ抽出部１０５の距離ブロックによってそれぞれ抽出された特徴マップとの結合を行う。 In the present embodiment, as shown in FIG. 7, the object segmentation unit 108 includes a plurality of object blocks BB1 to BB4 (hereinafter, also simply referred to as “object block BB”). In the present embodiment, the object segmentation unit 108 includes four object blocks BB. The object block BB includes a reverse convolution layer, an upsampling layer, and a coupling layer. In the present embodiment, the object block BB is provided with two reverse convolution layers, an upsampling layer, and a coupling layer so as to process data in this order. The coupling layer of the object block BB is a combination of the feature map extracted by the upsampling layer and the feature map extracted by the imaging block of the captured image feature map extraction unit 104 and the distance block of the distance image feature map extraction unit 105, respectively. I do.

物体セグメンテーション部１０８は、さらに、物体ブロックの下流側において、物体ドロップアウトブロックＢＢＤ（以下、「物体ＤＯブロックＢＢＤ」とも呼ぶ）を備える。物体ＤＯブロックＢＢＤは、逆畳込み層と、ドロップアウト層とを備える。本実施形態では、物体ＤＯブロックＢＢＤは、逆畳込み層と、ドロップアウト層とを交互に２層ずつ、この順にデータを処理するよう備える。物体ＤＯブロックＢＢＤにより抽出された特徴マップは、物体出力部１１０へ出力される。本実施形態はドロップアウト層を備えることにより、過学習を避けることができる。 The object segmentation unit 108 further includes an object dropout block BBD (hereinafter, also referred to as “object DO block BBD”) on the downstream side of the object block. The object DO block BBD includes a reverse convolution layer and a dropout layer. In the present embodiment, the object DO block BBD is provided with two layers alternately of a reverse convolution layer and a dropout layer, and the data is processed in this order. The feature map extracted by the object DO block BBD is output to the object output unit 110. The present embodiment can avoid overfitting by providing a dropout layer.

そして、認識装置１００の領域出力部１０９は、領域セグメンテーション部１０７により抽出された特徴マップから、画像と領域とを関連付けるセマンティックセグメンテーションを行う。本実施形態では、領域出力部１０９は、シグモイド活性化関数とバイナリクロスエントロピーエラー関数を用いることによって変換を行うことにより、セマンティックセグメンテーションを行う。 Then, the area output unit 109 of the recognition device 100 performs semantic segmentation that associates the image with the area from the feature map extracted by the area segmentation unit 107. In the present embodiment, the region output unit 109 performs semantic segmentation by performing conversion by using a sigmoid activation function and a binary cross entropy error function.

認識装置１００の物体出力部１１０は、物体セグメンテーション部１０８により抽出された特徴マップから、画像と物体とを関連付ける物体に関するセマンティックセグメンテーションを行う。本実施形態では、物体出力部１１０は、シグモイド活性化関数とバイナリクロスエントロピーエラー関数を用いることによって変換を行うことにより、セマンティックセグメンテーションを行う。 The object output unit 110 of the recognition device 100 performs semantic segmentation regarding an object that associates an image with an object from a feature map extracted by the object segmentation unit 108. In the present embodiment, the object output unit 110 performs semantic segmentation by performing conversion by using a sigmoid activation function and a binary cross entropy error function.

以上により、撮像画像取得部２１及び距離画像取得部２２によって得られた画像データが、認識装置１００で処理されることにより、撮像画像取得部２１、距離画像取得部２２に撮像された一組の画像に対する認識処理は終了する。なお、認識処理により得られた認識結果は、認識装置１００により車両１０の制御部３０に入力される。上述した処理は、撮像画像取得部２１及び距離画像取得部２２による撮像が続く限り繰り返し行われる。 As described above, the image data obtained by the captured image acquisition unit 21 and the distance image acquisition unit 22 is processed by the recognition device 100, so that a set of images captured by the captured image acquisition unit 21 and the distance image acquisition unit 22. The recognition process for the image ends. The recognition result obtained by the recognition process is input to the control unit 30 of the vehicle 10 by the recognition device 100. The above-mentioned processing is repeated as long as the imaging by the captured image acquisition unit 21 and the distance image acquisition unit 22 continues.

図８には、撮像画像取得部２１が取得した輝度画像と、セマンティックセグメンテーション後の領域画像及び物体画像との例が示されている。図８では、物体として前方の車両が認識されており、領域として車両が走行可能な領域が認識されている。図８から分かるように、物体の境界と領域の境界が明確に分かれていることが分かる。 FIG. 8 shows an example of a luminance image acquired by the captured image acquisition unit 21 and a region image and an object image after semantic segmentation. In FIG. 8, the vehicle in front is recognized as an object, and the area in which the vehicle can travel is recognized as an area. As can be seen from FIG. 8, it can be seen that the boundary between the object and the boundary of the area are clearly separated.

本実施形態では、撮像ブロックＳＢにより抽出された特徴マップは、下流側の撮像ブロックＳＢへ出力されるとともに、領域ブロックＲＢ及び物体ブロックＢＢへ出力される。ここで、領域ブロックＲＢの個数をＫとし、ｎを任意の整数（ｎ＝１～Ｋ）とすると、上流から起算してｎ個目の撮像ブロックＳＢは、ｎ＋１番目の撮像ブロックＳＢへ特徴マップを出力するとともに、上流から起算してＫ－ｎ＋１番目の領域ブロックＲＢ及びＫ－ｎ＋１番目の物体ブロックＢＢへ特徴マップを出力する。ここで、ｎ＋１個目の撮像ブロックＳＢには、ｎ個目の撮像ブロックＳＢのプーリング層から抽出された特徴マップが出力されるが、Ｋ－ｎ＋１番目の領域ブロックＲＢ及びＫ－ｎ＋１番目の物体ブロックＢＢには、ｎ個目の撮像ブロックＳＢのプーリング層の前の畳込み層により抽出された特徴マップが出力される。本実施形態では、領域ブロックの個数は４個であるため、例えば、上流から起算して１個目（ｎ＝１）の撮像ブロックＳＢ１は、２番目の撮像ブロックＳＢ２へ出力するととともに、上流から起算して４番目の領域ブロックＲＢ４及び４番目の物体ブロックＢＢ４へ出力される。 In the present embodiment, the feature map extracted by the image pickup block SB is output to the image pickup block SB on the downstream side, and is also output to the area block RB and the object block BB. Here, assuming that the number of region blocks RB is K and n is an arbitrary integer (n = 1 to K), the nth image pickup block SB counting from the upstream is a feature map to the n + 1th image pickup block SB. Is output, and the feature map is output to the Kn + 1st area block RB and the Kn + 1st object block BB counting from the upstream. Here, the feature map extracted from the pooling layer of the nth imaging block SB is output to the n + 1th imaging block SB, but the Kn + 1st region block RB and the Kn + 1st object. The feature map extracted by the convolutional layer in front of the pooling layer of the nth imaging block SB is output to the block BB. In the present embodiment, the number of region blocks is 4, so for example, the first (n = 1) image pickup block SB1 counting from the upstream is output to the second image pickup block SB2 and is output from the upstream. It is calculated and output to the fourth area block RB4 and the fourth object block BB4.

また、本実施形態では、距離ブロックＫＢにより抽出された特徴マップは、下流側の距離ブロックＫＢへ出力されるとともに、領域ブロックＲＢ及び物体ブロックＢＢへ出力される。Ｋを領域ブロックの個数とし、ｎを任意の整数（ｎ＝１～Ｋ）とすると、上流から起算してｎ個目の距離ブロックＫＢは、ｎ＋１番目の距離ブロックＫＢへ出力するとともに、上流から起算してＫ－ｎ＋１番目の領域ブロックＲＢ及びＫ－ｎ＋１番目の物体ブロックＢＢへ出力する。ここで、ｎ＋１個目の距離ブロックＫＢには、ｎ個目の距離ブロックＫＢのプーリング層から抽出された特徴マップが出力されるが、Ｋ－ｎ＋１番目の領域ブロックＲＢ及びＫ－ｎ＋１番目の物体ブロックＢＢには、ｎ個目の距離ブロックＫＢのプーリング層の前の畳込み層により抽出された特徴マップが出力される。本実施形態では、距離ブロックの個数は４個であるため、例えば、上流から起算して１個目（ｎ＝１）の距離ブロックＫＢ１は、２番目の距離ブロックＫＢ２へ出力するととともに、上流から起算して４番目の領域ブロックＲＢ４及び４番目の物体ブロックＢＢ４へ出力される。 Further, in the present embodiment, the feature map extracted by the distance block KB is output to the distance block KB on the downstream side, and is also output to the area block RB and the object block BB. Assuming that K is the number of area blocks and n is an arbitrary integer (n = 1 to K), the nth distance block KB counting from the upstream is output to the n + 1th distance block KB and is output from the upstream. It is calculated and output to the Kn + 1st area block RB and the Kn + 1st object block BB. Here, the feature map extracted from the pooling layer of the nth distance block KB is output to the n + 1th distance block KB, but the Kn + 1st region block RB and the Kn + 1st object. The feature map extracted by the convolutional layer before the pooling layer of the nth distance block KB is output to the block BB. In the present embodiment, the number of distance blocks is 4, so for example, the first distance block KB1 (n = 1) counting from the upstream is output to the second distance block KB2 and is output from the upstream. It is calculated and output to the fourth area block RB4 and the fourth object block BB4.

つまり、本実施形態のニューラルネットワークは、特徴マップ連結部１０６により連結された特徴マップに加え、さらに、撮像画像特徴マップ抽出部１０４から抽出された特徴マップや、距離画像特徴マップ抽出部１０５から抽出された特徴マップが、特徴マップ連結部１０６を経ずに、直接、領域セグメンテーション部１０７や物体セグメンテーション部１０８に出力される。このため、一般的なニューラルネットワークでは層が増えるたびに誤差が伝わりにくくなるため、学習の効率が下がり、物体や領域の境界がぼやけるが、本実施形態によれば、ネットワークの出力層である領域セグメンテーション部１０７及び物体セグメンテーション部１０８で、境界の情報の多い入力層である撮像画像特徴マップ抽出部１０４及び距離画像特徴マップ抽出部１０５からの情報を照らし合わせるため、物体の境界がぼやけることなく、精度を向上させることができる。 That is, the neural network of the present embodiment is further extracted from the feature map extracted from the captured image feature map extraction unit 104 and the distance image feature map extraction unit 105 in addition to the feature map connected by the feature map connection unit 106. The created feature map is directly output to the area segmentation section 107 and the object segmentation section 108 without passing through the feature map connecting section 106. For this reason, in a general neural network, an error is less likely to be transmitted as the number of layers increases, so that learning efficiency decreases and the boundaries of objects and regions are blurred. However, according to the present embodiment, the region that is the output layer of the network. Since the segmentation unit 107 and the object segmentation unit 108 collate the information from the captured image feature map extraction unit 104 and the distance image feature map extraction unit 105, which are input layers with a large amount of boundary information, the boundaries of the object are not blurred. The accuracy can be improved.

ここで、本実施形態のニューラルネットワークの構造は、既知の他の構造とは異なる。図９に示すＦｕｓｅＮｅｔを用いた比較例１００Ｙは、本実施形態と比較して、（ｉ）領域セグメンテーション部１０７及び物体セグメンテーション部１０８の代わりにセグメンテーション部１０７Ｙを備え、（ｉｉ）領域出力部１０９及び物体出力部１１０を備える代わりに出力部１０９Ｙを備える点で異なる。この相違点によって、本実施形態のニューラルネットワークの構造は、比較例１００Ｙと比較して、物体や領域の境界がより明確となる。 Here, the structure of the neural network of this embodiment is different from other known structures. Comparative Example 100Y using the FaceNet shown in FIG. 9 includes (i) a segmentation unit 107Y instead of the region segmentation unit 107 and the object segmentation unit 108, and (ii) the region output unit 109 and The difference is that the output unit 109Y is provided instead of the object output unit 110. Due to this difference, the structure of the neural network of the present embodiment has a clearer boundary between an object and a region as compared with Comparative Example 100Y.

また、図１０に示すＵ－Ｎｅｔを用いた比較例１００Ｚは、本実施形態と比較して、距離画像入力部１０２、距離画像特徴マップ抽出部１０５、及び物体セグメンテーション部１０８、物体出力部１１０を備えず、領域セグメンテーション部１０７の代わりにセグメンテーション部１０７Ｚを備え、領域出力部１０９の代わりに出力部１０９Ｚを備える点が異なる。この相違点によって、本実施形態のニューラルネットワークの構造は、撮像画像特徴マップ抽出部１０４において撮像画像の特徴マップの抽出に特化しているとともに、距離画像特徴マップ抽出部１０５において距離画像の特徴マップの抽出に特化している点で比較例１００Ｚと異なる。この結果、本実施形態のニューラルネットワークの構造は、比較例１００Ｚと比較して、物体の境界と領域の境界との境界がより明確となる。 Further, in Comparative Example 100Z using U-Net shown in FIG. 10, the distance image input unit 102, the distance image feature map extraction unit 105, the object segmentation unit 108, and the object output unit 110 are compared with the present embodiment. The difference is that the segmentation unit 107Z is provided in place of the area segmentation unit 107, and the output unit 109Z is provided in place of the area output unit 109. Due to this difference, the structure of the neural network of the present embodiment is specialized in extracting the feature map of the captured image in the captured image feature map extraction unit 104, and the feature map of the distance image in the distance image feature map extraction unit 105. It differs from Comparative Example 100Z in that it specializes in the extraction of. As a result, in the structure of the neural network of the present embodiment, the boundary between the boundary of the object and the boundary of the region becomes clearer as compared with Comparative Example 100Z.

Ｂ．変形例
図１１に示す変形例の認識装置１００Ａは、上述の認識装置１００と比較して、さらに、カラー画像をニューラルネットワークに入力するカラー画像入力部１０２Ａと、カラー画像の特徴マップを抽出するカラー画像特徴マップ抽出部１０５Ａと、歩行者のセグメンテーションに用いる特徴マップを生成する歩行者セグメンテーション部１０８Ａと、画像と歩行者とを関連付けるセマンティックセグメンテーションを行う歩行者出力部１１０Ａと、を備える点で異なる。さらに、変形例の認識装置１００Ａは、上述の認識装置１００と比較して、（ｉ）カラー画像特徴マップ抽出部１０５Ａにより抽出された特徴マップが、特徴マップ連結部１０６、領域セグメンテーション部１０７、及び物体セグメンテーション部１０８へ出力されるとともに、（ｉｉ）撮像画像特徴マップ抽出部１０４、距離画像特徴マップ抽出部１０５、及び特徴マップ連結部１０６により抽出された特徴マップが歩行者セグメンテーション部１０８Ａへ出力される点が異なる。 B. Modification example The recognition device 100A of the modification shown in FIG. 11 is compared with the above-mentioned recognition device 100, and further includes a color image input unit 102A for inputting a color image into a neural network and a color for extracting a feature map of the color image. It differs in that it includes an image feature map extraction unit 105A, a pedestrian segmentation unit 108A that generates a feature map used for pedestrian segmentation, and a pedestrian output unit 110A that performs semantic segmentation that associates an image with a pedestrian. Further, in the recognition device 100A of the modified example, as compared with the above-mentioned recognition device 100, the feature map extracted by (i) the color image feature map extraction unit 105A has the feature map connecting unit 106, the area segmentation unit 107, and the feature map. In addition to being output to the object segmentation unit 108, (ii) the feature map extracted by the captured image feature map extraction unit 104, the distance image feature map extraction unit 105, and the feature map connection unit 106 is output to the pedestrian segmentation unit 108A. The point is different.

この変形例のように、本開示において入力する画像は、２種類ではなく３種類以上であってもよく、出力する画像は、２種類ではなく３種類以上であってもよい。 As in this modification, the images input in the present disclosure may be three or more types instead of two types, and the output images may be three or more types instead of two types.

本開示は、上述の実施形態および変形例に限られるものではなく、その趣旨を逸脱しない範囲において種々の構成で実現することができる。例えば、発明の概要の欄に記載した各形態中の技術的特徴に対応する本実施形態、変形例中の技術的特徴は、上述の課題の一部又は全部を解決するために、あるいは、上述の効果の一部又は全部を達成するために、適宜、差し替えや、組み合わせを行うことが可能である。また、その技術的特徴が本明細書中に必須なものとして説明されていなければ、適宜、削除することが可能である。 The present disclosure is not limited to the above-described embodiments and modifications, and can be realized with various configurations within a range not deviating from the gist thereof. For example, the technical features in the present embodiment and modifications corresponding to the technical features in each of the embodiments described in the column of the outline of the invention may be used to solve some or all of the above-mentioned problems, or the above-mentioned ones. It is possible to replace or combine them as appropriate in order to achieve some or all of the effects of. Further, if the technical feature is not described as essential in the present specification, it can be appropriately deleted.

１０車両、１１ＣＰＵ、１２メモリ、２１撮像画像取得部、２２距離画像取得部、３０制御部、１００認識装置、１００Ａ認識装置、１００Ｙ、１００Ｚ比較例、１０１撮像画像入力部、１０２距離画像入力部、１０２Ａカラー画像入力部、１０４撮像画像特徴マップ抽出部、１０５距離画像特徴マップ抽出部、１０５Ａカラー画像特徴マップ抽出部、１０６特徴マップ連結部、１０７領域セグメンテーション部、１０７Ｙセグメンテーション部、１０８物体セグメンテーション部、１０８Ａ歩行者セグメンテーション部、１０９領域出力部、１０９Ｙ出力部、１１０Ａ歩行者出力部、１１０物体出力部、ＢＢ物体ブロック、ＢＢＤ物体ＤＯブロック、ＫＢ距離ブロック、ＫＢＵ距離ＵＳブロック、ＲＢ領域ブロック、ＲＢＤ領域ＤＯブロック、ＳＢ撮像ブロック、ＳＢＵ撮像ＵＳブロック、 10 vehicle, 11 CPU, 12 memory, 21 captured image acquisition unit, 22 distance image acquisition unit, 30 control unit, 100 recognition device, 100A recognition device, 100Y, 100Z comparative example, 101 image capture image input unit, 102 distance image input unit , 102A color image input section, 104 captured image feature map extraction section, 105 distance image feature map extraction section, 105A color image feature map extraction section, 106 feature map connection section, 107 area segmentation section, 107Y segmentation section, 108 object segmentation section. , 108A pedestrian segmentation section, 109 area output section, 109Y output section, 110A pedestrian output section, 110 object output section, BB object block, BBD object DO block, KB distance block, KBU distance US block, RB area block, RBD Area DO block, SB imaging block, SBU imaging US block,

Claims

A recognition device (100) that recognizes an area and an object using a trained neural network.
An image capture image feature map extraction unit (104) that extracts a feature map of an image captured image in which the region and the object are included in the image, and
A distance image feature map extraction unit (105) that extracts a feature map of a distance image in which the region and the object are included in the image, and
A feature map connecting unit (106) that connects the feature map extracted from the captured image and the feature map extracted from the distance image, and
A region segmentation unit (107) that generates a feature map to be used for segmentation of the region by using the linked feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image. When,
A region output unit (109) that performs semantic segmentation that associates the image with the region using a feature map used for segmentation of the region.
An object segmentation unit (108) that generates a feature map to be used for segmentation of the object by using the connected feature map, the feature map extracted from the captured image, and the feature map extracted from the distance image. When,
An object output unit (110) that performs semantic segmentation that associates the image with the object by using the feature map used for the segmentation of the object is provided.
A recognition device in which the captured image feature map extraction unit, the distance image feature map extraction unit, the feature map connection unit, the region segmentation unit, and the object segmentation unit are configured by the neural network.

The recognition device according to claim 1.
The neural network is a recognition device which is a convolutional neural network.

The recognition device according to claim 1 or 2.
The captured image is a recognition device which is a luminance image.

The recognition device according to any one of claims 1 to 3.
The distance image is a parallax image, a recognition device.