JP2016153984A

JP2016153984A - Neural network processor, neural network processing method, detection device, detection method, and vehicle

Info

Publication number: JP2016153984A
Application number: JP2015032258A
Authority: JP
Inventors: 育郎佐藤; Ikuro Sato
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2015-02-20
Filing date: 2015-02-20
Publication date: 2016-08-25
Anticipated expiration: 2035-02-20
Also published as: JP6360802B2

Abstract

PROBLEM TO BE SOLVED: To provide a neural network processor, a neural network processing method, a detection device and a detection method which detect the position of a detection object with a small processing load and with high accuracy, and to provide a vehicle having such a detection device.SOLUTION: A neural network processor (1) is provided, which includes a horizontal direction processing part (120h) for performing non-linearization processing and pooling processing of a map based on an input image, and outputting a vector (51) indicating a horizontal direction position of a detection object in the input image while maintaining the horizontal direction resolution of the input image; and a vertical direction processing part (120v) for performing non-linearization processing and pooling processing of the map based on the input image, and outputting a column vector (52) indicating a vertical direction position of the detection object in the input image while maintaining the vertical direction resolution of the input image.SELECTED DRAWING: Figure 3

Description

本発明は、検出対象物を検出するためのニューラルネットワーク処理装置、ニューラルネットワーク処理方法、検出装置および検出方法、ならびに、そのような検出装置を有する車両に関する。 The present invention relates to a neural network processing device, a neural network processing method, a detection device and a detection method for detecting a detection target, and a vehicle having such a detection device.

車載カメラを用いて道路上の物体（先行車など）を検出することで、自動車の安全運転を支援できる。そのため、物体の有無およびその位置を検出することが１つの技術課題となっている。物体検出の１つの手法として、畳み込みニューラルネットワーク（Convolution Neural Network）を用いた画像認識手法が知られている（例えば、非特許文献１）。 By detecting an object (such as a preceding vehicle) on the road using an in-vehicle camera, it is possible to support safe driving of the automobile. Therefore, detecting the presence and position of an object and its position is one technical problem. As one method of object detection, an image recognition method using a convolutional neural network is known (for example, Non-Patent Document 1).

畳み込みニューラルネットワークでは、入力画像に対する畳み込み演算と、プーリングと呼ばれる縮小処理とを繰り返し、十分に解像度が小さくなった段階で、全結合型ネットワーク（多層パーセプトロン）へと信号が入力されて最終的な出力が得られる。 In the convolutional neural network, the convolution operation on the input image and the reduction process called pooling are repeated, and when the resolution is sufficiently reduced, the signal is input to the fully connected network (multilayer perceptron) and the final output Is obtained.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "Handwritten Digit Recognition with a Back-Paopagation Network", Advances in Neural Information Processing Systems (NIPS), pp. 396-404, 1990.Y. LeCun, B. Boser, JS Denker, D. Henderson, RE Howard, W. Hubbard, and LD Jackel, "Handwritten Digit Recognition with a Back-Paopagation Network", Advances in Neural Information Processing Systems (NIPS), pp. 396-404, 1990.

しかしながら、このような畳み込みニューラルネットワークには以下のような問題点が挙げられる。 However, such a convolutional neural network has the following problems.

まずは、検出精度の問題である。通常の畳み込みニューラルネットワークでは、入力画像を水平方向および垂直方向にプーリングするため、解像度が低下する。物体の特定精度は解像度に依存するが、解像度が低下するために必ずしも精度よく物体の位置を検出できるとは限らない。かといって、プーリングを一切行わないと、全結合型ネットワークに入力される信号におけるベクトルの次元が著しく大きくなってメモリを逼迫するだけでなく、過学習が起こりやすくなる。そのため、プーリングを行わないのは非現実的である。 The first is the problem of detection accuracy. In a normal convolutional neural network, the resolution is lowered because the input image is pooled in the horizontal and vertical directions. The accuracy with which the object is specified depends on the resolution, but since the resolution is lowered, the position of the object cannot always be detected with high accuracy. However, if pooling is not performed at all, the dimension of the vector in the signal input to the fully coupled network becomes significantly large, which not only imposes memory, but also tends to cause overlearning. Therefore, it is impractical not to perform pooling.

また、処理負荷の問題もある。通常の畳み込みニューラルネットワークでは、入力画像のサイズを予め決めておく必要がある。しかしながら、どのようなサイズの画像が物体の位置検出に最適であるかは分からない。よって、車載カメラからの画像をいくつかのサイズにリサイズして複数のピラミッド画像を生成し、そのそれぞれに対してスライディングウィンドウを適用する必要がある。そのため、処理負荷が大きくならざるを得ない。 There is also a problem of processing load. In a normal convolutional neural network, the size of the input image needs to be determined in advance. However, it is not known what size image is optimal for object position detection. Therefore, it is necessary to resize the image from the in-vehicle camera into several sizes to generate a plurality of pyramid images, and to apply a sliding window to each of them. Therefore, the processing load must be increased.

本発明はこのような問題点に鑑みてなされたものであり、本発明の課題は、処理負荷が小さく、かつ、高い精度で検出対象物の位置を検出するためのニューラルネットワーク処理装置、ニューラルネットワーク処理方法、検出装置および検出方法を提供すること、また、そのような検出装置を有する車両を提供することである。 The present invention has been made in view of such problems, and an object of the present invention is to provide a neural network processing apparatus and a neural network for detecting the position of a detection target with a low processing load and high accuracy. It is to provide a processing method, a detection device and a detection method, and to provide a vehicle having such a detection device.

本発明の一態様によれば、入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の水平方向の解像度を保ったまま、前記入力画像における検出対象物の水平方向位置を示す行ベクトルを出力する水平方向処理部と、前記入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の垂直方向の解像度を保ったまま、前記入力画像における前記検出対象物の垂直方向位置を示す列ベクトルを出力する垂直方向処理部と、を備えるニューラルネットワーク処理装置が提供される。 According to one aspect of the present invention, a non-linearization process and a pooling process are performed on a map based on an input image, and a horizontal direction of a detection target in the input image is maintained while maintaining a horizontal resolution of the input image. A horizontal direction processing unit that outputs a row vector indicating a position, and a map based on the input image are subjected to a non-linearization process and a pooling process, while maintaining a vertical resolution of the input image. There is provided a neural network processing device comprising: a vertical processing unit that outputs a column vector indicating a vertical position of the detection object.

この構成によれば、入力画像の水平方向および垂直方向の解像度を保ったまま行ベクトルおよび列ベクトルが出力されるので、高い精度で検出対象物の位置を検出できる。 According to this configuration, since the row vector and the column vector are output while maintaining the horizontal and vertical resolutions of the input image, the position of the detection target can be detected with high accuracy.

前記水平方向処理部は、前記入力画像に基づくマップに対して、垂直方向のみのプーリング処理を行う第１プーリング部を有し、前記垂直方向処理部は、前記入力画像に基づくマップに対して、水平方向にのみプーリング処理を行う第２プーリング部を有するのが望ましい。 The horizontal direction processing unit has a first pooling unit that performs a pooling process only in the vertical direction on a map based on the input image, and the vertical direction processing unit applies to a map based on the input image, It is desirable to have a second pooling part that performs the pooling process only in the horizontal direction.

この構成によれば、第１プーリング部が垂直方向にみのプーリング処理を行うので水平方向の解像度を保ったまま行ベクトルを出力でき、かつ、第２プーリング部が水平方向にみのプーリング処理を行うので垂直方向の解像度を保ったまま列ベクトルを出力できる。 According to this configuration, since the first pooling unit performs the pooling process only in the vertical direction, the row vector can be output while maintaining the horizontal resolution, and the second pooling unit performs the pooling process only in the horizontal direction. As a result, column vectors can be output while maintaining the vertical resolution.

望ましくは、前記水平方向処理部は、第１水平方向マップ生成部および第２水平方向マップ生成部を有し、前記垂直方向処理部は、第１垂直方向マップ生成部および第２垂直方向マップ生成部を有し、前記第１水平方向マップ生成部は、前記入力画像に基づくマップに対して、非線形化処理および垂直方向のみのプーリング処理を行って、第１出力マップを生成し、前記第１垂直方向マップ生成部は、前記入力画像に基づくマップに対して、非線形化処理および水平方向のみのプーリング処理を行って、第２出力マップを生成し、前記第２水平方向マップ生成部は、前記第１出力マップおよび前記第２出力マップに対して、非線形化処理およびプーリング処理を行って、前記第１出力マップと水平方向の解像度が等しい第３出力マップを生成し、前記第２垂直方向マップ生成部は、前記第１出力マップおよび前記第２出力マップに対して、非線形化処理およびプーリング処理を行って、前記第２出力マップと垂直方向の解像度が等しい第４出力マップを生成し、前記第３出力マップに基づいて前記行ベクトルが生成され、前記第４出力マップに基づいて前記列ベクトルが生成される。 Preferably, the horizontal direction processing unit includes a first horizontal direction map generation unit and a second horizontal direction map generation unit, and the vertical direction processing unit includes a first vertical direction map generation unit and a second vertical direction map generation. The first horizontal map generation unit generates a first output map by performing non-linearization processing and pooling processing only in the vertical direction on the map based on the input image, and generates the first output map. The vertical direction map generation unit performs a non-linearization process and a pooling process only in the horizontal direction on the map based on the input image to generate a second output map, and the second horizontal direction map generation unit A non-linearization process and a pooling process are performed on the first output map and the second output map to generate a third output map having the same horizontal resolution as the first output map. The second vertical map generation unit performs a non-linearization process and a pooling process on the first output map and the second output map, and has a fourth vertical resolution equal to that of the second output map. An output map is generated, the row vector is generated based on the third output map, and the column vector is generated based on the fourth output map.

この構成によれば、第１水平マップ生成部が生成する第１出力マップも使って列ベクトルが生成され、かつ、第１垂直マップ生成部が生成する第２出力マップも使って行ベクトルが生成される。そのため、検出精度が向上する。 According to this configuration, a column vector is generated using the first output map generated by the first horizontal map generation unit, and a row vector is generated using the second output map generated by the first vertical map generation unit. Is done. Therefore, detection accuracy is improved.

具体的には、前記第２水平方向マップ生成部は、前記第１出力マップおよび前記第２出力マップに対して畳み込み演算を行って前記第１出力マップと解像度が等しい第１中間マップおよび第２中間マップをそれぞれ生成し、前記第１中間マップおよび前記第２中間マップを加算して第３中間マップを生成する第１非線形化処理部と、前記第３中間マップに対して、垂直方向のみのプーリング処理を行って前記第３出力マップを生成する第３プーリング部と、を有し、前記第２垂直方向マップ生成部は、前記第１出力マップおよび前記第２出力マップに対して畳み込み演算を行って前記第２出力マップと解像度が等しい第４中間マップおよび第５中間マップをそれぞれ生成し、前記第４中間マップおよび前記第５中間マップを加算して第６中間マップを生成する第２非線形化処理部と、前記第６中間マップに対して、水平方向のみのプーリング処理を行って前記第４出力マップを生成する第４プーリング部と、有してもよい。 Specifically, the second horizontal map generation unit performs a convolution operation on the first output map and the second output map to generate a first intermediate map and a second resolution that are equal in resolution to the first output map. A first non-linearization processing unit that generates an intermediate map and adds the first intermediate map and the second intermediate map to generate a third intermediate map; A third pooling unit that performs a pooling process to generate the third output map, and the second vertical map generation unit performs a convolution operation on the first output map and the second output map. To generate a fourth intermediate map and a fifth intermediate map having the same resolution as the second output map, respectively, and add the fourth intermediate map and the fifth intermediate map to obtain a sixth intermediate map. A second non-linearization processing unit that generates a loop, and a fourth pooling unit that performs a pooling process only in the horizontal direction on the sixth intermediate map to generate the fourth output map. .

さらに具体的には、前記第２中間マップの水平方向および垂直方向の解像度は、前記第２出力マップの水平方向および垂直方向の解像度のそれぞれｐ１倍および１／ｑ１倍（ｐ１，ｑ１は整数）であり、前記第１非線形化処理部は、前記第２出力マップのある一部の画素値と、第１フィルタのフィルタ係数との内積を、前記第２出力マップのある一部と対応する前記第２中間マップにおける画素およびその右の（ｐ１−１）個の画素の値に設定する処理を、前記第２出力マップの垂直方向においてはｑ１画素ずつ前記第１フィルタをシフトしながら行うことで、前記第２中間マップを生成し、前記第４中間マップの水平方向および垂直方向の解像度は、前記第１出力マップの水平方向および垂直方向の解像度のそれぞれ１／ｐ２倍およびｑ２倍（ｐ２，ｑ２は整数）であり、前記第２非線形化処理部は、前記第１出力マップのある一部の画素値と、第２フィルタのフィルタ係数との内積を、前記第１出力マップのある一部と対応する前記第４中間マップにおける画素およびその下の（ｑ２−１）個の画素の値に設定する処理を、前記第１出力マップの水平方向においてはｐ２画素ずつ前記第２フィルタをシフトしながら行うことで、前記第４中間マップを生成してもよい。 More specifically, the horizontal and vertical resolutions of the second intermediate map are p1 times and 1 / q1 times the horizontal and vertical resolutions of the second output map, respectively (p1 and q1 are integers). The first non-linearization processing unit corresponds to an inner product of a certain pixel value of the second output map and a filter coefficient of the first filter corresponding to a certain part of the second output map. The process of setting the values of the pixels in the second intermediate map and the (p1-1) pixels to the right of the second intermediate map is performed while shifting the first filter by q1 pixels in the vertical direction of the second output map. The second intermediate map is generated, and the horizontal and vertical resolutions of the fourth intermediate map are 1 / p2 times q2 and the horizontal and vertical resolutions of the first output map, respectively. (P2 and q2 are integers), and the second non-linearization processing unit calculates an inner product of a certain pixel value of the first output map and a filter coefficient of the second filter, as a result of the first output map. The processing for setting the value of the pixel in the fourth intermediate map corresponding to a certain part and the value of (q2-1) pixels below it is performed by the second filter by p2 pixels in the horizontal direction of the first output map. The fourth intermediate map may be generated by shifting while shifting.

また、サンプル画像における前記検出対象物の上下左右端に対応する画素が第１値で他の画素が第２値である画像を前記入力画像とし、前記水平方向処理部および前記垂直方向処理部は、中央が前記第１値であり他が前記第２値であるフィルタを用いた畳み込み演算を含む前記非線形化処理を行って、正解データとしての前記行ベクトルおよび前記列ベクトルを出力してもよい。
この構成により、サンプル画像から正解データを生成でき、学習処理に用いることができる。 In addition, an image in which pixels corresponding to the upper, lower, left, and right edges of the detection target in the sample image have the first value and the other pixels have the second value is the input image, and the horizontal direction processing unit and the vertical direction processing unit are The row vector and the column vector as correct data may be output by performing the non-linearization process including a convolution operation using a filter whose center is the first value and the other is the second value. .
With this configuration, correct data can be generated from the sample image and used for learning processing.

また、前記サンプル画像における前記検出対象物がある画素が前記第１値で他の画素が前記第２値である画素を前記入力画像とし、前記水平方向処理部および前記垂直方向処理部は、中央が前記第１値であり他が前記第２値であるフィルタを用いた畳み込み演算を含む前記非線形化処理を行って、正解データとしての前記行ベクトルおよび前記列ベクトルを出力してもよい。
この構成により、サンプル画像から正解データを生成でき、学習処理に用いることができる。特に、検出対象物が複数ある場合に有効である。 In addition, the pixel in which the detection target in the sample image has the first value and the other pixel has the second value is the input image, and the horizontal processing unit and the vertical processing unit are The row vector and the column vector as correct data may be output by performing the non-linearization process including a convolution operation using a filter in which is the first value and the other is the second value.
With this configuration, correct data can be generated from the sample image and used for learning processing. This is particularly effective when there are a plurality of detection objects.

前記サンプル画像および前記正解データに基づいて、前記非線形化処理で用いられるニューラルネットワークパラメータが生成されてもよい。 A neural network parameter used in the non-linearization process may be generated based on the sample image and the correct answer data.

前記水平方向処理部が、前記検出対象物の左端および右端に対応する列が前記第１値であり、前記左端より左側および前記右端より右側に対応する列が前記第２値である前記行ベクトルを出力し、かつ、前記垂直方向処理部が、前記検出対象物の上端および下端に対応する行が前記第１値であり、前記上端より上側および前記下端より下側に対応する行が前記第２値である前記列ベクトルを出力するよう、前記ニューラルネットワークパラメータが生成されるのが望ましい。
これにより、行ベクトルにおける各列の値および列ベクトルにおける各行の値によって、検出対象物の位置を示すことができる。 The row vector in which the column corresponding to the left end and the right end of the detection target object is the first value, and the column corresponding to the left side from the left end and the column corresponding to the right end from the right end is the second value. And the vertical processing section has the first value corresponding to the upper end and the lower end of the detection object, and the row corresponding to the upper side and the lower end of the detection target. The neural network parameters are preferably generated so as to output the binary column vector.
Accordingly, the position of the detection target can be indicated by the value of each column in the row vector and the value of each row in the column vector.

前記サンプル画像は、第１検出対象物を含むサンプル画像と、前記第１検出対象物とは異なる第２検出対象物を含むサンプル画像と、を含むのが望ましい。
この構成によれば、複数種類の検出対象物を検出できる。 The sample image preferably includes a sample image including a first detection target and a sample image including a second detection target different from the first detection target.
According to this configuration, a plurality of types of detection objects can be detected.

また、前記サンプル画像は、前記検出対象物の一部のみを含むサンプル画像を含むのが望ましい。
この構成によれば、入力画像に検出対象物の一部のみしか含まれない場合でも、当該検出対象物を検出できる。 The sample image preferably includes a sample image including only a part of the detection target.
According to this configuration, even when only a part of the detection target is included in the input image, the detection target can be detected.

前記行ベクトルに基づいて、前記入力画像から前記検出対象物を含む領域を特定する領域特定部を備え、前記垂直方向処理部は、前記入力画像に基づくマップとして、前記特定された領域に対して、非線形処理およびプーリング処理を行ってもよい。
この構成によれば、検出対象物が縦長である場合に、検出精度を向上できる。 An area specifying unit that specifies an area including the detection target object from the input image based on the row vector, and the vertical direction processing unit applies a map based on the input image to the specified area. Non-linear processing and pooling processing may be performed.
According to this configuration, the detection accuracy can be improved when the detection target is vertically long.

前記行ベクトルに基づいて、前記入力画像に前記検出対象物が含まれないと判断される場合、前記垂直方向処理部は非線形処理およびプーリング処理を行わないのが望ましい。
この構成によれば、検出対象物がない場合には垂直方向処理部が処理を行わず、演算量を削減できる。 When it is determined that the detection target is not included in the input image based on the row vector, it is preferable that the vertical direction processing unit does not perform nonlinear processing and pooling processing.
According to this configuration, when there is no detection target, the vertical processing unit does not perform processing, and the amount of calculation can be reduced.

また、本発明の別の態様によれば、上記のニューラルネットワーク装置と、前記ニューラルネットワーク装置によって出力された前記行ベクトルおよび前記列ベクトルに基づいて、前記入力画像における前記検出対象物の位置を特定する特定部と、を備える検出装置が提供される。 According to another aspect of the present invention, the position of the detection object in the input image is specified based on the neural network device and the row vector and the column vector output by the neural network device. There is provided a detection device including a specific unit.

前記特定部は、前記行ベクトルにおける列であって、その値が第１閾値条件を満たす列を前記検出対象物の左端候補または右端候補とし、前記列ベクトルにおける行であって、その値が第２閾値条件を満たす行を前記検出対象物の上端候補または下端候補とするのが望ましい。
この構成によれば、出力される行ベクトルおよび列ベクトルのノイズを除去でき、検出精度が向上する。 The specifying unit is a column in the row vector, and a column whose value satisfies the first threshold condition is set as a left end candidate or a right end candidate of the detection target, and is a row in the column vector, and the value is the first It is desirable that a row satisfying the two threshold conditions be an upper end candidate or a lower end candidate of the detection target.
According to this configuration, it is possible to remove noise in the output row vector and column vector, and the detection accuracy is improved.

また、前記特定部が特定する前記前記検出対象物の左端候補と右端候補は、第１画素数以上離れており、前記特定部が特定する前記前記検出対象物の上端候補と下端候補は、第２画素数以上離れているのが望ましい。
この構成によれば、やはり、出力される行ベクトルおよび列ベクトルのノイズを除去でき、検出精度が向上する。 Further, the left end candidate and the right end candidate of the detection target specified by the specifying unit are separated by a number of first pixels or more, and the upper end candidate and the lower end candidate of the detection target specified by the specifying unit are It is desirable that the number of pixels is two or more.
According to this configuration, the noise of the output row vector and column vector can be removed, and the detection accuracy is improved.

また、本発明の別の態様によれば、車両本体と、車両本体に取り付けられたカメラと、前記カメラからの画像を前記入力画像として処理する上記の検出装置と、を備える車両が提供される。 According to another aspect of the present invention, there is provided a vehicle including a vehicle main body, a camera attached to the vehicle main body, and the detection device that processes an image from the camera as the input image. .

また、本発明の別の態様によれば、入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の水平方向の解像度を保ったまま、前記入力画像における検出対象物の水平方向位置を示す行ベクトルを出力するステップと、前記入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の垂直方向の解像度を保ったまま、前記入力画像における前記検出対象物の垂直方向位置を示す列ベクトルを出力するステップと、を備えるニューラルネットワーク処理方法が提供される。 According to another aspect of the present invention, a non-linearization process and a pooling process are performed on a map based on an input image, and a detection target in the input image is maintained while maintaining a horizontal resolution of the input image. Outputting a row vector indicating a horizontal position of the input image, and performing a non-linearization process and a pooling process on the map based on the input image, and maintaining the vertical resolution of the input image in the input image Outputting a column vector indicating a vertical position of the detection object, and providing a neural network processing method.

また、本発明の別の態様によれば、入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の水平方向の解像度を保ったまま、前記入力画像における検出対象物の水平方向位置を示す行ベクトルを出力するステップと、前記入力画像に基づくマップに対して、非線形化処理およびプーリング処理を行い、前記入力画像の垂直方向の解像度を保ったまま、前記入力画像における前記検出対象物の垂直方向位置を示す列ベクトルを出力するステップと、前記行ベクトルおよび前記列ベクトルに基づいて、前記入力画像における前記検出対象物の位置を特定するステップと、を備える検出方法が提供される。 According to another aspect of the present invention, a non-linearization process and a pooling process are performed on a map based on an input image, and a detection target in the input image is maintained while maintaining a horizontal resolution of the input image. Outputting a row vector indicating a horizontal position of the input image, and performing a non-linearization process and a pooling process on the map based on the input image, and maintaining the vertical resolution of the input image in the input image A detection method comprising: outputting a column vector indicating a vertical position of the detection object; and specifying a position of the detection object in the input image based on the row vector and the column vector. Provided.

入力画像の水平方向の解像度を保ったまま、入力画像における検出対象物の水平方向位置を示す行ベクトルを出力するとともに、入力画像の垂直方向の解像度を保ったまま、入力画像における検出対象物の垂直方向位置を示す行ベクトルを出力するため、精度よく検出対象物の位置を検出できる。さらに、検出対象物の大きさや縦横比に何らの制限もないため、処理負荷を軽減できる。 A row vector indicating the horizontal position of the detection target in the input image is output while maintaining the horizontal resolution of the input image, and the detection target of the input image is maintained while maintaining the vertical resolution of the input image. Since the row vector indicating the vertical position is output, the position of the detection target can be detected with high accuracy. Furthermore, since there are no restrictions on the size and aspect ratio of the detection target, the processing load can be reduced.

第１の実施形態に係るニューラルネットワーク処理部１の処理動作の概要を説明する図。The figure explaining the outline | summary of the processing operation of the neural network process part 1 which concerns on 1st Embodiment. 第１の実施形態に係る検出装置１００の概略構成を示すブロック図。1 is a block diagram showing a schematic configuration of a detection apparatus 100 according to a first embodiment. 第１の実施形態に係るニューラルネットワーク処理部１の概略構成を示すブロック図。1 is a block diagram showing a schematic configuration of a neural network processing unit 1 according to a first embodiment. ニューラルネットワーク処理部１の処理動作の概略を説明する図。The figure explaining the outline of the processing operation of the neural network process part. 非線形化処理部１２１ｈの処理を説明する図。The figure explaining the process of the non-linearization process part 121h. 非線形化処理部１２１ｈにおける畳み込み演算を説明する図。The figure explaining the convolution calculation in the non-linearization process part 121h. 水平方向マップ生成部１２ｈにおけるプーリング部１２２ｈのプーリング処理を説明する図。The figure explaining the pooling process of the pooling part 122h in the horizontal direction map production | generation part 12h. サンプル画像から正解データを生成するための入力画像を説明する図。The figure explaining the input image for producing | generating correct data from a sample image. 正解データ生成時のニューラルネットワーク処理部１の処理を模式的に示す図。The figure which shows typically the process of the neural network process part 1 at the time of correct data generation. 第２の実施形態に係るニューラルネットワーク処理部１’の概略構成を示すブロック図。The block diagram which shows schematic structure of the neural network process part 1 'which concerns on 2nd Embodiment. 水平方向マップ生成部１２ｈ’および垂直方向マップ生成部１２ｖ’の概略構成を示すブロック図。The block diagram which shows schematic structure of the horizontal direction map production | generation part 12h 'and the vertical direction map production | generation part 12v'. 水平方向マップ生成部１２ｈ’および垂直方向マップ生成部１２ｖ’の処理動作の概略を説明する図。The figure explaining the outline of the processing operation of the horizontal direction map production | generation part 12h 'and the vertical direction map production | generation part 12v'. マップ９３の生成法を説明する図。The figure explaining the production | generation method of the map 93. FIG. マップ６４’の生成法を説明する図。The figure explaining the production | generation method of map 64 '. 第２の実施形態に係るニューラルネットワーク処理部１’の処理動作を模式的に示す図。The figure which shows typically the processing operation of the neural network process part 1 'which concerns on 2nd Embodiment. 特定部３による、検出対象物の左右端位置の特定を説明する図。The figure explaining specification of the right-and-left end position of a detection target by specific part 3. 特定部３による、検出対象物の上下端の特定を説明する図。The figure explaining the identification of the upper and lower ends of a detection target by the identification unit. 特定部３の処理動作の第１例を示す図。The figure which shows the 1st example of the processing operation of the specific | specification part 3. FIG. 特定部３によるノイズ除去を説明する図。The figure explaining the noise removal by the specific | specification part 3. FIG. ノイズ除去の手順を示すフローチャート。The flowchart which shows the procedure of noise removal. 特定部３の処理動作の第２例を示す図。The figure which shows the 2nd example of the processing operation of the specific | specification part 3. FIG. 特定部３の処理動作の第３例を示す図。The figure which shows the 3rd example of the processing operation of the specific | specification part 3. FIG. 特定部３の処理動作の第４例を示す図。The figure which shows the 4th example of the processing operation of the specific | specification part 3. FIG. 特定部３の処理動作の第５例を示す図。The figure which shows the 5th example of the processing operation of the specific | specification part 3. FIG. 特定部３の処理動作の第６例を示す図。The figure which shows the 6th example of the processing operation of the specific | specification part 3. FIG. 車載カメラ１０１と検出対象物１０２との進行方向の距離Ｚを算出する手法を説明する図。The figure explaining the method of calculating the distance Z of the advancing direction of the vehicle-mounted camera 101 and the detection target object 102. FIG. 車載カメラ１０１と検出対象物１０２との横方向の距離Ｘ_L，Ｘ_Rを算出する手法を説明する図。Lateral distance X _L between the vehicle-mounted camera 101 and the detection object 102, diagram for explaining a method of calculating the X _R. 第４の実施形態に係るニューラルネットワーク処理部１’’の概略構成を示すブロック図。The block diagram which shows schematic structure of the neural network process part 1 '' which concerns on 4th Embodiment. 第５の実施形態に係るニューラルネットワーク処理部１の処理動作の概要を説明する図。The figure explaining the outline | summary of the processing operation of the neural network process part 1 which concerns on 5th Embodiment. サンプル画像から正解データを生成するための入力画像を説明する図。The figure explaining the input image for producing | generating correct data from a sample image. 別のサンプル画像から正解データを生成するための入力画像を説明する図。The figure explaining the input image for producing | generating correct data from another sample image. 領域特定部１３の処理動作を説明する図。The figure explaining the processing operation of the area | region specific part 13. FIG.

以下、本発明に係る実施形態について、図面を参照しながら具体的に説明する。 Embodiments according to the present invention will be specifically described below with reference to the drawings.

（第１の実施形態）
第１の実施形態では、入力画像における検出対象物の上下左右端の位置、より詳しくは、左端位置ｘｌ、右端位置ｘｒ、上端位置ｙｔおよび下端位置ｙｂを、入力画像の解像度を損なうことなく特定する。 (First embodiment)
In the first embodiment, the positions of the upper, lower, left and right ends of the detection object in the input image, more specifically, the left end position xl, the right end position xr, the upper end position yt, and the lower end position yb are specified without impairing the resolution of the input image. To do.

そのために、入力画像と解像度が等しい仮想的な２値画像を考える。この２値画像は、画素（ｘｌ，ｙｔ），（ｘｒ，ｙｔ），（ｘｌ，ｙｂ），（ｘｒ，ｙｂ）の値のみが１であり、他の画素の値は０である。入力画像を処理して上記のような２値画像が得られれば、検出対象物の上下左右端が自ら明らかとなる。 Therefore, a virtual binary image having the same resolution as the input image is considered. In this binary image, only the values of the pixels (xl, yt), (xr, yt), (xl, yb), (xr, yb) are 1, and the values of the other pixels are 0. When the binary image as described above is obtained by processing the input image, the upper, lower, left and right ends of the detection target are clarified.

さらに、この２値画像を行列とみなせば、この２値画像は、ｙｔ行およびｙｂ行の値のみ１であり他行の値はすべて０である列ベクトルと、ｘｌ列およびｘｒ列の値のみ１であり他列の値はすべて０である行ベクトルとの積に分解される。列ベクトルの行数および行ベクトルの列数が、入力画像の垂直方向および水平方向の解像度とそれぞれ等しければ、入力画像の解像度を損なうことなく、検出対象物の上下左右端の位置を特定できる。 Further, if this binary image is regarded as a matrix, this binary image has only a column vector in which only the values of the yt row and the yb row are 1 and the values of the other rows are all 0, and the values of the xl and xr columns. 1 and the values of the other columns are decomposed into products with row vectors which are all 0. If the number of columns in the column vector and the number of columns in the row vector are equal to the vertical and horizontal resolutions of the input image, the positions of the upper, lower, left and right ends of the detection target can be specified without impairing the resolution of the input image.

そこで、本実施形態では、入力画像をニューラルネットワーク処理して、その解像度を損なうことなく、上記のような列ベクトルおよび行ベクトルを出力するニューラルネットワーク処理装置（ニューラルネットワーク処理部）を開示する。 Therefore, in the present embodiment, a neural network processing apparatus (neural network processing unit) that performs neural network processing on an input image and outputs the above column vector and row vector without losing the resolution is disclosed.

図１は、第１の実施形態に係るニューラルネットワーク処理部１の処理動作の概要を説明する図である。図１（ａ）に示すように、入力画像の垂直方向画素数がａ、水平方向画素数がｂ（以下、このことを単に「入力画像の画素数がｂ×ａ」ともいう）であるとする。具体的な処理内容は後述するが、この場合、ニューラルネットワーク処理部１は、要素数（列数）がｂの行ベクトル５１と、要素数（行数）がａの列ベクトル５２とを出力する。 FIG. 1 is a diagram for explaining the outline of the processing operation of the neural network processing unit 1 according to the first embodiment. As shown in FIG. 1A, the number of vertical pixels in the input image is a, and the number of horizontal pixels is b (hereinafter, this is also simply referred to as “the number of pixels in the input image is b × a”). To do. Although specific processing contents will be described later, in this case, the neural network processing unit 1 outputs a row vector 51 having the number of elements (number of columns) b and a column vector 52 having the number of elements (number of rows) a. .

図１（ｂ）に示すように、入力画像における検出対象物の境界を矩形で定めたとき、その左上、右上、左下および右下の画素位置がそれぞれ（ｘｌ，ｙｔ），（ｘｒ，ｙｔ），（ｘｌ，ｙｂ），（ｘｒ，ｙｂ）であったとする。このとき、ニューラルネットワーク処理部１は、ｘｌ列およびｘｒ列のみの値が１であり他列の値が０である行ベクトル５１、および、ｙｔ行およびｙｂ行のみの値が１であり他行の値が０である列ベクトル５２を出力するよう、入力画像に対してニューラルネットワーク処理を行う。 As shown in FIG. 1B, when the boundary of the detection object in the input image is defined by a rectangle, the upper left, upper right, lower left and lower right pixel positions are (xl, yt) and (xr, yt), respectively. , (Xl, yb), (xr, yb). At this time, the neural network processing unit 1 has a row vector 51 in which only the values in the xl and xr columns are 1 and the values in the other columns are 0, and the values in only the yt and yb rows are 1 and the other rows. Neural network processing is performed on the input image so as to output a column vector 52 whose value is 0.

行ベクトル５１において値が１である列が検出対象物の左端位置および／または右端位置を示している。列ベクトル５２において値が１である行が検出対象物の上端位置および／または下端位置を示している。よって、このような行ベクトル５１および列ベクトル５２に基づいて、検出対象物の位置を特定できる。 A column having a value of 1 in the row vector 51 indicates a left end position and / or a right end position of the detection target. A row having a value of 1 in the column vector 52 indicates the upper end position and / or the lower end position of the detection target. Therefore, the position of the detection target can be specified based on such row vector 51 and column vector 52.

以下、具体例として、車載カメラで車両の前方を撮影し、得られた画像を入力画像として検出対象物（例えば、人、他の車両、標識）の上下左右端の位置を検出する検出装置を説明する。 Hereinafter, as a specific example, a detection device that captures the front of a vehicle with an in-vehicle camera and detects the positions of the upper, lower, left and right ends of a detection object (for example, a person, another vehicle, a sign) using the obtained image as an input image explain.

図２は、第１の実施形態に係る検出装置１００の概略構成を示すブロック図である。検出装置１００は、ニューラルネットワーク処理部１と、学習部２と、特定部３とを備えている。本検出装置１００は、予めニューラルネットワーク処理部１および学習部２が、サンプル画像を用いてニューラルネットワークパラメータを生成する（このことを「学習処理」ともいう）。そして、ニューラルネットワーク処理部１および特定部３が、生成されたニューラルネットワークパラメータを用いて、入力画像から検出対象物を検出する（このことを「検出処理」ともいう）。 FIG. 2 is a block diagram illustrating a schematic configuration of the detection apparatus 100 according to the first embodiment. The detection apparatus 100 includes a neural network processing unit 1, a learning unit 2, and a specifying unit 3. In this detection apparatus 100, the neural network processing unit 1 and the learning unit 2 generate neural network parameters in advance using sample images (this is also referred to as “learning processing”). Then, the neural network processing unit 1 and the specifying unit 3 detect a detection target from the input image using the generated neural network parameters (this is also referred to as “detection processing”).

学習処理において、まずニューラルネットワーク処理部１は、学習処理用のニューラルネットワークパラメータを用い、事前に用意されたサンプル画像を加工して得られた入力画像に対してニューラルネットワーク処理を行い、正解データを生成する。続いて、ニューラルネットワーク処理部１は、同サンプル画像を入力画像としてニューラルネットワーク処理し、行ベクトル５１および列ベクトル５２を出力する。そして、学習部２は、これら行ベクトル５１および列ベクトル５２が上記正解データにできるだけ近づくよう、検出処理用のニューラルネットワークパラメータを生成する。 In the learning process, first, the neural network processing unit 1 performs neural network processing on an input image obtained by processing a sample image prepared in advance using neural network parameters for learning processing, and obtains correct data. Generate. Subsequently, the neural network processing unit 1 performs a neural network process using the sample image as an input image, and outputs a row vector 51 and a column vector 52. The learning unit 2 generates a neural network parameter for detection processing so that the row vector 51 and the column vector 52 are as close as possible to the correct data.

検出処理において、ニューラルネットワーク処理部１には、車載カメラからの画像が入力画像として入力される。そして、ニューラルネットワーク処理部１は、学習部２によって生成された検出処理用のニューラルネットワークパラメータを用いて入力画像に対してニューラルネットワーク処理を行い、検出対象物の位置を示す行ベクトル５１および列ベクトル５２を出力する。特定部３は、出力された行ベクトル５１および列ベクトル５２を解釈して、検出対象物の位置を特定する。 In the detection process, an image from the in-vehicle camera is input to the neural network processing unit 1 as an input image. Then, the neural network processing unit 1 performs neural network processing on the input image using the neural network parameters for detection processing generated by the learning unit 2, and a row vector 51 and a column vector indicating the position of the detection target object 52 is output. The specifying unit 3 interprets the output row vector 51 and column vector 52 and specifies the position of the detection target.

なお、学習処理および検出処理において、ニューラルネットワーク処理部１におけるニューラルネットワーク構造は共通している。そこで、まずはニューラルネットワーク構造を説明し、続いて学習処理および検出処理について詳しく説明する。 In the learning process and the detection process, the neural network structure in the neural network processing unit 1 is common. Therefore, first, the neural network structure will be described, and then the learning process and the detection process will be described in detail.

図３は、第１の実施形態に係るニューラルネットワーク処理部１の概略構成を示すブロック図である。図３に示すように、ニューラルネットワーク処理部１は、非線形化処理部１１と、水平方向処理部１２０ｈと、垂直方向処理部１２０ｖとを有する。水平方向処理部１２０ｈは１以上の水平方向マップ生成部１２ｈを有し、垂直方向処理部１２０ｖは１以上の垂直方向マップ生成部１２ｖを有する。 FIG. 3 is a block diagram illustrating a schematic configuration of the neural network processing unit 1 according to the first embodiment. As shown in FIG. 3, the neural network processing unit 1 includes a non-linearization processing unit 11, a horizontal direction processing unit 120h, and a vertical direction processing unit 120v. The horizontal direction processing unit 120h includes one or more horizontal direction map generation units 12h, and the vertical direction processing unit 120v includes one or more vertical direction map generation units 12v.

なお、非線形化処理部１１はなくてもよいし、２以上の非線形化処理部１１が縦続接続されていてもよい。また、水平方向マップ生成部１２ｈや垂直方向マップ生成部１２ｖの数にも制限はないが、両者の数は等しい。 Note that the non-linearization processing unit 11 may not be provided, and two or more non-linearization processing units 11 may be connected in cascade. The number of horizontal direction map generation units 12h and vertical direction map generation units 12v is not limited, but the number of both is the same.

初段の非線形化処理部１１は入力画像をマップとして畳み込み演算および活性化処理を行い、新たなマップを出力する。入力画像がグレースケールである場合、マップの数は１つである。入力画像がＲ画像、Ｇ画像およびＢ画像で構成される場合、マップの数はＲ画像、Ｇ画像およびＢ画像の３つであってもよい。非線形化処理部１１が複数段ある場合、前段の非線形化処理部１１から出力されるマップを、次段の非線形化処理部１１が同様の処理をする。具体的な処理内容は、次に説明する水平方向マップ生成部１２ｈの非線形化処理部１２１ｈなどにおける処理と同様である。非線形化処理部１１による処理には、物体のエッジなどを抽出する役割がある。 The first-stage non-linearization processing unit 11 performs convolution calculation and activation processing using the input image as a map, and outputs a new map. When the input image is grayscale, the number of maps is one. When the input image is composed of an R image, a G image, and a B image, the number of maps may be three, an R image, a G image, and a B image. When there are a plurality of non-linearization processing units 11, the next-stage non-linearization processing unit 11 performs the same processing on the map output from the preceding non-linearization processing unit 11. The specific processing content is the same as the processing in the non-linearization processing unit 121h of the horizontal direction map generation unit 12h described below. The processing by the non-linearization processing unit 11 has a role of extracting an edge of an object.

水平方向処理部１２０ｈおよび垂直方向処理部１２０ｖには、入力画像に基づくマップ、すなわち、最終段の非線形化処理部１１から出力されるマップか、非線形化処理部１１がない場合には入力画像が入力される。検出対象物が１種類である場合、水平方向処理部１２０ｈは１つのマップ（すなわち行ベクトル５１）を出力し、垂直方向処理部１２０ｖは１つのマップ（すなわち列ベクトル５２）を出力する。 In the horizontal direction processing unit 120h and the vertical direction processing unit 120v, a map based on an input image, that is, a map output from the non-linearization processing unit 11 at the final stage, or an input image when the non-linearization processing unit 11 is not provided. Entered. When there is one type of detection object, the horizontal processing unit 120h outputs one map (that is, the row vector 51), and the vertical processing unit 120v outputs one map (that is, the column vector 52).

水平方向処理部１２０ｈにおける各水平方向マップ生成部１２ｈは、非線形化処理部１２１ｈと、プーリング部１２２ｈとを有する。非線形化処理部１２１ｈは、マップに対して非線形化処理を行う。非線形化処理によってマップの数は変化し得るが、マップの解像度は不変である。プーリング部１２２ｈは、マップに対して垂直方向にのみのプーリングを行う。この処理によって、マップの垂直方向の解像度は減るが、水平方向の解像度は不変である。 Each horizontal direction map generation unit 12h in the horizontal direction processing unit 120h includes a non-linearization processing unit 121h and a pooling unit 122h. The non-linearization processing unit 121h performs non-linearization processing on the map. Although the number of maps can be changed by the non-linearization process, the resolution of the map is not changed. The pooling unit 122h performs pooling only in the direction perpendicular to the map. This process reduces the vertical resolution of the map, but does not change the horizontal resolution.

垂直方向処理部１２０ｖにおける各垂直方向マップ生成部１２ｖも、非線形化処理部１２１ｖと、プーリング部１２２ｖとを有する。非線形化処理部１２１ｖは、マップに対して非線形化処理を行う。非線形化処理によってマップの数は変化し得るが、マップの解像度は不変である。プーリング部１２２ｖは、マップに対して水平方向にのみのプーリングを行う。この処理によって、マップの水平方向の解像度は減るが、垂直方向の解像度は不変である。 Each vertical direction map generation unit 12v in the vertical direction processing unit 120v also includes a non-linearization processing unit 121v and a pooling unit 122v. The nonlinear processing unit 121v performs nonlinear processing on the map. Although the number of maps can be changed by the non-linearization process, the resolution of the map is not changed. The pooling unit 122v performs pooling only in the horizontal direction with respect to the map. This process reduces the horizontal resolution of the map, but does not change the vertical resolution.

図４は、ニューラルネットワーク処理部１の処理動作の概略を説明する図であり、マップの解像度の変遷に特に着目したものである。同図において、実線矢印は非線形化処理部１１，１２１ｈ，１２１ｖによる非線形化処理を意味しており、破線矢印はプーリング部１２２ｈ，１２２ｖによるプーリング処理を意味している。図４の表現によれば、入力画像のマップ数がｍ０であり、ｋ段目（最終段を除く）の非線形化処理後のマップ数がｍｋであり、最終段の非線形化処理後のマップ数が１であることが明示される。また、プーリング処理の度にマップの解像度が垂直方向または水平方向にのみ変化していく様子も明示される。 FIG. 4 is a diagram for explaining the outline of the processing operation of the neural network processing unit 1, and pays particular attention to the change in the resolution of the map. In the figure, solid line arrows mean non-linearization processing by the non-linearization processing units 11, 121h, 121v, and broken line arrows mean pooling processing by the pooling units 122h, 122v. According to the expression of FIG. 4, the number of maps of the input image is m0, the number of maps after the k-th stage (excluding the last stage) nonlinear processing is mk, and the number of maps after the last stage nonlinear processing. Is clearly 1. Also, it is clearly shown that the map resolution changes only in the vertical direction or in the horizontal direction at each pooling process.

図５は、非線形化処理部１２１ｈの処理を説明する図である。非線形化処理部１１，１２１ｈ，１２１ｖの処理は共通しているため、特に断らない限り非線形化処理部１２１ｈについて説明する。 FIG. 5 is a diagram illustrating the processing of the non-linearization processing unit 121h. Since the processing of the nonlinear processing units 11, 121h, and 121v is common, the nonlinear processing unit 121h will be described unless otherwise specified.

非線形化処理部１２１ｈは、マップ６１〜６３に対して、畳み込み演算、活性化処理およびバイアス加算処理を行い、中間マップ７１〜７３を生成する。同図では、３つのマップ６１〜６３から３つの中間マップ７１〜７３を生成する例を示しているが、入力されるマップおよび中間マップの数に制限はない。ただし、初段の非線形化処理部１２１ｈが生成する中間マップ数と、初段の非線形化処理部１２１ｖが生成する中間マップ数は等しい。２段目以降も同様である。そして、最終段の非線形化処理部１２１ｈ，１２１ｖは１つの中間マップを生成する。 The non-linearization processing unit 121h performs convolution operations, activation processing, and bias addition processing on the maps 61 to 63, and generates intermediate maps 71 to 73. In the figure, an example is shown in which three intermediate maps 71 to 73 are generated from three maps 61 to 63, but the number of input maps and intermediate maps is not limited. However, the number of intermediate maps generated by the first-stage nonlinearization processing unit 121h is equal to the number of intermediate maps generated by the first-stage nonlinearization processing unit 121v. The same applies to the second and subsequent stages. Then, the nonlinear processing units 121h and 121v in the final stage generate one intermediate map.

図５において、実線矢印は畳み込み演算および活性化処理を示している。畳み込み演算の詳細は図６を用いて後述するが、畳み込み演算に用いられるフィルタの係数は実線矢印ごとに異なり得る。活性化処理は、シグモイド関数などの活性化関数を用い、畳み込み演算の結果を活性化する処理である。また、一点鎖線矢印はバイアス加算処理を示しており、スカラーであるバイアスが１と掛け合わされて、活性化処理の結果の全要素に足される。バイアスは一点鎖線矢印ごとに異なり得る。 In FIG. 5, the solid line arrows indicate the convolution operation and the activation process. Details of the convolution operation will be described later with reference to FIG. 6, but the coefficient of the filter used for the convolution operation may be different for each solid line arrow. The activation process is a process for activating the result of the convolution operation using an activation function such as a sigmoid function. A one-dot chain arrow indicates a bias addition process. A scalar bias is multiplied by 1, and is added to all elements of the result of the activation process. The bias can be different for each dashed line arrow.

図５の例では、非線形化処理部１２１ｈにはマップ６１〜６３が入力され、マップ６１に対する畳み込み演算および活性化処理の結果と、マップ６２に対する畳み込み演算および活性化処理の結果と、マップ６３に対する畳み込み演算および活性化処理の結果と、が加算され、さらにバイアスが加算されて、引き続くプーリング部１２２ｈに入力される中間マップ７１が生成されることなどが示されている。 In the example of FIG. 5, the maps 61 to 63 are input to the non-linearization processing unit 121h, the result of the convolution calculation and activation processing for the map 61, the result of the convolution calculation and activation processing for the map 62, and the map 63 It is shown that the result of the convolution operation and the activation process is added, and the bias is added to generate an intermediate map 71 that is subsequently input to the pooling unit 122h.

上記フィルタ係数およびバイアスがニューラルネットワークパラメータである。学習処理用ニューラルネットワークパラメータは予め定められた固定値であり、検出処理用ニューラルネットワークパラメータは学習部２によって生成される。 The filter coefficients and bias are the neural network parameters. The learning processing neural network parameter is a predetermined fixed value, and the detection processing neural network parameter is generated by the learning unit 2.

図６は、非線形化処理部１２１ｈにおける畳み込み演算を説明する図である。図６（ａ）はマップのサイズが減少する畳み込み演算であり、図６（ｂ）はマップのサイズが不変である畳み込み演算である。 FIG. 6 is a diagram for explaining the convolution operation in the non-linearization processing unit 121h. FIG. 6A shows a convolution operation in which the map size decreases, and FIG. 6B shows a convolution operation in which the map size is unchanged.

図６（ａ）の畳み込み演算では、非線形化処理部１２１ｈは、まず畳み込み演算前のマップ６４の画素（１，１）を左上とする例えば３×３画素の領域にフィルタ８０を設定する。そして、非線形化処理部１２１ｈは、フィルタ係数とマップ６４の画素値との内積を、畳み込み演算後のマップ８１の画素（１，１）の値とする。以下、非線形化処理部１２１ｈはフィルタ８０の位置を１画素ずつ右にシフトしながら、畳み込み演算後のマップ８１の各画素値を算出する。この場合、畳み込み演算後のマップ８１における右端２列および下端２行の値は算出されない。よって、畳み込み演算後のマップ８１は、畳み込み演算前のマップ６４に比べて、水平方向および垂直方向とも２画素ずつサイズが減少する。 In the convolution operation of FIG. 6A, the non-linearization processing unit 121h first sets the filter 80 in, for example, a 3 × 3 pixel region in which the pixel (1, 1) of the map 64 before the convolution operation is at the upper left. Then, the non-linearization processing unit 121h sets the inner product of the filter coefficient and the pixel value of the map 64 as the value of the pixel (1, 1) of the map 81 after the convolution calculation. Thereafter, the non-linearization processing unit 121h calculates each pixel value of the map 81 after the convolution operation while shifting the position of the filter 80 to the right by one pixel. In this case, the values of the rightmost two columns and the lowermost two rows in the map 81 after the convolution operation are not calculated. Therefore, the size of the map 81 after the convolution operation is reduced by two pixels in both the horizontal direction and the vertical direction, compared to the map 64 before the convolution operation.

一方、図６（ｂ）の畳み込み演算では、非線形化処理部１２１ｈは、畳み込み演算前のマップ６４の画素（１，１）を中央とする例えば３×３画素の領域にフィルタ８０を設定する。そして、非線形化処理部１２１ｈは、フィルタ係数とマップ６４の画素値との内積を、畳み込み演算後のマップ８２の画素（１，１）の値とする。ただし、畳み込み演算前のマップ６４からはみ出る領域の画素の値は０とする（Zero paddingと呼ばれる）。以下、非線形化処理部１２１ｈはフィルタ８０の位置を１画素ずつ右にシフトしながら、畳み込み演算後のマップ８２の各画素値を算出する。この場合、畳み込み演算後のマップ８２のサイズは、畳み込み演算前のマップ６４のサイズと等しい。 On the other hand, in the convolution operation of FIG. 6B, the non-linearization processing unit 121h sets the filter 80 in, for example, a 3 × 3 pixel region centered on the pixel (1, 1) of the map 64 before the convolution operation. Then, the non-linearization processing unit 121h sets the inner product of the filter coefficient and the pixel value of the map 64 as the value of the pixel (1, 1) of the map 82 after the convolution calculation. However, the value of the pixel in the region that protrudes from the map 64 before the convolution operation is set to 0 (referred to as zero padding). Thereafter, the non-linearization processing unit 121h calculates each pixel value of the map 82 after the convolution calculation while shifting the position of the filter 80 to the right by one pixel. In this case, the size of the map 82 after the convolution operation is equal to the size of the map 64 before the convolution operation.

いずれの場合でもマップ６４が縮小されるわけではないので、マップのサイズ（画素数）が変化するか否かに関わらず、畳み込み演算の前後でマップの解像度は等しいと考えることができる。 In any case, since the map 64 is not reduced, it can be considered that the resolution of the map is the same before and after the convolution operation regardless of whether the size (number of pixels) of the map changes or not.

なお、フィルタ８０のサイズに特に制限はないが、以下の具体例では、フィルタ８０の水平方向画素数および垂直方向画素数を互いに等しい奇数とし、かつ、すべての非線形化処理部１２１ｈ，１２１ｖで共通のサイズとする。各フィルタ８０の係数は学習処理によって生成されるものであり、非線形化処理部１２１ｈ，１２１ｖごとに異なり得る。また、図６（ｂ）のマップサイズが不変である畳み込み演算が行われるものとする。 The size of the filter 80 is not particularly limited, but in the following specific example, the number of horizontal pixels and the number of vertical pixels of the filter 80 are equal to each other and are common to all the non-linearization processing units 121h and 121v. The size of The coefficient of each filter 80 is generated by learning processing, and may be different for each of the non-linearization processing units 121h and 121v. Further, it is assumed that a convolution operation in which the map size in FIG. 6B is unchanged is performed.

図７は、水平方向マップ生成部１２ｈにおけるプーリング部１２２ｈのプーリング処理を説明する図である。このプーリング部１２２ｈは、水平方向の解像度を変えることなく、言い換えると、水平方向の画素数を保ったまま、垂直方向にのみマップをプーリング（縮小）する。以下、垂直方向の画素数を１／２にプーリングする例を具体的に説明する。 FIG. 7 is a diagram illustrating the pooling process of the pooling unit 122h in the horizontal direction map generation unit 12h. The pooling unit 122h pools (reduces) the map only in the vertical direction without changing the horizontal resolution, in other words, while maintaining the number of pixels in the horizontal direction. Hereinafter, an example in which the number of pixels in the vertical direction is pooled to ½ will be described in detail.

まず、プーリング部１２２ｈは非線形化処理部１２１ｈからの中間マップ７１を水平方向１画素×垂直方向２画素のグリッド８４に分割する（図７（ａ））。グリッド８４の総数は中間マップ７１の１／２となる。次いで、プーリング部１２２ｈは各グリッド８４内で最大値を有する画素を選択する（図７（ｂ）の黒塗り画素が選択された画素を示す）。そして、プーリング部１２２ｈは、各グリッド８４内において、選択された画素で他の画素を埋めてプーリング処理後のマップ８５を生成する（図７（ｃ））。これにより、非線形化処理部１２１ｈが生成した中間マップ７１は垂直方向にのみプーリングされる。 First, the pooling unit 122h divides the intermediate map 71 from the non-linearization processing unit 121h into a grid 84 of 1 pixel in the horizontal direction × 2 pixels in the vertical direction (FIG. 7A). The total number of grids 84 is ½ of the intermediate map 71. Next, the pooling unit 122h selects a pixel having the maximum value in each grid 84 (shows a pixel in which the black pixel in FIG. 7B is selected). Then, the pooling unit 122h generates a map 85 after the pooling process by filling other pixels with the selected pixel in each grid 84 (FIG. 7C). As a result, the intermediate map 71 generated by the non-linearization processing unit 121h is pooled only in the vertical direction.

なお、図７（ｂ）において、プーリング部１２２ｈは、各グリッド８４内での最大値を選択するのではなく、グリッド８４内画素の平均値を用いてもよい。また、プーリング部１２２ｈは垂直方向の画素数を（１／２）ⁿ（ｎは１以上の整数）にプーリングするのが望ましいが、１／３や１／５などにプーリングしてもよい。 In FIG. 7B, the pooling unit 122h may use the average value of the pixels in the grid 84 instead of selecting the maximum value in each grid 84. The pooling unit 122h preferably pools the number of pixels in the vertical direction to (1/2) ⁿ (n is an integer of 1 or more), but may be pooled to 1/3 or 1/5.

一方、垂直方向マップ生成部１２ｖにおけるプーリング部１２２ｖは、垂直方向の解像度を変えることなく、言い換えると、垂直方向の画素数を保ったまま、水平方向にのみマップをプーリングする。その他は水平方向マップ生成部１２ｈのプーリング部１２２ｈと同様である。 On the other hand, the pooling unit 122v in the vertical direction map generation unit 12v pools the map only in the horizontal direction while maintaining the number of pixels in the vertical direction without changing the vertical resolution. Others are the same as the pooling part 122h of the horizontal direction map production | generation part 12h.

このように、１つのプーリング部１２２ｈ，１２２ｖを経るごとにマップの垂直方向画素数および水平方向画素数はそれぞれ１／２になる。例えば、８段ずつのプーリング部１２２ｈ，１２２ｖが設けられ、入力画像の画素数が５１２×２５６である場合、最終的には１つの行ベクトル５１および２つの列ベクトル５２が出力される。
続いて、学習処理について説明する。 As described above, the vertical pixel number and the horizontal pixel number of the map become ½ each time one of the pooling parts 122h and 122v passes. For example, when eight stages of pooling units 122h and 122v are provided and the number of pixels of the input image is 512 × 256, one row vector 51 and two column vectors 52 are finally output.
Subsequently, the learning process will be described.

学習処理を行うためには、あるサンプル画像に対する理想的な行ベクトル５１および列ベクトル５２を正解データとして予め用意する必要がある。ニューラルネットワーク処理部１によって正解データを生成する手法を説明する。 In order to perform the learning process, it is necessary to prepare ideal row vectors 51 and column vectors 52 for a certain sample image as correct data in advance. A method of generating correct data by the neural network processing unit 1 will be described.

図８は、サンプル画像から正解データを生成するための入力画像を説明する図である。同図（ａ）はサンプル画像の例であり、人手によって検出対象物の上下左右端の位置を特定し、同図（ｂ）に示す入力画像を生成する。入力画像は、サンプル画像と同じ画素数を有し、検出対象物の上下左右端における画素の値のみ１で、他の画素の値は０とする。 FIG. 8 is a diagram for explaining an input image for generating correct answer data from a sample image. (A) in the figure is an example of a sample image. The positions of the upper, lower, left and right ends of the detection target are specified by hand, and the input image shown in (b) in FIG. The input image has the same number of pixels as the sample image. Only the pixel values at the upper, lower, left and right ends of the detection target are 1, and the values of the other pixels are 0.

図８（ｂ）に示す入力画像がニューラルネットワーク処理部１に入力される。学習処理において、マップ数は１とする。そして、学習用ニューラルネットワークパラメータのうち、バイアスはすべて０とする。また、フィルタ係数は、フィルタの中央の値のみ１とし、他の値は０とする。なお、学習処理時と検出処理時において、フィルタのサイズは共通にしておく。 An input image shown in FIG. 8B is input to the neural network processing unit 1. In the learning process, the number of maps is 1. Of the learning neural network parameters, all biases are set to zero. The filter coefficient is set to 1 only at the center value of the filter, and 0 is set to other values. Note that the size of the filter is made common during the learning process and the detection process.

図９は、正解データ生成時のニューラルネットワーク処理部１の処理を模式的に示す図である。図示のように、非線形化処理部１１，１２１ｈ，１２１ｖ，プーリング部１２２ｈ，１２２ｖの処理を経て、行ベクトル５１および列ベクトル５２が出力される。これら行ベクトル５１および列ベクトル５２が正解データであり、サンプル画像と正解データとの組が得られる。 FIG. 9 is a diagram schematically illustrating processing of the neural network processing unit 1 when generating correct answer data. As illustrated, the row vector 51 and the column vector 52 are output through the processing of the non-linearization processing units 11, 121h, 121v and the pooling units 122h, 122v. These row vector 51 and column vector 52 are correct answer data, and a set of sample images and correct answer data is obtained.

検出精度を高めるためには、できるだけ多種多様のサンプル画像を用いる。例えば、検出対象物が一部だけ含まれるサンプル画像を用いるのが望ましい。これにより、検出対象物が一部だけしか入力画像に写っていない場合であっても、検出対象物を精度よく検出できる。また、検出対象物を囲う矩形の縦横比が様々なサンプル画像を用いるのが望ましい。これにより、検出対象物が撮影された方向によって縦横比が異なる場合（例えば検出対象物が車両である場合）であっても、検出対象物を精度よく検出できる。 In order to increase the detection accuracy, as many sample images as possible are used. For example, it is desirable to use a sample image that includes only a part of the detection object. As a result, even when only a part of the detection object is shown in the input image, the detection object can be accurately detected. In addition, it is desirable to use sample images having various aspect ratios of rectangles surrounding the detection target. Thereby, even when the aspect ratio differs depending on the direction in which the detection target is photographed (for example, when the detection target is a vehicle), the detection target can be detected with high accuracy.

その他、画素数が多いサンプル画像や少ないサンプル画像、検出対象物が中央にあるサンプル画像や端にあるサンプル画像、対象検出物が大きいサンプル画像や小さいサンプル画像、検出対象物が正面を向いているサンプル画像や斜めを向いているサンプル画像などを用いるのが望ましい。 In addition, sample images with a large number of pixels, small sample images, sample images with a detection target in the center, sample images at the edges, sample images with a large target detection object, small sample images, and the detection target are facing the front. It is desirable to use a sample image or a sample image facing diagonally.

学習部２には、上記の処理によって得られた正解データ（つまり行ベクトル５１および列ベクトル５２）ｔⁱ _m（ｉはサンプル画像のインデックスであり、ｍは行ベクトルであるか列ベクトルであるかを示すインデックス）、および、ニューラルネットワークパラメータをＷとしてサンプル画像ｘⁱをニューラルネットワーク処理部１が処理したときの出力（つまり行ベクトル５１および列ベクトル５２）ｆ_m（ｘⁱ；Ｗ）が入力される。
そして、学習部２は下記（１）式のように目的関数Ｅ（Ｗ）を定義する。
The learning unit 2 receives the correct data (that is, the row vector 51 and the column vector 52) t ⁱ _m (i is an index of the sample image and m is a row vector or a column vector) obtained by the above processing. And an output when the neural network processing unit 1 processes the sample image x ⁱ with the neural network parameter W (that is, the row vector 51 and the column vector 52) f _m (x ⁱ ; W) is input. The
Then, the learning unit 2 defines an objective function E (W) as in the following equation (1).

ここで、ｎはサンプル画像の数である。目的関数Ｅ（Ｗ）はニューラルネットワークパラメータＷ（つまり、フィルタ係数およびバイアス）の関数であり、出力ｆ_m（ｘⁱ；Ｗ）と正解データｔⁱ _mとが一致する場合のみ０となり、一致しない場合は０より大きくなる。 Here, n is the number of sample images. The objective function E (W) is a function of the neural network parameter W (that is, the filter coefficient and the bias), and becomes 0 only when the output f _m (x ⁱ ; W) and the correct answer data t ⁱ _m match, and does not match. If it is greater than zero.

そこで、学習部２は、目的関数Ｅ（Ｗ）が可能な限り小さくなるようニューラルネットワークパラメータＷを生成する。目的関数Ｅ（Ｗ）を最小化するには、例えば公知の誤差逆伝搬法を適用することができ、具体的には目的関数Ｅ（Ｗ）が収束するまで、下記（２）式に示す更新則を最終段の水平方向マップ生成部１２ｈ側および垂直方向マップ生成部１２ｖ側から順に適用すればよい。
Therefore, the learning unit 2 generates the neural network parameter W so that the objective function E (W) is as small as possible. In order to minimize the objective function E (W), for example, a publicly known error back propagation method can be applied. Specifically, until the objective function E (W) converges, the update shown in the following formula (2) is performed. The rules may be applied in order from the horizontal direction map generation unit 12h side and the vertical direction map generation unit 12v side of the final stage.

このようにして生成されたフィルタ係数およびバイアスが、検出処理用ニューラルネットワークパラメータとして、検出処理に用いられる。
続いて、検出処理について説明する。 The filter coefficient and the bias generated in this way are used for detection processing as a detection processing neural network parameter.
Subsequently, the detection process will be described.

検出処理時は、ニューラルネットワーク処理部１の各非線形化処理部１１，１２１ｈ，１２１ｖには、検出処理用ニューラルネットワークパラメータが設定されている。また、車載カメラからの画像が入力画像としてニューラルネットワーク処理部１に入力される。そして、図１に示すように、ニューラルネットワーク処理部１は行ベクトル５１および列ベクトル５２を生成する。 At the time of detection processing, neural network parameters for detection processing are set in the non-linearization processing units 11, 121h, 121v of the neural network processing unit 1. An image from the in-vehicle camera is input to the neural network processing unit 1 as an input image. Then, as shown in FIG. 1, the neural network processing unit 1 generates a row vector 51 and a column vector 52.

検出処理用ニューラルネットワークパラメータが適切であれば、行ベクトル５１および列ベクトル５２において、検出対象物の上下左右端に対応する列および行の値が１（あるいは１に近い値）となり、他の値が０（あるいは０に近い値）となる。よって、特定部３は、行ベクトル５１および列ベクトル５２に基づいて、入力画像における検出対象物の上下左右端の位置を特定できる。なお、特定部３ついては、第３の実施形態で詳述する。 If the neural network parameters for detection processing are appropriate, the column and row values corresponding to the upper, lower, left and right edges of the detection target in the row vector 51 and the column vector 52 are 1 (or values close to 1), and other values Becomes 0 (or a value close to 0). Therefore, the specifying unit 3 can specify the positions of the upper, lower, left and right ends of the detection target in the input image based on the row vector 51 and the column vector 52. The specifying unit 3 will be described in detail in the third embodiment.

このように、第１の実施形態では、行ベクトル５１を生成する水平方向処理部１２０ｈと、列ベクトル５２を生成する垂直方向処理部１２０ｖとを別々に設ける。そして、水平方向処理部１２０ｈにおけるプーリング部１２２ｈは、マップを水平方向にはプーリングせず垂直方向にのみプーリングする。垂直方向処理部１２０ｖにおけるプーリング部１２２ｖは、マップを垂直方向にはプーリングせず水平方向にのみプーリングする。 As described above, in the first embodiment, the horizontal direction processing unit 120 h that generates the row vector 51 and the vertical direction processing unit 120 v that generates the column vector 52 are provided separately. And the pooling part 122h in the horizontal direction processing part 120h does not pool the map in the horizontal direction but only in the vertical direction. The pooling unit 122v in the vertical processing unit 120v does not pool the map in the vertical direction, but pools the map only in the horizontal direction.

そのため、入力画像の水平方向の解像度を保ったまま検出対象物の水平方向の位置を示す行ベクトル５１を生成できるとともに、入力画像の垂直方向の解像度を保ったまま検出対象物の垂直方向の位置を示す列ベクトル５２を生成できる。したがって、高精度に検出対象物の位置を検出できる。 Therefore, the row vector 51 indicating the horizontal position of the detection object can be generated while maintaining the horizontal resolution of the input image, and the vertical position of the detection object can be maintained while maintaining the vertical resolution of the input image. Can be generated. Therefore, the position of the detection target can be detected with high accuracy.

また、本実施形態によれば、検出対象物の大きさや縦横比に何らの制限もないため、ピラミッド画像の生成やスライディングウィンドウの適用が不要であり、ニューラルネットワーク処理部１の処理負荷を軽減できる。 In addition, according to the present embodiment, since there is no limitation on the size and aspect ratio of the detection target object, generation of a pyramid image and application of a sliding window are unnecessary, and the processing load on the neural network processing unit 1 can be reduced. .

（第２の実施形態）
次に説明する第２の実施形態では、垂直方向処理部１２０ｖで生成されるマップも使用して水平方向処理部１２０ｈが行ベクトル５１を生成するとともに、水平方向処理部１２０ｈで生成されるマップも使用して垂直方向処理部１２０ｖが列ベクトル５２を生成するものである。 (Second Embodiment)
In the second embodiment to be described next, the horizontal direction processing unit 120h generates the row vector 51 using the map generated by the vertical direction processing unit 120v, and also the map generated by the horizontal direction processing unit 120h. The vertical processing unit 120v is used to generate the column vector 52.

図１０は、第２の実施形態に係るニューラルネットワーク処理部１’の概略構成を示すブロック図である。図３との相違点として、少なくとも２つの水平方向マップ生成部１２ｈ，１２ｈ’および垂直方向マップ生成部１２ｖ，１２ｖ’が設けられる。初段の水平方向マップ生成部１２ｈおよび垂直方向マップ生成部１２ｖは第１の実施形態で説明したものと同様である。これに対し、２段目以降の水平方向マップ生成部１２ｈ’は、前段の水平方向マップ生成部１２ｈ（１２ｈ’）から出力されるマップのみならず、前段の垂直方向マップ生成部１２ｖ（１２ｖ’）から出力されるマップも用いて、新たなマップを出力する。また、２段目以降の垂直方向マップ生成部１２ｖ’は、前段の垂直方向マップ生成部１２ｖ（１２ｖ’）から出力されるマップのみならず、前段の水平方向マップ生成部１２ｈ（１２ｈ’）から出力されるマップも用いて、新たなマップを出力する。 FIG. 10 is a block diagram illustrating a schematic configuration of the neural network processing unit 1 ′ according to the second embodiment. As a difference from FIG. 3, at least two horizontal direction map generators 12h and 12h 'and vertical direction map generators 12v and 12v' are provided. The first-stage horizontal direction map generation unit 12h and the vertical direction map generation unit 12v are the same as those described in the first embodiment. On the other hand, the second and subsequent horizontal direction map generation units 12h ′ not only include the map output from the previous horizontal direction map generation unit 12h (12h ′) but also the previous vertical direction map generation unit 12v (12v ′). ) Is also used to output a new map. The vertical direction map generation unit 12v ′ in the second and subsequent stages is not only the map output from the vertical direction map generation unit 12v (12v ′) in the previous stage, but also from the horizontal direction map generation unit 12h (12h ′) in the previous stage. A new map is output using the output map.

図１１は、水平方向マップ生成部１２ｈ’および垂直方向マップ生成部１２ｖ’の概略構成を示すブロック図である。図示のように、２段目の水平方向マップ生成部１２ｈ’は非線形化処理部１２１ｈ’を有し、初段の水平方向マップ生成部１２ｈからの出力マップおよび初段の垂直方向マップ生成部１２ｖからの出力マップに対して非線形化処理を行って中間マップを生成する。また、垂直方向マップ生成部１２ｖ’は同様の非線形化処理部１２１ｖ’を有する。なお、プーリング部１２２ｈ，１２２ｖは図３と同様である。 FIG. 11 is a block diagram showing a schematic configuration of the horizontal direction map generator 12h ′ and the vertical direction map generator 12v ′. As shown in the figure, the second-stage horizontal map generation unit 12h ′ has a non-linearization processing unit 121h ′, and outputs an output map from the first-stage horizontal direction map generation unit 12h and a first-stage vertical direction map generation unit 12v. An intermediate map is generated by performing a non-linearization process on the output map. Further, the vertical direction map generation unit 12v ′ includes a similar non-linearization processing unit 121v ′. The pooling portions 122h and 122v are the same as those in FIG.

図１２は、水平方向マップ生成部１２ｈ’および垂直方向マップ生成部１２ｖ’の処理動作の概略を説明する図である。同図において、水平方向マップ生成部１２ｈ’に入力されるのは、初段の水平方向マップ生成部１２ｈから出力される２つのマップ６４，６５であり、水平方向マップ生成部１２ｈ’が出力するのは２つのマップ９１，９２とする。また、垂直方向マップ生成部１２ｖ’に入力されるのは、初段の垂直方向マップ生成部１２ｖから出力される２つのマップ６６，６７であり、垂直方向マップ生成部１２ｖ’が出力するのは２つのマップ９３，９４とする。 FIG. 12 is a diagram for explaining the outline of the processing operation of the horizontal direction map generation unit 12h ′ and the vertical direction map generation unit 12v ′. In the figure, the two maps 64 and 65 output from the first horizontal map generation unit 12h are input to the horizontal map generation unit 12h ′, and the horizontal map generation unit 12h ′ outputs the two maps. Are two maps 91 and 92. Also, the two maps 66 and 67 output from the initial vertical map generation unit 12v are input to the vertical direction map generation unit 12v ′, and the vertical direction map generation unit 12v ′ outputs two. Two maps 93 and 94 are assumed.

図示のように、非線形化処理部１２１ｈ’はマップ６４〜６７から中間マップ７４を生成する。中間マップ７４の解像度はマップ６４，６５の解像度と等しい。そして、プーリング部１２２ｈは中間マップ７４を垂直方向にのみプーリングしてマップ９１を生成する。すなわち、マップ９１の水平方向の解像度は、マップ６４の水平方向の解像度と等しい。同様に、非線形化処理部１２１ｈ’はマップ６４〜６７から中間マップ７５を生成する。そして、プーリング部１２２ｈは中間マップ７５からマップ９２を生成する。 As illustrated, the non-linearization processing unit 121 h ′ generates an intermediate map 74 from the maps 64 to 67. The resolution of the intermediate map 74 is equal to the resolution of the maps 64 and 65. Then, the pooling unit 122h pools the intermediate map 74 only in the vertical direction to generate a map 91. That is, the horizontal resolution of the map 91 is equal to the horizontal resolution of the map 64. Similarly, the non-linearization processing unit 121 h ′ generates an intermediate map 75 from the maps 64 to 67. Then, the pooling unit 122h generates a map 92 from the intermediate map 75.

一方、非線形化処理部１２１ｖ’はマップ６４〜６７から中間マップ７６を生成する。中間マップ７６の解像度はマップ６６，６７の解像度と等しい。そして、プーリング部１２２ｖは中間マップ７６を水平方向にのみプーリングしてマップ９３を生成する。すなわち、マップ９３の垂直方向の解像度は、マップ６６の垂直方向の解像度と等しい。同様に、非線形化処理部１２１ｖ’はマップ６４〜６７から中間マップ７７を生成する。そして、プーリング部１２２ｖは中間マップ７７からマップ９４を生成する。
マップ９１〜９４の生成法は共通しているため、以下、代表してマップ９３の生成について詳しく説明する。 On the other hand, the non-linearization processing unit 121v ′ generates an intermediate map 76 from the maps 64-67. The resolution of the intermediate map 76 is equal to the resolution of the maps 66 and 67. Then, the pooling unit 122v pools the intermediate map 76 only in the horizontal direction to generate a map 93. That is, the vertical resolution of the map 93 is equal to the vertical resolution of the map 66. Similarly, the non-linearization processing unit 121v ′ generates an intermediate map 77 from the maps 64-67. Then, the pooling unit 122v generates a map 94 from the intermediate map 77.
Since the generation methods of the maps 91 to 94 are common, the generation of the map 93 will be described in detail below as a representative.

図１３は、マップ９３の生成法を説明する図である。ここで、入力画像の画素数はＣ×Ｒであるとする。初段の水平方向マップ生成部１２ｈの処理により、垂直方向マップ生成部１２ｖ’に入力されるマップ６４，６５の画素数はＣ×Ｒ／２になっている。また、初段の垂直方向マップ生成部１２ｖの処理により、垂直方向マップ生成部１２ｖ’に入力されるマップ６６，６６の画素数はＣ／２×Ｒになっている。 FIG. 13 is a diagram for explaining a method for generating the map 93. Here, it is assumed that the number of pixels of the input image is C × R. The number of pixels of the maps 64 and 65 input to the vertical direction map generation unit 12v ′ is C × R / 2 by the processing of the horizontal direction map generation unit 12h in the first stage. Further, the number of pixels of the maps 66 and 66 input to the vertical direction map generation unit 12v ′ is C / 2 × R by the process of the vertical direction map generation unit 12v in the first stage.

非線形化処理部１２１ｖ’ はマップ６６，６７を畳み込み演算してそれぞれ中間マップ６６’，６７’を生成する。中間マップ６６’，６７’の画素数は、マップ６６，６７の画素数と同じく、Ｃ／２×Ｒである。ここでの畳み込み演算は、例えば図５（ｂ）に示す処理である。 The non-linearization processing unit 121v ′ convolves the maps 66 and 67 to generate intermediate maps 66 ′ and 67 ′, respectively. The number of pixels in the intermediate maps 66 ′ and 67 ′ is C / 2 × R, similar to the number of pixels in the maps 66 and 67. The convolution calculation here is, for example, the process shown in FIG.

一方、非線形化処理部１２１ｖ’はマップ６４，６５を畳み込み演算してそれぞれ中間マップ６４’，６５’を生成する。ここで、非線形化処理部１２１ｖ’は、次のようにしてマップ６４，６５の水平方向画素数を１／２倍にするとともに垂直方向画素数を２倍にして、中間マップ６４’，６５’をそれぞれ生成する。 On the other hand, the non-linearization processing unit 121v 'performs convolution operations on the maps 64 and 65 to generate intermediate maps 64' and 65 ', respectively. Here, the non-linearization processing unit 121v ′ doubles the number of pixels in the horizontal direction of the maps 64 and 65 and doubles the number of pixels in the vertical direction as follows to obtain the intermediate maps 64 ′ and 65 ′ as follows. Are generated respectively.

図１４は、マップ６４’の生成法を説明する図である。同図では、フィルタの画素数を３×３としている。同図（ａ）に示すように、まず非線形化処理部１２１ｖ’はフィルタの中心がマップ６４の画素（１，１）と重なるようフィルタを設定する。そして、非線形化処理部１２１ｖ’は、マップ６４のフィルタが設定された部分の画素値とフィルタ係数との内積を、中間マップ６４’の左上の画素（１，１）およびその下の画素（１，２）の値とする。つまり、１つの内積を、中間マップ６４’の垂直方向に並ぶ２つの画素に設定する。 FIG. 14 is a diagram illustrating a method for generating the map 64 ′. In the figure, the number of pixels of the filter is 3 × 3. As shown in FIG. 6A, the non-linearization processing unit 121v ′ first sets the filter so that the center of the filter overlaps the pixel (1, 1) of the map 64. Then, the non-linearization processing unit 121v ′ calculates the inner product of the pixel value of the portion where the filter of the map 64 is set and the filter coefficient by using the upper left pixel (1, 1) and the lower pixel (1) of the intermediate map 64 ′. , 2). That is, one inner product is set to two pixels arranged in the vertical direction of the intermediate map 64 '.

続いて、図１４（ｂ）に示すように、非線形化処理部１２１ｖ’はフィルタを２画素右にシフトし、フィルタの中心がマップ６４の画素（３，１）に重なるようフィルタを設定する。そして、非線形化処理部１２１ｖ’は、マップ６４のフィルタが設定された部分の画素値とフィルタ係数との内積を、中間マップ６４’の画素（２，１）およびその下の画素（２，２）の値とする。
このようにフィルタを２画素ずつ右にずらしながらマップ６４の１行を処理することで、中間マップ６４’の２列の値が定まる。 Subsequently, as illustrated in FIG. 14B, the non-linearization processing unit 121 v ′ shifts the filter to the right by two pixels, and sets the filter so that the center of the filter overlaps the pixel (3, 1) of the map 64. Then, the non-linearization processing unit 121v ′ calculates the inner product of the pixel value of the portion where the filter of the map 64 is set and the filter coefficient, as the pixel (2, 1) of the intermediate map 64 ′ and the lower pixel (2, 2). ) Value.
In this way, by processing one row of the map 64 while shifting the filter to the right by two pixels, the values of the two columns of the intermediate map 64 ′ are determined.

マップ６４の１行目に対する処理終了後、図１４（ｃ）に示すように、非線形化処理部１２１ｖ’はフィルタを１画素下にシフトし、フィルタの中心がマップ６４の画素（１，２）と重なるようフィルタを設定し、内積を算出して中間マップ６４’の画素（１，３）およびその下の画素（１，４）の値とする。 After the processing for the first row of the map 64 is finished, as shown in FIG. 14C, the non-linearization processing unit 121v ′ shifts the filter down by one pixel, and the center of the filter is the pixel (1, 2) of the map 64. Are set so that they overlap with each other, and the inner product is calculated as the value of the pixel (1, 3) and the pixel (1, 4) below it in the intermediate map 64 ′.

このようにフィルタをシフトしながら処理を行うことで、画素数がＣ×Ｒ／２であるマップ６４から、画素数がＣ／２×Ｒである中間マップ６４’が生成される。 By performing the processing while shifting the filter in this way, an intermediate map 64 ′ having the number of pixels of C / 2 × R is generated from the map 64 having the number of pixels of C × R / 2.

より一般的には、マップ６４の水平方向画素数および垂直方向画素数をそれぞれ１／ｐ倍、ｑ倍（ｐ，ｑは任意の整数）して中間マップ６４’を生成するためには、非線形化処理部１２１ｖ’は、マップ６４のフィルタが設定された部分の画素値と、フィルタ係数との内積を、中間マップ６４’における対応画素およびその下の（ｑ−１）個の画素の値に設定する。この処理を、マップ６４の水平方向においてはｐ画素ずつフィルタをシフトしながら、マップ６４の垂直方向においては１画素ずつフィルタをシフトしながら行うことで、中間マップ６４’を生成できる。 More generally, in order to generate the intermediate map 64 ′ by multiplying the number of horizontal pixels and the number of vertical pixels of the map 64 by 1 / p times and q times (p and q are arbitrary integers), respectively, nonlinearity is required. The conversion processing unit 121v ′ converts the inner product of the pixel value of the part where the filter of the map 64 is set and the filter coefficient into the value of the corresponding pixel in the intermediate map 64 ′ and (q−1) pixels below it. Set. By performing this process while shifting the filter by p pixels in the horizontal direction of the map 64 and shifting the filter by one pixel in the vertical direction of the map 64, an intermediate map 64 'can be generated.

以上のようにして、図１３における中間マップ６４’，６５’が生成される。そして、非線形化処理部１２１ｖ’は、中間マップ６４’〜６７’の各画素値を足し合わせて、画素数がＣ／２×Ｒである中間マップ７６を生成する（図１３の２点鎖線）。なお、図示していないが、足し合わせた後にバイアスが加算される。そして、プーリング部１２２ｖが中間マップ７６に対してプーリング処理を行って、画素数がＣ／４×Ｒのマップ９３が生成される。同様にして、非線形化処理部１２１ｖ’は図１２のマップ９４を生成できる。
また、以上の説明の水平方向と垂直方向を入れ替えることで、非線形化処理部１２１ｈ’はマップ９１，９２を生成できる。 As described above, the intermediate maps 64 ′ and 65 ′ in FIG. 13 are generated. Then, the non-linearization processing unit 121v ′ adds the pixel values of the intermediate maps 64 ′ to 67 ′ to generate the intermediate map 76 having the number of pixels of C / 2 × R (two-dot chain line in FIG. 13). . Although not shown, a bias is added after addition. Then, the pooling unit 122v performs a pooling process on the intermediate map 76, and a map 93 having the number of pixels of C / 4 × R is generated. Similarly, the non-linearization processing unit 121v ′ can generate the map 94 of FIG.
Further, the non-linearization processing unit 121h ′ can generate the maps 91 and 92 by switching the horizontal direction and the vertical direction described above.

すなわち、マップ６６の水平方向画素数および垂直方向画素数をそれぞれｐ倍、１／ｑ倍（ｐ，ｑは任意の整数）して中間マップ（不図示。図１３および図１４の中間マップ６４’に相当）を生成するためには、非線形化処理部１２１ｈ’は、マップ６６のフィルタが設定された部分の画素値と、フィルタ係数との内積を、中間マップにおける対応画素およびその右の（ｐ−１）個の画素の値に設定する。この処理を、マップ６６の水平方向においては１画素ずつフィルタをシフトしながら、マップ６６の垂直方向においてはｑ画素ずつフィルタをシフトしながら行うことで、中間マップを生成できる。 That is, the number of pixels in the horizontal direction and the number of pixels in the vertical direction of the map 66 are multiplied by p and 1 / q, respectively (p and q are arbitrary integers), and an intermediate map (not shown. To generate the inner product of the pixel value of the portion of the map 66 where the filter is set and the filter coefficient, the non-linearization processing unit 121h ′ generates the corresponding pixel in the intermediate map and (p) -1) Set to the value of one pixel. An intermediate map can be generated by performing this process while shifting the filter by one pixel in the horizontal direction of the map 66 and shifting the filter by q pixels in the vertical direction of the map 66.

このように、水平方向処理部１２０ｈが垂直方向処理部１２０ｖで生成されるマップも使用し、垂直方向処理部１２０ｖが水平方向処理部１２０ｈで生成されるマップも使用することで、検出対象物の検出精度が向上することを説明する。 As described above, the horizontal direction processing unit 120h also uses the map generated by the vertical direction processing unit 120v, and the vertical direction processing unit 120v also uses the map generated by the horizontal direction processing unit 120h. The improvement in detection accuracy will be described.

図１５は、第２の実施形態に係るニューラルネットワーク処理部１’の処理動作を模式的に示す図である。簡略化のために、図１３において、初段の水平方向マップ生成部１２ｈから出力される１つのマップ６４と、初段の垂直方向マップ生成部１２ｖから出力される１つのマップ６６から、中間マップ７６を生成することを示している。そして、検出対象物が人であるとしている。 FIG. 15 is a diagram schematically illustrating the processing operation of the neural network processing unit 1 ′ according to the second embodiment. For simplification, in FIG. 13, an intermediate map 76 is obtained from one map 64 output from the first horizontal map generation unit 12h and one map 66 output from the first vertical map generation unit 12v. It shows that it generates. The detection object is a person.

入力画像に人が写っていた場合、マップ６４には垂直方向に潰された人の特徴が配置され、マップ６６には水平方向に潰された人の特徴が配置される。本実施形態では、これら２つのマップ６４，６６から中間マップ７６が生成される。中間マップ７６の画素７６ａは、マップ６４の矩形領域６４ａにおける内積と、マップ６６の矩形領域６６ａにおける内積と、から生成される。画素７６ａが矩形領域６６ａのみを用いて生成される場合、マップ６６における人の左足の先あたりの画像のみから、画素７６ａが生成されることになる。 When a person is shown in the input image, the feature of the person crushed in the vertical direction is arranged on the map 64, and the feature of the person crushed in the horizontal direction is arranged on the map 66. In the present embodiment, an intermediate map 76 is generated from these two maps 64 and 66. The pixel 76a of the intermediate map 76 is generated from the inner product in the rectangular area 64a of the map 64 and the inner product in the rectangular area 66a of the map 66. When the pixel 76a is generated using only the rectangular area 66a, the pixel 76a is generated only from the image of the left foot of the person on the map 66.

しかしながら、人の体のごく一部である足の先のみの画像から人の位置を精度よく検出するのは困難な場合もある。足の先と似た領域は画像内に多々存在し得るためである。 However, it may be difficult to accurately detect the position of a person from an image of only the tip of a foot that is a small part of the human body. This is because many regions similar to the tip of the foot may exist in the image.

これに対し本実施形態では、画素７６ａは、矩形領域６６ａのみならず、矩形領域６４ａも用いて生成される。そのため、足の先よりも縦に広い範囲（具体的には、足の付け根から腰、腹あたり）の画像から、画素７６ａが生成される。その結果、人のより広い領域の情報を使って人を検出することとなり、検出の精度が向上する。 On the other hand, in the present embodiment, the pixel 76a is generated using not only the rectangular area 66a but also the rectangular area 64a. Therefore, the pixel 76a is generated from an image in a range that is wider in the vertical direction than the tip of the foot (specifically, from the base of the foot to the waist and the stomach). As a result, the person is detected using information on a wider area of the person, and the detection accuracy is improved.

このように、第２の実施形態では、垂直方向処理部１２０ｖで生成されるマップも使用して水平方向処理部１２０ｈが行ベクトル５１を生成するとともに、水平方向処理部１２０ｈで生成されるマップも使用して垂直方向処理部１２０ｖが列ベクトル５２を生成する。そのため、より高精度に検出対象物の位置を検出できる。 As described above, in the second embodiment, the horizontal direction processing unit 120h generates the row vector 51 using the map generated by the vertical direction processing unit 120v, and the map generated by the horizontal direction processing unit 120h is also used. Using the vertical processing unit 120v, the column vector 52 is generated. Therefore, the position of the detection target can be detected with higher accuracy.

（第３の実施形態）
第３の実施形態では、特定部３について詳しく説明する。なお、本実施形態では、検出対象物が１種類であると仮定する。 (Third embodiment)
In the third embodiment, the specifying unit 3 will be described in detail. In this embodiment, it is assumed that there is one type of detection object.

図１６は、特定部３による、検出対象物の左右端位置の特定を説明する図である。入力画像を処理した結果、２つの行ベクトル５１ａ，５１ｂが出力される例を示している。 FIG. 16 is a diagram illustrating the specification of the left and right end positions of the detection target by the specifying unit 3. An example is shown in which two row vectors 51a and 51b are output as a result of processing an input image.

行ベクトル５１ａには、水平方向処理部１２０ｈ内のプーリング部１２２ｈのプーリング処理により、入力画像の上半分の領域５１Ａの情報が含まれている。同様に、行ベクトル５１ｂには、入力画像の下半分の領域５１Ｂの情報が含まれている。つまり、行ベクトル５１ａおよび行ベクトル５１ｂは、それぞれ入力画像の上半分の領域５１Ａおよび下半分の領域５１Ｂに対応している。 The row vector 51a includes information on the upper half area 51A of the input image by the pooling processing of the pooling unit 122h in the horizontal processing unit 120h. Similarly, the row vector 51b includes information on the lower half area 51B of the input image. That is, the row vector 51a and the row vector 51b correspond to the upper half area 51A and the lower half area 51B of the input image, respectively.

よって、行ベクトル５１ａに１値（あるいは１値に近い値、以下同様）が含まれる場合（図１６の例ではｘｌａ，ｘｒａ列）、入力画像の上半分の領域５１Ａに検出対象物の左端候補および／または右端候補が含まれることを意味する。一方、行ベクトル５１ａに１値が含まれない場合、入力画像の上半分の領域５１Ａには左右端候補が含まれないことを意味する。 Therefore, when the row vector 51a includes one value (or a value close to one value, the same applies hereinafter) (xla and xra columns in the example of FIG. 16), the left end candidate of the detection target in the upper half area 51A of the input image And / or right edge candidates are included. On the other hand, when the row vector 51a does not include 1 value, it means that the left and right end candidates are not included in the upper half area 51A of the input image.

同様に、行ベクトル５１ｂに１値が含まれる場合（図１６の例ではｘｌｂ，ｘｒｂ列）、入力画像の下半分の領域５１Ｂに検出対象物の左端候補および／または右端候補が含まれることを意味する。行ベクトル５１ｂに１値が含まれない場合、入力画像の下半分の領域５１Ｂには左右端候補が含まれないことを意味する。 Similarly, when 1 is included in the row vector 51b (xlb and xrb columns in the example of FIG. 16), the lower half region 51B of the input image includes the left end candidate and / or right end candidate of the detection target. means. When the row vector 51b does not include 1 value, it means that the left and right end candidates are not included in the lower half area 51B of the input image.

図１７は、特定部３による、検出対象物の上下端の特定を説明する図である。入力画像を処理した結果、３つの列ベクトル５２ａ〜５２ｃが出力される例を示している。 FIG. 17 is a diagram for explaining the identification of the upper and lower ends of the detection target by the identification unit 3. An example is shown in which three column vectors 52a to 52c are output as a result of processing an input image.

列ベクトル５２ａには、垂直方向処理部１２０ｖ内のプーリング部１２２ｖのプーリング処理により、入力画像を垂直方向に３等分したうちの左の領域５２Ａの情報が含まれている。同様に、列ベクトル５２ｂ，５２ｃには、入力画像を垂直方向に３等分したうちの中央の領域５２Ｂおよび右の領域５２Ｃの情報がそれぞれ含まれている。つまり、列ベクトル５２ａ〜５２ｃは、それぞれ入力画像を垂直方向に３等分したうちの左の領域５２Ａ、中央の領域５２Ｂおよび右の領域５２Ｃに対応している。 The column vector 52a includes information on the left region 52A of the input image divided into three equal parts in the vertical direction by the pooling processing of the pooling unit 122v in the vertical direction processing unit 120v. Similarly, the column vectors 52b and 52c include information on the central region 52B and the right region 52C, respectively, of the input image divided into three equal parts in the vertical direction. That is, the column vectors 52a to 52c correspond to the left region 52A, the central region 52B, and the right region 52C, respectively, of the input image divided into three equal parts in the vertical direction.

よって、列ベクトル５２ａに１値が含まれる場合、入力画像の左の領域５２Ａに検出対象物の上端候補および／または下端候補が含まれることを意味する。一方、列ベクトル５２ａに１値が含まれない場合、入力画像の左の領域５２Ａには上下端候補が含まれないことを意味する。 Therefore, when 1 is included in the column vector 52a, it means that the upper end candidate and / or the lower end candidate of the detection target is included in the left region 52A of the input image. On the other hand, when the column vector 52a does not include 1 value, it means that the upper and lower end candidates are not included in the left region 52A of the input image.

同様に、列ベクトル５２ｂに１値が含まれる場合（図１７の例ではｙｔｂ，ｙｂｂ列）、入力画像の中央の領域５２Ｂに検出対象物の上端候補および／または下端候補が含まれることを意味する。列ベクトル５２ｂに１値が含まれない場合、入力画像の中央の領域５２Ｂには上下端候補が含まれないことを意味する。 Similarly, when 1 is included in the column vector 52b (the ytb and ybb columns in the example of FIG. 17), it means that the upper end candidate and / or the lower end candidate of the detection target is included in the central region 52B of the input image. To do. When the column vector 52b does not include 1 value, it means that the upper and lower end candidates are not included in the central region 52B of the input image.

また、列ベクトル５２ｃに１値が含まれる場合（図１７の例ではｙｔｃ，ｙｂｃ列）、入力画像の右の領域５２Ｃに検出対象物の上端候補および／または下端候補が含まれることを意味する。列ベクトル５２ｃに１値が含まれない場合、入力画像の右の領域５２Ｃには上下端候補が含まれないことを意味する。 In addition, when the column vector 52c includes one value (the ytc and ybc columns in the example of FIG. 17), it means that the upper end candidate and / or the lower end candidate of the detection target is included in the right region 52C of the input image. . When the column vector 52c does not include 1 value, it means that the upper and lower end candidates are not included in the right region 52C of the input image.

図１８は、特定部３の処理動作の第１例を示す図である。入力画像を処理した結果、２つの行ベクトル５１ａ，５１ｂと、３つの列ベクトル５２ａ〜５２ｃが出力される例を示している。 FIG. 18 is a diagram illustrating a first example of the processing operation of the specifying unit 3. As an example of processing an input image, two row vectors 51a and 51b and three column vectors 52a to 52c are output.

図１８では、行ベクトル５１ａ，５２ｂの共通する２つの列ｘｌ，ｘｒの値が１でる。これらの列ｘｌ，ｘｒに対応する画素位置が検出対象物の左端右候補である。また、列ベクトル５２ａ〜５２ｃの共通する２つの行ｙｔ，ｙｂの値が１である。これらの行ｙｔ，ｙｂに対応する画素位置が検出対象物の上下端候補である。 In FIG. 18, the values of two common columns xl and xr of the row vectors 51a and 52b are 1. The pixel positions corresponding to these columns xl and xr are the left and right candidates for the detection target. Further, the value of two common rows yt and yb of the column vectors 52a to 52c is 1. Pixel positions corresponding to these rows yt and yb are upper and lower end candidates of the detection target.

図１８は単純な例であり、特定部３は、座標（ｘｌ，ｙｔ），（ｘｒ，ｙｔ），（ｘｌ，ｙｂ），（ｘｒ，ｙｂ）で囲まれる矩形領域に検出対象物があることを特定できる。 FIG. 18 is a simple example, and the specifying unit 3 has a detection target in a rectangular area surrounded by coordinates (xl, yt), (xr, yt), (xl, yb), and (xr, yb). Can be identified.

実際には、行ベクトル５１および列ベクトル５２にはノイズが含まれ得る。そのため、行ベクトル５１および列ベクトル５２における各値は０と１の間の値になることもある。値が１に近い列または行ほど検出対象物の端に対応する可能性が高いが、検出対象物の端に近い列または行の近傍の値も１に近くなることもある。そこで、特定部３は次のようにしてノイズを除去してもよい。 In practice, the row vector 51 and the column vector 52 may contain noise. Therefore, each value in the row vector 51 and the column vector 52 may be a value between 0 and 1. A column or row whose value is closer to 1 is more likely to correspond to the edge of the detection object, but a value near a column or row near the edge of the detection object may also be close to 1. Therefore, the specifying unit 3 may remove noise as follows.

図１９は、特定部３によるノイズ除去を説明する図である。同図は、２つの行ベクトル５１ａ，５２ｂのノイズを除去する例を示している。また、図２０は、ノイズ除去の手順を示すフローチャートである。 FIG. 19 is a diagram illustrating noise removal by the specifying unit 3. This figure shows an example of removing noise from two row vectors 51a and 52b. FIG. 20 is a flowchart showing a noise removal procedure.

まず、特定部３は１つの行ベクトル５１ａの各列のうちの最大値を選択する（図２０のステップＳ１）。図１９の例では、列ｘ１の値が最大値であったとする。そして、特定部３は最大値と所定の閾値とを比較する（図２０のステップＳ２）。 First, the specifying unit 3 selects the maximum value in each column of one row vector 51a (step S1 in FIG. 20). In the example of FIG. 19, it is assumed that the value of the column x1 is the maximum value. Then, the specifying unit 3 compares the maximum value with a predetermined threshold (step S2 in FIG. 20).

最大値が閾値より小さい場合（ステップＳ２のＮＯ）、特定部３は、行ベクトル５１ａには検出対象物の左端および右端がないと判断し、パラメータｆｌａｇをｆａｌｓｅに設定する。そして、特定部３は次の行ベクトル５１ｂの処理に進む（ステップＳ５）。 When the maximum value is smaller than the threshold (NO in step S2), the specifying unit 3 determines that the row vector 51a does not have the left end and the right end of the detection target, and sets the parameter flag to false. Then, the specifying unit 3 proceeds to the process for the next row vector 51b (step S5).

最大値が閾値以上であれば（ステップＳ２のＹＥＳ）、特定部３は最大値をとる列（図１９の例では列ｘ１）を検出対象物の左右端候補とし（図２０のステップＳ３）、パラメータｆｌａｇをｔｒｕｅに設定する。そして、特定部３はその列の近傍の所定領域（図１９の例では列ｘ１の左右２列ずつ）にマスクを設定する（図２０のステップＳ４）。次いで、特定部３は、行ベクトル５１ａのマスクが設定されていない列のうちの最大値を選択する（ステップＳ５）。図１９の例では、列ｘ２の値が最大値であったとする。
以降、選択された最大値が閾値より小さくなるまで（パラメータｆｌａｇがｆａｌｓｅに設定されるまで）、ステップＳ２〜Ｓ５を繰り返す。 If the maximum value is greater than or equal to the threshold value (YES in step S2), the specifying unit 3 sets the column that takes the maximum value (column x1 in the example of FIG. 19) as the left and right end candidates of the detection target (step S3 in FIG. 20). Set the parameter flag to true. Then, the specifying unit 3 sets a mask in a predetermined area near the column (in the example of FIG. 19, two columns on the left and right of the column x1) (step S4 in FIG. 20). Next, the specifying unit 3 selects the maximum value among the columns in which the mask of the row vector 51a is not set (step S5). In the example of FIG. 19, it is assumed that the value of the column x2 is the maximum value.
Thereafter, steps S2 to S5 are repeated until the selected maximum value is smaller than the threshold value (until the parameter flag is set to false).

そして、行ベクトル５１ａの処理が完了すると、同様の処理を行ベクトル５１ｂに対しても行う（ステップＳ６）。列ベクトル５２対するノイズ除去も同様である。
２つの左右端候補が検出された場合、特定部３は、左側を左端候補とし、右側を右端候補とすることができる。 When the processing of the row vector 51a is completed, the same processing is performed on the row vector 51b (step S6). The same applies to noise removal for the column vector 52.
When two left and right end candidates are detected, the specifying unit 3 can set the left side as a left end candidate and the right side as a right end candidate.

マスクを設定することで、検出対象物の左端候補と右端候補は、マスクのサイズに応じた画素数分以上離れた位置に検出される。同様に、検出対象物の上端候補と下端候補は、マスクのサイズに応じた画素数分以上離れた位置に検出される。以下、このようにして検出対象物の左右端候補が行ベクトル５１から検出され、上下端候補が列ベクトル５２から検出されたものとする。 By setting the mask, the left end candidate and the right end candidate of the detection target are detected at positions separated by at least the number of pixels corresponding to the size of the mask. Similarly, the upper end candidate and the lower end candidate of the detection target are detected at positions separated by the number of pixels or more according to the mask size. Hereinafter, it is assumed that the left and right end candidates of the detection target are detected from the row vector 51 and the upper and lower end candidates are detected from the column vector 52 in this way.

なお、この処理によれば、左右端候補がみつからない場合もあるし、１つだけ見つかる場合もあるし、３つ以上見つかる場合もある。アプリケーションによって、１つの検出対象物が入力画像内に収まることが分かっている場合などには、ステップＳ２の閾値処理を行わず、最大値が選択された順に２つの列を左右端候補としてもよい。上下端候補についても同様である。 According to this process, the left and right edge candidates may not be found, only one may be found, or three or more may be found. When it is known by an application that one detection object fits in the input image, the threshold processing in step S2 is not performed, and two columns may be set as the left and right end candidates in the order in which the maximum values are selected. . The same applies to the upper and lower end candidates.

図２１は、特定部３の処理動作の第２例を示す図である。同図では、行ベクトル５１ａに左右端候補ｘｌａ，ｘｒａが検出され，行ベクトル５１ｂに左右端候補ｘｌｂ，ｘｒｂが検出されている。また、列ベクトル５２ｂに上下端候補ｙｔｂ，ｙｂｂが検出され、列ベクトル５２ｃ上下端候補ｙｔｃ，ｙｂｃが検出されている。この場合、各候補に対応する位置に合計８本の線を引くと、十字（クロスポイント）が４点、すなわち、（ｘｌａ，ｙｔｂ），（ｘｒａ，ｙｔｃ），（ｘｌｂ，ｙｂｂ），（ｘｒｂ，ｙｂｃ）に形成される。この４点が検出対象物の左上、右上、左下および右下の候補点である。 FIG. 21 is a diagram illustrating a second example of the processing operation of the specifying unit 3. In the figure, left and right end candidates xla and xra are detected in the row vector 51a, and left and right end candidates xlb and xrb are detected in the row vector 51b. Also, upper and lower end candidates ytb and ybb are detected in the column vector 52b, and upper and lower end candidates ytc and ybc are detected in the column vector 52c. In this case, when a total of eight lines are drawn at positions corresponding to the candidates, four crosses (cross points), that is, (xla, ytb), (xra, ytc), (xlb, ybb), (xrb) , Ybc). These four points are the upper left, upper right, lower left and lower right candidate points of the detection object.

ただし、これら４点を結んで得られる矩形の各辺は水平方向および垂直方向と平行にはならない。そこで、特定部３は、例えば下式のように各点の平均値を用いて得られる４点（ｘｌ，ｙｔ），（ｘｒ，ｙｔ），（ｘｌ，ｙｂ），（ｘｒ，ｙｂ）を結んだ矩形を検出対象物の位置と特定することができる。
ｘｌ＝（ｘｌａ＋ｘｌｂ）／２
ｘｒ＝（ｘｒａ＋ｘｒｂ）／２
ｙｔ＝（ｙｔｂ＋ｙｔｃ）／２
ｙｂ＝（ｙｂｂ＋ｙｂｃ）／２ However, the sides of the rectangle obtained by connecting these four points are not parallel to the horizontal direction and the vertical direction. Therefore, for example, the specifying unit 3 connects four points (xl, yt), (xr, yt), (xl, yb), (xr, yb) obtained by using the average value of each point as in the following equation. The rectangle can be identified as the position of the detection object.
xl = (xla + xlb) / 2
xr = (xra + xrb) / 2
yt = (ytb + ytc) / 2
yb = (ybb + ybc) / 2

図２２は、特定部３の処理動作の第３例を示す図である。同図では、行ベクトル５１ａに左右端候補ｘｌａ，ｘｒａが検出され，行ベクトル５１ｂに左右端候補ｘｌｂ，ｘｒｂが検出されている。また、列ベクトル５２ｂに上下端候補ｙｔｂ，ｙｂｂが検出される。この場合も、十字（クロスポイント）が４点、すなわち、（ｘｌａ，ｙｔｂ），（ｘｒａ，ｙｔｂ），（ｘｌｂ，ｙｂｂ），（ｘｒｂ，ｙｂｂ）に形成される。この４点が検出対象物の左上、右上、左下および右下の候補点である。
そして、特定部３は、例えば平均値を用いて得られる４点を結んだ矩形を検出対象物の位置と特定することができる。 FIG. 22 is a diagram illustrating a third example of the processing operation of the specifying unit 3. In the figure, left and right end candidates xla and xra are detected in the row vector 51a, and left and right end candidates xlb and xrb are detected in the row vector 51b. Further, upper and lower end candidates ytb and ybb are detected in the column vector 52b. Also in this case, four crosses (cross points) are formed, that is, (xla, ytb), (xra, ytb), (xlb, ybb), and (xrb, ybb). These four points are the upper left, upper right, lower left and lower right candidate points of the detection object.
And the specific | specification part 3 can pinpoint the rectangle which connected 4 points | pieces obtained, for example using an average value with the position of a detection target object.

図２１および図２２に示すように、十字が４点に形成される場合、特定部３は検出対象物が入力画像に含まれる、と判断できる。 As illustrated in FIGS. 21 and 22, when the cross is formed at four points, the specifying unit 3 can determine that the detection target is included in the input image.

図２３は、特定部３の処理動作の第４例を示す図である。同図では、行ベクトル５１ａも左右端候補ｘａが検出され，行ベクトル５１ｂに左右端候補ｘｂが検出されている。また、列ベクトル５２ａに上下端候補ｙｔ，ｙｂが検出されている。よって、十字（クロスポイント）は２点、すなわち、（ｘａ，ｙｔ），（ｘｂ，ｙｂ）に形成される。 FIG. 23 is a diagram illustrating a fourth example of the processing operation of the specifying unit 3. In the figure, the left and right end candidates xa are also detected in the row vector 51a, and the left and right end candidates xb are detected in the row vector 51b. In addition, upper and lower end candidates yt and yb are detected in the column vector 52a. Therefore, a cross (cross point) is formed at two points, that is, (xa, yt) and (xb, yb).

この場合、検出対象物の上端、下端および右端が入力画像に含まれており、左端のみが入力画像の外側にある、との解釈が成立し得る。よって、特定部３は、検出対象物（図２３の例では車両）の一部のみが入力画像の領域５２Ａに含まれているとし、例えば検出対象物の右上の座標（（ｘａ＋ｘｂ）／２，ｙｔ）および右下の座標（（ｘａ＋ｘｂ）／２，ｙｂ）を特定する。 In this case, it can be interpreted that the upper end, the lower end, and the right end of the detection target are included in the input image, and only the left end is outside the input image. Therefore, the specifying unit 3 assumes that only a part of the detection target (vehicle in the example of FIG. 23) is included in the area 52A of the input image. For example, the upper right coordinates ((xa + xb) / 2, yt) and lower right coordinates ((xa + xb) / 2, yb).

図２４は、特定部３の処理動作の第５例を示す図である。同図では、列ベクトル５２ｃに上下端候補ｙｔ，ｙｂが検出されている。一方、行ベクトル５１ａ，５１ｂに左右端候補は検出されていない。よって、十字は形成されない。この場合、特定部３は、上下端候補ｙｔ，ｙｂが誤検出であると解釈して、検出対象物が入力画像に含まれない、と判断できる。 FIG. 24 is a diagram illustrating a fifth example of the processing operation of the specifying unit 3. In the figure, upper and lower end candidates yt and yb are detected in the column vector 52c. On the other hand, left and right edge candidates are not detected in the row vectors 51a and 51b. Thus, no cross is formed. In this case, the specifying unit 3 interprets that the upper and lower end candidates yt and yb are erroneous detections, and can determine that the detection target is not included in the input image.

図２５は、特定部３の処理動作の第６例を示す図である。同図では、行ベクトル５１ａに左右端候補ｘｌａ，ｘｒａが検出され、行ベクトル５１ｂに左右端候補ｘｌｂ，ｘｒｂが検出されている。一方、列ベクトル５２ａ〜５２ｃに上下端候補は検出されていない。よって、十字は形成されない。 FIG. 25 is a diagram illustrating a sixth example of the processing operation of the specifying unit 3. In the figure, left and right end candidates xla and xra are detected in the row vector 51a, and left and right end candidates xlb and xrb are detected in the row vector 51b. On the other hand, the upper and lower end candidates are not detected in the column vectors 52a to 52c. Thus, no cross is formed.

しかしながら、この場合、検出対象物の左右端が入力画像に含まれており、上下端が入力画像の外側にある、との解釈が成立し得る。よって、特定部３は、検出対象物（図２５の例では人）の一部が入力画像に含まれるとし、例えば検出対象物の左端の座標（ｘｌａ＋ｘｌｂ）／２および右端の座標（ｘｒａ＋ｘｒｂ）／２を特定できる。 However, in this case, it can be interpreted that the left and right ends of the detection target are included in the input image and the upper and lower ends are outside the input image. Therefore, the specifying unit 3 assumes that a part of the detection target (a person in the example of FIG. 25) is included in the input image. For example, the left end coordinate (xla + xlb) / 2 and the right end coordinate (xra + xrb) / 2 can be specified.

なお、図２３〜図２５では、形成される十字は４点未満であり、この場合は種々の解釈が成立し得る。例えば、図２４において、特定部３は、本来行ベクトル５１ａ，５１ｂに左右端候補が検出されるべきところが検出エラーであると解釈し、検出対象物の上端の座標ｙｔおよび下端の座標ｙｂのみを特定することも可能である。また、図２５において、特定部３は、左右端候補が誤検出であると解釈して、検出対象物が入力画像に含まれない、と判断することも可能である。 In FIGS. 23 to 25, the cross formed is less than four points, and in this case, various interpretations can be established. For example, in FIG. 24, the specifying unit 3 interprets that the right and left end candidates should be detected in the row vectors 51 a and 51 b as detection errors, and uses only the upper end coordinates yt and the lower end coordinates yb of the detection target. It is also possible to specify. In FIG. 25, the specifying unit 3 can interpret that the left and right end candidates are erroneous detections and determine that the detection target is not included in the input image.

そのため、本検出装置１００のアプリケーションや検出対象物の形状（例えば、人であれば縦長であることが想定される）などに応じて、形成される十字が４点未満の場合の解釈を予め定めておけばよい。 Therefore, depending on the application of the detection apparatus 100 and the shape of the detection target (for example, a person is assumed to be vertically long), the interpretation when the formed cross is less than 4 points is determined in advance. Just keep it.

ところで、特定部３は、検出された検出対象物の上下左端位置（の少なくとも一部）に基づいて、車載カメラと検出対象物との位置関係をさらに算出してもよい。 Incidentally, the specifying unit 3 may further calculate the positional relationship between the in-vehicle camera and the detection target based on (at least a part of) the upper and lower left and right positions of the detected detection target.

図２６は、車載カメラ１０１と検出対象物（ここでは先行車両）１０２との進行方向の距離Ｚを算出する手法を説明する図である。同図において、車両１１０は、車両本体１１１と、これに取り付けられた車載カメラ１０１と、検出装置１００とを備えている。車載カメラ１０１は既知の高さｈ（例えば１３０ｃｍ）に設置されているとする。また、車載カメラ１０１の焦点距離ｆ画素の位置に仮想の画像平面を設定する。この画像平面において、その中心を原点、水平方向（車両１１０の進行方向）をｘ軸、垂直方向（鉛直方向下方向）をｙ軸とする。 FIG. 26 is a diagram illustrating a method for calculating a distance Z in the traveling direction between the in-vehicle camera 101 and the detection target object (preceding vehicle in this case) 102. In the figure, a vehicle 110 includes a vehicle main body 111, an in-vehicle camera 101 attached to the vehicle main body 111, and a detection device 100. The in-vehicle camera 101 is assumed to be installed at a known height h (for example, 130 cm). In addition, a virtual image plane is set at the position of the focal length f pixel of the in-vehicle camera 101. In this image plane, the center is the origin, the horizontal direction (traveling direction of the vehicle 110) is the x-axis, and the vertical direction (vertical downward) is the y-axis.

そして、検出装置１００の特定部３によって特定された、検出対象物１０２の下端位置の座標がｙｂであったとする。このとき、図示のように、三角形の相似関係から、特定部３は下式に基づいて車載カメラ１０１と検出対象物１０２との進行方向の距離Ｚを算出できる。
Ｚ＝ｆｈ／ｙｂ Then, it is assumed that the coordinate of the lower end position of the detection object 102 specified by the specifying unit 3 of the detection apparatus 100 is yb. At this time, as shown in the figure, the specifying unit 3 can calculate the distance Z in the traveling direction between the in-vehicle camera 101 and the detection target object 102 based on the following equation from the similarity of triangles.
Z = fh / yb

図２７は、車載カメラ１０１と検出対象物１０２との横方向の距離Ｘ_L，Ｘ_Rを算出する手法を説明する図である。図２６と同様に座標軸を設定し、特定部３によって特定された検出対象物１０２の左端位置および右端位置の座標がそれぞれｘｌ，ｘｒであったとする。
このとき、図示のように、三角形の相似関係から、特定部３は下式に基づいて車載カメラ１０１と検出対象物１０２との横方向の距離Ｘ_L，Ｘ_Rを算出できる。
Ｘ_L＝Ｚｘｌ／ｆ
Ｘ_R＝Ｚｘｒ／ｆ FIG. 27 is a diagram for explaining a method for calculating the lateral distances X _L and X _R between the in-vehicle camera 101 and the detection target object 102. It is assumed that coordinate axes are set as in FIG. 26, and the coordinates of the left end position and the right end position of the detection target 102 specified by the specifying unit 3 are xl and xr, respectively.
At this time, as shown in the figure, the specifying unit 3 can calculate the lateral distances X _L and X _R between the in-vehicle camera 101 and the detection target object 102 based on the following formula from the similarity relationship of the triangles.
X _L = Zxl / f
X _R = Zxr / f

このように、第３の実施形態では、特定部３が行ベクトル５１および列ベクトル５２に応じて適切な解釈を行うため、検出対象物の一部のみが入力画像に含まれるような場合であっても、柔軟に検出対象物の位置を特定できる。 Thus, in the third embodiment, since the specifying unit 3 performs an appropriate interpretation according to the row vector 51 and the column vector 52, only a part of the detection target is included in the input image. However, the position of the detection target can be specified flexibly.

（第４の実施形態）
上述した第１〜３の実施形態は、検出対象物が１種類（例えば人）であることを念頭に置いていた。これに対し、以下に説明する第４の実施形態では、複数種類の検出対象物（例えば人と車両）の位置を特定するものである。 (Fourth embodiment)
In the first to third embodiments described above, the detection object is one type (for example, a person) in mind. In contrast, in the fourth embodiment described below, the positions of a plurality of types of detection objects (for example, people and vehicles) are specified.

図２８は、第４の実施形態に係るニューラルネットワーク処理部１’’の概略構成を示すブロック図である。検出対象物がｎ種類（検出対象物１〜ｎという）である場合、第１の実施形態（図３）との相違点として、水平方向処理部１２０ｈおよび垂直方向処理部１２０ｖがそれぞれ出力する行ベクトル５１および列ベクトル５２のマップ数はｎである。そして、ｉ（ｉ＝１〜ｎ）番目のマップにおける行ベクトル５１および列ベクトル５２が、検出対象物ｉの検出結果を示す。
なお、入力画像の画素数およびプーリング部１２２ｈの段数に応じて、１つのマップにつき複数の行ベクトル５１や列ベクトル５２が出力され得る。 FIG. 28 is a block diagram showing a schematic configuration of a neural network processing unit 1 ″ according to the fourth embodiment. When there are n types of detection objects (referred to as detection objects 1 to n), as a difference from the first embodiment (FIG. 3), the rows output by the horizontal processing unit 120h and the vertical processing unit 120v, respectively. The number of maps of the vector 51 and the column vector 52 is n. The row vector 51 and the column vector 52 in the i (i = 1 to n) th map indicate the detection result of the detection object i.
Note that a plurality of row vectors 51 and column vectors 52 can be output per map in accordance with the number of pixels of the input image and the number of stages of the pooling unit 122h.

学習処理においては、検出対象物１〜ｎのそれぞれが写った複数のサンプル画像を用いる。そして、正解データを生成するために、１つのサンプル画像に対して、ｎ枚の入力画像１〜ｎをニューラルネットワーク処理部１’’に入力する。サンプル画像に検出対象物ｉが含まれる場合、入力画像ｋ（ｋ＝１〜ｎかつｋ≠ｉ）の全画素は０値であり、入力画像ｉは検出対象物の上下左右端における画素のみ１値で他の画素は０値とする。その他は第１の実施形態と同様である。 In the learning process, a plurality of sample images showing each of the detection objects 1 to n are used. In order to generate correct data, n input images 1 to n are input to the neural network processing unit 1 ″ for one sample image. When the detection target object i is included in the sample image, all the pixels of the input image k (k = 1 to n and k ≠ i) are 0 values, and the input image i is 1 only at the top, bottom, left, and right ends of the detection target. The other pixels are 0 values. Others are the same as in the first embodiment.

検出処理においては、マップごとに検出対象物の位置特定を行う。すなわち、検出対象物ｉを検出する場合、特定部３は、ｉ番目のマップにおける行ベクトル５１および列ベクトル５２に基づいて、検出対象物ｉの位置を特定する。 In the detection process, the position of the detection target is specified for each map. That is, when detecting the detection target i, the specifying unit 3 specifies the position of the detection target i based on the row vector 51 and the column vector 52 in the i-th map.

このように、第４の実施形態では、ニューラルネットワーク処理部１’’が出力する行ベクトル５１および列ベクトル５２のマップ数をｎとし、各検出対象物を含むサンプル画像を用いて学習処理を行うことで、ｎ種類の検出対象物を別個に検出できる。なお、本実施形態を第２，３の実施形態に適用することも可能である。 As described above, in the fourth embodiment, the number of maps of the row vector 51 and the column vector 52 output from the neural network processing unit 1 ″ is n, and the learning process is performed using the sample image including each detection target. Thus, n types of detection objects can be detected separately. Note that this embodiment can also be applied to the second and third embodiments.

（第５の実施形態）
次に説明する第５の実施形態は、ニューラルネットワーク構造自体は第１の実施形態などと変わらないが、出力される行ベクトルおよび列ベクトルの形態が第１〜４の実施形態とは異なるものである。そして、入力画像における検出対象物を、必ずしも上下左右端の矩形でなく、任意の形状で特定するものである。 (Fifth embodiment)
In the fifth embodiment described below, the neural network structure itself is the same as in the first embodiment, but the output row vector and column vector forms are different from those in the first to fourth embodiments. is there. Then, the detection target in the input image is not necessarily limited to the rectangles at the top, bottom, left and right ends, but is specified by an arbitrary shape.

図２９は、第５の実施形態に係るニューラルネットワーク処理部１の処理動作の概要を説明する図であり、図１（ｂ）と対応している。入力画像において検出対象物が図１（ｂ）と同じ矩形である場合、ニューラルネットワーク処理部１は、ｘｌ〜ｘｒ列のみの値が１であり他列の値が０である行ベクトル５１’、および、ｙｔ行〜ｙｂ行のみの値が１であり他行の値が０である列ベクトル５２’を出力する。すなわち、行ベクトル５１’および列ベクトル５２’において、検出対象物がある位置に対応する行および列の値が１であり、検出対象物がない位置に対応する列および行の値が０である。
このような行ベクトル５１’および列ベクトル５２’を出力するためには、学習処理を次のようにすればよい。 FIG. 29 is a diagram for explaining the outline of the processing operation of the neural network processing unit 1 according to the fifth embodiment, and corresponds to FIG. When the detection target is the same rectangle as in FIG. 1B in the input image, the neural network processing unit 1 uses a row vector 51 ′ in which the values of only the xl to xr columns are 1 and the values of the other columns are 0. Also, a column vector 52 ′ in which only the values in the yt to yb rows are 1 and the values in the other rows are 0 is output. That is, in the row vector 51 ′ and the column vector 52 ′, the value of the row and the column corresponding to the position where the detection target is present is 1, and the value of the column and the row corresponding to the position where there is no detection target is 0. .
In order to output such a row vector 51 ′ and column vector 52 ′, the learning process may be performed as follows.

図３０は、サンプル画像から正解データを生成するための入力画像を説明する図であり、図８と対応している。本実施形態では、図２８（ａ）に示すサンプル画像に対して、人手によって同図（ｂ）に示す入力画像を生成する。すなわち入力画像は、検出対象物がある画素の値を１とし、検出対象物がない画素の値を０である。 FIG. 30 is a diagram for explaining an input image for generating correct data from a sample image, and corresponds to FIG. In the present embodiment, an input image shown in FIG. 28B is generated manually with respect to the sample image shown in FIG. That is, in the input image, the value of a pixel with a detection target is 1 and the value of a pixel with no detection target is 0.

図３１は、別のサンプル画像から正解データを生成するための入力画像を説明する図である。図３１（ａ）に示すサンプル画像では複数の検出対象物が含まれている。この場合の入力画像（同図（ｂ））は、少なくとも１つの検出対象物がある画素の値を１とし、他の画素の値を０とする。このような入力画像の場合、画素の値が１となる領域は矩形には限られない。 FIG. 31 is a diagram illustrating an input image for generating correct answer data from another sample image. In the sample image shown in FIG. 31A, a plurality of detection objects are included. In this case, in the input image ((b) in the figure), the value of a pixel having at least one detection object is set to 1, and the values of other pixels are set to 0. In the case of such an input image, the region where the pixel value is 1 is not limited to a rectangle.

以上のような入力画像がニューラルネットワーク処理部１に入力される。学習用ニューラルネットワークパラメータは上述した通りである。このようにして学習処理を行って検出用ニューラルネットワークパラメータを生成することで、車載カメラからの入力画像に対して検出処理を行うと、図２９に示すような行ベクトル５１’および列ベクトル５２’が出力される。 The input image as described above is input to the neural network processing unit 1. The learning neural network parameters are as described above. When the detection process is performed on the input image from the in-vehicle camera by performing the learning process in this manner and generating the detection neural network parameters, the row vector 51 ′ and the column vector 52 ′ as shown in FIG. Is output.

本実施形態においては、特定部３は、それほど複雑な解釈をすることなく、行ベクトル５１’および列ベクトル５２’における値が１である列および行の位置に、検出対象物があると判断できる。 In the present embodiment, the specifying unit 3 can determine that there is a detection target at the position of the column and row where the values in the row vector 51 ′ and the column vector 52 ′ are 1 without making a complicated interpretation. .

アプリケーションによっては、検出対象物のそれぞれの位置を詳細に特定する必要はなく、大雑把な位置さえ把握できればよいこともある。例えば、検出結果が「左側に人の集団がある」程度であっても、十分に安全運転に寄与できる。よって、特に検出対象物が複数ある場合には、特に本実施形態のようにするのも有効である。 Depending on the application, it is not necessary to specify the position of each detection object in detail, and it may be necessary to grasp only a rough position. For example, even if the detection result is “a group of people on the left side”, it can sufficiently contribute to safe driving. Therefore, especially when there are a plurality of detection objects, it is particularly effective to use this embodiment.

（第６の実施形態）
次に説明する第６の実施形態では、検出対象物が二輪車や人など縦長であることを想定している。そして、検出装置におけるニューラルネットワーク処理部は、入力画像において、まず検出対象物の水平方向位置を検出し、続いて垂直方向位置を検出するものである。 (Sixth embodiment)
In a sixth embodiment to be described next, it is assumed that the detection target is vertically long, such as a motorcycle or a person. The neural network processing unit in the detection device first detects the horizontal position of the detection target in the input image, and then detects the vertical position.

図３１は、第６の実施形態に係るニューラルネットワーク処理部１’’’の概略構成を示すブロック図である。このニューラルネットワーク処理部１’’’は次のようにして検出処理を行う。なお、まずは検出対象物は２つ以上存在しないと仮定する。また、以下では、上述した実施形態との相違点を主に説明する。 FIG. 31 is a block diagram illustrating a schematic configuration of a neural network processing unit 1 ″ ″ according to the sixth embodiment. The neural network processing unit 1 ″ ″ performs detection processing as follows. First, it is assumed that there are no two or more detection objects. In the following, differences from the above-described embodiment will be mainly described.

非線形処理部１１によって入力画像から生成されたマップは、垂直方向処理部１２０ｖには入力されず、水平方向処理部１２０ｈにのみ入力される。なお、水平方向処理部１２０ｈと垂直方向処理部１２０ｖは直接情報をやりとりしない。 The map generated from the input image by the non-linear processing unit 11 is not input to the vertical processing unit 120v, but is input only to the horizontal processing unit 120h. The horizontal processing unit 120h and the vertical processing unit 120v do not exchange information directly.

水平方向処理部１２０ｈは１つの行ベクトル５１を出力する。行ベクトル５１は、図２９に示すように、検出対象物がある位置に対応する列の値が１である。出力される行ベクトル５１の数を１つにするためには、種々の手法が考えられる。例えば、垂直方向の画素数を１／２にプーリングするプーリング部１２２ｈをｎ段設け、水平方向処理部１２０ｈには、垂直方向画素数が２ⁿ画素である入力画像を入力してもよいし、垂直方向画素数が２ⁿ画素となるよう予めリサイズや切り出しを行った入力画像を入力してもよい。あるいは、入力画像の垂直方向画素数は任意とし、垂直方向画素数に応じてプーリング部１２２ｈでの縮小率を調整して、最終的に１つの行ベクトル５１が出力されるようにしてもよい。 The horizontal processing unit 120h outputs one row vector 51. In the row vector 51, as shown in FIG. 29, the value of the column corresponding to the position where the detection target is located is 1. Various methods are conceivable for reducing the number of output row vectors 51 to one. For example, an n-stage pooling unit 122h that pools the number of pixels in the vertical direction to ½ may be provided, and an input image having 2 ⁿ pixels in the vertical direction may be input to the horizontal processing unit 120h. An input image that has been resized or cut out in advance so that the number of pixels in the vertical direction is 2 ⁿ pixels may be input. Alternatively, the number of pixels in the vertical direction of the input image may be arbitrary, and the reduction rate in the pooling unit 122h may be adjusted according to the number of pixels in the vertical direction so that one row vector 51 is finally output.

本実施形態のニューラルネットワーク処理部１’’’は、行ベクトルに基づいて、入力画像から検出対象物を含む領域を特定する領域特定部１３を有する。以下では、領域特定部１３は矩形領域を特定するものとする。 The neural network processing unit 1 ″ ″ according to the present embodiment includes a region specifying unit 13 that specifies a region including the detection target object from the input image based on the row vector. Hereinafter, it is assumed that the area specifying unit 13 specifies a rectangular area.

図３２は、領域特定部１３の処理動作を説明する図である。領域特定部１３は次のような矩形領域を特定する。矩形領域の中心は、行ベクトル５１において１値である列の中央と一致する。矩形領域の水平方向画素数（幅）は、予め定めた固定値（例えば６４画素）としてもよいし、予め定めた選択肢（例えば、３２画素、１２８画素および５１２画素の３つ）から入力画像の画素数に応じて選択してもよいし、１値の連続数が多いほど多くしてもよい。矩形領域の垂直方向画素数（高さ）は入力画像の垂直方向画素数と一致する。 FIG. 32 is a diagram for explaining the processing operation of the area specifying unit 13. The area specifying unit 13 specifies the following rectangular area. The center of the rectangular area coincides with the center of the column that is a single value in the row vector 51. The number (width) of pixels in the horizontal direction of the rectangular area may be a predetermined fixed value (for example, 64 pixels), or the input image can be selected from predetermined options (for example, three of 32 pixels, 128 pixels, and 512 pixels). You may select according to the number of pixels, and may increase, so that there are many continuation of 1 value. The number of vertical pixels (height) of the rectangular area matches the number of vertical pixels of the input image.

図３１に戻り、領域特定部１３によって特定された矩形領域は、入力画像に基づくマップとして、垂直方向処理部１２０ｖに入力される。なお、領域特定部１３と垂直方向処理部１２０ｖとの間に１以上の非線形処理部を設けてもよい。また、行ベクトル５１に１値がない場合、すなわち、検出対象物が検出されない場合、領域特定部１３は矩形領域を特定せず、したがって垂直方向処理部１２０ｖは処理を行わない。 Referring back to FIG. 31, the rectangular area specified by the area specifying unit 13 is input to the vertical processing unit 120v as a map based on the input image. One or more nonlinear processing units may be provided between the region specifying unit 13 and the vertical direction processing unit 120v. If the row vector 51 has no value, that is, if the detection target is not detected, the area specifying unit 13 does not specify a rectangular area, and therefore the vertical processing unit 120v does not perform processing.

矩形領域が入力された垂直方向処理部１２０ｖは、矩形領域を処理して、１つの列ベクトル５２を出力する。列ベクトル５２は、図２９に示すように、検出対象物がある位置に対応する行の値が１である。 The vertical processing unit 120v to which the rectangular area is input processes the rectangular area and outputs one column vector 52. In the column vector 52, as shown in FIG. 29, the value of the row corresponding to the position where the detection target is located is 1.

なお、矩形領域の水平方向画素数が固定値であれば、その数に合わせた段数のプーリング部１２２ｖを設ければよい。矩形領域の水平方向画素数が複数の選択肢から選択される場合、プーリング部１２２ｖの段数が互いに異なる垂直方向処理部１２０ｖを選択肢の数だけ設けておき、そのいずれかが矩形領域を処理するようにしてもよい。例えば、選択肢が３２（＝２⁵）画素、１２８（＝２⁷）画素および５１２（＝２⁹）画素の場合、５段のプーリング部１２２ｖを有する垂直方向処理部１２０ｖ、７段のプーリング部１２２ｖを有する垂直方向処理部１２０ｖ、および、９段のプーリング部１２２ｖを有する垂直方向処理部１２０ｖの３つを設け、選択された水平方向画素数に応じた垂直方向処理部１２０ｖが処理を行うことが考えられる。また、水平方向画素数が任意である場合、やはりプーリング部１２２ｖでの縮小率を調整してもよい。 If the number of pixels in the horizontal direction of the rectangular area is a fixed value, the number of pooling sections 122v corresponding to the number may be provided. When the number of pixels in the horizontal direction of the rectangular area is selected from a plurality of options, the vertical processing units 120v having different numbers of stages of the pooling unit 122v are provided for the number of options, and any one of them processes the rectangular area. May be. For example, when the options are 32 (= 2 ⁵ ) pixels, 128 (= 2 ⁷ ) pixels, and 512 (= 2 ⁹ ) pixels, the vertical processing unit 120 v having the five-stage pooling section 122 v and the seven-stage pooling section 122 v The vertical processing unit 120v and the vertical processing unit 120v including the nine-stage pooling unit 122v are provided, and the vertical processing unit 120v corresponding to the selected number of horizontal pixels performs processing. Conceivable. If the number of pixels in the horizontal direction is arbitrary, the reduction rate in the pooling unit 122v may be adjusted.

以上のようにして得られた１つずつの行ベクトル５１および列ベクトル５２に基づいて、特定部３（図２）は入力画像から検出対象物を検出する。 Based on the row vector 51 and the column vector 52 obtained one by one as described above, the specifying unit 3 (FIG. 2) detects a detection target from the input image.

検出対象物が２つ以上ある場合、領域特定部１３は各検出対象物について矩形領域を特定し、そのそれぞれについて垂直方向処理部１２０ｖが処理を行えばよい。 When there are two or more detection objects, the area specifying unit 13 specifies a rectangular area for each detection object, and the vertical direction processing unit 120v may perform processing for each of them.

このように処理する利点を説明する。検出対象物が二輪車や歩行者である場合、検出対象物が垂直方向に並んで存在することはまずあり得ない。よって、まず行ベクトル５１を生成して検出対象物の水平方向の位置を特定し、矩形領域を特定することで、その中央付近に検出対象物が位置する。そのため、垂直方向処理部１２０ｖでの処理において、検出対象物の特徴が充分に抽出され、検出精度の向上が期待できる。さらに、行ベクトル５１に基づいて検出対象物が存在しないと判断される場合には、垂直方向処理部１２０ｖは処理を行う必要がなく、演算量を減らすことができる。 The advantage of processing in this way will be described. When the detection target is a two-wheeled vehicle or a pedestrian, it is unlikely that the detection target exists in the vertical direction. Therefore, first, the row vector 51 is generated, the horizontal position of the detection target is specified, and the rectangular area is specified, so that the detection target is positioned near the center. Therefore, in the processing in the vertical processing unit 120v, the characteristics of the detection target are sufficiently extracted, and improvement in detection accuracy can be expected. Furthermore, when it is determined that the detection target does not exist based on the row vector 51, the vertical processing unit 120v does not need to perform processing, and the amount of calculation can be reduced.

本実施形態のニューラルネットワーク処理部１’’’では、水平方向処理部１２０ｈにおけるニューラルネットワークパラメータと、垂直方向処理部１２０ｖにおけるニューラルネットワークパラメータを、別個に学習することができる。すなわち、第１の実施形態で説明した学習処理用ニューラルネットワークパラメータを用いた水平方向処理部１２０ｈの処理によって、行ベクトル５１の正解データを生成する。そして、学習部２（図２）は行ベクトル５１が正解データに近づくよう検出処理用のニューラルネットワークパラメータを生成すればよい。垂直方向処理部１２０ｖのニューラルネットワークパラメータも同様である。 In the neural network processing unit 1 ″ ″ of the present embodiment, the neural network parameters in the horizontal direction processing unit 120 h and the neural network parameters in the vertical direction processing unit 120 v can be learned separately. That is, correct data of the row vector 51 is generated by the processing of the horizontal processing unit 120h using the learning processing neural network parameters described in the first embodiment. Then, the learning unit 2 (FIG. 2) may generate a neural network parameter for detection processing so that the row vector 51 approaches the correct data. The same applies to the neural network parameters of the vertical processing unit 120v.

このように、第６の実施形態では、まず垂直方向のプーリングを行って行ベクトル５１を生成し、行ベクトル５１に基づいて検出対象物を含む矩形領域を特定する。続いて、矩形領域に対して水平方向のプーリングを行って列ベクトル５２を生成する。矩形領域には検出対象物の特徴が含まれているため、検出精度を向上できる。また、垂直方向のプーリングを行った時点で検出対象物が存在しないことが分かった場合には水平方向のプーリングを行わないため、演算量の削減を図れる。 As described above, in the sixth embodiment, the row vector 51 is first generated by performing the pooling in the vertical direction, and the rectangular region including the detection target is specified based on the row vector 51. Subsequently, a column vector 52 is generated by performing horizontal pooling on the rectangular area. Since the features of the detection object are included in the rectangular area, the detection accuracy can be improved. Further, when it is found that there is no object to be detected at the time when the pooling in the vertical direction is performed, the pooling in the horizontal direction is not performed, so that the amount of calculation can be reduced.

なお、本実施形態では検出対象物が縦長であることを想定していたため、まず垂直方向のプーリングを行ったが、検出対象物が横長である場合には、まず水平方向のプーリングを行ってもよい。 In this embodiment, since it is assumed that the detection target is vertically long, first, the vertical pooling is performed. However, when the detection target is horizontal, the horizontal pooling may be performed first. Good.

また、上述した第１〜６の実施形態では、車載カメラからの画像から人や車両などを検出することを主に説明したが、他の用途にも適用可能である。 In the first to sixth embodiments described above, detection of a person, a vehicle, or the like from an image from a vehicle-mounted camera has been mainly described, but the present invention can also be applied to other uses.

上述した実施形態は、本発明が属する技術分野における通常の知識を有する者が本発明を実施できることを目的として記載されたものである。上記実施形態の種々の変形例は、当業者であれば当然になしうることであり、本発明の技術的思想は他の実施形態にも適用しうることである。したがって、本発明は、記載された実施形態に限定されることはなく、特許請求の範囲によって定義される技術的思想に従った最も広い範囲とすべきである。 The embodiment described above is described for the purpose of enabling the person having ordinary knowledge in the technical field to which the present invention belongs to implement the present invention. Various modifications of the above embodiment can be naturally made by those skilled in the art, and the technical idea of the present invention can be applied to other embodiments. Therefore, the present invention should not be limited to the described embodiments, but should be the widest scope according to the technical idea defined by the claims.

１，１’，１’’，１’’’ ニューラルネットワーク処理部
２学習部
３特定部
１１非線形化処理部
１２ｈ，１２ｈ’ 水平方向マップ生成部
１２ｖ，１２ｖ’ 垂直方向マップ生成部
１２１ｈ，１２１ｈ’，１２１ｖ，１２１ｖ’ 非線形化処理部
１２２ｈ，１２２ｖプーリング部
１２０ｈ水平方向処理部
１２０ｖ垂直方向処理部
１３領域設定部
５１，５１ａ，５１ｂ，５１’ 行ベクトル
５２，５２ａ〜５２ｃ，５２’ 列ベクトル
１００検出装置
１０１車載カメラ
１０２検出対象物
１１０車両
１１１車両本体 1, 1 ′, 1 ″, 1 ′ ″ Neural network processing unit 2 Learning unit 3 Identification unit 11 Nonlinearization processing unit 12h, 12h ′ Horizontal direction map generation unit 12v, 12v ′ Vertical direction map generation unit 121h, 121h ′ , 121v, 121v ′ Non-linearization processing unit 122h, 122v Pooling unit 120h Horizontal direction processing unit 120v Vertical direction processing unit 13 Area setting unit 51, 51a, 51b, 51 ′ Row vector 52, 52a-52c, 52 ′ Column vector 100 detection Device 101 Car-mounted camera 102 Target object 110 Vehicle 111 Vehicle body

Claims

A horizontal that outputs a row vector indicating a horizontal position of a detection object in the input image while performing a non-linearization process and a pooling process on the map based on the input image and maintaining a horizontal resolution of the input image. A direction processing unit;
A non-linearization process and a pooling process are performed on the map based on the input image, and a column vector indicating the vertical position of the detection target in the input image is output while maintaining the vertical resolution of the input image. And a vertical direction processing unit.

The horizontal processing unit has a first pooling unit that performs a pooling process only in the vertical direction on the map based on the input image,
The neural network processing apparatus according to claim 1, wherein the vertical direction processing unit includes a second pooling unit that performs a pooling process only in a horizontal direction on a map based on the input image.

The horizontal direction processing unit includes a first horizontal direction map generation unit and a second horizontal direction map generation unit,
The vertical direction processing unit includes a first vertical direction map generation unit and a second vertical direction map generation unit,
The first horizontal direction map generation unit performs a non-linearization process and a pooling process only in the vertical direction on the map based on the input image to generate a first output map,
The first vertical map generation unit performs a non-linearization process and a horizontal pooling process on the map based on the input image to generate a second output map,
The second horizontal map generation unit performs a non-linearization process and a pooling process on the first output map and the second output map, and outputs a third output having the same horizontal resolution as the first output map. Generate a map
The second vertical map generation unit performs a non-linearization process and a pooling process on the first output map and the second output map, and outputs a fourth output having the same vertical resolution as that of the second output map. Generate a map
The row vector is generated based on the third output map;
The neural network processing device according to claim 1, wherein the column vector is generated based on the fourth output map.

The second horizontal map generation unit includes:
A convolution operation is performed on the first output map and the second output map to generate a first intermediate map and a second intermediate map having the same resolution as the first output map, and the first intermediate map and the second output map are generated. A first non-linearization processing unit that adds two intermediate maps to generate a third intermediate map;
A third pooling unit that generates a third output map by performing a pooling process only in a vertical direction on the third intermediate map;
Have
The second vertical map generation unit includes:
A convolution operation is performed on the first output map and the second output map to generate a fourth intermediate map and a fifth intermediate map having the same resolution as that of the second output map, respectively. A second non-linearization processing unit that adds the five intermediate maps to generate a sixth intermediate map;
A fourth pooling unit that performs a pooling process only in the horizontal direction on the sixth intermediate map to generate the fourth output map;
The neural network processing apparatus according to claim 3, comprising:

The horizontal and vertical resolutions of the second intermediate map are p1 times and 1 / q1 times (p1 and q1 are integers), respectively, of the horizontal and vertical resolutions of the second output map,
The first non-linearization processing unit includes an inner product of a certain pixel value in the second output map and a filter coefficient of the first filter, the second intermediate corresponding to a certain part in the second output map. The process of setting the values of the pixels in the map and the (p1-1) pixels to the right thereof is performed while shifting the first filter by q1 pixels in the vertical direction of the second output map. Two intermediate maps,
The horizontal and vertical resolutions of the fourth intermediate map are 1 / p2 times and q2 times (p2 and q2 are integers), respectively, the horizontal and vertical resolutions of the first output map,
The second non-linearization processing unit includes the fourth intermediate corresponding to an inner product of a certain pixel value of the first output map and a filter coefficient of the second filter with a certain portion of the first output map. The process of setting the values of the pixels in the map and the (q2-1) pixels below it is performed while shifting the second filter by p2 pixels in the horizontal direction of the first output map. The neural network processing apparatus according to claim 4, wherein four intermediate maps are generated.

An image in which the pixels corresponding to the upper, lower, left, and right edges of the detection object in the sample image have the first value and the other pixels have the second value is the input image,
The horizontal direction processing unit and the vertical direction processing unit perform the non-linearization process including a convolution operation using a filter whose center is the first value and the other is the second value, and the correct data is the correct answer data. The neural network processing apparatus according to claim 1, wherein the neural network processing apparatus outputs a row vector and the column vector.

In the sample image, a pixel where the detection target is a first value and another pixel is a second value is the input image,
The horizontal direction processing unit and the vertical direction processing unit perform the non-linearization process including a convolution operation using a filter whose center is the first value and the other is the second value, and the correct data is the correct answer data. The neural network processing apparatus according to claim 1, wherein the neural network processing apparatus outputs a row vector and the column vector.

The neural network processing apparatus according to claim 6 or 7, wherein a neural network parameter used in the non-linearization process is generated based on the sample image and the correct answer data.

The row vector in which the column corresponding to the left end and the right end of the detection target object is the first value, and the column corresponding to the left side from the left end and the column corresponding to the right end from the right end is the second value. And
The column in which the vertical processing unit corresponds to the first value in the row corresponding to the upper end and the lower end of the detection target, and the row corresponding to the upper side from the upper end and the lower side from the lower end is the second value. The neural network processing apparatus of claim 8, wherein the neural network parameters are generated to output a vector.

10. The sample image according to claim 6, wherein the sample image includes a sample image including a first detection target and a sample image including a second detection target different from the first detection target. Neural network processing device.

The neural network processing apparatus according to claim 6, wherein the sample image includes a sample image including only a part of the detection target.

An area specifying unit for specifying an area including the detection target object from the input image based on the row vector;
The neural network processing apparatus according to claim 1, wherein the vertical processing unit performs nonlinear processing and pooling processing on the specified region as a map based on the input image.

The neural network processing device according to claim 12, wherein when it is determined that the detection target is not included in the input image based on the row vector, the vertical processing unit does not perform nonlinear processing and pooling processing. .

A neural network processing device according to any one of claims 1 to 13,
And a specifying unit that specifies a position of the detection target in the input image based on the row vector and the column vector output by the neural network device.

The specific part is:
A column in the row vector, the column of which satisfies the first threshold condition as a left end candidate or a right end candidate of the detection object,
The detection apparatus according to claim 14, wherein a row in the column vector whose value satisfies a second threshold condition is set as an upper end candidate or a lower end candidate of the detection target.

The left end candidate and right end candidate of the detection target specified by the specifying unit are separated by a number of first pixels or more,
The detection device according to claim 14 or 15, wherein an upper end candidate and a lower end candidate of the detection target specified by the specifying unit are separated by a second pixel number or more.

A vehicle body,
A camera attached to the vehicle body,
A vehicle comprising: the detection device according to claim 14, which processes an image from the camera as the input image.

Performing a non-linearization process and a pooling process on a map based on the input image, and outputting a row vector indicating a horizontal position of the detection target in the input image while maintaining a horizontal resolution of the input image When,
A non-linearization process and a pooling process are performed on the map based on the input image, and a column vector indicating the vertical position of the detection target in the input image is output while maintaining the vertical resolution of the input image. And a neural network processing method.

Performing a non-linearization process and a pooling process on a map based on the input image, and outputting a row vector indicating a horizontal position of the detection target in the input image while maintaining a horizontal resolution of the input image When,
A non-linearization process and a pooling process are performed on the map based on the input image, and a column vector indicating the vertical position of the detection target in the input image is output while maintaining the vertical resolution of the input image. And steps to
Identifying the position of the detection object in the input image based on the row vector and the column vector.