JP2011248525A

JP2011248525A - Object detection device and detection method thereof

Info

Publication number: JP2011248525A
Application number: JP2010119587A
Authority: JP
Inventors: Seiji Ishikawa; 聖二石川; Yu Kui Tan; ジュークイタン; Yuki Nakajima; 佑樹中島; Takashi Morie; 隆森江
Original assignee: Kyushu Institute of Technology NUC
Current assignee: Kyushu Institute of Technology NUC
Priority date: 2010-05-25
Filing date: 2010-05-25
Publication date: 2011-12-08

Abstract

PROBLEM TO BE SOLVED: To provide an object detection device and a detection method thereof, by which an object image from an image frame is accurately detected in a short time and also the trend of the object image is predicted.SOLUTION: The object detection device and the detection method thereof are the ones for detecting an object image 11 to be a detection object from an image frame 10 obtained by image input means. The method includes: a model generation step for allowing model generation means to previously generate an image model of the object image 11; a feature amount calculation step for allowing feature amount calculation means to calculate the luminance gradients of each pixel in the image frame 10 as feature amounts with respect to an area in the image frame 10, which corresponds to the generated image model of the object image 11; and an identification step for allowing identification means to determine whether or not the object image 11 exists in the image frame through the use of a calculation outcome calculated by the feature amount calculation means and a learning outcome of the previously obtained feature amount of the object image 11.

Description

本発明は、得られた画像フレームから検出対象となる物体画像を検出するための物体の検出装置及びその検出方法に関する。 The present invention relates to an object detection apparatus and a detection method for detecting an object image to be detected from an obtained image frame.

防犯カメラや車載カメラなどの需要がますます高まる現代において、パーソナルコンピュータとデジタルカメラの性能向上により、画像（画像フレーム）上の物体をコンピュータで自動認識及び理解するための研究が、国内外で活発に行われている。その中でも、コンピュータによる人の動作や行動の自動認識は、ロボットビジョンやＩＴＳ（ＩｎｔｅｌｌｉｇｅｎｔＴｒａｎｓｐｏｒｔＳｙｓｔｅｍｓ：高度道路交通システム）等への応用が、大いに期待されている。 In today's growing demand for security cameras and in-vehicle cameras, research to automatically recognize and understand objects on images (image frames) with computers has been actively conducted in Japan and overseas due to improved performance of personal computers and digital cameras. Has been done. Among them, automatic recognition of human movements and actions by a computer is highly expected to be applied to robot vision, ITS (Intelligent Transport Systems) and the like.

画像から人を検出する手法としては、例えば、非特許文献１に開示されたＤａｌａｌらの提案するＨＯＧ特徴量とＳＶＭを用いた手法が代表的である。これは、画像上で小矩形領域を順に移動させながら、その中でＨＯＧ特徴量を算出し、これを用いて、小矩形領域が人画像を含むか含まないかを判断するという方法である。
また、他の方法として、非特許文献２には、複数のＨＯＧ特徴量間の共起を表現するＪｏｉｎｔ特徴を用いた方法が、非特許文献３には、ブロックサイズを様々に変えながらＨＯＧ特徴量を得る方法が、それぞれ開示されている。
なお、上記したＨＯＧ（ＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴量とは、セルと呼ばれる小矩形領域において、各画素（ピクセル）から得られる輝度勾配情報を用い、輝度勾配の方向に対して、輝度勾配の大きさを累算することにより得られるヒストグラム特徴量である。 As a technique for detecting a person from an image, for example, a technique using HOG feature amount and SVM proposed by Dalal et al. This is a method in which a small rectangular area is sequentially moved on an image, an HOG feature amount is calculated therein, and it is used to determine whether the small rectangular area includes a human image.
As another method, Non-Patent Document 2 describes a method using a Joint feature that expresses a co-occurrence between a plurality of HOG feature values, and Non-Patent Document 3 describes an HOG feature with various block sizes. Each method of obtaining quantities is disclosed.
Note that the HOG (histograms of oriented gradients) feature amount described above uses luminance gradient information obtained from each pixel (pixel) in a small rectangular area called a cell, and the magnitude of the luminance gradient with respect to the direction of the luminance gradient. This is a histogram feature amount obtained by accumulating the length.

Ｎ．Ｄａｌａｌ、Ｂ．Ｔｒｉｇｇｓ、「ヒストグラムズオブオリエンティッドグラディエンツフォーヒューマンディテクション（ＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓｆｏｒＨｕｍａｎＤｅｔｅｃｔｉｏｎ）」、アイイーイーイーシーブイピーアール（ＩＥＥＥＣＶＰＲ）、ｐ．８８６−８９３、２００５年N. Dalal, B.M. Triggs, “Histograms of Oriented Gradients for Human Detection”, IEEE CVPR, p. 886-893, 2005 藤吉弘亘、局所特徴量の関連性に着目したＪｏｉｎｔ特徴による物体検出、電子情報通信学会研究会、２００９年Hiroyoshi Fujiyoshi, Object Detection by Joint Feature Focusing on Relevance of Local Features, IEICE Technical Committee, 2009 Ｑ．Ｚｈｕ、Ｓ．Ａｖｉｄｅｎ、Ｍ．Ｙｅｈ、Ｋ．Ｃｈｅｎｇ、「ファーストヒューマンディテクションユージングアカスケードオブヒストグラムズオブオリエンティッドグラディエンツ（ＦａｓｔＨｕｍａｎＤｅｔｅｃｔｉｏｎＵｓｉｎｇａＣａｓｃａｄｅｏｆＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）」、アイイーイーイーシーブイピーアール（ＩＥＥＥＣＶＰＲ）、６月、２００６年Q. Zhu, S .; Aviden, M.M. Yeh, K .; Cheng, “First Human Detection Using a Cascade of Histograms of Oriented Gradients”, ICE CV E (E

しかしながら、前記従来の方法には、未だ解決すべき以下のような問題があった。
非特許文献１に開示の方法は、小矩形領域を画像全体にわたって移動させるため、人画像とは異なる形状の背景までＨＯＧ特徴量（即ち、局所領域における勾配方向をヒストグラム化した特徴量）を求める対象とし、人画像の検出を行っていた。
また、非特許文献２に開示の方法は、Ｂｏｏｓｔｉｎｇ学習で、結果的に有効な特徴量を選んでいるが、ほとんど背景のような部分まで特徴量を選んでいる。
なお、非特許文献１、２のいずれの方法も、検出の際のセルが升目で固定されており、計算を行う際の自由度がない。
そして、非特許文献３に開示の方法も、非特許文献１、２と同様、人画像とは異なる形状の背景まで、ＨＯＧ特徴量を求める対象としている。 However, the conventional method still has the following problems to be solved.
Since the method disclosed in Non-Patent Document 1 moves the small rectangular region over the entire image, the HOG feature amount (that is, the feature amount obtained by histogramating the gradient direction in the local region) is obtained up to a background having a shape different from that of the human image. The target was to detect human images.
The method disclosed in Non-Patent Document 2 selects effective feature amounts as a result of boosting learning, but selects feature amounts up to almost the background.
Note that in both methods of Non-Patent Documents 1 and 2, the cells at the time of detection are fixed by grids, and there is no degree of freedom when performing calculations.
The method disclosed in Non-Patent Document 3 is also a target for obtaining the HOG feature amount up to a background having a shape different from that of the human image, as in Non-Patent Documents 1 and 2.

以上のように、非特許文献１〜３に開示の方法では、人画像とは異なる形状の背景までも人画像検出のための対象としているため、例えば、背景と分かる部分も人画像として誤検出する場合や、背景によって人画像を検出できない場合があり、人画像の検出精度の低下を招き易かった。また、背景を含む画像全体でＨＯＧ特徴量を計算する必要があるため、計算時間が長くかかって計算速度が遅くなっていた。
なお、非特許文献１〜３に開示の方法では、画像から人を検出することのみを目的としているため、更に、人の行動を予測することは考慮されていなかった。 As described above, in the methods disclosed in Non-Patent Documents 1 to 3, since a background having a shape different from that of a human image is a target for human image detection, for example, a portion that can be recognized as a background is erroneously detected as a human image. In some cases, the human image cannot be detected depending on the background, and the human image detection accuracy is likely to be lowered. In addition, since it is necessary to calculate the HOG feature value for the entire image including the background, the calculation time is long and the calculation speed is slow.
Note that the methods disclosed in Non-Patent Documents 1 to 3 are intended only to detect a person from an image, and thus have not been considered to predict human behavior.

本発明はかかる事情に鑑みてなされたもので、画像フレームからの物体の検出を、精度よくしかも短時間で実施でき、更には、物体の動向も予測可能な物体の検出装置及びその検出方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and an object detection apparatus and a detection method thereof that can detect an object from an image frame with high accuracy and in a short time, and further can predict the trend of the object. The purpose is to provide.

前記目的に沿う第１の発明に係る物体の検出装置は、画像入力手段で得られた画像フレームから、検出対象となる物体画像を検出するための物体の検出装置であって、
前記物体画像の画像モデルを予め作成するモデル作成手段と、
前記モデル作成手段で作成した前記物体画像の画像モデルと対応する前記画像フレーム中の領域に対し、該画像フレームの各画素の輝度勾配を特徴量として算出する特徴量算出手段と、
前記特徴量算出手段で算出した算出結果と、予め求めた前記物体画像の特徴量の学習結果とを用いて、前記画像フレーム中に前記物体画像が存在するか否かの判断を行う識別手段とを有する。 An object detection apparatus according to a first invention that meets the above-described object is an object detection apparatus for detecting an object image to be detected from an image frame obtained by an image input means,
Model creation means for creating an image model of the object image in advance;
Feature amount calculating means for calculating a luminance gradient of each pixel of the image frame as a feature amount for an area in the image frame corresponding to the image model of the object image created by the model creating means;
An identification unit that determines whether or not the object image exists in the image frame using a calculation result calculated by the feature amount calculation unit and a learning result of the feature amount of the object image obtained in advance; Have

第１の発明に係る物体の検出装置において、前記物体画像の特徴量の学習結果には、該物体画像の向きごとの学習データが含まれ、前記画像フレーム中に前記物体画像が存在すると判断されたことを条件として、前記識別手段で、更に、前記物体画像の向きを識別することが好ましい。
ここで、前記物体画像は人画像であるのがよい。 In the object detection device according to the first invention, the learning result of the feature amount of the object image includes learning data for each direction of the object image, and it is determined that the object image exists in the image frame. On the condition that it is, it is preferable that the direction of the object image is further identified by the identification means.
Here, the object image may be a human image.

前記目的に沿う第２の発明に係る物体の検出方法は、画像入力手段で得られた画像フレームから、検出対象となる物体画像を検出するための物体の検出方法であって、
前記物体画像の画像モデルを予め作成するモデル作成工程と、
前記モデル作成工程で作成した前記物体画像の画像モデルと対応する前記画像フレーム中の領域に対し、該画像フレームの各画素の輝度勾配を特徴量として算出する特徴量算出工程と、
前記特徴量算出工程で算出した算出結果と、予め求めた前記物体画像の特徴量の学習結果とを用いて、前記画像フレーム中に前記物体画像が存在するか否かの判断を行う識別工程とを有する。 An object detection method according to the second invention that meets the above object is an object detection method for detecting an object image to be detected from an image frame obtained by an image input means,
A model creation step of creating an image model of the object image in advance;
A feature amount calculating step of calculating a luminance gradient of each pixel of the image frame as a feature amount for a region in the image frame corresponding to the image model of the object image created in the model creating step;
An identification step for determining whether or not the object image exists in the image frame using the calculation result calculated in the feature amount calculation step and the learning result of the feature amount of the object image obtained in advance; Have

第２の発明に係る物体の検出方法において、前記物体画像の特徴量の学習結果には、該物体画像の向きごとの学習データが含まれ、前記画像フレーム中に前記物体画像が存在すると判断されたことを条件として、前記識別工程で、更に、前記物体画像の向きを識別することが好ましい。
ここで、前記物体画像は人画像であるのがよい。 In the object detection method according to the second invention, the learning result of the feature amount of the object image includes learning data for each direction of the object image, and it is determined that the object image exists in the image frame. On the condition that it is, it is preferable to further identify the orientation of the object image in the identification step.
Here, the object image may be a human image.

本発明に係る物体の検出装置及び検出方法は、予め作成した検出対象となる物体画像の画像モデルを使用し、この画像モデルと対応する画像フレーム中の領域に対して、画像フレームの各画素の輝度勾配を特徴量として算出し、この算出結果と、予め求めた物体画像の特徴量の学習結果とを用いて、画像フレーム中に物体画像が存在するか否かの判断を行うので、物体画像の形状とは異なる部分、例えば、画像フレームの背景を、物体画像の検出対象から予め外すことができる。これにより、背景と分かる部分を物体として誤検出する場合や、背景によって物体を検出できない場合を低減でき、また、背景を含む画像全体の特徴量を計算する必要もなくなる。
従って、画像フレームからの物体画像の検出を、精度よくしかも短時間で実施できる。 The object detection apparatus and the detection method according to the present invention use an image model of an object image that is a detection target created in advance, and each pixel of the image frame is compared with an area in the image frame corresponding to the image model. Since the brightness gradient is calculated as a feature amount, and the calculation result and the learning result of the feature amount of the object image obtained in advance are used to determine whether or not the object image exists in the image frame, the object image A portion different from the shape of the object image, for example, the background of the image frame can be removed in advance from the object image detection target. As a result, it is possible to reduce the case where a part known as the background is erroneously detected as an object or the case where the object cannot be detected based on the background, and it is not necessary to calculate the feature amount of the entire image including the background.
Therefore, the detection of the object image from the image frame can be performed accurately and in a short time.

また、物体画像の特徴量の学習結果には、物体画像の向きごとの学習データが含まれ、物体画像の向きを識別する場合、この物体画像が人画像であれば、人の行動を予測することが可能となる。これにより、例えば、画像フレームが車載カメラによる映像であれば、道路側を向いている人画像を検出できるため、人が道路へ飛び出す恐れを運転手に知らせることができる。 In addition, the learning result of the feature amount of the object image includes learning data for each direction of the object image. When the direction of the object image is identified, if the object image is a human image, the human behavior is predicted. It becomes possible. Thus, for example, if the image frame is a video from an in-vehicle camera, a human image facing the road side can be detected, so that the driver can be informed of the possibility of a person jumping out on the road.

本発明の一実施の形態に係る物体の検出方法の説明図である。It is explanatory drawing of the detection method of the object which concerns on one embodiment of this invention. 同物体の検出方法のモデル作成工程で行う処理のフローチャートである。It is a flowchart of the process performed at the model creation process of the detection method of the same object. （Ａ）〜（Ｅ）はそれぞれ同物体の検出方法のモデル作成工程で作成する人の画像モデルの作成過程を示す説明図である。(A)-(E) is explanatory drawing which shows the creation process of the person's image model created at the model creation process of the detection method of the same object, respectively. 同物体の検出方法の特徴量算出工程で行う処理の説明図である。It is explanatory drawing of the process performed at the feature-value calculation process of the detection method of the same object. 同物体の検出方法の識別工程で行う処理の説明図である。It is explanatory drawing of the process performed at the identification process of the detection method of the same object. 同物体の検出方法のフローチャートである。It is a flowchart of the detection method of the same object. （Ａ）、（Ｂ）はそれぞれ人検出実験で使用した人学習画像の説明図、人以外学習画像の説明図である。(A), (B) is explanatory drawing of the person learning image used by person detection experiment, respectively, and explanatory drawing of a non-human learning image. （Ａ）〜（Ｃ）はそれぞれ同人検出実験で使用した人評価画像の説明図、人以外評価画像の説明図、人画像の画像モデルの説明図である。(A)-(C) are explanatory drawing of the person evaluation image used in the same person detection experiment, explanatory drawing of a non-human evaluation image, and explanatory drawing of the image model of a human image, respectively. 同人検出実験の結果を示すグラフである。It is a graph which shows the result of a coterie detection experiment. （Ａ）は体方向検出実験で使用した学習画像の説明図、（Ｂ）は環境１の評価画像の説明図、（Ｃ）は環境２の評価画像の説明図である。(A) is explanatory drawing of the learning image used in the body direction detection experiment, (B) is explanatory drawing of the evaluation image of environment 1, (C) is explanatory drawing of the evaluation image of environment 2. （Ａ）〜（Ｅ）はそれぞれ体方向検出実験で使用した人画像の画像モデルの説明図である。(A)-(E) is explanatory drawing of the image model of the human image used by the body direction detection experiment, respectively.

続いて、添付した図面を参照しつつ、本発明を具体化した実施の形態につき説明し、本発明の理解に供する。
図１に示すように、本発明の一実施の形態に係る物体の検出装置（以下、単に検出装置ともいう）は、１台のビデオカメラ（画像入力手段の一例）で得られた画像フレーム１０（例えば、映像や写真）から、検出対象となる人画像（物体画像の一例）１１を検出するための装置である。この検出装置は、ＲＡＭ、ＣＰＵ、ＲＯＭ、Ｉ／Ｏ、及びこれらの要素を接続するバスを備えた従来公知の演算器（即ち、コンピュータ）で構成することができるが、これに限定されるものではない。以下、詳しく説明する。 Next, embodiments of the present invention will be described with reference to the accompanying drawings for understanding of the present invention.
As shown in FIG. 1, an object detection apparatus (hereinafter also simply referred to as a detection apparatus) according to an embodiment of the present invention includes an image frame 10 obtained by a single video camera (an example of image input means). This is a device for detecting a human image (an example of an object image) 11 to be detected from (for example, a video or a photograph). This detection device can be composed of a RAM, CPU, ROM, I / O, and a conventionally known arithmetic unit (that is, a computer) provided with a bus connecting these elements, but is not limited to this. is not. This will be described in detail below.

検出装置は、記憶手段、モデル作成手段、特徴量算出手段、及び識別手段を有している。
記憶手段は、メモリ（ＲＡＭ又はＲＯＭ）で構成され、ビデオカメラにより撮像された映像が記憶されている。なお、ビデオカメラとしては、例えば、ＣＣＤカメラ、高速度カメラ、ハンディータイプカメラ、デジタルＶＴＲ、又はデジタルビデオカメラ等の固定式又は移動式のカメラを使用できる。
また、モデル作成手段、特徴量算出手段、及び識別手段は、ＣＰＵが所定のプログラムを実行することによって実現されるものである。 The detection apparatus includes a storage unit, a model creation unit, a feature amount calculation unit, and an identification unit.
The storage means is composed of a memory (RAM or ROM), and stores images captured by the video camera. As the video camera, for example, a fixed or movable camera such as a CCD camera, a high-speed camera, a handy type camera, a digital VTR, or a digital video camera can be used.
In addition, the model creation unit, the feature amount calculation unit, and the identification unit are realized by the CPU executing a predetermined program.

まず、モデル作成手段について、図２、図３を参照しながら説明する。
モデル作成手段では、ステップ１１（ＳＴ１１）で、予め記憶されたデータベースの人画像（人画像を含む画像）から、画像上の輝度勾配を計算する。この輝度勾配とは、対象となる画素（以下、ピクセルともいう）近傍での輝度変化の度合を示すものであり、画像フレーム内で物体画像の境界領域（輪郭）で大きな値となる。なお、人画像は、データベースの全ての画像を使用できるが、一部のみを使用してもよい。
この輝度勾配は、画素（ｘ,ｙ）の輝度値をＩ（ｘ,ｙ）とし、ｘ方向及びｙ方向の輝度勾配をそれぞれｆｘ、ｆｙとすると、式（１）で計算される。 First, the model creation means will be described with reference to FIGS.
In the model creation means, in step 11 (ST11), the brightness gradient on the image is calculated from the human image (image including the human image) stored in the database in advance. This luminance gradient indicates the degree of luminance change in the vicinity of a target pixel (hereinafter also referred to as a pixel), and has a large value in the boundary region (contour) of the object image within the image frame. Note that all images in the database can be used as the human image, but only a part of the image may be used.
This luminance gradient is calculated by Expression (1), where I (x, y) is the luminance value of the pixel (x, y) and fx and fy are the luminance gradients in the x and y directions, respectively.

次に、ステップ１２（ＳＴ１２）で、画像上の全てのピクセル（画素）について、上記した輝度勾配の平均値を計算し、図３（Ａ）に示す画像、即ち平均輝度勾配画像を作成する。
そして、ステップ１３（ＳＴ１３）で、作成した平均輝度勾配画像から、図３（Ｂ）に示す輪郭（エッジ）部分を取り出し、人画像の輪郭画像（エッジ画像）を作成する。
更に、ステップ１４（ＳＴ１４）で、作成した平均輝度勾配画像から、図３（Ｃ）に示す人画像の影画像を作成し、この影画像から、図３（Ｄ）に示す人画像の骨格画像を作成する。ここで、人画像の影画像から骨格画像を作成するには、まず影画像から距離画像変換を行う。この距離画像変換とは、影画像の白画素部分に対して、白画像と黒画像の境からの距離を算出する手法である。そして、この距離画像変換した画像の極大点のみを探索することで、骨格画像を作成できる。 Next, in step 12 (ST12), the average value of the above-described luminance gradient is calculated for all pixels on the image, and the image shown in FIG. 3A, that is, an average luminance gradient image is created.
In step 13 (ST13), the contour (edge) portion shown in FIG. 3B is extracted from the created average luminance gradient image, and a contour image (edge image) of a human image is created.
Further, in step 14 (ST14), a shadow image of the person image shown in FIG. 3C is created from the created average luminance gradient image, and the skeleton image of the person image shown in FIG. 3D is created from the shadow image. Create Here, in order to create a skeleton image from a shadow image of a human image, first, distance image conversion is performed from the shadow image. This distance image conversion is a method of calculating the distance from the boundary between the white image and the black image for the white pixel portion of the shadow image. Then, a skeleton image can be created by searching for only the maximum point of the image obtained by the distance image conversion.

最後に、ステップ１５（ＳＴ１５）で、上記した図３（Ｂ）に示す人画像の輪郭画像と、図３（Ｄ）に示す人画像の骨格画像とを組み合わせて（論理和となる画像）、図３（Ｅ）に示す人画像の画像モデルを作成する。
これにより、画像フレーム１０から人画像１１を検出するために使用する人画像の画像モデルを予め作成できる。なお、モデル作成手段による人画像の画像モデル作成は、これに限定されるものではなく、例えば、連続撮像した画像から、背景の差分をとって作成することもできる。 Finally, in step 15 (ST15), the outline image of the human image shown in FIG. 3B and the skeleton image of the human image shown in FIG. 3D are combined (an image that becomes a logical sum). An image model of a human image shown in FIG.
Thereby, an image model of a human image used for detecting the human image 11 from the image frame 10 can be created in advance. Note that the image model creation of the human image by the model creation means is not limited to this, and for example, it can be created by taking a background difference from continuously captured images.

続いて、特徴量算出手段について説明する。
特徴量算出手段では、図１に示すように、上記したモデル作成手段で作成した人画像の画像モデルを中心に設定した矩形領域１２を、画像フレーム１０の左（一方）から右（他方）へ向けて、しかも画像フレーム１０の上から下へ順次、画像フレーム１０上でスキャンする。ここで、人画像の画像モデルとは、図３（Ｅ）の白部分で構成されるマスクである。また、矩形領域１２は、人画像の画像モデルを含んだ領域であり、ここでは、図３（Ｅ）に示す領域をそのまま利用している。
なお、スキャンの方法としては、従来公知のラスタースキャンを使用できるが、他の方法を使用してもよい。また、スキャンの方向も、上記した方向に限定されるものではない。 Next, the feature amount calculation unit will be described.
As shown in FIG. 1, the feature amount calculating means moves the rectangular area 12 set around the image model of the human image created by the model creating means from the left (one) to the right (the other) of the image frame 10. Further, scanning is performed on the image frame 10 sequentially from the top to the bottom of the image frame 10. Here, the image model of the human image is a mask composed of white portions in FIG. The rectangular area 12 is an area including an image model of a human image, and here, the area shown in FIG. 3E is used as it is.
As a scanning method, a conventionally known raster scan can be used, but other methods may be used. Further, the scanning direction is not limited to the above-described direction.

そして、スキャン中の人画像の画像モデルと対応する画像フレーム１０中の領域、即ち人画像の画像モデルを構成する画素（図３（Ｅ）の白部分）の座標位置にある画像フレーム１０の各画素に対し、その輝度勾配を特徴量として順次算出する。具体的には、図４に示すように、人画像の画像モデルをマスクとし、そのマスクの座標位置にある画像フレーム１０中のピクセルを中心として、画像フレーム１０に５×５ピクセルを１セルとした矩形の領域をとり、この領域のＨＯＧ特徴量をそれぞれ計算する（輝度の勾配方向ヒストグラムを作成する）。 Then, each region of the image frame 10 at the coordinate position of a pixel (white portion in FIG. 3E) that constitutes the region in the image frame 10 corresponding to the image model of the human image being scanned, that is, the image model of the human image. For each pixel, the luminance gradient is sequentially calculated as a feature amount. Specifically, as shown in FIG. 4, an image model of a human image is used as a mask, a pixel in the image frame 10 at the coordinate position of the mask is the center, and 5 × 5 pixels are one cell in the image frame 10. The obtained rectangular area is taken, and the HOG feature amount of this area is calculated (a luminance gradient direction histogram is created).

このＨＯＧ特徴量の計算は、人画像の画像モデルをマスクとした画像フレーム１０中の領域に対し、その左（一方）から右（他方）へ向けて、しかも画像フレーム１０の上から下へ順次（矢印の方向）行っているが、これに限定されるものではない。なお、ＨＯＧ特徴量は、例えば、非特許文献１等と同様の方法を使用して求めることができる。
次に、計算されたＨＯＧ特徴量を、式（２）で正規化する。 The calculation of the HOG feature amount is performed sequentially from the left (one) to the right (the other) of the region in the image frame 10 using the human image model as a mask and from the top to the bottom of the image frame 10. (In the direction of the arrow), it is not limited to this. Note that the HOG feature amount can be obtained using, for example, the same method as in Non-Patent Document 1 or the like.
Next, the calculated HOG feature value is normalized by Expression (2).

ここで、ａはＨＯＧ特徴量の要素、ｆは各セルで算出されたＨＯＧ特徴を並べた特徴量ベクトル、ｖは正規化後の特徴量、εは分母が「０」になることを避けるための微小な正の実数値である。
上記したＨＯＧは、近接画素の勾配を局所領域によってヒストグラム化するため、照明や影の影響を受けにくく、局所的な幾何学変化に頑強であるという特徴がある。
以上の方法により、人画像の画像モデルの座標位置に位置する画像フレーム１０中の画素のみ、人画像検出のための対象とすることができるため、最小限のサイズ領域で特徴量の算出ができ、また、背景等の不必要な特徴量を除去できる。 Here, a is an element of the HOG feature value, f is a feature value vector in which HOG features calculated in each cell are arranged, v is a feature value after normalization, and ε is to avoid the denominator being “0”. Is a small positive real value.
The HOG described above is characterized in that the gradient of the neighboring pixels is histogrammed by a local region, so that it is not easily affected by illumination and shadows and is robust to local geometric changes.
According to the above method, only the pixel in the image frame 10 located at the coordinate position of the image model of the human image can be set as the target for human image detection, so that the feature amount can be calculated with the minimum size region. In addition, unnecessary feature amounts such as the background can be removed.

なお、特徴量算出手段は、上記した方法と同様の方法により、人画像のＨＯＧ特徴量を予め求めて、学習結果のデータベースを構築（人検出器を作成）することができる。
具体的には、上記したｖを要素とする特徴量ベクトルＶを用い、人画像をｐｏｓｉｔｉｖｅ（ポジティブ）データ、人画像以外の画像をｎｅｇａｔｉｖｅ（ネガティブ）データとして、従来公知のＳＶＭ（サポートベクターマシン：Ｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅ）学習させる。このＳＶＭは、線形入力素子を利用してパターン識別器を構成する手法であり、学習サンプルから、各データ点との距離が最大となる分離平面（超平面）を求めるマージン最大化という基準で、線形入力素子のパラメータを学習するというものである。 Note that the feature quantity calculation means can obtain a HOG feature quantity of a human image in advance and construct a database of learning results (create a human detector) by the same method as described above.
Specifically, using the above-described feature vector V having v as an element, a human image as positive (positive) data and an image other than a human image as negative (negative) data, a conventionally known SVM (Support Vector Machine: (Support vector machine). This SVM is a method of configuring a pattern discriminator using a linear input element, and is based on a standard of margin maximization for obtaining a separation plane (hyperplane) that maximizes the distance from each data point from a learning sample. It learns the parameters of a linear input element.

なお、人画像と人画像以外の画像との区別ができれば、他の学習方法、例えば、ＡｄａＢｏｏｓｔ（アダブースト）、ＬＰＢｏｏｓｔ、ＢｒｏｗｎＢｏｏｓｔ、又はＬｏｇｉｔＢｏｏｓｔ等のＢｏｏｓｔｉｎｇ（ブースティング）や、ニューラルネットワークを使用することもできる。 If a human image can be distinguished from an image other than a human image, other learning methods, for example, boosting such as AdaBoost (LP Boost), LPBoost, BrownBoost, Log Boost, or a neural network may be used. it can.

そして、上記した学習結果には、更に人画像の向きごとの学習データを含めて、学習結果のデータベースを構築（体方向判別器を作成）することが好ましい。
ここでは、図５に示すように、人が、例えば、正面、右、又は左を向いている画像を学習画像として用い、上記した方法と同様の方法により、ｖを要素とする特徴量ベクトルＶを用いて、従来公知のＳＶＭ学習させる。具体的には、１対他法と１対１法により、人の正面画像、右向き画像、及び左向き画像を用いて、第１階層と第２階層の２クラスで３個（複数個）の識別器（方向判別器）を構築する。 Then, it is preferable to construct a database of learning results (create a body direction discriminator) by further including learning data for each orientation of the human image in the learning results described above.
Here, as shown in FIG. 5, for example, a feature vector V having v as an element is used by a method similar to the above-described method using an image facing a front, right, or left as a learning image. Is used to perform conventionally known SVM learning. Specifically, using one-to-other method and one-to-one method, using a front image, right-facing image, and left-facing image of a person, three (plural) identifications in two classes of the first and second layers Construct a device (direction discriminator).

ここで、１対他法とは、正面対それ以外、右向き対それ以外、及び左向き対それ以外であり、１対１法とは、正面対右向きと正面対左向きである。なお、識別器の個数や、１対他法と１対１法の各条件は、上記したものに限定されるものではなく、検出対象に応じて、種々変更できる。
これにより、１対他法で構築された識別器の投票により、人画像の向きを識別できる。なお、各向きの投票数が等しい場合は、１対１法により作成された識別器を用いて、最終的な判断がなされる。 Here, the one-to-other method is front-to-other, right-to-other, and left-to-other, and the one-to-one method is front-to-right and front-to-left. Note that the number of classifiers and the conditions of the one-to-other method and the one-to-one method are not limited to those described above, and can be variously changed according to the detection target.
Thus, the orientation of the human image can be identified by voting by a classifier constructed by a one-to-other method. When the number of votes in each direction is equal, a final determination is made using a discriminator created by the one-to-one method.

最後に、識別手段について説明する。
識別手段では、上記した特徴量算出手段で、人画像の画像モデルをマスクとした画像フレーム１０中の領域に対し、画像フレーム１０のＨＯＧ特徴量を特徴量として算出した後、この算出結果と、予め求めた人画像の特徴量の学習結果とを用いて、上記した人検出器により、画像フレーム１０中に人画像１１が存在するか否かの判断を行う。即ち、算出結果が、ＳＶＭ学習した人画像の領域と人画像以外の画像の領域のいずれの領域に含まれるかの判断を行う。 Finally, the identification means will be described.
The identification unit calculates the HOG feature amount of the image frame 10 as the feature amount for the region in the image frame 10 using the human image model as a mask by the above-described feature amount calculation unit. Whether or not the human image 11 exists in the image frame 10 is determined by the above-described human detector using the learning result of the feature amount of the human image obtained in advance. That is, it is determined whether the calculation result is included in the region of the human image learned by SVM or the region of the image other than the human image.

また、識別手段では、画像フレーム１０中に人画像１１が存在すると判断されたことを条件として、識別手段で、更に、人画像１１の向きを識別することが好ましい。
この場合、まず、図１に示すように、人画像１１が検出された複数（ここでは、３つ）の領域１３を、１つの領域１４に統合する。具体的には、複数の領域１３のうち、特徴量が一致する重複する部分を重ね合わせると共に、統合方法を併用して、人画像１１を中心とした１つの領域１４を形成する。この統合方法としては、ＭｅａｎＳｈｉｆｔ法という既存手法を用いているが、単純に重なり合う部分のＲＯＩ情報（矩形の領域の左上点の位置（ｘ,ｙ）、及び矩形の領域の縦と横のサイズ）の平均により決定することもできる。
そして、画像フレーム１０のＨＯＧ特徴量を特徴量として算出した算出結果と、予め求めた人画像の特徴量の学習結果とを用いて、上記した体方向判別器により、画像フレーム１０中の人画像１１の向きを識別する。 Further, it is preferable that the identification unit further identifies the orientation of the human image 11 on the condition that it is determined that the human image 11 exists in the image frame 10.
In this case, first, as shown in FIG. 1, a plurality of (here, three) regions 13 in which the human image 11 is detected are integrated into one region 14. Specifically, among the plurality of regions 13, overlapping portions having the same feature amount are overlapped, and one region 14 centered on the human image 11 is formed by using an integration method together. As this integration method, an existing method called MeanShift method is used, but ROI information of the overlapping portion (the position (x, y) of the upper left point of the rectangular area and the vertical and horizontal sizes of the rectangular area) It can also be determined by the average of.
Then, using the calculation result obtained by calculating the HOG feature value of the image frame 10 as the feature value and the learning result of the feature value of the human image obtained in advance, the human image in the image frame 10 is obtained by the body direction discriminator described above. 11 orientations are identified.

続いて、本発明の一実施の形態に係る物体の検出方法について、図６を参照しながら説明する。
まず、特徴量算出手段で予め求めた人画像や人画像以外の画像の特徴量の学習結果を、記憶手段に入力しておく。ここで、人画像には、例えば、大人や子供、また、正面向き、右向き、左向き等の画像がある。また、人画像以外の画像には、例えば、乗り物や建物、樹木、柱、看板等がある。
これらの特徴量を用いて、特徴量算出手段で、前記した人検出器と体方向判別器を構築しておく（以上、準備工程）。 Next, an object detection method according to an embodiment of the present invention will be described with reference to FIG.
First, a learning result of a feature amount of a human image or an image other than a human image obtained in advance by the feature amount calculation unit is input to the storage unit. Here, the human image includes, for example, an image of an adult or a child, a front direction, a right direction, or a left direction. Examples of images other than human images include vehicles, buildings, trees, pillars, and signboards.
Using these feature amounts, the feature amount calculating means constructs the human detector and the body direction discriminator described above (the preparation step).

次に、ステップ１（ＳＴ１）で、ビデオカメラにより映像（未知画像）を撮像し、この画像フレーム１０を記憶手段に入力する。
そして、ステップ２（ＳＴ２）で、前記したモデル作成手段により、人画像１１の画像モデルを予め作成する図２に示すＳＴ１１〜ＳＴ１５を行い、この人画像１１を含む矩形領域１２を設定する。ここで、矩形領域１２は、画像フレーム１０よりも小さな領域である。また、人画像１１の画像モデルは、ビデオカメラにより映像を撮像する前に、事前に設定してもよい。（以上、モデル作成工程）。 Next, in step 1 (ST1), a video (unknown image) is captured by the video camera, and this image frame 10 is input to the storage means.
In step 2 (ST2), the above-described model creation means performs ST11 to ST15 shown in FIG. 2 in which an image model of the human image 11 is created in advance, and a rectangular area 12 including the human image 11 is set. Here, the rectangular area 12 is an area smaller than the image frame 10. Further, the image model of the human image 11 may be set in advance before the video is captured by the video camera. (End of model creation process)

ステップ３（ＳＴ３）では、図１に示すように、設定した矩形領域１２を画像フレーム１０上でスキャンし、作成した人画像１１の画像モデルをマスクとした画像フレーム１０中の領域に対し、ＨＯＧ特徴量を特徴量として算出する。そして、ステップ４（ＳＴ４）で、前記した式（２）を用いて、求めた特徴量を正規化する（以上、特徴量算出工程）。 In step 3 (ST3), as shown in FIG. 1, the set rectangular area 12 is scanned on the image frame 10, and the HOG is applied to the area in the image frame 10 using the created image model of the human image 11 as a mask. The feature amount is calculated as a feature amount. In step 4 (ST4), the obtained feature quantity is normalized using the above-described equation (2) (the feature quantity calculation step).

ステップ５（ＳＴ５）では、ＳＴ４で算出した算出結果と、予め求めた人画像１１の特徴量の学習結果とを用いて、前記した人検出器により、画像フレーム１０中に人画像１１が存在するか否かの判断を行う。また、画像フレーム１０中に人画像１１が存在すると判断された場合は、これを条件として、更に、前記した体方向判別器により、人画像の向きを識別できる（以上、識別工程）。 In step 5 (ST5), the human image 11 is present in the image frame 10 by the above-described human detector using the calculation result calculated in ST4 and the learning result of the feature amount of the human image 11 obtained in advance. Judge whether or not. If it is determined that the human image 11 is present in the image frame 10, the orientation of the human image can be further identified by the above-described body direction discriminator on the condition (the identification process).

次に、本発明の作用効果を確認するために行った実施例について説明する。
まず、人画像の検出実験について説明する。
ここでは、学習画像である人画像を１６０２枚、人以外の画像を３２３５枚、それぞれ使用した。なお、人画像（人学習画像）の一例を図７（Ａ）に、人以外の画像（人以外学習画像）の一例を図７（Ｂ）に、それぞれ示す。
また、実験を行う評価用画像である人画像を１０００枚、人以外の画像を１０００枚、それぞれ使用した。なお、人画像（人評価画像）の一例を図８（Ａ）に、人以外の画像（人以外評価画像）の一例を図８（Ｂ）に、それぞれ示す。 Next, examples carried out for confirming the effects of the present invention will be described.
First, a human image detection experiment will be described.
Here, 1602 human images as learning images and 3235 non-human images were used. An example of a human image (human learning image) is shown in FIG. 7A, and an example of a non-human image (non-human learning image) is shown in FIG. 7B.
In addition, 1000 human images and 1000 non-human images were used as evaluation images for the experiment. An example of a person image (person evaluation image) is shown in FIG. 8A, and an example of an image other than a person (non-person evaluation image) is shown in FIG. 8B.

上記した各画像を用いて、人画像の検出精度の比較を行った。ここで、実施例１は、図８（Ｃ）に示す人画像の画像モデルを使用し、従来例１は、画像モデルを使用しない非特許文献１に記載の手法を用いた。
なお、識別実験結果の比較は、ＤＥＴ（ＤｅｔｅｃｔｉｏｎＥｒｒｏｒＴｒａｄｅｏｆｆ）によって、評価を行った。ＤＥＴは、横軸に誤検出率（人以外を人と認識する確率）、横軸に未検出率（人を人以外と認識する確率）を、両対数グラフによって表したものである。識別器のしきい値を変化させることによって、誤検出率に対する未検出率の比較を行うことができる。 Using the above-mentioned images, the human image detection accuracy was compared. Here, Example 1 uses an image model of a human image shown in FIG. 8C, and Conventional Example 1 uses the method described in Non-Patent Document 1 that does not use an image model.
In addition, the comparison of the discrimination experiment result was evaluated by DET (Detection Error Tradeoff). In the DET, the horizontal axis represents a false detection rate (probability of recognizing a person other than a person) and the horizontal axis represents a non-detection rate (probability of recognizing a person other than a person) by a log-log graph. By changing the threshold value of the discriminator, it is possible to compare the undetected rate against the false detection rate.

本実験で得られたＤＥＴ曲線を、図９に示す。
図９に示すように、未検出率０．５％のとき、誤検出率は、従来例１（◆）で１９．１％、実施例１（■）では７．５％となり、誤検出率が１１．６％軽減した。また、１枚あたりの処理時間は、実施例１で２９．１ミリ秒、従来例１で５４．４ミリ秒であり、処理速度を半分程度にできた。 The DET curve obtained in this experiment is shown in FIG.
As shown in FIG. 9, when the non-detection rate is 0.5%, the false detection rate is 19.1% in Conventional Example 1 (♦) and 7.5% in Example 1 (■). Was reduced by 11.6%. The processing time per sheet was 29.1 milliseconds in Example 1 and 54.4 milliseconds in Conventional Example 1, and the processing speed could be reduced to about half.

次に、体方向検出実験について説明する。
ここでは、学習画像である人画像を、各方向（正面向き、右向き、左向き）ごとに２１７枚使用した。この人画像の一例を図１０（Ａ）に示す。
また、実験を行う評価用画像である人画像を、設定した環境１（人の重なりがない環境）と実環境２で、各方向（正面向き、右向き、左向き）ごとに２００枚使用した。この設定した環境１での人画像の一例（データセット１）を図１０（Ｂ）に、実環境２での人画像の一例（データセット２）を図１０（Ｃ）に、それぞれ示す。 Next, a body direction detection experiment will be described.
Here, 217 human images as learning images are used in each direction (front direction, right direction, left direction). An example of this person image is shown in FIG.
In addition, 200 human images, which are evaluation images for experiments, were used for each direction (front direction, right direction, left direction) in the set environment 1 (an environment in which no people overlap) and the actual environment 2. FIG. 10B shows an example of the human image in the set environment 1 (data set 1), and FIG. 10C shows an example of the human image in the real environment 2 (data set 2).

上記した各画像を用いて、体方向の検出精度の比較を行った。ここで、実施例２は、図１１（Ａ）〜（Ｅ）に示す人画像の画像モデルを使用し、従来例２は、画像モデルを使用しない非特許文献１に記載の手法を用いた。なお、図１１（Ａ）〜（Ｃ）は、それぞれ左向きＳＶＭ（Ｌ）、正面向きＳＶＭ（Ｆ）、及び右向きＳＶＭ（Ｒ）の人画像であり、図１１（Ａ）、（Ｂ）を用いて図１１（Ｄ）に示す左正面向きの人画像ＳＶＭ（ＦＬ）と、図１１（Ｂ）、（Ｃ）を用いて図１１（Ｅ）に示す右正面向きの人画像ＳＶＭ（ＦＲ）を、それぞれ作成した。
本実験で得られた結果を、表１に示す。 Using each image described above, the detection accuracy of the body direction was compared. Here, Example 2 uses an image model of a human image shown in FIGS. 11A to 11E, and Conventional Example 2 uses a method described in Non-Patent Document 1 that does not use an image model. 11A to 11C are human images of left-facing SVM (L), front-facing SVM (F), and right-facing SVM (R), respectively, and FIGS. 11A and 11B are used. A human image SVM (FL) facing left front shown in FIG. 11D and a human image SVM (FR) facing right front shown in FIG. 11E using FIGS. 11B and 11C. , Each created.
The results obtained in this experiment are shown in Table 1.

表１から明らかなように、実施例２は、環境条件に関係なく、従来例２よりも高い正答率が得られることが分かった。なお、処理時間は、実施例２が２６．７ミリ秒、従来例２が６０．６ミリ秒であり、処理速度を半分以下にできた。
なお、以上の実施例においては、人画像の検出を行うコンピュータに、ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓＸＰＰｒｏｆｅｓｓｉｏｎａｌＩｎｔｅｌ（Ｒ）Ｃｅｌｅｒｏｎ（Ｒ）２．８０ＧＨｚを用いた。 As is clear from Table 1, it was found that the correct answer rate was higher in Example 2 than in Conventional Example 2 regardless of the environmental conditions. The processing time was 26.7 milliseconds for Example 2 and 60.6 milliseconds for Conventional Example 2, and the processing speed could be reduced to half or less.
In the above embodiment, Microsoft Windows XP Professional Intel (R) Celeron (R) 2.80 GHz is used as a computer for detecting human images.

以上の結果から、本発明の物体の検出装置及びその検出方法を用いることで、人検出と体方向検出は、ともに従来手法より良好な結果が得られることが分かった。これは、人画像の画像モデル上の特徴量のみを用いることにより、背景の影響が軽減されたためと考えられる。 From the above results, it was found that by using the object detection apparatus and the detection method of the present invention, both human detection and body direction detection can obtain better results than the conventional method. This is considered to be because the influence of the background was reduced by using only the feature amount on the image model of the human image.

以上、本発明を、実施の形態を参照して説明してきたが、本発明は何ら上記した実施の形態に記載の構成に限定されるものではなく、特許請求の範囲に記載されている事項の範囲内で考えられるその他の実施の形態や変形例も含むものである。例えば、前記したそれぞれの実施の形態や変形例の一部又は全部を組合せて本発明の物体の検出装置及びその検出方法を構成する場合も本発明の権利範囲に含まれる。
前記実施の形態においては、物体画像の一例である人画像を検出対象とした場合について説明したが、例えば、動物画像でもよい。また、物体の検出装置の設置位置に応じて、例えば、車や電車、船舶、又は飛行機等の乗り物画像を、検出対象とすることで、より安全を重視した運転（運行）環境を造り出すこともできる。 As described above, the present invention has been described with reference to the embodiment. However, the present invention is not limited to the configuration described in the above embodiment, and the matters described in the scope of claims. Other embodiments and modifications conceivable within the scope are also included. For example, the case where the object detection apparatus and the detection method thereof according to the present invention are configured by combining some or all of the above-described embodiments and modifications are also included in the scope of the right of the present invention.
In the above-described embodiment, the case where a human image that is an example of an object image is set as a detection target has been described, but an animal image may be used, for example. In addition, depending on the installation position of the object detection device, for example, a vehicle (train) image such as a car, a train, a ship, or an airplane may be a detection target, thereby creating a driving (operation) environment that emphasizes safety. it can.

また、前記実施の形態においては、人画像の画像モデルに対応する全画素について、特徴量を求める場合について説明したが、ＡｄａＢｏｏｓｔの考え方に基づいて、その中で認識に有効なブロック（１又は複数の画素）のみを選んで用いてもよい。
そして、前記実施の形態においては、画像フレームを１台のビデオカメラで得た場合について説明したが、例えば、設置場所の異なる複数台（例えば、２台又は３台）のビデオカメラで得ることもできる。この場合、得られた各画像フレームについて、それぞれ前記実施の形態で示した処理を行う。 Further, in the above-described embodiment, the case where the feature amount is obtained for all pixels corresponding to the image model of the human image has been described. However, based on the idea of AdaBoost, one or more blocks that are effective for recognition among them May be selected and used.
In the above embodiment, the case where the image frame is obtained by one video camera has been described. For example, the image frame may be obtained by a plurality of video cameras (for example, two or three) having different installation locations. it can. In this case, the processing shown in the above embodiment is performed for each obtained image frame.

１０：画像フレーム、１１：人画像（物体画像）、１２：矩形領域、１３、１４：領域 10: Image frame, 11: Human image (object image), 12: Rectangular area, 13, 14: Area

Claims

An object detection device for detecting an object image to be detected from an image frame obtained by an image input means,
Model creation means for creating an image model of the object image in advance;
Feature amount calculating means for calculating a luminance gradient of each pixel of the image frame as a feature amount for an area in the image frame corresponding to the image model of the object image created by the model creating means;
An identification unit that determines whether or not the object image exists in the image frame using a calculation result calculated by the feature amount calculation unit and a learning result of the feature amount of the object image obtained in advance; An apparatus for detecting an object characterized by comprising:

The object detection apparatus according to claim 1, wherein the learning result of the feature amount of the object image includes learning data for each direction of the object image, and it is determined that the object image exists in the image frame. On the condition, the identification unit further identifies the direction of the object image.

3. The object detection apparatus according to claim 1, wherein the object image is a human image.

An object detection method for detecting an object image to be detected from an image frame obtained by an image input means,
A model creation step of creating an image model of the object image in advance;
A feature amount calculating step of calculating a luminance gradient of each pixel of the image frame as a feature amount for a region in the image frame corresponding to the image model of the object image created in the model creating step;
An identification step for determining whether or not the object image exists in the image frame using the calculation result calculated in the feature amount calculation step and the learning result of the feature amount of the object image obtained in advance; An object detection method characterized by comprising:

The object detection method according to claim 4, wherein the learning result of the feature amount of the object image includes learning data for each direction of the object image, and it is determined that the object image exists in the image frame. On the condition that, in the identification step, the direction of the object image is further identified.

6. The object detection method according to claim 4, wherein the object image is a human image.