JP2012083855A

JP2012083855A - Object recognition device and object recognition method

Info

Publication number: JP2012083855A
Application number: JP2010227770A
Authority: JP
Inventors: Murahito Hattori; 祐人服部; Osamu Hasegawa; 修長谷川; Wataru Kasai; 航笠井; Keisuke Takada; 圭佑高田
Original assignee: Tokyo Institute of Technology NUC; Toyota Motor Corp
Current assignee: Tokyo Institute of Technology NUC; Toyota Motor Corp
Priority date: 2010-10-07
Filing date: 2010-10-07
Publication date: 2012-04-26

Abstract

PROBLEM TO BE SOLVED: To highly precisely recognize a three-dimensional object at a high speed compared with a conventional method.SOLUTION: An object recognition device 1 includes: a database 2 storing storage images, feature points previously extracted form the storage images, and positional relation information of the feature points; storage image specification means 11 that searches corresponding feature points between the feature points of the storage images of the database 2 and feature points extracted from an input image, votes, to the input image, a feature point calculated based on the positional relation information of the feature points of the storage image, determines that a similarity between the input image and the storage image is high as the voting points are concentrated therein and the number of votes increase, and specifies the storage image similar to the input image; and determination means 12 that generates a recall image using the specified storage image, compares the recall image with the input image and, when determining that the both are resembled, determines on the input image that the recognition object is detected.

Description

本発明は、物体認識装置及び物体認識方法に関する。 The present invention relates to an object recognition apparatus and an object recognition method.

物体認識は、例えばロボットの視覚能力を実現するうえで必要不可欠な機能であり、コンピュータビジョンにおける重要な研究テーマの一つとなっている（非特許文献１参照）。また、横から見た飛行機や正面から見た道路標識といった、ある程度視点を限定した条件下での一般物体画像のクラス分類についても、共通のデータセットが存在するなどの理由から、非常に研究が盛んである。 Object recognition is an indispensable function for realizing the visual ability of a robot, for example, and is one of the important research themes in computer vision (see Non-Patent Document 1). In addition, there is a great deal of research on the classification of general object images under certain conditions, such as airplanes seen from the side and road signs seen from the front, because there is a common data set. It is thriving.

特開２００７−１４８８７２号公報JP 2007-148872 A 特開２００７−２８００４０号公報JP 2007-280040 A

柳井啓司，"一般物体認識の現状と今後，" 情処学論: コンピュータビジョン・イメージメディア，vol.48，no.SIG16 (CVIM19)，pp.1-24，2007．Keiji Yanai, "Present state and future of general object recognition," Information processing theory: Computer vision and image media, vol.48, no.SIG16 (CVIM19), pp.1-24, 2007. M. Sun, H. Su, S. Savarese, and L. Fei-Fei, "A multiview probabilistic model for 3d object classes," IEEE Int.Conf. Comput. Vision and Pattern Recognit., pp.1247-1254, 2009.M. Sun, H. Su, S. Savarese, and L. Fei-Fei, "A multiview probabilistic model for 3d object classes," IEEE Int. Conf. Comput. Vision and Pattern Recognit., Pp.1247-1254, 2009 .

しかしながら、非特許文献１などに開示される一般物体を対象とした手法については、未だ実環境での利用には限定的と言わざるを得ない。その理由の一つに、認識対象が３次元物体であるにも関わらず、既存研究の多くが、その認識可能な視点が限定されたものとなっており、対象の向き変化に起因するクラス内変化に対応できていないことが挙げられる。 However, the method targeting general objects disclosed in Non-Patent Document 1 is still limited to use in an actual environment. One of the reasons is that, although the recognition target is a three-dimensional object, many of the existing studies have limited their recognizable viewpoints. One of the reasons is that it has not been able to cope with changes.

この問題に対し、近年では、一般物体を対象とした３次元物体認識のタスクに取り組む研究も行われるようになってきた（例えば、非特許文献２参照）。しかしながら、３次元物体の認識においては、その向きの推定まで行えてこそ正しく対象を認識したと考えられるが、非特許文献２に例示される従来研究においても、向き推定に関する評価はあまり重要視されておらず、定量評価にまで重点を置いて評価が行なわれている研究は少ないのが現状である。 In recent years, research on tackling the task of three-dimensional object recognition for general objects has been performed on this problem (see, for example, Non-Patent Document 2). However, in the recognition of a three-dimensional object, it is considered that the object has been correctly recognized if the orientation can be estimated. However, even in the conventional research exemplified in Non-Patent Document 2, the evaluation regarding the orientation estimation is regarded as very important. However, there are few studies that have been evaluated with an emphasis on quantitative evaluation.

また、非特許文献２に例示される手法は、いずれも対象の幾何学的構造を考慮してその特徴を学習するモデルベースな手法である。クラス内変化の大きい３次元物体認識においては、対象の３次元的な構造情報を利用するモデルベースな手法は妥当なアプローチであると考えられるが、学習時に対象の撮影角度情報が必要となるなど、一般的に学習コストが高く、また、対象の記述方法が複雑になりやすいといった欠点が存在する。以下、モデルベースな手法の問題点についてより詳細に説明する。 In addition, the methods exemplified in Non-Patent Document 2 are all model-based methods that learn the features in consideration of the target geometric structure. In 3D object recognition with large intra-class changes, a model-based method that uses 3D structural information of the target is considered to be a reasonable approach, but the shooting angle information of the target is required during learning, etc. In general, there are disadvantages that the learning cost is high and the description method of the object tends to be complicated. Hereinafter, the problems of the model-based method will be described in more detail.

まず、３次元の幾何学的構造を扱おうとすると、一般的に、任意の視点間における様々な特徴領域間の繋がりやその関連性（射影変換など）を算出する必要がある。このため、対象モデルの表現の仕方が複雑になりやすいという問題がある（実装が複雑になりやすい）。 First, in order to handle a three-dimensional geometric structure, it is generally necessary to calculate the connection between various feature regions between arbitrary viewpoints and their relevance (projective transformation and the like). For this reason, there is a problem that the method of expressing the target model is likely to be complicated (implementation is likely to be complicated).

また、学習時には学習画像の撮影角度情報が必要となるなど、学習データにそれらの情報を付与する必要があり、学習データの作成コストが高くなる場合が多いという問題がある。 In addition, there is a problem that the learning data creation cost is often increased because it is necessary to add such information to the learning data, for example, the shooting angle information of the learning image is required at the time of learning.

また、３次元物体認識では、同一クラスの認識対象であっても、その視点の多様性から、個々の形状や外観がとても大きく変化してしまう。このため、認識に時間を要してしまう場合が多いという問題がある。 In 3D object recognition, even if the recognition target is of the same class, the shape and appearance of each individual change greatly due to the diversity of viewpoints. For this reason, there is a problem that recognition often takes time.

また、画像から抽出した特徴量を利用して対象を認識するが、特徴量は画像の特徴的な情報のみを抽出しているために、その情報量は画像に比べて少なくなっている。このため、複雑な背景下（シーン）で認識を行う場合に、学習した特徴量と類似する特徴量が背景に多数存在するときには、誤認識してしまうことがある。つまり、特徴量の識別能力が認識精度に大きく影響を与える。一方で、識別能力が高い特徴量を用いた場合には、学習画像には含まれないものの同一クラスには含まれる対象を同一クラスとして認識させたいときに、その検出ができない状況に陥ることがある。また、特徴量の構造的な繋がりを評価することで認識を行っているものの、大局的な見え方としての類似性が考慮されていない。従って、これらの理由から、認識精度が低い（誤検出率が高い）という問題がある。 Further, a target is recognized using a feature amount extracted from an image, but since the feature amount extracts only characteristic information of the image, the information amount is smaller than that of the image. For this reason, when recognition is performed under a complicated background (scene), if there are many feature quantities similar to the learned feature quantities in the background, the recognition may be erroneously performed. That is, the ability to identify feature quantities greatly affects the recognition accuracy. On the other hand, when using feature quantities with high discrimination ability, if you want to recognize objects that are not included in the learning image but are included in the same class as the same class, you may not be able to detect them. is there. In addition, although recognition is performed by evaluating the structural connection of feature quantities, similarity as a global appearance is not considered. Therefore, for these reasons, there is a problem that the recognition accuracy is low (the false detection rate is high).

なお、本発明に関連する他の技術として、特許文献１及び２には、眼鏡やヒゲの変化などといった時間経過とは独立した顔画像の部分的変化に対応して、本人認証を行うことができる顔認証装置が開示されている。当該顔認証装置では、入力画像から抽出された認証対象画像と予め学習させた登録画像とを照合し、照合の結果、同一人物ではないと判定した場合には、登録画像に近い想起画像を登録画像から生成し、認証対象画像に代えて想起画像を用いて登録画像との照合を再度行う。このように、登録画像から生成した想起画像により眼鏡やサングラスによる一部隠れを補間することで検出率自体は向上するものの、補間画像の精度を向上させることは難しく、誤認証率の低減について不十分なものである。 As another technique related to the present invention, Patent Documents 1 and 2 disclose that personal authentication is performed in response to partial changes in face images that are independent of the passage of time, such as changes in glasses and beards. A face recognition device that can be used is disclosed. In the face authentication device, the authentication target image extracted from the input image is compared with the registered image learned in advance, and if it is determined that they are not the same person as a result of the comparison, a recall image close to the registered image is registered. The image is generated from the image and collated with the registered image again using the recall image instead of the authentication target image. In this way, although the detection rate itself is improved by interpolating partial hiding by glasses or sunglasses using the recall image generated from the registered image, it is difficult to improve the accuracy of the interpolated image, and there is no problem with reducing the false authentication rate. It is enough.

従って本発明は、上述した課題を解決して、従来手法と比較してより高速かつ高精度な３次元物体認識を実現可能な物体認識装置及び物体認識方法を提供することを目的とする。 Accordingly, an object of the present invention is to provide an object recognition apparatus and an object recognition method that can solve the above-described problems and can realize a three-dimensional object recognition that is faster and more accurate than the conventional method.

本発明に係る第一の態様の物体認識装置は、認識対象をそれぞれ含む複数の記憶画像と、当該複数の記憶画像の各々から予め抽出された特徴点と、前記記憶画像における基準位置に対する前記抽出された特徴点の位置関係情報と、を記憶する記憶手段と、入力画像から特徴点を抽出して、前記記憶手段に記憶された前記記憶画像の特徴点と前記入力画像から抽出した特徴点との間で対応する特徴点を探索し、当該探索した対応点について、前記記憶画像の特徴点の位置関係情報に基づいて算出する投票点を前記入力画像に投票し、当該投票点が集中しかつ投票数が多いほど前記入力画像と前記記憶画像との類似度が高いと判定し、当該判定の結果、前記入力画像と類似する前記記憶画像を特定する記憶画像特定手段と、前記記憶画像特定手段で特定された前記記憶画像を用いて前記入力画像の見え方を示す想起画像を生成し、当該生成した想起画像と前記入力画像とを比較して互いに類似しているか否かを判断し、類似していると判断した場合に、前記入力画像に前記認識対象を検出したと判定する判定手段と、を備えるものである。 The object recognition apparatus according to the first aspect of the present invention includes a plurality of stored images each including a recognition target, a feature point extracted in advance from each of the plurality of stored images, and the extraction with respect to a reference position in the stored image. A storage means for storing the positional relationship information of the feature points, a feature point extracted from the input image, a feature point of the stored image stored in the storage means, and a feature point extracted from the input image; Voting points that are calculated based on the positional relationship information of the feature points of the stored image with respect to the searched corresponding points, and the voting points are concentrated and It is determined that the similarity between the input image and the stored image is higher as the number of votes is larger, and as a result of the determination, a stored image specifying unit that specifies the stored image similar to the input image, and the stored image specifying unit Using the identified stored image to generate a recall image indicating how the input image looks, determine whether the generated recall image and the input image are similar to each other, Determination means for determining that the recognition target is detected in the input image when it is determined that the input object is detected.

これにより、従来手法と比較してより高速かつ高精度な３次元物体認識を実現することができる。 Thereby, it is possible to realize three-dimensional object recognition that is faster and more accurate than the conventional method.

また、前記判定手段は、前記記憶画像特定手段で特定された複数の前記記憶画像を、それぞれの前記入力画像との類似度に応じて合成することで、前記想起画像を生成するようにしてもよい。 Further, the determination unit may generate the recall image by combining the plurality of stored images specified by the stored image specifying unit according to the degree of similarity with each of the input images. Good.

さらにまた、前記判定手段は、前記想起画像と前記入力画像とを比較して互いに類似しているか否かを判断する際に、当該想起画像の生成に利用された前記入力画像の投票点の総数が多いほど、前記想起画像と前記入力画像とがより類似すると判断するようにしてもよい。 Furthermore, when the determination unit compares the recall image and the input image to determine whether they are similar to each other, the total number of voting points of the input image used to generate the recall image It may be determined that the recall image and the input image are more similar as there are more.

また、前記記憶手段は、前記認識対象の向き情報を前記記憶画像と対応付けて更に記憶し、前記判定手段は、前記想起画像と前記入力画像とを比較して互いに類似しているか否かを判断し、類似していると判断した場合に、前記入力画像に前記認識対象を検出したと判定すると共に、前記想起画像の生成に用いられた前記記憶画像の向き情報に基づいて、当該検出した前記認識対象の向きを推定するようにしてもよい。 Further, the storage means further stores the direction information of the recognition target in association with the stored image, and the determination means compares the recall image and the input image to determine whether they are similar to each other. If it is determined that the recognition target is detected in the input image, it is detected based on the orientation information of the stored image used for generating the recall image. The direction of the recognition target may be estimated.

本発明に係る第二の態様の物体認識装置は、認識対象を含む記憶画像と、当該記憶画像から予め抽出された特徴点と、前記記憶画像における基準位置に対する前記抽出された特徴点の位置関係情報と、前記特徴点の平均スケールサイズと、前記記憶画像のサイズを前記平均スケールサイズで正規化した値であるテンプレートサイズと、を含むテンプレートモデルをクラスモデルごとに複数記憶するデータベースと、入力画像から特徴点を抽出し、前記データベースに記憶された前記記憶画像の特徴点と前記入力画像から抽出した特徴点との間でマッチングを行うことで、前記記憶画像の特徴点に対応する前記入力画像の特徴点を対応点として探索する対応点探索部と、前記対応点探索部で探索された前記入力画像の対応点について、前記データベースに記憶された前記記憶画像の特徴点の位置関係情報に基づく投票点を算出し、当該投票点を前記入力画像において投票する投票処理部と、前記投票処理部で投票された前記入力画像における投票点をクラスタリングし、同一クラスタに含まれる投票点の投票数が所定の閾値以上である場合に、当該クラスタの中心を中心として前記認識対象が存在すると判断する投票点クラスタリング部と、前記投票点クラスタリング部で前記認識対象が存在すると判断したクラスタ中心について、当該クラスタ中心に対応する前記テンプレートモデルの平均スケールサイズ及びテンプレートサイズに基づいて生成する矩形領域を、前記入力画像において前記認識対象が存在する範囲を示す候補枠として生成する候補枠生成部と、前記候補枠生成部で生成された候補枠について、当該候補枠に対応する前記テンプレートモデルの前記記憶画像を用いて前記入力画像の見え方を示す想起画像を生成する想起画像生成部と、前記想起画像生成部で生成された想起画像と、前記入力画像における前記候補枠の画像と、を比較して相違度を算出し、当該算出した相違度が所定の閾値よりも大きかった場合には当該候補枠を誤検出であるとして除去し、除去されなかった候補枠について、当該候補枠で示す前記入力画像の範囲に、前記認識対象を検出したと判定する信頼値算出部と、を備えるものである。 An object recognition apparatus according to a second aspect of the present invention includes a stored image including a recognition target, a feature point extracted in advance from the stored image, and a positional relationship of the extracted feature point with respect to a reference position in the stored image. A database for storing a plurality of template models for each class model including information, an average scale size of the feature points, and a template size that is a value obtained by normalizing the size of the stored image with the average scale size; and an input image The input image corresponding to the feature point of the stored image is extracted by extracting the feature point from the database and performing matching between the feature point of the stored image stored in the database and the feature point extracted from the input image. A corresponding point search unit that searches for the corresponding feature point as a corresponding point, and the corresponding point of the input image searched by the corresponding point search unit. In the input image voted by the voting processing unit that calculates voting points based on the positional relationship information of the feature points of the stored image stored in the base and votes the voting points in the input image When the voting points are clustered and the number of voting points included in the same cluster is equal to or greater than a predetermined threshold, the voting point clustering unit determines that the recognition target exists around the center of the cluster, and the voting points For the cluster center that the clustering unit determines that the recognition target exists, the recognition target exists in the input image for generating a rectangular area based on the average scale size and template size of the template model corresponding to the cluster center. A candidate frame generation unit that generates a candidate frame indicating a range, and the candidate frame generation unit The generated candidate frame is generated by a recall image generation unit that generates a recall image indicating how the input image is viewed using the stored image of the template model corresponding to the candidate frame, and the recall image generation unit. The degree of difference is calculated by comparing the recalled image with the image of the candidate frame in the input image, and the candidate frame is erroneously detected when the calculated difference is greater than a predetermined threshold. And a reliability value calculation unit that determines that the recognition target is detected in the range of the input image indicated by the candidate frame.

また、前記想起画像生成部は、前記候補枠生成部で生成された候補枠について、当該候補枠に対応する複数の前記テンプレートモデルの前記記憶画像を、それぞれの前記テンプレートモデルから得た前記入力画像における投票数に応じて合成することで、前記想起画像を生成するようにしてもよい。 Moreover, the said recall image production | generation part is the said input image which acquired the said memory | storage image of the said several template model corresponding to the said candidate frame about each candidate frame produced | generated by the said candidate frame production | generation part from each said template model The recall image may be generated by combining them according to the number of votes.

さらにまた、前記信頼値算出部は、前記想起画像生成部で生成された想起画像と、前記入力画像における前記候補枠の画像と、を比較して相違度を算出すると共に、当該候補枠に対応する全ての前記テンプレートモデルから得た前記入力画像における投票総数を算出し、当該算出した相違度の逆数と当該投票総数とに基づいて算出する信頼値が所定の閾値よりも小さかった場合には当該候補枠を誤検出であるとして除去し、除去されなかった候補枠について、当該候補枠で示す前記入力画像の範囲に、前記認識対象を検出したと判定するようにしてもよい。 Furthermore, the confidence value calculation unit compares the recall image generated by the recall image generation unit with the image of the candidate frame in the input image to calculate a difference, and corresponds to the candidate frame Calculating the total number of votes in the input images obtained from all the template models, and if the confidence value calculated based on the calculated reciprocal of the difference and the total number of votes is smaller than a predetermined threshold, The candidate frame may be removed as erroneous detection, and the candidate frame that has not been removed may be determined to have been detected in the range of the input image indicated by the candidate frame.

また、前記テンプレートモデルは、前記認識対象の向きを示す向き情報を更に含み、前記信頼値算出部で前記認識対象を検出したと判定された場合に、当該検出された前記認識対象の候補枠について、当該候補枠に対応する複数の前記テンプレートモデルの前記向き情報に基づいて、当該検出された前記認識対象の向きを推定する向き推定部を更に備えるようにしてもよい。 In addition, the template model further includes orientation information indicating the orientation of the recognition target, and when it is determined that the recognition target is detected by the reliability value calculation unit, the detected candidate frame of the recognition target A direction estimating unit that estimates the detected orientation of the recognition target based on the orientation information of the plurality of template models corresponding to the candidate frame may be further provided.

本発明に係る第三の態様の物体認識方法は、認識対象をそれぞれ含む複数の記憶画像と、当該複数の記憶画像の各々から予め抽出された特徴点と、前記記憶画像における基準位置に対する前記抽出された特徴点の位置関係情報と、を予め記憶する記憶ステップと、入力画像から特徴点を抽出して、前記記憶ステップで記憶された前記記憶画像の特徴点と前記入力画像から抽出した特徴点との間で対応する特徴点を探索し、当該探索した対応点について、前記記憶画像の特徴点の位置関係情報に基づいて算出する投票点を前記入力画像に投票し、当該投票点が集中しかつ投票数が多いほど前記入力画像と前記記憶画像との類似度が高いと判定し、当該判定の結果、前記入力画像と類似する前記記憶画像を特定する記憶画像特定ステップと、前記記憶画像特定ステップで特定された前記記憶画像を用いて前記入力画像の見え方を示す想起画像を生成し、当該生成した想起画像と前記入力画像とを比較して互いに類似しているか否かを判断し、類似していると判断した場合に、前記入力画像に前記認識対象を検出したと判定する判定ステップと、を有するものである。 The object recognition method according to the third aspect of the present invention includes a plurality of stored images each including a recognition target, a feature point extracted in advance from each of the plurality of stored images, and the extraction with respect to a reference position in the stored image. A storage step for preliminarily storing the positional relationship information of the feature points obtained, a feature point extracted from the input image, and a feature point of the storage image stored in the storage step and a feature point extracted from the input image Voting points calculated on the basis of positional relationship information of the feature points of the stored image for the corresponding points searched for, and the voting points are concentrated. And determining that the degree of similarity between the input image and the stored image is higher as the number of votes increases, and, as a result of the determination, a stored image specifying step for specifying the stored image similar to the input image; Using the stored image specified in the memory image specifying step, a recall image indicating how the input image is seen is generated, and the generated recall image and the input image are compared to determine whether they are similar to each other. And a determination step of determining that the recognition target is detected in the input image when it is determined that the images are similar to each other.

本発明に係る第四の態様の物体認識方法は、認識対象を含む記憶画像と、当該記憶画像から予め抽出された特徴点と、前記記憶画像における基準位置に対する前記抽出された特徴点の位置関係情報と、前記特徴点の平均スケールサイズと、前記記憶画像のサイズを前記平均スケールサイズで正規化した値であるテンプレートサイズと、を含むテンプレートモデルをクラスモデルごとに予め複数記憶する記憶ステップと、入力画像から特徴点を抽出し、前記記憶ステップで記憶された前記記憶画像の特徴点と前記入力画像から抽出した特徴点との間でマッチングを行うことで、前記記憶画像の特徴点に対応する前記入力画像の特徴点を対応点として探索する対応点探索ステップと、前記対応点探索ステップで探索された前記入力画像の対応点について、前記記憶ステップで記憶された前記記憶画像の特徴点の位置関係情報に基づく投票点を算出し、当該投票点を前記入力画像において投票する投票処理ステップと、前記投票処理ステップで投票された前記入力画像における投票点をクラスタリングし、同一クラスタに含まれる投票点の投票数が所定の閾値以上である場合に、当該クラスタの中心を中心として前記認識対象が存在すると判断する投票点クラスタリングステップと、前記投票点クラスタリングステップで前記認識対象が存在すると判断したクラスタ中心について、当該クラスタ中心に対応する前記テンプレートモデルの平均スケールサイズ及びテンプレートサイズに基づいて生成する矩形領域を、前記入力画像において前記認識対象が存在する範囲を示す候補枠として生成する候補枠生成ステップと、前記候補枠生成ステップで生成された候補枠について、当該候補枠に対応する前記テンプレートモデルの前記記憶画像を用いて前記入力画像の見え方を示す想起画像を生成する想起画像生成ステップと、前記想起画像生成ステップで生成された想起画像と、前記入力画像における前記候補枠の画像と、を比較して相違度を算出し、当該算出した相違度が所定の閾値よりも大きかった場合には当該候補枠を誤検出であるとして除去し、除去されなかった候補枠について、当該候補枠で示す前記入力画像の範囲に、前記認識対象を検出したと判定する信頼値算出ステップと、を有するものである。 The object recognition method according to the fourth aspect of the present invention includes a stored image including a recognition target, a feature point extracted in advance from the stored image, and a positional relationship of the extracted feature point with respect to a reference position in the stored image. Storing a plurality of template models in advance for each class model including information, an average scale size of the feature points, and a template size that is a value obtained by normalizing the size of the stored image with the average scale size; A feature point is extracted from the input image, and matching is performed between the feature point of the stored image stored in the storage step and the feature point extracted from the input image, thereby corresponding to the feature point of the stored image A corresponding point search step for searching for a feature point of the input image as a corresponding point, and a corresponding point of the input image searched in the corresponding point search step And calculating voting points based on positional relation information of feature points of the stored image stored in the storing step, and voting processing steps for voting the voting points in the input image, and voting in the voting processing step. A voting point clustering step of clustering voting points in the input image and determining that the recognition target exists around the center of the cluster when the number of voting points included in the same cluster is equal to or greater than a predetermined threshold; In the input image, a rectangular area generated based on the average scale size and the template size of the template model corresponding to the cluster center is determined in the input image with respect to the cluster center determined in the vote point clustering step that the recognition target exists. As a candidate frame indicating the range where the target exists A candidate frame generation step, and a recall image that generates a recall image that indicates how the input image is viewed using the stored image of the template model corresponding to the candidate frame for the candidate frame generated in the candidate frame generation step The difference between the image generation step, the recall image generated in the recall image generation step, and the image of the candidate frame in the input image is calculated, and the calculated difference is greater than a predetermined threshold value. If it is larger, the candidate frame is removed as erroneous detection, and a confidence value calculation step for determining that the recognition target is detected in the range of the input image indicated by the candidate frame for the candidate frame that has not been removed And.

本発明によれば、従来手法と比較してより高速かつ高精度な３次元物体認識を実現可能な物体認識装置及び物体認識方法を提供することができる。 According to the present invention, it is possible to provide an object recognition apparatus and an object recognition method capable of realizing three-dimensional object recognition that is faster and more accurate than conventional methods.

実施の形態１に係る物体認識装置の構成を示す機能ブロック図である。1 is a functional block diagram illustrating a configuration of an object recognition device according to Embodiment 1. FIG. 実施の形態１に係る学習モデルを説明するための図である。3 is a diagram for explaining a learning model according to Embodiment 1. FIG. 実施の形態１に係る認識手順の概要を示すフローチャートである。3 is a flowchart showing an outline of a recognition procedure according to the first embodiment. 実施の形態１に係る特徴点の位置ベクトルを説明するための画像である。6 is an image for explaining a feature point position vector according to the first embodiment; 実施の形態１に係る中心候補点の投票を説明するための画像である。6 is an image for explaining voting of center candidate points according to the first embodiment. 実施の形態１に係る入力画像における対応点探索とVoting処理を説明するための画像である。5 is an image for explaining corresponding point search and Voting processing in an input image according to the first embodiment. 実施の形態１に係る想起画像の例を示す画像である。4 is an image showing an example of a recall image according to the first embodiment. 実施の形態１に係る想起画像の生成方法を説明するための画像である。It is an image for demonstrating the generation method of the recall image which concerns on Embodiment 1. FIG. 実施の形態１に係る想起画像を用いた誤検出除去を説明するための画像である。It is an image for demonstrating the false detection removal using the recall image which concerns on Embodiment 1. FIG. 実施の形態１に係る正しく検出できた場合の認識結果を示す画像である。It is an image which shows the recognition result when it has detected correctly according to Embodiment 1. 実施の形態１に係る誤って検出した場合の認識結果を示す画像である。It is an image which shows the recognition result at the time of detecting incorrectly according to Embodiment 1. 実施の形態１に係るクラス検出結果を示すグラフである。6 is a graph showing a class detection result according to the first embodiment. 実施の形態１に係る向き推定結果を示すグラフである。6 is a graph showing a direction estimation result according to the first embodiment.

本発明の各実施の形態について説明する前に、本発明の基本構成について説明する。
まず、本発明は、一般物体のうち、比較的形状変化の少ない剛体（後述する実施の形態では自動車を例に説明する。）を対象とする３次元物体認識について、学習画像一つひとつに対して独立に学習・認識を行うアピアランスベースな手法を採用したものである。さらに、後述するように、本発明による効果を確認するため、PASCAL VOC 2006 datasetを用いて、３次元物体のクラス検出精度を評価すると共に、それらの向き推定の結果についても定量的な評価を行う。 Before describing each embodiment of the present invention, the basic configuration of the present invention will be described.
First, the present invention is independent of each learning image for three-dimensional object recognition for a general object, which is a rigid body with relatively little change in shape (an automobile will be described as an example in the embodiments described later). Appearance-based approach to learning and recognition is adopted. Furthermore, as will be described later, in order to confirm the effect of the present invention, the PASCAL VOC 2006 dataset is used to evaluate the class detection accuracy of a three-dimensional object and quantitatively evaluate the direction estimation results. .

本発明では、計算コストが比較的低いVoting処理（高木雅成，藤吉弘亘，"SIFT 特徴量を用いた交通道路標識認識," 電学論C，vol.129，no.5，pp.824-831，2009．）を利用して、学習した認識対象それぞれについてその存在の有無を判定すると共に、それらの結果を重畳することで認識対象の存在範囲及びその向きを推定する。 In the present invention, Voting processing with relatively low calculation cost (Masunari Takagi, Hironobu Fujiyoshi, “Traffic road sign recognition using SIFT features,” Denki C, vol.129, no.5, pp.824 -831, 2009.), the presence / absence of the existence of each learned recognition target is determined, and the existence range and the direction of the recognition target are estimated by superimposing the results.

本発明では、認識時に、認識対象のアピアランス（見え方）の想起を行う（想起画像を生成する。）ことを特徴とする。これは、認識した範囲に認識対象がどのように存在しているか、そのアピアランスを推定するものであり、これにより、システムが対象をどのように認識しているのかを視覚的に確認することが可能となる。本発明では、この想起結果と認識範囲とを画像として比較することで誤検出の除去を行い、これによって、認識精度の向上を図る。なお、本発明による認識対象は、原理的に、形状変化の少ない剛体であれば自動車に限定されず、他の物体であってもよい。 In the present invention, at the time of recognition, the appearance (appearance) of the recognition target is recalled (a recall image is generated). This is to estimate how the recognition target exists in the recognized range and its appearance, so that the system can visually check how the target is recognized. It becomes possible. In the present invention, the false detection is removed by comparing the recall result and the recognition range as an image, thereby improving the recognition accuracy. In principle, the recognition target according to the present invention is not limited to an automobile as long as it is a rigid body with little shape change, and may be another object.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。なお、以下では、本文中の説明においては、必要に応じそれ以前に述べた符号を用いるものとする。 Embodiment 1
Embodiments of the present invention will be described below with reference to the drawings. In the following, in the description in the text, the symbols described before are used as necessary.

図１は、本実施の形態に係る物体認識装置の構成を示す機能ブロック図である。物体認識装置１は、学習モデルを記憶したデータベース２と、対応点探索部３と、投票処理部４と、投票点クラスタリング部５と、候補枠生成部６と、想起画像生成部７と、信頼値算出部８と、向き推定部９と、を備えている。 FIG. 1 is a functional block diagram showing the configuration of the object recognition apparatus according to the present embodiment. The object recognition device 1 includes a database 2 storing learning models, a corresponding point search unit 3, a voting processing unit 4, a voting point clustering unit 5, a candidate frame generation unit 6, a recall image generation unit 7, a trust A value calculation unit 8 and a direction estimation unit 9 are provided.

記憶画像特定手段１１は、対応点探索部３の機能と、投票処理部４の機能と、投票点クラスタリング部５の機能と、候補枠生成部６の機能と、を備えている。より具体的には、対応点探索部３の機能により、入力画像から特徴点を抽出して、データベース２に記憶された記憶画像の特徴点と入力画像から抽出した特徴点との間で対応する特徴点を探索する。投票処理部４の機能により、探索した対応点について、記憶画像の特徴点の位置関係情報に基づいて算出する投票点を入力画像に投票する。投票点クラスタリング部５の機能により、投票点が集中しかつ投票数が多いほど入力画像と記憶画像との類似度が高いと判定する。候補枠生成部６の機能により、判定の結果、入力画像と類似する記憶画像を特定する。 The stored image specifying unit 11 includes a function of the corresponding point search unit 3, a function of the voting processing unit 4, a function of the voting point clustering unit 5, and a function of the candidate frame generation unit 6. More specifically, the feature point is extracted from the input image by the function of the corresponding point search unit 3, and the feature point of the stored image stored in the database 2 and the feature point extracted from the input image correspond to each other. Search for feature points. With the function of the voting processing unit 4, the voting points calculated based on the positional relationship information of the feature points of the stored image are voted on the input image for the searched corresponding points. The function of the voting point clustering unit 5 determines that the similarity between the input image and the stored image is higher as the voting points are concentrated and the number of votes is larger. As a result of the determination, a stored image similar to the input image is specified by the function of the candidate frame generation unit 6.

判定手段１２は、想起画像生成部７の機能と、信頼値算出部８の機能と、向き推定部９の機能と、を備えている。より具体的には、想起画像生成部７の機能により、記憶画像特定手段１１で特定された記憶画像を用いて入力画像の見え方を示す想起画像を生成する。信頼値算出部８の機能により、生成した想起画像と入力画像とを比較して互いに類似しているか否かを判断し、類似していると判断した場合に、入力画像に認識対象を検出したと判定する。向き推定部９の機能により、検出した認識対象の向きを推定する。 The determination unit 12 includes a function of the recall image generation unit 7, a function of the confidence value calculation unit 8, and a function of the direction estimation unit 9. More specifically, the function of the recall image generation unit 7 generates a recall image that shows how the input image looks using the stored image specified by the stored image specifying means 11. The function of the confidence value calculation unit 8 compares the generated recall image and the input image to determine whether they are similar to each other, and when it is determined that they are similar, the recognition target is detected in the input image Is determined. The direction of the detected recognition target is estimated by the function of the direction estimation unit 9.

データベース２は、学習モデルを記憶する記憶手段である。本実施の形態では、ある１つのクラスについての全てのテンプレートモデルの集合体をクラスモデルとし、全てのクラスモデルの集合体を学習モデルとして取り扱う。すなわち、学習モデルは複数のクラスモデルを含み、各クラスモデルは複数のテンプレートモデルを含む。図１に示す例では、学習モデルは、クラスモデル２１を含み、クラスモデル２１は、テンプレートモデル２１１、２１２、２１３を含んでいる。なお、本実施の形態では、学習モデルは、１つのクラスモデル（自動車のクラス）を有するものとして説明するが、複数のクラスモデル（マルチクラス）を有するものとしてもよい。 The database 2 is a storage unit that stores a learning model. In this embodiment, an aggregate of all template models for a certain class is used as a class model, and an aggregate of all class models is handled as a learning model. That is, the learning model includes a plurality of class models, and each class model includes a plurality of template models. In the example illustrated in FIG. 1, the learning model includes a class model 21, and the class model 21 includes template models 211, 212, and 213. In the present embodiment, the learning model is described as having one class model (car class), but may have a plurality of class models (multi-class).

物体認識装置１は、認識対象を予め学習しておき、データベース２に予め記憶された学習モデルを用いて認識を行う。物体認識装置１は、学習時には、学習画像が提示されるごとに、学習画像に含まれる認識対象をインクリメンタルに学習して行く。認識対象の学習は、与えられる全ての学習画像についてテンプレートモデルを構築することで行われる。 The object recognition device 1 learns a recognition target in advance and performs recognition using a learning model stored in advance in the database 2. At the time of learning, the object recognition apparatus 1 incrementally learns the recognition target included in the learning image every time the learning image is presented. Learning of the recognition target is performed by constructing a template model for all of the given learning images.

図２に示すように、各テンプレートモデル（Template model）は、認識対象を含む領域の濃淡画像（Learning image）（以下、記憶画像と表記する場合がある。）と、特徴点及びその位置ベクトル（Feature point & Location vector）と、テンプレート情報（Template information）と、を含んでいる。テンプレート情報は、特徴点の平均スケールサイズ（Average scale size of feature points）と、記憶画像の大きさを平均スケールサイズで正規化した値（Template size(width, height)）（以下、テンプレートサイズと表記する場合がある。）と、向きラベル（Pose label）と、を含んでいる。 As shown in FIG. 2, each template model includes a grayscale image (Learning image) (hereinafter sometimes referred to as a stored image) of a region including a recognition target, a feature point, and a position vector ( Feature point & Location vector) and template information (Template information) are included. Template information includes the average scale size of feature points (Average scale size of feature points) and the value obtained by normalizing the size of the stored image with the average scale size (Template size (width, height)) (hereinafter referred to as template size) And orientation label (Pose label).

物体認識装置１は、学習画像が与えられると、学習画像内で認識対象が存在する領域内の局所特徴量（以下、特徴点と表記する場合がある。）を抽出し、抽出した各特徴点について位置ベクトルを算出する。なお、位置ベクトルの詳細については後述する。また、本実施の形態では、物体認識装置１により、特徴点の平均スケールサイズと、テンプレートサイズと、を算出し、ユーザにより向きラベルが与えられる。 When a learning image is given, the object recognition device 1 extracts a local feature amount (hereinafter sometimes referred to as a feature point) in a region where a recognition target exists in the learning image, and extracts each feature point. A position vector is calculated for. Details of the position vector will be described later. In the present embodiment, the object recognition apparatus 1 calculates the average scale size of the feature points and the template size, and the orientation label is given by the user.

次に、図３を参照して、物体認識装置１による認識手順の概要について説明する。物体認識装置１は、予め構築した学習モデルを用いて、対象の認識を行う。認識手順は、大別すると、候補枠検出手順（Ｓ１０２）と、誤検出除去手順（Ｓ１０３）と、の２段階の処理を含んでいる。まず、候補枠検出手順（Ｓ１０２）では、入力画像において認識対象が存在すると考えられる範囲（以下、候補枠と表記する場合がある。）を検出する。そして、誤検出除去手順（Ｓ１０３）では、検出した候補枠のうちで誤検出であると考えられる候補枠を除去する。より詳細には、それぞれ、以下に示す手順を含んでいる。 Next, the outline of the recognition procedure by the object recognition apparatus 1 will be described with reference to FIG. The object recognition apparatus 1 recognizes an object using a learning model that is built in advance. The recognition procedure is roughly divided into two stages of a candidate frame detection procedure (S102) and an erroneous detection removal procedure (S103). First, in the candidate frame detection procedure (S102), a range in which the recognition target is considered to exist in the input image (hereinafter sometimes referred to as a candidate frame) is detected. In the erroneous detection removal procedure (S103), candidate frames that are considered to be erroneous detection are removed from the detected candidate frames. More specifically, each includes the following procedures.

Ｓ１０１：物体認識装置１は、入力画像を取得する（入力画像取得処理）。
Ｓ１１１：対応点探索部３は、学習モデルと入力画像との間で、特徴点の対応を探索する（対応点探索処理）。
Ｓ１１２：投票処理部４は、学習モデルとの間で対応のとれた入力画像の特徴点について、学習モデルにおける特徴点の位置関係情報を利用して、入力画像の特徴点についての中心候補点を投票する（Voting処理）。
Ｓ１１３：投票点クラスタリング部５は、入力画像に投票した中心候補点（以下、単に投票点と表記する場合がある。）をクラスタリングし、所定の閾値以上の投票数を集めたクラスタを求める（投票点クラスタリング処理）。
Ｓ１１４：候補枠生成部６は、クラスタ中心を囲む矩形領域（候補枠）を生成する（候補枠生成処理）。
Ｓ１１６：想起画像生成部７は、入力画像における各候補枠について、候補枠に対応する学習モデルの記憶画像を用いて、想起画像を生成する（アピアランス想起処理）。
Ｓ１１７：信頼値算出部８は、想起画像と実画像（入力画像における候補枠内の画像）との間の類似度を信頼値として算出し、誤検出の候補枠を除去する（信頼値算出処理）。
Ｓ１１７：向き推定部９は、候補枠の算出に用いたテンプレートモデルの向きラベルから、検出した認識対象の向きを推定する（向き推定処理）。
Ｓ１０４：物体認識装置１は、認識対象の存在の有無と、存在する範囲と、向きと、を出力する（認識結果出力処理）。
以下、各部による処理の詳細について具体的に説明する。 S101: The object recognition apparatus 1 acquires an input image (input image acquisition process).
S111: Corresponding point search unit 3 searches for a correspondence between feature points between the learning model and the input image (corresponding point search process).
S112: The voting processing unit 4 uses the positional relationship information of the feature points in the learning model for the feature points of the input image that correspond to the learning model, and determines center candidate points for the feature points of the input image. Vote (Voting process).
S113: The voting point clustering unit 5 clusters the central candidate points voted on the input image (hereinafter sometimes simply referred to as voting points), and obtains a cluster in which the number of votes exceeding a predetermined threshold is collected (voting). Point clustering process).
S114: The candidate frame generation unit 6 generates a rectangular area (candidate frame) surrounding the cluster center (candidate frame generation process).
S116: For each candidate frame in the input image, the recall image generation unit 7 generates a recall image using the stored image of the learning model corresponding to the candidate frame (appearance recall process).
S117: The reliability value calculation unit 8 calculates the similarity between the recall image and the real image (the image in the candidate frame in the input image) as a reliability value, and removes the erroneous detection candidate frame (reliability value calculation process). ).
S117: The direction estimation unit 9 estimates the direction of the detected recognition target from the direction label of the template model used for calculating the candidate frame (direction estimation process).
S104: The object recognition apparatus 1 outputs the presence / absence of the recognition target, the existing range, and the direction (recognition result output process).
The details of the processing by each unit will be specifically described below.

対応点探索部３は、入力画像から特徴点を抽出し、学習モデル中の全ての特徴点と、入力画像から抽出された特徴点との間でのマッチングを行う。本実施の形態では、入力画像から抽出する特徴点としてＳＵＲＦを利用するが、入力画像から抽出する特徴点の種類はこれに限定されず、ＳＩＦＴなどの物体認識に用いる他の種類の特徴点を利用してもよい。 The corresponding point search unit 3 extracts feature points from the input image, and performs matching between all feature points in the learning model and feature points extracted from the input image. In this embodiment, SURF is used as a feature point extracted from the input image, but the type of feature point extracted from the input image is not limited to this, and other types of feature points used for object recognition such as SIFT are used. May be used.

学習モデル中のｉ番目の特徴点をｐ_ｉ、入力画像中のｊ番目の特徴点をｑ_ｊとし、それらの距離をｄ_ｉｊとした場合、特徴点ｐ_ｉに対して最近傍の特徴点は、ｊ_１ＮＮ＝ａｒｇｍｉｎ_ｊｄ_ｉｊのインデックスを持つ特徴点ｑ_ｊ１ＮＮとなる。ここで、類似した特徴点が入力画像中に複数存在する場合などにおいて、この特徴点ｑ_ｊ１ＮＮをそのまま対応点とみなすと、多くの誤対応が生じることがある。このため、本実施の形態では、更に、次の数（１）を満たす特徴点ｑ_ｊのみを対応点として利用する。なお、数（１）において、ｊ_２ＮＮは２番目に近い特徴点のインデックスを表しており、ｔは所定の閾値である。
When the i-th feature point in the learning model is p _i , the j-th feature point in the input image is q _j, and the distance between them is d _ij , the feature point nearest to the feature point p _i is , J _1NN = argmin _j d _ij is a feature point q _j1NN . Here, when there are a plurality of similar feature points in the input image, if the feature point q _j1NN is regarded as a corresponding point as it is, many erroneous correspondences may occur. For this reason, in the present embodiment, only feature points q _j satisfying the following number (1) are used as corresponding points. In _Equation (1), j _2NN represents the index of the second closest feature point, and t is a predetermined threshold value.

数（１）は、最近傍との距離ｄ_{ｉｊ１ＮＮ}が、２番目に近い特徴点との距離ｄ_{ｉｊ２ＮＮ}に対して一定の割合以下となることを表している。この条件式により、誤対応を削減した対応点探索を行うことができる。 The number (1) indicates that the distance d _ij1NN with the nearest neighbor is less than a certain ratio with respect to the distance d _ij2NN with the second closest feature point. By this conditional expression, it is possible to perform corresponding point search with reduced erroneous correspondence.

投票処理部４は、Voting処理を利用することで、認識対象の有無の判定と、認識対象が存在する領域の判定と、を行う。Voting処理は、マッチングした全特徴点に対して一意に決まる投票点を算出するという、反復の必要がない処理を用いて存在範囲を推定するために、計算コストが低く、認識速度の向上に有用である。 The voting processing unit 4 uses the Voting process to determine whether or not there is a recognition target and determine a region where the recognition target exists. The Voting process is useful for improving recognition speed because it calculates the existence range using a process that does not need to be repeated, such as calculating voting points that are uniquely determined for all matched feature points. It is.

Voting処理では、学習モデル中の特徴点の位置関係情報を利用して、入力画像中の対応する特徴点について、入力画像に含まれる認識対象の中心候補点を推定、投票していく。これは、一般化ハフ変換を応用した手法であり、ハッシュテーブルの各項目への投票数を算出する代わりに、実画像領域へ投票した中心候補点をクラスタリングし、各クラスタの投票数を算出するものである。以下、中心候補点の投票処理の流れを説明する。 In the Voting process, using the positional relationship information of the feature points in the learning model, the recognition target center candidate points included in the input image are estimated and voted for the corresponding feature points in the input image. This is a technique that applies the generalized Hough transform, and instead of calculating the number of votes for each item in the hash table, the center candidate points voted on the real image area are clustered, and the number of votes for each cluster is calculated. Is. The flow of the voting process for the center candidate point will be described below.

まず準備として、学習時において、テンプレートモデルの記憶画像それぞれについて、基準点を与えておく。本実施の形態では、記憶画像の中心を基準点として設定し、各記憶画像において、設定した基準点と、各特徴点との間の位置関係情報を算出する。そして、算出した位置関係情報を特徴点ごとの位置ベクトルとして、テンプレートモデルに与えておく。なお、本実施の形態では、記憶画像の中心を基準点として設定したが、これに限定されず他の任意の位置を基準点として設定してもよい。 First, as a preparation, a reference point is given to each stored image of the template model during learning. In the present embodiment, the center of the stored image is set as a reference point, and in each stored image, positional relationship information between the set reference point and each feature point is calculated. Then, the calculated positional relationship information is given to the template model as a position vector for each feature point. In the present embodiment, the center of the stored image is set as the reference point. However, the present invention is not limited to this, and any other position may be set as the reference point.

例えば、図４の左上に示すように、中心（Center point）を基準点として特徴点（feature point）の位置ベクトル（location vector）を算出する。例えば、図４の右下に示すように、記憶画像の認識対象（自動車）について、記憶画像の中心を基準として、３つの特徴点それぞれに対する位置ベクトルを算出する。 For example, as shown in the upper left of FIG. 4, a location vector of a feature point is calculated using the center (Center point) as a reference point. For example, as shown in the lower right of FIG. 4, for the recognition target (automobile) of the stored image, the position vector for each of the three feature points is calculated with reference to the center of the stored image.

次に、テンプレートモデルの記憶画像の特徴点と、特徴点の位置ベクトルと、に基づいて、その特徴点に対応する入力画像中の特徴点についての基準点、すなわち中心候補点を求める。ここで、テンプレートモデルの記憶画像における特徴点及び位置ベクトルと、入力画像における特徴点と、を以下のように与える。
（ｉ）テンプレートモデルの記憶画像における特徴点；
座標：（ｘ_ｔｅｍｐ，ｙ_ｔｅｍｐ）
スケール：σ_ｔｅｍｐ
輝度勾配方向：θ_ｔｅｍｐ
位置ベクトル：（Δｘ，Δｙ）。
（ｉｉ）入力画像における特徴点；
座標：（ｘ_ｉｎ，ｙ_ｉｎ）
スケール：σ_ｉｎ
輝度勾配方向：θ_ｉｎ。 Next, based on the feature point of the stored image of the template model and the position vector of the feature point, a reference point for the feature point in the input image corresponding to the feature point, that is, a center candidate point is obtained. Here, the feature points and position vectors in the stored image of the template model and the feature points in the input image are given as follows.
(I) feature points in the stored image of the template model;
Coordinates: (x _temp , y _temp )
Scale: σ _temp
Luminance gradient direction: θ _temp
Position vector: (Δx, Δy).
(Ii) feature points in the input image;
Coordinates: (x _in , y _in )
Scale: σ _in
Luminance gradient direction: θ _in .

すると、記憶画像の特徴点の座標、スケール、輝度勾配方向、及び位置ベクトルと、入力画像の特徴点の座標、スケール、及び輝度勾配方向と、から、入力画像における中心候補点（Ｘ，Ｙ）を次の数（２）及び数（３）により求めることができる。ただし、θ＝ａｒｃｔａｎ（Δｙ／Δｘ）とする。
Then, the center candidate point (X, Y) in the input image is obtained from the coordinates, scale, brightness gradient direction, and position vector of the feature point of the stored image, and the coordinates, scale, and brightness gradient direction of the feature point of the input image. Can be obtained by the following numbers (2) and (3). Here, θ = arctan (Δy / Δx).

以上の処理を、テンプレートモデルごとに対応するすべての特徴点に対して行い、入力画像に対して中心候補点の投票を行う。すなわち、１のテンプレートモデルについて、入力画像の特徴点との間で対応がとれた記憶画像の特徴点について、その対応がとれた特徴点の位置ベクトルなどの情報を利用して中心候補点を算出し、算出した中心候補点を入力画像に投票する。 The above processing is performed for all feature points corresponding to each template model, and the central candidate point is voted for the input image. That is, for one template model, the center candidate point is calculated by using information such as the position vector of the feature point corresponding to the feature point of the stored image corresponding to the feature point of the input image. Then, the calculated center candidate point is voted for the input image.

記憶画像と同一又は類似の認識対象が入力画像に存在する場合（すなわち、同一クラスの認識対象が存在する場合）には、記憶画像における基準点に対する特徴点の位置関係と、入力画像における中心候補点に対する特徴点の位置関係とが、互いに類似するものと考えられる。つまり、記憶画像において同一の認識対象に関する全ての特徴点を、同一の一の基準点からの位置ベクトルによりそれらの位置関係を規定しているため、入力画像に記憶画像と同一クラスの認識対象が存在しているならば、対応する記憶画像の特徴点の位置ベクトルなどを利用して入力画像における特徴点の中心候補点を推定したときには、それら中心候補点は、記憶画像における基準点と同様に、特定箇所に集中する可能性が高い。従って、もし入力画像中に同一クラスの認識対象物が存在するならば、投票された中心候補点は、認識対象物の中心近くに集中すると考えられる。 When the same or similar recognition target as the stored image exists in the input image (that is, when the same class of recognition target exists), the positional relationship of the feature points with respect to the reference point in the stored image and the center candidate in the input image The positional relationship of the feature points with respect to the points is considered to be similar to each other. In other words, since all the feature points related to the same recognition target in the stored image are defined by the positional vectors from the same reference point, the recognition target of the same class as the stored image is included in the input image. If the center candidate point of the feature point in the input image is estimated using the position vector of the feature point of the corresponding stored image, the center candidate point is the same as the reference point in the stored image. , Likely to concentrate in a specific place. Therefore, if recognition objects of the same class exist in the input image, it is considered that the voted center candidate points are concentrated near the center of the recognition object.

例えば、図５の左図に示すように、中心候補点（投票点）が分散している場合には、認識対象物が存在しない可能性が高い。一方で、同図の右図に示すように、投票点が集中している場合には、認識対象物が存在する可能性が高いと考えられる。 For example, as shown in the left diagram of FIG. 5, when the center candidate points (voting points) are dispersed, there is a high possibility that the recognition target object does not exist. On the other hand, as shown in the figure on the right side of the figure, when the voting points are concentrated, it is considered that there is a high possibility that the recognition object exists.

投票点クラスタリング部５は、入力画像に中心候補点が投票されたら、それら中心候補点（投票点）のクラスタリングを行う。投票点クラスタリング部５は、隣接している投票点を同一クラスタにまとめた上で、各クラスタについて、そのクラスタ内に含まれる投票数を求める。そして、クラスタの投票数が予め定めた所定の閾値（以下、投票閾値と表記する場合がある。）以上である場合には、そのクラスタ中心を中心とする認識対象物が存在するものと判断する。 When the central candidate points are voted on the input image, the voting point clustering unit 5 performs clustering of the central candidate points (voting points). The voting point clustering unit 5 collects adjacent voting points in the same cluster, and calculates the number of votes included in the cluster for each cluster. If the number of votes for a cluster is equal to or greater than a predetermined threshold value (hereinafter sometimes referred to as a vote threshold value), it is determined that a recognition object centered on the cluster center exists. .

例えば図６では、入力画像（Input image）のクラスタの投票数が投票閾値以上である例を示しており、この場合には、テンプレートモデル（Template model）の記憶画像に含まれる認識対象（自動車）が、入力画像のクラスタ中心を中心として存在するものと判断する。 For example, FIG. 6 shows an example in which the number of votes in the cluster of the input image (Input image) is greater than or equal to the vote threshold. In this case, the recognition target (automobile) included in the stored image of the template model (Template model) Is determined to exist around the cluster center of the input image.

本実施の形態では、投票点のクラスタリングをTOD（Threshold Order-Dependent）アルゴリズム（M. Friedman and A. Kandel, "Introduction to Pattern Recognition," pp.70-73, World Scientific Publishing Company, 1999.）に基づいて実行する。TODアルゴリズムは逐次的にデータを処理することが可能であり、極めて簡素な処理のため、データのクラスタリングを高速に行うことが可能である。なお、投票点のクラスタリングはTODアルゴリズムに限定されず、他の公知のクラスタリング手法に基づいて行うものとしてもよい。 In this embodiment, vote point clustering is applied to the TOD (Threshold Order-Dependent) algorithm (M. Friedman and A. Kandel, “Introduction to Pattern Recognition,” pp. 70-73, World Scientific Publishing Company, 1999.). Run based on. The TOD algorithm can process data sequentially and can perform data clustering at high speed due to extremely simple processing. Note that the voting point clustering is not limited to the TOD algorithm, and may be performed based on other known clustering methods.

以下、TODアルゴリズムに基づくクラスタリングについて説明する。投票点の座標をｖ、投票点に対する特徴点のスケールサイズ及びテンプレートサイズを要素とするベクトルをｗとしたとき、その処理は次のようになる。 Hereinafter, clustering based on the TOD algorithm will be described. When the coordinates of the voting points are v and the vector whose elements are the scale size and the template size of the feature points with respect to the voting points is w, the processing is as follows.

Ｓｔｅｐ１：クラスタリング閾値Ｔを設定する。これは、同一クラスタとする投票点間の最大距離である。本実施の形態では、単位スケールサイズに対する最大距離を予め定めておき、入力画像ごとに、全特徴点の平均スケールサイズ倍した値をクラスタリング閾値Ｔとして設定する。これにより、認識対象の大きさに合わせて相対的な値に設定できる。 Step 1: A clustering threshold T is set. This is the maximum distance between voting points for the same cluster. In this embodiment, a maximum distance with respect to the unit scale size is determined in advance, and a value obtained by multiplying the average scale size of all feature points is set as the clustering threshold T for each input image. Thereby, it can set to a relative value according to the magnitude | size of recognition object.

Ｓｔｅｐ２：クラスタ中心の集合をＣとして、最初の入力ｃ_０を集合Ｃの要素とする。また、クラスタ中心ｃ_０の向きラベルの集合Ｐ_ｃ０を作成し、最初の入力の向きラベルをこの要素とする。更に、クラスタ中心ｃ_０への投票数ε_ｃ０を１とする。 Step 2: Let C be a set of cluster centers, and let the first input c _{0 be} an element of the set C. Also, a set P _c0 of orientation labels of the cluster center c ₀ is created, and the first input direction label is used as this element. Further, the number of votes ε _c0 to the cluster center c ₀ is 1.

Ｓｔｅｐ３：ｃ_ｎｅｗを新しい入力として、ｃ_ｎｅｗに対して最近傍となるクラスタ中心ｃ_ＮＮ＝ａｒｇｍｉｎ_ｃ∈Ｃ‖ｖ_ｃｎｅｗ−ｖ_ｃ‖を探索する。 Step 3: With c _new as a new input, search for the cluster center c _NN = argmin _cεC ‖v _cnew -v _cと which is the nearest to c _new .

Ｓｔｅｐ４：ｖ_ｃｎｅｗとｖ_ｃＮＮの距離がクラスタリング閾値Ｔを超える場合には、ｃ_ｎｅｗを集合Ｃに加える。そして、更に、ｃ_ｎｅｗの向きラベルを要素とするＰ_ｃｎｅｗを作成し、ｃ_ｎｅｗへの投票数ε_ｃｎｅｗを１として、Ｓｔｅｐ３へと戻る。 Step 4: When the distance between v _cnew and v _cNN exceeds the clustering threshold T, c _new is added to the set C. Further, P _cnew having the c _new direction label as an element is created, the number of votes ε _cnew for c _new is _set to 1, and the process returns to _{Step 3} .

Ｓｔｅｐ５：ｖ_ｃｎｅｗとｖ_ｃＮＮの距離がクラスタリング閾値Ｔ以下である場合には、ｃ_ＮＮの総投票数であるε_ｃＮＮのカウントを１増加させ、ｖ_ｃＮＮ及びｗ_ｃＮＮの値を次の数（４）及び数（５）に示すように修正する。更に、ｃ_ｎｅｗの向きラベルを集合Ｐ_ｃＮＮに追加し、ステップ３へと戻る。
_{Step5: v} when the distance _cnew and _{v CNN} is less clustering threshold T _increases 1 counting epsilon _CNN is the total number of votes _{c _NN,} _{v CNN} and _w values of _CNN following equation (4 ) And the number (5). Further, the c _new direction label is added to the set P _cNN and the process returns to step 3.

Ｓｔｅｐ６：最後に、全ての投票点を入力し終わったら、生成された全てのクラスタ中心に対して、その投票数が投票閾値以上であるか否かを判定する。判定の結果、投票数が投票閾値未満のクラスタ中心については、削除する。本実施の形態では、以上の処理をまずテンプレートモデルごとに実行してクラスタリングを行う。 Step 6: Finally, when all the voting points have been input, it is determined whether or not the number of votes is greater than or equal to the voting threshold for all generated cluster centers. As a result of the determination, the cluster center whose number of votes is less than the vote threshold is deleted. In the present embodiment, clustering is performed by first executing the above processing for each template model.

投票点クラスタリング部５は、上述したＳｔｅｐ１からＳｔｅｐ６で示したクラスタリングを、各テンプレートモデルに対して実行する。そして、更に、各テンプレートモデルについて残ったクラスタ中心を投票点として、クラスモデルごとに再度クラスタリングを行う。この２度目のクラスタリングにより得たクラスタの中心を、最終的なクラスタリング結果とする。また、この２度目のクラスタリングはテンプレートモデルごとの認識結果をまとめるためのものであり、投票数によるクラスタ中心の削除は行わない。また、この２度目のクラスタリングで用いるクラスタリング閾値Ｔは、１度目のクラスタリングで用いたクラスタリング閾値に比例する値を用いた。 The vote point clustering unit 5 performs the clustering shown in Step 1 to Step 6 on each template model. Further, clustering is performed again for each class model using the remaining cluster center for each template model as a voting point. The center of the cluster obtained by the second clustering is set as the final clustering result. The second clustering is for collecting recognition results for each template model, and the cluster center is not deleted based on the number of votes. The clustering threshold T used in the second clustering is a value proportional to the clustering threshold used in the first clustering.

クラスタリングの結果、生成された各クラスタは１又は複数のテンプレートモデルに対応する。従って、以上の処理によって、入力画像と類似するテンプレートモデル（の記憶画像）が特定される。入力画像と記憶画像との類似度合いは、入力画像において投票点が集中しかつ投票数が多いほど、入力画像とテンプレートモデルの記憶画像との類似度が高いものとして判定される。 As a result of clustering, each generated cluster corresponds to one or more template models. Therefore, a template model (stored image) similar to the input image is specified by the above processing. The degree of similarity between the input image and the stored image is determined as the similarity between the input image and the stored image of the template model increases as the number of votes increases in the input image and the number of votes increases.

候補枠生成部６は、クラスタリング処理の結果生成された各クラスタの中心について、その座標を中心とする矩形領域を候補枠として生成する。矩形領域は、各クラスタに対応する１又は複数のテンプレートモデルの平均スケールサイズとテンプレートサイズと、に基づいて作成する。より具体的に説明すると、矩形領域生成のためのパラメータとして、クラスタリング中に更新されるwには、矩形領域の幅と高さ（すなわち、テンプレートサイズ）が含まれている。このため、１回目及び２回目のクラスタリング処理中に上記数（５）によりwが逐次更新されることで、最終的なクラスタ中心に対して所望の矩形領域サイズが記録される、という処理が行われる。すなわち、クラスタリングを行なう過程で、クラスタ中心が保持しているテンプレートサイズを逐次更新し続け、最終的に、クラスタに属する投票点の平均のような値が求められ、これを矩形領域のサイズとしている。これにより、認識対象物が存在すると思われる候補枠を入力画像において生成する。候補枠生成の結果、各候補枠は、１のクラスタ中心に対応する。また、各候補枠は、１又は複数のテンプレートモデルに対応する。 The candidate frame generation unit 6 generates, as a candidate frame, a rectangular area centered on the coordinates of each cluster generated as a result of the clustering process. The rectangular area is created based on the average scale size and template size of one or more template models corresponding to each cluster. More specifically, w updated during clustering as a parameter for generating a rectangular area includes the width and height (that is, the template size) of the rectangular area. For this reason, during the first and second clustering processes, w is sequentially updated according to the number (5), so that a desired rectangular area size is recorded with respect to the final cluster center. Is called. That is, in the process of clustering, the template size held by the cluster center is continuously updated, and finally a value such as the average of the voting points belonging to the cluster is obtained, and this is used as the size of the rectangular area. . Thereby, a candidate frame in which a recognition target object is thought to exist is generated in the input image. As a result of the candidate frame generation, each candidate frame corresponds to one cluster center. Each candidate frame corresponds to one or a plurality of template models.

想起画像生成部７は、生成した各候補枠について、候補枠に対応するテンプレートモデルの記憶画像を用いて、そのアピアランスを画像として想起する。ここで、想起される画像（以下、想起画像と表記する場合がある。）とは、システムが対象をどのように認識したかを視覚的に表現したものであり、対象がどのような見え方で存在しているのかを推定したものである。例えば図７の左図に示す候補枠について、同図の右図に示すような想起画像が生成される。 The recall image generation unit 7 recalls the appearance of each candidate frame generated as an image by using the stored image of the template model corresponding to the candidate frame. Here, the recalled image (hereinafter sometimes referred to as the recalled image) is a visual representation of how the system recognizes the object and how the object looks. It is an estimate of whether or not it exists. For example, for the candidate frame shown in the left diagram of FIG. 7, a recall image as shown in the right diagram of FIG. 7 is generated.

想起画像の生成方法はシンプルである。まず、候補枠を生成する際に、クラスタに対応するテンプレートモデルについて、どのテンプレートモデルからどの程度投票数が得られたのかを、候補枠のクラスタ中心に対して情報として記憶させておく。そして、テンプレートモデルごとに、その記憶画像を候補枠の大きさにリサイズし、さらに、それぞれの明度を減少させた上で、それら記憶画像を重畳していく。このとき、減少させる各記憶画像の明度は、クラスタリング結果に応じた重み付けに応じて調整される。本実施の形態では、それぞれの記憶画像の全ピクセルの輝度値に、クラスタリング結果に応じた重み付けを掛け合わせることで算出する。ここでは、クラスタ中心に記憶した全ての投票数に対して、そのテンプレートモデル（記憶画像）の投票数の占める割合を、上記の重み付けとする。これにより、１つの候補枠に対して、最終的に１枚の想起画像が生成される。 The method for generating the recall image is simple. First, when generating a candidate frame, the number of votes obtained from which template model for the template model corresponding to the cluster is stored as information for the cluster center of the candidate frame. Then, for each template model, the stored image is resized to the size of the candidate frame, and further, the brightness is reduced, and then the stored images are superimposed. At this time, the brightness of each stored image to be reduced is adjusted according to the weighting according to the clustering result. In the present embodiment, the calculation is performed by multiplying the luminance value of all the pixels of each stored image by the weighting according to the clustering result. Here, the ratio of the number of votes of the template model (stored image) to the number of votes stored at the center of the cluster is the above weighting. Thereby, one recall image is finally generated for one candidate frame.

例えば図８に示す例では、入力画像において、左上図に示す投票結果（Voting result）から右上図に示す候補枠（Candidate window detection）が生成される。そして、左下図に示す３つのテンプレートモデルの記憶画像（Learning image of template model）を、それぞれリサイズして明度の重み付けをした上で合成することで、右下図に示す想起画像(Recalled image)を生成した。なお、図に示す例では、明度の重み付けは、クラスタ中心に記憶した全ての投票数（図では１０の投票数）に対して、各テンプレートモデルの投票数（図では、それぞれ３の投票数、５の投票数、２の投票数）が占める割合（０．３と、０．５と、０．２と。）とした。 For example, in the example illustrated in FIG. 8, a candidate frame (Candidate window detection) illustrated in the upper right diagram is generated from the voting result (Voting result) illustrated in the upper left diagram in the input image. The memory images (Learning image of template model) shown in the lower left figure are then resized and weighted to create a recall image shown in the lower right figure. did. In the example shown in the figure, the lightness weighting is calculated based on the total number of votes stored in the center of the cluster (10 votes in the figure), and the number of votes of each template model (3 votes in the figure, 5 votes, 2 votes) (0.3, 0.5, 0.2).

信頼値算出部８は、生成した想起画像と、実画像（入力画像における候補枠の画像）と、を比較して候補枠の信頼値を算出し、算出した信頼値に基づいて、誤って検出した候補枠の除去を行う。すなわち、想起画像と実画像とを比較してこれら画像が類似している場合には、その候補枠は正しく検出されたものであったとみなす。一方で、両画像が類似していない場合には、その候補枠が誤って検出されたものであったとみなす。信頼値算出部８は、誤って検出されたとみなされた候補枠を除去した結果、除去されずに残された候補枠で示す入力画像の範囲に、認識対象（候補枠に対応するテンプレートモデルのクラス）を検出したと判定する。 The confidence value calculation unit 8 calculates the confidence value of the candidate frame by comparing the generated recall image and the actual image (the image of the candidate frame in the input image), and erroneously detects based on the calculated confidence value. The candidate frame is removed. That is, when the recalled image and the actual image are compared and these images are similar, it is considered that the candidate frame has been correctly detected. On the other hand, if the two images are not similar, it is considered that the candidate frame was erroneously detected. The reliability value calculation unit 8 removes the candidate frame that is regarded as being detected in error, and as a result, removes the recognition target (the template model corresponding to the candidate frame) in the range of the input image indicated by the candidate frame that is not removed. Class) is detected.

本実施の形態では、想起画像と実画像との相違度を、全ピクセルの輝度値差の平均値として算出し、この値が予め設定した所定の閾値よりも大きかった場合には、誤検出としてその候補枠を除去する。なお、候補枠の除去方法としては、全ピクセルの輝度値差の平均値以外にも、様々な評価基準に基づいて誤った候補枠の除去を行うことができる。 In the present embodiment, the degree of difference between the recall image and the actual image is calculated as an average value of the luminance value differences of all the pixels, and if this value is larger than a predetermined threshold value, it is determined as a false detection. The candidate frame is removed. As a candidate frame removal method, an erroneous candidate frame can be removed based on various evaluation criteria other than the average value of the luminance value differences of all pixels.

本実施の形態では、上記相違度に基づく誤検出除去後の各候補枠に対して、更に、想起画像と実画像との相違度（全ピクセルの輝度値差の平均値）に加えて、投票数についても考慮した候補枠の信頼値を算出し、この値が所定の閾値よりも小さかった場合に、誤検出としてその候補枠を除去する。これは、誤検出の判定においては、投票数を考慮せずに画像としての類似度で比較を行う方が好ましいが、正しく検出していると判断された場合に、画像としての類似度が同じであるならば、より投票数の多い方が、対象との類似度が高いと考えられるためである。 In the present embodiment, for each candidate frame after false detection removal based on the above-mentioned difference degree, in addition to the difference degree between the recalled image and the actual image (average value of luminance value differences of all pixels), voting The confidence value of the candidate frame that takes the number into consideration is calculated, and if this value is smaller than a predetermined threshold, the candidate frame is removed as a false detection. This is because it is preferable to make a comparison based on the similarity as an image without considering the number of votes in the determination of erroneous detection. However, when it is determined that the detection is correct, the similarity as an image is the same. This is because the higher the number of votes, the higher the degree of similarity with the target.

信頼値は、次の数（６）により算出する。ただし、候補枠の横幅と縦幅をそれぞれｗとｈとし、候補枠の実画像の輝度値をＩ_ｅｓｔとし、想起画像の輝度値をＩ_ｒｅｃとし、候補枠の総投票数をεとする。この値を用いることで、各候補枠の信頼度を比較することができる。
The confidence value is calculated by the following number (6). However, the horizontal and vertical widths of the candidate frame are w and h, the luminance value of the actual image of the candidate frame is I _est , the luminance value of the recall image is I _rec, and the total number of votes of the candidate frame is ε. By using this value, the reliability of each candidate frame can be compared.

例えば図９に示す例では、右上図に示すように３つの候補枠について、想起画像との比較の結果、左下図に示す１つの候補枠以外の候補枠が誤検出であるとして除去される。 For example, in the example illustrated in FIG. 9, as a result of comparison with the recall image for three candidate frames as illustrated in the upper right diagram, candidate frames other than the one candidate frame illustrated in the lower left diagram are removed as erroneous detection.

向き推定部９は、各候補枠について、その認識対象の向きを推定する。向きラベルの集合をＰ、認識対象のクラスモデル中で向きラベルｐ∈Ｐを持つテンプレートモデルの総数をＡ_{ｃｌａｓｓ}（ｐ）、候補枠の全投票結果の中で向きラベルがｐに一致する投票数をＡ_ｅｓｔ（ｐ）としたとき、認識対象の向きは次の数（７）により推定される。すなわち、学習した向きラベルの総数に対する割合で、最も多く投票された向きラベルが推定結果となる。
The direction estimation unit 9 estimates the direction of the recognition target for each candidate frame. P is the set of orientation labels, A _class (p) is the total number of template models with orientation label p∈P in the class model to be recognized, and the number of votes whose orientation label matches p among all the voting results of the candidate frames Is A _est (p), the orientation of the recognition target is estimated by the following number (7). In other words, the orientation label that has been voted the most as a percentage of the total number of orientation labels learned is the estimation result.

次に、本実施の形態による効果について説明する。本実施の形態による有効性を示すため、実世界シーンを撮影した実環境画像に対して、物体認識装置１によるクラス検出及び向き推定の実験を行った。 Next, the effect by this Embodiment is demonstrated. In order to show the effectiveness according to the present embodiment, an experiment of class detection and direction estimation by the object recognition apparatus 1 was performed on a real environment image obtained by photographing a real world scene.

実験には、３次元物体認識手法の評価実験によく利用されているデータセットである、PASCAL Challenge Visual Object Class（"PASCAL Challenge".http://www.pascal-network.org/challenges/VOC/.）を利用した。本実施の形態では、その中でも最新研究の結果が揃っているPASCAL VOC 2006 dataset（M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool, "The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results," Technical report, PASCAL Network, 2006.）の"ｃａｒ"クラスに対して性能評価を行なった。 In the experiment, PASCAL Challenge Visual Object Class ("PASCAL Challenge" .http: //www.pascal-network.org/challenges/VOC/) is a data set that is often used for evaluation experiments of 3D object recognition methods. .) Was used. In this embodiment, the PASCAL VOC 2006 dataset (M. Everingham, A. Zisserman, CKI Williams, and L. Van Gool, "The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results" , "Technical report, PASCAL Network, 2006.)" was evaluated for the "car" class.

"ｃａｒ"クラスデータの場合、学習画像全２６１８枚中５５３枚に写っている８５４個の自動車が学習対象として与えられており、テスト画像全２６８６枚中５４４枚に写っている８５４個の自動車が検出対象となっている。画像はすべて雑多な実環境を撮影したものであり、検出対象となる自動車の種類も様々で、その向きや大きさもばらばらである。 In the case of the “car” class data, 854 cars shown in 553 out of all 2,618 learning images are given as learning targets, and 854 cars shown in 544 out of all 2,686 test images. It is a detection target. All the images are taken from a variety of real environments, and there are various types of automobiles to be detected, and their orientations and sizes vary.

なお、学習には他の３次元物体認識手法と同様に、上述の学習データの他に3D objects dataset（S. Savarese and L. Fei-Fei, "3d generic object categorization,localization and pose estimation," IEEE Int. Conf.Comput. Vision, pp.1-8, 2007.）の"ｃａｒ"クラスデータも利用した。これは、１０種類の自動車について、それぞれ８方向×２高度×３スケールで撮影した４８枚の画像と、その撮影位置情報及び対象領域情報が与えられたものである。 As with other 3D object recognition methods, in addition to the above learning data, 3D objects dataset (S. Savarese and L. Fei-Fei, "3d generic object categorization, localization and pose estimation," IEEE "Car" class data of Int. Conf. Comput. Vision, pp.1-8, 2007.) was also used. In this example, 48 images taken on 8 directions × 2 altitudes × 3 scales, shooting position information, and target area information are given for 10 types of automobiles.

図１０及び図１１に、上記のデータセットに対する処理結果例を示す。図１０は、正しく検出できた場合の例を示す画像である。各画像の左上又は右下の小さな画像は、各検出結果に対する想起画像を示す。認識対象に多少のオクルージョンが存在する場合や、複数の認識対象が含まれている場合においても、正しく認識できていることが分かる。図１１は、誤検出した場合の例を示す画像である。領域サイズが適正でない場合や、明らかな誤認識をしてしまっていることが分かる。 10 and 11 show examples of processing results for the above data set. FIG. 10 is an image showing an example of a case where the detection is correctly performed. A small image in the upper left or lower right of each image indicates a recall image for each detection result. It can be seen that even when there are some occlusions in the recognition target or when a plurality of recognition targets are included, the recognition target is correctly recognized. FIG. 11 is an image showing an example of erroneous detection. It can be seen that the region size is not appropriate or that a clear misrecognition has occurred.

次に、上記のテストデータを用いて自動車のクラス検出実験を行ない、その結果を図１２に示す。図１２は、本実施の形態と、Sun&Su CVPR09（非特許文献２）との比較結果を示す。評価は、precision-recall curveとそのAP（Average Precision）で行なった。 Next, an automobile class detection experiment was performed using the test data, and the results are shown in FIG. FIG. 12 shows a comparison result between the present embodiment and Sun & Su CVPR09 (Non-Patent Document 2). The evaluation was performed with a precision-recall curve and its AP (Average Precision).

図１２を見てわかるとおり、検出精度について、既存手法（Sun&Su CVPR09）のAPは０．３１０であるのに対して、本実施の形態（Our method）によるAPは０．３２３となっており、より高い検出精度を達成できている。また、認識に要する処理時間について、本実施の形態では、特徴抽出時間も含めた画像１枚あたりの認識時間は、平均１２．６秒（３．２ＧＨｚ，Ｍａｔｌａｂ）であった。それに対して、私信によると、既存手法の認識時間は画像１枚あたり約３００秒（２．２ＧＨｚ，Ｍａｔｌａｂ）である。従って、検出精度だけでなく、その処理速度についても、本実施の形態による有効性が確認できた。 As can be seen from FIG. 12, regarding the detection accuracy, the AP of the existing method (Sun & Su CVPR09) is 0.310, whereas the AP of the present embodiment (Our method) is 0.323, Higher detection accuracy can be achieved. Regarding the processing time required for recognition, in this embodiment, the recognition time per image including the feature extraction time is 12.6 seconds (3.2 GHz, Matlab) on average. On the other hand, according to personal communication, the recognition time of the existing method is about 300 seconds (2.2 GHz, Matlab) per image. Therefore, not only the detection accuracy but also the processing speed was confirmed to be effective according to the present embodiment.

なお、図１２には、参考結果として、Liebelt CVPR08（J. Liebelt, C. Schmid, and K. Schertler, "Viewpointindependent object class detection using 3d feature maps," IEEE Int. Conf. Comput. Vision and Pattern Recognit., pp.1-8, 2008.）と、Su&SunICCV09（H. Su, M. Sun, L. Fei-Fei, and S. Savarese, "Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories," IEEE Int. Conf. Comput. Vision, 2009.）、及びPASCAL VOC 2006に参加した４チームの結果もグラフに示してある。しかしながら、Liebelt CVPR08は独自に用意したCGモデルを、また、Su&SunICCV09は本実験で用いた学習データの他に、独自に用意したvideo clipを、それぞれの学習において利用しており、テストデータや評価方法は同一であるが、学習時に利用したデータが異なる点に留意されたい。また、PASCAL VOC 2006に参加したチームの手法については、対象の検出のみを目的としており、向きの推定については考慮されていない。 In FIG. 12, Liebelt CVPR08 (J. Liebelt, C. Schmid, and K. Schertler, “Viewpointindependent object class detection using 3d feature maps,” IEEE Int. Conf. Comput. Vision and Pattern Recognit. , pp.1-8, 2008.) and Su & SunICCV09 (H. Su, M. Sun, L. Fei-Fei, and S. Savarese, "Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object The results of four teams participating in categories, "IEEE Int. Conf. Comput. Vision, 2009.) and PASCAL VOC 2006 are also shown in the graph. However, Liebelt CVPR08 uses its own CG model, and Su & SunICCV09 uses its own prepared video clip in addition to the learning data used in this experiment. Note that the data is the same, but the data used for learning is different. In addition, the method of the team that participated in PASCAL VOC 2006 is only for the purpose of target detection, and the direction estimation is not considered.

次に、上記のテストデータに対する検出結果の中で、正しく認識できていたものについて、その認識対象の向き推定を行なった。本実施の形態による向き推定の結果と、比較対象（Sun&Su CVPR09）による結果と、を図１３に示す。ただし、向きの推定は、PASCAL VOC 2006datasetでは、向きのラベルが４方向にしか付与されていない。このため、本実施の形態では、その４方向のいずれに属するのかを推定した。なお、テスト画像中の対象が４方向のいずれにも属していない場合については、その推定結果は考慮されていない。 Next, the direction of the recognition target was estimated for the detection results for the test data that were correctly recognized. FIG. 13 shows the result of direction estimation according to the present embodiment and the result of comparison (Sun & Su CVPR09). However, the direction of the direction is estimated in the PASCAL VOC 2006dataset only in four directions. For this reason, in this Embodiment, it was estimated to which of the four directions it belongs. Note that the estimation result is not considered when the target in the test image does not belong to any of the four directions.

図１３に示されるように、本実施の形態では、比較対象（Sun&Su CVPR09）と比べて、どの向きについてもより高い精度で推定することができた。平均精度についても、既存手法は６２％であるのに対して、本実施の形態では８６％となっており、本実施の形態による向き推定の精度の高さが確認できた。 As shown in FIG. 13, in the present embodiment, it was possible to estimate with higher accuracy in any direction as compared with the comparison target (Sun & Su CVPR09). Regarding the average accuracy, the existing method is 62%, while it is 86% in the present embodiment, and the high accuracy of the direction estimation according to the present embodiment was confirmed.

以上説明したように、本実施の形態では、Voting処理による候補枠検出と、アピアランス想起による誤検出除去という２段階の処理に基づく、比較的高速かつ高精度なアピアランスベースの３次元物体認識手法を実現した。また、本実施の形態による効果を確認するため、PASCAL VOC 2006 datasetを用いて実験を行い、向き推定の精度と認識時間において既存手法より優れた結果を示し、その認識精度においても、既存手法と同等以上の結果を得ることができた。 As described above, in this embodiment, a relatively high-speed and high-accuracy appearance-based three-dimensional object recognition method based on two-stage processing of candidate frame detection by voting processing and false detection removal by appearance recall. It was realized. In addition, in order to confirm the effect of this embodiment, we performed experiments using the PASCAL VOC 2006 dataset and showed results superior to the existing method in the direction estimation accuracy and recognition time. The result was equivalent or better.

本実施の形態では、従来のモデルベースの手法に対して、与えられた見え方そのものを独立に学習・認識するアピアランスベースな手法を採用することで、従来手法が有する欠点を容易に補うことを可能とした。 In the present embodiment, by adopting an appearance-based method that independently learns and recognizes a given appearance itself as compared to the conventional model-based method, it is possible to easily compensate for the disadvantages of the conventional method. It was possible.

また、本実施の形態では、Voting処理を利用することで、対象の様々なアピアランスを独立に学習・認識しつつ、それらの結果を重畳することでクラスレベルでの対象の認識を可能とした。ここで、本実施の形態では、Voting処理による認識結果に対してアピアランスの想起を行い、その想起結果と認識結果とを比較することで誤認識を除去することを特徴とし、これにより、認識精度をより向上させることができた。 Further, in the present embodiment, by using the Voting process, it is possible to recognize the target at the class level by superimposing the results while learning and recognizing various appearances of the target independently. Here, the present embodiment is characterized in that appearance is recalled with respect to the recognition result obtained by the Voting process, and false recognition is eliminated by comparing the recall result with the recognition result. We were able to improve more.

さらに、本実施の形態では、各学習画像から独立に学習を行うものであるため、認識対象の記述方法がシンプルであり、さらに、容易に追加学習を行うことができるという利点を有する。 Furthermore, in this embodiment, since learning is performed independently from each learning image, there is an advantage that the description method of the recognition target is simple and that additional learning can be easily performed.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。
例えば、上述の実施の形態では、ハードウェアの構成として説明したが、これに限定されるものではなく、任意の処理を、ＣＰＵ（Central Processing Unit）にコンピュータプログラムを実行させることにより実現することも可能である。プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（random access memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.
For example, in the above-described embodiment, the hardware configuration has been described. However, the present invention is not limited to this, and arbitrary processing may be realized by causing a CPU (Central Processing Unit) to execute a computer program. Is possible. The program may be stored using various types of non-transitory computer readable media and supplied to a computer. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W and semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (random access memory)) are included. The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

１物体認識装置、
２データベース、
３対応点探索部、
４投票処理部、
５投票点クラスタリング部、
６候補枠生成部、
７想起画像生成部、
８信頼値算出部、
９向き推定部、
１１記憶画像特定手段、
１２判定手段、
２１クラスモデル、
２１１、２１２、２１３テンプレートモデル 1 object recognition device,
2 database,
3 Corresponding point search part,
4 voting processing section,
5 vote point clustering unit,
6 candidate frame generator,
7 Recall image generator,
8 confidence value calculator,
9 orientation estimation unit,
11 stored image specifying means,
12 determination means,
21 class model,
211, 212, 213 Template model

Claims

A memory for storing a plurality of stored images each including a recognition target, feature points extracted in advance from each of the plurality of stored images, and positional relationship information of the extracted feature points with respect to a reference position in the stored image Means,
A feature point is extracted from the input image, and a corresponding feature point is searched between the feature point of the stored image stored in the storage unit and the feature point extracted from the input image. The voting points calculated based on the positional relationship information of the feature points of the stored image are voted on the input image, and the similarity between the input image and the stored image increases as the voting points concentrate and the number of votes increases. A stored image specifying means for determining that the stored image is similar to the input image as a result of the determination;
Whether or not the stored image specified by the stored image specifying means is used to generate a recall image indicating how the input image is seen, and the generated recall image and the input image are compared to determine whether they are similar to each other Determining means for determining that the recognition target is detected in the input image when it is determined that they are similar;
An object recognition apparatus comprising:

The determination means includes
The said recall image is produced | generated by synthesize | combining the said some memory image specified by the said memory | storage image specification means according to the similarity with each said input image. Object recognition device.

The determination means includes
When the recall image and the input image are compared to determine whether they are similar to each other, the larger the total number of voting points of the input image used for generating the recall image, the more the recall image and The object recognition apparatus according to claim 1, wherein it is determined that the input image is more similar.

The storage means
Storing the direction information of the recognition target in association with the stored image;
The determination means includes
The recall image and the input image are compared to determine whether they are similar to each other. If it is determined that they are similar, it is determined that the recognition target is detected in the input image, and the recall The object recognition apparatus according to any one of claims 1 to 3, wherein the detected orientation of the recognition target is estimated based on orientation information of the stored image used for generating an image.

A stored image including a recognition target, feature points extracted in advance from the stored image, positional relationship information of the extracted feature points with respect to a reference position in the stored image, an average scale size of the feature points, and the storage A database that stores a plurality of template models for each class model, including a template size that is a value obtained by normalizing the size of the image with the average scale size;
A feature point is extracted from the input image, and matching is performed between the feature point of the stored image stored in the database and the feature point extracted from the input image, thereby corresponding to the feature point of the stored image A corresponding point search unit that searches for feature points of the input image as corresponding points;
With respect to corresponding points of the input image searched by the corresponding point search unit, voting points are calculated based on positional relationship information of feature points of the stored image stored in the database, and the voting points are voted on the input image. A voting processor to
When the voting points in the input image voted by the voting processing unit are clustered and the number of voting points included in the same cluster is equal to or greater than a predetermined threshold, the recognition target exists around the center of the cluster. A voting point clustering unit to determine,
A rectangular area that is generated based on an average scale size and a template size of the template model corresponding to the cluster center for the cluster center that is determined by the vote point clustering unit to be present in the input image. A candidate frame generation unit that generates a candidate frame indicating a range in which the
A recall image generation unit that generates a recall image indicating how the input image is viewed using the stored image of the template model corresponding to the candidate frame for the candidate frame generated by the candidate frame generation unit;
When the recall image generated by the recall image generation unit and the image of the candidate frame in the input image are compared to calculate the difference, and the calculated difference is greater than a predetermined threshold A reliability value calculation unit that determines that the recognition target is detected in a range of the input image indicated by the candidate frame for the candidate frame that has been removed as erroneous detection, and the candidate frame has not been removed;
An object recognition apparatus comprising:

The recall image generation unit
For the candidate frame generated by the candidate frame generation unit, the storage images of the plurality of template models corresponding to the candidate frame are synthesized according to the number of votes in the input image obtained from each of the template models. The object recognition apparatus according to claim 5, wherein the recall image is generated.

The confidence value calculation unit
The recall image generated by the recall image generation unit and the image of the candidate frame in the input image are compared to calculate a difference, and the obtained from all the template models corresponding to the candidate frame The total number of votes in the input image is calculated, and if the confidence value calculated based on the reciprocal of the calculated difference and the total number of votes is smaller than a predetermined threshold, the candidate frame is removed as a false detection. The object recognition apparatus according to claim 5, wherein the candidate frame that has not been removed is determined to have detected the recognition target in the range of the input image indicated by the candidate frame.

The template model is
It further includes orientation information indicating the orientation of the recognition target,
When it is determined that the recognition target is detected by the confidence value calculation unit, for the detected candidate frame of the recognition target, based on the orientation information of the plurality of template models corresponding to the candidate frame, The object recognition apparatus according to claim 5, further comprising a direction estimation unit that estimates the detected direction of the recognition target.

A plurality of stored images each including a recognition target, feature points extracted in advance from each of the plurality of stored images, and positional relationship information of the extracted feature points with respect to a reference position in the stored image are stored in advance. A memory step;
A feature point is extracted from the input image, a corresponding feature point is searched between the feature point of the stored image stored in the storage step and the feature point extracted from the input image, and the searched corresponding point The voting points calculated based on the positional relationship information of the feature points of the stored image are voted on the input image, and the similarity between the input image and the stored image increases as the voting points concentrate and the number of votes increases. A stored image specifying step for determining that the stored image is similar to the input image as a result of the determination;
Whether or not the stored image specified in the stored image specifying step is used to generate a recall image indicating how the input image is seen, and the generated recall image and the input image are compared with each other to determine whether they are similar to each other A determination step for determining that the recognition target is detected in the input image when it is determined that they are similar to each other;
An object recognition method comprising:

A stored image including a recognition target, feature points extracted in advance from the stored image, positional relationship information of the extracted feature points with respect to a reference position in the stored image, an average scale size of the feature points, and the storage A storage step of storing a plurality of template models in advance for each class model, including a template size that is a value obtained by normalizing the size of the image with the average scale size;
A feature point is extracted from the input image, and matching is performed between the feature point of the stored image stored in the storage step and the feature point extracted from the input image, thereby corresponding to the feature point of the stored image A corresponding point search step for searching for a feature point of the input image as a corresponding point;
For the corresponding point of the input image searched in the corresponding point search step, calculate a voting point based on the positional relationship information of the feature point of the stored image stored in the storage step, and the voting point in the input image A voting process step to vote;
When the voting points in the input image voted in the voting processing step are clustered, and the number of voting points included in the same cluster is equal to or greater than a predetermined threshold, the recognition target exists around the center of the cluster. A voting point clustering step to determine,
A rectangular area that is generated based on an average scale size and a template size of the template model corresponding to the cluster center for the cluster center determined to be present in the vote point clustering step in the input image. A candidate frame generation step for generating a candidate frame indicating a range in which the
A recall image generation step for generating a recall image indicating how the input image is viewed using the stored image of the template model corresponding to the candidate frame for the candidate frame generated in the candidate frame generation step;
When the recall image generated in the recall image generation step and the image of the candidate frame in the input image are compared to calculate a difference, and the calculated difference is greater than a predetermined threshold A reliability value calculating step for determining that the recognition target is detected in a range of the input image indicated by the candidate frame, with respect to the candidate frame that has been removed as being erroneously detected as the candidate frame;
An object recognition method comprising: