JP2006065399A

JP2006065399A - Image processing device and method, storage medium and program

Info

Publication number: JP2006065399A
Application number: JP2004244018A
Authority: JP
Inventors: Hirotaka Suzuki; 洋貴鈴木; Kenichi Hidai; 健一日台; Masahiro Fujita; 雅博藤田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-08-24
Filing date: 2004-08-24
Publication date: 2006-03-09
Anticipated expiration: 2024-08-24
Also published as: JP4605445B2

Abstract

<P>PROBLEM TO BE SOLVED: To surely recognize an object irrespective of the expansion and shrinkage of an image. <P>SOLUTION: A multi-resolution image is generated from a model image by a multi-resolution generation part 21, and characteristic quantities of characteristic points of respective resolutions are extracted by a characteristic quantity extraction part 21 registered in a model dictionary registration part 24. The multi-resolution of the inputted object image is generated by a multi-resolution generation part 31, and in a characteristic quantity comparison part 35, the characteristic point and the characteristic quantity are compared with the characteristic quantity registered in the model dictionary registration part 24. This comparison is carried out by using a kd tree constructed by a kd tree construction part 34. A model attitude estimation part 35 estimates an attitude of the model image included in the object image based on the comparison result of the characteristic quantity by the characteristic quantity comparison part 35 and outputs an attitude parameter of the object. This invention can be applied to a robot. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、画像処理装置および方法、記録媒体、並びにプログラムに関し、特に、視点変化、明度変化に強く、画像の拡大縮小に拘らず、物体を確実に認識することができるようにした画像処理装置および方法、記録媒体、並びにプログラムに関する。 The present invention relates to an image processing apparatus and method, a recording medium, and a program, and more particularly to an image processing apparatus that is resistant to changes in viewpoint and brightness and can reliably recognize an object regardless of enlargement or reduction of the image. And a method, a recording medium, and a program.

例えば、ロボットにより対象物体を認識するための実用化されている対象物体認識技術の多くは、残差逐次検定法や相互相関係数によるテンプレートマッチング手法を用いている。しかしテンプレートマッチング手法は、検出対象物体が入力画像中に変形なく出現すると仮定できる特殊な場合には有効であるが、視点や照明状態の一定でない一般的な画像からの物体認識環境においては有効でない。 For example, many of the target object recognition techniques that have been put into practical use for recognizing a target object by a robot use a residual sequential test method or a template matching method using a cross-correlation coefficient. However, the template matching method is effective in special cases where it can be assumed that the detection target object appears in the input image without deformation, but it is not effective in an object recognition environment from a general image with a fixed viewpoint or illumination state. .

他方、対象物体の形状特徴と、画像分割手法によって切り出された入力画像中の各領域の形状特徴とをマッチングする形状マッチング手法も提案されているが、上述のような一般的な物体認識環境においては、領域分割の結果が安定せず、入力画像中の物体の良質な形状記述が難しくなる。特に、検出対象物体が、他の物体に部分的に隠されている場合などは認識が非常に困難になる。 On the other hand, a shape matching method that matches the shape feature of the target object with the shape feature of each region in the input image cut out by the image segmentation method has also been proposed, but in the general object recognition environment as described above In this case, the result of region segmentation is not stable, and it is difficult to describe a good shape of an object in the input image. In particular, when the detection target object is partially hidden by another object, recognition becomes very difficult.

以上のような、入力画像全体あるいは領域の全体的な特徴を用いたマッチング手法に対し、画像から特徴的な点やエッジを抽出し、それらが構成する線分集合やエッジ集合の空間的位置関係を線図形やグラフとして表現し、線図形間やグラフ間の構造類似性に基づいてマッチングを行う方法も提案されている。しかし、これらの手法は、ある特化した対象物に対してはうまく作用するが、画像の変形により、時として安定した特徴点間構造が抽出されず、特に前に述べたような部分的に隠されている物体の認識は困難となる。 For the matching method using the whole features of the whole input image or area as described above, characteristic points and edges are extracted from the image, and the spatial positional relationship of the line segment set and edge set that constitute them A method has also been proposed in which is expressed as a line figure or graph, and matching is performed based on the structural similarity between line figures or between graphs. However, these techniques work well for certain specialized objects, but due to the deformation of the image, sometimes a stable structure between feature points is not extracted, especially as described above partially. Recognition of hidden objects becomes difficult.

そこで、画像から特徴的な点（特徴点）を抽出し、特徴点とその局所近傍の画像情報から得られる特徴量を用いたマッチング手法が提案されている。このような特徴点の部分的画像変形に対して不変な局所特徴量を用いるマッチング手法では、前述の手法に比べ画像の変形に対しても、検出対象が部分的に隠されるような場合にも安定した検出が可能となる。拡大縮小変換に対して不変性を持つ特徴点の抽出法として、画像のスケールスペースを構築し、各スケール画像のDifference of Gaussian（DoG）フィルタ出力の局所極大点及び局所極小点のうち、スケール方向の変化によっても位置が変化しない点をスケール特徴点として抽出する方法（非特許文献１または非特許文献２）や、画像のスケールスペースを構築し、各スケール画像からHarrisコーナー検出器により抽出されたコーナー点のうち、スケールスペース画像のLaplacian of Gaussian（LoG）フィルタ出力の局所極大を与える点を特徴点として抽出する方法（非特許文献３）などが提案されている。 Therefore, a matching method has been proposed in which a characteristic point (feature point) is extracted from an image and a feature amount obtained from image information of the feature point and its local neighborhood is used. In such a matching method using local feature amounts that are invariant to partial image deformation of feature points, even when the detection target is partially hidden compared to the above-described method, even in the case of image deformation. Stable detection is possible. As a method of extracting feature points that are invariant to scaling conversion, a scale space of the image is constructed, and the scale direction of the local maximum points and local minimum points of the Difference of Gaussian (DoG) filter output of each scale image A method of extracting a point whose position does not change due to a change in scale as a scale feature point (Non-Patent Document 1 or Non-Patent Document 2) or a scale space of an image is constructed and extracted from each scale image by a Harris corner detector A method of extracting a point giving a local maximum of a Laplacian of Gaussian (LoG) filter output of a scale space image from corner points as a feature point has been proposed (Non-Patent Document 3).

さらに、このように抽出された特徴点において、視線変化に対して不変な特徴量を選ぶことが好ましい。例えば、 Schmid & Mohr は、 Harris コーナー検出器を用いて検出されたコーナーを特徴点とし、その特徴点付近の回転不変特徴量を用いたマッチング手法を提案している（非特許文献４）。
D. Lowe, “Object recognition from local scale-invariant features,” in Proc. International Conference on Computer Vision, Vol. 2, pp. 1150-1157, September 20-25, 1999, Corfu, Greece. D. Lowe, “Distinctive image features from scale-invariant keypoints,” accepted for publication in the International Journal of Computer Vision, 2004. K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” International Conference on Computer Vision, 525-531, July 2001. K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” International Conference on Computer Vision, 525-531, July 2001. Schmid, C., and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE PAMI, 19, 5, 1997, pp. 530-534. Schmid, C., and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE PAMI, 19, 5, 1997, pp. 530-534. Furthermore, it is preferable to select a feature quantity that is invariant to the line-of-sight change in the feature points extracted in this way. For example, Schmid & Mohr has proposed a matching method using a corner detected using a Harris corner detector as a feature point and using a rotation-invariant feature near the feature point (Non-Patent Document 4).
D. Lowe, “Object recognition from local scale-invariant features,” in Proc. International Conference on Computer Vision, Vol. 2, pp. 1150-1157, September 20-25, 1999, Corfu, Greece. D. Lowe, “Distinctive image features from scale-invariant keypoints,” accepted for publication in the International Journal of Computer Vision, 2004. K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” International Conference on Computer Vision, 525-531, July 2001. K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” International Conference on Computer Vision, 525-531, July 2001. Schmid, C., and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE PAMI, 19, 5, 1997, pp. 530-534. Schmid, C., and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE PAMI, 19, 5, 1997, pp. 530-534.

しかし、コーナーの特徴量は、画像の拡大縮小変換に対しては不変性を持たないため、拡大縮小変換のある場合には、正確な認識が困難となる課題がある。 However, since the feature amount of the corner does not have invariance to the enlargement / reduction conversion of the image, there is a problem that accurate recognition is difficult in the case of the enlargement / reduction conversion.

本発明は、このような状況に鑑みてなされたものであり、画像が拡大縮小されている場合にも、視点変化や明度変化による影響を軽減し、物体を確実に認識することができるようにするものである。 The present invention has been made in view of such a situation, and even when an image is enlarged or reduced, the influence of a change in viewpoint or a change in brightness can be reduced, and an object can be reliably recognized. To do.

請求項１の画像処理装置は、入力された画像から予め登録されているモデル画像を認識する画像処理装置において、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を生成する多重解像度画像生成手段と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出手段と、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出手段と、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較手段と、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定手段とを備えることを特徴とする。 The image processing apparatus according to claim 1, in the image processing apparatus that recognizes a pre-registered model image from the input image, reduces the resolution of the input image at a predetermined rate, thereby Multi-resolution image generation means for generating a multi-resolution image composed of images of different resolutions, feature point extraction means for extracting feature points for each resolution image of the multi-resolution image, and at least two local points in the feature points A feature quantity extracting means for extracting a typical feature quantity, comparing the feature quantity of the input image with the feature quantity of the model image, and obtaining a candidate corresponding feature point set as a set of feature points having similar feature quantities Comparing means to be generated and posture estimation means for estimating the posture of the input image based on the candidate corresponding feature point set.

前記特徴量抽出手段は、第１のタイプの特徴量として特徴点近傍の濃度勾配の方向ヒストグラムを抽出し、第２のタイプの特徴量として次元縮退濃度勾配ベクトルを抽出するようにすることができる。 The feature quantity extraction means can extract a density gradient direction histogram in the vicinity of a feature point as a first type feature quantity and extract a dimensional degenerate density gradient vector as a second type feature quantity. .

前記比較手段は、タイプ毎にkdツリーとされたモデル画像の特徴量群を、入力された画像の特徴量に基づいてk Nearest Neighbor (k-NN) 探索することで、候補対応特徴点組を生成するようにすることができる。 The comparison means searches for k Nearest Neighbor (k-NN) based on the feature amount of the input image for the feature amount group of the model image, which is a kd tree for each type, so that a candidate corresponding feature point set is obtained. Can be generated.

前記比較手段は、各タイプにおいて共通に抽出された特徴量を有する特徴点を候補対応特徴点組とするようにすることができる。 The comparison means may make a feature point having a feature amount extracted in common for each type as a candidate corresponding feature point set.

前記候補対応特徴点組を、モデル画像の位置姿勢を決める画像変換パラメータで規定されるパラメータ空間上に投票し、その最大投票数を閾値と比較することで絞り込む絞り込み手段をさらに備え、姿勢推定手段は、絞り込まれた候補対応特徴点組に基づいて、入力された画像の姿勢を推定するようにすることができる。 The candidate corresponding feature point set is further voted on a parameter space defined by an image conversion parameter that determines the position and orientation of the model image, and further includes a narrowing means for narrowing down by comparing the maximum number of votes with a threshold value, and an attitude estimating means Can estimate the posture of the input image based on the narrowed candidate corresponding feature point set.

前記姿勢推定手段は、ランダムに選択したＮ組の候補対応特徴点組により決定されるモデル画像の位置姿勢を決める画像変換パラメータをパラメータ空間に投射し、パラメータ空間上で形成されるクラスタのうち、最多メンバ数を有するクラスタを求め、そのメンバから最小自乗法により求まる前記モデル画像の位置姿勢を決める画像変換パラメータをモデル画像を認識する認識結果として出力するようにすることができる。 The posture estimation means projects an image conversion parameter for determining the position and posture of the model image determined by the N candidate corresponding feature point sets selected at random to the parameter space, and among the clusters formed on the parameter space, A cluster having the largest number of members can be obtained, and image conversion parameters for determining the position and orientation of the model image obtained from the members by the least square method can be output as a recognition result for recognizing the model image.

前記姿勢推定手段は、最多メンバ数を有するクラスタのセントロイドを検出し、セントロイドからなるモデル画像の位置姿勢を決める画像変換パラメータをモデル画像を認識する認識結果として出力するようにすることができる。 The posture estimation means can detect a centroid of a cluster having the largest number of members and output an image conversion parameter for determining a position and posture of a model image composed of centroids as a recognition result for recognizing the model image. .

前記多重解像度画像生成手段は、学習時における場合より粗い精度で多重解像度画像を生成するようにすることができる。 The multi-resolution image generation means can generate a multi-resolution image with coarser accuracy than in the case of learning.

請求項９の画像処理方法は、入力された画像から予め登録されているモデル画像を認識する画像処理装置の画像処理方法において、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較ステップと、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定ステップとを含むことを特徴とする。 An image processing method according to claim 9 is an image processing method of an image processing apparatus for recognizing a model image registered in advance from an input image, and reduces the resolution of the input image at a predetermined rate. A multi-resolution image generation step of generating a multi-resolution image composed of a plurality of different resolution images, a feature point extraction step of extracting a feature point for each resolution image of the multi-resolution image, and A feature amount extraction step for extracting at least two local feature amounts, and comparing the feature amount of the input image with the feature amount of the model image, and corresponding candidates as a set of feature points having similar feature amounts A comparison step for generating a feature point set, and a posture estimation step for estimating the posture of the input image based on the candidate corresponding feature point set. It is characterized in.

請求項１０の記録媒体のプログラムは、入力された画像から予め登録されているモデル画像を認識する画像処理装置のプログラムであって、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較ステップと、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定ステップとを含むことを特徴とする。 The recording medium program according to claim 10 is a program for an image processing apparatus that recognizes a model image registered in advance from an input image, and reduces the resolution of the input image at a predetermined rate. A multi-resolution image generation step for generating a multi-resolution image composed of a plurality of different resolution images, a feature point extraction step for extracting a feature point for each resolution image of the multi-resolution image, and the feature point A feature amount extracting step for extracting at least two local feature amounts in the image, comparing the feature amounts of the input image with the feature amounts of the model image, and candidates as sets of feature points having similar feature amounts A comparison step for generating a corresponding feature point set, and a posture estimation step for estimating the posture of the input image based on the candidate corresponding feature point set. Characterized in that it comprises and.

請求項１１のプログラムは、入力された画像から予め登録されているモデル画像を認識する画像処理装置のプログラムであって、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較ステップと、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定ステップとをコンピュータに実行させることを特徴とする。 A program according to an eleventh aspect is a program for an image processing apparatus that recognizes a model image registered in advance from an input image, and reduces the resolution of the input image at a predetermined rate. A multi-resolution image generating step for generating a multi-resolution image composed of a plurality of different resolution images, a feature point extracting step for extracting a feature point for each resolution image of the multi-resolution image, and at least two of the feature points A feature quantity extracting step for extracting two local feature quantities; comparing the feature quantities of the input image with the feature quantities of the model image; and candidate corresponding feature points as a set of feature points having similar feature quantities A comparison step for generating a set and a posture estimation step for estimating the posture of the input image based on the candidate corresponding feature point set. Characterized in that to execute the Yuta.

請求項１２の画像処理装置は、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置において、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を、認識時における場合より細かい精度で生成する多重解像度画像生成手段と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出手段と、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出手段と、前記モデル画像の前記特徴量を登録する登録手段とを備えることを特徴とする。 The image processing device according to claim 12 is an image processing device that learns a model image for comparison with an image to be recognized, by reducing the resolution of the model image at a predetermined rate, Multi-resolution image generation means for generating a multi-resolution image composed of a plurality of different resolution images with finer precision than in the case of recognition, and feature point extraction means for extracting feature points for each resolution image of the multi-resolution image And feature amount extraction means for extracting at least two local feature amounts at the feature points, and registration means for registering the feature amounts of the model image.

前記登録手段は、モデル画像の特徴量群を、タイプ毎にkdツリーとして登録するようにすることができる。 The registration unit may register the feature amount group of the model image as a kd tree for each type.

請求項１５の画像処理方法は、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置の画像処理方法において、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を、認識時における場合より細かい精度で生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、前記モデル画像の前記特徴量を登録する登録ステップとを含むことを特徴とする。 The image processing method according to claim 15 is an image processing method of an image processing apparatus that learns a model image for comparison with an image to be recognized, and the resolution of the model image is reduced at a predetermined rate. By doing so, a multi-resolution image generation step for generating a multi-resolution image composed of a plurality of different resolution images with finer precision than in the case of recognition, and feature points are extracted for each of the multi-resolution image resolution images. The method includes a feature point extracting step, a feature amount extracting step for extracting at least two local feature amounts in the feature point, and a registration step for registering the feature amount of the model image.

請求項１６の記録媒体のプログラムは、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置のプログラムであって、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を、認識時における場合より細かい精度で生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、前記モデル画像の前記特徴量を登録する登録ステップとを含むことを特徴とする。 The program of the recording medium according to claim 16 is a program of an image processing apparatus that learns a model image for comparison with an image to be recognized, and the resolution of the model image is set at a predetermined ratio. By reducing, a multi-resolution image generation step for generating a multi-resolution image composed of a plurality of different resolution images with a finer accuracy than in the case of recognition, and extracting feature points for each resolution image of the multi-resolution image A feature point extracting step, a feature amount extracting step for extracting at least two local feature amounts in the feature point, and a registration step for registering the feature amount of the model image.

請求項１７のプログラムは、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置のプログラムであって、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像を、認識時における場合より細かい精度で生成する多重解像度画像生成ステップと、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップと、前記特徴点における少なくとも２つの局所的な特徴量を抽出する特徴量抽出ステップと、前記モデル画像の前記特徴量を登録する登録ステップとをコンピュータに実行させることを特徴とする。 The program according to claim 17 is a program for an image processing apparatus that learns a model image for comparison with an image to be recognized, and reduces the resolution of the model image at a predetermined rate. A multi-resolution image generation step for generating a multi-resolution image composed of a plurality of different resolution images with a finer accuracy than in the case of recognition, and a feature point for extracting a feature point for each resolution image of the multi-resolution image The computer is caused to execute an extraction step, a feature amount extraction step for extracting at least two local feature amounts at the feature point, and a registration step for registering the feature amount of the model image.

本発明においては、入力された画像から多重解像度画像が生成され、その多重解像度画像のそれぞれの解像度の画像について特徴点が抽出される。特徴点における少なくとも２つの局所的な特徴量が抽出され、その特徴量とモデル画像の特徴量が比較され、類似する特徴量を有する特徴点の組としての候補対応特徴点組が生成され、それに基づいて、入力された画像の姿勢が推定される。 In the present invention, a multi-resolution image is generated from the input image, and feature points are extracted for each resolution image of the multi-resolution image. At least two local feature values at the feature points are extracted, the feature values are compared with the feature values of the model image, and a candidate corresponding feature point set as a set of feature points having similar feature values is generated. Based on this, the posture of the input image is estimated.

また、本発明においては、認識時における場合より細かい精度で多重解像度画像が生成される。多重解像度画像のそれぞれの解像度の画像について特徴点が抽出され、特徴点における少なくとも２つの局所的な特徴量が抽出され、登録される。 In the present invention, a multi-resolution image is generated with a finer accuracy than in the case of recognition. Feature points are extracted for each resolution image of the multi-resolution image, and at least two local feature amounts at the feature points are extracted and registered.

本発明によれば、物体を認識することができる。特に、本発明によれば、画像が拡大縮小されている場合にも、視点変化や明度変化に伴う画像の変化に拘らず、また物体が部分的に隠されてしまっているような場合でも、物体を確実に認識することが可能となる。 According to the present invention, an object can be recognized. In particular, according to the present invention, even when the image is enlarged or reduced, regardless of the change in the image due to the viewpoint change or brightness change, and even when the object is partially hidden, An object can be reliably recognized.

また、本発明によれば、物体を認識することが可能な特徴量を登録することができる。特に、本発明によれば、登録画像に対して認識対象物体の画像が拡大縮小されている場合にも、視点変化や明度変化に伴う画像の変化に拘らず、姿勢変化による画像変化に対してロバストに認識ができるような特徴点および特徴量を抽出し、登録することができる。 Further, according to the present invention, it is possible to register a feature quantity that can recognize an object. In particular, according to the present invention, even when the image of the recognition target object is enlarged or reduced with respect to the registered image, the image change due to the posture change can be performed regardless of the image change caused by the viewpoint change or the brightness change. Feature points and feature quantities that can be recognized robustly can be extracted and registered.

以下に本発明の最良の形態を説明するが、開示される発明と実施の形態との対応関係を例示すると、次のようになる。明細書中には記載されているが、発明に対応するものとして、ここには記載されていない実施の形態があったとしても、そのことは、その実施の形態が、その発明に対応するものではないことを意味するものではない。逆に、実施の形態が発明に対応するものとしてここに記載されていたとしても、そのことは、その実施の形態が、その発明以外の発明には対応しないものであることを意味するものでもない。 BEST MODE FOR CARRYING OUT THE INVENTION The best mode of the present invention will be described below. The correspondence relationship between the disclosed invention and the embodiments is exemplified as follows. Although there is an embodiment which is described in the specification but is not described here as corresponding to the invention, it means that the embodiment corresponds to the invention. It doesn't mean not. Conversely, even if an embodiment is described herein as corresponding to an invention, that means that the embodiment does not correspond to an invention other than the invention. Absent.

さらに、この記載は、明細書に記載されている発明の全てを意味するものではない。換言すれば、この記載は、明細書に記載されている発明であって、この出願では請求されていない発明の存在、すなわち、将来、分割出願されたり、補正により出現し、追加される発明の存在を否定するものではない。 Further, this description does not mean all the inventions described in the specification. In other words, this description is for the invention described in the specification and not claimed in this application, i.e., for the invention that will be applied for in the future or that will appear as a result of amendment and added. It does not deny existence.

請求項１の画像処理装置は、入力された画像から予め登録されているモデル画像を認識する画像処理装置（例えば、図１の認識部１２を有する画像処理装置１）において、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像（例えば、図４の多重解像度画像）を生成する多重解像度画像生成手段（例えば、図１３のステップS132の処理を実行する図１の多重解像度生成部31）と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出手段（例えば、図１３のステップS136の処理を実行する図１の特徴点抽出部32）と、前記特徴点における少なくとも２つの局所的な特徴量（例えば、図８の特徴点近傍の濃度勾配の方向ヒストグラム（タイプ１の特徴量）、図１１の最下段の線形補間リサンプリングされたタイプ2の特徴量）を抽出する特徴量抽出手段（例えば、図１３のステップS138乃至S142、図１４のステップS143乃至S145の処理を実行する図１の特徴量抽出部33）と、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較手段（例えば、図１４のステップS150,S151の処理を実行する図１の特徴量比較部35）と、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定手段（例えば、図１５のステップS158の処理を実行する図１のモデル姿勢推定部36）とを備えることを特徴とする。 The image processing apparatus according to claim 1 is an image processing apparatus that recognizes a model image registered in advance from an input image (for example, the image processing apparatus 1 having the recognition unit 12 in FIG. 1). Multi-resolution image generation means (for example, FIG. 13) that generates a multi-resolution image (for example, the multi-resolution image of FIG. 4) composed of a plurality of different resolution images by reducing the resolution at a predetermined rate. The multi-resolution generation unit 31 in FIG. 1 that executes the process of step S132, and feature point extraction means that extracts feature points for each resolution image of the multi-resolution image (for example, the process of step S136 in FIG. 13 is executed) 1 and the at least two local feature amounts of the feature points (for example, a direction histogram (type of density gradient in the vicinity of the feature points in FIG. 8) ), And the feature extraction unit (for example, steps S138 to S142 in FIG. 13, and steps S143 to S145 in FIG. 1 that performs the above-described processing, the feature amount of the input image is compared with the feature amount of the model image, and a candidate corresponding feature point set as a set of feature points having similar feature amounts Is used to estimate the posture of the input image based on the candidate corresponding feature point set and the comparison means (for example, the feature amount comparison unit 35 of FIG. 1 that executes the processing of steps S150 and S151 of FIG. 14). Posture estimation means (for example, the model posture estimation unit 36 in FIG. 1 that executes the processing in step S158 in FIG. 15).

前記特徴量抽出手段は、第１のタイプの特徴量として前記特徴点近傍の濃度勾配の方向ヒストグラム（例えば、図８の特徴点近傍の濃度勾配の方向ヒストグラム）を抽出し、第２のタイプの特徴量として次元縮退濃度勾配ベクトル（例えば、図１１の最下段の線形補間リサンプリングされたタイプ2の特徴量）を抽出する。 The feature amount extraction unit extracts a density gradient direction histogram in the vicinity of the feature point (for example, a density gradient direction histogram in the vicinity of the feature point in FIG. 8) as a first type feature amount. A dimensional degenerate density gradient vector (for example, the type 2 feature quantity that has been resampled by linear interpolation at the bottom in FIG. 11) is extracted as the feature quantity.

前記比較手段は、前記タイプ毎にkdツリーとされた前記モデル画像の特徴量群を、入力された画像の前記特徴量に基づいてk Nearest Neighbor (k-NN) 探索することで、前記候補対応特徴点組を生成する（例えば、図１７の処理）。 The comparison means searches for k Neest Neighbor (k-NN) based on the feature amount of the input image for the feature amount group of the model image, which is a kd tree for each type, so that the candidate correspondence A feature point set is generated (for example, the processing of FIG. 17).

前記比較手段は、各タイプにおいて共通に抽出された前記特徴量を有する前記特徴点を前記候補対応特徴点組とする（例えば、図１７の四角形と円のペア、並びに四角形と十字図形のペア、ただし図中の四角形、五角形、三角形、円、または十字の図形は特徴点を表す）。 The comparison means sets the feature points having the feature quantities extracted in common for each type as the candidate corresponding feature point set (for example, a pair of a rectangle and a circle in FIG. 17, a pair of a rectangle and a cross shape, However, squares, pentagons, triangles, circles, or crosses in the figure represent feature points).

前記画像処理装置は、前記候補対応特徴点組を、前記モデル画像の位置姿勢を決める画像変換パラメータで規定されるパラメータ空間（例えば、画像変換パラメータ（scl，θ，dX，dY）で規定されるパラメータ空間）上に投票し、その最大投票数を閾値と比較することで絞り込む絞り込み手段（例えば、図２２のステップS301乃至S310の処理を実行する図２１の対応特徴点ペア絞込み部６１）をさらに備え、前記姿勢推定手段は、絞り込まれた前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する。 In the image processing apparatus, the candidate corresponding feature point set is defined by a parameter space (for example, image conversion parameters (scl, θ, dX, dY)) defined by image conversion parameters that determine the position and orientation of the model image. The narrowing down means (for example, the corresponding feature point pair narrowing unit 61 in FIG. 21 that executes the processing of steps S301 to S310 in FIG. 22) further narrows down by voting on the parameter space) and comparing the maximum number of votes with a threshold value. The posture estimation means estimates the posture of the input image based on the narrowed candidate corresponding feature point set.

前記姿勢推定手段は、ランダムに選択したＮ組の前記候補対応特徴点組により決定される前記モデル画像の位置姿勢を決める画像変換パラメータ（例えば、ユークリッド変換、相似変換、アフィン変換、射影変換）をパラメータ空間に投射し、前記パラメータ空間上で形成されるクラスタのうち、最多メンバ数を有するクラスタを求め、そのメンバ（すなわち対応特徴点組群）から最小自乗法により求まる前記モデル画像の位置姿勢を決める画像変換パラメータを前記モデル画像を認識する認識結果として出力する（例えば、図１９のステップS201乃至S206の処理）。 The posture estimation means includes image conversion parameters (for example, Euclidean transformation, similarity transformation, affine transformation, projective transformation) that determine the position and orientation of the model image determined by N candidate selection feature point sets selected at random. A cluster having the largest number of members is obtained from the clusters formed on the parameter space by projecting to the parameter space, and the position and orientation of the model image obtained by the least square method from the members (that is, the corresponding feature point set group) are obtained. The determined image conversion parameter is output as a recognition result for recognizing the model image (for example, processing in steps S201 to S206 in FIG. 19).

前記多重解像度画像生成手段は、学習時における場合より粗い精度で前記多重解像度画像を生成する（例えば、図１３のステップS132の処理）。 The multi-resolution image generation unit generates the multi-resolution image with coarser accuracy than in the case of learning (for example, the process of step S132 in FIG. 13).

請求項９の画像処理方法、請求項１０の記録媒体のプログラム、請求項１１のプログラムは、入力された画像から予め登録されているモデル画像を認識する画像処理装置（例えば、図１の認識部１２を有する画像処理装置1）の画像処理方法またはプログラムにおいて、入力された画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像（例えば、図４の多重解像度画像）を生成する多重解像度画像生成ステップ（例えば、図１３のステップS132）と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップ（例えば、図１３のステップS136）と、前記特徴点における少なくとも２つの局所的な特徴量（例えば、図８の特徴点近傍の濃度勾配の方向ヒストグラム、図１１の最下段の線形補間リサンプリングされたタイプ2の特徴量）を抽出する特徴量抽出ステップ（例えば、図１３のステップS138乃至S142、図１４のステップS143乃至S145）と、入力された画像の前記特徴量を前記モデル画像の特徴量と比較し、類似する特徴量を有する特徴点の組としての候補対応特徴点組を生成する比較ステップ（例えば、図１４のステップS150,S151）と、前記候補対応特徴点組に基づいて、入力された画像の姿勢を推定する姿勢推定ステップ（例えば、図１５のステップS158）とを含むことを特徴とする。 An image processing method according to claim 9, a program for a recording medium according to claim 10, and a program according to claim 11 include an image processing apparatus that recognizes a model image registered in advance from an input image (for example, the recognition unit in FIG. 1). In the image processing method or program of the image processing apparatus 1) having 12, a multi-resolution image (for example, a plurality of different resolution images (for example, A multi-resolution image generation step (for example, step S132 of FIG. 13) for generating the multi-resolution image of FIG. 4 and a feature point extraction step (for example, FIG. 13 step S136) and at least two local feature quantities (for example, the direction of the density gradient in the vicinity of the feature point in FIG. 8) at the feature point. A feature amount extraction step (for example, steps S138 to S142 in FIG. 13 and steps S143 to S145 in FIG. 14) for extracting a strogram, the type 2 feature amount that has been resampled at the bottom of the linear interpolation in FIG. A comparison step (for example, steps S150 and S151 in FIG. 14) that compares the feature amount of the obtained image with the feature amount of the model image and generates a candidate corresponding feature point set as a set of feature points having similar feature amounts. And a posture estimation step (for example, step S158 in FIG. 15) for estimating the posture of the input image based on the candidate corresponding feature point set.

請求項１２の画像処理装置は、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置（例えば、図１の学習部11を有する画像処理装置1）において、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像（例えば、図４の多重解像度画像）を、認識時における場合より細かい精度で生成する多重解像度画像生成手段（例えば、図２のステップS12の処理を実行する図１の多重解像度生成部21）と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出手段（例えば、図２のステップS16の処理を実行する図１の特徴点抽出部22）と、前記特徴点における少なくとも２つの局所的な特徴量（例えば、図８の特徴点近傍の濃度勾配の方向ヒストグラム、図１１の最下段の線形補間リサンプリングされたタイプ2の特徴量）を抽出する特徴量抽出手段（例えば、図２のステップS19、図３のステップS25の処理を実行する図１の特徴量抽出部23）と、前記モデル画像の前記特徴量を登録する登録手段（例えば、図３のステップS29の処理を実行する図１のモデル辞書登録部24）とを備えることを特徴とする。 The image processing apparatus according to claim 12 is an image processing apparatus that learns a model image to be compared with an image to be recognized (for example, the image processing apparatus 1 having the learning unit 11 in FIG. 1). Is generated at a predetermined rate, so that a multi-resolution image (for example, the multi-resolution image in FIG. 4) composed of a plurality of different resolutions is generated with finer accuracy than in the case of recognition. Resolution image generation means (for example, the multi-resolution generation unit 21 in FIG. 1 that executes the processing of step S12 in FIG. 2), and feature point extraction means (for example, the feature point extraction means for extracting the resolution points of the multi-resolution image) The feature point extraction unit 22 in FIG. 1 that executes the process of step S16 in FIG. 2 and at least two local feature amounts (for example, density in the vicinity of the feature point in FIG. 8) FIG. 11 is a diagram for executing feature amount extraction means (for example, step S19 in FIG. 2 and step S25 in FIG. 3) for extracting the gradient direction histogram and the type 2 feature amount that has been resampled at the bottom in FIG. 1 feature amount extraction unit 23) and registration means for registering the feature amount of the model image (for example, the model dictionary registration unit 24 in FIG. 1 that executes the process of step S29 in FIG. 3). And

前記登録手段は、前記モデル画像の特徴量を、前記タイプ毎にkdツリーとして登録する（例えば、図１のkdツリー構築部34の処理）。 The registration unit registers the feature amount of the model image as a kd tree for each type (for example, processing of the kd tree construction unit 34 in FIG. 1).

請求項１５の画像処理方法、請求項１６の記録媒体のプログラム、請求項１７のプログラムは、認識の対象とされる画像と比較するためのモデル画像を学習する画像処理装置（例えば、図１の学習部11を有する画像処理装置1）の画像処理方法またはプログラムにおいて、前記モデル画像の解像度を、予め定められている割合で低下させることで、複数の異なる解像度の画像からなる多重解像度画像（例えば、図４の多重解像度画像）を、認識時における場合より細かい精度で生成する多重解像度画像生成ステップ（例えば、図２のステップS12）と、前記多重解像度画像のそれぞれの解像度の画像について特徴点を抽出する特徴点抽出ステップ（例えば、図２のステップS16）と、前記特徴点における少なくとも２つの局所的な特徴量（例えば、図８の特徴点近傍の濃度勾配の方向ヒストグラム、図１１の最下段の線形補間リサンプリングされたタイプ2の特徴量）を抽出する特徴量抽出ステップ（例えば、図２のステップS19、図３のステップS25）と、前記モデル画像の前記特徴量を登録する登録ステップ（例えば、図３のステップS29）とを含むことを特徴とする。 An image processing method according to claim 15, a program for a recording medium according to claim 16, and a program according to claim 17 include an image processing device that learns a model image for comparison with an image to be recognized (for example, FIG. 1). In the image processing method or program of the image processing apparatus 1) having the learning unit 11, the resolution of the model image is reduced at a predetermined rate, thereby allowing a multi-resolution image (for example, an image having a plurality of different resolutions). , The multi-resolution image generation step (for example, step S12 in FIG. 2) for generating the fine-resolution image of the multi-resolution image with finer accuracy than in the case of recognition, and feature points for the respective resolution images of the multi-resolution image. A feature point extraction step (for example, step S16 in FIG. 2) to be extracted, and at least two local feature amounts (for example, the feature point) 8 is a feature amount extraction step for extracting the density gradient direction histogram in the vicinity of the feature point and the linear interpolation resampled type 2 feature amount at the bottom of FIG. 11 (for example, step S19 in FIG. 2, step in FIG. 3). S25) and a registration step (for example, step S29 in FIG. 3) for registering the feature quantity of the model image.

以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。この実施の形態は、本発明を、複数のオブジェクトを含む入力画像であるオブジェクト画像と、検出対象となるモデルを含むモデル画像（予め登録されている）とを比較し、オブジェクト画像からモデルを抽出する画像処理装置に適用したものである。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In this embodiment, the present invention compares an object image, which is an input image including a plurality of objects, with a model image (registered in advance) including a model to be detected, and extracts a model from the object image. This is applied to an image processing apparatus.

本実施の形態における画像処理装置の概略構成を図１に示す。この画像処理装置１はモデルの学習処理を行う学習部１１と、入力画像中の物体を認識する認識部１２の２つの部分から構成される。 A schematic configuration of an image processing apparatus according to the present embodiment is shown in FIG. The image processing apparatus 1 includes two parts, a learning unit 11 that performs model learning processing and a recognition unit 12 that recognizes an object in an input image.

学習部１１は、多重解像度生成部２１、特徴点抽出部２２、特徴量抽出部２３、およびモデル辞書登録部２４により構成されている。 The learning unit 11 includes a multi-resolution generation unit 21, a feature point extraction unit 22, a feature amount extraction unit 23, and a model dictionary registration unit 24.

多重解像度生成部２１は、入力されたモデル画像から多重解像度画像を生成する。特徴点抽出部２２は、多重解像度生成部２１により生成された多重解像度の各画像から特徴点を抽出する。特徴量抽出部２３は、特徴点抽出部２２により抽出された各特徴点の特徴量を抽出する。モデル辞書登録部２４は、特徴量抽出部２３により抽出されたモデル画像の特徴量群を登録する。 The multi-resolution generation unit 21 generates a multi-resolution image from the input model image. The feature point extraction unit 22 extracts feature points from each multi-resolution image generated by the multi-resolution generation unit 21. The feature amount extraction unit 23 extracts the feature amount of each feature point extracted by the feature point extraction unit 22. The model dictionary registration unit 24 registers the feature amount group of the model image extracted by the feature amount extraction unit 23.

認識部１２は、多重解像度生成部３１、特徴点抽出部３２、特徴量抽出部３３、kdツリー構築部３４、特徴量比較部３５、およびモデル姿勢推定部３６により構成される。 The recognition unit 12 includes a multi-resolution generation unit 31, a feature point extraction unit 32, a feature amount extraction unit 33, a kd tree construction unit 34, a feature amount comparison unit 35, and a model posture estimation unit 36.

多重解像度生成部３１は、入力されたオブジェクト画像から、各多重解像度の画像を生成する。特徴点抽出部３２は、多重解像度生成部３１により生成された多重解像度の画像から特徴点を抽出する。特徴量抽出部３３は、特徴点抽出部３２により抽出された各特徴点の特徴量を抽出する。これらの多重解像度生成部３１、特徴点抽出部３２、および特徴量抽出部３３により行われる処理は、学習部１１における多重解像度生成部２１、特徴点抽出部２２、および特徴量抽出部２３において行われる処理と同様の処理である。 The multi-resolution generation unit 31 generates multi-resolution images from the input object image. The feature point extraction unit 32 extracts feature points from the multi-resolution image generated by the multi-resolution generation unit 31. The feature quantity extraction unit 33 extracts the feature quantity of each feature point extracted by the feature point extraction unit 32. The processing performed by the multi-resolution generation unit 31, the feature point extraction unit 32, and the feature amount extraction unit 33 is performed by the multi-resolution generation unit 21, the feature point extraction unit 22, and the feature amount extraction unit 23 in the learning unit 11. This is the same processing as that described.

kdツリー構築部３４は、モデル辞書登録部２４に登録されている特徴量からkdツリーを構築する。特徴量比較部３５は、特徴量抽出部３３により抽出された特徴量と、kdツリー構築部３４により構築されたkdツリーとして表現された認識対象となる全モデル画像（またはモデル毎処理を行う場合には各モデル画像）の特徴量群を比較する。モデル姿勢推定部３６は、特徴量比較部３５による比較結果に基づいて、オブジェクト画像に含まれるモデルの有無とその姿勢（モデル姿勢）を推定し、そのモデルの姿勢を表すパラメータ（物体姿勢パラメータ）を出力する。 The kd tree construction unit 34 constructs a kd tree from the feature amounts registered in the model dictionary registration unit 24. The feature quantity comparison unit 35 performs the feature quantity extracted by the feature quantity extraction unit 33 and the entire model image (or model-by-model processing) that is a recognition target expressed as a kd tree constructed by the kd tree construction unit 34. Is compared with the feature quantity group of each model image). The model posture estimation unit 36 estimates the presence / absence of a model included in the object image and its posture (model posture) based on the comparison result by the feature amount comparison unit 35, and a parameter (object posture parameter) representing the posture of the model Is output.

なお、学習部１１と認識部１２は、常に両方が同時に存在する必要はない。学習部１１により予め学習された結果、必要な情報が登録されたモデル辞書登録部２４を認識部１２に搭載するか、或いは無線で利用できるようにするようにしてもよい。 Note that both the learning unit 11 and the recognition unit 12 do not always have to exist simultaneously. As a result of learning in advance by the learning unit 11, the model dictionary registration unit 24 in which necessary information is registered may be mounted on the recognition unit 12 or may be used wirelessly.

次に、図２と図３のフローチャートを参照して、学習部１１における学習処理について説明する。 Next, the learning process in the learning unit 11 will be described with reference to the flowcharts of FIGS. 2 and 3.

多重解像度生成部２１は、後述するステップＳ２８において、全モデルを処理したと判定するまで、ステップＳ１１乃至Ｓ２７の処理を繰り返す。そこで、ステップＳ１１において、多重解像度生成部２１は、１つの未処理モデルを選択する。ステップＳ１２において、多重解像度生成部２１は、多重解像度群を生成する。具体的には、多重解像度生成部２１は、入力された学習対象のモデル画像を所定の倍率に従って縮小し、多重解像度画像群を生成する。例えば、最低解像度の画像である原画像からの縮小率をα、出力する多重解像度画像の数をＮ（原画像を含む）とするとき、ｋ番目（原画像をｋ＝０とする）の多重解像度の解像度画像Ｉ^[k]は、原画像Ｉ^[0]を縮小率α×（Ｎ−ｋ）で、線形補間縮小することで生成される。 The multi-resolution generation unit 21 repeats the processes in steps S11 to S27 until it is determined in step S28 described later that all models have been processed. Therefore, in step S11, the multi-resolution generation unit 21 selects one unprocessed model. In step S12, the multiresolution generation unit 21 generates a multiresolution group. Specifically, the multiresolution generation unit 21 reduces the input model image to be learned according to a predetermined magnification, and generates a multiresolution image group. For example, when the reduction ratio from the original image that is the lowest resolution image is α and the number of output multi-resolution images is N (including the original image), the kth (original image is k = 0) multiplexing The resolution image I ^[k] of the resolution is generated by linearly reducing the original image I ^[0] at a reduction rate α × (N−k).

あるいは他の方法としては、解像度の一段階低い画像を生成するための縮小率をγ（固定値）とする、つまりＩ^[0]を縮小率γ^kで、線形補間縮小することでＩ^[k]を生成する方法も考えられる。 Or as another method, the reduction ratio for generating a one-step lower image resolution and gamma (fixed value), i.e. with a reduction ratio gamma ^k of I ^{^[0],} I ^{^[k} by linear interpolation reduced ^] Is also possible.

図４は、パラメータＮ＝１０，α＝0.1とした場合に生成される多重解像度画像群を示す。図４の例においては、原画像Ｉ^[0]を縮小率0.9で縮小した画像Ｉ^[1]、縮小率0.8で縮小した画像Ｉ^[2]、・・・、縮小率0.1で縮小した画像Ｉ^[9]の合計１０段階の多重解像度画像が生成されている。縮小率を規定する係数ｋの値が大きくなるほど画像がより小さい大きさに縮小される結果、各フレームの画枠自体も、係数ｋの値が大きい程小さくなる。 FIG. 4 shows a multi-resolution image group generated when parameters N = 10 and α = 0.1. Figure in the example 4, the image was reduced at a reduction ratio 0.9 of the original image ^{^{I [0] I [1]}} , the image I ^[2] which is reduced at a reduction ratio 0.8, ..., the image I was reduced at a reduction ratio 0.1 A total of 10 multi-resolution images of ^[9] are generated. As the coefficient k that defines the reduction ratio increases, the image is reduced to a smaller size. As a result, the image frame itself of each frame also decreases as the coefficient k increases.

次に、特徴点抽出部２２は、後述するステップＳ２７において、全解像度画像を処理したと判定するまで、ステップＳ１３乃至Ｓ２６の処理を繰り返し、多重解像度生成部２１により生成された各解像度画像Ｉ^[k]（ｋ＝０，・・・，Ｎ−１）から、画像の拡大縮小変換（スケール変換）があってもロバストに抽出されるような特徴点（スケール不変特徴点）を抽出するのであるが、スケール不変特徴点の抽出法としては、画像のスケールスペースを構築し、各スケール画像のDifference of Gaussian（DoG）フィルタ出力の局所極大点（局所的な所定の範囲の最大点）及び局所極小点（局所的な所定の範囲の最小点）のうち、スケール方向の変化によっても位置が変化しない点をスケール特徴点として抽出する方法（D. Lowe, “Object recognition from local scale-invariant features,” in Proc. International Conference on Computer Vision, Vol. 2, pp. 1150-1157, September 20-25, 1999, Corfu, Greece.）や、画像のスケールスペースを構築し、各スケール画像からHarrisコーナー検出器により抽出されたコーナー点のうち、スケールスペース画像のLaplacian of Gaussian（LoG）フィルタ出力の局所極大を与える点を特徴点として抽出する方法（K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points,” International Conference on Computer Vision, 525-531, July 2001.）などがある。スケール不変特徴点が抽出できる手法であれば、どのような抽出法でも特徴点抽出部２２に適用が可能である。 Next, the feature point extraction unit 22 repeats the processing of steps S13 to S26 until it determines that all resolution images have been processed in step S27 described later, and each resolution image I ^{[ k]} (k = 0,..., N−1), feature points (scale invariant feature points) that are robustly extracted even if there is enlargement / reduction conversion (scale conversion) of the image are extracted. However, as a method of extracting scale invariant feature points, the scale space of the image is constructed, and the local maximum point (maximum point of the local predetermined range) and local minimum of the Difference of Gaussian (DoG) filter output of each scale image A method of extracting points whose position does not change even when the scale direction changes from among the points (local minimum points in a predetermined range) as scale feature points (D. Lowe, “Object recognition from local scal e-invariant features, ”in Proc. International Conference on Computer Vision, Vol. 2, pp. 1150-1157, September 20-25, 1999, Corfu, Greece.) Of corner points extracted by Harris corner detector from a point that gives local maximum of Laplacian of Gaussian (LoG) filter output of scale space image as feature points (K. Mikolajczyk, C. Schmid, “Indexing based on scale invariant interest points, ”International Conference on Computer Vision, 525-531, July 2001.). Any extraction method that can extract scale-invariant feature points can be applied to the feature point extraction unit 22.

ここでは発明の一実施の形態として、スケール不変特徴点の抽出法として、D.ロー（D. Lowe）が提案する方法（“Distinctive image features from scale-invariant keypoints,” accepted for publication in the International Journal of Computer Vision, 2004.）を基礎とした方法を説明する。この手法では、スケール不変特徴点抽出対象画像のスケールスペース表現（T. Lindeberg, “Scale-space: A framework for handling image structures at multiple scales.”, Journal of Applied Statistics, vol. 21, no. 2, pp. 224-270, 1994”）を介して、当該画像のDoGフィルタ出力から、スケール方向も考慮に入れた局所極大点及び局所極小点が特徴点として抽出される。 Here, as an embodiment of the invention, as a method for extracting scale invariant feature points, a method proposed by D. Lowe (“Distinctive image features from scale-invariant keypoints,” accepted for publication in the International Journal of Computer Vision, 2004.). In this method, scale space representation of scale-invariant feature point extraction target image (T. Lindeberg, “Scale-space: A framework for handling image structures at multiple scales.”, Journal of Applied Statistics, vol. 21, no. 2, pp. 224-270, 1994 "), local maximum points and local minimum points taking into account the scale direction are extracted as feature points from the DoG filter output of the image.

そこで、ステップＳ１３において、特徴点抽出部２２は、各解像度画像のうちの未処理解像度画像を選択する。そして、ステップＳ１４において、特徴点抽出部２２は、スケールスペースの解像度画像を生成する。すなわち、スケール不変特徴点抽出対象画像Ｉ（多重解像度生成部２１で生成された各解像度画像（ｋ＝０，１，２，・・・，９の各解像度画像）のうちの１つの解像度画像が特徴点抽出対象画像となる）のスケールスペースが生成される。スケールスペースのｓ番目（ｓ＝０,・・・，Ｓ−１）の解像度画像Ｌ_sは、対象画像Ｉを式（１）に示される２次元ガウス関数を用いて、σ＝ｋ^s σ₀で畳み込み積分（ガウスフィルタリング）することで生成される。 Therefore, in step S13, the feature point extraction unit 22 selects an unprocessed resolution image among the resolution images. In step S14, the feature point extraction unit 22 generates a resolution image of the scale space. That is, one resolution image of the scale invariant feature point extraction target image I (each resolution image generated by the multi-resolution generation unit 21 (resolution images of k = 0, 1, 2,..., 9)) A scale space of a feature point extraction target image is generated. The s-th (s = 0,..., S−1) resolution image L _s of the scale space is obtained by using the two-dimensional Gaussian function represented by the equation (1) for the target image I and σ = k ^s σ _0. Is generated by convolution integration (Gaussian filtering).

ここでσ₀は、対象画像Ｉのノイズ除去を目的としたぼかし度を決めるパラメータであり、ｋはスケールスペースの各解像度間で共通のぼかし度に関するコンスタントファクタであり、解像度画像Ｉ^[k]のｋとは別のファクタである。なお、画像の水平方向をＸ軸、垂直方向をＹ軸としている。 Here, σ ₀ is a parameter for determining the blur level for the purpose of noise removal of the target image I, k is a constant factor regarding the blur level common to each resolution of the scale space, and the resolution image I ^[k] This is a different factor from k. The horizontal direction of the image is the X axis and the vertical direction is the Y axis.

図５は、このようにして生成されたスケールスペースの例を表している。この例においては、画像Ｉにそれぞれ以下の５個の２次元ガウス関数を用いて生成された解像度画像Ｌ₀乃至Ｌ₄を表している。 FIG. 5 shows an example of the scale space generated in this way. In this example, the resolution images L _{0 to} L ₄ generated by using the following five two-dimensional Gaussian functions for the image I are shown.

なお、式（２）乃至式（６）の右辺の畳み込み積分の記号の右辺の項は、次式を表す。すなわち、実質的に式（１）と同一である。 Note that the term on the right side of the convolution integral symbol on the right side of Equations (2) to (6) represents the following equation. That is, it is substantially the same as the formula (1).

図５では、解像度レベル数Ｓ＝５とされている。 In FIG. 5, the number of resolution levels S = 5.

次に、ステップＳ１５で、特徴点抽出部２２は、DoGフィルタ出力画像を演算する。すなわち、このように得られた特徴点抽出対象のスケールスペースの各解像度画像Ｌ_sのDoGフィルタ出力画像が求められる。このDoGフィルタは、画像の輪郭強調のために用いられる２次微分フィルタの一種であり、人間の視覚系で網膜から外側膝状体で中継されるまでに行われている処理の近似モデルとして、LoGフィルタと共によく用いられるものである。DoGフィルタの出力は、２つのガウスフィルタ出力画像の差分を取ることで効率よく得られる。すなわち、図５の中央の列に示されるように、ｓ番目（ｓ＝０,・・・，Ｓ−２）の解像度のDoGフィルタ出力画像Ｄ_sは、解像度画像Ｌ_sを、その１段上の階層の解像度画像Ｌ_s+1から減算する（Ｌ_s+1−Ｌ_sを演算する）ことで得られる。 Next, in step S15, the feature point extraction unit 22 calculates a DoG filter output image. In other words, a DoG filter output image of each resolution image L _{s in} the scale space to be feature point extraction obtained in this way is obtained. This DoG filter is a kind of second-order differential filter that is used for image contour enhancement. As an approximate model of processing that is performed from the retina to the outer knee-like body in the human visual system, Often used with LoG filters. The output of the DoG filter can be obtained efficiently by taking the difference between the two Gaussian filter output images. That is, as shown in the center column of FIG. 5, the DoG filter output image D _s of the s-th (s = 0,..., S-2) resolution is the resolution image L _s one level higher. It is obtained by subtracting from the resolution image L _{s + 1} of the layer (calculating L _{s + 1} −L _s ).

次に、ステップＳ１６で、特徴点抽出部２２は、スケール不変特徴点を抽出する。具体的には、DoGフィルタ出力画像Ｄ_s（ｓ＝１,・・・，Ｓ−３）上のピクセルのうち、DoGフィルタ出力画像Ｄ_sの直接近傍領域（本実施の形態の場合、所定の位置の３×３個の画素の領域）、それより１段下位のDoGフィルタ出力画像Ｄ_s-1、並びにそれより１段上位のDoGフィルタ出力画像Ｄ_s+1上の同位置（対応する位置）の直接近傍領域の合わせて２７ピクセルにおいて、局所極大（２７ピクセルのうちの最大値）、局所極小（２７ピクセルのうちの最小値）となるピクセルがスケール不変特徴点として抽出され、特徴点群Ｋ_s（ｓ＝１,・・・，Ｓ−３）として保持される。図５の右側の列に、この特徴点群Ｋ_sが示されている。こうして抽出された特徴点はファクタがｋ²の解像度変化（つまりスケール変化）に対して、位置の不変性を持つスケール不変特徴点である。 Next, in step S16, the feature point extraction unit 22 extracts scale invariant feature points. Specifically, among the pixels on the DoG filter output image D _s (s = 1,..., S-3), a region immediately adjacent to the DoG filter output image D _s (predetermined in the case of the present embodiment) 3 × 3 pixel area), the DoG filter output image D _s-1 one level lower than it, and the same position (corresponding position) on the DoG filter output image D _{s + 1} one level higher than that ) Pixels having a local maximum (maximum value of 27 pixels) and a local minimum (minimum value of 27 pixels) are extracted as scale-invariant feature points in 27 pixels in total in the immediate vicinity region of the feature point group. It is held as K _s (s = 1,..., S-3). The feature point group K _s is shown in the right column of FIG. Thus extracted feature point with respect to factor change in resolution k ² (i.e. scale change), a scale invariant feature points with invariant position.

特徴点抽出部２２は、後述するステップＳ２７で、全解像度画像を処理したと判定するまで、ステップＳ１３乃至Ｓ１６の処理を繰り返し、多重解像度生成部２１により生成された多重解像度レベル画像Ｉ^[k]のそれぞれに対し、スケール不変特徴点群を抽出する。 The feature point extraction unit 22 repeats the processing of steps S13 to S16 until it determines that all resolution images have been processed in step S27 described later, and the multi-resolution level image I ^[k] generated by the multi-resolution generation unit 21 ^. A scale invariant feature point group is extracted for each of.

次に、特徴量抽出部２３は、ステップＳ１７乃至Ｓ２５の処理を、ステップＳ２６で全特徴点を処理したと判定するまで繰り返し、各多重解像度レベル画像Ｉ^[k]から抽出された各特徴点における特徴量を抽出する。以下においては、特徴点における特徴量を、文脈に応じて、特徴点特徴量または単に特徴量と呼ぶ。 Next, the feature amount extraction unit 23 repeats the processing of steps S17 to S25 until it is determined that all the feature points have been processed in step S26, and the feature amount extraction unit 23 performs the processing for each feature point extracted from each multi-resolution level image I ^[k] . Extract features. Hereinafter, the feature amount at the feature point is referred to as a feature point feature amount or simply as a feature amount depending on the context.

特徴点特徴量としては、画像の回転変換、明度変化に対して不変な特徴量が用いられる。１つの特徴点に対して、複数の特徴量をあててもかまわない。その場合には、後段の特徴量比較部３５において、異なる特徴量での比較結果を統合する処理が必要となる。この実施の形態の場合、特徴量として、当該特徴点が抽出された画像の特徴点近傍領域の濃度勾配情報（各点における濃度勾配強度及び濃度勾配方向）から導出される２つの特徴量が用いられる。１つは、当該特徴点近傍領域における支配的な濃度勾配方向（以下、カノニカル方向と呼ぶ）で補正された方向ヒストグラムであり、他の１つは、カノニカル方向で補正された低次元縮退された濃度勾配ベクトルである。 As the feature point feature amount, a feature amount that is invariant to image rotation conversion and brightness change is used. A plurality of feature amounts may be assigned to one feature point. In that case, the subsequent feature quantity comparison unit 35 needs to integrate the comparison results of different feature quantities. In the case of this embodiment, two feature quantities derived from density gradient information (density gradient strength and density gradient direction at each point) in the vicinity of the feature points of the image from which the feature points are extracted are used as the feature quantities. It is done. One is a direction histogram corrected in a dominant density gradient direction (hereinafter referred to as a canonical direction) in the region near the feature point, and the other is a low-dimensional degenerate corrected in the canonical direction. It is a density gradient vector.

第１の特徴量（タイプ１の特徴量）は、特徴点近傍の濃度勾配方向に関するヒストグラム（方向ヒストグラム）を、その支配的方向でゼロ補正したものである。この第１の特徴量を抽出するために、特徴量抽出部２３は、ステップＳ１７において、１つの未処理特徴点を選択する。そして、ステップＳ１８で、特徴量抽出部２３は、濃度勾配強度Ｍ_x,yと方向Ｒ_x,yを求める。すなわち、図６に示されるように、特徴点近傍（本実施の形態では、当該特徴点Ｐを中心として直径７ピクセル（半径3.5ピクセル）の範囲に入るピクセル群）の濃度勾配強度Ｍ_x,y、及び方向Ｒ_x,yが、それぞれ式（８）と式（９）により求められる。同式中のｘ，ｙは、濃度勾配を求めるピクセルの画像上の座標であり、Ｉ_x,yは、その画素値である。 The first feature amount (type 1 feature amount) is obtained by zero-correcting a histogram (direction histogram) regarding the density gradient direction in the vicinity of the feature point in the dominant direction. In order to extract the first feature amount, the feature amount extraction unit 23 selects one unprocessed feature point in step S17. In step S18, the feature amount extraction unit 23 obtains the density gradient strength M _{x, y} and the direction R _{x, y} . That is, as shown in FIG. 6, the density gradient intensity M _{x, y} near the feature point (in this embodiment, a pixel group that falls within a range of 7 pixels in diameter (radius 3.5 pixels) around the feature point P). , And the direction R _{x, y} are obtained by the equations (8) and (9), respectively. In the formula, x and y are coordinates on the image of the pixel for which the density gradient is obtained, and I _{x, y} is the pixel value.

次に、ステップＳ１９で、特徴量抽出部２３は方向ヒストグラムを生成する。具体的には、特徴点近傍中の各ピクセルの方向Ｒ_x,yに基づいて、階級幅Δθ、階級数360°／Δθの方向ヒストグラム（本実施の形態では、Δθ＝１０°）の該当する階級に、各ピクセルの度数が累積される。このとき、図７に示されるように階級の量子化誤差の影響を小さくするため、度数（図７における縦軸）としては、階級（図７における横軸）の中心値から方向Ｒ_x,yへの距離の近さに比例した値が累積される。つまり、方向Ｒ_x,yから最も近い２つの階級をｇ，ｇ＋１とし、それぞれの中心値と方向Ｒ_x,yとの距離をｄ₁，ｄ₂とすると、階級ｇ，ｇ＋１に加算する度数値は、それぞれｄ₂／（ｄ₁＋ｄ₂），ｄ₁／（ｄ₁＋ｄ₂）となる。これにより、量子化誤差が少なくなる。 Next, in step S19, the feature amount extraction unit 23 generates a direction histogram. Specifically, a direction histogram (in this embodiment, Δθ = 10 °) corresponding to a class width Δθ and a class number 360 ° / Δθ, based on the direction R _{x, y} of each pixel in the vicinity of the feature point. The frequency of each pixel is accumulated in the class. At this time, in order to reduce the influence of the quantization error of the class as shown in FIG. 7, the frequency (vertical axis in FIG. 7) is set to the direction R _{x, y} from the center value of the class (horizontal axis in FIG. 7). A value proportional to the proximity of the distance to is accumulated. That is, if the two classes closest to the direction R _{x, y} are g and g + 1 _, and the distance between each center value and the direction R _{x, y} is d ₁ and d ₂ , the numerical value to be added to the classes g and g + 1. Are d ₂ / (d ₁ + d ₂ ) and d ₁ / (d ₁ + d ₂ ), respectively. This reduces the quantization error.

次に、ステップＳ２０で、特徴量抽出部２３は度数を正規化する。すなわち、得られた方向ヒストグラムの度数が、特徴点近傍ピクセル数（直径７ピクセルの範囲に入るピクセル数）で割算することにより正規化される。このように、勾配方向のみを累積することで、明度変化に対して強い特徴量を得ることができる。 Next, in step S20, the feature amount extraction unit 23 normalizes the frequency. That is, the frequency of the obtained direction histogram is normalized by dividing by the number of pixels near the feature point (the number of pixels falling within the range of 7 pixels in diameter). In this way, by accumulating only the gradient direction, it is possible to obtain a feature quantity that is strong against changes in brightness.

さらに、特徴量抽出部２３は、ステップＳ２１でカノニカル方向を抽出し、ステップＳ２２で角度をカノニカル方向で正規化する。具体的には、回転変換に不変な特徴量とするために、得られた方向ヒストグラムの強いピークを与える角度としてのカノニカル方向が抽出され、そのカノニカル方向としての角度が０度になるようにヒストグラムをシフトすることで、角度の正規化が行われる。コーナー付近に抽出された特徴点に関するヒストグラムでは、そのエッジに垂直な方向に複数の強いピークが現れるため、このような場合は、強いピークごとにその角度が０度になるように補正した（正規化した）方向ヒストグラムが生成される。つまり、カノニカル方向の数だけ、別々に特徴量が生成される。ピークがカノニカル方向であるための基準は、例えば、最大累積値の８０％以上の累積値を与えるピーク方向とされる。 Further, the feature quantity extraction unit 23 extracts the canonical direction in step S21, and normalizes the angle in the canonical direction in step S22. Specifically, in order to obtain a feature quantity that is invariant to rotation transformation, a canonical direction is extracted as an angle that gives a strong peak in the obtained direction histogram, and the histogram is such that the angle as the canonical direction becomes 0 degrees. The angle is normalized by shifting. In the histogram relating to the feature points extracted near the corner, a plurality of strong peaks appear in the direction perpendicular to the edge. In such a case, the angle is corrected so that the angle is 0 degrees for each strong peak (normal) A directional histogram is generated. That is, as many features as the number of canonical directions are generated. The reference for the peak in the canonical direction is, for example, a peak direction that gives an accumulated value of 80% or more of the maximum accumulated value.

例えば、図８に示される方向ヒストグラムにおいては、角度80度の度数Ｖ₈₀と角度200度の度数Ｖ₂₀₀の２つのピークが存在する。すなわち、角度80度と角度200度が、カノニカル方向となる。この場合、図９に示されるように、カノニカル方向としての角度80度が０度となるように正規化されたヒストグラムと、図１０に示されるように、カノニカル方向としての角度200度が０度になるように正規化されたヒストグラムが生成される。 For example, in the directional histogram shown in FIG. 8, there are two peaks, a frequency V ₈₀ at an angle of ₈₀ degrees and a frequency V _{200 at} an angle of 200 degrees. That is, an angle of 80 degrees and an angle of 200 degrees are the canonical directions. In this case, as shown in FIG. 9, the histogram normalized so that the angle 80 degrees as the canonical direction becomes 0 degrees, and as shown in FIG. 10, the angle 200 degrees as the canonical direction is 0 degrees. A normalized histogram is generated so that

以上の処理で得られるタイプ１の特徴量は、方向ヒストグラムの階級数と同じ次元の特徴ベクトル（本実施の形態では、３６（＝360°／10°）次元ベクトル、すなわち、３６個の階級の度数を表わす数字からなるベクトル）となる。 The type 1 feature quantity obtained by the above processing is a feature vector having the same dimension as the number of classes in the direction histogram (in this embodiment, a 36 (= 360 ° / 10 °) dimension vector, that is, 36 classes. A vector of numbers representing frequencies).

次に、第２の特徴量（タイプ２の特徴量）として、低次元縮退濃度勾配ベクトルが求められる。タイプ１の特徴量が、特徴点近傍内ピクセルの空間的配置を無視し、特徴点近傍局所領域での濃度勾配ベクトルの方向の傾向（頻度）のみに注目しているのに対し、タイプ２の特徴量は、特徴点近傍の各濃度勾配ベクトルの空間的配置に注目する。この２種類の特徴量を後述する手法で特徴量比較に用いることで、視点変化、明度変化に強い認識を実現する。 Next, a low-dimensional degenerate density gradient vector is obtained as the second feature amount (type 2 feature amount). The type 1 feature quantity ignores the spatial arrangement of the pixels in the vicinity of the feature point and focuses only on the direction tendency (frequency) of the density gradient vector in the local region near the feature point, whereas the type 2 feature quantity For the feature amount, attention is paid to the spatial arrangement of each density gradient vector in the vicinity of the feature point. By using these two types of feature amounts for feature amount comparison by a method to be described later, recognition that is strong against viewpoint change and brightness change is realized.

タイプ２の特徴量の抽出のために、まず、ステップＳ２３で、特徴量抽出部２３は、特徴点近傍画像を回転補正する。すなわち、上述の処理で得られた特徴点近傍のカノニカル方向が０度になるように特徴点近傍画像が回転補正される。さらに、ステップＳ２４で、特徴量抽出部２３は、濃度勾配ベクトル群を演算する。例えば、図１１の上段に示されている特徴点近傍のピクセルの濃度勾配が、図８に示されるように分布している場合、上述したように、カノニカル方向は、80度と200度の方向となる。そこで、図１１の中段の左側の図に示されるように、上段の画像をカノニカル方向80度が０度になるように、特徴点近傍画像が、この例の場合時計方向に回転される。そして、その濃度勾配ベクトル群が演算される。このことは、結局、図８の角度80度のカノニカル方向を０度として正規化して得られた図９の方向ヒストグラムの濃度勾配ベクトル群を得ることに等しい。 In order to extract the type 2 feature quantity, first, in step S23, the feature quantity extraction unit 23 rotationally corrects the feature point neighborhood image. That is, the feature point vicinity image is rotationally corrected so that the canonical direction in the vicinity of the feature point obtained by the above-described processing is 0 degree. Further, in step S24, the feature amount extraction unit 23 calculates a density gradient vector group. For example, when the density gradient of the pixels near the feature points shown in the upper part of FIG. 11 is distributed as shown in FIG. 8, the canonical directions are directions of 80 degrees and 200 degrees as described above. It becomes. Therefore, as shown in the left diagram in the middle of FIG. 11, the feature point neighboring image is rotated clockwise in the case of this example so that the canonical direction of 80 degrees becomes 0 degrees. Then, the density gradient vector group is calculated. This is equivalent to obtaining the density gradient vector group of the direction histogram of FIG. 9 obtained by normalizing the canonical direction at an angle of 80 degrees of FIG. 8 as 0 degrees.

また、同様に、図１１の中段の右側に示されるように、特徴点近傍画像が、200度のカノニカル方向が０度になるように回転補正される。そして、その画像の濃度勾配ベクトル群が演算される。このことは、図８の角度200度のカノニカル方向を０度として正規化することで得られた図１０の方向ヒストグラムの濃度勾配ベクトル群を得ることに等しい。 Similarly, as shown on the right side of the middle stage of FIG. 11, the feature point neighborhood image is rotationally corrected so that the canonical direction of 200 degrees is 0 degrees. Then, the density gradient vector group of the image is calculated. This is equivalent to obtaining the density gradient vector group of the direction histogram of FIG. 10 obtained by normalizing the canonical direction of the angle of 200 degrees of FIG. 8 as 0 degree.

次に、ステップＳ２５において、特徴量抽出部２３は、濃度勾配ベクトル群を次元縮退する。すなわち、数ピクセル程度の特徴点抽出位置のずれを吸収できるようにするために、この濃度勾配ベクトル群が、図１１の下段の左右に示されているように、例えば、直径７ピクセルの円の内側にほぼ内接する四角形内の５×５ピクセルのベクトル群から、３×３個のベクトル群に線形補間リサンプルすることで次元縮退される。 Next, in step S25, the feature amount extraction unit 23 performs dimension reduction on the density gradient vector group. That is, in order to be able to absorb the deviation of the feature point extraction position of about several pixels, this density gradient vector group is, for example, a circle of 7 pixels in diameter as shown on the left and right in the lower part of FIG. Dimensional degeneration is performed by linearly re-sampling a vector group of 5 × 5 pixels in a rectangle almost inscribed inward into a 3 × 3 vector group.

線形補間リサンプルは、具体的には、図１２に示されるように、リサンプル画像のピクセル値を、その近傍４個の元画像ピクセルからの距離の比率で以下の式により演算することで行われる。 Specifically, as shown in FIG. 12, the linear interpolation resample is performed by calculating the pixel value of the resampled image by the ratio of the distance from four neighboring original image pixels according to the following equation. Is called.

ｆ（Ｘ，Ｙ）＝（１−ｑ）・｛（１−ｐ）・ｆ（ｘ，ｙ）＋ｐ・ｆ（ｘ＋１，ｙ）｝
＋ｑ・｛（１−ｐ）・ｆ（ｘ，ｙ＋１）＋ｐ・ｆ（ｘ＋１，ｙ＋１）｝
・・・（１０） f (X, Y) = (1-q). {(1-p) .f (x, y) + p.f (x + 1, y)}
+ Q · {(1−p) · f (x, y + 1) + p · f (x + 1, y + 1)}
... (10)

上記式において、（Ｘ，Ｙ）はリサンプル画像のピクセル、（ｘ，ｙ），（ｘ＋１，ｙ），（ｘ，ｙ＋１），（ｘ＋１，ｙ＋１）は、リサンプル画像（Ｘ，Ｙ）近傍の元画像ピクセル、ｆ（ａ，ｂ）は座標（ａ，ｂ）のピクセル値、ｐ，ｑは、図１２に示されるように、近傍ピクセルからリサンプル画像（Ｘ，Ｙ）へのｘ座標方向とｙ座標方向の距離比である。 In the above formula, (X, Y) is a pixel of the resampled image, and (x, y), (x + 1, y), (x, y + 1), (x + 1, y + 1) are near the resampled image (X, Y). Original image pixels, f (a, b) is the pixel value of coordinates (a, b), and p, q are the x coordinates from the neighboring pixels to the resampled image (X, Y) as shown in FIG. The distance ratio between the direction and the y-coordinate direction.

こうして、次元縮退されたベクトルのｘ，ｙ各成分を特徴ベクトルの各次元にあてることで、タイプ２の特徴量が得られる。線形補間リサンプルにより、３×３ベクトル群にリサンプルした場合には、１８（＝３×３×２）次元の特徴量となる。 Thus, by applying the x and y components of the dimension-reduced vector to each dimension of the feature vector, a type 2 feature amount can be obtained. When resampled into a 3 × 3 vector group by linear interpolation resample, the feature amount is 18 (= 3 × 3 × 2) dimensions.

なお、リサンプル後の目標の画像サイズが元画像サイズの半分以下の場合には、元画像を0.5倍づつ縮小していき、目標サイズより大きい最小の0.5倍乗数サイズの画像が得られたら、その画像から式（１０）のリサンプリングを行うことでリサンプリング時の誤差を小さくすることが可能である。例えば元画像の0.2倍サイズの画像を線形補間リサンプリングで作る場合には、0.5倍リサンプルを２回かけて得られる元画像の0.25倍サイズ画像に式（１０）の線形補間リサンプリングを行う。 If the target image size after re-sampling is less than half of the original image size, the original image is reduced by 0.5 times, and when a minimum 0.5 times multiplier size image larger than the target size is obtained, By performing resampling of the equation (10) from the image, it is possible to reduce the error at the time of resampling. For example, when an image having a size 0.2 times that of the original image is created by linear interpolation resampling, linear interpolation resampling of Expression (10) is performed on a 0.25 times size image of the original image obtained by applying 0.5 times resampling twice. .

ステップＳ２６において、特徴量抽出部２３は、全特徴点を処理したかを判定し、まだ処理していない特徴点が存在する場合には、処理をステップＳ１７に戻し、それ以降の処理を繰り返し実行する。ステップＳ２６において、全特徴点を処理したと判定された場合（ステップＳ１７乃至ステップＳ２５の処理が、全ての特徴点について行われた場合）、ステップＳ２７において、特徴点抽出部２２は、全解像度画像を処理したかを判定する。まだ処理していない解像度画像が存在する場合には、処理はステップＳ１３に戻り、それ以降の処理が繰り返し実行される。ステップＳ１３乃至ステップＳ２５の処理が、全ての解像度画像について行われたと判定された場合、ステップＳ２８において、多重解像度生成部２１は、全モデルを処理したかを判定する。まだ処理していないモデルが存在する場合には、処理はステップＳ１１に戻り、それ以降の処理が繰り返し実行される。ステップＳ１１乃至ステップＳ２５の処理が、全てのモデルについて実行されたと判定された場合、処理はステップＳ２９に進む。 In step S26, the feature quantity extraction unit 23 determines whether all feature points have been processed. If there is a feature point that has not yet been processed, the process returns to step S17, and the subsequent processing is repeatedly executed. To do. When it is determined in step S26 that all feature points have been processed (when the processing in steps S17 to S25 has been performed for all feature points), in step S27, the feature point extracting unit 22 displays the full resolution image. Is processed. If there is a resolution image that has not yet been processed, the processing returns to step S13, and the subsequent processing is repeatedly executed. When it is determined that the processing from step S13 to step S25 has been performed for all resolution images, in step S28, the multi-resolution generation unit 21 determines whether all models have been processed. If there is a model that has not yet been processed, the process returns to step S11, and the subsequent processes are repeatedly executed. If it is determined that the processes in steps S11 to S25 have been executed for all models, the process proceeds to step S29.

モデル辞書登録部２４は、ステップＳ２９において、以上のように抽出された特徴点特徴量をラベル付けし、登録する。この場合、どのモデルの多重解像度画像群の、どの画像の、どのスケールから抽出された、どの特徴点の特徴量なのか、が参照できるようにラベル付けされ、モデル辞書に登録される。 In step S29, the model dictionary registration unit 24 labels and registers the feature point feature quantities extracted as described above. In this case, it is labeled so as to refer to the feature quantity of which feature point extracted from which scale of which image of which model's multi-resolution image group, and is registered in the model dictionary.

以上のようにして、モデル辞書登録部２４には、認識させたい対象物体としてのモデル画像が特徴量として予め登録される。 As described above, in the model dictionary registration unit 24, a model image as a target object to be recognized is registered in advance as a feature amount.

学習部１１と認識部１２の両方を画像処理装置１が有する場合には、認識部１２は、このモデル辞書登録部２４をそのまま利用することが可能である。学習部１１と認識部１２が別の画像処理装置として構成される場合には、以上のようにして必要な情報が登録されたモデル辞書登録部２４が、認識部１２を有する画像処理装置に搭載されるか、或いは無線通信により利用可能とされる。 When the image processing apparatus 1 has both the learning unit 11 and the recognition unit 12, the recognition unit 12 can use the model dictionary registration unit 24 as it is. When the learning unit 11 and the recognition unit 12 are configured as separate image processing devices, the model dictionary registration unit 24 in which necessary information is registered as described above is mounted on the image processing device having the recognition unit 12. Or made available by wireless communication.

次に、認識部１２における入力画像の物体認識時の処理について、図１３乃至図１５のフローチャートを参照して説明する。 Next, processing when the recognition unit 12 recognizes an input image object will be described with reference to the flowcharts of FIGS.

多重解像度生成部３１、特徴点抽出部３２、および特徴量抽出部３３は、ステップＳ１３１乃至Ｓ１４７において、入力されたオブジェクト画像に対して、ステップＳ１１乃至Ｓ２７における学習部１１の多重解像度生成部２１、特徴点抽出部２２、および特徴量抽出部２３と同様の処理を行う。その説明は繰り返しになるので省略する。但し、パラメータＮとαで決まる多重解像度画像の構成が、認識時では学習時と異なっている。 The multi-resolution generation unit 31, the feature point extraction unit 32, and the feature amount extraction unit 33 perform the multi-resolution generation unit 21 of the learning unit 11 in steps S11 to S27 on the input object image in steps S131 to S147. Processing similar to that performed by the feature point extraction unit 22 and the feature amount extraction unit 23 is performed. Since the description is repeated, it is omitted. However, the configuration of the multi-resolution image determined by the parameters N and α is different at the time of recognition from that at the time of learning.

多重解像度生成部２１は、学習時の多重解像度画像を広い倍率レンジで細かい精度で生成するのに対し、多重解像度生成部３１は、認識時において、粗い精度で多重解像度画像を生成する。具体的に、本実施の形態で適用しているパラメータは、ステップＳ１２の学習時がＮ＝１０，α＝0.1であるのに対し、ステップＳ１３２の認識時はＮ＝２，α＝0.5である。その理由は、次の通りである。 The multi-resolution generation unit 21 generates a multi-resolution image at the time of learning with a fine accuracy with a wide magnification range, whereas the multi-resolution generation unit 31 generates a multi-resolution image with a rough accuracy at the time of recognition. Specifically, the parameters applied in the present embodiment are N = 10 and α = 0.1 during learning in step S12, whereas N = 2 and α = 0.5 during recognition in step S132. . The reason is as follows.

１）認識精度を上げるには、より多くの特徴点特徴量情報を用いて特徴量比較を行うのが望ましい。つまり、より多くの多重解像度画像から特徴点抽出するのが望ましい。
２）スケール変化のロバスト性を得るために、多重解像度画像の構成はなるべくスケールレンジを広くするのが望ましい。
３）モデル学習時にはリアルタイム性をそれほど重視しなくても良いので、モデル画像の多重解像度画像数を多くし、スケールレンジを広くして特徴点特徴量抽出し、保持することが可能である。
４）本実施の形態では、オブジェクト画像から抽出された各特徴点特徴量を、全モデルの全特徴点特徴量から構築されるkdツリーの k-Nearest Neighbor（k-NN）探索（後述する）を用いて特徴量の比較を行っているため、特徴量比較にかかる計算コストは、オブジェクト画像から抽出された特徴点数に対して比例して増加するが、モデル特徴点数に対しては、全認識対象モデルからkdツリーを構築した場合には全モデル特徴点をｎとすると、計算コストをlognのオーダー（つまりＯ(logn)）に抑えることができる。
５）また一方で、認識時はリアルタイム性が重視されるため、多重解像度画像数をなるべく減らすことで計算コストを小さくする必要が有る。
６）かといって、入力されたオブジェクト画像から多重解像度画像を生成せず、原画像のみを用いてしまうと、モデル画像の原画像のサイズよりもオブジェクト画像中の認識対象物体のサイズが大きい場合には、その物体の認識が不可能となってしまう。 1) In order to increase the recognition accuracy, it is desirable to perform feature amount comparison using more feature point feature amount information. That is, it is desirable to extract feature points from more multi-resolution images.
2) In order to obtain the robustness of the scale change, it is desirable to make the scale range of the multi-resolution image configuration as wide as possible.
3) Since the real-time property does not need to be emphasized so much during model learning, it is possible to increase the number of model resolution multi-resolution images, widen the scale range, and extract and hold feature point feature quantities.
4) In the present embodiment, k-Nearest Neighbor (k-NN) search of a kd tree constructed from all feature point feature values of all models is performed on each feature point feature value extracted from the object image (described later). Since the feature quantities are compared using the method, the calculation cost for the feature quantity comparison increases in proportion to the number of feature points extracted from the object image. When the kd tree is constructed from the target model, the calculation cost can be suppressed to the order of logn (that is, O (logn)) if all model feature points are n.
5) On the other hand, since real-time characteristics are important during recognition, it is necessary to reduce the calculation cost by reducing the number of multi-resolution images as much as possible.
6) However, if a multi-resolution image is not generated from an input object image and only the original image is used, the size of the recognition target object in the object image is larger than the size of the original image of the model image. In such a case, the object cannot be recognized.

以上の理由から、図１６に示されるように、学習時のモデル画像からは、より多くの（ｋ＝０乃至９の）多重解像度画像群をより広いレンジで生成し（Ｎ＝１０，α＝0.1）、より多くの特徴点を抽出する一方、認識時には、オブジェクト画像から、認識に最小限必要な（ｋ＝０，１の）多重解像度画像群を生成し（Ｎ＝２，α＝0.5）、それから特徴点を抽出し、特徴量比較をkdツリー上でk-NN探索を適用することで行い、計算コストを少なくかつ精度の良い認識を実現することを可能とする。図１６には、原画像は大き過ぎて対応する大きさのスケールの階層のモデル画像は存在しないが、原画像（ｋ＝０）を0.5倍に縮小する（ｋ＝１）ことで、対応する大きさのスケールの階層のモデル画像が見い出されるようになることが示されている。 For the above reasons, as shown in FIG. 16, from the model image at the time of learning, more (k = 0 to 9) multi-resolution image groups are generated in a wider range (N = 10, α = 0.1) While extracting more feature points, at the time of recognition, a minimum resolution (k = 0, 1) multi-resolution image group necessary for recognition is generated from the object image (N = 2, α = 0.5). Then, feature points are extracted, and feature quantity comparison is performed by applying a k-NN search on the kd tree, so that it is possible to realize recognition with low calculation cost and high accuracy. In FIG. 16, the original image is too large and there is no model image of the corresponding scale scale. However, the original image (k = 0) is reduced by 0.5 times (k = 1). It is shown that a model image of a hierarchy of size scales can be found.

ステップＳ１３１乃至ステップＳ１４５の処理が、全特徴点並びに全解像度画像について行われた場合、処理はステップＳ１４８に進む。 If the processes in steps S131 to S145 have been performed for all feature points and all resolution images, the process proceeds to step S148.

後述するように、オブジェクト画像から抽出された各特徴点特徴量（次元縮退された濃度勾配ベクトル群）は、登録されている認識対象モデルの各特徴点特徴量と比較され、類似するモデル特徴点特徴量と候補対応特徴点組として組み合わされる。最も単純な特徴量比較方法は全探索である。つまり、オブジェクト画像の各特徴点特徴量に対して、全認識対象モデルの全特徴点特徴量との特徴量間類似度の計算を行い、その類似度により対応特徴点組を選択するのが最も単純な方法である。しかし、全探索による方法は、計算コスト的に実用的でない。そこで本発明の実施の形態では、大量のデータ群からデータを高速に探索するために、kdツリーというデータ構造を用いたツリー探索手法（J. H. Friedman, J. L. Bentley, R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Transactions on Mathematical Software, Vol. 3, No. 3, pp. 209-226, September 1977.）が用いられる。Kdツリーは、ｋ次元の木構造の意味である。 As will be described later, each feature point feature amount extracted from the object image (dimensionally reduced density gradient vector group) is compared with each feature point feature amount of the registered recognition target model, and similar model feature points are obtained. The feature amount and the candidate corresponding feature point set are combined. The simplest feature comparison method is a full search. In other words, for each feature point feature quantity of the object image, it is best to calculate the similarity between feature quantities with all feature point feature quantities of all recognition target models, and select the corresponding feature point set based on the similarity degree. It is a simple method. However, the full search method is not practical in terms of calculation cost. Therefore, in the embodiment of the present invention, a tree search method using a data structure called a kd tree (JH Friedman, JL Bentley, RA Finkel, “An algorithm for finding best” matches in logarithmic expected time, ”ACM Transactions on Mathematical Software, Vol. 3, No. 3, pp. 209-226, September 1977.). Kd-tree means k-dimensional tree structure.

kdツリー構築部３４は、これまでの学習過程でモデル辞書中に登録されたモデルのうち一部のモデルに関して認識させれば良い場合には、ステップＳ１４８において、認識対象となるモデルについてのみ、その全特徴点特徴量からkdツリーを構築する。本実施の形態の場合は、タイプ１の特徴量の36ｄツリー（ｋ＝３６）とタイプ２の特徴量の18ｄツリー（ｋ＝１８）が、それぞれ構築される。ツリーの各リーフ（終端ノード）には、１つの特徴点特徴量が、その特徴量がどのモデルの多重解像度画像群の、どの画像の、どのスケールから抽出された、どの特徴点の特徴量なのか、が参照できるようなラベルとともに保持される。 In the case where the kd tree construction unit 34 only needs to recognize a part of the models registered in the model dictionary in the learning process so far, only the model to be recognized is recognized in step S148. A kd tree is constructed from all feature point features. In the case of the present embodiment, a type 1 feature amount 36d tree (k = 36) and a type 2 feature amount 18d tree (k = 18) are respectively constructed. For each leaf (terminal node) of the tree, one feature point feature quantity is the feature quantity of which feature point is extracted from which scale of which image of which model feature resolution multi-resolution image group. It is held with a label that can be referred to.

一方、モデル辞書に登録された全モデルを認識させる場合には、モデルの追加学習の度にツリーを構築し直し、ツリー自体がモデル辞書に登録される。この場合には、ステップＳ１４８におけるkdツリーの構築処理は省略される。 On the other hand, when all the models registered in the model dictionary are recognized, the tree is reconstructed every time the model is additionally learned, and the tree itself is registered in the model dictionary. In this case, the kd tree construction process in step S148 is omitted.

特徴量比較部３５は、ステップＳ１４９で、オブジェクト画像の未処理特徴点を選択する。そして、ステップＳ１５０において、特徴量比較部３５は、オブジェクト画像のタイプ１の特徴点特徴量と、類似するｋ個のモデルの特徴点特徴量をペア組みする。同様に、ステップＳ１５１で、特徴量比較部３５は、オブジェクト画像のタイプ２の特徴点特徴量と、類似するｋ個のモデルの特徴点特徴量をペア組みする。すなわち、特徴点抽出部３２と特徴量抽出部３３により抽出されたオブジェクト画像の各特徴点特徴量は、特徴量比較部３５により、k-NN探索により特徴量が類似するｋ個（図１７の例の場合、４個）のモデル特徴点特徴量とペア組みされる（k-NN探索のｋの値と、kdツリーのｋの値は、同じｋの文字を使用してはいるが、任意の別の値とし得る（もちろん、同じ値としてもよい））。本実施の形態では、タイプ１の特徴量のk-NN探索に用いる非類似度として、式（１１）のユークリッド距離（その値が大きい程、類似していないことを表す）が、タイプ２の特徴量の類似度として、式（１２）に示すコサイン相関値（その値が大きい程、類似していることを表す）が、それぞれ用られる。 In step S149, the feature amount comparison unit 35 selects an unprocessed feature point of the object image. In step S150, the feature amount comparison unit 35 pairs the feature point feature amounts of type 1 of the object image with the feature point feature amounts of k similar models. Similarly, in step S151, the feature amount comparison unit 35 pairs a feature point feature amount of type 2 of the object image with feature point feature amounts of k similar models. That is, the feature point feature amounts of the object images extracted by the feature point extraction unit 32 and the feature amount extraction unit 33 are k pieces of feature amounts similar by the k-NN search by the feature amount comparison unit 35 (see FIG. 17). In the example, four model feature point feature values are paired (the k value in the k-NN search and the k value in the kd tree use the same letter k, but are arbitrary. (Of course, it may be the same value). In the present embodiment, as the dissimilarity used for the k-NN search of the feature quantity of type 1, the Euclidean distance of the expression (11) (the larger the value, the more dissimilar) is represented. As the similarity of the feature amount, a cosine correlation value shown in Expression (12) (the larger the value, the more similar the value) is used.

但し、式（１１）において、ｕ_V，ｖ_Vは非類似度を計算する対象の特徴量ベクトル、ｕ_n，ｖ_nはそれぞれｕ_V，ｖ_Vのｎ次元における値、Ｎはｕ_V，ｖ_Vベクトルの次元数を、それぞれ表わす。 However, in Expression (11), u _V and v _V are feature quantity vectors whose dissimilarities are to be calculated, u _n and v _n are values in the n dimensions of u _V and v _V , respectively, and N is u _V and v Represents the number of dimensions of the _V vector.

式（１２）において、ｕ_V，ｖ_Vは類似度を計算する対象の特徴量ベクトルであり、ｕ_V・ｖ_Vはベクトルの内積を表す。特徴量が類似するｋ個のペア（組）を抽出する際に、非類似度（タイプ１特徴量に対して）、類似度（タイプ２特徴量に対して）に対する閾値判定を入れてもよい。タイプ２の特徴量に対する類似度計算尺度にコサイン相関値を用いる理由は、明度変化による局所濃度勾配ベクトルの強度の変化に特徴量が影響されないようにするためである。また、コサイン相関値による類似度のかわりに、ｕ_V，ｖ_Vをベクトル長を１で正規化し、それらのユークリッド距離を非類似度としてタイプ２の特徴量としてもよい。この場合も明度変化による局所濃度勾配ベクトルの強度の変化に特徴量が影響されないようになる。 In Expression (12), u _V and v _V are feature quantity vectors whose similarity is to be calculated, and u _V · v _V represents an inner product of the vectors. When k pairs with similar feature quantities are extracted, threshold determination may be made for dissimilarity (for type 1 feature quantities) and similarity (for type 2 feature quantities). . The reason why the cosine correlation value is used as the similarity calculation scale for the type 2 feature quantity is to prevent the feature quantity from being affected by the change in the intensity of the local density gradient vector due to the change in brightness. Further, instead of the similarity based on the cosine correlation value, u _V and v _V may be normalized with a vector length of 1, and their Euclidean distance may be set as a dissimilarity to be a type 2 feature amount. Also in this case, the feature amount is not affected by the change in the intensity of the local density gradient vector due to the change in brightness.

特徴量比較部３５は、ステップＳ１４９乃至ステップＳ１５１の処理を、各オブジェクト画像の特徴点に対して実行する。そして、ステップＳ１５２において、特徴量比較部３５は、全特徴点を処理したかを判定し、まだ処理していない特徴点が存在する場合には、処理をステップＳ１４９に戻し、それ以降の処理を繰り返し実行する。ステップＳ１５２において、全特徴点を処理したと判定された場合には、処理はステップＳ１５３に進む。 The feature amount comparison unit 35 performs the processing from step S149 to step S151 on the feature points of each object image. In step S152, the feature amount comparison unit 35 determines whether all feature points have been processed. If there is a feature point that has not yet been processed, the process returns to step S149, and the subsequent processing is performed. Run repeatedly. If it is determined in step S152 that all feature points have been processed, the process proceeds to step S153.

タイプ１とタイプ２の２つのタイプの特徴量を用いるので、特徴量比較部３５は、入力されたオブジェクト画像の特徴点に対する特徴点ペアを特徴量タイプごとに上述の方法で求めた後、ステップＳ１５３で、タイプ１とタイプ２の両方で共通して抽出された特徴点ペアのみを候補対応特徴点組として選択し、モデル毎に分類する。そして、この候補対応特徴点組は、後段のモデル姿勢推定部３６に供給される。モデル姿勢推定部３６では、モデルごとの処理を行うため、抽出された候補対応特徴点組をモデルごとに分類して渡すことで、処理の効率化を図ることができる。 Since two types of feature quantities of type 1 and type 2 are used, the feature quantity comparison unit 35 obtains a feature point pair for the feature point of the input object image for each feature quantity type by the above method, In S153, only the feature point pairs extracted in common for both Type 1 and Type 2 are selected as candidate corresponding feature point groups, and are classified for each model. Then, this candidate corresponding feature point set is supplied to the model posture estimation unit 36 in the subsequent stage. Since the model posture estimation unit 36 performs processing for each model, the extracted candidate corresponding feature point set is classified and passed for each model, so that the processing efficiency can be improved.

図１７は、以上の処理を模式的に表している。kdツリー構築部３４により、タイプ１の特徴量の36ｄツリー構造と、タイプ２の特徴量の18ｄツリー構造が生成される。オブジェクト画像の特徴量群から、k-NN探索（いまの場合、ｋ＝４）によりタイプ１の特徴量の36ｄツリー構造からタイプ１の特徴量の４個の類似ペア群が探索される。この例においては、オブジェクト画像の四角形で表されている特徴点特徴量（図中の四角形、五角形、三角形、円、または十字の図形は特徴点特徴量を表す）が、タイプ１の特徴量の36ｄツリー構造の五角形、三角形、円、または十字と類似するとして探索される。また、タイプ２の特徴量の18ｄツリー構造からk-NN探索によりタイプ２の特徴量の４個の類似ペア群が探索される。この例では、オブジェクト画像の四角形が、タイプ２の特徴量の18dツリー構造の平行四辺形、十字、円、またはひし形と類似するとして探索されている。 FIG. 17 schematically illustrates the above processing. The kd tree construction unit 34 generates a 36d tree structure of type 1 feature values and an 18d tree structure of type 2 feature values. From the feature amount group of the object image, four similar pair groups of the type 1 feature amount are searched from the 36d tree structure of the type 1 feature amount by k-NN search (in this case, k = 4). In this example, a feature point feature amount represented by a rectangle of an object image (a quadrilateral, pentagon, triangle, circle, or cross figure in the figure represents a feature point feature amount) is a type 1 feature amount. Searched as similar to a pentagon, triangle, circle, or cross in a 36d tree structure. Also, four similar pair groups of type 2 feature quantities are searched from the 18d tree structure of type 2 feature quantities by k-NN search. In this example, the quadrilateral of the object image is searched as being similar to the parallelogram, cross, circle, or rhombus of the type 2 feature quantity 18d tree structure.

タイプ１の特徴量の４個の類似ペア群と、タイプ２の特徴量の４個の類似ペア群の中から、共通する類似ペア群が選択される。この例の場合、タイプ１の特徴量の類似ペア群は、四角形と五角形、四角形と三角形、四角形と円、四角形と十字の４個である。これに対して、タイプ２の特徴量の類似ペア群は、四角形と平行四辺形、四角形と十字、四角形と円、四角形とひし形の４個である。したがって、四角形と円、並びに四角形と十字の類似ペア群が、２つのタイプに共通する特徴点ペアであるので、それが候補対応特徴点ペア（組）として、選択される。 A common similar pair group is selected from four similar pair groups of type 1 feature values and four similar pair groups of type 2 feature values. In the case of this example, there are four similar pairs of type 1 feature quantities: a square and a pentagon, a square and a triangle, a square and a circle, and a square and a cross. On the other hand, there are four similar pairs of type 2 feature quantities: a square and a parallelogram, a square and a cross, a square and a circle, and a square and a rhombus. Therefore, since a similar pair group of a quadrangle and a circle, and a quadrangle and a cross is a feature point pair common to the two types, it is selected as a candidate corresponding feature point pair.

なお、以上に説明したように、特徴量タイプ毎、認識対象の全モデルの全特徴点特徴量から１つのkdツリーを構築し、入力画像の各特徴点特徴量のk-NNを探索するのではなく、特徴量タイプ毎、モデル毎にkdツリーを構築し、モデル毎に入力画像各特徴点特徴量のk-NNを探索するようにしてもよい。いずれの場合でも、出力はモデル毎に分類された候補対応特徴点組群であり、後述する後段の処理は共通となる。 As described above, for each feature quantity type, one kd tree is constructed from all feature point feature quantities of all recognition target models, and k-NN of each feature point feature quantity of the input image is searched. Instead, a kd tree may be constructed for each feature type and for each model, and the k-NN of the feature point feature amount for each input image may be searched for each model. In any case, the output is a candidate corresponding feature point group grouped for each model, and the later stage processing described later is common.

以上の処理により、特徴点近傍の局所的な濃度勾配情報が類似するペア群（モデル特徴点とオブジェクト特徴点のペア群）を抽出することができるが、巨視的に見ると、このように得られたペア群は、対応特徴点間の空間的位置関係がモデルのオブジェクト画像（入力画像）上での姿勢（モデル姿勢）と矛盾しない「真の特徴点ペア（インライヤ）」だけでなく、矛盾するような「偽の特徴点ペア（アウトライヤ）」を含んでいる。 Through the above processing, a pair group (a pair group of model feature points and object feature points) having similar local density gradient information in the vicinity of the feature points can be extracted. In addition to the “true feature point pairs (inliers)” in which the spatial positional relationship between the corresponding feature points is consistent with the posture (model posture) on the model object image (input image), the paired pairs are inconsistent Such a “false feature point pair (outlier)” is included.

図１８は、インライヤとアウトライヤを模式的に表している。同図に示されるように、図中左側に示される三角形のモデル画像と、図中右側に示されるオブジェクト画像の三角形の検出対象物体（オブジェクト）が対応するとすると、モデル画像の三角形の頂点近傍の特徴点Ｐ１乃至Ｐ４は、検出対象物体の特徴点Ｐ11乃至Ｐ14とそれぞれ対応する。すなわち、特徴点Ｐ１が特徴点Ｐ11と、特徴点Ｐ２が特徴点Ｐ12と、特徴点Ｐ３が特徴点Ｐ13と、特徴点Ｐ４が特徴点Ｐ14とそれぞれ対応する。したがって、これらの候補対応特徴点組はインライヤを構成する。なお、図１８において、インライヤは実線で示されている。 FIG. 18 schematically shows inliers and outliers. As shown in the figure, if the triangle model image shown on the left side of the figure corresponds to the object to be detected (object) of the triangle of the object image shown on the right side of the figure, the model image near the vertex of the triangle of the model image The feature points P1 to P4 correspond to the feature points P11 to P14 of the detection target object, respectively. That is, the feature point P1 corresponds to the feature point P11, the feature point P2 corresponds to the feature point P12, the feature point P3 corresponds to the feature point P13, and the feature point P4 corresponds to the feature point P14. Therefore, these candidate corresponding feature point sets constitute an inlier. In FIG. 18, the inlier is indicated by a solid line.

これに対して、モデル画像の特徴点Ｐ５は三角形の内部のほぼ中央に位置し、特徴点Ｐ６は三角形の周辺の近傍の外部に位置する。これに対して、特徴点Ｐ５とペア組されたオブジェクト画像の特徴点Ｐ15と、特徴点Ｐ６とペア組されたオブジェクト画像の特徴点Ｐ16は、それぞれ、検出対象物体とは遠く離れた点である。すなわち、特徴点Ｐ５と特徴点Ｐ15の候補対応特徴点組、並びに特徴点Ｐ６と特徴点Ｐ16の候補対応特徴点組はアウトライヤである。なお、図１８において、アウトライヤは破線で示されている。 On the other hand, the feature point P5 of the model image is located approximately at the center inside the triangle, and the feature point P6 is located outside the vicinity of the periphery of the triangle. In contrast, the feature point P15 of the object image paired with the feature point P5 and the feature point P16 of the object image paired with the feature point P6 are points far from the detection target object. . That is, the candidate corresponding feature point set of the feature point P5 and the feature point P15 and the candidate corresponding feature point set of the feature point P6 and the feature point P16 are outliers. In FIG. 18, the outlier is indicated by a broken line.

候補対応特徴点組群からモデル画像の入力画像中の位置姿勢を決める画像変換パラメータを導出する方法として、最小自乗推定により推定画像変換パラメータを求める手法が考えられる。結果の推定モデル姿勢と空間的位置関係の矛盾する対応ペアを排除し、残ったペアで再び最小自乗推定による推定画像変換パラメータ導出を行うという処理を繰り返すことで、より精度の良いモデル姿勢を求めることができる。 As a method for deriving an image conversion parameter for determining the position and orientation of the model image in the input image from the candidate corresponding feature point set group, a method for obtaining the estimated image conversion parameter by least square estimation can be considered. A more accurate model pose is obtained by repeating the process of deriving the estimated image transformation parameters by least square estimation again with the remaining pairs, eliminating the corresponding pairs of the estimated model pose that are inconsistent with the spatial positional relationship. be able to.

しかしながら、候補対応特徴点組群中のアウトライヤの数が多い場合や、真の画像変換パラメータから極端に逸脱したアウトライヤが存在する場合には、上記最小自乗推定による推定結果は一般的に満足のいくものではないことが知られている（Hartley R., Zisserman A.,“Multiple View Geometry in Computer Vision.”, Chapter 3, pp.69-116, Cambridge University Press, 2000）。そこで、本実施の形態におけるモデル姿勢推定部３６は、ある画像変換の拘束の下、候補対応特徴点組群の空間的位置関係から「真の特徴点ペア（インライヤ）」を抽出し、抽出されたインライヤを用いてモデルの位置姿勢を決める画像変換パラメータを推定する。 However, when the number of outliers in the candidate corresponding feature point group group is large or when there are outliers that deviate significantly from the true image conversion parameters, the estimation result by the least square estimation is generally satisfactory. (Hartley R., Zisserman A., “Multiple View Geometry in Computer Vision.”, Chapter 3, pp.69-116, Cambridge University Press, 2000). Therefore, the model posture estimation unit 36 in the present embodiment extracts and extracts “true feature point pairs (inliers)” from the spatial positional relationship of the candidate corresponding feature point group group under a certain image conversion constraint. The image conversion parameters that determine the position and orientation of the model are estimated using the inlier.

このモデル姿勢推定部３６によるモデル姿勢推定処理は、認識対象モデルごとに行われ、モデルごとにその有無、有る場合には姿勢の推定が行われる。以下の説明で出てくる候補対応特徴点組は、特徴量比較部３５の出力である候補対応特徴点組のうち、当該モデルに関するペアのみをまとめたペア群を意味する。 The model posture estimation processing by the model posture estimation unit 36 is performed for each recognition target model, and the presence / absence of each model is estimated. The candidate-corresponding feature point group that appears in the following description means a pair group in which only the pairs related to the model among the candidate-corresponding feature point pairs that are the output of the feature amount comparison unit 35 are collected.

画像変換としてはユークリッド変換、相似変換、アフィン変換、射影変換などが挙げられるが、本実施の形態においては、アフィン変換の拘束の下、姿勢推定を行う場合について詳細説明を行う。アフィン変換は、平行移動及び回転変換（ユークリッド変換）に拡大縮小変換を加えた相似変換に、せん断変形を許すような変換で、元の図形で直線上に並ぶ点は変換後も直線上に並び、平行線は変換後も平行線であるなど、幾何学的性質が保たれる変換である。アフィン変換パラメータを決定するためには候補対応特徴点組が３組以上必要となる。 Examples of the image conversion include Euclidean transformation, similarity transformation, affine transformation, projective transformation, and the like. In the present embodiment, detailed description will be given of the case where posture estimation is performed under the constraint of affine transformation. The affine transformation is a transformation that allows shear deformation to the similarity transformation that adds the scaling transformation to the translation and rotation transformation (Euclidean transformation), and the points that are aligned on the straight line in the original figure are aligned on the straight line after the conversion. A parallel line is a transformation that maintains its geometric properties, such as a parallel line after transformation. In order to determine an affine transformation parameter, three or more candidate feature point pairs are required.

上述したように、候補対応特徴点組が３組以上なければ、アフィン変換パラメータを決定できない。そこで、モデル姿勢推定部３６は、ステップＳ１５４で１つの未処理モデルを選択した後、ステップＳ１５５で候補対応特徴点ペア（組）が３組以上あるかを判定する。候補対応特徴点組が２組以下の場合、モデル姿勢推定部３６は、ステップＳ１５６で、オブジェクト画像（入力画像）中にモデルが存在しない、又はモデル姿勢検出に失敗したとして、「認識不可」を出力する。一方、候補対応特徴点組が３組以上ある場合、モデル姿勢推定部３６は、モデル姿勢を検出可能であるので、アフィン変換パラメータの推定を行う。このため、モデル姿勢推定部３６は、ステップＳ１５７で座標変換を行う。すなわち、候補対応特徴点組のモデル特徴点位置座標が、モデル原画像上の位置座標に変換されるとともに、オブジェクト画像特徴点位置座標が、オブジェクト原画像の位置座標に変換される。そして、ステップＳ１５８で、モデル姿勢推定部３６は、姿勢推定処理を行う。 As described above, the affine transformation parameters cannot be determined unless there are three or more candidate feature point sets. Therefore, after selecting one unprocessed model in step S154, the model posture estimation unit 36 determines whether there are three or more candidate corresponding feature point pairs (groups) in step S155. When the number of candidate corresponding feature point sets is two or less, the model posture estimation unit 36 determines that the model does not exist in the object image (input image) or the model posture detection has failed in step S156, and “recognition is impossible”. Output. On the other hand, when there are three or more candidate corresponding feature point sets, the model posture estimation unit 36 can detect the model posture, and thus estimates the affine transformation parameters. For this reason, the model posture estimation unit 36 performs coordinate conversion in step S157. That is, the model feature point position coordinates of the candidate corresponding feature point set are converted to position coordinates on the model original image, and the object image feature point position coordinates are converted to position coordinates of the object original image. In step S158, the model posture estimation unit 36 performs posture estimation processing.

ここで、アフィン変換パラメータについて説明する。モデル特徴点［ｘｙ］^Tのオブジェクト特徴点［ｕｖ］^Tへのアフィン変換は、以下の式（１３）で与えられる。 Here, the affine transformation parameters will be described. The affine transformation of the model feature point [xy] ^{T to} the object feature point [u v] ^T is given by the following equation (13).

この式（１３）において、ａ_i（ｉ＝１，…，４）は回転、拡大縮小、せん断変形を決定するパラメータを表し、［ｂ₁ ｂ₂］^Tは、平行移動パラメータを表す。決定すべきアフィン変換パラメータはａ₁，…，ａ₄及びｂ₁，ｂ₂の６つであるため、候補対応特徴点組が３組あれば、アフィン変換パラメータを決定することができる。 In this equation (13), a _i (i = 1,..., 4) represents parameters for determining rotation, enlargement / reduction, and shear deformation, and [b ₁ b ₂ ] ^T represents a translation parameter. Since there are six affine transformation parameters a ₁ ,..., A ₄ and b ₁ , b ₂ to be determined, the affine transformation parameters can be determined if there are three candidate corresponding feature point sets.

３組の候補対応特徴点組で構成されるペア群Ｐを、（[ｘ₁ ｙ₁]^T，[ｕ₁ ｖ₁]^T），（[ｘ₂ ｙ₂]^T，[ｕ₂ ｖ₂]^T），（[ｘ₃ ｙ₃]^T，[ｕ₃ ｖ₃]^T）とすると、ペア群Ｐとアフィン変換パラメータとの関係は、以下の式（１４）に示す線形システムで表現することができる。 A pair group P composed of three sets of candidate corresponding feature points is expressed as ([x ₁ y ₁ ] ^T , [u ₁ v ₁ ] ^T ), ([x ₂ y ₂ ] ^T , [u ₂ v ₂ ]. ^T ), ([x ₃ y ₃ ] ^T , [u ₃ v ₃ ] ^T ), the relationship between the pair group P and the affine transformation parameters can be expressed by the linear system shown in the following equation (14). it can.

この式（１４）を、Ａｘ_V＝ｂ_Vのように書き直すと（下付のＶは、添えられている文字（例えばｘ_Vのｘ）がベクトルであることを表わす。以下、同様である）、アフィン変換パラメータｘ_Vの最小自乗解は、以下の式（１５）で与えられる。 When this equation (14) is rewritten as Ax _V = b _V (subscript V indicates that the attached character (for example, x of x _V ) is a vector. The same applies hereinafter). The least squares solution of the affine transformation parameter x _V is given by the following equation (15).

ｘ_V＝Ａ^-1ｂ_V ・・・（１５） x _V = A ⁻¹ b _V (15)

候補対応特徴点組群から、アウトライヤが１つ以上混入するように、ランダムにペア群Ｐを繰り返し選択した場合、そのアフィン変換パラメータは、パラメータ空間上に散らばって投射される。一方、インライヤのみから構成されるペア群Ｐをランダムに繰り返し選択した場合、そのアフィン変換パラメータは、何れもモデル姿勢の真のアフィン変換パラメータに極めて類似した、すなわちパラメータ空間上で距離の近いものとなる。したがって、候補対応特徴点組群から、ランダムにペア群Ｐを選択し、そのアフィン変換パラメータをパラメータ空間上に投射していく処理を繰り返すと、インライヤはパラメータ空間上で密度の高い（メンバ数の多い）クラスタを形成し、アウトライヤは散らばって出現することになる。すなわち、パラメータ空間上でクラスタリングを行えば、最多メンバ数を持つクラスタの要素がインライヤとなる。 When the pair group P is repeatedly selected at random so that one or more outliers are mixed from the candidate corresponding feature point set group, the affine transformation parameters are scattered and projected on the parameter space. On the other hand, when a pair group P composed only of inliers is selected repeatedly at random, the affine transformation parameters are all very similar to the true affine transformation parameters of the model pose, that is, close in the parameter space. Become. Therefore, when the pair P is randomly selected from the candidate corresponding feature point set group and the process of projecting the affine transformation parameters onto the parameter space is repeated, the inliers have a high density in the parameter space (the number of members). Many clusters) and outliers appear scattered. That is, if clustering is performed in the parameter space, the cluster element having the largest number of members becomes an inlier.

モデル姿勢推定部３６における姿勢推定処理の詳細を図１９のフローチャートを用いて説明する。なお、このモデル姿勢推定部３６におけるクラスタリング手法としては、NN（Nearest Neighbor）法が用いられる。この際、上述したパラメータｂ₁，ｂ₂は、認識対象画像により様々な値を取り得るため、ｘ空間でもクラスタリングにおいてクラスタリング閾値の選択が認識対象に依存してしまう。そこで、モデル姿勢推定部３６では、「真のパラメータとａ₁，…，ａ₄は類似するが、ｂ₁，ｂ₂が異なるようなアフィン変換パラメータを与えるペア群Ｐは、殆ど存在しない」という仮定の下、パラメータａ₁，…，ａ₄（以下、ａ_Vと表記する）で規定されるパラメータ空間上のみでクラスタリングを行う。なお、上記仮定が成り立たない状況が生じたとしても、ａ_V空間とは独立に、パラメータｂ₁，ｂ₂で規定されるパラメータ空間でクラスタリングを行い、その結果を考慮することで、容易に問題を回避することができる。 Details of the posture estimation processing in the model posture estimation unit 36 will be described with reference to the flowchart of FIG. As a clustering method in the model posture estimation unit 36, an NN (Nearest Neighbor) method is used. At this time, since the parameters b ₁ and b ₂ described above can take various values depending on the recognition target image, the selection of the clustering threshold in clustering depends on the recognition target even in the x space. Therefore, the model pose estimation unit 36 says that “there are almost no pair groups P that give affine transformation parameters such that the true parameters and a ₁ ,..., A ₄ are similar, but b ₁ and b ₂ are different”. Under the assumption, clustering is performed only on the parameter space defined by the parameters a ₁ ,..., A ₄ (hereinafter referred to as a _V ). Even if a situation in which the above assumption does not hold, clustering is performed in the parameter space defined by the parameters b ₁ and b ₂ independently of the a _V space, and the result is taken into consideration, so that the problem can be easily solved. Can be avoided.

先ず、ステップＳ２０１において、モデル姿勢推定部３６は初期化を行う。具体的には、繰り返し数を表す変数としてのカウント値cntがcnt＝１とされ、候補対応特徴点組群からランダムに３組のペアをペア群Ｐ₁として選択し、アフィン変換パラメータａ_V1が求められる。また、モデル姿勢推定部３６は、クラスタ数を表す変数ＮをＮ＝１とし、アフィン変換パラメータ空間ａ_V上でａ_V1を中心とするクラスタＺ₁を作る。モデル姿勢推定部３６は、このクラスタＺ₁のセントロイドｃ_V1をｃ_V1＝ａ_V1とし、クラスタのメンバ数を表す変数ｎz₁をｎz₁＝１とし、カウンタ値cntをcnt＝２に更新する。 First, in step S201, the model posture estimation unit 36 performs initialization. Specifically, the count value cnt serving as a variable representing the number of repetitions is set to cnt = 1, randomly selecting three sets of pairs as a pair group P ₁ from the candidates for the corresponding feature point groups, the affine transformation parameter a _V1 Desired. The model posture estimation unit 36 sets a variable N representing the number of clusters to N = 1, and creates a cluster Z ₁ centered on a _V1 on the affine transformation parameter space a _V. The model posture estimation unit 36 updates the centroid c _V1 of the cluster Z ₁ to c _V1 = a _V1 , sets the variable nz ₁ representing the number of members of the cluster to nz ₁ = 1, and updates the counter value cnt to cnt = 2. .

次に、ステップＳ２０２において、モデル姿勢推定部３６は、候補対応特徴点組群からランダムに３組のペアをペア群Ｐ_cntとして選択し、アフィン変換パラメータａ_Vcntを計算する。そして、モデル姿勢推定部３６は、計算されたアフィン変換パラメータａ_Vcntをパラメータ空間に投射する。 Next, in step S202, the model posture estimation unit 36 randomly selects three pairs as a pair group P _cnt from the candidate corresponding feature point set group, and calculates an affine transformation parameter a _Vcnt . Then, the model posture estimation unit 36 projects the calculated affine transformation parameter a _Vcnt on the parameter space.

次に、ステップＳ２０３において、モデル姿勢推定部３６は、NN法によりアフィン変換パラメータ空間をクラスタリングする。具体的には、モデル姿勢推定部３６は、先ず以下の式（１６）に従って、アフィン変換パラメータａ_Vcntと各クラスタＺ_iのセントロイドｃ_Vi（ｉ＝１，…，Ｎ）との距離ｄ（ａ_Vcnt, ｃ_Vi）のうち、最小の距離ｄ_minを求める。 Next, in step S203, the model posture estimation unit 36 clusters the affine transformation parameter space by the NN method. Specifically, the model posture estimation unit 36 firstly calculates the distance d () between the affine transformation parameter a _Vcnt and the centroid c _Vi (i = 1,..., N) of each cluster Z _i according to the following equation (16). a minimum distance d _min is obtained from a _Vcnt , c _Vi ).

ｄ_min = ｍｉｎ _1≦i≦N ｛ｄ(ａ_Vcnt, ｃ_Vi) ｝・・・（１６） d _min = min _{1 ≦ i ≦ N} {d (a _Vcnt , c _Vi )} (16)

そして、モデル姿勢推定部３６は、所定の閾値τ（例えばτ＝0.1）に対してｄ_min＜τであればｄ_minを与えるクラスタＺ_iにａ_Vcntを属させ、ａ_Vcntを含めた全メンバでクラスタＺ_iのセントロイドｃ_iを更新する。また、クラスタＺ_iのメンバ数ｎz_iはｎz_i＝ｎz_i＋１とされる。一方、ｄ_min≧τであれば、モデル姿勢推定部３６は、アフィン変換パラメータ空間ａ_V上でａ_Vcntをセントロイドｃ_VN+1とする新しいクラスタＺ_N+1を作り、そのクラスタのメンバ数ｎz_N+1をｎz_N+1＝１とし、クラスタ数ＮをＮ＝Ｎ＋１とする。 Then, the model posture estimation unit 36 makes a _Vcnt belong to the cluster Z _i that gives d _min if d _min <τ with respect to a predetermined threshold τ (for example, τ = 0.1), and all members including a _Vcnt To update the centroid c _i of the cluster Z _i . Further, the number of members nz _i cluster Z _i is set to nz _{_i} = nz _i +1. On the other hand, if d _min ≧ τ, the model posture estimation unit 36 creates a new cluster Z _{N + 1} with a _Vcnt as a centroid c _{VN + 1} on the affine transformation parameter space a _V , and the number of members of the cluster nz _{N + 1 is set} to nz _{N + 1} = 1, and the number N of clusters is set to N = N + 1.

続いて、ステップＳ２０４で、モデル姿勢推定部３６は、繰り返し終了条件を満たすか否かを判別する。繰り返し終了条件は、例えば最多メンバ数が所定の閾値（例えば１５）を超え、且つ最多メンバ数と２番目に多いメンバ数との差が所定の閾値（例えば３）を超える場合、或いは繰り返し数カウンタのカウント値cntが、所定の閾値（例えば5000回）を超える場合のように設定することができる。ステップＳ２０４において、繰り返し終了条件が満たされないと判定された場合（Noと判定された場合）には、モデル姿勢推定部３６は、ステップＳ２０５で繰り返し数のカウント値cntをcnt＝cnt＋１とした後、処理をステップＳ２０２に戻し、それ以降の処理を繰り返す。 Subsequently, in step S204, the model posture estimation unit 36 determines whether or not the repeated end condition is satisfied. The repetition end condition is, for example, when the maximum number of members exceeds a predetermined threshold (for example, 15) and the difference between the maximum number of members and the second largest number of members exceeds a predetermined threshold (for example, 3), or the repetition number counter This count value cnt can be set so as to exceed a predetermined threshold (for example, 5000 times). If it is determined in step S204 that the repetition end condition is not satisfied (when it is determined No), the model posture estimation unit 36 sets the count value cnt of the number of repetitions to cnt = cnt + 1 in step S205. The process returns to step S202, and the subsequent processes are repeated.

一方、ステップＳ２０４で、繰り返し終了条件を満たすと判定された場合（Yesと判定された場合）には、ステップＳ２０６において、モデル姿勢推定部３６は、以上の処理で得られたインライヤが３ペアに満たない場合には、アフィン変換パラメータが決定できないため、認識結果を「認識対象モデル非検出」と出力し、インライヤが３ペア以上抽出された場合には、インライヤに基づいて、最小自乗法によりモデル姿勢を決定するアフィン変換パラメータを推定し、認識結果として出力する。 On the other hand, when it is determined in step S204 that the repeated end condition is satisfied (when determined as Yes), in step S206, the model posture estimation unit 36 sets the inliers obtained by the above processing to three pairs. If not, the affine transformation parameters cannot be determined, so the recognition result is output as “recognition target model not detected”, and if three or more pairs of inliers are extracted, the model is calculated by the least square method based on the inliers. The affine transformation parameters that determine the posture are estimated and output as recognition results.

インライヤを（[ｘ_IN1 ｙ_IN1]^T，[ｕ_IN1 ｖ_IN1]^T），（[ｘ_IN2 ｙ_IN2]^T，[ｕ_IN2 ｖ_IN2]^T），…とすると、インライヤとアフィン変換パラメータとの関係は、以下の式（１７）に示す線形システムで表現することができる。 If the inlier is ([x _IN1 y _IN1 ] ^T , [u _IN1 v _IN1 ] ^T ), ([x _IN2 y _IN2 ] ^T , [u _IN2 v _IN2 ] ^T ), etc., the relationship between the inlier and the affine transformation parameters Can be expressed by a linear system shown in the following equation (17).

この式（１７）を、Ａ_INｘ_VIN＝ｂ_VINのように書き直すと、アフィン変換パラメータｘ_VINの最小自乗解は以下の式（１８）で与えられる。 When this equation (17) is rewritten as A _IN x _VIN = b _VIN , the least squares solution of the affine transformation parameter x _VIN is given by the following equation (18).

ｘ_VIN ＝ (Ａ_IN ^T Ａ_IN) ¹ Ａ_IN ^Tｂ_VIN ・・・（１８） x _VIN = (A _IN ^T A _IN ) ¹ A _IN ^T b _VIN (18)

ステップＳ２０６で、モデル姿勢推定部３６は、このアフィン変換パラメータｘ_VINで決定されるモデル姿勢をモデル認識結果として出力する。 In step S206, the model posture estimation unit 36 outputs the model posture determined by the affine transformation parameter _xVIN as a model recognition result.

図１５に戻り、ステップＳ１５８またはステップＳ１５６の処理の後、ステップＳ１５９において、モデル姿勢推定部３６は、全モデルを処理したかを判定する。まだ処理していないモデルが存在する場合には、処理はステップＳ１５４に戻り、それ以降の処理が繰り返し実行される。ステップＳ１５９において、全てのモデルについて処理したと判定された場合、処理は終了される。 Returning to FIG. 15, after step S158 or step S156, in step S159, the model posture estimation unit 36 determines whether all models have been processed. If there is a model that has not been processed yet, the processing returns to step S154, and the subsequent processing is repeatedly executed. If it is determined in step S159 that all models have been processed, the process ends.

以上の図１５のステップＳ１５４乃至Ｓ１５９の処理は、認識対象モデルごとに行なわれる。この処理が、図２０に模式的に示されている。この例においては、候補対応特徴点組群ｐ１乃至ｐ６から最初にランダムに３個の候補対応特徴点組群ｐ１,ｐ３,ｐ４が選択され、それに基づき計算されたアフィンパラメータがパラメータ空間に投射される。次に、ランダムに３個の候補対応特徴点組群ｐ３,ｐ４,ｐ６が選択され、それらに基づき、計算されたアフィンパラメータがパラメータ空間に投射される。同様の処理がさらに繰り返され、この例においては、３個の候補対応特徴点組群ｐ５,ｐ４,ｐ１が選択され、それに基づきアフィンパラメータが計算され、パラメータ空間に投射される。そして、パラメータ空間上において、近接するアフィンパラメータが、クラスタリングされ、そのクラスタリングされたアフィン変換パラメータに最小自乗法を適用することで、モデル姿勢が決定される。 The processes in steps S154 to S159 in FIG. 15 are performed for each recognition target model. This process is schematically shown in FIG. In this example, three candidate corresponding feature point group groups p1, p3, and p4 are first selected at random from the candidate corresponding feature point group groups p1 to p6, and the affine parameters calculated based on the candidate candidate feature point group groups p1, p3, and p4 are projected onto the parameter space. The Next, three candidate corresponding feature point group groups p3, p4, and p6 are selected at random, and based on them, the calculated affine parameters are projected onto the parameter space. Similar processing is further repeated. In this example, three candidate corresponding feature point group groups p5, p4, and p1 are selected, and affine parameters are calculated based on the selected candidate feature point group groups p5, p4, and p1. Then, adjacent affine parameters are clustered in the parameter space, and the model orientation is determined by applying the least square method to the clustered affine transformation parameters.

上記の手法を用いることにより、アウトライヤが候補対応特徴点組群中に多数含まれてしまっている場合でも、アウトライヤを排除し、高精度に姿勢推定（変換パラメータ導出）が可能となる。 By using the above method, even when a large number of outliers are included in the candidate corresponding feature point set group, outliers can be excluded and posture estimation (conversion parameter derivation) can be performed with high accuracy.

以上の実施の形態では、アフィン変換拘束の下での姿勢推定の詳細を述べた。アフィン変換拘束の下では、平面領域が支配的な、例えば箱や本などの３次元物体であれば、その支配平面についての視点変化に対してロバストな姿勢推定が可能となる。しかし、曲面や凹凸が支配的な３次元物体のロバストな姿勢推定を行うには、アフィン変換拘束を投影変換拘束に拡張する必要がある。ただし、この場合においても、推定すべき変換パラメータの次元が増えるだけで、上記手法を簡単に拡張することが可能である。 In the above embodiments, details of posture estimation under affine transformation constraints have been described. Under the affine transformation constraint, if the plane area is dominant, for example, a three-dimensional object such as a box or a book, it is possible to perform posture estimation that is robust to a change in viewpoint about the dominant plane. However, in order to perform robust posture estimation of a three-dimensional object in which curved surfaces and unevenness are dominant, it is necessary to extend the affine transformation constraint to a projection transformation constraint. However, even in this case, the above-described method can be easily expanded only by increasing the dimension of the conversion parameter to be estimated.

このようにして、決定されたモデル姿勢は、例えば、図１６や図１８において破線で示されている。これらの図に示されるように、本実施の形態においては、単にモデル画像に対応する検出対象物体の存在の有無が検出されるだけでなく、その検出対象物体が存在する場合には、その姿勢までも推定され、出力される。 The model posture thus determined is indicated by a broken line in FIGS. 16 and 18, for example. As shown in these figures, in the present embodiment, not only the presence / absence of the detection target object corresponding to the model image is detected, but also the posture when the detection target object exists. Is also estimated and output.

なお、モデル姿勢推定部３６が推定するこのモデル姿勢は、オブジェクト画像の検出対象物体に対する相対的な姿勢を意味するから、モデル姿勢を基準の姿勢として考えた場合には、モデル姿勢推定部３６は、モデル画像に対する検出対象物体の姿勢を推定することを意味する。 The model posture estimated by the model posture estimation unit 36 means a relative posture of the object image with respect to the detection target object. Therefore, when the model posture is considered as a reference posture, the model posture estimation unit 36 This means that the posture of the detection target object with respect to the model image is estimated.

なお、以上の説明では、閾値τが定数値であるものとしたが、ステップＳ２０３乃至ステップＳ２０６の繰り返し処理を行う際に、始めは比較的大きな閾値τを用いて大雑把なインライヤ抽出を行い、繰り返し回数が増える毎に次第に小さい閾値τを用いる、いわゆる「焼きなまし法」のような手法を適用してもよい。これにより、精度よくインライヤを抽出することができる。 In the above description, the threshold τ is assumed to be a constant value. However, when performing the iterative processing from step S203 to step S206, first, a rough inlier extraction is performed using a relatively large threshold τ, and the repetition is repeated. A technique such as a so-called “annealing method” that uses a gradually smaller threshold τ each time the number of times increases may be applied. Thereby, an inlier can be extracted with high accuracy.

また、以上の説明では、候補対応特徴点組群からランダムにペア群Ｐを選択し、そのアフィン変換パラメータをパラメータ空間上に投射していく処理を繰り返し、パラメータ空間上で最多メンバ数を持つクラスタの要素をインライヤとして、最小自乗法によりモデル姿勢を決定するアフィン変換パラメータを推定したが、これに限定されるものではなく、例えば最多メンバ数を持つクラスタのセントロイドを、モデル姿勢を決定するアフィン変換パラメータとしても構わない。さらに、組は３個以上の特徴点で構成してもよい。 In the above description, the cluster having the largest number of members in the parameter space is repeated by selecting a pair group P randomly from the candidate corresponding feature point set group and projecting the affine transformation parameters onto the parameter space. The affine transformation parameters that determine the model pose by the least squares method were estimated using the elements of, but this is not a limitation. For example, the centroid of the cluster with the largest number of members is used to determine the model pose. It does not matter as a conversion parameter. Further, the set may be composed of three or more feature points.

以上、オブジェクト画像の物体認識時において、多重解像度生成部３１、特徴点抽出部３２、特徴量抽出部３３、特徴量比較部３５、およびモデル姿勢推定部３６が、オブジェクト画像が更新される毎にそれを逐次処理するので、リアルタイムでの物体認識が可能になる。また、モデル毎に特徴量比較部３５により抽出された特徴点ペアは、モデル毎分類されモデル姿勢推定部３６においてモデル毎に姿勢推定が行われるので、入力画像中に複数のモデル物体が含まれているような画像でもモデル物体の認識が可能となる。 As described above, each time the object image is updated, the multi-resolution generation unit 31, the feature point extraction unit 32, the feature amount extraction unit 33, the feature amount comparison unit 35, and the model posture estimation unit 36 at the time of object recognition of the object image. Since it is sequentially processed, object recognition in real time becomes possible. Further, the feature point pairs extracted by the feature amount comparison unit 35 for each model are classified for each model, and posture estimation is performed for each model in the model posture estimation unit 36, so that a plurality of model objects are included in the input image. It is possible to recognize a model object even in such an image.

ところで、特徴量比較部３５で生成された候補対応特徴点ペア群中のアウトライヤの比率が大きくなるほどモデル姿勢推定部３６におけるインライヤの選択確率が低下し、モデル姿勢を推定する際に多くの繰り返し回数が必要となるため、計算時間が増大する。したがって、このモデル姿勢推定部３６に入力される候補対応特徴点ペア群から、できる限りアウトライヤを排除しておくことが望ましい。そこで、本実施の形態における画像処理装置１では、図２１に示されるように、特徴量比較部３５とモデル姿勢推定部３６との間に、対応特徴点ペアの絞込みを行う対応特徴点ペア絞込み部６１を挿入することができる。この対応特徴点ペア絞込み部６１では、モデル姿勢推定部３６のモデル姿勢推定処理で仮定した画像変換拘束よりもパラメータ次元の少ない画像変換拘束の下（例えばモデル姿勢推定部３６でアフィン変換拘束を用いた場合には、対応特徴点ペア絞込み部６１で相似変換拘束を用いる）、各モデルについて一般化ハフ変換を行うことで、粗い画像変換パラメータ推定及びそれをサポートする対応特徴点ペア群の抽出を行う。 By the way, as the ratio of outliers in the candidate corresponding feature point pair group generated by the feature quantity comparison unit 35 increases, the selection probability of the inlier in the model posture estimation unit 36 decreases, and the number of iterations increases when estimating the model posture. Therefore, the calculation time increases. Therefore, it is desirable to remove outliers as much as possible from the candidate corresponding feature point pair group input to the model posture estimation unit 36. Therefore, in the image processing apparatus 1 according to the present embodiment, as shown in FIG. 21, the corresponding feature point pair is narrowed down between the feature amount comparison unit 35 and the model posture estimation unit 36. The part 61 can be inserted. In the corresponding feature point pair narrowing-down unit 61, the image transformation constraint having a smaller parameter dimension than the image transformation constraint assumed in the model posture estimation process of the model posture estimation unit 36 (for example, the model posture estimation unit 36 uses the affine transformation constraint). If the corresponding feature point pair narrowing unit 61 uses a similarity transformation constraint), a generalized Hough transform is performed on each model to estimate a rough image transformation parameter and to extract a corresponding feature point pair group that supports it. Do.

次に、図２２のフローチャートを参照して、画像変換拘束に相似変換拘束を用いた場合の対応特徴点ペア絞込み部６１による候補対応特徴点ペアの絞込み処理について説明する。相似変換は平行移動（２次元画像上なのでＸ軸方向の変位dXとＹ軸方向の変位dY）及び回転変換（θ）に拡大縮小変換（scl）を加えた変換である。 Next, with reference to the flowchart of FIG. 22, the candidate corresponding feature point pair narrowing process by the corresponding feature point pair narrowing unit 61 when the similar transformation constraint is used as the image transformation constraint will be described. The similarity transformation is a transformation obtained by adding an enlargement / reduction transformation (scl) to the translation (the displacement dX in the X-axis direction and the displacement dY in the Y-axis direction because it is on a two-dimensional image) and the rotation transformation (θ).

ステップＳ３０１において、対応特徴点ペア絞込み部６１は、上記相似変換パラメータdX，dY，θ,sclで規定される４次元パラメータ空間の各軸について、レンジを規定し、そのレンジ内で規定したステップ幅でビンを切ることで、一般化ハフ変換の投票空間を作る。例えば、dX軸、dY軸に関してレンジを±200、ステップ幅を20（単位はピクセル）、θ軸に関してレンジを０から360、ステップ幅を15（単位は度）、scl軸に関してレンジを0.4から1.6、ステップ幅を0.3（単位は拡大縮小率）とすると、20 × 24 × 5 = 2400個の投票ビンができる。 In step S301, the corresponding feature point pair narrowing unit 61 defines a range for each axis of the four-dimensional parameter space defined by the similarity transformation parameters dX, dY, θ, scl, and the step width defined within the range. Create a voting space for generalized Hough transform by cutting the bin with For example, the range is ± 200 for the dX and dY axes, the step width is 20 (in pixels), the range is 0 to 360 for the θ axis, the step width is 15 (in degrees), and the range is 0.4 to 1.6 for the scl axis. If the step width is 0.3 (the unit is the enlargement / reduction ratio), 20 x 24 x 5 = 2400 voting bins are created.

ステップＳ３０２以降は、モデル毎の処理になる。対応特徴点ぺア絞込み部６１は、ステップＳ３０２において投票空間を初期化し、未処理モデルを選択する。すなわち、例えば各投票ビンの値が０に設定され、１つのモデルが選択される。 After step S302, the process is performed for each model. In step S302, the corresponding feature point pair narrowing unit 61 initializes the voting space and selects an unprocessed model. That is, for example, the value of each voting bin is set to 0, and one model is selected.

ステップＳ３０３乃至ステップＳ３０９は当該モデルの候補対応特徴点ペア（組）毎の処理となるので、ステップＳ３０３において対応特徴点ぺア絞込み部６１は、未処理候補対応特徴点ペアを選択する。すなわち、１つの未処理候補対応特徴点ペアが処理対象として選択される。 Since step S303 to step S309 are processing for each candidate corresponding feature point pair (set) of the model, in step S303, the corresponding feature point pair narrowing unit 61 selects an unprocessed candidate corresponding feature point pair. That is, one unprocessed candidate corresponding feature point pair is selected as a processing target.

続いて、ステップＳ３０４において、対応特徴点ぺア絞込み部６１は、回転変換角度θを求める。例えば、図２３Ａに示されるように、モデル画像の特徴点Ｐのカノニカル方向がＤ_Pであり、特徴点Ｐとペア（組）となるオブジェクト画像の特徴点Ｑのカノニカル方向が図２３Ｂに示されるように、Ｄ_Qである場合、図２４に示されるように、特徴点Ｐと特徴点Ｑを、対応する位置に配置させた場合に得られるカノニカル方向Ｄ_QとＤ_Pの差が回転変換角度θとなる。 Subsequently, in step S304, the corresponding feature point pair narrowing unit 61 obtains the rotation conversion angle θ. For example, as shown in FIG. 23A, the canonical direction of the feature point P of the model image is D _P, the canonical direction of the feature point Q of the object image as a feature point P and the pair (set) as shown in Figure 23B Thus, in the case of D _Q , as shown in FIG. 24, the difference between the canonical directions D _Q and D _P obtained when the feature point P and the feature point Q are arranged at the corresponding positions is the rotation conversion angle. θ.

ステップＳ３０５乃至ステップＳ３０９はscl軸で規定したビンの代表scl値（上記例の場合、scl軸で切られた５ビンの代表scl値はそれぞれ0.4, 0.7, 1.0, 1.3, 1.6となる）それぞれに関して行われる処理となる。そこで、対応特徴点ぺア絞込み部６１は、まず、ステップＳ３０５において、未処理scl値を選択する。すなわち、１つの未処理scl値が選択される。 Steps S305 to S309 relate to the representative scl values of the bins defined on the scl axis (in the above example, the representative scl values of 5 bins cut on the scl axis are 0.4, 0.7, 1.0, 1.3, and 1.6, respectively) The processing to be performed. Therefore, the corresponding feature point pair narrowing unit 61 first selects an unprocessed scl value in step S305. That is, one unprocessed scl value is selected.

次に、ステップＳ３０６において、対応特徴点ペア絞込み部６１は、図２５に示されるように、モデル画像を当該scl値で拡大あるいは縮小した場合の平行移動変換量dX，dYを求める。この平行移動変換量dX，dYはステップＳ３０４で求めた回転変換角度θ、オブジェクト画像上の当該特徴点Ｑの座標Xi，Yi、モデル画像上の当該特徴点Ｐの極座標における原点からの距離γと軸からの角度αにより、以下の式（１９）と式（２０）により導出できる。 Next, in step S306, the corresponding feature point pair narrowing unit 61 obtains translational conversion amounts dX and dY when the model image is enlarged or reduced by the scl value as shown in FIG. The translation conversion amounts dX and dY are the rotation conversion angle θ obtained in step S304, the coordinates Xi and Yi of the feature point Q on the object image, the distance γ from the origin in the polar coordinates of the feature point P on the model image, and The following equation (19) and equation (20) can be derived from the angle α from the axis.

ｄＸ＝Ｘｉ−ｓｃｌ・γ・ｃｏｓ（α＋θ）・・・（１９）
ｄＹ＝Ｙｉ−ｓｃｌ・γ・ｓｉｎ（α＋θ）・・・（２０） dX = Xi-scl · γ · cos (α + θ) (19)
dY = Yi-scl · γ · sin (α + θ) (20)

相似変換を決定する４つのパラメータ（scl，θ，dX，dY）が求まったので、ステップＳ３０７において、対応特徴点ペア絞込み部６１は、そのパラメータを投票空間上に投票する。具体的には、パラメータが投票空間の該当するビンに投票される。 Since the four parameters (scl, θ, dX, dY) for determining the similarity transformation have been obtained, in step S307, the corresponding feature point pair narrowing unit 61 votes the parameters on the voting space. Specifically, the parameter is voted for the corresponding bin in the voting space.

対応特徴点ペア絞込み部６１は、ステップＳ３０８、Ｓ３０９の分岐処理に従い、以上のステップＳ３０３からの処理を当該モデルに関する全対応特徴点ペア、全scl値に関して行う。すなわち、ステップＳ３０８において、対応特徴点ペア絞込み部６１は、全scl値について処理したかを判定し、まだ処理していないscl値が存在する場合には、処理をステップＳ３０５に戻し、それ以降の処理を繰り返す。そしてステップＳ３０８において、全scl値について処理したと判定された場合、ステップＳ３０９において、対応特徴点ペア絞込み部６１は、全対応特徴点ペアを処理したかを判定し、まだ処理していない対応特徴点ペアが存在する場合には処理をステップＳ３０３に戻し、それ以降の処理を繰り返す。 The corresponding feature point pair narrowing unit 61 performs the above-described processing from step S303 for all the corresponding feature point pairs and all the scl values for the model in accordance with the branch processing of steps S308 and S309. That is, in step S308, the corresponding feature point pair narrowing unit 61 determines whether all scl values have been processed. If there is an scl value that has not yet been processed, the process returns to step S305, and the subsequent steps Repeat the process. If it is determined in step S308 that all scl values have been processed, in step S309, the corresponding feature point pair narrowing unit 61 determines whether all corresponding feature point pairs have been processed, and corresponding features that have not yet been processed. If a point pair exists, the process returns to step S303, and the subsequent processes are repeated.

ステップＳ３０９で全対応特徴点ペアを処理したと判定された場合、ステップＳ３１０で、対応特徴点ペア絞込み部６１は、最多得票ビン及びその得票数を求める。すなわち、ステップＳ３０７の処理による投票の結果、最も得票数の多かったビンとその得票数が求められる。ステップＳ３１１で、対応特徴点ペア絞込み部６１は、ステップＳ３１０で求められた最大得票数が閾値以上かを判定する。この閾値は予め設定されている。最大得票数が閾値以上である場合には、ステップＳ３１２において、対応特徴点ペア絞込み部６１は、モデル有りと判定し、最多得票ビンに投票された候補対応特徴点ペアを出力する。すなわち、最多投票ビンの示す相似変換パラメータが、そのモデルのオブジェクト画像上での相似変換拘束の下での推定モデル姿勢となり、最多投票ビンに投票された候補対応特徴点ペア群が、相似変換拘束の下での推定モデル姿勢をサポートするインライヤ（極少数のアウトライヤを含む可能性もある）となる。ステップＳ３１１において、最大得票数が閾値以上ではないと判断された場合、ステップＳ３１３において、対応特徴点ペア絞込み部６１は、モデル無しと判定する。すなわち、オブジェクト画像中に存在しないモデルに対しては、得票にばらつきが生じるため、各モデルの一般化ハフ変換の結果に対して最大得票数が閾値以下となる。この場合には、モデルが検出されなかったと判定される。 When it is determined in step S309 that all the corresponding feature point pairs have been processed, in step S310, the corresponding feature point pair narrowing unit 61 obtains the most-voting-bin bin and the number of votes. That is, as a result of voting by the process of step S307, the bin with the largest number of votes and the number of votes obtained are obtained. In step S311, the corresponding feature point pair narrowing unit 61 determines whether the maximum number of votes obtained in step S310 is greater than or equal to a threshold value. This threshold is set in advance. If the maximum number of votes is greater than or equal to the threshold value, in step S312, the corresponding feature point pair narrowing unit 61 determines that there is a model, and outputs the candidate corresponding feature point pairs voted in the largest number of vote bins. That is, the similarity transformation parameter indicated by the most vote bin is the estimated model posture under the similarity transformation constraint on the object image of the model, and the candidate corresponding feature point pair voted to the most vote bin is the similarity transformation constraint. Inliers (which may include a very small number of outliers) that support the estimated model pose under. If it is determined in step S311 that the maximum number of votes is not greater than or equal to the threshold, the corresponding feature point pair narrowing unit 61 determines that there is no model in step S313. In other words, for models that do not exist in the object image, the votes will vary, so the maximum number of votes for the result of the generalized Hough transform for each model is below the threshold. In this case, it is determined that the model has not been detected.

以上、ステップＳ３０２乃至ステップＳ３１３の処理はステップＳ３１４の分岐処理に従い全ての認識対象モデルに関して行われる。すなわち、ステップＳ３１２またはステップＳ３１３の処理の後、ステップＳ３１４において、対応特徴点ペア絞込み部６１は、全モデルを処理したかを判定する。まだ処理していないモデルが存在する場合には、処理はステップＳ３０２に戻り、それ以降の処理が繰り返し実行される。ステップＳ３１４において、全モデルを処理したと判定された場合には、候補対応特徴点ペアの絞込み処理は終了される。 As described above, the processing from step S302 to step S313 is performed for all the recognition target models in accordance with the branch processing of step S314. That is, after the process of step S312 or step S313, in step S314, the corresponding feature point pair narrowing unit 61 determines whether all models have been processed. If there is a model that has not yet been processed, the process returns to step S302, and the subsequent processes are repeatedly executed. If it is determined in step S314 that all models have been processed, the candidate-corresponding feature point pair narrowing-down process ends.

ステップＳ３１２で、モデル有りの判定がされたモデルについては、そのモデルに対するインライヤがモデル姿勢推定部３６に供給され、前述の手法によりさらに厳密なモデルの姿勢推定が行われる。 For the model determined to have a model in step S312, an inlier for the model is supplied to the model posture estimation unit 36, and a more accurate model posture estimation is performed by the above-described method.

以上詳述した対応特徴量ペア絞込み部６１による処理では、モデル姿勢推定部３６のモデル姿勢推定処理で仮定した画像変換拘束よりもパラメータ次元の少ない画像変換拘束の下で粗い画像変換パラメータ推定を行うことにより、高速に候補対応特徴点ペア絞込みを行うことができる。絞り込まれた特徴点ペアを用いてモデル姿勢推定部３６で画像変換パラメータの推定を行うことで、モデル姿勢推定部３６の前段に対応特徴点ペア絞込み部６１を挿入しない場合と比べ、より高速な画像変換パラメータの高精度推定が可能となる。対応特徴量ペア絞込み部６１による処理では、画像変換パラメータの次元を少なくしたことにより実際の変換と推定の変換との誤差が生じるが、投票空間の各ビンのサイズを、その誤差を許容できるくらい大きくとることによりその問題は解決される。 In the processing by the corresponding feature amount pair narrowing unit 61 described in detail above, coarse image conversion parameter estimation is performed under an image conversion constraint having a smaller parameter dimension than the image conversion constraint assumed in the model posture estimation process of the model posture estimation unit 36. This makes it possible to narrow down candidate-corresponding feature point pairs at high speed. By using the refined feature point pairs, the model posture estimation unit 36 estimates the image conversion parameters, so that it is faster than when the corresponding feature point pair narrowing unit 61 is not inserted in the previous stage of the model posture estimation unit 36. Image conversion parameters can be estimated with high accuracy. In the processing by the corresponding feature amount pair narrowing unit 61, an error between the actual conversion and the estimated conversion occurs due to the reduction of the dimension of the image conversion parameter. However, the size of each bin in the voting space can be tolerated. The problem is solved by taking it large.

以上説明したように、本実施の形態における画像処理装置１によれば、次のことが可能になる。 As described above, according to the image processing apparatus 1 in the present embodiment, the following becomes possible.

１．解像度画像群を構成するようにしたので、結果としてノイズが除去され、ノイズに対してもロバストな認識が可能となる。
２．特徴点の局所領域濃度勾配のカノニカル方向を０度に正規化した特徴量を用いるようにしたので、回転変化にロバストな認識が可能となる。
３．明度変化に影響されやすい局所領域濃度勾配の強度情報ではなく、方向情報を特徴量及びそのマッチングに用いたので、明度変化に対してロバストな認識が可能となる。
４．全ての解像度間で特徴量間マッチングを行うようにしたので、拡大縮小変化に対してロバストな認識が可能となる。
５．特徴点ペア抽出後、アフィン拘束及び一般化ハフ変換を用いるようにしたので、アフィン変換に対してロバストな認識が可能になる。
６．局所特徴量間マッチングを用いるようにしたので、認識対象物体が部分的に隠されているような場合にも認識が可能となる。
７．モデル毎に特徴点ペアを抽出し、モデル毎に姿勢推定を行うようにしたので、入力画像中に複数のモデル物体が含まれているような画像でもモデル物体の認識が可能となる。
８．学習時に、特徴点、特徴量をより広いスケールレンジでより細かいスケールサンプリングで抽出する一方で、認識時には粗いスケールレンジ、粗いスケールサンプリングで特徴点、特徴量を抽出し、特徴量比較をkdツリーによるk-NN探索法を用いることにより、認識精度を落とさずに認識計算コストを減少させることが可能となる。つまり、リアルタイムに精度よく物体の認識が可能となる。 1. Since the resolution image group is configured, noise is removed as a result, and robust recognition can be performed against noise.
2. Since the feature amount obtained by normalizing the canonical direction of the local density gradient of the feature point to 0 degrees is used, it is possible to recognize the rotation change robustly.
3. Since the direction information is used for the feature quantity and its matching instead of the intensity information of the local area concentration gradient that is easily affected by the brightness change, it is possible to perform robust recognition against the brightness change.
4). Since matching between feature amounts is performed between all resolutions, robust recognition can be made with respect to enlargement / reduction changes.
5. Since the affine constraint and the generalized Hough transform are used after the feature point pair extraction, robust recognition with respect to the affine transformation becomes possible.
6). Since local feature matching is used, recognition is possible even when the recognition target object is partially hidden.
7). Since feature point pairs are extracted for each model and posture estimation is performed for each model, it is possible to recognize a model object even in an image in which a plurality of model objects are included in the input image.
8). During learning, feature points and feature quantities are extracted with a finer scale sampling with a wider scale range, while during recognition, feature points and feature quantities are extracted with a coarse scale range and coarse scale sampling, and feature quantity comparison is performed using a kd tree. By using the k-NN search method, it is possible to reduce the recognition calculation cost without degrading the recognition accuracy. That is, the object can be recognized accurately in real time.

上述した一連の処理は、ハードウエアにより実行させることもできるし、ソフトウエアにより実行させることもできる。この場合、例えば、画像処理装置１は、図２６に示されるようなパーソナルコンピュータにより構成される。 The series of processes described above can be executed by hardware or can be executed by software. In this case, for example, the image processing apparatus 1 is configured by a personal computer as shown in FIG.

図２６において、CPU（Central Processing Unit）１２１は、ROM（Read Only Memory）１２２に記憶されているプログラム、または記憶部１２８からRAM（Random Access Memory）１２３にロードされたプログラムに従って各種の処理を実行する。RAM１２３にはまた、CPU１２１が各種の処理を実行する上において必要なデータなども適宜記憶される。 26, a CPU (Central Processing Unit) 121 executes various processes according to a program stored in a ROM (Read Only Memory) 122 or a program loaded from a storage unit 128 to a RAM (Random Access Memory) 123. To do. The RAM 123 also appropriately stores data necessary for the CPU 121 to execute various processes.

CPU１２１、ROM１２２、およびRAM１２３は、バス１２４を介して相互に接続されている。このバス１２４にはまた、入出力インタフェース１２５も接続されている。 The CPU 121, ROM 122, and RAM 123 are connected to each other via a bus 124. An input / output interface 125 is also connected to the bus 124.

入出力インタフェース１２５には、キーボード、マウスなどよりなる入力部１２６、CRT(Cathode Ray Tube)、LCD(Liquid Crystal display)などよりなるディスプレイ、並びにスピーカなどよりなる出力部１２７、ハードディスクなどより構成される記憶部１２８、モデムなどより構成される通信部１２９が接続されている。通信部１２９は、インターネットを含むネットワークを介しての通信処理を行う。 The input / output interface 125 includes an input unit 126 including a keyboard and a mouse, a display including a CRT (Cathode Ray Tube) and an LCD (Liquid Crystal display), an output unit 127 including a speaker, and a hard disk. A communication unit 129 including a storage unit 128 and a modem is connected. The communication unit 129 performs communication processing via a network including the Internet.

入出力インタフェース１２５にはまた、必要に応じてドライブ１３０が接続され、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリなどのリムーバブルメディア１３１が適宜装着され、それらから読み出されたコンピュータプログラムが、必要に応じて記憶部１２８にインストールされる。 A drive 130 is connected to the input / output interface 125 as necessary, and a removable medium 131 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is appropriately mounted, and a computer program read from them is It is installed in the storage unit 128 as necessary.

一連の処理をソフトウエアにより実行させる場合には、そのソフトウエアを構成するプログラムが、専用のハードウエアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、ネットワークや記録媒体からインストールされる。 When a series of processing is executed by software, a program constituting the software executes various functions by installing a computer incorporated in dedicated hardware or various programs. For example, a general-purpose personal computer is installed from a network or a recording medium.

この記録媒体は、図２６に示されるように、装置本体とは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されている磁気ディスク（フロッピディスクを含む）、光ディスク（CD-ROM(Compact Disk-Read Only Memory),DVD(Digital Versatile Disk)を含む）、光磁気ディスク（MD（Mini-Disk）を含む）、もしくは半導体メモリなどよりなるリムーバブルメディア１３１により構成されるだけでなく、装置本体に予め組み込まれた状態でユーザに提供される、プログラムが記録されているROM１２２や、記憶部１２８に含まれるハードディスクなどで構成される。 As shown in FIG. 26, the recording medium is distributed to provide a program to the user separately from the apparatus main body, and includes a magnetic disk (including a floppy disk) on which the program is recorded, an optical disk (CD- In addition to ROM (Compact Disk-Read Only Memory), DVD (including Digital Versatile Disk)), magneto-optical disk (including MD (Mini-Disk)), or removable media 131 composed of semiconductor memory, etc. The program is provided with a ROM 122 in which a program is recorded and a hard disk included in the storage unit 128, which is provided to the user in a state of being incorporated in the apparatus main body in advance.

なお、本明細書において、記録媒体に記録されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。 In the present specification, the step of describing the program recorded on the recording medium is not limited to the processing performed in chronological order according to the described order, but is not necessarily performed in chronological order. It also includes processes that are executed individually.

また、本明細書において、システムとは、複数の装置により構成される装置全体を表すものである。 Further, in this specification, the system represents the entire apparatus constituted by a plurality of apparatuses.

本発明は、ロボット装置に適用することが可能である。 The present invention can be applied to a robot apparatus.

本発明を適用した画像処理装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the image processing apparatus to which this invention is applied. 図１の学習部の学習処理を説明するフローチャートである。It is a flowchart explaining the learning process of the learning part of FIG. 図１の学習部の学習処理を説明するフローチャートである。It is a flowchart explaining the learning process of the learning part of FIG. 解像度画像を説明する図である。It is a figure explaining a resolution image. DoGフィルタのスケールスペースを説明する図である。It is a figure explaining the scale space of a DoG filter. 特徴点近傍の濃度勾配方向を説明する図である。It is a figure explaining the density | concentration gradient direction of the feature point vicinity. ヒストグラムの度数の演算方法を説明する図である。It is a figure explaining the calculation method of the frequency of a histogram. 方向ヒストグラムの例を示す図である。It is a figure which shows the example of a direction histogram. 方向ヒストグラムの例を示す図である。It is a figure which shows the example of a direction histogram. 方向ヒストグラムの例を示す図である。It is a figure which shows the example of a direction histogram. 特徴量抽出の処理を説明する図である。It is a figure explaining the process of feature-value extraction. リサンプリングの例を示す図である。It is a figure which shows the example of resampling. 図１の認識部の認識処理を説明するフローチャートである。It is a flowchart explaining the recognition process of the recognition part of FIG. 図１の認識部の認識処理を説明するフローチャートである。It is a flowchart explaining the recognition process of the recognition part of FIG. 図１の認識部の認識処理を説明するフローチャートである。It is a flowchart explaining the recognition process of the recognition part of FIG. 学習時と認識時の多重解像度を説明する図である。It is a figure explaining the multi-resolution at the time of learning and recognition. 特徴量の比較処理を説明する図である。It is a figure explaining the comparison process of the feature-value. インライヤとアウトライヤを説明する図である。It is a figure explaining an inlier and an outlier. 姿勢推定処理の詳細を説明するフローチャートである。It is a flowchart explaining the detail of an attitude | position estimation process. 図１９の姿勢推定処理を説明する図である。It is a figure explaining the attitude | position estimation process of FIG. 本発明を適用した画像処理装置の他の構成例を示すブロック図である。It is a block diagram which shows the other structural example of the image processing apparatus to which this invention is applied. 図２１の対応特徴点ペア絞込み部の処理を説明するフローチャートである。It is a flowchart explaining the process of the corresponding feature point pair narrowing-down part of FIG. モデル画像とオブジェクト画像の対応する特徴点を説明する図である。It is a figure explaining the corresponding feature point of a model image and an object image. モデル画像とオブジェクト画像の特徴点の回転変換角度を説明する図である。It is a figure explaining the rotation conversion angle of the feature point of a model image and an object image. 平行移動変換量を説明する図である。It is a figure explaining the translation displacement amount. パーソナルコンピュータの構成例を示すブロック図である。And FIG. 16 is a block diagram illustrating a configuration example of a personal computer.

Explanation of symbols

１画像処理装置，１１学習部，１２認識部，２１多重解像度生成部，２２特徴点抽出部，２３特徴量抽出部，２４モデル辞書登録部，３１多重解像度生成部，３２特徴点抽出部，３３特徴量抽出部，３４ kdツリー構築部，３５特徴量比較部，３６モデル姿勢推定部 DESCRIPTION OF SYMBOLS 1 Image processing apparatus, 11 Learning part, 12 Recognition part, 21 Multi-resolution production | generation part, 22 Feature point extraction part, 23 Feature-value extraction part, 24 Model dictionary registration part, 31 Multi-resolution production part, 32 Feature point extraction part, 33 Feature extraction unit, 34 kd tree construction unit, 35 feature comparison unit, 36 model pose estimation unit

Claims

In an image processing apparatus that recognizes a pre-registered model image from an input image,
Multi-resolution image generation means for generating a multi-resolution image composed of a plurality of different resolution images by reducing the resolution of the input image at a predetermined rate;
Feature point extraction means for extracting feature points for each resolution image of the multi-resolution image;
Feature quantity extraction means for extracting at least two local feature quantities at the feature points;
Comparing means for comparing the feature quantity of the input image with the feature quantity of the model image, and generating a candidate corresponding feature point set as a set of feature points having similar feature quantities;
An image processing apparatus comprising: posture estimation means for estimating the posture of an input image based on the candidate corresponding feature point set.

The feature quantity extraction unit extracts a density gradient direction histogram near the feature point as a first type feature quantity, and extracts a dimensional degenerate density gradient vector as a second type feature quantity. The image processing apparatus according to claim 1.

The comparison means searches for k Neest Neighbor (k-NN) based on the feature amount of the input image for the feature amount group of the model image, which is a kd tree for each type, so that the candidate correspondence The image processing apparatus according to claim 2, wherein a feature point set is generated.

The image processing apparatus according to claim 3, wherein the comparison unit sets the feature points having the feature amount extracted in common for each type as the candidate corresponding feature point set.

The candidate corresponding feature point set is further voted on a parameter space defined by an image conversion parameter for determining the position and orientation of the model image, and further includes a narrowing means for narrowing down by comparing the maximum number of votes with a threshold value,
The image processing apparatus according to claim 4, wherein the posture estimation unit estimates a posture of an input image based on the narrowed candidate corresponding feature point set.

The posture estimation means projects image conversion parameters that determine the position and posture of the model image determined by the randomly selected N sets of candidate corresponding feature point sets to a parameter space, and clusters formed on the parameter space The image conversion parameter for determining the position and orientation of the model image obtained from the member by the least square method is output as a recognition result for recognizing the model image. The image processing apparatus according to 1.

The posture estimation means detects a centroid of the cluster having the largest number of members, and outputs an image conversion parameter for determining a position and posture of the model image composed of the centroid as a recognition result for recognizing the model image. The image processing apparatus according to claim 6.

The image processing apparatus according to claim 1, wherein the multi-resolution image generation unit generates the multi-resolution image with coarser accuracy than in the case of learning.

In an image processing method of an image processing apparatus that recognizes a model image registered in advance from an input image,
A multi-resolution image generation step of generating a multi-resolution image consisting of a plurality of different resolution images by reducing the resolution of the input image at a predetermined rate;
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
A comparison step of comparing the feature amount of the input image with the feature amount of the model image and generating a candidate corresponding feature point set as a set of feature points having similar feature amounts;
A posture estimation step of estimating a posture of the input image based on the candidate corresponding feature point set.

A program of an image processing apparatus that recognizes a model image registered in advance from an input image,
A multi-resolution image generation step of generating a multi-resolution image consisting of a plurality of different resolution images by reducing the resolution of the input image at a predetermined rate;
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
A comparison step of comparing the feature amount of the input image with the feature amount of the model image and generating a candidate corresponding feature point set as a set of feature points having similar feature amounts;
And a posture estimation step of estimating a posture of an input image based on the candidate corresponding feature point set. A recording medium on which a computer-readable program is recorded.

A program of an image processing apparatus that recognizes a model image registered in advance from an input image,
A multi-resolution image generation step of generating a multi-resolution image consisting of a plurality of different resolution images by reducing the resolution of the input image at a predetermined rate;
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
A comparison step of comparing the feature amount of the input image with the feature amount of the model image and generating a candidate corresponding feature point set as a set of feature points having similar feature amounts;
A program causing a computer to execute a posture estimation step of estimating a posture of an input image based on the candidate corresponding feature point set.

In an image processing apparatus for learning a model image for comparison with an image to be recognized,
A multi-resolution image generating means for generating a multi-resolution image consisting of a plurality of different resolution images with a finer precision than in the case of recognition by reducing the resolution of the model image at a predetermined rate;
Feature point extraction means for extracting feature points for each resolution image of the multi-resolution image;
Feature quantity extraction means for extracting at least two local feature quantities at the feature points;
An image processing apparatus comprising: a registration unit configured to register the feature amount of the model image.

The feature quantity extraction unit extracts a density gradient direction histogram near the feature point as a first type feature quantity, and extracts a dimensional degenerate density gradient vector as a second type feature quantity. The image processing apparatus according to claim 12.

The image processing apparatus according to claim 13, wherein the registration unit registers the feature amount group of the model image as a kd tree for each type.

In an image processing method of an image processing apparatus for learning a model image for comparison with an image to be recognized,
A multi-resolution image generation step for generating a multi-resolution image composed of images of a plurality of different resolutions with a finer precision than in the case of recognition by reducing the resolution of the model image at a predetermined rate;
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
A registration step of registering the feature quantity of the model image.

A program of an image processing device for learning a model image for comparison with an image to be recognized,
Reducing the resolution of the model image at a predetermined rate, thereby generating a multi-resolution image composed of a plurality of different resolution images with a finer precision than in the case of recognition; and
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
And a registration step for registering the feature quantity of the model image. A recording medium on which a computer-readable program is recorded.

A program of an image processing device for learning a model image for comparison with an image to be recognized,
Reducing the resolution of the model image at a predetermined rate, thereby generating a multi-resolution image composed of a plurality of different resolution images with a finer precision than in the case of recognition; and
A feature point extracting step of extracting feature points for each resolution image of the multi-resolution image;
A feature quantity extraction step of extracting at least two local feature quantities at the feature points;
A program causing a computer to execute a registration step of registering the feature amount of the model image.