JP2007249592A

JP2007249592A - Three-dimensional object recognition system

Info

Publication number: JP2007249592A
Application number: JP2006071857A
Authority: JP
Inventors: Masahiro Tomono; 正裕友納
Original assignee: Japan Science and Technology Agency
Current assignee: Japan Science and Technology Agency
Priority date: 2006-03-15
Filing date: 2006-03-15
Publication date: 2007-09-27
Anticipated expiration: 2026-03-15
Also published as: JP4709668B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a three-dimensional object recognition system for recognizing an object from one image, and for estimating the camera attitude of an image. <P>SOLUTION: Edge extraction processing 202 extracts the image edge point from an input image. Featured vector generation processing 203 generates featured vectors for each image edge point extracted by the edge extraction processing 202. Two-dimensional model collation processing 207 inputs an image edge point group, to which the featured vectors are added, and refers to the object model of an object model storage 107, and searches for an object model having a two-dimensional model whose featured vectors are matched with those of the input image edge point group. The group of the pair of the matched input image edge points and model image edge points is prepared. Three-dimensional attitude deduction processing 208 inputs the object model searched by the two-dimensional model collation processing 207, and searches for a camera attitude in which the projection image of the three-dimensional edge model of the object matches the input image edge point group. Finally, the matched object model name and the camera attitude are output. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、画像入力装置(カメラ) を移動しながら撮影した画像列から物体認識に用い
る物体モデルを生成すること、および、その物体モデルを用いてカメラ画像から物体を認識し、物体の３次元姿勢を推定することに関する。 The present invention generates an object model used for object recognition from an image sequence captured while moving an image input device (camera), recognizes an object from a camera image using the object model, and It relates to estimating the posture.

本発明で対象とする物体認識は、カメラ画像を入力とし、あらかじめ定義した物体モ
デル群をその入力画像と照合して、最もよくマッチする物体モデルを出力する処理であ
る。最もマッチする物体モデルがなければ、何も出力しない。さらに、マッチする物体
モデルの３次元姿勢を推定する。これは、ロボットによる物体操作などへの応用におい
て、とくに重要である。
一般に、物体認識には物体モデルが必要であるが、物体モデルを手動で生成するのは、多大な工数がかかる。また、複雑な形状の物体に対応しにくい、誤差が入りやすい、などの問題がある。 Object recognition as a target in the present invention is a process in which a camera image is input, a predefined object model group is collated with the input image, and the best matching object model is output. If there is no best matching object model, nothing is output. Further, the three-dimensional posture of the matching object model is estimated. This is particularly important in applications such as object manipulation by robots.
In general, an object model is required for object recognition, but it takes a lot of man-hours to generate the object model manually. In addition, there are problems such as difficulty in dealing with an object having a complicated shape, and easy error.

さて、物体モデルの生成システムも含む、カメラ画像から物体認識を行う方法は、従来から多数提案されている。物体認識は、一般に、物体モデルがもつ特徴と入力画像から抽出した特徴を比較して、よくマッチする物体モデルを求める。認識に用いる特徴として、２次元特徴と３次元特徴がある。
２次元特徴を用いる方法としては、たとえば、特許文献１および非特許文献１がある。特許文献１に記載している方法では、カメラ角度を少しずつ変えながら、対象物体の画像を撮影し、特徴を抽出して物体モデルとする。また、非特許文献１の方法も同様に、カメラ角度を変えながら対象物体の画像を撮り、スケール不変な特徴を抽出して、物体モデルとする。マッチングに用いる特徴量は異なるが、いずれも、２次元特徴を用いて入力画像によくマッチするモデル画像を求める。 Many methods for recognizing an object from a camera image including an object model generation system have been proposed. In object recognition, in general, a feature that an object model has is compared with a feature extracted from an input image, and an object model that matches well is obtained. Features used for recognition include two-dimensional features and three-dimensional features.
For example, there are Patent Document 1 and Non-Patent Document 1 as methods using two-dimensional features. In the method described in Patent Document 1, an image of a target object is photographed while gradually changing the camera angle, and features are extracted to obtain an object model. Similarly, the method of Non-Patent Document 1 takes an image of the target object while changing the camera angle, extracts scale-invariant features, and forms an object model. Although feature quantities used for matching are different, a model image that matches the input image is obtained using two-dimensional features.

３次元特徴を用いる方法として、たとえば、特許文献２がある。特許文献２の方法では、ステレオ画像を入力として、物体の３次元形状から抽出した３次元特徴を認識に用いている。具体的には、エッジセグメントを構成単位として物体モデルを生成する。その際、３次元空間でのエッジセグメントの分岐点、屈曲点、変曲点、遷移点を認識特徴として用いる。これらの３次元特徴を用いて、指定した物体が画像中に存在するかを調べ、存在する場合は、入力画像の３次元特徴と位置が合うように、３次元物体モデルを座標変換して、物体の３次元姿勢推定を行う。 As a method using a three-dimensional feature, for example, there is Patent Document 2. In the method of Patent Document 2, a stereo image is used as an input, and a three-dimensional feature extracted from a three-dimensional shape of an object is used for recognition. Specifically, an object model is generated with an edge segment as a constituent unit. At that time, branch points, inflection points, inflection points, and transition points of edge segments in the three-dimensional space are used as recognition features. Using these three-dimensional features, it is checked whether or not the specified object exists in the image. Estimate the 3D pose of the object.

特開２００３−２７１９２９号公報JP 2003-271929 A 特許２９６１２６４号公報Japanese Patent No. 2961264 D. G. Lowe: “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, 60:91-110,2004.D. G. Lowe: “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 60: 91-110, 2004. 友納:”画像列からの密な物体モデル生成のためのエッジの３次元復元”、日本ロボット学会第２３回学術講演会予稿集，2005.Tomino: “3D edge reconstruction for dense object model generation from image sequences”, Proceedings of the 23rd Annual Conference of the Robotics Society of Japan, 2005. 友納:”基線長選択機能を備えた形状復元に基づく単眼カメラ画像列からの３次元マップの構築”第１０回ロボティクスシンポジア予稿集, pp.159-164, 2005.Tomino: “Construction of 3D map from monocular image sequence based on shape restoration with baseline length selection function” 10th Robotics Symposia Proceedings, pp.159-164, 2005. J. Canny: A Computational Approach to Edge Detection,IEEE Trans. PAMI, Vol. 8, No. 6, pp. 679-698 (1986).J. Canny: A Computational Approach to Edge Detection, IEEE Trans. PAMI, Vol. 8, No. 6, pp. 679-698 (1986).

上述の２次元特徴を用いる方法では、画像に写っている物体を特定することはできる
が、２次元特徴だけを用いているため、物体の正確な３次元姿勢を推定することはでき
ない。
一方、３次元特徴を用いる方法では、次のような問題がある。特許文献２では、入力
画像から３次元特徴を得るにはステレオ画像を用いているが、そのためには複眼のステ
レオカメラが必要であり、その装置コストや校正の手間がかかるという問題がある。ま
た、対象物体が遠方にある場合は、ステレオカメラによる３次元復元に必要な視差が充
分に得られなくなり、有意な３次元特徴が抽出できず、認識ができなくなるという問題
がある。
これに対し、ステレオカメラを用いずに、１台の単眼カメラを移動させながら入力画像
を複数枚撮影して３次元特徴を得る方法も考えられる。しかし、カメラを適切に移動させながら画像を撮影するのに時間を要し、また、３次元復元処理にも時間がかかる。これらに要する時間は、物体モデル生成ではさほど問題にならないが、物体認識はロボットの動作中に実時間で行う必要性が高いため、大きな短所となりうる。
本発明は、これらの問題に対処して、１枚のカメラ画像から物体の認識と３次元姿勢推
定を安定して行うことを目的とする。 In the above-described method using two-dimensional features, an object shown in an image can be specified, but since only two-dimensional features are used, an accurate three-dimensional posture of the object cannot be estimated.
On the other hand, the method using three-dimensional features has the following problems. In Patent Document 2, a stereo image is used to obtain a three-dimensional feature from an input image. However, for this purpose, a compound-eye stereo camera is required, and there is a problem that the apparatus cost and the labor of calibration are required. Further, when the target object is far away, there is a problem that the parallax necessary for three-dimensional reconstruction by the stereo camera cannot be obtained sufficiently, and significant three-dimensional features cannot be extracted and cannot be recognized.
On the other hand, a method of obtaining a three-dimensional feature by capturing a plurality of input images while moving one monocular camera without using a stereo camera is also conceivable. However, it takes time to capture an image while moving the camera appropriately, and it also takes time for the three-dimensional restoration process. Although the time required for these does not cause much problem in object model generation, it is highly necessary to perform object recognition in real time during the operation of the robot, which can be a major disadvantage.
An object of the present invention is to deal with these problems and to stably perform object recognition and three-dimensional posture estimation from a single camera image.

上述の目的を達成するために、本発明は、１枚の入力画像から物体認識を行う３次元物体認識システムにおいて、物体名、３次元エッジモデル、２次元モデル（複数の画像に対するカメラ姿勢、画像エッジ点群、該各画像エッジ点の特徴ベクトル、及び、各画像エッジ点と対応する３次元エッジ点）を組にして物体モデルとして記憶する物体モデル記憶手段を有し、前記入力画像から画像エッジ点群を抽出する画像エッジ抽出手段と、前記画像エッジ抽出手段で抽出された画像エッジ点の特徴ベクトルを生成する特徴ベクトル生成手段と、前記特徴ベクトル生成手段で得られた前記入力画像の画像エッジ点の特徴ベクトルと、前記物体モデル記憶手段に記憶された物体モデルがもつ画像エッジ点の特徴ベクトルとを比較して、前記入力画像にマッチする物体モデルを検索する２次元モデル照合手段と、該検索された物体モデルがもつ３次元エッジモデルの３次元エッジ点を、前記入力画像に投影した位置が前記入力画像の画像エッジ点の位置と一致する度合いが大きくなるようなカメラ姿勢を求める３次元姿勢推定手段とを備え、前記入力画像の物体名とカメラ姿勢とを出力することを特徴とする３次元物体認識システムである。
本発明は、物体モデルを２次元の画像エッジ点とそれらを３次元に復元した３次元エッジ点とで構成し、多数の画像エッジ点を２次元特徴として用いて安定な認識を行いながら、３次元エッジ点を用いて物体の３次元姿勢推定を正確に行っている。 In order to achieve the above object, the present invention provides an object name, a three-dimensional edge model, a two-dimensional model (camera postures and images for a plurality of images) in a three-dimensional object recognition system that performs object recognition from one input image. Object model storage means for storing a set of edge points, a feature vector of each image edge point, and a three-dimensional edge point corresponding to each image edge point) as an object model. Image edge extracting means for extracting point groups, feature vector generating means for generating feature vectors of image edge points extracted by the image edge extracting means, and image edges of the input image obtained by the feature vector generating means The feature vector of the point is compared with the feature vector of the image edge point of the object model stored in the object model storage means, and the input image A position where the two-dimensional model matching means for searching for a matching object model and a three-dimensional edge point of the three-dimensional edge model of the searched object model are projected onto the input image is the position of the image edge point of the input image And a three-dimensional posture estimation means for obtaining a camera posture that increases the degree of coincidence with the image, and outputs the object name and the camera posture of the input image.
In the present invention, an object model is composed of two-dimensional image edge points and three-dimensional edge points obtained by restoring them in three dimensions, and a large number of image edge points are used as two-dimensional features to perform stable recognition. The three-dimensional posture of the object is accurately estimated using the three-dimensional edge point.

また、カメラ画像列から、物体認識に用いる物体モデルを生成するシステムであって、
画像列と該画像列を撮影した際のカメラ姿勢列を入力して、物体認識に用いるモデル画像と該モデル画像を撮影したカメラ姿勢を選択するモデル画像選択手段と、該モデル画像を入力して画像エッジ点群を抽出するエッジ抽出手段と、該エッジ抽出手段で抽出された画像エッジ点の特徴ベクトルを生成する特徴ベクトル生成手段と、該モデル画像に対応するカメラ姿勢と物体の３次元エッジモデルを入力して、該３次元エッジモデルに含まれる各３次元エッジ点の該モデル画像への投影を計算して前記投影エッジ点を求める３次元モデル投影手段と、該３次元モデル投影手段で計算された投影エッジ点と、前記エッジ抽出手段で抽出された画像エッジ点の位置関係から３次元エッジ点と画像エッジ点の対応関係を生成するエッジ点対応づけ手段とを備え、物体名、３次元エッジモデル、２次元モデル（複数の画像に対するカメラ姿勢、画像エッジ点群、該各画像エッジ点の特徴ベクトル、及び、各画像エッジ点と対応する３次元エッジ点）を組にして物体モデルとして出力することを特徴とする３次元物体認識システム用物体モデル生成システムも構成するとよい。 Further, a system for generating an object model used for object recognition from a camera image sequence,
An image sequence and a camera posture sequence when the image sequence is captured are input, a model image used for object recognition, a model image selection means for selecting a camera orientation that captures the model image, and the model image are input. Edge extraction means for extracting a group of image edge points, feature vector generation means for generating a feature vector of the image edge point extracted by the edge extraction means, a camera posture corresponding to the model image, and a three-dimensional edge model of the object 3D model projection means for calculating the projection of each 3D edge point included in the 3D edge model onto the model image to obtain the projection edge point, and calculation by the 3D model projection means An edge point associating method for generating a correspondence between a three-dimensional edge point and an image edge point from the positional relationship between the projected edge point and the image edge point extracted by the edge extraction means And an object name, a three-dimensional edge model, a two-dimensional model (a camera posture for a plurality of images, a group of image edge points, a feature vector of each image edge point, and a three-dimensional edge point corresponding to each image edge point The object model generation system for a three-dimensional object recognition system may be configured to output the data as an object model.

前記特徴ベクトル生成手段は、画像エッジ点の特徴ベクトルを生成する際に、該画像エッジ点を中心とした円形領域を求めて、該円周上にあるエッジ強度の和が最大あるいは極大になるように該円形領域の半径を定め、該円形領域に含まれる画像情報から特徴ベクトルを生成することもできる。
また、上述のシステムをコンピュータ・システムに構成させるコンピュータ・プログラムも本発明である。 When generating the feature vector of the image edge point, the feature vector generation means obtains a circular area centered on the image edge point so that the sum of the edge intensities on the circumference is maximized or maximized. It is also possible to determine the radius of the circular area and generate a feature vector from the image information included in the circular area.
The present invention also includes a computer program that configures the above-described system in a computer system.

本発明によれば、２次元エッジ点（画像エッジ点）と３次元エッジ点の両方を用いるため、物体の特定と３次元姿勢推定を１枚の入力画像で行えるという効果がある。また、エッジ点の各々を独立した特徴点として用いるので、特徴の少ない物体に対しても、認識の安定性が増すという効果がある。
また、この３次元物体認識システムに用いる物体モデルも容易に作成することができる。 According to the present invention, since both a two-dimensional edge point (image edge point) and a three-dimensional edge point are used, an object can be identified and a three-dimensional posture can be estimated with a single input image. Further, since each of the edge points is used as an independent feature point, there is an effect that the stability of recognition is increased even for an object having few features.
In addition, an object model used for the three-dimensional object recognition system can be easily created.

以下に、図面を用いて、本発明を実施するための実施形態を説明する。
＜概要＞
本発明は、物体モデル生成と物体認識の２つのフェーズからなる。
物体モデル生成のフェーズでは、複数のカメラ視点から撮影した対象物体の画像列、その撮影時のカメラ運動、および、その画像列から復元した３次元エッジモデルを入力とする。そして、該画像列の画像から抽出した２次元エッジ点、および、その２次元エッジ点と３次元エッジ点の対応関係を出力する。 Embodiments for carrying out the present invention will be described below with reference to the drawings.
<Overview>
The present invention comprises two phases: object model generation and object recognition.
In the object model generation phase, an image sequence of a target object captured from a plurality of camera viewpoints, a camera motion at the time of capturing, and a three-dimensional edge model restored from the image sequence are input. Then, the two-dimensional edge point extracted from the image in the image sequence and the correspondence between the two-dimensional edge point and the three-dimensional edge point are output.

この処理で、画像列から対象物体だけを切り出す処理が必要である。その方法は従来からいろいろある。簡単な方法として、たとえば、人間がコンピュータ画面上でマウスなどのポインティングデバイスを用いて手動で物体を切り出す方法などが考えられる。
ここで用いる対象物体の画像列は、単眼カメラで撮影した画像でも複眼ステレオカメラの画像でもよい。カメラ運動は、この画像列の各画像を撮影したときのカメラ姿勢の時系列である。各カメラ姿勢がわかれば、単眼カメラの場合でも、ステレオ視の原理に基づいて画像から抽出したエッジを３次元復元することができる。 In this process, it is necessary to extract only the target object from the image sequence. There are various conventional methods. As a simple method, for example, a method in which a human manually cuts out an object using a pointing device such as a mouse on a computer screen can be considered.
The image sequence of the target object used here may be an image taken by a monocular camera or an image of a compound eye stereo camera. The camera motion is a time series of camera postures when each image of this image sequence is taken. If each camera posture is known, even in the case of a monocular camera, the edge extracted from the image can be restored three-dimensionally based on the principle of stereo vision.

単眼カメラによる３次元エッジの復元は、たとえば、非特許文献２，３で述べられている。非特許文献２，３では、単眼カメラ画像列に対して、少数の一意性の高い特徴点を画像間で追跡し、因子分解法と逆投影誤差最小化を用いて、カメラ運動を推定する。そして、このカメラ運動を用いて画像間でエッジ点の対応づけを行い、三角測量の原理を用いて３次元復元を行い、３次元エッジモデルを求めている。
物体モデル生成では、画像列とカメラ運動の情報から、次のように物体モデルを生成する。まず、前記画像列からいくつかの画像を選択して、モデル画像とする。各モデル画像から物体のエッジ点を抽出する。
さて、各画像を撮影したカメラ姿勢はわかっているから、前述の３次元エッジモデルを、そのカメラ姿勢に基づいて画像に投影すると、３次元エッジ点の投影像が２次元エッジ点とほぼ重なるはずである。これにより、各画像の２次元エッジ点と３次元エッジ点の対応づけを行うことができる。さらに、エッジ点の近傍領域から、そのエッジ点を識別するための特徴ベクトルを生成しておく。 Non-patent documents 2 and 3 describe the restoration of a three-dimensional edge by a monocular camera, for example. In Non-Patent Documents 2 and 3, a small number of highly unique feature points are tracked between images for a monocular camera image sequence, and camera motion is estimated using a factorization method and back projection error minimization. Then, this camera motion is used to associate edge points between images, and three-dimensional restoration is performed using the principle of triangulation to obtain a three-dimensional edge model.
In the object model generation, an object model is generated from the image sequence and camera motion information as follows. First, several images are selected from the image sequence to be model images. The edge point of the object is extracted from each model image.
Now, since the camera pose of capturing each image is known, if the above-mentioned 3D edge model is projected onto the image based on the camera pose, the projected image of the 3D edge point should almost overlap the 2D edge point. It is. Thereby, it is possible to associate the two-dimensional edge point and the three-dimensional edge point of each image. Further, a feature vector for identifying the edge point is generated from the vicinity of the edge point.

また、３次元物体認識のフェーズでは、上述のようにして得られた２次元エッジ点と３次元エッジ点の両方を用いて、物体認識を行う。まず、２次元エッジ点を用いて画像レベルで物体を認識し、次に、２次元エッジ点と３次元エッジ点を用いて物体の３次元姿勢推定を行う。
画像レベルの認識では、入力画像の２次元エッジ点に対して特徴ベクトルを生成し、モデル画像の２次元エッジ点の特徴ベクトルと照合する。特徴ベクトルは、多少のカメラ角度の違いがあってもマッチするので、多くの場合、モデル画像のどれかとはマッチする。入力画像とモデル画像で２次元エッジ点がマッチすれば、入力画像の２次元エッジ点と３次元エッジ点の対応がとれるので、３次元姿勢推定が可能になる。 In the three-dimensional object recognition phase, object recognition is performed using both the two-dimensional edge points and the three-dimensional edge points obtained as described above. First, the object is recognized at the image level using the two-dimensional edge point, and then the three-dimensional posture estimation of the object is performed using the two-dimensional edge point and the three-dimensional edge point.
In recognition at the image level, a feature vector is generated for the two-dimensional edge point of the input image, and collated with the feature vector of the two-dimensional edge point of the model image. Since the feature vector matches even if there is a slight difference in camera angle, it often matches any of the model images. If the two-dimensional edge point matches between the input image and the model image, the correspondence between the two-dimensional edge point and the three-dimensional edge point of the input image can be obtained, so that the three-dimensional posture can be estimated.

本発明では、エッジを線分として扱うのではなく、独立したエッジ点として扱う。これは、次の利点による。
第一の利点は、エッジ点は大量に得られるので、統計的に処理することで安定して認識を行えることである。認識によく使われる特徴点として、エッジのコーナ点や分岐点などがあるが、その個数は物体あたりたかだか数十個程度である。一方、エッジ点は数千から数万個のオーダで得られる。
第二の利点は、個々のエッジ点は最小限の画像処理で得られるので、エッジ線分などの高次の特徴よりも安定して得られることである。エッジを線分として扱おうとすると、直線当てはめ処理などで誤差が生じる可能性があり、また、線分の端点を正確に求めるのも難しい。
第三の利点は、エッジ点は３次元形状モデルとの対応がとりやすいことである。エッジ点は、見かけの輪郭線(シルエット) でなければ、カメラの視点が変わってもほぼ同じ位置に抽出できる。このため、カメラ画像列から３次元モデルを生成するのに都合がよい。しかも、エッジ点群で構成された３次元モデルは形状が鮮明なので、人間が目視で確認するのにも適している。 In the present invention, an edge is not treated as a line segment, but as an independent edge point. This is due to the following advantages.
The first advantage is that a large number of edge points can be obtained, so that statistical recognition can stably perform recognition. As feature points often used for recognition, there are corner points and branch points of edges, but the number is about several tens per object. On the other hand, edge points can be obtained on the order of thousands to tens of thousands.
The second advantage is that individual edge points can be obtained more stably than higher-order features such as edge line segments because they can be obtained with minimal image processing. If an edge is treated as a line segment, an error may occur in a straight line fitting process or the like, and it is difficult to accurately determine the end point of the line segment.
A third advantage is that the edge point can easily correspond to the three-dimensional shape model. If the edge point is not an apparent outline (silhouette), it can be extracted at almost the same position even if the viewpoint of the camera changes. This is convenient for generating a three-dimensional model from a camera image sequence. In addition, since the three-dimensional model composed of the edge point group has a clear shape, it is suitable for human confirmation.

＜物体モデル生成＞
図１にしたがい、本発明における物体モデル生成処理の一実施形態を説明する。
モデル画像選択処理１０１は、カメラからの画像列と推定したカメラ運動を入力して、所定の間隔で選択したモデル画像Ｊとカメラ姿勢Ｔを出力する。選択の方法としては、画像を表示して、オペレータが目視で確認しながら手動で選んでもよい。あるいは、カメラ姿勢間の並進量と回転量の間隔をあらかじめ設定しておき、その間隔を超えた移動があった時のカメラ姿勢を自動的に選ぶようにしてもよい。
３次元エッジモデル生成処理１０２では、画像列とカメラ姿勢列から、３次元エッジモデルを生成する。この生成については、上述のように非特許文献２，３に記載されている。これにより、単眼カメラで撮影した画像からでも、３次元エッジモデルを生成することができる。 <Object model generation>
An embodiment of the object model generation process according to the present invention will be described with reference to FIG.
The model image selection process 101 inputs an image sequence from a camera and an estimated camera motion, and outputs a model image J and a camera posture T selected at a predetermined interval. As a selection method, an image may be displayed and an operator may select manually while visually confirming. Alternatively, an interval between the translation amount and the rotation amount between the camera postures may be set in advance, and the camera posture when there is a movement exceeding the interval may be automatically selected.
In the three-dimensional edge model generation process 102, a three-dimensional edge model is generated from the image sequence and the camera posture sequence. This generation is described in Non-Patent Documents 2 and 3 as described above. Thereby, a three-dimensional edge model can be generated even from an image taken with a monocular camera.

エッジ抽出処理１０３は、モデル画像選択処理１０１からモデル画像Ｊを入力して画像エッジ点群Ｇを抽出する。画像エッジ点群Ｇの抽出には、たとえば、非特許文献４で提案されたCannyオペレータを用いて、画像のエッジ点を抽出する。
Cannyオペレータは、ガウス関数で画像を平滑化した後、画像の一次微分を施す。そして、エッジの法線方向(微分方向) で微分強度が極大となる点をエッジ点として抽出する。このようなエッジ抽出法では、画像に写っている対象物体の大きさによって、エッジの位置が変化することがある。この問題に対しては、Canny オペレータのガウス関数の分散項を自動調節することで、エッジ位置のずれを軽減することが可能である。
このエッジ抽出処理については、従来技術を用いているので、非特許文献４などを参照されたい。 In the edge extraction process 103, the model image J is input from the model image selection process 101 and the image edge point group G is extracted. For the extraction of the image edge point group G, for example, the edge point of the image is extracted using the Canny operator proposed in Non-Patent Document 4.
The Canny operator smoothes the image with a Gaussian function and then performs the first derivative of the image. Then, a point where the differential intensity becomes maximum in the normal direction (differential direction) of the edge is extracted as an edge point. In such an edge extraction method, the position of the edge may change depending on the size of the target object shown in the image. To solve this problem, it is possible to reduce the deviation of the edge position by automatically adjusting the dispersion term of the Canny operator's Gaussian function.
For this edge extraction processing, since the prior art is used, refer to Non-Patent Document 4 and the like.

３次元モデル投影処理１０４は、３次元エッジモデル生成処理１０２で生成された物体の３次元エッジモデル、および、モデル画像選択処理１０１からのモデル画像Ｊとカメラ姿勢Ｔを入力して、このカメラ姿勢Ｔに基づいて透視変換の公式により、３次元エッジモデルに含まれる各エッジ点をモデル画像へ投影して投影エッジ点群を求める。
エッジ点対応づけ処理１０５は、エッジ抽出処理１０３で抽出された画像エッジ点群Ｇと３次元モデル投影処理１０４で求めた投影エッジ点群とについて位置の近いものを見つけて、画像エッジ点ｑと３次元エッジ点Ｐの対応関係を生成する。 The three-dimensional model projection process 104 inputs the three-dimensional edge model of the object generated by the three-dimensional edge model generation process 102, the model image J and the camera attitude T from the model image selection process 101, and inputs the camera attitude. Based on T, each edge point included in the three-dimensional edge model is projected onto the model image by a perspective transformation formula to obtain a projected edge point group.
The edge point associating process 105 finds an image edge point group G extracted by the edge extraction process 103 and a projected edge point group obtained by the three-dimensional model projection process 104, and finds an image edge point q and A correspondence relationship between the three-dimensional edge points P is generated.

特徴ベクトル生成処理１０６は、モデル画像Ｊを入力して、エッジ点対応づけ処理１０５で得られた対応関係を含む画像エッジ点ｑに対して、特徴ベクトルＢを付加する。特徴ベクトルＢは、画像エッジ点ｑの近傍の局所画像から後述の方法で作成する。
物体モデル記憶１０７は、物体モデル名、３次元エッジモデル、および、エッジ点対応づけ処理１０５で生成されたエッジ点対応関係を組にした２次元モデルを物体モデルとして記憶する。 The feature vector generation process 106 inputs the model image J and adds the feature vector B to the image edge point q including the correspondence obtained in the edge point correlation process 105. The feature vector B is created from the local image near the image edge point q by the method described later.
The object model storage 107 stores, as an object model, an object model name, a three-dimensional edge model, and a two-dimensional model obtained by pairing the edge point correspondence generated by the edge point association processing 105.

＜物体モデルの説明＞
次に、図２を用いて、本発明の物体認識システムで用いる物体モデルの構成例を説明する。ここで説明する物体モデルは、上述の図１で説明した物体モデル生成の処理で作成されるものであり、上述の物体モデル記憶１０７に記憶されるものである。
物体モデルは、図２(a) に示すように、物体モデル名、３次元エッジモデル、および、２次元モデルから構成される。
物体モデル名はオペレータが物体に与えるもので、通常は対象物体に即した名前をつける。図２（ａ）では、desk1（机１）とつけられている。３次元エッジモデルは、前述のように画像列から復元した３次元エッジ点Ｐｉの集合である。２次元モデルは、あるカメラ視点から対象物体を撮影した画像から抽出された情報である。一般に、１つの物体モデルは複数の２次元モデルをもつ。 <Description of object model>
Next, a configuration example of an object model used in the object recognition system of the present invention will be described with reference to FIG. The object model described here is created by the object model generation process described above with reference to FIG. 1 and is stored in the object model storage 107 described above.
As shown in FIG. 2A, the object model is composed of an object model name, a three-dimensional edge model, and a two-dimensional model.
The object model name is given to the object by the operator, and a name corresponding to the target object is usually given. In FIG. 2 (a), it is called desk1 (desk 1). The three-dimensional edge model is a set of three-dimensional edge points Pi restored from the image sequence as described above. The two-dimensional model is information extracted from an image obtained by photographing a target object from a certain camera viewpoint. In general, one object model has a plurality of two-dimensional models.

２次元モデルは、図２（ｂ）に示すように、２次元モデルＩＤ、モデル画像Ｊ、カメラ姿勢Ｔ、エッジ点集合Ｇから構成される。
２次元モデルＩＤは、２次元モデルを一意に表す記号（例えば、Ｍ１）である。モデル画像Ｊは、対象物体の画像列の中から選択された１枚の画像データである。カメラ姿勢Ｔは、モデル画像を撮影したときのカメラの姿勢であり、ある３次元座標系内の位置（ｘ，ｙ，ｚ）と方向（θ，φ，ψ）を表す６個の変数からなる。座標系は任意でよいが、通常は、画像列の最初の画像を撮影したカメラ姿勢を原点にして設定される。エッジ点集合Ｇは、モデル画像から抽出された画像エッジ点ｑの情報の集合である。
画像エッジ点ｑの情報は、２次元特徴として認識に用いられる。図２（ｃ）は、１個の画像エッジ点ｑの情報の構成である。エッジ点ｑの情報は、モデル画像内での位置（ｕ，ｖ）、方向（ａ）、スケール（Ｓ）、特徴ベクトル（Ｂ）、および、対応する３次元エッジ点（Ｐｊ）から構成される。
モデル画像内での位置（ｕ，ｖ）は、上述のエッジ抽出処理１０３におけるエッジ抽出オペレータにより求められる。方向（ａ）はその位置での画像の微分方向であり、やはり上述のエッジ抽出オペレータにより求められる。スケールはエッジ点ｑの近傍領域のサイズ（Ｓ）であり、その求め方は後で説明する。特徴ベクトルＢは、エッジ点の近傍の局所画像から抽出される多次元の数値情報である。種々のものが利用可能であるが、マッチングを安定して行うために、回転不変性、スケール不変性、照明不変性、カメラ視点の変化による歪みに対する許容性をもつことが望ましい。３次元エッジ点Ｐｊは、モデル画像と３次元エッジモデルの間で画像エッジ点の対応関係を保持するためのものである。 The two-dimensional model includes a two-dimensional model ID, a model image J, a camera posture T, and an edge point set G as shown in FIG.
The two-dimensional model ID is a symbol (for example, M1) that uniquely represents the two-dimensional model. The model image J is one piece of image data selected from the image sequence of the target object. The camera posture T is the posture of the camera when a model image is taken, and is composed of six variables representing the position (x, y, z) and direction (θ, φ, ψ) in a certain three-dimensional coordinate system. . Although the coordinate system may be arbitrary, it is normally set with the camera posture at which the first image in the image sequence is taken as the origin. The edge point set G is a set of information on image edge points q extracted from the model image.
Information on the image edge point q is used for recognition as a two-dimensional feature. FIG. 2C shows the information structure of one image edge point q. The information of the edge point q is composed of the position (u, v), the direction (a), the scale (S), the feature vector (B), and the corresponding three-dimensional edge point (Pj) in the model image. .
The position (u, v) in the model image is obtained by the edge extraction operator in the edge extraction process 103 described above. The direction (a) is the differential direction of the image at that position, and is also obtained by the edge extraction operator described above. The scale is the size (S) of the vicinity region of the edge point q, and how to find it will be described later. The feature vector B is multidimensional numerical information extracted from a local image near the edge point. Various types are available, but in order to perform matching stably, it is desirable to have tolerance for distortion caused by rotation invariance, scale invariance, illumination invariance, and camera viewpoint changes. The three-dimensional edge point Pj is for maintaining the correspondence of the image edge point between the model image and the three-dimensional edge model.

＜２次元モデルの生成手順＞
次に、図３にしたがって、２次元モデルの生成手順を説明する。図３を用いて説明する処理は、図１において、３次元モデル投影処理１０４，エッジ点対応づけ処理１０５および特徴ベクトル生成処理１０６の処理に対応している。
さて、ここでの入力は、モデル画像選択処理１０１で選択したモデル画像Ｊとカメラ姿勢Ｔ、および、エッジ抽出処理１０３で抽出された画像エッジ点群Ｇである。
この時点では、画像エッジ点群Ｇの各画像エッジ点ｑは、図２（ｃ）に示した画像エッジ点の構成において、位置と方向しかもたない。この各画像エッジ点ｑに対して、図２（ｃ）のスケールＳ、特徴ベクトルＢ、および、対応する３次元エッジ点Ｐを求めることが、図３に示したフローチャート処理での目的である。 <2D model generation procedure>
Next, a procedure for generating a two-dimensional model will be described with reference to FIG. The processing described with reference to FIG. 3 corresponds to the processing of the three-dimensional model projection processing 104, the edge point association processing 105, and the feature vector generation processing 106 in FIG.
The inputs here are the model image J selected by the model image selection process 101, the camera posture T, and the image edge point group G extracted by the edge extraction process 103.
At this time, each image edge point q of the image edge point group G has only a position and a direction in the configuration of the image edge point shown in FIG. The purpose of the flowchart processing shown in FIG. 3 is to obtain the scale S, feature vector B, and corresponding three-dimensional edge point P in FIG. 2C for each image edge point q.

まず、カメラ姿勢Ｔに基づいて、３次元エッジ点群をモデル画像Ｊに投影する（Ｓ１１０）。
次に、ステップＳ１１２において、画像エッジ点群Ｇから１つの画像エッジ点ｑを取り出す。次に、ステップＳ１１４において、モデル画像Ｊに３次元エッジ点群を投影した像のうち、エッジ点ｑに位置が最も近いものを求める。この基準としては、２点間のユークリッド距離を用いればよい。これを図４を用いて説明する。
たとえば、図４は、カメラ姿勢を用いてカメラ中心と３次元エッジモデルとを直線で結び、それが画像Ｊを切る点を求めることで３次元エッジモデルを画像Ｊに投影していることを示している。図４では、３次元エッジ点Ｐの画像Ｊへの投影点が画像エッジ点ｑと一致するので、画像エッジ点ｑに対応する３次元エッジ点としてＰを設定する。 First, based on the camera posture T, a three-dimensional edge point group is projected onto the model image J (S110).
Next, in step S112, one image edge point q is extracted from the image edge point group G. Next, in step S114, among the images obtained by projecting the three-dimensional edge point group on the model image J, an image having a position closest to the edge point q is obtained. As this reference, the Euclidean distance between two points may be used. This will be described with reference to FIG.
For example, FIG. 4 shows that the camera center and the three-dimensional edge model are connected by a straight line using the camera posture, and the three-dimensional edge model is projected onto the image J by obtaining a point that cuts the image J. ing. In FIG. 4, since the projection point of the three-dimensional edge point P onto the image J coincides with the image edge point q, P is set as the three-dimensional edge point corresponding to the image edge point q.

次に、エッジ点ｑの特徴ベクトルＢを求める（Ｓ１１６）。このとき、ｑのスケールも求める。これらの求め方は後述する。次に、ステップＳ１１８において、画像エッジ点群Ｇにエッジ点が残っているかを調べ、残っていればステップＳ１１２に戻る。エッジ点が残っていなければ、ステップＳ１２０において、モデル画像Ｊ、カメラ姿勢Ｔ，画像エッジ点群Ｇをまとめて２次元モデル情報を生成し、自動的に決めた２次元モデルＩＤをつけて、物体モデル記憶に格納する。
なお、エッジ点ｑから所定の距離以内に３次元エッジ点の投影像がなければ、そのエッジ点ｑの対応はないとしてもよい。この場合、図３に示したフローチャートで、３次元エッジ点Ｐとの対応関係がないときは、特徴ベクトルを生成する処理（Ｓ１１６）はスキップするようにする。 Next, the feature vector B of the edge point q is obtained (S116). At this time, the scale of q is also obtained. These methods will be described later. Next, in step S118, it is checked whether an edge point remains in the image edge point group G. If it remains, the process returns to step S112. If no edge point remains, in step S120, the model image J, the camera posture T, and the image edge point group G are combined to generate two-dimensional model information, and an automatically determined two-dimensional model ID is assigned to the object. Store in model memory.
If there is no projection image of the three-dimensional edge point within a predetermined distance from the edge point q, the edge point q may not be supported. In this case, if there is no correspondence with the three-dimensional edge point P in the flowchart shown in FIG. 3, the process of generating the feature vector (S116) is skipped.

＜特徴ベクトルの生成方法＞
図３における特徴ベクトルの生成（Ｓ１１６）を詳しく説明する。
一般に、物体は、画像により様々な大きさで写っている。少ないモデル画像で種々の入力画像とマッチングできるようにするには、画像中の物体の大きさに依存しないように特徴ベクトルを生成できることが望ましい。このためには、特徴ベクトルを生成する近傍領域の大きさを、物体の大きさに合わせて決める必要がある。
このために、次のように近傍領域の半径Ｓを決め、これを図２（ｃ）で示したエッジ点のスケールＳとする。
ここで、（ｒ，θ）はエッジ点ｑを中心とした極座標表現であり、Ｖ（ｒ，θ）はｑを中心とした半径ｒの円周上の点（ｒ，θ）でのエッジ強度である。Ｋは適当な比例定数である。
この式は、ｑを中心とした円周上にあるエッジ点の強度の和が最大になる半径を求めている。これは、直観的には、図５のようにｑの周囲のエッジと最もよく接する円に相当する。Ｓは、画像中の物体の大きさに比例するので、Ｓで近傍領域の局所画像を正規化すれば、特徴ベクトルは物体の大きさに依存せず不変になる。なお、上式では最大点を採用しているが、Ｓの適当な初期値から探索を始めて最初に見つかった極大点を用いてもよい。 <Feature vector generation method>
The feature vector generation (S116) in FIG. 3 will be described in detail.
In general, an object is shown in various sizes depending on an image. In order to be able to match various input images with a small number of model images, it is desirable that feature vectors can be generated without depending on the size of an object in the image. For this purpose, it is necessary to determine the size of the neighborhood region for generating the feature vector according to the size of the object.
For this purpose, the radius S of the neighboring region is determined as follows, and this is set as the edge point scale S shown in FIG.
Here, (r, θ) is a polar coordinate expression centered on the edge point q, and V (r, θ) is the edge strength at a point (r, θ) on the circumference of the radius r centered on q. It is. K is an appropriate proportionality constant.
This expression obtains a radius that maximizes the sum of the intensities of edge points on the circumference centered on q. Intuitively, this corresponds to a circle that best contacts the edges around q as shown in FIG. Since S is proportional to the size of the object in the image, if the local image in the neighboring region is normalized with S, the feature vector does not depend on the size of the object and remains unchanged. Although the maximum point is adopted in the above equation, the maximum point first found after starting the search from an appropriate initial value of S may be used.

図６に、物体の大きさに依存しない近傍領域を求めた例を示す。図６で、円の中心が画像エッジ点、円の半径が近傍領域のサイズである。左右の画像で、机の大きさが異なるが、円の大きさが、ほぼ机の大きさに比例して求められていることがわかる。
エッジ点の近傍領域が決まると、その中に含まれる局所画像から特徴ベクトルを生成する。まず、局所画像を上式で求めた近傍領域サイズＳで正規化する。これにより、物体の大きさによらず、近傍領域に含まれる画素の個数を同じにする。
特徴ベクトルは種々のものが利用できるが、たとえば、非特許文献１で提案されたＳＩＦＴ法で用いられる特徴ベクトルを利用する。ＳＩＦＴ法での特徴ベクトルは、特徴点の近傍領域の局所画像を４×４のブロックに分割して、各ブロック内にある画素の微分方向のヒストグラム値を並べて特徴ベクトルとする。方向ヒストグラムは４５°の間隔で離散化される。したがって、４×４×８＝１２８次元のベクトルとなる。ただし、特徴ベクトルが物体の回転に不変となるように、注目しているエッジ点の法線方向からの相対角度で方向ヒストグラムを作る。なお、ＳＩＦＴ法では、特徴点としてＤＯＧ (Difference of Gaussian) フィルタの極値点を用いるが、本発明では、そのかわりにエッジ点を用いていることに注意されたい。 FIG. 6 shows an example in which a neighborhood region that does not depend on the size of an object is obtained. In FIG. 6, the center of the circle is the image edge point, and the radius of the circle is the size of the neighboring region. It can be seen that the left and right images have different desk sizes, but the size of the circle is approximately proportional to the size of the desk.
When the neighborhood region of the edge point is determined, a feature vector is generated from the local image included therein. First, the local image is normalized with the neighborhood region size S obtained by the above equation. As a result, the number of pixels included in the neighboring region is made the same regardless of the size of the object.
Various feature vectors can be used. For example, a feature vector used in the SIFT method proposed in Non-Patent Document 1 is used. The feature vector in the SIFT method is obtained by dividing a local image in the vicinity of a feature point into 4 × 4 blocks and arranging the histogram values in the differential direction of the pixels in each block. The direction histogram is discretized at 45 ° intervals. Therefore, 4 × 4 × 8 = 128-dimensional vector. However, a direction histogram is created with a relative angle from the normal direction of the edge point of interest so that the feature vector does not change with the rotation of the object. In the SIFT method, extreme points of a DOG (Difference of Gaussian) filter are used as feature points, but it should be noted that edge points are used instead in the present invention.

この特徴ベクトルは、微分画像から得た画素の方向情報に基づくため、濃淡画像を直接用いるよりも、照明変化に対する不変性が高い。また、上述のように回転不変性をもつ。局所画像を近傍領域の大きさで正規化するので、物体の大きさに依存しないというスケール不変性も有する。さらに、ＳＩＦＴ特徴ベクトルは局所特徴量なので、画像が全体として多少歪んでも、あまり変化しない。このため、たとえば３０°おきにカメラ角度を変化させて撮影した画像をモデル画像として用いれば、多くの場合、そのどれかは入力画像とマッチする。そこで、３次元エッジモデルを生成した画像列から何枚か抜粋して、モデル画像として用いれば、少数のモデル画像で、種々のカメラ視点から撮影した物体の画像をカバーできる。
以上述べたようにして、本発明では物体モデルを生成する。 Since this feature vector is based on pixel direction information obtained from the differential image, it is more invariant to illumination changes than using a grayscale image directly. Moreover, it has rotation invariance as described above. Since the local image is normalized by the size of the neighboring region, it also has scale invariance that does not depend on the size of the object. Furthermore, since the SIFT feature vector is a local feature amount, even if the image is slightly distorted as a whole, it does not change much. For this reason, for example, if an image taken by changing the camera angle every 30 ° is used as a model image, one of them matches the input image in many cases. Therefore, if several images are extracted from the image sequence in which the three-dimensional edge model is generated and used as a model image, images of objects taken from various camera viewpoints can be covered with a small number of model images.
As described above, the present invention generates an object model.

＜物体認識＞
図７にしたがい、本発明における物体認識処理の概略を説明する。この物体認識処理は、上述の処理で生成した、図２に示す２次元モデルと３次元エッジモデルを含む物体モデルを使用し、１枚の入力画像から物体を認識するとともにその画像のカメラ姿勢も推定する。
エッジ抽出処理２０２は、入力画像から画像エッジを抽出する。その処理内容は、図１の物体モデル生成におけるエッジ抽出処理１０３と同様である。
特徴ベクトル生成処理２０３は、エッジ抽出処理２０２で抽出した各画像エッジ点ｑに対して、特徴ベクトルを生成する。その処理内容は、図１の物体モデル生成時の特徴ベクトル生成処理１０６と同様である。ただし、物体モデル生成時における特徴ベクトル生成処理１０６では、３次元エッジ点Ｐと対応がとれた画像エッジ点ｑに対してのみ特徴ベクトルを生成したが、物体認識時は、すべての画像エッジ点ｑに対して、特徴ベクトルを生成する。 <Object recognition>
The outline of the object recognition processing in the present invention will be described with reference to FIG. This object recognition process uses the object model including the two-dimensional model and the three-dimensional edge model shown in FIG. 2 generated by the above-described process, recognizes an object from one input image, and also determines the camera posture of the image. presume.
The edge extraction process 202 extracts an image edge from the input image. The processing content is the same as the edge extraction processing 103 in the object model generation of FIG.
The feature vector generation process 203 generates a feature vector for each image edge point q extracted by the edge extraction process 202. The processing content is the same as the feature vector generation processing 106 at the time of generating the object model in FIG. However, in the feature vector generation process 106 at the time of object model generation, feature vectors are generated only for image edge points q that correspond to the three-dimensional edge points P. However, at the time of object recognition, all image edge points q In contrast, a feature vector is generated.

２次元モデル照合処理２０７は、特徴ベクトル生成処理２０３により特徴ベクトルを付加された画像エッジ点群ｑを入力し、物体モデル記憶１０７の物体モデルを参照して、入力画像エッジ点群と特徴ベクトルがよくマッチする２次元モデルをもつ物体モデルを求める。さらに、マッチした入力画像エッジ点とモデル画像エッジ点のペアの集合Ｈを作る。
３次元姿勢推定処理２０８は、２次元モデル照合処理２０７で求めた物体モデルと２次元モデル、および、エッジ点ペア集合Ｈを入力して、物体の３次元エッジモデルの投影像が入力画像エッジ点群とよく一致するカメラ姿勢を求める。そして、最終的な認識結果として、最もよくマッチした物体モデル名とカメラ姿勢を出力する。 The two-dimensional model matching process 207 receives the image edge point group q to which the feature vector is added by the feature vector generation process 203, refers to the object model in the object model storage 107, and determines the input image edge point group and the feature vector. Find an object model with a two-dimensional model that matches well. Further, a set H of pairs of matched input image edge points and model image edge points is created.
The three-dimensional posture estimation processing 208 inputs the object model, the two-dimensional model, and the edge point pair set H obtained in the two-dimensional model matching processing 207, and the projection image of the three-dimensional edge model of the object is the input image edge point. Find the camera pose that matches well with the group. Then, as the final recognition result, the best matching object model name and camera posture are output.

＜物体認識の詳細処理手順＞
図８にしたがい、本発明における物体認識処理手順の一例を説明する。このフローチャートでは、図７の２次元モデル照合処理２０７及び３次元姿勢推定処理２０８を詳しく説明する。
まず、入力画像エッジ点群とマッチするモデル画像エッジ点が多い２次元モデルＭを物体モデル記憶を参照して求める（Ｓ２１２）。エッジ点のマッチングは、特徴ベクトル生成処理２０３で求めた入力画像エッジ群の特徴ベクトルと、２次元モデルＭの各画像エッジ点の特徴ベクトルとの一致度を用いて判定する。
特徴ベクトルの一致度の計算には種々の方法があり得るが、たとえば、特徴ベクトル同士のユークリッド距離や相関などを用いればよい。判定の結果、２次元モデルＭは複数個得られる可能性があるので、それぞれを候補として集合Ｄに登録する。また、２次元モデルＭごとに、マッチしたモデル画像エッジ点と入力画像エッジ点のペア集合を記憶しておく。 <Detailed processing procedure for object recognition>
An example of the object recognition processing procedure in the present invention will be described with reference to FIG. In this flowchart, the two-dimensional model matching process 207 and the three-dimensional posture estimation process 208 in FIG. 7 will be described in detail.
First, a two-dimensional model M having many model image edge points that match the input image edge point group is obtained with reference to the object model storage (S212). Edge point matching is determined using the degree of coincidence between the feature vector of the input image edge group obtained in the feature vector generation processing 203 and the feature vector of each image edge point of the two-dimensional model M.
There are various methods for calculating the degree of coincidence of feature vectors. For example, the Euclidean distance or correlation between feature vectors may be used. As a result of the determination, there is a possibility that a plurality of two-dimensional models M are obtained, so that each is registered as a candidate in the set D. For each two-dimensional model M, a pair set of matched model image edge points and input image edge points is stored.

次に、集合Ｄから２次元モデルＭを１個取り出して（Ｓ２１４）、２次元モデルＭに対するエッジ点ペア集合を用いて、入力画像とモデル画像間の変換パラメータを求める（Ｓ２１６）。ここでの変換とは、モデル画像に写っている対象物体を入力画像にうまくマッチするように位置や形状を２次元的に変換するものである。変換の種類として、たとえば、相似変換やアフィン変換がある。求めた変換パラメータに違反するエッジ点ペアは取り除く（Ｓ２１８）。この処理については後で詳しく述べる。この２つのステップ（Ｓ２１６とＳ２１８）では、誤ったエッジ点の対応づけを除去し、後のステップにおけるカメラ姿勢の推定（Ｓ２２２）の精度を向上させるためのものである。
ステップＳ２２０においては、入力画像エッジ点と３次元エッジ点の対応関係を求める。モデル画像エッジ点に対応する３次元エッジ点は、物体モデル生成時に得られているので、前のステップ（Ｓ２１８）により入力画像エッジ点とモデル画像エッジ点の対応が得られれば、入力画像エッジ点と３次元エッジ点の対応も得られる。
次に、３次元エッジ点の入力画像への投影像と入力画像エッジ点の位置が一致するように、カメラ姿勢を求める（Ｓ２２２）。この具体的な方法は後で詳しく述べる。このとき、位置がよく一致するエッジ点ペアの個数、および、その位置誤差の和を一致度として計算する。 Next, one 2D model M is extracted from the set D (S214), and a conversion parameter between the input image and the model image is obtained using the edge point pair set for the 2D model M (S216). The conversion here is a two-dimensional conversion of the position and shape so that the target object shown in the model image matches the input image well. Examples of conversion types include similarity conversion and affine conversion. Edge point pairs that violate the obtained conversion parameters are removed (S218). This process will be described in detail later. In these two steps (S216 and S218), the correspondence between erroneous edge points is removed, and the accuracy of camera posture estimation (S222) in the subsequent step is improved.
In step S220, the correspondence between the input image edge point and the three-dimensional edge point is obtained. Since the three-dimensional edge point corresponding to the model image edge point is obtained at the time of generating the object model, if the correspondence between the input image edge point and the model image edge point is obtained in the previous step (S218), the input image edge point is obtained. And a three-dimensional edge point can also be obtained.
Next, the camera posture is obtained so that the projected image of the three-dimensional edge point on the input image matches the position of the input image edge point (S222). This specific method will be described in detail later. At this time, the number of edge point pairs whose positions coincide well and the sum of the position errors are calculated as the degree of coincidence.

次に、ステップＳ２２４においては、一致度が所定の閾値を超えたかどうか調べる。閾値を超えれば（Ｓ２２４でＹＥＳ）、その２次元モデルをもつ物体モデルと、前のステップ（Ｓ２２２）で求めたカメラ姿勢を認識結果の候補として登録する（Ｓ２２６）。閾値を超えなければ（Ｓ２２４でＮＯ）、その２次元モデルは採用しない。ステップＳ２２８で、集合Ｄに２次元モデルが残っているか調べ、残っていれば（Ｓ２２８でＹＥＳ）、集合Ｄから残っている２次元モデルを取り出す処理（Ｓ２１４）から処理を繰り返す。 Next, in step S224, it is checked whether or not the degree of coincidence exceeds a predetermined threshold value. If the threshold is exceeded (YES in S224), the object model having the two-dimensional model and the camera posture obtained in the previous step (S222) are registered as recognition result candidates (S226). If the threshold is not exceeded (NO in S224), the two-dimensional model is not adopted. In step S228, it is checked whether or not a two-dimensional model remains in the set D. If it remains (YES in S228), the process is repeated from the process of extracting the remaining two-dimensional model from the set D (S214).

この処理手順に従うと、最終的な認識結果が複数個得られることがある。複数候補が得られた場合の処理として、たとえば、一致度が最も高い候補を採用する、あるいは、他のセンサ情報や以前の認識結果との整合性を利用して候補を絞り込む、などが考えられる。
なお、入力画像と類似する２次元モデルの集合Ｄを作成するステップ（Ｓ２１２）において、物体モデル記憶１０７に含まれるすべての２次元モデルに対して、入力画像エッジ点群の特徴ベクトルのマッチングを行うと、多くの計算時間がかかる。そこで、非特許文献１で提案されているように、特徴ベクトルのインデックスをＫＤ木で構成して、マッチングを高速に行うようにしてもよい。 If this processing procedure is followed, a plurality of final recognition results may be obtained. As a process when a plurality of candidates are obtained, for example, a candidate having the highest degree of coincidence may be adopted, or candidates may be narrowed down by using consistency with other sensor information and previous recognition results. .
In the step of creating a set D of two-dimensional models similar to the input image (S212), the feature vectors of the input image edge point group are matched with all the two-dimensional models included in the object model storage 107. And it takes a lot of calculation time. Therefore, as proposed in Non-Patent Document 1, the feature vector index may be configured with a KD tree to perform matching at high speed.

＜エッジ点ペア集合のフィルタリング（Ｓ２１８）＞
特徴ベクトルは局所特徴なので、それだけを用いてマッチングしたエッジ点ペアには多くの誤りが含まれうる。そこで、物体の画像上の形状にもとづく制約を用いて、誤ったエッジ点ペアを除去する処理（Ｓ２１８）を行っている。このために、まず、入力画像とモデル画像の変換を求める（Ｓ２１６）。この変換としては、相似変換やアフィン変換があるが、ここでは相似変換の例を述べる。
画像エッジ点は、位置、方向、スケールの情報をもつため、１組の入力画像エッジ点とモデル画像エッジ点とから、エッジ点の画像上での並進量（位置の差）、回転量（方向の差）、拡縮量（スケールの比）を計算することができる。各エッジ点ペアについてこれらの変換量を計算し、クラスタリングまたは投票処理を行って、最大多数を占める変換量を求める。このようにして求めた変換量によって、入力画像とモデル画像の相似変換を定義する。なお、アフィン変換の場合は、２組のエッジ点ペアからアフィン変換量を計算できる。
次に、エッジ点ペア集合に含まれるエッジ点ペアについて、求めた相似変換量と大きく値が異なる相似変換量をもつものを除去する。これにより、エッジ点の誤対応の多くが除去される。 <Filtering Edge Point Pair Set (S218)>
Since the feature vector is a local feature, the edge point pair matched using only the feature vector can include many errors. In view of this, processing for removing an erroneous edge point pair is performed (S218) using constraints based on the shape of the object on the image. For this purpose, first, conversion between the input image and the model image is obtained (S216). Examples of this transformation include similarity transformation and affine transformation. Here, an example of similarity transformation will be described.
Since the image edge point has position, direction, and scale information, the translation amount (positional difference) and rotation amount (direction) of the edge point on the image from a set of input image edge points and model image edge points. Difference) and the amount of scaling (ratio of scales) can be calculated. These conversion amounts are calculated for each edge point pair, and clustering or voting processing is performed to determine the conversion amount occupying the largest number. The similarity transformation between the input image and the model image is defined based on the transformation amount thus obtained. In the case of affine transformation, the affine transformation amount can be calculated from two edge point pairs.
Next, for edge point pairs included in the edge point pair set, those having similar transformation amounts that are significantly different from the obtained similarity transformation amounts are removed. As a result, many of the erroneous correspondences of the edge points are eliminated.

＜３次元エッジモデルのマッチング（Ｓ２２２）＞
入力画像エッジ点と３次元エッジ点の対応から、カメラ姿勢を計算する処理（Ｓ２２２）を説明する。
まず、カメラ姿勢の初期値として、モデル画像を撮影したカメラ姿勢を用いる。画像上の特徴が似ているということは、モデル画像のカメラ姿勢が入力画像のカメラ姿勢と近いと期待されるからである。次に、３次元エッジ点の入力画像への投影像が、入力画像のエッジ特徴点と位置がよく一致するようにカメラ姿勢を計算する。これは次式を最小化することで求める。
ここで、Ｒは求めるカメラ姿勢の回転行列、Ｔはカメラの並進ベクトルである。また、Ｐ_ｉ＝（Ｘ_ｉ，Ｙ_ｉ，Ｚ_ｉ）^Ｔは３次元エッジ点、Ｐ_ｉ’＝（Ｘ_ｉ’，Ｙ_ｉ’，Ｚ_ｉ’）^ＴはＰ_ｉのカメラ姿勢Ｒ，Ｔによる座標変換点、ｑ_ｉ＝（ｕ_ｉ，ｖ_ｉ）はＰ_ｉに対応する２次元エッジ点、ｆはカメラの焦点距離である。Ｅは、Ｒ，Ｔで決まるカメラ姿勢によって３次元エッジ点を入力画像に投影した点と、２次元エッジ点間の誤差の和を表す。
この最小化は非線形最小化問題となるので、上述のモデル画像のカメラ姿勢をＲ，Ｔの初期値として、最急降下法やニュートン法などの方法を用いて解く。
以上述べたようにして、本発明の３次元物体認識システムは、1 枚の入力画像から物体
の特定と３次元姿勢推定を行う。 <3D edge model matching (S222)>
A process (S222) for calculating the camera posture from the correspondence between the input image edge point and the three-dimensional edge point will be described.
First, as the initial value of the camera posture, the camera posture in which the model image is taken is used. The feature on the image is similar because the camera posture of the model image is expected to be close to the camera posture of the input image. Next, the camera posture is calculated so that the projected image of the three-dimensional edge point on the input image matches the edge feature point of the input image well. This is obtained by minimizing the following equation.
Here, R is a rotation matrix of the camera posture to be obtained, and T is a translation vector of the camera. P _i = (X _i , Y _i , Z _i ) ^T is a three-dimensional edge point, and P _i ′ = (X _i ′, Y _i ′, Z _i ′) ^T depends on the camera postures R and T of P _i. A coordinate conversion point, q _i = (u _i , v _i ) is a two-dimensional edge point corresponding to P _i , and f is a focal length of the camera. E represents the sum of errors between the point where the three-dimensional edge point is projected on the input image according to the camera posture determined by R and T and the two-dimensional edge point.
Since this minimization becomes a non-linear minimization problem, it is solved using a method such as the steepest descent method or Newton method with the camera orientation of the model image described above as the initial values of R and T.
As described above, the three-dimensional object recognition system of the present invention specifies an object and estimates a three-dimensional posture from one input image.

なお、システムの利用条件によっては、カメラと対象物体の距離が一定なため、画像内の物体の大きさがほぼ一定に保たれる場合もある。その場合は、スケール不変性に関する処理は省いてもよい。具体的には、特徴ベクトル生成における近傍領域のサイズ決定の処理は行わず、あらかじめ与えられた固定値を採用するのでよい。 Depending on the system usage conditions, the distance between the camera and the target object is constant, so the size of the object in the image may be kept substantially constant. In that case, the processing related to scale invariance may be omitted. Specifically, the fixed value given in advance may be adopted without performing the process of determining the size of the neighboring area in the feature vector generation.

＜認識例＞
図９に本発明の物体認識の動作例を示す。図９（ａ）は、本発明の物体モデル生成のフェーズで作成した物体モデル（シンク）の３次元エッジモデルである。図９（ｂ）〜（ｄ）は、各１枚のカメラ画像に図９（ａ）の３次元エッジモデルを含んだ物体モデルを用いて、物体認識を行った結果である。図９（ｂ）〜（ｄ）は、物体（シンク）の認識が成功していることを示している。これらの図９（ｂ）〜（ｄ）では、本システムで推定したカメラ姿勢で、３次元エッジモデルをカメラ画像上に重ね合わせて表示している。
図９（ｂ）は図９（ｃ）や図９（ｄ）より画像中の物体の大きさが小さいが、認識に成功していることを示している。また、図９（ｄ）は、物体の一部が欠けて写っているが、認識に成功していることを示している。
この例における物体（シンク）はほぼ直線で構成されており、コーナ点や分岐点といった特徴点の個数は少ない。このため、物体の一部が画像から欠けると、抽出できる特徴点数が少なくなり、認識が不安定になる。しかしながら、本発明の方法では、エッジ点すべてを用いるため、物体の一部が欠けても、まだ多数の２次元エッジ点が得られるため、図９に示したように、認識の安定性が高い。 <Example of recognition>
FIG. 9 shows an example of the object recognition operation of the present invention. FIG. 9A is a three-dimensional edge model of the object model (sink) created in the object model generation phase of the present invention. FIGS. 9B to 9D show the results of object recognition using an object model including the three-dimensional edge model of FIG. 9A for each one camera image. FIGS. 9B to 9D show that the object (sink) has been successfully recognized. In these FIGS. 9B to 9D, the three-dimensional edge model is displayed superimposed on the camera image with the camera posture estimated by the present system.
FIG. 9B shows that the size of the object in the image is smaller than in FIGS. 9C and 9D, but the recognition is successful. FIG. 9D shows that a part of the object is missing and is recognized successfully.
The object (sink) in this example is substantially straight, and the number of feature points such as corner points and branch points is small. For this reason, if a part of an object is missing from the image, the number of feature points that can be extracted decreases, and recognition becomes unstable. However, since all the edge points are used in the method of the present invention, even if a part of the object is missing, a large number of two-dimensional edge points are still obtained, so that the recognition stability is high as shown in FIG. .

本発明は、例えば、ロボットの視覚認識技術、監視システム、移動体による周囲環境
の認識、などに適用することができる。 The present invention can be applied to, for example, a robot visual recognition technology, a monitoring system, and recognition of the surrounding environment by a moving body.

本発明における物体モデル生成の一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the object model production | generation in this invention. 物体モデルのデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of an object model. ２次元モデル生成の処理手順の詳細を示すフローチャートである。It is a flowchart which shows the detail of the process sequence of a 2-dimensional model production | generation. ３次元エッジ点と画像エッジ点の対応関係を示す説明図である。It is explanatory drawing which shows the correspondence of a three-dimensional edge point and an image edge point. エッジ点近傍領域のスケール不変性の概念を説明する図である。It is a figure explaining the concept of scale invariance of an edge point vicinity area | region. スケール不変なエッジ点抽出の一例を示す図である。It is a figure which shows an example of scale point invariant edge point extraction. 本発明における物体認識の一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the object recognition in this invention. 物体認識の処理手順の詳細を示すフローチャートである。It is a flowchart which shows the detail of the process sequence of object recognition. 本発明による物体認識の処理結果の一例を示す図である。It is a figure which shows an example of the process result of the object recognition by this invention.

Claims

In a three-dimensional object recognition system that performs object recognition from a single input image,
Object name, 3D edge model, 2D model (camera posture for multiple images, image edge point group, feature vector of each image edge point, and 3D edge point corresponding to each image edge point) Object model storage means for storing as an object model
Image edge extraction means for extracting an image edge point group from the input image;
Feature vector generation means for generating a feature vector of the image edge point extracted by the image edge extraction means;
The feature vector of the image edge point of the input image obtained by the feature vector generation means is compared with the feature vector of the image edge point of the object model stored in the object model storage means, and the input image Two-dimensional model matching means for searching for a matching object model;
A camera posture is determined such that the position at which the three-dimensional edge point of the three-dimensional edge model of the retrieved object model is projected onto the input image has a high degree of coincidence with the position of the image edge point of the input image. Dimensional posture estimation means,
A three-dimensional object recognition system that outputs an object name and a camera posture of the input image.

A system for generating an object model used for object recognition from a camera image sequence,
A model image selection means for inputting an image sequence and a camera posture sequence at the time of capturing the image sequence, and selecting a model image used for object recognition and a camera posture at which the model image is captured;
Edge extraction means for inputting the model image and extracting an image edge point group;
Feature vector generation means for generating a feature vector of the image edge point extracted by the edge extraction means;
A camera posture corresponding to the model image and a 3D edge model of the object are input, and projection of each 3D edge point included in the 3D edge model onto the model image is calculated to obtain a projected edge point 3 Dimensional model projection means;
Edge point associating means for generating a correspondence between the three-dimensional edge point and the image edge point from the positional relation between the projected edge point calculated by the three-dimensional model projecting means and the image edge point extracted by the edge extracting means. And
Object name, 3D edge model, 2D model (camera posture for multiple images, image edge point group, feature vector of each image edge point, and 3D edge point corresponding to each image edge point) An object model generation system for a three-dimensional object recognition system.

When generating the feature vector of the image edge point, the feature vector generation means obtains a circular area centered on the image edge point so that the sum of the edge intensities on the circumference is maximized or maximized. Determine the radius of the circular region,
The feature vector is generated from the image information included in the circular region.
The three-dimensional object recognition system according to claim 2 or the object model generation system for the three-dimensional object recognition system according to claim 2.

Computer system
Image edge extraction means for extracting an image edge point group from an input image;
Feature vector generation means for generating a feature vector of the image edge point extracted by the image edge extraction means;
The feature vector of the image edge point of the input image obtained by the feature vector generation means is compared with the feature vector of the image edge point of the object model stored in the object model storage means, and the input image Two-dimensional model matching means for searching for a matching object model;
A camera posture is determined such that the position at which the three-dimensional edge point of the three-dimensional edge model of the retrieved object model is projected onto the input image has a high degree of coincidence with the position of the image edge point of the input image. Function as a dimensional posture estimation means,
A computer program for outputting an object name and a camera posture of the input image.

Computer system
A model image selection means for inputting an image sequence and a camera posture sequence at the time of capturing the image sequence, and selecting a model image used for object recognition and a camera posture at which the model image is captured;
Edge extraction means for inputting the model image and extracting an image edge point group;
Feature vector generation means for generating a feature vector of the image edge point extracted by the edge extraction means;
A camera posture corresponding to the model image and a 3D edge model of the object are input, and projection of each 3D edge point included in the 3D edge model onto the model image is calculated to obtain a projected edge point 3 Dimensional model projection means;
Edge point associating means for generating a correspondence between the three-dimensional edge point and the image edge point from the positional relation between the projected edge point calculated by the three-dimensional model projecting means and the image edge point extracted by the edge extracting means. Function as
Object name, 3D edge model, 2D model (camera posture for multiple images, image edge point group, feature vector of each image edge point, and 3D edge point corresponding to each image edge point) Computer program characterized by output as an object model