JP2011008687A

JP2011008687A - Image processor

Info

Publication number: JP2011008687A
Application number: JP2009153788A
Authority: JP
Inventors: Masatake Takahashi; 真毅高橋
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2009-06-29
Filing date: 2009-06-29
Publication date: 2011-01-13

Abstract

PROBLEM TO BE SOLVED: To provide an image processor which superposes a virtual object over an image in an actual space.SOLUTION: The image processor includes a camera for obtaining moving image data. The image processor detects feature points for long-term tracking at intervals of the prescribed number of frames of moving image data, detects feature points for short-term tracking with respect to all frames of the moving image data, performs inter-frame feature point tracking on the basis of respective feature quantities of feature points for long-term tracking detected from two different frames, performs inter-frame feature point tracking by block matching based on neighboring brightness values of feature points for long-term tracking detected from two different frames or feature points for short-term tracking, estimates the position and posture of the camera in a three-dimensional space on the basis of tracked feature points, estimates three-dimensional positions of feature points on the basis of the tracked feature points and the estimated position and posture of the camera in the three-dimensional space, and superimposes virtual object data input from the outside on the moving image data to output the data on the basis of estimated camera position information and posture information in the three-dimensional space and three-dimensional position information of feature points.

Description

本発明は、実写画像データにコンピュータグラフィックスデータを合成する画像処理装置に関する。 The present invention relates to an image processing apparatus that synthesizes computer graphics data with photographed image data.

従来から、カメラで実空間を動画像データとして撮影し、ＣＧ等で生成された仮想物体が存在するかのように重畳し、頭部に装着する小型ディスプレイ装置であるＨＭＤ（Head Mount Display）等で表示することで拡張現実感（Augmented Reality）を実現する仮想現実システムが知られている（例えば、非特許文献１参照）。このシステムは、仮想物体の表示位置（３次元位置）を特定するためのマーカーを実空間に配置し、動画像データからマーカーの検出・追跡を行うことで、重畳させる仮想物体の表示位置を決定し表示を行うものである。しかし、マーカーを用いるシステムにおいては、仮想物体を表示する位置ごとに、実空間にマーカーを設置する必要があるため、システムの利用環境・利用用途が限定されるという問題がある。 Conventionally, HMD (Head Mount Display) or the like, which is a small display device that is mounted on the head, is captured as if there is a virtual object generated by CG or the like by capturing a real space as moving image data with a camera. A virtual reality system that realizes Augmented Reality by displaying with (for example, see Non-Patent Document 1). In this system, a marker for specifying the display position (three-dimensional position) of a virtual object is arranged in the real space, and the display position of the virtual object to be superimposed is determined by detecting and tracking the marker from the moving image data. Display. However, in a system using a marker, it is necessary to install a marker in the real space for each position where a virtual object is displayed, so that there is a problem that the usage environment and usage of the system are limited.

このような問題を解決するために、マーカーの代わりに撮影された画像から特定の画像特徴を有した特徴点を検出・追跡し、仮想物体の表示位置を特定する技術が提案されている（例えば、非特許文献２参照）。非特許文献２に記載の特徴点を検出・追跡する技術では、ＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ（非特許文献３参照）という技術を用いて、各画像中のコーナー特徴を検出し、検出したコーナー特徴の周辺画素をテンプレートとして用いることで、画像間のテンプレートマッチングによってコーナー特徴を追跡し、その結果からカメラ位置姿勢および特徴点の３次元位置を推定し、仮想物体の表示位置を決定することが可能である。ＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒは、画像中の任意の画素を中心とした所定の半径Ｒの円周上に存在する各画素に対し、円の中心に位置する画素の輝度値との差が、所定のしきい値ＴＨ１以上の画素が所定数Ｎ以上連続する場合、あるいは、円の中心に位置する画素の輝度値との差が、―ＴＨ１以下の画素がＮ以上連続する場合、円の中心を特徴点の検出位置とし、円の中心及び円周上の各画素の輝度値をそのコーナー特徴とする手法である。 In order to solve such a problem, a technique for detecting and tracking a feature point having a specific image feature from an image taken instead of a marker and specifying a display position of a virtual object has been proposed (for example, Non-Patent Document 2). In the technology for detecting and tracking feature points described in Non-Patent Document 2, a corner feature in each image is detected by using a technology called FAST feature detector (see Non-Patent Document 3), and peripheral pixels of the detected corner feature are detected. Can be used as a template, corner features can be tracked by template matching between images, and the camera position and orientation and the three-dimensional positions of feature points can be estimated from the results to determine the display position of the virtual object. In the FAST feature detector, the difference between the luminance value of a pixel located at the center of a circle and a pixel having a predetermined radius R with respect to each pixel existing on the circumference of an arbitrary pixel in the image is a predetermined threshold. When the number of pixels having a value TH1 or more continues for a predetermined number N or more, or when the difference from the luminance value of the pixel located at the center of the circle is −N1 or less continues for N or more, the center of the circle In this method, the detection position is the corner feature of the luminance value of each pixel on the center and circumference of the circle.

加藤博一，MarkBillinghurst，浅野浩一，橘啓八郎：マーカー追跡に基づく拡張現実感システムとそのキャリブレーション，日本バーチャルリアリティ学会論文誌，Ｖｌｏ．４，Ｎｏ．４，ｐｐ．６０７−６１６（１９９９）Hirokazu Kato, MarkBillinghurst, Koichi Asano, Keihachiro Tachibana: Augmented reality system based on marker tracking and its calibration, Transactions of the Virtual Reality Society of Japan, Vlo. 4, no. 4, pp. 607-616 (1999) Parallel Tracking and Mapping for Small AR Workspaces In Proc. International Symposium on Mixed and Augmented Reality (ISMAR'07, Nara)Parallel Tracking and Mapping for Small AR Workspaces In Proc.International Symposium on Mixed and Augmented Reality (ISMAR'07, Nara) E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In Proc. 9th European Conference on Computer Vision(ECCV'06), Graz, May 2006.E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In Proc. 9th European Conference on Computer Vision (ECCV'06), Graz, May 2006.

しかしながら、画像特徴を用いる手法は、マーカーを用いる手法に比べて、撮影中の状況変化の影響を受けやすく、出力されるカメラ位置姿勢および特徴点の３次元位置の推定結果が不安定になりやすいという問題がある。例えば、非特許文献３に記載のＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法では、撮影された画像のコントラストが低く撮影時のノイズの影響が大きい場合や、長時間の撮影で照明条件が変化した場合には、被写体やカメラ自体の動きの有無に関わらず、複数の画像において同一の位置を安定して特徴点として検出し続けることは困難である。また撮影中にカメラの移動があり、撮影される被写体の形状が大きく変化する場合も、被写体上の同一位置を特徴点として検出し続けることは困難であり、出力されるカメラ位置姿勢および特徴点の３次元位置の推定結果が不安定になる要素である。これら撮影中の状況変化は、撮影時間の増加に伴い発生する確率が高くなるため、長時間撮影時における推定結果の安定性が問題となる。 However, the method using image features is more susceptible to changes in the situation during shooting than the method using markers, and the output camera position and orientation and the estimation result of the three-dimensional position of feature points are likely to be unstable. There is a problem. For example, according to the FAST feature detector method described in Non-Patent Document 3, when the contrast of a captured image is low and the influence of noise at the time of shooting is large, or when illumination conditions change during long-time shooting, Regardless of whether the camera itself moves, it is difficult to stably detect the same position as a feature point in a plurality of images. Even when the camera moves during shooting and the shape of the subject being photographed changes significantly, it is difficult to continue to detect the same position on the subject as a feature point. This is an element in which the estimation result of the three-dimensional position becomes unstable. Since there is a high probability that these situation changes during shooting will occur as the shooting time increases, the stability of the estimation result during long-time shooting becomes a problem.

この問題の解決手段としては、ＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法に比べ、画像の回転、拡大縮小、アフィン変形、照明変化、ノイズ付加に頑強な性能を有する画像特徴を用いる方法が考えられる。このような画像特徴としては、例えばＳＩＦＴ（Scale-invariant feature transform）特徴がある。ＳＩＦＴ特徴は、特徴点の検出のため、まず対象画像に対しＤｏＧ（Ｄｉｆｆｅｒｅｎｃｅ−ｏｆ−Ｇａｕｓｓｉａｎ）処理を行い、生成されたＤｏＧ画像内の極値となる点を特徴点の候補とする。次にノイズの影響を受けやすい主曲率が所定のしきい値以上の点、コントラストが所定のしきい値以下の点を候補から取り除くことで特徴点の検出が行われる。また検出された特徴点の輝度勾配方向および近傍領域における輝度勾配ヒストグラムを当該特徴点における特徴量として用い、検出された特徴点の特徴量を画像間で比較し、類似度の高い組み合わせを求めることで、画像中の特徴を追跡することができる。また、他の画像特徴として、ＧＬＯＨ（Gradient Location and Orientation Histogram）特徴などが存在する。 As a means for solving this problem, a method using image features having robust performance in image rotation, enlargement / reduction, affine deformation, illumination change, and noise addition as compared with the FAST feature detector method can be considered. Such image features include, for example, SIFT (Scale-invariant feature transform) features. In order to detect a feature point, the SIFT feature first performs DoG (Difference-of-Gaussian) processing on the target image, and sets a point that is an extreme value in the generated DoG image as a candidate for the feature point. Next, feature points are detected by removing from the candidates points whose main curvature, which is easily affected by noise, is greater than or equal to a predetermined threshold value, and points whose contrast is less than or equal to a predetermined threshold value. In addition, using the brightness gradient direction of the detected feature point and the brightness gradient histogram in the neighboring region as the feature amount of the feature point, the feature amount of the detected feature point is compared between images, and a combination with high similarity is obtained. Thus, the feature in the image can be tracked. Further, as other image features, there are GLOH (Gradient Location and Orientation Histogram) features and the like.

しかしながら、ＳＩＦＴ特徴やＧＬＯＨ特徴などの画像特徴は、特徴検出に必要な演算量がＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法を用いた場合に比べて極めて大きく、リアルタイム処理が困難である。このため、カメラで実空間を動画像データとして撮影し、この動画像データに対してコンピュータグラフィックス等で生成された仮想物体を重畳しながらリアルタイムで表示することで拡張現実感（Augmented Reality）を実現するのが困難であるという問題がある。 However, the image feature such as SIFT feature and GLOH feature has a very large amount of calculation required for feature detection compared to the case where the FAST feature detector method is used, and real-time processing is difficult. For this reason, Augmented Reality is obtained by capturing real space as moving image data with a camera and displaying in real time while superimposing a virtual object generated by computer graphics or the like on the moving image data. There is a problem that it is difficult to realize.

本発明は、このような事情に鑑みてなされたもので、実空間にコンピュータグラッフィクス等で生成された仮想物体を重畳させる拡張現実感を実現する際に、撮影条件の変化に対して安定した特徴点追跡を実現することができる画像処理装置を提供することを目的とする。 The present invention has been made in view of such circumstances, and is stable against changes in shooting conditions when realizing augmented reality in which a virtual object generated by computer graphics or the like is superimposed on real space. An object of the present invention is to provide an image processing apparatus capable of realizing feature point tracking.

本発明は、撮影された動画像データに対して、仮装物体の画像データを重畳して出力する画像処理装置であって、周辺環境を撮影して動画像データを得るカメラと、前記動画像データの所定のフレーム間隔毎に第１の画像特徴に基づく長期追跡用特徴点を検出する第１の特徴検出部と、前記動画像データの全てのフレームについて第２の画像特徴に基づく短期追跡用特徴点を検出する第２の特徴検出部と、異なる２フレームから検出された前記長期追跡用特徴点の各特徴量を基に、フレーム間の特徴点追跡を行う第１の特徴追跡部と、前記異なる２フレームから検出された長期追跡用特徴点あるいは短期追跡用特徴点の近傍輝度値を基にブロックマッチングを行うことによりフレーム間の特徴点追跡を行う第２の特徴追跡部と、前記第２の特徴追跡部により追跡された特徴点を基に３次元空間内の前記カメラの位置および姿勢を推定するカメラ位置姿勢推定部と、前記第１の特徴追跡部により追跡された特徴点と、前記推定された３次元空間内のカメラの位置および姿勢を基に、前記特徴点の３次元位置を推定する特徴点３次元位置推定部と、推定された３次元空間内のカメラ位置情報および姿勢情報と、前記特徴点の３次元位置情報を基に、前記カメラにより撮影された動画像データに対して、外部から入力された仮想物体の画像データを重畳して出力する画像合成部とを備えることを特徴とする。 The present invention is an image processing apparatus that superimposes and outputs image data of a disguise object on captured moving image data, the camera that captures the surrounding environment to obtain moving image data, and the moving image data A first feature detection unit for detecting feature points for long-term tracking based on a first image feature at every predetermined frame interval, and features for short-term tracking based on second image features for all frames of the moving image data A second feature detection unit for detecting points; a first feature tracking unit for tracking feature points between frames based on the feature quantities of the long-term tracking feature points detected from two different frames; and A second feature tracking unit for tracking feature points between frames by performing block matching based on long-term tracking feature points detected from two different frames or neighborhood luminance values of short-term tracking feature points; Special A camera position / orientation estimation unit that estimates the position and orientation of the camera in a three-dimensional space based on the feature points tracked by the tracking unit, the feature points tracked by the first feature tracking unit, and the estimated A feature point 3D position estimator for estimating the 3D position of the feature point based on the position and orientation of the camera in the 3D space; camera position information and orientation information in the estimated 3D space; An image composition unit that superimposes and outputs image data of a virtual object input from the outside on the moving image data captured by the camera based on the three-dimensional position information of the feature points. And

本発明は、前記カメラ位置姿勢推定部は、撮影された全てのフレームについて、３次元空間内のカメラ位置情報および姿勢情報を推定し、前記特徴点３次元位置推定部は、撮影された所定のフレーム間隔毎に、特徴点の３次元位置情報を推定することを特徴とする。 In the present invention, the camera position / orientation estimation unit estimates camera position information and attitude information in a three-dimensional space with respect to all frames taken, and the feature point three-dimensional position estimation unit It is characterized in that the three-dimensional position information of feature points is estimated for each frame interval.

本発明は、前記カメラ位置姿勢推定部は、前記追跡された短期追跡用特徴点を用いて当該フレームのカメラ位置および姿勢の初期推定を行い、前記追跡された短期追跡用特徴点および長期追跡用特徴点を用いて当該フレームのカメラ位置および姿勢を決定することを特徴とする。 In the present invention, the camera position and orientation estimation unit performs initial estimation of the camera position and orientation of the frame using the tracked short-term tracking feature points, and the tracked short-term tracking feature points and long-term tracking feature points. The camera position and orientation of the frame are determined using the feature points.

本発明は、前記特徴点３次元位置推定部は、前記追跡された長期追跡用特徴点に対して３次元位置を推定することを特徴とする。 The present invention is characterized in that the feature point three-dimensional position estimation unit estimates a three-dimensional position with respect to the tracked long-term tracking feature point.

本発明によれば、特徴点検出のための演算量が多い画像特徴の検出処理を所定のフレーム間隔毎に行うようにして、演算量の軽減を図るとともに、演算量が多い画像特徴の検出処理を行わないフレームにおける短時間のカメラ位置姿勢の変化に対しては、演算量が少ない画像特徴の検出処理による特徴点検出と、ブロックマッチングを使用することで、精度の高い滑らかなカメラ位置姿勢の変化を推定することが可能になるという効果が得られる。このため、任意の画像における特徴点の追跡による安定したカメラ位置姿勢の推定と、特徴点の３次元位置推定をリアルタイムに実現でき、実空間にコンピュータグラッフィクス等で生成された仮想物体を重畳させる拡張現実感を実現する処理を安定して実行することができる。 According to the present invention, image feature detection processing with a large amount of computation for feature point detection is performed at predetermined frame intervals to reduce the computation amount, and image feature detection processing with a large computation amount is performed. For short-time camera position and orientation changes in frames that do not perform image processing, feature point detection using image feature detection processing with a small amount of computation and block matching can be used to achieve high-precision smooth camera position and orientation. The effect that it becomes possible to estimate the change is obtained. For this reason, stable camera position and orientation estimation and feature point 3D position estimation by tracking feature points in an arbitrary image can be realized in real time, and a virtual object generated by computer graphics or the like is superimposed on real space Processing that realizes augmented reality can be executed stably.

本発明の一実施形態による画像処理装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of an image processing apparatus according to an embodiment of the present invention. 本発明の画像処理装置で取り扱う座標系の説明図である。It is explanatory drawing of the coordinate system handled with the image processing apparatus of this invention. 本発明の画像処理装置におけるカメラ位置姿勢推定処理の動作フローである。It is an operation | movement flow of the camera position and orientation estimation process in the image processing apparatus of this invention. 隣接フレームにおける特徴点の存在範囲を説明するための説明図である。It is explanatory drawing for demonstrating the existence range of the feature point in an adjacent frame. 本発明の画像処理装置における特徴点の３次元位置推定処理の動作フローである。It is an operation | movement flow of the three-dimensional position estimation process of the feature point in the image processing apparatus of this invention.

以下、図面を参照して、本発明の一実施形態による画像処理装置を説明する。図１は同実施形態の構成を示すブロック図である。この図において、符号１は、周辺環境を撮影して動画像データを出力するカメラである。符号２は、カメラ１から出力される動画像データにおける所定フレームから、長期追跡用特徴点として所定の第１の画像特徴に基づく特徴点を検出する第１特徴検出部である。符号３は、カメラ１から出力する動画像データにおける各フレームから、短期追跡用特徴点として所定の第２の画像特報に基づく特徴点を検出する第２特徴検出部である。符号４は、異なる２フレームから検出された長期追跡用特徴点の各特徴量を基に、フレーム間の特徴点追跡を行う第１特徴追跡部である。符号５は、異なる２フレームから検出された長期追跡用あるいは短期追跡用特徴点の近傍輝度値を基にブロックマッチングを行い、フレーム間の特徴点追跡を行う第２特徴追跡部である。 Hereinafter, an image processing apparatus according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the embodiment. In this figure, reference numeral 1 denotes a camera that captures the surrounding environment and outputs moving image data. Reference numeral 2 denotes a first feature detection unit that detects a feature point based on a predetermined first image feature as a long-term tracking feature point from a predetermined frame in the moving image data output from the camera 1. Reference numeral 3 denotes a second feature detection unit that detects a feature point based on a predetermined second image special report as a feature point for short-term tracking from each frame in the moving image data output from the camera 1. Reference numeral 4 denotes a first feature tracking unit that performs feature point tracking between frames based on the feature amounts of long-term tracking feature points detected from two different frames. Reference numeral 5 denotes a second feature tracking unit that performs block matching on the basis of luminance values in the vicinity of long-term tracking feature points or short-term tracking feature points detected from two different frames and tracks feature points between frames.

符号６は、第２特徴追跡部５によって追跡された短期追跡用特徴点を基にカメラ１の位置姿勢を推定するカメラ位置姿勢推定部である。符号７は、第１特徴追跡部４によって追跡された長期追跡用特徴点と、推定されたカメラ位置姿勢の情報を基に、特徴点の３次元位置を推定する特徴点３次元位置推定部である。符号８は、長期追跡用特徴点のフレーム上での観測位置、推定された３次元位置、各フレームのカメラ位置姿勢で構成される特徴点マップ情報を記憶する記憶部である。符号９は、重畳するべき仮想物体のコンピュータグラフィックス（以下、ＣＧと称する）データが予め記憶されたＣＧデータ記憶部である。符号１０は、記憶部８に記憶された推定されたカメラ位置姿勢情報及び特徴点の３次元位置情報を基に、ＣＧデータ記憶部９に記憶されているＣＧデータをカメラ１から出力する動画像データに重畳した合成画像を出力する画像合成部である。 Reference numeral 6 denotes a camera position and orientation estimation unit that estimates the position and orientation of the camera 1 based on the short-term tracking feature points tracked by the second feature tracking unit 5. Reference numeral 7 denotes a feature point three-dimensional position estimation unit that estimates a three-dimensional position of a feature point based on long-term tracking feature points tracked by the first feature tracking unit 4 and estimated camera position and orientation information. is there. Reference numeral 8 denotes a storage unit that stores feature point map information including an observation position of a long-term tracking feature point on a frame, an estimated three-dimensional position, and a camera position and orientation of each frame. Reference numeral 9 denotes a CG data storage unit in which computer graphics (hereinafter referred to as CG) data of a virtual object to be superimposed is stored in advance. Reference numeral 10 denotes a moving image that outputs the CG data stored in the CG data storage unit 9 from the camera 1 based on the estimated camera position and orientation information stored in the storage unit 8 and the three-dimensional position information of the feature points. An image composition unit that outputs a composite image superimposed on data.

図１に示す画像処理装置においては、２種類の異なる画像特徴のそれぞれを、長期追跡用の画像特徴及び短期追跡用の画像特徴として用いる。なお、以下の説明においては、第１特徴検出部２で検出する第１の画像特徴を前述のＳＩＦＴ特徴とし、第２特徴検出部３で検出する第２の画像特徴を前述のＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法によるコーナー特徴として説明するが、第１の画像特徴は、第２の画像特徴に比べ、画像の回転、拡大縮小、アフィン変形、照明変化、ノイズ付加等に頑強な性能を有する画像特徴であればいかなる画像特徴でもよく、ＳＩＦＴ特徴の代わりに、例えば、前述のＧＬＯＨ特徴やＳＵＲＦ（Speeded Up Robust Features）特徴、ＬＥＳＨ（Local Energy based Shape Histogram）特徴等を利用する構成であってもよい。また、もう一方の第２の画像特徴は、第１の画像特徴に比べ、演算量の小さな画像特徴であればいかなる画像特徴でもよく、ＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法によって検出されるコーナー特徴の代わりに例えば、ＳＵＳＡＮｃｏｒｎｅｒｄｅｔｅｃｔｏｒやＨａｒｒｉｓｏｐｅｒａｔｏｒによって検出されるコーナー特徴を用いる構成であってもよい。 In the image processing apparatus shown in FIG. 1, each of two different image features is used as an image feature for long-term tracking and an image feature for short-term tracking. In the following description, the first image feature detected by the first feature detection unit 2 is the aforementioned SIFT feature, and the second image feature detected by the second feature detection unit 3 is the aforementioned FAST feature detector method. The first image feature is an image feature that has robust performance in image rotation, enlargement / reduction, affine deformation, illumination change, noise addition, and the like compared to the second image feature. Any image feature may be used, and instead of the SIFT feature, for example, the above-described GLOH feature, SURF (Speeded Up Robust Features) feature, LESH (Local Energy based Shape Histogram) feature, or the like may be used. The other second image feature may be any image feature as long as the image feature has a smaller amount of calculation than the first image feature. For example, instead of the corner feature detected by the FAST feature detector method, for example, A configuration using corner features detected by a SUSAN corner detector or a Harris operator may also be used.

次に、図１に示す画像処理装置の動作について説明するため、まず図１に示す画像処理装置が取り扱う３次元空間及び２次元空間に関する座標系及び記号、数式について図２を参照して説明する。図２において、ワールド座標系（Ｘｗ，Ｙｗ，Ｚｗ）は、カメラ１で撮影される被写体が存在する３次元空間を表す座標系であり、図中のＸｗＹｗＺｗ座標軸を用いて既定される座標系である。カメラ座標系（Ｘｃ，Ｙｃ，Ｚｃ）は、カメラ１の視点位置を原点とした３次元空間を表すローカル座標系であり、図中のＸｃＹｃＺｃ座標軸を用いて既定される座標系であり、カメラ座標系におけるＺｃ軸はカメラの光軸（視線）方向を表す。 Next, in order to explain the operation of the image processing apparatus shown in FIG. 1, the coordinate system, symbols, and mathematical expressions related to the three-dimensional space and the two-dimensional space handled by the image processing apparatus shown in FIG. 1 will be described with reference to FIG. . In FIG. 2, a world coordinate system (Xw, Yw, Zw) is a coordinate system representing a three-dimensional space in which a subject photographed by the camera 1 exists, and is a coordinate system defined by using the XwYwZw coordinate axis in the figure. is there. The camera coordinate system (Xc, Yc, Zc) is a local coordinate system that represents a three-dimensional space with the viewpoint position of the camera 1 as the origin, and is a coordinate system that is defined using the XcYcZc coordinate axis in the figure. The Zc axis in the system represents the direction of the optical axis (line of sight) of the camera.

射影座標系（ｕ，ｖ）は、カメラ１によって撮影される３次元空間が投影された画像平面（フレーム）内の座標系であり、図中のＵＶ座標を用いて既定される座標系である。これら３つの座標系の関係は、ワールド座標系（Ｘｗ，Ｙｗ，Ｚｗ）からカメラ座標系（Ｘｃ，Ｙｃ，Ｚｃ）への変換行列Ｅｃｗ、カメラ座標系（Ｘｃ，Ｙｃ，Ｚｃ）から画像平面への射影モデルＣａｍＰｒｏｊ（Ｐ）を用いて、（１）式、（２）式により表すことができる。 The projected coordinate system (u, v) is a coordinate system in an image plane (frame) onto which a three-dimensional space photographed by the camera 1 is projected, and is a coordinate system defined by using UV coordinates in the drawing. . The relationship between these three coordinate systems is that the transformation matrix Ecw from the world coordinate system (Xw, Yw, Zw) to the camera coordinate system (Xc, Yc, Zc), and the camera coordinate system (Xc, Yc, Zc) to the image plane. (1) and (2) can be expressed using the projection model CamProj (P).

Ｐｃ＝ＥｃｗＰｗ・・・（１）
Ｐｖ＝ＣａｍＰｒｏｊ（Ｐｃ）＝ＣａｍＰｒｏｊ（ＥｃｗＰｗ）・・・（２） Pc = EcwPw (1)
Pv = CamProj (Pc) = CamProj (EcwPw) (2)

ここで、Ｐｗ、Ｐｃ、Ｐｖは、それぞれ特徴点Ｐのワールド座標系の位置Ｐｗ＝（Ｘｐ，Ｙｐ，Ｚｐ）^ｔ、カメラ座標系の位置Ｐｃ＝（Ｘｃｐ，Ｙｃｐ，Ｚｃｐ）^ｔおよび特徴点Ｐが投影される画像平面（フレーム）上の射影座標系での位置Ｐｖ＝（Ｕｐ，Ｖｐ）^ｔを示す。添字ｔは、転置行列を示す。なお変換行列Ｅｃｗは、ワールド座標系におけるカメラ１の位置姿勢を表しており、変換行列Ｅｃｗはカメラ１の動きと共に変化する。以下の説明では、カメラ１によって撮影されたｎ番目のフレーム（以下、フレームｎと称する）におけるカメラ１の位置姿勢を示す変換行列Ｅｃｗ_ｎを適宜使用する。 Here, Pw, Pc, and Pv are the position Pw = (Xp, Yp, Zp) ^{t of} the feature point P, the position Pc = (Xcp, Ycp, Zcp) ^t of the feature point P, and the feature point P, respectively. Represents a position Pv = (Up, Vp) ^t in the projected coordinate system on the image plane (frame) on which is projected. The subscript t indicates a transposed matrix. The transformation matrix Ecw represents the position and orientation of the camera 1 in the world coordinate system, and the transformation matrix Ecw changes with the movement of the camera 1. In the following description, a transformation matrix Ec _n that indicates the position and orientation of the camera 1 in the nth frame (hereinafter referred to as frame n) taken by the camera 1 is used as appropriate.

次に、図１に示す画像処理装置の動作について説明する。図１に示す画像処理装置の動作は、カメラ位置姿勢推定処理と、特徴点の３次元位置推定処理に大別され、それぞれ２つの処理動作が並行して実行される。初めに、図３を参照して、図１に示す画像処理装置におけるカメラ位置姿勢推定処理動作について説明する。図３は、図１に示す画像処理装置におけるカメラ位置姿勢推定処理動作を示すフローチャートである。 Next, the operation of the image processing apparatus shown in FIG. 1 will be described. The operation of the image processing apparatus shown in FIG. 1 is roughly divided into a camera position / posture estimation process and a feature point three-dimensional position estimation process, and two processing operations are executed in parallel. First, the camera position / posture estimation processing operation in the image processing apparatus shown in FIG. 1 will be described with reference to FIG. FIG. 3 is a flowchart showing the camera position / orientation estimation processing operation in the image processing apparatus shown in FIG.

まず、第２特徴検出部３は、カメラ１から出力する１フレーム分の画像データを入力し、ＦＡＳＴｆｅａｔｕｒｅｄｅｔｅｃｔｏｒ手法の処理を用いて、このフレームにおける短期追跡用の特徴点を検出する（ステップＳ１）。第２特徴検出部３は、検出した短期追跡用特徴点の画像平面上の射影座標を第２特徴追跡部５へ出力する。なお入力画像データに対する画像ピラミッド（多重分解能の画像で代表された画像のセットからなる）を生成し、低解像度の画像データからも特徴点検出する構成としても構わない。 First, the second feature detection unit 3 receives one frame of image data output from the camera 1, and detects feature points for short-term tracking in this frame using processing of the FAST feature detector method (step S1). ). The second feature detection unit 3 outputs the projected coordinates of the detected short-term tracking feature points on the image plane to the second feature tracking unit 5. Note that an image pyramid (consisting of a set of images represented by multi-resolution images) for the input image data may be generated, and feature points may be detected from low-resolution image data.

次に、第２特徴追跡部５は、連続する２つのフレーム（フレームｎ−１，フレームｎ）における短期追跡用特徴点に対し、特徴点近傍のＮ×Ｎ（Ｎは自然数）画素の領域を対象としたブロックマッチングを行うことにより短期追跡用の特徴点の追跡を行う（ステップＳ２）。ただし、先頭フレーム（フレーム０）が入力された場合、追跡対象のフレームが存在しないため本ステップは省略し、後続するステップＳ３〜Ｓ６についても同様に省略する。特徴点のＮ×Ｎ近傍画素の領域を対象としたブロックマッチングの評価基準は、一例として（３）式に示すＳＳＤ（Sum of Squared Difference）を用いる。 Next, the second feature tracking unit 5 sets a region of N × N (N is a natural number) pixels in the vicinity of the feature point for the short-term tracking feature point in two consecutive frames (frame n−1, frame n). Feature points for short-term tracking are tracked by performing block matching as a target (step S2). However, when the first frame (frame 0) is input, this step is omitted because there is no tracking target frame, and the subsequent steps S3 to S6 are similarly omitted. As an example of the block matching evaluation criterion for the N × N neighboring pixel region of the feature point, SSD (Sum of Squared Difference) shown in Equation (3) is used.

（３）式において、Ｆ_ｎ（ｕ，ｖ）はフレームｎの座標（ｕ，ｖ）における輝度値を表し、座標（ｕ_ｊ，ｖ_ｊ）、（ｕ_ｉ，ｖ_ｉ）はそれぞれフレームｎ−１における特徴点ｊ，フレームｎにおける特徴点ｉの座標を示す。第２特徴追跡部５は、フレームｎ−１における特徴点ｊに対し、フレームｎで検出された全ての特徴点のうち、（３）式で求められるＳＳＤの値が最小となる特徴点ｉ＝ｋの時、特徴点ｊ、特徴点ｉ＝ｋの組を追跡結果として出力する。ただし、所定のしきい値ＴＨを用いて、ＴＨ＜ＳＳＤの場合には、特徴点ｊに対する特徴点の追跡は失敗したとして出力を行わない。 In the equation (3), F _n (u, v) represents the luminance value at the coordinates (u, v) of the frame n, and the coordinates (u _j , v _j ) and (u _i , v _i ) are the frame n−. The coordinates of the feature point j in 1 and the feature point i in the frame n are shown. For the feature point j in the frame n−1, the second feature tracking unit 5 sets the feature point i = that minimizes the SSD value obtained by the expression (3) among all the feature points detected in the frame n. When k, a set of feature point j and feature point i = k is output as a tracking result. However, if TH <SSD using a predetermined threshold value TH, the tracking of the feature point for the feature point j fails and no output is performed.

なお、ここでは、フレームｎ−１における特徴点ｊに対するマッチング候補を、フレームｎで検出された全ての特徴点とするようにしたが、通常の撮影においては、連続する２フレーム間の特徴点の移動は狭い範囲に限定される可能性が高いため、特徴点の移動距離の上限に関するしきい値ｄ０を用いて、（４）式を満たす特徴点のみを特徴点ｊとマッチングを行うようにしてもよい。 Note that here, the matching candidates for the feature point j in the frame n−1 are all the feature points detected in the frame n. However, in normal shooting, the feature points between two consecutive frames are displayed. Since the movement is likely to be limited to a narrow range, only the feature point satisfying the equation (4) is matched with the feature point j using the threshold value d0 regarding the upper limit of the moving distance of the feature point. Also good.

このような構成とすることで、類似したテクスチャが広範囲に存在する場合等で、誤マッチングによって、誤った特徴点の組を追跡結果として出力することを防ぐことが可能となると共に、特徴点の追跡に必要な演算量を減らすことが可能となる。 By adopting such a configuration, it is possible to prevent a wrong set of feature points from being output as a tracking result due to erroneous matching when there are a wide range of similar textures, and It is possible to reduce the amount of calculation required for tracking.

なお、ブロックマッチングの評価方法として、ＳＳＤを用いる代わりにＳＡＤ（Sum of Absolute Difference）やＺＮＣＣ（Zero-mean Normalized Cross- Correlation）を用いてもよい。ただし、ＺＮＣＣを用いる構成の場合、評価値が最大となる特徴点を追跡結果に選ぶ点が異なる。 As a block matching evaluation method, SAD (Sum of Absolute Difference) or ZNCC (Zero-mean Normalized Cross-Correlation) may be used instead of SSD. However, the configuration using ZNCC is different in that the feature point having the maximum evaluation value is selected as the tracking result.

次に、カメラ位置姿勢推定部６は、第２特徴追跡部５から出力される特徴点追跡結果の情報を用いて、カメラ位置姿勢（変換行列Ｅｃｗ_ｎ）の初期推定を行う（ステップＳ３）。変換行列Ｅｃｗ_ｎの推定にはランダムに所定数Ｎ１の特徴点の組を選んで推定を行う。ここでは、（５）式の通り、ロバスト推定手法の１つであるＭ−ｅｓｔｉｍａｔｏｒを用いて、（５）式におけるＭ_１を最小化することで推定する。 Next, the camera position / orientation estimation unit 6 performs initial estimation of the camera position / orientation (conversion matrix Ecw _n ) using information on the feature point tracking result output from the second feature tracking unit 5 (step S3). For the estimation of the transformation matrix Ecw _n , a predetermined number N1 of feature point pairs are randomly selected and estimated. Here, estimation is performed by minimizing M ₁ in equation (5) using M-estimator, which is one of robust estimation methods, as in equation (5).

（５）式においてｅ_ｉは、特徴点ｉ（０≦ｉ＜Ｎ１）の計測誤差、ｆ_１（ｅ）は例外値の影響を抑えるために用いる計測誤差の評価関数である。連続するフレームｎ−１、フレームｎにおいて共通の特徴点が観測される場合、図４に示すようにフレームｎで観測される特徴点は、理想的には図４に示すエピポールライン上に観測されるので、特徴点の計測誤差ｅｉは、フレームｎで観測された特徴点ｉ＝（ｕ_ｉ，ｖ_ｉ）と、フレームｎ上のエピポールラインａ_ｉＵ＋ｂ_ｉＶ＋ｃ_ｉ＝０の距離として（６）式の通り定義される。 In equation (5), e _i is a measurement error of the feature point i (0 ≦ i <N1), and f ₁ (e) is an evaluation function of the measurement error used for suppressing the influence of the exceptional value. When a common feature point is observed in consecutive frames n-1 and n, the feature point observed in frame n as shown in FIG. 4 is ideally observed on the epipole line shown in FIG. Therefore, the measurement error ei of the feature point is a distance between the feature point i = (u _i , v _i ) observed in the frame n and the epipole line a _i U + b _i V + c _i = 0 on the frame n ( 6) It is defined according to the formula.

なお、エピポールラインａ_ｉＵ＋ｂ_ｉＶ＋ｃ_ｉ＝０は、特徴点ｉ＝（ｕ_ｉ，ｖ_ｉ）に対応する第２特徴点追跡部５の出力として得られた特徴点ｊ＝（ｕ_ｊ，ｖ_ｊ）と、カメラ位置姿勢（変換行列Ｅｃｗ_ｎ−１、変換行列Ｅｃｗ_ｎ）によって一意に定まる。従って、（５）式におけるＭ_１を最小化する変換行列Ｅｃｗ_ｎを求めることで、カメラ位置姿勢の初期推定が完了する。なお、推定されたカメラ位置姿勢を示す変換行列Ｅｃｗ_ｎは一旦当該フレームのカメラ位置姿勢推定結果として記憶部８に保存する。 The epipole line a _i U + b _i V + c _i = 0 is a feature point j = (u _j , obtained as an output of the second feature point tracking unit 5 corresponding to the feature point i = (u _i , v _i ). v _j ) and the camera position and orientation (conversion matrix Ecw _n−1 , conversion matrix Ecw _n ). Therefore, the initial estimation of the camera position and orientation is completed by obtaining the transformation matrix Ecw _n that minimizes M ₁ in Equation (5). The conversion matrix Ecw _n indicating the estimated camera pose is temporarily stored in the storage unit 8 as the camera position and orientation estimation result of the frame.

なお、上述の説明では、推定手法の一例としてＭ−ｅｓｔｉｍａｔｏｒを用いてカメラ位置姿勢を示す変換行列Ｅｃｗ_ｎを推定する方法について説明したが、ＲＡＮＳＡＣ（RANdom SAmple Consensus）やＬＭｅｄＳ（Least Median of Squares）推定等の公知のロバスト推定手法を用いることができる。また（５）式におけるＭ_１が所定のしきい値ＴＨ_Ｍ以下となった場合、後述するステップＳ４、Ｓ５の処理を省略し、ステップＳ３において得られた推定結果を、フレームｎのカメラ位置姿勢を示す変換行列Ｅｃｗ_ｎとするようにしてもよい。 In the above description has described how to estimate a transformation matrix Ecw _n indicating the camera position and orientation using the M-estimator as an example of the estimation method, RANSAC (RANdom SAmple Consensus) and LMedS (Least Median of Squares) A known robust estimation method such as estimation can be used. Further, when M _{1 in the} expression (5) is equal to or less than the predetermined threshold value TH _M , the processing in steps S4 and S5 described later is omitted, and the estimation result obtained in step S3 is used as the camera position / posture of frame n. The conversion matrix Ecw _n may be used.

次に、第１特徴追跡部４は、連続する２つのフレーム（フレームｎ−１，フレームｎ）における長期追跡用特徴点に対し、ステップＳ２における短期追跡用特徴点の場合と同様、（３）式を評価基準としてブロックマッチングを行うことにより長期追跡用特徴点の追跡を行う（ステップＳ４）。ただし、ここで追跡する特徴点は、後述する処理動作によって記憶部８に記憶されている３次元位置推定済み特徴点Ｐｗ_ｉ（０≦ｉ＜記憶済み特徴点数）を対象とする点がステップＳ２と異なる。（２）式を用いて特徴点Ｐｗ_ｉのフレームｎ−１上、フレームｎ上の射影位置Ｐｖ_{ｎ−１，ｉ}、Ｐｖ_ｎ，ｉはそれぞれ（７）式、（８）式により求める。 Next, the first feature tracking unit 4 performs the same processing as in the case of the short-term tracking feature points in step S2 on the long-term tracking feature points in the two consecutive frames (frame n-1, frame n) (3) Long-term tracking feature points are tracked by performing block matching using the formula as an evaluation criterion (step S4). However, the feature point to be tracked here is a point that is targeted for a three-dimensional position estimated feature point Pw _i (0 ≦ i <number of stored feature points) stored in the storage unit 8 by a processing operation to be described later. And different. Using the formula (2), the projection positions Pv _{n−1, i} and Pv _{n, i} on the frame n−1 and the frame n of the feature point Pw _i are obtained by the formulas (7) and (8), respectively.

Ｐｖ_{ｎ−１，ｉ}＝ＣａｍＰｒｏｊ（Ｅｃｗ_ｎ−１Ｐｗ_ｉ）・・・（７）
Ｐｖ_ｎ，ｉ＝ＣａｍＰｒｏｊ（Ｅｃｗ_ｎＰｗ_ｉ）・・・（８） Pv _{n−1, i} = CamProj (Ecw _n−1 Pw _i ) (7)
Pv _{n, i} = CamProj (Ecw _n Pw _i ) (8)

また短期追跡用特徴点の場合と異なり、フレームｎ−１における長期追跡用特徴点Ｐｖ_{ｎ−１，ｉ}に対するマッチング候補は、長期追跡用特徴点Ｐｖ_ｎ，ｉを中心に距離ｄ１内に存在する点全てを候補点としてブロックマッチングを行い、その他については、ステップＳ２の処理動作と同様である。 Unlike the case of the short-term tracking feature point, the matching candidate for the long-term tracking feature point Pv _{n−1, i} in the frame n−1 exists within the distance d1 with the long-term tracking feature point Pv _{n, i} as the center. Block matching is performed using all points as candidate points, and the rest is the same as the processing operation in step S2.

次に、カメラ位置姿勢推定部６は、ステップＳ２、Ｓ４において第２特徴追跡部５から出力された全ての特徴点追跡結果を用いて、カメラ位置姿勢を示す変換行列Ｅｃｗ_ｎの推定を行う（ステップＳ５）。変換行列Ｅｃｗ_ｎの推定にはランダムに所定のＮ２（Ｎ２＜Ｎ１）の特徴点の組を選んで推定を行う。変換行列Ｅｃｗ_ｎの推定処理動作についてはステップＳ３と同じであるのでここでは詳細な説明を省略する。そして、カメラ位置姿勢推定部６は、最終的に得られたフレームｎのカメラ位置姿勢推定結果である変換行列Ｅｃｗ_ｎを記憶部８上の特徴点マップに保存する。 Next, the camera position / orientation estimation unit 6 estimates a transformation matrix Ecw _n indicating the camera position / orientation using all the feature point tracking results output from the second feature tracking unit 5 in steps S2 and S4 ( Step S5). For estimation of the transformation matrix Ecw _n , a set of feature points of predetermined N2 (N2 <N1) is selected at random. Since the estimation processing operation of the transformation matrix Ecw _n are the same as step S3 a detailed description thereof is omitted here. Then, the camera position and orientation estimation unit 6 stores a camera position and orientation estimation result of the finally obtained frame n transformation matrix Ecw _n to the feature point map in the storage unit 8.

次に、画像合成部１０は、記憶部８に保存されたフレームｎのカメラ位置姿勢推定結果である変換行列Ｅｃｗ_ｎと後述する処理動作によって求められる特徴点の３次元位置情報を用いて、所定の条件に従って、ＣＧデータ記憶部９に記憶されている仮想物体を表すＣＧデータを撮影されたフレームｎにおける画像データに重畳して出力する（ステップＳ６）。ここでいう所定の条件とは、例えば、特徴点が多く分布する領域を任意画像で生成されたマーカー領域とみなし、マーカー領域上に仮想物体を表すＣＧデータを表示する構成としてもよいし、あるいは、特徴点の分布する領域を障害物とみなし、特徴点の存在しない空間に仮想物体を表すＣＧデータを表示する構成等が利用用途に応じて考えられる。 Next, the image composition unit 10 uses the three-dimensional position information of the feature point obtained by the processing operations to be described later transformation matrix Ecw _n is a camera position and orientation estimation result of the frame n stored in the storage unit 8, a predetermined In accordance with the above condition, the CG data representing the virtual object stored in the CG data storage unit 9 is superimposed on the image data of the captured frame n and output (step S6). The predetermined condition here may be, for example, a configuration in which a region in which many feature points are distributed is regarded as a marker region generated by an arbitrary image, and CG data representing a virtual object is displayed on the marker region. A configuration in which a region where feature points are distributed is regarded as an obstacle and CG data representing a virtual object is displayed in a space where no feature points exist may be considered according to usage.

以上説明した図３に示すステップＳ１〜Ｓ６の処理動作をカメラ１から入力されるフレーム毎に繰り返し行うことにより、リアルタイムで実写画像データに対してＣＧデータを合成した画像データを生成して出力することが可能となる。 By repeatedly performing the processing operations of steps S1 to S6 shown in FIG. 3 described above for each frame input from the camera 1, image data obtained by synthesizing CG data with real image data is generated and output in real time. It becomes possible.

次に、図５を参照して、図１に示す画像処理装置における特徴点の３次元位置推定処理動作を説明する。まず、カメラ１による撮影開始時において、記憶部８に保持された特徴点マップが空の状態に初期化される（ステップＳ１１）。また先頭フレームのカメラ位置姿勢を示す変換行列Ｅｃｗ_０は、カメラ座標系（Ｘｃ，Ｙｃ，Ｚｃ）、ワールド座標系（Ｘｗ，Ｙｗ，Ｚｗ）が一致するように設定され、記憶部８に特徴点マップとして保存する。 Next, the feature point three-dimensional position estimation processing operation in the image processing apparatus shown in FIG. 1 will be described with reference to FIG. First, at the start of shooting by the camera 1, the feature point map held in the storage unit 8 is initialized to an empty state (step S11). The transformation matrix Ecw ₀ indicating the camera position and orientation of the first frame is set so that the camera coordinate system (Xc, Yc, Zc) and the world coordinate system (Xw, Yw, Zw) coincide with each other. Save as a map.

次に、第１特徴検出部２は、カメラ１から出力される画像データのうち、所定のフレーム間隔Ｎ毎に画像データを入力し、ＳＩＦＴ特徴に基づき長期追跡用特徴点検出を行う（ステップＳ１２）。以降、第１特徴検出部２に入力されるフレーム間隔Ｎ毎のフレームをキーフレームと呼ぶことにする。第１特徴検出部２は、検出した長期追跡用特徴点の位置及びＳＩＦＴ特徴量を第１特徴追跡部４へ出力する。 Next, the first feature detection unit 2 inputs image data for every predetermined frame interval N out of the image data output from the camera 1, and performs long-term tracking feature point detection based on SIFT features (step S12). ). Hereinafter, a frame for each frame interval N input to the first feature detection unit 2 is referred to as a key frame. The first feature detection unit 2 outputs the detected position of the long-term tracking feature point and the SIFT feature quantity to the first feature tracking unit 4.

次に、第１特徴追跡部４は、連続する２つのキーフレーム（キーフレームｎ−１，キーフレームｎ）における長期追跡用特徴点に対し、ＳＩＦＴ特徴量のマッチングを行うことにより、長期追跡用特徴点の追跡を行う（ステップＳ１３）。ただし先頭キーフレーム（キーフレーム０）が入力された場合、組となる追跡対象のキーフレームが存在しないためこの処理動作は省略し、後続する処理動作も行わずにステップＳ１２に戻る。なお、組となる特徴点を求める方法としては、例えば、ＡＮＮ（Approximate Nearest Neighbor）アルゴリズム（S.Arya, D. M.Mount, R.Silverman, A.Y.Wu,"An optimal algorithm for approximate nearest neighbor searching",Journal of the ACM,Vol.45, No.6,pp.891-923,1998）を用いる。ＡＮＮアルゴリズムは、k-d tree構造を用いた局所探索手法であり、類似度が最も高い特徴点の組を近似的に求めるアルゴリズムであり、組となる特徴点を高速に求めることができる。 Next, the first feature tracking unit 4 performs SIFT feature value matching on feature points for long-term tracking in two consecutive key frames (key frame n−1, key frame n), thereby performing long-term tracking. The feature points are tracked (step S13). However, when the first key frame (key frame 0) is input, since there is no key frame to be tracked as a pair, this processing operation is omitted, and the processing returns to step S12 without performing the subsequent processing operation. In addition, as a method for obtaining feature points to be paired, for example, an ANN (Approximate Nearest Neighbor) algorithm (S. Arya, DMMount, R. Silverman, AYWu, “An optimal algorithm for approximate nearest neighbor searching”, Journal of the ACM, Vol. 45, No. 6, pp. 891-923, 1998). The ANN algorithm is a local search method using a k-d tree structure, and is an algorithm that approximately obtains a set of feature points having the highest degree of similarity, and can obtain feature points that form a set at high speed.

次に、特徴点３次元位置推定部７は、既に記憶部８上の特徴点マップに３次元位置が記憶されているか否かに基づいて、ステップＳ１３の処理により得られた特徴点の組のうち、新規の特徴点があるか否かを判定する（ステップＳ１４）。この判定の結果、記憶部８上の特徴点マップに記憶されていない特徴点がある場合、特徴点３次元位置推定部７は、キーフレームｎ−１、キーフレームｎそれぞれの推定済みカメラ位置姿勢を示す変換行列Ｅｃｗ_{（ｎ−１）Ｎ}、Ｅｃｗ_ｎＮと、キーフレームｎ−１、キーフレームｎそれぞれで観測された特徴点の射影座標（ｕ_ｉ，ｖ_ｉ）、（ｕ_ｉ，ｖ_ｉ）を用いて、三角測量により、この特徴点の３次元位置の初期値を推定し、特徴点の３次元位置及びキーフレームｎ−１，キーフレームｎでそれぞれ観測された特徴点の射影座標（ｕ_ｉ，ｖ_ｉ）、（ｕ_ｉ，ｖ_ｉ）を記憶部８上の特徴点マップに記憶する（ステップＳ１５）。 Next, the feature point 3D position estimation unit 7 determines whether the feature point set obtained by the process of step S13 is based on whether or not the 3D position is already stored in the feature point map on the storage unit 8. It is determined whether or not there is a new feature point (step S14). If there is a feature point that is not stored in the feature point map on the storage unit 8 as a result of this determination, the feature point three-dimensional position estimation unit 7 determines the estimated camera position and orientation for each of the key frame n-1 and the key frame n. Transformation matrices Ecw _{(n−1) N} and Ecw _nN indicating the projected coordinates (u _i , v _i ) and (u _i , v _i ) of the feature points observed in the key frame n−1 and the key frame n, respectively. The initial value of the three-dimensional position of the feature point is estimated by triangulation, and the three-dimensional position of the feature point and the projected coordinates (u of the feature point observed in the key frame n−1 and the key frame n, respectively) _i , v _i ), (u _i , v _i ) are stored in the feature point map on the storage unit 8 (step S15).

次に、特徴点３次元位置推定部７は、ロバスト推定手法の１つであるＭ−ｅｓｔｉｍａｔｏｒを用いて、（９）式におけるＭ_２の最小化により、特徴点の３次元位置の更新を行う（ステップＳ１６）。 Next, the feature point three-dimensional position estimation unit 7 updates the three-dimensional position of the feature point by minimizing M ₂ in Equation (9) using M-estimator which is one of robust estimation methods. (Step S16).

（９）式においてｅ_ｉ，ｊはキーフレームｉにおける特徴点ｊの観測誤差を示し、ｆ_２（ｅ）は例外値の影響を抑えるために用いる計測誤差の評価関数である。キーフレーム０からキーフレームｎにおけるカメラ位置姿勢は推定済みであり、Ｍ_２の最小化により求めた各特徴点の３次元位置を記憶部８上の特徴点マップを更新して（ステップＳ１７）、ステップＳ１２へ戻る。 In the equation (9), e _{i, j} indicates an observation error of the feature point j in the key frame i, and f ₂ (e) is an evaluation function of a measurement error used for suppressing the influence of the exceptional value. The camera position and orientation from key frame 0 to key frame n have been estimated, and the feature point map on the storage unit 8 is updated with the three-dimensional position of each feature point obtained by minimizing M ₂ (step S17). Return to step S12.

以上説明したように、特徴点検出のための演算量が多い画像特徴の検出処理を所定のフレーム間隔毎に行うようにして、演算量の軽減を図るとともに、演算量が多い画像特徴の検出処理を行わないフレームにおける短時間のカメラ位置姿勢の変化に対しては、演算量が少ない画像特徴の検出処理による特徴点検出と、ブロックマッチングを使用することで、精度の高い滑らかなカメラ位置姿勢の変化を推定することが可能になるという効果が得られる。このため、任意の画像における特徴点の追跡による安定したカメラ位置姿勢の推定と、特徴点の３次元位置推定をリアルタイムに実現でき、カメラで実空間を撮影した動画像データに対して、ＣＧ等で生成された仮想物体を重畳し、ＨＭＤ（Head Mount Display）等で表示することで拡張現実感（Augmented Reality）を実現する仮想現実システムの処理を安定して実行することができる。 As described above, image feature detection processing with a large amount of computation for feature point detection is performed at predetermined frame intervals to reduce the computation amount, and image feature detection processing with a large computation amount For short-time camera position and orientation changes in frames that do not perform image processing, feature point detection using image feature detection processing with a small amount of computation and block matching can be used to achieve high-precision smooth camera position and orientation. The effect that it becomes possible to estimate the change is obtained. For this reason, stable camera position and orientation estimation by tracking feature points in an arbitrary image and three-dimensional position estimation of feature points can be realized in real time, and CG or the like can be performed on moving image data obtained by photographing a real space with a camera. It is possible to stably execute the processing of the virtual reality system that realizes augmented reality by superimposing the virtual object generated in (1) and displaying it on an HMD (Head Mount Display) or the like.

なお、図１に示す第１特徴検出部２、第２特徴検出部３、第１特徴追跡部４、第２特徴追跡部５、カメラ位置姿勢推定部６、特徴点３次元位置推定部７及び画像合成部１０の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、実写画像データに対してＣＧデータを合成した画像データを生成して出力する処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 1, the first feature detection unit 2, the second feature detection unit 3, the first feature tracking unit 4, the second feature tracking unit 5, the camera position and orientation estimation unit 6, the feature point three-dimensional position estimation unit 7 and A program for realizing the functions of the image composition unit 10 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed, whereby real image data is processed. You may perform the process which produces | generates and outputs the image data which synthesize | combined CG data. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in the computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

カメラで実空間を動画像データとして撮影し、この動画像データに対してコンピュータグラフィックス等で生成された仮想物体を重畳して表示することで拡張現実感（Augmented Reality）を実現する仮想現実システムに適用できる。 Virtual reality system that realizes Augmented Reality by shooting a real space as moving image data with a camera and superimposing a virtual object generated by computer graphics on this moving image data. Applicable to.

１・・・カメラ、２・・・第１特徴検出部、３・・・第２特徴検出部、４・・・第１特徴追跡部、５・・・第２特徴追跡部、６・・・カメラ位置姿勢推定部、７・・・特徴点３次元位置推定部、８・・・記憶部、９・・・ＣＧデータ記憶部、１０・・・画像合成部 DESCRIPTION OF SYMBOLS 1 ... Camera, 2 ... 1st feature detection part, 3 ... 2nd feature detection part, 4 ... 1st feature tracking part, 5 ... 2nd feature tracking part, 6 ... Camera position and orientation estimation unit, 7... Feature point three-dimensional position estimation unit, 8... Storage unit, 9... CG data storage unit, 10.

Claims

An image processing apparatus that superimposes and outputs image data of a virtual object on captured moving image data,
A camera that captures moving image data by photographing the surrounding environment,
A first feature detector for detecting feature points for long-term tracking based on a first image feature at predetermined frame intervals of the moving image data;
A second feature detection unit for detecting feature points for short-term tracking based on a second image feature for all frames of the moving image data;
A first feature tracking unit for tracking feature points between frames based on each feature amount of the long-term tracking feature points detected from two different frames;
A second feature tracking unit for tracking feature points between frames by performing block matching on the basis of the long-term tracking feature points detected from the two different frames or the near luminance values of the short-term tracking feature points;
A camera position and orientation estimation unit that estimates the position and orientation of the camera in a three-dimensional space based on the feature points tracked by the second feature tracking unit;
A feature point 3D position estimation unit that estimates a 3D position of the feature point based on the feature point tracked by the first feature tracking unit and the estimated position and orientation of the camera in the estimated 3D space When,
Based on the estimated camera position information and posture information in the three-dimensional space and the three-dimensional position information of the feature point, the image of the virtual object input from the outside with respect to the moving image data captured by the camera And an image composition unit that superimposes and outputs the data.

The camera position and orientation estimation unit estimates camera position information and orientation information in a three-dimensional space for all captured frames,
The image processing apparatus according to claim 1, wherein the feature point three-dimensional position estimation unit estimates three-dimensional position information of feature points at every predetermined frame interval.

The camera position and orientation estimation unit
Using the tracked short-term tracking feature points, initial estimation of the camera position and orientation of the frame is performed,
The image processing apparatus according to claim 1, wherein the camera position and orientation of the frame are determined using the tracked short-term tracking feature points and long-term tracking feature points.

The feature point three-dimensional position estimation unit includes:
The image processing apparatus according to claim 1, wherein a three-dimensional position is estimated with respect to the tracked long-term tracking feature point.