JP2015184773A

JP2015184773A - Image processor and three-dimensional object tracking method

Info

Publication number: JP2015184773A
Application number: JP2014058625A
Authority: JP
Inventors: 宏将武井; Hiromasa Takei
Original assignee: Nihon Unisys Ltd
Current assignee: Nihon Unisys Ltd
Priority date: 2014-03-20
Filing date: 2014-03-20
Publication date: 2015-10-22
Anticipated expiration: 2034-03-20
Also published as: JP6310288B2

Abstract

PROBLEM TO BE SOLVED: To enable the tracking of an operation of a target object in a three-dimensional space with high accuracy from a two-dimensional video image picked up by a monocular camera.SOLUTION: An image processor includes a boundary extraction part 11 for extracting a boundary of a target object from a picked-up two-dimensional image, a radiation shape generation part 12 for generating a radiation shape by respectively connecting each point of the boundary and a camera center, a sample point extraction part 14 for extracting a sample point to be a boundary when the target object is reflected in the two-dimensional image from three-dimensional data of the target object projected on the two-dimensional image, and a positioning part 15 for positioning the three-dimensional data so as to position the sample point on the radiation shape, acquires an accurate three-dimensional position of the target object from the positioned three-dimensional data through positioning of the three-dimensional radiation shape generated from the two-dimensional image and the three-dimensional data of the target object, and performs three-dimensional tracking of the target object by performing this acquisition in each prescribed frame interval of a two-dimensional video image.

Description

本発明は、画像処理装置および３次元物体トラッキング方法に関し、特に、２次元映像から対象物体の３次元空間内での動作をトラッキングする画像処理装置に用いて好適なものである。 The present invention relates to an image processing apparatus and a three-dimensional object tracking method, and is particularly suitable for use in an image processing apparatus that tracks a movement of a target object in a three-dimensional space from a two-dimensional image.

従来、単一のビデオ映像内の対象物体をトラッキングする手法は、画像処理の分野で多数提案されている。特に、パーティクルフィルタを用いた対象物体のトラッキングは非常によく知られた手法である。しかし、これらの対象物体のトラッキング手法の多くは、２次元映像内における対象物体のトラッキングであり、３次元空間内での動作のトラッキングは実現できない。 Conventionally, many methods for tracking a target object in a single video image have been proposed in the field of image processing. In particular, tracking of a target object using a particle filter is a very well-known method. However, many of these target object tracking methods are tracking of a target object in a two-dimensional image, and it is not possible to realize tracking of an operation in a three-dimensional space.

ところで、マーカーレスＡＲ（Augmented Reality）の研究において、映像内の３次元空間を認識するシステムに関する研究がいくつか存在する。特に、単眼カメラを用いて３次元空間を認識する研究として、ＰＴＡＭ（Parallel Tracking and Mapping）がよく知られている。ＰＴＡＭでは、映像内の画像特徴点から平面に近い部分を認識することで３次元座標系を定め、映像の各フレーム間の画像特徴点の相対的な位置関係から３次元位置を認識する。しかしながら、ＰＴＡＭは空間追跡を目的として設計されており、物体追跡に利用するためには、３次元空間の認識精度や追跡精度の面で十分でない。 By the way, in the research of markerless AR (Augmented Reality), there are some studies on a system for recognizing a three-dimensional space in an image. In particular, PTAM (Parallel Tracking and Mapping) is well known as a research for recognizing a three-dimensional space using a monocular camera. In PTAM, a three-dimensional coordinate system is defined by recognizing a portion close to a plane from image feature points in a video, and a three-dimensional position is recognized from a relative positional relationship of image feature points between frames of the video. However, PTAM is designed for the purpose of space tracking, and is not sufficient in terms of recognition accuracy and tracking accuracy in a three-dimensional space for use in object tracking.

また、２次元の画像を処理して対象物体の３次元位置・姿勢を把握するための技術として、特許文献１に記載の技術も提案されている。特許文献１に記載の画像処理装置は、撮像装置を用いて実空間を撮像することにより生成される入力画像を取得する画像取得部と、入力画像に映る１つ以上の特徴点の位置に基づいて、実空間と撮像装置との間の相対的な位置及び姿勢を認識する認識部と、認識される相対的な位置及び姿勢を用いた拡張現実アプリケーションを提供するアプリケーション部とを備えている。 In addition, as a technique for processing a two-dimensional image and grasping a three-dimensional position / posture of a target object, a technique described in Patent Document 1 is also proposed. An image processing apparatus described in Patent Document 1 is based on an image acquisition unit that acquires an input image generated by imaging a real space using an imaging device, and the position of one or more feature points that appear in the input image. A recognition unit for recognizing a relative position and orientation between the real space and the imaging apparatus, and an application unit for providing an augmented reality application using the recognized relative position and orientation.

この特許文献１に記載の技術によれば、単眼カメラにより撮像された２次元画像から対象物体の３次元位置・姿勢を検出することが可能である。しかしながら、この特許文献１に記載の技術は、２次元画像の中から特徴的な部分を特定し、その特徴点の位置に基づいて対象物体の３次元位置・姿勢を検出する仕組みであるため、検出の精度は特徴点の抽出数と抽出精度に大きく依存する。そのため、特徴点が多く存在する空間や大きな空間に対する処理には適しているが、特徴点の少ない空間や限られた狭い空間に対する処理では３次元位置・姿勢の検出精度が悪くなってしまうという問題があった。 According to the technique described in Patent Document 1, it is possible to detect the three-dimensional position / posture of a target object from a two-dimensional image captured by a monocular camera. However, since the technique described in Patent Document 1 is a mechanism that identifies a characteristic part from a two-dimensional image and detects the three-dimensional position / posture of the target object based on the position of the feature point. The accuracy of detection largely depends on the number of feature points extracted and the extraction accuracy. Therefore, it is suitable for processing in a space with many feature points or a large space, but the processing accuracy for a space with few feature points or a limited narrow space deteriorates the accuracy of 3D position / posture detection. was there.

これに対して、物体の３次元トラッキングを行う方法として、モーションキャプチャと呼ばれる技術が存在する。しかしながら、モーションキャプチャの場合は特別な計測装置が必要である。そのため、これらのハードウェア環境を利用することが難しい現場に対しては導入が困難であるという問題があった。 On the other hand, there is a technique called motion capture as a method for performing three-dimensional tracking of an object. However, in the case of motion capture, a special measuring device is required. For this reason, there is a problem that it is difficult to install in a site where it is difficult to use these hardware environments.

特開２０１３−２２５２４５号公報JP 2013-225245 A

本発明は、このような問題を解決するために成されたものであり、処理対象とする空間の性質によらず、単眼カメラにより撮像された２次元映像から対象物体の３次元空間内での動作を精度よくトラッキングできるようにすることを目的とする。 The present invention has been made to solve such a problem. Regardless of the nature of the space to be processed, the present invention is based on the 2D image captured by the monocular camera in the 3D space of the target object. The purpose is to be able to accurately track the movement.

上記した課題を解決するために、本発明では、撮像装置を用いて実空間を撮像することにより生成される２次元映像から所定フレーム間隔毎に静止画としての２次元画像を抽出し、当該抽出した各フレームの２次元画像から対象物体の３次元位置および姿勢の空間情報を順次取得することにより、対象物体の３次元空間内での動作をトラッキングするようにしている。具体的には、現フレームの２次元画像から対象物体の境界を抽出し、当該抽出した境界の各点と撮像装置の中心位置とをそれぞれ結んでできる複数の直線により放射形状を３次元空間の座標系上に生成する。一方、前フレームの２次元画像から求められた対象物体の３次元位置を用いて２次元画像上に投影した対象物体の３次元データから、対象物体が２次元画像に写ったときに境界となる点を３次元データのサンプル点として抽出する。そして、放射形状上にサンプル点が位置するように３次元データの位置合わせを行い、位置合わせされた３次元データから対象物体の３次元位置および姿勢の空間情報を取得するようにしている。 In order to solve the above-described problems, in the present invention, a two-dimensional image as a still image is extracted from a two-dimensional video generated by imaging a real space using an imaging device at predetermined frame intervals, and the extraction is performed. By sequentially obtaining the spatial information of the three-dimensional position and orientation of the target object from the two-dimensional image of each frame, the movement of the target object in the three-dimensional space is tracked. Specifically, the boundary of the target object is extracted from the two-dimensional image of the current frame, and the radiation shape is defined in the three-dimensional space by a plurality of straight lines formed by connecting each point of the extracted boundary and the center position of the imaging device. Generate on the coordinate system. On the other hand, it becomes a boundary when the target object appears in the two-dimensional image from the three-dimensional data of the target object projected onto the two-dimensional image using the three-dimensional position of the target object obtained from the two-dimensional image of the previous frame. A point is extracted as a sample point of 3D data. Then, the three-dimensional data is aligned so that the sample point is positioned on the radial shape, and the spatial information of the three-dimensional position and orientation of the target object is acquired from the aligned three-dimensional data.

上記のように構成した本発明によれば、撮像装置により撮像される対象物体の２次元画像から生成された３次元的な放射形状と対象物体の３次元データとの位置合わせを通じて、２次元画像による２次元空間と３次元データによる３次元空間とを結びつけることができる。３次元データは対象物体の３次元位置・姿勢を有しているので、位置合わせをした３次元データから対象物体の正確な３次元位置・姿勢を取得することができる。このような処理を、撮像装置により撮像される２次元映像から所定フレーム間隔毎に取得される２次元画像のそれぞれについて行うことにより、対象物体のトラッキングを行うことができる。これにより、処理対象とする空間の性質によらず、単眼の撮像装置により撮像された２次元映像から対象物体の３次元空間内での動作を精度よくトラッキングすることができる。 According to the present invention configured as described above, a two-dimensional image is obtained by aligning a three-dimensional radial shape generated from a two-dimensional image of a target object imaged by an imaging device with the three-dimensional data of the target object. The two-dimensional space by and the three-dimensional space by three-dimensional data can be linked. Since the three-dimensional data has the three-dimensional position / posture of the target object, the accurate three-dimensional position / posture of the target object can be acquired from the aligned three-dimensional data. By performing such processing for each of the two-dimensional images acquired at predetermined frame intervals from the two-dimensional video imaged by the imaging device, the target object can be tracked. Thereby, it is possible to accurately track the movement of the target object in the three-dimensional space from the two-dimensional image captured by the monocular imaging device, regardless of the property of the space to be processed.

第１の実施形態による画像処理装置の機能構成例を示すブロック図である。It is a block diagram which shows the function structural example of the image processing apparatus by 1st Embodiment. 本実施形態で用いるピンホールカメラモデルの原理を説明するための図である。It is a figure for demonstrating the principle of the pinhole camera model used by this embodiment. 本実施形態の放射形状生成部により生成される放射形状の例を示す図である。It is a figure which shows the example of the radial shape produced | generated by the radial shape production | generation part of this embodiment. ２次元画像平面の座標系と３次元空間の座標系との関係を示す図である。It is a figure which shows the relationship between the coordinate system of a two-dimensional image plane, and the coordinate system of a three-dimensional space. 本実施形態の位置補正部により３次元データの位置補正が行われた結果を示す図である。It is a figure which shows the result of having performed the position correction of three-dimensional data by the position correction part of this embodiment. 第１の実施形態による画像処理装置の動作例を示すフローチャートである。3 is a flowchart illustrating an operation example of the image processing apparatus according to the first embodiment. 第２の実施形態による画像処理装置の機能構成例を示すブロック図である。It is a block diagram which shows the function structural example of the image processing apparatus by 2nd Embodiment. 混合正規分布を用いた前景および背景のラベリングについて説明するための図である。It is a figure for demonstrating the labeling of a foreground and a background using a mixed normal distribution. コーシー分布と正規分布との比較を示す図である。It is a figure which shows the comparison with Cauchy distribution and normal distribution. 第２の実施形態による画像処理装置の動作例を示すフローチャートである。10 is a flowchart illustrating an operation example of the image processing apparatus according to the second embodiment.

(第１の実施形態)
以下、本発明の第１の実施形態を図面に基づいて説明する。図１は、第１の実施形態による画像処理装置１００の機能構成例を示すブロック図である。図１に示すように、第１の実施形態による画像処理装置１００は、その機能構成として、２次元映像取得部１０、２次元画像取得部２０およびトラッキング部３０を備えて構成されている。また、トラッキング部３０は、その具体的な機能構成として、境界抽出部１１、放射形状生成部１２、３次元データ投影部１３、サンプル点抽出部１４、位置合わせ部１５、空間情報取得部１６およびカメラパラメータ記憶部１７を備えている。 (First embodiment)
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a first embodiment of the invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a functional configuration example of an image processing apparatus 100 according to the first embodiment. As shown in FIG. 1, the image processing apparatus 100 according to the first embodiment includes a 2D video acquisition unit 10, a 2D image acquisition unit 20, and a tracking unit 30 as its functional configuration. The tracking unit 30 includes a boundary extraction unit 11, a radial shape generation unit 12, a three-dimensional data projection unit 13, a sample point extraction unit 14, a registration unit 15, a spatial information acquisition unit 16, and a specific functional configuration. A camera parameter storage unit 17 is provided.

上記各機能ブロックは、ハードウェア、ＤＳＰ（Digital Signal Processor）、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、上記各機能ブロックは、実際にはコンピュータのＣＰＵ、ＲＡＭ、ＲＯＭなどを備えて構成され、ＲＡＭやＲＯＭ、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。 Each of the functional blocks can be configured by any of hardware, DSP (Digital Signal Processor), and software. For example, when configured by software, each functional block is actually configured by including a CPU, RAM, ROM, etc. of a computer, and a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory operates. Is realized.

本実施形態において解決すべき主な課題は、「２次元画像に写る対象物体の３次元位置・姿勢を定めること」である。単眼カメラ２００で撮像される２次元映像を、静止画としての２次元画像が連続的に表示されているものとみると、各２次元画像における対象物体の３次元位置・姿勢が定まれば、連続的に位置・姿勢を取得することで３次元トラッキングを行うことができる。 The main problem to be solved in the present embodiment is “determining the three-dimensional position / posture of the target object in the two-dimensional image”. Assuming that a 2D image captured by the monocular camera 200 is a continuous display of 2D images as still images, if the 3D position and orientation of the target object in each 2D image are determined, 3D tracking can be performed by continuously acquiring the position and orientation.

３次元位置・姿勢が定まった対象物体の２次元画像上における像は、画像平面と３次元空間との対応関係を用いれば簡単に求めることができる。しかし、本実施形態で解決すべき課題はその逆問題であり、解決は単純ではない。この課題に対して、本実施形態では、図２に示すようなピンホールカメラモデルを用いて２次元画像と３次元空間とを結び付ける。 An image on a two-dimensional image of a target object with a fixed three-dimensional position / posture can be easily obtained by using the correspondence between the image plane and the three-dimensional space. However, the problem to be solved in the present embodiment is the inverse problem, and the solution is not simple. In order to deal with this problem, in this embodiment, a pinhole camera model as shown in FIG. 2 is used to connect a two-dimensional image and a three-dimensional space.

図２に示すように、ピンホールカメラモデルは、カメラ中心２１および画像平面２２を持つ。対象物体２３上のある点２５は、その点２５とカメラ中心２１とを結んだ直線２６と、画像平面２２との交点２４上に写し出される。これを逆に捉えると、対象物体２３上の点２５が画像平面２２上に投影された像の境界線上の点２４になるとき、対象物体２３は点２５において、カメラ中心２１と点２４とを結ぶ直線２６に接すると言える。つまり、２次元画像上に投影した対象物体２３の像の境界となる点２５は、カメラ中心２１と２次元画像の対象物体の境界上の点２４とを結んだ直線２６上に存在する。 As shown in FIG. 2, the pinhole camera model has a camera center 21 and an image plane 22. A certain point 25 on the target object 23 is projected on the intersection 24 between the straight line 26 connecting the point 25 and the camera center 21 and the image plane 22. In other words, when the point 25 on the target object 23 becomes the point 24 on the boundary line of the image projected on the image plane 22, the target object 23 takes the camera center 21 and the point 24 at the point 25. It can be said that it touches the connecting straight line 26. That is, the point 25 that becomes the boundary of the image of the target object 23 projected on the two-dimensional image exists on a straight line 26 that connects the camera center 21 and the point 24 on the boundary of the target object of the two-dimensional image.

そこで、本実施形態では、「２次元画像に写る対象物体の境界」に「２次元画像平面に投影した対象物体の３次元データによる像の境界」を一致させ、その状態での３次元データから対象物体の３次元位置・姿勢を取得するという手法をとった。 Therefore, in the present embodiment, the “boundary of the target object projected on the two-dimensional image” is matched with the “boundary of the target object projected on the two-dimensional image plane”, and the three-dimensional data in that state is used. A method of acquiring the three-dimensional position / posture of the target object was adopted.

なお、カメラ中心２１と画像平面２２との距離を焦点距離と呼ぶ。また、単位距離あたりのピクセル数を解像度と呼ぶ。本実施形態では、単眼カメラ２００は固定しておき、対象物体が動くものとする。また、キャリブレーションを事前に行い、単眼カメラ２００の焦点距離および解像度を算出しておく。そして、これらの焦点距離および解像度を、カメラパラメータとしてあらかじめカメラパラメータ記憶部１７に記憶しておく。また、単眼カメラ２００のオートズーム機能は無効にしておく。 Note that the distance between the camera center 21 and the image plane 22 is referred to as a focal length. The number of pixels per unit distance is called resolution. In the present embodiment, it is assumed that the monocular camera 200 is fixed and the target object moves. Further, calibration is performed in advance, and the focal length and resolution of the monocular camera 200 are calculated. These focal lengths and resolutions are stored in advance in the camera parameter storage unit 17 as camera parameters. The auto zoom function of the monocular camera 200 is disabled.

図１に示した各機能ブロックは、上述のような処理を行うための構成である。２次元映像取得部１０は、単眼カメラ２００を用いて実空間を撮像することにより生成される２次元映像を取得する。また、２次元画像取得部２０は、２次元映像取得部１０により取得された２次元映像からｎフレーム間隔毎（ｎは１以上の任意の数）に静止画としての２次元画像を取得する。 Each functional block shown in FIG. 1 has a configuration for performing the processing as described above. The 2D video acquisition unit 10 acquires a 2D video generated by imaging a real space using the monocular camera 200. The 2D image acquisition unit 20 acquires a 2D image as a still image from the 2D video acquired by the 2D video acquisition unit 10 every n frame intervals (n is an arbitrary number equal to or greater than 1).

なお、図１の例では、パーソナルコンピュータ等の画像処理装置１００に単眼カメラ２００を接続しておき、単眼カメラ２００で撮像された２次元映像を２次元映像取得部１０がリアルタイムに取得する例を示しているが、本発明はこれに限定されない。例えば、単眼カメラ２００で撮像した２次元映像をメモリに記憶させ、このメモリに記憶された２次元映像を２次元映像取得部１０が後から取り込むようにしてもよい。 In the example of FIG. 1, a monocular camera 200 is connected to the image processing apparatus 100 such as a personal computer, and the 2D video acquisition unit 10 acquires the 2D video captured by the monocular camera 200 in real time. Although shown, the present invention is not limited to this. For example, a 2D image captured by the monocular camera 200 may be stored in a memory, and the 2D image stored in the memory may be captured later by the 2D image acquisition unit 10.

トラッキング部３０は、２次元画像取得部２０により取得された各フレームの２次元画像から対象物体の３次元位置および姿勢の空間情報を取得することにより、対象物体の３次元空間内での動作をトラッキングする。対象物体の３次元位置および姿勢は、具体的には以下に説明する各機能ブロック１１〜１７によって取得する。 The tracking unit 30 acquires the spatial information of the three-dimensional position and orientation of the target object from the two-dimensional image of each frame acquired by the two-dimensional image acquisition unit 20, thereby performing the operation of the target object in the three-dimensional space. To track. Specifically, the three-dimensional position and orientation of the target object are acquired by the function blocks 11 to 17 described below.

境界抽出部１１は、２次元画像取得部２０により現フレームで取得された２次元画像から対象物体の境界を抽出する。２次元画像における対象物体の境界とは、対象物体の画像と背景の画像との境界に当たる線のことであり、対象物体の構成面の境界線が２次元画像における対象物体の境界になるとは限らない。境界抽出部１１は、例えば、２次元画像から前景抽出処理により対象物体を抽出した後、抽出した対象物体から境界を抽出する。境界の抽出は、例えば、いわゆるエッジ検出処理（画像の輝度や色などが鋭敏に（不連続に）変化している箇所を特定する処理）によって行うことが可能である。 The boundary extraction unit 11 extracts the boundary of the target object from the 2D image acquired in the current frame by the 2D image acquisition unit 20. The boundary of the target object in the two-dimensional image is a line that hits the boundary between the image of the target object and the background image, and the boundary line of the configuration surface of the target object is not always the boundary of the target object in the two-dimensional image. Absent. For example, after extracting the target object from the two-dimensional image by the foreground extraction process, the boundary extraction unit 11 extracts the boundary from the extracted target object. The extraction of the boundary can be performed by, for example, a so-called edge detection process (a process of specifying a location where the brightness or color of the image changes sharply (discontinuously)).

放射形状生成部１２は、単眼カメラ２００の中心位置と境界抽出部１１により抽出された境界の各点とをそれぞれ結んでできる複数の直線により放射形状を生成する。図３は、放射形状生成部１２により生成される放射形状の例を示す図である。なお、図３では図示の便宜上、放射形状を構成する直線３６を４本のみ示している。図３に示すように、放射形状生成部１２は、カメラパラメータ記憶部１７に記憶されているカメラパラメータから求められるカメラ中心３１と、画像平面３２に写る２次元画像における対象物体の境界の各点３４とをそれぞれ結んでできる複数の直線３６により放射形状を３次元空間の座標系上に生成する。 The radial shape generation unit 12 generates a radial shape by a plurality of straight lines formed by connecting the center position of the monocular camera 200 and each point of the boundary extracted by the boundary extraction unit 11. FIG. 3 is a diagram illustrating an example of a radiation shape generated by the radiation shape generation unit 12. In FIG. 3, for convenience of illustration, only four straight lines 36 constituting the radial shape are shown. As shown in FIG. 3, the radial shape generation unit 12 has a camera center 31 obtained from the camera parameters stored in the camera parameter storage unit 17 and each point on the boundary of the target object in the two-dimensional image captured on the image plane 32. A radial shape is generated on a coordinate system in a three-dimensional space by a plurality of straight lines 36 each formed by connecting with the line 34.

なお、本実施形態では、２次元画像平面の座標系と３次元空間の座標系との関係を、図４のように定める。すなわち、２次元画像平面３２上の座標系を、幅方向をｗ軸、高さ方向をｈ軸とする。一方、３次元空間の座標系を、カメラ中心３１を原点、カメラ中心３１から画像平面３２に垂直に下した方向をｚ軸、ｚ軸と直交し画像平面３２のｗ軸と平行な方向をｘ軸、ｚ軸と直交し画像平面３２のｈ軸と平行な方向をｙ軸とする。また、ｚ軸と画像平面３２との交点を画像中心とする。 In this embodiment, the relationship between the coordinate system of the two-dimensional image plane and the coordinate system of the three-dimensional space is defined as shown in FIG. That is, in the coordinate system on the two-dimensional image plane 32, the width direction is the w axis and the height direction is the h axis. On the other hand, in the coordinate system of the three-dimensional space, the camera center 31 is the origin, the direction perpendicular to the image plane 32 from the camera center 31 is the z axis, and the direction orthogonal to the z axis and parallel to the w axis of the image plane 32 is x. A direction perpendicular to the axis and the z-axis and parallel to the h-axis of the image plane 32 is defined as a y-axis. The intersection of the z axis and the image plane 32 is the image center.

３次元データ投影部１３は、ピンホールカメラモデルに基づき、対象物体を表す３次元データを２次元画像上に投影する。ここで投影する３次元データは、例えば、対象物体と同一形状を、三角形または四角形から成る複数のメッシュで表現した３次元のメッシュデータである。トラッキングを開始した直後の最初のフレームの処理時は、３次元データを所定の初期位置に投影する。その初期位置は任意であるが、２次元画像に写る対象物体がある位置の近傍に投影されるような位置を初期位置として設定するのが好ましい。 The three-dimensional data projection unit 13 projects three-dimensional data representing the target object on the two-dimensional image based on the pinhole camera model. The three-dimensional data projected here is, for example, three-dimensional mesh data in which the same shape as the target object is expressed by a plurality of meshes made of triangles or quadrangles. When processing the first frame immediately after the start of tracking, three-dimensional data is projected to a predetermined initial position. Although the initial position is arbitrary, it is preferable to set a position where the target object reflected in the two-dimensional image is projected in the vicinity of a certain position as the initial position.

すなわち、境界抽出部１１により抽出された境界により、２次元画像上に写っている対象物体の位置が分かっている。また、単眼カメラ２００の位置も、カメラパラメータ記憶部１７に記憶されたカメラパラメータにより既知である。よって、これらの情報から、ピンホールカメラモデルにより３次元空間上における対象物体の大凡の位置は推定可能である。 That is, the position of the target object shown on the two-dimensional image is known from the boundary extracted by the boundary extraction unit 11. The position of the monocular camera 200 is also known from the camera parameters stored in the camera parameter storage unit 17. Therefore, from this information, the approximate position of the target object in the three-dimensional space can be estimated by the pinhole camera model.

一方、２フレーム目以降の処理時において、３次元データ投影部１３は、空間情報取得部１６によって前フレームの２次元画像から求められた対象物体の３次元位置を用いて、対象物体を表す３次元データを投影する。 On the other hand, at the time of processing for the second and subsequent frames, the 3D data projecting unit 13 represents the target object using the 3D position of the target object obtained from the 2D image of the previous frame by the spatial information acquisition unit 16. Project dimensional data.

サンプル点抽出部１４は、３次元データ投影部１３により２次元画像上に投影された対象物体の３次元データから、対象物体が２次元画像に写ったときに境界となる複数の点を３次元データのサンプル点として抽出する。図３において、符号３３は位置合わせ前の位置に投影された３次元データで表される対象物体であり、符号３５は当該３次元データによる対象物体３３から抽出される複数のサンプル点である。図３の例では、２次元画像上の４つの境界点３４に対応する４つのサンプル点３５（３次元データの位置合わせ前の位置では、境界点３４と正確に位置が対応していない）を示している。 The sample point extraction unit 14 determines a plurality of points that are boundaries when the target object appears in the two-dimensional image from the three-dimensional data of the target object projected onto the two-dimensional image by the three-dimensional data projection unit 13. Extract as data sample points. In FIG. 3, reference numeral 33 denotes a target object represented by three-dimensional data projected at a position before alignment, and reference numeral 35 denotes a plurality of sample points extracted from the target object 33 based on the three-dimensional data. In the example of FIG. 3, four sample points 35 corresponding to the four boundary points 34 on the two-dimensional image (the positions before the alignment of the three-dimensional data do not correspond exactly to the boundary points 34). Show.

例えば、投影される３次元データがメッシュデータの場合、対象物体は３つまたは４つの節点を結んで形成される三角形または四角形の形状をした複数のメッシュにより表現されている。そのメッシュの面から法線を伸ばした場合に、対象物体の境界となる場所では、カメラ中心から見た法線の角度がほぼ９０度になる。そこで、サンプル点抽出部１４は、ある１つの節点を共通に持つ複数のメッシュの面からそれぞれ法線を伸ばし、カメラ中心から見た法線の角度を確認する。そして、法線の角度が９０度より小さいメッシュと９０度より大きいメッシュとが混在している場合、当該ある１つの節点をサンプル点として抽出する。サンプル点抽出部１４は、この処理を複数の節点について行うことにより、対象物体が２次元画像に写ったときに境界となる複数の点を３次元データのサンプル点として抽出する。 For example, when the three-dimensional data to be projected is mesh data, the target object is represented by a plurality of meshes having a triangular or quadrangular shape formed by connecting three or four nodes. When the normal is extended from the surface of the mesh, the angle of the normal viewed from the center of the camera is approximately 90 degrees at the location that becomes the boundary of the target object. Therefore, the sample point extraction unit 14 extends normals from the surfaces of a plurality of meshes having a certain node in common, and confirms the angle of the normal viewed from the camera center. When a mesh having a normal angle smaller than 90 degrees and a mesh larger than 90 degrees coexist, one certain node is extracted as a sample point. The sample point extraction unit 14 performs this process on a plurality of nodes, and thereby extracts a plurality of points that become boundaries when the target object appears in the two-dimensional image as sample points of the three-dimensional data.

なお、図３では説明の便宜上、対象物体３３の３次元データを実際に２次元画像上に投影している状態を示しているが、必ずしも実際に３次元データを投影して２次元画像上に表示させる必要はない。すなわち、対象物体３３の３次元データを仮想的に投影して、計算によって複数のサンプル点を抽出することが可能である。 For convenience of explanation, FIG. 3 shows a state in which the three-dimensional data of the target object 33 is actually projected on the two-dimensional image. However, the three-dimensional data is actually projected on the two-dimensional image. There is no need to display. That is, it is possible to virtually project the three-dimensional data of the target object 33 and extract a plurality of sample points by calculation.

位置合わせ部１５は、放射形状生成部１２により生成された放射形状上に、サンプル点抽出部１４により抽出されたサンプル点が位置するように、３次元データの位置合わせを行う。例えば、位置合わせ部１５は、いわゆるＩＣＰ（Iterative Closest Point）アルゴリズムを用いて、放射形状の直線３６とサンプル点３５との対応を最近点により求め、当該求めた対応を最小化する変換処理を繰り返すことによって３次元データの位置合わせを行う。 The alignment unit 15 performs alignment of the three-dimensional data so that the sample points extracted by the sample point extraction unit 14 are positioned on the radial shape generated by the radial shape generation unit 12. For example, the alignment unit 15 uses a so-called ICP (Iterative Closest Point) algorithm to obtain a correspondence between the straight line 36 of the radial shape and the sample point 35 from the nearest point, and repeats a conversion process that minimizes the obtained correspondence. Thus, alignment of the three-dimensional data is performed.

図５は、位置合わせ部１５により３次元データの位置合わせが行われた結果を示す図である。図５に示すように、３次元データの位置合わせが行われると、３次元データ上から抽出した複数のサンプル点３５は、放射形状生成部１２により生成された放射形状を構成する複数の直線３６上に位置することとなる。 FIG. 5 is a diagram illustrating a result of the alignment of the three-dimensional data performed by the alignment unit 15. As shown in FIG. 5, when the alignment of the three-dimensional data is performed, the plurality of sample points 35 extracted from the three-dimensional data are the plurality of straight lines 36 constituting the radial shape generated by the radial shape generating unit 12. It will be located above.

空間情報取得部１６は、このように位置合わせ部１５により位置合わせされた３次元データから対象物体の３次元位置および姿勢の空間情報を取得する。３次元データは３次元のメッシュデータであるから、もともと対象物体の３次元位置・姿勢の空間情報を持っている。そこで、空間情報取得部１６は、位置合わせされた３次元メッシュデータが持っている３次元位置・姿勢の空間情報を取得すればよい。 The spatial information acquisition unit 16 acquires the spatial information of the three-dimensional position and orientation of the target object from the three-dimensional data aligned by the alignment unit 15 in this way. Since the three-dimensional data is three-dimensional mesh data, it originally has spatial information about the three-dimensional position / posture of the target object. Therefore, the spatial information acquisition unit 16 may acquire the spatial information of the three-dimensional position / posture possessed by the aligned three-dimensional mesh data.

図６は、上記のように構成した第１の実施形態による画像処理装置１００の動作例を示すフローチャートである。この図６に示すフローチャートは、画像処理装置１００の電源をオンにして、対象物体のトラッキングを行うことを指示する操作をユーザが行ったときに開始する。 FIG. 6 is a flowchart showing an operation example of the image processing apparatus 100 according to the first embodiment configured as described above. The flowchart shown in FIG. 6 starts when the user performs an operation to turn on the image processing apparatus 100 and instruct to perform tracking of the target object.

まず、２次元映像取得部１０は、３次元位置・姿勢を把握しようとする対象物体を含む実空間を撮像することによって生成された２次元映像を単眼カメラ２００から取得する（ステップＳ１）。また、２次元画像取得部２０は、２次元映像取得部１０により取得された２次元映像から１フレーム分の２次元画像を取得する（ステップＳ２）。 First, the 2D video acquisition unit 10 acquires from the monocular camera 200 a 2D video generated by imaging a real space including a target object whose 3D position / posture is to be grasped (step S1). The 2D image acquisition unit 20 acquires a 2D image for one frame from the 2D video acquired by the 2D video acquisition unit 10 (step S2).

次に、境界抽出部１１は、２次元画像取得部２０により取得された２次元画像から対象物体の背景との境界を抽出する（ステップＳ３）。さらに、放射形状生成部１２は、単眼カメラ２００の中心位置と境界抽出部１１により抽出された境界の各点とをそれぞれ結んでできる複数の直線により放射形状を３次元空間の座標系上に生成する（ステップＳ４）。 Next, the boundary extraction unit 11 extracts a boundary with the background of the target object from the two-dimensional image acquired by the two-dimensional image acquisition unit 20 (step S3). Furthermore, the radial shape generation unit 12 generates a radial shape on a coordinate system in a three-dimensional space by using a plurality of straight lines formed by connecting the center position of the monocular camera 200 and each point of the boundary extracted by the boundary extraction unit 11. (Step S4).

一方、３次元データ投影部１３は、対象物体を表す３次元メッシュデータをピンホールカメラモデルに基づき投影する（ステップＳ５）。そして、サンプル点抽出部１４は、３次元データ投影部１３により２次元画像上に投影された対象物体の３次元データから、対象物体が２次元画像に写ったときに境界となる点を３次元データのサンプル点として抽出する（ステップＳ６）。 On the other hand, the three-dimensional data projection unit 13 projects three-dimensional mesh data representing the target object based on the pinhole camera model (step S5). Then, the sample point extraction unit 14 determines, from the three-dimensional data of the target object projected on the two-dimensional image by the three-dimensional data projection unit 13, a point that becomes a boundary when the target object appears in the two-dimensional image. Data is extracted as sample points (step S6).

なお、ステップＳ２〜Ｓ６の処理は、必ずしも以上に説明した順序で処理する必要はない。例えば、ステップＳ５〜Ｓ６の処理を最初に行い、その後でステップＳ２〜Ｓ４の処理を行うようにしてもよい。または、ステップＳ２〜Ｓ４の処理とステップＳ５〜Ｓ６の処理とを同時に行うようにしてもよい。ただし、ステップＳ５の処理よりステップＳ３の処理を先に行っておくと、２次元画像上に写っている対象物体の位置が分かるので、最初のフレームの処理時に、実際に対象物体がある位置またはその近傍を、３次元データを投影する初期位置として推定することが可能である。 Note that the processes in steps S2 to S6 are not necessarily performed in the order described above. For example, the processes of steps S5 to S6 may be performed first, and then the processes of steps S2 to S4 may be performed. Or you may make it perform the process of step S2-S4, and the process of step S5-S6 simultaneously. However, if the process of step S3 is performed prior to the process of step S5, the position of the target object shown on the two-dimensional image can be known. It is possible to estimate the vicinity as an initial position for projecting the three-dimensional data.

次に、位置合わせ部１５は、例えばＩＣＰアルゴリズムを用いて、放射形状生成部１２により生成された放射形状上に、サンプル点抽出部１４により抽出されたサンプル点が位置するように、３次元データの位置合わせを行う（ステップＳ７）。その後、空間情報取得部１６は、位置合わせ部１５により位置合わせされた３次元データから対象物体の３次元位置および姿勢の空間情報を取得する（ステップＳ８）。 Next, the alignment unit 15 uses, for example, an ICP algorithm so that the sample points extracted by the sample point extraction unit 14 are positioned on the radial shape generated by the radial shape generation unit 12. Are aligned (step S7). Thereafter, the spatial information acquisition unit 16 acquires the spatial information of the three-dimensional position and orientation of the target object from the three-dimensional data aligned by the alignment unit 15 (step S8).

次に、画像処理装置１００は、対象物体のトラッキングを終了するか否かを判定する（ステップＳ９）。例えば、ユーザが画像処理装置１００の電源をオフにする操作や、トラッキング処理を停止させるための操作を行った場合、画像処理装置１００はトラッキングを終了すると判定し、図６に示すフローチャートの処理は終了する。一方、トラッキングを終了しない場合、処理はステップＳ１に戻り、次のフレームに関する処理を継続する。 Next, the image processing apparatus 100 determines whether or not tracking of the target object is to be ended (step S9). For example, when the user performs an operation of turning off the power of the image processing apparatus 100 or an operation for stopping the tracking process, the image processing apparatus 100 determines that the tracking is finished, and the process of the flowchart illustrated in FIG. finish. On the other hand, when the tracking is not finished, the process returns to step S1, and the process for the next frame is continued.

以上詳しく説明したように、第１の実施形態によれば、単眼カメラ２００により撮像される対象物体の２次元画像から生成された３次元的な放射形状と対象物体の３次元データとの位置合わせを通じて、２次元画像による２次元空間と３次元データによる３次元空間とを結びつけることができる。３次元データは対象物体の３次元位置・姿勢を有しているので、位置合わせした３次元データから対象物体の正確な３次元位置・姿勢を取得することができる。このような処理を、単眼カメラ２００により撮像される２次元映像から所定フレーム間隔毎に取得される２次元画像のそれぞれについて行うことにより、対象物体のトラッキングを行うことができる。これにより、処理対象とする空間が特徴点を多く有する空間であるか大きな空間であるかといった性質によらず、単眼カメラ２００により撮像された２次元画像から対象物体の３次元空間内での動作を精度よくトラッキングすることができる。 As described above in detail, according to the first embodiment, the alignment between the three-dimensional radial shape generated from the two-dimensional image of the target object imaged by the monocular camera 200 and the three-dimensional data of the target object. Through this, it is possible to connect the two-dimensional space based on the two-dimensional image and the three-dimensional space based on the three-dimensional data. Since the three-dimensional data has the three-dimensional position / posture of the target object, the accurate three-dimensional position / posture of the target object can be acquired from the aligned three-dimensional data. By performing such processing for each of the two-dimensional images acquired at predetermined frame intervals from the two-dimensional video imaged by the monocular camera 200, the target object can be tracked. Accordingly, the operation of the target object in the three-dimensional space from the two-dimensional image captured by the monocular camera 200 is performed regardless of the nature of whether the space to be processed is a space having many feature points or a large space. Can be accurately tracked.

（第２の実施形態）
次に、本発明の第２の実施形態を図面に基づいて説明する。図７は、第２の実施形態による画像処理装置１００’の機能構成例を示すブロック図である。なお、この図７において、図１に示した符号と同一の符号を付したものは同一の機能を有するものであるので、ここでは重複する説明を省略する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 7 is a block diagram illustrating a functional configuration example of the image processing apparatus 100 ′ according to the second embodiment. In FIG. 7, those given the same reference numerals as those shown in FIG. 1 have the same functions, and therefore redundant description is omitted here.

図７に示すように、第２の実施形態によるトラッキング部３０’は、その機能構成として、位置補正部１８を更に備えている。また、境界抽出部１１および３次元データ投影部１３に代えて、境界抽出部１１’および３次元データ投影部１３’を備えている。境界抽出部１１’は、その具体的な機能構成として、前景抽出処理部１１ａおよび境界抽出処理部１１ｂを備えている。 As shown in FIG. 7, the tracking unit 30 ′ according to the second embodiment further includes a position correction unit 18 as a functional configuration. Further, instead of the boundary extraction unit 11 and the three-dimensional data projection unit 13, a boundary extraction unit 11 'and a three-dimensional data projection unit 13' are provided. The boundary extraction unit 11 ′ includes a foreground extraction processing unit 11 a and a boundary extraction processing unit 11 b as specific functional configurations.

前景抽出処理部１１ａは、教師データから生成した学習データを用いて前景抽出処理を行うことにより、２次元画像取得部２０により取得された２次元画像から対象物体を抽出する。境界抽出処理部１１ｂは、前景抽出処理部１１ａにより抽出された対象物体から境界を抽出する。上述した第１の実施形態でもこれと同様に、２次元画像から前景抽出処理により対象物体を抽出した後、抽出した対象物体から境界を抽出しているが、第２の実施形態では前景抽出を、教師データありの機械学習を用いた処理によって行う。 The foreground extraction processing unit 11a extracts a target object from the two-dimensional image acquired by the two-dimensional image acquisition unit 20 by performing foreground extraction processing using learning data generated from the teacher data. The boundary extraction processing unit 11b extracts a boundary from the target object extracted by the foreground extraction processing unit 11a. Similarly, in the first embodiment described above, after extracting the target object from the two-dimensional image by the foreground extraction process, the boundary is extracted from the extracted target object. In the second embodiment, foreground extraction is performed. This is performed by processing using machine learning with teacher data.

対象物体のトラッキングの追従性を良くするためには、前景抽出処理および境界抽出処理をリアルタイムに（高速に）行うことが望まれる。前景抽出を行う方法として、例えば、グラフカットを用いた前景抽出処理がよく知られている。しかし、この方法は、抽出精度が高い半面、最小カット問題と呼ばれる問題を解く必要があるため、処理コストが高い。そのため、トラッキングで必要とされるリアルタイム処理には適さない。そこで、本実施形態では、抽出精度は多少劣るが、高速でリアルタイム処理に適した前景抽出として、教師データから生成した学習データを用いて前景抽出処理を行う。そして、後処理である位置合わせ処理において、前景抽出により発生したノイズを許容するような処理を行う。 In order to improve the tracking of the target object, it is desirable to perform the foreground extraction process and the boundary extraction process in real time (at high speed). As a method for performing foreground extraction, foreground extraction processing using a graph cut is well known, for example. However, this method has a high extraction accuracy, but requires a solution to a problem called a minimum cut problem, and therefore has a high processing cost. Therefore, it is not suitable for real-time processing required for tracking. Therefore, in this embodiment, although the extraction accuracy is somewhat inferior, foreground extraction processing is performed using learning data generated from teacher data as foreground extraction suitable for high-speed real-time processing. Then, in the alignment process, which is a post-process, a process that allows noise generated by foreground extraction is performed.

まず、学習データの生成について説明する。本実施形態では、対象物体の画像および背景の画像を教師データとして用いる。前景抽出処理部１１ａは、これらの教師データから抽出したデータを混合正規分布によりモデル化することで学習データを作成する。混合正規分布とは、複数の正規分布に和が１となる重みを掛けて、その総和をとることで作成した確率分布であり、複数の山を持つ。混合正規分布は、データ内に頻度のピークが複数あるようなデータをモデル化する場合によく用いられる確率分布である。 First, generation of learning data will be described. In the present embodiment, the image of the target object and the background image are used as teacher data. The foreground extraction processing unit 11a creates learning data by modeling the data extracted from these teacher data with a mixed normal distribution. A mixed normal distribution is a probability distribution created by multiplying a plurality of normal distributions by a weight that gives a sum of 1 and taking the sum, and has a plurality of peaks. The mixed normal distribution is a probability distribution often used when modeling data having a plurality of frequency peaks in the data.

ここで、モデルの作成方法について説明する。前景抽出処理部１１ａは、モデル作成のために、教師データの各ピクセルの色を３次元の座標値で表現する（ここでは、この座標値を色座標値と呼ぶことにする）。色の表現方法としては、ＲＧＢ値やＨＳＶ値がよく知られている。本実施形態では、人間の感覚的な色の把握に近い表現であるＨＳＶ値を用いるものとする。ＨＳＶ値は、円錐形状の座標系の１点として表される。なお、ここでは色モデルとしてＨＳＶ値を用いる例について説明するが、他のモデル（例えば、ＣＩＥ−Ｌａｂ）などでもよい。 Here, a method for creating a model will be described. The foreground extraction processing unit 11a expresses the color of each pixel of the teacher data with a three-dimensional coordinate value for model creation (here, this coordinate value is referred to as a color coordinate value). RGB values and HSV values are well known as color representation methods. In the present embodiment, it is assumed that an HSV value that is an expression close to grasping a human sensory color is used. The HSV value is expressed as one point in a conical coordinate system. In addition, although the example which uses an HSV value as a color model is demonstrated here, another model (for example, CIE-Lab) etc. may be sufficient.

前景抽出処理部１１ａは、２次元画像を構成する各ピクセルのＲＧＢ値をＨＳＶ値に変換し、これをさらに、ＨＳＶ座標系を表す円錐を３次元空間に埋め込んだときの対応する座標値に変換することで、各ピクセルのＲＧＢ値を３次元の色座標値に変換する。ここで、前景抽出処理部１１ａは、前景および背景の各ピクセルのＲＧＢ値を３次元の色座標値に変換し、これらの色座標値の確率分布を混合正規分布の最尤推定により算出する。最尤推定とは、与えられたデータの発生確率が最も高くなるような確率分布を算出する手法である。そして、算出した混合正規分布を学習データとして使用する。 The foreground extraction processing unit 11a converts the RGB values of each pixel constituting the two-dimensional image into HSV values, and further converts them into corresponding coordinate values when a cone representing the HSV coordinate system is embedded in the three-dimensional space. Thus, the RGB value of each pixel is converted into a three-dimensional color coordinate value. Here, the foreground extraction processing unit 11a converts the RGB values of the foreground and background pixels into three-dimensional color coordinate values, and calculates the probability distribution of these color coordinate values by maximum likelihood estimation of a mixed normal distribution. Maximum likelihood estimation is a technique for calculating a probability distribution that gives the highest probability of occurrence of given data. The calculated mixed normal distribution is used as learning data.

この混合正規分布の各正規分布に、前景または背景のいずれかをラベリングすることができる。すなわち、正規分布の中心付近に前景の色座標値が多く集まっている場合はその正規分布を前景としてラベリングし、背景の色座標値が多く集まっている場合はその正規分布を背景としてラベリングする。図８の例では、左側の正規分布の付近には前景のデータが集まっており、右側の正規分布の付近には背景のデータが集まっている。そのため、左側の正規分布は前景を表す正規分布、右側の正規分布は背景を表す正規分布としてラベリングされる。 Each normal distribution of this mixed normal distribution can be labeled with either foreground or background. That is, when many foreground color coordinate values are gathered near the center of the normal distribution, the normal distribution is labeled as the foreground, and when many background color coordinate values are gathered, the normal distribution is labeled as the background. In the example of FIG. 8, foreground data is collected near the left normal distribution, and background data is collected near the right normal distribution. Therefore, the normal distribution on the left is labeled as a normal distribution representing the foreground, and the normal distribution on the right is labeled as a normal distribution representing the background.

次に、以上のように生成した学習データを用いた前景抽出方法について説明する。前景抽出処理部１１ａは、２次元画像取得部２０により取得された２次元画像の各ピクセルについて前景または背景の判定を行い、前景と判定されたピクセルを抽出することで前景抽出を行う。 Next, a foreground extraction method using the learning data generated as described above will be described. The foreground extraction processing unit 11a determines the foreground or background for each pixel of the two-dimensional image acquired by the two-dimensional image acquisition unit 20, and extracts the foreground extracted by extracting the pixel determined to be the foreground.

具体的には、前景抽出処理部１１ａは、２次元画像の各ピクセルのＲＧＢ値から変換した色座標値と、学習データの各正規分布の中心とのマハラノビス距離を算出する。マハラノビス距離とは、分布の分散を考慮した距離である。前景抽出処理部１１ａは、マハラノビス距離が最小となる正規分布のラベル（前景または背景）をそのピクセルのラベルとすることで、各ピクセルが前景であるか背景であるかを判定する。 Specifically, the foreground extraction processing unit 11a calculates the Mahalanobis distance between the color coordinate value converted from the RGB value of each pixel of the two-dimensional image and the center of each normal distribution of the learning data. The Mahalanobis distance is a distance considering distribution distribution. The foreground extraction processing unit 11a determines whether each pixel is the foreground or the background by setting the label (foreground or background) of the normal distribution that minimizes the Mahalanobis distance as the label of the pixel.

この処理は、各ピクセルに対してマハラノビス距離を計算するのみであるため、計算が非常に高速である。また、ピクセル間の隣接関係を考慮しないため、すべてのピクセルを独立に計算できる。そのため、複数のＣＰＵやＧＰＧＰＵ（General-purpose computing on graphics processing units）を用いた並列処理による高速化と非常に相性がよい方法である。 Since this process only calculates the Mahalanobis distance for each pixel, the calculation is very fast. Moreover, since the adjacent relationship between pixels is not considered, all pixels can be calculated independently. For this reason, this method is very compatible with high speed by parallel processing using a plurality of CPUs and GPGPU (General-purpose computing on graphics processing units).

前景抽出処理部１１ａは、以上のようにして、２次元画像取得部２０により取得された２次元画像の各ピクセルについて前景または背景の判定を行った後、前景と判定されたピクセルを抽出することで対象物体の抽出を行う。なお、この方法によると、対象物体とは異なる場所も前景として抽出され、対象物体以外のノイズが残ることがある。そこで、以下に説明するように、このノイズを考慮した位置合わせ処理を行う。 The foreground extraction processing unit 11a extracts the pixels determined to be the foreground after determining the foreground or the background for each pixel of the two-dimensional image acquired by the two-dimensional image acquisition unit 20 as described above. The target object is extracted with. According to this method, a place different from the target object is also extracted as the foreground, and noise other than the target object may remain. Therefore, as will be described below, alignment processing is performed in consideration of this noise.

位置補正部１８は、３次元データ投影部１３’が２次元画像上に投影する３次元データの位置を補正する。具体的には、境界抽出部１１’により前フレームの２次元画像から抽出された対象物体と、現フレームの２次元画像から抽出された対象物体との中心位置をそれぞれ算出し、両中心位置の差分量に基づいて、３次元データを投影する位置を補正する。すなわち、２フレーム目以降の処理時において、３次元データ投影部１３’は、前フレームの２次元画像から求められた対象物体の３次元位置を用いて、３次元データを投影する２次元画像上の位置を定めている。位置補正部１８は、上述の差分量を３次元空間における差分量に変換して、それを前フレームの３次元データの位置に加えることにより、当該３次元データを投影する位置を補正する。 The position correction unit 18 corrects the position of the three-dimensional data projected on the two-dimensional image by the three-dimensional data projection unit 13 '. Specifically, the center positions of the target object extracted from the two-dimensional image of the previous frame by the boundary extraction unit 11 ′ and the target object extracted from the two-dimensional image of the current frame are respectively calculated, Based on the difference amount, the position where the three-dimensional data is projected is corrected. That is, at the time of processing for the second and subsequent frames, the 3D data projecting unit 13 ′ uses the 3D position of the target object obtained from the 2D image of the previous frame to project the 3D data on the 2D image. The position is determined. The position correction unit 18 converts the difference amount described above into a difference amount in the three-dimensional space and adds it to the position of the three-dimensional data of the previous frame, thereby correcting the position where the three-dimensional data is projected.

ここで、２次元画像上の対象物体の中心位置の算出方法が課題となる。上述したように、境界抽出部１１’における対象物体の抽出処理の結果にはノイズが含まれている。そのため、境界抽出部１１’により抽出された対象物体のピクセル値を単純に平均すると、ノイズの影響により中心位置がずれてしまう。そのため、ノイズに対して頑健な中心位置推定が必要となる。そこで、位置補正部１８は、境界抽出部１１’により抽出された対象物体のピクセル値に対して、コーシー分布の最尤推定を適用して対象物体の中心位置を算出する。 Here, a method for calculating the center position of the target object on the two-dimensional image becomes an issue. As described above, the target object extraction processing result in the boundary extraction unit 11 ′ includes noise. Therefore, if the pixel values of the target object extracted by the boundary extraction unit 11 'are simply averaged, the center position is shifted due to the influence of noise. Therefore, it is necessary to estimate the center position that is robust against noise. Accordingly, the position correction unit 18 calculates the center position of the target object by applying maximum likelihood estimation of the Cauchy distribution to the pixel value of the target object extracted by the boundary extraction unit 11 ′.

図９は、コーシー分布と正規分布とを比較した図である。コーシー分布は正規分布と比較して裾の部分の確率が大きく、ノイズに対して頑健な性質をもつことが知られている。したがって、対象物体のピクセルに対して、コーシー分布を用いた最尤推定により中心を推定することにより、境界抽出部１１’で生じたノイズにより受ける影響を最小限に抑え、対象物体の中心位置をより正確に捉えることができる。 FIG. 9 is a diagram comparing the Cauchy distribution and the normal distribution. It is known that the Cauchy distribution has a higher probability of the tail portion than the normal distribution and has a robust property against noise. Therefore, by estimating the center of the target object pixel by maximum likelihood estimation using the Cauchy distribution, the influence of noise generated by the boundary extraction unit 11 ′ is minimized, and the center position of the target object is determined. It can be captured more accurately.

そして、３次元データ投影部１３’は、位置補正部１８により補正された２次元画像上の位置に３次元データを投影する。この位置補正部１８により補正された位置は、前フレームの２次元画像から求められた対象物体の３次元位置を用いて設定された位置に比べて、現フレームにおいて実際に対象物体が存在する位置に近くなっている。これにより、位置合わせ部１５がＩＣＰアルゴリズムに基づき行う３次元データの位置合わせを、より高精度に行うことができる。 Then, the three-dimensional data projection unit 13 ′ projects the three-dimensional data at the position on the two-dimensional image corrected by the position correction unit 18. The position corrected by the position correction unit 18 is a position where the target object actually exists in the current frame as compared to the position set using the three-dimensional position of the target object obtained from the two-dimensional image of the previous frame. It is close to. Thereby, the alignment of the three-dimensional data performed by the alignment unit 15 based on the ICP algorithm can be performed with higher accuracy.

図１０は、上記のように構成した第２の実施形態による画像処理装置１００の動作例を示すフローチャートである。この図１０に示すフローチャートは、画像処理装置１００’の電源をオンにして、対象物体のトラッキングを行うことを指示する操作をユーザが行ったときに開始する。 FIG. 10 is a flowchart showing an operation example of the image processing apparatus 100 according to the second embodiment configured as described above. The flowchart shown in FIG. 10 starts when the user performs an operation to turn on the image processing apparatus 100 ′ and instruct to perform tracking of the target object.

まず、２次元映像取得部１０は、３次元位置・姿勢を把握しようとする対象物体を含む実空間を撮像することによって生成された２次元映像を単眼カメラ２００から取得する（ステップＳ１１）。また、２次元画像取得部２０は、２次元映像取得部１０により取得された２次元映像から１フレーム分の２次元画像を取得する（ステップＳ１２）。 First, the 2D video acquisition unit 10 acquires from the monocular camera 200 a 2D video generated by imaging a real space including a target object whose 3D position / orientation is to be grasped (step S11). The 2D image acquisition unit 20 acquires a 2D image for one frame from the 2D video acquired by the 2D video acquisition unit 10 (step S12).

次に、境界抽出部１１’は、２次元画像取得部２０により取得された２次元画像から前景抽出処理により対象物体を抽出した後、抽出した対象物体の背景との境界を抽出する（ステップＳ１３）。さらに、放射形状生成部１２は、単眼カメラ２００の中心位置と境界抽出部１１’により抽出された境界の各点とをそれぞれ結んでできる複数の直線により放射形状を３次元空間の座標系上に生成する（ステップＳ１４）。 Next, the boundary extraction unit 11 ′ extracts the target object from the two-dimensional image acquired by the two-dimensional image acquisition unit 20 by the foreground extraction process, and then extracts the boundary between the extracted target object and the background (step S13). ). Further, the radial shape generation unit 12 places the radial shape on a coordinate system in a three-dimensional space by a plurality of straight lines formed by connecting the center position of the monocular camera 200 and each point of the boundary extracted by the boundary extraction unit 11 ′. Generate (step S14).

一方、３次元データ投影部１３’は、３次元データをピンホールカメラモデルに基づき２次元画像上に投影する（ステップＳ１５）。ここで３次元データを投影する位置は、最初のフレームの処理時は任意の位置であり、２フレーム目以降の処理時は、前フレームの２次元画像から求められた対象物体の３次元位置である。位置補正部１８は、このように３次元データ投影部１３’により設定された位置を補正する（ステップＳ１６）。 On the other hand, the three-dimensional data projection unit 13 'projects the three-dimensional data on the two-dimensional image based on the pinhole camera model (step S15). Here, the position at which the three-dimensional data is projected is an arbitrary position when the first frame is processed, and is the three-dimensional position of the target object obtained from the two-dimensional image of the previous frame during the second and subsequent frames. is there. The position correction unit 18 corrects the position set by the three-dimensional data projection unit 13 'in this way (step S16).

３次元データ投影部１３’は、位置補正部１８により補正された位置に、対象物体を表す３次元メッシュデータを投影する（ステップＳ１７）。そして、サンプル点抽出部１４は、３次元データ投影部１３により２次元画像上に投影された対象物体の３次元データから、対象物体が２次元画像に写ったときに境界となる点を３次元データのサンプル点として抽出する（ステップＳ１８）。 The three-dimensional data projection unit 13 'projects the three-dimensional mesh data representing the target object at the position corrected by the position correction unit 18 (step S17). Then, the sample point extraction unit 14 determines, from the three-dimensional data of the target object projected on the two-dimensional image by the three-dimensional data projection unit 13, a point that becomes a boundary when the target object appears in the two-dimensional image. Data is extracted as sample points (step S18).

次に、位置合わせ部１５は、例えばＩＣＰアルゴリズムを用いて、放射形状生成部１２により生成された放射形状上に、サンプル点抽出部１４により抽出されたサンプル点が位置するように、３次元データの位置合わせを行う（ステップＳ１９）。その後、空間情報取得部１６は、位置合わせ部１５により位置合わせされた３次元データから対象物体の３次元位置および姿勢の空間情報を取得する（ステップＳ２０）。 Next, the alignment unit 15 uses, for example, an ICP algorithm so that the sample points extracted by the sample point extraction unit 14 are positioned on the radial shape generated by the radial shape generation unit 12. Are aligned (step S19). Thereafter, the spatial information acquisition unit 16 acquires the spatial information of the three-dimensional position and orientation of the target object from the three-dimensional data aligned by the alignment unit 15 (step S20).

次に、画像処理装置１００’は、対象物体のトラッキングを終了するか否かを判定する（ステップＳ２１）。ここで、トラッキングを終了すると判定された場合、図１０に示すフローチャートの処理は終了する。一方、トラッキングを終了しない場合、処理はステップＳ１１に戻り、次のフレームに関する処理を継続する。 Next, the image processing apparatus 100 ′ determines whether or not to finish tracking of the target object (Step S <b> 21). Here, when it is determined that the tracking is to be ended, the processing of the flowchart illustrated in FIG. 10 is ended. On the other hand, when the tracking is not finished, the process returns to step S11 and the process for the next frame is continued.

以上詳しく説明したように、第２の実施形態によれば、２次元画像から対象物体を抽出する処理と、２次元画像上に投影した３次元データの位置合わせ処理とを高速に行うことによって対象物体の追従性を良好するととともに、当該対象物体の抽出処理を高速化することによって残るノイズの影響を最小限に抑え、対象物体の３次元空間内での動作を精度よくトラッキングすることができる。 As described above in detail, according to the second embodiment, the object is extracted by performing the process of extracting the target object from the two-dimensional image and the process of aligning the three-dimensional data projected on the two-dimensional image at high speed. It is possible to track the movement of the target object in the three-dimensional space with high accuracy following the object and minimizing the influence of remaining noise by speeding up the extraction process of the target object.

なお、上記第１および第２の実施形態では、３次元データの一例としてメッシュデータを用いる例について説明したが、対象物体の３次元位置・姿勢を有するデータであれば、メッシュデータ以外のデータを用いてもよい。例えば、ＣＡＤデータを用いてもよい。ＣＡＤデータは、複数の３次元空間上の曲面式および境界線を表す式により構成されている。各曲面式および境界線に関して、カメラ中心に向かう画像平面上への投影を考えると、ＣＡＤデータは、画像平面上における２次元空間上の曲面式および境界線を表す式として表現される。このとき、対象物体を構成する全ての曲面および境界線を投影した形状の最外周線を取得することにより、対象物体が２次元画像に写ったときの境界線を取得することが可能である。サンプル点抽出部１４は、この境界線上から複数のサンプル点を抽出する。ただし、ＣＡＤデータをメッシュデータに変換した後に投影した方が、処理が早くなる点で好ましい。その他の３次元データについてもメッシュデータに変換することができれば本処理を適用することが可能である。 In the first and second embodiments described above, an example in which mesh data is used as an example of three-dimensional data has been described. However, if the data has the three-dimensional position and orientation of the target object, data other than mesh data is used. It may be used. For example, CAD data may be used. The CAD data is composed of a plurality of curved surface expressions in a three-dimensional space and expressions representing boundary lines. With regard to each curved surface expression and boundary line, considering projection on the image plane toward the camera center, CAD data is expressed as a curved surface expression and boundary line expression in a two-dimensional space on the image plane. At this time, it is possible to acquire the boundary line when the target object appears in the two-dimensional image by acquiring all the curved surfaces constituting the target object and the outermost peripheral line of the shape projected from the boundary line. The sample point extraction unit 14 extracts a plurality of sample points from the boundary line. However, it is preferable that the CAD data is projected after being converted into mesh data in that the processing becomes faster. If other 3D data can be converted into mesh data, this processing can be applied.

また、上記第１および第２の実施形態では、２次元画像から対象物体の境界を抽出する処理として、前景抽出処理により対象物体を抽出した後、抽出した対象物体から境界を抽出する例について説明したが、本発明はこれに限定されない。すなわち、これ以外の公知の手法により対象物体の境界を抽出するようにしてもよい。 In the first and second embodiments described above, an example in which the target object is extracted by the foreground extraction process and then the boundary is extracted from the extracted target object as the process of extracting the target object boundary from the two-dimensional image. However, the present invention is not limited to this. That is, the boundary of the target object may be extracted by other known methods.

また、上記第１および第２の実施形態では、カメラ中心と２次元画像から抽出された対象物体の境界の各点とをそれぞれ結んでできる複数の直線により放射形状を生成する例について説明したが、境界上の全ての点を通る放射形状である必要は必ずしもない。すなわち、対象物体の境界上からいくつかの代表点を抽出し、その抽出した代表点とカメラ中心とをそれぞれ結んでできる複数の直線により放射形状を生成するようにしてもよい。 In the first and second embodiments, the example in which the radial shape is generated by a plurality of straight lines formed by connecting the camera center and each point of the boundary of the target object extracted from the two-dimensional image has been described. The radial shape does not necessarily have to pass through all points on the boundary. That is, some representative points may be extracted from the boundary of the target object, and the radial shape may be generated by a plurality of straight lines formed by connecting the extracted representative points and the camera center.

また、上記第１および第２の実施形態では、ＩＣＰアルゴリズムを用いて３次元データの位置補正を行う例について説明したが、本発明はこれに限定されない。例えば、SoftassignアルゴリズムやＥＭ−ＩＣＰアルゴリズムなど、３次元点群の位置合わせに用いられる他のアルゴリズムを利用してもよい。 In the first and second embodiments, the example in which the position correction of the three-dimensional data is performed using the ICP algorithm has been described. However, the present invention is not limited to this. For example, other algorithms used for alignment of a three-dimensional point group such as Softassign algorithm and EM-ICP algorithm may be used.

その他、上記第１および第２の実施形態は、何れも本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 In addition, each of the first and second embodiments described above is merely an example of a specific example for carrying out the present invention, and the technical scope of the present invention should not be interpreted in a limited manner. It will not be. That is, the present invention can be implemented in various forms without departing from the gist or the main features thereof.

１０２次元映像取得部
２０２次元画像取得部
１１，１１’ 境界抽出部
１２放射形状生成部
１３，１３’ ３次元データ投影部
１４サンプル点抽出部
１５位置合わせ部
１６空間情報取得部
１７カメラパラメータ記憶部
１８位置補正部
１００，１００’ 画像処理装置
２００単眼カメラ DESCRIPTION OF SYMBOLS 10 2D image acquisition part 20 2D image acquisition part 11, 11 'Boundary extraction part 12 Radial shape generation part 13, 13' Three-dimensional data projection part 14 Sample point extraction part 15 Position alignment part 16 Spatial information acquisition part 17 Camera parameter Storage unit 18 Position correction unit 100, 100 ′ Image processing device 200 Monocular camera

Claims

A two-dimensional video acquisition unit that acquires a two-dimensional video generated by imaging a real space using an imaging device;
A two-dimensional image acquisition unit that acquires a two-dimensional image as a still image at predetermined frame intervals from the two-dimensional image acquired by the two-dimensional image acquisition unit;
A tracking unit that tracks the movement of the target object in the three-dimensional space by acquiring spatial information of the three-dimensional position and orientation of the target object from the two-dimensional image of each frame acquired by the two-dimensional image acquisition unit; With
The tracking part
A boundary extraction unit that extracts a boundary of the target object from the two-dimensional image acquired in the current frame by the two-dimensional image acquisition unit;
A radial shape generation unit that generates a radial shape on a coordinate system of a three-dimensional space by a plurality of straight lines formed by connecting the center position of the imaging device and each point of the boundary extracted by the boundary extraction unit;
Three-dimensional data for projecting the three-dimensional data representing the target object onto the two-dimensional image using the three-dimensional position of the target object obtained from the two-dimensional image of the previous frame acquired by the two-dimensional image acquisition unit. A projection unit;
A sample point extraction unit for extracting a point that becomes a boundary when the target object appears in the two-dimensional image from the three-dimensional data of the target object projected on the two-dimensional image; ,
An alignment unit that aligns the three-dimensional data so that the sample points extracted by the sample point extraction unit are positioned on the radial shape generated by the radial shape generation unit;
An image processing apparatus comprising: a spatial information acquisition unit that acquires spatial information of the three-dimensional position and orientation of the target object from the three-dimensional data aligned by the alignment unit.

The center position of the target object extracted from the two-dimensional image of the previous frame and the center position of the target object extracted from the two-dimensional image of the current frame are respectively calculated by the boundary extraction unit, and based on the difference between the two center positions. A position correction unit for correcting the position for projecting the three-dimensional data;
The image processing apparatus according to claim 1, wherein the three-dimensional data projection unit projects three-dimensional data representing the target object at a position on the two-dimensional image corrected by the position correction unit. .

The boundary extraction unit performs a foreground extraction process using learning data generated from teacher data, thereby extracting the target object from the two-dimensional image;
The image processing apparatus according to claim 2, further comprising a boundary extraction processing unit that extracts the boundary from the target object extracted by the foreground extraction processing unit.

The image processing apparatus according to claim 3, wherein the position correction unit calculates a center position of the target object by applying maximum likelihood estimation of a Cauchy distribution to pixel values of the two-dimensional image. .

The alignment unit obtains a correspondence between the radial shape generated by the radial shape generation unit and the sample point extracted by the sample point extraction unit from the nearest point, and performs a conversion process that minimizes the obtained correspondence. The image processing apparatus according to claim 1, wherein the three-dimensional data is aligned by repeating.

6. The three-dimensional data according to claim 1, wherein the three-dimensional data is three-dimensional mesh data in which the same shape as the target object is expressed by a plurality of meshes made of triangles or quadrilaterals. Image processing device.

A first step in which a 2D video acquisition unit of the image processing apparatus acquires a 2D video generated by imaging a real space using an imaging device;
A second step in which a two-dimensional image acquisition unit of the image processing apparatus acquires a two-dimensional image as a still image from the two-dimensional video acquired by the two-dimensional video acquisition unit;
A third step in which a boundary extraction unit of the image processing apparatus extracts a boundary of the target object from the two-dimensional image acquired in the current frame by the two-dimensional image acquisition unit;
The radial shape generation unit of the image processing apparatus converts the radial shape on a coordinate system in a three-dimensional space by a plurality of straight lines formed by connecting the center position of the imaging device and each point of the boundary extracted by the boundary extraction unit. A fourth step of generating
The three-dimensional data projection unit of the image processing apparatus uses the three-dimensional position of the target object obtained from the two-dimensional image of the previous frame acquired by the two-dimensional image acquisition unit to represent the three-dimensional data representing the target object A fifth step of projecting onto the two-dimensional image;
The sample point extraction unit of the image processing apparatus determines, from the three-dimensional data of the target object projected onto the two-dimensional image, a point that becomes a boundary when the target object appears in the two-dimensional image. A sixth step of extracting as sample points of
The alignment unit of the image processing apparatus performs alignment of the three-dimensional data so that the sample point extracted by the sample point extraction unit is positioned on the radial shape generated by the radial shape generation unit. 7 steps,
An eighth step in which the spatial information acquisition unit of the image processing apparatus acquires the spatial information of the three-dimensional position and orientation of the target object from the three-dimensional data aligned by the alignment unit;
The processing from the first step to the eighth step is performed at predetermined frame intervals of the two-dimensional video, and the spatial information of the three-dimensional position and orientation of the target object is obtained from the two-dimensional image of each frame. A three-dimensional tracking method for tracking movement of the target object in a three-dimensional space.