JP2008140101A

JP2008140101A - Unconstrained and real-time hand tracking device using no marker

Info

Publication number: JP2008140101A
Application number: JP2006325202A
Authority: JP
Inventors: Gumpp Thomas; トマス・グンプ; Azad Pedram; ペドラム・アザド; Welke Kai; カイ・ウェルケ; Erhan Oztop; エーハン・オズトップ; Cheng Gordon; ゴードン・チェン
Original assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Current assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Priority date: 2006-12-01
Filing date: 2006-12-01
Publication date: 2008-06-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for acquiring a personal hand's motion in real time without constraint and without using a marker. <P>SOLUTION: A hand tracker 90 includes a storage part of a three-dimensional hand model 116; an image processor 104 for extracting a skin color binary map 140 and an edge image 142; particle filter units 108 and 110 each estimating a posture and a position of a hand in a current frame 102 by applying particle filtering to the hand model 116 based on the skin color map 140 and the edge image 142; and a combination and updating modules 112 and 114 for updating the hand model 116 using the estimation result. The particle filter units 108 and 110 include z-buffers and calculate probabilities of the particle configurations by comparing the z-buffer projection images of the particle configurations with the skin color map 140 and the edge image 140. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は人の手のトラッキングシステムに関し、特に、効果的な、無制約のリアルタイム、マーカ不使用の、人の手のトラッキングシステムに関する。 The present invention relates to human hand tracking systems, and more particularly to an effective, unconstrained real-time, marker-free, human hand tracking system.

自然な方法で人と相互に交流しあうように設計されたロボットにはいずれも、無制約の手トラッカが必要である。ヒューマノイドがマーカ不使用の手トラッカを用いて示すことができる能力は、スキルの学習、目と手との協調を介して人にものを渡したり交換したりするといった、人との相互交流に関する対話的な模倣、及び身振りの理解が可能になること、等に及ぶ。人はまたこのシステムを用いて、指さすことでロボットの注意を引くことができる。手トラッカシステムは、これらの課題において必要なものに対する解決を与えようとするものである。 Any robot designed to interact with people in a natural way requires an unconstrained hand tracker. Humanoids' ability to demonstrate using marker-free hand trackers is about interactions with people, such as skills learning, handing over and exchanging things through eye-hand coordination. Imitation and understanding of gestures are possible. A person can also use this system to draw the robot's attention by pointing. The hand tracker system seeks to provide a solution to what is needed in these challenges.

マーカを用いる、グローブ（手ぶくろ）を用いる手トラッカは十分に検討されてきた。これらの技術は通常、良好なトラッキング結果をもたらす。我々が目指す応用では、ロボットは自然な環境でいかなるランダムな手もトラックできなければならないので、固定したマーカ等の何らかの準備を当てにすることはできない。このため、マーカを使わない解決策をとることとした。 A hand tracker that uses a marker and a glove has been well studied. These techniques usually give good tracking results. In our application, the robot must be able to track any random hand in its natural environment, so it cannot rely on any provisions such as fixed markers. For this reason, we decided to take a solution that does not use markers.

マーカ不使用の手トラッキングは通常、映像入力ストリームを解釈することにより行なう。撮影されたフレームから手がかりを探す。よく用いられる手がかりは手の輪郭であって、これは入力フレームにエッジフィルタを適用することで見出すことができる。別の手がかりは目に見える手の面であって、これは肌の色をフィルタ処理することで抽出できる。トラッカによって推定された手の形状を入力画像の手がかりと比較し、予め定められた身振りの確率を得る。手がかりとの比較には、姿勢データベースまたは３Ｄ手モデルの投影モデルを用いる。 Marker-free hand tracking is usually done by interpreting the video input stream. Look for clues from the frames taken. A commonly used clue is the contour of the hand, which can be found by applying an edge filter to the input frame. Another clue is the surface of the visible hand, which can be extracted by filtering the skin color. The hand shape estimated by the tracker is compared with the clue of the input image to obtain a predetermined gesture probability. For comparison with the clue, a posture database or a 3D hand model projection model is used.

予め定義された手の姿勢を認識するという問題は、通常、姿勢データベースを用いるシステムによって解決される。一例は、非特許文献１に開示された外観ベースの身振り認識であり、これは隠れマルコフモデルを用いて手話データベース中の身振りを検出する。 The problem of recognizing a predefined hand posture is usually solved by a system that uses a posture database. An example is appearance-based gesture recognition disclosed in Non-Patent Document 1, which detects a gesture in a sign language database using a hidden Markov model.

本件発明の目的により近いのは、最近非特許文献２においてエロルが十分に検討している、複数自由度の手トラッカである。この検討においては、３つのカテゴリによって区別される、この分野での最近の展開が示されている。すなわち、単一フレーム推定器、単一仮説推定器及び複数仮説推定器である。 Closer to the object of the present invention is a multi-degree-of-freedom hand tracker recently studied by Errol in Non-Patent Document 2. In this review, recent developments in this area are shown, distinguished by three categories. A single frame estimator, a single hypothesis estimator and a multiple hypothesis estimator.

［単一フレーム推定器］
第１のカテゴリは、前のフレームの先行する知識に依存しない、単一フレームの姿勢推定器である。可能な解決策の一つは、運動学的なパラメータでラベル付けされた一組のテンプレート上でのグローバルな検索である。非特許文献３では、これらのテンプレートが３Ｄ手モデルを用いて生成され、ツリー構造で記憶される。姿勢を検出するために、ツリーが先行する知識無しで探索される。 [Single frame estimator]
The first category is single frame pose estimators that do not rely on prior knowledge of previous frames. One possible solution is a global search on a set of templates labeled with kinematic parameters. In Non-Patent Document 3, these templates are generated using a 3D hand model and stored in a tree structure. To detect the pose, the tree is searched without prior knowledge.

これらのシステムは前のフレームから独立しているため、前のフレームのトラッキング誤差からの修復を必要とせず、この結果、頑健な挙動が得られる。しかし、テンプレートデータベースを構築することが必要であるため、システムの表現に対する性能は、姿勢の離散的なサブセットに制限されてしまい、これは本件発明の無制約のトラッキングという目標と矛盾する。グローバルな検索は、非常に大きな計算コストがかかる。トラッキングのために連続した関節空間を用いることは、リアルタイムのシステムでは現実的でない。多数のフレームにわたってトラッキングを行なうことにより、形状空間内で局部的な検索が行なわれるのみとなるため、より少ない資源しか用いないシステムを実現できる。 Since these systems are independent of the previous frame, they do not require repair from the tracking error of the previous frame, resulting in robust behavior. However, because it is necessary to build a template database, the performance of the system representation is limited to a discrete subset of poses, which contradicts the goal of unconstrained tracking of the present invention. Global search is very computationally expensive. Using a continuous joint space for tracking is not practical in a real-time system. By tracking over a large number of frames, only a local search is performed in the shape space, so that a system that uses fewer resources can be realized.

［単一仮説推定器］
第２のカテゴリは、前の推定から得られた単一の仮説を先行知識として用いる手トラッカを含む。画像から抽出された手がかりにあるモデルを当てはめるために最もよく用いられる方策は、標準的な最適化技術を用いることである。 [Single hypothesis estimator]
The second category includes hand trackers that use a single hypothesis obtained from the previous estimation as prior knowledge. The most commonly used strategy to fit a model in a clue extracted from an image is to use standard optimization techniques.

非特許文献４では、分割統治（ｄｉｖｉｄｅ−ａｎｄ−ｃｏｎｑｕｅｒ）の方策が提案されている。第１に、手の全体の動きが推定され、次いで関節角度を推定する。この手順が、収束するまで繰返し適用される。単一仮説のトラッキング問題を解くために、カルマンフィルタもまた用いられている。非特許文献５では、アンセンテッド（ｕｎｓｃｅｎｔｅｄ）カルマンフィルタ（ＵＫＦ）を用いて手をトラッキングしている。 Non-Patent Document 4 proposes a divide-and-conquer policy. First, the total hand movement is estimated, and then the joint angle is estimated. This procedure is applied repeatedly until convergence. A Kalman filter is also used to solve the single hypothesis tracking problem. In Non-Patent Document 5, a hand is tracked using an unscented Kalman filter (UKF).

これらのシステムは勾配検索を用いているため、局所的最小値にしか到達できず、グローバルな最小値に到達しないためにトラッキング誤差を生ずる危険性が常にある。 Since these systems use gradient search, they can only reach local minima and there is always a risk of tracking errors due to not reaching global minima.

［複数仮説推定器］
単一仮説トラッカに対して、複数仮説トラッカはシーケンスをトラッキングしつつ、複数のポーズ推定を継続しようとする。これによって、最小値に達するチャンスが増大する。この思想は、現在のフレームまでの状態の確率分布を保持するベイズフィルタ処理というフレームワークで最も良く実現される。再帰的ベイズフィルタを実現するよく知られた技術の一つがパーティクルフィルタである。 [Multiple hypothesis estimator]
In contrast to a single hypothesis tracker, the multiple hypothesis tracker attempts to continue multiple pose estimations while tracking the sequence. This increases the chance of reaching the minimum value. This idea is best realized in a framework called Bayesian filter processing that holds the probability distribution of states up to the current frame. One well-known technique for implementing a recursive Bayes filter is a particle filter.

パーティクルフィルタ処理はまた、非特許文献６でマイケル・イサード及びアンドルー・ブレークによって紹介された凝縮（Ｃｏｎｄｅｎｓａｔｉｏｎ）アルゴリズムとしても知られている。凝縮アルゴリズムは、散乱において曲線をトラッキングするように設計されている。カルマンフィルタは、択一的な仮説を同時に表すことのできないガウス密度に基づいているため、このような作業には不適である。 Particle filtering is also known as the Condensation algorithm introduced by Michael Isard and Andrew Blake in [6]. The condensation algorithm is designed to track a curve in scatter. The Kalman filter is unsuitable for such work because it is based on a Gaussian density that cannot simultaneously represent alternative hypotheses.

パーティクルとは、ある対（ｓ_ｎ,Π_ｎ）であって、ｓ_ｎはパーティクルｎの形状ベクトルであり、Π_ｎはこの形状の確率である。手トラッカについて、各パーティクルは形状ベクトルとその確率とからなり、形状ベクトルの値の各々は運動学的な手モデルの角度又は平行移動を表す。従ってベクトルの次元は手の運動のＤＯＦ（Ｄｅｇｒｅｅ−Ｏｆ−Ｆｒｅｅｄｏｍ：自由度）の数と等しくなる。Π_ｎはｓ_ｎの状態を評価することによって計算される。これは、映像フレームの目に見える面又はエッジ等の異なる手がかりを検索することでなされる。この実施例では、手画像の目に見える面とエッジとの両者を用いた。確率分布はパーティクルの組Ｓ＝（ｓ_１,Π_１）,…,（ｓ_Ｎ,Π_Ｎ）によってモデル化される。 A particle is a pair (s _n , Π _n ), where s _n is the shape vector of particle n and Π _n is the probability of this shape. For a hand tracker, each particle consists of a shape vector and its probability, and each value of the shape vector represents an angle or translation of the kinematic hand model. Therefore, the dimension of the vector is equal to the number of DOF (Degree-Of-Freedom) of hand movement. Π _n is calculated by evaluating the state of s _n . This is done by searching for different cues such as visible surfaces or edges of the video frame. In this embodiment, both the visible surface and the edge of the hand image are used. The probability distribution is modeled by a set of particles S = (s ₁ , _{１ 1} ), ..., (s _N , _{Ｎ N} ).

パーティクルフィルタの繰返しごとの主なステップは、最後の繰返しの組Ｓ_ｔ−１からのパーティクルの組Ｓ_ｔをサンプリングするステップと、ｓ_ｎの各々を評価してその確率Π_ｎを得るステップと、重み平均ｓ_ｍｅａｎを計算するステップとを含む。ｓ_ｍｅａｎは、現在仮定されている手の姿勢の推定を表す。
Ｐ．ドルー、「外観ベースの身振り認識」学位論文、ＲＷＴＨアーヘン大学、アーヘン、ドイツ、２００５年１月（P. Dreuw, “Appearance-based gesture recognition,” Diploma Thesis, RWTH Aachen University, Aachen, Germany, January 2005.）Ａ．エロル、Ｇ．ベービス、Ｍ．ニコレスク、Ｒ．Ｄ．ボイル及びＸ．トムブリー、「視界ベースの完全ｄｏｆの手の動き推定に関する検討」ＣＶＰＲ’０５、コンピュータビジョン及びパターン認識に関する２００５年ＩＥＥＥコンピュータソサエティ会議予稿集、（ＣＶＰＲ’０５）−ワークショップ。ワシントンＤＣ、アメリカ合衆国、ＩＥＥＥコンピュータソサエティ、２００５、７５ページ（A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “A review on vision-based full dof hand motion estimation,” in CVPR ’05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) -Workshops. Washington, DC, USA: IEEE Computer Society, 2005, p. 75.）Ｂ．ステンガー、Ａ．タヤナンタン、Ｐ．トル、及びＲ．シポラ、「階層的検出を用いた手の姿勢推定」２００４年。[オンライン]、citeseer.ist.psu.edu/stenger04hand.htmlで入手可能（B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla, “Hand pose estimation using hierarchical detection,” 2004. [Online]. Available: citeseer.ist.psu.edu/stenger04hand.html.）Ｙ．ウー及びＴ．Ｓ．ヒュアン、「多関節の人の手の動きキャプチャ：分断攻略の方策」、ＩＣＣＶ（ＩＥＥＥコンピュータビジョンに関する国際会議）’９９、第１巻、６０６ページ、１９９９年（ Y. Wu and T. S. Huang, “Capturing articulated human hand motion: A divide-and-conquer approach,” ICCV (IEEE International Conference On Computer Vision) '99, vol. 01, p. 606, 1999.）Ｂ．ステンガー、Ｐ．Ｒ．Ｓ．メンドンカ、及びＲ．シポラ、「多関節の手のモデルベースの３ｄトラッキング」、ＣＶＲＰ（コンピュータビジョン及びパターン認識に関するＩＥＥＥ会議）、第２巻、３１０ページ、２００１年（ B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model-based 3d tracking of an articulated hand,” CVPR (IEEE Conference on Computer Vision and Pattern Recognition), vol. 02, p. 310, 2001.）Ｍ．イサード及びＡ．ブレーク、「凝縮―視覚的トラッキングのための条件付き濃度伝播」、コンピュータビジョン国際ジャーナル、第２９巻第１号、５−２８ページ、１９９８年。[オンライン]、citeseer.ist.psu.edu/isard98condensation.htmlで入手可能（M.Isard and A. Blake, “Condensation - conditional density propagation for visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp. 5-28, 1998. [Online]. Available: citeseer.ist.psu.edu/isard98condensation.html.）Ｊ．リー及びＴ．Ｌ．クニイ、「手の姿勢のモデルベース分析」、ＩＥＥＥコンピュータグラフィック及びアプリケーション、第１５巻、第５号、７７−８６ページ、１９９５年（J. Lee and T. L. Kunii, “Model-based analysis of hand posture,” IEEE Computer Graphics and Applications, vol. 15, no. 5, pp. 77-86, 1995.）Ｎ．シマダ、Ｋ．キムラ及びＹ．シライ、「単眼カメラを用いた２−ｄ外観取得に基づくリアルタイムの３−ｄ手姿勢推定」、ＲＡＴＦＧ−ＲＴＳ（リアルタイムシステムにおける顔及び身振りの認識、分析及びトラッキングに関する国際ワークショップ）、第００巻、００２３ページ、２００１年（N. Shimada, K. Kimura, and Y. Shirai, “Real-time 3-d hand posture estimation based on 2-d appearance retrieval using monocular camera,” RATFG-RTS (International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems), vol. 00, p. 0023, 2001.）Ａ．ウデ、Ｖ．ワイアート、Ｍ．リン及びＧ．チェン、「ヒューマノイドロボットの視覚注目点の分布」、ＩＥＥＥ−ＲＡＳ／ＲＳＪヒューマノイドロボットに関する国際会議（Ｈｕｍａｎｏｉｄｓ２００５）、２００５年（ A. Ude, V. Wyart, M. Lin, and G. Cheng, “Distributed visual attention on a humanoid robot,” in IEEE-RAS/RSJ International Conference on Humanoid Robots (Humanoids 2005), 2005.）Ｊ．ドイッチャー、Ａ．ブレーク及びＩ．リード、「アニールされたパーティクルフィルタ処理によってキャプチャされた多関節の体の動き」ＣＶＰＲ、第０２巻、２１２６ページ、２０００年（J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by annealed particle filtering,” CVPR, vol. 02, p. 2126, 2000.） The main steps for each iteration of the particle filter comprises the steps of sampling particles set S _t from the set S _t-1 of the last iteration, and obtaining the probability [pi _n by evaluating each s _n, Calculating a weighted average s _mean . s _mean represents the currently assumed hand posture estimate.
P. Drew, “Appearance-based gesture recognition,” Diploma Thesis, RWTH Aachen University, Aachen, Germany, January 2005 .) A. Errol, G. Bavis, M.M. Nicolesque, R.D. D. Boyle and X. Tom Buri, “A Study on Vision-Based Full Dof Hand Motion Estimation” CVPR '05, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (CVPR '05)-Workshop. Washington DC, USA, IEEE Computer Society, 2005, p. 75 (A. Erol, G. Bebis, M. Nicolescu, RD Boyle, and X. Twombly, “A review on vision-based full dof hand motion estimation,” in CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) -Workshops. Washington, DC, USA: IEEE Computer Society, 2005, p. 75.) B. Stenger, A.M. Tayanantan, P.A. Toru, and R. Sipora, “Hand Posture Estimation Using Hierarchical Detection” 2004. [Online], available at citizene.ist.psu.edu/stenger04hand.html (B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla, “Hand pose estimation using hierarchical detection,” 2004. [Online] Available: citeseer.ist.psu.edu/stenger04hand.html.) Y. Wu and T.W. S. Huan, “Capturing Hands of Articulated People: Strategies for Dividing and Exploiting”, ICCV (International Conference on IEEE Computer Vision) '99, Volume 1, page 606, 1999 (Y. Wu and TS Huang, “Capturing articulated human hand motion: A divide-and-conquer approach, ”ICCV (IEEE International Conference On Computer Vision) '99, vol. 01, p. 606, 1999.) B. Stenger, P.A. R. S. Mendonka and R.M. Cipora, “Model-based 3d tracking of articulated hands”, CVRP (IEEE Conference on Computer Vision and Pattern Recognition), Vol. 2, page 310, 2001 (B. Stenger, PRS Mendonca, and R. Cipolla, “ Model-based 3d tracking of an articulated hand, ”CVPR (IEEE Conference on Computer Vision and Pattern Recognition), vol. 02, p. 310, 2001.) M.M. Isard and A.I. Blake, "Condensation-Conditional Concentration Propagation for Visual Tracking," Computer Vision International Journal, Vol. 29, No. 1, pp. 5-28, 1998. [Online], available at citizene.ist.psu.edu/isard98condensation.html (M.Isard and A. Blake, “Condensation-conditional density propagation for visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp. 5-28, 1998. [Online]. Available: citeseer.ist.psu.edu/isard98condensation.html.) J. et al. Lee and T. L. Kunii, “Model-Based Analysis of Hand Posture,” IEEE Computer Graphics and Applications, Vol. 15, No. 5, pp. 77-86, 1995 (J. Lee and TL Kunii, “Model-based analysis of hand posture, ”IEEE Computer Graphics and Applications, vol. 15, no. 5, pp. 77-86, 1995.) N. Shimada, K.M. Kimura and Y.K. Shirai, “Real-time 3-d hand posture estimation based on 2-d appearance acquisition using a monocular camera”, RATFG-RTS (International Workshop on Face, Gesture Recognition, Analysis and Tracking in Real-Time Systems), Volume 00 , 0023, 2001 (N. Shimada, K. Kimura, and Y. Shirai, “Real-time 3-d hand posture estimation based on 2-d appearance retrieval using monocular camera,” RATFG-RTS (International Workshop on Recognition , Analysis, and Tracking of Faces and Gestures in Real-Time Systems), vol. 00, p. 0023, 2001.) A. Ude, V. Y Art, M.C. Phosphorus and G. Chen, “Distribution of visual attention of humanoid robots”, International Conference on IEEE-RAS / RSJ Humanoid Robots (Humanoids 2005), 2005 (A. Ude, V. Wyart, M. Lin, and G. Cheng, “Distributed visual attention on a humanoid robot, ”in IEEE-RAS / RSJ International Conference on Humanoid Robots (Humanoids 2005), 2005.) J. et al. Deutscher, A. Break and I. Reed, "Articulated body motion capture captured by annealed particle filtering," CVPR, Vol. 02, 2126, 2000 (J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by annealed particle filtering, ”CVPR, vol. 02, p. 2126, 2000.)

信頼性のある手トラッカが必要とされているため、第３のカテゴリの方法は単一仮説のトラッカに比べてより有望であるように思われる。しかし、これらは計算に関してより高価である。自然なやり方で人と相互に交流するように設計されたロボットはどれもリアルタイムのトラッキング能力を要するため、第３のカテゴリの効果的な手トラッキング装置が必要である。 Due to the need for a reliable hand tracker, the third category method appears to be more promising than the single hypothesis tracker. However, these are more expensive for calculations. Since any robot designed to interact with people in a natural way requires real-time tracking capabilities, a third category of effective hand tracking devices is needed.

従って、この発明の目的の一つは、無制約、リアルタイム、マーカ不使用の、複数仮説推定器型の手トラッキングシステムであって、人の手の動きをリアルタイムで効果的かつ信頼性をもってキャプチャする装置を提供することである。 Accordingly, one of the objects of the present invention is an unconstrained, real-time, marker-free, multi-hypothesis estimator-type hand tracking system that captures human hand movements effectively and reliably in real time. Is to provide a device.

この発明の第１の局面に従えば、パーティクルフィルタ処理を用いて映像シーケンスにおける手のトラッキングを行なうための手トラッキング装置は、３次元の手モデルを記憶するための記憶手段と、映像シーケンスの現在のフレームから手の姿勢の手がかりを抽出するための手がかり抽出手段と、手がかりと現在の３次元の手モデル形状とにパーティクルフィルタ処理を施して現在のフレーム中の手の姿勢を推定するための推定手段と、推定手段によって推定された姿勢を用いて３次元の手モデル形状を更新する手段とを含む。 According to the first aspect of the present invention, a hand tracking device for tracking a hand in a video sequence using particle filter processing, a storage means for storing a three-dimensional hand model, and a current video sequence A clue extracting means for extracting a clue of the posture of the hand from the frame of the image, and an estimation for estimating the posture of the hand in the current frame by performing particle filtering on the clue and the current three-dimensional hand model shape And means for updating the three-dimensional hand model shape using the posture estimated by the estimating means.

前記推定手段は、３次元の手モデルにランダムノイズを付加して３次元の手モデルの形状を表す予め定められた数のパーティクルを生成する手段と、パーティクルの各々に対しｚバッファ画像を生成するためのｚバッファ処理手段とを含む。ｚバッファ画像は３次元の手モデルの、映像シーケンスを出力する画像キャプチャ装置の画像面への投影である。前記推定手段はさらに、手がかりとｚバッファ画像とに基づいて、手の形状の各々の確率を計算するための確率計算手段と、パーティクルのそれぞれの確率を、手形状のそれぞれの重みとして用いて、手形状について計算された手形状の重み合計を計算するための加重合計手段と、を含む。 The estimation means adds a random noise to the three-dimensional hand model to generate a predetermined number of particles representing the shape of the three-dimensional hand model, and generates a z-buffer image for each of the particles. Z-buffer processing means. A z-buffer image is a projection of a three-dimensional hand model onto an image plane of an image capture device that outputs a video sequence. The estimation means further uses a probability calculation means for calculating the probability of each of the hand shapes based on the clues and the z-buffer image, and uses the respective probabilities of the particles as the respective weights of the hand shapes, Weighted sum means for calculating a hand shape weight sum calculated for the hand shape.

３次元の手モデルは記憶手段に予め記憶される。手の姿勢の手がかりが映像シーケンスの現在のフレームから手がかり抽出手段によって抽出される。肌色の面及び／又はエッジが手がかりとして使用される。推定手段は３次元手モデルと手がかりとにパーティクルフィルタ処理を施して現在のフレームにおける手の姿勢を推定する。３次元手モデル形状は推定された姿勢によって更新される。 The three-dimensional hand model is stored in advance in the storage means. A clue of the posture of the hand is extracted from the current frame of the video sequence by the clue extraction means. Skin-colored surfaces and / or edges are used as cues. The estimation means performs particle filtering on the three-dimensional hand model and the clue to estimate the hand posture in the current frame. The three-dimensional hand model shape is updated with the estimated posture.

推定手段では、予め定められた数のパーティクル形状が生成される。手形状の各々について、投影のｚバッファ画像が生成される。手がかりをｚバッファ画像と比較することによって、推定手段は手形状の確率を計算することができる。推定された手の姿勢の加重合計により、３次元手モデル形状が更新される。 In the estimation means, a predetermined number of particle shapes are generated. For each hand shape, a projected z-buffer image is generated. By comparing the cues with the z-buffer image, the estimation means can calculate the hand shape probabilities. The three-dimensional hand model shape is updated by the weighted sum of the estimated hand postures.

確率の計算にｚバッファ画像が用いられるので、隠れた部分ではなく可視部分のみが検索される。従って、迅速にトラッキングを行なうことができ、リアルタイムのトラッキングが可能となる。 Since the z-buffer image is used for the probability calculation, only the visible part, not the hidden part, is searched. Therefore, tracking can be performed quickly and real-time tracking is possible.

好ましくは、３次元の手モデルは掌部分と指部分とを含む。確率計算手段は指パーティクルの第１の部分の確率を計算するための第１の評価手段を含む。第１の部分のパーティクル形状の指部分の各々は３次元モデルの現在の指部分と置換えられている。確率計算手段はさらに、パーティクルの第２の部分の確率を計算するための第２の評価手段を含む。第２の部分のパーティクルの掌部分の各々は３次元モデルの現在の掌部分と置換えられている。加重合計手段はそれぞれの確率を重みとして用いて、指形状の第１の部分の加重合計を計算するための第１の合計手段と、それぞれの確率を重みとして用いて、手形状を表すパーティクルの第２の部分の加重合計を計算するための第２の合計手段と、を含む。組合せ手段は、第１の合計手段によって出力された加重合計の掌部分と、第２の合計手段によって出力された加重合計の指部分とを組合せるための手段を含む。 Preferably, the three-dimensional hand model includes a palm portion and a finger portion. The probability calculation means includes first evaluation means for calculating the probability of the first part of the finger particle. Each of the particle shaped finger portions of the first portion is replaced with the current finger portion of the three-dimensional model. The probability calculation means further includes second evaluation means for calculating the probability of the second portion of the particles. Each palm part of the second part of the particle is replaced with the current palm part of the three-dimensional model. The weighted sum means uses the respective probabilities as weights, the first sum means for calculating the weighted sum of the first part of the finger shape, and the respective probabilities as weights for the particles representing the hand shape. Second summing means for calculating a weighted sum of the second portion. The combining means includes means for combining the weighted sum palm portion output by the first summing means and the weighted sum finger portion output by the second summing means.

第１の評価手段は指形状を表すパーティクルの第１の部分の確率を計算する。第１の部分のパーティクル形状の指部分が手モデルの現在の指部分と置換されているので、確率の計算においては掌部分のみが考慮される。同様に、第２の評価手段はパーティクルの第２の部分の確率を計算する。第２の部分のパーティクル形状の掌部分が、手モデルの現在の掌部分と置換されているので、確率の計算においては指部分のみが考慮される。このようにして、掌部分の姿勢の加重合計が第１の評価手段の出力に基づいて計算される。指の姿勢の加重合計は第２の評価手段の出力に基づいて計算される。これらの結果を組合せることにより、手全体の姿勢が推定される。 The first evaluation means calculates the probability of the first part of the particle representing the finger shape. Since the first finger part of the particle shape of the first part is replaced with the current finger part of the hand model, only the palm part is considered in the calculation of the probability. Similarly, the second evaluation means calculates the probability of the second part of the particle. Since the palm part of the particle shape of the second part is replaced with the current palm part of the hand model, only the finger part is considered in the calculation of the probability. In this way, the weighted sum of the palm portion postures is calculated based on the output of the first evaluation means. A weighted sum of finger postures is calculated based on the output of the second evaluation means. By combining these results, the posture of the entire hand is estimated.

検索空間が２つの空間（掌部分及び指部分）に分解されるので、計算のコストが実質的に減じられ、リアルタイムのトラッキングが実現可能になる。 Since the search space is divided into two spaces (palm part and finger part), the cost of calculation is substantially reduced and real-time tracking can be realized.

より好ましくは、第１の評価手段は並列に動作可能な複数の第１の計算ユニットによって実現される。複数の第１の計算ユニットの各々は手形状の第１の部分（指形状）の一部の確率を計算するようにプログラムされる。 More preferably, the first evaluation means is realized by a plurality of first calculation units operable in parallel. Each of the plurality of first calculation units is programmed to calculate a probability of a portion of the first part of the hand shape (finger shape).

さらに好ましくは、第２の評価手段は並列に動作可能な複数の第２の計算ユニットによって実現される。複数の第２の計算ユニットの各々はパーティクル形状の第２の部分の一部の確率を計算するようにプログラムされる。 More preferably, the second evaluation means is realized by a plurality of second calculation units operable in parallel. Each of the plurality of second calculation units is programmed to calculate a probability of a portion of the second portion of the particle shape.

各パーティクルでの、計算コストのかかる評価が他のパーティクル全てから独立しているので、これらを複数のコンピュータで並列に行なうことができる。従って、繰返しごとに必要とされる時間を減じることができ、リアルタイムのトラッキングが実現可能となる。 Since the calculation cost for each particle is independent of all other particles, these can be performed in parallel by a plurality of computers. Therefore, the time required for each repetition can be reduced, and real-time tracking can be realized.

＜特徴＞
［手トラッカの実現例］
Ａ．無制約のトラッキング
パーティクルフィルタが有望と思われるのは、計算コストがパーティクルのバランスのとれた並列実行で説明できるからである。なぜなら、コストが増加する時そのコストは独立した確率計算という形で生じるからである。従って、この発明の第１の実施の形態は、手の姿勢を推定するのにパーティクルフィルタを用いる。人の手のＤＯＦは比較的大きいので、検索空間が広くなりがちである。従ってこの実施の形態では、検索空間を２つに分解する。すなわち、手と指とである。手の姿勢と指の姿勢とを別々に推定した後、装置はこれらを組合せて、手全体の新たな姿勢を得る。 <Features>
[Example of hand tracker implementation]
A. Unconstrained tracking Particle filters seem promising because the computational cost can be explained by parallel execution with balanced particles. This is because when the cost increases, the cost is generated in the form of an independent probability calculation. Therefore, the first embodiment of the present invention uses a particle filter to estimate the hand posture. Since the DOF of a human hand is relatively large, the search space tends to be wide. Therefore, in this embodiment, the search space is decomposed into two. That is, hands and fingers. After estimating the hand posture and the finger posture separately, the device combines them to obtain a new hand posture.

身振りの見え方を学習する身振り認識システムは、特化した手トラッカの一例である。このようなシステムはそのドメインにおいては良好な性能を発揮するが、これは任意の手形状をトラッキングするように設計されてはいない。ヒューマノイドロボットに、人為的な制限を導入することなく人と自然な相互交流を行なわせたい。そのため、ここでの方策は、トラッキングされた手形状を偏りなく表現できるようにすることであり、それによって我々のヒューマノイドの挙動モジュールが、手の動きに反応しかつこれを解釈することが可能になる。これは、模倣、注意を引くこと、及び意図の推定、目と手との強調、身振りの認識といった、異なる要件の様々な課題の基礎として用いることができる。 A gesture recognition system that learns how to see gestures is an example of a specialized hand tracker. Such a system performs well in its domain, but it is not designed to track arbitrary hand shapes. I want humanoid robots to interact naturally with people without introducing artificial restrictions. Therefore, the strategy here is to be able to represent the tracked hand shape without bias, so that our humanoid behavior module can react to and interpret the hand movement. Become. This can be used as the basis for a variety of tasks with different requirements, such as imitation, drawing attention and estimating intent, emphasizing eyes and hands, and recognizing gestures.

この目標を達成するために、この実施の形態では非特許文献７に開示されたのと同様のモデルを用いた、モデルベースの方策を選択した。３Ｄの手モデルを形状表現のための手段として用いることにより、予め決められた、手の姿勢の集合に制限されることがなくなる。非特許文献８においてシマダは３Ｄの手モデルを用いて手姿勢テンプレートのラベル付データベースを構築しており、これはその後、手の動きをトラッキングするのに用いられている。この方策の結果主として得られるのは、離散的な運動学的パラメータが付された手の姿勢である。 In order to achieve this goal, a model-based strategy using a model similar to that disclosed in Non-Patent Document 7 is selected in this embodiment. By using the 3D hand model as a means for shape representation, it is not limited to a predetermined set of hand postures. In Non-Patent Document 8, Shimada builds a labeled database of hand posture templates using a 3D hand model, which is then used to track hand movements. The main result of this strategy is a hand posture with discrete kinematic parameters.

これに対して、この実施の形態では、３Ｄの手モデルを直接シーンに投影し、このため、連続した関節の角度範囲において正確な運動学的パラメータを知ることができる。データベースを用いることは、一般に計算上は安価である。しかし、この場合計算量が少なくなる代わりにかなりの量のメモリと検索時間とを必要とする。 In contrast, in this embodiment, the 3D hand model is projected directly onto the scene, so that accurate kinematic parameters can be known over a range of joint angles. Using a database is generally inexpensive in terms of computation. However, in this case, a considerable amount of memory and search time are required instead of reducing the calculation amount.

モデルを直接使用することにいくつかの利点がある。その構造的パラメータをダイナミックに変更することもでき、また、異なる手に合わせるために、モデルを完全に交換することもできる。また、動いた状態で投影の詳細を自由に変更することができ、これは投影特性の異なる中心視覚及び広角カメラシステムが用いられているヒューマノイドロボットヘッドには特に重要である。これに対して、テンプレートを用いたシステムは、各カメラについて別個のデータベースを必要とするであろう。加えて、異なる課題に対してモデルの自由度を調整することができる。全ての姿勢が実行時に生成されるので、この実施の形態のシステムは、姿勢データベースに係る必要メモリ量が指数関数的に増えてしまうことがなく、次元に制約されずに済む。 There are several advantages to using the model directly. Its structural parameters can be changed dynamically and the model can be completely exchanged to fit different hands. Also, the details of the projection can be freely changed in a moving state, which is particularly important for humanoid robot heads in which central vision and wide-angle camera systems with different projection characteristics are used. In contrast, a system using a template would require a separate database for each camera. In addition, the degree of freedom of the model can be adjusted for different tasks. Since all the postures are generated at the time of execution, the system according to this embodiment does not increase the required memory amount related to the posture database exponentially, and is not limited by dimensions.

Ｂ．リアルタイムの要件
非特許文献２の結論の一つは、リアルタイムの３Ｄ手トラッカが不足している、というものであった。環境及び人との相互交流をするためには、リアルタイムのシステムを生成する必要がある。従って、トラッキングの過程に長い遅延が導入されるのは受容できない。目標は、リアルタイムのトラッキングシステムを構築して、ロボットと人とのコミュニケーションのための自然な手段を確立することである。 B. Real-time requirements One of the conclusions of Non-Patent Document 2 was the lack of real-time 3D hand trackers. In order to interact with the environment and people, it is necessary to create a real-time system. Therefore, it is unacceptable to introduce a long delay in the tracking process. The goal is to build a real-time tracking system and establish a natural means for communication between robots and people.

この要件を念頭に置くと、パーティクルフィルタシステムは有望であると思われる。これは分散されたシステム部分間で必要とされるコミュニケーションを最小にするようなやり方で並列化されているので、非常にうまくその規模を調節できる。この特性を活かすことによって、非特許文献９で提案された分散視覚クラスタをベースに、分散されたパーティクルフィルタを実現することができた。計算上最もコストのかかる作業は各パーティクルの確率を計算することである。ここでは非特許文献８と同様の構成を選択し、１台のコンピュータでトラッキングを管理するとともに画像の撮影と前処理とを行ない、いくつかのクラスタノードで確率関数を計算する（図１５を参照）。 With this requirement in mind, the particle filter system seems promising. It is scaled very well because it is parallelized in such a way as to minimize the communication required between the distributed system parts. By utilizing this characteristic, it was possible to realize a distributed particle filter based on the distributed visual cluster proposed in Non-Patent Document 9. The most expensive task in calculation is to calculate the probability of each particle. Here, the same configuration as in Non-Patent Document 8 is selected, tracking is managed by one computer, image capturing and preprocessing are performed, and probability functions are calculated at several cluster nodes (see FIG. 15). ).

リアルタイムが必要な応用の一つは、模倣である。この目標を達成するために、トラッカシステムの推定角度をロボットの手の関節角度に変換する。手の姿勢を模倣するのに、商業的に入手可能なロボットハンドを用いた。結果を図１に示す。 One application that requires real-time is imitation. To achieve this goal, the estimated angle of the tracker system is converted to the joint angle of the robot's hand. A commercially available robotic hand was used to mimic the hand posture. The results are shown in FIG.

［マーカ不使用のトラッカシステム］
Ａ．データフローの概観
図２はこの実施の形態に従ったデータフローの概観を示す。図２を参照して、人の手の入力画像５０は、肌色フィルタによってフィルタ処理され、肌色の面を示すバイナリマップ５２が生成される。バイナリマップ５２を用いて、入力画像５０から切出された手画像５４が得られる。ソーベルフィルタ等のエッジフィルタを画像５４に適用することにより、エッジフィルタ処理された画像５６が得られる。 [Tracker system without marker]
A. Data Flow Overview FIG. 2 shows an overview of the data flow according to this embodiment. Referring to FIG. 2, an input image 50 of a human hand is filtered by a skin color filter, and a binary map 52 indicating a skin color surface is generated. Using the binary map 52, a hand image 54 cut out from the input image 50 is obtained. By applying an edge filter such as a Sobel filter to the image 54, an image 56 subjected to edge filter processing is obtained.

先の繰返しで準備された３Ｄの手モデル５８から、３Ｄ手モデル５８にランダムノイズ５９を付加することによって、所定の数（この実施の形態では３０００）のパーティクルが生成される。パーティクルの各々は可能な手の姿勢を表す。 By adding random noise 59 to the 3D hand model 58 prepared from the previous iteration, a predetermined number (3000 in this embodiment) of particles is generated. Each particle represents a possible hand posture.

パーティクルの各々について、ｚバッファ画像６０が生成される。モデル形状のｚバッファ画像６０は、それぞれブロック６２と６４で示すように、肌色フィルタ処理されたバイナリマップ５２及びエッジフィルタ処理された画像５６と比較される。これらの結果を用いて、ブロック６６に示されるように、パーティクルのモデル形状の手の姿勢確率が計算される。これがどのように行なわれるかは、後に詳細に説明する。 A z-buffer image 60 is generated for each of the particles. The model-shaped z-buffer image 60 is compared to the skin-filtered binary map 52 and the edge-filtered image 56, as indicated by blocks 62 and 64, respectively. Using these results, as shown in block 66, the hand posture probabilities of the particle model shape are calculated. How this is done will be described in detail later.

ブロック６８に示すようなパーティクルの形状の加重合計により、３Ｄ手モデル５８の新たな形状が推定される。形状の確率は、この計算において重みとして用いられる。 The new shape of the 3D hand model 58 is estimated by the weighted sum of the particle shapes as shown in block 68. The shape probabilities are used as weights in this calculation.

フレームサイクルの各々について、上述の計算が行なわれ、繰返しごとの手姿勢が推定される。以下では、上述の処理の各々をさらに詳細に説明する。 For each frame cycle, the above calculation is performed to estimate the hand posture for each iteration. Hereinafter, each of the above-described processes will be described in more detail.

Ｂ．画像の取得と前処理
図３を参照して、このシステムでは、ヒューマノイドの頭部に見られるような、眼球間の距離の小さい較正済のステレオカメラ３２を用いる。ステレオカメラを選択したのは、手について２つの視点があれば、深さ推定において１つの視野では無視できるほど小さい誤差も、第２の視野では目に見えるものとなるため、３Ｄ空間での位置づけがより良好となるからである。 B. Image Acquisition and Preprocessing Referring to FIG. 3, the system uses a calibrated stereo camera 32 with a small distance between the eyeballs as seen on the head of a humanoid. The stereo camera was selected because if there are two viewpoints for the hand, errors that are negligibly small in one field of view in depth estimation are visible in the second field of view. This is because is better.

手３０のステレオ画像を撮影したあと、ＨＳＶ空間で肌色を検出し、画像中の肌色の面のバイナリマップを生成し、これを肌面の手がかりの評価にも用いる。その後、このマスクを用いて画像のうち無関係の部分全てをカットし、元の画像から得られた面にエッジフィルタを適用して、手のエッジ画像を得る。これらの画像に基づき、ロボットの手３４は手３０の姿勢を模倣するように制御される。 After photographing a stereo image of the hand 30, the skin color is detected in the HSV space, a binary map of the skin color surface in the image is generated, and this is also used for evaluation of the clues on the skin surface. Thereafter, using this mask, all irrelevant portions of the image are cut, and an edge filter is applied to the surface obtained from the original image to obtain an edge image of the hand. Based on these images, the robot hand 34 is controlled to mimic the posture of the hand 30.

Ｃ．３Ｄの手モデル
円錐断面のみからなる３Ｄの手モデル（図４、図５）を用いる。図４を参照して、このモデル５８は非常に高速にフレームに投影して手がかりを検出することが可能であり、その一方で依然として、手と指の動きをトラッキングするためには十分正確である。図４において、「ＤＩＰ」、「ＰＩＰ」、「ＭＣＰ」、「ＣＭＣ」、「ＩＰ」及び「ＴＭ」は、「遠位指節間関節」、「近位指節間関節」、「中手指節間関節」、「手根中手関節」、「指節間関節」及び「拇指手根中手関節」をそれぞれ意味する。図４において、白の丸は１ＤＯＦの関節を示し、黒の丸は２ＤＯＦの関節を示す。 C. 3D hand model A 3D hand model (Figs. 4 and 5) consisting only of a conical section is used. Referring to FIG. 4, this model 58 can project onto a frame very quickly to detect cues, while still being accurate enough to track hand and finger movements. . In FIG. 4, “DIP”, “PIP”, “MCP”, “CMC”, “IP” and “TM” are “distal interphalangeal joint”, “proximal interphalangeal joint”, “metacarpal finger”. It means “internode joint”, “carpal joint”, “interphalangeal joint” and “phalangeal joint”. In FIG. 4, white circles indicate 1 DOF joints, and black circles indicate 2 DOF joints.

人の手のＤＯＦは２２であるが、複雑さを避けるために、この運動学的手モデル５８の自由度を、各指ごとに屈曲に関する１自由度に制限した。非特許文献２で見られるように、θ_ＤＩＰ＝θ_ＰＩＰと設定した。θ_ＰＩＰは各指のＰＩＰ角度を表し、θ_ＤＩＰは各指のＤＩＰ角度を表す。モデル５８の関節の角度を表すために、以下でも同様の表記を用いる。 The DOF of a human hand is 22, but to avoid complexity, this kinematic hand model 58 has been limited to one degree of freedom for flexion for each finger. As seen in Non-Patent Document 2, θ _DIP = θ _PIP was set. θ _PIP represents the PIP angle of each finger, and θ _DIP represents the DIP angle of each finger. In order to express the angle of the joint of the model 58, the same notation is used in the following.

加えて、このトラッカは、検索空間の次元数を減じるために、θ_ＰＩＰ＝θ_ＭＣＰと仮定している。全体のモデルは、手の姿勢と位置について６の自由度、各指について２の自由度から成る。このシステムの現在の版では、拇指は標準位置に固定されている。従って、各パーティクルの形状ベクトルｓ_ｎは次元１４を有する。 In addition, this tracker assumes θ _PIP = θ _MCP to reduce the number of dimensions of the search space. The overall model consists of 6 degrees of freedom for hand posture and position and 2 degrees of freedom for each finger. In the current version of the system, the thumb is fixed in the standard position. Therefore, the shape vector _{s n} in each particle has a dimension 14.

Ｄ.自己重なりの解決策
ドイッチャーは、非特許文献１０において画像の手がかりを評価する方法を提案している。彼は画像中の投影された体の部分の各々を直接評価している。手モデルの投影で自己と重なって隠れた部分の検索を避けるために、本実施の形態では彼の方法を修正した。隠れた部分ではなく、見える部分のみを検索するために、この実施の形態ではｚバッファ方策を提案する。 D. Solution to Self-Overlap Deutscher proposed a method for evaluating image cues in Non-Patent Document 10. He directly evaluates each projected body part in the image. In order to avoid searching for a hidden part that overlaps with the self in the projection of the hand model, this embodiment modifies his method. In order to search only the visible part, not the hidden part, this embodiment proposes a z-buffer strategy.

ｚバッファ処理の思想は、遠い対象物を先に描き、より近い、重なる対象物によって上書きされるように、重なって隠れた対象物を描くことである。ｚバッファ処理を用いてこの３Ｄ手モデルを正確に投影できるようにするためには、投影面からの距離に従って分類した指部分の順序付けしたリストが必要である。この投影面は、ステレオカメラシステム内で同一ではないので、各カメラについて別個のｚバッファを生成する必要がある。 The idea of z-buffer processing is to draw a distant object first and then draw a hidden object that is overwritten by a closer, overlapping object. In order to be able to accurately project this 3D hand model using z-buffer processing, an ordered list of finger parts classified according to their distance from the projection plane is required. Since this projection plane is not the same in a stereo camera system, it is necessary to generate a separate z-buffer for each camera.

３Ｄ手モデル要素を投影するために、各カメラの較正マトリックスの逆行列を用いる。近似として、視点の座標系に変換した各指の重心を比較する。その後、投影面と垂直となったｚ軸に従い、最も大きいｚ座標から始まるよう指部分をソートする。リストを生成した後、指の各要素をリストの順序に従ってバッファに描き、それによって指部分がエッジ又は面の手がかりを表すかどうかをコード化する。こうすることにより、自己と重なる手がかりのインデックスを、重複する指のもので上書きすることになる。どちらもバッファの同じ位置に書込まれることになるからである。深さ情報は、エッジ及び面の輝度により表わされる。 To project 3D hand model elements, we use the inverse matrix of each camera's calibration matrix. As an approximation, the center of gravity of each finger converted to the coordinate system of the viewpoint is compared. Thereafter, the finger portions are sorted so as to start from the largest z coordinate according to the z axis perpendicular to the projection plane. After generating the list, each element of the finger is drawn in a buffer according to the order of the list, thereby encoding whether the finger portion represents an edge or surface cue. By doing so, the index of the clue that overlaps with the self is overwritten with the duplicate finger. Both are written at the same position in the buffer. Depth information is represented by edge and surface brightness.

その後図６に例示されるようにｚバッファを算出するが、ここでは、各画素についてエッジが予測されるか、手の面が予測されるか、又は何もないと予測されるかがわかっている。これらの投影は最終的に、前処理された画像との比較に用いられる。 The z-buffer is then calculated as illustrated in FIG. 6, where we know whether an edge is predicted for each pixel, a hand plane is predicted, or nothing is predicted. Yes. These projections are ultimately used for comparison with the preprocessed image.

Ｅ．形状確率の評価
前処理されたフレームで見出されたエッジと面との手がかりに基づいて、各パーティクルの形状を評価する。各パーティクルｎのエッジ確率Π_ｅと面確率Π_ａとを計算するために、以下の式を用いる。 E. Shape Probability Evaluation The shape of each particle is evaluated based on cues between edges and faces found in the preprocessed frame. To calculate the edge probability [pi _e and surface probability [pi _a of each particle n, using the following equation.

ここでΦ^ａはバイナリの肌マスクであり、μ_ｎ ^ａ及びμ_ｎ ^ｍはパーティクルｎについて実際に見出された肌面の画素と欠落している肌面の画素とを含み、Φ^ｅはマスクによりカットされた勾配画像であり、Π_ｎ ^ａ／ｅはパーティクルｎについてｚバッファにより生じた全ての面／エッジを含む。肌色マスク２８０上とエッジフィルタ処理された画像２８２との上に描かれた投影モデルを図７に示す。

Where Φ ^a is a binary skin mask, μ _n ^a and μ _n ^m include the skin surface pixels actually found for the particle n and missing skin surface pixels, and Φ ^e is the mask Π _n ^{a / e} includes all the faces / edges generated by the z-buffer for particle n. FIG. 7 shows a projection model drawn on the skin color mask 280 and the edge-filtered image 282.

さらに、２つの指部分間の最短距離を計算することにより、指間の衝突も検出する。もしこの距離がそれらの半径の合計より小さければ、衝突があると仮定して、Π_ｃｏｌｌにより確率関数内でこの状態にペナルティを与える。次に各パーティクルの確率Π_ｎを以下のように計算する Furthermore, collisions between fingers are also detected by calculating the shortest distance between the two finger portions. If this distance is less than the sum of their radii, assume that there is a collision and penalize this state in the probability function by Π _coll . Next, the probability 各_{n of} each particle is calculated as follows:

λを設定することにより、指数関数の入力範囲を選択することができる。[ｅ^−λ,１]の範囲内での確率を得るために、全ての重みω[ｉ]を実験により決定して、合計が１となるように正規化する。また、全てのΠ_ｎを合計が１になるように正規化する。その後、全てのパーティクルの重み平均を計算して、現在の手の形状ｓ_ｍｅａｎを推定する。

By setting λ, the input range of the exponential function can be selected. In order to obtain a probability within the range of [e ^−λ , 1], all weights ω [i] are determined by experiment and normalized so that the sum is 1. Further, the sum of all [pi _n normalized to be 1. Thereafter, the weight average of all particles is calculated to estimate the current hand shape s _mean .

Ｆ．再サンプリングにおける検索空間の区分け
パーティクルフィルタの次の繰返しのために、パーティクルの組を再サンプリングしなければならない。基本的に、非特許文献６でイサードが説明しているように、古いパーティクルの組に対する要素分解した（ｆａｃｔｏｒｅｄ）再サンプリングを用いる。この方策を用いることにより、手のトラッキングが可能となるが、個々の指のトラッキングはできない。姿勢形状がうまくアライメントされていないパーティクルは、モデル中の指を画像中の隣の指と比較してしまうことがあるために、指トラッキングの評価を低下させるという問題に直面した。

F. Partitioning the search space in resampling For the next iteration of the particle filter, the set of particles must be resampled. Basically, factored resampling is used for the old set of particles, as Isard explains in [6]. Using this strategy, hand tracking is possible, but individual fingers cannot be tracked. Particles with poorly aligned pose shapes faced the problem of reducing finger tracking evaluation because they could compare the finger in the model with the next finger in the image.

性能を向上させるために、パーティクルの組を２つの分離した組に分割してみた。第１の組を、現在の指形状での手（掌）の姿勢をトラッキングするために用い、このため、前回の繰返しでの平均指形状をこのパーティクルの組に割当てる。すなわち、パーティクル形状の指要素を、前回の繰返しで得られた現在の指形状で置換える。姿勢は全てのパーティクルから、それらの重みを考慮して再サンプリングされる。第２の組は、前回知られた手（掌）の姿勢での指をトラッキングするのに用いられる。この目的のために、前回の繰返しでの平均の手（掌）姿勢をこの組の全てのパーティクルに割当てる。すなわち、パーティクル形状の手（掌）要素を、前回の繰返しで得られた現在の手（掌）形状と置換える。その後、全てのパーティクルから指形状を再サンプリングする。再サンプリングの後、両方の組をマージする。 In order to improve performance, the particle set was divided into two separate sets. The first set is used to track the posture of the hand (palm) with the current finger shape, and therefore the average finger shape from the previous iteration is assigned to this particle set. That is, the particle-shaped finger element is replaced with the current finger shape obtained in the previous iteration. The pose is resampled from all particles, taking their weights into account. The second set is used to track the finger in the previously known hand (palm) posture. For this purpose, the average hand (palm) posture from the previous iteration is assigned to all particles in this set. That is, the particle-shaped hand (palm) element is replaced with the current hand (palm) shape obtained in the previous iteration. Thereafter, the finger shape is resampled from all the particles. After resampling, merge both sets.

この方策を用いることによって、第２の組での指形状の検索にあたって前回知られた位置にパーティクルを集中させるという利点が得られ、実際上、検索空間の次元を６自由度だけ減じることができる。同時に、第１の組を用いて前回知られた指形状を用いて新たな手の姿勢を検索する。パーティクルの平均をこのようにして割当てることにより、多数の仮説を表すためのパーティクルフィルタの属性を保存する。 By using this measure, the advantage of concentrating the particles at the previously known position when searching for the finger shape in the second set is obtained, and the dimension of the search space can be actually reduced by 6 degrees of freedom. . At the same time, a new hand posture is searched using the previously known finger shape using the first set. By assigning the average of particles in this way, the particle filter attributes for representing a number of hypotheses are preserved.

Ｇ．分散リアルタイムトラッキング
パーティクルフィルタは、クラスタの並列化によく適している。というのも、計算上コストのかかる各パーティクルの評価が、全ての他のパーティクルから独立しているからである。しかし、パーティクルを全てのクラスタノードに分配し、対応の確率を集めるためには通信が必要である。ここでは、前処理された画像を分配することも必要である。 G. Distributed real-time tracking Particle filters are well suited for parallel clustering. This is because the computationally expensive evaluation of each particle is independent of all other particles. However, communication is necessary to distribute the particles to all cluster nodes and collect the corresponding probabilities. Here, it is also necessary to distribute the preprocessed image.

この実施の形態のクラスタノードは標準的なギガビットのネットワークを介して相互に接続されている。画像を送るために必要なトラフィックを最小にするため、ブロードキャストを用いた。分散パーティクルフィルタを実現するために、非特許文献９で提案された分散視覚クラスタ（ＤｉｓｔｒｉｂｕｔｅｄＶｉｓｉｏｎＣｌｕｓｔｅｒ）を用いる。各ノードにおいて多数の評価スレッドを開始することで、マルチプロセッサ、マルチコアのシステムの利益を享受することもできる。ステレオ画像の評価を、パーティクルフィルタの繰返しごとに１画像のみにインターリーブすることで、性能をほぼ２倍にできることがわかった。全ての試験とベンチマークのために、実験的に決定した３０００個のパーティクルを用いた。より多くのパーティクルを用いてもそれほどトラッキングが改善されず、２０００個より少ないパーティクルでは結果がかなり低下した。 The cluster nodes of this embodiment are connected to each other via a standard gigabit network. Broadcast was used to minimize the traffic required to send images. In order to realize a distributed particle filter, a distributed vision cluster proposed in Non-Patent Document 9 is used. By starting a large number of evaluation threads in each node, it is possible to enjoy the benefits of a multiprocessor, multicore system. It was found that the performance can be almost doubled by interleaving the stereo image to only one image every time the particle filter is repeated. Experimentally determined 3000 particles were used for all tests and benchmarks. The tracking was not improved much with more particles, and the results were considerably degraded with fewer than 2000 particles.

分散視覚クラスタにより、パーティクルフィルタは９個のクラスタを用いて６．８４倍の速度まで拡大することができたが、これは７６．１％の効率である。１０．２ＦＰＳに到達することができた。測定のために、クラスタノードではデュアル２．１６ＧＨｚのＣＰＵを用いた。性能尺度を図８に示す。 With the distributed visual cluster, the particle filter could be scaled up to 6.84 times using 9 clusters, which is 76.1% efficient. 10.2 FPS could be reached. For measurement, a dual 2.16 GHz CPU was used in the cluster node. The performance measure is shown in FIG.

＜システムの構造＞
図９はこの実施の形態の手モデルトラッカ９０の全体構造を示す。図９を参照して、手モデルトラッカ９０は、ステレオカメラ３２から出力されるステレオ画像出力をキャプチャするためのフレーム撮影部１００と、フレーム撮影部１００によってキャプチャされたステレオ画像を記憶するためのフレームメモリ１０２と、フレームメモリ１０２に記憶されたステレオ画像を処理して、手の肌色バイナリマップ１４０とエッジ画像１４２とを出力するための画像前処理モジュール１０４と、これらを記憶するためのメモリ１０６とを含む。 <System structure>
FIG. 9 shows the overall structure of the hand model tracker 90 of this embodiment. Referring to FIG. 9, hand model tracker 90 has a frame photographing unit 100 for capturing a stereo image output output from stereo camera 32, and a frame for storing a stereo image captured by frame photographing unit 100. A memory 102; an image preprocessing module 104 for processing a stereo image stored in the frame memory 102 and outputting a hand skin color binary map 140 and an edge image 142; and a memory 106 for storing them. including.

手モデルトラッカ９０はさらに、３Ｄの手モデル１１６を記憶するための記憶部（メモリ）と、３Ｄの手モデル１１６にランダムなノイズを付加して３０００個のパーティクルを生成するための乱数発生モジュール１１８と、手の姿勢を推定するためのパーティクル形状を記憶するパーティクルメモリ１２０と、指の姿勢を推定するためのパーティクル形状を記憶するパーティクルメモリ１２２とを含む。 The hand model tracker 90 further includes a storage unit (memory) for storing the 3D hand model 116, and a random number generation module 118 for generating 3000 particles by adding random noise to the 3D hand model 116. And a particle memory 120 for storing the particle shape for estimating the posture of the hand, and a particle memory 122 for storing the particle shape for estimating the posture of the finger.

手モデルトラッカ９０はさらに、手の姿勢のためのパーティクルフィルタユニット１０８と、指の姿勢のためのパーティクルフィルタユニット１１０と、推定された手の姿勢と指の姿勢とを組合わせるための組合せブロック１１２と、組合せブロック１１２の出力を用いて３Ｄ手モデル１１６を更新するモデル更新モジュール１１４とを含む。 The hand model tracker 90 further includes a particle filter unit 108 for hand posture, a particle filter unit 110 for finger posture, and a combination block 112 for combining the estimated hand posture and finger posture. And a model update module 114 that updates the 3D hand model 116 using the output of the combination block 112.

図１０は画像前処理モジュール１０４のブロック図を示す。図１０を参照して、画像前処理モジュール１０４は、フレームメモリ１０２に記憶された手画像から肌色の面を検出し、肌色バイナリマップ１４０を出力するための肌色フィルタ１８０と、肌色バイナリマップ１４０をマスクとして用いて元の画像のうち無関係な部分をカットすることによって、フレームメモリ１０２に記憶された画像の一部を抽出するためのマスキングモジュール１８４と、マスキングモジュール１８４から出力された手の切出画像１８６からエッジ画像を検出するためのエッジフィルタ１８８とを含む。エッジフィルタ１８８の出力がエッジ画像１４２である。 FIG. 10 shows a block diagram of the image preprocessing module 104. Referring to FIG. 10, the image preprocessing module 104 detects a skin color surface from the hand image stored in the frame memory 102 and outputs a skin color binary filter 140 and a skin color filter 180 for outputting the skin color binary map 140. A masking module 184 for extracting a part of the image stored in the frame memory 102 by cutting an irrelevant portion of the original image by using as a mask, and a hand cut out output from the masking module 184 And an edge filter 188 for detecting an edge image from the image 186. The output of the edge filter 188 is an edge image 142.

図１１は手の姿勢のためのパーティクルフィルタユニット１０８の詳細をブロック図で示す。図１１を参照して、パーティクルフィルタユニット１０８は、肌色バイナリマップ１４０と、エッジ画像１４２と、手のパーティクルモデル形状とを記憶するためのパーティクル形状メモリ２００と、各々がその評価ユニットに割当てられたモデル形状の確率を計算するための、多数の評価ユニット２０２Ａ−２０２Ｎと、それぞれの確率を重みとして用いて、手の形状の加重合計を計算することによって、新たな手の姿勢を出力するための、加重合計モジュール２０４と、を含む。 FIG. 11 is a block diagram showing details of the particle filter unit 108 for hand posture. Referring to FIG. 11, the particle filter unit 108 includes a skin color binary map 140, an edge image 142, a particle shape memory 200 for storing a hand particle model shape, and each of which is assigned to the evaluation unit. To output a new hand posture by calculating a weighted sum of hand shapes using a number of evaluation units 202A-202N for calculating model shape probabilities and the respective probabilities as weights. A weighted sum module 204.

評価ユニット２０２Ａ−２０２Ｎの各々はパーティクルモデル形状のうち手の姿勢に関する部分のみを受け、これらを現在の指形状と組合せる。これらの形状を用いて、評価ユニット２０２Ａ−２０２Ｎはパーティクルの確率を計算する。 Each of the evaluation units 202A-202N receives only the part related to the hand posture of the particle model shape and combines them with the current finger shape. Using these shapes, the evaluation units 202A-202N calculate particle probabilities.

評価ユニット２０２Ａ−２０２Ｎの各々は同じ構造を有する。図１２は、評価ユニット２０２Ａ−２０２Ｎの一つである評価ユニット２０２Ｋの詳細を示す。図１２を参照して、評価ユニット２０２Ｋはこのユニットに割当てられた手の姿勢のパーティクルモデル形状の各々を現在の指形状と組合せ、結果として得られるパーティクル形状ベクトルを出力するためのベクトル形成モジュール２１８と、ベクトル形成モジュール２１８から出力されたパーティクル形状ベクトルを記憶するためのメモリ２２０と、パーティクル形状ベクトルの要素をカメラ３２の較正マトリックスの逆行列を用いて、カメラ３２の画像面に投影するのに適した座標系に変換するための座標変換モジュール２２２とを含む。この新たな座標系は、画像面にＸ軸及びＹ軸を有し、さらに画像面に垂直なＺ軸を有する。 Each of the evaluation units 202A-202N has the same structure. FIG. 12 shows details of the evaluation unit 202K which is one of the evaluation units 202A to 202N. Referring to FIG. 12, the evaluation unit 202K combines each of the hand posture particle model shapes assigned to this unit with the current finger shape and outputs a vector shape module 218 for outputting the resulting particle shape vector. And a memory 220 for storing the particle shape vector output from the vector forming module 218, and projecting the elements of the particle shape vector onto the image plane of the camera 32 using the inverse matrix of the calibration matrix of the camera 32. And a coordinate conversion module 222 for converting to a suitable coordinate system. This new coordinate system has an X axis and a Y axis on the image plane, and a Z axis perpendicular to the image plane.

評価ユニット２０２Ｋはさらに、メモリ２２０に記憶された形状の手の部分の重心を計算するための重心計算モジュール２２４と、手の各部分のそれぞれの重心のＺ座標の降順で、手の各部分をソートし、手の各部分の順序付きのリスト２２８を出力するためのソートモジュール２２６と、ｚバッファ２３２と、リスト２２８の順序で、ｚバッファ２３２にパーティクル形状の投影画像を描画するための手部分描画モジュール２３０と、手の各部分間の衝突を検出し、衝突に関しそれぞれのペナルティを計算するための衝突検出モジュール２３４と、肌色バイナリマップとエッジ画像とを記憶するためのメモリ２３６と、メモリ２３６に記憶された肌色バイナリマップ及びエッジ画像と、ｚバッファ２３２内の画像と、衝突検出モジュール２３４によって計算された衝突ペナルティとを用いて、このユニット２０２Ｋに割当てられたパーティクル形状の各々の確率を計算するための確率計算モジュール２３８とを含む。結果として得られる、Ｋ番目の評価ユニット２０２Ｋに割当てられた手形状のｊ番目のパーティクルの確率Π_{Ｈ，Ｋ,ｊ}が、図１１に示す加重合計モジュール２０４に与えられる。 The evaluation unit 202K further includes a center-of-gravity calculation module 224 for calculating the center of gravity of the hand part having the shape stored in the memory 220, and the respective parts of the hand in descending order of the Z-coordinate of each center of gravity of each part of the hand. A sorting module 226 for sorting and outputting an ordered list 228 of each part of the hand, a z buffer 232, and a hand part for rendering a particle-shaped projection image in the z buffer 232 in the order of the list 228 A drawing module 230; a collision detection module 234 for detecting a collision between parts of the hand and calculating a penalty for the collision; a memory 236 for storing a skin color binary map and an edge image; and a memory 236 The skin color binary map and edge image stored in the image, the image in the z buffer 232, and the collision detection module 2. 4 by using the calculated collision penalty by, and a probability calculation module 238 for calculating the probability of each particle shapes assigned to this unit 202K. The resulting hand-shaped j-th particle probability _{ｊ H, K, j} assigned to the K-th evaluation unit 202K is provided to the weighted sum module 204 shown in FIG.

指のためのパーティクルフィルタユニット１１０の詳細を図１３に示す。図１３を参照して、パーティクルフィルタユニット１１０は、肌色バイナリマップ１４０とエッジ画像１４２と、指のパーティクルモデル形状とを記憶するためのパーティクル形状メモリ２４０と、各々がその評価ユニットに割当てられたモデル形状の確率を計算する多数の評価ユニット２４２Ａ−２４２Ｎと、それぞれの確率を重みとして用いて、指形状を付加することにより、新たな指姿勢を出力するための加重合計モジュール２４４と、を含む。 Details of the particle filter unit 110 for the finger are shown in FIG. Referring to FIG. 13, the particle filter unit 110 includes a skin color binary map 140, an edge image 142, a particle shape memory 240 for storing a finger particle model shape, and a model each assigned to the evaluation unit. It includes a number of evaluation units 242A-242N for calculating shape probabilities and a weighted sum module 244 for outputting new finger postures by adding finger shapes using each probability as a weight.

評価ユニット２４２Ａ−２４２Ｎの各々は、パーティクルモデル形状のうち指姿勢に関する部分のみを受け、これらを現在の手形状と組合せる。これらの形状を用いて、評価ユニット２４２Ａ−２４２Ｎはパーティクルの確率を計算する。 Each of the evaluation units 242A-242N receives only the part related to the finger posture of the particle model shape and combines them with the current hand shape. Using these shapes, the evaluation unit 242A-242N calculates the probability of the particles.

評価ユニット２４２Ａ−２４２Ｎの各々は同じ構造を有する。図１４は評価ユニット２４２Ａ−２４２Ｎの一つである評価ユニット２４２Ｋの詳細を示す。図１４を参照して、評価ユニット２４２Ｋは、このユニットに割当てられた指のパーティクルモデル形状の各々を現在の手の形状と組合せ、結果として得られる手形状ベクトルを出力するためのベクトル形成モジュール２５８と、ベクトル形成モジュール２５８から出力されるパーティクル形状ベクトルを記憶するためのメモリ２６０と、パーティクル形状ベクトルの要素をカメラ３２の画像面に投影するのに適した座標系に変換するための座標変換モジュール２６２とを含む。この新たな座標系は、画像面にＸ軸及びＹ軸を有し、さらに画像面に垂直なＺ軸を有する。 Each of the evaluation units 242A-242N has the same structure. FIG. 14 shows details of the evaluation unit 242K which is one of the evaluation units 242A-242N. Referring to FIG. 14, the evaluation unit 242K combines each of the finger particle model shapes assigned to this unit with the current hand shape and outputs a resulting hand shape vector 258. A memory 260 for storing the particle shape vector output from the vector forming module 258, and a coordinate conversion module for converting the elements of the particle shape vector into a coordinate system suitable for projecting onto the image plane of the camera 32 262. This new coordinate system has an X axis and a Y axis on the image plane, and a Z axis perpendicular to the image plane.

評価ユニット２４２Ｋはさらに、メモリ２６０に記憶された形状の手の各部分の重心を計算するための重心計算モジュール２６４と、手部分のそれぞれの重心のＺ座標の降順で、手の各部分をソートし、手の各部分の順序付きリスト２６８を出力するためのソートモジュール２６６と、ｚバッファ２７２と、リスト２６８の順序で、ｚバッファ２７２にパーティクル形状の投影画像を描画するための手部分描画モジュール２７０と、手の各部分間の衝突を検出し、衝突に関しそれぞれのペナルティを計算するための衝突検出モジュール２７４と、肌色バイナリマップとエッジ画像とを記憶するためのメモリ２７６と、メモリ２７６に記憶された肌色バイナリマップ及びエッジ画像と、ｚバッファ２７２内の画像と、衝突検出モジュール２７４によって計算された衝突ペナルティとを用いて、このユニット２４２Ｋに割当てられたパーティクル形状の各々の確率を計算するための確率計算モジュール２７８とを含む。結果として得られる、Ｋ番目の評価ユニット２４２Ｋに割当てられた指形状のｊ番目のパーティクルの確率Π_{Ｆ，Ｋ,ｊ}が、図１３に示す加重合計モジュール２４４に与えられる。 The evaluation unit 242K further sorts each part of the hand in descending order of the Z coordinate of each center of gravity of the hand part, and a center of gravity calculation module 264 for calculating the center of gravity of each part of the hand of the shape stored in the memory 260. Then, a sort module 266 for outputting an ordered list 268 of each part of the hand, a z buffer 272, and a hand part drawing module for drawing a projected image of the particle shape in the z buffer 272 in the order of the list 268 270, a collision detection module 274 for detecting a collision between each part of the hand and calculating a penalty for the collision, a memory 276 for storing the skin color binary map and the edge image, and storing in the memory 276 Skinned color binary map and edge image, image in z-buffer 272, and collision detection module 274 Therefore comprising using the calculated collision penalty, and a probability calculation module 278 for calculating the probability of each particle shapes assigned to this unit 242K. The resulting probability 指_{F, K, j of} the finger-shaped jth particle assigned to the Kth evaluation unit 242K is provided to the weighted sum module 244 shown in FIG.

＜動作＞
手モデルトラッカ９０は以下のように動作する。図９を参照して、３Ｄ手モデル１１６は図４及び図５に示す初期３Ｄ手モデル５８を記憶する。カメラ３２はステレオ画像信号を出力する。各ステレオフレームはフレーム撮影部１００によって撮影され、フレームメモリ１０２に記憶される。 <Operation>
The hand model tracker 90 operates as follows. Referring to FIG. 9, the 3D hand model 116 stores the initial 3D hand model 58 shown in FIGS. The camera 32 outputs a stereo image signal. Each stereo frame is shot by the frame shooting unit 100 and stored in the frame memory 102.

図１０を参照して、肌色フィルタ１８０はＨＳＶ空間における画像中の肌色面を検出して肌色バイナリマップ１４０を生成する。このマップはマスキングモジュール１８４によって、画像中の関連のない部分全てをカットするのに用いられ、これによって切出された手画像１８６が生成され、この結果として得られる、切出された手画像１８６にエッジフィルタ１８８が適用されてエッジ画像１４２が得られる。 Referring to FIG. 10, skin color filter 180 detects a skin color surface in an image in the HSV space and generates skin color binary map 140. This map is used by the masking module 184 to cut all unrelated portions of the image, thereby producing a cut hand image 186, resulting in the cut hand image 186. The edge filter 188 is applied to the edge image 142 to obtain the edge image 142.

結果として得られる肌色バイナリマップ１４０とエッジ画像１４２とが手姿勢のためのパーティクルフィルタユニット１０８と指姿勢のためのパーティクルフィルタユニット１１０とに与えられる。 The resulting skin color binary map 140 and edge image 142 are provided to the particle filter unit 108 for hand posture and the particle filter unit 110 for finger posture.

一方で、乱数発生モジュール１１８がランダムノイズを発生し、これらを３Ｄ手モデル１１６に付加して３０００個のパーティクルを生成する。パーティクルの半数はパーティクルメモリ１２０に記憶され、残りの半数はパーティクルメモリ１２２に記憶される。前者は次の手姿勢を推定するのに用いられ、後者は次の指姿勢を推定するのに用いられる。 On the other hand, the random number generation module 118 generates random noise and adds these to the 3D hand model 116 to generate 3000 particles. Half of the particles are stored in the particle memory 120 and the remaining half are stored in the particle memory 122. The former is used to estimate the next hand posture, and the latter is used to estimate the next finger posture.

図１１を参照して、肌色バイナリマップ１４０とエッジ画像１４２とはパーティクル形状メモリ２００に記憶される。手姿勢のパーティクル形状はパーティクル形状メモリ２００に記憶される。これらはともに、評価ユニット２０２Ａ−２０２Ｎの各々に与えられる。現在の指形状が評価ユニット２０２Ａ−２０２Ｎの各々に供給される。 Referring to FIG. 11, skin color binary map 140 and edge image 142 are stored in particle shape memory 200. The particle shape of the hand posture is stored in the particle shape memory 200. Both are provided to each of the evaluation units 202A-202N. The current finger shape is supplied to each of the evaluation units 202A-202N.

図１２を参照して、手姿勢のパーティクル形状と、現在の指形状のパーティクルフィルタとがベクトル形成モジュール２１８によって組合されてパーティクル形状ベクトルが形成され、これはメモリ２２０に記憶される。座標変換モジュール２２２はベクトルの座標系を、ｚバッファ処理に適したものに変換する。重心計算モジュール２２４はパーティクル形状の手部分の各々の重心を計算する。 Referring to FIG. 12, the particle shape vector is formed by combining the particle shape of the hand posture and the particle filter of the current finger shape by the vector forming module 218, and this is stored in the memory 220. The coordinate conversion module 222 converts the vector coordinate system into one suitable for z-buffer processing. The center-of-gravity calculation module 224 calculates the center of gravity of each hand part of the particle shape.

ソートモジュール２２６は手部分をそれらのＺ座標の降順でソートして、リスト２２８を生成する。手部分描画モジュール２３０は手の各要素をｚバッファ２３２に描画して、より遠くの対象物が先に描画され、より近い、重なる対象物によって上書きされるようにし、指部分と、これがエッジ又は面の手がかりを表すかどうかを符号化する。手部分描画モジュール２３０により、自己に隠される手がかりのインデックスが、重なる指のものによって上書きされる。 Sort module 226 sorts the hand portions in descending order of their Z coordinates to generate list 228. The hand drawing module 230 draws each element of the hand in the z-buffer 232 so that the farther object is drawn first and overwritten by the closer, overlapping objects, and the finger part and the edge or Encodes whether to represent a clue to the surface. The hand part drawing module 230 overwrites the index of the clue that is hidden by itself with the overlapping finger.

図６に例示されるｚバッファ２３２を参照して、各画素について、エッジが予測されるか、面が予測されるか、何もないことが予測されるかは分かっている。これらの投影は最終的には確率計算モジュール２３８で前処理された画像との比較のために用いられる。 With reference to the z-buffer 232 illustrated in FIG. 6, it is known for each pixel whether an edge is predicted, a face is predicted, or nothing is predicted. These projections are ultimately used for comparison with the image preprocessed by the probability calculation module 238.

確率計算モジュール２３８は、上述の式（１）−（４）を用いて、前処理されたフレームで見出されメモリ２３６に記憶されたエッジ及び面の手がかりに基づいて各パーティクル形状の確率を計算する。 The probability calculation module 238 calculates the probability of each particle shape based on the edge and surface cues found in the preprocessed frame and stored in the memory 236 using the above equations (1)-(4). To do.

衝突検出モジュール２３４は、２つの指部分間の最短距離を計算することによって、指間の衝突を検出する。もしこの距離が指の半径の合計より短ければ、衝突が生じていると仮定され、この状態にはΠ_ｃｏｌｌにより確率関数におけるペナルティが与えられる。その後各パーティクルの確率Π_ｎが、式（５）を用いて確率計算モジュール２３８によって計算される。 The collision detection module 234 detects a collision between fingers by calculating the shortest distance between the two finger portions. If this distance is less than the sum of the finger radii, it is assumed that a collision has occurred and this state is given a penalty in the probability function by Π _coll . Thereafter, the probability _{ｎ n of} each particle is calculated by the probability calculation module 238 using equation (5).

こうして、各形状についてその確率Π_{Ｈ，Ｋ，ｊ}が計算され、図１１に示す加重合計モジュール２０４に与えられる。 Thus, the probabilities Π _{H, K, j} for each shape are calculated and provided to the weighted sum module 204 shown in FIG.

確率Π_{Ｈ，Ｋ，ｊ}の全てが与えられると、加重合計モジュール２０４は全ての手のパーティクルの加重平均を計算し、式（６）を用いて手の姿勢について現在の手形状ｓ_{ｍｅａｎ，Ｈ}を推定する。同様に、指姿勢のためのパーティクルフィルタ１１０は、式（６）を用いて指の姿勢について現在の指形状ｓ_{ｍｅａｎ，Ｆ}を推定する。手の姿勢についての手形状ｓ_{ｍｅａｎ，Ｈ}と、指の姿勢についての指形状ｓ_{ｍｅａｎ，Ｆ}とが与えられると、図９に示される組合せブロック１１２は、推定された手の姿勢を推定された指の姿勢と組合せることができる。結果として得られる推定された手全体の形状がモデル更新モジュール１１４に与えられ、更新モジュール１１４は推定された手全体の形状で３Ｄの手モデル１１６を更新する。 Given all of the probabilities Π _{H, K, j} , the weighted sum module 204 calculates a weighted average of all hand particles and uses equation (6) to determine the current hand shape s _{mean, H for} the hand pose. Is estimated. Similarly, the particle filter 110 for finger posture estimates the current finger shape s _{mean, F} for the finger posture using Equation (6). Given the hand shape s _{mean, H} for the hand pose and the finger shape s _{mean, F} for the finger pose, the combination block 112 shown in FIG. 9 has estimated the estimated hand pose. Can be combined with finger posture. The resulting estimated overall hand shape is provided to the model update module 114, which updates the 3D hand model 116 with the estimated overall hand shape.

キャプチャされたステレオ画像の各々について上述の推定を繰返すことにより、手モデルトラッカ９０は手の姿勢をリアルタイムでトラッキングする。手全体の姿勢を推定するのにパーティクルフィルタが用いられるので、局所的最小にしか到達しない可能性は低い。従って、手モデルトラッカ９０は手の姿勢を信頼性をもってトラッキングする。検索空間の分解（手の姿勢と指の姿勢）により、トラッキングは各繰返しごとに短時間で実行でき、信頼性のあるリアルタイムのトラッキングが可能となった。 By repeating the above estimation for each captured stereo image, the hand model tracker 90 tracks the hand posture in real time. Since the particle filter is used to estimate the posture of the entire hand, it is unlikely that only the local minimum will be reached. Therefore, the hand model tracker 90 tracks the posture of the hand with reliability. Due to the decomposition of the search space (hand posture and finger posture), tracking can be performed in a short time for each iteration, and reliable real-time tracking is possible.

＜コンピュータハードウェア構成＞
上述の手モデルトラッカ９０は、高速ネットワークを介して相互に接続された複数のコンピュータで実現できる。図１５を参照して、手モデルトラッカ９０を実現するためのコンピュータシステム３２０は、ステレオ画像を撮影し、撮影された画像の前処理を行なうためのカメラ３２に接続されたパーソナルコンピュータ（ＰＣ）３３２と、各々がデュアルコアＣＰＵ構成を有し、評価ユニット２０２Ａ−２０２Ｎ及び２４２Ａ−２４２Ｎとして動作するように構成されたコアＣＰＵを各々が有するＰＣ３３４Ａ−３３４Ｘと、評価ユニット２０２Ａ−２０２Ｎ及び２４２Ａ−２４２Ｎから出力された確率を用いて、手形状及び指形状の重み平均を計算し、推定された手の姿勢を推定された指の姿勢と組合せるためのＰＣ３３８と、３Ｄ手モデルを記憶しＰＣ３３８の出力を用いてこれを更新するためのＰＣ３３６と、推定された手の姿勢に従ってロボットハンドを制御するハンド制御部３４０と、ＰＣ３３２、３３４Ａ−３３４Ｘ、３３６、３３８及び３４０を接続する高速ネットワーク３３０とを含む。 <Computer hardware configuration>
The hand model tracker 90 described above can be realized by a plurality of computers connected to each other via a high-speed network. Referring to FIG. 15, a computer system 320 for realizing the hand model tracker 90 captures a stereo image, and a personal computer (PC) 332 connected to the camera 32 for preprocessing the captured image. And PC 334A-334X, each having a core CPU configured to operate as evaluation units 202A-202N and 242A-242N, and evaluation units 202A-202N and 242A-242N, each having a dual-core CPU configuration. Using the output probability, the weight average of the hand shape and the finger shape is calculated, the PC 338 for combining the estimated hand posture with the estimated finger posture, the 3D hand model is stored, and the output of the PC 338 PC 336 to update this using the and the robot according to the estimated hand posture It includes a hand control unit 340 for controlling the hand, and a high-speed network 330 that connects PC332,334A-334X, the 336, 338 and 340.

図１６はＰＣ３３２の典型的な構成を示す。図１６を参照して、ＰＣ３３２は、デュアル計算ユニットを有するデュアルコアＣＰＵ３７６と、ブートアッププログラムを記憶するための読出専用メモリ（ＲＯＭ）３７８と、デュアルコアＣＰＵ３７６によって実行されるべきプログラムを記憶し、プログラムで用いられる変数を記憶するためのランダムアクセスメモリ（ＲＡＭ）３８０と、プログラム、変数の初期値、初期３Ｄ手モデル及び他のデータを記憶するためのハードディスク３７４とを含む。 FIG. 16 shows a typical configuration of the PC 332. Referring to FIG. 16, the PC 332 stores a dual core CPU 376 having a dual calculation unit, a read only memory (ROM) 378 for storing a bootup program, and a program to be executed by the dual core CPU 376. It includes a random access memory (RAM) 380 for storing variables used in the program and a hard disk 374 for storing programs, initial values of variables, initial 3D hand models, and other data.

ＰＣ３３２はさらに、ＤＶＤ３８２からデータを読出し、データを書込むためのディジタル多用途ディスク（ＤＶＤ）ドライブ３７０と、キーボード３６６及びマウス３６８と、モニタ３６２、カメラ３２が接続されたビデオキャプチャボード３８８と、ポータブルメモリ３８４からデータを読出し、データを書込むためのメモリポート３７２と、ＰＣ３３２をネットワーク３３０に接続可能とするためのネットワークインターフェイス３９６と、を含む。 The PC 332 further includes a digital versatile disk (DVD) drive 370 for reading and writing data from the DVD 382, a keyboard 366 and a mouse 368, a video capture board 388 to which the monitor 362 and the camera 32 are connected, and a portable. A memory port 372 for reading data from the memory 384 and writing data, and a network interface 396 for enabling the PC 332 to be connected to the network 330 are included.

他のＰＣ３３４Ａ−３３４Ｘ、３３６、３３８及び３４０は同じ構成を有する。しかし、これらのＰＣは全てＰＣ３３２から制御可能となっているので、これらのＰＣにマウス、キーボード、またはモニタを設ける必要はない。 The other PCs 334A-334X, 336, 338 and 340 have the same configuration. However, since all of these PCs can be controlled from the PC 332, it is not necessary to provide a mouse, a keyboard, or a monitor on these PCs.

＜結果＞
複雑な多関節の対象物に対するトラッキングの性能を評価するのはなかなかむずかしい。通常は映像シーケンスからの比較用データを容易に得ることができないからである。手のトラッキングにおいては、データグローブ又はマーカ等のより正確な方法を容易に用いることもできない。これらが手の外観を変えてしまうからである。 <Result>
It is difficult to evaluate the tracking performance for complex articulated objects. This is because the comparison data from the video sequence cannot usually be obtained easily. In hand tracking, more accurate methods such as data gloves or markers cannot be easily used. This is because they change the appearance of the hand.

我々は、２段階の評価を行なうことを選択した。第１に、自然な映像を用いて、本発明のトラッカがある姿勢と別の姿勢との間でトラッキングを行なうことで実世界とうまく作用することを示した。トラッカの性能を示すために、比較用データが既知の人工的映像シーケンスを生成して、システムのトラッキング誤差を評価した。 We have chosen to conduct a two-step evaluation. First, it was shown that the tracker of the present invention works well with the real world by tracking between one posture and another posture using natural images. To show the performance of the tracker, an artificial video sequence with known comparative data was generated to evaluate the tracking error of the system.

Ａ．自然な映像シーケンス
手の５つの姿勢を選択して、各対の間の遷移をトラッキングした。遷移が終了したあと、正確な姿勢までトラッカが完全に収束するまでに必要なフレームの数に注目した。全ての映像のキャプチャ速度は３０ＦＰＳであった。 A. Natural video sequence Five hand poses were selected to track the transition between each pair. We focused on the number of frames required for the tracker to fully converge to the correct posture after the end of the transition. All video capture speeds were 30 FPS.

表１に、姿勢[列]から姿勢[行]への遷移に必要なフレームの数の概略をまとめた。収束に必要とされる平均のフレーム数は４．６５である。わずかに掌のオフセットがあるが、収束した指形状は常に正しかった。

Table 1 summarizes the number of frames required for transition from posture [column] to posture [row]. The average number of frames required for convergence is 4.65. There was a slight palm offset, but the converged finger shape was always correct.

図１７は合成の手ステレオ投影（フレーム１５０から４００）のためのトラッキングシーケンスを示す。第１の行は面の手がかりのための入力を示し、第２の行はエッジの手がかりのための入力を示す。 FIG. 17 shows the tracking sequence for synthetic hand stereo projection (frames 150 to 400). The first row shows input for surface cues and the second row shows input for edge cues.

Ｂ．合成映像シーケンス
第２の実験では、３Ｄ手モデル（図１８）のステレオアニメーションシーケンスを投影することによって制作された合成映像を用いた。提示されたシーケンスから、各シーケンスの比較用データを導出することができる。従って、トラッキングされた値と比較用データとを比較でき、図１９の結果が得られる。 B. Composite Video Sequence In the second experiment, a composite video produced by projecting a stereo animation sequence of a 3D hand model (FIG. 18) was used. Data for comparison of each sequence can be derived from the presented sequence. Accordingly, the tracked value can be compared with the comparison data, and the result of FIG. 19 is obtained.

図１９は人差し指のＭＣＰの２つのＤＯＦについて、推定値と比較用データとの比較を示す。図１９からわかるように、トラッキングされた値と比較用データとの誤差はきわめて小さい。 FIG. 19 shows a comparison between estimated values and comparison data for two DOFs of the MCP of the index finger. As can be seen from FIG. 19, the error between the tracked value and the comparison data is very small.

図２０では、４００フレームにわたるトラッキングでの手の姿勢誤差をみることができる。図２０からわかるように、平均誤差（θ_ＭＣＰ）はフレーム３００で最高の０．０６ラジアンに達したのみである。 In FIG. 20, the posture error of the hand in tracking over 400 frames can be seen. As can be seen from FIG. 20, the average error (θ _MCP ) only reached the highest 0.06 radians in the frame 300.

＜結論＞
この発明においては、パーティクルフィルタを用いた、マーカ不使用の手及び指トラッキングシステムの実施の形態を提示した。人の手のような、多関節で自己の重なりにより隠れた部分のある対象物の手がかりを正確に評価する、新規なｚバッファを用いる方法を提案した。指を正確にトラッキングできるようにするために、検索空間を区分するパーティクル再サンプリングを提案した。ＰＣクラスタに分散したパーティクルフィルタを用いてこれを実現し、トラッキングの結果を高いフレームレートで達成することが可能となった。トラッキングシステムの有効性は、２０自由度の精巧なロボットハンドをリアルタイムで制御することで示された。 <Conclusion>
In the present invention, an embodiment of a marker-free hand and finger tracking system using a particle filter has been presented. We have proposed a new method using a z-buffer that accurately evaluates the cues of objects that are multi-joint and hidden by overlapping themselves, such as human hands. In order to be able to accurately track the finger, we proposed particle resampling to segment the search space. This can be achieved using particle filters dispersed in PC clusters, and tracking results can be achieved at a high frame rate. The effectiveness of the tracking system has been demonstrated by controlling a sophisticated robot hand with 20 degrees of freedom in real time.

このトラッキングシステムに人の全身のトラッカを組合せることで、非常に大きな利益が得られる。腕の姿勢を利用して手首の位置を得ることが可能となり、したがって、手の向きについての検索空間を大きく削減することができる。 Combining this tracking system with a tracker for the whole body of a person can provide tremendous benefits. The wrist position can be obtained by using the posture of the arm, and therefore the search space for the direction of the hand can be greatly reduced.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

この発明の実施の形態で制御されるロボットハンドの動きのシーケンスを示す図である。It is a figure which shows the sequence of the motion of the robot hand controlled by embodiment of this invention. この実施の形態に従ったデータフローの概観を示す図である。It is a figure which shows the general view of the data flow according to this embodiment. この発明の一実施の形態のセットアップを示す図である。It is a figure which shows the setup of one embodiment of this invention. この発明の一実施の形態で用いられる３Ｄ手モデルを概略的に示す図である。It is a figure which shows roughly the 3D hand model used in one embodiment of this invention. ３Ｄ手モデルの投影画像を示す図である。It is a figure which shows the projection image of 3D hand model. ｚバッファ上に投影された３Ｄ手モデルの画像を示す図である。It is a figure which shows the image of the 3D hand model projected on z buffer. 肌色マスク２８０及びエッジフィルタ処理された画像２８２に描画された投影モデルを示す図である。It is a figure which shows the projection model drawn by the skin color mask 280 and the image 282 by which the edge filter process was carried out. この発明の一実施の形態で用いられるクラスタノードの性能尺度を示す図である。It is a figure which shows the performance measure of the cluster node used by one embodiment of this invention. この実施の形態の手モデルトラッカ９０の全体構造を示す図である。It is a figure which shows the whole structure of the hand model tracker 90 of this embodiment. 画像前処理モジュール１０４を示すブロック図である。3 is a block diagram showing an image preprocessing module 104. FIG. 手の姿勢のためのパーティクルフィルタユニット１０８の詳細を示すブロック図である。It is a block diagram which shows the detail of the particle filter unit 108 for posture of a hand. 手の姿勢のためのパーティクルフィルタユニット１０８の評価ユニット２０２Ｋの詳細を示す図である。It is a figure which shows the detail of the evaluation unit 202K of the particle filter unit 108 for hand postures. 指のためのパーティクルフィルタユニット１１０の詳細を示す図である。It is a figure which shows the detail of the particle filter unit 110 for fingers. 指のためのパーティクルフィルタユニット１１０の評価ユニット２４２Ｋの詳細を示す図である。It is a figure which shows the detail of the evaluation unit 242K of the particle filter unit 110 for fingers. この発明の一実施の形態を実現するためのコンピュータシステム構造を示す図である。It is a figure which shows the computer system structure for implement | achieving one embodiment of this invention. 図１５に示すコンピュータシステム３２０で用いられるパーソナルコンピュータ（ＰＣ）の典型的な構成を示す図である。FIG. 16 is a diagram showing a typical configuration of a personal computer (PC) used in the computer system 320 shown in FIG. 15. 合成した手のステレオ投影（フレーム１５０から４００）のトラッキングシーケンスを示す図であって、第１の行が面の手がかりのための入力を示し、第２の行がエッジの手がかりのための入力を示す、図である。FIG. 6 shows a tracking sequence of a combined hand stereo projection (frames 150 to 400), where the first row shows input for surface cues and the second row shows input for edge cues. FIG. ３Ｄ手モデルのステレオアニメーションシーケンスを投影することによって制作された、第２の実験で用いられた合成映像のシーケンスを示す図である。It is a figure which shows the sequence of the synthetic | combination image | video used by the 2nd experiment produced by projecting the stereo animation sequence of 3D hand model. 実験において人差し指のＭＣＰの２ＤＯＦについて、比較用データと推定値との比較を示す図である。It is a figure which shows the comparison with the data for a comparison, and an estimated value about 2DOF of MCP of an index finger in experiment. 実験で観察された手の姿勢誤差を示す図である。It is a figure which shows the attitude | position error of the hand observed in experiment.

Explanation of symbols

３０手
３２カメラ
３４ロボットハンド
５８、１１６３Ｄ手モデル
９０手モデルトラッカ
１００フレーム撮影部
１０２フレームメモリ
１０４画像前処理モジュール
１０６、２２０、２３６、２６０、２７６メモリ
１０８手のためのパーティクルフィルタユニット
１１０指のためのパーティクルフィルタユニット
１１２組合せブロック
１１４モデル更新モジュール
１１８乱数発生モジュール
１２０及び１２２パーティクルメモリ
２００、２４０パーティクル形状メモリ
２０２Ａ−２０２Ｎ及び２４２Ａ−２４２Ｎ評価ユニット
２０４、２４４加重合計モジュール
２１８、２５８ベクトル形成モジュール
２２２、２６２座標変換モジュール
２２４、２６４重心計算モジュール
２２６、２６６ソートモジュール
２２８、２６８リスト
２３０、２７０手部分描画モジュール
２３２、２７２ｚバッファ
２３４、２７４衝突検出モジュール
２３８、２７８確率計算モジュール
30 hand 32 camera 34 robot hand 58, 116 3D hand model 90 hand model tracker 100 frame photographing unit 102 frame memory 104 image pre-processing module 106, 220, 236, 260, 276 memory 108 particle filter unit 110 for hand finger Particle filter unit 112 for combination block 114 Model update module 118 Random number generation module 120 and 122 Particle memory 200, 240 Particle shape memory 202A-202N and 242A-242N Evaluation unit 204, 244 Weighted sum module 218, 258 Vector formation module 222, 262 Coordinate transformation module 224, 264 Center of gravity calculation module 226, 266 Sort module 228, 268 List 30,270 hand of the drawing module 232,272 z buffer 234,274 collision detection module 238,278 probability calculation module

Claims

A hand tracking device for tracking a hand in a video sequence using particle filter processing,
Storage means for storing a three-dimensional hand model;
A clue extraction means for extracting a hand posture clue from the current frame of the video sequence;
Estimating means for performing particle filtering on the clue and the current three-dimensional hand model shape to estimate the posture of the hand in the current frame;
Means for updating the model shape of the three-dimensional hand using the posture estimated by the estimating means,
The estimation means includes
Means for adding a random noise to a three-dimensional hand model to generate a predetermined number of particle shapes for the three-dimensional hand model;
Z-buffer processing means for generating a z-buffer image for each of the particles representing a hand shape, wherein the z-buffer image outputs the video sequence of the three-dimensional hand model A probability calculating means for calculating the probability of each of the hand shapes based on the clue and the z-buffer image;
Weighted summing means for calculating the total weight of the hand shape calculated for the shape of the particle, using each probability of the shape of the hand as the weight of each particle representing the shape of the hand Including hand tracking device.

The three-dimensional hand model includes a palm part and a finger part, and the probability calculation means includes:
First evaluation means for calculating a probability of the first part of the particle, each of the finger parts of the particle shape of the first part being replaced with a current finger part of the three-dimensional model The probability calculation means further includes:
Second evaluation means for calculating the probability of the second part of the particle, each of the palm parts of the particle shape of the second part being replaced with the current palm part of the three-dimensional model ,
Said weighted summing means, using respective probabilities as weights, first summing means for calculating a weighted sum of said first portion of said particles;
Second summing means for calculating a weighted sum of the second portion of the particles using each probability as a weight;
The combination means includes means for combining the weighted sum palm portion output by the first summing means and the weighted sum finger portion output by the second summing means. The hand tracking device described in 1.

The first evaluation means is realized by a plurality of first calculation units operable in parallel, and each of the plurality of first calculation units calculates a probability of a part of the first portion of the particle shape. The hand tracking device according to claim 2, programmed to:

The second evaluation means is realized by a plurality of second calculation units operable in parallel, and each of the plurality of second calculation units calculates a probability of a part of the second portion of the particle shape. 4. A hand tracking device according to claim 2 or claim 3, programmed to: