JP3980341B2

JP3980341B2 - Eye position tracking method, eye position tracking device and program therefor

Info

Publication number: JP3980341B2
Application number: JP2001369076A
Authority: JP
Inventors: 慎二郎川戸; 信二鉄谷
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2001-12-03
Filing date: 2001-12-03
Publication date: 2007-09-26
Anticipated expiration: 2021-12-03
Also published as: JP2003168106A

Description

【０００１】
【発明が属する技術分野】
この発明はマン・マシンインタフェース技術に関し、特に、人の視線によってコンピュータなどを操作する際に、人の視線方向を誤りなく検出する目的のために、人の目の位置をトラッキングする方法および装置に関する。
【０００２】
【従来の技術】
近年のコンピュータ技術の発達により、人々の生活の隅々にまでコンピュータを用いた装置が用いられている。コンピュータを操作する技術なしには、満足な社会生活も営めなくなるおそれさえある。
【０００３】
一方で、コンピュータを操作するためには人の意思をコンピュータに伝える必要がある。いかに効率よく、誤りなく、そして簡単に人の意思をコンピュータに伝えるかについては、さまざまな研究がなされ、実用化されている。そうした技術は一般にマン・マシンインタフェースと呼ばれている。
【０００４】
コンピュータを操作するためだけであれば、テキストベースでコンピュータを操作するためのコマンドをたとえばキーボードなどを介してコンピュータに与えればよい。しかし、それでは人間がそれら多数のコマンドを、そのコマンドの構文および必要なパラメータとともに記憶する必要がある。そこで、一般にＧＵＩ（Graphical User Interface）と呼ばれるものが考案され、現在では主流を占めている。
【０００５】
ＧＵＩでは、一般的にはマウスなどのポインティングデバイスを用いて、画面に表示されるアイコンなどをポイントし、クリック、ダブルクリック、ドラッグなどの所定の操作を行なうことにより人間の意思をコンピュータに与えることができる。そのため、多数のコマンドを覚える必要がなく、誤りも生じにくいという特徴がある。
【０００６】
一方で、ＧＵＩではポインティングデバイスを操作する必要がある。そのため、たとえば手の動作に障害がある人などは、ＧＵＩを採用したコンピュータであっても操作が困難である。また応用によっては手を自由に利用できない場合もあり、マウスなどのポインティングデバイスを用いることが困難であったりする場合がある。
【０００７】
そこで、人間の視線を用いて人間の意思をコンピュータに伝達する技術が種々考えられている。人間が意識的に視線を操作することにより、その視線をポインティングデバイスとして用いて、人間のその意思をコンピュータに与えることができる。また、人間が無意識のうちに視線を移動させた場合、その視線の移動をコンピュータが検出することにより、人間の意思を推定することもできる。
【０００８】
そのための前提となる技術は、人間の眼球を撮像して視線方向を推定する装置である。そのためには一般的にアイカメラが用いられる。アイカメラを用いて人間の視線方向を推定することにより、その人間の意思を推定してコンピュータの操作に役立てることができる。
【０００９】
アイカメラを用いて視線方向を推定する場合、推定精度を上げるために高解像度で眼球を撮像する必要がある。ところが、そのためにアイカメラでは撮像範囲が狭くなる。また被写界深度も浅い。その結果、アイカメラを用いた場合には、ユーザは顔を前後左右のいずれにも動かさないよう、ほとんど固定していなければならない。そのため、アイカメラを単独で用いた場合にはその利用形態は非常に限定されてしまう。また、特定の個人だけにしか利用できないようなシステムではなく、見知らぬ顔でも目を検出しトラッキングすることができるようにすることが望ましい。
【００１０】
こうしたアイカメラの使用に伴う問題点を解決するために、本願発明者は、まず広視野のカメラを使用して利用者の目の位置を検出し、その情報を用いてアイカメラのパン、チルト、フォーカスを制御することが有効であることに想到した。つまり、何らかの方法により人間の目の位置を発見し、トラッキングして、逐次その方向と距離とをアイカメラの制御装置に出力するサブシステムを採用することで、アイカメラによる視線方向の検出の精度を高める方法である。
【００１１】
通常、利用者はコンピュータを利用する際には、アイカメラから５０〜１００ｃｍ程度の距離におり、椅子に座っているものと考えられる。したがって、上記したような目の位置を検出し、その情報を用いてアイカメラを制御することにより、アイカメラによる視線方向検出の精度は非常に高くなると考えられる。
【００１２】
想定している状況では、顔の動く範囲がかなり限定されている。そのため、顔が画面の高さいっぱいになる程度に大きく撮像することができる。そこで本願発明の発明者は、このように大きく撮像された顔画像の中の、目蓋の動きを検出することにより、両目の位置を検出することができることに思い至った。人間は無意識のうちに瞬きをしており、自然の瞬きを待つことにより、利用者に意識させずにその目の位置を検出することができる。特に、アイカメラを使用する状況では、利用者の協力を得ることも可能である。
【００１３】
瞬きは継続する動きではない。そのため、いったん目の位置を検出したなら、何らかの方法で目の位置をトラッキングする必要がある。そのためのひとつの方法は、顔画像のテンプレートを多数用意しておき、撮像された顔の画像とテンプレートとをマッチングすることにより、目の位置を推定することである。
【００１４】
しかしこの場合、単純なテンプレートマッチングでは顔の向きの変化に対応できない。テンプレートを更新するようにしたとしても、トラッキング点が徐々に実際の目とずれていくことは避けられない。したがって、こうしたテンプレートマッチングによる弱点を克服し、トラッキング処理のロバスト性を向上させる必要がある。
【００１５】
なお、アイカメラに与える距離の情報については、目の方向さえわかればたとえば超音波センサなどを用いて測定することで十分な精度が得られる。ただし以下の実施の形態では２眼ステレオ画像を用いて、利用者の顔までの距離を推定することとした。もちろんこの他にも、種々の方法を用いて利用者の顔までの距離を測定することができるが、本発明における目の位置の検出およびトラッキングとは直接の関係はないので、以下の説明では距離の測定については詳細な説明は行なわない。
【００１６】
目の位置のトラッキング方法としては、すでに幾つかの例がある。その一つは特開２０００−１６３５６４に開示されたものである。この例では、目を中心とする画像パターンをテンプレートとして、いわゆるテンプレートマッチングの手法を用いて目の位置をトラッキングしている。また他の例は、信学技報ＰＲＭＵ９９−１５１（１９９９−１１），ｐｐ．９−１４の「リアルタイム視線検出・動作認識システムの開発」と題された論文に開示されたものである。この例では、２台のビデオカメラを用いて得られた顔画像から、左右の目の両端を含む顔の特徴領域の画像および３次元座標を用いた画像のステレオ処理によって顔トラッキングを行ない、さらにこうして推定された顔位置と、左右の目の両端の中点から眼球へ向かうオフセットベクトルとによって、眼球の中心位置が推定される。
【００１７】
【発明が解決しようとする課題】
しかし、特開２０００−１６３５６４に開示された方法では通常のテンプレートマッチングの手法が用いられており、パターンの回転とスケール変化とに弱いという問題点がある。また信学技報に掲載された例では、２つの画像を用いたステレオ処理のみから目の位置を推定しているため、カメラから目までの距離が限定され、さらにトラッキングするパターンをあらかじめ登録しておく必要があるという問題がある。
【００１８】
すなわちこれまでは、一般的な利用者の目の位置を、利用者の顔の位置を拘束することなく正確にトラッキングすることができないという問題点があった。
【００１９】
それゆえに、この発明の目的は、一般的な顔画像から目の位置をロバスト性高くトラッキングすることが可能で、利用者を不当に拘束することがない目の位置のトラッキング方法、装置およびそのためのプログラムを提供することである。
【００２０】
この発明の他の目的は、一般的な顔画像から、顔が回転したり移動したりしても目の位置を精度高くかつロバスト性高く検出することが可能で、利用者を不当に拘束することがない目の位置のトラッキング方法、装置およびそのためのプログラムを提供することである。
【００２１】
【課題を解決するための手段】
本願の第１の局面に係る発明は、被験者の顔のビデオ画像を撮像する撮像手段が接続され、記憶装置を有するコンピュータにおいて、撮像手段の出力する一連の画像内における被験者の目の位置をトラッキングするための方法であって、撮像手段の出力する顔画像内の目の位置を抽出するステップを備え、目の位置を抽出するステップは、撮像手段の出力する二つの画像の差分画像を算出するステップと、差分画像のうち、二つの画像の間における被験者の顔全体の移動に起因する差分を除いた残りの差分領域を抽出するステップとを含み、残りの差分領域を抽出するステップは、二つの画像の各々を平行移動した画像の画素値と差分画像の画素値とが一致する領域を最大にする移動量を算出するステップと、算出した移動量に起因する差分を差分画像から削除するステップとを有し、抽出された残りの差分領域から二つの画像の画素値の差に応じて二つの目の候補点を選択し、二つの目の候補点が予め定められた幾何学的条件を満たす場合に、二つの目の候補点を目の位置として抽出するステップをさらに含み、抽出された目の位置によって定められる眉間の位置、眉間の位置に対する両目の相対位置、および眉間の位置によって定められる眉間の画像パターンを記憶装置にセーブするステップと、目の位置が抽出された画像に後続する画像を取得する画像取得ステップと、後続する画像内の顔画像の眉間の位置を予測するステップと、予測された眉間の位置の近傍で、セーブされた眉間の画像パターンと最もよく一致する領域の中心となる点を探索するステップと、探索された点の位置と、セーブされた両目の相対位置とに基づいて、後続する画像内での両目の位置を予測し、当該予測された領域を中心として予め定められた条件を満足する二つの領域の各々の中心となる点を探索するステップと、探索された二つの中心となる点の中点の位置を新たな眉間の位置とし、当該新たな眉間の位置および二つの中心となる点の相対位置によって定まる領域を眉間の位置に対する新たな両目の相対位置とし、さらに新たな眉間の位置によって定められる領域の画像パターンを新たな眉間の画像パターンとして記憶装置に記憶されている眉間の位置、両目の相対位置、および眉間の画像パターンをそれぞれ更新するステップと、後続する画像にさらに後続する画像に対して画像取得ステップから処理を実行するステップとをさらに備える。
【００２２】
好ましくは、眉間の位置を予測するステップは、目の位置が抽出された画像内での眉間の位置と、当該画像に先行する画像内での眉間の位置とから、後続する画像の眉間の位置を外挿するステップを含む。
【００２３】
さらに好ましくは、この方法は、二つの領域の各々の中心となる点を探索するステップにおいて、当該点が探索できなかったことに応答して、撮像手段の出力する顔画像の眉間の位置、眉間の位置に対する両目の相対位置、および眉間の位置によって定められる眉間の画像パターンを予め定められた方法によって抽出し、画像取得ステップから処理を再開するステップをさらに含む。
【００２４】
また、二つの領域の各々の中心となる点を探索するステップは、探索された点の位置に対し、両目の相対位置によって定められる位置の各々を中心とする近傍において、予め定められた形状の領域であって、かつ当該領域内の画素値の平均が最も暗くなる領域の中心を探索して両目の候補点とするステップを含んでもよい。
【００２５】
さらに好ましくは、二つの領域の各々の中心となる点を探索するステップはさらに、最も暗くなる領域の中心を探索するステップで探索された候補点が、
１）候補点間の距離があらかじめ定められた最小値以上であること、
２）候補点間の距離があらかじめ定められた最大値以下であること、および
３）候補点を結ぶ直線と走査線方向とのなす角度が、あらかじめ定められた関係を満足すること、
のすべての条件を満足するか否かを判定し、いずれかの条件が満足されない場合は探索を失敗とするステップを含んでもよい。
【００２６】
本願の第２の局面にかかる発明は、被験者の顔のビデオ画像を撮像する撮像手段が接続され、記憶装置を有するコンピュータにおいて、撮像手段の出力する一連の画像内における被験者の目の位置をトラッキングするための装置であって、撮像手段の出力する顔画像内の目の位置を抽出するための手段を備え、目の位置を抽出するための手段は、撮像手段の出力する二つの画像の差分画像を算出するための手段と、差分画像のうち、二つの画像の間における被験者の顔全体の移動に起因する差分を除いた残りの差分領域を抽出するための手段とを含み、残りの差分領域を抽出するための手段は、二つの画像の各々を平行移動した画像の画素値と差分画像の画素値とが一致する領域を最大にする移動量を算出するための手段と、算出した移動量に起因する差分を差分画像から削除するための手段とを有し、抽出された残りの差分領域から二つの画像の画素値の差に応じて二つの目の候補点を選択し、二つの目の候補点が予め定められた幾何学的条件を満たす場合に、二つの目の候補点を目の位置として抽出するための手段をさらに含み、抽出された目の位置によって定められる眉間の位置、眉間の位置に対する両目の相対位置、および眉間の位置によって定められる眉間の画像パターンを記憶装置にセーブするための手段と、目の位置が抽出された画像に後続する画像を取得するための画像取得手段と、後続する画像内の顔画像の眉間の位置を予測するための予測手段と、予測された眉間の位置の近傍で、セーブされた眉間の画像パターンと最もよく一致する領域の中心となる点を探索するための第１の探索手段と、探索された点の位置と、セーブされた両目の相対位置とに基づいて、後続する画像内での両目の位置を予測し、当該予測された領域を中心として予め定められた条件を満足する二つの領域の各々の中心となる点を探索するための第２の探索手段と、探索された二つの中心となる点の中点の位置を新たな眉間の位置とし、当該新たな眉間の位置および二つの中心となる点の相対位置によって定まる領域を眉間の位置に対する新たな両目の相対位置とし、さらに新たな眉間の位置によって定められる領域の画像パターンを新たな眉間の画像パターンとして記憶装置に記憶されている眉間の位置、両目の相対位置、および眉間の画像パターンをそれぞれ更新するための更新手段と、後続する画像にさらに後続する画像に対して画像取得手段からの処理を繰返すように画像取得手段、予測手段、第１および第２の探索手段、および更新手段を制御するための手段とをさらに備える。
【００２７】
好ましくは、予測手段は、目の位置が抽出された画像内での眉間の位置と、当該画像に先行する画像内での眉間の位置とから、後続する画像の眉間の位置を外挿するための手段を含む。
【００２８】
さらに好ましくは、第２の探索手段が当該点を探索できなかったことに応答して、撮像手段の出力する顔画像の眉間の位置、眉間の位置に対する両目の相対位置、および眉間の位置によって定められる眉間の画像パターンを予め定められた装置によって抽出し、画像取得手段から処理を再開するように画像取得手段、予測手段、第１および第２の探索手段、ならびに更新手段を制御するための手段をさらに含む。
【００２９】
また、第２の探索手段は、探索された点の位置に対し、両目の相対位置によって定められる位置の各々を中心とする近傍において、予め定められた形状の領域であって、かつ当該領域内の画素値の平均が最も暗くなる領域の中心を探索して両目の候補点とするための手段を含んでもよい。
【００３０】
より好ましくは、第２の探索手段はさらに、最も暗くなる領域の中心を探索する手段で探索された候補点が、
１）候補点間の距離があらかじめ定められた最小値以上であること、
２）候補点間の距離があらかじめ定められた最大値以下であること、および
３）候補点を結ぶ直線と走査線方向とのなす角度が、あらかじめ定められた関係を満足すること、
のすべての条件を満足するか否かを判定し、いずれかの条件が満足されない場合は探索を失敗とするための手段を含んでもよい。
【００３１】
【発明の実施の形態】
［ハードウェア構成］
図１に、本願発明にかかる目の位置のトラッキング方法および装置を実現するコンピュータシステム２０の外観を示す。なお、コンピュータシステム２０は目の位置の検出およびトラッキングを行ない、図示していないコンピュータがコンピュータシステム２０から与えられる利用者の顔までの距離および目の方向の情報を用いてアイカメラを制御する。なお、本明細書において「被験者」とは、このシステムが目の位置を検出しトラッキングする利用者のことをいい、人間だけでなく動物など「目」に相当するものを有しているものすべてを含み得るものとする。
【００３２】
図１を参照して、このコンピュータシステム２０は、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）のためのＣＤ−ＲＯＭドライブ５０と、フレキシブルディスク（ＦＤ）のためのＦＤドライブ５２とを備えたコンピュータ４０と、コンピュータ４０に接続された第１のポインティングデバイスであるマウス４８と、キーボード４６と、モニタ４２と、モニタ４２とキーボード４６との間に配置されたステレオビデオカメラ５６，５８と、前述した別のコンピュータにより制御されるアイカメラ５４とを含む。
【００３３】
図２に、コンピュータ４０の内部構成をコンピュータシステム２０の他の構成要素とともに図示する。図２を参照して、コンピュータ４０は、ＣＰＵ（Central Processing Unit）７０と、ＣＰＵ７０に接続されたメモリ７２と、ＣＰＵ７０に接続されたハードディスク７４とを含む。ＣＰＵ７０は、図示されないネットワークボードを介してネットワーク８０に接続され、このネットワーク８０には、アイカメラ５４を制御する前述したコンピュータ８２が接続されている。コンピュータ８２の構成はコンピュータ４０と同様であるので、ここではその詳細は説明しない。
【００３４】
以下に詳細に説明する、本発明にかかる目の位置のトラッキング方法および装置は、図１および図２を参照して説明したコンピュータハードウェアと、ＣＰＵ７０によって実行されるソフトウェアとにより実現される。このソフトウェアは、ＣＤ−ＲＯＭなどの記憶媒体に記憶されたり、ネットワークを介したりして市場を流通するものである。このソフトウェアがこのコンピュータ４０にインストールされると、典型的にはソフトウェアはハードディスク７４に格納される。そして、利用者がこのソフトウェアを起動すると、またはコンピュータ４０のオペレーティングシステム（ＯＳ）により起動されると、ハードディスク７４からメモリ７２に読み出され、ＣＰＵ７０によって実行される。
【００３５】
なお、このソフトウェアが、もともとコンピュータ４０のＯＳの一部として組み込まれている機能を利用したり、またはＯＳとは別にインストールされた別のソフトウェアの一部の機能を利用することもあり得る。したがって、本発明を実現するためのソフトウェアとして流通するのは、そうした一部の機能を欠いたものであってもよく、実行時にそうした機能が他のソフトウェアによって実現されることを考慮した構成となっていればよい。もちろん、必要な機能がすべてあらかじめ組み込まれたシステムソフトウェアとして流通してもよい。
【００３６】
［原理］
本発明において、目の位置を検出するための方法の基本は、フレーム画像の差分に基づき、特に瞬きによる明度の差に着目したものである。フレーム間差分から瞬きを検出しようとするとき、顔が動いていると、動いた瞬き部分以外にも明度変化の大きな画素が多くの部分で生じる。そのために、目蓋の動きによる明度変化と、顔全体の動きによる明度変化とを区別する必要がある。なお、口形状の変化や眉の動きなどによっても問題が生じる可能性があるが、それらの明度変化は、目候補点を抽出した後、条件検定によって棄却する。
【００３７】
顔全体の動きは、画像平面内の平行移動、画像平面内の回転（首をかしげる）、画像平面外の回転（振り向く、頷く）に分けて考えることができる。フレームレートは十分に速くて、フレーム間での動きは小さいものとする。以下、これらについて順にそれらの動きをキャンセルする方法について説明する。
【００３８】
顔が画像平面内で平行移動したときの動きをキャンセルする方法は以下のとおりである。図３（Ａ）に示される、画像ｆ_t-1に写っている色の付いた円板９０が、右方向に距離ｄだけ移動して、次のフレームで図３（Ｂ）に示される画像ｆ_tのようになった場合を考える。円板には白い小円マークが付いている。背景は一様でなくともよいが、図示はしていない。
【００３９】
図３（Ａ）および図３（Ｂ）の画像の差分画像をとると図４に示したとおりの画像となる。図４において、斜線で示した領域が画素値の差が大きく現れた部分であり、白い領域は画素値の差が小さい領域である。図４に示される差分画像Ｄの中で、画素値の差が大きく現れた画素位置を、図３（Ａ）または（Ｂ）に示す元の画像に立ち返って考えてみると、これら画素位置はさらに図５の上段の左右に示された領域ＦおよびＢに分類できる。領域ＦのＦはForeground、すなわち円板上の点であり、領域ＢのＢはBackground、すなわち背景に対応する。領域Ｂは一方の画像で見えていて、他方の画像では見えていない部分である。領域Ｆは両方の画像で見えていて、互いに対応する画素がある部分である。ただし、実際の画像では、移動物体がどの部分であるかは不明なので、どこが領域Ｆでどこが領域Ｂかは区別できない。
【００４０】
ここで、図５の上段右に示されている画像ｆ_t上の、領域ＦとＢとに相当する画素を−ｄだけシフトして画像ｆ_t-1に重ねてみる。すると、図５の下段左に示されているように、画素値が一致する領域Ｍの部分と、一致するともしないともいえない領域Ｕの部分とがあることがわかる。同様に画像ｆ_t-1上の領域ＦとＢとに相当する画素を＋ｄだけシフトして画像ｆ_tに重ねてみると、領域ＭとＵとは図５の下段右に示したようになる。なお、ここでいう移動量±ｄは二次元ベクトル量である。
【００４１】
ここで、シフト量が実際のシフト量±ｄとは異なるシフト量±ｄ’であるとすると、領域Ｍに属する画素数は、実際のシフト量±ｄを用いた場合よりも少なくなると考えられる。そこで、領域Ｍの画素数が最大となるシフト量を探索すれば、それがフレーム間での円板９０の移動量ｄであるといえる。領域Ｍの画素数の計数にあたっては、画像ｆ_t-1の画素のシフト量をｄとすれば画像ｆ_tのシフト量が−ｄとなるように、両者が常に逆方向で同じ大きさでなければならない。
【００４２】
このように、フレーム間の差分画像で抽出された画素のうち、領域Ｍに属する画素数が最大となるシフト量±ｄを見出して画像をシフトさせて重ね合わせ、なおかついずれでも画素値が一致しない画素を抽出すると、それは移動量ｄでの平行移動以外の動きをした部分であると言える。顔全体の移動量をｄとすれば、ｄ以外の動きをした部分は、瞬き、口形状の変化などに由来する部分である。したがってフレーム間差分画像から顔全体の動きをキャンセルすることにより、瞬きで動く部分、すなわち目位置を抽出することができる。
【００４３】
次に振り向く動きについて考える。図６を参照して、カメラ５６、５８を向いている顔が左に振り向こうとしている場合を考える。この場合、顔の右端（図６における左端）からは隠れていた部分が出現し、顔の左端（図６における右端）では、見えていた部分がセルフオクルージョンで隠れる。このような部分は、フレーム間差分を計算する２枚の画像の一方では隠れているので、平行移動のときとは異なりシフトして画素値の一致を見ることは意味がない。
【００４４】
しかしこのような部分は、移動領域の周辺部にしか現れない。したがって、フレーム間差分画像（図４）を計算した後、斜線を引いた領域の左右から一定数の画素をそれぞれ削り去ることで、その影響を除去することができる。他方、顔の中心部は、振り向く動作の場合でも平行移動とみなせるので、平行移動に関する前述の手法（二枚の画像を±ｄだけ移動して一致する画素を見る手法）を適用することができる。
【００４５】
なお、首をかしげる動きと頷く動きとは、回転中心が顔からはずれているため、平行移動とみなすことができる。ただし、いずれの場合も、フレーム間において動きが小さいことが前提である。
【００４６】
［ソフトウェアの制御構造］
上記した原理に基づく目の検出処理およびトラッキング処理を実現するためのソフトウェアの構造について以下説明する。なお、前述したとおり、以下に述べるソフトウェアの実行時には、ＯＳが持つ機能、またはすでにコンピュータにインストールされている機能を利用することが前提となっている場合があり、その結果、流通するソフトウェアがそうした機能自体を含まないことがあり得る。しかし、その場合でも、そうした機能を制御構造のどこで利用するかについての情報をもつものである限り、そうしたソフトウェアは本願発明の技術的範囲に属するものである。
【００４７】
なお、以下に述べる処理では、すでに述べた目の候補点の検出処理に加えて、目が二つあることを前提に、瞬きは左右の目で同時に起きる、両目間の距離はある範囲に入っている、顔の傾き角度は一定範囲内である、両目の中点近傍の濃淡パターンはほぼ左右対称である、などの条件を用いることで、抽出された目の位置の候補点対が目であるか否かを検定している。
【００４８】
また、以下の処理では種々のしきい値が用いられているが、その値は応用により、また求められている精度により経験的に定められるものである。
【００４９】
図７を参照して、このソフトウェアの全体構成は以下のようになっている。まず、図８以下を参照して後に詳細に説明する手順を用いて、目の位置を抽出する（ステップ１００）。続くステップ１０２で、抽出された目の位置をアイカメラ５４を制御するコンピュータ８２に送信する。そして、引き続きステップ１０４で目の位置のトラッキング処理を行なう。目位置のトラッキングができたか否か、すなわち目の位置を見失ったか否かを判断する（ステップ１０６）。トラッキングが失敗した場合には制御はステップ１００に戻り、再度ステップ１００の目の位置の抽出処理から実行する制御が行なわれる。トラッキングができていれば、制御はステップ１０２に戻り、トラッキングによって得られた目の方向、目までの距離などの情報がコンピュータ８２に送信される。以下、ステップ１００から１０６の処理を繰り返し行なうことで、コンピュータ８２に対して利用者の目の方向および距離に関する情報を絶えず送信することができる。
【００５０】
なお図７のステップ１００において目の位置が抽出されない場合もあり得る。その場合は、再度ステップ１００の処理を実行して、次のフレームに対して目の位置の抽出処理を実行し、その後再度トラッキング処理を開始する。
【００５１】
図８を参照して、図７のステップ１００で行なわれる目の位置の抽出処理について説明する。以下の処理は、図４から図６を参照して説明した原理を用いたものである。なおこの処理の前に、前画像がキャプチャされていることが前提である。特に処理の最初では、前画像を現画像と一致させるようにしてもよい。まず、ステップ１２０でカメラ５６，５８からの現画像をキャプチャする。以下の処理は実際には二つのカメラ５６の画像に対して行なわれ、目の検出後、カメラ５８の画像とあわせてステレオ処理により目までの距離を計算する。カメラ５６とカメラ５８との役割を交代させてもよい。
【００５２】
続いてステップ１２２で、前画像と現画像とのフレーム間差分２値化画像Ｄを計算する。より具体的には、各画素ごとに、前画像と現画像との値の差の絶対値を求め、その値があらかじめ定められたしきい値Ｎ以上であればその画素値を「１」とし、しきい値Ｎ未満であれば画素値を「０」とする。ビデオ信号は、静止物体を撮像している場合でもある程度変動しているので、このようなしきい値Ｎを用いることが必要である。このしきい値Ｎはノイズレベルと考えることができる。
【００５３】
続いてステップ１２４で、差分画像Ｄ中の画素値「１」の数があらかじめ定められたしきい値Ｇ以上か否かを判定する。判定結果が「ＹＥＳ」であれば制御は図９に示すステップ１６８にとび、目は検出されなかったとされ処理を終了する。この場合は、あまりにも多くの画素に動きが検出されたため、想定以上の動きがあるものとして、次のフレーム入力を待つ。
【００５４】
ステップ１２４での判定結果が「ＮＯ」の場合、ステップ１２６で差分画像Ｄ中の画素値「１」の数がしきい値Ｅ以下か否かを判定する。このしきい値Ｅは前述のしきい値Ｇよりも小さな値であり、顔が実質的に静止しているか否かを判定するためのものである。ステップ１２６の判定結果が「ＹＥＳ」の場合、制御は図９に示すステップ１５２に飛ぶ。この場合には、フレーム間差分で抽出された画素数が少なく、顔が停止状態であると考えられる。
【００５５】
ステップ１２６での判定結果が「ＮＯ」の場合、制御はステップ１２８に進む。ステップ１２８では、差分画像Ｄの画素値「１」である画素群に対し、その両端の画素を削除する処理が行なわれる。この処理は、図６を参照して説明したように振り向く動作を考慮したものである。
【００５６】
より具体的には、差分画像Ｄの各走査線ごとに左右の端からそれぞれ右方向および左方向にサーチして、最初に現れる画素値「１」の画素を探す。そしてその画素からそれぞれｋ個の画素の画素値を「０」に書き換える。この処理により、画像中の画素値「１」の左右領域の所定分が削り去られることになる。
【００５７】
続いてステップ１３０で、前画像と現画像とから、差分画像Ｄの画素値「１」に対応する画素を抽出する。この処理は、図５の中段に示した画像を得る処理である。この後ステップ１３２で、前述した移動量±ｄを算出する処理が行なわれる。この処理の詳細については図１０を参照して後に詳述する。
【００５８】
移動量±ｄが算出されると、続いてステップ１５０（図９）で図５の下段に示す領域Ｍに相当する画素を差分画像Ｄから削除する処理が行なわれる。さらにステップ１５２で、差分画像Ｄに対して孤立点除去および平滑化処理が実行される。より具体的には、処理対象の画素の８近傍と、中心画素とのうち合計して４画素以上の画素値が「１」であればその処理対象の画素の画素値を「１」とする。さもなければ画素値を「０」とする。この処理により、孤立点の画素値は「０」となって差分画像Ｄ上から削除される。
【００５９】
続いて、ステップ１５４で、差分画像Ｄに対してラベリング処理を施す。この処理により、互いに連続した画素値「１」の画素からなる領域の各々についてラベルが付与される。こうして得られた各要素の中心を、目位置の候補点とする（ステップ１５６）。
【００６０】
得られた候補点の数が２未満か否かを判定する（ステップ１５８）。目位置は一般的に２個であることが想定されるから、候補点の数が２未満の場合には目は検出されなかったとして、処理はステップ１６８に進む。候補数が２以上であれば制御はステップ１６０に進む。
【００６１】
ステップ１６０では、各候補点ごとに、候補点を中心とする矩形ａ×ｂの領域の各画素の、前画像の画素値と現画像の画素値との差の絶対値の総和Ｓを求める。ここで矩形ａ×ｂとは、想定される目の大きさに対応する大きさの矩形である。前画像と現画像との間で瞬きがあれば、一方の画像では目の画像であり他方の画像では目蓋の画像となるためその画素値の差の絶対値は大きくなり、かつその大きさは想定される目の大きさに近いはずである。したがって、このように各候補点についてＳを求め、その値の大きなもの二つを目の候補点（候補点対）とすることができる。
【００６２】
続いてステップ１６２で、以上のようにして求められた二つの候補点が以下の諸条件を満足するか否かが判定される。全条件を満足しなければ、目は検出されなかったとして（ステップ１６８）、処理を終了する。
【００６３】
１）ステップ１６０で求めた差の絶対値の総和Ｓが、二つの候補点のいずれにおいても所定のしきい値Ｆ以上である。
【００６４】
２）候補点間の距離があらかじめ定められた値Ｌmin以上である。
３）候補点間の距離があらかじめ定められた値Ｌmax以下である。
【００６５】
４）候補点を結ぶ直線と走査線方向とのなす角度が、あらかじめ定められたしきい値Ａ度以下である。
【００６６】
しきい値Ｆはあらかじめ実験的、または統計的に定められる値であり、目以外の部分が目として抽出されるのを防止するために用いられる。ＬminおよびＬmaxはそれぞれ、目の中心間の距離として想定される最小値および最大値である。これらの値はあらかじめ統計的に定められる。またしきい値角度Ａ度も実験的に適切な値が定められるものであるが、通常は３０度から４５度程度の値が使用される。
【００６７】
上述の４つの条件がすべて充足された場合、さらにステップ１６４で、求められた二つの目候補点の中点近傍の濃淡パターンがほぼ左右対称か否かを判定する。ほぼ左右対称であると判定されたら、この二つの目候補点を目の位置とする。左右対称でなければ目を見つけられなかったとして処理を終了する。
【００６８】
二つの目候補点の中点近傍のパターンがほぼ左右対称か否かを判定する方法について説明する。図１１を参照して、一般的に両目の中点（これを以下「眉間」と呼ぶ。）を中心とする領域２２０において、眉間の左右の濃淡パターンは、二つの目候補点を結ぶ線分に垂直で、眉間（二つの目候補点の中点）をとおる直線に関してほぼ左右対称となるはずである。そこで、まず現画像を、両目の候補点の中点を中心として、両目候補が走査線に水平に並ぶように回転し、候補点の中点を中心とするｇ×ｈ画素（領域２２０とする。）を切出す。図１１の下段に示されるように、この領域２２０の縦ｈ画素の画素値の総和を各列ごとに求め、さらに全画素の総和に対する各列の総和を百分率であらわした値を計算することにより、図１１に示すグラフ２２２のような投影プロフィールが得られる。そして、このプロフィールのうち、両候補の中点を中心として、左右対称となる互いに隣接する３列ずつの合計を左右双方について計算する。そして、左右の合計の差の総和を、隣接する３列のすべての組み合わせに対して計算し、その値がすべてしきい値ｐ以下であれば、二つの目候補点の中点近傍のパターンが、ほぼ左右対称であると判定する。
【００６９】
具体的には、図１１の下段を参照して、たとえば最左端の３列２２４Ｌの値の合計と、最右端の３列２２４Ｒの値の合計とを計算し、その差ｐ₁を求める。同様に、左から２列目〜４列目の３列２２６Ｌの値の合計と、右から２列目〜４列目の３列２２６Ｒの値の合計とを計算し、その差ｐ₂を求める。左から３列目〜５列目の３列２２８Ｌの値の合計と、右から３列目〜５列目の３列２２８Ｒの値の合計を計算し、その差ｐ₃を求める。以下同様にして値ｐ₄〜ｐ_g-2を計算し、ｐ₁〜ｐ_g-2の値の合計を求め、その合計としきい値ｐとを比較することで、二つの目候補点の中点近傍の濃淡パターンがほぼ左右対称か否かを判定できる。
【００７０】
なお、上の説明では、二つの目候補点を水平走査線に並ぶようにその中点を中心としていったん回転してから濃淡パターンが左右対称か否かを判定した。しかし左右対称の判定方法はこれに限定されるわけではない。たとえば、二つの目候補点を結ぶ線分が水平走査線に対してなす角度がわかれば、その線分に対し垂直な方向で画素値の合計を計算するようにすることで、図１１に示すのと同様の処理を実現することができる。また、上の例では各列ごとに合計をとるようにしたが、中点を中心として左右対称な画素ごとに差を計算してその総和を計算し、それがしきい値より小さい場合に左右対称と判定してもよい。
【００７１】
ここで再び図８のステップ１３２に戻り、移動量ｄを算出する処理について、図１０を参照して詳細に説明する。移動量ｄ（二次元ベクトル量）＝（ｘ，ｙ）とし、ｘ＝−ｍ，−ｍ＋１，…，ｍ、ｙ＝−ｎ，−ｎ＋１，…，ｎの範囲で図５の下段に示すような領域Ｍに属する画素数の合計の最大値を示すものを探索するものとする。まず、変数ｍａｘに初期値０を代入する（ステップ１８０）。この変数ｍａｘは、以下に述べる計算過程で、個々の移動量に対して算出される差分画像のうち、領域Ｍに属する画素数の合計の最大値を記憶するためのものである。
【００７２】
変数ｘに−ｍを代入し（ステップ１８２）、変数ｙに−ｎを代入する（ステップ１８４）。以下、ｙの各値ごとの繰り返し処理である。
【００７３】
まず、ｄ＝（ｘ、ｙ）に対する領域Ｍに属する画素数の合計Ｍ（ｄ）を求める（ステップ１８６）。そしてこのＭ（ｄ）が変数ｍａｘより大きいか否かを判定する（ステップ１８８）。Ｍ（ｄ）が変数ｍａｘ以下の場合には何もせず制御はステップ１９２に進む。Ｍ（ｄ）が変数ｍａｘより大きい場合には、ステップ１９０で変数ｍａｘにＭ（ｄ）の値を代入し、Ｘ、Ｙにそれぞれそのときのｘ，ｙの値を代入する。制御はステップ１９２に進む。
【００７４】
ステップ１９２ではｙに１を加算し、その結果ｙがｎを超えたか否かを判定する（ステップ１９４）。ｙがｎを超えていない場合には制御はステップ１８６に戻り、新たなｙに対してステップ１８６〜１９４の処理を繰り返す。ｙがｎを超えている場合には、制御はステップ１９６に進む。
【００７５】
ステップ１９６では、ｘに１を加算する。そして、その結果ｘがｍを超えたか否かを判定する（ステップ１９８）。ｘがｍを超えていないと判定された場合には制御はステップ１８４に戻り、以下新たなｘに対してステップ１８４〜１９８の処理を繰り返す。ｘがｍを超えたと判定された場合には、制御はステップ２００に進み、変数ｍａｘの値と、この変数ｍａｘの値が得られたときのｘ、ｙの値であるＸ，Ｙを出力して処理を終了する。
【００７６】
このようにして、領域Ｍに含まれる画素数が最大となる移動量ｄを求める。なお、移動量ｄの算出方法はこれ以外にも種々考えられる。たとえば、ｘ、ｙをｍ×ｎの範囲で変化させながらあらかじめ総当りで領域Ｍに含まれることになる画素数を求めてテーブル化し、そのテーブルの中で最大値を示すセルを求めるようにしてもよい。
【００７７】
さて、上で述べたようにして検出した目を次のフレームから信頼性高く、かつロバスト性高くトラッキングしていくために、次のようなトラッキング方法をソフトウェアにより実現する。本実施の形態では、目をトラッキングするのに、直接ではなく、目と目の間（眉間）のパターンをトラッキングする。そして、そのようにトラッキングされた眉間からの相対的位置が前フレームの目位置と同じ点の近傍で目を探索する。
【００７８】
眉間パターンは、顔の表情が変化する場合でも比較的安定している。また、額部と鼻筋の明るい部分に両サイドから目と眉とが楔のように割込んだ濃淡パターンを形成しているので、パターンマッチングによって位置を決める際の精度を出しやすい。
【００７９】
以下に述べるように、眉間のテンプレートを用いてパターンマッチングする際に、テンプレートを更新する必要がある。テンプレートとして本実施の形態では、両目の中心を中点とする矩形パターンを採用し、テンプレートの更新には現フレームで検出された矩形パターンを用いる。
【００８０】
以下、図１２を参照してトラッキング処理の詳細について説明する。なお、前フレームで抽出された両目の中点位置と、眉間パターン（ｓ×ｔ画素）と、中点から見た右目と左目との相対位置ｅ_r、ｅ_l（ｅ_r＝―ｅ_lで、いずれも２次元量）とがすでにセーブしてあるものとする。
【００８１】
まず、ステップ２４０でセーブされている中点位置、眉間パターン、および両目の相対位置を初期値として取り込む処理を行なう。続いて、現画像（現フレーム）の取り込みを行なう（ステップ２４２）。
【００８２】
次のステップ２４４では、前画像と、さらにその前の画像（前々画像）とにおける両目の中点位置から、現画像における両目の中点位置を予測する。すなわち、前フレーム、前々フレームの眉間位置をそれぞれＸ_t-1，Ｘ_t-2とする（ただしＸ_t-1，Ｘ_t-2はいずれも２次元量）と、現フレームでの予測位置Ｘ_tは、Ｘ_t＝２Ｘ_t-1―Ｘ_t-2で外挿できる。ただし、最初の検出時はＸ_-1＝Ｘ₀とする。このように現画像における両目の中点位置を予測するのは、顔画像が移動している場合、その移動量はほぼ一定であると考えられるので、移動後の中点位置を予測してから眉間パターンのマッチングをすると効率がよいからである。
【００８３】
ステップ２４６では、ステップ２４４で予測された中点位置の近傍で、セーブされていた眉間パターンと最もよく一致する眉間パターンを探すマッチング処理を実行する。ここで最もよいマッチが得られた位置をＸ_t0とする。そして、ステップ２４８で、Ｘ_t0+ｅ_rおよびＸ_t0+ｅ_lを中心とする近傍（ｉ×ｊ画素）において右目および左目の位置をそれぞれ探索する（ステップ２４８）。探索にあたっては、その点を中心とする５×５画素の平均画素値が最も暗い点を目と判定する。
【００８４】
次に、探索結果の目の位置が、図９のステップ１６２で用いられた条件２）３）４）を満たすか否かを判定する。いずれかひとつの条件でも満足されていなければ、目のトラッキングを誤ったか見失ったものとして、トラッキング失敗の判定をし（ステップ２５２）処理を終了する。この場合、図７のステップ１０６から制御はステップ１００に戻り、再び目の位置の検出から処理が再実行される。
【００８５】
すべての条件が満足されていると判定された場合には、トラッキングが成功したものと判定され、トラッキングにより得られた両目の中点位置と、その点を中心とする眉間パターン（ｓ×ｔ画素）と、中点に対する両目の相対位置ｅ_r、ｅ_lとをセーブ（更新）してトラッキング処理を終了する。
【００８６】
なお、このようにしてカメラ５６の画像に対して両目の位置が検出されれば、カメラ５８の画像とあわせてステレオ処理により目までの距離が求まり、アイカメラ５４を制御するためのパラメータ（距離、方向）が計算により決定できる。その処理については周知の技法を適用することができるので、ここでは詳細な説明は繰り返さない。
【００８７】
［動作］
上にその構造について説明した本願発明にかかる目の位置の検出装置は以下のように動作する。図１に示すステレオカメラ５６，５８は、利用者の顔画像を撮影しそれぞれビデオ信号を出力しコンピュータ４０に与える。
【００８８】
図８に示すように目の位置の検出およびトラッキングのためのソフトウェアを起動すると、カメラ５６，５８から得られたビデオ画像の各々に対して目の位置が抽出される。具体的には、現画像がキャプチャされ（ステップ１２０）、差分画像Ｄの計算（ステップ１２２）、ステップ１２４および１２６の判断が行なわれる。ステップ１２４または１２６において判定結果が「ＹＥＳ」となれば目の位置の検出ができなかったとして、次のフレーム画像に対して再度目の位置の抽出処理が実行される。
【００８９】
ステップ１２４および１２６における判定結果がいずれも「ＮＯ」となった場合には、ステップ１２８で画素値「１」の領域の両端の画素が削除される。そして、既に説明した手法にしたがって移動量ｄを計算した後（ステップ１３２）、このようにして得られた移動量ｄを用いて図５に示す移動および重ねあわせを行なって得られた領域Ｍに相当する画素を、差分画像Ｄから除去する（ステップ１５０）。
【００９０】
続いて、孤立点除去、平滑化処理（ステップ１５２）、ラベリング処理（ステップ１５４）を経て得られた各要素の中心を目位置候補とする（ステップ１５６）。この候補数が２未満の場合には抽出は失敗したものと判定され、最初から処理が再開される。候補数が２以上の場合には、その近傍における画像間の差分の絶対値の総和が最も大きな二つの点が候補点となる（ステップ１６０）。この候補点が前述した４つの条件のすべてを満足し（ステップ１６２）、かつ中点近傍のパターンが左右対称であると判定される（ステップ１６４）と、その二つの候補点を目の位置として、目の位置の抽出処理は終了する。ステップ１６２、１６４のテストに失敗すると目の位置の抽出はできなかったものとして、最初から処理が再開される。
【００９１】
こうしてステレオカメラ５６，５８の双方について、目の方向が決定されると、それらの情報を用いてアイカメラ５４を制御すべき情報（アイカメラ５４から見た目の方向および目までの距離）が得られる。こうした情報をコンピュータ８２に送信して（図７のステップ１０２）、制御は目位置のトラッキング処理に移る（ステップ１０４）。
【００９２】
トラッキング処理の最初には、目の位置の抽出処理で得られた両目の中点位置（すなわち眉間の位置）、眉間の画像パターン、および中点位置から見た両目の相対位置に関する情報がセーブされている。まず最初に、このようにセーブされている情報を初期値として取り込む（ステップ２４０）。続いて現画像を取込み（ステップ２４２）、前画像における目の中点位置（眉間の位置）および前々画像における目の中点位置に基づいて現画像での目の中点位置を予測する(ステップ２４４)。この予測は、たとえば前々画像と前画像との中点位置を外挿することにより行なわれる。予測された中点位置近傍で、既にセーブされていた前画像での眉間パターンと最もよくマッチするパターンの中心となる位置を探索し、最もよいマッチを示した位置を中心として、両目の位置を探索する（ステップ２４８）。得られた両目位置が所定の条件を満足していれば（ステップ２５０でＹＥＳ）、その両目位置に基づいて両目の中点位置を算出し、その値と、その中点位置を中心とする眉間パターンと、中点位置から両目までの相対位置とをセーブして、トラッキングを一旦終了する。ただしこの場合、図７においてステップ１０６からステップ１０２の経路を通り、次のトラッキング処理が行なわれる。
【００９３】
トラッキングに失敗すると、図７のステップ１０６からステップ１００の経路を通り、再度目の位置の抽出処理から実行される。
【００９４】
こうして、ステレオカメラ５６，５８を用いて目の位置（方向）を抽出し、トラッキングし、それらの値を用いることによりアイカメラ５４を制御するためのパラメータ（目の方向、距離）を計算してアイカメラ５４を制御することができる。
【００９５】
図１３〜図１５に、本実施の形態による目の位置の抽出およびトラッキング処理の具体例について、画面表示を示す。
【００９６】
図１３には、入力された画像例を示す。図１４に示すのは、この画像と前画像との差分画像である。画面の左上には左右の対称性を判定するために切出したパターン２７０が、左下には入力画像の眉間パターン２７２が、それぞれ表示されている。差分画像の中で、目位置候補として２箇所２７４および２７６が抽出され白丸で表示されている。
【００９７】
図１５には、両目位置２９２，２９４および中心位置２９０を画像に重ねて表示した例を示す。図１５に示すように、眉間を中心として、両目の位置が正しくトラッキングされている。
【００９８】
今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。
【図面の簡単な説明】
【図１】本願発明にかかる目の位置のトラッキング方法および装置を実現するコンピュータシステム２０の外観図である。
【図２】図１に示すコンピュータシステム２０および周辺装置のブロック図である。
【図３】本発明の一実施の形態における目の位置の抽出の原理を示すための図である。
【図４】差分画像の一例を示す図である。
【図５】本発明の一実施の形態における目の位置の抽出の原理を示すための図である。
【図６】本発明の一実施の形態において、利用者が振り向く動きをキャンセルする原理を示す図である。
【図７】本発明の一実施の形態における目の位置の抽出方法および装置を実現するソフトウェアの全体の制御構造を示すフローチャートである。
【図８】本発明の一実施の形態における目の位置の抽出処理を実現するソフトウェアのフローチャートである。
【図９】本発明の一実施の形態における目の位置の抽出処理を実現するソフトウェアのフローチャートである。
【図１０】本発明の一実施の形態における目の位置の抽出処理において、移動量ｄの算出処理を実現するソフトウェアのフローチャートである。
【図１１】本発明の一実施の形態における、眉間の近傍の左右対称性を判定する処理を説明するための図である。
【図１２】本発明の一実施の形態における目のトラッキング処理を実現するソフトウェアのフローチャートである。
【図１３】本発明の一実施の形態における現画像の表示例を示す図である。
【図１４】本発明の一実施の形態における差分画像の表示例を示す図である。
【図１５】本発明の一実施の形態における目の位置のトラッキングの表示例を示す図である。
【符号の説明】
２０コンピュータシステム、４０，８２コンピュータ、４２モニタ、４６キーボード、４８マウス、５４アイカメラ、５６，５８ステレオカメラ。[0001]
[Technical field to which the invention belongs]
The present invention relates to a man-machine interface technique, and more particularly to a method and apparatus for tracking the position of a human eye for the purpose of detecting a human gaze direction without error when operating a computer or the like with the human gaze. .
[0002]
[Prior art]
With the recent development of computer technology, devices using computers have been used in every corner of people's lives. Without the skills to operate a computer, there is even a risk of not being able to live a satisfactory social life.
[0003]
On the other hand, in order to operate a computer, it is necessary to convey the intention of a person to the computer. Various researches have been conducted and put into practical use about how efficiently, without error, and simply transferring human intentions to a computer. Such a technique is generally called a man-machine interface.
[0004]
If it is only for operating the computer, a command for operating the computer on a text basis may be given to the computer via, for example, a keyboard. However, this requires humans to store these multiple commands along with the command syntax and the necessary parameters. Therefore, what is generally called GUI (Graphical User Interface) has been devised and is now mainstream.
[0005]
In the GUI, generally, a pointing device such as a mouse is used to point an icon displayed on the screen, and a predetermined operation such as clicking, double-clicking, dragging, or the like is given to the computer. Can do. Therefore, there is a feature that it is not necessary to memorize a large number of commands and errors are less likely to occur.
[0006]
On the other hand, the GUI needs to operate a pointing device. For this reason, for example, a person with a hand movement disorder is difficult to operate even with a computer employing a GUI. Depending on the application, the hand may not be freely used, and it may be difficult to use a pointing device such as a mouse.
[0007]
Thus, various techniques for transmitting human intentions to a computer using human eyes have been considered. When a human consciously manipulates the line of sight, the intention of the human can be given to the computer using the line of sight as a pointing device. In addition, when a human moves the line of sight unconsciously, the intention of the human can be estimated by detecting the movement of the line of sight by the computer.
[0008]
A prerequisite technology for this is a device that images a human eyeball and estimates the line-of-sight direction. For this purpose, an eye camera is generally used. By estimating the direction of the line of sight of a person using an eye camera, the intention of the person can be estimated and used for computer operations.
[0009]
When estimating the gaze direction using an eye camera, it is necessary to image the eyeball with high resolution in order to increase the estimation accuracy. However, for this purpose, the imaging range of the eye camera is narrowed. The depth of field is also shallow. As a result, when an eye camera is used, the user must almost fix the face so that the face does not move forward, backward, left or right. Therefore, when the eye camera is used alone, its usage form is very limited. It is also desirable to be able to detect and track an eye even on an unknown face, not a system that can be used only by a specific individual.
[0010]
In order to solve the problems associated with the use of such an eye camera, the inventor of the present application first detects the position of the user's eyes using a wide-field camera, and uses that information to pan and tilt the eye camera. I came up with the idea that controlling the focus is effective. In other words, by using a subsystem that detects the position of the human eye by some method, tracks it, and sequentially outputs the direction and distance to the eye camera control device, the accuracy of eye direction detection by the eye camera is adopted. It is a method to increase.
[0011]
Usually, when using a computer, the user is about 50 to 100 cm away from the eye camera and is considered to be sitting on a chair. Therefore, by detecting the eye position as described above and controlling the eye camera using the information, it is considered that the accuracy of the gaze direction detection by the eye camera becomes very high.
[0012]
In the assumed situation, the range of movement of the face is quite limited. For this reason, it is possible to capture a large image so that the face fills the height of the screen. Accordingly, the inventors of the present invention have come to realize that the position of both eyes can be detected by detecting the movement of the eyelid in the face image captured in this manner. Humans are blinking unconsciously, and by waiting for a natural blink, the position of the eyes can be detected without making the user aware of it. Especially in situations where an eye camera is used, it is possible to obtain user cooperation.
[0013]
Blink is not a continuous movement. Therefore, once the eye position is detected, it is necessary to track the eye position by some method. One method for that purpose is to prepare a large number of face image templates and estimate the position of the eyes by matching the captured face image with the template.
[0014]
However, in this case, simple template matching cannot cope with changes in face orientation. Even if the template is updated, it is inevitable that the tracking point gradually shifts from the actual eye. Therefore, it is necessary to overcome the weakness caused by such template matching and improve the robustness of the tracking process.
[0015]
As for the information on the distance given to the eye camera, sufficient accuracy can be obtained by measuring using, for example, an ultrasonic sensor as long as the direction of the eye is known. However, in the following embodiments, the distance to the user's face is estimated using a binocular stereo image. Of course, in addition to this, it is possible to measure the distance to the user's face using various methods, but since there is no direct relationship with eye position detection and tracking in the present invention, in the following description, A detailed description of distance measurement will not be given.
[0016]
There are already several examples of eye position tracking methods. One of them is disclosed in Japanese Patent Laid-Open No. 2000-163564. In this example, the position of the eyes is tracked using a so-called template matching technique using an image pattern centered on the eyes as a template. Other examples are described in IEICE Technical Report PRMU99-151 (1999-11), pp. It is disclosed in a paper entitled “Development of Real-Time Eye-Gaze Detection / Motion Recognition System” of 9-14. In this example, face tracking is performed by stereo processing of a face feature region image including both ends of the left and right eyes and an image using three-dimensional coordinates from face images obtained using two video cameras. The center position of the eyeball is estimated from the face position thus estimated and the offset vector from the midpoint of both ends of the left and right eyes toward the eyeball.
[0017]
[Problems to be solved by the invention]
However, the method disclosed in Japanese Patent Laid-Open No. 2000-163564 uses a normal template matching method, and has a problem that it is vulnerable to pattern rotation and scale change. In the example published in the IEICE Technical Report, the eye position is estimated only from stereo processing using two images, so the distance from the camera to the eye is limited, and a tracking pattern is registered in advance. There is a problem that needs to be kept.
[0018]
That is, until now, there has been a problem that the position of the eyes of a general user cannot be accurately tracked without restricting the position of the user's face.
[0019]
Therefore, an object of the present invention is to track the eye position from a general face image with high robustness, and to track the eye position without improperly restraining the user, and an apparatus therefor Is to provide a program.
[0020]
Another object of the present invention is that it is possible to detect the position of the eyes with high accuracy and robustness from a general face image even if the face rotates or moves, and improperly restrains the user. To provide an eye position tracking method, apparatus, and program therefor.
[0021]
[Means for Solving the Problems]
In the invention according to the first aspect of the present application, an imaging unit that captures a video image of a face of a subject is connected , and a computer having a storage device tracks the position of the subject's eyes in a series of images output by the imaging unit. And a step of extracting an eye position in a face image output from the imaging means, and the step of extracting the eye position calculates a difference image between the two images output from the imaging means. A step of extracting a remaining difference area excluding a difference due to movement of the entire face of the subject between the two images, and extracting the remaining difference area includes two steps: A step of calculating a movement amount that maximizes an area where the pixel value of the image obtained by translating each of the two images and the pixel value of the difference image coincide with each other, and a difference caused by the calculated movement amount A step of deleting from the difference image, selecting the second candidate point from the extracted remaining difference area according to the difference of the pixel values of the two images, and the second candidate point is predetermined The second eye candidate point is extracted as the eye position when the geometric condition is satisfied, the position between the eyebrows determined by the extracted eye position, the relative position of both eyes with respect to the position between the eyebrows, An image pattern between the eyebrows determined by the position between the eyebrows, a step of saving the image pattern between the eyebrows in the storage device, an image acquisition step of acquiring an image subsequent to the image from which the eye position is extracted , a step of predicting the positions in the vicinity of the predicted frown position, the steps of searching for a point where the center of the area that best matches the saved forehead of the image pattern, is searched Based the position of, in the relative position of the saved eyes, to predict the positions of the eyes in the subsequent image, each of the two regions which satisfy the predetermined condition about the predicted area The step of searching for the center point of the two points, and the position of the midpoint of the two searched center points as the position between the new eyebrows, the position between the new eyebrows and the relative position of the two center points The determined region is the relative position of both eyes relative to the position between the eyebrows, and the image pattern of the area defined by the position between the new eyebrows is stored as a new image pattern between the eyebrows. further position, and a step of updating each glabellar image pattern, and performing the process from the image acquisition step for images further subsequent to the subsequent image Prepare for .
[0022]
Preferably, the step of predicting the position between the eyebrows includes a position between the eyebrows in the subsequent image based on the position between the eyebrows in the image from which the eye position is extracted and the position between the eyebrows in the image preceding the image. Extrapolating.
[0023]
More preferably, in this method, in the step of searching for the center point of each of the two regions, the position between the eyebrows of the face image output by the imaging means in response to the fact that the point could not be searched, the space between the eyebrows The method further includes a step of extracting an image pattern between the eyebrows determined by a relative position of both eyes with respect to the position of the eyebrow and a position between the eyebrows by a predetermined method, and restarting the processing from the image acquisition step.
[0024]
In addition, the step of searching for the center point of each of the two regions has a predetermined shape in the vicinity of each of the positions determined by the relative positions of both eyes with respect to the position of the searched point. It may include a step of searching for the center of a region that is the region and in which the average of the pixel values in the region is the darkest to be a candidate point for both eyes.
[0025]
More preferably, the step of searching for the center point of each of the two regions further includes the candidate point searched in the step of searching for the center of the darkest region,
1) The distance between candidate points is not less than a predetermined minimum value,
2) the distance between candidate points is less than or equal to a predetermined maximum value; and 3) the angle formed by a straight line connecting the candidate points and the scanning line direction satisfies a predetermined relationship;
It may be determined whether or not all of the conditions are satisfied, and if any of the conditions is not satisfied, the search may be failed.
[0026]
In the invention according to the second aspect of the present application, in a computer having a storage device connected to an imaging unit that captures a video image of the face of the subject, the position of the subject's eyes in a series of images output by the imaging unit is tracked. A device for extracting an eye position in a face image output by an imaging means, and the means for extracting the eye position is a difference between two images output by the imaging means. Means for calculating an image, and means for extracting the remaining difference area excluding the difference due to movement of the entire face of the subject between the two images of the difference images, the remaining difference The means for extracting the area includes means for calculating a movement amount for maximizing an area where the pixel value of the image obtained by translating each of the two images matches the pixel value of the difference image, and the calculated movement. amount Means for deleting the resulting difference from the difference image, selecting a second candidate point according to the difference between the pixel values of the two images from the extracted difference area, and If the candidate point satisfies a predetermined geometric condition, the position further includes a means for extracting the second eye candidate point as the eye position, and the position between the eyebrows determined by the extracted eye position, between the eyebrows Means for saving an image pattern between eyebrows defined by the relative position of both eyes relative to the position of the eyebrow and the position between the eyebrows, and image acquisition means for acquiring an image subsequent to the image from which the eye position is extracted And a prediction means for predicting the position between the eyebrows of the face image in the subsequent image, and the center of the area that best matches the image pattern between the saved eyebrows in the vicinity of the predicted position between the eyebrows Looking for Center and the first search means for, the position of the point which is searched, based on the relative position of the saved eyes, to predict the positions of the eyes in the subsequent image, the predicted area A second search means for searching for the center point of each of the two regions satisfying the predetermined condition, and the position of the midpoint of the two center points searched for between the new eyebrows The position determined by the position between the new eyebrows and the relative position of the two central points is set as the new position of the eyes relative to the position between the eyebrows, and the image pattern of the area determined by the position between the new eyebrows is newly set. Update means for updating the position between the eyebrows, the relative position between both eyes, and the image pattern between the eyebrows stored in the storage device as an image pattern between the eyebrows, and an image further following the subsequent image Are further provided with means for controlling the image acquisition means, the prediction means, the first and second search means, and the update means so as to repeat the processing from the image acquisition means.
[0027]
Preferably, the predicting means extrapolates the position between the eyebrows in the subsequent image from the position between the eyebrows in the image from which the eye position is extracted and the position between the eyebrows in the image preceding the image. Including means.
[0028]
More preferably, in response to the fact that the second search means could not search for the point, it is determined by the position between the eyebrows of the face image output by the imaging means, the relative position of both eyes with respect to the position between the eyebrows, and the position between the eyebrows. Means for controlling the image acquisition means, the prediction means, the first and second search means, and the update means so that the image pattern between the eyebrows is extracted by a predetermined device and the processing is resumed from the image acquisition means Further included.
[0029]
Further, the second search means is a region having a predetermined shape in the vicinity of each of the positions determined by the relative positions of both eyes with respect to the position of the searched point, and within the region Means may be included for searching for the center of the region where the average of the pixel values is the darkest to be a candidate point for both eyes.
[0030]
More preferably, the second search means further includes the candidate point searched by the means for searching for the center of the darkest region,
1) The distance between candidate points is not less than a predetermined minimum value,
2) the distance between candidate points is less than or equal to a predetermined maximum value; and 3) the angle formed by a straight line connecting the candidate points and the scanning line direction satisfies a predetermined relationship;
It may be determined whether or not all the conditions are satisfied, and if any of the conditions is not satisfied, a means for failing the search may be included.
[0031]
DETAILED DESCRIPTION OF THE INVENTION
[Hardware configuration]
FIG. 1 shows the appearance of a computer system 20 that implements an eye position tracking method and apparatus according to the present invention. The computer system 20 detects and tracks the position of the eyes, and a computer (not shown) controls the eye camera using information about the distance to the user's face and the direction of the eyes given from the computer system 20. In this specification, “subject” means a user whose system detects and tracks the position of the eye, and not only human beings but also animals such as animals. Can be included.
[0032]
Referring to FIG. 1, a computer system 20 includes a computer 40 including a CD-ROM drive 50 for a CD-ROM (Compact Disc Read Only Memory) and an FD drive 52 for a flexible disk (FD). A mouse 48 which is a first pointing device connected to the computer 40, a keyboard 46, a monitor 42, stereo video cameras 56 and 58 disposed between the monitor 42 and the keyboard 46, Eye camera 54 controlled by a computer.
[0033]
FIG. 2 illustrates the internal configuration of the computer 40 together with other components of the computer system 20. Referring to FIG. 2, computer 40 includes a CPU (Central Processing Unit) 70, a memory 72 connected to CPU 70, and a hard disk 74 connected to CPU 70. The CPU 70 is connected to the network 80 via a network board (not shown), and the computer 82 that controls the eye camera 54 is connected to the network 80. Since the configuration of the computer 82 is the same as that of the computer 40, details thereof will not be described here.
[0034]
The eye position tracking method and apparatus according to the present invention, which will be described in detail below, are realized by the computer hardware described with reference to FIGS. 1 and 2 and software executed by the CPU 70. This software is stored in a storage medium such as a CD-ROM or distributed through the market via a network. When this software is installed on this computer 40, the software is typically stored on the hard disk 74. When the user activates this software or is activated by the operating system (OS) of the computer 40, it is read from the hard disk 74 to the memory 72 and executed by the CPU 70.
[0035]
Note that this software may use functions originally incorporated as part of the OS of the computer 40, or may use some functions of other software installed separately from the OS. Therefore, software that is distributed as software for realizing the present invention may lack some of such functions, and is configured in consideration that such functions are realized by other software during execution. It only has to be. Of course, it may be distributed as system software in which all necessary functions are incorporated in advance.
[0036]
[principle]
In the present invention, the basis of the method for detecting the position of the eyes is based on the difference between the frame images, and particularly pays attention to the difference in brightness due to blinking. When a blink is detected from the inter-frame difference, if the face is moving, pixels having a large change in brightness occur in many parts other than the blinking part that has moved. Therefore, it is necessary to distinguish between a change in brightness due to the movement of the eyelid and a change in brightness due to the movement of the entire face. Although problems may occur due to changes in mouth shape, eyebrow movement, and the like, those brightness changes are rejected by a condition test after extracting eye candidate points.
[0037]
The movement of the entire face can be considered in the following manner: translation in the image plane, rotation in the image plane (raising the neck), and rotation outside the image plane (turning around and turning). It is assumed that the frame rate is sufficiently fast and the movement between frames is small. Hereinafter, a method of canceling these movements in order will be described.
[0038]
The method of canceling the movement when the face is translated in the image plane is as follows. Figure 3 shown (A), the picture f _t-1 disc 90 with in color that reflected in the moves in the right direction by a distance d, which is shown in FIG. 3 (B) in the next frame Consider the case where f _t is obtained. The disk is marked with a small white circle. The background need not be uniform, but is not shown.
[0039]
Taking the difference image of the images of FIG. 3A and FIG. 3B results in an image as shown in FIG. In FIG. 4, the shaded area is a portion where the difference in pixel values appears large, and the white area is an area where the difference in pixel values is small. In the difference image D shown in FIG. 4, when the pixel position where the difference in the pixel value appears greatly is considered back to the original image shown in FIG. 3 (A) or (B), these pixel positions are Furthermore, it can classify | categorize into the area | regions F and B shown by right and left of the upper stage of FIG. F in the region F corresponds to the foreground, that is, a point on the disk, and B in the region B corresponds to the background, that is, the background. Region B is a portion that is visible in one image and not visible in the other image. A region F is a portion that is visible in both images and has corresponding pixels. However, in an actual image, it is unclear which part the moving object is, so it is not possible to distinguish between the region F and the region B.
[0040]
Here, in the image f _t as shown in the upper right of FIG. 5, try repeatedly to shift the pixels corresponding to the regions F and B only -d to the image f _t-1. Then, as shown in the lower left of FIG. 5, it can be seen that there are a portion of the region M where the pixel values match and a portion of the region U where the pixel values do not match. Looking superimposed on the image f _t and region F and shifted + d the corresponding pixel in the B on similarly image f _t-1, becomes as shown in the lower right of FIG. 5 is a region M and U . The moving amount ± d here is a two-dimensional vector amount.
[0041]
Here, if the shift amount is a shift amount ± d ′ different from the actual shift amount ± d, the number of pixels belonging to the region M is considered to be smaller than when the actual shift amount ± d is used. Therefore, if a shift amount that maximizes the number of pixels in the region M is searched, it can be said that this is the movement amount d of the disc 90 between frames. When the counting of the number of pixels in the region M, the shift amount of the pixel image f _t-1 as the shift amount of d Tosureba image f _t is -d, have the same length in both always reverse I must.
[0042]
As described above, among the pixels extracted from the difference image between frames, the shift amount ± d that maximizes the number of pixels belonging to the region M is found, the images are shifted and overlapped, and in any case, the pixel values do not match. When a pixel is extracted, it can be said that it is a portion that has moved other than the parallel movement with the movement amount d. If the movement amount of the entire face is d, the part that moves other than d is a part derived from blinking, mouth shape change, or the like. Therefore, by canceling the movement of the entire face from the inter-frame difference image, it is possible to extract the portion that moves in a blink, that is, the eye position.
[0043]
Next, think about the movement that turns around. Referring to FIG. 6, consider a case where the face facing the cameras 56 and 58 is turning to the left. In this case, a hidden part appears from the right end (left end in FIG. 6) of the face, and the visible part is hidden by self-occlusion at the left end (right end in FIG. 6) of the face. Since such a part is hidden on one of the two images for calculating the inter-frame difference, it does not make sense to shift and see the pixel value coincidence unlike the case of parallel movement.
[0044]
However, such a portion appears only in the periphery of the moving area. Therefore, after calculating the inter-frame difference image (FIG. 4), the influence can be removed by removing a certain number of pixels from the left and right sides of the hatched area. On the other hand, since the center part of the face can be regarded as a parallel movement even in the case of a turning motion, the above-described technique relating to the parallel movement (a technique of moving two images by ± d to see the matching pixels) can be applied. .
[0045]
It should be noted that the movement of whispering the neck and the movement of whispering can be regarded as parallel movement because the center of rotation is off the face. However, in any case, it is assumed that there is little movement between frames.
[0046]
[Software control structure]
A software structure for realizing the eye detection process and tracking process based on the above principle will be described below. As described above, when executing the software described below, there are cases where it is assumed that the functions of the OS or the functions already installed in the computer are used. It may not include the function itself. However, even in such a case, such software belongs to the technical scope of the present invention as long as it has information on where to use such functions in the control structure.
[0047]
In the processing described below, in addition to the processing for detecting candidate points for eyes already described, blinking occurs simultaneously on the left and right eyes, assuming that there are two eyes, and the distance between both eyes falls within a certain range. By using the conditions that the face tilt angle is within a certain range, and the gray pattern near the midpoint of both eyes is almost symmetrical, the candidate point pair of the extracted eye position is It is tested whether it exists.
[0048]
In the following processing, various threshold values are used, and the values are determined empirically depending on the application and the required accuracy.
[0049]
Referring to FIG. 7, the overall configuration of this software is as follows. First, eye positions are extracted using a procedure described in detail later with reference to FIG. In the following step 102, the extracted eye position is transmitted to the computer 82 that controls the eye camera 54. Then, in step 104, eye position tracking processing is performed. It is determined whether or not the eye position has been tracked, that is, whether or not the eye position has been lost (step 106). If tracking fails, control returns to step 100, and control executed from the eye position extraction process in step 100 is performed again. If tracking has been completed, control returns to step 102, and information such as the eye direction and the distance to the eye obtained by tracking is transmitted to the computer 82. Thereafter, by repeatedly performing the processing of steps 100 to 106, information regarding the direction and distance of the user's eyes can be constantly transmitted to the computer 82.
[0050]
Note that the eye position may not be extracted in step 100 of FIG. In that case, the process of step 100 is executed again, the eye position extraction process is executed for the next frame, and then the tracking process is started again.
[0051]
With reference to FIG. 8, the eye position extraction process performed in step 100 of FIG. 7 will be described. The following processing uses the principle described with reference to FIGS. It is assumed that the previous image has been captured before this processing. In particular, at the beginning of the process, the previous image may be matched with the current image. First, in step 120, current images from the cameras 56 and 58 are captured. The following processing is actually performed on the images of the two cameras 56. After the eyes are detected, the distance to the eyes is calculated by stereo processing together with the images of the cameras 58. The roles of the camera 56 and the camera 58 may be changed.
[0052]
Subsequently, in step 122, the inter-frame difference binarized image D between the previous image and the current image is calculated. More specifically, the absolute value of the difference between the values of the previous image and the current image is obtained for each pixel, and if the value is equal to or greater than a predetermined threshold value N, the pixel value is set to “1”. If it is less than the threshold value N, the pixel value is set to “0”. Since the video signal varies to some extent even when a still object is imaged, it is necessary to use such a threshold value N. This threshold value N can be considered as a noise level.
[0053]
Subsequently, at step 124, it is determined whether or not the number of pixel values “1” in the difference image D is greater than or equal to a predetermined threshold value G. If the determination result is “YES”, the control jumps to step 168 shown in FIG. 9, and it is determined that no eye has been detected, and the process is terminated. In this case, since motion is detected in too many pixels, the next frame input is awaited on the assumption that there is more motion than expected.
[0054]
If the determination result in step 124 is “NO”, it is determined in step 126 whether or not the number of pixel values “1” in the difference image D is equal to or less than the threshold value E. This threshold value E is smaller than the aforementioned threshold value G, and is used for determining whether or not the face is substantially stationary. If the determination result of step 126 is “YES”, the control jumps to step 152 shown in FIG. In this case, the number of pixels extracted by the inter-frame difference is small, and the face is considered to be in a stopped state.
[0055]
If the determination result in step 126 is “NO”, the control proceeds to step 128. In step 128, a process of deleting pixels at both ends of the pixel group having the pixel value “1” of the difference image D is performed. This process takes into account the turning action as described with reference to FIG.
[0056]
More specifically, for each scanning line of the difference image D, a search is performed in the right direction and the left direction from the left and right ends, respectively, and the pixel having the pixel value “1” that appears first is searched. Then, the pixel value of each of k pixels is rewritten to “0” from that pixel. By this processing, a predetermined amount of the left and right regions of the pixel value “1” in the image is removed.
[0057]
Subsequently, in step 130, a pixel corresponding to the pixel value “1” of the difference image D is extracted from the previous image and the current image. This process is a process for obtaining the image shown in the middle of FIG. Thereafter, in step 132, the above-described processing for calculating the movement amount ± d is performed. Details of this processing will be described later with reference to FIG.
[0058]
When the movement amount ± d is calculated, subsequently, in step 150 (FIG. 9), a process of deleting pixels corresponding to the region M shown in the lower part of FIG. Further, in step 152, isolated point removal and smoothing processing is executed on the difference image D. More specifically, the pixel value of the pixel to be processed is set to “1” if the pixel value of four or more pixels in total in the vicinity of the pixel to be processed and the central pixel is “1”. . Otherwise, the pixel value is set to “0”. By this process, the pixel value of the isolated point becomes “0” and is deleted from the difference image D.
[0059]
Subsequently, in step 154, a labeling process is performed on the difference image D. By this processing, a label is assigned to each of the regions composed of pixels having pixel values “1” that are continuous with each other. The center of each element thus obtained is set as a candidate point for the eye position (step 156).
[0060]
It is determined whether the number of obtained candidate points is less than 2 (step 158). Since the number of eye positions is generally assumed to be two, if the number of candidate points is less than two, no eye is detected, and the process proceeds to step 168. If the number of candidates is two or more, control proceeds to step 160.
[0061]
In step 160, for each candidate point, the sum S of absolute values of the difference between the pixel value of the previous image and the pixel value of the current image for each pixel in the rectangular a × b region centered on the candidate point is obtained. Here, the rectangle a × b is a rectangle having a size corresponding to the assumed eye size. If there is a blink between the previous image and the current image, one image is an eye image and the other image is an eyelid image, so the absolute value of the difference between the pixel values is large and the size is large. Should be close to the expected eye size. Therefore, S can be obtained for each candidate point in this way, and two of the larger values can be used as the second candidate point (candidate point pair).
[0062]
Subsequently, at step 162, it is determined whether or not the two candidate points obtained as described above satisfy the following conditions. If all the conditions are not satisfied, the eye is not detected (step 168), and the process is terminated.
[0063]
1) The sum S of absolute values of differences obtained in step 160 is equal to or greater than a predetermined threshold value F at any of the two candidate points.
[0064]
2) The distance between candidate points is equal to or greater than a predetermined value Lmin.
3) The distance between candidate points is less than or equal to a predetermined value Lmax.
[0065]
4) The angle formed by the straight line connecting the candidate points and the scanning line direction is equal to or less than a predetermined threshold A degree.
[0066]
The threshold value F is a value that is experimentally or statistically determined in advance, and is used to prevent portions other than the eyes from being extracted. Lmin and Lmax are a minimum value and a maximum value assumed as the distance between the centers of the eyes, respectively. These values are statistically determined in advance. The threshold angle A degree is determined experimentally as an appropriate value, but usually a value of about 30 to 45 degrees is used.
[0067]
If all the above four conditions are satisfied, it is further determined in step 164 whether or not the shade pattern near the midpoint of the two obtained eye candidate points is substantially bilaterally symmetric. If it is determined that they are substantially symmetrical, the two eye candidate points are set as eye positions. If it is not symmetrical, the process is terminated assuming that no eyes have been found.
[0068]
A method for determining whether or not the pattern near the midpoint of the two eye candidate points is substantially symmetric will be described. Referring to FIG. 11, in a region 220 that is generally centered on the middle point of both eyes (hereinafter referred to as “brow space”), the left and right shading patterns between the eyebrows are line segments connecting two eye candidate points. Should be nearly symmetrical with respect to a straight line passing through the eyebrows (the midpoint of the two eye candidate points). Therefore, the current image is first rotated around the midpoint of the candidate points of both eyes so that the candidates for both eyes are aligned horizontally on the scanning line, and g × h pixels (centered on the midpoint of the candidate points) (region 220). .) As shown in the lower part of FIG. 11, by calculating the sum of the pixel values of the vertical h pixels in this region 220 for each column, and further calculating the value representing the sum of each column as a percentage of the sum of all pixels. A projection profile like the graph 222 shown in FIG. 11 is obtained. Then, in this profile, a total of three columns adjacent to each other that are symmetrical with respect to the midpoint of both candidates is calculated for both the left and right sides. Then, the sum of the differences between the left and right sums is calculated for all the combinations of the three adjacent columns, and if the values are all equal to or less than the threshold value p, the pattern near the midpoint of the two eye candidate points is , It is determined to be almost symmetrical.
[0069]
Specifically, referring to the lower part of FIG. 11, for example, the sum of the values in the leftmost three columns 224L and the sum of the values in the rightmost three columns 224R are calculated, and the difference p ₁ is obtained. Similarly, the sum of the values of the third column 226L from the second column to the fourth column from the left and the sum of the values of the third column 226R from the second column to the fourth column from the right are calculated, and the difference p ₂ is obtained. . Sum of the values of the three rows 228L in the third column 5 column from the left, the sum of the values of the three rows 228R in the third column 5 column from the right side is calculated to determine the difference p _3. In the same manner, the values p ₄ to p _g-2 are calculated, the sum of the values of p ₁ to p _g-2 is obtained, and the sum is compared with the threshold value p, so that It can be determined whether the shading pattern in the vicinity of the point is substantially symmetric.
[0070]
In the above description, the two eye candidate points are once rotated around their midpoint so as to be aligned on the horizontal scanning line, and then it is determined whether or not the shading pattern is symmetrical. However, the symmetrical determination method is not limited to this. For example, when an angle formed by a line segment connecting two eye candidate points with respect to a horizontal scanning line is known, the sum of pixel values is calculated in a direction perpendicular to the line segment, as shown in FIG. It is possible to realize the same processing as that of. In the above example, the sum is calculated for each column, but the difference is calculated for each symmetrical pixel around the midpoint, and the sum is calculated. You may determine with symmetry.
[0071]
Here, returning to step 132 in FIG. 8 again, the process of calculating the movement amount d will be described in detail with reference to FIG. The movement amount d (two-dimensional vector amount) = (x, y), and x = −m, −m + 1,..., M, y = −n, −n + 1,. A search is made for a pixel indicating the maximum value of the total number of pixels belonging to the region M. First, the initial value 0 is substituted for the variable max (step 180). This variable max is for storing the maximum value of the total number of pixels belonging to the region M among the difference images calculated for each movement amount in the calculation process described below.
[0072]
-M is substituted for variable x (step 182), and -n is substituted for variable y (step 184). Hereinafter, iterative processing is performed for each value of y.
[0073]
First, the total number M (d) of pixels belonging to the region M for d = (x, y) is obtained (step 186). Then, it is determined whether this M (d) is larger than the variable max (step 188). If M (d) is less than or equal to the variable max, control proceeds to step 192 without doing anything. If M (d) is larger than the variable max, in step 190, the value of M (d) is substituted for the variable max, and the values of x and y at that time are substituted for X and Y, respectively. Control continues to step 192.
[0074]
In step 192, 1 is added to y, and it is determined whether or not the result of y exceeds n (step 194). If y does not exceed n, the control returns to step 186, and the processing of steps 186 to 194 is repeated for the new y. If y exceeds n, control proceeds to step 196.
[0075]
In step 196, 1 is added to x. Then, it is determined whether or not the result x exceeds m (step 198). If it is determined that x does not exceed m, the control returns to step 184, and the processing of steps 184 to 198 is repeated for new x. If it is determined that x exceeds m, the control proceeds to step 200, and the value of the variable max and the values X and Y when the value of the variable max is obtained are output. To finish the process.
[0076]
In this manner, the movement amount d that maximizes the number of pixels included in the region M is obtained. Various other methods for calculating the movement amount d are conceivable. For example, while changing x and y in the range of m × n, the number of pixels to be included in the region M in advance is obtained and tabulated, and the cell indicating the maximum value in the table is obtained. Also good.
[0077]
Now, in order to track the eye detected as described above with high reliability and robustness from the next frame, the following tracking method is realized by software. In the present embodiment, the eye tracking is performed by tracking a pattern between the eyes (between eyebrows), not directly. Then, the eye is searched for in the vicinity of the point where the relative position from the eyebrows thus tracked is the same as the eye position of the previous frame.
[0078]
The pattern between the eyebrows is relatively stable even when the facial expression changes. Moreover, since the light and shade pattern in which the eyes and the eyebrows interrupt like a wedge from both sides is formed in the bright part of the forehead and nose muscles, it is easy to obtain accuracy when determining the position by pattern matching.
[0079]
As described below, when pattern matching is performed using a template between the eyebrows, it is necessary to update the template. In the present embodiment, a rectangular pattern having the center of both eyes as a midpoint is employed as a template, and the rectangular pattern detected in the current frame is used for updating the template.
[0080]
Details of the tracking process will be described below with reference to FIG. Note that the midpoint of the eyes extracted in the previous frame, the forehead pattern (s × t pixel), the relative positions e _r between the right eye and the left eye as viewed from the center point, in e _{_l} (e _r = -e _l , Both are two-dimensional quantities).
[0081]
First, a process of taking in the midpoint position, the eyebrow pattern, and the relative positions of both eyes saved as the initial values in step 240 is performed. Subsequently, the current image (current frame) is captured (step 242).
[0082]
In the next step 244, the midpoint position of both eyes in the current image is predicted from the midpoint position of both eyes in the previous image and the previous image (previous image). That is, the positions between the eyebrows of the previous frame and the previous frame are X _t-1 and X _t-2 (where X _t-1 and X _t-2 are both two-dimensional quantities), and the predicted position in the current frame X _t can be extrapolated by X _t = 2X _t−1 −X _t−2 . However, X _-1 = X ₀ at the first detection. In this way, the midpoint position of both eyes in the current image is predicted when the face image is moving, because the amount of movement is considered to be almost constant. This is because matching the eyebrow pattern is efficient.
[0083]
In step 246, matching processing for searching for an eyebrow pattern that most closely matches the saved eyebrow pattern is performed in the vicinity of the midpoint position predicted in step 244. Here, the position where the best match is obtained is _defined as X _t0 . Then, in step 248, searches respectively right and left eyes of the positions in the vicinity (i × j pixels) centered on the X _t0 + e _r and X _t0 + e _l (step 248). In the search, a point having the darkest average pixel value of 5 × 5 pixels centering on the point is determined as an eye.
[0084]
Next, it is determined whether the eye position of the search result satisfies the conditions 2), 3), and 4) used in step 162 of FIG. If any one of the conditions is not satisfied, it is determined that the eye tracking is wrong or lost, and tracking failure is determined (step 252), and the process is terminated. In this case, the control returns from step 106 in FIG. 7 to step 100, and the process is re-executed from the detection of the eye position.
[0085]
If it is determined that all the conditions are satisfied, it is determined that the tracking is successful, and the middle point position of both eyes obtained by tracking and the inter-brow pattern (s × t pixels) centered on the point. ) and the relative position of the eyes relative to the midpoint e _r, and the e _l saved (updated) ends the tracking process.
[0086]
If the positions of both eyes are detected with respect to the image of the camera 56 in this way, the distance to the eyes is obtained by stereo processing together with the image of the camera 58, and parameters (distances) for controlling the eye camera 54 are obtained. , Direction) can be determined by calculation. Since a well-known technique can be applied to the processing, detailed description will not be repeated here.
[0087]
[Operation]
The eye position detection apparatus according to the present invention, the structure of which has been described above, operates as follows. The stereo cameras 56 and 58 shown in FIG. 1 capture a user's face image, output a video signal, and provide it to the computer 40.
[0088]
When the software for eye position detection and tracking is activated as shown in FIG. 8, the eye position is extracted for each of the video images obtained from the cameras 56 and 58. Specifically, the current image is captured (step 120), the difference image D is calculated (step 122), and the determinations of steps 124 and 126 are performed. If the determination result in step 124 or 126 is “YES”, the eye position cannot be detected, and the eye position extraction process is executed again for the next frame image.
[0089]
If the determination results in steps 124 and 126 are both “NO”, the pixels at both ends of the region of pixel value “1” are deleted in step 128. Then, after calculating the movement amount d according to the method already described (step 132), the movement amount d obtained as described above is used to move and overlap the region M obtained by performing the movement and superposition shown in FIG. Corresponding pixels are removed from the difference image D (step 150).
[0090]
Subsequently, the center of each element obtained through isolated point removal, smoothing processing (step 152), and labeling processing (step 154) is set as an eye position candidate (step 156). If the number of candidates is less than 2, it is determined that the extraction has failed, and the process is restarted from the beginning. If the number of candidates is two or more, the two points having the largest sum of absolute values of differences between images in the vicinity are candidate points (step 160). If this candidate point satisfies all of the above four conditions (step 162) and the pattern near the midpoint is determined to be bilaterally symmetrical (step 164), the two candidate points are used as eye positions. The eye position extraction process ends. If the tests in steps 162 and 164 fail, it is assumed that the eye position could not be extracted, and the process is restarted from the beginning.
[0091]
Thus, when the eye direction is determined for both of the stereo cameras 56 and 58, information (the direction of the eye and the distance from the eye camera 54 to the eyes) that should control the eye camera 54 is obtained using the information. . Such information is transmitted to the computer 82 (step 102 in FIG. 7), and control proceeds to eye position tracking processing (step 104).
[0092]
At the beginning of the tracking process, information about the midpoint position of both eyes (that is, the position between the eyebrows) obtained from the eye position extraction process, the image pattern between the eyebrows, and information on the relative position of both eyes as viewed from the midpoint position is saved. ing. First, the information saved in this way is taken as an initial value (step 240). Subsequently, the current image is captured (step 242), and the midpoint position of the eye in the current image is predicted based on the midpoint position of the eye (position between the eyebrows) in the previous image and the midpoint position of the eye in the previous image ( Step 244). This prediction is performed, for example, by extrapolating the midpoint position between the previous image and the previous image. In the vicinity of the predicted midpoint position, a position that is the center of the pattern that best matches the pattern between the eyebrows in the previous image that has already been saved is searched, and the position of both eyes is centered on the position that showed the best match. Search is performed (step 248). If the obtained position of both eyes satisfies a predetermined condition (YES in step 250), the middle point position of both eyes is calculated based on the position of both eyes, and the value between the eyebrows centered on the position of the middle point The pattern and the relative position from the middle point position to both eyes are saved, and the tracking is temporarily ended. However, in this case, the next tracking process is performed through the path from step 106 to step 102 in FIG.
[0093]
If the tracking fails, the process is performed again from the eye position extraction process through the path from step 106 to step 100 in FIG.
[0094]
In this way, the positions (directions) of the eyes are extracted and tracked using the stereo cameras 56 and 58, and the parameters (eye direction and distance) for controlling the eye camera 54 are calculated by using these values. The eye camera 54 can be controlled.
[0095]
FIG. 13 to FIG. 15 show screen displays for specific examples of eye position extraction and tracking processing according to the present embodiment.
[0096]
FIG. 13 shows an input image example. FIG. 14 shows a difference image between this image and the previous image. A pattern 270 cut out to determine left / right symmetry is displayed on the upper left of the screen, and an inter-brow pattern 272 of the input image is displayed on the lower left. In the difference image, two places 274 and 276 are extracted as eye position candidates and displayed as white circles.
[0097]
FIG. 15 shows an example in which the positions of both eyes 292 and 294 and the center position 290 are displayed superimposed on the image. As shown in FIG. 15, the positions of both eyes are correctly tracked centering on the space between the eyebrows.
[0098]
The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
[Brief description of the drawings]
FIG. 1 is an external view of a computer system 20 that implements an eye position tracking method and apparatus according to the present invention.
FIG. 2 is a block diagram of the computer system 20 and peripheral devices shown in FIG.
FIG. 3 is a diagram illustrating the principle of eye position extraction according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating an example of a difference image.
FIG. 5 is a diagram illustrating the principle of eye position extraction according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating a principle of canceling a movement of a user in one embodiment of the present invention.
FIG. 7 is a flowchart showing an overall control structure of software for realizing an eye position extraction method and apparatus according to an embodiment of the present invention.
FIG. 8 is a flowchart of software that realizes eye position extraction processing according to an embodiment of the present invention;
FIG. 9 is a flowchart of software for realizing eye position extraction processing according to an embodiment of the present invention;
FIG. 10 is a flowchart of software that realizes a calculation process of a movement amount d in the eye position extraction process according to the embodiment of the invention.
FIG. 11 is a diagram for describing processing for determining left-right symmetry in the vicinity of the eyebrows in an embodiment of the present invention.
FIG. 12 is a flowchart of software that realizes eye tracking processing according to the embodiment of the present invention;
FIG. 13 is a diagram showing a display example of a current image in the embodiment of the present invention.
FIG. 14 is a diagram showing a display example of a difference image in an embodiment of the present invention.
FIG. 15 is a diagram showing a display example of eye position tracking according to the embodiment of the present invention;
[Explanation of symbols]
20 computer system, 40,82 computer, 42 monitor, 46 keyboard, 48 mouse, 54 eye camera, 56, 58 stereo camera.

Claims

A method for tracking the position of the eye of the subject in a series of images output by the imaging means in a computer connected to an imaging means for capturing a video image of the face of the subject and having a storage device ,
Extracting the position of the eye in the face image output by the imaging means,
Extracting the eye position comprises:
Calculating a difference image between two images output by the imaging means;
Extracting the remaining difference area excluding the difference due to the movement of the entire face of the subject between the two images among the difference images,
The step of extracting the remaining difference area includes:
Calculating a movement amount that maximizes an area where a pixel value of an image obtained by translating each of the two images and a pixel value of the difference image coincide with each other;
Deleting the difference resulting from the calculated movement amount from the difference image,
When the second candidate point is selected from the extracted difference area in accordance with the difference between the pixel values of the two images, and the second candidate point satisfies a predetermined geometric condition Further comprising extracting the second eye candidate points as eye positions,
Saving an image pattern between the eyebrows defined by the extracted eye position, a relative position of both eyes with respect to the position between the eyebrows, and an image pattern between the eyebrows defined by the position between the eyebrows in the storage device;
An image acquisition step of acquiring an image following the image from which the eye position is extracted ;
Predicting the position between the eyebrows of the face image in the subsequent image;
Searching for a point in the vicinity of the predicted position between the eyebrows that is the center of an area that best matches the saved image pattern between the eyebrows;
Based on the position of the searched point and the saved relative position of both eyes, the position of both eyes in the subsequent image is predicted, and a predetermined condition centered on the predicted region is determined. Searching for a central point in each of the two satisfied regions;
The position of the midpoint of the searched two central points is set as a position between new eyebrows, and an area determined by the position between the new eyebrows and the relative position of the two central points is newly set with respect to the position between the eyebrows. the relative position of such eyes, yet the position of the glabella which is stored an image pattern of a region defined by the location of the new eyebrows in the storage device as an image pattern of a new glabella, the relative position of both eyes, and between eyebrows Updating each of the image patterns of
A method of tracking an eye position, further comprising: performing processing from the image obtaining step on an image further succeeding the succeeding image.

The step of predicting the position between the eyebrows includes the position between the eyebrows of the subsequent image from the position between the eyebrows in the image from which the eye position is extracted and the position between the eyebrows in the image preceding the image. The method of claim 1, comprising extrapolating.

In the step of searching for the center point of each of the two regions, in response to the fact that the point could not be searched, the position between the eyebrows of the face image output by the imaging means, both eyes with respect to the position between the eyebrows The method according to claim 1, further comprising: extracting an image pattern between eyebrows defined by a relative position and a position between the eyebrows by a predetermined method and restarting the processing from the image acquisition step. .

The step of searching for the center point of each of the two regions is as follows:
An area having a predetermined shape in the vicinity of each of the positions determined by the relative positions of the eyes with respect to the position of the searched point, and the average of the pixel values in the area is the largest. The method according to any one of claims 1 to 3, comprising a step of searching for the center of a darkened area to be a candidate point for both eyes.

In the step of searching for the center point of each of the two regions, the candidate point searched in the step of searching for the center of the darkest region is
1) The distance between candidate points is not less than a predetermined minimum value,
2) the distance between candidate points is less than or equal to a predetermined maximum value; and 3) the angle formed by a straight line connecting the candidate points and the scanning line direction satisfies a predetermined relationship;
5. The method of claim 4, further comprising the step of determining whether all of the conditions are satisfied and failing the search if any of the conditions are not satisfied.

An apparatus for tracking the position of the eye of the subject in a series of images output by the imaging means in a computer having a storage device connected to an imaging means for capturing a video image of the face of the subject,
Means for extracting the position of the eye in the face image output by the imaging means;
The means for extracting the eye position is:
Means for calculating a difference image between two images output by the imaging means;
Means for extracting the remaining difference area excluding the difference due to the movement of the entire face of the subject between the two images among the difference images;
The means for extracting the remaining difference area is:
Means for calculating a movement amount that maximizes an area where a pixel value of an image obtained by translating each of the two images and a pixel value of the difference image coincide with each other;
Means for deleting the difference resulting from the calculated movement amount from the difference image;
When the second candidate point is selected from the extracted difference area in accordance with the difference between the pixel values of the two images, and the second candidate point satisfies a predetermined geometric condition Further comprising means for extracting the candidate points of the two eyes as eye positions,
Means for saving in the storage device a position between the eyebrows defined by the extracted eye position, a relative position of both eyes with respect to the position between the eyebrows, and an image pattern between the eyebrows defined by the position between the eyebrows ;
Image acquisition means for acquiring an image subsequent to the image from which the eye position is extracted ;
Prediction means for predicting the position between the eyebrows of the face image in the subsequent image;
First search means for searching for a point that is the center of an area that best matches the saved image pattern between the eyebrows in the vicinity of the predicted position between the eyebrows;
Based on the position of the searched point and the saved relative position of both eyes, the position of both eyes in the subsequent image is predicted, and a predetermined condition centered on the predicted region is determined. A second search means for searching for a central point of each of the two satisfied regions;
The position of the midpoint of the searched two central points is set as a position between new eyebrows, and an area determined by the position between the new eyebrows and the relative position of the two central points is newly set with respect to the position between the eyebrows. the relative position of such eyes, yet the position of the glabella which is stored an image pattern of a region defined by the location of the new eyebrows in the storage device as an image pattern of a new glabella, the relative position of both eyes, and between eyebrows Updating means for updating each of the image patterns;
To control the image acquisition means, the prediction means, the first and second search means, and the update means so as to repeat the processing from the image acquisition means for an image that further follows the subsequent image. further comprising, eye position of the tracking device and means.

The prediction means extrapolates the position between the eyebrows of the subsequent image from the position between the eyebrows in the image from which the eye position is extracted and the position between the eyebrows in the image preceding the image. The apparatus of claim 6 comprising:

In response to the fact that the second search means could not search for the point, it is determined by the position between the eyebrows of the face image output from the imaging means, the relative position of both eyes with respect to the position between the eyebrows, and the position between the eyebrows. The image acquisition means, the prediction means, the first and second search means, and the update means are extracted by a predetermined device to extract an image pattern between the eyebrows to be resumed from the image acquisition means. 8. An apparatus according to claim 6 or claim 7, further comprising means for controlling.

The second search means includes
An area having a predetermined shape in the vicinity of each of the positions determined by the relative positions of the eyes with respect to the position of the searched point, and the average of the pixel values in the area is the largest. The apparatus according to any one of claims 6 to 8, comprising means for searching for the center of a darkened area to be a candidate point for both eyes.

The second search means further includes candidate points searched by the means for searching the center of the darkest area.
1) The distance between candidate points is not less than a predetermined minimum value,
2) the distance between candidate points is less than or equal to a predetermined maximum value; and 3) the angle formed by a straight line connecting the candidate points and the scanning line direction satisfies a predetermined relationship;
10. The apparatus of claim 9, comprising means for determining whether all of the conditions are satisfied and failing the search if any of the conditions are not satisfied.

In a computer to which an imaging means for capturing a video image of the face of the subject is connected and having a computing device and a storage device, a process for tracking the position of the eye of the subject in a series of images output by the imaging means A computer-executable program for controlling the computer to execute,
Before Symbol process,
The arithmetic device comprises a step of extracting a position of an eye in a face image output by the imaging means;
Extracting the eye position comprises:
Calculating a difference image between two images output by the imaging means;
Extracting the remaining difference area excluding the difference due to the movement of the entire face of the subject between the two images among the difference images,
The step of extracting the remaining difference area includes:
Calculating a movement amount that maximizes an area where a pixel value of an image obtained by translating each of the two images and a pixel value of the difference image coincide with each other;
Deleting the difference resulting from the calculated movement amount from the difference image,
When the second candidate point is selected from the extracted difference area in accordance with the difference between the pixel values of the two images, and the second candidate point satisfies a predetermined geometric condition Further comprising extracting the second eye candidate points as eye positions,
A step of saving in the storage device an image pattern between the eyebrows defined by the position of the eyebrows determined by the extracted eye position, a relative position of both eyes with respect to the position of the eyebrows, and a position between the eyebrows,
An image acquisition step in which the arithmetic device acquires an image following the image from which the eye position is extracted ;
Predicting the position between the eyebrows of the face image in the subsequent image by the computing device ;
In the vicinity of the position of the glabella which the computing device is predicted, a step of searching for a point where the center of the best fit area and the saved forehead of the image pattern,
The arithmetic unit predicts the position of both eyes in the subsequent image based on the position of the searched point and the saved relative position of both eyes, and predetermines the predicted area as a center. Searching for a central point in each of the two regions satisfying the given condition;
The position of the midpoint between the two central points searched by the arithmetic unit is set as the position between the new eyebrows, and an area determined by the position between the new eyebrows and the relative position between the two central points is the space between the eyebrows the relative position of the new eyes of relative position, further wherein the position of the glabella which an image pattern of a region defined by the position of the new eyebrows stored in the storage device as an image pattern of the new glabella, the relative position of both eyes the steps and of updating the image patterns between eyebrows respectively,
A program for tracking an eye position, further comprising the step of performing processing from the image acquisition step on an image further succeeding the succeeding image.

The step of predicting the position between the eyebrows includes the position between the eyebrows of the subsequent image from the position between the eyebrows in the image from which the eye position is extracted and the position between the eyebrows in the image preceding the image. The program according to claim 11, comprising the step of extrapolating.

In the step of searching for the center point of each of the two areas, the processing is performed in response to the fact that the point could not be searched, and the position between the eyebrows of the face image output by the imaging unit, between the eyebrows 13. The method further comprises: extracting an image pattern between eyebrows defined by a relative position of both eyes with respect to a position and a position between the eyebrows by a predetermined method, and restarting the processing from the image acquisition step. The program described in.

The step of searching for the center point of each of the two regions is as follows:
An area having a predetermined shape in the vicinity of each of the positions determined by the relative positions of the eyes with respect to the position of the searched point, and the average of the pixel values in the area is the largest. The program according to any one of claims 11 to 13, comprising a step of searching for the center of a darkened area to be a candidate point for both eyes.

In the step of searching for the center point of each of the two regions, the candidate point searched in the step of searching for the center of the darkest region is
1) The distance between candidate points is not less than a predetermined minimum value,
2) the distance between candidate points is less than or equal to a predetermined maximum value; and 3) the angle formed by a straight line connecting the candidate points and the scanning line direction satisfies a predetermined relationship;
The program according to claim 14, further comprising a step of determining whether or not all of the conditions are satisfied and failing the search if any of the conditions is not satisfied.