JP2004157778A

JP2004157778A - Nose position extraction method, program for operating it on computer, and nose position extraction device

Info

Publication number: JP2004157778A
Application number: JP2002322952A
Authority: JP
Inventors: Shinjiro Kawato; 慎二郎川戸
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2002-11-06
Filing date: 2002-11-06
Publication date: 2004-06-03
Anticipated expiration: 2022-11-06
Also published as: JP3980464B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for extracting a face image from image information and specifying a nose position for tracing its position in real time. <P>SOLUTION: A video image continuously photographing a face is processed for detecting positions of both eyes (step S104). The locally brightest point (a point with highest luminance) is extracted in a fixed range area below the both eyes, and this point is determined to be a nose position (step S106). Once the nose position is extracted, a small area including the point is stored as a template (step S121), a point matching most with the template is searched in the following frame (step S122), the locally brightest point on the periphery of the matching point is determined to be the noise position, and the nose position is traced. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明はカメラ等からの画像を処理する画像処理に関し、特に、画像中の人物の顔の鼻の位置を抽出するための画像認識の分野に関する。
【０００２】
【従来の技術】
通信により、遠隔地にいる複数の人間で会議を行うＴＶ会議システムが実用化されている。しかしこれらシステムにおいて、映像そのものを送ると通信データ量が増大するという問題点がある。そのために、たとえば対象となる人物の視線、顔の向き、表情等に関する特徴データを各地で抽出し、抽出したデータのみを互いに送信する技術が研究されている。受信側では、このデータに基づいて仮想的な人物の顔面の画像を生成して表示する。これにより、通信データ量を削減しながら、ＴＶ会議を効率良く行える。
【０００３】
また、たとえば放送を用いた教育システムでは、講師が各地にいる受講者の反応を見ながら講議をすすめて行くことが望ましい。この場合にも各地の映像を講師の講議している場所に送信すると通信データ量が多くなる。そもそも、受講者が多数になると全受講者の映像を送ることは実際的でなく、なんらかの方法で各受講者の反応を各地で抽出し、その反応を示す情報のみを講師に送信し、講師に対しては抽象的な「受講者の集合の反応」という形で提示する方が好ましい。
【０００４】
こうした処理を実現するためには、人物の顔画像からその表情、姿勢、視線方向等を認識することが必要である。そのためには、顔の位置を特定し、さらに人物の表情の変化が顕著にあらわれる目、鼻、口等の顔部品、特に目の位置を検出する必要がある。
【０００５】
現在のところ、人物の顔全体の位置を映像から特定し追跡する技術としては、映像のカラー情報を利用して肌色を検出・追跡する方法が提案されている。またより簡単な方法として、映像の背景の動きが少なく人物のみが動くものと仮定して、映像のフレーム間差分により顔の領域を検出する方法がある。
【０００６】
こうして顔全体の概略位置が検出された後に、目を検出するための技術としては、顔の領域内の画像の明暗の分布とあらかじめ準備されたテンプレートとのマッチングを用いるもの、顔領域の画像の縦方向、横方向への投影処理により顔部品の位置を見つけるものが提案されている。
【０００７】
たとえば、従来の技術としては、本発明の発明者によって提案されものである、人間の顔の眉間部分の特徴を利用して画面中から顔画像を抽出する方法（たとえば、特許文献１を参照）、人間の顔の鼻穴を検出する方法（たとえば、特許文献２を参照）がある。
【０００８】
さらには、このようにして検出された顔画像の画面中の動き、特に、鼻部分の動きを検出して、手に障害のある人が使用可能であって、マウスのようにコンピュータと人間との間のインタフェースとして利用しようとする試みもある（たとえば、非特許文献１参照）。
【０００９】
【特許文献１】
特開２００１−５２１７６号公報明細書
【００１０】
【特許文献２】
特開平１０−０８６６９６号公報明細書
【００１１】
【非特許文献１】
１５回インターナショナルカンファレンスオンビジョンインタフェースプロシーディングズ５月２７−２９日、２００２年カルガリカナダｐｐ．３５４−３６１「Ｎｏｕｓｅ”鼻をマウスとして使う” ハンドフリーのゲームやインタフェースのための新しい技術」（１５ｔｈｉｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＶｉｓｉｏｎＩｎｔｅｒｆａｃｅＰｒｏｃｅｅｄｉｎｇｓＭａｙ２７−２９，２００２Ｃａｌｇａｒｙ，Ｃａｎａｄａ，ｐｐ．３５４−３６１， ”Ｎｏｕｓｅ ”ＵｓｅＹｏｕｒＮｏｓｅａｓａＭｏｕｓｅ”−ａＮｅｗｔｅｃｈｎｏｌｏｇｙＦｏｒＨａｎｄｓ−ｆｒｅｅＧａｍｅｓａｎｄＩｎｔｅｒｆａｃｅｓ”）
【００１２】
【発明が解決しようとする課題】
しかし、従来の方法において、テンプレートマッチングを用いるものでは、精度よい検出を行おうとすると多数のテンプレートを準備する必要がある。そのために多くの記憶容量を必要とし、演算装置の処理能力によっては、マッチングのための処理時間も長くかかるというだけでなく、顔画像からどのようにして鼻を検出し、かつ、鼻の位置の追跡を実時間で行うのかについて、必ずしも明らかではない、という問題点があった。
【００１３】
特許文献２に開示された発明では、顔領域の中にほぼ水平に並ぶ２つの黒領域を鼻と判定している。しかしこの方法では、鼻孔を撮影できるカメラ位置は、対象人物の前方下部に限定されるという制限があり、追跡できる顔の向きの範囲が狭くなってしまう。
【００１４】
さらに、非特許文献１に開示された発明では、鼻の位置をいかにして追跡するかについての具体的なアルゴリズムの開示がない。
【００１５】
それゆえに本発明の目的は、画像情報から顔画像を抽出して、さらに鼻の位置を特定して実時間でその位置を追跡することが可能な鼻位置抽出装置、そのための方法および当該方法をコンピュータを用いて実現するためのプログラムを提供することである。
【００１６】
【課題を解決するための手段】
請求項１に記載の鼻位置の抽出方法は、人間の顔領域であって対象となる画像領域内の各画素の値のデジタルデータを準備するステップと、対象となる画像領域内に対するフィルタリング処理により目の位置を抽出するステップと、抽出された目の位置に対応する所定の鼻位置探索領域中の最も輝度の高い点を鼻位置として特定するステップとを含む。
【００１７】
請求項２に記載の鼻位置の抽出方法は、請求項１に記載の鼻位置の抽出方法において、鼻位置探索領域は、両目の間隔をＬとするとき、両目を結ぶ基準線と平行であって、基準線から距離Ｌだけ離れた位置を下辺とし、下辺から距離２／３×Ｌだけ鉛直方向上方に離れた辺を上辺とし、上辺と下辺をそれぞれ結ぶ両側の辺が、距離Ｌを保って両目から鉛直方向に伸びる四辺形の領域である。
【００１８】
請求項３に記載の鼻位置の抽出方法は、請求項１に記載の鼻位置の抽出方法において、デジタルデータを準備するステップは、時間軸において所定間隔で連続する画面情報の各々について、対象となる画像領域内の各画素の値のデジタルデータを準備するステップを含み、鼻位置として特定するステップは、ある時刻に対応する画面情報において特定された鼻位置を含む小領域をテンプレートとして記憶するステップと、ある時刻に対応する画面情報に続く画面情報においてテンプレートにマッチングする領域を探索し、マッチングした領域内で局所的に最も輝度の高い点を新たな鼻位置と判定する手続きを順次続けることで、鼻位置を追跡するステップとを含む。
【００１９】
請求項４に記載の鼻位置の抽出方法は、請求項３に記載の鼻位置の抽出方法において、鼻位置として特定するステップは、過去の鼻頭位置履歴から鼻頭存在位置を予測するステップをさらに含む。
【００２０】
請求項５のプログラムは、コンピュータに、対象となる画像領域内の鼻位置を抽出する方法を実行させるためのプログラムであって、プログラムは、人間の顔領域であって対象となる画像領域内の各画素の値のデジタルデータを準備するステップと、対象となる画像領域内に対するフィルタリング処理により目の位置を抽出するステップと、抽出された目の位置に対応する所定の鼻位置探索領域中の最も輝度の高い点を鼻位置として特定するステップを含む。
【００２１】
請求項６のプログラムは、請求項５記載のプログラムの構成において、鼻位置探索領域は、両目の間隔をＬとするとき、両目を結ぶ基準線と平行であって、基準線から距離Ｌだけ離れた位置を下辺とし、下辺から距離２／３×Ｌだけ鉛直方向上方に離れた辺を上辺とし、上辺と下辺をそれぞれ結ぶ両側の辺が、距離Ｌを保って両目から鉛直方向に伸びる四辺形の領域である。
【００２２】
請求項７のプログラムは、請求項５記載のプログラムの構成において、デジタルデータを準備するステップは、時間軸において所定間隔で連続する画面情報の各々について、対象となる画像領域内の各画素の値のデジタルデータを準備するステップを含み、鼻位置として特定するステップは、ある時刻に対応する画面情報において特定された鼻位置を含む小領域をテンプレートとして記憶するステップと、ある時刻に対応する画面情報に続く画面情報においてテンプレートにマッチングする領域を探索し、マッチングした領域内で局所的に最も輝度の高い点を新たな鼻位置と判定する手続きを順次続けることで、鼻位置を追跡するステップとを含む。
【００２３】
請求項８のプログラムは、請求項７記載のプログラムの構成において、鼻位置として特定するステップは、過去の鼻頭位置履歴から鼻頭存在位置を予測するステップをさらに含む。
【００２４】
請求項９に記載の鼻位置抽出装置は、人間の顔領域であって対象となる画像領域内の各画素の値のデジタルデータを準備する手段と、対象となる画像領域内に対するフィルタリング処理により目の位置を抽出する手段と、抽出された目の位置に対応する所定の鼻位置探索領域中の最も輝度の高い点を鼻位置として特定する手段とを含む。
【００２５】
請求項１０に記載の鼻位置抽出装置は、請求項９に記載の鼻位置抽出装置において、鼻位置探索領域は、両目の間隔をＬとするとき、両目を結ぶ基準線と平行であって、基準線から距離Ｌだけ離れた位置を下辺とし、下辺から距離２／３×Ｌだけ鉛直方向上方に離れた辺を上辺とし、上辺と下辺をそれぞれ結ぶ両側の辺が、距離Ｌを保って両目から鉛直方向に伸びる四辺形の領域である。
【００２６】
請求項１１に記載の鼻位置抽出装置は、請求項９に記載の鼻位置抽出装置において、デジタルデータを準備する手段は、時間軸において所定間隔で連続する画面情報の各々について、対象となる画像領域内の各画素の値のデジタルデータを準備し、鼻位置として特定する手段は、ある時刻に対応する画面情報において特定された鼻位置を含む小領域をテンプレートとして記憶する手段と、ある時刻に対応する画面情報に続く画面情報においてテンプレートにマッチングする領域を探索し、マッチングした領域内で局所的に最も輝度の高い点を新たな鼻位置と判定する手続きを順次続けることで、鼻位置を追跡する手段とを含む。
【００２７】
請求項１２に記載の鼻位置抽出装置は、請求項１１に記載の鼻位置抽出装置において、鼻位置として特定する手段は、過去の鼻頭位置履歴から鼻頭存在位置を予測する手段をさらに含む。
【００２８】
【発明の実施の形態】
［ハードウェア構成］
以下、本発明の実施の形態にかかる鼻位置抽出装置について説明する。この鼻位置抽出装置は、パーソナルコンピュータまたはワークステーション等、コンピュータ上で実行されるソフトウェアにより実現されるものであって、人物の顔の映像から、目の位置を検出するためのものである。図１に、この鼻位置抽出装置の外観を示す。
【００２９】
図１を参照してこのシステム２０は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）ドライブ５０およびＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ５２を備えたコンピュータ本体４０と、コンピュータ本体４０に接続された表示装置としてのディスプレイ４２と、同じくコンピュータ本体４０に接続された入力装置としてのキーボード４６およびマウス４８と、コンピュータ本体４０に接続された、画像を取込むためのカメラ３０とを含む。この実施の形態の装置では、カメラ３０としてはＣＣＤ（固体撮像素子）を含むビデオカメラを用い、カメラ３０の前にいてこのシステム２０を操作する人物の目の位置を検出する処理を行うものとする。
【００３０】
図２に、このシステム２０の構成をブロック図形式で示す。図３に示されるようにこのシステム２０を構成するコンピュータ本体４０は、ＣＤ−ＲＯＭドライブ５０およびＦＤドライブ５２に加えて、それぞれバス６６に接続されたＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）５６と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）５８と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）６０と、ハードディスク５４と、カメラ３０からの画像を取込むための画像取込装置６８とを含んでいる。ＣＤ−ＲＯＭドライブ５０にはＣＤ−ＲＯＭ６２が装着される。ＦＤドライブ５２にはＦＤ６４が装着される。
【００３１】
既に述べたようにこの鼻位置抽出装置の主要部は、コンピュータハードウェアと、ＣＰＵ５６により実行されるソフトウェアとにより実現される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ６２、ＦＤ６４等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ５０またはＦＤドライブ５２等により記憶媒体から読取られてハードディスク５４に一旦格納される。または、当該装置がネットワークに接続されている場合には、ネットワーク上のサーバから一旦ハードディスク５４にコピーされる。そうしてさらにハードディスク５４からＲＡＭ６０に読出されてＣＰＵ５６により実行される。なお、ネットワーク接続されている場合には、ハードディスク５４に格納することなくＲＡＭ６０に直接ロードして実行するようにしてもよい。
【００３２】
図１および図２に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ６４、ハードディスク５４等の記憶媒体に記憶されたソフトウェアである。
【００３３】
なお、最近の一般的傾向として、コンピュータのオペレーティングシステムの一部として様々なプログラムモジュールを用意しておき、アプリケーションプログラムはこれらモジュールを所定の配列で必要な時に呼び出して処理を進める方式が一般的である。そうした場合、当該鼻位置抽出装置を実現するためのソフトウェア自体にはそうしたモジュールは含まれず、当該コンピュータでオペレーティングシステムと協働してはじめて鼻位置抽出装置が実現することになる。しかし、一般的なプラットフォームを使用する限り、そうしたモジュールを含ませたソフトウェアを流通させる必要はなく、それらモジュールを含まないソフトウェア自体およびそれらソフトウェアを記録した記録媒体（およびそれらソフトウェアがネットワーク上を流通する場合のデータ信号）が実施の形態を構成すると考えることができる。
【００３４】
［顔画像の抽出の基本的原理］
以下では、まず、本発明にかかる鼻位置の検出方法および追跡方法を説明する前提として、上述した特開２００１−５２１７６号公報明細書にしたがって、画面中から、顔画像を特定して、目の位置を検出する手順について説明しておく。
【００３５】
図３を参照して、本実施の形態の装置では、人物の顔の中で、両目の間に位置する眉間（以下の説明では両眼の中心を結んだ線分の中央の位置を「眉間」と呼ぶこととする。）に着目する。眉間を以下「ＢＥＰ」（Ｂｅｔｗｅｅｎ−Ｅｙｅｓ−Ｐｏｉｎｔ）と呼ぶ。
【００３６】
図３（ａ）に示すように、人物の顔画像において、眉間を中心としてある半径の円を描き、その円周に沿った各画素の明度を調べる。するとその結果は概略図３（ｂ）に示したようになる。図３（ｂ）において、横軸は円周方向の各画素の位置、縦軸は各画素の明度をそれぞれ示す。なお、図３（ａ）に示される円周の最も上部を図３（ｂ）の横軸の原点とし、図３（ａ）の円周を逆時計回り方向に回る方向に沿って図３（ｂ）のグラフでは横軸に各画素を並べるものとする。
【００３７】
図３（ｂ）を参照すると、このグラフは、頂点、谷、頂点、谷という「頂点と谷」の２回の繰り返しを形成していることが分かる。この意味は、図３（ａ）に示される顔画像を参照すると明らかである。すなわち、人間の顔画像では、眉間を中心として上記した円周上をたどると、最初は額（明度が高い）、次に右目（明度が低い）、次に鼻（明度が高い）、次に左目（明度が低い）、最後に最初の額（明度が高い）というように、明るい部分と暗い部分とが交互に２回繰返されている。顔画像の中では、こうした特徴を最も示すのは眉間であり、他の部分はこうした特徴はあまり示さず、あっても眉間より小さい。
【００３８】
そこで本実施の形態の装置では、眉間の周囲にはこうした明度の分布が存在すると想定して以下に「リングＤＦＴ（離散フーリエ変換）フィルタ」と呼ぶフィルタによるフィルタリングを行うことによって最初にＢＥＰを検出し、しかる後にそのＢＥＰを基準として眉間の両側の目の位置を検出する。なお、本実施の形態で用いられるリングＤＦＴ変換については後述する。
【００３９】
本実施の形態の装置は、以下のような制御構造を有するソフトウェアを用いて目の位置を検出する。
【００４０】
図４を参照して、まずイメージを取得する（ステップ８０）。ここでは、図１および図２に示したカメラ３０から得た１フレームの画像を画像取込装置６８でデジタル変換して画像取込装置６８内の画像メモリに格納し、この画像に対して以下のような処理をするものとする。連続的に処理する場合、カメラ３０から得た画像の各フレームに対して以下の処理を繰返す。
【００４１】
すなわち、ステップ８２で、１フレーム分の画像データのうちから、前述したリングＤＦＴフィルタを用いて眉間の候補点を抽出する。この処理については図５を参照して後述する。
【００４２】
次に、ステップ８２で抽出された眉間の候補点（一般的には複数個である）のうちで、その両側の対称な位置に２ケ所だけ暗い領域（目に対応）があるという条件を満足するものを探す（ステップＳ８６）。眉間の候補点のうち、この条件を充足しないものはここで不採用となる。
【００４３】
ステップ８８で、ステップ８６の処理の結果得られた目が一対のみか否か（すなわち眉間がただ１点のみ検出されたか否か）が判定される。一対のみ得られた場合には、目の検出処理が終了する（ステップ９０）。
【００４４】
一方、ステップ８８の処理で一対を越える数の目が検出された場合には処理はステップ８０にもどり、以後新たなフレームに対して、目の検出が行われるまで、上述した処理を繰返す。
【００４５】
［リングＤＦＴフィルタを用いた候補点の抽出］
さて、ステップ８２で行われる眉間の候補点の抽出において、前述したリングＤＦＴフィルタが用いられている。以後ステップ８２の処理について図５を参照して説明する。
【００４６】
まず、ステップ１１０で、処理対象の画像の平滑化および縦横方向の１／２縮小処理が行われる。実験ではたとえば対象点となる画素の周囲の５×５個の画素の明度を平均化してその画素の明度とし、このとき対象点の選択により画像の縮小も同時に行った。平滑化は、画像に含まれているノイズ（比較的高周波成分が多い）を除去するためのものである。特に、人間のＢＥＰの検出では、後述するように波数２のスペクトルパワー成分を計算するので、この平滑化によって、以後の処理で必要とされる情報が削除されるおそれはない。またこの処理で画素数を１／４に削減することにより、処理の高速化を図ることができる。ただし、十分高速なプロセッサを用いた場合にはあえて画素数を削減する必要はないかも知れない。また、より低速なプロセッサを用いる場合には、より小さな画像に縮小する（画素数を少なくする）ことが必要となろう。ただしあまりに画像を縮小すると解像度が下がる結果、ＢＥＰの検出の精度が低下するおそれがあるので、実験により適当な解像度を選択することも有用である。
【００４７】
続いて、こうして得られた画像から、対象となる人物の頭部領域を推定する処理が行われる（ステップ１１２）。この処理には、前述したとおりカラー情報を用い、肌色の領域を追跡するアルゴリズムを用いたり、前フレームと現フレームとの差分から、２フレーム間で移動したと思われる領域を抽出しこれを頭部領域と推定するアルゴリズムを用いたりすることができる。本実施の形態では、フレーム間の差分を用いる。またここで推定される領域はどのような形状の領域でもよいが、領域計算の簡便さを考えると矩形領域が適切である。ただし、条件によっては別の形状を用いた方が効率のよい場合もあるであろう。なお、頭部がほとんど移動していない場合にはフレーム間差分が得られない。その場合には、頭部が移動していないと想定して直前の処理で推定された頭部領域を使用する。
【００４８】
次に、こうして得られた頭部領域の範囲内で、リングＤＦＴフィルタを用いたフィルタリングを行う（ステップＳ１１４）。具体的には、たとえば頭部領域の左上の画素から順に、その画素を中心とする、図６に示されるような所定の大きさの円周上の画素に対して以下の計算を行う。
【００４９】
【数１】

【００５０】
ただしこの式で、Ｎは円周上の点の数であり、ｋはこの円周上の点のうち最も上にある（「北極」に相当する位置）点を０として、逆時計周りに順に各点にふられた番号である。またｆｋ（ｋ＝０、…、Ｎ−１）は円周上のｋ番目の画素の明度で、ｉは虚数単位である。この式（１）は、次に示す一般的な離散フーリエ変換によって求められるＤＦＴ係数のｎ＝２の場合である。
【００５１】
【数２】

【００５２】
式（１）に示す変換により、上記した円周上の明度の変動波形（図３（ｂ）参照）に含まれる波数２のスペクトルパワー成分が計算される。本実施の形態では円の半径を７画素、Ｎ＝３６として計算した。なお、人物とカメラとの間の距離に応じて顔領域の大きさは変化するから、そうした距離の変化量が多いと考えられる場合には、既に得られた顔領域の概略の大きさに合わせて円の半径を変化させることでより精度が高くなる。ただし、人物がほとんどそうした移動を行わないことがわかっていれば、半径をあらかじめ固定しておいてもよい。
【００５３】
この計算により、頭部領域のすべての画素について、その画素を中心とする円周上の波数２のスペクトルパワーの値が計算される。
【００５４】
各画素に対して上記した計算を行った結果得られる値の分布中には、特に値の高い部分が存在する。それら部分は、その周囲の円周上に上記したような波数２の波数成分が多く存在するものと考えられる。したがってそれらはＢＥＰとしての候補点としての資格を備えている。このように、画像上の各対象点を中心とする、典型的には円となる閉曲線上をたどり、その上での画素の値（明度に限らず色相、彩度等をも含みうる）に対してＤＦＴを行った後の情報を得ることを本願発明では「リングＤＦＴフィルタによるフィルタリング」と呼んでいる。
【００５５】
こうして、リングＤＦＴフィルタによるフィルタリングを行った値の対象画面中の分布から、局所的な最大値を示す点を選んでＢＥＰの候補点とする（ステップ１１６）。
【００５６】
検出された候補点の中には、真のＢＥＰが含まれている。前述したとおり真のＢＥＰの周囲にはほぼまちがいなく明、暗、明、暗という領域の分布がある。したがってステップ１１４の処理の結果、ほぼ間違いなく真のＢＥＰは局所的な最大値を示し、その結果ステップ１１６でほぼ例外なく候補点として抽出される。
このようにロバストに、ほぼ確実に真のＢＥＰが抽出されるのがこの方式の特徴である。なお、選択のためのしきい値は、対象となる画像の持つべき特徴に応じて主として経験的に定められる。
【００５７】
続いてステップ１１８で、複数個の局所最大値のうちで、ＢＥＰに特徴的な局所特徴を考慮して、ＢＥＰの候補を絞る処理が行われる。
【００５８】
たとえば実際のＢＥＰでは、その上（額）と下（鼻）とに明るい領域があり、その左右（両眼）に暗い領域が存在するはずである。したがって式（１）の計算結果は必ず実部が正となるはずである。正でない実部を生ずるものはＢＥＰではなく、候補から除外される。
【００５９】
また、同じ理由から、真のＢＥＰを中心とした画像を縦方向および横方向に投影した画像を考えると、次のようなことがいえる。図７を参照して、明暗の分布は、（ａ）に示すように上下方向では中央が最も暗く、左右方向では中央は最も明るい。またその分布は中央を中心としてほぼ対称となるべきである。そこで、複数個の候補点が存在する場合、これと同様の縦方向および横方向の投影を作成し、上記した条件に合致しないものを不採用とする。
【００６０】
また別の基準として、ＢＥＰ候補点を中心とする小領域の明るさの重心を計算し、その重心とＢＥＰ候補点との距離がしきい値を越えていればそのＢＥＰ候補点を除外する。
【００６１】
さらに、リングＤＦＴフィルタの以下のような特徴を用いて候補点を絞ることができる。すなわち、上記したＦｎの一般式（式（２））において、ｎ＝１として各画素でＦ１を計算する。そして、各画素で得られたＦ２との比（Ｆ１／Ｆ２）を計算し、この値が小さいほど真のＢＥＰである確率が高い、という基準を用いる。この値は、次の理由により、ある画素を中心とする円上の明暗の分布が理想的な制限カーブにどの程度合致しているか（どの程度離反しているか）を示す基準となりうると考えられる。
【００６２】
式（２）でｎ＝１、２、…として計算された値はそれぞれ、円周上において波数が１、２、…である波数成分のスペクトルパワーを示す。もし円周上の明暗の分布が理想的にｎ＝２のときの正弦カーブと一致しているときには、２以外のｎに対してＦｎ＝０となる筈である。もちろん、実際には明暗の分布が正弦カーブと一致することはないが、それでも理想的な正弦カーブに近ければＦ１は小さな値となり、Ｆ２は相対的に大きな値となるであろう。そこで、上記したＦ１／Ｆ２が小さければ、対象画素の周囲の明暗分布は実際のＢＥＰの周りの明暗分布に近く、大きければ遠い、と考えることができる。これがＦ１／Ｆ２が尺度として利用できる理由である。
【００６３】
なお、Ｆ３、Ｆ４等についてもＦ１と同様に理想的な明暗の分布では０となるはずである。そこでＦ３／Ｆ２、Ｆ４／Ｆ２等を基準とすることも考えられる。
しかしこれらはより高い波数成分の量を示し、そのためにノイズの影響を受けやすいので、Ｆ１／Ｆ２を用いた場合よりも結果の信頼性は低い。
【００６４】
以下、図４のステップ８６、８８によってこのＢＥＰが真のＢＥＰであるか否かが検定される。
【００６５】
以上説明したような手続きにより、この実施の形態のシステムでは、リングＤＦＴフィルタを用いてＢＥＰの検出を行う。リングＤＦＴフィルタは、画像のうちの明暗の分布内に存在する波数成分のみからＢＥＰ等の特徴点の抽出を行う。
そのため、画像の全体的な明るさの変動による影響を受けにくいという特徴がある。また、顔がやや傾けられている場合にも、ある点の周囲の明るさの分布内の波数成分は画像の回転に対し不変である。そのため上記した手法を用いると、回転に強い特徴点の抽出を実現することができる。これは顔をやや横に向けた場合も同様である。両眼がギリギリで見える程度の顔を横に向けたとしても、両眼が画像中に存在している限りは依然として上記した明暗の配置が眉間の周囲に存在しているので、上で説明した手法を用いてほぼ確実にＢＥＰを抽出することができる。また、対象となる人物が目を閉じていても、依然としてその領域は額、鼻と比較して暗いため、上記した手法でほぼ確実にＢＥＰを検出することができる。したがって、高い信頼性でＢＥＰを、さらにはその両側の目の位置を検出することができる。
【００６６】
なお、上記した例では各画素を中心とする円周上の点についてＤＦＴ係数を計算した。しかし本発明は、円周上の点についてのみ適用可能であるというわけではない。あらかじめ特徴点として抽出されるべき点と所定の位置関係にある閉曲線であって、その周上の明暗のあるべき分布が分かっているのであれば、他のどのような閉曲線上で上記した計算を行ってもよい。もっとも、回転に対してロバストな結果を与えるのは円のときであるから、円が最適となることが多いであろう。
【００６７】
さらにまた、上記した実施の形態では各画素を中心とする一つの円周上での明暗分布中の波数成分を利用したが、使用される円の数が１に限定されるわけではないことも当業者には明らかであろう。たとえば抽出すべき特徴点の周囲で、中心から異なった距離の領域では異なった明暗の分布が存在すべきことがあらかじめ判明しているのであれば、それに応じて複数個の円周（または閉曲線）上でそれぞれ上記した計算を行い、両者の計算の結果を総合して特徴点を抽出してもよい。
【００６８】
また、上記した例では波数成分の計算のためにＤＦＴを使用した。ＤＦＴを用いることが最も効率的だとは考えられるが、上記した例で必要な関数は、周上の明暗分布内の波数成分を抽出することさえできればよい。したがって、使用できる手法はＤＦＴに限定されるわけではなく、一般的なフーリエ変換を含め、波数成分の抽出のための関数のいずれもが使用可能であることもまた当業者には明白であろう。
【００６９】
さらに、上記した実施の形態では、リングＤＦＴフィルタの処理対象は画素の明度であった。しかし本発明の適用可能な対象はこれには限定されない。たとえば各画素の色相、彩度等の値に対してリングＤＦＴフィルタによるフィルタリングを行ってもよい。また、検出すべき特徴点のもつべき性質に応じて、各画素の明度、色相、彩度などの値に対して所定の演算を施した値をフィルタリングの対象とすることも考えられる。
【００７０】
［顔画像からの鼻の位置の検出］
以上の説明により、画面中から人間の顔の眉間の位置および目の位置を特定することができる。以下では、このようにして目の位置は特定された後に、鼻の位置をさらに特定し、さらに、この鼻の位置を追跡（トラッキング）する手続きについて説明する。
【００７１】
図８は、本発明における顔画像から鼻の位置を検出する手続きの前提となる概念を説明するための図である。
【００７２】
図８を参照して、光沢のある球面に対して光源からの光が照射されると、球面上には、光源からの光を反射してハイライトのスポットができる。
【００７３】
図９は、図８で示した概念にしたがって、顔画像において表れる現象を説明するための図である。
【００７４】
図９に示すとおり、鼻頭は理想的な球面ではないが、実効的には球面とみなすことができ、ある程度の光沢性がある。このとき、とくに顔において最も突出した位置である鼻頭部には、光源の光が反射したハイライトが生じる。
【００７５】
本発明では、まず、時間軸で所定の間隔で連続する顔を含む画面情報、たとえば、顔を連続撮影したビデオ画像を処理して、上述したリングＤＦＴフィルタを用いたフィルタリングによる方法により、眉間の位置および両目の位置を検出する。
【００７６】
その上で、以下に説明するとおり、両目下部の一定範囲領域において、局所的に最も明るい点（最も輝度の高い点）を抽出する。両目位置とその点で構成される三角形が一定の幾何学的条件をみたせば、その点を鼻位置と判定する。
【００７７】
さらに、鼻位置が抽出されたら、その点を含む小領域をテンプレートとして記憶し、次のフレームでそのテンプレートに最もマッチする点を探索し、そのマッチする点の周辺で局所的に最も明るい点を鼻位置と判定して、鼻位置を追跡していく。
【００７８】
図１０は、目の位置を検出した後、鼻位置を探索する、両目下部の一定範囲領域を説明する図である。
【００７９】
図１０を参照して、鼻位置を探索するのは、両目の間隔をＬとするとき、両目を結ぶ線（基準線）と平行であって、その基準線から距離Ｌだけ離れた位置を下辺とし、下辺から距離２／３×Ｌだけ鉛直方向上方に離れた辺を上辺とし、上辺と下辺をそれぞれ結ぶ両側の辺が、距離Ｌを保って両目から鉛直方向に伸びるような四辺形の領域である。ただし、上辺と下辺の距離は、必ずしも２／３×Ｌに限定されるわけではなく、また、両側の辺の間隔も距離Ｌに限定されるわけではなく、検出の対象となる顔画像の統計的性質に応じて、これらの値は、適宜補正した値とすることもできる。
【００８０】
図１０に示す一定範囲領域において、局所的にもっとも明るい点を抽出する。その点が、鼻頭の位置と特定できる。
【００８１】
図１１は、図９に示した顔がやや横を向いた場合の顔画像を示す図である。
図１１に示す程度に顔が横を向いた場合でも、図１０に示す領域内に、鼻頭を示すハイライトが存在することが分かる。
【００８２】
図１２は、本発明における鼻位置の特定方法および鼻位置の追跡方法を説明するためのフローチャートである。
【００８３】
図１２を参照して、まず、処理対象となる画像（フレーム）を特定するための変数ｔの値を「１」に初期化する（ステップ１００）。
【００８４】
続いて、第ｔフレームの画像を取得し（ステップＳ１０２）、顔画像の抽出および目の位置の特定が行われる（ステップＳ１０４）。このステップＳ１０２およびＳ１０４の処理は、図４で説明した目の位置の検出処理と基本的に同様である。
【００８５】
目の位置が検出されると、続いて、図１０で説明した一定領域範囲で、鼻頭のハイライト点の抽出を行う（ステップＳ１０６）。
【００８６】
第ｔフレームで鼻頭のハイライト点を抽出に成功すれば、処理はステップＳ１１２に移行する。一方、ハイライト点の抽出に失敗すると、変数ｔの値を１だけインクリメントして（ステップＳ１１０）、処理はステップＳ１０２に復帰する。
【００８７】
ステップＳ１１２においては、ハイライト点を中心とする所定の大きさおよび形状の所定小領域を鼻頭テンプレートパターンＴとして、たとえば、ハードディスク５４にセーブする。
【００８８】
なお、鼻頭テンプレートパターンは、ハイライト点を中心とする所定の大きさの小領域でもよいし、あるいは、ハイライト点から所定の距離だけオフセットした所定の大きさの小領域でもよい。
【００８９】
続いて、変数ｔの値を１だけインクリメントして（ステップＳ１１４）、第（ｔ＋１）フレームの画像を取得する（ステップＳ１１６）。
【００９０】
次に、過去の鼻頭位置履歴から鼻頭存在位置を予測する（ステップＳ１１８）。この予測においては、前フレームでの鼻頭存在位置Ｘ（ｔ）および前々フレームでの鼻頭存在位置Ｘ（ｔ−１）を用いて以下の式により予測を行う。
【００９１】
Ｘ（ｔ＋１）＝Ｘ（ｔ）＋Ｘ（ｔ）−Ｘ（ｔ−１）
なお、Ｘ（ｔ−１）が存在しない場合は、Ｘ（ｔ−１）の値として、Ｘ（ｔ）を用いる。
【００９２】
続いて、鼻頭存在予測位置を中心とする所定の大きさおよび形状の鼻頭探索領域を決定し（ステップＳ１２０）、鼻頭探索領域内でテンプレートパターンＴと最もよく一致するマッチング点を探す（ステップＳ１２２）。
【００９３】
マッチング点を中心とする所定領域内で最も明るい点を探索し、その点を第（ｔ＋１）フレームの鼻頭ハイライト点とする（ステップＳ１２４）。その上で、処理は、ステップＳ１１２に復帰する。
【００９４】
以上説明したような処理で、時間軸について所定間隔で連続する画面情報、たとえば、連続するフレーム画像から、実時間で、鼻の位置を検出することができる。さらに、このような連続する画面情報の各々において、鼻位置の検出を連続して行っていくことで、鼻位置のトラッキングを行うことができる。
【００９５】
このような鼻位置のトラッキングは、たとえば、コンピュータのマンマシンインタフェースにおいて、たとえば、マウスの代わりに用いることができる。
【００９６】
今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。
【００９７】
【発明の効果】
以上説明したとおり、本発明によれば、連続する画面情報から実時間で、鼻の位置を検出することができる。さらに、このような連続する画面情報の各々において、鼻位置の検出を連続して行っていくことで、鼻位置のトラッキングを行うことができる。
【図面の簡単な説明】
【図１】本発明の１実施の形態にかかるシステムの外観図である。
【図２】本発明の１実施の形態にかかるシステムのハードウェア的構成を示すブロック図である。
【図３】本発明の原理を説明するための図である。
【図４】本発明の実施の形態１にかかるシステムで実行される目位置検出処理のフローチャートである。
【図５】画像データから眉間の候補点を抽出する処理のフローチャートである。
【図６】リングＤＦＴフィルタの計算経路を示す図である。
【図７】眉間の局部的特徴を説明するための模式図である。
【図８】本発明における顔画像から鼻の位置を検出する手続きの前提となる概念を説明するための図である。
【図９】図８で示した概念にしたがって、顔画像において表れる現象を説明するための図である。
【図１０】目の位置を検出した後、鼻位置を探索する、両目下部の一定範囲領域を説明する図である。
【図１１】図９に示した顔がやや横を向いた場合の顔画像を示す図である。
【図１２】本発明における鼻位置の特定方法および鼻位置の追跡方法を説明するためのフローチャートである。
【符号の説明】
２０鼻位置抽出装置、３０カメラ、４０コンピュータ本体、４２モニタ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to image processing for processing an image from a camera or the like, and more particularly, to the field of image recognition for extracting a nose position of a person's face in an image.
[0002]
[Prior art]
2. Description of the Related Art A TV conference system in which a plurality of people at remote locations hold a conference by communication has been put to practical use. However, in these systems, there is a problem that the amount of communication data increases when the video itself is transmitted. For this purpose, for example, a technique of extracting feature data on the gaze, face direction, facial expression, and the like of a target person in various places and transmitting only the extracted data to each other has been studied. The receiving side generates and displays a virtual image of the face of the person based on the data. Thus, the TV conference can be efficiently performed while reducing the amount of communication data.
[0003]
Further, for example, in an educational system using broadcasting, it is desirable that a lecturer proceed with a lecture while observing the response of the trainees in various places. In this case as well, transmitting images from various places to the place where the lecturer is giving lectures increases the amount of communication data. In the first place, when the number of students increases, it is not practical to send images of all the students, extract the responses of each student in various places in some way, send only the information indicating the reaction to the instructor, and send it to the instructor. In contrast, it is preferable to present the information in the form of an abstract "reaction of a set of students".
[0004]
In order to realize such processing, it is necessary to recognize the facial expression, posture, gaze direction, and the like from the face image of a person. For this purpose, it is necessary to specify the position of the face and to detect facial parts such as eyes, nose, mouth, etc., in particular, where the facial expression of the person significantly changes, especially the position of the eyes.
[0005]
At present, as a technique for identifying and tracking the position of the entire face of a person from a video, a method of detecting and tracking a skin color using color information of the video has been proposed. As a simpler method, there is a method of detecting a face area based on an inter-frame difference of a video, assuming that only a person moves with a small background motion of the video.
[0006]
After the approximate position of the entire face is detected in this manner, as a technique for detecting the eyes, there is a technique using matching between a light and dark distribution of an image in a face area and a prepared template, A method for finding the position of a face part by projection processing in the vertical and horizontal directions has been proposed.
[0007]
For example, as a conventional technique, a method of extracting a face image from a screen using characteristics of a portion between eyebrows of a human face, proposed by the inventor of the present invention (for example, see Patent Document 1) There is a method of detecting a nostril in a human face (for example, see Patent Document 2).
[0008]
Furthermore, the movement of the face image detected in this manner on the screen, particularly the movement of the nose, can be detected and used by a person with a handicapped person. There is also an attempt to use the interface as an interface (for example, see Non-Patent Document 1).
[0009]
[Patent Document 1]
JP 2001-52176 A
[0010]
[Patent Document 2]
JP-A-10-086696
[0011]
[Non-patent document 1]
15th International Conference on Vision Interface Proceedings May 27-29, 2002 Calgary Canada pp. 354-361 "Nouse" Using the Nose as a Mouse "New Technology for Hands-Free Games and Interfaces" (15th International Conference on Vision Interface Proceedings May 27-29, 2002 Calgary, Canada, pp. 354-361, "Nouse""Use Your Nose as a Mouse" -a New technology For Hands-free Games and Interfaces)
[0012]
[Problems to be solved by the invention]
However, in the conventional method using template matching, it is necessary to prepare a large number of templates in order to perform accurate detection. For this purpose, a large amount of storage capacity is required, and depending on the processing capacity of the arithmetic unit, not only does the processing time for matching take long, but also how to detect the nose from the face image and to determine the position of the nose There is a problem that it is not always clear whether tracking is performed in real time.
[0013]
In the invention disclosed in Patent Literature 2, two black regions arranged substantially horizontally in the face region are determined as nose. However, in this method, the camera position at which the nostrils can be photographed is limited to the lower front part of the target person, and the range of the face direction that can be tracked becomes narrow.
[0014]
Further, in the invention disclosed in Non-Patent Document 1, there is no disclosure of a specific algorithm on how to track the position of the nose.
[0015]
Therefore, an object of the present invention is to provide a nose position extracting apparatus capable of extracting a face image from image information, further specifying a nose position and tracking the position in real time, a method therefor and a method therefor. It is to provide a program to be realized using a computer.
[0016]
[Means for Solving the Problems]
The method for extracting a nose position according to claim 1 includes a step of preparing digital data of a value of each pixel in a target image region, which is a human face region, and a filtering process on the target image region. The method includes a step of extracting an eye position and a step of specifying a point having the highest luminance in a predetermined nose position search area corresponding to the extracted eye position as a nose position.
[0017]
A nose position extraction method according to a second aspect is the nose position extraction method according to the first aspect, wherein the nose position search area is parallel to a reference line connecting both eyes when an interval between both eyes is L. The lower side is a position separated by a distance L from the reference line, the upper side is a side separated vertically by a distance 2/3 × L from the lower side, and both sides connecting the upper side and the lower side maintain the distance L. Is a quadrilateral region extending vertically from both eyes.
[0018]
The method for extracting a nose position according to claim 3 is the method for extracting a nose position according to claim 1, wherein the step of preparing digital data includes the step of preparing a digital data for each piece of screen information continuous at predetermined intervals on a time axis. Preparing digital data of the value of each pixel in the image region, and specifying the nose position as a nose position, storing the small region including the nose position specified in the screen information corresponding to a certain time as a template And a procedure for searching for a region matching the template in the screen information following the screen information corresponding to a certain time, and sequentially determining a point having the highest luminance locally as a new nose position in the matched region. Tracking the nose position.
[0019]
The method for extracting a nose position according to a fourth aspect is the method for extracting a nose position according to the third aspect, wherein the step of specifying the nose position further includes a step of predicting a nose head existing position from a past nose head position history. .
[0020]
The program according to claim 5 is a program for causing a computer to execute a method of extracting a nose position in a target image region, wherein the program is a human face region and is included in the target image region. Preparing digital data of the value of each pixel; extracting the eye position by filtering the target image area; and extracting the eye position in the predetermined nose position search area corresponding to the extracted eye position. The method includes the step of specifying a point having high luminance as a nose position.
[0021]
According to a sixth aspect of the present invention, in the configuration of the fifth aspect, the nose position search area is parallel to a reference line connecting both eyes when the interval between the eyes is L, and is separated from the reference line by a distance L. Is defined as the lower side, and the side vertically separated from the lower side by a distance of 2/3 × L is defined as the upper side, and both sides connecting the upper side and the lower side extend in the vertical direction from both eyes while maintaining the distance L. Area.
[0022]
According to a seventh aspect of the present invention, in the program configuration according to the fifth aspect, the step of preparing the digital data includes, for each piece of screen information continuous at predetermined intervals on a time axis, a value of each pixel in a target image area. Preparing the digital data of the nose position, the step of specifying the nose position includes storing a small area including the nose position specified in the screen information corresponding to a certain time as a template, and the screen information corresponding to a certain time Searching for a region that matches the template in the screen information following the above, and sequentially following a procedure of determining a point having the highest luminance locally as a new nose position in the matched region, thereby tracking the nose position. Including.
[0023]
In the program according to an eighth aspect of the present invention, the step of specifying the nose position further includes a step of predicting a nose head existing position from a past nose head position history.
[0024]
The nose position extracting apparatus according to claim 9, wherein means for preparing digital data of the value of each pixel in the target image area, which is a human face area, and filtering means for the target image area. And means for specifying the point with the highest luminance in the predetermined nose position search area corresponding to the extracted eye position as the nose position.
[0025]
The nose position extraction device according to claim 10 is the nose position extraction device according to claim 9, wherein the nose position search area is parallel to a reference line connecting both eyes when an interval between both eyes is L, A position separated by a distance L from the reference line is defined as a lower side, and a side separated vertically upward by a distance 2/3 × L from the lower side is defined as an upper side, and both sides connecting the upper side and the lower side are both eyes while maintaining the distance L. Is a quadrilateral region extending vertically from.
[0026]
In the nose position extracting apparatus according to the eleventh aspect, in the nose position extracting apparatus according to the ninth aspect, the means for preparing digital data includes a target image for each piece of screen information continuous at predetermined intervals on a time axis. The means for preparing digital data of the value of each pixel in the area and specifying it as a nose position includes: means for storing a small area including the nose position specified in the screen information corresponding to a certain time as a template; The nose position is tracked by searching for a region that matches the template in the screen information following the corresponding screen information, and sequentially continuing the procedure of determining a locally highest point in the matched region as a new nose position. Means.
[0027]
A nose position extracting device according to a twelfth aspect is the nose position extracting device according to the eleventh aspect, wherein the means for specifying the nose position further includes a means for predicting a nose head existing position from a past nose head position history.
[0028]
BEST MODE FOR CARRYING OUT THE INVENTION
[Hardware configuration]
Hereinafter, a nose position extracting device according to an embodiment of the present invention will be described. This nose position extracting device is realized by software executed on a computer such as a personal computer or a workstation, and detects a position of an eye from an image of a person's face. FIG. 1 shows the appearance of the nose position extracting device.
[0029]
Referring to FIG. 1, a system 20 includes a computer main body 40 including a CD-ROM (Compact Disc Read-Only Memory) drive 50 and an FD (Flexible Disk) drive 52, and a display device connected to the computer main body 40. It includes a display 42, a keyboard 46 and a mouse 48 as input devices also connected to the computer main body 40, and a camera 30 connected to the computer main body 40 for capturing images. In the apparatus of this embodiment, a video camera including a CCD (solid-state imaging device) is used as the camera 30, and a process for detecting the position of the eyes of a person operating the system 20 in front of the camera 30 is performed. I do.
[0030]
FIG. 2 is a block diagram showing the configuration of the system 20. As shown in FIG. 3, the computer main body 40 constituting the system 20 includes a CPU (Central Processing Unit) 56 and a ROM (Read) connected to a bus 66, respectively, in addition to a CD-ROM drive 50 and an FD drive 52. It includes an only memory 58, a random access memory (RAM) 60, a hard disk 54, and an image capturing device 68 for capturing an image from the camera 30. A CD-ROM 62 is mounted on the CD-ROM drive 50. An FD 64 is mounted on the FD drive 52.
[0031]
As described above, the main part of the nose position extracting device is realized by computer hardware and software executed by the CPU 56. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 62 or an FD 64, and is read from the storage medium by a CD-ROM drive 50 or an FD drive 52 and temporarily stored in a hard disk 54. Alternatively, when the device is connected to a network, the device is temporarily copied to a hard disk 54 from a server on the network. Then, the data is further read from the hard disk 54 to the RAM 60 and executed by the CPU 56. When a network connection is established, the program may be directly loaded into the RAM 60 and executed without being stored in the hard disk 54.
[0032]
The hardware itself and the operating principle of the computer shown in FIGS. 1 and 2 are general. Therefore, the most essential part of the present invention is the software stored in the storage medium such as the FD 64 and the hard disk 54.
[0033]
As a recent general tendency, various program modules are prepared as a part of a computer operating system, and an application program calls these modules in a predetermined arrangement when necessary and proceeds with processing. is there. In such a case, the software itself for realizing the nose position extracting device does not include such a module, and the nose position extracting device is realized only when the computer cooperates with the operating system. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and software itself not including those modules and a recording medium on which the software is recorded (and the software distributed on a network). Data signals in such a case) can be considered to constitute an embodiment.
[0034]
[Basic principle of face image extraction]
In the following, first, as a premise for explaining the nose position detection method and tracking method according to the present invention, a face image is specified from a screen according to the above-described Japanese Patent Application Laid-Open No. 2001-52176, The procedure for detecting the position will be described.
[0035]
Referring to FIG. 3, in the apparatus according to the present embodiment, in the face of a person, the position between the eyebrows located between the eyes (in the following description, the center position of the line connecting the centers of the eyes is referred to as the ").). The space between the eyebrows is hereinafter referred to as “BEP” (Between-Eyes-Point).
[0036]
As shown in FIG. 3A, a circle of a certain radius is drawn around the eyebrows in a face image of a person, and the brightness of each pixel along the circumference is examined. Then, the result is as schematically shown in FIG. In FIG. 3B, the horizontal axis represents the position of each pixel in the circumferential direction, and the vertical axis represents the brightness of each pixel. The uppermost part of the circumference shown in FIG. 3A is set as the origin of the horizontal axis in FIG. 3B, and FIG. 3A is drawn along the direction in which the circumference shown in FIG. In the graph of b), each pixel is arranged on the horizontal axis.
[0037]
Referring to FIG. 3B, it can be seen that this graph forms two repetitions of “vertex and valley”, namely, a vertex, a valley, a vertex, and a valley. This meaning is clear when referring to the face image shown in FIG. That is, in the human face image, following the above-mentioned circumference centering on the eyebrows, the forehead (lightness is high) first, the right eye (lightness is low), the nose (lightness is high), and then A bright portion and a dark portion are alternately repeated twice, such as the left eye (low brightness) and finally the first amount (high brightness). In the face image, such a feature is most frequently shown in the eyebrows, and the other parts do not show such features much, and even if they are smaller than the eyebrows.
[0038]
Therefore, in the apparatus according to the present embodiment, BEP is first detected by performing filtering using a filter called a “ring DFT (discrete Fourier transform) filter” on the assumption that such a lightness distribution exists around the eyebrows. Thereafter, the positions of the eyes on both sides between the eyebrows are detected based on the BEP. The ring DFT transform used in the present embodiment will be described later.
[0039]
The device of the present embodiment detects the position of the eyes using software having the following control structure.
[0040]
Referring to FIG. 4, first, an image is obtained (step 80). Here, one frame image obtained from the camera 30 shown in FIGS. 1 and 2 is digitally converted by the image capturing device 68 and stored in the image memory of the image capturing device 68. The following processing is performed. When processing is performed continuously, the following processing is repeated for each frame of the image obtained from the camera 30.
[0041]
That is, in step 82, a candidate point between eyebrows is extracted from the image data for one frame by using the above-described ring DFT filter. This processing will be described later with reference to FIG.
[0042]
Next, among the candidate points (generally a plurality) between the eyebrows extracted in step 82, the condition that there are two dark areas (corresponding to eyes) at two symmetrical positions on both sides is satisfied. Search for what to do (step S86). Among the eyebrow candidate points, those that do not satisfy this condition are rejected here.
[0043]
In step 88, it is determined whether or not only one pair of eyes obtained as a result of the processing in step 86 is detected (that is, whether only one point between the eyebrows is detected). If only one pair is obtained, the eye detection processing ends (step 90).
[0044]
On the other hand, if more than a pair of eyes are detected in step 88, the process returns to step 80, and the above process is repeated until a new frame is detected.
[0045]
[Extraction of Candidate Points Using Ring DFT Filter]
The above-described ring DFT filter is used in the extraction of the eyebrows candidate points performed in step 82. Hereinafter, the process of step 82 will be described with reference to FIG.
[0046]
First, in step 110, the image to be processed is subjected to smoothing and 横 reduction processing in the vertical and horizontal directions. In the experiment, for example, the brightness of 5 × 5 pixels around the target pixel was averaged to obtain the brightness of the pixel, and at this time, the image was also reduced by selecting the target point. The smoothing is for removing noise (having relatively high frequency components) contained in the image. In particular, in the detection of a human BEP, a spectral power component having a wave number of 2 is calculated as described later. Therefore, there is no possibility that information required for subsequent processing will be deleted by this smoothing. Also, by reducing the number of pixels to 1/4 in this process, the speed of the process can be increased. However, if a sufficiently high-speed processor is used, it may not be necessary to reduce the number of pixels. Also, if a slower processor is used, it may be necessary to reduce it to a smaller image (reduce the number of pixels). However, if the image is reduced too much, the resolution may be reduced, which may lower the accuracy of BEP detection. Therefore, it is also useful to select an appropriate resolution by experiment.
[0047]
Subsequently, a process of estimating the head region of the target person from the image thus obtained is performed (step 112). For this processing, as described above, an algorithm for tracking the skin color area using the color information is used, or an area which is considered to have moved between two frames is extracted from the difference between the previous frame and the current frame, and this is extracted. For example, an algorithm for estimating a partial region can be used. In the present embodiment, a difference between frames is used. The area estimated here may be any shape area, but a rectangular area is appropriate in view of the simplicity of the area calculation. However, depending on the conditions, it may be more efficient to use another shape. If the head hardly moves, no inter-frame difference can be obtained. In that case, assuming that the head has not moved, the head area estimated in the immediately preceding process is used.
[0048]
Next, filtering using a ring DFT filter is performed within the range of the head region thus obtained (step S114). Specifically, for example, the following calculation is performed on pixels on the circumference of a predetermined size as shown in FIG.
[0049]
(Equation 1)

[0050]
However, in this equation, N is the number of points on the circumference, and k is 0 in the highest point (the position corresponding to the “North Pole”) among the points on the circumference, and turns counterclockwise in order. The number assigned to each point. Fk (k = 0,..., N−1) is the brightness of the k-th pixel on the circumference, and i is an imaginary unit. Equation (1) is for the case where n = 2 of the DFT coefficients obtained by the following general discrete Fourier transform.
[0051]
(Equation 2)

[0052]
By the conversion shown in Expression (1), the spectrum power component of wave number 2 included in the above-described lightness fluctuation waveform on the circumference (see FIG. 3B) is calculated. In the present embodiment, the calculation is performed assuming that the radius of the circle is 7 pixels and N = 36. Since the size of the face area changes according to the distance between the person and the camera, if such a change in the distance is considered to be large, the size of the face area is adjusted to the approximate size of the already obtained face area. By changing the radius of the circle, the accuracy becomes higher. However, if it is known that a person hardly makes such a movement, the radius may be fixed in advance.
[0053]
With this calculation, for all the pixels in the head region, the value of the spectral power of wave number 2 on the circumference around the pixel is calculated.
[0054]
In the distribution of values obtained as a result of performing the above calculation for each pixel, there is a particularly high value portion. These portions are considered to have many wave number components of wave number 2 as described above on the circumference thereof. Therefore, they are qualified as candidate points for BEP. In this manner, the tracing is performed on a closed curve, typically a circle, centered on each target point on the image, and the pixel values (including not only lightness but also hue, saturation, etc.) on the closed curve are determined. On the other hand, obtaining the information after DFT is performed is called "filtering by a ring DFT filter" in the present invention.
[0055]
In this way, a point showing a local maximum value is selected from the distribution of values filtered by the ring DFT filter in the target screen and set as a candidate point for BEP (step 116).
[0056]
True BEP is included in the detected candidate points. As described above, around the true BEP, there is almost definitely a distribution of areas of light, dark, light, and dark. Therefore, as a result of the processing in step 114, the true BEP almost definitely indicates the local maximum value, and as a result, in step 116, it is almost universally extracted as a candidate point.
Thus, the feature of this method is that the true BEP is robustly and almost certainly extracted. Note that the threshold value for selection is mainly determined empirically in accordance with the characteristics of the target image.
[0057]
Subsequently, in step 118, processing of narrowing down BEP candidates is performed in consideration of local features characteristic of BEP among a plurality of local maximum values.
[0058]
For example, in an actual BEP, there should be a bright region above (forehead) and below (nose), and a dark region on the left and right (both eyes). Therefore, the calculation result of Expression (1) should always have a positive real part. Those that produce non-positive real parts are not BEPs and are excluded from the candidate.
[0059]
For the same reason, when considering an image in which an image centered on a true BEP is projected in the vertical and horizontal directions, the following can be said. Referring to FIG. 7, the distribution of brightness is darkest at the center in the vertical direction and brightest at the center in the horizontal direction, as shown in FIG. The distribution should be approximately symmetric about the center. Therefore, when there are a plurality of candidate points, the same vertical and horizontal projections are created, and those that do not meet the above conditions are rejected.
[0060]
As another criterion, the brightness center of the small area centered on the BEP candidate point is calculated, and if the distance between the center of gravity and the BEP candidate point exceeds a threshold, the BEP candidate point is excluded.
[0061]
Furthermore, the candidate points can be narrowed down using the following features of the ring DFT filter. That is, in the above-described general formula of Fn (formula (2)), F1 is calculated for each pixel with n = 1. Then, a ratio (F1 / F2) to F2 obtained for each pixel is calculated, and a criterion is used such that the smaller this value is, the higher the probability of being a true BEP is. It is considered that this value can be a reference indicating how much the distribution of light and dark on a circle centered on a certain pixel matches the ideal limit curve (how far apart) for the following reason. .
[0062]
The values calculated as n = 1, 2,... In the equation (2) indicate the spectral powers of the wave number components whose wave numbers are 1, 2,. If the distribution of lightness and darkness on the circumference ideally coincides with the sine curve when n = 2, Fn = 0 for n other than 2. Of course, the distribution of light and dark does not actually match the sine curve, but if it is close to the ideal sine curve, F1 will be a small value and F2 will be a relatively large value. Therefore, it can be considered that if F1 / F2 is small, the light and dark distribution around the target pixel is closer to the actual light and dark distribution around the BEP, and if it is larger, it is farther. This is why F1 / F2 can be used as a measure.
[0063]
Note that F3, F4, and the like should be 0 in an ideal light-dark distribution similarly to F1. Therefore, it is conceivable to use F3 / F2, F4 / F2, or the like as a reference.
However, they exhibit a higher amount of wavenumber components and are therefore more susceptible to noise, so the results are less reliable than with F1 / F2.
[0064]
Hereinafter, whether or not this BEP is a true BEP is determined by steps 86 and 88 in FIG.
[0065]
According to the procedure described above, in the system of this embodiment, BEP is detected using the ring DFT filter. The ring DFT filter extracts a feature point such as a BEP from only a wave number component existing in the light and dark distribution of the image.
Therefore, there is a feature that the image is hardly affected by fluctuations in the overall brightness of the image. Also, when the face is slightly tilted, the wave number component in the brightness distribution around a certain point is invariant to the rotation of the image. Therefore, by using the above-described method, it is possible to realize the extraction of a feature point resistant to rotation. This is the same when the face is turned a little sideways. Even if the face is turned sideways enough for both eyes to be barely visible, as long as both eyes are present in the image, the above light and dark arrangement still exists around the eyebrows, so explained above BEP can be extracted almost certainly using the technique. Further, even if the target person has closed eyes, the area is still darker than the forehead and nose, so that the BEP can be detected almost certainly by the above-described method. Therefore, the BEP and the positions of the eyes on both sides thereof can be detected with high reliability.
[0066]
In the above example, the DFT coefficients were calculated for points on the circumference centered on each pixel. However, the invention is not only applicable to points on the circumference. If the closed curve has a predetermined positional relationship with a point to be extracted as a feature point in advance and the distribution of light and dark on the periphery is known, the above calculation is performed on any other closed curve. May go. However, it is often the case that a circle is optimal because it gives a result that is robust to rotation when it is a circle.
[0067]
Furthermore, in the above-described embodiment, the wave number component in the light and dark distribution on one circle centered on each pixel is used, but the number of circles used is not limited to one. It will be apparent to those skilled in the art. For example, if it is known beforehand that different light and dark distributions should be present in regions at different distances from the center around the feature point to be extracted, a plurality of circumferences (or closed curves) are accordingly set. Each of the above calculations may be performed, and the result of both calculations may be combined to extract a feature point.
[0068]
In the above example, DFT is used for calculating the wave number component. The use of DFT is considered to be the most efficient, but the function required in the above example only needs to be able to extract the wave number component in the light and dark distribution on the circumference. Thus, the techniques that can be used are not limited to DFT, and it will also be apparent to those skilled in the art that any of the functions for extracting wavenumber components can be used, including the general Fourier transform. .
[0069]
Further, in the above-described embodiment, the processing target of the ring DFT filter is the brightness of the pixel. However, the applicable object of the present invention is not limited to this. For example, filtering such as hue and saturation of each pixel may be performed by a ring DFT filter. It is also conceivable that a value obtained by performing a predetermined operation on values such as the brightness, hue, and saturation of each pixel according to the property of the feature point to be detected is subjected to filtering.
[0070]
[Detection of nose position from face image]
According to the above description, the position between the eyebrows and the position of the eyes of the human face can be specified from the screen. In the following, a procedure for further specifying the position of the nose after the position of the eye is specified in this way and further tracking the position of the nose will be described.
[0071]
FIG. 8 is a diagram for explaining a concept serving as a premise of a procedure for detecting a nose position from a face image according to the present invention.
[0072]
Referring to FIG. 8, when light from a light source is applied to a glossy spherical surface, light from the light source is reflected on the spherical surface to form a highlight spot.
[0073]
FIG. 9 is a diagram for explaining a phenomenon appearing in a face image according to the concept shown in FIG.
[0074]
As shown in FIG. 9, the nose head is not an ideal spherical surface, but can be regarded as a spherical surface, and has a certain degree of gloss. At this time, a highlight in which the light of the light source is reflected is generated particularly at the nose head which is the most protruding position on the face.
[0075]
In the present invention, first, screen information including a face that is continuous at a predetermined interval on the time axis, for example, a video image obtained by continuously capturing a face is processed, and the eyebrows are filtered by a method using the above-described ring DFT filter. The position and the position of both eyes are detected.
[0076]
Then, as described below, a locally brightest point (a point with the highest luminance) is extracted in a certain range area below both eyes. If the triangle formed by both eye positions and the point meets certain geometric conditions, the point is determined to be the nose position.
[0077]
Furthermore, when the nose position is extracted, the small area including the point is stored as a template, the next frame is searched for the point that best matches the template, and the locally brightest point around the matching point is found. The nose position is determined, and the nose position is tracked.
[0078]
FIG. 10 is a diagram for explaining a fixed range area below both eyes for searching for a nose position after detecting the position of the eyes.
[0079]
Referring to FIG. 10, the nose position is searched for when the distance between both eyes is L, and a position parallel to a line (reference line) connecting both eyes and separated by a distance L from the reference line is a lower side. A quadrilateral region in which a side separated vertically upward by a distance 2/3 × L from the lower side is defined as an upper side, and both sides connecting the upper side and the lower side extend vertically from both eyes while maintaining a distance L. It is. However, the distance between the upper side and the lower side is not necessarily limited to 2/3 × L, and the interval between the sides is not limited to the distance L. These values may be appropriately corrected according to the target property.
[0080]
In the fixed range area shown in FIG. 10, the brightest point is extracted locally. That point can be specified as the position of the nose head.
[0081]
FIG. 11 is a diagram showing a face image when the face shown in FIG. 9 faces slightly sideways.
Even when the face is turned to the side as shown in FIG. 11, it can be seen that the highlight indicating the nose head exists in the area shown in FIG.
[0082]
FIG. 12 is a flowchart for explaining a nose position specifying method and a nose position tracking method according to the present invention.
[0083]
Referring to FIG. 12, first, the value of a variable t for specifying an image (frame) to be processed is initialized to "1" (step 100).
[0084]
Subsequently, an image of the t-th frame is obtained (step S102), and a face image is extracted and an eye position is specified (step S104). The processing in steps S102 and S104 is basically the same as the eye position detection processing described with reference to FIG.
[0085]
When the position of the eye is detected, the highlight point of the nose head is extracted in the fixed area range described with reference to FIG. 10 (step S106).
[0086]
If the highlight point of the nose head is successfully extracted in the t-th frame, the process proceeds to step S112. On the other hand, if the extraction of the highlight point fails, the value of the variable t is incremented by 1 (step S110), and the process returns to step S102.
[0087]
In step S112, a predetermined small area of a predetermined size and shape centered on the highlight point is saved as a nose head template pattern T, for example, on the hard disk 54.
[0088]
The nose head template pattern may be a small area of a predetermined size centered on the highlight point, or a small area of a predetermined size offset by a predetermined distance from the highlight point.
[0089]
Subsequently, the value of the variable t is incremented by 1 (step S114), and an image of the (t + 1) th frame is obtained (step S116).
[0090]
Next, the nose head existence position is predicted from the past nose head position history (step S118). In this prediction, prediction is performed by the following equation using the nose head position X (t) in the previous frame and the nose head position X (t-1) in the frame two frames before.
[0091]
X (t + 1) = X (t) + X (t) -X (t-1)
If X (t-1) does not exist, X (t) is used as the value of X (t-1).
[0092]
Next, a nose head search area having a predetermined size and shape centered on the nose head existence predicted position is determined (step S120), and a matching point that best matches the template pattern T is searched for in the nose head search area (step S122). .
[0093]
A brightest point is searched for in a predetermined area centered on the matching point, and that point is set as a nose head highlight point of the (t + 1) th frame (step S124). Then, the process returns to step S112.
[0094]
With the processing as described above, the position of the nose can be detected in real time from continuous screen information at predetermined intervals on the time axis, for example, continuous frame images. Furthermore, the nose position can be tracked by continuously detecting the nose position in each of such continuous screen information.
[0095]
Such tracking of the nose position can be used, for example, in a man-machine interface of a computer instead of, for example, a mouse.
[0096]
The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.
[0097]
【The invention's effect】
As described above, according to the present invention, the position of the nose can be detected in real time from continuous screen information. Furthermore, the nose position can be tracked by continuously detecting the nose position in each of such continuous screen information.
[Brief description of the drawings]
FIG. 1 is an external view of a system according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a hardware configuration of a system according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining the principle of the present invention.
FIG. 4 is a flowchart of eye position detection processing executed by the system according to the first embodiment of the present invention;
FIG. 5 is a flowchart of a process for extracting candidate points between eyebrows from image data.
FIG. 6 is a diagram illustrating a calculation path of a ring DFT filter.
FIG. 7 is a schematic diagram for explaining local features between eyebrows.
FIG. 8 is a diagram for explaining a concept serving as a premise of a procedure for detecting a nose position from a face image according to the present invention.
FIG. 9 is a diagram for explaining a phenomenon that appears in a face image according to the concept shown in FIG. 8;
FIG. 10 is a diagram illustrating a fixed range area below both eyes for searching for a nose position after detecting an eye position.
FIG. 11 is a diagram showing a face image when the face shown in FIG. 9 is slightly turned sideways.
FIG. 12 is a flowchart illustrating a nose position specifying method and a nose position tracking method according to the present invention.
[Explanation of symbols]
20 Nose position extraction device, 30 camera, 40 computer main body, 42 monitor.

Claims

Preparing digital data of the value of each pixel in the target image region, which is a human face region,
Extracting a position of an eye by filtering processing in the target image area;
A step of specifying a point having the highest luminance in a predetermined nose position search area corresponding to the extracted eye position as a nose position.

When the distance between both eyes is L, the nose position search area is parallel to a reference line connecting both eyes and a position separated by a distance L from the reference line is a lower side, and a distance 2/3 × L from the lower side. 2. The nose according to claim 1, wherein a side separated only vertically upward is an upper side, and sides on both sides connecting the upper side and the lower side are quadrangular regions extending in a vertical direction from the eyes while maintaining a distance L. 3. Location extraction method.

The step of preparing the digital data includes, for each piece of screen information continuous at predetermined intervals on the time axis, including a step of preparing digital data of a value of each pixel in the target image area,
The step of specifying the nose position,
Storing a small area including the nose position specified in the screen information corresponding to a certain time as a template,
Searching for an area matching the template in screen information following the screen information corresponding to the certain time, and sequentially continuing a procedure of determining a point having the highest brightness locally as a new nose position in the matched area; And a step of tracking the nose position.

4. The nose position extraction method according to claim 3, wherein the step of specifying the nose position further includes a step of predicting a nose head existing position from a past nose head position history.

A program for causing a computer to execute a method of extracting a nose position in a target image region, wherein the program includes:
Preparing digital data of the value of each pixel in the target image region, which is a human face region,
Extracting a position of an eye by filtering processing in the target image area;
A program including a step of specifying, as a nose position, a point having the highest luminance in a predetermined nose position search area corresponding to the extracted eye position.

When the distance between both eyes is L, the nose position search area is parallel to a reference line connecting both eyes and a position separated by a distance L from the reference line is a lower side, and a distance 2/3 × L from the lower side. 6. The program according to claim 5, wherein an upper side is a side separated only vertically upward, and sides on both sides connecting the upper side and the lower side are quadrangular regions extending in a vertical direction from both eyes while maintaining a distance L. 7. .

The step of preparing the digital data includes, for each piece of screen information continuous at predetermined intervals on the time axis, including a step of preparing digital data of a value of each pixel in the target image area,
The step of specifying the nose position,
Storing a small area including the nose position specified in the screen information corresponding to a certain time as a template,
Searching for an area matching the template in screen information following the screen information corresponding to the certain time, and sequentially continuing a procedure of determining a point having the highest brightness locally as a new nose position in the matched area; And tracking the nose position.

8. The program according to claim 7, wherein the step of specifying the nose position further includes a step of predicting a nose head existing position from a past nose head position history.

Means for preparing digital data of the value of each pixel in the target image area which is a human face area,
Means for extracting the position of the eyes by filtering processing in the target image area,
Means for specifying, as a nose position, a point having the highest luminance in a predetermined nose position search area corresponding to the extracted eye position.

When the distance between both eyes is L, the nose position search area is parallel to a reference line connecting both eyes and a position separated by a distance L from the reference line is a lower side, and a distance 2/3 × L from the lower side. 10. The nose according to claim 9, wherein a side separated vertically upward only is an upper side, and sides on both sides connecting the upper side and the lower side are quadrangular areas extending in a vertical direction from the eyes while maintaining a distance L. Position extraction device.

Means for preparing the digital data, for each piece of screen information continuous at predetermined intervals on the time axis, prepare digital data of the value of each pixel in the target image area,
The means for specifying the nose position is
Means for storing, as a template, a small area including the nose position specified in the screen information corresponding to a certain time;
Searching for an area matching the template in screen information following the screen information corresponding to the certain time, and sequentially continuing a procedure of determining a point having the highest brightness locally as a new nose position in the matched area; 10. The nose position extracting device according to claim 9, further comprising: means for tracking the nose position.

The nose position extracting apparatus according to claim 11, wherein the means for specifying the nose position further includes a means for predicting a nose head existing position from a past nose head position history.