JP2004030629A

JP2004030629A - Face detection apparatus, face detection method, robotic device, program, and recording medium

Info

Publication number: JP2004030629A
Application number: JP2003133601A
Authority: JP
Inventors: Hidehiko Morisada; 森貞　英彦
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-05-10
Filing date: 2003-05-12
Publication date: 2004-01-29
Anticipated expiration: 2023-05-12
Also published as: JP4329398B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a face detection apparatus and method, a robot device, a program and a recording medium capable of extremely improving a real time properties by reducing computational complexity. <P>SOLUTION: In the face detection apparatus, a scale converting and determining part 90 determines the scale of an input image and a template size indicating an average face, and a window segmenting part 91 segments a window image and sends the window image to a template matching part 92 together with a template. The template matching part 92 generates a matching result image being a two-dimensional array of correlative values correlating the input image and the template, divides the matching result image into a plurality of areas of the same size as the template to find the maximum value of the correlative values for each divided area, and extracts the maximum values being equal to or greater than a prescribed threshold as face candidates. Then similar processing is repeated for each scale, pre-processing is performed, and a pattern recognizing part 95 identifies the face image from the face candidates by SVM (Support Vector Machine). <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、入力画像から対象物の顔を検出する顔検出装置及びその方法、並びに顔検出装置を搭載してエンターテインメント性の向上等を図ったロボット装置、並びに顔検出する動作をコンピュータに実行させるためのプログラム及びこのプログラムを記録した記録媒に関する。
【０００２】
【従来の技術】
電気的又は磁気的な作用を用いて人間（生物）の動作に似た運動を行う機械装置を「ロボット」という。我が国においてロボットが普及し始めたのは、１９６０年代末からであるが、その多くは、工場における生産作業の自動化・無人化等を目的としたマニピュレータ及び搬送ロボット等の産業用ロボット（Ｉｎｄｕｓｔｒｉａｌ
Ｒｏｂｏｔ）であった。
【０００３】
最近では、人間のパートナーとして生活を支援する、即ち住環境その他の日常生活上の様々な場面における人的活動を支援する実用ロボットの開発が進められている。このような実用ロボットは、産業用ロボットとは異なり、人間の生活環境の様々な局面において、個々に個性の相違した人間、又は様々な環境への適応方法を自ら学習する能力を備えている。例えば、犬又は猫のように４足歩行の動物の身体メカニズム及びその動作を模した「ペット型」ロボット、或いは、２足直立歩行を行う人間等の身体メカニズム及びその動作をモデルにしてデザインされた「人間型」又は「人間形」ロボット（Ｈｕｍａｎｏｉｄ　Ｒｏｂｏｔ）等のロボット装置は、既に実用化されつつある。
【０００４】
これらのロボット装置は、産業用ロボットと比較して、例えばエンターテインメント性を重視した様々な動作等を行うことができるため、エンターテインメントロボットと呼称される場合もある。また、そのようなロボット装置には、ＣＣＤ（Ｃｈａｒｇｅ　　Ｃｏｕｐｌｅｄ　Ｄｅｖｉｃｅ）カメラ及びマイクロホン等の各種外部センサが搭載され、これら外部センサの出力に基づいて外部状況を認識して、外部からの情報及び内部の状態に応じて自律的に動作するものがある。
【０００５】
【発明が解決しようとする課題】
ところで、かかるエンターテインメント型のロボット装置において、対話中にその相手となる人間の顔や、移動中に視界内に入る人間の顔を検出して、その人間の顔を見ながら対話や動作を行うことができれば、人間が普段行う場合と同様に、その自然性から考えて最も望ましく、エンターテインメントロボット装置としてのエンターテインメント性をより一層向上させ得るものと考えられる。
【０００６】
従来、動画のような複雑な画像シーンの中から色や動きを使うことなく画像信号に基づく濃淡パターンのみを使って、人間の顔を検出する方法が数多く提案されている。
【０００７】
これらの顔検出方法としては、固有顔、ニューラル・ネットワーク及びサポートベクタマシン（ＳＶＭ：Ｓｕｐｏｒｔ　Ｖｅｃｔｏｒ　Ｍａｃｈｉｎｅ）等のパターン認識の手法を利用して、予め顔パターンを学習させて識別器を生成する方法が挙げられる。
【０００８】
しかし、このパターン識別器を生成する方法によると、ロボット装置は、膨大なデータ量である顔画像データに対して学習によるパターン識別を行うにあたって、環境の変化や自己の姿勢及び表情の変化に対してロバスト性を示すが、その分当該パターン識別に要する演算量が増加するため、演算処理に要する時間が膨大になるという問題点があった。
【０００９】
実際に、撮像画像の中から人間の顔画像を検出するプロセス（以下、これを顔検出タスクという。）においては、当該撮像画像の中から顔画像を切り出しながら識別を行うため、撮像画像全体を様々なスケールでスキャンすることとなる。このため一回一回のパターン識別に要する演算処理を極力少なくすることが極めて重要となる。
【００１０】
例えば、サポートベクタマシンによるパターン認識を用いた顔検出タスクの場合、撮像画像の中から４００（＝２０×２０）画素程度の切り出し画像から得られる４００次元のベクトルに対して、数百通りのサポートベクタ（４００次元）との内積演算が必要となる。これを大きさ（Ｗ、Ｈ）からなる全画面中で行うとなると、当該内積演算を（Ｗ−２０＋１）×（Ｈ−２０＋１）回繰り返さなくてはならないため、膨大な量の演算処理となる。
【００１１】
また、顔検出タスクをロボット装置に利用する場合には、動画の中から十分に早く顔画像を検出しなければ、リアルタイム性が要求されるロボットの行動としてフィードバックすることが極めて困難となる。更に、ロボット装置の内部のＣＰＵは、顔検出タスク以外にも常時実行しているタスクが多くあるため、これらのタスクに対して演算能力を費やしている分、当該顔検出タスクに対してフルに演算能力を費やすことは極めて困難である。
【００１２】
本発明は、このような従来の実情に鑑みて提案されたものであり、演算量を低減してリアルタイム性を格段と向上し得る顔検出装置及びその方法、ロボット装置、並びにプログラム及び記録媒体を提供することを目的とする。
【００１３】
【課題を解決するための手段】
上述した目的を達成するために、本発明に係る顔検出装置は、入力画像から対象物の顔領域を抽出する顔検出装置において、撮像手段による撮像結果として得られるフレーム画像を入力画像とし、この入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成手段と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出手段と、上記顔候補として抽出された点に相当する入力画像領域から上記顔領域を識別する識別手段とを有することを特徴とする。
【００１４】
本発明においては、平均顔テンプレートと同一のサイズの顔が入力画像に存在する場合にテンプレートマッチングの相関値が最も大きくなることを利用し、マッチング結果の相関値の局所最大値に基づいて顔候補を抽出することにより、マッチング結果から単に閾値を設けて顔候補を検出する場合に比して、顔検出性能を維持しつつ顔候補の数を飛躍的に低減することができ、これにより、後段の識別手段における顔識別の演算量を削減し、入力画像から極めて高速に顔領域を検出することができる。
【００１５】
また、上記顔候補抽出手段は、上記マッチング結果を、例えばテンプレートサイズ以下の大きさの複数の領域に分割しこの各分割領域毎に少なくとも相関値の最大値を顔候補として抽出することができる。
【００１６】
更に、異なるサイズのテンプレートから上記マッチング結果生成手段に入力する上記テンプレートのサイズを決定するテンプレートサイズ決定手段を有し、上記テンプレートサイズ決定手段は、予め検出された顔領域と同一サイズのテンプレートを選択するか、又は入力画像における上記対象物との距離情報が入力され、この距離情報に基づきテンプレートを選択することにより、効率よくテンプレートサイズを決定することができる。
【００１７】
更にまた、上記マッチング結果生成手段は、異なるサイズのテンプレートが順次入力され各テンプレートに対応するマッチング結果を生成することができ、これにより、入力画像に含まれる任意の大きさの顔を検出することができる。
【００１８】
また、上記顔候補抽出手段は、上記各分割領域の相関値の最大値のうち、所定の閾値以上のみを顔候補として抽出することにより、更に顔候補を絞り込むことができる。
【００１９】
更に、相関値の集合である上記マッチング結果は、上記入力画像をスキャンして所定画素ずつずらしながら移動させた上記テンプレートと上記入力画像との相関値の集合であって、上記テンプレートの移動に伴い配列された２次元配列とすることができる。
【００２０】
更にまた、上記顔候補抽出手段は、上記各分割領域の相関値の最大値を有する点を第１の顔候補とし、この第１の顔候補近傍の点を第２の顔候補として上記第１及び第２の顔候補を顔候補として抽出することにより、顔検出性能を更に向上することができる。
【００２１】
また、上記顔候補抽出手段は、上記第１の顔候補近傍の点のうち、該第１の顔候補近傍の点に対応する上記テンプレートサイズの入力画像領域における肌色領域の占有率が所定の閾値以上である点か、又は、予め学習した顔色情報を有し、上記第１の顔候補近傍の点のうち、該第１の顔候補近傍の点に対応する上記テンプレートサイズの入力画像領域における顔色領域の占有率が所定の閾値以上である点を第２の顔候補として抽出することができ、これにより、顔検出性能を高めつつ顔候補を絞り込むことができる。
【００２２】
本発明に係る顔検出方法は、入力画像から対象物の顔領域を抽出する顔検出方法において、撮像手段による撮像結果として得られるフレーム画像を入力画像とし、この入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成工程と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出工程と、上記顔候補として抽出された点に相当する入力画像領域から上記顔領域を識別する識別工程とを有することを特徴とする。
【００２３】
本発明に係るロボット装置は、供給された入力情報に基づいて動作を行う自律型のロボット装置において、撮像手段と、上記撮像手段による撮像結果として得られるフレーム画像を入力画像とし該入力画像から人物の顔領域を抽出する顔検出手段を有し、上記顔検出手段は、上記入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成手段と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出手段と、上記顔候補として抽出された点に相当する入力画像領域から上記顔領域を識別する識別手段とを具備することを特徴とする。
【００２４】
また、制御全体を司る制御手段を具え、上記制御手段は、上記マッチング結果生成手段、上記顔候補抽出手段、及び上記識別手段による各処理のいずれかを選択的に実行することができる。
【００２５】
更に、上記制御手段と接続され、当該制御手段の制御に応じて情報を読み書きするメモリ手段と、上記メモリ手段に接続され、上記マッチング結果生成手段、上記顔候補抽出手段、及び上記識別手段による各処理のうち上記制御手段の処理対象を除く処理を実行する演算処理手段とを具え、上記制御手段又は上記演算処理手段は、上記メモリ手段を介して該当する上記処理を切り換えて実行することができる。
【００２６】
本発明に係るプログラム及びこれを記録した記録媒体は、入力画像から人物の顔領域を抽出する動作をコンピュータに実行させるためのプログラム及びこれを記録した記録媒体において、撮像手段による撮像結果として得られるフレーム画像を入力画像とし、この入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成工程と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出工程と、上記顔候補として抽出された点に相当する入力画像領域から上記顔領域を識別する識別工程とを有することを特徴とする。
【００２７】
【発明の実施の形態】
（１）本願発明の概要
本発明に係る顔検出装置及び顔検出方法は、入力画像（撮像画像）から人物の顔（顔領域）を検出するものである。この顔検出タスクは、大きく分けて２つの工程を有する。即ち、入力画像に対してテンプレートマッチングを行って顔の候補を抽出する第１の工程と、この顔候補からＳＶＭ等により顔であるか否かの判定する第２の工程とからなる。
【００２８】
ロボット装置等のＣＰＵ及びメモリ等のリソースが限られたシステムにおいて、リアルタイムに顔検出を行うためには、顔検出タスクにおける演算量を削減すること必要である。しかしながら、上述した如く、第２の工程で行われるＳＶＭ等を使用する顔判定は膨大な演算量を要する。従って、顔検出タスクにおいて、顔検出性能を保持しつつ顔検出に要する演算量を削減するためには、第１の工程において顔候補を絞り込むことが有効となる。
【００２９】
ここで、第１の工程において、テンプレートマッチングを行って顔候補を抽出する際、例えばマッチングの相関値に対して閾値を設けて顔候補を抽出する方法がある。この方法では、顔候補の見逃しを軽減しようとした場合、相関値の閾値を上げる方法又はマッチングの際の間引きを減らす方法等をとることができるものの、例えば相関値の閾値を下げると多くの顔候補が抽出されるため、計算量が増大してしまい、ロボット装置等のリソースの限られた環境においては好ましくない場合がある。一方、閾値を上げると、顔候補の数を減らすことができるものの、本来顔である画像も候補から取り除いてしまい、顔画像を見逃してしまう場合がある。
【００３０】
そこで、本願発明は、本来顔である画像を顔候補から見逃すことなく、演算量の増大を防止する、即ち、顔候補の数を有効に絞り込むことができる方法を提案するものである。具体的には、第１の工程では、入力画像と、所定サイズの平均的な顔画像を示すテンプレートとの相関をとった相関値の集合であるマッチング結果を生成し、このマッチング結果における相関値の局所最大値に基づき顔領域を抽出する。顔領域は、マッチング結果を例えばテンプレートと同一サイズの複数の領域に分割して各分割領域毎に少なくとも相関値の最大値を顔候補として抽出する。そして、第２の工程では、上記顔候補として抽出された点に相当する入力画像領域からＳＶＭ等により顔領域（顔画像）を識別（判定）するものである。
【００３１】
（２）ロボットの構成
以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。この実施の形態は、本発明を、顔検出装置を搭載した２足歩行ロボット装置に適用したものである。先ず、顔検出手段を有するロボット装置の一例として、２足歩行の人間型ロボット装置について説明する。
【００３２】
図１及び図２には、人間型ロボット装置２００を前方及び後方の各々から眺望した様子を示している。更に、図３には、この人間型ロボット装置２００が具備する関節自由度構成を模式的に示している。
【００３３】
図１及び図２に示すように、人間型ロボット装置２００は、脚式移動を行う左右２足の脚部ユニット２０１Ｒ，２０１Ｌと、胴体部ユニット２０２と、左右の腕部ユニット２０３Ｒ，２０３Ｌと、頭部ユニット２０４とで構成される。
【００３４】
左右各々の脚部ユニット２０１Ｒ，２０１Ｌは、大腿部２０５Ｒ，２０５Ｌと、膝関節２０６Ｒ，２０６Ｌと、脛部２０７Ｒ，２０７Ｌと、足首２０８Ｒ，２０８Ｌと、足平２０９Ｒ，２０９Ｌとで構成され、股関節２１０Ｒ，２１０Ｌによって胴体部ユニット２０２の略最下端にて連結されている。また、左右各々の腕部ユニット２０３Ｒ，２０３Ｌは、上腕２１１Ｒ，２１１Ｌと、肘関節２１２Ｒ，２１２Ｌと、前腕２１３Ｒ，２１３Ｌとで構成され、肩関節２１４Ｒ，２１４Ｌによって胴体部ユニット２０２上方の左右各側縁にて連結されている。また、頭部ユニット２０４は、首関節２５５によって胴体部ユニット２０２の略最上端中央に連結されている。
【００３５】
頭部ユニット２０４と体幹部ユニット２０２との間の首間接は、図３に示すように、頭部ユニット２０４を支持する首関節ヨー軸３０２と、首関節ピッチ軸３０３と、首関節ロール軸３０４という３自由度を有している。
【００３６】
また、腕の関節は、肩関節ピッチ軸３０８と、肩関節ロール軸３０９と、上腕ヨー軸３１０と、肘関節ピッチ軸３１１と、前腕ヨー軸３１２と、手首関節ピッチ軸３１３と、手首関節ロール輪３１４と、手部３１５とで構成される。手部３１５は、実際には、複数本の指を含む多関節・多自由度構造体である。ただし、手部３１５の動作は人間型ロボット装置２００の姿勢制御や歩行制御に対する寄与や影響が少ないので、本明細書ではゼロ自由度と仮定する。従って、各腕部は７自由度を有するとする。
【００３７】
また、胴体部ユニットは、胴体ピッチ軸３０５と、胴体ロール軸３０６と、胴体ヨー軸３０７という３自由度を有する。
【００３８】
また、下肢を構成する各々の脚部ユニットは、股関節ヨー軸３１６と、股関節ピッチ軸３１７と、股関節ロール軸３１８と、膝関節ピッチ軸３１９と、足首関節ピッチ軸３２０と、足首関節ロール軸３２１と、足部３２２とで構成される。本明細書中では、股関節ピッチ軸３１７と股関節ロール軸３１８との交点は、人間型ロボット装置２００の股関節位置を定義する。人体の足部３２２は、実際には多関節・多自由度の足底を含んだ構造体であるが、人間型ロボット装置２００の足底は、ゼロ自由度とする。従って、各脚部は、６自由度で構成される。
【００３９】
以上を総括すれば、人間型ロボット装置２００全体としては、合計で３＋７×２＋３＋６×２＝３２自由度を有することになる。ただし、エンターテインメント向けの人間型ロボット装置２００が必ずしも３２自由度に限定される訳ではない。設計・制作上の制約条件や要求仕様等に応じて、自由度すなわち関節数を適宜増減することができることはいうまでもない。
【００４０】
上述したような人間型ロボット装置２００がもつ各自由度は、実際にはアクチュエータを用いて実装される。外観上で余分な膨らみを排してヒトの自然体形状に近似させること、２足歩行という不安定な構造体に対して姿勢制御を行うこと等の要請から、アクチュエータは小型且つ軽量であることが好ましい。
【００４１】
図４には、人間型ロボット装置２００の制御システム構成を模式的に示している。同図に示すように、人間型ロボット装置２００は、ヒトの四肢を表現した各機構ユニット３３０，３４０，３５０Ｒ／Ｌ，３６０Ｒ／Ｌと、各機構ユニット間の協調動作を実現するための適応制御を行う制御ユニット３８０とで構成される（ただし、Ｒ及びＬの各々は、右及び左の各々を示す接尾辞である。以下同様）。
【００４２】
人間型ロボット装置２００全体の動作は、制御ユニット３８０によって統括的に制御される。制御ユニット３８０は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）やメモリ等の主要回路コンポーネント（図示しない）で構成される主制御部３８１と、電源回路や人間型ロボット装置２００の各構成要素とのデータやコマンドの授受を行うインターフェイス（何れも図示しない）等を含んだ周辺回路３８２とで構成される。この制御ユニット３８０の設置場所は、特に限定されない。図４では胴体部ユニット３４０に搭載されているが、頭部ユニット３３０に搭載してもよい。或いは、人間型ロボット装置２００外に制御ユニット３８０を配備して、人間型ロボット装置２００の機体とは有線若しくは無線で交信するようにしてもよい。
【００４３】
図４に示した人間型ロボット装置２００内の各関節自由度は、それぞれに対応するアクチュエータによって実現される。すなわち、頭部ユニット３３０には、首関節ヨー軸３０２、首関節ピッチ３０３、首関節ロール軸３０４の各々を表現する首関節ヨー軸アクチュエータＡ_２、首関節ピッチ軸アクチュエータＡ_３、首関節ロール軸アクチュエータＡ_４が配設されている。
【００４４】
また、頭部ユニット３３０には、外部の状況を撮像するためのＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄ　Ｄｅｖｉｃｅ）カメラが設けられているほか、前方に位置する物体までの距離を測定するための距離センサ、外部音を集音するためのマイク、音声を出力するためのスピーカ、使用者からの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出するためのタッチセンサ等が配設されている。
【００４５】
また、胴体部ユニット３４０には、胴体ピッチ軸３０５、胴体ロール軸３０６、胴体ヨー軸３０７の各々を表現する胴体ピッチ軸アクチュエータＡ_５、胴体ロール軸アクチュエータＡ_６、胴体ヨー軸アクチュエータＡ_７が配設されている。また、胴体部ユニット３４０には、この人間型ロボット装置２００の起動電源となるバッテリを備えている。このバッテリは、充放電可能な電池によって構成されている。
【００４６】
また、腕部ユニット３５０Ｒ／Ｌは、上腕ユニット３５１Ｒ／Ｌと、肘関節ユニット３５２Ｒ／Ｌと、前腕ユニット３５３Ｒ／Ｌに細分化されるが、肩関節ピッチ軸３０８、肩関節ロール軸３０９、上腕ヨー軸３１０、肘関節ピッチ軸３１１、前腕ヨー軸３１２、手首関節ピッチ軸３１３、手首関節ロール軸３１４の各々表現する肩関節ピッチ軸アクチュエータＡ_８、肩関節ロール軸アクチュエータＡ_９、上腕ヨー軸アクチュエータＡ_１０、肘関節ピッチ軸アクチュエータＡ_１１、肘関節ロール軸アクチュエータＡ_１２、手首関節ピッチ軸アクチュエータＡ_１３、手首関節ロール軸アクチュエータＡ_１４が配備されている。
【００４７】
また、脚部ユニット３６０Ｒ／Ｌは、大腿部ユニット３６１Ｒ／Ｌと、膝ユニット３６２Ｒ／Ｌと、脛部ユニット３６３Ｒ／Ｌに細分化されるが、股関節ヨー軸３１６、股関節ピッチ軸３１７、股関節ロール軸３１８、膝関節ピッチ軸３１９、足首関節ピッチ軸３２０、足首関節ロール軸３２１の各々を表現する股関節ヨー軸アクチュエータＡ_１６、股関節ピッチ軸アクチュエータＡ_１７、股関節ロール軸アクチュエータＡ_１８、膝関節ピッチ軸アクチュエータＡ_１９、足首関節ピッチ軸アクチュエータＡ_２０、足首関節ロール軸アクチュエータＡ_２１が配備されている。各関節に用いられるアクチュエータＡ_２，Ａ_３・・・は、より好ましくは、ギア直結型で且つサーボ制御系をワンチップ化してモータ・ユニット内に搭載したタイプの小型ＡＣサーボ・アクチュエータで構成することができる。
【００４８】
頭部ユニット３３０、胴体部ユニット３４０、腕部ユニット３５０、各脚部ユニット３６０等の各機構ユニット毎に、アクチュエータ駆動制御部の副制御部３３５，３４５，３５５Ｒ／Ｌ，３６５Ｒ／Ｌが配備されている。更に、各脚部３６０Ｒ，Ｌの足底が着床したか否かを検出する接地確認センサ３９１及び３９２を装着するとともに、胴体部ユニット３４０内には、姿勢を計測する姿勢センサ３９３を装備している。
【００４９】
接地確認センサ３９１及び３９２は、例えば足底に設置された近接センサ又はマイクロ・スイッチ等で構成される。また、姿勢センサ３９３は、例えば、加速度センサとジャイロ・センサの組み合わせによって構成される。
【００５０】
接地確認センサ３９１及び３９２の出力によって、歩行・走行等の動作期間中において、左右の各脚部が現在立脚又は遊脚何れの状態であるかを判別することができる。また、姿勢センサ３９３の出力により、胴体部分の傾きや姿勢を検出することができる。
【００５１】
主制御部３８１は、各センサ３９１〜３９３の出力に応答して制御目標をダイナミックに補正することができる。より具体的には、副制御部３３５，３４５，３５５Ｒ／Ｌ，３６５Ｒ／Ｌの各々に対して適応的な制御を行い、人間型ロボット装置２００の腕部、胴体、及び脚部が協調して駆動する全身運動パターンを実現できる。
【００５２】
人間型ロボット装置２００の機体上での全身運動は、足部運動、ＺＭＰ（ＺｅｒｏＭｏｍｅｎｔ　Ｐｏｉｎｔ）軌道、胴体運動、上肢運動、腰部高さ等を設定するとともに、これらの設定内容にしたがった動作を指示するコマンドを各副制御部３３５，３４５，３５５Ｒ／Ｌ，３６５Ｒ／Ｌに転送する。そして、各々の副制御部３３５，３４５，・・・等では、主制御部３８１からの受信コマンドを解釈して、各アクチュエータＡ_２，Ａ_３・・・等に対して駆動制御信号を出力する。ここでいう「ＺＭＰ」とは、歩行中の床反力によるモーメントがゼロとなる床面上の点のことであり、また、「ＺＭＰ軌道」とは、例えば人間型ロボット装置２００の歩行動作期間中にＺＭＰが動く軌跡を意味する。
【００５３】
歩行時には、重力と歩行運動に伴って生じる加速度によって、歩行系から路面には重力と慣性力、並びにこれらのモーメントが作用する。いわゆる「ダランベールの原理」によると、それらは路面から歩行系への反作用としての床反力、床反力モーメントとバランスする。力学的推論の帰結として、足底接地点と路面の形成する支持多角形の辺上或いはその内側にピッチ及びロール軸モーメントがゼロとなる点、すなわち「ＺＭＰ（Ｚｅｒｏ　Ｍｏｍｅｎｔ　Ｐｏｉｎｔ）」が存在する。
【００５４】
脚式移動ロボットの姿勢安定制御や歩行時の転倒防止に関する提案の多くは、このＺＭＰを歩行の安定度判別の規範として用いたものである。ＺＭＰ規範に基づく２足歩行パターン生成は、足底着地点を予め設定することができ、路面形状に応じた足先の運動学的拘束条件を考慮しやすい等の利点がある。また、ＺＭＰを安定度判別規範とすることは、力ではなく軌道を運動制御上の目標値として扱うことを意味するので、技術的に実現可能性が高まる。なお、ＺＭＰの概念並びにＺＭＰを歩行ロボットの安定度判別規範に適用する点については、Ｍｉｏｍｉｒ　Ｖｕｋｏｂｒａｔｏｖｉｃ著“ＬＥＧＧＥＤ　ＬＯＣＯＭＯＴＩＯＮ　ＲＯＢＯＴＳ”（加藤一郎外著『歩行ロボットと人工の足』（日刊工業新聞社））に記載されている。
【００５５】
一般には、４足歩行よりもヒューマノイドのような２足歩行のロボットの方が、重心位置が高く、且つ、歩行時のＺＭＰ安定領域が狭い。従って、このような路面状態の変化に伴う姿勢変動の問題は、２足歩行ロボットにおいてとりわけ重要となる。
【００５６】
以上のように、人間型ロボット装置２００は、各々の副制御部３３５，３４５，・・・等が、主制御部３８１からの受信コマンドを解釈して、各アクチュエータＡ_２，Ａ_３・・・に対して駆動制御信号を出力し、各ユニットの駆動を制御している。これにより、人間型ロボット装置２００は、目標の姿勢に安定して遷移し、安定した姿勢で歩行できる。
【００５７】
また、人間型ロボット装置２００における制御ユニット３８０では、上述したような姿勢制御のほかに、加速度センサ、タッチセンサ、接地確認センサ等の各種センサ、及びＣＣＤカメラからの画像情報、マイクからの音声情報等を統括して処理している。制御ユニット３８０では、図示しないが加速度センサ、ジャイロ・センサ、タッチセンサ、距離センサ、マイク、スピーカ等の各種センサ、各アクチュエータ、ＣＣＤカメラ及びバッテリが各々対応するハブを介して主制御部３８１と接続されている。
【００５８】
主制御部３８１は、上述の各センサから供給されるセンサデータや画像データ及び音声データを順次取り込み、これらをそれぞれ内部インターフェイスを介してＤＲＡＭ内の所定位置に順次格納する。また、主制御部３８１は、バッテリから供給されるバッテリ残量を表すバッテリ残量データを順次取り込み、これをＤＲＡＭ内の所定位置に格納する。ＤＲＡＭに格納された各センサデータ、画像データ、音声データ及びバッテリ残量データは、主制御部３８１がこの人間型ロボット装置２００の動作制御を行う際に利用される。
【００５９】
主制御部３８１は、人間型ロボット装置２００の電源が投入された初期時、制御プログラムを読み出し、これをＤＲＡＭに格納する。また、主制御部３８１は、上述のように主制御部３８１よりＤＲＡＭに順次格納される各センサデータ、画像データ、音声データ及びバッテリ残量データに基づいて自己及び周囲の状況や、使用者からの指示及び働きかけの有無等を判断する。更に、主制御部３８１は、この判断結果及びＤＲＡＭに格納した制御プログラムに基づいて自己の状況に応じて行動を決定するとともに、当該決定結果に基づいて必要なアクチュエータを駆動させることにより人間型ロボット装置２００に、いわゆる「身振り」、「手振り」といった行動をとらせる。
【００６０】
従って、人間型ロボット装置２００は、制御プログラムに基づいて自己及び周囲の状況を判断し、使用者からの指示及び働きかけに応じて自律的に行動できる。また、人間型ロボット装置２００は、ＣＣＤカメラにおいて撮像された画像から抽出した文字の発音のしかた（読み方）を、抽出された文字から推定される読み方と集音マイクにおいて集音された音声とをマッチングして決定する。従って、人間型ロボット装置２００の音声認識の精度が向上し、新規単語が音声認識用辞書に登録できる。
【００６１】
（２）適用例
以下、本発明の実施の形態の説明に先立ち、理解を容易とするために、上述した２足歩行ロボット装置において、入力画像に対してテンプレートマッチングを行って顔候補を抽出し、この顔候補から顔を判定する顔検出装置を搭載した他の例について説明する（特願２００２−０７４９０７号参照、以下、適用例という。）。
【００６２】
以下に説明する第１及び第２の適用例においても、本願発明と同様にテンプレートマッチング等によりおおまかに顔候補を抽出した後、ＳＶＭ等により顔候補の中から顔であるか否かを判定することにより、この後段における演算量を削減してロボット装置の顔検出のリアルタイム性を向上させるものである。
【００６３】
（２−１）第１の適用例
（２−１−１）ロボットの内部構成
第１の適用例においては、胴体部ユニット２０４の胴体下部を形成する腰ベースの背面側には、図４及び図５に示すように、当該ロボット装置２００全体の動作制御を司る主制御部３８１、電源回路及び通信回路等の周辺回路３８１、バッテリ４５（図５）等がボックスに収納されてなる制御ユニット４２が配設されている。
【００６４】
そしてこの制御ユニット４２は、各構成ユニット（胴体部ユニット２０２、頭部ユニット２０４、各腕部ユニット２０３Ｒ、２０３Ｌ及び各脚部ユニット２０９Ｒ、２０９Ｌ）内にそれぞれ配設された各副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌと接続されており、これら副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌに対して必要な電源電圧を供給したり、これら副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌと通信を行ったりすることができるようになされている。
【００６５】
また、各副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌは、それぞれ対応する構成ユニット内の各アクチュエータＡ_２〜Ａ_２１と接続されており、当該構成ユニット内の各アクチュエータＡ_２〜Ａ_２１を主制御部３８１から与えられる各種制御コマンドに基づいて指定された状態に駆動し得るようになされている。
【００６６】
更に、頭部ユニット２０４には、図５に示すように、このロボット装置２００の「目」として機能するＣＣＤ（Ｃｈａｒｇｅ　Ｃｏｕｐｌｅｄ　Ｄｅｖｉｃｅ　）カメラ５０及び「耳」として機能するマイクロホン５１及びタッチセンサ５２等からなる外部センサ部５３と、「口」として機能するスピーカ５４と等がそれぞれ所定位置に配設され、制御ユニット３８０内には、バッテリセンサ５５及び加速度センサ５６等からなる内部センサ部５７が配設されている。
【００６７】
そして外部センサ部５３のＣＣＤカメラ５０は、周囲の状況を撮像し、得られた画像信号Ｓ１Ａを主制御部３８１に送出してフレーム単位で内部メモリ３８１Ａに順次記憶する一方、マイクロホン５１は、ユーザから音声入力として与えられる「歩け」、「座れ」又は「ボールを追いかけろ」等の各種命令音声を集音し、かくして得られた音声信号Ｓ１Ｂを主制御部３８１に送出するようになされている。
【００６８】
更に、タッチセンサは、頭部ユニット３の上部に設けられており、ユーザからの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出し、検出結果を圧力検出信号Ｓ１Ｃとして主制御部３８１に送出する。
【００６９】
更にまた、内部センサ部５７のバッテリセンサ５５は、バッテリ４５のエネルギ残量を所定周期で検出し、検出結果をバッテリ残量検出信号Ｓ２Ａとして主制御部３８１に送出する。一方、加速度センサ５６は、３軸方向（ｘ軸、ｙ軸及びｚ軸）の加速度を所定周期で検出し、検出結果を加速度検出信号Ｓ２Ｂとして主制御部３８１に送出する。
【００７０】
主制御部３８１は、外部センサ部５３のＣＣＤカメラ５０、マイクロホン５１及びタッチセンサ５２等からそれぞれ供給される画像信号Ｓ１Ａ、音声信号Ｓ１Ｂ及び圧力検出信号Ｓ１Ｃ等（以下、これらをまとめて外部センサ信号Ｓ１と呼ぶ）と、内部センサ部５７のバッテリセンサ５５及び加速度センサ等からそれぞれ供給されるバッテリ残量検出信号Ｓ２Ａ及び加速度検出信号Ｓ２Ｂ等（以下、これらをまとめて内部センサ信号Ｓ２という。）に基づいて、ロボット装置２００の周囲及び内部の状況や、ユーザからの指令、ユーザからの働きかけの有無等を判断する。
【００７１】
そして主制御部３８１は、この判断結果と、予め内部メモリ３８１Ａに格納されている制御プログラムと、そのとき装填されている外部メモリ５８に格納されている各種制御パラメータとに基づいて続く行動を決定し、決定結果に基づく制御コマンドを対応する副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌに送出する。この結果、この制御コマンドに基づき、その副制御部３３５、３５０Ｒ、３５０Ｌ、３６５Ｒ、３６５Ｌの制御のもとに、対応するアクチュエータＡ_２〜Ａ_２１が駆動され、かくして頭部ユニット２０４を上下左右に揺動させたり、腕部ユニット２０３Ｒ、２０３Ｌを上にあげたり、歩行する等の行動がロボット装置２００により発現されることとなる。
【００７２】
この際、主制御部３８１は、必要に応じて所定の音声信号Ｓ３をスピーカ５４に与えることにより当該音声信号Ｓ３に基づく音声を外部に出力させたり、外見上の「目」として機能する頭部ユニット２０４の所定位置に設けられたＬＥＤに駆動信号を出力することによりこれを点滅させる。
【００７３】
このようにしてこのロボット装置２００においては、周囲及び内部の状況や、ユーザからの指令及び働きかけの有無等に基づいて自律的に行動することができるようになされている。
【００７４】
（２−１−２）顔検出タスク機能に関する主制御部３８１の処理
次に、このロボット装置２００に搭載された顔検出タスク機能について説明する。このロボット装置２００には、ＣＣＤカメラ５０を介して内部メモリ３８１Ａに記憶されるフレーム画像について、当該フレーム画像の中から人間の顔画像を検出する顔検出タスク機能が搭載されている。そしてこの顔検出タスク機能は、主制御部３８１における各種処理により実現されている。
【００７５】
ここで、かかる顔検出タスク機能に関する主制御部３８１の処理内容を機能的に分類すると、図６に示すように、入力画像スケール変換部６０、ウィンドウ切出部６１、テンプレートマッチング部６２、前処理部６３、パターン識別部６４及び重なり判定部６５に分けることができる。
【００７６】
入力画像スケール変換部６０は、ＣＣＤカメラ５０（図５）からの画像信号Ｓ１Ａに基づくフレーム画像を内部メモリ３８１Ａから読み出して、当該フレーム画像を縮小率が相異なる複数のスケール画像に変換する。この適用例の場合、２５３４４（＝１７６×１４４）画素からなるフレーム画像に対して、これを０．８倍ずつ順次縮小して５段階（１．０倍、０．８倍、０．６４倍、０．５１倍、０．４１倍）のスケール画像（以下、これを第１〜第５のスケール画像と呼ぶ）に変換する。
【００７７】
続くウィンドウ切出部６１は、第１〜第５のスケール画像のうち、まず第１のスケール画像に対して、画像左上を起点として順に画像右下まで、適当な画素（例えば２画素）分を右側又は下側にずらしながらスキャンするようにして、４００（＝２０×２０）画素の矩形領域（以下、この領域をウィンドウ画像と呼ぶ）を順次切り出す。
【００７８】
その際、ウィンドウ切出部６１は、第１のスケール画像から切り出した複数のウィンドウ画像のうち先頭のウィンドウ画像を後段のテンプレートマッチング部６２に送出する。
【００７９】
テンプレートマッチング部６２は、ウィンドウ切出部６１から得られた先頭のウィンドウ画像について、正規化相関法や誤差二乗法等の演算処理を実行してピーク値をもつ関数曲線に変換した後、当該関数曲線に対して認識性能が落ちない程度に十分に低い閾値を設定して、当該閾値を基準として当該ウィンドウ画像が顔画像か否かを判断する。
【００８０】
本適用例の場合、テンプレートマッチング部６２では、例えば１００人程度の人物の平均からなる平均的な顔画像をテンプレートとして、かかる顔画像か否かの判断基準となる閾値を設定するようになされている。これにより当該ウィンドウ画像について、テンプレートとなる平均的な顔画像との大まかなマッチングをとり得るようになされている。
【００８１】
このようにしてテンプレートマッチング部６２は、ウィンドウ切出部６１から得られたウィンドウ画像について、テンプレートによるマッチングをとり、顔画像であると判断された場合には、当該ウィンドウ画像をスコア画像として後段の前処理部６３に送出する一方、顔画像でないと判断された場合には、当該ウィンドウ画像をそのまま後段の重なり判定部６５に送出する。
【００８２】
この時点で顔画像であると判断されたウィンドウ画像（スコア画像）には、実際には顔画像以外の判断誤りの画像が大量に含まれるが、日常のシーンの中では顔に類似した背景画像が多く存在することはあまりないため、ほとんどのウィンドウ画像は顔画像ではないと判断されることとなり極めて有効である。
【００８３】
実際に上述した正規化相関法や誤差二乗法等の演算処理は、後段の前処理部及びパターン識別部における演算処理と比較すると、演算量が１０分の１から１００分の１程度で済むと共に、実験上この段階で顔画像以外の画像を８０〔％〕以上はふるい落とすことができることが確認されたため、主制御部３８１全体としては大幅な演算量の削減につながることがわかる。
【００８４】
前処理部６３は、テンプレートマッチング部６２から得られたスコア画像について、矩形領域でなる当該スコア画像から人間の顔画像とは無関係である背景部分に相当する４隅の領域を除去するために、当該４隅の領域を切り取ったマスクを用いて、４００（＝２０×２０）画素あるスコア画像から３６０画素分を抽出する。
【００８５】
そして前処理部６３は、撮像時の照明により濃淡で表される被写体の傾き条件を解消すべく、当該抽出した３６０画素分のスコア画像のうち顔画像として最適な部位を基準とする平面を形成するように、例えば平均二乗誤差（ＲＳＭ：ＲｏｏｔＭｅａｎ　Ｓｑｕａｒｅ）等による算出方法を用いて当該３６０画素の濃淡値に補正をかける。
【００８６】
続いて前処理部６３は、当該３６０画素分のスコア画像のコントラストを強調した結果をヒストグラム平滑化処理を行うことにより、ＣＣＤカメラ５０のゲインや照明の強弱によらずに検出できるようにする。
【００８７】
次いで、前処理部６３は、後述するガボア・フィルタリング（Ｇａｂｏｒ　Ｆｉｌｔｅｒｉｎｇ）処理を行うことにより、当該３６０画素分のスコア画像をベクトル変換し、得られたベクトル群を更に１本のパターンベクトルに変換する。
【００８８】
以下において、ガボア・フィルタリング処理について説明する。まず人間の視覚細胞には、ある特定の方位に対して選択性を持つ細胞が存在することが既に判っている。これは、垂直の線に対して反応する細胞と、水平の線に反応する細胞で構成される。ガボア・フィルタリングは、これと同様に、方位選択性を持つ複数のフィルタで構成される空間フィルタである。
【００８９】
ガボア・フィルタは、ガボア関数によって空間表現される。ガボア関数ｇ（ｘ，ｙ）は、下記式（１）に示すように、コサイン成分からなるキャリアｓ（ｘ，ｙ）と、２次元ガウス分析状のエンベローブＷ_ｒ（ｘ，ｙ）とで構成される。
【００９０】
【数１】

【００９１】
キャリアｓ（ｘ，ｙ）は、複数関数を用いて、下記式（２）のように表現される。ここで、座標値（ｕ_０，ｖ_０）は空間周波数を表し、またＰはコサイン成分の位相を表す。
【００９２】
【数２】

【００９３】
ここで、上記式（２）に示すキャリアは、下記式（３）に示すように、実数成分Ｒｅ（ｓ（ｘ，ｙ）と虚数成分Ｉｍ（ｓ（ｘ，ｙ））に分離することができる。
【００９４】
【数３】

【００９５】
一方、２次元ガウス分布からなるエンベローブは、下記式（４）のように表現される。
【００９６】
【数４】

【００９７】
ここで、座標軸（ｘ_０，ｙ_０）はこの関数のピークであり、定数ａ及びｂはガウス分布のスケール・パラメータである。また、添え字ｒは、下記式（５）に示すような回転操作を意味する。
【００９８】
【数５】

【００９９】
従って、上記式（２）及び（４）より、ガボア・フィルタは、下記式（６）に示すような空間関数として表現される。
【０１００】
【数６】

【０１０１】
本適用例に係る前処理部は、８種類の方向と３通りの周波数を採用して、合計２４個のガボア・フィルタを用いて顔抽出処理を行う。
【０１０２】
ガボア・フィルタのレスポンスは、Ｇ_ｉをｉ番目のガボア・フィルタとし、ｉ番目のガボアの結果（Ｇａｂｏｒ　Ｊｅｔ）をＪ_ｉとし、入力イメージをＩとすると、下記式（７）で表される。この式（７）の演算は、実際には高速フーリエ変換を用いて高速化することができる。
【０１０３】
【数７】

【０１０４】
作成したガボア・フィルタの性能を調べるためには、フィルタリングして得られた画素を再構築することによって行う。再構築されたイメージＨは、下記式（８）のように表される。
【０１０５】
【数８】

【０１０６】
そして、入力画像Ｉと再構築された画像ＨとのエラーＥは、下記式（９）のように表される。
【０１０７】
【数９】

【０１０８】
このエラーＥを最小にするような最適なａを求めることにより再構築することができる。なお、ガボア・フィルタリングに関しては、認識タスクに応じてフィルタの種類を変更するようにしてもよい。
【０１０９】
低周波でのフィルタリングでは、フィルタリング後のイメージすべてをベクトルとして持っているのは冗長である。そこで、ダウンサンプリングして、ベクトルの次元を落とすようにしてもよい。ダウンサンプリングされた２４種類のベクトルを一列に並べ、長いベクトル（上述したパターンベクトル）にする。
【０１１０】
パターン識別部６４は、外部から供給される学習用のデータすなわち教師データを用いて、暫定的な識別関数を得た後、当該識別関数を前処理部６３からパターンベクトルとして得られた３６０画素分のスコア画像に試して顔の検出を行う。そして、検出に成功したものを顔データとして出力する。また検出に失敗したものを非顔データとして学習データに追加して、更に学習をし直す。
【０１１１】
本適用例では、パターン識別部６４における顔認識に関して、パターン認識の分野で最も学習汎化能力が高いとされるサポートベクタマシン（Ｓｕｐｐｏｒｔ　Ｖｅｃｔｏｒ
Ｍａｃｈｉｎｅ：ＳＶＭ）を用いて該当する顔か否かの識別を行う。
【０１１２】
サポートベクタマシン自体に関しては、例えばＢ．ｓｈｏｌｋｏｐｆ外著の報告（Ｂ．Ｓｈｏｌｋｏｐｆ、Ｃ．Ｂｕｒｇｅｓ、Ａ．Ｓｍｏｌａ、”Ａｄｖａｎｃｅ　ｉｎ　Ｋｅｒｎｅｌ　Ｓｕｐｐｏｒｔ　Ｖｅｃｔｏｒ　ｅａｒｎｉｎｇ”、Ｔｈｅ　ＭＩＴ　Ｐｒｅｓｓ、１９９９．）を挙げることができる。本願出願人が行った予備実験の結果からは、サポートベクタマシンによる顔認識方法は、主成分分析（ＰＣＡ）やニューラル・ネットワークを用いる手法に比べ、良好な結果を示すことが判っている。
【０１１３】
サポートベクタマシンは、識別関数に線形識別器（パーセプトロン）を用いた学習機械であり、カーネル関数を使うことで非線形空間に拡張することができる。また識別関数の学習では、クラス間分離のマージンを最大にとるように行われ、その解は２次数理計画法を解くことで得られるため、グローバル解に到達できることを理論的に保証することができる。
【０１１４】
通常、パターン認識の問題は、テスト・サンプルｘ＝（ｘ１、ｘ２、・・・、ｘｎ）に対して、下記式（１０）で与えられる識別関数ｆ（ｘ）を求めることである。
【０１１５】
【数１０】

【０１１６】
ここで、サポートベクタマシンの学習用の教師ラベルを下記式（１１）のようにおく。
【０１１７】
【数１１】

【０１１８】
すると、サポートベクタマシンにおける顔パターンの認識を下記式（１２）に示す制約条件の下での重み因子ｗの２乗の最小化する問題としてとらえることができる。
【０１１９】
【数１２】

【０１２０】
このような制約のついた問題は、ラグランジュの未定定数法を用いて解くことができる。すなわち、下記式（１３）に示すラグランジュをまず導入し、次いで、下記式（１４）に示すように、ｂ、ｗの各々について偏微分する。
【０１２１】
【数１３】

【０１２２】
【数１４】

【０１２３】
この結果、サポートベクタマシンにおける顔パターンの識別を下記式（１５）に示す２次計画問題としてとらえることができる。
【０１２４】
【数１５】

【０１２５】
特徴空間の次元数が、訓練サンプルの数よりも少ない場合は、スクラッチ変数ξ≧０を導入して、制約条件を下記式（１６）のように変更する。最適化については、下記式（１７）の目的関数を最小化する。
【０１２６】
【数１６】

【０１２７】
【数１７】

【０１２８】
上記式（１７）において、Ｃは、制約条件をどこまで緩めるかを指定する係数であり、実験的に値を決定する必要がある。ラグランジュ定数ａに関する問題は下記式（１８）のように変更される。
【０１２９】
【数１８】

【０１３０】
しかし、上記式（１８）のままでは、非線型の問題を解くことはできない。そこで、本適用例では、カーネル関数Ｋ（ｘ，ｘ３）を導入して、一旦、高次元の空間に写像して（カーネル・トリック）、その空間で線形分離することにしている。従って、元の空間では非線型分離していることと同等となる。
【０１３１】
カーネル関数は、ある写像Φを用いて下記式（１９）のように表される。
【０１３２】
【数１９】

【０１３３】
また、上記（１０）式に示した識別関数も、下記式（２０）のように表すことができる。
【０１３４】
【数２０】

【０１３５】
また学習に関しても、下記式（２１）に示す２次計画問題としてとらえることができる。
【０１３６】
【数２１】

【０１３７】
カーネルとしては、下記式（２２）に示すガウシアン・カーネル（ＲＢＦ：Ｒａｄｉｕｓ　Ｂａｓｉｃ　Ｆｕｎｃｔｉｏｎ）等を用いることができる。
【０１３８】
【数２２】

【０１３９】
このようにパターン識別部６４は、前処理部６３から与えられたスコア画像に基づくパターンベクトルについて、当該スコア画像内に顔データが存在するか否かを判断し、存在する場合のみ当該スコア画像の画像領域における左上位置（座標）及びその大きさ（縦横の画素数）と、当該スコア画像の切出し元となるスケール画像のフレーム画像に対する縮小率（すなわち上述の５段階のうちの該当する段階）とをリスト化し、これをリストデータとして内部メモリ３８１Ａに格納する。
【０１４０】
この後、パターン識別部６４は、ウィンドウ切出部６１に対して、第１のスケール画像のうち先頭のウィンドウ画像の顔検出が終了した旨を通知することにより、当該ウィンドウ切出部６１から第１のスケール画像のうち次にスキャンされたウィンドウ画像をテンプレートマッチング部６２に送出させる。
【０１４１】
そしてテンプレートマッチング部６２は、当該ウィンドウ画像がテンプレートにマッチングした場合のみスコア画像として前処理部６３に送出する。前処理部６３は、当該スコア画像をパターンベクトルに変換してパターン識別部６４に送出する。パターン識別部６４は、パターンベクトルから識別結果として得られた顔データに基づいてリストデータを生成して内部メモリ３８１Ａに格納する。
【０１４２】
このようにウィンドウ切出部６１おいて第１のスケール画像から切り出した全てのウィンドウ画像について、スキャン順にテンプレートマッチング部６２、前処理部６３及びパターン識別部６４の各処理を行うことにより、当該第１のスケール画像から撮像結果に存在する顔画像を含むスコア画像を複数検出することができる。
【０１４３】
この後、パターン識別部６４は、入力画像スケール変換部６０に対して、第１のスケール画像の顔検出が終了した旨を通知することにより、当該入力画像スケール変換部６０から第２のスケール画像をウィンドウ切出部６１に送出させる。
【０１４４】
そして第２のスケール画像についても、上述した第１のスケール画像と同様の処理を行って、当該第２のスケール画像から撮像結果に存在する顔画像を含むスコア画像を複数検出した後、第３〜第５のスケール画像についても同様の処理を順次行う。
【０１４５】
かくしてパターン識別部６４は、撮像画像であるフレーム画像を５段階に縮小した第１〜第５のスケール画像について、当該撮像画像内に存在する顔画像を含むスコア画像をそれぞれ複数検出した後、その結果得られたリストデータをそれぞれ内部メモリ３８１Ａに格納する。この場合、元のフレーム画像内での顔画像のサイズによっては、全くスコア画像が得られない場合もあるが、少なくとも１以上（２又は３以上でもよい）のスケール画像でスコア画像が得られれば、顔検出処理を続行することとする。
【０１４６】
ここで、各スケール画像において顔画像を含む複数のスコア画像は、ウィンドウ切出部６１におけるスキャンが２画素ずつすらして行われたため、実際に顔がある領域とその近傍領域とで高い相関性があり、隣接するスコア画像同士で相互に重なり合う画像領域を含むこととなる。
【０１４７】
そこで続く重なり判定部６５は、内部メモリ３８１Ａに格納されている第１〜第５のスケール画像ごとに複数のリストデータをそれぞれ読み出して、当該各リストデータに含まれるスコア画像同士を比較して、相互に重なり合う領域を含むか否かを判定する。
【０１４８】
その際、重なり判定部６５は、図７に示すように、２つのスコア画像Ｐ１、Ｐ２の左上角部の位置（座標）がそれぞれ（Ｘ_Ａ，Ｙ_Ａ）、（Ｘ_Ｂ，Ｙ_Ｂ）であり、その大きさ（縦横の画素数）がそれぞれＨ_Ａ×Ｌ_Ａ、Ｈ_Ｂ×Ｌ_Ｂであるとき、ｄＸ（＝Ｘ_Ｂ−Ｘ_Ａ）、ｄＹ（＝Ｙ_Ｂ−Ｙ_Ａ）として、下記式（２３）を満たすか否かによって、当該スコア画像Ｐ１、Ｐ２同士が重なり合うか否かを判定することができる。
【０１４９】
【数２３】

【０１５０】
重なり判定部６５は、当該判定結果に基づいて、スコア画像同士で重なり合う領域を除去することにより、各スケール画像において、最終的に複数のスコア画像を互いに重なることなく寄せ集めた単一の画像領域として得ることができ、当該画像領域を顔決定データとして新たに内部メモリ３８１Ａに格納する。
【０１５１】
また重なり判定部６５は、テンプレートマッチング部６２において顔画像でないと判断された場合には、そのまま何もすることなく、内部メモリ３８１Ａの格納も行わない。
【０１５２】
（２−１−３）第１の適用例における動作及び効果
以上の構成において、このロボット装置２００では、ＣＣＤカメラ５０により撮像したフレーム画像を縮小率が相異なる複数のスケール画像に変換した後、当該各スケール画像の中からそれぞれ所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら１枚ずつ切り出す。
【０１５３】
このウィンドウ画像について、平均的な顔画像を表すテンプレートを用いてマッチングをとって大まかに顔画像であるか否かを判断するようにして、明らかに顔画像でないウィンドウ画像を除去することにより、後段の顔検出処理に要する演算量及び時間をその分減少させることができる。
【０１５４】
続いてテンプレートマッチングで顔画像であると判断されたウィンドウ画像（すなわちスコア画像）について、当該スコア画像の矩形領域の４隅部分を除去した後、濃淡補正及び続くコントラスト強調の平滑化を行い、更に１本のパターンベクトルに変換する。
【０１５５】
そして当該パターンベクトルについて、元のスコア画像内での顔検出を行って顔データ又は非顔データを判断し、顔データが存在するスコア画像の画像領域の位置（座標）及びその大きさ（画素数）と、当該スコア画像の切出し元となるスケール画像のフレーム画像に対する縮小率とをリスト化したリストデータを生成する。
【０１５６】
このように各スケール画像毎にそれぞれ全てのスコア画像についてリストデータを生成した後、当該各リストデータに含まれるスコア画像同士を比較して、相互に重なり合う領域を除去した顔決定データを求めることにより、元のフレーム画像から顔画像を検出することができる。
【０１５７】
このような顔検出タスク処理のうち特にテンプレートマッチング処理は、比較的構成が簡易な演算器にもたやすく実装できる上に、画像圧縮等で利用されるブロックマッチングの手法と類似する処理であることからＣＰＵを用いた高速処理を行うハードウェアが数多く存在する。従ってテンプレートマッチング処理に関してはさらなる高速化が可能である。
【０１５８】
以上の構成によれば、このロボット装置２００において、ＣＣＤカメラ５０により撮像したフレーム画像について顔画像を検出する顔検出タスク処理の際、当該フレーム画像を相異なる縮小率で縮小した各スケール画像の中からそれぞれ所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら１枚ずつ切り出した後、平均的な顔画像を表すテンプレートを用いてマッチングをとって大まかに顔画像であるか否かを判断するようにして、明らかに顔画像でないウィンドウ画像を除去するようにしたことにより、当該テンプレートマッチングで顔画像であると判断されたスコア画像に対する種々の顔検出処理に要する演算量及び時間をその分減少させることができ、ロボット装置２００全体の制御を司る主制御部３８１の処理負担を軽減させることができ、かくしてリアルタイム性を格段と向上し得るロボット装置２００を実現できる。
【０１５９】
（２−２）第２の適用例
（２−２−１）ロボットの内部構成
第２の適用例において、胴体部ユニット２０２（図１）の胴体下部を形成する腰ベースの背面側には、図５との対応部分に同一符号を付した図８に示すように、上述した主制御部３８１に加えて、この主制御部３８１と接続されたバスを介して内部メモリ３８１Ａ、ＤＭＡ（Ｄｉｒｅｃｔ　Ｍｅｍｏｒｙ　Ａｃｃｅｓｓ）コントローラ７０及び演算処理部（ＤＳＰ（Ｄｉｇｉｔａｌ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓｏｒ）及び画像処理用エンジン等）７１が相互に接続されており、それ以外は第１の適用例と同様に構成されている。
【０１６０】
この場合、主制御部３８１のみが内部メモリ３８１Ａの管理を行うのではなく、ＤＭＡコントローラ７０が主制御部３８１又は演算処理部７１にデータを転送する。一方、当該主制御部３８１又は演算処理部７１から得られた演算結果をＤＭＡコントローラ７０を介して内部メモリ３８１Ａに格納するようになされている。
【０１６１】
このようにデータ転送をＤＭＡコントローラ７０が行うと共に、データ演算を主制御部３８１又は演算処理部７１に振り分けて互いに独立に行わせることにより、主制御部３８１にかかる負担を低減させ得るようになされている。
【０１６２】
（２−２−２）顔検出タスク機能に関する主制御部３８１の処理
第２の適用例の場合、顔検出タスク機能は、ＤＭＡコントローラ７０を介した主制御部３８１及び演算処理部７１における各種処理により実現されている。
【０１６３】
ここで、かかる顔検出タスク機能に関する主制御部３８１及び演算処理部７１の各処理内容を機能的に分類すると、図９に示すように、入力画像スケール変換部８０、ウィンドウ切出部８１、テンプレートマッチング部８２、重なり判定部８３、スケール変換及び切出部８４、前処理部８５及びパターン識別部８６に分けることができる。
【０１６４】
このうち入力画像スケール変換部８０、ウィンドウ切出部８１及びテンプレートマッチング部８２は、演算処理部７１が処理すると共に、重なり判定部８３、スケール変換及び切出部８４、前処理部８５及びパターン識別部８６は、主制御部３８１が処理するようになされている。演算処理部７１と主制御部３８１との処理の切り換えはＤＭＡコントローラ７０が行う。
【０１６５】
先ず、入力画像スケール変換部８０は、ＣＣＤカメラ５０（図８）からの画像信号Ｓ１Ａに基づくフレーム画像を内部メモリ３８１Ａから読み出して、当該フレーム画像を縮小率が相異なる複数のスケール画像（上述した第１〜第５のスケール画像）に変換する。
【０１６６】
続くウィンドウ切出部８１は、第１〜第５のスケール画像のうち、まず第１のスケール画像に対して、画像左上を起点として順に画像右下まで、適当な画素（例えば２画素）分を飛ばしながらスキャンするようにして、４００（＝２０×２０）画素の矩形領域でなるウィンドウ画像を順次切り出す。
【０１６７】
その際、ウィンドウ切出部８１は、第１のスケール画像から切り出した複数のウィンドウ画像のうち先頭のウィンドウ画像を後段のテンプレートマッチング部８２に送出する。
【０１６８】
テンプレートマッチング部８２は、ウィンドウ切出部８１から得られた先頭のウィンドウ画像について、平均的な顔画像をテンプレートとし、大まかなマッチングをとって、当該テンプレートとの相関度（テンプレートのうちマッチングする画素数の割合）を相関度データとして当該ウィンドウ画像と共にＤＭＡコントローラ７０を介して内部メモリ３８１Ａに格納する。
【０１６９】
具体的には、図１０に示すように、任意のスケール画像Ｆ１から切り出したウィンドウ画像Ｗ１について、平均的な顔画像であるテンプレートＴ１をとり、当該テンプレートＴ１との相関度をとった結果として、当該相関度を表す画像Ｒ１が検出され、これを相関度データとして内部メモリ３８１Ａに格納するようになされている。
【０１７０】
この第２の適用例の場合には、ウィンドウ画像について顔画像であると判断されたスコア画像以外をふるい落とす第１の適用例とは異なり、高速処理を実現すべく、当該ウィンドウ画像をスコア画像か否かにかかわりなく後段の重なり判定部８３に送出する。
【０１７１】
この後、テンプレートマッチング部８２は、ウィンドウ切出部８１に対して、第１のスケール画像のうち先頭のウィンドウ画像のテンプレートマッチングが終了した旨を通知することにより、当該ウィンドウ切出部８１から第１のスケール画像のうち次にスキャンされたウィンドウ画像をテンプレートマッチング部８２に送出させる。
【０１７２】
このようにテンプレートマッチング部８２は、ウィンドウ切出部８１において第１のスケール画像から切り出した全てのウィンドウ画像について、スキャン順にそれぞれ相関度データを生成する。
【０１７３】
この後、テンプレートマッチング部８２は、入力画像スケール変換部８０に対して、第１のスケール画像の顔検出が終了した旨を通知することにより、当該入力画像スケール変換部８０から第２のスケール画像をウィンドウ切出部８１に送出させる。
【０１７４】
そして第２のスケール画像についても、上述した第１のスケール画像と同様の処理を行って、当該第２のスケール画像から全てのウィンドウ画像に対応する相関度データを生成した後、第３〜第５のスケール画像についても同様の処理を順次行う。
【０１７５】
かくしてテンプレートマッチング部８２は、撮像画像であるフレーム画像を５段階に縮小した第１〜第５のスケール画像について、それぞれ切出した複数のウィンドウ画像に対応する相関度データを生成して、当該ウィンドウ画像と共にＤＭＡコントローラ７０を介して内部メモリ３８１Ａに格納する。
【０１７６】
以上が演算処理部７１における処理内容であり、ＤＭＡコントローラ７０の制御に応じて、主制御部３８１に切り換えられ、以下は当該主制御部３８１における処理内容となる。
【０１７７】
先ず、重なり判定部８３は、図１１の重なり判定処理手順Ｒ１１をステップＳＰ０から実行し、第１〜第５のスケール画像毎に、当該各スケール画像内のウィンドウ画像を切り出す際のスキャン順と同じ順番で、ウィンドウ画像をこれに付随する相関度データと共にＤＭＡコントローラ７０を介して内部メモリ３８１Ａから読み出す（ステップＳＰ１）。
【０１７８】
続いて重なり判定部８３は、該当するスケール画像における複数のウィンドウ画像について、スキャン順にウィンドウ画像同士を上述した式（２３）を用いて比較して、相互に重なり合う領域を含むか否かを判定する（ステップＳＰ２）。
【０１７９】
そして重なり判定部８３は、相互に重なり合う領域を含む場合のみ、当該ウィンドウ画像同士をそれぞれ対応する相関度の大小で比較して、相関度の大きいほうをスコア画像として、内部メモリ３８１Ａ内に設定する候補リストに追加して格納する（ステップＳＰ３）。
【０１８０】
そして重なり判定部８３は、該当するスケール画像内の全てのウィンドウ画像について、相関度を比較した場合の大きい方をスコア画像として候補リストに追加する（ステップＳＰ４）。一方、小さい方は候補リストに加えることなく又は既に候補リストに存在する場合には削除する（ステップＳＰ５）ようにして順次候補リストの書き換えを行う。
【０１８１】
やがて重なり判定部８３は、該当するスケール画像内の全てのウィンドウ画像について、比較的相関度が高い複数のスコア画像を候補リストとして決定した後、他のスケール画像についても比較的相関度が高い複数のスコア画像を候補リストとして決定する。なお、候補リストとして追加される比較的相関度が高いスコア画像の数は、各スケール画像毎にそれぞれ１０以下に絞られることが実験結として得ることができた。
【０１８２】
このようにして重なり判定部８３では、第１〜第５のスケール画像について、スコア画像を相関度の最も高いピーク値のみならず、そのピーク値の近傍領域をも対象として候補リストに加えるようにしたことにより、テンプレートマッチング部８２でのマッチング結果と後段のパターン識別部８６におけるサポートベクタマシンの処理結果とのずれをその分低下させることができる。
【０１８３】
実際上、かかる重なり判定部８３における候補リストの作成処理は、数回の加算及び乗算によって実行可能であるため、後段の前処理部８５及びパターン識別部８６による演算処理と比較すると、極めて高速に短時間に処理することができる。そして候補リストとして残ったスコア画像について前処理部８５及びパターン識別部８６による処理を行えば、当該前処理部８５及びパターン識別部８６においても処理負担を軽減させることができる。
【０１８４】
この後、スケール変換及び切出部８４は、重なり判定部８３において各スケール画像毎に候補リストとして内部メモリ３８１Ａに格納された複数のスコア画像について、当該内部メモリ３８１ＡからＤＭＡコントローラ７０を介して第１のスケール画像内でかつ候補リストの決定順に１つずつスコア画像を読み出して、前処理部８５に送出する。
【０１８５】
前処理部８５は、スケール変換及び切出部８４から得られたスコア画像について、矩形領域でなる当該スコア画像（４００画素）から人間の顔画像とは無関係である背景部分に相当する４隅の領域を除去するために、当該４隅の領域を切り取ったマスクを用いて３６０画素分を抽出する。
【０１８６】
そして前処理部８５は、撮像時の照明により濃淡で表される被写体の傾き条件を解消すべく、当該抽出した３６０画素分のスコア画像のうち顔画像として最適な部位を基準とする平面を形成するように画素単位で濃淡値に補正をかける。
【０１８７】
続いて前処理部８５は、当該３６０画素分のスコア画像のコントラストを強調した結果をヒストグラム平滑化処理を行うことにより、ＣＣＤカメラ５０のゲインや照明の強弱によらずに検出できるようにする。
【０１８８】
次いで、前処理部８５は、上述したガボア・フィルタリング処理を行うことにより、当該３６０画素分のスコア画像をベクトル変換し、得られたベクトル群を更に１本のパターンベクトルに変換する。
【０１８９】
パターン識別部８６は、上述したサポートベクタマシンを用いて、前処理部８５からパターンベクトルとして得られた３６０画素分のスコア画像に試して顔か否かの識別を学習しながら行う。
【０１９０】
このようにパターン識別部８６は、前処理部８５から与えられたスコア画像に基づくパターンベクトルについて、当該スコア画像内での顔データに相当する画像領域の位置（座標）及びその大きさ（画素数）と、当該スコア画像の切出し元となるスケール画像のフレーム画像に対する縮小率（すなわち上述の５段階のうちの該当する段階）とをリスト化し、これをリストデータとして内部メモリ３８１Ａに格納する。
【０１９１】
この後、パターン識別部８６は、スケール変換及び切出部８４に対して、第１のスケール画像のうち最初のスコア画像の顔検出が終了した旨を通知することにより、当該スケール変換及び切出部８４から第１のスケール画像のうち次のスコア画像を前処理部８５に送出する。前処理部８５は、当該スコア画像をパターンベクトルに変換してパターン識別部８６に送出する。パターン識別部８６は、パターンベクトルから得られた顔データに基づいてリストデータを生成して内部メモリ３８１Ａに格納する。
【０１９２】
このようにスケール変換及び切出部８４において第１のスケール画像内で候補リストにある全てのスコア画像について、順番に前処理部８５及びパターン識別部８６の各処理を行うことにより、当該第１のスケール画像から撮像結果に存在する顔画像を検出することができる。
【０１９３】
この後、パターン識別部８６は、スケール変換及び切出部８４に対して、第１のスケール画像の顔検出が終了した旨を通知することにより、当該第２のスケール画像についても、上述した第１のスケール画像と同様の処理を行って、当該第２のスケール画像から撮像結果に存在する顔画像を検出した後、第３〜第５のスケール画像についても同様の処理を順次行う。
【０１９４】
かくしてパターン識別部８６は、撮像画像であるフレーム画像を５段階に縮小した第１〜第５のスケール画像について、当該撮像画像内に存在する顔画像をそれぞれ検出した後、その結果得られた顔データに基づいてそれぞれリストデータを生成して内部メモリ３８１Ａに格納する。
【０１９５】
（２−２−３）第２の適用例における動作及び効果
以上の構成において、このロボット装置２００では、ＣＣＤカメラ５０により撮像したフレーム画像を縮小率が相異なる複数のスケール画像に変換した後、当該各スケール画像の中からそれぞれ所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら１枚ずつ切り出す。
【０１９６】
このウィンドウ画像について、平均的な顔画像を表すテンプレートを用いてマッチングをとって当該テンプレートとの相関度を表す相関度データを生成する。このように各スケール画像毎にそれぞれ全てのウィンドウ画像についてスキャン順にそれぞれ相関度データを生成する。
【０１９７】
ここまでを演算処理部７１において実行すると共にこれ以降の処理を主制御部３８１において実効することにより、その分当該主制御部３８１の処理負担を軽減することができる。
【０１９８】
続いて各スケール画像における複数のウィンドウ画像について、スキャン順にウィンドウ画像同士を比較して、相互に重なり合う領域を含む場合には、更に相関度を比較して大きい方のみをスコア画像として残るように候補リストに追加する。この結果、各スケール画像毎に、比較的相関度が高い複数のスコア画像を候補リストとして決定することができ、後段の顔識別処理の際に識別精度を一層向上させることができる。
【０１９９】
続いて各スケール画像毎に候補リストとして内部メモリ３８１Ａに格納された複数のスコア画像について、当該スコア画像の矩形領域の４隅部分を除去した後、濃淡補正及び続くコントラスト強調の平滑化を行い、更に１本のパターンベクトルに変換する。
【０２００】
そして当該パターンベクトルについて、元のスコア画像内での顔検出を行って顔データ又は非顔データを判断し、顔データが存在するスコア画像の画像領域の位置（座標）及びその大きさ（画素数）と、当該スコア画像の切出し元となるスケール画像のフレーム画像に対する縮小率とをリスト化したリストデータを生成する。
【０２０１】
このように各スケール画像毎にそれぞれ全てのスコア画像についてリストデータを生成することにより、元のフレーム画像から顔画像を検出することができる。
【０２０２】
かくして顔検出タスク処理に関して、ロボット装置２００の全体の制御を司る主制御部３８１に加えて、演算処理部７１を設け、内部メモリ３８１Ａに対するデータの読み書きをＤＭＡコントローラ７０の制御を介して主制御部３８１及び演算処理部７１の双方が行い得るようにしたことにより、当該主制御部３８１及び演算処理部７１がそれぞれ処理を分配することができる分、主制御部３８１にかかる演算量及び演算時間の負担を格段と低減させることができる。
【０２０３】
以上の構成によれば、このロボット装置２００において、ＣＣＤカメラ５０により撮像したフレーム画像について顔画像を検出する顔検出タスク処理の際、当該フレーム画像を相異なる縮小率で縮小した各スケール画像の中からそれぞれ所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら１枚ずつ切り出した後、平均的な顔画像を表すテンプレートを用いてマッチングをとって当該テンプレートとの相関度を表す相関度データをそれぞれ生成し、当該ウィンドウ画像同士で相互に重なり合う領域を含む場合には相関度を比較して大きい方のみを候補リストに残るようにしたことにより、当該候補リストにある比較的相関度が高い複数のスコア画像に対する種々の顔検出処理に要する演算量及び時間をその分減少させることができるのみならず、顔識別処理の際の識別精度を一層向上させることができる。
【０２０４】
これに加えて顔検出タスク処理に関して、ロボット装置２００の全体の制御を司る主制御部３８１と演算処理部７１とで処理を分担して行うようにしたことにより、当該主制御部３８１にかかる演算量及び演算時間の負担を格段と低減させることができ、かくしてリアルタイム性を格段と向上し得るロボット装置２００を実現できる。
【０２０５】
（３）実施の形態
次に、本発明の実施の形態について説明する。上述したように、テンプレートマッチングを行って顔候補を抽出し（第１の工程）、この顔候補の中からＳＶＭ等により顔領域を判定して（第２の工程）顔領域を検出する方法において、第１の工程においては、単純に正規化相関値の代償により顔候補を決定しているため、顔候補の見逃しを軽減しようとした場合、閾値を上げる方法又は間引きを減らす方法をとることができるものの、閾値を下げると演算量が増大してしまい、ロボット装置等のリソースの限られた環境においては好ましくない場合がある。一方、閾値を上げると、第２の工程において顔判定するための候補画像が減るため、演算量を減らすことができるものの、本来顔である画像も候補画像から取り除いてしまい、顔画像を見逃してしまう場合がある。
【０２０６】
そこで、本願発明者等は、テンプレートと同一サイズの顔領域（顔画像）が入力画像内に存在する場合、この顔画像とテンプレートとの相関をとれば、テンプレートサイズ近傍では最も相関値が大きくなることに着目し、顔領域の候補を絞り込む際に、局所的な絞り込みを行うアルゴリズムを使用することにより、本来顔である画像を見逃すことなく顔候補画像を低減して後段の第２の工程にて顔判定する計算量を低減する方法を見出した。具体的には、入力画像と所定サイズの平均顔のテンプレートとの正規化相関をとった相関値の集合であるマッチング結果における相関値の局所最大値に基づき候補となる顔領域を抽出するものである。以下、この顔検出方法について詳細に説明する。
【０２０７】
本実施の形態における顔検出装置は、上述の第１及び第２の適用例と同様の２足歩行のロボット装置等のＣＰＵ及びメモリ等のリソースが限られたシステムにおいて、性能を保持しつつ顔検出に要する計算量を削減する場合に好適である。本実施の形態においても、上述の第１及び第２の適用例と同様、図１乃至図４に示すロボット装置２００に搭載されるものとし、その内部構成は、上述の適用例と同様とし、その詳細な説明は省略する。
【０２０８】
（３−１）顔検出タスク機能に関する主制御部３８１の処理
本実施の形態におけるロボット装置２００に搭載された顔検出タスク機能について説明する。上述した如く、ロボット装置２００には、ＣＣＤカメラ５０を介して内部メモリ３８１Ａに記憶されるフレーム画像について、当該フレーム画像の中から人間の顔画像を検出する顔検出タスク機能が搭載されている。そしてこの顔検出タスク機能は、主制御部３８１における各種処理により実現されている。
【０２０９】
ここで、かかる顔検出タスク機能に関する主制御部３８１の処理内容を機能的に分類すると、図１２に示すように、テンプレートサイズ及び入力画像スケール変換決定部９０、ウィンドウ切り出し部９１、テンプレートマッチング部９２、スケール変換及び切り出し部９３、前処理部９４、及びパターン識別部９５に分けることができる。
【０２１０】
入力画像スケール変換決定部９０は、第１の適用例と同様に、ＣＣＤカメラ５０（図５）からの画像信号Ｓ１Ａに基づくフレーム画像を内部メモリ３８１Ａから読み出して、当該フレーム画像を縮小率が相異なる複数のスケール画像（第１〜第５のスケール画像）に変換すると共に、後段のテンプレートマッチング部９２で使用するテンプレートサイズ、即ち、顔画像のサイズを選択する（以下、これを第１のテンプレートサイズという）。
【０２１１】
続くウィンドウ切出部９１は、第１〜第５のスケール画像のうち、まず第１のスケール画像に対して、画像左上を起点として順に画像右下まで、適当な画素（例えば２画素）分を飛ばしながらスキャンするようにして、４００（＝２０×２０）画素の矩形領域でなるウィンドウ画像を順次切り出す。
【０２１２】
その際、ウィンドウ切出部９１は、第１のスケール画像から切り出した複数のウィンドウ画像のうち先頭のウィンドウ画像を、スケール変換決定部９１で選択した第１のテンプレートサイズのテンプレートと共に後段のテンプレートマッチング部９２に送出する。
【０２１３】
テンプレートマッチング部９２は、ウィンドウ切出部９１から得られた先頭のウィンドウ画像について、第１のテンプレートサイズを有する平均的な顔画像をテンプレートとして大まかなマッチングをとり、当該テンプレートとの相関値（テンプレートのうちマッチングする画素数の割合）の集合をマッチング結果として当該ウィンドウ画像と共にＤＭＡコントローラ７０を介して内部メモリ３８１Ａに格納する。
【０２１４】
即ち、図１３（ａ）に示すように、任意のスケール画像から切り出した、例えば高さ（ｙ軸方向の辺の長さ）ｈｅｉ＿ｓ×幅（ｘ軸方向の辺の長さ）ｗｅｉｄ＿ｓのウィンドウ画像（スケール変換後の入力画像）Ｗ２について、図１３（ｂ）に示すように、例えば高さｈｅｉ＿ｔ×幅ｗｉｄ＿ｓである第１のテンプレートサイズを有する平均的な顔画像であるテンプレートＴ２を使用し、ウィンドウ画像Ｗ２をスキャンし、所定画素（例えば１画素）ずつずらしながら移動させたテンプレートＴ２と上記入力画像との相関値の集合であるマッチング結果を求める。このマッチング結果は、テンプレートＴ２の移動に伴い相関値が２次元に配列されたものであり、図１４に示すように、当該相関値を表す高さｈｅｉ＿ｒ×幅ｗｉｄ＿ｒのテンプレートマッチング結果画像Ｒ２が得られる。ここで、テンプレートレートマッチング結果画像Ｒ２の高さｈｅｒ＿ｒは、ｈｅｉ＿ｓ−（ｈｅｉ＿ｔ＋１）であり、画像Ｒ２の幅ｗｉｄ＿ｓは、ｗｉｄ＿ｓ−（ｗｉｄ＿ｔ＋１）となる。
【０２１５】
次に、このテンプレートレートマッチング結果画像Ｒ２を所定のサイズ、例えば第１のテンプレートサイズと同一の大きさに分割し、各第１のテンプレートサイズに仕切られた分割領域毎に相関値の最大値を有する点（位置）を求め、これら各分割領域から得られた最大値を示す点のうち、所定の閾値以上のものを顔候補として抽出する。
【０２１６】
即ち、平均顔のテンプレートを使用して正規化相関をしようした場合、必ずしも任意のパターンより、顔画像の方が相関値が高くなるという保証はないものの、テンプレートと同一のサイズの顔画像が存在する場合は、テンプレートサイズ近傍の大きさでは相関値が最大値をとることから、相関値が分割領域内で最大値となり、且つ所定の閾値以上の点を顔候補として抽出することにより、単にテンプレートマッチングの結果、相関値が所定の閾値以上であるものを顔候補として抽出する場合に比して、顔候補をより有効に絞り込むことができる。
【０２１７】
次に、この顔候補として抽出された点を顔候補（第１の顔候補）とすると共に、この顔候補近傍の点も顔候補（第２の顔候補）として抽出する。後段のパターン認識部９５において、上述したＳＶＭを使用して顔判定する場合、ＳＶＭは１画素ずれるのみで顔検出ができなくなるほど敏感であるため、本実施の形態においては、顔候補の近傍も検索範囲（第２の顔候補）とすることにより、顔候補の近傍を重点的に検索して顔検出性能を向上することができる。ここで第１の顔候補として抽出された点の上下左右に隣接する８点全てを検索範囲とすると、後段の計算量が増加するため、所定の条件を満たす場合にのみ、検索範囲に指定することにより、後段の計算量の増加を抑えつつ検出性能を向上させることができる。
【０２１８】
即ち、第１の顔候補として抽出された点に隣接する点において、これらの点に対応するテンプレートサイズの入力画像領域において、肌色領域の占有率（肌色占有率）が所定の閾値以上である場合、又は、予め検出された（又は予め学習された）顔色情報を有し、この顔色領域の占有率（顔色占有率）が所定の閾値以上である場合にのみ、その点を検索範囲に設定することができる。ここで、肌色占有率は、例えば、肌色カラーテーブルを使用し、この肌色と比較することにより求めることができる。
【０２１９】
この後、テンプレートマッチング部９２は、ウィンドウ切出部９１に対して、第１のスケール画像のうち先頭のウィンドウ画像のテンプレートマッチングが終了した旨を通知することにより、当該ウィンドウ切出部９１から第１のスケール画像のうち次にスキャンされたウィンドウ画像をテンプレートマッチング部９２に送出させる。
【０２２０】
このようにテンプレートマッチング部９２は、ウィンドウ切出部９１において第１のスケール画像から切り出した全てのウィンドウ画像について、スキャン順にそれぞれ顔候補を検出する。
【０２２１】
以上が演算処理部７１における処理内容であり、ＤＭＡコントローラ７０の制御に応じて、主制御部３８１に切り換えられ、以下は当該主制御部３８１における処理内容となる。
【０２２２】
スケール変換及び切出部９３は、各スケール画像毎に候補リストとして内部メモリ３８１Ａに格納された複数のスコア画像について、当該内部メモリ３８１ＡからＤＭＡコントローラ７０を介して第１のスケール画像内でかつ候補リストの決定順に１つずつスコア画像を読み出して、前処理部９４に送出する。
【０２２３】
前処理部９４は、スケール変換及び切出部９３から得られたスコア画像について、矩形領域でなる当該スコア画像（４００画素）から人間の顔画像とは無関係である背景部分に相当する４隅の領域を除去するために、当該４隅の領域を切り取ったマスクを用いて３６０画素分を抽出する。
【０２２４】
そして前処理部９４は、撮像時の照明により濃淡で表される被写体の傾き条件を解消すべく、当該抽出した３６０画素分のスコア画像のうち顔画像として最適な部位を基準とする平面を形成するように画素単位で濃淡値に補正をかける。
【０２２５】
続いて前処理部９４は、当該３６０画素分のスコア画像のコントラストを強調した結果をヒストグラム平滑化処理を行うことにより、ＣＣＤカメラ５０のゲインや照明の強弱によらずに検出できるようにする。
【０２２６】
次いで、前処理部９４は、上述したガボア・フィルタリング処理を行うことにより、当該３６０画素分のスコア画像をベクトル変換し、得られたベクトル群を更に１本のパターンベクトルに変換する。
【０２２７】
パターン識別部（識別手段）９５は、上述したサポートベクタマシンを用いて、前処理部９４からパターンベクトルとして得られた３６０画素分のスコア画像に試して顔か否かの識別を学習しながら行う。
【０２２８】
このようにパターン識別部９５は、前処理部９４から与えられたスコア画像に基づくパターンベクトルについて、当該スコア画像内での顔データに相当する画像領域の位置（座標）及びその大きさ（画素数）と、当該スコア画像の切出し元となるスケール画像のフレーム画像に対する縮小率（すなわち上述の５段階のうちの該当する段階）と、テンプレートのサイズとをリスト化し、これをリストデータとして内部メモリ３８１Ａに格納する。
【０２２９】
この後、パターン識別部９５は、スケール変換及び切出部９３に対して、第１のスケール画像のうち最初のスコア画像の顔検出が終了した旨を通知することにより、当該スケール変換及び切出部９３から第１のスケール画像のうち次のスコア画像を前処理部９４に送出する。前処理部９５は、当該スコア画像をパターンベクトルに変換してパターン識別部９５に送出する。パターン識別部９５は、パターンベクトルから得られた顔データに基づいてリストデータを生成して内部メモリ３８１Ａに格納する。
【０２３０】
このようにスケール変換及び切出部９３において、第１のスケール画像内で候補リストにある全てのスコア画像について、順番に前処理部９４及びパターン識別部９５の各処理を行うことにより、当該第１のスケール画像から撮像結果に存在する顔画像を検出することができる。
【０２３１】
即ち、テンプレートマッチング部９２は、入力画像スケール変換部９０に対して、第１のスケール画像及び第１のテンプレートサイズのテンプレートを使用した顔検出が終了した旨を通知することにより、当該入力画像スケール変換部９０から、第１のスケール画像及び第２のテンプレートサイズのテンプレートをウィンドウ切り出し部９１に送出させる。第２のテンプレートサイズを使用した場合も、上述した第１のテンプレートサイズのテンプレートを使用した場合と同様の処理を行って全てのテンプレートサイズに対応する顔候補を検出した後、テンプレートマッチング部９２は、入力画像スケール変換部９０の第１のスケール画像に対して全てのテンプレートサイズを使用した顔検出が終了した旨を通知することにより、当該入力画像スケール変換決定部９０から第２のスケール画像をウィンドウ切出部９１に送出させる。
【０２３２】
そして第２のスケール画像についても、上述した第１のスケール画像と同様の処理を行って、当該第２のスケール画像から全てのウィンドウ画像に対応する顔候補を検出した後、第３〜第５のスケール画像についても同様の処理を順次行う。
【０２３３】
かくしてテンプレートマッチング部９２は、撮像画像であるフレーム画像を５段階に縮小した第１〜第５のスケール画像及び複数のテンプレートサイズのテンプレートについて、それぞれ切出した複数のウィンドウ画像に対応する顔候補を抽出して、当該ウィンドウ画像と共にＤＭＡコントローラ７０を介して内部メモリ３８１Ａに格納する。
【０２３４】
ここで、本実施の形態においては、任意の大きさのテンプレートを使用することができるが、使用するテンプレートサイズを切り替えて、テンプレートサイズを選択することにより、入力画像に対して準備できる全てのテンプレートサイズに対して演算をする場合に比して、演算量を減らして高効率化することができる。例えば、一度顔が検出された場合に、次に顔検出する際はそのテンプレートサイズを使用することができる。また、例えば、ロボット装置に設けられた距離センサを使用し、この距離センサからの距離情報に基づき入力画像に含まれる対象物との間の距離を認識することにより、対象物の顔領域の大きさを予測してテンプレートサイズを選択する対象距離切り替え手段を設ける等することができ、目的に応じてテンプレートサイズを切り替えることができる。
【０２３５】
（３−２）実施の形態における動作
以上の構成において、このロボット装置２００では、ＣＣＤカメラ５０により撮像したフレーム画像を縮小率が相異なる複数のスケール画像に変換した後、当該各スケール画像の中からそれぞれ所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら１枚ずつ切り出す。
【０２３６】
このウィンドウ画像について、平均的な顔画像を表すテンプレートを用いてマッチングをとって当該テンプレートとの相関値の集合であるマッチング結果画像を生成する。このように各スケール画像毎にそれぞれ全てのウィンドウ画像についてスキャン順にそれぞれマッチング結果画像を生成する。以下、マッチング結果画像から顔候補を検出する工程について詳細に説明する。
【０２３７】
図１５は、テンプレートマッチング部９２において、テンプレートマッチング結果画像Ｒ２から顔候補となる画素を検出する各処理工程を示すフローチャートである。図１５に示すように、先ず、テンプレートマッチング結果画像Ｒ２が入力されると、マッチング結果画像Ｒ２をテンプレートサイズに分割し、その分割領域の１つ、例えば０≦ｘ≦ｗｉｄ＿ｔ−１、０≦ｙ≦ｈｅｉ＿ｔ−１において、最も相関値が高い点（座標）を抽出する（ステップＳＰ１１）。以下、マッチング結果画像Ｒ２をテンプレートサイズに分割した領域を分割領域ｒｎ、分割領域ｒｎにおいて、相関値が最も大きい点（座標）をｌｏｃａｌ＿ｍａｘ（ｘ，ｙ）という。ここでは、この各分割領域内において最も相関値が高い画素を抽出するが、本実施の形態においては、マッチング結果画像において分割された分割領域を左から右へ一行ずつ順に処理を行う場合について説明する。
【０２３８】
次に、ｌｏｃａｌ＿ｍａｘ（ｘ，ｙ）が所定の閾値（ｔｈ１）より大きいか否かを判定し（ステップＳＰ１２）、大きい場合は、顔候補として追加する（ステップＳＰ１３）。上述した如く、入力画像スケール変換決定部９０においては、スケールと共に入力画像に含まれると想定される顔の大きさのテンプレートサイズを選択するが、テンプレートサイズは異なる大きさの複数種類あり、複数種類ある各テンプレートサイズ毎にマッチング結果画像Ｒ２を算出して顔候補を抽出すると、同一の点が抽出される場合がある。従って、ステップＳＰ１３において、顔候補として同一の点がある場合、即ち、異なるテンプレートサイズで顔候補を抽出した際に既に抽出されている場合はこの点は追加しない。
【０２３９】
次に、顔候補として抽出された点に対応するテンプレートサイズの入力画像領域において、この画像領域内に含まれる肌色画素の占有率を求める。本実施の形態においては、肌色画素の占有率を求める際に、肌色カラーテーブル１００を参照する。そして、この肌色画素占有率が所定の閾値（ｔｈ２）より大きいか否かを判定する（ステップＳＰ１４）。大きい場合は、このｌｏｃａｌ＿ｍａｘ（ｘ，ｙ）の周辺、例えば上下左右の８近傍点を顔候補として追加する（ステップＳＰ１５）。ここで、ステップＳＰ１３と同様に、既にこれらの８近傍点が既に顔候補として抽出されている場合は、候補に追加しない。
【０２４０】
ステップＳＰ１２でｌｏｃａｌ＿ｍａｘ（ｘ，ｙ）が閾値ｔｈ１未満だった場合、ステップＳＰ１４でｌｏｃａｌ＿ｍａｘ（ｘ，ｙ）に相当する入力画像における肌色画素占有率が閾値ｔｈ２未満であった場合、及びステップＳＰ１５で顔候補の追加が終了した後は、いずれもステップＳＰ１６に進み、次の顔候補を抽出するために次の分割領域に移り、処理を進める。
【０２４１】
先ず、マッチング結果画像Ｒ２において、ｘ方向にテンプレートサイズ分、即ち、ｗｉｄ＿ｔだけずれた隣の分割領域に移る（ステップＳＰ１６）。次に、ｗｉｄ＿ｔだけずれたｘ座標（ｘ＋ｗｉｄ＿ｔ）の分割領域において、そのｘ座標がマッチング結果画像の幅（ｘ方向の辺）ｗｉｄ＿ｒより大きい場合は、分割領域がマッチング結果画像に含まれないことを示し、次の行に移り、０≦ｘ≦ｗｉｄ＿ｔ−１であって、ｙ方向にテンプレートサイズ分、即ち、ｈｅｉ＿ｔだけずれた隣の分割領域に移る（ステップＳＰ１８）。次に、分割領域のｙ座標がマッチング結果画像の高さ（ｙ方向の辺）ｈｅｉ＿ｒより大きいか否かを判定し（ステップＳＰ１９）、大きい場合は、マッチング結果画像における全ての分割領域の相関値の最大値を求めたことを示し、処理を終了する。
【０２４２】
一方、ステップＳＰ１７及びステップＳＰ１８において、分割領域がマッチング結果画像に含まれると判定された場合は、再びステップＳＰ１１に戻り、その分割領域内で最も高い相関値を有する点を抽出する。
【０２４３】
本実施の形態においては、マッチング結果画像Ｒ２をテンプレートサイズに区切た分割領域における相関値の最大値を求めているため、ステップＳＰ１６において、隣接する分割領域に移る場合は、ｘ方向にｗｉｄ＿ｔだけずれるものとしたが、マッチング結果画像Ｒ２は、テンプレートサイズ以下のサイズであれば、任意の大きさに分割することができる。その際、分割する画像の大きさの幅（ｘ方向の辺）ｗｉｄ＿ｓｔｅｐ、高さ（ｙ方向）ｈｅｉ＿ｓｔｅｐとすると、ステップＳＰ１６及びステップＳＰ１８において、夫々ｘ方向にｗｉｄ＿ｓｔｅｐ、又はｙ方向にｈｅｉ＿ｓｔｅｐ移動することにより、次の分割領域に進むことができる。
【０２４４】
こうして、テンプレートマッチング部９２において、顔候補が抽出される。ここまでを演算処理部７１において実行すると共にこれ以降の処理を主制御部３８１において実効することにより、その分当該主制御部３８１の処理負担を軽減することができる。以降のスケール変換及び切り出し部９３、前処理部９４及びパターン認識部９５を経て顔データを抽出する処理は、上述の第２の適用例と同様である。
【０２４５】
（３−３）本実施の形態の効果
図１６は、テンプレートマッチング部９２において、ウィンドウ画像Ｗ２から顔候補として検出された点を示す図である。図１６において、白で示す点が、図１４に示すマッチング結果画像Ｒ２から顔候補として抽出された点である。比較として、図１７は、マッチング結果画像Ｒ２において、閾値以上である点を全て顔候補として抽出した例を示す図である。図１７に示す図と比較すると、本実施の形態において、テンプレートマッチング部９２にて顔候補として抽出される点が飛躍的に少なくなっているのがわかる。これにより、後段の処理における計算量を飛躍的に削減することができる。
【０２４６】
本実施の形態においては、ウィンドウ画像について、平均的な顔画像を表すテンプレートを用いてマッチングをとって大まかに顔画像であるか否かを判断する際に、テンプレートマッチング結果画像を所定のサイズに仕切り、相関値の最大値を顔候補として抽出して明らかに顔画像でないウィンドウ画像を除去することにより、本来顔である領域を見逃すことなく、後段の顔検出処理に要する演算量及び時間を減少させることができ、かくしてリアルタイム性を格段と向上た顔検出装置及びこれを搭載したロボット装置を実現することができる。
【０２４７】
また、相関値が最大となる点と共にその周囲においても顔検索範囲とすることにより、顔検出精度を向上することができる。更に、所定の閾値以上の肌色占有率又は顔の色占有率を有する場合のみ、顔検索範囲として設定することにより、顔検出精度を保ちつつ顔候補を減らして後段の演算量を減らすことができる。更にまた、テンプレートのサイズを適宜切り替えることにより、更に演算量を減らすことができる。
【０２４８】
（４）他の実施の形態
なお上述の実施の形態においては、本発明を図１のように構成された２足歩行型のロボット装置２００に適用するようにした場合について述べたが、本発明はこれに限らず、この他種々のロボット装置及びロボット装置以外のこの他種々の装置に広く適用することができる。例えば、ロボット装置は、４足歩行であってもよく、更に、移動手段は、脚式移動方式に限定されない。
【０２４９】
また、ＣＣＤカメラ（撮像手段）５０による撮像結果として得られるフレーム画像を複数の異なる縮小率でなるスケール画像に縮小変換するスケール変換手段として、図１２に示す主制御部３８１の処理機能のうちの入力画像スケール変換部９０を適用するようにした場合について述べたが、本発明はこれに限らず、要は、フレーム画像を複数の異なる縮小率で縮小変換した複数のスケール画像を得ることができれば、この他種々の構成のものを適用するようにしてもよい。
【０２５０】
更に、１枚のフレーウ画像を０．８倍ずつ縮小変換した５種類のスケール画像を適用するようにしたが、主制御部３８１の処理能力の範囲内であれば、２〜４種類でも６種類以上でも良く、また縮小率も自由に設定するようにしてもよい。この場合、複数のスケール画像を基にして顔検出処理を行えば、撮像結果であるフレーム画像に実際に写っている顔のサイズにかかわりなく（すなわちロボット装置２００と人間との距離）、当該フレーム画像から直接顔画像を検出する場合よりも極めて高い確率で顔画像を検出することができる。
【０２５１】
また、ＣＣＤカメラ（撮像手段）５０による撮像結果として得られるフレーム画像内を、４００画素のウィンドウ画像を１画素ずつずらすようにスキャンしながら順次抽出する抽出手段を、図１２に示す主制御部３８１の処理機能のうちのウィンドウ切出部９１により構成するようにした場合について述べたが、本発明はこれに限らず、要は、フレーム画像又はスケール画像から所定サイズのウィンドウ画像を所定画素ずつずらすようにスキャンしながら順次抽出する得ることができれば、この他種々の構成のものを適用するようにしてもよい。
【０２５２】
この場合、スキャンの順番は矩形画像の左上位置から右下位置まで以外にも当該矩形画像全体を走査することができればどのように設定してもよい。またスキャン時にずらす画素数は１画素でも３画素以上でも良く、ウィンドウ画像のサイズも４００画素以外に種々の縦横比の所望の画素数に設定するようにしてもよい。
【０２５３】
更に、ウィンドウ画像から当該顔画像を識別する識別手段として、図１２に示す主制御部３８１の処理機能のうちの前処理部９５及びパターン識別部９６を適用するようにした場合について述べたが、本発明はこれに限らず、顔画像を検出することができればこの他種々の方法を用いた構成のものに広く適用することができる。
【０２５４】
更にまた、抽出されたウィンドウ画像を、平均的な顔画像を表すテンプレートを用いてマッチングをとり、当該テンプレートとの相関度を検出する相関度検出手段として、図１２に示す主制御部３８１の処理機能のうちのテンプレートマッチング部９２を適用するようにした場合について述べたが、本発明はこれに限らず、要は、当該テンプレートを基準として相関度を検出することができれば、この他種々の方法を用いた構成のものに広く適用することができる。
【０２５５】
更に、ロボット装置２００の制御全体を司る制御手段として主制御部３８１を適用すると共に、当該主制御部３８１と接続され、その制御に応じて情報を読み書きするメモリ手段として内部メモリ３８１Ａを適用し、更に図１２に示す主制御部３８１の処理機能のうち、入力画像スケール変換部９０、ウィンドウ切出部９１、テンプレートマッチング部９２による各処理を実行する演算処理手段として演算処理部７１（ＤＭＡコントローラ７０）を適用するようにした場合について述べたが、本発明はこれに限らず、主制御部３８１と演算処理部７１とで顔検出処理の処理機能（９０〜９５）を選択的に分担して内部メモリ３８１Ａを介して切り換えて実行することができれば、その際の分担内容は、主制御部３８１にかかる演算量及び演算時間の負担を低減させるように自由に設定することができる。また、主制御部３８１のみで処理を行ってもよい。
【０２５６】
【発明の効果】
以上詳細に説明したように本発明に係る顔検出装置は、入力画像から対象物の顔領域を抽出する顔検出装置において、撮像手段による撮像結果として得られるフレーム画像を入力画像とし、この入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成手段と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出手段と、上記顔候補として抽出された点に相当する入力画像領域から顔画像を識別する識別手段とを有するので、顔候補抽出手段により極めて効率よく且つ正確に顔候補を絞り込むことができるため、後段の顔識別手段における演算量を低減して、リアルタイムの顔検出を可能とする。
【０２５７】
本発明に係るロボット装置は、供給された入力情報に基づいて動作を行う自律型のロボット装置において、撮像手段と、上記撮像手段による撮像結果として得られるフレーム画像を入力画像とし該入力画像から人物の顔領域を抽出する顔検出手段を有し、上記顔検出手段は、上記入力画像と平均的な顔画像を示す所定サイズのテンプレートとの相関をとった相関値の集合であるマッチング結果を生成するマッチング結果生成手段と、上記マッチング結果から相関値の局所最大値を求めこの局所最大値に基づき顔候補を抽出する顔候補抽出手段と、上記顔候補として抽出された点に相当する入力画像領域から顔画像を識別する識別手段とを具備するので、顔候補抽出手段により極めて効率よく且つ正確に顔候補を絞り込み、顔検出の演算量を低減して、リアルタイムの顔検出を可能とすることにより、ロボット装置のエンタテーメント性を向上することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態における人間型ロボット装置を前方からみた外観を示す斜視図である。
【図２】本発明の実施の形態における人間型ロボット装置を後方からみた外観を示す斜視図である。
【図３】本発明の実施の形態における人間型ロボット装置の自由度構成モデルを示す模式図である。
【図４】本発明の実施の形態における人間型ロボット装置の制御システム構成を示す模式図である。
【図５】第１の適用例におけるロボット装置の内部構成の説明に供するブロック図である。
【図６】第１の適用例におけるロボット装置の顔検出手段を示すブロック図である。
【図７】重なり判定処理の説明に供する略線的な平面図である。
【図８】第２の適用例におけるロボット装置の内部構成の説明に供するブロック図である。
【図９】第２の適用例におけるロボット装置の顔検出手段を示すブロック図である。
【図１０】テンプレートマッチングによる相関性の検出の説明に供する模式図である。
【図１１】重なり判定処理手順をその工程順に示すフローチャートである。
【図１２】本発明の実施の形態におけるロボット装置の顔検出タスク機能を示すブロック図である。
【図１３】（ａ）及び（ｂ）は、夫々入力画像（ウィンドウ画像）及びテンプレートを示す模式図である。
【図１４】入力画像（ウィンドウ画像）とテンプレートとから求めた相関値の集合であるマッチング結果画像を示す図である。
【図１５】本発明の実施の形態における顔検出手段のテンプレートマッチング部における処理工程を示すフローチャートである。
【図１６】本発明の実施の形態における顔検出タスク機能のテンプレートマッチング部がマッチング結果画像から顔候補を抽出した結果を示す図である。
【図１７】マッチング結果画像において、所定の閾値以上のものを顔候補として抽出した結果を示す図である。
【符号の説明】
１　ロボット装置、　３８１　メイン制御部、　３８１Ａ　内部メモリ、　５０　ＣＣＤカメラ、　６０、８０、９０　入力画像スケール変換部、　６１、８１、９１　ウィンドウ切出部、　６２、８２、９２　テンプレートマッチング部、　６３、８５、９４　前処理部、　６４、８６、９５　パターン識別部、　６５、８３　重なり判定部、　７０　ＤＭＡコントローラ、　７１　演算処理部、８４、９３　スケール変換及び切出部、　ＲＴ１　重なり判定処理手順、　Ｓ１Ａ　画像信号[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention provides a face detection device and method for detecting a face of a target object from an input image, a robot device equipped with the face detection device for improving entertainment, and the like, and causes a computer to execute a face detection operation. And a recording medium on which the program is recorded.
[0002]
[Prior art]
A mechanical device that performs a motion similar to the motion of a human (living organism) using an electric or magnetic action is called a “robot”. Robots have begun to spread in Japan since the late 1960s, and most of them have been industrial robots (Industrial) such as manipulators and transfer robots for the purpose of automation and unmanned production work in factories.
Robot).
[0003]
In recent years, practical robots have been developed to support life as a human partner, that is, to support human activities in various situations in a living environment and other daily lives. Unlike an industrial robot, such a practical robot has the ability to learn a human being having different personalities individually or a method of adapting to various environments in various aspects of a human living environment. For example, a "pet-type" robot that simulates the body mechanism and operation of a four-legged animal such as a dog or a cat, or a body mechanism and operation of a human or the like that walks upright on two legs is designed as a model. Robotic devices such as “humanoid” or “humanoid” robots are already being put to practical use.
[0004]
These robot devices can perform, for example, various operations with an emphasis on entertainment, as compared to industrial robots, and are therefore sometimes referred to as entertainment robots. In addition, such a robot apparatus is equipped with various external sensors such as a CCD (Charge Coupled Device) camera and a microphone, and recognizes an external situation based on the output of these external sensors, and outputs information from the outside and internal information. Some operate autonomously depending on the state.
[0005]
[Problems to be solved by the invention]
By the way, in such an entertainment-type robot apparatus, it is necessary to detect the face of a human being during the dialogue or the face of a human who enters the field of view while moving, and perform the dialogue or operation while looking at the human face. If it is possible, it is considered that it is the most desirable from the viewpoint of naturalness as in the case where humans usually perform, and it is considered that the entertainment property as an entertainment robot device can be further improved.
[0006]
2. Description of the Related Art Conventionally, there have been proposed many methods for detecting a human face using only a light and shade pattern based on an image signal without using a color or a motion from a complicated image scene such as a moving image.
[0007]
As these face detection methods, there is a method of generating a discriminator by learning a face pattern in advance using a pattern recognition technique such as a unique face, a neural network, and a support vector machine (SVM). No.
[0008]
However, according to the method of generating the pattern classifier, the robot apparatus performs pattern recognition by learning on a huge amount of face image data. However, there is a problem that the amount of calculation required for the pattern identification increases and the time required for calculation processing becomes enormous.
[0009]
In the process of actually detecting a human face image from a captured image (hereinafter, referred to as a face detection task), identification is performed while cutting out the face image from the captured image. You will scan at various scales. For this reason, it is extremely important to minimize the arithmetic processing required for one pattern identification at a time.
[0010]
For example, in the case of a face detection task using pattern recognition by a support vector machine, several hundred kinds of support are provided for a 400-dimensional vector obtained from a clipped image of about 400 (= 20 × 20) pixels from a captured image. An inner product operation with a vector (400 dimensions) is required. If this is performed in the entire screen having the size (W, H), the inner product calculation must be repeated (W−20 + 1) × (H−20 + 1) times, resulting in a huge amount of calculation processing. .
[0011]
In addition, when a face detection task is used for a robot apparatus, it is extremely difficult to provide feedback as a behavior of a robot that requires real-time properties unless a face image is detected sufficiently quickly from a moving image. In addition, the CPU inside the robot apparatus has many tasks that are always executed in addition to the face detection task. It is extremely difficult to spend computing power.
[0012]
The present invention has been proposed in view of such a conventional situation, and provides a face detection apparatus and method, a robot apparatus, and a program and a recording medium which can reduce the amount of computation and can significantly improve real-time performance. The purpose is to provide.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, a face detection device according to the present invention, in a face detection device that extracts a face region of a target object from an input image, uses a frame image obtained as an imaging result by an imaging unit as an input image, A matching result generating means for generating a matching result that is a set of correlation values obtained by correlating the input image with a template of a predetermined size representing an average face image; and obtaining a local maximum value of the correlation value from the matching result. The image processing apparatus includes a face candidate extracting unit that extracts a face candidate based on a local maximum value, and an identifying unit that identifies the face region from an input image region corresponding to a point extracted as the face candidate.
[0014]
In the present invention, the fact that the correlation value of template matching becomes largest when a face having the same size as the average face template exists in the input image is used, and a face candidate is determined based on the local maximum value of the correlation value of the matching result. , The number of face candidates can be dramatically reduced while maintaining face detection performance, as compared to a case where a threshold is simply set from the matching result to detect face candidates. Thus, the amount of calculation for face identification in the identification means can be reduced, and the face area can be detected from the input image at extremely high speed.
[0015]
Further, the face candidate extracting means can divide the matching result into a plurality of regions having a size equal to or smaller than a template size, for example, and extract at least a maximum correlation value as a face candidate for each of the divided regions.
[0016]
Further, there is provided template size determining means for determining the size of the template to be input to the matching result generating means from templates of different sizes, wherein the template size determining means selects a template having the same size as the face area detected in advance. Alternatively, information on the distance to the object in the input image is input, and a template is selected based on this distance information, so that the template size can be determined efficiently.
[0017]
Furthermore, the matching result generating means can generate a matching result corresponding to each template by sequentially inputting templates of different sizes, thereby detecting a face of an arbitrary size included in the input image. Can be.
[0018]
In addition, the face candidate extracting unit can further narrow down the face candidates by extracting only the predetermined threshold or more from the maximum values of the correlation values of the respective divided areas as the face candidates.
[0019]
Further, the matching result, which is a set of correlation values, is a set of correlation values between the template and the input image, which are obtained by scanning the input image and shifting the input image by a predetermined pixel. It can be an arranged two-dimensional array.
[0020]
Further, the face candidate extracting means sets a point having the maximum value of the correlation value of each of the divided areas as a first face candidate, and sets a point near the first face candidate as a second face candidate as the first face candidate. By extracting the second face candidate as the face candidate, the face detection performance can be further improved.
[0021]
Further, the face candidate extracting means may be configured such that, of the points near the first face candidate, the occupancy of the skin color area in the input image area of the template size corresponding to the point near the first face candidate is a predetermined threshold. The face color in the input image area of the template size corresponding to the above points or having the face color information previously learned and corresponding to the points near the first face candidate among the points near the first face candidate. A point where the area occupancy is equal to or more than a predetermined threshold value can be extracted as a second face candidate, and thereby, face candidates can be narrowed down while improving face detection performance.
[0022]
A face detection method according to the present invention is a face detection method for extracting a face region of a target object from an input image, wherein a frame image obtained as a result of imaging by an imaging unit is used as an input image, and the input image and an average face image are used. A matching result generation step of generating a matching result that is a set of correlation values obtained by correlating with a template of a predetermined size shown in the table, and obtaining a local maximum value of the correlation value from the matching result to extract a face candidate based on the local maximum value And an identification step of identifying the face area from an input image area corresponding to a point extracted as the face candidate.
[0023]
A robot apparatus according to the present invention is an autonomous robot apparatus that operates based on supplied input information, wherein an imaging unit and a frame image obtained as an imaging result by the imaging unit are used as an input image and a person is extracted from the input image. Face detection means for extracting a face area of the input image. The face detection means generates a matching result which is a set of correlation values obtained by correlating the input image with a template of a predetermined size representing an average face image. Matching result generating means for obtaining a local maximum value of a correlation value from the matching result, extracting face candidates based on the local maximum value, and an input image area corresponding to a point extracted as the face candidate And identifying means for identifying the face region from the above.
[0024]
In addition, a control unit that controls the entire control is provided, and the control unit can selectively execute any one of the processes by the matching result generation unit, the face candidate extraction unit, and the identification unit.
[0025]
A memory unit connected to the control unit for reading and writing information according to the control of the control unit; and a memory unit connected to the memory unit, the matching result generation unit, the face candidate extraction unit, and the identification unit. Arithmetic processing means for executing processing of the processing excluding the processing target of the control means, wherein the control means or the arithmetic processing means can switch and execute the corresponding processing via the memory means. .
[0026]
A program according to the present invention and a recording medium on which the program is recorded can be obtained as an imaging result by an imaging unit in a program for causing a computer to execute an operation of extracting a face area of a person from an input image and a recording medium on which the program is recorded. A frame image as an input image, a matching result generating step of generating a matching result which is a set of correlation values obtained by correlating the input image with a template of a predetermined size representing an average face image, A face candidate extracting step of obtaining a local maximum value of the value and extracting a face candidate based on the local maximum value, and an identifying step of identifying the face area from an input image area corresponding to a point extracted as the face candidate. It is characterized by the following.
[0027]
BEST MODE FOR CARRYING OUT THE INVENTION
(1) Outline of the present invention
A face detection device and a face detection method according to the present invention detect a person's face (face region) from an input image (captured image). This face detection task has two main steps. That is, it includes a first step of extracting a face candidate by performing template matching on an input image, and a second step of determining whether or not the face candidate is a face by SVM or the like.
[0028]
In a system with limited resources such as a CPU and a memory, such as a robot device, in order to perform face detection in real time, it is necessary to reduce the amount of calculation in a face detection task. However, as described above, the face determination using SVM or the like performed in the second step requires an enormous amount of calculation. Therefore, in the face detection task, it is effective to narrow down face candidates in the first step in order to reduce the calculation amount required for face detection while maintaining face detection performance.
[0029]
Here, in the first step, when extracting face candidates by performing template matching, for example, there is a method of extracting a face candidate by setting a threshold for a correlation value of matching. In this method, in order to reduce overlooking of face candidates, a method of increasing the threshold value of the correlation value or a method of reducing thinning out during matching can be adopted. Since candidates are extracted, the amount of calculation increases, which may not be preferable in an environment where resources such as a robot device are limited. On the other hand, when the threshold value is increased, the number of face candidates can be reduced, but an image that is originally a face is also removed from the candidates, and a face image may be missed.
[0030]
Therefore, the present invention proposes a method that can prevent an increase in the amount of calculation, that is, effectively narrow down the number of face candidates without missing an image that is originally a face from face candidates. Specifically, in the first step, a matching result, which is a set of correlation values obtained by correlating the input image with a template representing an average face image of a predetermined size, is generated, and the correlation value in the matching result is generated. Is extracted based on the local maximum value of. For the face area, the matching result is divided into, for example, a plurality of areas having the same size as the template, and at least the maximum correlation value is extracted as a face candidate for each of the divided areas. In the second step, a face area (face image) is identified (determined) by an SVM or the like from the input image area corresponding to the point extracted as the face candidate.
[0031]
(2) Robot configuration
Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In this embodiment, the present invention is applied to a bipedal walking robot device equipped with a face detection device. First, a biped walking humanoid robot device will be described as an example of a robot device having face detection means.
[0032]
1 and 2 show the humanoid robot device 200 viewed from the front and the rear, respectively. Further, FIG. 3 schematically shows the configuration of the degrees of freedom of the joints included in the humanoid robot device 200.
[0033]
As shown in FIG. 1 and FIG. 2, the humanoid robot device 200 includes two left and

right leg units

201R and 201L performing legged movement, a body unit 202, and left and

right arm units

203R and 203L. And a head unit 204.
[0034]
Each of the left and

right leg units

201R, 201L includes a

thigh

205R, 205L, a knee joint 206R, 206L, a

shin

207R, 207L, an

ankle

208R, 208L, and a

foot

209R, 209L. The body unit 202 is connected at substantially the lowermost end by the

body

210R and 210L. The left and

right arm units

203R and 203L each include an

upper arm

211R and 211L, an elbow joint 212R and 212L, and a

forearm

213R and 213L. Each of the left and right sides above the body unit 202 is formed by

shoulder joints

214R and 214L. They are connected at the edge. The head unit 204 is connected to the center of the uppermost end of the body unit 202 by a neck joint 255.
[0035]
As shown in FIG. 3, the neck joint between the head unit 204 and the trunk unit 202 includes a neck joint yaw axis 302 supporting the head unit 204, a neck joint pitch axis 303, and a neck joint roll axis 304. It has three degrees of freedom.
[0036]
The arm joints include a shoulder joint pitch axis 308, a shoulder joint roll axis 309, an upper arm yaw axis 310, an elbow joint pitch axis 311, a forearm yaw axis 312, a wrist joint pitch axis 313, and a wrist joint roll. It is composed of a wheel 314 and a hand 315. The hand part 315 is actually a multi-joint and multi-degree-of-freedom structure including a plurality of fingers. However, since the movement of the hand 315 has little contribution or influence on the posture control and walking control of the humanoid robot device 200, it is assumed that the degree of freedom is zero in this specification. Therefore, each arm has seven degrees of freedom.
[0037]
The body unit has three degrees of freedom, that is, a body pitch axis 305, a body roll axis 306, and a body yaw axis 307.
[0038]
Also, each leg unit constituting the lower limb includes a hip joint yaw axis 316, a hip joint pitch axis 317, a hip joint roll axis 318, a knee joint pitch axis 319, an ankle joint pitch axis 320, and an ankle joint roll axis 321. And a foot 322. In this specification, the intersection of the hip joint pitch axis 317 and the hip joint roll axis 318 defines the hip joint position of the humanoid robot device 200. The foot 322 of the human body is actually a structure including a sole with multiple joints and multiple degrees of freedom, but the sole of the humanoid robot device 200 has zero degrees of freedom. Therefore, each leg has six degrees of freedom.
[0039]
Summarizing the above, the entire humanoid robot device 200 has a total of 3 + 7 × 2 + 3 + 6 × 2 = 32 degrees of freedom. However, the humanoid robot device 200 for entertainment is not necessarily limited to 32 degrees of freedom. Needless to say, the degree of freedom, that is, the number of joints, can be appropriately increased or decreased according to design / production constraints and required specifications.
[0040]
Each degree of freedom of the humanoid robot device 200 as described above is actually implemented using an actuator. Actuators are required to be small and lightweight due to demands such as eliminating extra bulges in appearance and approximating the human body shape, and controlling the posture of unstable structures such as bipedal walking. preferable.
[0041]
FIG. 4 schematically shows a control system configuration of the humanoid robot device 200. As shown in the figure, the humanoid robot device 200 has each of the

mechanism units

330, 340, 350R / L, and 360R / L representing human limbs, and adaptive control for realizing a cooperative operation between the mechanism units. (However, each of R and L is a suffix indicating each of right and left. The same applies hereinafter).
[0042]
The operation of the entire humanoid robot device 200 is totally controlled by the control unit 380. The control unit 380 includes a main control unit 381 composed of main circuit components (not shown) such as a CPU (Central Processing Unit) and a memory, and data and commands of a power supply circuit and each component of the humanoid robot device 200. It includes a peripheral circuit 382 including an interface (not shown) for performing transmission and reception. The installation place of the control unit 380 is not particularly limited. In FIG. 4, it is mounted on the body unit 340, but may be mounted on the head unit 330. Alternatively, the control unit 380 may be provided outside the humanoid robot device 200 to communicate with the body of the humanoid robot device 200 by wire or wirelessly.
[0043]
Each joint degree of freedom in the humanoid robot device 200 shown in FIG. 4 is realized by the corresponding actuator. That is, the head unit 330 includes a neck joint yaw axis actuator A representing each of the neck joint yaw axis 302, the neck joint pitch 303, and the neck joint roll axis 304. ₂ , Neck joint pitch axis actuator A ₃ , Neck joint roll axis actuator A ₄ Are arranged.
[0044]
In addition, the head unit 330 is provided with a CCD (Charge Coupled Device) camera for imaging an external situation, a distance sensor for measuring a distance to an object located in front, and a collecting external sound. A microphone for sounding, a speaker for outputting sound, a touch sensor for detecting pressure received by a physical action such as “stroke” and “hit” from the user, and the like are provided.
[0045]
The body unit 340 includes a body pitch axis actuator A representing each of a body pitch axis 305, a body roll axis 306, and a body yaw axis 307. ₅ , Fuselage roll axis actuator A ₆ , Fuselage yaw axis actuator A ₇ Are arranged. The torso unit 340 includes a battery serving as a power source for activating the humanoid robot device 200. This battery is constituted by a chargeable / dischargeable battery.
[0046]
The arm unit 350R / L is subdivided into an upper arm unit 351R / L, an elbow joint unit 352R / L, and a forearm unit 353R / L. The shoulder joint pitch axis 308, the shoulder roll axis 309, and the upper arm Shoulder joint pitch axis actuator A expressing each of yaw axis 310, elbow joint pitch axis 311, forearm yaw axis 312, wrist joint pitch axis 313, and wrist joint roll axis 314 ₈ , Shoulder joint roll axis actuator A ₉ , Upper arm yaw axis actuator A ₁₀ , Elbow joint pitch axis actuator A ₁₁ , Elbow joint roll axis actuator A ₁₂ , Wrist joint pitch axis actuator A ₁₃ , Wrist joint roll axis actuator A ₁₄ Is deployed.
[0047]
The leg unit 360R / L is subdivided into a thigh unit 361R / L, a knee unit 362R / L, and a shin unit 363R / L, but the hip joint yaw axis 316, the hip joint pitch axis 317, and the hip joint Hip joint yaw axis actuator A representing each of roll axis 318, knee joint pitch axis 319, ankle joint pitch axis 320, and ankle joint roll axis 321 ₁₆ , Hip joint pitch axis actuator A ₁₇ , Hip roll axis actuator A ₁₈ , Knee joint pitch axis actuator A ₁₉ , Ankle joint pitch axis actuator A ₂₀ , Ankle joint roll axis actuator A ₂₁ Is deployed. Actuator A used for each joint ₂ , A ₃ ... can more preferably be constituted by a small AC servo actuator of the type directly connected to a gear and of a type in which the servo control system is integrated into one chip and mounted in a motor unit.
[0048]
The

sub-control units

335, 345, 355R / L and 365R / L of the actuator drive control unit are provided for each mechanism unit such as the head unit 330, the body unit 340, the arm unit 350, and each leg unit 360. ing. Further, grounding

confirmation sensors

391 and 392 for detecting whether the soles of the legs 360R and L have landed are mounted, and a posture sensor 393 for measuring the posture is provided in the body unit 340. ing.
[0049]
Each of the ground

contact confirmation sensors

391 and 392 includes, for example, a proximity sensor or a micro switch installed on the sole. The posture sensor 393 is configured by, for example, a combination of an acceleration sensor and a gyro sensor.
[0050]
Based on the outputs of the ground

contact confirmation sensors

391 and 392, it is possible to determine whether each of the left and right legs is in a standing or idle state during an operation such as walking or running. Also, the output of the posture sensor 393 can detect the inclination and posture of the body part.
[0051]
The main control unit 381 can dynamically correct the control target in response to the output of each of the sensors 391 to 393. More specifically, adaptive control is performed on each of the

sub-control units

335, 345, 355R / L, and 365R / L, and the arms, the body, and the legs of the humanoid robot device 200 cooperate. A driving whole body movement pattern can be realized.
[0052]
The whole-body motion of the humanoid robot device 200 on the body sets foot motion, ZMP (ZeroMoment Point) trajectory, torso motion, upper limb motion, waist height, and the like, and instructs motion according to these settings. The command to be transmitted is transferred to each of the

sub-control units

335, 345, 355R / L, 365R / L. Each of the sub-control units 335, 345,... Interprets the command received from the main control unit 381 and ₂ , A ₃ .. Output a drive control signal. Here, “ZMP” refers to a point on the floor at which the moment due to floor reaction force during walking becomes zero, and “ZMP trajectory” refers to, for example, the walking motion period of the humanoid robot device 200. Means the trajectory in which the ZMP moves.
[0053]
At the time of walking, gravity, inertia force, and these moments act on the road surface from the walking system due to gravity and acceleration generated by the walking motion. According to the so-called "Dalambert principle", they balance the floor reaction force and the floor reaction force moment as a reaction from the road surface to the walking system. As a result of the mechanical inference, there is a point where the pitch and roll axis moments are zero on or inside the sides of the supporting polygon formed by the sole and the road surface, that is, “ZMP (Zero Moment Point)”.
[0054]
Many proposals relating to posture stability control of a legged mobile robot and prevention of falling during walking use this ZMP as a criterion for determining walking stability. The bipedal walking pattern generation based on the ZMP standard has an advantage that a sole landing point can be set in advance, and the kinematic constraint condition of the toe according to the road surface shape can be easily considered. In addition, using ZMP as a stability determination criterion means that a trajectory, not a force, is treated as a target value in motion control, so that technical feasibility is increased. The concept of ZMP and the application of ZMP to the stability discrimination standard of walking robots are described in "LEGGED LOCOMMOTION ROBOTS" by Miomir Vukobravicic (Ichiro Kato, "Walking Robots and Artificial Feet" (Nikkan Kogyo Shimbun)). It is described in.
[0055]
In general, a bipedal walking robot such as a humanoid has a higher center of gravity and a narrower ZMP stable area during walking than a quadrupedal walking. Therefore, the problem of the posture change due to the change of the road surface condition is particularly important in a bipedal walking robot.
[0056]
As described above, in the humanoid robot device 200, each of the sub-control units 335, 345, and so on interprets the command received from the main control unit ₂ , A ₃ , A drive control signal is output to control the drive of each unit. Accordingly, the humanoid robot device 200 stably transitions to the target posture and can walk in a stable posture.
[0057]
In the control unit 380 of the humanoid robot device 200, in addition to the posture control described above, various sensors such as an acceleration sensor, a touch sensor, and a ground check sensor, image information from a CCD camera, and audio information from a microphone Etc. are managed. In the control unit 380, although not shown, various sensors such as an acceleration sensor, a gyro sensor, a touch sensor, a distance sensor, a microphone, and a speaker, actuators, a CCD camera, and a battery are connected to the main control unit 381 via corresponding hubs. Have been.
[0058]
The main control unit 381 sequentially fetches sensor data, image data, and audio data supplied from each of the above-described sensors, and sequentially stores them at predetermined positions in the DRAM via the respective internal interfaces. Further, the main control unit 381 sequentially takes in the battery remaining amount data indicating the battery remaining amount supplied from the battery, and stores the data at a predetermined position in the DRAM. The sensor data, image data, audio data, and remaining battery data stored in the DRAM are used when the main control unit 381 controls the operation of the humanoid robot device 200.
[0059]
The main control unit 381 reads the control program at the initial stage when the power of the humanoid robot device 200 is turned on, and stores the control program in the DRAM. In addition, the main control unit 381, based on the sensor data, image data, audio data, and battery remaining amount data sequentially stored in the DRAM from the main control unit 381, as described above, and from the user, And the presence or absence of an action is determined. Further, the main control unit 381 determines an action according to its own situation based on the determination result and the control program stored in the DRAM, and drives the necessary actuator based on the determination result, thereby making the humanoid robot The device 200 is caused to take actions such as so-called “gesture” and “hand gesture”.
[0060]
Therefore, the humanoid robot device 200 can determine its own and surroundings based on the control program, and can act autonomously according to instructions and actions from the user. Further, the humanoid robot device 200 determines how to pronounce (read) the characters extracted from the image captured by the CCD camera, based on the reading estimated from the extracted characters and the sound collected by the sound collection microphone. Determine by matching. Accordingly, the accuracy of voice recognition of the humanoid robot device 200 is improved, and new words can be registered in the voice recognition dictionary.
[0061]
(2) Application example
Hereinafter, prior to the description of embodiments of the present invention, in order to facilitate understanding, in the above-described bipedal walking robot apparatus, face matching is performed by performing template matching on an input image, and face candidates are extracted from the face candidates. Another example in which a face detection device for determining a face is mounted will be described (see Japanese Patent Application No. 2002-074907, hereinafter, referred to as an application example).
[0062]
In the first and second application examples described below, similarly to the present invention, after roughly extracting a face candidate by template matching or the like, it is determined whether the face is a face from the face candidates by SVM or the like. Thus, the amount of calculation in the subsequent stage is reduced, and the real-time property of face detection of the robot device is improved.
[0063]
(2-1) First application example
(2-1-1) Internal configuration of robot
In the first application example, as shown in FIGS. 4 and 5, a main control unit 381 that controls the operation of the entire robot device 200 is provided on the back side of the waist base that forms the lower part of the torso of the torso unit 204. And a control unit 42 in which a peripheral circuit 381 such as a power supply circuit and a communication circuit and a battery 45 (FIG. 5) are housed in a box.
[0064]
The control unit 42 includes sub-control units 335 provided in each of the constituent units (the body unit 202, the head unit 204, the

arm units

203R and 203L, and the

leg units

209R and 209L). 350R, 350L, 365R, 365L to supply necessary power supply voltage to these

sub-control units

335, 350R, 350L, 365R, 365L, and to connect these

sub-control units

335, 350R, 350L, 365R, 365L can be communicated with.
[0065]
Further, each of the

sub-control units

335, 350R, 350L, 365R, 365L is provided with each actuator A in the corresponding constituent unit. ₂ ~ A ₂₁ Is connected to each actuator A in the constituent unit. ₂ ~ A ₂₁ Can be driven to a designated state based on various control commands given from the main control unit 381.
[0066]
Further, as shown in FIG. 5, the head unit 204 includes a CCD (Charge Coupled Device) camera 50 functioning as an "eye" of the robot apparatus 200, a microphone 51 functioning as an "ear", a touch sensor 52, and the like. An external sensor unit 53 and a speaker 54 functioning as a “mouth” are disposed at predetermined positions, respectively, and an internal sensor unit 57 including a battery sensor 55 and an acceleration sensor 56 is disposed in the control unit 380. Have been.
[0067]
Then, the CCD camera 50 of the external sensor unit 53 captures an image of the surroundings, sends out the obtained image signal S1A to the main control unit 381, and sequentially stores the image signal S1A in the internal memory 381A in frame units. , Collects various command voices such as “walk”, “sit” or “follow the ball” given as a voice input, and sends the voice signal S1B thus obtained to the main control unit 381.
[0068]
Further, the touch sensor is provided on the upper part of the head unit 3 and detects pressure received by a physical action such as “stroke” or “hit” from the user, and the detection result is mainly used as a pressure detection signal S1C. It is sent to the control unit 381.
[0069]
Furthermore, the battery sensor 55 of the internal sensor unit 57 detects the remaining energy of the battery 45 at a predetermined cycle, and sends the detection result to the main control unit 381 as a remaining battery detection signal S2A. On the other hand, the acceleration sensor 56 detects accelerations in three axial directions (x-axis, y-axis, and z-axis) at a predetermined cycle, and sends a detection result to the main control unit 381 as an acceleration detection signal S2B.
[0070]
The main control unit 381 includes an image signal S1A, a sound signal S1B, a pressure detection signal S1C, and the like supplied from the CCD camera 50, the microphone 51, the touch sensor 52, and the like of the external sensor unit 53 (hereinafter, these are collectively referred to as an external sensor signal). S1) and a remaining battery level detection signal S2A, an acceleration detection signal S2B, and the like supplied from the battery sensor 55, the acceleration sensor, and the like of the internal sensor unit 57 (hereinafter, these are collectively referred to as an internal sensor signal S2). Based on this, the surrounding and internal states of the robot device 200, a command from a user, the presence or absence of a user's action, and the like are determined.
[0071]
Then, the main control unit 381 determines a subsequent action based on the determination result, the control program previously stored in the internal memory 381A, and various control parameters stored in the external memory 58 loaded at that time. Then, a control command based on the determination result is transmitted to the corresponding

sub-control units

335, 350R, 350L, 365R, 365L. As a result, based on the control command, the corresponding actuator A is controlled under the control of the

sub-control units

335, 350R, 350L, 365R, 365L. ₂ ~ A ₂₁ Is driven, and the robot unit 200 exerts actions such as swinging the head unit 204 up and down, left and right, raising the

arm units

203R and 203L, and walking.
[0072]
At this time, the main control unit 381 outputs a sound based on the sound signal S3 to the outside by giving a predetermined sound signal S3 to the speaker 54 as necessary, or a head functioning as an external “eye”. By outputting a drive signal to an LED provided at a predetermined position of the unit 204, the LED is turned on and off.
[0073]
In this way, in the robot device 200, it is possible to autonomously act based on the surrounding and internal conditions, the instruction from the user, the presence or absence of the action, and the like.
[0074]
(2-1-2) Processing of Main Control Unit 381 Related to Face Detection Task Function
Next, the face detection task function mounted on the robot device 200 will be described. The robot apparatus 200 has a face detection task function for detecting a human face image from a frame image stored in the internal memory 381A via the CCD camera 50. The face detection task function is realized by various processes in the main control unit 381.
[0075]
Here, when the processing contents of the main control unit 381 relating to the face detection task function are functionally classified, as shown in FIG. 6, the input image scale conversion unit 60, the window cutout unit 61, the template matching unit 62, the preprocessing It can be divided into a unit 63, a pattern identification unit 64, and an overlap determination unit 65.
[0076]
The input image scale conversion unit 60 reads a frame image based on the image signal S1A from the CCD camera 50 (FIG. 5) from the internal memory 381A, and converts the frame image into a plurality of scale images having different reduction rates. In the case of this application example, a frame image composed of 25344 (= 176 × 144) pixels is sequentially reduced by 0.8 times, and is reduced in 5 steps (1.0 times, 0.8 times, 0.64 times). , 0.51 times, 0.41 times) (hereinafter referred to as first to fifth scale images).
[0077]
Subsequently, the window cutout unit 61 firstly assigns an appropriate pixel (for example, two pixels) to the first scale image from the first to fifth scale images, starting from the upper left of the image to the lower right of the image. A rectangular area of 400 (= 20 × 20) pixels (hereinafter, this area is referred to as a window image) is sequentially cut out while scanning while shifting to the right or downward.
[0078]
At that time, the window cutout unit 61 sends the first window image among the plurality of window images cut out from the first scale image to the template matching unit 62 in the subsequent stage.
[0079]
The template matching unit 62 performs an arithmetic process such as a normalized correlation method or an error square method on the leading window image obtained from the window cutout unit 61 to convert the window image into a function curve having a peak value. A sufficiently low threshold value is set for the curve so that the recognition performance does not deteriorate, and it is determined whether or not the window image is a face image based on the threshold value.
[0080]
In the case of this application example, the template matching unit 62 sets a threshold value as a criterion for determining whether or not the face image is an average face image made up of, for example, about 100 people as a template. I have. As a result, the window image can be roughly matched with an average face image serving as a template.
[0081]
In this way, the template matching unit 62 performs matching using the template on the window image obtained from the window cutout unit 61, and when it is determined that the face image is a face image, uses the window image as a score image in the subsequent stage. On the other hand, when it is determined that the image is not a face image, the window image is transmitted to the subsequent overlap determination unit 65 as it is.
[0082]
At this point, the window image (score image) determined to be a face image actually contains a large number of misjudged images other than the face image, but in everyday scenes, the background image similar to the face is similar to the face image. Is rarely present, so that it is determined that most window images are not face images, which is extremely effective.
[0083]
Actually, the arithmetic processing such as the normalized correlation method and the error square method described above requires only about one tenth to one hundredth of the amount of operation when compared with the arithmetic processing in the preprocessing unit and the pattern identification unit in the subsequent stage. In this experiment, it has been confirmed at this stage that 80% or more of the image other than the face image can be eliminated, and it can be seen that the main control unit 381 as a whole can greatly reduce the amount of calculation.
[0084]
The preprocessing unit 63 removes, from the score image obtained from the template matching unit 62, four corner regions corresponding to a background portion irrelevant to a human face image from the rectangular score image, Using a mask obtained by cutting out the four corners, 360 pixels are extracted from a score image having 400 (= 20 × 20) pixels.
[0085]
Then, the pre-processing unit 63 forms a plane based on a part that is optimal as a face image among the extracted 360-pixel score images in order to eliminate the tilt condition of the subject represented by shading by illumination at the time of imaging. In such a way, the gray value of the 360 pixels is corrected using a calculation method based on, for example, a root mean square error (RSM).
[0086]
Subsequently, the preprocessing unit 63 performs a histogram smoothing process on the result of enhancing the contrast of the score image for the 360 pixels so that the result can be detected regardless of the gain of the CCD camera 50 and the intensity of illumination.
[0087]
Next, the pre-processing unit 63 performs a Gabor Filtering process, which will be described later, to perform vector conversion on the score image for the 360 pixels, and further converts the obtained vector group into one pattern vector. .
[0088]
Hereinafter, the Gabor filtering processing will be described. First, it has been already known that human visual cells include cells having selectivity for a specific direction. It consists of cells that respond to vertical lines and cells that respond to horizontal lines. Similarly, Gabor filtering is a spatial filter composed of a plurality of filters having azimuth selectivity.
[0089]
The Gabor filter is spatially represented by a Gabor function. The Gabor function g (x, y) is represented by a carrier s (x, y) composed of a cosine component and a two-dimensional Gaussian envelope W _r (X, y).
[0090]
(Equation 1)

[0091]
The carrier s (x, y) is expressed by the following equation (2) using a plurality of functions. Here, coordinate values (u ₀ , V ₀ ) Represents the spatial frequency, and P represents the phase of the cosine component.
[0092]
(Equation 2)

[0093]
Here, the carrier shown in the above formula (2) can be separated into a real component Re (s (x, y) and an imaginary component Im (s (x, y)) as shown in the following formula (3). it can.
[0094]
[Equation 3]

[0095]
On the other hand, an envelope having a two-dimensional Gaussian distribution is expressed as in the following equation (4).
[0096]
(Equation 4)

[0097]
Here, the coordinate axes (x ₀ , Y ₀ ) Is the peak of this function and the constants a and b are the Gaussian scale parameters. The subscript r means a rotation operation as shown in the following equation (5).
[0098]
(Equation 5)

[0099]
Therefore, from the above equations (2) and (4), the Gabor filter is expressed as a spatial function as shown in the following equation (6).
[0100]
(Equation 6)

[0101]
The preprocessing unit according to this application example performs face extraction processing using eight kinds of directions and three kinds of frequencies and using a total of 24 Gabor filters.
[0102]
The response of the Gabor filter is G _i Is the i-th Gabor filter, and the result (Gabor Jet) of the i-th Gabor is J _i And if the input image is I, it is represented by the following equation (7). Actually, the operation of the equation (7) can be speeded up by using the fast Fourier transform.
[0103]
(Equation 7)

[0104]
The performance of the created Gabor filter is examined by reconstructing the pixels obtained by filtering. The reconstructed image H is represented by the following equation (8).
[0105]
(Equation 8)

[0106]
Then, an error E between the input image I and the reconstructed image H is expressed by the following equation (9).
[0107]
(Equation 9)

[0108]
Reconstruction can be performed by finding an optimum a that minimizes the error E. As for Gabor filtering, the type of filter may be changed according to the recognition task.
[0109]
In low frequency filtering, it is redundant to have all the filtered images as vectors. Therefore, the dimension of the vector may be reduced by downsampling. The 24 types of down-sampled vectors are arranged in a line to form a long vector (the pattern vector described above).
[0110]
The pattern identification unit 64 obtains a provisional identification function using learning data supplied from outside, that is, teacher data, and then uses the identification function for 360 pixels obtained as a pattern vector from the preprocessing unit 63. The face is detected by experimenting with the score image. Then, the detection result is output as face data. Further, the detection failure is added to the learning data as non-face data, and learning is performed again.
[0111]
In this application example, with regard to face recognition in the pattern identification unit 64, a support vector machine (Support Vector Machine) which is considered to have the highest learning generalization ability in the field of pattern recognition.
Machine / SVM) is used to determine whether the face is applicable.
[0112]
Regarding the support vector machine itself, for example, A report written by Sholkoff et al. (B. Sholkof, C. Burges, A. Smola, "Advanced in Kernel Support Vector learning", The MIT Press, 1999.) can be mentioned. From the results of preliminary experiments conducted by the present applicant, it has been found that the face recognition method using the support vector machine shows better results than methods using principal component analysis (PCA) or a neural network.
[0113]
The support vector machine is a learning machine using a linear discriminator (perceptron) as a discriminant function, and can be extended to a non-linear space by using a kernel function. In addition, the learning of the discriminant function is performed so as to maximize the margin of separation between classes, and the solution is obtained by solving a quadratic mathematical programming. Therefore, it is theoretically guaranteed that a global solution can be reached. it can.
[0114]
Usually, the problem of pattern recognition is to find a discriminant function f (x) given by the following equation (10) for a test sample x = (x1, x2,..., Xn).
[0115]
(Equation 10)

[0116]
Here, a teacher label for learning the support vector machine is set as in the following expression (11).
[0117]
[Equation 11]

[0118]
Then, recognition of the face pattern in the support vector machine can be regarded as a problem of minimizing the square of the weighting factor w under the constraint shown in the following equation (12).
[0119]
(Equation 12)

[0120]
Problems with such constraints can be solved using Lagrange's undetermined constant method. That is, the Lagrange shown in the following equation (13) is first introduced, and then, as shown in the following equation (14), partial differentiation is performed on each of b and w.
[0121]
(Equation 13)

[0122]
[Equation 14]

[0123]
As a result, identification of the face pattern in the support vector machine can be regarded as a secondary programming problem represented by the following equation (15).
[0124]
[Equation 15]

[0125]
If the number of dimensions of the feature space is smaller than the number of training samples, a scratch variable ξ ≧ 0 is introduced and the constraint condition is changed as in the following equation (16). For the optimization, the objective function of the following equation (17) is minimized.
[0126]
(Equation 16)

[0127]
[Equation 17]

[0128]
In the above equation (17), C is a coefficient that specifies how much the constraint condition is relaxed, and it is necessary to experimentally determine the value. The problem related to the Lagrangian constant a is changed as in the following equation (18).
[0129]
(Equation 18)

[0130]
However, if the above equation (18) is not used, the nonlinear problem cannot be solved. Therefore, in this application example, the kernel function K (x, x3) is introduced, and is temporarily mapped to a high-dimensional space (kernel trick), and linearly separated in that space. Therefore, in the original space, it is equivalent to nonlinear separation.
[0131]
The kernel function is represented by the following equation (19) using a certain mapping Φ.
[0132]
[Equation 19]

[0133]
The discriminant function shown in the above equation (10) can also be expressed as the following equation (20).
[0134]
(Equation 20)

[0135]
Also, learning can be considered as a secondary programming problem represented by the following equation (21).
[0136]
(Equation 21)

[0137]
As the kernel, a Gaussian kernel (RBF: Radius Basic Function) shown in the following equation (22) can be used.
[0138]
(Equation 22)

[0139]
As described above, the pattern identification unit 64 determines whether or not face data exists in the score image with respect to the pattern vector based on the score image provided from the preprocessing unit 63. The upper left position (coordinates) and its size (the number of pixels in the vertical and horizontal directions) in the image area, the reduction rate of the scale image from which the score image is cut out with respect to the frame image (that is, the corresponding step among the above five steps) Is stored in the internal memory 381A as list data.
[0140]
Thereafter, the pattern identification unit 64 notifies the window cutout unit 61 that the face detection of the first window image in the first scale image has been completed, so that the window cutout unit 61 The template matching unit 62 sends the next scanned window image of the one scale image.
[0141]
Then, only when the window image matches the template, the template matching unit 62 sends the window image to the preprocessing unit 63 as a score image. The preprocessing unit 63 converts the score image into a pattern vector and sends it to the pattern identification unit 64. The pattern identification unit 64 generates list data based on face data obtained as an identification result from the pattern vector, and stores the list data in the internal memory 381A.
[0142]
As described above, by performing the processing of the template matching unit 62, the preprocessing unit 63, and the pattern identification unit 64 on all the window images cut out from the first scale image in the window cutout unit 61 in the scanning order, A plurality of score images including a face image existing in the imaging result can be detected from one scale image.
[0143]
Thereafter, the pattern identification unit 64 notifies the input image scale conversion unit 60 that the face detection of the first scale image has been completed, so that the input image scale conversion unit 60 outputs the second scale image. To the window cutout unit 61.
[0144]
The second scale image is also subjected to the same processing as the above-described first scale image, and after detecting a plurality of score images including face images present in the imaging result from the second scale image, the third scale image is obtained. Similar processing is sequentially performed for the fifth to fifth scale images.
[0145]
Thus, for the first to fifth scale images obtained by reducing the frame image, which is the captured image, in five stages, the pattern identification unit 64 detects a plurality of score images including the face images existing in the captured image, and then detects the score image. The resulting list data is stored in the internal memory 381A. In this case, depending on the size of the face image in the original frame image, a score image may not be obtained at all, but if a score image is obtained with at least one or more (or two or three or more) scale images, , The face detection process is continued.
[0146]
Here, in the multiple score images including the face image in each scale image, since the scan in the window cutout unit 61 was performed even every two pixels, a high correlation was found between the region where the face was actually located and the neighboring region. Therefore, adjacent score images include image regions that overlap each other.
[0147]
The subsequent overlap determination unit 65 reads a plurality of list data for each of the first to fifth scale images stored in the internal memory 381A, compares score data included in each of the list data, A determination is made as to whether or not regions that overlap each other are included.
[0148]
At this time, as shown in FIG. 7, the overlap determination unit 65 determines that the position (coordinate) of the upper left corner of the two score images P1 and P2 is (X _A , Y _A ), (X _B , Y _B ), And the size (the number of pixels in the vertical and horizontal directions) is H _A × L _A , H _B × L _B , DX (= X _B -X _A ), DY (= Y _B -Y _A ), It can be determined whether or not the score images P1 and P2 overlap each other, based on whether or not the following expression (23) is satisfied.
[0149]
(Equation 23)

[0150]
The overlap judging section 65 removes the area where the score images overlap each other based on the judgment result, so that in each scale image, a single image area where a plurality of score images are finally assembled without overlapping each other The image area is newly stored in the internal memory 381A as face determination data.
[0151]
When the template matching unit 62 determines that the image is not a face image, the overlap determination unit 65 does nothing and does not store the data in the internal memory 381A.
[0152]
(2-1-3) Operation and Effect in First Application Example
In the above configuration, the robot apparatus 200 converts a frame image captured by the CCD camera 50 into a plurality of scale images having different reduction ratios, and then converts a window image of a predetermined size from each of the scale images to a predetermined pixel. Cut out one sheet at a time while scanning so that each is shifted.
[0153]
This window image is subjected to matching using a template representing an average face image to roughly determine whether or not the image is a face image. The amount of calculation and the time required for the face detection process can be reduced accordingly.
[0154]
Subsequently, with respect to the window image (that is, the score image) determined to be a face image by the template matching, four corner portions of the rectangular area of the score image are removed, and then, gradation correction and subsequent smoothing of contrast enhancement are performed. Conversion into one pattern vector.
[0155]
Then, face data or non-face data is determined for the pattern vector by performing face detection within the original score image, and the position (coordinates) and size (number of pixels) of the image area of the score image in which the face data exists are determined. ) And the reduction ratio of the scale image from which the score image is cut out to the frame image are generated.
[0156]
After generating the list data for all the score images for each scale image in this manner, by comparing the score images included in each of the list data, and obtaining the face determination data from which the mutually overlapping regions have been removed. The face image can be detected from the original frame image.
[0157]
Among these face detection task processes, the template matching process is particularly similar to the block matching method used in image compression and the like, in addition to being easily mountable on an arithmetic unit having a relatively simple configuration. There are many types of hardware that perform high-speed processing using a CPU. Therefore, the speed of the template matching process can be further increased.
[0158]
According to the configuration described above, in the robot apparatus 200, in the face detection task processing for detecting a face image from a frame image captured by the CCD camera 50, the scale image obtained by reducing the frame image at a different reduction rate is used. , Cut out one by one while scanning each window image of a predetermined size so as to be shifted by a predetermined pixel, and then use a template representing an average face image to perform matching and determine roughly whether the image is a face image By removing the window image that is clearly not a face image, the amount of calculation and the time required for various face detection processes on the score image determined to be a face image by the template matching are reduced by that amount. The processing of the main control unit 381 which can be reduced and controls the entire robot apparatus 200 It is possible to reduce collateral, thus possible to realize a robot apparatus 200 capable of significantly enhancing real-time.
[0159]
(2-2) Second application example
(2-2-1) Internal configuration of robot
In the second application example, as shown in FIG. 8, the same reference numerals as those in FIG. 5 are attached to the back side of the waist base forming the lower body of the body unit 202 (FIG. 1). In addition to the main control unit 381, an internal memory 381A, a DMA (Direct Memory Access) controller 70, an arithmetic processing unit (DSP (Digital Signal Processor), an image processing engine, etc.) via a bus connected to the main control unit 381. ) 71 are connected to each other, and the rest is configured similarly to the first application example.
[0160]
In this case, the main controller 381 does not manage the internal memory 381A, but the DMA controller 70 transfers data to the main controller 381 or the arithmetic processing unit 71. On the other hand, the operation result obtained from the main control unit 381 or the operation processing unit 71 is stored in the internal memory 381A via the DMA controller 70.
[0161]
In this way, the data transfer is performed by the DMA controller 70 and the data calculation is distributed to the main control unit 381 or the calculation processing unit 71 so as to be performed independently of each other, so that the burden on the main control unit 381 can be reduced. ing.
[0162]
(2-2-2) Processing of Main Control Unit 381 Related to Face Detection Task Function
In the case of the second application example, the face detection task function is realized by various processes in the main control unit 381 and the arithmetic processing unit 71 via the DMA controller 70.
[0163]
Here, when the processing contents of the main control unit 381 and the arithmetic processing unit 71 relating to the face detection task function are functionally classified, as shown in FIG. 9, the input image scale conversion unit 80, the window cutout unit 81, the template It can be divided into a matching section 82, an overlap determination section 83, a scale conversion and cutout section 84, a preprocessing section 85, and a pattern identification section 86.
[0164]
The input image scale conversion unit 80, the window cutout unit 81, and the template matching unit 82 are processed by the arithmetic processing unit 71, and the overlap determination unit 83, the scale conversion and cutout unit 84, the preprocessing unit 85, and the pattern identification The unit 86 is configured to be processed by the main control unit 381. The switching of the processing between the arithmetic processing unit 71 and the main control unit 381 is performed by the DMA controller 70.
[0165]
First, the input image scale conversion unit 80 reads a frame image based on the image signal S1A from the CCD camera 50 (FIG. 8) from the internal memory 381A, and converts the frame image into a plurality of scale images having different reduction rates (described above). (1st to 5th scale images).
[0166]
Subsequently, the window cutout unit 81 sequentially selects appropriate pixels (for example, two pixels) from the first to fifth scale images with respect to the first scale image starting from the upper left of the image to the lower right of the image. By scanning while skipping, a window image composed of a rectangular area of 400 (= 20 × 20) pixels is sequentially cut out.
[0167]
At this time, the window cutout unit 81 sends the first window image among the plurality of window images cut out from the first scale image to the template matching unit 82 in the subsequent stage.
[0168]
The template matching unit 82 uses the average face image as a template for the top window image obtained from the window cutout unit 81, performs rough matching, and determines the degree of correlation with the template (the matching pixel in the template). Is stored in the internal memory 381A via the DMA controller 70 together with the window image as correlation degree data.
[0169]
Specifically, as shown in FIG. 10, for a window image W1 cut out from an arbitrary scale image F1, a template T1, which is an average face image, is taken, and as a result of obtaining a degree of correlation with the template T1, An image R1 representing the correlation is detected and stored in the internal memory 381A as correlation data.
[0170]
In the case of the second application example, unlike the first application example in which the window image is screened except for the score image determined to be a face image, the window image is converted to a score image in order to realize high-speed processing. Regardless of whether it is or not, it is sent to the overlap determination unit 83 in the subsequent stage.
[0171]
Thereafter, the template matching unit 82 notifies the window extracting unit 81 that the template matching of the first window image in the first scale image has been completed, so that the window extracting unit 81 outputs The next matching window image of the one scale image is sent to the template matching unit 82.
[0172]
As described above, the template matching unit 82 generates correlation data for each of the window images cut out from the first scale image by the window cutout unit 81 in the order of scanning.
[0173]
Thereafter, the template matching unit 82 notifies the input image scale conversion unit 80 that the face detection of the first scale image has been completed, so that the second scale image To the window cutout unit 81.
[0174]
Then, the second scale image is subjected to the same processing as that of the above-described first scale image to generate correlation degree data corresponding to all window images from the second scale image, and then perform the third to Similar processing is sequentially performed on the 5th scale image.
[0175]
In this way, the template matching unit 82 generates correlation data corresponding to a plurality of window images that are respectively extracted from the first to fifth scale images obtained by reducing the frame image, which is the captured image, in five stages, and generates the window image. At the same time, the data is stored in the internal memory 381A via the DMA controller 70.
[0176]
The above is the processing content of the arithmetic processing unit 71, which is switched to the main control unit 381 under the control of the DMA controller 70, and the following is the processing content of the main control unit 381.
[0177]
First, the overlap determination unit 83 executes the overlap determination processing procedure R11 of FIG. 11 from step SP0, and the same as the scan order when cutting out the window image in each of the first to fifth scale images, for each of the first to fifth scale images. In order, the window image is read out from the internal memory 381A via the DMA controller 70 together with the associated degree of correlation data (step SP1).
[0178]
Subsequently, the overlap determination unit 83 compares the window images of the plurality of window images in the corresponding scale image using the above-described equation (23) in scan order, and determines whether or not the window images include mutually overlapping regions. (Step SP2).
[0179]
Then, the overlap determination unit 83 compares the window images with each other in a corresponding magnitude of the degree of correlation only in a case where the areas overlap each other, and sets the higher degree of correlation as a score image in the internal memory 381A. It is stored in addition to the candidate list (step SP3).
[0180]
Then, the overlap determination unit 83 adds, for all the window images in the corresponding scale image, the larger one obtained by comparing the degrees of correlation to the candidate list as a score image (step SP4). On the other hand, the candidate list is sequentially rewritten without adding it to the candidate list or deleting it if it already exists in the candidate list (step SP5).
[0181]
Eventually, the overlap determination unit 83 determines, as a candidate list, a plurality of score images having a relatively high degree of correlation for all window images in the corresponding scale image, and then determines a plurality of score images having a relatively high degree of correlation for the other scale images. Is determined as a candidate list. It has been experimentally obtained that the number of score images having a relatively high degree of correlation added as a candidate list is reduced to 10 or less for each scale image.
[0182]
In this manner, the overlap determination unit 83 adds the score image to the candidate list for the first to fifth scale images not only for the peak value having the highest degree of correlation but also for the area near the peak value. Accordingly, the difference between the matching result in the template matching unit 82 and the processing result of the support vector machine in the subsequent pattern identification unit 86 can be reduced accordingly.
[0183]
In practice, the process of creating the candidate list in the overlap determination unit 83 can be executed by adding and multiplying several times, so that the process is extremely fast compared with the calculation process by the pre-processing unit 85 and the pattern identification unit 86 at the subsequent stage. It can be processed in a short time. If the preprocessing unit 85 and the pattern identification unit 86 perform the processing on the score images remaining as the candidate list, the processing load on the preprocessing unit 85 and the pattern identification unit 86 can also be reduced.
[0184]
Thereafter, the scale conversion and cut-out unit 84 outputs a plurality of score images stored in the internal memory 381A as a candidate list for each scale image in the overlap determination unit 83 from the internal memory 381A via the DMA controller 70. The score images are read one by one within the scale image and in the order in which the candidate list is determined, and are sent to the preprocessing unit 85.
[0185]
The pre-processing unit 85 converts the score image obtained from the scale conversion and cut-out unit 84 into four corners corresponding to a background portion irrelevant to a human face image from the score image (400 pixels) formed of a rectangular area. In order to remove the area, 360 pixels are extracted using a mask obtained by cutting out the four corner areas.
[0186]
Then, the pre-processing unit 85 forms a plane based on an optimal part as a face image among the extracted 360-pixel score images in order to eliminate the inclination condition of the subject represented by shading by illumination at the time of imaging. Is corrected for each pixel in such a manner as to perform the correction.
[0187]
Subsequently, the preprocessing unit 85 performs a histogram smoothing process on the result of enhancing the contrast of the 360-pixel score image so that the result can be detected regardless of the gain of the CCD camera 50 or the intensity of illumination.
[0188]
Next, the preprocessing unit 85 performs the above-described Gabor filtering processing to perform vector conversion on the score image for the 360 pixels, and further converts the obtained vector group into one pattern vector.
[0189]
Using the support vector machine described above, the pattern identification section 86 performs a trial on a 360-pixel score image obtained as a pattern vector from the preprocessing section 85 and learns whether or not the face is a face.
[0190]
As described above, for the pattern vector based on the score image provided from the pre-processing unit 85, the pattern identifying unit 86 determines the position (coordinate) and size (number of pixels) of the image area corresponding to the face data in the score image. ) And a reduction ratio of the scale image from which the score image is to be cut out to the frame image (that is, the corresponding stage of the above five stages) is listed, and this is stored in the internal memory 381A as list data.
[0191]
Thereafter, the pattern identification unit 86 notifies the scale conversion and cutout unit 84 that the face detection of the first score image in the first scale image has been completed, thereby performing the scale conversion and cutout. The unit 84 sends the next score image of the first scale images to the preprocessing unit 85. The preprocessing unit 85 converts the score image into a pattern vector and sends the pattern vector to the pattern identification unit 86. The pattern identification unit 86 generates list data based on the face data obtained from the pattern vector and stores the list data in the internal memory 381A.
[0192]
As described above, the scale conversion and cutout unit 84 sequentially performs the processes of the pre-processing unit 85 and the pattern identification unit 86 on all the score images in the candidate list in the first scale image, thereby obtaining the first scale image. The face image existing in the imaging result can be detected from the scale image.
[0193]
After that, the pattern identification unit 86 notifies the scale conversion and cutout unit 84 that the face detection of the first scale image has been completed, so that the second scale image also After performing the same processing as that of the first scale image and detecting the face image present in the imaging result from the second scale image, the same processing is sequentially performed for the third to fifth scale images.
[0194]
Thus, for the first to fifth scale images obtained by reducing the frame image, which is the captured image, in five stages, the pattern identification unit 86 detects the face images present in the captured image, and then obtains the resulting face image. The list data is generated based on the data and stored in the internal memory 381A.
[0195]
(2-2-3) Operation and Effect in Second Application Example
In the above configuration, the robot apparatus 200 converts a frame image captured by the CCD camera 50 into a plurality of scale images having different reduction ratios, and then converts a window image of a predetermined size from each of the scale images to a predetermined pixel. Cut out one sheet at a time while scanning so that each is shifted.
[0196]
The window image is matched using a template representing an average face image to generate correlation data representing the degree of correlation with the template. In this way, the correlation degree data is generated for all the window images for each scale image in the order of scanning.
[0197]
By executing the processing up to this point in the arithmetic processing unit 71 and executing the subsequent processing in the main control unit 381, the processing load on the main control unit 381 can be reduced accordingly.
[0198]
Subsequently, for a plurality of window images in each scale image, the window images are compared in the order of scanning, and if the window images include regions that overlap with each other, the degree of correlation is further compared so that only the larger one remains as a score image. Add to list. As a result, a plurality of score images having a relatively high degree of correlation can be determined as a candidate list for each scale image, and the identification accuracy can be further improved at the time of face identification processing at a later stage.
[0199]
Subsequently, for a plurality of score images stored in the internal memory 381A as a candidate list for each scale image, after removing four corners of a rectangular area of the score image, the gradation correction and the subsequent smoothing of contrast enhancement are performed. Further, it is converted into one pattern vector.
[0200]
Then, face data or non-face data is determined for the pattern vector by performing face detection within the original score image, and the position (coordinates) and size (number of pixels) of the image area of the score image in which the face data exists are determined. ) And the reduction ratio of the scale image from which the score image is cut out to the frame image are generated.
[0201]
By generating list data for all score images for each scale image in this way, a face image can be detected from an original frame image.
[0202]
Thus, regarding the face detection task processing, an arithmetic processing unit 71 is provided in addition to the main control unit 381 which controls the overall operation of the robot apparatus 200, and reading and writing of data to and from the internal memory 381 A is performed through control of the DMA controller 70. Since both the main control unit 381 and the calculation processing unit 71 can perform the processing, the processing amount can be distributed by the main control unit 381 and the calculation processing unit 71 respectively. The burden can be significantly reduced.
[0203]
According to the configuration described above, in the robot apparatus 200, in the face detection task processing for detecting a face image from a frame image captured by the CCD camera 50, the scale image obtained by reducing the frame image at a different reduction rate is used. , A window image of a predetermined size is cut out one by one while scanning so as to be shifted by a predetermined pixel, and matching is performed using a template representing an average face image, and correlation degree data representing a correlation degree with the template is obtained. Are generated, and if the window images include mutually overlapping regions, the correlation is compared and only the larger one remains in the candidate list, so that the correlation in the candidate list is relatively high. Reduce the amount of calculation and time required for various face detection processes for a plurality of score images DOO can not only, the identification accuracy in the face identification processing can be further improved.
[0204]
In addition, regarding the face detection task processing, the main control unit 381 and the arithmetic processing unit 71 that perform overall control of the robot apparatus 200 share the processing. The load on the amount and calculation time can be significantly reduced, and thus the robot apparatus 200 that can significantly improve the real-time property can be realized.
[0205]
(3) Embodiment
Next, an embodiment of the present invention will be described. As described above, in the method of extracting a face candidate by performing template matching (first step), determining a face area from the face candidates by SVM or the like (second step), and detecting the face area In the first step, since face candidates are determined simply at the price of the normalized correlation value, in order to reduce oversight of face candidates, a method of increasing a threshold value or a method of reducing thinning may be adopted. Although it is possible, when the threshold is lowered, the amount of calculation increases, which may not be preferable in an environment where resources are limited, such as a robot device. On the other hand, when the threshold value is increased, the number of candidate images for face determination in the second step is reduced, so that the amount of calculation can be reduced. However, the image that is originally a face is also removed from the candidate image, and the face image is overlooked. May be lost.
[0206]
Then, the present inventors, when a face area (face image) having the same size as the template exists in the input image, if the face image is correlated with the template, the correlation value becomes the largest near the template size. Focusing on this, when narrowing down face area candidates, by using an algorithm that performs local narrowing down, the face candidate image is reduced without overlooking the image that is originally the face, and the second step in the subsequent stage is performed. A method for reducing the amount of calculation for determining the face. More specifically, a candidate face area is extracted based on a local maximum value of a correlation value in a matching result which is a set of correlation values obtained by performing a normalized correlation between an input image and an average face template of a predetermined size. is there. Hereinafter, this face detection method will be described in detail.
[0207]
The face detection device according to the present embodiment is a face detection device while maintaining performance in a system with limited resources such as a CPU and a memory such as a bipedal walking robot device similar to the first and second application examples. This is suitable for reducing the amount of calculation required for detection. Also in this embodiment, similarly to the above-described first and second application examples, it is assumed that it is mounted on the robot device 200 shown in FIGS. 1 to 4, and the internal configuration is the same as the above-described application example. Detailed description is omitted.
[0208]
(3-1) Processing of the main control unit 381 related to the face detection task function
The face detection task function mounted on the robot device 200 according to the present embodiment will be described. As described above, the robot apparatus 200 has a face detection task function for detecting a human face image from a frame image stored in the internal memory 381A via the CCD camera 50. The face detection task function is realized by various processes in the main control unit 381.
[0209]
Here, when the processing content of the main control unit 381 relating to the face detection task function is functionally classified, as shown in FIG. 12, a template size and input image scale conversion determination unit 90, a window cutout unit 91, a template matching unit 92 , A scale conversion and cutout unit 93, a preprocessing unit 94, and a pattern identification unit 95.
[0210]
As in the first application example, the input image scale conversion determining unit 90 reads a frame image based on the image signal S1A from the CCD camera 50 (FIG. 5) from the internal memory 381A, and compares the frame image with the reduction rate. In addition to converting the image into a plurality of different scale images (first to fifth scale images), a template size used in the template matching unit 92 in the subsequent stage, that is, a size of a face image is selected (hereinafter, this is referred to as a first template image). Size).
[0211]
Subsequently, the window cutout unit 91 sequentially assigns an appropriate pixel (for example, two pixels) to the first scale image from the first to fifth scale images, starting from the upper left of the image to the lower right of the image. By scanning while skipping, a window image composed of a rectangular area of 400 (= 20 × 20) pixels is sequentially cut out.
[0212]
At this time, the window cutout unit 91 determines the first window image of the plurality of window images cut out from the first scale image together with the template of the first template size selected by the scale conversion determination unit 91 in the subsequent template matching. It is sent to the unit 92.
[0213]
The template matching section 92 performs rough matching on the first window image obtained from the window cutout section 91 using an average face image having a first template size as a template, and obtains a correlation value (template) with the template. Of the matching pixels) is stored in the internal memory 381A via the DMA controller 70 together with the window image as a matching result.
[0214]
That is, as shown in FIG. 13A, a window image of, for example, height (length of side in y-axis direction) hei_s × width (length of side in x-axis direction) weid_s cut out from an arbitrary scale image For the (input image after scale conversion) W2, as shown in FIG. 13B, a template T2 which is an average face image having a first template size of, for example, height hei_t × width wid_s is used. The window image W2 is scanned, and a matching result which is a set of correlation values between the template T2 shifted and shifted by a predetermined pixel (for example, one pixel) and the input image is obtained. The matching result is obtained by arranging the correlation values two-dimensionally with the movement of the template T2. As shown in FIG. 14, a template matching result image R2 having a height hei_r × width wid_r representing the correlation value is obtained. Can be Here, the height her_r of the template rate matching result image R2 is hei_s- (hei_t + 1), and the width wid_s of the image R2 is wid_s- (wid_t + 1).
[0215]
Next, the template rate matching result image R2 is divided into a predetermined size, for example, the same size as the first template size, and the maximum value of the correlation value is calculated for each divided region divided into the first template size. The points (positions) are obtained, and among the points indicating the maximum values obtained from each of the divided regions, those having a predetermined threshold or more are extracted as face candidates.
[0216]
That is, when the normalized correlation is performed using the average face template, there is no guarantee that the face image has a higher correlation value than the arbitrary pattern, but there is a face image having the same size as the template. In this case, since the correlation value has the maximum value in the vicinity of the template size, the correlation value becomes the maximum value in the divided region and the point which is equal to or larger than the predetermined threshold is extracted as a face candidate. As a result of the matching, the face candidates can be narrowed down more effectively as compared with the case where those whose correlation values are equal to or greater than a predetermined threshold are extracted as face candidates.
[0219]
Next, the points extracted as the face candidates are set as face candidates (first face candidates), and points near the face candidates are also extracted as face candidates (second face candidates). When the face is determined using the above-described SVM in the pattern recognition unit 95 at the subsequent stage, the SVM is so sensitive that the face cannot be detected because it is shifted by only one pixel. By setting the search range (the second face candidate), the vicinity of the face candidate can be focused on and the face detection performance can be improved. Here, if all eight points adjacent to the point extracted as the first face candidate at the top, bottom, left, and right are set as the search range, the amount of calculation at the subsequent stage increases, so that the search range is designated only when a predetermined condition is satisfied. As a result, it is possible to improve the detection performance while suppressing an increase in the amount of calculation in the subsequent stage.
[0218]
That is, when the occupation ratio (skin color occupation ratio) of the skin color region in the input image region of the template size corresponding to these points at the points adjacent to the points extracted as the first face candidates is equal to or larger than a predetermined threshold value Alternatively, the point is set as a search range only when face color information that has been detected (or learned in advance) is present and the occupancy rate (face color occupancy rate) of this face color area is equal to or greater than a predetermined threshold value. be able to. Here, the skin color occupancy can be determined, for example, by using a skin color color table and comparing with the skin color.
[0219]
Thereafter, the template matching unit 92 notifies the window cutout unit 91 that the template matching of the first window image in the first scale image has been completed, so that the window cutout unit 91 performs The next scanned window image of the one scale image is sent to the template matching unit 92.
[0220]
As described above, the template matching unit 92 detects face candidates in the order of scanning for all window images cut out from the first scale image in the window cutout unit 91.
[0221]
The above is the processing content of the arithmetic processing unit 71, which is switched to the main control unit 381 under the control of the DMA controller 70, and the following is the processing content of the main control unit 381.
[0222]
The scale conversion and cut-out unit 93 outputs a plurality of score images stored in the internal memory 381A as a candidate list for each scale image from the internal memory 381A via the DMA controller 70 in the first scale image and the candidate images. The score images are read one by one in the order in which the list is determined, and are sent to the preprocessing unit 94.
[0223]
The pre-processing unit 94 converts the score image obtained from the scale conversion and cut-out unit 93 into four corners corresponding to a background portion irrelevant to a human face image from the rectangular score image (400 pixels). In order to remove the area, 360 pixels are extracted using a mask obtained by cutting out the four corner areas.
[0224]
Then, the pre-processing unit 94 forms a plane based on a part optimal as a face image among the extracted 360-pixel score images in order to eliminate a tilt condition of a subject represented by shading by illumination at the time of imaging. Correction is applied to the gray value in pixel units.
[0225]
Subsequently, the preprocessing unit 94 performs a histogram smoothing process on the result of enhancing the contrast of the score image for the 360 pixels so that the result can be detected regardless of the gain of the CCD camera 50 or the intensity of illumination.
[0226]
Next, the pre-processing unit 94 performs the above-described Gabor filtering processing to perform vector conversion on the score image for the 360 pixels, and further converts the obtained vector group into one pattern vector.
[0227]
The pattern identification unit (identification means) 95 learns a face image or not by learning a 360-pixel score image obtained as a pattern vector from the preprocessing unit 94 using the support vector machine described above. .
[0228]
As described above, for the pattern vector based on the score image provided from the pre-processing unit 94, the pattern identification unit 95 determines the position (coordinates) of the image area corresponding to the face data in the score image and the size (number of pixels) ), The reduction ratio of the scale image from which the score image is cut out to the frame image (that is, the corresponding stage of the above five stages), and the size of the template are listed, and these are listed as list data and stored in the internal memory 381A. To be stored.
[0229]
Thereafter, the pattern identification unit 95 notifies the scale conversion and cutout unit 93 that the face detection of the first score image in the first scale image has been completed, thereby performing the scale conversion and cutout. The section 93 sends the next score image of the first scale images to the preprocessing section 94. The preprocessing unit 95 converts the score image into a pattern vector and sends the pattern vector to the pattern identification unit 95. The pattern identification unit 95 generates list data based on the face data obtained from the pattern vector and stores the list data in the internal memory 381A.
[0230]
As described above, the scale conversion and cutout unit 93 sequentially performs the processes of the pre-processing unit 94 and the pattern identification unit 95 on all score images in the candidate list in the first scale image, thereby A face image existing in the imaging result can be detected from one scale image.
[0231]
That is, the template matching unit 92 notifies the input image scale conversion unit 90 that the face detection using the first scale image and the template of the first template size has been completed, so that the input image scale The conversion unit 90 causes the window cutout unit 91 to send the template of the first scale image and the template of the second template size. Also in the case where the second template size is used, after performing the same processing as in the case of using the template of the first template size described above to detect face candidates corresponding to all template sizes, the template matching unit 92 By notifying that the face detection using all template sizes has been completed for the first scale image of the input image scale conversion unit 90, the second scale image is determined by the input image scale conversion determination unit 90. It is sent to the window cutout unit 91.
[0232]
Then, the same processing as that of the above-described first scale image is performed on the second scale image to detect face candidates corresponding to all window images from the second scale image. The same processing is sequentially performed on the scale image of.
[0233]
Thus, the template matching unit 92 extracts face candidates corresponding to a plurality of window images respectively cut out from the first to fifth scale images obtained by reducing the frame image as the captured image in five stages and templates having a plurality of template sizes. Then, it is stored in the internal memory 381A via the DMA controller 70 together with the window image.
[0234]
Here, in the present embodiment, a template of any size can be used. However, by switching the template size to be used and selecting the template size, all the templates that can be prepared for the input image can be used. Compared to the case where the operation is performed on the size, the amount of operation can be reduced and the efficiency can be improved. For example, once a face is detected, the next time a face is detected, the template size can be used. Further, for example, by using a distance sensor provided in the robot device and recognizing a distance between the target object included in the input image based on distance information from the distance sensor, the size of the face area of the target object can be improved. For example, it is possible to provide a target distance switching unit for selecting the template size by estimating the template size, and the template size can be switched according to the purpose.
[0235]
(3-2) Operation in the embodiment
In the above configuration, the robot apparatus 200 converts a frame image captured by the CCD camera 50 into a plurality of scale images having different reduction ratios, and then converts a window image of a predetermined size from each of the scale images to a predetermined pixel. Cut out one sheet at a time while scanning so that each is shifted.
[0236]
For this window image, matching is performed using a template representing an average face image, and a matching result image that is a set of correlation values with the template is generated. In this way, matching result images are generated for all the window images for each scale image in the order of scanning. Hereinafter, a process of detecting a face candidate from a matching result image will be described in detail.
[0237]
FIG. 15 is a flowchart showing each processing step in which the template matching unit 92 detects a pixel serving as a face candidate from the template matching result image R2. As shown in FIG. 15, first, when the template matching result image R2 is input, the matching result image R2 is divided into template sizes, and one of the divided regions, for example, 0 ≦ x ≦ wid_t−1, 0 ≦ y In ≦ hei_t−1, a point (coordinate) having the highest correlation value is extracted (step SP11). Hereinafter, a region obtained by dividing the matching result image R2 into the template size is referred to as a divided region rn, and a point (coordinate) having the largest correlation value in the divided region rn is referred to as local_max (x, y). Here, the pixel having the highest correlation value is extracted in each of the divided regions. In the present embodiment, a case where the divided regions divided in the matching result image are sequentially processed line by line from left to right will be described. I do.
[0238]
Next, it is determined whether or not local_max (x, y) is larger than a predetermined threshold (th1) (step SP12). If it is larger, it is added as a face candidate (step SP13). As described above, the input image scale conversion determination unit 90 selects a template size of a face size assumed to be included in the input image together with the scale. When the matching result image R2 is calculated for each template size and face candidates are extracted, the same point may be extracted. Therefore, in step SP13, if there is the same point as the face candidate, that is, if the face candidate has already been extracted when the face candidate is extracted with a different template size, this point is not added.
[0239]
Next, in the input image area of the template size corresponding to the point extracted as the face candidate, the occupancy of the skin color pixels included in this image area is obtained. In the present embodiment, the skin color table 100 is referred to when calculating the occupancy of the skin color pixels. Then, it is determined whether or not the skin color pixel occupancy is greater than a predetermined threshold (th2) (step SP14). If it is larger, the surroundings of this local_max (x, y), for example, eight points in the vertical, horizontal, and right directions are added as face candidates (step SP15). Here, as in step SP13, if these eight neighboring points have already been extracted as face candidates, they are not added to the candidates.
[0240]
If local_max (x, y) is less than the threshold th1 in step SP12, if the skin color pixel occupancy in the input image corresponding to local_max (x, y) is less than the threshold th2 in step SP14, and if face- After the addition of the candidates is completed, the process proceeds to step SP16, moves to the next divided area to extract the next face candidate, and proceeds.
[0241]
First, in the matching result image R2, the process moves to an adjacent divided region shifted by the template size in the x direction, that is, by wid_t (step SP16). Next, in the divided region of x coordinate (x + wid_t) shifted by wid_t, when the x coordinate is larger than the width (side in the x direction) wid_r of the matching result image, it is determined that the divided region is not included in the matching result image. Then, the processing moves to the next line, and the processing moves to the next divided area where 0 ≦ x ≦ wid_t−1, which is shifted by the template size in the y direction, that is, hei_t (step SP18). Next, it is determined whether or not the y coordinate of the divided area is larger than the height (side in the y direction) hei_r of the matching result image (step SP19). If it is larger, the correlation value of all the divided areas in the matching result image is determined. Is determined, and the process is terminated.
[0242]
On the other hand, when it is determined in step SP17 and step SP18 that the divided region is included in the matching result image, the process returns to step SP11 again, and a point having the highest correlation value in the divided region is extracted.
[0243]
In the present embodiment, since the maximum value of the correlation value in the divided region obtained by dividing the matching result image R2 into the template size is obtained, when shifting to the adjacent divided region in step SP16, the position is shifted by wid_t in the x direction. However, the matching result image R2 can be divided into any size as long as the size is equal to or smaller than the template size. At this time, assuming that the width (side in the x direction) width_step and the height (y direction) hei_step of the size of the image to be divided are wi_step in the x direction or hei_step in the y direction in step SP16 and step SP18, respectively. Thereby, it is possible to proceed to the next divided area.
[0244]
Thus, in the template matching unit 92, face candidates are extracted. By executing the processing up to this point in the arithmetic processing unit 71 and executing the subsequent processing in the main control unit 381, the processing load on the main control unit 381 can be reduced accordingly. Subsequent processing of extracting face data via the scale conversion and cutout unit 93, the preprocessing unit 94, and the pattern recognition unit 95 is the same as in the above-described second application example.
[0245]
(3-3) Effects of the present embodiment
FIG. 16 is a diagram showing points detected by the template matching unit 92 from the window image W2 as face candidates. In FIG. 16, the points shown in white are points extracted as face candidates from the matching result image R2 shown in FIG. As a comparison, FIG. 17 is a diagram illustrating an example in which all points that are equal to or larger than a threshold are extracted as face candidates in the matching result image R2. Compared to the diagram shown in FIG. 17, it can be seen that, in the present embodiment, the number of points extracted as face candidates by template matching section 92 is dramatically reduced. As a result, the amount of calculation in the subsequent processing can be drastically reduced.
[0246]
In the present embodiment, when a window image is roughly determined to be a face image by performing matching using a template representing an average face image, the template matching result image is set to a predetermined size. By extracting the maximum value of the partition and the correlation value as a face candidate and removing window images that are not clearly face images, the amount of calculation and time required for the subsequent face detection processing are reduced without overlooking the area that is originally a face. Thus, it is possible to realize a face detection device with much improved real-time properties and a robot device equipped with the same.
[0247]
In addition, the face detection accuracy can be improved by setting the face search range around the point where the correlation value becomes the maximum as well as around the point. Furthermore, only when the skin color occupancy or the face color occupancy is equal to or greater than a predetermined threshold, by setting the face search range, the number of face candidates can be reduced while maintaining face detection accuracy, and the amount of calculation in the subsequent stage can be reduced. . Furthermore, the amount of calculation can be further reduced by appropriately switching the size of the template.
[0248]
(4) Other embodiments
Note that, in the above-described embodiment, a case has been described in which the present invention is applied to the bipedal walking robot device 200 configured as shown in FIG. 1, but the present invention is not limited to this. The present invention can be widely applied to various robot devices and other various devices other than the robot device. For example, the robot device may be a four-legged walker, and the moving means is not limited to the leg type moving system.
[0249]
Further, as a scale conversion unit for reducing and converting a frame image obtained as a result of imaging by the CCD camera (imaging unit) 50 into a plurality of scale images having different reduction ratios, among the processing functions of the main control unit 381 shown in FIG. The case where the input image scale conversion unit 90 is applied has been described. However, the present invention is not limited to this, and the point is that if a plurality of scale images obtained by reducing a frame image at a plurality of different reduction ratios can be obtained. Alternatively, various other configurations may be applied.
[0250]
Further, five kinds of scale images obtained by reducing and converting one frame image by 0.8 times are applied. However, as long as the processing capacity of the main control unit 381 is within the range of the processing capability, six kinds of two to four kinds are used. The above may be used, and the reduction ratio may be freely set. In this case, if the face detection processing is performed based on a plurality of scale images, the frame detection process can be performed regardless of the size of the face actually shown in the frame image as the imaging result (that is, the distance between the robot device 200 and the human). A face image can be detected with a much higher probability than when a face image is directly detected from an image.
[0251]
Further, a main control unit 381 shown in FIG. 12 extracts a frame image obtained as a result of imaging by the CCD camera (imaging unit) 50 while sequentially scanning a 400-pixel window image so as to be shifted by one pixel. Although the description has been given of the case where the processing function is configured by the window cutout unit 91 among the processing functions of the present invention, the present invention is not limited to this. If it is possible to sequentially extract while scanning as described above, other various configurations may be applied.
[0252]
In this case, the scanning order may be set in any manner as long as the entire rectangular image can be scanned other than from the upper left position to the lower right position of the rectangular image. The number of pixels to be shifted during scanning may be one pixel or three or more pixels, and the size of the window image may be set to a desired number of pixels having various aspect ratios other than 400 pixels.
[0253]
Further, a case has been described in which the preprocessing unit 95 and the pattern identification unit 96 of the processing functions of the main control unit 381 shown in FIG. 12 are applied as identification means for identifying the face image from the window image. The present invention is not limited to this, and can be widely applied to configurations using various other methods as long as a face image can be detected.
[0254]
Further, the extracted window image is subjected to matching using a template representing an average face image, and as a correlation detection means for detecting the correlation with the template, the processing of the main control unit 381 shown in FIG. The case where the template matching unit 92 of the functions is applied has been described. However, the present invention is not limited to this. In other words, if the correlation degree can be detected based on the template, various other methods may be used. Can be widely applied.
[0255]
Further, the main control unit 381 is applied as control means for controlling the entire control of the robot apparatus 200, and the internal memory 381A is applied as a memory means connected to the main control unit 381 and for reading and writing information according to the control. Further, among the processing functions of the main control unit 381 shown in FIG. 12, an arithmetic processing unit 71 (DMA controller 70) as an arithmetic processing unit for executing each processing by the input image scale conversion unit 90, the window cutout unit 91, and the template matching unit 92. ) Was applied, but the present invention is not limited to this, and the main control unit 381 and the arithmetic processing unit 71 selectively share the processing functions (90 to 95) of the face detection processing. If the processing can be switched and executed via the internal memory 381A, the content of the assignment at that time depends on the calculation amount and the calculation amount of the main control unit 381. It can be freely set to reduce the burden between. Further, the processing may be performed only by the main control unit 381.
[0256]
【The invention's effect】
As described in detail above, a face detection apparatus according to the present invention is a face detection apparatus for extracting a face area of a target object from an input image, wherein a frame image obtained as an imaging result by an imaging unit is used as an input image. And a matching result generating means for generating a matching result which is a set of correlation values obtained by correlating with a template of a predetermined size indicating an average face image, and obtaining a local maximum value of the correlation value from the matching result. A face candidate extracting unit that extracts a face candidate based on the value, and an identifying unit that identifies a face image from an input image area corresponding to the point extracted as the face candidate. Since face candidates can be accurately narrowed down, the amount of calculation in the subsequent face identification means is reduced, and real-time face detection becomes possible.
[0257]
A robot apparatus according to the present invention is an autonomous robot apparatus that operates based on supplied input information, wherein an imaging unit, a frame image obtained as an imaging result by the imaging unit is used as an input image, and a person is extracted from the input image. Face detection means for extracting a face area of the input image. The face detection means generates a matching result which is a set of correlation values obtained by correlating the input image with a template of a predetermined size representing an average face image. Matching result generating means for obtaining a local maximum value of a correlation value from the matching result, extracting face candidates based on the local maximum value, and an input image area corresponding to a point extracted as the face candidate And an identification means for identifying a face image from a face candidate. And, by allowing real-time face detection, it is possible to improve the entertainment property of the robot apparatus.
[Brief description of the drawings]
FIG. 1 is a perspective view showing the appearance of a humanoid robot device according to an embodiment of the present invention as viewed from the front.
FIG. 2 is a perspective view showing the appearance of the humanoid robot device according to the embodiment of the present invention as viewed from behind.
FIG. 3 is a schematic diagram illustrating a degree of freedom configuration model of the humanoid robot device according to the embodiment of the present invention.
FIG. 4 is a schematic diagram showing a control system configuration of the humanoid robot device according to the embodiment of the present invention.
FIG. 5 is a block diagram for explaining an internal configuration of a robot device according to a first application example.
FIG. 6 is a block diagram illustrating a face detection unit of the robot device according to the first application example.
FIG. 7 is a schematic plan view for explaining an overlap determination process;
FIG. 8 is a block diagram for explaining an internal configuration of a robot device according to a second application example.
FIG. 9 is a block diagram illustrating a face detection unit of a robot device according to a second application example.
FIG. 10 is a schematic diagram for explaining detection of correlation by template matching.
FIG. 11 is a flowchart illustrating an overlap determination processing procedure in the order of steps.
FIG. 12 is a block diagram illustrating a face detection task function of the robot device according to the embodiment of the present invention.
FIGS. 13A and 13B are schematic diagrams showing an input image (window image) and a template, respectively.
FIG. 14 is a diagram showing a matching result image which is a set of correlation values obtained from an input image (window image) and a template.
FIG. 15 is a flowchart illustrating processing steps in a template matching unit of the face detection unit according to the embodiment of the present invention.
FIG. 16 is a diagram showing a result of extracting a face candidate from a matching result image by the template matching unit of the face detection task function according to the embodiment of the present invention.
FIG. 17 is a diagram illustrating a result of extracting a matching result image that is equal to or larger than a predetermined threshold value as a face candidate.
[Explanation of symbols]
1 robot apparatus, 381 main control section, 381A internal memory, 50 CCD camera, 60, 80, 90 input image scale conversion section, 61, 81, 91 window cutout section, 62, 82, 92 template matching section, 63, 85 , 94 preprocessing section, 64, 86, 95 pattern identification section, 65, 83 overlap determination section, 70 DMA controller, 71 arithmetic processing section, 84, 93 scale conversion and cutout section, RT1 overlap determination processing procedure, S1A image signal

Claims

In a face detection apparatus for extracting a face region of an object from an input image, a frame image obtained as a result of imaging by an imaging unit is used as an input image, and a correlation between the input image and a template of a predetermined size representing an average face image is determined. A matching result generating means for generating a matching result that is a set of taken correlation values;
A face candidate extraction unit for obtaining a local maximum value of the correlation value from the matching result and extracting a face candidate based on the local maximum value;
Identification means for identifying the face area from the input image area corresponding to the face candidate in the matching result.

2. The face detection apparatus according to claim 1, wherein the face candidate extracting unit divides the matching result into a plurality of regions, and extracts at least a maximum value of a correlation value as a face candidate for each of the divided regions.

3. The face detection device according to claim 2, wherein the size of the divided area is equal to or smaller than the size of the template.

2. The face detecting apparatus according to claim 1, further comprising template size determining means for determining the size of the template to be input to the matching result generating means from templates of different sizes.

5. The face detecting apparatus according to claim 4, wherein said template size determining means selects a template having the same size as a previously detected face area.

5. The face detecting apparatus according to claim 4, wherein the template size determining means receives distance information from the object in the input image and selects the template based on the distance information.

5. The face detecting apparatus according to claim 4, wherein the matching result generating means generates the matching result corresponding to each template by sequentially inputting the templates of different sizes.

3. The face detecting apparatus according to claim 2, wherein the face candidate extracting unit extracts only those having a predetermined threshold or more from the maximum values of the correlation values of the divided areas as face candidates.

The matching result, which is a set of correlation values, is a set of correlation values between the template and the input image that have been scanned and moved while being shifted by a predetermined number of pixels, and are arranged with the movement of the template. 3. The face detection device according to claim 2, wherein the face detection device has a two-dimensional array.

The face candidate extracting means sets a point having the maximum value of the correlation value of each of the divided areas as a first face candidate, and points near the first face candidate as second face candidates as the first and second face candidates. The face detection device according to claim 9, wherein the face candidate is extracted as a face candidate.

The face candidate extracting means may be configured such that, of the points near the first face candidate, the occupancy of the skin color region in the input image region of the template size corresponding to the point near the first face candidate is equal to or greater than a predetermined threshold. 11. The face detecting apparatus according to claim 10, wherein a certain point is extracted as a second face candidate.

The face candidate extracting means has face color information learned in advance, and among the points near the first face candidate, a face color area in the input image area of the template size corresponding to a point near the first face candidate. 11. The face detection device according to claim 10, wherein a point whose occupancy is equal to or more than a predetermined threshold is extracted as a second face candidate.

Window image extraction means for scanning the frame image and sequentially extracting a window image of a predetermined size while shifting the pixel by a predetermined pixel, wherein the matching result generating means generates a matching result using the window image as an input image The face detection device according to claim 1, wherein:

A scale conversion unit provided in a preceding stage of the window image extraction unit to reduce the frame image to a scale image having a plurality of different reduction ratios,
14. The face detecting apparatus according to claim 13, wherein said window image extracting means scans each of said scale images and sequentially extracts said window images while shifting said window images by predetermined pixels.

In a face detection method for extracting a face area of an object from an input image,
Matching result generation for generating a matching result as a set of correlation values obtained by correlating a frame image obtained as a result of imaging by an imaging unit with a template of a predetermined size representing an average face image as an input image Process and
A face candidate extraction step of finding a local maximum value of the correlation value from the matching result and extracting a face candidate based on the local maximum value;
An identification step of identifying the face area from an input image area corresponding to the face candidate in the matching result.

16. The face detecting method according to claim 15, wherein in the face candidate extracting step, the matching result is divided into a plurality of regions, and at least a maximum value of a correlation value is extracted as a face candidate for each of the divided regions.

17. The face detection method according to claim 16, wherein the size of the divided area is equal to or smaller than the size of the template.

16. The face detecting method according to claim 15, further comprising a template size determining step of determining a size of the template used in the matching result generating step from templates of different sizes.

19. The face detecting method according to claim 18, wherein in the template size determining step, a template having the same size as a face area detected in advance is selected.

19. The face detection method according to claim 18, wherein in the template size determining step, information on a distance from the target in the input image is input, and the template is selected based on the distance information.

17. The face detection method according to claim 16, wherein in the matching result generating step, the templates of different sizes are sequentially input to generate the matching results corresponding to each template.

17. The face detecting method according to claim 16, wherein, in the face candidate extracting step, only those having a predetermined threshold or more among the maximum correlation values of the divided areas are extracted as face candidates.

The matching result, which is a set of correlation values, is a set of correlation values between the template and the input image, which are obtained by scanning the input image and moving it by a predetermined number of pixels, and are arranged in accordance with the movement of the template. 17. The face detecting method according to claim 16, wherein the face is a two-dimensional array.

In the face candidate extraction step, a point having the maximum value of the correlation value of each of the divided areas is set as a first face candidate, and a point near the first face candidate is set as a second face candidate. The face detection method according to claim 23, wherein the face candidate is extracted as a face candidate.

In the face candidate extracting step, among the points near the first face candidate, the occupancy of the skin color region in the input image region of the template size corresponding to the point near the first face candidate is equal to or greater than a predetermined threshold. The face detection method according to claim 24, wherein a certain point is extracted as a second face candidate.

In the face candidate extracting step, the face color information in the input image area of the template size corresponding to the point near the first face candidate among the points near the first face candidate, having the face color information learned in advance. 25. The face detection method according to claim 24, wherein a point whose occupancy is equal to or greater than a predetermined threshold is extracted as a second face candidate.

A window image extracting step of scanning the frame image and sequentially extracting a window image of a predetermined size while shifting the pixel by a predetermined pixel, wherein in the matching result generating step, a matching result is generated using the window image as an input image. 16. The face detection method according to claim 15, wherein:

Before extracting the window image, the frame image has a scale conversion step of reducing the scale image into a scale image having a plurality of different reduction ratios,
28. The face detecting method according to claim 27, wherein in the window image extracting step, the scale images are scanned and the window images are sequentially extracted while being shifted by a predetermined pixel.

In an autonomous robot device that operates based on the supplied input information,
Imaging means;
Having a frame image obtained as an imaging result by the imaging means as an input image and a face detection means for extracting a face area of a person from the input image,
The face detecting means includes: a matching result generating means for generating a matching result which is a set of correlation values obtained by correlating the input image with a template of a predetermined size representing an average face image; and a correlation value based on the matching result. A face candidate extracting means for obtaining a local maximum value of the face candidate and extracting a face candidate based on the local maximum value, and an identifying means for identifying the face area from an input image area corresponding to a point extracted as the face candidate. A robot device characterized by the above-mentioned.

Equipped with control means for overall control,
30. The robot apparatus according to claim 29, wherein the control unit selectively executes any one of the processes by the matching result generation unit, the face candidate extraction unit, and the identification unit.

A memory means connected to the control means for reading and writing information according to the control of the control means;
An arithmetic processing unit connected to the memory unit and performing processing of the matching result generation unit, the face candidate extraction unit, and processing of the identification unit excluding a processing target of the control unit,
31. The robot device according to claim 30, wherein the control unit or the arithmetic processing unit switches and executes the corresponding process via the memory unit.

In a program for causing a computer to perform an operation of extracting a human face region from an input image,
Matching result generation for generating a matching result as a set of correlation values obtained by correlating a frame image obtained as a result of imaging by an imaging unit with a template of a predetermined size representing an average face image as an input image Process and
A face candidate extraction step of finding a local maximum value of the correlation value from the matching result and extracting a face candidate based on the local maximum value;
A step of identifying the face area from an input image area corresponding to a point extracted as the face candidate.

In a computer-readable recording medium recording a program for causing a computer to perform an operation of extracting a person's face region from an input image,
Matching result generation for generating a matching result as a set of correlation values obtained by correlating a frame image obtained as a result of imaging by an imaging unit with a template of a predetermined size representing an average face image as an input image Process and
A face candidate extraction step of finding a local maximum value of the correlation value from the matching result and extracting a face candidate based on the local maximum value;
An identification step of identifying the face area from an input image area corresponding to a point extracted as the face candidate.