JP4239635B2

JP4239635B2 - Robot device, operation control method thereof, and program

Info

Publication number: JP4239635B2
Application number: JP2003079146A
Authority: JP
Inventors: 務澤田; 秀樹下村
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-03-20
Filing date: 2003-03-20
Publication date: 2009-03-18
Anticipated expiration: 2023-03-20
Also published as: JP2004283959A

Abstract

<P>PROBLEM TO BE SOLVED: To enable a robot to perform robust tracking of an object against the unstable recognition of the object after suspension of the tracking by the robot performing an interrupting action or caused by the change of the external environment, etc. <P>SOLUTION: The tracking system of the robot device comprises: hierarchical recognition parts 15 having different image recognition levels consisting of a complexion recognition part 11, a face recognition part 12, an individual recognition part 13, and a voice direction recognition part 14 for recognizing the direction of voice, etc.; a recognition integrating part 21 having a control part for controlling the device to start tracking of the face of a person based on the result of recognition of the individual recognition part 13 and continue the tracking based on the result of recognition of the other recognition parts such as the face recognition part 12 on the failure of recognition of the individual recognition part 13; and a predicting part 31 for controlling the device to continue the tracking for the predicted direction of the object based on the result of recognition obtained until just before, if the results can not be obtained from the other recognition parts. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、人間等の２足歩行型の身体メカニズムや動作を模した自律的な動作を行うロボット装置、その動作制御方法及びプログラムに関し、特に、対象物を認識し、少なくとも目、首又は体により該対象物をトラッキングするロボット装置、その動作制御方法及びプログラムに関する。
【０００２】
【従来の技術】
近年、外観形状が人間又は動物に模して形成された例えばエンターテインメント型のロボット装置が提供されている。このようなロボット装置は、外部からの情報（例えば、周囲環境の情報等）や内部の状態（例えば、感情状態等）等に応じて首関節や脚等を自律的に動作させることで、人間又は動物のような動作を表出させている。このようなロボット装置は、撮像手段や音声入力手段等を有し、例えば撮像した画像内から人物の顔を抽出し、この顔をトラッキングしながら特定の個人を識別するものもあり、よりエンターテインメント性のあるものとなっている。
【０００３】
こうしたロボット装置に搭載される顔識別のアプリケーションでは、ある与えられた１枚のシーンの中の人物を識別する問題に加え、次のような問題を解決する必要がある。
（１）ロボット装置自身が移動するため、環境の変化やその多様性を許容しなくてはならない
（２）人間とロボット装置の位置関係も変化するため、インタラクション中には人間を視野に入れ続ける必要がある
（３）数多くのシーン画像から人物の識別に使える画像を選び出し、総合的に判断しなくてはならない
（４）ある時間内に応答しなくてはならない
【０００４】
このような条件下においても、ロボット装置自体の自律性を損なうことなく、限られたリソースにおいても顔識別可能なロボット装置が下記特許文献１に記載されている。この特許文献１に記載のロボット装置においては、撮像手段により撮像された画像内で変化する顔を追跡する（トラッキングする）顔追跡手段と、顔追跡手段による顔の追跡情報に基づいて、撮像手段により撮像された画像内の顔の顔データを検出する顔データ検出手段と、顔データ検出手段が検出した顔データに基づいて、特定顔を識別（個人認識）する顔識別手段とを備える。これにより、画像内で変化する顔を追跡しながら当該顔の顔データを検出して、検出した顔データに基づいて、特定顔を識別するものであり、撮像された画像内から顔領域を検出し、検出した顔を追跡し、その間に検出した顔が誰であるかを人物識別する。具体的には、入力画像中から顔領域を検出する処理はサポートベクターマシン（ＳＶＭ）等による顔又は非顔の識別を行う。また、検出した顔の追跡は、肌色領域を追跡することにより行う。更に、人物識別（個人識別）は、入力顔の目、鼻等の位置合わせ（モーフィング）を行い、複数の登録顔との差分から同一人物か否かの判定を行う。
【０００５】
【特許文献１】
特開２００２−１５７５９６号公報
【０００６】
【発明が解決しようとする課題】
ところで、上述の従来のロボット装置において、画像内の顔の移動の追跡（トラッキング）には肌色認識結果を用いて行われている。このような従来のトラッキング動作は、通常、対象物の認識に失敗した時点で終了されてしまう。しかし、ロボット装置にとって、物体やユーザ等の対象物を首関節を回転させる等の体幹部の一部を使用して追跡する動作（トラッキング）は重要な技術である。例えば動くボールを追いかける際にトラッキングできなければボールを追いかけることが困難となる。また、興味があるものを頭部等の体幹部の一部で追う動作等は生物らしさを感じさせる動作であり、エンターテイメント用等の人間又は動物を模したようなロボット装置には必要な動作である。更に、ユーザとのインタラクションをしている際に、ユーザの動きをトラッキング（追尾）することは、より自然なインタラクションのために重要である。
【０００７】
ところが、ロボット装置がトラッキングするために必要な画像による外界認識はしばしば不安定であり、ライティング（照明条件）や、人物の顔の角度に敏感であり、これらが少し変化すると認識に失敗する場合がある。また、ボール等の対象物が大きく動くと不均一な照明条件下をボールが移動することになり、認識が難しくなる。更に、自律動作が可能なロボット装置は、内部状態及び外部刺激に基づき常に発現する動作が選択され、例えばトラッキング動作より例えば優先順位が高い他の動作が生じた場合には、トラッキング動作を中断し、他の動作を発現を許す場合がある。例えば、ある人物Ａとの会話中に別の人物Ｂに呼ばれ、ロボット装置が振り向いて人物Ｂと短い会話をした後、元の人物Ａとの会話を継続しようとする場合等、一旦トラッキングを停止した後、再びトラッキングを開始したい場合が生じる。このような場合、元の人物Ａの存在位置等を記憶しておくことは原理的にはできるものの、人物Ａが少しでも動いたりすると認識に失敗してトラッキングを再開できない場合があり、このように、外部環境の変化又は自律型ロボット装置に関しては他の割り込み動作等により、トラッキングに失敗することが多く、対象物のトラッキングを継続することは極めて困難であるという問題点がある。
【０００８】
本発明は、このような従来の実情に鑑みて提案されたものであり、トラッキング動作の最中に他の割り込み動作を行ってトラッキングを中断した後及び外部環境の変化等による対象物認識の不安定さに対してもロバストなトラッキングを行うことができるロボット装置、その動作制御方法及びプログラムを提供することを目的とする。
【０００９】
【課題を解決するための手段】
上述した目的を達成するために、本発明に係るロボット装置は、少なくとも機体の一部を使用して対象物をトラッキングするロボット装置において、異なる種類のものを認識する複数の認識手段又は同一種類のものを認識しその認識レベルに応じて２以上に階層化された複数の認識手段と、上記複数の認識手段からの認識結果を統合する認識統合手段と、上記対象物をトラッキングする動作を生成する動作生成手段と、上記認識統合結果に基づき上記動作生成手段を制御するトラッキング制御手段とを有し、上記トラッキング制御手段は、上記複数の認識手段のうち所定の認識手段の認識結果に基づき上記対象物のトラッキングを開始し、該所定の認識手段の認識結果が得られなくなった場合に該所定の認識手段とは異なる他の認識手段の認識結果に基づき上記トラッキングを継続するよう制御し、上記トラッキングは、該トラッキング動作より優先度が高い動作により中断され、上記トラッキングが終了又は中断される直前の上記対象物の方向を記憶する対象物記憶手段を有し、上記トラッキング制御手段は、上記優先度が高い動作が終了した際、上記記憶した方向にて上記複数の認識手段の何れかにおいて上記対象物を認識した場合に再びトラッキングを開始するよう制御する。
【００１０】
本発明においては、複数の認識手段を有し、所定の認識手段の認識結果に基づきトラッキングを開始した場合、該所定の認識手段による認識が失敗しても他の認識手段による認識結果に基づきトラッキングを継続することができ、複数の認識手段を複合的に使用することで、１つの認識手段によってトラッキングを行う従来の方法に比して、極めてロバストなトラッキングを行うことができる。
【００１１】
また、上記他の認識手段からの認識結果が得られなくなった場合に、直前までの該他の認識手段の認識結果に基づき上記対象物の予測方向を求める予測手段を有し、上記トラッキング制御手段は、上記予測方向に基づき上記トラッキングを継続するよう制御することにより、他の認識手段による認識が失敗しても、他の認識手段の認識が失敗する直前までの認識結果から対象物の方向を予測しトラッキングを継続することができ、更にロバストなトラッキングを行うことができる。
【００１２】
更に、上記他の認識手段は、上記所定の認識手段と同一の種類であって下層の認識手段、又は該所定の認識手段とは異なる種類のものを認識する認識手段とすることにより、認識レベルが高い認識手段で失敗しても認識レベルを下げれば認識できる場合や、例えば画像が取得できない場合は音声認識を使用することにより、更にロバストなトラッキングを行うことができる。
【００１３】
更にまた、上記トラッキング制御手段は、上記所定の認識手段の認識結果が所定の時間得られない場合に該トラッキングを終了又は中断することができ、トラッキングスタート時の認識手段による認識が所定期間得られないときは、トラッキングを終了又は中断することで、対象物を間違えてトラッキングしてしまうことを防止する。
【００１４】
また、上記トラッキング制御手段は、上記所定の認識手段の認識結果が得られなくなった直前の上記対象物の方向と、上記他の認識手段の認識結果により得られる上記対象物の方向との差が所定の値を超えた場合に上記トラッキングを終了又は中断することで、他の認識手段の認識結果が得られている場合であっても所定条件を満たさない場合は、トラッキングを終了又は中断して対象物を間違えてトラッキングしてしまうことを防止する。
【００１５】
更に、上記所定の認識手段が同一種類の下層の認識手段を有する場合、上記認識統合手段は、上記下層の認識手段の認識結果に基づき上記対象物のトラッキングを継続するよう制御することができ、認識レベルを下げて、認識が難しい条件下においてもトラッキングを継続することができる。
【００１６】
更にまた、上記トラッキングが終了又は中断される直前までの上記対象物の動きを検出する対象物動き検出手段を有し、上記トラッキング制御手段は、上記優先度が高い動作が終了した際、当該動き検出結果に基づき上記認識手段の何れかにおいて上記対象物を認識した場合に再びトラッキングを開始するよう制御することができ、対象物がトラッキングを中断又は終了した時点から移動していたとしても、認識レベルが異なる又は認識の種別が異なる複数の認識手段のうちいずれかで対象物が認識又は対象物の位置を予測し、トラッキングを再開することができる。
【００１７】
本発明に係るロボット装置の行動制御方法は、少なくとも機体の一部を使用して対象物をトラッキングするロボット装置の動作制御方法において、異なる種類のものを認識する複数の認識手段又は同一種類のものを認識しその認識レベルに応じて２以上に階層化された複数の認識手段により上記対象物を認識する認識工程と、上記複数の認識手段からの認識結果を統合する認識統合工程と、上記対象物をトラッキングする動作を生成する動作生成工程と、上記認識統合結果に基づき上記トラッキングする動作を制御するトラッキング制御工程とを有し、上記トラッキング制御工程では、上記複数の認識手段のうち所定の認識手段の認識結果に基づき上記対象物のトラッキングを開始し、該所定の認識手段の認識結果が得られなくなった場合に該所定の認識手段とは異なる他の認識手段の認識結果に基づき上記トラッキングを継続するよう制御され、上記トラッキングは、該トラッキング動作より優先度が高い動作により中断され、上記トラッキングが終了又は中断される直前の上記対象物の方向を記憶する対象物記憶工程を有し、上記トラッキング制御工程では、上記優先度が高い動作が終了した際、上記記憶した方向にて上記複数の認識手段の何れかにおいて上記対象物を認識した場合に再びトラッキングを開始するよう制御される。
【００１８】
また、本発明に係るプログラムは、上述したロボット装置の行動制御をコンピュータに実行させるものである。
【００１９】
【発明の実施の形態】
以下、本発明の一構成例として示す２足歩行の人間型のロボット装置について、図面を参照して詳細に説明する。この人間型のロボット装置は、住環境その他の日常生活上の様々な場面における人的活動を支援する実用ロボットであり、内部状態（怒り、悲しみ、喜び、楽しみ等）に応じて自律的に行動できるほか、人間が行う基本的な動作を表出できるエンターテインメントロボットである。
【００２０】
（Ａ）ロボット装置の構成
図１は、本実施の形態におけるロボット装置の概観を示す斜視図である。図１に示すように、ロボット装置１は、体幹部ユニット２の所定の位置に頭部ユニット３が連結されると共に、左右２つの腕部ユニット４Ｒ／Ｌと、左右２つの脚部ユニット５Ｒ／Ｌが連結されて構成されている（但し、Ｒ及びＬの各々は、右及び左の各々を示す接尾辞である。以下において同じ。）。
【００２１】
このロボット装置１が具備する関節自由度構成を図２に模式的に示す。頭部ユニット３を支持する首関節は、首関節ヨー軸１０１と、首関節ピッチ軸１０２と、首関節ロール軸１０３という３自由度を有している。
【００２２】
また、上肢を構成する各々の腕部ユニット４Ｒ／Ｌは、肩関節ピッチ軸１０７と、肩関節ロール軸１０８と、上腕ヨー軸１０９と、肘関節ピッチ軸１１０と、前腕ヨー軸１１１と、手首関節ピッチ軸１１２と、手首関節ロール輪１１３と、手部１１４とで構成される。手部１１４は、実際には、複数本の指を含む多関節・多自由度構造体である。ただし、手部１１４の動作は、ロボット装置１の姿勢制御や歩行制御に対する寄与や影響が少ないので、本明細書ではゼロ自由度と仮定する。したがって、各腕部は７自由度を有するとする。
【００２３】
また、体幹部ユニット２は、体幹ピッチ軸１０４と、体幹ロール軸１０５と、体幹ヨー軸１０６という３自由度を有する。
【００２４】
また、下肢を構成する各々の脚部ユニット５Ｒ／Ｌは、股関節ヨー軸１１５と、股関節ピッチ軸１１６と、股関節ロール軸１１７と、膝関節ピッチ軸１１８と、足首関節ピッチ軸１１９と、足首関節ロール軸１２０と、足部１２１とで構成される。本明細書中では、股関節ピッチ軸１１６と股関節ロール軸１１７の交点は、ロボット装置１の股関節位置を定義する。人体の足部１２１は、実際には多関節・多自由度の足底を含んだ構造体であるが、ロボット装置１の足底は、ゼロ自由度とする。したがって、各脚部は、６自由度で構成される。
【００２５】
以上を総括すれば、ロボット装置１全体としては、合計で３＋７×２＋３＋６×２＝３２自由度を有することになる。ただし、エンターテインメント向けのロボット装置１が必ずしも３２自由度に限定されるわけではない。設計・制作上の制約条件や要求仕様等に応じて、自由度すなわち関節数を適宜増減することができることはいうまでもない。
【００２６】
上述したようなロボット装置１がもつ各自由度は、実際にはアクチュエータを用いて実装される。外観上で余分な膨らみを排してヒトの自然体形状に近似させること、２足歩行という不安定構造体に対して姿勢制御を行うこと等の要請から、アクチュエータは小型且つ軽量であることが好ましい。
【００２７】
なお、以下では、説明の便宜上、足部１２１の説明において、足部１２１の裏面の路面（床面）に当接する部分を含んで構成される面をＸ−Ｙ平面とし、該Ｘ−Ｙ平面内において、ロボット装置の前後方向をＸ軸とし、ロボット装置の左右方向をＹ軸とし、これらに直交する方向をＺ軸として説明する。
【００２８】
このようなロボット装置は、ロボット装置全体の動作を制御する制御システムを例えば体幹部ユニット２等に備える。図３は、ロボット装置１の制御システム構成を示す模式図である。図３に示すように、制御システムは、ユーザ入力等に動的に反応して情緒判断や感情表現を司る思考制御モジュール２００と、アクチュエータ３５０の駆動等ロボット装置１の全身協調運動を制御する運動制御モジュール３００とで構成される。
【００２９】
思考制御モジュール２００は、情緒判断や感情表現に関する演算処理を実行するＣＰＵ（Central Processing Unit）２１１や、ＲＡＭ（Random Access Memory）２１２、ＲＯＭ（Read Only Memory）２１３及び外部記憶装置（ハード・ディスク・ドライブ等）２１４等で構成され、モジュール内で自己完結した処理を行うことができる、独立駆動型の情報処理装置である。
【００３０】
この思考制御モジュール２００は、画像入力装置２５１から入力される画像データや音声入力装置２５２から入力される音声データ等、外界からの刺激等に従って、ロボット装置１の現在の感情や意思を決定する。ここで、画像入力装置２５１は、例えばＣＣＤ（Charge Coupled Device）カメラを複数備えており、また、音声入力装置２５２は、例えばマイクロホンを複数備えている。
【００３１】
また、思考制御モジュール２００は、意思決定に基づいた動作又は行動シーケンス、すなわち四肢の運動を実行するように、運動制御モジュール３００に対して指令を発行する。
【００３２】
一方の運動制御モジュール３００は、ロボット装置１の全身協調運動を制御するＣＰＵ３１１や、ＲＡＭ３１２、ＲＯＭ３１３及び外部記憶装置（ハード・ディスク・ドライブ等）３１４等で構成され、モジュール内で自己完結した処理を行うことができる独立駆動型の情報処理装置である。また、外部記憶装置３１４には、例えば、オフラインで算出された歩行パターンや目標とするＺＭＰ軌道、その他の行動計画を蓄積することができる。
【００３３】
この運動制御モジュール３００には、図２に示したロボット装置１の全身に分散するそれぞれの関節自由度を実現するアクチュエータ３５０、体幹部ユニット２の姿勢や傾斜を計測する姿勢センサ３５１、左右の足底の離床又は着床を検出する接地確認センサ３５２，３５３、足部１２１の足底１２１に設けられる後述する本実施の形態における荷重センサ、バッテリ等の電源を管理する電源制御装置３５４等の各種の装置が、バス・インターフェース（Ｉ／Ｆ）３０１経由で接続されている。ここで、姿勢センサ３５１は、例えば加速度センサとジャイロ・センサの組み合わせによって構成され、接地確認センサ３５２，３５３は、近接センサ又はマイクロ・スイッチ等で構成される。
【００３４】
思考制御モジュール２００と運動制御モジュール３００は、共通のプラットフォーム上で構築され、両者間はバス・インターフェース２０１，３０１を介して相互接続されている。
【００３５】
運動制御モジュール３００では、思考制御モジュール２００から指示された行動を体現すべく、各アクチュエータ３５０による全身協調運動を制御する。すなわち、ＣＰＵ３１１は、思考制御モジュール２００から指示された行動に応じた動作パターンを外部記憶装置３１４から取り出し、又は、内部的に動作パターンを生成する。そして、ＣＰＵ３１１は、指定された動作パターンに従って、足部運動、ＺＭＰ軌道、体幹運動、上肢運動、腰部水平位置及び高さ等を設定するとともに、これらの設定内容に従った動作を指示する指令値を各アクチュエータ３５０に転送する。
【００３６】
また、ＣＰＵ３１１は、姿勢センサ３５１の出力信号によりロボット装置１の体幹部ユニット２の姿勢や傾きを検出するとともに、各接地確認センサ３５２，３５３の出力信号により各脚部ユニット５Ｒ／Ｌが遊脚又は立脚のいずれの状態であるかを検出することによって、ロボット装置１の全身協調運動を適応的に制御することができる。更に、ＣＰＵ３１１は、ＺＭＰ位置が常にＺＭＰ安定領域の中心に向かうように、ロボット装置１の姿勢や動作を制御する。
【００３７】
また、運動制御モジュール３００は、思考制御モジュール２００において決定された意思通りの行動がどの程度発現されたか、すなわち処理の状況を、思考制御モジュール２００に返すようになっている。このようにしてロボット装置１は、制御プログラムに基づいて自己及び周囲の状況を判断し、自律的に行動することができる。
【００３８】
（Ｂ）ロボット装置の制御システム
次に、このようなロボット装置の制御システムについて説明する。本ロボット装置は、複数の認識器を有し、これらを利用することで、従来に比してロバストなトラッキングを行うことができるものであり、ここでは先ずそのトラッキングシステム及びトラッキング方法について説明し、ロボット装置全体の制御システムについては後述する。
【００３９】
（１）トラッキングシステム
図４は、本実施の形態のロボット装置の制御システムのうち、トラッキングシステムを構成する要部を示すブロック図である。本実施の形態におけるトラッキングシステムは、例えば画像認識や音声認識等の異なる種類のものを認識する複数の認識器を有し、更には、画像認識や音声認識は、異なる認識レベルの認識を行うことができるよう階層化されたものであって、これら種別又は認識レベルが異なる複数の認識器の認識結果を必要に応じて組み合わせることで、ロボット装置がトラッキング最中に、トラッキングとは異なる割り込み動作を行って、トラッキング動作を中断（一旦停止）してしまった場合や、環境変化により認識器の認識が失敗した場合であってもロバストなトラッキングを行うことができるトラッキングシステムである。
【００４０】
図４に示すように、トラッキングシステム１０は、画像を撮像する画像入力装置２５１と、マイクロホン等を備える音声入力装置２５２と、画像入力装置２５１からの画像が供給され肌色認識、顔認識、及び個人認識する夫々肌色認識部１１、顔認識部１２及び個人認識部１３からなる画像認識部１５と、音声入力装置２５２から供給される音声の方向を認識する音声方向認識部１４と、これらの認識器１４、１５の認識結果を統合する認識統合部２１と、認識統合部２１の統合結果に基づき対象物の位置を予測する予測部３１と、認識統合部２１の統合結果か、又は予測部３１の予測に基づきに基づき、対象物をトラッキングするために例えばロボット装置の首の回転角度等を制御する首制御コマンド生成部２２と、首制御コマンド生成部２２からの制御情報を出力する出力部２３とを有する。
【００４１】
色認識、顔認識、個人認識は、それぞれ人間の顔を認識するための異なる認識レベルの認識装置として存在する。図４に示す例では、最も上位で、認識が困難であるのは対象となる人物が誰であるかを特定する個人認識部１３である。そして、次に、人間等の顔であるか否かを認識する顔認識部１２であり、最下位が最も認識が容易な肌色領域を認識する肌色認識部１１となっている。なお、トラッキングの対象物としては、例えば人物や、ボール等の人物以外の物体等が挙げられるが、ここでは人物を認識し、これをトラッキングする例をとって説明する。
【００４２】
例えば、肌色認識部１１は、肌色領域を検出して認識統合部２１に認識結果を送る。次に、顔認識部１２は、対象物が人間の顔であるか否かを認識し、顔であると認識された場合には、その顔領域の画像を個人認識部１３に送ると共に、認識統合部２１に認識結果を送る。個人認識部１３は、入力される顔画像が誰であるか、即ち個人を特定すると共に、この結果を認識統合部２１へ送る。
【００４３】
対象物が人物の顔である場合の画像認識としては、このような肌色認識部１１から個人認識部１３までのカスケードな方法が考えられる。また個人の認識方法としては、例えば、個人顔が登録された登録顔データベース等を有し、登録顔と入力顔画像とを比較する等することで、入力顔が誰であるかを認識することができる。
【００４４】
また、対象物がボール等の物体である場合は、物体の色を認識する色認識部の上位に、物体の形状を認識する形状認識部、形状認識された物体の模様を認識する模様認識部等を設けることで、対象物が例えば人間の顔でない場合にも適用できることはいうまでもない。
【００４５】
また、音声方向認識部１４は、自分に対してどの方向から音声が聞こえたかを認識し、その認識方向を認識統合部２１へ供給するものであり、音声方向認識部１４の認識方法の一具体例は後述する。なお、ここでは対象物としての人間の声の方向を認識する音声方向認識手段としたが、後述するように、音声入力装置２５２からの入力音声に対して異なる認識レベルの認識器を複数設けてもよく、又は人間に関わらず、対象物が人物でないような場合、対象物が発する音の方向等を認識するものとしてもよい。
【００４６】
認識統合部２１は、これら個人認識部１３、顔認識部１２、肌色認識部１１及び音声方向認識部１４から、その認識結果が常に供給されその認識結果を統合する。この際の統合とは、画像上の同じ領域に対して、誰だかはよくわからないが顔と肌色が認識された、といった情報統合を意味する。即ち、認識部１４、１５からは、各認識が成功したか否かの情報と、認識が成功した場合はその認識情報が認識結果として送られ、認識が成功して認識情報が送られた場合は、その認識結果のち所定の認識結果又は１以上の認識結果から対象物の方向を推定する。
【００４７】
この認識統合部２１は、本ロボット装置においては、後述する図２３に示すミドル・ウェア・レイヤ５０に属し、実際にはその上位のアプリケーション・レイヤ５１に、例えば個人認識部１３の認識が成功した等、各認識部からの認識結果が伝えられ、アプリケーション・レイヤ５１からトラッキング動作開始の指令を受けとり、ロボット装置は、トラッキングをスタートするよう制御する図示しないトラッキング制御部を有する。即ち、認識統合部２１の内部又は外部に設けられたトラッキング制御部にトラッキングを開始する制御信号が入力され、後述する首制御コマンド生成部等のトラッキング動作生成部をトラッキングを開始するよう制御する。
【００４８】
トラッキング開始の指令を受取ったトラッキング制御手段は、トラッキング対象物の認識するために、複数の認識部の認識結果のうち、いずれか１以上の所定の認識部を選択し、その認識結果に基づきトラッキングを行うよう制御することができる。例えば、上記所定の認識部として、例えば、トラッキング開始指令が入力された際に、人物の顔を認識する画像認識における最上位の個人認識部１３が成功している場合には、これを使用してトラッキングをスタートすることができ、この個人認識結果により、対象物となる個人顔の例えば重心等が入力画像の例えば中心にくるよう首制御コマンド生成部２２を制御する。
【００４９】
そして、個人認識部１３が個人認識に失敗した場合、その他の認識手段である顔認識部１２、肌色認識部１１、音声方向認識部１４のいずれかの認識結果を使用してトラッキングを継続するよう制御する。例えば、個人認識部１３の下位の顔認識部１２の認識結果を使用して対象物としての人物の顔の方向（位置）を予想する。即ち、個人としての認識はできないものの、顔認識部１２の認識は成功しており、顔であることは認識できている場合、その顔を同一個人として該個人をまだトラッキングできているものとし、上記顔領域が入力画像の中心にくるよう首制御コマンド生成部２２を制御する。また、顔認識部１２も認識に失敗している場合は、例えば肌色認識部１１の結果を用い、更に肌色認識部１１の結果も失敗したときは、音声方向認識部１４の認識結果を使用し、音声方向にロボット装置の正面が向くよう首制御コマンド２２を制御する。なお、認識統合部２１は、認識結果のいずれ優先的に使用するかは、予め設定してもよく、又はロボット装置が適宜選択してもよい。例えば、個人認識部１３による認識が失敗する直前の対象物の位置（方向）と最も近い認識部の認識結果を使用するようにしてもよい。
【００５０】
予測部３１は、認識統合部２１の認識統合結果が供給され、各認識部の不安定さにより一時的に認識対象が認識できなくなった場合（認識に失敗した場合）、対象物の位置を予測するものであり、例えばいずれの認識部からの認識結果も失敗したような場合に、失敗する直前までの認識結果に基づき現在の対象物の位置（方向）を予測する。
【００５１】
そして、予測部３１は、例えば認識統合部２１から認識統合結果が常に供給され、上述のトラッキング制御部等により、対象物を認識できなくなった場合に、対象物の位置の予測を開始するよう指示される等、認識部１４、１５の認識の回復を一定時間待つなどの制御が行われる。又は、対象物が認識できなくなった場合に、認識統合部２１からその直前までの認識結果が供給され、対象物の位置を予測するよう指示されてもよい。
【００５２】
そして、この予測部３１は、対象物が認識されなくなる直前の認識結果から対象物の方向を予測し、その予測方向を首制御コマンド生成部２２へ送るものである。即ち、上述したように、ロボット装置がトラッキングするために必要な画像による外界認識はしばしば不安定であり、ライティング（照明条件）や、人物の顔の角度に敏感であり、これらが少し変化すると画像認識部１５は各種認識に失敗する場合がある等がある。また、ボール等の対象物が大きく動くと不均一な照明条件下をボールが移動することになり、認識が難しくなる。更に、自律動作が可能なロボット装置は、内部状態及び外部刺激に基づき常に発現する動作が選択され、例えばトラッキング動作より例えば優先順位が高い他の動作が生じた場合には、トラッキング動作を中断し、他の動作を発現を許す場合がある。例えば、ある人物Ａとの会話中に別の人物Ｂに呼ばれ、ロボット装置が振り向いて人物Ｂと短い会話をした後、元の人物Ａとの会話を継続しようとする場合等、一旦トラッキングを停止した後、再びトラッキングを開始したい場合が生じる。このような場合、元の人物Ａの存在位置等を記憶しておくことは原理的にはできるものの、人物Ａが少しでも動いたりすると認識の不安定さからトラッキングを再開できない場合がある。
【００５３】
このような場合においても、例えば対象物が動体であった場合は、直前の動き量から、現在の位置（方向）を予測して予測方向を求める。また、認識に失敗する直前の所定期間、対象物が静止していたと判断できるような場合は、直前の対象物の方向を予測位置とする。
【００５４】
そして、首制御コマンド生成部２２は、認識統合部２１又は予測部３１からの制御情報に基づき首制御コマンドを生成し、これを出力部２３を介して出力する。具体的には、上述の図２に示す頭部ユニット３を支持する首関節ヨー軸１０１、首関節ピッチ軸１０２、首関節ロール軸１０３からなる首関節を回転させる回転角度を算出し、出力部２３を介して出力することで、該当のモータ等を制御し、対象物の動きに合わせてロボット装置の頭部ユニットを回転させる等し、ロボット装置にトラッキングを行わせる。
【００５５】
なお、ロボット装置１は、実際には、後述する図２３に示すアプリケーション・レイヤ５１は、行動が記述された複数の行動記述モジュールからなる行動記述部（図示せず）と、これらの行動を選択する行動選択部（図示せず）とを有し、行動選択部は、外部の情報である認識統合結果を行動選択基準の一つとして捉え、複数の行動記述モジュールのうちトラッキングを行う行動を選択するようなされており、行動記述部から出力される行動が物体のトラッキングを伴うものである場合、対象物を指定して、ロボット装置の首、目、足等のロボット装置の少なくとも一部を使ってトラッキングするよう認識統合部２１に指令を出力し、これを受けた認識統合部２１が、例えば首制御コマンド生成部２２へ所定の首制御コマンドを生成する指令を出力する。
【００５６】
また、ここでは、認識統合部２１が首制御コマンド生成部２２にロボット装置の首関節を回転させてトラッキングを行う例をとって説明するが、例えば対象物がロボット装置の首関節の回転角度の限界より大きく移動したりする場合、首関節のみでなく、例えば図２に示す体幹部ユニット２の体幹ピッチ軸１０４、体幹ロール軸１０５、又は体幹ヨー軸１０６を制御したり、脚部ユニットを制御してロボット自身が移動しながらトラッキングを行ってもよい。
【００５７】
なお、予測部３１は、全ての認識部が失敗したときに対象物の方向を予測するものとしたが、上述した認識統合部２１での処理の一部を予測部３１にて行わせるようにしてもよい。即ち、上位の個人認識１３が失敗した際に、下位の顔認識部１２の認識結果や音声方向認識部１４の認識結果を使用してトラッキングを継続する際の処理を予測部３１が行ってもよい。
【００５８】
以下、本実施の形態について更に詳細に説明する。本トラッキングシステムは、対象物の認識に失敗した場合に対象物の位置（方向）を予測する予測機能、対象物の認識に失敗したときに認識レベル又は認識器の種類を切替えトラッキングを続行する機能、及び対象物の認識に失敗している期間（時間）が所定の閾値以上となった場合はトラッキングを中断する機能等を有する。以下、図５乃至図９を参照してこれらの機能について詳細に説明するが、理解を容易にするため、いずれの場合においても、トラッキングのスタート時には、個人認識部１３の個人認識結果を使用しているものとし、同一の構成要素には同一の符号を付してその詳細な説明は省略する。
【００５９】
なお、上述したように、本実施の形態は、複数の認識部を有し、いずれか所定の認識部の認識結果を使用してトラッキングをスタートさせた場合に、該所定の認識部の認識に失敗した場合であっても、他の認識部の認識結果を使用したり、失敗する直前の認識結果から対象物の現在の位置（方向）を予測して、ロバストなトラッキングを行うことを可能とするものであり、個人認識部１３以外の認識部の認識結果をトラッキングのスタートの際に使用してもよい。
【００６０】
（１−１）トラッキング対象物の予測
先ず、複数種及び／又は複数に階層化された認識部のいずれの認識も失敗した際でもトラッキングを継続する機能について説明する。図５は、トラッキングシステムを示すブロック図であり、ここでは、認識器として例えば異なる認識レベルに階層化された認識器１５を有する場合とする。先ず、人物が目の前に現れたことから、行動選択部が対話行動を選択し、人間と対話を開始したとする。その際、例えばその人物（対象物）の個人認識に成功し、上述の行動選択部及び認識統合部２１にその個人認識結果が送られて、該個人に対するトラッキングが開始される。トラッキングがスタートすると、首制御コマンド生成部２２に認識結果が送られ、この認識結果を基に、首制御コマンド生成部２２が人物の顔が例えばロボット装置の視野の中心にくるように首を制御することでトラッキングを実施する。
【００６１】
また、上述したように、首の回転角を制御してトラッキングを実施するだけでなく、人物が大きく動いた場合には足を使って体を回転させて、その人物を追尾するようにしたり、上半身のみ回転させるようにしたりしてもよいことはもちろんである。
【００６２】
そして、例えばトラッキング中に、顔の角度が変わったりあるいは照明状況が変化したことにより、個人認識ができなくなるか、別の個人として認識されたとする。即ち、同一個人の認識に失敗した場合、認識不可を知らせる情報が予測部３１に供給される。予測部３１は、認識不可になるタイミングの直前の人物の位置（方向）の情報を有し、直前のタイミングにおける人物が静止していた場合は、直前の人物の位置を予測方向とする。
【００６３】
又は、トラッキング対象の人物が動いていた場合は、直前までの情報から動き量を検出し、現在の位置を予測する。この場合、画像認識においては、認識不可となるまでの動きベクトルを求め、これに基づき、認識可能であった時刻から現在の時刻までの間に動いた位置を予測すればよい。なお、後述する音声認識においても、音声の認識方向が時間と共に変化している場合は、その変化量から現在の方向を予測すればよい。
【００６４】
但し、予測部３１は、個人認識部１３以外の他の認識部の認識結果を使用して一時的にトラッキングを継続するものであって、個人認識部１３が再び成功すれば、個人認識部１３の認識結果を使用するものとする。また、個人認識部１３が失敗して、他の認識部の認識結果が代用されてから所定の時間以上経過しても個人認識１３が成功しないか、又は所定時間以上経過した後も別の個人として認識されている場合は、トラッキングを終了又は中断する。
【００６５】
（１−２）異なる認識レベルの認識器を有する場合
次に、認識レベルが異なる画像認識器、例えば肌色認識部１、顔認識部１２及び個人認識部１３からなる認識器を有する場合の認識器の使用方法について説明する。上述したように、認識器１５はその認識レベルに応じて階層化されている。即ち、図６に示すように、肌色認識部１１のような最下位から顔認識部１２、個人認識部１３等の最上位までその認識レベルに応じて複数に階層化されており、上位の認識部の認識が失敗した際には下位の認識結果を使用するよう認識統合部２１にて制御される。
【００６６】
個人顔が認識されてトラッキングがスタートした場合、認識統合部２１には、常に個人認識部１３からの認識結果だけではなく、肌色認識部１１、顔認識部１２の認識結果も供給されている。なお、肌色認識部１１及び顔認識部１２は、それぞれ上位の認識器に自身の結果を送らないものとするが、例えば個人認識部１３は、顔認識部１２の顔認識結果を使用して個人認識する等、上位の認識器は下位の認識器の結果を使用してもよい。
【００６７】
ここで、ある時刻Ｔ１において、個人認識に失敗したとする。認識統合部２１には、各認識器から認識の成否、成功した場合は認識情報が認識結果として送られており、認識統合部２１は、時刻Ｔ１において、個人認識失敗の情報を得るとする。
【００６８】
その際、予測部３１は、個人認識部１３の下位の顔認識部１２が認識に成功している場合であって、同じ位置に顔が認識されている場合、その人物はまだそこにいると予測し、顔（誰であるかは問わない）領域をトラッキングする。また、顔認識分１２も認識に失敗している場合であって、同じ位置に肌色領域が認識されていれば、その領域をトラッキングする。上述したように、このような他の認識器による認識結果を使用したトラッキングの継続は、認識統合部２１で行ってもよいが、図６に示すように予測部３１で行ってもよい。
【００６９】
但し、個人認識部１３が失敗して下層の認識部（他の認識部）の認識結果を使用する場合、個人認識部１３により認識されていた対象物の位置（方向）と、下層の認識部により認識される対象物の方法（位置）とが所定の大きさ以上離れている場合は、間違った対象物を認識している可能性が高いのでトラッキングを終了又は中断する。
【００７０】
また、本実施の形態は、トラッキングのスタートを、人物の顔を認識する認識器の最上位の個人認識部１３の認識結果を使用して行うものとしているが、トラッキングのスタート指令が出された場合に、認識統合部２１に例えば肌色認識部１１のみしか認識に成功していない場合、肌色認識部１１の認識結果を基にトラッキングを行う。そして、トラッキングの途中で顔認識部１２、個人認識部１３の認識が成功した場合、上述とは逆に認識レベルを上げてもよい。
【００７１】
（１−３）異なる認識種別の認識器を有する場合
次に、図６に示したように階層化された画像認識部とは異なる種別の音声認識部１４を備えるトラッキングシステムにおける認識器の使用方法の一例について説明する。例えば、図６に示すような階層化された複数の画像認識部の認識が全て失敗した場合、即ち、例えば照明条件の変化等により入力画像が得られなくなったりする場合がある。このような場合、異なる種別の認識器を備えることによりトラッキングを続行することができる。
【００７２】
図７に示すように、ある時刻Ｔ３にて、それまでトラッキングしていた対象人物の個人認識、顔認識、肌色認識のいずれもが認識不能になった場合に、音声入力装置２５２から音声が入力されたとする。この場合、画像認識が全て認識不可であっても、音声方向認識部１４の認識結果が得られている場合は、その音声方向にまだその人物が存在していると解釈（予測）することが可能である。このように、予測部３１は、画像認識が全て失敗した場合においても、音声方向認部１４の識結果を使用し、その方向を対象物の予測方向とし、一定時間トラッキングを続けることで、更にトラッキングのロバスト性を高めることができる。もちろん、予測部３１におけるトラッキングを一定時間以上継続しても、トラッキング対象としていた人物が個人認識部１３にて認識できない場合には、トラッキングを中断又は終了するものとする。これは、上述したように、人物が実際にいなくなった場合、あるいは間違った人物をトラッキングしつづけることを防ぐためである。また、トラッキング開始指令が供給された際に、例えば音声方向認識部のにが認識に成功していた場合等には、この音声方向認識部１４の認識結果をトラッキングスタートとしてもよい。
【００７３】
また、ここでは、音声方向を認識する音声方向認識部１４のみを記載したが、例えば音声認識部においても、画像認識部のように、階層化された構造としてもよい。即ち、音声方向認識部１４の上層に、音声を発している人物が誰であるかを特定する話者認識部を設け、話者認識部の更に上層に、人物がどのような内容を話しているかを認識する音声内容認識部等を設けてもよい。話者認識は、例えば登録話者データベースを用意し、入力音声と登録音声データベースのデータとを比較することで話者を特定することができる。また、音声内容認識部においては、例えば、複数の単語又は短文が登録されたデータベースを用意し、ロボット装置の呼びかけに対し、所定のパターンの返答があるか否かをそのデータベースのデータと比較し、ユーザとの対話ができているか否か等を判定する等の方法がある。
【００７４】
（１−４）割り込み動作が入った後のトラッキング
次に、ある行動を実行中に別の行動が割り込むことを行動選択実行部が許容している場合について説明する。後述するように、本ロボット装置は、内部状態及び外部刺激に応じて自律的に行動を選択して発現することができ、従って、トラッキング動作よりも実行優先度が高いような動作が存在する場合は、トラッキング動作を中断し、他の割り込み動作を行うようにすることもできる。
【００７５】
例えば、例えば、ロボット装置が人物Ａと会話中、即ち人物Ａをトラッキング中に、別の人物Ｂから例えば後ろから呼ばれたために、振り向き、人物Ｂとある会話をした後、元の対話相手Ａの方に戻るような場合がある。このような場合、図８に示すように、人物Ｂと対話する割り込み動作（別の行動）３２により人物Ａのトラッキングは中断される。その際、認識統合部２１又は予測部３１は、元の対話相手Ａとの対話を再開するため、即ち人物Ａを再度トラッキングするために、中断した時点の人物Ａの位置（方向）を記憶しておく記憶手段（図示せず）を備える。トラッキング再開時には、認識統合部２１及び／又は予測部３１には割り込み動作３２が終了したことが通知され、トラッキングを再開するように指示する。その際、上述と同様の方法で、人物のトラッキングに復帰する。
【００７６】
即ち、トラッキングが中断されるまでに、人物Ａが静止していれば、現在もその位置いるものと予測し、また人物Ａの動きが検出されていれば、中断された時刻と検出されている動きベクトルとから現在の位置を予測する。その予測方向にて、なんらかの認識部により対象物が認識できればトラッキングを再開することができる。この際、上述したように、予測方向と認識された対象物の位置（方向）とが大きく異なるときは、トラッキングを再開しないようにすることができる。また、トラッキングを再開した後、所定時間経過しても個人認識できない場合はトラッキングを終了する。
【００７７】
（１−５）トラッキングの終了条件
また、本トラッキングシステムは、複数の認識器の認識結果を使用しロバストなトラッキングを行うものであるが、認識レベルが低い認識器の認識結果を使用し続けたり、認識結果を基に認識結果が得られていない場合にも対象物方向を予測し続けたりすることで、間違った対象物をトラッキングすることを防止するため、一定条件でトラッキングを終了又は中断する機能を有している。
【００７８】
例えば、図９に示すように、個人認識部１３の認識によりトラッキングがスタートした場合、個人認識部１３が失敗すると、例えば顔認識部１２、肌色認識部１１、音声認識部１４等、個人認識１３とは異なる他の認識部の認識結果を使用するが、全ての認識結果が得られなくなった場合は、上述したように、予測部３１が、最後まで成功していた認識部が失敗する直前の人物の動きから、人物の存在位置を予測してその予測方向に対して首を動かすトラッキングを行う。そのトラッキング中に、人物、顔、肌色のどれかがその予測位置（方向）付近に再出現したり、その予測方向付近に音声方向が認識できた場合には、それをトラッキング対象としてトラッキングを再開するが、最初にトラッキングを開始した際の個人顔であることが一定以上認識できない状態が続いたら、仮に顔、肌色が見えていても、トラッキングを中止する。即ち、トラッキングスタート時より下位の認識器の認識結果のみによりトラッキングを行ったり、認識結果からの予測によって対象物をトラッキングしているようにする（擬似的なトラッキング）動作は、所定時間以上続けないものとする。
【００７９】
このため、例えば認識統合部２１と予測部３１との間にタイマ３３有し、これが所定時間経過した場合は、トラッキングを終了するものとする。これは、上述したように、長時間個人認識されない場合、全く別の人物に対して誤解して対話行動を続ける場合や、全く別のもの顔としてトラッキング継続してしまうようなことを防ぐためである。
【００８０】
このように構成された本実施の形態のトラッキングシステムにおいては、認識部をその認識レベルが異なる階層構造とし、これを利用し上位認識系が認識に失敗しても下位の認識結果を用いたり、画像認識や音声認識等の種別が異なる認識器の認識結果を用いたりし、様々な認識器の認識結果を複合的に組み合わせて使用することで、割り込み動作によりトラッキングが中断された場合でも、また照明等の変化により認識系の不安定さに対してもロバストなトラッキングを行うことができる。
【００８１】
次に、上述のトラッキングシステムに好適な音源方向認識方法及び動体検出方法の一具体例について説明する。
【００８２】
（２−１）具体例：音源方向認識方法
以下に、音声方向認識部１４の音声方向認識方法（推定方法）の一具体例について詳細に説明する。ここで、上述した図２に示した関節自由度は、より生物感を高めるために可動範囲が制限されているものとする。このため、図２の首関節ヨー軸１０１の可動範囲外から音声が入力された場合には、首や体幹を協調させて回転し、音源方向に振り向く動作を行うことが必要となる。
【００８３】
そこで、本具体例におけるロボット装置１は、図１０（ａ）乃至（ｆ）に示すようにして音源方向を振り向く。すなわち、図１０（ａ）のようにロボット装置１が図中右側を向いていたときに背後から音声が入力されると、図４（ｂ）乃至（ｆ）のように、首を回転させると共に脚部を使って体幹を回転させ、音源方向に振り向く。
【００８４】
このような音源方向への振り向き動作の一例について、図１１のフローチャートを用いて説明する。先ずステップＳ１において、図４に示す音声入力装置２５２の有するマイクロホンに所定の閾値以上の振幅の音が入力されることにより、音イベントが発生したことが検出される。
【００８５】
次にステップＳ２において、入力された音イベントの音源方向の推定が行われる。ここで、上述したように、音声入力装置２５２には、複数のマイクロホンが備えられており、ロボット装置１は、この複数のマイクロホンを用いて音源方向を推定することができる。具体的には、例えば「大賀、山崎、金田『音響システムとディジタル処理』（電子情報通信学会）ｐ１９７」に記載されているように、音源方向と複数のマイクロホンで受音した信号の時間差とに一対一の関係があることを利用して音源方向を推定することができる。
【００８６】
すなわち、図１２に示すように、θＳ方向から到来する平面波を、距離ｄだけ離れて設置された２つのマイクロホンＭ１，Ｍ２で受音する場合、各マイクロホンＭ１，Ｍ２の受音信号ｘ１(ｔ)とｘ２(ｔ)との間には、以下の式（１）、（２）に示すような関係が成立する。ここで、式（１）、（２）において、ｃは音速であり、τＳは２つのマイクロホンＭ１，Ｍ２で受音した信号の時間差である。
【００８７】
【数１】

【００８８】
従って、受音信号ｘ１(ｔ)とｘ２(ｔ)との間の時間差τＳが分かれば、以下の式（３）により、音波の到来方向、すなわち音源方向を求めることができる。
【００８９】
【数２】

【００９０】
ここで、時間差τＳは、以下の式（４）に示すような、受音信号ｘ１(ｔ)とｘ２(ｔ)との間の相互相関関数φ１２(τ)から求めることができる。ここで、式（４）において、Ｅ[・]は期待値である。
【００９１】
【数３】

【００９２】
上述した式（１）と式（４）とから、相互相関関数φ１２(τ)は、以下の式（５）のように表される。ここで、式（５）において、φ１１(τ)は受音信号ｘ１(ｔ)の自己相関関数である。
【００９３】
【数４】

【００９４】
この自己相関関数φ１１(τ)は、τ＝０で最大値をとることが知られているため、式（５）より相互相関関数φ１２(τ)は、τ＝τＳで最大値をとる。したがって、相互相関関数φ１２(τ)を計算して、最大値を与えるτを求めればτＳが得られ、それを上述した式（３）に代入することにより、音波の到来方向、すなわち音源方向を求めることができる。
【００９５】
なお、上述した音源方向の推定手法は一例であり、この例に限定されるものではない。
【００９６】
図５に戻ってステップＳ３では、現在ロボット装置１が向いている方向と音源の方向との差が計算され、体幹の向きに対する音源方向の相対角度が求められる。
【００９７】
続いてステップＳ４では、図２に示した首関節ヨー軸１０１の可動範囲と、脚部を使って体幹を回転させる際に、一度の回転動作で回転できる最大角度などの制約に基づき、ステップＳ３で計算された相対角度分だけ頭部を回転させるのに必要な首関節と体幹の回転角度を決定する。ここで、音源方向によっては、首関節のみの回転角度が決定される。なお、ロボット装置１は、図２に示したように体幹ヨー軸１０６を有しているが、簡単のため、本具体例ではこの体幹ヨー軸１０６を利用しないものとして説明する。しかし、首、腰、足の接地方向を利用し、全身を協調させて音源方向を振り向くことができることは勿論である。
【００９８】
具体的に図１３を用いて説明する。図１３（ａ）は、ロボット装置１の首の可動範囲を±Ｙ度とし、音源Ｓの方向の相対角度がロボット装置１の正面方向に対してＸ度方向である場合の例である。この場合、ロボット装置１が音源Ｓの方向に振り向くためには、図１３（ｂ）に示すように、最低でもＸ−Ｙ度だけ体幹全体を脚部を使って回転させると共に、首関節ヨー軸１０１をＹ度だけ音源Ｓの方向に回転させる必要がある。
【００９９】
次にステップＳ５では、ステップＳ４で得られた角度を回転させるのに必要な各関節の制御情報を立案、実行し、音源方向に振り向く。
【０１００】
続いてステップＳ６において、音源方向に正対する必要があるか否かが判別される。ステップＳ６において、例えば、音イベントが単なる物音などの雑音の場合には、正対する必要がないとしてステップＳ７に進んで、元々向いていた方向に体幹及び首の向きを戻し、一連の動作を終了する。一方、ステップＳ６において、例えば、画像入力装置２５１（図３）の情報からロボット装置１が予め学習して知っている人の顔を発見し、その人に呼び掛けられたと判断した場合などには、その向きに正対する制御を行なうためにステップＳ８に進む。
【０１０１】
ここで、人間の顔を検出する手段としては、例えば「E.Osuna, R.Freund and F.Girosi:“Training support vector machines:an application to face detection”, CVPR'97, 1997」に記載されているような手法で実現することが可能である。また、特定の人の顔を認識する手段としては、例えば「B.Moghaddam and A.Pentland:“Probabilistic Visual Learning for Object Representation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.19, No.7, July 1997」に記載されているような手法で実現することが可能である。
【０１０２】
ステップＳ８では、正対するために必要な体幹及び首の回転角度が計算される。例えば上述した図１３（ｂ）に示すように、現在のロボット装置１の姿勢において首関節ヨー軸１０１がＹ度回転している場合、すなわち体幹に対して頭部がＹ度回転している場合には、図１３（ｃ）に示すように、体幹をＹ度回転させると同時に首関節ヨー軸１０１を−Ｙ度回転させることによって、対象オブジェクトを注視したまま首の捻れを解消し、自然な動作で音源Ｓの方向に正対することが可能となる。
【０１０３】
最後にステップＳ９では、ステップＳ８で計算した動作を実行し、音源方向に正対する。
【０１０４】
ロボット装置１は、以上のようにして音源方向を推定し、全身を協調させて自然な動作により音源方向を振り向くことができる。
【０１０５】
また、ロボット装置１は、音イベントの内容によっては、対象オブジェクトを注視したまま首の捻れを解消し、自然な動作で音源方向に正対する。特に、人間から呼びかけられたような場合には、その呼びかけを行った者の方向に顔を向けて正対することにより、人間との親密性を高めることができる。
【０１０６】
なお、以上の動作は、上述した図３に示す思考制御モジュール２００から指示され、運動制御モジュール３００によって各アクチュエータ３５０が制御されることにより実現される。
【０１０７】
ここで、体幹の向きに対する音源方向の相対角度を求め、実際にその方向に振り向いた場合に、対象オブジェクトを認識できないような状況が考えられる。すなわち例えば、音源方向の推定誤差により、振り向いた方向の視野角内に対象オブジェクトがない場合や、音源方向が正しくても対象オブジェクトまでの距離が遠い場合には、対象オブジェクトを認識することができない。
【０１０８】
そこで、本具体例におけるロボット装置１は、以下のようにしてこのような問題を解消することができる。この振り向き動作の一例を図１４のフローチャートを用いて説明する。
【０１０９】
図１４に示すように、先ず、ステップＳ１０において、図３に示す音声入力装置２５２の有するマイクロホンに所定の閾値以上の振幅の音が入力されることにより、音イベントが発生したことが検出される。
【０１１０】
次にステップＳ１１において、入力された音イベントの音源方向の推定が行われる。
【０１１１】
続いてステップＳ１２では、現在ロボット装置１が向いている方向と音源の方向との差が計算され、体幹の向きに対する音源方向の相対角度が求められる。
【０１１２】
続いてステップＳ１３では、図２に示した首関節ヨー軸１０１の可動範囲と、脚部を使って体幹を回転させる際に、一度の回転動作で回転できる最大角度などの制約に基づき、ステップＳ１２で計算された相対角度分だけ頭部を回転させるのに必要な首関節と体幹の回転角度を決定する。この際、可動範囲の限界まで首を回転させるのではなく、振り向いた後にさらに首を左右に振れるように、多少の余裕を持たせるようにする。
【０１１３】
すなわち、図１５（ａ）に示すように、ロボット装置１の首の可動範囲を±Ｙ度とし、音源Ｓの方向の相対角度がロボット装置１の正面方向に対してＸ度方向である場合、Ｚ度分だけ余裕を持たせ、図１５（ｂ）のように体幹全体を脚部によりＸ−(Ｙ−Ｚ)度だけ回転させると共に、首関節ヨー軸１０１をＹ−Ｚ度回転させる。これにより、音源Ｓの方向に振り向いた後にさらに首を左右に振ることが可能とされる。
【０１１４】
図１４に戻ってステップＳ１４では、ステップＳ１３で得られた角度を回転させるのに必要な各関節の制御情報を立案、実行し、音源方向に振り向く。
【０１１５】
ステップＳ１５では、音源方向に対象オブジェクトを認識できたか否かが判別される。ステップＳ１５において、例えば上述したような顔検出や顔認識により、予め学習して知っている人の顔が見つかった場合などには、音源方向に対象オブジェクトを認識することができたとして、ステップＳ１６に進む。
【０１１６】
ステップＳ１６では、認識された対象オブジェクトをトラッキングの対象とし、以後オブジェクトが動くのに合わせて首や体幹の向きを変化させて、対象オブジェクトのトラッキングを行うようにし、一連の動作を終了する。
【０１１７】
一方、ステップＳ１５において、対象オブジェクトを認識できなかった場合には、ステップＳ１７に進み、音イベントが音声であったか否かが判別される。具体的には、例えばＨＭＭ（Hidden Markov Model）法により音声と非音声とを統計的にモデル化し、その尤度を比較することによって。音イベントが音声であったか否かを判別することができる。ステップＳ１７において、音声でないと判別された場合には、例えばドアの閉まる音や物音など、認識対象とする必要のない事象に由来する音イベントであったと判断し、一連の動作を終了する。また、ステップＳ１７において音声であると判別された場合には、ステップＳ１８に進む。
【０１１８】
ステップＳ１８では、音源が近いか否かが判別される。具体的には、例えば文献「F.Asano, H.Asoh and T.Matsui, “Sound Source Localization and Separation in Near Field”, IEICE Trans. Fundamental, Vol.E83-A, No.11, 2000」に記載されているような手法で音源までの推定距離を計算することにより、大まかに推定することが可能である。ステップＳ１８において、音源までの距離が画像入力装置２５１や対象オブジェクト認識手段の性能によって、対象オブジェクトを認識することが困難なほど遠い場合には、続くステップＳ１９で対象オブジェクトを認識可能な距離までロボット装置１自体を音源方向に歩行させ、対象オブジェクトの認識精度を確保する。また、ステップＳ１８において、音源までの距離が近い場合には、歩行を行なわずにステップＳ２１に進む。
【０１１９】
ステップＳ２０では、対象オブジェクトが認識できるか否かが再度判別される。ステップＳ２０において対象オブジェクトを認識できた場合には、ステップＳ１６に戻ってトラッキング処理に移行し、一連の動作を終了する。ステップＳ２０において対象オブジェクトを認識できなかった場合には、ステップＳ２１に進む。
【０１２０】
ステップＳ２１では、音源方向の推定に誤差があることを想定して、首関節ピッチ軸１０２及び首関節ヨー軸１０１を回転させることによって頭部を上下左右に振る。
【０１２１】
続いてステップＳ２２では、ステップＳ２１で頭部を上下左右に振った結果、対象オブジェクトを認識できたか否かが判別される。ステップＳ２２において対象オブジェクトを認識できた場合には、ステップＳ１６に戻ってトラッキング処理に移行し、一連の動作を終了する。ステップＳ２２において対象オブジェクトを認識できなかった場合には、音源方向推定が大きく誤った場合が想定されるため、ステップＳ２３においてその旨を出力し、一連の動作を終了する。具体的には、対象オブジェクトが人間である場合には、「どこにいるのか分かりません。もう一度喋ってもらえますか？」といった音声を出力し、音声を再入力してもらえるように依頼することによって、再びこの一連の処理を実行することが可能となる。
【０１２２】
以上のように、ロボット装置１が音源方向に近づいたり、顔を左右に振ることによって、音源方向の推定誤差により振り向いた方向の視野角内に対象オブジェクトがない場合や、音源方向が正しくても対象オブジェクトまでの距離が遠い場合であっても、対象オブジェクトを認識することが可能となる。特に、推定された音源方向に振り向いた後にも、さらに頭部を左右に振れるように首の回転角度が設定されるため、自然な動作により対象オブジェクトをトラッキングすることができる。
【０１２３】
なお、上述の例では音源の距離推定を行い音源方向に近づいた後に顔を振る動作を行うものとして説明したが、これに限定されるものではなく、例えば、音源までの距離の推定精度が音源方向の推定精度と比較して著しく低いような場合には、音源方向に近づく前に顔を振る動作を行うようにしても構わない。
【０１２４】
また、上述の例では、対象オブジェクトを認識可能な距離までロボット装置１自体を音源方向に歩行させて、対象オブジェクトを認識できたか否かを再度確認するものとして説明したが、これに限定されるものではなく、例えば５０ｃｍなど、所定の距離だけ音源方向に近づいた後に、対象オブジェクトを認識できたか否かを再度確認するようにしても構わない。
【０１２５】
さらに、上述の例では、対象オブジェクトを認識する手段として、顔検出、顔認識を用いるものとして説明したが、これに限定されるものではなく、特定の色や形状のものを認識するようにしても構わない。
【０１２６】
（２−２）具体例：動体検出方法
次に、認識統合部２１において、画像認識結果から動体を検出をする方法であって、特に、撮像した画像内の動体を簡易な手法により検出し、トラッキング可能とする動体検出方法の一具体例について説明する。上述したように、本具体例におけるロボット装置１は、図３に示すＣＣＤカメラ等の画像入力装置２５１により撮像した画像データ内の人物の顔の肌色領域、顔（個人顔も含む）領域を認識し、更に認識した肌色又は個人顔領域が動体である場合は、それらの認識結果を基に、例えば図４に示す認識統合部２１に設けられた動体検出部（図示せず）において、動体を検出し、この検出結果に基づき首制御コマンド生成部２２や、脚部ユニット等を制御し、検出した動体の方向を向く、或いはトラッキングするなどといった行動を行うことができる。
【０１２７】
ここで、認識統合部２１の動体検出部がフレーム間の差分画像を生成した場合、動体の動きが停止した時点で差分値が０となる。例えば、図１６に示すように、それぞれ時刻ｔ１〜ｔ４における人間を撮像した画像データＰ１〜Ｐ４について差分画像データＤ１〜Ｄ３を生成した場合、時刻ｔ３及びｔ４間で顔が静止していると、差分画像データＤ３から顔の差分データが消失してしまう。つまり、差分画像データから動体が消失したということは、動体がその場から消失したのではなく、消えた場所に動体が存在するということを意味している。
【０１２８】
そこで、本具体例おけるロボット装置１は、この差分が０となる時点を検出し、その直前の差分画像における重心位置の方向にＣＣＤカメラ等の画像入力装置２５１を向けることで重心位置の方向を向き、又は脚部ユニットを制御し重心位置の方向に近付く。すなわち、図１７のフローチャートに示すように、先ずステップＳ２１において、差分画像データの重心位置を計算することで動体を検出し、ステップＳ２２において、検出した動体が差分画像データから消失したか否かが判別される。ステップＳ２２において動体が消えていない場合（No）にはステップＳ２１に戻る。一方、ステップＳ２２において動体が消えた場合（Yes）にはステップＳ２３に進み、消失した方向、すなわち直前の差分画像における重心位置の方向を向く、或いはその重心位置の方向に近付く。
【０１２９】
なお、検出した動体がロボット装置１の視覚範囲から外れた場合にも差分画像から動体が消失するが、この場合にも上述のステップＳ２３において最後に検出された重心位置の方向を向くことで、ほぼ動体の方向を向くことができる。
【０１３０】
このように、本具体例におけるロボット装置１は、視覚範囲内で動体が静止したことにより差分画像データから消失するタイミングを検出し、その重心位置の方向を向くようにすることで、例えば人間等の動体の気配を感じてその方向を向くという自律的なインタラクションを実現できる。また、予測部３１は、動体が視覚範囲から外れたことにより差分画像データから消失するのを検出し、最後に検出された重心位置の方向を向くようにすることで、ほぼ動体の方向を向くことができる。
【０１３１】
また、ロボット装置１は、差分画像データから動体が消失した場合のみならず、所定の時間間隔毎、或いは動体の重心位置が視覚範囲から外れそうになる毎に検出された重心方向を向き、動体をトラッキングするようにしても構わない。すなわち、図１８のフローチャートに示すように、先ずステップＳ３０において、差分画像データの重心位置を計算することで動体を検出し、ステップＳ３１において、所定の時間間隔毎、或いは動体が視覚範囲から外れそうになる毎に検出された重心位置の方向を向く。
【０１３２】
ここで、ロボット装置１は、前述のように差分画像データから動体が消失した場合の他、ステップＳ３１におけるロボット装置１の動きが大きい場合には、動き補償によって自己の動きと動体の動きとを区別することができなくなり、動体を見失ってしまう。そこでステップＳ３２において、動体を見失ったか否かが判別される。ステップＳ３２において動体を見失っていない場合（No）にはステップＳ３０に戻る。一方、ステップＳ３２において動体を見失った場合（Yes）にはステップＳ３３に進み、最後に検出された重心位置の方向を向く。
【０１３３】
このように、本具体例におけるロボット装置１は、所定の時間間隔毎、或いは動体が視覚範囲から外れそうになる毎に検出された重心方向を向き、動体を見失った場合に最後に検出された重心位置の方向を向くようにすることで、頭部ユニットに設けられた画像入力装置２５１によって撮像した画像内の動体を簡易な手法により検出し、トラッキングすることが可能となる。
【０１３４】
以上の処理は、後述する図２２に示すミドル・ウェア・レイヤ５０の認識系７１には、上述の図４に示す複数の認識部１５及び認識統合部２１が含まれ、この認識統合部２１は、動き検出用信号処理モジュール６８を含む。この動き検出信号処理モジュール６８が図１９に示す差分画像生成モジュール１１０と重心計算モジュール１１１とによって構成することで実現される。
【０１３５】
すなわち、図１９に示すように、ロボティック・サーバ・オブジェクト４２のバーチャル・ロボット４３は、ＣＣＤカメラ２２によって撮像されたフレーム単位の画像データをＤＲＡＭ１１から読み出し、この画像データをミドル・ウェア・レイヤ５０の認識系７１に含まれる認識結果統合部２１の差分画像生成モジュール１１０に送出する。
【０１３６】
差分画像生成モジュール１１０は、画像データを入力する毎に時間軸上で隣接する前フレームの画像データとの差分をとって差分画像データを生成し、この差分画像データを重心計算モジュール１１１に与える。例えば、上述した画像データＰ２と画像データＰ３との差分画像データＤ２を生成する場合、位置（ｉ，ｊ）における差分画像データＤ２の輝度値Ｄ２（ｉ，ｊ）は、位置（ｉ，ｊ）における画像データＰ３の輝度値Ｐ３（ｉ，ｊ）から同位置における画像データＰ２の輝度値Ｐ２（ｉ，ｊ）を減算することで得られる。差分画像生成モジュール１１０は、全画素について同様の計算を行って差分画像データＤ２を生成し、この差分画像データＤ２を重心計算モジュール１１１に与える。
【０１３７】
そして、重心計算モジュール１１１は、差分画像データのうち、輝度値が閾値Ｔｈよりも大きい部分についての重心位置Ｇ（ｘ，ｙ）を計算する。ここで、ｘ、ｙは、それぞれ以下の（６）式、（７）式を用いて計算される。
【０１３８】
【数５】

【０１３９】
これにより、図２０に示すように、例えば上述した画像データＰ２と画像データＰ３との差分画像データＤ２から、重心位置Ｇ２が求められる。
【０１４０】
重心計算モジュール１１１は、求めた重心位置のデータをアプリケーション・レイヤ５１の行動モデルライブラリ９０に送出する。
【０１４１】
行動モデルライブラリ９０は、必要に応じて情動のパラメータ値や欲求のパラメータ値を参照しながら続く行動を決定し、決定結果を行動切換モジュール９１に与える。例えば、差分画像データから動体が消失した場合には、直前に検出された重心位置を向く、或いは近付く行動を決定し、決定結果を行動切換モジュール９１に与える。また、所定の時間間隔毎に動体をトラッキングする場合には、その時間間隔毎に検出された重心位置を向く、或いは近付く行動を決定し、決定結果を行動切換モジュール９１に与える。そして、行動切換モジュール９１は、当該決定結果に基づく行動コマンドをミドル・ウェア・レイヤ５０の出力系８０におけるトラッキング用信号処理モジュール７３に送出する。
【０１４２】
トラッキング用信号処理モジュール７３は、行動コマンドが与えられると当該行動コマンドに基づいて、その行動を行うために対応するアクチュエータ２８１〜２８ｎに与えるべきサーボ指令値を生成し、このデータをロボティック・サーバ・オブジェクト４２のバーチャル・ロボット４３及び信号処理回路１４（図２）を順次介して対応するアクチュエータ２８１〜２８ｎに順次送出する。
【０１４３】
この結果、例えば、差分画像データから動体が消失した場合には、行動モデルライブラリ９０によって、直前に検出された重心位置を向く、或いは近付く行動が決定され、行動切換モジュール９１によって、その行動を行わせるための行動コマンドが生成される。また、所定の時間間隔毎に動体をトラッキングする場合には、行動モデルライブラリ９０によって、その時間間隔毎に検出された重心位置を向く、或いは近付く行動が決定され、行動切換モジュール９１によって、その行動を行わせるための行動コマンドが生成される。
【０１４４】
そして、この行動コマンドが、上述した図４に示す首制御コマンド生成部２２等のトラッキング用信号処理モジュール７３に与えられると、当該トラッキング用信号処理モジュール７３は、その行動コマンドに基づくサーボ指令値を出力部を介して対応するアクチュエータ２８１〜２８ｎに送出し、これによりロボット装置１が動体に興味を示して頭部をその方向に向けたり、動体の方向に近付いたりする行動が発現される。
【０１４５】
なお、本発明は上述した実施の形態のみに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。上述の実施の形態においては、人物の顔のトラッキングについて説明したが、他の対象物に対しても広く適用できる。例えば、上述した如く、ボール等の例では、ボールの色、形状（形及び大きさ）、模様等を認識する認識部を備え、いずれかが認識できていればトラッキングを継続する、いずれも認識できない時間が長く続けばトラッキングを中止する等の方法をとることができる。
【０１４６】
（３）制御プログラムのソフトウェア構成
以上のようなロボット装置１は、自己及び周囲の状況や、使用者からの指示及び働きかけに応じて自律的に行動し得るようになされている。次に、このようなロボット装置の制御プログラムのソフトウェア構成について詳細に説明する。図２１は、本実施の形態におけるロボット装置のソフトウェア構成を示すブロック図である。図２１において、デバイス・ドライバ・レイヤ４０は、この制御プログラムの最下位層に位置し、複数のデバイス・ドライバからなるデバイス・ドライバ・セット４１から構成されている。この場合、各デバイス・ドライバは、ＣＣＤカメラ等の画像入力装置２５１（図３）やタイマ等の通常のコンピュータで用いられるハードウェアに直接アクセスすることを許されたオブジェクトであり、対応するハードウェアからの割り込みを受けて処理を行う。
【０１４７】
また、ロボティック・サーバ・オブジェクト４２は、デバイス・ドライバ・レイヤ４０の最下位層に位置し、例えば上述の各種センサやアクチュエータ等のハードウェアにアクセスするためのインターフェイスを提供するソフトウェア群でなるバーチャル・ロボット４３と、電源の切換えなどを管理するソフトウェア群でなるパワーマネージャ４４と、他の種々のデバイス・ドライバを管理するソフトウェア群でなるデバイス・ドライバ・マネージャ４５と、ロボット装置１の機構を管理するソフトウェア群でなるデザインド・ロボット４６とから構成されている。
【０１４８】
マネージャ・オブジェクト４７は、オブジェクト・マネージャ４８及びサービス・マネージャ４９から構成されている。オブジェクト・マネージャ４８は、ロボティック・サーバ・オブジェクト４２、ミドル・ウェア・レイヤ５０、及びアプリケーション・レイヤ５１に含まれる各ソフトウェア群の起動や終了を管理するソフトウェア群であり、サービス・マネージャ４９は、例えばメモリカードに格納されたコネクションファイルに記述されている各オブジェクト間の接続情報に基づいて各オブジェクトの接続を管理するソフトウェア群である。
【０１４９】
ミドル・ウェア・レイヤ５０は、ロボティック・サーバ・オブジェクト４２の上位層に位置し、画像処理や音声処理などのこのロボット装置１の基本的な機能を提供するソフトウェア群から構成されている。また、アプリケーション・レイヤ５１は、ミドル・ウェア・レイヤ５０の上位層に位置し、当該ミドル・ウェア・レイヤ５０を構成する各ソフトウェア群によって処理された処理結果に基づいてロボット装置１の行動を決定するためのソフトウェア群から構成されている。
【０１５０】
なお、ミドル・ウェア・レイヤ５０及びアプリケーション・レイヤ５１の具体なソフトウェア構成をそれぞれ図２２に示す。
【０１５１】
ミドル・ウェア・レイヤ５０は、図２２に示すように、騒音検出用、温度検出用、明るさ検出用、音階認識用、距離検出用、姿勢検出用、接触検出用、操作入力検出用、動き検出用及び色認識用の各信号処理モジュール６０〜６９並びに入力セマンティクスコンバータモジュール７０などを有する認識系７１と、出力セマンティクスコンバータモジュール７９並びに姿勢管理用、トラッキング用、モーション再生用、歩行用、転倒復帰用、ＬＥＤ点灯用及び音再生用の各信号処理モジュール７２〜７８などを有する出力系８０とから構成されている。
【０１５２】
認識系７１の各信号処理モジュール６０〜６９は、ロボティック・サーバ・オブジェクト４２のバーチャル・ロボット４３によりＤＲＡＭから読み出される各センサデータや画像データ及び音声データのうちの対応するデータを取り込み、当該データに基づいて所定の処理を施して、処理結果を入力セマンティクスコンバータモジュール７０に与える。ここで、例えば、バーチャル・ロボット４３は、所定の通信規約によって、信号の授受或いは変換をする部分として構成されている。
【０１５３】
入力セマンティクスコンバータモジュール７０は、これら各信号処理モジュール６０〜６９から与えられる処理結果に基づいて、「うるさい」、「暑い」、「明るい」、「ドミソの音階が聞こえた」、「障害物を検出した」、「転倒を検出した」、「叱られた」、「誉められた」、「動く物体を検出した」又は「ボールを検出した」などの自己及び周囲の状況や、使用者からの指令及び働きかけを認識し、認識結果をアプリケーション・レイヤ５１（図２１）に出力する。
【０１５４】
アプリケーション・レイヤ５ｌは、図２３に示すように、行動モデルライブラリ９０、行動切換モジュール９１、学習モジュール９２、感情モデル９３及び本能モデル９４の５つのモジュールから構成されている。
【０１５５】
行動モデルライブラリ９０には、図２４に示すように、「バッテリ残量が少なくなった場合」、「転倒復帰する」、「障害物を回避する場合」、「感情を表現する場合」、「ボールを検出した場合」などの予め選択されたいくつかの条件項目にそれぞれ対応させて、それぞれ独立した行動モデル９０_１〜９０_ｎが設けられている。
【０１５６】
そして、これら行動モデル９０_１〜９０_ｎは、それぞれ入力セマンティクスコンバータモジュール７０から認識結果が与えられたときや、最後の認識結果が与えられてから一定時間が経過したときなどに、必要に応じて後述する感情モデル９３に保持されている対応する情動のパラメータ値や、本能モデル９４に保持されている対応する欲求のパラメータ値を参照しながら続く行動をそれぞれ決定し、決定結果を行動切換モジュール９１に出力する。
【０１５７】
なお、この具体例の場合、各行動モデル９０_１〜９０_ｎは、次の行動を決定する手法として、図２５に示すような１つのノード（状態）ＮＯＤＥ_０〜ＮＯＤＥ_ｎから他のどのノードＮＯＤＥ_０〜ＮＯＤＥ_ｎに遷移するかを各ノードＮＯＤＥ_０〜ＮＯＤＥ_ｎに間を接続するアークＡＲＣ_１〜ＡＲＣ_ｎに対してそれぞれ設定された遷移確率Ｐ_１〜Ｐ_ｎに基づいて確率的に決定する有限確率オートマトンと呼ばれるアルゴリズムを用いる。
【０１５８】
具体的に、各行動モデル９０_１〜９０_ｎは、それぞれ自己の行動モデル９０_１〜９０_ｎを形成するノードＮＯＤＥ_０〜ＮＯＤＥ_ｎにそれぞれ対応させて、これらノードＮＯＤＥ_０〜ＮＯＤＥ_ｎごとに図２６に示すような状態遷移表２７０を有している。
【０１５９】
この状態遷移表２７０では、そのノードＮＯＤＥ_０〜ＮＯＤＥ_ｎにおいて遷移条件とする入力イベント（認識結果）が「入力イベント名」の列に優先順に列記され、その遷移条件についてのさらなる条件が「データ名」及び「データ範囲」の列における対応する行に記述されている。
【０１６０】
したがって、図２６の状態遷移表２７０で表されるノードＮＯＤＥ１００では、「ボールを検出（BALL）」という認識結果が与えられた場合に、当該認識結果と共に与えられるそのボールの「大きさ（SIZE）」が「0から1000」の範囲であることや、「障害物を検出（OBSTACLE）」という認識結果が与えられた場合に、当該認識結果と共に与えられるその障害物までの「距離（DISTANCE）」が「0から100」の範囲であることが他のノードに遷移するための条件となっている。
【０１６１】
また、このノードＮＯＤＥ１００では、認識結果の入力がない場合においても、行動モデル９０_１〜９０_ｎが周期的に参照する感情モデル９３及び本能モデル９４にそれぞれ保持された各情動及び各欲求のパラメータ値のうち、感情モデル９３に保持された「喜び（JOY）」、「驚き（SURPRISE）」若しくは「悲しみ（SUDNESS）」のいずれかのパラメータ値が「50から100」の範囲であるときには他のノードに遷移することができるようになっている。
【０１６２】
また、状態遷移表１００では、「他のノードヘの遷移確率」の欄における「遷移先ノード」の行にそのノードＮＯＤＥ_０〜ＮＯＤＥ_ｎから遷移できるノード名が列記されていると共に、「入力イベント名」、「データ値」及び「データの範囲」の列に記述された全ての条件が揃ったときに遷移できる他の各ノードＮＯＤＥ_０〜ＮＯＤＥ_ｎへの遷移確率が「他のノードヘの遷移確率」の欄内の対応する箇所にそれぞれ記述され、そのノードＮＯＤＥ_０〜ＮＯＤＥ_ｎに遷移する際に出力すべき行動が「他のノードヘの遷移確率」の欄における「出力行動」の行に記述されている。なお、「他のノードヘの遷移確率」の欄における各行の確率の和は１００［％］となっている。
【０１６３】
したがって、図２６の状態遷移表１００で表されるノードＮＯＤＥ１００では、例えば「ボールを検出（BALL）」し、そのボールの「SIZE（大きさ）」が「0から1000」の範囲であるという認識結果が与えられた場合には、「30［％］」の確率で「ノードＮＯＤＥ１２０（node120）」に遷移でき、そのとき「ACTION1」の行動が出力されることとなる。
【０１６４】
各行動モデル９０_１〜９０_ｎは、それぞれこのような状態遷移表２７０として記述されたノードＮＯＤＥ_０〜ＮＯＤＥ_ｎがいくつも繋がるようにして構成されており、入力セマンティクスコンバータモジュール７０から認識結果が与えられたときなどに、対応するノードＮＯＤＥ_０〜ＮＯＤＥ_ｎの状態遷移表を利用して確率的に次の行動を決定し、決定結果を行動切換モジュール９１に出力するようになされている。
【０１６５】
図２４に示す行動切換モジュール９１は、行動モデルライブラリ９０の各行動モデル９０_１〜９０_ｎからそれぞれ出力される行動のうち、予め定められた優先順位の高い行動モデル９０_１〜９０_ｎから出力された行動を選択し、当該行動を実行すべき旨のコマンド（以下、これを行動コマンドという。）をミドル・ウェア・レイヤ５０の出力セマンティクスコンバータモジュール７９に送出する。なお、この実施の形態においては、図２４において下側に表記された行動モデル９０_１〜９０_ｎほど優先順位が高く設定されている。
【０１６６】
また、行動切換モジュール９１は、行動完了後に出力セマンティクスコンバータモジュール７９から与えられる行動完了情報に基づいて、その行動が完了したことを学習モジュール９２、感情モデル９３及び本能モデル９４に通知する。
【０１６７】
一方、学習モジュール９２は、入力セマンティクスコンバータモジュール７０から与えられる認識結果のうち、「叱られた」や「誉められた」など、使用者からの働きかけとして受けた教示の認識結果を入力する。そして、学習モジュール９２は、この認識結果及び行動切換モジュール９１からの通知に基づいて、「叱られた」ときにはその行動の発現確率を低下させ、「誉められた」ときにはその行動の発現確率を上昇させるように、行動モデルライブラリ９０における対応する行動モデル９０１〜９０ｎの対応する遷移確率を変更する。
【０１６８】
他方、感情モデル９３は、「喜び（JOY）」、「悲しみ（SADNESS）」、「怒り（ANGER）」、「驚き（SURPRISE）」、「嫌悪（DISGUST）」及び「恐れ（FEAR）」の合計６つの情動について、各情動ごとにその情動の強さを表すパラメータを保持している。そして、感情モデル９３は、これら各情動のパラメータ値を、それぞれ入力セマンティクスコンバータモジュール７０から与えられる「叱られた」及び「誉められた」などの特定の認識結果と、経過時間及び行動切換モジュール９１からの通知などに基づいて周期的に更新する。
【０１６９】
具体的には、感情モデル９３は、入力セマンティクスコンバータモジュール７０から与えられる認識結果と、そのときのロボット装置１の行動と、前回更新してからの経過時間などに基づいて所定の演算式により算出されるそのときのその情動の変動量を△Ｅ［ｔ］、現在のその情動のパラメータ値をＥ［ｔ］、その情動の感度を表す係数をｋｅとして、下記の式（８）によって次の周期におけるその情動のパラメータ値Ｅ［ｔ＋１］を算出し、これを現在のその情動のパラメータ値Ｅ［ｔ］と置き換えるようにしてその情動のパラメータ値を更新する。また、感情モデル７３は、これと同様にして全ての情動のパラメータ値を更新する。
【０１７０】
【数６】

【０１７１】
なお、各認識結果や出力セマンティクスコンバータモジュール７９からの通知が各情動のパラメータ値の変動量△Ｅ［ｔ］にどの程度の影響を与えるかは予め決められており、例えば「叩かれた」といった認識結果は「怒り」の情動のパラメータ値の変動量△Ｅ［ｔ］に大きな影響を与え、「撫でられた」といった認識結果は「喜び」の情動のパラメータ値の変動量△Ｅ［ｔ］に大きな影響を与えるようになっている。
【０１７２】
ここで、出力セマンティクスコンバータモジュール７９からの通知とは、いわゆる行動のフィードバック情報（行動完了情報）であり、行動の出現結果の情報であり、感情モデル９３は、このような情報によっても感情を変化させる。これは、例えば、「吠える」といった行動により怒りの感情レベルが下がるといったようなことである。なお、出力セマンティクスコンバータモジュール７９からの通知は、上述した学習モジュール９２にも入力されており、学習モジュール９２は、その通知に基づいて行動モデル９０_１〜９０_ｎの対応する遷移確率を変更する。
【０１７３】
なお、行動結果のフィードバックは、行動切換モジュール９１の出力（感情が付加された行動）によりなされるものであってもよい。
【０１７４】
一方、本能モデル９４は、「運動欲（exercise）」、「愛情欲（affection）」、「食欲（appetite）」及び「好奇心（curiosity）」の互いに独立した４つの欲求について、これら欲求ごとにその欲求の強さを表すパラメータを保持している。そして、本能モデル９４は、これらの欲求のパラメータ値を、それぞれ入力セマンティクスコンバータモジュール７０から与えられる認識結果や、経過時間及び行動切換モジュール９１からの通知などに基づいて周期的に更新する。
【０１７５】
具体的には、本能モデル９４は、「運動欲」、「愛情欲」及び「好奇心」については、認識結果、経過時間及び出力セマンティクスコンバータモジュール７９からの通知などに基づいて所定の演算式により算出されるそのときのその欲求の変動量をΔＩ［ｋ］、現在のその欲求のパラメータ値をＩ［ｋ］、その欲求の感度を表す係数ｋｉとして、所定周期で下記の式（９）を用いて次の周期におけるその欲求のパラメータ値Ｉ［ｋ＋１］を算出し、この演算結果を現在のその欲求のパラメータ値Ｉ［ｋ］と置き換えるようにしてその欲求のパラメータ値を更新する。また、本能モデル９４は、これと同様にして「食欲」を除く各欲求のパラメータ値を更新する。
【０１７６】
【数７】

【０１７７】
なお、認識結果及び出力セマンティクスコンバータモジュール７９からの通知などが各欲求のパラメータ値の変動量△Ｉ［ｋ］にどの程度の影響を与えるかは予め決められており、例えば出力セマンティクスコンバータモジュール７９からの通知は、「疲れ」のパラメータ値の変動量△Ｉ［ｋ］に大きな影響を与えるようになっている。
【０１７８】
なお、本実施の形態においては、各情動及び各欲求（本能）のパラメータ値がそれぞれ０から１００までの範囲で変動するように規制されており、また係数ｋｅ、ｋｉの値も各情動及び各欲求毎に個別に設定されている。
【０１７９】
一方、図２２に示すように、ミドル・ウェア・レイヤ５０の出力セマンティクスコンバータモジュール７９は、上述のようにしてアプリケーション・レイヤ５１の行動切換モジュール９１から与えられる「前進」、「喜ぶ」、「鳴く」又は「トラッキング（ボールを追いかける）」といった抽象的な行動コマンドを出力系８０の対応する信号処理モジュール７２〜７８に与える。
【０１８０】
そしてこれら信号処理モジュール７２〜７８は、行動コマンドが与えられると当該行動コマンドに基づいて、その行動を行うために対応するアクチュエータに与えるべきサーボ指令値や、スピーカから出力する音の音声データ及び／又は発光部のＬＥＤに与える駆動データを生成し、これらのデータをロボティック・サーバ・オブジェクト４２のバーチャル・ロボット４３を順次介して対応するアクチュエータ、スピーカ又は発光部等に順次送出する。
【０１８１】
このようにしてロボット装置１においては、制御プログラムに基づいて、自己（内部）及び周囲（外部）の状況や、使用者からの指示及び働きかけに応じた自律的な行動を行うことができるようになされている。
【０１８２】
【発明の効果】
以上詳細に説明したように本発明に係るロボット装置は、少なくとも機体の一部を使用して対象物をトラッキングするロボット装置において、異なる種類のものを認識する複数の認識手段及び／又は同一種類であってその認識レベルに応じて２以上に階層化された複数の認識手段と、上記複数の認識手段からの認識結果を統合する認識統合手段と、上記対象物をトラッキングする動作を生成する動作生成手段と上記認識統合結果に基づき上記動作生成手段を制御するトラッキング制御手段とを有し、上記トラッキング制御手段は、上記複数の認識手段のうち所定の認識手段の認識結果に基づき上記対象物のトラッキングを開始し、該所定の認識手段の認識結果が得られなくなった場合に該所定の認識手段とは異なる他の認識手段の認識結果に基づき上記トラッキングを継続するよう制御するので、所定の認識手段の認識結果に基づきトラッキングを開始した場合、該所定の認識手段による認識が失敗しても他の認識手段による認識結果に基づきトラッキングを継続することができ、割り込み動作等によりトラッキングが中断された場合や、照明条件の変化等に対する認識手段の認識の不安定さに対しても、複数の認識手段を複合的に組み合わせて使用することで極めてロバストなトラッキングを行うことができる。
【図面の簡単な説明】
【図１】本発明の実施の形態におけるロボット装置の外観構成を示す斜視図である。
【図２】同ロボット装置の自由度構成モデルを模式的に示す図である。
【図３】同ロボット装置の制御システム構成を模式的に示す図である。
【図４】本発明の実施の形態のロボット装置の制御システムのうち、トラッキングシステムを構成する部分を模式的に示すブロック図である。
【図５】同トラッキングシステムにおける予測部の機能を説明するためのブロック図である。
【図６】同トラッキングシステムにおける異なる認識レベルの階層化された認識部の機能を説明するためのブロック図である。
【図７】同トラッキングシステムにおける異なる種類の認識部の機能を説明するためのブロック図である。
【図８】同トラッキングシステムにおける割り込み行動が生じた際の予測部の機能を説明するためのブロック図である。
【図９】同トラッキングシステムにおけるトラッキング終了機能を説明するためのブロック図である。
【図１０】同トラッキングシステムの音声方向認識部の具体例を示す図であって、ロボット装置の振り向き動作を説明する図である。
【図１１】同トラッキングシステムの音声方向認識部の具体例を示す図であって、ロボット装置の振り向き動作の一例を説明するフローチャートである。
【図１２】同トラッキングシステムの音声方向認識部の具体例を示す図であって、音源方向の推定手法を説明する図である。
【図１３】同トラッキングシステムの音声方向認識部の具体例を示す図であってｍロボット装置の振り向き動作を説明する図であり、（ａ）は、振り向く前の状態を示し、（ｂ）は、振り向き後の状態を示し、（ｃ）は、対象オブジェクトに正対した図を示す。
【図１４】同トラッキングシステムの音声方向認識部の具体例を示す図であって、ロボット装置の振り向き動作の他の例を説明するフローチャートである。
【図１５】同トラッキングシステムの音声方向認識部の具体例を示す図であって、ロボット装置の振り向き動作の他の例を説明する図であり、（ａ）は、振り向く前の状態を示し、（ｂ）は、振り向き後の状態を示す。
【図１６】同トラッキングシステムの動体検出方法の具体例を示す図であって、差分画像データから動体が消える例を説明する図である。
【図１７】同トラッキングシステムの動体検出方法の具体例を示す図であって、差分画像データから動体が消失した方向を向く場合の手順を説明するフローチャートである。
【図１８】同トラッキングシステムの動体検出方法の具体例を示す図であって、動体をトラッキングする場合の手順を説明するフローチャートである。
【図１９】同トラッキングシステムの動体検出方法の具体例を示す図であって、ロボット装置の動体検出に関連する部分のソフトウェア構成を示すブロック図である。
【図２０】同トラッキングシステムの動体検出方法の具体例を示す図であって、差分画像データの重心位置を求める例を説明する図である。
【図２１】本発明の実施の形態におけるロボット装置のソフトウェア構成を示すブロック図である。
【図２２】本発明の実施の形態におけるロボット装置のソフトウェア構成におけるミドル・ウェア・レイヤの構成を示すブロック図である。
【図２３】本発明の実施の形態におけるロボット装置のソフトウェア構成におけるアプリケーション・レイヤの構成を示すブロック図である。
【図２４】本発明の実施の形態におけるアプリケーション・レイヤの行動モデルライブラリの構成を示すブロック図である。
【図２５】本発明の実施の形態におけるロボット装置の行動決定のための情報となる有限確率オートマトンを説明する図である。
【図２６】有限確率オートマトンの各ノードに用意された状態遷移表を示す図である。
【符号の説明】
１ロボット装置、１０トラッキングシステム、１１肌色認識部、１２顔検認識部、１３個人認識部、１５認識部、２１認識統合部、２２首制御コマンド生成部、２３出力部、３１予測部、３２行動、３３タイマ
４２ロボティック・サーバ・オブジェクト、４３バーチャル・ロボット、５０ミドル・ウェア・レイヤ、５１アプリケーション・レイヤ、６８動き検出用信号処理モジュール、７０入力セマンティクスコンバータモジュール、７１認識系、７３トラッキング用信号処理モジュール、７９出力セマンティクスコンバータモジュール、８０出力系、８３感情モデル、８４本能モデル、９０行動モデルライブラリ、９１行動切換モジュール、１１０差分画像生成モジュール、１１１重心計算モジュール、２００思考制御モジュール、２５１画像入力装置、２５２音声入力装置、２５３音声出力装置、３００運動制御モジュール、３５０アクチュエータ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a robot device that performs an autonomous operation simulating a bipedal walking type body mechanism or operation of a human or the like, and an operation control method and program thereof. The present invention relates to a robot apparatus that tracks the target object, an operation control method thereof, and a program.
[0002]
[Prior art]
2. Description of the Related Art In recent years, for example, an entertainment type robot apparatus having an external shape imitating that of a human or an animal has been provided. Such a robot apparatus operates a human body by autonomously operating a neck joint, a leg, and the like according to information from the outside (for example, information on the surrounding environment) and an internal state (for example, emotional state). Or the movement like an animal is expressed. Such a robot apparatus has an imaging means, a voice input means, etc., for example, there are those that extract a person's face from the captured image and identify a specific individual while tracking this face, and are more entertaining. It has become.
[0003]
In the face identification application installed in such a robot apparatus, in addition to the problem of identifying a person in a given scene, it is necessary to solve the following problem.
(1) Since the robot device itself moves, it must allow environmental changes and variations.
(2) Since the positional relationship between the human and the robot device also changes, it is necessary to keep the human in view during the interaction.
(3) We must select images that can be used to identify people from many scene images and make a comprehensive judgment.
(4) You must respond within a certain time
[0004]
Patent Document 1 below describes a robot apparatus that can identify a face with limited resources without impairing the autonomy of the robot apparatus itself even under such conditions. In the robot apparatus described in Patent Document 1, a face tracking unit that tracks (tracks) a face that changes in an image captured by the imaging unit, and an imaging unit based on face tracking information by the face tracking unit. Is provided with face data detection means for detecting face data of a face in the image captured by the above and face identification means for identifying a specific face (personal recognition) based on the face data detected by the face data detection means. As a result, the face data of the face is detected while tracking the changing face in the image, and a specific face is identified based on the detected face data, and the face area is detected from the captured image. Then, the detected face is tracked, and the person who identifies the detected face during that time is identified. Specifically, the process of detecting the face area from the input image is performed by identifying a face or a non-face using a support vector machine (SVM) or the like. The detected face is tracked by tracking the skin color area. Further, for person identification (personal identification), the eyes of the input face, the nose, etc. are aligned (morphed), and it is determined whether or not they are the same person based on differences from a plurality of registered faces.
[0005]
[Patent Document 1]
JP 2002-157596 A
[0006]
[Problems to be solved by the invention]
By the way, in the above-described conventional robot apparatus, tracking of the movement of the face in the image (tracking) is performed using the skin color recognition result. Such a conventional tracking operation is usually terminated when recognition of an object fails. However, for a robot apparatus, an operation (tracking) for tracking an object or a target such as a user by using a part of a trunk such as rotating a neck joint is an important technique. For example, if tracking is not possible when chasing a moving ball, it becomes difficult to chase the ball. In addition, the action of chasing an interesting thing with a part of the trunk such as the head is an action that makes a person feel like a living thing. is there. Further, tracking (tracking) the movement of the user when interacting with the user is important for more natural interaction.
[0007]
However, the recognition of the outside world by images necessary for tracking by the robotic device is often unstable, and it is sensitive to lighting (lighting conditions) and the angle of the person's face. is there. Also, if the object such as a ball moves greatly, the ball will move under non-uniform illumination conditions, making recognition difficult. Furthermore, the robot device capable of autonomous operation is selected based on an internal state and an external stimulus, and when the other operation having higher priority than the tracking operation occurs, for example, the tracking operation is interrupted. , May allow other behavior to manifest. For example, when a person is called by another person B during a conversation with a certain person A and the robot device turns around and talks with the person B for a short time, the tracking with the original person A is continued. After stopping, there is a case where it is desired to start tracking again. In such a case, it is possible in principle to store the location of the original person A, but if the person A moves even a little, the recognition may fail and tracking may not be resumed. In addition, there is a problem that tracking of an object is extremely difficult to keep track of due to changes in the external environment or other interrupt operations, etc. with respect to autonomous robot devices.
[0008]
The present invention has been proposed in view of such a conventional situation, and after the interrupting operation is performed during the tracking operation and tracking is interrupted, the object recognition is not performed due to a change in the external environment or the like. An object of the present invention is to provide a robot apparatus capable of performing robust tracking with respect to stability, an operation control method thereof, and a program.
[0009]
[Means for Solving the Problems]
  In order to achieve the above-described object, a robot apparatus according to the present invention includes a plurality of recognition means for recognizing different types of robot apparatuses that track an object using at least a part of a body.OrSame kindRecognize thingsA plurality of recognition means hierarchized into two or more according to the recognition level, a recognition integration means for integrating the recognition results from the plurality of recognition means, and an action generation means for generating an action for tracking the object. Tracking control means for controlling the motion generation means based on the recognition integration result, the tracking control means tracking the object based on the recognition result of a predetermined recognition means among the plurality of recognition means. The tracking is controlled to continue based on the recognition result of another recognition means different from the predetermined recognition means when the recognition result of the predetermined recognition means cannot be obtained. An object that memorizes the direction of the object immediately before the tracking is terminated or interrupted by an action having a higher priority than the action. And the tracking control means starts tracking again when the object is recognized in any of the plurality of recognition means in the stored direction when the high priority operation is completed. To controlThe
[0010]
In the present invention, when there are a plurality of recognition means and tracking is started based on the recognition result of the predetermined recognition means, even if the recognition by the predetermined recognition means fails, the tracking is performed based on the recognition result by other recognition means. By using a plurality of recognizing means in combination, extremely robust tracking can be performed as compared with the conventional method in which tracking is performed by one recognizing means.
[0011]
And when the recognition result from the other recognition means cannot be obtained, the tracking control means has a prediction means for obtaining a prediction direction of the object based on the recognition result of the other recognition means until immediately before. Controls the tracking based on the predicted direction so that, even if recognition by other recognition means fails, the direction of the object is determined from the recognition result until immediately before the recognition of other recognition means fails. Predicting and tracking can be continued, and more robust tracking can be performed.
[0012]
Further, the other recognition means is a recognition level that recognizes a recognition means of the same type as the predetermined recognition means and a recognition means of a lower layer or a different type from the predetermined recognition means. If the recognition level is lowered even if the recognition means fails, it can be recognized by lowering the recognition level. For example, when the image cannot be acquired, more robust tracking can be performed by using voice recognition.
[0013]
Furthermore, the tracking control means can end or interrupt the tracking when the recognition result of the predetermined recognition means is not obtained for a predetermined time, and the recognition by the recognition means at the time of tracking start is obtained for a predetermined period. When there is not, tracking is stopped or interrupted to prevent the object from being tracked by mistake.
[0014]
In addition, the tracking control means may be configured such that a difference between a direction of the object immediately before the recognition result of the predetermined recognition means is not obtained and a direction of the object obtained by a recognition result of the other recognition means. If the specified value is exceeded, the tracking is terminated or interrupted, and even if the recognition result of another recognition means is obtained, if the predetermined condition is not satisfied, the tracking is terminated or interrupted. This prevents the wrong object from being tracked.
[0015]
Furthermore, when the predetermined recognition means has the same type of lower-layer recognition means, the recognition integration means can be controlled to continue tracking the object based on the recognition result of the lower-layer recognition means, Tracking can be continued even under conditions where recognition is difficult by reducing the recognition level.
[0016]
  Furthermore,the aboveIt has object motion detection means for detecting the movement of the object until immediately before tracking is ended or interrupted, and the tracking control means is based on the motion detection result when the operation with high priority is completed. When the object is recognized by any of the recognition means, it can be controlled to start tracking again, and even if the object has moved from the time when tracking was interrupted or ended, the recognition level is different or The object can be recognized or the position of the object can be predicted by any of a plurality of recognition means having different types of recognition, and tracking can be resumed.
[0017]
  A behavior control method for a robot apparatus according to the present invention includes a plurality of recognition means for recognizing different types in an operation control method for a robot apparatus for tracking an object using at least a part of a body.OrSame kindRecognize thingsA recognition step of recognizing the object by a plurality of recognition means hierarchized into two or more according to the recognition level, a recognition integration step of integrating recognition results from the plurality of recognition means, and tracking the object And a tracking control step for controlling the tracking operation based on the recognition integration result. In the tracking control step, recognition of a predetermined recognition unit among the plurality of recognition units is performed. The tracking of the object is started based on the result, and when the recognition result of the predetermined recognition unit cannot be obtained, the tracking is continued based on the recognition result of another recognition unit different from the predetermined recognition unit. Controlled, and the tracking is interrupted by an operation having a higher priority than the tracking operation, and the tracking is terminated or completed. An object storage step for storing the direction of the object immediately before being performed, and in the tracking control step, when the operation with high priority is completed, any of the plurality of recognition means in the stored direction If the above object is recognized, the tracking is controlled to start again.The
[0018]
A program according to the present invention causes a computer to execute the behavior control of the robot apparatus described above.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a biped humanoid robot apparatus shown as one configuration example of the present invention will be described in detail with reference to the drawings. This humanoid robot device is a practical robot that supports human activities in various situations in daily life, such as the living environment, and behaves autonomously according to internal conditions (anger, sadness, joy, fun, etc.) It is an entertainment robot that can express basic actions that humans can perform.
[0020]
(A) Configuration of robot device
FIG. 1 is a perspective view showing an overview of the robot apparatus according to the present embodiment. As shown in FIG. 1, the robot apparatus 1 includes a head unit 3 connected to a predetermined position of the trunk unit 2, and two left and right arm units 4R / L and two right and left leg units 5R /. L is connected to each other (provided that R and L are suffixes indicating right and left, respectively, and the same applies hereinafter).
[0021]
The joint degree-of-freedom configuration of the robot apparatus 1 is schematically shown in FIG. The neck joint that supports the head unit 3 has three degrees of freedom: a neck joint yaw axis 101, a neck joint pitch axis 102, and a neck joint roll axis 103.
[0022]
Each arm unit 4R / L constituting the upper limb includes a shoulder joint pitch axis 107, a shoulder joint roll axis 108, an upper arm yaw axis 109, an elbow joint pitch axis 110, a forearm yaw axis 111, and a wrist. A joint pitch axis 112, a wrist joint roll wheel 113, and a hand portion 114 are included. The hand portion 114 is actually a multi-joint / multi-degree-of-freedom structure including a plurality of fingers. However, since the operation of the hand portion 114 has little contribution or influence on the posture control or walking control of the robot apparatus 1, it is assumed in this specification that the degree of freedom is zero. Therefore, it is assumed that each arm portion has seven degrees of freedom.
[0023]
The trunk unit 2 has three degrees of freedom: a trunk pitch axis 104, a trunk roll axis 105, and a trunk yaw axis 106.
[0024]
Each leg unit 5R / L constituting the lower limb includes a hip joint yaw axis 115, a hip joint pitch axis 116, a hip joint roll axis 117, a knee joint pitch axis 118, an ankle joint pitch axis 119, and an ankle joint. A roll shaft 120 and a foot 121 are included. In the present specification, the intersection of the hip joint pitch axis 116 and the hip joint roll axis 117 defines the hip joint position of the robot apparatus 1. The foot 121 of the human body is actually a structure including a multi-joint / multi-degree-of-freedom sole, but the foot of the robot apparatus 1 has zero degrees of freedom. Accordingly, each leg is configured with 6 degrees of freedom.
[0025]
In summary, the robot apparatus 1 as a whole has a total of 3 + 7 × 2 + 3 + 6 × 2 = 32 degrees of freedom. However, the robot device 1 for entertainment is not necessarily limited to 32 degrees of freedom. Needless to say, the degree of freedom, that is, the number of joints, can be increased or decreased as appropriate in accordance with design / production constraints or required specifications.
[0026]
Each degree of freedom of the robot apparatus 1 as described above is actually implemented using an actuator. It is preferable that the actuator be small and light in light of demands such as eliminating the extra bulge on the appearance and approximating the shape of a natural human body, and performing posture control on an unstable structure such as biped walking. .
[0027]
In the following description, for convenience of description, in the description of the foot 121, a surface configured to include a portion in contact with the road surface (floor surface) on the back surface of the foot 121 is referred to as an XY plane. In the following description, it is assumed that the front-rear direction of the robot apparatus is the X-axis, the left-right direction of the robot apparatus is the Y-axis, and the direction orthogonal to these is the Z-axis.
[0028]
Such a robot apparatus includes a control system for controlling the operation of the entire robot apparatus, for example, in the trunk unit 2. FIG. 3 is a schematic diagram showing a control system configuration of the robot apparatus 1. As shown in FIG. 3, the control system is a motion that controls the whole body cooperative motion of the robot device 1 such as driving of an actuator 350 and a thought control module 200 that dynamically controls emotion judgment and emotional expression in response to a user input or the like. And a control module 300.
[0029]
The thought control module 200 includes a CPU (Central Processing Unit) 211, a RAM (Random Access Memory) 212, a ROM (Read Only Memory) 213, and an external storage device (hard disk / disk This is an independent drive type information processing apparatus that is configured with 214 and the like and can perform self-contained processing within the module.
[0030]
This thought control module 200 determines the current emotion and intention of the robot apparatus 1 according to stimuli from the outside such as image data input from the image input apparatus 251 and audio data input from the audio input apparatus 252. Here, the image input device 251 includes a plurality of CCD (Charge Coupled Device) cameras, for example, and the sound input device 252 includes a plurality of microphones, for example.
[0031]
In addition, the thought control module 200 issues a command to the motion control module 300 to execute an action or action sequence based on decision making, that is, exercise of the limbs.
[0032]
One motion control module 300 includes a CPU 311 for controlling the whole body cooperative motion of the robot apparatus 1, a RAM 312, a ROM 313, an external storage device (hard disk drive, etc.) 314, etc., and performs self-contained processing within the module. It is an independent drive type information processing apparatus that can be performed. Also, the external storage device 314 can store, for example, walking patterns calculated offline, target ZMP trajectories, and other action plans.
[0033]
The motion control module 300 includes an actuator 350 that realizes degrees of freedom of joints distributed throughout the body of the robot apparatus 1 shown in FIG. 2, a posture sensor 351 that measures the posture and inclination of the trunk unit 2, and left and right feet. Various types of sensors such as ground

contact confirmation sensors

352 and 353 for detecting floor getting off or landing, a load sensor in this embodiment described later provided on the sole 121 of the foot 121, and a power supply control device 354 for managing the power source of the battery and the like Are connected via a bus interface (I / F) 301. Here, the posture sensor 351 is configured by, for example, a combination of an acceleration sensor and a gyro sensor, and the

grounding confirmation sensors

352 and 353 are configured by proximity sensors, micro switches, or the like.
[0034]
The thought control module 200 and the motion control module 300 are constructed on a common platform, and are interconnected via bus interfaces 201 and 301.
[0035]
The motion control module 300 controls the whole body cooperative motion by each actuator 350 in order to embody the action instructed from the thought control module 200. That is, the CPU 311 extracts an operation pattern corresponding to the action instructed from the thought control module 200 from the external storage device 314 or generates an operation pattern internally. Then, the CPU 311 sets a foot movement, a ZMP trajectory, a trunk movement, an upper limb movement, a horizontal waist position, a height, and the like in accordance with the designated movement pattern, and commands for instructing the movement according to these setting contents. The value is transferred to each actuator 350.
[0036]
Further, the CPU 311 detects the posture and inclination of the trunk unit 2 of the robot apparatus 1 from the output signal of the posture sensor 351, and each leg unit 5R / L is caused to move freely by the output signals of the

grounding confirmation sensors

352 and 353. Alternatively, the whole body cooperative movement of the robot apparatus 1 can be adaptively controlled by detecting whether the robot is standing or standing. Further, the CPU 311 controls the posture and operation of the robot apparatus 1 so that the ZMP position always moves toward the center of the ZMP stable region.
[0037]
In addition, the motion control module 300 returns to the thought control module 200 the degree to which the intended behavior determined by the thought control module 200 has been expressed, that is, the processing status. In this way, the robot apparatus 1 can determine its own and surrounding conditions based on the control program and act autonomously.
[0038]
(B) Robot system control system
Next, a control system for such a robot apparatus will be described. This robot apparatus has a plurality of recognizers, and by using these, it is possible to perform tracking more robust than conventional ones. Here, the tracking system and the tracking method will be described first, A control system for the entire robot apparatus will be described later.
[0039]
(1) Tracking system
FIG. 4 is a block diagram illustrating a main part of the tracking system in the robot apparatus control system according to the present embodiment. The tracking system in the present embodiment has a plurality of recognizers that recognize different types such as image recognition and voice recognition, for example, and image recognition and voice recognition perform recognition of different recognition levels. In this way, by combining the recognition results of multiple recognizers with different types or recognition levels as needed, the robotic device can perform an interrupt operation different from tracking during tracking. This tracking system can perform robust tracking even when the tracking operation is interrupted (temporarily stopped) or when recognition by the recognizer fails due to environmental changes.
[0040]
As shown in FIG. 4, the tracking system 10 includes an image input device 251 that captures an image, a voice input device 252 that includes a microphone and the like, and an image from the image input device 251 is supplied to the skin color recognition, face recognition, and personal An image recognizing unit 15 comprising a skin color recognizing unit 11, a face recognizing unit 12 and a personal recognizing unit 13, a speech direction recognizing unit 14 for recognizing the direction of the audio supplied from the audio input device 252, and these recognizers. The recognition integration unit 21 that integrates the recognition results 14 and 15, the prediction unit 31 that predicts the position of the object based on the integration result of the recognition integration unit 21, and the integration result of the recognition integration unit 21, or the prediction unit 31 Based on the prediction, in order to track the object, for example, a neck control command generation unit 22 that controls the rotation angle of the neck of the robot apparatus and the neck control command generation unit And an output unit 23 for outputting control information from the 2.
[0041]
Color recognition, face recognition, and personal recognition exist as recognition devices with different recognition levels for recognizing human faces. In the example shown in FIG. 4, the person recognition unit 13 that identifies who is the target person who is the most difficult to recognize is the highest. Then, the face recognition unit 12 recognizes whether the face is a human face or the like, and the skin color recognition unit 11 recognizes the skin color region that is most easily recognized at the bottom. The tracking target includes, for example, a person and an object other than a person such as a ball. Here, an example in which a person is recognized and tracked will be described.
[0042]
For example, the skin color recognizing unit 11 detects a skin color region and sends a recognition result to the recognition integrating unit 21. Next, the face recognizing unit 12 recognizes whether or not the object is a human face, and if the object is recognized as a face, the face recognizing unit 12 sends an image of the face area to the personal recognizing unit 13 and recognizes it. The recognition result is sent to the integration unit 21. The personal recognition unit 13 specifies who the input face image is, that is, an individual, and sends the result to the recognition integration unit 21.
[0043]
As the image recognition when the object is a human face, a cascade method from the skin color recognition unit 11 to the personal recognition unit 13 can be considered. As an individual recognition method, for example, a registered face database in which individual faces are registered is used, and the registered face is compared with the input face image to recognize who the input face is. Can do.
[0044]
When the target is an object such as a ball, a shape recognition unit that recognizes the shape of the object and a pattern recognition unit that recognizes the shape of the object whose shape has been recognized are arranged above the color recognition unit that recognizes the color of the object. Needless to say, the above can be applied even when the object is not a human face, for example.
[0045]
Further, the voice direction recognition unit 14 recognizes from which direction the voice is heard with respect to itself, and supplies the recognition direction to the recognition integration unit 21, which is a specific example of the recognition method of the voice direction recognition unit 14. An example will be described later. Here, the voice direction recognition means for recognizing the direction of the human voice as the object is used. However, as will be described later, a plurality of recognizers having different recognition levels for the input voice from the voice input device 252 are provided. Alternatively, when the object is not a person regardless of a person, the direction of the sound emitted from the object may be recognized.
[0046]
The recognition integration unit 21 always receives the recognition results from the individual recognition unit 13, the face recognition unit 12, the skin color recognition unit 11, and the voice direction recognition unit 14, and integrates the recognition results. Integration in this case means information integration in which the face and skin color are recognized for the same area on the image, although it is not known well. That is, the

recognition units

14 and 15 send information indicating whether or not each recognition is successful, and when the recognition is successful, the recognition information is sent as a recognition result, and when the recognition is successful and the recognition information is sent. Estimates the direction of the object from a predetermined recognition result or one or more recognition results after the recognition result.
[0047]
In this robot apparatus, the recognition integration unit 21 belongs to the middleware layer 50 shown in FIG. 23, which will be described later. Actually, for example, the recognition by the personal recognition unit 13 succeeded in the upper application layer 51. The robot apparatus has a tracking control unit (not shown) that controls to start tracking upon receiving a tracking operation start command from the application layer 51 and receiving the recognition result from each recognition unit. That is, a control signal for starting tracking is input to a tracking control unit provided inside or outside the recognition integration unit 21, and a tracking operation generation unit such as a neck control command generation unit described later is controlled to start tracking.
[0048]
Upon receiving the tracking start command, the tracking control means selects one or more predetermined recognition units from among the recognition results of the plurality of recognition units in order to recognize the tracking target, and performs tracking based on the recognition results. Can be controlled. For example, as the predetermined recognition unit, for example, when the highest-level personal recognition unit 13 in image recognition for recognizing a person's face is successful when a tracking start command is input, this is used. Tracking can be started, and the neck control command generator 22 is controlled so that, for example, the center of gravity or the like of the individual face that is the target is at the center of the input image, for example, based on the result of personal recognition.
[0049]
If the personal recognition unit 13 fails in personal recognition, the tracking is continued using the recognition results of the face recognition unit 12, the skin color recognition unit 11, and the voice direction recognition unit 14, which are other recognition means. Control. For example, the direction (position) of the face of a person as an object is predicted using the recognition result of the face recognition unit 12 subordinate to the personal recognition unit 13. That is, although recognition as an individual is not possible, if the recognition by the face recognition unit 12 is successful and the face is recognized, it is assumed that the individual can still be tracked with the face as the same individual, The neck control command generator 22 is controlled so that the face area is at the center of the input image. Further, when the face recognition unit 12 also fails to recognize, for example, the result of the skin color recognition unit 11 is used, and when the result of the skin color recognition unit 11 also fails, the recognition result of the voice direction recognition unit 14 is used. The neck control command 22 is controlled so that the front of the robot apparatus faces in the voice direction. Note that the recognition integration unit 21 may set in advance which of the recognition results is used preferentially, or the robot apparatus may select as appropriate. For example, you may make it use the recognition result of the recognition part nearest to the position (direction) of the target object just before the recognition by the personal recognition part 13 fails.
[0050]
The prediction unit 31 predicts the position of the object when the recognition integration result of the recognition integration unit 21 is supplied and the recognition target cannot be recognized temporarily due to the instability of each recognition unit (when recognition fails). For example, when the recognition result from any of the recognition units fails, the current position (direction) of the target object is predicted based on the recognition result immediately before the failure.
[0051]
The prediction unit 31 is instructed to start predicting the position of the object when the recognition integration result is always supplied from the recognition integration unit 21, for example, and the tracking control unit or the like cannot recognize the object. For example, control such as waiting for a certain period of time for recovery of recognition by the

recognition units

14 and 15 is performed. Alternatively, when the object cannot be recognized, the recognition result from the recognition integration unit 21 up to immediately before it may be supplied to instruct to predict the position of the object.
[0052]
The prediction unit 31 predicts the direction of the object from the recognition result immediately before the object is no longer recognized, and sends the prediction direction to the neck control command generation unit 22. In other words, as described above, the recognition of the outside world by an image necessary for tracking by the robot apparatus is often unstable, and is sensitive to lighting (lighting conditions) and the angle of a person's face. The recognition unit 15 may fail in various recognitions. Also, if the object such as a ball moves greatly, the ball will move under non-uniform illumination conditions, making recognition difficult. Furthermore, the robot device capable of autonomous operation is selected based on an internal state and an external stimulus, and when the other operation having higher priority than the tracking operation occurs, for example, the tracking operation is interrupted. , May allow other behavior to manifest. For example, when a person is called by another person B during a conversation with a certain person A and the robot device turns around and talks with the person B for a short time, the tracking with the original person A is continued. After stopping, there is a case where it is desired to start tracking again. In such a case, the original position of the person A can be stored in principle, but if the person A moves even a little, tracking may not be resumed due to instability of recognition.
[0053]
Even in such a case, for example, when the target object is a moving object, the predicted position is obtained by predicting the current position (direction) from the immediately preceding motion amount. If it can be determined that the object is stationary for a predetermined period immediately before the recognition fails, the direction of the immediately preceding object is set as the predicted position.
[0054]
Then, the neck control command generation unit 22 generates a neck control command based on the control information from the recognition integration unit 21 or the prediction unit 31, and outputs this through the output unit 23. Specifically, a rotation angle for rotating the neck joint including the neck joint yaw axis 101, the neck joint pitch axis 102, and the neck joint roll axis 103 that supports the head unit 3 shown in FIG. By outputting the signal through the control unit 23, the corresponding motor or the like is controlled, and the head unit of the robot apparatus is rotated in accordance with the movement of the object, thereby causing the robot apparatus to perform tracking.
[0055]
Note that in the robot apparatus 1, an application layer 51 shown in FIG. 23 to be described later actually selects an action description unit (not shown) including a plurality of action description modules in which actions are described, and these actions. A behavior selection unit (not shown), and the behavior selection unit recognizes the recognition integration result as external information as one of the behavior selection criteria, and selects a behavior to be tracked from a plurality of behavior description modules If the behavior output from the behavior description part is accompanied by object tracking, specify the target and use at least a part of the robot device such as the neck, eyes, and feet of the robot device. A command is output to the recognition integration unit 21 so that tracking is performed, and the recognition integration unit 21 that receives the command generates a predetermined neck control command to the neck control command generation unit 22, for example. To output.
[0056]
In addition, here, the recognition integration unit 21 will be described by taking an example in which the neck control command generation unit 22 rotates the neck joint of the robot apparatus to perform tracking. For example, the object has a rotation angle of the neck joint of the robot apparatus. When moving more than the limit, not only the neck joint but also the trunk pitch axis 104, trunk roll axis 105 or trunk yaw axis 106 of the trunk unit 2 shown in FIG. Tracking may be performed while the robot is moving by controlling the unit.
[0057]
The prediction unit 31 predicts the direction of the object when all the recognition units fail, but the prediction unit 31 performs a part of the processing in the recognition integration unit 21 described above. May be. That is, even when the higher-level personal recognition 13 fails, the prediction unit 31 performs processing when tracking is continued using the recognition result of the lower-level face recognition unit 12 or the recognition result of the voice direction recognition unit 14. Good.
[0058]
Hereinafter, this embodiment will be described in more detail. This tracking system is a prediction function that predicts the position (direction) of an object when recognition of the object fails, and a function that continues tracking by switching the recognition level or the type of the recognizer when recognition of the object fails. And a function of interrupting tracking when the period (time) in which the recognition of the object has failed exceeds a predetermined threshold. Hereinafter, these functions will be described in detail with reference to FIG. 5 to FIG. 9, but in order to facilitate understanding, in any case, the personal recognition result of the personal recognition unit 13 is used at the start of tracking. The same components are denoted by the same reference numerals, and detailed description thereof is omitted.
[0059]
As described above, the present embodiment has a plurality of recognition units, and when the tracking is started using the recognition result of one of the predetermined recognition units, the predetermined recognition unit is recognized. Even if it fails, it is possible to perform robust tracking by using the recognition result of another recognition unit or predicting the current position (direction) of the target object from the recognition result immediately before the failure. Therefore, the recognition result of a recognition unit other than the individual recognition unit 13 may be used when starting tracking.
[0060]
(1-1) Prediction of tracking object
First, a description will be given of the function of continuing tracking even when any of the recognition units hierarchized into a plurality of types and / or a plurality of recognition units fails. FIG. 5 is a block diagram showing a tracking system. Here, it is assumed that the recognizer has a recognizer 15 hierarchized at different recognition levels, for example. First, it is assumed that since a person appears in front of the eyes, the action selection unit selects a dialog action and starts a dialog with a person. At that time, for example, the person (object) is successfully recognized, and the result of the personal recognition is sent to the action selection unit and recognition integration unit 21 described above, and tracking for the individual is started. When tracking starts, a recognition result is sent to the neck control command generation unit 22, and based on this recognition result, the neck control command generation unit 22 controls the neck so that the face of the person is at the center of the field of view of the robot device, for example. To implement tracking.
[0061]
Also, as described above, not only tracking by controlling the rotation angle of the neck, but when a person moves greatly, the body is rotated using the foot to track the person, Of course, only the upper body may be rotated.
[0062]
Then, for example, it is assumed that personal recognition cannot be performed due to a change in the angle of the face or a change in lighting conditions during tracking, or recognition as another individual is performed. That is, if the same person fails to be recognized, information notifying that recognition is impossible is supplied to the prediction unit 31. The prediction unit 31 has information on the position (direction) of the person immediately before the timing when the recognition becomes impossible, and when the person at the previous timing is stationary, the position of the previous person is set as the prediction direction.
[0063]
Alternatively, if the person to be tracked is moving, the amount of movement is detected from the information immediately before and the current position is predicted. In this case, in the image recognition, a motion vector until the recognition becomes impossible is obtained, and based on this, the position moved from the time when the recognition was possible to the current time may be predicted. In the voice recognition described later, if the voice recognition direction changes with time, the current direction may be predicted from the amount of change.
[0064]
However, the prediction unit 31 uses the recognition results of other recognition units other than the personal recognition unit 13 to continue tracking temporarily. If the personal recognition unit 13 succeeds again, the personal recognition unit 13 The recognition result is used. Further, even if the personal recognition unit 13 fails and the recognition result of another recognition unit is substituted and the personal recognition 13 does not succeed even if a predetermined time or more elapses, or another individual has passed after the predetermined time or more. If it is recognized, tracking is terminated or interrupted.
[0065]
(1-2) When having recognizers with different recognition levels
Next, a method of using a recognizer having image recognizers having different recognition levels, for example, a recognizer including a skin color recognizer 1, a face recognizer 12, and a personal recognizer 13 will be described. As described above, the recognizer 15 is hierarchized according to the recognition level. That is, as shown in FIG. 6, a plurality of hierarchies are formed according to the recognition level from the lowest level such as the skin color recognition unit 11 to the highest level such as the face recognition unit 12 and the personal recognition unit 13. When the recognition of the part fails, the recognition integration part 21 controls to use the lower recognition result.
[0066]
When the tracking is started after the personal face is recognized, not only the recognition result from the personal recognition unit 13 but also the recognition results of the skin color recognition unit 11 and the face recognition unit 12 are always supplied to the recognition integration unit 21. Note that the skin color recognition unit 11 and the face recognition unit 12 do not send their own results to the higher-level recognizers. For example, the individual recognition unit 13 uses the face recognition result of the face recognition unit 12 to For example, the upper classifier may use the result of the lower classifier.
[0067]
Here, it is assumed that personal recognition has failed at a certain time T1. The recognition integration unit 21 is notified of the success or failure of recognition from each recognizer, and when the recognition is successful, the recognition information is sent as a recognition result, and the recognition integration unit 21 obtains information of personal recognition failure at time T1.
[0068]
At that time, the prediction unit 31 indicates that the face recognition unit 12 below the personal recognition unit 13 has succeeded in recognition, and if the face is recognized at the same position, the person is still there. Predict and track face (no matter who you are) region. If the face recognition portion 12 has also failed to be recognized and a skin color region is recognized at the same position, the region is tracked. As described above, the continuation of tracking using the recognition result by such other recognizers may be performed by the recognition integration unit 21 or may be performed by the prediction unit 31 as shown in FIG.
[0069]
However, when the personal recognition unit 13 fails and uses the recognition result of the lower recognition unit (other recognition unit), the position (direction) of the object recognized by the personal recognition unit 13 and the lower recognition unit If the method (position) of the object recognized by the above is more than a predetermined size, tracking is terminated or interrupted because there is a high possibility that the wrong object is recognized.
[0070]
In the present embodiment, the tracking is started using the recognition result of the top personal recognition unit 13 of the recognizer that recognizes a person's face, but a tracking start command is issued. In this case, for example, when only the skin color recognition unit 11 has succeeded in the recognition integration unit 21, tracking is performed based on the recognition result of the skin color recognition unit 11. If the recognition by the face recognition unit 12 and the personal recognition unit 13 is successful during tracking, the recognition level may be increased contrary to the above.
[0071]
(1-3) When having recognizers of different recognition types
Next, an example of a method of using the recognizer in the tracking system including the voice recognition unit 14 of a type different from the hierarchical image recognition unit as illustrated in FIG. 6 will be described. For example, there may be a case where recognition of a plurality of hierarchized image recognition units as shown in FIG. 6 fails, that is, an input image cannot be obtained due to, for example, a change in illumination conditions. In such a case, tracking can be continued by providing different types of recognizers.
[0072]
As shown in FIG. 7, at a certain time T3, when any of personal recognition, face recognition, and skin color recognition of the target person that has been tracked becomes unrecognizable, voice is input from the voice input device 252. Suppose that In this case, even if all image recognition is impossible, if the recognition result of the voice direction recognition unit 14 is obtained, it may be interpreted (predicted) that the person still exists in the voice direction. Is possible. As described above, the prediction unit 31 uses the knowledge result of the voice direction recognition unit 14 even when the image recognition fails, and sets the direction as the prediction direction of the target object, and continues tracking for a certain period of time. The robustness of tracking can be improved. Of course, if the person who was the tracking target cannot be recognized by the personal recognition unit 13 even if the tracking in the prediction unit 31 is continued for a certain time or longer, the tracking is interrupted or terminated. As described above, this is for preventing the person from actually disappearing or keeping track of the wrong person. When the tracking start command is supplied, for example, when the voice direction recognition unit succeeds in recognition, the recognition result of the voice direction recognition unit 14 may be used as the tracking start.
[0073]
Although only the voice direction recognition unit 14 for recognizing the voice direction is described here, for example, the voice recognition unit may have a layered structure like an image recognition unit. In other words, a speaker recognition unit is provided in the upper layer of the voice direction recognition unit 14 to specify who is speaking, and what the person speaks in the upper layer of the speaker recognition unit. A voice content recognition unit or the like that recognizes whether or not For speaker recognition, for example, a registered speaker database is prepared, and the speaker can be specified by comparing the input speech with the data of the registered speech database. In the speech content recognition unit, for example, a database in which a plurality of words or short sentences are registered is prepared, and whether or not there is a predetermined pattern response to the robot device call is compared with the data in the database. There is a method of determining whether or not a dialogue with the user is possible.
[0074]
(1-4) Tracking after interrupt operation
Next, a case will be described in which the action selection execution unit allows another action to interrupt during execution of a certain action. As will be described later, this robotic device can autonomously select and express an action according to the internal state and external stimulus, and therefore there is an action with higher execution priority than the tracking action. Can interrupt the tracking operation and perform another interrupt operation.
[0075]
For example, for example, while the robot apparatus is talking with the person A, that is, while tracking the person A, for example, it is called from another person B, for example, from behind, so after having a conversation with the person B, the original conversation partner A May return to In such a case, as shown in FIG. 8, tracking of the person A is interrupted by an interruption operation (another action) 32 that interacts with the person B. At that time, the recognition integration unit 21 or the prediction unit 31 stores the position (direction) of the person A at the time of interruption in order to resume the conversation with the original conversation partner A, that is, to track the person A again. Storage means (not shown). When the tracking is resumed, the recognition integration unit 21 and / or the prediction unit 31 is notified that the interrupt operation 32 has ended, and instructs to resume the tracking. At that time, it returns to tracking of a person by the same method as described above.
[0076]
That is, if the person A is still before the tracking is interrupted, it is predicted that the person A is still present, and if the movement of the person A is detected, it is detected as the interrupted time. The current position is predicted from the motion vector. Tracking can be resumed if an object can be recognized by any recognition unit in the predicted direction. At this time, as described above, when the position (direction) of the recognized object is greatly different from the predicted direction, the tracking can be prevented from being resumed. In addition, after the tracking is resumed, if the individual cannot be recognized even after a predetermined time has elapsed, the tracking is terminated.
[0077]
(1-5) Tracking termination conditions
This tracking system performs robust tracking using the recognition results of multiple recognizers, but continues to use the recognition results of recognizers with low recognition levels, or the recognition results are based on the recognition results. In order not to track the wrong object by keeping predicting the object direction even if it is not obtained, it has a function of terminating or interrupting tracking under a certain condition.
[0078]
For example, as shown in FIG. 9, when tracking is started by recognition of the personal recognition unit 13, if the personal recognition unit 13 fails, for example, the face recognition unit 12, the skin color recognition unit 11, the voice recognition unit 14, etc. The recognition results of other recognition units different from the above are used, but when all the recognition results cannot be obtained, as described above, the prediction unit 31 immediately before the recognition unit that has succeeded until the end fails. Tracking is performed by predicting the position of the person from the movement of the person and moving the neck in the direction of the prediction. During tracking, if any of the person, face, or skin color reappears near the predicted position (direction), or if the voice direction is recognized near the predicted direction, tracking is resumed as the tracking target. However, if a state in which it is impossible to recognize that the face is the personal face when tracking is first started continues, tracking is stopped even if the face and skin color are visible. That is, the tracking is performed only by the recognition result of the lower class recognizer from the start of tracking, or the target is tracked by the prediction based on the recognition result (pseudo tracking) does not continue for a predetermined time or longer. Shall.
[0079]
For this reason, for example, a timer 33 is provided between the recognition integration unit 21 and the prediction unit 31. When a predetermined time elapses, tracking is terminated. As described above, this is to prevent the case where the person is not recognized for a long time, misinterprets a completely different person and continues the dialogue action, or keeps tracking as a completely different face. is there.
[0080]
In the tracking system of the present embodiment configured as described above, the recognition unit has a hierarchical structure with different recognition levels, and even if the upper recognition system fails to recognize using this, the lower recognition result is used. Even if tracking is interrupted by an interrupt operation by using the recognition results of different recognizers such as image recognition and voice recognition, etc., and using the recognition results of various recognizers in combination, Robust tracking can be performed against the instability of the recognition system due to changes in illumination or the like.
[0081]
Next, a specific example of a sound source direction recognition method and a moving object detection method suitable for the above-described tracking system will be described.
[0082]
(2-1) Specific example: Sound source direction recognition method
Hereinafter, a specific example of the speech direction recognition method (estimation method) of the speech direction recognition unit 14 will be described in detail. Here, it is assumed that the movable range of the joint degrees of freedom shown in FIG. 2 described above is limited in order to further increase the biological feeling. For this reason, when sound is input from outside the movable range of the neck joint yaw axis 101 in FIG. 2, it is necessary to rotate the neck and trunk in cooperation with each other and turn around in the direction of the sound source.
[0083]
Therefore, the robot apparatus 1 in this specific example turns around the sound source direction as shown in FIGS. 10 (a) to 10 (f). That is, when a voice is input from behind when the robot apparatus 1 is facing the right side in the figure as shown in FIG. 10 (a), the neck is rotated as shown in FIGS. 4 (b) to (f). Rotate the trunk using the legs and turn around the sound source.
[0084]
An example of such a turning operation in the sound source direction will be described with reference to the flowchart of FIG. First, in step S1, it is detected that a sound event has occurred when a sound having an amplitude equal to or greater than a predetermined threshold is input to a microphone included in the sound input device 252 shown in FIG.
[0085]
Next, in step S2, the sound source direction of the input sound event is estimated. Here, as described above, the voice input device 252 includes a plurality of microphones, and the robot device 1 can estimate the sound source direction using the plurality of microphones. Specifically, for example, as described in “Oga, Yamazaki, Kanada“ Acoustic System and Digital Processing ”(Electronic Information and Communication Society) p197”, the sound source direction and the time difference between signals received by a plurality of microphones are used. The sound source direction can be estimated using the one-to-one relationship.
[0086]
That is, as shown in FIG. 12, when a plane wave arriving from the θS direction is received by two microphones M1 and M2 that are set apart by a distance d, the received signal x1 (t) of each microphone M1 and M2 is received. And x2 (t), a relationship as shown in the following equations (1) and (2) is established. Here, in equations (1) and (2), c is the speed of sound, and τS is the time difference between the signals received by the two microphones M1 and M2.
[0087]
[Expression 1]

[0088]
Therefore, if the time difference τS between the received sound signals x1 (t) and x2 (t) is known, the arrival direction of the sound wave, that is, the sound source direction can be obtained by the following equation (3).
[0089]
[Expression 2]

[0090]
Here, the time difference τS can be obtained from a cross-correlation function φ12 (τ) between the received sound signals x1 (t) and x2 (t) as shown in the following equation (4). Here, in the equation (4), E [•] is an expected value.
[0091]
[Equation 3]

[0092]
From the above equations (1) and (4), the cross-correlation function φ12 (τ) is expressed as the following equation (5). Here, in Expression (5), φ11 (τ) is an autocorrelation function of the received sound signal x1 (t).
[0093]
[Expression 4]

[0094]
Since the autocorrelation function φ11 (τ) is known to have a maximum value at τ = 0, the cross-correlation function φ12 (τ) has a maximum value at τ = τS from the equation (5). Therefore, by calculating the cross-correlation function φ12 (τ) and obtaining τ giving the maximum value, τS can be obtained, and by substituting it into the above equation (3), the arrival direction of the sound wave, that is, the sound source direction, can be obtained. Can be sought.
[0095]
Note that the above-described method for estimating the sound source direction is an example, and the present invention is not limited to this example.
[0096]
Returning to FIG. 5, in step S3, the difference between the direction in which the robot apparatus 1 is currently facing and the direction of the sound source is calculated, and the relative angle of the sound source direction with respect to the direction of the trunk is obtained.
[0097]
Subsequently, in step S4, based on the movable range of the neck joint yaw axis 101 shown in FIG. 2 and the restrictions such as the maximum angle that can be rotated by one rotation when the trunk is rotated using the legs, the step is performed. The rotation angles of the neck joint and the trunk necessary for rotating the head by the relative angle calculated in S3 are determined. Here, depending on the sound source direction, the rotation angle of only the neck joint is determined. The robot apparatus 1 has the trunk yaw axis 106 as shown in FIG. 2, but for the sake of simplicity, the robot apparatus 1 will be described as not using the trunk yaw axis 106 in this specific example. However, it is a matter of course that the direction of the sound source can be turned around by coordinating the whole body using the ground contact direction of the neck, waist and legs.
[0098]
This will be specifically described with reference to FIG. FIG. 13A shows an example in which the movable range of the neck of the robot apparatus 1 is ± Y degrees, and the relative angle of the direction of the sound source S is the X-degree direction with respect to the front direction of the robot apparatus 1. In this case, in order for the robot apparatus 1 to turn around in the direction of the sound source S, as shown in FIG. 13 (b), the entire trunk is rotated by using the legs at least XY degrees, and the neck joint yaw It is necessary to rotate the shaft 101 in the direction of the sound source S by Y degrees.
[0099]
Next, in step S5, control information for each joint necessary for rotating the angle obtained in step S4 is designed and executed, and turned around in the sound source direction.
[0100]
Subsequently, in step S6, it is determined whether or not it is necessary to face the sound source direction. In step S6, for example, when the sound event is a noise such as a mere object sound, the process proceeds to step S7 on the assumption that it is not necessary to face up, and the direction of the trunk and neck is returned to the direction originally directed, and a series of operations are performed. finish. On the other hand, in step S6, for example, when it is determined that the robot apparatus 1 has discovered a face of a known person from information in the image input apparatus 251 (FIG. 3) and called the person, The process proceeds to step S8 in order to perform control that directly faces the direction.
[0101]
Here, as means for detecting a human face, for example, it is described in “E.Osuna, R.Freund and F.Girosi:“ Training support vector machines: an application to face detection ”, CVPR '97, 1997”. It is possible to realize by such a method. As a means for recognizing a specific person's face, for example, “B. Moghaddam and A. Pentland:“ Probabilistic Visual Learning for Object Representation ”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, It can be realized by the method described in “July 1997”.
[0102]
In step S8, the rotation angles of the trunk and neck necessary to face each other are calculated. For example, as shown in FIG. 13B described above, when the neck joint yaw axis 101 rotates Y degrees in the current posture of the robot apparatus 1, that is, the head rotates Y degrees with respect to the trunk. In this case, as shown in FIG. 13C, by rotating the trunk Y degrees and simultaneously rotating the neck joint yaw axis 101 by -Y degrees, the twist of the neck is eliminated while keeping the target object in sight, It becomes possible to face the direction of the sound source S by natural operation.
[0103]
Finally, in step S9, the operation calculated in step S8 is executed to face the sound source direction.
[0104]
The robot apparatus 1 can estimate the direction of the sound source as described above, and can turn the direction of the sound source by a natural operation by coordinating the whole body.
[0105]
Further, depending on the content of the sound event, the robot apparatus 1 resolves the twist of the neck while keeping a close eye on the target object, and faces the sound source direction with a natural motion. In particular, in the case of a call from a human, the closeness with the human can be enhanced by facing the person facing the call and facing directly.
[0106]
Note that the above operation is realized by instructing from the thought control module 200 shown in FIG. 3 described above and controlling each actuator 350 by the motion control module 300.
[0107]
Here, when the relative angle of the sound source direction with respect to the direction of the trunk is obtained and the target object is actually turned in that direction, a situation in which the target object cannot be recognized is conceivable. That is, for example, if there is no target object within the viewing angle in the direction of the turn due to an estimation error of the sound source direction, or the distance to the target object is long even if the sound source direction is correct, the target object cannot be recognized. .
[0108]
Therefore, the robot apparatus 1 in this specific example can solve such a problem as follows. An example of this turning operation will be described with reference to the flowchart of FIG.
[0109]
As shown in FIG. 14, first, in step S10, it is detected that a sound event has occurred when a sound having an amplitude equal to or greater than a predetermined threshold is input to the microphone of the voice input device 252 shown in FIG. .
[0110]
Next, in step S11, the sound source direction of the input sound event is estimated.
[0111]
Subsequently, in step S12, the difference between the direction in which the robot apparatus 1 is currently facing and the direction of the sound source is calculated, and the relative angle of the sound source direction with respect to the direction of the trunk is obtained.
[0112]
Subsequently, in step S13, based on the movable range of the neck joint yaw axis 101 shown in FIG. 2 and the constraints such as the maximum angle that can be rotated by one rotation when rotating the trunk using the legs, step S13 is performed. The rotation angles of the neck joint and the trunk necessary for rotating the head by the relative angle calculated in S12 are determined. At this time, the neck is not rotated to the limit of the movable range, but a slight margin is provided so that the head can be further swung left and right after turning.
[0113]
That is, as shown in FIG. 15A, when the movable range of the neck of the robot apparatus 1 is ± Y degrees and the relative angle of the direction of the sound source S is the X-degree direction with respect to the front direction of the robot apparatus 1, As shown in FIG. 15B, the entire trunk is rotated by X- (YZ) degrees by the legs as shown in FIG. 15B, and the neck joint yaw axis 101 is rotated by YZ degrees. As a result, the head can be further swung left and right after turning in the direction of the sound source S.
[0114]
Returning to FIG. 14, in step S14, control information for each joint necessary for rotating the angle obtained in step S13 is designed and executed, and turned to the sound source direction.
[0115]
In step S15, it is determined whether or not the target object has been recognized in the sound source direction. In step S15, for example, when a face of a person who has been learned in advance is found by face detection or face recognition as described above, the target object can be recognized in the direction of the sound source. Proceed to
[0116]
In step S16, the recognized target object is set as a tracking target, and the direction of the neck and trunk is changed as the object moves thereafter, so that the target object is tracked, and the series of operations ends.
[0117]
On the other hand, if the target object cannot be recognized in step S15, the process proceeds to step S17 to determine whether or not the sound event is a sound. Specifically, for example, by statistically modeling speech and non-speech by the HMM (Hidden Markov Model) method and comparing the likelihood. It can be determined whether or not the sound event is sound. If it is determined in step S17 that the sound is not sound, for example, it is determined that the sound event is derived from an event that does not need to be recognized, such as a door closing sound or a noise, and the series of operations is terminated. If it is determined in step S17 that the sound is voice, the process proceeds to step S18.
[0118]
In step S18, it is determined whether or not the sound source is close. Specifically, for example, the document “F. Asano, H. Asoh and T. Matsui,“ Sound Source Localization and Separation in Near Field ”, IEICE Trans. Fundamental, Vol. E83-A, No. 11, 2000” It is possible to roughly estimate by calculating the estimated distance to the sound source by a method as described above. In step S18, when the distance to the sound source is far enough to make it difficult to recognize the target object due to the performance of the image input device 251 and the target object recognition means, the robot apparatus reaches a distance that allows the target object to be recognized in the subsequent step S19. 1 itself is walked in the direction of the sound source to ensure recognition accuracy of the target object. In step S18, when the distance to the sound source is short, the process proceeds to step S21 without walking.
[0119]
In step S20, it is determined again whether or not the target object can be recognized. If the target object can be recognized in step S20, the process returns to step S16 to shift to the tracking process, and the series of operations ends. If the target object cannot be recognized in step S20, the process proceeds to step S21.
[0120]
In step S21, assuming that there is an error in the estimation of the sound source direction, the head is shaken up and down and left and right by rotating the neck joint pitch axis 102 and the neck joint yaw axis 101.
[0121]
Subsequently, in step S22, it is determined whether or not the target object has been recognized as a result of shaking the head in the vertical and horizontal directions in step S21. If the target object can be recognized in step S22, the process returns to step S16 to shift to the tracking process, and the series of operations ends. If the target object cannot be recognized in step S22, it is assumed that the sound source direction estimation is largely incorrect. Therefore, this is output in step S23, and the series of operations ends. Specifically, if the target object is a human being, by outputting a voice such as “I do not know where I am. Can you ask me again?” And requesting that the voice be input again The series of processes can be executed again.
[0122]
As described above, when the robot apparatus 1 approaches the sound source direction or shakes the face to the left or right, there is no target object within the viewing angle of the direction turned by the estimation error of the sound source direction, or the sound source direction is correct. Even when the distance to the target object is long, the target object can be recognized. In particular, even after turning in the estimated sound source direction, the rotation angle of the neck is set so that the head can be swung further to the left and right, so that the target object can be tracked by natural movement.
[0123]
In the above example, the sound source distance is estimated and the face is shaken after approaching the sound source direction. However, the present invention is not limited to this. For example, the estimation accuracy of the distance to the sound source is If the direction estimation accuracy is significantly lower, the face may be shaken before approaching the sound source direction.
[0124]
In the above example, the robot apparatus 1 itself is walked in the sound source direction to a distance where the target object can be recognized, and it is confirmed again whether the target object has been recognized. However, the present invention is limited to this. For example, after approaching the sound source direction by a predetermined distance such as 50 cm, it may be confirmed again whether or not the target object has been recognized.
[0125]
Furthermore, in the above-described example, the face detection and the face recognition are used as means for recognizing the target object. However, the present invention is not limited to this, and recognizes a specific color or shape. It doesn't matter.
[0126]
(2-2) Specific example: Moving object detection method
Next, the recognition integration unit 21 is a method for detecting a moving object from an image recognition result, and in particular, a specific example of a moving object detection method that can detect and track a moving object in a captured image by a simple method. Will be described. As described above, the robot apparatus 1 in this specific example recognizes the skin color area and face (including personal face) areas of a person's face in image data captured by the image input device 251 such as a CCD camera shown in FIG. If the recognized skin color or personal face area is a moving object, the moving object is detected by a moving object detection unit (not shown) provided in the recognition integration unit 21 shown in FIG. It is possible to perform actions such as detecting and controlling the neck control command generation unit 22 and the leg unit based on the detection result so as to face the direction of the detected moving object or to track.
[0127]
Here, when the moving object detection unit of the recognition integration unit 21 generates a difference image between frames, the difference value becomes 0 when the movement of the moving object stops. For example, as shown in FIG. 16, when the difference image data D1 to D3 are generated for the image data P1 to P4 obtained by imaging a human at times t1 to t4, respectively, if the face is stationary between the times t3 and t4, The difference data of the face disappears from the difference image data D3. That is, the disappearance of the moving object from the difference image data does not mean that the moving object has disappeared from the spot, but the moving object exists in the disappeared place.
[0128]
Therefore, the robot device 1 in this specific example detects the time point when the difference becomes 0, and directs the image input device 251 such as a CCD camera in the direction of the centroid position in the immediately preceding difference image to change the direction of the centroid position. The direction or the leg unit is controlled to approach the direction of the center of gravity. That is, as shown in the flowchart of FIG. 17, first, in step S21, the moving object is detected by calculating the center of gravity position of the difference image data. In step S22, it is determined whether or not the detected moving object has disappeared from the difference image data. Determined. If the moving object has not disappeared in step S22 (No), the process returns to step S21. On the other hand, when the moving object disappears in step S22 (Yes), the process proceeds to step S23, and the direction of disappearance, that is, the direction of the center of gravity position in the immediately preceding difference image is turned to or approaches the direction of the center of gravity position.
[0129]
Note that even when the detected moving object is out of the visual range of the robot apparatus 1, the moving object disappears from the difference image, but in this case as well, by facing the direction of the center of gravity detected last in step S23 described above, It can face the direction of the moving body.
[0130]
As described above, the robot device 1 in this specific example detects the timing at which the moving object disappears from the difference image data when the moving object is stationary within the visual range, and is directed to the direction of the center of gravity. It is possible to realize an autonomous interaction that feels the sign of the moving body and faces the direction. The predicting unit 31 detects that the moving object disappears from the difference image data due to being out of the visual range, and faces the direction of the center of gravity detected last, so that it almost faces the moving object. be able to.
[0131]
The robot apparatus 1 is directed not only when the moving object disappears from the difference image data but also at the predetermined time interval or every time the moving object's center of gravity position is likely to deviate from the visual range. You may make it track. That is, as shown in the flowchart of FIG. 18, first, in step S30, the moving object is detected by calculating the center of gravity of the difference image data, and in step S31, the moving object is likely to be out of the visual range. It turns to the direction of the center of gravity detected each time.
[0132]
Here, in addition to the case where the moving object disappears from the difference image data as described above, the robot apparatus 1 calculates its own movement and the movement of the moving object by motion compensation when the movement of the robot apparatus 1 is large in step S31. It becomes impossible to distinguish, and the moving object is lost. In step S32, it is determined whether or not the moving object has been lost. If the moving object is not lost in step S32 (No), the process returns to step S30. On the other hand, if the moving object is lost in step S32 (Yes), the process proceeds to step S33, and the direction of the center of gravity detected last is turned.
[0133]
As described above, the robot apparatus 1 in the present specific example is detected last when the moving body is lost in the direction of the center of gravity detected every predetermined time interval or every time the moving body is likely to be out of the visual range. By facing the direction of the center of gravity, it is possible to detect and track a moving object in an image captured by the image input device 251 provided in the head unit by a simple method.
[0134]
In the above processing, the recognition system 71 of the middleware layer 50 shown in FIG. 22 described later includes the plurality of recognition units 15 and the recognition integration unit 21 shown in FIG. A motion detection signal processing module 68. This motion detection signal processing module 68 is realized by the difference image generation module 110 and the centroid calculation module 111 shown in FIG.
[0135]
That is, as shown in FIG. 19, the virtual robot 43 of the robotic server object 42 reads out frame-unit image data captured by the CCD camera 22 from the DRAM 11, and reads this image data into the middleware layer 50. To the difference image generation module 110 of the recognition result integration unit 21 included in the recognition system 71.
[0136]
The difference image generation module 110 generates difference image data by taking the difference from the image data of the previous frame adjacent on the time axis every time image data is input, and gives this difference image data to the centroid calculation module 111. For example, when the difference image data D2 between the image data P2 and the image data P3 described above is generated, the luminance value D2 (i, j) of the difference image data D2 at the position (i, j) is the position (i, j). Is obtained by subtracting the luminance value P2 (i, j) of the image data P2 at the same position from the luminance value P3 (i, j) of the image data P3. The difference image generation module 110 generates the difference image data D2 by performing the same calculation for all the pixels, and gives the difference image data D2 to the centroid calculation module 111.
[0137]
Then, the center-of-gravity calculation module 111 calculates the center-of-gravity position G (x, y) for the portion of the difference image data whose luminance value is greater than the threshold value Th. Here, x and y are calculated using the following equations (6) and (7), respectively.
[0138]
[Equation 5]

[0139]
Thereby, as shown in FIG. 20, for example, the gravity center position G2 is obtained from the difference image data D2 between the image data P2 and the image data P3 described above.
[0140]
The center-of-gravity calculation module 111 sends the obtained center-of-gravity position data to the behavior model library 90 of the application layer 51.
[0141]
The behavior model library 90 determines the subsequent behavior while referring to the parameter value of the emotion and the parameter value of the desire as necessary, and gives the determination result to the behavior switching module 91. For example, when the moving object disappears from the difference image data, the behavior that faces or approaches the position of the center of gravity detected immediately before is determined, and the determination result is given to the behavior switching module 91. In addition, when tracking a moving body at every predetermined time interval, an action that faces or approaches the position of the center of gravity detected at each time interval is determined, and the determination result is given to the action switching module 91. Then, the behavior switching module 91 sends a behavior command based on the determination result to the tracking signal processing module 73 in the output system 80 of the middleware layer 50.
[0142]
When a behavior command is given, the tracking signal processing module 73 generates a servo command value to be given to the corresponding actuators 281 to 28n to perform the behavior based on the behavior command, and uses this data for the robotic server Sending sequentially to the corresponding actuators 281 to 28n via the virtual robot 43 of the object 42 and the signal processing circuit 14 (FIG. 2) sequentially.
[0143]
As a result, for example, when the moving object disappears from the difference image data, the behavior model library 90 determines the behavior that faces or approaches the position of the center of gravity detected immediately before, and the behavior switching module 91 performs the behavior. An action command for generating Further, when tracking a moving body at a predetermined time interval, the behavior model library 90 determines the behavior that faces or approaches the center of gravity detected at each time interval, and the behavior switching module 91 determines the behavior. An action command is generated to cause
[0144]
When this action command is given to the tracking signal processing module 73 such as the neck control command generation unit 22 shown in FIG. 4 described above, the tracking signal processing module 73 outputs a servo command value based on the action command. This is sent to the corresponding actuators 281 to 28n via the output unit, whereby the robot device 1 shows an interest in the moving object and causes the head to turn in that direction or to approach the moving object.
[0145]
It should be noted that the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present invention. In the above-described embodiment, the tracking of a person's face has been described. However, the present invention can be widely applied to other objects. For example, as described above, in the case of a ball or the like, a recognition unit that recognizes the color, shape (shape and size), pattern, etc. of the ball is provided, and tracking is continued if any is recognized. If the unavailable time continues for a long time, a method such as stopping the tracking can be taken.
[0146]
(3) Control program software configuration
The robot apparatus 1 as described above can act autonomously according to the situation of itself and surroundings, and instructions and actions from the user. Next, the software configuration of the control program for such a robot apparatus will be described in detail. FIG. 21 is a block diagram showing a software configuration of the robot apparatus according to the present embodiment. In FIG. 21, a device driver layer 40 is located in the lowest layer of this control program, and is composed of a device driver set 41 composed of a plurality of device drivers. In this case, each device driver is an object that is allowed to directly access hardware used in an ordinary computer such as an image input device 251 (FIG. 3) such as a CCD camera or a timer, and from the corresponding hardware. Processing is performed upon receiving an interrupt.
[0147]
The robotic server object 42 is located in the lowest layer of the device driver layer 40, and is a virtual software group that provides an interface for accessing hardware such as the various sensors and actuators described above. The robot 43, the power manager 44 that is a software group that manages power supply switching, the device driver manager 45 that is a software group that manages various other device drivers, and the mechanism of the robot apparatus 1 are managed. And a designed robot 46 composed of software groups.
[0148]
The manager object 47 includes an object manager 48 and a service manager 49. The object manager 48 is a software group that manages activation and termination of each software group included in the robotic server object 42, the middleware layer 50, and the application layer 51. The service manager 49 includes: For example, it is a software group that manages the connection of each object based on the connection information between each object described in the connection file stored in the memory card.
[0149]
The middleware layer 50 is located in an upper layer of the robotic server object 42, and is composed of a software group that provides basic functions of the robot apparatus 1 such as image processing and sound processing. The application layer 51 is located in an upper layer of the middleware layer 50, and determines the behavior of the robot apparatus 1 based on the processing result processed by each software group constituting the middleware layer 50. It is composed of software groups.
[0150]
The specific software configurations of the middleware layer 50 and the application layer 51 are shown in FIG.
[0151]
As shown in FIG. 22, the middle wear layer 50 includes noise detection, temperature detection, brightness detection, scale recognition, distance detection, posture detection, contact detection, operation input detection, motion A recognition system 71 having signal processing modules 60 to 69 for detection and color recognition, an input semantic converter module 70, an output semantic converter module 79, and posture management, tracking, motion reproduction, walking, and fall recovery And an output system 80 having signal processing modules 72 to 78 for LED lighting and sound reproduction.
[0152]
Each of the signal processing modules 60 to 69 of the recognition system 71 takes in the corresponding data among the sensor data, image data, and audio data read from the DRAM by the virtual robot 43 of the robotic server object 42, and Based on the above, predetermined processing is performed, and the processing result is given to the input semantic converter module 70. Here, for example, the virtual robot 43 is configured as a part for transmitting / receiving or converting signals according to a predetermined communication protocol.
[0153]
Based on the processing result given from each of these signal processing modules 60 to 69, the input semantic converter module 70 detects “noisy”, “hot”, “bright”, “I heard a scale of Domiso”, “is detected an obstacle” , “Detected fall”, “struck down”, “admired”, “detected moving object”, “detected ball”, etc. And the action is recognized, and the recognition result is output to the application layer 51 (FIG. 21).
[0154]
As shown in FIG. 23, the application layer 5l includes five modules: a behavior model library 90, a behavior switching module 91, a learning module 92, an emotion model 93, and an instinct model 94.
[0155]
In the behavior model library 90, as shown in FIG. 24, “when the remaining battery level is low”, “returns to fall”, “when avoiding an obstacle”, “when expressing emotion”, “ball” Independent behavior models 90 corresponding to several preselected condition items such as “when detected”, respectively.₁~ 90_nIs provided.
[0156]
And these behavior models 90₁~ 90_nAre stored in the emotion model 93 described later as necessary when a recognition result is given from the input semantic converter module 70 or when a certain time has passed since the last recognition result was given. The following actions are determined while referring to the corresponding emotion parameter values and the corresponding desire parameter values held in the instinct model 94, and the determination results are output to the action switching module 91.
[0157]
In this specific example, each behavior model 90₁~ 90_nAs a method for determining the next action, one node (state) NODE as shown in FIG.₀~ NODE_nTo any other node NODE₀~ NODE_nEach node NODE₀~ NODE_nArc ARC connecting between the two₁~ ARC_nTransition probability P set for each₁~ P_nAn algorithm called a finite-probability automaton is used that is determined probabilistically based on.
[0158]
Specifically, each behavior model 90₁~ 90_nEach has its own behavior model 90₁~ 90_nNode NODE forming₀~ NODE_nCorrespond to each of these nodes NODE₀~ NODE_nEach has a state transition table 270 as shown in FIG.
[0159]
In this state transition table 270, the node NODE₀~ NODE_nThe input events (recognition results) that are used as transition conditions in are listed in the “input event name” column in priority order, and further conditions for the transition conditions are described in the corresponding rows in the “data name” and “data range” columns. Has been.
[0160]
Therefore, in the node NODE 100 represented by the state transition table 270 of FIG. 26, when the recognition result “BALL detected” is given, the “size” of the ball given together with the recognition result is given. ”Is in the range of“ 0 to 1000 ”, or when the recognition result“ OBSTACLE ”is given, the“ distance ”to the obstacle given along with the recognition result Is in the range of “0 to 100” as a condition for transitioning to another node.
[0161]
Further, in the node NODE 100, even when there is no input of the recognition result, the behavior model 90₁~ 90_nOf the emotion and each desire parameter value held in the emotion model 93 and the instinct model 94 that are periodically referred to, “JOY”, “SURPRISE” held in the emotion model 93 or When any parameter value of “Sadness (SUDNESS)” is in the range of “50 to 100”, it is possible to transition to another node.
[0162]
In the state transition table 100, the node NODE appears in the “transition destination node” line in the “transition probability to other node” column.₀~ NODE_nThe node names that can be transitioned from are listed, and each other node NODE that can transition when all the conditions described in the columns "input event name", "data value", and "data range" are met₀~ NODE_nThe transition probabilities to are respectively described in the corresponding places in the “transition probabilities to other nodes” column, and the node NODE₀~ NODE_nThe action to be output when transitioning to is described in the “output action” line in the “transition probability to other node” column. The sum of the probabilities of each row in the “transition probability to other node” column is 100 [%].
[0163]
Therefore, in the node NODE 100 represented by the state transition table 100 in FIG. 26, for example, “BALL is detected” and the “SIZE” of the ball is in the range of “0 to 1000”. When the result is given, it is possible to transition to “node NODE 120 (node 120)” with a probability of “30 [%]”, and the action of “ACTION 1” is output at that time.
[0164]
Each behavior model 90₁~ 90_nAre node NODE described as such state transition table 270, respectively.₀~ NODE_nAre connected to each other, and when a recognition result is given from the input semantic converter module 70, the corresponding node NODE₀~ NODE_nThe next action is determined probabilistically using the state transition table, and the determination result is output to the action switching module 91.
[0165]
The behavior switching module 91 shown in FIG. 24 includes behavior models 90 in the behavior model library 90.₁~ 90_nAmong the behaviors output from each of the behavior models 90, a behavior model 90 having a predetermined high priority order.₁~ 90_nThe behavior output from the middleware layer 50 is sent to the output semantics converter module 79 of the middleware layer 50. In this embodiment, the behavior model 90 shown on the lower side in FIG.₁~ 90_nThe higher the priority is set.
[0166]
Further, the behavior switching module 91 notifies the learning module 92, the emotion model 93, and the instinct model 94 that the behavior is completed based on the behavior completion information given from the output semantic converter module 79 after the behavior is completed.
[0167]
On the other hand, the learning module 92 inputs the recognition result of the teaching received as an action from the user, such as “beated” or “admired” among the recognition results given from the input semantic converter module 70. Then, based on the recognition result and the notification from the behavior switching module 91, the learning module 92 decreases the probability of the behavior when “scored” and increases the probability of the behavior when “praised”. The corresponding transition probabilities of the corresponding behavior models 901 to 90n in the behavior model library 90 are changed.
[0168]
On the other hand, the emotion model 93 is the sum of “JOY”, “SADNESS”, “ANGER”, “SURPRISE”, “DISGUST” and “FEAR”. For six emotions, a parameter representing the strength of the emotion is held for each emotion. Then, the emotion model 93 sets the parameter values of these emotions to specific recognition results such as “bone” and “honored” given from the input semantic converter module 70, the elapsed time and the behavior switching module 91. It is updated periodically based on notifications from.
[0169]
Specifically, the emotion model 93 is calculated by a predetermined arithmetic expression based on the recognition result given from the input semantic converter module 70, the behavior of the robot apparatus 1 at that time, the elapsed time since the last update, and the like. Assuming that the amount of change in the emotion at that time is ΔE [t], the current parameter value of the emotion is E [t], and the coefficient representing the sensitivity of the emotion is ke, the following equation (8) The parameter value E [t + 1] of the emotion in the cycle is calculated, and the parameter value of the emotion is updated so as to replace the current parameter value E [t] of the emotion. In addition, the emotion model 73 updates the parameter values of all emotions in the same manner.
[0170]
[Formula 6]

[0171]
It should be noted that how much each notification result or notification from the output semantic converter module 79 affects the parameter value variation ΔE [t] of each emotion is determined in advance. For example, “struck” The recognition result has a great influence on the fluctuation amount ΔE [t] of the emotion parameter of “anger”, and the recognition result of “boiled” has a fluctuation amount ΔE [t] of the parameter value of the emotion of “joy” It has come to have a big influence on.
[0172]
Here, the notification from the output semantic converter module 79 is so-called action feedback information (behavior completion information), which is information on the appearance result of the action, and the emotion model 93 changes the emotion also by such information. Let This is, for example, that the emotional level of anger is lowered by an action such as “barking”. The notification from the output semantic converter module 79 is also input to the learning module 92 described above, and the learning module 92 uses the behavior model 90 based on the notification.₁~ 90_nChange the corresponding transition probability of.
[0173]
Note that the feedback of the action result may be performed by the output of the action switching module 91 (the action to which the emotion is added).
[0174]
On the other hand, the instinct model 94 has four independent needs for “exercise”, “affection”, “appetite” and “curiosity”. It holds a parameter that represents the strength of the desire. Then, the instinct model 94 periodically updates the parameter values of these desires based on the recognition result given from the input semantic converter module 70, the elapsed time, the notification from the behavior switching module 91, and the like.
[0175]
Specifically, the instinct model 94 uses the predetermined arithmetic expression for “exercise greed”, “loving greed” and “curiosity” based on the recognition result, elapsed time, notification from the output semantic converter module 79, and the like. The following equation (9) is obtained in a predetermined cycle, where ΔI [k] is the calculated fluctuation amount of the desire at that time, I [k] is the current parameter value of the desire, and the coefficient ki is the sensitivity of the desire. The parameter value I [k + 1] of the desire in the next cycle is calculated, and the parameter value of the desire is updated so that the calculation result is replaced with the current parameter value I [k] of the desire. Similarly, the instinct model 94 updates the parameter values of each desire except “appetite”.
[0176]
[Expression 7]

[0177]
It is determined in advance how much the recognition result and the notification from the output semantic converter module 79 affect the fluctuation amount ΔI [k] of the parameter value of each desire. For example, from the output semantic converter module 79 This notification has a great influence on the fluctuation amount ΔI [k] of the parameter value of “fatigue”.
[0178]
In the present embodiment, the parameter values of each emotion and each desire (instinct) are regulated so as to fluctuate in the range of 0 to 100, and the values of the coefficients ke and ki are also set for each emotion and each It is set individually for each desire.
[0179]
On the other hand, as shown in FIG. 22, the output semantics converter module 79 of the middleware layer 50 performs “forward”, “joy”, “ring” given from the behavior switching module 91 of the application layer 51 as described above. ”Or“ tracking (following the ball) ”is given to the corresponding signal processing modules 72 to 78 of the output system 80.
[0180]
These signal processing modules 72 to 78 receive servo command values to be given to the corresponding actuators for performing the action based on the action commands, sound data of sound output from the speaker, and / or Or the drive data given to LED of a light emission part are produced | generated, and these data are sequentially sent to a corresponding actuator, a speaker, a light emission part, etc. via the virtual robot 43 of the robotic server object 42 sequentially.
[0181]
In this way, the robot apparatus 1 can perform autonomous actions in accordance with its own (internal) and surrounding (external) situations, and instructions and actions from the user, based on the control program. Has been made.
[0182]
【The invention's effect】
As described above in detail, the robot apparatus according to the present invention includes a plurality of recognition means for recognizing different kinds and / or the same kind in a robot apparatus that tracks an object using at least a part of the airframe. A plurality of recognition means hierarchized into two or more according to the recognition level, a recognition integration means for integrating the recognition results from the plurality of recognition means, and an action generation for generating an action for tracking the object And tracking control means for controlling the motion generating means based on the recognition integration result, the tracking control means tracking the object based on a recognition result of a predetermined recognition means among the plurality of recognition means. When the recognition result of the predetermined recognition unit cannot be obtained, the recognition result of another recognition unit different from the predetermined recognition unit is displayed. Therefore, when tracking is started based on the recognition result of the predetermined recognition means, tracking is continued based on the recognition result of other recognition means even if the recognition by the predetermined recognition means fails. Even if tracking is interrupted due to an interrupt operation, etc., or the instability of recognition by the recognition means for changes in lighting conditions, etc., multiple recognition means can be used in combination. Extremely robust tracking can be performed.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an external configuration of a robot apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram schematically illustrating a freedom degree configuration model of the robot apparatus.
FIG. 3 is a diagram schematically showing a control system configuration of the robot apparatus.
FIG. 4 is a block diagram schematically showing a part constituting the tracking system in the control system of the robot apparatus according to the embodiment of the present invention.
FIG. 5 is a block diagram for explaining a function of a prediction unit in the tracking system.
FIG. 6 is a block diagram for explaining a function of a recognition unit hierarchized with different recognition levels in the tracking system;
FIG. 7 is a block diagram for explaining functions of different types of recognition units in the tracking system;
FIG. 8 is a block diagram for explaining a function of a prediction unit when an interrupting action occurs in the tracking system.
FIG. 9 is a block diagram for explaining a tracking end function in the tracking system;
FIG. 10 is a diagram showing a specific example of a voice direction recognition unit of the tracking system, and is a diagram for explaining a turning operation of the robot apparatus.
FIG. 11 is a diagram illustrating a specific example of a voice direction recognition unit of the tracking system, and is a flowchart for explaining an example of a turning operation of the robot apparatus.
FIG. 12 is a diagram illustrating a specific example of a voice direction recognition unit of the tracking system, and is a diagram illustrating a method for estimating a sound source direction.
FIG. 13 is a diagram showing a specific example of a voice direction recognition unit of the tracking system, and is a diagram for explaining a turning operation of the m robot apparatus. FIG. 13 (a) shows a state before turning, and FIG. FIG. 5C shows a state after turning around, and FIG. 5C shows a diagram facing the target object.
FIG. 14 is a diagram illustrating a specific example of a voice direction recognition unit of the tracking system, and is a flowchart for explaining another example of the turning operation of the robot apparatus.
FIG. 15 is a diagram showing a specific example of the voice direction recognition unit of the tracking system, and is a diagram for explaining another example of the turning operation of the robot apparatus, (a) shows a state before turning; (B) shows the state after turning around.
FIG. 16 is a diagram illustrating a specific example of a moving object detection method of the tracking system, and illustrates an example in which a moving object disappears from difference image data.
FIG. 17 is a diagram illustrating a specific example of the moving object detection method of the tracking system, and is a flowchart for explaining a procedure in a case where the moving object is directed from the difference image data.
FIG. 18 is a diagram illustrating a specific example of a moving object detection method of the tracking system, and is a flowchart illustrating a procedure for tracking a moving object.
FIG. 19 is a diagram showing a specific example of the moving object detection method of the tracking system, and is a block diagram showing a software configuration of a part related to the moving object detection of the robot apparatus.
FIG. 20 is a diagram showing a specific example of a moving object detection method of the tracking system, and is a diagram for explaining an example of obtaining a gravity center position of difference image data.
FIG. 21 is a block diagram showing a software configuration of the robot apparatus according to the embodiment of the present invention.
FIG. 22 is a block diagram showing the configuration of a middleware layer in the software configuration of the robot apparatus according to the embodiment of the present invention.
FIG. 23 is a block diagram showing a configuration of an application layer in the software configuration of the robot apparatus according to the embodiment of the present invention.
FIG. 24 is a block diagram showing the configuration of an application layer behavior model library in an embodiment of the present invention;
FIG. 25 is a diagram for explaining a finite probability automaton serving as information for determining the behavior of the robot apparatus according to the embodiment of the present invention.
FIG. 26 is a diagram showing a state transition table prepared for each node of a finite probability automaton.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Robot apparatus, 10 Tracking system, 11 Skin color recognition part, 12 Face detection recognition part, 13 Personal recognition part, 15 Recognition part, 21 Recognition integration part, 22 Neck control command generation part, 23 Output part, 31 Prediction part, 32 Action 33 timers
42 robotic server object, 43 virtual robot, 50 middleware layer, 51 application layer, 68 motion detection signal processing module, 70 input semantic converter module, 71 recognition system, 73 tracking signal processing module, 79 Output Semantics Converter Module, 80 Output System, 83 Emotion Model, 84 Instinct Model, 90 Action Model Library, 91 Action Switching Module, 110 Difference Image Generation Module, 111 Center of Gravity Calculation Module, 200 Thought Control Module, 251 Image Input Device, 252 Voice input device, 253 Voice output device, 300 Motion control module, 350 Actuator

Claims

In a robotic device that tracks an object using at least a part of the aircraft,
A plurality of recognition means for recognizing different kinds or a plurality of recognition means for recognizing the same kind and hierarchized into two or more according to the recognition level;
Recognition integration means for integrating the recognition results from the plurality of recognition means;
Action generating means for generating an action for tracking the object;
Tracking control means for controlling the motion generation means based on the recognition integration result,
The tracking control means starts tracking the object based on the recognition result of the predetermined recognition means among the plurality of recognition means, and when the recognition result of the predetermined recognition means cannot be obtained, the predetermined recognition Control to continue the tracking based on the recognition result of other recognition means different from the means,
The tracking is interrupted by an operation having a higher priority than the tracking operation,
Object storage means for storing the direction of the object immediately before the tracking is terminated or interrupted;
Said tracking control means, when the higher priority operation has been completed, that controls so as again to start tracking when recognizing the object in any of said plurality of recognition means in a direction which is the storage robot apparatus.

When the recognition result from the other recognition means cannot be obtained, the prediction means obtains the prediction direction of the object based on the recognition result of the other recognition means until immediately before,
Said tracking control means, the robot apparatus according to claim 1, wherein that controls so as to continue the tracking based on the prediction direction.

It said another recognition means, the predetermined recognition means of the same kind in a lower-layer recognition means or the predetermined recognition means different types according to claim 1, wherein the robot Ru recognition means der recognize things and, apparatus.

Said tracking control means, the predetermined recognition means of the recognition result is a robot apparatus according to claim 1, wherein you exit or suspend the tracking when not obtained a predetermined time.

The tracking control means has a predetermined difference between the direction of the object immediately before the recognition result of the predetermined recognition means cannot be obtained and the direction of the object obtained by the recognition result of the other recognition means. robot according to claim 1, wherein you exit or interrupt the tracking when it exceeds the value.

The plurality of different types of recognition means include at least an image recognition means and a voice recognition means,
It said tracking control means, if the recognition result of the image recognition means and speech recognition means is obtained, the image recognition means of the recognition result robot apparatus according to claim 1, wherein that controls so as to continue the tracking based on.

Said predicting means, said another recognition result by the recognition means detects the movement of the object immediately before not be obtained, the motion detection result robot apparatus according to claim 2, wherein asking you to the prediction direction based on a.

The prediction means stores the direction of the object to be detected on the basis of the other recognition result, you the direction of the object just before the recognition result by the other recognition means may not be obtained and the prediction direction The robot apparatus according to claim 2.

An object movement detecting means for detecting the movement of the object until immediately before the tracking is terminated or interrupted;
It said tracking control means, when the higher priority operation has been completed, according to that control to start again tracking when recognizing the object in any of said plurality of recognition means based on the motion detection result Item 2. The robot device according to Item 1.

A plurality of recognition means of the same type and hierarchized into two or more according to the recognition level, at least image recognition means for recognizing a person's face as the object, and recognizing a person's voice as the object Any one of voice recognition means or object recognition means for recognizing an object other than a person as the object,
The image recognition means includes a skin color recognition means for recognizing a skin color area in an image area as the object, a face recognition means for determining whether or not the object is a face, and the face of the object as a specific individual. Hierarchized into at least two or more individual recognition means to be estimated,
The voice recognition means is hierarchized into at least two or more of a voice direction recognition means for estimating a voice direction, a speaker recognition means for estimating a speaker who has generated a voice, and a voice content recognition means for recognizing a voice content.
The object recognition means includes at least two of color recognition means for recognizing the color of the object in the image area, shape recognition means for recognizing the shape of the object, and pattern recognition means for recognizing the pattern of the object. robot apparatus according to claim 1, wherein that are layered.

It is a outer shape imitating an animal, at least neck or torso robot apparatus according to claim 1, wherein Ru Torakkingusu the object by.

In an operation control method of a robot apparatus that tracks an object using at least a part of a body,
A recognition step of recognizing the object by a plurality of recognition means for recognizing different kinds or a plurality of recognition means for recognizing the same kind and stratified into two or more according to the recognition level;
A recognition integration step of integrating the recognition results from the plurality of recognition means;
An action generating step for generating an action for tracking the object;
A tracking control step for controlling the tracking operation based on the recognition integration result,
In the tracking control step, tracking of the object is started based on the recognition result of the predetermined recognition unit among the plurality of recognition units, and the predetermined recognition is performed when the recognition result of the predetermined recognition unit cannot be obtained. Controlled to continue the tracking based on the recognition result of another recognition means different from the means,
The tracking is interrupted by an operation having a higher priority than the tracking operation,
An object storage step of storing the direction of the object immediately before the tracking is terminated or interrupted;
In the tracking control step, when the higher priority operation has been completed, that are controlled to again start tracking when recognizing the object in any of said plurality of recognition means in a direction which is the storage robot Device operation control method.

In a program for causing a robot apparatus to execute an operation of tracking an object using at least a part of the aircraft,
A recognition step of recognizing the object by a plurality of recognition means for recognizing different kinds or a plurality of recognition means for recognizing the same kind and stratified into two or more according to the recognition level;
A recognition integration step of integrating the recognition results from the plurality of recognition means;
An action generating step for generating an action for tracking the object;
A tracking control step for controlling the tracking operation based on the recognition integration result,
In the tracking control step, tracking of the object is started based on the recognition result of the predetermined recognition unit among the plurality of recognition units, and the predetermined recognition is performed when the recognition result of the predetermined recognition unit cannot be obtained. Control to continue the tracking based on the recognition result of other recognition means different from the means,
The tracking is interrupted by an operation having a higher priority than the tracking operation,
An object storage step of storing the direction of the object immediately before the tracking is terminated or interrupted;
In the tracking control step, when the higher priority operation is completed, the controls again to start tracking when recognizing the object in any of said plurality of recognition means in the direction that the storage
Program for.