JP2004283927A

JP2004283927A - Robot control device, and method, recording medium and program

Info

Publication number: JP2004283927A
Application number: JP2003076993A
Authority: JP
Inventors: Hiroaki Ogawa; 浩明小川
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-03-20
Filing date: 2003-03-20
Publication date: 2004-10-14

Abstract

PROBLEM TO BE SOLVED: To improve the recognizing precision of a voice recognition device of a robot by recognizing a voice after confirming the existence of a user when the voice recognition device installed on the robot recognizes the voice of a recognizing object such as the user, etc. SOLUTION: A voice recognizing part 101A outputs a voice signal detected by microphones 82-1 to 82-N to a behavior deciding mechanism 103. A direction recognizing part 101B outputs the direction of a sound source to the behavior deciding mechanism 103 by detecting it in accordance with the voice signal detected by the microphones 82-1 to 82-N. The behavior deciding mechanism 103 controls an attitude transition mechanism 104 so as to take looking-back action in the direction of the sound source. A picture image recognizing part 101D reports to a control part 101a that it detects the face of the user when the picture image recognizing part 101D detects the face of the user in accordance with a picture image signal input by CCD cameras 81L, R after taking the looking-back action. The control part 101a starts a voice recognizing process by controlling the voice recognizing part 101A. COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ロボット制御装置および方法、記録媒体、並びにプログラムに関し、特に、音声認識処理を実行させる際、ユーザの存在を確認した上で音声認識処理を実行させることにより、ノイズによる音声の誤検出に基づいたエラーを抑制できるようにしたロボット制御装置および方法、記録媒体、並びにプログラムに関する。
【０００２】
【従来の技術】
近年においては、玩具等として、音声認識装置などの認識機能を備えたロボット（本明細書においては、ぬいぐるみ状のものを含む）が製品化されている。例えば、音声認識装置を備えたロボットでは、ユーザが発した音声を音声認識し、その音声認識結果に基づいて、ある仕草をしたり、合成音を出力する等の行動を自律的に行うようになされている。
【０００３】
音声認識装置を備えたロボットが、ユーザが発した音声を音声認識する場合、音声を発したユーザが、ロボットから遠く離れすぎているときには、ロボットに装着されているマイクロホンにより取得されるユーザの発した音声波形の信号値は減衰し、相対的に雑音レベルが高くなる。つまり、マイクロホンにより取得されたユーザの音声信号のＳ／Ｎ比（ＳｉｇｎａｌｔｏＮｏｉｓｅｒａｔｉｏ）は低くなる。また、一般に、ユーザ（発話者）とロボット（に装着されているマイクロホン）の距離が大きくなるほど、音声信号の波形は、残響特性の影響を強く受ける。従って、ユーザとロボットの距離が離れすぎているときには、ロボットの音声認識装置の認識精度は悪くなる。
【０００４】
反対に、ユーザとロボットの距離が近すぎるときには、ロボットに装着されているマイクロホンにより取得されるユーザの発した音声波形の信号値は、マイクロホンの検出可能な範囲を超えてしまう。従って、マイクロホンにより取得された音声波形は、飽和したものとなり、本来の音声波形より歪んだ波形となる。ユーザとロボットの距離が近すぎる場合には、ロボットの音声認識装置は、このような歪んだ波形を音声認識することとなるので、音声認識の精度は悪くなる。
【０００５】
そこで、音声認識結果とともに、周囲雑音の影響を検知する周囲雑音検知、入力音声のパワーが特定の閾値条件を満たす状況を検知するパワー不足検知、パワー過多検知などの状況検知を行い、音声認識結果と状況検知の結果を利用して、ロボットにおける音声認識精度劣化の問題に対処する方法が提案されている（例えば、非特許文献１参照）。
【０００６】
【非特許文献１】
岩沢，大中，藤田，「状況検知を利用したロボット用音声認識インタフェースの一手法とその評価」，人工知能学会研究会資料，社団法人人工知能学会，平成１４年１１月，ｐ．３３−３８
【０００７】
さらに、ロボット自体の動作音は、ノイズ源がマイクに近いため、音声認識の精度に大きな悪影響を及ぼす。例えば、両手を持つロボットがマイクロホンの近くに手を移動して、指などを動作させるとマイクロホンには非常に大きなノイズが入力されてしまう。また、２足歩行するロボットが固い床面の上を歩行すると足が床面に接地する音が大きくなり、マイクロホンに大きなノイズが入力されてしまう場合があった。
【０００８】
【発明が解決しようとする課題】
しかしながら、非特許文献１に示される方法では、ロボット自体の発生するノイズに関しては考慮されていない。そのため、例えば、ユーザがロボットに何も話しかけていないにもかかわらず、ロボットがロボット自身の発生するノイズを音声として検出してしまい、誤った音声認識結果を獲得し、誤った動作を行う場合があった。このため、ユーザが何もロボットに話しかけていないにもかかわらず、ロボットが不可解な動作を行ったり、不可解な音声を発生したりする恐れがあった。さらに、周囲にユーザが存在しない場合でも、あたかも周辺にユーザが存在しているかのように、動作してしまう恐れがあった。
【０００９】
本発明は、このような状況に鑑みてなされたものであり、例えば、ロボットに装着された、音声認識装置が、ユーザなど認識対象の音声を認識する場合において、ユーザの存在を確認してから音声を認識することにより、ロボットの音声認識装置の認識精度を向上させるものである。
【００１０】
【課題を解決するための手段】
本発明のロボット制御装置は、音声を検出する音声検出手段と、音声検出手段により検出された音声を認識する音声認識手段と、音声の音源の方向を検出する方向検出手段と、方向検出手段により検出された方向に対してユーザを検出するユーザ検出手段と、ユーザ検出手段によりユーザが検出された場合、音声認識手段による音声の認識を開始させるように制御する音声認識制御手段とを備えることを特徴とする。
【００１１】
映像を撮像する撮像手段と、方向検出手段により検出された方向を撮像するように、撮像手段を制御する撮像制御手段とをさらに設けるようにさせることができ、撮像制御手段により制御され、撮像手段が方向検出手段により検出された方向を撮像する場合、ユーザ検出手段には、撮像手段により撮像された映像に、ユーザの顔が撮像されているか否かにより、ユーザを検出させるようにすることができる。
【００１２】
前記撮像制御手段により、撮像手段が方向検出手段により検出された方向を撮像するように制御された場合、撮像手段により撮像された映像に、ユーザの顔が撮像されているか否かの頻度に基づいて、方向毎の信頼度を検出する信頼度検出手段をさらに設けるようにさせることができ、撮像制御手段には、方向検出手段により検出された方向において、信頼度検出手段により検出された、ユーザの顔が検出される信頼度に基づいて、方向を撮像するように撮像手段を制御させるようにすることができる。
【００１３】
本発明のロボット制御方法は、音声を検出する音声検出ステップと、音声検出ステップの処理で検出された音声を認識する音声認識ステップと、音声の音源の方向を検出する方向検出ステップと、方向検出ステップの処理で検出された方向に対してユーザを検出するユーザ検出ステップと、ユーザ検出ステップの処理でユーザが検出された場合、音声認識ステップの処理での音声の認識を開始させるように制御する音声認識制御ステップとを含むことを特徴とする。
【００１４】
本発明の記録媒体のプログラムは、音声を検出する音声検出ステップと、音声検出ステップの処理で検出された音声を認識する音声認識ステップと、音声の音源の方向を検出する方向検出ステップと、方向検出ステップの処理で検出された方向に対してユーザを検出するユーザ検出ステップと、ユーザ検出ステップの処理でユーザが検出された場合、音声認識ステップの処理での音声の認識を開始させるように制御する音声認識制御ステップとを含むことを特徴とする。
【００１５】
本発明のプログラムは、音声を検出する音声検出ステップと、音声検出ステップの処理で検出された音声を認識する音声認識ステップと、音声の音源の方向を検出する方向検出ステップと、方向検出ステップの処理で検出された方向に対してユーザを検出するユーザ検出ステップと、ユーザ検出ステップの処理でユーザが検出された場合、音声認識ステップの処理での音声の認識を開始させるように制御する音声認識制御ステップとを含む処理をコンピュータに実行させることを特徴とする。
【００１６】
本発明のロボット制御装置および方法、並びにプログラムにおいては、音声が検出され、検出された音声が認識され、音声の音源の方向が検出され、検出された方向に対してユーザが検出され、ユーザが検出された場合、音声の認識が開始させられる。
【００１７】
【発明の実施の形態】
以下に、本発明の実施例を説明するが、その前に、特許請求の範囲に記載の発明の各手段と以下の実施例との対応関係を明らかにするために、各手段の後の括弧内に、対応する実施例（但し、一例）を付加して、本発明の特徴を記述すると、次のようになる。
【００１８】
即ち、本発明のロボット制御装置は、音声を検出する音声検出手段（例えば、図７のマイクロホン８２−１乃至８２−Ｎ）と、音声検出手段により検出された音声を認識する音声認識手段（例えば、図７の音声認識部１０１Ａ）と、音声の音源の方向を検出する方向検出手段（例えば、図７の方向認識部１０１Ｂ）と、方向検出手段により検出された方向に対してユーザを検出するユーザ検出手段（例えば、図７の画像認識部１０１Ｄ）と、ユーザ検出手段によりユーザが検出された場合、音声認識手段による音声の認識を開始させるように制御する音声認識制御手段（例えば、図７の制御部１０１ａ）とを備えることを特徴とする。
【００１９】
なお、勿論この記載は、各手段を上記したものに限定することを意味するものではない。
【００２０】
図１は、本発明を適用した２足歩行タイプのロボット１の一実施の形態の構成を示す外装の概観斜視図である。ロボット１は、住環境その他の日常生活上の様々な場面における人的活動を支援する実用ロボットであり、内部状態（怒り、悲しみ、喜び、楽しみ等）に応じて行動できるほか、人間が行う基本的な動作を表出することができる。
【００２１】
図１で示されるように、ロボット１は、体幹部外装ユニット２の所定の位置に頭部外装ユニット３が連結されると共に、左右２つの腕部外装ユニット４Ｒ／Ｌ（Ｒｉｇｈｔ／Ｌｅｆｔ：右腕／左腕）と、左右２つの脚部外装ユニット５Ｒ／Ｌが連結されて構成されている。
【００２２】
次に、図２を参照して、ロボット１の内部の構成について説明する。尚、図２は、図１で示した外装部分に対して、それらの内部の構成を示すものである。
【００２３】
図２は、ロボット１の正面方向の内部の斜視図であり、図３は、ロボット１の背面方向からの内部の斜視図である。また、図４は、ロボット１の軸構成について説明するための斜視図である。
【００２４】
ロボット１は、胴体部ユニット１１の上部に頭部ユニット１２が配設されるとともに、胴体部ユニット１１の上部左右に、同様の構成を有する腕部ユニット１３Ａおよび１３Ｂが所定位置にそれぞれ取り付けられ、かつ、胴体部ユニット１１の下部左右に、同様の構成を有する脚部ユニット１４Ａおよび１４Ｂが所定位置にそれぞれ取り付けられることにより構成されている。頭部ユニット１２には、タッチセンサ５１、および、表示部５５が設けられている。
【００２５】
胴体部ユニット１１においては、体幹上部を形成するフレーム２１および体幹下部を形成する腰ベース２２が、腰関節機構２３を介して連結することにより構成されており、体幹下部の腰ベース２２に固定された腰関節機構２３のアクチュエータＡ１、および、アクチュエータＡ２をそれぞれ駆動することによって、体幹上部を、図４に示す直交するロール軸２４およびピッチ軸２５の回りに、それぞれ独立に回転させることができるようになされている。
【００２６】
また頭部ユニット１２は、フレーム２１の上端に固定された肩ベース２６の上面中央部に首関節機構２７を介して取り付けられており、首関節機構２７のアクチュエータＡ３およびＡ４をそれぞれ駆動することによって、図４に示す直交するピッチ軸２８およびヨー軸２９の回りに、それぞれ独立に回転させることができるようになされている。
【００２７】
更に、腕部ユニット１３Ａおよび１３Ｂは、肩関節機構３０を介して肩ベース２６の左右にそれぞれ取り付けられており、対応する肩関節機構３０のアクチュエータＡ５およびＡ６をそれぞれ駆動することによって、図４に示す、直交するピッチ軸３１およびロール軸３２の回りに、それぞれを独立に回転させることができるようになされている。
【００２８】
腕部ユニット１３Ａおよび１３Ｂは、上腕部を形成するアクチュエータＡ７の出力軸に、肘関節機構３３を介して、前腕部を形成するアクチュエータＡ８が連結され、前腕部の先端に手部３４が取り付けられることにより構成されている。
【００２９】
そして腕部ユニット１３Ａおよび１３Ｂでは、アクチュエータＡ７を駆動することによって、前腕部を図４に示すヨー軸３５に対して回転させることができ、アクチュエータＡ８を駆動することによって、前腕部を図４に示すピッチ軸３６に対して回転させることができるようになされている。
【００３０】
脚部ユニット１４Ａおよび１４Ｂは、股関節機構３７を介して、体幹下部の腰ベース２２にそれぞれ取り付けられており、対応する股関節機構３７のアクチュエータＡ９乃至Ａ１１をそれぞれ駆動することによって、図４に示す、互いに直交するヨー軸３８、ロール軸３９、およびピッチ軸４０に対して、それぞれ独立に回転させることができるようになされている。
【００３１】
脚部ユニット１４Ａおよび１４Ｂは、大腿部を形成するフレーム４１の下端が、膝関節機構４２を介して、下腿部を形成するフレーム４３に連結されるとともに、フレーム４３の下端が、足首関節機構４４を介して、足部４５に連結されることにより構成されている。
【００３２】
これにより脚部ユニット１４Ａおよび１４Ｂにおいては、膝関節機構４２を形成するアクチュエータＡ１２を駆動することによって、図４に示すピッチ軸４６に対して、下腿部を回転させることができ、また足首関節機構４４のアクチュエータＡ１３およびＡ１４をそれぞれ駆動することによって、図４に示す直交するピッチ軸４７およびロール軸４８に対して、足部４５をそれぞれ独立に回転させることができるようになされている。
【００３３】
また、胴体部ユニット１１の体幹下部を形成する腰ベース２２の背面側には、後述するメイン制御部６１や周辺回路６２（いずれも図５）などを内蔵したボックスである、制御ユニット５２が配設されている。
【００３４】
図５は、ロボット１のアクチュエータとその制御系等の構成例を示している。
【００３５】
制御ユニット５２には、ロボット１全体の動作制御をつかさどるメイン制御部６１、電源回路および通信回路などの周辺回路６２、および、バッテリ７４（図６）などが収納されている
【００３６】
そして、制御ユニット５２は、各構成ユニット（胴体部ユニット１１、頭部ユニット１２、腕部ユニット１３Ａおよび１３Ｂ、並びに、脚部ユニット１４Ａおよび１４Ｂ）内にそれぞれ配設されたサブ制御部６３Ａ乃至６３Ｄと接続されており、サブ制御部６３Ａ乃至６３Ｄに対して必要な電源電圧を供給したり、サブ制御部６３Ａ乃至６３Ｄと通信を行う。
【００３７】
また、サブ制御部６３Ａ乃至６３Ｄは、対応する構成ユニット内のアクチュエータＡ１乃至Ａ１４と、それぞれ接続されており、メイン制御部６１から供給された各種制御コマンドに基づいて、構成ユニット内のアクチュエータＡ１乃至Ａ１４を、指定された状態に駆動させるように制御する。
【００３８】
図６は、ロボット１の電気的な内部構成例を示すブロック図である。
【００３９】
頭部ユニット１２には、ロボット１の「目」として機能するＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）カメラ８１Ｌおよび８１Ｒ、「耳」として機能するマイクロホン８２−１乃至８２−Ｎ、並びにタッチセンサ５１などからなる外部センサ部７１、および、「口」として機能するスピーカ７２などがそれぞれ所定位置に配設され、制御ユニット５２内には、バッテリセンサ９１および加速度センサ９２などからなる内部センサ部７３が配設されている。また、この他に、ロボット１の状態やユーザからの応答を表示する表示部５５が配設されている。
【００４０】
そして、外部センサ部７１のＣＣＤカメラ８１Ｌおよび８１Ｒは、周囲の状況を撮像し、得られた画像信号Ｓ１Ａを、メイン制御部６１に送出する。マイクロホン８２−１乃至８２−Ｎは、ユーザから音声入力として与えられる「歩け」、「とまれ」または「右手を挙げろ」等の各種命令音声（音声コマンド）を集音し、得られた音声信号Ｓ１Ｂを、メイン制御部６１にそれぞれ送出する。なお、以下において、Ｎ個のマイクロホン８２−１乃至８２−Ｎを特に区別する必要がない場合には、マイクロホン８２と称する。
【００４１】
また、タッチセンサ５１は、例えば、図２および図３に示されるように頭部ユニット１２の上部に設けられており、ユーザからの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出し、その検出結果を、圧力検出信号Ｓ１Ｃとしてメイン制御部６１に送出する。
【００４２】
内部センサ部７３のバッテリセンサ９１は、バッテリ７４のエネルギ残量を所定の周期で検出し、検出結果をバッテリ残量検出信号Ｓ２Ａとして、メイン制御部６１に送出する。加速度センサ９２は、ロボット１の移動について、３軸方向（ｘ軸、ｙ軸およびｚ軸）の加速度を、所定の周期で検出し、その検出結果を、加速度検出信号Ｓ２Ｂとして、メイン制御部６１に送出する。
【００４３】
外部メモリ７５は、プログラムやデータ、および制御パラメータなどを記憶しており、そのプログラムやデータを必要に応じてメイン制御部６１に内蔵されるメモリ６１Ａに供給する。また、外部メモリ７５は、データ等をメモリ６１Ａから受け取り、記憶する。なお、外部メモリ７５は、ロボット１から着脱可能となされている。
【００４４】
メイン制御部６１は、メモリ６１Ａを内蔵している。メモリ６１Ａは、プログラムやデータを記憶しており、メイン制御部６１は、メモリ６１Ａに記憶されたプログラムを実行することで、各種の処理を行う。即ち、メイン制御部６１は、外部センサ部７１のＣＣＤカメラ８１Ｌおよび８１Ｒ、マイクロホン８２、およびタッチセンサ５１からそれぞれ供給される、画像信号Ｓ１Ａ、音声信号Ｓ１Ｂ、および圧力検出信号Ｓ１Ｃ（以下、これらをまとめて外部センサ信号Ｓ１と称する）と、内部センサ部７３のバッテリセンサ９１および加速度センサ等からそれぞれ供給される、バッテリ残量検出信号Ｓ２Ａおよび加速度検出信号Ｓ２Ｂ（以下、これらをまとめて内部センサ信号Ｓ２と称する）に基づいて、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無などを判断する。
【００４５】
そして、メイン制御部６１は、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無の判断結果と、内部メモリ６１Ａに予め格納されている制御プログラム、あるいは、そのとき装填されている外部メモリ７５に格納されている各種制御パラメータなどに基づいて、ロボット１の行動を決定し、その決定結果に基づく制御コマンドを生成して、対応するサブ制御部６３Ａ乃至６３Ｄに送出する。サブ制御部６３Ａ乃至６３Ｄは、メイン制御部６１から供給された制御コマンドに基づいて、アクチュエータＡ１乃至Ａ１４のうち、対応するものの駆動を制御する。これにより、ロボット１は、例えば、頭部ユニット１２を上下左右に揺動かさせたり、腕部ユニット１３Ａ、あるいは、腕部ユニット１３Ｂを上に挙げたり、脚部ユニット１４Ａと１４Ｂを交互に駆動させて、歩行するなどの行動を行う。
【００４６】
また、メイン制御部６１は、必要に応じて、所定の音声信号Ｓ３をスピーカ７２に与えることにより、音声信号Ｓ３に基づく音声を外部に出力させると共に、例えば、音声を検出したときに、表示信号Ｓ４に基づいて「だーれ」などのユーザへの応答を表示部５５に表示する。更に、メイン制御部６１は、外見上の「目」として機能する、頭部ユニット１２の所定位置に設けられた、図示しないＬＥＤに対して駆動信号を出力することにより、ＬＥＤを点滅させて、表示部５５として機能させる。
【００４７】
このようにして、ロボット１は、周囲および内部の状況（状態）や、ユーザからの指令および働きかけの有無などに基づいて、自律的に行動する。
【００４８】
図７は、図６のメイン制御部６１の機能的構成例を示している。なお、図７に示す機能的構成は、メイン制御部６１が、メモリ６１Ａに記憶された制御プログラムを実行することで実現されるようになっている。
【００４９】
メイン制御部６１は、特定の外部状態を認識する状態認識情報処理部１０１、状態認識情報処理部１０１の認識結果等に基づいて更新される、ロボット１の感情、本能、あるいは、成長の状態などのモデルを記憶するモデル記憶部１０２、状態認識情報処理部１０１の認識結果等に基づいて、ロボット１の行動を決定する行動決定機構部１０３、行動決定機構部１０３の決定結果に基づいて、実際にロボット１に行動を起こさせる姿勢遷移機構部１０４、合成音を生成する音声合成部１０５から構成されている。
【００５０】
状態認識情報処理部１０１には、マイクロホン８２や、ＣＣＤカメラ８１Ｌおよび８１Ｒ、タッチセンサ５１等から音声信号、画像信号、圧力検出信号等が、ロボット１の電源が投入されている間、常時入力される。そして、状態認識情報処理部１０１は、マイクロホン８２や、ＣＣＤカメラ８１Ｌおよび８１Ｒ、タッチセンサ５１等から与えられる音声信号、画像信号、圧力検出信号等に基づいて、特定の外部状態や、ユーザからの特定の働きかけ、ユーザからの指示等を認識し、その認識結果を表す状態認識情報を、モデル記憶部１０２および行動決定機構部１０３に常時出力する。
【００５１】
状態認識情報処理部１０１は、音声認識部１０１Ａ、方向認識部１０１Ｂ、圧力処理部１０１Ｃ、および画像認識部１０１Ｄを有している。
【００５２】
音声認識部１０１Ａは、マイクロホン８２−１乃至８２−Ｎそれぞれから与えられる音声信号Ｓ１Ｂについて音声の有無を検出して、音声が検出されたとき音声を検出したことを行動決定部１０３に出力する。制御部１０１ａは、音声が検出された場合に、ロボット１が音源の方向に振り向き動作を行った後、画像認識部１０１Ｄより供給される信号に基づいて、ユーザの顔が検出されたとき、音声認識の処理を開始するように制御し、それ以外のとき音声認識処理を実行しないように制御する。すなわち、音声認識部１０１Ａは、画像認識部１０１Ｄがユーザの顔を認識したときにのみ、音声認識処理を実行し、それ以外のとき音声認識処理をしないように制御部１０１ａにより制御される。この結果、音声認識部１０１Ａは、音声認識処理を実行する際に、ユーザが存在しない状態で発生してしまうようなノイズにより生じる認識エラーを低減させるように制御される。
【００５３】
また、音声認識部１０１Ａは、制御部１０１ａにより音声認識処理を実行できる状態に制御されているとき音声認識を行う。そして、音声認識部１０１Ａは、例えば、「歩け」、「止まれ」、「右手を挙げろ」等の指令、その他の音声認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５４】
方向認識部１０１Ｂは、マイクロホン８２−１乃至８２−Ｎから供給される音声信号Ｓ１Ｂのパワー差や位相差から音源の方向を認識し（音源の方向を検出して認識する）、認識結果を行動決定機構部１０３に供給する。
【００５５】
圧力処理部１０１Ｃは、タッチセンサ５１から与えられる圧力検出信号Ｓ１Ｃを処理する。そして、圧力処理部１０１Ｃは、その処理の結果、例えば、所定の閾値以上で、かつ短時間の圧力を検出したときには、「叩かれた（しかられた）」と認識し、所定の閾値未満で、かつ長時間の圧力を検出したときには、「撫でられた（ほめられた）」と認識して、その認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５６】
また、画像認識部１０１Ｄは、ＣＣＤカメラ８１Ｌおよび８１Ｒから与えられる画像信号Ｓ１Ａを用いて、画像認識処理を行う。そして、画像認識部１０１Ｄは、その処理の結果、例えば、「赤い丸いもの」や、「地面に対して垂直なかつ所定高さ以上の平面」等を検出したときには、「ボールがある」や、「壁がある」、または、人間の顔を検出した等の画像認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５７】
ここで、ユーザは、一般に、ロボット１の正面方向から話しかけることが多いと予想されるため、周囲の状況を撮像するＣＣＤカメラ８１Ｌおよび８１Ｒは、その撮像方向が、ロボット１の正面方向になるように、頭部ユニット１２（図２）に設置されているものとする。
【００５８】
ＣＣＤカメラ８１Ｌ、および、８１Ｒは、方向認識部１０１Ｂにより認識された方向の情報に基づいて、姿勢遷移機構部１０４により検出された方向に、頭部ユニット１２が動かされることによって、ＣＣＤカメラ８１Ｌおよび８１Ｒにおいて、ユーザを撮像することができるようにすることが可能である。
【００５９】
モデル記憶部１０２は、ロボット１の感情、本能、成長の状態を表現する感情モデル、本能モデル、成長モデルをそれぞれ記憶、管理している。
【００６０】
ここで、感情モデルは、例えば、「うれしさ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度合い）を、所定の範囲（例えば、−１．０乃至１．０等）の値によってそれぞれ表し、状態認識情報処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。本能モデルは、例えば、「食欲」、「睡眠欲」、「運動欲」等の本能による欲求の状態（度合い）を、所定の範囲の値によってそれぞれ表し、状態認識情報処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。成長モデルは、例えば、「幼年期」、「青年期」、「熟年期」、「老年期」等の成長の状態（度合い）を、所定の範囲の値によってそれぞれ表し、状態認識情報処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００６１】
モデル記憶部１０２は、上述のようにして感情モデル、本能モデル、成長モデルの値で表される感情、本能、成長の状態を、状態情報として、行動決定機構部１０３に送出する。
【００６２】
なお、モデル記憶部１０２には、状態認識情報処理部１０１から状態認識情報が供給される他、行動決定機構部１０３から、ロボット１の現在または過去の行動、具体的には、例えば、「長時間歩いた」などの行動の内容を示す行動情報が供給されるようになっており、モデル記憶部１０２は、同一の状態認識情報が与えられても、行動情報が示すロボット１の行動に応じて、異なる状態情報を生成するようになっている。
【００６３】
即ち、例えば、ロボット１が、ユーザに挨拶をし、ユーザに頭を撫でられた場合には、ユーザに挨拶をしたという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部１０２に与えられ、この場合、モデル記憶部１０２では、「うれしさ」を表す感情モデルの値が増加される。
【００６４】
一方、ロボット１が、何らかの仕事を実行中に頭を撫でられた場合には、仕事を実行中であるという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部１０２に与えられ、この場合、モデル記憶部１０２では、「うれしさ」を表す感情モデルの値は変化されない。
【００６５】
このように、モデル記憶部１０２は、状態認識情報だけでなく、現在または過去のロボット１の行動を示す行動情報も参照しながら、感情モデルの値を設定する。これにより、例えば、何らかのタスクを実行中に、ユーザが、いたずらするつもりで頭を撫でたときに、「うれしさ」を表す感情モデルの値を増加させるような、不自然な感情の変化が生じることを回避することができる。
【００６６】
なお、モデル記憶部１０２は、本能モデルおよび成長モデルについても、感情モデルにおける場合と同様に、状態認識情報および行動情報の両方に基づいて、その値を増減させるようになっている。また、モデル記憶部１０２は、感情モデル、本能モデル、成長モデルそれぞれの値を、他のモデルの値にも基づいて増減させるようになっている。
【００６７】
行動決定機構部１０３は、状態認識情報処理部１０１からの状態認識情報や、モデル記憶部１０２からの状態情報、時間経過等に基づいて、次の行動を決定し、決定された行動の内容が、例えば、「ダンスをする」というような音声認識処理や画像認識処理を必要としない場合、その行動の内容を、行動指令情報として、姿勢遷移機構部１０４に送出する。
【００６８】
すなわち、行動決定機構部１０３は、ロボット１がとり得る行動をステート（状態：ｓｔａｔｅ）に対応させた有限オートマトンを、ロボット１の行動を規定する行動モデルとして管理しており、この行動モデルとしての有限オートマトンにおけるステートを、状態認識情報処理部１０１からの状態認識情報や、モデル記憶部１０２における感情モデル、本能モデル、または成長モデルの値、時間経過等に基づいて遷移させ、遷移後のステートに対応する行動を、次にとるべき行動として決定する。
【００６９】
ここで、行動決定機構部１０３は、所定のトリガ（ｔｒｉｇｇｅｒ）があったことを検出すると、ステートを遷移させる。即ち、行動決定機構部１０３は、例えば、現在のステートに対応する行動を実行している時間が所定時間に達したときや、特定の状態認識情報を受信したとき、モデル記憶部１０２から供給される状態情報が示す感情や、本能、成長の状態の値が所定の閾値以下または以上になったとき等に、ステートを遷移させる。
【００７０】
なお、行動決定機構部１０３は、上述したように、状態認識情報処理部１０１からの状態認識情報だけでなく、モデル記憶部１０２における感情モデルや、本能モデル、成長モデルの値等にも基づいて、行動モデルにおけるステートを遷移させることから、同一の状態認識情報が入力されても、感情モデルや、本能モデル、成長モデルの値（状態情報）によっては、ステートの遷移先は異なるものとなる。
【００７１】
状態認識情報処理部１０１の音声認識部１０１Ａが、音声信号を検出したことを示す状態認識情報を出力する場合、行動決定機構部１０３は、姿勢遷移機構部１０４に対して、ロボット１を音源の方向に振り向かせる。さらに、ロボット１が、音源の方向に振り向いた状態で、状態認識情報処理部１０１の画像認識部１０１Ｄが、画像信号の肌色領域などから判断されるユーザの顔画像などを検出し、検出した旨を示す状態認識情報を音声認識部１０１Ａの制御部１０１ａに送出する場合、制御部１０１ａは、音声認識部１０１Ａを制御して音声認識処理を開始させるようにする。
【００７２】
そして、状態認識情報処理部１０１から供給されている状態認識情報（例えば、音声認識部１０１Ａにより認識されたコマンドの情報）を取得し、上述したような、例えば、「ユーザと会話する」や「ユーザに手を振る」などの、行動決定機構部１０３自身が決定した動作を行う（その行動の内容を、行動指令情報として、姿勢遷移機構部１０４に送出する）。
【００７３】
なお、行動決定機構部１０３では、上述したように、ロボット１の頭部や手足等を動作させる行動指令情報の他、ロボット１に発話を行わせる行動指令情報も生成される。ロボット１に発話を行わせる行動指令情報は、音声合成部１０５に供給されるようになっており、音声合成部１０５に供給される行動指令情報には、音声合成部１０５に生成させる合成音に対応するテキスト等が含まれる。そして、音声合成部１０５は、行動決定機構部１０３から行動指令情報を受信すると、その行動指令情報に含まれるテキストに基づき、合成音を生成し、スピーカ７２に供給して出力させる。
【００７４】
また、行動決定機構１０３では、発話に対応する、または、発話をしない場合に発話の代わりとなる言葉を、表示部５５にプロンプトとしてテキスト表示させる。例えば、音声を検出して振り向いたときに、「誰？」とか「なぁに？」といったテキストを表示部５５にプロンプトとして表示したり、または、スピーカ７２より発生することができる。
【００７５】
姿勢遷移機構部１０４は、上述したように、行動決定機構部１０３から供給される行動指令情報に基づいて、ロボット１の姿勢を、現在の姿勢から次の姿勢に遷移させるための姿勢遷移情報を生成し、これをサブ制御部６３Ａ乃至６３Ｄに送出する。
【００７６】
図８は、状態認識情報処理部１０１の音声認識部１２１の機能を示す機能ブロック図である。
【００７７】
制御部１０１ａが、音声認識部１０１Ａに音声認識処理を実行させる状態にしていない場合（Ｄｉｓａｂｌｅの状態の場合）、制御部１０１ａは、音声信号を検出したことを示す信号を検出すると、音声信号を検出したことを示す信号を行動決定機構部１０３に出力する。すなわち、Ｄｉｓａｂｌｅの状態においては、音声認識部１０１Ａは、入力された音声信号に基づいた音声認識処理は実行しない。
【００７８】
また、制御部１０１ａは、画像認識部１０１Ｄより画像を検出したことを示す信号が入力されると、音声認識処理を実行可能な状態（Ｅｎａｂｌｅの状態）であると判断し、音声認識処理を実行可能な状態に制御して、マイクロホン８２から入力され、図示せぬＡＤ変換部によりデジタル信号に変換された音声を特徴抽出部１２１に出力する。
【００７９】
特徴抽出部１２１は、入力された音声信号の特徴量を演算する。
【００８０】
認識処理制御部１２２は、複数の言語モデル（語彙と文法）に対応する認識処理を並列に処理することができるように構成されており、１つの言語モデルに対応する認識処理を行なうモジュールとして、それぞれ認識処理部１３１−１乃至１３１−４が設けられている。
【００８１】
認識処理制御部１２２においては、新たな言語モデルに対応した認識処理部を追加したり、不要になった認識処理部を削除することができる。また、各認識処理部に対して、認識処理を停止させたり、開始させたりすることができる。すなわち、複数の認識処理部を同時に駆動したり、認識処理部を切り替えたりすることによって、複数の言語モデルを同時に駆動したり、言語モデルを切り替えることができる。
【００８２】
認識処理部１３１−１乃至１３１−４には、特徴抽出部１２１により演算された特徴量に基づいて、音声のマッチングを行うマッチング部１４１−１乃至１４１−４が設けられており、また、語彙に関する情報が蓄積された辞書データベース１４２−１乃至１４２−４、文法に関する情報が蓄積された文法データベース１４３−１乃至１４３−４が設けられている。さらに音響に関する情報が蓄積された音響モデルデータベース１３２が、マッチング部１４１−１乃至１４１−４と接続されている。
【００８３】
なお、以下の説明において、認識処理部１３１−１乃至１３１−４のそれぞれを、個々に区別する必要がない場合、まとめて認識処理部１３１と称する。他の部分についても同様とする。また、図８の例においては、認識処理部は、認識処理部１３１−１乃至１３１−４の４つが示されているが、認識処理部は、必要に応じて、３つ以下、または５つ以上設けられることもある。
【００８４】
音響モデルデータベース１３２により、同じ音響モデルをすべての認識処理部１３１が共有して利用することができるように構成されており、これによって消費するメモリや音響モデルにおいて発生するスコア計算のための処理などを効率的に共有することが可能となる。
【００８５】
音響モデルデータベース１３２は、音声認識する音声の言語における個々の音素や音節などの音響的な特徴を表す音響モデルを記憶している。音響モデルとしては、例えば、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）が用いられる。辞書データベース１４２−１乃至１４２−４は、認識対象の各単語（語彙）について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法データベース１４３−１乃至１４３−４は、辞書データベース１４２−１乃至１４２−４の単語辞書に登録されている各単語が、どのように連鎖する（つながる）かを記述した文法規則（言語モデル）を記憶している。文法規則としては、例えば、文脈自由文法（ＣＦＧ）に基づく記述や、統計的な単語連鎖確率（Ｎ−ｇｒａｍ）などが用いられる。
【００８６】
辞書データベース１４２−１乃至１４２−４にはそれぞれ異なる語彙に関する情報が蓄積されており、文法データベース１４３−１乃至１４３−４にもそれぞれ異なる文法に関する情報が蓄積されている。この辞書データベース１４２と文法データベース１４３の組み合わせにより言語モデルが決定される。
【００８７】
次に、図９のフローチャートを参照して、音声コマンドによる動作の処理について説明する。
【００８８】
ステップＳ１において、音声認識部１０１Ａは、マイクロホン８２−１乃至８２−Ｎを介して、音声が入力されたか否か（音声が検出されたか否か）を判定し、音声が入力されていないと判定された場合、ステップＳ１の処理を繰り返す。すなわち、音声が検出されたと判定されるまで（音声が入力されたと判定されるまで）、ステップＳ１の処理が繰り返される。
【００８９】
ステップＳ１において、音声が検出された、すなわち、例えば、ユーザがロボット１に対して何か音声によるコマンドを入力しようと声をかけたとみなされた場合、その処理は、ステップＳ２に進む。
【００９０】
ステップＳ２において、方向認識部１０１Ｂは、マイクロホン８２−１乃至８２−Ｎを介して入力された音声に基づいて、音源の方向を検出して認識する。すなわち、方向認識部１０１Ｂは、マイクロホン８２−１乃至８２−Ｎから供給される音声信号Ｓ１Ｂのパワー差や位相差から音源の方向を検出して認識し、認識結果を行動決定機構部１０３に供給する。
【００９１】
ステップＳ３において、音源の方向への振り向き動作の処理が実行される。
【００９２】
ここで、図１０のフローチャートを参照して、振り向き動作の処理について説明する。
【００９３】
ステップＳ２１において、行動決定機構部１０３は、状態認識情報処理部１０１の方向認識部１０１Ｂより供給された音源の方向の情報に基づいて、現在ロボット１が向いている方向と音源の方向との差を計算し、体幹の向きに対する音源方向の相対角度を求める。
【００９４】
ステップＳ２２において、行動決定部１０３は、図４に示した首関節機構２７のヨー軸２９の可動範囲と、脚部を使って体幹を回転させる際に、一度の回転動作で回転できる最大角度などの制約に基づき、ステップＳ２１で計算された相対角度分だけ頭部を回転させるのに必要な首関節機構２７と体幹（股関節機構３７を用いて回転させるロボット１の本体の垂直方向の軸）の回転角度を決定する。ここで、音源方向によっては、行動決定部１０３は、首関節機構２７のみの回転角度を決定する。なお、ロボット１は、図４に示したように股関節機構３７のヨー軸３８を有しているが、簡単のため、本実施の形態ではこの股関節機構３７のヨー軸３８を利用しないものとして説明する。しかしながら、首、腰、足の接地方向を利用し、全身を協調させて音源方向を振り向くことができることは勿論である。
【００９５】
具体的に図１１を用いて説明する。図１１Ａは、ロボット１の首の可動範囲を±Ｙ度とし、音源Ｓの方向の相対角度がロボット１の正面方向に対してＸ度方向である場合の例である。この場合、ロボット１が音源Ｓの方向に振り向くためには、図１１Ｂに示すように、最低でもＸ−Ｙ度だけ体幹全体を脚部を使って回転させると共に、首関節機構２７のヨー軸２９をＹ度だけ音源Ｓの方向に回転させる必要がある。
【００９６】
ステップＳ２３において、行動決定機構部１０３は、ステップＳ２２で得られた角度を回転させるのに必要な各関節の制御情報を姿勢遷移機構部１０４に供給し、この情報に基づいて、姿勢遷移機構部１０４は、各種のアクチュエータをそれぞれ駆動させることによって、ロボット１を音源方向に振り向かせる。
【００９７】
ステップＳ２４において、行動決定機構部１０３は、音源Ｓの方向に対して正対するために必要な体幹及び首の回転角度を計算する。例えば上述した図１１Ｂに示すように、現在のロボット装置１の姿勢において首関節機構２７のヨー軸２９がＹ度回転している場合、すなわち体幹に対して頭部がＹ度回転している場合には、図１１Ｃに示すように、体幹をＹ度回転させると同時に首関節機構２７のヨー軸２９を−Ｙ度回転させることによって、対象オブジェクトを注視したまま首の捻れを解消し、自然な動作で音源Ｓの方向に正対することが可能となる。
【００９８】
ステップＳ２５において、姿勢遷移機構部１０４は、ステップＳ２４で計算した動作をロボット１に実行させ、音源方向に正対させると共に、行動決定機構部１０３は、例えば、「だーれ」などのテキストからなるプロンプトを表示部５５に表示させる。
【００９９】
ロボット装置１は、以上のようにして音源方向を認識し（推定し）、全身を協調させて自然な動作により音源方向を振り向くことができる。
【０１００】
例えば、ロボット１は、図１２Ａ乃至Ｆで示されるようにして音源方向に振り向く。すなわち、図１２Ａのようにロボット１が図中右側を向いていたときに背後から音声が入力されると、図１２Ｂ乃至Ｆのように、首を回転させると共に脚部を使って体幹を回転させ、最終的に、図１２Ｆで示されるように、図中左方向の音源方向に振り向く。また、このとき、行動決定機構部１０３は、表示部５５を制御して、例えば、「なーに？」といった表示をさせることにより、ユーザに対して応答していることを表現させてもよい。結果として、ユーザが音声によるコマンドを行った際、ユーザは、ロボット１がユーザが発した音声に反応し、応答していることを認識することが可能となる。
【０１０１】
ここで、図９のフローチャートの説明に戻る。
【０１０２】
ステップＳ４において、状態認識情報処理部１０１の画像認識部１０１Ｄは、ＣＣＤカメラ８１Ｌ，８１Ｒのそれぞれより入力される画像情報に基づいて、ユーザの顔の検出処理を実行する。人間の顔を検出する手法は、例えば、画像信号の肌色領域などから判断されるユーザの顔画像などを検出するといった方法でもよい。また、人間の顔を検出する手法としては、例えば「Ｅ．Ｏｓｕｎａ，Ｒ．ＦｒｅｕｎｄａｎｄＦ．Ｇｉｒｏｓｉ：“Ｔｒａｉｎｉｎｇｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓ：ａｎａｐｐｌｉｃａｔｉｏｎｔｏｆａｃｅｄｅｔｅｃｔｉｏｎ”，ＣＶＰＲ’９７，１９９７」に記載されているような手法で実現することも可能である。
【０１０３】
ステップＳ５において、画像認識部１０１Ｄは、顔が検出されたか否かを判定し、顔が検出されたと判定された場合、その処理は、ステップＳ６に進む。
【０１０４】
ステップＳ６において、画像認識部１０１Ｄは、顔を検出した旨を示す信号を音声認識部１０１Ａに送出すると共に、この情報に基づいて、音声認識部１０１Ａの制御部１０１ａは、音声認識部１０１Ａにおいて、音声認識処理ができる状態、すなわち、音声認識部１０１ＡをＥｎａｂｌｅの状態に制御し、この処理以降においてマイクロホン８２−１乃至８２−Ｎより供給される音声信号に基づいて、音声認識処理を実行することができる状態にする。
【０１０５】
ステップＳ７において、音声認識処理が実行される。
【０１０６】
ここで、図１３のフローチャートを参照して、音声認識処理について説明する。
【０１０７】
ステップＳ４１において、特徴抽出部１２１は、デジタル信号としての音声信号を、適当な時間間隔で周波数分析行うなどして、スペクトルや、その他の音声の音響的特徴を表すパラメータに変換し、特徴量として抽出する。
【０１０８】
ステップＳ４２において、認識処理制御部１２２は、駆動させる認識処理部を選択する。
【０１０９】
例えば、ロボット１が、ユーザとの雑談、歌唱、および踊りを実行している場合を想定する。このとき、ロボット１では、雑談用、歌唱用、および踊り用のアプリケーションが起動している。また、ロボット１は、ユーザとの雑談用、歌唱用、および踊り用に、それぞれ１つずつ言語モデルを有しており、それぞれの言語モデルに対応した認識処理部が駆動されるものとする。さらに、全ての動作に共通に利用される言語モデルを１つ有しており、この言語モデルに対応した認識処理部が駆動されているものとする。なお、全ての動作に共通に利用される言語モデルとは、例えば「止まれ」などのように、重要度が大きいコマンドなどを認識するための言語モデルである。
【０１１０】
このとき、ロボット１は、現在実行中のアプリケーションに基づいて、全ての動作に共通に利用される言語モデルをもつ認識処理部、ユーザとの雑談用の言語モデルをもつ認識処理部、歌唱用の言語モデルをもつ認識処理部、および踊り用の言語モデルをもつ認識処理部を駆動する。ここでは、認識処理部１３１−１が全ての動作に共通に利用される言語モデルをもち、認識処理部１３１−２が雑談用の言語モデルをもち、認識処理部１３１−３が歌唱用の言語モデルをもち、認識処理部１３１−４が踊り用の言語モデルをもつものとする。
【０１１１】
従って、認識処理制御部１２２は、上記の認識処理部１３１−１乃至１３１−４を、駆動すべき認識処理部として選択する。すなわち、全部で４つの認識処理部１３１−１乃至１３１−４が認識処理制御部１２２で動作していることになり、１つのアプリケーションに対応する認識処理部はそれぞれ２つずつとなる。
【０１１２】
このように、認識処理制御部１２２は、実行中のアプリケーションに対応する言語モデルをもつ認識処理部を選択して駆動させる。
【０１１３】
その後、処理はステップＳ４３に進む。なお、ステップＳ４３乃至ステップＳ４６の処理（以下、ステップＳ４３乃至ステップＳ４６の処理を単語系列認識処理とも称する）は、認識処理部１３１−１乃至１３１−４により、並列に実行される。
【０１１４】
ステップＳ４３において、認識処理部１３１−１乃至１３１−４は、特徴抽出部１２１から出力された音声の特徴量を音響モデルデータベース１３２とマッチングし、音素、音節を判定する。
【０１１５】
ステップＳ４４において、認識処理部１３１−１乃至１３１−４は、音素、音節を辞書データベース１４２−１乃至１４２−４、および文法データベース１４３−１乃至１４３−４とマッチングして、音響スコアと言語スコアを演算する。
【０１１６】
すなわち、認識処理部１３１−１乃至１３１−４は、入力された特徴量がもつ音響的なパターンを、辞書データベース１４２に含まれる各単語に対応する音響的な標準パターンと比較し、音響的な評価値を、音響スコアとして演算する。また、文法としてたとえばバイグラムが用いられる場合には、認識処理部１３１−１乃至１３１−４は、直前の単語との連鎖確率に基づく各単語の言語的な確からしさを、文法データベース１４３に基づいて数値化し、これを言語スコアとして演算する。
【０１１７】
ステップＳ４５において、認識処理部１３１−１乃至１３１−４は、音響スコアと言語スコアを総合して最も評価の高い単語列を決定し、ステップＳ４６に進み、決定された単語列を行動決定部１０３、および、モデル記憶部１０２に出力する。
【０１１８】
例えば、ユーザが、「今日はいい天気ですね。」と発声したとき、「今日」、「は」、「いい」、「天気」、「ですね」のような単語の系列が認識結果として得られることになる。
【０１１９】
このようにして、入力された音声から単語系列が認識される。
【０１２０】
ここで、図９のフローチャートの説明に戻る。
【０１２１】
ステップＳ８において、行動決定部１０３は、状態認識情報処理部１０１の音声認識部１０１Ａより供給される単語系列からなる音声によるコマンドに基づいて、行動を決定して姿勢遷移機構部１０４に出力し、姿勢遷移機構部１０４は、決定された行動に対応する動作を各種のアクチュエータを制御してロボット１を行動させる。
【０１２２】
ステップＳ９において、状態認識情報処理部１０１の音声認識部１０１Ａの制御部１０１ａは、音声によるコマンドの入力があるか否かを判定し、音声によるコマンドの入力があると判定された場合、その処理は、ステップＳ７に戻る。すなわち、音声によるコマンドが入力されつづける限り、ステップＳ７乃至Ｓ９の処理が繰り返され、ユーザとの対話型の処理が継続されることになる。
【０１２３】
ステップＳ９において、音声によるコマンドの入力がない、すなわち、例えば、所定の時間、音声によるコマンドの入力が無く、ユーザからの要求がなくなったと判定された場合、ステップＳ１０において、状態認識情報処理部１０１の音声認識部１０１Ａの制御部１０１ａは、音声認識処理ができない状態、すなわち、音声認識部１０１ＡをＤｉｓａｂｌｅの状態に制御し、この処理以降においてマイクロホン８２−１乃至８２−Ｎより供給される音声信号に基づいて、音声認識処理が実行できない状態にする。
【０１２４】
ステップＳ１１において、元の方向への振り向き動作処理が実行され、その処理は、ステップＳ１に戻る。尚、この元の方向への振り向き動作処理は、図１０のフローチャートを参照して説明した、図９のステップＳ３の処理における音源方向への振り向き動作の処理における、音源方向を元の方向に置き換えたこと以外は、同様の処理であるので、その説明は省略する。
【０１２５】
また、ステップＳ５において、ユーザの顔が検出されなかった場合、ステップＳ６乃至Ｓ１０の処理はスキップされて、その処理は、ステップＳ１１に進む。
【０１２６】
すなわち、通常、音声によりコマンドが入力される場合、ユーザは、ロボット１がＣＣＤカメラ８１で視認可能な距離でなされるので、ステップＳ１の処理で、音声信号が検出されたとき（最初に音声信号が検出されたとき）、ロボット１を音源の方向に振り向かせる。そして、ステップＳ５の処理で、音源の方向にユーザの顔が検出された場合にのみ、それ以降のタイミングでマイクロホン８２より検出される音声信号（２回目以降に検出される音声信号）を用いて音声認識処理を実行させることにより、それ以外のタイミングで入力される音声信号はノイズであるものとみなし（コマンドとして入力される音声信号ではないものとみなし）、音声認識処理を実行しないようにさせることができるので、ユーザが存在しない環境において発生するノイズによる誤認識を抑制することが可能となる。
【０１２７】
また、検出された音声に基づいて検出された音源方向に、振り向く動作は、ユーザの顔を検出することができるように動作すればよいのであって、ロボット１の頭部のみを振り向かせても、または、ロボット１の本体全体を振り向かせてもどちらでもよい。さらに、このとき、音源方向に指向性の高いマイクロホンが向けられるようにしてもよい。このようにすることで、音声によるコマンドを発しているユーザに対してロボット１が反応していることを示すことが可能になると共に、音源に対して音声認識に必要な音声信号を高い精度で取得することが可能となり、結果として音声認識処理におけるノイズなどによる誤認識を抑制することが可能となる。
【０１２８】
尚、以上の音声コマンドによる動作の処理の説明においては、ロボット１が、この音声コマンドによる動作の処理とは別の動作を行っていてもよく、その場合、ステップＳ１の処理において、音源の方向を検出する際、これまで行っていた動作が中断され、さらに、ステップＳ１１において、元の方向への振り向き動作が完了した後、音声コマンドによる動作の処理がなされるまでに実行されていた動作が再開されることになる。
【０１２９】
以上の処理においては、最初に検出される音声信号に対応して音源の方向を検出して、音源の方向への振り向き動作を実行していたが、音声は、ロボット１の周辺の環境によって、音源の方向とは異なる方向から聞こえてしまうような場合が生じうる。すなわち、ロボット１の周辺に存在する天井や壁などにより、音声は反響し、この反響した音声により方向を検出すると、本来の音源の方向とは異なる方向を音源として誤検出してしまう可能性が高くなる。結果として、誤検出された方向をロボット１がいくら振り向いても、音声によるコマンドを発しているユーザの顔を検出することができず、不要な振り向き動作を繰り返してしまう恐れがある。
【０１３０】
そこで、以上のような、音声が反響してしまうような状況にある場合に対応するため、顔の検出ができる頻度を方向毎に記憶しておき、検出できなかった頻度に応じて（または、ユーザの顔が検出できる頻度に応じて）、音源の方向の信頼度を求め、顔が検出できる信頼度の低い方向が音源方向として検出された場合には、所定の割合で振り向き動作をしないようにさせるようにしてもよい（音声信号が検出されても無視するようにしてもよい）。
【０１３１】
図１４は、ユーザの顔の検出ができる頻度を記憶しておき、検出した頻度から方向毎の信頼度を演算し、その信頼度に応じて振り向き動作をしないようにしたロボット１のメイン制御部６１の他の構成を示すブロック図である。
【０１３２】
図１４のメイン制御部６１は、基本的には、図７のメイン制御部６１の構成と同様であるが、行動決定機構部１０３が、行動メモリ１０３ａ、および、信頼度演算部１０３ｂを備えており、行動メモリ１０３ａに記憶された情報に基づいて信頼度演算部１０３ｂが方向毎の信頼度を演算し、その信頼度に応じて姿勢遷移機構部１０４を制御する点が異なる。
【０１３３】
行動メモリ１０３ａは、行動決定機構部１０３により決定された行動を記憶するメモリであり、音源の方向に対して振り向き動作をした際に、各方向毎に、振り向き動作の頻度と、ユーザの顔が検出された頻度を更新して、記憶する。
【０１３４】
信頼度演算部１０３ｂは、行動メモリ１０３ａに記憶された情報に基づいて、各方向毎にユーザの顔が検出される信頼度を百分率で求め、記憶する。行動メモリ１０３ａに記憶される情報は、動作がなされる毎に更新されるので、信頼度演算部１０３ｂも、各行動に対応して信頼度が順次更新されていく。
【０１３５】
図１４の行動決定機構部１０３は、この信頼度に基づいて振り向き動作を制御する。すなわち、例えば、右方向に振り向き動作をした頻度をＴＲ、そのうち顔が検出された頻度がＦＲであった場合、信頼度演算部１０３ｂが演算する、右方向の顔が検出される信頼度は１００×ＦＲ／ＴＲ（％）である。行動決定機構部１０３は、方向認識部１０１Ｂより入力されてくる、方向の情報に基づいて、１乃至１００までの乱数を発生させ、その乱数の値と、信頼度演算部１０３ｂに記憶された顔が検出される信頼度とを比較し、乱数の値の方が、信頼度よりも低いとき、その方向に振り向き動作を実行させ、それ以外のときは、振り向き動作をさせないように姿勢遷移機構部１０４を制御する。尚、信頼度のデフォルトの値は、１００％である。
【０１３６】
次に、図１５のフローチャートを参照して、図１４のメイン制御部６１を用いたロボット１の音声コマンドによる動作の処理を説明する。尚、図１５のステップＳ６１，Ｓ６２、ステップＳ６５乃至Ｓ７３の処理は、図９のフローチャートを参照して説明したステップＳ１乃至Ｓ１１の処理と同様であるので、その説明は省略する。
【０１３７】
ステップＳ６３において、行動決定機構部１０３は、信頼度演算部１０３ｂに記憶されている、検出された音源の方向に対応した顔が検出される信頼度を読み出す。最初の処理の場合、信頼度は１００％となっており、それ以降では、頻度に応じた値となっている。
【０１３８】
ステップＳ６４において、行動決定機構部１０３は、１乃至１００までの乱数を発生させ、発生した乱数と信頼度との比較から、振り向き動作を実行させるか否かを判定する。より詳細には、行動決定機構部１０３は、１乃至１００までの乱数を発生させ、信頼度演算部１０３ｂより読み出された信頼度と比較し、読み出された信頼度よりも低い場合、振り向き動作を実行させると判断し、乱数の方が、信頼度よりも高い場合、振り向き動作を実行させないと判断する。
【０１３９】
ステップＳ６４において、行動決定機構部１０３は、例えば、発生された乱数が信頼度よりも低いと判定する場合、すなわち、振り向き動作を実行させると判定した場合、その処理は、ステップＳ６５に進む。
【０１４０】
一方、ステップＳ６４において、発生された乱数が、信頼度よりも高いと判定された場合、行動決定機構部１０３は、振り向き動作を実行させないと判定し、その処理は、ステップＳ６１戻る。
【０１４１】
ステップＳ７４において、行動決定機構部１０３は、顔が検出されたか否かの判定結果に基づいて顔が検出された頻度、および、音源の方向への振り向き動作を実行した頻度の情報を更新し行動メモリ１０３ａ記憶させると共に、信頼度演算部１０３ｂは、この更新された頻度に基づいて信頼度を求めて信頼度を更新する。
【０１４２】
以上の処理により、振り向き動作毎に顔が検出される信頼度が更新されるので、例えば、天井や壁などにより音声が反響しやすい環境で、誤検出されやすい音源の方向に対しては、顔が検出される頻度に応じて振り向き動作を抑制することが可能となり、結果として、誤検出を起こし易い、無駄な方向への振り向き動作を抑制しつつ、精度の高い音声認識処理を実現させることが可能となる。
【０１４３】
また、ロボット１から見た方向は、例えば、ロボット１が進行方向を変化させながら歩行しているような場合、加速度検出信号Ｓ２Ｂなどを用いて方向毎の信頼度もその変化している進行方向に合わせて変化させるようにしてもよいし、前後左右といった方向ではなく、東西南北といった絶対方向をコンパスを用いて設定し、その方向ごとに信頼度を設定するようにしてもよい。
【０１４４】
上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行させることが可能な、例えば汎用のパーソナルコンピュータなどに記録媒体からインストールされる。
【０１４５】
図１６は、図６のロボット１の電気的な内部構成をソフトウェアにより実現する場合のパーソナルコンピュータの一実施の形態の構成を示している。パーソナルコンピュータのＣＰＵ２０１は、パーソナルコンピュータの全体の動作を制御する。また、ＣＰＵ２０１は、バス２０４および入出力インタフェース２０５を介してユーザからキーボードやマウスなどからなる入力部２０６から指令が入力されると、それに対応してＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２０２に格納されているプログラムを実行する。あるいはまた、ＣＰＵ２０１は、ドライブ２１０に接続された磁気ディスク２２１、光ディスク２２２、光磁気ディスク２２３、または半導体メモリ２２４から読み出され、記憶部２０８にインストールされたプログラムを、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０３にロードして実行する。これにより、上述した全方位画像データ生成部１３の機能が、ソフトウェアにより実現されている。さらに、ＣＰＵ２０１は、通信部２０９を制御して、外部と通信し、データの授受を実行する。
【０１４６】
プログラムが記録されている記録媒体は、図１６に示すように、コンピュータとは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されている磁気ディスク２２１（フレキシブルディスクを含む）、光ディスク２２２（ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）を含む）、光磁気ディスク２２３（ＭＤ（Ｍｉｎｉ−Ｄｉｓｃ）を含む）、もしくは半導体メモリ２２４などよりなるパッケージメディアにより構成されるだけでなく、コンピュータに予め組み込まれた状態でユーザに提供される、プログラムが記録されているＲＯＭ２０２や、記憶部２０８に含まれるハードディスクなどで構成される。
【０１４７】
尚、本明細書において、記録媒体に記録されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理は、もちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理を含むものである。
【０１４８】
【発明の効果】
本発明によれば、音声認識処理における誤認識を抑制させることが可能となる。
【図面の簡単な説明】
【図１】本発明を適用したロボットの外装の外観斜視図を示す図である。
【図２】図１のロボットの内部の構成を示す斜視図である。
【図３】図２のロボットの内部の構成を示す、背後側の斜視図である。
【図４】図２のロボットの軸について説明するための略線図である。
【図５】図２のロボットの制御に関する部分を主に説明するためのブロック図である。
【図６】図１のロボットの制御の内部構成を示すブロック図である。
【図７】図６のメイン制御部の構成を示すブロック図である。
【図８】図７の音声認識部の構成を示すブロック図である。
【図９】ロボットの音声コマンドによる動作の処理を説明するフローチャートである。
【図１０】図９の振り向き動作の処理を説明するフローチャートである。
【図１１】振り向き動作を説明する図である。
【図１２】振り向き動作を説明する図である。
【図１３】ロボットの音声認識処理を説明するフローチャートである。
【図１４】図６のメイン制御部のその他の構成を示すブロック図である。
【図１５】図１４のメイン制御部の構成を用いたロボットの音声コマンドによる動作の処理を説明するフローチャートである。
【図１６】記録媒体を説明する図である。
【符号の説明】
１ロボット，６１メイン制御部，５５表示部，６３サブ制御部，７１外部センサ部，７２スピーカ，８１Ｌ，８１ＲＣＣＤカメラ，８２マイクロホン，１０１状態認識情報処理部，１０１Ａ音声認識部，１０１ａ制御部，１０１Ｂ方向認識部，１０１Ｃ圧力処理部，１０１Ｄ画像認識部，１０２モデル記憶部，１０３行動決定機構部，１０３ａ行動メモリ，１０４姿勢遷移機構部，１０５音声合成部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a robot control device and method, a recording medium, and a program, and more particularly, to performing voice recognition processing, by performing voice recognition processing after confirming the presence of a user, thereby erroneously detecting voice due to noise. TECHNICAL FIELD The present invention relates to a robot control device and method, a recording medium, and a program capable of suppressing an error based on a robot.
[0002]
[Prior art]
In recent years, robots having a recognition function such as a voice recognition device (including stuffed toys in the present specification) have been commercialized as toys and the like. For example, a robot equipped with a voice recognition device performs voice recognition of a voice uttered by a user, and performs an action such as performing a certain gesture or outputting a synthesized sound based on the voice recognition result. Has been done.
[0003]
When a robot equipped with a voice recognition device recognizes a voice uttered by a user, if the user uttering the voice is too far away from the robot, the voice of the user acquired by a microphone mounted on the robot is used. The signal value of the speech waveform thus attenuated and the noise level becomes relatively high. That is, the S / N ratio (Signal to Noise ratio) of the user's voice signal acquired by the microphone is low. In general, as the distance between the user (speaker) and the robot (the microphone attached to the robot) increases, the waveform of the audio signal is more affected by the reverberation characteristics. Therefore, when the distance between the user and the robot is too large, the recognition accuracy of the voice recognition device of the robot deteriorates.
[0004]
Conversely, when the distance between the user and the robot is too short, the signal value of the voice waveform emitted by the user obtained by the microphone mounted on the robot exceeds the detectable range of the microphone. Therefore, the sound waveform acquired by the microphone becomes saturated and becomes a waveform distorted from the original sound waveform. If the distance between the user and the robot is too short, the voice recognition device of the robot performs voice recognition of such a distorted waveform, and the accuracy of voice recognition deteriorates.
[0005]
Therefore, along with the speech recognition result, ambient noise detection that detects the effect of ambient noise, power shortage detection that detects the situation where the power of the input voice satisfies a specific threshold condition, and power excess detection are performed, and the speech recognition result is obtained. A method has been proposed for addressing the problem of voice recognition accuracy deterioration in a robot using the result of the situation detection and the situation detection (for example, see Non-Patent Document 1).
[0006]
[Non-patent document 1]
Iwasawa, Onaka, Fujita, "A Method of Speech Recognition Interface for Robots Using Situation Detection and Its Evaluation," Research Papers of the Japanese Society for Artificial Intelligence, Japan Society for Artificial Intelligence, November 2002, p. 33-38
[0007]
Furthermore, the operation sound of the robot itself has a great adverse effect on the accuracy of voice recognition because the noise source is close to the microphone. For example, when a robot having both hands moves a hand near a microphone and moves a finger or the like, very large noise is input to the microphone. Also, when a bipedal walking robot walks on a hard floor surface, the sound of the feet touching the floor surface becomes loud, and large noise may be input to the microphone.
[0008]
[Problems to be solved by the invention]
However, the method disclosed in Non-Patent Document 1 does not consider noise generated by the robot itself. Therefore, for example, even though the user is not talking to the robot, the robot may detect the noise generated by the robot as voice, acquire an incorrect voice recognition result, and perform an incorrect operation. there were. For this reason, even though the user is not talking to the robot, the robot may perform a mysterious operation or generate a mysterious voice. Furthermore, even when there is no user in the vicinity, there is a risk of operating as if there is a user in the vicinity.
[0009]
The present invention has been made in view of such a situation, and for example, when a voice recognition device mounted on a robot recognizes a voice to be recognized, such as a user, after confirming the presence of the user By recognizing the voice, the recognition accuracy of the voice recognition device of the robot is improved.
[0010]
[Means for Solving the Problems]
The robot control device according to the present invention includes a voice detection unit that detects a voice, a voice recognition unit that recognizes the voice detected by the voice detection unit, a direction detection unit that detects a direction of a sound source of the voice, and a direction detection unit. User detection means for detecting a user in the detected direction; and voice recognition control means for controlling to start voice recognition by the voice recognition means when the user is detected by the user detection means. Features.
[0011]
An image pickup means for picking up an image and an image pickup control means for controlling the image pickup means so as to pick up an image in the direction detected by the direction detection means may be further provided, and the image pickup means is controlled by the image pickup control means. In the case where the image is captured in the direction detected by the direction detection unit, the user detection unit may detect the user based on whether or not the image of the user is captured in the video imaged by the imaging unit. it can.
[0012]
When the imaging control unit controls the imaging unit to capture an image in the direction detected by the direction detection unit, the image captured by the imaging unit is based on the frequency of whether or not the user's face is captured. And a reliability detection means for detecting the reliability for each direction, wherein the imaging control means includes a user detected by the reliability detection means in the direction detected by the direction detection means. It is possible to control the imaging means to image the direction based on the reliability of detecting the face.
[0013]
A robot control method according to the present invention includes a voice detection step of detecting a voice, a voice recognition step of recognizing a voice detected in the processing of the voice detection step, a direction detection step of detecting a direction of a sound source of the voice, and a direction detection. A user detection step of detecting a user in the direction detected in the step processing, and control is performed such that when a user is detected in the processing of the user detection step, voice recognition in the processing of the voice recognition step is started. And a voice recognition control step.
[0014]
The program of the recording medium of the present invention includes a voice detection step of detecting voice, a voice recognition step of recognizing the voice detected in the processing of the voice detection step, a direction detection step of detecting a direction of a sound source of the voice, A user detection step for detecting a user in the direction detected in the detection step processing, and control to start speech recognition in the speech recognition step processing when a user is detected in the user detection step processing. And performing a voice recognition control step.
[0015]
The program of the present invention includes a voice detection step of detecting voice, a voice recognition step of recognizing the voice detected in the processing of the voice detection step, a direction detection step of detecting a direction of a sound source of the voice, and a direction detection step. A user detection step of detecting a user in the direction detected by the processing; and voice recognition for controlling to start voice recognition in the processing of the voice recognition step when a user is detected in the processing of the user detection step. A process including a control step is executed by a computer.
[0016]
In the robot control device and method and the program according to the present invention, a voice is detected, the detected voice is recognized, a direction of a sound source of the voice is detected, and a user is detected in the detected direction. If detected, speech recognition is started.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described, but before that, in order to clarify the correspondence between each means of the invention described in the claims and the following embodiments, parentheses after each means are described. When the features of the present invention are described by adding a corresponding embodiment (however, an example), the following is obtained.
[0018]
That is, the robot control device of the present invention includes a voice detection unit (for example, the microphones 82-1 to 82-N in FIG. 7) for detecting voice and a voice recognition unit (for example, for recognizing the voice detected by the voice detection unit). 7, a voice recognition unit 101A in FIG. 7, a direction detection unit for detecting the direction of the sound source of the voice (for example, the direction recognition unit 101B in FIG. 7), and a user is detected in the direction detected by the direction detection unit. A user detection unit (for example, the image recognition unit 101D in FIG. 7) and a voice recognition control unit (for example, FIG. 7) for controlling the start of voice recognition by the voice recognition unit when a user is detected by the user detection unit. Control section 101a).
[0019]
Of course, this description does not mean that each means is limited to those described above.
[0020]
FIG. 1 is a schematic perspective view of an exterior showing a configuration of an embodiment of a bipedal walking type robot 1 to which the present invention is applied. The robot 1 is a practical robot that supports human activities in various situations in a living environment and other everyday life, and can act according to internal states (anger, sadness, joy, enjoyment, etc.), and can perform basic operations performed by humans. Behavior can be expressed.
[0021]
As shown in FIG. 1, the robot 1 has a head exterior unit 3 connected to a predetermined position of a trunk exterior unit 2 and two left and right arm exterior units 4R / L (Right / Left: right arm / right arm). (Left arm) and two left and right leg exterior units 5R / L.
[0022]
Next, an internal configuration of the robot 1 will be described with reference to FIG. FIG. 2 shows the internal structure of the exterior parts shown in FIG.
[0023]
FIG. 2 is a perspective view of the inside of the robot 1 in the front direction, and FIG. 3 is a perspective view of the inside of the robot 1 from the back direction. FIG. 4 is a perspective view for describing a shaft configuration of the robot 1.
[0024]
In the robot 1, the head unit 12 is disposed above the torso unit 11, and the arm units 13A and 13B having the same configuration are attached to predetermined positions on the left and right of the upper part of the torso unit 11, respectively. Further, leg units 14A and 14B having the same configuration are attached to predetermined positions on the lower left and right of the body unit 11, respectively. The head unit 12 is provided with a touch sensor 51 and a display unit 55.
[0025]
In the torso unit 11, the frame 21 forming the upper trunk and the waist base 22 forming the lower trunk are connected to each other via a waist joint mechanism 23, and the lower base 22 of the lower trunk is formed. By independently driving the actuators A1 and A2 of the waist joint mechanism 23 fixed to the upper body, the upper trunk is independently rotated around the orthogonal roll axis 24 and pitch axis 25 shown in FIG. It has been made possible.
[0026]
The head unit 12 is attached to the center of the upper surface of a shoulder base 26 fixed to the upper end of the frame 21 via a neck joint mechanism 27. By driving the actuators A3 and A4 of the neck joint mechanism 27, respectively. 4 can be independently rotated around a pitch axis 28 and a yaw axis 29 which are orthogonal to each other.
[0027]
Further, the arm units 13A and 13B are attached to the left and right of the shoulder base 26 via the shoulder joint mechanism 30, respectively, and by driving the actuators A5 and A6 of the corresponding shoulder joint mechanism 30, respectively, as shown in FIG. Each of them can be independently rotated around the orthogonal pitch axis 31 and roll axis 32 shown in the figure.
[0028]
In the arm units 13A and 13B, the actuator A8 forming the forearm is connected to the output shaft of the actuator A7 forming the upper arm via the elbow joint mechanism 33, and the hand 34 is attached to the tip of the forearm. It is constituted by.
[0029]
In the arm units 13A and 13B, the forearm can be rotated with respect to the yaw axis 35 shown in FIG. 4 by driving the actuator A7, and the forearm can be rotated in FIG. It can be rotated with respect to the pitch axis 36 shown.
[0030]
The leg units 14A and 14B are respectively attached to the hip base 22 below the trunk via the hip joint mechanism 37, and by driving the actuators A9 to A11 of the corresponding hip joint mechanism 37, respectively, as shown in FIG. , The yaw axis 38, the roll axis 39, and the pitch axis 40, which are orthogonal to each other, can be independently rotated.
[0031]
In the leg units 14A and 14B, the lower end of the frame 41 forming the thigh is connected to the frame 43 forming the lower leg via the knee joint mechanism 42, and the lower end of the frame 43 is connected to the ankle joint. It is configured by being connected to a foot 45 via a mechanism 44.
[0032]
Thus, in the leg units 14A and 14B, by driving the actuator A12 forming the knee joint mechanism 42, the lower leg can be rotated with respect to the pitch axis 46 shown in FIG. By driving the actuators A13 and A14 of the mechanism 44, respectively, the feet 45 can be independently rotated with respect to the orthogonal pitch axis 47 and roll axis 48 shown in FIG.
[0033]
On the back side of the waist base 22, which forms the lower trunk of the body unit 11, a control unit 52, which is a box containing a main control unit 61 and a peripheral circuit 62 (both shown in FIG. 5) described later, is provided. It is arranged.
[0034]
FIG. 5 shows a configuration example of an actuator of the robot 1 and a control system thereof.
[0035]
The control unit 52 contains a main control unit 61 for controlling the operation of the entire robot 1, peripheral circuits 62 such as a power supply circuit and a communication circuit, a battery 74 (FIG. 6), and the like.
[0036]
The control unit 52 includes sub-control units 63A to 63D provided in each of the constituent units (the body unit 11, the head unit 12, the arm units 13A and 13B, and the leg units 14A and 14B). And supplies necessary power supply voltage to the sub-control units 63A to 63D and communicates with the sub-control units 63A to 63D.
[0037]
The sub-control units 63A to 63D are respectively connected to the actuators A1 to A14 in the corresponding constituent units, and based on various control commands supplied from the main control unit 61, the actuators A1 to A14 in the constituent units. A14 is controlled to be driven to a designated state.
[0038]
FIG. 6 is a block diagram illustrating an example of an electrical internal configuration of the robot 1.
[0039]
The head unit 12 includes an external unit including CCD (Charge Coupled Device) cameras 81L and 81R functioning as “eyes” of the robot 1, microphones 82-1 to 82-N functioning as “ears”, a touch sensor 51, and the like. A sensor unit 71, a speaker 72 functioning as a "mouth" and the like are provided at predetermined positions, respectively, and an internal sensor unit 73 including a battery sensor 91 and an acceleration sensor 92 is provided in the control unit 52. I have. In addition, a display unit 55 for displaying the state of the robot 1 and a response from the user is provided.
[0040]
Then, the CCD cameras 81 L and 81 R of the external sensor unit 71 image the surroundings, and transmit the obtained image signal S 1 A to the main control unit 61. The microphones 82-1 to 82-N collect various command voices (voice commands) such as “walk”, “stop” or “raise your right hand” given as voice input from the user, and the obtained voice signal S1B is obtained. To the main control unit 61. In the following, the microphones 82-1 to 82 -N are referred to as microphones 82 unless it is necessary to distinguish them.
[0041]
The touch sensor 51 is provided, for example, on the upper part of the head unit 12 as shown in FIGS. 2 and 3, and receives a pressure applied by a physical action such as “stroke” or “hit” from the user. And sends the detection result to the main control unit 61 as a pressure detection signal S1C.
[0042]
The battery sensor 91 of the internal sensor unit 73 detects the remaining energy of the battery 74 at a predetermined cycle, and sends the detection result to the main control unit 61 as a remaining battery detection signal S2A. The acceleration sensor 92 detects the acceleration of the movement of the robot 1 in three axial directions (x-axis, y-axis, and z-axis) at a predetermined cycle, and uses the detection result as an acceleration detection signal S2B as the main control unit 61. To send to.
[0043]
The external memory 75 stores programs and data, control parameters, and the like, and supplies the programs and data to the memory 61A incorporated in the main control unit 61 as necessary. The external memory 75 receives data and the like from the memory 61A and stores them. Note that the external memory 75 is detachable from the robot 1.
[0044]
The main control section 61 has a built-in memory 61A. The memory 61A stores programs and data, and the main control unit 61 performs various processes by executing the programs stored in the memory 61A. That is, the main control unit 61 transmits the image signal S1A, the audio signal S1B, and the pressure detection signal S1C (hereinafter, these signals are supplied from the CCD cameras 81L and 81R, the microphone 82, and the touch sensor 51 of the external sensor unit 71, respectively. The external sensor signal S1 is collectively referred to as a battery remaining amount detection signal S2A and an acceleration detection signal S2B (hereinafter collectively referred to as an internal sensor signal) supplied from the battery sensor 91 and the acceleration sensor of the internal sensor unit 73, respectively. Based on S2), the situation around and inside the robot 1, a command from the user, the presence or absence of a user's action, and the like are determined.
[0045]
Then, the main control unit 61 determines a situation around and inside the robot 1, a command from the user, or a result of the determination as to whether or not there is an action from the user, a control program stored in the internal memory 61 A in advance, or The action of the robot 1 is determined based on various control parameters and the like stored in the external memory 75 loaded at that time, a control command is generated based on the determined result, and the corresponding sub-control units 63A to 63D are determined. To send to. The sub-control units 63A to 63D control the driving of the corresponding one of the actuators A1 to A14 based on the control command supplied from the main control unit 61. Thereby, the robot 1 causes the head unit 12 to swing up, down, left, and right, raise the arm unit 13A or the arm unit 13B, and alternately drive the leg units 14A and 14B. And perform actions such as walking.
[0046]
In addition, the main control unit 61 outputs a sound based on the sound signal S3 to the outside by giving a predetermined sound signal S3 to the speaker 72 as necessary, and, for example, displays a display signal when the sound is detected. A response to the user such as “Dare” is displayed on the display unit 55 based on S4. Further, the main control section 61 blinks the LED by outputting a drive signal to an LED (not shown) provided at a predetermined position of the head unit 12, which functions as an external “eye”, It functions as the display unit 55.
[0047]
In this way, the robot 1 autonomously behaves based on surrounding and internal situations (states), commands from the user and presence / absence of action.
[0048]
FIG. 7 shows an example of a functional configuration of the main control unit 61 of FIG. The functional configuration shown in FIG. 7 is realized by the main control unit 61 executing a control program stored in the memory 61A.
[0049]
The main control unit 61 includes a state recognition information processing unit 101 that recognizes a specific external state, an emotion, instinct, and a growth state of the robot 1 that are updated based on a recognition result of the state recognition information processing unit 101 and the like. The model storage unit 102 stores the model of the robot 1, the action determination mechanism unit 103 that determines the action of the robot 1 based on the recognition result of the state recognition information processing unit 101, and the like. A posture transition mechanism 104 for causing the robot 1 to take an action, and a voice synthesizer 105 for generating a synthesized sound.
[0050]
A voice signal, an image signal, a pressure detection signal, and the like from the microphone 82, the CCD cameras 81L and 81R, the touch sensor 51, and the like are constantly input to the state recognition information processing unit 101 while the power of the robot 1 is turned on. You. The state-recognition information processing unit 101 receives a specific external state or a signal from a user based on a sound signal, an image signal, a pressure detection signal, and the like provided from the microphone 82, the CCD cameras 81L and 81R, the touch sensor 51, and the like. It recognizes a specific action, an instruction from the user, and the like, and constantly outputs state recognition information representing the recognition result to the model storage unit 102 and the action determination mechanism unit 103.
[0051]
The state recognition information processing unit 101 includes a voice recognition unit 101A, a direction recognition unit 101B, a pressure processing unit 101C, and an image recognition unit 101D.
[0052]
The speech recognition unit 101A detects the presence or absence of a speech in the speech signal S1B provided from each of the microphones 82-1 to 82-N, and outputs the detection of the speech when the speech is detected to the action determination unit 103. When a voice is detected, the controller 101a performs a turning operation in the direction of the sound source after the robot 1 detects a user's face based on a signal supplied from the image recognition unit 101D. Control is performed so that recognition processing is started, and control is performed so as not to execute voice recognition processing at other times. That is, the voice recognition unit 101A is controlled by the control unit 101a so as to execute the voice recognition process only when the image recognition unit 101D recognizes the user's face, and not to perform the voice recognition process otherwise. As a result, when performing the voice recognition process, the voice recognition unit 101A is controlled so as to reduce a recognition error caused by noise that occurs when no user is present.
[0053]
The speech recognition unit 101A performs speech recognition when the control unit 101a controls the speech recognition process so that the speech recognition process can be executed. Then, the voice recognition unit 101A sends, for example, commands such as “walk”, “stop”, and “raise your right hand” and other voice recognition results to the model storage unit 102 and the action determination mechanism unit 103 as state recognition information. Notice.
[0054]
The direction recognition unit 101B recognizes the direction of the sound source from the power difference and the phase difference of the audio signals S1B supplied from the microphones 82-1 to 82-N (detects and recognizes the direction of the sound source) and acts on the recognition result. It is supplied to the determination mechanism unit 103.
[0055]
The pressure processing unit 101C processes a pressure detection signal S1C provided from the touch sensor 51. Then, as a result of the processing, the pressure processing unit 101C, for example, when detecting a pressure that is equal to or more than a predetermined threshold value and for a short period of time, recognizes that the user has been “hit” and has determined that the pressure is less than the predetermined threshold value. When the pressure is detected for a long time, it is recognized as "stroke (praised)", and the recognition result is notified to the model storage unit 102 and the action determination mechanism unit 103 as state recognition information.
[0056]
The image recognizing unit 101D performs an image recognizing process using the image signal S1A provided from the CCD cameras 81L and 81R. When the image recognition unit 101D detects, for example, a “red round object” or a “plane that is perpendicular to the ground and equal to or more than a predetermined height” as a result of the processing, the “ball is present” The model storage unit 102 and the action determination mechanism unit 103 are notified of image recognition results such as "there is a wall" or the detection of a human face as state recognition information.
[0057]
Here, since it is generally expected that the user often speaks from the front of the robot 1, the CCD cameras 81 L and 81 R that image the surrounding situation are such that the imaging directions are in the front of the robot 1. It is assumed that it is installed in the head unit 12 (FIG. 2).
[0058]
The CCD cameras 81L and 81R move the head unit 12 in the direction detected by the posture transition mechanism unit 104 based on the information on the direction recognized by the direction recognition unit 101B. At 81R, it is possible to image a user.
[0059]
The model storage unit 102 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot 1, respectively.
[0060]
Here, the emotion model indicates, for example, the state (degree) of emotions such as “joy”, “sadness”, “anger”, and “fun” in a predetermined range (for example, −1.0 to 1.. 0), and the values are changed based on the state recognition information from the state recognition information processing unit 101 or the passage of time. The instinct model expresses the state (degree) of the instinct's desire such as “appetite”, “sleep desire”, and “exercise desire” by a value in a predetermined range. The value is changed based on information, elapsed time, or the like. The growth model represents, for example, growth states (degrees) such as “childhood”, “adolescence”, “mature”, “elderly” and the like by values within a predetermined range, and the state recognition information processing unit 101. The value is changed on the basis of the state recognition information or the passage of time.
[0061]
The model storage unit 102 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the behavior determination mechanism unit 103 as state information.
[0062]
The model storage unit 102 is supplied with the state recognition information from the state recognition information processing unit 101, and also receives the current or past behavior of the robot 1 from the behavior determination mechanism unit 103, specifically, for example, “ The behavior information indicating the content of the behavior such as "walking for time" is supplied. Even if the same state recognition information is given, the model storage unit 102 responds to the behavior of the robot 1 indicated by the behavior information. Thus, different state information is generated.
[0063]
That is, for example, when the robot 1 greets the user and strokes the head, the behavior information that the robot 1 greets the user and the state recognition information that the head is stroked are stored in the model storage unit. In this case, in the model storage unit 102, the value of the emotion model representing “joy” is increased.
[0064]
On the other hand, when the robot 1 is stroked on the head while performing any work, the behavior information indicating that the robot 1 is performing the work and state recognition information indicating that the robot has been stroked on the head are given to the model storage unit 102. In this case, the model storage unit 102 does not change the value of the emotion model representing “joy”.
[0065]
As described above, the model storage unit 102 sets the value of the emotion model while referring to not only the state recognition information but also the behavior information indicating the current or past behavior of the robot 1. Thereby, for example, when the user strokes the head with the intention of mischief while performing any task, an unnatural change in emotion such as increasing the value of the emotion model representing “joy” occurs. Can be avoided.
[0066]
Note that the model storage unit 102 also increases and decreases the values of the instinct model and the growth model based on both the state recognition information and the behavior information, as in the case of the emotion model. Further, the model storage unit 102 increases or decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.
[0067]
The action determining mechanism unit 103 determines the next action based on the state recognition information from the state recognition information processing unit 101, the state information from the model storage unit 102, the passage of time, and the like. For example, when the voice recognition processing or the image recognition processing such as “dance” is not required, the content of the action is sent to the attitude transition mechanism unit 104 as action command information.
[0068]
In other words, the behavior determining mechanism unit 103 manages a finite state automaton in which the behavior that the robot 1 can take in correspondence with the state (state), as a behavior model that defines the behavior of the robot 1. The state in the finite state automaton is transited based on the state recognition information from the state recognition information processing unit 101, the value of the emotion model, the instinct model, or the growth model in the model storage unit 102, the passage of time, etc. The corresponding action is determined as the next action to be taken.
[0069]
Here, when detecting that there is a predetermined trigger, the action determining mechanism unit 103 changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed reaches a predetermined time, or when specific state recognition information is received, the action determining mechanism unit 103 is supplied from the model storage unit 102. The state is changed when the value of the emotion, instinct, or growth state indicated by the state information is equal to or less than a predetermined threshold.
[0070]
Note that, as described above, the behavior determining mechanism unit 103 performs the processing based on not only the state recognition information from the state recognition information processing unit 101 but also the values of an emotion model, an instinct model, a growth model, and the like in the model storage unit 102. Since the state in the behavior model is changed, even if the same state recognition information is input, the destination of the state is different depending on the value of the emotion model, the instinct model, and the growth model (state information).
[0071]
When the voice recognition unit 101A of the state recognition information processing unit 101 outputs state recognition information indicating that a voice signal has been detected, the action determination mechanism unit 103 sends the robot 1 a sound source to the posture transition mechanism unit 104. Turn around. Further, in a state where the robot 1 turns around in the direction of the sound source, the image recognition unit 101D of the state recognition information processing unit 101 detects and detects a user's face image determined from a skin color region or the like of an image signal. Is transmitted to the control unit 101a of the voice recognition unit 101A, the control unit 101a controls the voice recognition unit 101A to start the voice recognition process.
[0072]
Then, it acquires the state recognition information (for example, information on the command recognized by the voice recognition unit 101A) supplied from the state recognition information processing unit 101, and as described above, for example, "conversing with the user" or " An action determined by the action determining mechanism 103 itself, such as “waving a hand to the user,” is performed (the content of the action is sent to the attitude transition mechanism 104 as action command information).
[0073]
As described above, the action determining mechanism 103 generates action command information for causing the robot 1 to speak, in addition to action command information for operating the head, limbs, and the like of the robot 1. The action command information that causes the robot 1 to make an utterance is supplied to the voice synthesis unit 105. The action command information supplied to the voice synthesis unit 105 includes a synthesized sound generated by the voice synthesis unit 105. The corresponding text and the like are included. Then, upon receiving the action command information from the action determination mechanism unit 103, the speech synthesis unit 105 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 72 for output.
[0074]
In addition, the action determination mechanism 103 causes the display unit 55 to display a text corresponding to the utterance or a word that is substituted for the utterance when the utterance is not performed as a prompt. For example, when a voice is detected and turned around, a text such as “Who?” Or “What?” Can be displayed as a prompt on the display unit 55 or generated from the speaker 72.
[0075]
As described above, the posture transition mechanism unit 104 generates the posture transition information for transitioning the posture of the robot 1 from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 103. Generated and transmitted to the sub-control units 63A to 63D.
[0076]
FIG. 8 is a functional block diagram illustrating functions of the voice recognition unit 121 of the state recognition information processing unit 101.
[0077]
When the control unit 101a is not in a state in which the voice recognition unit 101A executes the voice recognition processing (in a state of Disable), when the control unit 101a detects a signal indicating that the voice signal has been detected, the control unit 101a outputs the voice signal. A signal indicating the detection is output to the action determination mechanism unit 103. That is, in the state of Disable, the voice recognition unit 101A does not execute the voice recognition process based on the input voice signal.
[0078]
When a signal indicating that an image has been detected is input from the image recognizing unit 101D, the control unit 101a determines that the voice recognition process can be executed (Enable state) and executes the voice recognition process. The sound is input to the microphone 82 and converted to a digital signal by an AD converter (not shown), and is output to the feature extractor 121.
[0079]
The feature extracting unit 121 calculates a feature amount of the input audio signal.
[0080]
The recognition processing control unit 122 is configured to be able to perform recognition processing corresponding to a plurality of language models (vocabulary and grammar) in parallel, and as a module that performs recognition processing corresponding to one language model, Recognition processing units 131-1 to 131-4 are provided, respectively.
[0081]
In the recognition processing control unit 122, a recognition processing unit corresponding to a new language model can be added, or an unnecessary recognition processing unit can be deleted. In addition, the recognition processing can be stopped or started for each recognition processing unit. That is, by simultaneously driving a plurality of recognition processing units or switching the recognition processing units, it is possible to simultaneously drive a plurality of language models or switch language models.
[0082]
The recognition processing units 131-1 to 131-4 are provided with matching units 141-1 to 141-4 that perform voice matching based on the feature amount calculated by the feature extraction unit 121. Dictionary databases 142-1 to 142-4 in which information on grammar is stored, and grammar databases 143-1 to 143-4 in which information on grammar is stored. Further, an acoustic model database 132 in which information on acoustics is stored is connected to the matching units 141-1 to 141-4.
[0083]
In the following description, when it is not necessary to distinguish each of the recognition processing units 131-1 to 131-4, they are collectively referred to as a recognition processing unit 131. The same applies to other parts. In addition, in the example of FIG. 8, four recognition processing units 131-1 to 131-4 are shown, but the number of the recognition processing units is three or less or five as necessary. It may be provided above.
[0084]
The acoustic model database 132 is configured so that the same acoustic model can be shared and used by all the recognition processing units 131, thereby consuming memory and processing for calculating a score generated in the acoustic model. Can be efficiently shared.
[0085]
The acoustic model database 132 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of the speech to be recognized. As the acoustic model, for example, HMM (Hidden Markov Model) is used. The dictionary databases 142-1 to 142-4 store word dictionaries in which information (phonological information) regarding pronunciation is described for each word (vocabulary) to be recognized. The grammar databases 143-1 to 143-4 are grammar rules (language models) that describe how words registered in the word dictionaries of the dictionary databases 142-1 to 142-4 are linked (connected). I remember. As the grammar rule, for example, a description based on a context-free grammar (CFG), a statistical word chain probability (N-gram), or the like is used.
[0086]
Information about different vocabulary is stored in each of the dictionary databases 142-1 to 142-4, and information about different grammars is also stored in the grammar databases 143-1 to 143-4. A language model is determined by a combination of the dictionary database 142 and the grammar database 143.
[0087]
Next, processing of an operation by a voice command will be described with reference to the flowchart of FIG.
[0088]
In step S1, the voice recognition unit 101A determines whether or not voice has been input (whether or not voice has been detected) via the microphones 82-1 to 82-N, and determines that no voice has been input. If so, the process of step S1 is repeated. That is, the process of step S1 is repeated until it is determined that a voice has been detected (until it is determined that a voice has been input).
[0089]
If it is determined in step S1 that a voice has been detected, that is, for example, it is determined that the user has called for an input of a voice command to the robot 1, the process proceeds to step S2.
[0090]
In step S2, the direction recognition unit 101B detects and recognizes the direction of the sound source based on the sound input through the microphones 82-1 to 82-N. That is, the direction recognition unit 101B detects and recognizes the direction of the sound source from the power difference and the phase difference of the audio signals S1B supplied from the microphones 82-1 to 82-N, and supplies the recognition result to the action determination mechanism unit 103. I do.
[0091]
In step S3, a process of turning around the sound source is performed.
[0092]
Here, the turning operation process will be described with reference to the flowchart in FIG.
[0093]
In step S21, the action determining mechanism unit 103 determines the difference between the current direction of the robot 1 and the direction of the sound source based on the information on the direction of the sound source supplied from the direction recognition unit 101B of the state recognition information processing unit 101. Is calculated, and the relative angle of the sound source direction to the trunk direction is obtained.
[0094]
In step S22, the action determining unit 103 determines the movable range of the yaw axis 29 of the neck joint mechanism 27 shown in FIG. 4 and the maximum angle that can be rotated by one rotation operation when rotating the trunk using the legs. And the trunk (necessary for rotating the head by the relative angle calculated in step S21) (the vertical axis of the main body of the robot 1 rotated using the hip joint mechanism 37). ) Is determined. Here, depending on the sound source direction, the action determining unit 103 determines the rotation angle of only the neck joint mechanism 27. Although the robot 1 has the yaw axis 38 of the hip joint mechanism 37 as shown in FIG. 4, for simplicity, the present embodiment will be described on the assumption that the yaw axis 38 of the hip joint mechanism 37 is not used. I do. However, it is needless to say that the sound source direction can be turned in cooperation with the whole body using the grounding directions of the neck, waist, and feet.
[0095]
This will be specifically described with reference to FIG. FIG. 11A shows an example in which the movable range of the neck of the robot 1 is ± Y degrees and the relative angle of the direction of the sound source S is X degrees with respect to the front direction of the robot 1. In this case, in order for the robot 1 to turn around in the direction of the sound source S, as shown in FIG. 11B, the entire trunk is rotated by at least X-Y degrees using the legs, and the yaw axis of the neck joint mechanism 27 is rotated. 29 needs to be rotated in the direction of the sound source S by Y degrees.
[0096]
In step S23, the action determining mechanism unit 103 supplies the control information of each joint necessary for rotating the angle obtained in step S22 to the posture transition mechanism unit 104, and based on this information, the posture transition mechanism unit The 104 turns the robot 1 toward the sound source by driving various actuators.
[0097]
In step S 24, the action determining mechanism unit 103 calculates the rotation angles of the trunk and the neck required to face the direction of the sound source S. For example, as shown in FIG. 11B described above, when the yaw axis 29 of the neck joint mechanism 27 is rotated by Y degrees in the current posture of the robot apparatus 1, that is, the head is rotated by Y degrees with respect to the trunk. In this case, as shown in FIG. 11C, by rotating the trunk at the Y-degree and simultaneously rotating the yaw axis 29 of the neck joint mechanism 27 by −Y-degree, the twist of the neck is eliminated while the target object is being closely watched, It is possible to directly face the sound source S with a natural motion.
[0098]
In step S25, the posture transition mechanism unit 104 causes the robot 1 to execute the operation calculated in step S24 to face the direction of the sound source, and the action determination mechanism unit 103 outputs, for example, a text such as "Dare". Prompt is displayed on the display unit 55.
[0099]
The robot apparatus 1 recognizes (estimates) the sound source direction as described above, and can turn around the sound source direction by a natural motion by coordinating the whole body.
[0100]
For example, the robot 1 turns around in the direction of the sound source as shown in FIGS. 12A to 12F. That is, when a voice is input from behind while the robot 1 is facing the right side in the figure as shown in FIG. 12A, the neck is rotated and the trunk is rotated using the legs as shown in FIGS. 12B to 12F. Finally, as shown in FIG. 12F, the player turns to the sound source direction in the left direction in the figure. At this time, the action determining mechanism unit 103 may control the display unit 55 to display, for example, “What?” To indicate that the user is responding. As a result, when the user issues a command by voice, the user can react to the voice uttered by the robot 1 and recognize that the robot 1 is responding.
[0101]
Here, the description returns to the flowchart of FIG.
[0102]
In step S4, the image recognition unit 101D of the state recognition information processing unit 101 performs a user face detection process based on image information input from each of the CCD cameras 81L and 81R. The method of detecting a human face may be, for example, a method of detecting a user's face image or the like determined from a skin color region or the like of an image signal. As a technique for detecting a human face, for example, "E. Osuna, R. Freund and F. Girosi:" Training support vector machines: an application to face detection ", described in CVPR'97, 1997. It is also possible to realize by such a method.
[0103]
In step S5, the image recognition unit 101D determines whether a face has been detected. If it is determined that a face has been detected, the process proceeds to step S6.
[0104]
In step S6, the image recognition unit 101D sends a signal indicating that the face has been detected to the voice recognition unit 101A, and based on this information, the control unit 101a of the voice recognition unit 101A A state in which the voice recognition process can be performed, that is, the voice recognition unit 101A is controlled to the Enable state, and after this process, the voice recognition process is performed based on the voice signals supplied from the microphones 82-1 to 82-N. To be able to do.
[0105]
In step S7, a voice recognition process is performed.
[0106]
Here, the speech recognition processing will be described with reference to the flowchart in FIG.
[0107]
In step S41, the feature extraction unit 121 converts the audio signal as a digital signal into a parameter representing a spectrum or other acoustic characteristics of the audio by performing frequency analysis at appropriate time intervals, and the like, Extract.
[0108]
In step S42, the recognition processing control unit 122 selects a recognition processing unit to be driven.
[0109]
For example, assume that the robot 1 is performing chat, singing, and dancing with the user. At this time, in the robot 1, applications for chat, singing, and dancing are running. The robot 1 has one language model for each of chat, singing, and dancing with the user, and the recognition processing unit corresponding to each language model is driven. Further, it is assumed that one language model is used in common for all operations, and the recognition processing unit corresponding to this language model is driven. The language model commonly used for all operations is a language model for recognizing a command having a high degree of importance, such as "stop".
[0110]
At this time, based on the application currently being executed, the robot 1 has a recognition processing unit having a language model commonly used for all operations, a recognition processing unit having a language model for chatting with a user, and a singing unit. The recognition processing unit having a language model and the recognition processing unit having a dance language model are driven. Here, the recognition processing unit 131-1 has a language model commonly used for all operations, the recognition processing unit 131-2 has a chat language model, and the recognition processing unit 131-3 has a singing language. It is assumed that the recognition processing unit 131-4 has a language model for dancing.
[0111]
Therefore, the recognition processing control unit 122 selects the recognition processing units 131-1 to 131-4 as the recognition processing units to be driven. In other words, a total of four recognition processing units 131-1 to 131-4 are operated by the recognition processing control unit 122, and there are two recognition processing units corresponding to one application.
[0112]
As described above, the recognition processing control unit 122 selects and drives the recognition processing unit having the language model corresponding to the application being executed.
[0113]
Thereafter, the process proceeds to step S43. The processing of steps S43 to S46 (hereinafter, the processing of steps S43 to S46 is also referred to as word sequence recognition processing) is executed in parallel by the recognition processing units 131-1 to 131-4.
[0114]
In step S43, the recognition processing units 131-1 to 131-4 match the feature amount of the voice output from the feature extraction unit 121 with the acoustic model database 132, and determine phonemes and syllables.
[0115]
In step S44, the recognition processing units 131-1 to 131-4 match the phonemes and syllables with the dictionary databases 142-1 to 142-4 and the grammar databases 143-1 to 143-4, and set the acoustic score and the language score. Is calculated.
[0116]
That is, the recognition processing units 131-1 to 131-4 compare the acoustic pattern of the input feature amount with the acoustic standard pattern corresponding to each word included in the dictionary database 142, and The evaluation value is calculated as an acoustic score. When a bigram is used as the grammar, for example, the recognition processing units 131-1 to 131-4 determine the linguistic certainty of each word based on the chain probability with the immediately preceding word based on the grammar database 143. Numerical values are calculated as language scores.
[0117]
In step S45, the recognition processing units 131-1 to 131-4 determine the word string with the highest evaluation by integrating the acoustic score and the language score, and proceed to step S46. , And output to the model storage unit 102.
[0118]
For example, when the user utters “Today is good weather”, a series of words such as “today”, “ha”, “good”, “weather”, and “is” is obtained as a recognition result. Will be done.
[0119]
In this way, a word sequence is recognized from the input speech.
[0120]
Here, the description returns to the flowchart of FIG.
[0121]
In step S8, the action determining unit 103 determines an action based on a voice command composed of a word sequence supplied from the voice recognition unit 101A of the state recognition information processing unit 101, and outputs the determined action to the posture transition mechanism unit 104. The posture transition mechanism unit 104 controls the various actuators to perform an action corresponding to the determined action, and causes the robot 1 to act.
[0122]
In step S9, the control unit 101a of the voice recognition unit 101A of the state recognition information processing unit 101 determines whether or not there is an input of a voice command. Returns to step S7. That is, as long as the voice command is continuously input, the processes of steps S7 to S9 are repeated, and the interactive process with the user is continued.
[0123]
In step S9, when it is determined that there is no voice command input, that is, for example, there is no voice command input for a predetermined period of time and there is no request from the user, in step S10, the state recognition information processing unit 101 The control unit 101a of the voice recognition unit 101A controls the voice recognition unit 101A so that the voice recognition process cannot be performed, that is, disables the voice recognition unit 101A, and the voice signal supplied from the microphones 82-1 to 82-N after this process. Is set to a state in which the voice recognition processing cannot be executed.
[0124]
In step S11, a turning motion process in the original direction is performed, and the process returns to step S1. Note that the turning operation processing in the original direction is performed by replacing the sound source direction with the original direction in the processing of the turning operation toward the sound source in the processing in step S3 in FIG. 9 described with reference to the flowchart in FIG. Other than the above, the processing is the same, and a description thereof will be omitted.
[0125]
If no user face is detected in step S5, steps S6 to S10 are skipped, and the process proceeds to step S11.
[0126]
That is, normally, when a command is input by voice, the user is made at a distance at which the robot 1 can be visually recognized by the CCD camera 81. Therefore, when a voice signal is detected in the process of step S1 (the voice signal is first detected). Is detected), the robot 1 is turned around in the direction of the sound source. Only when the user's face is detected in the direction of the sound source in the process of step S5, the audio signal detected by the microphone 82 at the subsequent timing (the audio signal detected after the second time) is used. By executing the voice recognition process, the voice signal input at other timings is regarded as noise (not a voice signal input as a command), and the voice recognition process is not performed. Therefore, it is possible to suppress erroneous recognition due to noise generated in an environment where no user exists.
[0127]
In addition, the operation of turning around in the direction of the sound source detected based on the detected sound may be such that the user's face can be detected, and even if only the head of the robot 1 is turned around. Alternatively, the entire body of the robot 1 may be turned around. Further, at this time, a microphone having high directivity may be directed to the sound source direction. By doing so, it is possible to indicate to the user who is issuing the command by voice that the robot 1 is reacting, and to generate a voice signal required for voice recognition with respect to the sound source with high accuracy. As a result, it is possible to suppress erroneous recognition due to noise or the like in the voice recognition processing.
[0128]
In the above description of the processing of the operation based on the voice command, the robot 1 may perform another operation different from the processing of the operation based on the voice command. In that case, in the processing of step S1, the direction of the sound source is determined. Is detected, the operation that has been performed is interrupted. Further, in step S11, the operation that has been performed until the processing of the operation by the voice command is performed after the turning operation in the original direction is completed in step S11. Will be resumed.
[0129]
In the above processing, the direction of the sound source is detected in response to the sound signal detected first, and the turning operation in the direction of the sound source is executed. There may be a case where the sound is heard from a direction different from the direction of the sound source. That is, the sound reverberates due to a ceiling, a wall, or the like existing around the robot 1, and if the direction is detected based on the reverberated sound, there is a possibility that a direction different from the original direction of the sound source is erroneously detected as the sound source. Get higher. As a result, no matter how much the robot 1 turns in the erroneously detected direction, the face of the user issuing the command by voice cannot be detected, and there is a possibility that an unnecessary turning operation may be repeated.
[0130]
Therefore, in order to cope with the situation where the sound reverberates as described above, the frequency at which the face can be detected is stored for each direction, and according to the frequency at which the face cannot be detected (or In accordance with the frequency at which the user's face can be detected), the reliability of the direction of the sound source is determined. If a direction with a low reliability at which the face can be detected is detected as the sound source direction, the turning operation is not performed at a predetermined rate. (Even if an audio signal is detected, it may be ignored).
[0131]
FIG. 14 shows a main control unit of the robot 1 in which the frequency at which the user's face can be detected is stored, the reliability for each direction is calculated from the detected frequency, and the turning operation is not performed according to the reliability. FIG. 61 is a block diagram showing another configuration of the embodiment.
[0132]
The main control unit 61 in FIG. 14 is basically the same as the configuration of the main control unit 61 in FIG. 7, except that the behavior determination mechanism unit 103 includes a behavior memory 103a and a reliability calculation unit 103b. The difference is that the reliability calculation unit 103b calculates the reliability for each direction based on the information stored in the behavior memory 103a, and controls the posture transition mechanism unit 104 according to the reliability.
[0133]
The action memory 103a is a memory that stores the action determined by the action determining mechanism unit 103. When the player turns around in the direction of the sound source, the frequency of the turning operation and the face of the user are determined for each direction. The detected frequency is updated and stored.
[0134]
The reliability calculation unit 103b calculates and stores, as a percentage, the reliability with which the user's face is detected for each direction based on the information stored in the behavior memory 103a. Since the information stored in the behavior memory 103a is updated every time an operation is performed, the reliability calculation unit 103b also sequentially updates the reliability in accordance with each behavior.
[0135]
The behavior determining mechanism 103 in FIG. 14 controls the turning operation based on the reliability. That is, for example, when the frequency of turning to the right is TR, and the frequency of detecting a face is FR, the reliability calculating unit 103b calculates the reliability of detecting a rightward face is 100. X FR / TR (%). The action determining mechanism unit 103 generates a random number from 1 to 100 based on the direction information input from the direction recognizing unit 101B, and stores the random number value and the face stored in the reliability calculating unit 103b. Is compared with the detected reliability, and when the value of the random number is lower than the reliability, the turning operation is performed in that direction, otherwise, the posture transition mechanism unit is configured not to perform the turning operation. 104 is controlled. The default value of the reliability is 100%.
[0136]
Next, with reference to the flowchart of FIG. 15, the processing of the operation of the robot 1 using the voice command using the main control unit 61 of FIG. 14 will be described. Note that the processing in steps S61 and S62 and steps S65 to S73 in FIG. 15 is the same as the processing in steps S1 to S11 described with reference to the flowchart in FIG. 9, and a description thereof will be omitted.
[0137]
In step S63, the action determining mechanism unit 103 reads the reliability stored in the reliability calculation unit 103b for detecting a face corresponding to the direction of the detected sound source. In the case of the first process, the reliability is 100%, and thereafter, the value is in accordance with the frequency.
[0138]
In step S64, the action determining mechanism unit 103 generates a random number from 1 to 100, and determines whether or not to perform the turning motion by comparing the generated random number with the reliability. More specifically, the action determining mechanism unit 103 generates a random number from 1 to 100, compares the random number with the reliability read from the reliability calculation unit 103b, and, when the reliability is lower than the read reliability, turns around. When it is determined that the operation is to be executed and the random number is higher than the reliability, it is determined that the turning operation is not to be executed.
[0139]
In step S64, for example, when the action determining mechanism unit 103 determines that the generated random number is lower than the reliability, that is, when it determines that the turning motion is to be executed, the process proceeds to step S65.
[0140]
On the other hand, when it is determined in step S64 that the generated random number is higher than the reliability, the action determining mechanism unit 103 determines that the turning motion is not performed, and the process returns to step S61.
[0141]
In step S74, the action determining mechanism unit 103 updates the information on the frequency at which the face is detected based on the determination result as to whether or not the face is detected, and the frequency at which the turning operation in the direction of the sound source is executed, and updates the action. In addition to the storage in the memory 103a, the reliability calculation unit 103b calculates the reliability based on the updated frequency and updates the reliability.
[0142]
With the above processing, the reliability of detecting a face for each turning operation is updated.For example, in an environment where sound is likely to reverberate due to a ceiling or a wall, the face of a sound source that is likely to be erroneously detected is It is possible to suppress the turning operation in accordance with the frequency with which the detection is performed, and as a result, it is possible to realize a highly accurate voice recognition process while suppressing the turning operation in a useless direction that is likely to cause erroneous detection. It becomes possible.
[0143]
The direction viewed from the robot 1 is, for example, when the robot 1 is walking while changing the traveling direction, the reliability in each direction is also changed using the acceleration detection signal S2B or the like. Alternatively, an absolute direction such as east, west, north and south may be set using a compass instead of directions such as front, rear, left, and right, and the reliability may be set for each direction.
[0144]
The series of processes described above can be executed by hardware, but can also be executed by software. When a series of processing is executed by software, a program constituting the software may be executed by a computer built into dedicated hardware or by installing various programs to execute various functions. It is installed from a recording medium into a possible general-purpose personal computer or the like.
[0145]
FIG. 16 shows a configuration of an embodiment of a personal computer in a case where the electrical internal configuration of the robot 1 of FIG. 6 is realized by software. The CPU 201 of the personal computer controls the entire operation of the personal computer. When a user inputs a command from an input unit 206 including a keyboard, a mouse, and the like via a bus 204 and an input / output interface 205, the CPU 201 stores the command in a ROM (Read Only Memory) 202 in response to the command. Execute the program. Alternatively, the CPU 201 reads a program read from the magnetic disk 221, the optical disk 222, the magneto-optical disk 223, or the semiconductor memory 224 connected to the drive 210 and installed in the storage unit 208, and stores the program in a RAM (Random Access Memory) 203. And run it. Thereby, the function of the omnidirectional image data generation unit 13 described above is realized by software. Further, the CPU 201 controls the communication unit 209 to communicate with the outside and execute transmission and reception of data.
[0146]
As shown in FIG. 16, the recording medium on which the program is recorded is a magnetic disk 221 (including a flexible disk) on which the program is recorded, which is distributed separately from the computer to provide the program to the user, An optical disk 222 (including a CD-ROM (Compact Disc-Read Only Memory), a DVD (Digital Versatile Disk)), a magneto-optical disk 223 (including an MD (Mini-Disc)), or a package medium including a semiconductor memory 224 or the like. In addition to the configuration, the configuration includes a ROM 202 storing a program and a hard disk included in the storage unit 208, which are provided to a user in a state where the program is incorporated in a computer in advance.
[0147]
In this specification, a step of describing a program recorded on a recording medium is performed in a time-series manner in the order described. Alternatively, the processing includes individually executed processing.
[0148]
【The invention's effect】
According to the present invention, it is possible to suppress erroneous recognition in the voice recognition processing.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an external appearance of an exterior of a robot to which the present invention is applied.
FIG. 2 is a perspective view showing an internal configuration of the robot shown in FIG.
FIG. 3 is a rear perspective view showing the internal configuration of the robot shown in FIG. 2;
FIG. 4 is a schematic diagram for explaining axes of the robot in FIG. 2;
FIG. 5 is a block diagram for mainly explaining a portion related to control of the robot in FIG. 2;
FIG. 6 is a block diagram showing an internal configuration of control of the robot shown in FIG. 1;
FIG. 7 is a block diagram illustrating a configuration of a main control unit in FIG. 6;
FIG. 8 is a block diagram illustrating a configuration of a speech recognition unit in FIG. 7;
FIG. 9 is a flowchart illustrating processing of an operation by a voice command of the robot.
FIG. 10 is a flowchart illustrating processing of a turning operation in FIG. 9;
FIG. 11 is a diagram illustrating a turning operation.
FIG. 12 is a diagram illustrating a turning operation.
FIG. 13 is a flowchart illustrating a voice recognition process of the robot.
FIG. 14 is a block diagram illustrating another configuration of the main control unit in FIG. 6;
FIG. 15 is a flowchart illustrating processing of an operation by a voice command of the robot using the configuration of the main control unit in FIG.
FIG. 16 is a diagram illustrating a recording medium.
[Explanation of symbols]
1 robot, 61 main control unit, 55 display unit, 63 sub control unit, 71 external sensor unit, 72 speaker, 81L, 81R CCD camera, 82 microphone, 101 state recognition information processing unit, 101A voice recognition unit, 101a control unit, 101B direction recognition section, 101C pressure processing section, 101D image recognition section, 102 model storage section, 103 action determination mechanism section, 103a action memory, 104 attitude transition mechanism section, 105 voice synthesis section

Claims

Voice detection means for detecting voice;
Voice recognition means for recognizing the voice detected by the voice detection means,
Direction detection means for detecting the direction of the sound source of the voice,
User detection means for detecting a user in the direction detected by the direction detection means,
A robot control device, comprising: a voice recognition control unit that controls a start of voice recognition by the voice recognition unit when a user is detected by the user detection unit.

Imaging means for capturing an image,
An imaging control unit that controls the imaging unit so as to capture an image of the direction detected by the direction detection unit,
When controlled by the imaging control unit, and the imaging unit captures the direction detected by the direction detection unit, the user detection unit captures the user's face in the video captured by the imaging unit. The robot control device according to claim 1, wherein the user is detected based on whether or not the user is in operation.

When the imaging control unit controls the imaging unit to capture an image in the direction detected by the direction detection unit, the image captured by the imaging unit determines whether the user's face is captured. Further comprising a reliability detection means for detecting the reliability of each direction based on the frequency of
The image capturing control unit captures the image in the direction detected by the direction detecting unit based on the reliability of detection of the user's face detected by the reliability detection unit. 3. The robot controller according to claim 2, wherein the controller controls the means.

A voice detection step of detecting voice;
A voice recognition step of recognizing the voice detected in the processing of the voice detection step;
A direction detection step of detecting a direction of a sound source of the voice,
A user detection step of detecting a user in the direction detected in the processing of the direction detection step,
A voice recognition control step of performing control to start voice recognition in the voice recognition step when a user is detected in the user detection step.

A voice detection step of detecting voice;
A voice recognition step of recognizing the voice detected in the processing of the voice detection step;
A direction detection step of detecting a direction of a sound source of the voice,
A user detection step of detecting a user in the direction detected in the processing of the direction detection step,
A computer-readable program that includes a voice recognition control step of controlling a start of voice recognition in the voice recognition step when a user is detected in the user detection step. Recording medium on which is recorded.

A voice detection step of detecting voice;
A voice recognition step of recognizing the voice detected in the processing of the voice detection step;
A direction detection step of detecting a direction of a sound source of the voice,
A user detection step of detecting a user in the direction detected in the processing of the direction detection step,
When a user is detected in the processing of the user detection step, the computer performs processing including a voice recognition control step of controlling to start voice recognition in the processing of the voice recognition step. program.