JP2004170756A

JP2004170756A - Unit and method for robot control, recording medium, and program

Info

Publication number: JP2004170756A
Application number: JP2002337808A
Authority: JP
Inventors: Hiroaki Ogawa; 浩明小川
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-11-21
Filing date: 2002-11-21
Publication date: 2004-06-17

Abstract

PROBLEM TO BE SOLVED: To store a word registered in normal interaction while making it correspond to an operation of a robot and to make an unknown word and an operation of the robot correspond to each other to use it for the operation of the robot. SOLUTION: When a speech recognition result matches a template 1 where pause names are registered in a step S121, the cluster ID and category of the word are stored in a step S122 while they are made to correspond to each other, and the cluster ID and an actuator control angle are stored in a step S123 while they are made to correspond to each other. When the speech recognition result matches a template 2 to command pauses in a step S124, information on the actuator control angle is extracted based upon the cluster ID in a step S125 and an actuator is controlled in a step S126. When the speech recognition result matches a template 3 to command storage of a character name and a user name in a step S127, the cluster ID and category are stored in a step S128 while they are made to correspond to each other and if no match is found in the template 3, specified response processing is carried out in a step S129. This invention is applicable to a robot. COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、ロボット制御装置および方法、記録媒体、並びにプログラムに関し、特に、音声認識、音声出力、および、駆動が可能なロボットを制御する場合に用いて好適な、ロボット制御装置および方法、記録媒体、並びにプログラムに関する。
【０００２】
【従来の技術】
対話システムにおいて、何かの名前を音声で登録するという場面は、多く発生する。例えば、ユーザが自分の名前を登録したり、対話システムに名前をつけたり、地名や店名を入力したりするという場面である。
【０００３】
従来、このような音声登録を簡単に実現する方法としては、何かのコマンドによって登録モードに移行して、登録が終了したら通常の対話モードに戻るというものがある。この場合、例えば、「ユーザ名登録」という音声コマンドによって登録モードに移行して、その後でユーザが名前を発声したらそれが登録され、その後、通常モードに戻る処理が行われる。
【０００４】
例えば、音声認識可能なロボットに、名前を付けてほしいことを表す行動を起こさせることによって、ロボット名登録モードに移行することをユーザに通知するようにし、ロボットの行動が制御された後に入力された音声から、最適音素列を検出し、最適音素列を、名前として登録するようにした技術がある（例えば、特許文献１参照）。
【０００５】
【特許文献１】
特開２００２−１２０１７７号公報
【０００６】
【発明が解決しようとする課題】
しかしながら、このような音声登録の方法では、コマンドによるモード切換えをしなければならず、対話としては不自然であり、ユーザにとっては煩わしいという課題がある。また、名付ける対象が複数存在する場合、コマンドの数が増えるため、いっそう煩わしくなる。
【０００７】
例えば、上述したように、ロボットの行動により、登録モードへの移行をユーザに通知するようにした場合、名付ける対象によってロボットの動作を変更し、ユーザが、ロボットの動作と登録内容の関係を把握しておかなければならないなど、音声登録が非常に煩わしいものとなってしまう。
【０００８】
更に、登録モード中に、ユーザが名前以外の単語（例えば、「こんにちは」）を話してしまった場合、名前以外の単語も名前として登録されてしまう。また、例えば、「太郎」という名前だけではなく、「私の名前は太郎です。」といったように、ユーザが名前以外の言葉を付加して話した場合、全体（「私の名前は太郎です。」）が名前として登録されてしまう。
【０００９】
また、登録された単語は、登録された単語を用いた音声出力処理や、登録後に同じ単語がユーザにより発声された場合の認識処理に用いられるのみであった。
【００１０】
本発明はこのような状況に鑑みてなされたものであり、通常の対話の中で登録された単語を、ロボットの動作と対応付けて記憶し、ロボットの動作に利用することができるようにするものである。
【００１１】
【課題を解決するための手段】
本発明の第１のロボット制御装置は、連続する入力音声を認識する認識手段と、認識手段により認識された認識結果に、未知語が含まれていると判定された場合、未知語に対応する単語を獲得する獲得手段と、獲得手段により獲得された単語を、ロボットの動作を制御する情報に関連付けて登録する登録手段とを備えることを特徴とする。
【００１２】
認識手段により認識された認識結果が特定のパターンにマッチするか否かを判定するパターン判定手段を更に備えさせるようにすることができ、パターン判定手段により、認識結果が特定のパターンにマッチしていると判定された場合、登録手段には、単語を、ロボットの動作を制御する情報に関連付けて登録させるようにすることができる。
【００１３】
ロボットの状態を検知する検知手段を更に備えさせるようにすることができ、検知手段には、認識結果が特定のパターンにマッチしていると判定された時点でのロボットの状態を検知させるようにすることができ、登録手段には、単語と、検知手段により検知されたロボットの状態となるようにロボットの動作を制御する情報とを関連付けて登録させるようにすることができる。
【００１４】
ロボットの駆動を制御する制御手段と、認識手段により認識された認識結果が特定のパターンにマッチするか否かを判定するパターン判定手段とを更に備えさせるようにすることができ、パターン判定手段により、認識結果が特定のパターンにマッチしていると判定された場合、制御手段には、登録手段により、単語に関連付けられて登録されたロボットの動作を制御する情報に基づいて、ロボットの駆動を制御させるようにすることができる。
【００１５】
獲得手段により獲得された単語を、複数のカテゴリに分類して記憶する記憶手段を更に備えさせるようにすることができ、登録手段には、記憶手段において、所定のカテゴリで記憶された単語を、ロボットの動作を制御する情報に関連付けて登録させるようにすることができる。
【００１６】
本発明の第１のロボット制御方法は、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する判定ステップと、判定ステップの処理により、認識結果に、未知語が含まれていると判定された場合、未知語に対応する単語を獲得する獲得ステップと、獲得ステップの処理により獲得された単語を、ロボットの動作を制御する情報に関連付けて登録する登録ステップとを含むことを特徴とする。
【００１７】
本発明の第１の記録媒体に記録されているプログラムは、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する判定ステップと、判定ステップの処理により、認識結果に、未知語が含まれていると判定された場合、未知語に対応する単語を獲得する獲得ステップと、獲得ステップの処理により獲得された単語を、ロボットの動作を制御する情報に関連付けて登録する登録ステップとを含むことを特徴とする。
【００１８】
本発明の第１のプログラムは、連続する入力音声を認識する認識ステップと、認識ステップの処理により認識された認識結果に、未知語が含まれているか否かを判定する判定ステップと、判定ステップの処理により、認識結果に、未知語が含まれていると判定された場合、未知語に対応する単語を獲得する獲得ステップと、獲得ステップの処理により獲得された単語を、ロボットの動作を制御する情報に関連付けて登録する登録ステップとを含むことを特徴とする。
【００１９】
本発明の第１のロボット制御装置および方法、並びに、プログラムにおいては、連続する入力音声が認識され、認識結果に未知語が含まれているか否かが判定され、未知語が含まれていると判定された場合、未知語に対応する単語が獲得されて、獲得された単語が、ロボットの動作を制御する情報に関連付けられて登録される。
【００２０】
本発明の第２のロボット制御装置は、ロボットの駆動を制御する制御手段と、ロボットの状態を示す情報を、対応する単語に関連付けて登録する登録手段と、制御手段により駆動が制御されたロボットの状態を検知する検知手段と、音声を合成する音声合成手段と、音声合成手段により合成された音声を出力する出力手段とを備え、検知手段により検知されたロボットの状態が、登録手段により登録されているロボットの状態を示す情報に合致した場合、音声合成手段は、登録手段により登録されているロボットの状態を示す情報に関連付けられた単語を含む音声を合成することを特徴とする。
【００２１】
ユーザからの指令を受ける入力手段を更に備えさせるようにすることができ、検出手段には、入力手段により、所定の時間、操作入力を受けなかった場合、ロボットの状態を検出させるようにすることができる。
【００２２】
連続する入力音声を認識する認識手段と、認識手段による認識結果が特定のパターンにマッチするか否かを判定するパターン判定手段とを更に備えさせるようにすることができ、判定手段により、認識結果が特定のパターンにマッチすると判定された場合、検出手段には、ロボットの状態を検出させるようにすることができる。
【００２３】
認識手段により認識された認識結果に、未知語が含まれていると判定された場合、未知語に対応する単語を獲得する獲得手段を更に備えさせるようにすることができ、登録手段には、獲得手段により獲得された未知語に対応する単語とロボットの状態を示す情報を関連付けて記憶させるようにすることができる。
【００２４】
本発明の第２のロボット制御方法は、ロボットの状態を検知する検知ステップと、検知ステップの処理により検知されたロボットの状態が、登録情報に登録されている登録情報に合致しているか否かを判断する判断ステップと、判断ステップの処理により、ロボットの状態が、登録情報に登録されているロボットの状態を示す情報に合致していると判断された場合、ロボットの状態を示す情報に関連付けられた単語を含む音声を合成する音声合成ステップとを含むことを特徴とする。
【００２５】
本発明の第２の記録媒体に記録されているプログラムは、ロボットの状態を検知する検知ステップと、検知ステップの処理により検知されたロボットの状態が、登録情報に登録されている登録情報に合致しているか否かを判断する判断ステップと、判断ステップの処理により、ロボットの状態が、登録情報に登録されているロボットの状態を示す情報に合致していると判断された場合、ロボットの状態を示す情報に関連付けられた単語を含む音声を合成する音声合成ステップとを含むことを特徴とする。
【００２６】
本発明の第２のプログラムは、ロボットの状態を検知する検知ステップと、検知ステップの処理により検知されたロボットの状態が、登録情報に登録されている登録情報に合致しているか否かを判断する判断ステップと、判断ステップの処理により、ロボットの状態が、登録情報に登録されているロボットの状態を示す情報に合致していると判断された場合、ロボットの状態を示す情報に関連付けられた単語を含む音声を合成する音声合成ステップとを含むことを特徴とする。
【００２７】
本発明の第２のロボット制御装置および方法、並びに、プログラムにおいては、ロボットの状態が検知され、検知されたロボットの状態が、登録情報に登録されている登録情報に合致しているか否かが判断され、ロボットの状態が、登録情報に登録されているロボットの状態を示す情報に合致していると判断された場合、ロボットの状態を示す情報に関連付けられた単語を含む音声が合成されて出力される。
【００２８】
【発明の実施の形態】
以下、図を参照して、本発明の実施の形態について説明する。
【００２９】
図１は、本発明を適用した２足歩行型のロボット１の正面方向の斜視図であり、図２は、ロボット１の背面方向からの斜視図である。また、図３は、ロボット１の軸構成について説明するための図である。
【００３０】
ロボット１は、胴体部ユニット１１の上部に頭部ユニット１２が配設されるとともに、胴体部ユニット１１の上部左右に、同様の構成を有する腕部ユニット１３Ａ、および、腕部ユニット１３Ｂが所定位置にそれぞれ取り付けられ、かつ、胴体部ユニット１１の下部左右に、同様の構成を有する脚部ユニット１４Ａ、および、脚部ユニット１４Ｂが所定位置にそれぞれ取り付けられることにより構成されている。頭部ユニット１２には、タッチセンサ５１が設けられている。
【００３１】
胴体部ユニット１１においては、体幹上部を形成するフレーム２１および体幹下部を形成する腰ベース２２が、腰関節機構２３を介して連結することにより構成されており、体幹下部の腰ベース２２に固定された腰関節機構２３のアクチュエータＡ１、および、アクチュエータＡ２をそれぞれ駆動することによって、体幹上部を、図３に示す直交するロール軸２４およびピッチ軸２５の回りに、それぞれ独立に回転させることができるようになされている。
【００３２】
また頭部ユニット１２は、フレーム２１の上端に固定された肩ベース２６の上面中央部に首関節機構２７を介して取り付けられており、首関節機構２７のアクチュエータＡ３、および、アクチュエータＡ４をそれぞれ駆動することによって、図３に示す直交するピッチ軸２８およびヨー軸２９の回りに、それぞれ独立に回転させることができるようになされている。
【００３３】
更に、腕部ユニット１３Ａ、および、腕部ユニット１３Ｂは、肩関節機構３０を介して肩ベース２６の左右にそれぞれ取り付けられており、対応する肩関節機構３０のアクチュエータＡ５、および、アクチュエータＡ６をそれぞれ駆動することによって、図３に示す、直交するピッチ軸３１およびロール軸３２の回りに、それぞれを独立に回転させることができるようになされている。
【００３４】
この場合、腕部ユニット１３Ａ、および、腕部ユニット１３Ｂにおいては、上腕部を形成するアクチュエータＡ７の出力軸に、肘関節機構４４を介して、前腕部を形成するアクチュエータＡ８が連結され、前腕部の先端に手部３４が取り付けられることにより構成されている。
【００３５】
そして腕部ユニット１３Ａ、および腕部ユニット１３Ｂでは、アクチュエータＡ７を駆動することによって、前腕部を図３に示すヨー軸３５に対して回転させることができ、アクチュエータＡ８を駆動することによって、前腕部を図３に示すピッチ軸３６対して回転させることができるようになされている。
【００３６】
脚部ユニット１４Ａ、および、脚部ユニット１４Ｂは、股関節機構３７を介して、体幹下部の腰ベース２２にそれぞれ取り付けられており、対応する股関節機構３７のアクチュエータＡ９乃至Ａ１１をそれぞれ駆動することによって、図３に示す、互いに直交するヨー軸３８、ロール軸３９、およびピッチ軸４０に対して、それぞれ独立に回転させることができるようになされている。
【００３７】
脚部ユニット１４Ａ、および、脚部ユニット１４Ｂは、大腿部を形成するフレーム４１の下端が、膝関節機構４２を介して、下腿部を形成するフレーム４３に連結されるとともに、フレーム４３の下端が、足首関節機構４４を介して、足部４５に連結されることにより構成されている。
【００３８】
これにより脚部ユニット１４Ａ、および、脚部ユニット１４Ｂにおいては、膝関節機構４２を形成するアクチュエータＡ１２を駆動することによって、図３に示すピッチ軸４６に対して、下腿部を回転させることができ、また足首関節機構４４のアクチュエータＡ１３、および、アクチュエータＡ１４をそれぞれ駆動することによって、図３に示す直交するピッチ軸４７およびロール軸４８に対して、足部４５をそれぞれ独立に回転させることができるようになされている。
【００３９】
また、胴体部ユニット１１の体幹下部を形成する腰ベース２２の背面側には、後述するメイン制御部６１や周辺回路６２（いずれも図４）などを内蔵したボックスである、制御ユニット５２が配設されている。
【００４０】
図４は、ロボット１のアクチュエータとその制御系等について説明する図である。
【００４１】
制御ユニット５２には、ロボット１全体の動作制御をつかさどるメイン制御部６１、電源回路および通信回路などの周辺回路６２、および、バッテリ７４（図５）などが収納されている
【００４２】
そしてこの制御ユニット５２は、各構成ユニット（胴体部ユニット１１、頭部ユニット１２、腕部ユニット１３Ａおよび腕部ユニット１３Ｂ、並びに、脚部ユニット１４Ａおよび脚部ユニット１４Ｂ）内にそれぞれ配設されたサブ制御部６３Ａ乃至サブ制御部６３Ｄと接続されており、サブ制御部６３Ａ乃至サブ制御部６３Ｄに対して必要な電源電圧を供給したり、サブ制御部６３Ａ乃至サブ制御部６３Ｄと通信を行う。
【００４３】
また、サブ制御部６３Ａ乃至サブ制御部６３Ｄは、対応する構成ユニット内のアクチュエータＡ１乃至アクチュエータＡ１４と、それぞれ接続されており、メイン制御部６１から供給された各種制御コマンドに基づいて、構成ユニット内のアクチュエータＡ１乃至アクチュエータＡ１４を、指定された状態に駆動させるように制御する。
【００４４】
図５は、ロボット１の内部構成を示すブロック図である。
【００４５】
頭部ユニット１２には、このロボット１の「目」として機能するＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）カメラ８１、「耳」として機能するマイクロホン８２、タッチセンサ５１などからなる外部センサ部７１、および、「口」として機能するスピーカ７２となどがそれぞれ所定位置に配設され、制御ユニット５２内には、バッテリセンサ９１および加速度センサ９２などからなる内部センサ部７３が配設されている。
【００４６】
そして、外部センサ部７１のＣＣＤカメラ８１は、周囲の状況を撮像し、得られた画像信号Ｓ１Ａを、メイン制御部６１に送出する。マイクロホン８２は、ユーザから音声入力として与えられる「歩け」、「とまれ」または「右手を挙げろ」等の各種命令音声を集音し、得られた音声信号Ｓ１Ｂを、メイン制御部６１に送出する。
【００４７】
また、タッチセンサ５１は、例えば、図１および図２に示されるように頭部ユニット１２の上部に設けられており、ユーザからの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出し、検出結果を、圧力検出信号Ｓ１Ｃとしてメイン制御部６１に送出する。
【００４８】
内部センサ部７３のバッテリセンサ９１は、バッテリ７４のエネルギ残量を所定の周期で検出し、検出結果をバッテリ残量検出信号Ｓ２Ａとして、メイン制御部６１に送出する。加速度センサ９２は、ロボット１の移動について、３軸方向（ｘ軸、ｙ軸およびｚ軸）の加速度を、所定の周期で検出し、検出結果を、加速度検出信号Ｓ２Ｂとして、メイン制御部６１に送出する。
【００４９】
メイン制御部６１は、外部センサ部７１のＣＣＤカメラ８１、マイクロホン８２およびタッチセンサ５１からそれぞれ供給される、画像信号Ｓ１Ａ、音声信号Ｓ１Ｂおよび圧力検出信号Ｓ１Ｃ（以下、これらをまとめて外部センサ信号Ｓ１と称する）と、内部センサ部７３のバッテリセンサ９１および加速度センサ等からそれぞれ供給される、バッテリ残量検出信号Ｓ２Ａおよび加速度検出信号Ｓ２Ｂ（以下、これらをまとめて内部センサ信号Ｓ２と称する）に基づいて、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無などを判断する。
【００５０】
そして、メイン制御部６１は、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無の判断結果と、内部メモリ６１Ａに予め格納されている制御プログラム、あるいは、そのとき装填されている外部メモリ７５に格納されている各種制御パラメータなどに基づいて、ロボット１の行動を決定し、決定結果に基づく制御コマンドを生成して、対応するサブ制御部６３Ａ乃至サブ制御部６３Ｄに送出する。サブ制御部６３Ａ乃至サブ制御部６３Ｄは、供給された制御コマンドに基づいて、アクチュエータＡ１乃至アクチュエータＡ１４のうち、対応するものの駆動を制御するので、ロボット１は、例えば、頭部ユニット１２を上下左右に揺動させたり、腕部ユニット１３Ａ、あるいは、腕部ユニット１３Ｂを上に挙げたり、脚部ユニット１４Ａおよび脚部ユニット１４Ｂを交互に駆動させて、歩行するなどの行動を行うことが可能となる。
【００５１】
また、メイン制御部６１は、必要に応じて、所定の音声信号Ｓ３をスピーカ７２に与えることにより、音声信号Ｓ３に基づく音声を外部に出力させる。更に、メイン制御部６１は、外見上の「目」として機能する、頭部ユニット１２の所定位置に設けられた、図示しないＬＥＤに対して駆動信号を出力することにより、ＬＥＤを点滅させる。
【００５２】
このようにして、ロボット１においては、周囲および内部の状況や、ユーザからの指令および働きかけの有無などに基づいて、自律的に行動することができるようになされている。
【００５３】
次に、図６は、図５のメイン制御部６１の機能的構成例を示している。なお、図６に示す機能的構成は、メイン制御部６１が、メモリ６１Ａに記憶された制御プログラムを実行することで実現されるようになっている。
【００５４】
メイン制御部６１は、特定の外部状態を認識するセンサ入力処理部１０１、センサ入力処理部１０１の認識結果を累積して、ロボット１の感情、本能、あるいは、成長の状態などのモデルを記憶するモデル記憶部１０２、音声認識結果と行動内容のテーブルを記憶するテーブル記憶部１０４、センサ入力処理部１０１の認識結果や、テーブル記憶部１０４に記憶されているテーブル等に基づいて、ロボット１の行動を決定する行動決定機構部１０３、行動決定機構部１０３の決定結果に基づいて、実際にロボット１に行動を起こさせる姿勢遷移機構部１０５、並びに合成音を生成する音声合成部１０６から構成されている。
【００５５】
センサ入力処理部１０１は、マイクロホン８２や、ＣＣＤカメラ８１、タッチセンサ５１等から与えられる音声信号、画像信号、圧力検出信号等に基づいて、特定の外部状態や、ユーザからの特定の働きかけ、ユーザからの指示等を認識し、その認識結果を表す状態認識情報を、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５６】
すなわち、センサ入力処理部１０１は、音声認識部１０１Ａを有しており、音声認識部１０１Ａは、マイクロホン８２から与えられる音声信号について音声認識を行う。そして、音声認識部１０１Ａは、例えば、「歩け」、「止まれ」、「右手を挙げろ」等の指令、その他の音声認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５７】
また、音声認識部１０１Ａは、未知語（ＯＯＶ：ＯｕｔＯｆＶｏｃａｂｕｌａｒｙ）を新たに認識することが可能であり、必要に応じて、認識した未知語に対応付けられたＩＤを、行動決定機後部１０３に供給する。未知語の認識の詳細については、後述する。
【００５８】
また、センサ入力処理部１０１は、画像認識部１０１Ｂを有しており、画像認識部１０１Ｂは、ＣＣＤカメラ８１から与えられる画像信号を用いて、画像認識処理を行う。そして、画像認識部１０１Ｂは、その処理の結果、例えば、「赤い丸いもの」や、「地面に対して垂直なかつ所定高さ以上の平面」等を検出したときには、「ボールがある」や、「壁がある」等の画像認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００５９】
更に、センサ入力処理部１０１は、圧力処理部１０１Ｃを有しており、圧力処理部１０１Ｃは、タッチセンサ５１から与えられる圧力検出信号を処理する。そして、圧力処理部１０１Ｃは、その処理の結果、所定の閾値以上で、かつ短時間の圧力を検出したときには、「叩かれた（しかられた）」と認識し、所定の閾値未満で、かつ長時間の圧力を検出したときには、「撫でられた（ほめられた）」と認識して、その認識結果を、状態認識情報として、モデル記憶部１０２および行動決定機構部１０３に通知する。
【００６０】
モデル記憶部１０２は、ロボット１の感情、本能、成長の状態を表現する感情モデル、本能モデル、成長モデルをそれぞれ記憶、管理している。
【００６１】
ここで、感情モデルは、例えば、「うれしさ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度合い）を、所定の範囲（例えば、−１．０乃至１．０等）の値によってそれぞれ表し、センサ入力処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。本能モデルは、例えば、「食欲」、「睡眠欲」、「運動欲」等の本能による欲求の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。成長モデルは、例えば、「幼年期」、「青年期」、「熟年期」、「老年期」等の成長の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部１０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００６２】
モデル記憶部１０２は、上述のようにして感情モデル、本能モデル、成長モデルの値で表される感情、本能、成長の状態を、状態情報として、行動決定機構部１０３に送出する。
【００６３】
なお、モデル記憶部１０２には、センサ入力処理部１０１から状態認識情報が供給される他、行動決定機構部１０３から、ロボット１の現在または過去の行動、具体的には、例えば、「長時間歩いた」などの行動の内容を示す行動情報が供給されるようになっており、モデル記憶部１０２は、同一の状態認識情報が与えられても、行動情報が示すロボット１の行動に応じて、異なる状態情報を生成するようになっている。
【００６４】
即ち、例えば、ロボット１が、ユーザに挨拶をし、ユーザに頭を撫でられた場合には、ユーザに挨拶をしたという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部１０２に与えられ、この場合、モデル記憶部１０２では、「うれしさ」を表す感情モデルの値が増加される。
【００６５】
一方、ロボット１が、何らかの仕事を実行中に頭を撫でられた場合には、仕事を実行中であるという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部１０２に与えられ、この場合、モデル記憶部１０２では、「うれしさ」を表す感情モデルの値は変化されない。
【００６６】
このように、モデル記憶部１０２は、状態認識情報だけでなく、現在または過去のロボット１の行動を示す行動情報も参照しながら、感情モデルの値を設定する。これにより、例えば、何らかのタスクを実行中に、ユーザが、いたずらするつもりで頭を撫でたときに、「うれしさ」を表す感情モデルの値を増加させるような、不自然な感情の変化が生じることを回避することができる。
【００６７】
なお、モデル記憶部１０２は、本能モデルおよび成長モデルについても、感情モデルにおける場合と同様に、状態認識情報および行動情報の両方に基づいて、その値を増減させるようになっている。また、モデル記憶部１０２は、感情モデル、本能モデル、成長モデルそれぞれの値を、他のモデルの値にも基づいて増減させるようになっている。
【００６８】
行動決定機構部１０３は、センサ入力処理部１０１からの状態認識情報や、モデル記憶部１０２からの状態情報、時間経過等に基づいて、必要に応じて、テーブル記憶部１０４に記憶されたテーブルを参照して、次の行動を決定し、決定された行動の内容を、行動指令情報として、姿勢遷移機構部１０５に送出する。
【００６９】
また、行動決定機構部１０３は、センサ入力処理部１０１の音声認識部１０１Ａから、例えば、「これは、＜ＯＯＶ（未知語）＞ポーズだよ」などの、所定の第１のルールに合致した音声入力を受けた場合、サブ制御部６３Ａ乃至サブ制御部６３Ｄから供給される、アクチュエータＡ１乃至アクチュエータＡ１４の状態を示す信号と、音声認識の結果得られた、ポーズを示す未知語とを、対応付けて、テーブル記憶部１０４に記憶させる。
【００７０】
そして、行動決定機構部１０３は、センサ入力処理部１０１の音声認識部１０１Ａから、例えば、「＜ＯＯＶ＞して」などの、所定の第２のルールに合致した音声入力を受け、このときの＜ＯＯＶ＞が、テーブル記憶部１０４に記憶されていた場合、テーブル記憶部１０４に記憶されている、＜ＯＯＶ＞に対応するアクチュエータの制御情報を読み出し、姿勢遷移機構部１０５に供給する。
【００７１】
すなわち、行動決定機構部１０３は、ロボット１がとり得る行動をステート（状態：ｓｔａｔｅ）に対応させた有限オートマトンを、ロボット１の行動を規定する行動モデルとして管理しており、この行動モデルとしての有限オートマトンにおけるステートを、センサ入力処理部１０１からの状態認識情報や、モデル記憶部１０２における感情モデル、本能モデル、または成長モデルの値、時間経過等に基づいて遷移させ、遷移後のステートに対応する行動を、次にとるべき行動として決定する。
【００７２】
ここで、行動決定機構部１０３は、所定のトリガ（ｔｒｉｇｇｅｒ）があったことを検出すると、ステートを遷移させる。即ち、行動決定機構部１０３は、例えば、現在のステートに対応する行動を実行している時間が所定時間に達したときや、特定の状態認識情報を受信したとき、モデル記憶部１０２から供給される状態情報が示す感情や、本能、成長の状態の値が所定の閾値以下または以上になったとき等に、ステートを遷移させる。
【００７３】
なお、行動決定機構部１０３は、上述したように、センサ入力処理部１０１からの状態認識情報だけでなく、モデル記憶部１０２における感情モデルや、本能モデル、成長モデルの値等にも基づいて、行動モデルにおけるステートを遷移させることから、同一の状態認識情報が入力されても、感情モデルや、本能モデル、成長モデルの値（状態情報）によっては、ステートの遷移先は異なるものとなる。
【００７４】
なお、行動決定機構部１０３では、上述したように、ロボット１の頭部や手足等を動作させる行動指令情報の他、ロボット１に発話を行わせる行動指令情報も生成される。ロボット１に発話を行わせる行動指令情報は、音声合成部１０６に供給されるようになっており、音声合成部１０６に供給される行動指令情報には、音声合成部１０６に生成させる合成音に対応するテキスト等が含まれる。そして、音声合成部１０６は、行動決定部５２から行動指令情報を受信すると、その行動指令情報に含まれるテキストに基づき、合成音を生成し、スピーカ１８に供給して出力させる。これにより、スピーカ１８からは、例えば、「こんにちは」などのユーザへの挨拶、ユーザへの各種の要求、あるいは、「何ですか？」等のユーザの呼びかけに対する応答その他の音声出力が行われる。
【００７５】
姿勢遷移機構部１０５は、行動決定機構部１０３から供給される行動指令情報に基づいて、ロボット１の姿勢を、現在の姿勢から次の姿勢に遷移させるための姿勢遷移情報を生成し、これをサブ制御部６３Ａ乃至６３Ｄに送出する。
【００７６】
図７は、センサ入力処理部１０１の音声認識部１０１Ａの機能を示す機能ブロック図である。
【００７７】
音声認識処理部１２１には、ユーザからの発話に基づく音声信号が入力されるようになっており、音声認識処理部１２１は、入力された音声信号を認識し、その音声認識の結果としてのテキスト、その他付随する情報を、対話制御部１２３および単語獲得部１２４に、必要に応じて出力する。音声認識処理部１２１の詳細については、図１０を用いて後述する。
【００７８】
単語獲得部１２４は、音声認識処理部１２１が有する認識用辞書に登録されていない単語（未知語）について、音響的特徴を自動的に記憶し、それ以降、その単語の音声を認識できるようにする。
【００７９】
すなわち、単語獲得部１２４は、入力音声の未知語部分に対応する発音を音韻タイプライタによって求め、それをいくつかのクラスタに分類する。各クラスタはＩＤと代表音韻系列を持ち、ＩＤで管理される。このときのクラスタの状態を、図８を参照して説明する。
【００８０】
例えば、「あか」、「あお」、「ガッツ」という３回の入力音声があったとする。この場合、単語獲得部１２４は、３回の音声を、それぞれに対応した「あか」クラスタ１４１、「あお」クラスタ１４２、「ガッツ」クラスタ１４３の、３つのクラスタに分類し、各クラスタには、代表となる音韻系列（図８の例の場合、“ａ／ｋ／ａ， “ａ／ｏ“， “ｇ／ａ／ｔ／ｔ／ｕ”）とＩＤ（図８の例の場合、「１」、「２」、「３」）を付加する。
【００８１】
ここで再び、「あか」という音声が入力されると、対応するクラスタがすでに存在するので、単語獲得部１２４は、入力音声を「あか」クラスタ１４１に分類し、新しいクラスタは生成しない。これに対して、「くろ」という音声が入力された場合、対応するクラスタが存在しないので、単語獲得部１２４は、「くろ」に対応したクラスタ１４４を新たに生成し、そのクラスタには、代表的な音韻系列（図８の例の場合、“ｋ／ｕ／ｒ／ｏ”）とＩＤ（図８の例の場合、「４」）を付加する。
【００８２】
したがって、入力音声が未獲得の語であるか否かは、新たなクラスタが生成されたかどうかによって判定できる。なお、このような単語獲得処理の詳細は、本出願人が先に提案した特願２００１−９７８４３号に開示されている。
【００８３】
連想記憶部１２２は、登録した単語（未知語）がユーザ名であるか、キャラクタ名であるかといったカテゴリ等の情報を記憶する。例えば、図９の例では、クラスタＩＤとカテゴリ名とが対応して記憶されている。図９の例の場合、例えば、クラスタＩＤ「１」、および、クラスタＩＤ「４」は、「ユーザ名」のカテゴリに対応され、クラスタＩＤ「２」は、「キャラクタ名」のカテゴリに対応され、クラスタＩＤ「３」は、「ポーズ名」のカテゴリに対応されている。
【００８４】
対話制御部１２３は、音声認識処理部１２１の出力からユーザの発話の内容を理解し、その理解の結果に基づいて、単語（未知語）の登録を制御する。また、対話制御部１２３は、連想記憶部１２２に記憶されている登録済みの単語の情報に基づいて、登録済みの単語を認識できるように、それ以降の対話を制御する。
【００８５】
このようにして、音声認識部１０１Ａにおいて認識された音声は、行動決定機構部１０３に供給される。
【００８６】
ここでは、単語獲得部１２４が、音韻タイプライタによって得られた発音から、クラスタを生成し、それ以降の未知語の入力時には、クラスタとのマッチングが行われるものとして説明しているが、例えば、クラスタを生成することなく、未知語として、音韻系列そのもの（例えば、“ｋ／ｕ／ｒ／ｏ”など）に、ＩＤを付加し、新たに未知語が入力された場合、音韻系列で比較して、入力された未知語が、すでにＩＤが付加された未知語のうちのいずれかに一致するか否かを判断するようにしても良い。
【００８７】
図１０は、音声認識処理部１２１の構成例を示している。
【００８８】
ユーザの発話は、マイクロホン８２に入力され、マイクロホン８２では、その発話が、電気信号としての音声信号に変換される。この音声信号は、ＡＤ（ＡｎａｌｏｇＤｉｇｉｔａｌ）変換部１７１に供給される。ＡＤ変換部１７１は、マイクロホン８２からのアナログ信号である音声信号をサンプリングして、量子化し、ディジタル信号である音声データに変換する。この音声データは、特徴量抽出部１７２に供給される。
【００８９】
特徴量抽出部１７２は、ＡＤ変換部１７１からの音声データについて、適当なフレームごとに、例えば、スペクトル、パワー線形予測係数、ケプストラム係数、線スペクトル対等の特徴パラメータを抽出し、マッチング部１７３および音韻タイプライタ部１７４に供給する。
【００９０】
マッチング部１７３は、特徴量抽出部１７２からの特徴パラメータに基づき、音響モデルデータベース１８１、辞書データベース１８２、および言語モデルデータベース１８３を必要に応じて参照しながら、マイクロホン８２に入力された音声（入力音声）に最も近い単語列を求める。
【００９１】
音響モデルデータベース１８１は、音声認識する音声の言語における個々の音韻や音節などの音響的な特徴を表す音響モデルを記憶している。音響モデルとしては、例えば、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）などを用いることができる。辞書データベース１８２は、認識対象の各単語（語句）について、その発音に関する情報が記述された単語辞書や、音韻や音節の連鎖関係を記述したモデルを記憶している。
【００９２】
なお、ここにおける単語とは、認識処理において１つのまとまりとして扱った方が都合の良い単位のことであり、言語学的な単語とは必ずしも一致しない。例えば、「タロウ君」は、それ全体を１単語として扱ってもよいし、「タロウ」、「君」という２単語として扱ってもよい。更に、もっと大きな単位である「こんにちはタロウ君」等を１単語として扱ってもよい。
【００９３】
また、音韻とは、音響的に１つの単位として扱った方が処理上都合のよいもののことであり、音声学的な音韻や音素とは必ずしも一致しない。例えば、「東京」の「とう」の部分を”ｔ／ｏ／ｕ”という３個の音韻記号で表してもよいし、”ｏ”の長音である”ｏ：”という記号を用いて”ｔ／ｏ：”と表してもよい。または、”ｔ／ｏ／ｏ”と表すことも可能である。他にも、無音を表す記号を用意したり、更にそれを「発話前の無音」、「発話に挟まれた短い無音区間」、「発話語の無音」、「「っ」の部分の無音」のように細かく分類してそれぞれに記号を用意してもよい。
【００９４】
言語モデルデータベース１８３は、辞書データベース１８２の単語辞書に登録されている各単語がどのように連鎖する（接続する）かに関する情報を記述している。
【００９５】
音韻タイプライタ部１７４は、特徴量抽出部１７２から供給された特徴パラメータに基づいて、入力された音声に対応する音韻系列を取得する。音韻タイプライタ部１７４は、例えば、「私の名前は太郎です。」という音声から”ｗ／ａ／ｔ／ａ／ｓｈ／ｉ／ｎ／ｏ／ｎ／ａ／ｍ／ａ／ｅ／ｗ／ａ／ｔ／ａ／ｒ／ｏ：／ｄ／ｅ／ｓ／ｕ”という音韻系列を取得する。この音韻タイプライタには、既存のものを用いることができる。
【００９６】
なお、音韻タイプライタ部１７４に代わって、任意の音声に対して音韻系列を取得できる他の構成を用いるようにしてもよい。例えば、日本語の音節（あ・い・う・・・か・き・・・・ん）を単位とする音声認識や、音韻よりも大きく、単語よりは小さな単位であるサブワードを単位とする音声認識等を用いることも可能である。
【００９７】
制御部１７５は、ＡＤ変換部１７１、特徴量抽出部１７２、マッチング部１７３、音韻タイプライタ部１７４の動作を制御する。
【００９８】
次に、図１１のフローチャートを参照して、ロボット１が音声入力を受けた場合の処理について説明する。
【００９９】
ステップＳ１において、音声認識部１０１Ａの音声認識処理部１２１は、マイクロホン８２から、音声の入力を受けたか否かを判断する。ステップＳ１において、音声の入力を受けていないと判断された場合、音声の入力を受けたと判断されるまで、ステップＳ１の処理が繰り返される。
【０１００】
ステップＳ２において、図１２を用いて後述する音声認識処理が実行される。
【０１０１】
ステップＳ３において、音声認識処理部１２１の制御部１７５は、ステップＳ２において認識された単語列に、未知語が含まれているか否かを判定する。
【０１０２】
ステップＳ４において、未知語が含まれていると判定された場合、ステップＳ４において、制御部１７５が単語獲得部１２４を制御することにより、図１８を用いて後述する単語獲得処理が実行される。
【０１０３】
ステップＳ５において、対話制御部１２３により、ステップＳ４の処理により獲得された単語が用いられて、図１９を用いて後述するテンプレートマッチング処理が実行されて、処理が終了される。
【０１０４】
ステップＳ３において、未知語が含まれていないと判定された場合、ステップＳ５において、音声認識処理部１２１の制御部１７５は、認識された音声を、行動決定機構部１０３、および、必要に応じて、モデル記憶部１０２に出力する。行動決定機構部１０３は、供給された音声認識結果に基づいた所定の応答処理を実行する。
【０１０５】
具体的には、例えば、認識された音声が、「前に進め」であった場合、行動決定機構部１０３は、ロボット１の行動を規定する行動モデルに合致した、ロボット１の行動を決定し、決定された行動の内容を、行動指令情報として、姿勢遷移機構部１０５に送出する。姿勢遷移機構部１０５は、アクチュエータＡ１乃至アクチュエータＡ１４のうち、必要なものの制御情報を生成し、サブ制御部６３Ａ乃至サブ制御部６３Ｄに供給して、アクチュエータＡ１乃至アクチュエータＡ１４のうち対応するものを駆動させ、ロボット１に「前に進む」行動を実行させる。
【０１０６】
あるいは、文法に合致しないなどの理由により、正しく音声認識ができなかった場合、行動決定機構部１０３は、ロボット１に、「何ですか？」と発話させるための行動指令情報を生成し、音声合成部１０６に供給して、スピーカ７２から、「何ですか？」という音声を出力させる。また、行動決定機構部１０３は、「何ですか？」という音声の出力と同時に、頭部ユニット１２を横に傾かせる（首をかしげるしぐさをさせる）ための動指令情報を生成し、姿勢遷移機構部１０５に送出する。姿勢遷移機構部１０５は、アクチュエータＡ３およびアクチュエータＡ４に首を傾かせる動作をさせるための制御情報を生成し、サブ制御部６３Ｂに供給して、アクチュエータＡ３およびアクチュエータＡ４を駆動させ、ロボット１に「首をかしげる」ポーズをさせるようにしてもよい。
【０１０７】
次に、図１２のフローチャートを参照して、図１１のステップＳ２において実行される音声認識処理について説明する。
【０１０８】
ステップＳ２１において、ＡＤ変換部１７１は、マイクロホン８２より供給されたアナログの音声信号を、ディジタル信号である音声データに変換し、特徴量抽出部１７２に供給する。
【０１０９】
特徴量抽出部１７２は、ステップＳ２２において、ＡＤ変換部１７１からの音声データを受信し、ステップＳ２３において、適当なフレームごとに、例えば、スペクトル、パワー、それらの時間変化量等の特徴パラメータを抽出し、マッチング部１７３に供給する。
【０１１０】
ステップＳ２４において、マッチング部１７３は、辞書データベース１８２に格納されている単語モデルのうちのいくつかを連結する。
【０１１１】
ステップＳ２５において、図１７を用いて後述する単語列生成処理が実行される。なお、この単語列を構成する単語には、辞書データベース１８２に登録されている既知語だけでなく、登録されていない未知語を表すシンボルである“＜ＯＯＶ＞”も含まれている。
【０１１２】
ステップＳ２６において、音韻タイプライタ部１７４は、ステップＳ２４およびステップＳ２５の処理とは独立して、ステップＳ２３の処理で抽出された特徴パラメータに対して、音韻を単位とする認識を行い、音韻系列を出力する。例えば、「私の名前は太郎（未知語）です。」という音声が入力された場合、音韻タイプライタ部１７４は、”ｗ／ａ／ｔ／ａ／ｓｈ／ｉ／ｎ／ｏ／ｎ／ａ／ｍ／ａ／ｅ／ｗ／ａ／ｔ／ａ／ｒ／ｏ：／ｄ／ｅ／ｓ／ｕ”という音韻系列を出力する。
【０１１３】
ステップＳ２７において、マッチング部１７３は、ステップＳ２５において生成された単語列ごとに、音響スコアを計算する。＜ＯＯＶ＞（未知語）を含まない単語列に対する音響スコアの計算方法には、既存の方法、すなわち各単語列（単語モデルを連結したもの）に対して音声の特徴パラメータを入力することで尤度を計算するという方法を用いる。一方、既存の方法では＜ＯＯＶ＞に相当する音声区間の音響スコアを求めることができない（＜ＯＯＶ＞に対応する単語モデルは事前には存在しないため）ので、＜ＯＯＶ＞を含む単語列に対する音響スコアの計算においては、その音声区間については、音韻タイプライタの認識結果の中から、同区間の音響スコアを取り出し、その値に補正をかけたものを、＜ＯＯＶ＞の音響スコアとして採用する方法を用いる。マッチング部１７３は、更に、＜ＯＯＶ＞の音響スコアと、他の既知語部分の音響スコアとを統合し、それをその単語列の音響スコアとする。
【０１１４】
ステップＳ２８において、マッチング部１７３は、音響スコアの高い単語列を上位ｍ個（ｍ≦ｎ）残し、候補単語列とする。ステップＳ２９において、マッチング部１７３は、言語モデルデータベース１８３を参照して、候補単語列ごとに、言語スコアを計算する。言語スコアは、認識結果の候補である単語列が言葉としてどれだけふさわしいかを表す。ここで、この言語スコアを計算する方法を詳細に説明する。
【０１１５】
本発明の音声認識処理部１２１は、未知語も認識することができるため、言語モデルは未知語に対応している必要がある。例として、未知語に対応した文法または有限状態オートマトン（ＦＳＡ：ＦｉｎｉｔｅＳｔａｔｅＡｕｔｏｍａｔｏｎ）を用いた場合と、同じく未知語に対応したｔｒｉ−ｇｒａｍ（統計言語モデルの１つである）を用いた場合とについて説明する。
【０１１６】
図１３を参照して、文法の例について説明する。この文法はＢＮＦ（ＢａｃｋｕｓＮａｕｒＦｏｒｍ）で記述されている。図１３において、＄Ａ”は「変数」を表し、”Ａ｜Ｂ”は「ＡまたはＢ」という意味を表す。また、”［Ａ］”は「Ａは省略可能」という意味を表し、｛Ａ｝は「Ａを０回以上繰り返す」という意味を表す。
【０１１７】
＜ＯＯＶ＞は未知語を表すシンボルであり、文法中に＜ＯＯＶ＞を記述しておくことで、未知語を含む単語列に対しても対処することができる。”＄ＡＣＴＩＯＮ”には、例えば、「起立」、「着席」、「お辞儀」、「挨拶」等の、名称と動作内容の対応が予め設定されている場合の、動作に対応する単語が定義されている。
【０１１８】
この文法では、「＜先頭＞／こんにちは／＜終端＞」（“／”は単語間の区切り）、「＜先頭＞／さようなら／＜終端＞」、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞／です／＜終端＞」のように、データベースに記憶されている文法に当てはまる単語列は受理される（この文法で解析される）が、「＜先頭＞／君／の／＜ＯＯＶ＞／名前／＜終端＞」といった、データベースに記憶されている文法に当てはまらない単語列は受理されない（この文法で解析されない）。なお、「＜先頭＞」と「＜終端＞」はそれぞれ発話前と後の無音を表す特殊なシンボルである。
【０１１９】
この文法を用いて言語スコアを計算するために、パーザ（解析機）が用いられる。パーザは、単語列を、文法を受理できる単語列と、受理できない単語列に分ける。即ち、例えば、受理できる単語列には言語スコア１が与えられて、受理できない単語列には言語スコア０が与えられる。
【０１２０】
したがって、例えば、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞（ｔ／ａ／ｒ／ｏ：）／です／＜終端＞」と、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞（ｊ／ｉ／ｒ／ｏ：）／です／＜終端＞」という２つの単語列があった場合、いずれも「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞／です／＜終端＞」に置き換えられた上で言語スコアが計算されて、ともに言語スコア１（受理）が出力される。
【０１２１】
また、単語列の文法が受理できるか否かの判定は、事前に文法を等価（近似でも良い）な有限状態オートマトン（以下、ＦＳＡと称する）に変換しておき、各単語列がそのＦＳＡで受理できるか否かを判定することによっても実現できる。
【０１２２】
図１３の文法を等価なＦＳＡに変換した例を、図１４に示す。ＦＳＡは、状態（ノード）とパス（アーク）とからなる有向グラフである。図１４に示されるように、Ｓ１は開始状態、Ｓ２０は終了状態である。また、”＄ＡＣＴＩＯＮ”には、図１３と同様に、実際には動作に対応する単語が登録されている。
【０１２３】
パスには単語が付与されていて、所定の状態から次の状態に遷移する場合、パスはこの単語を消費する。ただし、”ε”が付与されているパスは、単語を消費しない特別な遷移（以下、ε遷移と称する）である。例えば、「＜先頭＞／私／は／＜ＯＯＶ＞／です／＜終端＞」においては、初期状態Ｓ１から状態Ｓ２に遷移して、＜先頭＞が消費され、状態Ｓ２から状態Ｓ３へ遷移して、「私」が消費されるが、状態Ｓ３から状態Ｓ５への遷移は、ε遷移なので、単語は消費されない。即ち、状態Ｓ３から状態Ｓ５へスキップして、次の状態Ｓ６へ遷移することができる。
【０１２４】
所定の単語列がこのＦＳＡで受理できるか否かは、初期状態Ｓ１から出発して、終了状態Ｓ２０まで到達できるか否かで判定される。
【０１２５】
即ち、例えば、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞／です／＜終端＞」においては、初期状態Ｓ１から状態Ｓ２へ遷移して、単語「＜先頭＞」が消費される。次に、状態Ｓ２から状態Ｓ３へ遷移して、単語「私」が消費される。以下、同様に、状態Ｓ３から状態Ｓ４へ、状態Ｓ４から状態Ｓ５へ、状態Ｓ５から状態Ｓ６へ、状態Ｓ６から状態Ｓ７へ順次遷移して、「の」、「名前」、「は」、「＜００Ｖ＞」、が次々に消費される。更に、状態Ｓ７から状態Ｓ１９へ遷移して、「です」が消費され、状態Ｓ１９から状態Ｓ２０に遷移して、「＜終端＞」が消費され、結局、終了状態Ｓ２０へ到達する。したがって、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞／です／＜終端＞」はＦＳＡで受理される。
【０１２６】
しかしながら、「＜先頭＞／君／の／＜ＯＯＶ＞／名前／＜終端＞」は、状態Ｓ１から状態Ｓ２へ、状態Ｓ２から状態Ｓ８へ、状態Ｓ８から状態Ｓ９までは遷移して、「＜先頭＞」、「君」、「の」までは消費されるが、その先には遷移できないので、終了状態Ｓ１６へ到達することはできない。したがって、「＜先頭＞／君／の／＜ＯＯＶ＞／名前／＜終端＞」は、ＦＳＡで受理されない（不受理）。
【０１２７】
また、「＜先頭＞／さようなら／＜終端＞」「＜先頭＞／こんにちは／＜終端＞」においては、いずれも、状態Ｓ１から状態Ｓ２へ遷移して、単語「＜先頭＞」が消費され、状態Ｓ２から状態Ｓ１９へ遷移して、単語「さようなら」または「こんにちわ」が消費され、状態Ｓ１９から状態Ｓ２０に遷移して、「＜終端＞」が消費される。
【０１２８】
「＜先頭＞／これは／＜ＯＯＶ＞／ポーズ／だよ／＜終端＞」または、それに類似する単語列である「＜先頭＞／これは／＜ＯＯＶ＞／です／＜終端＞」などにおいては、状態Ｓ１から状態Ｓ２へ遷移して、単語「＜先頭＞」が消費され、状態Ｓ２から状態Ｓ１３へ遷移して、単語「これは」が、状態Ｓ１３から状態Ｓ１４へ遷移して、単語「＜ＯＯＶ＞」が消費され、状態Ｓ１３から状態Ｓ１４へ遷移して、単語「ポーズ」が消費されるか、あるいは、状態Ｓ１３から状態Ｓ１４へε遷移し、状態Ｓ１４から状態Ｓ１９へ遷移して、単語「だよ」または「です」が消費されるか、あるいは、状態Ｓ１４から状態Ｓ１９へε遷移し、最後に、状態Ｓ１９から状態Ｓ２０に遷移して、「＜終端＞」が消費される。
【０１２９】
そして、「＜先頭＞／＜ＯＯＶ＞（キャラクタ名）／＄ＡＣＴＩＯＮ／して／＜終端＞」または、それに類似する単語列である「＜先頭＞／＜ＯＯＶ＞（キャラクタ名）／＜ＯＯＶ＞（ポーズ名）／ポーズ／して／＜終端＞」などにおいては、状態Ｓ１から状態Ｓ２へ遷移して、単語「＜先頭＞」が消費され、状態Ｓ２から状態Ｓ１６へ遷移して、単語「＜ＯＯＶ＞」（未知語のキャラクタ名）が消費され、状態Ｓ１６から状態Ｓ１８へ遷移して、予め定められた動作を示す単語「＄ＡＣＴＩＯＮ」が消費されるか、あるいは、状態Ｓ１６から状態Ｓ１７へ遷移して、単語「＜ＯＯＶ＞」（未知語のポーズ名）が消費された後、状態Ｓ１７から状態Ｓ１８へε遷移するか、または、単語「ポーズ」が消費される。そして、状態Ｓ１８から状態Ｓ１９へ遷移して、単語「して」が消費されて、最後に、状態Ｓ１９から状態Ｓ２０に遷移して、「＜終端＞」が消費される。
【０１３０】
更に、言語モデルとして、統計言語モデルの１つであるｔｒｉ−ｇｒａｍを用いた場合の言語スコアを計算する例を、図１５を参照して説明する。統計言語モデルとは、その単語列の生成確率を求めて、それを言語スコアとする言語モデルである。即ち、例えば、図１５に示される言語モデルの「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞／です／＜終端＞」の言語スコアは、第２行に示されるように、その単語列の生成確率で表される。これは更に、第３行乃至第６行で示されるように、条件付き確率の積として表される。なお、例えば、「Ｐ（の｜＜先頭＞私）」は、「の」の直前の単語が「私」で、「私」の直前の単語が「＜先頭＞」であるという条件の下で、「の」が出現する確率を表す。
【０１３１】
更に、ｔｒｉ−ｇｒａｍでは、図１５の第３行乃至第６行で示される式を、第７行乃至第９行で示されるように、連続する３単語の条件付き確率で近似させる。これらの確率値は、図１６に示されるようなｔｒｉ−ｇｒａｍデータベースを参照して求められる。このｔｒｉ−ｇｒａｍデータベースは、予め大量のテキストを分析して求められたものである。
【０１３２】
図１６の例では、３つの連続する単語ｗ１，ｗ２，ｗ３の確率Ｐ（ｗ３｜ｗ１ｗ２）が表されている。例えば、３つの単語ｗ１，ｗ２，ｗ３が、それぞれ、「＜先頭＞」、「私」、「の」である場合、確率値は０．１２とされ、「私」、「の」、「名前」である場合、確率値は０．０１とされ、「＜ＯＯＶ＞」、「です」、「＜終端＞」である場合、確率値は、０．８７とされている。
【０１３３】
もちろん、「Ｐ（Ｗ）」および「Ｐ（ｗ２｜ｗ１）」についても、同様に、予め求めておく。
【０１３４】
このようにして、言語モデル中の＜ＯＯＶ＞について、エントリ処理をしておくことで、＜ＯＯＶ＞を含む単語列に対して、言語スコアを計算することができる。したがって、認識結果に＜ＯＯＶ＞というシンボルを出力することができる。
【０１３５】
また、他の種類の言語モデルを用いる場合も、＜ＯＯＶ＞についてのエントリ処理をすることによって、同様に＜ＯＯＶ＞を含む単語列に対して、言語スコアを計算することができる。
【０１３６】
更に、＜ＯＯＶ＞のエントリが存在しない言語モデルを用いた場合でも、＜ＯＯＶ＞を言語モデル中の適切な単語にマッピングする機構を用いることで、言語スコアの計算ができる。例えば、「Ｐ（＜ＯＯＶ＞｜私は）」が存在しないｔｒｉ−ｇｒａｍデータベースを用いた場合でも、「Ｐ（太郎｜私は）」でデータベースをアクセスして、そこに記述されている確率を「Ｐ（＜ＯＯＶ＞｜私は）」の値とみなすことで、言語スコアの計算ができる。
【０１３７】
図１２の音声認識処理についての説明に戻る。ステップＳ３０において、マッチング部１７３は、音響スコアと言語スコアを統合する。ステップＳ３１において、マッチング部１７３は、ステップＳ３０において求められた音響スコアと言語スコアの両スコアを統合したスコアに基づいて、最もよいスコアをもつ候補単語列を選択して、認識結果として出力する。
【０１３８】
なお、言語モデルとして、有限状態オートマトンを使用している場合は、ステップＳ３０の統合処理を、言語スコアが０の場合は単語列を消去し、言語スコアが０以外の場合はそのまま残すという処理にしてもよい。
【０１３９】
次に、図１７のフローチャートを参照して、図１２のステップＳ２５において実行される、単語列生成処理について説明する。
【０１４０】
ステップＳ６１において、マッチング部１７３は、入力音声のある区間について、辞書データベース１８２に登録されている既知語とマッチングさせた結果の音響スコアと、音韻タイプライタ部１７４により得られた結果（今の場合、”ｗ／ａ／ｔ／ａ／ｓｈ／ｉ／ｎ／ｏ／ｎ／ａ／ｍ／ａ／ｅ／ｗ／ａ／ｔ／ａ／ｒ／ｏ：／ｄ／ｅ／ｓ／ｕ”の中の一部区間）の音響スコアとの、両方の場合の音響スコアを計算する。音響スコアは、音声認識結果の候補である単語列と入力音声とが音としてどれだけ近いかを表す。
【０１４１】
そして、次に、入力音声の一部区間と辞書データベース１８２に登録されている既知語とをマッチングさせた結果の音響スコアと、音韻タイプライタ部１７４による結果の音響スコアが比較されるのであるが、既知語とのマッチングは単語単位で行われ、音韻タイプライタ部１７４でのマッチングは音韻単位で行われ、尺度が異なっているので、そのままでは比較することが困難である（一般的には、音韻単位の音響スコアの方が大きな値となる）。
【０１４２】
そこで、尺度を合わせて比較できるようにするために、ステップＳ６２において、マッチング部１７３は、音韻タイプライタ部１７４により得られた結果の音響スコアを補正する。
【０１４３】
ステップＳ６２においては、例えば、音韻タイプライタ部１７４からの音響スコアに係数をかけたり、一定の値やフレーム長に比例した値などを減じたりする処理が行われる。もちろん、この処理は相対的なものなので、既知語とマッチングさせた結果得られる音響スコアに対して行うこともできる。なお、この処理の詳細は、例えば、文献「”ＥＵＲＯＳＰＥＥＣＨ９９Ｖｏｌｕｍｅ１，Ｐａｇｅ４９−５２”」に「ＯＯＶ−ＤｅｔｅｃｔｉｏｎｉｎＬａｒｇｅＶｏｃａｂｕｌａｒｙＳｙｓｔｅｍＵｓｉｎｇＡｕｔｏｍａｔｉｃａｌｌｙＤｅｆｉｎｅｄＷｏｒｄ−ＦｒａｇｍｅｎｔｓａｓＦｉｌｌｅｒｓ」として開示されている。
【０１４４】
マッチング部１７３は、ステップＳ６３において、この２つの音響スコアを比較し、音韻タイプライタ部１７４で認識させた結果の音響スコアの方が高い（優れている）か否かを判定する。音韻タイプライタ部１７４で認識させた結果得られる音響スコアの方が高い場合、ステップＳ６４において、マッチング部１７３は、その区間を、未知語＜ＯＯＶ＞であると推定する。
【０１４５】
ステップＳ６３において、既知語とマッチングさせた結果の音響スコアに対して、音韻タイプライタ部１７４で認識された結果の音響スコアの方が低いと判定された場合、ステップＳ６５において、マッチング部１７３は、その区間を既知語であると推定する。
【０１４６】
即ち、例えば、「たろう」に相当する区間について、音韻タイプライタ部１７４の出力した”ｔ／ａ／ｒ／ｏ：”の音響スコアと、既知語でマッチングさせた場合の音響スコアを比較して、”ｔ／ａ／ｒ／ｏ：”の音響スコアの方が高い場合は、その音声区間に相当する単語として「＜ＯＯＶ＞（ｔ／ａ／ｒ／ｏ：）」が出力され、既知語の音響スコアの方が高い場合は、その既知語が音声区間に相当する単語として出力される。
【０１４７】
ステップＳ６４、または、ステップＳ６５の処理の終了後、ステップＳ６６において、マッチング部１７３は、音響スコアが高くなると推測される単語列（いくつかの単語モデルを連結したもの）を優先的にｎ個を生成して、処理は、図１２のステップＳ２６に戻る。
【０１４８】
このような処理により、図１２のステップＳ２７の処理において、音響スコアの計算に用いられる単語列が生成される。
【０１４９】
次に、図１８のフローチャートを参照して、図１１のステップＳ４において実行される単語獲得処理について説明する。
【０１５０】
ステップＳ９１において、単語獲得部１２４は、音声認識処理部１２１から未知語（＜ＯＯＶ＞）の特徴パラメータを抽出する。
【０１５１】
ステップＳ９２において、単語獲得部１２４は、未知語が既獲得のクラスタに属するか否かを判定する。ステップＳ９２において、未知語が既獲得のクラスタに属すると判定された場合、新しいクラスタを生成する必要がないので、処理は、ステップＳ９４に進む。
【０１５２】
ステップＳ９２において、未知語が既獲得のクラスタに属さないと判定された場合、ステップＳ９３において、単語獲得部１２４は、その未知語に対応する、新しいクラスタを生成する。
【０１５３】
ステップＳ９２において、未知語が既獲得のクラスタに属すると判定された場合、または、ステップＳ９３の処理の終了後、ステップＳ９４において、単語獲得部１２４は、未知語の属するクラスタのＩＤを音声認識処理部１２１に出力し、処理は、図１１のステップＳ５に進む。
【０１５４】
このような処理により、続く図１１のステップＳ５において、テンプレートとマッチングされる未知語の単語が獲得される。
【０１５５】
図１８においては、未知語の特徴量パラメータから、未知語が既獲得のクラスタに属するか否かを判定し、既獲得のクラスタに属さない場合は、新たなクラスタを生成するものとして説明しているが、クラスタを生成することなく、例えば、音韻系列にＩＤを対応付けるものとし、すでに獲得されている未知語であるか否かは、音韻系列の比較によって行うようにしてもよい。
【０１５６】
次に、図１９のフローチャートを参照して、図１１のステップＳ５において実行される、テンプレートマッチング処理について説明する。
【０１５７】
ステップＳ１２１において、音声認識部１０１Ａの対話制御部１２３は、供給された音声認識結果が、未知語をポーズ名として登録するテンプレート１にマッチしているか否か、すなわち、認識結果の単語列が何かのポーズ名の登録を意味するものか否かを判定する。
【０１５８】
図２０を用いて、テンプレート１について説明する。なお、図２０と、後述する図２２、図２３、および図２４においては、”／Ａ／”は「文字列Ａが含まれていたら」という意味を表し、”Ａ｜Ｂ”は「ＡまたはＢ」という意味を表す。また、”．”は「任意の文字」を表し、”Ａ＋”は「Ａの１回以上の繰り返し」という意味を表し、”（．）＋”は「任意の文字列」を表す。
【０１５９】
認識結果の単語列が、例えば、「＜先頭＞／これは／＜ＯＯＶ＞（ｇ／ａ／ｔ／ｔ／ｕ：）／ポーズ／です／＜終端＞」という単語列である場合、この認識結果から生成された文字列「これは＜ＯＯＶ＞ポーズです」が、テンプレート１の、正規表現”／これは＜ＯＯＶ＞（カテゴリ；ポーズ名）／”にマッチするので、対応する「＜ＯＯＶ＞（カテゴリ；ポーズ名）に対応するクラスタＩＤを、ポーズ名として登録」する動作と、その時点における「アクチュエータ制御角とともに、テーブルに記憶」する動作を実行させることを表している。
【０１６０】
具体的には、例えば、ユーザが、ロボット１の両腕を上に挙げさせた状態にし、「これはガッツ（ポーズ）だよ」と発声した場合、未知語＜ガッツ（ポーズ）＞が獲得されて、テンプレート１に対応していることが検出される。
【０１６１】
ステップＳ１２１において、テンプレート１にマッチしていると判断された場合、ステップＳ１２２において、連想記憶部１２２は、図９を用いて説明したように、単語のクラスタＩＤとカテゴリを、対応させて記憶する。
【０１６２】
ステップＳ１２３において、音声認識部１０１Ａは、クラスタＩＤを行動決定機構部１０３に供給する。また、行動決定機構部１０３は、サブ制御部６３Ａ乃至サブ制御部６３Ｄから、現在の、アクチュエータＡ１乃至アクチュエータＡ１４の状態（角度）を取得する。行動決定機構部１０３は、供給されたクラスタＩＤと、アクチュエータ制御角の情報を対応付けて、図２１に示されるようなテーブルを、テーブル記憶部１０４に記憶させ、処理が終了される。
【０１６３】
具体的には、例えば、ユーザが、ロボット１の両腕を上に挙げさせた状態にし、「これはガッツ（ポーズ）だよ」と発声した場合、＜ガッツ（ポーズ）＞に対応付けられたクラスタＩＤが、ロボット１の両腕を上に挙げさせた状態のアクチュエータＡ１乃至アクチュエータＡ１４の状態（角度）と対応付けられて、テーブル記憶部１０４に記憶される。
【０１６４】
ステップＳ１２１において、テンプレート１にマッチしていないと判断された場合、ステップＳ１２４において、対話制御部１２３は、供給された音声認識結果が、未知語に対応付けられたポーズをロボット１に指令するテンプレート２にマッチしているか否かを判断する。
【０１６５】
図２２を用いて、テンプレート２について説明する。認識結果の単語列が、例えば、「＜先頭＞／＜ＯＯＶ＞（ｂ／ａ／ｎ／ｚ／ａ／ｉ）／して／＜終端＞」という単語列である場合、この認識結果から生成された文字列「＜ＯＯＶ＞して」が、テンプレート２の、正規表現”／＜ＯＯＶ＞（カテゴリ；ポーズ名）＋して／”にマッチするので、対応する「＜ＯＯＶ＞（カテゴリ；ポーズ名）に対応するクラスタＩＤを基に、テーブル記憶部１０４を参照して、＜ＯＯＶ＞に対応するアクションを実行」の動作が実行される。
【０１６６】
ステップＳ１２４において、テンプレート２にマッチしていると判断された場合、ステップＳ１２５において、音声認識部１０１Ａは、認識された未知語のクラスタＩＤを行動決定機構部１０３に供給する。行動決定機構部１０３は、テーブル記憶部１０４を参照して、供給されたクラスタＩＤを基に、図２１を用いて説明したようなテーブルから、アクチュエータ制御角の情報を抽出し、姿勢遷移機構部１０５に供給する。
【０１６７】
ステップＳ１２６において、姿勢遷移機構部１０５は、アクチュエータ制御角の情報に基づいて、サブ制御部６３Ａ乃至サブ制御部６３Ｄのうちの対応するものに、アクチュエータＡ１乃至アクチュエータＡ１４のうちの必要なものを制御させて、処理が終了される。
【０１６８】
具体的には、例えば、ユーザが、ロボット１に対して、「バンザイして」と発声した場合、単語獲得部１２４において、未知語＜バンザイ＞が獲得され、対話制御部１２３で、テンプレート２に対応していることが検出される。そして、行動決定機構部１０３は、テーブル記憶部１０４を参照して、＜バンザイ＞に対応するアクチュエータの制御情報を取得し、サブ制御部６３Ａ乃至サブ制御部６３Ｄのうちの対応するものに供給して、ロボット１にバンザイに対応するポーズをさせることができる。
【０１６９】
ステップＳ１２４において、テンプレート２にマッチしていないと判断された場合、ステップＳ１２７において、対話制御部１２３は、供給された音声認識結果が、キャラクタ名やユーザ名の記憶をロボット１に指令するテンプレート３にマッチしているか否かを判断する。
【０１７０】
図２３を用いて、テンプレート３について説明する。例えば、認識結果が、「＜先頭＞／君／の／名前／は／＜ＯＯＶ＞（ｔ／ａ／ｒ／ｏ：）／だよ／＜終端＞」という単語列である場合、この認識結果から生成された文字列「君の名前は＜ＯＯＶ＞だよ」が、正規表現「／君（．）＋は＜ＯＯＶ＞／」に対応するので、「＜ＯＯＶ＞に対応するクラスタＩＤをキャラクタ名として登録」の動作が実行される。また、認識結果が、「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞（ｔ／ａ／ｒ／ｏ：）／です／＜終端＞」という単語列である場合、この認識結果から生成された文字列「私の名前は＜ＯＯＶ＞です」が、正規表現「（私｜僕）（．）＋は＜ＯＯＶ＞／」に対応するので、「＜ＯＯＶ＞に対応するクラスタＩＤをユーザ名として登録」の動作が実行される。
【０１７１】
ステップＳ１２７において、テンプレート３にマッチしていると判断された場合、ステップＳ１２８において、対話制御部１２３は、連想記憶部１２２に、音声認識結果である単語のクラスタＩＤと、ユーザ名またはキャラクタ名のカテゴリを対応させて記憶させて処理が終了される。
【０１７２】
なお、ロボット１の利用方法によっては、登録する単語が１種類しかない（例えば、「ユーザ名」のみ）場合もあり、その場合は、テンプレートと連想記憶部１２２は簡略化することができる。例えば、テンプレートの内容を「認識結果に＜ＯＯＶ＞が含まれていたら、そのＩＤを記憶する」として、連想記憶部１２２にそのクラスタＩＤのみを記憶させることができる。
【０１７３】
対話制御部１２３は、このようにして連想記憶部１２２に登録された情報を、以後の対話の判断処理に反映させる。例えば、ロボット１の側で、「ユーザの発話の中に、「キャラクタ名」が含まれているかどうかを判定し、含まれている場合は『呼びかけられた』と判断して、それに応じた返事をする」という処理や、「ロボット１がユーザの名前をしゃべる」という処理が必要になった場合、対話制御部１２３は、連想記憶部１２２に記録されている情報を参照することで、ロボット１に相当する単語（カテゴリ名が「キャラクタ名」であるエントリ）やユーザ名に相当する単語（カテゴリ名が「ユーザ名」であるエントリ）を得ることができる。
【０１７４】
ステップＳ１２７において、テンプレート３にマッチしていないと判断された場合、ステップＳ１２９において、対話制御部１２３は、入力音声に対応する所定の応答処理を実行する。すなわち、この場合には、未知語の登録処理は行われず、例えば、「歩け」という命令に対して、歩く動作を行うなどの、ユーザからの入力音声に対応する所定の処理が実行されて、処理が終了される。
【０１７５】
このようにして、取得した音声データを正規表現と照らし合わせることにより、所定のテンプレートに合致しているか否かが判断され、いずれかのテンプレートに合致していると判断された場合、そのテンプレートに対応した行動を実行するように、ロボット１の動作が制御される。
【０１７６】
なお、図２０、図２２および図２３を用いて説明したテンプレート１乃至テンプレート３の内容は、この限りではなく、例えば、テンプレート２の正規表現に、”／＜ＯＯＶ＞（カテゴリ；ポーズ名）＋ポーズ＋して／”を加えたり、テンプレート２の正規表現を、”／＜ＯＯＶ＞（カテゴリ；ポーズ名）（．）＋して／”とするようにしても良い。
【０１７７】
ところで、言語モデルとして文法を用いる場合、文法の中に音韻タイプライタ相当の記述も組み込むことができる。この場合の文法の例について、図２４を用いて説明する。図２４に示される文法において、第１行目の変数”＄ＰＨＯＮＥＭＥ”は、全ての音韻が「または」を意味する”｜”で繋がれているので、音韻記号の内のどれか１つを意味する。変数”＄ＯＯＶ”は、変数”＄ＰＨＯＮＥＭＥ”を０回以上繰り返すことを表している。すなわち、「任意の音韻記号を０回以上接続したもの」を意味し、音韻タイプライタに相当する。したがって、第３行目の「は」と「です」の間の変数”＄ＯＯＶ”は、任意の発音を受け付けることができる。
【０１７８】
図２４に示される文法を用いた場合の認識結果では、変数”＄ＯＯＶ”に相当する部分が複数のシンボルで出力される。例えば、「私の名前は太郎です」の認識結果が「＜先頭＞／私／の／名前／は／ｔ／ａ／ｒ／ｏ：／です／＜終端＞」となる。この結果を「＜先頭＞／私／の／名前／は／＜ＯＯＶ＞（ｔ／ａ／ｒ／ｏ：）／です」に変換すると、図１１のステップＳ３以降の処理は、音韻タイプライタを用いた場合と同様に実行することができる。
【０１７９】
以上においては、未知語に関連する情報として、カテゴリを登録するようにしたが、その他の情報を登録するようにしてもよい。
【０１８０】
また、ロボット１は、ユーザの指令に基づいた処理を実行するのみならず、ユーザの操作を受けない状態においても、自動的に、動作を行うことができる。その動作は、予め設定されているものであっても良いが、例えば、上述した処理により、未知語と対応付けられて記憶された動作を、ランダムに実行するようにし、ロボット１が、自分自身の動作を認識して、ポーズに対応する発話を自動的に行うようにしてもよい。
【０１８１】
図２５のフローチャートを参照して、ロボット１が実行する、動作認識発話処理について説明する。
【０１８２】
ステップＳ１５１において、音声認識部１０１Ａの音声認識処理部１２１は、マイクロホン８２から、音声の入力を受けたか否かを判断する。ステップＳ１５１において、音声の入力を受けていないと判断された場合、音声の入力を受けたと判断されるまで、ステップＳ１５１の処理が繰り返される。
【０１８３】
ステップＳ１５２において、図１２を用いて説明した音声認識処理が実行される。
【０１８４】
ステップＳ１５３において、音声認識部１０１Ａの対話制御部１２３は、供給された音声認識結果が、「そのポーズは何？」であったか否かを判断する。
【０１８５】
ステップＳ１５３において、「そのポーズは何？」という音声を認識しなかったと判断された場合、ステップＳ１５４において、行動決定機構部１０３は、図示しない内部のタイマが、ユーザからの指令の入力を受けない状態で、所定の時間が経過したことを通知したか否かを判断する。ステップＳ１５４において、タイマが所定経過時間を通知していないと判断された場合、処理は、ステップＳ１５１に戻り、それ以降の処理が繰り返される。
【０１８６】
ステップＳ１５３において、「そのポーズは何？」という音声を認識したと判断された場合、または、ステップＳ１５４において、タイマが所定経過時間を通知したと判断された場合、ステップＳ１５５において、行動決定機構部１０３は、サブ制御部６３Ａ乃至サブ制御部６３Ｄから供給される信号を基に、アクチュエータＡ１乃至アクチュエータＡ１４の制御角を認識する。
【０１８７】
ステップＳ１５６において、行動決定機構部１０３は、テーブル記憶部１０４に記憶されている、ポーズ名とアクチュエータＡ１乃至アクチュエータＡ１４の制御角との対応を示すテーブルを参照して、ステップＳ１５５において認識したアクチュエータＡ１乃至アクチュエータＡ１４の制御角に対応するポーズ名を検索する。
【０１８８】
ステップＳ１５７において、行動決定機構部１０３は、テーブル記憶部１０４に記憶されているテーブルに、ステップＳ１５５において認識したアクチュエータＡ１乃至アクチュエータＡ１４の制御角に対応するポーズ名があるか否かを判断する。ステップＳ１５７において、対応するポーズ名がないと判断された場合、処理は、ステップＳ１５１に戻り、それ以降の処理が繰り返される。
【０１８９】
ステップＳ１５７において、対応するポーズ名があると判断された場合、ステップＳ１５８において、行動決定機構部１０３は、テーブル記憶部１０４を参照して、対応する、カテゴリがポーズ名である未知語を抽出し、音声合成部１０６を制御して、例えば、「いま、＜ＯＯＶ＞ポーズをしているのです」と言う音声を合成させ、スピーカ７２から出力、すなわち、発話させて、処理は、ステップＳ１５１に戻り、それ以降の処理が繰り返される。
【０１９０】
以上のような処理により、ロボット１は、ユーザの発話に対する応答のみならず、自分自身の動作を認識することにより、テーブルとして記憶しているアクチュエータＡ１乃至アクチュエータＡ１４の制御角に対応するポーズ名を利用して、自動的に発話することができる。
【０１９１】
以上説明した処理においては、静止したポーズにおけるアクチュエータＡ１乃至アクチュエータＡ１４の制御角と、獲得した未知語を対応付けて記憶させる場合について説明したが、本発明は、静止したポーズのみならず、ロボット１が所定の動作を行う場合のアクチュエータＡ１乃至アクチュエータＡ１４の制御情報（例えば、それぞれのアクチュエータの制御角を時間系列に表したもの）と、獲得した未知語を対応付けて記憶させる場合にも適用可能であることはいうまでもない。
【０１９２】
また、以上においては、手足などに対応する実際に駆動可能な部分を有するロボット１の動作（駆動および発話）を制御する場合について説明したが、例えば、ディスプレイに表示された（バーチャルの）ロボット１においても、本発明は適応可能である。
【０１９３】
図２６は、上述の処理を実行するパーソナルコンピュータ２０１の構成例を示している。このパーソナルコンピュータ２０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２１１を内蔵している。ＣＰＵ２１１にはバス２１４を介して、入出力インタフェース１１５が接続されている。バス２１４には、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２１２およびＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２１３が接続されている。
【０１９４】
入出力インタフェース２１５には、ユーザが操作するマウス、キーボード、マイクロホン、ＡＤ変換器等の入力デバイスで構成される入力部２１７、およびディスプレイ、スピーカ、ＤＡ変換器等の出力デバイスで構成される出力部２１６が接続されている。更に、入出力インタフェース２１５には、プログラムや各種データを格納するハードディスクドライブなどよりなる記憶部２１８、並びにインタネットに代表されるネットワークを介してデータを通信する通信部２１９が接続されている。
【０１９５】
入出力インタフェース２１５には、磁気ディスク２３１、光ディスク２３２、光磁気ディスク２３３、半導体メモリ２３４などの記録媒体に対してデータを読み書きするドライブ２２０が必要に応じて接続される。
【０１９６】
例えば、このパーソナルコンピュータ２０１は、ユーザが発生した音声を、入力部２１７のマイクロホンにより集音可能であり、ＣＰＵ２１１の処理により、上述した場合と同様にして、音声認識処理を実行する。また、ＣＰＵ２１１の制御に基づいて、出力部２１６のディスプレイに、例えば、ロボット１の外見を有するキャラクタを表示可能である。ＣＰＵ２１１は、入力部２１７のマイクロホンにより集音された音声の認識処理結果、あるいは、ユーザが入力部２１７のマウスを用いて入力する操作に基づいて、出力部２１６のディスプレイに表示されているキャラクタの動き（表示上での動き）を制御（すなわち、出力部２１６のディスプレイの表示を制御）したり、ユーザの呼びかけに対する返答に対する音声データを生成し、出力部２１６のスピーカから出力する。
【０１９７】
そして、ユーザが、「これば、＜うれしい＞ポーズだよ」と発話した場合、ＣＰＵ２１１は、「うれしい」という未知語を認識し、ポーズ名のカテゴリの未知語として固有のＩＤを対応付けるとともに、出力部２１６のディスプレイ表示されているキャラクタの動き（すなわち、ディスプレイに表示されている画像データ）を、対応するＩＤとともに、例えば、ＲＡＭ２１３に記録させるようにすることが可能である。そして、その後、ユーザが、「＜うれしい＞ポーズして」と指令した場合、ＲＡＭ２１３に記録されている情報を基に、出力部２１６のディスプレイ表示を、未知語＜うれしい＞に対応する表示としたり、ＣＰＵ２１１の処理により、出力部２１６のディスプレイ表示を、自動的に、未知語＜うれしい＞に対応する表示とし、出力部２１６のスピーカから「これは＜うれしい＞ポーズだよ」という音声を出力させることができる。
【０１９８】
また、上述した一連の処理は、ソフトウェアにより実行することもできる。そのソフトウェアは、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体からインストールされる。
【０１９９】
この記録媒体は、図２６に示すように、コンピュータとは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されている磁気ディスク２３１（フレキシブルディスクを含む）、光ディスク２３２（ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）を含む）、光磁気ディスク２３３（ＭＤ（Ｍｉｎｉ−Ｄｉｓｋ）（商標）を含む）、もしくは半導体メモリ２３４などよりなるパッケージメディアなどにより構成される。
【０２００】
例えば、パーソナルコンピュータ２０１に、本発明を適用したロボット１制御装置としての動作を実行させるロボット１制御処理プログラムは、磁気ディスク２３１（フロッピディスクを含む）、光ディスク２３２（ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）を含む）、光磁気ディスク２３３（ＭＤ（ＭｉｎｉＤｉｓｃ）を含む）、もしくは半導体メモリ２３４に格納された状態でパーソナルコンピュータ２０１に供給され、ドライブ２２０によって読み出されて、記憶部２１８に内蔵されるハードディスクドライブにインストールされる。記憶部２１８にインストールされた音声処理プログラムは、入力部２１７に入力されるユーザからのコマンドに対応するＣＰＵ２１１の指令によって、記憶部２１８からＲＡＭ２１３にロードされて実行される。
【０２０１】
また、本明細書において、記録媒体に記録されるプログラムを記述するステップは、記載された順序に沿って時系列的に行われる処理はもちろん、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
【０２０２】
【発明の効果】
このように、本発明によれば、音声を認識することができる。特に、認識した音声とロボットの動作を制御する情報を関連付けて登録することができる。
また、他の本発明によれば、ロボットの状態を検出することができる他、ロボットの状態の検出結果に基づいて、ロボットの状態に対応した単語を含む音声を合成することができる。
【図面の簡単な説明】
【図１】本発明を適用したロボットの外観構成を示す斜視図である。
【図２】図１のロボットの外観構成を示す、背後側の斜視図である。
【図３】図１のロボットについて説明するための略線図である。
【図４】図１のロボットの内部構成を示すブロック図である。
【図５】図１のロボットの制御に関する部分を主に説明するためのブロック図である。
【図６】図５のメイン制御部の構成を示すブロック図である。
【図７】図６の音声認識部の構成を示すブロック図である。
【図８】クラスタの状態を説明する図である。
【図９】単語の登録について説明する図である。
【図１０】図７の音声認識処理部の構成を示すブロック図である。
【図１１】本発明を適用したロボットが実行する処理について説明するフローチャートである。
【図１２】音声認識処理について説明するフローチャートである。
【図１３】言語モデルデータベースで用いられる文法の例を示す図である。
【図１４】有限状態オートマトンによる言語モデルの例を示す図である。
【図１５】ｔｒｉ−ｇｒａｍを用いた言語スコアの計算の例を示す図である。
【図１６】ｔｒｉ−ｇｒａｍデータベースの例を示す図である。
【図１７】単語列生成処理について説明するフローチャートである。
【図１８】単語獲得処理について説明するフローチャートである。
【図１９】テンプレートマッチング処理について説明するフローチャートである。
【図２０】テンプレートについて説明する図である。
【図２１】クラスタＩＤに対応けられて登録されているアクチュエータ制御角について説明する図である。
【図２２】テンプレートについて説明する図である。
【図２３】テンプレートについて説明する図である。
【図２４】音韻タイプライタを組み込んだ文法の例を示す図である。
【図２５】動作認識発話処理について説明するフローチャートである。
【図２６】本発明を適用したコンピュータの構成を示すブロック図である。
【符号の説明】
１ロボット，６１メイン制御部，６３サブ制御部，７２スピーカ，８２マイクロホン，１０１センサ入力処理部，１０１Ａ院生認識部，１０３行動決定機構部，１０４テーブル記憶部，１０５姿勢遷移機構部，１０６音声合成部，１２１音声認識処理部，１２２連想記憶部，１２３対話制御部，１２４単語獲得部，１７２特徴量抽出部，１７３マッチング部，１７４音韻タイプライタ部，１７５制御部，１８１音響モデルデータベース，１８２辞書データベース，１８３言語モデルデータベース[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a robot control device and method, a recording medium, and a program, and particularly to a robot control device, method, and recording medium suitable for controlling a robot capable of voice recognition, voice output, and driving. , And the program.
[0002]
[Prior art]
In a dialogue system, there are many cases where a name of some kind is registered by voice. For example, a user registers his / her name, names a dialogue system, or inputs a place name or a shop name.
[0003]
Conventionally, as a method for easily realizing such voice registration, there is a method in which the mode is shifted to the registration mode by some command, and the normal interactive mode is returned when the registration is completed. In this case, for example, the mode is shifted to the registration mode by a voice command of “user name registration”, and when the user utters the name thereafter, the name is registered, and thereafter, the process of returning to the normal mode is performed.
[0004]
For example, by causing a voice-recognizable robot to take an action indicating that a name is to be given, the user is notified that the mode is switched to the robot name registration mode, and the input is performed after the action of the robot is controlled. There is a technique for detecting an optimal phoneme sequence from a recorded voice and registering the optimal phoneme sequence as a name (for example, see Patent Document 1).
[0005]
[Patent Document 1]
JP-A-2002-120177
[0006]
[Problems to be solved by the invention]
However, in such a voice registration method, the mode must be switched by a command, which is unnatural as a dialogue and has a problem that the user is troublesome. Further, when there are a plurality of naming targets, the number of commands is increased, which is more troublesome.
[0007]
For example, as described above, when the user is notified of the transition to the registration mode by the action of the robot, the action of the robot is changed according to the naming target, and the user grasps the relationship between the action of the robot and the registered content. Voice registration becomes very troublesome, for example.
[0008]
In addition, during the registration mode, the user word other than the name (for example, "Hello") If you've talked to, even the words of except for the name will be registered as the name. In addition, for example, when the user speaks with a word other than the name such as “My name is Taro” instead of the name “Taro”, the whole (“My name is Taro. )) Is registered as a name.
[0009]
Further, the registered word is only used for voice output processing using the registered word and recognition processing when the same word is uttered by the user after registration.
[0010]
The present invention has been made in view of such a situation, and a word registered in a normal dialogue is stored in association with a motion of a robot so that the word can be used for the motion of the robot. Things.
[0011]
[Means for Solving the Problems]
A first robot control device according to the present invention includes: a recognition unit that recognizes a continuous input voice; and, when it is determined that an unknown word is included in the recognition result recognized by the recognition unit, the unknown word corresponds to the unknown word. An acquisition unit for acquiring a word and a registration unit for registering the word acquired by the acquisition unit in association with information for controlling the operation of the robot are provided.
[0012]
Pattern determination means for determining whether or not the recognition result recognized by the recognition means matches a specific pattern can be further provided. By the pattern determination means, the recognition result matches the specific pattern. If it is determined that the word is present, the registration means can register the word in association with information for controlling the operation of the robot.
[0013]
Detection means for detecting the state of the robot may be further provided, and the detection means may detect the state of the robot at the time when the recognition result is determined to match the specific pattern. The registration means can associate and register a word and information for controlling the operation of the robot so as to be in the state of the robot detected by the detection means.
[0014]
Control means for controlling the driving of the robot, and pattern determining means for determining whether or not the recognition result recognized by the recognition means matches a specific pattern, further comprising: When it is determined that the recognition result matches the specific pattern, the control unit controls the driving of the robot based on information for controlling the operation of the robot registered in association with the word by the registration unit. It can be controlled.
[0015]
It is possible to further include a storage unit that classifies the words acquired by the acquisition unit into a plurality of categories and stores the words, and the registration unit stores the words stored in the predetermined category in the storage unit. The information can be registered in association with information for controlling the operation of the robot.
[0016]
A first robot control method according to the present invention includes a recognition step of recognizing a continuous input voice; a determination step of determining whether or not an unknown word is included in a recognition result recognized by the processing of the recognition step; When it is determined that the unknown word is included in the recognition result by the processing of the determining step, the acquiring step of acquiring the word corresponding to the unknown word, and the word acquired by the processing of the acquiring step are performed by the operation of the robot. And registering the information in association with information for controlling
[0017]
The program recorded on the first recording medium of the present invention determines a recognition step of recognizing a continuous input voice, and determines whether or not an unknown word is included in a recognition result recognized by the processing of the recognition step. When the recognition result is determined to include an unknown word by the processing of the determining step, the obtaining step of obtaining a word corresponding to the unknown word, and the word obtained by the processing of the obtaining step And registering the information in association with information for controlling the operation of the robot.
[0018]
A first program according to the present invention includes: a recognition step of recognizing a continuous input voice; a determination step of determining whether an unknown word is included in a recognition result recognized by the processing of the recognition step; When it is determined that the unknown word is included in the recognition result by the processing of the above, the acquiring step of acquiring the word corresponding to the unknown word, and controlling the operation of the robot with the word acquired by the processing of the acquiring step And a registration step of registering the information in association with the information to be performed.
[0019]
In the first robot control apparatus and method and the program according to the present invention, continuous input speech is recognized, and whether or not an unknown word is included in the recognition result is determined. If determined, a word corresponding to the unknown word is acquired, and the acquired word is registered in association with information for controlling the operation of the robot.
[0020]
A second robot control device according to the present invention includes a control unit that controls driving of the robot, a registration unit that registers information indicating a state of the robot in association with a corresponding word, and a robot whose driving is controlled by the control unit. Detecting means for detecting the state of the robot, voice synthesizing means for synthesizing the voice, and output means for outputting the voice synthesized by the voice synthesizing means, and the state of the robot detected by the detecting means is registered by the registering means. The voice synthesizing unit synthesizes a voice including a word associated with the information indicating the state of the robot registered by the registration unit when the information matches the information indicating the state of the registered robot.
[0021]
Input means for receiving a command from the user may be further provided, and the detecting means may detect the state of the robot when the input means does not receive an operation input for a predetermined time. Can be.
[0022]
A recognition unit for recognizing a continuous input voice, and a pattern determination unit for determining whether a recognition result by the recognition unit matches a specific pattern may be further provided. If it is determined that the pattern matches a specific pattern, the detecting means can detect the state of the robot.
[0023]
When it is determined that the unknown word is included in the recognition result recognized by the recognition unit, an acquisition unit that acquires a word corresponding to the unknown word can be further provided. A word corresponding to the unknown word acquired by the acquiring means and information indicating the state of the robot can be stored in association with each other.
[0024]
According to a second robot control method of the present invention, there is provided a detecting step for detecting a state of the robot, and whether or not the state of the robot detected by the processing of the detecting step matches the registered information registered in the registered information. When it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information by the processing of the determining step of determining the state of the robot, the information is associated with the information indicating the state of the robot. And a speech synthesizing step of synthesizing speech including the selected word.
[0025]
The program recorded on the second recording medium of the present invention includes a detection step of detecting a state of the robot and a state of the robot detected by the processing of the detection step matching the registration information registered in the registration information. A determining step of determining whether or not the robot is in conformity with the information indicating the state of the robot registered in the registration information; And synthesizing a speech including a word associated with the information indicating
[0026]
A second program according to the present invention includes a detecting step of detecting a state of the robot, and determining whether or not the state of the robot detected by the processing of the detecting step matches the registered information registered in the registered information. When it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information by the processing of the determining step and the determining step, the robot is associated with the information indicating the state of the robot. And a speech synthesizing step of synthesizing speech including a word.
[0027]
In the second robot control device and method, and the program according to the present invention, the state of the robot is detected, and it is determined whether or not the detected state of the robot matches the registered information registered in the registered information. If it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information, a voice including a word associated with the information indicating the state of the robot is synthesized. Is output.
[0028]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0029]
FIG. 1 is a front perspective view of a bipedal walking robot 1 to which the present invention is applied, and FIG. 2 is a perspective view of the robot 1 as viewed from the rear. FIG. 3 is a diagram for explaining the axis configuration of the robot 1.
[0030]
In the robot 1, a head unit 12 is disposed above a body unit 11, and an arm unit 13 A and an arm unit 13 B having the same configuration are provided at predetermined positions on the upper right and left sides of the body unit 11. And a leg unit 14A and a leg unit 14B having the same configuration are attached to predetermined positions on the lower left and right sides of the body unit 11, respectively. The head unit 12 is provided with a touch sensor 51.
[0031]
In the torso unit 11, the frame 21 forming the upper trunk and the waist base 22 forming the lower trunk are connected to each other via a waist joint mechanism 23, and the lower base 22 of the lower trunk is formed. By driving the actuator A1 and the actuator A2 of the waist joint mechanism 23 fixed to each other, the upper trunk is independently rotated around the orthogonal roll axis 24 and pitch axis 25 shown in FIG. It has been made possible.
[0032]
The head unit 12 is attached to the center of the upper surface of a shoulder base 26 fixed to the upper end of the frame 21 via a neck joint mechanism 27, and drives the actuator A3 and the actuator A4 of the neck joint mechanism 27, respectively. By doing so, they can be independently rotated around the orthogonal pitch axis 28 and yaw axis 29 shown in FIG.
[0033]
Further, the arm unit 13A and the arm unit 13B are respectively attached to the left and right of the shoulder base 26 via the shoulder joint mechanism 30, and the corresponding actuators A5 and A6 of the shoulder joint mechanism 30 are respectively attached. By driving, each can be independently rotated around a pitch axis 31 and a roll axis 32 which are orthogonal to each other as shown in FIG.
[0034]
In this case, in the arm unit 13A and the arm unit 13B, the actuator A8 forming the forearm is connected to the output shaft of the actuator A7 forming the upper arm via the elbow joint mechanism 44. Is configured by attaching a hand portion 34 to the tip of the.
[0035]
In the arm unit 13A and the arm unit 13B, the forearm can be rotated with respect to the yaw axis 35 shown in FIG. 3 by driving the actuator A7, and the forearm can be rotated by driving the actuator A8. Can be rotated with respect to a pitch axis 36 shown in FIG.
[0036]
The leg unit 14A and the leg unit 14B are respectively attached to the waist base 22 below the trunk via the hip joint mechanism 37, and by driving the actuators A9 to A11 of the corresponding hip joint mechanism 37, respectively. , The yaw axis 38, the roll axis 39, and the pitch axis 40, which are orthogonal to each other, can be rotated independently of each other.
[0037]
In the leg unit 14A and the leg unit 14B, the lower end of the frame 41 forming the thigh is connected to the frame 43 forming the lower leg through the knee joint mechanism 42, and The lower end is connected to the foot 45 via the ankle joint mechanism 44.
[0038]
Thereby, in the leg unit 14A and the leg unit 14B, the lower leg can be rotated with respect to the pitch axis 46 shown in FIG. 3 by driving the actuator A12 forming the knee joint mechanism 42. By driving the actuator A13 and the actuator A14 of the ankle joint mechanism 44, the foot 45 can be independently rotated with respect to the orthogonal pitch axis 47 and roll axis 48 shown in FIG. It has been made possible.
[0039]
On the back side of the waist base 22 that forms the lower trunk of the body unit 11, a control unit 52, which is a box containing a main control unit 61 and a peripheral circuit 62 (both shown in FIG. 4) described below, is provided. It is arranged.
[0040]
FIG. 4 is a diagram illustrating an actuator of the robot 1 and a control system thereof.
[0041]
The control unit 52 contains a main control unit 61 for controlling the operation of the entire robot 1, peripheral circuits 62 such as a power supply circuit and a communication circuit, and a battery 74 (FIG. 5).
[0042]
The control unit 52 is disposed in each of the constituent units (the body unit 11, the head unit 12, the arm unit 13A and the arm unit 13B, and the leg unit 14A and the leg unit 14B). It is connected to the sub-control units 63A to 63D, and supplies necessary power supply voltages to the sub-control units 63A to 63D and performs communication with the sub-control units 63A to 63D.
[0043]
Further, the sub-control units 63A to 63D are respectively connected to the actuators A1 to A14 in the corresponding constituent units, and based on various control commands supplied from the main control unit 61, the sub-control units 63A to 63D Are controlled to drive the actuators A1 to A14 to the designated state.
[0044]
FIG. 5 is a block diagram showing the internal configuration of the robot 1.
[0045]
The head unit 12 includes a CCD (Charge Coupled Device) camera 81 functioning as an “eye” of the robot 1, a microphone 82 functioning as an “ear”, an external sensor unit 71 including a touch sensor 51, and a “mouth”. And a speaker 72 functioning as a "" are provided at predetermined positions, and an internal sensor unit 73 including a battery sensor 91 and an acceleration sensor 92 is provided in the control unit 52.
[0046]
Then, the CCD camera 81 of the external sensor unit 71 captures an image of the surroundings, and sends out the obtained image signal S1A to the main control unit 61. The microphone 82 collects various command sounds such as “walk”, “stop” or “raise your right hand” given as a voice input from the user, and sends out the obtained voice signal S1B to the main control unit 61.
[0047]
The touch sensor 51 is provided, for example, on the upper part of the head unit 12 as shown in FIGS. 1 and 2, and receives a pressure applied by a physical action such as “stroke” or “hit” from the user. And sends the detection result to the main control unit 61 as a pressure detection signal S1C.
[0048]
The battery sensor 91 of the internal sensor unit 73 detects the remaining energy of the battery 74 at a predetermined cycle, and sends the detection result to the main control unit 61 as a remaining battery detection signal S2A. The acceleration sensor 92 detects the acceleration of the movement of the robot 1 in three axial directions (x-axis, y-axis, and z-axis) at a predetermined cycle, and outputs the detection result to the main control unit 61 as an acceleration detection signal S2B. Send out.
[0049]
The main control unit 61 receives an image signal S1A, an audio signal S1B, and a pressure detection signal S1C (hereinafter collectively referred to as an external sensor signal S1) supplied from the CCD camera 81, the microphone 82, and the touch sensor 51 of the external sensor unit 71, respectively. ), Based on the remaining battery level detection signal S2A and the acceleration detection signal S2B (hereinafter collectively referred to as an internal sensor signal S2) supplied from the battery sensor 91 and the acceleration sensor of the internal sensor unit 73, respectively. Then, the situation around and inside the robot 1, the command from the user, the presence or absence of the user's action, and the like are determined.
[0050]
Then, the main control unit 61 determines a situation around and inside the robot 1, a command from the user, or a result of the determination as to whether or not there is an action from the user, a control program stored in the internal memory 61 A in advance, or The action of the robot 1 is determined based on various control parameters stored in the external memory 75 loaded at that time, a control command based on the determination result is generated, and the corresponding sub-control unit 63A to sub-control is generated. It is sent to the unit 63D. The sub-control units 63A to 63D control the driving of the corresponding one of the actuators A1 to A14 based on the supplied control command. It is possible to perform an action such as walking, swinging the arm unit 13A, or raising the arm unit 13B, and alternately driving the leg unit 14A and the leg unit 14B. Become.
[0051]
Further, the main control unit 61 outputs a sound based on the sound signal S3 to the outside by giving a predetermined sound signal S3 to the speaker 72 as necessary. Further, the main controller 61 blinks the LED by outputting a drive signal to an LED (not shown) provided at a predetermined position of the head unit 12 and functioning as an apparent "eye".
[0052]
In this way, the robot 1 is capable of acting autonomously based on the surrounding and internal conditions, the presence / absence of a command from the user, and the presence or absence of an action.
[0053]
Next, FIG. 6 shows an example of a functional configuration of the main control unit 61 of FIG. Note that the functional configuration shown in FIG. 6 is realized by the main control unit 61 executing a control program stored in the memory 61A.
[0054]
The main control unit 61 accumulates the sensor input processing unit 101 for recognizing a specific external state, the recognition result of the sensor input processing unit 101, and stores a model of the emotion, instinct, or growth state of the robot 1. Based on a model storage unit 102, a table storage unit 104 for storing a table of voice recognition results and action contents, a recognition result of the sensor input processing unit 101, and a behavior of the robot 1 based on a table stored in the table storage unit 104. , A posture transition mechanism unit 105 that actually causes the robot 1 to perform an action based on the determination result of the behavior determination mechanism unit 103, and a voice synthesis unit 106 that generates a synthetic sound. I have.
[0055]
The sensor input processing unit 101 receives a specific external state, a specific action from a user, a user, based on a sound signal, an image signal, a pressure detection signal, and the like provided from the microphone 82, the CCD camera 81, the touch sensor 51, and the like. , And notifies the model storage unit 102 and the action determination mechanism unit 103 of state recognition information representing the recognition result.
[0056]
That is, the sensor input processing unit 101 has a voice recognition unit 101A, and the voice recognition unit 101A performs voice recognition on a voice signal given from the microphone 82. Then, the voice recognition unit 101A sends, for example, commands such as “walk”, “stop”, and “raise your right hand” and other voice recognition results to the model storage unit 102 and the action determination mechanism unit 103 as state recognition information. Notice.
[0057]
Further, the speech recognition unit 101A can newly recognize an out-of-vocabulary (OOV: Out of Vocabulary), and if necessary, outputs an ID associated with the recognized unknown word to the action determining unit rear unit 103. To supply. Details of the recognition of the unknown word will be described later.
[0058]
Further, the sensor input processing unit 101 has an image recognition unit 101B, and the image recognition unit 101B performs an image recognition process using an image signal given from the CCD camera 81. Then, when the image recognition unit 101B detects, for example, a “red round object” or a “plane that is perpendicular to the ground and equal to or more than a predetermined height” as a result of the processing, “there is a ball” or “ An image recognition result such as “there is a wall” is notified to the model storage unit 102 and the action determination mechanism unit 103 as state recognition information.
[0059]
Further, the sensor input processing unit 101 has a pressure processing unit 101C, and the pressure processing unit 101C processes a pressure detection signal given from the touch sensor 51. Then, as a result of the processing, when the pressure processing section 101C detects a pressure that is equal to or more than a predetermined threshold value and is short-time, the pressure processing section 101C recognizes that “hitting” has been performed, When a long-term pressure is detected, it is recognized as “stroke (praised)”, and the recognition result is notified to the model storage unit 102 and the action determination mechanism unit 103 as state recognition information.
[0060]
The model storage unit 102 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot 1, respectively.
[0061]
Here, the emotion model indicates, for example, the state (degree) of emotions such as “joy”, “sadness”, “anger”, and “fun” in a predetermined range (for example, −1.0 to 1.. 0, etc.), and the values are changed based on the state recognition information from the sensor input processing unit 101 or the passage of time. The instinct model expresses the state (degree) of instinct such as “appetite”, “sleep desire”, and “exercise desire” by a value in a predetermined range, and the state recognition information from the sensor input processing unit 101. The value is changed based on the time or the passage of time. The growth model represents, for example, a growth state (degree) such as “childhood”, “adolescence”, “mature”, “elderly”, etc., by a value in a predetermined range. The value is changed on the basis of the state recognition information or the passage of time.
[0062]
The model storage unit 102 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the behavior determination mechanism unit 103 as state information.
[0063]
In addition to the state recognition information supplied from the sensor input processing unit 101 to the model storage unit 102, the current or past behavior of the robot 1, specifically, for example, “ The behavior information indicating the content of the behavior such as "walked" is supplied. Even if the same state recognition information is given, the model storage unit 102 responds to the behavior of the robot 1 indicated by the behavior information. , Different status information is generated.
[0064]
That is, for example, when the robot 1 greets the user and strokes the head, the behavior information that the robot 1 greets the user and the state recognition information that the head is stroked are stored in the model storage unit. In this case, in the model storage unit 102, the value of the emotion model representing “joy” is increased.
[0065]
On the other hand, when the robot 1 is stroked on the head while performing any work, the behavior information indicating that the robot 1 is performing the work and state recognition information indicating that the robot has been stroked on the head are given to the model storage unit 102. In this case, the model storage unit 102 does not change the value of the emotion model representing “joy”.
[0066]
As described above, the model storage unit 102 sets the value of the emotion model while referring to not only the state recognition information but also the behavior information indicating the current or past behavior of the robot 1. Thereby, for example, when the user strokes the head with the intention of mischief while performing any task, an unnatural change in emotion such as increasing the value of the emotion model representing “joy” occurs. Can be avoided.
[0067]
Note that the model storage unit 102 also increases and decreases the values of the instinct model and the growth model based on both the state recognition information and the behavior information, as in the case of the emotion model. Further, the model storage unit 102 increases or decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.
[0068]
The action determination mechanism unit 103 stores the table stored in the table storage unit 104 as necessary based on the state recognition information from the sensor input processing unit 101, the state information from the model storage unit 102, the passage of time, and the like. With reference to the next action, the next action is determined, and the content of the determined action is sent to the attitude transition mechanism unit 105 as action command information.
[0069]
In addition, the action determination mechanism unit 103, from the voice recognition unit 101A of the sensor input processing unit 101, matches a predetermined first rule such as, for example, "This is a <OOV (unknown word)>pause". When a voice input is received, the signals indicating the states of the actuators A1 to A14 supplied from the sub-control units 63A to 63D correspond to the unknown words indicating the pauses obtained as a result of the voice recognition. And stored in the table storage unit 104.
[0070]
Then, the action determining mechanism unit 103 receives, from the voice recognition unit 101A of the sensor input processing unit 101, a voice input that conforms to a predetermined second rule, such as “<OOV>”, for example. When <OOV> is stored in the table storage unit 104, the control information of the actuator corresponding to <OOV> stored in the table storage unit 104 is read and supplied to the posture transition mechanism unit 105.
[0071]
That is, the behavior determining mechanism unit 103 manages a finite state automaton in which the behavior that can be taken by the robot 1 corresponds to a state (state), as a behavior model that defines the behavior of the robot 1. The state in the finite state automaton is changed based on the state recognition information from the sensor input processing unit 101, the value of the emotion model, the instinct model, or the growth model in the model storage unit 102, the elapsed time, and the like, and corresponds to the state after the transition. Is determined as the next action to be taken.
[0072]
Here, when detecting that there is a predetermined trigger, the action determining mechanism unit 103 changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed reaches a predetermined time, or when specific state recognition information is received, the action determining mechanism unit 103 is supplied from the model storage unit 102. The state is changed when the value of the emotion, instinct, or growth state indicated by the state information is equal to or less than a predetermined threshold.
[0073]
Note that, as described above, the action determination mechanism unit 103 performs not only the state recognition information from the sensor input processing unit 101 but also the emotion model in the model storage unit 102, the instinct model, the value of the growth model, and the like. Since the state in the action model is changed, even if the same state recognition information is input, the destination of the state changes depending on the value (state information) of the emotion model, the instinct model, and the growth model.
[0074]
As described above, the action determining mechanism 103 generates action command information for causing the robot 1 to speak, in addition to action command information for operating the head, limbs, and the like of the robot 1. The action command information for causing the robot 1 to speak is supplied to the speech synthesis unit 106, and the action command information supplied to the speech synthesis unit 106 includes the synthesized sound to be generated by the speech synthesis unit 106. The corresponding text and the like are included. Then, upon receiving the action command information from the action determination section 52, the speech synthesis section 106 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 18 for output. As a result, from the speaker 18, for example, greeting to the user, such as "Hello", various requests to the user, or other audio output response to the call of the user, such as "What?" Is performed.
[0075]
The posture transition mechanism unit 105 generates posture transition information for transitioning the posture of the robot 1 from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 103, and This is sent to the sub-control units 63A to 63D.
[0076]
FIG. 7 is a functional block diagram illustrating functions of the voice recognition unit 101A of the sensor input processing unit 101.
[0077]
A voice signal based on an utterance from a user is input to the voice recognition processing unit 121. The voice recognition processing unit 121 recognizes the input voice signal and outputs a text as a result of the voice recognition. And other accompanying information to the dialog control unit 123 and the word acquisition unit 124 as necessary. Details of the voice recognition processing unit 121 will be described later with reference to FIG.
[0078]
The word acquiring unit 124 automatically stores acoustic features of words (unknown words) not registered in the recognition dictionary included in the speech recognition processing unit 121 so that the speech of the words can be recognized thereafter. I do.
[0079]
That is, the word acquisition unit 124 obtains a pronunciation corresponding to an unknown word portion of the input voice by a phonological typewriter, and classifies it into several clusters. Each cluster has an ID and a representative phoneme sequence, and is managed by the ID. The state of the cluster at this time will be described with reference to FIG.
[0080]
For example, it is assumed that there are three input voices “red”, “blue”, and “guts”. In this case, the word acquiring unit 124 classifies the three times of speech into three clusters of “red” cluster 141, “blue” cluster 142, and “guts” cluster 143, and each cluster has A representative phoneme sequence (“a / k / a,“ a / o ”,“ g / a / t / t / u ”) in the example of FIG. 8 and an ID (“ 1 / , "2", "3").
[0081]
Here, when the voice “Aka” is input again, since the corresponding cluster already exists, the word acquiring unit 124 classifies the input voice into the “Aka” cluster 141 and does not generate a new cluster. On the other hand, when the voice of “kuro” is input, there is no corresponding cluster, so the word acquiring unit 124 newly generates a cluster 144 corresponding to “kuro”, and the cluster includes a representative A typical phoneme sequence (“k / u / r / o” in the example of FIG. 8) and an ID (“4” in the example of FIG. 8) are added.
[0082]
Therefore, whether or not the input speech is an unacquired word can be determined based on whether or not a new cluster has been generated. The details of such a word acquisition process are disclosed in Japanese Patent Application No. 2001-97842 previously proposed by the present applicant.
[0083]
The associative storage unit 122 stores information such as a category such as whether the registered word (unknown word) is a user name or a character name. For example, in the example of FIG. 9, a cluster ID and a category name are stored in association with each other. In the case of the example of FIG. 9, for example, cluster ID “1” and cluster ID “4” correspond to the category of “user name”, and cluster ID “2” corresponds to the category of “character name”. , Cluster ID “3” corresponds to the category of “pause name”.
[0084]
The dialog control unit 123 understands the content of the user's utterance from the output of the speech recognition processing unit 121, and controls the registration of a word (unknown word) based on the result of the understanding. Further, the dialog control unit 123 controls the subsequent dialog based on the information of the registered words stored in the associative storage unit 122 so that the registered words can be recognized.
[0085]
Thus, the speech recognized by the speech recognition unit 101A is supplied to the action determination mechanism unit 103.
[0086]
Here, it is described that the word acquisition unit 124 generates a cluster from the pronunciation obtained by the phoneme typewriter, and performs matching with the cluster when inputting an unknown word thereafter. Without generating a cluster, an ID is added to the phoneme sequence itself (for example, “k / u / r / o”) as an unknown word, and when a new unknown word is input, comparison is performed using the phoneme sequence. Then, it may be determined whether or not the input unknown word matches any of the unknown words to which the ID has already been added.
[0087]
FIG. 10 illustrates a configuration example of the speech recognition processing unit 121.
[0088]
The utterance of the user is input to the microphone 82, where the utterance is converted into an audio signal as an electric signal. This audio signal is supplied to an AD (Analog Digital) converter 171. The AD converter 171 samples and quantizes the audio signal that is an analog signal from the microphone 82, and converts it into audio data that is a digital signal. This audio data is supplied to the feature amount extraction unit 172.
[0089]
The feature amount extraction unit 172 extracts, for each appropriate frame, feature parameters such as a spectrum, a power linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair from the audio data from the AD conversion unit 171, and outputs a matching unit 173 and a phoneme. It is supplied to the typewriter section 174.
[0090]
The matching unit 173 refers to the acoustic model database 181, the dictionary database 182, and the language model database 183 as needed based on the feature parameters from the feature amount extraction unit 172, and refers to the speech (input speech) input to the microphone 82. Find the word string that is closest to ().
[0091]
The acoustic model database 181 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of the speech to be recognized. As the acoustic model, for example, HMM (Hidden Markov Model) or the like can be used. The dictionary database 182 stores, for each word (phrase) to be recognized, a word dictionary in which information about the pronunciation is described, and a model in which a chain relation between phonemes and syllables is described.
[0092]
Note that the word here is a unit that is more convenient to be treated as one unit in the recognition processing, and does not always match a linguistic word. For example, “Taro-kun” may be treated as one word, or may be treated as two words “Taro” and “Kimi”. In addition, may be dealing with more is a major unit of "Hello Taro" or the like as one word.
[0093]
A phoneme is one that is more conveniently processed acoustically as one unit, and does not always match a phonetic phoneme or phoneme. For example, the "to" portion of "Tokyo" may be represented by three phonetic symbols "t / o / u", or "t:" using the symbol "o:" which is a long sound of "o". / O: ". Alternatively, it can be expressed as “t / o / o”. In addition, a symbol representing silence is prepared, and it is further described as "silence before speech", "short silence section between speech", "silence of speech word", "silence of" tsu "part" And a symbol may be prepared for each.
[0094]
The language model database 183 describes information on how words registered in the word dictionary of the dictionary database 182 are linked (connected).
[0095]
The phoneme typewriter unit 174 acquires a phoneme sequence corresponding to the input speech based on the feature parameters supplied from the feature amount extraction unit 172. The phoneme typewriter unit 174, for example, reads “w / a / t / a / sh / i / n / o / n / a / m / a / e / w / from the voice“ My name is Taro. ” a / t / a / r / o: / d / e / s / u ”is obtained. An existing phoneme typewriter can be used.
[0096]
Note that, instead of the phoneme typewriter unit 174, another configuration that can acquire a phoneme sequence for an arbitrary voice may be used. For example, speech recognition in units of Japanese syllables (a, i, u, ..., k, ..., n) or speech in units of subwords, which are larger than phonemes but smaller than words It is also possible to use recognition or the like.
[0097]
The control unit 175 controls the operations of the AD conversion unit 171, the feature amount extraction unit 172, the matching unit 173, and the phoneme typewriter unit 174.
[0098]
Next, a process when the robot 1 receives a voice input will be described with reference to a flowchart of FIG.
[0099]
In step S1, the voice recognition processing unit 121 of the voice recognition unit 101A determines whether a voice has been input from the microphone 82. If it is determined in step S1 that no voice input has been received, the process of step S1 is repeated until it is determined that a voice input has been received.
[0100]
In step S2, a speech recognition process described later with reference to FIG. 12 is executed.
[0101]
In step S3, the control unit 175 of the speech recognition processing unit 121 determines whether the word string recognized in step S2 includes an unknown word.
[0102]
When it is determined in step S4 that an unknown word is included, in step S4, the control unit 175 controls the word obtaining unit 124 to execute a word obtaining process described later with reference to FIG.
[0103]
In step S5, the dialog control unit 123 executes a template matching process, which will be described later with reference to FIG. 19, using the word obtained in the process of step S4, and the process ends.
[0104]
If it is determined in step S3 that an unknown word is not included, in step S5, the control unit 175 of the voice recognition processing unit 121 outputs the recognized voice to the action determination mechanism unit 103 and, if necessary, , To the model storage unit 102. The action determining mechanism unit 103 executes a predetermined response process based on the supplied speech recognition result.
[0105]
Specifically, for example, when the recognized voice is “forward”, the action determining mechanism unit 103 determines the action of the robot 1 that matches the action model that defines the action of the robot 1. The content of the determined action is sent to the attitude transition mechanism unit 105 as action command information. The posture transition mechanism unit 105 generates necessary control information of the actuators A1 to A14, supplies the control information to the sub-control units 63A to 63D, and drives the corresponding one of the actuators A1 to A14. And causes the robot 1 to execute a “forward” action.
[0106]
Alternatively, if the speech cannot be correctly recognized due to reasons such as not matching the grammar, the action determining mechanism unit 103 generates action command information for causing the robot 1 to utter "What?" The sound is supplied to the synthesizing unit 106 and the speaker 72 outputs the sound “What?”. In addition, the action determining mechanism unit 103 generates the motion command information for tilting the head unit 12 to the side (to make the neck crouch) at the same time as outputting the voice "What?" It is sent to the mechanism unit 105. The posture transition mechanism unit 105 generates control information for causing the actuator A3 and the actuator A4 to tilt the head, supplies the control information to the sub-control unit 63B, drives the actuator A3 and the actuator A4, and causes the robot 1 to “ You may be made to pose "kick your neck".
[0107]
Next, the speech recognition processing executed in step S2 in FIG. 11 will be described with reference to the flowchart in FIG.
[0108]
In step S21, the AD conversion unit 171 converts the analog audio signal supplied from the microphone 82 into audio data that is a digital signal, and supplies the digital audio data to the feature extraction unit 172.
[0109]
The feature amount extraction unit 172 receives the audio data from the AD conversion unit 171 in step S22, and extracts, in step S23, for each appropriate frame, for example, a feature parameter such as a spectrum, power, and a time variation thereof. Then, the data is supplied to the matching unit 173.
[0110]
In step S24, the matching unit 173 connects some of the word models stored in the dictionary database 182.
[0111]
In step S25, a word string generation process described later with reference to FIG. 17 is executed. Note that the words constituting this word string include not only known words registered in the dictionary database 182 but also “<OOV>” which is a symbol representing an unregistered unknown word.
[0112]
In step S26, the phoneme typewriter unit 174 performs recognition in units of phonemes on the feature parameters extracted in the process of step S23, independently of the processes of steps S24 and S25, and converts the phoneme sequence into units. Output. For example, when the voice “My name is Taro (unknown word)” is input, the phoneme typewriter unit 174 outputs “w / a / t / a / sh / i / n / o / n / a”. / M / a / e / w / a / t / a / r / o: / d / e / s / u ".
[0113]
In step S27, the matching unit 173 calculates an acoustic score for each word string generated in step S25. <OOV> A method of calculating an acoustic score for a word string that does not include an (unknown word) is an existing method, that is, by inputting a speech feature parameter for each word string (a word model is linked). A method of calculating degrees is used. On the other hand, with the existing method, the acoustic score of the voice section corresponding to <OOV> cannot be obtained (because there is no word model corresponding to <OOV> in advance). In the calculation of the score, for the speech section, a method of taking out the acoustic score of the same section from the recognition result of the phoneme typewriter and correcting the value is adopted as the acoustic score of <OOV>. Is used. The matching unit 173 further integrates the acoustic score of <OOV> with the acoustic score of another known word part, and sets it as the acoustic score of the word string.
[0114]
In step S28, the matching unit 173 leaves the top m (m ≦ n) word strings with high acoustic scores as candidate word strings. In step S29, the matching unit 173 calculates a language score for each candidate word string with reference to the language model database 183. The language score indicates how appropriate a word string that is a candidate for a recognition result is as a word. Here, a method of calculating the language score will be described in detail.
[0115]
Since the speech recognition processing unit 121 of the present invention can also recognize an unknown word, the language model needs to correspond to the unknown word. For example, a case where a grammar corresponding to an unknown word or a finite state automaton (FSA) is used, and a case where a tri-gram (one of statistical language models) also corresponding to an unknown word is used. Will be described.
[0116]
An example of the grammar will be described with reference to FIG. This grammar is described in BNF (Backus Naur Form). In FIG. 13, ＄ A ”represents“ variable ”, and“ A | B ”represents“ A or B ”. “[A]” means “A can be omitted”, and {A} means “A is repeated 0 or more times”.
[0117]
<OOV> is a symbol representing an unknown word. By describing <OOV> in the grammar, it is possible to deal with a word string including an unknown word. "@ACTION" defines a word corresponding to the motion when the correspondence between the name and the motion content, such as "stand up", "seated", "bow", "greeting", etc., is set in advance. ing.
[0118]
In this grammar, "<start> / Hello / <end>" ( "/" is a separator between words), "<start> / goodbye / <end>", "<start> / I / Roh / name / is / <OOV> / is / <end> ”, a word string that applies to the grammar stored in the database is accepted (parsed with this grammar), but“ <head> / kun / no / < Word strings that do not apply to the grammar stored in the database, such as “OOV> / name / <end>”, are not accepted (not parsed by this grammar). Note that “<head>” and “<end>” are special symbols representing silence before and after the utterance, respectively.
[0119]
A parser (analyzer) is used to calculate a language score using this grammar. The parser divides the word string into a word string that can accept grammar and a word string that cannot be accepted. That is, for example, an acceptable word string is given a language score of 1, and an unacceptable word string is given a language score of 0.
[0120]
Therefore, for example, "<head> / me / name / is / <OOV> (t / a / r / o:) / is / <end>" and "<head> / me / name / When there are two word strings of / <OOV> (j / i / r / o:) / is / <end>, both are “<head> / me / name / ha / <OOV> The language score is calculated after being replaced with // is / <end>, and both are output as language score 1 (acceptance).
[0121]
In addition, to determine whether or not the grammar of the word string can be accepted, the grammar is converted in advance to an equivalent (or approximate) finite state automaton (hereinafter referred to as FSA), and each word string is It can also be realized by determining whether it can be accepted.
[0122]
FIG. 14 shows an example in which the grammar of FIG. 13 is converted into an equivalent FSA. FSA is a directed graph composed of states (nodes) and paths (arcs). As shown in FIG. 14, S1 is a start state, and S20 is an end state. In addition, as in FIG. 13, a word corresponding to an actual operation is registered in “@ACTION”.
[0123]
A word is assigned to a path, and when a transition is made from a predetermined state to the next state, the path consumes this word. However, the path to which “ε” is assigned is a special transition that does not consume a word (hereinafter, referred to as ε transition). For example, in “<head> / me / is / <OOV> / is / <end>”, the state transits from the initial state S1 to the state S2, the <head> is consumed, and the state transits from the state S2 to the state S3. Therefore, "I" is consumed, but the word is not consumed since the transition from the state S3 to the state S5 is an ε transition. That is, it is possible to skip from the state S3 to the state S5 and transition to the next state S6.
[0124]
Whether or not a predetermined word string can be accepted by this FSA is determined by whether or not it can start from the initial state S1 and reach the end state S20.
[0125]
That is, for example, in "<head> / me / name / is / <OOV> / is / <end>", the state transitions from the initial state S1 to the state S2, and the word "<head>" is consumed. You. Next, the state transits from the state S2 to the state S3, and the word “I” is consumed. Hereinafter, similarly, the state sequentially transitions from the state S3 to the state S4, from the state S4 to the state S5, from the state S5 to the state S6, from the state S6 to the state S7, and to “NO”, “NAME”, “HA”, “ <00V> ”is consumed one after another. Further, the state transits from the state S7 to the state S19, "is" is consumed, the state transits from the state S19 to the state S20, "<end>" is consumed, and finally reaches the end state S20. Therefore, "<head> / me / name / is / <OOV> / is / <end>" is accepted by the FSA.
[0126]
However, “<head> / you / no / <OOV> / name / <end>” transits from state S1 to state S2, from state S2 to state S8, from state S8 to state S9, and returns “< Although it is consumed up to "head>", "kun", and "no", it is not possible to transit further, so that it cannot reach the end state S16. Therefore, “<head> / you / no / <OOV> / name / <end>” is not accepted by the FSA (not accepted).
[0127]
Further, in the "<start> / Goodbye / <end>", "<start> / Hi / <end>" are both a transition from the state S1 to the state S2, the word "<start>" is consumed, The transition from the state S2 to the state S19 causes the word “Goodbye” or “Hello” to be consumed, and the transition from the state S19 to the state S20 results in the consumption of “<end>”.
[0128]
"<Head> / this is / <OOV> / pause / dayo / <end>" or a similar word string such as "<head> / this is / <OOV> / is / <end>" Transitions from state S1 to state S2, consumes the word "<head>", transitions from state S2 to state S13, transitions word "this" from state S13 to state S14, “<OOV>” is consumed, the state transitions from the state S13 to the state S14, the word “pause” is consumed, or the state transitions from the state S13 to the state S14, and the state transitions from the state S14 to the state S19. , The word “da” or “is” is consumed, or ε transitions from the state S14 to the state S19, and finally, the state transitions from the state S19 to the state S20, and “<end>” is consumed. .
[0129]
Then, "<head> / <OOV> (character name) / @ ACTION / set / <end>" or a similar word string such as "<head> / <OOV> (character name) / <OOV> In (pose name) / pause / do / <end>, etc., the state transitions from state S1 to state S2, and the word “<head>” is consumed. From state S2 to state S16, the state transitions to state S16. <OOV> ”(character name of the unknown word) is consumed, and the state transits from the state S16 to the state S18, and the word“ $ ACTION ”indicating a predetermined operation is consumed, or the state S16 is changed to the state S17. Then, after the word “<OOV>” (the pose name of the unknown word) is consumed, the state transitions from the state S17 to the state S18, or the word “pause” is consumed. Then, the state transits from the state S18 to the state S19, and the word “shi” is consumed. Finally, the state transits from the state S19 to the state S20, and “<end>” is consumed.
[0130]
Further, an example of calculating a language score when a tri-gram that is one of statistical language models is used as a language model will be described with reference to FIG. The statistical language model is a language model that obtains the generation probability of the word string and uses it as a language score. That is, for example, the language score of “<head> / me / name / is / <OOV> / is / <end>” of the language model shown in FIG. It is represented by the word string generation probability. This is further expressed as a product of the conditional probabilities, as shown in rows 3-6. Note that, for example, “P (no | <head> me)” is based on the condition that the word immediately before “no” is “me” and the word immediately before “me” is “<head>”. , “Of” appear.
[0131]
Further, in the tri-gram, the expressions shown in the third to sixth lines in FIG. 15 are approximated by conditional probabilities of three consecutive words as shown in the seventh to ninth lines. These probability values are obtained by referring to a tri-gram database as shown in FIG. This tri-gram database is obtained by analyzing a large amount of texts in advance.
[0132]
In the example of FIG. 16, the probabilities P (w3 | w1w2) of three consecutive words w1, w2, w3 are shown. For example, if the three words w1, w2, and w3 are respectively “<head>”, “me”, and “no”, the probability value is 0.12, and “me”, “no”, and “name” , The probability value is 0.01, and if it is "<OOV>", "is", or "<end>", the probability value is 0.87.
[0133]
Of course, “P (W)” and “P (w2 | w1)” are similarly obtained in advance.
[0134]
In this way, by performing entry processing on <OOV> in the language model, a language score can be calculated for a word string including <OOV>. Therefore, a symbol <OOV> can be output as a recognition result.
[0135]
Also, when another type of language model is used, a language score can be similarly calculated for a word string including <OOV> by performing entry processing for <OOV>.
[0136]
Further, even when a language model having no <OOV> entry is used, a language score can be calculated by using a mechanism that maps <OOV> to an appropriate word in the language model. For example, even if a tri-gram database in which “P (<OOV> | I)” does not exist is used, accessing the database with “P (Taro | I)” and calculating the probability described there The language score can be calculated by regarding the value as “P (<OOV> | I)”.
[0137]
Returning to the description of the voice recognition processing in FIG. In step S30, the matching unit 173 integrates the acoustic score and the language score. In step S31, the matching unit 173 selects a candidate word string having the best score based on a score obtained by integrating both the acoustic score and the language score obtained in step S30, and outputs it as a recognition result.
[0138]
When the finite state automaton is used as the language model, the integration processing in step S30 is performed by deleting the word string when the language score is 0 and leaving the word string as it is when the language score is not 0. You may.
[0139]
Next, the word string generation processing executed in step S25 in FIG. 12 will be described with reference to the flowchart in FIG.
[0140]
In step S61, the matching unit 173 determines an acoustic score obtained by matching a certain section of the input voice with a known word registered in the dictionary database 182, and a result obtained by the phoneme typewriter unit 174 (in this case, , "W / a / t / a / sh / i / n / o / n / a / m / a / e / w / a / t / a / r / o: / d / e / s / u" The acoustic score of both cases is calculated with the acoustic score of the middle section. The acoustic score indicates how close a word string that is a candidate for a speech recognition result and an input speech are as sounds.
[0141]
Then, an acoustic score obtained by matching a partial section of the input voice with a known word registered in the dictionary database 182 is compared with an acoustic score obtained by the phoneme typewriter unit 174. , Known words are performed in units of words, and matching in the phoneme typewriter unit 174 is performed in units of phonemes, and the scales are different. The acoustic score for each phoneme is larger.)
[0142]
Therefore, in order to enable comparison using the scale, the matching unit 173 corrects the acoustic score of the result obtained by the phoneme typewriter unit 174 in step S62.
[0143]
In step S62, for example, a process of multiplying the acoustic score from the phoneme typewriter unit 174 by a coefficient or reducing a constant value or a value proportional to the frame length is performed. Of course, since this processing is relative, it can be performed on an acoustic score obtained as a result of matching with a known word. The details of this process are disclosed in, for example, “OOV-Detection in Large Vocabulary System Using Automatically Delivered Foreign Documents as a Foreign-Wide-Folded Foreign-Defined-Folded-Folder-Folded-Words-Folding-Folded-Folder” in, for example, “OOV-Detection in Large Vocabulary System Useing Automatically Defined Foreign Document” in the document ““ EUROSPEECH99 Volume 1, Page 49-52 ””.
[0144]
In step S63, the matching unit 173 compares the two acoustic scores, and determines whether the acoustic score obtained as a result of recognition by the phoneme typewriter unit 174 is higher (excellent). If the acoustic score obtained as a result of recognition by the phonemic typewriter unit 174 is higher, the matching unit 173 estimates that section is an unknown word <OOV> in step S64.
[0145]
In step S63, when it is determined that the acoustic score of the result recognized by the phonological typewriter unit 174 is lower than the acoustic score of the result matched with the known word, in step S65, the matching unit 173 It is estimated that the section is a known word.
[0146]
That is, for example, for the section corresponding to “Taro”, the sound score of “t / a / r / o:” output from the phoneme typewriter unit 174 is compared with the sound score when matching is performed with a known word. If the acoustic score of “t / a / r / o:” is higher, “<OOV> (t / a / r / o :)” is output as a word corresponding to the voice section, and a known word is output. If the acoustic score is higher, the known word is output as a word corresponding to the speech section.
[0147]
After the processing of step S64 or step S65 is completed, in step S66, the matching unit 173 preferentially removes n word strings (concatenation of several word models) estimated to have a high acoustic score. After the generation, the process returns to step S26 in FIG.
[0148]
Through such processing, a word string used for calculating an acoustic score is generated in the processing of step S27 in FIG.
[0149]
Next, the word acquisition processing executed in step S4 in FIG. 11 will be described with reference to the flowchart in FIG.
[0150]
In step S 91, the word acquiring unit 124 extracts a feature parameter of an unknown word (<OOV>) from the speech recognition processing unit 121.
[0151]
In step S92, the word acquiring unit 124 determines whether or not the unknown word belongs to the already acquired cluster. If it is determined in step S92 that the unknown word belongs to the already acquired cluster, there is no need to generate a new cluster, and the process proceeds to step S94.
[0152]
If it is determined in step S92 that the unknown word does not belong to the acquired cluster, in step S93, the word acquisition unit 124 generates a new cluster corresponding to the unknown word.
[0153]
If it is determined in step S92 that the unknown word belongs to the already-acquired cluster, or after the process of step S93 ends, in step S94, the word acquisition unit 124 determines the ID of the cluster to which the unknown word belongs by a voice recognition process. Then, the process proceeds to step S5 in FIG.
[0154]
By such processing, in the subsequent step S5 in FIG. 11, an unknown word to be matched with the template is obtained.
[0155]
In FIG. 18, it is determined whether or not an unknown word belongs to an already-acquired cluster from a feature amount parameter of an unknown word. If the unknown word does not belong to an already-acquired cluster, a new cluster is generated. However, without generating a cluster, for example, an ID may be associated with a phoneme sequence, and whether or not an unknown word has already been obtained may be determined by comparing phoneme sequences.
[0156]
Next, the template matching process executed in step S5 in FIG. 11 will be described with reference to the flowchart in FIG.
[0157]
In step S121, the dialog control unit 123 of the speech recognition unit 101A determines whether the supplied speech recognition result matches the template 1 for registering the unknown word as a pose name, that is, what the word string of the recognition result is. It is determined whether or not the pose name is registered.
[0158]
The template 1 will be described with reference to FIG. Note that in FIG. 20 and FIGS. 22, 23, and 24 described later, “/ A /” means “if character string A is included”, and “A | B” means “A or B ". “.” Represents “arbitrary character”, “A +” represents “one or more repetitions of A”, and “(.) +” Represents “arbitrary character string”.
[0159]
If the word string of the recognition result is, for example, a word string of “<head> / this is / <OOV> (g / a / t / t / u:) / pause / is / <end>”, this recognition is performed. Although the character string "This is <OOV>pose" generated from the result, since the regular expression "/ this matches <OOV>(category; pose name) /" of template 1, the corresponding "<OOV> The operation of “registering a cluster ID corresponding to (category; pose name) as a pose name” and the operation of “storing in a table together with the actuator control angle” at that time are executed.
[0160]
Specifically, for example, when the user puts both arms of the robot 1 up and utters “This is a guts (pause)”, an unknown word <guts (pause)> is obtained. Thus, it is detected that it corresponds to template 1.
[0161]
When it is determined in step S121 that the word matches the template 1, in step S122, the associative storage unit 122 stores the cluster ID of the word and the category in association with each other as described with reference to FIG. .
[0162]
In step S123, the voice recognition unit 101A supplies the cluster ID to the action determination mechanism unit 103. Further, the behavior determining mechanism unit 103 acquires the current states (angles) of the actuators A1 to A14 from the sub-control units 63A to 63D. The action determining mechanism unit 103 stores the table as shown in FIG. 21 in the table storage unit 104 in association with the supplied cluster ID and the information on the actuator control angle, and the process ends.
[0163]
Specifically, for example, when the user puts both arms of the robot 1 up and utters “This is a guts (pause)”, it is associated with <guts (pause)> The cluster ID is stored in the table storage unit 104 in association with the states (angles) of the actuators A1 to A14 with both arms of the robot 1 raised.
[0164]
If it is determined in step S121 that the template does not match the template 1, in step S124, the dialogue control unit 123 determines whether the supplied voice recognition result indicates to the robot 1 a pose associated with the unknown word. It is determined whether or not 2 is matched.
[0165]
The template 2 will be described with reference to FIG. If the word string of the recognition result is, for example, a word string of “<head> / <OOV> (b / a / n / z / a / i) / then / <end>”, it is generated from this recognition result. The matched character string “<OOV>” matches the regular expression “/ <OOV>(category; pose name) ++ /” of the template 2, so the corresponding “<OOV>(category; pose) The operation “execute action corresponding to <OOV>” is executed with reference to the table storage unit 104 based on the cluster ID corresponding to (name).
[0166]
If it is determined in step S124 that the word matches the template 2, in step S125, the speech recognition unit 101A supplies the recognized cluster ID of the unknown word to the action determination mechanism unit 103. The action determining mechanism unit 103 refers to the table storage unit 104, extracts information on the actuator control angle from the table described with reference to FIG. 21 based on the supplied cluster ID, and 105.
[0167]
In step S126, the posture transition mechanism unit 105 controls the necessary one of the actuators A1 to A14 to the corresponding one of the sub-control units 63A to 63D based on the information of the actuator control angle. Then, the process ends.
[0168]
Specifically, for example, when the user utters “Banzai” to the robot 1, an unknown word <Banzai> is acquired in the word acquisition unit 124, and the dialog control unit 123 transmits the unknown word <Banzai> to the template 2. Correspondence is detected. Then, the action determining mechanism unit 103 refers to the table storage unit 104, acquires the control information of the actuator corresponding to <Banzai>, and supplies the control information to the corresponding one of the sub-control units 63A to 63D. Thus, the robot 1 can be caused to pose corresponding to Banzai.
[0169]
If it is determined in step S124 that the template does not match the template 2, in step S127, the dialog control unit 123 determines that the supplied voice recognition result indicates that the template 3 instructs the robot 1 to store the character name and the user name. Judge whether or not it matches.
[0170]
The template 3 will be described with reference to FIG. For example, if the recognition result is the word string “<head> / you / no / name / was / <OOV> (t / a / r / o:) / dayo / <end>”, this recognition result Since the character string "Your name is <OOV>" corresponds to the regular expression "/Kimi(.)+<OOV>/", the cluster ID corresponding to <OOV> is a character string. The operation of “register as name” is executed. Also, if the recognition result is a word string “<head> / me // name / is / <OOV> (t / a / r / o:) / is / <end>”, Although the generated character string "My name is <OOV>", the regular expression "(I | I) (.) + Corresponds to <OOV>/", so the cluster ID corresponding to <OOV> The operation of “register as user name” is executed.
[0171]
If it is determined in step S127 that the tag matches the template 3, in step S128, the dialogue control unit 123 stores in the associative storage unit 122 the cluster ID of the word as the speech recognition result and the user name or character name. The process is terminated by storing the categories in association with each other.
[0172]
In some cases, depending on how the robot 1 is used, there is only one type of word to be registered (for example, only “user name”). In that case, the template and the associative storage unit 122 can be simplified. For example, the content of the template may be set as “if the recognition result includes <OOV>, store the ID”, and the associative storage unit 122 may store only the cluster ID.
[0173]
The dialogue control unit 123 reflects the information registered in the associative storage unit 122 in the subsequent dialogue determination process. For example, on the side of the robot 1, “it is determined whether or not“ character name ”is included in the utterance of the user. When it is necessary to perform a process of “execute” or a process of “the robot 1 speaks the user's name”, the dialog control unit 123 refers to the information recorded in the associative storage unit 122, and (An entry whose category name is “character name”) and a word corresponding to the user name (entry whose category name is “user name”) can be obtained.
[0174]
If it is determined in step S127 that the template 3 does not match, in step S129, the dialogue control unit 123 executes a predetermined response process corresponding to the input voice. That is, in this case, the registration process of the unknown word is not performed, and for example, a predetermined process corresponding to the input voice from the user, such as performing a walking operation in response to the command “walk”, is executed, The process ends.
[0175]
In this way, by comparing the acquired audio data with the regular expression, it is determined whether or not the voice data matches a predetermined template. When it is determined that the voice data matches any one of the templates, the template is The operation of the robot 1 is controlled so as to execute a corresponding action.
[0176]
Note that the contents of the templates 1 to 3 described with reference to FIGS. 20, 22 and 23 are not limited to this, and, for example, the regular expression of the template 2 may include “/ <OOV>(category; pose name) + It is also possible to add a pause / "/" or to add the regular expression of the template 2 to "/ <OOV>(category; pose name) (.) + /".
[0177]
When a grammar is used as a language model, a description equivalent to a phoneme typewriter can be incorporated in the grammar. An example of the grammar in this case will be described with reference to FIG. In the grammar shown in FIG. 24, the variable “$ PHONEME” on the first line is replaced with one of the phoneme symbols because all phonemes are connected by “|” meaning “or”. means. The variable “$ OOV” indicates that the variable “$ PHONEME” is repeated 0 times or more. That is, it means "any phoneme symbol is connected 0 or more times" and corresponds to a phoneme typewriter. Therefore, the variable “$ OOV” between “ha” and “is” on the third line can receive any sound.
[0178]
In the recognition result when the grammar shown in FIG. 24 is used, a portion corresponding to the variable “$ OOV” is output as a plurality of symbols. For example, the recognition result of "my name is Taro" is "<head> / me / name / is / t / a / r / o: / is / <end>". When this result is converted into “<head> / me / name / is / <OOV> (t / a / r / o:) / is”, the processing after step S3 in FIG. It can be performed in the same manner as when used.
[0179]
In the above description, a category is registered as information relating to an unknown word, but other information may be registered.
[0180]
Further, the robot 1 can not only execute processing based on a user's command, but also automatically operate even in a state where no operation is performed by the user. The operation may be set in advance, but, for example, the operation described above is performed at random by executing the operation stored in association with the unknown word, and the robot 1 , The speech corresponding to the pause may be automatically performed.
[0181]
The motion recognition utterance process executed by the robot 1 will be described with reference to the flowchart in FIG.
[0182]
In step S151, the voice recognition processing unit 121 of the voice recognition unit 101A determines whether a voice has been input from the microphone 82. If it is determined in step S151 that no voice input has been received, the process of step S151 is repeated until it is determined that a voice input has been received.
[0183]
In step S152, the voice recognition processing described with reference to FIG. 12 is performed.
[0184]
In step S153, the dialogue control unit 123 of the voice recognition unit 101A determines whether or not the supplied voice recognition result is “what is the pause?”.
[0185]
If it is determined in step S153 that the voice "What is the pause?" Has not been recognized, in step S154, the action determining mechanism unit 103 sets the internal timer (not shown) not to receive a command input from the user. In this state, it is determined whether or not notification has been given that a predetermined time has elapsed. If it is determined in step S154 that the timer has not notified the predetermined elapsed time, the process returns to step S151, and the subsequent processes are repeated.
[0186]
If it is determined in step S153 that the voice "What is the pause?" Has been recognized, or if it is determined in step S154 that the timer has notified the predetermined elapsed time, then in step S155, the action determining mechanism unit 103 recognizes the control angles of the actuators A1 to A14 based on the signals supplied from the sub-control units 63A to 63D.
[0187]
In step S156, the action determining mechanism unit 103 refers to the table indicating the correspondence between the pose name and the control angles of the actuators A1 to A14 stored in the table storage unit 104, and determines the actuator A1 recognized in step S155. Or search for a pose name corresponding to the control angle of the actuator A14.
[0188]
In step S157, the action determining mechanism unit 103 determines whether or not the table stored in the table storage unit 104 has a pose name corresponding to the control angles of the actuators A1 to A14 recognized in step S155. If it is determined in step S157 that there is no corresponding pose name, the process returns to step S151, and the subsequent processes are repeated.
[0189]
When it is determined in step S157 that there is a corresponding pose name, in step S158, the action determining mechanism unit 103 refers to the table storage unit 104 and extracts a corresponding unknown word whose category is a pose name. Then, the voice synthesizer 106 is controlled to synthesize a voice saying, for example, “I'm in a <OOV> pause”, and output it from the speaker 72, that is, make it speak, and the process goes to step S151. Return, and the subsequent processing is repeated.
[0190]
With the above-described processing, the robot 1 recognizes not only the response to the user's utterance but also the operation of itself, thereby changing the pose name corresponding to the control angles of the actuators A1 to A14 stored as a table. It can be used to speak automatically.
[0191]
In the processing described above, the case where the control angles of the actuators A1 to A14 in the stationary pose and the obtained unknown words are stored in association with each other is described. Can also be applied to a case where control information of the actuators A1 to A14 when a predetermined operation is performed (for example, a control angle of each actuator is represented in a time series) and an acquired unknown word are associated with each other. Needless to say,
[0192]
Further, in the above description, the case where the operation (driving and utterance) of the robot 1 having the actually drivable parts corresponding to the limbs and the like is described, for example, the (virtual) robot 1 displayed on the display is controlled. In this case, the present invention is also applicable.
[0193]
FIG. 26 illustrates a configuration example of a personal computer 201 that executes the above-described processing. The personal computer 201 has a built-in CPU (Central Processing Unit) 211. An input / output interface 115 is connected to the CPU 211 via a bus 214. The bus 214 is connected to a ROM (Read Only Memory) 212 and a RAM (Random Access Memory) 213.
[0194]
The input / output interface 215 includes an input unit 217 configured by input devices such as a mouse, a keyboard, a microphone, and an AD converter operated by a user, and an output unit configured by output devices such as a display, a speaker, and a DA converter. 216 are connected. Further, the input / output interface 215 is connected to a storage unit 218 including a hard disk drive for storing programs and various data, and a communication unit 219 for communicating data via a network represented by the Internet.
[0195]
A drive 220 that reads and writes data from and to a recording medium such as a magnetic disk 231, an optical disk 232, a magneto-optical disk 233, or a semiconductor memory 234 is connected to the input / output interface 215 as necessary.
[0196]
For example, the personal computer 201 can collect the voice generated by the user with the microphone of the input unit 217, and executes the voice recognition process by the processing of the CPU 211 in the same manner as described above. Further, based on the control of the CPU 211, for example, a character having the appearance of the robot 1 can be displayed on the display of the output unit 216. The CPU 211 recognizes the character displayed on the display of the output unit 216 based on the recognition processing result of the sound collected by the microphone of the input unit 217 or based on an operation performed by the user using the mouse of the input unit 217. It controls the movement (movement on the display) (that is, controls the display on the display of the output unit 216), generates voice data in response to the user's call, and outputs it from the speaker of the output unit 216.
[0197]
Then, when the user utters “This is a <happy> pose”, the CPU 211 recognizes the unknown word “happy”, associates the unique ID as the unknown word of the pose name category, and outputs The movement of the character displayed on the display of the unit 216 (that is, the image data displayed on the display) can be recorded, for example, in the RAM 213 together with the corresponding ID. Then, when the user instructs to “pause”, the display on the output unit 216 is changed to a display corresponding to the unknown word “happy” based on the information recorded in the RAM 213. By the processing of the CPU 211, the display on the output unit 216 is automatically changed to a display corresponding to the unknown word <happy>, and the speaker of the output unit 216 outputs a voice saying “This is a <happy> pause”. be able to.
[0198]
Further, the series of processes described above can be executed by software. The software is a computer in which a program constituting the software is built in dedicated hardware, or a general-purpose personal computer that can execute various functions by installing various programs. For example, it is installed from a recording medium.
[0199]
As shown in FIG. 26, this recording medium is a magnetic disk 231 (including a flexible disk) on which the program is recorded, and an optical disk 232 (CD-ROM), which are distributed in order to provide the user with the program, separately from the computer. It is composed of a package medium such as a ROM (Compact Disk-Read Only Memory), a DVD (including a Digital Versatile Disk), a magneto-optical disk 233 (including an MD (Mini-Disk) (trademark)), or a semiconductor memory 234. Is done.
[0200]
For example, the robot 1 control processing program that causes the personal computer 201 to execute the operation as the robot 1 control device to which the present invention is applied includes a magnetic disk 231 (including a floppy disk), an optical disk 232 (CD-ROM (Compact Disc-Read)). It is supplied to the personal computer 201 in a state where it is stored in the Only Memory (DVD) (including a Digital Versatile Disc), a magneto-optical disk 233 (including an MD (Mini Disc)), or a semiconductor memory 234, and is read out by the drive 220. Then, it is installed in a hard disk drive built in the storage unit 218. The voice processing program installed in the storage unit 218 is loaded from the storage unit 218 to the RAM 213 and executed by a command from the CPU 211 corresponding to a command from the user input to the input unit 217.
[0201]
In this specification, the step of describing a program to be recorded on a recording medium may be performed in a chronological order in the order described, but is not necessarily performed in a chronological order. This also includes processes executed individually.
[0202]
【The invention's effect】
Thus, according to the present invention, speech can be recognized. In particular, the recognized voice and information for controlling the operation of the robot can be registered in association with each other.
Further, according to another aspect of the present invention, in addition to detecting the state of the robot, a voice including a word corresponding to the state of the robot can be synthesized based on the detection result of the state of the robot.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an external configuration of a robot to which the present invention is applied.
FIG. 2 is a rear perspective view showing the external configuration of the robot shown in FIG. 1;
FIG. 3 is a schematic diagram for explaining the robot of FIG. 1;
FIG. 4 is a block diagram showing an internal configuration of the robot shown in FIG. 1;
FIG. 5 is a block diagram for mainly explaining a portion related to control of the robot in FIG. 1;
FIG. 6 is a block diagram illustrating a configuration of a main control unit in FIG. 5;
FIG. 7 is a block diagram illustrating a configuration of a voice recognition unit in FIG. 6;
FIG. 8 is a diagram illustrating a state of a cluster.
FIG. 9 is a diagram for explaining word registration.
FIG. 10 is a block diagram illustrating a configuration of a voice recognition processing unit in FIG. 7;
FIG. 11 is a flowchart illustrating processing executed by a robot to which the present invention is applied.
FIG. 12 is a flowchart illustrating a speech recognition process.
FIG. 13 is a diagram illustrating an example of a grammar used in a language model database.
FIG. 14 is a diagram illustrating an example of a language model using a finite state automaton.
FIG. 15 is a diagram illustrating an example of calculating a language score using tri-gram.
FIG. 16 is a diagram illustrating an example of a tri-gram database.
FIG. 17 is a flowchart illustrating a word string generation process.
FIG. 18 is a flowchart illustrating a word acquisition process.
FIG. 19 is a flowchart illustrating a template matching process.
FIG. 20 is a diagram illustrating a template.
FIG. 21 is a diagram illustrating an actuator control angle registered in association with a cluster ID.
FIG. 22 is a diagram illustrating a template.
FIG. 23 is a diagram illustrating a template.
FIG. 24 is a diagram showing an example of a grammar incorporating a phoneme typewriter.
FIG. 25 is a flowchart illustrating a motion recognition utterance process.
FIG. 26 is a block diagram illustrating a configuration of a computer to which the present invention has been applied.
[Explanation of symbols]
1 robot, 61 main control unit, 63 sub control unit, 72 speakers, 82 microphone, 101 sensor input processing unit, 101A graduate student recognition unit, 103 action decision mechanism unit, 104 table storage unit, 105 posture transition mechanism unit, 106 voice synthesis , 121 Speech Recognition Processing Unit, 122 Associative Memory Unit, 123 Dialogue Control Unit, 124 Word Acquisition Unit, 172 Feature Extraction Unit, 173 Matching Unit, 174 Phoneme Typewriter Unit, 175 Control Unit, 181 Acoustic Model Database, 182 Dictionary Database, 183 language model database

Claims

In a robot controller for controlling the operation of the robot,
A recognition means for recognizing a continuous input voice;
Acquisition means for acquiring a word corresponding to the unknown word, when the recognition result recognized by the recognition means is determined to include an unknown word,
A registration unit for registering the word acquired by the acquisition unit in association with information for controlling the operation of the robot.

Further comprising a pattern determination means for determining whether the recognition result recognized by the recognition means matches a specific pattern,
When the pattern determination unit determines that the recognition result matches the specific pattern, the registration unit registers the word in association with information for controlling the operation of the robot. The robot control device according to claim 1, wherein

Further comprising a detecting means for detecting the state of the robot,
The detecting means detects the state of the robot at the time when the recognition result is determined to match the specific pattern,
3. The robot according to claim 2, wherein the registration unit registers the word in association with information that controls an operation of the robot so as to be in a state of the robot detected by the detection unit. 4. Control device.

Control means for controlling the driving of the robot,
Pattern determining means for determining whether the recognition result recognized by the recognition means matches a specific pattern,
When the pattern determination unit determines that the recognition result matches the specific pattern, the control unit controls the operation of the robot registered in association with the word by the registration unit. The robot control device according to claim 1, wherein driving of the robot is controlled based on information to be performed.

A storage unit that stores the words acquired by the acquisition unit in a plurality of categories;
The robot control device according to claim 1, wherein the registration unit registers the word stored in a predetermined category in the storage unit in association with information for controlling an operation of the robot.

In a robot control method of a robot control device that controls the operation of a robot,
A recognition step of recognizing a continuous input voice;
A determination step of determining whether an unknown word is included in the recognition result recognized by the processing of the recognition step,
An acquiring step of acquiring a word corresponding to the unknown word when the recognition result includes the unknown word, by the processing of the determining step;
A registration step of registering the word acquired by the processing of the acquisition step in association with information for controlling the operation of the robot.

A program that causes a computer to execute processing for controlling the operation of the robot,
A recognition step of recognizing a continuous input voice;
A determination step of determining whether an unknown word is included in the recognition result recognized by the processing of the recognition step,
An acquiring step of acquiring a word corresponding to the unknown word when the recognition result includes the unknown word, by the processing of the determining step;
A registration step of registering the word acquired by the processing of the acquisition step in association with information for controlling the operation of the robot, and recording the computer-readable program.

A program that causes a computer to execute processing for controlling the operation of the robot,
A recognition step of recognizing a continuous input voice;
A determination step of determining whether an unknown word is included in the recognition result recognized by the processing of the recognition step,
An acquiring step of acquiring a word corresponding to the unknown word when the recognition result includes the unknown word, by the processing of the determining step;
A registration step of registering the word acquired by the processing of the acquisition step in association with information for controlling the operation of the robot.

In a robot controller for controlling the operation of the robot,
Control means for controlling the driving of the robot,
Registration means for registering information indicating the state of the robot in association with a corresponding word,
Detecting means for detecting a state of the robot whose driving is controlled by the control means;
Voice synthesis means for synthesizing voice;
Output means for outputting the voice synthesized by the voice synthesis means,
When the state of the robot detected by the detecting unit matches information indicating the state of the robot registered by the registering unit, the voice synthesizing unit determines the state of the robot registered by the registering unit. A robot control device for synthesizing a voice including the word associated with information indicating a state.

Further comprising input means for receiving a command from a user,
The robot control device according to claim 9, wherein the detection unit detects a state of the robot when no operation input is received by the input unit for a predetermined time.

A recognition means for recognizing a continuous input voice;
Pattern determining means for determining whether or not the recognition result by the recognition means matches a specific pattern,
The robot control device according to claim 9, wherein when the determination unit determines that the recognition result matches the specific pattern, the detection unit detects a state of the robot.

The recognition result recognized by the recognition means, when it is determined that an unknown word is included, further comprising an obtaining means for obtaining a word corresponding to the unknown word,
10. The robot control device according to claim 9, wherein the registration unit associates and stores a word corresponding to the unknown word acquired by the acquisition unit and information indicating a state of the robot.

Using registration information in which information indicating a state of the robot and a corresponding word are registered in association with each other, a robot control method of a robot control device that controls an operation of the robot,
A detecting step of detecting a state of the robot,
A determining step of determining whether the state of the robot detected by the processing of the detecting step matches the registered information registered in the registered information;
When it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information, the state of the robot is associated with the information indicating the state of the robot. A voice synthesizing step of synthesizing a voice including the word.

A program that causes a computer to execute a process of controlling the operation of the robot by using registration information in which information indicating a state of the robot and a corresponding word are registered in association with each other,
A detecting step of detecting a state of the robot,
A determining step of determining whether the state of the robot detected by the processing of the detecting step matches the registered information registered in the registered information;
When it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information, the state of the robot is associated with the information indicating the state of the robot. A voice synthesizing step of synthesizing voice including the word, wherein a computer readable program is recorded.

A program that causes a computer to execute a process of controlling the operation of the robot by using registration information in which information indicating a state of the robot and a corresponding word are registered in association with each other,
A detecting step of detecting a state of the robot,
A determining step of determining whether the state of the robot detected by the processing of the detecting step matches the registered information registered in the registered information;
When it is determined that the state of the robot matches the information indicating the state of the robot registered in the registration information, the state of the robot is associated with the information indicating the state of the robot. A voice synthesizing step of synthesizing a voice including the word.