JP4587009B2

JP4587009B2 - Robot control apparatus, robot control method, and recording medium

Info

Publication number: JP4587009B2
Application number: JP2000310492A
Authority: JP
Inventors: 和夫石井; 順広井; 渡小野木; 崇豊田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-10-11
Filing date: 2000-10-11
Publication date: 2010-11-24
Anticipated expiration: 2020-10-11
Also published as: JP2002116794A

Description

【０００１】
【発明の属する技術分野】
本発明は、ロボット制御装置およびロボット制御方法、並びに記録媒体に関し、特に、音声認識装置による音声認識結果に基づいて行動するロボットに用いて好適なロボット制御装置およびロボット制御方法、並びに記録媒体に関する。
【０００２】
【従来の技術】
近年においては、例えば、玩具等として、ユーザが発した音声を音声認識し、その音声認識結果に基づいて、ある仕草をしたり、合成音を出力する等の行動を行うロボット（本明細書においては、ぬいぐるみ状のものを含む）が製品化されている。
【０００３】
【発明が解決しようとする課題】
このようなロボットは、常時、音声入力を受け付けるようになされている。しかしながら、ロボットの動作中に発生するアクチュエータのノイズ、ロボットの歩行時に発生する接地パルス音、あるいは、ユーザがロボットに触れることにより発生するタッチノイズなどが、ユーザの発話した音声であると誤検知されてしまう場合があった。
【０００４】
本発明はこのような状況に鑑みてなされたものであり、ロボットの状態や行動に基づいた音声認識を行うことにより、ユーザの発話した音声と、ロボットの動作などにより発生するノイズとを区別して、誤認識を防ぐようにするものである。
【０００５】
【課題を解決するための手段】
本発明のロボット制御装置は、音声データの入力を受ける音声入力手段と、ロボットの状態の認識結果を示す状態認識情報を生成する状態認識情報生成手段と、ロボットの姿勢の遷移を示す姿勢遷移情報を生成する姿勢遷移情報生成手段と、ロボットが歩行動作を行っている場合に発生するノイズ成分に対応するデータを記憶する記憶手段と、状態認識情報生成手段により生成された状態認識情報、もしくは、姿勢遷移情報生成手段により生成された姿勢遷移情報を基に、音声入力手段により入力された音声データにおける音声認識を行う区間を判別し、判別した音声認識を行う区間の音声データを認識する認識する認識手段とを備え認識手段は、姿勢遷移情報を基にロボットが歩行動作を行っていることを判別した場合、記憶手段により記憶されているノイズ成分に対応するデータを用いて、音声データをフィルタリングした後の音声データを認識することを特徴とする。
【０００６】
姿勢遷移情報には、ロボットが有する複数の駆動部のうちのいずれの駆動部が駆動動作をするかを示す情報を含ませることができ、認識手段には、駆動する駆動部の位置が、音声入力手段に近い場合、音声データの認識を行わせないようにすることができる。
【０００７】
認識手段には、ロボットが歩行動作を行っている場合、歩行動作のために発生したノイズ成分を含むフレームを除いた音声データを認識させることができる。
【０００９】
姿勢遷移情報には、ロボットが有する複数の駆動部のうちのいずれの駆動部が駆動動作をするかを示す情報を含ませることができ、認識手段には、姿勢遷移情報を基に、駆動部が駆動することにより発生するノイズを考慮して音声認識を行わせることができる。
【００１０】
状態認識情報には、ロボットがユーザに触れられているか否かを示す情報を含ませることができ、認識手段には、状態認識情報を基に、ユーザがロボットに触れているために発生するノイズを考慮して音声認識を行わせることができる。
【００１１】
ロボットの状態もしくは姿勢の遷移により発生するノイズに対応した所定の閾値を記憶する記憶手段と、認識手段が音声認識を行っていない場合の環境音を推定する推定手段とを更に備えさせることができ、認識手段には、状態認識情報生成手段により生成された状態認識情報、もしくは、姿勢遷移情報生成手段により生成された姿勢遷移情報を基に、記憶手段に記憶されている閾値および推定手段により推定された環境音を用いて、音声認識を行う区間の開始を判別させることができる。
【００１２】
ロボットの状態もしくは姿勢の遷移により発生するノイズに対応した所定の閾値を記憶する記憶手段を更に備えさせることができ、認識手段には、状態認識情報生成手段により生成された状態認識情報、もしくは、姿勢遷移情報生成手段により生成された姿勢遷移情報を基に、記憶手段に記憶されている閾値を用いて、音声認識を行う区間の開始を判別させることができる。
【００１３】
音声入力手段により入力された音声データを基に、ロボットの状態もしくは姿勢の遷移により発生するノイズに対応した閾値を設定する設定手段を更に備えさせることができ、認識手段には、設定手段により設定された閾値を用いて、音声入力手段により入力された音声データにおける音声認識を行う区間の開始を判別させることができる。
【００１４】
本発明のロボット制御方法は、音声データの入力を受ける音声入力ステップと、ロボットの状態の認識結果を示す状態認識情報を生成する状態認識情報生成ステップと、ロボットの姿勢の遷移を示す姿勢遷移情報を生成する姿勢遷移情報生成ステップと、状態認識情報生成ステップの処理により生成された状態認識情報、もしくは、姿勢遷移情報生成ステップの処理により生成された姿勢遷移情報を基に、音声入力ステップの処理により入力された音声データにおける音声認識を行う区間を判別し、判別した音声認識を行う区間の音声データを認識する認識ステップとを含み、認識ステップの処理によって、姿勢遷移情報を基にロボットが歩行動作を行っていることを判別した場合、記憶手段により記憶されているロボットが歩行動作を行っている場合に発生するノイズ成分に対応するデータを用いて、音声データをフィルタリングした後の音声データを認識することを特徴とする。
【００１５】
本発明の記録媒体に記録されているプログラムは、音声データの入力を受ける音声入力ステップと、ロボットの状態の認識結果を示す状態認識情報を生成する状態認識情報生成ステップと、ロボットの姿勢の遷移を示す姿勢遷移情報を生成する姿勢遷移情報生成ステップと、状態認識情報生成ステップの処理により生成された状態認識情報、もしくは、姿勢遷移情報生成ステップの処理により生成された姿勢遷移情報を基に、音声入力ステップの処理により入力された音声データにおける音声認識を行う区間を判別し、判別した音声認識を行う区間の音声データを認識する認識ステップとを含み、認識ステップの処理によって、姿勢遷移情報を基にロボットが歩行動作を行っていることを判別した場合、記憶手段により記憶されているロボットが歩行動作を行っている場合に発生するノイズ成分に対応するデータを用いて、音声データをフィルタリングした後の音声データを認識することを特徴とする。
【００１６】
本発明のロボット制御装置、ロボット制御方法、および記録媒体に記録されているプログラムにおいては、音声データが入力され、ロボットの状態の認識結果を示す状態認識情報が生成され、ロボットの姿勢の遷移を示す姿勢遷移情報が生成され、ロボットが歩行動作を行っている場合に発生するノイズ成分に対応するデータが記憶され、生成された状態認識情報、もしくは生成された姿勢遷移情報を基に、入力された音声データにおける音声認識を行う区間を判別し、判別した音声認識を行う区間の音声データが認識され、姿勢遷移情報を基にロボットが歩行動作を行っていることが判別された場合、記憶されているノイズ成分に対応するデータを用いて、音声データがフィルタリングされた後の音声データが認識される。
【００１７】
【発明の実施の形態】
以下、図を参照して、本発明の実施の形態について説明する。
【００１８】
図１は、本発明を適用したロボットの一実施の形態の外観構成例を示しており、図２は、その電気的構成例を示している。
【００１９】
本実施の形態では、ロボットは、例えば、犬等の四つ足の動物の形状のものとなっており、胴体部ユニット２の前後左右に、それぞれ脚部ユニット３Ａ，３Ｂ，３Ｃ，３Ｄが連結されるとともに、胴体部ユニット２の前端部と後端部に、それぞれ頭部ユニット４と尻尾部ユニット５が連結されることにより構成されている。
【００２０】
尻尾部ユニット５は、胴体部ユニット２の上面に設けられたベース部５Ｂから、２自由度をもって湾曲または揺動自在に引き出されている。
【００２１】
胴体部ユニット２には、ロボット全体の制御を行うコントローラ１０、ロボットの動力源となるバッテリ１１、並びにバッテリセンサ１２および熱センサ１３からなる内部センサ部１４などが収納されている。
【００２２】
頭部ユニット４には、「耳」に相当するマイク（マイクロフォン）１５、「目」に相当するＣＣＤ（Charge Coupled Device）カメラ１６、触覚に相当するタッチセンサ１７、「口」に相当するスピーカ１８などが、それぞれ所定位置に配設されている。また、頭部ユニット４には、口の下顎に相当する下顎部４Ａが１自由度をもって可動に取り付けられており、この下顎部４Ａが動くことにより、ロボットの口の開閉動作が実現されるようになっている。
【００２３】
脚部ユニット３Ａ乃至３Ｄそれぞれの関節部分や、脚部ユニット３Ａ乃至３Ｄそれぞれと胴体部ユニット２の連結部分、頭部ユニット４と胴体部ユニット２の連結部分、頭部ユニット４と下顎部４Ａの連結部分、並びに尻尾部ユニット５と胴体部ユニット２の連結部分などには、図２に示すように、それぞれアクチュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃至３ＢＡ_K、３ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ_K、４Ａ₁乃至４Ａ_L、５Ａ₁および５Ａ₂が配設されている。
【００２４】
また、脚部ユニット３Ａ乃至３Ｄそれぞれの接地部分（足の裏にあたる部分）には、センサ３Ａ₁乃至センサ３Ｄ₁が設けられ、脚部ユニット３Ａ乃至３Ｄのそれぞれが、接地しているか否か（例えば、床などに触れているか否か）を検知して、コントローラ１０に出力する。
【００２５】
頭部ユニット４におけるマイク１５は、ユーザからの発話を含む周囲の音声（音）を集音し、得られた音声信号を、コントローラ１０に送出する。ＣＣＤカメラ１６は、周囲の状況を撮像し、得られた画像信号を、コントローラ１０に送出する。
【００２６】
タッチセンサ１７は、例えば、頭部ユニット４の上部に設けられており、ユーザからの「撫でる」や「たたく」といった物理的な働きかけにより受けた圧力を検出し、その検出結果を圧力検出信号としてコントローラ１０に送出する。
【００２７】
胴体部ユニット２におけるバッテリセンサ１２は、バッテリ１１の残量を検出し、その検出結果を、バッテリ残量検出信号としてコントローラ１０に送出する。熱センサ１３は、ロボット内部の熱を検出し、その検出結果を、熱検出信号としてコントローラ１０に送出する。
【００２８】
コントローラ１０は、ＣＰＵ（Central Processing Unit）１０Ａやメモリ１０Ｂ等を内蔵しており、ＣＰＵ１０Ａにおいて、メモリ１０Ｂに記憶された制御プログラムが実行されることにより、各種の処理を行う。
【００２９】
すなわち、コントローラ１０は、マイク１５、ＣＣＤカメラ１６、タッチセンサ１７、センサ３Ａ₁乃至センサ３Ｄ₁、バッテリセンサ１２、および熱センサ１３から与えられる音声信号、画像信号、圧力検出信号、バッテリ残量検出信号、および熱検出信号に基づいて、周囲の状況や、ユーザからの指令、ユーザからの働きかけなどの有無を判断する。
【００３０】
更に、コントローラ１０は、この判断結果等に基づいて、続く行動を決定し、その決定結果に基づいて、アクチュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃至３ＢＡ_K、３ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ_K、４Ａ₁乃至４Ａ_L、５Ａ₁、および５Ａ₂のうちの必要なものを駆動させる。これにより、頭部ユニット４を上下左右に振らせたり、下顎部４Ａを開閉させる。さらには、尻尾部ユニット５を動かせたり、各脚部ユニット３Ａ乃至３Ｄを駆動して、ロボットを歩行させるなどの行動を行わせる。
【００３１】
また、コントローラ１０は、必要に応じて、合成音を生成し、スピーカ１８に供給して出力させたり、ロボットの「目」の位置に設けられた図示しないＬＥＤ（Light Emitting Diode）を点灯、消灯または点滅させる。
【００３２】
以上のようにして、ロボットは、周囲の状況等に基づいて自律的に行動をとるようになっている。
【００３３】
次に、図３は、図２のコントローラ１０の機能的構成例を示している。なお、図３に示す機能的構成は、ＣＰＵ１０Ａが、メモリ１０Ｂに記憶された制御プログラムを実行することで実現されるようになっている。
【００３４】
コントローラ１０は、特定の外部状態を認識するセンサ入力処理部３１、センサ入力処理部３１の認識結果を累積して、感情や、本能、成長の状態を表現するモデル記憶部３２、センサ入力処理部３１の認識結果等に基づいて、続く行動を決定する行動決定機構部３３、行動決定機構部３３の決定結果に基づいて、実際にロボットに行動を起こさせる姿勢遷移機構部３４、アクチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂を駆動制御する制御機構部３５、合成音を生成する音声合成部３６、並びに、音声合成部３６において合成された合成音の出力を制御する出力制御部３７から構成されている。
【００３５】
センサ入力処理部３１は、マイク１５、ＣＣＤカメラ１６、タッチセンサ１７、もしくは、センサ３Ａ₁乃至センサ３Ｄ₁等から与えられる音声信号、画像信号、圧力検出信号等に基づいて、特定の外部状態や、ユーザからの特定の働きかけ、ユーザからの指示等を認識し、その認識結果を表す状態認識情報を、モデル記憶部３２および行動決定機構部３３に通知する。
【００３６】
すなわち、センサ入力処理部３１は、音声認識部３１Ａを有しており、音声認識部３１Ａは、マイク１５から与えられる音声信号について音声認識を行う。そして、音声認識部３１Ａは、その音声認識結果としての、例えば、「歩け」、「伏せ」、「ボールを追いかけろ」等の指令その他を、状態認識情報として、モデル記憶部３２および行動決定機構部３３に通知する。
【００３７】
また、音声認識部３１Ａは、圧力処理部３１Ｃから入力される状態認識情報、および姿勢遷移機構部３４から入力されるロボットの姿勢遷移情報を基に、ロボットの状態を監視しながら、その音声認識処理を実行するようになされている。
【００３８】
また、センサ入力処理部３１は、画像認識部３１Ｂを有しており、画像認識部３１Ｂは、ＣＣＤカメラ１６から与えられる画像信号を用いて、画像認識処理を行う。そして、画像認識部３１Ｂは、その処理の結果、例えば、「赤い丸いもの」や、「地面に対して垂直なかつ所定高さ以上の平面」等を検出したときには、「ボールがある」や、「壁がある」等の画像認識結果を、状態認識情報として、モデル記憶部３２および行動決定機構部３３に通知する。
【００３９】
さらに、センサ入力処理部３１は、圧力処理部３１Ｃを有しており、圧力処理部３１Ｃは、タッチセンサ１７、および、センサ３Ａ₁乃至センサ３Ｄ₁から与えられる圧力検出信号を処理する。
【００４０】
圧力処理部３１Ｃは、その処理の結果、タッチセンサ１７から、所定の閾値以上で、かつ短時間の圧力を検出したときには、「たたかれた（しかられた）」と認識し、所定の閾値未満で、かつ長時間の圧力を検出したときには、「撫でられた（ほめられた）」と認識して、その認識結果を、状態認識情報として、モデル記憶部３２、行動決定機構部３３および音声認識部３１Ａに通知する。さらに、圧力処理部３１Ｃは、センサ３Ａ₁乃至センサ３Ｄ₁から入力される信号を基に、脚部ユニット３Ａ乃至３Ｄが、いずれも床などに接地していないことを検出したときには、ユーザによって抱き上げられていると認識して、その認識結果を、状態認識情報として、モデル記憶部３２、行動決定機構部３３および音声認識部３１Ａに通知する。
【００４１】
モデル記憶部３２は、ロボットの感情、本能、成長の状態を表現する感情モデル、本能モデル、成長モデルをそれぞれ記憶、管理している。
【００４２】
ここで、感情モデルは、例えば、「うれしさ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部３１からの状態認識情報や時間経過等に基づいて、その値を変化させる。本能モデルは、例えば、「食欲」、「睡眠欲」、「運動欲」等の本能による欲求の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部３１からの状態認識情報や時間経過等に基づいて、その値を変化させる。成長モデルは、例えば、「幼年期」、「青年期」、「熟年期」、「老年期」等の成長の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部３１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００４３】
モデル記憶部３２は、上述のようにして感情モデル、本能モデル、成長モデルの値で表される感情、本能、成長の状態を、状態情報として、行動決定機構部３３に送出する。
【００４４】
なお、モデル記憶部３２には、センサ入力処理部３１から状態認識情報が供給される他、行動決定機構部３３から、ロボットの現在または過去の行動、具体的には、例えば、「長時間歩いた」などの行動の内容を示す行動情報が供給されるようになっており、同一の状態認識情報が与えられても、行動情報が示すロボットの行動に応じて、異なる状態情報を生成するようになっている。
【００４５】
すなわち、例えば、ロボットが、ユーザに挨拶をし、ユーザに頭を撫でられた場合には、ユーザに挨拶をしたという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部３２に与えられ、この場合、モデル記憶部３２では、「うれしさ」を表す感情モデルの値が増加される。
【００４６】
一方、ロボットが、何らかの仕事を実行中に頭を撫でられた場合には、仕事を実行中であるという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部３２に与えられ、この場合、モデル記憶部３２では、「うれしさ」を表す感情モデルの値は変化されない。
【００４７】
このように、モデル記憶部３２は、状態認識情報だけでなく、現在または過去のロボットの行動を示す行動情報も参照しながら、感情モデルの値を設定する。
これにより、例えば、何らかのタスクを実行中に、ユーザが、いたずらするつもりで頭を撫でたときに、「うれしさ」を表す感情モデルの値を増加させるような、不自然な感情の変化が生じることを回避することができる。
【００４８】
なお、モデル記憶部３２は、本能モデルおよび成長モデルについても、感情モデルにおける場合と同様に、状態認識情報および行動情報の両方に基づいて、その値を増減させるようになっている。また、モデル記憶部３２は、感情モデル、本能モデル、成長モデルそれぞれの値を、他のモデルの値にも基づいて増減させるようになっている。
【００４９】
行動決定機構部３３は、センサ入力処理部３１からの状態認識情報や、モデル記憶部３２からの状態情報、時間経過等に基づいて、次の行動を決定し、決定された行動の内容を、行動指令情報として、姿勢遷移機構部３４に送出する。
【００５０】
即ち、行動決定機構部３３は、ロボットがとり得る行動をステート（状態）（state）に対応させた有限オートマトンを、ロボットの行動を規定する行動モデルとして管理しており、この行動モデルとしての有限オートマトンにおけるステートを、センサ入力処理部３１からの状態認識情報や、モデル記憶部３２における感情モデル、本能モデル、または成長モデルの値、時間経過等に基づいて遷移させ、遷移後のステートに対応する行動を、次にとるべき行動として決定する。
【００５１】
ここで、行動決定機構部３３は、所定のトリガ（trigger）があったことを検出すると、ステートを遷移させる。すなわち、行動決定機構部３３は、例えば、現在のステートに対応する行動を実行している時間が所定時間に達したときや、特定の状態認識情報を受信したとき、モデル記憶部３２から供給される状態情報が示す感情や、本能、成長の状態の値が所定の閾値以下または以上になったとき等に、ステートを遷移させる。
【００５２】
なお、行動決定機構部３３は、上述したように、センサ入力処理部３１からの状態認識情報だけでなく、モデル記憶部３２における感情モデルや、本能モデル、成長モデルの値等にも基づいて、行動モデルにおけるステートを遷移させることから、同一の状態認識情報が入力されても、感情モデルや、本能モデル、成長モデルの値（状態情報）によっては、ステートの遷移先は異なるものとなる。
【００５３】
その結果、行動決定機構部３３は、例えば、状態情報が、「怒っていない」こと、および「お腹がすいていない」ことを表している場合において、状態認識情報が、「目の前に手のひらが差し出された」ことを表しているときには、目の前に手のひらが差し出されたことに応じて、「お手」という行動をとらせる行動指令情報を生成し、これを、姿勢遷移機構部３４に送出する。
【００５４】
また、行動決定機構部３３は、例えば、状態情報が、「怒っていない」こと、および「お腹がすいている」ことを表している場合において、状態認識情報が、「目の前に手のひらが差し出された」ことを表しているときには、目の前に手のひらが差し出されたことに応じて、「手のひらをぺろぺろなめる」ような行動を行わせるための行動指令情報を生成し、これを、姿勢遷移機構部３４に送出する。
【００５５】
また、行動決定機構部３３は、例えば、状態情報が、「怒っている」ことを表している場合において、状態認識情報が、「目の前に手のひらが差し出された」ことを表しているときには、状態情報が、「お腹がすいている」ことを表していても、また、「お腹がすいていない」ことを表していても、「ぷいと横を向く」ような行動を行わせるための行動指令情報を生成し、これを、姿勢遷移機構部３４に送出する。
【００５６】
なお、行動決定機構部３３には、モデル記憶部３２から供給される状態情報が示す感情や、本能、成長の状態に基づいて、遷移先のステートに対応する行動のパラメータとしての、例えば、歩行の速度や、手足を動かす際の動きの大きさおよび速度などを決定させることができ、この場合、それらのパラメータを含む行動指令情報が、姿勢遷移機構部３４に送出される。
【００５７】
また、行動決定機構部３３では、上述したように、ロボットの頭部や手足等を動作させる行動指令情報の他、ロボットに発話を行わせる行動指令情報も生成される。ロボットに発話を行わせる行動指令情報は、音声合成部３６に供給されるようになっており、音声合成部３６に供給される行動指令情報には、音声合成部３６に生成させる合成音に対応するテキスト等が含まれる。そして、音声合成部３６は、行動決定部３２から行動指令情報を受信すると、その行動指令情報に含まれるテキストに基づき、合成音を生成し、出力制御部３７を介して、スピーカ１８に供給して出力させる。これにより、スピーカ１８からは、例えば、ロボットの鳴き声、さらには、「お腹がすいた」等のユーザへの各種の要求、「何？」等のユーザの呼びかけに対する応答その他の音声出力が行われる。
【００５８】
姿勢遷移機構部３４は、行動決定機構部３３から供給される行動指令情報に基づいて、ロボットの姿勢を、現在の姿勢から次の姿勢に遷移させるための姿勢遷移情報を生成し、これを制御機構部３５および音声認識部３１Ａに送出する。
【００５９】
ここで、現在の姿勢から次に遷移可能な姿勢は、例えば、胴体や手や足の形状、重さ、各部の結合状態のようなロボットの物理的形状と、関節が曲がる方向や角度のようなアクチュエータ３ＡＡ₁乃至５Ａ₁および５Ａ₂の機構とによって決定される。
【００６０】
また、次の姿勢としては、現在の姿勢から直接遷移可能な姿勢と、直接には遷移できない姿勢とがある。例えば、４本足のロボットは、手足を大きく投げ出して寝転んでいる状態から、伏せた状態へ直接遷移することはできるが、立った状態へ直接遷移することはできず、一旦、手足を胴体近くに引き寄せて伏せた姿勢になり、それから立ち上がるという２段階の動作が必要である。また、安全に実行できない姿勢も存在する。例えば、４本足のロボットは、その４本足で立っている姿勢から、両前足を挙げてバンザイをしようとすると、簡単に転倒してしまう。
【００６１】
このため、姿勢遷移機構部３４は、直接遷移可能な姿勢をあらかじめ登録しておき、行動決定機構部３３から供給される行動指令情報が、直接遷移可能な姿勢を示す場合には、その行動指令情報を、そのまま姿勢遷移情報として、制御機構部３５に送出する。一方、行動指令情報が、直接遷移不可能な姿勢を示す場合には、姿勢遷移機構部３４は、遷移可能な他の姿勢に一旦遷移した後に、目的の姿勢まで遷移させるような姿勢遷移情報を生成し、制御機構部３５に送出する。これによりロボットが、遷移不可能な姿勢を無理に実行しようとする事態や、転倒するような事態を回避することができるようになっている。
【００６２】
姿勢遷移情報は、音声認識部３１Ａにも出力される。ロボットが、その姿勢を遷移させる場合、アクチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂のうちの、いずれかのアクチュエータが動作する。そこで、姿勢遷移機構部３４は、音声認識部３１Ａが、これらのアクチュエータの動作音を、ユーザの音声と認識してしまわないように、姿勢遷移情報を、音声認識部３１Ａに出力する。
【００６３】
制御機構部３５は、姿勢遷移機構部３４からの姿勢遷移情報にしたがって、アクチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂を駆動するための制御信号を生成し、これを、アクチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂に送出する。これにより、アクチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂は、制御信号にしたがって駆動し、ロボットは、自律的に行動を起こす。
【００６４】
出力制御部３７には、音声合成部３６からの合成音のディジタルデータが供給されるようになっており、出力制御部３７は、それらのディジタルデータを、アナログの音声信号にＤ／Ａ変換し、スピーカ１８に供給して出力させる。
【００６５】
次に、図４は、図３の音声認識部３１Ａの構成例を示している。
【００６６】
マイク１５からの音声信号は、ＡＤ（Analog Digital）変換部４１に供給される。ＡＤ変換部４１では、マイク１５からのアナログ信号である音声信号がサンプリング、量子化され、ディジタル信号である音声データにＡＤ変換される。この音声データは、特徴抽出部４２および音声区間検出部４７に供給される。
【００６７】
特徴抽出部４２は、入力される音声データについて、適当なフレームごとに、状態認識情報、および姿勢遷移情報の入力を受け、ロボットの状態に対応させて、例えば、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）分析を行い、その分析結果を、特徴パラメータ（特徴ベクトル）として、マッチング部４３に出力する。なお、特徴抽出部４２では、その他、例えば、線形予測係数、ケプストラム係数、線スペクトル対、所定の周波数帯域ごとのパワー（フィルタバンクの出力）等を、特徴パラメータとして抽出することが可能である。
【００６８】
マッチング部４３は、特徴抽出部４２からの特徴パラメータを用いて、音響モデル記憶部４４、辞書記憶部４５、および文法記憶部４６を必要に応じて参照しながら、マイク１５に入力された音声（入力音声）を、例えば、連続分布ＨＭＭ（Hidden Markov Model）法に基づいて音声認識する。
【００６９】
即ち、音響モデル記憶部４４は、音声認識する音声の言語における個々の音素や音節などの音響的な特徴を表す音響モデルを記憶している。ここでは、連続分布ＨＭＭ法に基づいて音声認識を行うので、音響モデルとしては、ＨＭＭ（Hidden Markov Model）が用いられる。辞書記憶部４５は、認識対象の各単語について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法記憶部４６は、辞書記憶部４５の単語辞書に登録されている各単語が、どのように連鎖する（つながる）かを記述した文法規則を記憶している。ここで、文法規則としては、例えば、文脈自由文法（ＣＦＧ）や、統計的な単語連鎖確率（Ｎ−ｇｒａｍ）などに基づく規則を用いることができる。
【００７０】
マッチング部４３は、辞書記憶部４５の単語辞書を参照することにより、音響モデル記憶部４４に記憶されている音響モデルを接続することで、単語の音響モデル（単語モデル）を構成する。さらに、マッチング部４３は、幾つかの単語モデルを、文法記憶部４６に記憶された文法規則を参照することにより接続し、そのようにして接続された単語モデルを用いて、特徴パラメータに基づき、連続分布ＨＭＭ法によって、マイク１５に入力された音声を認識する。即ち、マッチング部４３は、特徴抽出部４２が出力する時系列の特徴パラメータが観測されるスコア（尤度）が最も高い単語モデルの系列を検出し、その単語モデルの系列に対応する単語列の音韻情報（読み）を、音声の認識結果として出力する。
【００７１】
より具体的には、マッチング部４３は、接続された単語モデルに対応する単語列について、各特徴パラメータの出現確率を累積し、その累積値をスコアとして、そのスコアを最も高くする単語列の音韻情報を、音声認識結果として出力する。
【００７２】
以上のようにして出力される、マイク１５に入力された音声の認識結果は、状態認識情報として、モデル記憶部３２および行動決定機構部３３に出力される。
【００７３】
なお、音声区間検出部４７は、ＡＤ変換部４１からの音声データについて、特徴抽出部４２がＭＦＣＣ分析を行うのと同様のフレームごとに、音声入力レベル（パワー）を算出している。さらに、音声区間検出部４７は、状態認識情報、および姿勢遷移情報の入力を受け、各フレームの音声入力レベルを所定の閾値と比較することにより、ユーザの音声が入力されている音声区間を検出する。すなわち、音声区間とは、ノイズ源となる所定の動作（例えば、頭部ユニット４を動かす）が行われていない区間であり、かつ、所定の閾値以上の音声入力レベルを有するフレームで構成される区間を示す。そして、音声区間検出部４７は、検出した音声区間を、特徴抽出部４２とマッチング部４３に供給しており、特徴抽出部４２とマッチング部４３は、音声区間のみを対象に処理を行う。
【００７４】
次に、図５および図６のフローチャートを参照して、音声認識処理について説明する。
【００７５】
ステップＳ１において、音声区間検出部４７は、ＡＤ変換部４１を介して入力された音声データを基に、環境音レベルを推定する。
【００７６】
マイク１５には、ユーザがロボットに対して発話していない場合においても、様々なノイズが音声入力されるが、そのノイズをユーザの発話として音声認識することは誤動作の原因になる。従って、ユーザの発話を音声認識していない状態（音声認識ＯＦＦ状態）において、環境音レベルを推定する必要がある。
【００７７】
図７に示されるように、マイク１５およびＡＤ変換部４１を介して入力される音声データの音声入力レベルは、音声認識ＯＦＦ状態においても一定ではない。そこで、環境音レベルをＥＮＶ、現在の音声入力レベルをＰとして、次の式（１）および式（２）により、所定の短い時間毎に、環境音レベルを算出する。
ＥＮＶ＝ａ×ＥＮＶ＋ｂ×Ｐ・・・（１）
ａ＋ｂ＝１．０・・・（２）
【００７８】
ここで、ａは、０．９など、１に比較的近い数字に設定され、ｂは、０．１などに設定されることにより、瞬間的にパワーの大きなノイズ（例えば、ドアがばたんと閉まる音など）が、環境音全体に大きな影響を与えないようになされている。
【００７９】
環境音レベルの推定は、予め決められた閾値Ｔ１を基に、音声入力レベルが、ＥＮＶ＋Ｔ１を越えるまで（後述するステップＳ８、もしくはステップＳ１３において、音声入力レベルがＥＮＶ＋Ｔ１を越えたと判断されるまで）継続される。
【００８０】
ステップＳ２において、音声区間検出部４７は、姿勢遷移機構部３４、もしくは、圧力処理部３１−Ｃから、姿勢遷移情報もしくは状態認識情報の入力を受けたか否かを判断する。ステップＳ２において、姿勢遷移情報もしくは状態認識情報の入力を受けていないと判断された場合、処理は、ステップＳ８に進む。
【００８１】
ステップＳ２において、姿勢遷移情報もしくは状態認識情報の入力を受けたと判断された場合、ステップＳ３において、音声区間検出部４７は、入力された情報に基づいて、動作を行うアクチュエータは、マイク１５に近いか否かを判断する。ステップＳ３において、動作を行うアクチュエータは、マイク１５に近くないと判断された場合、処理は、ステップＳ６に進む。
【００８２】
ステップＳ３において、動作を行うアクチュエータは、マイク１５に近いと判断された場合、ステップＳ４において、音声区間検出部４７は、音声認識処理をキャンセルする。すなわち、マイク１５に近いアクチュエータの動作中は、音声区間としないようにする。
【００８３】
図１および図２を用いて説明したロボットにおいては、マイク１５が頭部ユニット４に設けられている。すなわち、頭部ユニット４に設けられているアクチュエータ４Ａ₁乃至４Ａ_L（例えば、頭部ユニット４と胴体部ユニット２の連結部分、あるいは下顎部４Ａ）の動作に伴って発生するノイズの音声入力パワーは、比較的大きいものであるため、動作中の音声認識結果の信頼性は著しく低いものとなる。従って、マイク１５に近いアクチュエータの動作中は、音声区間としないことにより、誤動作を防ぐことが可能となる。
【００８４】
ステップＳ５において、音声区間検出部４７は、入力された情報に基づいて、動作が終了されたか否かを判断する。ステップＳ５において、動作が終了されていないと判断された場合、処理は、ステップＳ４に戻り、それ以降の処理が繰り返される。ステップＳ５において、動作が終了されたと判断された場合、処理は、ステップＳ１に戻り、それ以降の処理が繰り返される。
【００８５】
ステップＳ３において、動作を行うアクチュエータは、マイク１５に近くないと判断された場合、ステップＳ６において、音声区間検出部４７は、入力された情報に基づいて、姿勢遷移情報は、歩行動作の開始を示しているか否かを判断する。ステップＳ６において、歩行動作の開始を示していると判断された場合、処理は、ステップＳ１３に進む。
【００８６】
ステップＳ６において、姿勢遷移情報は、歩行動作の開始を示していないと判断された場合、ロボットは、マイク１５付近のアクチュエータは動作せず、かつ、歩行動作以外の動作が行われている、もしくは、ユーザに抱き上げられていたり、撫でられているという状態である。ステップＳ７において、音声区間検出部４７は、環境音（ノイズ）ではない、ユーザの発話などが音声入力されていることを判断するための閾値Ｔ１（図７）を、姿勢遷移情報、もしくは、状態認識情報に応じた値に変更する。
【００８７】
例えば、圧力処理部３１Ｃから入力される状態認識情報が、ロボットがユーザに抱き上げられていることを示している場合、マイク１５から入力される音声データには、ユーザがロボットの筐体表面に触れているために生じるタッチノイズやこすれ音が含まれる。また、状態遷移機構部３４から入力される姿勢遷移情報が、マイク１５に比較的遠い位置に設けられているアクチュエータ（例えば、尻尾部ユニット５のアクチュエータ５Ａ₁など）が動作していることを示している場合、マイク１５から入力される音声データには、それらのアクチュエータの動作音が含まれる。音声データに含まれる動作音は、そのアクチュエータの種類、およびマイク１５との距離によって異なる。
【００８８】
音声区間検出部４７は、例えば、ロボットがユーザに抱き上げられている場合の閾値Ｔ１ａ、脚部ユニット３Ａもしくは３Ｂが動作している場合の閾値Ｔ１ｂ、脚部ユニット３Ｃもしくは３Ｄが動作している場合の閾値Ｔ１ｃ、あるいは、尻尾部ユニット５が動作している場合の閾値Ｔ１ｄを予め用意しておき、入力された姿勢遷移情報、もしくは、状態認識情報に応じて、適切な閾値を利用した音声区間の検出を行うように設定する。
【００８９】
ステップＳ２において、姿勢遷移情報もしくは状態認識情報の入力を受けていないと判断された場合、もしくは、ステップＳ７の処理の終了後、ステップＳ８において、音声区間検出部４７は、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えたか否かを判断する。ステップＳ８において、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えていないと判断された場合、処理は、ステップＳ１に戻り、それ以降の処理が繰り返される。
【００９０】
ステップＳ８において、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えていると判断された場合、ステップＳ９において、音声区間検出部４７は、環境音レベルの推定を止め、その内部に有する図示しないカウンタ（タイマ）を用いて、音声認識開始カウントを開始する。
【００９１】
ステップＳ１０において、音声区間検出部４７は、音声認識開始カウントが所定の値（例えば、図７のＣＮＴ＿ＯＮで示される値）を超えたか否かを判断する。ステップＳ１０において、音声認識開始カウントが所定の値を超えていないと判断された場合、音声認識開始カウントが所定の値を超えたと判断されるまで、ステップＳ１０の処理が繰り返される。
【００９２】
ステップＳ１０において、音声認識開始カウントが所定の値を超えたと判断された場合、ステップＳ１１において、音声区間検出部４７は、音声区間の開始を特徴抽出部４２およびマッチング部４３に出力する。特徴抽出部４２およびマッチング部４３は、図４を用いて説明した音声認識処理を実行する。
【００９３】
ステップＳ１２において、音声区間検出部４７は、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったか否かを判断する。
【００９４】
マイク１５には、ユーザがロボットに対する発話を終了した後も、様々なノイズが音声入力されるが、そのノイズをユーザの発話として音声認識することは誤動作の原因になる。従って、音声入力レベルが、一定の値を下回った場合、音声認識処理を行わないように（音声認識ＯＦＦ状態に）する必要がある。
【００９５】
図８に示されるように、マイク１５およびＡＤ変換部４１を介して入力される音声データの音声入力レベルが、所定の閾値Ｔ２と、音声認識処理が開始された時点においての環境音レベルＥＮＶとの和（Ｔ２＋ＥＮＶ）を下回るか否かを判断することにより、音声区間検出部４７は、ユーザの発話が終了したか否かを判断することができる。
【００９６】
ステップＳ１２において、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になっていないと判断された場合、処理は、ステップＳ１１に戻り、それ以降の処理が繰り返される。ステップＳ１２において、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったと判断された場合、処理は、ステップＳ１８に進む。
【００９７】
ステップＳ６において、姿勢遷移情報は歩行動作の開始を示していると判断された場合、ステップＳ１３において、音声区間検出部４７は、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えたか否かを判断する。ステップＳ１３において、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えていないと判断された場合、処理は、ステップＳ１に戻り、それ以降の処理が繰り返される。
【００９８】
ステップＳ１３において、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えていると判断された場合、ステップＳ１４において、音声区間検出部４７は、環境音レベルの推定を止め、その内部に有する図示しないカウンタ（タイマ）を用いて、音声認識開始カウントを開始する。
【００９９】
ステップＳ１５において、音声区間検出部４７は、音声認識開始カウントが所定の値（例えば、図７のＣＮＴ＿ＯＮで示される値）を超えたか否かを判断する。ステップＳ１５において、音声認識開始カウントが所定の値を超えていないと判断された場合、音声認識開始カウントが所定の値を超えたと判断されるまで、ステップＳ１５の処理が繰り返される。
【０１００】
ステップＳ１５において、音声認識開始カウントが所定の値を超えたと判断された場合、ステップＳ１６において、音声区間検出部４７は、音声区間の開始を特徴抽出部４２およびマッチング部４３に出力する。特徴抽出部４２およびマッチング部４３は、歩行動作専用の音声認識処理を実行する。
【０１０１】
図９に示されるように、ロボットの歩行動作中は、ユーザの発話に加えて、脚部ユニット３Ａ乃至３Ｄが、例えば床などに接地する際の接地ノイズが、入力される音声データに含まれてしまう。姿勢遷移情報として、歩行動作中であることを通知された特徴抽出部４２は、この接地ノイズを音声データから取り除くために、例えば、音声入力レベルの増減（ΔＰ）を監視し、ΔＰの大きさを基にパルス性のノイズを検出する（パルス性の音声データは、図９に示されるように、その音声レベルが急激に上昇し、ピーク後、急激に減少するという特徴を有する）。そして、特徴抽出部４２は、そのフレームに関して、特徴抽出処理を行わないようにする。接地ノイズは、非常に短い時間において発生するため、接地ノイズであると検出されたフレームを、音声認識処理から外したとしても、ユーザの発話の認識には支障がない。
【０１０２】
また、特徴抽出部４２に、歩行動作時の接地ノイズの標準的な音声成分を予め記憶させておき、歩行動作時は、その音声成分を用いて、入力された音声データをフィルタリングさせて、接地ノイズを除去した後の音声データを用いて、特徴抽出処理を行わせるようにしても良い。
【０１０３】
ステップＳ１７において、音声区間検出部４７は、ステップＳ１２と同様の処理により、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったか否かを判断する。ステップＳ１７において、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になっていないと判断された場合、処理は、ステップＳ１６に戻り、それ以降の処理が繰り返される。
【０１０４】
ステップＳ１２において、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったと判断された場合、もしくは、ステップＳ１７において、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったと判断された場合、ステップＳ１８において、音声区間検出部４７は、その内部に有する図示しないカウンタ（タイマ）を用いて、音声認識終了カウントを開始する。
【０１０５】
ステップＳ１９において、音声区間検出部４７は、音声認識終了カウントが所定の値（例えば、図８のＣＮＴ＿ＯＦＦで示される値）を超えたか否かを判断する。ステップＳ１９において、音声認識終了カウントが所定の値を超えていないと判断された場合、音声認識終了カウントが所定の値を超えたと判断されるまで、ステップＳ１９の処理が繰り返される。
【０１０６】
ステップＳ１９において、音声認識終了カウントが所定の値を超えたと判断された場合、ステップＳ２０において、音声区間検出部４７は、音声区間が終了したことを示す信号を、特徴抽出部４２およびマッチング部４３に出力する。特徴抽出部４２およびマッチング部４３は、音声認識処理を終了し、処理は、ステップＳ１に戻り、それ以降の処理が繰り返される。
【０１０７】
なお、図５のステップＳ７においては、音声区間検出部４７が、環境音（ノイズ）ではない、ユーザの発話などが音声入力されていることを判断するための閾値Ｔ１を、姿勢遷移情報、もしくは、状態認識情報に応じた値に変更するものとして説明したが、音声区分の検出のための閾値として、環境音レベルに左右されない閾値Ｔ３を用いるようにしても良い。すなわち、ステップＳ８においては、音声入力レベルが、閾値（Ｔ１＋ＥＮＶ）を越えたか否かが判断されるのではなく、環境音レベルＥＮＶに関わらない閾値Ｔ３を超えたか否かが判断される。
【０１０８】
また、閾値Ｔ３は、姿勢遷移情報、もしくは、状態認識情報に応じて、予め用意されるようにしても良いし、動作開始、もしくは状態変更時に、所定の学習区間を設けて、平均的なノイズ成分を取得することによって、その都度、設定されるようにしても良い。
【０１０９】
以上、本発明を、エンターテイメント用のロボット（疑似ペットとしてのロボット）に適用した場合について説明したが、本発明は、これに限らず、例えば、産業用のロボット等の各種のロボットに広く適用することが可能である。また、本発明は、現実世界のロボットだけでなく、例えば、液晶ディスプレイ等の表示装置に表示される仮想的なロボットにも適用可能である。
【０１１０】
さらに、本実施の形態においては、上述した一連の処理を、ＣＰＵ１０Ａ（図２）にプログラムを実行させることにより行うようにしたが、一連の処理は、それ専用のハードウェアによって行うことも可能である。
【０１１１】
なお、プログラムは、あらかじめメモリ１０Ｂ（図２）に記憶させておく他、フロッピーディスク、CD-ROM（Compact Disc Read Only Memory），MO（Magneto optical）ディスク，DVD（Digital Versatile Disc)、磁気ディスク、半導体メモリなどのリムーバブル記録媒体に、一時的あるいは永続的に格納（記録）しておくことができる。そして、このようなリムーバブル記録媒体を、いわゆるパッケージソフトウエアとして提供し、ロボット（メモリ１０Ｂ）にインストールするようにすることができる。
【０１１２】
また、プログラムは、ダウンロードサイトから、ディジタル衛星放送用の人工衛星を介して、無線で転送したり、LAN（Local Area Network）、インターネットといったネットワークを介して、有線で転送し、メモリ１０Ｂにインストールすることができる。
【０１１３】
この場合、プログラムがバージョンアップされたとき等に、そのバージョンアップされたプログラムを、メモリ１０Ｂに、容易にインストールすることができる。
【０１１４】
ここで、本明細書において、ＣＰＵ１０Ａに各種の処理を行わせるためのプログラムを記述する処理ステップは、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいは個別に実行される処理（例えば、並列処理あるいはオブジェクトによる処理）も含むものである。
【０１１５】
また、プログラムは、唯１つのＣＰＵにより処理されるものであっても良いし、複数のＣＰＵによって分散処理されるものであっても良い。
【０１１６】
【発明の効果】
本発明のロボット制御装置、ロボット制御方法、および記録媒体に記録されているプログラムによれば、ロボットの状態や行動に基づいた音声認識を行うことにより、ユーザの発話した音声と、ロボットの動作などにより発生するノイズとを区別して、誤認識を防ぐようにすることができる。
【図面の簡単な説明】
【図１】本発明を適用したロボットの一実施の形態の外観構成例を示す斜視図である。
【図２】ロボットの内部構成例を示すブロック図である。
【図３】コントローラの機能的構成例を示すブロック図である。
【図４】音声認識部の構成例を示すブロック図である。
【図５】音声認識処理を説明するためのフローチャートである。
【図６】音声認識処理を説明するためのフローチャートである。
【図７】音声認識区間の開始について説明するための図である。
【図８】音声認識区間の終了について説明するための図である。
【図９】歩行動作中の脚部ユニットの接地ノイズについて説明するための図である。
【符号の説明】
４頭部ユニット，４Ａ下顎部，１０コントローラ，１０ＡＣＰＵ，１０Ｂメモリ，１５マイク，１６ＣＣＤカメラ，１７タッチセンサ，１８スピーカ，３１センサ入力処理部，３１Ａ音声認識部，３１Ｂ画像認識部，３１Ｃ圧力処理部，３２モデル記憶部，３３行動決定機構部，３４姿勢遷移機構部，３５制御機構部，３６音声合成部，３７出力制御部，４１ＡＤ変換部，４２特徴抽出部，４３マッチング部，４４音響モデル記憶部，４５辞書記憶部，４６文法記憶部，４７音声区間検出部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a robot control device, a robot control method, and a recording medium, and more particularly to a robot control device, a robot control method, and a recording medium that are suitable for a robot that acts based on a voice recognition result by a voice recognition device.
[0002]
[Prior art]
In recent years, for example, as a toy or the like, a robot that recognizes a voice uttered by a user and performs a behavior such as performing a certain gesture or outputting a synthesized sound based on the voice recognition result (in this specification, (Including stuffed animals) has been commercialized.
[0003]
[Problems to be solved by the invention]
Such robots are always adapted to accept voice input. However, actuator noise that occurs during robot operation, ground pulse noise that occurs when the robot walks, or touch noise that occurs when the user touches the robot are falsely detected as the voice spoken by the user. There was a case.
[0004]
The present invention has been made in view of such a situation, and by performing speech recognition based on the state and behavior of the robot, the speech uttered by the user is distinguished from noise generated by the operation of the robot. This is to prevent misrecognition.
[0005]
[Means for Solving the Problems]
The robot control apparatus according to the present invention includes voice input means for receiving voice data input, Recognition results Indicate State recognition information Generate State recognition information Generation means and robot Posture transition Indicate Posture transition information Generate Posture transition information Generating means; Storage means for storing data corresponding to noise components generated when the robot is walking, and state recognition information Generated by the generating means State recognition Information or Posture transition information Generated by the generating means Posture transition Based on the information voice Input by input means Determine the section for speech recognition in the speech data, and select the section for speech recognition A recognition means for recognizing voice data When the recognizing unit determines that the robot is walking based on the posture transition information, the voice data after filtering the voice data using the data corresponding to the noise component stored in the storage unit Recognize It is characterized by that.
[0006]
Posture transition information Can include information indicating which one of the plurality of driving units of the robot performs the driving operation, The recognition means include When the position of the drive unit to drive is close to the voice input means , It is possible to prevent voice data from being recognized.
[0007]
The recognition means include When the robot is walking , It is possible to recognize audio data excluding a frame including a noise component generated due to walking motion.
[0009]
Posture transition information Can include information indicating which of the plurality of driving units of the robot has a driving operation. Posture transition information Based on the above, it is possible to perform speech recognition in consideration of noise generated by driving the driving unit.
[0010]
State recognition information Can include information indicating whether or not the robot is touched by the user. State recognition information Based on the above, it is possible to perform speech recognition in consideration of noise generated when the user touches the robot.
[0011]
The state of the robot or Posture transition Storage means for storing a predetermined threshold corresponding to the noise generated by, and estimation means for estimating environmental sound when the recognition means is not performing speech recognition, State recognition information Generated by the generating means State recognition information Or Posture transition information Generated by the generating means Posture transition information Based on the above, it is possible to determine the start of the section in which speech recognition is performed using the threshold value stored in the storage unit and the environmental sound estimated by the estimation unit.
[0012]
The state of the robot or Posture transition Storage means for storing a predetermined threshold corresponding to the noise generated by the State recognition information Generated by the generating means State recognition information Or Posture transition information Generated by the generating means Posture transition information Based on the above, it is possible to determine the start of the section for performing speech recognition using the threshold value stored in the storage means.
[0013]
Based on the voice data input by the voice input means, the robot status or Posture transition Can further comprise a setting means for setting a threshold corresponding to the noise generated by using the threshold set by the setting means, In voice data input by voice input means It is possible to determine the start of a section where speech recognition is performed.
[0014]
The robot control method of the present invention includes a voice input step for receiving voice data input, and a state of the robot Recognition results Indicate State recognition information Generate State recognition information Generation step and robot Posture transition Indicate Posture transition information Generate Posture transition information Generation step; State recognition information Generated by the processing of the generation step State recognition information Or Posture transition information Generated by the processing of the generation step Posture transition information Based on voice Entered by input step processing Determine the section for speech recognition in the speech data, and select the section for speech recognition A recognition step for recognizing audio data. In the recognition step processing, when it is determined that the robot is walking based on the posture transition information, the noise component generated when the robot stored in the storage means is walking Recognize the voice data after filtering the voice data using the corresponding data It is characterized by that.
[0015]
The program recorded on the recording medium of the present invention includes a voice input step for receiving voice data input, and a state of the robot. Recognition results Indicate State recognition information Generate State recognition information Generation step and robot Posture transition Indicate Posture transition information Generate Posture transition information Generation step; State recognition information Generated by the processing of the generation step State recognition information Or Posture transition information Generated by the processing of the generation step Posture transition information Based on voice Entered by input step processing Determine the section for speech recognition in the speech data, and select the section for speech recognition A recognition step for recognizing audio data. In the recognition step processing, when it is determined that the robot is walking based on the posture transition information, the noise component generated when the robot stored in the storage means is walking Recognize the voice data after filtering the voice data using the corresponding data It is characterized by that.
[0016]
In the robot control device, the robot control method, and the program recorded on the recording medium of the present invention, voice data is input, and the robot state Recognition results Indicate State recognition information Is generated for the robot Posture transition Indicate Posture transition information Is generated, Data corresponding to noise components generated when the robot is walking is stored, Generated State recognition information Or generated Posture transition information Entered based on Determine the section for speech recognition in the speech data, and select the section for speech recognition Audio data is recognized When it is determined that the robot is walking based on the posture transition information, the voice data after the voice data is filtered is recognized using the data corresponding to the stored noise component. .
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0018]
FIG. 1 shows an external configuration example of an embodiment of a robot to which the present invention is applied, and FIG. 2 shows an electrical configuration example thereof.
[0019]
In the present embodiment, the robot has a shape of a four-legged animal such as a dog, for example, and leg units 3A, 3B, 3C, 3D are connected to the front and rear, left and right of the body unit 2, respectively. In addition, the head unit 4 and the tail unit 5 are connected to the front end portion and the rear end portion of the body unit 2, respectively.
[0020]
The tail unit 5 is pulled out from a base portion 5B provided on the upper surface of the body unit 2 so as to be curved or swingable with two degrees of freedom.
[0021]
The body unit 2 houses a controller 10 that controls the entire robot, a battery 11 that serves as a power source for the robot, an internal sensor unit 14 that includes a battery sensor 12 and a heat sensor 13, and the like.
[0022]
The head unit 4 includes a microphone (microphone) 15 corresponding to “ear”, a CCD (Charge Coupled Device) camera 16 corresponding to “eye”, a touch sensor 17 corresponding to touch, and a speaker 18 corresponding to “mouth”. Are arranged at predetermined positions. The head unit 4 has a lower jaw portion 4A corresponding to the lower jaw of the mouth movably attached with one degree of freedom, and the opening and closing operation of the robot's mouth is realized by moving the lower jaw portion 4A. It has become.
[0023]
The joint parts of the leg units 3A to 3D, the connecting parts of the leg units 3A to 3D and the body unit 2, the connecting parts of the head unit 4 and the torso unit 2, the head unit 4 and the lower jaw part 4A As shown in FIG. 2, the connecting portion, the connecting portion between the tail unit 5 and the body unit 2, and the like are respectively connected to the actuator 3AA. ₁ Thru 3AA _K 3BA ₁ Thru 3BA _K 3CA ₁ Thru 3CA _K 3DA ₁ Thru 3DA _K 4A ₁ To 4A _L 5A ₁ And 5A ₂ Is arranged.
[0024]
Further, each of the leg units 3A to 3D has a grounding part (a part corresponding to the sole of the foot) with a sensor 3A. ₁ Thru sensor 3D ₁ , And detects whether each of the leg units 3A to 3D is grounded (for example, whether it is touching the floor or the like), and outputs it to the controller 10.
[0025]
The microphone 15 in the head unit 4 collects surrounding sounds (sounds) including utterances from the user and sends the obtained sound signals to the controller 10. The CCD camera 16 images the surrounding situation and sends the obtained image signal to the controller 10.
[0026]
The touch sensor 17 is provided, for example, in the upper part of the head unit 4 and detects pressure received by a physical action such as “blow” or “slap” from the user, and the detection result is used as a pressure detection signal. Send to controller 10.
[0027]
The battery sensor 12 in the body unit 2 detects the remaining amount of the battery 11 and sends the detection result to the controller 10 as a battery remaining amount detection signal. The thermal sensor 13 detects the heat inside the robot, and sends the detection result to the controller 10 as a heat detection signal.
[0028]
The controller 10 includes a CPU (Central Processing Unit) 10A, a memory 10B, and the like. The CPU 10A executes various processes by executing a control program stored in the memory 10B.
[0029]
That is, the controller 10 includes the microphone 15, the CCD camera 16, the touch sensor 17, and the sensor 3A. ₁ Thru sensor 3D ₁ Based on the sound signal, image signal, pressure detection signal, battery remaining amount detection signal, and heat detection signal given from the battery sensor 12 and the heat sensor 13, the surrounding situation, the command from the user, the action from the user Whether or not there is.
[0030]
Further, the controller 10 determines a subsequent action based on the determination result and the like, and based on the determination result, the actuator 3AA. ₁ Thru 3AA _K 3BA ₁ Thru 3BA _K 3CA ₁ Thru 3CA _K 3DA ₁ Thru 3DA _K 4A ₁ To 4A _L 5A ₁ And 5A ₂ Drive what you need. As a result, the head unit 4 is swung up and down and left and right, and the lower jaw 4A is opened and closed. Furthermore, the tail unit 5 can be moved, or each leg unit 3A to 3D is driven to perform actions such as walking the robot.
[0031]
Further, the controller 10 generates a synthesized sound as necessary and supplies it to the speaker 18 for output, or turns on / off an LED (Light Emitting Diode) (not shown) provided at the “eye” position of the robot. Or blink.
[0032]
As described above, the robot takes an autonomous action based on the surrounding situation and the like.
[0033]
Next, FIG. 3 shows a functional configuration example of the controller 10 of FIG. The functional configuration shown in FIG. 3 is realized by the CPU 10A executing the control program stored in the memory 10B.
[0034]
The controller 10 accumulates the recognition results of the sensor input processing unit 31 and the sensor input processing unit 31 for recognizing a specific external state, and represents a model storage unit 32 and a sensor input processing unit for expressing emotions, instincts and growth states. 31 based on the recognition result of 31 and the like, the action determining mechanism unit 33 that determines the subsequent action, the posture transition mechanism unit 34 that actually causes the robot to act based on the determination result of the action determining mechanism unit 33, and the actuator 3AA ₁ To actuator 5A ₂ Is composed of a control mechanism unit 35 for driving and controlling, a speech synthesis unit 36 for generating synthesized sound, and an output control unit 37 for controlling the output of the synthesized sound synthesized by the speech synthesis unit 36.
[0035]
The sensor input processing unit 31 includes the microphone 15, the CCD camera 16, the touch sensor 17, or the sensor 3A. ₁ Thru sensor 3D ₁ Based on the audio signal, image signal, pressure detection signal, etc. given from, etc., it recognizes a specific external state, a specific action from the user, an instruction from the user, etc., and state recognition information representing the recognition result, The model storage unit 32 and the behavior determination mechanism unit 33 are notified.
[0036]
That is, the sensor input processing unit 31 includes a voice recognition unit 31A, and the voice recognition unit 31A performs voice recognition on a voice signal provided from the microphone 15. Then, the voice recognition unit 31A uses, for example, a command storage unit 32 and an action determination mechanism unit as state recognition information such as “walk”, “turn down”, “follow the ball”, etc. as the voice recognition result. 33 is notified.
[0037]
The voice recognition unit 31A monitors the robot state based on the state recognition information input from the pressure processing unit 31C and the posture transition information of the robot input from the posture transition mechanism unit 34. It is designed to execute processing.
[0038]
Further, the sensor input processing unit 31 includes an image recognition unit 31B, and the image recognition unit 31B performs image recognition processing using an image signal given from the CCD camera 16. When the image recognition unit 31B detects, for example, “a red round object”, “a plane perpendicular to the ground and higher than a predetermined height” or the like as a result of the processing, An image recognition result such as “There is a wall” is notified to the model storage unit 32 and the action determination mechanism unit 33 as state recognition information.
[0039]
Further, the sensor input processing unit 31 includes a pressure processing unit 31C, and the pressure processing unit 31C includes the touch sensor 17 and the sensor 3A. ₁ Thru sensor 3D ₁ The pressure detection signal given from is processed.
[0040]
When the pressure processing unit 31C detects pressure from the touch sensor 17 that is equal to or higher than the predetermined threshold value and for a short time as a result of the processing, the pressure processing unit 31C recognizes that the pressure processing unit 31C has been struck, and the predetermined threshold value When the pressure is less than and for a long time, it is recognized as “boiled (praised)”, and the recognition result is used as state recognition information, and the model storage unit 32, the action determination mechanism unit 33, and the voice Notify the recognition unit 31A. Further, the pressure processing unit 31C is connected to the sensor 3A. ₁ Thru sensor 3D ₁ When it is detected that none of the leg units 3A to 3D is grounded on the floor or the like based on the signal input from Information is notified to the model storage unit 32, the action determination mechanism unit 33, and the voice recognition unit 31A.
[0041]
The model storage unit 32 stores and manages an emotion model, an instinct model, and a growth model that express the emotion, instinct, and growth state of the robot.
[0042]
Here, the emotion model represents, for example, emotional states (degrees) such as “joyfulness”, “sadness”, “anger”, “joyfulness”, etc., by values in a predetermined range, and sensor input processing units The value is changed on the basis of the state recognition information from 31 or the passage of time. The instinct model represents, for example, the state (degree) of desire by instinct such as “appetite”, “sleep desire”, “exercise desire”, etc., by a predetermined range of values. The value is changed based on the passage of time or the like. The growth model represents, for example, growth states (degrees) such as “childhood”, “adolescence”, “mature age”, “old age”, and the like by values in a predetermined range. The value is changed based on the state recognition information and the passage of time.
[0043]
The model storage unit 32 sends the emotion, instinct, and growth states represented by the values of the emotion model, instinct model, and growth model as described above to the behavior determination mechanism unit 33 as state information.
[0044]
In addition to the state recognition information supplied from the sensor input processing unit 31, the model storage unit 32 receives the current or past behavior of the robot from the behavior determination mechanism unit 33, specifically, for example, “walking for a long time”. The behavior information indicating the content of the behavior such as “t” is supplied, and even if the same state recognition information is given, different state information is generated according to the behavior of the robot indicated by the behavior information. It has become.
[0045]
That is, for example, when the robot greets the user and strokes the head, the behavior information indicating that the user is greeted and the state recognition information that the head is stroked are model storage unit 32. In this case, the value of the emotion model representing “joyfulness” is increased in the model storage unit 32.
[0046]
On the other hand, when the robot is stroked while performing some work, behavior information indicating that the work is being performed and state recognition information indicating that the head has been stroked are provided to the model storage unit 32. In this case, the value of the emotion model representing “joy” is not changed in the model storage unit 32.
[0047]
As described above, the model storage unit 32 sets the value of the emotion model while referring to not only the state recognition information but also the behavior information indicating the current or past behavior of the robot.
This causes an unnatural emotional change that increases the value of the emotion model that expresses “joyfulness” when, for example, the user is stroking his / her head while performing some task. You can avoid that.
[0048]
Note that the model storage unit 32 also increases or decreases the values of the instinct model and the growth model based on both the state recognition information and the behavior information, as in the emotion model. The model storage unit 32 increases or decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.
[0049]
The action determination mechanism unit 33 determines the next action based on the state recognition information from the sensor input processing unit 31, the state information from the model storage unit 32, the passage of time, and the like. It is sent to the posture transition mechanism unit 34 as action command information.
[0050]
That is, the behavior determination mechanism unit 33 manages a finite automaton in which actions that can be taken by the robot correspond to states, as a behavior model that defines the behavior of the robot. The state in the automaton is transitioned based on the state recognition information from the sensor input processing unit 31, the value of the emotion model, the instinct model, or the growth model in the model storage unit 32, the time course, etc., and corresponds to the state after the transition. The action is determined as the next action to be taken.
[0051]
Here, the behavior determination mechanism unit 33 transitions the state when it detects that a predetermined trigger (trigger) has occurred. That is, the behavior determination mechanism unit 33 is supplied from the model storage unit 32 when, for example, the time during which the behavior corresponding to the current state is executed reaches a predetermined time or when specific state recognition information is received. The state is changed when the emotion, instinct, and growth state values indicated by the state information are below or above a predetermined threshold.
[0052]
Note that, as described above, the behavior determination mechanism unit 33 is based not only on the state recognition information from the sensor input processing unit 31 but also on the emotion model, instinct model, growth model value, and the like in the model storage unit 32. Since the state in the behavior model is transitioned, even if the same state recognition information is input, the state transition destination differs depending on the value (state information) of the emotion model, instinct model, and growth model.
[0053]
As a result, for example, when the state information indicates “not angry” and “not hungry”, the behavior determination mechanism unit 33 indicates that the state recognition information is “the palm in front of the eyes”. Is generated, action command information for taking the action of “hand” is generated in response to the palm being presented in front of the eyes. To the unit 34.
[0054]
In addition, for example, when the state information indicates “not angry” and “hungry”, the behavior determination mechanism unit 33 indicates that the state recognition information indicates that “the palm is in front of the eyes. When it indicates that it has been `` submitted, '' action command information is generated to perform an action such as `` flipping the palm '' in response to the palm being presented in front of the eyes. And sent to the posture transition mechanism unit 34.
[0055]
In addition, for example, in the case where the state information indicates “angry”, the behavior determination mechanism unit 33 indicates that the state recognition information indicates “a palm has been presented in front of the eyes”. Sometimes, even if the status information indicates "I am hungry" or "I am not hungry", I want to behave like "Looking sideways" Action command information is generated and sent to the posture transition mechanism unit 34.
[0056]
Note that the behavior determination mechanism unit 33 uses, for example, walking as a behavior parameter corresponding to the transition destination state based on the emotion, instinct, and growth state indicated by the state information supplied from the model storage unit 32. , The magnitude and speed of movement when moving the limb, and in this case, action command information including these parameters is sent to the posture transition mechanism unit 34.
[0057]
In addition, as described above, the behavior determination mechanism unit 33 generates behavior command information for causing the robot to speak in addition to the behavior command information for operating the head, limbs, and the like of the robot. The action command information for causing the robot to speak is supplied to the voice synthesis unit 36, and the action command information supplied to the voice synthesis unit 36 corresponds to the synthesized sound generated by the voice synthesis unit 36. Text to be included. Then, when receiving the action command information from the action determining unit 32, the voice synthesizing unit 36 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 18 via the output control unit 37. Output. As a result, for example, the robot screams, various requests to the user such as “I am hungry”, a response to the user's call such as “what?” And other audio output are performed from the speaker 18. .
[0058]
The posture transition mechanism unit 34 generates posture transition information for changing the posture of the robot from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 33, and controls this. It is sent to the mechanism unit 35 and the voice recognition unit 31A.
[0059]
Here, the postures that can be transitioned from the current posture to the next are, for example, the physical shape of the robot, such as the shape and weight of the torso, hands and feet, and the connected state of each part, and the direction and angle at which the joint bends. Actuator 3AA ₁ To 5A ₁ And 5A ₂ Determined by the mechanism.
[0060]
Further, as the next posture, there are a posture that can be directly changed from the current posture and a posture that cannot be directly changed. For example, a four-legged robot can make a direct transition from a lying position with its limbs thrown down to a lying position, but cannot make a direct transition to a standing state. A two-step movement is required, that is, a posture that is pulled down and then lies down and then stands up. There are also postures that cannot be executed safely. For example, a four-legged robot can easily fall if it tries to banzai with both front legs raised from its four-legged posture.
[0061]
Therefore, the posture transition mechanism unit 34 registers postures that can be directly transitioned in advance, and if the behavior command information supplied from the behavior determination mechanism unit 33 indicates a posture that can be transitioned directly, the behavior command The information is sent to the control mechanism unit 35 as posture transition information as it is. On the other hand, when the action command information indicates a posture that cannot be directly transitioned, the posture transition mechanism unit 34 changes the posture transition information that makes a transition to a target posture after temporarily transitioning to another transitionable posture. It is generated and sent to the control mechanism unit 35. As a result, it is possible to avoid situations where the robot forcibly executes a posture incapable of transition or a situation where the robot falls over.
[0062]
The posture transition information is also output to the voice recognition unit 31A. When the robot changes its posture, the actuator 3AA ₁ To actuator 5A ₂ Any one of the actuators operates. Therefore, the posture transition mechanism unit 34 outputs posture transition information to the voice recognition unit 31A so that the voice recognition unit 31A does not recognize the operation sound of these actuators as the voice of the user.
[0063]
The control mechanism unit 35 controls the actuator 3AA according to the posture transition information from the posture transition mechanism unit 34. ₁ To actuator 5A ₂ A control signal for driving the actuator 3AA is generated. ₁ To actuator 5A ₂ To send. As a result, the actuator 3AA ₁ To actuator 5A ₂ Is driven according to the control signal, and the robot acts autonomously.
[0064]
The output control unit 37 is supplied with the digital data of the synthesized sound from the voice synthesis unit 36, and the output control unit 37 D / A converts the digital data into an analog voice signal. , Supplied to the speaker 18 for output.
[0065]
Next, FIG. 4 shows a configuration example of the voice recognition unit 31A of FIG.
[0066]
The audio signal from the microphone 15 is supplied to an AD (Analog Digital) conversion unit 41. In the AD conversion unit 41, the audio signal that is an analog signal from the microphone 15 is sampled and quantized, and AD converted into audio data that is a digital signal. This voice data is supplied to the feature extraction unit 42 and the voice section detection unit 47.
[0067]
The feature extraction unit 42 receives state recognition information and posture transition information for each appropriate frame of the input speech data, and performs, for example, MFCC (Mel Frequency Cepstrum Coefficient) analysis in accordance with the state of the robot. The analysis result is output to the matching unit 43 as a feature parameter (feature vector). In addition, the feature extraction unit 42 can extract, for example, linear prediction coefficients, cepstrum coefficients, line spectrum pairs, power for each predetermined frequency band (filter bank output), and the like as feature parameters.
[0068]
The matching unit 43 uses the feature parameters from the feature extraction unit 42 to refer to the acoustic model storage unit 44, the dictionary storage unit 45, and the grammar storage unit 46 as necessary, and input the voice ( The input speech) is recognized based on, for example, a continuous distribution HMM (Hidden Markov Model) method.
[0069]
That is, the acoustic model storage unit 44 stores an acoustic model representing acoustic features such as individual phonemes and syllables in the speech language for speech recognition. Here, since speech recognition is performed based on the continuous distribution HMM method, an HMM (Hidden Markov Model) is used as the acoustic model. The dictionary storage unit 45 stores a word dictionary in which information about pronunciation (phoneme information) is described for each word to be recognized. The grammar storage unit 46 stores grammar rules that describe how each word registered in the word dictionary of the dictionary storage unit 45 is linked (connected). Here, as the grammar rule, for example, a rule based on context-free grammar (CFG), statistical word chain probability (N-gram), or the like can be used.
[0070]
The matching unit 43 refers to the word dictionary in the dictionary storage unit 45 and connects the acoustic model stored in the acoustic model storage unit 44 to configure an acoustic model (word model) of the word. Further, the matching unit 43 connects several word models by referring to the grammatical rules stored in the grammar storage unit 46, and uses the word models connected in this way, based on the feature parameters, The voice input to the microphone 15 is recognized by the continuous distribution HMM method. That is, the matching unit 43 detects a word model sequence having the highest score (likelihood) in which the time-series feature parameters output from the feature extraction unit 42 are observed, and the word sequence corresponding to the word model sequence is detected. Phonological information (reading) is output as a speech recognition result.
[0071]
More specifically, the matching unit 43 accumulates the appearance probabilities of the feature parameters for the word strings corresponding to the connected word models, uses the accumulated value as a score, and uses the phoneme of the word string that has the highest score. Information is output as a speech recognition result.
[0072]
The speech recognition result input to the microphone 15 that is output as described above is output to the model storage unit 32 and the action determination mechanism unit 33 as state recognition information.
[0073]
Note that the speech section detection unit 47 calculates a speech input level (power) for each frame similar to the case where the feature extraction unit 42 performs MFCC analysis on the speech data from the AD conversion unit 41. Further, the voice section detection unit 47 receives the state recognition information and the posture transition information, and compares the voice input level of each frame with a predetermined threshold to detect the voice section in which the user's voice is input. To do. That is, the voice section is a section in which a predetermined operation (for example, moving the head unit 4) that is a noise source is not performed, and is configured by a frame having a voice input level equal to or higher than a predetermined threshold. Indicates the section. Then, the speech segment detection unit 47 supplies the detected speech segment to the feature extraction unit 42 and the matching unit 43, and the feature extraction unit 42 and the matching unit 43 perform processing only on the speech segment.
[0074]
Next, the speech recognition process will be described with reference to the flowcharts of FIGS.
[0075]
In step S 1, the voice segment detection unit 47 estimates the environmental sound level based on the voice data input via the AD conversion unit 41.
[0076]
Various noises are input to the microphone 15 even when the user is not speaking to the robot, but recognizing the noise as the user's speech causes a malfunction. Therefore, it is necessary to estimate the environmental sound level in a state where the user's utterance is not recognized (speech recognition OFF state).
[0077]
As shown in FIG. 7, the voice input level of the voice data input via the microphone 15 and the AD conversion unit 41 is not constant even in the voice recognition OFF state. Therefore, the environmental sound level is calculated every predetermined short time by the following equations (1) and (2), where ENV is the environmental sound level and P is the current audio input level.
ENV = a × ENV + b × P (1)
a + b = 1.0 (2)
[0078]
Here, a is set to a number relatively close to 1, such as 0.9, and b is set to 0.1 or the like, so that noise with a large power (for example, when the door is fluttered) (Such as a closing sound) does not significantly affect the overall environmental sound.
[0079]
The environmental sound level is estimated based on a predetermined threshold T1 until the audio input level exceeds ENV + T1 (until it is determined in step S8 described later or step S13 that the audio input level exceeds ENV + T1). Will continue.
[0080]
In step S2, the speech section detection unit 47 determines whether or not posture transition information or state recognition information has been received from the posture transition mechanism unit 34 or the pressure processing unit 31-C. If it is determined in step S2 that no posture transition information or state recognition information has been input, the process proceeds to step S8.
[0081]
If it is determined in step S2 that the posture transition information or the state recognition information has been input, in step S3, the voice section detection unit 47 is based on the input information, and the actuator that operates is close to the microphone 15. Determine whether or not. If it is determined in step S3 that the actuator that performs the operation is not close to the microphone 15, the process proceeds to step S6.
[0082]
In step S3, when it is determined that the actuator that performs the operation is close to the microphone 15, in step S4, the speech section detection unit 47 cancels the speech recognition processing. That is, during the operation of the actuator close to the microphone 15, the voice section is not set.
[0083]
In the robot described with reference to FIGS. 1 and 2, the microphone 15 is provided in the head unit 4. That is, the actuator 4A provided in the head unit 4 ₁ To 4A _L Since the voice input power of noise generated with the operation of the head unit 4 and the body unit 2 (for example, the connecting part of the head unit 4 and the lower jaw part 4A) is relatively large, Reliability is extremely low. Therefore, during the operation of the actuator close to the microphone 15, it is possible to prevent malfunctions by not setting the voice section.
[0084]
In step S5, the voice section detecting unit 47 determines whether or not the operation is finished based on the input information. If it is determined in step S5 that the operation has not been completed, the process returns to step S4, and the subsequent processes are repeated. If it is determined in step S5 that the operation has been completed, the process returns to step S1, and the subsequent processes are repeated.
[0085]
If it is determined in step S3 that the actuator that performs the operation is not close to the microphone 15, in step S6, the voice section detection unit 47 determines that the posture transition information indicates the start of the walking motion based on the input information. It is determined whether or not it is indicated. In step S6, when it is determined that the start of the walking motion is indicated, the process proceeds to step S13.
[0086]
If it is determined in step S6 that the posture transition information does not indicate the start of the walking motion, the robot does not operate the actuator near the microphone 15 and the motion other than the walking motion is performed, or It is in a state where it is being picked up by the user or stroked. In step S 7, the voice section detection unit 47 sets a threshold value T 1 (FIG. 7) for determining that a user's utterance, which is not an environmental sound (noise), is input as voice, posture transition information, or state. Change the value according to the recognition information.
[0087]
For example, when the state recognition information input from the pressure processing unit 31C indicates that the robot is picked up by the user, the voice data input from the microphone 15 may cause the user to touch the housing surface of the robot. Touch noise and rubbing sounds that are generated. In addition, the posture transition information input from the state transition mechanism unit 34 is an actuator provided at a position relatively far from the microphone 15 (for example, the actuator 5A of the tail unit 5). ₁ ) Indicates that the actuator 15 is operating, the audio data input from the microphone 15 includes the operating sound of those actuators. The operation sound included in the audio data varies depending on the type of the actuator and the distance to the microphone 15.
[0088]
The voice section detection unit 47 is, for example, a threshold T1a when the robot is held up by the user, a threshold T1b when the leg unit 3A or 3B is operating, and a leg unit 3C or 3D is operating. Threshold T1c or threshold T1d when the tail unit 5 is operating in advance, and a voice section using an appropriate threshold according to the input posture transition information or state recognition information Set to detect.
[0089]
If it is determined in step S2 that the posture transition information or the state recognition information has not been input, or after the process of step S7 is completed, in step S8, the voice segment detection unit 47 determines that the voice input level is a threshold value. It is determined whether or not (T1 + ENV) has been exceeded. If it is determined in step S8 that the audio input level does not exceed the threshold value (T1 + ENV), the process returns to step S1 and the subsequent processes are repeated.
[0090]
If it is determined in step S8 that the sound input level exceeds the threshold value (T1 + ENV), in step S9, the sound section detection unit 47 stops estimating the environmental sound level and includes a counter (not shown) included therein. A voice recognition start count is started using a timer.
[0091]
In step S10, the voice section detection unit 47 determines whether or not the voice recognition start count has exceeded a predetermined value (for example, a value indicated by CNT_ON in FIG. 7). When it is determined in step S10 that the voice recognition start count does not exceed the predetermined value, the process of step S10 is repeated until it is determined that the voice recognition start count exceeds the predetermined value.
[0092]
If it is determined in step S10 that the speech recognition start count has exceeded a predetermined value, the speech segment detection unit 47 outputs the start of the speech segment to the feature extraction unit 42 and the matching unit 43 in step S11. The feature extraction unit 42 and the matching unit 43 execute the speech recognition process described with reference to FIG.
[0093]
In step S12, the voice section detection unit 47 determines whether or not the voice input level is equal to or lower than a threshold value (T2 + ENV).
[0094]
Various noises are input to the microphone 15 even after the user has finished speaking to the robot, but recognizing the noise as the user's speech causes a malfunction. Therefore, when the voice input level falls below a certain value, it is necessary not to perform voice recognition processing (to turn voice recognition OFF).
[0095]
As shown in FIG. 8, the voice input level of the voice data input via the microphone 15 and the AD conversion unit 41 includes a predetermined threshold T2 and an environmental sound level ENV at the time when the voice recognition process is started. By determining whether or not the sum is less than (T2 + ENV), the speech section detection unit 47 can determine whether or not the user's speech has ended.
[0096]
If it is determined in step S12 that the audio input level is not less than or equal to the threshold value (T2 + ENV), the process returns to step S11, and the subsequent processes are repeated. If it is determined in step S12 that the audio input level has become equal to or less than the threshold value (T2 + ENV), the process proceeds to step S18.
[0097]
If it is determined in step S6 that the posture transition information indicates the start of a walking motion, in step S13, the speech segment detection unit 47 determines whether or not the speech input level has exceeded a threshold value (T1 + ENV). . If it is determined in step S13 that the audio input level does not exceed the threshold (T1 + ENV), the process returns to step S1 and the subsequent processes are repeated.
[0098]
If it is determined in step S13 that the audio input level exceeds the threshold value (T1 + ENV), in step S14, the audio section detection unit 47 stops estimating the environmental sound level and includes a counter (not shown) included therein. A voice recognition start count is started using a timer.
[0099]
In step S15, the speech section detection unit 47 determines whether or not the speech recognition start count has exceeded a predetermined value (for example, a value indicated by CNT_ON in FIG. 7). When it is determined in step S15 that the voice recognition start count does not exceed the predetermined value, the process of step S15 is repeated until it is determined that the voice recognition start count exceeds the predetermined value.
[0100]
If it is determined in step S15 that the speech recognition start count has exceeded a predetermined value, the speech segment detection unit 47 outputs the start of the speech segment to the feature extraction unit 42 and the matching unit 43 in step S16. The feature extraction unit 42 and the matching unit 43 execute a speech recognition process dedicated to walking motion.
[0101]
As shown in FIG. 9, during the walking motion of the robot, in addition to the user's utterance, ground noise when the leg units 3A to 3D are grounded on the floor, for example, is included in the input voice data. End up. The feature extraction unit 42 that has been notified that the walking movement is being performed as posture transition information monitors, for example, the increase / decrease (ΔP) of the voice input level and removes the magnitude of ΔP in order to remove the ground noise from the voice data. Based on the above, pulsating noise is detected (as shown in FIG. 9, the pulsating sound data has a feature that its sound level increases rapidly and decreases rapidly after the peak). Then, the feature extraction unit 42 does not perform feature extraction processing for the frame. Since the ground noise is generated in a very short time, even if a frame detected as ground noise is removed from the speech recognition process, there is no problem in recognition of the user's utterance.
[0102]
In addition, the feature extraction unit 42 stores in advance a standard voice component of ground noise during walking motion, and filters the input voice data using the voice component during walking motion, You may make it perform feature extraction processing using the audio | voice data after removing noise.
[0103]
In step S 17, the speech section detection unit 47 determines whether or not the speech input level is equal to or lower than a threshold value (T2 + ENV) by the same process as in step S 12. If it is determined in step S17 that the audio input level is not less than or equal to the threshold value (T2 + ENV), the process returns to step S16, and the subsequent processes are repeated.
[0104]
If it is determined in step S12 that the audio input level is equal to or lower than the threshold (T2 + ENV), or if it is determined in step S17 that the audio input level is equal to or lower than the threshold (T2 + ENV), in step S18. The voice section detection unit 47 starts a voice recognition end count using a counter (timer) (not shown) included therein.
[0105]
In step S19, the speech section detecting unit 47 determines whether or not the speech recognition end count has exceeded a predetermined value (for example, a value indicated by CNT_OFF in FIG. 8). If it is determined in step S19 that the voice recognition end count does not exceed the predetermined value, the process of step S19 is repeated until it is determined that the voice recognition end count exceeds the predetermined value.
[0106]
If it is determined in step S19 that the voice recognition end count has exceeded a predetermined value, in step S20, the voice segment detection unit 47 outputs a signal indicating that the voice segment has ended to a feature extraction unit 42 and a matching unit 43. Output to. The feature extraction unit 42 and the matching unit 43 end the speech recognition process, the process returns to step S1, and the subsequent processes are repeated.
[0107]
Note that in step S7 in FIG. 5, the voice section detection unit 47 uses the posture transition information or the threshold value T1 for determining that the user's utterance, which is not an environmental sound (noise), is inputted as voice. In the above description, the value is changed according to the state recognition information. However, a threshold value T3 that is not influenced by the environmental sound level may be used as a threshold value for detecting the voice classification. That is, in step S8, it is not determined whether or not the audio input level exceeds a threshold value (T1 + ENV), but it is determined whether or not it exceeds a threshold value T3 that is not related to the environmental sound level ENV.
[0108]
Further, the threshold T3 may be prepared in advance according to the posture transition information or the state recognition information, or a predetermined learning section is provided at the start of operation or state change, and an average noise is set. You may make it set each time by acquiring a component.
[0109]
As described above, the case where the present invention is applied to an entertainment robot (a robot as a pseudo pet) has been described. However, the present invention is not limited thereto, and is widely applied to various robots such as industrial robots. It is possible. Further, the present invention can be applied not only to a real world robot but also to a virtual robot displayed on a display device such as a liquid crystal display.
[0110]
Furthermore, in the present embodiment, the series of processes described above is performed by causing the CPU 10A (FIG. 2) to execute a program. However, the series of processes can also be performed by dedicated hardware. is there.
[0111]
The program is stored in advance in the memory 10B (FIG. 2), a floppy disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, It can be stored (recorded) temporarily or permanently in a removable recording medium such as a semiconductor memory. Such a removable recording medium can be provided as so-called package software and installed in the robot (memory 10B).
[0112]
The program is transferred from a download site wirelessly via an artificial satellite for digital satellite broadcasting, or wired via a network such as a LAN (Local Area Network) or the Internet, and installed in the memory 10B. be able to.
[0113]
In this case, when the program is upgraded, the upgraded program can be easily installed in the memory 10B.
[0114]
Here, in the present specification, the processing steps for describing a program for causing the CPU 10A to perform various types of processing do not necessarily have to be processed in time series in the order described in the flowchart, but in parallel or individually. This includes processing to be executed (for example, parallel processing or processing by an object).
[0115]
The program may be processed by only one CPU, or may be distributedly processed by a plurality of CPUs.
[0116]
【The invention's effect】
According to the robot control device, the robot control method, and the program recorded on the recording medium of the present invention , By performing voice recognition based on the state and action of the robot, it is possible to distinguish between the voice uttered by the user and the noise generated by the operation of the robot and prevent erroneous recognition.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an external configuration example of an embodiment of a robot to which the present invention is applied.
FIG. 2 is a block diagram illustrating an internal configuration example of a robot.
FIG. 3 is a block diagram illustrating a functional configuration example of a controller.
FIG. 4 is a block diagram illustrating a configuration example of a voice recognition unit.
FIG. 5 is a flowchart for explaining voice recognition processing;
FIG. 6 is a flowchart for explaining voice recognition processing;
FIG. 7 is a diagram for explaining the start of a speech recognition section.
FIG. 8 is a diagram for explaining the end of a voice recognition section.
FIG. 9 is a diagram for explaining ground noise of a leg unit during a walking motion.
[Explanation of symbols]
4 head unit, 4A lower jaw, 10 controller, 10A CPU, 10B memory, 15 microphone, 16 CCD camera, 17 touch sensor, 18 speaker, 31 sensor input processing unit, 31A voice recognition unit, 31B image recognition unit, 31C pressure Processing unit, 32 model storage unit, 33 action determination mechanism unit, 34 posture transition mechanism unit, 35 control mechanism unit, 36 speech synthesis unit, 37 output control unit, 41 AD conversion unit, 42 feature extraction unit, 43 matching unit, 44 Acoustic model storage unit, 45 dictionary storage unit, 46 grammar storage unit, 47 speech segment detection unit

Claims

A robot control device that controls a robot that acts based on at least a voice recognition result,
A voice input means for receiving voice data;
State recognition information generating means for generating state recognition information indicating a recognition result of the state of the robot;
Posture transition information generating means for generating posture transition information indicating the posture transition of the robot;
Storage means for storing data corresponding to a noise component generated when the robot is walking.
The state recognition information generated by the state recognition information generating unit, or on the basis of the posture transition information generated by the posture transition information generation means, the voice recognition in the voice data input by the voice input means Recognizing means for recognizing the voice data of the section for performing the recognized voice recognition ,
When the recognizing unit determines that the robot is performing the walking motion based on the posture transition information, the recognition unit uses the data corresponding to the noise component stored in the storage unit, The voice data after filtering
A robot controller characterized by that.

The posture transition information includes information indicating which of the plurality of driving units included in the robot performs the driving operation.
It said recognition means, the position of the driving portion to be driven, is close to the audio input means, the robot control apparatus according to claim 1, characterized in that does not perform recognition of the speech data.

The said recognition means recognizes the said audio | voice data except the flame | frame containing the noise component which generate | occur | produced for the said walking motion, when the said robot is performing the said walking motion. Robot control device.

The posture transition information includes information indicating which of the plurality of driving units included in the robot performs the driving operation.
The robot control apparatus according to claim 1, wherein the recognizing unit performs speech recognition in consideration of noise generated when the driving unit is driven based on the posture transition information .

The state recognition information includes information indicating whether the robot is touched by a user,
The robot control apparatus according to claim 1, wherein the recognition unit performs voice recognition based on the state recognition information in consideration of noise generated when the user touches the robot.

Storage means for storing a predetermined threshold value corresponding to noise generated by transition of the state or posture of the robot;
An estimation means for estimating an environmental sound when the recognition means is not performing voice recognition, and
Said recognition means, said state recognition information generated by the state recognition information generating means or, on the basis of the posture transition information generated by the posture transition information generation means, the threshold value stored in the storage means The robot control apparatus according to claim 1, wherein the start of a section in which speech recognition is performed is determined using the environmental sound estimated by the estimation unit.

Storage means for storing a predetermined threshold value corresponding to noise generated by the state or posture transition of the robot;
Said recognition means, said state recognition information generated by the state recognition information generating means or, on the basis of the posture transition information generated by the posture transition information generation means, the threshold value stored in the storage means The robot control apparatus according to claim 1, wherein the start of a section in which voice recognition is performed is determined using.

Based on the voice data input by the voice input means, further comprising a setting means for setting a threshold corresponding to noise generated by the state or posture transition of the robot,
The said recognition means discriminate | determines the start of the area which performs the speech recognition in the said audio | voice data input by the said voice input means using the said threshold value set by the said setting means. Robot controller.

In a robot control method of a robot control apparatus for controlling a robot that acts based on at least a voice recognition result,
A voice input step for receiving voice data;
A state recognition information generating step for generating state recognition information indicating a recognition result of the state of the robot;
A posture transition information generating step for generating posture transition information indicating a transition of the posture of the robot;
The state recognition information generated by the processing of the state recognition information generating step, or on the basis of the posture transition information generated by the processing of the posture transition information generation step, input by the processing of the speech input step wherein determine a section for performing speech recognition in the speech data, look-containing and said recognition step of recognizing the speech data of the section to perform speech recognition is determined,
Occurs when the recognition step determines that the robot is walking based on the posture transition information, and the robot stored in the storage means is performing the walking motion. Recognizing the audio data after filtering the audio data using data corresponding to a noise component
A robot control method characterized by the above.

A program for a robot controller that controls a robot that acts based on at least a voice recognition result,
A voice input step for receiving voice data;
A state recognition information generating step for generating state recognition information indicating a recognition result of the state of the robot;
A posture transition information generating step for generating posture transition information indicating a transition of the posture of the robot;
The state recognition information generated by the processing of the state recognition information generating step, or on the basis of the posture transition information generated by the processing of the posture transition information generation step, input by the processing of the speech input step wherein Recognizing a section for performing speech recognition in speech data, and recognizing the speech data of the section for performing recognized speech recognition ,
Occurs when the recognition step determines that the robot is walking based on the posture transition information, and the robot stored in the storage means is performing the walking motion. Recognizing the audio data after filtering the audio data using data corresponding to a noise component
A recording medium on which a computer-readable program is recorded.