JP2002116794A

JP2002116794A - Robot controller and method for robot control and recording medium

Info

Publication number: JP2002116794A
Application number: JP2000310492A
Authority: JP
Inventors: Kazuo Ishii; 和夫石井; Jun Hiroi; 順広井; Wataru Onoki; 渡小野木; Takashi Toyoda; 崇豊田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-10-11
Filing date: 2000-10-11
Publication date: 2002-04-19
Anticipated expiration: 2020-10-11
Also published as: JP4587009B2

Abstract

PROBLEM TO BE SOLVED: To prevent voice from being misrecognized owing to noise that a robot generates. SOLUTION: A voice signal inputted to a microphone is converted by an AD conversion part 41 into a digital signal, which is outputted to a voice section detection part 47 and a feature extraction part 42. The voice section detection part 47 decides that the signal is not a voice section while an actuator nearby the microphone is in operation according to state recognition information and attitude transition information and then cancels voice recognition processing and varies the value of a threshold for determining the start of a voice section according to the contents of the operation and state. When the robot is walking, the feature extraction part 42 extracts features by using voice data from which pulsative grounding noise generated during the walk is removed through frame removable, filtering, etc.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ロボット制御装置
およびロボット制御方法、並びに記録媒体に関し、特
に、音声認識装置による音声認識結果に基づいて行動す
るロボットに用いて好適なロボット制御装置およびロボ
ット制御方法、並びに記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a robot control device, a robot control method, and a recording medium, and more particularly to a robot control device and a robot control suitable for a robot acting on the basis of a result of voice recognition by a voice recognition device. The present invention relates to a method and a recording medium.

【０００２】[0002]

【従来の技術】近年においては、例えば、玩具等とし
て、ユーザが発した音声を音声認識し、その音声認識結
果に基づいて、ある仕草をしたり、合成音を出力する等
の行動を行うロボット（本明細書においては、ぬいぐる
み状のものを含む）が製品化されている。2. Description of the Related Art In recent years, for example, a toy or the like performs voice recognition of a voice uttered by a user, and performs a certain gesture or outputs a synthetic sound based on the voice recognition result. (In this specification, including a stuffed one) has been commercialized.

【０００３】[0003]

【発明が解決しようとする課題】このようなロボット
は、常時、音声入力を受け付けるようになされている。
しかしながら、ロボットの動作中に発生するアクチュエ
ータのノイズ、ロボットの歩行時に発生する接地パルス
音、あるいは、ユーザがロボットに触れることにより発
生するタッチノイズなどが、ユーザの発話した音声であ
ると誤検知されてしまう場合があった。Such a robot is always adapted to receive a voice input.
However, the noise of the actuator generated during the operation of the robot, the grounding pulse sound generated when the robot is walking, or the touch noise generated when the user touches the robot is erroneously detected as the voice spoken by the user. There was a case.

【０００４】本発明はこのような状況に鑑みてなされた
ものであり、ロボットの状態や行動に基づいた音声認識
を行うことにより、ユーザの発話した音声と、ロボット
の動作などにより発生するノイズとを区別して、誤認識
を防ぐようにするものである。[0004] The present invention has been made in view of such a situation, and performs voice recognition based on the state and behavior of a robot to reduce the noise uttered by the user and the noise generated by the operation of the robot. In order to prevent erroneous recognition.

【０００５】[0005]

【課題を解決するための手段】本発明のロボット制御装
置は、音声データの入力を受ける音声入力手段と、ロボ
ットの状態を示す第１の情報を生成する第１の生成手段
と、ロボットの行動を示す第２の情報を生成する第２の
生成手段と、第１の生成手段により生成された第１の情
報、もしくは、第２の生成手段により生成された第２の
情報を基に、入力手段により入力された音声データを認
識する認識手段とを備えることを特徴とする。According to the present invention, there is provided a robot control apparatus comprising: a voice input means for receiving voice data; a first generating means for generating first information indicating a state of the robot; Based on the first information generated by the first generating means or the second information generated by the second generating means. Recognizing means for recognizing voice data input by the means.

【０００６】第２の情報には、ロボットが有する複数の
駆動部のうちのいずれの駆動部が駆動動作をするかを示
す情報を含ませることができ、駆動する駆動部の位置
が、音声入力手段に近い場合、認識手段には、音声デー
タの認識を行わせないようにすることができる。The second information can include information indicating which one of the plurality of driving units of the robot performs a driving operation, and the position of the driving unit to be driven is determined by voice input. If it is close to the means, the recognition means can be prevented from recognizing the voice data.

【０００７】第２の情報には、ロボットが歩行動作を行
っているか否かを示す情報を含ませることができ、ロボ
ットが歩行動作を行っている場合、認識手段には、歩行
動作のために発生したノイズ成分を含むフレームを除い
た音声データを認識させることができる。[0007] The second information can include information indicating whether or not the robot is performing a walking operation. When the robot is performing a walking operation, the recognizing means includes: It is possible to recognize the voice data excluding the frame including the generated noise component.

【０００８】ロボットが歩行動作を行っている場合に発
生するノイズ成分に対応するデータを記憶する記憶手段
を更に備えさせることができ、第２の情報には、ロボッ
トが歩行動作を行っているか否かを示す情報を含ませる
ことができ、ロボットが歩行動作を行っている場合、認
識手段には、記憶手段により記憶されているノイズ成分
に対応するデータを用いて、音声データをフィルタリン
グした後の音声データを認識させることができる。[0008] A storage means for storing data corresponding to a noise component generated when the robot is performing a walking operation may be further provided. The second information indicates whether the robot is performing a walking operation. Can be included, and when the robot is performing a walking motion, the recognition unit uses the data corresponding to the noise component stored by the storage unit to filter the audio data after filtering. Voice data can be recognized.

【０００９】第２の情報には、ロボットが有する複数の
駆動部のうちのいずれの駆動部が駆動動作をするかを示
す情報を含ませることができ、認識手段には、第２の情
報を基に、駆動部が駆動することにより発生するノイズ
を考慮して音声認識を行わせることができる。The second information can include information indicating which one of the plurality of driving units of the robot performs a driving operation, and the recognizing means includes the second information. Based on this, speech recognition can be performed in consideration of noise generated by driving of the driving unit.

【００１０】第１の情報には、ロボットがユーザに触れ
られているか否かを示す情報を含ませることができ、認
識手段には、第１の情報を基に、ユーザがロボットに触
れているために発生するノイズを考慮して音声認識を行
わせることができる。[0010] The first information may include information indicating whether or not the robot is being touched by the user, and the recognizing means uses the first information to indicate that the user is touching the robot. Therefore, voice recognition can be performed in consideration of the noise generated.

【００１１】ロボットの状態もしくは行動により発生す
るノイズに対応した所定の閾値を記憶する記憶手段と、
認識手段が音声認識を行っていない場合の環境音を推定
する推定手段とを更に備えさせることができ、認識手段
には、第１の生成手段により生成された第１の情報、も
しくは、第２の生成手段により生成された第２の情報を
基に、記憶手段に記憶されている閾値および推定手段に
より推定された環境音を用いて、音声認識を行う区間の
開始を判別させることができる。Storage means for storing a predetermined threshold value corresponding to noise generated by the state or action of the robot;
The recognition means may further include an estimation means for estimating an environmental sound when the speech recognition is not performed. The recognition means may include the first information generated by the first generation means or the second information. Based on the second information generated by the generating means, the start of the section in which the voice recognition is performed can be determined using the threshold value stored in the storing means and the environmental sound estimated by the estimating means.

【００１２】ロボットの状態もしくは行動により発生す
るノイズに対応した所定の閾値を記憶する記憶手段を更
に備えさせることができ、認識手段には、第１の生成手
段により生成された第１の情報、もしくは、第２の生成
手段により生成された第２の情報を基に、記憶手段に記
憶されている閾値を用いて、音声認識を行う区間の開始
を判別させることができる。[0012] The apparatus may further include storage means for storing a predetermined threshold value corresponding to noise generated by the state or action of the robot, wherein the recognition means includes first information generated by the first generation means, Alternatively, based on the second information generated by the second generation unit, the start of the section in which voice recognition is performed can be determined using the threshold value stored in the storage unit.

【００１３】音声入力手段により入力された音声データ
を基に、ロボットの状態もしくは行動により発生するノ
イズに対応した閾値を設定する設定手段を更に備えさせ
ることができ、認識手段には、設定手段により設定され
た閾値を用いて、音声認識を行う区間の開始を判別させ
ることができる。[0013] A setting means for setting a threshold value corresponding to a noise generated by the state or action of the robot based on the voice data input by the voice input means may be further provided. Using the set threshold value, it is possible to determine the start of a section in which speech recognition is performed.

【００１４】本発明のロボット制御方法は、音声データ
の入力を受ける音声入力ステップと、ロボットの状態を
示す第１の情報を生成する第１の生成ステップと、ロボ
ットの行動を示す第２の情報を生成する第２の生成ステ
ップと、第１の生成ステップの処理により生成された第
１の情報、もしくは、第２の生成ステップの処理により
生成された第２の情報を基に、入力ステップの処理によ
り入力された音声データを認識する認識ステップとを含
むことを特徴とする。According to the robot control method of the present invention, a voice input step for receiving voice data, a first generation step for generating first information indicating a state of the robot, and a second information indicating behavior of the robot are provided. Based on the first information generated by the processing of the first generation step or the second information generated by the processing of the second generation step. And a recognition step of recognizing the voice data input by the processing.

【００１５】本発明の記録媒体に記録されているプログ
ラムは、音声データの入力を受ける音声入力ステップ
と、ロボットの状態を示す第１の情報を生成する第１の
生成ステップと、ロボットの行動を示す第２の情報を生
成する第２の生成ステップと、第１の生成ステップの処
理により生成された第１の情報、もしくは、第２の生成
ステップの処理により生成された第２の情報を基に、入
力ステップの処理により入力された音声データを認識す
る認識ステップとを含むことを特徴とする。The program recorded on the recording medium according to the present invention includes a voice input step for receiving voice data input, a first generation step for generating first information indicating a state of the robot, and an action of the robot. A second generation step of generating the second information shown, and the first information generated by the processing of the first generation step or the second information generated by the processing of the second generation step. And a recognition step of recognizing the voice data input by the processing of the input step.

【００１６】本発明のロボット制御装置、ロボット制御
方法、および記録媒体に記録されているプログラムにお
いては、音声データが入力され、ロボットの状態を示す
第１の情報が生成され、ロボットの行動を示す第２の情
報が生成され、生成された第１の情報、もしくは生成さ
れた第２の情報を基に、入力された音声データが認識さ
れる。In the robot control apparatus, the robot control method, and the program recorded on the recording medium of the present invention, voice data is input, first information indicating the state of the robot is generated, and the behavior of the robot is indicated. Second information is generated, and the input voice data is recognized based on the generated first information or the generated second information.

【００１７】[0017]

【発明の実施の形態】以下、図を参照して、本発明の実
施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１８】図１は、本発明を適用したロボットの一実
施の形態の外観構成例を示しており、図２は、その電気
的構成例を示している。FIG. 1 shows an example of an external configuration of an embodiment of a robot to which the present invention is applied, and FIG. 2 shows an example of an electrical configuration thereof.

【００１９】本実施の形態では、ロボットは、例えば、
犬等の四つ足の動物の形状のものとなっており、胴体部
ユニット２の前後左右に、それぞれ脚部ユニット３Ａ，
３Ｂ，３Ｃ，３Ｄが連結されるとともに、胴体部ユニッ
ト２の前端部と後端部に、それぞれ頭部ユニット４と尻
尾部ユニット５が連結されることにより構成されてい
る。In the present embodiment, for example, the robot
It has the shape of a four-legged animal such as a dog, and has leg units 3A,
3B, 3C, and 3D are connected, and a head unit 4 and a tail unit 5 are connected to a front end and a rear end of the body unit 2, respectively.

【００２０】尻尾部ユニット５は、胴体部ユニット２の
上面に設けられたベース部５Ｂから、２自由度をもって
湾曲または揺動自在に引き出されている。The tail unit 5 is drawn out from a base unit 5B provided on the upper surface of the body unit 2 so as to bend or swing with two degrees of freedom.

【００２１】胴体部ユニット２には、ロボット全体の制
御を行うコントローラ１０、ロボットの動力源となるバ
ッテリ１１、並びにバッテリセンサ１２および熱センサ
１３からなる内部センサ部１４などが収納されている。The body unit 2 contains a controller 10 for controlling the entire robot, a battery 11 as a power source of the robot, and an internal sensor unit 14 including a battery sensor 12 and a heat sensor 13.

【００２２】頭部ユニット４には、「耳」に相当するマ
イク（マイクロフォン）１５、「目」に相当するＣＣＤ
（Charge Coupled Device）カメラ１６、触覚に相当す
るタッチセンサ１７、「口」に相当するスピーカ１８な
どが、それぞれ所定位置に配設されている。また、頭部
ユニット４には、口の下顎に相当する下顎部４Ａが１自
由度をもって可動に取り付けられており、この下顎部４
Ａが動くことにより、ロボットの口の開閉動作が実現さ
れるようになっている。The head unit 4 includes a microphone (microphone) 15 corresponding to “ears” and a CCD corresponding to “eyes”.
(Charge Coupled Device) A camera 16, a touch sensor 17 corresponding to tactile sensation, a speaker 18 corresponding to a "mouth", and the like are arranged at predetermined positions. A lower jaw 4A corresponding to the lower jaw of the mouth is movably attached to the head unit 4 with one degree of freedom.
When A moves, the opening and closing operation of the mouth of the robot is realized.

【００２３】脚部ユニット３Ａ乃至３Ｄそれぞれの関節
部分や、脚部ユニット３Ａ乃至３Ｄそれぞれと胴体部ユ
ニット２の連結部分、頭部ユニット４と胴体部ユニット
２の連結部分、頭部ユニット４と下顎部４Ａの連結部
分、並びに尻尾部ユニット５と胴体部ユニット２の連結
部分などには、図２に示すように、それぞれアクチュエ
ータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃至３ＢＡ_K、３ＣＡ
₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ_K、４Ａ₁乃至４Ａ_L、
５Ａ₁および５Ａ₂が配設されている。The joints of the leg units 3A to 3D, the joints of the leg units 3A to 3D and the body unit 2, the joints of the head unit 4 and the body unit 2, the head unit 4 and the lower jaw linking moiety parts 4A, and the like in the connecting portion of the tail unit 5 and the body unit 2, as shown in FIG. 2, each actuator 3AA ₁ to 3AA _K, 3BA ₁ to 3BA _K, 3CA
₁ to 3CA _K, 3DA ₁ to 3DA _K, 4A ₁ to 4A _L,
5A ₁ and 5A ₂ are disposed.

【００２４】また、脚部ユニット３Ａ乃至３Ｄそれぞれ
の接地部分（足の裏にあたる部分）には、センサ３Ａ₁
乃至センサ３Ｄ₁が設けられ、脚部ユニット３Ａ乃至３
Ｄのそれぞれが、接地しているか否か（例えば、床など
に触れているか否か）を検知して、コントローラ１０に
出力する。Further, a sensor 3A ₁ is provided on the ground portion (the portion corresponding to the sole of the foot) of each of the leg units 3A to 3D.
To the sensor 3D ₁ is provided, the leg units 3A to 3
Each of D detects whether it is grounded (for example, whether it is touching the floor or the like) and outputs it to the controller 10.

【００２５】頭部ユニット４におけるマイク１５は、ユ
ーザからの発話を含む周囲の音声（音）を集音し、得ら
れた音声信号を、コントローラ１０に送出する。ＣＣＤ
カメラ１６は、周囲の状況を撮像し、得られた画像信号
を、コントローラ１０に送出する。The microphone 15 in the head unit 4 collects surrounding sounds (sounds) including utterances from the user, and sends out the obtained sound signals to the controller 10. CCD
The camera 16 captures an image of the surroundings, and sends the obtained image signal to the controller 10.

【００２６】タッチセンサ１７は、例えば、頭部ユニッ
ト４の上部に設けられており、ユーザからの「撫でる」
や「たたく」といった物理的な働きかけにより受けた圧
力を検出し、その検出結果を圧力検出信号としてコント
ローラ１０に送出する。The touch sensor 17 is provided, for example, above the head unit 4 and “strokes” from the user.
It detects the pressure received by a physical action such as tapping or tapping, and sends the detection result to the controller 10 as a pressure detection signal.

【００２７】胴体部ユニット２におけるバッテリセンサ
１２は、バッテリ１１の残量を検出し、その検出結果
を、バッテリ残量検出信号としてコントローラ１０に送
出する。熱センサ１３は、ロボット内部の熱を検出し、
その検出結果を、熱検出信号としてコントローラ１０に
送出する。The battery sensor 12 in the body unit 2 detects the remaining amount of the battery 11 and sends the detection result to the controller 10 as a battery remaining amount detection signal. The heat sensor 13 detects heat inside the robot,
The detection result is sent to the controller 10 as a heat detection signal.

【００２８】コントローラ１０は、ＣＰＵ（Central Pr
ocessing Unit）１０Ａやメモリ１０Ｂ等を内蔵してお
り、ＣＰＵ１０Ａにおいて、メモリ１０Ｂに記憶された
制御プログラムが実行されることにより、各種の処理を
行う。The controller 10 has a CPU (Central Pr
10A, a memory 10B, and the like. The CPU 10A performs various processes by executing a control program stored in the memory 10B.

【００２９】すなわち、コントローラ１０は、マイク１
５、ＣＣＤカメラ１６、タッチセンサ１７、センサ３Ａ
₁乃至センサ３Ｄ₁、バッテリセンサ１２、および熱セン
サ１３から与えられる音声信号、画像信号、圧力検出信
号、バッテリ残量検出信号、および熱検出信号に基づい
て、周囲の状況や、ユーザからの指令、ユーザからの働
きかけなどの有無を判断する。That is, the controller 10 controls the microphone 1
5, CCD camera 16, touch sensor 17, sensor 3A
_Based on the audio signal, image signal, pressure detection signal, battery remaining amount detection signal, and heat detection signal given from _{1 to} sensor 3D ₁ , battery sensor 12, and heat sensor 13, the surrounding conditions and user commands are given. Then, it is determined whether or not there is a request from the user.

【００３０】更に、コントローラ１０は、この判断結果
等に基づいて、続く行動を決定し、その決定結果に基づ
いて、アクチュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃
至３ＢＡ_K、３ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３Ｄ
Ａ_K、４Ａ₁乃至４Ａ_L、５Ａ₁、および５Ａ₂のうちの必
要なものを駆動させる。これにより、頭部ユニット４を
上下左右に振らせたり、下顎部４Ａを開閉させる。さら
には、尻尾部ユニット５を動かせたり、各脚部ユニット
３Ａ乃至３Ｄを駆動して、ロボットを歩行させるなどの
行動を行わせる。Furthermore, the controller 10, based on the determination results and the like, followed by action determines, based on the determination result, the actuators 3AA ₁ to 3AA _K, 3BA ₁ to 3BA _K, 3CA ₁ to 3CA _K, 3DA _{1 to} 3D
Drive the required one of A _K , 4A _{1 to} 4A _L , 5A ₁ , and 5A ₂ . Thereby, the head unit 4 is swung up, down, left and right, and the lower jaw 4A is opened and closed. Further, the tail unit 5 can be moved, and the leg units 3A to 3D are driven to perform actions such as walking the robot.

【００３１】また、コントローラ１０は、必要に応じ
て、合成音を生成し、スピーカ１８に供給して出力させ
たり、ロボットの「目」の位置に設けられた図示しない
ＬＥＤ（Light Emitting Diode）を点灯、消灯または点
滅させる。Further, the controller 10 generates a synthesized sound as necessary, and supplies the synthesized sound to the speaker 18 for outputting the synthesized sound, or an LED (Light Emitting Diode) (not shown) provided at the position of the “eye” of the robot. Turn on, turn off or blink.

【００３２】以上のようにして、ロボットは、周囲の状
況等に基づいて自律的に行動をとるようになっている。As described above, the robot autonomously acts based on the surrounding situation and the like.

【００３３】次に、図３は、図２のコントローラ１０の
機能的構成例を示している。なお、図３に示す機能的構
成は、ＣＰＵ１０Ａが、メモリ１０Ｂに記憶された制御
プログラムを実行することで実現されるようになってい
る。Next, FIG. 3 shows an example of a functional configuration of the controller 10 of FIG. Note that the functional configuration illustrated in FIG. 3 is realized by the CPU 10A executing a control program stored in the memory 10B.

【００３４】コントローラ１０は、特定の外部状態を認
識するセンサ入力処理部３１、センサ入力処理部３１の
認識結果を累積して、感情や、本能、成長の状態を表現
するモデル記憶部３２、センサ入力処理部３１の認識結
果等に基づいて、続く行動を決定する行動決定機構部３
３、行動決定機構部３３の決定結果に基づいて、実際に
ロボットに行動を起こさせる姿勢遷移機構部３４、アク
チュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂を駆動制御
する制御機構部３５、合成音を生成する音声合成部３
６、並びに、音声合成部３６において合成された合成音
の出力を制御する出力制御部３７から構成されている。The controller 10 includes a sensor input processing unit 31 for recognizing a specific external state, a model storage unit 32 for accumulating the recognition results of the sensor input processing unit 31, and expressing a feeling, an instinct, and a growth state. Action determining mechanism 3 that determines a subsequent action based on the recognition result of input processing unit 31 and the like.
3, based on the determination result of the action decision mechanism section 33, the posture transition mechanism unit 34 for actually take action on the robot, the actuators 3AA ₁ to control mechanism 35 for driving and controlling the actuator 5A _2, sound to generate a synthesized speech Synthesizing unit 3
6 and an output control unit 37 for controlling the output of the synthesized sound synthesized by the voice synthesis unit 36.

【００３５】センサ入力処理部３１は、マイク１５、Ｃ
ＣＤカメラ１６、タッチセンサ１７、もしくは、センサ
３Ａ₁乃至センサ３Ｄ₁等から与えられる音声信号、画像
信号、圧力検出信号等に基づいて、特定の外部状態や、
ユーザからの特定の働きかけ、ユーザからの指示等を認
識し、その認識結果を表す状態認識情報を、モデル記憶
部３２および行動決定機構部３３に通知する。The sensor input processing unit 31 includes the microphone 15, C
CD camera 16, touch sensor 17 or the sensor 3A ₁ to audio signals supplied from the sensor 3D ₁ etc., the image signal, based on the pressure detection signal and the like, and specific external conditions,
It recognizes a specific action from the user, an instruction from the user, and the like, and notifies the model storage unit 32 and the action determination mechanism unit 33 of state recognition information indicating the recognition result.

【００３６】すなわち、センサ入力処理部３１は、音声
認識部３１Ａを有しており、音声認識部３１Ａは、マイ
ク１５から与えられる音声信号について音声認識を行
う。そして、音声認識部３１Ａは、その音声認識結果と
しての、例えば、「歩け」、「伏せ」、「ボールを追い
かけろ」等の指令その他を、状態認識情報として、モデ
ル記憶部３２および行動決定機構部３３に通知する。That is, the sensor input processing section 31 has a voice recognition section 31A, and the voice recognition section 31A performs voice recognition on a voice signal given from the microphone 15. Then, the voice recognition unit 31A uses the model storage unit 32 and the action determination mechanism unit as state recognition information, for example, commands such as “walk”, “down”, “chase the ball” and the like as the voice recognition result. Notify 33.

【００３７】また、音声認識部３１Ａは、圧力処理部３
１Ｃから入力される状態認識情報、および姿勢遷移機構
部３４から入力されるロボットの姿勢遷移情報を基に、
ロボットの状態を監視しながら、その音声認識処理を実
行するようになされている。The speech recognition unit 31A is provided with a pressure processing unit 3
Based on the state recognition information input from 1C and the posture transition information of the robot input from the posture transition mechanism unit 34,
The voice recognition processing is executed while monitoring the state of the robot.

【００３８】また、センサ入力処理部３１は、画像認識
部３１Ｂを有しており、画像認識部３１Ｂは、ＣＣＤカ
メラ１６から与えられる画像信号を用いて、画像認識処
理を行う。そして、画像認識部３１Ｂは、その処理の結
果、例えば、「赤い丸いもの」や、「地面に対して垂直
なかつ所定高さ以上の平面」等を検出したときには、
「ボールがある」や、「壁がある」等の画像認識結果
を、状態認識情報として、モデル記憶部３２および行動
決定機構部３３に通知する。The sensor input processing section 31 has an image recognition section 31B, and the image recognition section 31B performs an image recognition process using an image signal given from the CCD camera 16. When the image recognition unit 31B detects, for example, a “red round object” or a “plane that is perpendicular to the ground and equal to or more than a predetermined height” as a result of the processing,
Image recognition results such as “there is a ball” and “there is a wall” are notified to the model storage unit 32 and the action determination mechanism unit 33 as state recognition information.

【００３９】さらに、センサ入力処理部３１は、圧力処
理部３１Ｃを有しており、圧力処理部３１Ｃは、タッチ
センサ１７、および、センサ３Ａ₁乃至センサ３Ｄ₁から
与えられる圧力検出信号を処理する。[0039] Further, the sensor input processing unit 31 has a pressure processing unit 31C, the pressure processing unit 31C includes a touch sensor 17 and processes the pressure detection signal supplied from the sensor 3A ₁ to sensor 3D ₁ .

【００４０】圧力処理部３１Ｃは、その処理の結果、タ
ッチセンサ１７から、所定の閾値以上で、かつ短時間の
圧力を検出したときには、「たたかれた（しかられ
た）」と認識し、所定の閾値未満で、かつ長時間の圧力
を検出したときには、「撫でられた（ほめられた）」と
認識して、その認識結果を、状態認識情報として、モデ
ル記憶部３２、行動決定機構部３３および音声認識部３
１Ａに通知する。さらに、圧力処理部３１Ｃは、センサ
３Ａ₁乃至センサ３Ｄ₁から入力される信号を基に、脚部
ユニット３Ａ乃至３Ｄが、いずれも床などに接地してい
ないことを検出したときには、ユーザによって抱き上げ
られていると認識して、その認識結果を、状態認識情報
として、モデル記憶部３２、行動決定機構部３３および
音声認識部３１Ａに通知する。As a result of the processing, when the pressure sensor 31C detects a pressure equal to or higher than a predetermined threshold and for a short time from the touch sensor 17, the pressure processing unit 31C recognizes "hit". When a pressure that is less than the predetermined threshold value and is detected for a long time is detected as “stroke (praised)”, the recognition result is used as state recognition information as the model storage unit 32 and the action determination mechanism unit. 33 and voice recognition unit 3
Notify 1A. Further, the pressure processing unit 31C, based on the signal inputted from the sensor 3A ₁ to sensor 3D _1, leg units 3A to 3D are, when any is detected that is not grounded on the floor, the lift up by the user Then, the recognition result is notified to the model storage unit 32, the action determination mechanism unit 33, and the speech recognition unit 31A as state recognition information.

【００４１】モデル記憶部３２は、ロボットの感情、本
能、成長の状態を表現する感情モデル、本能モデル、成
長モデルをそれぞれ記憶、管理している。The model storage unit 32 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot.

【００４２】ここで、感情モデルは、例えば、「うれし
さ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状
態（度合い）を、所定の範囲の値によってそれぞれ表
し、センサ入力処理部３１からの状態認識情報や時間経
過等に基づいて、その値を変化させる。本能モデルは、
例えば、「食欲」、「睡眠欲」、「運動欲」等の本能に
よる欲求の状態（度合い）を、所定の範囲の値によって
それぞれ表し、センサ入力処理部３１からの状態認識情
報や時間経過等に基づいて、その値を変化させる。成長
モデルは、例えば、「幼年期」、「青年期」、「熟年
期」、「老年期」等の成長の状態（度合い）を、所定の
範囲の値によってそれぞれ表し、センサ入力処理部３１
からの状態認識情報や時間経過等に基づいて、その値を
変化させる。Here, the emotion model expresses emotion states (degrees) such as "joy,""sadness,""anger," and "fun," by values in a predetermined range, respectively. The value is changed based on the state recognition information from the input processing unit 31, the elapsed time, and the like. The instinct model is
For example, the state (degree) of the desire by instinct such as “appetite”, “sleep desire”, and “exercise desire” is represented by a value in a predetermined range, and the state recognition information from the sensor input processing unit 31, the elapsed time, and the like. , The value is changed. The growth model represents, for example, the growth state (degree) such as “childhood”, “adolescence”, “mature”, “elderly” by a value in a predetermined range, and the sensor input processing unit 31.
The value is changed on the basis of the state recognition information or the passage of time.

【００４３】モデル記憶部３２は、上述のようにして感
情モデル、本能モデル、成長モデルの値で表される感
情、本能、成長の状態を、状態情報として、行動決定機
構部３３に送出する。The model storage unit 32 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the action determination mechanism unit 33 as state information.

【００４４】なお、モデル記憶部３２には、センサ入力
処理部３１から状態認識情報が供給される他、行動決定
機構部３３から、ロボットの現在または過去の行動、具
体的には、例えば、「長時間歩いた」などの行動の内容
を示す行動情報が供給されるようになっており、同一の
状態認識情報が与えられても、行動情報が示すロボット
の行動に応じて、異なる状態情報を生成するようになっ
ている。The model storage unit 32 is supplied with the state recognition information from the sensor input processing unit 31, and the current or past behavior of the robot, specifically, for example, “ Behavior information indicating the content of the action such as "walking for a long time" is supplied, and even if the same state recognition information is given, different state information is given according to the action of the robot indicated by the action information. Is to be generated.

【００４５】すなわち、例えば、ロボットが、ユーザに
挨拶をし、ユーザに頭を撫でられた場合には、ユーザに
挨拶をしたという行動情報と、頭を撫でられたという状
態認識情報とが、モデル記憶部３２に与えられ、この場
合、モデル記憶部３２では、「うれしさ」を表す感情モ
デルの値が増加される。That is, for example, when the robot greets the user and strokes his head, the behavior information indicating that the user greets the user and the state recognition information indicating that the head has been stroked are represented by the model. It is provided to the storage unit 32, and in this case, the value of the emotion model representing “joy” is increased in the model storage unit 32.

【００４６】一方、ロボットが、何らかの仕事を実行中
に頭を撫でられた場合には、仕事を実行中であるという
行動情報と、頭を撫でられたという状態認識情報とが、
モデル記憶部３２に与えられ、この場合、モデル記憶部
３２では、「うれしさ」を表す感情モデルの値は変化さ
れない。On the other hand, when the robot is stroked on the head while performing any work, the action information indicating that the robot is performing the work and the state recognition information indicating that the robot has been stroked on the head include:
The value is given to the model storage unit 32. In this case, the value of the emotion model representing “joy” is not changed in the model storage unit 32.

【００４７】このように、モデル記憶部３２は、状態認
識情報だけでなく、現在または過去のロボットの行動を
示す行動情報も参照しながら、感情モデルの値を設定す
る。これにより、例えば、何らかのタスクを実行中に、
ユーザが、いたずらするつもりで頭を撫でたときに、
「うれしさ」を表す感情モデルの値を増加させるよう
な、不自然な感情の変化が生じることを回避することが
できる。As described above, the model storage unit 32 sets the value of the emotion model with reference to not only the state recognition information but also the behavior information indicating the current or past behavior of the robot. Thus, for example, while performing some task,
When the user strokes his head with the intention of mischief,
It is possible to avoid an unnatural change in emotion, such as increasing the value of the emotion model representing “joy”.

【００４８】なお、モデル記憶部３２は、本能モデルお
よび成長モデルについても、感情モデルにおける場合と
同様に、状態認識情報および行動情報の両方に基づい
て、その値を増減させるようになっている。また、モデ
ル記憶部３２は、感情モデル、本能モデル、成長モデル
それぞれの値を、他のモデルの値にも基づいて増減させ
るようになっている。The model storage unit 32 increases and decreases the values of the instinct model and the growth model based on both the state recognition information and the action information as in the case of the emotion model. The model storage unit 32 increases or decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.

【００４９】行動決定機構部３３は、センサ入力処理部
３１からの状態認識情報や、モデル記憶部３２からの状
態情報、時間経過等に基づいて、次の行動を決定し、決
定された行動の内容を、行動指令情報として、姿勢遷移
機構部３４に送出する。The action determining mechanism unit 33 determines the next action based on the state recognition information from the sensor input processing unit 31, the state information from the model storage unit 32, the passage of time, and the like. The content is sent to the posture transition mechanism unit 34 as action command information.

【００５０】即ち、行動決定機構部３３は、ロボットが
とり得る行動をステート（状態）（state）に対応させ
た有限オートマトンを、ロボットの行動を規定する行動
モデルとして管理しており、この行動モデルとしての有
限オートマトンにおけるステートを、センサ入力処理部
３１からの状態認識情報や、モデル記憶部３２における
感情モデル、本能モデル、または成長モデルの値、時間
経過等に基づいて遷移させ、遷移後のステートに対応す
る行動を、次にとるべき行動として決定する。That is, the action determining mechanism 33 manages a finite state automaton in which actions that can be taken by the robot correspond to states, as an action model that defines the actions of the robot. The state in the finite state automaton is changed based on the state recognition information from the sensor input processing unit 31, the value of the emotion model, the instinct model, or the growth model in the model storage unit 32, the lapse of time, and the like. Is determined as the next action to be taken.

【００５１】ここで、行動決定機構部３３は、所定のト
リガ（trigger）があったことを検出すると、ステート
を遷移させる。すなわち、行動決定機構部３３は、例え
ば、現在のステートに対応する行動を実行している時間
が所定時間に達したときや、特定の状態認識情報を受信
したとき、モデル記憶部３２から供給される状態情報が
示す感情や、本能、成長の状態の値が所定の閾値以下ま
たは以上になったとき等に、ステートを遷移させる。Here, upon detecting that a predetermined trigger has occurred, the action determining mechanism 33 changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed reaches a predetermined time, or when specific state recognition information is received, the action determining mechanism unit 33 is supplied from the model storage unit 32. The state is changed when the value of the emotion, instinct, or growth state indicated by the state information is equal to or less than a predetermined threshold.

【００５２】なお、行動決定機構部３３は、上述したよ
うに、センサ入力処理部３１からの状態認識情報だけで
なく、モデル記憶部３２における感情モデルや、本能モ
デル、成長モデルの値等にも基づいて、行動モデルにお
けるステートを遷移させることから、同一の状態認識情
報が入力されても、感情モデルや、本能モデル、成長モ
デルの値（状態情報）によっては、ステートの遷移先は
異なるものとなる。As described above, the action determining mechanism unit 33 stores not only the state recognition information from the sensor input processing unit 31 but also the values of the emotion model, the instinct model, the growth model, and the like in the model storage unit 32. Based on the state transition based on the behavior model, the destination of the state transition differs depending on the emotion model, the instinct model, and the value of the growth model (state information) even if the same state recognition information is input. Become.

【００５３】その結果、行動決定機構部３３は、例え
ば、状態情報が、「怒っていない」こと、および「お腹
がすいていない」ことを表している場合において、状態
認識情報が、「目の前に手のひらが差し出された」こと
を表しているときには、目の前に手のひらが差し出され
たことに応じて、「お手」という行動をとらせる行動指
令情報を生成し、これを、姿勢遷移機構部３４に送出す
る。As a result, for example, when the state information indicates “not angry” and “not hungry”, the action determination mechanism 33 When the palm has been presented before, the action command information for taking the action of "hand" is generated in accordance with the palm being presented in front of the eyes, It is sent to the posture transition mechanism 34.

【００５４】また、行動決定機構部３３は、例えば、状
態情報が、「怒っていない」こと、および「お腹がすい
ている」ことを表している場合において、状態認識情報
が、「目の前に手のひらが差し出された」ことを表して
いるときには、目の前に手のひらが差し出されたことに
応じて、「手のひらをぺろぺろなめる」ような行動を行
わせるための行動指令情報を生成し、これを、姿勢遷移
機構部３４に送出する。Further, for example, when the state information indicates “not angry” and “stomach is hungry”, the action recognition mechanism unit 33 determines that the state recognition information is “in front of the eyes”. When the palm is displayed, the action command information for performing an action such as "palm licking the palm" is generated in response to the palm being displayed in front of the eyes. This is sent to the posture transition mechanism 34.

【００５５】また、行動決定機構部３３は、例えば、状
態情報が、「怒っている」ことを表している場合におい
て、状態認識情報が、「目の前に手のひらが差し出され
た」ことを表しているときには、状態情報が、「お腹が
すいている」ことを表していても、また、「お腹がすい
ていない」ことを表していても、「ぷいと横を向く」よ
うな行動を行わせるための行動指令情報を生成し、これ
を、姿勢遷移機構部３４に送出する。Further, for example, when the state information indicates “angry”, the action determining mechanism unit 33 determines that the state recognition information indicates that “the palm is put in front of the eyes”. When it indicates, even if the status information indicates that "stomach is hungry", or indicates that "stomach is not hungry", even if the state information indicates "being hungry", an action such as "turns to the side with a little bit" The action command information for performing the action is generated, and the action command information is transmitted to the posture transition mechanism unit 34.

【００５６】なお、行動決定機構部３３には、モデル記
憶部３２から供給される状態情報が示す感情や、本能、
成長の状態に基づいて、遷移先のステートに対応する行
動のパラメータとしての、例えば、歩行の速度や、手足
を動かす際の動きの大きさおよび速度などを決定させる
ことができ、この場合、それらのパラメータを含む行動
指令情報が、姿勢遷移機構部３４に送出される。The behavior determining mechanism 33 stores the emotions indicated by the state information supplied from the model storage 32, instinct,
Based on the state of growth, as a parameter of the action corresponding to the state of the transition destination, for example, the speed of walking, the magnitude and speed of the movement when moving the limbs can be determined, in this case, Is transmitted to the posture transition mechanism unit 34.

【００５７】また、行動決定機構部３３では、上述した
ように、ロボットの頭部や手足等を動作させる行動指令
情報の他、ロボットに発話を行わせる行動指令情報も生
成される。ロボットに発話を行わせる行動指令情報は、
音声合成部３６に供給されるようになっており、音声合
成部３６に供給される行動指令情報には、音声合成部３
６に生成させる合成音に対応するテキスト等が含まれ
る。そして、音声合成部３６は、行動決定部３２から行
動指令情報を受信すると、その行動指令情報に含まれる
テキストに基づき、合成音を生成し、出力制御部３７を
介して、スピーカ１８に供給して出力させる。これによ
り、スピーカ１８からは、例えば、ロボットの鳴き声、
さらには、「お腹がすいた」等のユーザへの各種の要
求、「何？」等のユーザの呼びかけに対する応答その他
の音声出力が行われる。Further, as described above, the action determining mechanism 33 generates action command information for causing the robot to speak, in addition to action command information for operating the robot's head, limbs, and the like. The action command information that causes the robot to speak is
The voice command is supplied to the voice synthesizing unit 36, and the action command information supplied to the voice synthesizing unit 36 includes the voice synthesizing unit 3.
6 includes a text corresponding to the synthesized sound to be generated. Then, upon receiving the action command information from the action determination section 32, the voice synthesis section 36 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 18 via the output control section 37. Output. Thereby, for example, the cry of the robot,
Further, various requests to the user such as “I am hungry”, a response to the user's call such as “What?”, And other voice output are performed.

【００５８】姿勢遷移機構部３４は、行動決定機構部３
３から供給される行動指令情報に基づいて、ロボットの
姿勢を、現在の姿勢から次の姿勢に遷移させるための姿
勢遷移情報を生成し、これを制御機構部３５および音声
認識部３１Ａに送出する。The posture transition mechanism section 34 includes the action determination mechanism section 3
Based on the action command information supplied from 3, posture transition information for transitioning the posture of the robot from the current posture to the next posture is generated and transmitted to the control mechanism unit 35 and the voice recognition unit 31 </ b> A. .

【００５９】ここで、現在の姿勢から次に遷移可能な姿
勢は、例えば、胴体や手や足の形状、重さ、各部の結合
状態のようなロボットの物理的形状と、関節が曲がる方
向や角度のようなアクチュエータ３ＡＡ₁乃至５Ａ₁およ
び５Ａ₂の機構とによって決定される。Here, the posture that can be changed next from the current posture is, for example, the physical shape of the robot such as the shape and weight of the body, hands and feet, the connection state of each part, the direction in which the joint is bent, and the like. It is determined by the mechanism of the actuator 3AA ₁ to 5A ₁ and 5A _2, such as angle.

【００６０】また、次の姿勢としては、現在の姿勢から
直接遷移可能な姿勢と、直接には遷移できない姿勢とが
ある。例えば、４本足のロボットは、手足を大きく投げ
出して寝転んでいる状態から、伏せた状態へ直接遷移す
ることはできるが、立った状態へ直接遷移することはで
きず、一旦、手足を胴体近くに引き寄せて伏せた姿勢に
なり、それから立ち上がるという２段階の動作が必要で
ある。また、安全に実行できない姿勢も存在する。例え
ば、４本足のロボットは、その４本足で立っている姿勢
から、両前足を挙げてバンザイをしようとすると、簡単
に転倒してしまう。As the next posture, there are a posture that can directly transition from the current posture and a posture that cannot directly transition. For example, a four-legged robot can make a direct transition from lying down with its limbs throwing down to lying down, but not directly into a standing state. It is necessary to perform a two-stage operation of pulling down to a prone position and then standing up. There are also postures that cannot be safely executed. For example, a four-legged robot easily falls down when trying to banzai with both front legs raised from its standing posture.

【００６１】このため、姿勢遷移機構部３４は、直接遷
移可能な姿勢をあらかじめ登録しておき、行動決定機構
部３３から供給される行動指令情報が、直接遷移可能な
姿勢を示す場合には、その行動指令情報を、そのまま姿
勢遷移情報として、制御機構部３５に送出する。一方、
行動指令情報が、直接遷移不可能な姿勢を示す場合に
は、姿勢遷移機構部３４は、遷移可能な他の姿勢に一旦
遷移した後に、目的の姿勢まで遷移させるような姿勢遷
移情報を生成し、制御機構部３５に送出する。これによ
りロボットが、遷移不可能な姿勢を無理に実行しようと
する事態や、転倒するような事態を回避することができ
るようになっている。For this reason, the posture transition mechanism unit 34 pre-registers the posture to which a direct transition can be made, and if the action command information supplied from the behavior determination mechanism unit 33 indicates the posture to which a direct transition can be made, The action command information is sent to the control mechanism section 35 as posture transition information as it is. on the other hand,
When the action command information indicates a posture that cannot directly transition, the posture transition mechanism unit 34 generates posture transition information that causes a transition to another possible posture and then transitions to a target posture. To the control mechanism unit 35. As a result, it is possible to avoid a situation in which the robot forcibly executes an untransitionable posture or a situation in which the robot falls.

【００６２】姿勢遷移情報は、音声認識部３１Ａにも出
力される。ロボットが、その姿勢を遷移させる場合、ア
クチュエータ３ＡＡ₁乃至アクチュエータ５Ａ₂のうち
の、いずれかのアクチュエータが動作する。そこで、姿
勢遷移機構部３４は、音声認識部３１Ａが、これらのア
クチュエータの動作音を、ユーザの音声と認識してしま
わないように、姿勢遷移情報を、音声認識部３１Ａに出
力する。The posture transition information is also output to the voice recognition unit 31A. Robot, if shifting the posture thereof, of the actuators 3AA ₁ through actuator 5A _2, either actuator operates. Therefore, the posture transition mechanism unit 34 outputs the posture transition information to the voice recognition unit 31A so that the voice recognition unit 31A does not recognize the operation sounds of these actuators as the user's voice.

【００６３】制御機構部３５は、姿勢遷移機構部３４か
らの姿勢遷移情報にしたがって、アクチュエータ３ＡＡ
₁乃至アクチュエータ５Ａ₂を駆動するための制御信号を
生成し、これを、アクチュエータ３ＡＡ₁乃至アクチュ
エータ５Ａ₂に送出する。これにより、アクチュエータ
３ＡＡ₁乃至アクチュエータ５Ａ₂は、制御信号にしたが
って駆動し、ロボットは、自律的に行動を起こす。The control mechanism 35 changes the position of the actuator 3AA in accordance with the posture transition information from the posture transition mechanism 34.
₁ generates a control signal for driving the actuator 5A _2, which is sent to the actuator 3AA ₁ to actuator 5A _2. Thus, the actuator 3AA ₁ to actuator 5A ₂ is driven in accordance with the control signals, the robot causes the autonomous motions.

【００６４】出力制御部３７には、音声合成部３６から
の合成音のディジタルデータが供給されるようになって
おり、出力制御部３７は、それらのディジタルデータ
を、アナログの音声信号にＤ／Ａ変換し、スピーカ１８
に供給して出力させる。The output control section 37 is supplied with the digital data of the synthesized sound from the voice synthesizing section 36. The output control section 37 converts the digital data into an analog voice signal by D / D conversion. A conversion, speaker 18
And output it.

【００６５】次に、図４は、図３の音声認識部３１Ａの
構成例を示している。Next, FIG. 4 shows an example of the configuration of the speech recognition section 31A of FIG.

【００６６】マイク１５からの音声信号は、ＡＤ（Anal
og Digital）変換部４１に供給される。ＡＤ変換部４１
では、マイク１５からのアナログ信号である音声信号が
サンプリング、量子化され、ディジタル信号である音声
データにＡＤ変換される。この音声データは、特徴抽出
部４２および音声区間検出部４７に供給される。The audio signal from the microphone 15 is AD (Anal
og Digital) converter 41. AD converter 41
In, an audio signal as an analog signal from the microphone 15 is sampled and quantized, and A / D converted into audio data as a digital signal. This audio data is supplied to the feature extraction unit 42 and the audio section detection unit 47.

【００６７】特徴抽出部４２は、入力される音声データ
について、適当なフレームごとに、状態認識情報、およ
び姿勢遷移情報の入力を受け、ロボットの状態に対応さ
せて、例えば、ＭＦＣＣ（Mel Frequency Cepstrum Coe
fficient）分析を行い、その分析結果を、特徴パラメー
タ（特徴ベクトル）として、マッチング部４３に出力す
る。なお、特徴抽出部４２では、その他、例えば、線形
予測係数、ケプストラム係数、線スペクトル対、所定の
周波数帯域ごとのパワー（フィルタバンクの出力）等
を、特徴パラメータとして抽出することが可能である。The feature extracting unit 42 receives the input of the state recognition information and the posture transition information for each appropriate frame of the input voice data, and for example, according to the state of the robot, for example, a MFCC (Mel Frequency Cepstrum). Coe
fficient) analysis, and outputs the analysis result to the matching unit 43 as a feature parameter (feature vector). The feature extraction unit 42 can also extract, for example, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, power (output of a filter bank) for each predetermined frequency band, and the like as a feature parameter.

【００６８】マッチング部４３は、特徴抽出部４２から
の特徴パラメータを用いて、音響モデル記憶部４４、辞
書記憶部４５、および文法記憶部４６を必要に応じて参
照しながら、マイク１５に入力された音声（入力音声）
を、例えば、連続分布ＨＭＭ（Hidden Markov Model）
法に基づいて音声認識する。The matching section 43 uses the feature parameters from the feature extraction section 42 to refer to the acoustic model storage section 44, the dictionary storage section 45, and the grammar storage section 46 as necessary, and to be input to the microphone 15. Voice (input voice)
For example, a continuous distribution HMM (Hidden Markov Model)
Speech recognition based on the law.

【００６９】即ち、音響モデル記憶部４４は、音声認識
する音声の言語における個々の音素や音節などの音響的
な特徴を表す音響モデルを記憶している。ここでは、連
続分布ＨＭＭ法に基づいて音声認識を行うので、音響モ
デルとしては、ＨＭＭ（Hidden Markov Model）が用い
られる。辞書記憶部４５は、認識対象の各単語につい
て、その発音に関する情報（音韻情報）が記述された単
語辞書を記憶している。文法記憶部４６は、辞書記憶部
４５の単語辞書に登録されている各単語が、どのように
連鎖する（つながる）かを記述した文法規則を記憶して
いる。ここで、文法規則としては、例えば、文脈自由文
法（ＣＦＧ）や、統計的な単語連鎖確率（Ｎ−ｇｒａ
ｍ）などに基づく規則を用いることができる。That is, the acoustic model storage unit 44 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of the speech to be recognized. Here, since speech recognition is performed based on the continuous distribution HMM method, an HMM (Hidden Markov Model) is used as an acoustic model. The dictionary storage unit 45 stores a word dictionary in which information (phonological information) regarding pronunciation of each word to be recognized is described. The grammar storage unit 46 stores grammar rules that describe how the words registered in the word dictionary of the dictionary storage unit 45 are linked (connected). Here, the grammar rules include, for example, context-free grammar (CFG) and statistical word chain probability (N-gra
m) etc. can be used.

【００７０】マッチング部４３は、辞書記憶部４５の単
語辞書を参照することにより、音響モデル記憶部４４に
記憶されている音響モデルを接続することで、単語の音
響モデル（単語モデル）を構成する。さらに、マッチン
グ部４３は、幾つかの単語モデルを、文法記憶部４６に
記憶された文法規則を参照することにより接続し、その
ようにして接続された単語モデルを用いて、特徴パラメ
ータに基づき、連続分布ＨＭＭ法によって、マイク１５
に入力された音声を認識する。即ち、マッチング部４３
は、特徴抽出部４２が出力する時系列の特徴パラメータ
が観測されるスコア（尤度）が最も高い単語モデルの系
列を検出し、その単語モデルの系列に対応する単語列の
音韻情報（読み）を、音声の認識結果として出力する。The matching unit 43 refers to the word dictionary in the dictionary storage unit 45 and connects the acoustic models stored in the acoustic model storage unit 44 to form a word acoustic model (word model). . Further, the matching unit 43 connects some word models by referring to the grammar rules stored in the grammar storage unit 46, and uses the word models connected in this way, based on the feature parameters, The microphone 15 is obtained by the continuous distribution HMM method.
Recognize the voice input to. That is, the matching unit 43
Detects a sequence of a word model having the highest score (likelihood) at which a time-series feature parameter output by the feature extraction unit 42 is observed, and obtains phonemic information (reading) of a word string corresponding to the sequence of the word model. Is output as a speech recognition result.

【００７１】より具体的には、マッチング部４３は、接
続された単語モデルに対応する単語列について、各特徴
パラメータの出現確率を累積し、その累積値をスコアと
して、そのスコアを最も高くする単語列の音韻情報を、
音声認識結果として出力する。More specifically, the matching unit 43 accumulates the appearance probabilities of the respective characteristic parameters for the word string corresponding to the connected word model, and uses the accumulated value as a score to determine the word having the highest score. The phoneme information of the column
Output as speech recognition result.

【００７２】以上のようにして出力される、マイク１５
に入力された音声の認識結果は、状態認識情報として、
モデル記憶部３２および行動決定機構部３３に出力され
る。The microphone 15 output as described above
The recognition result of the voice input to the
It is output to the model storage unit 32 and the action determination mechanism unit 33.

【００７３】なお、音声区間検出部４７は、ＡＤ変換部
４１からの音声データについて、特徴抽出部４２がＭＦ
ＣＣ分析を行うのと同様のフレームごとに、音声入力レ
ベル（パワー）を算出している。さらに、音声区間検出
部４７は、状態認識情報、および姿勢遷移情報の入力を
受け、各フレームの音声入力レベルを所定の閾値と比較
することにより、ユーザの音声が入力されている音声区
間を検出する。すなわち、音声区間とは、ノイズ源とな
る所定の動作（例えば、頭部ユニット４を動かす）が行
われていない区間であり、かつ、所定の閾値以上の音声
入力レベルを有するフレームで構成される区間を示す。
そして、音声区間検出部４７は、検出した音声区間を、
特徴抽出部４２とマッチング部４３に供給しており、特
徴抽出部４２とマッチング部４３は、音声区間のみを対
象に処理を行う。It is to be noted that the voice section detection unit 47 determines that the feature extraction unit 42
The speech input level (power) is calculated for each frame similar to that in which the CC analysis is performed. Further, the voice section detection unit 47 receives the input of the state recognition information and the posture transition information, and detects the voice section in which the user's voice is input by comparing the voice input level of each frame with a predetermined threshold. I do. That is, the voice section is a section in which a predetermined operation serving as a noise source (for example, moving the head unit 4) is not performed, and includes a frame having a voice input level equal to or higher than a predetermined threshold. Indicates a section.
Then, the voice section detection unit 47 converts the detected voice section into
The information is supplied to the feature extraction unit 42 and the matching unit 43, and the feature extraction unit 42 and the matching unit 43 perform processing only on the voice section.

【００７４】次に、図５および図６のフローチャートを
参照して、音声認識処理について説明する。Next, the speech recognition processing will be described with reference to the flowcharts of FIGS.

【００７５】ステップＳ１において、音声区間検出部４
７は、ＡＤ変換部４１を介して入力された音声データを
基に、環境音レベルを推定する。In step S1, the voice section detection unit 4
7 estimates the environmental sound level based on the audio data input via the AD converter 41.

【００７６】マイク１５には、ユーザがロボットに対し
て発話していない場合においても、様々なノイズが音声
入力されるが、そのノイズをユーザの発話として音声認
識することは誤動作の原因になる。従って、ユーザの発
話を音声認識していない状態（音声認識ＯＦＦ状態）に
おいて、環境音レベルを推定する必要がある。Although various noises are input to the microphone 15 even when the user is not speaking to the robot, recognizing the noises as the user's speech causes a malfunction. Therefore, it is necessary to estimate the environmental sound level in a state where the speech of the user is not speech-recognized (speech recognition OFF state).

【００７７】図７に示されるように、マイク１５および
ＡＤ変換部４１を介して入力される音声データの音声入
力レベルは、音声認識ＯＦＦ状態においても一定ではな
い。そこで、環境音レベルをＥＮＶ、現在の音声入力レ
ベルをＰとして、次の式（１）および式（２）により、
所定の短い時間毎に、環境音レベルを算出する。ＥＮＶ＝ａ×ＥＮＶ＋ｂ×Ｐ・・・（１）ａ＋ｂ＝１．０・・・（２）As shown in FIG. 7, the voice input level of the voice data input via the microphone 15 and the AD converter 41 is not constant even in the voice recognition OFF state. Then, assuming that the environmental sound level is ENV and the current voice input level is P, by the following equations (1) and (2),
The environmental sound level is calculated for each predetermined short time. ENV = a × ENV + b × P (1) a + b = 1.0 (2)

【００７８】ここで、ａは、０．９など、１に比較的近
い数字に設定され、ｂは、０．１などに設定されること
により、瞬間的にパワーの大きなノイズ（例えば、ドア
がばたんと閉まる音など）が、環境音全体に大きな影響
を与えないようになされている。Here, a is set to a number relatively close to 1 such as 0.9, and b is set to 0.1 or the like, so that noise having a large instantaneous power (for example, when the door is Noises, etc.) do not significantly affect the overall environmental sound.

【００７９】環境音レベルの推定は、予め決められた閾
値Ｔ１を基に、音声入力レベルが、ＥＮＶ＋Ｔ１を越え
るまで（後述するステップＳ８、もしくはステップＳ１
３において、音声入力レベルがＥＮＶ＋Ｔ１を越えたと
判断されるまで）継続される。The environmental sound level is estimated based on the predetermined threshold value T1 until the voice input level exceeds ENV + T1 (step S8 or step S1 to be described later).
(3) until the sound input level exceeds ENV + T1.

【００８０】ステップＳ２において、音声区間検出部４
７は、姿勢遷移機構部３４、もしくは、圧力処理部３１
−Ｃから、姿勢遷移情報もしくは状態認識情報の入力を
受けたか否かを判断する。ステップＳ２において、姿勢
遷移情報もしくは状態認識情報の入力を受けていないと
判断された場合、処理は、ステップＳ８に進む。In step S2, the voice section detection unit 4
7 is the posture transition mechanism unit 34 or the pressure processing unit 31
From -C, it is determined whether or not the input of the posture transition information or the state recognition information has been received. If it is determined in step S2 that the input of the posture transition information or the state recognition information has not been received, the process proceeds to step S8.

【００８１】ステップＳ２において、姿勢遷移情報もし
くは状態認識情報の入力を受けたと判断された場合、ス
テップＳ３において、音声区間検出部４７は、入力され
た情報に基づいて、動作を行うアクチュエータは、マイ
ク１５に近いか否かを判断する。ステップＳ３におい
て、動作を行うアクチュエータは、マイク１５に近くな
いと判断された場合、処理は、ステップＳ６に進む。If it is determined in step S2 that the input of the posture transition information or the state recognition information has been received, in step S3, the voice section detection unit 47 performs an operation based on the input information with a microphone. It is determined whether it is close to 15. If it is determined in step S3 that the actuator performing the operation is not close to the microphone 15, the process proceeds to step S6.

【００８２】ステップＳ３において、動作を行うアクチ
ュエータは、マイク１５に近いと判断された場合、ステ
ップＳ４において、音声区間検出部４７は、音声認識処
理をキャンセルする。すなわち、マイク１５に近いアク
チュエータの動作中は、音声区間としないようにする。If it is determined in step S3 that the actuator performing the operation is close to the microphone 15, the voice section detection unit 47 cancels the voice recognition processing in step S4. That is, during the operation of the actuator close to the microphone 15, the sound section is not set.

【００８３】図１および図２を用いて説明したロボット
においては、マイク１５が頭部ユニット４に設けられて
いる。すなわち、頭部ユニット４に設けられているアク
チュエータ４Ａ₁乃至４Ａ_L（例えば、頭部ユニット４と
胴体部ユニット２の連結部分、あるいは下顎部４Ａ）の
動作に伴って発生するノイズの音声入力パワーは、比較
的大きいものであるため、動作中の音声認識結果の信頼
性は著しく低いものとなる。従って、マイク１５に近い
アクチュエータの動作中は、音声区間としないことによ
り、誤動作を防ぐことが可能となる。In the robot described with reference to FIGS. 1 and 2, the microphone 15 is provided on the head unit 4. That is, the actuator 4A ₁ to 4A are provided in the head unit 4 _L (e.g., coupling portion of the head unit 4 and body unit 2 or the lower jaw 4A,) noise of the speech input power generated with the operation of the Is relatively large, the reliability of the speech recognition result during operation becomes extremely low. Therefore, during the operation of the actuator close to the microphone 15, it is possible to prevent erroneous operation by not setting it as a voice section.

【００８４】ステップＳ５において、音声区間検出部４
７は、入力された情報に基づいて、動作が終了されたか
否かを判断する。ステップＳ５において、動作が終了さ
れていないと判断された場合、処理は、ステップＳ４に
戻り、それ以降の処理が繰り返される。ステップＳ５に
おいて、動作が終了されたと判断された場合、処理は、
ステップＳ１に戻り、それ以降の処理が繰り返される。In step S5, the voice section detection unit 4
7 determines whether or not the operation has been completed based on the input information. If it is determined in step S5 that the operation has not been completed, the process returns to step S4, and the subsequent processes are repeated. If it is determined in step S5 that the operation has been completed,
Returning to step S1, the subsequent processing is repeated.

【００８５】ステップＳ３において、動作を行うアクチ
ュエータは、マイク１５に近くないと判断された場合、
ステップＳ６において、音声区間検出部４７は、入力さ
れた情報に基づいて、姿勢遷移情報は、歩行動作の開始
を示しているか否かを判断する。ステップＳ６におい
て、歩行動作の開始を示していると判断された場合、処
理は、ステップＳ１３に進む。In step S3, when it is determined that the actuator performing the operation is not close to the microphone 15,
In step S6, the voice segment detection unit 47 determines whether or not the posture transition information indicates the start of the walking motion based on the input information. If it is determined in step S6 that the start of the walking motion is indicated, the process proceeds to step S13.

【００８６】ステップＳ６において、姿勢遷移情報は、
歩行動作の開始を示していないと判断された場合、ロボ
ットは、マイク１５付近のアクチュエータは動作せず、
かつ、歩行動作以外の動作が行われている、もしくは、
ユーザに抱き上げられていたり、撫でられているという
状態である。ステップＳ７において、音声区間検出部４
７は、環境音（ノイズ）ではない、ユーザの発話などが
音声入力されていることを判断するための閾値Ｔ１（図
７）を、姿勢遷移情報、もしくは、状態認識情報に応じ
た値に変更する。At step S6, the posture transition information is
If it is determined that it does not indicate the start of the walking motion, the robot does not operate the actuator near the microphone 15;
And an operation other than the walking operation is performed, or
The user is being held up or stroked. In step S7, the voice section detection unit 4
Reference numeral 7 changes the threshold value T1 (FIG. 7) for determining that a user's utterance or the like, which is not environmental sound (noise), is input as a voice, to a value corresponding to posture transition information or state recognition information. I do.

【００８７】例えば、圧力処理部３１Ｃから入力される
状態認識情報が、ロボットがユーザに抱き上げられてい
ることを示している場合、マイク１５から入力される音
声データには、ユーザがロボットの筐体表面に触れてい
るために生じるタッチノイズやこすれ音が含まれる。ま
た、状態遷移機構部３４から入力される姿勢遷移情報
が、マイク１５に比較的遠い位置に設けられているアク
チュエータ（例えば、尻尾部ユニット５のアクチュエー
タ５Ａ₁など）が動作していることを示している場合、
マイク１５から入力される音声データには、それらのア
クチュエータの動作音が含まれる。音声データに含まれ
る動作音は、そのアクチュエータの種類、およびマイク
１５との距離によって異なる。For example, when the state recognition information input from the pressure processing unit 31C indicates that the robot is being held by the user, the voice data input from the microphone 15 includes the user's housing of the robot. Includes touch noise and rubbing noise caused by touching the surface. Also shows that the posture transition information input from the state transition mechanism section 34, an actuator provided in a position relatively far to the microphone 15 (e.g., an actuator 5A ₁ of the tail unit 5) is operating If
The sound data input from the microphone 15 includes the operation sounds of those actuators. The operation sound included in the audio data differs depending on the type of the actuator and the distance from the microphone 15.

【００８８】音声区間検出部４７は、例えば、ロボット
がユーザに抱き上げられている場合の閾値Ｔ１ａ、脚部
ユニット３Ａもしくは３Ｂが動作している場合の閾値Ｔ
１ｂ、脚部ユニット３Ｃもしくは３Ｄが動作している場
合の閾値Ｔ１ｃ、あるいは、尻尾部ユニット５が動作し
ている場合の閾値Ｔ１ｄを予め用意しておき、入力され
た姿勢遷移情報、もしくは、状態認識情報に応じて、適
切な閾値を利用した音声区間の検出を行うように設定す
る。For example, the voice section detection unit 47 includes a threshold T1a when the robot is held by the user and a threshold T1 when the leg unit 3A or 3B is operating.
1b, a threshold T1c when the leg unit 3C or 3D is operating, or a threshold T1d when the tail unit 5 is operating is prepared in advance, and the input posture transition information or state According to the recognition information, a setting is made so as to detect a voice section using an appropriate threshold.

【００８９】ステップＳ２において、姿勢遷移情報もし
くは状態認識情報の入力を受けていないと判断された場
合、もしくは、ステップＳ７の処理の終了後、ステップ
Ｓ８において、音声区間検出部４７は、音声入力レベル
が、閾値（Ｔ１＋ＥＮＶ）を越えたか否かを判断する。
ステップＳ８において、音声入力レベルが、閾値（Ｔ１
＋ＥＮＶ）を越えていないと判断された場合、処理は、
ステップＳ１に戻り、それ以降の処理が繰り返される。If it is determined in step S2 that the input of the posture transition information or the state recognition information has not been received, or if the processing in step S7 has been completed, in step S8, the voice section detection unit 47 determines the voice input level. Is greater than or equal to a threshold value (T1 + ENV).
In step S8, the voice input level is set to the threshold (T1
+ ENV), if it is determined that it does not exceed
Returning to step S1, the subsequent processing is repeated.

【００９０】ステップＳ８において、音声入力レベル
が、閾値（Ｔ１＋ＥＮＶ）を越えていると判断された場
合、ステップＳ９において、音声区間検出部４７は、環
境音レベルの推定を止め、その内部に有する図示しない
カウンタ（タイマ）を用いて、音声認識開始カウントを
開始する。If it is determined in step S8 that the voice input level has exceeded the threshold value (T1 + ENV), in step S9, the voice section detection unit 47 stops estimating the environmental sound level, and the A speech recognition start count is started using a counter (timer) not to be used.

【００９１】ステップＳ１０において、音声区間検出部
４７は、音声認識開始カウントが所定の値（例えば、図
７のＣＮＴ＿ＯＮで示される値）を超えたか否かを判断
する。ステップＳ１０において、音声認識開始カウント
が所定の値を超えていないと判断された場合、音声認識
開始カウントが所定の値を超えたと判断されるまで、ス
テップＳ１０の処理が繰り返される。In step S10, the voice section detection unit 47 determines whether or not the voice recognition start count has exceeded a predetermined value (for example, a value indicated by CNT_ON in FIG. 7). If it is determined in step S10 that the voice recognition start count has not exceeded the predetermined value, the process of step S10 is repeated until it is determined that the voice recognition start count has exceeded the predetermined value.

【００９２】ステップＳ１０において、音声認識開始カ
ウントが所定の値を超えたと判断された場合、ステップ
Ｓ１１において、音声区間検出部４７は、音声区間の開
始を特徴抽出部４２およびマッチング部４３に出力す
る。特徴抽出部４２およびマッチング部４３は、図４を
用いて説明した音声認識処理を実行する。When it is determined in step S10 that the speech recognition start count has exceeded a predetermined value, in step S11, the speech section detection section 47 outputs the start of the speech section to the feature extraction section 42 and the matching section 43. . The feature extraction unit 42 and the matching unit 43 execute the speech recognition processing described with reference to FIG.

【００９３】ステップＳ１２において、音声区間検出部
４７は、音声入力レベルが、閾値（Ｔ２＋ＥＮＶ）以下
になったか否かを判断する。In step S12, the voice section detection section 47 determines whether or not the voice input level has become equal to or less than the threshold value (T2 + ENV).

【００９４】マイク１５には、ユーザがロボットに対す
る発話を終了した後も、様々なノイズが音声入力される
が、そのノイズをユーザの発話として音声認識すること
は誤動作の原因になる。従って、音声入力レベルが、一
定の値を下回った場合、音声認識処理を行わないように
（音声認識ＯＦＦ状態に）する必要がある。Various noises are input to the microphone 15 even after the user has finished speaking to the robot. Recognizing the noises as the user's speech causes malfunctions. Therefore, when the voice input level falls below a certain value, it is necessary to prevent the voice recognition processing from being performed (turn off the voice recognition).

【００９５】図８に示されるように、マイク１５および
ＡＤ変換部４１を介して入力される音声データの音声入
力レベルが、所定の閾値Ｔ２と、音声認識処理が開始さ
れた時点においての環境音レベルＥＮＶとの和（Ｔ２＋
ＥＮＶ）を下回るか否かを判断することにより、音声区
間検出部４７は、ユーザの発話が終了したか否かを判断
することができる。As shown in FIG. 8, when the audio input level of the audio data input via the microphone 15 and the AD converter 41 is a predetermined threshold value T2 and the environmental sound at the time when the voice recognition process is started. Sum with level ENV (T2 +
By determining whether the value falls below ENV), the voice section detection unit 47 can determine whether or not the utterance of the user has ended.

【００９６】ステップＳ１２において、音声入力レベル
が、閾値（Ｔ２＋ＥＮＶ）以下になっていないと判断さ
れた場合、処理は、ステップＳ１１に戻り、それ以降の
処理が繰り返される。ステップＳ１２において、音声入
力レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったと判断
された場合、処理は、ステップＳ１８に進む。If it is determined in step S12 that the voice input level is not lower than the threshold value (T2 + ENV), the process returns to step S11, and the subsequent processes are repeated. If it is determined in step S12 that the audio input level has become equal to or lower than the threshold value (T2 + ENV), the process proceeds to step S18.

【００９７】ステップＳ６において、姿勢遷移情報は歩
行動作の開始を示していると判断された場合、ステップ
Ｓ１３において、音声区間検出部４７は、音声入力レベ
ルが、閾値（Ｔ１＋ＥＮＶ）を越えたか否かを判断す
る。ステップＳ１３において、音声入力レベルが、閾値
（Ｔ１＋ＥＮＶ）を越えていないと判断された場合、処
理は、ステップＳ１に戻り、それ以降の処理が繰り返さ
れる。If it is determined in step S6 that the posture transition information indicates the start of the walking motion, in step S13, the voice section detection unit 47 determines whether the voice input level has exceeded a threshold value (T1 + ENV). Judge. If it is determined in step S13 that the audio input level has not exceeded the threshold value (T1 + ENV), the process returns to step S1, and the subsequent processes are repeated.

【００９８】ステップＳ１３において、音声入力レベル
が、閾値（Ｔ１＋ＥＮＶ）を越えていると判断された場
合、ステップＳ１４において、音声区間検出部４７は、
環境音レベルの推定を止め、その内部に有する図示しな
いカウンタ（タイマ）を用いて、音声認識開始カウント
を開始する。If it is determined in step S13 that the voice input level exceeds the threshold value (T1 + ENV), in step S14, the voice section detection unit 47
The estimation of the ambient sound level is stopped, and a speech recognition start count is started using a counter (timer) (not shown) provided therein.

【００９９】ステップＳ１５において、音声区間検出部
４７は、音声認識開始カウントが所定の値（例えば、図
７のＣＮＴ＿ＯＮで示される値）を超えたか否かを判断
する。ステップＳ１５において、音声認識開始カウント
が所定の値を超えていないと判断された場合、音声認識
開始カウントが所定の値を超えたと判断されるまで、ス
テップＳ１５の処理が繰り返される。In step S15, the voice section detection section 47 determines whether or not the voice recognition start count has exceeded a predetermined value (for example, a value indicated by CNT_ON in FIG. 7). If it is determined in step S15 that the voice recognition start count has not exceeded the predetermined value, the process of step S15 is repeated until it is determined that the voice recognition start count has exceeded the predetermined value.

【０１００】ステップＳ１５において、音声認識開始カ
ウントが所定の値を超えたと判断された場合、ステップ
Ｓ１６において、音声区間検出部４７は、音声区間の開
始を特徴抽出部４２およびマッチング部４３に出力す
る。特徴抽出部４２およびマッチング部４３は、歩行動
作専用の音声認識処理を実行する。If it is determined in step S15 that the speech recognition start count has exceeded a predetermined value, in step S16, the speech section detection section 47 outputs the start of the speech section to the feature extraction section 42 and the matching section 43. . The feature extracting unit 42 and the matching unit 43 execute a voice recognition process dedicated to the walking motion.

【０１０１】図９に示されるように、ロボットの歩行動
作中は、ユーザの発話に加えて、脚部ユニット３Ａ乃至
３Ｄが、例えば床などに接地する際の接地ノイズが、入
力される音声データに含まれてしまう。姿勢遷移情報と
して、歩行動作中であることを通知された特徴抽出部４
２は、この接地ノイズを音声データから取り除くため
に、例えば、音声入力レベルの増減（ΔＰ）を監視し、
ΔＰの大きさを基にパルス性のノイズを検出する（パル
ス性の音声データは、図９に示されるように、その音声
レベルが急激に上昇し、ピーク後、急激に減少するとい
う特徴を有する）。そして、特徴抽出部４２は、そのフ
レームに関して、特徴抽出処理を行わないようにする。
接地ノイズは、非常に短い時間において発生するため、
接地ノイズであると検出されたフレームを、音声認識処
理から外したとしても、ユーザの発話の認識には支障が
ない。As shown in FIG. 9, during the walking operation of the robot, in addition to the utterance of the user, grounding noise when the leg units 3A to 3D are grounded on, for example, the floor, is input to the audio data. Will be included in The feature extraction unit 4 notified that the user is walking, as the posture transition information
2 monitors, for example, an increase or decrease (ΔP) in the audio input level to remove this ground noise from the audio data;
The pulse noise is detected based on the magnitude of ΔP (pulse sound data has a feature that its sound level rapidly rises and then sharply decreases after the peak as shown in FIG. 9. ). Then, the feature extracting unit 42 does not perform the feature extracting process on the frame.
Ground noise occurs in a very short time,
Even if the frame detected as the ground noise is excluded from the voice recognition processing, there is no problem in recognizing the utterance of the user.

【０１０２】また、特徴抽出部４２に、歩行動作時の接
地ノイズの標準的な音声成分を予め記憶させておき、歩
行動作時は、その音声成分を用いて、入力された音声デ
ータをフィルタリングさせて、接地ノイズを除去した後
の音声データを用いて、特徴抽出処理を行わせるように
しても良い。Further, a standard voice component of the ground noise at the time of the walking operation is stored in advance in the feature extracting unit 42, and at the time of the walking operation, the input voice data is filtered using the voice component. Then, the feature extraction processing may be performed using the audio data after the ground noise has been removed.

【０１０３】ステップＳ１７において、音声区間検出部
４７は、ステップＳ１２と同様の処理により、音声入力
レベルが、閾値（Ｔ２＋ＥＮＶ）以下になったか否かを
判断する。ステップＳ１７において、音声入力レベル
が、閾値（Ｔ２＋ＥＮＶ）以下になっていないと判断さ
れた場合、処理は、ステップＳ１６に戻り、それ以降の
処理が繰り返される。In step S17, the voice section detection section 47 determines whether or not the voice input level has become equal to or less than the threshold value (T2 + ENV) by the same processing as in step S12. If it is determined in step S17 that the voice input level is not lower than the threshold value (T2 + ENV), the process returns to step S16, and the subsequent processes are repeated.

【０１０４】ステップＳ１２において、音声入力レベル
が、閾値（Ｔ２＋ＥＮＶ）以下になったと判断された場
合、もしくは、ステップＳ１７において、音声入力レベ
ルが、閾値（Ｔ２＋ＥＮＶ）以下になったと判断された
場合、ステップＳ１８において、音声区間検出部４７
は、その内部に有する図示しないカウンタ（タイマ）を
用いて、音声認識終了カウントを開始する。If it is determined in step S12 that the voice input level has fallen below the threshold value (T2 + ENV), or if it is determined in step S17 that the voice input level has fallen below the threshold value (T2 + ENV), In S18, the voice section detection unit 47
Starts a voice recognition end count using a counter (timer) (not shown) provided therein.

【０１０５】ステップＳ１９において、音声区間検出部
４７は、音声認識終了カウントが所定の値（例えば、図
８のＣＮＴ＿ＯＦＦで示される値）を超えたか否かを判
断する。ステップＳ１９において、音声認識終了カウン
トが所定の値を超えていないと判断された場合、音声認
識終了カウントが所定の値を超えたと判断されるまで、
ステップＳ１９の処理が繰り返される。In step S19, the voice section detection section 47 determines whether or not the voice recognition end count has exceeded a predetermined value (for example, a value indicated by CNT_OFF in FIG. 8). If it is determined in step S19 that the voice recognition end count has not exceeded the predetermined value, the process proceeds until the voice recognition end count is determined to have exceeded the predetermined value.
Step S19 is repeated.

【０１０６】ステップＳ１９において、音声認識終了カ
ウントが所定の値を超えたと判断された場合、ステップ
Ｓ２０において、音声区間検出部４７は、音声区間が終
了したことを示す信号を、特徴抽出部４２およびマッチ
ング部４３に出力する。特徴抽出部４２およびマッチン
グ部４３は、音声認識処理を終了し、処理は、ステップ
Ｓ１に戻り、それ以降の処理が繰り返される。If it is determined in step S19 that the speech recognition end count has exceeded a predetermined value, in step S20 the speech section detection section 47 outputs a signal indicating that the speech section has ended to the feature extraction section 42 and the feature extraction section 42. Output to the matching unit 43. The feature extracting unit 42 and the matching unit 43 end the voice recognition process, the process returns to step S1, and the subsequent processes are repeated.

【０１０７】なお、図５のステップＳ７においては、音
声区間検出部４７が、環境音（ノイズ）ではない、ユー
ザの発話などが音声入力されていることを判断するため
の閾値Ｔ１を、姿勢遷移情報、もしくは、状態認識情報
に応じた値に変更するものとして説明したが、音声区分
の検出のための閾値として、環境音レベルに左右されな
い閾値Ｔ３を用いるようにしても良い。すなわち、ステ
ップＳ８においては、音声入力レベルが、閾値（Ｔ１＋
ＥＮＶ）を越えたか否かが判断されるのではなく、環境
音レベルＥＮＶに関わらない閾値Ｔ３を超えたか否かが
判断される。In step S7 of FIG. 5, the voice section detection unit 47 sets the threshold value T1 for determining that the user's utterance or the like, which is not the environmental sound (noise), is input as a voice, by the posture transition. Although the description has been made on the assumption that the value is changed to the value corresponding to the information or the state recognition information, the threshold T3 which is not influenced by the environmental sound level may be used as the threshold for detecting the voice segment. That is, in step S8, the voice input level is set to the threshold (T1 +
ENV) is not determined, but it is determined whether a threshold T3 irrespective of the environmental sound level ENV has been exceeded.

【０１０８】また、閾値Ｔ３は、姿勢遷移情報、もしく
は、状態認識情報に応じて、予め用意されるようにして
も良いし、動作開始、もしくは状態変更時に、所定の学
習区間を設けて、平均的なノイズ成分を取得することに
よって、その都度、設定されるようにしても良い。The threshold value T3 may be prepared in advance in accordance with the posture transition information or the state recognition information. It may be set each time by acquiring a typical noise component.

【０１０９】以上、本発明を、エンターテイメント用の
ロボット（疑似ペットとしてのロボット）に適用した場
合について説明したが、本発明は、これに限らず、例え
ば、産業用のロボット等の各種のロボットに広く適用す
ることが可能である。また、本発明は、現実世界のロボ
ットだけでなく、例えば、液晶ディスプレイ等の表示装
置に表示される仮想的なロボットにも適用可能である。Although the present invention has been described for the case where the present invention is applied to an entertainment robot (robot as a pseudo pet), the present invention is not limited to this. For example, the present invention is applied to various robots such as industrial robots. It can be widely applied. In addition, the present invention is applicable not only to a robot in the real world but also to a virtual robot displayed on a display device such as a liquid crystal display.

【０１１０】さらに、本実施の形態においては、上述し
た一連の処理を、ＣＰＵ１０Ａ（図２）にプログラムを
実行させることにより行うようにしたが、一連の処理
は、それ専用のハードウェアによって行うことも可能で
ある。Furthermore, in the present embodiment, the above-described series of processing is performed by causing the CPU 10A (FIG. 2) to execute a program, but the series of processing is performed by dedicated hardware. Is also possible.

【０１１１】なお、プログラムは、あらかじめメモリ１
０Ｂ（図２）に記憶させておく他、フロッピー（登録商
標）ディスク、CD-ROM（Compact Disc Read Only Memor
y），MO（Magnetooptical）ディスク，DVD（Digital Ve
rsatile Disc)、磁気ディスク、半導体メモリなどのリ
ムーバブル記録媒体に、一時的あるいは永続的に格納
（記録）しておくことができる。そして、このようなリ
ムーバブル記録媒体を、いわゆるパッケージソフトウエ
アとして提供し、ロボット（メモリ１０Ｂ）にインスト
ールするようにすることができる。The program is stored in the memory 1 in advance.
0B (FIG. 2), a floppy disk, a CD-ROM (Compact Disc Read Only Memor
y), MO (Magnetooptical) disk, DVD (Digital Ve)
It can be temporarily or permanently stored (recorded) on a removable recording medium such as a rsatile disc, a magnetic disk, or a semiconductor memory. Then, such a removable recording medium can be provided as so-called package software, and can be installed in the robot (memory 10B).

【０１１２】また、プログラムは、ダウンロードサイト
から、ディジタル衛星放送用の人工衛星を介して、無線
で転送したり、LAN（Local Area Network）、インター
ネットといったネットワークを介して、有線で転送し、
メモリ１０Ｂにインストールすることができる。The program can be transmitted wirelessly from a download site via an artificial satellite for digital satellite broadcasting, or transmitted via a cable via a network such as a LAN (Local Area Network) or the Internet.
It can be installed in the memory 10B.

【０１１３】この場合、プログラムがバージョンアップ
されたとき等に、そのバージョンアップされたプログラ
ムを、メモリ１０Ｂに、容易にインストールすることが
できる。In this case, when the program is upgraded, the upgraded program can be easily installed in the memory 10B.

【０１１４】ここで、本明細書において、ＣＰＵ１０Ａ
に各種の処理を行わせるためのプログラムを記述する処
理ステップは、必ずしもフローチャートとして記載され
た順序に沿って時系列に処理する必要はなく、並列的あ
るいは個別に実行される処理（例えば、並列処理あるい
はオブジェクトによる処理）も含むものである。Here, in this specification, the CPU 10A
The processing steps for writing a program for causing the CPU to perform various types of processing do not necessarily need to be processed in chronological order in the order described in the flowchart, and may be performed in parallel or individually (for example, parallel processing). Or processing by an object).

【０１１５】また、プログラムは、唯１つのＣＰＵによ
り処理されるものであっても良いし、複数のＣＰＵによ
って分散処理されるものであっても良い。The program may be processed by only one CPU or may be processed in a distributed manner by a plurality of CPUs.

【０１１６】[0116]

【発明の効果】本発明のロボット制御装置、ロボット制
御方法、および記録媒体に記録されているプログラムに
よれば、音声データの入力を受け、ロボットの状態を示
す第１の情報を生成し、ロボットの行動を示す第２の情
報を生成し、生成された第１の情報、もしくは生成され
た第２の情報を基に、入力された音声データを認識する
ようにしたので、ロボットの状態や行動に基づいた音声
認識を行うことにより、ユーザの発話した音声と、ロボ
ットの動作などにより発生するノイズとを区別して、誤
認識を防ぐようにすることができる。According to the robot control apparatus, the robot control method, and the program recorded on the recording medium of the present invention, the input of voice data, the first information indicating the state of the robot is generated, Is generated, and the input voice data is recognized based on the generated first information or the generated second information. By performing the voice recognition based on, the voice uttered by the user and the noise generated by the operation of the robot or the like can be distinguished from each other to prevent erroneous recognition.

[Brief description of the drawings]

【図１】本発明を適用したロボットの一実施の形態の外
観構成例を示す斜視図である。FIG. 1 is a perspective view illustrating an external configuration example of a robot according to an embodiment of the present invention.

【図２】ロボットの内部構成例を示すブロック図であ
る。FIG. 2 is a block diagram illustrating an example of an internal configuration of a robot.

【図３】コントローラの機能的構成例を示すブロック図
である。FIG. 3 is a block diagram illustrating a functional configuration example of a controller.

【図４】音声認識部の構成例を示すブロック図である。FIG. 4 is a block diagram illustrating a configuration example of a speech recognition unit.

【図５】音声認識処理を説明するためのフローチャート
である。FIG. 5 is a flowchart illustrating a speech recognition process.

【図６】音声認識処理を説明するためのフローチャート
である。FIG. 6 is a flowchart illustrating a speech recognition process.

【図７】音声認識区間の開始について説明するための図
である。FIG. 7 is a diagram for explaining the start of a speech recognition section.

【図８】音声認識区間の終了について説明するための図
である。FIG. 8 is a diagram for explaining the end of a speech recognition section.

【図９】歩行動作中の脚部ユニットの接地ノイズについ
て説明するための図である。FIG. 9 is a diagram for explaining ground noise of a leg unit during a walking operation.

[Explanation of symbols]

４頭部ユニット，４Ａ下顎部，１０コントロ
ーラ，１０ＡＣＰＵ，１０Ｂメモリ，１５
マイク，１６ＣＣＤカメラ，１７タッチセン
サ，１８スピーカ，３１センサ入力処理部，
３１Ａ音声認識部，３１Ｂ画像認識部，３１Ｃ
圧力処理部，３２モデル記憶部，３３行動決定
機構部，３４姿勢遷移機構部，３５制御機構
部，３６音声合成部，３７出力制御部，４１
ＡＤ変換部，４２特徴抽出部，４３マッチング
部，４４音響モデル記憶部，４５辞書記憶部，
４６文法記憶部，４７音声区間検出部4 head unit, 4A lower jaw, 10 controller, 10A CPU, 10B memory, 15
Microphone, 16 CCD camera, 17 touch sensor, 18 speaker, 31 sensor input processing unit,
31A voice recognition unit, 31B image recognition unit, 31C
Pressure processing section, 32 model storage section, 33 action determination mechanism section, 34 attitude transition mechanism section, 35 control mechanism section, 36 speech synthesis section, 37 output control section, 41
A / D conversion unit, 42 feature extraction unit, 43 matching unit, 44 acoustic model storage unit, 45 dictionary storage unit,
46 grammar storage unit, 47 voice section detection unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/20 Ｇ１０Ｌ 3/02 ３０１Ｄ 21/02 (72)発明者小野木渡東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者豊田崇東京都品川区北品川６丁目７番35号ソニー株式会社内Ｆターム(参考） 2C150 CA02 DA05 DA24 DA26 DA27 DA28 DF03 DF04 ED42 ED52 EF07 EF16 EF23 EF29 3F059 AA00 BB06 DD08 DD18 FA03 FC15 3F060 AA00 CA14 5D015 DD03 EE05 KK01 LL10 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 15/20 G10L 3/02 301D 21/02 (72) Inventor Wataru Onoki 6-chome Kita Shinagawa, Shinagawa-ku, Tokyo 7-35 Inside Sony Corporation (72) Inventor Takashi Toyoda 6-7-35 Kita-Shinagawa, Shinagawa-ku, Tokyo F-term inside Sony Corporation (reference) 2C150 CA02 DA05 DA24 DA26 DA27 DA28 DF03 DF04 ED42 ED52 EF07 EF16 EF23 EF29 3F059 AA00 BB06 DD08 DD18 FA03 FC15 3F060 AA00 CA14 5D015 DD03 EE05 KK01 LL10

Claims

[Claims]

1. A robot control device for controlling a robot acting based on at least a voice recognition result, comprising: a voice input unit for receiving input of voice data; and a first unit for generating first information indicating a state of the robot. 1, generating means for generating second information indicating the behavior of the robot, and the first information generated by the first generating means;
Alternatively, a robot control device comprising: a recognition unit that recognizes voice data input by the input unit based on the second information generated by the second generation unit.

2. The second information includes information indicating which one of the plurality of driving units of the robot performs a driving operation, and the position of the driving unit to be driven is The robot control device according to claim 1, wherein the recognition unit does not recognize the voice data when the voice data is close to the voice input unit.

3. The method according to claim 2, wherein the second information includes information indicating whether the robot is performing a walking operation. If the robot is performing the walking operation, the recognition unit determines whether the robot performs the walking operation. The robot control device according to claim 1, wherein the voice data excluding a frame including a noise component generated for the purpose is recognized.

4. The apparatus according to claim 1, further comprising a storage unit configured to store data corresponding to a noise component generated when the robot performs the walking operation, wherein the second information indicates whether the robot performs the walking operation. If the robot performs the walking motion, the recognition unit filters the voice data using data corresponding to the noise component stored by the storage unit. The robot control device according to claim 1, wherein the voice data is recognized later.

5. The second information includes information indicating which of the plurality of driving units of the robot performs a driving operation, wherein the recognizing unit includes the second information. The robot control device according to claim 1, wherein the voice recognition is performed in consideration of noise generated by the driving unit based on the driving.

6. The first information includes information indicating whether or not the robot is being touched by a user, and the recognition unit is configured to allow the user to touch the robot based on the first information. The robot control device according to claim 1, wherein the voice recognition is performed in consideration of noise generated due to the recognition.

7. A storage unit for storing a predetermined threshold value corresponding to noise generated by a state or action of the robot, and an estimation unit for estimating an environmental sound when the recognition unit does not perform voice recognition. The recognition unit is stored in the storage unit based on the first information generated by the first generation unit or the second information generated by the second generation unit. The robot control device according to claim 1, wherein a start of a section in which voice recognition is performed is determined using the threshold value and the environmental sound estimated by the estimation unit.

8. A storage unit for storing a predetermined threshold value corresponding to noise generated by a state or action of the robot, wherein the recognizing unit generates the first information generated by the first generating unit. Alternatively, based on the second information generated by the second generation unit, using the threshold value stored in the storage unit, determine the start of a section for performing voice recognition. The robot control device according to claim 1, wherein:

9. The apparatus further comprising: setting means for setting a threshold value corresponding to noise generated by a state or action of the robot based on the voice data input by the voice input means, The robot control device according to claim 1, wherein a start of a section in which voice recognition is performed is determined using the threshold value set by the means.

10. A robot control method for a robot control device for controlling a robot acting based on at least a voice recognition result, wherein: a voice input step of receiving voice data input; and generating first information indicating a state of the robot. A first generation step of generating, a second generation step of generating second information indicating the behavior of the robot, and the first information generated by the processing of the first generation step, or the first A recognition step of recognizing voice data input by the processing of the input step based on the second information generated by the processing of the second generation step.

11. A program for a robot control device that controls a robot that acts based on at least a voice recognition result, comprising: a voice input step of receiving voice data input; and a first information indicating a state of the robot. A first generation step of generating; a second generation step of generating second information indicating the behavior of the robot; and the first information generated by the processing of the first generation step, or A computer-readable program comprising: a recognition step of recognizing voice data input by the processing of the input step based on the second information generated by the processing of the second generation step. Recording medium on which is recorded.