JP2002116795A

JP2002116795A - Voice processor and method for voice processing and recording medium

Info

Publication number: JP2002116795A
Application number: JP2000310493A
Authority: JP
Inventors: Kazuo Ishii; 和夫石井; Jun Hiroi; 順広井; Wataru Onoki; 渡小野木; Takashi Toyoda; 崇豊田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-10-11
Filing date: 2000-10-11
Publication date: 2002-04-19
Anticipated expiration: 2020-10-11
Also published as: JP4656354B2

Abstract

PROBLEM TO BE SOLVED: To prevent voice from being misrecognized. SOLUTION: When voice begins to be inputted to a microphone at time P2, a voice section detection part estimates an environmental sound level according to the inputted voice data until the voice input level exceeds a specific value. The voice section detection part stops estimating the environmental level once the voice input level exceeds the specific value and starts detecting a voice section. When an analog voice signal is outputted from an output control part through a loudspeaker at time P4, the voice section detection part cancels the detection of the voice section.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ロボット制御装置
およびロボット制御方法、並びに記録媒体に関し、特
に、例えば、音声認識装置による音声認識結果に基づい
て行動するロボットに用いて好適なロボット制御装置お
よびロボット制御方法、並びに記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a robot control device, a robot control method, and a recording medium, and more particularly to a robot control device suitable for use in a robot acting on the basis of a result of voice recognition by a voice recognition device, for example. The present invention relates to a robot control method and a recording medium.

【０００２】[0002]

【従来の技術】近年においては、例えば、玩具等とし
て、ユーザが発した音声を音声認識し、その音声認識結
果に基づいて、ある仕草をしたり、合成音を出力する等
の行動を行うロボット（本明細書においては、ぬいぐる
み状のものを含む）が製品化されている。2. Description of the Related Art In recent years, for example, a toy or the like performs voice recognition of a voice uttered by a user, and performs a certain gesture or outputs a synthetic sound based on the voice recognition result. (In this specification, including a stuffed one) has been commercialized.

【０００３】[0003]

【発明が解決しようとする課題】このようなロボット
は、常時、音声入力を受け付けるようになされている。
しかしながら、音声が入力されている途中でロボットが
発話したり、あるいは、ロボットが発話している途中で
音声が入力されると、ロボットの発話した音声自体も入
力された音声であると誤検知されてしまう場合があっ
た。Such a robot is always adapted to receive a voice input.
However, if the robot speaks while the voice is being input, or if the voice is input while the robot is speaking, the voice itself spoken by the robot is erroneously detected as the input voice. There was a case.

【０００４】本発明はこのような状況に鑑みてなされた
ものであり、音声の誤認識を防止することができるよう
にするものである。[0004] The present invention has been made in view of such a situation, and is intended to prevent erroneous recognition of voice.

【０００５】[0005]

【課題を解決するための手段】本発明の音声処理装置
は、音声データの入力を受ける音声入力手段と、音声入
力手段により入力された音声データを認識する認識手段
と、音声を出力する音声出力手段と、認識手段による音
声データの認識の途中で、音声出力手段により音声が出
力されたとき、認識手段による音声データの認識を中断
するように制御する認識制御手段とを備えることを特徴
とする。According to the present invention, there is provided a voice processing apparatus comprising: voice input means for receiving voice data; recognition means for recognizing voice data input by the voice input means; and voice output for outputting voice Means, and recognition control means for controlling to stop recognition of the voice data by the recognition means when the voice is output by the voice output means during the recognition of the voice data by the recognition means. .

【０００６】認識手段は、音声データの入力レベルに基
づいて、認識を行うようにすることができる。[0006] The recognition means can perform recognition based on the input level of the voice data.

【０００７】音声出力手段による音声の出力が終了した
ときに、音声入力手段による音声データの入力がある場
合、認識手段は、音声データを認識しないようにするこ
とができる。[0007] When the output of the sound by the sound output means is completed, if there is an input of the sound data by the sound input means, the recognition means can prevent the sound data from being recognized.

【０００８】本発明の音声処理方法は、音声データの入
力を受ける音声入力ステップと、音声入力ステップの処
理により入力された音声データを認識する認識ステップ
と、音声を出力する音声出力ステップと、認識ステップ
の処理による音声データの認識の途中で、音声出力ステ
ップの処理により音声が出力されたとき、認識ステップ
の処理による音声データの認識を中断するように制御す
る認識制御ステップとを含むことを特徴とする。The voice processing method according to the present invention includes a voice input step for receiving voice data input, a recognition step for recognizing voice data input by the voice input step, a voice output step for outputting voice, and a voice recognition step. A recognition control step of controlling to interrupt the recognition of the voice data by the processing of the recognition step when the voice is output by the processing of the voice output step during the recognition of the voice data by the processing of the step. And

【０００９】本発明の記録媒体に記録されているプログ
ラムは、音声データの入力を受ける音声入力ステップ
と、音声入力ステップの処理により入力された音声デー
タを認識する認識ステップと、音声を出力する音声出力
ステップと、認識ステップの処理による音声データの認
識の途中で、音声出力ステップの処理により音声が出力
されたとき、認識ステップの処理による音声データの認
識を中断するように制御する認識制御ステップとを含む
ことを特徴とする。[0009] The program recorded on the recording medium of the present invention includes a voice input step for receiving voice data input, a recognition step for recognizing voice data input by the voice input step processing, and a voice for outputting voice. An output step, and a recognition control step of controlling to interrupt the recognition of the voice data by the processing of the recognition step when the voice is output by the processing of the voice output step during the recognition of the voice data by the processing of the recognition step. It is characterized by including.

【００１０】本発明の音声処理装置および音声処理方
法、並びに記録媒体に記録されているプログラムにおい
ては、音声データの認識の途中で、音声が出力されたと
き、その認識を中断するように制御される。[0010] In the voice processing apparatus and the voice processing method of the present invention, the program recorded on the recording medium is controlled so that when voice is output during the recognition of voice data, the recognition is interrupted. You.

【００１１】[0011]

【発明の実施の形態】図１は、本発明を適用したロボッ
トの一実施の形態の外観構成例を示しており、図２は、
その電気的構成例を示している。FIG. 1 shows an example of the appearance of a robot according to an embodiment of the present invention, and FIG.
An example of the electrical configuration is shown.

【００１２】本実施の形態では、ロボットは、例えば、
犬等の四つ足の動物の形状のものとなっており、胴体部
ユニット２の前後左右に、それぞれ脚部ユニット３Ａ，
３Ｂ，３Ｃ，３Ｄが連結されるとともに、胴体部ユニッ
ト２の前端部と後端部に、それぞれ頭部ユニット４と尻
尾部ユニット５が連結されることにより構成されてい
る。In the present embodiment, for example, the robot
It has the shape of a four-legged animal such as a dog, and has leg units 3A,
3B, 3C, and 3D are connected, and a head unit 4 and a tail unit 5 are connected to a front end and a rear end of the body unit 2, respectively.

【００１３】尻尾部ユニット５は、胴体部ユニット２の
上面に設けられたベース部５Ｂから、２自由度をもって
湾曲または揺動自在に引き出されている。The tail unit 5 is drawn out from a base 5B provided on the upper surface of the body unit 2 so as to bend or swing with two degrees of freedom.

【００１４】胴体部ユニット２には、ロボット全体の制
御を行うコントローラ１０、ロボットの動力源となるバ
ッテリ１１、並びにバッテリセンサ１２および熱センサ
１３からなる内部センサ部１４などが収納されている。The body unit 2 contains a controller 10 for controlling the entire robot, a battery 11 as a power source of the robot, and an internal sensor unit 14 including a battery sensor 12 and a heat sensor 13.

【００１５】頭部ユニット４には、「耳」に相当するマ
イク（マイクロフォン）１５、「目」に相当するＣＣＤ
（Charge Coupled Device）カメラ１６、触覚に相当す
るタッチセンサ１７、「口」に相当するスピーカ１８な
どが、それぞれ所定位置に配設されている。また、頭部
ユニット４には、口の下顎に相当する下顎部４Ａが１自
由度をもって可動に取り付けられており、この下顎部４
Ａが動くことにより、ロボットの口の開閉動作が実現さ
れるようになっている。The head unit 4 includes a microphone (microphone) 15 corresponding to “ears” and a CCD corresponding to “eyes”.
(Charge Coupled Device) A camera 16, a touch sensor 17 corresponding to tactile sensation, a speaker 18 corresponding to a "mouth", and the like are arranged at predetermined positions. A lower jaw 4A corresponding to the lower jaw of the mouth is movably attached to the head unit 4 with one degree of freedom.
When A moves, the opening and closing operation of the mouth of the robot is realized.

【００１６】脚部ユニット３Ａ乃至３Ｄそれぞれの関節
部分や、脚部ユニット３Ａ乃至３Ｄそれぞれと胴体部ユ
ニット２の連結部分、頭部ユニット４と胴体部ユニット
２の連結部分、頭部ユニット４と下顎部４Ａの連結部
分、並びに尻尾部ユニット５と胴体部ユニット２の連結
部分などには、図２に示すように、それぞれアクチュエ
ータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁乃至３ＢＡ_K、３ＣＡ
₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ_K、４Ａ₁乃至４Ａ_L、
５Ａ₁および５Ａ₂が配設されている。The joints of the leg units 3A to 3D, the connecting portions of the leg units 3A to 3D and the body unit 2, the connecting portions of the head unit 4 and the body unit 2, the head unit 4 and the lower jaw linking moiety parts 4A, and the like in the connecting portion of the tail unit 5 and the body unit 2, as shown in FIG. 2, each actuator 3AA ₁ to 3AA _K, 3BA ₁ to 3BA _K, 3CA
₁ to 3CA _K, 3DA ₁ to 3DA _K, 4A ₁ to 4A _L,
5A ₁ and 5A ₂ are disposed.

【００１７】頭部ユニット４におけるマイク１５は、ユ
ーザからの発話を含む周囲の音声（音）を集音し、得ら
れた音声信号を、コントローラ１０に送出する。ＣＣＤ
カメラ１６は、周囲の状況を撮像し、得られた画像信号
を、コントローラ１０に送出する。The microphone 15 in the head unit 4 collects surrounding sounds (sounds) including utterances from the user, and sends out the obtained sound signals to the controller 10. CCD
The camera 16 captures an image of the surroundings, and sends the obtained image signal to the controller 10.

【００１８】タッチセンサ１７は、例えば、頭部ユニッ
ト４の上部に設けられており、ユーザからの「撫でる」
や「たたく」といった物理的な働きかけにより受けた圧
力を検出し、その検出結果を圧力検出信号としてコント
ローラ１０に送出する。The touch sensor 17 is provided, for example, above the head unit 4 and "strokes" from the user.
It detects the pressure received by a physical action such as tapping or tapping, and sends the detection result to the controller 10 as a pressure detection signal.

【００１９】胴体部ユニット２におけるバッテリセンサ
１２は、バッテリ１１の残量を検出し、その検出結果
を、バッテリ残量検出信号としてコントローラ１０に送
出する。熱センサ１３は、ロボット内部の熱を検出し、
その検出結果を、熱検出信号としてコントローラ１０に
送出する。The battery sensor 12 in the body unit 2 detects the remaining amount of the battery 11 and sends the detection result to the controller 10 as a battery remaining amount detection signal. The heat sensor 13 detects heat inside the robot,
The detection result is sent to the controller 10 as a heat detection signal.

【００２０】コントローラ１０は、ＣＰＵ（Central Pr
ocessing Unit）１０Ａやメモリ１０Ｂ等を内蔵してお
り、ＣＰＵ１０Ａにおいて、メモリ１０Ｂに記憶された
制御プログラムが実行されることにより、各種の処理を
行う。The controller 10 has a CPU (Central Pr
10A, a memory 10B, and the like. The CPU 10A performs various processes by executing a control program stored in the memory 10B.

【００２１】すなわち、コントローラ１０は、マイク１
５、ＣＣＤカメラ１６、タッチセンサ１７、バッテリセ
ンサ１２、および熱センサ１３から与えられる音声信
号、画像信号、圧力検出信号、バッテリ残量検出信号、
および熱検出信号に基づいて、周囲の状況や、ユーザか
らの指令、ユーザからの働きかけなどの有無を判断す
る。That is, the controller 10 controls the microphone 1
5, a voice signal, an image signal, a pressure detection signal, a battery remaining amount detection signal, and a voice signal given from the CCD camera 16, the touch sensor 17, the battery sensor 12, and the heat sensor 13.
Then, based on the heat detection signal, it is determined whether there is a surrounding situation, a command from the user, an action from the user, or the like.

【００２２】さらに、コントローラ１０は、この判断結
果等に基づいて、続く行動を決定し、その決定結果に基
づいて、アクチュエータ３ＡＡ₁乃至３ＡＡ_K、３ＢＡ₁
乃至３ＢＡ_K、３ＣＡ₁乃至３ＣＡ_K、３ＤＡ₁乃至３ＤＡ
_K、４Ａ₁乃至４Ａ_L、５Ａ₁、および５Ａ₂のうちの必要
なものを駆動させる。これにより、頭部ユニット４を上
下左右に振らせたり、下顎部４Ａを開閉させる。さらに
は、尻尾部ユニット５を動かせたり、各脚部ユニット３
Ａ乃至３Ｄを駆動して、ロボットを歩行させるなどの行
動を行わせる。Furthermore, the controller 10, based on the determination results and the like, to determine the subsequent actions, based on the determination result, the actuators 3AA ₁ to 3AA _K, 3BA ₁
To 3BA _K, 3CA ₁ to 3CA _K, 3DA ₁ to 3DA
_K, 4A ₁ to 4A _L, 5A _1, and drives the necessary of 5A _2. Thereby, the head unit 4 is swung up, down, left and right, and the lower jaw 4A is opened and closed. Further, the tail unit 5 can be moved, and each leg unit 3 can be moved.
Drive A to 3D to cause the robot to perform an action such as walking.

【００２３】また、コントローラ１０は、必要に応じ
て、合成音あるいは後述するようなエコーバックの音声
を生成し、スピーカ１８に供給して出力させたり、ロボ
ットの「目」の位置に設けられた図示しないＬＥＤ（Li
ght Emitting Diode）を点灯、消灯または点滅させる。Further, the controller 10 generates a synthesized sound or an echo-back sound as will be described later, and supplies it to the speaker 18 for output, or is provided at the position of the "eye" of the robot, if necessary. LED (not shown) (Li
ght Emitting Diode) is turned on, off or blinking.

【００２４】以上のようにして、ロボットは、周囲の状
況等に基づいて自律的に行動をとるようになっている。As described above, the robot autonomously behaves based on the surrounding conditions and the like.

【００２５】次に、図３は、図２のコントローラ１０の
機能的構成例を示している。なお、図３に示す機能的構
成は、ＣＰＵ１０Ａが、メモリ１０Ｂに記憶された制御
プログラムを実行することで実現されるようになってい
る。FIG. 3 shows an example of a functional configuration of the controller 10 shown in FIG. Note that the functional configuration illustrated in FIG. 3 is realized by the CPU 10A executing a control program stored in the memory 10B.

【００２６】コントローラ１０は、特定の外部状態を認
識するセンサ入力処理部３１、センサ入力処理部３１の
認識結果を累積して、感情や、本能、成長の状態を表現
するモデル記憶部３２、センサ入力処理部３１の認識結
果等に基づいて、続く行動を決定する行動決定機構部３
３、行動決定機構部３３の決定結果に基づいて、実際に
ロボットに行動を起こさせる姿勢遷移機構部３４、各ア
クチュエータ３ＡＡ₁乃至５Ａ₁および５Ａ₂を駆動制御
する制御機構部３５、合成音を生成する音声合成部３
６、並びに、音声合成部３６において合成された合成音
の出力を制御する出力制御部３７から構成されている。The controller 10 includes a sensor input processing unit 31 for recognizing a specific external state, a model storage unit 32 for accumulating the recognition results of the sensor input processing unit 31 and expressing a feeling, an instinct, and a growth state. Action determining mechanism 3 that determines a subsequent action based on the recognition result of input processing unit 31 and the like.
3, based on the determination result of the action decision mechanism section 33, the actual posture transition mechanism unit 34 to take action on the robot, the actuators 3AA ₁ to 5A ₁ and 5A ₂ control mechanism unit 35 for controlling driving of the synthesized sound Generated speech synthesizer 3
6 and an output control unit 37 for controlling the output of the synthesized sound synthesized by the voice synthesis unit 36.

【００２７】センサ入力処理部３１は、マイク１５、Ｃ
ＣＤカメラ１６、もしくは、タッチセンサ１７から与え
られる音声信号、画像信号、圧力検出信号等に基づい
て、特定の外部状態や、ユーザからの特定の働きかけ、
ユーザからの指示等を認識し、その認識結果を表す状態
認識情報を、モデル記憶部３２および行動決定機構部３
３に通知する。The sensor input processing unit 31 includes the microphone 15, C
Based on a sound signal, an image signal, a pressure detection signal, or the like provided from the CD camera 16 or the touch sensor 17, a specific external state or a specific action from a user,
Recognize an instruction or the like from the user, and store state recognition information representing the recognition result in the model storage unit 32 and the action determination mechanism unit 3.
Notify 3.

【００２８】すなわち、センサ入力処理部３１は、音声
認識部３１Ａを有しており、音声認識部３１Ａは、マイ
ク１５から与えられる音声信号について音声認識を行
う。そして、音声認識部３１Ａは、その音声認識結果と
しての、例えば、「歩け」、「伏せ」、「ボールを追い
かけろ」等の指令その他を、状態認識情報として、モデ
ル記憶部３２および行動決定機構部３３に通知する。That is, the sensor input processing unit 31 has a voice recognition unit 31A, and the voice recognition unit 31A performs voice recognition on a voice signal given from the microphone 15. Then, the voice recognition unit 31A uses the model storage unit 32 and the action determination mechanism unit as state recognition information, for example, commands such as “walk”, “down”, “chase the ball” and the like as the voice recognition result. Notify 33.

【００２９】また、センサ入力処理部３１は、画像認識
部３１Ｂを有しており、画像認識部３１Ｂは、ＣＣＤカ
メラ１６から与えられる画像信号を用いて、画像認識処
理を行う。そして、画像認識部３１Ｂは、その処理の結
果、例えば、「赤い丸いもの」や、「地面に対して垂直
なかつ所定高さ以上の平面」等を検出したときには、
「ボールがある」や、「壁がある」等の画像認識結果
を、状態認識情報として、モデル記憶部３２および行動
決定機構部３３に通知する。The sensor input processing section 31 has an image recognition section 31B. The image recognition section 31B performs an image recognition process using an image signal given from the CCD camera 16. When the image recognition unit 31B detects, for example, a “red round object” or a “plane that is perpendicular to the ground and equal to or more than a predetermined height” as a result of the processing,
Image recognition results such as “there is a ball” and “there is a wall” are notified to the model storage unit 32 and the action determination mechanism unit 33 as state recognition information.

【００３０】さらに、センサ入力処理部３１は、圧力処
理部３１Ｃを有しており、圧力処理部３１Ｃは、およ
び、タッチセンサ１７から与えられる圧力検出信号を処
理する。圧力処理部３１Ｃは、その処理の結果、タッチ
センサ１７から、所定の閾値以上で、かつ短時間の圧力
を検出したときには、「たたかれた（しかられた）」と
認識し、所定の閾値未満で、かつ長時間の圧力を検出し
たときには、「撫でられた（ほめられた）」と認識し
て、その認識結果を、状態認識情報として、モデル記憶
部３２、および行動決定機構部３３に通知する。Further, the sensor input processing section 31 has a pressure processing section 31C, and the pressure processing section 31C processes a pressure detection signal given from the touch sensor 17. As a result of the processing, when the pressure sensor 31C detects a pressure that is equal to or more than a predetermined threshold value and is short in time from the touch sensor 17, the pressure processing unit 31C recognizes that the user has been "hit" and has received the predetermined threshold value. When the pressure is detected for less than and for a long time, it is recognized as “stroke (praised)”, and the recognition result is stored in the model storage unit 32 and the action determination mechanism unit 33 as state recognition information. Notice.

【００３１】モデル記憶部３２は、ロボットの感情、本
能、成長の状態を表現する感情モデル、本能モデル、成
長モデルをそれぞれ記憶、管理している。The model storage unit 32 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot.

【００３２】ここで、感情モデルは、例えば、「うれし
さ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状
態（度合い）を、所定の範囲の値によってそれぞれ表
し、センサ入力処理部３１からの状態認識情報や時間経
過等に基づいて、その値を変化させる。本能モデルは、
例えば、「食欲」、「睡眠欲」、「運動欲」等の本能に
よる欲求の状態（度合い）を、所定の範囲の値によって
それぞれ表し、センサ入力処理部３１からの状態認識情
報や時間経過等に基づいて、その値を変化させる。成長
モデルは、例えば、「幼年期」、「青年期」、「熟年
期」、「老年期」等の成長の状態（度合い）を、所定の
範囲の値によってそれぞれ表し、センサ入力処理部３１
からの状態認識情報や時間経過等に基づいて、その値を
変化させる。Here, the emotion model expresses emotion states (degrees) such as "joy,""sadness,""anger,""enjoyment" by values in a predetermined range, respectively. The value is changed based on the state recognition information from the input processing unit 31, the elapsed time, and the like. The instinct model is
For example, the state (degree) of the desire by instinct such as “appetite”, “sleep desire”, and “exercise desire” is represented by a value in a predetermined range, and the state recognition information from the sensor input processing unit 31, the elapsed time, and the like. , The value is changed. The growth model represents, for example, the growth state (degree) such as “childhood”, “adolescence”, “mature”, “elderly” by a value in a predetermined range, and the sensor input processing unit 31.
The value is changed on the basis of the state recognition information or the passage of time.

【００３３】モデル記憶部３２は、上述のようにして感
情モデル、本能モデル、成長モデルの値で表される感
情、本能、成長の状態を、状態情報として、行動決定機
構部３３に送出する。The model storage unit 32 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the action determination mechanism unit 33 as state information.

【００３４】なお、モデル記憶部３２には、センサ入力
処理部３１から状態認識情報が供給される他、行動決定
機構部３３から、ロボットの現在または過去の行動、具
体的には、例えば、「長時間歩いた」などの行動の内容
を示す行動情報が供給されるようになっており、同一の
状態認識情報が与えられても、行動情報が示すロボット
の行動に応じて、異なる状態情報を生成するようになっ
ている。The model storage unit 32 is supplied with the state recognition information from the sensor input processing unit 31, and the current or past behavior of the robot, specifically, for example, “ Behavior information indicating the content of the action such as "walking for a long time" is supplied, and even if the same state recognition information is given, different state information is given according to the action of the robot indicated by the action information. Is to be generated.

【００３５】すなわち、例えば、ロボットが、ユーザに
挨拶をし、ユーザに頭を撫でられた場合には、ユーザに
挨拶をしたという行動情報と、頭を撫でられたという状
態認識情報とが、モデル記憶部３２に与えられ、この場
合、モデル記憶部３２では、「うれしさ」を表す感情モ
デルの値が増加される。That is, for example, when the robot greets the user and is stroked by the user, the behavior information indicating that the robot greets the user and the state recognition information indicating that the robot has been stroked by the robot are represented by a model. It is provided to the storage unit 32, and in this case, the value of the emotion model representing “joy” is increased in the model storage unit 32.

【００３６】一方、ロボットが、何らかの仕事を実行中
に頭を撫でられた場合には、仕事を実行中であるという
行動情報と、頭を撫でられたという状態認識情報とが、
モデル記憶部３２に与えられ、この場合、モデル記憶部
３２では、「うれしさ」を表す感情モデルの値は変化さ
れない。On the other hand, when the robot is stroked on the head while performing any work, the behavior information indicating that the robot is executing the work and the state recognition information indicating that the robot has been stroked on the head include:
The value is given to the model storage unit 32. In this case, the value of the emotion model representing “joy” is not changed in the model storage unit 32.

【００３７】このように、モデル記憶部３２は、状態認
識情報だけでなく、現在または過去のロボットの行動を
示す行動情報も参照しながら、感情モデルの値を設定す
る。これにより、例えば、何らかのタスクを実行中に、
ユーザが、いたずらするつもりで頭を撫でたときに、
「うれしさ」を表す感情モデルの値を増加させるよう
な、不自然な感情の変化が生じることを回避することが
できる。As described above, the model storage unit 32 sets the value of the emotion model with reference to not only the state recognition information but also the behavior information indicating the current or past behavior of the robot. Thus, for example, while performing some task,
When the user strokes his head with the intention of mischief,
It is possible to avoid an unnatural change in emotion, such as increasing the value of the emotion model representing “joy”.

【００３８】なお、モデル記憶部３２は、本能モデルお
よび成長モデルについても、感情モデルにおける場合と
同様に、状態認識情報および行動情報の両方に基づい
て、その値を増減させるようになっている。また、モデ
ル記憶部３２は、感情モデル、本能モデル、成長モデル
それぞれの値を、他のモデルの値にも基づいて増減させ
るようになっている。The model storage unit 32 increases and decreases the values of the instinct model and the growth model based on both the state recognition information and the action information, as in the case of the emotion model. The model storage unit 32 increases or decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.

【００３９】行動決定機構部３３は、センサ入力処理部
３１からの状態認識情報や、モデル記憶部３２からの状
態情報、時間経過等に基づいて、次の行動を決定し、決
定された行動の内容を、行動指令情報として、姿勢遷移
機構部３４に送出する。The action determining mechanism 33 determines the next action based on the state recognition information from the sensor input processing section 31, the state information from the model storage section 32, the passage of time, and the like. The content is sent to the posture transition mechanism unit 34 as action command information.

【００４０】即ち、行動決定機構部３３は、ロボットが
とり得る行動をステート（状態）（state）に対応させ
た有限オートマトンを、ロボットの行動を規定する行動
モデルとして管理しており、この行動モデルとしての有
限オートマトンにおけるステートを、センサ入力処理部
３１からの状態認識情報や、モデル記憶部３２における
感情モデル、本能モデル、または成長モデルの値、時間
経過等に基づいて遷移させ、遷移後のステートに対応す
る行動を、次にとるべき行動として決定する。That is, the action determining mechanism 33 manages a finite state automaton in which actions that can be taken by the robot correspond to states, as an action model that defines the actions of the robot. The state in the finite state automaton is changed based on the state recognition information from the sensor input processing unit 31, the value of the emotion model, the instinct model, or the growth model in the model storage unit 32, the lapse of time, and the like. Is determined as the next action to be taken.

【００４１】ここで、行動決定機構部３３は、所定のト
リガ（trigger）があったことを検出すると、ステート
を遷移させる。すなわち、行動決定機構部３３は、例え
ば、現在のステートに対応する行動を実行している時間
が所定時間に達したときや、特定の状態認識情報を受信
したとき、モデル記憶部３２から供給される状態情報が
示す感情や、本能、成長の状態の値が所定の閾値以下ま
たは以上になったとき等に、ステートを遷移させる。Here, upon detecting that a predetermined trigger has occurred, the action determining mechanism 33 changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed reaches a predetermined time, or when specific state recognition information is received, the action determining mechanism unit 33 is supplied from the model storage unit 32. The state is changed when the value of the emotion, instinct, or growth state indicated by the state information is equal to or less than a predetermined threshold.

【００４２】なお、行動決定機構部３３は、上述したよ
うに、センサ入力処理部３１からの状態認識情報だけで
なく、モデル記憶部３２における感情モデルや、本能モ
デル、成長モデルの値等にも基づいて、行動モデルにお
けるステートを遷移させることから、同一の状態認識情
報が入力されても、感情モデルや、本能モデル、成長モ
デルの値（状態情報）によっては、ステートの遷移先は
異なるものとなる。As described above, the action determining mechanism unit 33 stores not only the state recognition information from the sensor input processing unit 31 but also the values of the emotion model, instinct model, growth model, and the like in the model storage unit 32. Based on the state transition based on the behavior model, the destination of the state transition differs depending on the emotion model, the instinct model, and the value of the growth model (state information) even if the same state recognition information is input. Become.

【００４３】その結果、行動決定機構部３３は、例え
ば、状態情報が、「怒っていない」こと、および「お腹
がすいていない」ことを表している場合において、状態
認識情報が、「目の前に手のひらが差し出された」こと
を表しているときには、目の前に手のひらが差し出され
たことに応じて、「お手」という行動をとらせる行動指
令情報を生成し、これを、姿勢遷移機構部３４に送出す
る。As a result, for example, when the state information indicates “not angry” and “not hungry”, the action determining mechanism 33 changes the state recognition information to “eye”. When the palm has been presented before, the action command information for taking the action of "hand" is generated in accordance with the palm being presented in front of the eyes, It is sent to the posture transition mechanism 34.

【００４４】また、行動決定機構部３３は、例えば、状
態情報が、「怒っていない」こと、および「お腹がすい
ている」ことを表している場合において、状態認識情報
が、「目の前に手のひらが差し出された」ことを表して
いるときには、目の前に手のひらが差し出されたことに
応じて、「手のひらをぺろぺろなめる」ような行動を行
わせるための行動指令情報を生成し、これを、姿勢遷移
機構部３４に送出する。Further, for example, when the state information indicates “not angry” and “stomach is hungry”, the action recognition mechanism unit 33 sets the state recognition information to “in front of the eyes”. When the palm is displayed, the action command information for performing an action such as "palm licking the palm" is generated in response to the palm being displayed in front of the eyes. This is sent to the posture transition mechanism 34.

【００４５】また、行動決定機構部３３は、例えば、状
態情報が、「怒っている」ことを表している場合におい
て、状態認識情報が、「目の前に手のひらが差し出され
た」ことを表しているときには、状態情報が、「お腹が
すいている」ことを表していても、また、「お腹がすい
ていない」ことを表していても、「ぷいと横を向く」よ
うな行動を行わせるための行動指令情報を生成し、これ
を、姿勢遷移機構部３４に送出する。Further, for example, when the state information indicates “angry”, the action determining mechanism unit 33 determines that the state recognition information indicates that “the palm is put in front of the eyes”. When it indicates, even if the status information indicates that "stomach is hungry", or indicates that "stomach is not hungry", even if the state information indicates "being hungry", an action such as "turns to the side with a little bit" The action command information for performing the action is generated, and the action command information is transmitted to the posture transition mechanism unit 34.

【００４６】なお、行動決定機構部３３には、モデル記
憶部３２から供給される状態情報が示す感情や、本能、
成長の状態に基づいて、遷移先のステートに対応する行
動のパラメータとしての、例えば、歩行の速度や、手足
を動かす際の動きの大きさおよび速度などを決定させる
ことができ、この場合、それらのパラメータを含む行動
指令情報が、姿勢遷移機構部３４に送出される。It should be noted that the behavior determining mechanism 33 has an emotion, instinct,
Based on the state of growth, as a parameter of the action corresponding to the state of the transition destination, for example, the speed of walking, the magnitude and speed of the movement when moving the limbs can be determined, in this case, Is transmitted to the posture transition mechanism unit 34.

【００４７】また、行動決定機構部３３では、上述した
ように、ロボットの頭部や手足等を動作させる行動指令
情報の他、ロボットに発話を行わせる行動指令情報も生
成される。ロボットに発話を行わせる行動指令情報は、
音声合成部３７に供給されるようになっており、音声合
成部３７に供給される行動指令情報には、音声合成部３
７に生成させる合成音に対応するテキスト等が含まれ
る。そして、音声合成部３７は、行動決定部３２から行
動指令情報を受信すると、その行動指令情報に含まれる
テキストに基づき、合成音を生成し、出力制御部３８を
介して、スピーカ１８に供給して出力させる。これによ
り、スピーカ１８からは、例えば、ロボットの鳴き声、
さらには、「お腹がすいた」等のユーザへの各種の要
求、「何？」等のユーザの呼びかけに対する応答その他
の音声出力が行われる。In addition, as described above, the action determining mechanism 33 generates action command information for causing the robot to speak, in addition to action command information for operating the head and limbs of the robot. The action command information that causes the robot to speak is
The action command information supplied to the voice synthesis unit 37 includes the voice synthesis unit 3.
7 includes text corresponding to the synthesized sound to be generated. Then, when receiving the action command information from the action determination section 32, the voice synthesis section 37 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 18 via the output control section 38. Output. Thereby, for example, the cry of the robot,
Further, various requests to the user such as “I am hungry”, a response to the user's call such as “What?”, And other voice output are performed.

【００４８】姿勢遷移機構部３４は、行動決定機構部３
３から供給される行動指令情報に基づいて、ロボットの
姿勢を、現在の姿勢から次の姿勢に遷移させるための姿
勢遷移情報を生成し、これを制御機構部３５および音声
認識部３１Ａに送出する。The posture transition mechanism section 34 includes the action determination mechanism section 3
Based on the action command information supplied from 3, posture transition information for transitioning the posture of the robot from the current posture to the next posture is generated and transmitted to the control mechanism unit 35 and the voice recognition unit 31 </ b> A. .

【００４９】ここで、現在の姿勢から次に遷移可能な姿
勢は、例えば、胴体や手や足の形状、重さ、各部の結合
状態のようなロボットの物理的形状と、関節が曲がる方
向や角度のようなアクチュエータ３ＡＡ₁乃至５Ａ₁およ
び５Ａ₂の機構とによって決定される。Here, the posture that can be changed next from the current posture is, for example, the physical shape of the robot such as the shape and weight of the torso, hands and feet, the connection state of each part, the direction in which the joint bends, and the like. It is determined by the mechanism of the actuator 3AA ₁ to 5A ₁ and 5A _2, such as angle.

【００５０】また、次の姿勢としては、現在の姿勢から
直接遷移可能な姿勢と、直接には遷移できない姿勢とが
ある。例えば、４本足のロボットは、手足を大きく投げ
出して寝転んでいる状態から、伏せた状態へ直接遷移す
ることはできるが、立った状態へ直接遷移することはで
きず、一旦、手足を胴体近くに引き寄せて伏せた姿勢に
なり、それから立ち上がるという２段階の動作が必要で
ある。また、安全に実行できない姿勢も存在する。例え
ば、４本足のロボットは、その４本足で立っている姿勢
から、両前足を挙げてバンザイをしようとすると、簡単
に転倒してしまう。The next posture includes a posture that can directly transition from the current posture and a posture that cannot directly transition. For example, a four-legged robot can make a direct transition from lying down with its limbs throwing down to lying down, but not directly into a standing state. It is necessary to perform a two-stage operation of pulling down to a prone position and then standing up. There are also postures that cannot be safely executed. For example, a four-legged robot easily falls down when trying to banzai with both front legs raised from its standing posture.

【００５１】このため、姿勢遷移機構部３４は、直接遷
移可能な姿勢をあらかじめ登録しておき、行動決定機構
部３３から供給される行動指令情報が、直接遷移可能な
姿勢を示す場合には、その行動指令情報を、そのまま姿
勢遷移情報として、制御機構部３５に送出する。一方、
行動指令情報が、直接遷移不可能な姿勢を示す場合に
は、姿勢遷移機構部３４は、遷移可能な他の姿勢に一旦
遷移した後に、目的の姿勢まで遷移させるような姿勢遷
移情報を生成し、制御機構部３５に送出する。これによ
りロボットが、遷移不可能な姿勢を無理に実行しようと
する事態や、転倒するような事態を回避することができ
るようになっている。For this reason, the posture transition mechanism 34 pre-registers a posture to which a direct transition is possible, and if the action command information supplied from the behavior determining mechanism 33 indicates a posture to which a direct transition is possible, The action command information is sent to the control mechanism section 35 as posture transition information as it is. on the other hand,
When the action command information indicates a posture that cannot directly transition, the posture transition mechanism unit 34 generates posture transition information that causes a transition to another possible posture and then transitions to a target posture. To the control mechanism unit 35. As a result, it is possible to avoid a situation in which the robot forcibly executes an untransitionable posture or a situation in which the robot falls.

【００５２】制御機構部３５は、姿勢遷移機構部３４か
らの姿勢遷移情報にしたがって、アクチュエータ３ＡＡ
₁乃至アクチュエータ５Ａ₂を駆動するための制御信号を
生成し、これを、アクチュエータ３ＡＡ₁乃至アクチュ
エータ５Ａ₂に送出する。これにより、アクチュエータ
３ＡＡ₁乃至アクチュエータ５Ａ₂は、制御信号にしたが
って駆動し、ロボットは、自律的に行動を起こす。In accordance with the posture transition information from the posture transition mechanism unit 34, the control mechanism unit 35 controls the actuator 3AA
₁ generates a control signal for driving the actuator 5A _2, which is sent to the actuator 3AA ₁ to actuator 5A _2. Thus, the actuator 3AA ₁ to actuator 5A ₂ is driven in accordance with the control signals, the robot causes the autonomous motions.

【００５３】エコーバック部３６は、マイク１５から与
えられ、音声認識部３１Ａで音声認識される音声信号を
監視しており、その音声信号を復唱するような音声（以
下、適宜、エコーバック音声という）を生成して出力す
る。このエコーバック音声は、出力制御部５７を介し
て、スピーカ１８に供給されて出力される。The echo back unit 36 monitors a voice signal given from the microphone 15 and recognized by the voice recognition unit 31A, and reproduces the voice signal (hereinafter referred to as echo back voice as appropriate). ) Is generated and output. This echo back sound is supplied to the speaker 18 via the output control unit 57 and output.

【００５４】出力制御部３８には、音声合成部３７から
の合成音のディジタルデータ、および、エコーバック部
３６からのエコーバック音声のディジタルデータが供給
されるようになっており、出力制御部３８は、それらの
ディジタルデータを、アナログの音声信号にＤ／Ａ変換
し、スピーカ１８に供給して出力させる。また、出力制
御部３８は、音声合成部３７からの合成音と、エコーバ
ック部３６からのエコーバック音声の、スピーカ１８へ
の出力が競合した場合に、その競合を調整する。即ち、
エコーバック部３６からのエコーバック音声の出力は、
行動決定機構部３３の制御にしたがって音声合成部３７
が行う合成音の出力とは独立に行われるようになってお
り、エコーバック音声の出力と合成音の出力とは競合す
る場合がある。そこで、出力制御部３８は、その競合の
調整を行う。The output control section 38 is supplied with the digital data of the synthesized sound from the voice synthesis section 37 and the digital data of the echo-back sound from the echo-back section 36. Converts the digital data into an analog audio signal by D / A conversion, and supplies the analog audio signal to a speaker 18 for output. When the output of the synthesized sound from the voice synthesizer 37 and the output of the echo back sound from the echo back unit 36 to the speaker 18 conflict with each other, the output control unit 38 adjusts the conflict. That is,
The output of the echo-back sound from the echo-back unit 36 is
Speech synthesis unit 37 according to the control of action determination mechanism unit 33
Is performed independently of the output of the synthesized sound performed by the computer, and the output of the echo-back sound and the output of the synthesized sound may conflict with each other. Therefore, the output control unit 38 adjusts the competition.

【００５５】次に、図４は、図３の音声認識部３１Ａの
構成例を示している。Next, FIG. 4 shows an example of the configuration of the speech recognition section 31A of FIG.

【００５６】マイク１５からの音声信号は、ＡＤ（Anal
og Digital）変換部４１に供給される。ＡＤ変換部４１
では、マイク１５からのアナログ信号である音声信号が
サンプリング、量子化され、ディジタル信号である音声
データにＡＤ変換される。この音声データは、特徴抽出
部４２および音声区間検出部４７に供給される。The audio signal from the microphone 15 is AD (Anal
og Digital) converter 41. AD converter 41
In, an audio signal as an analog signal from the microphone 15 is sampled and quantized, and A / D converted into audio data as a digital signal. This audio data is supplied to the feature extraction unit 42 and the audio section detection unit 47.

【００５７】特徴抽出部４２は、入力される音声データ
について、適当なフレームごとに、例えば、ＭＦＣＣ
（Mel Frequency Cepstrum Coefficient）分析を行い、
その分析結果を、特徴パラメータ（特徴ベクトル）とし
て、マッチング部４３に出力する。なお、特徴抽出部４
２では、その他、例えば、線形予測係数、ケプストラム
係数、線スペクトル対、所定の周波数帯域ごとのパワー
（フィルタバンクの出力）等を、特徴パラメータとして
抽出することが可能である。The feature extracting unit 42 determines, for each input frame, the MFCC
(Mel Frequency Cepstrum Coefficient)
The result of the analysis is output to the matching unit 43 as a feature parameter (feature vector). Note that the feature extraction unit 4
In 2, it is possible to extract, for example, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, power for each predetermined frequency band (output of a filter bank), and the like as feature parameters.

【００５８】マッチング部４３は、特徴抽出部４２から
の特徴パラメータを用いて、音響モデル記憶部４４、辞
書記憶部４５、および文法記憶部４６を必要に応じて参
照しながら、マイク１５に入力された音声（入力音声）
を、例えば、連続分布ＨＭＭ（Hidden Markov Model）
法に基づいて音声認識する。The matching section 43 uses the feature parameters from the feature extraction section 42 to refer to the acoustic model storage section 44, the dictionary storage section 45, and the grammar storage section 46 as necessary, and to be input to the microphone 15. Voice (input voice)
For example, a continuous distribution HMM (Hidden Markov Model)
Speech recognition based on the law.

【００５９】即ち、音響モデル記憶部４４は、音声認識
する音声の言語における個々の音素や音節などの音響的
な特徴を表す音響モデルを記憶している。ここでは、連
続分布ＨＭＭ法に基づいて音声認識を行うので、音響モ
デルとしては、ＨＭＭ（Hidden Markov Model）が用い
られる。辞書記憶部４５は、認識対象の各単語につい
て、その発音に関する情報（音韻情報）が記述された単
語辞書を記憶している。文法記憶部４６は、辞書記憶部
４５の単語辞書に登録されている各単語が、どのように
連鎖する（つながる）かを記述した文法規則を記憶して
いる。ここで、文法規則としては、例えば、文脈自由文
法（ＣＦＧ）や、統計的な単語連鎖確率（Ｎ−ｇｒａ
ｍ）などに基づく規則を用いることができる。That is, the acoustic model storage unit 44 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of the speech to be recognized. Here, since speech recognition is performed based on the continuous distribution HMM method, an HMM (Hidden Markov Model) is used as an acoustic model. The dictionary storage unit 45 stores a word dictionary in which information (phonological information) regarding pronunciation of each word to be recognized is described. The grammar storage unit 46 stores grammar rules that describe how the words registered in the word dictionary of the dictionary storage unit 45 are linked (connected). Here, the grammar rules include, for example, context-free grammar (CFG) and statistical word chain probability (N-gra
m) etc. can be used.

【００６０】マッチング部４３は、辞書記憶部４５の単
語辞書を参照することにより、音響モデル記憶部４４に
記憶されている音響モデルを接続することで、単語の音
響モデル（単語モデル）を構成する。さらに、マッチン
グ部４３は、幾つかの単語モデルを、文法記憶部４６に
記憶された文法規則を参照することにより接続し、その
ようにして接続された単語モデルを用いて、特徴パラメ
ータに基づき、連続分布ＨＭＭ法によって、マイク１５
に入力された音声を認識する。即ち、マッチング部４３
は、特徴抽出部４２が出力する時系列の特徴パラメータ
が観測されるスコア（尤度）が最も高い単語モデルの系
列を検出し、その単語モデルの系列に対応する単語列の
音韻情報（読み）を、音声の認識結果として出力する。The matching unit 43 refers to the word dictionary in the dictionary storage unit 45 and connects the acoustic models stored in the acoustic model storage unit 44 to form a word acoustic model (word model). . Further, the matching unit 43 connects some word models by referring to the grammar rules stored in the grammar storage unit 46, and uses the word models connected in this way, based on the feature parameters, The microphone 15 is obtained by the continuous distribution HMM method.
Recognize the voice input to. That is, the matching unit 43
Detects a sequence of a word model having the highest score (likelihood) at which a time-series feature parameter output by the feature extraction unit 42 is observed, and obtains phonemic information (reading) of a word string corresponding to the sequence of the word model. Is output as a speech recognition result.

【００６１】より具体的には、マッチング部４３は、接
続された単語モデルに対応する単語列について、各特徴
パラメータの出現確率を累積し、その累積値をスコアと
して、そのスコアを最も高くする単語列の音韻情報を、
音声認識結果として出力する。More specifically, the matching unit 43 accumulates the appearance probabilities of the respective characteristic parameters for the word string corresponding to the connected word model, and uses the accumulated value as a score to determine the word having the highest score. The phoneme information of the column
Output as speech recognition result.

【００６２】以上のようにして出力される、マイク１５
に入力された音声の認識結果は、状態認識情報として、
モデル記憶部３２および行動決定機構部３３に出力され
る。The microphone 15 output as described above
The recognition result of the voice input to the
It is output to the model storage unit 32 and the action determination mechanism unit 33.

【００６３】なお、音声区間検出部４７は、ＡＤ変換部
４１からの音声データについて、特徴抽出部４２がＭＦ
ＣＣ分析を行うのと同様のフレームごとに、音声入力レ
ベル（パワー）を算出している。さらに、音声区間検出
部４７は、各フレームの音声入力レベルを所定の閾値と
比較することにより、その閾値以上のパワーを有するフ
レームで構成される区間を、ユーザの音声が入力されて
いる音声区間として検出する。すなわち、音声区間と
は、所定の閾値以上の音声入力レベルを有するフレーム
で構成される区間を示す。そして、音声区間検出部４７
は、検出した音声区間を、特徴抽出部４２とマッチング
部４３に供給しており、特徴抽出部４２とマッチング部
４３は、音声区間のみを対象に処理を行う。It should be noted that the voice section detection section 47 determines that the feature extraction section 42
The speech input level (power) is calculated for each frame similar to that in which the CC analysis is performed. Furthermore, the voice section detection unit 47 compares the voice input level of each frame with a predetermined threshold to determine a section composed of frames having power equal to or higher than the threshold to the voice section in which the user's voice is input. Detected as That is, the voice section indicates a section including a frame having a voice input level equal to or higher than a predetermined threshold. Then, the voice section detection unit 47
Supplies the detected voice section to the feature extracting section 42 and the matching section 43, and the feature extracting section 42 and the matching section 43 perform processing only on the voice section.

【００６４】次に、図５は、図３のエコーバック部３６
の構成例を示している。Next, FIG. 5 shows the echo back unit 36 of FIG.
Is shown.

【００６５】マイク１５からの音声信号は、ＡＤ変換部
５１に供給される。ＡＤ変換部５１では、マイク１５か
らのアナログ信号である音声信号がサンプリング、量子
化され、ディジタル信号である音声データにＡ／Ｄ変換
される。この音声データは、韻律分析部５２および音声
区間検出部５６に供給される。The audio signal from the microphone 15 is supplied to the AD converter 51. The AD converter 51 samples and quantizes the audio signal as an analog signal from the microphone 15 and A / D converts the signal into audio data as a digital signal. The audio data is supplied to the prosody analysis unit 52 and the audio section detection unit 56.

【００６６】韻律分析部５２は、そこに入力される音声
データを、適当なフレームごとに音響分析することによ
り、例えば、ピッチ周波数やパワー等といった音声デー
タの韻律情報を抽出する。この韻律情報は、音生成部５
３に供給される。The prosody analyzing unit 52 extracts prosody information of the voice data, such as pitch frequency and power, by acoustically analyzing the voice data input thereto for each appropriate frame. This prosody information is stored in the sound generation unit 5.
3 is supplied.

【００６７】音生成部５３は、韻律分析部５２からの韻
律情報に基づいて、韻律を制御したエコーバック音声を
生成する。The sound generation unit 53 generates an echo back voice whose prosody is controlled based on the prosody information from the prosody analysis unit 52.

【００６８】即ち、音生成部４３は、韻律分析部４２か
らの韻律情報と同一の韻律を有する、音韻のない音声
（以下、適宜、無音韻音声という）を、例えば、サイン
(sin)波を重畳することにより生成し、エコーバック音
声として、出力部４４に供給する。That is, the sound generation unit 43 outputs a phoneme-free speech having the same prosody as that of the prosody information from the prosody analysis unit 42 (hereinafter referred to as a silent phoneme speech as appropriate), for example, as a sign
The signal is generated by superimposing (sin) waves and supplied to the output unit 44 as echo back sound.

【００６９】なお、韻律情報としての、例えば、ピッチ
周波数とパワーから音声データを生成する方法について
は、例えば、鈴木、石井、竹内、「非分節音による反響
的な模倣とその心理的影響」、情報処理学会論文誌、vo
l.41,No.5,pp1328-1337,May,2000や、特開2000-181896
号公報等に、その詳細が記載されている。Note that, for a method of generating voice data from, for example, pitch frequency and power as prosodic information, for example, Suzuki, Ishii, Takeuchi, “Resonant imitation by non-segmented sound and its psychological effect”, IPSJ Transactions, vo
l.41, No.5, pp1328-1337, May, 2000, and JP-A-2000-181896
The details are described in Japanese Patent Publication No.

【００７０】出力部４４は、音生成部４３からのエコー
バック音声のデータを、メモリ４５に記憶させるととも
に、出力制御部３８（図３）に出力する。The output unit 44 stores the echo back sound data from the sound generation unit 43 in the memory 45 and outputs the data to the output control unit 38 (FIG. 3).

【００７１】音声区間検出部５６は、ＡＤ変換部５１か
らの音声データについて、図４の音声区間検出部４７に
おける場合と同様の処理を行うことにより、音声区間を
検出し、韻律分析部５２と音生成部５３に供給する。こ
れにより、韻律分析部５２と音生成部５３では、音声区
間のみを対象に処理が行われる。The voice section detecting section 56 detects the voice section by performing the same processing as that of the voice section detecting section 47 of FIG. The sound is supplied to the sound generator 53. As a result, the prosody analysis unit 52 and the sound generation unit 53 perform processing only on the voice section.

【００７２】なお、図５のＡＤ変換部５１または音声区
間検出部５６と、図４のＡＤ変換部４１または音声区間
検出部４７とは、それぞれ兼用することが可能である。The AD converter 51 or the voice section detector 56 in FIG. 5 and the AD converter 41 or the voice section detector 47 in FIG. 4 can be used respectively.

【００７３】次に、図６は、図３の音声合成部３７の構
成例を示している。FIG. 6 shows an example of the configuration of the speech synthesizer 37 shown in FIG.

【００７４】テキスト生成部６１には、行動決定機構部
３３が出力する、音声合成の対象とするテキストを含む
行動指令情報が供給されるようになっており、テキスト
生成部６１は、辞書記憶部６３や生成用文法記憶部６４
を参照しながら、その行動指令情報に含まれるテキスト
を解析する。The text generation unit 61 is supplied with action command information including a text to be subjected to speech synthesis, which is output from the action determination mechanism unit 33. The text generation unit 61 includes a dictionary storage unit. 63 and a grammar storage unit for generation 64
And analyze the text included in the action command information.

【００７５】即ち、辞書記憶部６３には、各単語の品詞
情報や、読み、アクセント等の情報が記述された単語辞
書が記憶されており、また、生成用文法記憶部６４に
は、辞書記憶部６３の単語辞書に記述された単語につい
て、単語連鎖に関する制約等の生成用文法規則が記憶さ
れている。そして、テキスト生成部６１は、この単語辞
書および生成用文法規則に基づいて、そこに入力される
テキストの形態素解析や構文解析等の解析を行い、後段
の規則合成部６２で行われる規則音声合成に必要な情報
を抽出する。ここで、規則音声合成に必要な情報として
は、例えば、ポーズの位置や、アクセントおよびイント
ネーションを制御するための情報その他の韻律情報や、
各単語の発音等の音韻情報などがある。That is, the dictionary storage unit 63 stores a word dictionary in which part-of-speech information of each word, and information such as readings and accents are described. The generation grammar storage unit 64 stores dictionary data. For words described in the word dictionary of the unit 63, grammar rules for generation such as restrictions on word chains are stored. Then, the text generation unit 61 performs an analysis such as a morphological analysis or a syntax analysis of the text input thereto based on the word dictionary and the grammatical rules for generation, and performs a rule speech synthesis performed by a rule synthesis unit 62 at a subsequent stage. Extract necessary information. Here, as information necessary for the rule speech synthesis, for example, the position of a pause, information for controlling accent and intonation, and other prosody information,
There is phonological information such as pronunciation of each word.

【００７６】テキスト生成部６１で得られた情報は、規
則合成部６２に供給され、規則合成部６２では、音素片
記憶部６５を参照しながら、テキスト生成部５１に入力
されたテキストに対応する合成音の音声データ（ディジ
タルデータ）が生成される。The information obtained by the text generator 61 is supplied to the rule synthesizer 62, which refers to the phoneme segment storage 65 and corresponds to the text input to the text generator 51. Voice data (digital data) of the synthesized sound is generated.

【００７７】即ち、音素片記憶部６５には、例えば、Ｃ
Ｖ(Consonant, Vowel)や、ＶＣＶ、ＣＶＣ等の形で音素
片データが記憶されており、規則合成部６２は、テキス
ト生成部６１からの情報に基づいて、必要な音素片デー
タを接続し、さらに、音素片データの波形を加工するこ
とによって、ポーズ、アクセント、イントネーション等
を適切に付加し、これにより、テキスト生成部６１に入
力されたテキストに対応する合成音の音声データを生成
する。That is, for example, C
V (Consonant, Vowel), VCV, CVC, and the like are stored as phoneme piece data. The rule synthesizing unit 62 connects necessary phoneme piece data based on information from the text generation unit 61, Furthermore, by processing the waveform of the phoneme segment data, a pause, an accent, intonation, and the like are appropriately added, and thereby, speech data of a synthesized sound corresponding to the text input to the text generation unit 61 is generated.

【００７８】以上のようにして生成された音声データ
は、出力制御部３８（図３）を介して、スピーカ１８に
供給され、これにより、スピーカ１８からは、テキスト
生成部６１に入力されたテキストに対応する合成音が出
力される。The audio data generated as described above is supplied to the speaker 18 via the output control unit 38 (FIG. 3), and the speaker 18 outputs the text data input to the text generation unit 61. Is output.

【００７９】なお、図３の行動決定機構部３３では、上
述したように、行動モデルに基づいて、次の行動が決定
されるが、合成音として出力するテキストの内容は、ロ
ボットの行動と対応付けておくことが可能である。The action determining mechanism 33 shown in FIG. 3 determines the next action based on the action model as described above. However, the content of the text output as the synthesized sound corresponds to the action of the robot. It is possible to attach.

【００８０】即ち、例えば、ロボットが、座った状態か
ら、立った状態になる行動には、テキスト「よっこいし
ょ」などを対応付けておくことが可能である。この場
合、ロボットが、座っている姿勢から、立つ姿勢に移行
するときに、その姿勢の移行に同期して、合成音「よっ
こいしょ」を出力することが可能となる。That is, for example, the action of the robot changing from a sitting state to a standing state can be associated with the text “OK”. In this case, when the robot shifts from the sitting posture to the standing posture, it becomes possible to output a synthetic sound “OK” in synchronization with the transition of the posture.

【００８１】次に、以上の実施の形態の動作について説
明する。Next, the operation of the above embodiment will be described.

【００８２】まず、ペットロボットのコントローラ１０
の出力制御部３８が、音声のディジタルデータをアナロ
グの音声信号にＤ／Ａ変換し、スピーカ１８に供給して
出力させている場合を考える。First, the controller 10 of the pet robot
Suppose that the output control unit 38 converts the digital audio data into an analog audio signal by D / A conversion and supplies it to the speaker 18 for output.

【００８３】図７に示されるように、出力制御部３８
は、時刻Ｐ₁から、エコーバック部３６または音声合成
部３７から供給されるディジタルデータをＤ／Ａ変換
し、時間Ｔ₁の期間、スピーカ１８に供給して出力させ
ている（図７（Ｄ））。この時刻Ｐ₁から時間Ｔ₁の期間
には、マイク１５に、ユーザの発話を含む周囲の音声
（音）が入力されていない（音声区間検出部４７で音声
区間が検出されない）。As shown in FIG. 7, the output control unit 38
Converts the digital data supplied from the echo back unit 36 or the voice synthesizing unit 37 from time P ₁ to D / A conversion, and supplies the digital data to the speaker 18 for the period of time T ₁ (FIG. 7 (D )). During the period from time P ₁ to time T _1, no surrounding sound (sound) including the utterance of the user has been input to the microphone 15 (the voice section is not detected by the voice section detection unit 47).

【００８４】そして、時刻Ｐ₂からマイク１５に音声が
入力され始めると、音声区間検出部４７は、ＡＤ変換部
４１を介して入力される音声データを基に、環境音レベ
ルを推定する。すなわち、マイク１５には、ユーザがロ
ボットに対して発話していない場合においても、様々な
ノイズが音声入力されるが、そのノイズをユーザの発話
として音声認識することは誤動作の原因になる。従っ
て、ユーザの発話を音声認識していない状態（音声認識
ＯＦＦ状態）において、環境音レベルを推定する必要が
ある。Then, when the sound starts to be input to the microphone 15 from the time P ₂ , the voice section detection unit 47 estimates the environmental sound level based on the voice data input via the AD conversion unit 41. That is, various noises are input to the microphone 15 even when the user does not speak to the robot. However, recognizing the noise as the user's speech causes malfunction. Therefore, it is necessary to estimate the environmental sound level in a state where the speech of the user is not speech-recognized (speech recognition OFF state).

【００８５】図８に示されるように、マイク１５および
ＡＤ変換部４１を介して入力される音声データの音声入
力レベルは、音声認識ＯＦＦ状態においても一定ではな
い。そこで、環境音レベルをＥＮＶ、現在の音声入力レ
ベルをＰとして、次の式（１）および式（２）により、
所定の短い時間毎に、環境音レベルを算出する。ＥＮＶ＝ａ×ＥＮＶ＋ｂ×Ｐ・・・（１）ａ＋ｂ＝１．０・・・（２）As shown in FIG. 8, the voice input level of the voice data input via the microphone 15 and the AD converter 41 is not constant even in the voice recognition OFF state. Then, assuming that the environmental sound level is ENV and the current voice input level is P, by the following equations (1) and (2),
The environmental sound level is calculated for each predetermined short time. ENV = a × ENV + b × P (1) a + b = 1.0 (2)

【００８６】ここで、変数ａは、０．９など、１に比較
的近い数字に設定され、変数ｂは、０．１などに設定さ
れることにより、瞬間的にパワーの大きなノイズ（例え
ば、ドアがばたんと閉まる音など）が、環境音全体に大
きな影響を与えないようになされている。Here, the variable a is set to a number relatively close to 1, such as 0.9, and the variable b is set to 0.1 or the like, so that noise having a large instantaneous power (for example, Noises such as the door closing) do not significantly affect the overall environmental sound.

【００８７】環境音レベルの推定は、予め決められた閾
値Ｌ１を基に、音声入力レベルが、ＥＮＶ＋Ｌ１を越え
るまで継続される。The estimation of the environmental sound level is continued based on the predetermined threshold L1 until the sound input level exceeds ENV + L1.

【００８８】音声区間検出部４７は、音声入力レベルが
ＥＮＶ＋Ｌ１（図７の例の場合、時刻Ｐ₃）を越える
と、環境レベルの推定を止め、音声区間の検出を開始す
るとともに（図７（Ｂ））、その内部に有する図示せぬ
カウンタ（タイマ）を用いて、音声認識開始カウントを
開始する。When the voice input level exceeds ENV + L1 (time P _{3 in} the example of FIG. 7), the voice section detection unit 47 stops estimation of the environmental level, starts detection of the voice section (FIG. 7 ( B)), a voice recognition start count is started using a counter (timer) (not shown) provided therein.

【００８９】音声区間検出部４７は、音声認識開始カウ
ントが所定の値（例えば、図８のＣＮＴ＿ＯＮで示され
る値）を超えたとき、音声区間の検出の開始を、特徴抽
出部４２およびマッチング部４３に出力する。特徴抽出
部４２およびマッチング部４３は、音声認識処理を実行
する（図７（Ｃ））。When the voice recognition start count exceeds a predetermined value (for example, the value indicated by CNT_ON in FIG. 8), the voice section detection section 47 determines the start of voice section detection by the feature extraction section 42 and the matching section. 43. The feature extracting unit 42 and the matching unit 43 execute a voice recognition process (FIG. 7C).

【００９０】そして、時刻Ｐ₄において、出力制御部３
８からアナログの音声信号がスピーカ１８を介して出力
されると（図７（Ｄ））、音声区間検出部４７は、音声
区間の検出をキャンセル（中止）する（図７（Ｂ））。At time P ₄ , the output control unit 3
When an analog audio signal is output from the speaker 8 via the speaker 18 (FIG. 7D), the audio section detection unit 47 cancels (stops) detection of the audio section (FIG. 7B).

【００９１】すなわち、通常、時刻Ｐ₃からマージンＭ
を戻った時刻Ｐ₅から時間ｔ₁の期間が音声区間として検
出されるが、途中、出力制御部３８からアナログの音声
信号がスピーカ１８を介して出力されるので、音声区間
検出部４７は、音声区間の検出を、時刻Ｐ₅から時間ｔ₂
が経過したところでキャンセルする。それにともなっ
て、特徴抽出部４２およびマッチング部４３は、通常、
時刻Ｐ₅から時間ｔ₃（＝ｔ₁）の期間を音声認識する
が、音声区間検出部４７から、音声区間が入力されなく
なるので、時刻Ｐ₅から時間ｔ₄（＝ｔ₂）までの期間を
音声認識することになる。[0091] In other words, usually, the margin M from the time P ₃
Since the period from the time P ₅ time t ₁ has returned to the are detected as a voice interval, during, analog audio signal is outputted through the speaker 18 from the output control unit 38, the speech section detecting unit 47, the detection of the speech period, from the time P ₅ time t ₂
Is canceled when has elapsed. Accordingly, the feature extracting unit 42 and the matching unit 43 usually
Speech recognizing a period of time P ₅ from the time t ₃ (= t ₁₎ is the period from the speech period detection unit 47, since the speech section is not input, from the time P ₅ to the time t ₄ (= t ₂₎ Will be recognized by voice.

【００９２】出力制御部３８は、時刻Ｐ₄から時間Ｔ₂の
期間、アナログの音声信号をスピーカ１８に供給して出
力させる（図７（Ｄ））。The output control section 38 supplies an analog audio signal to the speaker 18 for output during a period of time T ₂ from time P ₄ (FIG. 7D).

【００９３】このように、ロボットが発話（音声を出
力）している場合には、マイク１５より音声入力があっ
たとしても、それを認識しないようにする。これによ
り、ロボットの発話自体を音声として誤って入力してし
まうことがなくなり、音声の誤認識を防止することがで
きる。As described above, when the robot is uttering (outputting voice), even if a voice is input from the microphone 15, it is not recognized. Thereby, it is possible to prevent the utterance of the robot itself from being erroneously input as voice, and to prevent erroneous recognition of voice.

【００９４】なお、時刻Ｐ₄までの認識結果は、破棄し
てもよいし、あるいは、音声認識のスコアが所定のスレ
ッショルド以上の信頼度がある場合は、そこまでの結果
を採用してもよい。Note that the recognition result up to time P ₄ may be discarded, or, if the voice recognition score has a reliability higher than a predetermined threshold, the result up to that point may be used. .

【００９５】次に、音声入力の途中でペットロボットが
発話し、一旦、音声認識をキャンセルし、ペットロボッ
トの発話が終了したとき（出力制御部３８からスピーカ
１８を介して、アナログの音声信号の出力が終了され、
音声認識が再開されたとき）、音声入力がまだ継続して
いる場合を考える。Next, when the pet robot speaks in the middle of the voice input, the voice recognition is canceled once, and the voice of the pet robot is terminated (from the output control unit 38 via the speaker 18, the analog voice signal is output). Output is terminated,
Suppose that speech input is still continued when speech recognition is restarted).

【００９６】図９に示されるように、時刻Ｐ₁におい
て、音声区間検出部４７は、音声区間の検出を開始す
る。特徴抽出部４２およびマッチング部４３は、時刻Ｐ
₁＋ＣＮＴ＿ＯＮから音声認識を開始する。As shown in FIG. 9, at time P ₁ , the voice section detection section 47 starts detecting a voice section. The feature extracting unit 42 and the matching unit 43
₁ Start speech recognition from + CNT_ON.

【００９７】時刻Ｐ₂において、出力制御部３８からア
ナログの音声信号がスピーカ１８を介して出力されると
（図９（Ｄ））、音声区間検出部４７は、音声区間の検
出をキャンセルする（図９（Ｂ））。それにともなっ
て、特徴抽出部４２およびマッチング部４３は、音声認
識をキャンセルする（図９（Ｃ））。出力制御部３８
は、時刻Ｐ₂から時間Ｔの期間、アナログの音声信号を
スピーカ１８に供給して出力させる（図９（Ｄ））。At time P ₂ , when an analog audio signal is output from the output controller 38 via the speaker 18 (FIG. 9D), the audio section detector 47 cancels the detection of the audio section (FIG. 9D). FIG. 9 (B). Accordingly, the feature extracting unit 42 and the matching unit 43 cancel the speech recognition (FIG. 9C). Output control unit 38
The period from the time P ₂ time T, and outputs and supplies the analog audio signal to the speaker 18 (FIG. 9 (D)).

【００９８】アナログの音声信号の出力が終了した時刻
Ｐ₃において、マイク１５に音声がまだ入力されている
場合、音声区間検出部４７は、再び、環境音レベルを推
定する。すなわち、マイク１５には、ロボットが発話を
終了した後も、様々なノイズまたは音声が音声入力され
る場合があるので、そのノイズを音声認識することは誤
動作の原因になる。従って、音声入力レベルが、一定の
値を上回った場合、音声認識処理を行わないように（音
声認識ＯＦＦ状態に）する必要がある。[0098] At time P ₃ where the output of the analog audio signal is completed, if it is audio yet entered into the microphone 15, the speech section detecting unit 47 again estimates the environmental sound level. That is, even after the robot finishes speaking, various noises or voices may be input to the microphone 15, so that voice recognition of the noises may cause a malfunction. Therefore, when the voice input level exceeds a certain value, it is necessary to prevent the voice recognition processing from being performed (turn off the voice recognition).

【００９９】図１０に示されるように、マイク１５およ
びＡＤ変換部４１を介して入力される音声データの音声
入力レベルが、所定の閾値Ｌ２と、音声認識処理が開始
された時点においての環境音レベルＥＮＶとの和（Ｌ２
＋ＥＮＶ）を下回るか否かを判断することにより、音声
区間検出部４７は、音声入力が終了したか否かを判断す
ることができる。As shown in FIG. 10, when the voice input level of the voice data input via the microphone 15 and the AD converter 41 is a predetermined threshold value L2 and the environmental sound at the time when the voice recognition process is started. Sum with level ENV (L2
By determining whether or not the value is lower than (+ ENV), the voice section detection unit 47 can determine whether or not the voice input has been completed.

【０１００】また、所定の時間が経過しても、音声認識
処理が開始された時点においての環境音レベルＥＮＶと
の和（Ｌ２＋ＥＮＶ）を下回らなかった場合、環境レベ
ルが高くなった（周囲がうるさくなった）とみなされ、
新たな環境音レベルとして更新される。If the sum does not fall below the sum (L2 + ENV) of the environmental sound level ENV at the time when the speech recognition processing is started even after the predetermined time has elapsed, the environmental level becomes high (the surrounding area becomes noisy). )
Updated as a new environmental sound level.

【０１０１】そして、音声区間検出部４７は、時刻Ｐ₄
から時間ｔ₅の期間、再び、音声区間の検出を開始す
る。特徴抽出部４２およびマッチング部４３は、マージ
ンＭを考慮して時刻Ｐ₅から時間ｔ₅（＝ｔ₆）の期間、
音声認識を開始する。Then, the voice section detecting section 47 determines whether the time P ₄
Period of time t ₅ from again starts detecting the speech interval. Feature extraction unit 42 and matching unit 43, a period of time t ₅ from the time P ₅ in consideration of the margin M (= t _6),
Start speech recognition.

【０１０２】次に、図１１のフローチャートを参照し
て、音声認識処理について説明する。Next, the speech recognition processing will be described with reference to the flowchart of FIG.

【０１０３】ステップＳ１において、音声区間検出部４
７は、ＡＤ変換部４１を介して入力された音声データを
基に、環境音レベルを推定する。ステップＳ２におい
て、音声区間検出部４７は、音声入力レベルが、閾値
（Ｌ１＋ＥＮＶ）を越えたか否かを判定し、音声入力レ
ベルが、閾値（Ｌ１＋ＥＮＶ）を越えていないと判定し
た場合、ステップＳ１に戻り、上述した処理を繰り返
す。In step S1, the voice section detection unit 4
7 estimates the environmental sound level based on the audio data input via the AD converter 41. In step S2, the voice section detection unit 47 determines whether or not the voice input level has exceeded the threshold value (L1 + ENV). If it is determined that the voice input level has not exceeded the threshold value (L1 + ENV), the process proceeds to step S1. Return and repeat the above process.

【０１０４】ステップＳ２において、音声入力レベルが
閾値（Ｌ１＋ＥＮＶ）を越えたと判定されると、ステッ
プＳ３に進み、音声区間検出部４７は、環境音レベルの
推定を止め、その内部に有する図示しないカウンタ（タ
イマ）を用いて、音声認識開始カウントを開始する。If it is determined in step S2 that the voice input level has exceeded the threshold value (L1 + ENV), the flow advances to step S3, where the voice section detection unit 47 stops estimating the environmental sound level, and a counter (not shown) provided therein. Using a (timer), a speech recognition start count is started.

【０１０５】ステップＳ４において、音声区間検出部４
７は、音声認識開始カウントが所定の値（例えば、図８
のＣＮＴ＿ＯＮで示される値）を超えたか否かを判定
し、音声認識開始カウントが所定の値を超えたと判定さ
れるまで待機する。そして、音声認識開始カウントが所
定の値を超えたと判定されると、ステップＳ５に進み、
音声区間検出部４７は、音声区間の開始を特徴抽出部４
２およびマッチング部４３に出力する。特徴抽出部４２
およびマッチング部４３は、図４を用いて説明した音声
認識処理を実行する。In step S4, the voice section detection unit 4
7, the voice recognition start count is a predetermined value (for example, FIG.
CNT_ON), and waits until it is determined that the speech recognition start count has exceeded a predetermined value. When it is determined that the voice recognition start count has exceeded a predetermined value, the process proceeds to step S5,
The voice section detection unit 47 determines the start of the voice section by the feature extraction unit 4.
2 and the matching unit 43. Feature extraction unit 42
The matching unit 43 performs the voice recognition processing described with reference to FIG.

【０１０６】ステップＳ６において、音声合成部３７
は、音声のディジタルデータを出力制御部３８に出力し
たか否かを判定し、音声を出力していないと判定した場
合、ステップＳ７に進む。In step S6, the speech synthesizer 37
Determines whether or not the audio digital data has been output to the output control unit 38. If it is determined that the audio has not been output, the process proceeds to step S7.

【０１０７】ステップＳ７において、音声区間検出部４
７は、音声認識処理が終了したか否かを判定し、音声認
識処理が終了していないと判定した場合、ステップＳ５
に戻り、上述した処理を繰り返す。そして、ステップＳ
７において、音声認識処理が終了したと判定された場
合、ステップＳ１に戻り、上述した処理を繰り返す。In step S7, the voice section detection unit 4
7 determines whether or not the voice recognition processing has been completed, and if it is determined that the voice recognition processing has not been completed, Step S5
And the above processing is repeated. And step S
If it is determined in step 7 that the speech recognition process has been completed, the process returns to step S1, and the above-described process is repeated.

【０１０８】ステップＳ６において、音声のディジタル
データが出力制御部３８に出力されたと判定された場
合、ステップＳ８に進み、音声区間検出部４７は、音声
認識処理をキャンセル（中止）する。ステップＳ９にお
いて、音声合成部３７は、音声のディジタルデータの出
力を終了したか否かを判定し、音声の出力が終了するま
で待機する。If it is determined in step S6 that the digital voice data has been output to the output control unit 38, the process proceeds to step S8, and the voice section detection unit 47 cancels (stops) the voice recognition processing. In step S9, the voice synthesizer 37 determines whether or not the output of the voice digital data has been completed, and waits until the output of the voice has been completed.

【０１０９】ステップＳ９において、音声の出力が終了
したと判定されると、ステップＳ１０に進み、音声区間
検出部４７は、ＡＤ変換部４１を介して入力された音声
データを基に、環境音レベルを推定する。If it is determined in step S9 that the output of the voice has been completed, the process proceeds to step S10, where the voice section detection unit 47 determines the environmental sound level based on the voice data input via the AD conversion unit 41. Is estimated.

【０１１０】ステップＳ１１において、音声区間検出部
４７は、環境音レベルが元の環境音レベルになったか否
か、すなわち、閾値（Ｌ２＋ＥＮＶ）以下になったか否
かを判定し、元の環境音レベルになったと判定した場
合、ステップＳ１に戻り、上述した処理を繰り返す。In step S11, the voice section detecting section 47 determines whether or not the environmental sound level has become the original environmental sound level, that is, whether or not the environmental sound level has become equal to or less than the threshold value (L2 + ENV). If it is determined that the condition has been satisfied, the process returns to step S1, and the above-described processing is repeated.

【０１１１】ステップＳ１１において、環境音レベルが
元の環境音レベルではないと判定された場合、ステップ
Ｓ１２に進み、音声区間検出部７２は、所定の時間（例
えば、２０秒）が経過したか否かを判定する。ステップ
Ｓ１２において、所定の時間が経過していないと判定し
た場合、ステップＳ１１に戻り、上述した処理を繰り返
す。If it is determined in step S11 that the environmental sound level is not the original environmental sound level, the process proceeds to step S12, where the voice section detecting unit 72 determines whether a predetermined time (for example, 20 seconds) has elapsed. Is determined. If it is determined in step S12 that the predetermined time has not elapsed, the process returns to step S11, and the above-described processing is repeated.

【０１１２】ステップＳ１２において、所定の時間が経
過したと判定された場合、ステップＳ１３に進み、音声
区間検出部７２は、環境レベルが高くなった（周囲がう
るさくなった）と判断し、現在の環境音レベルを新たな
環境音レベルとして更新した後、ステップＳ１に戻り、
上述した処理を繰り返す。If it is determined in step S12 that the predetermined time has elapsed, the process proceeds to step S13, in which the voice section detection unit 72 determines that the environmental level has become high (the surrounding area has become noisy), and After updating the environmental sound level as a new environmental sound level, the process returns to step S1,
The above processing is repeated.

【０１１３】以上、本発明を、エンターテイメント用の
ロボット（疑似ペットとしてのロボット）に適用した場
合について説明したが、本発明は、これに限らず、例え
ば、産業用のロボット等の各種のロボットに広く適用す
ることが可能である。また、本発明は、現実世界のロボ
ットだけでなく、例えば、液晶ディスプレイ等の表示装
置に表示される仮想的なロボットにも適用可能である。The case where the present invention is applied to a robot for entertainment (robot as a pseudo pet) has been described above. However, the present invention is not limited to this, and may be applied to various robots such as industrial robots. It can be widely applied. In addition, the present invention is applicable not only to a robot in the real world but also to a virtual robot displayed on a display device such as a liquid crystal display.

【０１１４】また、以上においては、ロボット以外に、
例えば、対話システムなどにも適用可能である。In the above, in addition to the robot,
For example, the present invention can be applied to an interactive system.

【０１１５】図１２は、本発明を適用した対話システム
の構成例を示すブロック図である。なお、図中、図３に
おける場合と対応する部分には同一の符号を付してあ
り、その説明は適宜省略する。FIG. 12 is a block diagram showing a configuration example of a dialogue system to which the present invention is applied. In the figure, parts corresponding to those in FIG. 3 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

【０１１６】マイク１５は、ユーザの発話である音声を
入力し、その音声信号を音声認識部３１Ａに出力する。
音声認識部３１Ａは、マイク１５から与えられる音声信
号について音声認識を行う。[0116] The microphone 15 receives a voice as a user's utterance, and outputs the voice signal to the voice recognition unit 31A.
The voice recognition unit 31A performs voice recognition on a voice signal given from the microphone 15.

【０１１７】対話管理部７１は、音声認識部３１Ａによ
り音声認識された結果に基づいて、所定の言語（テキス
ト）を選択し、その選択された言語を音声合成部３７に
出力する。音声合成部３７は、入力された言語に基づ
き、対応する合成音の音声データ（ディジタルデータ）
を生成し、出力制御部３８を介して、スピーカ１８に供
給して出力させる。The dialog management section 71 selects a predetermined language (text) based on the result of voice recognition by the voice recognition section 31A, and outputs the selected language to the voice synthesis section 37. The voice synthesizer 37 generates voice data (digital data) of the corresponding synthesized voice based on the input language.
Is generated and supplied to the speaker 18 via the output control unit 38 for output.

【０１１８】より具体的には、例えば、ユーザが、マイ
ク１５を介して「いま何時ですか？」を入力すると（問
いかけると）、音声認識部３１Ａは、その音声信号につ
いて音声認識を行う。対話管理部７１は、音声認識され
た結果に基づき、予め用意されている複数の言語（テキ
スト）の中から、所定の言語（例えば、１２時です）を
選択し、その選択された言語を音声合成部３７に出力す
る。音声合成部３７は、入力された言語に基づき、対応
する合成音の音声データを生成し、出力制御部３８に出
力する。出力制御部３８は、入力された音声データをア
ナログの音声信号に変換し、スピーカ１８に供給して出
力させる。これにより、スピーカ１８からは、「１２時
です」の音声出力が行われる。More specifically, for example, when the user inputs (what time is it?) Through the microphone 15 (when asked), the voice recognition section 31A performs voice recognition on the voice signal. The dialog management unit 71 selects a predetermined language (for example, 12:00) from a plurality of languages (texts) prepared in advance based on the result of the voice recognition, and converts the selected language into a voice. Output to the combining unit 37. The speech synthesis unit 37 generates speech data of the corresponding synthesized sound based on the input language, and outputs the speech data to the output control unit 38. The output control unit 38 converts the input audio data into an analog audio signal, and supplies the analog audio signal to the speaker 18 for output. As a result, the audio output of “12:00 is” is performed from the speaker 18.

【０１１９】このように、ユーザからの問いかけに対し
て、その音声を認識し、適切な言葉で返答するようにす
ることで、ユーザは、あたかも、そのシステムと対話し
ているような感覚を得ることができる。In this way, by recognizing the voice and responding with appropriate words in response to the user's inquiry, the user can feel as if he or she is interacting with the system. be able to.

【０１２０】また、本実施の形態においては、上述した
一連の処理を、ＣＰＵ１０Ａ（図２）にプログラムを実
行させることにより行うようにしたが、一連の処理は、
それ専用のハードウェアによって行うことも可能であ
る。In the present embodiment, the above-described series of processing is performed by causing the CPU 10A (FIG. 2) to execute a program.
It is also possible to use dedicated hardware.

【０１２１】なお、プログラムは、あらかじめメモリ１
０Ｂ（図２）に記憶させておく他、フロッピー（登録商
標）ディスク、CD-ROM（Compact Disc Read Only Memor
y），MO（Magnetooptical）ディスク，DVD（Digital Ve
rsatile Disc)、磁気ディスク、半導体メモリなどのリ
ムーバブル記録媒体に、一時的あるいは永続的に格納
（記録）しておくことができる。そして、このようなリ
ムーバブル記録媒体を、いわゆるパッケージソフトウエ
アとして提供し、ロボット（メモリ１０Ｂ）にインスト
ールするようにすることができる。The program is stored in the memory 1 in advance.
0B (FIG. 2), a floppy disk, a CD-ROM (Compact Disc Read Only Memor
y), MO (Magnetooptical) disk, DVD (Digital Ve)
It can be temporarily or permanently stored (recorded) on a removable recording medium such as a rsatile disc, a magnetic disk, or a semiconductor memory. Then, such a removable recording medium can be provided as so-called package software, and can be installed in the robot (memory 10B).

【０１２２】また、プログラムは、ダウンロードサイト
から、ディジタル衛星放送用の人工衛星を介して、無線
で転送したり、LAN（Local Area Network）、インター
ネットといったネットワークを介して、有線で転送し、
メモリ１０Ｂにインストールすることができる。The program can be transmitted wirelessly from a download site via an artificial satellite for digital satellite broadcasting, or transmitted via a cable via a network such as a LAN (Local Area Network) or the Internet.
It can be installed in the memory 10B.

【０１２３】この場合、プログラムがバージョンアップ
されたとき等に、そのバージョンアップされたプログラ
ムを、メモリ１０Ｂに、容易にインストールすることが
できる。In this case, when the program is upgraded, the upgraded program can be easily installed in the memory 10B.

【０１２４】ここで、本明細書において、ＣＰＵ１０Ａ
に各種の処理を行わせるためのプログラムを記述する処
理ステップは、必ずしもフローチャートとして記載され
た順序に沿って時系列に処理する必要はなく、並列的あ
るいは個別に実行される処理（例えば、並列処理あるい
はオブジェクトによる処理）も含むものである。Here, in this specification, the CPU 10A
The processing steps for writing a program for causing the CPU to perform various types of processing do not necessarily need to be processed in chronological order in the order described in the flowchart, and may be performed in parallel or individually (for example, parallel processing). Or processing by an object).

【０１２５】また、プログラムは、１のＣＰＵにより処
理されるものであっても良いし、複数のＣＰＵによって
分散処理されるものであっても良い。The program may be processed by one CPU or may be processed by a plurality of CPUs in a distributed manner.

【０１２６】[0126]

【発明の効果】本発明の音声処理装置および音声処理方
法、並びに記録媒体に記録されているプログラムによれ
ば、音声データの認識の途中で、音声が出力されたと
き、その認識を中断するようにしたので、音声の誤認識
を防止することができる。According to the voice processing apparatus and the voice processing method of the present invention, and the program recorded on the recording medium, when the voice is output during the recognition of the voice data, the recognition is interrupted. Therefore, erroneous recognition of voice can be prevented.

[Brief description of the drawings]

【図１】本発明を適用したロボットの一実施の形態の外
観構成例を示す斜視図である。FIG. 1 is a perspective view illustrating an external configuration example of a robot according to an embodiment of the present invention.

【図２】ロボットの内部構成例を示すブロック図であ
る。FIG. 2 is a block diagram illustrating an example of an internal configuration of a robot.

【図３】コントローラの機能的構成例を示すブロック図
である。FIG. 3 is a block diagram illustrating a functional configuration example of a controller.

【図４】音声認識部の構成例を示すブロック図である。FIG. 4 is a block diagram illustrating a configuration example of a speech recognition unit.

【図５】エコーバック部の構成例を示すブロック図であ
る。FIG. 5 is a block diagram illustrating a configuration example of an echo back unit.

【図６】音声合成部の構成例を示すブロック図である。FIG. 6 is a block diagram illustrating a configuration example of a speech synthesis unit.

【図７】音声認識について説明するための図である。FIG. 7 is a diagram for explaining speech recognition.

【図８】環境レベルの推定を説明するための図である。FIG. 8 is a diagram for explaining estimation of an environmental level.

【図９】音声認識について説明するための図である。FIG. 9 is a diagram illustrating speech recognition.

【図１０】環境レベルの推定を説明するための図であ
る。FIG. 10 is a diagram for explaining estimation of an environmental level.

【図１１】音声認識処理を説明するためのフローチャー
トである。FIG. 11 is a flowchart illustrating a speech recognition process.

【図１２】本発明を適用した対話システムを説明するた
めのブロック図である。FIG. 12 is a block diagram for explaining a dialogue system to which the present invention is applied.

[Explanation of symbols]

４頭部ユニット，４Ａ下顎部，１０コントロ
ーラ，１０ＡＣＰＵ，１０Ｂメモリ，１５
マイク，１６ＣＣＤカメラ，１７タッチセン
サ，１８スピーカ，３１センサ入力処理部，
３１Ａ音声認識部，３１Ｂ画像認識部，３１Ｃ
圧力処理部，３２モデル記憶部，３３行動決定
機構部，３４姿勢遷移機構部，３５制御機構
部，３６エコーバック部, ３７音声合成部，３
８出力制御部，４１ＡＤ変換部，４２特徴抽
出部，４３マッチング部，４４音響モデル記憶
部，４５辞書記憶部，４６文法記憶部，４７
音声区間検出部，５１ＡＤ変換部，５２韻律分
析部，５３音生成部，５４出力部，５５メ
モリ，５６音声区間検出部，６１テキスト生成
部，６２規則合成部，６３辞書記憶部，６４
生成用文法記憶部，６５音素片記憶部，７１
対話管理部4 head unit, 4A lower jaw, 10 controller, 10A CPU, 10B memory, 15
Microphone, 16 CCD camera, 17 touch sensor, 18 speaker, 31 sensor input processing unit,
31A voice recognition unit, 31B image recognition unit, 31C
Pressure processing section, 32 model storage section, 33 action determination mechanism section, 34 attitude transition mechanism section, 35 control mechanism section, 36 echo back section, 37 voice synthesis section, 3
8 output control unit, 41 AD conversion unit, 42 feature extraction unit, 43 matching unit, 44 acoustic model storage unit, 45 dictionary storage unit, 46 grammar storage unit, 47
Voice section detection section, 51 AD conversion section, 52 prosody analysis section, 53 sound generation section, 54 output section, 55 memory, 56 voice section detection section, 61 text generation section, 62 rule synthesis section, 63 dictionary storage section, 64
Grammar storage for generation, 65 phoneme segment storage, 71
Dialogue management department

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｂ２５Ｊ 5/00 Ｂ２５Ｊ 13/00 Ｚ 13/00 Ｇ１０Ｌ 3/00 ５７１ＫＧ１０Ｌ 13/00 Ｒ 15/00 ５５１Ｈ５５１Ｆ (72)発明者小野木渡東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者豊田崇東京都品川区北品川６丁目７番35号ソニー株式会社内Ｆターム(参考） 2C150 AA05 CA02 DA05 DC15 DF03 DF04 DF33 DK02 EB01 EC02 ED11 ED42 ED47 ED56 EF13 EF16 EF23 EF28 EF33 EG12 EH07 FA03 FA04 FB33 FB43 3F059 AA00 BA00 BB06 DA05 DC00 FC00 3F060 AA00 BA10 CA14 5D015 AA01 BB01 KK02 KK04 LL10 5D045 AB11 AB30 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) B25J 5/00 B25J 13/00 Z 13/00 G10L 3/00 571K G10L 13/00 R 15/00 551H 551F (72) Inventor Wataru Onoki 6-7-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Sony Corporation (72) Inventor Takashi Toyoda 6-35-35 Kita-Shinagawa, Shinagawa-ku, Tokyo F-Term in Sony Corporation (Ref.)

Claims

[Claims]

A voice input unit that receives input of voice data; a recognition unit that recognizes the voice data input by the voice input unit; a voice output unit that outputs voice; and the voice data by the recognition unit. And a recognition control unit for controlling the recognition unit to interrupt the recognition of the voice data when the voice output unit outputs the voice during the recognition.

2. The speech processing apparatus according to claim 1, wherein said recognition means performs recognition based on an input level of said speech data.

3. When the voice output by the voice output means is completed and the voice data is input by the voice input means, the recognition means does not recognize the voice data. The audio processing device according to claim 1, wherein:

4. A voice input step of receiving voice data, a recognition step of recognizing the voice data input by the processing of the voice input step, a voice output step of outputting voice, and a process of the recognition step During the recognition of the voice data by the voice output step, when the voice is output by the processing of the voice output step, a recognition control step of controlling to stop the recognition of the voice data by the processing of the recognition step. A voice processing method characterized by the following.

5. A voice input step of receiving voice data, a recognition step of recognizing the voice data input by the processing of the voice input step, a voice output step of outputting voice, and a process of the recognition step During the recognition of the voice data by the voice output step, when the voice is output by the processing of the voice output step, a recognition control step of controlling to stop the recognition of the voice data by the processing of the recognition step. A recording medium on which a computer-readable program is recorded.