JP2001215993A

JP2001215993A - Device and method for interactive processing and recording medium

Info

Publication number: JP2001215993A
Application number: JP2000022225A
Authority: JP
Inventors: Koji Asano; 康治浅野; Seiichi Aoyanagi; 誠一青柳; Miyuki Tanaka; 幸田中; Jun Yokono; 順横野; Toshio Oe; 敏生大江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2000-01-31
Filing date: 2000-01-31
Publication date: 2001-08-10

Abstract

PROBLEM TO BE SOLVED: To conduct interactive operations having rich variations depending on the feeling condition of a user. SOLUTION: In a voice recognition section 2, user's voice is recognized and phoneme information of the voice is extracted. In an interactive control section 3, conceptual information of the words and the phrases included in the voice recognition result obtained by the section 2 is extracted. An image inputting section 6 photographs the face of the user and outputs face image information. In a physiological information inputting section 7, physiological information such as the pulse rate of the user is detected. Then, a user feeling information updating section 8 estimates the feeling of the user based on the phoneme, the conceptual, the face image and the physiological information. In the section 3 and a sentence generating section 4, an output sentence is generated and outputted to the user based on the estimated result of the feeling.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、対話処理装置およ
び対話処理方法、並びに記録媒体に関し、特に、例え
ば、ユーザの感情を考慮した対話を行うことができるよ
うにする対話処理装置および対話処理方法、並びに記録
媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a dialogue processing apparatus, a dialogue processing method, and a recording medium, and more particularly to a dialogue processing apparatus and a dialogue processing method capable of performing a dialogue in consideration of, for example, a user's emotion. And a recording medium.

【０００２】[0002]

【従来の技術】いわゆる対話システムにおいては、ユー
ザから入力があると、その入力の意味内容に対応した応
答文が生成されて出力される。2. Description of the Related Art In a so-called interactive system, when a user inputs, a response sentence corresponding to the meaning of the input is generated and output.

【０００３】[0003]

【発明が解決しようとする課題】従って、従来の対話シ
ステムでは、ユーザの感情がどのような状態であって
も、入力の意味内容が同一であれば、同じような応答文
が出力され、その結果、同じような対話が行われること
になる。Therefore, in the conventional dialogue system, the same response sentence is output regardless of the state of the user's emotion if the meaning of the input is the same. As a result, a similar dialogue will take place.

【０００４】本発明は、このような状況に鑑みてなされ
たものであり、ユーザの感情の状態によって、バリエー
ションに富んだ対話を行うことができるようにするもの
である。[0004] The present invention has been made in view of such a situation, and enables a variety of dialogues to be performed depending on the emotional state of the user.

【０００５】[0005]

【課題を解決するための手段】本発明の対話処理装置
は、ユーザから入力された語句の概念を抽出する概念抽
出手段と、ユーザから入力された語句の概念に基づい
て、ユーザの感情を推定し、その感情を表す感情情報を
出力する感情推定手段と、感情情報に基づいて、ユーザ
に出力する出力文を生成する出力文生成手段とを備える
ことを特徴とする。According to the present invention, there is provided a dialogue processing apparatus for extracting a concept of a phrase input by a user, and estimating a user's emotion based on the concept of a phrase input by the user. The apparatus further includes an emotion estimating unit that outputs emotion information indicating the emotion, and an output sentence generating unit that generates an output sentence to be output to the user based on the emotion information.

【０００６】感情推定手段には、出力文にも基づいて、
ユーザの感情を推定させることができる。[0006] The emotion estimation means, based on the output sentence,
The emotion of the user can be estimated.

【０００７】また、感情推定手段には、ユーザを撮像し
て得られる画像にも基づいて、ユーザの感情を推定させ
ることができる。Further, the emotion estimation means can estimate the user's emotion based on an image obtained by imaging the user.

【０００８】さらに、感情推定手段には、ユーザの生理
現象にも基づいて、ユーザの感情を推定させることがで
きる。Further, the emotion estimation means can estimate the user's emotion based on the user's physiological phenomenon.

【０００９】本発明の対話処理装置には、外部から得ら
れる音響信号を処理する音響処理手段をさらに設けるこ
とができ、この場合、感情推定手段には、音響処理手段
の処理結果にも基づいて、ユーザの感情を推定させるこ
とができる。The dialogue processing apparatus of the present invention may further include sound processing means for processing a sound signal obtained from the outside. In this case, the emotion estimation means may be provided based on the processing result of the sound processing means. Thus, the user's emotion can be estimated.

【００１０】本発明の対話処理装置には、ユーザの音声
を認識する音声認識手段をさらに設けることができ、こ
の場合、概念抽出手段には、ユーザの音声の音声認識結
果に含まれる語句の概念を抽出させることができる。[0010] The dialogue processing apparatus of the present invention may further include voice recognition means for recognizing the user's voice. In this case, the concept extracting means includes a concept of a phrase included in the voice recognition result of the user's voice. Can be extracted.

【００１１】感情推定手段には、ユーザの音声の韻律情
報にも基づいて、ユーザの感情を推定させることができ
る。The emotion estimating means can estimate the user's emotion based on the prosodic information of the user's voice.

【００１２】出力文生成手段には、感情情報に基づい
て、出力文の表現を変更させることができる。[0012] The output sentence generation means can change the expression of the output sentence based on the emotion information.

【００１３】出力文生成手段には、感情情報に基づい
て、出力文の個数を変更させることができる。[0013] The output sentence generation means can change the number of output sentences based on the emotion information.

【００１４】出力文は、相づちを意味するものとするこ
とができる。[0014] The output sentence can be meant to be a combination.

【００１５】本発明の対話処理装置には、感情情報を記
憶する記憶手段をさらに設けることができ、この場合、
出力文生成手段には、記憶手段に記憶された感情情報に
基づいて、出力文を生成させることができる。[0015] The interactive processing device of the present invention may further include a storage means for storing emotion information.
The output sentence generating means can generate an output sentence based on the emotion information stored in the storage means.

【００１６】本発明の対話処理装置には、出力文を出力
する出力文出力手段をさらに設けることができる。The interactive processing device according to the present invention may further include output sentence output means for outputting an output sentence.

【００１７】出力文出力手段には、出力文を合成音で出
力させることができる。The output sentence output means can output the output sentence as a synthesized sound.

【００１８】また、出力文出力手段には、感情情報に基
づいて、合成音の韻律を制御させることができる。Also, the output sentence output means can control the prosody of the synthesized sound based on the emotion information.

【００１９】本発明の対話処理方法は、ユーザから入力
された語句の概念を抽出する概念抽出ステップと、ユー
ザから入力された語句の概念に基づいて、ユーザの感情
を推定し、その感情を表す感情情報を出力する感情推定
ステップと、感情情報に基づいて、ユーザに出力する出
力文を生成する出力文生成ステップとを備えることを特
徴とする。According to the dialogue processing method of the present invention, a concept extracting step of extracting a concept of a phrase input from a user, and estimating the user's emotion based on the concept of the phrase input by the user, and expressing the emotion An emotion estimation step for outputting emotion information and an output sentence generation step for generating an output sentence to be output to the user based on the emotion information are provided.

【００２０】本発明の記録媒体は、ユーザから入力され
た語句の概念を抽出する概念抽出ステップと、ユーザか
ら入力された語句の概念に基づいて、ユーザの感情を推
定し、その感情を表す感情情報を出力する感情推定ステ
ップと、感情情報に基づいて、ユーザに出力する出力文
を生成する出力文生成ステップとを備えるプログラムが
記録されていることを特徴とする。A recording medium according to the present invention includes a concept extracting step of extracting a concept of a phrase input by a user, and estimating a user's emotion based on the concept of a phrase input by the user, and an emotion expressing the feeling. A program is recorded which includes an emotion estimation step of outputting information, and an output sentence generation step of generating an output sentence to be output to a user based on the emotion information.

【００２１】本発明の対話処理装置および対話処理方
法、並びに記録媒体においては、ユーザから入力された
語句の概念が抽出され、その概念に基づいて、ユーザの
感情が推定される。そして、その結果得られる感情情報
に基づいて、ユーザに出力する出力文が生成される。In the dialogue processing apparatus, the dialogue processing method, and the recording medium of the present invention, the concept of a phrase input by the user is extracted, and the emotion of the user is estimated based on the concept. Then, an output sentence to be output to the user is generated based on the emotion information obtained as a result.

【００２２】[0022]

【発明の実施の形態】図１は、本発明を適用した対話シ
ステム（システムとは、複数の装置が論理的に集合した
ものをいい、各構成の装置が同一筐体中にあるか否かは
問わない）の一実施の形態の構成例を示している。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a dialogue system to which the present invention is applied (a system refers to a system in which a plurality of devices are logically assembled, and whether or not the devices of each configuration are in the same housing. The configuration example of one embodiment is shown.

【００２３】音声入力部１は、例えば、マイク（マイク
ロフォン）およびアンプ等で構成され、ユーザの音声
を、電気信号としての音声信号に変換し、必要に応じて
増幅して、その音声信号を、音声認識部２に供給する。The audio input unit 1 is composed of, for example, a microphone (microphone), an amplifier, etc., converts a user's voice into an audio signal as an electric signal, amplifies it as necessary, and converts the audio signal into an audio signal. It is supplied to the voice recognition unit 2.

【００２４】音声認識部２は、音声入力部１からの音声
信号を音響処理し、さらに、その音響処理結果に基づい
て、ユーザの音声を認識する。この音声認識結果は、対
話管理部３に供給される。また、音声認識部２は、音声
信号を音響処理することにより得られるユーザの音声の
韻律情報を、ユーザ感情情報更新部８に供給する。The voice recognition unit 2 performs audio processing on the voice signal from the voice input unit 1, and further recognizes the user's voice based on the result of the voice processing. This speech recognition result is supplied to the dialogue management unit 3. Further, the voice recognition unit 2 supplies the user emotion information updating unit 8 with the prosody information of the user's voice obtained by acoustically processing the voice signal.

【００２５】対話管理部３は、ユーザ感情情報記録部９
が保持（記憶）している、ユーザの感情を表す感情情報
を考慮して、音声認識部２からの音声認識結果に対する
応答等としての、ユーザに出力する出力文の内容を生成
し、その内容を表す内容情報を、文生成部４に供給す
る。また、対話管理部３は、音声認識部２からの音声認
識結果に含まれる語句や、自身が生成した内容情報に対
応する出力文に含まれる語句の概念を抽出し、その概念
を表す概念情報を、ユーザ感情情報更新部８に供給す
る。The dialog management unit 3 includes a user emotion information recording unit 9
Generates the contents of an output sentence to be output to the user as a response to the speech recognition result from the speech recognition unit 2 in consideration of the emotion information indicating (e.g., emotions of) the user, which is held (stored) by the user. Is supplied to the sentence generation unit 4. Further, the dialogue management unit 3 extracts a concept of a phrase included in the speech recognition result from the speech recognition unit 2 and a concept of a phrase included in an output sentence corresponding to the content information generated by itself, and conceptual information representing the concept. Is supplied to the user emotion information updating unit 8.

【００２６】文生成部４は、ユーザ感情情報記録部９が
保持している感情情報を考慮しながら、対話管理部３か
らの内容情報に対応する、例えばテキストの出力文を生
成し、さらに、その出力文に対応する合成音の音声信号
を生成して、音声出力部５に供給する。The sentence generation unit 4 generates an output sentence of, for example, a text corresponding to the content information from the dialog management unit 3 while considering the emotion information held by the user emotion information recording unit 9. A speech signal of a synthesized sound corresponding to the output sentence is generated and supplied to the speech output unit 5.

【００２７】音声出力部５は、例えば、アンプおよびス
ピーカ等で構成され、文生成部４からの音声信号を、必
要に応じて増幅し、スピーカから出力する。The audio output unit 5 is composed of, for example, an amplifier and a speaker. The audio output unit 5 amplifies the audio signal from the sentence generation unit 4 as necessary, and outputs the amplified signal from the speaker.

【００２８】画像入力部６は、例えば、レンズ、ＣＣＤ
(Charge Coupled Device)、Ａ／Ｄ変換器等で構成さ
れ、ユーザの顔等を撮像して、その結果得られる顔画像
のディジタルデータ（画像データ）である顔画像情報
を、ユーザ感情情報更新部８に供給する。The image input unit 6 includes, for example, a lens, a CCD,
(Charge Coupled Device), an A / D converter, etc., images a user's face and the like, and converts the resulting face image digital data (image data) of the face image into a user emotion information updating unit. 8

【００２９】生理情報入力部７は、例えば、脈拍計、発
汗量や熱を測定するセンサ等で構成され、ユーザの脈拍
や、発汗量、熱等の生理的な情報を感知し、その結果得
られる生理情報を、ユーザ感情情報更新部８に供給す
る。The physiological information input unit 7 comprises, for example, a pulse meter, a sensor for measuring the amount of sweat and heat, etc., and senses physiological information such as the user's pulse, amount of sweat and heat, and obtains the result. The supplied physiological information is supplied to the user emotion information updating unit 8.

【００３０】ユーザ感情情報更新部８は、音声認識部２
からのユーザの音声の韻律情報や、対話管理部３からの
音声認識結果等に含まれる語句の概念情報、画像入力部
６からの顔画像情報、生理情報入力部７からの生理情報
に基づいて、ユーザの感情の状態を推定する。さらに、
ユーザ感情情報更新部８は、その推定の結果得られる感
情情報によって、ユーザ感情情報記録部９に保持されて
いる感情情報を更新する。The user emotion information updating unit 8 includes the voice recognition unit 2
Based on the prosody information of the user's voice, the concept information of words and phrases included in the speech recognition results and the like from the dialog management unit 3, the face image information from the image input unit 6, and the physiological information from the physiological information input unit 7. Estimate the emotional state of the user. further,
The user emotion information updating unit 8 updates the emotion information stored in the user emotion information recording unit 9 with the emotion information obtained as a result of the estimation.

【００３１】ユーザ感情情報記録部９は、ユーザの感情
としての、例えば、喜びや、怒り、驚き、悲しみ等の状
態を、所定の範囲の数値で表す感情情報を保持してい
る。The user emotion information recording unit 9 holds emotion information that represents the user's emotions, such as joy, anger, surprise, sadness, and the like, in numerical values within a predetermined range.

【００３２】次に、図２のフローチャートを参照して、
図１の対話システムの基本的な処理の流れについて説明
する。Next, referring to the flowchart of FIG.
The basic processing flow of the dialog system of FIG. 1 will be described.

【００３３】ユーザにより発話が行われると、音声入力
部１は、ステップＳ１において、その発話された音声に
対して音声入力処理を施し、その結果得られる音声信号
を、音声認識部２に出力する。即ち、音声入力部１は、
ユーザの音声を、電気信号としての音声信号に変換し、
その音声信号を、必要に応じて増幅して、音声認識部２
に供給する。When the user speaks, the speech input unit 1 performs a speech input process on the spoken speech in step S 1, and outputs a speech signal obtained as a result to the speech recognition unit 2. . That is, the voice input unit 1
Convert the user's voice into a voice signal as an electrical signal,
The voice signal is amplified as needed, and the voice recognition unit 2
To supply.

【００３４】音声認識部２は、ステップＳ２において、
音声入力部２からの音声信号に基づいて、ユーザの音声
を認識し、その音声認識結果を、対話管理部３に供給す
る。さらに、音声認識部２は、音声入力部２からの音声
信号から、ユーザの音声の韻律情報を抽出し、ユーザ感
情情報更新部８に供給する。In step S2, the voice recognition unit 2
Based on the voice signal from the voice input unit 2, the voice of the user is recognized, and the voice recognition result is supplied to the dialog management unit 3. Further, the voice recognition unit 2 extracts the prosody information of the user's voice from the voice signal from the voice input unit 2 and supplies the information to the user emotion information updating unit 8.

【００３５】その後、ステップＳ３に進み、ユーザ感情
情報記録部９に保持されている感情情報を更新する準備
を行う処理が行われる。Thereafter, the process proceeds to step S3, and a process for preparing to update the emotion information stored in the user emotion information recording unit 9 is performed.

【００３６】即ち、ステップＳ３では、対話管理部３
は、音声認識部２からのユーザの音声の音声認識結果等
に基づいて、感情情報を更新するのに用いる、上述の概
念情報を得る感情情報更新用対話管理処理を行い、その
概念情報を、ユーザ感情情報更新部８に供給する。さら
に、ステップＳ３では、画像入力部６は、ユーザの顔を
撮像して、顔画像情報を得る画像入力処理を行い、その
顔画像情報を、ユーザ感情情報更新部８に供給する。ま
た、ステップＳ３では、生理情報入力部７は、ユーザの
生理情報を得る生理情報入力処理を行い、その生理情報
を、ユーザ感情情報更新部８に供給する。That is, in step S3, the dialogue management unit 3
Performs an emotion information updating dialogue management process for obtaining the above-mentioned concept information, which is used to update the emotion information, based on the speech recognition result of the user's voice from the speech recognition unit 2, and converts the concept information into The information is supplied to the user emotion information updating unit 8. Further, in step S3, the image input unit 6 performs an image input process of capturing the face of the user and obtaining face image information, and supplies the face image information to the user emotion information updating unit 8. In step S3, the physiological information input unit 7 performs a physiological information input process for obtaining physiological information of the user, and supplies the physiological information to the user emotion information updating unit 8.

【００３７】ユーザ感情情報更新部８は、ステップＳ４
において、音声認識部２からのユーザの音声の韻律情報
や、対話管理部３からの概念情報、画像入力部６からの
顔画像情報、生理情報入力部７からの生理情報に基づい
て、ユーザの感情の状態を推定する。さらに、ステップ
Ｓ４では、ユーザ感情情報更新部８は、その推定の結果
得られる感情情報によって、ユーザ感情情報記録部９に
保持されている感情情報を更新する。The user emotion information updating unit 8 determines in step S4
, Based on the prosody information of the user's voice from the voice recognition unit 2, the concept information from the dialogue management unit 3, the face image information from the image input unit 6, and the physiological information from the physiological information input unit 7, Estimate emotional state. Further, in step S4, the user emotion information updating unit 8 updates the emotion information stored in the user emotion information recording unit 9 with the emotion information obtained as a result of the estimation.

【００３８】その後、ステップＳ５において、対話管理
部３は、ユーザ感情情報記録部９が保持（記憶）してい
る、ユーザの感情を表す感情情報を考慮して、音声認識
部２からの音声認識結果に対する応答等としての、ユー
ザに出力する出力文の内容を表す内容情報を生成する文
生成用対話管理処理を行い、その内容情報を、文生成部
４に供給する。Thereafter, in step S5, the dialog management unit 3 considers the emotion information representing the emotion of the user, which is stored (stored) in the user emotion information recording unit 9, and performs the speech recognition from the speech recognition unit 2. A sentence generation dialogue management process for generating content information representing the content of an output sentence to be output to the user as a response to the result or the like is performed, and the content information is supplied to the sentence generation unit 4.

【００３９】そして、ステップＳ６において、文生成部
４は、ユーザ感情情報記録部９が保持している感情情報
を考慮しながら、対話管理部３からの内容情報に対応す
るテキストの出力文を生成し（文生成処理を行い）、さ
らに、その出力文に対応する合成音の音声信号を生成し
て、音声出力部５に供給する。In step S 6, the sentence generation unit 4 generates a text output sentence corresponding to the content information from the dialog management unit 3 while considering the emotion information held by the user emotion information recording unit 9. (Sentence generation processing is performed), and further, an audio signal of a synthesized sound corresponding to the output sentence is generated and supplied to the audio output unit 5.

【００４０】音声出力部５は、ステップＳ７において、
文生成部４からの音声信号を増幅し、スピーカから出力
する音声出力処理を行い、処理を終了する。The audio output unit 5 determines in step S7
The sound signal from the sentence generation unit 4 is amplified, sound output processing for outputting from the speaker is performed, and the processing ends.

【００４１】なお、上述の場合には、対話システムにお
いて、ユーザが何らかの発話を行ったことをトリガとし
て、合成音の出力（以下、適宜、対話システムの発話と
もいう）が行われるから、その合成音は、ユーザの発話
に対する応答となるが、対話システムにおいては、ユー
ザの発話以外をトリガとして、発話を行うようにするこ
とも可能である。In the above case, in the interactive system, the output of a synthesized sound (hereinafter, also referred to as an utterance of the interactive system as appropriate) is triggered by the fact that the user utters some utterance. The sound is a response to the utterance of the user, but in the interactive system, the utterance can be triggered by a trigger other than the utterance of the user.

【００４２】即ち、対話システムにおいては、例えば、
所定の時間ごとに発話を行うようにすることが可能であ
る。また、例えば、画像入力部６において、ユーザの顔
画像が得られたとき（単に、顔画像が得られたときの
他、所定の表情の顔画像が得られたときも含む）や、生
理情報入力部７において、所定の生理情報が得られたと
きに、発話を行うようにすることも可能である。さら
に、例えば、ユーザ感情情報記録部９に保持された感情
情報が所定の値以上または以下になったときに、発話を
行うようにすることも可能である。これらの場合は、対
話システムが、ユーザに話しかけ、その応答をユーザが
返す形で、対話が行われることになる。That is, in the interactive system, for example,
It is possible to make an utterance every predetermined time. Further, for example, when the image input unit 6 obtains a user's face image (including not only when a face image is obtained but also when a face image with a predetermined expression is obtained), or when physiological information is obtained. When predetermined physiological information is obtained in the input unit 7, it is also possible to make an utterance. Further, for example, when the emotion information stored in the user emotion information recording unit 9 becomes equal to or more than a predetermined value, it is possible to make an utterance. In these cases, the dialogue will take place in such a way that the dialogue system speaks to the user and the response is returned by the user.

【００４３】次に、図３は、図１の音声認識部２の構成
例を示している。Next, FIG. 3 shows an example of the configuration of the speech recognition section 2 in FIG.

【００４４】音声入力部１からの音声信号は、ＡＤ(Ana
log Digtal)変換部１１に供給されるようになってお
り、ＡＤ変換部１１は、その音声信号を、アナログ信号
からディジタル信号に変換し、その結果得られる音声デ
ータを、特徴抽出部１２に供給する。特徴抽出部１２
は、ＡＤ変換部１１からの音声データについて、適当な
フレームごとに音響処理を施すことで、例えば、スペク
トルや、線形予測係数、ケプストラム係数、線スペクト
ル対、ＭＦＣＣ(Mel Frequency Cepstrum Coefficient)
等の特徴パラメータを抽出し、マッチング部１３に供給
する。The audio signal from the audio input unit 1 is AD (Ana
(Log Digtal) conversion unit 11, and the AD conversion unit 11 converts the audio signal from an analog signal to a digital signal, and supplies the resulting audio data to the feature extraction unit 12. I do. Feature extraction unit 12
Performs audio processing on the audio data from the AD conversion unit 11 for each appropriate frame to obtain, for example, a spectrum, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, and a MFCC (Mel Frequency Cepstrum Coefficient).
Are extracted and supplied to the matching unit 13.

【００４５】また、特徴抽出部１２は、音声データに音
響処理を施すことにより得られる、例えば、発話速度
や、ピッチ周波数、パワー等の韻律情報を、ユーザ感情
情報更新部８に供給する。なお、発話速度としては、例
えば、１フレームあたりのモーラ数等を用いることがで
きる。The feature extracting unit 12 supplies the user emotion information updating unit 8 with prosody information, such as speech speed, pitch frequency, and power, obtained by performing audio processing on the audio data. As the utterance speed, for example, the number of mora per frame or the like can be used.

【００４６】マッチング部１３は、特徴抽出部１２から
供給される特徴パラメータに基づき、音響モデルデータ
ベース１４、辞書データベース１５、および文法データ
ベース１６を必要に応じて参照しながら、ユーザの音声
（入力音声）を認識する。The matching section 13 refers to the acoustic model database 14, the dictionary database 15, and the grammar database 16 as needed based on the feature parameters supplied from the feature extracting section 12, and, if necessary, the user's voice (input voice). Recognize.

【００４７】即ち、音響モデルデータベース１４は、音
声認識する音声の言語における個々の音素や音節などの
音響的な特徴を表す音響モデルを記憶している。ここ
で、音響モデルとしては、例えば、ＨＭＭ(Hidden Mark
ov Model)などを用いることができる。辞書データベー
ス１５は、認識対象の各単語について、その発音に関す
る情報が記述された単語辞書を記憶している。文法デー
タベース１６は、辞書データベース１５の単語辞書に登
録されている各単語が、どのように連鎖する（つなが
る）かを記述した文法規則を記憶している。ここで、文
法規則としては、例えば、文脈自由文法（ＣＦＧ）やＨ
ＰＳＧ（Head-driven Phrase Structure Grammar）（主
辞駆動句構造文法）、統計的な単語連鎖確率（Ｎ−ｇｒ
ａｍ）などに基づく規則を用いることができる。That is, the acoustic model database 14 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of the speech to be recognized. Here, as the acoustic model, for example, HMM (Hidden Mark
ov Model) can be used. The dictionary database 15 stores a word dictionary in which information about pronunciation is described for each word to be recognized. The grammar database 16 stores grammar rules that describe how words registered in the word dictionary of the dictionary database 15 are linked (connected). Here, the grammar rules include, for example, context-free grammar (CFG) and H
PSG (Head-driven Phrase Structure Grammar) (head-driven phrase structure grammar), statistical word chain probability (N-gr
am) or the like.

【００４８】マッチング部１３は、辞書データベース１
５の単語辞書を参照することにより、音響モデルデータ
ベース１４に記憶されている音響モデルを接続すること
で、単語の音響モデル（単語モデル）を構成する。さら
に、マッチング部１３は、幾つかの単語モデルを、文法
データベース１６に記憶された文法規則を参照すること
により接続し、そのようにして接続された単語モデルを
用いて、特徴パラメータに基づき、例えば、ＨＭＭ法等
によって、ユーザの音声を認識する。[0048] The matching section 13 is a dictionary database 1
By connecting the acoustic models stored in the acoustic model database 14 by referring to the word dictionary of No. 5, an acoustic model (word model) of the word is formed. Further, the matching unit 13 connects several word models by referring to the grammar rules stored in the grammar database 16, and uses the word models connected in this manner, based on the feature parameters, for example, , HMM method, etc., to recognize the user's voice.

【００４９】そして、マッチング部１３による音声認識
結果としての音韻情報は、例えば、テキストやワードグ
ラフ等で、対話管理部３に出力される。The phoneme information as a result of the speech recognition by the matching unit 13 is output to the dialog management unit 3 in the form of, for example, a text or a word graph.

【００５０】次に、図４は、図１の対話管理部３の構成
例を示している。Next, FIG. 4 shows an example of the configuration of the dialog management section 3 of FIG.

【００５１】音声認識部２が出力するユーザの音声認識
結果は、言語処理部２１に供給されるようになってお
り、言語処理部２１は、シソーラスデータベース２３
や、言語処理用データベース２４、履歴データベース２
５を必要に応じて参照しながら、音声認識結果を処理
し、その音声認識結果が表す意味や概念を、対話処理部
２２に供給する。The speech recognition result of the user output from the speech recognition unit 2 is supplied to a language processing unit 21.
, Language processing database 24, history database 2
The speech recognition result is processed while referring to 5 as necessary, and the meaning and concept represented by the speech recognition result are supplied to the dialog processing unit 22.

【００５２】即ち、シソーラスデータベース２３には、
単語を、その概念によって階層構造に分類したシソーラ
スが記憶されており、言語処理部２１は、このシソーラ
スを参照することにより、音声認識結果に含まれる単語
の概念を認識する。That is, the thesaurus database 23 contains
A thesaurus in which words are classified into a hierarchical structure based on the concept is stored, and the language processing unit 21 recognizes the concept of the word included in the speech recognition result by referring to the thesaurus.

【００５３】ここで、シソーラスとしては、例えば、国
立国語研究所によって発表されている分類語彙表等を用
いることができる。Here, as the thesaurus, for example, a classification vocabulary table published by the National Institute for Japanese Language and the like can be used.

【００５４】言語処理用データベース２４には、各単語
の表記や必要な品詞情報などが記述された単語辞書と、
その単語辞書に記述された各単語の情報に基づいて、単
語連鎖に関する制約等が記述された構文／意味規則が記
憶されており、言語処理部２１は、その単語辞書や構文
／意味規則に基づいて、そこに入力される音声認識結果
の形態素解析を行う。さらに、言語処理部２１は、その
形態素解析結果に基づいて、音声認識結果の構文解析、
さらには、その意味内容の理解を行う。そして、言語処
理部２１は、以上のようにして得られる音声認識結果を
構成する各単語の概念や、音声認識結果の意味内容の理
解の結果（以下、適宜、まとめて言語処理結果という）
を、対話処理部２２に出力する。The database 24 for language processing includes a word dictionary in which the notation of each word and necessary part-of-speech information are described.
Based on the information of each word described in the word dictionary, a syntax / semantic rule in which restrictions on word chains and the like are described is stored. The language processing unit 21 uses the word dictionary and the syntax / semantic rule based on the word dictionary and the syntax / semantic rule. Then, a morphological analysis of the speech recognition result input thereto is performed. Further, the language processing unit 21 analyzes the syntax of the speech recognition result based on the morphological analysis result,
In addition, understand the meaning. Then, the language processing unit 21 interprets the concept of each word constituting the speech recognition result obtained as described above and the result of understanding the meaning and content of the speech recognition result (hereinafter, collectively referred to as the language processing result as appropriate).
Is output to the interaction processing unit 22.

【００５５】ここで、言語処理部２１では、例えば、正
規文法や、文脈自由文法、ＨＰＳＧ、統計的な単語連鎖
確率を用いて、構文解析や意味内容の理解を行うことが
できる。Here, the language processing section 21 can perform syntactic analysis and understanding of semantic contents using, for example, regular grammar, context-free grammar, HPSG, and statistical word chain probability.

【００５６】また、言語処理部２１は、必要に応じて、
履歴データベース２５も参照しながら処理を行う。即
ち、履歴データベース２５には、ユーザが発話した音声
の音声認識結果と、その発話に対して、対話システムが
出力した応答との組や、あるいは、対話システムの出力
と、その出力に対して、ユーザが発話した音声の音声認
識結果との組等の形で、ユーザと対話システムとの間の
対話の履歴（対話履歴）が記憶されるようになってお
り、言語処理部２１は、対話履歴を参照することで、音
声認識結果における主語等の省略や、照応表現等の解析
を行い、これにより、例えば、ユーザの音声認識結果に
含まれる代名詞が、具体的に何を意味しているのか等を
認識するようになっている。Further, the language processing unit 21 may, if necessary,
The processing is performed with reference to the history database 25. That is, in the history database 25, a set of a speech recognition result of a speech uttered by the user and a response output by the dialogue system with respect to the utterance, or an output of the dialogue system and an output thereof, The history of the conversation between the user and the conversation system (conversation history) is stored in the form of a combination of the speech uttered by the user with the speech recognition result, and the language processing unit 21 stores the conversation history. , Omission of the subject, etc. in the speech recognition result, analysis of the anaphoric expression, etc., and thereby, for example, what the pronoun included in the user's speech recognition result specifically means And so on.

【００５７】なお、シソーラスデータベース２３および
言語処理用データベース２４に記憶されている情報は、
基本的には更新されないから、いわば静的な情報という
ことができる。これに対して、履歴データベース２５に
記憶されている対話履歴は、ユーザにより発話が行わ
れ、あるいは、対話システムが、ユーザに対して何らか
の出力を行うと、後述する対話処理部２２によって更新
されていくので、いわば動的な情報ということができ
る。The information stored in the thesaurus database 23 and the language processing database 24 is as follows.
Basically, it is not updated, so it can be called static information. On the other hand, the dialogue history stored in the history database 25 is updated by the dialogue processing unit 22 described later when the user utters or the dialogue system performs some output to the user. As it goes, it can be called dynamic information.

【００５８】上述したように、言語処理部２１は、シソ
ーラスデータベース２３を参照することで、音声認識結
果を構成する各単語（語彙）の概念を抽出するが、その
概念が、感情を表すものであるとき、その感情を表す概
念を、概念情報として、ユーザ感情情報更新部８に供給
する。即ち、言語処理部２１は、シソーラス上におい
て、例えば、「喜び」や、「怒り」、「驚き」、「悲し
み」、「苦しさ」、「恥ずかしさ」、「楽しさ」等の、
感情を表す概念に属する単語が、音声認識結果に含まれ
るとき、その概念を表す概念情報を、ユーザ感情情報更
新部８に供給する。As described above, the language processing unit 21 extracts the concept of each word (vocabulary) constituting the speech recognition result by referring to the thesaurus database 23, and the concept represents emotion. At some point, a concept representing the emotion is supplied to the user emotion information updating unit 8 as concept information. That is, on the thesaurus, the language processing unit 21 outputs, for example, "joy,""anger,""surprise,""sadness,""suffering,""embarrassment,""fun," and the like.
When a word belonging to a concept representing an emotion is included in the speech recognition result, concept information representing the concept is supplied to the user emotion information updating unit 8.

【００５９】なお、言語処理部２１は、音声認識結果に
含まれる単語の概念情報の他、対話履歴として記憶され
ている対話システムの出力に含まれる単語の概念情報
も、必要に応じて抽出し、ユーザ感情情報更新部８に供
給するようになっている。The language processing unit 21 extracts not only the concept information of the words included in the speech recognition result but also the concept information of the words included in the output of the dialog system stored as the dialog history as necessary. , And to the user emotion information updating unit 8.

【００６０】即ち、ユーザ感情情報更新部８は、上述の
ように、ユーザの感情の状態を推定するが、その推定に
あたっては、音声認識結果に含まれる単語の概念情報は
勿論であるが、対話システムの出力に含まれる単語の概
念情報も役に立つ場合がある。具体的には、例えば、対
話システムにおいて、ユーザを愚弄するような発話を行
った場合には、ユーザが怒ることが予想される。このた
め、言語処理部２１は、対話履歴として記憶されている
対話システムの出力に含まれる単語の概念情報も、シソ
ーラスを参照することで抽出し、音声認識結果に含まれ
る単語の概念情報とともに、ユーザ感情情報更新部８に
供給するようになっている。That is, the user emotion information updating section 8 estimates the emotion state of the user as described above. In the estimation, not only the concept information of the words included in the speech recognition result but also the dialogue information is used. Word conceptual information in the output of the system may also be useful. Specifically, for example, in a dialogue system, when an utterance that fools the user is performed, the user is expected to be angry. For this reason, the language processing unit 21 extracts the concept information of the words included in the output of the dialog system stored as the dialog history by referring to the thesaurus, and extracts the concept information of the words included in the speech recognition result, The information is supplied to the user emotion information updating unit 8.

【００６１】対話処理部２２は、言語処理部２１からの
言語処理結果、およびユーザ感情情報記録部９に保持さ
れている、ユーザの感情の状態を表す感情情報に基づ
き、履歴データベース２５やシナリオデータベース２６
を参照しながら、ユーザの音声認識結果に対する応答等
としての、ユーザに出力する出力文の内容を生成し、そ
の内容を表す内容情報を、文生成部４に供給する。The dialogue processing unit 22 is based on the language processing result from the language processing unit 21 and the emotion information indicating the emotional state of the user, which is stored in the user emotion information recording unit 9, based on the history database 25 and the scenario database. 26
, The contents of an output sentence to be output to the user, such as a response to the user's voice recognition result, and the like, and the content information representing the contents is supplied to the sentence generation unit 4.

【００６２】即ち、シナリオデータベース２６は、例え
ば、ユーザとの対話パターンの規則としてのシナリオ
を、タスク（話題）ごとに記憶しており、対話処理部２
２は、基本的には、シナリオデータベース２６に記憶さ
れているシナリオの中から、言語処理部２１からの言語
処理結果に基づいて、ユーザとの対話に用いるものを決
定し、そのシナリオにしたがって、内容情報を生成す
る。That is, the scenario database 26 stores, for example, a scenario as a rule of an interaction pattern with the user for each task (topic).
2 basically determines, from among the scenarios stored in the scenario database 26, those to be used for dialogue with the user based on the language processing result from the language processing unit 21, and according to the scenario, Generate content information.

【００６３】具体的には、例えば、ビデオ予約等の目的
志向型のタスクについては、例えば、次のようなシナリ
オが記憶されている。More specifically, for the purpose-oriented task such as video reservation, for example, the following scenario is stored.

【００６４】 (action(Question(date,start_time,end_time,channel))) (date ???) #日付 (start_time ???)#開始時刻 (end_time ???) #終了時刻 (channel ???) #チャンネル・・・（１）(Action (Question (date, start_time, end_time, channel))) (date ???) #date (start_time ???) # start time (end_time ???) #end time (channel ???) #Channel ・・・ (1)

【００６５】ここで、（１）のシナリオによれば、言語
処理部２１による言語処理結果が、録画の要求を表すも
のである場合には、対話処理部２２において、録画を行
う日付、録画を開始する時刻、録画を終了する時刻、録
画を行うチャンネルを、そのような順番で質問する旨の
内容情報が生成される。Here, according to the scenario (1), if the result of the language processing by the language processing unit 21 indicates a request for recording, the interactive processing unit 22 sets the date of recording and the recording date. Content information is generated to ask the start time, the end time of the recording, and the channel to be recorded in such an order.

【００６６】また、例えば、無目的型の対話（いわゆる
雑談）を行うためのシナリオとしては、次のようなもの
が記憶されている。For example, the following is stored as a scenario for performing a purposeless conversation (so-called chat).

【００６７】 If X exist then speak (Y) # X:キーワード, Y:応答文 (お金何が欲しいの) # (X Y) (食べたいお腹がすいているの) ・・・（２）If X exist then speak (Y) # X: keyword, Y: response sentence (money what do you want) # (X Y) (you want to eat hungry) ... (2)

【００６８】ここで、（２）のシナリオによれば、言語
処理部２１による言語処理結果に、「お金」というキー
ワードが含まれていれば、対話処理部２２において、
「何が欲しいの」という、質問を行う旨の内容情報が生
成される。また、言語処理部２１による言語処理結果
に、「食べたい」というキーワードが含まれていれば、
対話処理部２２において、「お腹がすいているの」とい
う、質問を行う旨の内容情報が生成される。Here, according to the scenario (2), if the keyword “money” is included in the language processing result by the language processing unit 21,
Content information for asking a question, “What do you want?” Is generated. If the keyword “I want to eat” is included in the language processing result by the language processing unit 21,
In the interactive processing unit 22, content information indicating that a question is to be made, such as "I am hungry," is generated.

【００６９】また、対話処理部２２は、例えば、言語処
理部２１からの言語処理結果だけでなく、ユーザ感情情
報記録部９に保持されている感情情報にも基づいて、使
用するシナリオを決定する。即ち、例えば、言語処理部
２１からの言語処理結果が、ユーザが挨拶をしたことを
表している場合において、感情情報が、「楽しさ」や
「うれしさ」が通常レベルであることを表しているとき
には、あるいは、「怒り」や「いらつき」が大であるこ
とを表しているときには、対話処理部２２は、ユーザに
「こんにちは」と、単に挨拶を返すシナリオの使用を決
定する。また、例えば、例えば、言語処理部２１からの
言語処理結果が、ユーザが挨拶をしたことを表している
場合において、感情情報が、「楽しさ」や「うれしさ」
が大であることを表しているときには、対話処理部２２
は、ユーザに「何か良いことがあったのですか？」と問
い合わせるシナリオの使用を決定する。The dialogue processing unit 22 determines a scenario to be used, for example, based on not only the language processing result from the language processing unit 21 but also the emotion information stored in the user emotion information recording unit 9. . That is, for example, when the language processing result from the language processing unit 21 indicates that the user has greeted, the emotion information indicates that “fun” or “joy” is at the normal level. when there is or when it indicates that "anger" and "irritability" of the character is large, dialogue processing unit 22, a "Hello" to the user, simply determine the use of scenarios that return greeting. Further, for example, when the language processing result from the language processing unit 21 indicates that the user has made a greeting, the emotion information may be “fun” or “joy”.
Is large, the dialogue processing unit 22
Decides to use a scenario that asks the user "What's good?"

【００７０】なお、シナリオデータベース２６には、シ
ナリオの他、ユーザと対話を行うにあたっての一般的な
知識も記憶されている。即ち、シナリオデータベース２
６には、例えば、言語処理部２１による言語処理結果
が、ユーザが挨拶をしたことを表している場合には、そ
の挨拶に対する挨拶を行うことを指示する情報が、一般
的な知識として記憶されている。また、シナリオデータ
ベース２６には、例えば、雑談時に使用する話題（トピ
ックス）なども、一般的な知識として記憶されている。The scenario database 26 also stores general knowledge for interacting with the user in addition to the scenario. That is, scenario database 2
For example, in the case where the result of the language processing by the language processing unit 21 indicates that the user has made a greeting, information indicating that a greeting is given to the greeting is stored as general knowledge. ing. Further, in the scenario database 26, for example, topics (topics) used at the time of chatting are also stored as general knowledge.

【００７１】さらに、対話処理部２２は、言語処理部２
１からの言語処理結果や、自身が生成した内容情報、さ
らには、その内容情報を生成するのに用いたシナリオに
関する情報等を、対話履歴として、履歴データベース２
５に記憶させる。Further, the dialogue processing unit 22 includes the language processing unit 2
The language processing result from the language database 1 and the content information generated by itself, as well as information on the scenario used to generate the content information, etc.
5 is stored.

【００７２】また、対話処理部２２は、必要に応じて、
対話履歴を参照し、これにより、例えば、音声認識結果
や、その意味の理解に誤りがあったことが、後から判明
した場合等に対処するようにもなっている。Further, the dialogue processing unit 22 may, if necessary,
By referring to the conversation history, for example, it is possible to deal with a case where it is later found that there is an error in the speech recognition result or in understanding the meaning thereof, for example.

【００７３】次に、図５は、図１の文生成部４の構成例
を示している。Next, FIG. 5 shows a configuration example of the sentence generation unit 4 of FIG.

【００７４】テキスト文生成部３１には、対話管理部３
から内容情報が供給されるようになっており、テキスト
文生成部３１は、必要に応じて、辞書データベース３４
および生成文法データベース３５を参照しながら、内容
情報に対応する（即した）、テキストの出力文を生成す
る。The text sentence generation unit 31 includes the dialogue management unit 3
The text sentence generating unit 31 supplies the content information from the dictionary database 34 as necessary.
While referring to the generated grammar database 35, a text output sentence corresponding to (conforming to) the content information is generated.

【００７５】即ち、辞書データベース３４には、各単語
の品詞情報や、読み、アクセント等の情報が記述された
単語辞書が記憶されており、生成用文法データベース３
５には、出力文の例のテンプレート、さらには、出力文
を生成するのに必要な単語の活用規則や、語順の制約情
報等の生成用文法規則が記憶されている。そして、テキ
スト文生成部３１は、内容情報に対応するテンプレート
を選択し、さらに、必要な単語を単語辞書から選択す
る。さらに、テキスト文生成部３１は、生成用文法規則
を参照して、語尾等を適切に変えながら、単語をテンプ
レートにあてはめることで、内容情報に対応する出力文
を生成する。That is, the dictionary database 34 stores a word dictionary in which part-of-speech information of each word, and information such as readings and accents are described.
Reference numeral 5 stores a template of an example of an output sentence, as well as a rule for utilizing words necessary for generating the output sentence and a grammatical rule for generation such as constraint information on word order. Then, the text sentence generation unit 31 selects a template corresponding to the content information, and further selects a necessary word from the word dictionary. Further, the text sentence generation unit 31 refers to the grammar rules for generation and applies words to the template while appropriately changing the ending or the like, thereby generating an output sentence corresponding to the content information.

【００７６】また、テキスト文生成部３１には、ユーザ
感情情報記録部９に保持されている感情情報も供給され
るようになっており、テキスト文生成部３１は、その感
情情報に基づいて、出力文の表現を変更する。即ち、生
成用文法データベース３５には、同一内容のテンプレー
トで、表現の異なるものが記憶されており、テキスト文
生成部３１は、そのような同一内容のテンプレートか
ら、所定の表現のものを、感情情報に基づいて選択す
る。また、テキスト文生成部３１は、テンプレートにあ
てはめる単語についても、所定の表現のものを、感情情
報に基づいて選択する。さらに、テキスト文生成部３１
は、語尾等の変更も、感情情報に基づいて行う。The text sentence generation unit 31 is also supplied with the emotion information held in the user emotion information recording unit 9. Change the expression of the output statement. That is, the generation grammar database 35 stores templates having the same content but different expressions, and the text sentence generation unit 31 converts the template having the same content from the template having the same content into an emotional expression. Make an informed choice. In addition, the text sentence generation unit 31 also selects a word to be applied to the template with a predetermined expression based on the emotion information. Further, the text sentence generation unit 31
Also changes the ending and the like based on the emotion information.

【００７７】これにより、例えば、感情情報が、「怒
り」や「悲しみ」のレベルが大であることを表している
ときには、テキスト生成部３１において、比較的丁寧な
表現の出力文が生成される。また、例えば、感情情報
が、「楽しさ」や「喜び」のレベルが大であることを表
しているときには、テキスト生成部３１において、いわ
ゆるラフな表現の出力文が生成される。Thus, for example, when the emotion information indicates that the level of “anger” or “sadness” is high, the text generator 31 generates an output sentence of a relatively polite expression. . Further, for example, when the emotion information indicates that the level of “fun” or “joy” is high, the text generating unit 31 generates a so-called rough output sentence.

【００７８】なお、出力文の生成の方法としては、テン
プレートを用いる方法の他、例えば、格構造に基づく方
法等を採用することも可能である。As a method of generating an output sentence, besides the method using a template, for example, a method based on a case structure can be adopted.

【００７９】テキスト文生成部３１は、出力文を生成す
ると、その形態素解析や構文解析等を行い、後段の規則
合成部３２で行われる規則音声合成に必要な情報を抽出
する。ここで、規則音声合成に必要な情報としては、例
えば、ポーズの位置や、アクセントおよびイントネーシ
ョンを制御するための情報その他の韻律情報や、各単語
の発音等の音韻情報などがある。When the output sentence is generated, the text sentence generation unit 31 performs morphological analysis and syntax analysis on the output sentence, and extracts information necessary for the rule speech synthesis performed by the rule synthesis unit 32 at the subsequent stage. Here, the information necessary for the rule speech synthesis includes, for example, information for controlling the position of a pause, accent and intonation, other prosody information, and phoneme information such as pronunciation of each word.

【００８０】テキスト文生成部３１で得られた情報は、
規則合成部３２に供給され、規則合成部３２では、音素
片データベース３６を用いて、テキスト文生成部３１に
おいて生成された出力文に対応する合成音の音声データ
（ディジタルデータ）が生成される。The information obtained by the text sentence generator 31 is
The text data is supplied to the rule synthesizing unit 32, and the rule synthesizing unit 32 generates speech data (digital data) of synthesized speech corresponding to the output sentence generated by the text sentence generating unit 31 using the phoneme segment database 36.

【００８１】即ち、音素片データベース３６には、例え
ば、ＣＶ(Consonant, Vowel)や、ＶＣＶ、ＣＶＣ等の形
で音素片データが記憶されており、規則合成部３２は、
テキスト文生成部３１からの情報に基づいて、必要な音
素片データを接続し、さらに、ポーズ、アクセント、イ
ントネーション等を適切に付加することで、テキスト文
生成部３１で生成された出力文に対応する合成音の音声
データを生成する。That is, the speech segment database 36 stores speech segment data in the form of, for example, CV (Consonant, Vowel), VCV, CVC, and the like.
Based on the information from the text sentence generation unit 31, the necessary sentence segment data is connected, and further, pauses, accents, intonations, etc. are appropriately added to correspond to the output sentence generated by the text sentence generation unit 31. The voice data of the synthesized sound to be generated is generated.

【００８２】また、規則合成部３２には、ユーザ感情情
報記録部９に保持されている感情情報が供給されるよう
になっており、規則合成部３２は、感情情報に基づい
て、接続された音素片データに付加するポーズや、アク
セント、イントネーション、さらには、発話速度、ピッ
チ周波数等の韻律情報を制御する。即ち、これにより、
規則合成部３２では、例えば、感情情報が、ユーザが興
奮していることを表しているときには、ゆっくりとし
た、落ち着いた調子の合成音の音声データが生成され
る。また、例えば、感情情報が、ユーザが楽しそうであ
ることを表しているときには、規則合成部３２では、や
はり、楽しそうな調子の合成音の音声データが生成され
る。The rule synthesizing unit 32 is supplied with the emotion information stored in the user emotion information recording unit 9, and the rule synthesizing unit 32 is connected based on the emotion information. It controls pause, accent, intonation, and prosody information such as utterance speed and pitch frequency to be added to the phoneme segment data. That is,
In the rule synthesizing unit 32, for example, when the emotion information indicates that the user is excited, voice data of a synthesized sound with a slow and calm tone is generated. Further, for example, when the emotion information indicates that the user is likely to be happy, the rule synthesizing unit 32 also generates speech data of a synthesized sound having a fun tone.

【００８３】なお、感情と音声との関係については、例
えば、前川、「音声によるパラ言語情報の伝達；言語学
の立場から」、日本音響学会平成９年度秋季研究発表会
講演論文集１−３−１０、pp.381-384、平成９年９月等
に、その詳細が記載されている。The relationship between emotions and voices is described in, for example, Maekawa, "Transmission of Paralinguistic Information by Voice; From the Perspective of Linguistics", Proceedings of the Acoustical Society of Japan 1997 Fall Meeting 1-3. -10, pp. 381-384, September 1997, etc., the details are described.

【００８４】規則合成部３２で得られた合成音の音声デ
ータは、ＤＡ(Digital Analog)変換部３３に供給され、
そこで、アナログ信号としての音声信号に変換される。
この音声信号は、音声出力部５に供給され、これによ
り、テキスト文生成部３１で生成された出力文に対応す
る合成音が出力される。The sound data of the synthesized sound obtained by the rule synthesizing unit 32 is supplied to a DA (Digital Analog) converting unit 33,
Then, it is converted into an audio signal as an analog signal.
This audio signal is supplied to the audio output unit 5, whereby a synthesized sound corresponding to the output sentence generated by the text sentence generation unit 31 is output.

【００８５】次に、図６は、図１のユーザ感情情報更新
部８の構成例を示している。Next, FIG. 6 shows an example of the configuration of the user emotion information updating unit 8 of FIG.

【００８６】音声認識部２が出力する韻律情報は韻律情
報処理部４１に、対話管理部３が出力する概念情報は概
念情報処理部４２に、画像入力部６が出力する顔画像情
報は画像情報処理部４３に、生理情報入力部７が出力す
る生理情報は生理情報処理部４４に、それぞれ供給され
るようになっている。The prosody information output by the voice recognition unit 2 is output to the prosody information processing unit 41, the concept information output by the dialogue management unit 3 is output to the concept information processing unit 42, and the face image information output by the image input unit 6 is output by the image information The physiological information output from the physiological information input unit 7 to the processing unit 43 is supplied to the physiological information processing unit 44.

【００８７】韻律情報処理部４１は、そこに供給される
韻律情報を処理することにより、ユーザの感情を推定
し、その推定結果としての感情情報を、更新処理部４５
に出力する。The prosody information processing section 41 estimates the user's emotion by processing the prosody information supplied thereto, and outputs the emotion information as the estimation result to the update processing section 45.
Output to

【００８８】なお、ユーザの音声の韻律情報から、その
ユーザの感情を推定する方法としては、例えば、特開平
１０−５５１９４号公報に記載されているもの等を用い
ることが可能である。As a method for estimating the emotion of the user from the prosodic information of the voice of the user, for example, a method described in Japanese Patent Application Laid-Open No. H10-55194 can be used.

【００８９】概念情報処理部４２は、そこに供給される
概念情報を処理することにより、ユーザの感情を推定
し、その推定結果としての感情情報を、更新処理部４５
に出力する。即ち、概念情報処理部４２は、概念情報に
基づき、「喜び」や「怒り」等といった各感情を表す概
念に属する単語が、ユーザと対話システムとの対話にお
いて出現した出現頻度をカウントする。そして、概念情
報処理部４２は、その出現頻度に基づいて、ユーザの感
情を推定し、その推定結果としての感情情報を、更新処
理部４５に出力する。The concept information processing section 42 estimates the user's emotion by processing the concept information supplied thereto, and outputs the emotion information as the estimation result to the update processing section 45.
Output to That is, the concept information processing unit 42 counts, based on the concept information, the frequency of occurrence of words belonging to concepts representing respective emotions such as “joy” and “anger” in a dialog between the user and the dialog system. Then, concept information processing section 42 estimates the emotion of the user based on the appearance frequency, and outputs emotion information as the estimation result to update processing section 45.

【００９０】画像情報処理部４３は、そこに供給される
顔画像情報を処理することにより、ユーザの感情を推定
し、その推定結果としての感情情報を、更新処理部４５
に出力する。The image information processing section 43 estimates the emotion of the user by processing the face image information supplied thereto, and sends the emotion information as the estimation result to the update processing section 45.
Output to

【００９１】即ち、図７は、図６の画像情報処理部４３
の構成例を示している。FIG. 7 shows the image information processing section 43 of FIG.
Is shown.

【００９２】顔画像情報は、特徴抽出部５１に供給さ
れ、特徴抽出部５１は、その顔画像情報の特徴量を抽出
する。即ち、特徴抽出部５１は、例えば、顔画像情報を
ウェーブレット(Wavelet)変換し、空間周波数成分を表
す係数をコンポーネントとする特徴ベクトルを得て、ベ
クトル量子化部５２に供給する。[0092] The face image information is supplied to the feature extraction unit 51, and the feature extraction unit 51 extracts the feature amount of the face image information. That is, the feature extraction unit 51 performs, for example, a wavelet (Wavelet) transform on the face image information, obtains a feature vector having a component representing a spatial frequency component as a component, and supplies this to the vector quantization unit 52.

【００９３】ベクトル量子化部５２は、コードブックデ
ータベース５４に記憶されたコードブックにしたがっ
て、特徴抽出部５１からの特徴ベクトルをベクトル量子
化し、これにより、１次元のシンボル（列）を得る。The vector quantization unit 52 performs vector quantization on the feature vector from the feature extraction unit 51 according to the codebook stored in the codebook database 54, thereby obtaining one-dimensional symbols (columns).

【００９４】即ち、コードブックデータベース５４に
は、喜んでいる状態や、怒っている状態、驚いている状
態、悲しんでいる状態等の、各感情の状態における顔の
画像を用いて学習を行うことにより得られたコードブッ
クが記憶されている。なお、ここでは、量子化精度を高
めるために、例えば、喜び用コードブックや怒り用コー
ドブックのように、各感情ごとのコードブックが作成さ
れて記憶されている。That is, learning is performed in the code book database 54 by using face images in each emotional state, such as a happy state, an angry state, a surprised state, or a sad state. Is stored. Here, in order to increase the quantization accuracy, for example, a codebook for each emotion, such as a joy codebook or an anger codebook, is created and stored.

【００９５】そして、ベクトル量子化部５２は、コード
ブックデータベース５４に記憶された各感情ごとのコー
ドブックにしたがって、特徴抽出部５１からの特徴ベク
トルをベクトル量子化し、シンボル（コードブックのコ
ードベクトルに割り当てられたコード）を得て、マッチ
ング部５３に出力する。従って、マッチング部５３に
は、各感情ごとのベクトル量子化結果としてのシンボル
が供給される。The vector quantization unit 52 vector-quantizes the feature vector from the feature extraction unit 51 according to the codebook for each emotion stored in the codebook database 54, and converts the symbol (the codebook of the codebook into a codebook). The assigned code is obtained and output to the matching unit 53. Therefore, the matching section 53 is supplied with a symbol as a vector quantization result for each emotion.

【００９６】マッチング部５３は、ベクトル量子化部５
２からのシンボルを用い、ＨＭＭデータベース５５を参
照して、顔画像情報が、例えば、喜んでいる状態、怒っ
ている状態、驚いている状態、悲しんでいる状態等のう
ちのいずれの感情の状態における顔のものであるかのマ
ッチングを行う。The matching unit 53 includes a vector quantization unit 5
Using the symbol from No. 2 and referring to the HMM database 55, the face image information is, for example, any of emotional states such as a happy state, an angry state, a surprised state, a sad state, and the like. Is performed to determine whether or not the face is a face.

【００９７】即ち、ＨＭＭデータベース５５には、喜ん
でいる状態や、怒っている状態、驚いている状態、悲し
んでいる状態等の、各感情の状態における顔の画像を用
いて学習を行うことにより得られた、各感情における顔
についてのモデル（ＨＭＭ）が記憶されている。That is, learning is performed in the HMM database 55 by using a face image in each emotional state such as a happy state, an angry state, a surprised state, or a sad state. The obtained model (HMM) for the face in each emotion is stored.

【００９８】そして、マッチング部５３は、ベクトル量
子化部５２から得られるシンボル系列が観測される確率
が最も高いモデルを、ビタビ法により求める。さらに、
マッチング部５３は、そのモデルに対応する感情を、ユ
ーザの感情として推定し、その推定結果としての感情情
報を、更新処理部４５に出力する。Then, matching section 53 obtains a model having the highest probability of observing the symbol sequence obtained from vector quantization section 52 by the Viterbi method. further,
The matching unit 53 estimates the emotion corresponding to the model as the user's emotion, and outputs the emotion information as the estimation result to the update processing unit 45.

【００９９】ここで、マッチング部５３において、ベク
トル量子化部５２から得られるシンボル系列が観測され
る確率の計算は、各感情ごとに行われる。即ち、例え
ば、喜び用コードブックを用いてベクトル量子化を行う
ことにより得られたシンボル系列が観測される確率の計
算は、喜んでいる状態の顔の画像を用いて学習が行われ
たＨＭＭ（喜び用ＨＭＭ）を用いて行われる。また、例
えば、怒り用コードブックを用いてベクトル量子化を行
うことにより得られたシンボル系列が観測される確率の
計算は、怒っている状態の顔の画像を用いて学習が行わ
れたＨＭＭ（怒り用ＨＭＭ）を用いて行われる。Here, the matching section 53 calculates the probability of observing the symbol sequence obtained from the vector quantization section 52 for each emotion. That is, for example, the calculation of the probability of observing the symbol sequence obtained by performing vector quantization using the joy codebook is performed by using an HMM (learned using an image of a happy face). HMM for pleasure. Further, for example, the calculation of the probability of observing a symbol sequence obtained by performing vector quantization using an anger codebook is performed by using an HMM (learned using an angry face image). This is performed using an anger HMM).

【０１００】なお、上述のようにして、顔画像情報か
ら、感情を推定する方法については、例えば、坂口、大
谷、岸野、「隠れマルコフモデルによる顔動画像からの
表情認識」、テレビジョン学会誌、VOL.49, no.8, pp.1
060-1067, 1995年8月等に、その詳細が記載されてい
る。As described above, methods for estimating emotions from face image information are described in, for example, Sakaguchi, Otani, Kishino, “Expression Recognition from Facial Moving Image Using Hidden Markov Model”, Journal of the Institute of Television Engineers of Japan. , VOL.49, no.8, pp.1
060-1067, August 1995, etc., the details are described.

【０１０１】また、顔画像情報から、感情を推定する方
法としては、その他、例えば、坂口、森島、「空間周波
数情報に基づく基本表情の実時間認識」、第２回知能情
報メディアシンポジウム論文集，pp.75-82，１９９６年
１２月等に記載されている方法を採用することも可能で
ある。Other methods for estimating emotions from face image information include, for example, Sakaguchi and Morishima, “Real-time recognition of basic facial expressions based on spatial frequency information”, Proceedings of the 2nd Intelligent Information Media Symposium, pp. 75-82, December 1996, etc. can also be used.

【０１０２】図６に戻り、生理情報処理部４４は、そこ
に供給される生理情報を処理することにより、ユーザの
感情を推定し、その推定結果としての感情情報を、更新
処理部４５に出力する。ここで、生理情報から、ユーザ
の感情を推定する方法としては、例えば、各感情と、脈
拍数や発汗量等の生理情報との相関を表す関数を、あら
かじめ統計的に求めておき、その関数を用いて行う方法
等がある。Returning to FIG. 6, the physiological information processing section 44 estimates the emotion of the user by processing the physiological information supplied thereto, and outputs the emotion information as the estimation result to the update processing section 45. I do. Here, as a method of estimating the emotion of the user from the physiological information, for example, a function representing a correlation between each emotion and physiological information such as a pulse rate and a sweating amount is statistically obtained in advance, and the function is calculated. And the like.

【０１０３】更新処理部４５は、韻律情報処理部４１、
概念情報処理部４２、画像情報処理部４３、および生理
情報処理部４４からの感情情報を総合的に用いて、ユー
ザ感情情報記録部９に保持されている感情情報を更新す
る最終的な更新値を求め、その更新値によって、ユーザ
感情情報記録部９の感情情報を更新する。即ち、更新処
理部４５は、例えば、韻律情報処理部４１、概念情報処
理部４２、画像情報処理部４３、生理情報処理部４４そ
れぞれからの、各感情に対応する感情情報を重み付け加
算して正規化することで、各感情に対応する最終的な感
情情報を算出する。そして、更新処理部４５は、この最
終的な感情情報によって、ユーザ感情情報記録部９の感
情情報を更新する。The update processing unit 45 includes a prosody information processing unit 41,
A final update value for updating the emotion information stored in the user emotion information recording unit 9 by comprehensively using the emotion information from the concept information processing unit 42, the image information processing unit 43, and the physiological information processing unit 44. And updates the emotion information in the user emotion information recording unit 9 with the updated value. That is, the update processing unit 45 weights and adds emotion information corresponding to each emotion from each of the prosody information processing unit 41, the concept information processing unit 42, the image information processing unit 43, and the physiological information processing unit 44, and performs normalization. Then, final emotion information corresponding to each emotion is calculated. Then, the update processing unit 45 updates the emotion information of the user emotion information recording unit 9 with the final emotion information.

【０１０４】ここで、図８は、ユーザ感情情報記録部９
が保持している感情情報を示している。各感情に対応す
る感情情報は、その感情の度合いを、例えば、０乃至１
の範囲の実数で表すもので、値が大きいほど、その感情
が強い（値が小さいほど、その感情が弱い）ことを示
す。更新処理部４５では、このような感情情報としての
値が、各感情ごとに更新される。Here, FIG. 8 shows the user emotion information recording section 9.
Indicates the emotion information held by. The emotion information corresponding to each emotion indicates the degree of the emotion, for example, from 0 to 1.
The larger the value, the stronger the emotion (the smaller the value, the weaker the emotion). In the update processing unit 45, such a value as emotion information is updated for each emotion.

【０１０５】次に、図９のフローチャートを参照して、
図６のユーザ感情情報更新部８の処理（感情情報更新処
理）について説明する。Next, referring to the flowchart of FIG.
The processing (emotion information update processing) of the user emotion information update unit 8 in FIG. 6 will be described.

【０１０６】まず最初に、ステップＳ１１において、韻
律情報処理部４１、概念情報処理部４２、画像情報処理
部４３、および生理情報処理部４４は、上述したように
して、ユーザの感情を推定し、その推定結果としての感
情情報を、更新処理部４５に出力する。First, in step S11, the prosody information processing unit 41, the concept information processing unit 42, the image information processing unit 43, and the physiological information processing unit 44 estimate the user's emotion as described above. The emotion information as the estimation result is output to the update processing unit 45.

【０１０７】更新処理部４５は、ステップＳ１２におい
て、韻律情報処理部４１、概念情報処理部４２、画像情
報処理部４３、および生理情報処理部４４からの感情情
報を総合的に用いて、ユーザ感情情報記録部９に保持さ
れている感情情報を更新する最終的な更新値を求め、ス
テップＳ１３に進み、その更新値によって、ユーザ感情
情報記録部９の感情情報を更新して、処理を終了する。In step S12, the update processing unit 45 comprehensively uses the emotion information from the prosody information processing unit 41, the concept information processing unit 42, the image information processing unit 43, and the physiological information processing unit 44, and A final update value for updating the emotion information stored in the information recording unit 9 is obtained, the process proceeds to step S13, the emotion information of the user emotion information recording unit 9 is updated with the updated value, and the process ends. .

【０１０８】次に、上述した一連の処理は、専用のハー
ドウェアにより行うこともできるし、ソフトウェアによ
り行うこともできる。一連の処理をソフトウェアによっ
て行う場合には、そのソフトウェアを構成するプログラ
ムが、汎用のコンピュータ等にインストールされる。Next, the above-mentioned series of processing can be performed by dedicated hardware or software. When a series of processes is performed by software, a program constituting the software is installed in a general-purpose computer or the like.

【０１０９】そこで、図１０は、上述した一連の処理を
実行するプログラムがインストールされるコンピュータ
の一実施の形態の構成例を示している。FIG. 10 shows a configuration example of an embodiment of a computer in which a program for executing the above-described series of processing is installed.

【０１１０】プログラムは、コンピュータに内蔵されて
いる記録媒体としてのハードディスク１０５やＲＯＭ１
０３に予め記録しておくことができる。The program is stored in a hard disk 105 or a ROM 1 as a recording medium built in the computer.
03 can be recorded in advance.

【０１１１】あるいはまた、プログラムは、フロッピー
ディスク、CD-ROM(Compact Disc Read Only Memory)，M
O(Magneto optical)ディスク，DVD(Digital Versatile
Disc)、磁気ディスク、半導体メモリなどのリムーバブ
ル記録媒体１１１に、一時的あるいは永続的に格納（記
録）しておくことができる。このようなリムーバブル記
録媒体１１１は、いわゆるパッケージソフトウエアとし
て提供することができる。Alternatively, the program may be a floppy disk, CD-ROM (Compact Disc Read Only Memory), M
O (Magneto optical) disc, DVD (Digital Versatile)
Disc), a magnetic disk, a semiconductor memory, or another such removable storage medium 111, which can be temporarily or permanently stored (recorded). Such a removable recording medium 111 can be provided as so-called package software.

【０１１２】なお、プログラムは、上述したようなリム
ーバブル記録媒体１１１からコンピュータにインストー
ルする他、ダウンロードサイトから、ディジタル衛星放
送用の人工衛星を介して、コンピュータに無線で転送し
たり、LAN(Local Area Network)、インターネットとい
ったネットワークを介して、コンピュータに有線で転送
し、コンピュータでは、そのようにして転送されてくる
プログラムを、通信部１０８で受信し、内蔵するハード
ディスク１０５にインストールすることができる。The program may be installed in the computer from the removable recording medium 111 as described above, or may be wirelessly transferred from a download site to the computer via an artificial satellite for digital satellite broadcasting, or transmitted to a LAN (Local Area). Network), the Internet, and the like, and can be transferred to a computer by wire. In the computer, the transferred program can be received by the communication unit 108 and installed on the built-in hard disk 105.

【０１１３】コンピュータは、CPU(Central Processing
Unit)１０２を内蔵している。CPU１０２には、バス１
０１を介して、入出力インタフェース１１０が接続され
ており、CPU１０２は、入出力インタフェース１１０を
介して、ユーザによって、キーボードやマウス等で構成
される入力部１０７が操作されることにより指令が入力
されると、それにしたがって、ROM(Read Only Memory)
１０３に格納されているプログラムを実行する。あるい
は、また、CPU１０２は、ハードディスク１０５に格納
されているプログラム、衛星若しくはネットワークから
転送され、通信部１０８で受信されてハードディスク１
０５にインストールされたプログラム、またはドライブ
１０９に装着されたリムーバブル記録媒体１１１から読
み出されてハードディスク１０５にインストールされた
プログラムを、RAM(Random Access Memory)１０４にロ
ードして実行する。これにより、CPU１０２は、上述し
たフローチャートにしたがった処理、あるいは上述した
ブロック図の構成により行われる処理を行う。そして、
CPU１０２は、その処理結果を、必要に応じて、例え
ば、入出力インタフェース１１０を介して、LCD(Liquid
CryStal Display)やスピーカ等で構成される出力部１
０６から出力、あるいは、通信部１０８から送信、さら
には、ハードディスク１０５に記録等させる。The computer has a CPU (Central Processing).
Unit) 102. The CPU 102 has a bus 1
01, an input / output interface 110 is connected. The CPU 102 receives a command via the input / output interface 110 by operating the input unit 107 including a keyboard, a mouse, and the like. Then, according to it, ROM (Read Only Memory)
The program stored in 103 is executed. Alternatively, the CPU 102 transmits the program stored in the hard disk 105, a satellite, or a network, receives the program by the communication unit 108, and
The program installed in the hard disk 105 is read from the removable recording medium 111 installed in the drive 109 and loaded into the RAM (Random Access Memory) 104 and executed. Accordingly, the CPU 102 performs the processing according to the above-described flowchart or the processing performed by the configuration of the above-described block diagram. And
The CPU 102 transmits the processing result to an LCD (Liquid
Output unit 1 consisting of CryStal Display) and speakers
06, or transmitted from the communication unit 108, and further recorded on the hard disk 105.

【０１１４】ここで、本明細書において、コンピュータ
に各種の処理を行わせるためのプログラムを記述する処
理ステップは、必ずしもフローチャートとして記載され
た順序に沿って時系列に処理する必要はなく、並列的あ
るいは個別に実行される処理（例えば、並列処理あるい
はオブジェクトによる処理）も含むものである。Here, in the present specification, processing steps for describing a program for causing a computer to perform various processes do not necessarily need to be processed in chronological order in the order described in the flowchart, and may be performed in parallel. Alternatively, it also includes processing executed individually (for example, parallel processing or processing by an object).

【０１１５】また、プログラムは、１のコンピュータに
より処理されるものであっても良いし、複数のコンピュ
ータによって分散処理されるものであっても良い。さら
に、プログラムは、遠方のコンピュータに転送されて実
行されるものであっても良い。The program may be processed by one computer, or may be processed in a distributed manner by a plurality of computers. Further, the program may be transferred to a remote computer and executed.

【０１１６】以上のように、少なくとも、ユーザの音声
認識結果に含まれる語句の概念に基づいて、ユーザの感
情を推定するようにしたので、比較的精度良く、ユーザ
の感情を推定することができる。さらに、その他、韻律
情報や、顔画像情報、生理情報にも基づいて、ユーザの
感情を推定するようにしたので、より精度良く、ユーザ
の感情を推定することができる。さらに、そのような感
情の推定結果に基づいて、出力文を生成するようにした
ので、ユーザの感情の状態によって、バリエーションに
富んだ出力文を、ユーザに提供することが可能となる。As described above, since the user's emotion is estimated at least based on the concept of the phrase included in the user's speech recognition result, the user's emotion can be estimated relatively accurately. . Further, since the user's emotion is estimated based on the prosody information, face image information, and physiological information, the user's emotion can be estimated more accurately. Furthermore, since the output sentence is generated based on the estimation result of the emotion, it is possible to provide the user with a variety of output sentences depending on the state of the emotion of the user.

【０１１７】なお、本実施の形態では、音声入力部１に
入力された音（音声）について、音声認識を行うように
したが、音声入力部１に入力された音については、音声
認識を行わずに、例えば、その音が、机を叩いている音
であるとか、ユーザの息づかいであるといったことを検
出し、その検出結果に基づいて、ユーザの感情を推定す
ることも可能である。即ち、例えば、机を叩いているこ
とが連続して検出された場合には、ユーザが怒っている
ことを推定することができる。また、例えば、息づかい
が荒いことが検出された場合には、ユーザが興奮してい
ることを推定することができる。そして、この場合、そ
のような推定結果に基づいて、「怒り」や「興奮」を表
す感情情報の値を大きくするような、アドホック(ad ho
c)な更新ルールを適用することができる。In the present embodiment, voice recognition is performed for the sound (voice) input to the voice input unit 1. However, voice recognition is performed for the sound input to the voice input unit 1. Instead, for example, it is also possible to detect that the sound is a sound of striking a desk or a breath of the user, and to estimate the emotion of the user based on the detection result. That is, for example, when it is continuously detected that the user is hitting the desk, it can be estimated that the user is angry. Further, for example, when it is detected that breathing is rough, it can be estimated that the user is excited. Then, in this case, based on such an estimation result, an ad hoc (ad hoc) that increases the value of emotion information representing “anger” or “excitation” is increased.
c) Applicable update rules can be applied.

【０１１８】さらに、対話管理部３においては、感情状
態に応じて、出力文の生成回数を制御することにより、
ユーザに対する発話の回数を変化させることが可能であ
る。具合的には、例えば、ユーザが楽しそうな状態にあ
る場合には、例えば、相づちの回数を増やしたり、その
他、対話システムからの発話回数を増やして、積極的
に、ユーザとの対話を行うようにすることが可能であ
る。また、例えば、ユーザが悲しそうな状態にある場合
には、対話システムからの発話回数を減らして、ユーザ
に煩わしさを感じさせないようにすることが可能であ
る。Further, the dialogue management section 3 controls the number of output sentence generations in accordance with the emotional state,
It is possible to change the number of utterances to the user. Specifically, for example, when the user is in a fun state, for example, the number of reciprocations and the number of utterances from the interactive system are increased, and the user actively interacts with the user. It is possible to do so. Further, for example, when the user is in a sad state, the number of utterances from the interactive system can be reduced so that the user does not feel troublesome.

【０１１９】また、本実施の形態では、ユーザからの音
声を音声認識し、その音声認識結果に対する応答として
の発話を行うようにしたが、その他、例えば、ユーザが
キーボードを操作することにより入力される文に対し
て、応答を行うようにすることも可能である。Further, in the present embodiment, the voice from the user is recognized and the utterance is made as a response to the voice recognition result. However, for example, the voice is input by the user operating the keyboard. It is also possible to make a response to a sentence.

【０１２０】さらに、本実施の形態では、ユーザに対す
る応答等を、合成音で出力するようにしたが、その他、
例えば、テキスト等で表示するようにすることも可能で
ある。Further, in the present embodiment, the response to the user and the like are output as synthesized sounds.
For example, it is also possible to display in text or the like.

【０１２１】また、本発明は、例えば、ディスプレイに
表示される仮想的なキャラクタや、あるいは実体のある
ロボット等とユーザとの間のユーザインタフェースとし
て用いることが可能である。この場合、ユーザに対する
応答等として、上述したように合成音を出力する他、仮
想的なキャラクタの表示状態を変えたり、ロボットに所
定の動作を行わせることで、マルチモーダルなインタフ
ェースを実現することができる。Further, the present invention can be used, for example, as a user interface between a user and a virtual character displayed on a display, or a real robot or the like. In this case, in addition to outputting a synthesized sound as described above as a response to the user, a multi-modal interface is realized by changing the display state of the virtual character or causing the robot to perform a predetermined operation. Can be.

【０１２２】[0122]

【発明の効果】本発明の対話処理装置および対話処理方
法、並びに記録媒体によれば、ユーザから入力された語
句の概念が抽出され、その概念に基づいて、ユーザの感
情が推定される。そして、その結果得られる感情情報に
基づいて、ユーザに出力する出力文が生成される。従っ
て、ユーザの感情の状態によって、例えば、バリエーシ
ョンに富んだ対話を行うことが可能となる。According to the dialogue processing apparatus, the dialogue processing method, and the recording medium of the present invention, the concept of a phrase input by the user is extracted, and the emotion of the user is estimated based on the concept. Then, an output sentence to be output to the user is generated based on the emotion information obtained as a result. Therefore, for example, it is possible to perform a variety of dialogues depending on the emotional state of the user.

[Brief description of the drawings]

【図１】本発明を適用した対話システムの一実施の形態
の構成例を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of an embodiment of a dialogue system to which the present invention has been applied.

【図２】図１の対話システムの処理を説明するためのフ
ローチャートである。FIG. 2 is a flowchart illustrating a process of the dialogue system in FIG. 1;

【図３】図１の音声認識部２の構成例を示すブロック図
である。FIG. 3 is a block diagram illustrating a configuration example of a speech recognition unit 2 in FIG. 1;

【図４】図１の対話管理部３の構成例を示すブロック図
である。FIG. 4 is a block diagram illustrating a configuration example of a dialog management unit 3 of FIG. 1;

【図５】図１の文生成部４の構成例を示すブロック図で
ある。FIG. 5 is a block diagram illustrating a configuration example of a sentence generation unit 4 of FIG. 1;

【図６】図１のユーザ感情情報更新部８の構成例を示す
ブロック図である。FIG. 6 is a block diagram showing a configuration example of a user emotion information updating unit 8 of FIG. 1;

【図７】図６の画像情報処理部４３の構成例を示すブロ
ック図である。FIG. 7 is a block diagram illustrating a configuration example of an image information processing unit 43 in FIG. 6;

【図８】感情情報を示す図である。FIG. 8 is a diagram showing emotion information.

【図９】図６のユーザ感情情報更新部８の処理を説明す
るためのフローチャートである。FIG. 9 is a flowchart illustrating a process of a user emotion information updating unit 8 of FIG. 6;

【図１０】本発明を適用したコンピュータの一実施の形
態の構成例を示すブロック図である。FIG. 10 is a block diagram illustrating a configuration example of a computer according to an embodiment of the present invention.

[Explanation of symbols]

１音声入力部，２音声認識部，３対話管理
部，４文生成部，５音声出力部，６画像入力
部，７生理情報入力部，８ユーザ感情情報更新
部，９ユーザ感情情報記録部，１１ＡＤ変換
部，１２特徴抽出部，１３マッチング部，１
４音響モデルデータベース，１５辞書データベー
ス，１６文法データベース，２１言語処理部，
２２対話処理部，２３シソーラスデータベー
ス，２４言語処理用データベース，２５履歴デ
ータベース，２６シナリオデータベース，３１
テキスト文生成部，３２規則合成部，３３ＤＡ
変換部，３４辞書データベース，３５生成用文
法データベース，３６音素片データベース，４１
韻律情報処理部，４２概念情報処理部，４３画
像情報処理部，４４生理情報処理部，５１特徴抽
出部，５２ベクトル量子化部，５３マッチング
部，５４コードブックデータベース，５５ＨＭ
Ｍデータベース，１０１バス，１０２ CPU，
１０３ ROM，１０４ RAM，１０５ハードディス
ク，１０６出力部，１０７入力部，１０８
通信部，１０９ドライブ，１１０入出力インタ
フェース，１１１リムーバブル記録媒体1 voice input unit, 2 voice recognition unit, 3 dialogue management unit, 4 sentence generation unit, 5 voice output unit, 6 image input unit, 7 physiological information input unit, 8 user emotion information update unit, 9 user emotion information recording unit, 11 AD conversion unit, 12 feature extraction unit, 13 matching unit, 1
4 acoustic model database, 15 dictionary database, 16 grammar database, 21 language processing unit,
22 dialogue processing section, 23 thesaurus database, 24 language processing database, 25 history database, 26 scenario database, 31
Text sentence generator, 32 rule synthesizer, 33 DA
Conversion unit, 34 dictionary database, 35 grammar database for generation, 36 phoneme segment database, 41
Prosody information processing unit, 42 concept information processing unit, 43 image information processing unit, 44 physiological information processing unit, 51 feature extraction unit, 52 vector quantization unit, 53 matching unit, 54 codebook database, 55 HM
M database, 101 bus, 102 CPU,
103 ROM, 104 RAM, 105 hard disk, 106 output unit, 107 input unit, 108
Communication unit, 109 drive, 110 input / output interface, 111 removable recording medium

───────────────────────────────────────────────────── フロントページの続き (72)発明者田中幸東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者横野順東京都品川区北品川６丁目７番35号ソニー株式会社内 (72)発明者大江敏生東京都品川区北品川６丁目７番35号ソニー株式会社内Ｆターム(参考） 5D015 AA06 LL07 LL10 5D045 AB01 AB07 AB30 9A001 DZ11 FF03 HH17 HH18 HH33 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Sachi Tanaka 6-7-35 Kita-Shinagawa, Shinagawa-ku, Tokyo Inside Sony Corporation (72) Inventor Jun Yokono 6-35, Kita-Shinagawa, Shinagawa-ku, Tokyo Inside Sony Corporation (72) Inventor Toshio Oe 6-7-35 Kita Shinagawa, Shinagawa-ku, Tokyo Sony Corporation F-term (reference) 5D015 AA06 LL07 LL10 5D045 AB01 AB07 AB30 9A001 DZ11 FF03 HH17 HH18 HH33

Claims

[Claims]

1. A dialogue processing device for performing a dialogue with a user, comprising: a concept extracting means for extracting a concept of a phrase input by the user; A dialogue process comprising: an emotion estimation unit that estimates an emotion and outputs emotion information representing the emotion; and an output sentence generation unit that generates an output sentence to be output to the user based on the emotion information. apparatus.

2. The interaction processing apparatus according to claim 1, wherein the emotion estimation unit estimates the emotion of the user based on the output sentence.

3. The interaction processing apparatus according to claim 1, wherein the emotion estimation unit estimates the emotion of the user based on an image obtained by imaging the user.

4. The interaction processing apparatus according to claim 1, wherein the emotion estimation unit estimates the emotion of the user based on a physiological phenomenon of the user.

5. An audio processing unit for processing an audio signal obtained from the outside, wherein the emotion estimation unit estimates the emotion of the user based on a processing result of the audio processing unit. The interactive processing device according to claim 1.

6. The apparatus according to claim 1, further comprising voice recognition means for recognizing said user's voice, wherein said concept extracting means extracts a concept of a phrase included in a voice recognition result of said user's voice. An interactive processing device according to claim 1.

7. The interaction processing apparatus according to claim 6, wherein the emotion estimation unit estimates the emotion of the user based on prosodic information of the voice of the user.

8. The interaction processing apparatus according to claim 1, wherein the output sentence generation unit changes the expression of the output sentence based on the emotion information.

9. The interaction processing apparatus according to claim 1, wherein the output sentence generating means changes the number of the output sentences based on the emotion information.

10. The interaction processing apparatus according to claim 9, wherein the output sentence means a mutual connection.

11. The apparatus according to claim 11, further comprising a storage unit configured to store the emotion information, wherein the output sentence generation unit generates the output sentence based on the emotion information stored in the storage unit. Item 2. The interactive processing device according to Item 1.

12. The apparatus according to claim 1, further comprising output sentence output means for outputting the output sentence.

13. The interactive processing apparatus according to claim 12, wherein the output sentence output unit outputs the output sentence as a synthesized sound.

14. The apparatus according to claim 13, wherein the output sentence output unit controls the prosody of the synthesized sound based on the emotion information.

15. A dialogue processing method for performing a dialogue with a user, comprising: a concept extracting step of extracting a concept of a phrase input by a user; An emotion estimation step of estimating a user's emotion and outputting emotion information representing the emotion, and an output sentence generation step of generating an output sentence to be output to the user based on the emotion information. Interaction method.

16. A concept extracting step for extracting a concept of a phrase input from a user, the recording medium storing a program for causing a computer to perform an interaction process for performing a dialog with a user; An emotion estimation step of estimating the emotion of the user based on the concept of the phrase input from the user and outputting emotion information representing the emotion; and generating an output sentence output to the user based on the emotion information. A recording medium characterized by storing a program having an output statement generating step of performing the following.