JP2006313287A

JP2006313287A - Speech dialogue apparatus

Info

Publication number: JP2006313287A
Application number: JP2005136681A
Authority: JP
Inventors: Takayuki Yamaguchi; 隆幸山口; Shigeo Onoki; 重夫大野木; Masaaki Ichihara; 雅明市原
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2005-05-09
Filing date: 2005-05-09
Publication date: 2006-11-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech dialogue apparatus capable of smoothly performing a dialogue with a user. <P>SOLUTION: The speech dialogue apparatus, by which the transfer of information with the user is performed by speech, comprises an intention presuming function for presuming the intention of the user included in a speech spoken by the user based on both of language information obtained by speech recognition processing of utterance and non-language information capable of showing emotional or physiological state of the user. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、ユーザとの情報のやりとりを音声によって行う音声対話装置に関する。 The present invention relates to a voice interactive apparatus that exchanges information with a user by voice.

従来から、目的地の位置データを入力する例えばタッチパネルやキーボード等の入力手段と、地図データや交差点データ等の道路情報が格納された記憶手段と、音声を出力する複数の音声出力手段と、車両の現在位置を検出するＧＰＳやジャイロセンサー等の現在位置検出手段と、この現在位置検出手段で検出した現在位置と前記記憶手段に格納されている前記道路情報とにより入力手段から入力された目的地までの進路判断を行う進路判断手段と、進路判断手段が車両の進むべき方向を右方向であるとした場合、運転者が感覚的に指示内容を認知できるようにするべく、音声出力手段からの指示音声を右方向から聞こえるように出力制御する音声出力制御手段とを備えた音声出力装置が知られている（例えば、特許文献１参照）。
特許第２６０２１５８号公報 Conventionally, input means such as a touch panel and a keyboard for inputting destination position data, storage means for storing road information such as map data and intersection data, a plurality of sound output means for outputting sound, and a vehicle The destination input from the input means by the current position detection means such as GPS or gyro sensor for detecting the current position of the vehicle, the current position detected by the current position detection means and the road information stored in the storage means If the direction determining means for determining the direction to the vehicle and the direction determining means are assumed to be the right direction as the direction of travel of the vehicle, the voice output means A voice output device is known that includes voice output control means for controlling the output of an instruction voice so that it can be heard from the right direction (see, for example, Patent Document 1).
Japanese Patent No. 2602158

ところで、一般的な音声認識装置は、ユーザの発話に対して所与の音声認識エンジンを用いて音声認識処理を行い、得られた音声認識結果（言語情報）に基づいて、ユーザの意図を認識する機能を持つ。例えば、ナビゲーションシステムにおける目的地設定時に、ユーザが「とよたしえき」と発話した場合、「とよたしえき」という音声認識結果（言語情報）に基づいて、「とよたしえき」に行きたいというユーザの意図を認識する。これにより、ユーザは、ユーザの意図（目的地）を、手入力することなく音声認識装置（ナビゲーションシステム）に伝えること（認識してもらうこと）ができる。 By the way, a general speech recognition apparatus performs speech recognition processing on a user's utterance using a given speech recognition engine, and recognizes the user's intention based on the obtained speech recognition result (language information). It has a function to do. For example, if the user utters “Toyota Eki” when setting the destination in the navigation system, the user wants to go to “Toyota Eki” based on the speech recognition result (language information) “Toyota Eki”. Recognize intent. Thereby, the user can transmit (recognize) the user's intention (destination) to the voice recognition device (navigation system) without manually inputting.

ところで、近年では、音声認識装置の音声認識能力の向上に伴い、ユーザとの対話能力が高まりつつある。即ち、先の例のように既に決まっている目的地を直接音声で入力するような単純な対話アプリケーションではなく、例えば目的地に関するユーザの希望をユーザが音声入力し、それに対するシステム側の提案がユーザに対して出力され、それに対するユーザの返答（承諾又は拒絶）といったように、複数の対話を経てユーザの意図を確定していくようなアプリケーションが行われている。かかるアプリケーションでは、ユーザの意図をシステム側が速やかに汲み取ることが、ユーザとの間の対話を円滑に実現する上で重要となるが、これを上述の言語情報のみで実現することは困難である。 By the way, in recent years, with the improvement of the voice recognition ability of the voice recognition apparatus, the ability to interact with the user is increasing. In other words, it is not a simple interactive application that directly inputs a predetermined destination by voice as in the previous example, but for example, the user inputs the voice of the user regarding the destination and the system side suggests that There are applications in which a user's intention is determined through a plurality of dialogues such as a user's response (acceptance or rejection) to the user. In such an application, it is important for the system side to quickly capture the user's intention in order to smoothly realize the dialogue with the user, but it is difficult to realize this only with the language information described above.

そこで、本発明は、ユーザとの間の対話をより円滑に実現することが可能な音声対話装置の提供を目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a voice dialogue apparatus that can more smoothly realize a dialogue with a user.

上記課題を解決するため、本発明の一局面によれば、ユーザとの情報のやりとりを音声によって行う音声対話装置において、
ユーザの発話する言葉に含まれるユーザの意図を、該発話に対する音声認識処理により得られる言語情報と、ユーザの感情又は生理状態を表すことが可能な非言語情報との双方に基づいて推定する意図推定機能を有することを特徴とする、音声対話装置が提供される。 In order to solve the above-described problem, according to one aspect of the present invention, in a voice interaction apparatus that performs voice exchange of information with a user,
Intention to estimate the user's intention included in the words spoken by the user based on both linguistic information obtained by speech recognition processing for the utterance and non-linguistic information capable of representing the user's emotion or physiological state A spoken dialogue apparatus characterized by having an estimation function is provided.

本局面において、前記非言語情報は、カメラによるユーザの撮像画像を画像処理して得られる画像情報、及び、ユーザの発話音声を解析して得られる韻律情報を含んでよい。前記非言語情報は、心電や脈拍の計測結果に基づくユーザの生理情報を含んでよい。 In this aspect, the non-linguistic information may include image information obtained by performing image processing on a captured image of the user by a camera, and prosodic information obtained by analyzing the user's uttered voice. The non-linguistic information may include user physiological information based on a measurement result of an electrocardiogram or a pulse.

本発明によれば、ユーザの発話する言葉に含まれるユーザの意図を、該発話に対する音声認識処理により得られる言語情報と、ユーザの感情又は生理状態を表すことが可能な非言語情報との双方に基づいて推定する意図推定機能を有することで、ユーザとの間の対話をより円滑に実現することが可能な音声対話装置を得ることができる。 According to the present invention, both the linguistic information obtained by the speech recognition processing for the utterance and the non-linguistic information capable of representing the user's emotion or physiological state, the user's intention included in the words uttered by the user. By having the intention estimation function that performs estimation based on the above, it is possible to obtain a voice conversation apparatus that can more smoothly realize a conversation with the user.

以下、図面を参照して、本発明を実施するための最良の形態の説明を行う。 The best mode for carrying out the present invention will be described below with reference to the drawings.

図１は、本発明による音声対話装置が組み込まれる音声対話システムの一実施例を示すシステム構成図である。音声対話装置１０は、対話制御ECU２０、車室内の音（音声）を拾う車内マイク（マイクロフォン）３０、アンプ４２、スピーカ４０、及びディスプレイ５０を備える。尚、アンプ４２及びスピーカ４０は、車載オーディオシステムで用いられるものと共通であってよい。 FIG. 1 is a system configuration diagram showing an embodiment of a voice dialogue system in which a voice dialogue apparatus according to the present invention is incorporated. The voice interaction device 10 includes a dialogue control ECU 20, a vehicle interior microphone (microphone) 30 that picks up sound (sound) in the vehicle interior, an amplifier 42, a speaker 40, and a display 50. The amplifier 42 and the speaker 40 may be the same as those used in the in-vehicle audio system.

対話制御ECU２０は、ユーザとの対話を統括的に制御する。対話の一方向を構成するシステム側（対話制御ECU２０）からユーザへの応答は、スピーカ４０及び／又はディスプレイ５０を介した音響的及び／又は視覚的な出力により実現される。対話の他方向、即ち、ユーザからシステム側への意思伝達は、原則的に、車内マイク３０を介した音声入力により実現される。 The dialogue control ECU 20 comprehensively controls the dialogue with the user. A response from the system side (interaction control ECU 20) constituting one direction of the dialogue to the user is realized by an acoustic and / or visual output via the speaker 40 and / or the display 50. The other direction of the dialogue, that is, the communication of intention from the user to the system side is realized in principle by voice input through the in-vehicle microphone 30.

対話制御ECU２０は、その基本的な構成として、バスを介して接続されるCPU、メモリ、A/D(analog-to-digital)変換器を備える。メモリには、以下で説明する対話制御ECU２０の機能を実現するプログラムやデータが格納される。 The dialog control ECU 20 includes, as its basic configuration, a CPU, a memory, and an A / D (analog-to-digital) converter connected via a bus. The memory stores programs and data for realizing the functions of the dialog control ECU 20 described below.

車内マイク３０に入力されるアナログ音声は、マイクアンプにて増幅処理やノイズ除去などの所定処理を受けて、A/D変換器でデジタル形式の音声信号に変換され、対話制御ECU２０に送られる。対話制御ECU２０は、音声信号から特徴量を抽出し、次いで、所与の音響／言語モデルを用いたマッチング処理により認識結果を得る。尚、本発明は、特に音声認識方法により限定されるものでなく、如何なるハードウェア構成で如何なるソフトウェア（音声認識エンジン）を用いた音声認識処理に対しても適用可能である。 The analog audio input to the in-vehicle microphone 30 is subjected to predetermined processing such as amplification processing and noise removal by a microphone amplifier, converted into a digital audio signal by an A / D converter, and sent to the dialog control ECU 20. The dialogue control ECU 20 extracts a feature amount from the voice signal, and then obtains a recognition result by a matching process using a given acoustic / language model. Note that the present invention is not particularly limited by the speech recognition method, and can be applied to speech recognition processing using any software (speech recognition engine) with any hardware configuration.

対話制御ECU２０には、CAN（controller area network）などの適切なバスを介して、乗員状態推定ECU７０が接続される。乗員状態推定ECU７０は、車内カメラ６０及び生体センサ８０が接続され、ユーザの感情又は生理状態を表すことが可能な非言語情報を取得・生成する。 An occupant state estimation ECU 70 is connected to the dialog control ECU 20 via an appropriate bus such as a CAN (controller area network). The occupant state estimation ECU 70 is connected to the in-vehicle camera 60 and the biological sensor 80, and acquires / generates non-linguistic information that can represent the emotion or physiological state of the user.

車内カメラ６０は、例えばＣＣＤカメラで構成され、車室内のユーザ（即ち、運転者を含む乗員）を撮像できるような位置に搭載される。乗員状態推定ECU７０は、内蔵の画像処理プロセッサにより、車内カメラ６０が撮像した画像データを受信及び処理することにより、ユーザの表情、ジェスチャー、仕草等を表す画像情報を取得する。或いは、車内カメラ６０は、サーモグラフィカメラ（赤外線カメラ）であってもよい。この場合、乗員状態推定ECU７０は、ユーザの顔などの所定部位における熱の分布状態を表す画像情報を取得する。 The in-vehicle camera 60 is constituted by a CCD camera, for example, and is mounted at a position where an image of a user in the passenger compartment (that is, an occupant including a driver) can be taken. The occupant state estimation ECU 70 receives and processes image data captured by the in-vehicle camera 60 using a built-in image processing processor, thereby acquiring image information representing a user's facial expression, gesture, gesture, and the like. Alternatively, the in-vehicle camera 60 may be a thermographic camera (infrared camera). In this case, the occupant state estimation ECU 70 acquires image information representing a heat distribution state in a predetermined part such as a user's face.

生体センサ８０は、心電や脈拍の計測するセンサを含み、具体的には、生体センサ８０は、心電計ないし心拍計（例えば、腕時計等のリストバンドに設定された心拍感知センサ）、又は、脈拍数を計測する脈拍計、血圧を計測する血圧計である。生体センサ８０は、ユーザが携帯する例えば腕時計に設定されるものであってもよく、或いは、ドライバがハンドル操作するステアリングホイールに埋設されてもよい。 The biosensor 80 includes a sensor for measuring an electrocardiogram and a pulse. Specifically, the biosensor 80 is an electrocardiograph or a heart rate monitor (for example, a heart rate detection sensor set in a wristband such as a wristwatch), or A pulse meter that measures the pulse rate, and a sphygmomanometer that measures blood pressure. The biosensor 80 may be set, for example, in a wristwatch carried by the user, or may be embedded in a steering wheel operated by a driver.

乗員状態推定ECU７０は、また、上述の如く車内マイク３０に入力される音声データを解析することで、ユーザの発話音声の韻律的特徴(声の高さ、強さ、大きさ、長さ等)を表す韻律情報を取得する。 The occupant state estimation ECU 70 also analyzes voice data input to the in-vehicle microphone 30 as described above, thereby prosodic features of the user's uttered voice (voice pitch, strength, magnitude, length, etc.). Prosodic information representing is acquired.

図２は、本実施例の音声対話装置により実現される特徴的な動作フローを示す。 FIG. 2 shows a characteristic operation flow realized by the voice interaction apparatus of the present embodiment.

図２に示すように、本実施例の音声対話装置は、車内マイク３０に入力される音声データに対する音声認識結果に基づいて、言語情報に基づく意図推定を行う（ステップ１００）と共に、音声認識結果（韻律情報等）又は乗員状態検出結果に基づいて、非言語情報に基づく感情推定を行い（ステップ１１０）、これらの意図推定結果を統合して最終的な意図推定を行う（ステップ１２０）。即ち、本実施例の音声対話装置は、ユーザの感情及びその変化等を表す非言語情報と、音声認識結果を表す言語情報との双方に基づいて、ユーザの発話する言葉に含まれるユーザの本来の意図（真意）を推定する意図推定機能を有する。 As shown in FIG. 2, the voice interaction apparatus according to the present embodiment performs intention estimation based on language information based on the voice recognition result with respect to the voice data input to the in-vehicle microphone 30 (step 100) and the voice recognition result. Based on (prosodic information, etc.) or occupant state detection results, emotion estimation based on non-linguistic information is performed (step 110), and these intention estimation results are integrated to perform final intention estimation (step 120). That is, the speech dialogue apparatus according to the present embodiment is based on both the non-linguistic information representing the user's emotions and changes thereof and the linguistic information representing the speech recognition result. It has an intention estimation function for estimating the intention (meaning).

具体的には、ステップ１１０では、対話制御ECU２０は、生体センサ８０からの生理情報に基づいて、心電計からの心電図や脈拍データを解析することで、ユーザの緊張ないし興奮状態又は平静状態を推定する。また、対話制御ECU２０は、韻律情報や画像情報に基づいて、ユーザの感情（例えば、平静、怒り、喜び、悲しみ）を推定する。このとき、対話制御ECU２０は、ユーザの固有情報を格納したユーザデータベース２２を利用する。即ち、ユーザデータベース２２内には、各ユーザの基準となる生理情報が、そのときのユーザの精神状態（緊張ないし興奮状態又は平静状態）に応じて分類した形でユーザデータベース２２内に格納されている。同様に、韻律情報や画像情報についても同様に、各ユーザの基準となる韻律情報や画像情報が、そのときのユーザの感情に応じて分類した形でユーザデータベース２２内に格納されている。これらの基準データは、事前に取得されてもよいし、或いは、実際のアプリケーションを通じて学習されてもよい。これにより、対話制御ECU２０は、今回検出された生理情報や韻律情報を、データベース２２内のデータと照合することで、ユーザの精神状態ないし感情を高い精度で推定することができる。 Specifically, in step 110, the dialogue control ECU 20 analyzes the electrocardiogram and pulse data from the electrocardiograph based on the physiological information from the biosensor 80, thereby determining the user's tension or excitement state or calm state. presume. Further, the dialogue control ECU 20 estimates the user's emotion (for example, calmness, anger, joy, sadness) based on the prosodic information and the image information. At this time, the dialogue control ECU 20 uses a user database 22 in which user specific information is stored. That is, in the user database 22, physiological information serving as a reference for each user is stored in the user database 22 in a form classified according to the mental state (tension, excitement state, or calm state) of the user at that time. Yes. Similarly, for the prosody information and image information, the prosody information and image information serving as a reference for each user are stored in the user database 22 in a form classified according to the emotion of the user at that time. These reference data may be acquired in advance or may be learned through an actual application. Thereby, the dialogue control ECU 20 can estimate the mental state or emotion of the user with high accuracy by comparing the physiological information and prosodic information detected this time with the data in the database 22.

また、ステップ１２０で最終的な意図推定を行う際、当然ながら、ステップ１００及びステップ１１０で得られる推定結果は互いに同期したものが用いられる。即ち、あるユーザの発話に対する言語情報は、当該発話時又はその前後の非言語情報に対応付けられる。 When the final intention estimation is performed in step 120, the estimation results obtained in step 100 and step 110 are naturally synchronized with each other. That is, language information for a certain user's utterance is associated with non-linguistic information at the time of the utterance or before and after the utterance.

図３は、非言語情報と言語情報とを統合してユーザの意図（真意）を汲み取る方法の一例を示す表図である。図３に示す例は、ユーザが「なんで」と発声した場合に関する。 FIG. 3 is a table showing an example of a method for fetching a user's intention (meaning) by integrating non-linguistic information and linguistic information. The example shown in FIG. 3 relates to a case where the user utters “why”.

ここで、「なんで」という発話は、一般的に、相手が言ったことに対する質問、詰問(相手の非を責めながらきびしく問いつめる)、疑い等といったように、その意味には複数の可能性があるので、言語情報に基づくだけでは的確にユーザの意図推定を行うことができない。 Here, the utterance “why” generally has multiple possibilities in its meaning, such as a question about what the other party said, a question (quickly blaming the other party), doubt, etc. Therefore, the user's intention cannot be accurately estimated only based on the language information.

これに対して、本実施例では、図３に示すように、韻律情報や画像情報から推定されるユーザの感情の推定結果と、生理情報から推定されるユーザの精神状態の推定結果とに基づいて、「なんで」という発話に含まれるユーザの意図（質問なのか、詰問なのか、それとも疑いなのか）を推定・判断することで、的確にユーザの意図推定を行うことが可能となる。 On the other hand, in this embodiment, as shown in FIG. 3, based on the estimation result of the user's emotion estimated from the prosodic information and the image information, and the estimation result of the user's mental state estimated from the physiological information. Thus, by estimating and judging the user's intention (whether it is a question, a question, or a suspicion) included in the utterance “why”, it is possible to accurately estimate the user's intention.

例えば、図３に示す例では、「なんで」と発声したユーザの感情の推定結果が「平静」であるときは、言葉「なんで」は“質問”の意味で用いられた言語であると推定される。また、感情推定結果が「怒り」であるときは、言葉「なんで」は“詰問”の意味で用いられた言語であると推定される。同様に、感情推定結果が「喜び」又は「悲しみ」であるときは、言葉「なんで」は“疑い”の意味で用いられた言語であると推定される。 For example, in the example shown in FIG. 3, when the estimation result of the emotion of the user who uttered “why” is “calm”, the word “why” is estimated to be a language used in the meaning of “question”. The When the emotion estimation result is “anger”, the word “why” is estimated to be a language used in the meaning of “question”. Similarly, when the emotion estimation result is “joy” or “sadness”, the word “why” is estimated to be a language used in the meaning of “suspect”.

尚、ユーザの感情は、韻律情報や画像情報に基づいて、例えば韻律が通常時と同じ場合、及び／又は、顔の表情や仕草に特段の変化が無い場合に、「平静」と推定されてよい。また、韻律が大きく変化した場合、且つ、顔の表情や仕草に怒りを表す特徴（例えば、目のつり上がりや口のとんがり）が表れた場合、ユーザの感情が「怒り」であると推定されてよい。同様に、韻律が大きく変化した場合、且つ、顔の表情や仕草に喜びを表す特徴（例えば、微笑みないし笑いや、手をたたくような仕草）が表れた場合、ユーザの感情が「喜び」であると推定されてよい。また、韻律が大きく変化した場合、且つ、顔の表情や仕草に悲しみを表す特徴（例えば、うつむく仕草や涙を拭く仕草）が表れた場合、ユーザの感情が「悲しみ」であると推定されてよい。 Note that the user's emotion is estimated to be “seduce” based on prosodic information and image information, for example, when the prosody is the same as normal and / or when there is no particular change in facial expression or gesture. Good. In addition, when the prosody changes greatly, and when a facial expression or gesture is characterized by anger (for example, rising eyes or pointed mouth), the user's emotion is estimated to be “angry”. It's okay. Similarly, if the prosody changes significantly and if facial expressions or gestures show joyful features (for example, smiles or laughter or clapping hands), the user's emotion is “joy”. It may be estimated that there is. Also, if the prosody changes greatly and if a facial expression or a characteristic that expresses sadness (such as a gesture of wiping or tearing) appears, the user's emotion is assumed to be "sadness" Good.

また、図３に示す例では、「なんで」と発声したユーザの精神状態の推定結果が「平静」であるときは、言葉「なんで」は“質問”の意味で用いられた言語であると推定される。また、精神状態の推定結果が「緊張」であるときは、言葉「なんで」は“詰問”の意味で用いられた言語であると推定され、精神状態の推定結果が「（興奮に近い）緊張」であるときは、言葉「なんで」は“疑い”の意味で用いられた言語であると推定される。尚、かかるユーザの精神状態は、上述のパターン照合によらず簡易的に、心電や脈拍に大きな変化が現れた否かで推定されてもよい。 Further, in the example shown in FIG. 3, when the estimation result of the mental state of the user who uttered “why” is “calm”, the word “why” is estimated to be a language used in the meaning of “question”. Is done. When the mental state estimation result is “tension”, the word “why” is presumed to be the language used in the meaning of “question”, and the mental state estimation result is “tension (close to excitement)”. ”, The word“ why ”is presumed to be a language used in the sense of“ suspect ”. Note that the mental state of the user may be estimated simply based on whether or not a large change in the electrocardiogram or the pulse appears regardless of the pattern matching described above.

図４は、その他の一例を示す表図である。図４に示す例は、ユーザが「いいよ」と発声した場合に関する。 FIG. 4 is a table showing another example. The example shown in FIG. 4 relates to a case where the user utters “OK”.

同様に、「いいよ」という発話は、一般的に、相手が言ったことに対する承諾や、あきらめ（どうでもいい、というあきらめ）等といったように、その意味には複数の可能性があるので、言語情報に基づくだけでは的確にユーザの意図推定を行うことができない。 Similarly, the utterance of “OK” generally has multiple possibilities in its meaning, such as consent to what the other party said or giving up (giving up that it ’s okay). The user's intention cannot be accurately estimated only based on the language information.

これに対して、本実施例では、図３に示すように、韻律情報や画像情報から推定されるユーザの感情の推定結果と、生理情報から推定されるユーザの精神状態の推定結果とに基づいて、「いいよ」という発話に含まれるユーザの意図（承諾なのか、それともあきらめなのか）を特定することで、的確にユーザの意図推定を行うことが可能となる。 On the other hand, in this embodiment, as shown in FIG. 3, based on the estimation result of the user's emotion estimated from the prosodic information and the image information, and the estimation result of the user's mental state estimated from the physiological information. Thus, it is possible to accurately estimate the user's intention by specifying the user's intention (whether to accept or give up) included in the utterance “OK”.

例えば、図４に示す例では、「いいよ」と発声したユーザの感情の推定結果が「平静」又は「喜び」であるときは、言葉「いいよ」は“承諾”の意味で用いられた言語であると推定される。また、感情推定結果が「怒り」又は「悲しみ」であるときは、言葉「いいよ」は“あきらめ”の意味で用いられた言語であると推定される。 For example, in the example shown in FIG. 4, when the estimated result of the emotion of the user who uttered “good” is “calm” or “joy”, the word “good” was used to mean “acceptance”. Presumed to be language. When the emotion estimation result is “anger” or “sadness”, the word “good” is estimated to be a language used in the meaning of “give up”.

同様に、図４に示す例では、「いいよ」と発声したユーザの精神状態の推定結果が「平静」であるときは、言葉「いいよ」は“承諾”の意味で用いられた言語であると推定される。また、精神状態の推定結果が「緊張」であるときは、言葉「いいよ」は“あきらめ”の意味で用いられた言語であると推定される。 Similarly, in the example shown in FIG. 4, when the estimation result of the mental state of the user who uttered “good” is “calm”, the word “good” is a language used to mean “acceptance”. Presumed to be. When the mental state estimation result is “tension”, the word “good” is presumed to be a language used in the sense of “give up”.

尚、図３及び図４に示すような、ユーザの感情や精神状態と、ユーザの意図との対応関係は、かかる複数の意味を有しうる言語毎に作成され、対話制御ECU２０のアクセス可能なメモリに予め保存される。 The correspondence relationship between the user's emotion and mental state and the user's intention as shown in FIGS. 3 and 4 is created for each language that can have such a plurality of meanings, and can be accessed by the dialog control ECU 20. Pre-stored in memory.

これにより、例えばシステム側からユーザに対して、「目的地周辺の駐車場の空き状況をお知らせしましょうか？」なる音声をスピーカ４０を介して出力して、ユーザに問い合わせした場合に、ユーザから「いいよ」という返答を受けたとき、上述の如くユーザの感情や精神状態を加味することで、その返答が「よろしくたのむ」という意味の「いいよ」なのか、又は、「必要ない」という意味の「いいよ」なのかを的確に判断することができる。この結果、再度聞き直すなどしてユーザを煩わせることがなくなり、当該返答に対するシステム側からの次の出力（動作）を的確なものとすることができ、ユーザとの対話をより円滑にすることが可能となる。例えば、「よろしくたのむ」という意味の「いいよ」の場合であるときだけ、対話制御ECU２０は、直ぐに目的地周辺の駐車場の空き状況を取得して、「目的地周辺の駐車場の空き状況をお知らせします」なる音声を出力しつつ、駐車場の空き状況をディスプレイ５０に表示することができる。 As a result, for example, when the system side outputs a voice message “Would you like to inform about the availability of the parking lot around the destination?” Through the speaker 40 and inquires the user, When you receive a response that says “OK”, by adding the emotion and mental state of the user as described above, the response is “good” in the sense of “fun” or “not necessary” It is possible to accurately determine whether the meaning is “good”. As a result, it does not bother the user by re-listening, and the next output (operation) from the system side to the response can be made accurate, and the dialog with the user is made smoother. Is possible. For example, the dialogue control ECU 20 immediately acquires the availability of the parking lot around the destination, and the “vacation status of the parking lot around the destination” only in the case of “good” meaning “good to meet”. The parking space availability can be displayed on the display 50, while outputting the voice "I will inform you."

以上のように本実施例によれば、ユーザの発話した言語の意味が複数あり、その言語情報からはユーザの意図が一意的に特定できない場合であっても、ユーザの感情や精神状態を推定してこれらを加味することで、ユーザの意図を高精度に（的確に）推定することが可能となる。即ち、人間同士のコミュニケーションにおいて言語情報で伝達可能なのは７％程度であり、音声の韻律や表情などがコミュニケーションにおいて重要な役割を果たしているという研究結果（心理学者のアルバート・メラビアンによる研究結果）が知られているが、本実施例では、この研究結果からも分かるように、言語情報に加えて音声の韻律や表情などを考慮することで、言語情報だけで認識困難なユーザの意図をより的確に把握することが可能となる。 As described above, according to the present embodiment, there are a plurality of meanings of the language spoken by the user, and even if the user's intention cannot be uniquely specified from the language information, the user's emotion and mental state are estimated. By taking these into account, it is possible to estimate the user's intention with high accuracy (exactly). That is, about 7% of humans can communicate with linguistic information, and the research results (study results by psychologist Albert Melavian) that voice prosody and facial expression play an important role in communication are known. However, in this example, as can be seen from the results of this study, the intentions of users who are difficult to recognize using only linguistic information are more accurately considered by taking into account the prosodic and facial expressions of speech in addition to linguistic information. It becomes possible to grasp.

以上、本発明の好ましい実施例について詳説したが、本発明は、上述した実施例に制限されることはなく、本発明の範囲を逸脱することなく、上述した実施例に種々の変形及び置換を加えることができる。 The preferred embodiments of the present invention have been described in detail above. However, the present invention is not limited to the above-described embodiments, and various modifications and substitutions can be made to the above-described embodiments without departing from the scope of the present invention. Can be added.

例えば、上述した実施例では、ユーザの意図の推定精度を高めるため、ユーザの感情及び精神状態の双方を推定しているが、その何れかのみを推定する構成も可能である。また、そのときの推定精度に応じて、一方の推定結果を優先して用い、他方の推定結果を補助的に用いることも可能である。 For example, in the above-described embodiment, in order to improve the estimation accuracy of the user's intention, both the user's emotion and mental state are estimated, but a configuration in which only one of them is estimated is also possible. Also, according to the estimation accuracy at that time, it is possible to preferentially use one estimation result and use the other estimation result as an auxiliary.

また、上述では、車内マイク３０に入力される非言語情報として、韻律情報を例示しているが、本発明は、特にこれに限定されることはなく、例えば、車内マイク３０に入力されるブレス・息遣いの態様（ため息や息を呑む態様）に基づいて、ユーザの感情や精神状態を検出してもよい。 In the above description, prosodic information is illustrated as non-linguistic information input to the in-vehicle microphone 30. However, the present invention is not particularly limited to this, for example, breath input to the in-vehicle microphone 30. -You may detect a user's emotion and mental state based on the mode of breathing (mode of sighing and breathing).

本発明による音声対話装置が組み込まれる音声対話システムの一実施例を示すシステム構成図である。1 is a system configuration diagram showing an embodiment of a voice interaction system in which a voice interaction device according to the present invention is incorporated. 本実施例の音声対話装置により実現される特徴的な動作フローを示す図である。It is a figure which shows the characteristic operation | movement flow implement | achieved by the voice interactive apparatus of a present Example. 非言語情報と言語情報とを統合してユーザの意図（真意）を推定する方法の一例を示す表図である。It is a table | surface figure which shows an example of the method of integrating non-linguistic information and linguistic information and estimating a user's intention (intention). 非言語情報と言語情報とを統合してユーザの意図（真意）を推定する方法のその他の一例を示す表図である。It is a table | surface figure which shows another example of the method of integrating a non-linguistic information and linguistic information and estimating a user's intention (intention).

Explanation of symbols

１０音声対話装置
２０対話制御ECU
２２ユーザデータベース
３０車内マイク
４０スピーカ
４２アンプ
５０ディスプレイ
６０車内カメラ
７０乗員状態推定ECU
８０生体センサ 10 Spoken Dialogue Device 20 Dialogue Control ECU
22 User Database 30 Car Microphone 40 Speaker 42 Amplifier 50 Display 60 Car Camera 70 Crew State Estimation ECU
80 Biosensor

Claims

In a voice interaction device that exchanges information with the user by voice,
Intention to estimate the user's intention included in the words spoken by the user based on both linguistic information obtained by speech recognition processing for the utterance and non-linguistic information capable of representing the user's emotion or physiological state A spoken dialogue apparatus characterized by having an estimation function.

The spoken dialogue apparatus according to claim 1, wherein the non-linguistic information includes image information obtained by performing image processing on a user-captured image by a camera, and prosodic information obtained by analyzing a user's speech.

The spoken dialogue apparatus according to claim 1, wherein the non-linguistic information includes user physiological information based on a measurement result of an electrocardiogram or a pulse.