JP2005234331A

JP2005234331A - Voice interaction device

Info

Publication number: JP2005234331A
Application number: JP2004044798A
Authority: JP
Inventors: Hisayuki Nagashima; 久幸長島
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2004-02-20
Filing date: 2004-02-20
Publication date: 2005-09-02
Anticipated expiration: 2024-02-20
Also published as: JP4437047B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice interaction device which is improved in convenience to a user by accurately estimating the user intelligibility and performing proper interaction control. <P>SOLUTION: An intelligibility calculation part 28 finds an intelligibility estimation parameter s1 calculated from the keyword appearance time from when voice input is requested to when a keyword is vocalized and an intelligibility estimation parameter s2 calculated from the rate of the number of phonemes of an important keyword to the total number of phonemes from the start to the end of user's utterance, and the intelligibility R of interaction by the user is calculated from the intelligibility estimation parameters s1 and s2. Then an interaction control part 29 performs interaction control based upon the intelligibility R, and decides the intelligibility by the user is low when the intelligibility R is 0 and outputs a message urging re-input. When the intelligibility R is 1, on the other hand, it is decided that the intelligibility is normal, input contents are confirmed, and an advance to a next step is made. When the intelligibility R is 2, it is decided that the intelligibility is high and an advance to the next state is made without confirming the input contents. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、利用者との対話に基づいて処理を実行する処理システムに利用する音声対話装置に関する。 The present invention relates to a speech dialogue apparatus used in a processing system that executes processing based on dialogue with a user.

従来、利用者との対話のために利用する音声対話装置には、例えば音声の入力を要求する信号を出力する入力要求手段と、入力された音声を認識する認識手段と、音声の入力が要求されてから音声の入力が検出されるまでの時間や、音声入力の継続時間（発話時間）を計測する計測手段と、音声の認識結果に対応した音声応答信号を出力する出力手段とを備え、音声の入力が検出されてから音声応答信号を出力するまでの時間や、音声応答信号の応答時間、あるいは音声応答信号の表現形式を、前述の音声の入力が要求されてから音声の入力が検出されるまでの時間や、音声入力の継続時間に基づいて可変制御するものがある。この装置では、これにより各利用者の反応時間や音声の入力時間に基づいて、各利用者に適切な応答を与えることが可能になる（例えば、特許文献１参照。）。 2. Description of the Related Art Conventionally, a voice dialogue apparatus used for dialogue with a user requires input request means for outputting a signal for requesting voice input, recognition means for recognizing input voice, and voice input, for example. A measuring means for measuring the time from when the voice input is detected and the duration of the voice input (speech time), and an output means for outputting a voice response signal corresponding to the voice recognition result, The time from when voice input is detected until the voice response signal is output, the response time of the voice response signal, or the expression format of the voice response signal is detected after the voice input is requested. There are those that variably control based on the time until the input is made and the duration of the voice input. With this device, it is possible to give an appropriate response to each user based on the reaction time of each user and the voice input time (see, for example, Patent Document 1).

一方、同様に利用者の音声を認識して応答する装置には、音声の入力が要求されてから音声の入力が検出されるまでの時間や、音声入力の継続時間（発話時間）の他、利用者の発話音数を計測して利用者の理解度（習熟度）を推定し、その結果から、応答音声信号による音声ガイドのシナリオや発話内容、更には発話速度を制御するものもある。この装置でも、同様に各利用者に適切な応答を与えることが可能になる（例えば、特許文献２参照。）。
特公平５−１８１１８号公報特開２０００−１９４３８６号公報 On the other hand, the device that recognizes and responds to the user's voice in the same manner, the time from when the voice input is requested until the voice input is detected, the duration of the voice input (speech time), Some measure the number of utterances of the user to estimate the user's level of understanding (skill level), and based on the result, control the voice guidance scenario and utterance content by the response voice signal, as well as the utterance speed. This device can similarly give an appropriate response to each user (see, for example, Patent Document 2).
Japanese Patent Publication No. 5-18118 JP 2000-194386 A

ところで、従来の装置では、利用者が発話を開始するまでの時間を用いることで、装置側の音声応答信号の出力タイミングを変更できるものの、利用者が発話したか否かを、入力された音声信号の周波数や強さから判定するのみで、その内容を判断していないため、利用者の発話に対して正しく応答できない可能性があるという問題があった。何故なら、たとえ音声入力の要求に対して即座に利用者が応答したとしても、利用者が目的地等の意味のある言葉を発話したのか、それとも例えば日本語において意味のある言葉の前置きの言葉として発せられる「あ〜」や「え〜」等の意味を持たない言葉を発話したのか、両者を区別しなければ、本当に利用者が対話を理解して発話しているかを判断することができないからである。すなわち、利用者が意味を持たない言葉を即座に応答したとしても、それは利用者が対話を理解しているとは言えないので、この時間に基づいて音声応答信号を出力しても、利用者に適切な応答を与えることができないという問題があった。 By the way, in the conventional apparatus, although the output timing of the voice response signal on the apparatus side can be changed by using the time until the user starts speaking, whether the user has spoken or not is inputted. There is a problem that it may not be able to respond correctly to the user's utterance because the content is not determined only by determination based on the frequency and strength of the signal. This is because even if the user responds immediately to the voice input request, the user has spoken a meaningful word such as the destination, or a preface to a meaningful word in Japanese, for example It is impossible to judge whether the user really understands the dialogue or speaks unless the words are uttered as “A ~”, “E ~”, etc. Because. That is, even if the user responds immediately to a meaningless word, it cannot be said that the user understands the dialogue, so even if a voice response signal is output based on this time, the user There was a problem that an appropriate response could not be given.

また、入力された発話音数や発話語数を計測し、これを標準の利用者の発話音数や発話語数と比較して利用者の理解度（習熟度）を推定する場合、利用者の多様な発話に対応するのが難しいという問題があった。具体的には、例えば「かしわ」や「とうきょうあみゅーずめんとらんど」のように、目的地として同じレベルの意味を持つ言葉でも、これを標準の発話音数や発話語数と比較すると、その音数あるいは語数の違いから異なる理解度が出力されてしまう可能性や、あるいは理解度が同じでも、言葉のあとに「です」等の言葉を付与して丁寧に発話すると、理解度が低いと判断してしまう可能性があり、推定された理解度に基づいて音声応答信号を出力しても、利用者に適切な応答を与えることができない可能性があるという問題があった。 In addition, when the number of input utterances and utterances is measured and compared with the number of utterances and utterances of standard users, the user's understanding (skill level) is estimated. There was a problem that it was difficult to deal with various utterances. Specifically, even words that have the same level of meaning as the destination, such as `` Kashiwa '' and `` Tokyo Amyuzu Mento Land '', compared with the standard number of utterances and utterances, There is a possibility that different understanding levels may be output due to the difference in the number of sounds or the number of words, or even if the level of understanding is the same, if you add words such as “Is” after the words and speak carefully, the level of understanding will be low There is a problem that even if a voice response signal is output based on the estimated degree of understanding, an appropriate response may not be given to the user.

本発明は、上記課題に鑑みてなされたもので、正確に利用者の理解度を推定して適切な対話制御を行い、利用者の利便性を向上させた音声対話装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a voice interaction device that accurately estimates a user's understanding and performs appropriate dialogue control to improve user convenience. And

上記課題を解決するために、請求項１の発明に係る音声対話装置は、利用者が発話する音声を入力するための音声入力手段（例えば後述する実施例のマイク１）と、入力された音声の認識処理を行う音声認識手段（例えば後述する実施例の音声認識部２２）と、認識された前記利用者の音声から所定のキーワードを抽出するキーワード判定手段（例えば後述する実施例のキーワード判定部２３）と、認識された前記利用者の音声について音数を計測する入力音数計測手段（例えば後述する実施例の認識語カウント部２６、及び理解度計算部２８が実行するステップＳ１１からステップＳ１２の処理）と、利用者の発話によって入力された総音数に占める前記キーワードの音数の割合に基づいて、対話における前記利用者の理解度を判定する理解度判定手段（例えば後述する実施例の理解度計算部２８が実行するステップＳ１３からステップＳ１５の処理、及びステップＳ２３の処理）と、前記利用者の理解度に応じて対話応答を制御する対話制御手段（例えば後述する実施例の対話制御部２９）とを備えたことを特徴とする。 In order to solve the above-described problem, a voice interactive apparatus according to the invention of claim 1 includes a voice input means (for example, a microphone 1 of an embodiment described later) for inputting voice uttered by a user, and input voice. Voice recognition means (for example, a voice recognition section 22 in an embodiment described later) and keyword determination means (for example, a keyword determination section in an embodiment described later) for extracting a predetermined keyword from the recognized voice of the user. 23) and input sound number measuring means for measuring the number of sounds of the recognized user's voice (for example, a recognition word counting unit 26 and an understanding level calculating unit 28 of the embodiment described later, steps S11 to S12). And the degree of understanding of the user in the dialogue based on the ratio of the number of sounds of the keyword to the total number of sounds input by the user's utterance Degree control means (for example, the processes of steps S13 to S15 and the process of step S23 executed by the understanding degree calculation unit 28 of the embodiment described later) and the dialogue control for controlling the dialogue response according to the understanding degree of the user And a means (for example, a dialogue control unit 29 in an embodiment to be described later).

以上の構成を備えた音声対話装置は、利用者が発話する音声を入力するための音声入力手段と、入力された音声の認識処理を行う音声認識手段とを備える音声対話装置において、理解度判定手段が、例えば発話された音声の総音数に対して意味のある言葉の音数の割合が多い方が利用者の対話における理解度が高いというように、利用者の発話によって入力された音声の総音数に対する重要なキーワードの音数の割合に基づいて、対話における利用者の理解度を判定するので、対話制御手段は、推定された理解度に応じて適切な対話制御を実行することができる。 The speech dialogue apparatus having the above-described configuration is a speech dialogue apparatus that includes voice input means for inputting voice uttered by a user, and voice recognition means for performing recognition processing of the input voice. The voice input by the user's utterance is such that the means is higher when the ratio of the number of meaningful words to the total number of uttered voices is higher. Since the user's level of understanding in dialogue is determined based on the ratio of the number of sounds of important keywords to the total number of tones, the dialogue control means must execute appropriate dialogue control according to the estimated level of understanding. Can do.

請求項２の発明に係る音声対話装置は、利用者が発話する音声を入力するための音声入力手段（例えば後述する実施例のマイク１）と、入力された音声の認識処理を行う音声認識手段（例えば後述する実施例の音声認識部２２）と、認識された前記利用者の音声から所定のキーワードを抽出するキーワード判定手段（例えば後述する実施例のキーワード判定部２３）と、前記利用者に発話を要求してから該利用者が前記キーワードを発話するまでのキーワード出現時間を計測するキーワード出現時間計測手段（例えば後述する実施例の時刻キーワード結合部２７、及び理解度計算部２８が実行するステップＳ１からステップＳ３の処理）と、前記キーワード出現時間の長さに基づいて、対話における前記利用者の理解度を判定する理解度判定手段（例えば後述する実施例の理解度計算部２８が実行するステップＳ４からステップＳ６の処理、及びステップＳ２３の処理）と、前記利用者の理解度に応じて対話応答を制御する対話制御手段（例えば後述する実施例の対話制御部２９）とを備えたことを特徴とする。 According to a second aspect of the present invention, there is provided a voice interactive apparatus comprising: voice input means for inputting voice spoken by a user (for example, a microphone 1 in an embodiment described later); and voice recognition means for performing input voice recognition processing. (For example, a voice recognition unit 22 in an embodiment described later), keyword determination means (for example, a keyword determination unit 23 in an embodiment described later) for extracting a predetermined keyword from the recognized voice of the user, and the user Keyword appearance time measuring means for measuring the keyword appearance time from when the user requests utterance until the user utters the keyword (for example, executed by the time keyword combining unit 27 and the comprehension degree calculating unit 28 in the embodiment described later). Comprehension level determination for determining the level of understanding of the user in the dialogue based on the processing from step S1 to step S3) and the length of the keyword appearance time Dialog control means (for example, processing in steps S4 to S6 and processing in step S23 executed by the understanding level calculation unit 28 of an embodiment described later) and a dialog control means for controlling a dialog response according to the user's understanding level ( For example, a dialogue control unit 29) of an embodiment described later is provided.

以上の構成を備えた音声対話装置は、利用者が発話する音声を入力するための音声入力手段と、入力された音声の認識処理を行う音声認識手段とを備える音声対話装置において、理解度判定手段が、例えば意味を持たない言葉による前置きが短く、意味のある言葉が早く発話された方が利用者の対話における理解度が高いというように、音声の入力を要求してから重要なキーワードが発話されるまでのキーワード出現時間の長さに基づいて、対話における利用者の理解度を判定するので、対話制御手段は、推定された理解度に応じて適切な対話制御を実行することができる。 The speech dialogue apparatus having the above-described configuration is a speech dialogue apparatus that includes voice input means for inputting voice uttered by a user, and voice recognition means for performing recognition processing of the input voice. For example, the key word is important after requesting voice input, such that the meaning of words that have no meaning is short and the meaning of words that are spoken earlier is higher when the user speaks. Since the degree of understanding of the user in the dialogue is determined based on the length of the keyword appearance time until the utterance is spoken, the dialogue control means can execute appropriate dialogue control according to the estimated degree of understanding. .

請求項３の発明に係る音声対話装置は、利用者が発話する音声を入力するための音声入力手段（例えば後述する実施例のマイク１）と、入力された音声の認識処理を行う音声認識手段（例えば後述する実施例の音声認識部２２）と、認識された前記利用者の音声から所定のキーワードを抽出するキーワード判定手段（例えば後述する実施例のキーワード判定部２３）と、前記利用者が発話を開始してから終了するまでの総発話時間を計測する総発話時間計測手段（例えば後述する実施例の時刻キーワード結合部２７、及び理解度計算部２８が実行するステップＳ３１の処理）と、前記利用者が前記キーワード自体を発話するのに要したキーワード発話継続時間を計測するキーワード発話継続時間計測手段（例えば後述する実施例の時刻キーワード結合部２７、及び理解度計算部２８が実行するステップＳ３２の処理）と、前記総発話時間に占める前記キーワード発話継続時間の割合に基づいて、対話における前記利用者の理解度を判定する理解度判定手段（例えば後述する実施例の理解度計算部２８が実行するステップＳ３３からステップＳ３５の処理、及びステップＳ２３の処理）と、前記利用者の理解度に応じて対話応答を制御する対話制御手段（例えば後述する実施例の対話制御部２９）とを備えたことを特徴とする。 According to a third aspect of the present invention, there is provided a voice interaction apparatus comprising: voice input means for inputting voice spoken by a user (for example, a microphone 1 in an embodiment described later); and voice recognition means for performing processing for recognizing input voice. (For example, a voice recognition unit 22 in an embodiment described later), keyword determination means (for example, a keyword determination unit 23 in an embodiment described later) for extracting a predetermined keyword from the recognized voice of the user, and the user A total utterance time measuring means for measuring the total utterance time from the start to the end of the utterance (for example, the processing of step S31 executed by the time keyword combining unit 27 and the understanding degree calculating unit 28 in the embodiment described later); Keyword utterance duration measuring means for measuring the keyword utterance duration required for the user to utter the keyword itself (for example, a time keyword of an embodiment described later) A process of step S32 executed by the combination unit 27 and the understanding level calculation unit 28) and an understanding for determining the level of understanding of the user in the dialogue based on the ratio of the keyword utterance duration to the total utterance time Degree control means (for example, the processing from step S33 to step S35 and the processing of step S23 executed by the understanding level calculation unit 28 of the embodiment described later) and dialogue control for controlling the dialogue response according to the understanding level of the user And a means (for example, a dialogue control unit 29 in an embodiment to be described later).

以上の構成を備えた音声対話装置は、利用者が発話する音声を入力するための音声入力手段と、入力された音声の認識処理を行う音声認識手段とを備える音声対話装置において、理解度判定手段が、例えば利用者が発話を開始してから終了するまでの総発話時間に対して意味のある言葉が発話されている時間の割合が多い方が利用者の対話における理解度が高いというように、音声の入力が開始されてから終了するまでの総発話時間に対する重要なキーワードが発話されていたキーワード発話継続時間の割合に基づいて、対話における利用者の理解度を判定するので、対話制御手段は、推定された理解度に応じて適切に対話制御を実行することができる。 The speech dialogue apparatus having the above-described configuration is a speech dialogue apparatus that includes voice input means for inputting voice uttered by a user, and voice recognition means for performing recognition processing of the input voice. For example, if the percentage of the time during which a meaningful word is spoken is large relative to the total utterance time from when the user starts utterance to the end, the degree of understanding in the user's dialogue is higher In addition, based on the ratio of the keyword utterance duration during which important keywords were spoken relative to the total utterance time from the start to the end of voice input, the user's level of understanding in the dialogue is determined, so dialogue control The means can appropriately execute the dialogue control according to the estimated degree of understanding.

請求項４の発明に係る音声対話装置は、利用者が発話する音声を入力するための音声入力手段（例えば後述する実施例のマイク１）と、入力された音声の認識処理を行う音声認識手段（例えば後述する実施例の音声認識部２２）と、認識された前記利用者の音声から所定のキーワードを抽出するキーワード判定手段（例えば後述する実施例のキーワード判定部２３）と、前記キーワードを発話するのに要する標準的な時間をキーワード基準発話時間として前記キーワード毎に記憶したキーワード基準発話時間記憶手段（例えば後述する実施例のキーワードデータベース２５）と、前記利用者が前記キーワード自体を発話するのに要したキーワード発話継続時間を計測するキーワード発話継続時間計測手段（例えば後述する実施例の時刻キーワード結合部２７、及び理解度計算部２８が実行するステップＳ４２の処理）と、前記キーワード基準発話時間に対する前記キーワード発話継続時間の比率に基づいて、対話における前記利用者の理解度を判定する理解度判定手段（例えば後述する実施例の理解度計算部２８が実行するステップＳ４３からステップＳ４５の処理、及びステップＳ２３の処理）と、前記利用者の理解度に応じて対話応答を制御する対話制御手段（例えば後述する実施例の対話制御部２９）とを備えたことを特徴とする。 According to a fourth aspect of the present invention, there is provided a voice interaction apparatus comprising: voice input means for inputting voice spoken by a user (for example, a microphone 1 in an embodiment described later); and voice recognition means for performing processing for recognizing input voice. (For example, a voice recognition unit 22 in an embodiment described later), keyword determination means (for example, a keyword determination unit 23 in an embodiment described later) for extracting a predetermined keyword from the recognized voice of the user, and uttering the keyword A keyword reference utterance time storage means (for example, a keyword database 25 in an embodiment to be described later) stored for each keyword as a keyword reference utterance time as a standard time required for the operation, and the user utters the keyword itself. Keyword utterance duration measuring means for measuring the keyword utterance duration required for (for example, the time key of the embodiment described later) The processing of step S42 executed by the link combining unit 27 and the understanding level calculation unit 28), and an understanding for determining the level of understanding of the user in the dialogue based on the ratio of the keyword utterance duration to the keyword reference utterance time Degree control means (for example, the processes in steps S43 to S45 and the process in step S23 executed by the understanding degree calculation unit 28 of the embodiment described later) and the dialogue control for controlling the dialogue response according to the understanding degree of the user And a means (for example, a dialogue control unit 29 in an embodiment to be described later).

以上の構成を備えた音声対話装置は、利用者が発話する音声を入力するための音声入力手段と、入力された音声の認識処理を行う音声認識手段とを備える音声対話装置において、理解度判定手段が、例えば利用者が自信を持たずに発話した場合は重要なキーワードの発話に要するキーワード発話継続時間の長さが長く、利用者が自信を持って発話した場合は重要なキーワードの発話に要するキーワード発話継続時間の長さが短く、利用者の対話における理解度が高いというように、所定のキーワードの発話に標準的に要するキーワード基準発話時間に対する利用者がキーワード自体を発話するのに要したキーワード発話継続時間の比率に基づいて、対話における利用者の理解度を判定するので、対話制御手段は、推定された理解度に応じて適切に対話制御を実行することができる。 The speech dialogue apparatus having the above-described configuration is a speech dialogue apparatus that includes voice input means for inputting voice uttered by a user, and voice recognition means for performing recognition processing of the input voice. For example, if the user utters without confidence, the keyword utterance duration required for uttering important keywords is long, and if the user utters confidently, Necessary for the user to utter the keyword itself for the keyword-based utterance time that is normally required for the utterance of a given keyword, such as the length of the keyword utterance duration required is short and the understanding level of the user's dialogue is high. Since the user's level of understanding in the dialog is determined based on the ratio of the keyword utterance durations, the dialog control means is suitable for the estimated level of understanding. It is possible to perform an interactive control on.

請求項５の発明に係る音声対話装置は、利用者が発話する音声を入力するための音声入力手段（例えば後述する実施例のマイク１）と、入力された音声の認識処理を行う音声認識手段（例えば後述する実施例の音声認識部２２）と、認識された前記利用者の音声から所定のキーワードを抽出するキーワード判定手段（例えば後述する実施例のキーワード判定部２３）に加えて、更に、認識された前記利用者の音声について音数を計測する入力音数計測手段（例えば後述する実施例の認識語カウント部２６と、理解度計算部２８が実行するステップＳ１１からステップＳ１２の処理）、あるいは前記利用者に発話を要求してから該利用者が前記キーワードを発話するまでのキーワード出現時間を計測するキーワード出現時間計測手段（例えば後述する実施例の時刻キーワード結合部２７と、理解度計算部２８が実行するステップＳ１からステップＳ３の処理）、あるいは前記利用者が発話を開始してから終了するまでの総発話時間を計測する総発話時間計測手段（例えば後述する実施例の時刻キーワード結合部２７と、理解度計算部２８が実行するステップＳ３１の処理）及び前記利用者が前記キーワード自体を発話するのに要したキーワード発話継続時間を計測するキーワード発話継続時間計測手段（例えば後述する実施例の時刻キーワード結合部２７と、理解度計算部２８が実行するステップＳ３２の処理）、あるいは前記キーワードを発話するのに要する標準的な時間をキーワード基準発話時間として前記キーワード毎に記憶したキーワード基準発話時間記憶手段（例えば後述する実施例のキーワードデータベース２５）及び前記利用者が前記キーワード自体を発話するのに要したキーワード発話継続時間を計測するキーワード発話継続時間計測手段（例えば後述する実施例の時刻キーワード結合部２７と、理解度計算部２８が実行するステップＳ４２の処理）の内の少なくとも２組以上の手段と、利用者の発話によって入力された総音数に占める前記キーワードの音数の割合と、前記キーワード出現時間の長さと、前記総発話時間に占める前記キーワード発話継続時間の割合と、前記キーワード基準発話時間に対する前記キーワード発話継続時間の比率の内の少なくとも２つ以上を組合わせた結果に基づいて、対話における前記利用者の理解度を判定する理解度判定手段（例えば後述する実施例の理解度計算部２８が実行するステップＳ１３からステップＳ１５の処理、ステップＳ４からステップＳ６の処理、ステップＳ３３からステップＳ３５の処理、ステップＳ４３からステップＳ４５の処理のいずれか２つ以上の組合せとステップＳ２３の処理）と、前記利用者の理解度に応じて対話応答を制御する対話制御手段（例えば後述する実施例の対話制御部２９）とを備えたことを特徴とする。 According to a fifth aspect of the present invention, there is provided a voice interactive apparatus comprising: voice input means for inputting voice spoken by a user (for example, a microphone 1 in an embodiment described later); and voice recognition means for performing processing for recognizing input voice. In addition to (for example, a voice recognition unit 22 in an embodiment described later) and keyword determination means (for example, a keyword determination unit 23 in an embodiment described later) for extracting a predetermined keyword from the recognized voice of the user, Input sound number measuring means for measuring the number of sounds of the recognized user's voice (for example, the processing from step S11 to step S12 executed by the recognized word counting unit 26 and the understanding level calculating unit 28 of the embodiment described later), Alternatively, keyword appearance time measuring means for measuring the keyword appearance time from when the user requests utterance until the user utters the keyword (for example, after The time keyword combining unit 27 and the understanding level calculating unit 28 of the embodiment to be executed), or the total utterance time from the start of the user to the end of the utterance. The utterance time measuring means (for example, the process of step S31 executed by the time keyword combining unit 27 and the comprehension degree calculating unit 28 in the embodiment described later) and the keyword utterance duration required for the user to utter the keyword itself Utterance duration measuring means for measuring the time (for example, the processing of step S32 executed by the time keyword combining unit 27 and the comprehension degree calculating unit 28 in the embodiment described later), or the standard time required to utter the keyword Is a keyword reference utterance time storage means (for example, described later). Keyword database 25) of the embodiment and keyword utterance duration measuring means for measuring a keyword utterance duration required for the user to utter the keyword itself (for example, a time keyword combining unit 27 of the embodiment described later) At least two or more means in the processing of step S42 executed by the comprehension level calculation unit 28, the ratio of the number of sounds of the keyword to the total number of sounds input by the user's utterance, and the keyword appearance time Based on a result of combining at least two of the length of the keyword utterance duration, the ratio of the keyword utterance duration to the total utterance time, and the ratio of the keyword utterance duration to the keyword reference utterance time. Understanding level determination means for determining the level of understanding of the user (e.g., understanding level calculation unit 28 in the embodiment described later) A combination of any two or more of steps S13 to S15, steps S4 to S6, steps S33 to S35, steps S43 to S45, and step S23) It is characterized by comprising dialogue control means (for example, dialogue control unit 29 in an embodiment to be described later) for controlling dialogue response according to the degree of understanding of the user.

以上の構成を備えた音声対話装置は、利用者が発話する音声を入力するための音声入力手段と、入力された音声の認識処理を行う音声認識手段とを備える音声対話装置において、理解度判定手段が、入力された音声の総音数に対する重要なキーワードの音数の割合、あるいは音声の入力を要求してから重要なキーワードが発話されるまでのキーワード出現時間の長さ、あるいは音声の入力が開始されてから終了するまでの総発話時間に対する重要なキーワードが発話されていたキーワード発話継続時間の割合、あるいは所定のキーワードの発話に標準的に要するキーワード基準発話時間に対する利用者がキーワード自体を発話するのに要したキーワード発話継続時間の比率の内の少なくとも２つ以上を組合わせた結果に基づいて、対話における利用者の理解度を判定するので、対話制御手段は、推定された理解度に応じて適切に対話制御を実行することができる。 The speech dialogue apparatus having the above-described configuration is a speech dialogue apparatus that includes voice input means for inputting voice uttered by a user, and voice recognition means for performing recognition processing of the input voice. The ratio of the number of sounds of important keywords to the total number of sounds of the input speech, the length of the keyword appearance time until the important keywords are uttered after requesting the input of speech, or the input of speech The percentage of keyword utterance duration during which important keywords were uttered relative to the total utterance time from the start to the end of the utterance, or the user with respect to the keyword reference utterance time that is normally required for utterance of a given keyword Based on the result of combining at least two of the ratios of keyword utterance durations required to speak, Since determining the user's comprehension, dialogue control means can appropriately execute the dialogue control according to the estimated level of understanding.

請求項６の発明に係る音声対話装置は、請求項１から請求項５のいずれかに記載の音声対話装置において、搭載された車両の走行環境を判定する走行環境判定部（例えば後述する実施例の車両状態検出装置）を備え、前記理解度判定手段が、前記走行環境判定部が判定する前記車両の走行環境に応じて、対話における前記利用者の理解度を判定するためのしきい値を変更することを特徴とする。 According to a sixth aspect of the present invention, there is provided a spoken dialogue apparatus according to any one of the first to fifth aspects, wherein a running environment determining unit (e.g., an embodiment to be described later) for judging a running environment of a mounted vehicle. Vehicle state detection device), and the understanding level determination means determines a threshold value for determining the level of understanding of the user in the dialogue according to the traveling environment of the vehicle determined by the traveling environment determination unit. It is characterized by changing.

以上の構成を備えた音声対話装置は、理解度判定手段が、走行環境判定部の判定する車両の走行環境に応じて、対話における利用者の理解度を判定するためのしきい値を変更することで、例えば利用者が運転中は、自信を持って入力されたと推定できる発話でも、利用者が運転に気を取られて、もしかしたら言葉を間違っているかもしれないというように、車両の走行環境の変化に伴う利用者の理解度の変化を正確に推定することができる。 In the voice interaction device having the above configuration, the understanding level determination means changes the threshold value for determining the level of understanding of the user in the dialog according to the traveling environment of the vehicle determined by the traveling environment determination unit. Thus, for example, even when the user is driving, even if the utterance can be presumed to have been input with confidence, the user may be distracted by the driving and the language may be wrong. It is possible to accurately estimate the change in the understanding level of the user accompanying the change in the driving environment.

請求項１に記載の音声対話装置によれば、利用者の発話によって入力された音声の総音数と重要なキーワードの音数との比較により推定された対話における利用者の理解度に応じて、適切な対話制御を実行することができる。
従って、理解度が高い利用者には簡潔な応答による対話制御を実行し、一方理解度が低い利用者には詳細かつ丁寧な応答による対話制御を実行することで、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に入力された音声の総音数と重要なキーワードの音数との比較により理解度を推定することで、音数の取りうる幅が大きな言葉や、語尾に付く丁寧語の影響を受けることなく正確に理解度を算出し、適切な対話制御を実行することができるという効果が得られる。 According to the spoken dialogue apparatus of claim 1, according to the degree of understanding of the user in the dialogue estimated by comparing the total number of sounds input by the user's utterance with the number of sounds of important keywords. Appropriate dialog control can be performed.
Therefore, users who have a high level of understanding perform dialogue control with a simple response, while users with a low level of understanding perform interactive control with a detailed and polite response, so that it is There is an effect that it is possible to realize a voice dialogue apparatus that executes dialogue control and improves user convenience. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, by estimating the degree of understanding by comparing the total number of sounds of the input speech with the number of sounds of important keywords, it is influenced by words that have a large range of sounds and polite words attached to the ending. Thus, it is possible to calculate the degree of understanding accurately without executing the appropriate dialogue control.

請求項２に記載の音声対話装置によれば、重要なキーワードが発話されるまでの時間により推定された理解度に応じて、適切な対話制御を実行することができる。
従って、請求項１と同様に、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に単純に利用者が発話したか否かではなく、重要なキーワードが発話されるまでの時間により理解度を推定することで、重要なキーワードがいつ発話されても正確に理解度を算出し、適切な対話制御を実行することができるという効果が得られる。 According to the spoken dialogue apparatus of the second aspect, appropriate dialogue control can be executed according to the degree of understanding estimated by the time until an important keyword is spoken.
Therefore, similarly to the first aspect, it is possible to realize an audio dialogue apparatus that executes appropriate dialogue control according to the level of the user and improves the convenience of the user. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, the degree of understanding is estimated based on the time it takes for an important keyword to be spoken, not simply whether or not the user has spoken. As a result, it is possible to execute appropriate dialogue control.

請求項３に記載の音声対話装置によれば、重要なキーワードが発話されていた時間により推定された理解度に応じて、適切に対話制御を実行することができる。
従って、請求項１と同様に、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に音声の入力が開始されてから終了するまでの総発話時間とキーワードが発話されていたキーワード発話継続時間との比較により理解度を推定することで、どのような言葉にも対応し、音数が識別しにくい言葉についても、発話された音声に対する意味のある言葉の割合から正確に理解度を算出して、適切な対話制御を実行することができるという効果が得られる。 According to the voice dialogue apparatus according to the third aspect, the dialogue control can be appropriately executed according to the understanding level estimated from the time when the important keyword was spoken.
Therefore, similarly to the first aspect, it is possible to realize an audio dialogue apparatus that executes appropriate dialogue control according to the level of the user and improves the convenience of the user. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, it supports any language by estimating the level of understanding by comparing the total utterance time from the start of speech input to the end and the keyword utterance duration when the keyword was spoken, Even for words whose number of sounds is difficult to identify, it is possible to accurately calculate the degree of understanding from the ratio of meaningful words to spoken speech and execute appropriate dialogue control.

請求項４に記載の音声対話装置によれば、利用者が重要なキーワード自体を発話するのに要した時間により推定された理解度に応じて、適切に対話制御を実行することができる。
従って、請求項１と同様に、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に所定のキーワードの発話に標準的に要するキーワード基準発話時間と利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間との比較により理解度を推定することで、発話全体を検査するまでもなく、重要なキーワードのみを確認するだけで、正確に理解度を算出して適切な対話制御を実行することができるという効果が得られる。 According to the voice dialogue apparatus according to the fourth aspect, the dialogue control can be appropriately executed according to the degree of understanding estimated by the time required for the user to speak the important keyword itself.
Therefore, similarly to the first aspect, it is possible to realize an audio dialogue apparatus that executes appropriate dialogue control according to the level of the user and improves the convenience of the user. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In particular, the overall utterance is estimated by estimating the level of understanding by comparing the keyword-based utterance time that is typically required to utter a given keyword with the keyword utterance duration that is required for the user to utter an important keyword itself. There is no need to check the above, and it is possible to calculate the degree of understanding accurately and execute appropriate dialogue control only by confirming only important keywords.

請求項５に記載の音声対話装置によれば、重要なキーワードに関する組み合わされた情報により推定された理解度に応じて、適切に対話制御を実行することができる。
従って、対話制御における理解度の判定精度を向上させ、入力された音声の内容や状態に影響を受けずに正確に理解度を算出し、適切な対話制御を実行することができるという効果が得られる。 According to the spoken dialogue apparatus of the fifth aspect, the dialogue control can be appropriately executed according to the degree of understanding estimated by the combined information regarding the important keyword.
Therefore, it is possible to improve the determination accuracy of the level of understanding in dialogue control, accurately calculate the level of understanding without being affected by the content and state of the input speech, and execute appropriate dialogue control. It is done.

請求項６に記載の音声対話装置によれば、車両の走行環境の変化に伴う利用者の理解度の変化を正確に推定することができる。
従って、車両を運転することで負担がかかっている利用者についても、その時の利用者の状態に応じて正確に理解度を算出し、適切な対話制御を実行することができるという効果が得られる。また、利用者の利便性を向上させ、車両搭載に適した音声対話装置を実現することができるという効果が得られる。 According to the voice interactive apparatus according to the sixth aspect, it is possible to accurately estimate the change in the understanding level of the user accompanying the change in the traveling environment of the vehicle.
Therefore, even for a user who is burdened by driving the vehicle, the degree of understanding can be accurately calculated according to the state of the user at that time, and appropriate dialogue control can be executed. . Further, it is possible to improve the convenience for the user and to realize a voice interactive device suitable for mounting on a vehicle.

以下、図面を参照して本発明の実施例について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

まず、第１の実施例について説明する。第１の実施例では、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間と、利用者が単一の発話を開始してから終了するまでに入力された総音数（音素の数）に占める重要なキーワードの音数の割合とにより、対話における利用者の理解度を推定する場合を示す。 First, the first embodiment will be described. In the first embodiment, the keyword appearance time from the time when the user is requested to input voice (utterance) to the time when an important keyword is uttered, and the time from when the user starts a single utterance to the end. The case where the degree of understanding of the user in the dialogue is estimated based on the ratio of the number of important keyword sounds to the total number of input sounds (number of phonemes) is shown.

（装置構成）
図１は、本発明の第１の実施例の音声対話装置の全体構成を示すブロック図である。
図１において、本実施例の音声対話装置は、利用者の音声を入力するためのマイク１を備えており、マイク１から入力された利用者の音声は信号処理部２へ入力される。
信号処理部２は、音声認識を実行して入力された音声を認識語に変換したり、該認識語から利用者の対話における理解度を算出し、理解度に基づいて対話の制御を行うと共に、対話の制御に基づいて応答文の生成を実行する処理部であって、信号処理部２において生成された応答文は音声合成部３とディスプレイ４へ入力される。また、音声合成部３は、信号処理部２において生成された応答文をスピーカ５へ出力する。一方、ディスプレイ４は、信号処理部２において生成された応答文を画面に表示する。 (Device configuration)
FIG. 1 is a block diagram showing the overall configuration of the voice interactive apparatus according to the first embodiment of the present invention.
In FIG. 1, the voice interaction apparatus according to the present embodiment includes a microphone 1 for inputting a user's voice, and the user's voice input from the microphone 1 is input to the signal processing unit 2.
The signal processing unit 2 performs speech recognition to convert the input speech into a recognized word, calculates an understanding level in the user's dialogue from the recognized word, and controls the dialogue based on the understanding level. A processing unit that generates a response sentence based on dialogue control, and the response sentence generated by the signal processing unit 2 is input to the speech synthesis unit 3 and the display 4. In addition, the speech synthesizer 3 outputs the response sentence generated in the signal processor 2 to the speaker 5. On the other hand, the display 4 displays the response sentence generated in the signal processing unit 2 on the screen.

（信号処理部の詳細）
次に、図面を参照して本実施例の音声対話装置の信号処理部２の詳細について説明する。図２は、本実施例の音声対話装置の信号処理部２の構成を示すブロック図である。
図２において、マイク１から入力された音声は、まず発話区間検出部２１に入力され、発話区間検出部２１において、音声対話装置の発話に基づいて利用者の発話区間（開始時刻と終了時刻）の検出が行われる。次に、入力された音声は、音声認識部２２へ入力され、音声認識部２２において音声認識が実行されることにより認識語（テキスト）に変換される。
また、利用者の音声と発話区間の情報、及び認識語は、時刻認識語結合部２４へ入力され、時刻認識語結合部２４は、それぞれの認識語に対して認識語が発話された時刻情報を結合して、認識語とそれに対応する時刻情報を、後述する時刻キーワード結合部２７へ出力する。 (Details of signal processor)
Next, details of the signal processing unit 2 of the voice interactive apparatus according to the present embodiment will be described with reference to the drawings. FIG. 2 is a block diagram illustrating a configuration of the signal processing unit 2 of the voice interactive apparatus according to the present embodiment.
In FIG. 2, the voice input from the microphone 1 is first input to the utterance section detector 21, and the utterance section detector 21 starts the user's utterance section (start time and end time) based on the utterance of the voice interaction device. Is detected. Next, the input voice is input to the voice recognition unit 22, and voice recognition is executed in the voice recognition unit 22 to be converted into a recognized word (text).
In addition, the user's voice and utterance section information and the recognition word are input to the time recognition word combining unit 24, and the time recognition word combining unit 24 is time information when the recognition word is uttered for each recognition word. And the recognized word and time information corresponding to the recognized word are output to the time keyword combining unit 27 described later.

一方、音声認識部２２の出力する音声及びその認識語は、キーワード判定部２３へ入力され、キーワード判定部２３は、対話において意味のある言葉であるキーワードが記憶されたキーワードデータベース２５を参照して、入力された認識語から所定のキーワードを抽出すると共に、認識語のキーワード部分にタグを付与して、キーワードを同定済みの認識語を認識語カウント部２６へ出力する。
これに対し、認識語カウント部２６は、キーワードの音素の数（音数）と、キーワードも含めた全認識語の音素の数（総音数）をカウントし、認識語及びキーワードの音素の数に関する情報を、キーワードが同定された認識語と共に時刻キーワード結合部２７へ出力する。 On the other hand, the speech output from the speech recognition unit 22 and its recognition word are input to the keyword determination unit 23, and the keyword determination unit 23 refers to the keyword database 25 in which keywords that are meaningful words in dialogue are stored. A predetermined keyword is extracted from the input recognition word, a tag is assigned to the keyword portion of the recognition word, and the recognized word having the identified keyword is output to the recognition word counting unit 26.
On the other hand, the recognized word counting unit 26 counts the number of phonemes of the keyword (number of sounds) and the number of phonemes of all recognized words including the keyword (total number of sounds), and the number of recognized words and the number of phonemes of the keyword. Is output to the time keyword combining unit 27 together with the recognized word in which the keyword is identified.

また、時刻キーワード結合部２７は、時刻認識語結合部２４から入力された認識語及びそれに対応する時刻情報と、認識語カウント部２６から入力されたキーワードが同定された認識語とから、それぞれのキーワードに対してキーワードが発話された時刻情報を結合して、キーワードが同定された認識語とそれに対応する時刻情報を理解度計算部２８へ出力する。 In addition, the time keyword combining unit 27 includes a recognition word input from the time recognition word combining unit 24 and time information corresponding thereto, and a recognition word in which the keyword input from the recognition word counting unit 26 is identified. The time information at which the keyword is spoken is combined with the keyword, and the recognized word in which the keyword is identified and the corresponding time information are output to the comprehension calculator 28.

また、理解度計算部２８は、入力された認識語及びキーワードの音素の数や、認識語とそれに対応する時刻情報、更には認識語のキーワードの位置に関する情報を利用して、対話における利用者の理解度Ｒを推定する処理部であって、理解度Ｒを、例えば以下に示す３つの状態のいずれかとして算出し対話制御部２９へ出力する。ここで、理解度Ｒについて説明すると、理解度Ｒ＝０は、「利用者の理解度が低く再入力を要求する必要がある。」場合を表し、理解度Ｒ＝１は、「利用者の通常の理解度であり、入力内容を確認して次のステップに進む。」場合を表す。また、理解度Ｒ＝２は、「利用者の理解度が高く、すぐに次のステップに進む。」場合を表す。なお、理解度計算部２８における理解度Ｒの算出方法については、詳細を後述する。 In addition, the comprehension level calculation unit 28 uses the number of input recognition words and the number of phonemes of the keywords, the time information corresponding to the recognition words, and the information about the positions of the keywords of the recognition words, and the user in the dialogue. The comprehension level R is calculated as one of the following three states, for example, and is output to the dialogue control unit 29. The understanding level R will be described below. The understanding level R = 0 represents a case where “the user's understanding level is low and it is necessary to request re-input.” It is a normal understanding level, confirms the input content, and proceeds to the next step. " In addition, the understanding level R = 2 represents the case where “the user's understanding level is high, and the process immediately proceeds to the next step”. The method of calculating the understanding level R in the understanding level calculation unit 28 will be described later in detail.

一方、対話制御部２９は、理解度計算部２８が算出する理解度Ｒに基づいて対話の流れを制御する処理部であって、例えば上述の理解度Ｒの３つの状態に対して、理解度Ｒ＝０の場合、再度入力を促すメッセージを出力する。また、理解度Ｒ＝１の場合、入力から得られた認識語（テキスト）を確認してから次のステップへ進む。更に、理解度Ｒ＝２の場合、入力から得られた認識語（テキスト）を確認せずに次のステップへ進む。なお、対話制御部２９が実行する理解度Ｒに基づく対話進行フローについても、詳細は後述する。
また、応答文生成部３０は、対話制御部２９の制御に合わせて、必要な応答文を生成して出力する処理部である。 On the other hand, the dialogue control unit 29 is a processing unit that controls the flow of dialogue based on the understanding level R calculated by the understanding level calculation unit 28. For example, for the three states of the above understanding level R, the understanding level If R = 0, a message prompting input again is output. When the understanding level R = 1, the recognition word (text) obtained from the input is confirmed, and then the process proceeds to the next step. Further, when the understanding level R = 2, the process proceeds to the next step without confirming the recognized word (text) obtained from the input. Details of the dialogue progress flow based on the understanding level R executed by the dialogue control unit 29 will be described later.
The response sentence generation unit 30 is a processing unit that generates and outputs a necessary response sentence in accordance with the control of the dialogue control unit 29.

なお、キーワードデータベース２５は、ハードディスク装置や光磁気ディスク装置、フラッシュメモリ等の不揮発性のメモリや、ＣＤ−ＲＯＭ等の読み出しのみが可能な記録媒体、ＲＡＭ（Random Access Memory）のような揮発性のメモリ、あるいはこれらの組み合わせによるコンピュータ読み取り、書き込み可能な記録媒体より構成されるものとする。 The keyword database 25 is a non-volatile memory such as a hard disk device, a magneto-optical disk device, or a flash memory, a readable recording medium such as a CD-ROM, or a volatile memory such as a RAM (Random Access Memory). It is assumed that the recording medium is configured by a memory or a computer-readable / writable recording medium.

また、発話区間検出部２１と、音声認識部２２と、キーワード判定部２３と、時刻認識語結合部２４と、認識語カウント部２６と、時刻キーワード結合部２７と、理解度計算部２８と、対話制御部２９と、応答文生成部３０は、専用のハードウェアにより実現されるものであってもよく、また、メモリおよびＣＰＵ（中央演算装置）により構成され、上記の各部の機能を実現するためのプログラムをメモリにロードして実行することによりその機能を実現させるものであってもよい。 Further, the utterance section detection unit 21, the speech recognition unit 22, the keyword determination unit 23, the time recognition word combination unit 24, the recognition word count unit 26, the time keyword combination unit 27, the understanding level calculation unit 28, The dialogue control unit 29 and the response sentence generation unit 30 may be realized by dedicated hardware, and are configured by a memory and a CPU (central processing unit) to realize the functions of the above-described units. The function may be realized by loading the program for loading into the memory and executing the program.

（理解度推定パラメータ）
次に、本実施例において理解度計算部２８が理解度Ｒを算出するための理解度推定パラメータについて説明する。
図３は、発話例と理解度推定パラメータを算出するための要素との関係を示す図である。なお、図３は、横軸を時刻、縦軸を音声のパワーとして示した図であって、音声対話装置の音声合成部３による発話と利用者の発話の両方を示している。 (Understanding level estimation parameter)
Next, an understanding level estimation parameter for the understanding level calculation unit 28 to calculate the understanding level R in the present embodiment will be described.
FIG. 3 is a diagram illustrating a relationship between an utterance example and an element for calculating an understanding level estimation parameter. Note that FIG. 3 is a diagram in which the horizontal axis represents time and the vertical axis represents voice power, and shows both the speech by the speech synthesizer 3 and the user's speech in the speech dialogue apparatus.

図３において、時刻ｔ０は、音声対話装置の音声合成部３による発話が終了した時刻である。また、時刻ｔ１は、利用者によって重要なキーワードの発話が開始された時刻である。また、時間ｄは、音声対話装置の音声合成部３による発話を終了することにより利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間を示している。更に、時間Ｔ２は、利用者が単一の発話を開始してから終了するまでの総発話時間を示している。一方、時間Ｔｋは、利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間を示している。なお、これらの情報が、時刻キーワード結合部２７の出力するキーワードが同定された認識語とそれに対応する時刻情報に相当する。 In FIG. 3, time t0 is the time when the speech by the speech synthesizer 3 of the voice interactive apparatus is finished. The time t1 is a time when an important keyword is uttered by the user. The time d indicates the keyword appearance time from when the user requests voice input (utterance) by ending the utterance by the voice synthesizing unit 3 of the voice interactive apparatus until an important keyword is uttered. . Further, the time T2 indicates the total utterance time from when the user starts a single utterance to when it ends. On the other hand, the time Tk indicates a keyword utterance duration required for the user to utter an important keyword itself. Note that these pieces of information correspond to the recognized word in which the keyword output from the time keyword combining unit 27 is identified and the corresponding time information.

一方、図３において、二重丸印は入力音声の音素を示しおり、認識語カウント部２６の出力する認識語及びキーワードの音素の数に関する情報に相当する。例えば、利用者の発話した「え〜〜っと、まいはまです。」という音声に基づく認識語は、８個の音素（音数＝８）から構成されており、重要なキーワードである「まいはま」の部分は、４個の音素（音数＝４）から構成されている。なお、太い下線により示した「え〜〜っと、」と「です。」の部分は、発話の中で意味を持たない言葉である。 On the other hand, in FIG. 3, double circles indicate phonemes of input speech, and correspond to information regarding the number of recognized phonemes and keyword phonemes output from the recognized word counting unit 26. For example, a recognition word based on a voice uttered by a user, “Eh ~~, Maihama.” Is composed of 8 phonemes (number of sounds = 8) and is an important keyword “ The “maima” portion is composed of four phonemes (number of sounds = 4). In addition, the words “U ~~,” and “I.” indicated by bold underlines are words that have no meaning in the utterance.

一方、上述の各要素に対して、理解度推定パラメータｓ１は、キーワード出現時間ｄを変数に持つ関数として示される。また、理解度推定パラメータｓ２は、利用者が単一の発話を開始してから終了するまでに入力された総音数に占める重要なキーワードの音数の割合により示される。例えば、図３に示す発話例では、ｓ２＝４／８＝０．５である。 On the other hand, for each of the above elements, the understanding level estimation parameter s1 is shown as a function having the keyword appearance time d as a variable. The understanding level estimation parameter s2 is indicated by the ratio of the number of sounds of important keywords to the total number of sounds input from the start to the end of the single utterance by the user. For example, in the utterance example shown in FIG. 3, s2 = 4/8 = 0.5.

（理解度推定パラメータｓ１の算出手順）
次に、図面を参照して、理解度計算部２８における理解度推定パラメータｓ１の算出手順について説明する。図４は、理解度計算部２８における理解度推定パラメータｓ１の算出手順を示すフローチャートである。
図４において、まず理解度計算部２８は、音声対話装置の音声合成部３による発話が終了した時刻ｔ０を取得する（ステップＳ１）。
次に、利用者によって重要なキーワードの発話が開始された時刻ｔ１を取得する（ステップＳ２）。 (Calculation procedure of understanding level estimation parameter s1)
Next, the calculation procedure of the understanding level estimation parameter s1 in the understanding level calculation unit 28 will be described with reference to the drawings. FIG. 4 is a flowchart showing a calculation procedure of the understanding level estimation parameter s1 in the understanding level calculation unit 28.
In FIG. 4, first, the comprehension calculator 28 obtains the time t0 when the utterance by the speech synthesizer 3 of the speech dialogue apparatus is finished (step S1).
Next, a time t1 at which the utterance of an important keyword is started by the user is acquired (step S2).

そして、利用者によって重要なキーワードの発話が開始された時刻ｔ１から、音声合成部３による発話が終了した時刻ｔ０を減算して、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間ｄ（ｄ＝ｔ１−ｔ０）を算出する。（ステップＳ３）。
また、キーワード出現時間ｄを算出することができたら、キーワード出現時間ｄが所定時間Ｔｄより短いか否かを判定する（ステップＳ４）。 Then, the important keyword is obtained after subtracting the time t0 when the speech synthesizer 3 ends the utterance from the time t1 when the utterance of the important keyword is started by the user and requesting the user to input the voice (utterance). The keyword appearance time d (d = t1−t0) until the utterance is spoken is calculated. (Step S3).
If the keyword appearance time d can be calculated, it is determined whether the keyword appearance time d is shorter than the predetermined time Td (step S4).

もし、ステップＳ４において、キーワード出現時間ｄが所定時間Ｔｄより短い（ｄ＜Ｔｄ）場合（ステップＳ４のＹＥＳ）、対話における利用者の理解度は高いと推定して、理解度推定パラメータｓ１に「１」を設定（ｓ１＝１）する（ステップＳ５）。
また、ステップＳ４において、キーワード出現時間ｄが所定時間Ｔｄ以上である（ｄ≧Ｔｄ）場合（ステップＳ４のＮＯ）、対話における利用者の理解度は低いと推定して、理解度推定パラメータｓ１に「０」を設定（ｓ１＝０）する（ステップＳ６）。 If it is determined in step S4 that the keyword appearance time d is shorter than the predetermined time Td (d <Td) (YES in step S4), the understanding level of the user in the dialogue is estimated to be high, and the understanding level estimation parameter s1 is set to “ 1 ”is set (s1 = 1) (step S5).
If the keyword appearance time d is greater than or equal to the predetermined time Td in step S4 (d ≧ Td) (NO in step S4), the understanding level of the user in the dialogue is estimated to be low, and the understanding level estimation parameter s1 is set. “0” is set (s1 = 0) (step S6).

（理解度推定パラメータｓ２の算出手順）
次に、図面を参照して、理解度計算部２８における理解度推定パラメータｓ２の算出手順について説明する。図５は、理解度計算部２８における理解度推定パラメータｓ２の算出手順を示すフローチャートである。
図５において、まず理解度計算部２８は、利用者が単一の発話を開始してから終了するまでに入力された総音数ｐ０を取得する（ステップＳ１１）。
次に、利用者によって発話された重要なキーワードの音数ｐ１を取得する（ステップＳ１２）。
そして、入力された総音数ｐ０に占めるキーワードの音数ｐ１の割合（ｐ１／ｐ０）が例えば０．５以上であるか否かを判定する（ステップＳ１３）。 (Calculation procedure of understanding level estimation parameter s2)
Next, a procedure for calculating the understanding level estimation parameter s2 in the understanding level calculation unit 28 will be described with reference to the drawings. FIG. 5 is a flowchart showing a calculation procedure of the understanding level estimation parameter s2 in the understanding level calculation unit 28.
In FIG. 5, first, the comprehension calculator 28 obtains the total number of sounds p0 input from when the user starts a single utterance until it ends (step S11).
Next, the number of sounds p1 of an important keyword uttered by the user is acquired (step S12).
Then, it is determined whether or not the ratio (p1 / p0) of the keyword sound number p1 to the input total sound number p0 is 0.5 or more, for example (step S13).

もし、ステップＳ１３において、入力された総音数ｐ０に占めるキーワードの音数ｐ１の割合（ｐ１／ｐ０）が０．５以上である場合（ステップＳ１３のＹＥＳ）、対話における利用者の理解度は高いと推定して、理解度推定パラメータｓ２に「１」を設定（ｓ２＝１）する（ステップＳ１４）。
また、ステップＳ１３において、入力された総音数ｐ０に占めるキーワードの音数ｐ１の割合（ｐ１／ｐ０）が０．５未満である場合（ステップＳ１３のＮＯ）、対話における利用者の理解度は低いと推定して、理解度推定パラメータｓ２に「０」を設定（ｓ２＝０）する（ステップＳ１５）。 If the ratio (p1 / p0) of the keyword sound number p1 to the input total sound number p0 is 0.5 or more in step S13 (YES in step S13), the user's level of understanding in the dialogue is Assuming that it is high, “1” is set to the understanding level estimation parameter s2 (s2 = 1) (step S14).
In addition, in step S13, when the ratio (p1 / p0) of the keyword sound number p1 to the input total sound number p0 is less than 0.5 (NO in step S13), the user's level of understanding in the dialogue is It is estimated that the value is low, and “0” is set to the understanding level estimation parameter s2 (s2 = 0) (step S15).

（理解度Ｒの算出手順）
次に、本実施例における理解度計算部２８の理解度Ｒの算出手順について説明する。理解度計算部２８において、理解度Ｒは上述の理解度推定パラメータｓ１、ｓ２を用いて算出される。具体的に説明すると、本実施例では、理解度推定パラメータｓ１、ｓ２を組み合わせて、理解度Ｒを下記（１）式により算出する。従って、本実施例において、理解度ＲはＲ＝０、Ｒ＝１、Ｒ＝２の３つの状態のいずれかとして算出される。 (Procedure for understanding R)
Next, the calculation procedure of the understanding level R of the understanding level calculation unit 28 in the present embodiment will be described. In the understanding level calculation unit 28, the understanding level R is calculated using the above-described understanding level estimation parameters s1 and s2. More specifically, in this embodiment, the understanding level R is calculated by the following equation (1) by combining the understanding level estimation parameters s1 and s2. Therefore, in this embodiment, the understanding level R is calculated as one of three states of R = 0, R = 1, and R = 2.

Ｒ＝ｓ１＋ｓ２・・・（１） R = s1 + s2 (1)

（対話進行フロー）
次に、図面を参照して本実施例の音声対話装置の理解度Ｒに基づく対話進行フローについて説明する。図６は、本実施例の音声対話装置の理解度Ｒに基づく対話進行フローを示すフローチャートである。
図６において、まず信号処理部２は、対話の中で音声合成部３を介してスピーカ５から音声入力を促すメッセージを出力し、利用者に音声入力を求める（ステップＳ２１）。 (Dialogue progress flow)
Next, a dialogue progress flow based on the understanding level R of the voice dialogue apparatus of this embodiment will be described with reference to the drawings. FIG. 6 is a flowchart showing a dialogue progress flow based on the understanding level R of the voice dialogue apparatus of the present embodiment.
In FIG. 6, first, the signal processing unit 2 outputs a message for prompting voice input from the speaker 5 through the voice synthesis unit 3 during the dialogue, and asks the user for voice input (step S <b> 21).

これに対し、マイク１から音声が入力されると（ステップＳ２２）、信号処理部２は、理解度計算部２８において、理解度Ｒを算出する（ステップＳ２３）。
そして、信号処理部２は、対話制御部２９において、算出された理解度Ｒに基づく対話制御を実行し、対話の流れを決定する（ステップＳ２４）。
具体的には、理解度Ｒ＝０の場合（ステップＳ２４：Ｒ＝０）、利用者の理解度は低いと推定され、再入力を要求する必要があるので、信号処理部２は、ステップＳ２１へ戻り、再度入力を促すメッセージを出力する。 On the other hand, when a sound is input from the microphone 1 (step S22), the signal processing unit 2 calculates an understanding level R in the understanding level calculation unit 28 (step S23).
Then, the signal processing unit 2 executes dialogue control based on the calculated understanding level R in the dialogue control unit 29, and determines the flow of dialogue (step S24).
Specifically, when the understanding level R = 0 (step S24: R = 0), it is estimated that the user's level of understanding is low and it is necessary to request re-input, so the signal processing unit 2 performs step S21. Return to, and output a message prompting input again.

一方、理解度Ｒ＝１の場合（ステップＳ２４：Ｒ＝１）、利用者の理解度は通常と推定され、入力内容を確認して次のステップに進めば良いので、信号処理部２は、入力された認識語（テキスト）の確認メッセージを出力し（ステップＳ２５）、入力から得られた認識語（テキスト）に対する利用者の確認音声入力を待って（ステップＳ２６）、音声認識した認識語（テキスト）が正しいと利用者により確認されたか否かを判定する（ステップＳ２７）。 On the other hand, when the understanding level is R = 1 (step S24: R = 1), the user's level of understanding is estimated to be normal, and it is only necessary to confirm the input content and proceed to the next step. A confirmation message for the input recognition word (text) is output (step S25), and the user's confirmation voice input for the recognition word (text) obtained from the input is waited (step S26). It is determined whether or not the text is correct by the user (step S27).

その結果、音声認識した認識語（テキスト）が正しいと利用者によって確認された場合（ステップＳ２７のＹＥＳ）、次のステップへ進む。
また、音声認識した認識語（テキスト）が正しいと利用者によって確認されなかった場合（ステップＳ２７のＮＯ）、ステップＳ２１へ戻り、再度入力を促すメッセージを出力する。
更に、理解度Ｒ＝２の場合（ステップＳ２４：Ｒ＝２）、利用者の理解度は高いと推定され、すぐに次のステップに進めば良いので、信号処理部２は、入力から得られた認識語（テキスト）を確認せずに次のステップへ進む。 As a result, if the user confirms that the recognized word (text) recognized by voice is correct (YES in step S27), the process proceeds to the next step.
If the recognized word (text) recognized by voice is not confirmed by the user (NO in step S27), the process returns to step S21, and a message prompting input is output again.
Furthermore, in the case of understanding level R = 2 (step S24: R = 2), it is estimated that the user's level of understanding is high, and it is only necessary to proceed to the next step, so the signal processing unit 2 is obtained from the input. Proceed to the next step without checking the recognized word (text).

なお、本実施例では、信号処理部２が入力音数計測手段と、理解度判定手段と、キーワード出現時間計測手段とを備えている。具体的には、認識語カウント部２６の他、理解度計算部２８が実行するステップＳ１１からステップＳ１２の処理が入力音数計測手段に相当する。また、理解度計算部２８が実行するステップＳ４からステップＳ６の処理と、ステップＳ１３からステップＳ１５の処理と、ステップＳ２３の処理が理解度判定手段に相当する。また、時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ１からステップＳ３の処理がキーワード出現時間計測手段に相当する。 In the present embodiment, the signal processing unit 2 includes an input sound number measurement unit, an understanding level determination unit, and a keyword appearance time measurement unit. Specifically, the processing from step S11 to step S12 executed by the understanding level calculation unit 28 in addition to the recognized word counting unit 26 corresponds to the input sound number measurement means. Further, the processing from step S4 to step S6, the processing from step S13 to step S15, and the processing from step S23 executed by the understanding level calculation unit 28 correspond to the understanding level determination means. In addition to the time keyword combining unit 27, the processing from step S1 to step S3 executed by the understanding level calculating unit 28 corresponds to the keyword appearance time measuring means.

以上説明したように、本実施例の音声対話装置によれば、マイク１から入力された音声について、理解度計算部２８が、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間ｄの関数で表される理解度推定パラメータｓ１と、利用者が単一の発話を開始してから終了するまでに入力された総音数ｐ０に占める重要なキーワードの音数ｐ１の割合（ｐ１／ｐ０）から算出される理解度推定パラメータｓ２とを求め、理解度推定パラメータｓ１、ｓ２から対話における利用者の理解度を理解度Ｒとして算出する。そして、対話制御部２９が、算出された理解度Ｒに基づいて対話制御を実行し、例えば理解度Ｒ＝０の場合、利用者の理解度は低いと判定し、再度入力を促すメッセージを出力する。また、理解度Ｒ＝１の場合、利用者の理解度は通常と判定し、入力内容を確認して次のステップに進む。更に、理解度Ｒ＝２の場合、利用者の理解度は高いと判定し、入力から得られた認識語を確認せずに次のステップへ進む。 As described above, according to the voice interactive apparatus of the present embodiment, an important keyword is obtained after the comprehension calculator 28 requests voice input (utterance) from the user for the voice input from the microphone 1. Important keywords that account for the understanding level estimation parameter s1 expressed as a function of the keyword appearance time d until the utterance and the total number of sounds p0 input from the start to the end of the single utterance by the user. The comprehension level estimation parameter s2 calculated from the ratio of the number of sounds p1 (p1 / p0) is obtained, and the comprehension level of the user in the dialog is calculated as the understanding level R from the understanding level estimation parameters s1 and s2. Then, the dialogue control unit 29 executes dialogue control based on the calculated understanding level R. For example, when the understanding level R = 0, the dialogue control unit 29 determines that the understanding level of the user is low and outputs a message prompting input again. To do. If the understanding level R = 1, it is determined that the user's level of understanding is normal, the input content is confirmed, and the process proceeds to the next step. Further, when the understanding level R = 2, it is determined that the user has a high level of understanding level, and the process proceeds to the next step without confirming the recognized word obtained from the input.

従って、理解度が高い利用者には簡潔な応答による対話制御を実行し、一方理解度が低い利用者には詳細かつ丁寧な応答による対話制御を実行することで、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に単純に利用者が発話したか否かではなく、重要なキーワードが発話されるまでの時間により理解度を推定することで、重要なキーワードがいつ発話されても正確に理解度を算出すると共に、更に入力された音声の総音数と重要なキーワードの音数との比較により理解度を推定することで、音数の取りうる幅が大きな言葉や、語尾に付く丁寧語の影響を受けることなく正確に理解度を算出し、適切な対話制御を実行することができるという効果が得られる。 Therefore, users who have a high level of understanding perform dialogue control with a simple response, while users with a low level of understanding perform interactive control with a detailed and polite response, so that it is There is an effect that it is possible to realize a voice dialogue apparatus that executes dialogue control and improves user convenience. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, the degree of understanding is estimated based on the time it takes for an important keyword to be spoken, not simply whether or not the user has spoken. In addition, by estimating the comprehension level by comparing the total number of input voices with the number of important keywords, the influence of words with a large range of sounds and polite words attached to the ending It is possible to obtain the effect that the degree of understanding can be accurately calculated without being received and appropriate dialogue control can be executed.

次に、第２の実施例について説明する。第２の実施例では、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間と、利用者が単一の発話を開始してから終了するまでの総発話時間に占める利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間の割合とにより、対話における利用者の理解度を推定する場合を示す。 Next, a second embodiment will be described. In the second embodiment, the keyword appearance time from the time when the user requests voice input (speech) to the time when an important keyword is uttered, and the time from when the user starts a single utterance until it ends. The case where the user's comprehension level in the dialogue is estimated by the ratio of the keyword utterance duration required for the user to utter the important keyword itself in the total utterance time is shown.

（装置構成、及び信号処理部の詳細）
本実施例における装置構成、及び信号処理部の詳細は、第１の実施例と同一なので、ここでは説明を省略する。 (Details of device configuration and signal processing unit)
Since the details of the apparatus configuration and the signal processing unit in this embodiment are the same as those in the first embodiment, description thereof is omitted here.

（理解度推定パラメータ）
次に、本実施例において理解度計算部２８が理解度Ｒを算出するための理解度推定パラメータについて説明する。
具体的には、図３に示す理解度推定パラメータを算出するための各要素に対して、理解度推定パラメータｓ１は、第１の実施例と同様に、キーワード出現時間ｄを変数に持つ関数として示される。一方、理解度推定パラメータｓ３は、利用者が単一の発話を開始してから終了するまでの総発話時間Ｔ２に占める利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間Ｔｋの割合、すなわち”Ｔｋ／Ｔ２”を変数に持つ関数として示される。 (Understanding level estimation parameter)
Next, an understanding level estimation parameter for the understanding level calculation unit 28 to calculate the understanding level R in the present embodiment will be described.
Specifically, for each element for calculating the understanding level estimation parameter shown in FIG. 3, the understanding level estimation parameter s1 is a function having the keyword appearance time d as a variable, as in the first embodiment. Indicated. On the other hand, the understanding level estimation parameter s3 is the keyword utterance duration time Tk required for the user to utter an important keyword itself in the total utterance time T2 from the start to the end of the single utterance. , Ie, a function having “Tk / T2” as a variable.

（理解度推定パラメータｓ３の算出手順）
次に、図面を参照して、理解度計算部２８における理解度推定パラメータｓ３の算出手順について説明する。図７は、理解度計算部２８における理解度推定パラメータｓ３の算出手順を示すフローチャートである。
図７において、まず理解度計算部２８は、利用者が単一の発話を開始してから終了するまでの総発話時間Ｔ２を取得する（ステップＳ３１）。
次に、利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間Ｔｋを取得する（ステップＳ３２）。
そして、利用者の総発話時間Ｔ２に占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｔ２）が例えば０．５以上であるか否かを判定する（ステップＳ３３）。 (Calculation procedure of comprehension level estimation parameter s3)
Next, the calculation procedure of the understanding level estimation parameter s3 in the understanding level calculation unit 28 will be described with reference to the drawings. FIG. 7 is a flowchart showing a calculation procedure of the understanding level estimation parameter s3 in the understanding level calculation unit 28.
In FIG. 7, first, the comprehension calculator 28 obtains a total utterance time T2 from when the user starts a single utterance to when it ends (step S31).
Next, keyword utterance duration Tk required for the user to utter an important keyword itself is acquired (step S32).
Then, it is determined whether or not the ratio (Tk / T2) of the keyword utterance duration Tk to the total utterance time T2 of the user is 0.5 or more, for example (step S33).

もし、ステップＳ３３において、利用者の総発話時間Ｔ２に占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｔ２）が０．５以上である場合（ステップＳ３３のＹＥＳ）、対話における利用者の理解度は高いと推定して、理解度推定パラメータｓ３に「１」を設定（ｓ３＝１）する（ステップＳ３４）。
また、ステップＳ３３において、利用者の総発話時間Ｔ２に占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｔ２）が０．５未満である場合（ステップＳ３３のＮＯ）、対話における利用者の理解度は低いと推定して、理解度推定パラメータｓ３に「０」を設定（ｓ３＝０）する（ステップＳ３５）。 If the ratio of the keyword utterance duration Tk to the total utterance time T2 of the user (Tk / T2) is 0.5 or more in step S33 (YES in step S33), the user's level of understanding in the dialogue is Assuming that it is high, “1” is set to the understanding level estimation parameter s3 (s3 = 1) (step S34).
If the ratio of the keyword utterance duration Tk to the total utterance time T2 of the user (Tk / T2) is less than 0.5 (NO in step S33), the user's level of understanding in the dialogue is It is estimated that the value is low, and “0” is set to the understanding level estimation parameter s3 (s3 = 0) (step S35).

（理解度Ｒの算出手順）
次に、本実施例における理解度計算部２８の理解度Ｒの算出手順について説明する。理解度計算部２８において、理解度Ｒは上述の理解度推定パラメータｓ１、ｓ３を用いて算出される。具体的に説明すると、本実施例では、理解度推定パラメータｓ１、ｓ３を組み合わせて、理解度Ｒを下記（２）式により算出する。従って、本実施例においても、理解度ＲはＲ＝０、Ｒ＝１、Ｒ＝２の３つの状態のいずれかとして算出される。 (Procedure for understanding R)
Next, the calculation procedure of the understanding level R of the understanding level calculation unit 28 in the present embodiment will be described. In the understanding level calculation unit 28, the understanding level R is calculated using the above-described understanding level estimation parameters s1 and s3. More specifically, in this embodiment, the understanding level R is calculated by the following equation (2) by combining the understanding level estimation parameters s1 and s3. Therefore, also in the present embodiment, the understanding level R is calculated as one of three states of R = 0, R = 1, and R = 2.

Ｒ＝ｓ１＋ｓ３・・・（２） R = s1 + s3 (2)

（対話進行フロー）
また、本実施例の音声対話装置も、理解度Ｒが算出できたら、第１の実施例で図６を参照して説明した理解度Ｒに基づく対話進行フローに基づいて、対話制御を実行する。 (Dialogue progress flow)
In addition, when the understanding level R can be calculated, the voice interaction apparatus according to the present embodiment also executes dialogue control based on the dialogue progress flow based on the understanding level R described with reference to FIG. 6 in the first example. .

なお、本実施例では、信号処理部２が総発話時間計測手段と、キーワード発話継続時間計測手段と、理解度判定手段と、キーワード出現時間計測手段とを備えている。具体的には、時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ３１の処理が総発話時間計測手段に相当する。また、時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ３２の処理がキーワード発話継続時間計測手段に相当する。また、理解度計算部２８が実行するステップＳ３３からステップＳ３５の処理と、第１の実施例で説明したステップＳ１３からステップＳ１５の処理とステップＳ２３の処理が理解度判定手段に相当する。また、第１の実施例で説明した時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ１からステップＳ３の処理がキーワード出現時間計測手段に相当する。 In the present embodiment, the signal processing unit 2 includes total utterance time measuring means, keyword utterance duration measuring means, understanding level determining means, and keyword appearance time measuring means. Specifically, in addition to the time keyword combining unit 27, the process of step S31 executed by the understanding level calculating unit 28 corresponds to the total utterance time measuring means. In addition to the time keyword combining unit 27, the process of step S32 executed by the understanding level calculating unit 28 corresponds to a keyword utterance duration measuring unit. Further, the processing from step S33 to step S35, the processing from step S13 to step S15, and the processing from step S23 described in the first embodiment correspond to the understanding level determination unit. In addition to the time keyword combining unit 27 described in the first embodiment, the processing from step S1 to step S3 executed by the understanding level calculating unit 28 corresponds to the keyword appearance time measuring means.

以上説明したように、本実施例の音声対話装置によれば、マイク１から入力された音声について、理解度計算部２８が、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間ｄの関数で表される理解度推定パラメータｓ１と、利用者が単一の発話を開始してから終了するまでの総発話時間Ｔ２に占める利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｔ２）から算出される理解度推定パラメータｓ３とを求め、理解度推定パラメータｓ１、ｓ３から対話における利用者の理解度を理解度Ｒとして算出する。そして、対話制御部２９が、算出された理解度Ｒに基づいて対話制御を実行し、例えば理解度Ｒ＝０の場合、利用者の理解度は低いと判定し、再度入力を促すメッセージを出力する。また、理解度Ｒ＝１の場合、利用者の理解度は通常と判定し、入力内容を確認して次のステップに進む。更に、理解度Ｒ＝２の場合、利用者の理解度は高いと判定し、入力から得られた認識語を確認せずに次のステップへ進む。 As described above, according to the voice interactive apparatus of the present embodiment, an important keyword is obtained after the comprehension calculator 28 requests voice input (utterance) from the user for the voice input from the microphone 1. Keywords that are important for the user in the total utterance time T2 from the start to the end of the single utterance by the understanding level estimation parameter s1 expressed as a function of the keyword appearance time d until the utterance A comprehension level estimation parameter s3 calculated from the ratio (Tk / T2) of the keyword utterance duration Tk required to speak itself is obtained, and the comprehension level of the user in the dialog is understood from the understanding level estimation parameters s1 and s3. Calculated as degree R. Then, the dialogue control unit 29 executes dialogue control based on the calculated understanding level R. For example, when the understanding level R = 0, the dialogue control unit 29 determines that the understanding level of the user is low and outputs a message prompting input again. To do. If the understanding level R = 1, it is determined that the user's level of understanding is normal, the input content is confirmed, and the process proceeds to the next step. Further, when the understanding level R = 2, it is determined that the user has a high level of understanding level, and the process proceeds to the next step without confirming the recognized word obtained from the input.

従って、第１の実施例と同様に、理解度が高い利用者には簡潔な応答による対話制御を実行し、一方理解度が低い利用者には詳細かつ丁寧な応答による対話制御を実行することで、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に単純に利用者が発話したか否かではなく、重要なキーワードが発話されるまでの時間により理解度を推定することで、重要なキーワードがいつ発話されても正確に理解度を算出すると共に、更に音声の入力が開始されてから終了するまでの総発話時間とキーワードが発話されていたキーワード発話継続時間との比較により理解度を推定することで、どのような言葉にも対応し、音数が識別しにくい言葉についても、発話された音声に対する意味のある言葉の割合から正確に理解度を算出して、適切な対話制御を実行することができるという効果が得られる。 Therefore, as in the first embodiment, a user with a high level of understanding performs interactive control with a simple response, while a user with a low level of understanding executes interactive control with a detailed and polite response. Thus, it is possible to realize an audio dialogue apparatus that executes appropriate dialogue control according to the level of the user and improves the convenience of the user. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, the degree of understanding is estimated based on the time it takes for an important keyword to be spoken, not simply whether or not the user has spoken. In addition, it can handle any language by estimating the degree of understanding by comparing the total utterance time from the start to the end of voice input with the keyword utterance duration when the keyword was spoken. Even for words whose number of sounds is difficult to identify, it is possible to accurately calculate the degree of understanding from the ratio of meaningful words to spoken speech and execute appropriate dialogue control.

次に、第３の実施例について説明する。第３の実施例では、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間と、所定のキーワードを発話するのに要する標準的な時間であるキーワード基準発話時間に対する利用者が該キーワード自体を発話するのに要した時間であるキーワード発話継続時間の比率とにより、対話における利用者の理解度を推定する場合を示す。 Next, a third embodiment will be described. In the third embodiment, the keyword appearance time from when the user requests voice input (speech) to the time when an important keyword is uttered, and the standard time required to utter a predetermined keyword. A case is shown in which the user's understanding level in the dialogue is estimated based on the ratio of the keyword utterance duration, which is the time required for the user to utter the keyword itself with respect to the reference utterance time.

（装置構成、及び信号処理部の詳細）
本実施例における装置構成、及び信号処理部の詳細は、第１の実施例と同一なので、ここでは説明を省略する。
但し、本実施例では、キーワードデータベース２５に記憶されたキーワードには、キーワード毎に、それぞれキーワードを発話するのに要する標準的な時間であるキーワード基準発話時間が記憶されているものとする。また、キーワード判定部２３は、入力された認識語から所定のキーワードを抽出すると共に、認識語のキーワード部分にタグを付与して、キーワードを同定済みの認識語を、該キーワードのキーワード基準発話時間と共に認識語カウント部２６へ出力し、キーワード基準発話時間は、更に認識語カウント部２６及び時刻キーワード結合部２７を介して、理解度計算部２８へ出力されるものとする。 (Details of device configuration and signal processing unit)
Since the details of the apparatus configuration and the signal processing unit in this embodiment are the same as those in the first embodiment, description thereof is omitted here.
However, in this embodiment, it is assumed that the keyword stored in the keyword database 25 stores a keyword reference utterance time, which is a standard time required to utter the keyword, for each keyword. In addition, the keyword determination unit 23 extracts a predetermined keyword from the input recognition word, adds a tag to the keyword portion of the recognition word, and uses the recognized word having the keyword identified as the keyword reference utterance time of the keyword. The keyword-based utterance time is further output to the comprehension level calculation unit 28 via the recognition word count unit 26 and the time keyword combination unit 27.

（理解度推定パラメータ）
次に、本実施例において理解度計算部２８が理解度Ｒを算出するための理解度推定パラメータについて説明する。
具体的には、図３に示す理解度推定パラメータを算出するための各要素に対して、理解度推定パラメータｓ１は、第１の実施例と同様に、キーワード出現時間ｄを変数に持つ関数として示される。一方、理解度推定パラメータｓ４は、所定のキーワードを発話するのに要する標準的な時間であるキーワード基準発話時間Ｄｓに対する利用者が該キーワード自体を発話するのに要した時間であるキーワード発話継続時間Ｔｋの比率、すなわち”Ｔｋ／Ｄｓ”を変数に持つ関数として示される。 (Understanding level estimation parameter)
Next, an understanding level estimation parameter for the understanding level calculation unit 28 to calculate the understanding level R in the present embodiment will be described.
Specifically, for each element for calculating the understanding level estimation parameter shown in FIG. 3, the understanding level estimation parameter s1 is a function having the keyword appearance time d as a variable, as in the first embodiment. Indicated. On the other hand, the understanding level estimation parameter s4 is a keyword utterance duration time which is a time required for the user to utter the keyword itself with respect to the keyword reference utterance time Ds which is a standard time required to utter a predetermined keyword. It is shown as a function having the ratio of Tk, that is, “Tk / Ds” as a variable.

（理解度推定パラメータｓ４の算出手順）
次に、図面を参照して、理解度計算部２８における理解度推定パラメータｓ４の算出手順について説明する。図８は、理解度計算部２８における理解度推定パラメータｓ４の算出手順を示すフローチャートである。
図８において、まず理解度計算部２８は、入力された重要なキーワードを発話するのに要する標準的な時間であるキーワード基準発話時間Ｄｓを取得する（ステップＳ４１）。
次に、利用者がこの重要なキーワード自体を発話するのに要したキーワード発話継続時間Ｔｋを取得する（ステップＳ４２）。
そして、キーワードを発話するのに要するキーワード基準発話時間Ｄｓに占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｄｓ）が例えば１．０以下であるか否かを判定する（ステップＳ４３）。 (Calculation procedure of understanding level estimation parameter s4)
Next, the calculation procedure of the understanding level estimation parameter s4 in the understanding level calculation unit 28 will be described with reference to the drawings. FIG. 8 is a flowchart showing a calculation procedure of the understanding level estimation parameter s4 in the understanding level calculation unit 28.
In FIG. 8, first, the comprehension calculator 28 obtains a keyword reference utterance time Ds that is a standard time required to utter an input important keyword (step S41).
Next, the keyword utterance duration time Tk required for the user to utter this important keyword itself is acquired (step S42).
Then, it is determined whether or not the ratio (Tk / Ds) of the keyword utterance duration Tk to the keyword reference utterance time Ds required to utter a keyword is 1.0 or less, for example (step S43).

もし、ステップＳ４３において、キーワードを発話するのに要するキーワード基準発話時間Ｄｓに占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｄｓ）が１．０以下である場合（ステップＳ４３のＹＥＳ）、対話における利用者の理解度は高いと推定して、理解度推定パラメータｓ４に「１」を設定（ｓ４＝１）する（ステップＳ４４）。
また、ステップＳ４３において、キーワードを発話するのに要するキーワード基準発話時間Ｄｓに占めるキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｄｓ）が１．０より大きい場合（ステップＳ４３のＮＯ）、対話における利用者の理解度は低いと推定して、理解度推定パラメータｓ４に「０」を設定（ｓ４＝０）する（ステップＳ４５）。 If the ratio (Tk / Ds) of the keyword utterance duration Tk to the keyword reference utterance time Ds required to utter a keyword is 1.0 or less (YES in step S43), it is used in the dialog. The understanding level of the person is estimated to be high, and “1” is set to the understanding level estimation parameter s4 (s4 = 1) (step S44).
In addition, in step S43, when the ratio (Tk / Ds) of the keyword utterance duration Tk to the keyword reference utterance time Ds required to utter the keyword is larger than 1.0 (NO in step S43), the user in the dialogue Is understood to be low, and “0” is set to the understanding level estimation parameter s4 (s4 = 0) (step S45).

（理解度Ｒの算出手順）
次に、本実施例における理解度計算部２８の理解度Ｒの算出手順について説明する。理解度計算部２８において、理解度Ｒは上述の理解度推定パラメータｓ１、ｓ４を用いて算出される。具体的に説明すると、本実施例では、理解度推定パラメータｓ１、ｓ４を組み合わせて、理解度Ｒを下記（３）式により算出する。従って、本実施例においても、理解度ＲはＲ＝０、Ｒ＝１、Ｒ＝２の３つの状態のいずれかとして算出される。 (Procedure for understanding R)
Next, the calculation procedure of the understanding level R of the understanding level calculation unit 28 in the present embodiment will be described. In the understanding level calculation unit 28, the understanding level R is calculated using the above-described understanding level estimation parameters s1 and s4. More specifically, in this embodiment, the understanding level R is calculated by the following equation (3) by combining the understanding level estimation parameters s1 and s4. Therefore, also in the present embodiment, the understanding level R is calculated as one of three states of R = 0, R = 1, and R = 2.

Ｒ＝ｓ１＋ｓ４・・・（３） R = s1 + s4 (3)

（対話進行フロー）
また、本実施例の音声対話装置も、理解度Ｒが算出できたら、第１の実施例で図６を参照して説明した理解度Ｒに基づく対話進行フローに基づいて、対話制御を実行する。 (Dialogue progress flow)
In addition, when the understanding level R can be calculated, the voice interaction apparatus according to the present embodiment also executes dialogue control based on the dialogue progress flow based on the understanding level R described with reference to FIG. 6 in the first embodiment. .

なお、本実施例では、信号処理部２がキーワード発話継続時間計測手段と、理解度判定手段と、キーワード出現時間計測手段とを備えている。具体的には、時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ４２の処理がキーワード発話継続時間計測手段に相当する。また、理解度計算部２８が実行するステップＳ４３からステップＳ４５の処理と、第１の実施例で説明したステップＳ１３からステップＳ１５の処理とステップＳ２３の処理が理解度判定手段に相当する。また、第１の実施例で説明した時刻キーワード結合部２７の他、理解度計算部２８が実行するステップＳ１からステップＳ３の処理がキーワード出現時間計測手段に相当する。 In the present embodiment, the signal processing unit 2 includes a keyword utterance duration measuring unit, an understanding level determining unit, and a keyword appearance time measuring unit. Specifically, in addition to the time keyword combining unit 27, the process of step S42 executed by the understanding level calculating unit 28 corresponds to a keyword utterance duration measuring unit. Further, the processing from step S43 to step S45 executed by the understanding level calculation unit 28, the processing from step S13 to step S15, and the processing from step S23 described in the first embodiment correspond to the understanding level determination means. In addition to the time keyword combining unit 27 described in the first embodiment, the processing from step S1 to step S3 executed by the understanding level calculating unit 28 corresponds to the keyword appearance time measuring means.

以上説明したように、本実施例の音声対話装置によれば、マイク１から入力された音声について、理解度計算部２８が、利用者に音声入力（発話）を要求してから重要なキーワードが発話されるまでのキーワード出現時間ｄの関数で表される理解度推定パラメータｓ１と、所定のキーワードを発話するのに要する標準的な時間であるキーワード基準発話時間Ｄｓに対する利用者が該キーワード自体を発話するのに要した時間であるキーワード発話継続時間Ｔｋの比率（Ｔｋ／Ｄｓ）から算出される理解度推定パラメータｓ４とを求め、理解度推定パラメータｓ１、ｓ４から対話における利用者の理解度を理解度Ｒとして算出する。そして、対話制御部２９が、算出された理解度Ｒに基づいて対話制御を実行し、例えば理解度Ｒ＝０の場合、利用者の理解度は低いと判定し、再度入力を促すメッセージを出力する。また、理解度Ｒ＝１の場合、利用者の理解度は通常と判定し、入力内容を確認して次のステップに進む。更に、理解度Ｒ＝２の場合、利用者の理解度は高いと判定し、入力から得られた認識語を確認せずに次のステップへ進む。 As described above, according to the voice interactive apparatus of the present embodiment, an important keyword is obtained after the comprehension calculator 28 requests voice input (utterance) from the user for the voice input from the microphone 1. The user with respect to the understanding level estimation parameter s1 expressed as a function of the keyword appearance time d until the utterance and the keyword reference utterance time Ds, which is a standard time required to utter a predetermined keyword, is used by the user. The comprehension level estimation parameter s4 calculated from the ratio (Tk / Ds) of the keyword utterance duration Tk, which is the time required to speak, is obtained, and the comprehension level of the user in the dialogue is determined from the understanding level estimation parameters s1 and s4. Calculated as an understanding level R. Then, the dialogue control unit 29 executes dialogue control based on the calculated understanding level R. For example, when the understanding level R = 0, the dialogue control unit 29 determines that the understanding level of the user is low and outputs a message prompting input again. To do. If the understanding level R = 1, it is determined that the user's level of understanding is normal, the input content is confirmed, and the process proceeds to the next step. Further, when the understanding level R = 2, it is determined that the user has a high level of understanding level, and the process proceeds to the next step without confirming the recognized word obtained from the input.

従って、第１の実施例と同様に、理解度が高い利用者には簡潔な応答による対話制御を実行し、一方理解度が低い利用者には詳細かつ丁寧な応答による対話制御を実行することで、利用者のレベルに従って適切な対話制御を実行し、利用者の利便性を向上させる音声対話装置を実現することができるという効果が得られる。また、自由発音特有の意味を持たない言葉の入力を許容することで、利用者の発話に対するプレッシャーを極力排除することができるという効果が得られる。また、特に単純に利用者が発話したか否かではなく、重要なキーワードが発話されるまでの時間により理解度を推定することで、重要なキーワードがいつ発話されても正確に理解度を算出すると共に、更に所定のキーワードの発話に標準的に要するキーワード基準発話時間と利用者が重要なキーワード自体を発話するのに要したキーワード発話継続時間との比較により理解度を推定することで、発話全体を検査するまでもなく、重要なキーワードのみを確認するだけで、正確に理解度を算出して適切な対話制御を実行することができるという効果が得られる。 Therefore, as in the first embodiment, a user with a high level of understanding performs interactive control with a simple response, while a user with a low level of understanding executes interactive control with a detailed and polite response. Thus, it is possible to realize an audio dialogue apparatus that executes appropriate dialogue control according to the level of the user and improves the convenience of the user. In addition, by allowing the input of words that do not have a meaning specific to free pronunciation, the effect of eliminating the pressure on the user's utterance as much as possible can be obtained. In addition, the degree of understanding is estimated based on the time it takes for an important keyword to be spoken, not simply whether or not the user has spoken. In addition, the degree of understanding is estimated by comparing the keyword-based utterance time that is normally required for utterance of a predetermined keyword with the keyword utterance duration that is required for the user to utter an important keyword itself. Without checking the whole, it is possible to obtain an effect that an understanding level can be accurately calculated and appropriate dialogue control can be executed only by checking only important keywords.

なお、上述の第１から第３の実施例では、例えばステップＳ４、あるいはステップＳ１３、あるいはステップＳ３３、あるいはステップＳ４３において、それぞれ理解度推定パラメータｓ１〜ｓ４を判定するためのしきい値を固定的に扱って説明したが、上述の第１から第３の実施例で説明した電子機器制御装置は、特に自動車等の車両に搭載して利用することが有用であり、一例として、電子機器制御装置が自動車に搭載された場合、これらの理解度推定パラメータｓ１〜ｓ４を判定するためのしきい値は、車両の走行環境に基づいて可変するようにしても良い。 In the first to third embodiments described above, for example, in step S4, step S13, step S33, or step S43, threshold values for determining the understanding level estimation parameters s1 to s4 are fixed. However, the electronic apparatus control apparatus described in the first to third embodiments is particularly useful when mounted on a vehicle such as an automobile. As an example, the electronic apparatus control apparatus When the is installed in an automobile, the threshold values for determining these understanding level estimation parameters s1 to s4 may be varied based on the traveling environment of the vehicle.

具体的には、例えば信号処理部２に、ＧＰＳ（Global Positioning System ）を備えたナビゲーション装置や車両の走行速度を検出する速度センサ、更にはステアリングやブレーキ等に設けられたセンサにより、車両の走行位置や走行速度、利用者の運転操作等を含む車両の走行状況や運転状態を検出する車両状態検出装置を接続する。 Specifically, for example, the signal processing unit 2 includes a navigation device equipped with a GPS (Global Positioning System), a speed sensor that detects the traveling speed of the vehicle, and a sensor provided in a steering, a brake, or the like. A vehicle state detection device that detects a vehicle traveling state and a driving state including a position, a traveling speed, a user's driving operation, and the like is connected.

そして、信号処理部２の理解度計算部２８は、随時車両の位置情報や走行速度、運転状態等を取得することで、キーワード出現時間ｄ、あるいは利用者が発話した総音数ｐ０に占める重要なキーワードの音数ｐ１の割合（ｐ１／ｐ０）、あるいは利用者が発話した総発話時間Ｔ２に占める重要なキーワード自体を発話するのに要したキーワード発話継続時間Ｔｋの割合（Ｔｋ／Ｔ２）、あるいは所定のキーワードを発話するのに要するキーワード基準発話時間Ｄｓに対する利用者が該キーワード自体を発話するのに要したキーワード発話継続時間Ｔｋの比率（Ｔｋ／Ｄｓ）が同じでも、車両の走行位置や走行速度、利用者の運転操作等を含む車両の走行状況や運転状態、すなわち車両状態検出装置が判定する車両の走行環境から、利用者に負担がかかっていると推定できる場合は、理解度推定パラメータｓ１〜ｓ４を判定するためのしきい値を、それぞれの判定結果について利用者の理解度が低く推定されるように変更する。 Then, the comprehension calculation unit 28 of the signal processing unit 2 acquires the position information of the vehicle, the traveling speed, the driving state, etc. as needed, so that it is important for the keyword appearance time d or the total number of sounds p0 spoken by the user. The ratio of the number of sounds p1 of a simple keyword (p1 / p0), or the ratio of the keyword utterance duration Tk required to utter an important keyword itself in the total utterance time T2 uttered by the user (Tk / T2), Alternatively, even if the ratio (Tk / Ds) of the keyword utterance duration Tk required for the user to utter the keyword itself with respect to the keyword reference utterance time Ds required to utter the predetermined keyword is the same, It is a burden on the user based on the driving conditions and driving conditions of the vehicle including the driving speed and the driving operation of the user, i.e. Is If a can be estimated depends, the threshold for determining the comprehension estimation parameter s1 to s4, the comprehension of the user for each of the determination results change as estimated low.

これにより、利用者が車両を運転中は、自信を持って入力されたと推定できる発話でも、利用者が運転に気を取られて、もしかしたら言葉を間違っているかもしれないと推定し、「入力内容を確認せず、すぐに次のステップに進む」ことはせず、「再入力を要求する」あるいは「入力内容を確認して次のステップに進む」等、その時の利用者の状態に応じて、適切な対話制御を実行することができるという効果が得られる。 As a result, while the user is driving the vehicle, even if the utterance can be presumed to have been input with confidence, the user may be distracted by the driving, and the language may be wrong. Do not confirm the input contents and proceed to the next step immediately, do not "re-input" or "confirm the input contents and proceed to the next step", etc. Accordingly, it is possible to execute an appropriate dialogue control.

また、上述の第１の実施例では、理解度推定パラメータｓ１と理解度推定パラメータｓ２の組合せから理解度Ｒを求め、第２の実施例では、理解度推定パラメータｓ１と理解度推定パラメータｓ３の組合せから理解度Ｒを求め、更に第３の実施例では、理解度推定パラメータｓ１と理解度推定パラメータｓ４の組合せから理解度Ｒを求め、それぞれの実施例において、理解度ＲをＲ＝０、Ｒ＝１、Ｒ＝２の３つの状態のいずれかとして算出して対話制御を実行したが、理解度Ｒを求めるための理解度推定パラメータの組合せはこれに限らず、理解度推定パラメータｓ２と理解度推定パラメータｓ３の組合せ、あるいは理解度推定パラメータｓ２と理解度推定パラメータｓ４の組合せ、あるいは理解度推定パラメータｓ３と理解度推定パラメータｓ４の組合せから、理解度ＲをＲ＝０、Ｒ＝１、Ｒ＝２の３つの状態のいずれかとして算出しても良く、理解度推定パラメータがこのような組合せでも第１から第３の実施例と同様の効果が得られる。 In the first embodiment, the understanding level R is obtained from the combination of the understanding level estimation parameter s1 and the understanding level estimation parameter s2. In the second example, the understanding level estimation parameter s1 and the understanding level estimation parameter s3 are obtained. The understanding level R is obtained from the combination, and in the third embodiment, the understanding level R is obtained from the combination of the understanding level estimation parameter s1 and the understanding level estimation parameter s4. In each example, the understanding level R is set to R = 0, The dialogue control is executed by calculating as one of the three states of R = 1 and R = 2, but the combination of the understanding level estimation parameters for obtaining the understanding level R is not limited to this, and the understanding level estimation parameter s2 and Combination of understanding level estimation parameter s3, or combination of understanding level estimation parameter s2 and understanding level estimation parameter s4, or understanding level estimation parameter s3 and understanding level estimation parameter From the combination of s4, the understanding level R may be calculated as any one of the three states of R = 0, R = 1, and R = 2. Even if the understanding level estimation parameter is such a combination, the first to third The same effect as the embodiment can be obtained.

更に、理解度Ｒを求めるための理解度推定パラメータの組合せは２つに限らず、理解度Ｒに理解度推定パラメータ１つを対応させて、理解度ＲをＲ＝０、Ｒ＝１の２つの状態のいずれかとして算出し対話制御を実行しても良いし、理解度Ｒに理解度推定パラメータ３つあるいは４つを対応させて、理解度ＲをＲ＝０〜３の４つの状態、あるいはＲ＝０〜４の５つの状態のいずれかとして算出し対話制御を実行しても良い。この場合、上述の実施例において図６を参照して説明した理解度Ｒに基づく対話進行フローでは、理解度Ｒの状態数に合わせて処理の流れを変えるようにする。 Furthermore, the combination of the understanding level estimation parameters for obtaining the understanding level R is not limited to two, and one understanding level estimation parameter is associated with the understanding level R, and the understanding level R is 2 (R = 0, R = 1). One of the two states may be calculated and interactive control may be performed, or the understanding level R may be set to correspond to 3 or 4 understanding level estimation parameters, and the understanding level R may be set to 4 states of R = 0 to 3; Or you may calculate as either of the five states of R = 0-4, and may perform dialog control. In this case, in the dialogue progress flow based on the understanding level R described with reference to FIG. 6 in the above-described embodiment, the processing flow is changed according to the number of states of the understanding level R.

本発明の第１の実施例の音声対話装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the voice interactive apparatus of 1st Example of this invention. 同実施例の音声対話装置の信号処理部の構成を示すブロック図である。It is a block diagram which shows the structure of the signal processing part of the voice interactive apparatus of the Example. 発話例と理解度推定パラメータを算出するための要素との関係を示す図である。It is a figure which shows the relationship between the utterance example and the element for calculating an understanding level estimation parameter. 理解度計算部における理解度推定パラメータｓ１の算出手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the understanding level estimation parameter s1 in an understanding level calculation part. 理解度計算部における理解度推定パラメータｓ２の算出手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the understanding level estimation parameter s2 in an understanding level calculation part. 同実施例の音声対話装置の理解度Ｒに基づく対話進行フローを示すフローチャートである。It is a flowchart which shows the dialogue progress flow based on the understanding degree R of the speech dialogue apparatus of the Example. 本発明の第２の実施例における理解度計算部の理解度推定パラメータｓ３の算出手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the understanding level estimation parameter s3 of the understanding level calculation part in 2nd Example of this invention. 本発明の第３の実施例における理解度計算部の理解度推定パラメータｓ４の算出手順を示すフローチャートである。It is a flowchart which shows the calculation procedure of the understanding level estimation parameter s4 of the understanding level calculation part in the 3rd Example of this invention.

Explanation of symbols

１マイク（音声入力手段）
２２音声認識部（音声認識手段）
２３キーワード判定部（キーワード判定手段）
２５キーワードデータベース（キーワード基準発話時間記憶手段）
２６認識語カウント部（入力音数計測手段）
２７時刻キーワード結合部（キーワード出現時間計測手段、総発話時間計測手段、キーワード発話継続時間計測手段）
２８理解度計算部（キーワード出現時間計測手段、総発話時間計測手段、キーワード発話継続時間計測手段、理解度判定手段）
２９対話制御部（対話制御手段）
Ｓ１〜Ｓ３キーワード出現時間計測手段
Ｓ１１〜Ｓ１２入力音数計測手段
Ｓ３１総発話時間計測手段
Ｓ３２、Ｓ４２キーワード発話継続時間計測手段
Ｓ４〜Ｓ６、Ｓ１３〜Ｓ１５、Ｓ３３〜Ｓ３５、Ｓ４３〜Ｓ４５理解度判定手段

1 Microphone (voice input means)
22 Voice recognition unit (voice recognition means)
23 Keyword determination unit (keyword determination means)
25 Keyword Database (Keyword-based utterance time storage means)
26 Recognition word count unit (input sound number measuring means)
27 Time keyword combining unit (keyword appearance time measuring means, total utterance time measuring means, keyword utterance duration measuring means)
28 Understanding level calculator (keyword appearance time measuring means, total utterance time measuring means, keyword utterance duration measuring means, understanding level judging means)
29 Dialogue control unit (dialogue control means)
S1 to S3 Keyword appearance time measuring means S11 to S12 Input sound number measuring means S31 Total utterance time measuring means S32 and S42 Keyword utterance duration measuring means S4 to S6, S13 to S15, S33 to S35, S43 to S45 Understanding degree judging means

Claims

Voice input means for inputting voice spoken by the user;
Speech recognition means for performing recognition processing of input speech;
Keyword determination means for extracting a predetermined keyword from the recognized voice of the user;
Input sound number measuring means for measuring the sound number of the recognized user's voice;
Understanding level determination means for determining the level of understanding of the user in the dialogue based on the ratio of the number of sounds of the keyword to the total number of sounds input by the user's utterance;
A spoken dialogue apparatus comprising dialogue control means for controlling dialogue response according to the degree of understanding of the user.

Voice input means for inputting voice spoken by the user;
Speech recognition means for performing recognition processing of input speech;
Keyword determination means for extracting a predetermined keyword from the recognized voice of the user;
Keyword appearance time measuring means for measuring a keyword appearance time from when the user requests utterance until the user utters the keyword;
Understanding level determination means for determining the level of understanding of the user in the dialogue based on the length of the keyword appearance time;
A spoken dialogue apparatus comprising dialogue control means for controlling dialogue response according to the degree of understanding of the user.

Voice input means for inputting voice spoken by the user;
Speech recognition means for performing recognition processing of input speech;
Keyword determination means for extracting a predetermined keyword from the recognized voice of the user;
A total utterance time measuring means for measuring a total utterance time from the start of the user to the end of the utterance;
Keyword utterance duration measuring means for measuring a keyword utterance duration required for the user to utter the keyword itself;
Understanding level determination means for determining the level of understanding of the user in the dialogue based on the ratio of the keyword utterance duration to the total utterance time;
A spoken dialogue apparatus comprising dialogue control means for controlling dialogue response according to the degree of understanding of the user.

Voice input means for inputting voice spoken by the user;
Speech recognition means for performing recognition processing of input speech;
Keyword determination means for extracting a predetermined keyword from the recognized voice of the user;
A keyword-based utterance time storage means for storing each keyword as a keyword-based utterance time as a standard time required to utter the keyword;
Keyword utterance duration measuring means for measuring a keyword utterance duration required for the user to utter the keyword itself;
Understanding level determination means for determining the level of understanding of the user in dialogue based on the ratio of the keyword utterance duration to the keyword reference utterance time;
A spoken dialogue apparatus comprising dialogue control means for controlling dialogue response according to the degree of understanding of the user.

Voice input means for inputting voice spoken by the user;
Speech recognition means for performing recognition processing of input speech;
In addition to keyword determination means for extracting a predetermined keyword from the recognized voice of the user,
Furthermore, the input sound number measuring means for measuring the sound number of the recognized user's voice,
Alternatively, keyword appearance time measuring means for measuring a keyword appearance time from when the user requests utterance until the user utters the keyword,
Alternatively, the total utterance time measuring means for measuring the total utterance time from the start of the user to the end of the utterance and the keyword utterance for measuring the keyword utterance duration required for the user to utter the keyword itself Duration measurement means,
Alternatively, keyword reference utterance time storage means for storing each keyword as a keyword reference utterance time as a standard time required for uttering the keyword, and keyword utterance duration required for the user to utter the keyword itself At least two sets of keyword utterance duration measuring means for measuring
The ratio of the number of sounds of the keyword to the total number of sounds input by the user's utterance, the length of the keyword appearance time, the ratio of the keyword utterance duration to the total utterance time, and the keyword reference utterance time Understanding level determination means for determining the level of understanding of the user in a dialogue based on a combination of at least two of the ratios of the keyword utterance durations to
A spoken dialogue apparatus comprising dialogue control means for controlling dialogue response according to the degree of understanding of the user.

A travel environment determination unit for determining the travel environment of the mounted vehicle;
The understanding level determination means changes a threshold value for determining the level of understanding of the user in the dialogue according to the driving environment of the vehicle determined by the driving environment determination unit. The voice interactive apparatus according to any one of claims 1 to 5.