JP2003228393A

JP2003228393A - Device and method for voice interaction, voice interaction program and recording medium therefor

Info

Publication number: JP2003228393A
Application number: JP2002024799A
Authority: JP
Inventors: Yoshihito Yasuda; 宜仁安田; Kouji Dousaka; 浩二堂坂; Kiyoaki Aikawa; 清明相川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-01-31
Filing date: 2002-01-31
Publication date: 2003-08-15

Abstract

<P>PROBLEM TO BE SOLVED: To make interaction with a user smoothly performable by precisely grasping the contents of a demand made by the user without limiting user's utterance. <P>SOLUTION: The device is provided with a voice recognizing means 203 which obtains two recognition results by inputting a user's voice signal and recognizing a voice by two different grammars for interaction, a language understanding means 140 which analyzes the two recognition results respectively and generates two kinds of understanding states consisting of item names, values, and likelihoods of the values corresponding to the two recognition results, and a recognition grammar selecting means 150 which determines which of the two kinds of understanding states is employed by comparing the reliabilities of the two recognition results. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、人と音声による対
話をやりとりすることによって、たとえば、スケジュー
ル管理や天気情報案内やカーナビゲーションなどの所定
の動作を行う音声対話装置及び方法、そのためのプログ
ラム並びにそのプログラムを記録した記録媒体に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice interactive apparatus and method for exchanging a voice interaction with a person to perform a predetermined operation such as schedule management, weather information guidance and car navigation, a program therefor, and a program therefor. The present invention relates to a recording medium recording the program.

【０００２】[0002]

【従来の技術】従来より、利用者との対話によって所定
の動作を成し遂げる、いわゆる音声対話装置が知られて
いる。このような音声対話装置は、人が発生した音声を
文字列に変換する音声認識部を有し、この結果とこれま
での対話の履歴を利用して次の装置の発話内容制御し、
この発話内容を音声出力部によって音声として出力す
る。2. Description of the Related Art Hitherto, there has been known a so-called voice dialog device which achieves a predetermined operation by a dialog with a user. Such a voice interaction device has a voice recognition unit that converts a voice generated by a person into a character string, and controls the utterance content of the next device by using the result and the history of the conversation so far,
The utterance content is output as voice by the voice output unit.

【０００３】通常、音声対話装置は、利用者の装置に対
する要求内容を会話のやりとりを通して確定し、確定さ
れた利用者の要求内容に応じた動作を行うということを
繰り返す。この利用者の要求内容の確定のために行われ
るやりとりの方法は２種類に分けることができる。ひと
つは「システム主導型対話」と呼ばれ、もうひとつは
「ユーザ主導型対話」と呼ばれる。システム主導型対話
では、装置側からの確認や質問に対して、利用者が発話
できる範囲には強い制限を置く。たとえば、装置側が県
名を確認あるいは質問した場合には、利用者の発話とし
て県名以外は受け付けないといった具合である。一方、
ユーザ主導型対話では、利用者の回答にはそういった制
限は設けないで、自由な発話を許す。Normally, the voice interactive apparatus repeats the process of deciding the request content for the user's device through conversation and performing the operation according to the confirmed user request content. There are two types of exchange methods that are performed to determine the content requested by the user. One is called "system driven dialogue" and the other is called "user driven dialogue". In system-driven dialogue, a strong limit is placed on the range in which the user can speak in response to confirmations and questions from the device side. For example, when the device side confirms or asks the prefecture name, only the prefecture name is accepted as the utterance of the user. on the other hand,
In the user-driven dialogue, the user's answer does not have such a restriction, and free utterance is allowed.

【０００４】システム主導型対話では、利用者の発話の
自由度が少ないため、認識のための文法は狭い範囲の語
彙の限定された言い回しを含むもので良い。一方、ユー
ザ主導型対話では、自由な発話を許す必要があるので、
広い語彙を含む認識のための文法を用意しなければなら
ない。In the system-driven dialogue, since the degree of freedom of the user's utterance is low, the grammar for recognition may include a limited range of vocabulary. On the other hand, in user-driven dialogue, it is necessary to allow free speech, so
We must prepare a grammar for recognition that includes a wide vocabulary.

【０００５】システム主導用文法での認識は、強い制限
を置いているためより高精度な認識が望めるが、利用者
の自由な発話ができないという問題があった。一方、ユ
ーザ主導用文法での認識は、利用者の自由な発話は認め
られるものの、認識精度が下がってしまうという問題が
あった。The recognition by the system-led grammar has a strong limitation, so that more accurate recognition can be expected, but there is a problem that the user cannot speak freely. On the other hand, in the recognition by the user-led grammar, although the user's free utterance is recognized, there is a problem that the recognition accuracy is lowered.

【０００６】このような問題を解決するための従来技術
として、主導権混在型というものがある。これは、シス
テム主導型の対話での高い認識精度とユーザ主導型対話
での利用者の発話の自由度の高さを活かすために、対話
の場面毎に切り替えるものである。たとえば、以下の対
話例１は主導権混在型の対話の例である。〈対話例１〉装置発話１：「ご要件をどうそ」← 自由な質問利用者発話２：「気温を教えて」装置発話３：「いつの天気予報ですか？」←制限された質問利用者発話４：「明日です」As a conventional technique for solving such a problem, there is a mixed initiative type. This is for switching each scene of the dialogue in order to take advantage of the high recognition accuracy in the system-led dialogue and the high degree of freedom of the user's speech in the user-led dialogue. For example, the following dialogue example 1 is an example of a dialogue of mixed initiative. <Dialogue example 1> Device utterance 1: “How do you meet your requirements?” ← Free question User utterance 2: “Tell us about the temperature” Device utterance 3: “When is the weather forecast?” ← Restricted question User Utterance 4: "Tomorrow"

【０００７】しかしこのような従来技術では、対話の主
導権の判定と利用する文法の選択は、装置が発話する時
点、つまり認識すべき利用者発話が行われる以前に行な
う必要があった。However, in such a conventional technique, it has been necessary to judge the initiative of the dialogue and select the grammar to be used at the time when the device speaks, that is, before the user's utterance to be recognized is performed.

【０００８】[0008]

【発明が解決しようとする課題】装置側が意図した主導
権通りの発話を利用者が行ってくれるとは限らない。特
に装置側がシステム主導を想定し、限られた範囲の文法
のみを用意していても、利用者はより多くの情報を装置
に伝達しようとすることがある。この場合結果として利
用者の意図は正確に伝わらないことになってしまう。The user does not always make the utterance according to the initiative intended by the device side. In particular, even if the device side assumes system initiative and prepares only a limited range of grammar, the user may try to transmit more information to the device. In this case, as a result, the user's intention is not accurately transmitted.

【０００９】たとえば、下の記対話例２を考えてみる。〈対話例２〉装置発話１：「ご要件をどうそ」利用者発話２：「気温を教えて」装置発話３：「いつの天気予報ですか？」利用者発話４：「神奈川の気温を教えて」For example, consider the following dialogue example 2. <Dialogue example 2> Device Utterance 1: "What are your requirements?" User utterance 2: "Tell me about the temperature" Device Utterance 3: "When is the weather forecast?" User utterance 4: "Tell me about the temperature in Kanagawa"

【００１０】このように、装置は（装置発話３）におい
て利用者が「時間」という属性に関する内容を発話して
いることを意図しても、実際には（利用者発話４）のよ
うに、「場所」や「要求の種類」といった属性に関する
内容を発話する場合もある。このような場合に、システ
ム主導用文法を用いていた場合には、これらの「場所」
や「要求の種類」に関する利用者の意図は伝わらない。As described above, even though the device intends that the user speaks the content related to the attribute "time" in (device utterance 3), in reality, as in (user utterance 4), In some cases, the content about attributes such as “location” and “request type” is spoken. In such cases, if you were using system-led grammars, these "places"
The user's intention regarding "or request type" is not transmitted.

【００１１】一方、従来の技術においても、常にユーザ
主導型用文法を使用することによって、利用者の自由度
の高い発話を受け付けるようにすることは可能であっ
た。しかし、この場合は認識精度が低くなってしまって
いた。On the other hand, even in the conventional technique, it is possible to always accept the utterance with a high degree of freedom of the user by always using the user-led grammar. However, in this case, the recognition accuracy was low.

【００１２】そこで本発明では、従来の主導権混在型の
ような対話の場面毎に利用者の発話できる範囲を制限す
るようなことはせずに、しかも、従来手法に比べてより
正確にユーザの意図を把握することができる音声対話装
置および方法を提供することを目的とする。In view of the above, the present invention does not limit the range in which the user can speak for each dialogue scene as in the conventional mixed initiative type, and more accurately than the conventional method. It is an object of the present invention to provide a voice interaction device and method capable of grasping the intention of the user.

【００１３】[0013]

【課題を解決するための手段】本発明は、利用者の音声
信号を入力して、２つの異なる対話用文法でそれぞれ音
声認識して、２組の認識結果を得る音声認識手段と、前
記２組の認識結果をそれぞれ解析し、項目名と値と値の
確からしさからなる理解状態を、前記２組の認識結果に
対応して２種類作成する言語理解手段と、前記２種類の
理解状態のうちどちらを採用するかどうかを決定する認
識文法選択手段とを設けたこと主要な特徴とする。行動
決定手段では、前記決定された理解状態を入力とし、装
置の次の行動を決定し、音声生成出力手段では、前記決
定された行動に対応する確認内容あるいは応答内容を音
声として出力する。これを利用者の発話が行われる毎に
繰り返す。According to the present invention, there is provided voice recognition means for inputting a voice signal of a user and recognizing voice by two different grammars for dialogue to obtain two sets of recognition results. A language understanding unit that analyzes the recognition results of each pair and creates two types of understanding states consisting of item names, values, and certainty of the values corresponding to the recognition results of the two pairs, and the understanding state of the two types of understanding states. The main feature is that a recognition grammar selecting means for deciding which of the two is adopted is provided. The action determining means receives the determined understanding state as an input and determines the next action of the apparatus, and the voice generation / output means outputs the confirmation content or the response content corresponding to the determined behavior as a voice. This is repeated every time the user speaks.

【００１４】ここで、音声認識手段の２つの文法は、一
方は認識の前の装置側の質問あるいは確認を行なった場
合、その質問あるいは確認を行なった項目と同一属性の
項目および、対話進行のために必要な一般的な語彙のみ
から構成される文法であり、もう一方の文法は装置側が
質問あるいは確認を行なった項目の属性とは異なる属性
の語彙も認識語彙に含めた文法である。Here, one of the two grammars of the voice recognition means is that, when a question or confirmation is made on the device side before the recognition, an item having the same attribute as the question or confirmation and the dialogue progress The grammar is composed of only general vocabulary necessary for this purpose, and the other grammar is a grammar in which a vocabulary having an attribute different from the attribute of the item that the device has asked or confirmed is included in the recognized vocabulary.

【００１５】また、認識文法選択手段では、第１の実現
法として、２組の認識結果の信頼度を比較して、２種類
の理解状態のうちどちらを採用するかを決定する。第２
の実現法では、音声認識部で得られた２組の認識結果の
信頼度に加えて、直前の２組の認識結果の信頼度、対話
の経過時間、対話のやりとりの回数、利用者の音声の長
さ、パワー、韻律情報の一部又は全部にもとづいて、２
種類の理解状態のうちどちらを採用するかを決定する。Further, the recognition grammar selecting means compares the reliability of the two sets of recognition results and determines which of the two kinds of understanding states is to be adopted, as a first implementation method. Second
In the realization method, in addition to the reliability of the two sets of recognition results obtained by the speech recognition unit, the reliability of the immediately preceding two sets of recognition results, the elapsed time of the dialogue, the number of dialogue exchanges, and the user's voice. 2 based on some or all of the length, power, and prosody information
Decide which of the different comprehension states to adopt.

【００１６】[0016]

【発明の実施の形態】以下、本発明の一実施の形態につ
いて図面を参照して詳しく説明する。図１は本発明にか
かる音声対話装置の一実施例を示すブロック図である。
本音声対話装置１００は、音声入力部１１０、音声情報
抽出部１１０、音声認識部１３０、言語理解部１４０、
認識文法選択部１５０、行動決定部１６０、音声生成出
力部１７０、及びタスク使用データベース１８０により
構成される。また、音声認識部１３０は、システム主導
型対話用文法を持つ認識部（システム主導型音声認識
部）１３１と、ユーザ主導型対話用文法を持つ認識部
（ユーザ主導型音声認識部）１３２とで構成される。な
お、本音声対話装置１００は、実際にはいわゆるコンピ
ュータを利用して構成されるものである。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a voice interaction device according to the present invention.
The voice interaction device 100 includes a voice input unit 110, a voice information extraction unit 110, a voice recognition unit 130, a language understanding unit 140,
The recognition grammar selection unit 150, the action determination unit 160, the voice generation output unit 170, and the task usage database 180 are included. Further, the voice recognition unit 130 includes a recognition unit (system-driven voice recognition unit) 131 having a system-driven dialogue grammar and a recognition unit (user-driven speech recognition unit) 132 having a user-driven dialogue grammar. Composed. The voice interaction device 100 is actually configured by using a so-called computer.

【００１７】図２は本音声対話装置１００の処理手順の
一例を示すフローチャートである。以下、図２に従い図
１の各部の動作を説明する。FIG. 2 is a flow chart showing an example of the processing procedure of the voice dialog device 100. The operation of each unit in FIG. 1 will be described below with reference to FIG.

【００１８】まず、音声入力部１１０では、利用者が発
声した音声を入力として受け取り、電気信号に変換し、
さらにデジタル信号に変換したものを出力する（ステッ
プ２０１）。この音声入力部１１０には、一般のマイク
や電話機とＡＤコンバータの組合せなどを使うことがで
きる。First, in the voice input section 110, the voice uttered by the user is received as an input and converted into an electric signal,
Further, the converted digital signal is output (step 201). For the voice input unit 110, a general microphone or a combination of a telephone and an AD converter can be used.

【００１９】次に、音声情報抽出部１２０では、音声入
力部１１０が出力した音声信号（デジタル音声信号）を
受け取り、利用者の発話した音声の長さ、パワー、韻律
の情報等を出力する（ステップ２０２）。Next, the voice information extraction unit 120 receives the voice signal (digital voice signal) output from the voice input unit 110, and outputs information such as the length, power, and prosody of the voice spoken by the user ( Step 202).

【００２０】一方、音声認識部１３０では、音声入力部
１１０からの音声信号（デジタル音声信号）を受け取
り、認識結果の文字列と認識結果の信頼度（スコア）を
出力する（ステップ２０３）。この種の音声認識処理自
体は周知であるので、詳細は省略する。On the other hand, the voice recognition unit 130 receives the voice signal (digital voice signal) from the voice input unit 110, and outputs the character string of the recognition result and the reliability (score) of the recognition result (step 203). Since this kind of voice recognition processing itself is well known, its details are omitted.

【００２１】ここで、音声認識部１３０は、システム主
導型対話用文法を持つ認識部１３１と、ユーザ主導型対
話用文法を持つ認識部１３２とで構成され、それぞれ音
声入力部１１０からの音声信号を受け取り、それぞれ認
識結果の文字列と該認識結果の信頼度（スコア）を出力
する。すなわち、音声認識部１３０では、音声入力部１
１０からの音声信号に対して、システム主導型対話用文
法とユーザ主導型対話用文法による２組の認識結果（２
組の文字列及びその信頼度）を出力する。Here, the voice recognition unit 130 is composed of a recognition unit 131 having a system-initiated dialogue grammar and a recognition unit 132 having a user-initiated dialogue grammar, each of which is a voice signal from the voice input unit 110. To output the character string of the recognition result and the reliability (score) of the recognition result. That is, in the voice recognition unit 130, the voice input unit 1
For the voice signal from 10, two sets of recognition results by the system-driven dialogue grammar and the user-driven dialogue grammar (2
Outputs a set of character strings and their reliability).

【００２２】たとえば、以下の対話例３を想定する。〈対話例３〉装置発話１：「いつの天気予報ですか？」利用者発話２：「明日」装置発話３：「県名を指定してください」利用者発話４：「明日の神奈川県西部」For example, assume the following interaction example 3. <Dialogue example 3> Device Utterance 1: "When is the weather forecast?" User utterance 2: "Tomorrow" Device Utterance 3: "Please specify the prefecture name" User utterance 4: "Tomorrow's western Kanagawa prefecture"

【００２３】システム主導型対話用文法では、装置発話
１の後には日付と曜日のみから構成される。また、装置
発話３の後のシステム主導型対話用文法では、県名のみ
から構成される。一方、ユーザ主導型対話用文法は、対
話の場面に応じて語彙を限定することはないので、装置
発話１の後でも県名や天気の種類を認識語彙として含む
し、装置発話３の後でも、日付や曜日も含むことにな
る。In the system driven dialogue grammar, the device utterance 1 is followed by only the date and the day of the week. The system-led dialogue grammar after the device utterance 3 is composed of only the prefecture name. On the other hand, since the user-driven dialogue grammar does not limit the vocabulary according to the scene of the dialogue, the prefecture name and the type of weather are included as the recognition vocabulary even after the device utterance 1, and even after the device utterance 3. , Date and day of the week will also be included.

【００２４】なお、認識部１３０の語彙と文法は、後述
の行動決定部１６０の指示により入れ換えることができ
るものを利用する。The vocabulary and grammar of the recognition unit 130 are those that can be exchanged by the instruction of the action determination unit 160 described later.

【００２５】次に、言語理解部１４０では、音声認識部
１３０からの認識結果の文字列と信頼度（スコア）を入
力として受け取り、理解状態を出力する（ステップ２０
４）。ここでは、理解状態は項目名と値、値の確からし
さで表わす。項目名は「場所」、「時間」、「情報種
別」などであり、値は項目名の値で「神奈川」、「明
白」などである。値の確からしさは数値で表わし、音声
認識部１３０より得られた信頼度などを使うことができ
る。たとえば、天気情報案内システムにおいて「明日の
神奈川の気温を教えてください」という認識結果に対し
ては、図３のような理解状態が得られる。Next, the language understanding unit 140 receives the character string of the recognition result and the reliability (score) from the voice recognition unit 130 as inputs, and outputs the understanding state (step 20).
4). Here, the understanding state is represented by the item name, the value, and the certainty of the value. The item name is “place”, “time”, “information type”, and the like, and the value is the value of the item name, such as “Kanagawa” or “obvious”. The accuracy of the value is represented by a numerical value, and the reliability obtained from the voice recognition unit 130 can be used. For example, in the weather information guidance system, the understanding state as shown in FIG. 3 is obtained for the recognition result of "Please tell me the temperature of Kanagawa tomorrow."

【００２６】言語理解部１４０では、音声認識部１３０
のシステム主導型音声認識部１３１とユーザ主導型音声
認識部１３２による２組の認識結果を入力してそれぞれ
処理した結果の、２種類の理解状態を出力する。たとえ
ば、利用者の「明日のー、えーっと」という発話が、ユ
ーザ主導型対話用文法では、「明日の愛媛の」と認識さ
れ、システム主導型対話用文法では、「明日」と認識さ
れたとする。この場合、言語理解部１４０では、図４、
図５のような２種類の理解状態を作成し出力する。In the language understanding unit 140, the voice recognition unit 130
The system-led voice recognition unit 131 and the user-led voice recognition unit 132 input two sets of recognition results, and output two types of understanding states as a result of processing. For example, it is assumed that the user's utterance "Tomorrow's Eh" is recognized as "Tomorrow's Ehime's" in the user-driven dialogue grammar and "Tomorrow" in the system-driven dialogue grammar. . In this case, in the language understanding unit 140, as shown in FIG.
Two types of understanding states as shown in FIG. 5 are created and output.

【００２７】また、もう１つの例としては、利用者の発
話が「はい」「そうです」などの肯定表現や「いいえ」
「ちがいます」などの否定表現である場合は、項目名と
して「肯定か否定か」を使用する。[0027] As another example, the user's utterance is a positive expression such as "yes" or "yes" or "no".
If the expression is a negative expression such as "I'm wrong", "Affirmative or negative" is used as the item name.

【００２８】次に、認識文法選択部１５０では、言語理
解部１４０からの２種類の理解状態を入力し、音声認識
部１３０からの２組の認識結果の信頼度や、音声情報抽
出部１２０からの利用者の長さ、パワー、韻律、さらに
はタスク仕様データベース１８０内のこれまでの対話で
のやりとりの時間、回数などをもとに、システム主導と
ユーザ主導のどちらを採用するか決定し、それに対応す
る理解状態を選択して出力する（ステップ２０５）。Next, in the recognition grammar selecting section 150, the two kinds of understanding states from the language understanding section 140 are input, and the reliability of the two sets of recognition results from the voice recognizing section 130 and the voice information extracting section 120 are input. Based on the user's length, power, prosody, and the time and number of interactions in the task specification database 180 so far, it is determined whether to adopt system-led or user-led, The corresponding understanding state is selected and output (step 205).

【００２９】従来の音声対話装置では、このような認識
文法選択部１５０を有していないため、利用者の発話が
行われる以前に認識のための文法を一種類に限定してお
く必要があり、装置が用意した文法で許された範囲外の
利用者発話は受け付けることができなかったり、装置が
用意した文法が必要以上に広範であり、精度が低くなっ
ていたりした。Since the conventional speech dialogue system does not have such a recognition grammar selecting section 150, it is necessary to limit the grammar for recognition to one type before the user speaks. , The user's utterance outside the range permitted by the grammar prepared by the device could not be accepted, or the grammar prepared by the device was wider than necessary and the accuracy was low.

【００３０】ここでは、認識文法選択部１５０の実現方
法として二つの実施例を説明する。第１の実施例は、音
声認識部１３０のシステム主導型音声認識部１３１とユ
ーザ主導型音声認識部１３２からの２組の認識結果の信
頼度をもとに、システム主導型かユーザ主導型かを決定
するものである。いま、システム主導型音声認識部１３
１、ユーザ主導型音声認識部１３２が出力した信頼度
（スコア）を正規化したものをそれぞれＣＭｓ、ＣＭｕ
とする。ＣＭｓ＞ＣＭｕであれば、システム主導を採用
し、ＣＭｕ＞ＣＭｓであれば、ユーザ主導を採用する。
そして、言語理解部１４０からの２種類の理解状態のう
ち、対応する理解状態を出力する。図６に本実施例のフ
ローチャートを示す。Here, two embodiments will be described as a method of realizing the recognition grammar selecting section 150. The first embodiment is system-initiated or user-initiated based on the reliability of two sets of recognition results from the system-initiated speech recognition unit 131 and the user-initiated speech recognition unit 132 of the speech recognition unit 130. Is to determine. Now, the system-led voice recognition unit 13
1. CMs and CMu are obtained by normalizing the reliability (score) output by the user-led voice recognition unit 132.
And If CMs> CMu, the system initiative is adopted, and if CMu> CMs, the user initiative is adopted.
Then, of the two types of understanding states from the language understanding unit 140, the corresponding understanding state is output. FIG. 6 shows a flowchart of this embodiment.

【００３１】第２の実施例は、音声認識部１３０のシス
テム主導型音声認識部１３１とユーザ主導型音声認識部
１３２からの２組の認識結果の信頼度に加えて、音声情
報抽出部１２０により得られた利用者の音声の長さ、パ
ワー、韻律、さらには例えばタスク仕様データベース１
８０に格納されている、これまでの２つの認識部１３
１，１３２の出力の信頼度、直前の装置の発話、これま
での対話でのやりとりの時間、これまでの対話でのやり
とりの回数などにもとづいて、システム主導かユーザ主
導かを決定するものである。以下、この実施例について
詳述する。In the second embodiment, in addition to the reliability of the two sets of recognition results from the system-led voice recognition unit 131 and the user-led voice recognition unit 132 of the voice recognition unit 130, the voice information extraction unit 120 is used. The length, power, and prosody of the obtained voice of the user, and for example, the task specification database 1
The two recognition units 13 thus far stored in 80.
Based on the reliability of the output of 1,132, the utterance of the immediately preceding device, the time of the interaction in the previous conversation, the number of interactions in the previous conversation, etc., it is determined whether to be system-initiated or user-initiated. is there. Hereinafter, this embodiment will be described in detail.

【００３２】第２の実施例には、機械学習の手法を用い
ることができる。このためのアルゴリズムとしては、入
力として与える特徴量の一部、つまり記述した理解状
態、直前の認識部の信頼度、これまでの認識部の信頼
度、直前の装置の発話、これまでの対話の経過時間、こ
れまでの対話のやりとりの回数、音声の長さ、パワー、
韻律のうちの一部（あるいはすべて）を特徴量を与えた
ときに、出力するべき２値、つまりシステム主導かユー
ザ主導かの分類が行えるようなアルゴリズムならば詳細
は特に問わない。In the second embodiment, a machine learning method can be used. As an algorithm for this, a part of the feature amount given as an input, that is, the described understanding state, the reliability of the immediately preceding recognition unit, the reliability of the previous recognition unit, the utterance of the immediately previous device, and the previous conversation Elapsed time, number of dialogue exchanges so far, voice length, power,
The details are not particularly limited as long as it is an algorithm that can classify binary values to be output, that is, system-initiated or user-initiated, when a part (or all) of the prosody is given a feature amount.

【００３３】ここでは、ひとつの具体例として、特徴量
として、直前のユーザ主導型音声認識部の信頼度、対話
の経過時間、対話のやりとりの回数、利用者の音声の長
さを使う場合を説明する。学習データとするために、装
置の発話に対する利用者の発話を収集する。これらの利
用者の発話について、システム主導用の文法で対話の進
行が可能であるのか、あるいはそうではない（つまりユ
ーザ主導用の文法が必要である）かの印を人手でつけ
る。たとえば以下のような具合である。装置発話１：「いつの天気予報ですか？」利用者発話２：「明日」・・・システム主導用文法装置発話３：「県名を指定してください」利用者発話４：「明日の神奈川県西部」・・・ユーザ主導用文法Here, as one specific example, the case where the reliability of the immediately preceding user-initiated voice recognition unit, the elapsed time of the dialogue, the number of times of dialogue exchanges, and the length of the voice of the user are used as the feature amount. explain. Collect the user's utterances to the device utterances as learning data. Regarding the utterances of these users, manually mark whether the system-driven grammar allows the dialogue to proceed or not (that is, the user-driven grammar is required). For example, the condition is as follows. Device utterance 1: "When is the weather forecast?" User utterance 2: "Tomorrow" ... System-driven grammar device utterance 3: "Please specify the prefecture name" User utterance 4: "Tomorrow's Kanagawa Prefecture" West ”... User driven grammar

【００３４】次に、各利用者発話時点での各特徴量の値
と先ほど印をつけたシステム主導用文法か、ユーザ主導
用文法かという対応を、例えば、図７のように記述した
ものを用意する。これらの情報を使って、決定木を作成
すると、たとえば、 (１) 対話やりとり回数が５回より多ければ・・・シス
テム主導 (２) 対話やりとり回数が５回以下で、信頼度が０．７
より大きければ・・・ユーザ主導 (３) 対話やりとり回数が５回以下で、信頼度が０．７
より小さくて、経過時間が１０秒以上で、利用者の音声
の長さが２秒より長ければ・・・システム主導 (４) 対話やりとり回数が５回以下で、信頼度が０．７
より小さくて、経過時間が１０秒以上で、利用者の音声
の長さが２秒以下ならば・・・ユーザ主導 (５) 対話やりとり回数が５回以下で、信頼度が０．７
より小さくて、経過時間が１０秒未満ならば…ユーザ主
導といった具合のルールを得ることができる。Next, the correspondence between the value of each feature at the time of each user's utterance and the system-initiated grammar marked above or the user-initiated grammar is described as shown in FIG. 7, for example. prepare. When a decision tree is created using these pieces of information, for example, (1) If the number of dialog exchanges is more than 5 ... System-led (2) The number of dialog exchanges is 5 or less, and the reliability is 0.7.
If it is larger ... User-initiated (3) The number of dialogue exchanges is 5 or less, and the reliability is 0.7.
If it is smaller, the elapsed time is 10 seconds or more, and the length of the user's voice is longer than 2 seconds ... System initiative (4) The number of dialogue exchanges is 5 times or less, and the reliability is 0.7.
If it is smaller, the elapsed time is 10 seconds or more, and the length of the user's voice is 2 seconds or less ... User-initiated (5) The number of dialog exchanges is 5 times or less, and the reliability is 0.7.
If it is smaller and the elapsed time is less than 10 seconds, a rule such as "user initiated" can be obtained.

【００３５】認識文法選択部１５０には、このようにし
て作成しておいたルールを用意して、利用者の発話があ
る毎に適用し、システム主導かユーザ主導か決定する。
そして、言語理解部１４０からの２種類の理解状態のう
ち、対応する理解状態を選択して出力する。図８に、上
記ルールに対応する処理フローチャートを示す。The recognition grammar selecting unit 150 prepares the rule created in this way, applies it every time the user utters, and determines whether it is system-initiated or user-initiated.
Then, of the two types of understanding states from the language understanding unit 140, the corresponding understanding state is selected and output. FIG. 8 shows a processing flowchart corresponding to the above rule.

【００３６】行動決定部１６０は、認識文法選択部１５
０が出力した理解状態を入力として受け取り、装置の次
の行動を決定する（ステップ２０６）。また、タスク仕
様データベース１８０内に保持していたこれまでの理解
状態の更新を行い、装置から利用者への発話内容を決定
する。The action determining section 160 includes a recognition grammar selecting section 15
The understanding state output by 0 is received as an input, and the next action of the device is determined (step 206). Further, the understanding state up to now held in the task specification database 180 is updated, and the utterance content from the device to the user is determined.

【００３７】理解状態の更新は項目名と値の統合と、確
認フラグの更新の２つに分けられる。入力された理解状
態が「肯定か否定か」という項目名をもたない場合に
は、これまで保持していた理解状態の各項目名につい
て、確認フラグが立っていなければ、入力された理解状
態の対応する部分を統合する。もし入力された理解状態
が「肯定か否定か」という項目名をもっていて、しかも
前回の行動決定部の出力が確認であった場合に、もし肯
定であれば前回確認した項目の確認フラグを立てる、否
定であれば前回確認した項目の値を消す。The update of the understanding state can be divided into two: integration of item name and value and update of confirmation flag. If the input understanding status does not have the item name "Affirmative or Negative", if the confirmation flag is not set for each item name of the understanding status that has been held until now, the input understanding status Integrate the corresponding parts of. If the input understanding state has the item name "Affirmative or Negative" and the previous output of the action determination unit is confirmation, if affirmative, a confirmation flag of the previously confirmed item is set, If negative, erase the value of the item checked last time.

【００３８】また、タスク仕様データベース１８０と現
在の理解状態を比較して、現在の理解状態がいずれかの
装置が受け付け可能な何らかの要求種類にとって必要な
項目をすべて満たしていて、かつすべて確認済みであれ
ば、既に要求内容を把握することができているとして、
タスク仕様データベース１８０に記述してある応答内容
を出力する。もし現在の理解状態のうち未確認の項目が
あり、それらを確認すれば、装置が処理可能な何らかの
要求種類にとって必要な情報がすべて確認ずみになるの
であれば、装置は、確認の印と、項目のリストを出力す
る。もし、現在の理解状態が装置が受け付け可能ないず
れの要求種類にとっても必要な項目を満たしていなけれ
ば、予め決められた順序で項目名を決定し、情報要求の
印と項目名を出力する。Further, by comparing the task specification database 180 and the current understanding state, it is confirmed that the current understanding state satisfies all the items necessary for some kind of request that can be accepted by any of the devices, and has been confirmed. If so, it is possible to understand the request contents already,
The response content described in the task specification database 180 is output. If there are unconfirmed items in the current understanding and if they confirm all the necessary information for some kind of request that can be processed by the device, the device will display a confirmation mark and an item. Outputs a list of. If the current understanding state does not satisfy the required items for any request type that the device can accept, the item names are determined in a predetermined order, and the information request mark and the item name are output.

【００３９】図９は、タスク仕様データベース１８０の
例である。タスク仕様データベース１８０には、システ
ムが受け付け可能な要求の種類とそれぞれの要求の種類
で必要な項目、およびその要求の種類での応答内容が記
述してある。なお、図９では省略したが、タスク仕様デ
ータベース１８０には、認識文法選択部１５０の上記第
２の実施例で必要とする種々の情報も保持しておくよう
にする。FIG. 9 shows an example of the task specification database 180. The task specification database 180 describes the types of requests that the system can accept, the items required for each type of request, and the response content for each type of request. Although omitted in FIG. 9, the task specification database 180 also holds various kinds of information required by the recognition grammar selecting unit 150 in the second embodiment.

【００４０】最後に、音声生成出力部１７０は、行動決
定部１６０が出力する発話内容、確認、情報要求のいず
れかを入力として受け取り、装置の音声を出力する（ス
テップ２０７）。発話内容が発話内容検索のための印を
含んでいる場合には、まず応答内容検索部検索の結果を
応答内容に置き換える。なお、音声生成出力部１７０は
既存のテンプレートベースの言語生成器と、既存の関係
データベース検索器と、既存の音声合成器の組み合わせ
によって実現することができる。Finally, the voice generation / output unit 170 receives as input one of the utterance content, confirmation, and information request output by the action determination unit 160, and outputs the voice of the device (step 207). When the utterance content includes a mark for retrieving the utterance content, first, the result of the response content search unit search is replaced with the response content. The speech generation / output unit 170 can be realized by a combination of an existing template-based language generator, an existing relational database searcher, and an existing speech synthesizer.

【００４１】以上、本発明の一実施例を説明したが、図
１で示した装置における各部の一部もしくは全部の処理
機能をコンピュータのプログラムで構成し、そのプログ
ラムをコンピュータを用いて実行して本発明を実現する
ことができること、あるいは、図２で示した処理手順を
コンピュータのプログラムで構成し、そのプログラムを
コンピュータに実行させることができることは言うまで
もない。コンピュータでその処理機能を実現するための
プログラム、あるいは、コンピュータにその処理手順を
実行させるためのプログラムを、そのコンピュータが読
み取り可能な記録媒体、例えば、ＦＤや、ＭＯ、ＲＯ
Ｍ、メモリカード、ＣＤ、ＤＶＤ、リムーバブルディス
クなどに記録して、保存したり、提供したりすることが
できるとともに、インターネット等のネットワークを通
してそのプログラムを配布したりすることが可能であ
る。The embodiment of the present invention has been described above. However, some or all of the processing functions of each unit in the apparatus shown in FIG. 1 are configured by a computer program, and the program is executed by the computer. Needless to say, the present invention can be realized, or the processing procedure shown in FIG. 2 can be configured by a computer program and the computer can execute the program. A computer-readable recording medium, such as an FD, MO, or RO, stores a program for realizing the processing function of the computer or a program for causing the computer to execute the processing procedure.
It is possible to record the data on an M, a memory card, a CD, a DVD, a removable disk, etc., and save or provide it, as well as distribute the program through a network such as the Internet.

【００４２】[0042]

【発明の効果】本発明によれば、利用者の発話の後に主
導権に応じた文法を選択するので、利用者の発話に対す
る制限を行わずともより精度よく利用者の要求内容を把
握することができ、利用者の手間を軽減すことが可能で
ある。According to the present invention, the grammar according to the initiative is selected after the utterance of the user, so that the user's request content can be grasped more accurately without restricting the utterance of the user. Therefore, it is possible to reduce the trouble of the user.

【００４３】[0043]

[Brief description of drawings]

【図１】本発明にかかる音声対話装置の一実施例のブロ
ック図である。FIG. 1 is a block diagram of an embodiment of a voice interaction device according to the present invention.

【図２】本発明にかかる音声対話方法の一実施例の処理
フローチャートである。FIG. 2 is a processing flowchart of an embodiment of a voice interaction method according to the present invention.

【図３】理解状態の一例である。FIG. 3 is an example of an understanding state.

【図４】ユーザ主導型対話用文法の認識出力を処理した
理解状態の一例である。FIG. 4 is an example of an understanding state in which a recognition output of a user-initiated dialogue grammar is processed.

【図５】システム主導型対話文法の認識出力を処理した
理解状態の一例である。FIG. 5 is an example of an understanding state in which recognition output of a system-driven dialogue grammar is processed.

【図６】認識文法選択部の第１の実施例の処理フローチ
ャートである。FIG. 6 is a processing flowchart of a first embodiment of a recognition grammar selecting unit.

【図７】認識文法選択部の第１の実施例に使用される学
習データの一例である。FIG. 7 is an example of learning data used in the first embodiment of the recognition grammar selecting unit.

【図８】認識文法選択部の第２の実施例の処理フローチ
ャートである。FIG. 8 is a processing flowchart of a second embodiment of a recognition grammar selecting unit.

【図９】タスク仕様データベースの一例である。FIG. 9 is an example of a task specification database.

[Explanation of symbols]

１００音声対話装置１１０音声入力部１２０音声情報抽出部１３０音声認識部１３１システム主導型音声認識部１３２ユーザ主導型音声認識部１４０言語理解部１５０認識文法選択部１６０行動決定部１７０音声生成出力部１８０タスク仕様データベース 100 voice interaction device 110 Voice input section 120 voice information extraction unit 130 Speech recognition unit 131 System-led voice recognition unit 132 User-Driven Speech Recognition Unit 140 Language Understanding Department 150 Recognition Grammar Selection Section 160 Action decision unit 170 Voice generation output unit 180 task specification database

フロントページの続き (72)発明者相川清明東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 2F029 AA02 AA07 AC02 AC18 AC20 5D015 AA05 HH15 LL02 LL06 LL09Continued front page (72) Inventor Kiyoaki Aikawa 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation F term (reference) 2F029 AA02 AA07 AC02 AC18 AC20 5D015 AA05 HH15 LL02 LL06 LL09

Claims

[Claims]

1. A voice interactive device for inputting a voice of a user, performing a process according to a request of the user, and outputting a processing result using the voice, a voice input unit for converting a voice into a voice signal of an electric signal. And a voice recognition unit that receives the voice signal and recognizes voices by two different grammars for dialogue to obtain two sets of recognition results, and analyzes each of the two sets of recognition results to obtain an item name and a value. And a grammar selection that determines which of the two types of comprehension states is to be adopted, and two types of comprehension states consisting of A unit, an action determination unit that determines the next action of the device by using the determined understanding state as an input, and a voice generation output unit that outputs confirmation content or response content corresponding to the determined action as a voice, To have Spoken dialogue apparatus according to claim.

2. The voice interaction device according to claim 1,
One of the two grammars of the voice recognition unit is a general one that is required for proceeding with a dialogue when a question or confirmation is made on the device side before recognition, the item having the same attribute as the question or confirmation. Spoken dialogue device characterized in that it is a grammar consisting of only vocabulary, and the other grammar is a grammar in which a vocabulary having an attribute different from the attribute of the item that the device has asked or confirmed is also included in the recognized vocabulary. .

3. The spoken dialogue apparatus according to claim 1 or 2, wherein the recognition grammar selection unit compares the reliability of two sets of recognition results and determines which of two types of understanding states should be adopted. A voice interaction device characterized by:

4. The speech dialogue apparatus according to claim 1, wherein the recognition grammar selecting unit is a speech recognition unit which is obtained by the speech recognizing unit.
In addition to the reliability of the recognition results of the pair, the reliability of the recognition results of the immediately preceding two groups, the elapsed time of the dialogue, the number of times of dialogue exchange, the length of the user's voice, the power, and part or all of the prosodic information. A voice dialogue apparatus characterized by determining which of two kinds of understanding states is to be adopted based on the above.

5. A voice inputting method for converting a voice into a voice signal of an electrical signal in a voice interactive method for receiving a voice of a user as input and performing a process according to a request of the user and outputting a processing result using the voice. And a voice recognition process in which the voice signal is input and voice recognition is performed using two different grammars for dialogue, and two sets of recognition results are obtained. And a grammar selection process that determines which of the two types of understanding states is to be adopted, and two types of comprehension states consisting of A process, an action determination process that inputs the determined understanding state and determines a next action, and a voice generation and output process that outputs confirmation content or response content corresponding to the determined action as voice. You Voice interaction wherein the.

6. The voice interaction method according to claim 5,
The two grammars used in the speech recognition process are, when one asks a question or confirmation on the device side before recognition, the item having the same attribute as the question or confirmation, and
The grammar consists of only general vocabulary necessary for the progress of the dialogue, and the other grammar is a grammar in which the vocabulary of the attribute different from the attribute of the item that the device has asked or confirmed is also included in the recognition vocabulary A spoken dialogue method characterized by being present.

7. The spoken dialogue method according to claim 5, wherein in the recognition grammar selection process, the reliability of two sets of recognition results is compared to determine which of two types of comprehension states should be adopted. A spoken dialogue method characterized by:

8. The spoken dialogue method according to claim 5, wherein in the recognition grammar selection step, in addition to the reliability of the two sets of recognition results obtained in the speech recognition step, the reliability of the immediately preceding two sets of recognition results is added. Determining which of the two comprehension states to employ based on the degree, the elapsed time of the dialogue, the number of dialogue exchanges, the length of the user's voice, power, and some or all of the prosodic information. A spoken dialogue method characterized by.

9. A voice interaction program for executing the voice interaction method according to claim 5, 6, 7 or 8 on a computer.

10. A recording medium recording a voice interaction program for executing the voice interaction method according to claim 5, 6, 7 or 8 on a computer.