JP2006039484A

JP2006039484A - Reliability degree calculation processing method for dialog comprehension result

Info

Publication number: JP2006039484A
Application number: JP2004223451A
Authority: JP
Inventors: Ryuichiro Higashinaka; 竜一郎東中; Katsuto Sudo; 克仁須藤; Mikio Nakano; 幹生中野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-07-30
Filing date: 2004-07-30
Publication date: 2006-02-09
Anticipated expiration: 2024-07-30
Also published as: JP4313267B2

Abstract

<P>PROBLEM TO BE SOLVED: To increase the estimated accuracy of reliability of comprehension result about a user information request content in a voice dialog system. <P>SOLUTION: By making use of a prototype voice dialog system, a dialog record 10 which includes system utterance information, user utterance information, comprehension result of a user information request content by means of a voice dialog system for each user utterance and information about its correctness is produced, and it is stored in a memory device. The conversation information regarding the exchange between the system and the user is retrieved from the stored dialog record 10, and the conversation characteristics quantity is extracted by a conversation characteristics quantity extracting part 30. A reliability criterion producing part 50 produces a reliability criterion for evaluating the reliability of the comprehension result of the user information request content by making use of a conversation characteristics quantity extracted from the conversation information in addition to a sound/linguistic characteristic quantity of a user's utterance voice by means of identification learning. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は，コンピュータが音声を用いて人と対話する音声対話技術に関し，特に，音声対話システムにおいて，ユーザ情報要求内容の理解結果の信頼度を，ユーザ音声の音響・言語的特徴に加え，ユーザとシステムとのやり取りに関する談話情報を用いて算出する対話理解結果の信頼度算出処理方法に関するものである。 The present invention relates to a voice dialogue technology in which a computer interacts with a person using voice. In particular, in a voice dialogue system, the reliability of the understanding result of user information request contents is added to the acoustic and linguistic characteristics of user voice, It is related with the reliability calculation processing method of the dialogue understanding result calculated using the discourse information regarding the interaction between the system and the system.

ユーザが音声を用いてシステムに情報要求を入力するたびに，ユーザとシステムのやり取りの履歴を用いて，システムが保持するユーザ情報要求内容の理解結果を更新し，更新されたユーザ情報要求内容の理解結果に応じて，ユーザに音声により応答するシステムのことを音声対話システムと呼ぶ。 Every time a user inputs an information request to the system using voice, the user's system interaction history is used to update the understanding result of the user information request content held by the system. A system that responds to the user by voice according to the understanding result is called a voice dialogue system.

音声対話システムの一般的な構成を図２に示す。１は発話理解部，２は発話生成部である。発話理解部１において，１００は音声認識部，１０１は言語理解部，１０２は談話理解部である。発話生成部２において，２００は内容生成部，２０１は表層生成部，２０２は音声生成部である。１０３は文脈理解規則・知識が格納される文脈理解規則・知識データベース（ＤＢ），１０４は構文意味理解規則・知識が格納される構文意味理解規則・知識ＤＢ，１０５は対話状態が格納される対話状態ＤＢ，１０６は対話制御応答生成規則・知識が格納される対話状態応答生成規則・知識ＤＢ，１０７はユーザが知りたいデータが格納されているデータベース，１０８は生成規則テンプレートが格納される生成規則テンプレートＤＢである。 FIG. 2 shows a general configuration of the spoken dialogue system. 1 is an utterance understanding unit, and 2 is an utterance generation unit. In the utterance understanding unit 1, 100 is a speech recognition unit, 101 is a language understanding unit, and 102 is a discourse understanding unit. In the utterance generation unit 2, reference numeral 200 denotes a content generation unit, 201 denotes a surface layer generation unit, and 202 denotes a voice generation unit. 103 is a context understanding rule / knowledge database (DB) in which knowledge is stored; 104 is a syntax / semantic understanding rule / knowledge DB in which knowledge is stored; 105 is a dialog in which a dialog state is stored The state DB 106 is a dialog state response generation rule / knowledge DB in which dialog control response generation rules / knowledge is stored, 107 is a database in which data that the user wants to know is stored, and 108 is a generation rule in which a generation rule template is stored. It is a template DB.

ユーザが音声対話システムの発話理解部１に情報要求を入力する際に用いるユーザ音声をユーザ発話と呼び，ユーザの一発話によって音声対話システムに伝達される情報要求内容のことを対話行為と呼ぶ。ユーザ発話から対話行為を導出することを言語理解という。 The user voice used when the user inputs an information request to the utterance understanding unit 1 of the voice dialogue system is called user utterance, and the information request content transmitted to the voice dialogue system by one utterance of the user is called dialogue action. Deriving dialogue actions from user utterances is called language understanding.

音声対話では，まず発話理解部１の音声認識部１００は，ユーザ発話音声を受け取り，認識結果の単語列を出力する。言語理解部１０１は，認識結果の単語列を入力とし，構文意味理解規則・知識ＤＢ１０４内の構文意味理解規則・知識を用いて，キーワード抽出や構文解析・意味解析等の自然言語処理を行い，対話行為と呼ばれる発話意図，発話内容の理解結果を出力する。 In the voice dialogue, first, the voice recognition unit 100 of the utterance understanding unit 1 receives a user uttered voice and outputs a word string as a recognition result. The language understanding unit 101 receives a word string of the recognition result as input, and performs natural language processing such as keyword extraction, syntax analysis, and semantic analysis using syntax semantic understanding rules / knowledge in the syntax semantic understanding rules / knowledge DB 104, Outputs the understanding result of utterance intention and utterance content called dialogue action.

談話理解部１０２は，対話行為と，現時点での対話状態を入力とし，文脈理解規則・知識ＤＢ１０３内の文脈理解規則・知識を用いて，対話状態ＤＢ１０５中の対話状態を更新する。対話状態とは，システムが内部に保持するさまざまな対話に関する情報のことを指す。主な要素はユーザ情報要求内容の理解結果であり，他に，ユーザ発話履歴，システム発話履歴なども保持する。 The discourse understanding unit 102 receives the dialogue action and the current dialogue state as input, and updates the dialogue state in the dialogue state DB 105 using the context understanding rule / knowledge in the context understanding rule / knowledge DB 103. Dialogue state refers to information about various dialogues that the system maintains internally. The main element is the understanding result of the user information request contents, and also holds the user utterance history, system utterance history, and the like.

発話生成部２の内容生成部２００は，更新された対話状態をもとに，対話制御応答生成規則・知識ＤＢ１０６内の対話状態応答生成規則・知識とデータベース１０７内の情報とを用いて，システムの発話内容としての対話行為を生成し，表層生成部２０１は，それをもとに生成規則テンプレートＤＢ１０８内の生成規則テンプレートを用いて，発話文字列を出力し，音声生成部２０２がシステム発話を生成する。 Based on the updated dialog state, the content generation unit 200 of the utterance generation unit 2 uses the dialog state response generation rule / knowledge in the dialog control response generation rule / knowledge DB 106 and the information in the database 107 to create a system. , The surface layer generation unit 201 outputs the utterance character string using the generation rule template in the generation rule template DB 108 based on the dialogue action, and the voice generation unit 202 generates the system utterance. Generate.

図３に対話例と，ユーザ発話毎の音声対話システムの情報要求内容の理解結果を示す。図３の例では，ユーザ情報要求内容の理解結果はフレーム表現により表されている。フレーム表現は，スロットと呼ばれる属性値対から構成されるデータ構造であり，多くのシステムでユーザの情報要求内容の理解結果の表現に用いられている。図３に示す「ｐｌａｃｅ，ｄａｔｅ，ｉｎｆｏ」の全体の枠がフレームであり，「ｐｌａｃｅ」，「ｄａｔｅ」，「ｉｎｆｏ」の個々の要素がスロットである。ユーザ発話によりそれぞれのスロットに適切な値が埋まることで，音声対話システムはユーザの情報要求内容を理解する。スロットを埋める値として，例えば音声認識結果に含まれる特定の単語（キーワード）などが考えられる。 FIG. 3 shows an example of dialogue and an understanding result of information request contents of the voice dialogue system for each user utterance. In the example of FIG. 3, the understanding result of the user information request content is represented by a frame expression. The frame representation is a data structure composed of attribute value pairs called slots, and is used in many systems to represent the understanding result of the user's information request contents. The whole frame of “place, date, info” shown in FIG. 3 is a frame, and each element of “place”, “date”, and “info” is a slot. The spoken dialogue system understands the user's information request contents by filling appropriate values in each slot by user utterance. As a value for filling the slot, for example, a specific word (keyword) included in the speech recognition result can be considered.

図３（Ａ）に示すように，各スロット「ｐｌａｃｅ」，「ｄａｔｅ」，「ｉｎｆｏ」に対する値が埋まっていない場合に，例えば，第１のシステム発話（Ｓ１）の「ご用件をどうぞ」という発話に対して，第１のユーザ発話（Ｕ１）として「明日の天気を教えてください」が返されたとする。 As shown in FIG. 3A, when the values for each of the slots “place”, “date”, and “info” are not filled, for example, the first system utterance (S1) “Please contact us” Suppose that "Please tell me the weather tomorrow" is returned as the first user utterance (U1).

音声対話システムが，第１のユーザ発話（Ｕ１）中の「明日」を「今日」と誤認識したとすると，Ｕ１の後のユーザの情報要求内容の理解結果は，図３（Ｂ）に示すように，スロット「ｄａｔｅ」に対する値として「今日」，スロット「ｉｎｆｏ」に対する値として「天気」が埋まった状態となる。 If the spoken dialogue system misrecognizes "tomorrow" in the first user utterance (U1) as "today", the understanding result of the information request contents of the user after U1 is shown in FIG. Thus, “Today” is filled as the value for the slot “date”, and “Weather” is filled as the value for the slot “info”.

次に，例えば，第２のシステム発話（Ｓ２）の「どこの天気ですか」という発話に対して，第２のユーザ発話（Ｕ２）として「東京です」が返されたとすると，Ｕ２の後のユーザの情報要求内容の理解結果は，図３（Ｃ）に示すように，スロット「ｐｌａｃｅ」に対する値として「東京」，スロット「ｄａｔｅ」に対する値として「今日」，スロット「ｉｎｆｏ」に対する値として「天気」が埋まった状態となる。 Next, for example, in response to the utterance “Where is the weather” in the second system utterance (S2), if “It is Tokyo” is returned as the second user utterance (U2), As shown in FIG. 3C, the user's understanding result of the information request content is “Tokyo” as the value for the slot “place”, “today” as the value for the slot “date”, and “value” for the slot “info”. The weather is filled.

次に，例えば，第３のシステム発話（Ｓ３）の「今日の東京の天気ですね」という発話に対して，第３のユーザ発話（Ｕ３）として「いいえ，明日です」が返されたとすると，Ｕ３の後のユーザの情報要求内容の理解結果は，図３（Ｄ）に示すように，図３（Ｃ）のユーザの情報要求内容の理解結果におけるスロット「ｄａｔｅ」に対する値「今日」が「明日」に変更された状態となる。そこで，音声対話システムは，第４のシステム発話（Ｓ４）として，「お伝えします。明日の東京の天気は晴れでしょう。」を返す。 Next, for example, in response to the utterance of “the weather in Tokyo today” in the third system utterance (S3), “No, tomorrow” is returned as the third user utterance (U3). As shown in FIG. 3D, the understanding result of the user's information request contents after U3 is “0” for the slot “date” in the understanding result of the information request contents of the user in FIG. Changed to "Tomorrow". Therefore, the voice dialogue system returns “I will tell you. The weather in Tokyo tomorrow will be fine” as the fourth system utterance (S4).

音声対話システムは，ユーザに正しい情報を伝えるために，正しくユーザの情報要求内容を理解する必要がある。しかし，音声認識器の出力は完全ではないため，ユーザと対話した結果として音声対話システムが保持するユーザの情報要求内容は誤りを含むことがある。そのため，ユーザの情報要求内容の理解結果の信頼度を算出し，情報要求内容の誤りを検知する技術が従来研究されている。このような技術により，誤った箇所の訂正要求や優先的な確認が可能となるなど，対話を円滑に進めることができる。 In order to convey correct information to the user, the spoken dialogue system needs to correctly understand the user's information request contents. However, since the output of the speech recognizer is not perfect, the user's information request content held by the speech dialogue system as a result of dialogue with the user may contain errors. For this reason, a technique for calculating the reliability of the understanding result of the user's information request content and detecting an error in the information request content has been studied. With this technology, it is possible to smoothly proceed with dialogues, such as requesting correction of incorrect parts and preferential confirmation.

図４にユーザ発話の音声認識結果に含まれる単語の信頼度を用いたユーザ情報要求内容の信頼度の割り当て例を示す。図４では，ユーザ発話中の単語に信頼度がそれぞれ割り当てられており，その単語がスロットに入ると，そのスロット値の信頼度がその単語の信頼度になっている。すなわち，ユーザの情報要求内容の理解結果ごとにその理解結果が正しいかどうかを示す信頼度が算出され，それぞれのスロットに割り当てられる。 FIG. 4 shows an example of assigning the reliability of the user information request contents using the reliability of words included in the speech recognition result of the user utterance. In FIG. 4, the reliability is assigned to each word being uttered by the user, and when the word enters the slot, the reliability of the slot value becomes the reliability of the word. That is, for each understanding result of the information request contents of the user, a reliability indicating whether or not the understanding result is correct is calculated and assigned to each slot.

例えば，音声対話システムが，第１のユーザ発話（Ｕ１）中の単語「明日」を「今日」と誤認識したとすると，Ｕ１の後のユーザの情報要求内容の理解結果は，図４（Ａ）に示すように，スロット「ｄａｔｅ」に対する値として単語「今日」が埋まり，その信頼度として「ｃ１」が割り当てられるが，第３のユーザ発話（Ｕ３）として「いいえ，明日です。」が入力されたとすると，Ｕ３の後のユーザの情報要求内容の理解結果において，図４（Ｃ）に示すように，スロット「ｄａｔｅ」に対するスロット値が，単語「今日」から単語「明日」に変更され，その信頼度として，Ｕ３中の単語「明日」の信頼度である「ｃ９」が割り当てられる。 For example, if the spoken dialogue system misrecognizes the word “tomorrow” in the first user utterance (U1) as “today”, the understanding result of the information request content of the user after U1 is shown in FIG. ), The word “today” is filled as the value for the slot “date”, and “c1” is assigned as the reliability thereof, but “No, it is tomorrow” is input as the third user utterance (U3). In the understanding result of the information request contents of the user after U3, the slot value for the slot “date” is changed from the word “today” to the word “tomorrow” as shown in FIG. As the reliability, “c9”, which is the reliability of the word “tomorrow” in U3, is assigned.

この信頼度は，信頼度の値に応じて，ユーザの情報要求内容の理解結果の棄却や，「今日の天気ですか？」というようなシステム発話による確認を行うために用いられる。 This reliability is used to reject the understanding result of the information request contents of the user or to confirm by the system utterance such as “Is it weather today?” According to the value of the reliability.

ユーザの情報要求内容の理解結果の信頼度を算出する従来手法としては，音声認識器が出力する音響スコアを信頼度として用いる方法，事前に調査された音声認識器の認識精度を用いる手法，音声認識結果の音響・言語的特徴量から信頼度を求める尺度を作成してその尺度を用いる手法などが挙げられる。 The conventional methods for calculating the reliability of the understanding result of the user's information request include the method using the acoustic score output by the speech recognizer as the reliability, the method using the recognition accuracy of the speech recognizer investigated in advance, For example, there is a method of creating a scale for obtaining reliability from the acoustic and linguistic features of the recognition result and using the scale.

下記の非特許文献１は，ユーザの情報要求内容の理解結果に含まれる各要素について，事前に調査されたその要素が正しい確率を信頼度として割り当てる手法を提案している。例えば，ユーザが「東京」というキーワードを含む発話をした場合，事前に作成された対話記録において，正しく「東京」と聞き取れた確率を求めておき，その値を信頼度として用いる。 The following non-patent document 1 proposes a method of assigning, as reliability, the probability that each element included in the understanding result of the information request content of the user is correct. For example, when the user utters an utterance including the keyword “Tokyo”, the probability of correctly hearing “Tokyo” is obtained in a dialog record created in advance, and the value is used as the reliability.

非特許文献２は，ユーザの情報要求内容の理解結果を直接的には扱わないが，ユーザの情報要求内容の理解結果を構成し得る，ユーザ発話中の各要素（単語や文）について，音声認識器が出力する音響・言語的特徴量から高精度に信頼度を求める手法を提案している。 Non-Patent Document 2 does not deal directly with the understanding result of the user's information request contents, but for each element (word or sentence) in the user's utterance that can constitute the understanding result of the user's information request contents, We have proposed a method for obtaining reliability with high accuracy from acoustic and linguistic features output by the recognizer.

具体的には，まず，ユーザ発話を複数事前に収集し，それらの発話音声に対して音声認識器が出力する音声認識結果と音響・言語的特徴量を求める。図５は，ユーザ発話中の単語に関する音響・言語的特徴量の例を示す。図５中の「音素信頼度」とは，各音響フレームでのＨＭＭの最尤ステートとのスコア差のフレーム数平均frame purity（N-best purity を各音響フレームでの音素に対して考えたもの）である。図６は，発話に関する音響・言語的特徴量の例を示している。次に，音声認識器の出力（単語や文）について，それらが正しいかどうかをラベル付けする。最後に，音響・言語的特徴量から音声認識結果の各要素が「正解」か「不正解」かを判別する式（以降，信頼性尺度と呼ぶ）を作成する。 Specifically, first, a plurality of user utterances are collected in advance, and speech recognition results and acoustic / linguistic features that are output by the speech recognizer for those uttered speeches are obtained. FIG. 5 shows an example of acoustic and linguistic features related to a word being uttered by a user. The “phoneme reliability” in FIG. 5 is the frame number average frame purity of the difference in score from the maximum likelihood state of the HMM in each acoustic frame (N-best purity is considered for phonemes in each acoustic frame) ). FIG. 6 shows an example of acoustic and linguistic features related to speech. Next, the output (words and sentences) of the speech recognizer is labeled for correctness. Finally, an expression (hereinafter referred to as a reliability measure) is created that determines whether each element of the speech recognition result is “correct” or “incorrect” from the acoustic and linguistic features.

音声認識システムは，このように作成された信頼性尺度を用いることで，将来における未知のユーザ発話中の単語について正解か，不正解の分類を行うことができる。また，分類精度を信頼度として用いることができる。 The speech recognition system can classify a correct answer or an incorrect answer for a word in an unknown user utterance in the future by using the reliability scale created in this way. Moreover, the classification accuracy can be used as the reliability.

非特許文献３は，音声認識器が出力する音声認識結果と音響・言語的特徴量を用いる点において非特許文献２と類似しているが，対象とするユーザ発話の直前のシステム発話の内容ごとに信頼性尺度を作成する工夫をしている。また，直前のシステム発話という談話的情報を用いている点において，本発明と類似しているが，本発明は談話情報を数値化して表現しており，この点において従来技術と異なる。
"Efficient Spoken Dialogue Control Depending on the Speech Recognition Rate and System's Database," Kohji Dohsaka and Norihito Yasuda and Kiyoaki Aikawa,In Proc.Eurospeech, 2003, pp.657-660 "Recognition confidence scoring and its use in speech understanding systems," Timothy J. Hazen,Stephanie Seneff and Joseph Polifroni, Computer Speech and Language, 2O02, vol.16, pp.49-67 "Estimating Semantic Confidence for Spoken Dialogue systems," Sameer S. Pradhan and Wayne H. Ward, In Proc. ICASSP, 2002, pp.233-236 Non-Patent Document 3 is similar to Non-Patent Document 2 in that the speech recognition result output from the speech recognizer and the acoustic and linguistic features are used, but for each content of the system utterance immediately before the target user utterance. The idea is to create a reliability scale. In addition, the present invention is similar to the present invention in that it uses the discourse information of the immediately preceding system utterance, but the present invention expresses the discourse information in numerical form, and is different from the prior art in this respect.
"Efficient Spoken Dialogue Control Depending on the Speech Recognition Rate and System's Database," Kohji Dohsaka and Norihito Yasuda and Kiyoaki Aikawa, In Proc. Eurospeech, 2003, pp.657-660 "Recognition confidence scoring and its use in speech understanding systems," Timothy J. Hazen, Stephanie Seneff and Joseph Polifroni, Computer Speech and Language, 2O02, vol.16, pp.49-67 "Estimating Semantic Confidence for Spoken Dialogue systems," Sameer S. Pradhan and Wayne H. Ward, In Proc. ICASSP, 2002, pp.233-236

従来法では，対話はシステムとユーザのやり取りで成り立っているにもかかわらず，ユーザ情報要求内容の理解結果の信頼度を，主に音声認識結果が出力する音響的特徴と言語的特徴のみを用いて推定しているため，推定精度に問題がある。 In the conventional method, although the dialogue consists of the interaction between the system and the user, the reliability of the understanding result of the user information request content is mainly used only for the acoustic feature and the linguistic feature output by the speech recognition result. Therefore, there is a problem in estimation accuracy.

本発明は，上記従来技術の問題点を解決し，ユーザ情報要求内容の理解結果の信頼度の推定精度を高めることを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-described problems of the prior art and to improve the estimation accuracy of the reliability of the understanding result of user information request contents.

上記課題を解決するため，本発明は，ユーザが音声を用いてコンピュータによる音声対話システムに情報要求を入力するたびに，ユーザと音声対話システムとのやり取りの履歴を用いて，音声対話システムが保持するユーザ情報要求内容の理解結果を更新し，更新されたユーザ情報要求内容の理解結果に応じてユーザに音声により応答する音声対話システムにおいて，以下の方法によりユーザ情報要求内容の理解結果の信頼度を算出する。 In order to solve the above-mentioned problems, the present invention maintains a speech dialogue system using a history of exchanges between the user and the speech dialogue system every time the user inputs an information request to the voice dialogue system using a computer. In the spoken dialogue system in which the user information request content understanding result is updated and the user responds by voice according to the updated user information request content understanding result, the reliability of the user information request content understanding result is as follows: Is calculated.

まず，プロトタイプの音声対話システムを用いて，システム発話の情報と，ユーザ発話の情報と，ユーザ発話毎の音声対話システムによるユーザ情報要求内容の理解結果およびその正誤に関する情報とを含む対話記録を作成し，記憶装置に保存する。 First, using a prototype spoken dialogue system, create a dialogue record that includes system utterance information, user utterance information, user information request content understanding results and correct / wrong information for each user utterance. And save it in the storage device.

次に，前記記憶装置に保存された対話記録からユーザと音声対話システムとのやり取りに関する談話情報を取り出し，その談話情報におけるユーザ情報要求内容の理解結果の値の出現回数または出現比率または変化などを示す特徴に関する談話特徴量を抽出する。 Next, the discourse information regarding the interaction between the user and the voice interaction system is extracted from the conversation record stored in the storage device, and the number of appearances or the appearance ratio or change of the understanding result value of the user information request content in the discourse information is obtained. Extract the discourse feature amount related to the indicated feature.

この談話特徴量としては，例えば次の特徴量の少なくとも一つ以上を含む。
・ユーザ情報要求内容の理解結果の現在の値の出現比率
・ユーザ情報要求内容の理解結果において最も出現回数が多い値の出現比率
・ユーザ情報要求内容の理解結果における出現値の種類数
・ユーザ情報要求内容の理解結果の現在の値が打ち消された回数
・ユーザ情報要求内容の理解結果の現在の値が他の値により上書きされた回数
・ユーザ情報要求内容の理解結果の現在の値がいくつ前まで同じ値を持っているかを示す値
・ユーザ情報要求内容の理解結果の現在の値がいくつ前まで違う値を持っているかを示す値
・ユーザ情報要求内容の理解結果の値を音声対話システムが確認をした後で同じ値をユーザが音声対話システムに伝達した回数
・ユーザ情報要求内容の理解結果の現在の値と同じ値がこれまでのユーザ発話の中に何回出現したかを示す値
・ユーザ情報要求内容の理解結果の現在の値と異なる値がこれまでのユーザ発話の中に何回出現したかを示す値
・ユーザ情報要求内容の理解結果の現在の値と同じ値がこれまでのシステム発話の中に何回出現したかを示す値
・ユーザ情報要求内容の理解結果の現在の値と異なる値がこれまでのシステム発話の中に何回出現したかを示す値
次に，記憶装置に保存された対話記録から読み出したユーザ情報要求内容の理解結果と，その正誤に関する情報と，ユーザ発話音声の音響・言語的特徴量と，前記談話情報から抽出した談話特徴量とをもとに，識別学習を用いてユーザ情報要求内容の理解結果の信頼度を評価するための信頼性尺度を作成する。 As the discourse feature amount, for example, at least one of the following feature amounts is included.
-Appearance ratio of the current value of the user information request content understanding result-Appearance ratio of the most frequently occurring value in the user information request content understanding result-Number of types of appearance values in the user information request content understanding result-User information Number of times the current value of the understanding result of the request content was canceled ・ Number of times the current value of the understanding result of the user information request content was overwritten by other values ・ How many times the current value of the understanding result of the user information request content was Spoken dialogue system shows the value that indicates whether the current value of the user information request content has a different value until the current value of the user information request content understands the value that the user information request content understands Number of times the user transmitted the same value to the spoken dialogue system after confirmation ・ How many times the same value as the current value of the understanding result of user information request appeared in the user's utterance so far The value that indicates the number of times that a different value from the current value of the understanding result of the user information request content has appeared in the user utterance so far The same value as the current value of the understanding result of the user information request content A value that indicates how many times the system utterance has appeared in the past ・ A value that indicates how many times a value different from the current value of the understanding result of the user information request content has appeared in the system utterance so far Next , The understanding result of the user information request content read from the dialogue record stored in the storage device, the information about the correctness, the acoustic and linguistic features of the user uttered speech, and the discourse feature extracted from the discourse information Based on this, a reliability measure is created to evaluate the reliability of the understanding result of user information request contents using discriminative learning.

この信頼性尺度は，システム発話の情報と，ユーザ音声の認識結果およびその音響・言語的特徴量を含むユーザ発話の情報と，ユーザ発話毎の音声対話システムのユーザ情報要求内容の理解結果の情報と，ユーザと音声対話システムとのやり取りに関する談話情報から抽出した談話特徴量とを入力し，ユーザ情報要求内容の理解結果の信頼度を出力するものであり，計算式に相当するものである。具体的には，例えば関数化されたプログラムやデータ群によって実現される。ここでは，これを信頼度評価手段と呼ぶ。 This reliability measure includes system utterance information, user speech recognition results including user speech recognition results and their acoustic and linguistic features, and information on the results of understanding the user information requirements of the spoken dialogue system for each user utterance. And the discourse feature value extracted from the discourse information related to the interaction between the user and the spoken dialogue system, and the reliability of the understanding result of the user information request content is output, which corresponds to the calculation formula. Specifically, it is realized by, for example, a functionalized program or data group. Here, this is called reliability evaluation means.

実運用で用いる音声対話システムでは，作成された信頼度評価手段を用いて，システム発話の情報と，ユーザ音声の認識結果およびその音響・言語的特徴量を含むユーザ発話の情報と，ユーザ発話毎の音声対話システムのユーザ情報要求内容の理解結果の情報と，ユーザと音声対話システムとのやり取りに関する談話情報から抽出した談話特徴量とをもとに，ユーザ情報要求内容の理解結果の信頼度を算出する。 In the spoken dialogue system used in actual operation, using the created reliability evaluation means, the system utterance information, the user utterance information including the user speech recognition result and its acoustic and linguistic features, and the user utterance The reliability of the understanding result of the user information request content based on the information of the understanding result of the user information request content of the Japanese conversation system and the discourse feature extracted from the discourse information about the interaction between the user and the voice interaction system calculate.

本発明は，音声対話システムにおける，ユーザ情報要求内容の理解結果の信頼度を高精度に計算可能とする。このことにより，信頼度の低い理解結果の棄却や，信頼度の高いものについては確認しないといったことを高精度で実現でき，人とシステムの間でより円滑な対話が実現できる。 The present invention makes it possible to calculate the reliability of the understanding result of the user information request contents with high accuracy in the spoken dialogue system. As a result, rejection of understanding results with low reliability and confirmation of things with high reliability can be realized with high accuracy, and smoother dialogue between people and the system can be realized.

以下に，図を用いて本発明の実施の形態を説明する。図１は，本発明の信頼性尺度算出方法を実現する装置構成を示す図である。図１に示す信頼性尺度算出装置は，システム発話とユーザ発話が記録される対話記録１０，対話記録１０内の発話毎の情報要求内容の理解結果が記録される談話メモリ２０，談話メモリ２０内の情報要求内容の理解結果に基づいて談話特徴量を抽出する談話特徴量抽出部３０，発話毎に，後述する信頼性尺度作成に用いるデータが蓄積されるデータメモリ４０，信頼性尺度を作成する信頼性尺度作成部５０によって実現される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a device configuration for realizing the reliability measure calculation method of the present invention. The reliability measure calculation apparatus shown in FIG. 1 includes a dialogue record 10 in which system utterances and user utterances are recorded, a discourse memory 20 in which an understanding result of information request contents for each utterance in the dialogue record 10 is recorded, and in the discourse memory 20. A discourse feature amount extraction unit 30 that extracts discourse feature amounts based on an understanding result of the information request contents, a data memory 40 in which data used for creating a reliability measure described later is accumulated for each utterance, and a reliability measure This is realized by the reliability scale creation unit 50.

本発明では，非特許文献２と同様に，信頼性尺度の構築を通して，情報要求内容の理解結果の信頼度を獲得する。図７は，信頼性尺度構築の流れを示す図である。 In the present invention, as in Non-Patent Document 2, the reliability of the understanding result of the information request content is obtained through the construction of the reliability scale. FIG. 7 is a diagram showing the flow of reliability scale construction.

［ステップＳ１］：図２のように構成されたプロトタイプシステムを用い，対話記録１０を収録する。対話記録１０には，対話収録時のシステム発話とユーザ発話（ユーザ音声と音声認識結果および音響・言語的特徴量），ユーザ発話毎の情報要求内容の理解結果と，情報要求内容の理解結果の正誤がラベリングされている。 [Step S1]: The dialogue record 10 is recorded using the prototype system configured as shown in FIG. The dialogue record 10 includes a system utterance and user utterance (user speech and speech recognition result and acoustic / linguistic feature) at the time of dialogue recording, an understanding result of information request contents for each user utterance, and an understanding result of information request contents. Correctness is labeled.

［ステップＳ２］：ｉ＝０とし，発話番号ｉを初期化する。 [Step S2]: i = 0 and the speech number i is initialized.

［ステップＳ３］：発話番号ｉをインクリメントする。 [Step S3]: The utterance number i is incremented.

［ステップＳ４，Ｓ５］：ｉ番目の発話を対話記録１０から取り出す。対話記録１０にｉ番目の発話がなければステップＳ１１へ進む。 [Steps S4 and S5]: The i-th utterance is extracted from the dialogue record 10. If there is no i-th utterance in the dialog record 10, the process proceeds to step S11.

［ステップＳ６，Ｓ７］：ｉ番目の発話がシステム発話であれば，システム発話内容を談話メモリ２０に追加し，ステップＳ３へ戻る。 [Steps S6 and S7]: If the i-th utterance is a system utterance, the contents of the system utterance are added to the discourse memory 20, and the process returns to step S3.

［ステップＳ８，Ｓ９］：ｉ番目の発話がユ−ザ発話であれば，まず発話内容とユーザ発話直後の情報要求内容の理解結果を談話メモリ２０に追加し，談話メモリ２０内の情報要求内容の理解結果に基づいて談話特徴量を抽出する。 [Steps S8, S9]: If the i-th utterance is a user utterance, first, the understanding result of the utterance content and the information request content immediately after the user utterance is added to the discourse memory 20, and the information request content in the discourse memory 20 The discourse feature is extracted based on the understanding result.

［ステップＳ１０］：次に，
（ａ）ユーザ発話直後の情報要求内容の理解結果，
（ｂ）情報要求内容の理解結果の正解，
（ｃ）ユーザ発話音声の音響・言語的特徴量，
（ｄ）談話メモリ２０から談話特徴量抽出部３０により獲得される談話特徴量，
をそれぞれ求め，発話番号ｉと，上記（ａ），（ｂ），（ｃ），（ｄ）とをひとまとまりとして，データメモリ４０に追加し，ステップＳ３へ戻る。 [Step S10]: Next,
(A) Understanding result of information request immediately after user utterance,
(B) Correct answer of understanding result of information request contents,
(C) acoustic and linguistic features of user utterances;
(D) a discourse feature obtained from the discourse memory 20 by the discourse feature extraction unit 30;
, And the utterance number i and the above (a), (b), (c), and (d) are added together to the data memory 40, and the process returns to step S3.

［ステップＳ１１］：データメモリ４０に含まれるデータから，信頼性尺度作成部５０により，信頼性尺度を作成する。信頼性尺度の作成法としては，上記非特許文献２に多く挙げられる識別学習を用いた手法が一般的である。識別学習には，線形判別法や，Support Vector Machines （ＳＶＭ），決定木学習などが挙げられる。ここでは，識別学習の一例として線形判別法を説明する。 [Step S11]: A reliability measure is created by the reliability measure creation unit 50 from the data included in the data memory 40. As a method for creating a reliability scale, a technique using discriminative learning, which is often mentioned in Non-Patent Document 2, is common. Examples of discriminative learning include linear discriminant methods, Support Vector Machines (SVM), and decision tree learning. Here, a linear discrimination method will be described as an example of discriminative learning.

図８に，線形判別法による線形判別分析のイメージ図を示す。線形判別法は別名判別分析と呼ばれ，サンプルを２グループに分割する直線を見つける方法である。図８では，特徴量１と特徴量２を持つ２種類（Ａ群とＢ群）のサンプルが線形判別直線によって２つに分割されている様子を示している。ＳＶＭや決定木学習も同様に，複数のサンプルを２分（またはそれ以上に分割）する直線（または平面，超平面）を見つける手法であり，ＳＶＭや決定木学習などによる識別学習を用いても，同様に信頼性尺度を作成することができる。 FIG. 8 shows an image diagram of linear discriminant analysis by the linear discriminant method. The linear discriminant method is called alias discriminant analysis, and is a method of finding a straight line that divides a sample into two groups. FIG. 8 shows a state in which two types of samples (group A and group B) having feature amount 1 and feature amount 2 are divided into two by a linear discriminant straight line. Similarly, SVM and decision tree learning are techniques for finding straight lines (or planes and hyperplanes) that divide a plurality of samples into two (or more) parts. Even if discriminative learning such as SVM or decision tree learning is used. Similarly, a reliability measure can be created.

非特許文献２に記載されている手法では，以下の式により，まずサンプルの特徴量ベクトル↑ｆからスカラ値ｒに変換する変換ベクトル↑ｐを求める。↑ｐの初期値は，線形判別法によって求めることができる。 In the method described in Non-Patent Document 2, first, a conversion vector ↑ p for converting a sample feature vector ↑ f to a scalar value r is obtained by the following equation. The initial value of ↑ p can be obtained by a linear discriminant method.

ｒ＝↑ｐ^T・↑ｆ
すべてのサンプル（ｘ₁，…，ｘ_n）に対し，ｒ_iを実際に計算し，Ａ群のサンプルｒ_Aiの確率分布（平均と分散），Ｂ群のサンプルのｒ_Bjの確率分布（平均と分散）をそれぞれ求める。ここで，未知のサンプルを変換ベクトル↑ｐによって変換した結果であるｒが，Ａ群に属すかＢ群に属すかを以下の式により求める。 r = ↑ p ^T・ ↑ f
All samples (x _1, ..., x _n) to actually calculate the r _i, the probability distribution of the sample r _Ai in group A (mean and variance), the probability distribution of r _Bj sample group B (average And variance). Here, whether r, which is the result of converting an unknown sample with the conversion vector ↑ p, belongs to the A group or the B group, is obtained by the following equation.

ｃ＝ log［｛ｐ（ｒ｜Ａ）Ｐ（Ａ）｝／｛ｐ（ｒ｜Ｂ）Ｐ（Ｂ）｝］−ｔ
ｒの値を持つときにＡ群である確率（ｐ（ｒ｜Ａ）Ｐ（Ａ））と，Ｂ群である確率（ｐ（ｒ｜Ｂ）Ｐ（Ｂ））の比を取り，閾値ｔでその値を調整する。比の値は，ｒのＡらしさを表現している。ｔの値を大きくすれば，ｃの値を小さくすることができ，同じｒの値に対してＢ群になりやすくすることができる。ｃは信頼度を示す値であり，正ならＡ群，負ならＢ群のように決める。変換ベクトル↑ｐと閾値ｔは，既知のサンプルに対して，Ａ群，Ｂ群の分類誤りを最小にするように調節される（非特許文献２参照）。 c = log [{p (r | A) P (A)} / {p (r | B) P (B)}]-t
Taking the ratio of the probability of being a group A (p (r | A) P (A)) to the probability of being a group B (p (r | B) P (B)) when having a value of r, a threshold t Adjust the value with. The ratio value expresses the A-likeness of r. If the value of t is increased, the value of c can be decreased, and the group B can be easily obtained for the same value of r. c is a value indicating reliability, and is determined as group A if positive and group B if negative. The conversion vector ↑ p and the threshold value t are adjusted so as to minimize the classification error of the A group and the B group for known samples (see Non-Patent Document 2).

上記ステップにより作成される信頼性尺度を用いることで，ユーザ発話音声の音響・言語的特徴量，談話特徴量から，ユーザ発話直後の情報要求内容の理解結果の信頼度を求める。 By using the reliability scale created by the above steps, the reliability of the understanding result of the information request content immediately after the user utterance is obtained from the acoustic / linguistic feature amount and the discourse feature amount of the user utterance voice.

以下に，気象情報案内システムにおける事例を用いて，本発明の実施例を説明する。 In the following, embodiments of the present invention will be described using examples in a weather information guidance system.

〔対話実験〕
本発明の実施例では，実際に図２の構成を持った音声対話システムを構築し，対話実験により，対話記録を作成した。対話データ収集は音声対話システムを過去に使ったことのないユーザを対象に行われた。１８人の被験者がシステムとそれぞれ１６対話行い，全体で２８８対話が収録された。実験は３人ずつ６回にわたり行われた。それぞれの回に行われた対話記録を，SET001，SET002，SET003，SET004，SET005，SET006と呼ぶ。対話記録におけるユーザ発話は，すべて書き起こされ，各ユーザ発話後のユーザ情報要求内容について正解・不正解のラベルが付与された。 [Dialogue experiment]
In the embodiment of the present invention, a voice dialogue system having the configuration of FIG. 2 was actually constructed, and a dialogue record was created by a dialogue experiment. Dialogue data collection was conducted for users who had never used a voice dialogue system in the past. Eighteen subjects each had 16 dialogues with the system, for a total of 288 dialogues. The experiment was conducted 6 times by 3 people. The dialogue records made at each time are called SET001, SET002, SET003, SET004, SET005, and SET006. All user utterances in the dialogue record were transcribed, and the correct / incorrect label was attached to the user information request contents after each user utterance.

〔談話特徴量〕
情報要求内容の理解結果はフレーム表現で表されるが，その構成要素であるスロットの信頼度を求める信頼性尺度を作成した。スロットの値は，ユーザ発話に含まれるキーワードにより埋められるものとし，スロットが埋められているときの，その値に関する特徴量から，その信頼度を推定する信頼性尺度を作成する。フレームには複数のスロットが含まれるが，天気情報システムであれば，［ｐｌａｃｅ，ｄａｔｅ，ｉｎｆｏ］のそれぞれのスロットに対し，談話特徴量抽出部３０は，以下に説明する談話特徴量を求める。 [Discourse features]
The result of understanding the information request content is expressed in frame representation, and a reliability scale was created to determine the reliability of the slot that is a component of the frame. The value of the slot is assumed to be filled with a keyword included in the user utterance, and a reliability measure for estimating the reliability is created from the feature amount regarding the value when the slot is filled. Although the frame includes a plurality of slots, in the case of a weather information system, the discourse feature amount extraction unit 30 obtains a discourse feature amount described below for each slot of [place, date, info].

音響・言語的特徴量は，上記図５に示すユーザ発話中の単語に関する音響・言語的特徴量または図６に示す発話に関する音響・言語的特徴量といった，従来と同様のものを用い，加えて談話特徴量として以下のものを導入する。図９は，以下に記述するスロット値に関する談話特徴量Ｄ１〜Ｄ１２をまとめたものである。 The acoustic and linguistic features are the same as the conventional ones such as the acoustic and linguistic features related to the word in the user utterance shown in FIG. 5 or the acoustic and linguistic features related to the utterance shown in FIG. The following features are introduced as discourse features. FIG. 9 summarizes the discourse feature amounts D1 to D12 related to the slot values described below.

［Ｄ１］：Slot purity in context
現在の理解結果までに経てきた理解結果のそれぞれに関して，どの値が何回その当該スロットに入っているかをカウントし，現在の値がその内何割を占めるかを表す値。例えば値が東京→大阪→京都→大阪と変化したならぱ，４回値が変わった内，現在の値である大阪が占める数は２であるので，１／２となる。図２のｄａｔｅスロットの例では，スロットの値「明日」は１／３となる。 [D1]: Slot purity in context
For each understanding result that has passed through the current understanding result, it counts how many times the value has entered the slot, and indicates the percentage of the current value. For example, if the value changes from Tokyo → Osaka → Kyoto → Osaka, the current value Osaka occupies 2 out of 4 changes, so the value is ½. In the example of the date slot in FIG. 2, the slot value “Tomorrow” is 1/3.

［Ｄ２］：Top slot purity in context
現在の理解結果までに経てきた理解結果のそれぞれに関して，どの値が何回その当該スロットに入っているかをカウントし，最も割合の多かった要素の割合。例えば値が東京→大阪→京都→大阪と変化したならば，東京は１／４，大阪は１／２，京都は１／４であるので，最大値は大阪の１／２であるので，値は１／２となる。図２のｄａｔｅスロットの例では，「今日」は２／３，「明日」は１／３であるので，最大値は２／３となる。 [D2]: Top slot purity in context
For each of the understanding results that have passed through the current understanding result, count the number of times that value has entered the slot, and the proportion of the element with the highest proportion. For example, if the value changes from Tokyo → Osaka → Kyoto → Osaka, Tokyo is 1/4, Osaka is 1/2, Kyoto is 1/4, and the maximum value is 1/2 of Osaka. Becomes 1/2. In the example of the date slot in FIG. 2, “Today” is 2/3 and “Tomorrow” is 1/3, so the maximum value is 2/3.

［Ｄ３］：Slot variety
理解パス上のスロット値が何種類の値をこれまで持ったことがあるかを示す値。東京→大阪→京都→大阪の場合には，値が［東京，大阪，京都］の三種類なので３となる。図２のｄａｔｅスロットの例では，「今日」，「明日」の２種類であるので２となる。 [D3]: Slot variety
A value that indicates how many values the slot value on the understanding path has ever had. In the case of Tokyo → Osaka → Kyoto → Osaka, the value is 3 because there are three types [Tokyo, Osaka, Kyoto]. In the example of the date slot in FIG. 2, it is 2 because there are two types of “today” and “tomorrow”.

［Ｄ４］：Deny count
現在のスロット値がこれまでのユーザとのやり取りにおいて，打ち消された回数。例えば値が東京→なし→京都→東京というスロット値の変化があった場合，現在の東京という値は以前一度「なし」になっている（打ち消されている）ため，値は１となる。 [D4]: Deny count
The number of times the current slot value has been negated in user interactions so far. For example, if there is a change in the slot value of Tokyo → None → Kyoto → Tokyo, the current value of Tokyo is once “None” (cancelled), and the value is 1.

［Ｄ５］：Overwrite count
現在のスロット値がこれまでのユーザとのやり取りにおいて，他の値により上書きされた回数。例えば値が東京→大阪→京都→東京というスロット値の変化があった場合，現在の東京という値は以前は一度「大阪」により上書きされているので，値は１となる。 [D5]: Overwrite count
The number of times the current slot value has been overwritten by other values in previous user interactions. For example, if there is a change in the slot value of Tokyo → Osaka → Kyoto → Tokyo, the current value of Tokyo was once overwritten by “Osaka” before, so the value is 1.

［Ｄ６］：Continue count
現在のスロット値が理解パス上のいくつ前まで同じ値を持っているかを示す値。例えば値がなし→東京→東京→東京というスロット値の変化があった場合，現在の値の直前に二度「東京」があるため，値は２となる。 [D6]: Continue count
A value that indicates how long the current slot value has the same value on the understanding path. For example, if there is a change in the slot value of no value → Tokyo → Tokyo → Tokyo, the value is 2 because there is “Tokyo” twice immediately before the current value.

［Ｄ７］：Different value count
現在のスロット値が理解パス上のいくつ前まで違う値を持っているかを示す値。例えば値が東京→大阪→京都→東京というスロット値の変化があった場合，現在の値の直前に二度「東京」ではない値があるため，値は２となる。 [D7]: Different value count
A value indicating how many times the current slot value has a different value on the understanding path. For example, if there is a change in slot value such as Tokyo → Osaka → Kyoto → Tokyo, the value is 2 because there is a value that is not “Tokyo” twice immediately before the current value.

［Ｄ８］：Same key pair count
システムがあるスロット値について確認をし，ユーザがその値について肯定も，否定もせず全く同じスロット値をシステムに伝達することは，やり取りとして冗長性の観点から不自然であると考えられる。例えば，システム：「東京の天気ですね？」→ユーザ：「東京です」というやり取りはあまり起こらないと予想される。よって，現在システムが保持しているスロット値について，過去にシステムがその値について確認をして，同じ値をユーザがシステムに伝達した回数をカウントする。 [D8]: Same key pair count
It is considered unnatural from the viewpoint of redundancy as an exchange if the system confirms a certain slot value and the user transmits the same slot value to the system without affirming or denying the value. For example, it is expected that the exchange of “System is the weather in Tokyo?” → User: “It is Tokyo” does not occur very much. Therefore, for the slot value currently held by the system, the system checks the value in the past, and counts the number of times the user has transmitted the same value to the system.

［Ｄ９］：Same key count user
これまでのユーザ発話の中に現在のスロット値と同じ値がいくつあるかを示す値。例えば，現在のシステムのスロット値が「東京」である場合，これまでシステムが理解してきたユーザ発話履歴の中に「東京」が何回出ているかをカウントする。 [D9]: Same key count user
A value that indicates how many of the previous user utterances have the same value as the current slot value. For example, when the slot value of the current system is “Tokyo”, the number of times “Tokyo” appears in the user utterance history understood by the system is counted.

［Ｄ１０］：Different key count user
これまでのユーザ発話の中に現在のスロット値と異なる値がいくつあるかを示す値。例えば，現在のシステムのスロット値が「東京」であり，これまでシステムが理解してきたユーザ発話履歴の中に「東京」と異なる値が何回出ているかをカウントする。 [D10]: Different key count user
A value that indicates how many different user utterances are different from the current slot value. For example, the slot value of the current system is “Tokyo”, and the number of times a value different from “Tokyo” appears in the user utterance history understood by the system so far is counted.

［Ｄ１１］：Same key count system
これまでのシステム発話の中に現在のスロット値と同じ値がいくつあるかを示す値。例えば，現在のシステムのスロット値が「東京」である場合，これまでシステムが確認等により発話してきたシステム発話履歴の中に「東京」が何回出ているかをカウントする。 [D11]: Same key count system
A value that indicates how many of the previous system utterances have the same value as the current slot value. For example, when the slot value of the current system is “Tokyo”, the number of times “Tokyo” appears in the system utterance history that has been spoken by the system so far is counted.

［Ｄ１２］：Different key count system
これまでのシステム発話の中に現在のスロット値と異なる値がいくつあるかを示す値。例えば，現在のシステムのスロット値が「東京」である場合，これまでシステムが確認等により発話してきたシステム発話履歴の中に「東京」と異なる値が何回出ているかをカウントする。 [D12]: Different key count system
A value that indicates how many different values in the system utterance are different from the current slot value. For example, when the slot value of the current system is “Tokyo”, the number of times that a value different from “Tokyo” appears in the system utterance history that the system has spoken by confirmation or the like is counted.

〔評価結果〕
まず対話記録中のスロット値のデータの内，次の（１），（２）のものを除外した。〔Evaluation results〕
First, the following (1) and (2) were excluded from the slot value data during dialogue recording.

（１）本発明における特徴量は，スロットの値が２種類以上の値を持つときを想定している。そのため，値がない状態から初めて値を持った場合のスロット値や，誤認識を繰り返し，音声認識器が特定の値しか出力しないといった場合のスロット値のデータは除外した。初めて値を持った場合のスロット値の信頼度は，その値に関する十分な談話的要素がないことから音響・言語的特徴量から推定することが可能であると考えられる。また，誤認識の繰り返しにより音声認識器が特定の値しか出力しない場合は，その値の変動の少なさから信頼度を求めることは可能かもしれないが，音声認識器が正解の単語を認識し続けている場合との区別が困難であり，音響・言語的特徴量から推定する方が妥当であると考えられる。 (1) The feature value in the present invention assumes that the slot value has two or more values. For this reason, the slot value when the value is first obtained from a state where there is no value, and the data of the slot value when the speech recognizer outputs only a specific value are excluded. The reliability of the slot value when it has a value for the first time can be estimated from the acoustic and linguistic features because there is not enough discourse element about the value. Also, if the speech recognizer outputs only a specific value due to repeated misrecognition, it may be possible to determine the reliability from the small fluctuation of the value, but the speech recognizer recognizes the correct word. It is difficult to distinguish from the case of continuing, and it is considered more appropriate to estimate from the acoustic and linguistic features.

（２）確認済みのスロット。対話の途中でシステムはスロット値についてシステムの保持する値が正しいかどうかの確認を行うことがある。ユーザが確認に対し承認すれば，そのスロット値は正しいといえる。従って，システムがあるスロットの値について確認済みであるとしている場合はその信頼度はおおよそ正しいため，対象から除外した。 (2) A confirmed slot. During the dialogue, the system may check whether the value held by the system is correct for the slot value. If the user approves the confirmation, the slot value is correct. Therefore, if it is assumed that the system has already confirmed the value of a certain slot, the reliability is roughly correct, so it was excluded from the target.

スロットの信頼性尺度作成には，全データ中で，上記条件１，２に合致するスロットの値を除外した７７７個のスロットの値を用いた。そして，６つのセットの内一つをテストデータ，残りを訓練データとし，信頼性尺度を学習しテストすることをすべてのセットについて行い，信頼性尺度の性能を調べた。 In creating the slot reliability scale, the values of 777 slots excluding the slot values meeting the above conditions 1 and 2 were used in all data. Then, one of the six sets was set as test data and the rest as training data. The reliability measure was learned and tested for all sets, and the performance of the reliability measure was examined.

以下に，従来法との比較における本発明による信頼性尺度生成について説明する。図１０は，従来法における信頼性尺度生成に用いるデータ例を示す。１行目の「purity」から「 num＿of＿words 」は音響・言語的特徴量名のリスト，２行目以降はデータであり，２行目の一つ目の項目「slot-level-label」は，この行がスロットのデータであることを示し，二つ目の項目（例えば「１」または「０」）は，正解（「１」で表される）・不正解（「０」で表される）のラベルである。 In the following, the reliability scale generation according to the present invention in comparison with the conventional method will be described. FIG. 10 shows an example of data used for reliability scale generation in the conventional method. The first line “purity” to “num_of_words” is a list of acoustic and linguistic feature names, the second and subsequent lines are data, and the first item “slot-level-label” on the second line is this Indicates that the row is slot data, and the second item (for example, “1” or “0”) is correct (represented by “1”) or incorrect (represented by “0”). It is a label.

図１１は，本発明における信頼性尺度生成に用いるデータ例を示す。図１１に示すデータ例は，データメモリ４０に格納されている。１行目の「slot＿purity＿in＿context 」から「 num＿of＿words 」は，談話特徴量名と音響・言語的特徴量名のリスト，２行目以降はデータであり，２行目の一つ目の項目「slot-level-label」は，この行がスロットのデータであることを示し，二つ目の項目（例えば「１」または「０」）は，正解（「１」で表される）・不正解（「０」で表される）のラベルである。 FIG. 11 shows an example of data used for reliability scale generation in the present invention. The data example shown in FIG. 11 is stored in the data memory 40. From “slot_purity_in_context” to “num_of_words” on the first line is a list of discourse feature names and acoustic / linguistic feature names, the second and subsequent lines are data, and the first item “slot-level” on the second line -label "indicates that this line is slot data, and the second item (for example," 1 "or" 0 ") is correct (represented by" 1 ") or incorrect (" 0 "). ”).

図１０および図１１において，１行目のリストにおける音響・言語的特徴量は，図５および図６に示したものであり，図１１において，１行目のリストにおける談話特徴量は，図９に示したものである。 10 and 11, the acoustic and linguistic features in the list on the first row are those shown in FIGS. 5 and 6. In FIG. 11, the discourse feature in the list on the first row is shown in FIG. It is shown in.

図１２は，従来法を用いて音響・言語的特徴量のみから作成した信頼性尺度と，本発明を用いて音響・言語的特徴量と談話特徴量の両方から作成した信頼性尺度との精度比較を示す図である。図１２中，False Acceptance Rate とは誤りを正解と分類してしまう率，False Rejection Rateとは正解を誤りと分類してしまう率である。それぞれ信頼性尺度の評価に用いられる指標であり，以下の式で求められる。 FIG. 12 shows the accuracy of a reliability measure created from only acoustic and linguistic features using the conventional method and a reliability measure created from both acoustic and linguistic features and discourse features using the present invention. It is a figure which shows a comparison. In FIG. 12, False Acceptance Rate is the rate at which errors are classified as correct, and False Rejection Rate is the rate at which correct answers are classified as errors. Each is an index used to evaluate the reliability scale, and is obtained by the following formula.

----------------------------------------------------------------
False Acceptance Rate
＝ False Positive 数／（ False Positive 数＋ True Negative数）
----------------------------------------------------------------
False Rejection Rate
＝ False Negative 数／（ True Positive数＋ False Negative 数）
----------------------------------------------------------------
上記の式において， False Positive とは，誤って正解と分類することを指し， True Negativeとは，正しく不正解に分類することを指す。また， True Positiveとは，正しく正解と分類することを指し， False Negative とは，誤って不正解に分類することを指す。 -------------------------------------------------- --------------
False Acceptance Rate
= Number of false positives / (number of false positives + number of true negatives)
-------------------------------------------------- --------------
False Rejection Rate
= Number of False Negatives / (Number of True Positives + Number of False Negatives)
-------------------------------------------------- --------------
In the above formula, False Positive means that it is mistakenly classified as a correct answer, and True Negative means that it is correctly classified as an incorrect answer. True positive means that it is correctly classified as a correct answer, and False Negative means that it is mistakenly classified as an incorrect answer.

図１２に示すように，本発明は，False Acceptance Rate の値，False Rejection Rateの値が，ともに従来法に比べて低いことがわかる。 As shown in FIG. 12, it can be seen that in the present invention, the value of False Acceptance Rate and the value of False Rejection Rate are both lower than the conventional method.

図１３は，従来法と比較した本発明の正誤分類性能を示す。全サンプルを対象とし，本願発明のみが正解になる場合が７２サンプル，従来法のみが正解になる場合が１５サンプルあった。下記の参考文献１に記載されるMcNemar Testという手法によりこの結果に統計的優位差があるかの検定を行ったところ，ｐ＝４．２６ｅ−１０（＜０．０１）となり，統計的な優位差が認められ，本発明が有効であることが示された。
［参考文献１］
"Some Statistical Issues in the Comparison of Speech Recognition Algorithums," L. Gillick and S. Cox, ICASSP 89, v. 1, pp.532-535.
図１４は，スロットに関する信頼性尺度のFalse Acceptance Rate - False Rejection Rate曲線である。この曲線はサンプルを正解・不正解に判別するものについて，求められる一般的なものである。このグラフはFalse Acceptance Rate を縦軸に，False Rejection Rateを横軸にプロットしたものである。 FIG. 13 shows the correct / incorrect classification performance of the present invention compared to the conventional method. For all samples, there were 72 samples when only the present invention was correct and 15 samples when only the conventional method was correct. When a test called McNemar Test described in Reference Document 1 below is used to test whether there is a statistically significant difference, the result is p = 4.26e-10 (<0.01), which is statistically superior. Differences were noted indicating that the present invention was effective.
[Reference 1]
"Some Statistical Issues in the Comparison of Speech Recognition Algorithums," L. Gillick and S. Cox, ICASSP 89, v. 1, pp.532-535.
FIG. 14 is a False Acceptance Rate-False Rejection Rate curve of the reliability measure regarding the slot. This curve is a general curve required for determining whether a sample is correct or incorrect. This graph plots False Acceptance Rate on the vertical axis and False Rejection Rate on the horizontal axis.

本発明の曲線が従来法のものよりも常に下部に位置するため，本願発明により作成された信頼性尺度は，従来法を用いて作成された信頼性尺度よりも高性能であるといえる。 Since the curve of the present invention is always located below the conventional method, it can be said that the reliability scale created by the present invention is higher performance than the reliability scale created using the conventional method.

本発明の信頼性尺度算出方法を実現する装置構成を示す図である。It is a figure which shows the apparatus structure which implement | achieves the reliability scale calculation method of this invention. 音声対話システムの一般的な構成を示す図である。It is a figure which shows the general structure of a speech dialogue system. 対話例と，ユーザ発話毎の音声対話システムの情報要求内容の理解結果を示す図である。It is a figure which shows the example of a dialog, and the understanding result of the information request | requirement content of the voice dialog system for every user utterance. ユーザ発話の音声認識結果に含まれる単語の信頼度を用いたユーザ情報要求内容の信頼度の割り当て例を示す図である。It is a figure which shows the example of allocation of the reliability of the user information request | requirement content using the reliability of the word contained in the speech recognition result of a user utterance. 単語に関する音響・言語的特徴量の例を示す図である。It is a figure which shows the example of the acoustic and linguistic feature-value regarding a word. 発話に関する音響・言語的特徴量の例を示す図である。It is a figure which shows the example of the acoustic and linguistic feature-value regarding speech. 信頼性尺度構築の流れを示す図である。It is a figure which shows the flow of reliability scale construction. 線形判別分析のイメージ図である。It is an image figure of a linear discriminant analysis. スロット値に関する談話特徴量を示す図である。It is a figure which shows the discourse feature-value regarding a slot value. 従来法における信頼性尺度作成に用いるデータ例を示す図である。It is a figure which shows the example of data used for the reliability scale preparation in a conventional method. 提案法における信頼性尺度作成に用いるデータ例を示す図である。It is a figure which shows the example of data used for the reliability scale preparation in a proposal method. 信頼性尺度の精度比較を示す図である。It is a figure which shows the accuracy comparison of a reliability scale. 従来法と比較した本発明の正誤分類性能を示す図である。It is a figure which shows the correct / false classification performance of this invention compared with the conventional method. False Acceptance Rate - False Rejection Rate曲線を示す図である。It is a figure which shows a False Acceptance Rate-False Rejection Rate curve.

Explanation of symbols

１発話理解部
２発話生成部
１０対話記録
２０談話メモリ
３０談話特徴量抽出部
４０データメモリ
５０信頼性尺度作成部
１００音声認識部
１０１言語理解部
１０２談話理解部
１０３文脈理解規則・知識ＤＢ
１０４構文意味理解規則・知識ＤＢ
１０５対話状態ＤＢ
１０６対話制御応答生成規則・知識ＤＢ
１０７データベース
１０８生成規則テンプレートＤＢ
２００内容生成部
２０１表層生成部
２０２音声生成部 DESCRIPTION OF SYMBOLS 1 Speech understanding part 2 Speech production | generation part 10 Dialog recording 20 Discourse memory 30 Discourse feature-value extraction part 40 Data memory 50 Reliability scale creation part 100 Speech recognition part 101 Language understanding part 102 Discourse understanding part 103 Context understanding rule and knowledge DB
104 Syntactic Semantic Rules / Knowledge DB
105 Dialogue state DB
106 Dialog control response generation rule / knowledge DB
107 database 108 generation rule template DB
200 content generation unit 201 surface layer generation unit 202 voice generation unit

Claims

Every time a user inputs an information request to a computer-based spoken dialogue system using voice, the user's spoken dialogue system updates the understanding result of the user information request held by the voice dialogue system. , A method for calculating the reliability of the understanding result of the user information request content in the spoken dialogue system that responds to the user by voice according to the understanding result of the updated user information request content,
Using an existing spoken dialogue system, create a dialogue record that includes system utterance information, user utterance information, and the user information request content understanding result and correct / wrong information for each user utterance. A process of storing in a storage device;
A process of extracting discourse information regarding the interaction between the user and the voice interaction system from the conversation record stored in the storage device, and extracting a discourse feature amount regarding a characteristic of an understanding result of the user information request content in the discourse information;
At least the understanding result of the user information request content read from the dialogue record stored in the storage device, the information about the correctness, the acoustic / linguistic feature amount of the user utterance voice, and the discourse feature amount extracted from the discourse information And a process of creating a reliability evaluation means for evaluating the reliability of the understanding result of the user information request content using identification learning,
Information on system utterances in the spoken dialogue system, information on user utterances including user speech recognition results and their acoustic and linguistic features, information on understanding results of user information requirements of the voice dialogue system for each user utterance, A dialogue characterized in that, based on the discourse feature extracted from the discourse information related to the interaction between the user and the spoken dialogue system, the reliability of the understanding result of the user information request content is calculated using the reliability evaluation means. Method for calculating the reliability of understanding results.

In the process of extracting the discourse feature,
Appearance ratio of the current value of the understanding result of the user information request content, appearance ratio of the most frequently occurring value in the understanding result of the user information request content, number of types of appearance values in the understanding result of the user information request content, user information request The number of times the current value of the content understanding result was canceled, the number of times the current value of the understanding result of the user information request content was overwritten by another value, and the current value of the understanding result of the user information request content The spoken dialogue system confirms the value indicating whether the values have the same value, the value indicating how many times the current value of the understanding result of the user information request has a different value, and the value of the understanding result of the user information request The number of times the user has transmitted the same value to the spoken dialogue system and the number of times the same value as the current value of the understanding result of the user information request has appeared in the user's utterance so far The value that indicates the number of times that a different value from the current value of the understanding result of the user information request content has appeared in the user utterance so far, and the same value as the current value of the understanding result of the user information request content A value that indicates how many times the system utterance has appeared in the past, or a value that indicates how many times the system utterance has appeared in a different value from the current understanding value of the user information request. The method of claim 1, wherein at least one is extracted as a discourse feature.

In the process of creating the reliability evaluation means,
The method for calculating the reliability of a dialogue understanding result according to claim 1 or 2, wherein linear discriminant method, support vector machine or decision tree learning is used as discriminative learning.