JP2006030282A

JP2006030282A - Interaction understanding device

Info

Publication number: JP2006030282A
Application number: JP2004204788A
Authority: JP
Inventors: Takeshi Ono; 健大野; Minoru Togashi; 実冨樫; Keiko Katsuragawa; 景子桂川; Yukihiro Ito; 幸宏伊東; Tatsuhiro Konishi; 達裕小西; Michihiko Kai; 充彦甲斐; Toshihiko Ito; 敏彦伊藤
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2004-07-12
Filing date: 2004-07-12
Publication date: 2006-02-02
Anticipated expiration: 2024-07-12
Also published as: JP4610249B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an interaction understanding device which has practical recognition capability and also has language understanding capability applicable to a speech interaction system capable of acquiring information that a user requires in a short period of time by greatly reducing influence of background noise of, especially, utterance. <P>SOLUTION: Context information of a conversation is also taken into consideration to select more likely words and when integrated arithmetic operation for language recognition is performed, the operation is carried out by using numeric parameters corresponding to the kind of noise to greatly reduce the influence of the background noise, thereby performing the language recognition with high precision. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声対話システムによる機器の制御に関するもので、特に操作者の「手」あるいは「目」を煩わせることなく制御を行うことが要求される対話理解装置で認識精度を向上させるための技術に関する。 The present invention relates to device control by a voice dialogue system, and in particular, a technique for improving recognition accuracy in a dialogue understanding device that requires control without bothering an operator's "hand" or "eye". About.

従来の音声対話システムでは音源が自然発生の音声であること、また車両等においては走行中の騒音の影響があること等のため、使用者の発話を正しく理解することが出来ず、このため使用者の意図とは異なる応答をする場合が生じていた。その結果、システムと使用者との間の対話が円滑に進まなくなり、使用者に不快感を与えることがあった。この対策として、従来より音声認識の認識精度を向上させるための様々な研究がなされており、その研究結果が種々報告されている（例えば、非特許文献１や非特許文献２等を参照。）。
甲斐、石丸、伊藤、小西、伊東，「目的地設定タスクにおける訂正発話の特徴分析と検出への応用」，日本音響学会全国大会論文集２−１−８，２００１，ｐｐ．６３−６４駒谷、河原，「音声対話システムにおける音声認識結果の信頼度の利用法」，日本音響学会全国大会論文集３−５−２，２０００，ｐｐ．７３−７４ In conventional voice dialogue systems, the sound source is a naturally generated voice, and in vehicles, etc., there is an influence of noise during driving, etc., so it is not possible to correctly understand the user's utterance. In some cases, the response was different from the person's intention. As a result, the dialogue between the system and the user does not proceed smoothly, which may cause discomfort to the user. As countermeasures, various studies have been made to improve the recognition accuracy of speech recognition, and various research results have been reported (see, for example, Non-Patent Document 1 and Non-Patent Document 2). .
Kai, Ishimaru, Ito, Konishi, Ito, “Characteristic analysis of corrected utterance in destination setting task and application to detection”, Acoustical Society of Japan Annual Conference Proceedings 2-1-8, 2001, pp. 63-64 Komatani, Kawahara, “How to use the reliability of speech recognition results in a spoken dialogue system”, Proc. 73-74

前記非特許文献１は、音声認識における誤認識に対する研究であり、前記非特許文献２は、音声認識結果に信頼度を利用した対話制御に関する研究であるが、これらの研究において採用されている手法は、何れも入力された音声信号を単語単位で逐次音響的に認識することを基本とするもので、人間が実行しているような文脈情報を含めた言語認識を行っていない。このため話者の発声条件、送話側および受話側両方における背景雑音等の影響を軽減するには限界があった。 The non-patent document 1 is a study on misrecognition in speech recognition, and the non-patent document 2 is a study on dialogue control using reliability in the speech recognition result, and a method adopted in these studies. Are basically based on the recognition of input speech signals acoustically in units of words, and do not perform language recognition including contextual information that is being performed by humans. For this reason, there is a limit to reducing the influence of the background noise and the like on the speaking condition of the speaker and on both the transmitting side and the receiving side.

本発明は、以上のような性能上の限界を超え、実用に耐えられる認識能力を有し、使用者の必要とする情報を短時間で取得可能とする音声対話システムに適用可能な言語理解能力を有する対話理解装置を提供することを目的としている。 The present invention has a recognition ability that can be practically used and exceeds the above-mentioned performance limitations, and can be applied to a speech dialogue system that can acquire information required by a user in a short time. An object of the present invention is to provide a dialogue understanding device having

前記目的を達成する方法の一つとして、話者の発話内容における文脈の流れから対話的に音声情報を理解する手法が考えられる。この方法によれば、単に発話音声の明瞭度あるいは了解度向上に着目した従来の方法に比べて、より精度の高い認識結果が得られることが期待される。そこで、本発明においては、文脈情報を利用した音声情報処理に音声認識の信頼度を組み合わせて言語理解や応答生成を行うことを基本とした。すなわち、単に従来の音声認識の信頼度を利用するのみではなく、発話の種類や対話履歴（認識履歴）の情報も利用して学習させた結果を利用することで、対話的により尤もらしい言語理解を実行させるようにした。そして、さらにこの言語理解を実行する中で、発話の背景騒音に対応したパラメータを用いた演算を行うことで、背景雑音等の影響をより適切に低減できるようにした。 As one of the methods for achieving the above object, a method of interactively understanding speech information from a context flow in the utterance content of the speaker can be considered. According to this method, it is expected that a more accurate recognition result can be obtained as compared with the conventional method that simply focuses on improving the intelligibility or intelligibility of the speech. Therefore, in the present invention, it is fundamental to perform language understanding and response generation by combining the reliability of speech recognition with speech information processing using context information. In other words, not only using the reliability of conventional speech recognition, but also using the results learned using the type of utterance and information of the conversation history (recognition history), more probable language understanding interactively Was made to run. Further, while performing this language understanding, the influence of background noise and the like can be reduced more appropriately by performing calculations using parameters corresponding to the background noise of the utterance.

本発明に係る対話理解装置は、以上のような手法を採用して認識精度を向上させるようにしたものである。すなわち、本発明の請求項１に係る対話理解装置は、対話に含まれる発話をその発話が包括する広さの順に階層的に複数のカテゴリおよび該カテゴリを細分化して構成されるクラスに分類し、どのクラスの単語が発話されたかその確からしさを与えるクラススコアと、発話に含まれる単語の確からしさを与える単語スコアとを算出し、対話の中で複数回の発話があったときに、各発話毎に算出される複数のクラススコアおよび複数の単語スコアを統合演算して対話内容を理解する対話理解装置であって、前記統合演算で用いるパラメータが、複数種類の騒音が混入された発話データにより統計的に推定されており、検出された騒音の種類に対応したパラメータを用いて対話内容を理解することを特徴としている。 The dialogue understanding device according to the present invention employs the above-described method to improve the recognition accuracy. That is, the dialog understanding device according to claim 1 of the present invention classifies the utterances included in the dialog into a plurality of categories and classes formed by subdividing the categories in the order of the size of the utterances. , Calculate a class score that gives the certainty of which class of words was uttered and a word score that gives the certainty of the words included in the utterance. A dialogue understanding apparatus for understanding a dialogue content by integrating a plurality of class scores and a plurality of word scores calculated for each utterance, wherein the parameters used in the integration calculation include utterance data mixed with a plurality of types of noise. It is characterized by the fact that the dialogue content is understood using parameters corresponding to the type of detected noise.

また、本発明の請求項２に係る対話理解装置は、請求項１に記載の対話理解装置に置いて、マイクロホンと音声増幅器とで構成された音声入力手段と、該音声入力手段の出力をデジタル化して音声認識を行う音声認識手段と、該音声認識手段で認識された結果の信頼度を算出する信頼度生成手段と、前記音声認識手段と前記信頼度生成手段とにより得られた結果を用いて予め設定された前記複数のカテゴリおよび該カテゴリを細分化した前記クラスからなる階層構造に分類し、前記クラスに分類された発話の確からしさを求めるクラススコア生成部と、これにより得られた結果から前記各カテゴリを求めるカテゴリ理解部と、認識された単語の確からしさを求める単語スコア生成部と、上記各処理部で処理された結果として理解内容を生成する理解内容生成部とからなる言語理解手段と、前記言語理解手段における処理を実行するために使用される過去の認識履歴を記憶する記憶手段と、前記言語理解手段から得られた結果から応答情報を作成する応答生成手段と、前記応答情報を出力するための出力手段とを有し、新たな発話があったとき、前記言語理解手段は、過去の認識履歴のクラススコアおよび単語スコアをパラメータ演算して得られる値と、最新の認識結果による新たな信頼度とを用いて新しいクラススコアおよび単語スコアを生成し、該新しいクラススコアおよび単語スコアで前記認識履歴の更新を行うとともに、該新しいクラススコアから求まるカテゴリ理解と新しい単語スコアとを用いて対話内容を理解し、前記パラメータが、複数種類の騒音が混入された発話データにより統計的に推定されており、検出された騒音の種類に対応したパラメータを用いて新しいクラススコアおよび単語スコアを生成することを特徴としている。 According to a second aspect of the present invention, there is provided a dialog understanding device according to the first aspect, in which the voice input means including a microphone and a voice amplifier is digitally output. Using speech recognition means for performing speech recognition, reliability generation means for calculating reliability of results recognized by the speech recognition means, and results obtained by the speech recognition means and the reliability generation means A class score generator for classifying the plurality of categories set in advance and a hierarchical structure composed of the classes into which the categories are subdivided, and determining the likelihood of the utterances classified into the classes, and the results obtained thereby A category understanding unit for obtaining each category from the above, a word score generating unit for obtaining the probability of the recognized word, and generating an understanding content as a result processed by each processing unit. Language understanding means comprising an understanding content generation unit; storage means for storing past recognition history used for executing processing in the language understanding means; and response information from results obtained from the language understanding means. A response generation means for creating and an output means for outputting the response information, and when there is a new utterance, the language understanding means performs a parameter operation on a class score and a word score of a past recognition history. A new class score and a word score are generated using the value obtained in this way and the new reliability based on the latest recognition result, the recognition history is updated with the new class score and the word score, and the new class score is updated. The content of the dialogue is understood using the category understanding obtained from the above and a new word score, and the parameter is an utterance data in which plural kinds of noises are mixed. Are statistically estimated by motor, it is characterized by generating a new class score and the word score using the parameters corresponding to the type of the detected noise.

また、本発明の請求項３に係る対話理解装置は、請求項１又は２に記載の対話理解装置において、前記パラメータが、騒音レベルごとに分類された発話データにより統計的に推定されており、騒音検出手段により検出された発話中の騒音レベルをもとに前記パラメータを変更することを特徴としている。 Further, in the dialogue understanding device according to claim 3 of the present invention, in the dialogue understanding device according to claim 1 or 2, the parameter is statistically estimated from utterance data classified for each noise level, The parameter is changed based on the noise level during speech detected by the noise detection means.

また、本発明の請求項４に係る対話理解装置は、請求項１又は２に記載の対話理解装置において、車両に搭載される対話理解装置であって、前記パラメータが、車両の走行速度ごとに分類された発話データにより統計的に推定されており、走行速度検出手段により検出された速度情報をもとに前記パラメータを変更することを特徴としている。 A dialogue understanding device according to claim 4 of the present invention is the dialogue understanding device according to claim 1 or 2, wherein the parameter is set for each traveling speed of the vehicle. The parameter is statistically estimated from the classified speech data, and the parameter is changed based on the speed information detected by the traveling speed detecting means.

また、本発明の請求項５に係る対話理解装置は、請求項１又は２に記載の対話理解装置において、車両に搭載される対話理解装置であって、前記パラメータが、車両の走行路線ごとに分類された発話データにより統計的に推定されており、走行路線検出手段により検出された路線情報をもとに前記パラメータを変更することを特徴としている。 A dialogue understanding device according to claim 5 of the present invention is the dialogue understanding device according to claim 1 or 2, wherein the parameter is set for each traveling route of the vehicle. The parameter is statistically estimated from the classified speech data, and the parameter is changed based on the route information detected by the traveling route detection means.

本発明によれば、単に単語の音声認識を行うのみならず、認識した単語をさらにカテゴリとクラスとに分類し、文脈との関連を考慮して、より尤らしい語の選定を行う手法を採用しているので、効率良く認識精度をさらに向上させることができる。また、発話の背景騒音に対応したパラメータを用いた演算を行って、対話内容を理解するようにしているので、例えば、車両用ナビゲーションシステムにおける音声入力のように、雑音の大きな環境下で用いる場合であっても、良好な認識精度を得ることができる。 According to the present invention, not only speech recognition of words but also a method of further classifying recognized words into categories and classes and selecting more likely words in consideration of the relationship with the context. Therefore, the recognition accuracy can be further improved efficiently. In addition, since the calculation is performed using parameters corresponding to the background noise of the utterance so as to understand the content of the dialogue, for example, when used in a noisy environment such as voice input in a vehicle navigation system Even so, good recognition accuracy can be obtained.

以下、本発明を適用した具体的な実施形態について、図面を参照しながら詳細に説明する。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings.

（第１の実施形態）
図１は、本発明を適用した第１の実施形態の対話理解装置の基本構成を示すものである。この図１に示す対話理解装置では、使用者により入力されたアナログ音声入力信号は、音声入力部１０１でデジタル信号に変換される。ここで、音声入力部１０１はマイクロホン、入力増幅器、Ａ／Ｄコンバータから構成されている。このデジタル化された音声信号は音声認識部１０２に入力される。音声認識部１０２では、使用者から入力される音声信号と、音声信号認識部１０２内に記憶してある認識対象文とのマッチング処理を行い、複数の認識結果候補文およびそれらの尤度（詳細は後述）を出力する。信頼度生成部１０３では、これら出力情報を用いて、使用者からの単一の発話に伴って入力される上記複数の認識結果候補文から、この認識結果候補文に含まれる単語と、これら単語の分類を示すクラスの尤もらしさを示す信頼度を出力する。 (First embodiment)
FIG. 1 shows a basic configuration of a dialogue understanding device according to a first embodiment to which the present invention is applied. In the dialogue understanding device shown in FIG. 1, an analog voice input signal input by a user is converted into a digital signal by the voice input unit 101. Here, the voice input unit 101 includes a microphone, an input amplifier, and an A / D converter. This digitized voice signal is input to the voice recognition unit 102. The speech recognition unit 102 performs matching processing between the speech signal input from the user and the recognition target sentence stored in the speech signal recognition unit 102, and a plurality of recognition result candidate sentences and their likelihoods (details). Is output later). The reliability generation unit 103 uses the output information, and from the plurality of recognition result candidate sentences input along with a single utterance from the user, the words included in the recognition result candidate sentence, and these words The reliability indicating the likelihood of the class indicating the classification of is output.

ここで、クラスとは、発話の表現形式に応じて図２に示すように定義されるものである。すなわち、例えば発話が目的地を示す表現形式の場合、これを図２に示すように階層構造的に分類し、包括する範囲が広いほうから狭いほうに順次カテゴリが配列されて、各カテゴリに含まれる単語を内容別に分類したものがクラスとされる。例えば図２の例では、各単語は上位（ＰＲ）、中位（ＨＲ）、下位（ＬＭ）の３カテゴリに分類され、さらに各カテゴリにおいてそれぞれ複数のクラスに分類される。例えば図２の場合、上位カテゴリでは「県」の１クラスのみであるが、下位カテゴリでは「インターチェンジ」、「市区町村」、「駅」の３クラスを有している。 Here, the class is defined as shown in FIG. 2 according to the expression format of the utterance. That is, for example, when the utterance is an expression format indicating the destination, it is classified hierarchically as shown in FIG. 2, and categories are arranged in order from the widest range to the narrowest range and included in each category. A word is a class that is classified by content. For example, in the example of FIG. 2, each word is classified into three categories of upper (PR), middle (HR), and lower (LM), and further classified into a plurality of classes in each category. For example, in the case of FIG. 2, the upper category has only one class of “prefecture”, but the lower category has three classes of “interchange”, “city”, and “station”.

単語単位での信頼度は以下のようにして求められる。すなわち、まず、単語の認識結果から得られた候補単語列（例えば複数の単語で形成された文章）の第１位から第Ｎ位までの尤度の高い順に配列した単語列（以下、この単語列をＮ−ｂｅｓｔ候補と称する。）と、それぞれの単語に対する対数尤度を求める。ここで、尤度とは認識結果から得られる音声信号列がＹである時、使用者が発話した音声信号列がＷである事後確率で定義される値で、「音声信号列に関する仮説Ｗに対し、音声信号列Ｙが観測される事前確率」と「音声信号列Ｗが発話される確率」との積と、音声信号列Ｙが観測される確率との比のうち最大確率である。 The reliability in units of words is obtained as follows. That is, first, word strings arranged in descending order of likelihood from the first place to the N-th place of candidate word strings (for example, sentences formed of a plurality of words) obtained from word recognition results (hereinafter referred to as words). The column is referred to as an N-best candidate.) And the log likelihood for each word is obtained. Here, the likelihood is a value defined by the posterior probability that the speech signal sequence uttered by the user is W when the speech signal sequence obtained from the recognition result is Y. On the other hand, it is the maximum probability among the ratios of the product of the “prior probability that the speech signal sequence Y is observed” and the “probability that the speech signal sequence W is uttered” and the probability that the speech signal sequence Y is observed.

これにより第１位候補に含まれる単語ｗの信頼度Conf(ｗ)を以下の（数１）式から求める。

Thereby, the reliability Conf (w) of the word w included in the first candidate is obtained from the following equation (1).

（数１）式において単語ｗがＮ−ｂｅｓｔ候補の中でｉ番目の候補に含まれている確からしさｐｉは、下記の（数２）式から求められる。ここで、ＬｉはＮ−ｂｅｓｔ候補それぞれに対する対数尤度である。

The probability pi that the word w is included in the i-th candidate among the N-best candidates in the formula (1) is obtained from the following formula (2). Here, Li is the log likelihood for each N-best candidate.

また、クラス単位での信頼度は上記単語単位の場合と同様に、第１位候補に含まれる各単語ｗのクラスＣｗにより、信頼度Conf(Ｃｗ)を以下の（数３）式から求められる。

Further, as in the case of the above word unit, the reliability in class units can be obtained from the following equation (3) by using the class Cw of each word w included in the first candidate. .

ここで、上記単語単位の場合と同様、ｐｉは下記の（数４）式から求められる。

Here, as in the case of the above-mentioned word unit, pi can be obtained from the following equation (4).

以上のようにして得られた認識データ（認識結果候補文、尤度及び信頼度）は、言語理解部１０４に入力される。 The recognition data (recognition result candidate sentence, likelihood and reliability) obtained as described above is input to the language understanding unit 104.

また、本実施形態の対話理解装置は、音声入力部１０１に入力されたアナログ音声入力信号から使用者による発話以外の区間に混入されている騒音の種類と程度（騒音レベル）を検出する騒音検出部１１３を備えており、この騒音検出部１１３の出力も言語理解部１０４に入力される。 In addition, the dialogue understanding device according to the present embodiment detects noise type and level (noise level) mixed in a section other than a user's utterance from an analog voice input signal input to the voice input unit 101. Unit 113, and the output of the noise detection unit 113 is also input to the language understanding unit 104.

言語理解部１０４は、クラススコア生成部１０５、カテゴリ理解部１０６、単語スコア生成部１０７および理解内容生成部１０８の各部で構成されており、使用者からの複数回にわたる発話に伴って入力される単語と、その属するクラスの信頼度とから理解結果を生成する機能を有する。 The language understanding unit 104 includes a class score generation unit 105, a category understanding unit 106, a word score generation unit 107, and an understanding content generation unit 108. The language understanding unit 104 is input with a plurality of utterances from the user. It has a function of generating an understanding result from a word and the reliability of the class to which the word belongs.

ここで、クラススコア生成部１０５は、使用者からの複数回にわたる発話に伴って入力される単語のクラス信頼度からどのクラスが発話されたかを示すスコアを計算するものであり、カテゴリ理解部１０６は、使用者からの複数回にわたる発話に伴って入力されるクラススコアからクラスの分類を示すカテゴリの理解結果、すなわち、どのカテゴリが発話されたかを出力するものである。また、単語スコア生成部１０７は、使用者からの複数回にわたる発話に伴って入力される単語の信頼度から、どの単語が発話されたかを示すスコアを計算し、理解内容生成部１０８は、上記で得られたカテゴリ理解結果（カテゴリ理解部１０６の出力）および単語スコア（単語スコア生成部１０７の出力）から理解内容を生成する機能を有する。 Here, the class score generation unit 105 calculates a score indicating which class has been uttered from the class reliability of the word input in association with a plurality of utterances from the user, and the category understanding unit 106. Is to output an understanding result of a category indicating the classification of the class from the class score inputted with a plurality of utterances from the user, that is, which category is uttered. Further, the word score generation unit 107 calculates a score indicating which word is uttered from the reliability of the word input with a plurality of utterances from the user, and the understanding content generation unit 108 2 has a function of generating an understanding content from the category understanding result (output of the category understanding unit 106) and the word score (output of the word score generating unit 107) obtained in the above.

この言語理解部１０４で理解内容を生成する過程では、詳細を後述するように各種数値パラメータを用いたパラメータ演算が行われるが、本実施形態の対話理解装置では、この演算に用いる数値パラメータが、騒音検出部１１３により検出された騒音の種類と程度に応じて変更される。すなわち、本実施形態の対話理解装置は、騒音の種類および程度毎に分類された複数の数値パラメータを保持するパラメータセット保持部１１４を備えており、このパラメータセット保持部１１４に保持された数値パラメータの中から、騒音の種類や程度に対応した数値パラメータを読み出して、演算で使用するようにしている。なお、パラメータセット保持部１１４に保持される複数の数値パラメータは、予め複数種類の騒音が混入された多数の発話データを集めて、これらを統計処理することによって、騒音の種類や程度に応じたパラメータとして推定されているものである。 In the process of generating the understanding content in the language understanding unit 104, parameter calculation using various numerical parameters is performed as will be described in detail later. In the dialog understanding device of the present embodiment, the numerical parameter used for this calculation is It is changed according to the type and degree of noise detected by the noise detection unit 113. That is, the dialogue understanding device of the present embodiment includes a parameter set holding unit 114 that holds a plurality of numerical parameters classified according to the type and degree of noise, and the numerical parameters held in the parameter set holding unit 114. Among these, numerical parameters corresponding to the type and degree of noise are read out and used in the calculation. The plurality of numerical parameters held in the parameter set holding unit 114 are collected according to the type and degree of noise by collecting a large number of utterance data mixed with a plurality of types of noise in advance and statistically processing them. It is estimated as a parameter.

以上のようにして得られた言語理解部１０４の出力情報は、応答生成部１０９に入力される。応答生成部１０９は、言語理解部１０４で得られた理解内容から応答文を生成する。この応答生成部１０９により生成された応答文は、音声合成部１１０でデジタル信号として合成され、音声合成部１１０が内蔵する図示しないＤ／Ａコンバータ、出力増幅器を経て音声出力として出力される。また、この応答生成部１０９により生成された出力応答文は、ＧＵＩ表示部１１１を経て図示しない表示装置に送られ、この表示装置上に文字情報として表示される。なお、認識履歴１１２は、過去の認識状況を履歴データとして記憶しておく例えばハードディスク記憶装置等の記憶装置である。 The output information of the language understanding unit 104 obtained as described above is input to the response generation unit 109. The response generation unit 109 generates a response sentence from the understanding content obtained by the language understanding unit 104. The response sentence generated by the response generator 109 is synthesized as a digital signal by the voice synthesizer 110, and is output as a voice output through a D / A converter and an output amplifier (not shown) built in the voice synthesizer 110. The output response sentence generated by the response generation unit 109 is sent to a display device (not shown) via the GUI display unit 111 and displayed as character information on the display device. The recognition history 112 is a storage device such as a hard disk storage device that stores past recognition status as history data.

次に、以上のように構成される本実施形態の対話理解装置の作用について、使用者との対話によって車両の目的地を設定する場合を例に挙げて、具体的に説明する。 Next, the operation of the dialogue understanding device of the present embodiment configured as described above will be specifically described by taking as an example the case where the destination of the vehicle is set by dialogue with the user.

まず、本実施形態の対話理解装置で扱われる目的地の表現形式について説明する。ここでは、インターチェンジ、駅、市区町村名を目的地に設定することができ、各々には県、自動車道、鉄道路線を付加することができるものとする。前記のように図２はこれら表現形式を階層構造的に表示したものである。すなわち、本実施形態では、目的地を上位、中位、下位３段階の部分発話の組み合わせにより発話することができ、本実施形態ではこの３段階の各々をカテゴリと呼ぶ。上位カテゴリＰＲでは、県（都道府県）を発話することができ、中位カテゴリＨＲでは自動車道、または鉄道路線を発話することができ、下位カテゴリＬＭではインターチェンジ、市区町村、駅を発話することができる。 First, the destination expression format handled by the dialogue understanding device of the present embodiment will be described. Here, interchanges, stations, and city names can be set as destinations, and prefectures, expressways, and railway lines can be added to each destination. As described above, FIG. 2 shows these representation formats in a hierarchical structure. That is, in the present embodiment, the destination can be uttered by a combination of partial utterances of upper, middle, and lower three stages, and in the present embodiment, each of the three stages is called a category. In the upper category PR, you can utter prefectures (prefectures), in the middle category HR you can utter automobile roads or railway lines, and in the lower category LM you can speak interchanges, municipalities, and stations. Can do.

本実施形態の対話理解装置では、対話形式での目的地設定をより柔軟な発話によって行えるようにしている。すなわち、使用者は例えば、「静岡県の東名自動車道の浜松西インターチェンジ」と言うように、一度ですべてのカテゴリを発話することもできる。また、第一の発話で「静岡県」と発話し、第二の発話で「東名高速の浜松西インターチェンジ」と発話するように複数回に分けて発話することも可能である。 In the dialogue understanding device of this embodiment, the destination setting in the dialogue format can be performed by more flexible speech. That is, the user can speak all categories at once, for example, “Hamamatsu Nishi Interchange on Tomei Expressway in Shizuoka Prefecture”. It is also possible to speak in multiple sessions, such as “Shizuoka Prefecture” in the first utterance and “Hamamatsu Nishi Interchange at Tomei Expressway” in the second utterance.

また、使用者が複数回の発話を行うとき、過去の発話に対してより詳細な情報を追加していく詳細化発話を可能とするものである。例えば、第一の発話で、「静岡県の」と発話し、第二の発話で「浜松市」と発話することが可能である。また、使用者が複数回の発話を行うとき、対話理解装置の応答結果を訂正する発話を可能とするものである。例えば、第一の発話「静岡県の浜松市」に対して、第一の応答「静岡県の浜松西インターチェンジですか」と誤った応答がなされたとき、第二の発話で「いいえ浜松市です。」と発話することが可能である。 Further, when the user utters a plurality of times, a detailed utterance in which more detailed information is added to the past utterance is enabled. For example, it is possible to speak “Shizuoka Prefecture” in the first utterance and “Hamamatsu City” in the second utterance. Further, when the user utters a plurality of times, the utterance for correcting the response result of the dialogue understanding device is enabled. For example, when the first response “Hamamatsu City in Shizuoka Prefecture” is answered incorrectly as the first response “Is Hamamatsu Nishi Interchange in Shizuoka Prefecture”, the second utterance is “No Hamamatsu City” Can be spoken. "

また、使用者が複数回の発話を行うとき、対話理解装置からの応答が質問であったときに、それに回答する発話も可能とするものである。例えば、第一の応答が「静岡県の何インターチェンジですか」であったとき、第二の発話で「浜松西インターチェンジです」と発話することが可能である。
また、使用者が複数回の発話を行うとき、対話理解装置からの応答が再入力を促す発話であったときに、それに回答する発話を可能とするものである。例えば、第一の応答「もう一度発話してください」であったとき、第二の発話で第一の発話と同様の発話を行うことが可能である。
本実施形態における認識対象語は、例えば図３に例示するようなものである。また、本実施形態における対話例は、例えば図４に示すようなものである。図４中、Ｕは使用者の発話であり、Ｓは対話理解装置からの応答であり、数字は発話順である。 In addition, when the user utters a plurality of times, when the response from the dialogue understanding device is a question, it is also possible to utter the answer. For example, when the first response is “How many interchanges in Shizuoka Prefecture”, it is possible to say “This is Hamamatsu Nishi Interchange” in the second utterance.
In addition, when the user utters a plurality of times, when the response from the dialogue understanding device is an utterance prompting re-input, the utterance can be answered. For example, when the first response is “Please speak again”, it is possible to perform the same utterance as the first utterance in the second utterance.
The recognition target words in this embodiment are as exemplified in FIG. Moreover, the example of a conversation in this embodiment is as shown in FIG. 4, for example. In FIG. 4, U is the user's utterance, S is the response from the dialogue understanding device, and the numbers are the utterance order.

ここで、本実施形態の対話理解装置における動作について、図５のフローチャートを用いて具体的な例を挙げながら説明する。 Here, the operation of the dialogue understanding device of the present embodiment will be described with reference to a specific example using the flowchart of FIG.

先ず、ステップＳ３０１で処理が開始されると、使用者が発話開始を指示するために図示しない音声入力スイッチ（発話スイッチ）をオン状態に操作したかどうかを判定し、ステップＳ３０２において音声入力スイッチがオン状態とされたこと検出した場合、音声信号・騒音の取り込み開始のステップ（ステップＳ３０３）に移行する。ここで、音声入力スイッチのオン状態への操作が検出されない場合は、この操作が検出されるまでステップＳ３０２で待ち状態となる。 First, when the process is started in step S301, it is determined whether or not the user has operated an unillustrated voice input switch (speech switch) in order to instruct the start of utterance. In step S302, the voice input switch is When it is detected that the signal is in the on state, the process proceeds to the step of starting the capturing of the audio signal / noise (step S303). Here, if an operation to turn on the voice input switch is not detected, the process waits in step S302 until this operation is detected.

ステップＳ３０３では、使用者から認識対象文に含まれる発話（例えば図３に例示した語等）が行われ、この使用者からの発話を受けて、音声入力部１０１が、マイクロホンからの信号をＡ／Ｄコンバータでデジタル信号に変換し、音声認識部１０２に出力する。また、これと同時に騒音検出部１１３が使用者による発話以外の区間に混入されている騒音の種類と程度を検出し、言語理解部１０４に出力する。音声認識部１０２は、発話スイッチの操作がなされるまでは前記デジタル信号の平均パワーの演算を継続しており、前記発話スイッチが操作された後、前記平均パワーにくらべてデジタル信号の瞬時パワーが所定値以上に大きくなった時に使用者が発話したと判断し、音声信号の取り込みを開始する。 In step S303, an utterance (for example, the word illustrated in FIG. 3) included in the sentence to be recognized is performed from the user, and the voice input unit 101 receives the utterance from the user and receives the signal from the microphone as A. The digital signal is converted by the / D converter and output to the voice recognition unit 102. At the same time, the noise detection unit 113 detects the type and degree of noise mixed in a section other than the user's utterance and outputs it to the language understanding unit 104. The voice recognition unit 102 continues to calculate the average power of the digital signal until the utterance switch is operated. After the utterance switch is operated, the instantaneous power of the digital signal is larger than the average power. When it becomes larger than a predetermined value, it is determined that the user speaks, and the capturing of the audio signal is started.

そして、音声認識部１０２では、記憶してある認識対象文とデジタル化された音声信号とを比較して、尤度を演算（ステップＳ３０４）することにより、複数の候補を設定する。なお、本ステップＳ３０４を実行する間も、並列処理によりステップＳ３０３での音声信号の取り込みは継続されている。 Then, the voice recognition unit 102 sets a plurality of candidates by comparing the stored recognition target sentence with the digitized voice signal and calculating the likelihood (step S304). Note that while executing step S304, the audio signal capturing in step S303 is continued by parallel processing.

その後、デジタル化された音声信号の瞬時パワーが所定時間以上所定値以下の状態が継続した段階で、対話理解装置側では使用者の発話が終了したと判断し、音声信号の入力処理を終了する（ステップＳ３０５）。これにより、音声認識部１０２は、複数の認識結果候補文を尤度順にならべた上位Ｎ候補を、尤度データとともに出力する。図６にこの音声認識部１０２による出力結果の例を示す。この図６において、ＸＸＸと記されている部分は、各単語に対する算出された尤度を示している。 Thereafter, at the stage where the instantaneous power of the digitized audio signal continues for a predetermined time or more and a predetermined value or less, the dialogue understanding device determines that the user's utterance has ended and ends the audio signal input processing. (Step S305). Thereby, the speech recognition unit 102 outputs the top N candidates obtained by arranging a plurality of recognition result candidate sentences in order of likelihood together with the likelihood data. FIG. 6 shows an example of an output result by the voice recognition unit 102. In FIG. 6, the part marked XXX indicates the likelihood calculated for each word.

次に、前記のＮ−Ｂｅｓｔ候補と呼ばれる音響的な尤度で順位付けられた複数の候補からなる認識結果をもとに、単語とクラスの２種類の信頼度について音響的な尤度とＮ−Ｂｅｓｔ候補中の出現頻度から、事後確立に基づく尺度として信頼度が演算される（ステップＳ３０６）。この演算は信頼度生成部１０３において実行されるもので、演算結果の例を図７に示す。この図７において、左側の表は図６で示した音声認識部出力であり、右側の表の単語信頼度は、ある単語が発話された可能性を示し、クラス信頼度はあるクラスの単語が発話された可能性を示す。なお、本演算に関しては、前記非特許文献２（駒谷、河原，「音声対話システムにおける音声認識結果の信頼度の利用法」，日本音響学会全国大会論文集３−５−２，２０００，ｐｐ．７３−７４）にて詳述されている。 Next, based on the recognition result made up of a plurality of candidates ranked by the acoustic likelihood called the N-Best candidate, the acoustic likelihood and N -From the appearance frequency in the best candidate, the reliability is calculated as a measure based on the post-establishment (step S306). This calculation is executed in the reliability generation unit 103, and an example of the calculation result is shown in FIG. In FIG. 7, the left table is the output of the speech recognition unit shown in FIG. 6, the word reliability in the right table indicates the possibility that a certain word has been uttered, and the class reliability is a word of a certain class. Indicates the possibility of being uttered. Regarding this calculation, Non-Patent Document 2 (Komatani, Kawahara, “Usage of reliability of speech recognition results in a spoken dialogue system”, Proceedings of the Acoustical Society of Japan National Conference 3-5-2, 2000, pp. 73-74).

以上のようにして発話された単語の信頼度を求めて尤らしい単語の推定が行われるが、本実施形態の対話理解装置においては、使用者との間での対話により単語推定の精度をさらに向上させている。このため、次のステップＳ３０７において、クラススコア生成部１０５によりクラススコアが演算されるが、このクラススコア演算に先立ち、使用者の発話タイプの判定が行われる。 As described above, the reliability of the spoken word is obtained to estimate the probable word. In the dialogue understanding device of the present embodiment, the accuracy of the word estimation is further improved by the dialogue with the user. It is improving. For this reason, in the next step S307, the class score is calculated by the class score generation unit 105. Prior to this class score calculation, the user's utterance type is determined.

すなわち、第一の発話タイプは、以前の情報に新しい情報を追加する働きがある発話タイプである。例えば、詳細化および回答の処理がこれに相当する。また、第二の発話タイプは、以前の情報を訂正する働きがある。例えば、訂正および再入力の処理がこれに相当する。このいずれの発話タイプであるかどうかは、例えば図８に示すように、判定材料の欄に記載されている状況に基づいて判定される。また、これ以外の判定方法も存在する。例えば、地名入力でよく用いられる部分的な言い直し発生をＤＰマッチングによるワードスポッティング法を用いて検出する方法があり、これに関しては、角谷、北岡、中川，「カーナビの地名入力における誤認識時の訂正発話の分析と検出」，情報処理学会研究報告、音声言語情報処理３７−１１，２００１にて詳述されている。 That is, the first utterance type is an utterance type that serves to add new information to the previous information. For example, details and response processing correspond to this. The second utterance type has a function of correcting previous information. For example, correction and re-input processing correspond to this. Whether or not the utterance type is selected is determined based on the situation described in the determination material column, as shown in FIG. 8, for example. There are also other determination methods. For example, there is a method of detecting the occurrence of partial rephrasing often used in place name input using the word spotting method by DP matching. Regarding this, Kakutani, Kitaoka, Nakagawa, “ Analysis and detection of corrected utterances ", Information Processing Society of Japan Research Report, Spoken Language Information Processing 37-11, 2001.

以上のようにして発話タイプが判定された後に、ステップＳ３０７においてクラススコア生成部１０５によりクラススコアが生成される。ここで、クラススコアは、対話中すなわち使用者の複数回の発話中におけるクラスの尤もらしさを示す値である。この場合、以前に理解した情報を残しつつ、新しい情報を付加することで、より適切にスコアを生成することができる。このクラススコアの生成は、前記の発話タイプ別に異なる生成式を用いて行われる。したがって、図５におけるステップＳ３０７は、図９に示すように２分割された処理が行われることになる。すなわち、図８の判定材料の欄に記載の状況によりステップＳ３１５で詳細化、回答の発話タイプに該当するか否かを判定し、該当する場合はステップＳ３１６で処理し、該当しないで訂正、再入力の発話タイプの場合はステップＳ３１７で処理した後、いずれの場合も処理はステップＳ３０８に移行する。 After the utterance type is determined as described above, a class score is generated by the class score generation unit 105 in step S307. Here, the class score is a value indicating the likelihood of the class during the conversation, that is, during the user's multiple utterances. In this case, a score can be generated more appropriately by adding new information while leaving previously understood information. The class score is generated using a different generation formula for each utterance type. Therefore, in step S307 in FIG. 5, a process divided into two as shown in FIG. 9 is performed. That is, in step S315, it is determined whether or not it corresponds to the utterance type of the answer according to the situation described in the column of determination material in FIG. 8, and if so, it is processed in step S316. In the case of the input utterance type, after processing in step S317, in any case, the processing proceeds to step S308.

詳細化、回答の発話タイプにおける場合、すなわち図９におけるステップＳ３１６の場合のクラススコアは、下記の（数５）式で求められる。 In the case of the refinement and answer utterance type, that is, in the case of step S316 in FIG. 9, the class score is obtained by the following equation (5).

Score(c)＝Score(c)×weight s,n＋Conf(c) （数５）
但し、Scoreはクラススコアであり、（数５）式の左辺が新たに求められたクラススコアであり、（数５）式の右辺が過去の（認識履歴１１２から読み出した）クラススコアに対する処理である。また、Confは最新の認識結果から得られたクラス信頼度である。また、weight s,nは0.0〜1.0の値を採る重み付けのためのパラメータである。また、cはスコアを生成するクラスである。ここで、過去のクラススコアに対してパラメータweight s,nを用いて重み付けして一定の割合でクラススコアを下げているのは、”情報が古くなるごとに信頼性が低下する”という方針を適用しているからである。詳細化、回答の発話タイプの場合、認識履歴１１２から読み出した過去のクラススコアは、この（数５）式により新たに求められたクラススコアで更新されて、認識履歴１１２に書き込まれることになる。 Score (c) = Score (c) x weight s, n + Conf (c) (Equation 5)
However, Score is a class score, the left side of the formula (5) is a newly obtained class score, and the right side of the formula (5) is a process for a past class score (read from the recognition history 112). is there. Conf is the class reliability obtained from the latest recognition result. Further, weight s, n is a weighting parameter that takes a value of 0.0 to 1.0. C is a class for generating a score. Here, the weight of the past class score is weighted using the parameter weight s, n, and the class score is lowered at a certain rate. The policy is that “the reliability decreases as information becomes older”. This is because it is applied. In the case of the refinement and answer utterance type, the past class score read from the recognition history 112 is updated with the class score newly obtained by the equation (5) and written to the recognition history 112. .

過去のクラススコアに対して重み付けを行うためのパラメータweight s,nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、騒音の種別および程度毎に分類された複数の数値パラメータの中から、実際に検出された騒音の種類や程度に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにある。ここで、ｎは混入した騒音の種類や程度を示しており、例えば騒音レベル毎［ｎ＝レベル０，レベル１，レベル２，レベル３・・・］に分類された発話データから実験的にweight s,nを求めた場合には、騒音レベルに応じた複数個のweight s,nが求められることになる。本実施形態の対話理解装置では、前記（数５）式で新たなクラススコアを求める際に、騒音検出部１１３で実際に検出された騒音レベルに応じたweight s,nを選択して用いるようにしている。 The parameter weight s, n for weighting the past class score can be obtained experimentally using speech data mixed with noise. The feature of the present embodiment is that the language understanding unit 104 The numerical parameter corresponding to the actually detected noise type and degree is read out from the parameter set holding unit 114 and used from among a plurality of numerical parameters classified for each noise type and degree. Here, n indicates the type and degree of the mixed noise. For example, the weight is experimentally determined from the speech data classified for each noise level [n = level 0, level 1, level 2, level 3 ...]. When s, n is obtained, a plurality of weights s, n corresponding to the noise level are obtained. In the dialogue understanding device according to the present embodiment, when a new class score is obtained by the equation (5), weight s, n corresponding to the noise level actually detected by the noise detection unit 113 is selected and used. I have to.

詳細化・回答発話タイプのクラススコア生成の様子を図１０に示す。この例では、使用者は、過去の発話で例えば「静岡県のＪＲ、浜松駅」と発話し、「県」および「鉄道路線」の旧クラススコアが1.00、「駅」の旧クラススコアが0.84となっている。そして、次に対話理解装置からの「静岡県のＪＲの何駅ですか？」という問いに対して、使用者は、最新の発話として例えば「浜松駅」と答え、「駅」の新クラス信頼度として0.81が得られている。その結果、新たなクラススコアが前記（数５）式に基づいて生成され、「県」の新たなクラススコアは0.90、「鉄道路線」の新たなクラススコアは0.90、「駅」の新たなクラススコアは1.65となり、旧クラススコアがこれらのクラススコアで更新されて、認識履歴１１２に書き込まれる。 FIG. 10 shows how the class score is generated for the refinement / response utterance type. In this example, the user utters, for example, “JR Hamamatsu Station in Shizuoka Prefecture” in the past utterance, the old class score of “prefecture” and “railway line” is 1.00, and the old class score of “station” is 0.84. It has become. Next, in response to the question “How many stations of JR in Shizuoka Prefecture?” From the dialogue understanding device, the user replied, for example, “Hamamatsu Station” as the latest utterance, and the new class trust of “Station” The degree is 0.81. As a result, a new class score is generated based on the formula (5), a new class score for “prefecture” is 0.90, a new class score for “railway” is 0.90, and a new class for “station” The score is 1.65, and the old class score is updated with these class scores and written in the recognition history 112.

訂正、再入力の発話タイプにおける場合、すなわち図９におけるステップＳ３１７の場合のクラススコアは、下記の（数６）式で求められる。 In the case of the utterance type of correction and re-input, that is, in the case of step S317 in FIG. 9, the class score is obtained by the following (Equation 6).

Score(ca)＝Score(ca)×weight t,n−Conf(cb)＋Conf(ca) （数６）
但し、Scoreはクラススコアであり、（数６）式の左辺が新たに得られたクラススコアであり、（数６）式の右辺が過去の（認識履歴１１２から読み出した）クラススコアである。また、Confは最新の認識結果から得られたクラス信頼度である。また、weight t,nは0.0〜1.0の値を採る重み付けのためのパラメータである。また、caはスコアを生成するクラスであり、cbはcaと同じカテゴリで異なる全てのクラスである。この（数６）式は（数５）式と比較すると、同カテゴリ、異クラスの信頼度を減算している点で異なっている。これにより、クラスを間違えた場合にスコアが修正され易くなる。詳細化・回答発話タイプの場合、認識履歴１１２から読み出した過去のクラススコアは、この（数６）式により新たに求められたクラススコアで更新されて、認識履歴１１２に書き込まれることになる。 Score (ca) = Score (ca) × weight t, n−Conf (cb) + Conf (ca) (Formula 6)
However, Score is a class score, the left side of the formula (6) is a newly obtained class score, and the right side of the formula (6) is a past class score (read from the recognition history 112). Conf is the class reliability obtained from the latest recognition result. Further, weight t, n is a weighting parameter that takes a value of 0.0 to 1.0. Also, ca is a class that generates a score, and cb is all different classes in the same category as ca. This equation (6) differs from the equation (5) in that the reliability of the same category and different class is subtracted. This makes it easier to correct the score when the class is wrong. In the case of the refinement / answer utterance type, the past class score read from the recognition history 112 is updated with the class score newly obtained by the equation (6) and written to the recognition history 112.

過去のクラススコアに対して重み付けを行うためのパラメータweight t,nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、騒音の種別および程度毎に分類された複数の数値パラメータの中から、実際に検出された騒音の種類や程度に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにあり、本実施形態の対話理解装置では、前記（数６）式で新たなクラススコアを求める際に、前記（数５）式で用いるweight s,nと同様に、騒音検出部１１３で実際に検出された騒音レベルに応じたweight t,nを選択して用いるようにしている。 The parameter weight t, n for weighting the past class score can be obtained experimentally using speech data mixed with noise. The feature of the present embodiment is that the language understanding unit 104 Among the plurality of numerical parameters classified for each noise type and degree, the numerical parameter corresponding to the actually detected noise type and degree is read from the parameter set holding unit 114 and used. In the dialogue understanding device according to the embodiment, when the new class score is obtained by the equation (6), the noise detection unit 113 actually detects the new class score similarly to the weight s, n used by the equation (5). The weight t, n corresponding to the noise level is selected and used.

訂正・再入力発話タイプのクラススコア生成の様子を図１１に示す。この例では、使用者は、過去の発話で例えば「静岡県、浜松駅」と発話し、「県」の旧クラススコアが0.39、「駅」の旧クラススコアが0.63となっている。このため、クラススコアの値が不十分でカテゴリを特定できず、対話理解装置は「もう一度発話して下さい」との応答を出力している。この応答を受けて使用者は、次に再度同じ「静岡県、浜松駅」との発話を行い、「県」の新クラス信頼度として0.54が得られ、「駅」の新クラス信頼度として0.52が得られている。その結果、新たなクラススコアが前記（数６）式に基づいて生成され、「県」の新たなクラススコアは0.89、「駅」の新たなクラススコアは0.86となり、旧クラススコアがこれらのクラススコアで更新されて、認識履歴１１２に書き込まれる。 FIG. 11 shows how the correction / re-input utterance type class score is generated. In this example, the user utters, for example, “Shizuoka Prefecture, Hamamatsu Station” in the past utterance, and the old class score of “prefecture” is 0.39 and the old class score of “station” is 0.63. For this reason, the class score value is insufficient and the category cannot be specified, and the dialogue understanding device outputs a response “Please speak again”. In response to this response, the user then speaks again with the same "Shizuoka Prefecture, Hamamatsu Station", obtaining a new class reliability of 0.54 for "prefecture" and a new class reliability of 0.52 for "station". Is obtained. As a result, a new class score is generated based on the above equation (6), the new class score of “prefecture” is 0.89, the new class score of “station” is 0.86, and the old class score is these classes. It is updated with the score and written in the recognition history 112.

続いて、カテゴリ理解処理のステップＳ３０８に移るが、この処理はカテゴリ理解部１０６で、過去の（認識履歴１１２から読み出した）クラススコアと最新の認識結果におけるクラス信頼度との両方に対してカテゴリスコアを計算することにより実行される。この処理の様子を図１２に示す。カテゴリスコアは、図１２のａで表示した部分およびＢで表示した部分におけるそれぞれの欄の数字から分かるように、同じカテゴリに属する全てのクラススコアあるいは信頼度を加算したものである。それぞれのカテゴリスコアは閾値で判定され、ＰＲ（上位）、ＨＲ（中位）、ＬＭ（下位）の３カテゴリに対して、判定結果の論理和を計算する。そこで得られた結果が、現在までに発話されたカテゴリの組み合わせを示している。クラススコアが図１２のように求められた場合において、それに続くカテゴリ理解の様子を図１３に示す。すなわち、旧および新スコアから各カテゴリに対して判定を行い、その結果としてカテゴリ理解が得られる。 Subsequently, the process proceeds to step S308 of the category understanding process. This process is performed by the category understanding unit 106 for both the past class score (read from the recognition history 112) and the class reliability in the latest recognition result. This is done by calculating a score. The state of this processing is shown in FIG. The category score is obtained by adding all the class scores or reliability belonging to the same category, as can be seen from the numbers in the respective columns in the part indicated by a and the part indicated by B in FIG. Each category score is determined by a threshold value, and the logical sum of the determination results is calculated for the three categories of PR (upper), HR (middle), and LM (lower). The results obtained there show combinations of categories spoken so far. When the class score is obtained as shown in FIG. 12, the subsequent category understanding is shown in FIG. That is, each category is determined from the old and new scores, and as a result, category understanding is obtained.

次に、ステップＳ３０９の単語スコアの生成が行われるが、このステップＳ３０９の処理は単語スコア生成部１０７で実行され、１）過去の（認識履歴１１２中に既に存在する）単語、および２）新たに出現した単語（最新の認識結果中の単語）の２つに対して、各々別々の方針を用いてスコアを生成する。後者２）の場合の単語は、最新の認識結果のＮ−Ｂｅｓｔ候補に含まれる全単語が対象となる。この単語スコアの生成は、言語理解部１０４が最新の認識率を獲得するたびに、１）→２）の順番で実行される。 Next, a word score is generated in step S309. The processing in step S309 is executed by the word score generation unit 107, 1) a past word (which already exists in the recognition history 112), and 2) a new one. A score is generated for each of the two words (words in the latest recognition result) appearing in, using different policies. In the case of the latter 2), all words included in the N-Best candidate of the latest recognition result are targeted. The word score is generated in the order of 1) → 2) every time the language understanding unit 104 acquires the latest recognition rate.

上記１）の認識履歴中に存在する単語については、単語の新しさ、対話理解装置の応答内容と使用者の発話タイプ（詳細化、訂正、回答、再入力）から、既存の単語スコアを上下させることで、新しい単語スコアを生成する。これには以下の５種類の方針を採用する。 For words that exist in the recognition history of 1) above, the existing word score is increased or decreased from the newness of the word, the response content of the dialogue understanding device and the user's utterance type (detailed, corrected, answered, re-input). To generate a new word score. For this, the following five policies are adopted.

方針１：古い情報は、信頼性が低くなるという仮定のもとに、新しい認識結果が入力されるたびに、認識履歴１１２中に存在する全ての単語のスコアを下げる。 Policy 1: Under the assumption that the old information has low reliability, every time a new recognition result is input, the scores of all the words existing in the recognition history 112 are lowered.

方針２：認識履歴１１２中の単語Ａと認識結果として新たに得られた単語Ｂが詳細化の関係にあった場合、単語Ａのスコアを上げる。 Policy 2: When the word A in the recognition history 112 and the word B newly obtained as a recognition result are in a detailed relationship, the score of the word A is increased.

方針３：認識履歴１１２中の単語Ａと認識履歴１１２中の単語Ｂとが訂正の関係にあった場合、単語Ａのスコアを下げる。 Policy 3: When the word A in the recognition history 112 and the word B in the recognition history 112 are in a correction relationship, the score of the word A is lowered.

方針４：認識結果に肯定（はい、うん等）が含まれていた場合、応答に含まれていた単語のスコアを上げる。 Policy 4: If the recognition result includes affirmation (yes, yes, etc.), increase the score of the word included in the response.

方針５：認識結果に否定後（いいえ、ちがう等）が含まれていた場合、応答に含まれていた単語のスコアを下げる。
認識履歴１１２中の単語スコアの生成は、下記の（数７）式による。 Policy 5: If the recognition result includes a negative result (No, wrong, etc.), the score of the word included in the response is lowered.
Generation of the word score in the recognition history 112 is based on the following equation (7).

Score(Wd)＝Score(Wd)−p1n＋p2n×Conf(Ws)−p3n×Conf(Wt)＋i×(p4n×Conf(yes)−p5n×Conf(no)−p6n×Conf(rej)) （数７）
但し、Scoreは認識履歴１１２中の単語のスコアであり、右辺が更新前、左辺が更新後である。Wdは計算対象となる認識履歴１１２中の単語である。方針１に対応する項としては、パラメータp1nを用いて単語のスコアを下げる項（左辺第１項）がある。左辺の次の２つの項が方針２と方針３に対応する項であり、p2n、p3nが重み付けのためのパラメータ、Confは最新の認識結果から得られる信頼度であり、Wsは最新の認識結果に含まれWdと詳細化の関係にある全ての単語であり、Wtは最新の認識結果に含まれWdとは訂正の関係にある全ての単語である。左辺の更に次の項が方針４、方針５に対応する項であり、p4n、p5n、p6nがパラメータであり、iは前回の対話理解装置からの応答に単語が含まれている場合はi＝１となり、含まれていない場合はi＝０となる。また(yes)は最新の認識結果に含まれる肯定語を示し、(no)は今回の認識結果に含まれる否定後を示し、(rej)は今回の認識結果に含まれる文末否定語を示す。 Score (Wd) = Score (Wd) −p1n + p2n × Conf (Ws) −p3n × Conf (Wt) + i × (p4n × Conf (yes) −p5n × Conf (no) −p6n × Conf (rej)) )
However, Score is the score of the word in the recognition history 112, the right side is before update, and the left side is after update. Wd is a word in the recognition history 112 to be calculated. As a term corresponding to the policy 1, there is a term (first term on the left side) that lowers the word score using the parameter p1n. The next two terms on the left are the terms corresponding to policy 2 and policy 3, p2n and p3n are parameters for weighting, Conf is the reliability obtained from the latest recognition result, and Ws is the latest recognition result. Are all the words that are in a detailed relationship with Wd, and Wt is all the words that are included in the latest recognition result and that have a correction relationship with Wd. The next term on the left side is a term corresponding to policy 4 and policy 5, p4n, p5n, and p6n are parameters, and i is i = if a word is included in the response from the previous dialog understanding device. 1 and i = 0 if not included. Further, (yes) indicates an affirmative word included in the latest recognition result, (no) indicates after negation included in the current recognition result, and (rej) indicates a sentence end negative word included in the current recognition result.

前記の（数７）式で単語スコアを生成する際に用いる各パラメータp1n〜p6nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、騒音の種別および程度毎に分類された複数の数値パラメータの中から、実際に検出された騒音の種類や程度に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにあり、本実施形態の対話理解装置では、前記（数７）式で単語スコアを求める際に、前記（数５）式で用いるweight s,nや前記（数６）式で用いるweight t,nと同様に、騒音検出部１１３で実際に検出された騒音レベルに応じたp1n〜p6nを選択して用いるようにしている。 The parameters p1n to p6n used when generating the word score in the above equation (7) can be experimentally obtained using speech data mixed with noise. The understanding unit 104 reads out from the parameter set holding unit 114 and uses the numerical parameter corresponding to the type and degree of the actually detected noise from a plurality of numerical parameters classified for each noise type and degree. In the dialogue understanding device according to the present embodiment, when the word score is obtained by the equation (7), the weight s, n used in the equation (5) or the weight t, used in the equation (6) is used. Similarly to n, p1n to p6n corresponding to the noise level actually detected by the noise detection unit 113 are selected and used.

前記２）における最新の認識結果中の単語であって、認識履歴１１２にまだ登録されていない単語、すなわち新たに出現した単語については、応答内容とユーザ発話タイプ（詳細化、訂正、回答、再入力）、Ｎ−Ｂｅｓｔの順位、発話長（発話された単語の数）により、音声認識の信頼度を上下させることで、単語スコアを生成する。これには以下の４種類の方針を採用する。 For the words in the latest recognition result in the above 2) that are not yet registered in the recognition history 112, that is, newly appearing words, the response content and the user utterance type (detailed, corrected, answered, replayed) Input), N-Best ranking, and utterance length (the number of spoken words), the word score is generated by raising and lowering the reliability of speech recognition. For this, the following four policies are adopted.

方針６：認識結果の単語Ａと応答とに含まれる単語Ｂとが詳細化の関係にある場合、単語Ａのスコアを上げる。 Policy 6: When the word A of the recognition result and the word B included in the response are in a refinement relationship, the score of the word A is increased.

方針７：対話理解装置の応答が質問（例、何インターチェンジですか？）であって、認識結果の内容が回答である場合、認識結果の単語のスコアを上げる。 Policy 7: When the response of the dialogue understanding device is a question (for example, how many interchanges?) And the content of the recognition result is an answer, the score of the word of the recognition result is increased.

方針８：認識結果の上位には正解単語が多く含まれているので、上位に含まれる単語のスコアを上げる。 Policy 8: Since many correct words are included at the top of the recognition result, the score of the words included at the top is increased.

方針９：発話長が長い発話（短い発話）は認識されやすい（認識されにくい）ため、１カテゴリの結果はその単語のスコアを下げ、２カテゴリ以上の単語はそのスコアを上げる。 Policy 9: Since an utterance with a long utterance length (short utterance) is easily recognized (not easily recognized), the result of one category lowers the score of the word, and the score of two or more categories raises the score.

最新の認識結果中の単語であって、認識履歴１１２にまだ登録されていない単語のスコアの生成は、下記の（数８）式による。 Generation of a score of a word in the latest recognition result that has not yet been registered in the recognition history 112 is based on the following equation (8).

Score(Wd)＝Conf(Wd)＋p7n×Score(Ws)＋p8n×Conf(Wa)＋Conf(Wd)×(p9n＋p10n×len2-p11n×len1) （数８）式
但し、Scoreは認識履歴中の単語のスコアであり、Confは最新の認識結果から得られる信頼度である。また、Wdは計算対象となる認識結果中の単語である。方針６に対応する項としては、パラメータp7nを用いて単語のスコアを上げる項（左辺第１項）があり、Wsは認識結果に含まれるWdと詳細化の関係を持つ全ての単語である。左辺の次の項が方針７に対応する項であり、p8nが重み付けのためのパラメータ、Waは認識結果が質問に対する回答である場合の認識結果に含まれる単語である。左辺の更に次の項が方針８、９に対応する項であり、p9nが方針８に対応してＮ−Ｂｅｓｔの順位の高さに応じた重み付けのためのパラメータである。また、P10n、p11nが方針９に対応した重み付けのためのパラメータであり、len2は認識のカテゴリが２以上であるときlen2＝１になり、len1は認識のカテゴリが１であるときlen1＝１になる値である。 Score (Wd) = Conf (Wd) + p7n × Score (Ws) + p8n × Conf (Wa) + Conf (Wd) × (p9n + p10n × len2-p11n × len1) (Equation 8) Where, Score is the word in the recognition history It is a score, and Conf is the reliability obtained from the latest recognition result. Wd is a word in the recognition result to be calculated. As a term corresponding to the policy 6, there is a term (first term on the left side) that increases the score of the word using the parameter p7n, and Ws is all words that have a detailed relationship with Wd included in the recognition result. The next term on the left side is a term corresponding to policy 7, p8n is a parameter for weighting, and Wa is a word included in the recognition result when the recognition result is an answer to the question. The next term on the left side is a term corresponding to the policies 8 and 9, and p9n is a parameter for weighting corresponding to the policy 8 and the height of the N-Best ranking. P10n and p11n are parameters for weighting corresponding to the policy 9, len2 is set to len2 = 1 when the recognition category is 2 or more, and len1 is set to len1 = 1 when the recognition category is 1. Is the value.

前記の（数８）式で単語スコアを生成する際に用いる各パラメータp7n〜p11nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、騒音の種別および程度毎に分類された複数の数値パラメータの中から、実際に検出された騒音の種類や程度に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにあり、本実施形態の対話理解装置では、前記（数８）式で単語スコアを求める際に、前記（数５）式で用いるweight s,nや前記（数６）式で用いるweight t,n、前記（数７）式で用いるp1n〜p6n同様に、騒音検出部１１３で実際に検出された騒音レベルに応じたp7n〜p11nを選択して用いるようにしている。 The parameters p7n to p11n used when generating the word score by the above equation (8) can be experimentally obtained using utterance data mixed with noise. The understanding unit 104 reads out from the parameter set holding unit 114 and uses the numerical parameter corresponding to the type and degree of the actually detected noise from a plurality of numerical parameters classified for each noise type and degree. In the dialogue understanding device according to the present embodiment, when the word score is obtained by the equation (8), weight s, n used by the equation (5) or weight t, used by the equation (6). n, similarly to p1n to p6n used in the equation (7), p7n to p11n corresponding to the noise level actually detected by the noise detector 113 are selected and used.

上記１）で更新された単語のスコア、上記２）で追加された単語、およびそのスコアは統合された認識履歴として、認識履歴１１２に書き込まれる。統合された認識結果の例を実際の県名、鉄道名等を実例として図１４に示す。図中同名が複数存在する場合（厚木、田無等）があるが、これは複数路線に含まれる駅の名称などである。 The score of the word updated in 1), the word added in 2), and the score are written in the recognition history 112 as an integrated recognition history. An example of the integrated recognition result is shown in FIG. 14 using actual prefecture names, railway names, and the like as actual examples. There are cases where there are a plurality of the same names in the figure (Atsugi, Tanashi, etc.), which are the names of stations included in the plurality of routes.

上記により得られたカテゴリ理解結果、および前記統合された認識結果とから、妥当な組み合わせとして複数個の候補を生成する。すなわち、上記により得られた情報を基に本装置が理解した内容として、尤らしい候補を複数個生成する（ステップＳ３１０）。このステップＳ３１０の処理は、理解内容生成部１０８において実行される。具体的には、理解内容生成部１０８は、例えば図１３に示した結果からＰＲ、ＨＲおよびＬＭの３カテゴリが発話されていることを認識し、図１４に例示した認識結果から、各カテゴリで実際に存在する組み合わせを抽出し候補とする。そして、各カテゴリのスコアの和が最大のものを選択する。その結果を図１５に示す。この図１５に示す例では、理解結果として、＜ＰＲカテゴリ＝愛知、スコア＝1.47＞、＜ＨＲカテゴリ＝名古屋鉄道、スコア＝1.17＞、＜ＬＭカテゴリ＝豊橋、スコア＝0.62＞が選択されている。 A plurality of candidates are generated as an appropriate combination from the category understanding result obtained as described above and the integrated recognition result. That is, a plurality of likely candidates are generated as contents understood by the present apparatus based on the information obtained as described above (step S310). The processing in step S310 is executed in the understanding content generation unit 108. Specifically, the understanding content generation unit 108 recognizes that, for example, three categories of PR, HR, and LM are uttered from the result shown in FIG. 13, and from the recognition result illustrated in FIG. A combination that actually exists is extracted as a candidate. Then, the one with the largest sum of scores of each category is selected. The result is shown in FIG. In the example shown in FIG. 15, <PR category = Aichi, score = 1.47>, <HR category = Nagoya Railway, score = 1.17>, <LM category = Toyohashi, score = 0.62> are selected as the understanding results. .

以上、言語理解部１０４の各ステップで処理された結果である理解内容を生成するまでの処理過程を説明した。これにより得られた出力情報は、応答生成部１０９に入力され、ステップＳ３１１において、応答生成部１０９により理解結果に応じた応答フラグにより応答文が生成される。この応答生成部１０９で処理される応答フラグの種類を図１６に示す。また、図１６における各ビット（ａ乃至Ｆの各ビット）が示す内容を図１７に示す。前記理解結果から、カテゴリに該当する単語が存在する場合には該当するフラグを立てるが、この場合、例えばスコアを４段階で評価した値（ビット数）のフラグを立てる。すなわち、スコアが最大から最小までを評価１から評価４とし、フラグは１０００、０１００、００１０、０００１とする。 Heretofore, the processing process up to generation of the understanding content that is the result of processing in each step of the language understanding unit 104 has been described. The output information obtained in this way is input to the response generation unit 109, and in step S311, a response sentence is generated by the response generation unit 109 using a response flag corresponding to the understanding result. The types of response flags processed by the response generation unit 109 are shown in FIG. FIG. 17 shows the contents indicated by the bits (bits a to F) in FIG. From the understanding result, when a word corresponding to the category exists, a corresponding flag is set. In this case, for example, a flag of a value (number of bits) obtained by evaluating the score in four stages is set. That is, the score from the maximum to the minimum is set from evaluation 1 to evaluation 4, and the flags are set to 1000, 0100, 0010, and 0001.

応答生成部１０９は、上記の応答フラグを利用し、対話における以下の方針に沿った応答を行う。 The response generation unit 109 makes a response according to the following policy in the dialog using the response flag.

・応答方針１：了承（相槌）
下位カテゴリがなく、上位カテゴリまたは中位カテゴリのスコア評価が評価１の場合、対話をスムーズに進めるための応答を行う。・ Response policy 1: Approval (consideration)
When there is no lower category and the score evaluation of the upper category or the middle category is evaluation 1, a response for smoothly proceeding with the dialogue is performed.

〈例〉使用者の発話 …「静岡県」
対話理解装置の応答…「はい」
・応答方針２：復唱
スコア評価が評価２の場合や、使用者の発話の文頭に否定後が来た場合は確認の意味も込めて復唱を行う。 <Example>User's utterance "Shizuoka Prefecture"
Response of the dialogue understanding device… Yes
-Response policy 2: Repetition When the score evaluation is evaluation 2 or when a negative word comes after the beginning of the user's utterance, a recitation is performed with the meaning of confirmation.

〈例〉使用者の発話 …「静岡県」
対話理解装置の応答…「静岡県」
・応答方針３：最終確認
下位カテゴリが発話され、信頼できる（スコア評価が評価１か評価２）場合は、最終確認を行う。 <Example>User's utterance "Shizuoka Prefecture"
Response of dialogue understanding device ... "Shizuoka"
Response policy 3: Final confirmation If the lower category is spoken and reliable (score evaluation is evaluation 1 or evaluation 2), final confirmation is performed.

〈例〉使用者の発話 …「浜松インターから乗ります」
対話理解装置の応答…「浜松インターを設定してよろしいですか」
・応答方針４：目的地設定
前応答に下位カテゴリがあり、肯定発話が信頼できる（スコア評価が評価１か評価２）場合は、目的地に設定する。 <Example>User's utterance… “Ride from Hamamatsu Inter”
Response of dialogue understanding device… “Are you sure you want to set up Hamamatsu Inter”
Response policy 4: Destination setting If there is a lower category in the previous response and the affirmative utterance is reliable (score evaluation is evaluation 1 or evaluation 2), it is set as the destination.

〈例〉対話理解装置の応答…「浜松インターを設定してよろしいですか」
使用者の発話 …「はい」
対話理解装置の応答…「目的地に設定しました」
・応答方針５：分からない情報のみ尋ねる
使用者に対して分からない情報のみを尋ねる。 <Example> Response of the dialogue understanding device ... "Are you sure you want to set up Hamamatsu Inter?"
User's utterance “Yes”
Response of dialogue understanding device… “Set as destination”
・ Response policy 5: Ask only unknown information Ask the user about only unknown information.

〈例〉使用者の発話…「静岡県の東名自動車道です」（「東名」のスコア評価が低いとき）
対話理解装置の応答…「静岡県の何自動車道ですか？」
・応答方針６：自信のない情報は応答しない
上位カテゴリ（ＰＲ）と中位カテゴリ（ＨＲ）の組み合わせで、どちらか一方だけ信頼できない（スコア評価が評価４）場合、スコアの高いものだけ応答することで対話を進める。 <Example>User's utterance: "It is Tomei Expressway in Shizuoka Prefecture" (when the score evaluation of "Tomei" is low)
Response of dialogue understanding device… “How many expressways in Shizuoka?”
Response policy 6: Do not respond to unconfident information If only one of the combinations of the higher category (PR) and middle category (HR) is unreliable (score rating is 4), respond only to those with higher scores To promote dialogue.

〈例〉使用者の発話…「静岡県の東名自動車道」（「静岡県」のスコア評価が低いとき）
対話理解装置の応答…「東名自動車道の」
・応答方針７：上のカテゴリも尋ねる
別情報の付加情報が少なく、スコア評価が悪いときに、上のカテゴリも聞くことによって認識率の向上を図る。 <Example>User's utterance ... "Tomei Expressway in Shizuoka Prefecture" (when score evaluation of "Shizuoka Prefecture" is low)
Response of dialogue understanding device ... "Tomei Expressway"
-Response policy 7: Ask also the upper category When there is little additional information of other information and the score evaluation is bad, the recognition rate is improved by listening to the upper category.

〈例〉使用者の発話…「浜松インターから乗る」（「浜松」のスコア評価が低いとき）
対話理解装置の応答…「何県のインターですが」
・応答方針８：次の発話を促す
上位カテゴリにつづいて肯定発話がきて、信頼できる場合（スコア評価が評価１か評価２の場合）、次の発話を促す。 <Example>User's utterance: "Riding from Hamamatsu interchange" (when score evaluation of "Hamamatsu" is low)
Response of the dialogue understanding device ... "How many prefectures are you?"
Response policy 8: Encourage the next utterance If an affirmative utterance comes after the upper category and it is reliable (when the score evaluation is evaluation 1 or evaluation 2), the next utterance is prompted.

〈例〉対話理解装置の応答…「東名自動車道」
使用者の発話 …「はい」
対話理解装置の応答…「東名自動車道のどこですか」
・応答方針９：別の候補を返す
否定発話が信頼できる場合（スコア評価が評価１か評価２の場合）、前回の応答に用いていない別候補を返す。 <Example> Response of the dialogue understanding device ... "Tomei Expressway"
User's utterance “Yes”
Response of dialogue understanding device… “Where is Tomei Expressway”
Response policy 9: Return another candidate When the negative utterance is reliable (when the score evaluation is evaluation 1 or evaluation 2), another candidate not used in the previous response is returned.

〈例〉対話理解装置の応答…「浜松インターを設定しますか」
使用者の発話 …「いいえ」
対話理解装置の応答…「浜松西インターを設定しますか」
・応答方針１０：前応答の繰り返し
肯定発話や否定発話が信頼できない場合（スコア評価が評価４の場合）、前の応答を繰り返す。 <Example> Response from the device for understanding the dialogue ... "Do you want to set up Hamamatsu Inter"
User's utterance “No”
Response of dialogue understanding device… “Do you want to set up Hamamatsu Nishi Inter”
Response policy 10: Repeating previous response When a positive utterance or negative utterance is unreliable (when the score evaluation is evaluation 4), the previous response is repeated.

〈例〉対話理解装置の応答…「浜松インターを設定してよろしいですか」
使用者の発話 …「はい」（「はい」のスコア評価が低いとき）
対話理解装置の応答…「浜松インターを設定してよろしいですか」
・応答方針１１：聞き返し
全ての情報に対して信頼できない場合（スコア評価が評価４の場合）、全ての情報を聞き返す。 <Example> Response of the dialogue understanding device ... "Are you sure you want to set up Hamamatsu Inter?"
User's utterance… “Yes” (when “Yes” score is low)
Response of dialogue understanding device… “Are you sure you want to set up Hamamatsu Inter”
Response policy 11: Listening back If all information is unreliable (score evaluation is 4), all information is returned.

〈例〉使用者の発話 …「静岡県」（「静岡県」のスコア評価が低いとき）
対話理解装置の応答…「もう一度発話してください」
応答生成部１０９は、上記の対話方針を実施するために、前記の応答フラグを、図１６のフラグテーブルと照らし合わせ、フラグが最初に一致した応答パターンで応答を返す。 <Example>User's utterance: "Shizuoka Prefecture" (when the score evaluation of "Shizuoka Prefecture" is low)
Dialogue understanding device response… “Please speak again”
In order to implement the above dialogue policy, the response generation unit 109 compares the response flag with the flag table of FIG. 16 and returns a response with a response pattern in which the flag first matches.

具体的には、例えば応答生成部１０９が前記理解結果から生成した応答フラグが"1 111000 1000 1000 0100 0000 0000 0000 0"である場合には、図１６のフラグテーブルとの参照の結果、応答パターンとして以下のパターンが選択されることになる。 Specifically, for example, when the response flag generated from the understanding result by the response generation unit 109 is “1 111000 1000 1000 0100 0000 0000 0000 0”, the response pattern as a result of referring to the flag table of FIG. As a result, the following pattern is selected.

「“ＰＲカテゴリ単語”、“ＰＲカテゴリクラス”の
“ＨＲカテゴリ単語”、“ＨＲカテゴリクラス”の
“ＬＭカテゴリ単語”、“ＬＭカテゴリクラス”を設定してよろしいですか。」
その結果、「愛知県の名古屋鉄道の豊橋駅を設定してよろしいですか」が応答文として生成されることになる。 “Are you sure you want to set“ PR category word ”,“ PR category class ”,“ HR category word ”,“ HR category class ”,“ LM category word ”,“ LM category class ”?
As a result, “Are you sure you want to set up Toyohashi Station on the Nagoya Railway in Aichi Prefecture?” Is generated as a response sentence.

以上のようにして生成された応答文は、ステップＳ３１２において、音声合成部１１０でデジタル信号として合成され、音声合成部１１０が内蔵する図示しないＤ／Ａコンバータ、出力増幅器を経て音声出力として出力される。また、この応答文は、ＧＵＩ表示部１１１を経て図示しない表示装置に送られ、この表示装置上に文字情報として表示される。 In step S312, the response sentence generated as described above is synthesized as a digital signal by the voice synthesizer 110, and is output as a voice output through a D / A converter and an output amplifier (not shown) built in the voice synthesizer 110. The The response text is sent to a display device (not shown) via the GUI display unit 111 and displayed as character information on the display device.

この段階で、入力処理が全て完了したか否かの確認が行われ（ステップＳ３１３）、入力処理が全て完了した段階でステップＳ３１４に移行して全ての入力処理を終了する。すなわち、下位カテゴリ（ＬＭ）の単語が確定しているかどうかを判定して、確定していない場合（ステップＳ３１３でｎｏの場合）は処理を継続し、下位カテゴリ（ＬＭ）の単語が確定している場合は（ステップＳ３１３でｙｅｓの場合）、ステップＳ３１４に移行し全ての入力処理を終了する。本例では、「愛知県の名古屋鉄道の豊橋駅を設定してよろしいですか」が応答されている段階であり、次に使用者が「はい」を発話することで、「目的地に設定しました」の応答を行ったのち処理を終了する。 At this stage, it is confirmed whether or not all the input processes have been completed (step S313). When all the input processes have been completed, the process proceeds to step S314 to end all the input processes. That is, it is determined whether or not the words of the lower category (LM) are confirmed. If not confirmed (in the case of no in step S313), the processing is continued, and the words of the lower category (LM) are confirmed. If yes (YES in step S313), the process proceeds to step S314, and all input processes are terminated. In this example, “Are you sure you want to set up Toyohashi Station on the Nagoya Railroad in Aichi Prefecture” is being answered, and then the user will say “Yes” to “Set as the destination. The process is terminated after the response “Yes” is made.

以上説明したように、本実施形態の対話理解装置によれば、対話に含まれる発話をその発話が包括する広さの順に階層的に複数のカテゴリおよび該カテゴリを細分化して構成されるクラスに分類し、どのクラスの単語が発話されたかその確からしさを与えるクラススコアと、発話に含まれる単語の確からしさを与える単語スコアとを算出し、対話の中で複数回の発話があったときに、各発話毎に算出される複数のクラススコアおよび複数の単語スコアを統合演算して対話内容を理解するようにしているので、発話内容における文脈の流れから対話的に音声情報を理解することができ、単に発話音声の明瞭度あるいは了解度向上に着目した従来の方法に比べて、より精度の高い認識結果を得ることができる。 As described above, according to the dialogue understanding device of the present embodiment, the utterances included in the dialogue are hierarchically divided into a plurality of categories and in a class configured by subdividing the categories in the order of the size of the utterances. When classifying the class score that gives the certainty of which class of words has been uttered and the word score giving the certainty of the words included in the utterance, and when there are multiple utterances in the dialogue Since the conversation contents are understood by integrating the plurality of class scores and the plurality of word scores calculated for each utterance, the voice information can be understood interactively from the context flow in the utterance contents. Therefore, a more accurate recognition result can be obtained as compared with the conventional method that simply focuses on improving the intelligibility or intelligibility of the speech.

また、前記統合演算で用いるパラメータが、複数種類の騒音が混入された発話データにより統計的に推定されており、検出された騒音レベルに対応したパラメータを用いて対話内容を理解するようにしているので、発話の背景騒音による影響を大幅に低減することができ、例えば、車両用ナビゲーションシステムにおける音声入力のように、雑音の大きな環境下で用いる場合であっても、良好な認識精度を得ることができる。 In addition, the parameters used in the integrated calculation are statistically estimated from utterance data mixed with a plurality of types of noise, and the conversation contents are understood using the parameters corresponding to the detected noise level. Therefore, it is possible to greatly reduce the influence of the background noise of the utterance, and to obtain good recognition accuracy even when used in a noisy environment such as voice input in a vehicle navigation system, for example. Can do.

（第２の実施形態）
次に、本発明を適用した第２の実施形態の対話理解装置について説明する。本実施形態の対話理解装置は、車両に搭載される対話理解装置の例であり、図１８にその基本構成を示すように、上述した第１の実施形態の対話理解装置における騒音検出部１１３に代えて走行速度検出部１１５を備えている。また、対話内容を理解するための統合演算で用いるパラメータが、車両の走行速度ごとに分類された発話データにより統計的に推定されており、走行速度検出部１１５により検出された速度情報をもとに前記パラメータを変更するようにしたものである。なお、その他の構成及び基本的な処理の流れは上述した第１の実施形態と同様であるので、以下、第１の実施形態と同様の部分については重複した説明を省略し、本実施形態に特徴的な部分についてのみ説明する。 (Second Embodiment)
Next, a dialogue understanding device according to a second embodiment to which the present invention is applied will be described. The dialogue understanding device according to the present embodiment is an example of a dialogue understanding device mounted on a vehicle. As shown in FIG. 18, the noise detection unit 113 in the dialogue understanding device according to the first embodiment described above is used. Instead, a travel speed detector 115 is provided. In addition, the parameters used in the integrated calculation for understanding the conversation contents are statistically estimated from the utterance data classified for each traveling speed of the vehicle, and based on the speed information detected by the traveling speed detecting unit 115. The above parameters are changed. Since the other configuration and the basic processing flow are the same as those in the first embodiment described above, the same description as in the first embodiment will be omitted below, and the present embodiment will be omitted. Only the characteristic part will be described.

本実施形態の対話理解装置では、車両の車輪速センサによって検出される車速パルスが走行速度検出部１１５に送られ、走行速度検出部１１５において、車輪速センサによって検出される車速パルスのパルス間隔を計測することで、本装置が搭載されている車両の走行速度が検出される。そして、この車両の走行速度を示す速度情報が、言語理解部１０４に入力される。 In the dialogue understanding device of this embodiment, a vehicle speed pulse detected by the vehicle wheel speed sensor is sent to the travel speed detection unit 115, and the travel speed detection unit 115 sets the pulse interval of the vehicle speed pulse detected by the wheel speed sensor. By measuring, the traveling speed of the vehicle in which the present apparatus is mounted is detected. Then, speed information indicating the traveling speed of the vehicle is input to the language understanding unit 104.

言語理解部１０４では、使用者からの複数回にわたる発話に伴って入力される単語と、その属するクラスの信頼度とから理解結果を生成する過程で、第１の実施形態と同様に各種数値パラメータを用いたパラメータ演算が行われるが、本実施形態の対話理解装置では、この演算に用いる数値パラメータが、走行速度検出部１１５により検出された車両の走行速度に応じて変更される。すなわち、本実施形態の対話理解装置では、パラメータセット保持部１１４が車両の走行速度毎に分類された複数の数値パラメータを保持しており、このパラメータセット保持部１１４に保持された数値パラメータの中から、実際に検出された車両の走行速度に対応した数値パラメータを読み出して、演算で使用するようにしている。なお、パラメータセット保持部１１４に保持される複数の数値パラメータは、予め車両走行に伴う複数種類の騒音が混入された多数の発話データを集めて、これらを統計処理することによって、車両の走行速度に応じたパラメータとして推定されているものである。 In the language understanding unit 104, various numerical parameters are generated in the process of generating an understanding result from a word input with a plurality of utterances from the user and the reliability of the class to which the language understanding unit 104 belongs, as in the first embodiment. In the dialog understanding device of this embodiment, the numerical parameter used for this calculation is changed according to the traveling speed of the vehicle detected by the traveling speed detector 115. That is, in the dialogue understanding device according to the present embodiment, the parameter set holding unit 114 holds a plurality of numerical parameters classified for each traveling speed of the vehicle, and among the numerical parameters held in the parameter set holding unit 114, Thus, a numerical parameter corresponding to the actually detected traveling speed of the vehicle is read out and used in the calculation. The plurality of numerical parameters held in the parameter set holding unit 114 collects a large number of utterance data mixed with a plurality of types of noises associated with vehicle running in advance, and statistically processes them to thereby calculate the vehicle running speed. It is estimated as a parameter according to.

具体的に説明すると、本実施形態の対話理解装置においては、上述した第１の実施形態と同様に、前記の（数５）式や（数６）式を用いてクラススコアを求め、（数７）式や（数８）式を用いて単語スコアを求めるようにしており、これらの演算で用いる各パラメータweight s,n、weight t,nやp1n〜p11nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、車両の走行速度毎に分類された複数の数値パラメータの中から、実際に検出された車両の走行速度に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにある。ここで、ｎは混入した騒音が観測された車両走行速度を示しており、走行速度毎［ｎ＝速度０，速度１，速度２，速度３・・・］に分類された発話データから実験的に前記数値パラメータを求めた場合には、走行速度に応じて複数個の数値パラメータがそれぞれ求められることになる。本実施形態の対話理解装置では、前記（数５）式や（数６）式を用いてクラススコア、（数７）式や（数８）式を用いて単語スコアをそれぞれ求める際に、走行速度検出部１１５で実際に検出された車両の走行速度に応じた数値パラメータ（weight s,n、weight t,nやp1n〜p11n）を選択して用いるようにしている。 More specifically, in the dialogue understanding device according to the present embodiment, the class score is obtained using the above (Equation 5) and (Equation 6), as in the first embodiment described above. 7) The word score is calculated using equation (8) and the parameters weight s, n, weight t, n and p1n to p11n used in these calculations are utterance data mixed with noise. However, the feature of the present embodiment is that the language understanding unit 104 actually detects the travel of the vehicle from a plurality of numerical parameters classified for each travel speed of the vehicle. A numerical parameter corresponding to the speed is read from the parameter set holding unit 114 and used. Here, n indicates the vehicle traveling speed at which mixed noise is observed, and is experimental from speech data classified by traveling speed [n = speed 0, speed 1, speed 2, speed 3...]. When the numerical parameters are obtained, a plurality of numerical parameters are obtained in accordance with the traveling speed. In the dialogue understanding device according to the present embodiment, when the class score is obtained using the formula (5) or the formula (6) and the word score is obtained using the formula (7) or the formula (8), respectively, Numerical parameters (weight s, n, weight t, n, and p1n to p11n) corresponding to the traveling speed of the vehicle actually detected by the speed detector 115 are selected and used.

以上説明したように、本実施形態の対話理解装置によれば、対話的に発話内容を理解するための統合演算で用いるパラメータが、車両走行速度毎に複数種類の騒音が混入された発話データにより統計的に推定されており、実際に検出された車両の走行速度に対応したパラメータを用いて対話内容を理解するようにしているので、発話の背景騒音による影響を大幅に低減することができ、例えば、車両用ナビゲーションシステムにおける音声入力のように、雑音の大きな環境下で用いる場合であっても、良好な認識精度を得ることができる。 As described above, according to the dialog understanding device of the present embodiment, the parameters used in the integrated calculation for interactively understanding the utterance contents are based on the utterance data in which plural types of noise are mixed for each vehicle traveling speed. Since it is statistically estimated and the conversation contents are understood using parameters corresponding to the actually detected vehicle traveling speed, the influence of background noise of the utterance can be greatly reduced, For example, even when used in a noisy environment such as voice input in a vehicle navigation system, good recognition accuracy can be obtained.

（第３の実施形態）
次に、本発明を適用した第３の実施形態の対話理解装置について説明する。本実施形態の対話理解装置は、車両用ナビゲーションシステムの一機能として実現される対話理解装置の例であり、図１９にその基本構成を示すように、上述した第１の実施形態の対話理解装置における騒音検出部１１３に代えて走行路線検出部１１６を備えている。また、対話内容を理解するための統合演算で用いるパラメータが、車両の走行路線ごとに分類された発話データにより統計的に推定されており、走行路線検出部１１６により検出された路線情報をもとに前記パラメータを変更するようにしたものである。なお、その他の構成及び基本的な処理の流れは上述した第１の実施形態と同様であるので、以下、第１の実施形態と同様の部分については重複した説明を省略し、本実施形態に特徴的な部分についてのみ説明する。 (Third embodiment)
Next, a dialogue understanding device according to a third embodiment to which the present invention is applied will be described. The dialog understanding device of the present embodiment is an example of a dialog understanding device realized as a function of the vehicle navigation system. As shown in FIG. 19, the dialog understanding device of the first embodiment described above. A travel route detection unit 116 is provided instead of the noise detection unit 113 in FIG. In addition, the parameters used in the integrated calculation for understanding the dialogue contents are statistically estimated from the utterance data classified for each travel route of the vehicle, and based on the route information detected by the travel route detection unit 116. The above parameters are changed. Since the other configuration and the basic processing flow are the same as those in the first embodiment described above, the same description as in the first embodiment will be omitted below, and the present embodiment will be omitted. Only the characteristic part will be described.

本実施形態の対話理解装置では、走行路線検出部１１６でナビゲーション装置から走行路線情報が取得され、本装置が搭載されている車両が現在走行している走行路線が検出され、その情報が言語理解部１０４に入力される。ここで、ナビゲーション装置から取得される走行路線情報は、例えば路面の状態やトンネルの存在、高架の存在といった内容の情報である。このような走行路線情報はナビゲーション装置に記憶されており、走行路線検出部１１６が車両が現在走行している路線の情報をナビゲーション装置から取得して、言語理解部１０４に出力している。 In the dialog understanding device of this embodiment, the travel route information is acquired from the navigation device by the travel route detection unit 116, the travel route on which the vehicle on which the device is mounted is currently traveling is detected, and the information is understood in language. Input to the unit 104. Here, the travel route information acquired from the navigation device is information of contents such as a road surface state, the presence of a tunnel, and the presence of an elevated route. Such travel route information is stored in the navigation device, and the travel route detection unit 116 acquires information on the route on which the vehicle is currently traveling from the navigation device and outputs the information to the language understanding unit 104.

言語理解部１０４では、使用者からの複数回にわたる発話に伴って入力される単語と、その属するクラスの信頼度とから理解結果を生成する過程で、第１の実施形態と同様に各種数値パラメータを用いたパラメータ演算が行われるが、本実施形態の対話理解装置では、この演算に用いる数値パラメータが、走行路線検出部１１６により検出された走行路線の路線情報に応じて変更される。すなわち、本実施形態の対話理解装置では、パラメータセット保持部１１４が走行路線毎に分類された複数の数値パラメータを保持しており、このパラメータセット保持部１１４に保持された数値パラメータの中から、実際に検出された車両の走行路線の路線情報に対応した数値パラメータを読み出して、演算で使用するようにしている。なお、パラメータセット保持部１１４に保持される複数の数値パラメータは、予め車両走行に伴う複数種類の騒音が混入された多数の発話データを集めて、これらを統計処理することによって、車両の走行路線に応じたパラメータとして推定されているものである。 In the language understanding unit 104, various numerical parameters are generated in the process of generating an understanding result from a word input with a plurality of utterances from the user and the reliability of the class to which the language understanding unit 104 belongs, as in the first embodiment. In the dialog understanding device of this embodiment, the numerical parameter used for this calculation is changed according to the route information of the travel route detected by the travel route detection unit 116. That is, in the dialogue understanding device of the present embodiment, the parameter set holding unit 114 holds a plurality of numerical parameters classified for each travel route, and among the numerical parameters held in the parameter set holding unit 114, Numerical parameters corresponding to the actually detected route information of the traveling route of the vehicle are read out and used in the calculation. The plurality of numerical parameters held in the parameter set holding unit 114 collects a large number of utterance data mixed with a plurality of types of noises associated with vehicle traveling in advance, and statistically processes them to thereby calculate the vehicle travel route. It is estimated as a parameter according to.

具体的に説明すると、本実施形態の対話理解装置においては、上述した第１の実施形態と同様に、前記の（数５）式や（数６）式を用いてクラススコアを求め、（数７）式や（数８）式を用いて単語スコアを求めるようにしており、これらの演算で用いる各パラメータweight s,n、weight t,nやp1n〜p11nは、騒音が混入された発話データを用いて実験的に求めることができるが、本実施形態の特徴は、言語理解部１０４が、車両の走行路線毎に分類された複数の数値パラメータの中から、実際に検出された車両の走行路線の路線情報に対応した数値パラメータをパラメータセット保持部１１４から読み出して利用することにある。ここで、ｎは混入した騒音が観測された走行路線を示しており、走行路線毎［ｎ＝路線０，路線１，路線２，路線３・・・］に分類された発話データから実験的に前記数値パラメータを求めた場合には、走行路線の路線情報に応じて複数個の数値パラメータがそれぞれ求められることになる。本実施形態の対話理解装置では、前記（数５）式や（数６）式を用いてクラススコア、（数７）式や（数８）式を用いて単語スコアをそれぞれ求める際に、走行路線検出部１１６で実際に検出された車両の走行路線の路線情報に応じた数値パラメータ（weight s,n、weight t,nやp1n〜p11n）を選択して用いるようにしている。 More specifically, in the dialogue understanding device according to the present embodiment, the class score is obtained using the above (Equation 5) and (Equation 6), as in the first embodiment described above. 7) The word score is calculated using equation (8) and the parameters weight s, n, weight t, n and p1n to p11n used in these calculations are utterance data mixed with noise. However, the feature of the present embodiment is that the language understanding unit 104 actually detects the travel of the vehicle from a plurality of numerical parameters classified for each travel route of the vehicle. The numerical parameter corresponding to the route information of the route is read from the parameter set holding unit 114 and used. Here, n indicates a travel route in which mixed noise is observed, and is experimentally determined from speech data classified into each travel route [n = route 0, route 1, route 2, route 3 ...]. When the numerical parameters are obtained, a plurality of numerical parameters are obtained according to the route information of the traveling route. In the dialogue understanding device of the present embodiment, when the class score is obtained using the above (Equation 5) and (Equation 6), and the word score is obtained using the (Equation 7) and (Equation 8), respectively, Numerical parameters (weight s, n, weight t, n, and p1n to p11n) corresponding to the route information of the travel route of the vehicle actually detected by the route detection unit 116 are selected and used.

以上説明したように、本実施形態の対話理解装置によれば、対話的に発話内容を理解するための統合演算で用いるパラメータが、車両の走行路線毎に複数種類の騒音が混入された発話データにより統計的に推定されており、実際に検出された車両の走行路線の路線情報に対応したパラメータを用いて対話内容を理解するようにしているので、発話の背景騒音による影響を大幅に低減することができ、例えば、車両用ナビゲーションシステムにおける音声入力のように、雑音の大きな環境下で用いる場合であっても、良好な認識精度を得ることができる。 As described above, according to the dialogue understanding device of the present embodiment, the parameters used in the integrated calculation for interactively understanding the utterance contents are utterance data in which plural types of noise are mixed for each traveling route of the vehicle. Since the dialogue contents are understood using parameters corresponding to the route information of the actually detected vehicle travel route, the influence of background noise of the utterance is greatly reduced. For example, even when used in a noisy environment such as voice input in a vehicle navigation system, good recognition accuracy can be obtained.

本発明を適用した第１の実施形態の対話理解装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the dialogue understanding apparatus of 1st Embodiment to which this invention is applied. 発話された単語の階層構造的分類法を説明する図である。It is a figure explaining the hierarchical structure classification method of the uttered word. 認識対象語と発話タイプとの関係を示す対応図である。It is a response | compatibility figure which shows the relationship between a recognition object word and an utterance type. 対話理解装置と使用者との間での対話の例を示す図である。It is a figure which shows the example of the dialogue between a dialogue understanding device and a user. 対話理解装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a dialog understanding apparatus. 音声認識部の出力としての認識結果候補文と尤度との関係を示す図である。It is a figure which shows the relationship between the recognition result candidate sentence as an output of a speech recognition part, and likelihood. 認識結果候補文と尤度との関係から信頼度を求める手法を説明する図である。It is a figure explaining the method of calculating | requiring reliability from the relationship between a recognition result candidate sentence and likelihood. 発話タイプと発話タイプ判定材料との関係を示す対応図である。It is a correspondence figure which shows the relationship between an utterance type and an utterance type determination material. 発話タイプによる処理の使い分けを示すフローチャートである。It is a flowchart which shows the proper use of the process by speech type. 詳細化・回答発話タイプにおける更新後のクラススコアの生成過程を示す図である。It is a figure which shows the production | generation process of the class score after the update in a refinement | miniaturization and an answer speech type. 訂正・再入力発話タイプにおける更新後のクラススコアの生成過程を示す図である。It is a figure which shows the production | generation process of the class score after the update in correction and re-input utterance type. カテゴリ毎にスコアを演算する手順を説明する図である。It is a figure explaining the procedure which calculates a score for every category. カテゴリ理解処理を説明する図である。It is a figure explaining a category comprehension process. 統合された認識結果の実例を示す図である。It is a figure which shows the example of the integrated recognition result. 言語理解における最終スコアを項目別に比較して示す図である。It is a figure which compares and shows the final score in language understanding for every item. 応答フラグと応答パターンとの関係を示す対応図である。It is a correspondence figure which shows the relationship between a response flag and a response pattern. 応答フラグとその内容との関係を示す対応図である。It is a correspondence figure which shows the relationship between a response flag and its content. 本発明を適用した第２の実施形態の対話理解装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the dialogue understanding apparatus of 2nd Embodiment to which this invention is applied. 本発明を適用した第３の実施形態の対話理解装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the dialogue understanding apparatus of 3rd Embodiment to which this invention is applied.

Explanation of symbols

１０１音声入力部
１０２音声認識部
１０３信頼度生成部
１０４言語理解部
１０５クラススコア生成部
１０６カテゴリ理解部
１０７単語スコア生成部
１０８理解内容生成部
１０９応答生成部
１１０音声合成部
１１１ＧＵＩ表示部
１１２認識履歴
１１３騒音検出部
１１４パラメータセット保持部
１１５走行速度検出部
１１６走行路線検出部 DESCRIPTION OF SYMBOLS 101 Speech input part 102 Speech recognition part 103 Reliability generation part 104 Language understanding part 105 Class score generation part 106 Category understanding part 107 Word score generation part 108 Understanding content generation part 109 Response generation part 110 Speech synthesis part 111 GUI display part 112 Recognition History 113 Noise detection unit 114 Parameter set holding unit 115 Traveling speed detection unit 116 Traveling route detection unit

Claims

Class that classifies the utterances included in the dialogue into a class composed of multiple categories and subcategories of the categories hierarchically in the order of the width of the utterances, and gives a certainty of which class of words were uttered The score and the word score that gives the certainty of the words included in the utterance are calculated. When there are multiple utterances in the dialogue, multiple class scores and multiple word scores are calculated for each utterance. A dialogue understanding device that understands the dialogue contents by integrating
The parameters used in the integrated calculation are statistically estimated from utterance data mixed with a plurality of types of noise, and the contents of the dialogue are understood using parameters corresponding to the types of detected noise. Dialogue understanding device.

A voice input means comprising a microphone and a voice amplifier;
Voice recognition means for digitizing the output of the voice input means to perform voice recognition;
Reliability generation means for calculating the reliability of the result recognized by the voice recognition means;
Using the results obtained by the speech recognition unit and the reliability generation unit, the plurality of categories set in advance and a hierarchical structure composed of the classes obtained by subdividing the categories are classified into the classes. Processed by a class score generation unit for determining the probability of utterance, a category understanding unit for determining each category from the results obtained thereby, a word score generation unit for determining the probability of a recognized word, and the above processing units A language understanding means comprising an understanding content generation unit for generating an understanding content as a result of
Storage means for storing past recognition history used for executing the processing in the language understanding means;
Response generation means for creating response information from the result obtained from the language understanding means;
Output means for outputting the response information,
When there is a new utterance, the language comprehension means uses the value obtained by parameter calculation of the class score and word score of the past recognition history, and a new confidence score based on the latest recognition result. Generating a word score, updating the recognition history with the new class score and word score, and using the category understanding obtained from the new class score and the new word score to understand the dialogue content,
The parameter is statistically estimated from utterance data mixed with a plurality of types of noise, and a new class score and word score are generated using a parameter corresponding to the type of detected noise. The dialogue understanding device according to claim 1.

The parameter is statistically estimated from speech data classified for each noise level, and the parameter is changed based on a noise level during speech detected by a noise detection means. The dialogue understanding device according to 1 or 2.

A dialogue understanding device mounted on a vehicle,
The parameter is statistically estimated from utterance data classified for each traveling speed of the vehicle, and the parameter is changed based on speed information detected by a traveling speed detecting means. The dialogue understanding device according to 1 or 2.

A dialogue understanding device mounted on a vehicle,
The parameter is statistically estimated from speech data classified for each travel route of the vehicle, and the parameter is changed based on route information detected by the travel route detection means. The dialogue understanding device according to 1 or 2.