JP2004251998A

JP2004251998A - Conversation understanding device

Info

Publication number: JP2004251998A
Application number: JP2003040053A
Authority: JP
Inventors: Yukihiro Ito; 幸宏伊東; Michihiko Kai; 充彦甲斐; Toshihiko Ito; 敏彦伊藤; Takeshi Ono; 健大野; Daisuke Saito; 大介斎藤; Minoru Togashi; 実富樫
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2003-02-18
Filing date: 2003-02-18
Publication date: 2004-09-09
Anticipated expiration: 2023-02-18
Also published as: JP4293340B2

Abstract

<P>PROBLEM TO BE SOLVED: To understand spoken words more accurately by solving a problem that a speech conventional system recognizes an inputted speech signal basically by a method for sequential acoustic recognition in word units, and in this method, a sound source is a naturally generated speech and easily influenced by ambient noise to hinder a conversation as user's spoken words can not correctly be understood. <P>SOLUTION: A conversation understanding device as an embodiment of the present invention performs speech recognition for words and then classifies the obtained words by categories and classes to enable detailing, answering and correction, and an interaction corresponding to a speaking type of reinput, and also takes influence of past recognition records into consideration to select a more likely word while considering the relation with the context. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は音声対話システムによる機器の制御に関するもので、特に操作者の「手」あるいは「目」を煩わせることなく制御を行うことが要求される対話理解装置（音声情報入出力装置）に係る。
【０００２】
【従来の技術】
【特許文献１】特開平８−２７８７９３号公報
【非特許文献１】甲斐、石丸、伊藤、小西、伊東「目的地設定タスクにおける訂正発話の特徴分析と検出への応用」日本音響学会全国大会論文集２−１−８，ｐｐ．６３−６４，２００１
【非特許文献２】駒谷、河原、「音声対話システムにおける音声認識結果の信頼度の利用法」日本音響学会全国大会論文集３−５−２，ｐｐ．７３−７４，２０００
従来の音声対話システムでは音源が自然発生の音声であること、また車両等においては走行中の騒音の影響があること等のため、使用者の発話を正しく理解することが出来ず、このため使用者の意図とは異なる応答をする場合が生じていた。その結果、システムと使用者との間の対話が円滑に進まなくなり、使用者に不快感を与えることがあった。この対策として、例えば上記「非特許文献１」あるいは「非特許文献２」等が報告されているが、前者は音声認識における誤認識に対する研究であり、後者は音声認識結果に信頼度を利用した対話制御に関する研究である。これらの研究において採用されている手法は、入力された音声信号を単語単位で逐次音響的に認識することを基本とするもので、人間が実行しているような文脈情報を含めた言語認識を行っていない。このため話者の発声条件、送話側および受話側両方における背景雑音等の影響を軽減するには限界があった。
【０００３】
また、特許文献１においては、構文解析結果から候補を逐次決定し、この候補に対応する標準パターンとの尤度（尤度の定義については後述）と、テキストデータベースから算出した当該候補に対応する生起尤度／生起順序の尤度との和を候補とすることにより、小記憶容量で照合速度の高速化を図る方法も開示されている。しかし、この方法においては、認識結果から候補を選定し、その候補の尤度を求めているのみで、使用者が必要としている情報を有する最終応答であるか否かについては不問の状態にある。
【０００４】
【発明が解決しようとする課題】
上記のように、従来の方法においては入力された発話の音声信号は、逐次認識しその認識精度を向上することに重点がおかれ、使用者の必要としている内容に至っているかについては検討されていなかった。また、文章形での認識に付いても行われているが、これは予め用意されたテキストデータベースとの比較で認識が行われるものであり、使用者の要求に沿った結果であるか否かは認識過程に入る余地はなかった。本発明は、以上述べた性能上の限界を超え、実用に耐えられる認識能力を有し、使用者の必要とする情報を短時間で取得可能とする対話制御システムに適用可能な言語理解能力を有する、使用者／システム間対話理解装置を提供することを目的としている。
【課題を解決するための手段】
上記目的を達成する方法の一つとして話者の発話内容における文脈の流れから対話的に音声情報を理解する手法が考えられる。この方法によれば、単に発話音声の明瞭度あるいは了解度向上に着目した従来の方法よりも良好な結果が期待される。本発明は、この方法実現のための具体的なアルゴリズムを開示し、これに基づく具体的な装置を提供するものである。
【０００５】
本発明においては、文脈情報を利用した音声情報処理に音声認識の信頼度を組み合わせて言語理解や応答生成を行うことを基本とした。すなわち、単に従来の音声認識の信頼度を利用するのみではなく、発話の種類や対話履歴（認識履歴）の情報も利用して学習させた結果を利用することで、対話的により尤もらしい言語理解を実行させるようにした。
【０００６】
このため、請求項１においては、対話に含まれる発話をその発話が包括する広さの順に階層的に複数のカテゴリーおよび該カテゴリーを細分化して構成されるクラスに分類し、どのクラスの単語が発話されたかその確からしさを与えるクラススコアと、発話に含まれる単語とから対話内容を理解し、発話内容にさらに詳細な情報を追加する詳細化・回答の対話を行い、かつ該詳細化・回答の発話タイプに対して、前記クラススコアを新たに生成するために使用する過去の認識履歴のクラススコアを更新する際に、該過去の認識履歴のクラススコアが小さくなるように重み付けを付加した値に、最新の認識結果による新たな信頼度を加算して得られる新しいクラススコアで前記認識履歴の更新を行う対話理解装置について規定した。
【０００７】
請求項２においては、マイクロホンと音声増幅器とで構成された音声入力手段と、該音声入力手段の出力をデジタル化して音声認識を行う音声認識手段と、該音声認識手段で認識された結果の信頼度を算出する信頼度生成手段と、前記音声認識手段と前記信頼度生成手段とにより得られた結果を用いて予め設定された前記複数のカテゴリー及び該カテゴリーを細分化した前記クラスからなる階層構造に分類し、前記クラスに分類された発話の確からしさを求めるクラススコア生成部と、これにより得られた結果から前記各カテゴリーを求めるカテゴリー理解部と、認識された単語の確からしさを求める単語スコア生成部と、上記各処理部で処理された結果として理解内容を生成する理解内容生成部とからなる言語理解手段と、前記言語理解手段における処理を実行するために使用される過去の認識履歴を記憶する記憶手段と、前記言語理解手段から得られた結果から応答情報を作成する応答生成手段と、前記応答情報を出力するための出力手段とを有する請求項１に記載の対話理解装置であって、前記詳細化・回答の発話タイプに対して、前記クラススコアを新たに生成するために使用する過去の認識履歴のクラススコアを更新する際に、該過去の認識履歴のクラススコアが小さくなるように重み付けを付加した値に、最新の認識結果による新たな信頼度を加算して得られる新しいクラススコアで前記認識履歴の更新を行う対話理解装置について規定している。
【０００８】
請求項３においては、請求項１または請求項２に記載の対話理解装置において、過去の認識履歴のクラススコアの更新を次式により実行する対話理解装置について規定している。
【０００９】
Ｓｃｏｒｅ（ｃ）＝Ｓｃｏｒｅ（ｃ）＊ｗｅｉｇｈｔ_ｓ＋Ｃｏｎｆ（ｃ）
ただし、Ｓｃｏｒｅ：過去の認識履歴のクラススコア
Ｃｏｎｆ：最新の認識結果に対するクラス信頼度
ｗｅｉｇｈｔｓ：重み（０．０＜ｗｅｉｇｈｔ_ｓ＜１．０）
ｃ：スコアを生成するクラス
請求項４においては、対話に含まれる発話をその発話が包括する広さの順に階層的に複数のカテゴリーおよび該カテゴリーを細分化して構成されるクラスに分類し、どのクラスの単語が発話されたかその確からしさを与えるクラススコアと、発話に含まれる単語とから対話内容を理解し、システムから誤った応答があり、それを訂正する処理、すなわち、訂正・再入力を行い、かつ該訂正・再入力の発話タイプに対して、前記クラススコアを新たに生成するために使用する過去の認識履歴のクラススコアを更新する際に、該過去の認識履歴のスコアが小さくなるように重み付けを付加した値に、最新の認識結果による新たな信頼度を加算し、同一カテゴリーで、かつ異なるクラスの信頼度全てを減算することにより得られるスコアで認識履歴の更新を行う対話理解装置について規定している。
【００１０】
請求項５においては、マイクロホンと音声増幅器とで構成された音声入力手段と、該音声入力手段の出力をデジタル化して音声認識を行う音声認識手段と、該音声認識手段で認識された結果の信頼度を算出する信頼度生成手段と、前記音声認識手段と前記信頼度生成手段とにより得られた結果を用いて予め設定された前記複数のカテゴリー及び該カテゴリーを細分化した前記クラスからなる階層構造に分類し、前記クラスに分類された発話の確からしさを求めるクラススコア生成部と、これにより得られた結果から前記各カテゴリーを求めるカテゴリー理解部と、認識された単語の確からしさを求める単語スコア生成部と、上記各処理部で処理された結果として理解内容を生成する理解内容生成部とからなる言語理解手段と、前記言語理解手段における処理を実行するために使用される過去の認識履歴を記憶する記憶手段と、前記言語理解手段から得られた結果から応答情報を作成する応答生成手段と、前記応答情報を出力するための出力手段とを有する請求項１または請求項４に記載の対話理解装置であって、前記訂正・再入力の発話タイプに対して、前記クラススコアを新たに生成するために使用する過去の認識履歴のクラススコアを更新する際に、該過去の認識履歴のスコアが小さくなるように重み付けを付加した値に、最新の認識結果による新たな信頼度を加算し、同一カテゴリーで、かつ異なるクラスの信頼度全てを減算することにより得られるスコアで認識履歴の更新を行う対話理解装置について規定している。
【００１１】
請求項６においては請求項４または請求項５記載の対話理解装置において、過去の認識履歴の更新を次式により実行する対話理解装置について規定している。
【００１２】
Ｓｃｏｒｅ（ｃａ）＝Ｓｃｏｒｅ（ｃａ）＊ｗｅｉｇｈｔ_ｔ−Ｃｏｎｆ（ｃｂ）＋Ｃｏｎｆ（ｃａ）
ただし、Ｓｃｏｒｅ：認識履歴のクラススコア
Ｃｏｎｆ：最新認識結果のクラス信頼度
ｗｅｉｇｈｔ_ｔ：重み（０．０＜ｗｅｉｇｈｔ_ｔ＜１．０）
ｃａ：スコアを生成するクラス
ｃｂ：ｃａと同じカテゴリーで異なるクラス
請求項７においては、詳細化・回答の発話タイプか、訂正・再入力の発話タイプかを判断する判断手段を有し、この判断結果に基づいて請求項３に記載の演算式を用いるか、請求項６に記載の演算式を用いるかを決定すること対話理解装置について規定している。
【００１３】
【発明の効果】
本発明によれば、以上述べたように、単に単語の音声認識を行うのみならず、認識した単語をさらにカテゴリーとクラスとに分類し、文脈との関連を考慮して、より尤らしい語の選定を行う手法を採用することにより効率良く認識精度をさらに向上することが出来た。例えば、車両用ナビゲーションシステムにおける音声入力のように、雑音の大きな環境下で用いるときには特に有効である。
【発明の実施の形態】
以下、本発明による実施の形態を図により説明する。
図１は本発明による対話理解装置の基本構成を示すもので、入力されたアナログ音声入力信号は音声入力部１０１でデジタル信号に変換される。ここで、音声入力部１０１はマイクロホン、入力増幅器、Ａ／Ｄコンバータから構成されている。このデジタル化された音声信号は音声認識部１０２に入力され、使用者から入力される音声信号と、音声信号認識部１０２内に記憶してある認識対象文とのマッチング処理を行い、複数の認識結果候補文およびそれらの尤度（詳細は後述）を出力する。これら出力情報は信頼度生成部１０３において、使用者からの単一の発話に伴って入力される上記複数の認識結果候補文から、この認識結果候補文に含まれる単語と、これら単語の分類を示すクラスの尤もらしさを示す信頼度を出力する。
ここで、クラスとは図２に示すように目的地を示す表現形式を階層構造的に分類する。ここでカテゴリーは包括する範囲が広いほうから狭いほうに順次はいれつされ、クラスは各カテゴリーに含まれる単語を内容別に分類したものである。図２の例では例えば、各単語は上位（ＰＲ）、中位（ＨＲ）、下位（ＬＭ）の３カテゴリーに分類され、さらに各カテゴリーにおいてそれぞれ複数のクラスに分類される。例えば図２の場合、上位カテゴリーでは「県」の１クラスのみであるが、下位カテゴリーでは「インターチェンジ」、「市区町村」、「駅」の３クラスを有している。
【００１４】
単語単位での信頼度は以下のようにして求められる。すなわち、まず、単語の認識結果から得られた候補単語列（例えば複数の単語で形成された文章）の第１位から第Ｎ位までの尤度の高い順に配列した単語列（以下Ｎ−ｂｅｓｔ候補と称する）と、それぞれの単語に対する対数尤度を求める。ここで、尤度とは認識結果から得られる音声信号列がＹである時、使用者が発話した音声信号列がＷである事後確率で定義される値で、「音声信号列に関する仮説Ｗに対し、音声信号列Ｙが観測される事前確率」と「音声信号列Ｗが発話される確率」との積と、音声信号列Ｙが観測される確率との比のうち最大確率である。
【００１５】
これにより第１位候補に含まれる単語ｗの信頼度Ｃｏｎｆ（ｗ）を以下の（数１）式から求める。
【００１６】
【数１】

（数１）式において単語ｗがＮ−ｂｅｓｔ候補の中でｉ番目の候補に含まれている確からしさｐ_ｉは下記の（数２）式から求められる。ここで、Ｌ_ｉはＮ−ｂｅｓｔ候補それぞれに対する対数尤度である。
【００１７】
【数２】

また、クラス単位での信頼度は上記単語単位の場合と同様に、第１位候補に含まれる各単語ｗのクラスＣ_ｗにより、信頼度Ｃｏｎｆ（Ｃ_ｗ）を以下の（数３）式から求められる。
【００１８】
【数３】

ここで、上記単語単位の場合と同様、ｐ_ｉは下記の（数４）式から求められる。
【００１９】
【数４】

以上のようにして得られた認識データ（認識結果候補文、尤度及び信頼度）は言語理解部１０４に入力される。この言語理解部１０４はクラススコア生成部１０５、カテゴリー理解部１０６、単語スコア生成部１０７および理解内容生成部１０８の各部で構成されており、使用者からの複数回にわたる発話に伴って入力される単語と、その属するクラスの信頼度とから理解結果を生成する機能を有する。ここで、クラススコア生成部１０５は、使用者からの複数回にわたる発話に伴って入力される単語のクラス信頼度からどのクラスが発話されたかを示すスコアを計算するものであり、カテゴリー理解部１０６は使用者からの複数回にわたる発話に伴って入力されるクラススコアからクラスの分類を示すカテゴリーの理解結果、すなわち、どのカテゴリーが発話されたかを出力するものである。また、単語スコア生成部１０７は、使用者からの複数回にわたる発話に伴って入力される単語の信頼度から、どの単語が発話されたかを示すスコアを計算し、理解内容生成部１０８は、上記で得られたカテゴリー理解結果（１０６出力）および単語スコア（１０７出力）から理解内容を生成する機能を有する。
【００２０】
以上のようにして得られた言語理解部１０４の出力情報は応答生成部１０９に入力され、上記言語理解部１０４で得られた理解内容から応答文を生成する。この応答文は音声合成部１１０でデジタル信号として合成され、図示しないが音声合成部１１０内蔵のＤ／Ａコンバータ、出力増幅器を経て音声出力として出力する。一方、この出力応答文はＧＵＩ表示部１１１を経て図示しないが表示装置上に表示する。なお、認識履歴１１２は過去の認識状況を履歴データとして記憶しておく例えばハードディスク記憶装置等の記憶装置である。
【００２１】
次に上記装置構成の作用について説明する。
まず本発明の実施の形態で扱う目的地の表現形式を説明する。インターチェンジ、駅、市区町村名を目的地に設定することができ、各々には県、自動車道、鉄道路線を付加することができる。前記のように図２はこれら表現形式を階層構造的に表示したものである。すなわち、本実施の形態では、目的地を上位、中位、下位３段階の部分発話の組み合わせにより発話することができ、これを本実施の形態ではこの３段階の各々をカテゴリーと呼ぶ。上位カテゴリーＰＲでは、県（都道府県）を発話することができ、中位カテゴリーＨＲでは自動車道、または鉄道路線を発話することができ、下位カテゴリーＬＭではインターチェンジ、市区町村、駅を発話することができる。
【００２２】
本発明の実施の形態においては、対話形式での目的地設定をより柔軟な発話によって行うことを目的としている。すなわち、使用者は例えば、「静岡県の東名自動車道の浜松西インターチェンジ」と言うように、一度ですべてのカテゴリーを発話することもできる。また第一の発話で「静岡県」と発話し、第二の発話で「東名高速の浜松西インターチェンジ」と発話するように複数回に分けて発話することも可能である。
また使用者が複数回の発話を行うとき、過去の発話に対してより詳細な情報を追加していく詳細化発話を可能とするものである。例えば、第一の発話で、「静岡県の」と発話し、第二の発話で「浜松市」と発話することが可能である。また使用者が複数回の発話を行うとき、システムの応答結果を訂正する発話を可能とするものである。例えば、第一の発話「静岡県の浜松市」に対して、第一の応答「静岡県の浜松西インターチェンジですか」と誤った応答がなされたとき、第二の発話で「いいえ浜松市です。」と発話することが可能である。
また使用者が複数回の発話を行うとき、システムからの応答が質問であったときに、それに回答する発話も可能とするものである。例えば、第一の応答が「静岡県の何インターチェンジですか」であったとき、第二の発話で「浜松西インターチェンジです」と発話することが可能である。
また使用者が複数回の発話を行うとき、システムからの応答が再入力を促す発話であったときに、それに回答する発話を可能とするものである。例えば、第一の応答「もう一度発話してください」であったとき、第二の発話で第一の発話と同様の発話を行うことが可能である。
本実施の形態における認識対象語は図３に例示するようなものである。本実施の形態における対話例は図４に示すようなものである。図４中、Ｕは使用者の発話であり、Ｓはシステムからの応答であり、数字は発話順である。
【００２３】
次に、本発明の実施の形態における動作を図５のフローチャートを用いて説明する。
ステップ３０１で処理を開始し、まず、使用者が発話開始を指示するために、図示しないが音声入力スイッチ（発話スイッチ）がオン状態に操作されたこと検出（ステップ３０２）した場合、音声信号の取り込み開始のステップ（ステップ３０３）に移行する。ここで、音声入力スイッチのオン状態への操作が検出されない場合は、この操作が検出されるまでステップ３０２で待ち状態となる。
ステップ３０３では、使用者は認識対象文に含まれる発話を行う（例えば図３に例示した語等）。図１における音声入力部１０１は、マイクロホンからの信号をＡ／Ｄコンバータでデジタル信号に変換し、音声認識部１０２に出力する。音声認識部１０２は発話スイッチの操作がなされるまでは、前記デジタル信号の平均パワーの演算を継続している。前記発話スイッチが操作された後、前記平均パワーにくらべてデジタル信号の瞬時パワーが所定値以上に大きくなった時、使用者が発話したと判断し、音声信号の取り込みが開始される。
取り込まれた音声信号は、図１における音声認識部１０２において、記憶してある認識対象文と入力されたデジタル化された音声信号とを比較し、尤度を演算する（ステップ３０４）ことにより、複数の候補を設定する。なお本ステップ３０４を実行する間も、並列処理により上記の音声信号取り込みは継続されている。
デジタル化された音声信号の瞬時パワーが所定時間以上所定値以下の状態が継続した時、システム側では使用者の発話が終了したと判断し、音声信号の入力処理を終了する（ステップ３０５）。これにより、図１における音声認識部１０２は複数の認識結果候補文を尤度順にならべた上位Ｎ候補を、尤度データとともに出力する。図６にこの出力結果の例を示す。図６において、ＸＸＸと記されている部分は、各単語に対する算出された尤度を示している。
前記のＮ−Ｂｅｓｔ候補と呼ばれる音響的な尤度で順位付けられた複数の候補からなる認識結果をもとに、単語とクラスの２種類の信頼度について音響的な尤度とＮ−Ｂｅｓｔ候補中の出現頻度から、事後確立に基づく尺度として信頼度が演算される（ステップ３０６）。この演算は図１における信頼度生成部１０３において実行されるもので、演算結果の例を図７に示す。図７において、左側の表は図６で示した音声認識部出力であり、右側の表の単語信頼度は、ある単語が発話された可能性を示し、クラス信頼度はあるクラスの単語が発話された可能性を示す。なお、本演算に関しては前記「従来の技術」の項で述べた「非特許文献２」駒谷他、”音声対話システムにおける音声認識結果の信頼度の利用法”、日本音響学会講演論文集、３−５−２、ｐｐ７３−７４、２０００に詳述されている。
【００２４】
以上のようにして発話された単語の信頼度を求めて尤らしい単語の推定が行われるが、本発明においては、システムと使用者との間での対話により単語推定の精度をさらに向上させている。このため、図１におけるクラススコア生成部１０５においてクラススコアが演算されるが（ステップ３０７）、このクラススコア演算に先立ち、使用者の発話タイプの判定が行われる。すなわち、第一の発話タイプは、以前の情報に新しい情報を追加する働きがある発話タイプである。例えば、詳細化および回答の処理がこれに相当する。また第二の発話タイプは、以前の情報を訂正する働きがある。例えば、訂正および再入力の処理がこれに相当する。このいずれの発話タイプであるかの判定は図８に示すように、判定材料の欄に記載されている判定材料の状況に対して発話タイプが判定される。また、これ以外の判定方法も存在する。例えば、地名入力でよく用いられる部分的な言い直し発生をＤＰマッチングによるワードスポッティング法を用いて検出する方法があり、これに関しては、角谷、北岡、中川”カーナビの地名入力における誤認識時の訂正発話の分析と検出、情報処理学会研究報告、音声言語情報処理３７−１１、２００１に詳述されている。
発話タイプが判定された後に、クラススコア生成部１０５においてクラススコアが生成される。クラススコアは、対話中すなわち使用者の複数回の発話中におけるクラスの尤もらしさを示す値である。この場合、以前に理解した情報を残しつつ、新しい情報を付加することで、より適切にスコアを生成することができる。このクラススコアの生成は前記の発話タイプ別に異なる生成式を用いて行われる。したがって、図５におけるステップ３０７は図９に示すように２分割された処理が行われることになる。すなわち図８の判定材料の欄に記載の状況によりステップ３１５で詳細化、回答の発話タイプに該当するか否かを判定し、該当する場合はステップ３１６で処理し、該当しないで訂正、再入力の発話タイプの場合はステップ３１７で処理された後いずれの場合も処理はステップ３０８に移行する。
【００２５】
詳細化、回答の発話タイプにおける場合、すなわち図９におけるステップ３１６の場合のクラススコアは（数５）式で求められる。
Ｓｃｏｒｅ（ｃ）＝Ｓｃｏｒｅ（ｃ）＊ｗｅｉｇｈｔ_ｓ＋Ｃｏｎｆ（ｃ）（数５）
但し、Ｓｃｏｒｅはクラススコアであり、（数５）式の左辺が新たに求められたクラススコアであり、（数５）式の右辺が過去の（認識履歴１１２から読み出した）クラススコアに対する処理である。Ｃｏｎｆは最新の認識結果から得られたクラス信頼度である。ｗｅｉｇｈｔ_ｓは０．０〜１．０の値を採る重みである。ｃはスコアを生成するクラスである。重みｗｅｉｇｈｔ_ｓにより一定の割合で更新前のクラススコアを下げているのは、”情報が古くなるごとに信頼性が低下する”という方針を適用しているからである。また、ｗｅｉｇｈｔ_ｓは、実際の発話データを用いて実験的に求めることができる。更新されたクラススコアは認識履歴１１２に書き込まれる。
【００２６】
詳細化・回答発話タイプのクラス生成の様子を図１０に示す。使用者は、過去の発話（旧クラススコア１．００）で「県」「鉄道路線」の発話を行っており、最新の発話（新クラス信頼度欄が０．８１）で「駅」を発話している。この場合のクラススコア生成は（数５）式に基づいて行われる。
【００２７】
訂正・再入力の発話タイプの場合、すなわち図９におけるステップ３１７の場合におけるクラススコアは（数６）式で求められる。
Ｓｃｏｒｅ（ｃａ）＝Ｓｃｏｒｅ（ｃａ）＊ｗｅｉｇｈｔ_ｔ−Ｃｏｎｆ（ｃｂ）＋Ｃｏｎｆ（ｃａ）（数６）
但し、Ｓｃｏｒｅはクラススコアであり、（数６）式の左辺が新たに得られたクラススコアであり、（数６）式の右辺が過去の（認識履歴１１２から読み出した）クラススコアである。Ｃｏｎｆは最新の認識結果から得られたクラス信頼度である。ｗｅｉｇｈｔ_ｔは０．０〜１．０の値を採る重みである。ｃａはスコアを生成するクラスであり、ｃｂはｃａと同じカテゴリーで異なる全てのクラスである。（数５）式と比較し、同カテゴリー、異クラスの信頼度を減算していることである。これによりクラスを間違えた場合にスコアが修正され易くなる。更新されたクラススコアは認識履歴１１２に書き込まれる。
訂正・再入力発話タイプのクラス生成の様子を図１１に示す。使用者は、過去の発話で「県」クラスの発話を行っており、クラススコアの値が不十分でカテゴリーを特定できず、システム応答は「もう一度発話して下さい」を出力している。使用者は次に再度同じ「県」クラスの発話を行い更新後のクラススコアを得ている（例えば、「県」の発話に対しては旧クラススコアと新クラス信頼度の両方の欄にスコアが記載されている）。この場合のクラススコア生成は（数６）式に基づいて行われている。
【００２８】
続いて、カテゴリー理解処理のステップ３０８に移るが、この処理は図１におけるカテゴリー理解部１０６で、過去の（認識履歴から読み出した）クラススコアと最新の認識結果におけるクラス信頼度との両方に対してカテゴリースコアを計算することにより実行される。この処理の様子を図１２に示す。カテゴリースコアは、図１２のａで表示した部分およびＢで表示した部分におけるそれぞれの欄の数字から知れるように、同じカテゴリーに属する全てのクラススコアあるいは信頼度を加算したものである。それぞれのカテゴリースコアは閾値で判定され、ＰＲ（上位）、ＨＲ（中位）、ＬＭ（下位）の３カテゴリーに対して、判定結果の論理和を計算する。そこで得られた結果が、現在までに発話されたカテゴリーの組み合わせを示している。クラススコアが図１２であった場合、それに続くカテゴリー理解の様子を図１３に示す。すなわち、旧および新スコアから各カテゴリーに対して判定を行い、その結果としてカテゴリー理解が得られる。
【００２９】
次に、ステップ３０９の単語スコア生成が行われるが、このステップ３０９は図１における単語スコア生成部１０７で実行され、
１）過去の（認識履歴１１２中に既に存在する）単語、および
２）新たに出現した単語（最新の認識結果中の単語）
の２つに対して、各々別々の方針を用いてスコアを生成する。後者２）の場合の単語は、最新の認識結果のＮ−Ｂｅｓｔ候補に含まれる全単語が対象となる。スコア生成は、図１における言語理解部１０４が最新の認識率を獲得するたびに、１）→２）の順番で実行される。
【００３０】
上記１）の認識履歴中に存在する単語は、単語の新しさ、システムの応答内容とユーザ発話タイプ（詳細化、訂正、回答、再入力）から、既存の単語スコアを上下させて、新しい単語スコアを生成する。これには以下５種類の方針を使用する。
方針１：古い情報は、信頼性が低くなるという仮定のもとに、新しい認識結果が入力されるたびに、認識履歴中に存在する全ての単語のスコアを下げる。
方針２：認識履歴中の単語Ａと認識結果単語Ｂが詳細化の関係にあった場合、単語Ａのスコアを上げる。
方針３：認識履歴中の単語Ａと認識履歴中の単語Ｂが訂正の関係にあった場合、単語Ａのスコアを下げる。
方針４：認識結果に肯定（はい、うん等）が含まれていた場合、応答に含まれていた単語のスコアを上げる。
方針５：認識結果に否定後（いいえ、ちがう等）が含まれていた場合、応答に含まれていた単語のスコアを下げる。
認識履歴中の単語スコアの生成は、下記の（数７）式による。
【００３１】

但し、Ｓｃｏｒｅは認識履歴中の単語のスコアであり、右辺が更新前、左辺が更新後である。Ｗｄは計算対象となる認識履歴１１２中の単語である。方針１に対応する項としては、ｐ１があり単語のスコアを下げる項である。方針２と方針３に対応する項に関しては、ｐ２、ｐ３は重み付け、Ｃｏｎｆは最新の認識結果から得られる信頼度であり、Ｗｓは最新の認識結果に含まれ、Ｗｄと詳細化の関係にある全ての単語であり、Ｗｔは最新の認識結果に含まれＷｄとは訂正の関係にある全ての単語である。方針４、方針５に対応する項に関しては、ｉは前回のシステム応答に単語が含まれている場合はｉ＝１となり、含まれていない場合はｉ＝０となる。またｙｅｓは最新の認識結果に含まれる肯定語を示し、ｎｏは今回の認識結果に含まれる否定後を示し、ｒｅｊは今回の認識結果に含まれる文末否定語を示す。
【００３２】
前記２）における最新の認識履歴中の単語であって、認識履歴にまだ登録されていない単語、すなわち新たに出現した単語のスコアの生成は、応答内容とユーザ発話タイプ（詳細化、訂正、回答、再入力）、Ｎ−Ｂｅｓｔの順位、発話長（発話された単語の数）により、音声認識の信頼度を上下させて、単語スコアを生成する。これには以下４種類の方針を使用する。
方針６：認識結果の単語Ａと応答とに含まれる単語Ｂが詳細化の関係にある場合、単語Ａのスコアを上げる。
方針７：システム応答が質問（例、何インターチェンジですか？）であって、認識結果の内容が回答である場合、認識結果の単語のスコアを上げる。
方針８：認識結果の上位には正解単語が多く含まれているので、上位に含まれる単語のスコアを上げる。
方針９：発話長が長い発話（短い発話）は認識されやすい（認識されにくい）ため、１カテゴリーの結果はその単語のスコアを下げ、２カテゴリー以上の単語はそのスコアを上げる。
【００３３】
最新の認識履歴中の単語であって、認識履歴にまだ登録されていない単語のスコアの生成は、以下の（数８）式による。
【００３４】

但し、Ｓｃｏｒｅは認識履歴中の単語のスコアであり、Ｃｏｎｆは最新の認識結果から得られる信頼度である。Ｗｄは計算対象となる認識履歴中の単語である。方針６に対応する項に関しては、ｐ６が重み付けであり、Ｗｓは認識履歴に含まれるＷｄと詳細化の関係を持つ全ての単語である。方針７に対応する項に関しては、ｐ７は重み付け、認識結果が質問に対する回答である場合の認識結果に含まれる単語である。方針８に対応する項としてはｐ８がＮ−Ｂｅｓｔの順位の高さに応じた重み付けである。方針９に対応する項としてはｐ９、ｐ１０が重み付けであり、ｌｅｎ２は認識のカテゴリーが２以上であるときｌｅｎ２＝１になり、ｌｅｎ１は認識のカテゴリーが１であるときｌｅｎ１＝１になる値である。
上記１）で更新された単語のスコア、上記２）で追加された単語、およびそのスコアは統合された認識履歴として、認識履歴１１２に書き込まれる。統合された認識結果の例を実際の県名、鉄道名等を実例として図１４に示す。図中同名が複数存在する場合（厚木、田無等）があるが、これは複数路線に含まれる駅の名称などである。
【００３５】
上記により得られたカテゴリー理解結果、および前記統合された認識履歴とから、妥当な組み合わせとして複数個の候補を生成する。すなわち、上記により得られた情報を基に本装置が理解した内容として、尤らしい候補を複数個生成する（ステップ３１０）。この処理は図１における理解内容生成部１０８において実行される。図１３の結果から、ＰＲ、ＨＲおよびＬＭの３カテゴリーが発話されており、図１４から前記に該当し、実際に存在する組み合わせを抽出し候補とする。各カテゴリーのスコアの和が最大のものを選択する。その結果を図１５に示す。理解結果として、＜ＰＲカテゴリー＝愛知、スコア＝１．４７＞、＜ＨＲカテゴリー＝名古屋鉄道、スコア＝１．１７＞、＜ＬＭカテゴリー＝豊橋、スコア＝０．６２＞が選択されている。
【００３６】
以上、図１における言語理解部１０４の各ステップで処理された結果である理解内容から応答フラグを生成する（ステップ３１１）までの全処理過程を説明した。これにより得られた出力（応答）情報は図１の応答生成部１０９で実行される。この応答フラグの種類を図１６に示す。なお、図１６における各ビット（ａ乃至Ｆの各ビット）が示す内容を図１７に示す。前記理解結果から、カテゴリーに該当する単語が存在する場合、該当するフラグを立てるがこの場合スコアを４段階で評価した値（ビット数）のフラグを立てる。すなわち、スコアが最大から最小までを評価１から評価４とし、フラグは１０００、０１００、００１０、０００１とする。
【００３７】
応答生成部１０９は、上記の応答フラグを利用し、対話における以下の方針に沿った応答を行う。
応答方針１：了承（相槌）
下位カテゴリーがなく、上位カテゴリーまたは中位カテゴリーのスコア評価が評価１の場合、対話をスムーズに進めるための応答を行う。
例ユーザ発話 …「静岡県」
システム応答…「はい」
応答方針２：復唱
スコア評価が２の場合や、ユーザ発話の文頭に否定後が来た場合は確認の意味も込めて復唱を行う。
例ユーザ発話 …「静岡県」
システム応答…「静岡県」
応答方針３最終確認
下位カテゴリーが発話され、信頼できる（スコア評価が１か２）場合は、最終確認を行う。
例ユーザ発話 …「浜松インターから乗ります」
システム応答…「浜松インターを設定してよろしいですか」
応答方針４：目的地設定
前応答に下位カテゴリーがあり、肯定発話が信頼できる（スコア評価が１か２）場合は、目的地に設定する。
例システム応答…「浜松インターを設定してよろしいですか」
ユーザ発話 …「はい」
システム応答…「目的地に設定しました」
応答方針５：分からない情報のみ尋ねる
ユーザに対して分からない情報のみを尋ねる。
例ユーザ発話…「静岡県の東名自動車道です」（下線部のスコア評価が低い時）
システム応答…「静岡県の何自動車道ですか？」
応答方針６：自信のない情報は応答しない
上位カテゴリー（ＰＲ）と中位カテゴリー（ＨＲ）の組み合わせで、どちらか一方だけ信頼できない（スコア評価が４）場合、スコアの高いものだけ応答することで対話を進める。
例ユーザ発話…「静岡県の東名自動車道」（下線部のスコア評価が低いとき）
システム応答…「東名自動車道の」
応答方針７別情報の付加情報が少なく、スコア評価が悪いときに、上のカテゴリーも聞くことによって認識率の向上を図る。
例ユーザ発話…「浜松インターから乗る」（下線部のスコア評価が低いとき）
システム応答…「何県のインターですが」
応答方針８：次の発話を促す
上位カテゴリーにつづいて肯定発話がきて、信頼できる場合（スコア評価が１か２の場合）次の発話を促す。
例システム応答…「東名自動車道」
ユーザ発話 …「はい」
システム応答…「東名自動車道のどこですか」
応答方針９：別の候補を返す
否定発話が信頼できる場合（スコア評価が１か２の場合）前回の応答に用いていない別候補を返す。
例システム応答…「浜松インターを設定しますか」
ユーザ発話 …「いいえ」
システム応答…「浜松西インターを設定しますか」
応答方針１０：前応答の繰り返し
肯定発話や否定発話が信頼できない場合（スコア評価が４の場合）
例システム応答…「浜松インターを設定してよろしいですか」
ユーザ発話 …「はい」（下線部のスコア評価が低いとき）
システム応答…「浜松インターを設定してよろしいですか」
応答方針１１：聞き返し
全ての情報に対して信頼できない場合（スコア評価が４の場合）
例ユーザ発話 …「静岡県」（下線部のスコア評価が低いとき）
システム応答…「もう一度発話してください」
応答生成部１０９は、上記の対話方針を実施するために、前記の応答フラグを、図１６のフラグテーブルと照らし合わせ、フラグが最初に一致した応答パターンで応答を返す。
応答生成部１０９が前記理解結果から生成した応答フラグは
”１１１１０００１０００１００００１０００００００００００００００”
であり、図１６のフラグテーブルとの参照の結果、図示しないが応答パターン
“ＰＲカテゴリー単語”、“ＰＲカテゴリークラス”の
“ＨＲカテゴリー単語”、“ＨＲカテゴリークラス”の
“ＬＭカテゴリー単語”、“ＬＭカテゴリークラス”を設定してよろしいですか。が選択され、その結果、
「愛知県の名古屋鉄道の豊橋駅を設定してよろしいですか」が応答文として生成される。
【００３８】
以上のようにして生成された応答はステップ３１２で実行されるもので、図１における音声合成部１１０を経由して音声信号として出力され、またＧＵＩ表示部１１１を経由してディスプレイ上に表示される。
この段階で、入力処理が全て完了したか否かの確認が行われる（ステップ３１３）。すなわち、下位カテゴリー（ＬＭ）の単語が確定している場合は（ステップ３１３でｙｅｓの場合）、ステップ３１４に移行し全ての入力処理を終了する。もし、下位カテゴリー（ＬＭ）の単語が確定していない場合（ステップ３１３でｎｏの場合）は処理を継続する。本例では、「愛知県の名古屋鉄道の豊橋駅を設定してよろしいですか」が応答されている段階であり、次に使用者が「はい」を発話することで、「目的地に設定しました」の応答を行ったのち処理を終了する。
【図面の簡単な説明】
【図１】本発明による対話理解装置の基本構成ブロック図。
【図２】発話された単語の階層構造的分類法を示す構成図。
【図３】認識対象語と発話タイプとの関係を示す対応図。
【図４】システム／使用者間での対話の例を示す発話・応答図。
【図５】システムの動作を示すフロー図。
【図６】音声認識部の出力としての認識結果候補文と尤度との関係を示す対象図。
【図７】認識結果候補文と尤度との関係から信頼度を求める対象図。
【図８】発話タイプと発話タイプ判定材料との対象図。
【図９】発話タイプによる処理の使い分けを示すフロー図。
【図１０】詳細化・回答発話タイプにおける更新後のクラススコア生成過程を示す旧クラススコアとの対象図。
【図１１】訂正・再入力発話タイプにおける更新後のクラススコア生成過程を示す旧クラススコアとの対象図。
【図１２】クラススコア演算手順を示す新旧スコア比較図。
【図１３】カテゴリー理解処理における新旧スコア演算過程を示す対象図。
【図１４】統合された認識結果の実例を示すスコア対象図。
【図１５】言語理解最終スコアの項目別比較図。
【図１６】応答フラグと応答パターン対象図。
【図１７】応答フラグとその内容対象図。
【符号の説明】
１０１：音声入力部１０２：音声認識部
１０３：信頼度生成部１０４：言語理解部
１０５：クラススコア生成部１０６：カテゴリ理解部
１０７：単語スコア生成部１０８：理解内容生成部
１０９：応答生成部１１０：音声合成部
１１１：ＧＵＩ表示部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to device control by a voice dialogue system, and more particularly to a dialogue understanding device (voice information input / output device) that is required to perform control without bothering the "hands" or "eyes" of an operator. .
[0002]
[Prior art]
[Patent Document 1] JP-A-8-278793
[Non-Patent Document 1] Kai, Ishimaru, Ito, Konishi, Ito, "Analysis of Feature of Corrected Utterance in Destination Setting Task and Application to Detection," Proceedings of the Acoustical Society of Japan, 2-1-8, pp. 147-64. 63-64, 2001
[Non-Patent Document 2] Komagani and Kawahara, "Usage of Reliability of Speech Recognition Results in Spoken Dialogue System" Proc. 73-74, 2000
In the conventional voice dialogue system, the sound source is naturally occurring voice, and in the case of vehicles, etc., there is the effect of noise during traveling, etc. Responding differently from the person's intention. As a result, the dialog between the system and the user does not proceed smoothly, and the user may feel uncomfortable. As a countermeasure, for example, the above “Non-patent Document 1” or “Non-patent Document 2” has been reported. The former is a study on erroneous recognition in speech recognition, and the latter uses reliability in speech recognition results. This is a study on dialogue control. The method adopted in these studies is based on the sequential acoustic recognition of input speech signals word by word, and uses language recognition that includes contextual information as performed by humans. not going. For this reason, there is a limit in reducing the effects of the speaker's utterance conditions and the background noise on both the transmitting side and the receiving side.
[0003]
Further, in Patent Document 1, candidates are sequentially determined from the result of syntax analysis, and the likelihood of a standard pattern corresponding to the candidate (the definition of the likelihood will be described later) and the candidate calculated from a text database. A method of increasing the matching speed with a small storage capacity by using the sum of the likelihood of occurrence / likelihood of occurrence as a candidate is also disclosed. However, in this method, a candidate is selected from the recognition result and only the likelihood of the candidate is obtained, and it is not questioned whether or not the final response has the information required by the user. .
[0004]
[Problems to be solved by the invention]
As described above, in the conventional method, the voice signal of the input utterance is sequentially recognized, and emphasis is placed on improving the recognition accuracy, and it is examined whether the voice signal reaches the content required by the user. Did not. In addition, the recognition in sentence form is also performed, but this is performed by comparison with a prepared text database, and whether or not the result is in accordance with the user's request Had no room for the recognition process. The present invention has a language understanding ability that can be applied to a dialogue control system that has a recognition ability that can be used practically and that can obtain information required by a user in a short time, beyond the performance limit described above. It is an object of the present invention to provide a user / system dialogue understanding device having the same.
[Means for Solving the Problems]
As one of the methods for achieving the above object, a method of interactively understanding speech information from a flow of context in a speaker's utterance content is considered. According to this method, better results can be expected than in the conventional method that simply focuses on improving the clarity or intelligibility of the uttered voice. The present invention discloses a specific algorithm for realizing this method, and provides a specific device based on the algorithm.
[0005]
The present invention is based on performing language understanding and response generation by combining speech information processing using context information with reliability of speech recognition. That is, not only using the reliability of the conventional speech recognition but also using the result of learning using the information of the utterance type and the conversation history (recognition history), the language understanding is more likely to be interactive. Was executed.
[0006]
For this reason, in claim 1, the utterance included in the dialog is hierarchically classified into a plurality of categories and a class configured by subdividing the category in the order of the breadth covered by the utterance. Understand the content of the dialogue from the class score that gives the certainty of the utterance and the words included in the utterance, and perform the detailed / answer dialogue to add more detailed information to the utterance content, and perform the detailed / answer. When updating the class score of the past recognition history used to newly generate the class score for the utterance type, a value weighted so as to reduce the class score of the past recognition history In addition, a dialog understanding device that updates the recognition history with a new class score obtained by adding a new reliability based on the latest recognition result is defined.
[0007]
According to claim 2, a voice input means comprising a microphone and a voice amplifier, a voice recognition means for digitizing an output of the voice input means to perform voice recognition, and a reliability of a result recognized by the voice recognition means. And a hierarchical structure including the plurality of categories set in advance using results obtained by the voice recognition unit and the reliability generation unit, and the classes obtained by subdividing the categories. And a class score generating unit for determining the likelihood of the utterance classified into the class; a category understanding unit for determining each of the categories from the results obtained thereby; and a word score for determining the certainty of the recognized word. A language understanding means comprising: a generation unit; an understanding content generation unit for generating understanding content as a result of processing by each of the processing units; Storage means for storing past recognition histories used for executing the processing in the above, response generation means for creating response information from results obtained from the language understanding means, and output for outputting the response information 2. The dialogue understanding apparatus according to claim 1, further comprising: updating a class score of a past recognition history used to newly generate the class score for the utterance type of the detailing / answer. At this time, the recognition history is updated with a new class score obtained by adding a new reliability based on the latest recognition result to a value weighted so that the class score of the past recognition history becomes smaller. It specifies the conversation understanding device.
[0008]
According to a third aspect of the present invention, in the dialogue understanding apparatus according to the first or second aspect, a dialogue understanding apparatus that updates a class score of a past recognition history by the following equation is defined.
[0009]
Score (c) = Score (c) * weight _s + Conf (c)
However, Score: class score of past recognition history
Conf: Class reliability for the latest recognition result
weights: weight (0.0 <weight _s <1.0)
c: Class for generating score
According to claim 4, the utterance included in the dialog is classified into a plurality of categories hierarchically in the order of the breadth of the utterance and a class configured by subdividing the category, and which class of word is uttered The system understands the content of the dialogue from the class score that gives the certainty and the words included in the utterance, and there is an erroneous response from the system. For the input utterance type, when updating the class score of the past recognition history used to newly generate the class score, a value weighted so that the score of the past recognition history becomes small. , Add the new reliability based on the latest recognition result, and subtract the total reliability of the same category and different classes. It is defined for dialogue understanding device to perform the update.
[0010]
According to claim 5, a voice input means comprising a microphone and a voice amplifier, a voice recognition means for digitizing an output of the voice input means to perform voice recognition, and a reliability of a result recognized by the voice recognition means. And a hierarchical structure including the plurality of categories set in advance using results obtained by the voice recognition unit and the reliability generation unit, and the classes obtained by subdividing the categories. And a class score generating unit for determining the likelihood of the utterance classified into the class; a category understanding unit for determining each of the categories from the results obtained thereby; and a word score for determining the certainty of the recognized word. A language understanding means comprising: a generation unit; an understanding content generation unit for generating understanding content as a result of processing by each of the processing units; Storage means for storing past recognition histories used for executing the processing in the above, response generation means for creating response information from results obtained from the language understanding means, and output for outputting the response information The dialogue understanding device according to

claim

1 or 4, further comprising: a past recognition history used to newly generate the class score for the utterance type of the correction / reinput. When updating the class score, a new reliability based on the latest recognition result is added to a value weighted so as to reduce the score of the past recognition history, and the reliability of the same category and a different class is added. A dialog comprehension device that updates a recognition history with a score obtained by subtracting all of them is defined.
[0011]
According to a sixth aspect of the present invention, in the dialogue understanding apparatus according to the fourth or fifth aspect, a dialogue understanding apparatus that updates a past recognition history by the following equation is defined.
[0012]
Score (ca) = Score (ca) * weight _t -Conf (cb) + Conf (ca)
However, Score: Class score of recognition history
Conf: Class reliability of the latest recognition result
weight _t : Weight (0.0 <weight _t <1.0)
ca: The class that generates the score
cb: different class in the same category as ca
According to a seventh aspect of the present invention, there is provided a judging means for judging whether the utterance type is a detail / answer utterance type or a correction / re-entry utterance type. A dialog comprehension device for determining whether to use the arithmetic expression according to claim 6 is defined.
[0013]
【The invention's effect】
According to the present invention, as described above, in addition to simply performing speech recognition of words, the recognized words are further classified into categories and classes, and in consideration of the relationship with context, more likely words are recognized. By adopting the selection method, the recognition accuracy could be further improved efficiently. For example, it is particularly effective when used in a noisy environment such as voice input in a vehicle navigation system.
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 shows a basic configuration of a dialogue understanding device according to the present invention. An input analog voice input signal is converted into a digital signal by a voice input unit 101. Here, the audio input unit 101 includes a microphone, an input amplifier, and an A / D converter. The digitized voice signal is input to the voice recognition unit 102, and a matching process is performed between the voice signal input from the user and the recognition target sentence stored in the voice signal recognition unit 102, and a plurality of recognition processes are performed. The result candidate sentences and their likelihoods (details will be described later) are output. From the plurality of recognition result candidate sentences input along with a single utterance from the user, the reliability information generation unit 103 determines the words included in the recognition result candidate sentences and the classification of these words in the reliability generation unit 103. The reliability indicating the likelihood of the indicated class is output.
Here, the class classifies expression forms indicating destinations in a hierarchical structure as shown in FIG. Here, the categories are sequentially inserted from the widest to the narrowest inclusive range, and the class is a classification of the words included in each category according to the content. In the example of FIG. 2, for example, each word is classified into three categories of a higher order (PR), a middle order (HR), and a lower order (LM), and each category is further classified into a plurality of classes. For example, in the case of FIG. 2, the upper category has only one class of "prefecture", but the lower category has three classes of "interchange", "city", and "station".
[0014]
The reliability in word units is obtained as follows. That is, first, a word string (hereinafter referred to as N-best) is arranged in the order of the highest likelihood from the first place to the N-th place of a candidate word string (for example, a sentence formed of a plurality of words) obtained from the word recognition result. And the log likelihood for each word is determined. Here, the likelihood is a value defined by the posterior probability that the voice signal sequence uttered by the user is W when the voice signal sequence obtained from the recognition result is Y, and "likelihood W On the other hand, the maximum probability is the ratio of the product of the prior probability that the audio signal sequence Y is observed and the “probability that the audio signal sequence W is uttered” and the probability that the audio signal sequence Y is observed.
[0015]
Thus, the confidence Conf (w) of the word w included in the first candidate is obtained from the following equation (1).
[0016]
(Equation 1)

In equation (1), the probability p that the word w is included in the i-th candidate among the N-best candidates _i Is obtained from the following equation (2). Where L _i Is the log likelihood for each N-best candidate.
[0017]
(Equation 2)

Also, the reliability in class units is the same as in the case of the above-mentioned word units. _w With the confidence Conf (C _w ) Is obtained from the following equation (Equation 3).
[0018]
[Equation 3]

Here, as in the case of the word unit, p _i Is obtained from the following equation (Equation 4).
[0019]
(Equation 4)

The recognition data (recognition result candidate sentence, likelihood and reliability) obtained as described above is input to the language understanding unit 104. The language understanding unit 104 includes a class score generation unit 105, a category understanding unit 106, a word score generation unit 107, and an understanding content generation unit 108, and is input in response to a user's utterance a plurality of times. It has a function of generating an understanding result from a word and the reliability of the class to which the word belongs. Here, the class score generation unit 105 calculates a score indicating which class is uttered from the class reliability of a word input along with a plurality of utterances from the user, and the category understanding unit 106 Is to output the result of understanding the category indicating the classification of the class, that is, which category was uttered, from the class score inputted with the user's utterance a plurality of times. Further, the word score generation unit 107 calculates a score indicating which word is uttered from the reliability of a word input with a plurality of utterances from the user. Has a function of generating understanding contents from the category understanding result (106 output) and the word score (107 output) obtained in (1).
[0020]
The output information of the language understanding unit 104 obtained as described above is input to the response generation unit 109, and a response sentence is generated from the understanding contents obtained by the language understanding unit 104. The response sentence is synthesized as a digital signal by the voice synthesizer 110, and is output as a voice output through a D / A converter and an output amplifier built in the voice synthesizer 110 (not shown). On the other hand, this output response sentence is displayed on a display device (not shown) via the GUI display unit 111. The recognition history 112 is, for example, a storage device such as a hard disk storage device that stores the past recognition status as history data.
[0021]
Next, the operation of the above device configuration will be described.
First, an expression form of a destination handled in the embodiment of the present invention will be described. Interchanges, stations, municipalities can be set as destinations, and prefectures, motorways, and railway lines can be added to each. As described above, FIG. 2 shows these expressions in a hierarchical structure. That is, in the present embodiment, the destination can be uttered by a combination of partial utterances of the upper, middle and lower three stages. In the present embodiment, each of these three stages is called a category. In the upper category PR, the prefecture (prefecture) can be spoken, in the middle category HR, the motorway or railway line can be spoken, and in the lower category LM, interchange, municipalities, and stations can be spoken. Can be.
[0022]
In the embodiment of the present invention, an object of the present invention is to set a destination in an interactive manner by more flexible utterance. That is, the user can speak all categories at once, for example, "Hamamatsu-Nishi Interchange on the Tomei Expressway in Shizuoka Prefecture". It is also possible to speak multiple times, such as speaking "Shizuoka Prefecture" in the first utterance and "Hamamatsu Nishi Interchange of Tomei Expressway" in the second utterance.
Further, when the user makes a plurality of utterances, it is possible to perform a detailed utterance in which more detailed information is added to past utterances. For example, it is possible to utter “Shizuoka” in the first utterance and “Hamamatsu City” in the second utterance. Further, when the user makes a plurality of utterances, the utterance for correcting the response result of the system is enabled. For example, when the first utterance "Hamamatsu City in Shizuoka Prefecture" is incorrectly answered to the first response "Is Hamamatsu Nishi Interchange in Shizuoka Prefecture?", The second utterance says "No Hamamatsu City. . ".
Further, when the user makes a plurality of utterances, and when the response from the system is a question, the user can also make an utterance to answer the question. For example, when the first response is "What interchange in Shizuoka prefecture?", It is possible to say "Hamamatsu Nishi interchange" in the second utterance.
Also, when the user makes utterances a plurality of times, if the response from the system is an utterance prompting re-input, the utterance can be made to answer the utterance. For example, when the first response is "Please speak again", the second utterance can perform the same utterance as the first utterance.
The recognition target words in the present embodiment are as illustrated in FIG. An example of the dialogue in the present embodiment is as shown in FIG. In FIG. 4, U is the utterance of the user, S is the response from the system, and the numbers are the utterance order.
[0023]
Next, the operation of the embodiment of the present invention will be described with reference to the flowchart of FIG.
The process starts in step 301. First, when it is detected that a voice input switch (voice switch) is turned on (not shown) in order to instruct the user to start uttering (step 302), a voice signal is generated. The process proceeds to the step of starting the capture (step 303). If no operation to turn on the voice input switch is detected, the process waits at step 302 until this operation is detected.
In step 303, the user makes an utterance included in the sentence to be recognized (for example, the word illustrated in FIG. 3). The voice input unit 101 in FIG. 1 converts a signal from a microphone into a digital signal by an A / D converter and outputs the digital signal to a voice recognition unit 102. The voice recognition unit 102 continues to calculate the average power of the digital signal until the speech switch is operated. After the utterance switch is operated, when the instantaneous power of the digital signal becomes greater than or equal to a predetermined value compared to the average power, it is determined that the user has uttered, and the capture of the audio signal is started.
The fetched voice signal is compared in the voice recognition unit 102 in FIG. 1 with the stored digitized voice signal and the input digitized voice signal, and the likelihood is calculated (step 304). Set multiple candidates. It should be noted that the above-described capturing of the audio signal is continued by the parallel processing even during the execution of this step 304.
When the instantaneous power of the digitized audio signal continues for a predetermined time or more and a predetermined value or less, the system determines that the utterance of the user has ended, and ends the input processing of the audio signal (step 305). Thereby, the speech recognition unit 102 in FIG. 1 outputs the top N candidates obtained by arranging a plurality of recognition result candidate sentences in the order of likelihood together with likelihood data. FIG. 6 shows an example of this output result. In FIG. 6, the portion marked XXX indicates the calculated likelihood for each word.
Based on the recognition result including a plurality of candidates ranked by acoustic likelihood called the N-Best candidate, the acoustic likelihood and the N-Best candidate for two kinds of reliability of a word and a class are described. The reliability is calculated from the appearance frequency in the data as a measure based on the posterior probability (step 306). This calculation is performed by the reliability generation unit 103 in FIG. 1, and an example of the calculation result is shown in FIG. In FIG. 7, the table on the left is the output of the voice recognition unit shown in FIG. 6, the word reliability in the table on the right indicates the possibility that a certain word was uttered, and the class reliability indicates the word in a certain class is uttered. Indicates the possibility of having been done. Regarding this calculation, “Non-Patent Document 2” Komagani et al., “Utilization of reliability of speech recognition result in speech dialogue system” described in “Prior Art” above, Proceedings of the Acoustical Society of Japan, -5-2, pp. 73-74, 2000.
[0024]
Likelihood words are estimated by obtaining the reliability of words spoken as described above. In the present invention, the accuracy of word estimation is further improved by dialogue between the system and the user. I have. For this reason, the class score is calculated by the class score generation unit 105 in FIG. 1 (step 307). Prior to the class score calculation, the utterance type of the user is determined. That is, the first utterance type is an utterance type that has a function of adding new information to previous information. For example, the detailing and answer processing correspond to this. The second utterance type has a function of correcting previous information. For example, correction and re-input processing correspond to this. As shown in FIG. 8, the utterance type is determined based on the state of the determination material described in the column of the determination material, as shown in FIG. There are other determination methods. For example, there is a method of detecting the occurrence of partial rephrase frequently used in place name input by using a word spotting method by DP matching. For this, correction of misrecognition in place name input of "Kakutani, Kitaoka, Nakagawa" car navigation systems is available. Speech analysis and detection, IPSJ Research Report, Spoken Language Information Processing 37-11, 2001.
After the utterance type is determined, a class score is generated in the class score generation unit 105. The class score is a value indicating the likelihood of a class during a dialogue, that is, during a user's multiple utterances. In this case, the score can be generated more appropriately by adding new information while leaving the previously understood information. The generation of this class score is performed using a different generation formula for each utterance type. Therefore, in step 307 in FIG. 5, the process is divided into two as shown in FIG. That is, in step 315, it is determined whether or not the utterance type corresponds to the utterance type of the answer in accordance with the situation described in the column of the determination material in FIG. In the case of the utterance type, after the processing in step 317, the processing proceeds to step 308 in any case.
[0025]
The class score in the case of the utterance type of the detail and the answer, that is, in the case of step 316 in FIG. 9 is obtained by Expression (5).
Score (c) = Score (c) * weight _s + Conf (c) (Equation 5)
Here, Score is a class score, the left side of equation (5) is a newly obtained class score, and the right side of equation (5) is a process for a past class score (read from the recognition history 112). is there. Conf is the class reliability obtained from the latest recognition result. weight _s Is a weight taking a value of 0.0 to 1.0. c is a class for generating a score. Weight weight _s The reason why the class score before the update is lowered at a certain rate is that the policy that “the reliability decreases as the information becomes older” is applied. Also, weight _s Can be experimentally obtained using actual utterance data. The updated class score is written to the recognition history 112.
[0026]
FIG. 10 shows how the class of the refinement / answer utterance type is generated. The user has spoken "prefecture" and "railway line" in the past utterance (old class score 1.00), and uttered "station" in the latest utterance (new class reliability column is 0.81). are doing. In this case, the class score is generated based on Expression (5).
[0027]
The class score in the case of the utterance type of correction / re-input, that is, in the case of step 317 in FIG. 9, is obtained by Expression (6).
Score (ca) = Score (ca) * weight _t −Conf (cb) + Conf (ca) (Equation 6)
Here, Score is a class score, the left side of Expression (6) is a newly obtained class score, and the right side of Expression (6) is a past class score (read from the recognition history 112). Conf is the class reliability obtained from the latest recognition result. weight _t Is a weight taking a value of 0.0 to 1.0. “ca” is a class for generating a score, and “cb” is all classes that are different in the same category as “ca”. This means that the reliability of the same category and a different class is subtracted from the expression (Equation 5). This makes it easier to correct the score if you make a mistake in the class. The updated class score is written to the recognition history 112.
FIG. 11 shows how the class of the corrected / re-input utterance type is generated. The user has spoken in the “prefecture” class in the past speech, the class score value was insufficient, the category could not be specified, and the system response output “Please speak again”. Next, the user speaks again in the same “prefecture” class and obtains an updated class score (for example, for the “prefecture” utterance, the score is displayed in both the old class score and the new class reliability column). Is described). The generation of the class score in this case is performed based on Expression (6).
[0028]
Subsequently, the process proceeds to step 308 of the category understanding process. This process is performed by the category understanding unit 106 in FIG. 1 with respect to both the past class score (read from the recognition history) and the class reliability in the latest recognition result. This is performed by calculating the category score. FIG. 12 shows the state of this processing. The category score is obtained by adding up all the class scores or the degrees of reliability belonging to the same category, as can be seen from the numbers in the respective columns in the portion indicated by a in FIG. 12 and the portion indicated by B in FIG. Each category score is determined by a threshold value, and the logical sum of the determination results is calculated for three categories, PR (upper), HR (middle), and LM (lower). The results obtained show combinations of categories spoken up to now. If the class score is that of FIG. 12, the subsequent state of category understanding is shown in FIG. That is, a judgment is made for each category from the old and new scores, and as a result, the category is understood.
[0029]
Next, word score generation in step 309 is performed. This step 309 is executed by the word score generation unit 107 in FIG.
1) past words (already present in the recognition history 112), and
2) newly appearing words (words in the latest recognition result)
For each of the two, a score is generated using a different policy. In the case of the latter 2), all the words included in the N-Best candidate of the latest recognition result are targeted. The score generation is executed in the order of 1) → 2) every time the language understanding unit 104 in FIG. 1 acquires the latest recognition rate.
[0030]
The word existing in the recognition history of the above 1) is obtained by raising or lowering the existing word score based on the newness of the word, the response content of the system, and the user's utterance type (refinement, correction, answer, re-input). Generate a score. The following five types of policies are used for this.
Policy 1: Assuming that old information becomes less reliable, every time a new recognition result is input, the scores of all words existing in the recognition history are lowered.
Policy 2: If the word A in the recognition history and the recognition result word B are in a detailed relationship, the score of the word A is increased.
Policy 3: If the word A in the recognition history and the word B in the recognition history have a correction relationship, the score of the word A is lowered.
Policy 4: If the recognition result includes affirmation (yes, yeah, etc.), increase the score of the word included in the response.
Policy 5: If the recognition result includes negative (no, different, etc.), the score of the word included in the response is reduced.
The generation of a word score in the recognition history is based on the following (Equation 7).
[0031]

Here, Score is the score of the word in the recognition history, and the right side is before updating and the left side is after updating. Wd is a word in the recognition history 112 to be calculated. The term corresponding to the policy 1 is p1, which is a term that lowers the score of a word. Regarding terms corresponding to

policies

2 and 3, p2 and p3 are weighted, Conf is the reliability obtained from the latest recognition result, Ws is included in the latest recognition result, and has a relation of refinement with Wd. Wt is all the words included in the latest recognition result and having a correction relationship with Wd. Regarding the terms corresponding to the

policies

4 and 5, i = 1 when the word is included in the previous system response, and i = 0 when the word is not included. In addition, yes indicates a positive word included in the latest recognition result, no indicates post-negation included in the current recognition result, and rej indicates a sentence end negative word included in the current recognition result.
[0032]
The generation of the score of the word in the latest recognition history in the above 2) that has not been registered in the recognition history, that is, the newly appearing word, is based on the response content and the user utterance type (refinement, correction, answer , Re-input), the rank of N-Best, and the utterance length (the number of uttered words) to raise or lower the reliability of speech recognition, and generate a word score. The following four types of policies are used for this.
Policy 6: If word A in the recognition result and word B included in the response are in a detailed relationship, the score of word A is increased.
Policy 7: If the system response is a question (eg, what interchange?) And the content of the recognition result is an answer, increase the score of the word in the recognition result.
Policy 8: Since many correct words are included in the top recognition result, the score of the word included in the top is increased.
Policy 9: An utterance having a long utterance length (short utterance) is easily recognized (it is difficult to recognize), so that the result of one category lowers the score of the word, and the score of two or more categories raises the score.
[0033]
The generation of a score for a word in the latest recognition history that has not been registered in the recognition history is based on the following equation (8).
[0034]

Here, Score is the score of a word in the recognition history, and Conf is the reliability obtained from the latest recognition result. Wd is a word in the recognition history to be calculated. Regarding the term corresponding to the policy 6, p6 is the weight, and Ws is all the words having a relation of refinement with Wd included in the recognition history. Regarding the term corresponding to the policy 7, p7 is a word included in the recognition result when the recognition result is an answer to the question. As a term corresponding to the policy 8, p8 is weighting according to the rank of N-Best. As terms corresponding to the policy 9, p9 and p10 are weights, len2 is len2 = 1 when the category of recognition is 2 or more, and len1 is a value that len1 = 1 when the category of recognition is 1. is there.
The score of the word updated in 1), the word added in 2), and the score are written to the recognition history 112 as an integrated recognition history. An example of the integrated recognition result is shown in FIG. 14 using actual prefecture names, railway names and the like as actual examples. In the figure, there may be a plurality of the same names (Atsugi, Tanashi, etc.), which are the names of the stations included in the plurality of lines.
[0035]
A plurality of candidates are generated as appropriate combinations from the category understanding result obtained as described above and the integrated recognition history. That is, based on the information obtained as described above, a plurality of likely candidates are generated as contents understood by the present apparatus (step 310). This process is executed by the understanding content generation unit 108 in FIG. From the results of FIG. 13, three categories of PR, HR, and LM are uttered, and the combinations that actually correspond to the above from FIG. 14 are extracted and set as candidates. The one with the largest sum of the scores in each category is selected. The result is shown in FIG. As the understanding result, <PR category = Aichi, score = 1.47>, <HR category = Nagoya Railway, score = 1.17>, <LM category = Toyohashi, score = 0.62> are selected.
[0036]
As described above, the entire processing up to the generation of the response flag (step 311) from the understanding contents as the result of processing in each step of the language understanding unit 104 in FIG. 1 has been described. The output (response) information thus obtained is executed by the response generation unit 109 in FIG. FIG. 16 shows the types of the response flags. FIG. 17 shows the contents of each bit (each bit from a to F) in FIG. Based on the understanding result, if a word corresponding to the category exists, a corresponding flag is set. In this case, a flag of a value (the number of bits) obtained by evaluating the score in four steps is set. That is, the scores from the maximum to the minimum are evaluated 1 to 4, and the flags are set to 1000, 0100, 0010, and 0001.
[0037]
The response generation unit 109 makes a response according to the following policy in the dialog using the above-mentioned response flag.
Response Policy 1: Approval (Aoitsu)
When there is no lower category and the score evaluation of the upper category or the middle category is rating 1, a response is made to facilitate the dialogue.
Example User utterance… “Shizuoka”
System response ... "Yes"
Response Policy 2: Repeat
When the score evaluation is 2, or when the beginning of the sentence of the user's utterance is negative, the repeat is performed with the meaning of confirmation.
Example User utterance… “Shizuoka”
System response ... "Shizuoka"
Response policy 3 Final confirmation
If the lower category is spoken and is reliable (score evaluation is 1 or 2), a final check is made.
Example User utterance… "Get on from Hamamatsu interchange"
System response ... "Are you sure you want to set up Hamamatsu IC?"
Response policy 4: Destination setting
If there is a lower category in the previous response and the positive utterance is reliable (score evaluation is 1 or 2), the destination is set.
Example System response ... "Are you sure you want to configure Hamamatsu IC?"
User utterance… "Yes"
System response: "Destination set"
Response policy 5: Ask only for unknown information
Ask users only for information they do not know.
Example User utterance ... "Shizuoka Prefecture Tomei It is a motorway. (When the score underlined is low.)
System response ... "What kind of motorway is Shizuoka Prefecture?"
Response policy 6: Information without confidence will not respond
If only one of the combinations of the upper category (PR) and the middle category (HR) is unreliable (score evaluation is 4), the dialogue proceeds by responding only to the one with the higher score.
Example User utterance ... Shizuoka Prefectural Tomei Expressway "(when the score underlined is low)
System response: "Tomei Expressway"
Response policy 7 When the additional information of the separate information is small and the score evaluation is bad, the recognition rate is improved by listening to the above category.
Example User utterance ... Hamamatsu Ride from the interchange ”(when the underlined score is low)
System response ... "How many prefectures are interchanges?"
Response policy 8: Prompt for next utterance
If an affirmative utterance comes after the upper category and it is reliable (when the score evaluation is 1 or 2), the next utterance is prompted.
Example System response: "Tomei Expressway"
User utterance… "Yes"
System response ... "Where is the Tomei Expressway?"
Response policy 9: return another candidate
When the negative utterance is reliable (when the score evaluation is 1 or 2), another candidate not used in the previous response is returned.
Example System response: "Do you want to set up Hamamatsu interchange?"
User utterance… “No”
System response ... "Do you want to set up Hamamatsu Nishi Inter?"
Response policy 10: Repeat previous response
Positive or negative utterances are unreliable (when the score is 4)
Example System response ... "Are you sure you want to configure Hamamatsu IC?"
User utterance… Yes ”(When the underlined score is low)
System response ... "Are you sure you want to set up Hamamatsu IC?"
Response policy 11: Reflection
When all information is unreliable (when the score is 4)
Example User utterance… " Shizuoka Prefecture ”(When the underlined score is low)
System response ... "Please speak again"
The response generation unit 109 compares the response flag with the flag table shown in FIG. 16 to implement the above-described dialogue policy, and returns a response in a response pattern in which the flag first matches.
The response flag generated by the response generation unit 109 from the understanding result is
"1 111000 1000 1000 0100 0000 0000 0000 0"
As a result of reference to the flag table of FIG.
"PR Category Word", "PR Category Class"
"HR category word", "HR category class"
Are you sure you want to set "LM category word" and "LM category class"? Is selected, so that
"Are you sure you want to set Toyohashi Station on the Nagoya Railway in Aichi Prefecture?" Is generated as a response sentence.
[0038]
The response generated as described above is executed in step 312, and is output as a voice signal via the voice synthesis unit 110 in FIG. 1 and displayed on the display via the GUI display unit 111. You.
At this stage, it is confirmed whether or not all the input processes have been completed (step 313). That is, when the word of the lower category (LM) is determined (yes in step 313), the processing shifts to step 314 and all the input processing ends. If the word of the lower category (LM) has not been determined (NO in step 313), the processing is continued. In this example, the response is "Are you sure you want to set Toyohashi Station on the Nagoya Railway in Aichi Prefecture?", And then the user speaks "Yes" to "Set destination." The process is terminated after a response of "has been made" is made.
[Brief description of the drawings]
FIG. 1 is a block diagram of a basic configuration of a dialogue understanding device according to the present invention.
FIG. 2 is a block diagram showing a hierarchical classification method of uttered words.
FIG. 3 is a correspondence diagram showing a relationship between a recognition target word and an utterance type.
FIG. 4 is an utterance / response diagram showing an example of a dialogue between a system and a user.
FIG. 5 is a flowchart showing the operation of the system.
FIG. 6 is a target diagram showing a relationship between a recognition result candidate sentence as an output of a speech recognition unit and likelihood.
FIG. 7 is an object diagram for obtaining a reliability from a relationship between a recognition result candidate sentence and a likelihood.
FIG. 8 is an object diagram of an utterance type and an utterance type determination material.
FIG. 9 is a flowchart showing the proper use of processing depending on the utterance type.
FIG. 10 is a view showing an object class with an old class score, showing a process of generating a class score after updating in the detailing / answer utterance type.
FIG. 11 is a view showing a process of generating a class score after updating in a corrected / re-input utterance type and an object class with an old class score.
FIG. 12 is a comparison diagram of new and old scores showing a class score calculation procedure.
FIG. 13 is a target diagram showing a new and old score calculation process in the category understanding process.
FIG. 14 is a score target diagram showing an example of an integrated recognition result.
FIG. 15 is a diagram illustrating a comparison of language understanding final scores by item.
FIG. 16 is a diagram showing response flags and response patterns.
FIG. 17 is a diagram showing response flags and their contents.
[Explanation of symbols]
101: voice input unit 102: voice recognition unit
103: reliability generation unit 104: language understanding unit
105: Class score generation unit 106: Category understanding unit
107: word score generator 108: understanding content generator
109: response generator 110: speech synthesizer
111: GUI display section

Claims

A class that classifies utterances included in a dialog into a plurality of categories hierarchically in the order of the breadth covered by the utterances and a class configured by subdividing the categories, and gives a certainty as to which class of word was uttered Understand the dialogue content from the score and the words contained in the utterance,
A class of past recognition history used to perform a refinement / answer dialogue to add more detailed information to the utterance content and to newly generate the class score for the utterance type of the refinement / answer When updating the score, the recognition history is updated with a new class score obtained by adding a new reliability based on the latest recognition result to a value weighted so that the class score of the past recognition history becomes smaller. A conversation understanding device characterized by performing an update.

Audio input means composed of a microphone and an audio amplifier,
Voice recognition means for digitizing the output of the voice input means for voice recognition, and reliability generation means for calculating the reliability of the result recognized by the voice recognition means,
Using the results obtained by the speech recognition unit and the reliability generation unit, the plurality of categories are set in advance and the categories are classified into a hierarchical structure including the classes obtained by subdividing the categories, and the classes are classified into the classes. A class score generating unit for determining the likelihood of the utterance, a category understanding unit for determining each of the categories from the results obtained by the class score generating unit, a word score generating unit for determining the certainty of the recognized word, and processing performed by the processing units described above. A language comprehension means comprising a comprehension content generator for producing comprehension content as a result of the
Storage means for storing a past recognition history used to execute the processing in the language understanding means,
Response generating means for generating response information from a result obtained from the language understanding means;
The dialogue understanding device according to claim 1, further comprising: an output unit configured to output the response information.
When updating the class score of the past recognition history used to newly generate the class score for the utterance type of the refinement / answer, the class score of the past recognition history is reduced. An interactive understanding apparatus characterized in that the recognition history is updated with a new class score obtained by adding a new reliability based on the latest recognition result to a value to which weighting has been added.

3. The dialogue understanding apparatus according to claim 1, wherein the class score of the past recognition history is updated by the following equation.
_{Score (c) = Score (c} ) * weight s + Conf (c)
However, Score: class of the past of recognition history score Conf: the latest recognition result class reliability of the Weights: weight (0.0 <weight _s <1.0)
c: Class for generating score

A class that classifies utterances included in a dialog into a plurality of categories hierarchically in the order of the breadth covered by the utterances and a class configured by subdividing the categories, and gives a certainty as to which class of word was uttered Understand the dialogue content from the score and the words contained in the utterance,
There is an erroneous response from the system, a process of correcting it, that is, a process of correcting and re-inputting, and using the past class used to newly generate the class score for the utterance type of the correction and re-input. When updating the class score of the recognition history, a new reliability based on the latest recognition result is added to a value weighted so as to reduce the score of the past recognition history, and the same category and different classes are used. A recognition history is updated with a score obtained by subtracting all the degrees of reliability of the dialogue.

Audio input means composed of a microphone and an audio amplifier,
Voice recognition means for digitizing the output of the voice input means for voice recognition, and reliability generation means for calculating the reliability of the result recognized by the voice recognition means,
Using the results obtained by the speech recognition unit and the reliability generation unit, the plurality of categories are set in advance and the categories are classified into a hierarchical structure including the classes obtained by subdividing the categories, and the classes are classified into the classes. A class score generating unit for determining the likelihood of the utterance, a category understanding unit for determining each of the categories from the results obtained by the class score generating unit, a word score generating unit for determining the certainty of the recognized word, and processing performed by the processing units described above. A language comprehension means comprising a comprehension content generator for producing comprehension content as a result of the
Storage means for storing a past recognition history used to execute the processing in the language understanding means,
Response generating means for generating response information from a result obtained from the language understanding means;
The dialogue understanding device according to claim 1 or 4, further comprising: an output unit configured to output the response information.
When updating the class score of the past recognition history used to newly generate the class score, the utterance type of the correction / re-input is weighted so that the score of the past recognition history becomes small. A new reliability based on the latest recognition result is added to the value added with, and the recognition history is updated with a score obtained by subtracting all the reliability of the same category and different classes. Dialogue understanding device.

6. The dialogue understanding device according to claim 4, wherein the past recognition history is updated by the following equation.
_{Score (ca) = Score (ca} ) * weight t -Conf (cb) + Conf (ca)
Here, Score: class score of recognition history Conf: class reliability of latest recognition result weight _t : weight (0.0 <weight _t <1.0)
ca: a class that generates a score cb: a class that differs in the same category as ca

A deciding means for deciding whether the utterance type is the utterance type of detailing / answering or the utterance type of correction / re-entry is used. An interactive understanding device for determining whether to use an arithmetic expression.