JP3936266B2

JP3936266B2 - Speech recognition apparatus and program

Info

Publication number: JP3936266B2
Application number: JP2002265510A
Authority: JP
Inventors: コンスタンチンマルコフ; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2002-09-11
Filing date: 2002-09-11
Publication date: 2007-06-27
Anticipated expiration: 2022-09-11
Also published as: JP2004163445A

Description

【０００１】
【発明の属する技術分野】
この発明は音声認識システムに関し、特に、ベイジアンネットワーク（ＢＮ）を採用したＨＭＭ（隠れマルコフモデル）に基づく音声認識システムに関する。
【０００２】
【従来の技術】
多年にわたり、ＨＭＭの音声認識への導入以来、各状態Ｑに対する観測の、条件付き分布Ｐ（ｙ｜Ｑ）は確率密度関数の混合によりモデル化されてきた（離散ＨＭＭはここでは考慮しない）。ガウスｐｄｆ（確率密度関数）およびラプラシアンｐｄｆがこの目的でよく使用されている。後には、ハイブリッドＨＭＭ/ＮＮ（隠れマルコフモデル／ニューラルネットワーク）システムが提案されたが、ここでは所与の入力観測に対しＨＭＭの状態の尤度を推定するのにニューラルネットワークが用いられている。
【０００３】
多くの場合、音声スペクトルから抽出された特徴がこれらの観測値を形成する。しかしながら、音声認識に関する研究により、これらの特徴のみを用いるだけでは高いシステム性能を達成するには不十分である事が示された。このため、多くの研究者が、ＨＭＭシステムに何か他の知識を表わす付加的な特徴を含めようとしてきた。
【０００４】
トクダら（後掲の非特許文献１）は、付加的なピッチ情報をモデル化するため、多元空間での確率分布を提案している。しかしほとんどの場合、付加的特徴の特性により種々の方策がとられている。この問題に対処すべき共通の、十分に柔軟性の高いフレームワークはこれまでには存在しなかった。
【０００５】
最近、ＨＭＭに対する別の選択肢として、ベイジアンネットワーク（ＢＮ）が研究者の関心を集めている。ＢＮは人工知能の研究分野では周知でありよく研究されている。しかし、音声認識においては、これらは比較的新しい研究課題である。
【０００６】
ベイジアンネットワークとは有向非巡回的グラフであり、そのノードは事象を表わす。ベイジアンネットワークでは、第一のノード（ノードＡ）から第二のノード（ノードＢ）への枝は、ＡがＢの原因である事を示す。各枝（ＡＢ）には、確率Ｐ（Ｂ｜Ａ）を示す「リンク」行列が割当てられ、この確率は、Ａの各値に対するＢの各値の確率を特定する。
【０００７】
ベイジアンネットワークは、多くの互いに異なる（離散的または連続的な）ランダムな変数の複合的な同時確率分布を、よく構造化され容易に表現可能な仕方でモデル化できる。特に、時間的音声特徴をモデル化するのに適しているのは、ディーンらの提案するダイナミックＢＮ（ＤＢＮ）である（後掲の非特許文献２）。
【０００８】
音声認識におけるＤＢＮの最初の報告のいくつかでは、ツバイクら、およびダウディらの報告でのように（後掲の非特許文献３、４）、これらは単独の語認識作業での語モデルとして用いられていた。これらの著作においては、ＤＢＮは、音声スペクトル情報に加えて、調音的特徴、サブ帯域相関、話し方のスタイル等の付加的な知識を容易に組み入れる事が可能な、ＨＭＭの一般化と考えられている。スティーブンソンらはＤＢＮのフレームワーク内で、音声学的な特徴をピッチ情報で容易に補う事ができる事を報告している。
【０００９】
ベイジアンネットワークの別の利点は、認識中に信頼性をもって推定するのが困難な付加的な特徴を、隠されたまま、すなわち観測不能な状態にしておける事である。
【００１０】
【非特許文献１】
Ｋ.トクダ、Ｔ.マスコ、Ｎ.ミヤザキ、Ｔ.コバヤシ、『ピッチパターンモデリングのための多元空間確率分布に基づく隠れマルコフモデル』、ＩＣＡＳＳＰ予稿集、ｐｐ．２２９−２３２，１９９９年。（K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markov Models based on multi-space probability distribution for pitch pattern modeling.” In Proc. ICASSP, pp. 229-232, 1999.）
【非特許文献２】
Ｔ.ディーン、Ｋ.カナザワ、「確率的な時間的推論」、ＡＡＡＩ，ｐｐ．５２４−５２８，１９８８年。（T. Dean and K. Kanazawa, “Probabilistic temporal reasoning,” in AAAI, pp. 524-528, 1988.）
【非特許文献３】
G．ツヴァイク、Ｓ.ラッセル、『自動音声認識のためのベイジアンネットワークによる確率モデリング』、ＩＣＳＬＰ予稿集、ｐｐ．３０１０−３０１３，１９８８年。（G. Zweig and S. Russell, “Probabilistic modeling with Bayesian Networks for automatic speech recognition,” in Proc. ICSLP, pp. 3010-3013, 1998.）
【非特許文献４】
Ｋ.ダウディ、Ｄ.フォア、Ｃ.アントワーヌ、『確率的グラフモデルに基づく多帯域音声認識の新たなアプローチ』、ＩＣＳＬＰ予稿集，ｖｏｌ．１，ｐｐ．３２９−３３２，２０００年。（K. Daoudi, D. Fohr, and C. Antoine, “A new approach for multi-band speech recognition based on probabilistic graphical models,” in Proc. ICSLP, vol. 1, pp. 329-332, 2000.）
【非特許文献５】
Ｔ.スティーブンソン、Ｍ.マシュウ、Ｈ．ボラード、『ベイジアンネットワークに基づくＡＳＲでの補助情報のモデル化』、ユーロスピーチ予稿集、ｐｐ．］２７６５−２７６８，２００１年。（T. Stephenson, M. Mathew, and H. Bourlard, “Modeling auxiliary information in Bayesian Network based ASR,” in Proc. Eurospeech, pp. 2765-2768, 2001.）
【非特許文献６】
Ｋ.ダウディ、Ｄ.フォア、Ｃ.アントワーヌ、「ベイジアンネットワークを用いた連続多帯域音声認識」、Proc. ASRU, 2001年。（K. Daoudi, D. Fohr, and C. Antoine, “Continuous multi-band speech recognition using Bayesian Networks,” in Proc. ASRU, 2001.）
【発明が解決しようとする課題】
ＢＮのこれらの魅力的な特性にも関わらず、音声認識への応用は依然として、小さな、単独の語認識作業に対するものに限られている。その理由は、ＢＮパラメータの学習および推論のための既存のアルゴリズムが、連続した音声認識（ＣＳＲ）および、特に語彙の多いＣＳＲ作業にはそれほど適していないからである。連続して発話された数字の認識を可能にするＤＢＮ語モデルの拡張が、ダウディらの非特許文献６に報告されてはいるものの、作業対象となる語彙がわずか数百に増加するだけでも、あまりにも計算量が多くなり過ぎる。
【００１１】
従って、この発明の目的の一つは、ＢＮを採用したものであって、かつＣＳＲに適した装置を提供する事である。
【００１２】
この発明の別の目的は、ＢＮを採用したものであって、かつ語彙量の多いＣＳＲ作業に適した装置を提供する事である。
【００１３】
この発明のさらに別の目的は、ＢＮを採用したものであって、かつＢＮのための煩瑣なパラメータ学習を必要としない装置を提供する事である。
【課題を解決するための手段】
この発明のある局面に従った音声認識装置は、音声認識のためのハイブリッド隠れマルコフ／ベイジアンネットワーク（ＨＭＭ／ＢＮ）モデルを記憶するための記憶装置を含む。隠れマルコフモデル（ＨＭＭ）は時間的な音声の特徴をモデリングするのに用いられ、ベイジアンネットワーク（ＢＮ）は状態確率モデルをあらわすのに用いられる。この装置はさらに、記憶装置に記憶されたＨＭＭ／ＢＮモデルを用いて入来する音声データをデコードするための音声デコーダを含む。
【００１４】
ＢＮの状態確率モデルは、変数Ｘを含んでもよく、条件付確率Ｐ（Ｙ｜Ｑ）は以下の式によって計算され、
【００１５】
【数３】

ただし、Ｙは一連の観測パラメータを表わし、Ｑは状態変数を表わし、ＸはＱとは独立してＹの値に影響を与える所定の要素を反映する変数を表わし、ｘはＸがとり得る値の一つであり、Ｎ（ｘ）はＸがとり得る値の数である。
【００１６】
好ましくは、変数Ｘは環境ノイズ、話者認識情報、または話者の母語を表わしてもよい。
【００１７】
状態確率モデルは変数ＮとＳとをさらに含んでもよく、条件付確率Ｐ（Ｙ｜Ｑ）は以下で計算され、
【００１８】
【数４】

ただし、Ｙは一連の観測パラメータを表わし、Ｑは状態変数を表わし、ＮおよびＳはＱとは独立してＹの値に影響を与える所定の要素を反映する変数を表わし、ｎおよびｓはＮおよびＳがそれぞれとり得る値の一つであり、Ｎ（ｎ，ｓ）はＮおよびＳがとり得る値の組合せの数である。
【００１９】
変数Ｎはノイズの種類を表わし、変数Ｓは入来する音声データの信号対雑音比を表わしてもよい。
【００２０】
この発明の別の局面は、上述の音声認識装置としてコンピュータを動作させる、コンピュータで実行可能な音声認識プログラムに関する。
【００２１】
この発明のさらに別の局面は、上述の音声認識装置としてコンピュータを動作させる、コンピュータで実行可能な音声認識プログラムを記憶する、コンピュータ読取可能な記憶媒体に関する。
【００２６】
【発明の実施の形態】
−ハイブリッドＨＭＭ／ＢＮモデリング−
多くの場合、音声認識でＢＮを使用するのは、ＨＭＭをダイナミックなＢＮとして表わそうとする考えに基づいている。このような表現を図１に示す。ここで、Ｑ_tは状態変数であり、Ｙ_tは時刻ｔ＝1，2，3，4，…での連続な観測変数である。枝は変数間の確率的依存を表わす。状態インスタンス間の枝はＨＭＭ遷移確率を表わし、状態インスタンスと観測インスタンスとの間の枝はＨＭＭ状態の条件付分布を表わす。以下の図において、四角で囲った変数は離散的であり、丸で囲った変数は連続である。ハッチングした丸／四角は観測可能な変数を示す。
【００２７】
Ｙ＝ｙ₁，…，ｙ_T、Ｍは本発明のモデルとしたとき、入力される観測シーケンスＰ（Ｙ｜Ｍ）の尤度を得る必要がある場合、ＨＭＭおよびＢＮの表現にＢＮ推論アルゴリズムを用いる必要がある。この作業の間に、ネットワークのサイズが入力シーケンスＴのサイズに合わせて調整され、その後、ネットワーク全体からＰ（Ｙ｜Ｍ）が推論される。
【００２８】
状態ノード間の枝を切断した場合を想定する。こうする事により、各時刻ｔに対応する、図２に示すように複数の独立したＢＮが得られる。時間遷移（切断された枝）が従来のＨＭＭにより支配されるとすれば、これらのＢＮを適切なＨＭＭ状態に割当てる事で時間指標をなくす事ができる。ＢＮは全て同じ共通の構造を有するので、それらを図３のように単一のＢＮとして表わす事ができる。図３では、変数Ｑは音声学的モデル内の全てのＨＭＭの状態指標（Ｓ_ij）の値をとり、状態確率分布Ｐ（Ｙ｜Ｑ＝Ｓ_ij）は枝によって表わされる。
【００２９】
こうして、ガウスの混合ではなく、状態分布モデルとしてＢＮを有するように従来のＨＭＭを修正した。ＨＭＭとＢＮとをこのように組合せる事で、ＨＭＭ／ＢＮモデルが階層的となる。ＢＮは最下層にあり、ＨＭＭは最上層にある。なお、状態変数Ｑ（図３）はＢＮに対しては観測可能となるが、上のＨＭＭレベルでは、依然として隠されたままである。
【００３０】
状態ＢＮは付加的な知識を表わす他のランダム変数に容易に拡張可能である。簡単な作業とは言いがたいデータからの学習によってではなく、変数間の関係に関する発明者らの知識に従って、拡張ＢＮのグラフィック構造を課する事ができる。この実施例は煩瑣なパラメータ学習とは無縁である。
【００３１】
拡張状態ＢＮの可能な構造のいくつかを図４に示す。たとえば、変数Ｘはこの図では環境ノイズの種類を表わす事ができ、他のＷおよびＺ変数は話者のＩＤ（識別情報）および話者の母語を表わす事ができる。
【００３２】
このＨＭＭ／ＢＮモデルで認識を行なう場合には、従来のＨＭＭと同様に、各状態Ｑ＝q_ijについてＰ（ｙ｜Ｑ）を計算する必要がある。ただし、「ｉ」はＨＭＭ指標であり、「ｊ」はｉ番目のＨＭＭの状態指標である。この値をＢＮ確率モデルから推論する事ができ、これを行うアルゴリズムとして、正確なものと近似的なものとを含め多数の推論アルゴリズムがある。単純なＢＮでは、図４（ａ）に示すように、「力任せ」法さえも適用可能である。このＢＮに対する同時確率モデルは、以下のようにチェーンルールで表わす事ができる。
【００３３】
【数５】

ＸおよびＱは独立した変数なので、「Ｐ（Ｘ｜Ｑ）＝Ｐ（Ｘ）」が成立し、従って上の式は以下のように書換える事ができる。
【００３４】
【数６】

従って、求める確率Ｐ（Ｙ｜Ｑ）はＸに対するマージナライゼーションにより、以下のようになる。
【００３５】
【数７】

実用上、多くの場合には全てのＸ＝ｘについてＰ（Ｘ）は同じであると仮定できるので、式（３）を次のように変形できる。
【００３６】
【数８】

ここで、Ｎ（ｘ）は変数Ｘがとり得る値の数である。
【００３７】
ＢＮパラメータのトレーニングは、従来のＨＭＭ状態バラメータのトレーニングとほぼ同様に、各状態について独立して行なう事ができる。トレーニングの詳細は図１０を参照して後述する。
【００３８】
−ノイズの多い音声認識システムにおけるＨＭＭ／ＢＮモデル−
音声がノイズを含む場合、音声の特徴ベクトルはその分布を変え、その変化はノイズの種類とともにＳＮＲ（信号対雑音比）の値にも依存する。このため、この依存性を図５に示すような種類の状態ＢＮで表わす事ができる。ここで、ＮとＳとはそれぞれ、ノイズの種類とＳＮＲ値とを示す、隠れた離散的変数である。この場合、状態の尤度は式（３）を導出したのと同じ方法で解析的に表現できる。
【００３９】
多くの場合、事前確率Ｐ（Ｎ）およびＰ（Ｓ）は、ノイズの各種類および各ＳＮＲ値について等しいと合理的に仮定できるので、以下のとおりとなる。
【００４０】
【数９】

語モデルおよびサブワードモデルも、従来のＨＭＭの場合と同様に作成される。デコードもまた、デコーダを変更する事なく、標準的なＨＭＭベースのシステムと同様に行なう事ができる。Ｎ（ｎ，s）はＮとＳのとり得る組合せの数を示す。
【００４１】
この実施例のＨＭＭ／ＢＮモデルの構造を図６に要約して示す。
【００４２】
−コンピュータでの実現例−
図７はこの実施例の全体図である。図７を参照して、このシステムは、トレーニングデータ１１０とＨＭＭ／ＢＮモデル１１２とに基づいてＨＭＭ／ＢＮハイブリッドモデル１１４をトレーニングするためのトレーニングシステム１００と、ＨＭＭ／ＢＮハイブリッドモデル１１４を記憶するための媒体１０４と、コンピュータシステムで実現され、音声データ１２０を媒体１０４に記憶されたＨＭＭ／ＢＮハイブリッドモデル１１４でデコード（音声認識１２２）し、認識された音声１２４を出力するための音声認識システム（音声デコーダ）１０２とを含む。
【００４３】
図８はこのコンピュータシステム３０を概略的に示し、図９はシステム３０をブロック図形式で示す。図８を参照して、このコンピュータシステム３０は、ＦＤ（フレキシブルディスク）ドライブ５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ５０を有するコンピュータ４０と、キーボード４６と、マウス４８と、モニタ４２とを含む。
【００４４】
図９を参照して、コンピュータ４０は、ＦＤドライブ５２およびＣＤ−ＲＯＭドライブ５０に加えて、ＣＰＵ（中央処理装置）５６と、ＣＰＵ５６、ＦＤドライブ５２およびＣＤ−ＲＯＭドライブ５０に接続されたバス６６と、ブートアッププログラムおよびＨＭＭデコードプログラム等を記憶する読出専用メモリ（ＲＯＭ）５８と、バス６６に接続され、プログラム命令、システムプログラム、およびＨＭＭ／ＢＮハイブリッドモデルデータを記憶するランダムアクセスメモリ（ＲＡＭ）６０とを含む。
【００４５】
ここでは示さないが、コンピュータ４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。音声認識をリアルタイムで行なう場合には、コンピュータシステム３０はさらにマイクロフォンとオーディオキャプチャボードとを含むことになる。
【００４６】
コンピュータシステム３０に音声認識を行なわせるプログラムは、ＣＤ−ＲＯＭドライブ５０またはＦＤドライブ５２に挿入されるＣＤ−ＲＯＭ６２または図示しないＦＤに記憶され、さらにハードディスク５４に転送される。またはこれに代えて、プログラムは図示しないネットワークを通じてコンピュータ４０に送信されハードディスク５４に記憶されてもよい。プログラムは実行の際にＲＡＭ６０にロードされる。ＣＤ−ＲＯＭ６２、ＦＤ、またはネットワークを介してＲＡＭ６０にプログラムを直接ロードしてもよい。
【００４７】
プログラムは、コンピュータ４０にこの実施例の音声認識を行なわせるいくつかの命令を含む。この方法を行なわせるのに必要な基本的機能のいくつかはコンピュータ４０のオペレーティングシステム（ＯＳ）またはサードパーティのプログラム、もしくはコンピュータ４０にインストールされるＨＭＭツールキット等のモジュールにより提供されるので、このプログラムはこの実施例の方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出す事により音声認識プロセスを実行する命令のみを含んでいるだけでよい。コンピュータシステム３０の動作は周知であるので、ここでは繰り返さない。
【００４８】
音声認識システム１０２は従来のＨＭＭデコードプログラムで実現される。この実施例で新規な点は、ＨＭＭ／ＢＮハイブリッドモデル１１４が媒体１０４上に記憶されている事である。
【００４９】
−ＨＭＭ／ＢＮモデルでのトレーニングと認識−
ＨＭＭ／ＢＮモデルのトレーニングには、ＨＭＭ／ＮＮトレーニングと同じアプローチを採用する事ができる。これはビタビトレーニングアルゴリズムに基づくものである。全体の制御フローを図１０に示す。まず始めに、ＨＭＭと状態ベイジアンネットワークとのトポロジーを選択する。モデルを初期化し、特徴抽出ステップ１５０で入力音声１４０から抽出した特徴１４４に対し、状態分離ステップ１５２で、ビタビアライメントが基本ＨＭＭモデル１４２を用いた基本認識部を使用して行なわれる。
【００５０】
状態の分離は状態ベイジアンネットワークのためのトレーニングデータを生成するのに用いられ、つぎにこの状態ベイジアンネットワークはステップ１５４、１５６および１６０でトレーニングされる。このビタビトレーニング手法では、トレーニングの時間的な部分と静的な部分とが分離される。この処理では、ステップ１５８で終了条件が満たされるまで、ステップ１５４および１５６のＢＮトレーニングと、埋込まれたトレーニング処理であるステップ１６０の遷移確率の再推定とを交互に繰返す。
【００５１】
このＨＭＭ／ＢＮモデルで認識を行なう場合、従来のＨＭＭと同様に、通常のビタビデコードアルゴリズムが用いられる。ここで、各状態Ｑ＝q_ij（図６を参照）についてＰ（ｙ｜Ｑ）を計算する必要がある。ここでｉはＨＭＭ指標であり、ｊはｉ番目のＨＭＭの状態指標である。この値を、標準的推論アルゴリズムを用いてＢＮ確率モデルから推論する事ができる。
【００５２】
−Ａｕｒｏｒａ２タスクでの評価−
この実施例の音声認識システムをＡｕｒｏｒａ２タスクで評価する実験を行なった。これらの実験では、公式のＡｕｒｏｒａ２タスクで示唆されている評価用シナリオに忠実に従った。最も関心があったのは、ＨＭＭ／ＢＮシステムを複数条件でトレーニングされたＨＭＭシステムと比較する事であった。ＨＭＭ／ＢＮ状態の条件付き分布のトレーニングにあたっては、トレーニングデータをノイズの種類とＳＮＲ値とで分け、ＨＴＫ（ＨＭＭツールキット、ＨＭＭ処理のためのソフトウェアツールキット）を用いて各条件についてのパラメータを別個にトレーニングした。特徴ベクトル、語モデル、状態数、実験条件等の全ての他のシステムパラメータは同一にした。なお、ＨＭＭ／ＢＮシステムでは、何らかの適合やノイズに強い方法を用いたわけではない。二つのシステムでの主な機能的相違点は、ＨＭＭ／ＢＮシステムが音声の特徴およびノイズの隠れた依存性を探求する事である。
【００５３】
テストセットＡ（トレーニングデータと同じノイズ種類）とテストセットＢ（異なるノイズ）の認識結果を表１にまとめて示す。理解できるとおり、ＨＭＭ／ＢＮシステムの性能は閉じたノイズ条件（Ａセット）ではかなり高く、ずっと複雑なシステムでこの作業について得られる最新の結果に迫っている。
【００５４】
【表１】

この評価は、ノイズの種類とＳＮＲ値とを付加的なパラメータとして加え、これらの依存性を検討する事で、結果として得られるスペクトル特徴パラメータの誤差が３６．４％（＝（１２．７１−８．０８）＊１００／１２．７１）減少する事を示している。
【００５５】
Ｂセットの条件についていえば、性能の劣化が見られる。これは、音声スペクトル特徴の分布のミスマッチに加えて、このＨＭＭ／ＢＮシステムには新たなノイズの依存性に関する知識が得られないという事実で説明がつく。他方で、複数条件ＨＭＭシステムでは、状態ガウス混合は多ノイズおよびＳＮＲ条件から複雑な分布をあまりうまくモデル化できていない事が明らかである。しかしながら、このデータとモデル分布のミスマッチにはある種の平滑化の効果があり、これによって、見られないデータから一般化するというこのモデルの能力が高まる。
【００５６】
[ハイブリッドＨＭＭ／ＢＮモデルの応用]
明らかに、提案されたハイブリッドＨＭＭ／ＢＮモデルはノイズの多い音声認識システムのみでなく、観測や隠れた特徴が新たに得られる事で性能上の利益が得られる他の多くの場合に適用可能である。このアプローチは、種々の空間からの特徴を組合せ、その間の依存性を検討する事により、システムのモデル化能力を高めるという、より一般的なフレームワークという方がより正確である。とくに興味深いのは、ＢＮの隠れた変数の確率が推論できる可能性である。このように、ＨＭＭ／ＢＮシステムはこれらの付加的パラメータの認識に用いる事ができる。
【００５７】
例えば、ある付加的な隠れ変数Ｘが多言語システムでの言語を表わす場合、各フレームについてＰ（Ｘ｜Ｑ）を計算し、これらの確率を、入力された発話全体にわたり累積する事ができる。その場合、ｘ＝ａｒｇｍａｘ_x Ｐ（ｘ｜Ｑ_s）、Ｑsは最高の仮定状態シーケンス、となるｘは、その発話がなされた言語として最も確率の高い言語を示す。従って、多言語の音声認識に加えて、このようなシステムは言語認識をも行なう事ができる。なお、関数ｘ＝ａｒｇｍａｘ_x Ｐ（ｘ｜Ｑ_s）は、Ｐ（ｘ｜Ｑ_s）を最大にするｘを示す。
【００５８】
[この実施の形態の効果]
この実施の形態では、ＨＭＭとＢＮとを単一のモデル内で組合せ、ＨＭＭとＢＮとの両者の長所を活かしている。ハイブリッドＨＭＭ／ＢＮモデルにより、音声認識システムに他の情報を容易に加える事が可能になり、最小限の費用でその性能を高める事ができる。さらに、ＨＭＭ／ＢＮモデルは従来のＨＭＭと同様に、サブワードの音声ユニットを表わす事ができる。こうして、ＢＮフレームワークを語彙数の多い連続した音声認識に用いる事が可能になる。
【図面の簡単な説明】
【図１】ＨＭＭをＤＢＮとして表わす模式図である。
【図２】各時刻ｔでの複数ＢＮの模式図である。
【図３】通常のＢＮ構造を示す、状態ＢＮの模式図である。
【図４】（ａ）は離散的変数を一つ付加した状態ＢＮの模式図であり、（ｂ）はより複雑な構造の状態ＢＮの模式図である。
【図５】ノイズおよびＳＮＲ変数を有する状態ＢＮの模式図である。
【図６】この発明の一実施の形態に係るＨＭＭ／ＢＮハイブリッドモデル構造の模式図である。
【図７】この発明の一実施の形態の音声認識システムのブロック図である。
【図８】この実施の形態の音声認識システムのプログラムを実行するコンピュータシステムの外観図である。
【図９】図８のコンピュータシステムのブロック図である。
【図１０】この実施の形態のＨＭＭ／ＢＮモデルのトレーニングプログラムの制御の流れを示すフローチャートである。
【符号の説明】
１００トレーニングシステム、１０２音声認識システム、１０４記憶媒体、１１０トレーニングデータ、１１２ＨＭＭ／ＢＮモデル、１１４ＨＭＭ／ＢＮハイブリッドモデル、１２０音声データ、１２２音声認識、１２４出力[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition system, and more particularly to a speech recognition system based on an HMM (Hidden Markov Model) employing a Bayesian network (BN).
[0002]
[Prior art]
Over the years, since the introduction of HMM to speech recognition, the conditional distribution P (y | Q) of observations for each state Q has been modeled by a mixture of probability density functions (discrete HMMs are not considered here). Gaussian pdf (probability density function) and Laplacian pdf are often used for this purpose. Later, a hybrid HMM / NN (Hidden Markov Model / Neural Network) system was proposed, where a neural network is used to estimate the likelihood of the state of the HMM for a given input observation.
[0003]
In many cases, features extracted from the speech spectrum form these observations. However, research on speech recognition has shown that using only these features is not sufficient to achieve high system performance. For this reason, many researchers have sought to include additional features that represent some other knowledge in the HMM system.
[0004]
Tokuda et al. (Non-patent document 1 described later) propose a probability distribution in a multi-dimensional space in order to model additional pitch information. In most cases, however, various measures are taken depending on the characteristics of the additional features. There has never been a common, sufficiently flexible framework to address this issue.
[0005]
Recently, another option for HMM is the Bayesian Network (BN), which has attracted researchers' attention. BN is well known and well studied in the field of artificial intelligence research. However, these are relatively new research topics in speech recognition.
[0006]
A Bayesian network is a directed acyclic graph whose nodes represent events. In a Bayesian network, the branch from the first node (node A) to the second node (node B) indicates that A is the cause of B. Each branch (AB) is assigned a “link” matrix indicating the probability P (B | A), which specifies the probability of each value of B for each value of A.
[0007]
A Bayesian network can model a complex joint probability distribution of many different (discrete or continuous) random variables in a well structured and easily expressible manner. In particular, the dynamic BN (DBN) proposed by Dean et al. Is suitable for modeling temporal speech features (non-patent document 2 described later).
[0008]
In some of DBN's first reports in speech recognition, these are used as a word model in a single word recognition task, as in the reports by Tsubak et al. And Daudi et al. It was done. In these works, DBN is considered to be a generalization of HMM that can easily incorporate additional knowledge such as articulatory features, subband correlation, speaking style, etc., in addition to speech spectrum information. Yes. Stevenson et al. Have reported that phonetic features can be easily supplemented with pitch information within the DBN framework.
[0009]
Another advantage of Bayesian networks is that additional features that are difficult to estimate reliably during recognition can remain hidden, ie unobservable.
[0010]
[Non-Patent Document 1]
K. Tokuda, T. Masco, N. Miyazaki, T. Kobayashi, “Hidden Markov Model Based on Multi-dimensional Spatial Distribution for Pitch Pattern Modeling”, ICASSP Proceedings, pp. 229-232, 1999. (K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markov Models based on multi-space probability distribution for pitch pattern modeling.” In Proc. ICASSP, pp. 229-232, 1999.)
[Non-Patent Document 2]
T. Dean, K. Kanazawa, “Probabilistic Temporal Reasoning”, AAAI, pp. 524-528 , 1988. (T. Dean and K. Kanazawa, “Probabilistic temporal reasoning,” in AAAI, pp. 524-528, 1988.)
[Non-Patent Document 3]
G. Zweik, S. Russell, “Probabilistic modeling with Bayesian networks for automatic speech recognition”, ICSLP Proceedings, pp. 3010-3013, 1988. (G. Zweig and S. Russell, “Probabilistic modeling with Bayesian Networks for automatic speech recognition,” in Proc. ICSLP, pp. 3010-3013, 1998.)
[Non-Patent Document 4]
K. Daudi, D. Fore, C. Antoine, “A New Approach for Multi-Band Speech Recognition Based on Stochastic Graph Model”, ICSLP Proceedings , vol. 1, pp. 329-332, 2000. (K. Daoudi, D. Fohr, and C. Antoine, “A new approach for multi-band speech recognition based on probabilistic graphical models,” in Proc. ICSLP, vol. 1, pp. 329-332, 2000.)
[Non-Patent Document 5]
T. Stevenson, M. Mashu, H. Bollard, “Modeling Auxiliary Information in ASR Based on Bayesian Networks”, Euro Speech Proceedings, pp. ] 2765-2768, 2001. (T. Stephenson, M. Mathew, and H. Bourlard, “Modeling auxiliary information in Bayesian Network based ASR,” in Proc. Eurospeech, pp. 2765-2768, 2001.)
[Non-Patent Document 6]
K. Daudi, D. Fore, C. Antoine, “Continuous Multiband Speech Recognition Using Bayesian Networks”, Proc. ASRU, 2001. (K. Daoudi, D. Fohr, and C. Antoine, “Continuous multi-band speech recognition using Bayesian Networks,” in Proc. ASRU, 2001.)
[Problems to be solved by the invention]
Despite these attractive characteristics of BN, its application to speech recognition is still limited to small, single word recognition tasks. The reason is that existing algorithms for learning and inference of BN parameters are not well suited for continuous speech recognition (CSR) and particularly vocabulary CSR tasks. Although an extension of the DBN word model that allows recognition of continuously spoken numbers has been reported in Daudi et al. [6], the vocabulary to be worked on increases to only a few hundred, Too much computation.
[0011]
Accordingly, one of the objects of the present invention is to provide a device that adopts BN and is suitable for CSR.
[0012]
Another object of the present invention is to provide an apparatus that employs BN and is suitable for CSR work with a large vocabulary.
[0013]
Still another object of the present invention is to provide an apparatus that employs BN and does not require cumbersome parameter learning for BN.
[Means for Solving the Problems]
A speech recognition device according to one aspect of the present invention includes a storage device for storing a hybrid hidden Markov / Bayesian network (HMM / BN) model for speech recognition. Hidden Markov Models (HMM) are used to model temporal speech features, and Bayesian Networks (BN) are used to represent state probability models. The apparatus further includes an audio decoder for decoding incoming audio data using the HMM / BN model stored in the storage device.
[0014]
The BN state probability model may include a variable X, and the conditional probability P (Y | Q) is calculated by the following equation:
[0015]
[Equation 3]

Where Y represents a series of observation parameters, Q represents a state variable, X represents a variable reflecting a predetermined factor that affects the value of Y independently of Q, and x represents a value that X can take. N (x) is the number of values that X can take.
[0016]
Preferably, variable X is environmental noise may represent a native speaker recognition information or speaker.
[0017]
The state probability model may further include variables N and S, and the conditional probability P (Y | Q) is calculated as follows:
[0018]
[Expression 4]

Where Y represents a series of observation parameters, Q represents a state variable, N and S represent variables that reflect a predetermined factor that affects the value of Y independently of Q, and n and s represent N and S is one of possible values taken respectively, n (n, s) is the number of combinations of n and S Gato Ri possible values.
[0019]
The variable N may represent the type of noise, and the variable S may represent the signal-to-noise ratio of incoming voice data.
[0020]
Another aspect of the present invention relates to a computer-executable speech recognition program that causes a computer to operate as the speech recognition device described above.
[0021]
Yet another aspect of the present invention relates to a computer-readable storage medium that stores a computer-executable speech recognition program that causes a computer to operate as the speech recognition apparatus described above.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
-Hybrid HMM / BN modeling-
In many cases, the use of BN in speech recognition is based on the idea of representing an HMM as a dynamic BN. Such an expression is shown in FIG. Here, Q _t is a state variable, and Y _t is a continuous observation variable at time t = 1, 2, 3, 4,. A branch represents a stochastic dependence between variables. The branch between state instances represents the HMM transition probability, and the branch between the state instance and the observed instance represents a conditional distribution of HMM states. In the following figures, the variables enclosed in squares are discrete and the variables enclosed in circles are continuous. Hatched circles / squares indicate observable variables.
[0027]
When Y = y ₁ ,..., Y _T , M is a model of the present invention, if it is necessary to obtain the likelihood of an input observation sequence P (Y | M), a BN inference algorithm is used for the representation of HMM and BN. Must be used. During this work, the size of the network is adjusted to the size of the input sequence T, and then P (Y | M) is inferred from the entire network.
[0028]
Assume that a branch between state nodes is disconnected. By doing so, a plurality of independent BNs corresponding to each time t are obtained as shown in FIG. Assuming that time transitions (disconnected branches) are dominated by conventional HMMs, the time index can be eliminated by assigning these BNs to the appropriate HMM state. Since all BNs have the same common structure, they can be represented as a single BN as shown in FIG. In FIG. 3, the variable Q takes the value of the state index (S _ij ) of all the HMMs in the phonetic model, and the state probability distribution P (Y | Q = S _ij ) is represented by a branch.
[0029]
Thus, the conventional HMM was modified to have BN as a state distribution model rather than Gaussian mixture. By combining HMM and BN in this way, the HMM / BN model becomes hierarchical. BN is at the bottom layer and HMM is at the top layer. Note that state variable Q (FIG. 3) is observable for BN, but is still hidden at the upper HMM level.
[0030]
The state BN can easily be extended to other random variables representing additional knowledge. The graphic structure of the extended BN can be imposed according to the knowledge of the inventors regarding the relationship between variables, not by learning from data that is difficult to say as simple work. This embodiment is unrelated to cumbersome parameter learning.
[0031]
Some possible structures of the expanded state BN are shown in FIG. For example, the variable X can represent the type of environmental noise in this figure, and the other W and Z variables can represent the speaker's ID (identification information) and the speaker's mother tongue.
[0032]
When recognition is performed using this HMM / BN model, it is necessary to calculate P (y | Q) for each state Q = q _ij as in the conventional HMM. However, “i” is an HMM index, and “j” is a status index of the i-th HMM. This value can be inferred from the BN probability model, and there are a number of inference algorithms including an accurate one and an approximate one as an algorithm for doing this. With simple BN, even the “forced” method is applicable, as shown in FIG. 4 (a). The joint probability model for BN can be expressed by a chain rule as follows.
[0033]
[Equation 5]

Since X and Q are independent variables, “P (X | Q) = P (X)” is satisfied, and therefore the above equation can be rewritten as follows.
[0034]
[Formula 6]

Accordingly, the obtained probability P (Y | Q) is as follows by marginalization for X.
[0035]
[Expression 7]

In practice, in many cases it can be assumed that P (X) is the same for all X = x, so equation (3) can be modified as follows.
[0036]
[Equation 8]

Here, N (x) is the number of values that the variable X can take.
[0037]
The training of the BN parameter can be performed independently for each state, almost like the training of the conventional HMM state parameter. Details of the training will be described later with reference to FIG.
[0038]
-HMM / BN model in noisy speech recognition system-
When speech includes noise, the speech feature vector changes its distribution, and the change depends on the type of noise as well as the value of SNR (signal to noise ratio). For this reason, this dependency can be represented by the kind of state BN shown in FIG. Here, N and S are hidden discrete variables indicating the type of noise and the SNR value, respectively. In this case, the likelihood of the state can be expressed analytically by the same method as that used to derive Equation (3).
[0039]
In many cases, prior probabilities P (N) and P (S) can be reasonably assumed to be equal for each type of noise and each SNR value, and therefore:
[0040]
[Equation 9]

The word model and subword model are also created in the same manner as in the case of the conventional HMM. Decoding can also be performed in the same way as a standard HMM-based system without changing the decoder. N (n, s) indicates the number of possible combinations of N and S.
[0041]
The structure of the HMM / BN model of this example is summarized in FIG.
[0042]
-Example of implementation on a computer-
FIG. 7 is an overall view of this embodiment. Referring to FIG. 7, the system stores training system 100 for training HMM / BN hybrid model 114 based on training data 110 and HMM / BN model 112, and HMM / BN hybrid model 114. And a voice recognition system (a voice recognition system 122) which is realized by a computer system and which decodes the voice data 120 with the HMM / BN hybrid model 114 stored in the medium 104 and outputs a recognized voice 124 (see FIG. Audio decoder) 102.
[0043]
FIG. 8 schematically illustrates the computer system 30, and FIG. 9 illustrates the system 30 in block diagram form. Referring to FIG. 8, a computer system 30 includes a computer 40 having an FD (flexible disk) drive 52 and a CD-ROM (compact disk read only memory) drive 50, a keyboard 46, a mouse 48, and a monitor 42. including.
[0044]
Referring to FIG. 9, in addition to the FD drive 52 and the CD-ROM drive 50, the computer 40 includes a CPU (central processing unit) 56 and a bus 66 connected to the CPU 56, the FD drive 52, and the CD-ROM drive 50. A read only memory (ROM) 58 for storing a boot-up program, an HMM decode program, and the like; and a random access memory (RAM) connected to the bus 66 for storing program instructions, system programs, and HMM / BN hybrid model data 60.
[0045]
Although not shown here, the computer 40 may further include a network adapter board that provides a connection to a local area network (LAN). When voice recognition is performed in real time, the computer system 30 further includes a microphone and an audio capture board.
[0046]
A program for causing the computer system 30 to perform voice recognition is stored in a CD-ROM 62 inserted into the CD-ROM drive 50 or the FD drive 52 or an FD (not shown), and further transferred to the hard disk 54. Alternatively, the program may be transmitted to the computer 40 through a network (not shown) and stored in the hard disk 54. The program is loaded into the RAM 60 at the time of execution. The program may be directly loaded into the RAM 60 via the CD-ROM 62, FD , or network.
[0047]
The program includes several instructions that cause the computer 40 to perform the speech recognition of this embodiment. Some of the basic functions necessary to perform this method are provided by the operating system (OS) of the computer 40 or a third party program, or a module such as an HMM toolkit installed on the computer 40. The program does not necessarily include all functions necessary to implement the method of this embodiment. The program of instructions may only include only instructions for executing a speech recognition process by calling the appropriate function or "tools" in a controlled manner to achieve the desired result. The operation of computer system 30 is well known and will not be repeated here.
[0048]
The speech recognition system 102 is realized by a conventional HMM decoding program. What is new in this embodiment is that the HMM / BN hybrid model 114 is stored on the medium 104.
[0049]
-Training and recognition with HMM / BN model-
The same approach as the HMM / NN training can be adopted for the training of the HMM / BN model. This is based on the Viterbi training algorithm. The overall control flow is shown in FIG. First, the topology between the HMM and the state Bayesian network is selected. The model is initialized, and the feature 144 extracted from the input speech 140 in the feature extraction step 150 is subjected to Viterbi alignment in the state separation step 152 using a basic recognition unit using the basic HMM model 142.
[0050]
The state separation is used to generate training data for the state Bayesian network, which is then trained in

steps

154, 156 and 160. In this Viterbi training method, the time portion of training and the static portion are separated. In this process, the BN training in

steps

154 and 156 and the reestimation of the transition probability in step 160, which is an embedded training process, are alternately repeated until the end condition is satisfied in step 158.
[0051]
When recognition is performed using this HMM / BN model, a normal Viterbi decoding algorithm is used as in the conventional HMM. Here, it is necessary to calculate P (y | Q) for each state Q = q _ij (see FIG. 6). Here, i is an HMM index, and j is a state index of the i-th HMM. This value can be inferred from the BN probability model using standard inference algorithms.
[0052]
-Evaluation in Aurora 2 task-
An experiment was conducted to evaluate the speech recognition system of this example using the Aurora 2 task. These experiments faithfully followed the evaluation scenario suggested in the official Aurora 2 task. The most interesting was to compare the HMM / BN system with a multi-condition trained HMM system. When training the conditional distribution of the HMM / BN state, the training data is divided into noise type and SNR value, and parameters for each condition are set using HTK (HMM toolkit, software toolkit for HMM processing). Trained separately. All other system parameters such as feature vectors, word models, number of states, experimental conditions, etc. were made identical. In the HMM / BN system, a method that is resistant to some kind of adaptation and noise is not used. The main functional difference between the two systems is that the HMM / BN system explores hidden features of speech features and noise.
[0053]
Table 1 summarizes the recognition results of test set A (same noise type as training data) and test set B (different noise). As can be seen, the performance of the HMM / BN system is quite high under closed noise conditions (A set), approaching the latest results obtained for this task in a much more complex system.
[0054]
[Table 1]

In this evaluation, the noise type and the SNR value are added as additional parameters, and their dependence is examined. As a result, the error of the resulting spectral feature parameter is 36.4% (= (12.71− 8.08) * 100 / 12.71) indicates a decrease.
[0055]
Regarding the conditions of the B set, the performance is deteriorated. This is explained by the fact that in addition to the mismatch of the distribution of speech spectral features, this HMM / BN system does not have knowledge about new noise dependencies. On the other hand, it is clear that in a multi-condition HMM system, state Gaussian mixing has not been able to model complex distributions very well due to many noise and SNR conditions. However, this data and model distribution mismatch has some kind of smoothing effect, which increases the model's ability to generalize from data that is not seen.
[0056]
[Application of hybrid HMM / BN model]
Clearly, the proposed hybrid HMM / BN model is applicable not only to noisy speech recognition systems, but also to many other cases where performance gains can be gained by new observations and hidden features. is there. This approach is more accurate for a more general framework that increases system modeling capabilities by combining features from different spaces and examining the dependencies between them. Of particular interest is the possibility of inferring the probability of BN's hidden variables. Thus, the HMM / BN system can be used to recognize these additional parameters.
[0057]
For example, if an additional hidden variable X represents a language in a multilingual system, P (X | Q) can be calculated for each frame and these probabilities can be accumulated over the entire input utterance. In that case, x = argmax _× P (x | Q _s ), Qs is the highest assumed state sequence , and x indicates the language with the highest probability as the language in which the utterance was made. Thus , in addition to multilingual speech recognition, such a system can also perform language recognition. The function x = argmax _x P (x | Q _s ) indicates x that maximizes P (x | Q _s ).
[0058]
[Effect of this embodiment]
In this embodiment, HMM and BN are combined in a single model, and the advantages of both HMM and BN are utilized. The hybrid HMM / BN model makes it possible to easily add other information to the speech recognition system and increase its performance at a minimum cost. Furthermore, the HMM / BN model can represent a sub-word speech unit, similar to a conventional HMM. In this way, the BN framework can be used for continuous speech recognition with a large number of vocabularies.
[Brief description of the drawings]
FIG. 1 is a schematic diagram showing an HMM as DBN.
FIG. 2 is a schematic diagram of a plurality of BN at each time t.
FIG. 3 is a schematic diagram of a state BN showing a normal BN structure.
4A is a schematic diagram of a state BN to which one discrete variable is added, and FIG. 4B is a schematic diagram of a state BN having a more complicated structure.
FIG. 5 is a schematic diagram of a state BN having noise and SNR variables.
FIG. 6 is a schematic diagram of an HMM / BN hybrid model structure according to an embodiment of the present invention.
FIG. 7 is a block diagram of a speech recognition system according to an embodiment of the present invention.
FIG. 8 is an external view of a computer system that executes a program of the speech recognition system of this embodiment.
FIG. 9 is a block diagram of the computer system of FIG.
FIG. 10 is a flowchart showing the flow of control of the training program for the HMM / BN model of this embodiment.
[Explanation of symbols]
100 training system, 102 speech recognition system, 104 storage medium, 110 the training data, 112 HMM / BN model 114 HMM / BN hybrid model, 120 audio data, 122 speech recognition, 124 output

Claims

A storage device that stores a hybrid hidden Markov / Bayesian network (HMM / BN) model for speech recognition, the hidden Markov model (HMM) being used to model temporal speech features, and a Bayesian network (BN) ) Represents a state probability model, and further includes a speech decoder that decodes incoming speech data using the HMM / BN model stored in the storage device.

The BN state probability model further includes a variable X and a conditional probability P (Y | Q) calculated as follows:

Where Y represents a series of observation parameters, Q represents a state variable, X represents a variable reflecting a predetermined factor that affects the value of Y independently of Q, and x represents a value that X can take. The speech recognition apparatus according to claim 1, wherein N (x) is the number of values that X can take.

The speech recognition apparatus according to claim 2, wherein the variable X represents environmental noise.

The speech recognition apparatus according to claim 2, wherein the variable X represents speaker identification information.

The speech recognition apparatus according to claim 2, wherein the variable X represents a speaker's mother tongue.

The state probability model further includes variables N and S, and the conditional probability P (Y | Q) is calculated by:

Where Y represents a series of observation parameters, Q represents a state variable, N and S represent variables that reflect a predetermined factor that affects the value of Y independent of Q, and n and s represent The speech recognition apparatus according to claim 1, wherein N and S are one of possible values, and N (n, s) is the number of combinations of values that N and S can take.

The speech recognition apparatus according to claim 6, wherein the variable N represents a type of noise.

The speech recognition device according to claim 6 or 7, wherein the variable S represents a signal-to-noise ratio of incoming speech data.

A computer-executable speech recognition program for causing a computer to operate as the speech recognition device according to claim 1.

A computer-readable storage medium storing a computer-executable voice recognition program for operating the computer as the voice recognition apparatus according to claim 1.