JP3911246B2

JP3911246B2 - Speech recognition apparatus and computer program

Info

Publication number: JP3911246B2
Application number: JP2003056694A
Authority: JP
Inventors: 茂彦大西; 玄一郎菊井; 博史山本
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-03-04
Filing date: 2003-03-04
Publication date: 2007-05-09
Anticipated expiration: 2023-03-04
Also published as: JP2004264719A

Description

【０００１】
【発明の属する技術分野】
この発明は音声認識技術に関し、特に、大規模コーパスから得た言語モデルを有効に用いる事により文認識率を向上させるための技術に関する。
【０００２】
【従来の技術】
音声翻訳では、音声認識におけるわずか１単語の誤りが翻訳結果に致命的な影響を及ぼす事がしばしばある。そうした問題を回避するためには、音声認識における文認識率を向上させる事が必要である。文認識率を向上させるための一つの手法は、有限状態機械（ＦｉｎｉｔｅＳｔａｔｅＡｕｔｏｍａｔｏｎ：ＦＳＡ）などを用いて文全体の大域的なモデル化によって強い制約を与える事である。このためには、以下の二つが必要である。
【０００３】
（１）当該タスクの言語表現を広く覆う様なモデルの構築
（２）その様なモデルを使った効率の良い探索手法の開発
前者に関しては、Ｎ−グラムの様な制約の緩いモデルと併用する事によって、大域的なモデルでカバーできなかった部分を救う事が試みられている（非特許文献１を参照されたい。）。しかし、それは小規模な実験にとどまっている。
【０００４】
【非特許文献１】
鶴見他、「単語Ｎ−ｇｒａｍとネットワーク文法を併用した音声認識アルゴリズムの検討」、日本音響学会講演論文集３−９−８、ｐｐ．１４５−１４６、日本音響学会２００２年秋季発表会
【発明が解決しようとする課題】
文認識率を向上させるために、大量（たとえば１６万文規模）の文を含むコーパスとＮ−グラムとを併用する場合、その様な大量の文を単純にＦＳＡでモデル化すると、モデルサイズと探索空間の巨大化を招く事は明らかである。従って、大規模コーパスとＮ−グラムの様な制約の緩いモデルとを併用しながら、モデルサイズと探索空間の巨大化を抑える様な手法が求められている。
【０００５】
それゆえに本発明の目的は、大規模コーパスとＮ−グラムの様な制約の緩いモデルとを併用して文認識率を向上させながら、モデルサイズと探索空間の巨大化を抑える事が可能な音声認識装置を提供する事である。
【０００６】
この発明の他の目的は、大規模コーパスとＮ−グラムの様な制約の緩いモデルとを併用し、かつ小さなモデルを用いて実質的に大規模コーパスを用いた場合と同様の文認識率の向上を図る事が可能な音声認識装置を提供する事である。
【０００７】
この発明のさらに他の目的は、大規模コーパスを所定の基準に従って分割した小さなコーパスから生成した言語モデルと、Ｎ−グラムの様な制約の緩いモデルとを併用する事で、実質的に大規模コーパスを用いた場合と同様の文認識率の向上を図る事が可能な音声認識装置を提供する事である。
【０００８】
本発明の他の目的は、大規模コーパスを、所定の基準に従って分割した小さなコーパスから生成した言語モデルと、Ｎ−グラムの様な制約の緩いモデルとを併用し、かつ小さなコーパスから生成した言語モデルのうち、適切なものを選択して再度音声認識する事で、実質的に大規模コーパスを用いた場合と同様の文認識率の向上を図る事が可能な音声認識装置を提供する事である。
【０００９】
【課題を解決するための手段】
本発明の第１の局面に係る音声認識装置は、所定のコーパスから生成した言語モデルに基づき、入力音声の音声認識を行なって音声認識結果を出力するための第１の音声認識手段と、第１の音声認識手段の音声認識結果から、入力音声中に含まれる所定の音声単位の数を推定するための推定手段と、所定のコーパス内の文を、各々に含まれる所定の音声単位の数に従って分類した複数個の部分集合からそれぞれ生成された、音声認識のための複数個の言語モデルを記憶するための手段と、複数個の言語モデルのうち、推定手段により推定された音声単位の数に応じて、予め定められた複数個を選択するための第１の言語モデル選択手段と、第１の言語モデル選択手段により選択された複数個の言語モデルにそれぞれ基づいて入力音声の音声認識を行ない、複数個の音声認識結果を出力するための第２の音声認識手段と、第１の音声認識手段の音声認識結果、及び第２の音声認識手段の音声認識結果のうち、所定の選択基準に従って一つを選択して出力するための認識結果選択手段とを含む。
【００１０】
好ましくは、所定のコーパスから生成した言語モデルは、Ｎ−グラム（ただしＮは１以上の整数）言語モデルである。
【００１１】
より好ましくは、音声認識装置の推定手段は、第１の音声認識手段の音声認識結果から、入力音声中に含まれるシラブルの数を推定するための手段を含み、第１の言語モデル選択手段は、複数個の言語モデルのうち、推定するための手段により推定されたシラブルの数に応じて、予め定められた複数個を選択するための第２の言語モデル選択手段を含む。
【００１２】
好ましくは、第２の言語モデル選択手段は、複数個の言語モデルのうち、推定するための手段により推定されたシラブルの数に応じて、推定されたシラブルの数に対応するものを含む予め定められた複数個を選択するための第３の言語モデル選択手段を含む。
【００１３】
さらに好ましくは、第３の言語モデル選択手段は、複数個の言語モデルのうち、推定するための手段により推定されたシラブルの数に応じて、推定されたシラブルの数に対応するものと、推定されたシラブルの数の前若しくは後又はその双方の予め定められた範囲のシラブル数にそれぞれ対応するものとを選択するための第４の言語モデル選択手段を含む。
【００１４】
より好ましくは、第４の言語モデル選択手段は、複数個の言語モデルのうち、推定するための手段により推定されたシラブルの数に応じて、推定されたシラブルの数に対応するものと、推定されたシラブルの数の前後の互いに等しい範囲のシラブル数にそれぞれ対応するものとを選択するための第５の言語モデル選択手段を含む。
【００１５】
さらに好ましくは、認識結果選択手段は、第２の音声認識手段の音声認識結果の各々と、第１の音声認識手段の音声認識結果との間のＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチング距離を算出するための距離算出手段と、第２の音声認識手段の音声認識結果のうち、距離算出手段により算出されたＤＰマッチング距離の小さなものから順番に予め定める数だけ選択するための手段と、選択するための手段により選択された予め定められた数の第２の音声認識手段の音声認識結果の音響スコアの各々を、予め定められた調整式によって調整するための音響スコア調整手段と、音響スコア調整手段により出力される複数個の音響スコアと、第１の音声認識手段の認識結果の音響スコアとを比較して、最も大きな音響スコアを選択し、選択された音響スコアに対応する音声認識結果を音声認識装置による音声認識結果として出力するための手段とを含む。
【００１６】
より好ましくは、音響スコア調整手段は、選択するための手段により選択された第２の音声認識手段の音声認識結果の対数音響スコアを、１＋α（ただしαは予め定められた正の定数）倍する事により調整するための手段を含む。
【００１７】
本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記したいずれかの音声認識装置として動作させるものである。
【００１８】
【発明の実施の形態】
[第１の実施の形態]
＜概略＞
予め、本実施の形態に係る音声認識の手法の概要を述べる。本実施の形態では、まず、ＦＳＡの訓練コーパス内の全ての文を、各文が含むシラブル数によって部分集合に分け、それら部分集合にそれぞれ対応した複数のＦＳＡを生成する。その上で、次の様にして音声認識処理を行なう。すなわち、
（１）まずＮ−グラムを用いて音声認識を行なう。
【００１９】
（２）上記（１）の結果の文が含むシラブル数を求める。
【００２０】
（３）その値に応じてＦＳＡの部分集合のうちの適切なもの（多くは複数個）を選択する。
【００２１】
（４）選択されたＦＳＡを用いて、（１）と同じ入力に対して音声認識を行なう。
【００２２】
（５）上記（１）と（４）とで得られた音声認識結果のうち、ある基準によって最適と思われる１文を選択して最終的な音声認識結果とする。
【００２３】
＜構成＞
図１に、本実施の形態に係る音声認識装置３０の機能的ブロック図を示す。図１を参照し、この音声認識装置３０は、音声信号を受け、音声認識に適したデジタルの音声データ列に変換するための音声入力部４０と、予め準備された大規模な訓練コーパス中に出現する全文のみが受理可能な（すなわち、この訓練コーパスに含まれる文のみを認識結果とする様な）マルチクラス複合２−グラム言語モデル４２と、音声入力部４０から与えられる音声データ列に対して、言語モデル４２を用いた音声認識を行ない、最尤の認識結果文を示す単語列を出力するための音声認識部４４と、音声認識部４４の出力する単語列に含まれるシラブル数ＮＳを算出するためのシラブル数算出部４６とを含む。この場合、シラブル数算出部４６が算出するシラブル数ＮＳは、実際には入力音声が表す文が含むシラブル数の推定値である。
【００２４】
上記した訓練コーパスは旅行会話基本表現集（ＢＴＥＣ）であり、その内容は以下の通りである。ただし下の表で「クローズド文」とは、テストセット文のうちで訓練セットにも現れた文の事をいう。
【００２５】
【表１】

音声認識装置３０はさらに、上記した訓練コーパスを各文が含むシラブル数により部分集合に分け、それら部分集合にそれぞれ対応して生成された複数のＦＳＡ６０を記憶するＦＳＡ記憶部４８と、ＦＳＡ記憶部４８の複数のＦＳＡ６０のうち、シラブル数算出部４６が算出したシラブル数ＮＳを中心とした所定の幅のシラブル数に対応するＦＳＡ５２を選択するための選択部５０と、選択部５０が選択した複数のＦＳＡ５２をそれぞれ用いて、音声認識部４４が行なったものと同じ音声データ列に対して音声認識を行ない、結果の単語列を出力するための複数の音声認識部５４と、音声認識部４４の出力、及び複数の音声認識部５４の出力のうち、後述する基準に従って最も適切と思われる認識結果を選択し最終認識結果として出力するための選択部５６とを含む。複数の音声認識部５４の各々は、音声認識部４４と同じ認識エンジンによって音声認識を行なう。逆にいえば、同じ音声認識エンジンを用いるのであれば、音声認識部４４と音声認識部５４としてどの様なものを用いても良い。
【００２６】
図２を参照して、選択部５６は、複数の音声認識部５４から与えられる複数の認識結果の各々と、音声認識部４４から与えられる認識結果との間の音素並びとの間のＤＰマッチング距離を算出するためのＤＰ距離算出部７０と、ＤＰ距離算出部７０が算出したＤＰマッチング距離に基づき、複数の音声認識部５４からの複数の認識結果のうち、音声認識部４４の認識結果の音素並びと近いもの上位３位までの候補を選択するためのＤＰ距離比較・選択部７２と、ＤＰ距離比較・選択部７２により選択された上位三位までの認識結果に付随する音響スコア（認識時に認識結果と同時に得られる）の値を（１＋α）倍して調整するためのスコア調整部７４と、調整部７４が出力する上位３位までの認識結果の調整済み音響スコアと、音声認識部４４の認識結果に付随する音響スコアとを比較し、最も高い値を持つ音響スコアに対応する認識結果を最終認識結果として出力するためのスコア比較・選択部７６とを含む。
【００２７】
上記した様にコーパスを各文の含むシラブル数に基づいて部分集合に分割し、それらに対応したＦＳＡを使用する様にした理由は以下の通りである。前述した通り、訓練コーパスには大量の文が含まれている。そのまま１つのＦＳＡでこのコーパスをモデル化すると、モデルサイズ、探索空間ともに巨大となってしまう。そこで、ＦＳＡを分割する事が必要である。
【００２８】
ＦＳＡを分割する際には、言語モデル４２を用いた音声認識部４４の認識結果から正確に推定できるパラメータに沿って、ＦＳＡを選択できる様にする必要がある。本実施の形態では、上記したパラメータとして入力音声のシラブル数ＮＳを採用した。入力音声のシラブル数ＮＳを採用したのは、Ｎ−グラムのみを言語モデルに用いた場合、認識結果の正誤はともかくとして、認識結果に含まれるシラブル数ＮＳが正解のそれと大差がない事が判明したためである。図３はそれを示す実験結果である。
【００２９】
図３は、本実施の形態で用いたコーパスを用いてＮ−グラム（マルチクラス複合２−グラム）の言語モデルを作成し、その言語モデルを用いて音声認識を行なった結果の上位１０位までの文について、そのシラブル数が正解文と最も異なるものを選び、そのシラブル数の差をヒストグラムにしたものである。図３からわかる様に、マルチクラス複合２−グラムの認識結果のシラブル数と、正解文のそれとの差が５以内に収まるテスト文が全体の９９．５パーセントを占めている。したがって、マルチクラス複合２−グラムの認識結果のシラブル数から、正解文のシラブル数を同定（推定）する事が十分に可能である事が分かる。
【００３０】
そこで、コーパスを各文のシラブル数によって部分集合に分割した各部分集合からＦＳＡ６０を得る様にした。図４に、異なり文数約１０万の訓練コーパスを各文のシラブル数により部分集合に分けた際に、各集合に含まれる訓練文数の分布を示している。横軸はシラブル数，縦軸はそのシラブル数ごとの集合に含まれる学習文の数を示す。図４を参照して、この分布は、１６シラブルにピークを持ったガンマ分布で近似可能な分布である。集合中の文数は最大で１２０００以下であり、元の訓練コーパスの８分の１以下である。個々の集合に対応したＦＳＡモデルの個数は総計６５個となった。なお、各文（単語列）に対する言語尤度は同一とした。
【００３１】
言語モデル４２は上記した訓練コーパスを用いて訓練した言語モデルである。この言語モデルを用いて、この訓練コーパスと同一ドメインのテストセット（訓練コーパスと同じく旅行という状況で使用される文の集合）に対して音声認識部４４単独により行なった音声認識実験の結果、単語認識率は８８．７％、文認識率は６５．７％であった。単語認識率は高い値であるのに対して文認識率は低い値である。ＢＴＥＣは定型文が多いという性質上、クローズド文が多いのだが、それらについても８５．１％しか完全には正解となっていない。それゆえ、残りの１４．９％について、本実施の形態の様に、訓練セットの文をモデル化したクローズドなＦＳＡを併用して音声認識を行なう事で正解出来る可能性がある。仮にそれらについて正解できたものとすると、全体として文認識率は７．２５ポイント改善されるはずである。
【００３２】
なお、本実施の形態では後述する様に音声認識部５４はソフトウェアで実現する。そのため、一つのＣＰＵで複数回の音声認識を繰返し行なう。しかし、ハードウェアに余裕があれば、音声認識処理を並列に行なってもよい。たとえば複数個のＣＰＵがあれば音声認識処理を同時に並列に実行する事ができる。
【００３３】
＜動作＞
以上の構成により、ベースラインである１個のマルチクラス複合２−グラムと、６５個のＦＳＡ群とを言語モデルとして利用できる。以下、図１及び図２に記載した音声認識装置３０の動作について述べる。
【００３４】
図１を参照して、音声入力部４０が音声入力を音声認識に適したデジタルデータに変換する。このデジタルデータは音声認識部４４及び音声認識部５４に与えられる。最初に音声認識部４４が音声認識を行なう。この際、音声認識部４４は、マルチクラス複合２−グラムの言語モデル４２を用いて認識処理を行ない、最尤の認識結果文を得る。この認識結果文はシラブル数算出部４６及び選択部５６に与えられる。なおこの音声認識に伴い、音声認識の際の音響スコアも得られ、選択部５６に与えられる。ここで、音響スコアとは、認識の際の入力波形に対する認識結果の間の尤度を音素（又はシラブル）ごとに求め、それらを乗算した値である。通常はその対数を用いる。本明細書では、音響スコアといえば対数音響スコアの事をいうものとする。
【００３５】
シラブル数算出部４６は、与えられた認識結果文のシラブル数ＮＳを求める。シラブル数算出部４６は、シラブル数ＮＳを選択部５０に与える。
【００３６】
選択部５０は、与えられたシラブル数ＮＳを中心としてＮＳ±５の範囲のシラブル数の文集合に対応したＦＳＡ５２を選択する。すなわち、シラブル数がＮＳを中心としてＮＳ−５からＮＳ＋５までの１１個のＦＳＡ５２が選択される。
【００３７】
音声認識部５４は、この１１個のＦＳＡ５２を用いて、音声入力部４０から与えられたデジタルデータに対し個別に音声認識を行なう。その結果、ＮＳ−５からＮＳ＋５までのシラブル数の文集合に対応したＦＳＡ５２による１１個の認識結果文が得られる。この出力は選択部５６に与えられる。また同時にこれらの音声認識の際の音響スコアも得られ、選択部５６に与えられる。
【００３８】
図２を参照して、以上の処理の結果、ＤＰ距離算出部７０には音声認識部４４からの一つの認識結果、及び音声認識部５４からの１１個の認識結果が与えられる。ＤＰ距離算出部７０は、１１個のＦＳＡ５２による認識結果の各々と、音声認識部４４の認識結果の音素並びとの間のＤＰマッチング距離を算出し、ＤＰ距離比較・選択部７２に与える。
【００３９】
ＤＰ距離比較・選択部７２は、これら１１個のＤＰマッチング距離のうち、距離の短いものを上位３位まで選択し、調整部７４に与える。
【００４０】
調整部７４は、ＤＰ距離比較・選択部７２により選択された上位３位までの認識結果に付随する音響スコアを（１＋α）倍する。このαは、音響スコアの調整のためのパラメータである。αは正の値であり、０．０５から０．２５までの値が好ましい。特に、α＝０．１〜０．２５が好ましく、さらに０．１５〜０．２５が好ましい。実験によれば、最も好ましい結果が得られたのはα＝０．２のときであった。ただし、αの値は認識の対象により変化し得るので、この値に限定されるわけではない。調整部７４は、この様に（１＋α）を乗算する事により調整済みの音響スコアをスコア比較・選択部７６に与える。
【００４１】
スコア比較・選択部７６は、音声認識部４４の認識結果の音響スコアと、調整部７４から与えられる３つの調整済みの音響スコアとを比較し、最も音響スコアが高いものを最終的な認識結果として選択し出力する。
【００４２】
本実施の形態の装置で、αを０．０５から０．２５まで０．０５刻みで変化させ、上記した態様に従って認識結果を選択した実験結果を図５及び図６に示す。図５に示すのは、αの各値に対する文認識率の推移である。図６には、参考のためにこのときの単語認識率の推移を示す。
【００４３】
図５に示す様に、本実施の形態ではα＝０．２０のときに最高の文認識率が得られた。図５に示したα＝０の値は、マルチクラス複合２−グラム言語モデル４２のみを用いたときの文認識率である。図５からわかる様に、α＝０．２０のときには文認識率はα＝０のときと比較して約４．３ポイント高い。この値をクローズド文のみについて計算すると、約８．８ポイントの文認識率の向上に値する。
【００４４】
以上の様に本実施の形態の音声認識装置３０によれば、コーパスをシラブル数により複数の文集合に分割し、各々についてＦＳＡを言語モデルとして生成する。そして、マルチクラス複合２−グラムの言語モデルを用いたベースラインの音声認識結果のシラブル数ＮＳに応じて、ＦＳＡのうちこのシラブル数ＮＳと特定の関係にある文集合に対応するものを選択して再度音声認識を行なう。その認識結果のうち、ベースラインの音声認識結果の音素の並びとのＤＰマッチング距離が小さなものを３つ選択し、それらの音響スコアを１＋αで調整した上で、ベースラインの音声認識結果と比較し、最も高い音響スコアを示したものを最終認識結果として選択する。その結果、ベースラインの音声認識結果と比較して上記した通り文認識率にかなりの改善が見られた。
【００４５】
＜コンピュータによる実現＞
上記した本実施の形態の音声認識装置は、音声処理機能を備えたコンピュータにより実現できる。図７にコンピュータにより実現された音声認識装置３０の外観を示す。図８はこの音声認識装置３０のハードウェアブロック図である。
【００４６】
図７を参照して、音声認識装置３０は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）駆動装置９０、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）駆動装置９２を備えたコンピュータ８０と、いずれもコンピュータ８０に接続されたモニタ８２、マイク８４、キーボード８６、及びマウス８８とを含む。
【００４７】
図８を参照して、コンピュータ８０は、前述したＣＤ−ＲＯＭ駆動装置９０及びＦＤ駆動装置９２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）９６と、ＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）９８と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１００と、ハードディスク９４と、マイク８４に接続されたサウンドボード１０８とを含む。これらはいずれもバス１０６により相互に接続されている。ＣＤ−ＲＯＭ駆動装置９０にはＣＤ−ＲＯＭ１０２が装着され、ＦＤ駆動装置９２にはＦＤ１０４が装着される。
【００４８】
以下に述べる制御構造を有するコンピュータプログラムは、たとえばＣＤ−ＲＯＭ１０２又はＦＤ１０４の様なコンピュータ読取可能な記録媒体上に記録されて流通し、当該ＣＤ−ＲＯＭ１０２をＣＤ−ＲＯＭ駆動装置９０に装着したのちＣＤ−ＲＯＭ１０２からハードディスク９４に複写される。実行時にはこのプログラムはハードディスク９４から読出されてＲＡＭ１００に読出され、図示しないプログラムカウンタにより指定されるアドレスからＣＰＵ９６が読出して実行し、実行結果をＲＡＭ１００又はハードディスク９４に書込む。ＣＰＵ９６はさらにプログラムカウンタの値をプログラムの実行結果により書換え、さらにそのプログラムカウンタの値に基づいて次の命令をＲＡＭ１００から読出して実行する。ＣＰＵ９６はこの様な動作原理に従って、コンピュータプログラムを実行する。なお、ＦＳＡ記憶部４８は、ハードディスク９４、ＲＯＭ９８、又はＲＡＭ１００などにより実現される。
【００４９】
図９に、このコンピュータプログラムの全体の制御構造を示す。図９を参照して、このコンピュータプログラムは、音声入力に対してマルチクラス複合２−グラムの言語モデル４２を用いて音声認識を行なうステップ１１０と、ステップ１１０の音声認識の結果得られた単語列のシラブル数ＮＳを算出するステップ１１２と、ステップ１１２で算出されたシラブル数ＮＳに従ってシラブル数ＮＳ−５からＮＳ＋５までの文集合から得られた１１個のＦＳＡを選択し、選択されたＦＳＡを用いて音声認識を行なう事により、１１個の音声認識結果をその音響スコアとともに出力するステップ１１４と、ステップ１１０の音声認識結果と、ステップ１１４で得られた１１個の音声認識結果とのうち、前述した方法に従って一つの音声認識結果を選択するステップ１１６と、選択された音声認識結果を出力するステップ１１８とを含む。ステップ１１８が完了すると、一回の音声入力に対する処理が完了した事になる。
【００５０】
図１０に、図９のＦＳＡによる認識ステップ１１４の詳細を示す。図１０を参照して、ステップ１１４は、使用する文集合のシラブル数の最小値ＭＩＮ（＝ＮＳ−５）を算出するステップ１４０と、最大値ＭＡＸ（＝ＮＳ＋５）を算出するステップ１４２と、以下のループ処理の繰返し変数Ｉを最小値ＭＩＮに設定するステップ１４４と、繰返し変数Ｉが最大値ＭＡＸを超えたか否かを判定し、Ｉ＞ＭＡＸの場合とそれ以外の場合とで制御を分岐させるステップ１４６とを含む。ステップ１４６でＩ＞ＭＡＸと判定された場合にはこのルーチンは終了する。さもなければ制御はステップ１４８に進む。
【００５１】
ＦＳＡによる認識ステップ１１４はさらに、シラブル数Ｉの文集合から得られたＦＳＡを用いて認識を行なうステップ１４８と、ステップ１４８での認識結果をその音響スコアと共に出力するステップ１５０と、繰返し変数Ｉに１を加算するステップ１５２とを含む。ステップ１５２の後、制御はステップ１４６に戻る。
【００５２】
図１１に、図９の認識結果の選択ステップ１１６の詳細を示す。図１１を参照して、ステップ１１６は、ＦＳＡによる１１個の認識結果の各々に対し、その音素並びと、マルチクラス複合２−グラムの言語モデルによる認識結果の音素並びとの間のＤＰマッチング距離を算出するステップ１８０と、ステップ１８０で算出されたＤＰマッチング距離の小さなもの上位３位までを選択するステップ１８２と、ステップ１８２で選択された３個の認識結果の音響スコアを（１＋α）倍して調整するステップ１８４と、ステップ１８４で算出された調整後の音響スコアと、図９のステップ１１０で行なわれた音声認識結果に付随して得られた音響スコアとのうち、最大の音響スコアを持つ音声認識結果を最終認識結果として選択するステップ１８６とを含む。
【００５３】
上記した制御構造を有するコンピュータプログラムをコンピュータ８０に実行させる事により、既に説明した音声認識装置３０の各機能が実現できる事は当業者には容易に理解できるであろう。
【００５４】
＜変形例＞
上に述べた実施の形態では、言語モデル４２としてマルチクラス複合２−グラム言語モデルを用いている。しかし本発明はマルチクラス複合２−グラムを用いたものだけに限定されるわけではない。一般的にＮ−グラム（ただしＮは１以上の整数）の言語モデルを言語モデル４２に用いる事ができる。
【００５５】
また、上に述べた実施の形態では、音声認識部４４による認識結果に含まれるシラブル数によってＦＳＡを選択している。しかし本発明はＦＳＡの選択にシラブル数を用いるものには限定されない。使用できる要素の大きさは、Ｎ−グラムによる認識結果から元の要素数を比較的正確に推定できるものであればどの様なものでもよい。たとえば、シラブルに代えて認識結果に含まれる音素数を用いても良いし、日本語の場合におけるモーラの数の様に、シラブル数に類似した概念の要素を用いても良い。認識システムの性能によりたとえば単語単位で数えたときに正解に近くなる様なものがあれば、それを用いても良い。
【００５６】
また、上記した実施の形態では、選択部５０は、ＦＳＡ記憶部４８に含まれるＦＳＡ６０のうち、シラブル数算出部４６が算出したシラブル数ＮＳを中心としてその前後５つずつ、合計１１個のＦＳＡを選択している。しかし本発明はその様な実施の形態に限定されない。たとえば、シラブル数算出部４６が算出したシラブル数ＮＳとその前だけ、又はその後だけを使用する場合もあり得る。また、選択するＦＳＡの数も１１個には限定されない。本実施の形態ではＦＳＡの選択にシラブル数を用いており、実験の結果図３に示す様に推定されたシラブル数ＮＳの前後５つずつ、合計１１個を用いればほぼ正解文のシラブル数に対応するＦＳＡを選択できる事が分かった。しかし、たとえばハードウェア資源に制限があったり処理速度に制約があったりした場合には、より少ない数のＦＳＡを選択する様にしてもよい。また、推定する音声単位をシラブルではなく音素などとした場合には、当然に図３とは異なる実験結果が予想され、その結果、選択すべきＦＳＡの個数も変わる事があり得る。
【００５７】
上に述べた実施の形態では、ＦＳＡを用いた音声認識は、同じＣＰＵを用いて繰返し処理を行なっている。しかし、前述した通りＣＰＵを複数個利用できるのであれば、これらを並列に動作させて処理させると処理を高速に行なう事ができる。ただしその場合には、図９のステップ１１６の処理を開始するにあたり、各処理の間の同期をとる必要がある。この場合、ＦＳＡによる認識処理の全てが終了した時点で図９のステップ１１６の処理を開始する様にしてもよいし、ＦＳＡによる認識処理が終了したものについて、随時図１１に示すステップ１８０の処理を行ない、全ての認識処理が終了した時点でステップ１８２以降の処理を開始する様にしてもよい。
【００５８】
さらに、上記した実施の形態では、選択部５６による選択にＤＰマッチング距離と音響スコアとを用いた。しかし本発明はそうした実施の形態には限定されない。たとえば、ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）などの機械学習システムに予め選択方法を学習させたものを用いても良い。
【００５９】
なお、上記した説明では、コンピュータプログラムは記録媒体上に記録されて流通するものとした。しかしコンピュータプログラムの流通形態は記録媒体上に記録されるものに限定されない。たとえば、有線又は無線のネットワークを介した通信という形で流通する事もあり得る。また、コンピュータで直ちに実行可能な形式のみに限らず、ソースプログラムの形で流通し、コンピュータでコンパイルする事により実行形式とする場合もあり得る。さらに、当該コンピュータプログラムを実行するときに、一時にはプログラムのうち実行に関係する一部分のみを遠隔地からネットワークを介してＲＡＭ１００に記憶して実行し、一度にプログラムの全体をＲＡＭ１００又はハードディスク９４に記憶しない様な運用形態もあり得る。しかしそのいずれの場合も、本発明の実施に該当する事はいうまでもない。
【００６０】
今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。
【図面の簡単な説明】
【図１】本発明の一実施の形態に係る音声認識装置３０のブロック図である。
【図２】図１に示す音声認識装置３０のうち、選択部５６のより詳細なブロック図である。
【図３】Ｎ−グラムによる音声認識結果に含まれるシラブル数と、正解文のシラブル数との差の分布例を示すグラフである。
【図４】訓練コーパス内の文のシラブル数の分布を示すグラフである。
【図５】本発明の一実施の形態による文認識率の改善を示すグラフである。
【図６】本発明の一実施の形態による単語認識率を示すグラフである。
【図７】音声認識装置３０を実現するコンピュータ８０及びその周辺装置の外観を示す図である。
【図８】コンピュータ８０のハードウェアブロック図である。
【図９】コンピュータ８０で実行される音声認識プログラムの制御構造を示すフローチャートである。
【図１０】図９に示すＦＳＡによる認識ステップ１１４の詳細な制御構造を示すフローチャートである。
【図１１】図９に示す認識結果の選択ステップ１１６の詳細な制御構造を示すフローチャートである。
【符号の説明】
３０音声認識装置、４０音声入力部、４２マルチクラス複合２−グラムによる言語モデル、４４音声認識部、４６シラブル数算出部、４８ＦＳＡ記憶部、５０選択部、５２，６０ＦＳＡ、５４音声認識部、５６選択部、７０ＤＰ距離算出部、７２ＤＰ距離比較・選択部、７４調整部、７６スコア比較・選択部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition technique, and more particularly to a technique for improving a sentence recognition rate by effectively using a language model obtained from a large-scale corpus.
[0002]
[Prior art]
In speech translation, an error of just one word in speech recognition often has a fatal effect on the translation result. In order to avoid such problems, it is necessary to improve the sentence recognition rate in speech recognition. One technique for improving the sentence recognition rate is to give a strong constraint by global modeling of the entire sentence using a finite state machine (FSA) or the like. For this, the following two are necessary.
[0003]
(1) Construction of a model that widely covers the language expression of the task
(2) Development of efficient search methods using such models
With respect to the former, an attempt has been made to save a portion that could not be covered by a global model by using it together with a model with a loose constraint such as N-gram (see Non-Patent Document 1). However, it is only a small experiment.
[0004]
[Non-Patent Document 1]
Tsurumi et al., “Examination of speech recognition algorithm using word N-gram and network grammar”, Acoustical Society of Japan Proceedings 3-9-8, pp. 145-146, Acoustical Society of Japan 2002 Fall Presentation
[Problems to be solved by the invention]
In order to improve the sentence recognition rate, Sentence When a corpus including a sentence of a large scale) and an N-gram are used in combination, it is clear that if such a large number of sentences are simply modeled by FSA, the model size and the search space become enormous. Therefore, there is a demand for a technique that suppresses the model size and the search space from becoming too large while using a large-scale corpus and a model with loose constraints such as N-gram.
[0005]
Therefore, an object of the present invention is to improve the sentence recognition rate by using a large-scale corpus and a loosely modeled model such as N-gram, while suppressing the size of the model and the enlargement of the search space. It is to provide a recognition device.
[0006]
Another object of the present invention is to use a large-scale corpus and a loosely-constrained model such as N-gram, and to achieve a sentence recognition rate substantially the same as when a large-scale corpus is used using a small model. It is to provide a voice recognition device capable of improving.
[0007]
Still another object of the present invention is to use a language model generated from a small corpus obtained by dividing a large-scale corpus according to a predetermined standard, and a model with a loose constraint such as an N-gram, thereby substantially increasing the scale. The object is to provide a speech recognition device capable of improving the sentence recognition rate similar to the case of using a corpus.
[0008]
Another object of the present invention is to combine a language model generated from a small corpus obtained by dividing a large-scale corpus according to a predetermined criterion with a loosely-constrained model such as an N-gram, and a language generated from a small corpus. By providing a speech recognition device that can improve the sentence recognition rate in the same way as when using a large-scale corpus by selecting an appropriate model and recognizing it again. is there.
[0009]
[Means for Solving the Problems]
A speech recognition apparatus according to a first aspect of the present invention includes first speech recognition means for performing speech recognition of an input speech and outputting a speech recognition result based on a language model generated from a predetermined corpus, An estimation means for estimating the number of predetermined speech units included in the input speech from the speech recognition result of one speech recognition means, and the number of predetermined speech units included in each sentence in the predetermined corpus Means for storing a plurality of language models for speech recognition, each generated from a plurality of subsets classified according to the number of speech units, and the number of speech units estimated by the estimating means among the plurality of language models And a first language model selecting means for selecting a plurality of predetermined languages, and voice recognition of the input speech based on the plurality of language models selected by the first language model selecting means. And performing a predetermined selection among the second voice recognition means for outputting a plurality of voice recognition results, the voice recognition result of the first voice recognition means, and the voice recognition result of the second voice recognition means And a recognition result selecting means for selecting and outputting one according to the reference.
[0010]
Preferably, the language model generated from the predetermined corpus is an N-gram (where N is an integer equal to or greater than 1) language model.
[0011]
More preferably, the estimation means of the speech recognition apparatus includes means for estimating the number of syllables included in the input speech from the speech recognition result of the first speech recognition means, and the first language model selection means is The second language model selecting means for selecting a predetermined number of the plurality of language models according to the number of syllables estimated by the estimating means.
[0012]
Preferably, the second language model selecting means includes a predetermined one including a plurality of language models corresponding to the estimated number of syllables according to the number of syllables estimated by the estimating means. A third language model selection means for selecting a plurality of selected language models.
[0013]
More preferably, the third language model selecting means estimates the number corresponding to the estimated number of syllables according to the number of syllables estimated by the estimating means among the plurality of language models. A fourth language model selection means for selecting one corresponding to the number of syllables in a predetermined range before and / or after the number of syllables.
[0014]
More preferably, the fourth language model selecting means estimates the number corresponding to the estimated number of syllables according to the number of syllables estimated by the estimating means among the plurality of language models. A fifth language model selecting means for selecting one corresponding to the number of syllables in the same range before and after the number of syllables.
[0015]
More preferably, the recognition result selection means calculates a DP (Dynamic Programming) matching distance between each of the voice recognition results of the second voice recognition means and the voice recognition result of the first voice recognition means. A distance calculating means; a means for selecting a predetermined number in order from a smaller DP matching distance calculated by the distance calculating means among the voice recognition results of the second voice recognizing means; and a means for selecting Output by the acoustic score adjusting means and the acoustic score adjusting means for adjusting each of the acoustic scores of the speech recognition results of the predetermined number of second speech recognizing means selected in accordance with a predetermined adjustment formula Comparing the plurality of acoustic scores to the acoustic score of the recognition result of the first speech recognition means, and selecting the largest acoustic score, Means for outputting a speech recognition result corresponding to the selected acoustic score as a speech recognition result by the speech recognition apparatus.
[0016]
More preferably, the acoustic score adjusting unit multiplies the logarithmic acoustic score of the speech recognition result of the second speech recognition unit selected by the selecting unit by 1 + α (where α is a predetermined positive constant). Including means for making adjustments accordingly.
[0017]
The computer program according to the second aspect of the present invention, when executed by a computer, causes the computer to operate as one of the voice recognition devices described above.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
[First embodiment]
<Outline>
The outline of the speech recognition method according to this embodiment will be described in advance. In the present embodiment, first, all sentences in the FSA training corpus are divided into subsets according to the number of syllables included in each sentence, and a plurality of FSAs corresponding to the subsets are generated. Then, voice recognition processing is performed as follows. That is,
(1) First, speech recognition is performed using N-grams.
[0019]
(2) The number of syllables included in the sentence resulting from (1) above is obtained.
[0020]
(3) Select an appropriate one (mostly a plurality) of the FSA subsets according to the value.
[0021]
(4) Perform voice recognition for the same input as in (1) using the selected FSA.
[0022]
(5) Among the speech recognition results obtained in the above (1) and (4), one sentence that seems to be optimal according to a certain criterion is selected as a final speech recognition result.
[0023]
<Configuration>
FIG. 1 shows a functional block diagram of a speech recognition apparatus 30 according to the present embodiment. Referring to FIG. 1, the speech recognition apparatus 30 receives a speech signal and converts it into a digital speech data sequence suitable for speech recognition and a large-scale training corpus prepared in advance. For a multi-class composite 2-gram language model 42 that can accept only all the appearing sentences (that is, only a sentence included in this training corpus is recognized as a recognition result) and a speech data string given from the speech input unit 40 Then, speech recognition using the language model 42 is performed, and a speech recognition unit 44 for outputting a word string indicating the most likely recognition result sentence, and a syllable number NS included in the word string output by the speech recognition unit 44 And a syllable number calculation unit 46 for calculation. In this case, the syllable number NS calculated by the syllable number calculation unit 46 is actually an estimated value of the syllable number included in the sentence represented by the input speech.
[0024]
The training corpus described above is a travel conversation basic expression collection (BTEC), and its contents are as follows. However, the “closed sentence” in the table below refers to the sentence that appeared in the training set among the test set sentences.
[0025]
[Table 1]

The speech recognition apparatus 30 further divides the training corpus described above into subsets according to the number of syllables included in each sentence, and an FSA storage unit 48 that stores a plurality of FSAs 60 generated corresponding to the subsets, and an FSA storage unit Among the plurality of 48 FSAs 60, the selection unit 50 for selecting the FSA 52 corresponding to the syllable number having a predetermined width centered on the syllable number NS calculated by the syllable number calculation unit 46, and the plurality selected by the selection unit 50 Using the same FSA 52, a plurality of speech recognition units 54 for performing speech recognition on the same speech data string as that performed by the speech recognition unit 44 and outputting the resulting word sequence, Among the outputs and the outputs of the plurality of speech recognition units 54, the recognition result that seems to be most appropriate is selected according to the criteria described later and output as the final recognition result. Of and a selection section 56. Each of the plurality of speech recognition units 54 performs speech recognition by the same recognition engine as the speech recognition unit 44. Conversely, as long as the same speech recognition engine is used, any speech recognition unit 44 and speech recognition unit 54 may be used.
[0026]
With reference to FIG. 2, the selection unit 56 performs DP matching between each of the plurality of recognition results given from the plurality of speech recognition units 54 and the phoneme sequence between the recognition results given from the speech recognition unit 44. Based on the DP distance calculation unit 70 for calculating the distance and the DP matching distance calculated by the DP distance calculation unit 70, the recognition result of the speech recognition unit 44 among the plurality of recognition results from the plurality of speech recognition units 54. DP distance comparison / selection unit 72 for selecting the top three candidates that are close to the phoneme arrangement, and the acoustic score (recognition) associated with the top three recognition results selected by the DP distance comparison / selection unit 72 A score adjuster 74 for adjusting the value of (sometimes obtained simultaneously with the recognition result) by (1 + α), an adjusted acoustic score of the top three recognition results output by the adjuster 74, and a speech recognizer 4 Recognition compares the acoustic score associated with the result, and a score comparing and selecting section 76 for outputting as a final recognition result a recognition result corresponding to the acoustic score having the highest value.
[0027]
The reason why the corpus is divided into subsets based on the number of syllables included in each sentence as described above and the FSA corresponding to them is used is as follows. As mentioned above, the training corpus contains a large amount of sentences. If this corpus is modeled with one FSA as it is, both the model size and the search space become huge. Therefore, it is necessary to divide the FSA.
[0028]
When dividing the FSA, it is necessary to select the FSA along parameters that can be accurately estimated from the recognition result of the speech recognition unit 44 using the language model 42. In the present embodiment, the input speech syllable number NS is adopted as the parameter. The syllable number NS of the input speech was adopted when it was determined that only the N-gram was used for the language model, the syllable number NS included in the recognition result was not significantly different from the correct answer, regardless of whether the recognition result was correct or incorrect. This is because. FIG. 3 shows the experimental results showing this.
[0029]
FIG. 3 shows the results of creating an N-gram (multi-class composite 2-gram) language model using the corpus used in the present embodiment and performing speech recognition using the language model up to the top ten. The syllable number is the one with the most different syllable number from the correct sentence, and the difference between the syllable numbers is made into a histogram. As can be seen from FIG. 3, the test sentence in which the difference between the syllable number of the recognition result of the multi-class composite 2-gram and that of the correct sentence falls within 5 accounts for 99.5% of the total. Therefore, it can be seen that it is possible to identify (estimate) the number of syllables in the correct sentence from the number of syllables in the recognition result of the multiclass composite 2-gram.
[0030]
Therefore, the FSA 60 is obtained from each subset obtained by dividing the corpus into subsets by the number of syllables of each sentence. FIG. 4 shows the distribution of the number of training sentences included in each set when a different training corpus having about 100,000 sentences is divided into subsets according to the number of syllables of each sentence. The horizontal axis indicates the number of syllables, and the vertical axis indicates the number of learning sentences included in the set for each syllable number. Referring to FIG. 4, this distribution can be approximated by a gamma distribution having a peak at 16 syllables. The maximum number of sentences in the set is 12000 or less, which is 1/8 or less of the original training corpus. The total number of FSA models corresponding to each set was 65. The language likelihood for each sentence (word string) is the same.
[0031]
The language model 42 is a language model trained using the training corpus described above. As a result of a speech recognition experiment performed by the speech recognition unit 44 alone on a test set of the same domain as this training corpus (a set of sentences used in the same travel situation as the training corpus) using this language model, the word The recognition rate was 88.7%, and the sentence recognition rate was 65.7%. The word recognition rate is high, while the sentence recognition rate is low. BTEC has many closed sentences due to the nature of many fixed sentences, but only 85.1% of them are completely correct. Therefore, there is a possibility that the remaining 14.9% can be correctly answered by performing voice recognition together with a closed FSA in which a sentence of a training set is modeled as in this embodiment. Assuming that the correct answers can be obtained, the sentence recognition rate as a whole should be improved by 7.25 points.
[0032]
In this embodiment, as will be described later, the voice recognition unit 54 is realized by software. Therefore, a single CPU repeatedly performs voice recognition multiple times. However, if there is enough hardware, speech recognition processing may be performed in parallel. For example, if there are a plurality of CPUs, voice recognition processing can be executed simultaneously in parallel.
[0033]
<Operation>
With the above configuration, one multi-class composite 2-gram that is a baseline and 65 FSA groups can be used as a language model. Hereinafter, the operation of the speech recognition apparatus 30 described in FIGS. 1 and 2 will be described.
[0034]
Referring to FIG. 1, a voice input unit 40 converts voice input into digital data suitable for voice recognition. This digital data is given to the voice recognition unit 44 and the voice recognition unit 54. First, the voice recognition unit 44 performs voice recognition. At this time, the speech recognition unit 44 performs recognition processing using the multi-class composite 2-gram language model 42 to obtain a maximum likelihood recognition result sentence. This recognition result sentence is given to the syllable number calculation unit 46 and the selection unit 56. Along with this voice recognition, an acoustic score for voice recognition is also obtained and given to the selection unit 56. Here, the acoustic score is a value obtained by obtaining the likelihood between recognition results for an input waveform at the time of recognition for each phoneme (or syllable) and multiplying them. Usually, the logarithm is used. In this specification, an acoustic score refers to a logarithmic acoustic score.
[0035]
The syllable number calculation unit 46 obtains the syllable number NS of the given recognition result sentence. The syllable number calculation unit 46 gives the syllable number NS to the selection unit 50.
[0036]
The selection unit 50 selects the FSA 52 corresponding to the sentence set having the syllable number in the range of NS ± 5 around the given syllable number NS. That is, eleven FSAs 52 with syllable numbers ranging from NS-5 to NS + 5 centering on NS are selected.
[0037]
The voice recognition unit 54 performs voice recognition individually on the digital data given from the voice input unit 40 using the eleven FSAs 52. As a result, 11 recognition result sentences by the FSA 52 corresponding to the sentence set having the number of syllables from NS-5 to NS + 5 are obtained. This output is given to the selection unit 56. At the same time, an acoustic score for the voice recognition is also obtained and given to the selection unit 56.
[0038]
With reference to FIG. 2, as a result of the above processing, the DP distance calculation unit 70 is given one recognition result from the voice recognition unit 44 and eleven recognition results from the voice recognition unit 54. The DP distance calculation unit 70 calculates a DP matching distance between each of the recognition results by the 11 FSAs 52 and the phoneme arrangement of the recognition result of the speech recognition unit 44 and supplies the DP matching distance to the DP distance comparison / selection unit 72.
[0039]
The DP distance comparison / selection unit 72 selects one of the 11 DP matching distances having the shortest distance from the top three, and provides the top three to the adjustment unit 74.
[0040]
The adjustment unit 74 multiplies (1 + α) the acoustic score associated with the recognition results up to the top three selected by the DP distance comparison / selection unit 72. This α is a parameter for adjusting the acoustic score. α is a positive value, and a value of 0.05 to 0.25 is preferable. In particular, α = 0.1 to 0.25 is preferable, and 0.15 to 0.25 is more preferable. According to experiments, the most favorable result was obtained when α = 0.2. However, the value of α is not limited to this value because it can vary depending on the recognition target. The adjustment unit 74 gives the adjusted acoustic score to the score comparison / selection unit 76 by multiplying (1 + α) in this way.
[0041]
The score comparison / selection unit 76 compares the acoustic score of the recognition result of the speech recognition unit 44 with the three adjusted acoustic scores given from the adjustment unit 74, and determines the one with the highest acoustic score as the final recognition result Select as output.
[0042]
FIG. 5 and FIG. 6 show the experimental results of selecting the recognition result in accordance with the above-described mode by changing α in 0.05 increments from 0.05 to 0.25 with the apparatus of the present embodiment. FIG. 5 shows the transition of the sentence recognition rate for each value of α. FIG. 6 shows the transition of the word recognition rate at this time for reference.
[0043]
As shown in FIG. 5, in the present embodiment, the highest sentence recognition rate was obtained when α = 0.20. The value of α = 0 shown in FIG. 5 is a sentence recognition rate when only the multi-class composite 2-gram language model 42 is used. As can be seen from FIG. 5, when α = 0.20, the sentence recognition rate is about 4.3 points higher than when α = 0. If this value is calculated for only closed sentences, it is worth improving the sentence recognition rate by about 8.8 points.
[0044]
As described above, according to the speech recognition apparatus 30 of the present embodiment, the corpus is divided into a plurality of sentence sets according to the number of syllables, and an FSA is generated as a language model for each. Then, according to the syllable number NS of the baseline speech recognition result using the multi-class composite 2-gram language model, the FSA corresponding to the sentence set having a specific relationship with the syllable number NS is selected. Then perform voice recognition again. Of the recognition results, select the three that have a short DP matching distance from the phoneme sequence of the baseline speech recognition result, adjust their acoustic score by 1 + α, and compare with the baseline speech recognition result Then, the one showing the highest acoustic score is selected as the final recognition result. As a result, the sentence recognition rate was significantly improved as described above compared with the baseline speech recognition result.
[0045]
<Realization by computer>
The above-described voice recognition device according to the present embodiment can be realized by a computer having a voice processing function. FIG. 7 shows the appearance of a speech recognition device 30 realized by a computer. FIG. 8 is a hardware block diagram of the voice recognition device 30.
[0046]
Referring to FIG. 7, a speech recognition device 30 is connected to a computer 80 and a computer 80 having a CD-ROM (Compact Disc Read-Only Memory) driving device 90 and an FD (Flexible Disk) driving device 92. Monitor 82, microphone 84, keyboard 86, and mouse 88.
[0047]
Referring to FIG. 8, in addition to the CD-ROM drive device 90 and the FD drive device 92 described above, a computer 80 includes a CPU (Central Processing Unit) 96, a ROM (Read-Only Memory) 98, and a RAM (Random Access). Memory) 100, hard disk 94, and sound board 108 connected to microphone 84. These are all connected to each other by a bus 106. A CD-ROM 102 is mounted on the CD-ROM drive device 90, and an FD 104 is mounted on the FD drive device 92.
[0048]
A computer program having the control structure described below is recorded and distributed on a computer-readable recording medium such as the CD-ROM 102 or the FD 104. After the CD-ROM 102 is mounted on the CD-ROM drive device 90, the CD is read. -Copied from ROM 102 to hard disk 94. At the time of execution, this program is read from the hard disk 94 and read to the RAM 100, and the CPU 96 reads and executes it from an address designated by a program counter (not shown), and writes the execution result to the RAM 100 or the hard disk 94. The CPU 96 further rewrites the value of the program counter with the execution result of the program, and further reads and executes the next instruction from the RAM 100 based on the value of the program counter. The CPU 96 executes the computer program according to such an operation principle. The FSA storage unit 48 is realized by the hard disk 94, the ROM 98, the RAM 100, or the like.
[0049]
FIG. 9 shows the overall control structure of this computer program. Referring to FIG. 9, this computer program performs step 110 for performing speech recognition on speech input using multi-class composite 2-gram language model 42, and a word string obtained as a result of speech recognition in step 110. Calculating the syllable number NS of the syllable, and selecting 11 FSAs obtained from the sentence set of the syllable number NS-5 to NS + 5 according to the syllable number NS calculated in step 112, and using the selected FSA Among the step 114 for outputting 11 speech recognition results together with the acoustic score, the speech recognition result in step 110, and the 11 speech recognition results obtained in step 114. Step 116 for selecting one speech recognition result in accordance with the selected method, and a step for outputting the selected speech recognition result. Tsu and a flop 118. When step 118 is completed, the processing for one voice input is completed.
[0050]
FIG. 10 shows details of the recognition step 114 by the FSA of FIG. Referring to FIG. 10, step 114 includes a step 140 for calculating the minimum MIN (= NS-5) syllable number of sentence sets to be used, a step 142 for calculating the maximum value MAX (= NS + 5), and the following. Step 144 for setting the iteration variable I of the loop processing to the minimum value MIN, it is determined whether or not the iteration variable I exceeds the maximum value MAX, and the control is branched between I> MAX and other cases. Step 146. If it is determined in step 146 that I> MAX, this routine ends. Otherwise, control proceeds to step 148.
[0051]
The recognition step 114 by the FSA further includes a step 148 for performing recognition using the FSA obtained from the sentence set having the syllable number I, a step 150 for outputting the recognition result in step 148 together with the acoustic score, and a repetition variable I. And step 152 of adding one. After step 152, control returns to step 146.
[0052]
FIG. 11 shows details of the recognition result selection step 116 shown in FIG. Referring to FIG. 11, in step 116, for each of the 11 recognition results by FSA, the DP matching distance between the phoneme sequence and the phoneme sequence of the recognition result by the multi-class composite 2-gram language model. Step 180 for calculating the value, Step 182 for selecting the top three items with the smallest DP matching distance calculated in Step 180, and multiplying the acoustic score of the three recognition results selected in Step 182 by (1 + α) The maximum acoustic score among the adjusted acoustic score calculated in step 184, the adjusted acoustic score calculated in step 184, and the acoustic score obtained in association with the speech recognition result performed in step 110 in FIG. And a step 186 of selecting the voice recognition result that is held as the final recognition result.
[0053]
Those skilled in the art will readily understand that the functions of the speech recognition apparatus 30 described above can be realized by causing the computer 80 to execute the computer program having the control structure described above.
[0054]
<Modification>
In the embodiment described above, a multi-class composite 2-gram language model is used as the language model 42. However, the present invention is not limited to using multi-class composite 2-grams. In general, an N-gram language model (where N is an integer equal to or greater than 1) can be used as the language model 42.
[0055]
In the embodiment described above, the FSA is selected based on the number of syllables included in the recognition result by the speech recognition unit 44. However, the present invention is not limited to using the syllable number for FSA selection. The size of the element that can be used is not particularly limited as long as the original number of elements can be estimated relatively accurately from the recognition result by the N-gram. For example, instead of syllables, the number of phonemes included in the recognition result may be used. In Elements of concept similar to the number of syllables may be used, such as the number of mora. Depending on the performance of the recognition system, for example, if there is something that is close to the correct answer when counted in word units, it may be used.
[0056]
Further, in the above-described embodiment, the selection unit 50 has a total of 11 FSAs around the syllable number NS calculated by the syllable number calculation unit 46 among the FSAs 60 included in the FSA storage unit 48. Is selected. However, the present invention is not limited to such an embodiment. For example, the syllable number NS calculated by the syllable number calculation unit 46 may be used only before or after the syllable number NS. Further, the number of FSAs to be selected is not limited to eleven. In this embodiment, the number of syllables is used to select the FSA, and as a result of the experiment, as shown in FIG. It turns out that the corresponding FSA can be selected. However, for example, when hardware resources are limited or processing speed is limited, a smaller number of FSAs may be selected. Further, when the estimated speech unit is not a syllable but a phoneme or the like, naturally, an experimental result different from FIG. 3 is expected, and as a result, the number of FSAs to be selected may change.
[0057]
In the embodiment described above, voice recognition using FSA is performed repeatedly using the same CPU. However, as described above, if a plurality of CPUs can be used, processing can be performed at high speed by operating them in parallel. However, in that case, it is necessary to synchronize each process before starting the process of step 116 of FIG. In this case, the processing of step 116 in FIG. 9 may be started when all of the recognition processing by FSA is completed, or the processing of step 180 shown in FIG. May be performed, and processing after step 182 may be started when all the recognition processing is completed.
[0058]
Furthermore, in the above-described embodiment, the DP matching distance and the acoustic score are used for selection by the selection unit 56. However, the present invention is not limited to such an embodiment. For example, a machine learning system such as SVM (Support Vector Machine) that has previously learned the selection method may be used.
[0059]
In the above description, the computer program is recorded and distributed on a recording medium. However, the distribution form of the computer program is not limited to that recorded on a recording medium. For example, it may be distributed in the form of communication via a wired or wireless network. Further, the format is not limited to a format that can be immediately executed by a computer, but may be distributed in the form of a source program and compiled into a computer to be an executable format. Furthermore, when the computer program is executed, only a part of the program related to the execution is temporarily stored in the RAM 100 via the network from a remote location and executed, and the entire program is stored in the RAM 100 or the hard disk 94 at a time. There may be operation modes that do not. However, it goes without saying that either case corresponds to the implementation of the present invention.
[0060]
The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech recognition apparatus 30 according to an embodiment of the present invention.
FIG. 2 is a more detailed block diagram of a selection unit 56 in the speech recognition apparatus 30 shown in FIG.
FIG. 3 is a graph showing a distribution example of a difference between the number of syllables included in an N-gram speech recognition result and the number of syllables of a correct sentence.
FIG. 4 is a graph showing a distribution of syllable numbers of sentences in a training corpus.
FIG. 5 is a graph showing an improvement in sentence recognition rate according to an embodiment of the present invention.
FIG. 6 is a graph showing a word recognition rate according to an embodiment of the present invention.
7 is a diagram showing the external appearance of a computer 80 and its peripheral devices that implement the speech recognition device 30. FIG.
8 is a hardware block diagram of a computer 80. FIG.
FIG. 9 is a flowchart showing a control structure of a speech recognition program executed by the computer 80.
FIG. 10 is a flowchart showing a detailed control structure of the recognition step 114 by FSA shown in FIG. 9;
FIG. 11 is a flowchart showing a detailed control structure of recognition result selection step 116 shown in FIG. 9;
[Explanation of symbols]
30 Voice recognition device, 40 Voice input section , 42 Multi-class composite 2-gram language model, 44 speech recognition unit, 46 syllable number calculation unit, 48 FSA storage unit, 50 selection unit, 52, 60 FSA, 54 speech recognition unit, 56 selection unit, 70 DP distance calculation Section, 72 DP distance comparison / selection section, 74 adjustment section, 76 score comparison / selection section

Claims

(The proviso N a positive integer) N- gram produced from a given corpus based on language model, the first speech recognition means for outputting a speech recognition result by performing speech recognition of the input speech,
Estimating means for estimating the number of predetermined speech units included in the input speech from the speech recognition result of the first speech recognition means;
A plurality of sentences for speech recognition, each comprising a finite state machine , each generated from a plurality of subsets classified according to the number of the predetermined speech units included in each sentence in the predetermined corpus Means for storing a language model;
Of the plurality of language models, one corresponding to the estimated number of speech units according to the number of speech units estimated by the estimating means , and before or after the estimated number of speech units. Or a first language model selecting means for selecting a plurality of predetermined ones including one corresponding to the number of speech units in both predetermined ranges ,
Second speech recognition means for performing speech recognition of input speech based on each of the plurality of language models selected by the first language model selection means and outputting a plurality of speech recognition results;
A recognition result selection means for selecting and outputting one of the voice recognition results of the first voice recognition means and the voice recognition results of the second voice recognition means according to a predetermined selection criterion; Including a speech recognition device.

The estimation means includes means for estimating the number of syllables contained in the input speech from the speech recognition result of the first speech recognition means,
The first language model selecting means corresponds to the estimated number of syllables according to the number of syllables estimated by the estimating means among the plurality of language models ; A second language model selection means for selecting a plurality of predetermined ones including one corresponding to the number of syllables in a predetermined range before and / or after the estimated number of syllables ; The speech recognition device according to claim 1, further comprising:

It said first language model selecting means, among the plurality of language models, to that in response to said number of speech unit estimated by the estimation Teite stage corresponds to the number of the estimated speech units, including a second language model selecting means for selecting as corresponding respectively to the number of speech units are equal to each other ranges before and after the number of the estimated speech unit, the speech recognition apparatus according to claim 1.

The recognition result selection means includes:
Distance calculating means for calculating a DP matching distance between each of the voice recognition results of the second voice recognition means and the voice recognition result of the first voice recognition means;
Means for selecting a predetermined number in order from the smallest DP matching distance calculated by the distance calculation means among the voice recognition results of the second voice recognition means;
Acoustic score adjustment for adjusting each of the acoustic scores of the speech recognition results of the predetermined number of the second speech recognition means selected by the means for selecting by a predetermined adjustment formula Means,
The plurality of acoustic scores output by the acoustic score adjusting unit and the acoustic score of the recognition result of the first speech recognition unit are compared, and the largest acoustic score is selected and corresponding to the selected acoustic score the speech recognition result and means for outputting a speech recognition result by the speech recognition device, speech recognition apparatus according to any one of claims 1 to 3.

The acoustic score adjusting unit multiplies the logarithmic acoustic score of the speech recognition result of the second speech recognition unit selected by the selecting unit by 1 + α (where α is a predetermined positive constant). 5. A speech recognition device according to claim 4 , comprising means for adjusting by means.

A computer program that, when executed by a computer, causes the computer to operate as the voice recognition device according to any one of claims 1 to 5 .