JP4042176B2

JP4042176B2 - Speech recognition method

Info

Publication number: JP4042176B2
Application number: JP05616297A
Authority: JP
Inventors: 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1997-03-11
Filing date: 1997-03-11
Publication date: 2008-02-06
Anticipated expiration: 2017-03-11
Also published as: JPH10254496A

Abstract

PROBLEM TO BE SOLVED: To make it possible to reduce a search volume and also improve accuracy of speech recognition by determining a likelihood of an input speech to model group composed of adapted model groups by a search processing and configuring a system to recognize the speech based on this likelihood. SOLUTION: A model calculating means 7 receives and input for one frame, and applies a parameter of a model in accordance with syntax information for each frame to the input and calculates the likelihood of the inputted feature parameter 3. Next, a partial model group selection means 11 selects a partial model group as information used for an adaptation at an adapting means 13. Next, after adaptation has been completed for all the selected models, the adapting means 14 adopts the parameters obtained from the result of adaptation as corrected parameters and replaces the parameters of a storage means 6 with the corrected parameters 15. And backward searching is operated by a backward searching means 1002 and a recognition result 1003 is obtained.

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識の精度改善と、探索量の削減に関する。
【０００２】
【従来の技術】
従来、音声認識の探索量の削減の手法として、ビーム探索が行われている。
日本音響学会平成元年度春季研究発表会講演論文集Ｉ（平成元年３月）、５〜６頁「ＤＰビームサーチの閾値を入力音声の途中で変更する方法の検討」には、フレーム同期型のＤＰマッチングにおいて、ビーム探索の閾値を入力音声の途中で変化させることで、探索量を減少させる方法が提案されている。
また、特開平６−２８２２９５号公報には、観測可能な特徴量を入力とする制御関数を用いてビーム探索の探索範囲を適応的に変化させることで、探索量を減少させる方法が開示されている。ここで、ビーム探索の閾値の制御関数には、ニューラルネットおよび重回帰分析を用いている。
【０００３】
一方、日本音響学会平成８年度秋季研究発表会講演論文集Ｉ（平成８年９月）、１１７〜１１８頁「音声認識のためのＮ−ｂｅｓｔに基づく話者適応化」には、教師なし話者適応化（即時適応化）の方法として、Ｎベストビタビ認識の結果から推定されたＮベストのモデル系列に従って、音韻ＨＭＭを連結し、その連結した音韻ＨＭＭの入力音声に対する尤度が最大となるように、音韻ＨＭＭのパラメータを推定して、認識対象話者に適応化する方法が提案されている。
この方法での適応化は、次のステップからなる。
(1)適応化前の音韻ＨＭＭを用いて入力音声のＮベスト認識を行い、入力音声に対するＮ個のモデル系列を求める。
(2)各モデル系列ごとに、音韻ＨＭＭのパラメータをその話者に適応化する。
(3)適応化後に、最大尤度を示したモデル系列を選択する。
(4)その適応化された音韻ＨＭＭのパラメータを用いて現在のＨＭＭを更新する。
上記、ステップ（２）〜（４）を繰り返す。
従って、上記方法は、入力音声の途中でモデルを変更することができない。
【０００４】
図１４は従来のビーム探索を用いる音声認識方式の機能ブロック図である。
音声区間切出手段１によって、入力音声１００１から切り出された音声区間の各フレームについて、分析手段２による音声分析を行い、特徴パラメータの時系列３を得る。
そして、モデル記憶手段５１からパラメータ５ａを、また構文情報格納手段４から入力音声に対応するモデルの系列を規定する構文情報をそれぞれ参照し、特徴パラメータの時系列３に対する最適なモデル系列を認識結果１００３として、以下のようにして得る。
なお、１０は入力音声の第１フレームから途中までのフレームに対応する部分モデル系列の仮説を格納する部分モデル系列格納手段である。
【０００５】
最初のフレーム番号を１、最後のフレーム番号をＴとする。まず、最初に、部分モデル系列の初期値を部分モデル系列格納手段１０に格納する。次に入力音声のフレーム番号ｉを１とおく。
モデル演算手段７は部分モデル系列格納手段１０から部分モデル系列の仮説（Ｈとする）をとり出す。
つぎに、構文情報格納手段４の構文情報から、部分モデル系列Ｈに連結可能なモデル（音韻モデルｋ、複数通りのときもある）を選択し、音韻モデルｋに対応するフレーム番号iの特徴パラメータの尤度f(k,i)を計算する。
さらに、音韻モデルｋを連結した１フレーム分成長した部分モデル系列の仮説を作成し、ビーム探索用の中間スタック１００４に格納する。１フレーム分成長した部分モデル系列の仮説の累積尤度は成長前の種の部分モデル系列の累積尤度に音韻モデルkの尤度を加えたものである。
ビーム探索手段９はフレーム番号ｉについて、中間スタック１００４内の部分モデル系列の仮説の累積尤度を相互に比較し、例えば、累積尤度の最大の仮説の尤度を上限とし、この上限からビーム幅８だけ引いた値を下限として、この範囲の累積尤度を有する部分モデル系列の仮説を部分モデル系列格納手段１０に格納する。
【０００６】
この場合、中間スタックからの仮説の選び方としては、例えば、累積尤度の大きい方からＮb個の部分モデル系列を選ぶこともできる。ただし、Ｎbはビーム内に残す仮説の数の最大の数を表す。
以上の処理を入力音声の第１フレームから最後のフレームまで行うことによって、部分モデル系列格納手段１０には、入力音声の全フレームに対応するモデル系列の仮説がその累積尤度とともに得られる。
その後、後向き探索手段１００２は、例えば最適な累積尤度の仮説を選ぶことによって、認識結果１００３を得る。
【０００７】
【発明が解決しようとする課題】
入力音声の途中でビーム探索の閾値を変更する従来のビーム探索は、探索量を削減することができるが、認識に用いるモデルのパラメータは一定であり、このようなパラメータが一定のモデルで認識を行うため、認識精度の向上は得られない。
また、従来の教師なし適応化は、一定のパラメータのモデルでＮベスト認識を行ってＮ個のモデル系列を求めた後に、認識結果からモデルのパラメータの入力音声への適応化を行う。
このため、精度のより高い認識結果を得るためには、適応化された音響モデルにより、再度の認識処理が必要であるという問題があった。
【０００８】
この発明が解決しようとする課題は、ビーム探索を用いる音声認識において、入力音声の途中で音韻モデルおよび音韻境界のモデルを含むモデルを入力音声に適応化することで、認識の精度を向上させることである。
また、入力音声の途中でモデルの適応化を行うとともに、入力音声の途中で得られるモデルの精度改善の結果として、入力音声の途中でビーム探索の幅を絞ることで、探索量を削減することである。
【０００９】
【課題を解決するための手段】
この発明に係る音声認識方式は、入力音声を複数のフレームに分割し、当該分割されたフレーム間をモデル間の接続点とするモデルの系列からなるモデル系列に対する入力音声の尤度を探索処理により求め、この尤度に基づき音声認識を行う音声認識方式において、上記探索処理としてビーム探索を用いるものであって、上記入力音声の各フレームで、そのフレームでビーム内に残る上記入力音声の途中までのフレームに対応する部分モデル系列から選択される上記モデルのパラメータを上記入力音声の途中までのフレームに対応する認識結果に基づいて適応化して、上記モデルのパラメータをフレームごとに置き換えるようにしたものである。
【００１７】
【発明の実施の形態】
実施の形態１．
この実施形態は、モデルとして混合連続分布の音韻モデルを用いる場合を示す。
図１は、この実施形態における音声認識方式の機能ブロック図である。入力音声信号１００１は音声切出手段１により、例えば１０msの一定の分析周期で、例えば２５．６msの信号区間であるフレームに分割される。
分析手段２は、これをフレームごとに特徴パラメータ３に変換する。フレーム番号ｔの特徴パラメータをXtと記す。図２はこれ以降の動作を示すフローチャートである。
ステップ２１ではモデルの初期化を行う。すなわち適応化前のモデルである初期モデルを初期モデル記憶手段５からモデル記憶手段６にコピーする。また、フレームの番号ｔを１に設定する。
次に、ｔ＝１番から最終のｔ＝Ｔ番のフレームについて、フレーム番号ｔを１づつ増加しながら、フレームごとに以下の処理を行う。
【００１８】
構文情報格納手段４に格納された構文情報は、部分モデル系列のあとに接続可能なモデルを決めるための情報であり、状態をあらわすノードと、遷移をあらわす枝とから表わされる。これは例えば図３に示すようなグラフとして表現される。またこの構文情報は、構文情報格納手段４内においては図４に示すような表として格納されている。すなわち、ある部分モデル系列の現在の構文的な状態をあらわす番号から、次に接続可能なモデルと、そのモデルを選択したときに拡張された部分モデル系列の次の状態番号が、表として与えられている。図３に対応するグラフの構文状態の遷移表は図４のようになる。
【００１９】
モデル演算手段７は、１フレーム分の入力を行い（ステップ２２）、フレームごとに、構文情報に従ったモデルのパラメータを適用し（ステップ２３）、入力される特徴パラメータ３の尤度を計算する（ステップ２４）。モデルのパラメータは、音韻モデルｋについて、Ｍ混合のガウス分布の平均、分散、分岐係数μm(k), Σm(k), λm(k) (ｍ=1, 2, ..., M) からなる。
現在の構文状態がｐのとき、構文情報から自己ループを含めて後続の遷移可能なすべての枝を検知し、このすべての枝について、その枝のモデルと遷移先の構文状態の組み合わせ＜k,q＞∈｛＜k1,q1＞,＜k2,q2＞, ..., ＜kn,qn＞｝に対するモデルｋの特徴パラメータｘtの尤度f(t,k)を、混合分布の各分布の尤度Ｎ(ｘt,μm(k),Σm(k))の加重和として次式で計算する。
【００２０】
【数１】

【００２１】
ステップ２５では、次のようにして、１フレーム前の部分モデル系列を１フレーム分拡張し、新しい部分モデル系列を生成する。種となる一フレーム前の部分モデル系列がＳ1,Ｓ2,...のとき、部分モデル系列を一つ選択し、Ｓとする。Ｓは構文状態δ(S)と、累積尤度α(S)と、最終モデルｋ(S)とを情報として保持している。Ｓの構文状態がｐのとき、つぎの演算を行い、構文状態、選択されるモデルの組み合わせに応じて、新しい部分モデル系列の仮説Ｕ1,Ｕ2,...を生成する。
例えば、選択されるモデルがｋで、次の構文状態がｑのとき、これに対応して生成される新しい部分モデル系列をＵとすると、Ｕの構文状態δ(U)はδ(U)=ｑ、Ｕの累積尤度α(U)はα(U)=α(S)＋f(t,k)、Ｕの最終モデルはｋ(U)＝ｋである。
【００２２】
ビーム探索手段９は、モデル演算手段７で生成された部分モデル系列Ｕ1,Ｕ2,...について、それらの累積尤度と、制御手段１０００より与えられるビーム幅８とで決まる、ビーム幅範囲の中に入らない仮説を破棄することで、ビーム幅の中に入る仮説だけを残し、部分モデル系列として出力し、部分モデル系列格納手段１０に格納する（ステップ２６）。
ビーム幅８に基づくビーム幅範囲の設定は、Ｕ1,Ｕ2,...の累積尤度の中の最大値αmaxを上限として、αmaxからビーム幅８を減じた値を下限αminとすることで行う。
枝刈りは、Ｕ1,Ｕ2,...の中から、その累積尤度α(U1),α(U2),... が、αminからαmaxの間にある仮説を残し、それ以外を破棄することで行う。
【００２３】
次に部分モデル系列選択手段１１は、適応化手段１３における適応化に用いる情報としての部分モデル系列を選択する（ステップ２７）。例えば、部分モデル系列格納手段１０の中の部分モデル系列で、累積尤度の大きい部分モデル系列から順番に探索し、異なるモデルを選択した部分モデル系列を最大でＮ個選択する。
【００２４】
次に適応化手段１３は、部分モデル系列選択手段１１が現在のフレームで選択した部分モデル系列Ｕ1,Ｕ2,...（最大でＮ個）の、選択されたモデルｋ ∈ ｋ(U1),ｋ(U2), ... （最大でＮ個）について、適応化係数１２に従って、パラメータの適応化を行う（ステップ２８）。
この実施形態においては、モデルのパラメータは、音韻モデルkについて、Ｍ混合のガウス分布の平均、分散、分岐係数μm(k),Σm(k),λm(k) (m=1, 2, ...,
M)からなる。
適応化の対象は、Ｍ混合の各分布（正規密度関数）の尤度に対する分岐係数λm(k)と、Ｍ混合の各分布の平均μm(k)であり、従って、補正前のパラメータ１４は、モデルkについてλm(k)と、μm(k)であり、その適応化は、次式で行う。
【００２５】
【数２】

【００２６】
なお、ｗは適応化係数１２で０≦ｗ＜１。
分散の適応化は理論上は次式で可能であるが、適応化の対象となるパラメータ数を削減するため、この実施形態では行わない。
【００２７】
【数３】

【００２８】
全ての選択されたモデルについて、上記の適応化が終了した後、適応化手段１３は適応化の結果得られたパラメータを補正後パラメータ１５としてモデル記憶手段６のパラメータを補正後のパラメータ１５に置き換える（ステップ２９、３０）。
そして、後向き探索手段１００２による後向き探索を行い、認識結果１００３を得る（ステップ３１）。
なお、制御部１０００はモデル記憶手段６の初期化から、入力のフレームに同期したモデル演算手段７の処理、ビーム探索手段９、適応化手段１３の各処理の制御を行う。
【００２９】
以上のように、ｔ番目のフレームでの入力フレームの尤度計算に用いるモデルのパラメータは、一つ前のフレームで適応化処理により補正されたパラメータを用いている。これにより、次第に適応化が進んでいく。すなわち、認識結果が出たあとではなく、認識処理中に適応化が進められるものである。
また、構文情報を備えるビーム探索の過程の中で、構文情報で規定される部分モデル系列から、尤度の高い部分モデル系列のモデルを適応化の対象のモデルとして選択しているため、いわば過去の履歴で補正されたフレームごとの認識結果によるモデルの適応化が実現されることになっている。
このため、従来のビーム探索のビーム幅の制御による、探索量の減少の効果とともに、従来は得られなかった認識精度の改善の効果が期待できる。
また、部分モデル系列選択手段１１において、累積尤度の大きい部分モデル系列から順番に探索し、異なるモデルを選択した部分モデル系列を最大でＮ個選択するようにしたので、安定した適応化が行える。
【００３０】
実施形態２．
次に、モデルとしてセミ連続分布の音韻モデルを用いる実施形態を示す。この場合のブロック図は図１と同じであり、フローチャートは図２と同じである。モデルが異なるため、モデル演算と適応化部の動作が異なるが、それ以外は同じであり、説明を省略する。
【００３１】
モデル演算手段７は、フレームごとに、構文情報４に従ったモデルのパラメータを適用し、入力の特徴パラメータ３の尤度を計算する。
この実施形態のモデルのパラメータは、すべての音韻について共通のＭ個のコードブックのガウス分布の平均、分散μm,Σm (m=1,2,...,M)と、音韻モデルｋについての分岐係数λm(k)からなる。
現在の構文状態がｐのとき、構文情報から自己ループを含めて後続の遷移可能なすべての枝を検知し、このすべての枝について、その枝のモデルと遷移先の構文状態の組み合わせ＜k,q＞∈｛＜k1,q1＞,＜k2,q2＞, ..., ＜kn,qn＞｝に対するモデルｋの特徴パラメータｘtの尤度ｆ(t,k)を、混合分布の各分布の尤度N(ｘt,μm,Σm)の加重和として次式で計算する。
【００３２】
【数４】

【００３３】
種となる一フレーム前の部分モデル系列がＳ1,Ｓ2,...のとき、部分モデル系列を一つ選択し、Ｓとする。Ｓは構文状態δ(S)と、累積尤度α(S)と、最終モデルｋ(S)とを情報として保持している。Ｓの構文状態がｐのとき、つぎの演算を行い、構文状態、選択されるモデルの組み合わせに応じて、新しい部分モデル系列の仮説Ｕ1,Ｕ2,...を生成する。
例えば、選択されるモデルがｋで、次の構文状態がｑのとき、これに対応して生成される新しい部分モデル系列をＵとすると、Ｕの構文状態δ(U)はδ(U)=ｑ、Ｕの累積尤度α(U)はα(U)=α(S)＋f(t,k)、Ｕの最終モデルはｋ(U)＝ｋである。
【００３４】
適応化手段１３は、部分モデル系列選択手段１１が現在のフレームで選択した部分モデル系列Ｕ1,Ｕ2,...（最大でＮ個）の、選択されたモデルｋ∈ ｋ(U1), ｋ(U2), ...（最大でＮ個）について、適応化係数１２に従ってパラメータの適応化を行う。
この実施形態のモデルkのパラメータは、すべての音韻について共通のＭ個のコードブック（いずれも正規密度関数で、平均、分散は、Σm,λm (m=1,2,...,M)）である。適応化対象は音韻モデルｋについての分岐係数λ m(k)である。従って、補正前のパラメータ１４は、モデルｋについてλm(k)であり、その適応化は次式で行う。
【００３５】
【数５】

【００３６】
なお、Ｎ(ｘt,μm,Σm)が第ｍ番目のコードブックの尤度（正規密度関数の値）である。
λm＝０なる分岐係数は、適応化してもλm＝０のままである。
この実施形態では、したがって、λm＝０なる係数についての適応化のための演算を省略することで、精度に影響を与えずに、演算量を削減することができる。
すべてのモデルについて、上記の適応化が終了した後、適応化手段１３は、適応化の結果得られたパラメータを補正後パラメータ１５としてモデル記憶手段６のパラメータを補正後のパラメータ１５に置き換える。
【００３７】
以上のように、実施形態１と同様、ｔ番目のフレームでの入力フレームの尤度計算に用いるモデルのパラメータは、一つ前のフレームで適応化処理により補正されたパラメータを用いている。これにより、次第に適応化が進んでいく。すなわち、認識結果がでたあとではなく、認識処理中に適応化が進められるものである。
また、構文情報を備えるビーム探索の過程の中で、構文情報で規定される部分モデル系列から、尤度の高い部分モデル系列のモデルを、適応化の対象のモデルとして選択しているため、いわば過去の履歴で補正されたフレームごとの認識結果によるモデルの適応化が実現されることになっている。
このため、従来のビーム探索のビーム幅の制御による、探索量の減少の効果とともに、従来は得られなかった認識精度の改善の効果が期待できる。
この実施形態では、セミ連続分布を用いたため、分岐係数の適応化だけで精度が改善される。計算、適応化が容易である。
【００３８】
実施形態３．
次に、音韻のモデルについて、フレームごとに適応化を行うもので、モデル系列の尤度に応じた適応化係数による適応化をする実施形態を示す。
【００３９】
この場合のブロック図は図１と同じであり、フローチャートは図２と同じである。
音韻のモデルは、実施形態２と同様のセミ連続分布モデルである。
この実施形態では音韻モデルとしてセミ連続分布モデルについて説明したが、混合連続分布モデルでも、同様な効果が期待できる。
適応化手段１３の動作が異なる以外は実施形態２と同様であり、説明を省略する。
【００４０】
適応化手段１３は、部分モデル系列選択手段１１が現在のフレームで選択した部分モデル系列Ｕ1,Ｕ2,...（最大でN個）の、選択されたモデルｋ∈ ｋ(U1), ｋ(U2), ...（最大でＮ個）について、適応化係数１２に従って、選択された部分系列の尤度に応じて、パラメータの適応化を行う。
モデルｋについて、適応化係数ｗ(k)の適応化を行う。
ここで、モデルｋの適応化係数ｗ(k)は、
【００４１】
【数６】

【００４２】
式中、Ｕ(k)は選択されたモデルｋを選択するにあたって用いられた部分モデル系列である。
この実施形態のモデルkのパラメータは、すべての音韻について共通のＭ個のコードブック（正規密度関数、平均、分散μm,Σm (m=1,2,...,M)）である。適応化対象は、音韻モデルｋについての分岐係数λm(k)である。従って、補正前のパラメータ１４は、モデルｋについてλm(k)であり、その適応化は、次式で行う。
【００４３】
【数７】

【００４４】
なお、Ｎ(ｘt,μm,Σm)が第ｍ番目のコードブックの尤度（正規密度関数の値）である。
λm＝０なる分岐係数は、適応化してもλm＝０のままである。
この実施形態では、したがって、λm＝０なる係数についての適応化のための演算を省略することで、精度に影響を与えずに、演算量を削減することができる。
すべてのモデルについて、上記の適応化が終了した後、適応化手段１３は、適応化の結果得られたパラメータを補正後パラメータ１５としてモデル記憶手段６のパラメータを補正後のパラメータ１５に置き換える。
【００４５】
以上のように、実施形態１と同様、ｔ番目のフレームでの入力フレームの尤度計算に用いるモデルのパラメータは、一つ前のフレームで適応化処理により補正されたパラメータを用いている。
これは、構文情報を備えるビーム探索の過程の中で、構文情報で規定される部分モデル系列から、尤度の高い部分モデル系列のモデルを、適応化の対象のモデルとして選択しているため、いわば過去の履歴で補正されたフレームごとの認識結果によるモデルの適応化が実現されることになっている。
このため、従来のビーム探索のビーム幅の制御による、探索量の減少の効果とともに、従来は得られなかった認識精度の改善の効果が期待できる。
この実施形態ではセミ連続分布を用いたため、分岐係数の適応化だけで精度が改善される。計算、適応化が容易である。また、部分系列の尤度を考慮するため、誤った方向の適応化を防止することが期待できる。
【００４６】
実施形態４．
次に音韻境界のモデルについて、フレームごとに適応化を行うものを示す。
音韻境界のモデルは、音韻間の遷移に対応したモデル間の遷移を制御するためのモデルであり、次の尤度比が１より大きいときに音韻間の遷移が可能である。＜尤度比＞＝＜音韻境界である第１の確率密度＞／＜音韻境界でない第２の確率密度＞
この実施形態では、第１の確率密度(Pr(Bt｜境界))および第２の確率密度(Pr(Bt｜非境界))は、コードブックの確率密度関数の次の多項式で与えられる。但し、Ｂtはｔ番目及びその前後のフレームから作成した特徴量である。
【００４７】
【数８】

【００４８】
この実施形態での部分モデル系列選択手段１１は、部分モデル系列格納手段１０の部分モデル系列の中から、音韻境界の遷移が起こった部分モデル系列（即ち、自己ループに対応しないもの）を尤度の大きい方から、最大でＮ個選択する。これにより、特別な計算をすることなく選択が行える。
また、この実施形態での適応化手段１３は、部分モデル系列選択手段１１が現在のフレームで選択した部分モデル系列Ｕ1,Ｕ2,...（最大でＮ個）の、選択されたモデルｋ∈ ｋ(U1), ｋ(U2), ...（最大でN個）について、適応化係数１２に従って、パラメータの適応化を行う。
この実施形態の音韻境界モデルｋのパラメータは、コードブックの尤度に対する分子多項式係数Ｐm(k)であり、従って、補正前のパラメータ１４はモデルｋについてＰm(k)であり、その適応化は次式で行う。
【００４９】
【数９】

【００５０】
なお、ＭＢは音韻境界モデル用のコードブック（正規密度関数）の数、Ｎ(Ｂｔ,μm,Σm)は正規密度関数、μm,Σmはそれぞれ正規密度関数の平均および分散である。ｗは適応化係数である。Ｐm＝０なる多項式係数は、適応化してもＰm＝０のままである。
この実施形態では、したがって、Ｐm＝０なる係数についての適応化のための演算を省略することで、精度に影響を与えずに演算量を削減することができる。すべてのモデルについて、上記の適応化が終了した後、適応化手段１３は、適応化の結果得られたパラメータを補正後パラメータ１５としてモデル記憶手段６のパラメータを補正後のパラメータ１５に置き換える。
【００５１】
実施形態５．
次に、フレームごとのモデルの適応化処理とともに、ビーム探索の幅を、フレームに同期して、斬減させる例を示す。
図５にビーム探索の幅の変化を模式的に示す。フレームごとのモデルの適応化処理によって、尤度が高くなることが期待され、ビーム内における正解の仮説の順位が向上する。このため、ビーム幅をフレームごとに斬減させることで、探索量が削減される。ビーム幅８の更新は次式で行う。但し、θはビーム幅である。θ ← θ＊（１−ｗ）＋＜ビーム幅推定値＞＊ｗ
【００５２】
ここで、＜ビーム幅推定値＞は、数多くの例について認識実験を行い、最終入力フレームにおいて正解の部分モデル系列の尤度と、そのときの尤度が最大の部分モデル系列の尤度との差として求めた。
ビーム幅の初期値は、＜ビーム幅推定値＞に比べ、大きな値に設定する。
上の式でｗはビーム幅をフレームごとに更新するときの度合いを決める適応化係数である。
【００５３】
適応化係数をどのように設定するのが妥当かを実験的に決めるため評価実験を行った。音節の３連鎖の制約を構文情報とする。出力の仮説はグラフ構造になっている。グラフ構造の複雑さの減少の程度でフレームごとの適応化の効果を調べた。
図６は、音韻モデル（セミ連続分布モデル）の１フレームごとの適応化を行う実施形態２の適応化のため選択する仮説数Ｎと適応化係数ｗとの組み合わせ条件について、
(1)正解のモデル系列の尤度と最大の尤度を示したモデル系列の尤度との差(Δ)(2)出力グラフのノード数
(3)出力グラフのエッジ数（枝の数）
に基づいて作成した実験結果を示す。
【００５４】
それぞれの数値は、(1)については、フレームごとの適応化なしの場合を０として、それに対するΔの増加分を、また、(2)と(3)については、フレームごとの適応化なしの場合を１として、それに対する比を、様々な不特定話者の入力音声を認識したときについて平均した数値を示す。
なお、評価に用いた入力音声は、次の２０フレーズである。
【００５５】
（話者）：（フレーズの音韻記述）
ecl0009 ：kaizjoowa dociradesuka
ecl0009 ：kikaisiNkookaikaNnara tookjootawaano maedesu
ecl0009 ：tookjootawaano maedesuka
etl1003 ：tookjootawaano maedesu
etl1003 ：tookjootawaano mae
fuj0003 ：koNdono hujujasumini
fuj0003 ：cukubani cuite osiete kudasai
fuj0003 ：cukubawa
fuj0003 ：zjeeaarude kuru baaiwa
kdd1005 ：koNdo
kdd1005 ：oNseekeNkjuukaiga aruNde soreo kikini ikitaiNdesukeredo
mac0003 ：kikaisiNkookaikaNdesu
mat1003 ：koNdo oNseekeNkjuukaiga aruNde
mat1003 ：tookjootawaadesu
mit0003 ：kanazawano rjokooaNnaisjodesjooka
mit0003 ：sinainiwa cjuuooni keNrokueNga arimasu
nec1011 ：kaNkoopuraNzukurio otasukesimasu
nec1011 ：dokoka mite mitai tokorowa arimasuka
nec1011 ：rakuhokuhoomeNto
【００５６】
また図６の結果をグラフにして表現したものを図７、図８、図９に示す。それぞれＸ軸を適応化係数ｗ、Ｙ軸を適応化のため選択する仮説数Ｎとしたものであり、Ｚ軸として図７は上記(1)のΔ、図８は上記(2)のノード数、図９は上記(3)のエッジ数をとったものである。なお、ＸＹ平面上にはＺ軸の等高線を示している。
図６〜図９から、ｗ＝0.005かつＮ＝1〜50、また、ｗ＝0.01かつＮ＝1〜50、さらに、ｗ＝0.02かつＮ＝1〜200、ｗ＝0.05かつＮ＝50〜100にすれば、Δが減少かつノード数とエッジ数が減少することがわかる。
Δの減少は音声認識の精度の向上を示し、またノード数とエッジ数の減少は、音声認識の精度の向上によって、正解以外のモデル系列の生成が抑制されたことを示していると考えられる。
【００５７】
図１０は、音韻境界のモデル（セミ連続分布モデル）の１フレームごとの適応化を行う実施形態４の適応化のため選択する仮説数Ｎと適応化係数ｗとの組み合わせ条件について、
(1)正解のモデル系列の尤度と最大の尤度を示したモデル系列の尤度との差(Δ)(2)出力グラフのノード数
(3)出力グラフのエッジ数（枝の数）
に基づいて作成した実験結果を示す。
【００５８】
それぞれの数値は、(1)については、フレームごとの適応化なしの場合を０として、それに対するΔの増加分を、また、(2)と(3)については、フレームごとの適応化なしの場合を１として、それに対する比を、様々な不特定話者の入力音声を認識したときについて平均した数値を示す。評価に用いた入力音声は、上記の２０フレーズである。
また図１０の結果をグラフにして表現したものを図１１、図１２、図１３に示す。それぞれＸ軸を適応化係数ｗ、Ｙ軸を適応化のため選択する仮説数Ｎとしたものであり、Ｚ軸として図１１は上記(1)のΔ、図１２は上記(2)のノード数、図１３は上記(3)のエッジ数をとったものである。なお、ＸＹ平面上にはＺ軸の等高線を示している。
【００５９】
図１０〜図１３から、音韻境界モデルの適応化係数ｗと適応化する境界の種類数Nの適切な範囲としては、ｗ＝0.1かつＮ＝100〜500、また、ｗ＝0.2かつＮ＝100、さらに、ｗ＝0.3かつＮ＝50〜500、ｗ＝0.4かつＮ＝50〜500、ｗ＝0.5かつＮ＝1〜500などで、Δが減少かつノード数とエッジ数が減少することがわかる。Δの減少は、音声認識の精度の向上を示し、また、ノード数とエッジ数の減少は、音声認識の精度の向上によって、正解以外のモデル系列の生成が抑制されたことを示していると考えられる。
【００６０】
【発明の効果】
以上に説明したように、この発明によれば、入力音声を複数のフレームに分割し、当該分割されたフレーム間をモデル間の接続点とするモデルの系列からなるモデル系列に対する入力音声の尤度を探索処理により求め、この尤度に基づき音声認識を行う音声認識方式において、上記探索処理としてビーム探索を用いるものであって、上記入力音声の各フレームで、そのフレームでビーム内に残る上記入力音声の途中までのフレームに対応する部分モデル系列から選択される上記モデルのパラメータを上記入力音声の途中までのフレームに対応する認識結果に基づいて適応化して、上記モデルのパラメータをフレームごとに置き換えるようにしたので、過去の履歴で補正されたフレームごとの認識結果によるモデルの適応化が実現されることになり、探索量の減少の効果とともに認識精度の改善の効果がある。
【図面の簡単な説明】
【図１】この発明の実施形態における音声認識方式の機能ブロック図である。
【図２】この発明の実施形態における音声認識動作のフローチャートである。
【図３】この発明の実施形態における構文制御情報の模式図である。
【図４】この発明の実施形態における構文制御情報の構成の説明図である。
【図５】この発明の実施形態におけるビーム探索の幅の変化を示す説明図である。
【図６】この発明の実施形態における評価結果の説明図である。
【図７】この発明の実施形態における評価結果をグラフ化して示す説明図である。
【図８】この発明の実施形態における評価結果をグラフ化して示す説明図である。
【図９】この発明の実施形態における評価結果をグラフ化して示す説明図である。
【図１０】この発明の実施形態における評価結果の説明図である。
【図１１】この発明の実施形態における評価結果をグラフ化して示す説明
図である。
【図１２】この発明の実施形態における評価結果をグラフ化して示す説明
図である。
【図１３】この発明の実施形態における評価結果をグラフ化して示す説明
図である。
【図１４】従来の音声認識方式の機能ブロック図である。
【符号の説明】
１音声区間切出手段
２分析手段
３特徴パラメータ
４構文情報格納手段
５初期モデル記憶手段
６モデル記憶手段
７モデル演算手段
８ビーム幅
９ビーム探索手段
１０部分モデル系列格納手段
１１部分モデル系列選択手段
１２適応化係数
１３適応化手段
１４補正前パラメータ
１５補正後パラメータ
１０００制御手段
１００１入力音声
１００２後向き探索手段
１００３認識結果
１００４中間スタック[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an improvement in accuracy of speech recognition and a reduction in search amount.
[0002]
[Prior art]
Conventionally, beam search has been performed as a technique for reducing the search amount for speech recognition.
Acoustical Society of Japan Spring Meeting 1993 Proceedings I (March 1989), pages 5-6 “Examination of changing DP beam search threshold in the middle of input speech” In DP matching, there has been proposed a method for reducing the search amount by changing the beam search threshold in the middle of the input speech.
Japanese Patent Application Laid-Open No. 6-282295 discloses a method of reducing the search amount by adaptively changing the search range of the beam search using a control function having an observable feature amount as an input. Yes. Here, a neural network and multiple regression analysis are used for the beam search threshold control function.
[0003]
On the other hand, the Acoustical Society of Japan, Fall 2008 Presentation Meeting, Proceedings I (September 1996), pages 117-118, “N-best speaker adaptation for speech recognition” As a method of person adaptation (immediate adaptation), phoneme HMMs are connected according to the N-best model sequence estimated from the result of N-best Viterbi recognition, and the likelihood of the connected phoneme HMM for the input speech is maximized. Thus, a method for estimating the parameters of the phoneme HMM and adapting it to the recognition target speaker has been proposed.
Adaptation in this way consists of the following steps.
(1) N best recognition of input speech is performed using the phoneme HMM before adaptation, and N model sequences for the input speech are obtained.
(2) For each model series, the phoneme HMM parameters are adapted to the speaker.
(3) After adaptation, select the model series that shows the maximum likelihood.
(4) Update the current HMM using the parameters of the adapted phonological HMM.
The steps (2) to (4) are repeated.
Therefore, the above method cannot change the model in the middle of the input voice.
[0004]
FIG. 14 is a functional block diagram of a speech recognition method using a conventional beam search.
The voice analysis by the analysis unit 2 is performed on each frame of the voice segment extracted from the input voice 1001 by the voice segment extraction unit 1 to obtain the time series 3 of the characteristic parameters.
Then, referring to the parameter 5a from the model storage means 51 and the syntax information defining the model series corresponding to the input speech from the syntax information storage means 4, the optimum model series for the feature parameter time series 3 is recognized. 1003 is obtained as follows.
Reference numeral 10 denotes a partial model sequence storage means for storing partial model sequence hypotheses corresponding to frames from the first frame to the middle of the input speech.
[0005]
The first frame number is 1, and the last frame number is T. First, the initial value of the partial model series is stored in the partial model series storage means 10. Next, the frame number i of the input voice is set to 1.
The model calculating means 7 takes out a hypothesis (H) of the partial model series from the partial model series storage means 10.
Next, from the syntax information in the syntax information storage unit 4, a model (phoneme model k, which may be plural) that can be connected to the partial model sequence H is selected, and the feature parameter of the frame number i corresponding to the phoneme model k is selected. The likelihood f (k, i) of is calculated.
Further, a hypothesis of the partial model series grown for one frame connecting the phoneme model k is created and stored in the beam search intermediate stack 1004. The cumulative likelihood of the hypothesis of the partial model sequence grown for one frame is obtained by adding the likelihood of the phonological model k to the cumulative likelihood of the partial model sequence of the seed before growth.
The beam search means 9 compares the cumulative likelihoods of the hypotheses of the partial model series in the intermediate stack 1004 with respect to the frame number i. The hypothesis of the partial model series having the cumulative likelihood within this range is stored in the partial model series storage means 10 with the value subtracted by the width 8 as the lower limit.
[0006]
In this case, as a method of selecting a hypothesis from the intermediate stack, for example, Nb partial model sequences can be selected from the one having the largest cumulative likelihood. However, Nb represents the maximum number of hypotheses left in the beam.
By performing the above processing from the first frame to the last frame of the input speech, the partial model sequence storage means 10 can obtain model model hypotheses corresponding to all frames of the input speech along with their cumulative likelihoods.
Thereafter, the backward search means 1002 obtains a recognition result 1003 by, for example, selecting an optimal cumulative likelihood hypothesis.
[0007]
[Problems to be solved by the invention]
The conventional beam search that changes the beam search threshold in the middle of the input speech can reduce the search amount, but the parameters of the model used for recognition are constant, and such parameters are recognized by a constant model. Therefore, the recognition accuracy cannot be improved.
In addition, in the conventional unsupervised adaptation, after performing N best recognition with a model of a constant parameter to obtain N model sequences, adaptation of the model parameter to the input speech is performed from the recognition result.
For this reason, in order to obtain a higher-accuracy recognition result, there is a problem that re-recognition processing is required by the adapted acoustic model.
[0008]
The problem to be solved by the present invention is to improve recognition accuracy by adapting a model including a phoneme model and a model of a phoneme boundary in the middle of an input speech in speech recognition using a beam search to the input speech. It is.
In addition to adapting the model in the middle of the input speech and reducing the beam search width in the middle of the input speech as a result of improving the accuracy of the model obtained in the middle of the input speech, the search amount can be reduced. It is.
[0009]
[Means for Solving the Problems]
The speech recognition method according to the present invention is:The input speech is divided into a plurality of frames, and the likelihood of the input speech for a model sequence consisting of a sequence of models having the divided frames as connection points between the models is obtained by search processing, and speech recognition is performed based on this likelihood. In the speech recognition method for performing the above, a beam search is used as the search processing, and in each frame of the input speech, from the partial model sequence corresponding to the frame of the input speech remaining in the beam in that frame. The parameters of the model to be selected are adapted based on the recognition result corresponding to the frame up to the middle of the input speech, and the parameters of the model are replaced for each frame.It is what I did.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Embodiment 1 FIG.
This embodiment shows a case where a phoneme model having a mixed continuous distribution is used as a model.
FIG. 1 is a functional block diagram of the speech recognition method in this embodiment. The input voice signal 1001 is divided by the voice extraction unit 1 into frames that are signal sections of 25.6 ms, for example, at a constant analysis period of 10 ms, for example.
The analysis unit 2 converts this into a feature parameter 3 for each frame. The feature parameter of frame number t is denoted as Xt. FIG. 2 is a flowchart showing the subsequent operation.
In step 21, the model is initialized. That is, the initial model which is a model before adaptation is copied from the initial model storage means 5 to the model storage means 6. The frame number t is set to 1.
Next, for the frames from t = 1 to the final t = T, the following processing is performed for each frame while incrementing the frame number t by 1.
[0018]
The syntax information stored in the syntax information storage means 4 is information for determining a model that can be connected after the partial model series, and is represented by a node representing a state and a branch representing a transition. This is expressed, for example, as a graph as shown in FIG. The syntax information is stored in the syntax information storage means 4 as a table as shown in FIG. That is, from the number representing the current syntactic state of a partial model series, the next connectable model and the next state number of the partial model series expanded when that model is selected are given as a table. ing. The transition table of the syntax state of the graph corresponding to FIG. 3 is as shown in FIG.
[0019]
The model calculation means 7 inputs one frame (step 22), applies the model parameters according to the syntax information for each frame (step 23), and calculates the likelihood of the feature parameter 3 to be input. (Step 24). The model parameters for the phoneme model k are the mean, variance, branching coefficient μm (k), Σm (k), λm (k) (m = 1, 2, ..., M) of the M mixture. Become.
When the current syntax state is p, all subsequent transitionable branches including the self-loop are detected from the syntax information, and the combination of the model of the branch and the syntax state of the transition destination <k, q> ∈ {<k1, q1>, <k2, q2>, ..., <kn, qn>}, the likelihood f (t, k) of the feature parameter xt of the model k is calculated for each distribution of the mixed distribution. The weighted sum of the likelihood N (xt, μm (k), Σm (k)) is calculated by the following equation.
[0020]
[Expression 1]

[0021]
In step 25, the partial model sequence one frame before is extended by one frame as follows to generate a new partial model sequence. When the partial model sequence one frame before the seed is S1, S2,..., One partial model sequence is selected and set to S. S holds the syntax state δ (S), the cumulative likelihood α (S), and the final model k (S) as information. When the syntax state of S is p, the following operation is performed to generate new partial model series hypotheses U1, U2,... According to the combination of the syntax state and the selected model.
For example, when the selected model is k and the next syntactic state is q, and U is a new partial model sequence generated corresponding to this, the syntactic state δ (U) of U is δ (U) = The cumulative likelihood α (U) of q and U is α (U) = α (S) + f (t, k), and the final model of U is k (U) = k.
[0022]
  The beam search means 9 has a beam width range determined by the cumulative likelihood of the partial model series U1, U2,... Generated by the model calculation means 7 and the beam width 8 given by the control means 1000. By discarding hypotheses that do not enter, leaving only hypotheses that fall within the beam width,portionThe model series is output and stored in the partial model series storage means 10 (step 26).
  The beam width range based on the beam width 8 is set by setting the maximum value αmax in the cumulative likelihood of U1, U2,... As the upper limit and the value obtained by subtracting the beam width 8 from αmax as the lower limit αmin. .
  Pruning leaves the hypothesis that the cumulative likelihood α (U1), α (U2), ... is between αmin and αmax among U1, U2, ..., and discards the others Do that.
[0023]
Next, the partial model series selection means 11 selects a partial model series as information used for adaptation in the adaptation means 13 (step 27). For example, the partial model series in the partial model series storage means 10 are searched in order from the partial model series having the largest cumulative likelihood, and a maximum of N partial model series in which different models are selected are selected.
[0024]
Next, the adaptation means 13 selects the selected model k ∈ k (U1), of the partial model series U1, U2,... (Maximum N) selected by the partial model series selection means 11 in the current frame. Parameter adaptation is performed according to the adaptation coefficient 12 for k (U2),... (maximum N) (step 28).
In this embodiment, the model parameters are the mean, variance, branching coefficient μm (k), Σm (k), λm (k) (m = 1, 2,. ..,
M).
The object of adaptation is the branching coefficient λm (k) for the likelihood of each distribution of M mixtures (normal density function) and the average μm (k) of each distribution of M mixtures. , Λm (k) and μm (k) for model k, and adaptation thereof is performed by the following equation.
[0025]
[Expression 2]

[0026]
Note that w is the

adaptation coefficient

12 and 0 ≦ w <1.
In theory, the adaptation of the variance is possible by the following equation, but is not performed in this embodiment in order to reduce the number of parameters to be adapted.
[0027]
[Equation 3]

[0028]
After the above-described adaptation is completed for all selected models, the adaptation unit 13 replaces the parameter obtained as a result of the adaptation with the corrected parameter 15 and the parameter in the model storage unit 6 with the corrected parameter 15. (Steps 29 and 30).
Then, a backward search is performed by the backward search means 1002, and a recognition result 1003 is obtained (step 31).
Note that the control unit 1000 controls the processing of the model calculation means 7 synchronized with the input frame, the processing of the beam search means 9 and the adaptation means 13 from the initialization of the model storage means 6.
[0029]
As described above, the parameter of the model used for the likelihood calculation of the input frame in the t-th frame is the parameter corrected by the adaptation process in the previous frame. As a result, adaptation gradually proceeds. That is, the adaptation proceeds during the recognition process, not after the recognition result is output.
Also, in the process of beam search with syntactic information, the model of the partial model sequence with high likelihood is selected as the model to be adapted from the partial model sequences specified by the syntactic information. Adaptation of the model based on the recognition result for each frame corrected by this history is to be realized.
For this reason, in addition to the effect of reducing the search amount by controlling the beam width of the conventional beam search, the effect of improving the recognition accuracy that could not be obtained conventionally can be expected.
Further, the partial model series selection means 11 searches in order from the partial model series having the largest cumulative likelihood and selects up to N partial model series from which different models are selected, so that stable adaptation can be performed. .
[0030]
Embodiment 2. FIG.
Next, an embodiment in which a phoneme model having a semi-continuous distribution is used as a model will be described. The block diagram in this case is the same as FIG. 1, and the flowchart is the same as FIG. Since the models are different, the model calculation and the operation of the adaptation unit are different, but the others are the same, and the description is omitted.
[0031]
The model calculation means 7 applies the parameter of the model according to the syntax information 4 for each frame, and calculates the likelihood of the input feature parameter 3.
The parameters of the model of this embodiment are the mean, variance μm, Σm (m = 1, 2,..., M) of M codebooks common to all phonemes, and the phoneme model k. It consists of a branching coefficient λm (k).
When the current syntax state is p, all subsequent transitionable branches including the self-loop are detected from the syntax information, and the combination of the model of the branch and the syntax state of the transition destination <k, q> ∈ {<k1, q1>, <k2, q2>,..., <kn, qn>}, the likelihood f (t, k) of the feature parameter xt of the model k is calculated for each distribution of the mixed distribution. The weighted sum of the likelihood N (xt, μm, Σm) is calculated by the following equation.
[0032]
[Expression 4]

[0033]
When the partial model sequence one frame before the seed is S1, S2,..., One partial model sequence is selected and set to S. S holds the syntax state δ (S), the cumulative likelihood α (S), and the final model k (S) as information. When the syntax state of S is p, the following operation is performed to generate new partial model series hypotheses U1, U2,... According to the combination of the syntax state and the selected model.
For example, when the selected model is k and the next syntax state is q, and U is a new partial model sequence generated corresponding to this, the syntax state δ (U) of U is δ (U) = The cumulative likelihood α (U) of q and U is α (U) = α (S) + f (t, k), and the final model of U is k (U) = k.
[0034]
The adapting means 13 selects the selected models kε k (U1), k () of the partial model series U1, U2,... (Maximum N) selected by the partial model series selecting means 11 in the current frame. The parameters are adapted according to the adaptation coefficient 12 for U2),... (Maximum N).
The parameters of the model k in this embodiment are the M codebooks common to all phonemes (Both are normal density functions, mean and variance areΣm, λm (m = 1,2, ..., M)). The target of adaptation is the branching coefficient for the phoneme model kλ m (k)It is. Therefore, the parameter 14 before correction is λm (k) for the model k, and its adaptation is performed by the following equation.
[0035]
[Equation 5]

[0036]
N (xt, μm, Σm) is the likelihood (value of the normal density function) of the mth codebook.
The branch coefficient of λm = 0 remains λm = 0 even if it is adapted.
In this embodiment, therefore, the calculation amount can be reduced without affecting the accuracy by omitting the calculation for adaptation with respect to the coefficient of λm = 0.
After the above-described adaptation is completed for all models, the adaptation unit 13 replaces the parameter obtained as a result of the adaptation with the corrected parameter 15 and the parameter stored in the model storage unit 6 with the corrected parameter 15.
[0037]
As described above, as in the first embodiment, the parameters of the model used for calculating the likelihood of the input frame in the t-th frame are parameters corrected by the adaptation process in the previous frame. As a result, adaptation gradually proceeds. That is, the adaptation is advanced during the recognition process, not after the recognition result is obtained.
In addition, in the beam search process with syntactic information, a model of a partial model sequence having a high likelihood is selected as a model to be adapted from the partial model sequences specified by the syntactic information. Model adaptation based on the recognition result for each frame corrected in the past history is to be realized.
For this reason, in addition to the effect of reducing the search amount by controlling the beam width of the conventional beam search, the effect of improving the recognition accuracy that could not be obtained conventionally can be expected.
In this embodiment, since the semi-continuous distribution is used, the accuracy is improved only by adapting the branch coefficient. Easy to calculate and adapt.
[0038]
Embodiment 3. FIG.
Next, an embodiment is described in which the phoneme model is adapted for each frame, and adaptation is performed using an adaptation coefficient corresponding to the likelihood of the model sequence.
[0039]
The block diagram in this case is the same as FIG. 1, and the flowchart is the same as FIG.
The phoneme model is a semi-continuous distribution model similar to that of the second embodiment.
In this embodiment, a semi-continuous distribution model has been described as a phonological model, but a similar effect can be expected with a mixed continuous distribution model.
Except for the difference in the operation of the adapting means 13, it is the same as in the second embodiment, and a description thereof will be omitted.
[0040]
The adapting means 13 selects the selected models kε k (U1), k (of the partial model series U1, U2,... (Maximum N) selected by the partial model series selecting means 11 in the current frame. For U2),... (Maximum N), parameters are adapted according to the likelihood of the selected partial sequence according to the adaptation coefficient 12.
The adaptation coefficient w (k) is adapted for the model k.
Here, the adaptation coefficient w (k) of the model k is
[0041]
[Formula 6]

[0042]
In the equation, U (k) is a partial model sequence used in selecting the selected model k.
The parameters of the model k in this embodiment are M codebooks (normal density function, average, variance μm, Σm (m = 1, 2,..., M)) common to all phonemes. The adaptation target is the branching coefficient λm (k) for the phoneme model k. Therefore, the parameter 14 before correction is λm (k) for the model k, and its adaptation is performed by the following equation.
[0043]
[Expression 7]

[0044]
N (xt, μm, Σm) is the likelihood (value of the normal density function) of the mth codebook.
The branch coefficient of λm = 0 remains λm = 0 even if it is adapted.
In this embodiment, therefore, the calculation amount can be reduced without affecting the accuracy by omitting the calculation for adaptation with respect to the coefficient of λm = 0.
After the above-described adaptation is completed for all models, the adaptation unit 13 replaces the parameter obtained as a result of the adaptation with the corrected parameter 15 and the parameter stored in the model storage unit 6 with the corrected parameter 15.
[0045]
As described above, as in the first embodiment, the parameters of the model used for calculating the likelihood of the input frame in the t-th frame are parameters corrected by the adaptation process in the previous frame.
This is because the model of the partial model sequence with high likelihood is selected as the model to be adapted from the partial model sequence specified by the syntax information in the process of beam search with syntax information. In other words, the adaptation of the model is realized by the recognition result for each frame corrected in the past history.
For this reason, in addition to the effect of reducing the search amount by controlling the beam width of the conventional beam search, the effect of improving the recognition accuracy that could not be obtained conventionally can be expected.
Since the semi-continuous distribution is used in this embodiment, the accuracy is improved only by adapting the branch coefficient. Easy to calculate and adapt. Also, since the likelihood of the partial sequence is taken into account, it can be expected that adaptation in the wrong direction is prevented.
[0046]
Embodiment 4 FIG.
  Next, a phonetic boundary model that is adapted for each frame is shown.
  The phoneme boundary model is a model for controlling the transition between models corresponding to the transition between phonemes. When the next likelihood ratio is larger than 1, transition between phonemes is possible. <Likelihood ratio> = <first probability density that is a phoneme boundary> / <second probability density that is not a phoneme boundary>
  In this embodiment, the first probability density (Pr (Bt | boundary)) and the second probability density (Pr (Bt | nonboundary)) areProbability density functionIs given by the following polynomial: Here, Bt is a feature quantity created from the tth frame and the frames before and after it.
[0047]
[Equation 8]

[0048]
The partial model sequence selection means 11 in this embodiment is the likelihood of the partial model series in which the phoneme boundary transition occurred (that is, the one not corresponding to the self-loop) from the partial model series of the partial model series storage means 10. Select a maximum of N from the larger one. As a result, selection can be performed without any special calculation.
Further, the adaptation means 13 in this embodiment includes the selected model kε of the partial model series U1, U2,... (Maximum N) selected by the partial model series selection means 11 in the current frame. The parameters are adapted according to the adaptation coefficient 12 for k (U1), k (U2),... (maximum N).
The parameter of the phoneme boundary model k in this embodiment is the numerator polynomial coefficient Pm (k) for the likelihood of the codebook, and therefore the parameter 14 before correction is Pm (k) for the model k, and its adaptation is The following formula is used.
[0049]
[Equation 9]

[0050]
MB is the number of codebooks (normal density functions) for the phoneme boundary model, N (Bt, μm, Σm) is the normal density function, and μm, Σm are the mean and variance of the normal density function, respectively. w is an adaptation coefficient. The polynomial coefficient of Pm = 0 remains Pm = 0 even if it is adapted.
In this embodiment, therefore, the calculation amount can be reduced without affecting the accuracy by omitting the calculation for adaptation with respect to the coefficient Pm = 0. After the above-described adaptation is completed for all models, the adaptation unit 13 replaces the parameter obtained as a result of the adaptation with the corrected parameter 15 and the parameter stored in the model storage unit 6 with the corrected parameter 15.
[0051]
Embodiment 5. FIG.
Next, an example in which the beam search width is reduced in synchronization with the frame together with the model adaptation processing for each frame will be described.
FIG. 5 schematically shows changes in the beam search width. By the model adaptation process for each frame, it is expected that the likelihood is increased, and the order of correct hypotheses in the beam is improved. For this reason, the amount of search is reduced by reducing the beam width for each frame. The beam width 8 is updated by the following equation. Where θ is the beam width. θ ← θ * (1-w) + <Estimated beam width> * w
[0052]
Here, <Beam Width Estimate> is a recognition experiment for a number of examples, and the likelihood of the correct partial model sequence in the final input frame and the likelihood of the partial model sequence with the maximum likelihood at that time Calculated as difference.
The initial value of the beam width is set to a larger value than the <beam width estimated value>.
In the above equation, w is an adaptation coefficient that determines the degree to which the beam width is updated for each frame.
[0053]
An evaluation experiment was conducted to experimentally determine how to set the adaptation coefficient. The three-chain restriction of syllable is syntactic information. The output hypothesis has a graph structure. The effect of frame-by-frame adaptation on the degree of reduction of the complexity of the graph structure was investigated.
FIG. 6 shows a combination condition of the hypothesis number N and the adaptation coefficient w to be selected for adaptation in the second embodiment in which adaptation for each frame of the phoneme model (semi-continuous distribution model) is performed.
(1) Difference between likelihood of correct model sequence and likelihood of model sequence showing maximum likelihood (Δ) (2) Number of nodes in output graph
(3) Number of edges in the output graph (number of branches)
The experimental result created based on this is shown.
[0054]
Each numerical value is 0 for the case of no adaptation for each frame for (1), and the increment of Δ with respect to it, and for (2) and (3), for no adaptation for each frame Assuming that the case is 1, a numerical value obtained by averaging the ratio with respect to the input voice of various unspecified speakers is shown.
The input speech used for the evaluation is the following 20 phrases.
[0055]
(Speaker): (Phonological description of phrase)
ecl0009: kaizjoowa dociradesuka
ecl0009: kikaisiNkookaikaNnara tookjootawaano maedesu
ecl0009: tookjootawaano maedesuka
etl1003: tookjootawaano maedesu
etl1003: tookjootawaano mae
fuj0003: koNdono hujujasumini
fuj0003: cukubani cuite osiete kudasai
fuj0003: cukubawa
fuj0003: zjeeaarude kuru baaiwa
kdd1005: koNdo
kdd1005: oNseekeNkjuukaiga aruNde soreo kikini ikitaiNdesukeredo
mac0003: kikaisiNkookaikaNdesu
mat1003: koNdo oNseekeNkjuukaiga aruNde
mat1003: tookjootawaadesu
mit0003: kanazawano rjokooaNnaisjodesjooka
mit0003: sinainiwa cjuuooni keNrokueNga arimasu
nec1011: kaNkoopuraNzukurio otasukesimasu
nec1011: dokoka mite mitai tokorowa arimasuka
nec1011: rakuhokuhoomeNto
[0056]
Moreover, what represented the result of FIG. 6 in the graph is shown in FIG. 7, FIG. 8, FIG. The X-axis is the adaptation coefficient w, and the Y-axis is the hypothesis number N to be selected for adaptation. FIG. 7 shows Δ of (1) above and FIG. 8 shows the number of nodes of (2) as Z-axis. FIG. 9 shows the number of edges in (3) above. A Z-axis contour line is shown on the XY plane.
From FIG. 6 to FIG. 9, w = 0.005 and N = 1-50, w = 0.01 and N = 1-50, w = 0.02 and N = 1-200, w = 0.05 and N = 50-100. In this case, it can be seen that Δ decreases and the number of nodes and the number of edges decrease.
A decrease in Δ indicates an improvement in the accuracy of speech recognition, and a decrease in the number of nodes and the number of edges is considered to indicate that the generation of model sequences other than correct answers is suppressed due to the improvement in the accuracy of speech recognition. .
[0057]
FIG. 10 shows a combination condition of the hypothesis number N and the adaptation coefficient w to be selected for adaptation in the fourth embodiment in which adaptation of each phoneme boundary model (semi-continuous distribution model) is performed.
(1) Difference between likelihood of correct model sequence and likelihood of model sequence showing maximum likelihood (Δ) (2) Number of nodes in output graph
(3) Number of edges in the output graph (number of branches)
The experimental result created based on this is shown.
[0058]
Each numerical value is 0 for the case of no adaptation for each frame for (1), and the increment of Δ with respect to it, and for (2) and (3), for no adaptation for each frame Assuming that the case is 1, a numerical value obtained by averaging the ratio with respect to the input voice of various unspecified speakers is shown. The input speech used for the evaluation is the above 20 phrases.
Moreover, what represented the result of FIG. 10 in the graph is shown in FIG. 11, FIG. 12, FIG. The X-axis is the adaptation coefficient w and the Y-axis is the hypothesis number N to be selected for adaptation. FIG. 11 shows the Δ in (1) above and FIG. 12 shows the number of nodes in (2) as the Z-axis. FIG. 13 shows the number of edges in (3) above. A Z-axis contour line is shown on the XY plane.
[0059]
From FIG. 10 to FIG. 13, as an appropriate range of the adaptation coefficient w of the phoneme boundary model and the number N of boundary types to be adapted, w = 0.1 and N = 100 to 500, and w = 0.2 and N = 100 Furthermore, it can be seen that when w = 0.3 and N = 50 to 500, w = 0.4 and N = 50 to 500, w = 0.5 and N = 1 to 500, Δ decreases and the number of nodes and edges decreases. . A decrease in Δ indicates an improvement in the accuracy of speech recognition, and a decrease in the number of nodes and the number of edges indicates that the generation of model sequences other than the correct answer is suppressed due to the improvement in the accuracy of speech recognition. Conceivable.
[0060]
【The invention's effect】
As explained above, according to the present invention,The input speech is divided into a plurality of frames, and the likelihood of the input speech for a model sequence consisting of a sequence of models having the divided frames as connection points between the models is obtained by search processing, and speech recognition is performed based on this likelihood. In the speech recognition method for performing the above, a beam search is used as the search processing, and in each frame of the input speech, from the partial model sequence corresponding to the frame of the input speech remaining in the beam in that frame. The parameters of the model to be selected are adapted based on the recognition result corresponding to the frame up to the middle of the input speech, and the parameters of the model are replaced for each frame.As a result, the adaptation of the model based on the recognition result for each frame corrected in the past history is realized, which has the effect of reducing the search amount and improving the recognition accuracy.
[Brief description of the drawings]
FIG. 1 is a functional block diagram of a speech recognition method according to an embodiment of the present invention.
FIG. 2 is a flowchart of a voice recognition operation in the embodiment of the present invention.
FIG. 3 is a schematic diagram of syntax control information according to the embodiment of the present invention.
FIG. 4 is an explanatory diagram of a structure of syntax control information in the embodiment of the present invention.
FIG. 5 is an explanatory diagram showing a change in the width of a beam search in the embodiment of the present invention.
FIG. 6 is an explanatory diagram of evaluation results in the embodiment of the present invention.
FIG. 7 is an explanatory diagram showing a graph of the evaluation result in the embodiment of the present invention.
FIG. 8 is an explanatory diagram showing a graph of the evaluation result in the embodiment of the present invention.
FIG. 9 is an explanatory diagram showing the evaluation results in the embodiment of the present invention in a graph.
FIG. 10 is an explanatory diagram of an evaluation result in the embodiment of the present invention.
FIG. 11 is a graph showing evaluation results in the embodiment of the present invention.
FIG.
FIG. 12 is a graph showing evaluation results in the embodiment of the present invention.
FIG.
FIG. 13 is a graph showing evaluation results in the embodiment of the present invention.
FIG.
FIG. 14 is a functional block diagram of a conventional speech recognition method.
[Explanation of symbols]
1 Voice segment extraction means
2 Analytical means
3 characteristic parameters
4 Syntax information storage means
5 Initial model storage means
6 Model storage means
7 Model calculation means
8 Beam width
9 Beam search means
10 Partial model series storage means
11 Partial model series selection means
12 Adaptation factor
13 Adaptation means
14 Pre-correction parameters
15 Parameter after correction
1000 Control means
1001 Input voice
1002 Backward search means
1003 Recognition result
1004 Intermediate stack

Claims

The input speech is divided into a plurality of frames, and the likelihood of the input speech for a model sequence consisting of a sequence of models having the divided frames as connection points between the models is obtained by search processing, and speech recognition is performed based on this likelihood. In the voice recognition method that performs
A beam search is used as the search process,
In each frame of the input speech, the parameters of the model selected from the partial model sequence corresponding to the frame up to the middle of the input speech remaining in the beam at that frame are recognized corresponding to the frame up to the middle of the input speech. A speech recognition method that is adapted based on the result and replaces the parameters of the model for each frame.

2. The speech recognition method according to claim 1, wherein the parameters of the model are adapted using N hypotheses having the highest likelihood of the hypothesis in the beam in the middle of the input speech.

The model parameters are adapted by using N hypotheses from the highest likelihood of the hypotheses in the beam in the middle of the input speech and using weights according to the weights according to the likelihoods. The speech recognition method according to 1.

The speech recognition method according to claim 1, wherein the model is a phoneme model.

The speech recognition method according to claim 1, wherein the model is a phonemic boundary model.

The speech recognition method according to claim 1, wherein the model is a phoneme model and a phoneme boundary model.

7. The speech recognition method according to claim 4, wherein the phoneme model is a semi-continuous distribution model, and only the branch coefficient of the phoneme model is adapted.

7. The speech recognition method according to claim 5, wherein the phoneme boundary model is a semi-continuous distribution model, and only the branch coefficient of the phoneme boundary model is adapted.

In the adaptation of the model parameters of the phonological boundary, the hypothesis of the partial model sequence having the transition between the models is selected when N hypotheses are selected from the ones with the highest likelihood of the hypotheses in the beam in the middle of the input speech. 6. The speech recognition method according to claim 5, wherein the voice recognition method is selected.

The speech recognition method according to claim 1, wherein the search width of the beam search is changed for each frame using an adaptation coefficient.