JP3567477B2

JP3567477B2 - Utterance deformed speech recognition device

Info

Publication number: JP3567477B2
Application number: JP05060594A
Authority: JP
Inventors: 鈴木　　忠
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1994-03-22
Filing date: 1994-03-22
Publication date: 2004-09-22
Anticipated expiration: 2019-09-22
Also published as: JPH07261780A

Description

【０００１】
【産業上の利用分野】
本発明は、環境騒音による発声変形が生じた音声を対象とする音声認識装置に関するものである。
【０００２】
【従来の技術】
騒音下音声認識を実現する上で、環境騒音による発声変形（ロンバード効果）は、雑音重畳による音声信号の品質劣化と並ぶ重要な問題となっている。ロンバード効果による音韻スペクトルの変形に対して、音韻や話者に依存しない補正手法がこれまでに提案されている。
【０００３】
特開平４−２９６７９９号公報に記載された音声認識装置や特開平５−６１９６号公報に記載された音声認識装置では、ロンバード効果により３００Ｈｚ〜１５００Ｈｚ内のホルマントが大きく変動することについて、入力音声に対するホルマント周波数分析と、環境騒音レベルもしくは入力音声のレベルによって規定される周波数変動量により、ケプストラムパラメータ上で補正する手法が提案されている。特開平４−２５７８９８号公報に記載されたロンバード音声認識方法においても前述の帯域におけるホルマント周波数の変動に着目して、標準パタンのスペクトルと入力パタンのスペクトルのマッチングの際に、１．５ｋＨｚ以下のズレをＤＰマッチングで補正する方法を提案している。
しかしながらこれらの手法は、ロンバード効果によるスペクトル変形の個人性や音韻依存性を考慮しておらず、また前記帯域以外の変動については具体的補正手法を示すに至っていない。そのため、語彙数の多い認識では十分な認識率が得られないという欠点があった。
【０００４】
これに対し近年、スペクトル変形の様態を表現する発声変形モデルを定義し、このモデルのパラメータを大量の発声変形音声データを用いて音韻ごとに学習、認識に用いる手法が、文献“高騒音下音声認識における発声変形対処法の検討”（鈴木、中島、日本音響学会講演論文集平成５年１０月ｐｐ．１４７−１４８）において提案されている。
図４はこの手法に基づく発声変形音声認識装置の構成図の一例である。図において、２は入力端１より入力された入力音声信号に対し音響分析を行い、入力音声特徴ベクトル時系列３を出力する音響分析手段、４は音韻ごとに学習された発声変形モデルを格納する発声変形モデルメモリ、５は発声変形がない音声データを学習データとして得られた発声変形なし音声標準モデルを格納する発声変形なし音声標準モデルメモリ、６は発声変形モデルメモリ４に格納されている発声変形モデルと、発声変形なし音声標準モデルメモリ５に記憶されている発声変形なし音声標準モデルとを入力として、音声認識処理を行う発声変形音声認識手段、７は発声変形音声認識手段６の出力である認識結果である。図５に発声変形音声認識手段６の構成図の一例を示す。８は前記発声変形なし音声標準モデルに対し前記発声変形モデルを用いて音韻スペクトルの変形を行うスペクトル変形手段、９はスペクトル変形手段８の出力であるところの変形音声標準モデルと発声変形なし音声標準モデルメモリ５に格納されている発声変形なし音声標準モデルとを合成し、混合型音声標準モデルを生成する音声モデル合成手段、１０は音声モデル合成手段９の出力である混合型音声標準モデルと、入力音声特徴ベクトル時系列３に対する尤度を演算する尤度演算手段、１１は尤度演算手段の出力である尤度データを用いて、照合処理を行い認識結果７を出力する照合手段である。
【０００５】
次に動作について、連続分布型音素片ＨＭＭによる離散単語認識の場合を例にとり説明を行う。発声変形なし音声標準モデルメモリ５には、発声変形のない音声データを用いて学習した音素片ＨＭＭが発声変形なし音声標準モデルとして格納されている。認識対象となる各単語音声は音素片ＨＭＭの連鎖で表現される。発声変形モデルは各音素片のスペクトル変形に対応して生成され、発声変形モデルメモリ４に格納されているものとする。
【０００６】
入力端１より入力された未知入力単語音声信号は、音響分析手段２における音響分析により各分析フレームごとに特徴ベクトルが抽出され、入力音声特徴ベクトル時系列３｛Ｘ（ｎ）｜ｎ＝１．．．Ｎ｝に変換される。ここでＸ（ｎ）は第ｎフレームの特徴ベクトル、Ｎはフレーム数である。
スペクトル変形手段８は、発声変形なし音声標準モデルメモリ５に格納されているところの音素片Ｌ（Ｌは音素片の種類を表すラベルとする）に対応する発声変形なし音素片ＨＭＭの平均ベクトルに対し、発声変形モデルメモリ４に格納されている発声変形モデルＬ _Ｔを用いてスペクトル変形処理を施す。平均ベクトル以外のパラメータは何等所作を加えない。この処理をすべての音素片について行う。
音声モデル合成手段９は、発声変形なし音声標準モデルメモリ５に格納されている発声変形なし音素片ＨＭＭと、これに対しスペクトル変形手段８でスペクトル変形処理を施されて得られた発声変形音素片ＨＭＭとを用い、２混合等確率の混合連続分布型音素片ＨＭＭを生成する。
尤度演算手段１０は前記入力音声特徴ベクトル時系列３の各特徴ベクトルＸ（ｎ）と、音響モデル合成手段９の出力であるところの混合連続分布型音素片ＨＭＭとの尤度演算を行い、尤度データを出力する。尤度データＰ（ｎ，Ｌ）は、ラベルＬの音素片ＨＭＭに対する入力音声特徴ベクトル時系列中の特徴ベクトルＸ（ｎ）の尤度を表し、すべてのＬについて１≦ｎ≦Ｎの範囲で求める。
照合手段１１は、尤度演算手段１０の出力である尤度データを用いて、認識語彙の単語音声を表す音素片の連鎖に従い、各単語に対する尤度をビタビ演算もしくはトレリス演算により求め、尤度が最大になる単語のカテゴリを認識結果として出力する。
【０００７】
【発明が解決しようとする課題】
従来の装置は以上のように構成されているため、発声変形モデルにより表現された一定の変形様態に従った変形音声標準モデルが生成されることになる。発声変形モデルは、前記文献における学習手順によれば、ある騒音環境下において発声された音声の、音素片ごとの平均的な変形様態を表現している。ところが実際には発声変形音声の変形の強度は、同一騒音環境下においても、アクセントの有無や声の大きさによって大きく変動している。そのため、発声変形モデルが表現する固定的なスペクトル変形処理を施した変形音声標準モデルでは、十分な認識性能が得られないという問題があった。
また、従来の発声変形モデルは、スペクトルの変形にのみ注目していたが、ロンバード効果による音声の変形は、発話時間の伸長としても現れる。現在、ＨＭＭを用いた音声認識方式においては、音韻の継続時間による尤度ペナルティを併用することで、認識性能の向上を実現している。これに対し、前述の発話時間の伸長は、音韻継続時間による尤度ペナルティの精度を劣化させ、認識性能の劣化につながっていた。
【０００８】
本発明は上記の問題を解決するためになされたもので、発声変形の強度を表すパラメータの関数として定義される発声変形モデルを従来の発声変形モデルから生成し、入力音声に対する尤度を最大にする発声変形の強度パラメータを求める機能を持たせることで、発声変形の強度の変動による認識性能の劣化を免れることを目的としている。
また、発声変形なし音声標準モデルに含まれる音韻継続時間パラメータに対し、ロンバード効果による変動を補償するように変更する機能を付加することで、発声変形音声の認識率の向上を図る。
【０００９】
【課題を解決するための手段】
この発明に係る発声変形音声認識装置は、
適応型発声変形モデル生成手段と、スペクトル変形手段と、発声変形音声認識手段と、適応型尤度演算手段と、照合手段と、を有する発声変形音声認識手段をさらに備え、
前記適応型発声変形モデル生成手段は、前記適応型尤度演算手段が求めた発声変形度パラメータを入力し、前記発声変形モデルメモリに格納されている発声変形モデルから前記発声変形度パラメータに従う適応型発声変形モデルを生成し、前記スペクトル変形手段は、前記発声変形なし音声標準モデルメモリに格納されている音声標準モデルに対し、前記適応型発声変形モデルに基づくスペクトル変形処理を施し、
前記適応型尤度演算手段は、前記入力音声特徴ベクトル時系列と前記スペクトル変形手段が出力した音声標準パタンとの尤度を最大にする前記発声変形度パラメータを求めるとともに、前記適応型発声変形モデル生成手段に入力して、前記発声変形度パラメータに基づく尤度を求め、
前記照合手段は、前記適応型尤度演算手段の出力を用いて照合処理を行い、認識結果を出力することを特徴とする。
【００１１】
また請求項３の発明における発声変形音声認識装置は、発声変形なし音声標準モデルメモリと音声認識手段との間に、発声変形なし音声標準モデルの継続時間パラメータを変更する継続時間パラメータ変更手段を入れたことを特徴とする。
【００１２】
【作用】
この発明において、適応型発声変形モデル生成手段は、適応型尤度演算手段が出力した発声変形度パラメータに従い、発声変形モデルメモリに格納されている発声変形モデルから適応型発声変形モデルを生成する。
本発明におけるスペクトル変形手段は、適応型発声変形モデル生成手段の出力であるところの適応型発声変形モデルに従い、発声変形なし音声標準モデルメモリに格納されている発声変形なし音声標準モデルに対しスペクトル変形処理を施し、変形音声標準モデルを生成する。
また適応型尤度演算手段は、入力音声特徴ベクトル時系列に対し、尤度を最大にする発声変形度パラメータを求め、そのパラメータに対応してスペクトル変形手段が生成した変形音声標準モデルに対する入力音声特徴ベクトルの尤度データを演算し、照合手段に出力する。
【００１３】
また他の発明によるマルチ発声変形モデル生成手段は、発声変形モデルメモリ上の発声変形モデルから、変形度メモリに格納されている発声変形度パラメータに則り、発声変形の強度の異なる発声変形モデルを生成する。
選択型尤度演算手段は、入力音声特徴ベクトルに対し、発声変形の強度が異なる発声変形モデルに基づきスペクトル変形手段で生成された変形音声標準モデルの中での尤度最大値を尤度データとして、照合手段に出力する。
【００１４】
また別の発明においては、継続時間パラメータ変更手段は、発声変形なし音声標準モデルメモリに格納されている発声変形なし音声標準モデルの音韻継続時間パラメータに対し、ロンバード効果による発話時間の伸長を補正するように変更を行い、発声変形音声認識手段へ送る。
【００１５】
【実施例】
実施例１．
図１は、請求項１の発明にかかわる発声変形音声認識装置に使われる発声変形音声認識手段の一実施例の構成を示すブロック図である。図において、４は発声変形モデルを格納する発声変形モデルメモリ、５は発声変形がない音声データから学習した発声変形なし音声標準モデルを格納する発声変形なし音声標準モデルメモリ、１２は発声変形モデルメモリ４に格納されている発声変形モデルから、入力される発声変形度パラメータに従う適応型発声変形モデルを生成する適応型発声変形モデル生成手段、８は入力される適応型声変形モデルを用いて、発声変形なし音声標準モデルメモリ５に格納されている発声変形なし音声標準モデルに対し、スペクトル変形処理を施すスペクトル変形手段、１４は適応型発声変形モデル生成手段１２に出力した発声変形度パラメータと、その値に対応してスペクトル変形手段から出力された発声変形モデルに対する入力音声特徴ベクトル時系列３の尤度とを用いて、入力音声特徴ベクトル時系列に対し最適な発声変形度パラメータによる尤度データ１５を照合手段に出力する適応型尤度演算手段、１１は尤度データ１５を用いて、照合処理を行い認識結果７を出力する照合手段である。
【００１６】
次に動作について、従来例の説明と同じく連続分布型音素片ＨＭＭによる離散単語認識の場合を例にとって説明する。発声変形なし音声標準モデルメモリ５には、発声変形のない音声データを用いて学習した音素片ＨＭＭが発声変形なし音声標準モデルとして格納されている。認識対象となる各単語音声は音素片ＨＭＭの連鎖で表現される。発声変形モデルは各音素片のスペクトル変形に対応して生成され、発声変形モデルメモリ４に格納されているものとする。従来例と重複する部分は説明を省略する。
【００１７】
適応型発声変形モデル生成手段１２は、発声変形モデルメモリ４に格納されている発声変形モデルから、後述する適応型尤度演算手段が決定した発声変形度パラメータ１３に従う発声変形の強度をもつ適応型発声変形モデルを生成する。
発声変形モデルは、従来例である前記文献と同じくロンバード効果によるスペクトルの変形について、以下の３つの要素で構成される。
（１）ホルマント周波数の移動を表す周波数軸非線形伸縮関数
（２）スペクトルの全体傾斜の変化を表すフィルタ
（３）ホルマントＱの変化を表すフィルタ
発声変形モデルメモリ４に格納されている発声変形モデルＬ _Ｔに対する適応型発声変形モデルは、発声変形度パラメータｗの関数として以下のように定義される。
Δ^Ｌ _ｔ（ｗ）＝ｗ・Δ^Ｌ _ｔ
δ^Ｌ _ｔ（ｗ）＝ｗ・δ^Ｌ _ｔ
ＱＬ_ｔ（ｗ）＝ｗ・Ｑ^Ｌ _ｔ
ここで、Δ^Ｌ _ｔは（１）から得られる周波数ｔにおける周波数シフト量、δ^Ｌ _ｔおよびＱ^Ｌ _ｔはそれぞれ（２）（３）の各フィルタに対応する対数スペクトルでｔは周波数を表している。ｗは０以上の値をとり、ｗ＝０では変形なし、徐々にｗを大きくすることで変形度が増し、ｗ＝１では元の発声変形モデルと同じになる。ｗを１以上にすることでより強い変形も表現できる。
【００１８】
スペクトル変形手段８は、従来例と同じく入力された適応型発声変形モデルを用いて、発声変形なし音声標準パタンメモリ５に格納されているラベルＬの発声変形なし音素片ＨＭＭに対し、同じラベルＬに対応する適応型発声変形モデルによるスペクトル変形処理を施し、発声変形音素片ＨＭＭとして出力する。スペクトル変形処理の対象は、音素片ＨＭＭの場合、各状態における平均ベクトルとなる。
適応型尤度演算手段１４は、適応型発声変形モデル生成手段１２に対し出力する発声変形度パラメータ１３の値の変更と、それに対応してスペクトル変形手段８が出力した発声変形音素片ＨＭＭ（ラベルＬ）に対する入力音声特徴ベクトル時系列３の特徴ベクトルＸ（ｎ）の尤度演算を繰り返し、最も大きい尤度を尤度データＰ（ｎ，Ｌ）として出力する。これをすべてのＬ、１≦ｎ≦Ｎについて行う。これにより発声変形の強さの変動の影響を受けない尤度が得られる。
照合手段１１は、従来例と同じように、尤度データ１５を用いて各単語に対する尤度をビタビ演算もしくはトレリス演算により求め、尤度最大となる単語のカテゴリを認識結果として出力する。
【００１９】
実施例２．
図２は、請求項２の発明に係る発声変形音声認識装置に使われる発声変形音声認識手段の一実施例の構成を示すブロック図である。図において、１６は各発声変形モデルについて設定される複数個の相異なる発声変形度パラメータを記憶する変形度メモリ、１７は発声変形モデルメモリ４に記憶されている各発声変形モデルを入力として、変形度メモリ１６に格納されている発声変形度パラメータを用いて発声変形の強さの相異なる複数の発声変形モデルを生成するマルチ発声変形モデル生成手段、８はマルチ発声変形モデル生成手段１７の出力であるところの発声変形モデルを用いて、発声変形なし音声標準モデルメモリ５に格納されている発声変形なし音声標準モデルに対しスペクトル変形処理を施すスペクトル変形手段、１８はスペクトル変形手段８の出力であるところの変形音声標準モデルに対する入力音声特徴ベクトル時系列３の尤度を求め、同一の発声変形なし音声標準モデルから生成された変形音声標準モデルの中での最大尤度を尤度データとして照合手段に出力する選択型尤度演算手段、１１は選択型尤度演算手段１８の出力であるところの尤度データを用いて、照合処理を行い認識結果７を出力する照合手段である。
【００２０】
次に動作について、従来例と同じく連続分布型音素片ＨＭＭによる離散単語認識の場合を例にとって説明する。発声変形なし音声標準モデルメモリ５には、発声変形のない音声データを用いて学習した音素片ＨＭＭが発声変形なし音声標準モデルとして格納されている。認識対象となる各単語音声は音素片ＨＭＭの連鎖で表現される。発声変形モデルは各音素片のスペクトル変形に対応して生成され、発声変形モデルメモリ４に格納されているものとする。従来例と重複する部分は説明を省略する。
【００２１】
変形度メモリ１６には、ラベルＬの音素片における発声変形の強さの変動の分布を近似する複数個の発声変形度パラメータ｛ｗ_Ｌ（ｋ）｜ｋ＝１．．．Ｋ_Ｌ｝（Ｋ_ＬはラベルＬの音素片に対する発声変形度パラメータの数）が、すべてのラベルについて記憶されている。
マルチ発声変形モデル生成手段１７は、発声変形モデルメモリ４に記憶されている、各音素片に対応する発声変形モデルに対し、変形度メモリ１６に格納されている該音素片に対応する複数個の発声変形度パラメータに従い、前述の実施例１における適応型発声変形モデル生成手段における適応型発声変形モデルの定義に則り発声変形度パラメータの個数と等しい数の発声変形モデルを生成する。
スペクトル変形手段８は、発声変形なし音声標準モデルメモリ５に記憶されているラベルＬの発声変形なし音素片ＨＭＭに対し、マルチ発声変形モデル生成手段１７の出力であるところのラベルＬに対応する複数個の発声変形モデルによる、スペクトル変形処理を施し、発声変形音素片ＨＭＭとして出力する。これをすべてのＬについて行う。
選択型尤度演算手段１８は、スペクトル変形手段の出力であるところのラベルＬに対応する複数個の発声変形音素片ＨＭＭに対する、入力音声特徴ベクトル時系列３の特徴ベクトルＸ（ｎ）の尤度を求め、その中で最大の尤度を尤度データＰ（ｎ，Ｌ）として出力する。これをすべてのＬ、１≦ｎ≦Ｎについて行う。
照合手段は、従来例と同じように、尤度データ１５を用いて各単語に対する尤度をビタビ演算もしくはトレリス演算により求め、尤度最大となる単語のカテゴリを認識結果として出力する。
【００２２】
実施例３．
図３は、請求項３の発明に係る発声変形音声認識装置の位置実施例の構成を示すブロック図である。図において、１９は発声変形なし音声標準モデルメモリ５に格納されている発声変形なし音声標準モデルを入力とし、該発声変形なし音声標準モデルの音韻継続時間パラメータに対し変更を加えて、発声変形音声認識手段へ出力する、継続時間パラメータ変更手段である。その他の構成要素は、前述の従来例におけるものと全く同一であるので説明を省略する。
【００２３】
次に動作について、継続時間制御付き連続分布型音素片ＨＭＭによる離散単語認識の場合を例にとって説明する。従来例と重複する部分は説明を省略する。
発声変形なし音声標準モデルメモリ５には、発声変形がない音声データから生成した発声変形なし音声標準モデルが格納されている。各単語音声の発声変形音声標準モデルは、連続分布型音素片ＨＭＭの連鎖で表されている。また各音素片について継続時間の平均と分散が求められており、認識時には継続時間によるペナルティを含めた尤度計算が行われる。
継続時間パラメータ変更手段１９は、ロンバード効果による各音素片の継続時間の変化についての情報として、発声変形音声における音素片継続時間の平均の伸び率と分散の増大率を多数話者について調査した得た平均値を保持しており、これに従い、発声変形なし音声標準モデルメモリ５に記憶されている発声変形なし音声標準モデルの音素片継続時間パラメータを変更し、出力する。
これにより、継続時間によるペナルティを用いた照合方式において、ロンバード効果による発話時間の伸長による認識精度の劣化が抑えられる。
この継続時間補正手法は、音素片への適用に限定されるものではなく、半音素、音素、音節、ＣＶＣ、ＶＣＶ、単語など如何なる音声単位であってもかまわない。
【００２４】
【発明の効果】
この発明は、以上説明したように構成されているので、以下に記載されるような効果を奏する。
【００２５】
請求項１の発明においては、適応型尤度演算手段が設定した発声変形度パラメータに従って適応型発声変形モデルが生成され、この適応型発声変形モデルに基づくスペクトル変形を発声変形なし音声標準モデルに施し、得られた変形音声標準モデルに対する入力音声特徴ベクトル時系列との尤度に従って発声変形度パラメータが更新されているので、入力音声における発声変形の強さの変動の影響を受けにくい発声変形音声認識装置を得ることができる。
【００２７】
また、請求項２の発明においては、発声変形なし音声標準モデルにおける音韻継続時間に関するパラメータに対し、ロンバード効果による発話時間の伸長に適合した補正を施しているため、音韻継続時間によるペナルティを用いる音声認識装置においてロンバード効果による発話時間伸長による認識精度劣化が生じ難くなっている。
【図面の簡単な説明】
【図１】この発明の実施例１を示すブロック図である。
【図２】この発明の実施例２を示すブロック図である。
【図３】この発明の実施例３を示すブロック図である。
【図４】従来の音声認識装置の全体構成を示すブロック図である。
【図５】従来の音声認識装置の構成する機能の一つである発声変形音声認識手段の構成を示すブロック図である。
【符号の説明】
１入力端
２音響分析手段
３入力音声特徴ベクトル時系列
４発声変形モデルメモリ
５発声変形なし音声標準モデルメモリ
６発声変形音声認識手段
７認識結果
８スペクトル変形手段
９音声モデル合成手段
１０尤度演算手段
１１照合手段
１２適応型発声変形モデル生成手段
１３発声変形度パラメータ
１４適応型尤度演算手段
１５尤度データ
１６変形度メモリ
１７マルチ発声変形モデル生成手段
１８選択型尤度演算手段
１９継続時間パラメータ変更手段[0001]
[Industrial applications]
The present invention relates to a voice recognition device for voice that has undergone vocal deformation due to environmental noise.
[0002]
[Prior art]
In realizing speech recognition under noise, utterance deformation (Lombard effect) due to environmental noise is an important problem along with quality degradation of a speech signal due to superimposition of noise. A correction method that does not depend on phonemes or speakers has been proposed for the deformation of the phoneme spectrum due to the Lombard effect.
[0003]
In the speech recognition device described in JP-A-4-296799 and the speech recognition device described in JP-A-5-6196, the fact that the formant within 300 Hz to 1500 Hz fluctuates greatly due to the Lombard effect is described with respect to the input speech. A method has been proposed in which correction is performed on cepstrum parameters based on formant frequency analysis and a frequency fluctuation amount defined by an environmental noise level or an input voice level. Also in the Lombard speech recognition method described in Japanese Patent Application Laid-Open No. H4-257898, focusing on the variation of the formant frequency in the aforementioned band, when matching the spectrum of the standard pattern with the spectrum of the input pattern, the frequency of 1.5 kHz or less is used. A method for correcting the displacement by DP matching has been proposed.
However, these methods do not take into account the personality and phoneme dependence of spectrum deformation due to the Lombard effect, and have not yet shown a specific correction method for fluctuations other than the band. Therefore, there is a defect that a sufficient recognition rate cannot be obtained with recognition having a large number of words.
[0004]
On the other hand, in recent years, a method of defining an utterance deformation model that expresses a form of spectrum deformation and using the parameters of this model for learning and recognition for each phoneme using a large amount of utterance deformation voice data has been described in the literature “Speech under high noise. Examination of Speech Deformation Coping Method in Recognition "(Suzuki, Nakajima, Proceedings of the Acoustical Society of Japan, October 1993, pp. 147-148).
FIG. 4 is an example of a configuration diagram of an uttered speech recognition apparatus based on this technique. In the figure, reference numeral 2 denotes an acoustic analysis unit that performs an acoustic analysis on an input audio signal input from an input terminal 1 and outputs an input audio feature vector time series 3, and 4 stores an utterance deformation model learned for each phoneme. An utterance deformation model memory 5, a utterance deformation-free speech standard model memory for storing an utterance deformation-free speech standard model obtained by using speech data having no utterance deformation as learning data, and 6 an utterance stored in the utterance deformation model memory 4. An utterance-deformed speech recognition means for performing a speech recognition process by using the deformation model and the utterance-deformation-free speech standard model stored in the utterance-deformed speech standard model memory 5 as input, and 7 an output of the utterance-deformation speech recognition means 6 This is a certain recognition result. FIG. 5 shows an example of a configuration diagram of the uttered transformed voice recognition means 6. 8 is a spectrum transforming means for transforming the phoneme spectrum using the speech transformation model with respect to the speech transformation model without speech transformation, 9 is a transformed speech standard model which is the output of the spectrum transformation means 8 and a speech transformation-free speech standard. Speech model synthesis means for synthesizing the speech standard model without utterance deformation stored in the model memory 5 to generate a mixed speech standard model, 10 is a mixed speech standard model output from the speech model synthesis means 9, A likelihood calculating means 11 for calculating the likelihood for the input speech feature vector time series 3 is a matching means for performing a matching process using the likelihood data output from the likelihood calculating means and outputting a recognition result 7.
[0005]
Next, the operation will be described by taking as an example the case of discrete word recognition by a continuous distribution type phoneme HMM. The speech standard model without speech transformation 5 stores speech unit HMMs learned using speech data without speech transformation as speech standard models without speech transformation. Each word voice to be recognized is represented by a chain of phoneme HMMs. It is assumed that the utterance deformation model is generated corresponding to the spectrum deformation of each phoneme segment and stored in the utterance deformation model memory 4.
[0006]
From the unknown input word speech signal input from the input terminal 1, a feature vector is extracted for each analysis frame by acoustic analysis in the acoustic analysis means 2, and an input speech feature vector time series 3 ｛X (n) | n = 1. . . N}. Here, X (n) is the feature vector of the n-th frame, and N is the number of frames.
The spectrum deforming means 8 calculates the average vector of the speech-transformation-free phoneme HMM corresponding to the phoneme L (where L is a label representing the type of phoneme) stored in the speech-transformation-free speech standard model memory 5. against, subjected to spectral modification processing using the utterance deformation model L _T stored in the utterance variation model memory 4. Parameters other than the average vector have no effect. This process is performed for all phoneme segments.
The speech model synthesizing unit 9 includes a speech unmodified speech unit HMM stored in the speech unmodified speech standard model memory 5, and a speech transformed phoneme obtained by performing a spectrum transformation process on the speech unit HMM. Using the HMM, a mixed continuous distribution phoneme HMM with two-mixed equal probability is generated.
The likelihood calculating means 10 performs likelihood calculation of each feature vector X (n) of the input speech feature vector time series 3 and a mixed continuous distribution type speech unit HMM which is an output of the acoustic model synthesizing means 9, Output likelihood data. The likelihood data P (n, L) represents the likelihood of the feature vector X (n) in the input speech feature vector time series with respect to the phoneme HMM of the label L, and all L have a range of 1 ≦ n ≦ N. Ask.
The matching means 11 uses the likelihood data output from the likelihood calculating means 10 to determine the likelihood for each word by Viterbi calculation or trellis calculation in accordance with a chain of phonemic segments representing word speech in the recognized vocabulary. Is output as a recognition result.
[0007]
[Problems to be solved by the invention]
Since the conventional apparatus is configured as described above, a modified voice standard model according to a certain modification mode represented by the utterance modification model is generated. According to the learning procedure in the literature, the utterance deformation model expresses an average deformation mode of each utterance of a voice uttered in a certain noise environment. However, in practice, the intensity of the deformation of the deformed uttered voice greatly varies depending on the presence or absence of an accent and the volume of the voice even under the same noise environment. For this reason, there is a problem that the modified speech standard model that has been subjected to the fixed spectrum modification process represented by the utterance modification model cannot obtain sufficient recognition performance.
Further, the conventional utterance deformation model pays attention only to spectrum deformation, but voice deformation due to the Lombard effect also appears as an increase in utterance time. At present, in the speech recognition method using the HMM, the recognition performance is improved by using a likelihood penalty based on the duration of a phoneme. On the other hand, the extension of the utterance time described above deteriorates the accuracy of the likelihood penalty due to the phoneme duration, leading to deterioration of the recognition performance.
[0008]
The present invention has been made to solve the above problem, and generates an utterance deformation model defined as a function of a parameter representing the intensity of utterance deformation from a conventional utterance deformation model to maximize the likelihood for input speech. An object of the present invention is to avoid a deterioration in recognition performance due to a variation in the intensity of the utterance deformation by providing a function for calculating an intensity parameter of the utterance deformation.
In addition, by adding a function of changing the phoneme duration parameter included in the unvoiced speech standard model so as to compensate for the variation due to the Lombard effect, the recognition rate of the uttered voice is improved.
[0009]
[Means for Solving the Problems]
The utterance-modified voice recognition device according to the present invention includes:
Adaptive utterance deformation model generation means, spectrum deformation means, utterance deformation speech recognition means, adaptive likelihood calculation means, and matching means, further comprising utterance deformation speech recognition means,
The adaptive utterance deformation model generation means inputs the utterance deformation degree parameter obtained by the adaptive likelihood calculating means, and outputs an adaptive utterance deformation model according to the utterance deformation degree parameter from the utterance deformation model stored in the utterance deformation model memory. An utterance deformation model is generated, and the spectrum deforming unit performs a spectrum deformation process based on the adaptive utterance deformation model on a voice standard model stored in the voice standard model memory without utterance deformation,
The adaptive likelihood calculating means obtains the utterance deformation degree parameter that maximizes the likelihood between the input speech feature vector time series and the voice standard pattern output by the spectrum deforming means, and the adaptive utterance deformation model Input to the generating means to determine a likelihood based on the utterance deformation degree parameter,
The matching means performs a matching process using an output of the adaptive likelihood calculating means, and outputs a recognition result .
[0011]
In the utterance-deformed speech recognition apparatus according to the third aspect of the present invention, a duration parameter changing means for changing a duration parameter of the utterance-deformed speech standard model is provided between the speech standard model without utterance deformation and the speech recognition means. It is characterized by having.
[0012]
[Action]
In the present invention, the adaptive utterance deformation model generation means generates an adaptive utterance deformation model from the utterance deformation model stored in the utterance deformation model memory according to the utterance deformation degree parameter output from the adaptive likelihood calculation means.
According to the adaptive utterance transformation model output from the adaptive utterance transformation model generation means, the spectrum transformation means according to the present invention performs spectrum transformation with respect to the speech utterance-free speech standard model stored in the speech utterance-free speech standard model memory. Processing is performed to generate a modified voice standard model.
The adaptive likelihood calculating means obtains an utterance deformation degree parameter which maximizes the likelihood with respect to the input voice feature vector time series, and the input voice for the deformed voice standard model generated by the spectrum deformation means corresponding to the parameter. The likelihood data of the feature vector is calculated and output to the matching means.
[0013]
The multi-utterance deformation model generation means according to another invention generates an utterance deformation model having a different utterance deformation intensity from the utterance deformation model stored in the utterance deformation model memory in accordance with the utterance deformation degree parameter stored in the deformation degree memory. I do.
The selection-type likelihood calculation means uses the maximum likelihood value in the deformed speech standard model generated by the spectrum deformation means based on the utterance deformation model having a different utterance deformation strength as the likelihood data for the input speech feature vector. , Output to the matching means.
[0014]
In still another invention, the duration parameter changing means corrects the extension of the utterance time due to the Lombard effect on the phoneme duration parameter of the speech standard model without speech transformation stored in the speech standard model without speech transformation. And then send it to the utterance deformed speech recognition means.
[0015]
【Example】
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of an embodiment of a modified utterance speech recognition means used in an utterance modification speech recognition apparatus according to the first aspect of the present invention. In the figure, reference numeral 4 denotes an utterance deformation model memory for storing utterance deformation models, 5 denotes an utterance deformation-free speech standard model memory for storing a utterance deformation-free speech standard model learned from speech data having no utterance deformation, and 12 denotes an utterance deformation model memory. An adaptive utterance deformation model generating means for generating an adaptive utterance deformation model in accordance with the input utterance deformation degree parameter from the utterance deformation model stored in 4; Spectral transformation means 14 for performing spectrum transformation processing on the speech standard model without utterance deformation stored in the speech standard model memory without deformation 5, and utterance transformation degree parameters 14 outputted to the adaptive utterance transformation model generation means 12, When the input speech feature vector for the utterance deformation model output from the spectrum deformation means corresponding to the value Adaptive likelihood calculating means for outputting to the matching means the likelihood data 15 based on the optimal utterance deformation degree parameter for the input speech feature vector time series using the likelihood of the column 3, and 11 uses the likelihood data 15 And performs a collation process and outputs a recognition result 7.
[0016]
Next, the operation will be described by taking as an example the case of discrete word recognition by a continuous distribution type speech unit HMM as in the description of the conventional example. The speech standard model without speech transformation 5 stores speech unit HMMs learned using speech data without speech transformation as speech standard models without speech transformation. Each word voice to be recognized is represented by a chain of phoneme HMMs. It is assumed that the utterance deformation model is generated corresponding to the spectrum deformation of each phoneme segment and stored in the utterance deformation model memory 4. The description of the same parts as the conventional example will be omitted.
[0017]
The adaptive utterance deformation model generation means 12 generates an adaptive utterance deformation model having an utterance deformation intensity according to the utterance deformation degree parameter 13 determined by the adaptive likelihood calculation means described later from the utterance deformation model stored in the utterance deformation model memory 4. Generate an utterance transformation model.
The utterance deformation model is composed of the following three elements with respect to spectrum deformation due to the Lombard effect, similarly to the above-mentioned literature as a conventional example.
(1) Frequency axis non-linear expansion / contraction function representing movement of formant frequency (2) Filter representing change in overall slope of spectrum (3) Filter representing change in formant Q Voice transformation model L stored in memory 4 _The adaptive utterance transformation model for _T is defined as a function of the utterance transformation degree parameter w as follows.
^{_{Δ L t (w) = w}} · Δ L t
^{_{δ L t (w) = w}} · δ L t
_{QL t (w) = w ·} Q L t
Here, the delta ^L _t represents a t frequency in logarithmic spectrum corresponding to each filter of the frequency shift amount in the frequency t obtained from (1), [delta] ^L _t and ^Q _{L t} each (2) (3) I have. w takes a value of 0 or more, there is no deformation when w = 0, and the degree of deformation increases by gradually increasing w. When w = 1, the original utterance deformation model becomes the same. By setting w to 1 or more, stronger deformation can be expressed.
[0018]
Using the adaptive utterance transformation model input as in the conventional example, the spectrum transformation means 8 applies the same label L to the utterance-free speech unit HMM of the label L stored in the speech standard pattern memory 5 without utterance transformation. Is subjected to a spectrum deformation process using an adaptive utterance deformation model corresponding to, and is output as an utterance deformed speech unit HMM. In the case of the speech element HMM, the target of the spectrum deformation processing is an average vector in each state.
The adaptive likelihood calculating means 14 changes the value of the utterance deformation degree parameter 13 output to the adaptive utterance deformation model generation means 12 and correspondingly changes the utterance deformed speech unit HMM (label) output by the spectrum deformation means 8. L), the likelihood calculation of the feature vector X (n) of the input speech feature vector time series 3 is repeated, and the largest likelihood is output as likelihood data P (n, L). This is performed for all L and 1 ≦ n ≦ N. As a result, a likelihood that is not affected by the fluctuation of the strength of the utterance deformation is obtained.
The matching unit 11 obtains the likelihood of each word by the Viterbi operation or the trellis operation using the likelihood data 15 and outputs the category of the word having the maximum likelihood as a recognition result, as in the conventional example.
[0019]
Embodiment 2. FIG.
FIG. 2 is a block diagram showing a configuration of an embodiment of a modified utterance voice recognition means used in the modified utterance voice recognition apparatus according to the second aspect of the present invention. In the figure, reference numeral 16 denotes a deformation degree memory for storing a plurality of different utterance deformation degree parameters set for each utterance deformation model, and 17 denotes a deformation based on each utterance deformation model stored in the utterance deformation model memory 4. Means for generating a plurality of utterance transformation models having different utterance transformation strengths using the utterance transformation degree parameters stored in the degree memory 16; Using a certain utterance transformation model, spectrum transformation means for performing spectrum transformation processing on the speech standard model without speech transformation stored in the speech standard model without speech transformation 5, and 18 denotes an output of the spectrum transformation means 8. However, the likelihood of the input speech feature vector time series 3 with respect to the modified speech standard model is obtained, and Selective likelihood calculating means for outputting the maximum likelihood in the deformed voice standard model generated from the voice standard model to the matching means as likelihood data. Reference numeral 11 denotes an output of the selective likelihood calculating means. A matching unit that performs a matching process using the likelihood data and outputs a recognition result 7.
[0020]
Next, the operation will be described by taking as an example the case of discrete word recognition by a continuous distribution type speech element HMM as in the conventional example. The speech standard model without speech transformation 5 stores speech unit HMMs learned using speech data without speech transformation as speech standard models without speech transformation. Each word voice to be recognized is represented by a chain of phoneme HMMs. It is assumed that the utterance deformation model is generated corresponding to the spectrum deformation of each phoneme segment and stored in the utterance deformation model memory 4. The description of the same parts as the conventional example will be omitted.
[0021]
The deformation degree memory 16 stores a plurality of utterance deformation degree parameters {w _L (k) | k = 1. . . K _L} is _(K _L is the number of utterance variation degree parameters for phoneme label L), are stored for all labels.
The multi-utterance deformation model generation unit 17 outputs a plurality of utterance deformation models corresponding to each phoneme stored in the utterance deformation model memory 4 to a plurality of utterance deformations corresponding to the phonemes stored in the deformation degree memory 16. According to the utterance transformation degree parameter, the number of utterance transformation models equal to the number of utterance transformation degree parameters is generated according to the definition of the adaptive utterance transformation model in the adaptive utterance transformation model generation means in the first embodiment.
The spectrum deforming means 8 applies a plurality of speech units HMM of the label L stored in the speech standard model memory 5 without speech transformation to the plurality of speech units HMM corresponding to the label L which is the output of the multi-speech transformation model generation unit 17. The spectrum modification process is performed by using the individual utterance transformation models, and the resultant is output as the utterance transformation speech unit HMM. This is performed for all L.
The selection-type likelihood calculating means 18 calculates the likelihood of the feature vector X (n) of the input speech feature vector time series 3 with respect to a plurality of uttered speech modified speech segments HMM corresponding to the label L output from the spectrum transformation means. And the maximum likelihood is output as likelihood data P (n, L). This is performed for all L and 1 ≦ n ≦ N.
As in the conventional example, the matching unit obtains the likelihood for each word by using the likelihood data 15 by Viterbi operation or trellis operation, and outputs the category of the word having the maximum likelihood as a recognition result.
[0022]
Embodiment 3 FIG.
FIG. 3 is a block diagram showing the configuration of a position embodiment of the utterance-modified voice recognition apparatus according to the third aspect of the present invention. In the figure, reference numeral 19 designates an input of a speech standard model without speech transformation stored in the speech standard model without speech transformation 5 and changes a phoneme duration parameter of the speech standard model without speech transformation to change the phoneme duration parameter. This is a duration parameter changing unit that outputs to the recognition unit. The other components are exactly the same as those in the above-described conventional example, and the description is omitted.
[0023]
Next, the operation will be described by taking as an example a case of discrete word recognition by a continuous distribution type speech unit HMM with duration control. The description of the same parts as the conventional example will be omitted.
The speech standard model without speech transformation 5 stores a speech standard model without speech transformation generated from speech data without speech transformation. The utterance-deformed speech standard model of each word speech is represented by a chain of continuous distribution phoneme HMMs. The average and variance of the duration are obtained for each phoneme segment, and the likelihood calculation including the penalty due to the duration is performed at the time of recognition.
The duration parameter changing means 19 obtains information on the change in duration of each phoneme due to the Lombard effect by examining the average growth rate and variance growth rate of the phoneme duration in a vocal deformed voice for a large number of speakers. In accordance with the average value, the speech unit duration parameter of the speech standard model without speech transformation stored in the speech standard model without speech transformation 5 is changed and output.
As a result, in the matching method using the penalty based on the duration, deterioration in recognition accuracy due to extension of the speech time due to the Lombard effect can be suppressed.
This duration correction method is not limited to application to phoneme segments, and may be any speech unit such as a half phoneme, phoneme, syllable, CVC, VCV, or word.
[0024]
【The invention's effect】
Since the present invention is configured as described above, it has the following effects.
[0025]
According to the first aspect of the present invention, an adaptive utterance deformation model is generated according to the utterance deformation degree parameter set by the adaptive likelihood calculating means, and a spectrum deformation based on the adaptive utterance deformation model is applied to the speech standard model without utterance deformation. Since the utterance deformation degree parameter is updated according to the likelihood of the obtained deformed speech standard model and the input speech feature vector time series, the utterance deformation speech recognition is less susceptible to the fluctuation of the utterance deformation strength in the input speech. A device can be obtained.
[0027]
According to the second aspect of the present invention, since the parameter relating to the phoneme duration in the speech standard model without speech transformation is subjected to correction adapted to the extension of the speech time due to the Lombard effect, the speech using the penalty based on the phoneme duration is applied. In the recognition device, the recognition accuracy is hardly degraded due to the extension of the utterance time due to the Lombard effect.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a block diagram showing Embodiment 2 of the present invention.
FIG. 3 is a block diagram showing a third embodiment of the present invention.
FIG. 4 is a block diagram showing an overall configuration of a conventional speech recognition device.
FIG. 5 is a block diagram showing a configuration of an utterance-modified voice recognition unit which is one of the functions included in the conventional voice recognition device.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Input terminal 2 Acoustic analysis means 3 Input speech feature vector time series 4 Voice deformation model memory 5 Voice standard model memory without voice deformation 6 Voice deformation voice recognition means 7 Recognition result 8 Spectrum deformation means 9 Voice model synthesis means 10 Likelihood calculation means 11 verification means 12 adaptive utterance deformation model generation means 13 utterance deformation degree parameter 14 adaptive likelihood calculation means 15 likelihood data 16 deformation degree memory 17 multi-utterance deformation model generation means 18 selection type likelihood calculation means 19 duration parameter change means

Claims

Sound analysis means for performing sound analysis on the input sound signal and outputting an input sound feature vector time series,
An utterance deformation model memory for storing an utterance deformation model expressing a mode of deformation of a phoneme spectrum generated in a voice uttered in a noisy environment;
A speech standard model memory without speech transformation that stores a speech standard model learned from speech data without speech transformation,
The input speech feature vector time series output from the acoustic analysis means is subjected to a recognition process using the utterance transformation model and the speech standard model, and comprises a utterance transformation speech recognition means for outputting a recognition result. In a speech recognition device,
Adaptive utterance deformation model generation means, spectrum deformation means, utterance deformation speech recognition means, adaptive likelihood calculation means, and matching means, further comprising utterance deformation speech recognition means,
The adaptive utterance deformation model generation means inputs the utterance deformation degree parameter obtained by the adaptive likelihood calculating means, and outputs an adaptive utterance deformation model according to the utterance deformation degree parameter from the utterance deformation model stored in the utterance deformation model memory. Generate an utterance deformation model,
The spectrum transformation means performs a spectrum transformation process based on the adaptive speech transformation model on the speech standard model stored in the speech transformation model without speech transformation,
The adaptive likelihood calculating means obtains the utterance deformation degree parameter that maximizes the likelihood between the input speech feature vector time series and the voice standard pattern output by the spectrum deforming means, and the adaptive utterance deformation model Input to the generating means to determine a likelihood based on the utterance deformation degree parameter,
The uttered speech recognition apparatus according to claim 1, wherein the matching unit performs a matching process using an output of the adaptive likelihood calculating unit and outputs a recognition result.

Sound analysis means for performing sound analysis on the input sound signal and outputting an input sound feature vector time series,
An utterance deformation model memory for storing an utterance deformation model expressing a mode of deformation of a phoneme spectrum generated in a voice uttered in a noisy environment;
A speech standard model memory without speech transformation that stores a speech standard model learned from speech data without speech transformation,
The input speech feature vector time series output from the acoustic analysis means is subjected to a recognition process using the utterance transformation model and the speech standard model, and comprises a utterance transformation speech recognition means for outputting a recognition result. For a speech recognition device,
An utterance-modified voice recognition apparatus, characterized in that a duration parameter changing means for changing a duration parameter of the voice-standard model without utterance is inserted between the voice standard model memory without utterance deformation and the voice recognition means.