JP2004309959A

JP2004309959A - Device and method for speech recognition

Info

Publication number: JP2004309959A
Application number: JP2003106394A
Authority: JP
Inventors: Satoru Suzuki; 哲鈴木; Takeo Kanamori; 丈郎金森
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-04-10
Filing date: 2003-04-10
Publication date: 2004-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem that, in a speech recognition device using a noise suppressing method for an input signal to obtain a high recognition rate even for a low S/N speech, the speech after suppression is distorted more sometimes depending upon noise suppressing coefficient estimation precision and adverse influence is exerted on speech recognition in such a case. <P>SOLUTION: Speech recognition of a speech B6 after noise suppression by a noise suppressing means 14 is performed by a feature quantity extracting means 16, a likelihood calculating means 107, and a maximum likelihood search means 109; and a suppression quantity decision coefficient calculating means 21 calculates a suppression quantity decision coefficient from recognition information B10 of a calculated likelihood, feature quantity vector, phoneme series, etc., and a noise suppressing estimating means 13 controls a suppression quantity of noise to be suppressed by the noise suppressing means 14 by also using the suppression quantity decision coefficient. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
低Ｓ／Ｎの音声に対しても高い認識率を得るために入力信号に対して雑音抑圧手法を用いる音声認識装置に関する。
【０００２】
【従来の技術】
従来の一般的な音声認識装置を図６に示す（例えば、特許文献１参照）。マイクなどの音声入力手段１０１から入力された入力信号Ｂ１を、特徴量抽出手段１６において、音響分析を行うことにより、例えば、ＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｅｆｆｉｃｉｅｎｔ）ケプストラムやＭＦＣＣ（メル周波数ケプストラム係数）などの認識計算に最適な特徴量ベクトルＢ７に変換する。尤度算出手段１７では、特徴量抽出手段１６で求めた特徴量ベクトルＢ７と音響モデル保存手段１８が保存している音響モデルより、複数の音素片からなるサブワードの単位で、特徴量間の平均値や分散値より算出する距離や、ＨＭＭ（ｈｉｄｄｅｎＭａｌｃｏｖＭｏｄｅｌ）による遷移確率を尤度Ｂ８とする。最尤探索手段１９は、ＤＰマッチングやビタビサーチなどの認識用ルールを保存している文法辞書保存手段２０の文法や単語辞書に従って、尤度算出手段１７で得られた尤度列より、最尤となるパスを探索し、これを認識結果Ｂ９として、結果表示提示手段１０２に表示提示する。
【０００３】
音声認識処理内での雑音に関するアルゴリズムとして用いられているのは、乗法性歪みに有効なＣＭＥ／ＣＭＳ（ＣｅｐｓｔｒａｌＭｅａｎＥｑｕａｌｉｚａｔｉｏｎ／Ｓｕｂｔｒａｃｔｉｏｎ）や、加法性乗法性双方の歪みに有効なＪａｃｏｂｉａｎ行列近似法がある。前者は、音声認識の特徴量として多く利用されているケプストラムをそのまま利用し加減算のみでよいため、非常に簡便で効果の高い方法とされている。しかし、窓関数を利用して抽出している特徴量であるために、残響時間の長い反射や残響の影響までは除去することが不可能である。また、後者はＨＭＭ合成法を簡略化する方法として提案されたものの、ＳＳ（ＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ）と比較すると計算量が膨大となるなどの欠点がある。そのため、より耐雑音性能を高めるため、特に加法性の雑音の対策として、スペクトラム領域でのＳＳ法などの雑音抑圧処理が合わせて実施されている（ＷＯ９８／３９９４６）。
【０００４】
図７に従来の一般的な雑音抑圧方法を用いた音声認識装置を示す（例えば、特許文献２を参照）。スペクトル変換手段１１において、図６同様に入力された入力信号Ｂ１を、スペクトル表現に変換する。このスペクトル表現の入力信号Ｂ２から、Ｓ／Ｎ推定手段１２において、Ｓ／Ｎ推定を行う。抑圧量推定手段１３では、雑音抑圧係数保存手段１５内のテーブルから、推定Ｓ／Ｎに基づいて読み出された係数Ｂ４および入力音声に対する学習推定により、抑圧量を推定する。雑音抑圧手段１４では、抑圧量推定手段１３で推定された抑圧係数Ｂ５に基づいて雑音抑圧演算を行い、雑音抑圧された入力信号Ｂ６として音声認識ユニットに出力する。ここで雑音抑圧アルゴリズムとして用いられるものは，前述のＳＳやＷｉｅｎｅｒＦｉｌｔｅｒ等が挙げられる。いずれも、雑音抑圧量を調整するため、推定したノイズスペクトルに倍率変数であるαを掛け合わせるのが一般的である。ＳＳの機能は、（数１）で表される。
【０００５】
【数１】

【０００６】
なお、（数１）において、雑音抑制後の周波数ｗにおけるパワーＳ（ｗ）、観測入力信号の周波数ｗにおけるパワーＸ（ｗ）、推定ノイズ区間の周波数スペクトルパワーＮ（ｗ）、αは倍率変数、βは必要以上引きすぎが起こらないように設定されるフロアリング定数である。また、（数１）におけるｍａｘとは、カンマで区切られた２つの値のうち大きい値をとる関数である。
【０００７】
入力されるノイズおよび音声の関係などによって変動する入力信号に対して、このＳＳを用いた雑音抑圧法では、いかに推定ノイズスペクトルを精度よく求めることができるか。およびαやβの設定値をいかに制御するかによって、必要以上に引きすぎることなく、つまり抑圧後の音声信号が認識に悪影響を与えてしまうことをなくすかが、掛かっているのである。
【０００８】
この課題を解決するため、雑音抑圧後の消し残りパワースペクトルを算出して雑音ＨＭＭを学習し、これをクリーンな音声データと合成して音響モデルを作成し直す方法が提案されている（例えば、特許文献３）。また、雑音区間の推定に関して、非定常部分では小さく、定常部分では大きくフィルタを更新する方法（例えば、特許文献４）、Ｓ／Ｎに応じて推定した雑音特徴ベクトルと合成した音声特徴ベクトルを用いて認識する方法（例えば、特許文献５）、子音区間では抑制量低減する方法（例えば、特許文献６）、定常雑音時のみ学習する方法（例えば、特許文献７）、認識結果が正しいときのみ学習を行う方法（例えば、特許文献８）、学習するインパルス応答を定常非定常で切り替える方法（例えば、特許文献９）も提案されている。
【０００９】
しかしいずれの方法も、学習をいかに精度よく行うかということに着目した方法であり、雑音抑圧の性能つまり歪みを生じさせずに、Ｓ／Ｎを改善することができているかについては、関知していない。一方、この点に着目しているのは、音声の歪みを低減するための騒音目標値を設置して雑音抑圧量を制御する方法（例えば、特許文献１０）である。
【００１０】
【特許文献１】
特開２００１−１００７８２号公報
【特許文献２】
特開２０００−３３０５９７号公報
【特許文献３】
特開平１０−９７２７８号公報
【特許文献４】
特開平６−９５６９３号公報
【特許文献５】
特開平６−２８９８９１号公報
【特許文献６】
特開平８−２２１０９４号公報
【特許文献７】
特開平１０−３３４２８６号公報
【特許文献８】
特開平１１−３２７５９３号公報
【特許文献９】
特開平１０−８３９９４６号公報
【特許文献１０】
特開２００２−１４０１００号公報
【００１１】
【発明が解決しようとする課題】
しかしこの方法は、雑音性能の向上にしか目を向けておらず、認識性能と関連があるのかどうかを直接調べているわけではない。
【００１２】
そこで、本発明は、雑音抑圧された音声に対して音声認識した結果得られた、尤度、特徴量ベクトル、音素列などの認識情報を直接もしくは間接的に反映させた抑圧量判定係数を算出し、この抑圧量判定係数を指標として利用することにより、音声の歪みを低減するための雑音抑圧量を制御する。雑音抑圧制御の性能の向上により、より歪みの少ない雑音抑圧が可能となることから、認識率が向上することを目的とする。
【００１３】
【課題を解決するための手段】
この課題を解決するために、本発明の音声認識装置は、入力音声に対して雑音抑圧を行うためのスペクトル変換手段と、前記スペクトル変換手段が出力する信号よりＳ／Ｎを推定するＳ／Ｎ推定手段と、雑音抑圧演算に用いられる雑音抑圧係数をテーブルとして保持している雑音抑圧係数保持手段と、前記雑音抑圧係数保持手段が保持しているテーブルや前記Ｓ／Ｎ推定手段が推定したＳ／Ｎより雑音抑圧量を推定し、前記雑音抑圧係数を算出する抑圧量推定手段と、前記スペクトル変換手段が出力する信号から前記抑圧量推定手段が算出した雑音抑圧係数に基づいて雑音抑圧演算を行う雑音抑圧手段と、前記雑音抑圧手段より出力される信号から認識尤度を求めるための特徴量ベクトルを抽出する特徴量抽出手段と、学習データである音響モデルを保存する音響モデル保存手段と、時系列に前記音響モデル保存手段が保持する音響モデルと前記特徴量抽出手段が出力する特徴量ベクトルを比較して尤度を算出する尤度算出手段と、音声認識用の単語辞書または文法等の認識用ルールを保存する文法辞書保存手段と、前記認識用ルールに従って前記尤度より最尤パスを求める最尤探索手段と、前記最尤探索手段の結果を認識結果として表示提示する結果表示提示手段と、前記尤度算出手段および前記最尤探索手段が出力する認識結果より、抑圧量の判定を行う指標となる抑圧量判定係数を求める抑圧量判定係数算出手段とを備え、前記抑圧量推定手段は、前記抑圧量判定係数算出手段が算出する抑圧量判定係数も利用して雑音抑圧量を推定し、雑音抑圧係数を算出する。
【００１４】
この構成により、音声認識した結果得られた、尤度、特徴量ベクトル、音素列などの認識情報を直接もしくは間接的に反映させた抑圧量判定係数を算出し、この抑圧量判定係数を指標として利用することにより、音声の歪みを低減するように雑音抑圧量を制御することができ、その結果より良好な音声認識結果を得ることが可能となる。
【００１５】
また、本発明の音声認識装置は、入力音声に対して雑音抑圧を行うためのスペクトル変換手段と、前記スペクトル変換手段が出力する信号よりＳ／Ｎを推定するＳ／Ｎ推定手段と、雑音抑圧演算に用いられる雑音抑圧係数をテーブルとして保持している雑音抑圧係数保持手段と、前記雑音抑圧係数保持手段が保持しているテーブルや前記Ｓ／Ｎ推定手段が推定したＳ／Ｎより雑音抑圧量を推定し、前記雑音抑圧係数を算出する抑圧量推定手段と、前記スペクトル変換手段が出力する信号から前記抑圧量推定手段が算出した雑音抑圧係数に基づいて雑音抑圧演算を行う雑音抑圧手段と、前記雑音抑圧手段より出力される信号から認識尤度を求めるための特徴量ベクトルを抽出する特徴量抽出手段と、学習データである音響モデルを保存する音響モデル保存手段と、時系列に前記音響モデル保存手段が保持する音響モデルと前記特徴量抽出手段が出力する特徴量ベクトルを比較して尤度を算出する尤度算出手段と、音声認識用の単語辞書または文法等の認識用ルールを保存する文法辞書保存手段と、前記認識用ルールに従って前記尤度より最尤パスを求める最尤探索手段と、前記最尤探索手段の結果を認識結果として表示提示する結果表示提示手段と、結果表示提示手段が表示提示する認識結果に対する話者の判断を入力する話者入力情報取得手段と、前記尤度算出手段および前記最尤探索手段より出力される認識結果、および、前記話者入力情報取得手段より出力される前記話者の判断より、抑圧量の判定を行う指標となる抑圧量判定係数を求める抑圧量判定係数算出手段とを備え、前記抑圧量推定手段は、前記抑圧量判定係数算出手段が算出する抑圧量判定係数も利用して雑音抑圧量を推定し、雑音抑圧係数を算出する。
【００１６】
この構成により、話者の入力した情報や音声認識した結果得られた、尤度、特徴量ベクトル、音素列などの認識情報を直接もしくは間接的に反映させた抑圧量判定係数を算出し、この抑圧量判定係数を指標として利用することにより、音声の歪みを低減するように雑音抑圧量を制御することができ、その結果より良好な音声認識結果を得ることが可能となる。
【００１７】
また、本発明の音声認識方法は、入力された音声に対して、雑音を抑圧させる雑音抑圧係数を算出し、前記雑音抑圧係数を用いて、前記入力された音声の雑音を抑圧し、雑音を抑制された音声に対して音声認識を行う音声認識方法であって、前記音声認識の結果得られる尤度、特徴量レベル、音素列などの認識情報を、前記抑圧量判定係数を算出するために用いることを特徴とし、音声認識した結果得られた、尤度、特徴量ベクトル、音素列などの認識情報を直接もしくは間接的に反映させた抑圧量判定係数を算出し、この抑圧量判定係数を指標として利用することにより、音声の歪みを低減するように雑音抑圧量を制御することができ、その結果より良好な音声認識結果を得ることが可能となる。
【００１８】
また、本発明の音声認識方法は、入力された音声に対して、雑音を抑圧させる雑音抑圧係数を算出し、前記雑音抑圧係数を用いて、前記入力された音声の雑音を抑圧し、雑音を抑制された音声に対して音声認識を行う音声認識方法であって、前記音声認識の結果得られる尤度、特徴量レベル、音素列などの認識情報と、話者に前記音声認識の結果に対する前記話者の判断を用いて抑圧量判定係数を算出することを特徴とし、話者の入力した情報や音声認識した結果得られた、尤度、特徴量ベクトル、音素列などの認識情報を直接もしくは間接的に反映させた抑圧量判定係数を算出し、この抑圧量判定係数を指標として利用することにより、音声の歪みを低減するように雑音抑圧量を制御することができ、その結果より良好な音声認識結果を得ることが可能となる。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を用いて説明する。
【００２０】
（実施の形態１）
図１は本発明の音声認識装置の構成を示すブロック図である。図１において、図６、図７に示す従来の音声認識装置と同様に機能するものについては、同一の符号を付して説明を省略する。図１において、１０７は尤度算出手段、１０９は最尤探索手段、２０は文法辞書保存手段、２１は抑圧量判定係数算出手段である。
【００２１】
以上のように構成された音声認識装置について、以下、その動作を述べる。
【００２２】
マイクなどの音声入力手段１０１から入力された入力信号Ｂ１を、スペクトル変換手段１１において、スペクトル表現に変換する。このスペクトル表現の入力信号Ｂ２から、Ｓ／Ｎ推定手段１２において、Ｓ／Ｎ推定を行い、推定Ｓ／Ｎ（Ｂ３）を出力する。抑圧量推定手段１３では、雑音抑圧係数保存手段１５内のテーブルから、推定Ｓ／Ｎ（Ｂ３）に基づいて読み出された雑音抑圧係数Ｂ４および入力音声に対する学習推定により、抑圧量を推定する。雑音抑圧手段１４では、抑圧量推定手段１３により推定された雑音抑圧係数Ｂ５に基づいて雑音抑圧演算を行い、雑音抑圧された入力信号Ｂ６として音声認識ユニットに出力する。ここで雑音抑圧アルゴリズムとして用いられるものは，前述のＳＳやＷｉｅｎｅｒＦｉｌｔｅｒの他、ＶＱ（ＶｅｃｔｏｒＱｕａｎｔｉｚａｔｉｏｎ）を用いた雑音除去等広範囲の雑音除去法に用いることができる。
【００２３】
雑音抑圧ユニットから送られる雑音抑圧後の信号Ｂ６に対して、特徴量抽出手段１６において、音響分析を行うことにより、例えば、ＬＰＣケプストラムやＭＦＣＣ（メル周波数ケプストラム係数）などの認識計算に最適な特徴量ベクトルＢ７に変換する。尤度算出手段１０７では、特徴量抽出手段１６で求めた特徴量ベクトルＢ７と音響モデル保存手段１８が保存している音響モデルより、複数の音素片からなるサブワードの単位で、特徴量間の平均値や分散値より算出する距離や、ＨＭＭ（ｈｉｄｄｅｎＭａｌｃｏｖＭｏｄｅｌ）による遷移確率を尤度Ｂ８とする。最尤探索手段１０９は、ＤＰマッチングやビタビサーチなどを文法辞書保存手段２０が保存している文法や単語辞書に従って、尤度算出手段１０７で得られた尤度列より、最尤となるパスを探索し、これを認識結果Ｂ９とする。
【００２４】
さらに、雑音抑圧された音声に対して音声認識した結果得られた尤度Ｂ８、特徴量ベクトルＢ７、音素列などの認識情報Ｂ１０をもとに、抑圧量判定係数算出手段２１において抑圧量判定係数Ｂ１１を算出し、この抑圧量判定係数Ｂ１１を利用して抑圧量推定手段１３において次回の雑音抑圧量を制御する。
【００２５】
次に、認識情報Ｂ１０から、抑圧量判定係数Ｂ１１を求める方法について述べる。認識の結果得られる、区間情報あるいは最尤音素片情報をもとに、Ｚ＝Ｖ：母音、子音、ノイズ等のグループに分けたとき、認識結果に基づいて信号Ｘのフレームで音素片ｉの尤度をｄ（Ｘ＝Ｚ｜ｉ＝Ｚ）として表すものとする。また、ｄを雑音抑圧前、ｄ＾を雑音抑圧後の尤度とする。また、Ｘ＝Ｚとは、認識の結果グループごとに、グループに属する音素片の尤度の合計を求めて、グループ間で比較し、最尤条件となるグループをその区間であるとみなすことによって、決定されるものとする。
【００２６】
このとき、雑音抑圧を行うことにより、全般的にＳ／Ｎが改善されるため、母音とノイズ区間での尤度関係をみると、（数２）、（数３）が成り立つ。
【００２７】
【数２】

【００２８】
【数３】

【００２９】
母音区間でのノイズモデルの尤度は、雑音抑圧前後で改善し、さらにノイズ区間での母音素片での尤度は、雑音抑圧前後で低下する。（数２）、（数３）の関係式を用いて、（数２）、または、（数３）、または（数２）と（数３）で得られる積などの値を抑圧量判定係数として用いることができる。
【００３０】
また、抑圧量判定係数として、尤度をそのまま用いること、音声・ノイズ区間の情報や、母音・子音区間情報を用いること、あるいは、特徴量例えばケプストラムの差分などを用いることもできる。特に、高域低域全域など周波数ごとに求めた特徴量を利用すること、あるいは雑音抑圧前後での特徴量の差分を利用することもできる。そして、抑圧量推定手段１３では、図２、図３、図４のいずれかの手段に従って、抑圧量推定を行うことができる。
【００３１】
図２の方法は、雑音抑圧（図２のステップ２０１）によって得られた音声を認識（ステップ２０２）することによって、抑圧量判定係数が正の場合（ステップ２０３）には、その区間の抑圧が仮説どおり実行されているとして、抑圧係数推定が正しい方向に働いていると判断し、そのフレームでの推定値を現在のまま維持あるいは更新する（ステップ２０４）。図３では、逆に、抑圧量判定係数が負（γ＞０）の値をとる場合（図３のステップ３０３）には、抑圧係数の推定が全くうまくいっていないものとして、抑圧係数の学習または推定をリセットする（ステップ３０４）。図４は、図２と図３の操作を同時に行い、さらに、値が正の場合には、再度雑音抑圧（ステップ４０５）と認識（ステップ４０６）を行い抑圧量判定を行うことで、より正しい方向に進めることが可能である。図２、図３、図４の点線は次回の抑圧時に係数が反映されることを示している。
【００３２】
なお、上記の制御方法以外にも、例えばスペクトラムサブトラクションのパラメータ制御αの値を制御するにあたり、従来のＳ／Ｎのほかに、認識結果から得られるより正確な音声区間情報や、抑圧量判定係数などに応じて定めることなどができる。
【００３３】
また、上記２つの入力信号を用いて結果情報を求める前に、雑音抑圧前後で相互相関係数をとることによって、フレーミングを調整したうえで、尤度を求めてもよい。
【００３４】
また、上記音素片ごとのグルーピングに関して、母音１（／ａ／，／ｏ／）、母音２（／ｅ／，／ｕ／，／ｗ／）、子音１（／ｉ／，／ｊ／）、子音２（／ｋ／，／ｐ／，／ｃ／，／ｔ／）、子音３（／ｍ／，／ｎ／）、その他の子音、学習時のノイズ、などと音素のもつ特徴がもともと近い音素をまとめるようにグルーピングを定めて、雑音抑圧係数を求めてもよい。
【００３５】
また、（数２）、あるいは（数３）で求める区間については、１フレーム単位、音素区間単位、音声信号中の母音区間またはノイズ区間単位で、計算してもよい。また、その際、それぞれの和、平均などを抑圧係数としてもよい。
【００３６】
このように音声認識結果を加工して雑音抑圧量判定係数とすることで、認識結果向上と同期して、雑音抑圧性能を向上させることが可能となる。
【００３７】
（実施の形態２）
図５は本実施の形態における音声認識装置の構成を示すブロック図である。図５において、図１の音声認識装置と同様に機能するものについては同一の符号を付して、説明を省略する。図５において、２２は話者が情報を入力することが可能な話者入力情報取得手段である。
【００３８】
以上のように構成された音声認識装置について、以下、その動作を述べる。
【００３９】
本実施の形態では、実施の形態１で説明した内容に加えて、話者入力情報取得手段２２により、話者が認識結果に対する表示を見てあるいは聞くことにより、正解かどうかを入力することができる。あるいは、話者入力情報取得手段２２により、雑音抑圧された音声を提示し、その音質がどうか判断を求める質問を投げかけた回答を取得することにより、雑音抑圧効果としての判断、あるいは認識結果の判断を話者に行わせることによって、その性能の確実性を増すことが可能である。この情報も利用して、雑音抑圧判定係数の算出を行い、雑音抑圧係数に反映させることで、より精度の高い雑音抑圧係数推定とその制御を行うことが可能となる。
【００４０】
【発明の効果】
以上のように本発明によれば、雑音抑圧制御の性能の向上により、より歪みの少ない雑音抑圧が可能となることから、認識率が向上するという顕著な効果が得られる。
【図面の簡単な説明】
【図１】本発明の実施の形態１における音声認識装置の構成を示すブロック図
【図２】抑圧量推定手段の動作を示すフローチャート
【図３】抑圧量推定手段の動作を示すフローチャート
【図４】抑圧量推定手段の動作を示すフローチャート
【図５】本発明の実施の形態２における音声認識装置の構成を示すブロック図
【図６】従来の音声認識装置を示すブロック図
【図７】従来の雑音抑圧を行う音声認識装置を示すブロック図
【符号の説明】
１１スペクトル変換手段
１２Ｓ／Ｎ推定手段
１３抑圧量推定手段
１４雑音抑圧手段
１５雑音抑圧係数保存手段
１６特徴量抽出手段
１８音響モデル保存手段
２０文法辞書保存手段
２１抑圧量判定係数算出手段
２２話者入力情報取得手段
１０１音声入力手段
１０２結果表示提示手段
１０７尤度算出手段
１０９最尤探索手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition apparatus that uses a noise suppression method for an input signal in order to obtain a high recognition rate even for speech with a low S / N.
[0002]
[Prior art]
FIG. 6 shows a conventional general voice recognition device (for example, see Patent Document 1). The input signal B1 input from the audio input unit 101 such as a microphone is subjected to acoustic analysis in the feature amount extraction unit 16 to recognize, for example, a LPC (Linear Predictive Coefficient) cepstrum or an MFCC (mel frequency cepstrum coefficient). It is converted into a feature amount vector B7 that is optimal for calculation. The likelihood calculating means 17 calculates, based on the feature quantity vector B7 obtained by the feature quantity extracting means 16 and the acoustic model stored in the acoustic model storing means 18, the average between the feature quantities in units of sub-words composed of a plurality of phoneme segments. The likelihood B8 is the distance calculated from the value or the variance value, and the transition probability by HMM (hidden Malkov Model). The maximum likelihood search unit 19 calculates the maximum likelihood from the likelihood sequence obtained by the likelihood calculation unit 17 in accordance with the grammar and the word dictionary of the grammar dictionary storage unit 20 that stores recognition rules such as DP matching and Viterbi search. Is searched, and this is displayed and presented on the result display / presentation means 102 as a recognition result B9.
[0003]
As an algorithm relating to noise in the speech recognition processing, CME / CMS (Cepstral Mean Equalization / Subtraction) effective for multiplicative distortion and Jacobian matrix approximation effective for both additive multiplicative distortion are used. is there. The former method is a very simple and highly effective method because cepstrum, which is frequently used as a feature amount of speech recognition, is used as it is and only addition and subtraction are required. However, since the feature is extracted using the window function, it is impossible to remove the influence of reflection or reverberation having a long reverberation time. Further, the latter is proposed as a method for simplifying the HMM combining method, but has a drawback that the amount of calculation is enormous as compared with SS (Spectral Subtraction). Therefore, in order to further improve the noise resistance, noise suppression processing such as the SS method in the spectrum area is also implemented as a measure against additive noise (WO98 / 39946).
[0004]
FIG. 7 shows a speech recognition device using a conventional general noise suppression method (for example, see Patent Document 2). In the spectrum conversion means 11, the input signal B1 input as in FIG. 6 is converted into a spectrum expression. The S / N estimating means 12 performs S / N estimation from the input signal B2 of this spectrum expression. The suppression amount estimating unit 13 estimates the suppression amount from the table in the noise suppression coefficient storage unit 15 by learning estimation for the coefficient B4 read based on the estimated S / N and the input voice. The noise suppression unit 14 performs a noise suppression operation based on the suppression coefficient B5 estimated by the suppression amount estimation unit 13, and outputs the result to the speech recognition unit as a noise suppressed input signal B6. Here, the above-mentioned SS, Wiener Filter, and the like are used as the noise suppression algorithm. In any case, in order to adjust the amount of noise suppression, it is general to multiply the estimated noise spectrum by α which is a scaling variable. The function of the SS is represented by (Equation 1).
[0005]
(Equation 1)

[0006]
In equation (1), the power S (w) at the frequency w after noise suppression, the power X (w) at the frequency w of the observed input signal, the frequency spectrum power N (w) in the estimated noise section, and α are magnification variables , Β are flooring constants set so as not to cause excessive pulling. Further, max in (Equation 1) is a function that takes a larger value among two values separated by a comma.
[0007]
For an input signal that fluctuates due to the relationship between input noise and voice, etc., how can an estimated noise spectrum be accurately obtained by the noise suppression method using this SS? Depending on how the set values of α and β are controlled, it is important to prevent the audio signal after suppression from adversely affecting the recognition, that is, to prevent the speech signal from being suppressed more than necessary.
[0008]
In order to solve this problem, a method has been proposed in which a noise HMM is learned by calculating a residual power spectrum after noise suppression, and this is combined with clean speech data to recreate an acoustic model (eg, Patent Document 3). Also, regarding the estimation of the noise section, a method of updating the filter small in the non-stationary part and large in the stationary part (for example, Patent Document 4), using a speech feature vector synthesized with the noise feature vector estimated according to S / N. (For example, Patent Document 5), a method for reducing the amount of suppression in a consonant section (for example, Patent Document 6), a method for learning only at the time of stationary noise (for example, Patent Document 7), and learning only when the recognition result is correct. (For example, Patent Literature 8) and a method for switching the impulse response to be learned between stationary and non-stationary (for example, Patent Document 9).
[0009]
However, each method focuses on how to perform learning with high accuracy, and it is not known how the noise suppression performance, that is, the S / N can be improved without causing distortion. Not. On the other hand, a method that focuses on this point is a method of controlling a noise suppression amount by setting a noise target value for reducing voice distortion (for example, Patent Document 10).
[0010]
[Patent Document 1]
JP 2001-100782 [Patent Document 2]
JP 2000-330597 A [Patent Document 3]
JP-A-10-97278 [Patent Document 4]
Japanese Patent Application Laid-Open No. 6-95793 [Patent Document 5]
Japanese Patent Application Laid-Open No. Hei 6-289991 [Patent Document 6]
Japanese Patent Application Laid-Open No. H8-221094 [Patent Document 7]
JP-A-10-334286 [Patent Document 8]
JP 11-327593 A [Patent Document 9]
Japanese Patent Application Laid-Open No. H10-833946 [Patent Document 10]
JP 2002-140100 A
[Problems to be solved by the invention]
However, this method only focuses on improving the noise performance, and does not directly examine whether it is related to the recognition performance.
[0012]
Therefore, the present invention calculates a suppression amount determination coefficient that directly or indirectly reflects recognition information such as a likelihood, a feature amount vector, and a phoneme sequence obtained as a result of speech recognition of a noise-suppressed speech. Then, by using the suppression amount determination coefficient as an index, the amount of noise suppression for reducing voice distortion is controlled. An object of the present invention is to improve the recognition rate because noise suppression with less distortion can be performed by improving the performance of noise suppression control.
[0013]
[Means for Solving the Problems]
In order to solve this problem, a speech recognition apparatus according to the present invention includes a spectrum conversion unit for performing noise suppression on input speech, and an S / N for estimating S / N from a signal output from the spectrum conversion unit. Estimation means, noise suppression coefficient holding means for holding a noise suppression coefficient used for noise suppression calculation as a table, table held by the noise suppression coefficient holding means, and S estimated by the S / N estimation means. / N to estimate the noise suppression amount and calculate the noise suppression coefficient, and perform the noise suppression calculation based on the noise suppression coefficient calculated by the suppression amount estimation means from the signal output by the spectrum conversion means. Noise suppression means for performing, a feature quantity extraction means for extracting a feature quantity vector for obtaining a recognition likelihood from a signal output from the noise suppression means, and a sound as learning data. Acoustic model storing means for storing Dell, likelihood calculating means for calculating likelihood by comparing the acoustic model held by the acoustic model storing means and the feature vector output by the feature extracting means in time series, A grammar dictionary storage unit that stores a recognition rule such as a word dictionary or a grammar for speech recognition, a maximum likelihood search unit that obtains a maximum likelihood path from the likelihood according to the recognition rule, and a result of the maximum likelihood search unit. A result display / presentation means for displaying / presenting as a recognition result, and a suppression amount determination coefficient calculation for obtaining a suppression amount determination coefficient serving as an index for determining a suppression amount based on the recognition results output by the likelihood calculation means and the maximum likelihood search means. Means for estimating the amount of noise suppression using the suppression amount determination coefficient calculated by the suppression amount determination coefficient calculation means, and calculating the noise suppression coefficient.
[0014]
With this configuration, a suppression amount determination coefficient that directly or indirectly reflects recognition information such as a likelihood, a feature amount vector, and a phoneme string obtained as a result of speech recognition is calculated, and the suppression amount determination coefficient is used as an index. By using this, it is possible to control the amount of noise suppression so as to reduce the distortion of the voice, and it is possible to obtain a better voice recognition result as a result.
[0015]
Also, the speech recognition apparatus of the present invention includes a spectrum conversion unit for performing noise suppression on input speech, an S / N estimation unit for estimating S / N from a signal output from the spectrum conversion unit, A noise suppression coefficient holding unit that holds a noise suppression coefficient used for the calculation as a table, and a noise suppression amount based on the table held by the noise suppression coefficient holding unit and the S / N estimated by the S / N estimation unit. Suppression amount estimating means for estimating the noise suppression coefficient, and noise suppression means for performing a noise suppression operation based on the noise suppression coefficient calculated by the suppression amount estimating means from the signal output by the spectrum conversion means, A feature extracting means for extracting a feature vector for obtaining a recognition likelihood from a signal output from the noise suppressing means, and a sound storing an acoustic model as learning data. Dell storage means, likelihood calculation means for calculating a likelihood by comparing an acoustic model held by the acoustic model storage means with a feature quantity vector output by the feature quantity extraction means in time series, and a word for speech recognition. A grammar dictionary storage unit for storing a recognition rule such as a dictionary or a grammar; a maximum likelihood search unit for obtaining a maximum likelihood path from the likelihood according to the recognition rule; and a result of the maximum likelihood search unit as a recognition result. Result display and presentation means, speaker input information acquisition means for inputting a speaker's judgment on the recognition result displayed and presented by the result display and presentation means, and recognition results output from the likelihood calculation means and the maximum likelihood search means. And, from the determination of the speaker output from the speaker input information acquisition means, comprising a suppression amount determination coefficient calculation means for obtaining a suppression amount determination coefficient as an index for determining the amount of suppression, Serial suppression amount estimating means, the suppression amount determined coefficient the suppression quantity determination coefficient calculating means calculates also estimate the noise suppression amount is utilized to calculate the noise suppression coefficient.
[0016]
With this configuration, a suppression amount determination coefficient that directly or indirectly reflects recognition information such as a likelihood, a feature vector, and a phoneme sequence obtained as a result of information input by a speaker or speech recognition is calculated. By using the suppression amount determination coefficient as an index, it is possible to control the amount of noise suppression so as to reduce speech distortion, and it is possible to obtain better speech recognition results.
[0017]
Further, the speech recognition method of the present invention calculates a noise suppression coefficient for suppressing noise with respect to the input speech, suppresses the noise of the input speech using the noise suppression coefficient, and reduces the noise. A speech recognition method for performing speech recognition on the suppressed speech, the likelihood obtained as a result of the speech recognition, feature amount level, recognition information such as phoneme strings, to calculate the suppression amount determination coefficient. It is characterized by using, and calculates a suppression amount determination coefficient that directly or indirectly reflects recognition information such as a likelihood, a feature amount vector, and a phoneme string obtained as a result of speech recognition, and calculates this suppression amount determination coefficient. By using it as an index, it is possible to control the amount of noise suppression so as to reduce speech distortion, and it is possible to obtain better speech recognition results.
[0018]
Further, the speech recognition method of the present invention calculates a noise suppression coefficient for suppressing noise with respect to the input speech, suppresses the noise of the input speech using the noise suppression coefficient, and reduces the noise. A speech recognition method for performing speech recognition on the suppressed speech, the likelihood obtained as a result of the speech recognition, feature amount level, recognition information such as a phoneme string, and the speaker to the speech recognition result The feature is to calculate the suppression amount judgment coefficient using the speaker's judgment, and to directly or directly recognize the recognition information such as the likelihood, feature amount vector, phoneme sequence, etc. obtained from the information input by the speaker or the result of speech recognition. By calculating the indirectly reflected suppression amount determination coefficient and using the suppression amount determination coefficient as an index, the noise suppression amount can be controlled so as to reduce voice distortion, and as a result, Get speech recognition results It becomes possible.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0020]
(Embodiment 1)
FIG. 1 is a block diagram showing the configuration of the speech recognition device of the present invention. In FIG. 1, components that function in the same manner as the conventional voice recognition devices shown in FIGS. 6 and 7 are denoted by the same reference numerals, and description thereof is omitted. In FIG. 1, reference numeral 107 denotes a likelihood calculation unit, 109 denotes a maximum likelihood search unit, 20 denotes a grammar dictionary storage unit, and 21 denotes a suppression amount determination coefficient calculation unit.
[0021]
The operation of the speech recognition device configured as described above will be described below.
[0022]
An input signal B1 input from a voice input unit 101 such as a microphone is converted into a spectrum expression by a spectrum conversion unit 11. The S / N estimating means 12 performs S / N estimation from the input signal B2 of this spectrum expression, and outputs an estimated S / N (B3). The suppression amount estimating means 13 estimates the suppression amount by learning estimation on the noise suppression coefficient B4 and the input speech read from the table in the noise suppression coefficient storage means 15 based on the estimated S / N (B3). The noise suppression unit 14 performs a noise suppression operation based on the noise suppression coefficient B5 estimated by the suppression amount estimation unit 13, and outputs the noise suppression input signal B6 to the speech recognition unit. Here, the algorithm used as the noise suppression algorithm can be used for a wide range of noise elimination methods such as noise elimination using VQ (Vector Quantization), in addition to the above-described SS and Wiener Filter.
[0023]
The characteristic amount extraction unit 16 performs acoustic analysis on the signal B6 after noise suppression sent from the noise suppression unit, and thereby, for example, features optimal for recognition calculation such as LPC cepstrum and MFCC (mel frequency cepstrum coefficient). It is converted into a quantity vector B7. The likelihood calculating unit 107 calculates the average between the feature amounts in units of subwords composed of a plurality of phoneme pieces based on the feature amount vector B7 obtained by the feature amount extracting unit 16 and the acoustic model stored in the acoustic model storing unit 18. The likelihood B8 is the distance calculated from the value or the variance value, and the transition probability by HMM (hidden Malkov Model). The maximum likelihood search unit 109 finds the path with the maximum likelihood from the likelihood sequence obtained by the likelihood calculation unit 107 in accordance with the grammar or the word dictionary stored in the grammar dictionary storage unit 20 such as DP matching or Viterbi search. A search is performed, and this is set as a recognition result B9.
[0024]
Further, based on recognition information B10 such as a likelihood B8, a feature vector B7, and a phoneme sequence obtained as a result of speech recognition of the noise-suppressed speech, a suppression amount determination coefficient B11 is calculated, and the suppression amount estimating means 13 controls the next noise suppression amount using the suppression amount determination coefficient B11.
[0025]
Next, a method for obtaining the suppression amount determination coefficient B11 from the recognition information B10 will be described. Based on the section information or the maximum likelihood phoneme information obtained as a result of the recognition, Z = V: when divided into groups of vowels, consonants, noises, etc. It is assumed that the likelihood is represented as d (X = Z | i = Z). Also, let d be the likelihood after noise suppression and d ＾ be the likelihood after noise suppression. In addition, X = Z is obtained by calculating the total likelihood of phonemes belonging to each group as a result of recognition, comparing the groups, and considering the group having the maximum likelihood condition as that section. , Shall be determined.
[0026]
At this time, by performing noise suppression, the S / N is generally improved. Therefore, looking at the likelihood relationship between a vowel and a noise section, (Equation 2) and (Equation 3) hold.
[0027]
(Equation 2)

[0028]
[Equation 3]

[0029]
The likelihood of a noise model in a vowel section is improved before and after noise suppression, and the likelihood of a vowel segment in a noise section is reduced before and after noise suppression. Using the relational expressions of (Equation 2) and (Equation 3), a value such as (Equation 2), or (Equation 3), or a product obtained by (Equation 2) and (Equation 3) is used as a suppression amount determination coefficient. Can be used as
[0030]
Further, as the suppression amount determination coefficient, the likelihood can be used as it is, information on a voice / noise section, vowel / consonant section information, or a feature amount such as a cepstrum difference can be used. In particular, it is possible to use a feature amount obtained for each frequency, such as the entire high band and low band, or to use a difference between feature amounts before and after noise suppression. Then, the suppression amount estimating means 13 can perform the suppression amount estimation according to any of the means shown in FIGS.
[0031]
The method of FIG. 2 recognizes the voice obtained by the noise suppression (Step 201 of FIG. 2) (Step 202), and when the suppression amount determination coefficient is positive (Step 203), the suppression of the section is suppressed. Assuming that the estimation has been performed according to the hypothesis, it is determined that the suppression coefficient estimation is working in the correct direction, and the estimation value in that frame is maintained or updated as it is (step 204). Conversely, in FIG. 3, when the suppression amount determination coefficient takes a negative value (γ> 0) (step 303 in FIG. 3), it is determined that the estimation of the suppression coefficient has not been completely successful, and the learning of the suppression coefficient The estimation is reset (step 304). In FIG. 4, the operations of FIGS. 2 and 3 are performed at the same time, and when the value is positive, noise suppression (step 405) and recognition (step 406) are performed again to perform the suppression amount determination, thereby obtaining a more correct result. It is possible to proceed in the direction. The dotted lines in FIGS. 2, 3 and 4 indicate that the coefficients are reflected at the next suppression.
[0032]
In addition to the above control method, for example, when controlling the value of the parameter control α of the spectrum subtraction, in addition to the conventional S / N, more accurate speech section information obtained from the recognition result and the suppression amount determination coefficient It can be determined according to the situation.
[0033]
Before obtaining result information using the two input signals, the likelihood may be obtained after adjusting framing by taking a cross-correlation coefficient before and after noise suppression.
[0034]
In addition, regarding the grouping for each phoneme segment, vowel 1 (/ a /, / o /), vowel 2 (/ e /, / u /, / w /), consonant 1 (/ i /, / j /), The characteristics of phonemes are similar to consonants 2 (/ k /, / p /, / c /, / t /), consonants 3 (/ m /, / n /), other consonants, noise during learning, etc. The grouping may be determined so that the phonemes are put together, and the noise suppression coefficient may be obtained.
[0035]
In addition, the section obtained by (Equation 2) or (Equation 3) may be calculated in units of one frame, in units of phonemes, or in units of vowels or noise in a voice signal. At that time, the sum, the average, or the like may be used as the suppression coefficient.
[0036]
By processing the speech recognition result as described above and using the result as the noise suppression amount determination coefficient, it is possible to improve the noise suppression performance in synchronization with the improvement of the recognition result.
[0037]
(Embodiment 2)
FIG. 5 is a block diagram showing a configuration of the speech recognition device according to the present embodiment. In FIG. 5, components that function in the same way as the speech recognition device of FIG. In FIG. 5, reference numeral 22 denotes a speaker input information acquisition unit that allows a speaker to input information.
[0038]
The operation of the speech recognition device configured as described above will be described below.
[0039]
In the present embodiment, in addition to the contents described in the first embodiment, the speaker input information acquisition means 22 allows the speaker to input a correct answer by looking at or hearing the display of the recognition result. it can. Alternatively, the noise-suppressed voice is presented by the speaker input information acquiring means 22, and a response asking a question asking whether the sound quality is determined is acquired, so that the judgment as the noise suppression effect or the judgment of the recognition result is made. , It is possible to increase the certainty of the performance. By using this information to calculate the noise suppression determination coefficient and to reflect the noise suppression coefficient in the noise suppression coefficient, it is possible to perform more accurate noise suppression coefficient estimation and its control.
[0040]
【The invention's effect】
As described above, according to the present invention, noise suppression with less distortion can be achieved by improving the performance of noise suppression control, so that a remarkable effect of improving the recognition rate can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. FIG. 2 is a flowchart illustrating an operation of a suppression amount estimation unit. FIG. 3 is a flowchart illustrating an operation of a suppression amount estimation unit. FIG. 5 is a flowchart showing the operation of the suppression amount estimating means. FIG. 5 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. FIG. 6 is a block diagram showing a conventional speech recognition apparatus. Block diagram showing a speech recognition device that performs noise suppression.
Reference Signs List 11 spectrum conversion means 12 S / N estimation means 13 suppression amount estimation means 14 noise suppression means 15 noise suppression coefficient storage means 16 feature amount extraction means 18 acoustic model storage means 20 grammar dictionary storage means 21 suppression amount determination coefficient calculation means 22 speaker Input information acquisition means 101 voice input means 102 result display / presentation means 107 likelihood calculation means 109 maximum likelihood search means

Claims

Spectrum converting means for performing noise suppression on input speech, S / N estimating means for estimating S / N from a signal output from the spectrum converting means, and a noise suppression coefficient used for noise suppression calculation as a table The noise suppression coefficient holding means, the noise suppression amount is estimated from the table held by the noise suppression coefficient holding means and the S / N estimated by the S / N estimation means, and the noise suppression coefficient is calculated. Noise suppression means for performing a noise suppression operation based on a noise suppression coefficient calculated by the suppression amount estimation means from a signal output by the spectrum conversion means; and a signal output from the noise suppression means. A feature amount extraction unit for extracting a feature amount vector for obtaining a recognition likelihood; an acoustic model storage unit for storing an acoustic model as learning data; A likelihood calculating means for calculating the likelihood by comparing the acoustic model held by the sound model storing means and the feature quantity vector outputted by the feature quantity extracting means, and a recognition rule such as a word dictionary or grammar for speech recognition. A grammar dictionary storage means for storing, a maximum likelihood search means for obtaining a maximum likelihood path from the likelihood in accordance with the recognition rule, a result display / presentation means for displaying and presenting a result of the maximum likelihood search means as a recognition result, From the recognition result output by the degree calculation means and the maximum likelihood search means, comprises a suppression amount determination coefficient calculation means for obtaining a suppression amount determination coefficient as an index for determining the suppression amount,
A speech recognition apparatus, wherein the suppression amount estimating means estimates a noise suppression amount by using also a suppression amount determination coefficient calculated by the suppression amount determination coefficient calculation means, and calculates a noise suppression coefficient.

Spectrum converting means for performing noise suppression on input speech, S / N estimating means for estimating S / N from a signal output from the spectrum converting means, and a noise suppression coefficient used for noise suppression calculation as a table The noise suppression coefficient holding means, the noise suppression amount is estimated from the table held by the noise suppression coefficient holding means and the S / N estimated by the S / N estimation means, and the noise suppression coefficient is calculated. Noise suppression means for performing a noise suppression operation based on a noise suppression coefficient calculated by the suppression amount estimation means from a signal output by the spectrum conversion means; and a signal output from the noise suppression means. A feature amount extraction unit for extracting a feature amount vector for obtaining a recognition likelihood; an acoustic model storage unit for storing an acoustic model as learning data; A likelihood calculating means for calculating the likelihood by comparing the acoustic model held by the sound model storing means and the feature quantity vector outputted by the feature quantity extracting means, and a recognition rule such as a word dictionary or grammar for speech recognition. A grammar dictionary storing means for storing, a maximum likelihood search means for obtaining a maximum likelihood path from the likelihood according to the recognition rule, a result display presenting means for displaying and presenting a result of the maximum likelihood search means as a recognition result, and a result display Speaker input information acquiring means for inputting a speaker's judgment on the recognition result displayed and presented by the presenting means, recognition results output from the likelihood calculating means and the maximum likelihood searching means, and the speaker input information acquiring Comprising a suppression amount determination coefficient calculation means for obtaining a suppression amount determination coefficient serving as an index for determining the amount of suppression from the determination of the speaker output from the means,
A speech recognition apparatus, wherein the suppression amount estimating means estimates a noise suppression amount by using also a suppression amount determination coefficient calculated by the suppression amount determination coefficient calculation means, and calculates a noise suppression coefficient.

For the input speech, a noise suppression coefficient for suppressing noise is calculated, and the noise of the input speech is suppressed using the noise suppression coefficient, and speech recognition is performed on the noise-suppressed speech. The speech recognition method to be performed,
A speech recognition method, wherein recognition information such as a likelihood, a feature amount level, and a phoneme sequence obtained as a result of the speech recognition is used to calculate the suppression amount determination coefficient.

For the input speech, a noise suppression coefficient for suppressing noise is calculated, and the noise of the input speech is suppressed using the noise suppression coefficient, and speech recognition is performed on the noise-suppressed speech. The speech recognition method to be performed,
The likelihood obtained as a result of the speech recognition, the feature amount level, recognition information such as a phoneme sequence, and a speaker calculates a suppression amount determination coefficient using the speaker's judgment on the result of the speech recognition. Voice recognition method to be used.