JP3900691B2

JP3900691B2 - Noise suppression apparatus and speech recognition system using the apparatus

Info

Publication number: JP3900691B2
Application number: JP19317798A
Authority: JP
Inventors: 勇立野
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 1998-07-08
Filing date: 1998-07-08
Publication date: 2007-04-04
Anticipated expiration: 2018-07-08
Also published as: JP2000029500A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識等の音声信号処理の前処理として用いる雑音抑圧に関し、特に、認識対象となる音声信号と雑音信号とが混在した入力信号から雑音成分を除去する技術に関する。
【０００２】
【従来の技術及び発明が解決しようとする課題】
従来より、例えばカーナビゲーションシステムにおける目的地の設定などを音声によって入力できるようにする場合などに有効な音声認識装置が提案され、また実現されている。このような音声認識装置においては、入力音声を予め記憶されている複数の比較対象パターン候補と比較し、一致度合の高いものを認識結果とするのであるが、現在の認識技術ではその認識結果が完全に正確なものとは限らない。これは、静かな環境下にあってもそうであるため、周囲に雑音が発生するような環境下ではなおさらである。特に、上述したカーナビゲーションシステムなどの実際の使用環境を考慮すると、雑音がないことは想定しにくい。したがって、認識率の向上を実現する上では、音声認識装置への入力の前処理として、認識に必要な音声信号と雑音信号とが混在した入力信号から雑音成分を除去する雑音抑圧を行なうことが望ましい。
【０００３】
このような雑音抑圧を行なってから音声認識を行なうシステム構成として、例えば図６（ａ）のような音声認識システム２００が考えられている。つまり、音声用マイク２０１からは雑音が混入した音声信号が入力される。一方、雑音用マイク２０２からは雑音のみの雑音信号が入力される。音声用マイク２０１及び雑音用マイク２０２からの入力信号は雑音抑圧装置２０３へ入力され、雑音抑圧装置２０３で雑音抑圧された音声信号が音声認識装置２０４へ転送される。また、この場合、利用者がＰＴＴ（Push-To-Talk）スイッチ２０５を押しながらマイク２０１を介して音声を入力するようにされている。そして、雑音抑圧装置２０３での雑音抑圧は次のように行われる。
【０００４】
つまり、図６（ｂ）に示すように、ＰＴＴスイッチ２０５が押されると音声区間であるとして、雑音抑圧装置２０３は音声用マイク２０１及び雑音用マイク２０２からの入力信号を取り込む。しかし、音声用マイク２０１からの入力信号は「音声信号＋雑音信号」となる。したがって、雑音用マイク２０２から入力された「雑音信号」を、音声用マイク２０１からの「音声信号＋雑音信号」から差し引けば、雑音信号の抑圧された音声信号を抽出することができるというものである。なお、雑音の混入した音声信号から雑音信号を差し引く際には、それぞれの信号をフーリエ変換した周波数スペクトルやその周波数スペクトルの振幅である振幅スペクトルあるいは振幅スペクトルを２乗したパワースペクトルの形式で差し引くことが考えられる。
【０００５】
しかしながら、図６として上述したような手法では、雑音用マイク２０２が音声信号をひろわないように、音声用マイク２０１と雑音用マイク２０２とを所定距離だけ離した場所に設置することが必要となり、システム全体が煩雑となる。また、２本のマイクを設置するため、マイクを設置する場所によっては、発生する雑音の種類が異なる可能性がある。すなわち、音声用マイク２０１から音声信号と共に入力される雑音信号と雑音用マイク２０２から入力される雑音信号とが同一である保障はない。そのため、音声信号と雑音信号とが混在した入力信号から雑音成分のみを適切に除去することができない可能性があった。
【０００６】
ここで雑音の種類が異なると、適切な雑音成分の除去ができないことを説明する。図５は、種類の異なる雑音の周波数スペクトルを例示する説明図であるが、図５（ａ）では、１キロＨｚ、２キロＨｚ、４キロＨｚ付近の周波数成分のレベル変化率が大きくなっており、図５（ｂ）では、図５（ａ）と比べて、１キロＨｚ、２キロＨｚ、４キロＨｚ付近の周波数成分のレベル変化率がさらに大きくなっている。また、図５（ｃ）では、１．５キロＨｚ、３キロＨｚ、６キロＨｚ付近の周波数成分のレベル変化率が大きくなっており、図５（ｄ）では、０〜６キロＨｚの全ての周波数成分のレベル変化率が大きくなっている。なお、レベル変化率とはスペクトル波形における傾きの絶対値をいい、レベル変化率の大きな部分は、図中ではグラフの縦軸方向に突出した部分として示される。
【０００７】
このように種類の異なる雑音を周波数スペクトルとして見た場合、レベル変化率やそのレベル変化率の大きくなる周波数などが異なってくる。従って、音声信号と雑音信号とが混在する入力信号から雑音成分を差し引く際、種類の異なる雑音のスペクトルを差し引くと、却って音声信号のスペクトルに歪みを生じさせることになる。つまり、図６（ａ）に示すようなシステムでは、音声用マイク２０１から入力される雑音と雑音用マイク２０２から入力される雑音が同じ種類のものでなければ、適切な雑音抑圧ができないのである。
【０００８】
ところで、従来、１本のマイクを使用したシステムもあったが、この場合は、ＰＴＴスイッチのオン・オフを検出して雑音区間、音声区間を区別し、雑音区間において取り込んだ「雑音信号」を、音声区間において取り込んだ「音声信号＋雑音信号」から差し引いて、雑音信号の抑圧された音声信号を抽出する。しかし、この手法も、音声区間において混入した雑音を直接検知しているのではなく、音声区間の開始以前の雑音区間にて取り込んだ雑音信号を基に音声区間における雑音を推定し、雑音の混入した音声信号から、推定された雑音信号を差し引いているに過ぎない。
【０００９】
従って、例えば自動車内というように周囲の環境が時々刻々変化し、それに伴って発生する雑音の種類も変化するような環境下では、雑音区間において入力された雑音が音声区間において音声信号に混入した雑音と同じ種類のものである保障がなく、この場合も、適切に雑音信号を差し引くことができない可能性があった。
【００１０】
本発明は、上述した問題点を解決するためになされたものであり、音声信号と雑音信号とが混在した入力信号から雑音成分を適切に除去し、音声認識における認識率の向上に寄与することを目的とする。
【００１１】
【課題を解決するための手段及び発明の効果】
本発明の雑音抑圧装置では、例えばマイクロフォンなどを介して入力された入力信号を、フレーム分割手段が分割しフレーム信号として切り出し、スペクトル算出手段が、そのフレーム信号からスペクトルを算出する。ここでスペクトル算出手段が算出するスペクトルは、一例として、フレーム信号のフーリエ変換にて定義される周波数スペクトルであることが考えられる。但し、フーリエ変換にて定義されるものには限られず、例えばフーリエ級数展開、Ｚ変換、離散的フーリエ変換（ＤＦＴ）にて定義されるスペクトルを用いてもよい。また、上述した周波数スペクトルの振幅成分である振幅スペクトルを用いてもよいし、その振幅スペクトルを２乗して得たパワースペクトルを用いてもよい。
【００１２】
上述した入力信号は、利用者からの音声信号が入力される場合は、雑音の混入した音声信号であるし、利用者からの音声信号が入力されない場合は、雑音のみの信号である。ここで特に、本発明の雑音抑圧装置では、雑音スペクトル推定手段が、入力信号に基づいて算出したスペクトルに現れる雑音成分の特徴に基づいて、そのスペクトルに含まれる雑音のスペクトルを雑音スペクトルとして推定する。なお、雑音スペクトルは、繰り返し算出されるスペクトルにそれぞれ対応させて推定してもよいし、所定数のスペクトル毎に、それらスペクトルに共通するものとして推定してもよい。例えば繰り返し算出されるスペクトルに同じ種類の雑音のスペクトルが含まれている場合は、雑音のスペクトルの周波数成分がそれらスペクトルに共通して現出することが考えられるため、複数のスペクトルに基づいて雑音スペクトルを推定すれば、より正確な雑音スペクトルを推定できる可能性が高くなる。逆に、雑音の種類が時々刻々変化するような環境下では、相対的に早いタイミングで繰り返し雑音スペクトルを推定することが望ましい。
【００１３】
そして、本発明の雑音抑圧装置は、さらに、雑音信号に基づいて算出された雑音のスペクトルである予測雑音スペクトルを記憶する予測雑音スペクトル記憶手段と、予測雑音スペクトル記憶手段に記憶された予測雑音スペクトルの中から、雑音スペクトル推定手段によって推定された雑音スペクトルとの類似度合の高いものを特定する予測雑音スペクトル特定手段と、予測雑音スペクトル特定手段によって特定された予測雑音スペクトルを、スペクトル算出手段によって算出されたスペクトルから減算する減算手段とを備える。
このような方法意外に、例えば、入力信号に基づいて算出したスペクトルから雑音スペクトルを推定し、推定した雑音スペクトルを減算する構成にすることも考えられるが、音声信号のスペクトルから雑音スペクトルを推定する場合、真の雑音のスペクトルを推定することは困難である。そこで、雑音信号に基づいて算出された雑音のスペクトルを予測雑音スペクトルとして予測雑音スペクトル記憶手段に記憶しておき、減算手段は、この予測雑音スペクトルを減算するようにするのである。
このとき、予測雑音スペクトルを複数種類記憶しておき、音声信号のスペクトルに重畳している雑音のスペクトルに近いものを予測雑音スペクトル特定手段が特定する。予測雑音スペクトル特定手段は、予測雑音スペクトル記憶手段に記憶された予測雑音スペクトルの中で、上述した雑音スペクトル推定手段によって推定された雑音スペクトルとの類似度合が高いものを特定する。例えば、図４（ｂ）に示す雑音スペクトルとの類似度合が最も高い図４（ｃ）に示す予測雑音スペクトルが特定されるという具合である。
つまり、予測雑音スペクトルとして、音声信号に混入する可能性の高い雑音のスペクトルを記憶しておけば、音声信号と雑音信号の混在する入力信号から雑音成分のみを除去できる。その結果、音声認識率を飛躍的に向上させることができる。
【００１５】
ここで、音声区間の入力信号に基づいて算出したスペクトルに現れる雑音成分の特徴に基づき、雑音スペクトルを推定する具体的な手法を説明する。
例えば請求項２に示すように、雑音スペクトル推定手段は、入力信号に基づいて算出したスペクトルのレベル変化率が所定の閾値以上となる周波数を検出し、当該検出した周波数におけるスペクトル成分に基づいて雑音スペクトルを推定するよう構成することが考えられる。
【００１６】
これは、図５に例示したように雑音のスペクトルには特定の周波数成分にレベル変化率の大きな部分が現れる可能性が高いという事実に着目したものである。このような特定周波数成分のレベル変化率が大きな雑音のスペクトルが重畳した音声信号のスペクトルには、図３（ｂ）に示すように、特定の周波数成分にレベル変化率の大きな部分が現出する。そして、図３（ａ）に示すような雑音の混入していない理想的な音声信号のスペクトルと比較すると、このレベル変化率の大きな部分を差し引けば、理想的な音声信号のスペクトルに近づけられることが分かる。
【００１７】
従って、以下のようにして雑音スペクトルを推定することができる。
例えばスペクトル算出手段が周波数スペクトルを算出する場合、その周波数スペクトルは、時間関数であるフレーム信号のフーリエ変換にて定義され、周波数ｆの関数として表される。そのため、周波数ｆで微分することによってスペクトルのレベル変化率を求め、この変化率が所定の閾値以上となる周波数を検出し、そして、当該周波数におけるスペクトル成分に基づいて雑音スペクトルを推定する。ここでスペクトル成分に基づいて推定するとは、例えばそのスペクトル成分そのものを有するスペクトルを雑音スペクトルとして推定することも考えられるし、あるいは、検出された周波数以外の周波数におけるスペクトル成分を用いてそのスペクトル成分を逓倍補正し、その補正したスペクトル成分を有するスペクトルを雑音スペクトルとして推定することも考えられる。例えば図４（ａ）に示すような雑音の混入した音声信号のスペクトルがある場合にレベル変化率が所定の閾値を越える周波数が、１ｋＨｚ、２ｋＨｚ、４ｋＨｚ付近の周波数である場合には、１，２，４ｋＨｚ付近のスペクトル成分に基づくスペクトル成分を有する例えば図４（ｂ）に示すようなスペクトルを雑音スペクトルとして推定するという具合である。
【００１８】
このように、雑音のスペクトルは、特定の周波数成分のレベルの変化率が大きくなることが多く、雑音の混入した音声信号のスペクトルにレベルの変化率が大きくなる周波数成分を現出させるという点に着目すれば、音声信号に混入した雑音のスペクトルを推定することができる。
【００１９】
なお、上述したように、周波数成分のレベルの変化率だけによって、混入した雑音のスペクトルを推定することもできるが、上述した閾値の設定は困難である場合も考えられる。つまり、雑音のスペクトルのレベル変化率が音声のスペクトルのレベル変化率とかけ離れていればよいが、雑音のスペクトルと音声のスペクトルとのレベル変化率の差が小さい場合、それを判定するための閾値の設定は難しくなる。
【００２０】
そこで、請求項３に示すように、雑音スペクトル推定手段は、入力信号に基づいて算出したスペクトルのレベル変化率が第１の閾値以上となる周波数を検出すると共に、当該周波数の２ⁿ 倍（ｎは整数）の近傍の周波数で、当該周波数におけるレベル変化率が第１の閾値よりも小さな第２の閾値以上となっているものを検出し、レベル変化率が第１又は第２の閾値以上となっている周波数におけるスペクトル成分に基づいて雑音スペクトルを推定するよう構成することが考えられる。
【００２１】
これは、雑音のスペクトルのレベル変化率と周波数との関係に着目したものである。すなわち、雑音のスペクトル中にレベル変化率が大きくなる周波数があると、その周波数を２ⁿ 倍した周波数でもレベル変化率が大きくなることが多いという関係に着目したものである。例えば図４に示すように、１ｋＨｚ、その倍の２ｋＨｚ、さらにその倍の４ｋＨｚ付近でレベル変化率が大きくなるという具合である。この前提に立てば、相対的にレベル変化率の大きな周波数があった場合、その周波数の２ⁿ 倍の周波数でレベル変化率が大きくなっていれば、その２ⁿ 倍の周波数におけるレベル変化は、雑音のスペクトルに起因するものとみなしてよい。そこで、最初に第１の閾値以上となる周波数を検出し、次に第１の閾値では判定できない雑音のスペクトルに起因するレベル変化を、上述した周波数間の関係を用い、第１の閾値よりも小さな第２の閾値で判定する。そして、このようにして検出された周波数におけるスペクトル成分に基づいて雑音スペクトルを推定する。ここでスペクトル成分に基づいて雑音スペクトルを推定するというのは、請求項２と同様である。この場合、周波数間の関係を用いることによって、閾値の設定が簡単になると共に、より正確に雑音のスペクトルを推定することができる。
【００２２】
なお、ここで「ｎは整数」としたが、例えば図４に示す例では、０ｋＨｚ＜（周波数×２ⁿ ）＜６ｋＨｚとなるようなｎについて考えればよい。すなわち、周波数×２ⁿ が考慮すべき周波数帯域に入るようなｎに限定される。但し、ｎは負の整数であることも考えられる。つまり、最初に検出された周波数の１／２倍の周波数、１／４倍の周波数・・・も考慮するのである。
【００２７】
ところで、予測雑音スペクトルとして、音声信号に混入する可能性の高い雑音のスペクトルを記憶するため、入力信号として雑音のみの雑音信号が入力された場合に、この雑音信号に基づいてスペクトルを算出し、算出したスペクトルを予測雑音スペクトルとして記憶しておくようにすることが考えられる。例えば、本装置が自動車内に設置されることを前提とすれば、請求項４に示すような構成を採用することが考えられる。すなわち、請求項１〜３の構成に加え、さらに、車両状態を検出する車両状態検出手段と、入力信号に音声が含まれている音声区間と音声が含まれていない雑音区間とを判定する判定手段と、判定手段によって判定された雑音区間の入力信号に基づいて算出したスペクトルを予測雑音スペクトルとし、車両状態検出手段によって検出される各車両状態に対応させて記憶する予測雑音スペクトル記憶制御手段を備える構成とすることが考えられる。
【００２８】
車両状態検出手段は、例えば車載オーディオ機器の音量、車速、窓の開閉状態、道路状態、車両の振動状態といった車両状態を検出する。そして、入力信号に音声が含まれている音声区間であるか音声が含まれていない雑音区間であるかは、判定手段によって判定され、予測雑音スペクトル記憶制御手段が、雑音区間における入力信号に基づいて算出されたスペクトルである予測雑音スペクトルを、上述した車両状態に対応させて記憶する。
【００２９】
本発明では、実際に雑音区間において入力された入力信号からスペクトルを算出し、予測雑音スペクトルとして記憶するのであるが、ここで特に、車両状態に対応させて記憶することを特徴としている。この技術思想の前提となるのは、車両状態が変われば発生する雑音の種類が変わるという認識である。すなわち、自動車内を考えた場合、各車両状態に対応して異なる種類の雑音が、音声信号に混入すると考えられるため、各車両状態に対応させて予測雑音スペクトルを記憶しておけば、それら予測雑音スペクトルは、音声信号に混入する可能性のある雑音のスペクトルとなるのである。
【００３０】
なお、予測雑音スペクトル記憶手段は、ある車両状態に対応する予測雑音スペクトルを一度記憶した後は、その車両状態となった場合であっても予測雑音スペクトルを記憶しないように構成することもできるし、同じように窓を開けた状態であっても周囲の環境が街中であるのと郊外であるのとでは雑音の種類も変わってくることが考えられるため、ある車両状態に対応する予測雑音スペクトルを一度記憶した後であっても、記憶した時から所定時間が経過している場合には、算出された予測雑音スペクトルを改めて記憶するように構成してもよい。前者のような構成とすれば、各車両状態に対応する予測雑音スペクトルを一度記憶すれば、その後は記憶処理が実行されないため、処理負荷軽減の点で有効であるし、一方、後者のような構成とすれば、所定時間が経過した後に予測雑音スペクトルが更新されるため、比較的現在時点に近い過去の雑音信号に基づいて算出された予測雑音スペクトルが記憶される。従って、音声信号に混入する雑音のスペクトルに類似した予測雑音スペクトルが記憶される可能性が高くなり、雑音成分の除去が効果的に行われる可能性がある。
【００３１】
また、上述した判定手段は、入力信号に音声が含まれている音声区間であるか音声が含まれていない雑音区間であるかを判定するのであるが、これは入力信号のパワーに基づいて判定することが考えられる。また、音声を入力させる期間を発声者自身が指定するために設けられた入力期間指定手段によって指定された入力期間を音声区間として判定するようにしてもよい。この入力期間指定手段としては、例えばＰＴＴ（Push-To-Talk）スイッチなどが考えられる。つまり、利用者がＰＴＴスイッチを押しながら音声を入力すると、そのＰＴＴスイッチが押されている間に入力された音声を処理対象として受け付けるのである。
【００３２】
なお、これまでは雑音抑圧装置としての構成及びその作用効果について説明したが、上述した雑音抑圧装置と、該雑音抑圧装置からの出力を、予め記憶されている複数の比較対象パターン候補と比較して一致度合の高いものを認識結果とする音声認識装置と、を備えることを特徴とする音声認識システムとして実現することもできる。
【００３３】
これら音声認識システムとして実現した場合の効果については、雑音抑圧装置として実現した場合と同様であるので、ここでは省略する。
また、このような音声認識システムは、種々の適用先が考えられるが、例えばいわゆるカーナビゲーションシステム用として用いることが考えられる。この場合には、例えば経路設定のための目的地などが音声にて入力できれば非常に便利である。また、ナビゲーションシステムだけでなく、例えば音声認識システムを車載空調システム用として用いることも考えられる。この場合には、空調システムにおける空調状態関連指示を利用者が音声にて入力するために用いることとなる。
【００３４】
【発明の実施の形態】
図１は本発明の実施形態の音声認識システムの概略構成を示すブロック図である。本音声認識システムは、車載用であり、マイク３０を介して入力された音声に対して雑音抑圧を行なう雑音抑圧装置１０と、その雑音抑圧装置１０からの出力を、予め記憶されている複数の比較対象パターン候補と比較して一致度合の高いものを認識結果とする音声認識装置２０とを備えている。また、雑音抑圧装置１０には、利用車が音声を入力する場合に押下するＰＴＴ（PushーtoーTalk）スイッチ４０が接続されている。さらに、車両状態を検出するためのオーディオ機器５１、速度センサ５２、加速度センサ５３、ナビゲーション装置５４及び窓開閉装置５５が接続されている。
【００３５】
図１に示すように、雑音抑圧装置１０は、音声入力部１１と、フレーム分割部１２と、フーリエ変換部１３と、雑音スペクトル推定部１４と、音声制御部１５と、減算部１６と、逆フーリエ変換部１７と、雑音記憶部１８とを備えている。以下各ブロックでの処理内容について説明する。
【００３６】
音声入力部１１は、マイク３０を介して入力されたアナログ音声信号を例えば１０ＫＨｚのサンプリング周波数でデジタル信号に変換し、フレーム分割部１２へ出力する。フレーム分割部１２は、音声入力部１１からの入力信号の区切りを判断し、例えば「とうきょうと」、「ちよだく」というような単語毎のフレームに切り出し、フーリエ変換部１３へ出力する。フーリエ変換部１３では、フレーム毎の時間関数の入力信号に対してフーリエ変換を行い、入力信号の周波数スペクトルを求める。この周波数スペクトルは、雑音スペクトル推定部１４及び減算部１６へ出力される。なお、以下、周波数スペクトルを単にスペクトルと記述する。
【００３７】
雑音スペクトル推定部１４には上述したＰＴＴスイッチ４０からの音声入力検出信号が入力されるようになっており、この音声入力信号を受け取ると、雑音スペクトル推定部１４は、フーリエ変換部１３からのスペクトルに基づき、そのスペクトルに含まれる雑音のスペクトルを推定する。そして、推定した雑音スペクトルを音声制御部１５へ出力する。
【００３８】
ここで雑音スペクトル推定部１４における雑音スペクトルの具体的な推定方法を説明する。なお、最初に、実際に測定した雑音信号に基づいて算出された雑音のスペクトルを示す図５を参照し、雑音のスペクトルの特徴を説明する。
図５に示すように、雑音の種類によって雑音のスペクトルの周波数成分は変わるのであるが、特に特定の周波数成分にレベル変化率の大きな部分が現出することが多い。レベル変化率とはスペクトル波形の傾きの絶対値であり、このレベル変化率が大きな部分は、図で言えば、スペクトル波形がグラフ縦軸方向に大きく突出した部分である。例えば図５（ａ）及び（ｂ）では、１ｋ，２ｋ，４ｋＨｚ付近でレベル変化率が大きくなっており、図５（ｃ）では、１．５ｋ，３ｋ，６ｋＨｚ付近でレベル変化率が大きくなっており、図５（ｄ）では、０〜６ｋＨｚの全体でレベル変化率が大きくなっている。
【００３９】
従って、雑音の混入した音声信号のスペクトルにもこのような雑音スペクトルの特徴が現れることが多い。例えば、図３（ａ）には、雑音の混入していない理想的な音声信号のスペクトルを示し、一方、図３（ｂ）には、雑音の混入した音声信号のスペクトルを示した。図３（ａ）と図３（ｂ）を比較すると分かるように、雑音の混入した音声信号のスペクトルには、特定の周波数成分にレベル変化率の大きな部分が現出している。
【００４０】
また、雑音のスペクトルにレベル変化率が大きくなる周波数があると、その周波数を２ⁿ 倍した周波数でもレベル変化率が大きくなることが多い。例えば図５（ａ）及び（ｂ）では、１ｋ，２ｋ，４ｋＨｚ付近でレベルの変化率が大きくなっており、図５（ｃ）では、１．５ｋ，３ｋ，６ｋ付近でレベル変化率が大きくなっている。
【００４１】
そこで、本実施形態では、雑音の混入した音声信号のスペクトルのレベル変化率と周波数とに基づいて雑音スペクトルを推定している。
具体的には、フーリエ変換部１３から出力される周波数ｆの関数であるスペクトルを周波数ｆで微分し、レベル変化率が第１の閾値以上となる周波数ｆ１を検出する。さらに、周波数ｆ１の２ⁿ 倍の周波数で、レベル変化率が第２の閾値以上となる周波数ｆ２を検出する。そして、レベル変化率が第１又は第２の閾値以上となっている周波数ｆ１，ｆ２におけるスペクトル成分を抽出し、検出された周波数ｆ１，ｆ２以外の周波数におけるスペクトル成分を用いて周波数ｆ１，ｆ２におけるスペクトル成分を補正し、その補正したスペクトル成分を有する雑音スペクトルを推定する。例えば、図４（ａ）に示した雑音の混入した音声信号のスペクトルから図４（ｂ）に示した雑音スペクトルが推定される。
【００４２】
なお、この雑音スペクトル推定部１４は、音声が入力されたことを示す音声入力検出信号を受け取っていない期間は、雑音スペクトルの推定処理を中止する。本実施形態においては、ＰＴＴ（Push-To-Talk）スイッチ４０が押されている場合にはこの音声入力検出信号が出力される。つまり、本音声認識システムでは、利用者がＰＴＴスイッチ４０を押しながらマイク３０を介して音声を入力するという使用方法である。そのため、ＰＴＴスイッチ４０が押されているということは利用者が音声を入力しようとする意志をもって操作したことであるので、その場合、実際には音声入力があるかないかを判断することなく、音声入力がされる期間（音声区間）であると捉えて処理しているのである。
【００４３】
ＰＴＴスイッチ４０の押下による音声入力検出信号が出力されない場合は、音声が入力されない期間（雑音区間）であると捉えて、フーリエ変換部１３からのスペクトルをそのまま音声制御部１５へ出力する。
次に音声制御部１５について説明する。音声制御部１５には、オーディオ機器５１からの音量、速度センサ５２からの車速、加速度センサ５３からの車両の振動状態、ナビゲーション装置５４からの道路状態（トンネル、砂利道など）、窓開閉装置５５からの窓の開閉状態が入力されるようになっており、音声制御部１５は、これら５つのデータに基づき車両状態を特定すると共に、以下説明する処理を行う。この音声制御部１５にも音声が入力されたことを示す音声入力検出信号が入力され、音声入力検出信号が入力されている場合（音声区間）と、音声入力検出信号が入力されていない場合（雑音区間）とで処理を変える。そこで、音声制御部１５における処理を以下分説する。
【００４４】
最初に、音声入力信号が入力されていない場合である雑音区間の処理を説明する。
この場合、上述したように音量、車速、加速度、道路状態、窓の開閉状態という５つのパラメータから定まる車両状態に対応する雑音のスペクトルが雑音記憶部１８に記憶されているか否かを判断し、記憶されていない場合には雑音スペクトル推定部１４から出力されたスペクトルを雑音記憶部１８に記憶する。これによって、雑音区間の入力信号に基づいてフーリエ変換部１３にて算出されたスペクトルが、各車両状態に対応して記憶されることになる。なお、雑音区間の入力信号に基づいてフーリエ変換部１３にて算出されたスペクトルを以下「予測雑音スペクトル」という。
【００４５】
続けて、音声入力検出信号が入力されている場合である音声区間の処理を説明する。
このとき、上述したように雑音スペクトル推定部１４からは、音声区間の入力信号に基づいてフーリエ変換部１３にて算出されたスペクトルに含まれる雑音のスペクトルを推定した雑音スペクトルが出力される。ここで音声制御部１５は、雑音スペクトル推定部１４から出力された雑音スペクトルと、各車両状態に対応させて雑音記憶部１８に記憶されている各予測雑音スペクトルとの類似度合を計算し、類似度合が所定値を越える予測雑音スペクトルを発見すると、その予測雑音スペクトルを減算部１６へ出力する。例えば、図４（ｂ）に示すような雑音スペクトルに基づいて、複数の予測雑音スペクトルの中から図４（ｃ）に示すような予測雑音スペクトルを出力するという具合である。
【００４６】
減算部１６では、フーリエ変換部１３から出力された雑音の混入した音声信号のスペクトルから、音声制御部１５から出力された予測雑音スペクトルを減算する。そして、逆フーリエ変換部１７では、減算部１６からの出力に対して逆フーリエ変換を施して時間関数の信号を求める。逆フーリエ変換部１７は、この信号を音声認識装置２０へ出力する。
【００４７】
このようにして、フレーム分割部１２での切り出し単位であるフレーム毎に得られる雑音の抑圧された時間関数の信号が順次音声認識装置２０へ送られる。
次に、この音声認識装置２０について説明する。
音声認識装置２０は、雑音抑圧装置１０からの出力を用いて一般的な分析手法である線形予測分析を行いパラメータを計算する。そして、予め計算しておいた認識対象語彙の標準パターン（特徴パラメータ系列）と、計算されたパラメータとの間で類似度計算を行なう。これらは周知のＤＰマッチング法、ＨＭＭ（隠れマルコフモデル）あるいはニューラルネットなどによって、この時系列データをいくつかの区間に分け、各区間が辞書データとして格納されたどの単語に対応しているかを求める。そして、各認識対象語彙のうち類似度が所定値を越える語彙を認識結果として図示しない各種アクチュエータ等の制御部へ出力する。一方、各認識対象語彙のうち類似度が所定値を越える語彙がない場合には、雑音抑圧装置１０の音声制御部１５へ認識不可であることを通知する。
【００４８】
音声認識装置２０から認識不可である旨の通知を受けると、雑音抑圧装置１０の音声制御部１５は、再び雑音スペクトル推定部１４から出力された雑音スペクトルに類似する別の予測雑音スペクトルを発見すべく、各車両状態に対応させて雑音記憶部１８に記憶されている予測雑音スペクトルとの類似度合を計算する。そして、類似度合が所定値を越える予測雑音スペクトルが発見されると、その予測雑音スペクトルを減算部１６へ出力する。一方、類似度合が所定値を越える予測雑音スペクトルが発見されない場合は、オーディオ機器５１及びナビゲーション装置５４へ利用者からの再入力を促すための指示信号を出力する。この指示信号に基づく指示は、スピーカ６０から音声として出力されると共に、ナビゲーション装置５４のモニタに文字として表示される。
【００４９】
以上、図１に基づいて各機能ブロックの説明をしたが、さらに上述した各処理の流れを明確にするため、次に本音声認識システムでの処理を図２のフローチャートに基づいて説明する。
まず最初のステップＳ１００において、入力処理を行う。この処理は、図１に示した音声入力部１１及びフレーム分割部１２の処理に相当するものである。すなわち、マイク３０を介して入力されたアナログ音声信号を例えば１０ＫＨｚのサンプリング周波数でデジタル信号に変換し、変換されたデジタル信号を例えば単語毎のフレームとして順次切り出す。
【００５０】
Ｓ１１０では、フーリエ変換を行う。この処理は、図１中に示したフーリエ変換部１３の処理に相当する。ここでは、フレーム毎の入力信号に対してフーリエ変換を行い、入力信号の周波数スペクトルを求め、その周波数スペクトルの振幅成分を２乗して入力信号のパワースペクトルを算出する。
【００５１】
Ｓ１２０では、車両状態を取得する。本実施形態で車両状態とは、図１中に示したオーディオ機器５１からの音量、速度センサ５２からの車速、加速度センサ５３からの車両の振動状態、ナビゲーション装置５４からの道路状態及び窓開閉装置５５からの窓の開閉状態という５つのパラメータによって定まる状態をいう。従って、ここではこれら５つのパラメータを取得し、この５つのパラメータから車両状態を特定する。なお、この処理は、図１中の音声制御部１５の処理に相当する。
【００５２】
Ｓ１３０では、ＰＴＴスイッチ４０がオンであるか否かを判断する。ここでＰＴＴスイッチ４０がオンである場合（Ｓ１３０：ＹＥＳ）、すなわち音声が入力される期間（音声区間）である場合には、Ｓ１６０へ移行する。一方、ＰＴＴスイッチ４０がオンでない場合（Ｓ１３０：ＮＯ）、すなわち雑音のみが入力される期間（雑音区間）である場合には、Ｓ１４０へ移行する。
ＰＴＴスイッチ４０がオフである場合に移行するＳ１４０からの処理は、Ｓ１１０でフーリエ変換されたフレーム信号が雑音のみの雑音信号である場合に相当する。Ｓ１４０では、Ｓ１２０にて特定された車両状態に対応して予測雑音スペクトルが既に雑音記憶部１８に記憶されているか否かを判断する。ここで予測雑音スペクトルが既に記憶されていると判断されると（Ｓ１４０：ＹＥＳ）、Ｓ１５０の処理を実行せずに、本雑音抑圧処理を終了する。一方、予測雑音スペクトルがまだ記憶されていないと判断されると（Ｓ１４０：ＮＯ）、Ｓ１５０にて、Ｓ１１０で算出されたスペクトルを予測雑音スペクトルとして雑音記憶部１８に記憶し、本雑音抑圧処理を終了する。なお、Ｓ１４０及びＳ１５０の処理は音声制御部１５の処理に相当する。
【００５３】
ＰＴＴスイッチ４０がオンである場合に移行するＳ１６０からの処理は、Ｓ１１０でフーリエ変換されたフレーム信号が雑音の混入した音声信号である場合に相当する。
Ｓ１６０では、Ｓ１１０にて算出されたスペクトルに含まれる雑音スペクトルを推定する。この処理は、図１中に示した雑音スペクトル推定部１４の処理に相当する。
【００５４】
続くＳ１７０では、Ｓ１６０にて算出された雑音スペクトルに類似する予測雑音スペクトルが雑音記憶部１８に記憶されているか否かを判断するものである。ここでは、雑音記憶部１８に記憶されている予測雑音スペクトルとＳ１６０にて算出された雑音スペクトルとの類似度合を順次算出しつつ、類似度合が所定値以上であるか否かを判断する。ここで類似度合が所定値以上の予測雑音スペクトルがあった場合（Ｓ１７０：ＹＥＳ）、Ｓ１８０へ移行する。一方、類似度合が所定値以上の予測雑音スペクトルがなかった場合（Ｓ１７０：ＮＯ）、Ｓ２１０へ移行する。
【００５５】
Ｓ１８０では、減算処理を行う。この処理は、図１中の減算部１６の処理に相当する。ここでは、Ｓ１１０にて算出されたスペクトルから、Ｓ１７０にて読み出された予測雑音スペクトルを減算する。その後、逆フーリエ変換が行われ、時間関数の信号が音声認識装置２０へ出力される。
【００５６】
Ｓ１９０では、音声認識できたか否かを判断する。この処理は、図１中の音声認識装置２０における処理である。ここでは、上述したような周知の方法によって、予め記憶されている各認識対象語彙のうち類似度が所定値を越える語彙があるか否かを判断する。ここで類似度が所定値を越える語彙があると判断された場合（Ｓ１９０：ＹＥＳ）、Ｓ２００にて、その語彙を認識結果として図示しない制御部へ出力する。一方、類似度が所定値を越える語彙がないと判断された場合（Ｓ１９０：ＮＯ）、Ｓ２１０へ移行する。
【００５７】
Ｓ１７０及びＳ１９０で否定判断された場合に移行するＳ２１０では、雑音記憶部１８に記憶されている予測雑音スペクトルについてＳ１６０にて算出された雑音スペクトルとの類似度合をすべて算出し類似判定したか否かを判断する。ここで雑音記憶部１８に記憶されている予測雑音スペクトルについてすべて類似判定している場合（Ｓ２１０：ＹＥＳ）、Ｓ２２０にて利用者に再入力を促す。この処理は、音声制御部１５が、オーディオ機器５１及びナビゲーション装置５４へ利用者からの再入力を促すための指示信号を出力するものである。その後、本雑音抑圧処理を終了する。一方、雑音記憶部１８に記憶されている予測雑音スペクトルについて類似判定をしていないものがある場合（Ｓ２１０：ＮＯ）、Ｓ１７０からの処理を繰り返す。
【００５８】
次に、本実施形態の音声認識システムの発揮する効果を説明する。なお、ここでの説明に対する理解を容易にするため、最初に従来の問題点について簡単に説明しておく。
従来より、雑音の混入した音声信号に基づいて算出したスペクトルから雑音のスペクトルを差し引いて、音声認識装置での認識率を向上させることが行われていたが、このときの雑音のスペクトルは、上述したように別のマイクロフォンを介して入力された雑音信号に基づいて算出されたり、または、音声信号が入力される以前に入力された過去の雑音信号に基づいて算出されたりしていたため、音声信号に混入した雑音と同じ種類の雑音のスペクトルである保障がなかった。従って、音声信号に混入した雑音とは異なる種類の雑音スペクトルを差し引いてしまうことがあり、音声信号に混入した雑音を適切に抑圧できず、音声認識率の低下につながっていた。
【００５９】
そこで、本実施形態の音声認識システムでは、ＰＴＴスイッチ４０がオフとなっている期間（雑音区間）に算出された雑音のスペクトルを予測雑音スペクトルとして雑音記憶部１８に記憶する際、取得した車両状態（図２中のＳ１２０）に対応させて複数種類記憶しておく（図２中のＳ１４０及びＳ１５０）。そして、ＰＴＴスイッチ４０がオンとなっている期間（音声区間）に算出された音声信号のスペクトルから、そのスペクトルに重畳した雑音スペクトルを推定し（図２中のＳ１６０）、この推定した雑音スペクトルに類似する予測雑音スペクトルが雑音記憶部１８に記憶されているか否かを判断する（図２中のＳ１７０）。ここで、雑音記憶部１８に記憶された予測雑音スペクトルに雑音スペクトルと類似するものがあれば、その予測雑音スペクトルを、雑音の混入した音声信号のスペクトルから減算し（図２中のＳ１８０）、音声認識を行う（図２中のＳ１９０）。
【００６０】
つまり、雑音区間にて算出された雑音のスペクトルである予測雑音スペクトルを従来ように一律に記憶しておくのではなく、車両状態が変われば発生する雑音の種類が変わるという前提に立ち、車両状態に対応させて記憶しておくのである。これによって、複数の種類の雑音のスペクトルが予測雑音スペクトルとして記憶されることになる。そして、音声区間にて算出された雑音の混入した音声のスペクトルに含まれる雑音のスペクトルを推定し、この雑音スペクトルに類似する予測雑音スペクトルを差し引く。従って、全く種類の異なる雑音のスペクトルを差し引いてしまう可能性がなくなり、音声信号と雑音信号とが混在した入力信号から雑音成分のみを適切に除去できる可能性が高くなる。結果として、音声認識装置２０における音声認識率の向上に寄与することができる。
【００６１】
なお、本実施形態においては、フレーム分割部１２における切り出し機能が「フレーム分割手段」に相当する。また、雑音スペクトル推定部１４において、音声入力検出信号の入力があると雑音の推定処理を始めたり、音声制御部１５において、音声入力検出信号の入力があると予測雑音スペクトルの検索処理を行い、音声入力検出信号の入力がないと雑音のスペクトルを予測雑音スペクトルとして記憶する処理を実行しているが、これが「判定手段」による音声区間と雑音区間の判定結果に基づく処理内容の変更に相当する。そして、フーリエ変換部１３が「スペクトル算出手段」に相当し、雑音スペクトル推定部１４が「雑音スペクトル推定手段」に相当する。また、減算部１６が「減算手段」に相当し、雑音記憶部１８が「予測雑音スペクトル記憶手段」に相当し、オーディオ機器５１、速度センサ５２、加速度センサ５３、ナビゲーション装置５４及び窓開閉装置５５が「車両状態検出手段」に相当し、音声制御部１５が「予測雑音スペクトル記憶制御手段」に相当し、ＰＴＴスイッチ４０が「入力期間指定手段」に相当する。
【００６２】
以上、本発明はこのような実施形態に何等限定されるものではなく、本発明の主旨を逸脱しない範囲において種々なる形態で実施し得る。
（１）例えば、上記実施形態においては、ＰＴＴスイッチ４０のオン・オフを判定し、ＰＴＴスイッチがオフである期間（雑音区間）の雑音のみの入力信号に基づいて算出された雑音のスペクトルを予測雑音スペクトルとして雑音記憶部１８に記憶する構成であった。このとき、車両状態に対応する予測雑音スペクトルを一度記憶した後は、その車両状態となった場合であっても予測雑音スペクトルを記憶しないようになっていた（図２中のＳ１４０：ＹＥＳ）。すなわち、各車両状態に対応する予測雑音スペクトルを一度記憶すれば、その後は記憶処理が実行されないため、処理負荷軽減の点で有効である。
【００６３】
これに対して、ある車両状態に対応する予測雑音スペクトルを一度記憶した後であっても、記憶した時から所定時間が経過している場合には、既に記憶されている予測雑音スペクトルを更新するように構成してもよい。なぜなら、同じように窓を開けた状態であっても周囲の環境が街中であるのと郊外であるのとでは雑音の種類も変わってくることが考えられ、なるべく現時点に近い過去に記憶された雑音のスペクトルを予測雑音スペクトルとした方がよいからである。このような構成とすれば、所定時間が経過した後に予測雑音スペクトルが更新されるため、より音声信号に混入した雑音に近い予測雑音スペクトルが記憶される可能性が高くなり、雑音成分の除去が効果的に行われる可能性が高くなる。
【００６４】
また、発生が予想される雑音が分かっている場合には、それら雑音のスペクトルを予測雑音スペクトルとして予め雑音記憶部１８に記憶しておいてもよい。この場合、ＰＴＴスイッチ４０によって音声区間と雑音区間を区別し、雑音区間において算出した雑音のスペクトルを記憶する必要がなくなるため、処理負荷軽減という点で有利である。
【００６５】
（２）また、上記実施形態では、雑音スペクトル推定部１４にて音声信号に混入した雑音のスペクトルを雑音スペクトルとして推定し（図２中のＳ１６０）、雑音記憶部１８に記憶された予測雑音スペクトルの中でこの雑音スペクトルに類似するものを減算部１６にて減算するようにしていたが（図２中のＳ１８０）、雑音スペクトル推定部１４にて推定した雑音スペクトルそのものを減算するようにしてもよい。この場合は、ＰＴＴスイッチ４０によって音声区間と雑音区間を区別し、雑音区間において算出した雑音のスペクトルを記憶する必要がなくなると共に、予測雑音スペクトルを記憶するための雑音記憶部も必要なくなるため、処理負荷軽減及び装置構成の簡略化を実現することができる。
【００６６】
（３）さらにまた、雑音スペクトル推定部１４における雑音スペクトルの推定方法について言えば、上記実施形態では、雑音の混入した音声のスペクトルのレベル変化率及び周波数に基づいて雑音を推定していたが、レベル変化率のみに基づいて雑音スペクトルを推定することもできる。
【００６７】
（４）また、上記実施形態では、フレーム信号をフーリエ変換した周波数スペクトルを用いて処理を行っていたが、フーリエ変換して得た周波数スペクトルの振幅成分である振幅スペクトルや、その振幅成分を２乗したパワースペクトルを用いて処理を行う構成とすることも考えられる。
【００６８】
（５）さらにまた、上記実施形態においては、音声を入力させる期間を発声者自身が指定するために設けられたＰＴＴスイッチ４０を用い、利用者がＰＴＴスイッチ４０を押しながら音声を入力すると、そのＰＴＴスイッチ４０が押されている間を音声区間とみなすようにしたが、実際の入力信号に基づいて音声区間と雑音区間を判定するようにしてもよい。例えば、入力信号のパワーに基づいて判定することが考えられる。
【図面の簡単な説明】
【図１】実施形態の音声認識システムの概略構成を示すブロック図である。
【図２】実施形態の音声認識システムで実行される雑音抑圧処理を示すフローチャートである。
【図３】（ａ）は雑音の混入していない音声信号のスペクトルを例示し、（ｂ）は雑音の混入した音声信号のスペクトルを例示した説明図である。
【図４】（ａ）は雑音の混入した音声信号のスペクトルを例示し、（ｂ）は（ａ）のスペクトルに基づいて推定された雑音スペクトルを例示し、（ｃ）は予め記憶された予測雑音スペクトルの中で（ｂ）のスペクトルに類似するものを例示した説明図である。
【図５】種類の異なる雑音のスペクトルを例示した説明図である。
【図６】従来の音声認識システムを例示する説明図である。
【符号の説明】
１０…雑音抑圧装置１１…音声入力部
１２…フレーム分割部１３…フーリエ変換部
１４…雑音スペクトル推定部１５…音声制御部
１６…減算部１７…逆フーリエ変換部
１８…雑音記憶部２０…音声認識装置
３０…マイク４０…ＰＴＴスイッチ
５１…オーディオ機器５２…速度センサ
５３…加速度センサ５４…ナビゲーション装置
５５…窓開閉装置６０…スピーカ
２００…音声認識システム２０１…音声用マイク
２０２…雑音用マイク２０３…雑音抑圧装置
２０４…音声認識装置２０５…ＰＴＴスイッチ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to noise suppression used as preprocessing for speech signal processing such as speech recognition, and more particularly to a technique for removing a noise component from an input signal in which a speech signal to be recognized and a noise signal are mixed.
[0002]
[Prior art and problems to be solved by the invention]
2. Description of the Related Art Conventionally, a voice recognition apparatus effective for, for example, enabling a destination setting in a car navigation system to be input by voice has been proposed and realized. In such a speech recognition apparatus, the input speech is compared with a plurality of comparison target pattern candidates stored in advance, and the one with a high degree of coincidence is used as the recognition result. It may not be completely accurate. This is the case even in a quiet environment, and is particularly so in an environment where noise is generated in the surroundings. In particular, considering the actual usage environment such as the car navigation system described above, it is difficult to assume that there is no noise. Therefore, in order to improve the recognition rate, it is possible to perform noise suppression that removes noise components from an input signal in which a speech signal and a noise signal necessary for recognition are mixed as preprocessing for input to the speech recognition apparatus. desirable.
[0003]
As a system configuration for performing speech recognition after performing such noise suppression, for example, a speech recognition system 200 as shown in FIG. 6A is considered. That is, an audio signal mixed with noise is input from the audio microphone 201. On the other hand, a noise signal containing only noise is input from the noise microphone 202. Input signals from the voice microphone 201 and the noise microphone 202 are input to the noise suppression device 203, and the voice signal noise-suppressed by the noise suppression device 203 is transferred to the voice recognition device 204. In this case, the user inputs voice via the microphone 201 while pressing a PTT (Push-To-Talk) switch 205. And the noise suppression in the noise suppression apparatus 203 is performed as follows.
[0004]
That is, as shown in FIG. 6B, when the PTT switch 205 is pressed, the noise suppression device 203 captures input signals from the audio microphone 201 and the noise microphone 202 assuming that it is an audio section. However, the input signal from the voice microphone 201 is “voice signal + noise signal”. Therefore, if the “noise signal” input from the noise microphone 202 is subtracted from the “audio signal + noise signal” from the audio microphone 201, the audio signal with the noise signal suppressed can be extracted. It is. When a noise signal is subtracted from an audio signal mixed with noise, the frequency spectrum obtained by Fourier transforming each signal, an amplitude spectrum that is the amplitude of the frequency spectrum, or a power spectrum obtained by squaring the amplitude spectrum is subtracted. Can be considered.
[0005]
However, in the method as described above with reference to FIG. 6, it is necessary to install the audio microphone 201 and the noise microphone 202 at a predetermined distance so that the noise microphone 202 does not spread the audio signal. The entire system becomes complicated. In addition, since two microphones are installed, the type of noise generated may vary depending on the location where the microphones are installed. That is, there is no guarantee that the noise signal input from the audio microphone 201 together with the audio signal and the noise signal input from the noise microphone 202 are the same. Therefore, there is a possibility that only the noise component cannot be appropriately removed from the input signal in which the audio signal and the noise signal are mixed.
[0006]
Here, it will be described that the noise component cannot be appropriately removed if the noise type is different. FIG. 5 is an explanatory diagram illustrating frequency spectra of different types of noise. In FIG. 5A, the level change rate of frequency components near 1 kHz, 2 kHz, and 4 kHz is increased. In FIG. 5B, the level change rate of frequency components in the vicinity of 1 kHz, 2 kHz, and 4 kHz is larger than that in FIG. Further, in FIG. 5 (c), the level change rate of the frequency components in the vicinity of 1.5 kHz, 3 kHz, and 6 kHz is large, and in FIG. The level change rate of the frequency component of is large. The level change rate refers to the absolute value of the slope in the spectrum waveform, and the portion where the level change rate is large is shown as a portion protruding in the vertical axis direction of the graph.
[0007]
Thus, when different types of noise are viewed as frequency spectra, the level change rate and the frequency at which the level change rate becomes large differ. Therefore, when a noise component is subtracted from an input signal in which an audio signal and a noise signal are mixed, if a spectrum of different types of noise is subtracted, the spectrum of the audio signal is distorted. That is, in the system as shown in FIG. 6A, appropriate noise suppression cannot be performed unless the noise input from the voice microphone 201 and the noise input from the noise microphone 202 are of the same type. .
[0008]
By the way, there has been a system using one microphone in the past, but in this case, the ON / OFF of the PTT switch is detected to distinguish the noise section and the voice section, and the “noise signal” captured in the noise section is obtained. Then, a voice signal in which the noise signal is suppressed is extracted by subtracting it from “voice signal + noise signal” captured in the voice section. However, this method does not directly detect the noise mixed in the voice section, but estimates the noise in the voice section based on the noise signal captured in the noise section before the start of the voice section, and mixes the noise. The estimated noise signal is simply subtracted from the voice signal.
[0009]
Therefore, for example, in an environment where the surrounding environment changes from moment to moment, such as in an automobile, and the type of noise generated changes accordingly, noise input in the noise interval is mixed into the audio signal in the audio interval. There is no guarantee that the noise is of the same type, and in this case, the noise signal may not be appropriately subtracted.
[0010]
The present invention has been made to solve the above-described problems, and appropriately removes noise components from an input signal in which a speech signal and a noise signal are mixed, thereby contributing to an improvement in the recognition rate in speech recognition. With the goal.
[0011]
[Means for Solving the Problems and Effects of the Invention]
In the noise suppression apparatus of the present invention, for example, an input signal input via a microphone or the like is divided by a frame dividing unit and cut out as a frame signal, and a spectrum calculating unit calculates a spectrum from the frame signal. Here, as an example, the spectrum calculated by the spectrum calculating means may be a frequency spectrum defined by Fourier transform of a frame signal. However, it is not limited to those defined by Fourier transform, and for example, a spectrum defined by Fourier series expansion, Z transform, or discrete Fourier transform (DFT) may be used. Further, an amplitude spectrum that is an amplitude component of the frequency spectrum described above may be used, or a power spectrum obtained by squaring the amplitude spectrum may be used.
[0012]
The input signal described above is an audio signal in which noise is mixed when an audio signal from the user is input, and is a noise-only signal when an audio signal from the user is not input. Here, in particular, in the noise suppression device of the present invention, the noise spectrum estimation means estimates the spectrum of the noise included in the spectrum as the noise spectrum based on the characteristics of the noise component appearing in the spectrum calculated based on the input signal. . Note that the noise spectrum may be estimated corresponding to each of the repeatedly calculated spectra, or may be estimated for each predetermined number of spectra as being common to those spectra. For example, if the spectrum of the same type of noise is included in the spectrum that is repeatedly calculated, the frequency components of the noise spectrum may appear in common in those spectra. If the spectrum is estimated, there is a high possibility that a more accurate noise spectrum can be estimated. Conversely, in an environment where the type of noise changes from moment to moment, it is desirable to estimate the noise spectrum repeatedly at a relatively early timing.
[0013]
  The noise suppression apparatus of the present invention further includes a predicted noise spectrum storage unit that stores a predicted noise spectrum that is a spectrum of noise calculated based on a noise signal, and a predicted noise spectrum stored in the predicted noise spectrum storage unit. Among them, the predicted noise spectrum specifying means for specifying the one having a high degree of similarity with the noise spectrum estimated by the noise spectrum estimating means, and the predicted noise spectrum specified by the predicted noise spectrum specifying means are calculated by the spectrum calculating means. Subtracting means for subtracting from the obtained spectrum.
  In addition to such a method, for example, a noise spectrum may be estimated from a spectrum calculated based on an input signal, and the estimated noise spectrum may be subtracted. However, a noise spectrum is estimated from the spectrum of a speech signal. If this is the case, it is difficult to estimate the true noise spectrum. Therefore, the noise spectrum calculated based on the noise signal is stored as a predicted noise spectrum in the predicted noise spectrum storage means, and the subtracting means subtracts the predicted noise spectrum.
  At this time, a plurality of types of predicted noise spectra are stored, and the predicted noise spectrum specifying means specifies a noise spectrum that is close to the spectrum of the speech signal. The predicted noise spectrum specifying means specifies a predicted noise spectrum stored in the predicted noise spectrum storage means that has a high degree of similarity with the noise spectrum estimated by the noise spectrum estimating means described above. For example, the predicted noise spectrum shown in FIG. 4C having the highest degree of similarity with the noise spectrum shown in FIG. 4B is specified.
  That is, if a noise spectrum that is highly likely to be mixed into the speech signal is stored as the predicted noise spectrum, only the noise component can be removed from the input signal in which the speech signal and the noise signal are mixed. As a result, the speech recognition rate can be dramatically improved.
[0015]
Here, a specific method for estimating the noise spectrum based on the characteristics of the noise component appearing in the spectrum calculated based on the input signal of the speech section will be described.
For example, as shown in claim 2, the noise spectrum estimation means detects a frequency at which the spectrum level change rate calculated based on the input signal is equal to or higher than a predetermined threshold value, and detects noise based on the spectrum component at the detected frequency. It is conceivable to configure to estimate the spectrum.
[0016]
This is based on the fact that, as illustrated in FIG. 5, there is a high possibility that a portion having a large level change rate appears in a specific frequency component in the noise spectrum. In the spectrum of the audio signal on which the spectrum of noise having a large level change rate of the specific frequency component is superimposed, a portion having a high level change rate appears in the specific frequency component as shown in FIG. . Then, when compared with the spectrum of an ideal audio signal that does not contain noise as shown in FIG. 3A, it can be approximated to the spectrum of the ideal audio signal by subtracting a portion with a large level change rate. I understand that.
[0017]
Therefore, the noise spectrum can be estimated as follows.
For example, when the spectrum calculation means calculates the frequency spectrum, the frequency spectrum is defined by Fourier transform of the frame signal that is a time function, and is expressed as a function of the frequency f. Therefore, the level change rate of the spectrum is obtained by differentiating at the frequency f, the frequency at which the change rate is equal to or higher than a predetermined threshold is detected, and the noise spectrum is estimated based on the spectrum component at the frequency. Here, estimation based on a spectral component may be, for example, estimating a spectrum having the spectral component itself as a noise spectrum, or using a spectral component at a frequency other than the detected frequency to estimate the spectral component. It is also conceivable to perform multiplication correction and estimate a spectrum having the corrected spectral component as a noise spectrum. For example, when there is a spectrum of a noise signal mixed with noise as shown in FIG. 4A, the frequency at which the level change rate exceeds a predetermined threshold is 1 kHz, 2 kHz, or 4 kHz. For example, a spectrum having a spectral component based on a spectral component in the vicinity of 2,4 kHz, for example, as shown in FIG. 4B is estimated as a noise spectrum.
[0018]
As described above, in the noise spectrum, the rate of change in the level of a specific frequency component often increases, and the frequency component in which the rate of change in level appears in the spectrum of a speech signal mixed with noise appears. If attention is paid, it is possible to estimate the spectrum of noise mixed in the audio signal.
[0019]
As described above, the spectrum of the mixed noise can be estimated only by the change rate of the level of the frequency component, but it may be difficult to set the threshold described above. That is, the level change rate of the noise spectrum only needs to be far from the level change rate of the voice spectrum, but when the difference in level change rate between the noise spectrum and the voice spectrum is small, a threshold value for determining this Setting is difficult.
[0020]
Therefore, as shown in claim 3, the noise spectrum estimation means detects the frequency at which the spectrum level change rate calculated based on the input signal is equal to or higher than the first threshold, and 2 of the frequency.ⁿ A frequency in the vicinity of a multiple (n is an integer) having a level change rate at the frequency equal to or higher than a second threshold smaller than the first threshold is detected, and the level change rate is the first or second. It can be considered that the noise spectrum is estimated based on the spectrum component at the frequency that is equal to or higher than the threshold.
[0021]
This focuses on the relationship between the level change rate of noise spectrum and frequency. That is, if there is a frequency with a large level change rate in the noise spectrum, the frequency is set to 2ⁿ The focus is on the relationship that the level change rate often increases even at the doubled frequency. For example, as shown in FIG. 4, the level change rate becomes large in the vicinity of 1 kHz, 2 kHz that is twice, and 4 kHz that is twice that. Based on this assumption, if there is a frequency with a relatively large level change rate, 2 of the frequencyⁿ If the level change rate is large at twice the frequency, then 2ⁿ A level change at twice the frequency may be attributed to the noise spectrum. Therefore, first, a frequency that is equal to or higher than the first threshold value is detected, and then a level change caused by a noise spectrum that cannot be determined by the first threshold value is determined by using the above-described relationship between the frequencies. The determination is made with a small second threshold. And a noise spectrum is estimated based on the spectrum component in the frequency detected in this way. The estimation of the noise spectrum based on the spectrum component is the same as in the second aspect. In this case, by using the relationship between the frequencies, the setting of the threshold is simplified, and the noise spectrum can be estimated more accurately.
[0022]
Here, “n is an integer”, but in the example shown in FIG. 4, for example, 0 kHz <(frequency × 2ⁿ ) <6 kHz may be considered. That is, frequency x 2ⁿ Is limited to n so as to fall within the frequency band to be considered. However, n may be a negative integer. That is, a frequency that is 1/2 times the frequency that is detected first, a frequency that is 1/4 times, and the like are also considered.
[0027]
  By the way, in order to store a noise spectrum that is highly likely to be mixed into the speech signal as a predicted noise spectrum, when a noise signal of only noise is input as an input signal, a spectrum is calculated based on the noise signal, It is conceivable to store the calculated spectrum as a predicted noise spectrum. For example, assuming that the device is installed in a car, the claims4It is conceivable to adopt a configuration as shown in FIG. That is, the claim1-3In addition to the above configuration, the vehicle state detection means for detecting the vehicle state, the determination means for determining the voice section in which the input signal includes the voice and the noise section in which the voice is not included, and the determination means It is conceivable that the spectrum calculated based on the input signal in the noise section is a predicted noise spectrum and includes a predicted noise spectrum storage control unit that stores the spectrum corresponding to each vehicle state detected by the vehicle state detection unit. It is done.
[0028]
The vehicle state detection means detects a vehicle state such as a volume of a vehicle-mounted audio device, a vehicle speed, a window opening / closing state, a road state, and a vehicle vibration state. Then, it is determined by the determining means whether the input signal is a speech section in which speech is included or a noise section in which speech is not included, and the predicted noise spectrum storage control means is based on the input signal in the noise section. The predicted noise spectrum, which is the spectrum calculated as described above, is stored in association with the vehicle state described above.
[0029]
In the present invention, a spectrum is calculated from an input signal actually input in a noise section and stored as a predicted noise spectrum. In particular, the present invention is characterized in that it is stored in correspondence with the vehicle state. The premise of this technical idea is the recognition that the type of noise generated changes if the vehicle state changes. In other words, when considering the inside of an automobile, it is considered that different types of noise are mixed in the audio signal corresponding to each vehicle state. Therefore, if a predicted noise spectrum is stored corresponding to each vehicle state, the prediction is performed. The noise spectrum is a spectrum of noise that may be mixed in the audio signal.
[0030]
The predicted noise spectrum storage means can be configured not to store the predicted noise spectrum even if the predicted noise spectrum corresponding to a certain vehicle state is once stored, even if the vehicle state is reached. In the same way, even if the window is opened, the type of noise may vary depending on whether the surrounding environment is in the city or in the suburbs, so the predicted noise spectrum corresponding to a certain vehicle condition Even after the memory is stored once, if the predetermined time has elapsed since the time of storage, the calculated predicted noise spectrum may be stored again. If the configuration like the former is used, once the predicted noise spectrum corresponding to each vehicle state is stored once, then the storage process is not executed, which is effective in reducing the processing load. With the configuration, the predicted noise spectrum is updated after a predetermined time has elapsed, and therefore the predicted noise spectrum calculated based on the past noise signal relatively close to the current time point is stored. Therefore, there is a high possibility that a predicted noise spectrum similar to the noise spectrum mixed in the speech signal is stored, and noise components may be effectively removed.
[0031]
In addition, the determination means described above determines whether the input signal is a voice section including voice or a noise section that does not contain voice. This is determined based on the power of the input signal. It is possible to do. Further, the input period specified by the input period specifying means provided for the speaker himself / herself to specify the period for inputting the voice may be determined as the voice section. As this input period specifying means, for example, a PTT (Push-To-Talk) switch can be considered. That is, when the user inputs a voice while pressing the PTT switch, the voice input while the PTT switch is pressed is accepted as a processing target.
[0032]
Up to now, the configuration and the operational effect of the noise suppression device have been described. However, the above-described noise suppression device and the output from the noise suppression device are compared with a plurality of comparison target pattern candidates stored in advance. It is also possible to realize a voice recognition system including a voice recognition device that recognizes a result with a high degree of coincidence.
[0033]
Since the effects when realized as these speech recognition systems are the same as when realized as a noise suppression device, they are omitted here.
Such a voice recognition system can be applied to various applications, for example, it can be used for a so-called car navigation system. In this case, for example, it is very convenient if a destination for route setting can be input by voice. In addition to the navigation system, for example, a voice recognition system may be used for an in-vehicle air conditioning system. In this case, the air conditioning state related instruction in the air conditioning system is used for the user to input by voice.
[0034]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition system according to an embodiment of the present invention. This speech recognition system is for in-vehicle use, and a noise suppression device 10 that performs noise suppression on speech input via a microphone 30 and a plurality of pre-stored outputs from the noise suppression device 10. A speech recognition device 20 that has a recognition result that has a higher degree of coincidence than the comparison target pattern candidate is provided. The noise suppression apparatus 10 is connected to a PTT (Push-to-Talk) switch 40 that is pressed when the vehicle in use inputs voice. Furthermore, an audio device 51 for detecting a vehicle state, a speed sensor 52, an acceleration sensor 53, a navigation device 54, and a window opening / closing device 55 are connected.
[0035]
As shown in FIG. 1, the noise suppression device 10 includes a speech input unit 11, a frame division unit 12, a Fourier transform unit 13, a noise spectrum estimation unit 14, a speech control unit 15, a subtraction unit 16, and an inverse A Fourier transform unit 17 and a noise storage unit 18 are provided. The processing contents in each block will be described below.
[0036]
The audio input unit 11 converts an analog audio signal input via the microphone 30 into a digital signal at a sampling frequency of 10 KHz, for example, and outputs the digital signal to the frame dividing unit 12. The frame division unit 12 determines the separation of the input signal from the voice input unit 11, and for example, cuts into frames for each word such as “Tokyo” and “Chiyodaku”, and outputs them to the Fourier transform unit 13. The Fourier transform unit 13 performs Fourier transform on the input signal of the time function for each frame to obtain the frequency spectrum of the input signal. This frequency spectrum is output to the noise spectrum estimation unit 14 and the subtraction unit 16. Hereinafter, a frequency spectrum is simply referred to as a spectrum.
[0037]
The noise spectrum estimation unit 14 receives the voice input detection signal from the PTT switch 40 described above, and upon receiving this voice input signal, the noise spectrum estimation unit 14 receives the spectrum from the Fourier transform unit 13. Based on the above, the spectrum of noise included in the spectrum is estimated. Then, the estimated noise spectrum is output to the voice control unit 15.
[0038]
Here, a specific estimation method of the noise spectrum in the noise spectrum estimation unit 14 will be described. First, the characteristics of the noise spectrum will be described with reference to FIG. 5 showing the noise spectrum calculated based on the actually measured noise signal.
As shown in FIG. 5, although the frequency component of the noise spectrum varies depending on the type of noise, a portion having a large level change rate often appears especially in a specific frequency component. The level change rate is an absolute value of the slope of the spectrum waveform, and a portion where the level change rate is large is a portion where the spectrum waveform protrudes greatly in the vertical axis direction of the graph. For example, in FIGS. 5 (a) and 5 (b), the level change rate is large near 1k, 2k, 4kHz, and in FIG. 5 (c), the level change rate is large near 1.5k, 3k, 6kHz. In FIG. 5D, the level change rate is large over the entire range of 0 to 6 kHz.
[0039]
Therefore, such noise spectrum features often appear in the spectrum of a speech signal mixed with noise. For example, FIG. 3A shows the spectrum of an ideal audio signal that is not mixed with noise, while FIG. 3B shows the spectrum of an audio signal that is mixed with noise. As can be seen by comparing FIG. 3 (a) and FIG. 3 (b), a portion of a specific frequency component having a large level change rate appears in the spectrum of the speech signal mixed with noise.
[0040]
Also, if there is a frequency with a large level change rate in the noise spectrum, the frequency is set to 2ⁿ In many cases, the level change rate becomes large even at the doubled frequency. For example, in FIGS. 5 (a) and 5 (b), the level change rate is large near 1k, 2k, and 4kHz, and in FIG. 5 (c), the level change rate is large near 1.5k, 3k, and 6k. It has become.
[0041]
Therefore, in the present embodiment, the noise spectrum is estimated based on the level change rate and the frequency of the spectrum of the speech signal mixed with noise.
Specifically, the spectrum that is a function of the frequency f output from the Fourier transform unit 13 is differentiated by the frequency f, and the frequency f1 at which the level change rate is equal to or higher than the first threshold is detected. Furthermore, 2 of the frequency f1ⁿ A frequency f2 at which the level change rate is equal to or higher than the second threshold at the double frequency is detected. Then, the spectrum components at the frequencies f1 and f2 whose level change rate is equal to or higher than the first or second threshold are extracted, and the spectrum components at frequencies other than the detected frequencies f1 and f2 are used. A spectral component is corrected, and a noise spectrum having the corrected spectral component is estimated. For example, the noise spectrum shown in FIG. 4B is estimated from the spectrum of the speech signal mixed with noise shown in FIG.
[0042]
The noise spectrum estimation unit 14 stops the noise spectrum estimation process during a period in which a voice input detection signal indicating that voice has been input is not received. In the present embodiment, this voice input detection signal is output when a PTT (Push-To-Talk) switch 40 is pressed. That is, in this voice recognition system, the user inputs voice through the microphone 30 while pressing the PTT switch 40. Therefore, the fact that the PTT switch 40 is pressed means that the user has operated with the will to input the voice, and in this case, the voice is not actually determined without determining whether or not there is a voice input. Processing is performed by regarding the input period (voice section).
[0043]
When the voice input detection signal is not output by pressing the PTT switch 40, it is assumed that the voice is not input (noise period), and the spectrum from the Fourier transform unit 13 is output to the voice control unit 15 as it is.
Next, the voice control unit 15 will be described. The voice control unit 15 includes a volume from the audio device 51, a vehicle speed from the speed sensor 52, a vibration state of the vehicle from the acceleration sensor 53, a road state (tunnel, gravel road, etc.) from the navigation device 54, and a window opening / closing device 55. The voice control unit 15 specifies the vehicle state based on these five data and performs the process described below. A voice input detection signal indicating that voice has been input is also input to the voice control unit 15, and a voice input detection signal is input (voice section) and a voice input detection signal is not input ( The processing is changed depending on the noise interval. Therefore, the processing in the voice control unit 15 will be described below.
[0044]
First, the processing of the noise section, which is a case where no voice input signal is input, will be described.
In this case, as described above, it is determined whether or not the noise spectrum corresponding to the vehicle state determined from the five parameters such as volume, vehicle speed, acceleration, road state, and window open / close state is stored in the noise storage unit 18. If not stored, the spectrum output from the noise spectrum estimation unit 14 is stored in the noise storage unit 18. As a result, the spectrum calculated by the Fourier transform unit 13 based on the input signal in the noise section is stored corresponding to each vehicle state. The spectrum calculated by the Fourier transform unit 13 based on the input signal in the noise section is hereinafter referred to as “predicted noise spectrum”.
[0045]
Next, a description will be given of the processing of a speech section that is a case where a speech input detection signal is input.
At this time, as described above, the noise spectrum estimation unit 14 outputs a noise spectrum obtained by estimating the noise spectrum included in the spectrum calculated by the Fourier transform unit 13 based on the input signal in the speech section. Here, the voice control unit 15 calculates the degree of similarity between the noise spectrum output from the noise spectrum estimation unit 14 and each predicted noise spectrum stored in the noise storage unit 18 corresponding to each vehicle state. When a predicted noise spectrum whose degree exceeds a predetermined value is found, the predicted noise spectrum is output to the subtracting unit 16. For example, based on a noise spectrum as shown in FIG. 4B, a predicted noise spectrum as shown in FIG. 4C is output from a plurality of predicted noise spectra.
[0046]
The subtracting unit 16 subtracts the predicted noise spectrum output from the speech control unit 15 from the spectrum of the speech signal mixed with noise output from the Fourier transform unit 13. Then, the inverse Fourier transform unit 17 performs inverse Fourier transform on the output from the subtraction unit 16 to obtain a time function signal. The inverse Fourier transform unit 17 outputs this signal to the speech recognition device 20.
[0047]
In this manner, the noise-suppressed time function signal obtained for each frame, which is a cut-out unit in the frame dividing unit 12, is sequentially sent to the speech recognition apparatus 20.
Next, the voice recognition device 20 will be described.
The speech recognition apparatus 20 performs linear prediction analysis, which is a general analysis method, using the output from the noise suppression apparatus 10 and calculates parameters. Then, similarity calculation is performed between the standard pattern (feature parameter series) of the recognition target vocabulary calculated in advance and the calculated parameters. These time series data is divided into several sections by a known DP matching method, HMM (Hidden Markov Model), or a neural network, and it is determined which word corresponds to each section stored as dictionary data. . Then, a vocabulary with a similarity exceeding a predetermined value among the vocabulary to be recognized is output as a recognition result to a control unit such as various actuators (not shown). On the other hand, when there is no vocabulary whose similarity exceeds a predetermined value among the vocabulary to be recognized, the speech control unit 15 of the noise suppression apparatus 10 is notified that recognition is impossible.
[0048]
When the notification that the recognition is impossible is received from the speech recognition device 20, the speech control unit 15 of the noise suppression device 10 finds another predicted noise spectrum similar to the noise spectrum output from the noise spectrum estimation unit 14 again. Therefore, the degree of similarity with the predicted noise spectrum stored in the noise storage unit 18 corresponding to each vehicle state is calculated. When a prediction noise spectrum having a similarity degree exceeding a predetermined value is found, the prediction noise spectrum is output to the subtraction unit 16. On the other hand, if a predicted noise spectrum having a degree of similarity exceeding a predetermined value is not found, an instruction signal for prompting the user to input again is output to the audio device 51 and the navigation device 54. The instruction based on the instruction signal is output as sound from the speaker 60 and is displayed as a character on the monitor of the navigation device 54.
[0049]
The functional blocks have been described with reference to FIG. 1. In order to further clarify the flow of each process described above, the process in the voice recognition system will be described with reference to the flowchart of FIG.
First, in the first step S100, input processing is performed. This processing corresponds to the processing of the voice input unit 11 and the frame dividing unit 12 shown in FIG. That is, the analog audio signal input via the microphone 30 is converted into a digital signal at a sampling frequency of 10 KHz, for example, and the converted digital signal is sequentially cut out as a frame for each word, for example.
[0050]
In S110, Fourier transform is performed. This process corresponds to the process of the Fourier transform unit 13 shown in FIG. Here, Fourier transform is performed on the input signal for each frame to obtain the frequency spectrum of the input signal, and the power spectrum of the input signal is calculated by squaring the amplitude component of the frequency spectrum.
[0051]
In S120, the vehicle state is acquired. In the present embodiment, the vehicle state refers to the volume from the audio device 51, the vehicle speed from the speed sensor 52, the vibration state of the vehicle from the acceleration sensor 53, the road state from the navigation device 54, and the window opening / closing device shown in FIG. This is a state determined by five parameters such as the open / close state of the window from 55. Therefore, these five parameters are acquired here, and the vehicle state is specified from these five parameters. This process corresponds to the process of the voice control unit 15 in FIG.
[0052]
In S130, it is determined whether or not the PTT switch 40 is on. Here, if the PTT switch 40 is on (S130: YES), that is, if it is a period during which a voice is input (voice section), the process proceeds to S160. On the other hand, when the PTT switch 40 is not on (S130: NO), that is, when it is a period during which only noise is input (noise section), the process proceeds to S140.
The processing from S140, which is shifted when the PTT switch 40 is OFF, corresponds to the case where the frame signal Fourier-transformed in S110 is a noise signal only of noise. In S140, it is determined whether or not the predicted noise spectrum is already stored in the noise storage unit 18 corresponding to the vehicle state specified in S120. If it is determined that the predicted noise spectrum is already stored (S140: YES), the present noise suppression process is terminated without executing the process of S150. On the other hand, if it is determined that the predicted noise spectrum has not yet been stored (S140: NO), in S150, the spectrum calculated in S110 is stored in the noise storage unit 18 as the predicted noise spectrum, and this noise suppression process is performed. finish. Note that the processing of S140 and S150 corresponds to the processing of the voice control unit 15.
[0053]
The processing from S160, which is shifted to when the PTT switch 40 is on, corresponds to the case where the frame signal Fourier-transformed in S110 is an audio signal mixed with noise.
In S160, the noise spectrum included in the spectrum calculated in S110 is estimated. This process corresponds to the process of the noise spectrum estimation unit 14 shown in FIG.
[0054]
In subsequent S170, it is determined whether or not a predicted noise spectrum similar to the noise spectrum calculated in S160 is stored in the noise storage unit 18. Here, it is determined whether or not the similarity is a predetermined value or more while sequentially calculating the similarity between the predicted noise spectrum stored in the noise storage unit 18 and the noise spectrum calculated in S160. If there is a predicted noise spectrum with a similarity degree equal to or greater than a predetermined value (S170: YES), the process proceeds to S180. On the other hand, when there is no predicted noise spectrum with a similarity degree equal to or greater than a predetermined value (S170: NO), the process proceeds to S210.
[0055]
In S180, a subtraction process is performed. This process corresponds to the process of the subtracting unit 16 in FIG. Here, the predicted noise spectrum read in S170 is subtracted from the spectrum calculated in S110. Thereafter, inverse Fourier transform is performed, and a signal of a time function is output to the speech recognition device 20.
[0056]
In S190, it is determined whether or not voice recognition is possible. This process is a process in the speech recognition apparatus 20 in FIG. Here, it is determined by a known method as described above whether there is a vocabulary whose similarity exceeds a predetermined value among the vocabulary to be recognized that are stored in advance. If it is determined that there is a vocabulary whose similarity exceeds a predetermined value (S190: YES), the vocabulary is output as a recognition result to a control unit (not shown) in S200. On the other hand, when it is determined that there is no vocabulary whose similarity exceeds a predetermined value (S190: NO), the process proceeds to S210.
[0057]
In S210 to which the process proceeds when a negative determination is made in S170 and S190, whether or not the prediction noise spectrum stored in the noise storage unit 18 is calculated by calculating all similarities with the noise spectrum calculated in S160 and whether the similarity is determined. Judging. If all the predicted noise spectra stored in the noise storage unit 18 are determined to be similar (S210: YES), the user is prompted to input again in S220. In this process, the voice control unit 15 outputs an instruction signal for prompting the user to input again to the audio device 51 and the navigation device 54. Thereafter, the noise suppression process is terminated. On the other hand, when there is what the similarity determination is not performed on the predicted noise spectrum stored in the noise storage unit 18 (S210: NO), the processing from S170 is repeated.
[0058]
Next, the effect which the voice recognition system of this embodiment exhibits is demonstrated. In order to facilitate understanding of the description here, the conventional problems will be briefly described first.
Conventionally, the noise spectrum is subtracted from the spectrum calculated based on the voice signal mixed with noise to improve the recognition rate in the voice recognition device. Audio signal because it was calculated based on the noise signal input through another microphone as described above, or based on the past noise signal input before the audio signal was input There was no guarantee that the spectrum of the same kind of noise was mixed in. Therefore, a noise spectrum of a different type from the noise mixed in the voice signal may be subtracted, and the noise mixed in the voice signal cannot be appropriately suppressed, leading to a decrease in the voice recognition rate.
[0059]
Therefore, in the speech recognition system of the present embodiment, when the noise spectrum calculated during the period (noise interval) in which the PTT switch 40 is off is stored as the predicted noise spectrum in the noise storage unit 18, the acquired vehicle state A plurality of types are stored corresponding to (S120 in FIG. 2) (S140 and S150 in FIG. 2). Then, a noise spectrum superimposed on the spectrum is estimated from the spectrum of the voice signal calculated during the period (voice section) in which the PTT switch 40 is on (S160 in FIG. 2). It is determined whether or not a similar predicted noise spectrum is stored in the noise storage unit 18 (S170 in FIG. 2). Here, if the predicted noise spectrum stored in the noise storage unit 18 is similar to the noise spectrum, the predicted noise spectrum is subtracted from the spectrum of the speech signal mixed with noise (S180 in FIG. 2), Voice recognition is performed (S190 in FIG. 2).
[0060]
In other words, the predicted noise spectrum, which is the noise spectrum calculated in the noise section, is not stored uniformly as in the past, but on the assumption that the type of noise generated will change if the vehicle status changes. It is memorized corresponding to. As a result, a plurality of types of noise spectra are stored as predicted noise spectra. Then, the noise spectrum included in the speech spectrum mixed with noise calculated in the speech section is estimated, and a predicted noise spectrum similar to the noise spectrum is subtracted. Therefore, there is no possibility of subtracting noise spectra of completely different types, and there is a high possibility that only the noise component can be appropriately removed from the input signal in which the audio signal and the noise signal are mixed. As a result, it is possible to contribute to the improvement of the voice recognition rate in the voice recognition device 20.
[0061]
In the present embodiment, the cut-out function in the frame dividing unit 12 corresponds to “frame dividing means”. The noise spectrum estimation unit 14 starts a noise estimation process when a voice input detection signal is input, or the voice control unit 15 performs a prediction noise spectrum search process when a voice input detection signal is input. If no speech input detection signal is input, the process of storing the noise spectrum as the predicted noise spectrum is executed. This corresponds to the change of the processing content based on the determination result of the speech section and the noise section by the “determination means”. . The Fourier transform unit 13 corresponds to “spectrum calculation means”, and the noise spectrum estimation unit 14 corresponds to “noise spectrum estimation means”. Further, the subtracting unit 16 corresponds to “subtracting unit”, the noise storage unit 18 corresponds to “predicted noise spectrum storing unit”, the audio device 51, the speed sensor 52, the acceleration sensor 53, the navigation device 54, and the window opening / closing device 55. Corresponds to “vehicle state detection means”, the voice control unit 15 corresponds to “predicted noise spectrum storage control means”, and the PTT switch 40 corresponds to “input period designation means”.
[0062]
As described above, the present invention is not limited to such an embodiment, and can be implemented in various forms without departing from the gist of the present invention.
(1) For example, in the above-described embodiment, whether the PTT switch 40 is turned on or off is determined, and a noise spectrum calculated based on an input signal of only noise in a period (noise interval) in which the PTT switch is off is predicted. The noise storage unit 18 stores the noise spectrum. At this time, after once storing the predicted noise spectrum corresponding to the vehicle state, the predicted noise spectrum is not stored even when the vehicle state is reached (S140: YES in FIG. 2). That is, once the predicted noise spectrum corresponding to each vehicle state is stored once, the storage process is not executed thereafter, which is effective in reducing the processing load.
[0063]
On the other hand, even after the prediction noise spectrum corresponding to a certain vehicle state is stored once, if the predetermined time has elapsed since the storage, the already stored prediction noise spectrum is updated. You may comprise as follows. Because even if the window is opened in the same way, the type of noise may change depending on whether the surrounding environment is in the city or in the suburbs. This is because the noise spectrum should be the predicted noise spectrum. With such a configuration, since the predicted noise spectrum is updated after a predetermined time has elapsed, it is more likely that a predicted noise spectrum closer to the noise mixed in the voice signal is stored, and noise components can be removed. The possibility of being effective is increased.
[0064]
In addition, when noise that is expected to be generated is known, the noise spectrum may be stored in the noise storage unit 18 in advance as a predicted noise spectrum. In this case, it is not necessary to distinguish between the speech section and the noise section by the PTT switch 40 and store the noise spectrum calculated in the noise section, which is advantageous in terms of reducing the processing load.
[0065]
(2) In the above embodiment, the noise spectrum mixed in the speech signal is estimated as the noise spectrum by the noise spectrum estimation unit 14 (S160 in FIG. 2), and the predicted noise spectrum stored in the noise storage unit 18 is estimated. Among them, those similar to the noise spectrum are subtracted by the subtractor 16 (S180 in FIG. 2), but the noise spectrum itself estimated by the noise spectrum estimator 14 may be subtracted. Good. In this case, the speech section and the noise section are distinguished by the PTT switch 40, and it is not necessary to store the noise spectrum calculated in the noise section, and the noise storage unit for storing the predicted noise spectrum is not necessary. It is possible to reduce the load and simplify the device configuration.
[0066]
(3) Furthermore, regarding the noise spectrum estimation method in the noise spectrum estimation unit 14, in the above embodiment, the noise is estimated based on the level change rate and the frequency of the spectrum of the noise-mixed speech. It is also possible to estimate the noise spectrum based only on the level change rate.
[0067]
(4) In the above embodiment, the processing is performed using the frequency spectrum obtained by Fourier transforming the frame signal. However, the amplitude spectrum that is the amplitude component of the frequency spectrum obtained by Fourier transform, and the amplitude component is 2 It is also conceivable that the processing is performed using the power spectrum that has been raised.
[0068]
(5) Furthermore, in the above-described embodiment, when a user inputs a voice while pressing the PTT switch 40 using the PTT switch 40 provided for the speaker himself / herself to specify the period for inputting the voice, While the PTT switch 40 is being pressed is regarded as a voice section, the voice section and the noise section may be determined based on an actual input signal. For example, it is conceivable to make a determination based on the power of the input signal.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of a speech recognition system according to an embodiment.
FIG. 2 is a flowchart showing noise suppression processing executed in the speech recognition system of the embodiment.
FIGS. 3A and 3B are diagrams illustrating a spectrum of a speech signal not mixed with noise, and FIG. 3B is an explanatory diagram illustrating a spectrum of a speech signal mixed with noise;
4A illustrates a spectrum of a speech signal mixed with noise, FIG. 4B illustrates a noise spectrum estimated based on the spectrum of FIG. 4A, and FIG. 4C illustrates a prediction stored in advance. It is explanatory drawing which illustrated what is similar to the spectrum of (b) in a noise spectrum.
FIG. 5 is an explanatory diagram illustrating spectrums of different types of noise.
FIG. 6 is an explanatory diagram illustrating a conventional voice recognition system.
[Explanation of symbols]
10 ... Noise suppression device 11 ... Voice input unit
12 ... Frame division unit 13 ... Fourier transform unit
14 ... Noise spectrum estimation unit 15 ... Voice control unit
16 ... Subtraction unit 17 ... Inverse Fourier transform unit
18 ... Noise storage unit 20 ... Voice recognition device
30 ... Microphone 40 ... PTT switch
51 ... Audio equipment 52 ... Speed sensor
53 ... Acceleration sensor 54 ... Navigation device
55 ... Window opening and closing device 60 ... Speaker
200 ... voice recognition system 201 ... voice microphone
202 ... Noise microphone 203 ... Noise suppression device
204 ... Voice recognition device 205 ... PTT switch

Claims

Frame dividing means for dividing an input signal including noise and cutting out as a frame signal;
Spectrum calculating means for calculating a spectrum from the frame signal cut out by the frame dividing means;
Noise spectrum estimating means for estimating the noise spectrum as a noise spectrum from the spectrum based on the characteristics of the noise component appearing in the spectrum calculated by the spectrum calculating means;
Predicted noise spectrum storage means for storing a predicted noise spectrum which is a spectrum of noise;
A predicted noise spectrum specifying means for specifying, from among the predicted noise spectra stored in the predicted noise spectrum storage means, one having a high degree of similarity to the noise spectrum estimated by the noise spectrum estimating means;
A noise suppression apparatus comprising: a subtracting unit that subtracts the predicted noise spectrum specified by the predicted noise spectrum specifying unit from the spectrum calculated by the spectrum calculating unit.

The noise suppression device according to claim 1,
The noise spectrum estimation means is configured to detect a frequency at which a spectrum level change rate calculated based on the input signal is equal to or greater than a predetermined threshold, and to estimate the noise spectrum based on a spectrum component at the detected frequency. A noise suppression device characterized by the above.

The noise suppression device according to claim 1,
The noise spectrum estimation means detects a frequency at which the level change rate of the spectrum calculated based on the input signal is equal to or higher than a first threshold, and has a frequency in the vicinity of 2n times (n is an integer) of the frequency, A spectrum at a frequency at which the level change rate at the frequency is equal to or higher than a second threshold value smaller than the first threshold value is detected, and the level change rate is equal to or higher than the first or second threshold value. A noise suppression apparatus configured to estimate the noise spectrum based on a component.

In the noise suppression apparatus in any one of Claims 1-3,
further,
Vehicle state detection means that is used in a vehicle and detects a vehicle state;
Determining means for determining a voice section in which voice is included in the input signal and a noise section in which the voice is not included;
Predictive noise spectrum storage control for storing the spectrum calculated based on the input signal of the noise section determined by the determination unit as the predicted noise spectrum and storing the spectrum corresponding to each vehicle state detected by the vehicle state detection unit A noise suppression apparatus comprising: means.

The noise suppression device according to claim 4, wherein
The predicted noise spectrum storage control unit stores the calculated predicted noise spectrum when a predicted noise spectrum corresponding to the vehicle state detected by the vehicle state detection unit is not stored in the predicted noise spectrum storage unit. A noise suppression device configured to perform

The noise suppression device according to claim 4, wherein
The predicted noise spectrum storage control means is predetermined after being stored even if the predicted noise spectrum corresponding to the vehicle state detected by the vehicle state detecting means is already stored in the predicted noise spectrum storage means. A noise suppression device configured to update the predicted noise spectrum when time has elapsed.

The noise suppression device according to any one of claims 4 to 6,
The noise suppression apparatus, wherein the determination unit is configured to determine the speech section and the noise section based on the power of the input signal.

The noise suppression device according to any one of claims 4 to 6,
Provided with an input period specifying means provided for the speaker himself to specify a period for inputting voice,
The noise suppression device, wherein the determination unit is configured to determine the input period specified by the input period specifying unit as the speech section.

The noise suppression device according to any one of claims 1 to 8,
A speech recognition device that recognizes an output from the noise suppression device as a recognition result when compared with a plurality of comparison target pattern candidates stored in advance;
A speech recognition system comprising: