JP2004102161A

JP2004102161A - Device, method, and program for voice detection

Info

Publication number: JP2004102161A
Application number: JP2002267108A
Authority: JP
Inventors: Masafumi Miyabe; 宮部　雅史
Original assignee: Asahi Kasei Microsystems Co Ltd; Asahi Kasei Microdevices Corp
Current assignee: Asahi Kasei Microsystems Co Ltd; Asahi Kasei Microdevices Corp
Priority date: 2002-09-12
Filing date: 2002-09-12
Publication date: 2004-04-02

Abstract

<P>PROBLEM TO BE SOLVED: To suppress deterioration in precision of speaking detection due to influence of a speaking time. <P>SOLUTION: A speaker signal is inputted to a short-time sound pressure averaging unit 4a after being passed through high-pass filters 1 and 2a to limit the band almost to 2,000 Hz at which formant frequencies of a human being are mainly distributed, and the speaker signal is inputted to a long-time sound pressure averaging unit 4b after being passed through the high-pass filter 1 and a low-pass filter 2b to limit the band to a band wherein low-frequency noise is distributed. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は音声検出装置、音声検出方法および音声検出プログラムに関し、特に、話者の発話の検出を精度よく行う場合に適用して好適なものである。
【０００２】
【従来の技術】
従来の音声検出装置では、話者の発話および無発話を判定するため、話者側から送られた話者信号の音圧を測り、予め設定したしきい値と比較する方法があった。
また、雑音環境下での話者の発話および無発話の判定精度を向上させるため、例えば、特許文献１に開示されているように、時定数の異なる時間平均器を設け、短時間の平均音圧より長時間の平均音圧を差し引いてから、予め設定したしきい値と比較する方法もあった。
【０００３】
図３は、従来の音声検出装置の概略構成を示すブロック図である。
図３において、従来の音声検出装置には、話者信号を絶対値化する絶対値化回路１１、短時間の平均音圧を算出する短時間音圧平均器１２ａ、長時間の平均音圧を算出する長時間音圧平均器１２ｂ、短時間音圧平均器１２ａの出力から長時間音圧平均器１２ｂの出力を減算する減算器１３、減算器１３からの出力をしきい値と比較する比較器１４が設けられている。
【０００４】
ここで、話者信号には、話者の発話による音声信号および雑音環境下での雑音信号が含まれている。
そして、話者信号が絶対値化回路１１に入力されると、絶対値化回路１１にて話者信号が絶対値化され、短時間音圧平均器１２ａおよび長時間音圧平均器１２ｂに出力される。
【０００５】
そして、絶対値化された話者信号の短時間の平均音圧が短時間音圧平均器１２ａにて算出されるとともに、絶対値化された話者信号の長時間の平均音圧が長時間音圧平均器１２ｂにて算出され、これらの平均音圧が減算器１３にて互いに減算される。
ここで、雑音はほぼ定常的に発生するのに対し、話者の発話は間欠的に行われる。
【０００６】
このため、雑音の平均音圧は、平均時間に依存することなくほぼ一定値を維持し、短時間音圧平均器１２ａおよび長時間音圧平均器１２ｂで捉えられる雑音の平均音圧はほぼ一致するため、短時間音圧平均器１２ａの出力値から長時間音圧平均器１２ｂの出力値を減算することで、雑音の影響を打ち消すことができる。一方、音声信号の平均音圧は、平均時間が長いと、発話時のレベルと無発話時のレベルが平均化されるが、平均時間が短いと、話者の発話または無発話に応じて変動するようになり、話者の発話時または無発話時のレベルを抽出することが可能となる。
【０００７】
この結果、短時間音圧平均器１２ａでは、発話時と無発話時とで平均音圧を大きく異ならせることが可能となり、話者の発話時または無発話時のレベルを抽出することが可能となるとともに、長時間音圧平均器１２ｂでは、発話または無発話にかかわらず、平均音圧をほぼ一定に維持することが可能となり、雑音の平均音圧を抽出することが可能となる。
【０００８】
そして、減算器１３からの出力は比較器１４に入力され、減算器１３からの出力がしきい値と比較される。
そして、減算器１３からの出力がしきい値より大きい時は発話状態と判定することが可能となるとともに、減算器１３からの出力がしきい値より小さい時は無発話状態と判定することが可能となる。
【０００９】
【特許文献１】
特開平１−１２７２８号公報
【００１０】
【発明が解決しようとする課題】
しかしながら、図３の短時間音圧平均器１２ａおよび長時間音圧平均器１２ｂでは、話者信号の平均音圧を算出するための平均時間こそ異なるが、話者信号の平均音圧を算出するための周波数帯域は同一である。
このため、発話が長時間に及ぶ場合、長時間音圧平均器１２ｂで算出される定常音圧が音声の影響をより強く受けて、正しい値を維持できなくなり、発話検出の精度が劣化するという問題があった。
【００１１】
そこで、本発明の目的は、発話時間の影響による発話検出の精度の劣化を抑制することが可能な音声検出装置、音声検出方法および音声検出プログラムを提供することである。
【００１２】
【課題を解決するための手段】
上述した課題を解決するために、請求項１記載の音声検出装置によれば、音声信号および雑音信号を含む話者信号を入力し、前記音声信号のレベルと予め設定された基準となるレベルとを比較して、その比較結果に基づいて発話状態か無発話状態かを判定する音声検出装置において、前記音声信号が分布する周波数帯域に前記話者信号を帯域制限する音声帯域制限手段と、前記音声帯域制限手段の出力信号のレベルを時間で平均する第１の時間平均手段と、前記雑音信号が分布する周波数帯域に前記話者信号を帯域制限する雑音帯域制限手段と、前記雑音帯域制限手段の出力信号のレベルを前記第１の時間平均手段の平均する時間よりも長い時間で平均する第２の時間平均手段と、前記第１の時間平均手段の出力信号から前記第２の時間平均手段の出力信号を減算する減算手段と、前記減算手段の出力信号のレベルと予め設定された基準となるレベルとを比較する比較器とを備えることを特徴とする。
【００１３】
これにより、第１の時間平均手段に入力される雑音信号を抑圧して、音声信号を効率よく抽出することが可能となるとともに、第２の時間平均手段に入力される音声信号を抑圧して、雑音信号を効率よく抽出することが可能となる。
このため、発話が長時間に及んだ場合においても、雑音レベル検出時の音声信号の影響を抑圧することが可能となり、第２の時間平均手段で算出される雑音レベルの精度を向上させることが可能となることから、発話時間の影響による発話検出の精度の劣化を抑制することが可能となる。
【００１４】
また、請求項２記載の音声検出装置によれば、前記第２の時間平均手段の後段に、前記第２の時間平均手段の出力信号のレベルに対して重み付けを行い周波数特性の補正をする重み補正手段をさらに備えたことを特徴とする。
これにより、第２の時間平均手段の出力信号のレベルが第１の時間平均手段に入力される信号の帯域に合致するように、第２の時間平均手段の出力信号のレベルを補正することが可能となる。
【００１５】
このため、第１の時間平均手段に入力される信号の帯域と、第２の時間平均手段に入力される信号の帯域とが異なる場合においても、第１の時間平均手段の出力信号から第２の時間平均手段の出力信号を減算することで、第１の時間平均手段の出力信号に残存する雑音成分を除去することができ、音声検出の精度を向上させることが可能となる。
【００１６】
また、請求項３記載の音声検出装置によれば、前記音声帯域制限手段および前記雑音帯域制限手段は、前記話者信号を入力しフィルタリングする第１のハイパスフィルタと、前記第１のハイパスフィルタの出力信号をフィルタリングする第２のハイパスフィルタおよびローパスフィルタとを備え、前記話者信号は、前記第１のハイパスフィルタを経て前記第２のハイパスフィルタに入力され前記音声信号が分布する周波数帯域に帯域制限され、前記第２のハイパスフィルタの出力信号は、前記第１の時間平均手段に入力され、前記話者信号は、前記第１のハイパスフィルタを経て前記ローパスフィルタに入力され前記雑音信号が分布する周波数帯域に帯域制限され、前記ローパスフィルタの出力信号は、前記第２の時間平均手段に入力されることを特徴とする。
【００１７】
これにより、自動車の車室内などに分布する低域雑音帯域と、音声のホルマント周波数の中心が集中する音声帯域とを分離することが可能となり、自動車の車室内などの特定環境下における音声検出の精度を容易に向上させることが可能となる。
また、請求項４記載の音声検出方法によれば、話者信号から音声信号を抽出する周波数帯域と、前記話者信号から雑音を抽出する周波数帯域とを異ならせることにより、前記話者信号に含まれる音声を検出することを特徴とする。
【００１８】
これにより、話者信号に含まれる音声信号と雑音信号とを効率よく分離することが可能となり、音声信号に雑音信号が重畳されている場合においても、音声レベル検出時の雑音レベルの影響を軽減することが可能となることから、音声検出の精度を向上させることが可能となる。
また、請求項５記載の音声検出プログラムによれば、音声信号が分布する第１の周波数帯域に話者信号を帯域制限するステップと、雑音信号が分布する第２の周波数帯域に前記話者信号を帯域制限するステップと、前記第１の周波数帯域に帯域制限された話者信号と、前記前記第２の周波数帯域に帯域制限された話者信号との比較結果に基づいて、前記話者信号に含まれる音声を検出するステップとを備えることを特徴とする。
【００１９】
これにより、話者信号に含まれる音声信号と雑音信号とを分離して、音声信号に雑音信号が重畳されている場合においても、音声信号および雑音信号を効率よく抽出することが可能となる。
このため、発話が長時間に及んだ場合においても、雑音レベル検出時の音声信号の影響を抑圧することが可能となり、発話時間の影響による発話検出の精度の劣化を抑制することが可能となる。
【００２０】
【発明の実施の形態】
以下、本発明の実施形態に係る音声検出装置について図面を参照しながら説明する。
図１は、本発明の一実施形態に係る音声検出装置の概略構成を示すブロック図である。
【００２１】
図１において、音声検出装置には、ハイパスフィルタ１、２ａ、ローパスフィルタ２ｂ、ハイパスフィルタ１およびローパスフィルタ２ａで帯域制限された話者信号を絶対値化する絶対値化回路３ａ、ハイパスフィルタ１およびローパスフィルタ２ｂで帯域制限された話者信号を絶対値化する絶対値化回路３ｂ、短時間の平均音圧を算出する短時間音圧平均器４ａ、長時間の平均音圧を算出する長時間音圧平均器４ｂ、長時間音圧平均器４ｂから出力される信号レベルを補正する重み調整アンプ５、短時間音圧平均器４ａの出力から重み調整アンプ５の出力を減算する減算器６および減算器６からの出力をしきい値と比較する比較器７が設けられている。
【００２２】
ここで、話者信号には、話者の発話による音声信号および雑音環境下での雑音信号が含まれている。
また、ハイパスフィルタ１は、電話回線で伝送されない帯域の信号を遮断するためのもので、電話回線の帯域は通常３００Ｈｚ以上になっているため、ハイパスフィルタ１の通過帯域は、この電話回線の帯域に対応して設定することができる。
【００２３】
また、ハイパスフィルタ２ａは、人間のホルマント周波数が主に分布する帯域の信号を抽出するとともに、自動車の車内などで発生する低域雑音を遮断するためのもので、人間のホルマント周波数の中心は２０００Ｈｚ近辺に集中するため、ハイパスフィルタ２ａの遮断周波数を２０００Ｈｚ近辺に設定することにより、低域雑音の影響を低減させつつ、人間の音声を精度よく抽出することができる。
【００２４】
また、ローパスフィルタ２ｂは、ハイパスフィルタ２ａで帯域制限された信号と相関の弱い帯域の信号を抽出するもので、例えば、自動車の車内などで発生する低域雑音を抽出するとともに、人間のホルマント周波数が主に分布する帯域の信号を遮断する。
すなわち、人間のホルマント周波数の中心は２０００Ｈｚ近辺に集中し、自動車の車内などで発生する雑音はそれよりも低域に集中するため、ローパスフィルタ２ｂの遮断周波数を２０００Ｈｚ近辺に設定することにより、人間の音声の影響を低減させつつ、雑音音圧を精度よく抽出することができる。
【００２５】
そして、ハイパスフィルタ１で３００Ｈｚ以下の低域成分がカットされた話者信号は、ハイパスフィルタ２ａおよびローパスフィルタ２ｂにそれぞれ入力される。
そして、ハイパスフィルタ２ａでは、人間のホルマント周波数が主に分布する２０００Ｈｚ近辺の帯域の信号が抽出されるとともに、自動車の車内などで発生する低域雑音がカットされ、絶対値化回路３ａに入力される。
【００２６】
そして、ハイパスフィルタ２ａで抽出された信号が絶対値化回路３ａに入力されると、その信号が絶対値化された後、短時間音圧平均器４ａに出力される。
そして、絶対値化回路３ａで絶対値化された信号の短時間の平均音圧が算出され、減算器６に出力される。
一方、ローパスフィルタ２ｂでは、自動車の車内などで発生する低域雑音が抽出されるとともに、人間のホルマント周波数が主に分布する２０００Ｈｚ近辺の帯域の信号がカットされ、絶対値化回路３ｂに入力される。
【００２７】
そして、ローパスフィルタ２ｂで抽出された信号が絶対値化回路３ｂに入力されると、その信号が絶対値化された後、長時間音圧平均器４ｂに出力される。
そして、絶対値化回路３ｂで絶対値化された信号の長時間の平均音圧が算出され、重み調整アンプ５に出力される。
ここで、長時間音圧平均器４ｂが重み調整アンプ５に入力されると、重み調整アンプ５は、長時間音圧平均器４ｂから出力される帯域の平均音圧が短時間音圧平均器４ａから出力される信号の帯域に合致するように、長時間音圧平均器４ｂから出力される平均音圧を補正する。
【００２８】
そして、短時間音圧平均器４ａで算出された平均音圧が減算器６に入力されるとともに、長時間音圧平均器４ｂで算出された平均音圧が重み調整アンプ５を介して減算器６に入力されると、これらの値が減算されて、比較器７に入力される。
ここで、重み調整アンプ５にて長時間音圧平均器４ｂから出力される平均音圧の周波数特性を補正することにより、異なる帯域に帯域制限された信号同士を減算することで、ハイパスフィルタ２ａで除去しきれずに残存する雑音成分を短時間音圧平均器４ａの出力信号から除去することができ、音声検出の精度を向上させることが可能となる。
【００２９】
そして、減算器６からの出力が比較器７に入力されると、減算器６からの出力がしきい値と比較される。
そして、減算器６からの出力がしきい値より大きい時は発話状態と判定することが可能となるとともに、減算器６からの出力がしきい値より小さい時は無発話状態と判定することが可能となる。
【００３０】
図２は、本発明の一実施形態に係る自動車内での音声および雑音の周波数分布を示す図である。
図２において、人間の音声のスペクトル分布Ａ１は通常３００〜３５００Ｈｚに分布し、人間の音声Ａ３のホルマント周波数の中心は２０００Ｈｚ近辺に存在する。
【００３１】
一方、車内雑音のスペクトル分布Ｂ１のピークは、人間の音声のスペクトル分布Ａ１のピークよりも低域に存在し、音声Ａ３のホルマント周波数の中心が集中する２０００Ｈｚ近辺では、車内雑音のスペクトル分布Ｂ１の裾に対応する。
このため、人間の音声のスペクトル分布Ｂ１が２０００Ｈｚ近辺以上に帯域制限されるように、フィルタ特性Ｃ１を設定するとともに、車内雑音のスペクトル分布Ｂ１が２０００Ｈｚ近辺以下に帯域制限されるように、フィルタ特性Ｃ２を設定する。
【００３２】
そして、ハイパスフィルタ１、２ａを組み合わせてフィルタ特性Ｃ１を実現するとともに、ハイパスフィルタ１とローパスフィルタ２ｂを組み合わせてフィルタ特性Ｃ２を実現することにより、帯域制限された人間の音声スペクトル分布Ａ２の平均音圧を短時間音圧平均器４ａで算出することが可能となるとともに、帯域制限された車内雑音のスペクトル分布Ｂ２の平均音圧を長時間音圧平均器４ｂで算出することが可能となる。
【００３３】
このため、話者信号に含まれる音声信号と雑音信号とを分離して、音声信号に雑音信号が重畳されている場合においても、音声信号および雑音信号を効率よく抽出することが可能となる。
このため、発話が長時間に及んだ場合においても、雑音レベル検出時の音声信号の影響を抑圧することが可能となり、発話時間の影響による発話検出の精度の劣化を抑制することが可能となる。
【００３４】
ここで、短時間音圧平均器４ａに入力される信号と長時間音圧平均器４ｂに入力される信号とを異なる帯域に帯域制限し、短時間音圧平均器４ａから出力される信号に残存する雑音を除去するために、短時間音圧平均器４ａで算出される平均音圧から長時間音圧平均器４ｂで算出される平均音圧を差し引くと、差し引かれる雑音の帯域が異なるようになる。
【００３５】
一方、自動車内の定常雑音は、雑音のレベルの大小にかかわらず、スペクトルパターンはほぼ一定の形状を維持する。
例えば、低速走行の場合、自動車のエンジン音は小さくなり、そのエンジン音に比例してタイヤと路面間で生じるロードノイズや風切り音なども小さくなる。逆に、高速走行の場合、自動車のエンジン音は大きくなり、そのエンジン音に比例してタイヤと路面間で生じるロードノイズや風切り音なども大きくなる。
【００３６】
この結果、長時間音圧平均器４ｂで算出された平均音圧に重み補正値を乗じることにより、長時間音圧平均器４ｂで捉えられた帯域の雑音を短時間音圧平均器４ａで捉えられる帯域の雑音に変換することができる。
このため、短時間音圧平均器４ａの平均音圧から、重み補正値が乗じられた長時間音圧平均器４ｂの平均音圧を差し引くことにより、短時間音圧平均器４ａに入力される信号と長時間音圧平均器４ｂに入力される信号とが異なる帯域に帯域制限された場合においても、短時間音圧平均器４ａの平均音圧に残存する雑音を除去することが可能となる。
【００３７】
すなわち、雑音のレベルの大小にかかわらず、スペクトルパターンはほぼ一定の形状を維持する場合、図２（ｂ）に示すように、長時間音圧平均器４ｂで捉えることができる音圧の積分量Ｄと音声検出を正確に行うために得なければならない音圧量Ｅとの比はほぼ一定になる。
ここで、話者信号に含まれる音声信号から雑音を正確に除去した値を比較器７に入力するためには、以下の（１）式を減算器６で算出する必要がある。
【００３８】
（減算器出力）＝（短時間音圧平均器４ａの平均音圧）−Ｅ　　・・・（１）
ただし、Ｅは、図２（ｂ）の音声検出を正確に行うために得なければならない音圧量である。
一方、雑音のレベルの大小にかかわらず、スペクトルパターンはほぼ一定の形状を維持する場合、音声検出を正確に行うために得なければならない音圧量Ｅは、長時間音圧平均器４ｂで捉えることができる音圧の積分量Ｄを用いて以下の（２）式で算出することができる。
【００３９】
Ｅ＝（重み補正値）×Ｄ　　　　　　　　　　　　　　　　　・・・（２）
従って、（２）式を（１）式に代入することにより、減算器出力は、以下の（３）式を用いて算出することができる。
（減算器出力）＝（短時間音圧平均器４ａの平均音圧）−（重み補正値）×（長時間音圧平均器４ｂの平均音圧）・・・（３）
よって、短時間音圧平均器４ａの平均音圧を減算器６に入力するとともに、重み調整アンプ５を介し長時間音圧平均器４ｂの平均音圧を減算器６に入力することにより、話者信号に含まれる音声信号から雑音を正確に除去した値を算出することができ、話者信号に含まれる音声信号と雑音とが異なる帯域に帯域制限された場合においても、話者信号に含まれる音声信号から雑音を正確に除去した値を比較器７に入力することが可能となる。
【００４０】
なお、重み補正値は適当な値に設定することができ、例えば、無音時または雑音だけが入力された時は、減算器６からの出力がしきい値を越えないような値に設定してもよいし、音声＋雑音が入力された時に音声がある場合は、減算器６からの出力がしきい値を越えるような値に設定してもよい。
また、エンジン音やロードノイズなどは車種によってそれぞれ固有のものであるため、スペクトルパターンは車種によってそれぞれ固有のもとなるが、車種が変わった場合には、重み調整アンプ５で設定される重み補正を変更することができるので、どんな形状のスペクトルパターンであっても対応することができる。
【００４１】
【発明の効果】
以上説明したように、本発明によれば、話者信号の帯域制限を行うことにより、話者信号に含まれる音声信号と雑音信号とを分離することができ、雑音レベル検出時の音声信号の影響を抑圧することが可能となるため、発話時間の影響による発話検出の精度の劣化を抑制することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る音声検出装置の概略構成を示すブロック図である。
【図２】本発明の一実施形態に係る自動車内での音声および雑音の周波数分布を示す図である。
【図３】従来の音声検出装置の概略構成を示すブロック図である。
【符号の説明】
１、２ａ　ハイパスフィルタ
２ｂ　ローパスフィルタ
３ａ、３ｂ　絶対値化回路
４ａ　短時間音圧平均器
４ｂ　長時間音圧平均器
５　重み調整アンプ
６　減算器
７　比較器
Ａ１　人間の音声のスペクトル分布
Ａ２　帯域制限後の人間の音声スペクトル分布
Ａ３　人間の音声
Ｂ１　車内雑音のスペクトル分布
Ｂ２　帯域制限後の車内雑音のスペクトル分布
Ｃ１、Ｃ２　フィルタ特性
Ｄ　長時間音圧平均器で捉えることができる音圧の積分量
Ｅ　音声検出を正確に行うために得なければならない音圧量[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice detection device, a voice detection method, and a voice detection program, and is particularly suitable when applied to a case where detection of a speaker's utterance is accurately performed.
[0002]
[Prior art]
In a conventional voice detection device, there is a method of measuring the sound pressure of a speaker signal sent from the speaker side and comparing the measured sound pressure with a preset threshold value in order to determine whether the speaker is speaking or not.
Further, in order to improve the determination accuracy of a speaker's utterance and no utterance in a noisy environment, for example, as disclosed in Patent Document 1, time averagers having different time constants are provided, and a short-time average sound is provided. There is also a method of subtracting the average sound pressure for a long time from the pressure and comparing the result with a preset threshold value.
[0003]
FIG. 3 is a block diagram showing a schematic configuration of a conventional voice detection device.
In FIG. 3, a conventional voice detection device includes an absolute value conversion circuit 11 for converting a speaker signal into an absolute value, a short-time sound pressure averager 12a for calculating a short-time average sound pressure, and a long-time average sound pressure. A subtractor 13 for subtracting the output of the long-time sound pressure averager 12b from the output of the long-time sound pressure averager 12b and the short-time sound pressure averager 12a to be calculated, and a comparison of comparing the output from the subtractor 13 with a threshold value A vessel 14 is provided.
[0004]
Here, the speaker signal includes a voice signal generated by the speaker's speech and a noise signal in a noise environment.
When the speaker signal is input to the absolute value conversion circuit 11, the speaker signal is converted to an absolute value by the absolute value conversion circuit 11 and output to the short-time sound pressure averager 12a and the long-time sound pressure averager 12b. Is done.
[0005]
The short-term average sound pressure of the absolute-valued speaker signal is calculated by the short-time sound pressure averager 12a, and the long-time average sound pressure of the absolute-valued speaker signal is calculated for a long time. The sound pressure averaging device 12b calculates the sound pressure, and the average sound pressure is subtracted from each other by the subtractor 13.
Here, the noise occurs almost constantly, while the speaker speaks intermittently.
[0006]
For this reason, the average sound pressure of the noise maintains a substantially constant value without depending on the averaging time, and the average sound pressure of the noise captured by the short-time sound pressure averaging device 12a and the long-time sound pressure averaging device 12b substantially coincides with each other. Therefore, the influence of noise can be canceled by subtracting the output value of the long-time sound pressure averager 12b from the output value of the short-time sound pressure averager 12a. On the other hand, when the average time is long, the average sound pressure of the audio signal averages the level during utterance and the level during no utterance, but when the average time is short, the average sound pressure fluctuates according to the speaker's utterance or no utterance. It is possible to extract the level when the speaker speaks or when the speaker does not speak.
[0007]
As a result, in the short-time sound pressure averaging device 12a, it is possible to make the average sound pressure greatly different between the time of utterance and the time of no utterance, and it is possible to extract the level when the speaker utters or when there is no utterance. In addition, the long-time sound pressure averaging device 12b can maintain the average sound pressure substantially constant regardless of utterance or no utterance, and can extract the average sound pressure of noise.
[0008]
Then, the output from the subtractor 13 is input to the comparator 14, and the output from the subtractor 13 is compared with a threshold.
When the output from the subtractor 13 is larger than the threshold value, it is possible to determine that the speech state is present, and when the output from the subtractor 13 is smaller than the threshold value, it is possible to determine that the speech state is not present. It becomes possible.
[0009]
[Patent Document 1]
JP-A-1-12728
[Problems to be solved by the invention]
However, in the short-time sound pressure averager 12a and the long-time sound pressure averager 12b in FIG. 3, the average time for calculating the average sound pressure of the speaker signal is different, but the average sound pressure of the speaker signal is calculated. Frequency bands are the same.
For this reason, when the utterance lasts for a long time, the steady sound pressure calculated by the long-time sound pressure averaging unit 12b is more strongly affected by the voice, so that the correct value cannot be maintained, and the accuracy of the utterance detection deteriorates. There was a problem.
[0011]
Therefore, an object of the present invention is to provide a voice detection device, a voice detection method, and a voice detection program capable of suppressing deterioration in accuracy of utterance detection due to the influence of utterance time.
[0012]
[Means for Solving the Problems]
In order to solve the above-described problem, according to the voice detection device according to claim 1, a speaker signal including a voice signal and a noise signal is input, and a level of the voice signal and a predetermined reference level are set. In the voice detection device that determines whether the speech state or the non-speech state based on the comparison result, a voice band limiting unit that band-limits the speaker signal to a frequency band in which the voice signal is distributed, First time averaging means for averaging the level of the output signal of the voice band limiting means with time, noise band limiting means for band limiting the speaker signal to a frequency band in which the noise signal is distributed, and the noise band limiting means Second time averaging means for averaging the level of the output signal of the first time averaging means for a time longer than the averaging time of the first time averaging means, and the second time averaging means based on the output signal of the first time averaging means. Subtracting means for subtracting an output signal of the stage, characterized in that it comprises a comparator for comparing the level of the level and the preset reference output signal of said subtraction means.
[0013]
This makes it possible to suppress the noise signal input to the first time averaging means and efficiently extract the audio signal, and suppress the audio signal input to the second time averaging means. , A noise signal can be efficiently extracted.
For this reason, even if the utterance lasts for a long time, it is possible to suppress the influence of the voice signal when detecting the noise level, and to improve the accuracy of the noise level calculated by the second time averaging means. Is possible, it is possible to suppress deterioration in accuracy of utterance detection due to the effect of the utterance time.
[0014]
According to the second aspect of the present invention, the level of the output signal of the second time averaging means is weighted at the subsequent stage of the second time averaging means to correct the frequency characteristic. It is characterized by further comprising a correcting means.
This makes it possible to correct the level of the output signal of the second time averaging means so that the level of the output signal of the second time averaging means matches the band of the signal input to the first time averaging means. It becomes possible.
[0015]
For this reason, even when the band of the signal input to the first time averaging means is different from the band of the signal input to the second time averaging means, the second signal is output from the first time averaging means. By subtracting the output signal of the time averaging means, noise components remaining in the output signal of the first time averaging means can be removed, and the accuracy of voice detection can be improved.
[0016]
Further, according to the voice detecting device of the third aspect, the voice band limiting unit and the noise band limiting unit may include a first high-pass filter for inputting and filtering the speaker signal, and a first high-pass filter. A second high-pass filter and a low-pass filter for filtering an output signal, wherein the speaker signal is input to the second high-pass filter via the first high-pass filter and is applied to a frequency band in which the audio signal is distributed. The limited output signal of the second high-pass filter is input to the first time averaging means, and the speaker signal is input to the low-pass filter via the first high-pass filter and the noise signal is distributed. And the output signal of the low-pass filter is input to the second time averaging means. And wherein the door.
[0017]
This makes it possible to separate the low-frequency noise band distributed in the interior of the vehicle from the audio band in which the center of the formant frequency of the audio is concentrated, and to detect the audio in a specific environment such as the interior of the automobile. Accuracy can be easily improved.
According to the voice detection method of the fourth aspect, the frequency band for extracting a voice signal from a speaker signal and the frequency band for extracting noise from the speaker signal are different from each other, so that It is characterized in that included voices are detected.
[0018]
This makes it possible to efficiently separate the voice signal and the noise signal included in the speaker signal, and reduces the influence of the noise level when detecting the voice level even when the noise signal is superimposed on the voice signal. Therefore, the accuracy of voice detection can be improved.
According to the speech detection program of the fifth aspect, the step of band-limiting the speaker signal to a first frequency band in which a speech signal is distributed, and the step of band-limiting the speaker signal in a second frequency band in which a noise signal is distributed. Band limiting the speaker signal based on a comparison result of the speaker signal band-limited to the first frequency band and the speaker signal band-limited to the second frequency band. And detecting a voice included in.
[0019]
This makes it possible to separate the speech signal and the noise signal included in the speaker signal, and to efficiently extract the speech signal and the noise signal even when the noise signal is superimposed on the speech signal.
For this reason, even when the utterance lasts for a long time, it is possible to suppress the influence of the voice signal when detecting the noise level, and it is possible to suppress the deterioration of the accuracy of the utterance detection due to the influence of the utterance time. Become.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a speech detection device according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram illustrating a schematic configuration of a voice detection device according to an embodiment of the present invention.
[0021]
In FIG. 1, a voice detection device includes an absolute value conversion circuit 3a for converting a speaker signal band-limited by the high-pass filters 1 and 2a, a low-pass filter 2b, a high-pass filter 1 and a low-pass filter 2a into an absolute value, a high-pass filter 1, An absolute value conversion circuit 3b for converting the speaker signal band-limited by the low-pass filter 2b into an absolute value, a short-time sound pressure averager 4a for calculating a short-time average sound pressure, and a long time for calculating a long-time average sound pressure A sound pressure averager 4b, a weight adjustment amplifier 5 for correcting the signal level output from the long-time sound pressure averager 4b, a subtractor 6 for subtracting the output of the weight adjustment amplifier 5 from the output of the short-time sound pressure averager 4a, and A comparator 7 for comparing the output from the subtractor 6 with a threshold value is provided.
[0022]
Here, the speaker signal includes a voice signal generated by the speaker's speech and a noise signal in a noise environment.
The high-pass filter 1 is for blocking signals in a band not transmitted by the telephone line. Since the band of the telephone line is usually 300 Hz or more, the pass band of the high-pass filter 1 is set to the band of this telephone line. Can be set corresponding to.
[0023]
The high-pass filter 2a is for extracting a signal in a band in which human formant frequencies are mainly distributed, and for cutting off low-frequency noise generated in an automobile or the like. The center of the human formant frequency is 2000 Hz. By setting the cutoff frequency of the high-pass filter 2a to around 2000 Hz to concentrate on the vicinity, it is possible to accurately extract human voice while reducing the influence of low-frequency noise.
[0024]
The low-pass filter 2b extracts a signal in a band having a weak correlation with the signal band-limited by the high-pass filter 2a. For example, the low-pass filter 2b extracts low-frequency noise generated in an automobile or the like, and also extracts a human formant frequency. Cuts off signals in a band mainly distributed.
That is, since the center of the human formant frequency is concentrated around 2000 Hz, and the noise generated in the interior of the vehicle is concentrated in a lower frequency than that, the cutoff frequency of the low-pass filter 2b is set at around 2000 Hz. Noise sound pressure can be accurately extracted while reducing the influence of the sound.
[0025]
The speaker signal from which the low-frequency component of 300 Hz or less has been cut by the high-pass filter 1 is input to the high-pass filter 2a and the low-pass filter 2b, respectively.
In the high-pass filter 2a, a signal in a band around 2000 Hz where human formant frequencies are mainly distributed is extracted, and low-frequency noise generated in the interior of an automobile or the like is cut and input to the absolute value conversion circuit 3a. You.
[0026]
When the signal extracted by the high-pass filter 2a is input to the absolute value conversion circuit 3a, the signal is converted to an absolute value and then output to the short-time sound pressure averager 4a.
Then, the short-time average sound pressure of the signal converted into the absolute value by the absolute value conversion circuit 3 a is calculated and output to the subtractor 6.
On the other hand, the low-pass filter 2b extracts low-frequency noise generated in the interior of an automobile, cuts a signal in a band around 2000 Hz where human formant frequencies are mainly distributed, and inputs the signal to the absolute value conversion circuit 3b. You.
[0027]
When the signal extracted by the low-pass filter 2b is input to the absolute value conversion circuit 3b, the signal is converted to an absolute value and then output to the long-time sound pressure averager 4b.
Then, a long-term average sound pressure of the signal converted into the absolute value by the absolute value conversion circuit 3 b is calculated and output to the weight adjustment amplifier 5.
Here, when the long-time sound pressure averager 4b is input to the weight adjustment amplifier 5, the weight adjustment amplifier 5 outputs the average sound pressure of the band output from the long-time sound pressure averager 4b to the short-time sound pressure averager. The average sound pressure output from the long-time sound pressure averager 4b is corrected so as to match the band of the signal output from the signal output unit 4a.
[0028]
The average sound pressure calculated by the short-time sound pressure averaging device 4a is input to the subtractor 6, and the average sound pressure calculated by the long-time sound pressure averaging device 4b is subtracted through the weight adjustment amplifier 5 by the subtractor 6. 6, these values are subtracted and input to the comparator 7.
Here, the weight adjustment amplifier 5 corrects the frequency characteristics of the average sound pressure output from the long-time sound pressure averaging device 4b, thereby subtracting signals band-limited to different bands, thereby obtaining a high-pass filter 2a. The remaining noise components that cannot be completely removed by the above-described method can be removed from the output signal of the sound pressure averaging device 4a in a short time, and the accuracy of voice detection can be improved.
[0029]
When the output from the subtractor 6 is input to the comparator 7, the output from the subtractor 6 is compared with a threshold.
When the output from the subtracter 6 is larger than the threshold value, it is possible to determine the utterance state, and when the output from the subtracter 6 is smaller than the threshold value, it is possible to determine the non-speech state. It becomes possible.
[0030]
FIG. 2 is a diagram illustrating a frequency distribution of voice and noise in a vehicle according to an embodiment of the present invention.
In FIG. 2, the spectral distribution A1 of the human voice is normally distributed in the range of 300 to 3500 Hz, and the center of the formant frequency of the human voice A3 exists near 2000 Hz.
[0031]
On the other hand, the peak of the in-vehicle noise spectrum distribution B1 is lower than the peak of the human voice spectrum distribution A1, and around 2000 Hz where the center of the formant frequency of the voice A3 is concentrated, the peak of the in-vehicle noise spectrum distribution B1 is lower. Corresponds to the hem.
Therefore, the filter characteristic C1 is set so that the spectral distribution B1 of the human voice is band-limited to around 2000 Hz or higher, and the filter characteristic C1 is set so that the spectral distribution B1 of the vehicle interior noise is band-limited to around 2000 Hz or lower. Set C2.
[0032]
The filter characteristic C1 is realized by combining the high-pass filters 1 and 2a, and the filter characteristic C2 is realized by combining the high-pass filter 1 and the low-pass filter 2b, whereby the average sound of the band-limited human voice spectrum distribution A2 is obtained. The pressure can be calculated by the short-time sound pressure averager 4a, and the average sound pressure of the band-limited vehicle interior noise spectrum distribution B2 can be calculated by the long-time sound pressure averager 4b.
[0033]
Therefore, it is possible to separate the voice signal and the noise signal included in the speaker signal, and to efficiently extract the voice signal and the noise signal even when the noise signal is superimposed on the voice signal.
For this reason, even when the utterance lasts for a long time, it is possible to suppress the influence of the voice signal when detecting the noise level, and it is possible to suppress the deterioration of the accuracy of the utterance detection due to the influence of the utterance time. Become.
[0034]
Here, the signal input to the short-time sound pressure averager 4a and the signal input to the long-time sound pressure averager 4b are band-limited to different bands, and the signals output from the short-time sound pressure averager 4a are When the average sound pressure calculated by the long-time sound pressure averager 4b is subtracted from the average sound pressure calculated by the short-time sound pressure averager 4a in order to remove the remaining noise, the noise band to be subtracted is different. become.
[0035]
On the other hand, as for the stationary noise in the automobile, the spectral pattern maintains a substantially constant shape regardless of the level of the noise.
For example, when the vehicle is running at a low speed, the engine sound of the vehicle is reduced, and road noise and wind noise generated between the tire and the road surface are reduced in proportion to the engine sound. Conversely, in the case of high-speed running, the engine noise of the automobile increases, and road noise and wind noise generated between the tire and the road surface increase in proportion to the engine noise.
[0036]
As a result, by multiplying the average sound pressure calculated by the long-time sound pressure averaging device 4b by the weight correction value, the noise in the band captured by the long-time sound pressure averaging device 4b is captured by the short-time sound pressure averaging device 4a. Can be converted into noise in a given band.
Therefore, the average sound pressure of the long-time sound pressure averager 4b multiplied by the weight correction value is subtracted from the average sound pressure of the short-time sound pressure averager 4a, so that the average sound pressure is input to the short-time sound pressure averager 4a. Even when the signal and the signal input to the long-time sound pressure averager 4b are band-limited to different bands, it is possible to remove noise remaining in the average sound pressure of the short-time sound pressure averager 4a. .
[0037]
That is, when the spectrum pattern maintains a substantially constant shape regardless of the level of the noise level, as shown in FIG. 2B, the integration amount of sound pressure that can be captured by the sound pressure averager 4b for a long time. The ratio between D and the sound pressure amount E that must be obtained for accurate voice detection is substantially constant.
Here, in order to input a value obtained by accurately removing noise from the voice signal included in the speaker signal to the comparator 7, the following equation (1) needs to be calculated by the subtractor 6.
[0038]
(Subtractor output) = (average sound pressure of short-time sound pressure averaging device 4a) −E (1)
Here, E is a sound pressure amount that must be obtained in order to accurately perform the sound detection in FIG.
On the other hand, regardless of the level of the noise level, when the spectrum pattern keeps a substantially constant shape, the sound pressure amount E that must be obtained in order to accurately detect the voice is captured by the long-time sound pressure averager 4b. It can be calculated by the following equation (2) using the integrated amount D of the sound pressure that can be obtained.
[0039]
E = (weight correction value) × D (2)
Therefore, by substituting equation (2) into equation (1), the subtractor output can be calculated using the following equation (3).
(Subtractor output) = (average sound pressure of short-time sound pressure averager 4a) − (weight correction value) × (average sound pressure of long-time sound pressure averager 4b) (3)
Accordingly, by inputting the average sound pressure of the short-time sound pressure averager 4a to the subtractor 6, and inputting the average sound pressure of the long-time sound pressure averager 4b to the subtractor 6 via the weight adjustment amplifier 5, It is possible to calculate a value obtained by accurately removing noise from the voice signal included in the speaker signal, and to include the noise signal in the speaker signal even when the voice signal and noise included in the speaker signal are band-limited to different bands. It is possible to input to the comparator 7 a value obtained by accurately removing noise from the audio signal to be received.
[0040]
The weight correction value can be set to an appropriate value. For example, when there is no sound or when only noise is input, the weight correction value is set so that the output from the subtractor 6 does not exceed the threshold value. Alternatively, if there is a voice when voice + noise is input, a value may be set so that the output from the subtracter 6 exceeds a threshold value.
Further, since the engine sound, road noise, and the like are unique to each vehicle type, the spectral pattern is unique to each vehicle type, but when the vehicle type changes, the weight correction set by the weight adjustment amplifier 5 is set. Can be changed, so that a spectrum pattern of any shape can be handled.
[0041]
【The invention's effect】
As described above, according to the present invention, it is possible to separate the speech signal and the noise signal included in the speaker signal by performing the band limitation of the speaker signal, and to separate the speech signal when detecting the noise level. Since the influence can be suppressed, it is possible to suppress the deterioration of the accuracy of the utterance detection due to the influence of the utterance time.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of a voice detection device according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a frequency distribution of voice and noise in a vehicle according to an embodiment of the present invention.
FIG. 3 is a block diagram illustrating a schematic configuration of a conventional voice detection device.
[Explanation of symbols]
1, 2a High-pass filter 2b Low-pass filter 3a, 3b Absolute value conversion circuit 4a Short-time sound pressure averager 4b Long-time sound pressure averager 5 Weight adjustment amplifier 6 Subtractor 7 Comparator A1 Human voice spectrum distribution A2 After band limitation Human voice spectrum distribution A3 Human voice B1 In-vehicle noise spectrum distribution B2 In-vehicle noise spectrum distribution C1 and C2 after band limitation Filter characteristics D Sound pressure integral amount E that can be captured by long-time sound pressure averaging E Voice The amount of sound pressure that must be obtained for accurate detection

Claims

A speaker signal including a voice signal and a noise signal is input, and the level of the voice signal is compared with a predetermined reference level, and it is determined based on the comparison result whether the voice state is a voice state or a non-voice state. In the voice detection device,
Voice band limiting means for band limiting the speaker signal to a frequency band in which the voice signal is distributed,
First time averaging means for averaging the level of the output signal of the audio band limiting means with time;
Noise band limiting means for band limiting the speaker signal to a frequency band in which the noise signal is distributed,
Second time averaging means for averaging the level of the output signal of the noise band limiting means for a time longer than the averaging time of the first time averaging means;
Subtraction means for subtracting the output signal of the second time averaging means from the output signal of the first time averaging means;
A sound detector comprising: a comparator for comparing a level of an output signal of the subtraction means with a preset reference level.

2. The apparatus according to claim 1, further comprising a weight correcting means for weighting an output signal level of said second time averaging means and correcting a frequency characteristic at a stage subsequent to said second time averaging means. The voice detection device according to the above.

The voice band limiting means and the noise band limiting means,
A first high-pass filter for inputting and filtering the speaker signal;
A second high-pass filter and a low-pass filter for filtering an output signal of the first high-pass filter;
The speaker signal is input to the second high-pass filter via the first high-pass filter and band-limited to a frequency band in which the audio signal is distributed,
An output signal of the second high-pass filter is input to the first time averaging means,
The speaker signal is input to the low-pass filter via the first high-pass filter and band-limited to a frequency band in which the noise signal is distributed,
3. The voice detection device according to claim 1, wherein an output signal of the low-pass filter is input to the second time averaging unit.

Detecting a voice included in the speaker signal by making a frequency band for extracting a voice signal from a speaker signal different from a frequency band for extracting noise from the speaker signal; .

Band limiting the speaker signal to a first frequency band in which the audio signal is distributed;
Band limiting the speaker signal to a second frequency band in which a noise signal is distributed;
Detecting a voice included in the speaker signal based on a comparison result between the speaker signal band-limited to the first frequency band and the speaker signal band-limited to the second frequency band; And a step of causing a computer to execute the steps.