JP2004309536A

JP2004309536A - Speech processing unit

Info

Publication number: JP2004309536A
Application number: JP2003098870A
Authority: JP
Inventors: Kenji Narumi; 健司鳴海; Katsuhide Kumagai; 勝秀熊谷; Koji Okuda; 幸治奥田
Original assignee: Tokai Rika Co Ltd
Current assignee: Tokai Rika Co Ltd
Priority date: 2003-04-02
Filing date: 2003-04-02
Publication date: 2004-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a speech processing unit reliably estimating whether a speaker speaks according to a speech signal inputted from a microphone. <P>SOLUTION: The speech processing unit 10 when inputting a 1st speech signal and a 2nd speech signal from a 1st microphone 12A and a 2nd microphone 12B arranged nearly symmetrically to a specified speaker calculates voice power as the power of the sum signal of the 1st and 2nd speech signals and noise power as the power of a difference signal generated by subtracting the 2nd speech signal from the 1st speech signal, and a speaking/non-speaking state estimation part 32 estimates whether the speaker speaks according to whether the voice power is larger or smaller than the noise power which is not influenced by speaker's speaking. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、音声認識装置や通話装置に適用される音声処理装置に関する。
【０００２】
【従来の技術】
例えば、自動車等の車両に搭載されるナビゲーション装置やオーディオ装置には、マイクロホンから乗員の音声を入力して認識し、該認識内容に基づいて各種処理を行なわせる音声認識装置を備えたものが知られている。また、車両には、運転者が運転中にマイクロホン等（例えば、携帯電話の本体）を持つことなく通話するためのハンズフリー通話装置を備えたものがある。
【０００３】
また、このような音声認識装置や通話装置には、２つのマイクロホンを備えて乗員（発話者）が発した声以外の音声を除去するノイズキャンセル機能を実現したものがある（例えば、特許文献１参照）。具体的には、一方のマイクロホンより入力した音声信号から、他方のマイクロホンより入力しノイズキャンセルフィルタによって処理されたノイズ信号を差し引くことで、発話者の声に対応する音声信号（出力信号）を抽出して出力するようになっている。ノイズキャンセルフィルタとしては、適応フィルタが用いられ、上記差分信号である出力信号のパワーが最小となるようにフィルタ係数が更新される構成とされている。
【０００４】
しかしながら、ノイズキャンセルフィルタがフィルタ係数を常時更新する構成では、発話者の声がノイズに対し大きい場合に、該発話者の声に対応する信号の一部も除去（キャンセル）され、出力信号に大きな歪が生じてしまう。このため、上記音声認識装置や通話装置では、発話者が発話する際には、該発話者が発話スイッチを押圧操作することで、ノイズキャンセルフィルタのフィルタ係数を固定して発話者の声に対応する信号の一部が除去されることを防止し、このようにして得られた出力信号を受け付けるようになっている。これにより、出力信号の歪みが防止される。
【０００５】
さらに、上記のような音声認識装置や通話装置には、上記構成とは別の構成によってノイズキャンセル機能を実現したものが知られている（例えば、特許文献２参照）。この特許文献２記載の車両用音声入力装置は、車両の各座席にそれぞれ対応して複数のマイクロホンを備えており、各マイクロホンから入力した音声信号レベルの過去所定時間における移動平均値に基づいて各閾値（ノイズに対応するパワー）を設定し、何れかのマイクロホンにおける入力信号レベルが対応する閾値を超えると何れかの乗員の発話による音声が入力されたと判断し、かつ、特定の乗員（座席）に対応するマイクロホンからの入力信号レベルが他の乗員に対応するマイクロホンからの入力信号レベルよりも大きい場合に、該特定の乗員が声を発したと判断して出力信号を出力するようになっている。
【０００６】
そして、この出力信号は、特定の乗員（座席）に対応するマイクロホンからの入力信号から、他の乗員に対応するマイクロホンからの入力信号を差し引いてノイズ除去された信号とされている。すなわち、この構成では、適応フィルタであるノイズキャンセルフィルタを備えず、基本的に、他の乗員に対応したマイクロホン（特定の乗員に対応するマイクロホンから十分に離間して配置されたもの）からの入力をノイズ信号としている。これにより、この構成では、発話スイッチを操作しなくても、特定の乗員による発話の有無を判断し、該特定の乗員が発話したと判断した場合には、該特定の乗員に対応したマイクロホンからの入力信号からノイズを除去して得た出力信号を音声認識装置に出力することができる。
【０００７】
【特許文献１】
特開２０００−１４８２００号公報
【特許文献２】
特開平１１−６５５８６号公報
【０００８】
【発明が解決しようとする課題】
しかしながら、上記前者の構成では、発話者は発話の度に発話スイッチを操作しなければならず、操作が煩雑であるという問題があった。
【０００９】
一方、後者の構成では、複数のマイクロホンが異なる乗員に対応して配置されるため、換言すれば、特定の乗員に対応するマイクロホンが１つでありノイズを分離できないため、発話の有無の判断基準である閾値が音声の大小やノイズの大小によって変動しやすく、発話有無の判断精度が悪いという問題があった。したがって、例えば、特定の乗員に対応するマイクロホンの近傍でノイズが発生する（該マイクロホンに大きな入力を与えるノイズが生じる）と、該ノイズは他のマイクロホンへの小さな入力信号によっては除去できず、発話の有無の判断を誤る可能性が高かった。特に、発話有無の判断の際に上記閾値（移動平均値に基づくパワー）と比較される入力信号レベルが瞬時値であるため、この問題が顕著となる。
【００１０】
本発明は、上記事実を考慮して、発話者による発話の有無を、マイクロホンから入力される音声信号に基づいて確実に推定することができる音声処理装置を得ることが目的である。
【００１１】
【課題を解決するための手段】
上記目的を達成するために請求項１記載の発明に係る音声処理装置は、特定の発話者に対し略対称に配置された第１のマイクロホン及び第２のマイクロホンからそれぞれ入力される第１音声信号及び第２音声信号を処理する音声処理装置であって、前記第１音声信号と第２音声信号とを加算した加算信号のパワーと、前記第１音声信号から第２音声信号を差し引いた差分信号のパワーとを比較して前記発話者による発話の有無を推定する発話状態推定手段を備えた、ことを特徴としている。
【００１２】
請求項１記載の音声処理装置では、第１のマイクロホンから第１音声信号が入力されると共に、第２のマイクロホンから第２音声信号が入力される。これらの第１及び第２音声信号は、第１及び第２のマイクロホンが発話者に対し略対称に配置されていることにより、発話者が発した声については位相、大きさとも略同等の信号であり、発話者が発した声以外の音声すなわちノイズについては発生源との相対位置に応じて異なるものとなる。
【００１３】
第１及び第２音声信号が入力されると、発話状態推定手段は、第１音声信号と第２音声信号とを加算した加算信号のパワー（以下、音声パワーという）と、第１音声信号から第２音声信号を差し引いた差分信号のパワー（以下、ノイズパワーという）とを比較して、発話者による発話の有無を推定（判定、判断）する。なお、音声パワーとノイズパワーとを直接的に比較せず、これらの一方または双方を適宜処理したパワー等を比較対象としても良いことは言うまでもない。
【００１４】
具体的には、音声パワーは、発話者が声を発していないときには、共にノイズのみである第１及び第２音声信号の加算信号のパワーであるから、共にノイズのみである第１及び第２音声信号の差分信号のパワーであるノイズパワーに対する比が小さい。一方、発話者が声を発したときには、上記加算信号のパワーである音声パワーは、該発話者の声に対応する信号のパワーを含むため、上記差分信号のパワーであり発話者の声に対応する信号のパワーを含まないノイズパワーに対する比が十分に大きい。そして、ノイズパワーは、発話者の発話の有無により位相及び大きさに殆ど差を生じないので、例えば、このノイズパワーに適当な係数を乗じた閾値と音声パワーとの比較によって、発話者による発話の有無が推定される。
【００１５】
そして、それぞれ特定の発話者に対応して配置された第１及び第２マイクロホンからの第１及び第２音声信号、すなわち空間的な音の情報が含まれた信号に基づいて、発話の有無により影響を受け難いノイズパワーを分離してこれを基準に上記発話の有無を推定するため、ノイズの発生源の位置に依らず、上記発話の有無を確実に推定することができる。しかも、音声パワーとノイズパワーとを比較して発話の有無を推定するため、瞬時値を用いて発話の有無を推定する場合と比較して、誤推定の可能性が著しく低減される。
【００１６】
これにより、従来の如く発話スイッチを操作しなくても、発話の有無に基づいた制御を行なうことが可能となる。したがって、例えば、発話有りと推定された場合にのみ音声入力装置（音声認識装置や通や装置等）へ音声信号を出力したり、発話有りと推定された場合にノイズキャンセルフィルタのフィルタ係数を変更したり固定したりする等の制御が可能となる。
【００１７】
このように、請求項１記載の音声処理装置では、発話者による発話の有無を、マイクロホンから入力される音声信号に基づいて確実に推定することができる。
【００１８】
また、上記目的を達成するために請求項２記載の発明に係る音声処理装置は、特定の発話者に対し略対称に配置された第１のマイクロホン及び第２のマイクロホンからそれぞれ入力される第１音声信号及び第２音声信号を処理する音声処理装置であって、前記第１音声信号と第２音声信号とを加算した加算信号のパワーを計算して音声パワーを得る音声パワー計算手段と、前記第１音声信号から第２音声信号を差し引いた差分信号のパワーを計算してノイズパワーを得るノイズパワー計算手段と、前記音声パワーとノイズパワーとの差分と、該ノイズパワーとを比較して発話の有無を推定する発話状態推定手段と、を備えている。
【００１９】
請求項２記載の音声処理装置では、第１のマイクロホンから第１音声信号が入力されると共に、第２のマイクロホンから第２音声信号が入力される。これらの第１及び第２音声信号は、第１及び第２のマイクロホンが発話者に対し略対称に配置されていることにより、発話者が発した声については位相、大きさとも略同等の信号であり、発話者が発した声以外の音声すなわちノイズについては発生源との相対位置に応じて異なるものとなる。
【００２０】
第１及び第２音声信号が入力されると、音声パワー計算手段が第１音声信号と第２音声信号とを加算した加算信号のパワーを計算して音声パワーを得ると共に、ノイズパワー計算手段が第１音声信号から第２音声信号を差し引いた差分信号のパワーを計算してノイズパワーを得る。そして、発話状態推定手段は、音声パワーとノイズパワーとの差分（以下、擬似発話パワーという）と、ノイズパワーとを比較することで、発話者による発話の有無を推定（判定、判断）する。
【００２１】
具体的には、音声パワーとノイズパワーとの差分である擬似発話パワーは、全音声信号のパワーからノイズ信号のパワーを差し引いたものに略相当し、発話者が声を発していないときには該発話者の声に対応する信号のパワーを含まないためにノイズパワーに対する比が小さく、発話者が声を発したときには該発話者の声に対応する信号のパワーを含むためノイズパワーに対する比が十分に大きい。そして、ノイズパワーは、発話者の発話の有無により位相及び大きさに殆ど差を生じないので、例えば、このノイズパワーに適当な係数を乗じた閾値と擬似発話パワーとの比較によって、発話者による発話の有無が推定される。
【００２２】
そして、それぞれ特定の発話者に対応して配置された第１及び第２マイクロホンからの第１及び第２音声信号、すなわち空間的な音の情報が含まれた信号に基づいて、発話の有無により影響を受け難いノイズパワーを分離してこれを基準に上記発話の有無を推定するため、ノイズの発生源の位置に依らず、上記発話の有無を確実に推定することができる。しかも、擬似発話パワーとノイズパワーとを比較して発話の有無を推定するため、瞬時値を用いて発話の有無を推定する場合と比較して、誤推定の可能性が著しく低減される。
【００２３】
これにより、従来のように発話スイッチを操作しなくても、発話の有無に基づいた制御を行なうことが可能となる。したがって、例えば、発話有りと推定された場合にのみ音声入力装置（音声認識装置や通や装置等）へ音声信号を出力したり、発話有りと推定された場合にノイズキャンセルフィルタのフィルタ係数を変更したり固定したりすることが可能となる。
【００２４】
このように、請求項２記載の音声処理装置では、発話者による発話の有無を、マイクロホンから入力される音声信号に基づいて確実に推定することができる。
【００２５】
なお、請求項１及び請求項２におけるパワーは、所定時間（所定移動時間）における信号の２乗平均値に限られず、例えば、所定時間における信号の平均値を含むフィルタ処理結果、パワースペクトル等が含まれる。
【００２６】
【発明の実施の形態】
本発明の実施の形態に係る音声処理装置１０について、図１に基づいて説明する。図１には、音声処理装置１０の全体構成が概略のブロック図にて示されている。この図に示される如く、音声処理装置１０は、マイクアレイ１２と電気的に接続されており、マイクアレイ１２から入力した音声信号を処理する構成とされている。
【００２７】
マイクアレイ１２は、第１マイクロホン１２Ａと第２マイクロホン１２Ｂとで構成されており、第１マイクロホン１２Ａと第２マイクロホン１２Ｂとは、特定の発話者の正面で該発話者に対し略対称に配置されている。すなわち、特定の発話者が発した音声は、第１マイクロホン１２Ａと第２マイクロホン１２Ｂとに略同等の位相、大きさで入力されるようになっている。
【００２８】
これらの第１マイクロホン１２Ａ、第２マイクロホン１２Ｂは、それぞれＡ／Ｄ変換器１４、１６を介して音声処理装置１０に電気的に接続されており、音声処理装置１０には、Ａ／Ｄ変換器１４、１６によってデジタル化された音声信号が入力される構成である。なお、以下単に「接続」というときは、電気的な接続を意味するものとする。
【００２９】
音声処理装置１０は、加算部１８、減算部２０を備えている。加算部１８は、Ａ／Ｄ変換器１４、１６とそれぞれ接続されており、Ａ／Ｄ変換器１４から入力した第１音声信号とＡ／Ｄ変換器１６から入力した第２音声信号とを加算して出力する構成である。一方、減算部２０は、Ａ／Ｄ変換器１４から入力した第１音声信号とＡ／Ｄ変換器１６から入力した第２音声信号との差分（絶対値）を計算して出力する構成である。
【００３０】
また、加算部１８は、バンドパスフィルタ（以下、ＢＰＦという）２２と接続されており、減算部２０は、ＢＰＦ２４と接続されている。ＢＰＦ２２、２４は、それぞれ加算部１８、減算部２０から入力した信号から人の音声の周波数帯域以外の信号成分をカットし、人の音声の周波数帯域の音声信号を出力する構成である。
【００３１】
さらに、ＢＰＦ２２は、音声パワー計算手段としての音声パワー演算部２６に接続されており、ＢＰＦ２４は、ノイズパワー計算手段としてのノイズパワー演算部２８に接続されている。音声パワー演算部２６は、所定時間におけるＢＰＦ２２の出力信号のパワーを計算して音声パワーとして出力するようになっており、ノイズパワー演算部２８は、上記所定時間におけるＢＰＦ２４の出力信号のパワーを計算してノイズパワーとして出力するようになっている。これらのパワーは、例えば、上記所定時間における平均値や２乗平均値、パワースペクトル値とすることができるが、本実施の形態では２乗平均値としており、常時更新されるようになっている。
【００３２】
さらにまた、音声パワー演算部２６及びノイズパワー演算部２８は、それぞれ減算部３０に接続されている。減算部３０は、音声パワー演算部２６から入力した音声パワーとノイズパワー演算部２８から入力したノイズパワーとの差分を計算して、擬似発話パワーとして出力する構成である。この減算部３０は、発話状態推定手段としての発話有無推定部３２に接続されている。この発話有無推定部３２には、ノイズパワー演算部２８も接続されている。
【００３３】
そして、発話有無推定部３２は、減算部３０から入力した擬似発話パワーとノイズパワー演算部２８から入力したノイズパワーとを比較して、発話者による発話の有無を推定（判断、判定）する構成とされている。本実施の形態では、発話有無推定部３２は、擬似発話パワーの大きさがノイズパワーの大きさの３倍を越えるときに、発話者による発話が為されたと推定し、発話有信号Ｔを出力するようになっている。
【００３４】
この発話有無推定部３２は、必要に応じて、音声入力装置３４、及び音声出力装置を構成するノイズキャンセルフィルタ３６と接続されるようになっている。音声入力装置３４は、例えば、ナビゲーション装置等を構成する音声認識装置や移動体電話装置等の通信装置である。また、ノイズキャンセルフィルタ３６は、上記音声入力装置３４に出力する音声信号からノイズを除去する適応フィルタであり、フィルタ係数を変更可能とされている。そして、発話有無推定部３２は、上記発話有信号Ｔを、音声入力装置３４に対してはトリガ情報として出力し、ノイズキャンセルフィルタ３６に対しては制御信号として出力する構成である。
【００３５】
なお、上記ＢＰＦ２２、２４は、上記した周波数帯域の範囲内で、音声パワー演算部２６への入力である加算信号と、ノイズパワー演算部２８への入力である差分信号とのゲイン差が出やすい帯域（それぞれの帯域が異なっても良い）の信号を通過させる設計とされており、擬似発話パワー（音声パワー）とノイズパワーとを検出しやすいように構成されている。
【００３６】
次に、本実施の形態の作用を説明する。
【００３７】
図１に示される如く、マイクアレイ１２に、その正面から発話者の音声Ｓが到達すると共に、その正面以外からノイズＮが到達した場合、第１マイクロホン１２Ａには音声Ｓ及びノイズＮ１が入力され、第２マイクロホン１２Ｂには音声Ｓ及びノイズＮ２が入力される。
【００３８】
そして、第１マイクロホン１２Ａから入力した第１音声信号［Ｓ＋Ｎ１］は、Ａ／Ｄ変換器１４によってデジタル変換された後、音声処理装置１０の加算部１８、減算部２０にそれぞれ出力される。また、第２マイクロホン１２Ｂから入力した第２音声信号（Ｓ＋Ｎ２）は、Ａ／Ｄ変換器１６によってデジタル変換された後、加算部１８、減算部２０にそれぞれ出力される。
【００３９】
加算部１８では、第１音声信号と第２音声信号とを加算して得た加算信号［Ｓ＋Ｎ’］（＝２Ｓ＋Ｎ１＋Ｎ２）を出力する。このように、加算信号は、発話者の声に対応した信号成分［Ｓ］とノイズに対応したノイズ成分［Ｎ’］とを含んでいる。一方、減算部２０では、第１音声信号と第２音声信号との差分を計算して得た差分信号［Ｎ’’］（＝Ｎ１−Ｎ２）を出力する。このように、差分信号は、発話者の声に対応した信号成分［Ｓ］を含まず、上記ノイズ成分［Ｎ’］に似たノイズ成分［Ｎ’’］のみを含んでいる。
【００４０】
加算信号は、ＢＰＦ２２で所定の周波数帯域以外の帯域成分がカットされ、音声パワー演算部２６に入力される。音声パワー演算部２６では、所定時間における加算信号の２乗平均を計算して得た音声パワー［Ｐ（Ｓ＋Ｎ’）］を減算部３０に出力する。
【００４１】
一方、減算信号は、ＢＰＦ２４で所定の周波数帯域以外の帯域成分がカットされ、ノイズパワー演算部２８に入力される。ノイズパワー演算部２８では、所定時間における加算信号の２乗平均を計算して得たノイズパワー［Ｐ（Ｎ’’）］を減算部３０、発話有無推定部３２に出力する。
【００４２】
音声パワー及びノイズパワーが入力された減算部３０では、該音声パワーからノイズパワーを差し引いて得た擬似発話パワー［Ｐ（Ｓ’）］（＝Ｐ（Ｓ＋Ｎ’）−Ｐ（Ｎ’’））を発話有無推定部３２に出力する。そして、発話有無推定部３２では、擬似発話パワー［Ｐ（Ｓ’）］とノイズパワー［Ｐ（Ｎ’’）］とを比較し、Ｐ（Ｓ’）＞３×Ｐ（Ｎ’’）が成立するときに、発話者による発話が為された推定し、発話有信号Ｔを出力する。
【００４３】
他方、発話がない場合には、音声Ｓがマイクアレイ１２から音声処理装置１０に入力されないので、加算部１８の出力すなわち音声パワー演算部２６への入力は、加算信号［Ｎ’］（＝Ｎ１＋Ｎ２）となり、音声パワー演算部２６の出力は、音声パワー［Ｐ（Ｎ’）］となる。また、減算部２０の出力すなわちノイズパワー演算部２８への入力は、上記発話がある場合と同じ差分信号［Ｎ’’］であり、ノイズパワー演算部２８の出力は、ノイズパワー［Ｐ（Ｎ’’）］である。
【００４４】
これにより、減算部３０の出力は、擬似発話パワー［Ｐ（Ｓ’’）］（＝Ｐ（Ｎ’）−Ｐ（Ｎ’’））であり、この擬似発話パワーがノイズパワーの３倍を越えることがなく、発話無しが推定される。
【００４５】
このように、音声処理装置１０では、発話の有無により影響を受けないノイズパワーを基準として、発話による音声Ｓに対応する成分を含み得る擬似発話パワーの大小を判断するため、発話者による発話の有無を確実に推定することができる。そして、それぞれ特定の発話者に対応して配置された第１マイクロホン１２Ａ、第２マイクロホン１２Ｂからの第１及び第２音声信号、すなわち空間的な音の情報が含まれた信号に基づき上記発話の有無を推定するため、ノイズの発生源の位置に依らず、上記発話の有無を一層確実に推定することができる。さらに、発話の有無を推定するために上記ノイズパワーと擬似発話パワーとを比較するため、瞬時値を用いて発話の有無を推定する場合と比較して、誤推定の可能性が著しく低減される。
【００４６】
そして、この推定結果である発話有信号Ｔを、トリガ情報として音声入力装置３４へ出力するため、該音声入力装置３４では、発話者が発話スイッチを操作することなく該発話者による発話の有無を検知することが可能となる。同様に、ノイズキャンセルフィルタ３６を有する音声出力装置においても、発話者が発話スイッチを操作することなく該発話者による発話の有無を検知することが可能となり、この発話有無に基づいてノイズキャンセルフィルタ３６のフィルタ係数を変更する等の制御を行なうことができる。
【００４７】
このように、本実施の形態に係る音声処理装置１０では、発話者による発話の有無を、マイクアレイ１２から入力される音声信号に基づいて確実に推定することができる。
【００４８】
次に、音声処理装置１０を、自動車等の車両に搭載され音声入力装置３４の一形態である音声認識ナビゲーションシステム５０に適用した例について図２に基づいて説明する。
【００４９】
図２に示される如く、音声認識ナビゲーションシステム５０は、車両乗員（本実施の形態では、運転者である発話者）の発する音声を制御コマンドとして制御されるナビゲーション装置であり、相互に接続された音声処理部５２と、音声出力部５４と、音声認識ナビゲーション装置５６とを含み構成されている。
【００５０】
音声処理部５２は、ＢＰＦ２２、２４がそれぞれ音声出力部５４の加算部１８、減算部２０に接続されている点を除き、音声処理装置１０と全く同様に構成されている。すなわち、加算部１８、減算部２０は、音声処理部５２と音声出力部５４とで共用されている。
【００５１】
音声出力部５４は、第１マイクロホン１２Ａと第２マイクロホン１２Ｂとから成るマイクアレイ１２と、Ａ／Ｄ変換器１４、１６と、加算部１８と、減算部２０とを備えており、これらは、上記実施の形態と全く同様に接続されている。第１マイクロホン１２Ａ、第２マイクロホン１２Ｂは、車室内における運転席前方のインストルメントパネルやルーフ前端近傍に、上記運転者に対し左右対称に所定の間隔（例えば、１００ｍｍ）で配置されている。
【００５２】
そして、加算部１８は、上記の通り音声処理部５２のＢＰＦ２２に接続されると共に、遅延処理部５７を介して減算器５８に接続されている。一方、減算部２０は、上記の通り音声処理部５２のＢＰＦ２４に接続されると共に、ノイズキャンセルフィルタ３６の信号入力部に接続されている。
【００５３】
ノイズキャンセルフィルタ３６は、上記の通り適応フィルタであり、その出力部が減算器５８に接続されている。減算器５８は、遅延処理部５７によってノイズキャンセルフィルタ３６における信号処理時間に対応して遅延された加算部１８の加算信号と、ノイズキャンセルフィルタ３６から入力したノイズ信号（上記差分信号を処理した信号）との差分を計算し、該計算より得た音声信号である擬似発話信号を出力する構成である。
【００５４】
この減算器５８の出力部は、音声認識ナビゲーション装置５６に接続されると共に、ノイズキャンセルフィルタ３６の制御入力部に接続されている。すなわち、ノイズキャンセルフィルタ３６は、音声信号（擬似発話信号）がフィードバックされるようになっており、通常はこの音声信号（のパワー）が最小となる上記ノイズ信号を出力するように、そのフィルタ係数を更新する構成である。
【００５５】
また、このノイズキャンセルフィルタ３６の制御入力部は、音声処理部５２の発話有無推定部３２と接続されており、上記発話有信号Ｔが制御信号として入力されるようになっている。そして、ノイズキャンセルフィルタ３６は、発話有信号Ｔが入力されている間はフィルタ係数の更新を停止し、発話有信号Ｔの入力直前のフィルタ係数を維持する構成とされている。
【００５６】
さらに、音声処理部５２の発話有無推定部３２は、音声認識ナビゲーション装置５６に接続されており、上記発話有信号Ｔをトリガ情報として出力するようになっている。この音声認識ナビゲーション装置５６は、発話有信号Ｔが入力されると、音声出力部５４の減算器５８から入力される音声信号を発話者による発話信号として受け付け、該発話信号に基づいて音声認識処理を行ない、該認識結果を制御コマンドとして使用するようになっている。すなわち、発話者の発した音声に基づいて、地図画面のスクロールやズーム、各種検索等（階層的な仮想制御パネルの切り換え）を行なうようになっている。
【００５７】
以上説明した音声認識ナビゲーションシステム５０では、マイクアレイ１２に、その正面から発話者の音声Ｓが到達すると共に、その正面以外からノイズＮが到達した場合、第１マイクロホン１２Ａには音声Ｓ及びノイズＮ１が入力され、第２マイクロホン１２Ｂには音声Ｓ及びノイズＮ２が入力される。
【００５８】
すると、上記音声処理装置１０の場合と同様に、加算部１８は加算信号［Ｓ＋Ｎ’］を出力し、減算部２０は差分信号［Ｎ’’］を出力する。差分信号［Ｎ’’］が入力されたノイズキャンセルフィルタ３６は、これを処理してノイズ信号［Ｎ’’’］を出力する。一方、加算信号［Ｓ＋Ｎ’］が入力された遅延処理部５７は、該加算信号を、差分信号［Ｎ’’］のノイズキャンセルフィルタ３６による処理時間に対応して遅延させて出力する。
【００５９】
すると、音声出力部５４では、減算器５８が、遅延処理部５７から入力した加算信号［Ｓ＋Ｎ’］から、ノイズキャンセルフィルタ３６から入力したノイズ信号［Ｎ’’’］を差し引いて、ノイズ成分を除去した音声信号である擬似発話信号［Ｓ’］を出力する。このとき、ノイズキャンセルフィルタ３６は、擬似発話信号を最小とするように、換言すれば、時間と共に変化するノイズ成分を極力除去するように、フィルタ係数を更新している。
【００６０】
一方、上記加算信号と減算信号とが入力された音声処理部５２は、発話有無推定部３２で擬似発話パワー［Ｐ（Ｓ’）］とノイズパワー［Ｐ（Ｎ’’）］を比較し、Ｐ（Ｓ’）＞３×Ｐ（Ｎ’’）が成立するときに、この発話有無推定部３２では、発話者による発話が為された推定し、発話有信号Ｔを出力する。
【００６１】
そして、音声出力部５４のノイズキャンセルフィルタ３６は、発話有信号Ｔが入力されると、そのフィルタ係数を該入力直前のフィルタ係数に固定する。これにより、音声処理部５２において発話有りを推定した後においては、減算器５８からの出力は、上記発話有信号Ｔ出力直前に生じていたノイズに対応したノイズ成分を極力除去する機能を維持しつつ、ノイズキャンセルフィルタ３６の擬似発話信号［Ｓ’］を最小化する動作によって発話者の声に対応した信号成分［Ｓ］の一部が除去される不具合が解消される。
【００６２】
すなわち、減算器５８からは、発話者の声に対応した信号成分［Ｓ］を主成分とし、歪みの少ない良好な擬似発話信号［Ｓ’］が出力される。この擬似発話信号は音声認識ナビゲーション装置５６に入力され、このとき音声認識ナビゲーション装置５６は、発話有信号Ｔが入力されているため、この擬似発話信号を発話者による発話信号として受け付ける。さらに、音声認識ナビゲーション装置５６は、該発話信号に基づいて音声認識処理を行ない、該認識結果を制御コマンドとして使用する。
【００６３】
以上説明したように、音声認識ナビゲーションシステム５０では、音声処理部５２（発話有無推定部３２）がマイクアレイ１２からの入力情報に基づいて発話者による発話の有無を確実に推定するため、発話者が発話前に発話スイッチを操作する必要がない。このため、発話者による発話が有る（と推定された）場合に、発話スイッチを操作することなく、ノイズキャンセルフィルタ３６のフィルタ係数を固定する制御が可能となり、発話者の声に対応する信号の一部が除去されて擬似発話信号に大きな歪が生じることが防止される。すなわち、音声認識ナビゲーションシステム５０では、歪みの少ない良好な擬似発話信号を音声認識ナビゲーション装置５６に入力させることができる。
【００６４】
また、この音声認識ナビゲーション装置５６にも発話有信号Ｔがトリガ情報として入力されるため、音声認識ナビゲーション装置５６は、発話者による発話がない場合に入力される擬似発話信号によって音声認識を行なうことがなく、誤作動等の恐れがない。そして、この機能が発話スイッチを操作することなく実現されている。
【００６５】
なお、上記の実施の形態では、説明を簡単にして理解を容易にするために、主要構成要素のみを説明したが、各加算部、加算器または減算部、減算器の前に適宜減衰部や増幅部を設けて擬似発話信号の抽出精度（ノイズ等の除去精度）を向上させることが可能であることは言うまでもない。また、車両に搭載される音声認識ナビゲーションシステム５０では、例えば、車載オーディオ装置の音声を除去するための適応フィルタ等を上記構成に付加して設けても良い。
【００６６】
また、上記の実施の形態では、音声処理装置１０、音声処理部５２が擬似発話パワーとノイズパワーとを比較して発話者による発話の有無を推定する構成としたが、本発明はこれに限定されず、例えば、音声パワーとノイズパワーとの比較によって発話者による発話の有無を推定する構成としても良い。
【００６７】
さらに、上記実施の形態では、音声処理装置１０を音声認識ナビゲーションシステム５０に適用した例を示したが、本発明はこれに限定されず、例えば、音声認識機能を有するオーディ装置や空調装置等の各種装置、車両に搭載さ入れるハンズフリー通話システム、トランシーバー装置、所謂テレビ会議システムや単なる電話装置等の各種通信装置に適用することが可能である。また、音声処理装置１０を、車両等に搭載されナビゲーション装置やオーディオ装置、空調装置、ハンズフリー通話システム、その他車載機器等を統合制御するための音声認識装置等に適用しても良い。
【００６８】
【発明の効果】
以上説明したように本発明に係る音声処理装置は、発話者による発話の有無を、マイクロホンから入力される音声信号に基づいて確実に推定することができるという優れた効果を有する。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る音声処理装置の該略構成を示すブロック図である。
【図２】本発明の実施の形態に係る音声処理装置が音声認識ナビゲーションシステムに適用された例を示すブロック図である。
【符号の説明】
１０音声処理装置
１２Ａ第１マイクロホン（第１のマイクロホン）
１２Ｂ第２マイクロホン（第２のマイクロホン）
２６音声パワー演算部（音声パワー計算手段）
２８ノイズパワー演算部（ノイズパワー計算手段）
３２発話有無推定部（発話状態推定手段）[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice processing device applied to, for example, a voice recognition device or a communication device.
[0002]
[Prior art]
For example, some navigation devices and audio devices mounted on vehicles such as automobiles are provided with a voice recognition device that inputs and recognizes the voice of an occupant from a microphone and performs various processes based on the recognition content. Have been. Some vehicles have a hands-free communication device for the driver to talk without having to hold a microphone or the like (for example, the main body of a mobile phone) while driving.
[0003]
Some of such speech recognition devices and communication devices include two microphones and realize a noise canceling function for removing voices other than voices emitted by an occupant (speaker) (for example, see Patent Document 1). reference). Specifically, an audio signal (output signal) corresponding to a speaker's voice is extracted by subtracting a noise signal input from the other microphone and processed by the noise cancellation filter from an audio signal input from one microphone. And output it. An adaptive filter is used as the noise canceling filter, and the filter coefficient is updated so that the power of the output signal as the difference signal is minimized.
[0004]
However, in the configuration in which the noise cancellation filter constantly updates the filter coefficient, when the voice of the speaker is loud relative to the noise, a part of the signal corresponding to the voice of the speaker is also removed (canceled), and the output signal is large. Distortion occurs. For this reason, in the above-described speech recognition device and the communication device, when the speaker speaks, the speaker presses the speech switch to fix the filter coefficient of the noise cancellation filter to correspond to the speaker's voice. The output signal obtained in this manner is prevented from being removed from a part of the output signal. Thereby, distortion of the output signal is prevented.
[0005]
Further, there has been known a voice recognition device or a communication device as described above that realizes a noise canceling function by a configuration different from the above configuration (for example, see Patent Document 2). The vehicle audio input device described in Patent Document 2 is provided with a plurality of microphones corresponding to each seat of the vehicle, and based on a moving average value of an audio signal level input from each microphone in a past predetermined time. A threshold value (power corresponding to noise) is set, and when an input signal level in any one of the microphones exceeds the corresponding threshold value, it is determined that a voice by any occupant has been input, and a specific occupant (seat) is determined. When the input signal level from the microphone corresponding to the other occupant is higher than the input signal level from the microphone corresponding to another occupant, it is determined that the specific occupant has uttered and an output signal is output. I have.
[0006]
The output signal is a signal obtained by subtracting an input signal from a microphone corresponding to another occupant from an input signal from a microphone corresponding to a specific occupant (seat) and removing noise. That is, this configuration does not include a noise canceling filter that is an adaptive filter, and basically receives input from a microphone corresponding to another occupant (a microphone that is sufficiently spaced from a microphone corresponding to a specific occupant). Is a noise signal. With this configuration, in this configuration, without operating the utterance switch, it is determined whether or not the specific occupant has uttered, and when it is determined that the specific occupant has uttered, the microphone corresponding to the specific occupant is used. An output signal obtained by removing noise from the input signal can be output to the speech recognition device.
[0007]
[Patent Document 1]
JP 2000-148200 A
[Patent Document 2]
JP-A-11-65586
[0008]
[Problems to be solved by the invention]
However, the former configuration has a problem in that the speaker must operate the utterance switch every time the utterance is made, and the operation is complicated.
[0009]
On the other hand, in the latter configuration, since a plurality of microphones are arranged corresponding to different occupants, in other words, since there is only one microphone corresponding to a specific occupant and noise cannot be separated, the criterion for determining the presence or absence of speech is determined. There is a problem that the threshold value is likely to fluctuate depending on the size of the voice and the size of the noise, and the accuracy of determining the presence or absence of speech is poor. Therefore, for example, when noise occurs near the microphone corresponding to a specific occupant (noise giving a large input to the microphone), the noise cannot be removed by a small input signal to another microphone, and There was a high possibility that a wrong judgment was made. In particular, since the input signal level compared with the threshold (power based on the moving average value) at the time of determining the presence or absence of speech is an instantaneous value, this problem becomes remarkable.
[0010]
SUMMARY OF THE INVENTION It is an object of the present invention to provide an audio processing device capable of reliably estimating the presence or absence of an utterance by a speaker based on an audio signal input from a microphone in consideration of the above fact.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, an audio processing apparatus according to the present invention is characterized in that first audio signals respectively input from a first microphone and a second microphone arranged substantially symmetrically with respect to a specific speaker. An audio processing device for processing a second audio signal, the power of an addition signal obtained by adding the first audio signal and the second audio signal, and a difference signal obtained by subtracting a second audio signal from the first audio signal. Utterance state estimating means for estimating the presence / absence of utterance by the speaker by comparing the power with the utterance.
[0012]
In the audio processing device according to the first aspect, the first audio signal is input from the first microphone, and the second audio signal is input from the second microphone. Since the first and second microphones are arranged substantially symmetrically with respect to the speaker, the first and second audio signals are signals having substantially the same phase and loudness as to the voice emitted by the speaker. The voice other than the voice uttered by the speaker, that is, noise differs depending on the relative position to the source.
[0013]
When the first and second audio signals are input, the utterance state estimating means calculates the sum of the power of the first audio signal and the second audio signal (hereinafter referred to as audio power) and the first audio signal. By comparing the power of the difference signal obtained by subtracting the second audio signal (hereinafter referred to as noise power), the presence or absence of the utterance of the speaker is estimated (determined, determined). Needless to say, the audio power and the noise power may not be directly compared, but may be a power obtained by appropriately processing one or both of them, for example.
[0014]
Specifically, the voice power is the power of the added signal of the first and second voice signals, both of which are only noise when the speaker is not uttering, so the first and second voices, which are both noise only, The ratio of the difference signal power of the audio signal to the noise power, which is the power of the difference signal, is small. On the other hand, when the speaker utters a voice, the voice power, which is the power of the addition signal, includes the power of the signal corresponding to the voice of the speaker, and is therefore the power of the difference signal, which corresponds to the voice of the speaker. The ratio of the signal to the noise power not including the power is sufficiently large. Since the noise power has little difference in phase and magnitude depending on the presence or absence of the speaker's utterance, for example, by comparing a threshold obtained by multiplying this noise power by an appropriate coefficient with the voice power, the utterance by the speaker is determined. Is estimated.
[0015]
Then, based on the first and second audio signals from the first and second microphones respectively arranged corresponding to the specific speakers, that is, based on the signal including the information of the spatial sound, the presence or absence of the utterance is determined. Since the noise power that is hardly affected is separated and the presence or absence of the utterance is estimated based on the noise power, the presence or absence of the utterance can be reliably estimated regardless of the position of the noise source. Moreover, since the presence / absence of speech is estimated by comparing the voice power and the noise power, the possibility of erroneous estimation is significantly reduced as compared with the case where the presence / absence of speech is estimated using instantaneous values.
[0016]
As a result, control based on the presence or absence of speech can be performed without operating the speech switch as in the related art. Therefore, for example, an audio signal is output to a voice input device (speech recognition device, communication device, or the like) only when it is estimated that there is utterance, or the filter coefficient of the noise cancellation filter is changed when it is estimated that there is utterance. It is possible to perform control such as dropping and fixing.
[0017]
As described above, in the voice processing device according to the first aspect, it is possible to reliably estimate the presence or absence of the utterance by the speaker based on the voice signal input from the microphone.
[0018]
According to another aspect of the present invention, there is provided a speech processing apparatus comprising: a first microphone and a second microphone arranged substantially symmetrically with respect to a specific speaker; An audio processing device for processing an audio signal and a second audio signal, wherein the audio power calculation means obtains audio power by calculating the power of an addition signal obtained by adding the first audio signal and the second audio signal; Noise power calculating means for calculating the power of a difference signal obtained by subtracting the second audio signal from the first audio signal to obtain noise power; comparing the difference between the audio power and the noise power with the noise power; Utterance state estimating means for estimating the presence / absence of the utterance.
[0019]
In the audio processing device according to the second aspect, the first audio signal is input from the first microphone, and the second audio signal is input from the second microphone. Since the first and second microphones are arranged substantially symmetrically with respect to the speaker, the first and second audio signals are signals having substantially the same phase and loudness as to the voice emitted by the speaker. The voice other than the voice uttered by the speaker, that is, noise differs depending on the relative position to the source.
[0020]
When the first and second audio signals are input, the audio power calculation means calculates the power of the added signal obtained by adding the first audio signal and the second audio signal to obtain the audio power, and the noise power calculation means obtains the audio power. The noise power is obtained by calculating the power of the difference signal obtained by subtracting the second audio signal from the first audio signal. Then, the utterance state estimating unit estimates (determines and determines) whether or not the utterer has uttered by comparing the difference between the voice power and the noise power (hereinafter, referred to as pseudo utterance power) with the noise power.
[0021]
Specifically, the pseudo utterance power, which is the difference between the audio power and the noise power, substantially corresponds to the power of the entire audio signal minus the power of the noise signal, and when the speaker does not speak, the utterance power is reduced. Since the power of the signal corresponding to the speaker's voice is not included, the ratio to the noise power is small, and when the speaker speaks, the ratio to the noise power is sufficient because the power of the signal corresponding to the speaker's voice is included. large. Since the noise power has almost no difference in phase and magnitude depending on the presence or absence of the utterance of the speaker, for example, by comparing a threshold obtained by multiplying the noise power by an appropriate coefficient with the pseudo utterance power, The presence or absence of speech is estimated.
[0022]
Then, based on the first and second audio signals from the first and second microphones respectively arranged corresponding to the specific speakers, that is, based on the signal including the information of the spatial sound, the presence or absence of the utterance is determined. Since the noise power that is hardly affected is separated and the presence or absence of the utterance is estimated based on the noise power, the presence or absence of the utterance can be reliably estimated regardless of the position of the noise source. Moreover, since the presence or absence of the utterance is estimated by comparing the pseudo utterance power and the noise power, the possibility of erroneous estimation is significantly reduced as compared with the case where the presence or absence of the utterance is estimated using the instantaneous value.
[0023]
As a result, control based on the presence or absence of speech can be performed without operating the speech switch as in the related art. Therefore, for example, an audio signal is output to a voice input device (speech recognition device, communication device, or the like) only when it is estimated that there is utterance, or the filter coefficient of the noise cancellation filter is changed when it is estimated that there is utterance. And can be fixed.
[0024]
As described above, in the voice processing device according to the second aspect, it is possible to reliably estimate the presence or absence of the utterance by the speaker based on the voice signal input from the microphone.
[0025]
It should be noted that the power in claims 1 and 2 is not limited to the mean square value of the signal in a predetermined time (predetermined movement time). included.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
An audio processing device 10 according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic block diagram showing the overall configuration of the audio processing device 10. As shown in the figure, the audio processing device 10 is electrically connected to the microphone array 12 and configured to process audio signals input from the microphone array 12.
[0027]
The microphone array 12 includes a first microphone 12A and a second microphone 12B. The first microphone 12A and the second microphone 12B are arranged substantially symmetrically with respect to a specific speaker in front of the specific speaker. ing. That is, the voice uttered by a specific speaker is input to the first microphone 12A and the second microphone 12B with substantially the same phase and magnitude.
[0028]
The first microphone 12A and the second microphone 12B are electrically connected to the audio processing device 10 via A / D converters 14 and 16, respectively, and the audio processing device 10 has an A / D converter. In this configuration, audio signals digitized by 14 and 16 are input. Hereinafter, the term “connection” means an electrical connection.
[0029]
The audio processing device 10 includes an adding unit 18 and a subtracting unit 20. The adder 18 is connected to the A / D converters 14 and 16, respectively, and adds the first audio signal input from the A / D converter 14 and the second audio signal input from the A / D converter 16. And output. On the other hand, the subtraction unit 20 is configured to calculate and output the difference (absolute value) between the first audio signal input from the A / D converter 14 and the second audio signal input from the A / D converter 16. .
[0030]
Further, the adding unit 18 is connected to a band-pass filter (hereinafter, referred to as BPF) 22, and the subtracting unit 20 is connected to a BPF 24. The BPFs 22 and 24 are configured to cut out signal components other than the frequency band of the human voice from the signals input from the adding unit 18 and the subtracting unit 20, respectively, and output a voice signal in the frequency band of the human voice.
[0031]
Further, the BPF 22 is connected to an audio power calculator 26 as audio power calculator, and the BPF 24 is connected to a noise power calculator 28 as noise power calculator. The audio power calculator 26 calculates the power of the output signal of the BPF 22 for a predetermined time and outputs the calculated power as audio power. The noise power calculator 28 calculates the power of the output signal of the BPF 24 for the predetermined time. Output as noise power. These powers can be, for example, an average value, a root-mean-square value, or a power spectrum value in the above-mentioned predetermined time. In the present embodiment, the power is a root-mean-square value, and is constantly updated. .
[0032]
Furthermore, the audio power calculation unit 26 and the noise power calculation unit 28 are connected to the subtraction unit 30, respectively. The subtraction unit 30 is configured to calculate a difference between the voice power input from the voice power calculation unit 26 and the noise power input from the noise power calculation unit 28 and output the difference as pseudo speech power. The subtraction unit 30 is connected to an utterance presence / absence estimation unit 32 as utterance state estimation means. The utterance presence / absence estimation unit 32 is also connected to a noise power calculation unit 28.
[0033]
Then, the utterance presence / absence estimation unit 32 compares the pseudo utterance power input from the subtraction unit 30 with the noise power input from the noise power calculation unit 28, and estimates (determines / determines) the presence / absence of utterance by the speaker. It has been. In the present embodiment, the utterance presence / absence estimating unit 32 estimates that the utterance has been made by the utterer when the magnitude of the pseudo utterance power exceeds three times the magnitude of the noise power, and outputs the utterance presence signal T. It is supposed to.
[0034]
The utterance presence / absence estimation unit 32 is connected to a voice input device 34 and a noise canceling filter 36 constituting a voice output device as necessary. The voice input device 34 is, for example, a communication device such as a voice recognition device or a mobile telephone device that forms a navigation device or the like. The noise canceling filter 36 is an adaptive filter that removes noise from the audio signal output to the audio input device 34, and is capable of changing a filter coefficient. The utterance presence / absence estimating unit 32 is configured to output the utterance presence signal T to the voice input device 34 as trigger information and to the noise cancellation filter 36 as a control signal.
[0035]
In the BPFs 22 and 24, a gain difference between the addition signal input to the audio power calculation unit 26 and the difference signal input to the noise power calculation unit 28 easily occurs within the range of the frequency band described above. It is designed to pass signals in bands (each band may be different), and is configured to easily detect pseudo speech power (voice power) and noise power.
[0036]
Next, the operation of the present embodiment will be described.
[0037]
As shown in FIG. 1, when the voice S of the speaker reaches the microphone array 12 from the front and the noise N arrives from a position other than the front, the voice S and the noise N1 are input to the first microphone 12A. The voice S and the noise N2 are input to the second microphone 12B.
[0038]
Then, the first audio signal [S + N1] input from the first microphone 12A is digitally converted by the A / D converter 14, and then output to the adding unit 18 and the subtracting unit 20 of the audio processing device 10, respectively. The second audio signal (S + N2) input from the second microphone 12B is digitally converted by the A / D converter 16 and then output to the adding unit 18 and the subtracting unit 20, respectively.
[0039]
The adder 18 outputs an addition signal [S + N '] (= 2S + N1 + N2) obtained by adding the first audio signal and the second audio signal. As described above, the addition signal includes the signal component [S] corresponding to the voice of the speaker and the noise component [N ′] corresponding to the noise. On the other hand, the subtractor 20 outputs a difference signal [N ″] (= N1−N2) obtained by calculating a difference between the first audio signal and the second audio signal. As described above, the difference signal does not include the signal component [S] corresponding to the speaker's voice, but includes only the noise component [N ″] similar to the noise component [N ′].
[0040]
In the addition signal, the band components other than the predetermined frequency band are cut by the BPF 22 and input to the audio power calculation unit 26. The audio power calculator 26 outputs the audio power [P (S + N ′)] obtained by calculating the mean square of the added signal for a predetermined time to the subtractor 30.
[0041]
On the other hand, the BPF 24 cuts out the band components other than the predetermined frequency band from the subtraction signal, and inputs the signal to the noise power calculator 28. The noise power calculation unit 28 outputs the noise power [P (N ″)] obtained by calculating the mean square of the added signal for a predetermined time to the subtraction unit 30 and the utterance presence / absence estimation unit 32.
[0042]
In the subtraction unit 30 to which the voice power and the noise power are input, the pseudo utterance power [P (S ′)] (= P (S + N ′) − P (N ″)) obtained by subtracting the noise power from the voice power. Is output to the utterance presence / absence estimation unit 32. Then, the utterance presence / absence estimation unit 32 compares the pseudo utterance power [P (S ′)] with the noise power [P (N ″)], and finds that P (S ′)> 3 × P (N ″). When it is established, it is estimated that the utterance has been made by the speaker, and the utterance presence signal T is output.
[0043]
On the other hand, when there is no utterance, since the voice S is not input from the microphone array 12 to the voice processing device 10, the output of the adder 18, that is, the input to the voice power calculator 26 is the addition signal [N '] (= N1 + N2). ), And the output of the audio power calculator 26 is the audio power [P (N ′)]. The output of the subtraction unit 20, that is, the input to the noise power calculation unit 28 is the same difference signal [N ″] as in the case where the speech is present, and the output of the noise power calculation unit 28 is the noise power [P (N '')].
[0044]
As a result, the output of the subtraction unit 30 is the pseudo speech power [P (S ″)] (= P (N ′) − P (N ″)), and this pseudo speech power is three times the noise power. It is estimated that there is no utterance without exceeding.
[0045]
As described above, the voice processing device 10 determines the magnitude of the pseudo-utterance power that can include the component corresponding to the voice S generated by the utterance based on the noise power that is not affected by the presence or absence of the utterance. Presence / absence can be reliably estimated. Then, based on the first and second audio signals from the first microphone 12A and the second microphone 12B respectively arranged corresponding to the specific speaker, that is, the signal of the utterance based on the signal including the information of the spatial sound. Since the presence / absence is estimated, the presence / absence of the utterance can be more reliably estimated regardless of the position of the noise source. Furthermore, since the noise power is compared with the pseudo-utterance power to estimate the presence or absence of speech, the possibility of erroneous estimation is significantly reduced as compared with the case where the presence or absence of speech is estimated using instantaneous values. .
[0046]
Then, since the utterance presence signal T, which is the estimation result, is output to the voice input device 34 as trigger information, the voice input device 34 determines whether or not the utterance is made by the utterer without operating the utterance switch. It becomes possible to detect. Similarly, in the audio output device having the noise canceling filter 36, it is possible for the speaker to detect the presence or absence of the utterance without operating the utterance switch. Can be controlled such as changing the filter coefficient.
[0047]
As described above, in the voice processing device 10 according to the present embodiment, it is possible to reliably estimate the presence or absence of the utterance by the speaker based on the voice signal input from the microphone array 12.
[0048]
Next, an example in which the voice processing device 10 is applied to a voice recognition navigation system 50 that is mounted on a vehicle such as an automobile and is an embodiment of the voice input device 34 will be described with reference to FIG.
[0049]
As shown in FIG. 2, the voice recognition navigation system 50 is a navigation device that is controlled as a control command using a voice emitted by a vehicle occupant (in the present embodiment, a speaker who is a driver), and is connected to each other. It includes a voice processing unit 52, a voice output unit 54, and a voice recognition navigation device 56.
[0050]
The audio processing unit 52 has the same configuration as the audio processing device 10 except that the BPFs 22 and 24 are connected to the addition unit 18 and the subtraction unit 20 of the audio output unit 54, respectively. That is, the addition unit 18 and the subtraction unit 20 are shared by the audio processing unit 52 and the audio output unit 54.
[0051]
The audio output unit 54 includes a microphone array 12 including a first microphone 12A and a second microphone 12B, A / D converters 14 and 16, an adding unit 18, and a subtracting unit 20, which are The connection is made in exactly the same way as in the above embodiment. The first microphone 12A and the second microphone 12B are arranged at predetermined intervals (for example, 100 mm) symmetrically with respect to the driver near an instrument panel in front of a driver's seat and a front end of a roof in a vehicle cabin.
[0052]
The adder 18 is connected to the BPF 22 of the audio processor 52 as described above, and is also connected to the subtractor 58 via the delay processor 57. On the other hand, the subtraction unit 20 is connected to the BPF 24 of the audio processing unit 52 and the signal input unit of the noise cancellation filter 36 as described above.
[0053]
The noise canceling filter 36 is an adaptive filter as described above, and its output is connected to the subtractor 58. The subtractor 58 includes an adder signal of the adder 18 delayed by the delay processor 57 in accordance with the signal processing time of the noise cancel filter 36, and a noise signal input from the noise cancel filter 36 (a signal obtained by processing the difference signal). ) Is calculated, and a pseudo speech signal which is a voice signal obtained by the calculation is output.
[0054]
The output of the subtractor 58 is connected to the speech recognition navigation device 56 and to the control input of the noise cancellation filter 36. That is, the noise cancellation filter 36 is configured to feed back an audio signal (pseudo utterance signal). Normally, the noise cancellation filter 36 outputs the noise signal with the minimum power of the audio signal so that the filter coefficient is reduced. Is updated.
[0055]
The control input unit of the noise cancellation filter 36 is connected to the utterance presence / absence estimation unit 32 of the audio processing unit 52, and the utterance presence signal T is input as a control signal. The noise canceling filter 36 is configured to stop updating the filter coefficient while the speech presence signal T is being input, and maintain the filter coefficient immediately before the speech presence signal T is input.
[0056]
Further, the speech presence / absence estimation unit 32 of the speech processing unit 52 is connected to the speech recognition navigation device 56, and outputs the speech presence signal T as trigger information. When the speech recognition signal T is input, the voice recognition navigation device 56 receives a voice signal input from the subtractor 58 of the voice output unit 54 as a voice signal of the speaker, and performs a voice recognition process based on the voice signal. And the recognition result is used as a control command. That is, scrolling and zooming of the map screen, various searches, and the like (switching of the hierarchical virtual control panel) are performed based on the voice uttered by the speaker.
[0057]
In the speech recognition navigation system 50 described above, when the speaker's voice S reaches the microphone array 12 from the front and the noise N from other than the front, the voice S and the noise N1 are transmitted to the first microphone 12A. , And the voice S and the noise N2 are input to the second microphone 12B.
[0058]
Then, as in the case of the audio processing device 10, the adding unit 18 outputs the addition signal [S + N '], and the subtracting unit 20 outputs the difference signal [N "]. The noise cancellation filter 36 to which the difference signal [N ″] has been input processes this and outputs a noise signal [N ′ ″]. On the other hand, the delay processing unit 57 to which the addition signal [S + N ′] is input delays the addition signal according to the processing time of the difference signal [N ″] by the noise cancellation filter 36 and outputs the result.
[0059]
Then, in the audio output unit 54, the subtracter 58 subtracts the noise signal [N ′ ″] input from the noise cancellation filter 36 from the addition signal [S + N ′] input from the delay processing unit 57 to reduce the noise component. A pseudo speech signal [S '], which is a removed speech signal, is output. At this time, the noise cancellation filter 36 updates the filter coefficient so as to minimize the pseudo speech signal, in other words, to remove as much as possible a noise component that changes with time.
[0060]
On the other hand, the speech processing unit 52 to which the addition signal and the subtraction signal are input compares the pseudo speech power [P (S ′)] with the noise power [P (N ″)] in the speech presence / absence estimation unit 32. When P (S ′)> 3 × P (N ″) holds, the utterance presence / absence estimating unit 32 estimates that the utterer has made an utterance, and outputs an utterance presence signal T.
[0061]
When the utterance presence signal T is input, the noise cancellation filter 36 of the audio output unit 54 fixes the filter coefficient to the filter coefficient immediately before the input. Thus, after the speech processing unit 52 estimates that the utterance is present, the output from the subtracter 58 maintains the function of removing as much as possible the noise component corresponding to the noise generated immediately before the output of the utterance signal T. On the other hand, the problem that a part of the signal component [S] corresponding to the speaker's voice is removed by the operation of the noise canceling filter 36 to minimize the pseudo utterance signal [S '] is solved.
[0062]
That is, the subtractor 58 outputs a good pseudo-utterance signal [S ′] having a signal component [S] corresponding to the speaker's voice as a main component and having little distortion. The pseudo speech signal is input to the speech recognition navigation device 56. At this time, the speech recognition navigation device 56 receives the pseudo speech signal as the speech signal of the speaker since the speech presence signal T is input. Further, the speech recognition navigation device 56 performs a speech recognition process based on the utterance signal, and uses the recognition result as a control command.
[0063]
As described above, in the voice recognition navigation system 50, the voice processing unit 52 (the utterance presence / absence estimation unit 32) reliably estimates the presence or absence of the utterance by the speaker based on the input information from the microphone array 12. Does not need to operate the utterance switch before speaking. For this reason, when there is an utterance (estimated) by the speaker, control for fixing the filter coefficient of the noise cancel filter 36 can be performed without operating the utterance switch, and the signal corresponding to the utterer's voice can be controlled. Partial removal prevents a large distortion from occurring in the pseudo speech signal. That is, in the voice recognition navigation system 50, a good pseudo utterance signal with little distortion can be input to the voice recognition navigation device 56.
[0064]
Further, since the speech presence signal T is also input as trigger information to the speech recognition navigation device 56, the speech recognition navigation device 56 performs speech recognition based on a pseudo speech signal input when there is no speech by the speaker. And there is no risk of malfunction. This function is realized without operating the speech switch.
[0065]
In the above-described embodiment, only the main components have been described for the sake of simplicity and easy understanding. However, each of the adding units, the adder or the subtracting unit, and the attenuating unit or the Needless to say, it is possible to improve the accuracy of extracting a pseudo speech signal (the accuracy of removing noise and the like) by providing an amplifier. Further, in the voice recognition navigation system 50 mounted on the vehicle, for example, an adaptive filter or the like for removing the voice of the in-vehicle audio device may be added to the above configuration.
[0066]
Further, in the above embodiment, the speech processing device 10 and the speech processing unit 52 are configured to compare the pseudo speech power and the noise power to estimate the presence or absence of speech by the speaker, but the present invention is not limited to this. Instead, for example, a configuration may be adopted in which the presence or absence of the utterance by the speaker is estimated by comparing the voice power and the noise power.
[0067]
Further, in the above-described embodiment, an example in which the voice processing device 10 is applied to the voice recognition navigation system 50 has been described. However, the present invention is not limited to this. The present invention can be applied to various communication devices such as various devices, a hands-free communication system mounted on a vehicle, a transceiver device, a so-called video conference system, and a simple telephone device. Further, the voice processing device 10 may be applied to a navigation device, an audio device, an air conditioner, a hands-free communication system, a voice recognition device for integrated control of in-vehicle devices, and the like mounted on a vehicle or the like.
[0068]
【The invention's effect】
As described above, the speech processing device according to the present invention has an excellent effect that the presence or absence of speech by a speaker can be reliably estimated based on a speech signal input from a microphone.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a schematic configuration of an audio processing device according to an embodiment of the present invention.
FIG. 2 is a block diagram showing an example in which the speech processing device according to the embodiment of the present invention is applied to a speech recognition navigation system.
[Explanation of symbols]
10 Audio processing device
12A first microphone (first microphone)
12B 2nd microphone (2nd microphone)
26 Audio power calculation unit (Audio power calculation means)
28 Noise Power Calculator (Noise Power Calculator)
32 Speech presence / absence estimation unit (speech state estimation means)

Claims

An audio processing device for processing a first audio signal and a second audio signal respectively input from a first microphone and a second microphone arranged substantially symmetrically with respect to a specific speaker,
The power of the added signal obtained by adding the first audio signal and the second audio signal is compared with the power of the difference signal obtained by subtracting the second audio signal from the first audio signal, and the presence or absence of the utterance by the speaker is determined. Provided with an utterance state estimating means for estimating,
An audio processing device characterized by the above.

An audio processing device for processing a first audio signal and a second audio signal respectively input from a first microphone and a second microphone arranged substantially symmetrically with respect to a specific speaker,
Voice power calculation means for calculating the power of the added signal obtained by adding the first voice signal and the second voice signal to obtain voice power;
Noise power calculating means for calculating the power of the difference signal obtained by subtracting the second audio signal from the first audio signal to obtain noise power;
A difference between the voice power and the noise power, and a speech state estimating means for comparing the noise power to estimate the presence or absence of speech;
An audio processing device comprising: