JP2004289762A

JP2004289762A - Method of processing sound signal, and system and program therefor

Info

Publication number: JP2004289762A
Application number: JP2003119116A
Authority: JP
Inventors: Ko Amada; 皇天田; Hiroshi Kanazawa; 博史金澤; Hitoshi Nagata; 仁史永田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-01-29
Filing date: 2003-04-23
Publication date: 2004-10-14
Anticipated expiration: 2023-04-23
Also published as: JP4247037B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound signal processor capable of emphasizing the component of target sound by suppressing noise under an actual noise environment including sudden noise. <P>SOLUTION: A mutual correlative coefficient calculating part 102 calculates a mutual correlative coefficient between input sound signals of a plurality of channels outputted from a plurality of microphones 101-1 to 101-M which are spatially separately arranged. A signal synthesizer 103 synthesizes the input sound signals of the channels into one channel. The largeness of the acquired synthesized sound signal is adjusted by a gain control unit 104 in response to the mutual correlative coefficient. Thus, an output sound signal 105 is generated by the component of the target sound. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のマイクロホンによって得られる入力音声信号を処理する音声信号処理方法と装置及びプログラムに関する。より詳しくは、本発明は例えばハンズフリー通話や音声認識等において用いられる雑音抑圧技術の一つとして入力音声信号から目的とする音声信号を強調して出力する技術に関する。
【０００２】
【従来の技術】
音声信号処理の分野では、音声認識や携帯電話の実用化に伴い、雑音対策が重要な問題となってきている。雑音抑圧技術としては、一つのマイクロホンを使う場合に用いられる、例えば雑音の定常性を仮定したスペクトルサブトラクション処理と、複数のマイクロホンを用いるマイクロホンアレイ処理がある。マイクロホンアレイ処理には、少数のマイクロホンでも高い雑音抑圧能力を発揮する適応型マイクロホンアレイがコスト面から有望である。適応マイクロホンアレイは、雑音方向に受音感度の低い死角を自動的に向けることにより雑音を抑圧するものであり、適応ビームフォーマ（適応ＢＦ）と呼ばれることもある。
【０００３】
適応ビームフォーマは、方向性の強い雑音に対しては効果的であるが、その他の雑音、例えば（１）車で走行中に発生する雑音のような高レベルの拡散性雑音、（２）高速で移動する車からの放射音のように音響伝達系の変化が速い雑音、あるいは（３）突発雑音のような継続時間が非常に短い雑音、等に関しては抑圧性能が十分ではない。これらのような雑音は実環境ではごく普通に存在するため、対処が必要である。
【０００４】
非特許文献１には、複数のマイクロホンからの入力音声信号の２チャネル間のコヒーレンス関数に基づいたフィルタリングを行って雑音を抑圧する技術が示されている。
一方、非特許文献２では、相関の大きい雑音に対処するため、目的音のない区間でチャネル間の雑音のクロススペクトルを推定しておき、目的音のある区間で雑音の重畳した目的音のクロススペクトルから雑音のクロススペクトルを引き去る技術が開示されている。
【０００５】
非特許文献３には、例えば複数チャネルの信号間の相互相関を用いて信号検出処理を行うために、コヒーレンス関数を閾値処理することによって目的信号の存在を判別する方法が示されている。
非特許文献４には、複数のマイクロホンから出力される複数チャネルの音声信号間の相互相関係数を閾値処理することにより、目的音を検出する方法が開示されている。
非特許文献５には、適応ビームフォーマを用いて２以上のチャネルの音声信号を１チャネルに統合する方法が記載されている。
非特許文献６には、重み関数を用いて複数チャネルの音声信号のチャネル間の一般化相互相関関数（ｇｅｎｅｒａｌｉｚｅｄｃｒｏｓｓｃｏｒｒｅｌａｔｉｏｎｆｕｎｃｔｉｏｎ）を最尤推定する方法が開示されている。
【０００６】
【非特許文献１】
“Ｕｓｉｎｇｔｈｅｃｏｈｅｒｅｎｃｅｆｕｎｃｔｉｏｎｆｏｒｎｏｉｓｅｒｅｄｕｃｔｉｏｎ”，ＩＥＥＰｒｏｃｅｅｄｉｎｇｓ−ＩＶｏｌ．１３９，Ｎｏ．３，１９９２
【０００７】
【非特許文献２】
“Ｅｎｈａｎｃｅｍｅｎｔｏｆｓｐｅｅｃｈｄｅｇｒａｄｅｄｂｙｃｏｈｅｒｅｎｔａｎｄｉｎｃｏｈｅｒｅｎｔｎｏｉｓｅｕｓｉｎｇａｃｒｏｓｓ−ｓｐｅｃｔｒａｌｅｓｔｉｍａｔｏｒ”，ＩＥＥＥＴｒａｎｓ．ｏｎＳｅｅｃｈａｎｄＡｕｄｉｏｐｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．５，Ｎｏ．５，１９９７
【０００８】
【非特許文献３】
“ＫｎｏｗｉｎｇｔｈｅＷｈｅａｔｆｒｏｍｔｈｅＷｅｅｄｓｉｎＮｏｉｓｙＳｐｅｅｃｈ”，Ｈ．ＡｇａｉｂｙａｎｄＴ．Ｊ．Ｍｏｉｒ著，Ｐｒｏｃ．ｏｆＥＵＲＯＳＰＥＥＣＨ’９７，ｖｏｌ．３，ｐｐ．１１１−１１２，１９９７
【０００９】
【非特許文献４】「２つの指向性マイクロホンを用いた目的音検出に関する検討」、永田他、電子情報通信学会誌Ｖｏｌ．Ｊ８３−ＡＮｏ．２（２０００））
【００１０】
【非特許文献５】
“Ｔｈｅａｄａｐｔｉｖｅｆｉｌｔｅｒｔｈｅｏｒｙ”，Ｈｙａｋｉｎ著，ＰＲＥＮＴＩＣＥＨＡＬＬ出版
【００１１】
【非特許文献６】
“ＴｈｅＧｅｎｅｒａｌｉｚｅｄＣｏｒｒｅｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＥｓｔｉｍａｔｉｏｎｏｆＴｉｍｅＤｅｌａｙ”，Ｃ．Ｈ．ＫｎａｐｐａｎｄＧ．Ｃ．Ｃａｒｔｅｒ著，ＩＥＥＥＴｒａｎｓ，Ａｃｏｕｓｔ．，Ｓｐｅｅｃｈ，ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．ＡＳＳＰ−２４，Ｎｏ．４，ｐｐ．３２０−３２７，１９７６
【００１２】
【発明が解決しようとする課題】
非特許文献１に記載された技術は、（１）の拡散性雑音のようにチャネル間で無相関であると仮定できる雑音に対しては有効である。しかし、（３）の突発性雑音や、ビームフォーマによって抑圧できた方向性のある雑音は、チャネル間の相関が大きくなるため、抑圧することができない。非特許文献２に記載の技術によると、このようなチャネル間の相関が大きい雑音を抑圧できる。しかし、この方法が有効なのは、雑音に方向性があり、かつ、雑音の定常性が仮定できる場合に限られる。このような雑音環境では、むしろビームフォーマのように指向性の死角を雑音源に向ける手法の方がよりよく対処できる。
【００１３】
本発明は、突発雑音を含む実環境雑音下で雑音を抑圧して目的音の成分を強調することができる音声信号処理方法と装置及びプログラムを提供することを目的とする。
【００１４】
本発明の他の目的は、目的音が到来しているか否かの検出を高精度で行うことを目的とする。
【００１５】
【課題を解決するための手段】
上記の課題を解決するため、本発明の第１の観点によると、空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号間の相互相関係数を求める。入力音声信号を１チャネルに統合して得られる統合音声信号の大きさを相互相関係数に従って調整することにより、目的音の成分が強調された出力音声信号を生成する。
【００１６】
本発明の第２の観点では、各マイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成し、複数チャネルのスペクトル情報間の相互相関係数を求める。スペクトル情報を１チャネルに統合して得られる統合スペクトル信号の大きさを相互相関係数に従って調整することにより、目的音の成分が強調されたスペクトル信号を得る。
【００１７】
本発明の第３の観点では、各マイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成し、これらのスペクトル情報から入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを求める。さらに、パワースペクトル及びクロススペクトルから各チャネルのスペクトル情報間のコヒーレンス関数を求める。次に、コヒーレンス関数を用いてパワースペクトル及びクロススペクトルを修正し、修正後のパワースペクトル及びクロススペクトルに基づいて重み付けられた、入力音声信号のチャネル間の相互相関係数を求める。
【００１８】
本発明の第４の観点では、各マイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成し、これらのスペクトル情報から入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを求める。さらに、パワースペクトル及びクロススペクトルから各チャネルのスペクトル情報間のコヒーレンス関数を求め、またスペクトル情報から入力音声信号のチャネル間の信号パワーに関するパワー情報を求める。次に、コヒーレンス関数及びパワー情報を用いてパワースペクトル及びクロススペクトルを修正し、修正後のパワースペクトル及びクロススペクトルに基づいて重み付けられた、入力音声信号のチャネル間の相互相関係数を求める。
【００１９】
第３または第４の観点において、相互相関係数に対して予め定めた閾値を用いて閾値処理を行うことによりマイクロホンに目的音が到来しているか否かを判定してもよい。スペクトル情報を１チャネルに統合して統合スペクトル信号を求め、この統合スペクトル信号の大きさを相互相関係数に従って調整してもよい。コヒーレンス関数に従って、統合スペクトル信号の各周波数成分に対して重み付けを行ってもよい。相互相関係数に従って、複数チャネルのスペクトル情報の位相及び振幅の少なくとも一方をチャネル間で一致するように補正してもよい。
【００２０】
第３及び第４の観点において、複数のマイクロホンは、少なくとも一つの無指向性マイクロホンと少なくとも一つの指向性マイクロホンを含んでもよいし、指向性の軸の向きを異ならせた少なくとも二つの指向性マイクロホンを含んでもよい。後者の場合、少なくとも二つの指向性マイクロホンは、指向性の軸が同一平面内に存在せず、且つ指向性の軸と目的音の到来方向とのなす角が一致するように配置されることが好ましい。
【００２１】
さらに本発明の別の観点によると、上述した音声信号処理をコンピュータで実行するための以下のようなプログラムあるいは該プログラムを記憶した記憶媒体を提供する。
【００２２】
（１）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号間の相互相関係数を求める処理と、入力音声信号を１チャネルに統合して統合音声信号を出力する処理と、統合音声信号の大きさを相互相関係数に従って調整することにより出力音声信号を生成する処理とをコンピュータに行わせるプログラム。
【００２３】
（２）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、複数チャネルのスペクトル情報間の相互相関係数を求める処理と、スペクトル情報を１チャネルに統合して統合スペクトル信号を生成する統合処理と、統合スペクトル信号の大きさを相互相関係数に従って調整する処理とをコンピュータに行わせるためのプログラム。
【００２４】
（３）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、スペクトル情報から入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを求める処理と、パワースペクトル及びクロススペクトルから複数チャネルのスペクトル情報間のコヒーレンス関数を求める処理と、コヒーレンス関数を用いてパワースペクトル及びクロススペクトルを修正する処理と、修正されたパワースペクトル及びクロススペクトルに基づいて重み付けられた、入力音声信号のチャネル間の相互相関係数を求める処理とをコンピュータに行わせるためのプログラム。
【００２５】
（４）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、スペクトル情報から入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを求める処理と、パワースペクトル及びクロススペクトルから複数チャネルのスペクトル情報間のコヒーレンス関数を求める処理と、スペクトル情報に基づいて入力音声信号のチャネル間の信号パワーに関するパワー情報を求める処理と、コヒーレンス関数及びパワー情報を用いてパワースペクトル及びクロススペクトルを修正する処理と、修正されたパワースペクトル及びクロススペクトルに基づいて重み付けられた、入力音声信号のチャネル間の相互相関係数を求める処理とをコンピュータに行わせるためのプログラム。
【００２６】
（５）空間的に離れて配置された複数のマイクロホンに入力される音声に応答して該マイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、前記スペクトル情報から前記入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する処理と、前記パワースペクトル及びクロススペクトルから前記複数チャネルのスペクトル情報のチャネル間のコヒーレンス関数を算出する処理と、音声の複数の仮想到来方向からなる仮想到来方向群に対応して、該仮想到来方向から到来する音声が複数のチャネル間で一致するように補正するための補正係数を発生する処理と、前記補正係数に基づいて前記パワースペクトル及びクロススペクトルを補正し、補正パワースペクトル及び補正クロススペクトルを生成する処理と、前記補正パワースペクトル及び補正クロススペクトルに基づいて前記入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する処理と、前記補正パワースペクトル及び補正クロススペクトルを前記コヒーレンス関数及びパワー情報に基づいて重み付けし、前記仮想到来方向群に対応した前記入力音声信号のチャネル間の相互相関係数を前記仮想到来方向毎に算出する処理と、前記相互相関係数に基づいて前記マイクロホンに入力される音声の音源方向を検出すると共に、検出した該音源方向における前記相互相関係数の値を音源相関係数として出力する処理とをコンピュータに行わせるためのプログラム。
【００２７】
（６）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、前記スペクトル情報から前記入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する処理と、前記パワースペクトル及びクロススペクトルから前記複数チャネルのスペクトル情報のチャネル間のコヒーレンス関数を算出する処理と、前記複数のスペクトル情報を１チャネルに統合して統合スペクトル信号を生成する処理と、前記統合スペクトル信号のパワースペクトルを計算する処理と、前記クロススペクトルを前記コヒーレンス関数に基づいて重み付けし、重み付けたクロススペクトルをさらに前記統合信号パワースペクトルに基づいて正規化して利得係数を計算する処理とをコンピュータに行わせるためのプログラム。
【００２８】
（７）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、前記スペクトル情報から前記入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する処理と、前記複数チャネル間のクロススペクトルと各チャネルのパワースペクトルから前記複数チャネル間のコヒーレンス関数を算出する処理と、音声の複数の仮想到来方向からなる仮想到来方向群に対応して、該仮想到来方向から到来する音声が複数のチャネル間で一致するように補正するための補正係数を発生する処理と、前記補正係数に基づいて前記パワースペクトル及びクロススペクトルを補正し、補正パワースペクトル及び補正クロススペクトルを生成する処理と、前記補正パワースペクトル及び補正クロススペクトルに基づいて前記入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する処理と、前記複数チャネルのスペクトル情報を前記補正係数により補正してから統合して得られる統合スペクトル情報に対するパワースペクトルを前記補正パワースペクトル及び補正クロススペクトルに基づいて計算する処理と、前記補正クロススペクトルを前記コヒーレンス関数及びパワー情報に基づいて重み付けし、さらに仮想統合パワースペクトルに基づいて正規化することにより、前記仮想到来方向に対応した利得係数を求める処理と、前記利得係数に基づいて前記マイクロホンに入力される音声の音源方向を検出すると共に、検出した該音源方向に対応する利得係数の値を音源利得係数として出力する処理とをコンピュータに行わせるためのプログラム。
【００２９】
（８）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する処理と、前記複数チャネルのスペクトル情報を入力として、前記複数チャネルの入力音声信号のチャネル間の第１の修正相互相関係数を計算する処理と、前記第１の修正相互相関係数に基づいて前記複数チャネルのスペクトル情報のチャネル間の差を適応的に補正して補正スペクトル情報を生成する処理と、前記補正スペクトル情報から第２の修正相互相関係数を計算する処理とをコンピュータに実行させるためのプログラムであって、前記第１及び第２の修正相互相関係数の計算処理は、（ａ）前記スペクトル情報から前記入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する処理と、（ｂ）前記パワースペクトル及びクロススペクトルから前記複数チャネルのスペクトル情報のチャネル間のコヒーレンス関数を算出する処理と、（ｃ）前記パワースペクトルから前記入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する処理と、（ｄ）前記パワースペクトル及びクロススペクトルを前記コヒーレンス関数及びパワー情報に基づいて重み付けして前記入力音声信号のチャネル間の相互相関係数を算出し、前記第１または第２の修正相互相関関数を出力する処理とを含む。
【００３０】
（９）空間的に離れて配置された複数のマイクロホンから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルの第１スペクトル情報を生成する処理と、前記第１スペクトル情報から第１の修正利得を計算する処理と、前記第１の利得係数に基づいて前記第１スペクトル情報のチャネル間の差を適応的に補正して第２スペクトル情報を生成する処理と、前記第２スペクトル情報から第２の修正利得を計算する処理とをコンピュータに行わせるためのプログラムであって、前記第１及び第２の修正利得係数の計算処理は、（ａ）前記第１または第２スペクトル情報から前記入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する処理と、（ｂ）前記パワースペクトル及びクロススペクトルから前記複数チャネルのスペクトル情報のチャネル間のコヒーレンス関数を算出する処理と、（ｃ）前記パワースペクトルから前記入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する処理と、（ｄ）前記複数のスペクトル情報を１チャネルに統合して統合スペクトル信号を生成する処理と、（ｅ）前記統合スペクトル信号のパワースペクトルを計算する処理と、（ｆ）前記クロススペクトルを前記コヒーレンス関数及びパワー情報に基づいて重み付けし、重み付けたクロススペクトルをさらに前記統合スペクトル信号のパワースペクトルに基づいて正規化して前記第１または第２の利得係数を計算する処理とを含む。
【００３１】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態を説明する。以下に説明する各実施形態における音声信号処理は、コンピュータ上で実行されるソフトウェア（ファームウェアを含む）として実現することが可能であり、またハードウェアによって実現することも可能である。
【００３２】
（第１の実施形態）
図１に、本発明の第１の実施形態に係る信号処理装置の構成を示す。複数のマイクロホン１０１−１〜１０１−Ｍによって、例えば話者の入力音声のような目的音を含む音響信号が検出され、複数（Ｍ）チャネルの入力音声信号が出力される。ここで、雑音を抑圧して入力音声のうち最終的に出力音声として取り出したい成分を目的音という。マイクロホン１０１−１〜１０１−Ｍからの入力音声信号は、図示しないＡ／Ｄ変換器によりディジタル信号に変換された後、相互相関計算部１０２と信号統合部１０３に入力される。
【００３３】
相互相関計算部１０２では、Ｍチャネルの入力音声信号間の相互相関係数が計算される。信号統合部１０３では、Ｍチャネルの入力音声信号が１チャネルに統合される。信号統合部１０３から出力される信号を統合音声信号という。統合音声信号は、相互相関係数に従って利得が制御される利得制御部１０４に入力され、その大きさが調整される。これにより、利得制御部１０４から目的音の成分が強調された出力音声信号１０５が出力される。
【００３４】
一般に、複数チャネルの観測信号に対して計算される相互相関係数は、雑音下の目的信号の検出尺度としてソナーやレーダの処理において古くから使われている。本実施形態では、音声信号処理において目的音の検出だけでなく、目的音の強調に用いる方法を提案する。この方法により、チャネル間で無相関な雑音のある環境下でも、雑音を効果的に抑圧できる。
【００３５】
本実施形態でいう相互相関係数とは、入力音声信号がｘ（ｎ），ｙ（ｎ）の２チャネルの場合、次式で計算される値ρである。
【００３６】
【数１】

【００３７】
ここで、上線が付された値は、期待値または時間平均値を表す（以後、同様とする）。
【００３８】
入力音声信号がＭチャネルの場合（２チャネルに限定されない場合）には、相互相関係数ρは例えば次式で計算される。
【００３９】
【数２】

【００４０】
ここで、ｘｐ（ｎ），ｘｑ（ｎ）はそれぞれ第ｐチャネル，第ｑチャネルの入力音声信号であり、またＫ＝Ｍ（Ｍ−１）／２である。
【００４１】
従来、複数チャネルの信号のチャネル間相互相関は信号検出処理に用いられ、例えば、コヒーレンス関数を閾値処理することによって目的信号の存在を判別する方法が例えば、非特許文献３：“ＫｎｏｗｉｎｇｔｈｅＷｈｅａｔｆｒｏｍｔｈｅＷｅｅｄｓｉｎＮｏｉｓｙＳｐｅｅｃｈ”，Ｈ．ＡｇａｉｂｙａｎｄＴ．Ｊ．Ｍｏｉｒ著，Ｐｒｏｃ．ｏｆＥＵＲＯＳＰＥＥＣＨ’９７，ｖｏｌ．３，ｐｐ．１１１−１１２，１９９７に開示されている。
【００４２】
相互相関係数は音声検出にも使われており、この値を閾値処理して目的音を検出する方法は、例えば非特許文献４：「２つの指向性マイクロホンを用いた目的音検出に関する検討」、永田他、電子情報通信学会誌Ｖｏｌ．Ｊ８３−ＡＮｏ．２（２０００））に開示されている。本実施形態は、閾値処理による目的音の検出ではなく、相互相関を目的音の強調に用いる点が特徴である。
【００４３】
相互相関係数ρは、入力音声に目的音が存在する場合は１に近い値をとり、雑音だけなら０に近い値となるので、音声強調に用いるには相互相関係数の大きさに従って統合音声信号に与える利得を制御すればよい。すなわち、マイクロホン１０１−１〜１０１−Ｍから得られる複数チャネルの入力音声信号について、相互相関係数計算部１０２により式（１−１）または（１−２）に従ってチャネル間の相互相関係数を計算する。この相互相関係数に基づき利得制御部１０４の利得を制御し、信号統合部１０３からの統合音声信号の振幅を利得制御部１０４で調整することによって出力音声信号１０５を生成する。
【００４４】
相互相関係数ρは、−１から＋１の範囲をとる。従って、利得制御部１０４では相互相関係数の絶対値をとってから用いるか、あるいは相互相関係数が負の場合は０と置くようにする。利得制御部１０４での利得制御は、こうして計算された相互相関係数を例えば統合音声信号の振幅に乗算することにより行われる。この場合、相互相関係数と利得の関係を図２に示す直線（Ａ）のような比例関係に設定してもよいし、例えば図２の折れ線（Ｂ）や曲線（Ｃ）のような関係にしてもよい。
【００４５】
次に、図３を用いて本実施形態における処理の流れを説明する。
まず、マイクロホン１０１−１〜１０１−Ｍから音声信号を入力する（ステップＳ１１）。マイクロホンが二つの場合を例にとると、例えば図４に示すように二つのマイクロホン１０１−１〜１０１−２を１０ｃｍ程度の距離を置いて、目的音源が各マイクロホン１０１−１〜１０１−２から等距離となるように設置する。マイクロホン１０１−１〜１０１−２の各々は指向性があってもよいし、無指向性でもよい。入力音声信号をディジタル化するＡ／Ｄ変換器のサンプリング周波数は例えば１１ｋＨｚとするが、他の周波数でもかまわない。
【００４６】
次に、相互相関係数ρを式（１−１）または式（１−２）によって計算する。このとき、相互相関係数ρの時間変化を考慮して、適当な時間間隔、例えば、Ｎ＝１２８点おきに相互相関係数ρを求めることとし、時間平均を例えば対象時点の前後Ｌ点、計２Ｌ点の波形に対して式（１−１）を適用すると、相互相関係数ρを求める数式は以下となる。
【００４７】
【数３】

【００４８】
ここで、ｋは相互相関係数の番号であり、ρの値は入力音声信号波形のＮサンプル毎に１個求まる。
【００４９】
式（１−２）を用いた場合も同様に、次式によって相関係数ρが求まる。
【００５０】
【数４】

【００５１】
ここで、Ｋ＝Ｍ（Ｍ−１）／２である。
【００５２】
次に、信号統合部１０３によって複数チャネルの入力音声信号を１チャネルに統合する。信号統合部１０３の処理は、例えば単純な加算であってもよいし、図５に示すように雑音抑圧の機能を持つ、時間領域で動作する適応ビームフォーマ１０６による処理であってもよい。信号統合部１０３が単純な加算を行うとすると、統合音声信号ｚ（ｎ）は、次式のように求まる。
【００５３】
【数５】

【００５４】
信号統合部１０３に図５のように適応ビームフォーマ１０６、例えば、よく知られているＬＭＳ適応フィルタによる２チャネルのＪｉｍ−Ｇｒｉｆｆｉｔｈビームフォーマを用いた場合、次式のように統合音声信号ｚ（ｎ）が求まる。
【００５５】
【数６】

【００５６】
ここで、Ｕ（ｎ）は入力音声信号ｘ，ｙの差の値をＴ個並べたベクトル、Ｗ（ｎ）＝［ｗ１（ｎ），ｗ２（ｎ），．．．，ｗＴ（ｎ）］はｎ回更新を行った後のＬＭＳ適応フィルタの係数、ｄ（ｎ）は入力音声信号ｘ，ｙの和信号、（・）は内積である。Ｄは遅延量であり、例えばＴ／２を用いる。μはステップサイズであり、例えば０．１を用いればよい。Ｍチャネルの場合への拡張も容易であり、Ｍ−１個の適応ビームフォーマを用いて１チャネルに統合された音声信号を得る方法が、例えば非特許文献５：“Ｔｈｅａｄａｐｔｉｖｅｆｉｌｔｅｒｔｈｅｏｒｙ”，Ｈｙａｋｉｎ著，ＰＲＥＮＴＩＣＥＨＡＬＬ出版に詳述されているが、ここでは詳しい説明を省略する。
【００５７】
最後に、統合音声信号ｚ（ｎ）に相互相関係数ρに基づく利得を乗じて統合音声信号ｚ（ｎ）の大きさを調整することにより、出力音声信号１０５を出力する。ステップＳ１１〜Ｓ１４の処理は、ディジタル化された音声信号がステップＳ１１においてフレーム単位で入力される毎に繰り返し行われる。
【００５８】
このように本実施形態によれば、複数チャネルの入力音声信号が１チャネルに統合された統合音声信号の大きさを各チャネルの入力音声信号間の相互相関関数に従って調整することにより、相関の少ない雑音が抑圧され、相関の大きい目的音の成分が強調された出力音声信号を得ることが可能となる。
【００５９】
（第２の実施形態）
図６に、本発明の第２の実施形態に係る音声信号処理装置の構成を示す。本実施形態では、第１の実施形態で説明した時間領域での音声信号処理と等価な音声信号処理を周波数領域で実現する。図６において、複数のマイクロホン１０１−１〜１０１−Ｍからの入力音声信号は図示しないＡ／Ｄ変換器によりディジタル信号に変換された後、周波数分析部２０１により周波数成分が分析され、周波数スペクトルを表すスペクトル情報が生成される。周波数分析部２０１は、例えば公知のＦＦＴ（高速フーリエ変換）、ＤＦＴ（離散フーリエ変換）、あるいは通過帯域の異なる複数の帯域フィルタを並列に配置した帯域フィルタバンクによって実現される。周波数分析部２０１から出力されるスペクトル情報は、相関係数計算部２０２と信号統合部２０３に入力される。
【００６０】
相互相関計算部２０２では、Ｍチャネルのスペクトル情報間の相互相関係数、すなわち周波数領域の相互相関係数が計算される。言い換えれば、本実施形態ではスペクトル情報を用いてＭチャネルの入力音声信号のチャネル間の相互相関係数が求められる。信号統合部２０３では、Ｍチャネルのスペクトル情報が１チャネルに統合される。信号統合部２０３の処理は、第１の実施形態で説明したと同様、例えば単純な加算であってもよいし、周波数領域で動作する適応フィルタを用いたＪｉｍ−Ｇｒｉｆｆｔｈの適応ビームフォーマによる処理であってもよい。信号統合部２０３から出力される信号を統合スペクトル信号という。
【００６１】
信号統合部２０３から出力される統合スペクトル信号は、相互相関係数に従って利得が制御される利得制御部２０４に入力され、その大きさが調整される。これにより、利得制御部２０４から目的音の成分が強調されたスペクトル信号２０５が出力される。第１の実施形態と同様に、相互相関係数計算部２０２により得られる周波数領域の相互相関係数も、目的音が存在する場合は１に近い値をとり、雑音だけなら０に近い値となるので、目的音の強調に用いるには相互相関係数の大きさに従って統合スペクトル信号に与える利得を制御すればよい。
【００６２】
目的音の成分が強調されたスペクトル信号２０５は、必要に応じて逆変換部２０６によって周波数分析部２０１とは逆の変換、すなわち周波数領域から時間領域への変換が施されることにより、目的音の成分が強調された出力音声信号２０７が生成される。逆変換部２０６は、周波数分析部２０１が例えばＦＦＴの場合、その逆変換である逆ＦＦＴによって実現される。
【００６３】
相互相関係数計算部２０２では、入力音声信号がｘ（ｎ），ｙ（ｎ）の２チャネルの場合、周波数領域での相互相関係数として、次式で表されるρが計算される。
【００６４】
【数７】

【００６５】
ここで、Ｗｘｙ（ｆ）は入力音声信号ｘ（ｎ），ｙ（ｎ）間のクロススペクトルであり、Ｗｘｘ（ｆ），Ｗｙｙ（ｆ）は入力音声信号ｘ（ｎ），ｙ（ｎ）のパワースペクトル、Ｌは離散フーリエ変換（ＤＦＴ）における周波数成分の数である。
【００６６】
クロススペクトルとパワースペクトルは、よく知られているように、ｘ（ｎ）の離散フーリエ変換をＸ（ｆ）とし、ｙ（ｎ）の離散フーリエ変換をＹ（ｆ）とすると、
【数８】

のように計算できる。ここで、上線を付した値は時間平均値、＊は複素共役である。ＤＦＴの長さは例えば２５６点を使うことができ、この場合Ｌ＝２５６である。Ｌ＝１２８として、得られた複素数の相互相関係数の実部をとっても等価な結果が得られる。
【００６７】
入力音声信号がＭチャネルの場合（２チャネルに限定されない場合）にも、同様に相互相関係数ρは例えば次式で計算される。
【数９】

【００６８】
ここで、Ｗｉｊ（ｆ）は入力音声信号ｘｉ（ｎ），ｘｊ（ｎ）間のクロススペクトル、Ｗｉｉ（ｆ）、Ｗｊｊ（ｆ）は入力音声信号ｘｉ（ｎ），ｘｊ（ｎ）のパワースペクトルである。
【００６９】
このようにマイクロホン１０１−１〜１０１−Ｍから得られる複数チャネルの入力音声信号を周波数分析部２０１でスペクトル情報に変換した後、相互相関係数計算部２０２により式（２−１）または（２−２）に従ってチャネル間の相互相関係数ρを計算する。
【００７０】
一方、周波数分析部２０１で得られる複数チャネルのスペクトル情報を信号統合部２０３で１チャネルに統合して統合スペクトル信号Ｚ（ｆ）を求める。信号統合部２０３で単純な加算を用いる場合は、
【数１０】

として、統合スペクトル信号Ｚ（ｆ）を得ることができる。
【００７１】
適応ビームフォーマを用いる場合は、例えば、よく知られている２チャネルのＪｉｍ−Ｇｒｉｆｆｉｔｈビームフォーマを使った場合、次式のように統合スペクトル信号Ｚ（ｆ）が求まる。
【数１１】

【００７２】
ここで、ｋはフレーム番号、Ｕはチャネル間の差分スペクトル、Ｄは加算スペクトル、Ｚは出力スペクトル、Ｗは複素数のフィルタ係数、μはステップサイズ、（＊）は複素共役である。
【００７３】
次に、相互相関係数ρに基づき利得制御部２０４の利得を制御し、信号統合部２０３からの統合スペクトル信号の大きさ（振幅）を利得制御部２０４で調整することによって、目的音の成分が強調されたスペクトル信号２０５を生成する。利得制御部２０４での利得制御に関しては、例えば相互相関係数ρを統合スペクトル信号の振幅に乗算することにより行うことができるが、第１の実施形態と同様に例えば図２（Ａ）（Ｂ）（Ｃ）に示すような関数を用いて行うことも可能である。相互相関係数ρは負になる場合もあるが、その場合は、絶対値かまたは０と置いて利得制御に用いることも可能である。
【００７４】
図７に、本実施形態における処理の流れを示す。音声信号入力ステップＳ２１の後に周波数分析ステップＳ２２が加わったこと以外、処理の流れは第１の実施形態と基本的に同様である。すなわち、ステップＳ２２で周波数分析（例えば、ＦＦＴ）を行った後に、相互相関係数の計算（ステップＳ２３）、スペクトル情報の統合（ステップＳ２４）及び相関係数による統合スペクトル信号に対する利得制御（ステップＳ２５）を順次行って、目的音の成分が強調されたスペクトル信号を生成し、最後に必要に応じてステップＳ２６で逆変換（例えば、逆ＦＦＴ）を行って目的音の成分が強調された出力音声信号を得る。ステップＳ２１〜Ｓ２６の処理は、ディジタル化された音声信号がステップＳ２１においてフレーム単位で入力される毎に繰り返し行われる。
【００７５】
このように本実施形態によれば、相関の少ない雑音が抑圧され、相関の大きい目的音の音声が強調されたスペクトル信号あるいは出力音声信号を得ることが可能となる他、相関係数の計算と信号統合の処理を周波数領域で行うことにより、相関係数の計算と信号統合の処理を時間領域で行う第１の実施形態に比較して、演算量を少なくできるという利点がある。
【００７６】
（第３の実施形態）
図８に、本発明の第３の実施形態に係る音声信号処理装置の構成を示す。本実施形態は、重み付き相互相関係数を用いて目的信号（目的音の信号）の活性度を算出する手法を提供する。こうして算出される目的信号活性度は、例えば目的音の検出や目的音の強調に有効に用いられる。
【００７７】
本実施形態では、第１の実施形態と同様に、まず複数のマイクロホン１０１−１〜１０１−Ｍからの複数チャネルの入力音声信号が周波数分析部２０１により周波数領域の信号、すなわち複数の周波数成分を含むスペクトル情報に変換された後、目的信号活性度計算部３００に入力される。目的信号活性度計算部３００は、クロス・パワースペクトル計算部３０１、コヒーレンス関数計算部３０２、パワー情報計算部３０３、修正スペクトル計算部３０４及び重み付き相互相関関数計算部３０５を有する。
【００７８】
クロス・パワースペクトル計算部３０１では、複数チャネルの周波数成分から各チャネルのパワースペクトルとチャネル間のクロススペクトルが算出される。コヒーレンス関数計算部３０２では、パワースペクトルとクロススペクトルからコヒーレンス関数が算出される。パワー計算部３０３では、パワースペクトルから入力音声信号のチャネル間の信号パワーに関するパワー情報が算出される。修正スペクトル計算部３０４では、パワースペクトルとクロススペクトルに対してコヒーレンス関数とパワー情報を用いて修正が加えられる。重み付き相互相関関数計算部３０４では、修正スペクトル計算部３０４で修正されたスペクトルに従って重み付けられた相互相関係数が目的信号活性度として計算される。
【００７９】
次に、図９を用いて本実施形態における処理の流れを説明する。音声信号入力ステップＳ３１から周波数分析ステップＳ３２までは第２の実施形態と同様であり、複数チャネルの入力音声信号をフレーム単位で周波数領域の信号（スペクトル情報）に変換する。
【００８０】
次に、周波数分析で得られたスペクトル情報から各チャネルのパワースペクトルとチャネル間のクロススペクトルを計算する（ステップＳ３３）。次に、パワースペクトルとチャネル間のクロススペクトルを用いてコヒーレンス関数とパワー情報を計算する（ステップＳ３４〜Ｓ３５）次に、コヒーレンス関数とパワー情報に基づいて修正されたスペクトルを計算する（ステップＳ３６）。この修正後のスペクトルに基づいて重み付き相互相関係数を計算し、これを目的信号活性度として出力する（ステップＳ３７）。ステップＳ３１〜Ｓ３７の処理は、ディジタル化された音声信号がステップＳ３１においてフレーム単位で入力される毎に繰り返し行われる。
【００８１】
本実施形態は、耐雑音性を高めるように相互相関係数に修正を加える点が特徴である。一般的な相互相関係数は、雑音がチャネル間で無相関な場合の目的音検出には高い性能を示すものの、チャネル間で相関のある雑音が到来している場合と目的音が到来している場合を区別する性能は低い。本実施形態によると、相関のある雑音が到来する場合においても、目的音と雑音とを区別する性能を大幅に高めることができる。
【００８２】
通常、耳障りな大振幅の雑音はチャネル間で高い相関があるので、本実施形態で示す方法は、これを抑圧するのに好適である。出力である目的信号活性度は、入力音声に目的音が存在するか否かの尺度を示すものであり、これは以降の実施形態の音声検出や音声強調などで必要となる必須の要素である。
【００８３】
次に、クロス・パワースペクトル計算部３０１、コヒーレンス関数計算部３０２、パワー情報計算部３０３、修正スペクトル計算部３０４及び重み付き相互相関係数計算部３０４での具体的な計算方法について述べる。まず、クロス・パワースペクトル計算部３０１では、式（２−２）に従ってチャネル間のクロススペクトルとチャネル毎のパワースペクトルが計算される。次に、コヒーレンス関数計算部３０２では、入力音声信号がｘ，ｙの２チャネルの場合、次式に従ってコヒーレンス関数γ（ｆ）が計算される。
【数１２】

【００８４】
ここで、Ｗｘｙ（ｆ）は二つのチャネル間のクロススペクトル、Ｗｘｘ（ｆ）及びＷｙｙ（ｆ）は各チャネルのパワースペクトルである。
【００８５】
入力音声信号がＭチャネルの場合（２チャネルに限定されない場合）、第ｉチャネルと第ｊチャネル間のコヒーレンス関数γｉｊ（ｆ）は、同様に次式に従って計算される。
【数１３】

【００８６】
ここで、Ｗｉｊ（ｆ）は第ｉチャネルと第ｊチャネル間のクロススペクトル、Ｗｉｉ（ｆ）及びＷｊｊ（ｆ）は、第ｉチャネル及び第ｊチャネルのパワースペクトルである。
【００８７】
Ｍチャネルの場合のトータルのコヒーレンス関数γｍ（ｆ）は、例えば次式のように計算される。
【数１４】

【００８８】
パワー情報計算部３０３では、入力音声信号がｘ，ｙの２チャネルの場合、次式に従ってパワー情報ｐ（ｆ）が計算される。
【数１５】

【００８９】
ここで、ｍｉｎ［ａ，ｂ］はａ，ｂのうちで小さい方を選択することを意味し、ｍａｘ［ａ，ｂ］は、ａ，ｂのうちで大きい方を選択することを意味する。
【００９０】
一方、入力音声信号がＭチャネルの場合（２チャネルに限定されない場合）、第ｉチャネルと第ｊチャネル間のパワー情報はｐｉｊ（ｆ）は、次式に従って計算される。
【数１６】

【００９１】
このようにして計算されるパワー情報ｐ（ｆ），ｐｉｊ（ｆ）に対して、次式のように適当な関数を用いて実際のチャネル間のパワー比に対する鋭敏性を調整することも可能である。
【数１７】

【００９２】
ここで、ｐｏｗ｛ａ，ｂ｝はａのｂ乗を表す指数関数である。β＝１のとき、式（３−６），（３−７）はそれぞれ式（３−４），（３−５）と同じであり、βを１より大きい値とすることで、パワー比に対する鋭敏性を増すことが可能である。
【００９３】
修正スペクトル計算部３０４では、入力音声信号が２チャネルの場合、各チャネルのパワースペクトルとチャネル間のクロススペクトルに対して、先に算出されたコヒーレンス関数γ（ｆ）を２乗した値である２乗コヒーレンス関数γ^２（ｆ）とパワー情報ｐ（ｆ）を用いて修正が加えられたクロススペクトル及びパワースペクトルが計算される。さらに、重み付き相互相関係数計算部３０５では、修正後のクロススペクトル及びパワースペクトルに従って重み付けられた重み付き相互相関係数ρ（目的信号活性度）が計算される。
【００９４】
これら修正スペクトル計算部３０４及び重み付き相互相関係数計算部３０５での計算は、次式で示される。
【数１８】

【００９５】
ここで、Ψａ（ｆ），Ψｂ（ｆ）は、各々相互相関係数の計算式（３−１０）の分母、分子に用いる重み関数であり、Ｗｘｙ（ｆ）Ψｂ（ｆ）は修正後のクロススペクトル、Ｗｘｘ（ｆ）Ψａ（ｆ），Ｗｙｙ（ｆ）Ψａ（ｆ）は修正後のパワースペクトルである。
【００９６】
なお、コヒーレンス関数を用いた式（３−８）または（３−９）の重み関数以外にも、単純なクロススペクトル白色化の重み１／｜Ｗｘｙ（ｆ）｜を用いて
【数１９】

などとすることも可能であるが、性能としては式（３−８）または（３−９）の重み関数を使うことが望ましい。
【００９７】
一方、入力音声信号がＭチャネルの場合（２チャネルに限定されない場合）、同様に各チャネルのパワースペクトルとチャネル間のクロススペクトルに対して、先に算出された第ｉチャネルと第ｊチャネル間のコヒーレンス関数γｉｊ（ｆ）を２乗した値である２乗コヒーレンス関数γｉｊ^２（ｆ）とパワー情報ｐｉｊ（ｆ）を用いて修正が加えられたクロススペクトル及びパワースペクトルが計算される。
【００９８】
さらに、重み付き相互相関係数計算部３０５では、修正後のクロススペクトル及びパワースペクトルに従って重み付けられた重み付き相互相関係数ρ（目的信号活性度）が計算される。この場合の修正スペクトル計算部３０４及び重み付き相互相関係数計算部３０５での計算は、次式で示される。
【数２０】

【００９９】
ここで、Ψａｉｊ（ｆ），Ψｂｉｊ（ｆ）は、各々相互相関係数の計算式（３−１３）の分母、分子に用いる重み関数であり、ｉ，ｊはチャネルの番号を表す。また、ｐｉｊ（ｆ）は式（３−５）または式（３−７）のパワー情報である。また、Ｋ＝Ｍ（ｍ−１）／２である。
【０１００】
Ψａ（ｆ）は、一般化相互相関関数（ｇｅｎｅｒａｌｉｚｅｄｃｒｏｓｓｃｏｒｒｅｌａｔｉｏｎｆｕｎｃｔｉｏｎ）を最尤推定する際に使われる重み関数として知られており、チャネル間で無相関な雑音の影響を抑圧するのに効果がある。これに関しては、例えば非特許文献６：“ＴｈｅＧｅｎｅｒａｌｉｚｅｄＣｏｒｒｅｌａｔｉｏｎＭｅｔｈｏｄｆｏｒＥｓｔｉｍａｔｉｏｎｏｆＴｉｍｅＤｅｌａｙ，Ｃ．Ｈ．ＫｎａｐｐａｎｄＧ．Ｃ．Ｃａｒｔｅｒ，ＩＥＥＥＴｒａｎｓ，Ａｃｏｕｓｔ．，Ｓｐｅｅｃｈ，ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ”，Ｖｏｌ．ＡＳＳＰ−２４，Ｎｏ．４，ｐｐ．３２０−３２７（１９７６）に詳述されている。なお、文献６は相互相関関数を求める方法を開示するものであり、相互相関係数については触れていない。
これに対して、本実施形態では重み付き相互相関係数として、上述の重み関数Ψａ（ｆ）にさらにチャネル間のパワーの比に基づいた重みを与える式（３−６）または式（３−７）によって修正したΨｂ（ｆ）を用いる点が大きく異なる。
【０１０１】
上記の処理では、チャネル間で無相関な雑音に加え、目的方向以外から到来する相関のある雑音までも効果的に抑圧するため、得られた重み付き相互相関係数は、目的信号が存在するか否かの程度を精度よく反映している。このため、重み付き相互相関係数の値を目的信号活性度として用いることができる。この目的信号活性度は、音声検出や音声強調など、種々の応用でその性能を向上するキーコンポーネントとして使用できる。
【０１０２】
本実施形態における目的信号活性度の測定において、活性度を帯域毎に分けて出力してもよい、例えば、ＤＦＴの１から１２８点を周波数上で等間隔に８帯域、つまり、１２８／８＝１６点ずつに分け、８個の目的信号活性度を出力するようにする。分割の仕方は必要に応じて変えて差し支えない。このことは以下の実施形態においても同様である。
【０１０３】
上述の説明では、コヒーレンス関数とパワー情報の両方を使って目的信号活性度を計算しているが、パワー情報を使わず、コヒーレンス関数のみを用いて目的信号活性度を計算しても、ある程度の効果がある。その場合、式（３−４）〜（３−７）によって計算されるパワー情報ｐ（ｆ）またはｐｉｊ（ｆ）を１と置けばよい。
【０１０４】
（第４の実施形態）
図１０に、本発明の第４の実施形態に係る音声信号処理装置の構成を示す。本実施形態では、第３の実施形態を音声検出に適用し、目的信号活性度に対して閾値処理を行うことにより、入力音声信号から目的音の成分を検出する。
【０１０５】
複数のマイクロホン１０１−１〜１０１−Ｍからの入力音声信号が周波数分析部２０１により周波数領域の信号、すなわち複数チャネルの周波数成分を含むスペクトル情報に変換された後、目的信号活性度計算部３００に入力される。目的信号活性度計算部３００の構成は、第３の実施形態で説明した通りである。
【０１０６】
目的信号活性度計算部３００から出力される目的信号活性度信号３０６は検出処理部４０１に入力され、ここで閾値処理が行われることにより、入力音声信号に目的音が存在しているかどうかを示す目的音検出ステータス信号４０２が出力される。具体的には、検出処理部４０１は入力音声信号に目的音の成分が存在すると判定した場合には“１”を、存在しないと判定した場合には“０”を目的音検出ステータス信号４０２として出力する。
【０１０７】
図１１を用いて本実施形態における処理の流れを説明すると、まずステップＳ４１で入力された入力音声信号を周波数分析し（ステップＳ４２）、得られたスペクトル情報から第３の実施形態で述べた手順により目的信号活性度を計算する（ステップＳ４３）。最後に、目的信号活性度に対して、目的に応じて予め定められた閾値を用いて閾値処理を行うことにより、入力音声信号に目的音の成分が存在しているかどうかの検出処理を行う（ステップＳ４４）。ステップＳ４１〜Ｓ４４の処理は、ディジタル化された音声信号がステップＳ４１においてフレーム単位で入力される毎に繰り返し行われる。
【０１０８】
次に、図１２を用いて検出処理部４０１における閾値処理の手順を説明する。ここでは、目的音のない区間の目的信号活性度のバイアスと分散から、検出のための閾値を設定する例について示す。
まず、初期設定を行い（ステップＳ４００）、次いで音声信号の入力（ステップＳ４０１）、周波数分析（ステップＳ４０２）及び目的信号活性度の計算（ステップＳ４０３）をフレーム毎に順次行う。
【０１０９】
第ｋフレームの目的信号活性度をρ（ｋ）とすると、ρ（ｋ）について目的音の無い区間（無音区間という）のバイアスと分散を推定する。無音区間か否かの暫定的な判定を｜ρ（ｋ）−ｂ（ｋ−１）｜とκとの比較により行う（ステップＳ４０４）。ここで、ｂ（ｋ）はρ（ｋ）のバイアスの推定値、κは判定のための閾値である。
【０１１０】
ここで、｜ρ（ｋ）−ｂ（ｋ−１）｜＜κの場合は、無音の可能性が高いと判断し、１次のローパスフィルタを用いて次式に示されるようにバイアスｂ（ｋ）と分散ｖ（ｋ）の推定値を更新する（ステップＳ４０５）。
【数２１】

【０１１１】
一方、｜ρ（ｋ）−ｂ（ｋ−１）｜＞κの場合は、目的音が存在する可能性が高いと判断し、次式に示されるように、バイアスｂ（ｋ）と分散ｖ（ｋ）の推定値を更新しない（ステップＳ４０６）。
【数２２】

【０１１２】
次に、次式によって検出のための閾値ｈ（ｋ）を設定する（ステップＳ４０７）。
【数２３】

【０１１３】
ここで、ξは検出閾値ｈ（ｋ）を設定するための定数である。この結果、ｈ（ｋ）＜ρ（ｋ）ならば目的信号が存在するとして“１”を、そうでなければ“０”をそれぞれ目的ステータス信号として出力する（ステップＳ４０８）。
初期設定に必要なκ，η，η’，ξの値の例は、初期設定ステップＳ４００の枠内に示した通りである。
【０１１４】
図１３に、検出処理の具体的な例を示す。図１３（Ａ）に示す曲線ρから、図１３（Ｂ）に示す検出ステータス信号の時系列が出力される。目的信号活性度の計算は、第３の実施形態で述べたようにチャネル間で相関のない雑音と相関があっても目的音とは違う方向から到来する雑音とを抑圧し、目的音だけに精度よく反応する。従って、算出される目的信号活性度を本実施形態のような音声検出のパラメータとして用いた場合、高い検出性能を達成できる。
【０１１５】
（第５の実施形態）
図１４に、本発明の第５の実施形態に係る音声信号処理装置の構成を示す。本実施形態は、第３の実施形態を音声強調に適用したものである。複数のマイクロホン１０１−１〜１０１−Ｍからの入力音声信号が周波数分析部２０１により周波数領域の信号、すなわち複数チャネルの周波数成分を含むスペクトル情報に変換された後、目的信号活性度計算部３００に入力される。目的信号活性度計算部３００の構成は、第３の実施形態で説明した通りである。
【０１１６】
一方、第２の実施形態と同様に周波数分析部２０１からのスペクトル情報は信号統合部２０３にも入力され、ここで１チャネルの統合されることにより、統合スペクトル信号が生成される。信号統合部２０３から出力される統合スペクトル信号は、目的信号活性度計算部３００から出力される目的信号活性度信号（相互相関係数）３０６に従って利得が制御される利得制御部５０１に入力され、その大きさが調整される。これにより、利得制御部５０１から目的音の成分が強調されたスペクトル信号５０２が出力される。
【０１１７】
目的音の成分が強調されたスペクトル信号５０２は、必要に応じて逆変換部５０３によって周波数分析部２０１とは逆の変換、すなわち周波数領域から時間領域への変換が施され、目的音の成分が強調された出力音声信号５０４が生成される。逆変換部５０２は、周波数分析部２０１が例えばＦＦＴの場合、逆ＦＦＴによって実現される。
このように本実施形態に係る音声信号処理装置は、図６に示した第２の実施形態における相互相関係数計算部２０２が重み付き相互相関係数を計算する目的信号活性度計算部３００に変更された構成となっている。
【０１１８】
次に、図１１を用いて本実施形態における処理の流れを説明すると、まずステップＳ５１からステップＳ５３までの処理は、第４の実施形態で説明した図１１中に示すステップＳ４１からステップＳ４３までの処理と同様である。ステップＳ５２の周波数分析の後、ステップＳ５３の目的信号活性度の計算と平行して、複数チャネルのスペクトル情報を１チャネルに統合して統合スペクトル信号を生成する処理を行う（ステップＳ５４）
次に、統合スペクトル信号に対して、ステップＳ５３で得られた目的信号活性度に応じた利得制御を行って振幅を調整することにより、目的音の成分が強調されたスペクトル信号を生成し（ステップＳ５５）、最後に必要に応じてステップＳ５６で逆変換（例えば、逆ＦＦＴ）を行って目的音の成分が強調された出力音声信号を得る。ステップＳ５１〜Ｓ５６の処理は、ディジタル化された音声信号がステップＳ５１においてフレーム単位で入力される毎に繰り返し行われる。
【０１１９】
本実施形態によると、第３の実施形態で説明したように、目的信号活性度が入力音声に目的音があるか否かを高精度に反映するので、これを用いて目的音を強調する音声強調を行うことによって、種々の雑音環境において非常に高い性能の処理を実現できる。
【０１２０】
なお、第３の実施形態の中で、目的信号活性度を複数の周波数帯域に分けて求めてもよいと述べたが、本実施形態の利得制御の処理において、このような複数の周波数帯域に関する目的信号活性度を用いて、帯域毎に利得を制御することも可能である。すなわち、目的信号活性度計算の際に用いた帯域毎に統合信号、例えばスペクトル情報の算出にＬ点のＤＦＴを用い、帯域分割数をＢとする場合、Ｌ／２／Ｂ＝Ｎ点ずつを用いて以下のように目的活性度を計算する。
【０１２１】
【数２４】

【０１２２】
ここで、ρ（ｂ）は帯域番号ｂに関する目的信号活性度であり、帯域ｂの計算で用いる周波数成分の範囲をｓ（ｂ），ｅ（ｂ）と置いている。この値は、例えば以下のようにとる。
【数２５】

【０１２３】
これは、ＤＦＴにおいて周波数成分番号ｆが２からＬ／２の正の周波数に相当する成分と、ｆがＬ／２＋１からＬの負の周波数に相当する成分の番号の一般的な規則性を用いて求められる。ここで、ｆ＝１は直流成分に相当し、一般的な波形信号の場合、その成分は０と置いてよいので、上の計算式では除いてある。また、ｆ＝Ｌ／２の成分は利用可能な周波数の上限であり、その大きさはやはり０に近いので、除いてある。勿論、これらを計算に含めることにしても、何ら問題はない。
【０１２４】
このようにして求められた目的信号活性度ρ（ｂ）を用いると、統合信号に対する利得制御は、以下のようにして行うことができる。
【数２６】

【０１２５】
先に述べたと同様、上式のように目的信号活性度ρ（ｂ）の絶対値を用いてもよいし、ρ（ｂ）の実数部をとって負の場合は０と置いた値を用いて、以下のようにしてもよい。
【０１２６】
【数２７】

【０１２７】
以上の方法により、目的音の成分を強調する際の利得制御を帯域毎に行うことができる。これにより、ある帯域に偏って雑音が存在する場合などに、その帯域のみ抑圧することが可能となるので、目的音成分強調の性能を向上させることができる。
【０１２８】
（第６の実施形態）
図１６に、本発明の第６の実施形態に係る音声信号処理装置の構成を示す。本実施形態は、第５の実施形態にコヒーレンスとパワー情報に基づいたフィルタ演算を行うコヒーレンスフィルタ演算部６０１が加わった構成である。
【０１２９】
次に、図１７を用いて本実施形態における処理の流れについて述べる。まずステップＳ６１からステップＳ６４までの処理は、第５の実施形態の図１１中に示すステップＳ５１からステップ５４までの処理と同様である。本実施形態では、ステップＳ５４で得られた統合スペクトル信号に対して、ステップＳ６４の目的信号活性度計算の仮定で生成されるコヒーレンス関数とパワー情報を用いたフィルタ演算を行う。
【０１３０】
こうしてコヒーレンスフィルタ演算が施された統合スペクトル信号に対して、ステップＳ６３で得られた目的信号活性度に応じた利得制御を行って振幅を調整することにより目的音の成分が強調されたスペクトル信号を生成し（ステップＳ６５）、最後に必要に応じてステップＳ６６で逆変換（例えば、逆ＦＦＴ）を行って、目的音の成分が強調された出力音声信号を得る。ステップＳ６１〜Ｓ６６の処理は、ディジタル化された音声信号がステップＳ６１においてフレーム単位で入力される毎に繰り返し行われる。
【０１３１】
次に、コヒーレンスフィルタ演算部６０１について詳しく述べる。コヒーレンスフィルタ演算部６０１では、目的信号活性度計算部３００で計算されるコヒーレンス関数を用いて対象のスペクトル情報をフィルタリングする。コヒーレンス関数は、式（３−１）または式（３−２）を用いて計算される。このとき、目的信号活性度計算部３００で内部的に得られた式（３−４）〜（３−７）のいずれかのパワー情報に従って、次式のようにコヒーレンス関数を修正して用いると、さらに効果的である。
【０１３２】
入力音声信号がｘ（ｆ），ｙ（ｆ）の２チャネルの場合の修正コヒーレンス関数γ（ｆ）は、次式に示される。
【数２８】

【０１３３】
一方、Ｍチャネルの場合（２チャネルに限定されない場合）の修正コヒーレンス関数γ（ｆ）は、次式に示される。
【数２９】

【０１３４】
ここで、第３の実施形態と同様にｉ，ｊはチャネル番号、Ｗｉｊ（ｆ）は第ｉチャネルと第ｊチャネル間のクロススペクトル、Ｗｉｉ（ｆ），Ｗｊｊ（ｆ）は第ｉチャネル及び第ｊチャネルのパワースペクトルである。
【０１３５】
式（６−１）または式（６−２）に示される修正コヒーレンス関数γ（ｆ）を用いたフィルタ演算は、次式に従って行われる。
【数３０】

【０１３６】
ここで、ＺＯ（ｆ）はフィルタ演算の出力、Ｚ（ｆ）は信号統合部２０３で得られる統合スペクトル信号である。
【０１３７】
このとき、例えば次式のようにコヒーレンス関数γ（ｆ）を適当な関数を用いて修正してから、フィルタ演算を行ってもよい。
【数３１】

【０１３８】
ここで、ｐｏｗ（ａ，ｂ）は、ａのｂ乗を表す指数関数であり、例えばα＝２などを用いることがある。この場合、式（６−３）（α＝１に相当する）よりもコヒーレンス関数γ（ｆ）の値が強調され、雑音抑圧量が増加するが、代わりに目的音声の歪みも大きくなるので、状況に合わせて設定するのがよい。
【０１３９】
このように本実施形態によれば、目的信号活性度を用いた目的音の強調に際して、コヒーレンス関数に対応したスペクトルの重み付けを行うことにより、チャネル間で無相関な雑音に対する音声強調性能をさらに改善することができる。
【０１４０】
（マイクロホンの配置について）
次に、これまでに述べたマイクロホンの好ましい配置方法について述べる。音声信号処理装置は、複数のマイクロホンに対して目的音に関しては同一の成分が入射し、雑音に関しては位相と振幅の少なくとも一方の異なった成分が入射することを想定している。このようなマイクロホンの受音状況を実現するためには、マイクロホン１０１−１〜１０１−Ｍを以下に述べるように配置することが望ましい。
【０１４１】
第３の実施形態では、重み付き相互相関係数を計算する過程で、チャネル間のパワー比に関する情報を用いており、目的音に対してはチャネル間で等パワー、雑音に対してはチャネル間で異なるパワーとなるようにマイクロホン１０１−１〜１０１−Ｍを配置したときに、高い性能が得られる。マイクロホン１０１−１〜１０１−Ｍに全て無指向性のマイクロホンを用いた場合でも、ある程度の性能は発揮できる。これは受音位置によって反射などの条件が異なるため、無指向性マイクロホンでも到来音のパワーが異なる場合があるからである。
【０１４２】
しかしながら、高い性能を安定して発揮させるには、マイクロホン１０１−１〜１０１−Ｍの少なくとも一つを指向性マイクロホンとする方がよい。これにより、目的音の到来方向以外の方向に対してチャネル間で感度差を作り出し、雑音抑圧性能を向上することができる。
【０１４３】
ここでは、マイクロホンの数Ｍが２個、すなわち２チャネルの場合について述べるが、３以上の多チャネルの場合にも容易に拡張可能である。図１８に示すように、２個のマイクロホンの一方が無指向性マイクロホン７０１で、他方が指向性マイクロホン７０２の場合と、図１９に示すように２個のマイクロホン７１１，７１２がいずれも指向性マイクロホンの場合について述べる。各々特徴のある使い分けが可能である。指向性マイクロホンとしては、通常の単一指向性マイクロホンを想定する。単一指向性以外のもっと鋭い指向性のものを使う場合は、さらに性能が高くなる可能性があるが、配置方法は単一指向性マイクロホンを用いた場合と同様である。
【０１４４】
図１８に示すように無指向性マイクロホン７０１と指向性マイクロホン７０２を用いた場合、指向性マイクロホン７０２は、目的音の方向に指向性の頂点（感度最大方向）が向くようにする。マイクロホン７０１，７０２間の距離は、例えば５ｃｍから２０ｃｍ程度が適当である。この配置においては、無指向性マイクロホン７０１の感度と指向性マイクロホン７０２の頂点方向の感度を同程度に調整しておくことが望ましい。
【０１４５】
このような配置により、指向性マイクロホン７０２における低感度の方向、例えば、図１８に示すように高感度の方向と１８０°逆の方向に関しては、チャネル間、すなわちマイクロホン７０１，７０２間の感度差が非常に大きいので、低感度の方向からの到来音の抑圧量は非常に大きくなる。一見、これは指向性マイクロホンの元々の指向性を表したにすぎないように見えるが、チャネル間のパワー比に対する鋭敏性を式（３−６）または式（３−７）のβの値によって調整できるため、指向性マイクロホン７０２の元々の指向性よりも鋭い指向性に調整することが可能である。
【０１４６】
すなわち、例えばβ＝２とすることにより、実際のパワー比の２乗の重みが目的信号活性度の計算に使われることになる。実際のパワー比は、目的音方向に関しては１であるが、目的音の到来方向以外の方向では１以下であるため、これを２乗することによって目的音以外の成分に関する重みはさらに小さくなる。このため、低感度方向と目的音方向の間の横方向などの感度もさらに小さくできる。
【０１４７】
一方、図１９に示すように２つのマイクロホンに指向性マイクロホン７１１，７１２を用いた場合は、例えば図１９（Ａ１）〜（Ａ４）に示す配置が有効である。これは、同一平面上に二つのマイクロホン７１１，７１２の指向性の軸が含まれるような配置であり、図で上から見たときの指向性の軸の向きがθ＝−９０°〜９０°程度の範囲内にあるのが望ましい。θ＞０の場合は指向性の軸が２つのマイクロホン７１１，７１２の中点から外に開くような形となるが、θ＜０としても同様な性能であり、この場合は指向性の軸が中点に向かう形となる。
【０１４８】
図１９（Ｂ１）〜（Ｂ４）は、２つの指向性マイクロホン７１１，７１２のもう一つの好ましい配置の例である。指向性の軸は同一平面内に含まれていない。正確さを期すため、図２０に図１９（Ｂ１）〜（Ｂ４）の配置における指向性の軸の向きを方位角θと仰角φで表した図を示す。ここで、Ｒチャネルのマイクロホン７１２の指向性の軸の向きを（θ，φ）とすると、Ｌチャネルのマイクロホン７１１の指向性の軸の向きは（−θ，−φ）となるようにするのが望ましい。すなわち、２つのマイクロホンの位置と軸方向は１８０°の回転対称をなす。マイクロホンの数がＭならば、３６０°／Ｍの回転対称となる配置が望ましい。θとφの範囲は、１０°＜θ＜８０°，１０°＜φ＜８０°となるようにするのが望ましい。指向性の軸の向きを上のように設定した後、２つのマイクロホン７１１，７１２の位置を目的音の到来方向を軸として回転させた場合も全く同じ特性を有するので、必要に応じて回転させて用いてもよい。
【０１４９】
図１９（Ａ１）〜（Ａ４）の配置の場合、前述した音声信号処理により最終的な指向性は、目的音の到来方向に対しては感度最大となり、指向性マイクロホン７１１，７１２から等距離の方向、すなわち、２つのマイクロホン７１１，７１２を結ぶ直線に垂直な方向に対しては感度が極大となるため、真上や真下からの到来音に対してもある程度の感度を持つようになる。
【０１５０】
これに対し、図１９（Ｂ１）〜（Ｂ４）の配置では、２つの指向性マイクロホン７１１，７１２の位相が一致する方向は、図１９（Ａ１）〜（Ａ４）の場合と同様に、マイクロホン７１１，７１２から等距離の方向、すなわち２つのマイクロホン７１１，７１２を結ぶ直線に垂直な面（図２１の面ａ）に含まれる方向となる。一方、２つのマイクロホン７１１，７１２の感度が一致するような到来方向は、マイクロホン７１１，７１２の軸の向きを表す２つのベクトルを１つの平面上に平行移動したとき、その２つのベクトルの差ベクトル（図２１のベクトルＣ）と垂直の平面（図２１の面ｂ）に含まれる。
【０１５１】
本実施形態における目的信号活性度は、位相と振幅がチャネル間で共に一致した場合に大きな値となるので、図２１に示す面ａと面ｂが交わる方向、すなわち正面方向（図２０または図２１において矢印で示した目的音の到来方向）とその１８０°逆の方向にのみ、指向性の大きな極大ができる。正面の逆方向に関しては、指向性マイクロホン７１１，７１２の低感度方向が向いているため、その方向からの入射音のレベルは低い。従って、実質的に正面方向のみに極大のメインローブを持つような指向性を得ることができるので、真上や真下からの到来音も抑圧したい場合には、この配置が有効である。
【０１５２】
（第８の実施形態）
図２２に、本発明の第８の実施形態に係る音声信号処理装置の構成を示す。本実施形態は、第３の実施形態における周波数分析部２０１と目的信号活性度計算部３００との間にスペクトル補正部８００を挿入した構成になっている。図２３に示されるように、スペクトル補正部８００は適応フィルタ８０１と補正フィルタ８０２を有する。
【０１５３】
前述したように、本発明の実施形態に係る音声信号処理装置は、目的音に関しては同一の成分が複数のマイクロホン１０１−１〜１０１−Ｍに入射することを想定している。従って、マイクロホン１０１−１〜１０１−Ｍの感度が経年変化やバイアス設定用のバッテリの消耗などによって変化した場合、処理精度が低下する可能性がある。目的音の到来方向が想定している方向とずれた場合にも、処理精度が低下する可能性がある。
【０１５４】
本実施形態では、マイクロホン１０１−１〜１０１−Ｍ毎の感度の違いや目的音の到来方向のずれを補正して、本来の性能を発揮させるために、スペクトル補正部８００により、周波数分析部２０１で得られたスペクトル情報に対して、目的信号活性度計算部３００で得られた目的信号活性度とスペクトル情報とに基づく修正を施す。
【０１５５】
次に、図２３を用いてスペクトル補正部８００での処理の詳細を述べる。ここでは、入力音声信号が２チャネルの場合について述べるが、Ｍチャネルへの拡張も同様である。スペクトルの補正は、チャネル間の差を適応フィルタ８０１により同定し、一方のチャネルのスペクトルに対して補正フィルタ８０２を用いて、適応フィルタ８０１により同定した差分を補正することにより行う。適応フィルタ８０１による差分同定の際、目的信号活性度信号３０６に従ってフィルタ更新の速さを制御するようにするようにしてもよい。
【０１５６】
適応フィルタ８０１としては、例えば、周波数領域のＬＭＳ適応フィルタを使用することが可能である。この場合、周波数領域ＬＭＳ適応フィルタの計算は、以下のように行われる。
【数３２】

【０１５７】
ここで、ｋはフレーム番号、Ｘは第１チャネルのスペクトル、Ｙは第２チャネルのスペクトル、Ｅは誤差スペクトル、Ｗは複素数のフィルタ係数、μはステップサイズ、（＊）は複素共役である。
【０１５８】
補正フィルタ８０２の演算は、この場合、第１チャネルのスペクトルＸ（ｋ，ｆ）に対し、Ｘ′（ｋ，ｆ）＝Ｗ（ｋ，ｆ）Ｘ（ｋ，ｆ）により行う。Ｘ′（ｋ，ｆ）は、補正後の第１チャネルのスペクトルである。この演算は、適応フィルタ８０１の演算の式（８−１）で既に行われているので、新たに補正フィルタ８０２を用意せず、適応フィルタ８０１からＷ（ｋ，ｆ）Ｘ（ｋ，ｆ）の信号を取り出すだけでもよい。
【０１５９】
目的信号活性度ρ（ｋ）を用いて適応フィルタ８０１による差分同定の際のフィルタ更新の速さを制御することも可能であり、その場合は例えば、次式のように適応フィルタ８０１の更新式（８−２）を修正する。
【０１６０】
【数３３】

【０１６１】
ここで、閾値ｈとしては例えば０．５を使うことができる。これは、ρ（ｋ）の大きさが閾値より大きいときだけチャネル間の差分を求めることになるので、目的音が到来している可能性の大きいときのみフィルタ更新が行われ、雑音に適応してしまう心配がない。このような閾値を用いた適応の更新／停止の制御の他、次式のように更新分の大きさをρ（ｋ）に比例させることも可能である。
【数３４】

【０１６２】
式（８−３）を使ってチャネル間の差分を推定した場合、例えば、感度差が始めから大きく異なっている場合などは、ρ（ｋ）の値が閾値を上回らないため、適応フィルタ８０１の更新が行われず、差分が全く求まらないこともある。しかしながら、前述したようにマイクロホンの感度が経年変化やバイアス設定用のバッテリの消耗などによって変化したことを想定した場合、感度差が急に大きくなることは少なくこのような不都合はあまり問題にならない。本実施形態は、例えば第３〜第６の実施形態で説明した音声信号処理における目的信号活性度を求める際の補正方法として用いることで、チャネル間の感度の差に影響を受けない動作が可能となる。
【０１６３】
（第９の実施形態）
図２４に、本発明の第９の実施形態に係る音声信号処理装置の構成を示す。第８の実施形態と同様に、スペクトル補正部９００が設けられ、さらに補正フィルタ学習指示部９１０が追加されている。
【０１６４】
第８の実施形態で示した感度補正は、マイクロホン１０１−１〜１０１−Ｍの感度が大きくは違わない場合に効果があった。第９の実施形態では、目的音の振幅または位相が各マイクロホンで同一であると想定できないような場合に、学習モードの処理を設け、第８の実施形態とは別の補正フィルタの学習を行ってチャネル間の差を補正する。
【０１６５】
学習後の経年変化による感度ずれや、目的話者位置の小さなずれによる位相差などを補正する場合は、学習モードを経て学習したフィルタによる補正の後、第８の実施形態で述べたような自動的な補正を行う。本実施形態は、このような二つの補正ができる構成になっている。
【０１６６】
目的音方向が想定している方向と異なる場合や、各マイクロホン１０１−１〜１０１−Ｍと目的音源との距離が異なるようなマイクロホン配置にした場合などでも、本実施形態の音声処理方法を利用可能になる。学習モードは、利用者の指示をトリガとして開始したり、装置の起動後などに装置側が自動的に学習モードに入る場合などがある。
【０１６７】
補正フィルタ学習指示部９１０は、学習モードであるか否かを表す信号を出力する。例えば、学習モードは“１”、学習モードでなければ“０”を出力する。学習モードの終了は、装置側が自動的に行ってもよいし、利用者が指示するようにしてもよい。学習モードにおいては、入力したい目的音の位置からテスト音を発生させる。利用者が発声してもよいし、スピーカなどのテスト音発生装置を目的音位置に置いて使用してもよい。テスト音は使用目的に応じて選択してよい。音声入力が目的なら音声や白色雑音を使うのが望ましい。
【０１６８】
図２５に示されるように、補正フィルタ学習指示部９１０はスイッチ９１１により利用者の指示が入力されると、一定期間を学習モードとするように、指示入力後からの経過時間をタイマ９１２で測定して、補正フィルタ学習指示信号Ｓを出力する。タイマ９１２は補正フィルタ学習指示信号Ｓとして、スイッチ９１１による指示入力時点から予め定めた時間までは例えば“１”を出力し、その他の期間は“０”を出力する。タイマ９１２は、大抵のマイクロプロセッサに備わっている機能であるので、それを使えばよい。学習モードの終了は、このように装置側がタイマ９１２を用いて自動的に行ってもよいし、利用者が指示するようにしてもよい。
【０１６９】
スペクトル補正部９００は、補正フィルタ学習指示部９１０からの指示に従って一定時間長の期間、例えば３秒間にわたって学習を行う。この期間を学習モードと呼ぶことにする。学習モードにおいては、入力したい目的音の位置からテスト音を発生させる。利用者が発声してもよいし、スピーカなどのテスト音発生装置を目的音位置に置いて使用してもよい。テスト音は使用目的に応じて選択してよい。音声入力が目的なら音声や白色雑音を使うのが望ましい。学習モードの終了後は、続けて第８の実施形態までに述べたような音声信号処理を行う。
【０１７０】
スペクトル補正部９００の構成は、第８の実施形態におけるスペクトル補正部８００の図２３に示した構成と若干異なり、図２６に示されるように図２３の補正フィルタ８０２に相当する補正フィルタ９０２に加えて、補正フィルタ９０２の前段にもう一つの補正フィルタ９０１が追加されている。補正フィルタ９０２は、第８の実施形態で説明したと同様の働きをする。すなわち、チャネル間の小さなずれを補正する。
【０１７１】
一方、追加された補正フィルタ９０１は、チャネル間の大きな差を補正する。補正フィルタ９０１は、学習モード以外は固定される。補正フィルタ学習指示部９１０からの学習フィルタ指示信号Ｓが“１”の場合、適応フィルタ９０４は補正フィルタ９０１を学習させ、学習フィルタ指示信号Ｓが“０”の場合は補正フィルタ９０２を学習させる。
【０１７２】
例えば、ＬＭＳを用いた補正フィルタ９０２の学習は次式により行われる。
【数３５】

【０１７３】
一方、補正フィルタ９０１の学習は、次式により行われる。
【数３６】

【０１７４】
ここで、ｋはフレーム番号、Ｘは第１チャネルのスペクトル、Ｙは第２チャネルのスペクトル、Ｘ１はＸに補正フィルタ９０１をかけた後のスペクトル、Ｗ０は補正フィルタ９０２のフィルタ係数、Ｅ０は補正フィルタ９０２の学習の際の誤差スペクトル、μ０は補正フィルタ９０２の学習の際のステップサイズ、Ｗ１は補正フィルタ１のフィルタ係数、Ｅ１は補正フィルタ９０１の学習の際の誤差スペクトル、μ１は補正フィルタ９０１の学習の際のステップサイズ、（＊）は複素共役である。ステップサイズμ０，μ１には、例えば０．１を使う。
【０１７５】
式（９−１），（９−２）の補正フィルタ９０２の学習を行う際、第８の実施形態のように目的信号活性度を用いて適応の速度を制御してよい。補正フィルタ９０１のフィルタリングは、
【数３７】

により行われ、補正フィルタ９０２のフィルタリングは
【数３８】

ここで、Ｘ′（ｋ，ｆ）は、スペクトル補正部９００の出力となる第１チャネルのスペクトルである。
【０１７６】
次に、図２７を用いて本実施形態の処理の流れを説明する。
まず、初期設定として補正フィルタ９０１，９０２の係数の初期値を設定する（ステップＳ９０）。補正フィルタ９０１を補正フィルタ１、補正フィルタ９０２を補正フィルタ０とすると、補正フィルタ１，０の係数の初期値を全ての周波数（ｆ）で（１，０）としておけば、学習をしない場合でも音声信号の入力が可能となるので扱いやすい。ここで、（１，０）は複素数の１＋ｊ０を表す。ただし、補正フィルタ１，０の係数の初期値を全ての周波数（ｆ）で（０，０）とした場合でも、学習さえ進めば動作するようになるので、初期値をどのように選ぶかは本質的な違いはない。
【０１７７】
次に、補正フィルタ学習指示信号Ｓが“１”か否か（“０”）を調べ（ステップＳ９１）、Ｓ＝“１”であれば補正フィルタ１の学習を式（９−３）（９−４）に従って行う（ステップＳ９３）。一方、Ｓ＝“０”であれば補正フィルタ１によるフィルタリングを式（９−５）に従って行い（ステップＳ９４）、次に補正フィルタ０の学習を式（９−１）（９−２）に従って行った後、補正フィルタ０によるフィルタリングを行い（ステップＳ９３〜Ｓ９４）、この後に目的信号活性度を測定する（ステップＳ９６）。ステップＳ９１からステップＳ９６までの処理は、ディジタル化された音声信号がステップＳ９１においてフレーム単位で入力される毎に繰り返し行われる。
【０１７８】
本実施形態によれば、例えばマイクロホン１０１−１〜１０１−Ｍを目的音源の位置に対して距離が異なるように並べた場合にも、目的信号活性度の計算、目的音の検出及び目的音の強調などの処理を有効に行うことが可能となる。
【０１７９】
自動車内で観測される走行雑音の環境下で用いる場合、走行雑音は拡散性が強いため、マイクロホンを異なった位置や向きに置いた場合でもチャネル間の振幅の差があまりない。各マイクロホンと目的音位置との距離が異なるように並べた場合は、本実施形態のスペクトル補正によって、チャネル間で目的音が同振幅、同位相となるように補正される。一方で、等振幅だった雑音成分は補正により異なった振幅となり、目的信号活性度における雑音区間の区別が容易になり、活性度測定の精度が向上する。このように、マイクロホンを目的音から等距離に並べない場合は、拡散性雑音下での性能向上を図ることができる。
【０１８０】
（第１０の実施形態）
図２８に、本発明の第１０の実施形態に係る音声信号処理装置の構成を示す。本実施形態は、修正相互相関係数に基づいて音源の到来方向を推定する技術に関する。音源の到来方向の推定は、音声強調や雑音源の同定など音声処理における種々の応用において重要である。特に、本実施形態に係る修正相互相関係数に基づく方法は、適応ビームフォーマなどの死角制御に基づく方法に比べて雑音源の信号や伝播状況に関する制約が少なく、広範囲の雑音環境で使用可能であるという利点がある。
【０１８１】
本実施形態に係る音声信号処理装置は、図２８に示すようにマイクロホン１０１−１〜１０１−Ｍからの複数（Ｍ）チャネルの入力音声信号を周波数分析して周波数成分であるスペクトル情報に変換する周波数分析部２０１と、該スペクトル情報から音源方向を推定する音源方向推定部１０００からなる。音声分析部２０１の処理は、第２の実施形態（図６）で説明した通りである。
【０１８２】
音源方向推定部１０００は、クロス・パワースペクトル計算部１００１、コヒーレンス関数計算部１００２、補正係数発生部１００３、クロス・パワースペクトル補正部１００４、パワー情報計算部１００５、仮想方向相関係数計算部１００６及び音源方向検出部１００７を有する。以下、音源方向推定部１０００の各構成要素について説明する。
【０１８３】
クロス・パワースペクトル計算部１００１は、周波数分析部２０１により得られたスペクトル情報から各チャネルのパワースペクトルとチャネル間のクロススペクトルを計算する。
【０１８４】
コヒーレンス関数計算部１００２は、クロス・パワースペクトル計算部１００１で得られたクロススペクトルと各チャネルのパワースペクトルから入力音声信号のチャネル間のコヒーレンス関数を算出する。
【０１８５】
補正係数発生部１００３は、予め設定した信号の到来方向範囲の中に、信号の仮想的な到来方向である仮想方向を定め、この仮想方向から信号が到来したと仮定した場合に、入力音声信号のスペクトル情報中の当該信号成分がチャネル間で一致するようにスペクトル情報を補正するための補正係数を発生する。
【０１８６】
クロス・パワースペクトル補正部１００４は、発生した補正係数を用いてクロススペクトルとパワースペクトルを補正し、補正クロススペクトルと補正パワースペクトルを生成する。
【０１８７】
パワー情報計算部１００５は、補正クロススペクトルと補正パワースペクトルに基づいて入力音声信号のチャネル間の周波数毎の信号パワー比であるパワー情報を算出する。
【０１８８】
仮想方向相関係数計算部１００６は、補正パワースペクトルと補正クロススペクトルを先のコヒーレンス関数とパワー情報に基づいて重み付けし、予め設定した１組の仮想方向に対応した相互相関係数を仮想方向毎に算出する。
【０１８９】
音源方向検出部１００７は、仮想方向相関係数計算部１００６によって計算された仮想方向毎の相互相関係数に基づき音源方向を検出して出力すると同時に、検出した音源方向における相互相関係数の値を音源相関係数として、また音源方向に対応した補正係数を音源方向補正係数として出力する。
【０１９０】
次に、各部の処理についてさらに詳しく説明する。クロス・パワースペクトル計算部１００１、コヒーレンス関数計算部１００２及びパワー情報計算１００５での計算には、例えば入力音声信号のチャネル数Ｍが２チャネルの場合は式（３−８），（３−９），（３−１０）を用い、３チャネル以上の場合は式（３−１２），（３−１３），（３−１４）を用いる。
【０１９１】
補正係数発生部１００３は、予め信号が到来する範囲を例えば図２９に示すように設定する。到来方向は、水平方向の角度である方位角θと垂直方向の角度である仰角φの組（θ，φ）で表すものとし、例えば到来範囲の中の格子点上の方向を仮想方向とするものとする。図２９の場合、到来範囲は方位角、仰角共に−４０°〜４０°、格子点は方位角、仰角共に５°おきであり、全ての格子点上の方向を仮想方向の組とする。図２９では、作図の都合上格子点の間隔を５°にしてあるが、実際はもっと小さく、２°以下にすることが望ましい。
【０１９２】
格子点上の仮想方向は、ｄｈ，ｇ＝（θｈ，φｇ）で表すことにする。ここでｈは格子点の方位角に関する番号、ｇは仰角の番号である。補正係数発生部１００３は、仮想方向に対応する補正係数を次式に従って生成する。
【０１９３】
【数３９】

【０１９４】
ここで、ｉはチャネル番号、Ｈｉ（ｆ，θ，φ）は（θ，φ）方向に関するｉ番目のチャネルの補正係数、τｉ（θ，φ）は、ｉ番目のマイクロホンに（θ，φ）方向からの到来信号が到達するときの基準マイクロホンでの受音信号に対する伝播遅れ時間、Ｄｉ（θ，φ）は、ｉ番目のマイクロホンにおける（θ，φ）方向の感度の指向性、ｆは周波数番号、Ｆはサンプリング周波数、ＬはＦＦＴの点数である。基準マイクロホンは、例えば１番目のマイクロホンとする。
【０１９５】
伝搬遅延の値は、例えば図３０に示すようなマイクロホン配置において、到来音の方向がｄ＝（θ，φ）の場合、基準位置を座標の原点にとると、原点に対する時間遅れは、極座標と直交座標の関係を用いて以下のように計算できる。
【０１９６】
【数４０】

【０１９７】
ここで、・は内積、ｃは音速である。マイクロホンｉの位置がＡｉ＝（ｘｉ，ｙｉ，ｚｉ）のときは、次式となる。
【０１９８】
【数４１】

【０１９９】
Ｄｉ（θ，φ）は、マイクロホン固有の特性であるので、製品情報から得るか、または測定により得る。マイクロホン感度の指向性の測定は、例えばマイクロホンへの音の入射角度を変えながら出力を測定すればよく、一般的な方法を用いればよいので、ここでは省略する。
【０２００】
補正係数発生部１００３で発生する補正係数は、音源方向探索の範囲とマイクロホン１０１−１〜１０１−Ｍの指向性が変化しなければ変化しないので、最初に係数を発生した後はテーブルに記憶しておき、格子点の番号でテーブルを参照して係数の値を読み出すようにする。
【０２０１】
クロス・パワースペクトル補正部１００４では、補正係数発生部１００３で発生した補正係数を対応するチャネルのクロススペクトルとパワースペクトルに乗じて補正クロススペクトルと補正パワースペクトルを求める。計算は、次式のように行う。
【０２０２】
【数４２】

【０２０３】
ここでＷ′は補正後のスペクトル、＊は複素共役、ｉ，ｊはチャネルの番号であり、ｉ≠ｊのときはクロススペクトル、ｉ＝ｊのときはパワースペクトルを意味する。
【０２０４】
式（１０−４）の補正は、スペクトル情報Ｘｉ（ｆ）をＨｉ（ｆ，θ，φ）で補正してからクロス・パワースペクトルを計算することと等価であり、上線を付した処理を時間平均化処理として、Ｈｉは時間に対して変化しないことを使うと、以下のようになることに基づいている。
【数４３】

【０２０５】
パワー情報計算部１００５では、クロス・パワースペクトル補正部１００４で補正したパワースペクトルからチャネル間のパワー比を求めることにより行う。パワー比の計算は、式（３−７）において元々のパワースペクトルＷｉｉ（ｆ）の代わりに、次式のように補正したものを使う。
【０２０６】
【数４４】

【０２０７】
仮想方向相互相関係数計算部１００６では、補正したクロス・パワースペクトルとパワー情報を用い、仮想方向（θ，φ）に関する相互相関係数を計算する。相互相関係数の計算は、式（３−１１），（３−１２），（３−１３）において、元々のクロス・パワースペクトルとパワー情報を次式のように各々の補正したものに置き換えればよい。
【０２０８】
【数４５】

【０２０９】
ここで、Ｋは
【数４６】

であり、和における周波数ｆの範囲Ｌ１，Ｌ２は目的音の帯域に相当する範囲に相当する番号になるようにする。例えば、目的音の帯域を２６０Ｈｚから４ｋＨｚであると定めた場合には、ＦＦＴ長２５６、サンプリング１１ｋＨｚの場合には、Ｌ１＝６，Ｌ２＝９２とするのがよい。
【０２１０】
式（１０−６）〜（１０−１０）を用い、θ＝θｈｇ，φ＝φｈｇとし、設定した到来範囲の仮想方向ｄ（θｈｇ，φｈｇ）（ｈ＝１〜Ｎｈ，ｇ＝１〜Ｎｇ）に対して仮想方向相関係数を求める。
【０２１１】
音源方向検出部１００７は、仮想方向相互相関係数計算部１００６によって計算された仮想方向毎の相関係数から、そのピークを検出して音源方向として出力する。このとき、例えば次式のように仮想方向相関係数の時間的な平均化によって安定化を図ることができる。
【０２１２】
【数４７】

【０２１３】
ここで、ρ’ｋはｋ番目のフレームの処理において平均化された仮想方向相関係数、ρｋはｋ番目のフレームの処理において求められた仮想方向相関係数、ηは学習定数であり、例えばη＝０．０５などを用いる。ピークの検出は、ρ’ｋ（θ，φ）から最大値を求めればよい。
【０２１４】
音源方向検出部１００７は、音源方向の他、音源方向のピークの値である音源相関係数と、音源方向に相当する補正係数である音源方向補正係数を出力する。このために、補正係数発生部１００３の内部の補正係数のテーブルから、音源方向の格子点の番号に基づいて補正係数を取り出すようにする。
【０２１５】
次に、図３１を用いて本実施形態における処理の流れを説明する。
まず、初期設定として音源方向の範囲を設定する（ステップＳ１００）。次に補正係数の生成（ステップＳ１０１）、マイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ１０２）、周波数分析（ステップＳ１０３）、クロススペクトルとパワースペクトルの計算（ステップＳ１０４）及びコヒーレンス関数の計算（ステップＳ１０５）を順次行う。次に、スペクトル補正（ステップＳ１０６）、パワー情報の計算（ステップＳ１０７）及び仮想方向相互相関関数の計算（ステップＳ１０８）を全ての仮想方向について繰り返し行い、最後に音源方向の検出を行う（ステップＳ１０９）。ステップＳ１０２〜Ｓ１０９の処理は、ディジタル化された音声信号がステップＳ１０２においてフレーム単位で入力される毎に繰り返し行われる。
【０２１６】
（第１１の実施形態）
本発明の音声強調処理は、マイクロホン配列の正面から目的とする音である目的音が到来すると仮定しているので、目的音の方向が仮定とずれた場合は、性能が低下する可能性がある。第８の実施形態で述べた適応処理に基づいた補正により、目的音の方向ずれへの対処がある程度は可能であるが、目的音の方向が大きくずれた場合には適応処理だけでは対処が困難である。そこで、本実施形態では第１０の実施形態で説明した音源方向推定処理の結果を用いて目的音の方向を追尾することにより、目的音が想定している方向とずれた場合に対する音声強調処理の安定度を向上させる。
【０２１７】
図３２に、本実施形態に係る音声信号処理装置の構成を示す。本実施形態は、第１０の実施形態で説明した音源方向推定処理で音源方向を推定し、音源方向に対応する補正係数を用いて入力のスペクトル情報の補正を行い、補正したスペクトル情報を統合し、統合スペクトル情報に対して利得制御を行って音声強調を行う。
【０２１８】
このような処理を実現するため、本実施形態に係る音声信号処理装置は第１０の実施形態で説明した音源方向推定部１０００、周波数分析部２０１からの複数チャネルのスペクトル情報を音源方向補正係数に基づいて補正するスペクトル情報補正部１１００、補正したスペクトル情報を統合する信号統合部１１０１、統合スペクトル情報をコヒーレンス関数に基づいてフィルタリングするコヒーレンスフィルタ演算部１１０２、及びコヒーレンスでフィルタリングしたスペクトル情報をさらに音源相関係数に基づいて利得制御することにより雑音を抑圧する利得制御部１１０３を有する。
【０２１９】
周波数分析部２０１と音源方向推定部１０００は、第１０の実施形態で述べた通りである。スペクトル補正部１１００では、音源方向推定部から出力される音源方向補正係数を用いて複数チャネルのスペクトル情報を補正する。このスペクトル情報の補正は、音源方向からの到来音に対して相関係数を最大化する働きがある。音源方向を（θｏ，φｏ）、音源相関係数をρ（θｏ，φｏ）、音源方向補正係数をＨｉ（ｋ，θｏ，φｏ）とすれば、スペクトル情報の補正は
【数４８】

に従って行われる。ここでｉはチャネル番号、Ｘ’ｉ（ｋ）は補正後のスペクトル情報、Ｘｉ（ｋ）は補正前のスペクトル情報である。
【０２２０】
以降は、補正スペクトル情報Ｘ’ｉ（ｋ）を用いて信号統合部１１０１で１チャネルのスペクトル情報に統合し、この統合スペクトル情報に対してコヒーレンスフィルタ演算と利得制御を行えばよい。利得制御のための利得としては、前述したようにρ（θｏ，φｏ）を使う。これ以降の処理は、第１０の実施形態と同様であるので省略する。
【０２２１】
次に、図３３を用いて本実施形態における処理の流れを説明する。
まず、初期設定として音源方向範囲を設定し、かつ第１０の実施形態で説明したように補正係数を発生する（ステップＳ２００）。次に、マイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ２０１）、周波数分析（ステップＳ２０２）、音源方向の推定（ステップＳ２０３）、スペクトル情報の補正（ステップＳ２０４）、スペクトル情報の統合（ステップＳ２０５）、コヒーレンス関数の演算（ステップＳ２０６）及び利得制御の処理（ステップＳ２０７）をディジタル化された音声信号がステップＳ２０１においてフレーム単位で入力される毎に繰り返し行う。
【０２２２】
（第１２の実施形態）
次に、本発明の第１２の実施形態について説明する。これまでに述べてきた修正相互相関係数の計算においては、式（３−１３）に示したように、相互相関の正規化の際に入力スペクトル情報のパワーの幾何平均を用いていたが、本実施形態では幾何平均の代わりに入力スペクトル情報を統合して得られる統合スペクトル情報のパワーを用いる場合について述べる。
【０２２３】
複数チャネルの信号をビームフォーマなどによって統合する際には、ビームフォーマの働きによって方向性の雑音などが抑圧されている場合がある。このような場合、相互相関または修正相互相関係数による利得制御においては、既に抑圧されている分を考慮して軽めに利得制御した方がよい。本実施形態で説明する利得係数を用いると、抑圧された分が考慮されて利得制御が適正化できる。
【０２２４】
本実施形態に係る音声信号処理装置は、図３４に示すように空間的に離れて配置された複数のマイクロホン１０１−０〜１０１−Ｍから出力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する周波数分析部２０１と、複数のスペクトル情報から目的音の活性度に相当する値である利得係数を計算する修正利得係数計算部２０００Ａとからなる。
【０２２５】
修正利得係数計算部２０００Ａは、クロス・パワースペクトル計算部２００１、コヒーレンス関数計算部２００２、パワー情報計算部２００３、信号統合部２００４、統合信号パワースペクトル計算部２００５及び利得係数計算部２００６からなる。
【０２２６】
クロス・パワースペクトル計算部２００１は、スペクトル情報から入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する。
【０２２７】
コヒーレンス関数計算部２００２は、複数チャネル間のクロススペクトルと各チャネルのパワースペクトルからコヒーレンス関数を算出する。
【０２２８】
パワー情報計算部２００３は、複数チャネルのパワースペクトルから入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する。
【０２２９】
信号統合部２００４は、複数のスペクトル情報を統合して１チャネルの統合スペクトル情報を生成する。
【０２３０】
統合信号パワースペクトル計算部２００５は、統合スペクトル情報のパワースペクトルを計算する。
【０２３１】
利得係数計算部２００６は、クロススペクトルをコヒーレンス関数とパワー情報に基づいて重み付けし、重み付けたクロススペクトルをさらに統合信号パワースペクトルに基づいて正規化して得られる利得係数を計算する。
【０２３２】
周波数分析部２０１、クロス・パワースペクトル計算部２００１、コヒーレンス関数計算部２００２、パワー情報計算部２００３及び信号統合部２００４は第１０の実施形態と同様であるため、説明を省略する。
【０２３３】
統合信号パワースペクトル計算部２００５では、統合スペクトル情報のパワースペクトルが計算される。例えば、統合スペクトル情報をＺ（ｆ）として、統合処理が２チャネルの信号の加算平均Ｚ（ｆ）＝｛Ｘ１（ｆ）＋Ｘ２（ｆ）｝／２であったとすると、Ｚ（ｆ）のパワースペクトルは、
【数４９】

で求められる。Ｚ（ｆ）が異なる係数を持つビームフォーマから得られる統合信号であったとしても、同様である。
【０２３４】
利得係数計算部２００６で計算される利得係数σは、相互相関係数の代わりに利得制御に用いる係数であり、Ｍ＝２の場合は次式により計算できる。
【数５０】

【０２３５】
なお、式（１２−２），（１２−３）はそれぞれ式（３−１２），（３−１３）と同じである。以上の計算により得られる利得係数σは、Ｗｚｚのパワーにおいて既に抑圧された雑音の分が除かれているので、利得を過小に計算する可能性が低くなり、性能を改善できる可能性がある。利得係数計算部２００６は、パワー比とコヒーレンス関数で重み付けた利得係数であるという意味の修正利得係数σを出力する。
【０２３６】
次に、図３５を用いて本実施形態における処理の流れについて説明する。マイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ３０１）及び周波数分析（ステップＳ３０２）の後、修正理作係数計算部２０００Ａにおいてクロススペクトルとパワースペクトルの計算（ステップＳ３０３）、、パワー情報の計算（ステップＳ３０４）、コヒーレンス関数の計算（ステップＳ３０５）、信号統合（スペクトル情報の統合）（ステップＳ３０６）、統合スペクトル情報（統合信号）のパワースペクトルの計算（ステップＳ３０７）及び修正利得係数の計算（Ｓ３０８）をディジタル化された音声信号がステップＳ３０１においてフレーム単位で入力される毎に繰り返し行う。
【０２３７】
（第１３の実施形態）
図３６に、本発明の第１３の実施形態に係る音声信号処理装置の構成を示す。本実施形態は式（１２−３）においてパワー情報ｐｉｊ（ｆ）を全て１とおいて、パワー情報を用いないようにした例であり、修正利得係数計算部２０００Ｂでは図３４中に示したパワー情報計算部２００３が除去されている。
【０２３８】
（第１４の実施形態）
次に、本発明の第１４の実施形態として、第１２の実施形態で求めた利得係数に基づいて雑音を抑圧し、目的音声を強調する音声強調処理装置について説明する。
【０２３９】
本実施形態に係る音声信号処理装置は、図３６に示すように空間的に離れて配置された複数のマイクロホン１０１−０〜１０１−Ｍから出力される複数チャネルの入力音声信号を周波数分析してＭチャネルのスペクトル情報を生成する周波数分析部２０１及びスペクトル情報から目的音の活性度に相当する値である利得係数を計算する図３４に示した修正利得係数計算部２０００Ａに加えて、利得制御部２１０１及びコヒーレンスフィルタ演算部２１０２を有する。
【０２４０】
利得制御部２１０１は、修正利得係数計算部２０００Ａで計算した利得係数に基づいて、修正利得係数計算部２０００Ａ内の信号統合部２００４で得られた統合スペクトル情報に対して利得の制御を行う。コヒーレンスフィルタ演算部２１０２は、修正利得係数計算部２０００Ａ内のコヒーレンス関数計算部２００２で得られたコヒーレンス関数に基づいて、利得制御部２１０１から出力されるスペクトル情報をフィルタリングする。
【０２４１】
次に、図３８を用いて本実施形態における処理の流れを説明する。
マイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ４０１）及び周波数分析（ステップＳ４０２）の後、修正利得係数計算部２０００Ａにおいてクロススペクトル及びパワースペクトルの計算（ステップＳ４０３）、パワー情報の計算（ステップＳ４０４）、コヒーレンス関数の計算（ステップＳ４０５）、スペクトル情報の統合（ステップＳ４０６）、統合スペクトル情報のパワースペクトルの計算（ステップＳ４０７）及び利得係数の計算（ステップＳ４０８）を行う。次に、計算された利得係数に基づく利得制御処理（ステップＳ４０９）とコヒーレンスフィルタ演算の処理（ステップＳ４１０）を行う。以上のステップＳ４０１〜Ｓ４１０の処理をディジタル化された音声信号がステップＳ４０１においてフレーム単位で入力される毎に繰り返し行う。
【０２４２】
（第１５の実施形態）
図３９に、本発明の第１５の実施形態に係る音声信号処理装置の構成を示す。本実施形態は式（１０−６）のパワー情報ｐｉｊ（ｆ）を１とおいて、パワー情報を用いないようにした例であり、修正利得係数計算部２０００Ｂでは図３７中に示したパワー情報計算部２００３が除去されている。
【０２４３】
（第１６の実施形態）
次に、第１２の実施形態で説明した利得係数を用いて音源方向を推定する本発明の第１６の実施形態について説明する。本実施形態に係る音声信号処理装置は、図４０に示すようにマイクロホン１０１−１〜１０１−Ｍからの複数（Ｍ）チャネルの入力音声信号を周波数分析して周波数成分であるスペクトル情報に変換する周波数分析部２０１と、該スペクトル情報から音源方向を推定する音源方向推定部３０００からなる。音声分析部２０１の処理は、第２の実施形態（図６）で説明した通りである。
【０２４４】
音源方向推定部３０００は、クロス・パワースペクトル計算部３００１、コヒーレンス関数計算部３００２、補正係数発生部３００３、クロス・パワースペクトル補正部３００４、パワー情報計算部３００５、仮想統合パワースペクトル計算部３００６、仮想方向利得係数計算部３００７及び音源方向検出部３００８を有する。以下、音源方向推定部３０００の各部について説明する。
【０２４５】
クロス・パワースペクトル計算部３００１は、周波数分析部２０１により得られたスペクトル情報から各チャネルの入力音声信号のチャネル毎のパワースペクトル及びチャネル間のクロススペクトルを算出する。
【０２４６】
コヒーレンス関数計算部３００２は、複数チャネル間のクロススペクトルと各チャネルのパワースペクトルから入力音声信号の複数チャネル間のコヒーレンス関数を算出する。
【０２４７】
補正係数発生部３００３は、信号の仮想的な到来方向である仮想方向から到来する信号がチャネル間で一致するように補正するための係数を複数の仮想方向から成る１組の仮想方向群に対応して発生する。
【０２４８】
クロス・パワースペクトル補正部３００４は、補正係数発生部３００３で発生された補正係数に基づいてクロススペクトルとパワースペクトルを補正し、補正クロススペクトルと補正パワースペクトルを生成する。
【０２４９】
パワー情報計算部３００５は、補正クロススペクトルと補正パワースペクトルに基づいて入力音声信号のチャネル間の信号パワーに関するパワー情報を算出する。
【０２５０】
仮想統合パワースペクトル計算部３００６は、周波数分析部２０１で得られた複数チャネルのスペクトル情報を補正係数発生部３００３で発生された補正係数により補正してから統合して得られる統合スペクトル情報に対するパワースペクトルを、クロス・パワースペクトル補正部３００４で得られた補正クロススペクトルと補正パワースペクトルに基づいて計算する。
【０２５１】
仮想方向利得係数計算部３００７は、クロス・パワースペクトル補正部で得られた補正クロススペクトルに対し、コヒーレンス関数とパワー情報に基づいて重み付けを行い、さらに仮想統合パワースペクトルに基づいて正規化を行った後、１組の仮想方向に対応した利得係数を求める。
【０２５２】
音源方向検出部３００８は、仮想方向利得係数計算部３００７において計算された仮想方向毎の利得係数に基づいて音源方向を検出し出力すると同時に、検出した音源方向に対応した利得係数の値を音源利得係数として、また音源方向に対応した補正係数を音源方向補正係数として出力する。
【０２５３】
ここで、周波数分析部２０１、クロス・パワースペクトル計算部３００１、コヒーレンス関数計算部３００２、補正係数発生部３００３、クロス・パワースペクトル補正部３００４及びパワー情報計算部３００５の処理については、第１０の実施形態に係る相関係数に基づく音源方向推定と同一であるので、詳細な説明を省略する。
【０２５４】
第１２〜第１４の実施形態における利得係数の計算においては、利得係数σの式の分母の値を求める際、複数チャネルのスペクトル情報を統合してそのパワースペクトルを求めている。これに対し、本実施形態ではスペクトル情報の段階での統合は行わず、パワースペクトルとクロススペクトルを補正して、統合信号のパワーを直接求める。これは実際に信号を統合してからパワーを求めるよりも、計算量と記憶領域の点で有利である。すなわち、スペクトル情報を統合してからパワーを求めると、仮想方向毎にパワースペクトル推定のための時間平均化が必要となるが、本実施形態によればこれを避けることが可能である。
【０２５５】
まず、各チャネルのスペクトル情報に補正係数発生部３００３で発生した補正係数を乗じてから信号を統合したと仮定し、その処理式をここでは加算平均とする。このときの統合信号Ｚ（ｆ）は、
【数５１】

と表せる。もちろん他の統合方法でもよい。
【０２５６】
このとき、統合信号Ｚ（ｆ）のパワースペクトルは、
【数５２】

となる。ここで、式（１６−２）では添え字は省略してある。また、上線は時間平均を表す。従って、クロススペクトルとパワースペクトルを一度求めておけば、後は補正係数を式（１６−２）に従って乗じるだけで、仮想方向（θ，φ）に対応した利得係数σ（θ，φ）の分母の値が求まる。
【０２５７】
仮想方向利得係数計算部３００７では、まずクロス・パワースペクトル補正部３００４で求められた仮想方向に対応した補正クロススペクトル
【数５３】

に対し、コヒーレンス関数γ^２（ｆ）と補正したパワー情報ｐｉｊ（ｆ，θ，φ）に基づいて重み付けを行う。さらに、仮想方向利得係数計算部３００７では、仮想統合パワースペクトル計算部３００６で求められた仮想的な統合信号パワーＷｚｚ（ｆ，θ，φ）に対し、コヒーレンス関数γ^２（ｆ）に基づいて重み付けを行い、先の式（２−３）により仮想方向に対応した利得係数である仮想方向利得係数σ（θ，φ）を求める。
【０２５８】
音源方向検出部３００８の処理は、第１０の実施形態における音源方向推定部１０００７と同様でよい。この場合、音源方向検出部３００８が検出した音源方向に相当する利得係数σ（θｏ，φｏ）を音源方向利得係数と呼ぶことにする。さらに、音源方向検出部３００８は第１０の実施形態１と同様、音源方向（θｏ，φｏ）の他に、音源方向の補正係数Ｈｉ（θｏ，φｏ）を音源方向補正係数として出力する。以上により、利得係数に基づいて音源方向を推定することができる。
【０２５９】
次に、図４１を用いて本実施形態における処理の流れを説明する。
まず、初期設定として音源方向の範囲を設定する（ステップＳ５００）。次に補正係数の生成（ステップＳ５０１）、マイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ５０２）、周波数分析（ステップＳ５０３）、クロススペクトルとパワースペクトルの計算（ステップＳ５０４）及びコヒーレンス関数の計算（ステップＳ５０５）を順次行う。次に、スペクトル補正（ステップＳ５０６）、パワー情報の計算（ステップＳ５０７）、仮想統合パワースペクトルの計算（ステップＳ５０８）及び仮想方向利得係数の計算（ステップＳ５０９）を全ての仮想方向について繰り返し行い、最後に音源方向の検出を行う（ステップＳ５１０）。ステップＳ５０２〜Ｓ５１０の処理は、ディジタル化された音声信号がステップＳ５０２においてフレーム単位で入力される毎に繰り返し行われる。
【０２６０】
（第１７の実施形態）
次に、本発明の第１７の実施形態として、第１６の実施形態で説明した利得係数に基づく音源方向推定処理により推定した音源方向を用い、目的音が移動した場合でもその方向を追尾して音声強調を行うことにより、音声強調を安定に行うことができるようにするための処理について説明する。
【０２６１】
本実施形態の音声信号処理装置は、図４２に示すように周波数分析部２０１、音源方向推定部３０００、周波数分析部２０１からの複数チャネルのスペクトル情報を音源方向補正係数に基づいて補正するスペクトル情報補正部３１００、補正したスペクトル情報を統合する信号統合部３１０１、統合スペクトル情報をコヒーレンス関数に基づいてフィルタリングするコヒーレンスフィルタ演算部３１０２、及びフィルタリングしたスペクトル情報をさらに音源利得係数に基づいて利得制御することにより雑音を抑圧する利得制御部３１０３を有する。
【０２６２】
ここで、周波数分析部２０１、音源方向推定部３０００及びスペクトル情報補正部３１００は第１６の実施形態と同様であり、またコヒーレンスフィルタ演算部３００２は第１１の実施形態と同様である。
【０２６３】
信号統合部３１０１は、音源方向推定部３０００内の図４０に示した仮想統合信号パワースペクトル計算部３００６において行っている仮想統合信号パワースペクトルの計算の際に仮定している信号の統合と同じ統合の式を用いて、補正されたスペクトル情報の統合を行う。すなわち、仮想統合信号パワースペクトル計算部３００６において２チャネルの加算平均を想定していれば、信号統合部３１０１でのスペクトル情報の統合にも加算平均を用いる。この場合、音源方向推定部３０００で得られた音源方向を（θｏ，φｏ）とし、これに対応する補正係数をＨ１（ｆ，θｏ，φｏ），Ｈ２（ｆ，θｏ，φｏ）とする。この場合、音源方向に対応して補正した統合信号Ｚ（ｆ，θｏ，φｏ）は、次式のようになる。
【０２６４】
【数５４】

【０２６５】
Ｘ１（ｆ），Ｘ２（ｆ）は、周波数分析部で得られていた各チャネルのスペクトル情報である。
【０２６６】
利得制御部３１０３は、音源方向推定部３０００で推定された音源方向に対応する利得係数σ（θｏ，φｏ）を用い、これに基づいて式（１６−１）に従って補正した統合信号Ｚ（ｆ，θｏ，φｏ）の振幅を制御する。制御の方法としては、単純な比例のほか、第１実施形態で述べた通りの方法を用いればよい。
【０２６７】
次に、図４３を用いて本実施形態における処理の流れを説明する。
まず、初期設定として音源方向の範囲を設定し、さらに補正係数を発生する（ステップＳ６００）。次にマイクロホン１０１−１〜１０１−Ｍからの音声信号の入力（ステップＳ６０１）、周波数分析（ステップＳ６０２）、音源方向の推定（ステップＳ６０３）、スペクトル情報の補正（ステップＳ６０４）、スペクトル情報の統合（ステップＳ６０５）、コヒーレンスフィルタ演算（ステップＳ６０６）及び利得制御（ステップＳ６０７）をディジタル化された音声信号がステップＳ６０１においてフレーム単位で入力される毎に繰り返し行う。
【０２６８】
（第１８の実施形態）
次に、本発明の第１８の実施形態として、適応フィルタを用いて入力音声信号のチャネル間の差を適応的に補正し、目的音の方向が想定とわずかにずれる場合のほか、反射による影響も低減する音声信号処理装置について説明する。第１１の実施形態及び第１７の実施形態で述べた音源方向推定に基づく追尾型の安定化方法は、目的音のずれには効果的であるが、反射などによるチャネル間の信号のずれには効果が小さい。反射の状況は、受音位置により異なることが多いため、チャネル間のずれを生じる原因になる。そこで、本実施形態では適応フィルタを用いた安定化方法を用いる。
【０２６９】
適応フィルタを用いた安定化方法については、既に第８の実施形態において述べている。第８の実施形態では相関係数による目的信号活性度を求める前に、相関係数により適応フィルタを制御してチャネル間の補正を行っている。この場合は、相関係数を求める際の時間遅れがあるため、この遅延よりもゆっくりと変化する外乱要因、すなわちマイクロホンのバイアス電圧変化や経年変化等による感度変化などに効果がある。これに対し、本実施形態は反射波がある場合や頻繁に目的音が動く場合など、入力音声信号のチャネル間のずれの状況の変化が比較的速い場合に効果がある。
【０２７０】
本実施形態に係る音声信号処理装置は、図示しない空間的に離れて配置された複数のマイクロホンと、該マイクロホンから入力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する周波数分析部に加えて、図４４に示すように周波数分析部からの複数チャネルのスペクトル情報を入力として目的信号活性度を推定する安定化目的信号活性度推定部４０００からなる。
【０２７１】
安定化目的信号活性度推定部４０００は、入力音声信号のチャネル間の修正した相互相関係数である第１の修正相互相関係数を計算する第１の修正相互相関係数計算部４００１と、第１の修正相互相関係数に基づき複数チャネルのスペクトル情報の間の差を適応的に補正して補正スペクトル情報を得る適応スペクトル補正部４００２と、補正スペクトル情報から第２の修正相互相関係数を計算する第２の修正相互相関係数計算部４００３からなる。周波数分析部、第１及び第２の修正相互相関係数計算部４００１，４００３は、既に述べたものと同一の処理を行う。
【０２７２】
適応スペクトル補正部４００２は、図４６に示すように周波数分析部で得られた各チャネルのスペクトル情報の間の伝達関数を適応フィルタ４１０３によって同定し、その差分を補正する。このとき第１の修正相互相関係数計算部４００１から出力される修正相互相関係数に基づいて適応フィルタ４１０３を制御し、目的音が到来している間のみ適応フィルタ４１０３を更新することによって雑音への適応を避け、目的音に関する伝達関数のみ推定するようにする。
【０２７３】
第１修正相互相関係数の計算は、クロススペクトルとパワースペクトルを求める際の時間平均に起因する時間遅れがあるので、第１の修正相互相関係数計算部４００１から出力される相関係数は、現時点からその時間遅れ分だけ過去の入力データに基づいて計算されたものである。従って、適応フィルタ４１０３に入力するスペクトル情報と相関係数を同期させるため、遅延回路４１０１，４１０２により相関係数計算と同じだけ遅延させたスペクトル情報を用いるようにする。
【０２７４】
時間遅れの値は、クロス・パワースペクトルの平均化に要する時間長をＴとすると、Ｔ／２である。フレーム数で見ると、平均化フレーム数をＴａとしてＴａが偶数の場合、遅れはＴａ／２フレームであるが、Ｔａが奇数の場合は（Ｔａ−１）／２で計算できる。Ｔａは奇数の方が望ましい。
【０２７５】
適応フィルタ４１０３を用いた演算は、第８の実施形態で既に述べたように例えば周波数領域のＬＭＳ適応フィルタを用いて行い、同定されたフィルタＷ（ｆ）を参照信号に用いたチャネル側のスペクトル情報に乗じて補正する。第２の修正相互相関係数計算部４００３は、適応スペクトル修正部４００２で補正されたスペクトル情報から第２の修正相互相関係数を計算して出力する。
【０２７６】
次に、図４５を用いて本実施形態における処理の流れを説明すると、まず入力音声信号のチャネル間の修正した相互相関係数である第１の修正相互相関係数を計算し（ステップＳ７０１）、これに基づいて各チャネルのスペクトル情報の間の伝達関数の差分を補正することにより適応スペクトルの補正を行い（ステップＳ７０２）、最後に補正された適応スペクトル情報から第２の修正相互相関係数を計算して目的信号活性度として出力する（ステップＳ７０３）。
【０２７７】
本実施形態では、適応の制御とフィルタの更新を同期したデータにより行うため、修正相互相関係数計算を時間遅れを考慮して２回行っている。これにより、状況がすばやく変化する場合にも、雑音の影響を抑えてチャネル間の差を適応的に正確に補正することが可能となる。
【０２７８】
（第１９の実施形態）
第１８の実施形態では、修正相互相関係数計算に関して適応的に安定化させる場合を述べたが、修正相互相関係数の代わりに、第１２の実施形態で述べた利得係数の計算で同様な処理を行うことが可能である。
【０２７９】
本実施形態に係る音声信号処理装置は、図示しない空間的に離れて配置された複数のマイクロホンと、該マイクロホンから入力される複数チャネルの入力音声信号を周波数分析して複数チャネルのスペクトル情報を生成する周波数分析部に加えて、図４５に示すように周波数分析部からの複数チャネルのスペクトル情報を入力として目的信号活性度を推定する安定化目的信号活性度推定部５０００からなる。
【０２８０】
安定化目的信号活性度推定部５０００は、複数チャネルのスペクトル情報から目的音の活性度に相当する値である第１の修正利得係数を計算する第１の修正利得係数計算部５００１と、第１の修正利得係数に基づき複数チャネルのスペクトル情報の間の差を適応的に補正して補正スペクトル情報を得る適応スペクトル補正部５００２と、補正スペクトル情報から第２の修正利得係数を計算する第２の修正利得係数計算部５００３からなる。第１及び第２の修正利得係数計算部５００１，５００３は、第１２の実施形態で述べたものと同一の処理を行う。
【０２８１】
ところで、第１、第２、第４、第６、第１１、第１４及び第１７の各実施形態においては、相関係数または利得係数の算出結果を用いて音声強調処理を行っている。これら第１、第２、第４、第６、第１１、第１４及び第１７の各実施形態においても、図４６で説明した同様に、相関係数または利得係数の計算による時間遅延を考慮して、相関係数または利得係数と入力のスペクトル情報が同期するように、相関係数または利得係数計算時の入力のスペクトル情報を遅延させて処理することが望ましい。この場合の遅延フレーム数は、図４６で説明したと同様に、クロススペクトルとパワースペクトル推定のための時間平均化フレーム数の半分の値に選ばれる。このような遅延処理の導入は自明のことであるため、第１、第２、第４、第６、第１１、第１４及び第１７の各実施形態の説明では省略されている。
【０２８２】
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。
【０２８３】
【発明の効果】
以上説明したように、本発明によれば突発雑音や拡散性雑音を含む実環境雑音下で雑音を抑圧することが可能となり、雑音環境下において目的音声が到来しているか否かを高精度で検出したり、ハンズフリー通話や音声認識の前処理に好適な音声信号処理を行うことができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声信号処理装置の構成を示すブロック図
【図２】同実施形態における統合音声信号に対する利得制御に用いる種々の関数を示す図
【図３】同実施形態における音声信号処理手順を示すフローチャート
【図４】同実施形態におけるマイクロホンの配置例を示す図
【図５】同実施形態に係る信号統合部に適応ビームフォーマを用いた音声信号処理装置の構成を示すフローチャート
【図６】本発明の第２の実施形態に係る音声信号処理装置の構成を示すフローチャート
【図７】同実施形態における音声信号処理手順を示すフローチャート
【図８】本発明の第３の実施形態に係る音声信号処理装置の構成を示すブロック図
【図９】同実施形態における音声信号処理手順を示すフローチャート
【図１０】本発明の第４の実施形態に係る音声信号処理装置の構成を示すブロック図
【図１１】同実施形態における音声信号処理手順を示すフローチャート
【図１２】同実施形態における検出処理手順を示すフローチャート
【図１３】同実施形態における検出処理の具体例を示す図
【図１４】本発明の第５の実施形態に係る音声信号処理装置の構成を示すブロック図
【図１５】同実施形態における音声信号処理手順を示すフローチャート
【図１６】本発明の第６の実施形態に係る音声信号処理装置の構成を示すブロック図
【図１７】同実施形態における音声信号処理手順を示すフローチャート
【図１８】本発明の第７の実施形態に係るマイクロホンの配置例を示す図
【図１９】同実施形態に係るマイクロホンの他の配置例を示す図
【図２０】図１９（Ｂ１）〜（Ｂ４）の配置における到来方向を方位角と仰角を用いて表した図
【図２１】図１９（Ｂ１）〜（Ｂ４）の配置における２つのマイクロホンの位相が一致する到来方向と２つのマイクロホンの感度が一致する到来方向の関係を示す図
【図２２】本発明の第８の実施形態に係る音声信号処理装置の構成を示すブロック図
【図２３】同実施形態におけるスペクトル補正部の構成を示すブロック図
【図２４】本発明の第９の実施形態に係る音声信号処理装置の構成を示すブロック図
【図２５】同実施形態における補正フィルタ学習指示部の構成を示すブロック図
【図２６】同実施形態におけるスペクトル補正部の構成を示すブロック図
【図２７】同実施形態におけるスペクトル補正部の処理手順を示すフローチャート
【図２８】本発明の第１０の実施形態に係る音声信号処理装置の構成を示すブロック図
【図２９】同実施形態における到来方向推定時の仮想点の設定について説明する図
【図３０】同実施形態における伝搬遅延の計算法について説明する図
【図３１】同実施形態における音声信号処理手順を示すフローチャート
【図３２】本発明の第１１の実施形態に係る音声信号処理装置の構成を示すブロック図
【図３３】同実施形態における音声信号処理手順を示すフローチャート
【図３４】本発明の第１２の実施形態に係る音声信号処理装置の構成を示すブロック図
【図３５】同実施形態における音声信号処理手順を示すフローチャート
【図３６】本発明の第１３の実施形態に係る音声信号処理装置の構成を示すブロック図
【図３７】本発明の第１４の実施形態に係る音声信号処理装置の構成を示すブロック図
【図３８】同実施形態における音声信号処理手順を示すフローチャート
【図３９】本発明の第１５の実施形態に係る音声信号処理装置の構成を示すブロック図
【図４０】本発明の第１６の実施形態に係る音声信号処理装置の構成を示すブロック図
【図４１】同実施形態における音声信号処理手順を示すフローチャート
【図４２】本発明の第１７の実施形態に係る音声信号処理装置の構成を示すブロック図
【図４３】同実施形態における音声信号処理手順を示すフローチャート
【図４４】本発明の第１８の実施形態に係る音声信号処理装置の構成を示すブロック図
【図４５】同実施形態における音声信号処理手順を示すフローチャート
【図４６】同実施形態における適応スペクトル補正部の構成を示すブロック図
【図４７】本発明の第１９の実施形態に係る音声信号処理装置の構成を示すブロック図
【符号の説明】
１０１−１〜１０１−Ｍ…マイクロホン
１０２…相互相関係数計算部
１０３…信号統合部
１０４…利得制御部（調整部）
１０６…適応ビームフォーマ
２０１…周波数分析部
２０２…相互相関係数計算部
２０３…信号統合部
２０４…利得制御部（調整部）
３００…目的信号活性度計算部
３０１…クロス・パワースペクトル計算部
３０２…コヒーレンス関数計算部
３０３…パワー情報計算部
３０４…修正スペクトル計算部
３０５…重み付き相互相関係数計算部
４０１…検出処理部（判定部）
５０１…利得制御部（調整部）
６０１…コヒーレンスフィルタ演算部
７０１…無指向性マイクロホン
７０２，７１１，７１２…指向性マイクロホン
８００，９００…スペクトル補正部
８０１，９０４…適応フィルタ
８０２，９０１，９０２…補正フィルタ
９１０…補正フィルタ学習部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an audio signal processing method, apparatus, and program for processing input audio signals obtained by a plurality of microphones. More specifically, the present invention relates to a technique for emphasizing and outputting a target audio signal from an input audio signal as one of noise suppression techniques used in, for example, hands-free communication and voice recognition.
[0002]
[Prior art]
In the field of voice signal processing, noise reduction has become an important issue with the practical use of voice recognition and mobile phones. As noise suppression techniques, there are, for example, spectral subtraction processing assuming noise continuity and microphone array processing using a plurality of microphones, which are used when one microphone is used. For microphone array processing, an adaptive microphone array that exhibits high noise suppression capability even with a small number of microphones is promising in terms of cost. The adaptive microphone array suppresses noise by automatically directing a blind spot with low sound reception sensitivity in the noise direction, and is sometimes called an adaptive beamformer (adaptive BF).
[0003]
The adaptive beamformer is effective against highly directional noise, but other noises, such as (1) high level diffuse noise such as noise generated while driving in a car, (2) high speed The noise is not sufficiently suppressed, for example, noise with a rapid change in the sound transmission system, such as a sound radiated from a moving vehicle, or (3) noise with a very short duration such as sudden noise. Noise such as these is very common in the real world and needs to be dealt with.
[0004]
Non-Patent Document 1 discloses a technique for suppressing noise by performing filtering based on a coherence function between two channels of input audio signals from a plurality of microphones.
On the other hand, in Non-Patent Document 2, in order to cope with noise having a large correlation, a cross spectrum of noise between channels is estimated in a section where there is no target sound, and a crossover of the target sound with noise superimposed in a section where the target sound is present. Techniques for subtracting the noise cross spectrum from the spectrum have been disclosed.
[0005]
Non-Patent Document 3 discloses a method of determining the presence of a target signal by performing threshold processing on a coherence function in order to perform signal detection processing using, for example, cross-correlation between signals of a plurality of channels.
Non-Patent Document 4 discloses a method of detecting a target sound by performing threshold processing on a cross-correlation coefficient between audio signals of a plurality of channels output from a plurality of microphones.
Non-Patent Document 5 describes a method of integrating audio signals of two or more channels into one channel using an adaptive beamformer.
Non-Patent Document 6 discloses a method of maximum likelihood estimation of a generalized cross-correlation function between channels of a plurality of channels of an audio signal using a weight function.
[0006]
[Non-patent document 1]
"Using the coherence function for noise reduction", IEEE Proceedings-I Vol. 139, no. 3, 1992
[0007]
[Non-patent document 2]
"Enhancement of speech degraded by coherent and incoherent noise using a cross-spectral estimator", IEEE Trans. on Seach and Audio processing, Vol. 5, No. 5, 1997
[0008]
[Non-Patent Document 3]
"Knowing the What from the Weeds in Noisy Speech", H .; Agaiby and T.A. J. Moir, Proc. of EUROSPEECH '97, vol. 3, pp. 111-112, 1997
[0009]
[Non-Patent Document 4] "Study on Target Sound Detection Using Two Directional Microphones", Nagata et al., Journal of the Institute of Electronics, Information and Communication Engineers, Vol. J83-A No. 2 (2000))
[0010]
[Non-Patent Document 5]
“The adaptive filter theory”, written by Hyakin, published by PRENTICE HALL.
[0011]
[Non-Patent Document 6]
"The Generalized Correlation Method for Estimation of Time Delay", C.I. H. Knapp and G .; C. Carter, IEEE Trans, Acoustic. , Speech, Signal Processing, Vol. ASSP-24, No. 4, pp. 320-327, 1976
[0012]
[Problems to be solved by the invention]
The technique described in Non-Patent Document 1 is effective for noise that can be assumed to be uncorrelated between channels, such as the diffuse noise of (1). However, the sudden noise of (3) and the directional noise that can be suppressed by the beamformer cannot be suppressed because the correlation between channels increases. According to the technique described in Non-Patent Document 2, such noise having a large correlation between channels can be suppressed. However, this method is effective only when the noise is directional and the noise continuity can be assumed. In such a noise environment, a method of directing a directional blind spot to a noise source, such as a beamformer, can be better dealt with.
[0013]
SUMMARY OF THE INVENTION It is an object of the present invention to provide an audio signal processing method, apparatus, and program capable of suppressing noise under real environment noise including sudden noise and enhancing a target sound component.
[0014]
Another object of the present invention is to perform detection of whether or not a target sound has arrived with high accuracy.
[0015]
[Means for Solving the Problems]
In order to solve the above-described problems, according to a first aspect of the present invention, a cross-correlation coefficient between input audio signals of a plurality of channels output from a plurality of microphones spatially separated is obtained. By adjusting the magnitude of the integrated audio signal obtained by integrating the input audio signal into one channel according to the cross-correlation coefficient, an output audio signal in which the target sound component is emphasized is generated.
[0016]
According to a second aspect of the present invention, a plurality of channels of spectrum information is generated by frequency-analyzing a plurality of channels of input audio signals output from each microphone, and a cross-correlation coefficient between the plurality of channels of spectrum information is obtained. By adjusting the magnitude of the integrated spectral signal obtained by integrating the spectral information into one channel according to the cross-correlation coefficient, a spectral signal in which the component of the target sound is emphasized is obtained.
[0017]
According to a third aspect of the present invention, a plurality of channels of spectrum information is generated by frequency-analyzing a plurality of channels of input speech signals output from each microphone, and the power spectrum and the power spectrum of each channel of the input speech signal are obtained from the spectrum information. Find the cross spectrum between channels. Further, a coherence function between spectrum information of each channel is obtained from the power spectrum and the cross spectrum. Next, the power spectrum and the cross spectrum are corrected using the coherence function, and a cross-correlation coefficient between channels of the input audio signal, which is weighted based on the corrected power spectrum and the cross spectrum, is obtained.
[0018]
According to a fourth aspect of the present invention, a plurality of channels of spectrum information are generated by frequency-analyzing a plurality of channels of input speech signals output from each microphone, and the power spectrum and the power spectrum of each channel of the input speech signal are obtained from the spectrum information. Find the cross spectrum between channels. Further, a coherence function between spectrum information of each channel is obtained from the power spectrum and the cross spectrum, and power information on signal power between channels of the input audio signal is obtained from the spectrum information. Next, the power spectrum and the cross spectrum are corrected using the coherence function and the power information, and a cross-correlation coefficient between the channels of the input audio signal weighted based on the corrected power spectrum and the cross spectrum is obtained.
[0019]
In the third or fourth aspect, it may be determined whether or not the target sound has arrived at the microphone by performing threshold processing on the cross-correlation coefficient using a predetermined threshold. The spectrum information may be integrated into one channel to obtain an integrated spectrum signal, and the size of the integrated spectrum signal may be adjusted according to the cross-correlation coefficient. Each frequency component of the integrated spectrum signal may be weighted according to the coherence function. According to the cross-correlation coefficient, at least one of the phase and the amplitude of the spectral information of a plurality of channels may be corrected so as to match between the channels.
[0020]
In the third and fourth aspects, the plurality of microphones may include at least one omnidirectional microphone and at least one directional microphone, or at least two directional microphones having different directional axes. May be included. In the latter case, at least two directional microphones may be arranged such that the axis of directivity does not exist in the same plane and the angle between the directionality axis and the arrival direction of the target sound coincides with each other. preferable.
[0021]
Further, according to another aspect of the present invention, there is provided a program as described below for executing the above-described audio signal processing by a computer, or a storage medium storing the program.
[0022]
(1) Processing for obtaining a cross-correlation coefficient between input audio signals of a plurality of channels output from a plurality of microphones spatially separated, and integrating the input audio signals into one channel to output an integrated audio signal And a process for causing a computer to perform a process of generating an output audio signal by adjusting the magnitude of the integrated audio signal in accordance with the cross-correlation coefficient.
[0023]
(2) A process of generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated from each other, and a mutual phase relationship between the spectrum information of the plurality of channels A program for causing a computer to perform a process of obtaining a number, a process of integrating spectral information into one channel to generate an integrated spectrum signal, and a process of adjusting the size of the integrated spectrum signal according to a cross-correlation coefficient.
[0024]
(3) a process of generating frequency information of a plurality of channels by frequency-analyzing input voice signals of a plurality of channels output from a plurality of microphones arranged spatially apart from each other; Processing for obtaining a power spectrum and a cross spectrum between channels, processing for obtaining a coherence function between spectral information of a plurality of channels from the power spectrum and the cross spectrum, processing for correcting the power spectrum and the cross spectrum using the coherence function, and correction And calculating a cross-correlation coefficient between channels of the input audio signal, which is weighted based on the obtained power spectrum and cross spectrum.
[0025]
(4) a process of generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones arranged spatially apart from each other; Processing for obtaining a power spectrum and a cross spectrum between channels; processing for obtaining a coherence function between spectral information of a plurality of channels from the power spectrum and the cross spectrum; and power information relating to signal power between channels of an input voice signal based on the spectral information. , Correcting the power spectrum and the cross spectrum using the coherence function and the power information, and the cross-correlation coefficient between the channels of the input audio signal weighted based on the corrected power spectrum and the cross spectrum. Ask for Program for causing the management to the computer.
[0026]
(5) A process of generating frequency information of a plurality of channels by frequency-analyzing input voice signals of a plurality of channels output from the microphones in response to voices input to a plurality of microphones spatially separated from each other. A process of calculating a power spectrum for each channel of the input voice signal and a cross spectrum between channels from the spectrum information, and a process of calculating a coherence function between channels of the spectrum information of the plurality of channels from the power spectrum and the cross spectrum Corresponding to a virtual arrival direction group consisting of a plurality of virtual arrival directions of voice, a process of generating a correction coefficient for correcting the voice arriving from the virtual arrival direction to match among a plurality of channels, Correcting the power spectrum and the cross spectrum based on the correction coefficient, A process of generating a positive power spectrum and a corrected cross spectrum, a process of calculating power information regarding signal power between channels of the input audio signal based on the corrected power spectrum and the corrected cross spectrum, and a process of calculating the corrected power spectrum and the corrected cross spectrum. Processing for weighting a spectrum based on the coherence function and the power information and calculating a cross-correlation coefficient between channels of the input voice signal corresponding to the virtual direction-of-arrival group for each virtual direction of arrival; Program for causing a computer to perform a process of detecting a sound source direction of a sound input to the microphone based on the number and outputting the value of the cross-correlation coefficient in the detected sound source direction as a sound source correlation coefficient. .
[0027]
(6) a process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels, and a channel of the input audio signal from the spectral information A process of calculating a power spectrum and a cross spectrum between channels, a process of calculating a coherence function between channels of the spectrum information of the plurality of channels from the power spectrum and the cross spectrum, and converting the plurality of spectrum information into one channel. A process of integrating to generate an integrated spectrum signal, a process of calculating a power spectrum of the integrated spectrum signal, and weighting the cross spectrum based on the coherence function, further converting the weighted cross spectrum to the integrated signal power spectrum Based Program for causing a process of calculating a gain factor normalized to the computer.
[0028]
(7) a process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels, and a channel of the input audio signal from the spectral information A process of calculating a power spectrum for each channel and a cross spectrum between channels; a process of calculating a coherence function between the plurality of channels from the cross spectrum between the plurality of channels and a power spectrum of each channel; and a plurality of virtual arrival directions of voice. A process of generating a correction coefficient for correcting a voice arriving from the virtual direction of arrival to match among a plurality of channels, corresponding to the virtual direction of arrival group consisting of: And corrected cross spectrum, corrected power spectrum and corrected cross spectrum Generating power information, calculating power information regarding signal power between channels of the input audio signal based on the corrected power spectrum and the corrected cross spectrum, and correcting the spectrum information of the plurality of channels with the correction coefficient. A process of calculating a power spectrum for integrated spectrum information obtained by integrating the corrected cross spectrum based on the corrected power spectrum and the corrected cross spectrum, weighting the corrected cross spectrum based on the coherence function and the power information, and further virtual integration. A process of obtaining a gain coefficient corresponding to the virtual arrival direction by normalizing based on the power spectrum; detecting a sound source direction of a sound input to the microphone based on the gain coefficient; Gain factor corresponding to direction Program for causing a process for outputting a value as a sound source gain factor in the computer.
[0029]
(8) a process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels; and inputting the spectral information of the plurality of channels as input. Calculating a first modified cross-correlation coefficient between channels of the input audio signals of the plurality of channels, and adapting a difference between channels of the spectral information of the plurality of channels based on the first modified cross-correlation coefficient A program for causing a computer to execute a process of generating corrected spectrum information by performing a correction and a process of calculating a second corrected cross-correlation coefficient from the corrected spectrum information, the program comprising: The calculation processing of the corrected cross-correlation coefficient of (a) includes: (B) calculating a coherence function between the channels of the spectral information of the plurality of channels from the power spectrum and the cross spectrum; and (c) calculating the input speech signal from the power spectrum. A process of calculating power information relating to signal power between channels; and (d) calculating a cross-correlation coefficient between channels of the input audio signal by weighting the power spectrum and the cross spectrum based on the coherence function and the power information. And outputting the first or second modified cross-correlation function.
[0030]
(9) a process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones arranged spatially apart to generate first spectrum information of a plurality of channels, and a first process based on the first spectrum information. Calculating the corrected gain of the first spectrum information, adaptively correcting the difference between the channels of the first spectrum information based on the first gain coefficient to generate second spectrum information, and processing the second spectrum information And calculating a second modified gain from the first and second spectral information, wherein the first and second modified gain factors are calculated from the first or second spectral information. A process of calculating a power spectrum for each channel of the input audio signal and a cross spectrum between channels; and (b) the power spectrum and the cross spectrum. (C) calculating power information related to signal power between channels of the input audio signal from the power spectrum; and (d) calculating power information related to signal power between channels of the input audio signal from the power spectrum. (E) calculating the power spectrum of the integrated spectrum signal, and (f) calculating the cross spectrum based on the coherence function and the power information. Calculating the first or second gain coefficient by further normalizing the weighted cross spectrum based on the power spectrum of the integrated spectrum signal.
[0031]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. The audio signal processing in each embodiment described below can be implemented as software (including firmware) executed on a computer, and can also be implemented by hardware.
[0032]
(1st Embodiment)
FIG. 1 shows a configuration of a signal processing device according to the first embodiment of the present invention. The plurality of microphones 101-1 to 101-M detect an acoustic signal including a target sound such as a speaker's input sound, and output a plurality (M) of input sound signals. Here, a component which is desired to be finally extracted as an output sound from the input sound by suppressing noise is referred to as a target sound. Input audio signals from the microphones 101-1 to 101-M are converted into digital signals by an A / D converter (not shown), and then input to the cross-correlation calculation unit 102 and the signal integration unit 103.
[0033]
The cross-correlation calculation unit 102 calculates a cross-correlation coefficient between the input audio signals of the M channels. In the signal integration unit 103, the input audio signals of M channels are integrated into one channel. The signal output from the signal integration unit 103 is called an integrated audio signal. The integrated voice signal is input to gain control section 104, whose gain is controlled according to the cross-correlation coefficient, and its magnitude is adjusted. As a result, the output audio signal 105 in which the target sound component is emphasized is output from the gain control unit 104.
[0034]
Generally, a cross-correlation coefficient calculated for observation signals of a plurality of channels has long been used in sonar and radar processing as a detection measure of a target signal under noise. This embodiment proposes a method used not only for detecting the target sound but also for enhancing the target sound in the audio signal processing. With this method, noise can be effectively suppressed even in an environment where there is no correlation between channels.
[0035]
The cross-correlation coefficient in the present embodiment is a value ρ calculated by the following equation when the input audio signal has two channels of x (n) and y (n).
[0036]
(Equation 1)

[0037]
Here, an overlined value indicates an expected value or a time average value (the same applies hereinafter).
[0038]
When the input audio signal has M channels (not limited to two channels), the cross-correlation coefficient ρ is calculated by the following equation, for example.
[0039]
(Equation 2)

[0040]
Here, xp (n) and xq (n) are the input audio signals of the p-th channel and the q-th channel, respectively, and K = M (M-1) / 2.
[0041]
Conventionally, cross-correlation between channels of signals of a plurality of channels is used for signal detection processing. For example, a method of determining the presence of a target signal by threshold processing of a coherence function is described in, for example, Non-Patent Document 3: “Knowing the What from the Weeds in Noise Speech ", H.S. Agaiby and T.A. J. Moir, Proc. of EUROSPEECH '97, vol. 3, pp. 111-112, 1997.
[0042]
The cross-correlation coefficient is also used for voice detection, and a method of thresholding this value to detect a target sound is described in, for example, Non-Patent Document 4: “Study on Target Sound Detection Using Two Directional Microphones” , Nagata et al., IEICE Journal, Vol. J83-A No. 2 (2000)). The present embodiment is characterized in that the cross-correlation is used for enhancing the target sound, instead of detecting the target sound by threshold processing.
[0043]
The cross-correlation coefficient ρ takes a value close to 1 if the target sound is present in the input voice, and takes a value close to 0 if the noise is only noise. What is necessary is just to control the gain given to an audio signal. That is, regarding the input audio signals of a plurality of channels obtained from the microphones 101-1 to 101-M, the cross-correlation coefficient calculation unit 102 calculates the cross-correlation coefficient between channels according to the equation (1-1) or (1-2). calculate. The gain of the gain control unit 104 is controlled based on the cross-correlation coefficient, and the amplitude of the integrated voice signal from the signal integration unit 103 is adjusted by the gain control unit 104 to generate the output voice signal 105.
[0044]
The cross-correlation coefficient ρ ranges from −1 to +1. Therefore, the gain control unit 104 takes the absolute value of the cross-correlation coefficient before use, or sets it to 0 when the cross-correlation coefficient is negative. The gain control in the gain control unit 104 is performed by multiplying the thus calculated cross-correlation coefficient by, for example, the amplitude of the integrated voice signal. In this case, the relationship between the cross-correlation coefficient and the gain may be set to a proportional relationship such as a straight line (A) shown in FIG. 2, or a relationship such as a broken line (B) or a curve (C) in FIG. It may be.
[0045]
Next, the flow of processing in this embodiment will be described with reference to FIG.
First, audio signals are input from the microphones 101-1 to 101-M (step S11). Taking the case of two microphones as an example, for example, as shown in FIG. 4, two microphones 101-1 to 101-2 are placed at a distance of about 10 cm, and the target sound source is separated from each of the microphones 101-1 to 101-2. Install so that they are equidistant. Each of the microphones 101-1 to 101-2 may have directivity or may be non-directional. The sampling frequency of the A / D converter for digitizing the input audio signal is, for example, 11 kHz, but may be another frequency.
[0046]
Next, the cross-correlation coefficient ρ is calculated by Expression (1-1) or Expression (1-2). At this time, in consideration of the time change of the cross-correlation coefficient ρ, the cross-correlation coefficient ρ is determined at an appropriate time interval, for example, every N = 128 points, and the time average is calculated, for example, at L points before and after the target time point. When Equation (1-1) is applied to the waveform at a total of 2L points, the equation for calculating the cross-correlation coefficient ρ is as follows.
[0047]
[Equation 3]

[0048]
Here, k is the number of the cross-correlation coefficient, and one value of ρ is obtained for every N samples of the input voice signal waveform.
[0049]
Similarly, when the equation (1-2) is used, the correlation coefficient ρ is obtained by the following equation.
[0050]
(Equation 4)

[0051]
Here, K = M (M-1) / 2.
[0052]
Next, the signal integrating unit 103 integrates input audio signals of a plurality of channels into one channel. The processing of the signal integration unit 103 may be, for example, simple addition, or may be processing by an adaptive beamformer 106 having a noise suppression function and operating in the time domain as shown in FIG. Assuming that signal integration section 103 performs simple addition, integrated audio signal z (n) is obtained as in the following equation.
[0053]
(Equation 5)

[0054]
When an adaptive beamformer 106, for example, a two-channel Jim-Grifffith beamformer using a well-known LMS adaptive filter is used as the signal integration unit 103 as shown in FIG. 5, an integrated audio signal z (n ) Is obtained.
[0055]
(Equation 6)

[0056]
Here, U (n) is a vector in which T values of the difference between the input audio signals x and y are arranged, and W (n) = [w1 (n), w2 (n),. . . , WT (n)] are the coefficients of the LMS adaptive filter after updating n times, d (n) is the sum signal of the input audio signals x and y, and (•) is the inner product. D is the delay amount, for example, T / 2 is used. μ is a step size, for example, 0.1 may be used. It is easy to expand to the case of M channels, and a method of obtaining an audio signal integrated into one channel using M-1 adaptive beamformers is described in, for example, Non-Patent Document 5: "The adaptive filter theory", Hyakin Author, PRENTICE HALL Publishing, but the detailed description is omitted here.
[0057]
Finally, the output audio signal 105 is output by adjusting the magnitude of the integrated audio signal z (n) by multiplying the integrated audio signal z (n) by a gain based on the cross-correlation coefficient ρ. The processes of steps S11 to S14 are repeated each time a digitized audio signal is input in units of frames in step S11.
[0058]
As described above, according to the present embodiment, by adjusting the magnitude of an integrated audio signal in which input audio signals of a plurality of channels are integrated into one channel in accordance with the cross-correlation function between the input audio signals of the respective channels, low correlation is achieved. It is possible to obtain an output audio signal in which noise is suppressed and the component of the target sound having a large correlation is emphasized.
[0059]
(Second embodiment)
FIG. 6 shows a configuration of an audio signal processing device according to the second embodiment of the present invention. In the present embodiment, audio signal processing equivalent to the audio signal processing in the time domain described in the first embodiment is realized in the frequency domain. In FIG. 6, input audio signals from a plurality of microphones 101-1 to 101-M are converted into digital signals by an A / D converter (not shown), and the frequency components are analyzed by a frequency analysis unit 201. Represented spectral information is generated. The frequency analysis unit 201 is realized by, for example, a known FFT (Fast Fourier Transform), DFT (Discrete Fourier Transform), or a band filter bank in which a plurality of band filters having different pass bands are arranged in parallel. Spectrum information output from frequency analysis section 201 is input to correlation coefficient calculation section 202 and signal integration section 203.
[0060]
The cross-correlation calculation unit 202 calculates a cross-correlation coefficient between the spectrum information of the M channels, that is, a cross-correlation coefficient in the frequency domain. In other words, in the present embodiment, the cross-correlation coefficient between the channels of the M-channel input audio signal is obtained using the spectrum information. In the signal integrating unit 203, the spectral information of the M channels is integrated into one channel. The processing of the signal integration unit 203 may be, for example, simple addition, as described in the first embodiment, or may be processing by a Jim-Griffth adaptive beamformer using an adaptive filter operating in the frequency domain. There may be. The signal output from signal integration section 203 is called an integrated spectrum signal.
[0061]
The integrated spectrum signal output from signal integration section 203 is input to gain control section 204 whose gain is controlled in accordance with the cross-correlation coefficient, and its magnitude is adjusted. As a result, the spectrum signal 205 in which the target sound component is emphasized is output from the gain control unit 204. As in the first embodiment, the cross-correlation coefficient in the frequency domain obtained by the cross-correlation coefficient calculation unit 202 also takes a value close to 1 when the target sound is present, and a value close to 0 when only the noise is present. Therefore, to use for emphasizing the target sound, the gain given to the integrated spectrum signal may be controlled according to the magnitude of the cross-correlation coefficient.
[0062]
The spectrum signal 205 in which the component of the target sound has been emphasized is subjected to a reverse conversion to the frequency analysis unit 201, that is, a conversion from the frequency domain to the time domain by the inverse transform unit 206 as necessary, thereby obtaining the target sound. The output audio signal 207 in which the component is emphasized is generated. When the frequency analysis unit 201 is, for example, an FFT, the inverse transform unit 206 is implemented by an inverse FFT that is an inverse transform thereof.
[0063]
The cross-correlation coefficient calculation unit 202 calculates ρ represented by the following equation as a cross-correlation coefficient in the frequency domain when the input audio signal has two channels of x (n) and y (n).
[0064]
(Equation 7)

[0065]
Here, Wxy (f) is the cross spectrum between the input audio signals x (n) and y (n), and Wxx (f) and Wyy (f) are the input audio signals x (n) and y (n). The power spectrum, L, is the number of frequency components in a discrete Fourier transform (DFT).
[0066]
As is well known, the cross spectrum and the power spectrum are expressed as follows: X (f) denotes a discrete Fourier transform of x (n) and Y (f) denotes a discrete Fourier transform of y (n).
(Equation 8)

It can be calculated as follows. Here, the value with an overline is a time average value, and * is a complex conjugate. For example, 256 points can be used as the length of the DFT, and in this case, L = 256. Assuming that L = 128, an equivalent result can be obtained by taking the real part of the obtained cross-correlation coefficient of the complex number.
[0067]
Similarly, when the input audio signal has M channels (not limited to two channels), the cross-correlation coefficient ρ is similarly calculated by the following equation, for example.
(Equation 9)

[0068]
Here, Wij (f) is a cross spectrum between the input audio signals xi (n) and xj (n), and Wii (f) and Wjj (f) are power spectra of the input audio signals xi (n) and xj (n). It is.
[0069]
After converting the input audio signals of a plurality of channels obtained from the microphones 101-1 to 101-M into spectrum information in the frequency analysis unit 201, the cross-correlation coefficient calculation unit 202 calculates the expression (2-1) or (2). Calculate the cross-correlation coefficient ρ between the channels according to -2).
[0070]
On the other hand, the spectrum information of a plurality of channels obtained by the frequency analysis unit 201 is integrated into one channel by the signal integration unit 203 to obtain an integrated spectrum signal Z (f). When simple addition is used in the signal integration unit 203,
(Equation 10)

As a result, an integrated spectrum signal Z (f) can be obtained.
[0071]
When an adaptive beamformer is used, for example, when a well-known two-channel Jim-Griffith beamformer is used, an integrated spectrum signal Z (f) is obtained as in the following equation.
[Equation 11]

[0072]
Here, k is a frame number, U is a difference spectrum between channels, D is an addition spectrum, Z is an output spectrum, W is a complex filter coefficient, μ is a step size, and (*) is a complex conjugate.
[0073]
Next, the gain of the gain control unit 204 is controlled based on the cross-correlation coefficient ρ, and the magnitude (amplitude) of the integrated spectrum signal from the signal integration unit 203 is adjusted by the gain control unit 204. Generates a spectral signal 205 in which is emphasized. The gain control in the gain control unit 204 can be performed by, for example, multiplying the amplitude of the integrated spectrum signal by the cross-correlation coefficient ρ. However, as in the first embodiment, for example, FIG. ) It is also possible to use a function as shown in FIG. The cross-correlation coefficient ρ may be negative. In this case, the cross-correlation coefficient ρ may be set to an absolute value or set to 0 and used for gain control.
[0074]
FIG. 7 shows a processing flow in the present embodiment. The flow of processing is basically the same as that of the first embodiment except that a frequency analysis step S22 is added after the audio signal input step S21. That is, after performing frequency analysis (for example, FFT) in step S22, calculation of a cross-correlation coefficient (step S23), integration of spectrum information (step S24), and gain control for an integrated spectrum signal using the correlation coefficient (step S25) ) Is sequentially performed to generate a spectrum signal in which the component of the target sound is enhanced, and finally, if necessary, an inverse transform (for example, inverse FFT) is performed in step S26 to output the output sound in which the component of the target sound is enhanced. Get the signal. The processing of steps S21 to S26 is repeated each time a digitized audio signal is input in units of frames in step S21.
[0075]
As described above, according to the present embodiment, noise with low correlation is suppressed, and it is possible to obtain a spectrum signal or an output audio signal in which the sound of the target sound having high correlation is emphasized. Performing the signal integration processing in the frequency domain has an advantage that the amount of calculation can be reduced as compared with the first embodiment in which the calculation of the correlation coefficient and the signal integration processing are performed in the time domain.
[0076]
(Third embodiment)
FIG. 8 shows a configuration of an audio signal processing device according to the third embodiment of the present invention. The present embodiment provides a method of calculating the activity of a target signal (target sound signal) using a weighted cross-correlation coefficient. The target signal activity calculated in this manner is effectively used, for example, for detection of a target sound and enhancement of the target sound.
[0077]
In the present embodiment, similarly to the first embodiment, first, input audio signals of a plurality of channels from a plurality of microphones 101-1 to 101-M are converted into frequency domain signals, that is, a plurality of frequency components by a frequency analysis unit 201. After being converted into the spectrum information including the spectrum information, it is input to the target signal activity calculator 300. The target signal activity calculator 300 includes a cross power spectrum calculator 301, a coherence function calculator 302, a power information calculator 303, a modified spectrum calculator 304, and a weighted cross-correlation function calculator 305.
[0078]
The cross power spectrum calculation unit 301 calculates a power spectrum of each channel and a cross spectrum between channels from frequency components of a plurality of channels. The coherence function calculator 302 calculates a coherence function from the power spectrum and the cross spectrum. The power calculator 303 calculates power information on the signal power between channels of the input audio signal from the power spectrum. The corrected spectrum calculator 304 corrects the power spectrum and the cross spectrum using the coherence function and the power information. In the weighted cross-correlation function calculation section 304, a cross-correlation coefficient weighted according to the spectrum corrected by the correction spectrum calculation section 304 is calculated as the target signal activity.
[0079]
Next, the flow of processing in this embodiment will be described with reference to FIG. The steps from the audio signal input step S31 to the frequency analysis step S32 are the same as in the second embodiment, and the input audio signals of a plurality of channels are converted into frequency domain signals (spectral information) in frame units.
[0080]
Next, the power spectrum of each channel and the cross spectrum between the channels are calculated from the spectrum information obtained by the frequency analysis (step S33). Next, a coherence function and power information are calculated using the power spectrum and a cross spectrum between channels (steps S34 to S35). Next, a spectrum corrected based on the coherence function and power information is calculated (step S36). . A weighted cross-correlation coefficient is calculated based on the spectrum after this correction, and this is output as the target signal activity (step S37). The processing of steps S31 to S37 is repeated each time a digitized audio signal is input in units of frames in step S31.
[0081]
The present embodiment is characterized in that the cross-correlation coefficient is modified so as to enhance the noise resistance. The general cross-correlation coefficient shows high performance in target sound detection when noise is uncorrelated between channels, but when the correlated noise arrives between channels and when the target sound arrives The ability to distinguish between cases is low. According to the present embodiment, even when correlated noise arrives, the performance of distinguishing the target sound from the noise can be greatly improved.
[0082]
Usually, harsh large-amplitude noise has a high correlation between channels. Therefore, the method described in the present embodiment is suitable for suppressing this. The target signal activity, which is an output, indicates a measure of whether or not the target sound is present in the input voice, and is an essential element required for voice detection and voice emphasis in the following embodiments. .
[0083]
Next, a specific calculation method in the cross power spectrum calculation unit 301, the coherence function calculation unit 302, the power information calculation unit 303, the modified spectrum calculation unit 304, and the weighted cross-correlation coefficient calculation unit 304 will be described. First, the cross power spectrum calculation unit 301 calculates a cross spectrum between channels and a power spectrum for each channel according to the equation (2-2). Next, the coherence function calculation unit 302 calculates the coherence function γ (f) according to the following equation when the input audio signal is two channels of x and y.
(Equation 12)

[0084]
Here, Wxy (f) is a cross spectrum between two channels, and Wxx (f) and Wyy (f) are power spectra of each channel.
[0085]
When the input audio signal has M channels (not limited to two channels), the coherence function γij (f) between the i-th channel and the j-th channel is similarly calculated according to the following equation.
(Equation 13)

[0086]
Here, Wij (f) is a cross spectrum between the i-th channel and the j-th channel, and Wii (f) and Wjj (f) are power spectra of the i-th channel and the j-th channel.
[0087]
The total coherence function γm (f) in the case of the M channel is calculated, for example, by the following equation.
[Equation 14]

[0088]
The power information calculation unit 303 calculates power information p (f) according to the following equation when the input audio signal has two channels of x and y.
(Equation 15)

[0089]
Here, min [a, b] means selecting the smaller one of a and b, and max [a, b] means selecting the larger one of a and b.
[0090]
On the other hand, when the input audio signal has M channels (not limited to two channels), pij (f) of the power information between the i-th channel and the j-th channel is calculated according to the following equation.
(Equation 16)

[0091]
With respect to the power information p (f) and pij (f) calculated in this way, it is also possible to adjust the sensitivity to the actual power ratio between channels by using an appropriate function as in the following equation. is there.
[Equation 17]

[0092]
Here, pow {a, b} is an exponential function representing a raised to the power of b. When β = 1, Equations (3-6) and (3-7) are the same as Equations (3-4) and (3-5), respectively. Can be increased.
[0093]
When the input audio signal has two channels, the modified spectrum calculation unit 304 is a value obtained by squaring the previously calculated coherence function γ (f) with respect to the power spectrum of each channel and the cross spectrum between the channels. Squared coherence function γ ² The corrected cross spectrum and power spectrum are calculated using (f) and the power information p (f). Further, the weighted cross-correlation coefficient calculation unit 305 calculates a weighted cross-correlation coefficient ρ (target signal activity) weighted according to the corrected cross spectrum and power spectrum.
[0094]
The calculations in the modified spectrum calculator 304 and the weighted cross-correlation coefficient calculator 305 are represented by the following equations.
(Equation 18)

[0095]
Here, Ψa (f) and Ψb (f) are weighting functions used for the denominator and the numerator of the equation (3-10) for calculating the cross-correlation coefficient, and Wxy (f) Ψb (f) is the corrected The cross spectrum, Wxx (f) Ψa (f), Wyy (f) Ψa (f), is the corrected power spectrum.
[0096]
In addition to the weighting function of the equation (3-8) or (3-9) using the coherence function, the weight 1 / | Wxy (f) |
[Equation 19]

Although it is possible to use the weight function of Expression (3-8) or (3-9) as the performance, it is desirable.
[0097]
On the other hand, when the input audio signal is of the M channel (not limited to two channels), the power spectrum of each channel and the cross spectrum between the channels are similarly calculated between the i-th channel and the j-th channel. Squared coherence function γij which is a value obtained by squaring coherence function γij (f) ² The corrected cross spectrum and power spectrum are calculated using (f) and the power information pij (f).
[0098]
Further, the weighted cross-correlation coefficient calculation unit 305 calculates a weighted cross-correlation coefficient ρ (target signal activity) weighted according to the corrected cross spectrum and power spectrum. In this case, the calculations performed by the modified spectrum calculator 304 and the weighted cross-correlation coefficient calculator 305 are represented by the following equations.
(Equation 20)

[0099]
Here, Ψaij (f) and Ψbij (f) are weighting functions used for the denominator and the numerator of the equation (3-13) for calculating the cross-correlation coefficient, and i and j represent channel numbers. Pij (f) is the power information of the equation (3-5) or (3-7). In addition, K = M (m-1) / 2.
[0100]
Ψa (f) is known as a weight function used for maximum likelihood estimation of a generalized cross correlation function, and has an effect of suppressing the influence of uncorrelated noise between channels. is there. In this regard, see, for example, Non-Patent Document 6: “The Generalized Correlation Method for Estimation of Time Delay, CH Knapp and GC Carter, IEEE Trans, Acoustic. ASSP-24, No. 4, pp. 320-327 (1976). Reference 6 discloses a method for obtaining a cross-correlation function, and does not mention a cross-correlation coefficient.
On the other hand, in the present embodiment, as the weighted cross-correlation coefficient, the above-described weighting function Ψa (f) is further given a weight based on the ratio of the power between channels (3-6) or (3-6). The point that the Ψb (f) corrected according to 7) is used is greatly different.
[0101]
In the above processing, in addition to uncorrelated noise between channels, since even correlated noise arriving from directions other than the target direction is effectively suppressed, the obtained weighted cross-correlation coefficient indicates that the target signal exists. The degree of accuracy is accurately reflected. Therefore, the value of the weighted cross-correlation coefficient can be used as the target signal activity. This target signal activity can be used as a key component to improve its performance in various applications such as voice detection and voice enhancement.
[0102]
In the measurement of the target signal activity in the present embodiment, the activity may be divided and outputted for each band. For example, 1 to 128 points of the DFT are equally spaced on the frequency in eight bands, ie, 128/8 = It is divided into 16 points and eight target signal activities are output. The method of division can be changed as needed. This is the same in the following embodiments.
[0103]
In the above description, the target signal activity is calculated using both the coherence function and the power information. However, even if the target signal activity is calculated using only the coherence function without using the power information, a certain degree is obtained. effective. In that case, the power information p (f) or pij (f) calculated by the equations (3-4) to (3-7) may be set to 1.
[0104]
(Fourth embodiment)
FIG. 10 shows a configuration of an audio signal processing device according to the fourth embodiment of the present invention. In the present embodiment, the third embodiment is applied to voice detection, and threshold processing is performed on the target signal activity to detect a target sound component from an input voice signal.
[0105]
After the input audio signals from the plurality of microphones 101-1 to 101-M are converted into frequency domain signals by the frequency analysis unit 201, that is, spectral information including frequency components of a plurality of channels, the target signal activity calculation unit 300 Is entered. The configuration of the target signal activity calculator 300 is as described in the third embodiment.
[0106]
The target signal activity signal 306 output from the target signal activity calculation unit 300 is input to the detection processing unit 401, where threshold processing is performed to indicate whether or not the target sound is present in the input audio signal. The target sound detection status signal 402 is output. Specifically, the detection processing unit 401 sets “1” as the target sound detection status signal 402 when it determines that the component of the target sound is present in the input audio signal, and sets “0” when it determines that it does not exist. Output.
[0107]
The flow of processing in the present embodiment will be described with reference to FIG. 11. First, the frequency of the input audio signal input in step S41 is analyzed (step S42), and the procedure described in the third embodiment is performed based on the obtained spectrum information. To calculate the target signal activity (step S43). Finally, threshold processing is performed on the target signal activity using a threshold predetermined according to the purpose, thereby performing detection processing as to whether or not the target sound component exists in the input audio signal ( Step S44). The processes in steps S41 to S44 are repeatedly performed each time a digitized audio signal is input in units of frames in step S41.
[0108]
Next, a procedure of threshold processing in the detection processing unit 401 will be described with reference to FIG. Here, an example will be described in which a threshold value for detection is set from the bias and variance of the target signal activity in a section where there is no target sound.
First, initialization is performed (step S400), and then input of an audio signal (step S401), frequency analysis (step S402), and calculation of target signal activity (step S403) are sequentially performed for each frame.
[0109]
Assuming that the target signal activity of the k-th frame is ρ (k), the bias and variance of ρ (k) in a section having no target sound (referred to as a silent section) are estimated. A provisional determination as to whether or not the section is a silent section is made by comparing | ρ (k) −b (k−1) | with κ (step S404). Here, b (k) is an estimated value of the bias of ρ (k), and κ is a threshold for determination.
[0110]
Here, if | ρ (k) −b (k−1) | <κ, it is determined that there is a high possibility of silence, and a bias b ( k) and the estimated value of the variance v (k) are updated (step S405).
(Equation 21)

[0111]
On the other hand, when | ρ (k) −b (k−1) |> κ, it is determined that there is a high possibility that the target sound exists, and the bias b (k) and the variance v The estimated value of (k) is not updated (step S406).
(Equation 22)

[0112]
Next, a threshold value h (k) for detection is set by the following equation (step S407).
[Equation 23]

[0113]
Here, ξ is a constant for setting the detection threshold h (k). As a result, if h (k) <ρ (k), “1” is output as the target status signal, and otherwise “0” is output as the target status signal (step S408).
Examples of the values of κ, η, η ′, 必要 necessary for the initial setting are as shown in the frame of the initial setting step S400.
[0114]
FIG. 13 shows a specific example of the detection processing. The time series of the detection status signal shown in FIG. 13B is output from the curve ρ shown in FIG. As described in the third embodiment, the calculation of the target signal activity suppresses noise having no correlation between the channels and noise arriving from a direction different from the target sound even if there is a correlation. React accurately. Therefore, when the calculated target signal activity is used as a parameter for voice detection as in the present embodiment, high detection performance can be achieved.
[0115]
(Fifth embodiment)
FIG. 14 shows a configuration of an audio signal processing device according to the fifth embodiment of the present invention. This embodiment is obtained by applying the third embodiment to speech enhancement. After the input audio signals from the plurality of microphones 101-1 to 101-M are converted into frequency domain signals by the frequency analysis unit 201, that is, spectral information including frequency components of a plurality of channels, the target signal activity calculation unit 300 Is entered. The configuration of the target signal activity calculator 300 is as described in the third embodiment.
[0116]
On the other hand, similarly to the second embodiment, the spectrum information from the frequency analysis unit 201 is also input to the signal integration unit 203, where one channel is integrated to generate an integrated spectrum signal. The integrated spectrum signal output from signal integration section 203 is input to gain control section 501 whose gain is controlled in accordance with target signal activity signal (cross-correlation coefficient) 306 output from target signal activity calculation section 300, Its size is adjusted. As a result, spectrum signal 502 in which the target sound component is emphasized is output from gain control section 501.
[0117]
The spectrum signal 502 in which the target sound component is emphasized is subjected to inverse conversion with the frequency analysis unit 201, that is, conversion from the frequency domain to the time domain by the inverse conversion unit 503 as necessary, and the component of the target sound is converted. An enhanced output audio signal 504 is generated. The inverse transform unit 502 is realized by an inverse FFT when the frequency analysis unit 201 is, for example, an FFT.
As described above, the audio signal processing apparatus according to the present embodiment includes the target signal activity calculator 300 in which the cross-correlation coefficient calculator 202 in the second embodiment shown in FIG. 6 calculates the weighted cross-correlation coefficient. The configuration has been changed.
[0118]
Next, the flow of processing in this embodiment will be described with reference to FIG. 11. First, the processing from step S51 to step S53 is the same as the processing from step S41 to step S43 shown in FIG. 11 described in the fourth embodiment. This is the same as the processing. After the frequency analysis in step S52, in parallel with the calculation of the target signal activity in step S53, a process of integrating the spectral information of a plurality of channels into one channel to generate an integrated spectrum signal is performed (step S54).
Next, by performing gain control on the integrated spectrum signal in accordance with the target signal activity obtained in step S53 to adjust the amplitude, a spectrum signal in which the target sound component is emphasized is generated (step S53). S55) Finally, if necessary, in step S56, inverse conversion (for example, inverse FFT) is performed to obtain an output audio signal in which the target sound component is emphasized. The processes of steps S51 to S56 are repeatedly performed each time a digitized audio signal is input in units of frames in step S51.
[0119]
According to the present embodiment, as described in the third embodiment, since the target signal activity accurately reflects whether or not the input sound has the target sound, the target signal activity is used to emphasize the target sound. By performing the emphasis, very high-performance processing can be realized in various noise environments.
[0120]
In the third embodiment, it has been described that the target signal activity may be obtained by dividing into a plurality of frequency bands. However, in the gain control process of the present embodiment, such a plurality of frequency bands may be determined. It is also possible to control the gain for each band using the target signal activity. That is, when the integrated signal, for example, DFT of L points is used for calculation of spectrum information and the number of band divisions is B for each band used in the calculation of the target signal activity, L / 2 / B = N points are used for each. The target activity is calculated as follows.
[0121]
(Equation 24)

[0122]
Here, ρ (b) is the target signal activity related to the band number b, and the ranges of the frequency components used in the calculation of the band b are set as s (b) and e (b). This value is, for example, as follows.
(Equation 25)

[0123]
This is based on the general regularity of the number of the component whose frequency component number f corresponds to a positive frequency of 2 to L / 2 and the number of the component whose f corresponds to a negative frequency of L / 2 + 1 to L in the DFT. Required. Here, f = 1 corresponds to a direct current component, and in the case of a general waveform signal, the component may be set to 0, and thus is excluded from the above calculation formula. Also, the component of f = L / 2 is the upper limit of the usable frequency, and its magnitude is also close to 0, so it is excluded. Of course, there is no problem even if these are included in the calculation.
[0124]
Using the target signal activity ρ (b) obtained in this way, gain control for the integrated signal can be performed as follows.
(Equation 26)

[0125]
As described above, the absolute value of the target signal activity ρ (b) may be used as in the above equation, or the value obtained by taking the real part of ρ (b) and setting it to 0 when the value is negative is used. Then, the following may be performed.
[0126]
[Equation 27]

[0127]
With the above method, gain control when emphasizing the component of the target sound can be performed for each band. This makes it possible to suppress only a certain band when noise is present in a certain band, so that the performance of target sound component emphasis can be improved.
[0128]
(Sixth embodiment)
FIG. 16 shows a configuration of an audio signal processing device according to the sixth embodiment of the present invention. This embodiment has a configuration in which a coherence filter operation unit 601 that performs a filter operation based on coherence and power information is added to the fifth embodiment.
[0129]
Next, the flow of processing in this embodiment will be described with reference to FIG. First, the processing from step S61 to step S64 is the same as the processing from step S51 to step 54 shown in FIG. 11 of the fifth embodiment. In the present embodiment, a filter operation is performed on the integrated spectrum signal obtained in step S54 using a coherence function and power information generated on the assumption of target signal activity calculation in step S64.
[0130]
By performing gain control according to the target signal activity obtained in step S63 on the integrated spectrum signal on which the coherence filter operation has been performed, the spectrum signal in which the component of the target sound is emphasized by adjusting the amplitude is obtained. It is generated (step S65), and finally, if necessary, inverse transformation (for example, inverse FFT) is performed in step S66 to obtain an output audio signal in which the target sound component is emphasized. The processing of steps S61 to S66 is repeated each time a digitized audio signal is input in units of frames in step S61.
[0131]
Next, the coherence filter operation unit 601 will be described in detail. The coherence filter calculator 601 filters the target spectrum information using the coherence function calculated by the target signal activity calculator 300. The coherence function is calculated using equation (3-1) or equation (3-2). At this time, if the coherence function is modified and used as in the following equation according to any of the power information of equations (3-4) to (3-7) obtained internally by the target signal activity calculator 300: , Is even more effective.
[0132]
The modified coherence function γ (f) when the input audio signal has two channels of x (f) and y (f) is expressed by the following equation.
[Equation 28]

[0133]
On the other hand, the modified coherence function γ (f) for the M channel (not limited to two channels) is shown by the following equation.
(Equation 29)

[0134]
Here, i and j are channel numbers, Wij (f) is a cross spectrum between the i-th channel and the j-th channel, and Wii (f) and Wjj (f) are the i-th channel and the It is a power spectrum of the j channel.
[0135]
The filter operation using the modified coherence function γ (f) shown in Expression (6-1) or Expression (6-2) is performed according to the following expression.
[Equation 30]

[0136]
Here, ZO (f) is an output of the filter operation, and Z (f) is an integrated spectrum signal obtained by the signal integration unit 203.
[0137]
At this time, the filter operation may be performed after correcting the coherence function γ (f) using an appropriate function, for example, as in the following equation.
[Equation 31]

[0138]
Here, pow (a, b) is an exponential function representing a raised to the power b, and for example, α = 2 may be used. In this case, the value of the coherence function γ (f) is emphasized and the noise suppression amount is increased as compared with the equation (6-3) (corresponding to α = 1), but the distortion of the target voice is increased instead. It is good to set according to the situation.
[0139]
As described above, according to the present embodiment, in emphasizing the target sound using the target signal activity, the weight of the spectrum corresponding to the coherence function is performed, thereby further improving the voice emphasis performance with respect to uncorrelated noise between channels. can do.
[0140]
(About placement of microphone)
Next, a preferred arrangement method of the microphones described above will be described. The audio signal processing device assumes that the same component is incident on a plurality of microphones for the target sound, and that at least one of the phase and amplitude components is incident on the noise. In order to realize such a sound receiving condition of the microphone, it is desirable to arrange the microphones 101-1 to 101-M as described below.
[0141]
In the third embodiment, information on the power ratio between channels is used in the process of calculating the weighted cross-correlation coefficient. When the microphones 101-1 to 101-M are arranged so as to have different powers, high performance can be obtained. Even when all non-directional microphones are used as the microphones 101-1 to 101-M, some performance can be exhibited. This is because the conditions such as reflection vary depending on the sound receiving position, and therefore the power of the incoming sound may differ even with an omnidirectional microphone.
[0142]
However, in order to stably exhibit high performance, it is better to use at least one of the microphones 101-1 to 101-M as a directional microphone. This makes it possible to create a sensitivity difference between channels in directions other than the arrival direction of the target sound, thereby improving noise suppression performance.
[0143]
Here, a case where the number M of microphones is two, that is, two channels will be described. As shown in FIG. 18, one of the two microphones is an omnidirectional microphone 701 and the other is a directional microphone 702, and as shown in FIG. 19, both

microphones

711 and 712 are directional microphones. The case will be described. Each of them can be distinguished and used. A normal unidirectional microphone is assumed as the directional microphone. If a sharper directivity other than unidirectional is used, the performance may be higher, but the arrangement method is the same as that using a unidirectional microphone.
[0144]
As shown in FIG. 18, when the omnidirectional microphone 701 and the directional microphone 702 are used, the directional microphone 702 is set so that the top of the directivity (the maximum sensitivity direction) is directed to the direction of the target sound. An appropriate distance between the

microphones

701 and 702 is, for example, about 5 cm to 20 cm. In this arrangement, it is desirable to adjust the sensitivity of the omnidirectional microphone 701 and the sensitivity of the directional microphone 702 in the vertex direction to the same level.
[0145]
With such an arrangement, the sensitivity difference between the channels, that is, the

microphones

701 and 702 in the direction of low sensitivity in the directional microphone 702, for example, in the direction 180 ° opposite to the direction of high sensitivity as shown in FIG. Since it is very large, the amount of suppression of the incoming sound from the direction of low sensitivity becomes very large. At first glance, this seems to only represent the original directivity of the directional microphone, but the sensitivity to the power ratio between channels is determined by the value of β in equation (3-6) or (3-7). Since the directivity can be adjusted, the directivity can be adjusted to be sharper than the original directivity of the directional microphone 702.
[0146]
That is, for example, by setting β = 2, the weight of the square of the actual power ratio is used for calculating the target signal activity. Although the actual power ratio is 1 in the direction of the target sound, it is 1 or less in directions other than the arrival direction of the target sound. Therefore, by squaring this, the weight for components other than the target sound is further reduced. Therefore, the sensitivity in the horizontal direction between the low sensitivity direction and the target sound direction can be further reduced.
[0147]
On the other hand, when the

directional microphones

711 and 712 are used as the two microphones as shown in FIG. 19, for example, the arrangements shown in FIGS. 19 (A1) to (A4) are effective. This arrangement is such that the directivity axes of the two

microphones

711 and 712 are included on the same plane, and the direction of the directivity axis when viewed from above in the drawing is θ = −90 ° to 90 °. Desirably within the range. When θ> 0, the directivity axis opens outward from the midpoint of the two

microphones

711 and 712, but the same performance is obtained when θ <0. In this case, the directivity axis is It becomes a form toward the middle point.
[0148]
FIGS. 19 (B1) to (B4) are examples of another preferred arrangement of the two

directional microphones

711 and 712. FIG. The axis of directivity is not included in the same plane. For the sake of accuracy, FIG. 20 shows a diagram in which the direction of the directivity axis in the arrangement of FIGS. 19 (B1) to (B4) is represented by an azimuth angle θ and an elevation angle φ. Here, assuming that the direction of the directivity axis of the R-channel microphone 712 is (θ, φ), the direction of the directivity axis of the L-channel microphone 711 is (−θ, −φ). Is desirable. That is, the position and the axial direction of the two microphones are 180 ° rotationally symmetric. If the number of microphones is M, it is desirable that the microphones be arranged so as to have a rotational symmetry of 360 ° / M. It is desirable that the range of θ and φ be 10 ° <θ <80 ° and 10 ° <φ <80 °. After setting the directions of the directivity axes as described above, the two

microphones

711 and 712 have exactly the same characteristics when rotated about the arrival direction of the target sound. May be used.
[0149]
In the case of the arrangements of FIGS. 19A1 to 19A4, the final directivity becomes maximum in the direction of arrival of the target sound due to the above-described sound signal processing, and the final directivity is equidistant from the

directional microphones

711 and 712. Since the sensitivity is maximum in the direction, that is, in the direction perpendicular to the straight line connecting the two

microphones

711 and 712, the microphone has a certain degree of sensitivity to a sound coming from directly above or directly below.
[0150]
On the other hand, in the arrangement of FIGS. 19 (B1) to (B4), the direction in which the phases of the two

directional microphones

711 and 712 coincide with each other is the same as the case of FIGS. 19 (A1) to (A4). , 712, that is, directions included in a plane (plane a in FIG. 21) perpendicular to a straight line connecting the two

microphones

711 and 712. On the other hand, the direction of arrival at which the sensitivities of the two

microphones

711 and 712 coincide is the difference vector between the two vectors representing the directions of the axes of the

microphones

711 and 712 when the two vectors are translated on one plane. (Vector C in FIG. 21) and included in a plane perpendicular to the plane (plane b in FIG. 21).
[0151]
The target signal activity in the present embodiment has a large value when the phase and the amplitude are the same between the channels. Therefore, the direction in which the plane a and the plane b intersect in FIG. 21, that is, the front direction (FIG. 20 or FIG. 21) In this case, a large maximum directivity can be obtained only in a direction 180 ° opposite to the direction of arrival of the target sound indicated by the arrow). As for the direction opposite to the front, the low sensitivity direction of the

directional microphones

711 and 712 is oriented, so that the level of the incident sound from that direction is low. Therefore, it is possible to obtain directivity having a maximum main lobe substantially only in the front direction, and this arrangement is effective when it is desired to suppress incoming sound from directly above or directly below.
[0152]
(Eighth embodiment)
FIG. 22 shows a configuration of an audio signal processing device according to the eighth embodiment of the present invention. This embodiment has a configuration in which a spectrum correction unit 800 is inserted between the frequency analysis unit 201 and the target signal activity calculation unit 300 in the third embodiment. As shown in FIG. 23, the spectrum correction unit 800 has an adaptive filter 801 and a correction filter 802.
[0153]
As described above, the audio signal processing device according to the embodiment of the present invention assumes that the same component of the target sound enters the plurality of microphones 101-1 to 101-M. Therefore, when the sensitivity of the microphones 101-1 to 101-M changes due to aging or consumption of a bias setting battery, processing accuracy may be reduced. Even when the arrival direction of the target sound deviates from the expected direction, the processing accuracy may be reduced.
[0154]
In the present embodiment, in order to correct the difference in sensitivity for each of the microphones 101-1 to 101-M and the deviation in the arrival direction of the target sound, and to exhibit the original performance, the spectrum correction unit 800 uses the frequency analysis unit 201. Is corrected based on the target signal activity obtained by the target signal activity calculator 300 and the spectrum information.
[0155]
Next, details of the processing in the spectrum correction unit 800 will be described with reference to FIG. Here, the case where the input audio signal has two channels will be described, but the extension to the M channel is the same. The correction of the spectrum is performed by identifying the difference between the channels by the adaptive filter 801 and correcting the difference identified by the adaptive filter 801 using the correction filter 802 for the spectrum of one channel. When the difference is identified by the adaptive filter 801, the update speed of the filter may be controlled according to the target signal activity signal 306.
[0156]
As the adaptive filter 801, for example, an LMS adaptive filter in the frequency domain can be used. In this case, the calculation of the frequency domain LMS adaptive filter is performed as follows.
(Equation 32)

[0157]
Here, k is a frame number, X is a spectrum of the first channel, Y is a spectrum of the second channel, E is an error spectrum, W is a complex filter coefficient, μ is a step size, and (*) is a complex conjugate.
[0158]
In this case, the calculation of the correction filter 802 is performed on the spectrum X (k, f) of the first channel by X ′ (k, f) = W (k, f) X (k, f). X ′ (k, f) is the spectrum of the first channel after the correction. Since this calculation has already been performed by the equation (8-1) of the calculation of the adaptive filter 801, a new correction filter 802 is not prepared, and the adaptive filter 801 outputs W (k, f) X (k, f). May be simply taken out.
[0159]
It is also possible to control the filter update speed at the time of the difference identification by the adaptive filter 801 using the target signal activity ρ (k). In this case, for example, the update expression of the adaptive filter 801 is expressed by the following equation. Modify (8-2).
[0160]
[Equation 33]

[0161]
Here, for example, 0.5 can be used as the threshold value h. This means that the difference between the channels is obtained only when the magnitude of ρ (k) is larger than the threshold value. Therefore, the filter is updated only when the possibility that the target sound is arriving is large, and the filter adapts to the noise. Don't worry about it. In addition to the adaptive update / stop control using such a threshold, it is also possible to make the size of the update proportional to ρ (k) as in the following equation.
(Equation 34)

[0162]
When the difference between the channels is estimated using Expression (8-3), for example, when the sensitivity difference is largely different from the beginning, the value of ρ (k) does not exceed the threshold value. In some cases, no updates are made and no differences are found. However, as described above, when it is assumed that the sensitivity of the microphone has changed due to aging, consumption of a battery for bias setting, and the like, the sensitivity difference does not suddenly increase and such an inconvenience does not cause much problem. This embodiment can be used as a correction method for obtaining the target signal activity in the audio signal processing described in the third to sixth embodiments, for example, so that an operation that is not affected by the difference in sensitivity between channels can be performed. It becomes.
[0163]
(Ninth embodiment)
FIG. 24 shows the configuration of the audio signal processing device according to the ninth embodiment of the present invention. As in the eighth embodiment, a spectrum correction unit 900 is provided, and a correction filter learning instruction unit 910 is added.
[0164]
The sensitivity correction described in the eighth embodiment is effective when the sensitivities of the microphones 101-1 to 101-M are not significantly different. In the ninth embodiment, when the amplitude or phase of the target sound cannot be assumed to be the same for each microphone, a learning mode process is provided, and learning of a correction filter different from that of the eighth embodiment is performed. To correct the difference between the channels.
[0165]
When correcting a sensitivity shift due to a secular change after learning or a phase difference due to a small shift of the target speaker position, the correction by the filter learned through the learning mode is performed, and then the automatic correction as described in the eighth embodiment is performed. Make a correct correction. The present embodiment is configured to perform such two corrections.
[0166]
Even when the target sound direction is different from the assumed direction, or when the microphones 101-1 to 101-M are arranged at different distances from the target sound source, the sound processing method of the present embodiment is used. Will be possible. The learning mode may be started by a user's instruction as a trigger, or the apparatus may automatically enter the learning mode after the apparatus is started.
[0167]
The correction filter learning instruction unit 910 outputs a signal indicating whether or not the mode is the learning mode. For example, “1” is output in the learning mode, and “0” is output in the non-learning mode. The end of the learning mode may be automatically performed by the device side or may be instructed by the user. In the learning mode, a test sound is generated from the position of the target sound to be input. The user may speak, or a test sound generating device such as a speaker may be used at the target sound position. The test sound may be selected according to the purpose of use. For voice input, it is desirable to use voice or white noise.
[0168]
As shown in FIG. 25, when a user's instruction is input by a switch 911, a correction filter learning instruction unit 910 measures an elapsed time from the instruction input by a timer 912 so that a certain period is set to a learning mode. Then, a correction filter learning instruction signal S is output. The timer 912 outputs, for example, “1” as the correction filter learning instruction signal S from the time when the instruction is input by the switch 911 until a predetermined time, and outputs “0” during the other periods. Since the timer 912 is a function provided in most microprocessors, it can be used. The end of the learning mode may be automatically performed by the apparatus using the timer 912 as described above, or may be instructed by the user.
[0169]
The spectrum correction unit 900 performs learning over a period of a fixed time length, for example, 3 seconds according to an instruction from the correction filter learning instruction unit 910. This period is called a learning mode. In the learning mode, a test sound is generated from the position of the target sound to be input. The user may speak, or a test sound generating device such as a speaker may be used at the target sound position. The test sound may be selected according to the purpose of use. For voice input, it is desirable to use voice or white noise. After the end of the learning mode, the audio signal processing as described up to the eighth embodiment is continuously performed.
[0170]
The configuration of the spectrum correction unit 900 is slightly different from the configuration of the spectrum correction unit 800 in the eighth embodiment shown in FIG. 23, and in addition to the correction filter 902 corresponding to the correction filter 802 in FIG. Thus, another correction filter 901 is added before the correction filter 902. The correction filter 902 performs the same operation as that described in the eighth embodiment. That is, a small shift between channels is corrected.
[0171]
On the other hand, the added correction filter 901 corrects a large difference between channels. The correction filter 901 is fixed except in the learning mode. When the learning filter instruction signal S from the correction filter learning instruction section 910 is “1”, the adaptive filter 904 makes the correction filter 901 learn, and when the learning filter instruction signal S is “0”, it makes the correction filter 902 learn.
[0172]
For example, learning of the correction filter 902 using the LMS is performed by the following equation.
(Equation 35)

[0173]
On the other hand, learning of the correction filter 901 is performed by the following equation.
[Equation 36]

[0174]
Here, k is the frame number, X is the spectrum of the first channel, Y is the spectrum of the second channel, X1 is the spectrum obtained by applying X to the correction filter 901, W0 is the filter coefficient of the correction filter 902, and E0 is the correction An error spectrum at the time of learning of the filter 902, μ0 is a step size at the time of learning of the correction filter 902, W1 is a filter coefficient of the correction filter 1, E1 is an error spectrum at the time of learning of the correction filter 901, and μ1 is a correction filter 901. The step size at the time of learning (*) is complex conjugate. For example, 0.1 is used for the step sizes μ0 and μ1.
[0175]
When learning the correction filter 902 of the equations (9-1) and (9-2), the adaptation speed may be controlled by using the target signal activity as in the eighth embodiment. Filtering of the correction filter 901 is performed by:
(37)

The filtering of the correction filter 902 is performed by
[Equation 38]

Here, X ′ (k, f) is the spectrum of the first channel which is the output of the spectrum correction unit 900.
[0176]
Next, a processing flow of the present embodiment will be described with reference to FIG.
First, initial values of coefficients of the correction filters 901 and 902 are set as initial settings (step S90). Assuming that the correction filter 901 is the correction filter 1 and the correction filter 902 is the correction filter 0, if the initial values of the coefficients of the correction filters 1 and 0 are set to (1, 0) at all frequencies (f), even if learning is not performed, Since it is possible to input audio signals, it is easy to handle. Here, (1, 0) represents a complex number 1 + j0. However, even when the initial values of the coefficients of the correction filters 1 and 0 are set to (0, 0) at all frequencies (f), the operation will be performed as long as learning is performed. There is no essential difference.
[0177]
Next, it is checked whether or not the correction filter learning instruction signal S is “1” (“0”) (step S91). If S = “1”, learning of the correction filter 1 is performed according to the equations (9-3) and (9-3). -4) (step S93). On the other hand, if S = "0", filtering by the correction filter 1 is performed according to the equation (9-5) (step S94), and then learning of the correction filter 0 is performed according to the equations (9-1) and (9-2). After that, filtering is performed by the correction filter 0 (steps S93 to S94), and thereafter, the target signal activity is measured (step S96). The processing from step S91 to step S96 is repeated each time a digitized audio signal is input in units of frames in step S91.
[0178]
According to the present embodiment, for example, even when the microphones 101-1 to 101-M are arranged at different distances from the position of the target sound source, the calculation of the target signal activity, the detection of the target sound, and the detection of the target sound are performed. Processing such as emphasis can be performed effectively.
[0179]
When used in an environment of running noise observed in a car, the running noise has a high diffusivity, so that there is not much difference in amplitude between channels even when the microphone is placed in a different position or direction. When the microphones and the target sound positions are arranged so as to have different distances, the target sound is corrected to have the same amplitude and the same phase between channels by the spectrum correction of the present embodiment. On the other hand, noise components having the same amplitude have different amplitudes due to the correction, so that the noise section in the target signal activity is easily distinguished, and the accuracy of the activity measurement is improved. As described above, when the microphones are not arranged at the same distance from the target sound, performance improvement under diffuse noise can be achieved.
[0180]
(Tenth embodiment)
FIG. 28 shows the configuration of the audio signal processing device according to the tenth embodiment of the present invention. The present embodiment relates to a technique for estimating a direction of arrival of a sound source based on a corrected cross-correlation coefficient. Estimation of the direction of arrival of a sound source is important in various applications in speech processing, such as speech enhancement and noise source identification. In particular, the method based on the modified cross-correlation coefficient according to the present embodiment has less restrictions on the signal and propagation state of a noise source than a method based on blind spot control such as an adaptive beamformer, and can be used in a wide range of noise environments. There is an advantage that there is.
[0181]
The audio signal processing apparatus according to the present embodiment performs frequency analysis on input audio signals of a plurality of (M) channels from the microphones 101-1 to 101-M and converts them into spectrum information as frequency components, as shown in FIG. It comprises a frequency analysis unit 201 and a sound source direction estimating unit 1000 for estimating a sound source direction from the spectrum information. The processing of the voice analysis unit 201 is as described in the second embodiment (FIG. 6).
[0182]
The sound source direction estimation unit 1000 includes a cross power spectrum calculation unit 1001, a coherence function calculation unit 1002, a correction coefficient generation unit 1003, a cross power spectrum correction unit 1004, a power information calculation unit 1005, a virtual direction correlation coefficient calculation unit 1006, A sound source direction detection unit 1007 is provided. Hereinafter, each component of the sound source direction estimation unit 1000 will be described.
[0183]
The cross power spectrum calculation unit 1001 calculates a power spectrum of each channel and a cross spectrum between channels from the spectrum information obtained by the frequency analysis unit 201.
[0184]
The coherence function calculator 1002 calculates a coherence function between channels of the input audio signal from the cross spectrum obtained by the cross power spectrum calculator 1001 and the power spectrum of each channel.
[0185]
The correction coefficient generation unit 1003 determines a virtual direction, which is a virtual arrival direction of the signal, within a predetermined arrival direction range of the signal, and assumes that the signal has arrived from this virtual direction. , A correction coefficient for correcting the spectrum information such that the signal component in the spectrum information matches between the channels is generated.
[0186]
The cross power spectrum correction unit 1004 corrects the cross spectrum and the power spectrum using the generated correction coefficient, and generates a corrected cross spectrum and a corrected power spectrum.
[0187]
The power information calculation unit 1005 calculates power information, which is a signal power ratio for each frequency between channels of the input audio signal, based on the corrected cross spectrum and the corrected power spectrum.
[0188]
The virtual direction correlation coefficient calculation unit 1006 weights the corrected power spectrum and the corrected cross spectrum based on the coherence function and the power information, and calculates a cross-correlation coefficient corresponding to a set of virtual directions set in advance for each virtual direction. Is calculated.
[0189]
The sound source direction detection unit 1007 detects and outputs the sound source direction based on the cross-correlation coefficient for each virtual direction calculated by the virtual direction correlation coefficient calculation unit 1006, and at the same time, detects the value of the cross-correlation coefficient in the detected sound source direction As a sound source correlation coefficient, and a correction coefficient corresponding to the sound source direction is output as a sound source direction correction coefficient.
[0190]
Next, the processing of each unit will be described in more detail. In the calculation in the cross power spectrum calculation unit 1001, the coherence function calculation unit 1002, and the power information calculation 1005, for example, when the number M of channels of the input audio signal is two, the equations (3-8) and (3-9) are used. , (3-10), and equations (3-12), (3-13), and (3-14) for three or more channels.
[0191]
The correction coefficient generation unit 1003 sets a range from which a signal arrives in advance, for example, as shown in FIG. The arrival direction is represented by a set (θ, φ) of an azimuth angle θ that is a horizontal angle and an elevation angle φ that is a vertical angle. For example, a direction on a lattice point in the arrival range is a virtual direction. Shall be. In the case of FIG. 29, the arrival range is -40 ° to 40 ° for both the azimuth and elevation, and the lattice points are every 5 ° for both the azimuth and elevation. The directions on all the lattice points are set as virtual directions. In FIG. 29, the interval between lattice points is set to 5 ° for the sake of drawing.
[0192]
The virtual direction on the lattice point is represented by dh, g = (θh, φg). Here, h is a number related to the azimuth of the lattice point, and g is a number of the elevation angle. The correction coefficient generator 1003 generates a correction coefficient corresponding to the virtual direction according to the following equation.
[0193]
[Equation 39]

[0194]
Here, i is the channel number, Hi (f, θ, φ) is the correction coefficient of the i-th channel in the (θ, φ) direction, and τi (θ, φ) is (θ, φ) in the i-th microphone. Propagation delay time for the sound reception signal at the reference microphone when an incoming signal arrives from the direction, Di (θ, φ) is the directivity of sensitivity in the (θ, φ) direction at the i-th microphone, and f is the frequency The number, F is the sampling frequency, and L is the FFT score. The reference microphone is, for example, the first microphone.
[0195]
For example, when the direction of an incoming sound is d = (θ, φ) in a microphone arrangement as shown in FIG. 30 and the reference position is at the origin of the coordinates, the time delay with respect to the origin becomes It can be calculated as follows using the relation of the rectangular coordinates.
[0196]
(Equation 40)

[0197]
Here, * is the inner product and c is the speed of sound. When the position of the microphone i is Ai = (xi, yi, zi), the following expression is obtained.
[0198]
(Equation 41)

[0199]
Di (θ, φ) is a characteristic inherent to the microphone, and is obtained from product information or obtained by measurement. The measurement of the directivity of the microphone sensitivity may be performed by, for example, measuring the output while changing the incident angle of the sound to the microphone, and a general method may be used.
[0200]
Since the correction coefficient generated by the correction coefficient generation unit 1003 does not change unless the range of the sound source direction search and the directivity of the microphones 101-1 to 101-M change, the correction coefficient is first stored in a table after being generated. In advance, the value of the coefficient is read by referring to the table using the number of the lattice point.
[0201]
Cross power spectrum correction section 1004 multiplies the correction coefficient generated by correction coefficient generation section 1003 by the cross spectrum and power spectrum of the corresponding channel to obtain a corrected cross spectrum and corrected power spectrum. The calculation is performed as follows.
[0202]
(Equation 42)

[0203]
Here, W 'is a spectrum after correction, * is a complex conjugate, i and j are channel numbers. When i ≠ j, it means a cross spectrum, and when i = j, it means a power spectrum.
[0204]
The correction of equation (10-4) is equivalent to correcting the spectrum information Xi (f) with Hi (f, θ, φ) and then calculating the cross power spectrum. Assuming that Hi does not change with time as the averaging process, it is based on the following.
[Equation 43]

[0205]
The power information calculation unit 1005 calculates the power ratio between channels from the power spectrum corrected by the cross power spectrum correction unit 1004. In the calculation of the power ratio, a corrected value as in the following expression is used instead of the original power spectrum Wii (f) in Expression (3-7).
[0206]
[Equation 44]

[0207]
The virtual direction cross-correlation coefficient calculation unit 1006 calculates a cross-correlation coefficient for the virtual direction (θ, φ) using the corrected cross power spectrum and power information. In the calculation of the cross-correlation coefficient, in the equations (3-11), (3-12), and (3-13), the original cross power spectrum and the power information are replaced with respective corrected ones as in the following equation. Just fine.
[0208]
[Equation 45]

[0209]
Where K is
[Equation 46]

The ranges L1 and L2 of the frequency f in the sum are set to numbers corresponding to the range corresponding to the band of the target sound. For example, if the band of the target sound is determined to be from 260 Hz to 4 kHz, it is preferable to set the FFT length to 256, and to set the sampling frequency to 11 kHz, L1 = 6 and L2 = 92.
[0210]
Using equations (10-6) to (10-10), θ = θhg, φ = φhg, and the virtual direction d (θhg, φhg) of the set arrival range (h = 1 to Nh, g = 1 to Ng) , A virtual direction correlation coefficient is obtained.
[0211]
The sound source direction detection unit 1007 detects the peak from the correlation coefficient for each virtual direction calculated by the virtual direction cross-correlation coefficient calculation unit 1006 and outputs the peak as the sound source direction. At this time, for example, stabilization can be achieved by temporal averaging of the virtual direction correlation coefficient as in the following equation.
[0212]
[Equation 47]

[0213]
Here, ρ′k is a virtual direction correlation coefficient averaged in the processing of the kth frame, ρk is a virtual direction correlation coefficient obtained in the processing of the kth frame, and η is a learning constant. Use η = 0.05 or the like. The peak can be detected by finding the maximum value from ρ′k (θ, φ).
[0214]
The sound source direction detection unit 1007 outputs, in addition to the sound source direction, a sound source correlation coefficient that is a peak value in the sound source direction, and a sound source direction correction coefficient that is a correction coefficient corresponding to the sound source direction. For this purpose, a correction coefficient is extracted from the correction coefficient table inside the correction coefficient generation unit 1003 based on the number of the lattice point in the sound source direction.
[0215]
Next, the flow of processing in this embodiment will be described with reference to FIG.
First, a range of a sound source direction is set as an initial setting (step S100). Next, generation of correction coefficients (step S101), input of audio signals from the microphones 101-1 to 101-M (step S102), frequency analysis (step S103), calculation of cross spectrum and power spectrum (step S104), and coherence The calculation of the function (step S105) is performed sequentially. Next, the spectrum correction (step S106), the calculation of the power information (step S107), and the calculation of the virtual direction cross-correlation function (step S108) are repeated for all the virtual directions, and finally the sound source direction is detected (step S109). ). The processes in steps S102 to S109 are repeatedly performed each time a digitized audio signal is input in units of frames in step S102.
[0216]
(Eleventh embodiment)
The voice emphasizing process of the present invention assumes that the target sound, which is the target sound, comes from the front of the microphone array, so that if the direction of the target sound deviates from the assumption, the performance may decrease. . The correction based on the adaptive processing described in the eighth embodiment can cope with the direction deviation of the target sound to some extent, but when the direction of the target sound is largely deviated, it is difficult to cope with only the adaptive processing. It is. Therefore, in the present embodiment, by tracking the direction of the target sound using the result of the sound source direction estimation processing described in the tenth embodiment, the sound enhancement processing for the case where the target sound deviates from the assumed direction is performed. Improve stability.
[0219]
FIG. 32 shows the configuration of the audio signal processing device according to the present embodiment. This embodiment estimates the sound source direction by the sound source direction estimation processing described in the tenth embodiment, corrects the input spectrum information using the correction coefficient corresponding to the sound source direction, and integrates the corrected spectrum information. , Perform gain control on the integrated spectrum information to perform voice enhancement.
[0218]
To realize such processing, the audio signal processing apparatus according to the present embodiment converts the spectral information of a plurality of channels from the sound source direction estimating unit 1000 and the frequency analyzing unit 201 described in the tenth embodiment into a sound source direction correcting coefficient. A spectrum information correction unit 1100 for correcting based on the coherence function, a signal integration unit 1101 for integrating the corrected spectrum information, a coherence filter operation unit 1102 for filtering the integrated spectrum information based on a coherence function, and a sound source phase It has a gain control unit 1103 that suppresses noise by performing gain control based on the relation number.
[0219]
The frequency analysis unit 201 and the sound source direction estimation unit 1000 are as described in the tenth embodiment. The spectrum correction unit 1100 corrects spectrum information of a plurality of channels using the sound source direction correction coefficient output from the sound source direction estimation unit. This correction of the spectrum information has the function of maximizing the correlation coefficient for the sound coming from the sound source direction. If the sound source direction is (θo, φo), the sound source correlation coefficient is ρ (θo, φo), and the sound source direction correction coefficient is Hi (k, θo, φo), the correction of the spectrum information can be performed.
[Equation 48]

It is performed according to. Here, i is a channel number, X'i (k) is spectrum information after correction, and Xi (k) is spectrum information before correction.
[0220]
Thereafter, the signal integrating unit 1101 integrates the spectral information of one channel using the corrected spectral information X′i (k), and performs the coherence filter operation and the gain control on the integrated spectral information. As the gain for gain control, ρ (θo, φo) is used as described above. Subsequent processes are the same as in the tenth embodiment, and a description thereof will be omitted.
[0221]
Next, the flow of processing in this embodiment will be described with reference to FIG.
First, a sound source direction range is set as an initial setting, and a correction coefficient is generated as described in the tenth embodiment (step S200). Next, input of audio signals from the microphones 101-1 to 101-M (Step S201), frequency analysis (Step S202), estimation of sound source direction (Step S203), correction of spectrum information (Step S204), and spectrum information The integration (step S205), the calculation of the coherence function (step S206), and the processing of the gain control (step S207) are repeated each time a digitized audio signal is input in frame units in step S201.
[0222]
(Twelfth embodiment)
Next, a twelfth embodiment of the present invention will be described. In the above-described calculation of the corrected cross-correlation coefficient, as shown in Expression (3-13), the geometric mean of the power of the input spectrum information is used in normalizing the cross-correlation. In this embodiment, a case will be described in which the power of integrated spectral information obtained by integrating input spectral information is used instead of the geometric mean.
[0223]
When integrating signals of a plurality of channels using a beamformer or the like, there is a case where directional noise or the like is suppressed by the function of the beamformer. In such a case, in the gain control based on the cross-correlation or the modified cross-correlation coefficient, it is better to lightly control the gain in consideration of the already suppressed amount. When the gain coefficient described in the present embodiment is used, the gain control can be optimized in consideration of the suppressed amount.
[0224]
The audio signal processing device according to the present embodiment performs frequency analysis on input audio signals of a plurality of channels output from a plurality of microphones 101-0 to 101-M spatially separated as shown in FIG. The frequency analysis unit 201 includes a frequency analysis unit 201 that generates spectrum information of a plurality of channels, and a modified gain coefficient calculation unit 2000A that calculates a gain coefficient that is a value corresponding to the activity of a target sound from the plurality of spectrum information.
[0225]
The modified gain coefficient calculation unit 2000A includes a cross power spectrum calculation unit 2001, a coherence function calculation unit 2002, a power information calculation unit 2003, a signal integration unit 2004, an integrated signal power spectrum calculation unit 2005, and a gain coefficient calculation unit 2006.
[0226]
The cross power spectrum calculation unit 2001 calculates a power spectrum for each channel of an input voice signal and a cross spectrum between channels from the spectrum information.
[0227]
The coherence function calculator 2002 calculates a coherence function from a cross spectrum between a plurality of channels and a power spectrum of each channel.
[0228]
Power information calculation section 2003 calculates power information relating to the signal power between channels of the input audio signal from the power spectra of a plurality of channels.
[0229]
The signal integration unit 2004 integrates a plurality of pieces of spectrum information to generate one channel of integrated spectrum information.
[0230]
Integrated signal power spectrum calculation section 2005 calculates the power spectrum of the integrated spectrum information.
[0231]
The gain coefficient calculator 2006 weights the cross spectrum based on the coherence function and the power information, and calculates a gain coefficient obtained by normalizing the weighted cross spectrum based on the integrated signal power spectrum.
[0232]
The frequency analysis unit 201, the cross power spectrum calculation unit 2001, the coherence function calculation unit 2002, the power information calculation unit 2003, and the signal integration unit 2004 are the same as those in the tenth embodiment, and a description thereof will be omitted.
[0233]
The integrated signal power spectrum calculation unit 2005 calculates the power spectrum of the integrated spectrum information. For example, assuming that the integrated spectrum information is Z (f) and the integrated processing is the addition average Z (f) = {X1 (f) + X2 (f)} / 2, the power of Z (f) The spectrum is
[Equation 49]

Is required. The same is true even if Z (f) is an integrated signal obtained from a beamformer having different coefficients.
[0234]
The gain coefficient σ calculated by the gain coefficient calculation unit 2006 is a coefficient used for gain control instead of the cross-correlation coefficient, and can be calculated by the following equation when M = 2.
[Equation 50]

[0235]
Equations (12-2) and (12-3) are the same as equations (3-12) and (3-13), respectively. In the gain coefficient σ obtained by the above calculation, since the noise already suppressed in the power of Wzz is removed, the possibility that the gain is calculated too small is reduced, and the performance may be improved. The gain coefficient calculation unit 2006 outputs a corrected gain coefficient σ meaning that the gain coefficient is weighted by the power ratio and the coherence function.
[0236]
Next, the flow of processing in this embodiment will be described with reference to FIG. After inputting audio signals from the microphones 101-1 to 101-M (Step S301) and analyzing the frequency (Step S302), the modified working coefficient calculator 2000A calculates the cross spectrum and the power spectrum (Step S303), Calculation of information (step S304), calculation of coherence function (step S305), signal integration (integration of spectrum information) (step S306), calculation of power spectrum of integrated spectrum information (integrated signal) (step S307), and correction gain coefficient (S308) is repeated each time a digitized audio signal is input in units of frames in step S301.
[0237]
(Thirteenth embodiment)
FIG. 36 shows the configuration of the audio signal processing device according to the thirteenth embodiment of the present invention. This embodiment is an example in which all the power information pij (f) is set to 1 in the equation (12-3), and the power information is not used. In the modified gain coefficient calculating unit 2000B, the power information pij (f) shown in FIG. The calculation unit 2003 has been removed.
[0238]
(14th embodiment)
Next, as a fourteenth embodiment of the present invention, a speech enhancement processing device that suppresses noise based on the gain coefficient obtained in the twelfth embodiment and enhances a target speech will be described.
[0239]
The audio signal processing device according to the present embodiment performs frequency analysis on input audio signals of a plurality of channels output from a plurality of microphones 101-0 to 101-M spatially separated as shown in FIG. A gain control unit in addition to the frequency analysis unit 201 that generates spectrum information of the M channel and the corrected gain coefficient calculation unit 2000A illustrated in FIG. 34 that calculates a gain coefficient corresponding to the activity of the target sound from the spectrum information. 2101 and a coherence filter operation unit 2102.
[0240]
Gain control section 2101 performs gain control on the integrated spectrum information obtained by signal integration section 2004 in corrected gain coefficient calculation section 2000A based on the gain coefficient calculated by corrected gain coefficient calculation section 2000A. The coherence filter operation unit 2102 filters the spectrum information output from the gain control unit 2101 based on the coherence function obtained by the coherence function calculation unit 2002 in the modified gain coefficient calculation unit 2000A.
[0241]
Next, the flow of processing in this embodiment will be described with reference to FIG.
After input of audio signals from the microphones 101-1 to 101-M (step S401) and frequency analysis (step S402), the corrected gain coefficient calculator 2000A calculates a cross spectrum and a power spectrum (step S403), and outputs power information. The calculation (step S404), the calculation of the coherence function (step S405), the integration of the spectrum information (step S406), the calculation of the power spectrum of the integrated spectrum information (step S407), and the calculation of the gain coefficient (step S408) are performed. Next, a gain control process based on the calculated gain coefficient (step S409) and a coherence filter calculation process (step S410) are performed. The above steps S401 to S410 are repeated each time a digitized audio signal is input in units of frames in step S401.
[0242]
(Fifteenth embodiment)
FIG. 39 shows the configuration of the audio signal processing device according to the fifteenth embodiment of the present invention. This embodiment is an example in which the power information pij (f) in the equation (10-6) is set to 1 and the power information is not used, and the modified gain coefficient calculator 2000B calculates the power information calculated in FIG. Section 2003 has been removed.
[0243]
(Sixteenth embodiment)
Next, a sixteenth embodiment of the present invention for estimating the sound source direction using the gain coefficient described in the twelfth embodiment will be described. As shown in FIG. 40, the audio signal processing device according to the present embodiment performs frequency analysis on input audio signals of a plurality of (M) channels from the microphones 101-1 to 101-M and converts the input audio signals into spectrum information as frequency components. It comprises a frequency analysis unit 201 and a sound source direction estimating unit 3000 for estimating the sound source direction from the spectrum information. The processing of the voice analysis unit 201 is as described in the second embodiment (FIG. 6).
[0244]
The sound source direction estimation unit 3000 includes a cross power spectrum calculation unit 3001, a coherence function calculation unit 3002, a correction coefficient generation unit 3003, a cross power spectrum correction unit 3004, a power information calculation unit 3005, a virtual integrated power spectrum calculation unit 3006, and a virtual It has a direction gain coefficient calculator 3007 and a sound source direction detector 3008. Hereinafter, each unit of the sound source direction estimating unit 3000 will be described.
[0245]
The cross power spectrum calculation section 3001 calculates a power spectrum for each channel of an input audio signal of each channel and a cross spectrum between channels from the spectrum information obtained by the frequency analysis section 201.
[0246]
The coherence function calculation unit 3002 calculates a coherence function between the plurality of channels of the input audio signal from the cross spectrum between the plurality of channels and the power spectrum of each channel.
[0247]
The correction coefficient generation unit 3003 corresponds to a set of virtual direction groups composed of a plurality of virtual directions to correct a coefficient for correcting a signal arriving from a virtual direction which is a virtual arrival direction of a signal so as to match between channels. And occur.
[0248]
The cross power spectrum correction unit 3004 corrects the cross spectrum and the power spectrum based on the correction coefficient generated by the correction coefficient generation unit 3003, and generates a corrected cross spectrum and a corrected power spectrum.
[0249]
The power information calculation unit 3005 calculates power information on the signal power between channels of the input audio signal based on the corrected cross spectrum and the corrected power spectrum.
[0250]
The virtual integrated power spectrum calculation section 3006 corrects the spectrum information of the plurality of channels obtained by the frequency analysis section 201 with the correction coefficient generated by the correction coefficient generation section 3003 and then integrates the power spectrum for the integrated spectrum information obtained by integration. Is calculated based on the corrected cross spectrum and the corrected power spectrum obtained by the cross power spectrum correcting unit 3004.
[0251]
The virtual direction gain coefficient calculator 3007 weights the corrected cross spectrum obtained by the cross power spectrum corrector based on the coherence function and the power information, and further performs normalization based on the virtual integrated power spectrum. Then, a gain coefficient corresponding to one set of virtual directions is obtained.
[0252]
The sound source direction detection unit 3008 detects and outputs the sound source direction based on the gain coefficient for each virtual direction calculated by the virtual direction gain coefficient calculation unit 3007, and simultaneously outputs the value of the gain coefficient corresponding to the detected sound source direction to the sound source gain. A correction coefficient corresponding to the sound source direction is output as a sound source direction correction coefficient.
[0253]
Here, the processes of the frequency analysis unit 201, the cross power spectrum calculation unit 3001, the coherence function calculation unit 3002, the correction coefficient generation unit 3003, the cross power spectrum correction unit 3004, and the power information calculation unit 3005 are described in the tenth embodiment. Since this is the same as the sound source direction estimation based on the correlation coefficient according to the embodiment, detailed description will be omitted.
[0254]
In the calculation of the gain coefficient in the twelfth and fourteenth embodiments, when calculating the value of the denominator of the equation of the gain coefficient σ, the power spectrum is obtained by integrating spectral information of a plurality of channels. On the other hand, in the present embodiment, the integration at the stage of the spectrum information is not performed, and the power spectrum and the cross spectrum are corrected to directly obtain the power of the integrated signal. This is more advantageous in terms of calculation amount and storage area than actually obtaining the power after integrating the signals. That is, if power is obtained after integrating spectrum information, time averaging for power spectrum estimation is required for each virtual direction. According to the present embodiment, this can be avoided.
[0255]
First, it is assumed that the signals are integrated after multiplying the spectrum information of each channel by the correction coefficient generated by the correction coefficient generation unit 3003, and the processing equation is set to an averaging here. The integrated signal Z (f) at this time is
(Equation 51)

Can be expressed as Of course, other integration methods may be used.
[0256]
At this time, the power spectrum of the integrated signal Z (f) is
(Equation 52)

It becomes. Here, in Equation (16-2), the suffix is omitted. The upper line represents the time average. Therefore, once the cross spectrum and the power spectrum have been obtained, the denominator of the gain coefficient σ (θ, φ) corresponding to the virtual direction (θ, φ) can be obtained simply by multiplying the correction coefficient according to the equation (16-2). Is obtained.
[0257]
In the virtual direction gain coefficient calculation unit 3007, first, the corrected cross spectrum corresponding to the virtual direction obtained by the cross power spectrum correction unit 3004
(Equation 53)

For the coherence function γ ² Weighting is performed based on (f) and the corrected power information pij (f, θ, φ). Further, the virtual direction gain coefficient calculation unit 3007 applies a coherence function γ to the virtual integrated signal power Wzz (f, θ, φ) obtained by the virtual integrated power spectrum calculation unit 3006. ² Weighting is performed based on (f), and a virtual direction gain coefficient σ (θ, φ), which is a gain coefficient corresponding to the virtual direction, is obtained by the above equation (2-3).
[0258]
The processing of the sound source direction detection unit 3008 may be the same as that of the sound source direction estimation unit 10007 in the tenth embodiment. In this case, the gain coefficient σ (θo, φo) corresponding to the sound source direction detected by the sound source direction detection unit 3008 is referred to as a sound source direction gain coefficient. Further, similarly to the tenth embodiment, the sound source direction detection unit 3008 outputs a sound source direction correction coefficient Hi (θo, φo) as a sound source direction correction coefficient in addition to the sound source direction (θo, φo). As described above, the sound source direction can be estimated based on the gain coefficient.
[0259]
Next, the flow of processing in this embodiment will be described with reference to FIG.
First, a range of a sound source direction is set as an initial setting (step S500). Next, generation of correction coefficients (step S501), input of audio signals from the microphones 101-1 to 101-M (step S502), frequency analysis (step S503), calculation of cross spectrum and power spectrum (step S504), and coherence Function calculation (step S505) is performed sequentially. Next, the spectrum correction (step S506), the calculation of the power information (step S507), the calculation of the virtual integrated power spectrum (step S508), and the calculation of the virtual direction gain coefficient (step S509) are repeatedly performed for all the virtual directions. First, the sound source direction is detected (step S510). The processing of steps S502 to S510 is repeated each time a digitized audio signal is input in units of frames in step S502.
[0260]
(Seventeenth embodiment)
Next, as a seventeenth embodiment of the present invention, the sound source direction estimated by the sound source direction estimation processing based on the gain coefficient described in the sixteenth embodiment is used, and even when the target sound moves, the direction is tracked. A description will be given of a process for performing voice enhancement so that voice enhancement can be performed stably.
[0261]
As shown in FIG. 42, the audio signal processing apparatus according to the present embodiment is configured to correct the spectrum information of a plurality of channels from the frequency analysis unit 201, the sound source direction estimation unit 3000, and the frequency analysis unit 201 based on the sound source direction correction coefficient. A correcting unit 3100, a signal integrating unit 3101 for integrating the corrected spectral information, a coherence filter calculating unit 3102 for filtering the integrated spectral information based on a coherence function, and performing gain control on the filtered spectral information further based on a sound source gain coefficient. And a gain control unit 3103 for suppressing noise.
[0262]
Here, the frequency analysis unit 201, the sound source direction estimation unit 3000, and the spectrum information correction unit 3100 are the same as in the sixteenth embodiment, and the coherence filter operation unit 3002 is the same as in the eleventh embodiment.
[0263]
The signal integration unit 3101 is the same integration as the signal integration assumed in the calculation of the virtual integrated signal power spectrum performed by the virtual integrated signal power spectrum calculation unit 3006 in the sound source direction estimation unit 3000 shown in FIG. Is used to integrate the corrected spectral information. That is, if the virtual integrated signal power spectrum calculation section 3006 assumes the averaging of two channels, the signal integration section 3101 also uses the averaging for integrating the spectrum information. In this case, the sound source direction obtained by the sound source direction estimating unit 3000 is (θo, φo), and the corresponding correction coefficients are H1 (f, θo, φo), and H2 (f, θo, φo). In this case, the integrated signal Z (f, θo, φo) corrected according to the sound source direction is expressed by the following equation.
[0264]
(Equation 54)

[0265]
X1 (f) and X2 (f) are spectrum information of each channel obtained by the frequency analysis unit.
[0266]
The gain control unit 3103 uses the gain coefficient σ (θo, φo) corresponding to the sound source direction estimated by the sound source direction estimation unit 3000, and based on the gain coefficient σ (θo, φo), corrects the integrated signal Z (f, θo, φo). As a control method, other than the simple proportional method, the method described in the first embodiment may be used.
[0267]
Next, the flow of processing in this embodiment will be described with reference to FIG.
First, a range of the sound source direction is set as an initial setting, and a correction coefficient is generated (step S600). Next, input of audio signals from the microphones 101-1 to 101-M (step S601), frequency analysis (step S602), estimation of sound source direction (step S603), correction of spectrum information (step S604), integration of spectrum information (Step S605), the coherence filter operation (Step S606) and the gain control (Step S607) are repeated each time the digitized audio signal is input in frame units in Step S601.
[0268]
(Eighteenth Embodiment)
Next, as an eighteenth embodiment of the present invention, the difference between the channels of the input audio signal is adaptively corrected using an adaptive filter, and in addition to the case where the direction of the target sound slightly deviates from the expected one, The following describes a sound signal processing device that also reduces noise. The tracking-type stabilization method based on the sound source direction estimation described in the eleventh and seventeenth embodiments is effective for a target sound shift, but is effective for a signal shift between channels due to reflection or the like. The effect is small. Since the state of reflection often differs depending on the sound receiving position, it causes a shift between channels. Therefore, in the present embodiment, a stabilization method using an adaptive filter is used.
[0269]
The stabilization method using the adaptive filter has already been described in the eighth embodiment. In the eighth embodiment, before obtaining the target signal activity based on the correlation coefficient, correction between channels is performed by controlling an adaptive filter using the correlation coefficient. In this case, since there is a time delay in obtaining the correlation coefficient, this is effective for disturbance factors that change more slowly than this delay, that is, sensitivity changes due to microphone bias voltage changes and aging. On the other hand, the present embodiment is effective when the state of the shift between the channels of the input audio signal changes relatively quickly, such as when there is a reflected wave or when the target sound frequently moves.
[0270]
The audio signal processing apparatus according to the present embodiment generates a plurality of channels of spectral information by frequency-analyzing a plurality of microphones spatially separated from each other and an input audio signal of a plurality of channels input from the microphone. As shown in FIG. 44, in addition to the frequency analysis unit which performs the above-mentioned operations, the signal analysis unit includes a stabilized target signal activity estimating unit 4000 for estimating the target signal activity by using the spectrum information of a plurality of channels from the frequency analysis unit.
[0271]
A stabilization target signal activity estimator 4000 that calculates a first corrected cross-correlation coefficient that is a corrected cross-correlation coefficient between channels of the input audio signal; An adaptive spectrum correction unit 4002 that adaptively corrects the difference between the spectral information of a plurality of channels based on the first corrected cross-correlation coefficient to obtain corrected spectrum information, and a second corrected cross-correlation coefficient from the corrected spectrum information From the second corrected cross-correlation coefficient calculation unit 4003. The frequency analysis unit and the first and second modified cross-correlation

coefficient calculation units

4001 and 4003 perform the same processing as described above.
[0272]
As shown in FIG. 46, adaptive spectrum correction section 4002 identifies the transfer function between the spectrum information of each channel obtained by frequency analysis section by adaptive filter 4103, and corrects the difference. At this time, the adaptive filter 4103 is controlled based on the corrected cross-correlation coefficient output from the first corrected cross-correlation coefficient calculation unit 4001, and the adaptive filter 4103 is updated only while the target sound is arriving. Avoid adaptation to, and estimate only the transfer function for the target sound.
[0273]
Since the calculation of the first modified cross-correlation coefficient has a time delay caused by a time average when obtaining the cross spectrum and the power spectrum, the correlation coefficient output from the first modified cross-correlation coefficient calculation unit 4001 is Is calculated based on the past input data by the time delay from the present time. Therefore, in order to synchronize the spectral information input to the adaptive filter 4103 with the correlation coefficient, the spectral information delayed by the

delay circuits

4101 and 4102 by the same amount as the correlation coefficient calculation is used.
[0274]
The value of the time delay is T / 2, where T is the time length required for averaging the cross power spectrum. In terms of the number of frames, when the averaged frame number is Ta and Ta is an even number, the delay is Ta / 2 frames, but when Ta is an odd number, the delay can be calculated as (Ta-1) / 2. Ta is preferably an odd number.
[0275]
The calculation using the adaptive filter 4103 is performed using, for example, an LMS adaptive filter in the frequency domain as described in the eighth embodiment, and the channel-side spectrum using the identified filter W (f) as a reference signal is used. Correct by multiplying the information. Second corrected cross-correlation coefficient calculation section 4003 calculates and outputs a second corrected cross-correlation coefficient from the spectrum information corrected by adaptive spectrum correction section 4002.
[0276]
Next, the flow of processing in this embodiment will be described with reference to FIG. 45. First, a first modified cross-correlation coefficient that is a corrected cross-correlation coefficient between channels of an input audio signal is calculated (step S701). Based on this, the adaptive spectrum is corrected by correcting the transfer function difference between the spectrum information of each channel (step S702), and the second corrected cross-correlation coefficient is calculated from the finally corrected adaptive spectrum information. Is calculated and output as the target signal activity (step S703).
[0277]
In the present embodiment, the modified cross-correlation coefficient calculation is performed twice in consideration of the time delay in order to perform adaptive control and update of the filter using synchronized data. This makes it possible to adaptively and accurately correct the difference between channels while suppressing the influence of noise even when the situation changes quickly.
[0278]
(Nineteenth Embodiment)
In the eighteenth embodiment, the case of adaptively stabilizing the modified cross-correlation coefficient calculation has been described. However, instead of the modified cross-correlation coefficient, similar calculation is performed by the calculation of the gain coefficient described in the twelfth embodiment. Processing can be performed.
[0279]
The audio signal processing apparatus according to the present embodiment generates a plurality of channels of spectral information by frequency-analyzing a plurality of microphones spatially separated from each other and an input audio signal of a plurality of channels input from the microphone. In addition to the frequency analysis unit, as shown in FIG. 45, a stabilization target signal activity estimating unit 5000 for estimating the target signal activity using the spectrum information of a plurality of channels from the frequency analysis unit as an input.
[0280]
The stabilization target signal activity estimating section 5000 includes a first correction gain coefficient calculating section 5001 for calculating a first correction gain coefficient which is a value corresponding to the activity of the target sound from spectrum information of a plurality of channels, An adaptive spectrum correction unit 5002 that adaptively corrects the difference between the spectral information of a plurality of channels based on the corrected gain coefficient to obtain corrected spectrum information, and a second that calculates a second corrected gain coefficient from the corrected spectrum information. It comprises a modified gain coefficient calculator 5003. The first and second modified

gain coefficient calculators

5001 and 5003 perform the same processing as that described in the twelfth embodiment.
[0281]
By the way, in each of the first, second, fourth, sixth, eleventh, fourteenth, and seventeenth embodiments, the speech enhancement processing is performed using the calculation result of the correlation coefficient or the gain coefficient. In each of the first, second, fourth, sixth, eleventh, fourteenth, and seventeenth embodiments, the time delay due to the calculation of the correlation coefficient or the gain coefficient is also considered in the same manner as described with reference to FIG. It is desirable that the input spectral information at the time of calculating the correlation coefficient or the gain coefficient is processed with a delay so that the correlation coefficient or the gain coefficient and the input spectral information are synchronized. In this case, the number of delay frames is selected to be a half of the number of time-averaged frames for estimating the cross spectrum and the power spectrum, as described with reference to FIG. Since the introduction of such a delay process is self-evident, it is omitted in the description of the first, second, fourth, sixth, eleventh, fourteenth, and seventeenth embodiments.
[0282]
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying constituent elements in an implementation stage without departing from the scope of the invention. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Further, components of different embodiments may be appropriately combined.
[0283]
【The invention's effect】
As described above, according to the present invention, it is possible to suppress noise under real environment noise including sudden noise and diffuse noise, and it is possible to accurately determine whether or not a target voice has arrived under a noise environment. It is possible to perform audio signal processing suitable for detection, preprocessing for hands-free communication and voice recognition.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an audio signal processing device according to a first embodiment of the present invention.
FIG. 2 is an exemplary view showing various functions used for gain control for an integrated audio signal in the embodiment.
FIG. 3 is a flowchart showing an audio signal processing procedure according to the embodiment;
FIG. 4 is a view showing an example of arrangement of microphones in the embodiment.
FIG. 5 is a flowchart showing a configuration of an audio signal processing device using an adaptive beamformer in a signal integration unit according to the embodiment;
FIG. 6 is a flowchart illustrating a configuration of an audio signal processing device according to a second embodiment of the present invention.
FIG. 7 is a flowchart showing an audio signal processing procedure according to the embodiment;
FIG. 8 is a block diagram showing a configuration of an audio signal processing device according to a third embodiment of the present invention.
FIG. 9 is a flowchart showing an audio signal processing procedure according to the embodiment;
FIG. 10 is a block diagram showing a configuration of an audio signal processing device according to a fourth embodiment of the present invention.
FIG. 11 is a flowchart showing the audio signal processing procedure in the embodiment.
FIG. 12 is a flowchart showing a detection processing procedure according to the embodiment;
FIG. 13 is a view showing a specific example of a detection process according to the embodiment;
FIG. 14 is a block diagram showing a configuration of an audio signal processing device according to a fifth embodiment of the present invention.
FIG. 15 is a flowchart showing a sound signal processing procedure in the embodiment.
FIG. 16 is a block diagram showing a configuration of an audio signal processing device according to a sixth embodiment of the present invention.
FIG. 17 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 18 is a diagram showing an arrangement example of microphones according to a seventh embodiment of the present invention.
FIG. 19 is an exemplary view showing another arrangement example of the microphone according to the embodiment;
FIG. 20 is a diagram showing arrival directions in the arrangements of FIGS. 19 (B1) to (B4) using azimuths and elevation angles.
FIG. 21 is a diagram showing a relationship between an arrival direction in which the phases of two microphones match and an arrival direction in which the sensitivities of the two microphones match in the arrangement of FIGS. 19 (B1) to (B4).
FIG. 22 is a block diagram illustrating a configuration of an audio signal processing device according to an eighth embodiment of the present invention.
FIG. 23 is a block diagram showing a configuration of a spectrum correction unit in the embodiment.
FIG. 24 is a block diagram showing a configuration of an audio signal processing device according to a ninth embodiment of the present invention.
FIG. 25 is a block diagram showing a configuration of a correction filter learning instructing unit in the embodiment.
FIG. 26 is a block diagram showing a configuration of a spectrum correction unit in the embodiment.
FIG. 27 is a flowchart showing a processing procedure of a spectrum correction unit in the embodiment.
FIG. 28 is a block diagram showing a configuration of an audio signal processing device according to a tenth embodiment of the present invention.
FIG. 29 is an exemplary view for explaining setting of virtual points at the time of arrival direction estimation in the embodiment.
FIG. 30 is an exemplary view for explaining a method of calculating a propagation delay in the embodiment.
FIG. 31 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 32 is a block diagram showing a configuration of an audio signal processing device according to an eleventh embodiment of the present invention.
FIG. 33 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 34 is a block diagram showing a configuration of an audio signal processing device according to a twelfth embodiment of the present invention.
FIG. 35 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 36 is a block diagram showing a configuration of an audio signal processing device according to a thirteenth embodiment of the present invention.
FIG. 37 is a block diagram showing a configuration of an audio signal processing device according to a fourteenth embodiment of the present invention.
FIG. 38 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 39 is a block diagram showing a configuration of an audio signal processing device according to a fifteenth embodiment of the present invention.
FIG. 40 is a block diagram showing a configuration of an audio signal processing device according to a sixteenth embodiment of the present invention.
FIG. 41 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 42 is a block diagram showing a configuration of an audio signal processing device according to a seventeenth embodiment of the present invention.
FIG. 43 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 44 is a block diagram showing a configuration of an audio signal processing device according to an eighteenth embodiment of the present invention.
FIG. 45 is a flowchart showing an audio signal processing procedure in the embodiment.
FIG. 46 is a block diagram showing a configuration of an adaptive spectrum correction unit in the embodiment.
FIG. 47 is a block diagram showing a configuration of an audio signal processing device according to a nineteenth embodiment of the present invention.
[Explanation of symbols]
101-1 to 101-M ... microphone
102: cross-correlation coefficient calculator
103 ... Signal integration unit
104: gain control unit (adjustment unit)
106 ... Adaptive beamformer
201: Frequency analysis unit
202: Cross-correlation coefficient calculator
203 ... Signal integration unit
204: gain control unit (adjustment unit)
300: target signal activity calculator
301: Cross power spectrum calculator
302: Coherence function calculator
303: Power information calculation unit
304: Corrected spectrum calculator
305 ... weighted cross-correlation coefficient calculator
401: detection processing unit (determination unit)
501: gain control unit (adjustment unit)
601 Coherence filter operation unit
701 omnidirectional microphone
702, 711, 712 ... directional microphone
800, 900: spectrum correction unit
801,904 ... Adaptive filter
802,901,902 ... correction filter
910: correction filter learning unit

Claims

Determining a cross-correlation coefficient between the channels of the input audio signals of a plurality of channels output from a plurality of microphones spatially separated;
An integrating step of integrating the input audio signal into one channel and outputting an integrated audio signal;
Generating an output audio signal by adjusting the magnitude of the integrated audio signal according to the cross-correlation coefficient.

2. The audio signal processing method according to claim 1, wherein the integrating step integrates the input audio signal into one channel using an adaptive beamformer operating in a time domain.

Generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated;
Determining a cross-correlation coefficient between the channels of the spectral information of the plurality of channels;
An integrating step of integrating the spectral information into one channel to generate an integrated spectral signal;
Adjusting the magnitude of the integrated spectrum signal according to the cross-correlation coefficient.

4. The audio signal processing method according to claim 3, wherein the integrating step integrates the input audio signal into one channel using an adaptive beamformer operating in a frequency domain.

Generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated;
Obtaining a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information;
Correcting the power spectrum and the cross spectrum using the coherence function by weighting each frequency using a weight function calculated from the power spectrum and the cross spectrum,
Obtaining a cross-correlation coefficient between the channels of the input audio signal, which is weighted based on the corrected power spectrum and cross spectrum.

Generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated;
Obtaining a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information;
Obtaining a coherence function between the channels of the spectral information of the plurality of channels from the power spectrum and the cross spectrum,
Modifying the power spectrum and the cross spectrum using the coherence function;
Obtaining a cross-correlation coefficient between the channels of the input audio signal, which is weighted based on the corrected power spectrum and cross spectrum.

Generating frequency information of a plurality of channels by frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated;
Obtaining a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information;
Obtaining a coherence function between the channels of the spectral information of the plurality of channels from the power spectrum and the cross spectrum,
Obtaining power information on signal power between channels of the input audio signal based on the power spectrum;
Modifying the power spectrum and cross spectrum using the coherence function and power information;
Obtaining a cross-correlation coefficient between the channels of the input audio signal, which is weighted based on the corrected power spectrum and cross spectrum.

8. The method according to claim 5, further comprising: performing a threshold process on the cross-correlation coefficient using a predetermined threshold to determine whether or not a target sound has arrived at the microphone. The audio signal processing method according to claim 1.

8. The method according to claim 5, further comprising: integrating the spectrum information into one channel to obtain an integrated spectrum signal; and adjusting a size of the integrated spectrum signal according to the cross-correlation coefficient. 2. The signal processing method according to claim 1.

The audio signal processing method according to claim 3, wherein the frequency analysis is performed by at least one of a fast Fourier transform process and a band filter bank process.

The audio signal processing method according to claim 6, further comprising weighting each frequency component of the integrated spectrum signal according to at least one of the coherence function and the power information.

8. The audio signal processing according to claim 6, further comprising a step of correcting at least one of a phase and an amplitude of the spectrum information of the plurality of channels so as to match between the channels according to the cross-correlation coefficient. Method.

A cross-correlation coefficient calculation unit that calculates a cross-correlation coefficient between channels of a plurality of channels of input audio signals output from a plurality of microphones spatially separated,
An integration unit that integrates the input audio signal into one channel and outputs an integrated audio signal;
An audio signal processing apparatus comprising: an adjustment unit configured to adjust an intensity of the integrated audio signal according to the cross-correlation coefficient to generate an output audio signal.

14. The audio signal processing device according to claim 13, wherein the integration unit includes an adaptive beamformer operating in a time domain.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A cross-correlation coefficient calculation unit that calculates a cross-correlation coefficient between the channels of the spectral information of the plurality of channels,
An integration unit that integrates the spectrum information into one channel to generate an integrated spectrum signal;
An adjusting unit that adjusts the magnitude of the integrated spectrum signal according to the cross-correlation coefficient.

14. The audio signal processing device according to claim 13, wherein the integration unit integrates the input audio signal into one channel using an adaptive beamformer operating in a frequency domain.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum,
A correction spectrum calculation unit that corrects the power spectrum and the cross spectrum using the coherence function,
An audio signal processing device comprising: a weighted cross-correlation coefficient calculation unit that calculates a cross-correlation coefficient between channels of the input audio signal, the weight being calculated based on the corrected power spectrum and cross spectrum.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum,
A power information calculation unit that calculates power information related to signal power between channels of the input audio signal based on the power spectrum,
A correction spectrum calculator that corrects the power spectrum and cross spectrum using the coherence function and power information,
An audio signal processing device comprising: a weighted cross-correlation coefficient calculation unit that calculates a cross-correlation coefficient between channels of the input audio signal, the weight being calculated based on the corrected power spectrum and cross spectrum.

19. The determination unit according to claim 17, further comprising a determination unit configured to perform a threshold process on the cross-correlation coefficient using a predetermined threshold to determine whether a target sound has arrived at the microphone. Audio signal processing device.

19. The apparatus according to claim 17, further comprising: an integrating unit that integrates the spectrum information into one channel to obtain an integrated spectral signal; and an adjusting unit that adjusts the size of the integrated spectral signal according to the cross-correlation coefficient. Audio signal processing device.

19. The audio signal processing device according to claim 15, wherein the frequency analysis unit is at least one of a fast Fourier transformer and a bandpass filter bank.

19. The signal processing apparatus according to claim 17, further comprising: means for weighting each frequency component of the integrated spectrum signal according to at least one of the coherence function and the power information.

19. The voice according to claim 15, further comprising: means for correcting at least one of a phase and an amplitude of the spectrum information of the plurality of channels so as to match between the channels according to the cross-correlation coefficient. Signal processing device.

19. The audio signal processing device according to claim 15, wherein the plurality of microphones includes at least one omnidirectional microphone and at least one directional microphone.

19. The audio signal processing device according to claim 15, wherein the plurality of microphones include at least two directional microphones having different directional axes.

26. The at least two directional microphones are arranged such that the axis of the directivity does not exist in the same plane, and the angle between the axis of the directivity and the arrival direction of the target sound matches. 3. The audio signal processing device according to claim 1.

A process of determining a cross-correlation coefficient between channels of a plurality of channels of input audio signals output from a plurality of microphones spatially separated;
A process of integrating the input audio signal into one channel and outputting an integrated audio signal;
A program for causing a computer to perform a process of generating an output audio signal by adjusting a magnitude of the integrated audio signal according to the cross-correlation coefficient.

A process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels,
A process of obtaining a cross-correlation coefficient between the channels of the spectral information of the plurality of channels;
An integration process of integrating the spectrum information into one channel to generate an integrated spectrum signal;
Adjusting a magnitude of the integrated spectrum signal in accordance with the cross-correlation coefficient.

A process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels,
A process for obtaining a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information;
A process for obtaining a coherence function between channels of the plurality of channels of spectral information from the power spectrum and the cross spectrum,
Correcting the power spectrum and the cross spectrum using the coherence function,
A process for obtaining a cross-correlation coefficient between channels of the input audio signal, the process being weighted based on the corrected power spectrum and cross spectrum.

A process of frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate spectral information of a plurality of channels,
A process for obtaining a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information;
A process for obtaining a coherence function between channels of the plurality of channels of spectral information from the power spectrum and the cross spectrum,
A process for obtaining power information about signal power between channels of the input audio signal based on the power spectrum;
Correcting the power spectrum and the cross spectrum using the coherence function and power information,
A process for obtaining a cross-correlation coefficient between channels of the input audio signal, the process being weighted based on the corrected power spectrum and cross spectrum.

A frequency analysis unit that generates a plurality of channels of spectrum information by frequency-analyzing an input audio signal of a plurality of channels output from the microphones in response to sounds input to a plurality of microphones spatially separated,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum,
A correction coefficient generation unit that generates a correction coefficient for correcting a voice arriving from the virtual arrival direction to match between a plurality of channels, corresponding to a virtual arrival direction group including a plurality of virtual arrival directions of the voice; ,
A spectrum correction unit that corrects the power spectrum and the cross spectrum based on the correction coefficient, and generates a corrected power spectrum and a corrected cross spectrum,
A power information calculation unit that calculates power information about signal power between channels of the input audio signal based on the corrected power spectrum and the corrected cross spectrum,
The corrected power spectrum and the corrected cross spectrum are weighted based on the coherence function and the power information, and a cross-correlation coefficient between channels of the input voice signal corresponding to the virtual direction of arrival group is calculated for each virtual direction of arrival. A correlation coefficient calculator,
A sound source direction detecting unit that detects a sound source direction of a sound input to the microphone based on the cross correlation coefficient, and outputs a value of the cross correlation coefficient in the detected sound source direction as a sound source correlation coefficient. An audio signal processing device provided.

A spectrum information correction unit for correcting the spectrum information of the plurality of channels based on the sound source direction,
An integration unit that integrates the corrected spectrum information of the plurality of channels into one channel to generate an integrated spectrum signal;
A coherence filter operation unit that filters the integrated spectrum signal using the coherence function,
32. The audio signal processing device according to claim 31, further comprising: an adjusting unit configured to adjust the magnitude of the filtered integrated spectrum signal based on the sound source correlation coefficient to generate an output audio signal.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum,
A power information calculation unit that calculates power information related to signal power between channels of the input audio signal from the power spectrum,
An integration unit that integrates the plurality of pieces of spectrum information into one channel to generate an integrated spectrum signal;
An integrated signal power spectrum calculation unit for calculating a power spectrum of the integrated spectrum signal,
An audio signal comprising: a gain coefficient calculator that weights the cross spectrum based on the coherence function and the power information, and further normalizes the weighted cross spectrum based on the power spectrum of the integrated spectrum signal to calculate a gain coefficient. Processing equipment.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum,
An integration unit that integrates the plurality of pieces of spectrum information into one channel to generate an integrated spectrum signal;
An integrated signal power spectrum calculation unit for calculating the power spectrum of the integrated spectrum information,
An audio signal processing apparatus comprising: a gain coefficient calculator that weights the cross spectrum based on the coherence function, and further normalizes the weighted cross spectrum based on the integrated signal power spectrum to calculate a gain coefficient.

An adjustment unit that adjusts the magnitude of the integrated spectrum signal based on the gain coefficient, and a coherence filter operation unit that outputs an output audio signal by filtering the integrated spectrum signal output from the adjustment unit based on the coherence function. The audio signal processing device according to claim 34, further comprising:

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A spectrum calculation unit that calculates a power spectrum for each channel of the input audio signal and a cross spectrum between channels from the spectrum information,
A coherence function calculation unit that calculates a coherence function between the plurality of channels from the cross spectrum between the plurality of channels and a power spectrum of each channel,
A correction coefficient generation unit that generates a correction coefficient for correcting a voice arriving from the virtual arrival direction to match between a plurality of channels, corresponding to a virtual arrival direction group including a plurality of virtual arrival directions of the voice; ,
A spectrum correction unit that corrects the power spectrum and the cross spectrum based on the correction coefficient, and generates a corrected power spectrum and a corrected cross spectrum,
A power information calculation unit that calculates power information about signal power between channels of the input audio signal based on the corrected power spectrum and the corrected cross spectrum,
A virtual integrated power spectrum calculation unit that calculates a power spectrum for integrated spectrum information obtained by integrating the spectrum information of the plurality of channels after correcting with the correction coefficient based on the corrected power spectrum and the corrected cross spectrum,
A gain coefficient calculator for weighting the corrected cross spectrum based on the coherence function and the power information, and further normalizing the corrected cross spectrum based on the virtual integrated power spectrum, thereby obtaining a gain coefficient corresponding to the virtual arrival direction.
A sound source direction detecting unit that detects a sound source direction of a sound input to the microphone based on the gain coefficient, and outputs a value of a gain coefficient corresponding to the detected sound source direction as a sound source gain coefficient. Processing equipment.

A spectrum information correction unit that corrects the spectrum information of the plurality of channels based on the correction coefficient,
A signal integration unit that integrates the corrected plurality of channels of spectrum information into one channel to generate an integrated spectrum signal;
A coherence filter operation unit that filters the integrated spectrum signal using the coherence function,
37. The audio signal processing device according to claim 36, further comprising: an adjusting unit that adjusts the magnitude of the filtered integrated spectrum signal based on the sound source correlation coefficient.

A frequency analysis unit that frequency-analyzes the input audio signals of a plurality of channels output from a plurality of microphones that are spatially separated to generate spectral information of a plurality of channels,
A first modified cross-correlation coefficient calculation unit that calculates a first modified cross-correlation coefficient between channels of the input audio signals of the plurality of channels by using the spectral information of the plurality of channels as an input;
An adaptive spectrum correction unit that adaptively corrects a difference between channels of the spectrum information of the plurality of channels based on the first corrected cross-correlation coefficient to generate corrected spectrum information;
A second corrected cross-correlation coefficient calculator that calculates a second corrected cross-correlation coefficient from the corrected spectrum information,
The first and second modified cross-correlation coefficient calculators include: (a) a spectrum calculator that calculates a power spectrum for each channel of the input voice signal and a cross spectrum between channels from the spectrum information; A coherence function calculator for calculating a coherence function between channels of the plurality of channels of spectrum information from the power spectrum and the cross spectrum; and (c) calculating power information on signal power between channels of the input voice signal from the power spectrum. And (d) calculating a cross-correlation coefficient between channels of the input audio signal by weighting the power spectrum and the cross spectrum based on the coherence function and the power information. And a correlation coefficient calculator that outputs the modified cross-correlation function of The audio signal processing apparatus.

A frequency analysis unit for frequency-analyzing input audio signals of a plurality of channels output from a plurality of microphones spatially separated to generate first spectrum information of a plurality of channels;
A first modified gain coefficient calculator for calculating a first modified gain from the first spectrum information;
An adaptive spectrum correction unit that adaptively corrects a difference between channels of the first spectrum information based on the first gain coefficient to generate second spectrum information;
A second correction gain coefficient calculation unit that calculates a second correction gain from the second spectrum information,
The first and second modified gain coefficient calculators include: (a) a spectrum calculator that calculates a power spectrum for each channel of the input voice signal and a cross spectrum between channels from the first or second spectrum information; (B) a coherence function calculator for calculating a coherence function between channels of the spectrum information of the plurality of channels from the power spectrum and the cross spectrum; and (c) a power related to signal power between channels of the input audio signal from the power spectrum. A power information calculation unit for calculating information; (d) an integration unit for integrating the plurality of pieces of spectrum information into one channel to generate an integrated spectrum signal; and (e) an integrated signal for calculating a power spectrum of the integrated spectrum signal. A power spectrum calculation unit; A gain coefficient calculator for weighting based on the coherence function and the power information, and further normalizing the weighted cross spectrum based on the power spectrum of the integrated spectrum signal to calculate the first or second gain coefficient. Signal processing device.