JP2004325127A

JP2004325127A - Sound source detection method, sound source separation method, and apparatus for executing them

Info

Publication number: JP2004325127A
Application number: JP2003117135A
Authority: JP
Inventors: Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-04-22
Filing date: 2003-04-22
Publication date: 2004-11-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound source detection method and a sound source separation method for improving sound source detection performance and sound source separation performance, and to provide an apparatus for executing the methods when detecting a sound source by using an acoustic signal including reverberation and separating the sound source by utilizing the difference between statistical properties that information between channels in an acoustic signal has when no reverberation exists and when reverberation exists. <P>SOLUTION: The sound source detection method and the sound source separation method execute processing for extracting information between signals to received acoustic signal input by receiving the acoustic signal observed by a plurality of sensors, a directional signal extraction processing, an instantaneous direction power calculation processing, a sound source position extraction processing, or further a successive direction power calculation processing. The apparatus executes these methods. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【産業上の利用分野】
この発明は、音源検出方法、音源分離方法、およびこれらを実施する装置に関し、特に、残響が存在しない場合の音響信号のチャンネル間情報が有する統計的性質と、残響が存在する場合の音響信号のチャンネル間情報が有する統計的性質との間の差違を利用することにより、残響を含む音響信号を用いて音源を検出する場合と音源を分離する場合において音源検出性能および音源分離性能を改善する音源検出方法、音源分離方法、およびこれらを実施する装置に関する。
【０００２】
【従来の技術】
音声信号は、日常の生活環境の中で収音されると、多くの場合、他の音が同時に発生していて、観測される音はその音（妨害音、と称す）が重畳した混合音になる。音声信号は、また、無響室などの特別な環境を除いて、通常は残響が重畳された信号となる。以上のことから、本来の音声信号の性質を抽出することが困難になると共に、音声自体の明瞭度が低下する。ここで、音源分離処理により音声信号に重畳した妨害音および残響を取り除くことで、音声本来の性質を抽出し易くすると共に、音声の明瞭度を向上させることができる。この音源分離処理はこれを他の様々の音響信号処理装置の要素技術として用いることで、その装置全体の性能向上につながる技術である。音源分離処理技術を要素技術に使用して装置全体の性能を向上することができる音響信号処理装置として、以下の様な装置を列挙することができる。
【０００３】
ａ．音源分離を前処理技術として用いる音声認識装置。
ｂ．音源分離により音声の明瞭度が向上するＴＶ会議装置などの通信装置。ｃ．講演の録音に含まれる妨害音および残響を除去することで、録音された音声の明瞭度が向上する再生装置。
ｄ．妨害音および残響を除去することで、聞き取り易さが向上する補聴器。
ｅ．人が歌ったり、楽器で演奏したり、或いはスピーカで演奏された音楽の残響およびそれ以外の妨害音を除去して、楽曲を検索したり、採譜したりする音楽情報処理装置。
ｆ．人が発した音声に反応して機械にコマンドをわたす機械制御インターフェース、および機械と人間との間の対話装置。
【０００４】
音源検出方法の従来例を図６を参照して具体的に説明する（非特許文献１参照）。
チャンネル間情報抽出処理は、複数のセンサを用いて収音した観測信号の各時系列（チャンネル、と称す）において信号波形の振幅が０を横切る時刻（ゼロ交差点）を抽出すると共に、異なるチャンネル同士で対応するゼロ交差点の時間差を抽出する。次に、ヒストグラム計算処理は、各ゼロ交差点の時間差に対応する音源位置を推定し、同一位置から来るゼロ交差点の数を音源位置の範囲毎にカウントして、時間差ヒストグラムを計算する。このために、ヒストグラム計算処理は、各音源位置から来る音のチャンネル間時間差がどの様な値を有するかという情報を予め所有しており、この情報を利用してゼロ交差点に対応する音源の位置を推定する。最後に、音源位置抽出処理が、時間差ヒストグラムで最大値を与える音源方向の範囲を抽出し、これが音源位置であると判定する。音源が複数ある場合は、時間差ヒストグムに複数のピークが現れるので、それぞれを音源位置として検出する。
【０００５】
次に、音源分離方法の従来例を図７を参照して説明する（非特許文献２参照）。
チャンネル間情報抽出処理は、各観測信号を周波数領域信号に変換した後、各周波数毎の信号の位相差と強度差を抽出する。これと音源検出処理の如き前処理を施して得られる音源位置情報とを音源抽出フィルタより成る音源抽出処理部が受け取る。音源抽出処理部は目的音源の位置に関する観測信号が持つべき属性と実際に観測されたチャンネル間情報とが一致する周波数を特定し、それ以外の周波数の信号を削除した後、周波数領域信号を時間領域信号に戻すことで分離音信号を出力する。
【０００６】
【非特許文献１】
黄捷、大西昇，杉江昇，「時間差ヒストグラムを用いた複数音源定位システム」，日本ロボット学会誌１，９，１，ｐｐ．２９−３８，１９９１．０２．
【非特許文献２】
Ａｏｋｉ，Ｍ．，Ｏｋａｍｏｔｏ，Ｍ．，Ａｏｋｉ，Ｓ．，Ｍａｔｓｕｉ，Ｈ．，Ｓａｋｕｒａｉ，Ｔ．，ａｎｄ
Ｋａｎｅｄａ，Ｙ．，“Ｓｏｕｎｄｓｏｕｒｃｅｓｅｇｒｅｇａｔｉｏｎｂａｓｅｄｏｎｅｓｔｉｍａｔｉｎｇｉｎｃｉｄｅｎｔａｎｇｌｅｏｆｅａｃｈｆｒｅｑｕｅｎｃｙｃｏｍｐｏｎｅｎｔｏｆｉｎｐｕｔｓｉｇｎａｌｓａｃｑｕｉｒｅｄｂｙｍｕｌｔｉｐｌｅｍｉｃｒｏｐｈｏｎｅｓ”，Ｊ．Ａｃｏｕｓｔ．Ｓｏｃ．Ｊｐｎ．（Ｅ），２２，２，ｐｐ．１４９−１５７，２００１．
【０００７】
【発明が解決しようとする課題】
上述した音源検出方法の従来例および音源分離方法の従来例においては、残響を含んだ音響信号を入力信号として受け取ると、残響の影響で各音響信号の属性を正確に抽出することができなくなるので、正しくチャンネル間情報を抽出することができなくなる。このために、音源検出方法の従来例は、正しい音源位置を検出することができなくなったり、或いは、存在しない音源位置を検出する誤りを起こし易くなる。また、音源分離方法は、音源方向の周波数を誤って特定することに起因して、必要な周波数を抽出し損なったり、或いは、余分な周波数に含まれる別の音源の音を混在させるという様な誤りを生起する。その結果、分離される音響信号の品質が劣化するに到る。
【０００８】
以上の通り、従来、用いられてきた音源検出方法および音源分離方法は、残響を含む環境においては正しく音源検出をすることができなかったり、分離する音響信号の品質が劣化するという問題を内包するものであった。
この発明は、残響が存在しない場合の音響信号のチャンネル間情報が有する統計的性質と、残響が存在する場合の音響信号のチャンネル間情報が有する統計的性質との間の差違を利用することにより、残響を含む音響信号を用いて音源を検出する場合と音源を分離する場合において音源検出性能および音源分離性能を改善する音源検出方法、音源分離方法、およびこれらを実施する装置を提供するものである。
【０００９】
【課題を解決するための手段】
複数のセンサで観測された音響信号入力を受信し、この受信した音響信号入力に対して信号間情報抽出処理と、方向信号抽出処理と、瞬時方向パワー計算処理と、音源位置抽出処理とを施し、或いは継時方向パワー計算処理を更に施す音源検出方法を構成した。
先の音源検出方法において、
【数１】

の統計値の計算を実施してこれらを方向信号抽出処理に対して予め設定しておく音源検出方法を構成した。
【００１０】
先の音源検出方法において、更に、最大値方向検出処理、方向信号抽出範囲決定処理、音源抑制処理の内の何れか一つ以上の処理を組み合わせて施す音源検出方法を構成した。
ここで，複数のセンサで観測された音響信号入力を受信し、この受信した音響信号入力に対して信号間情報抽出処理と、方向信号抽出処理と、瞬時方向パワー計算処理と、音源位置抽出処理とを施し、或いは継時方向パワー計算処理を更に施す音源検出処理を実施し、方向信号抽出処理により抽出した音響信号を分離音として出力する音源分離方法を構成した。
【００１１】
先の音源分離方法において、
【数２】

の統計値の計算を実施しこれらを方向信号抽出処理に対して予め設定しておく音源分離方法を構成した。
【００１２】
先の音源分離方法において、音源検出処理に、更に、最大値方向検出処理、方向信号抽出範囲決定処理、音源抑制処理の内の何れか一つ以上の処理を組み合わせて施す音源分離方法を構成した。
ここで、複数チャンネルの観測信号を入力してチャンネル間時間差信号および強度差信号を求める信号間情報抽出処理部１を具備し、妨害音および残響の存在しない条件における、チャンネル間の信号の強度差の平均値および標準偏差値、チャンネル間の信号の時間差の平均値および標準偏差値の統計値を予め計算して設定記憶しておき、或る方向範囲内の方向信号を抽出する方向信号抽出処理部２を具備し、方向範囲に含まれると判定した周波数のパワーの総和を瞬時方向パワーとして求める瞬時方向パワー計算処理部４を具備し、方向パワーを入力して音源位置を出力する音源位置抽出処理部７を具備し、或いは、更に、方向範囲別に瞬時方向パワーを加算したパワーの総和を計算して継時方向パワーを求める継時方向パワー計算処理部６を具備する音源検出装置を構成した。
【００１３】
先の音源検出装置において、求められた瞬時方向パワーの最大値を与える方向範囲を選択し、それ以外の方向範囲のパワーの総和を０に置き換える最大値方向検出処理部５と、継時方向パワー計算処理部６により求められた継時方向パワーを受け取ると共に音源位置抽出処理部７により検出された音源方向を受け取り、対応する方向信号抽出範囲を決定して方向信号抽出処理部２に出力する方向信号抽出範囲決定処理部３と、音源位置抽出処理部７により検出された音源方向を入力すると共に方向信号抽出範囲決定処理部３により決定された方向信号抽出範囲を入力し、消去されるべき方向範囲を計算し、信号間情報抽出処理部１に出力する音源抑制処理部８の３処理部の内の何れか一つ以上の処理部を、更に、組み合わせて具備する音源検出装置を構成した。
【００１４】
そして、複数チャンネルの観測信号を入力してチャンネル間時間差信号および強度差信号を求める信号間情報抽出処理部１を具備し、妨害音および残響の存在しない条件における、チャンネル間の信号の強度差の平均値および標準偏差値、チャンネル間の信号の時間差の平均値および標準偏差値の統計値を予め計算して設定記憶しておき、或る方向範囲内の方向信号を抽出する方向信号抽出処理部２を具備し、方向範囲に含まれると判定した周波数のパワーの総和を瞬時方向パワーとして求める瞬時方向パワー計算処理部４を具備し、方向パワーを入力して音源位置を出力する音源位置抽出処理部７を具備し、或いは、更に、方向範囲別に瞬時方向パワーを加算したパワーの総和を計算して継時方向パワーを求める継時方向パワー計算処理部６を具備し、方向信号抽出処理部２に音響信号を分離音として出力する分離音出力部を形成した音源分離装置を構成した。
【００１５】
先の音源分離装置において、求められた瞬時方向パワーの最大値を与える方向範囲を選択し、それ以外の方向範囲のパワーの総和を０に置き換える最大値方向検出処理部５と、継時方向パワー計算処理部６により求められた継時方向パワーを受け取ると共に音源位置抽出処理部７により検出された音源方向を受け取り、対応する方向信号抽出範囲を決定して方向信号抽出処理部２に出力する方向信号抽出範囲決定処理部３と、音源位置抽出処理部７により検出された音源方向を入力すると共に方向信号抽出範囲決定処理部３により決定された方向信号抽出範囲を入力し、消去されるべき方向範囲を計算し、信号間情報抽出処理部１に出力する音源抑制処理部８の３処理部の内の何れか一つ以上の処理部を、更に、組み合わせて具備し、方向信号抽出処理部２に音響信号を分離音として出力する分離音出力部を形成した音源分離装置を構成した。
【００１６】
【発明の実施の形態】
以下、この発明の実施の形態を説明する。
この発明は、残響がない場合と比べて、残響を含む音響信号のチャンネル間情報が有する統計的性質の変化を利用する。即ち、残響が存在しない場合の音響信号のチャンネル間情報が有する統計的性質と、残響が存在する場合の音響信号のチャンネル間情報が有する統計的性質との間の差違を利用することにより、残響を含む音響信号を用いて音源を検出する場合と音源を分離する場合において、音源検出性能および音源分離性能を改善する音源検出方法および音源分離方法である。
【００１７】
これを説明するに先だって、先ず、▲１▼チャンネル間情報の基本的な性質について説明し、次いで、▲２▼残響に対する統計的性質に言及する。最後に、その▲３▼統計的性質を利用した音源検出および音源分離の原理について説明する。
（▲１▼ チャンネル間情報の基本的な性質）
先ず、或る音源方向から到来し複数のセンサで観測された音響信号ｘ_ｉ（ｔ）（ｉ：何れのセンサにより得られた入力であるかを表すインデックス）を周波数変換することで得られる周波数領域信号に関して、各周波数ｆにおける信号が音源位置からセンサｉ、ｊに到達するために要する時間差を、チャンネルｉとチャンネルｊのチャンネル間時間差△ｔ_ｉｊ（ｆ）と呼ぶ。ただし、観測信号から直接的に時間差を測定するのは難しいので、実際には信号の位相差△φ_ｉｊ、_ｋ（ｆ）から時間差を推定する。この値は以下の通りに計算される。
φ_ｉｊ、_ｋ（ｆ）＝∠Ｘ_ｉｊ、_ｋ（ｆ）・・・・・・・・・・・・（１）
△φ_ｉｊ、_ｋ（ｆ）＝φ_ｉ、_ｋ（ｆ）−φ_ｊ、_ｋ（ｆ）・・・・・・・（２）
△ｔ_ｉｊ、_ｋ（ｆ）＝（△φ_ｉｊ、_ｋ（ｆ）＋２πｎ）／２πｆ・・（３）
【００１８】
ここで、音響信号ｘ_ｉ（ｔ）に時間窓をかけてフレーム毎の信号に分割して得られる信号にフレームの番号ｋのインデックスを付けてｘ_ｉ，ｋ（ｔ）と表して、これを離散時間フーリエ変換操作を施して周波数信号に変換した結果得られる信号の周波数ｆにおける値をＸ_ｉ、_ｋ（ｆ）と表している。また、∠Ｘ_ｉ、_ｋ（ｆ）はその位相（−π＜∠Ｘ_ｉ、_ｋ（ｆ）＜π）を表す。位相差は周期２πの関数であるので、｜２πｆ△ｔ_ｉｊ（ｆ）｜がπよりも大きな値をとるときは、２πに或る整数ｎを乗じた値を位相差に加算することで、式（３）に示す様に時間差との対応をとることができる。ｎは以下の制約を満たす整数である。

ここで、△ｔ_ｉｊ、_ｍｉｎ（ｆ）と△ｔ_ｉｊ、_ｍａｘ（ｆ）は、チャンネルｉとチャンネルｊの信号の周波数ｆにおいて生じ得ると想定される時間差の最小値と最大値を意味する。
【００１９】
また、チャンネル間強度差△Ｉ_ｉｊ、_ｋ（ｆ）は以下の通りに定義される。

異なる方向からセンサに到達する観測信号は、異なる時間差、異なる強度差をとることが知られており、この性質を利用することで観測信号の到来する方向を推定することができる。これを行うには、各方向から到来する音響信号について時間差、強度差が如何なる値をとるかを、実際の測定を実施するに先立って予め求めて知る必要がある。時間差、強度差の値は、１）センサの物理的配置から解析的に求める方法と、２）実際に音響信号を測定して値を抽出する方法が考えられる。ここでは、２）について説明する。
【００２０】
先ず、想定される各音源方向ｄに音源を配置して、センサでその信号を測定する。音源信号には、全帯域の周波数が含まれている白色雑音、伝達関数の測定に用いられるＴＳＰ、または、音声そのものなども利用することができる。次に、測定した信号から各フレーム毎に強度差△Ｉ_ｉｊ、_ｋ（ｆ）、時間差△ｔ_ｉｊ、_ｋ（ｆ）を求める。残響がない環境でも、観測誤差、センサ自身の音響特性によりこれらの値は或る程度の幅のある範囲の値をとる。これらの性質は、
【数３】

を用いて以下の通りに統計的に表すことができる。
【数４】

ここで、Ｅ（・）は平均値を計算する関数を意味する。この計算においては、△ｔ_ｉｊ、_ｋ（ｆ）は、｜２πｆ△ｔ_ｉｊ（ｆ）｜がπより大きくなる周波数では一意に定まらないので、別の方法でこの値を決める必要がある。これには、例えば、チャンネル間時間差は周波数によって大きく異なる値ではないことを考慮して、この計算が一意に定まる可能な最大周波数ｆ_ｍａｘにおける時間差の値を、それより高域の周波数の推定値とみなして利用する方法などがある。
【００２１】
上述の平均値、標準偏差の統計値の計算は、伝達関数を測定することでより簡便に行うこともできる。先ず、想定される音源位置に測定用のスピーカを配置して、その位置から各センサまでのインパルス応答を測定する。インパルス応答の測定には、ＴＳＰ信号を用いる方法、Ｍ系列信号を用いる方法その他、既に知られた方法を用いればよい。一旦、インパルス応答を求めてしまえば、音源信号と想定される音源位置からのインパルス応答を計算機上で畳み込むことで、その位置においた音源からセンサで観測される信号を計算機上でシミュレートすることができる。この信号を用いて、上記と同じ操作により、チャンネル間情報の平均値、標準偏差を求めることができる。
【００２２】
また、上述の方法においては、音源を配置する離散的な位置からのチャンネル間情報の平均値と標準偏差値しか求めることができない。一方、観測しない音源位置に関する値は、観測した値から線形補完などの方法を用いて近似的に定めることができる。これにより、連続的な位置に関するチャンネル間情報を持つことができる。以下においては、この様にして得られる連続的な位置に関する値を用いて処理を行っていることを前提に説明する。
【００２３】
以上の通りにして求められた統計値を用いると、観測信号が与えられた時に、各周波数の音が或る方向の範囲Ｄから到来する信号か否かを推定することができる。即ち、その周波数で観測された信号のチャンネル間位相差△φ_ｉｊ（ｆ）、強度差△Ｉ_ｉｊ（ｆ）が以下の制約条件の内の一方だけ、或いは双方を満たす時にその周波数の信号は方向Ｄから到来するものと判断することができる。

但し、
【数５】

である。Ｃ_１、Ｃ_２は、Ｄから到来する音と判定する範囲の広さを指定する正定数であり、例えば１などの値をとる。ｎは任意の整数を示す。
【００２４】
この方法を更に拡張すると観測信号の到来方向推定にも用いることができる。これには、１）先ず、音源方向を幾つかの方向の範囲Ｄ_ｉ（ｉ＝１〜Ｍ）に分割する。範囲Ｄ_ｉは互に範囲の重複があっても差し支えない。２）次に、観測信号Ｘ_ｋ（ｆ）の周波数ｆに相当する信号がどの範囲に収まっているかを上記の方法により判定する。３）各方向範囲に判定された周波数のパワーの総和Ｐ_ｋ（Ｄ_ｉ）を求める。
Ｐ_ｋ（Ｄ_ｉ）＝Σ｜Ｘ_ｋ（ｆ）｜^２
（ここで，Ｘ_ｋ（ｆ）∈Ｄ_ｉ）・・・（１５）
最後に、４）パワーの総和が最大になる方向範囲：Ｄ_{ｋ，ｍａｘ} ＝ａｒｇｍａｘ_ＤｉＰ_ｋ（Ｄ_ｉ）を音源方向と定める。
【００２５】
以下において，この方向範囲別に求めるパワーの総和のことを方向パワーと呼び、方向パワーのピークを音像と呼び、音像により推定される音源の方向のことを音像の方向と呼ぶ。また、この音源方向パワーを求める際に、チャンネル間強度差のみを用いたり、時間差のみを用いて計算することも可能であり、同様に音源方向の推定に用いることができる。前者を強度差に基づく方向パワーと呼び、後者を時間差に基づく方向パワーと呼ぶ。更に、前者から求まる音源の方向を強度差に基づく音像の方向、後者から求まる音源の方向を位相差に基づく音像の方向と呼ぶ。
【００２６】
音源数が複数である場合は、方向パワーで極大値を与える方向範囲の内の、値が或る閾値を超える方向に音源があると判定することができる。これにより、音源検出と音源位置推定を同時に行うことができる。
方向パワーは、式（１５）に示す様に、ひとつのフレームに関してパワーを加算する方法と、以下の式（１６）に示す様に、ひとつのフレームに関する方向パワーをさらに複数のフレームにわたって方向範囲毎に加算する方法がある。
Ｐ（Ｄ_ｉ）＝Σ_ｋＰ_ｋ（Ｄ_ｉ）・・・・・・・・・（１６）
【００２７】
前者を瞬時方向パワーと呼び、後者を継時方向パワーと呼ぶ。瞬時方向パワーは、或る瞬間のみ存在しているような音源を検出できる一方で、雑音、残響などの目的音以外の音像を誤って抽出する可能性も高い。逆に、継時方向パワーは、瞬間的な音源の検出が難しくなる一方で、存在しない音源に由来する方向パワーが相対的に弱くなるので、閾値処理などで取り除き易くなる。
【００２８】
これらのチャンネル間情報は音源分離にも利用することができる。先ず、分離したい音源の方向に対して、その方向を含む範囲Ｄを定める。次に観測信号を周波数変換した信号から強度差△Ｉ_ｉｊ、_ｋ（ｆ），位相差△φ_ｉｊ、_ｋ（ｆ）を求め、式（１０）を用いてＤに含まれる周波数ｆを求める。最後に、Ｄに含まれない周波数に対応する周波数信号を０と置き換えた後、周波数信号を時間信号に戻すことで目的方向の音源のみを分離することができる。
【００２９】
（▲２▼ 残響がある場合のチャンネル間情報の統計的性質）
残響がある環境においては、残響が重畳するので上述したチャンネル間情報の性質が変化する。このために、残響がない環境で測定した時間差の平均値、強度差の平均値、標準偏差値（以下、これらを、チャンネル間情報の統計値、と称す）では、音源検出、音源分離が適切に行えなくなる。
一方、残響に応じて時間差、強度差がどの様に変化するかを予測することができれば、残響下でも適切に音源検出および音源分離を行うことができると期待される。このために、この発明は、残響下で時間差、強度差の値が持つ以下の統計的性質を利用する。
【００３０】
１．残響が増大するに従って、各チャンネル間情報の
【数６】

の値は大きくなる。
２．残響がない場合と比べると、チャンネル間強度差の平均値△Ｉ⁻ _ｉｊ（ｄ、ｆ）は残響が増えるにしたがい変化する。例えば、四方から残響が均等に到来する部屋では、残響の影響が大きくなるにつれてチャンネル間強度差は０に近づく傾向にある。従って、もし残響がない場合のチャンネル間強度差の統計値に基づいて音源の位置を推定すると、音像の方向はずれることになる。
３．チャンネル間の時間差の平均値△ｔ⁻ _ｉｊ（ｄ、ｆ）も、同様に、変化する傾向にある。ただし、残響がない場合のチャンネル間時間差の統計値に基づいて推定される音源の位置のずれは強度差に基づくずれほどは大きくない。
【００３１】
４．上述の様に、強度差に基づいて推定される音源方向と位相差に基づいて推定される音像方向のずれ方が異なるため、両方の手がかりを組み合わせて用いると、相互に矛盾が生じる。このために、方向パワーのピーク自体が適切に形成されなくなってしまう。これが音源検出精度が下がる大きな原因となる。
５．ひとつの観測信号の中で、時間区間、および周波数領域別に上記の推定される音源方向のずれを比較すると、パワーの小さな区間１領域にくらべてパワーの大きな区間、領域の方が相対的にずれは小さい傾向にある。これは、パワーの小さな区間、領域の方が残響の影響が大きくなる傾向にあるからである。
【００３２】
６．ひとつの音源が目的の音源信号に対応する方向以外に方向パワーのピーク（音像）を形成する場合がある（以下、擬似音像、と称する）。これは誤った音源検出の原因となる。これに対して、観測信号から目的の音源信号を消去することにより、即ち、混合音から音源分離手法を用いて分離した当該音源信号を観測信号から減算するか、或いは観測信号から目的方向の音と判定された周波数だけを消去した信号を構成すると、多くの場合、目的音源信号の方向のピーク（音像）と一緒に擬似音像も消去される。
７．逆に、パワーの強い音源がある場合、強い音に隠れて他の音像が検出されなくなってしまう場合がある。これを隠れ音像と呼ぶ。隠れ音像はパワーの大きな音源を消去することで検出できるようになる場合がある。
【００３３】
（▲３▼ 残響に対する統計的性質を利用した音源検出、分離方法）
この発明は、上記の１〜７の性質を利用して、残響が存在する環境でも適切に音源検出、音源分離を行うために、以下の処理を実施する。
第１．音像が移動するのに対応して、式（１０）の音源抽出処理のパラメータを変更し、目的方向から到来すると判定する基準を広く取ることで、より適切な音源検出、音源分離を実現する（方向信号抽出範囲決定処理）。
音源検出
式（１１）から（１４）のＣ_１、Ｃ_２を大きい値に設定する。特に、残響の増加に従って基準値からのずれが大きい強度差については判定基準をより広くする。これにより方向パワーのピークの山の形が広がりを持った形になるため、強度差による音像の方向と時間差による音像の方向がずれていてもピークの山の重なりが大きくなる。このため、二つの手がかりを組み合わせた場合でも、相互の手がかりが矛盾しなくなる。その結果、正しい音源方向付近に、方向パワーのピークが形成されるようになる。
【００３４】
音源分離
上記の音源検出と同じ様に、Ｃ_１，Ｃ_２に大きい値を用いて音源分離することで残響下で目的音源の分離精度をあげることができる。ただし、単純に分離するための判定基準を広げるだけでは、目的音源の成分をより多く取り出せるようになるのと同時に、残響などの余分な成分もより多く含んでしまう。これを低減させるために、さらに別の方法を導入する。残響により、方向パワーがずれている場合、目的音源の真の方向と強度差に基づく音像の方向の２つにはさまれた方向範囲に、周波数信号の多くは含まれて判定されていると考えられる。従って、判定基準Ｃ_１は平均値から大小等しく広げる必要はなく、音像がずれる方向に、そのずれる分だけ広げればよい。これにより、余計な方向範囲を広げずに必要な方向範囲の音のみを抽出することができるようになる。時間差に関する判定基準についても同様の方法を用いることができる。
【００３５】
第２．継時方向パワーを計算するに際して、各フレームの瞬時方向パワーにおいて最大値を与える方向範囲以外の方向範囲のパワーを０に置き換えてから加算する。各フレームにおいて最大値を与える方向範囲は、残響の影響が比較的少ないことが期待される一方で、その他の方向範囲はより残響の影響を強く受けていることが予想される。従って、この瞬時方向パワーの最大値のみを加算するという操作により、より信頼できる情報を優先的に考慮した継時方向パワーを求めることができる（最大値方向検出処理）。
【００３６】
第３．複数の音源、音像が存在する場合は、擬似音像、隠れ音像が含まれる場合がある上に、目的音の音像もずれて観測される可能性がある。これらの問題点を解消するには、以下に示すような方法で、検出された音像に対応する幾つかの音源信号を観測信号から消去することが効果的である（音源抑制処理）。
【００３７】
擬似音像の除去
複数の音源が検出されたときには擬似音像が含まれている可能性がある。これを除去するために、検出された音源のうちのいくつかの音像に対応する音源を観測信号から消去することで一緒に消える音像は擬似音像と判断
する。
隠れ音像の検出
逆に、ひとつの音像に対応する音源を観測信号から消去することで現われる音像がある場合、これを隠れ音像と判断して、目的音源の音像として検
出する。
正確な音像位置抽出
第１．の音源分離においては、判定基準の決定のために、強度差による方向パワーのピーク位置（または、位相差による方向パワーのピーク位置）を抽出する必要がある。しかし、混合音では、他の音との干渉のため、このピーク位置を求めることが困難である場合がある。この場合、擬似音像や隠れ音像の検出に用いられた音像の消去操作が効果的である。即ち、目的音源以外の音像に対応する音源を観測信号から消去した後で、強度差による方向パワーのピーク位置（或は、位相差による方向パワーのピーク位置）を抽出すれば、残響環境下でより正確な値を抽出することができる。
【００３８】
なお、上述の残響環境下における音源検出、および音源分離精度を向上させるための各処理は、単独で用いても効果があるし、組み合わせることで相乗効果を発揮する。目的に応じて必要な処理のみを組み込むなどの選択的な利用をすることができる。
【００３９】
【実施例】
この発明の実施例を図を参照して説明する。
図２を参照するに、簡単のために、観測信号は２チャンネルの信号で構成される場合について、マイクロフォンは人の上半身と類似の形状大きさを持つロボットの耳の位置に配置していることを想定して説明する。スピーカの配列される音源方向はロボットの正面（０°）から左右９５°（左が−９５°）の角度範囲を想定し、１０°の範囲毎に分割して扱うものとする（ｉ＝１〜１９）。各角度範囲Ｄ_ｉは、Ｄ_１が−９５°〜−８５°、Ｄ_２が−８５°〜−７５°、という様に１０°幅でシフトし、Ｄ_１０が−５°〜５°に対応し，Ｄ_１９が８５°〜９５°に対応するものとして説明する。
【００４０】
（最大値方向検出処理を用いた音源検出方法）
図１において、信号間情報抽出処理部１は、先ず、チャンネルｉのディジタル入力信号ｘ_ｉ（ｔ）を受け取ると、時間窓を適用して短時間（約３０ｍｓなど）の音声区間（フレーム）ｘ_ｉ，ｋ（ｔ）に分割した後に、短時間離散フーリエ変換などの手法を用いて周波数信号Ｘ_ｉ，ｋ（ｆ）に変換する。次に、各周波数のチャンネル間時間差△φ_ｉ、_ｋ（ｆ）、強度差△Ｉ_ｉ、_ｋ（ｆ）を式（３）、（５）により求める。このとき、高域の周波数では時間差は必ずしも一意に定まらないため、式（４）を満たすすべての時間差を候補値として求めておく。もしくは、時間差が一意に定まらない高域では、時間差に関しての条件は用いずに強度差情報だけを用いて音源方向を定めるという方法を用いることもできる。
【００４１】
方向信号抽出処理部２には、妨害音および残響の存在しない条件における、
【数７】

の統計値を、実際の方向信号抽出処理を実施するに先だって、予め計算して設定記憶しておく。
方向信号抽出処理部２は、方向信号抽出範囲決定処理部３が定めた基準をもとに、方向範囲毎に式（１０）を満たす周波数を定める。時間差候補が複数ある周波数では、どれか一つでもその候補が式（１０）を満足する場合は、基準を満たすと判定する。
瞬時方向パワー計算処理部４は、各フレームにおいて、各方向範囲に含まれると判定した周波数のパワーの総和を、方向範囲毎に式（１５）に従って瞬時方向パワーとして求める。
【００４２】
最大値方向検出処理部５は、各フレームにおいて求められた瞬時方向パワーの最大値を与える方向範囲を選択し、それ以外の方向範囲のパワーの総和を０に置き換える。即ち、次の継時方向パワー計算処理部６において継時方向パワーを計算するに際して、各フレームの瞬時方向パワーにおいて最大値を与える方向範囲以外の方向範囲のパワーを０に置き換えてから加算する。各フレームにおいて最大値を与える方向範囲は、残響の影響が比較的少ないことが期待される一方、その他の方向範囲はより残響の影響を強く受けていることが予想される。従って、この瞬時方向パワーの最大値のみを加算するという操作により、より信頼できる情報を優先的に考慮した継時方向パワーを求めることができる
【００４３】
継時方向パワー計算処理部６は、例えば、約２秒程度の一定時間のフレームに亘って、式（１６）に従って、各方向範囲別に瞬時方向パワーを加算したパワーの総和を計算して継時方向パワーを求める。
音源位置抽出処理部７は、継時方向パワーのピークを与える方向の内の或る閾値θを超える方向に音源があるものとして音源を検出する。閾値θの決定法としては様々の方法が考えられるが、例えば、以下のように継時方向パワーの最大値ｍａｘ｛Ｐ（Ｄ_ｉ）｝に応じて適応的に閾値を決定する方法が考えられる。
θ＝ｍａｘ｛θ_０，ｃ_１ｍａｘ｛Ｐ（Ｄ_ｉ）｝｝・・・・（１７）
【００４４】
ここで、ｃ_１は、最大値を与える方向パワーに対してその他の方向の方向パワーから音源を検出する閾値を決定する定数で、例えば、０．１程度の値をとればよい。また、θ_０は、一つの音源が音を出しているときに音があると検出する閾値で、音響信号入力装置の入力レベルに応じて決まる値である。
以上の処理により、瞬時方向パワーの最大方向のパワーだけを用いて継時方向パワーを計算した音源検出方法を構築することができる。これにより、比較的残響の影響を受けていないチャンネル間情報を用いて継時方向パワーを計算しているため、より正確に音源検出を行えるようになる。
【００４５】
（音源抑制処理を用いた音源検出方法）
ところで、上述した音源検出方法に依っては、擬似音源が検出されたり、隠れ音像が検出されなかったりするのを防げない。これを防ぐためには、以下の音源抑制処理技術を用いると効果的である。これには、先の音源検出方法に、音源抑制処理を付加して実施する。
音源抑制処理部８は音源位置抽出処理部７により検出された音源方向を入力すると共に方向信号抽出範囲決定処理部３により決定された方向信号抽出範囲を入力し、消去されるべき方向範囲を計算して信号間情報抽出処理部１に出力する。
【００４６】
先ず、擬似音像を除外する方法の実施例を説明する。上述した手続によりｎ_０個の音像の集合｛Ｓ_ｉ ^０｝（ｉ＝１〜ｎ_０）が音源位置抽出処理部７により検出されたものとする。次に、観測信号から、この内の一つの音像Ｓ_ｉ ^０を暫定的に音源抑制処理部８において抑制してみる。即ち、周波数領域信号の内の音像Ｓ_ｉ ^０の方向に関して式（１０）を満たす周波数信号を、全チャンネルに関してすべて０と置き換えた後に時間領域信号に戻す。この後、この信号を新たに上述した音源位置抽出処理の入力信号として用いて同様の音源位置抽出処理を行う。その結果、抑制した音像と一緒に検出されなくなる音像があれば、音源位置抽出処理部７がそれを擬似音像と判定する。即ち、音像Ｓ_ｊ ^０を抑制した結果、検出される音像の集合が｛Ｓ_ｉ ^ｊ｝（ｉ＝１〜ｎ_ｊ）になるとした時、以下を満たす音像Ｓ_ｉ ^０を擬似音像と呼ぶ。
｛Ｓ_ｉ ^０｜Ｓ_ｉ ^０≠Ｓ_ｋ ^ｊｆｏｒ∀ｋ，ｉ≠ｊ｝・・・・・・（１８）
更に、以上の音像抑制の操作をすべての音像｛Ｓ_ｉ ^０｝（ｊ＝１〜ｎ）に関して個別に実施した結果、一度でも擬似音像と判定されたものは擬似音像であると判断する。
【００４７】
次に、隠れ音像を検出する実施例を説明する。以上の音源抑制操作と同一の操作を実施した結果、音源位置抽出処理部７が以下の条件を満たす音像Ｓを隠れ音像と判定する。
｛Ｓ｜｛Ｓ≠Ｓ_ｉ ^０ｆｏｒ∀ｉ｝∩｛Ｓ＝Ｓ_ｋ ^ｊｆｏｒ∃ｋ∀ｊ｝（１９）
即ち、｛Ｓ_ｉ ^０｝には含まれない音像のうち、｛Ｓ_ｉ ^０｝のどれでもひとつの音像を抑制することで常に検出されるようになる音像を隠れ音像として検出する。ただし、この検出条件では、どのひとつの音像を消去した場合でも隠れ音像が検出される必要があるため、目的によっては条件として厳しすぎる場合もある。その場合には、例えば、式（１９）がすべてのｊについて成り立つとするかわりに、Ｓ_ｊ ^０のパワーＰ（Ｓ_ｊ ^０）と、音像の集合｛Ｓ_ｌ ^０｝の中の最大のパワーと比べて、前者が後者の定数倍ｃ_２ｍａｘ（Ｐ（Ｓ_ｌ ^０））以上のものだけについて成り立つ場合とするなどの方法がある。ｃ_２は１より小さい正の数であり、例えば０．１などの値が考えられる。
【００４８】
続いて、音源抑制処理部８を用いたより正確な音像位置抽出方法の実施例を説明する。このために上述した音源検出方法を用いて検出された音像の集合｛Ｓ_ｉ｝（ｉ＝１〜ｎ）に含まれるひとつの音像Ｓ_ｋに関して議論する。先ず、音源抑制処理が音像の集合｛Ｓ_ｉ｝に含まれるＳ_ｋ以外のすべての音像を観測信号から消去した信号を作成する。この信号には、目的の音像以外の音像の大部分の要素が削除されているので、音源位置抽出処理部７は目的音像であるＳ_ｋの位置についてより正確に求めることができる。この信号から上述までに説明したのと同じ方法で強度差に基づく音像を求めることで、より正確な値が得られることが期待される。正確な時間差に基づく音像についても同様に、上述までに説明したのと同じ方法で、時間差に基づく音像の抽出処理を行うことで求めることができる。なお、以上述べた音源抑制処理を用いた３つの音源検出法は、相補的な効果をもたらすため組み合わせて用いることもできる一方で、個々の技術を単独で用いても効果のある方法である。
【００４９】
（方向信号抽出範囲決定処理法を用いた音源検出方法）
残響環境下では、信号間強度差や時間差の統計的性質が変化するため、残響がない状態にあわせて決定した方向信号抽出範囲を用いると、目的方向の音響信号を適切に抽出できなくなるため、必ずしも良い音源検出性能を実現できなくなってしまう。これを防ぐためには、より適切な方向信号抽出範囲決定処理を用いる必要がある。
【００５０】
このため、方向信号抽出範囲決定処理部３は、初期状態では、主として対象となる残響の状況を想定した方向範囲を定めるようにしておく。例えば、残響の状態がどの様な状態かわからない場合、長い残響に対応することが出来るようにするために、方向信号抽出範囲決定処理部３は比較的広い方向範囲を設定するのが一つの方策である。
【００５１】
また、方向信号抽出範囲決定処理部３は、実際の音響信号が入力されてきた後で、継時方向パワー計算処理部６により求められた継時方向パワーを受け取るとともに音源位置抽出処理部７により検出された音源方向を受け取り、対応する方向信号抽出範囲を決定して方向信号抽出処理部２および音源抑制処理部８に出力する。これにより、より適切な方向信号抽出範囲を定めることができる。先ず、上述までに述べた音源検出方法などにより、音源位置の初期推定値Ｄ_{ｓｏｕｒｃｅ}が求められているとともに、強度差に基づく音像方向Ｄ_{ｉｍａｇｅ}がずれた方向に求められているとする。この時、各音源からの音が支配的に含まれている周波数における信号間強度差はＤ_{ｓｏｕｒｃｅ}とＤ_{ｉｍａｇｅ}にはさまれた区間に対応する値をとることが期待される。信号間時間差についても全く同様のことがいえる。したがって、このような音源信号を適切に抽出するためには、信号間情報がこの範囲に収まる周波数だけを通過させるように方向信号抽出処理部２を、方向信号抽出範囲決定処理部３により制御してやればよい。この制御は、式（１１）、（１４）のＤを以下のとおりに定めることで実現することができる。
【００５２】

ここで、Ｃ_３は音源方向のマージンをとるための定数項で、例えば５°程度の値をとる。
なお、音像方向にずらした方向範囲を用いる場合、ずらさない方向範囲を用いる場合と比べて、実際に音を通過させる広さを決定する式（１１）〜（１４）において、Ｃ_１，Ｃ_２の値は小さな値をとるほうが良い。これは、Ｄがより適切な範囲を示す様になるため、そこからさらに通過範囲を大きく広げても、目的音以外の音や残響を拾う場合が多くなるためである。例えば、Ｃ_１＝２，Ｃ_２＝２などの値が一つの選択肢として上げられる。
【００５３】
（音源分離）
音源分離法の実施例を説明する。上述の様々な音源検出法の何れを用いる場合でも、音源検出のために方向信号抽出処理部２が特定の周波数のみを目的音と同じ方向から到来する音であると判定する処理を行っている。この方向判定処理の結果、目的音と異なる方向であると判定された周波数を方向信号抽出処理部２が０と置き換えると共に、その他をそのまま変換しないで得られる周波数信号に対して短時間逆フーリエ変換を施すことで、目的音のみを分離した時間信号を得ることができ、音源分離処理が達成される。
【００５４】
【発明の効果】
上述した通りであって、この発明は、２音源混合音，３音源混合音に対する音源検出、および音源分離に関して、この発明の性能評価を行った。各実験は、音声データとして、ＡＴＲ単語データベース（５２４０単語、１２ｋＨｚ、１６ｂｉｔ）から３話者（女性２人：ＦＫＭとＦＳＵ、男性１人：ＭＡＵ）の発話を，各２０発話づつ用いた。モノラル音声データに対して図２の無響室および可変残響室で測定したインパルス応答を畳み込み、残響を含んだバイノーラル音声を合成した。各音源の位置は、２音源の場合はＦＳＵを−６０°（或は−３０°）、ＦＫＭを３０°（或いは６０°）に、３音源の場合は更にＭＡＵを０°に配置した。
【００５５】
（音源検出率の評価）
音源検出性能を再現率、適合率で評価した結果を図３および図４に示す。図３は音源数２、図４は音源数３の場合であり、何れの場合も（ａ）は再現率、（ｂ）は適合率を示す。検出されるべき音源数をＮ、正しく検出された音源数をＮ＾、検出された全音源数をＮ⁻としたとき、再現率はＮ＾／Ｎ、適合率はＮ＾／Ｎ⁻を意味する。実験では配置した音源方向の±１０°以内に音源が見つかった場合に音源が正しく検出されたものとした。この発明の比較対照として実施例１と示しているのは、図１において、方向信号抽出範囲決定処理、最大値方向検出処理および音源抑制処理を用いずに音源検出を行った場合の結果を指している。実施例２と示しているのは、図１において、方向信号抽出範囲決定処理、最大値方向検出処理、および音源抑制処理を用いて音源検出を行った場合の結果を示している。
図３および図４より、この発明を用いることで、再現率は実施例１においても９０％以上あるが、実施例２は更に約０〜５％程度改善しつつ、適合率を２音源の場合で約２０％、３音源の場合で約１０％と、大幅に改善することができたことがわかる。
【００５６】
（分離音の評価）
ＬＰＣケプストラム距離尺度に基づくスペクトル歪みを用いて分離音の品質評価を実施した結果を図５に示す。図５（ａ）は音源数２、図５（ｂ）は音源数３の場合を示す。評価においてスペクトル形状の正解値としては残響が０秒のときの単音源のバイノーラル音声を用いた。これにより、残響を含んだ混合音において分離された音と残響０秒における単音源のバイノーラル音声との差異を評価できる。図５で実施例１として示しているのは、図１において、方向信号抽出範囲決定処理、最大値方向検出処理、および音源抑制処理を用いずに音源分離を行った場合の結果のことを指している。実施例２と示しているのは、図１において、方向信号抽出範囲決定処理、最大値方向検出処理、および音源抑制処理を用いて音源分離を行った場合の結果を示している。なお、分離性能だけを比較するために、音源方向は既知とした。図５より、この発明により、実施例２は実施例１と比較して、各残響状態で約２〜５ｄＢ程度の品質改善がなされていることが分かる。
【図面の簡単な説明】
【図１】実施例を説明する図。
【図２】可変残響室におけるバイノーラル録音を説明する図。
【図３】実施例１および実施例２による２音源の音源検出結果を示す図。
【図４】実施例１および実施例２による３音源の音源検出結果を示す図。
【図５】実施例１および実施例２による分離音のスペクトル歪みを示す図。
【図６】音源検出方法の従来例を説明する図。
【図７】音源分離方法の従来例を説明する図。
【符号の説明】
１信号間情報抽出処理部２方向信号抽出処理部
３方向信号抽出範囲決定処理部４瞬時方向パワー計算処理部
５最大値方向検出処理部６継時方向パワー計算処理部
７音源位置抽出処理部８音源抑制処理部[0001]
[Industrial applications]
The present invention relates to a sound source detection method, a sound source separation method, and an apparatus for performing the same, and more particularly, to a statistical property of inter-channel information of an acoustic signal when reverberation does not exist, and an acoustic signal of an acoustic signal when reverberation exists. A sound source that improves a sound source detection performance and a sound source separation performance in a case where a sound source is detected using an acoustic signal including reverberation and a case where the sound source is separated by using a difference between statistical characteristics of information between channels. The present invention relates to a detection method, a sound source separation method, and an apparatus for performing the same.
[0002]
[Prior art]
When a sound signal is picked up in a daily living environment, in many cases, other sounds are generated simultaneously, and the observed sound is a mixed sound in which the sound (referred to as interference sound) is superimposed. become. The audio signal is usually a signal on which reverberation is superimposed except for a special environment such as an anechoic room. From the above, it becomes difficult to extract the properties of the original audio signal, and the intelligibility of the audio itself decreases. Here, by removing the disturbing sound and reverberation superimposed on the audio signal by the sound source separation processing, the original characteristics of the audio can be easily extracted and the clarity of the audio can be improved. This sound source separation processing is a technique that leads to an improvement in the performance of the entire sound signal processing device by using the sound source separation processing as a component technology of various other sound signal processing devices. The following devices can be listed as acoustic signal processing devices that can improve the performance of the entire device by using the sound source separation processing technology as a component technology.
[0003]
a. A speech recognition device that uses sound source separation as a preprocessing technique.
b. A communication device such as a TV conference device in which sound clarity is improved by sound source separation. c. A playback device that improves the clarity of the recorded voice by removing the disturbing sound and reverberation included in the lecture recording.
d. A hearing aid that improves audibility by removing disturbing sounds and reverberation.
e. A music information processing apparatus for retrieving music or transcribing music by removing the reverberation and other disturbing sounds of music performed by a person singing, playing on a musical instrument, or playing on a speaker.
f. A machine control interface that passes commands to the machine in response to human voices, and a dialogue device between the machine and a human.
[0004]
A conventional example of a sound source detection method will be specifically described with reference to FIG. 6 (see Non-Patent Document 1).
The inter-channel information extraction process extracts the time (zero-crossing point) at which the amplitude of the signal waveform crosses zero in each time series (referred to as a channel) of the observation signal collected by using a plurality of sensors, and separates different channels from each other. Extracts the time difference between the corresponding zero crossings. Next, the histogram calculation process estimates a sound source position corresponding to the time difference between the respective zero crossings, counts the number of zero crossings coming from the same position for each range of the sound source position, and calculates a time difference histogram. For this purpose, the histogram calculation process has in advance information about the value of the time difference between channels of the sound coming from each sound source position, and uses this information to determine the position of the sound source corresponding to the zero-crossing point. Is estimated. Finally, the sound source position extraction process extracts the range of the sound source direction that gives the maximum value from the time difference histogram, and determines that this is the sound source position. When there are a plurality of sound sources, since a plurality of peaks appear in the time difference histogram, each is detected as a sound source position.
[0005]
Next, a conventional example of a sound source separation method will be described with reference to FIG. 7 (see Non-Patent Document 2).
In the inter-channel information extraction process, after converting each observation signal into a frequency domain signal, a phase difference and an intensity difference of a signal for each frequency are extracted. A sound source extraction processing unit comprising a sound source extraction filter receives this and sound source position information obtained by performing preprocessing such as sound source detection processing. The sound source extraction processing unit specifies the frequency at which the attribute of the observation signal relating to the position of the target sound source and the actually observed inter-channel information match, deletes the signals of other frequencies, and converts the frequency domain signal into time. The separated sound signal is output by returning to the area signal.
[0006]
[Non-patent document 1]
Koji Huang, Noboru Onishi, Noboru Sugie, "Multiple Sound Source Localization System Using Time Difference Histogram", Journal of the Robotics Society of Japan 1, 9, 1, pp. 29-38, 1991.02.
[Non-patent document 2]
Aoki, M .; Okamoto, M .; Aoki, S .; Matsui, H .; , Sakurai, T .; , And
Kaneda, Y .; , "Sound source segregation based on estimating incident angle of each of the frequency components of input signals acquired by multiple micros. Acoustic. Soc. Jpn. (E), 22,2, pp. 149-157, 2001.
[0007]
[Problems to be solved by the invention]
In the above-described conventional example of the sound source detection method and the conventional example of the sound source separation method, when an acoustic signal including reverberation is received as an input signal, the attribute of each acoustic signal cannot be accurately extracted due to the effect of reverberation. Therefore, it becomes impossible to correctly extract the inter-channel information. For this reason, in the conventional example of the sound source detection method, a correct sound source position cannot be detected, or an error of detecting a non-existent sound source position is easily caused. In addition, the sound source separation method may fail to extract a necessary frequency or mix sounds of another sound source included in an extra frequency due to erroneously specifying a frequency in a sound source direction. Make mistakes. As a result, the quality of the separated audio signal is degraded.
[0008]
As described above, the sound source detection method and the sound source separation method that have been conventionally used include a problem that the sound source cannot be detected correctly in an environment including reverberation, and the quality of the sound signal to be separated deteriorates. Was something.
The present invention utilizes the difference between the statistical property of the inter-channel information of an audio signal when reverberation does not exist and the statistical property of the inter-channel information of an audio signal when reverberation exists. Provided are a sound source detection method and a sound source separation method for improving sound source detection performance and sound source separation performance in a case where a sound source is detected using an acoustic signal including reverberation and a case where the sound source is separated. is there.
[0009]
[Means for Solving the Problems]
An audio signal input observed by a plurality of sensors is received, and an inter-signal information extraction process, a direction signal extraction process, an instantaneous direction power calculation process, and a sound source position extraction process are performed on the received audio signal input. Alternatively, a sound source detection method for further performing a successive direction power calculation process is configured.
In the above sound source detection method,
(Equation 1)

And a sound source detection method in which these are set in advance for the direction signal extraction processing.
[0010]
In the above sound source detection method, a sound source detection method is further configured to perform a combination of any one or more of the maximum value direction detection processing, the direction signal extraction range determination processing, and the sound source suppression processing.
Here, audio signal inputs observed by a plurality of sensors are received, and the received audio signal inputs are subjected to inter-signal information extraction processing, direction signal extraction processing, instantaneous direction power calculation processing, and sound source position extraction processing. Or a sound source detection process for further performing a successive direction power calculation process is performed, and a sound source separation method of outputting a sound signal extracted by the direction signal extraction process as a separated sound is configured.
[0011]
In the above sound source separation method,
(Equation 2)

, And a sound source separation method in which these are set in advance for the direction signal extraction processing is configured.
[0012]
In the above sound source separation method, a sound source separation method configured to combine the sound source detection process with any one or more of a maximum value direction detection process, a direction signal extraction range determination process, and a sound source suppression process is configured. .
Here, an inter-signal information extraction processing unit 1 for inputting observation signals of a plurality of channels and obtaining an inter-channel time difference signal and an intensity difference signal is provided. Direction signal extraction processing for calculating and preliminarily calculating and storing the average value and the standard deviation value, and the average value and the standard deviation value of the signal time difference between the channels, and extracting a direction signal within a certain direction range. A sound source position extraction unit for inputting direction power and outputting a sound source position, including an instantaneous direction power calculation processing unit 4 for obtaining a sum of powers of frequencies determined to be included in the direction range as an instantaneous direction power. A time direction power calculation processing unit that includes a processing unit 7 or further calculates a sum of powers obtained by adding instantaneous direction powers for each direction range to obtain a time direction power; To constitute a sound source detection apparatus comprising.
[0013]
A maximum value direction detection processing unit 5 for selecting a direction range that gives the maximum value of the obtained instantaneous direction power in the sound source detection apparatus, and replacing the sum of powers in the other direction ranges with 0; The direction in which the successive direction power obtained by the calculation processing unit 6 and the sound source direction detected by the sound source position extraction processing unit 7 are received, the corresponding direction signal extraction range is determined, and the direction is output to the direction signal extraction processing unit 2. The direction of the sound source detected by the signal extraction range determination processing unit 3 and the sound source position extraction processing unit 7 and the direction signal extraction range determined by the direction signal extraction range determination processing unit 3 are input, and the direction to be erased. The sound source detection unit further includes a combination of at least one of the three processing units of the sound source suppression processing unit 8 that calculates the range and outputs the calculated range to the inter-signal information extraction processing unit 1. You configure the device.
[0014]
An inter-signal information extraction processing unit 1 for inputting observation signals of a plurality of channels to obtain an inter-channel time difference signal and an intensity difference signal is provided. A direction signal extraction processing unit that calculates and stores in advance the average value and standard deviation value, the average value of the signal time difference between channels and the statistical value of the standard deviation value, and stores them, and extracts a direction signal within a certain direction range. And an instantaneous directional power calculation processing unit 4 for obtaining, as an instantaneous directional power, a sum of powers of frequencies determined to be included in the directional range, and a directional power input to output a sound source position. A time direction power calculation processing unit for calculating a sum of powers obtained by adding instantaneous direction powers for each direction range to obtain a time direction power; Comprising a, to constitute a sound source separation apparatus forming a separated sound output unit for outputting an acoustic signal as a separated sound in a direction signal extraction processing section 2.
[0015]
In the above sound source separation apparatus, a maximum value direction detection processing unit 5 for selecting a direction range that gives the maximum value of the obtained instantaneous direction power and replacing the sum of powers in the other direction ranges with 0, The direction in which the successive direction power obtained by the calculation processing unit 6 and the sound source direction detected by the sound source position extraction processing unit 7 are received, the corresponding direction signal extraction range is determined, and the direction is output to the direction signal extraction processing unit 2. The direction of the sound source detected by the signal extraction range determination processing unit 3 and the sound source position extraction processing unit 7 and the direction signal extraction range determined by the direction signal extraction range determination processing unit 3 are input, and the direction to be erased. Any one or more processing units among the three processing units of the sound source suppression processing unit 8 that calculate the range and output to the inter-signal information extraction processing unit 1 are further provided in combination with each other. An acoustic signal to constitute a sound source separation apparatus forming a separated sound output unit for outputting as a separate sound extraction section 2.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
The present invention makes use of a change in the statistical property of the inter-channel information of an acoustic signal containing reverberation, as compared with a case without reverberation. That is, by utilizing the difference between the statistical property of the inter-channel information of the audio signal when reverberation does not exist and the statistical property of the inter-channel information of the audio signal when reverberation exists, reverberation is achieved. A sound source detection method and a sound source separation method for improving the sound source detection performance and the sound source separation performance in the case where the sound source is detected using the acoustic signal including the sound source and the case where the sound source is separated.
[0017]
Before describing this, first, (1) the basic properties of the inter-channel information will be described, and then (2) the statistical properties for reverberation. Finally, the principle of (3) sound source detection and sound source separation utilizing the statistical properties will be described.
((1) Basic properties of information between channels)
First, an acoustic signal x coming from a certain sound source direction and observed by a plurality of sensors._i(T) Regarding a frequency domain signal obtained by frequency-converting (i: an index representing an input obtained by which sensor), a signal at each frequency f reaches the sensors i and j from the sound source position. Is the time difference between the channels i and j, Δt_ij(F). However, since it is difficult to measure the time difference directly from the observed signal, the phase difference of the signal is actually △ φ_ij,_kThe time difference is estimated from (f). This value is calculated as follows.
φ_ij,_k(F) = ∠X_ij,_k(F) ... (1)
△ φ_ij,_k(F) = φ_i,_k(F) -φ_j,_k(F) ... (2)
△ t_ij,_k(F) = (△ φ_ij,_k(F) + 2πn) / 2πf (3)
[0018]
Where the acoustic signalx_iA time window is applied to (t) to divide the signal into signals for each frame, and an index of a frame number k is added to the signal.x_{i, k}(T), a value at a frequency f of a signal obtained as a result of converting the signal into a frequency signal by performing a discrete-time Fourier transform operation is expressed by X_i,_k(F). Also, ∠X_i,_k(F) is the phase (−π <∠X_i,_k(F) <π). Since the phase difference is a function of the period 2π, | 2πf △ t_ijWhen (f) | takes a value larger than π, a value obtained by multiplying 2π by a certain integer n is added to the phase difference, so that a correspondence with the time difference can be obtained as shown in Expression (3). it can. n is an integer satisfying the following constraints.

Where Δt_ij,_min(F) and Δt_ij,_max(F) means a minimum value and a maximum value of a time difference assumed to occur at the frequency f of the signal of the channel i and the channel j.
[0019]
In addition, the intensity difference between channels ΔI_ij,_k(F) is defined as follows.

It is known that observation signals arriving at the sensor from different directions have different time differences and different intensity differences, and by using this property, it is possible to estimate the arrival direction of the observation signal. To do this, it is necessary to determine in advance what value the time difference and the intensity difference take for the sound signal arriving from each direction before performing the actual measurement. The values of the time difference and the intensity difference can be conceived of 1) a method of analytically obtaining the physical arrangement of the sensors, and 2) a method of actually measuring the acoustic signal and extracting the values. Here, 2) will be described.
[0020]
First, a sound source is arranged in each assumed sound source direction d, and its signal is measured by a sensor. As the sound source signal, white noise including frequencies in all bands, TSP used for measuring a transfer function, or voice itself can be used. Next, from the measured signal, the intensity difference ΔI_ij,_k(F), time difference Δt_ij,_kFind (f). Even in an environment without reverberation, these values take a certain range of values depending on the observation error and the acoustic characteristics of the sensor itself. These properties are
(Equation 3)

Can be statistically expressed as follows.
(Equation 4)

Here, E (•) means a function for calculating an average value. In this calculation, Δt_ij,_k(F) is | 2πf △ t_ij(F) Since | is not uniquely determined at frequencies where | is greater than π, it is necessary to determine this value by another method. This includes, for example, taking into account that the time difference between channels is not a value that varies greatly with frequency, the maximum frequency f_maxThere is a method in which the value of the time difference in is used as an estimated value of a frequency higher than that.
[0021]
The above-described calculation of the statistical value of the average value and the standard deviation can be more easily performed by measuring the transfer function. First, a speaker for measurement is arranged at an assumed sound source position, and an impulse response from the position to each sensor is measured. To measure the impulse response, a method using a TSP signal, a method using an M-sequence signal, or another known method may be used. Once the impulse response is determined, the impulse response from the sound source position assumed to be the sound source signal is convolved on the computer, and the signal observed by the sensor from the sound source at that position is simulated on the computer. Can be. Using this signal, the average value and standard deviation of the inter-channel information can be obtained by the same operation as described above.
[0022]
Further, in the above-described method, only the average value and the standard deviation value of the inter-channel information from the discrete positions where the sound sources are arranged can be obtained. On the other hand, the value relating to the sound source position that is not observed can be approximately determined from the observed value using a method such as linear interpolation. Thereby, it is possible to have inter-channel information regarding continuous positions. In the following, a description will be given on the assumption that the processing is performed using the values regarding the continuous positions obtained in this manner.
[0023]
Using the statistical values obtained as described above, it is possible to estimate whether or not the sound of each frequency is a signal coming from a range D in a certain direction when an observation signal is given. That is, the phase difference between the channels of the signal observed at that frequency △ φ_ij(F), intensity difference ΔI_ijWhen (f) satisfies only one or both of the following constraints, it can be determined that the signal of that frequency comes from the direction D.

However,
(Equation 5)

It is. C₁, C₂Is a positive constant that specifies the size of the range that is determined to be a sound coming from D, and takes a value such as 1, for example. n shows an arbitrary integer.
[0024]
If this method is further extended, it can be used for estimating the direction of arrival of observation signals. To this end, 1) First, the sound source direction is set to a range D in several directions._i(I = 1 to M)Divided into Range D_iMay have overlapping ranges. 2) Next, the observation signal X_kFrequency f of (f)Is determined by the above-described method. 3) Sum P of the power of the frequency determined in each direction range_k(D_i).
P_k(D_i) = Σ | X_k(F) |²
(Where X_k(F) ∈D_i) ・・・ (15)
Finally, 4) Direction range in which the sum of the power becomes maximum: D_{k, max} = Arg max_DiP_k(D_i) Is defined as the sound source direction.
[0025]
Hereinafter, the sum of the powers obtained for each direction range is referred to as direction power, the peak of the direction power is referred to as a sound image, and the direction of the sound source estimated from the sound image is referred to as the direction of the sound image. Further, when calculating the sound source direction power, it is also possible to use only the inter-channel intensity difference or to calculate using only the time difference, and similarly, it can be used for estimating the sound source direction. The former is called directional power based on the intensity difference, and the latter is called directional power based on the time difference. Furthermore, the direction of the sound source determined from the former is called the direction of the sound image based on the intensity difference, and the direction of the sound source determined from the latter is called the direction of the sound image based on the phase difference.
[0026]
When the number of sound sources is plural, it can be determined that there is a sound source in a direction in which the value exceeds a certain threshold value in the direction range in which the local power gives the maximum value. Thus, sound source detection and sound source position estimation can be performed simultaneously.
The directional power is calculated by adding the power for one frame as shown in equation (15), and by adding the directional power for one frame to a plurality of frames as shown in equation (16). There is a method of adding to.
P (D_i) = Σ_kP_k(D_i) ・・・・・・・・・ (16)
[0027]
The former is called instantaneous directional power, and the latter is called successive directional power. The instantaneous directional power can detect a sound source that exists only at a certain moment, but also has a high possibility of erroneously extracting a sound image other than the target sound such as noise or reverberation. Conversely, the successive directional power makes it difficult to detect an instantaneous sound source, while the directional power derived from a non-existent sound source becomes relatively weak, so that it can be easily removed by threshold processing or the like.
[0028]
These inter-channel information can also be used for sound source separation. First, a range D including the direction of the sound source to be separated is determined. Next, the intensity difference △ I_ij,_k(F), phase difference △ φ_ij,_k(F) is obtained, and the frequency f included in D is obtained using Expression (10). Finally, after replacing the frequency signal corresponding to the frequency not included in D with 0, and then returning the frequency signal to the time signal, only the sound source in the target direction can be separated.
[0029]
((2) Statistical properties of information between channels when there is reverberation)
In an environment with reverberation, reverberation is superimposed, so that the properties of the above-described inter-channel information change. For this reason, sound source detection and sound source separation are appropriate for the average value of the time difference, the average value of the intensity difference, and the standard deviation value (hereinafter, referred to as the statistics of inter-channel information) measured in an environment without reverberation. Can not do it.
On the other hand, if it is possible to predict how the time difference and the intensity difference change according to reverberation, it is expected that sound source detection and sound source separation can be appropriately performed even under reverberation. For this purpose, the present invention makes use of the following statistical properties of the values of the time difference and the intensity difference under reverberation.
[0030]
1. As the reverberation increases, the
(Equation 6)

Becomes large.
2. Compared with the case without reverberation, the average value of the intensity difference between channels ΔI⁻ _ij(D, f) changes as the reverberation increases. For example, in a room where reverberation arrives evenly from all directions, the inter-channel intensity difference tends to approach zero as the effect of reverberation increases. Therefore, if the position of the sound source is estimated based on the statistical value of the inter-channel intensity difference when there is no reverberation, the direction of the sound image is shifted.
3. Average value of time difference between channels Δt⁻ _ij(D, f) also tends to change. However, the deviation of the position of the sound source estimated based on the statistical value of the time difference between channels when there is no reverberation is not as large as the deviation based on the intensity difference.
[0031]
4. As described above, since the difference between the sound source direction estimated based on the intensity difference and the sound image direction estimated based on the phase difference is different, when both cues are used in combination, mutual inconsistencies arise. For this reason, the directional power peak itself is not properly formed. This is a major cause of lowering the sound source detection accuracy.
5. In one observation signal, when the above-mentioned estimated displacement of the sound source direction is compared for each of the time section and the frequency domain, the section and the section having the larger power are relatively displaced compared to the section 1 having the smaller power. Tend to be small. This is because the influence of reverberation tends to be greater in sections and areas where the power is small.
[0032]
6. One sound source may form a directional power peak (sound image) in a direction other than the direction corresponding to the target sound source signal (hereinafter, referred to as a pseudo sound image). This causes erroneous sound source detection. On the other hand, by eliminating the target sound source signal from the observed signal, that is, by subtracting the sound source signal separated from the mixed sound using the sound source separation method from the observed signal, or by subtracting the sound in the target direction from the observed signal. In many cases, a pseudo sound image is deleted together with the peak (sound image) in the direction of the target sound source signal when only the frequency determined to be used is deleted.
7. Conversely, if there is a strong sound source, it may be hidden by a strong sound and other sound images may not be detected. This is called a hidden sound image. In some cases, a hidden sound image can be detected by erasing a sound source having a large power.
[0033]
(3) Sound source detection and separation method using statistical properties for reverberation
The present invention executes the following processing in order to properly perform sound source detection and sound source separation even in an environment where reverberation exists, utilizing the above-described properties 1 to 7.
First. In response to the movement of the sound image, the parameters of the sound source extraction processing of Expression (10) are changed, and a wider standard for determining that the sound is coming from the target direction is realized, thereby realizing more appropriate sound source detection and sound source separation ( Direction signal extraction range determination processing).
Sound source detection
C in equations (11) to (14)₁, C₂Set to a large value. In particular, the judgment criterion is made wider for an intensity difference whose deviation from the reference value is large as the reverberation increases. As a result, since the shape of the peak of the directional power has a shape having a spread, even if the direction of the sound image due to the intensity difference and the direction of the sound image due to the time difference deviate, the overlap of the peaks increases. Therefore, even when two cues are combined, mutual cues do not conflict. As a result, a directional power peak is formed near the correct sound source direction.
[0034]
Sound source separation
As in the above sound source detection, C₁, C₂Separation of the target sound source under reverberation can be improved by separating the sound source by using a large value. However, simply expanding the criterion for separation makes it possible to extract more components of the target sound source and also includes more extra components such as reverberation. In order to reduce this, another method is introduced. If the directional power is deviated due to reverberation, it is determined that most of the frequency signals are included in the directional range sandwiched between the true direction of the target sound source and the direction of the sound image based on the intensity difference. Conceivable. Therefore, the criterion C₁It is not necessary to increase the value of the sound image from the average value, but only in the direction in which the sound image shifts. As a result, it is possible to extract only a sound in a necessary direction range without expanding an unnecessary direction range. A similar method can be used for the criterion regarding the time difference.
[0035]
Second. When calculating the successive direction power, the power in the direction range other than the direction range giving the maximum value in the instantaneous direction power of each frame is replaced with 0, and then added. It is expected that the direction range giving the maximum value in each frame is relatively less affected by reverberation, while the other direction ranges are more strongly affected by reverberation. Therefore, by performing the operation of adding only the maximum value of the instantaneous directional power, it is possible to obtain the successive directional power in which more reliable information is preferentially considered (maximum value direction detection processing).
[0036]
Third. When a plurality of sound sources and sound images exist, a pseudo sound image and a hidden sound image may be included, and the sound image of the target sound may be observed with a shift. In order to solve these problems, it is effective to delete some sound source signals corresponding to the detected sound image from the observed signal by the following method (sound source suppression processing).
[0037]
Removal of pseudo sound image
When a plurality of sound sources are detected, a pseudo sound image may be included. In order to eliminate this, the sound images that disappear together with the sound sources corresponding to some of the detected sound sources are removed from the observation signal and are judged to be pseudo sound images.
I do.
Detection of hidden sound image
Conversely, if there is a sound image that appears when the sound source corresponding to one sound image is deleted from the observation signal, it is determined to be a hidden sound image and detected as the sound image of the target sound source.
Put out.
Accurate sound image position extraction
First. In the sound source separation, it is necessary to extract the peak position of the directional power due to the intensity difference (or the peak position of the directional power due to the phase difference) in order to determine the criterion. However, it may be difficult to obtain the peak position of the mixed sound due to interference with other sounds. In this case, the erasing operation of the sound image used for detecting the pseudo sound image and the hidden sound image is effective. That is, if the sound source corresponding to the sound image other than the target sound source is deleted from the observation signal, and then the peak position of the directional power due to the intensity difference (or the peak position of the directional power due to the phase difference) is extracted, the reverberation environment is increased. More accurate values can be extracted.
[0038]
Note that the above-described processes for detecting the sound source in the reverberant environment and improving the accuracy of the sound source separation are effective even if used alone, and exhibit a synergistic effect when combined. It is possible to selectively use such as incorporating only necessary processing according to the purpose.
[0039]
【Example】
An embodiment of the present invention will be described with reference to the drawings.
Referring to FIG. 2, for the sake of simplicity, in the case where the observation signal is composed of two-channel signals, the microphone is arranged at the position of the ear of a robot having a shape and size similar to the upper body of a human. It is assumed that The direction of the sound source where the speakers are arranged is assumed to be an angle range of 95 ° left and right (−95 ° on the left) from the front (0 °) of the robot, and is divided into 10 ° ranges and handled (i = 1). 19). Each angle range D_iIs D₁Is -95 ° to -85 °, D₂Shifts by 10 degrees such as -85 to -75 degrees, and D₁₀Corresponds to -5 ° to 5 °, and D₁₉Correspond to 85 ° to 95 °.
[0040]
(Sound source detection method using maximum value direction detection processing)
In FIG. 1, an inter-signal information extraction processing unit 1 firstly outputs a digital input signal x of a channel i._iWhen (t) is received, a time window is applied and the voice section (frame) for a short time (eg, about 30 ms) x_{i, k}After dividing into (t), the frequency signal is obtained using a technique such as short-time discrete Fourier transform.X_{i, k}(F). Next, the time difference between channels of each frequency △ φ_i,_k(F), intensity difference ΔI_i,_k(F) is determined by equations (3) and (5). At this time, since the time difference is not always uniquely determined at a high frequency, all the time differences satisfying Expression (4) are obtained as candidate values. Alternatively, in a high band where the time difference is not uniquely determined, a method of determining the sound source direction using only the intensity difference information without using the condition regarding the time difference may be used.
[0041]
The directional signal extraction processing unit 2 includes:
(Equation 7)

Is calculated and set and stored in advance before the actual direction signal extraction processing is performed.
The direction signal extraction processing unit 2 determines a frequency that satisfies Expression (10) for each direction range based on the criterion determined by the direction signal extraction range determination processing unit 3. For any frequency having a plurality of time difference candidates, if any one of the candidates satisfies Expression (10), it is determined that the criterion is satisfied.
In each frame, the instantaneous direction power calculation processing unit 4 calculates the total sum of the powers of the frequencies determined to be included in each direction range as the instantaneous direction power for each direction range according to Expression (15).
[0042]
The maximum value direction detection processing unit 5 selects a direction range that gives the maximum value of the instantaneous direction power obtained in each frame, and replaces the sum of the powers in the other direction ranges with 0. That is, when calculating the successive direction power in the next successive direction power calculation processing unit 6, the power in the direction range other than the direction range that gives the maximum value in the instantaneous direction power of each frame is replaced with 0, and then added. It is expected that the direction range that gives the maximum value in each frame is relatively less affected by reverberation, while the other direction ranges are more strongly affected by reverberation. Therefore, by the operation of adding only the maximum value of the instantaneous directional power, it is possible to obtain the successive directional power in which more reliable information is preferentially considered.
[0043]
The successive direction power calculation processing unit 6 calculates the total sum of the power obtained by adding the instantaneous direction power for each direction range according to Expression (16) over a frame of a fixed time of about 2 seconds, for example. Find the directional power.
The sound source position extraction processing unit 7 detects a sound source as if there is a sound source in a direction exceeding a certain threshold θ in the direction giving the peak of the successive direction power. Various methods are conceivable as a method of determining the threshold value θ. For example, the maximum value max ｛P (D (D_i) A method of adaptively determining the threshold value according to｝ can be considered.
θ = max ｛θ₀, C₁max @ P (D_i)｝｝・・・・ (17)
[0044]
Where c₁Is a constant for determining a threshold for detecting a sound source from the directional power in the other direction with respect to the directional power giving the maximum value, and may be, for example, a value of about 0.1. Also, θ₀Is a threshold for detecting the presence of sound when one sound source is emitting sound, and is a value determined according to the input level of the acoustic signal input device.
With the above processing, it is possible to construct a sound source detection method in which the continuous direction power is calculated using only the maximum power of the instantaneous direction power. As a result, since the successive direction power is calculated using the inter-channel information relatively unaffected by reverberation, sound source detection can be performed more accurately.
[0045]
(Sound source detection method using sound source suppression processing)
By the way, according to the above-described sound source detection method, it is not possible to prevent a pseudo sound source from being detected and a hidden sound image from not being detected. In order to prevent this, it is effective to use the following sound source suppression processing technology. This is performed by adding a sound source suppression process to the above sound source detection method.
The sound source suppression processing unit 8 inputs the sound source direction detected by the sound source position extraction processing unit 7 and the direction signal extraction range determined by the direction signal extraction range determination processing unit 3, and calculates the direction range to be deleted. The signal is then output to the inter-signal information extraction processing unit 1.
[0046]
First, an embodiment of a method for excluding a pseudo sound image will be described. According to the procedure described above, n₀Set of sound images ｛S_i ⁰｝ (I = 1 to n₀) Is detected by the sound source position extraction processing unit 7. Next, from the observation signal, one of the sound images S_i ⁰Is temporarily suppressed by the sound source suppression processing unit 8. That is, the sound image S in the frequency domain signal_i ⁰, The frequency signal satisfying the expression (10) is replaced with 0 for all channels, and then returned to the time domain signal. Thereafter, a similar sound source position extraction process is performed using this signal as a new input signal for the sound source position extraction process. As a result, if there is a sound image that cannot be detected together with the suppressed sound image, the sound source position extraction processing unit 7 determines it as a pseudo sound image. That is, the sound image S_j ⁰Is suppressed, the set of detected sound images becomes ΔS_i ^j ｝ (I = 1 to n_j), The sound image S that satisfies_i ⁰Is called a pseudo sound image.
｛S_i ⁰｜ S_i ⁰≠ S_k ^jfor {k, i {j} ... (18)
Further, the above-described sound image suppression operation is performed for all sound images ΔS_i ⁰As a result of individually performing｝ (j = 1 to n), an image determined as a pseudo sound image even once is determined to be a pseudo sound image.
[0047]
Next, an embodiment for detecting a hidden sound image will be described. As a result of performing the same operation as the above sound source suppression operation, the sound source position extraction processing unit 7 determines that the sound image S that satisfies the following condition is a hidden sound image.
｛S ｜｛S ≠ S_i ⁰for∀i｝ ∩ ｛S = S_k ^jfor∃k∀j｝ (19)
That is, ｛S_i ⁰Of the sound images not included in｝, ｛S_i ⁰A sound image that can be always detected by suppressing one sound image in any of｝ is detected as a hidden sound image. However, under this detection condition, a hidden sound image needs to be detected even when any one of the sound images is erased, so that the condition may be too severe depending on the purpose. In that case, for example, instead of assuming that equation (19) holds for all j, S_j ⁰Power P (S_j ⁰) And the set of sound images ｛S_l ⁰Compared to the maximum power in｝, the former is a constant times c of the latter₂max (P (S_l ⁰)) There are methods such as the case where only the above is satisfied. c₂Is a positive number smaller than 1, for example, a value such as 0.1 can be considered.
[0048]
Next, an embodiment of a more accurate sound image position extraction method using the sound source suppression processing unit 8 will be described. For this purpose, a set of sound images ｛S detected using the sound source detection method described above_i ひとつ One sound image S included in (i = 1 to n)_kWill be discussed. First, the sound source suppression processing is performed by a set of sound images ｛S_iS included in｝_kCreate a signal in which all sound images other than are deleted from the observed signal. Since most of the elements of the sound image other than the target sound image have been deleted from this signal, the sound source position extraction processing unit 7 sets the signal S_kCan be determined more accurately. By obtaining a sound image based on the intensity difference from this signal in the same manner as described above, it is expected that a more accurate value will be obtained. Similarly, a sound image based on an accurate time difference can be obtained by performing a sound image extraction process based on the time difference in the same manner as described above. Note that the three sound source detection methods using the above-described sound source suppression processing can be used in combination to provide complementary effects, but are effective even when each technology is used alone.
[0049]
(Sound source detection method using direction signal extraction range determination processing method)
In a reverberant environment, the statistical properties of signal intensity differences and time differences change, so if you use the direction signal extraction range determined according to the absence of reverberation, it will not be possible to properly extract the sound signal in the target direction, Good sound source detection performance cannot always be realized. To prevent this, it is necessary to use more appropriate direction signal extraction range determination processing.
[0050]
Therefore, in the initial state, the direction signal extraction range determination processing unit 3 determines the direction range mainly assuming the target reverberation situation. For example, if the state of the reverberation is not known, the direction signal extraction range determination processing unit 3 sets a relatively wide direction range in order to cope with a long reverberation. It is.
[0051]
Further, the direction signal extraction range determination processing unit 3 receives the successive direction power obtained by the successive direction power calculation processing unit 6 after the actual sound signal is input, and the sound source position extraction processing unit 7 The detected direction of the sound source is received, a corresponding direction signal extraction range is determined, and output to the direction signal extraction processing unit 2 and the sound source suppression processing unit 8. Thus, a more appropriate direction signal extraction range can be determined. First, the initial estimation value D of the sound source position is obtained by the sound source detection method described above._sourceAnd the sound image direction D based on the intensity difference_imageAre determined to be shifted in the direction. At this time, the signal intensity difference at the frequency at which the sound from each sound source is dominant is D_sourceAnd D_imageIt is expected to take a value corresponding to the section sandwiched between. The same can be said for the time difference between signals. Therefore, in order to appropriately extract such a sound source signal, the direction signal extraction processing unit 2 is controlled by the direction signal extraction range determination processing unit 3 so that the inter-signal information passes only frequencies within this range. Just fine. This control can be realized by defining D in Expressions (11) and (14) as follows.
[0052]

Where C₃Is a constant term for obtaining a margin in the direction of the sound source, and takes a value of, for example, about 5 °.
In the case where the direction range shifted in the direction of the sound image is used, compared to the case where the direction range not shifted is used, in formulas (11) to (14) for determining the area through which sound actually passes, C₁, C₂It is better to take a small value for. This is because D shows a more appropriate range, and even if the pass range is further widened from there, sound and reverberation other than the target sound are often picked up. For example, C₁= 2, C₂A value such as = 2 is raised as one option.
[0053]
(Source separation)
An embodiment of the sound source separation method will be described. In any of the various sound source detection methods described above, the direction signal extraction processing unit 2 performs processing for determining only a specific frequency as a sound coming from the same direction as the target sound for sound source detection. . As a result of this direction determination processing, the direction signal extraction processing unit 2 replaces the frequency determined to be different from the direction of the target sound with 0, and performs a short-time inverse Fourier transform on the frequency signal obtained without converting the others. , A time signal in which only the target sound is separated can be obtained, and the sound source separation processing is achieved.
[0054]
【The invention's effect】
As described above, the present invention has performed the performance evaluation of the present invention with respect to the sound source detection and the sound source separation for the mixed sound of two sound sources and the mixed sound of three sound sources. In each experiment, utterances of three speakers (two females: FKM and FSU, one male: MAU) from the ATR word database (5240 words, 12 kHz, 16 bits) were used as voice data, each 20 utterances. The impulse responses measured in the anechoic room and the variable reverberation room shown in FIG. 2 were convolved with the monaural sound data to synthesize a binaural sound including reverberation. The position of each sound source is such that the FSU is -60 ° (or -30 °) for the two sound sources, the FKM is 30 ° (or 60 °), and the MAU is further 0 ° for the three sound sources.
[0055]
(Evaluation of sound source detection rate)
FIGS. 3 and 4 show the results of evaluating the sound source detection performance based on the recall and precision. 3 shows the case where the number of sound sources is 2, and FIG. 4 shows the case where the number of sound sources is 3. In each case, (a) shows the recall rate and (b) shows the relevance rate. The number of sound sources to be detected is N, the number of correctly detected sound sources is N ＾, and the total number of detected sound sources is N⁻, The recall is N ＾ / N and the precision is N ＾ / N⁻Means In the experiment, it was assumed that the sound source was correctly detected when the sound source was found within ± 10 ° of the arranged sound source direction. Example 1 shown as a comparative example of the present invention indicates a result when sound source detection is performed without using the direction signal extraction range determination processing, the maximum value direction detection processing, and the sound source suppression processing in FIG. ing. FIG. 1 shows the result when the sound source is detected using the direction signal extraction range determination processing, the maximum value direction detection processing, and the sound source suppression processing in FIG.
From FIGS. 3 and 4, the reproducibility is more than 90% in the first embodiment by using the present invention. In the second embodiment, the relevance is further improved by about 0 to 5% while the precision is two sound sources. It can be seen that the improvement was about 20%, and about 10% for three sound sources.
[0056]
(Evaluation of separated sound)
FIG. 5 shows the results of evaluating the quality of separated sounds using spectral distortion based on the LPC cepstrum distance scale. FIG. 5A shows the case of two sound sources, and FIG. 5B shows the case of three sound sources. In the evaluation, the binaural sound of a single sound source when the reverberation was 0 seconds was used as the correct value of the spectrum shape. This makes it possible to evaluate the difference between the sound separated in the mixed sound including reverberation and the binaural sound of a single sound source at reverberation 0 seconds. FIG. 5 illustrates the result of the case where the sound source separation is performed without using the direction signal extraction range determination processing, the maximum value direction detection processing, and the sound source suppression processing in FIG. ing. In FIG. 1, the result of the case where the sound source separation is performed using the direction signal extraction range determination process, the maximum value direction detection process, and the sound source suppression process in FIG. In order to compare only the separation performance, the sound source direction was assumed to be known. From FIG. 5, it can be seen that according to the present invention, the quality of the second embodiment is improved by about 2 to 5 dB in each reverberation state as compared with the first embodiment.
[Brief description of the drawings]
FIG. 1 illustrates an embodiment.
FIG. 2 is a diagram illustrating binaural recording in a variable reverberation room.
FIG. 3 is a diagram illustrating sound source detection results of two sound sources according to the first and second embodiments.
FIG. 4 is a diagram showing sound source detection results of three sound sources according to the first and second embodiments.
FIG. 5 is a diagram illustrating spectral distortion of a separated sound according to the first and second embodiments.
FIG. 6 is a diagram illustrating a conventional example of a sound source detection method.
FIG. 7 is a diagram illustrating a conventional example of a sound source separation method.
[Explanation of symbols]
1. Inter-signal information extraction processing unit 2-directional signal extraction processing unit
3 direction signal extraction range determination processing unit 4 instantaneous direction power calculation processing unit
5 Maximum value direction detection processing unit 6 Successive direction power calculation processing unit
7 sound source position extraction processing unit 8 sound source suppression processing unit

Claims

Receives audio signal inputs observed by a plurality of sensors, and performs an inter-signal information extraction process, a direction signal extraction process, an instantaneous direction power calculation process, and a sound source position extraction process on the received audio signal inputs. Alternatively, a sound source detection method characterized by further performing a successive direction power calculation process.

The sound source detection method according to claim 1,
The average value and the standard deviation value of the intensity difference between the signal channels, the average value and the standard deviation value of the time difference between the signal channels are calculated, and these are set in advance for the direction signal extraction process. A sound source detection method.

In the sound source detection method according to any one of claims 1 and 2,
Furthermore, a sound source detection method characterized by performing at least one of a maximum value direction detection process, a direction signal extraction range determination process, and a sound source suppression process in combination.

Receives audio signal inputs observed by a plurality of sensors, and performs an inter-signal information extraction process, a direction signal extraction process, an instantaneous direction power calculation process, and a sound source position extraction process on the received audio signal inputs. Alternatively, a sound source detection method for performing a successive direction power calculation process is performed, and a sound signal extracted by the direction signal extraction process is output as a separated sound.

In the sound source separation method according to claim 4,
The average value and the standard deviation value of the intensity difference between the signal channels, the average value and the standard deviation value of the time difference between the signal channels are calculated, and these are set in advance for the direction signal extraction process. A sound source separation method.

In the sound source separation method according to any one of claims 4 and 5,
A sound source separation method characterized by further performing one or more of a maximum value direction detection process, a direction signal extraction range determination process, and a sound source suppression process in combination with the sound source detection process.

It comprises an inter-signal information extraction processing unit for inputting observation signals of a plurality of channels to obtain an inter-channel time difference signal and an intensity difference signal,
The average value and the standard deviation value of the signal intensity difference between channels and the statistical value of the average value and the standard deviation value of the signal time difference between the channels under the condition where no interference sound and reverberation are present are calculated and set and stored in advance. And a direction signal extraction processing unit for extracting a direction signal within a certain direction range,
Comprising an instantaneous direction power calculation processing unit that determines the sum of the powers of the frequencies determined to be included in the direction range as the instantaneous direction power,
A sound source position extraction processing unit that inputs a directional power and outputs a sound source position,
Alternatively, the sound source detection apparatus further includes a successive direction power calculation processing unit that calculates a total sum of the power obtained by adding the instantaneous direction power for each direction range to obtain a successive direction power.

The sound source detection device according to claim 7,
A maximum value direction detection processing unit that selects a direction range that gives the maximum value of the obtained instantaneous direction power, and replaces the sum of powers in other direction ranges with 0;
The direction in which the successive direction power calculated by the successive direction power calculation processing unit is received, the sound source direction detected by the sound source position extraction processing unit is received, the corresponding direction signal extraction range is determined, and the range is output to the direction signal extraction processing unit. Input the sound source direction detected by the signal extraction range determination processing unit and the sound source position extraction processing unit, and input the direction signal extraction range determined by the direction signal extraction range determination processing unit, and calculate the direction range to be deleted. A sound source detecting apparatus further comprising a combination of at least one of the three processing units of the sound source suppression processing unit that outputs to the inter-signal information extraction processing unit.

An inter-signal information extraction processing unit 1 for inputting observation signals of a plurality of channels to obtain an inter-channel time difference signal and an intensity difference signal,
The average value and the standard deviation value of the signal intensity difference between channels and the statistical value of the average value and the standard deviation value of the signal time difference between the channels under the condition where no interference sound and reverberation are present are calculated and set and stored in advance. And a direction signal extraction processing unit for extracting a direction signal within a certain direction range,
Comprising an instantaneous direction power calculation processing unit that determines the sum of the powers of the frequencies determined to be included in the direction range as the instantaneous direction power,
A sound source position extraction processing unit for inputting the direction power and outputting a sound source position, or a successive direction power for calculating a successive direction power by calculating a sum of powers obtained by adding instantaneous direction powers for each direction range Equipped with a calculation processing unit,
A sound source separation device, wherein a separated sound output unit that outputs an acoustic signal as a separated sound is formed in the direction signal extraction processing unit.

The sound source separation device according to claim 9,
A maximum value direction detection processing unit that selects a direction range that gives the maximum value of the obtained instantaneous direction power, and replaces the sum of powers in other direction ranges with 0;
The direction in which the successive direction power calculated by the successive direction power calculation processing unit is received, the sound source direction detected by the sound source position extraction processing unit is received, the corresponding direction signal extraction range is determined, and the range is output to the direction signal extraction processing unit. Input the sound source direction detected by the signal extraction range determination processing unit and the sound source position extraction processing unit, and input the direction signal extraction range determined by the direction signal extraction range determination processing unit, and calculate the direction range to be deleted. And, any one or more of the three processing units of the sound source suppression processing unit that outputs to the inter-signal information extraction processing unit 1 are further provided in combination,
A sound source separation device, wherein a separated sound output unit that outputs an acoustic signal as a separated sound is formed in the direction signal extraction processing unit.