JP2014048399A

JP2014048399A - Sound signal analyzing device, method and program

Info

Publication number: JP2014048399A
Application number: JP2012190189A
Authority: JP
Inventors: Hirokazu Kameoka; 弘和亀岡; Misa Sato; 美沙佐藤; Takuma Ono; 拓磨小野; Junki Ono; 順貴小野; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems; University of Tokyo NUC
Priority date: 2012-08-30
Filing date: 2012-08-30
Publication date: 2014-03-17
Anticipated expiration: 2032-08-30
Also published as: JP5911101B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound signal analyzing device, method and program capable of precisely separating a time-series data of sound signals output from a plural microphones into sound source signals of each sound source even when the number of the sound sources is unclear.SOLUTION: A parameter update section 23 defines a sound source which gets active at each time t with respect to each frequency ω; models a prior distribution of the probability that each sound source k gets active in a stick-breaking construction; models a prior distribution of an indicator representing an index in an arrival direction of each sound source k using a discrete probability distribution; models a prior distribution of the probability when each sound source k is the arrival direction of each index using a Dirichlet distribution; and repeatedly updates, on the basis of a variation estimation, parameters of a variable function approximating a posterior distribution of the parameter of generation model of the sound source signal. A time frequency element estimation section 26 estimates the time frequency element of the sound source signals of each sound source k by using the final-updated parameter.

Description

本発明は、音響信号解析装置、方法、及びプログラムに係り、特に、複数のマイクロホンから出力される音響信号の時系列データから、各音源の信号に分離する音響信号解析装置、方法、及びプログラムに関する。 The present invention relates to an acoustic signal analyzing apparatus, method, and program, and more particularly, to an acoustic signal analyzing apparatus, method, and program for separating time-series data of acoustic signals output from a plurality of microphones into signals of respective sound sources. .

音源の成分と音源からマイクロホンまでの伝達特性がともに未知のもとで、マイクロホン入力信号から個々の音源成分を分離抽出する技術をブラインド音源分離(Blind Source Separation;BSS)という。BSSでは観測信号だけから音源信号とその混合過程を推定する必要があるため、通常は音源に関して何らかの仮定を置き、これにより立てられる規準をもとに両未知変数を推定する最適化問題として定式化される。例えば、観測信号数が音源数以上の場合には、音源信号成分が優ガウス分布に従うという仮定のもとで分離フィルタを最尤推定する独立成分分析と呼ぶ方法が有名である。しかし、音源数を仮定しないBSSシステムを実現するためには、マイクロホン数よりも音源数が多い劣決定な問題設定を想定しておく必要があり、独立成分分析をそのまま適用することはできない。劣決定の条件下では、たとえ混合過程が既知であったとしても解が一意に決められないため、音源に関して独立性よりさらに強い仮定が必要となる。 A technique that separates and extracts individual sound source components from the microphone input signal when both the sound source components and the transfer characteristics from the sound source to the microphone are unknown is called Blind Source Separation (BSS). In BSS, it is necessary to estimate the sound source signal and its mixing process only from the observed signal. Therefore, it is usually formulated as an optimization problem in which both unknown variables are estimated based on the criteria established by making some assumptions about the sound source. Is done. For example, when the number of observation signals is equal to or greater than the number of sound sources, a method called independent component analysis in which the separation filter is maximum likelihood estimated on the assumption that the sound source signal components follow a Gaussian distribution is well known. However, in order to realize a BSS system that does not assume the number of sound sources, it is necessary to assume an indeterminate problem setting in which the number of sound sources is larger than the number of microphones, and independent component analysis cannot be applied as it is. Under underdetermined conditions, even if the mixing process is known, the solution cannot be determined uniquely, so a stronger assumption than the independence is required for the sound source.

劣決定BSSは難しい不良設定問題であるが、音源数が既知の場合には、時間周波数領域において音声のエネルギーが一部でしか支配的にならないというスパース性と呼ぶ性質を利用した時間周波数領域のBSSアプローチが近年有効なアプローチの一つとして注目されている（非特許文献１）。音声のスパース性とは、音声信号の時間周波数成分が多くの領域でほぼ0となる性質である。このため、複数の音声が同時に発話された状況でも、各時間周波数において音声の時間周波数成分は互いにほとんど重なり合わないと仮定できる場合が多い。この仮定をもとに、目的音声信号の時間周波数成分のみを通過させるような時間周波数マスクをいかにうまく設計するかがこのアプローチにおける問題の焦点となる。また、このアプローチでは、周波数ごとに分離した成分を音源ごとにまとめるためのパーミュテーション整合と呼ばれる問題を解決する必要があり、従来技術では、パーミュテーション整合は周波数ごとの信号分離の後段処理として行われることが多かった。 Underdetermined BSS is a difficult defect setting problem, but when the number of sound sources is known, in the time-frequency domain using the property called sparseness that the voice energy becomes dominant only in the time-frequency domain. In recent years, the BSS approach has attracted attention as one of the effective approaches (Non-Patent Document 1). The sparseness of speech is a property in which the time frequency component of the speech signal is almost zero in many regions. For this reason, even when a plurality of voices are spoken simultaneously, it is often assumed that the time frequency components of the voices hardly overlap each other at each time frequency. Based on this assumption, how to design a time-frequency mask that allows only the time-frequency component of the target speech signal to pass is the focus of the problem in this approach. In addition, this approach needs to solve a problem called permutation matching that combines components separated for each frequency for each sound source. In the prior art, permutation matching is a post-processing of signal separation for each frequency. It was often done as.

H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. ASLP, vol. 19, no. 3, pp. 516−527, 2010.H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. ASLP, vol. 19, no. 3, pp. 516-527, 2010 .

音声信号を対象としたBSSは、ハンズフリーテレビ会議システムや会議録コンテンツの自動作成システムなど、多くの応用が期待されている。例えば会議の場面では参加人数が途中で変化したりドアの開閉音などが突発的に鳴ったりすることがあるように、実環境においては予めあらゆる音源の数を想定しておくことは難しい。従来の多くのBSSアルゴリズムは音源数を仮定して動作するものが多く、仮定した音源数が実際の音源数と異なる場合、高い性能を発揮できない場合がある。 BSS for audio signals is expected to have many applications such as hands-free video conferencing system and automatic system for creating conference contents. For example, in a meeting situation, it is difficult to assume the number of all sound sources in advance in an actual environment so that the number of participants may change midway or a door opening / closing sound may suddenly sound. Many conventional BSS algorithms operate assuming the number of sound sources, and if the assumed number of sound sources is different from the actual number of sound sources, high performance may not be achieved.

従って、音源数を仮定することなく自律的に音源数を推論しながら動作するBSSシステムの実現が望まれる。音源数が既知の場合には、時間周波数領域において音声のエネルギーが一部でしか支配的にならないというスパース性と呼ぶ性質を利用した時間周波数領域のBSSアプローチが有効であるが、音源数が不明の場合にBSSの問題をいかにして解決するか、が問題となる。 Therefore, it is desirable to realize a BSS system that operates while inferring the number of sound sources autonomously without assuming the number of sound sources. When the number of sound sources is known, the BSS approach in the time frequency domain using the property called sparsity that the energy of the voice is dominant only in the time frequency domain is effective, but the number of sound sources is unknown In this case, how to solve the problem of BSS becomes a problem.

各マイクロホンにおける観測信号は、通常、音源信号の時間遅れを含む畳み込み混合で表されるが、音源からマイクロホンまでのインパルス応答長に対して十分に長い時間窓をもつ時間周波数分解(短時間Fourier変換、ウェーブレット変換など) を用いると、畳み込み混合を近似的に瞬時混合で表すことができる。この観測モデルに基づくBSSは周波数領域BSSと呼ばれ、時間領域の畳み込み混合モデルに基づくBSSに対し、演算量の少ないアルゴリズムを実現できる点や上述の音声のスパース性の仮定を組み込める点などの特長がある一方で、周波数ごとに分離した成分を音源ごとにまとめるためのパーミュテーション整合と呼ばれる問題を扱う必要がある。従来アプローチでは、パーミュテーション整合は周波数ごとの信号分離の後段処理として行われることが多かったが、パーミュテーション整合において手がかりとなる音源の周波数間における関係やスペクトル構造は各周波数における信号分離の精度向上にも寄与しうるはずである。そこで、パーミュテーション整合問題と音源分離をいかに同時解決するか、が問題となる。 The observed signal at each microphone is usually represented by convolutional mixing including the time delay of the sound source signal, but it is a time-frequency decomposition (short-time Fourier transform) with a sufficiently long time window for the impulse response length from the sound source to the microphone. , Wavelet transform, etc.), convolutional mixing can be approximated by instantaneous mixing. BSS based on this observation model is called frequency domain BSS, and features such as the ability to implement an algorithm with a small amount of computation and the above-mentioned assumption of sparsity of speech compared to BSS based on the time domain convolution mixed model On the other hand, there is a need to deal with a problem called permutation matching for collecting components separated for each frequency for each sound source. In the conventional approach, permutation matching is often performed as a subsequent process of signal separation for each frequency. However, the relationship between the frequencies of the sound source and the spectrum structure, which are clues in permutation matching, are determined by the signal separation at each frequency. It should be able to contribute to accuracy improvement. Therefore, how to solve the permutation matching problem and the sound source separation simultaneously becomes a problem.

本発明は、上記の事情を鑑みてなされたもので、音源数が不明であっても、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる音響信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and accurately separates sound source signals for each sound source from time-series data of acoustic signals output from a plurality of microphones even if the number of sound sources is unknown. It is an object to provide an acoustic signal analysis apparatus, method, and program capable of performing

上記の目的を達成するために本発明に係る音響信号解析装置は、Ｍ個（Ｍは２以上の整数）のマイクロホンｍから出力される音響信号の時系列データを入力として、観測時間周波数成分ｙ_m,ω,t（ｍはマイクロホン、ｔは時刻、ωは周波数のインデックスである。）を要素にもつ三次元配列ｙ＾を出力する時間周波数解析手段と、各周波数ωについて、D個（Dは1以上の整数）の音源ｋの各々から各マイクロホンｍへの音源信号の伝達周波数特性ａ_m,k,ωを要素にもつ三次元配列ａ＾に含まれる各ベクトルａ_k,ω＾の事後分布のパラメータ（複素正規分布の平均ベクトルｍ_k,ω＾、共分散行列Γ_k,ω）、各周波数ωに対する各時刻ｔにおいてアクティブな音源の時間周波数成分ｓ_ω,tを要素にもつ二次元配列ｓ＾に含まれる各要素ｓ_ω,tの事後分布のパラメータ（複素正規分布の平均μ_ω,t、分散σ² _ω,t）、各周波数ωに対する各時刻ｔにおいてアクティブとなる音源のインデックスを示すインジケータｚ_ω,tを要素にもつ二次元配列ｚ＾の各要素ｚ_ω,tの事後分布のパラメータ（離散確率分布の各パラメータφ_k,ω,t）、各音源ｋがアクティブになる確率ｖ_kを要素にもつベクトルｖ＾の各要素ｖ_kの事後分布のパラメータ（ベータ分布のパラメータγ_k,0、γ_k,1）、各音源ｋの到来方向のインデックスｉを示すインジケータｘ_kを要素にもつベクトルｘ＾の各要素ｘ_kの事後分布のパラメータ（離散確率分布の各パラメータψ_k,i）、各音源ｋが各インデックスｉの到来方向となる確率ρ_k,iを要素にもつ二次元配列ρ＾の各ベクトルρ_k＾の事後分布のパラメータ（ディレクレ分布の各パラメータζ_k,i）の各々の初期値を設定する初期設定手段と、前記三次元配列ｙ＾が与えられたときの前記三次元配列ａ＾、前記二次元配列ｓ＾、前記二次元配列ｚ＾、前記ベクトルｖ＾、前記ベクトルｘ＾、及び前記二次元配列ρ＾の事後分布ｐ（ａ＾、ｓ＾、ｚ＾、ｖ＾、ｘ＾、ρ＾｜ｙ＾）と変関数ｑ（ａ＾、ｓ＾、ｚ＾、ｖ＾、ｘ＾、ρ＾）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき前記目的関数を最小化するように、（ｋ、ｍ）の全ての組み合わせの各々における前記平均ベクトルｍ_k,ω＾及び前記分散Γ_k,ωと、(ω、ｔ)の全ての組み合わせの各々における前記平均μ_ω,t及び前記分散σ² _ω,tと、(ｋ、ω、ｔ)の全ての組み合わせの各々における前記パラメータφ_k,ω,tと、全てのｋにおける前記パラメータγ_k,0、γ_k,1、前記パラメータψ_k,iと、（ｋ、ｉ）の全ての組み合わせの各々におけるパラメータζ_k,iとを更新するパラメータ更新手段と、予め定められた終了条件を満たすまで、前記パラメータ更新手段による更新を繰り返し行う終了判定手段と、（ｋ、ω、ｔ）の全ての組み合わせの各々について、前記パラメータφ_k,ω,t及び平均μ_ω,tに基づいて、前記音源ｋの音源信号の時間周波数成分ｓ_k,ω,tを推定する音源信号推定手段と、を含んで構成されている。 In order to achieve the above object, the acoustic signal analyzing apparatus according to the present invention receives time series data of acoustic signals output from M (M is an integer of 2 or more) microphones m as input, and an observation time frequency component y. Time frequency analysis means for outputting a three-dimensional array y ^ having elements _{m, ω, t} (m is a microphone, t is a time, and ω is an index of frequency), and D (D Posterior of each vector a _{k, ω} ^ included in the three-dimensional array a ^ having elements of the transmission frequency characteristics a _{m, k, ω} of the sound source signal from each of the sound sources k of 1) to each microphone m Parameters of distribution (average vector m _{k, ω} ^ of complex normal distribution, covariance matrix Γ _{k, ω} ), two-dimensional element having time frequency components s _{ω, t} of active sound source at each time t for each frequency ω each element in the array s ^ s _ω, of the posterior distribution of _t Parameters (mean mu _omega of complex normal _{distribution, t,} variance σ ² _{ω, t),} each frequency indicator indicates the index of the sound source which becomes active at each time t with respect to omega z _omega, two-dimensional array having a _t the element z ^ Posterior distribution parameters of each element z _{ω, t} of each element (discrete probability distribution parameters φ _{k, ω, t} ), and each element v _k of the vector v ^ having the probability v _k that each sound source k is active Posterior distribution parameters (beta distribution parameters γ _{k, 0} , γ _{k, 1} ) and posterior distribution of each element x _k of vector x ^ having an indicator x _k indicating the arrival direction index i of each sound source k. Parameter (each parameter ψ _{k, i of the} discrete probability distribution), and the posterior distribution of each vector ρ _k ^ of the two-dimensional array ρ ^ having the probability ρ _{k, i} of each sound source k being the arrival direction of each index i Parameters (each parameter of the directory distribution _k, an initial setting means for setting each initial value of _i), the three-dimensional array y ^ wherein when given a three-dimensional array a ^, the two-dimensional array s ^, the two-dimensional array z ^, The vector v ^, the vector x ^, and the posterior distribution p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) and the variable function q (a ^) of the two-dimensional array ρ ^ , S ^, z ^, v ^, x ^, ρ ^), a function representing a divergence representing a difference between the objective function and the objective function is (k M), the mean vector m _{k, ω} ^ and the variance Γ _{k, ω} in each of all combinations _, and the mean μ _{ω, t} and the variance σ ^{2 in} each of all combinations of (ω, t). _The parameters φ _{k, ω, t} in each of all combinations of _{ω, t} and (k, ω, t) and the parameters in all k Parameter updating means for updating the data γ _{k, 0} , γ _{k, 1} , the parameter ψ _{k, i,} and the parameter ζ _{k, i} in each of all combinations of (k, i), For each of all combinations of (k, ω, t) and end determination means for repeatedly updating by the parameter update means until an end condition is satisfied, the parameter φ _{k, ω, t} and the average μ _{ω, t} And a sound source signal estimating means for estimating a time frequency component s _{k, ω, t} of the sound source signal of the sound source k.

本発明に係る音響信号解析方法は、時間周波数解析手段によって、Ｍ個（Ｍは２以上の整数）のマイクロホンｍから出力される音響信号の時系列データを入力として、観測時間周波数成分ｙ_m,ω,t（ｍはマイクロホン、ｔは時刻、ωは周波数のインデックスである。）を要素にもつ三次元配列ｙ＾を出力し、初期設定手段によって、各周波数ωについて、D個（Dは1以上の整数）の音源ｋの各々から各マイクロホンｍへの音源信号の伝達周波数特性ａ_m,k,ωを要素にもつ三次元配列ａ＾に含まれる各ベクトルａ_k,ω＾の事後分布のパラメータ（複素正規分布の平均ベクトルｍ_k,ω＾、共分散行列Γ_k,ω）、各周波数ωに対する各時刻ｔにおいてアクティブな音源の時間周波数成分ｓ_ω,tを要素にもつ二次元配列ｓ＾に含まれる各要素ｓ_ω,tの事後分布のパラメータ（複素正規分布の平均μ_ω,t、分散σ² _ω,t）、各周波数ωに対する各時刻ｔにおいてアクティブとなる音源のインデックスを示すインジケータｚ_ω,tを要素にもつ二次元配列ｚ＾の各要素ｚ_ω,tの事後分布のパラメータ（離散確率分布の各パラメータφ_k,ω,t）、各音源ｋがアクティブになる確率ｖ_kを要素にもつベクトルｖ＾の各要素ｖ_kの事後分布のパラメータ（ベータ分布のパラメータγ_k,0、γ_k,1）、各音源ｋの到来方向のインデックスｉを示すインジケータｘ_kを要素にもつベクトルｘ＾の各要素ｘ_kの事後分布のパラメータ（離散確率分布の各パラメータψ_k,i）、各音源ｋが各インデックスｉの到来方向となる確率ρ_k,iを要素にもつ二次元配列ρ＾の各ベクトルρ_k＾の事後分布のパラメータ（ディレクレ分布の各パラメータζ_k,i）の各々の初期値を設定し、パラメータ更新手段によって、前記三次元配列ｙ＾が与えられたときの前記三次元配列ａ＾、前記二次元配列ｓ＾、前記二次元配列ｚ＾、前記ベクトルｖ＾、前記ベクトルｘ＾、及び前記二次元配列ρ＾の事後分布ｐ（ａ＾、ｓ＾、ｚ＾、ｖ＾、ｘ＾、ρ＾｜ｙ＾）と変関数ｑ（ａ＾、ｓ＾、ｚ＾、ｖ＾、ｘ＾、ρ＾）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき前記目的関数を最小化するように、（ｋ、ｍ）の全ての組み合わせの各々における前記平均ベクトルｍ_k,ω＾及び前記分散Γ_k,ωと、(ω、ｔ)の全ての組み合わせの各々における前記平均μ_ω,t及び前記分散σ² _ω,tと、(ｋ、ω、ｔ)の全ての組み合わせの各々における前記パラメータφ_k,ω,tと、全てのｋにおける前記パラメータγ_k,0、γ_k,1、前記パラメータψ_kと、（ｋ、ｉ）の全ての組み合わせの各々におけるパラメータζ_k,iとを更新し、終了判定手段によって、予め定められた終了条件を満たすまで、前記パラメータ更新手段による更新を繰り返し行い、音源信号推定手段によって、（ｋ、ω、ｔ）の全ての組み合わせの各々について、前記パラメータφ_k,ω,t及び平均μ_ω,tに基づいて、前記音源ｋの音源信号の時間周波数成分ｓ_k,ω,tを推定する。 In the acoustic signal analysis method according to the present invention, the time-frequency analysis means receives time-series data of acoustic signals output from M (M is an integer of 2 or more) microphones m as input, and the observation time-frequency component ym _{, A} three-dimensional array y ^ having _{ω, t} (m is a microphone, t is a time, and ω is an index of frequency) as elements is output, and D is set (D is 1 for each frequency ω) by the initial setting means. (Integer above) from each of the sound sources k to each microphone m, the posterior distribution of each vector a _{k, ω} ^ included in the three-dimensional array a ^ having the transmission frequency characteristics a _{m, k, ω} of the sound source signal as elements. Parameters (average vector m _{k, ω} ^, covariance matrix Γ _{k, ω} ) of complex normal distribution, two-dimensional array s having elements of time frequency components s _{ω, t} of active sound sources at each time t for each frequency ω ^ each of the elements included in the s _ω, the posterior distribution of _t Parameters (mean of complex normal distribution mu _{omega, t,} variance σ ² _{ω, t),} each frequency indicator indicates the index of the sound source which becomes active at each time t with respect to omega z _omega, two-dimensional array having a _t the element z ^ Posterior distribution parameters of each element z _{ω, t} of each element (discrete probability distribution parameters φ _{k, ω, t} ), and each element v _k of the vector v ^ having the probability v _k that each sound source k is active Posterior distribution parameters (beta distribution parameters γ _{k, 0} , γ _{k, 1} ) and posterior distribution of each element x _k of vector x ^ having an indicator x _k indicating the arrival direction index i of each sound source k. Parameter (each parameter ψ _{k, i of the} discrete probability distribution), and the posterior distribution of each vector ρ _k ^ of the two-dimensional array ρ ^ having the probability ρ _{k, i} of each sound source k being the arrival direction of each index i Parameters (each parameter of the directory distribution Set the initial value of each of the zeta _{k, i),} the parameter by the update means, the three-dimensional array y ^ wherein when given a three-dimensional array a ^, the two-dimensional array s ^, the two-dimensional array z ^, The vector v ^, the vector x ^, and the posterior distribution p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) and the variable function q ( a ^, s ^, z ^, v ^, x ^, ρ ^) as a function representing the divergence representing the difference between the objective function and minimizing the objective function based on the variational inference method. The mean vector m _{k, ω} ^ and the variance Γ _{k, ω} in each of all combinations of (k, m) and the mean μ _{ω, t} and the variance in each of all combinations of (ω, t). σ ² _{ω, t} and the parameter φ _{k, ω, t} in each of all combinations of (k, ω, t) and all k The parameters γ _{k, 0} , γ _{k, 1} , the parameter ψ _k, and the parameters ζ _{k, i} in each of all combinations of (k, i) are updated and predetermined by the end determination means Until the end condition is satisfied, the updating by the parameter updating unit is repeatedly performed, and the parameter φ _{k, ω, t} and the average μ _ω, Based on _t , the time frequency component s _{k, ω, t} of the sound source signal of the sound source k is estimated.

本発明に係るプログラムは、上記の音響信号解析装置の各手段としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each means of the acoustic signal analyzing apparatus.

以上説明したように、本発明の音響信号解析装置、方法、及びプログラムによれば、各周波数ωに対する各時刻ｔにおいてアクティブとなる音源を１個とし、各音源ｋがアクティブになる確率をベータ分布で生成し、各音源ｋの到来方向のインデックスを示すインジケータを離散確率分布で生成し、各音源ｋが各インデックスの到来方向となる確率をディレクレ分布で生成する音源信号の生成モデルのパラメータの事後分布に基づく目的関数を最小化するように、変分推論法に基づき各パラメータを更新して、各音源ｋの音源信号の時間周波数成分の事後分布を推定することにより、音源数が不明であっても、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる、という効果が得られる。 As described above, according to the acoustic signal analysis apparatus, method, and program of the present invention, one sound source is activated at each time t for each frequency ω, and the probability that each sound source k is activated is represented by the beta distribution. The posterior of the parameters of the generation model of the sound source signal that generates the indicator indicating the arrival direction index of each sound source k with a discrete probability distribution and generates the probability that each sound source k becomes the arrival direction of each index with a directory distribution By updating each parameter based on the variational inference method so as to minimize the objective function based on the distribution and estimating the posterior distribution of the time frequency component of the sound source signal of each sound source k, the number of sound sources is unknown. However, it is possible to accurately separate sound source signals for each sound source from time-series data of acoustic signals output from a plurality of microphones. Obtained.

本発明の実施の形態に係る音響信号解析装置の構成を示す概略図である。It is the schematic which shows the structure of the acoustic signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る音響信号解析装置における音響信号解析処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the acoustic signal analysis process routine in the acoustic signal analyzer which concerns on embodiment of this invention. 従来法と提案法について得られたＳＤＲとＫ又はＫ*との関係を示す図である。It is a figure which shows the relationship between SDR and K or K * obtained about the conventional method and the proposed method. 観測信号の合成に利用した室内インパルス応答のチャンネル間位相差を示すグラフである。It is a graph which shows the phase difference between channels of the indoor impulse response utilized for the synthesis | combination of an observation signal. 推定した伝達周波数特性より算出されるチャンネル間位相差を示すグラフである。It is a graph which shows the phase difference between channels computed from the estimated transmission frequency characteristic.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明の目的は、(1)音源数の推論、(2)音声のスパース性を仮定した劣決定周波数領域BSS、(3)パーミュテーション整合、の問題を一挙に解決することである。本発明で提案する手法は、観測信号が与えられた下で、各時間周波数ビンにおいてどの音源がアクティブであるらしいかの事後確率、各音源がどれだけアクティブになりやすいか(どの程度の音源数が存在しているらしいか)の事後確率、各音源の時間周波数成分の事後分布、各音源のステアリングベクトルの事後分布、各音源がどの方向から到来したらしいかの事後確率を、変分ベイズ法に基づいて推定することによりこれを実現する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. An object of the present invention is to solve at once the problems of (1) inference of the number of sound sources, (2) underdetermined frequency domain BSS assuming the sparseness of speech, and (3) permutation matching. The method proposed in the present invention is based on the posterior probability of which sound source seems to be active in each time-frequency bin under the given observation signal, how much each sound source is likely to be active (how many sound sources) Variational Bayesian method), posterior distribution of time-frequency components of each sound source, posterior distribution of steering vector of each sound source, and posterior probability of each sound source from which direction This is achieved by estimation based on

＜発明の原理＞
まず、本発明の原理について説明する。 <Principle of the invention>
First, the principle of the present invention will be described.

＜観測モデル＞
まず、観測モデルについて説明する。K個(Kは1以上の整数)の信号源から到来する音源信号をM個(Mは2以上の整数)のマイクロホンで観測する場合を考え、m番目のマイクロホンで観測される信号の時間周波数成分をy_m(ω,t)、k番目の音源信号の時間周波数成分をs_k(ω,t)とし、ｙ＾(ω,t)=(y₁(ω,t),...,y_M(ω,t))^T∈C^M，ｓ＾(ω,t)=(s₁(ω,t),...,s_K(ω,t))^T∈C^Kとする。ただし、1≦ω≦Ω,1≦t≦Tはそれぞれ周波数および時刻に対応するインデックスである。先に述べたとおり、時間周波数領域において観測信号ｙ＾(ω, t)は近似的に以下の（１）式のように、ｓ＾(ω,t)の瞬時混合の形で表すことができる。なお、記号に付された「＾」は、当該記号が行列または多次元配列またはベクトルであることを表わしている。 <Observation model>
First, the observation model will be described. Considering the case of observing sound source signals coming from K signal sources (K is an integer of 1 or more) with M (M is an integer of 2 or more) microphones, the time frequency of the signal observed with the mth microphone Let y _m (ω, t) be the component, s _k (ω, t) be the time-frequency component of the _kth sound source signal, and y ^ (ω, t) = (y ₁ (ω, t), ..., y _M (ω, t)) ^T ∈C ^M , s ^ (ω, t) = (s ₁ (ω, t), ..., s _K (ω, t)) ^T ∈C ^K However, 1 ≦ ω ≦ Ω and 1 ≦ t ≦ T are indices corresponding to frequency and time, respectively. As described above, in the time-frequency domain, the observed signal y ^ (ω, t) can be approximately expressed in the form of instantaneous mixture of s ^ (ω, t) as in the following equation (1). . Note that “＾” attached to a symbol indicates that the symbol is a matrix, a multidimensional array, or a vector.

信号源kからマイクロホンmまでの伝達周波数特性a_m,k(ω)を要素にした行列a＾(ω)=(a_m,k(ω))_M×K=(a₁(ω),...,a_K(ω))∈C^M×Kを混合行列と呼び、以下ではこれを時不変と仮定する。n＾(ω,t)は、多数の方向から到来する背景雑音や、フレーム長を超える残響成分など、時不変な伝達特性として表現できない成分を表す。ここで、音声のスパース性が仮定できる場合、各時間周波数ビン(ω,t)においてアクティブ(支配的)となる音源インデックスをz_ω,t∈{1,...,K}で表すことにすると、上記（１）式は、以下の（２）式のように書き直せる。 Matrix a ^ (ω) = (a _{m, k} (ω)) _{M × K} = (a ₁ (ω),... With elements of the transfer frequency characteristics a _{m, k} (ω) from the signal source k to the microphone m. .., a _K (ω)) ∈C ^{M × K} is called a mixing matrix and is assumed to be time invariant below. n ^ (ω, t) represents a component that cannot be expressed as a time-invariant transfer characteristic, such as background noise coming from many directions or a reverberation component exceeding the frame length. Here, when the sparseness of the speech can be assumed, the sound source index that becomes active (dominant) in each time frequency bin (ω, t) is represented by z _{ω, t} ∈ {1, ..., K}. Then, the above equation (1) can be rewritten as the following equation (2).

この観測モデルでは、z_ω,t番目の音源以外の成分はすべて0であると仮定しているので、各時間周波数ビンにおいて音源成分を表す変数は一つだけで十分である。このため上記（２）式ではs_k(ω,t)のインデックスkを省いている。すなわち、s(ω,t)は特定の音源の成分ではなく、各時間周波数ビンにおいてアクティブな１個の音源の成分を表す変数である。紙面のスペースの節約のため、以後ωとtを下付き添え字で表記することにする。 In this observation model, since it is assumed that all components other than the z _{ω, t} th sound source are 0, only one variable representing the sound source component is sufficient for each time frequency bin. For this reason, the index k of s _k (ω, t) is omitted in the above equation (2). In other words, s (ω, t) is not a specific sound source component but a variable representing a single sound source component active in each time frequency bin. In order to save space on the page, ω and t will be expressed as subscripts.

＜生成モデル＞
＜観測信号の生成プロセス＞
上述した観測モデルをもとに、観測信号が生成されるプロセスを生成モデルにより記述する。 <Generation model>
<Observation signal generation process>
Based on the observation model described above, the process for generating the observation signal is described by the generation model.

まず、雑音成分n_ω,t＾は、平均が0＾、共分散がΣ⁽ⁿ⁾ _ωの複素正規分布に従うと仮定すると、もしa_1:K,ω＾={a_1,ω,...,a_K,ω},s_ω,t，および、各時間周波数ビンでどの音源がアクティブであるか、すなわちz_ω,tが既知であれば、上記（２）式より、y_ω,tは、以下の（３）式に従って生成される。 First, assuming that the noise component n _{ω, t} ^ follows a complex normal distribution with mean 0 ^ and covariance Σ ⁽ⁿ⁾ _ω , a _{1: K, ω} ^ = {a _{1, ω} ,. ., a _{K, ω} }, s _{ω, t} and which sound source is active in each time frequency bin, that is, z _{ω, t} is known, y _{ω, t} Is generated according to the following equation (3).

＜無限音源数混合モデル＞
上記ではz_ω,tが既知の下での観測信号の生成プロセスを仮定したが、通常は各時間周波数ビンでどの音源がアクティブであるかに関する情報は観測することができない。そこで、本実施の形態では、アクティブな音源インデックスを示すインジケータz_ω,tを潜在変数と見なし、その生成プロセスをモデル化する。まず、音源数がKの場合、z_ω,tは、音源インデックスの集合{1,...,K}からいずれかのインデックスが、ある離散分布に従って選ばれる、以下の（４）式に示すプロセスによって生成されると仮定する。 <Infinite number of sound source mixed models>
In the above, the process of generating an observation signal under _{the condition} that z _{ω, t} is known is assumed, but normally, information regarding which sound source is active in each time frequency bin cannot be observed. Therefore, in this embodiment, the indicator z _{ω, t} indicating the active sound source index is regarded as a latent variable, and the generation process is modeled. First, when the number of sound sources is K, z _{ω, t} is _expressed by the following equation (4) in which any index is selected from a set of sound source indexes {1, ..., K} according to a discrete distribution. Assume that it is generated by a process.

ただし、π＾=(π,...,π_K)は離散分布における各インデックスの選ばれやすさを意味する確率値であり、Σ^K _k=1π_k=1とする。さらに、これらの確率値は、以下の（５）式に示すDirichlet分布よって生成されると仮定する。 Here, π ^ = (π,..., Π _K ) is a probability value indicating the ease of selecting each index in the discrete distribution, and Σ ^K _{k = 1} π _k = 1. Furthermore, it is assumed that these probability values are generated by the Dirichlet distribution shown in the following equation (5).

ここまでは有限個の音源数Kを想定していたが、ここで、K→∞の極限をとると、上記（３）式、（４）式、（５）式は、Dirichlet過程混合モデルとなり、以下の（６）式〜（９）式で表すことができる。 Up to this point, we assumed a finite number of sound sources, but if we take the limit of K → ∞, the above equations (3), (4), and (5) become Dirichlet process mixed models. These can be expressed by the following formulas (6) to (9).

GEM(α₀)は、Dirichlet過程（参考文献（T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” Annals of Statistics, vol.1, no. 2, pp. 209−230, 1973.）を参照）により生成されるπ₁,π₂,...の一つの具体的な構成方法(棒折過程と呼ぶ（参考文献（J. Sethuraman, “A constructive definition of Dirichlet priors,” Statistica Sinica, vol. 4, pp. 639−650, 1994.）を参照）)であり、以下の（１０）式、（１１）式に従って与えられる。 GEM (α ₀ ) is the Dirichlet process (see reference (TS Ferguson, “A Bayesian analysis of some nonparametric problems,” Annals of Statistics, vol. 1, no. 2, pp. 209-230, 1973)) One concrete construction method of π ₁ , π ₂ , ... generated by (referred to as the rod folding process (reference (J. Sethuraman, “A constructive definition of Dirichlet priors,” Statistica Sinica, vol. 4 , pp. 639-650, 1994.))) and is given according to the following equations (10) and (11).

以上のプロセスで生成されるπ₁,π₂,...は、平均的な意味で、kが大きいほどπ_kが指数的に小さくなるという傾向を持つため、大きいkに対応した音源ほどアクティブになる確率が低くなることを意味する。よって、観測信号からパラメータを推論する際、必要最小限の音源インデックス数の混合モデルで観測信号を説明しようとする効果がもたらされる。以上のy_ω,tの生成モデルを「無限音源数混合モデル」と呼ぶ。 Π ₁ , π ₂ , ... generated by the above process mean, in an average sense, π _k tends to decrease exponentially as _k increases, so the sound source corresponding to large k is more active This means that the probability of becoming low becomes low. Therefore, when the parameter is inferred from the observed signal, there is an effect of trying to explain the observed signal with a mixed model having the minimum number of sound source indexes. The above generation model of y _{ω, t} is called an “infinite number of sound source mixture model”.

＜混合DOAモデル＞
上述したモデルにパーミュテーション整合機能を組み込むため、上記（９）式の生成プロセスを以下でモデル化する。ここまで各音源の伝達周波数特性a_k,ωを周波数ωごとの独立な変数であるかのように扱っていたが、もし各音源が単一方向から平面波到来すると仮定できるならば、例えばマイクロホン数が2の場合、伝達周波数特性の各ω間の関係は、到来方向(Direction-of-Arrival;DOA)θの関数として以下の（１２）式のように陽に表される。 <Mixed DOA model>
In order to incorporate the permutation matching function into the above model, the generation process of the above equation (9) is modeled below. Up to this point, the transmission frequency characteristics a _{k, ω} of each sound source have been treated as if they were independent variables for each frequency ω, but if it can be assumed that each sound source is a plane wave coming from a single direction, for example, the number of microphones 2 is expressed explicitly as a function of the direction of arrival (Direction-of-Arrival; DOA) θ as shown in the following equation (12).

ただし、0≦θ＜2π、h_ωは周波数インデックスωに対応する角周波数(rad/s)，Dをマイクロホンの間隔(m)、Cを音速(m/s)とする。実際には残響や周波数領域の瞬時混合近似などの影響により、a_k,ωは上記の理論式からは逸脱することが予想される。そこで、到来方向θ_kが既知のとき、a_k,ωは、 However, 0 ≦ θ <2π, h _ω is an angular frequency (rad / s) corresponding to the frequency index ω, D is a microphone interval (m), and C is a sound velocity (m / s). Actually, a _{k, ω} is expected to deviate from the above theoretical formula due to the effects of reverberation and instantaneous mixing approximation in the frequency domain. Therefore, when the arrival direction θ _k is known, a _{k, ω} is

を中心とした複素正規分布より生成されたものと仮定する。しかし当然ながら到来方向θ_kは実際には観測することができないため、これを潜在変数と見なすことにすると、a_k,ωの生成モデルは、DOAを潜在変数とした混合モデルとなる。これを、上述した無限音源数混合モデルに組み込み、観測信号が与えられた下で全体の生成モデルのパラメータ推論を行うことができれば、音源分離とパーミュテーション整合を同時解決できる可能性がある。 Is generated from a complex normal distribution centered at. However, since the direction of arrival θ _k cannot be actually observed, of course, if this is regarded as a latent variable _, the generation model of a _{k, ω} is a mixed model with DOA as a latent variable. If this can be incorporated into the above-described infinite number of sound source mixture model and parameter inference of the entire generation model can be performed under the observation signal, sound source separation and permutation matching may be solved simultaneously.

まず、Θ₁,...,Θ_I(すべて定数)からなるI個のDOA候補値の集合を用意する。例えば、180度をI等分した角度Θ_i=(i−1)π/I,(i=1,...,I)の集合としよう。各音源のDOAがこのDOA候補値の中から一つ選ばれて決定される、というプロセスを仮定するなら、θ_kが生成されるプロセスは以下の（１３）式、（１４）式のように記述できる。 First, a set of I DOA candidate values consisting of Θ ₁ ,..., Θ _I (all constants) is prepared. For example, suppose that a set of angles Θ _i = (i−1) π / I, (i = 1,..., I) obtained by dividing 180 degrees into I equal parts. Assuming a process in which the DOA of each sound source is selected and determined from the DOA candidate values, the process for generating θ _k is as shown in the following equations (13) and (14): Can be described.

ただし、ρ_k＾=(ρ_k,1,...,ρ_k,I)である。x_k∈{1,...,I}はk番目の音源にどのDOA候補値が割り当てられるかを表すインジケータ変数であり、上記（１３）式はこれが離散分布(各確率値がρ_k,1,...,ρ_k,I)から生成されることを意味している。このプロセスにより各音源のDOAが決定され、伝達周波数特性a_k(ω)は、以下の（１５）により生成される。 However, ρ _k ^ = (ρ _{k, 1} ,..., Ρ _{k, I} ). x _k ∈ {1, ..., I} is an indicator variable indicating which DOA candidate value is assigned to the k-th sound source, and the above equation (13) is a discrete distribution (each probability value is represented by ρ _{k, 1} , ..., ρ _{k, I} ). Through this process, the DOA of each sound source is determined, and the transfer frequency characteristic a _k (ω) is generated by the following (15).

また、ここで、ρ_k＾の事前分布として、以下の（１６）式で表されるDirichlet分布を仮定する。 Here, a Dirichlet distribution represented by the following equation (16) is assumed as a prior distribution of ρ _k ^.

以上の、混合モデルに基づくa_k(ω)の生成モデルを「混合DOAモデル」と呼ぶ。 The generation model of a _k (ω) based on the above mixed model is referred to as a “mixed DOA model”.

＜変分推論アルゴリズム＞
観測信号ｙ＾=ｙ＾_1:Ω,1:Tが与えられたもとで、以上の生成モデルのパラメータa＾=a_1:∞,1:Ω＾,ｓ＾=s_1:Ω,1:T,ｚ＾=z_1:Ω,1:T,ｖ＾=v_1:∞,ｘ＾=x_1:∞,ρ＾=ρ＾_1:Kの事後分布p(a＾,ｓ＾,ｚ＾,ｖ＾,ｘ＾,ρ＾｜ｙ＾)を求めたい。この事後分布を解析的に得ることは難しいが、変分推論法に基づき近似分布を反復計算により得ることができる。以下、簡単のため、Σ⁽ⁿ⁾ _1:Ω,Σ^(a) _1:Ω,α₀,β₀は実験的に定める定数とする。変分推論は、事後分布p(a＾,ｓ＾,ｚ＾,ｖ＾,ｘ＾,ρ＾|ｙ＾)と、以下の（１７）式を満たす非負の変関数q(a＾,ｓ＾,ｚ＾,ｖ＾,ｘ＾,ρ＾)との間の以下の（１８）式に示すKullback-Leiblerダイバージェンスをqに関して最小化することが目的となる。 <Variation reasoning algorithm>
Given the observed signal y ^ = y ^ _{1: Ω, 1: T} , the parameters a ^ = a _{1: ∞, 1: Ω} ^, s ^ = s _{1: Ω, 1: T} , z ^ = z _{1: Ω, 1: T} , v ^ = v _{1: ∞} , x ^ = x _{1: ∞} , ρ ^ = ρ ^ _{1: K} posterior distribution p (a ^, s ^, z ^ , v ^, x ^, ρ ^ | y ^). Although it is difficult to obtain this posterior distribution analytically, an approximate distribution can be obtained by iterative calculation based on the variational reasoning method. Hereinafter, for simplicity, Σ ⁽ⁿ⁾ _{1: Ω} and Σ ^(a) _{1: Ω} , α ₀ , β ₀ are constants determined experimentally. Variational inference is based on the posterior distribution p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) and the non-negative variable function q (a ^, s The objective is to minimize the Kullback-Leibler divergence with respect to q between {circumflex over (z)}, {circumflex over (v)}, {circumflex over (v)}, {circumflex over (x)}, and {circumflex over (ρ)}.

ただし〈f(x)〉_q(x)は∫q(x)f(x)dxを表す。そしてqに関して、以下の（１９）式のように近似できると仮定し、q(a＾),q(ｓ＾),q(ｚ＾),q(ｖ＾),q(ｘ＾),q(ρ＾)について反復的にF[q] を最小化することでp(a＾,ｓ＾,ｚ＾,ｖ＾,ｘ＾,ρ＾|ｙ＾)の近似分布を得ようというのが変分推論法の基本的な考え方である。 However, <f (x)> _{q (x)} represents ∫q (x) f (x) dx. Assuming that q can be approximated by the following equation (19), q (a ^), q (s ^), q (z ^), q (v ^), q (x ^), q It is to obtain an approximate distribution of p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) by minimizing F [q] repeatedly for (ρ ^). This is the basic concept of variational reasoning.

また、q(z)に関して、以下の（２０）式に示すように、打ち切り近似を行う。 Also, with respect to q (z), as shown in the following equation (20), censored approximation is performed.

この近似は、モデルの複雑度(音源数)を固定した、ということではなく、qの関数空間をある領域に限定した、ということを意味している。よって、本来推定したいp(a＾,ｓ＾,ｚ＾,ｖ＾,ｘ＾,ρ＾|ｙ＾)をできるだけ良くqでフィッティングしたければ、Dは大きければ大きいほど良い、ということになる。導出は省略するが、上記（１８）式を上記（１７）式の拘束の下で最小化する各qは、解析的に以下の（２１）式〜（２６）式に示す形として求まる。 This approximation does not mean that the model complexity (number of sound sources) is fixed, but that the function space of q is limited to a certain region. Therefore, if you want to fit p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) that you want to estimate with q as much as possible, the larger D, the better. . Although derivation is omitted, each q that minimizes the above equation (18) under the constraint of the above equation (17) is analytically obtained as a form shown in the following equations (21) to (26).

また、m_k,ω＾, Γ_k,ω, μ_ω,t, σ² _ω,t,φ_ω,t＾ = (φ_1,ω,t, . . . , φ_D,ω,t), γ_k,0, γ_k,1,ψ_k＾= (ψ_k,1, . . . , ψ_k,I ), ζ_k,1, . . . , ζ_k,Iは、以下の（２７）式〜（３７）式で表される。ただし、 M _{k, ω} ^, Γ _{k, ω} , μ _{ω, t} , σ ² _{ω, t} , φ _{ω, t} ^ = (φ _{1, ω, t} , _... , Φ _{D, ω, t} ), _{_{γ k, 0, γ k,}} 1, ψ k ^ = (ψ k, 1,..., ψ k, I), ζ k, 1,..., ζ k, I , the following (27) It represents with Formula-(37) Formula. However,

とする。
And

ただし、Ψはディガンマ関数である。また、記号に付された「^*」は複素共役を表す。上記γ_k,0とγ_k,1の更新式（（３６）式、（３７）式）では第二項にv₀が乗じられているが、これは事前分布をv₀乗することに相当し、音声データが長いなどの原因でデータ情報が多くなった場合に事前分布の影響が小さくなりすぎるのを防ぐための措置である。後述する実験ではv₀=0.7×Ω×Tとした。 Where Ψ is a digamma function. Further, “ ^* ” attached to the symbol represents a complex conjugate. In the above update formulas for γ _{k, 0} and γ _{k, 1} (equations (36) and (37)), the second term is multiplied by v ₀ , which is equivalent to raising the prior distribution to the v ₀ power. However, this is a measure for preventing the influence of the prior distribution from becoming too small when the data information increases due to long audio data. In the experiment described later, v ₀ = 0.7 × Ω × T.

＜システム構成＞
次に、Ｍ個（Ｍは２以上の整数）のマイクロホンから得られた音響信号を解析して、未知のＫ個の音源信号に分離する音響信号解析装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, an example in which the present invention is applied to an acoustic signal analysis apparatus that analyzes acoustic signals obtained from M (M is an integer of 2 or more) microphones and separates them into unknown K sound source signals will be described. Thus, an embodiment of the present invention will be described.

図１に示すように、本発明の実施の形態に係る音響信号解析装置は、ＣＰＵと、ＲＡＭと、後述する音響信号解析処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 1, the acoustic signal analysis apparatus according to the embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program for executing an acoustic signal analysis processing routine described later. It is configured and functionally configured as follows.

音響信号解析装置１００は、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。 The acoustic signal analysis device 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40.

入力部１０により、Ｍ個のマイクロホンから出力された音響信号（多チャンネル信号）の時系列データが入力される。記憶部３０は、入力部１０により入力された音響信号の時系列データを記憶する。また、記憶部３０は、後述する各処理での結果を記憶すると共に、定数のパラメータΣ⁽ⁿ⁾ _ω＾、Σ^(a) _ω＾、α₀、β₀を記憶している。 The input unit 10 receives time-series data of acoustic signals (multi-channel signals) output from the M microphones. The storage unit 30 stores time-series data of acoustic signals input from the input unit 10. In addition, the storage unit 30 stores the results of each processing described later, and stores constant parameters Σ ⁽ⁿ⁾ _ω ^, Σ ^(a) _ω ^, α ₀ , β ₀ .

例えば、マイクロホン数Ｍを2とする。定数のパラメータを、それぞれ以下の（３８）式〜（４１）式に示すように設定する。 For example, the number of microphones M is 2. The constant parameters are set as shown in the following equations (38) to (41), respectively.

ただし、I＾はM×Mの単位行列である。また、角度の分割数をI=180とする。v₀=0.7×Ω×Tとする。Σ⁽ⁿ⁾ _ω＾は Here, I ^ is an M × M unit matrix. The number of angle divisions is I = 180. v ₀ = 0.7 x Ω x T. Σ ⁽ⁿ⁾ _ω ^

(スパース音源モデル)のy_ω,t(観測信号)からのずれの許容量を表し、Σ^(a) _ω＾は、a_k,ωの
Represents a y _ω, the allowable amount of deviation from the _t (observation signal) of the (sparse sound source ^model), Σ ^(a) _ω ^ _is, a _k, of _ω

(音源k の到来方向が既知のときその方向に対応するステアリングベクトル) からのずれの許容量を表す。また、α₀は、大きいインデックスの音源がアクティブになりにくくなる度合いを表し、β₀は到来方向x_kの分布ρ_k,1,...,ρ_k,Iがスパースになる度合いを表す。
This represents the allowable amount of deviation from the steering vector corresponding to the direction of arrival of the sound source k when it is known. Also, alpha ₀ represents the degree to which large index of the sound source is less likely to become active, beta ₀ DOA x distribution [rho _{k, 1} of _k, ..., represent the degree to which [rho _{k, I} becomes sparse.

演算部２０は、時間周波数解析部２１と、初期設定部２２と、パラメータ更新部２３と、パラメータ調整部２４と、終了判定部２５と、時間周波数成分推定部２６と、信号変換部２７とを備えている。また、パラメータ更新部２３は、アクティブ音源事後分布更新部２３１と、時間周波数成分事後分布更新部２３２と、音源インジケータ事後確率更新部２３３と、音源到来方向事後確率更新部２３４と、ステアリングベクトル事後確率更新部２３５と、音源方向事後確率更新部２３６と、を備えている。 The calculation unit 20 includes a time frequency analysis unit 21, an initial setting unit 22, a parameter update unit 23, a parameter adjustment unit 24, an end determination unit 25, a time frequency component estimation unit 26, and a signal conversion unit 27. I have. The parameter updating unit 23 includes an active sound source posterior distribution updating unit 231, a time frequency component posterior distribution updating unit 232, a sound source indicator posterior probability updating unit 233, a sound source arrival direction posterior probability updating unit 234, and a steering vector posterior probability. An update unit 235 and a sound source direction posterior probability update unit 236 are provided.

時間周波数解析部２１は、各マイクロホンの時系列信号としての観測された音響信号を入力として、時間周波数成分（観測時間周波数成分）ｙ_m,ω,t（ｍ＝１，・・・，Ｍ、ω＝１,・・・,Ω,ｔ＝１,・・・,Ｔは、それぞれマイクロホン、周波数、時刻に対応するインデックスを示す。）を各（ｍ，ω，ｔ）の要素にもつ三次元配列ｙ＾を計算する。また、計算した時間周波数成分ｙ_m,ω,tを、記憶部３０に記憶しておく。より詳細には、時間周波数解析部２１は、各マイクロホンｍについて、当該マイクロホンの音響信号の時系列データを入力として、短時間フーリエ変換（Short-Time Fourier Transform；ＳＴＦＴ）を用いて時間周波数解析を行うことにより時間周波数成分ｙ_m,ω,tを計算し、時間周波数成分ｙ_m,ω,tを格納した行列（振幅スペクトログラム）ｙ＾＝（ｙ_m,ω,t）_M×Ω×Ｔを出力する。なお、時間周波数成分ｙ_m,ω,tは、ウェーブレット変換を用いて計算してもよい。 The time frequency analysis unit 21 receives the observed acoustic signal as a time-series signal of each microphone, and inputs time frequency components (observation time frequency components) y _{m, ω, t} (m = 1,..., M, ω = 1,..., Ω, t = 1,..., T are indices corresponding to microphones, frequencies, and times, respectively)) in three dimensions (m, ω, t) Compute the array y ^. Further, the calculated time frequency component y _{m, ω, t} is stored in the storage unit 30. More specifically, the time-frequency analysis unit 21 performs time-frequency analysis on each microphone m by using time-series data of the acoustic signal of the microphone as an input and using a short-time Fourier transform (STFT). time-frequency components y _m by _{performing, omega,} computes the _t, time-frequency components y _{m, omega,} matrix that contains the _t (amplitude spectrogram) y ^ = (y m, ω, t) and _{M ×} Ω _{× T} Output. Note that the time frequency component ym _{, ω, t} may be calculated using wavelet transform.

初期設定部２２は、Ω、ｌ_i,ω＾および各変分事後分布qのパラメータ(以後、変分パラメータと称する)の初期値を以下の通りに設定し、記憶部３０に格納する。 The initial setting unit 22 sets initial values of Ω, l _{i, ω} ^ and parameters of each variational posterior distribution q (hereinafter referred to as variational parameters) as follows, and stores them in the storage unit 30.

まず、DOAが位相に現れる程度に高く、かつエイリアシングが起こらない範囲の周波数帯h_Ω0=約300Hz〜約300+120Hzから反復計算を始めるように、Ω＝Ω₀を設定する。 First, Ω = Ω ₀ is set so that iterative calculation is started from a frequency band h _Ω0 = about 300 Hz to about 300 + 120 Hz that is high enough to cause DOA to appear in phase and in which aliasing does not occur.

また、ｌ_i,ω＾について、以下の（４２）式に従って、（ｉ，ω）毎に初期値を設定する。 For l _{i, ω} ^, an initial value is set for each (i, ω) according to the following equation (42).

また、アクティブな音源信号の平均μ_ω,tについては、各マイクロホンｍで取得した信号の時間周波数成分の平均Σ_my_m,ω,t/Mを初期値として（ｍ，ｔ）毎に設定する。 The setting, average mu _omega active source _signals, for _t, the average sigma _m y _m in the time-frequency components of acquired signals by the microphones _{m, omega,} the _t / M as an initial value (m, t) for each To do.

また、φ_k,ω,t(音源kが時間周波数(ω,t)においてアクティブである確率を表すパラメータ)を、音源インデックスに反比例するように(大きいインデックスほど低い確率値となるように)、なおかつ、Σ_kφ_k,ω,t=1となるように、以下の（４３）式に従って（ｋ、ｍ、ｔ）毎に与える。 In addition, φ _{k, ω, t} (parameter representing the probability that the sound source k is active at the time frequency (ω, t)) is inversely proportional to the sound source index (so that a larger index has a lower probability value) Moreover, it is given for each (k, m, t) according to the following equation (43) so that Σ _k φ _{k, ω, t} = 1.

また、その他の変分パラメータm_k,ω＾, Γ_k,ω, σ² _ω,t,γ_k,0, γ_k,1,ψ_k＾= (ψ_k,1, . . . , ψ_k,I ), ζ_k,1, . . . , ζ_k,Iについては、初期値として、乱数を用いて適当な値を設定すればよい。 In addition, other variational parameters _{_{m k, ω ^, Γ k}} , ω, σ 2 ω, t, γ k, 0, γ k, 1, ψ k ^ = (ψ k, 1,..., Ψ k _{, I} ), ζ _{k, 1} ,,..., Ζ _{k, I} may be set to appropriate values using random numbers as initial values.

時間周波数成分事後分布更新部２３２は、（ω、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているσ² _ω,t、φ_k,ω,t、m_k,ω＾, Σ⁽ⁿ⁾ _ω＾、ｙ_m,ω,t、Γ_k,ωに基づいて、以下の（４４）式、（４５）式に従って、アクティブな音源の時間周波数成分の事後分布のパラメータμ_ω,t、σ² _ω,tを更新し、記憶部３０に格納する。 The time-frequency component posterior distribution updating unit 232 stores σ ² _{ω, t} , φ _{k, ω, t} , m _{k, ω} ^, stored in the storage unit 30 for each of all combinations of (ω, t). Based on Σ ⁽ⁿ⁾ _ω ^, y _{m, ω, t} , Γ _{k, ω} , the parameters _{ω ω,} of the posterior distribution of the time-frequency component of the active sound source according to the following equations (44) and (45) _t and σ ² _{ω, t} are updated and stored in the storage unit 30.

ステアリングベクトル事後確率更新部２３５は、（ｋ、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているΓ_k,ω,σ² _ω,t、φ_k,ω,t、m_k,ω＾, Σ⁽ⁿ⁾ _ω＾、Σ^(a) _ω＾、μ_ω,t、ｙ_m,ω,t、ψ_k、ｌ_i,ω＾に基づいて、以下の（４６）式、（４７）式に従って、各音源のステアリングベクトルの事後確率のパラメータm_k,ω＾、Γ_k,ω,を更新し、記憶部３０に格納する。 The steering vector posterior probability update unit 235 stores Γ _{k, ω} , σ ² _{ω, t} , φ _{k, ω, t} , m _k stored in the storage unit 30 for each of all combinations of (k, ω). _{, ω} ^, Σ ⁽ⁿ⁾ _ω ^, Σ ^(a) _ω ^, μ _{ω, t} , y _{m, ω, t} , ψ _k , l _{i, ω} ^ 47) The parameters m _{k, ω} ^, Γ _{k, ω} , of the a posteriori probability of the steering vector of each sound source are updated according to the equation (47) and stored in the storage unit 30.

音源インジケータ事後確率更新部２３３は、（ｋ、ω、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているΓ_k,ω,σ² _ω,t、φ_k,ω,t、m_k,ω＾, Σ⁽ⁿ⁾ _ω＾、μ_ω,t、ｙ_m,ω,t、γ_k,0, γ_k,1に基づいて、以下の（４８）式に従って、アクティブな音源のインジケータ（各時間周波数ビンにおいてどの音源がアクティブであるか）の事後確率のパラメータφ_k,ω,tを更新し、記憶部３０に格納する。 The sound source indicator posterior probability update unit 233 is configured to store Γ _{k, ω} , σ ² _{ω, t} , φ _{k, ω, t} , stored in the storage unit 30 for each of all combinations of (k, ω, t). m _{k, ω} ^, Σ ⁽ⁿ⁾ _ω ^, μ _{ω, t} , y _{m, ω, t} , γ _{k, 0} , γ _{k, 1} The parameter φ _{k, ω, t} of the posterior probability of the indicator (which sound source is active in each time frequency bin) is updated and stored in the storage unit 30.

音源到来方向事後確率更新部２３４は、（ｋ、ｉ）の全ての組み合わせの各々について、記憶部３０に記憶されているζ_k,i、Σ^(a) _ω＾、ｌ_i,ω＾、に基づいて、以下の（４９）式に従って、各音源がどの方向から到来したらしいかの事後確率のパラメータψ_k,iを更新し、記憶部３０に格納する。 The sound source arrival direction posterior probability updating unit 234 adds ζ _{k, i} , Σ ^(a) _ω ^, l _{i, ω} ^, stored in the storage unit 30 for each of all combinations of (k, i). Based on the following equation (49), the parameter ψ _{k, i} of the posterior probability indicating from which direction each sound source has come is updated and stored in the storage unit 30.

音源方向事後確率更新部２３６は、（ｋ、ｉ）の全ての組み合わせの各々について、記憶部３０に記憶されているψ_k,i、β₀に基づいて、以下の（５０）式に従って、どの方向に音源があるらしいかの事後確率のパラメータζ_k,iを更新し、記憶部３０に格納する。 The sound source direction posterior probability update unit 236 determines, for each of all combinations of (k, i), according to the following equation (50) based on ψ _{k, i} , β ₀ stored in the storage unit 30. The parameter ζ _{k, i} of the posterior probability that the sound source seems to be in the direction is updated and stored in the storage unit 30.

アクティブ音源事後分布更新部２３１は、各ｋについて、記憶部３０に記憶されているφ_k,ω,t、α₀、ｖ₀に基づいて、以下の（５１）式、（５２）式に従って、各音源がどれだけアクティブになりやすいかの事後分布のパラメータγ_k,0, γ_k,1を更新し、記憶部３０に格納する。 The active sound source posterior distribution updating unit 231 performs, for each k, based on φ _{k, ω, t} , α ₀ , v ₀ stored in the storage unit 30 according to the following equations (51) and (52): The parameters γ _{k, 0} , γ _{k, 1} of how much each sound source is likely to become active are updated and stored in the storage unit 30.

パラメータ調整部２４は、以下の調整を行う。 The parameter adjustment unit 24 performs the following adjustment.

まず、以下の（５３）式に従って、（ｋ、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているm_k,ω＾のスケールを正規化することによって、ステアリングベクトルのスケールを正規化する。 First, according to the following equation (53) _, the scale of the steering vector is obtained by normalizing the scale of m _{k, ω} ^ stored in the storage unit 30 for each combination of (k, ω). Normalize.

ただし、[・]_iはベクトルのi番目の要素を表す。 However, [·] _i represents the i-th element of the vector.

また、反復計算に用いる周波数帯域を、以下の（５４）式に従って、ΔΩ=約120Hzに相当する周波数ビン数分だけ広げる。 Further, the frequency band used for the iterative calculation is expanded by the number of frequency bins corresponding to ΔΩ = about 120 Hz according to the following equation (54).

また、音源インデックスｋについて、音源ｋがアクティブになる周波数ω、時刻ｔにおける時間周波数成分ｓ_ω,tの平均μ_ω,tが降順となるように音源インデックスｋを並び替える。 Further, the sound source index k is rearranged so that the frequency μ at which the sound source k becomes active and the average μ _{ω, t} of the time frequency components s _{ω, t} at time _t are in descending order.

また、例えば反復回数20回目以降では、更に、各ω毎に、ステアリングベクトルの共分散行列Σ^(a) _ωを（１／反復回数）倍して小さくする。 For example, after the 20th iteration, the covariance matrix Σ ^(a) _ω of the steering vector is further reduced by (1 / number of iterations) for each ω.

終了判定部２５は、予め定められた終了条件を満足するか否かを判定し、終了条件を満足していない場合には、パラメータ更新部２３及びパラメータ調整部２４の各処理を繰り返す。終了判定部２５は、終了条件を満足したと判定した場合には、時間周波数成分推定部２６による処理に移行する。 The end determination unit 25 determines whether or not a predetermined end condition is satisfied. If the end condition is not satisfied, the processes of the parameter update unit 23 and the parameter adjustment unit 24 are repeated. When it is determined that the end condition is satisfied, the end determination unit 25 proceeds to processing by the time frequency component estimation unit 26.

終了条件としては、反復計算回数が予め定めた回数に達したことを用いればよい。なお、一回のパラメータ更新によるパラメータの変化率がほぼ1になったと見なせたことを、終了条件として用いてもよい。また、一回のパラメータ更新による目的関数F[q]の値の変化率がほぼ1になったと見なせたことを、終了条件として用いてもよい。 As the termination condition, it may be used that the number of iterations has reached a predetermined number. Note that the fact that the change rate of the parameter by one parameter update can be regarded as almost 1 may be used as the termination condition. Further, the fact that the rate of change of the value of the objective function F [q] by one parameter update can be regarded as almost 1 may be used as the termination condition.

時間周波数成分推定部２６は、（ｋ、ω、ｔ）の全ての組み合わせの各々について、記憶部３０に記憶されているφ_k,ω,t、μ_ω,tに基づいて、以下の（５５）式に従って、各音源ｋの時間周波数成分の推定値＾s_k,ω,tを計算する。 The time frequency component estimator 26 calculates the following (55) based on φ _{k, ω, t} and μ _{ω, t} stored in the storage 30 for each of all combinations of (k, ω, t). ) Calculate the estimated value ^ s _{k, ω, t} of the time frequency component of each sound source k according to the equation.

信号変換部２７は、時間周波数成分推定部２６によって計算された各音源ｋの時間周波数成分の推定値＾s_k,ω,tを、各音源の音源信号に変換し、出力部４０により、音源信号を出力する。 The signal conversion unit 27 converts the estimated value s _{k, ω, t} of the time frequency component of each sound source k calculated by the time frequency component estimation unit 26 into the sound source signal of each sound source, and the output unit 40 causes the sound source to Output a signal.

＜音響信号解析装置の作用＞
次に、本実施の形態に係る音響信号解析装置１００の作用について説明する。まず、解析対象の信号として各マイクロホンからの音響信号の時系列データが音響信号解析装置１００に入力され、記憶部３０に格納される。そして、音響信号解析装置において、図２に示す音響信号解析処理ルーチンが実行される。 <Operation of acoustic signal analyzer>
Next, the operation of the acoustic signal analysis apparatus 100 according to the present embodiment will be described. First, time-series data of acoustic signals from each microphone is input to the acoustic signal analyzing apparatus 100 as a signal to be analyzed and stored in the storage unit 30. Then, in the acoustic signal analysis device, an acoustic signal analysis processing routine shown in FIG. 2 is executed.

まず、ステップＳ１０１において、記憶部３０から、マイクロホン毎に、各時刻ｔのフレーム内の音響信号を読み込み、当該音響信号に対して、短時間フーリエ変換を用いた時間周波数分析を行った結果から、観測時間周波数成分ｙ_m,ω,tを各（ｍ，ω，ｔ）の要素にもつ三次元配列ｙ＾を生成して、記憶部３０に記憶する。 First, in step S101, for each microphone, an acoustic signal in a frame at each time t is read from the storage unit 30, and a time frequency analysis using a short-time Fourier transform is performed on the acoustic signal. A three-dimensional array y ^ having observation time frequency components y _{m, ω, t} as elements of each (m, ω, t) is generated and stored in the storage unit 30.

そして、ステップＳ１０２において、Ω、ｌ_i,ω＾および各変分事後分布qのパラメータμ_ω,t、φ_k,ω,t、m_k,ω＾, Γ_k,ω, σ² _ω,t,γ_k,0, γ_k,1, ψ_k＾= (ψ_k,1, . . . , ψ_k,I ), ζ_k,1, . . . , ζ_k,Iの初期値を設定し、記憶部３０に格納する。 In step S102, Ω, l _{i, ω} ^ and parameters μ _{ω, t} , φ _{k, ω, t} , m _{k, ω} ^, Γ _{k, ω} , σ ² _{ω, t of} each variational posterior distribution q _{_{, γ k, 0, γ k}} , 1, ψ k ^ = (ψ k, 1,..., ψ k, I), ζ k, 1,..., the initial values of the zeta _{k, I} And stored in the storage unit 30.

次のステップＳ１０３では、上記ステップＳ１０２で設定され、またはステップＳ１０３、Ｓ１０４、Ｓ１０５で更新されたパラメータσ² _ω,t、φ_k,ω,t、m_k,ω＾、Γ_k,ωと、上記ステップＳ１０１で得られたｙ_m,ω,tとに基づいて、上記（４４）式、（４５）式に従って、アクティブな音源の時間周波数成分の事後分布のパラメータμ_ω,t、σ² _ω,tを更新し、記憶部３０に格納する。 In the next step S103, the parameters σ ² _{ω, t} , φ _{k, ω, t} , m _{k, ω} ^, Γ _{k, ω} set in step S102 or updated in steps S103, S104, S105, and Based on _{ym, ω, t} obtained in step S101, the parameters _{ω ω, t} , σ ² _ω of the posterior distribution of the time frequency component of the active sound source according to the above formulas (44) and (45). _{, t} are updated and stored in the storage unit 30.

そして、ステップＳ１０４では、上記ステップＳ１０２で設定され、またはステップＳ１０３、Ｓ１０４、Ｓ１０５、Ｓ１０６で更新されたパラメータΓ_k,ω,σ² _ω,t、φ_k,ω,t、m_k,ω＾、μ_ω,t、ψ_k、ｌ_i,ω＾と、上記ステップＳ１０１で得られたｙ_m,ω,tとに基づいて、上記（４６）式、（４７）式に従って、各音源のステアリングベクトルの事後確率のパラメータm_k,ω＾、Γ_k,ω,を更新し、記憶部３０に格納する。 In step S104, the parameters Γ _{k, ω} , σ ² _{ω, t} , φ _{k, ω, t} , m _{k, ω} ^ set in step S102 or updated in steps S103, S104, S105, S106 are _obtained. , Μ _{ω, t} , ψ _k , l _{i, ω} ^ and ym _{, ω, t} obtained in step S101, and according to the above equations (46) and (47), the steering of each sound source The vector posterior probability parameters m _{k, ω} ^ and Γ _{k, ω} are updated and stored in the storage unit 30.

ステップＳ１０５では、上記ステップＳ１０２で設定され、又はステップＳ１０３、Ｓ１０４、Ｓ１０５、Ｓ１０８で更新されたパラメータΓ_k,ω,σ² _ω,t、φ_k,ω,t、m_k,ω＾、μ_ω,t、γ_k,0, γ_k,1と、上記ステップＳ１０１で得られたｙ_m,ω,tとに基づいて、上記（４８）式に従って、アクティブな音源のインジケータ（各時間周波数ビンにおいてどの音源がアクティブであるか）の事後確率のパラメータφ_k,ω,tを更新し、記憶部３０に格納する。 In step S105, the parameters Γ _{k, ω} , σ ² _{ω, t} , φ _{k, ω, t} , m _{k, ω} ^, μ set in step S102 or updated in steps S103, S104, S105, and S108. _{Based on ω, t} , γ _{k, 0} , γ _{k, 1} and ym _{, ω, t} obtained in step S101, an active sound source indicator (each time frequency bin) according to the above equation (48). The a posteriori probability parameter φ _{k, ω, t} of which sound source is active in () is updated and stored in the storage unit 30.

次のステップＳ１０６では、上記ステップＳ１０２で設定され、又はステップＳ１０７で更新されたパラメータζ_k,i、ｌ_i,ω＾に基づいて、上記（４９）式に従って、各音源がどの方向から到来したらしいかの事後確率のパラメータψ_k,iを更新し、記憶部３０に格納する。 In the next step S106, based on the parameters ζ _{k, i} , l _{i, ω} ^ set in step S102 or updated in step S107, each sound source has come from which direction according to the above equation (49). The posterior probability parameter ψ _{k, i} is updated and stored in the storage unit 30.

次のステップＳ１０７では、上記ステップＳ１０２で設定され、又はステップＳ１０６で更新されたパラメータψ_k,iに基づいて、上記（５０）式に従って、どの方向に音源があるらしいかの事後確率のパラメータζ_k,iを更新し、記憶部３０に格納する。 In the next step S107, based on the parameter ψ _{k, i} set in step S102 or updated in step S106, the posterior probability parameter ζ in which direction the sound source is likely to exist according to the above equation (50). _{k and i} are updated and stored in the storage unit 30.

ステップＳ１０８では、上記ステップＳ１０２で設定され、又はステップＳ１０５で更新されたパラメータφ_k,ω,tに基づいて、上記（５１）式、（５２）式に従って、各音源がどれだけアクティブになりやすいかの事後分布のパラメータγ_k,0, γ_k,1を更新し、記憶部３０に格納する。 In step S108, how active each sound source is likely to become active according to the equations (51) and (52) based on the parameters φ _{k, ω, t} set in step S102 or updated in step S105. The posterior distribution parameters γ _{k, 0} , γ _{k, 1} are updated and stored in the storage unit 30.

そして、ステップＳ１０９において、上記ステップ１０４で更新されたパラメータm_k,ω＾を正規化し、パラメータΩを更新し、音源インデックスｋの並び替えを行う。また、反復回数20回目以降では、パラメータΣ^(a) _ωを調整する。 In step S109, the parameter m _{k, ω} ^ updated in step 104 is normalized, the parameter Ω is updated, and the sound source index k is rearranged. In addition, the parameter Σ ^(a) _ω is adjusted after the 20th iteration.

次のステップＳ１１０では、終了条件として、反復回数が、予め定めた回数に到達したか否かを判定し、反復回数が予め定めた回数に到達していない場合には、終了条件を満足していないと判断して、上記ステップ１０３へ戻り、上記ステップ１０３〜ステップ１０９の処理を繰り返す。一方、反復が予め定めた回数に到達した場合には、終了条件を満足したと判断し、ステップＳ１１１で、上記ステップＳ１０３、Ｓ１０５で最終的に更新されたφ_k,ω,t、μ_ω,tに基づいて、上記（５５）式に従って、各音源ｋの時間周波数成分の推定値＾s_k,ω,tを計算する。 In the next step S110, as an end condition, it is determined whether or not the number of iterations has reached a predetermined number of times. If the number of iterations has not reached the predetermined number of times, the end condition is satisfied. If it is determined that there is not, the process returns to step 103, and the processes of steps 103 to 109 are repeated. On the other hand, if the number of iterations reaches a predetermined number of times, it is determined that the termination condition is satisfied, and in step S111, φ _{k, ω, t} , _μω, Based on _t , an estimated value s s _{k, ω, t} of the time frequency component of each sound source k is calculated according to the above equation (55).

そして、ステップＳ１１２において、上記ステップＳ１１１で計算された各音源ｋの時間周波数成分の推定値＾s_k,ω,tに基づいて、各音源の音源信号を算出し、出力部４０により出力して、音響信号解析処理ルーチンを終了する。 In step S112, the sound source signal of each sound source is calculated based on the estimated value s _{k, ω, t} of the time frequency component of each sound source k calculated in step S111 and output by the output unit 40. Then, the acoustic signal analysis processing routine is finished.

＜実験結果＞ <Experimental result>

次に、本実施の形態に係る手法（以下、提案法と称する）の有用性を示すため、音源分離性能の検証実験を行った結果について説明する。3人の話者(女性2人、男性1 人)の音声信号（参考文献（A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis,” Trans. Speech Communication, pp. 357−363, 1990.）を参照）に、室内インパルス応答(残響時間は0ms)（参考文献（S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,” in Proc. LREC ’00, 2000, pp. 965−968.）を参照）を畳み込み加算することで人工的に混合したものを観測信号とした。標本化周波数は16kHzとした。観測信号の時間周波数成分は、短時間Fourier変換(フレーム長は64ms、フレームシフトは16ms)により算出した。Σ(n)ωとΣ(a)ωはそれぞれ＾I,10−^1.5×＾Iとした。また、角度の分割数はI=180とした。上記の反復アルゴリズムの実行後、アクティブな音源成分の推定値μω,tに、音源k が各時間周波数ビンでどれだけアクティブらしいかを表す確率値φk,ω,tを乗じたものを、音源kの推定時間周波数成分とした。また、今回の実験では、空間エイリアシングによりζが局所解に陥ってしまう可能性を考慮し、反復計算の初期段階では空間エイリアシングが起こらない低い帯域の観測情報のみを用いてアルゴリズムを実行し、反復回数の増加に従って徐々にその帯域を高帯域に広げていく方法をとった。 Next, in order to show the usefulness of the method according to the present embodiment (hereinafter referred to as the proposed method), the result of a verification experiment of the sound source separation performance will be described. Audio signals of three speakers (two women and one man) (references (A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis, ”Trans. Speech Communication, pp. 357-363, 1990.), room impulse response (reverberation time is 0 ms) (reference (S. Nakamura, K. Hiyane , F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,” in Proc. LREC '00, 2000, pp. 965-968.) The signal that was artificially mixed by convolutional addition was used as the observation signal. The sampling frequency was 16 kHz. The time frequency component of the observation signal was calculated by short-time Fourier transform (frame length is 64 ms, frame shift is 16 ms). Σ (n) ω and Σ (a) ω, respectively ^ I, was a 10- ^1.5 × ^ I. The number of angle divisions was I = 180. After executing the above iterative algorithm, the estimated value of the active sound source component μω, t is multiplied by the probability value φk, ω, t representing how active the sound source k is in each time frequency bin, The estimated time-frequency component. In this experiment, considering the possibility that ζ falls into a local solution due to spatial aliasing, the algorithm is executed using only low-band observation information where spatial aliasing does not occur in the initial stage of iterative calculation. As the number of times increased, the band was gradually expanded to a higher band.

音源分離性能の評価基準として、Signal-to-Distortion Ratio(SDR)を採用した（参考文献（E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Trans. ASLP, pp. 1462−1469, 2006.）を参照）。まず、提案法の第一のポイントである、必要な音源数を適応的に推論する効果を検証するため、打ち切りレベルDによって分離性能がどう影響するかを確認した。また、従来方法では音源数を仮定する必要があったが、仮定する音源数Kによって従来方法による分離性能がどう変化するかも併せて確認した。その結果を図３に示す。従来方法では、実際の音源数と仮定する音源数が一致する場合は分離性能が高かったが、一致しない場合は著しく低かったのに対して、提案法では、先に述べたとおり、打ち切りレベルDが大きければ大きいほどqが真の事後分布の良い近似が得られるため、Dが大きいほど分離性能が向上することが確認できた。このことは、Dをあらかじめ大きく設定してさえいれば音源数が分かっていなかったとしても高い分離性能が得られる、ということを意味し、提案法の効果を示している。 Signal-to-Distortion Ratio (SDR) was adopted as an evaluation standard for sound source separation performance (references (E. Vincent, R. Gribonval, and C. F´evotte, “Performance measurement in blind audio source separation,” IEEE Trans. ASLP, pp. 1462-1469, 2006.)). First, in order to verify the effect of adaptively inferring the required number of sound sources, which is the first point of the proposed method, we confirmed how the separation performance is affected by the truncation level D. In addition, it was necessary to assume the number of sound sources in the conventional method, but we also confirmed how the separation performance by the conventional method changes depending on the assumed number of sound sources K. The result is shown in FIG. In the conventional method, the separation performance was high when the actual number of sound sources and the assumed number of sound sources were the same, but when the numbers were not the same, the separation performance was extremely low. It can be confirmed that the larger D is, the better the approximation of true posterior distribution is obtained for q, so that the separation performance improves as D increases. This means that if D is set large in advance, high separation performance can be obtained even if the number of sound sources is not known, and the effect of the proposed method is shown.

次に、混合DOAモデルの効果を見るため、音源の到来方向を正しく推定できているらしいかを確認した。図４は、観測信号の合成に利用した室内インパルス応答のチャンネル間位相差arg([m_k,ω＾]₂/[m_k,ω＾]₁)を音源ごとに異なる色でプロットしたものであり、図５は、推定した伝達周波数特性m_k,ω＾より算出されるチャンネル間位相差arg([m_k,ω]₂/[m_k,ω]₁)を音源ごとに異なる色でプロットしたものである。ただし、[・]iはベクトルのi番目の要素を表す。空間エイリアシングがあっても各音源の到来方向が概ね正しく推定できていることが分かる。 Next, in order to see the effect of the mixed DOA model, it was confirmed whether the direction of arrival of the sound source could be estimated correctly. Fig. 4 is a plot of the inter-channel phase difference arg ([m _{k, ω} ^] ₂ / [m _{k, ω} ^] ₁ ) of the indoor impulse response used for the synthesis of the observed signal in different colors for each sound source. Yes, FIG. 5 plots the inter-channel phase difference arg ([m _{k, ω} ] ₂ / [m _{k, ω} ] ₁ ) calculated from the estimated transfer frequency characteristics m _{k, ω} ^ in different colors for each sound source. It is what. However, [•] i represents the i-th element of the vector. It can be seen that even if there is spatial aliasing, the direction of arrival of each sound source can be estimated roughly.

以上説明したように、本発明の実施の形態に係る音響信号解析装置によれば、各周波数ωに対する各時刻ｔにおいてアクティブとなる音源を１個とし、各音源ｋがアクティブになる確率ｖ_kの事前分布を棒折過程でモデル化し、各音源ｋの到来方向のインデックスを示すインジケータｘ_kの事前分布を離散確率分布でモデル化し、各音源ｋが各インデックスの到来方向となる確率ρ_k＾の事前分布をディレクレ分布でモデル化し、音源信号の生成モデルのパラメータの事後分布を近似する変関数ｑのパラメータを変分推論法に基づき繰り返し更新して、各音源ｋの音源信号の時間周波数成分の事後分布を推定することにより、音源数が不明であっても、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる。 As described above, according to the acoustic signal analysis device according to the embodiment of the present invention, there is one sound source that is active at each time t for each frequency ω, and the probability v _k that each sound source k becomes active. The prior distribution is modeled by a bar folding process, the prior distribution of the indicator x _k indicating the arrival direction index of each sound source k is modeled by a discrete probability distribution, and the probability ρ _k ^ of each sound source k being the arrival direction of each index The prior distribution is modeled as a directory distribution, and the parameters of the variation function q that approximates the posterior distribution of the parameters of the sound source signal generation model are repeatedly updated based on the variational inference method, and the time-frequency component of the sound source signal of each sound source k is updated. By estimating the posterior distribution, even if the number of sound sources is unknown, the time-series data of acoustic signals output from multiple microphones can be used to accurately generate sound sources for each sound source. It is possible to Ku separation.

また、複数のマイクロホンで取得した音響信号から混在する個々の音源信号を分離することができるため、ハンズフリーテレビ会議システムや会議録コンテンツの自動作成システムなどの応用が期待される。 In addition, since individual sound source signals mixed from acoustic signals acquired by a plurality of microphones can be separated, applications such as a hands-free video conference system and a system for automatically creating conference contents are expected.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音響信号解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above-described acoustic signal analysis apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１時間周波数解析部
２２初期設定部
２３パラメータ更新部
２４パラメータ調整部
２５終了判定部
２６時間周波数成分推定部
２７信号変換部
３０記憶部
４０出力部
１００音響信号解析装置
２３１アクティブ音源事後分布更新部
２３２時間周波数成分事後分布更新部
２３３音源インジケータ事後確率更新部
２３４音源到来方向事後確率更新部
２３５ステアリングベクトル事後確率更新部
２３６音源方向事後確率更新部 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Time frequency analysis part 22 Initial setting part 23 Parameter update part 24 Parameter adjustment part 25 End determination part 26 Time frequency component estimation part 27 Signal conversion part 30 Storage part 40 Output part 100 Acoustic signal analysis apparatus 231 Active Sound source posterior distribution update unit 232 Time frequency component posterior distribution update unit 233 Sound source indicator posterior probability update unit 234 Sound source arrival direction posterior probability update unit 235 Steering vector posterior probability update unit 236 Sound source direction posterior probability update unit

Claims

Using time series data of acoustic signals output from M microphones (M is an integer of 2 or more) as input, the observation time frequency components y _{m, ω, t} (m is a microphone, t is time, ω is frequency) A time-frequency analysis means for outputting a three-dimensional array y ^ having an element as an index;
For each frequency ω, it is included in a three-dimensional array a ^ having elements of transmission frequency characteristics a _{m, k, ω} of sound source signals from each of D sound sources k (D is an integer of 1 or more) to each microphone m. Parameters of posterior distribution of each vector a _{k, ω} ^ (mean vector m _{k, ω} ^, covariance matrix Γ _{k, ω} ) of complex normal distribution, time frequency component s of active sound source at each time t for each frequency ω Parameters of the posterior distribution of each element s _{ω, t} included in the two-dimensional array s ^ having _{ω, t} as elements (mean μ _{ω, t} of the complex normal distribution, variance σ ² _{ω, t} ), and each frequency ω Parameters of the posterior distribution of each element z _{ω, t} of the two-dimensional array z ^ having an indicator z _{ω, t} indicating the index of the sound source active at time t (each parameter φ _{k, ω, t of the} discrete probability distribution) ), vector with a probability v _k that each sound source k becomes active element v ^ parameters posterior distribution of each element v _k of (parameter gamma _k of the beta _{_{distribution, 0, γ k, 1)}} , the vector x ^ of having the indicator x _k of the index i in the arrival direction of each sound source k elements Parameters of the posterior distribution of each element x _k (each parameter ψ _{k, i of the} discrete probability distribution) and each of the two-dimensional arrays ρ ^ having the probability ρ _{k, i} that each sound source k is the arrival direction of each index _i as elements Initial setting means for setting an initial value of each parameter of the posterior distribution of the vector ρ _k ^ (each parameter ζ _{k, i of the} directory distribution);
Given the three-dimensional array y ^, the three-dimensional array a ^, the two-dimensional array s ^, the two-dimensional array z ^, the vector v ^, the vector x ^, and the two-dimensional array ρ ^ Posterior distribution p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) and a variable function q (a ^, s ^, z ^, v ^, x ^, ρ ^) The average vector m _{k, ω in} each of all the combinations of (k, m) so that the objective function is minimized based on the variational reasoning method, with the function representing the divergence representing the difference between ^ And the variance Γ _{k, ω} and the average μ _{ω, t} and the variance σ ² _{ω, t} in each of all combinations of (ω, t) and all combinations of (k, ω, t). wherein in each parameter phi _{k, omega,} and _t, the in all k parameter _{_{γ k, 0, γ k,}} 1, and the parameter ψ _{k, i, (k,} And parameter updating means for updating the parameter zeta _{k, i} in each of all combinations of)
An end determination means for repeatedly updating by the parameter update means until a predetermined end condition is satisfied;
For each combination of (k, ω, t), the time frequency components s _{k, ω, t} of the sound source signal of the sound source k are calculated based on the parameters φ _{k, ω, t} and the average μ _{ω, t.} A sound source signal estimating means for estimating;
An acoustic signal analyzing apparatus including:

The variable function q (a ^, s ^, z ^, v ^, x ^, ρ ^) is changed to q (a ^) q (s ^) q () q (z ^) q (v ^) q ( x ^) q (ρ ^)
Q (a ^) that minimizes the objective function is multiplied by the complex normal distribution of the vectors a _{k, ω} ^ in each of all combinations of (k, ω),
Q (s ^) that minimizes the objective function is multiplied by the complex normal distribution of the elements s _{ω, t} in each of all combinations of (ω, t),
Q (z ^) that minimizes the objective function is multiplied by the discrete probability distribution of the element z _{ω, t} in each of all combinations of (ω, t),
Q (v ^) that minimizes the objective function is multiplied by the beta distribution of the element v _{k in} each of all k,
Q (x ^) that minimizes the objective function is multiplied by the discrete probability distribution of the element x _{k in} each of all k,
2. The acoustic signal analyzing apparatus according to claim 1, wherein q (ρ ^) that minimizes the objective function is multiplied by a directory distribution of the vector ρ ^ _k in each of all k.

Predetermine the number D of truncation of the number of sound sources
The acoustic signal analyzing apparatus according to claim 2 _{, wherein} q (z _{ω, t} = k ') = 0 is set for k' _{that is equal to} or greater than D + 1.

Normalizing the scale of the average vector m _{k, ω} in each of all the combinations of (k, m) updated by the parameter updating means, and for the index k of the sound source, the frequency ω, at which the sound source k becomes active Parameter adjusting means for rearranging the index k so that the average _{μω, t at} time t is in descending order;
The acoustic signal analyzer according to claim 1, wherein the end determination unit repeatedly performs the update by the parameter update unit and the process by the parameter adjustment unit until the predetermined end condition is satisfied.

Time-frequency analysis means ym _{, ω, t} (where m is a microphone and t is a microphone) with time-series data of acoustic signals output from M (m is an integer of 2 or more) microphones m as input. Time, ω is the frequency index.)
By the initial setting means, for each frequency ω, a three-dimensional array having the transmission frequency characteristics a _{m, k, ω} of the sound source signal from each of the D sound sources k (D is an integer of 1 or more) to each microphone m Parameters of a posteriori distribution of each vector a _{k, ω} ^ included in a ^ (mean vector m _{k, ω} ^, covariance matrix Γ _{k, ω} ) of a complex normal distribution, active sound source at each time t for each frequency ω Parameters of the posterior distribution of each element s _{ω, t} included in the two-dimensional array s ^ having the time frequency component s _{ω, t} of the element (mean μ _{ω, t} of the complex normal distribution, variance σ ² _{ω, t} ), Parameters of the posterior distribution of each element z _{ω, t} of the two-dimensional array z ^ having an indicator z _{ω, t} indicating the index of the sound source that is active at each time t for each frequency ω (each parameter φ of the discrete probability distribution) _{k, ω, t),} the probability of each sound source k becomes active vector with _k elements v ^ parameters of the posterior distribution of each element v _k of (beta distribution parameter _{_{γ k, 0, γ k,}} 1), indicator x _k elements indicating the direction of arrival of the index i of each sound source k Posterior distribution parameter (each parameter ψ _{k, i of} discrete probability distribution) of each element x _k of vector x ^, and the probability ρ _{k, i} that each sound source k becomes the arrival direction of each index i. Set the initial values of the posterior distribution parameters (directive distribution parameters ζ _{k, i} ) of each vector ρ _k ^ in the dimensional array ρ ^
When the three-dimensional array y ^ is given by the parameter updating means, the three-dimensional array a ^, the two-dimensional array s ^, the two-dimensional array z ^, the vector v ^, the vector x ^, and the A posteriori distribution p (a ^, s ^, z ^, v ^, x ^, ρ ^ | y ^) of the two-dimensional array ρ ^ and a variable function q (a ^, s ^, z ^, v ^, x ^) , Ρ ^) as a function representing the divergence between the objective function and the average in each combination of (k, m) so as to minimize the objective function based on the variational inference method. The mean μ _{ω, t} and the variance σ ² _{ω, t} in each of all combinations of the vector m _{k, ω} ^ and the variance Γ _{k, ω} and (ω, t) _, and (k, ω, t) of the parameter phi _k in each of all _{combinations, omega, t} and the parameter gamma _{k, 0} in all k, gamma _{k, 1,} And serial parameters [psi _k, updating the parameter zeta _{k, i} in each of all combinations of (k, i),
The updating by the parameter updating unit is repeatedly performed until a predetermined end condition is satisfied by the end determination unit,
The time-frequency component s of the sound source signal of the sound source k is determined by the sound source signal estimation means based on the parameters φ _{k, ω, t} and the average μ _{ω, t} for each of all combinations of (k, ω, t). An acoustic signal analysis method for estimating _{k, ω, t} .

The program for functioning a computer as each means of the acoustic signal analyzer of any one of Claims 1-4.