JP2018142917A

JP2018142917A - Sound source localization device, method and program

Info

Publication number: JP2018142917A
Application number: JP2017037404A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 夏樹植野; Natsuki Ueno
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2018-09-13
Anticipated expiration: 2037-02-28
Also published as: JP6623185B2

Abstract

PROBLEM TO BE SOLVED: To allow for simultaneous localization of multiple sound sources, even if the number of sound sources is unknown.SOLUTION: A sound source position estimation unit 25 estimates the position of each sound source k by estimating each parameter representing each distribution of a variable N, a variable ρ, a variable Z, a variable V, a variable λ, a variable ζ, and a variable γ, so as to minimize an objective function based on the variation inference method, with a function representing divergence indicating the difference between the exterior distribution p(Θ|Y) of an unknown variable Θ, including variables N and ρ representing the positions of multiple sound sources k when measurement data Y is given, a variable Z representing an indicator indicating the index of a sound source dominant at each time for each frequency, a variable V for determining a probability πwhen each sound source k is dominant, a variable λ representing decentralization of the observation time frequency components of each frequency for each of multiple directions, and variables ζ, γ representing the power of the time frequency components of each frequency of each sound source k and noise, and a variation function q(Θ) as an objective function.SELECTED DRAWING: Figure 3

Description

本発明は、音源定位装置、方法、及びプログラムに係り、特に、音響信号から、音源の位置を推定する音源定位装置、方法、及びプログラムに関する。 The present invention relates to a sound source localization device, method, and program, and more particularly, to a sound source localization device, method, and program for estimating the position of a sound source from an acoustic signal.

波源定位は、レーダやソナーといった幅広い応用を有している。特に、小さいアレイで、移動する波源を瞬時に定位し追跡できるようにすることは重要課題である。波源定位問題に対する従来法としては、Multiple Signal Classication (MUSIC) 法、Generalized Cross-Correlation methods with Phase Transform (GCC-PHAT) 法、波源拘束偏微分方程式に基づく手法などがある。 Wave source localization has a wide range of applications such as radar and sonar. In particular, it is important to be able to quickly locate and track a moving wave source with a small array. Conventional methods for the source localization problem include multiple signal classicization (MUSIC) method, generalized cross-correlation methods with phase transform (GCC-PHAT) method, and methods based on source-constrained partial differential equations.

MUSIC 法やGCC-PHAT 法は、音源に対し平面波を仮定し各音源のセンサ間での到来時間差を定位の手がかりとするため、一般にアレイサイズは大きい方が有利となる。また、いずれもセンサアレイの受信信号間の自己相関関数や相互相関関数といった、統計量に基づく手法であるため、音源を高い精度で定位するためには観測時間幅を十分長く取る必要がある。このため、これらの手法は小さいアレイサイズと瞬時的な観測のみによる波源定位には必ずしも向いていない。一方、波源拘束偏微分方程式に基づく手法は、各時刻ごとに成立する音響信号の時空間偏微分方程式を元に音源定位を行うもので、理論的には瞬時の小領域観測のみで波源定位を行うことが可能である。 Since the MUSIC method and GCC-PHAT method assume a plane wave for the sound source and use the arrival time difference between the sensors of each sound source as a key for localization, in general, a larger array size is advantageous. In addition, since both methods are based on statistics such as autocorrelation function and cross-correlation function between received signals of the sensor array, it is necessary to take a sufficiently long observation time width in order to localize the sound source with high accuracy. For this reason, these methods are not necessarily suitable for wave source localization using only a small array size and instantaneous observation. On the other hand, the method based on the partial differential equation of the wave source performs sound source localization based on the spatio-temporal partial differential equation of the acoustic signal that is established at each time. Theoretically, the source localization is performed only by instantaneous small region observation. Is possible.

ただし、この手法は単一波源に対して成立する方程式をベースとしているため、雑音や複数の点音源が存在する場合などのように観測音響信号が偏微分方程式から逸脱する場合に脆弱であるという欠点を有している。本枠組においてランダム雑音や複数の点音源が存在する場合への拡張を可能にするため、波源拘束偏微分方程式を基にしたアレイ観測信号の確率モデルとそれに基づく波源定位アルゴリズムが非特許文献１で提案されている。 However, since this method is based on an equation that holds for a single wave source, it is vulnerable when the observed acoustic signal deviates from the partial differential equation, such as when there is noise or multiple point sources. Has drawbacks. Non-Patent Document 1 discloses a probability model of an array observation signal based on a wave source-constrained partial differential equation and a wave source localization algorithm based on it in order to enable expansion to the case where random noise and a plurality of point sound sources exist in this framework. Proposed.

鈴木惇, 亀岡弘和,"波源拘束差分方程式に基づく音響信号の確率モデル化と複数音源定位アルゴリズム",日本音響学会講演論文集, pp.615-618, 2016年3月Satoshi Suzuki and Hirokazu Kameoka, "Probabilistic modeling of acoustic signals based on wave source constrained difference equations and multiple sound source localization algorithms", Proceedings of the Acoustical Society of Japan, pp.615-618, March 2016

上記非特許文献１の手法では音源数を仮定する必要があるが、実環境では音源数が未知の場合が多い。従って、音源数を仮定せずともモデルの複雑度を適応させながら実際の音源数に合わせて音源定位を行えるようにすることが望ましい。 In the method of Non-Patent Document 1, it is necessary to assume the number of sound sources, but the number of sound sources is often unknown in an actual environment. Therefore, it is desirable that sound source localization can be performed in accordance with the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources.

本発明は、上記事情を鑑みてなされたものであり、音源数が未知の場合であっても、複数の音源を同時に定位することができる音源定位装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a sound source localization apparatus, method, and program capable of simultaneously locating a plurality of sound sources even when the number of sound sources is unknown. And

上記の目的を達成するために本発明に係る音源定位装置は、マイクロホンアレイにより入力された複数の音源からの音源信号が混合された観測信号から、前記複数の音源の各々の位置を推定する音源定位装置であって、複数の方向の各々に対し、前記マイクロホンアレイのうち、前記方向に並んだマイクロホンのペアにより入力された前記観測信号の差分を算出する空間差分算出部と、前記マイクロホンアレイのうち、基準のマイクロホンにより入力された前記観測信号を入力として、各周波数の観測時間周波数成分を出力すると共に、前記空間差分算出部によって前記複数の方向の各々に対して算出された前記観測信号の差分を入力として、前記複数の方向の各々に対して、各周波数の観測時間周波数成分を出力する時間周波数展開部と、前記時間周波数展開部により出力された、前記基準のマイクロホンの各周波数の観測時間周波数成分、及び前記複数の方向の各々に対する各周波数の観測時間周波数成分に基づいて、前記基準のマイクロホンの各周波数の観測時間周波数成分、及び前記複数の方向の各々に対する各周波数の観測時間周波数成分からなる観測データＹが与えられたときの、複数の音源ｋの位置を表す変数Ｎ及び変数ρ、各周波数に対する各時刻において支配的となる音源のインデックスを示すインジケータを表す変数Ｚ、
各音源ｋが支配的になる確率π_kを定めるための変数Ｖ、前記複数の方向の各々に対する各周波数の観測時間周波数成分の分散を表す変数λ、並びに各音源ｋ及び雑音の各周波数の時間周波数成分のパワーを表す変数ζ及び変数γを含む未知変数Θの事後分布ｐ（Θ｜Ｙ）と変関数ｑ（Θ）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき前記目的関数を最小化するように、前記変数Ｎ、前記変数ρ、前記変数Ｚ、前記変数Ｖ、変数λ、変数ζ、前記変数γの各々の分布を表す各パラメータを推定するパラメータ推定部と、推定された各音源ｋの位置を表す変数Ｎの分布を表すパラメータ、及び変数ρの分布を表すパラメータに基づいて、各音源ｋの位置を推定する音源位置推定部と、を含んで構成されている。 In order to achieve the above object, a sound source localization apparatus according to the present invention is a sound source that estimates the position of each of a plurality of sound sources from an observation signal mixed with sound source signals from a plurality of sound sources input by a microphone array. A localization apparatus, for each of a plurality of directions, a spatial difference calculation unit that calculates a difference between the observation signals input by a pair of microphones arranged in the direction of the microphone array; Among them, the observation signal input by the reference microphone is input, and the observation time frequency component of each frequency is output, and the observation signal calculated for each of the plurality of directions by the spatial difference calculation unit is output. With the difference as an input, for each of the plurality of directions, a time frequency expansion unit that outputs an observation time frequency component of each frequency; and Based on the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency for each of the plurality of directions output by the time frequency expansion unit, each frequency of the reference microphone A variable N and a variable ρ representing the positions of a plurality of sound sources k when given observation data Y composed of an observation time frequency component and an observation time frequency component of each frequency for each of the plurality of directions, and each frequency. A variable Z representing an indicator of the index of the sound source that becomes dominant at the time,
A variable V for determining the probability π _k that each sound source k becomes dominant, a variable λ representing the dispersion of the observation time frequency component of each frequency for each of the plurality of directions, and the time of each frequency of each sound source k and noise. A function representing a divergence representing a difference between a posterior distribution p (Θ | Y) of an unknown variable Θ including a variable ζ and a variable γ representing a power of a frequency component and a variable function q (Θ) is used as a variation function. Each parameter representing the distribution of each of the variable N, the variable ρ, the variable Z, the variable V, the variable λ, the variable ζ, and the variable γ is estimated so as to minimize the objective function based on an inference method. A parameter estimation unit, a parameter representing a distribution of the variable N representing the estimated position of each sound source k, and a sound source position estimating unit for estimating the position of each sound source k based on a parameter representing the distribution of the variable ρ. Consists of including

本発明に係る音源定位方法は、マイクロホンアレイにより入力された複数の音源からの音源信号が混合された観測信号から、前記複数の音源の各々の位置を推定する音源定位装置における音源定位方法であって、空間差分算出部が、複数の方向の各々に対し、前記マイクロホンアレイのうち、前記方向に並んだマイクロホンのペアにより入力された前記観測信号の差分を算出し、時間周波数展開部が、前記マイクロホンアレイのうち、基準のマイクロホンにより入力された前記観測信号を入力として、各周波数の観測時間周波数成分を出力すると共に、前記空間差分算出部によって前記複数の方向の各々に対して算出された前記観測信号の差分を入力として、前記複数の方向の各々に対して、各周波数の観測時間周波数成分を出力し、パラメータ推定部が、前記時間周波数展開部により出力された、前記基準のマイクロホンの各周波数の観測時間周波数成分、及び前記複数の方向の各々に対する各周波数の観測時間周波数成分に基づいて、前記基準のマイクロホンの各周波数の観測時間周波数成分、及び前記複数の方向の各々に対する各周波数の観測時間周波数成分からなる観測データＹが与えられたときの、複数の音源ｋの位置を表す変数Ｎ及び変数ρ、各周波数に対する各時刻において支配的となる音源のインデックスを示すインジケータを表す変数Ｚ、各音源ｋが支配的になる確率π_kを定めるための変数Ｖ、前記複数の方向の各々に対する各周波数の観測時間周波数成分の分散を表す変数λ、並びに各音源ｋ及び雑音の各周波数の時間周波数成分のパワーを表す変数ζ及び変数γを含む未知変数Θの事後分布ｐ（Θ｜Ｙ）と変関数ｑ（Θ）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき前記目的関数を最小化するように、前記変数Ｎ、前記変数ρ、前記変数Ｚ、前記変数Ｖ、変数λ、変数ζ、前記変数γの各々の分布を表す各パラメータを推定し、音源位置推定部が、推定された各音源ｋの位置を表す変数Ｎの分布を表すパラメータ、及び変数ρの分布を表すパラメータに基づいて、各音源ｋの位置を推定する。 The sound source localization method according to the present invention is a sound source localization method in a sound source localization apparatus that estimates the position of each of the plurality of sound sources from an observation signal obtained by mixing sound source signals from a plurality of sound sources input by a microphone array. A spatial difference calculation unit for each of a plurality of directions, calculates a difference between the observation signals input by a pair of microphones arranged in the direction of the microphone array, and a time-frequency expansion unit In the microphone array, the observation signal input from a reference microphone is input, and an observation time frequency component of each frequency is output, and the spatial difference calculation unit calculates each of the plurality of directions. Using the observation signal difference as an input, the observation time frequency component of each frequency is output for each of the plurality of directions, and the parameters are output. The reference estimation unit outputs the reference time based on the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency for each of the plurality of directions output by the time frequency expansion unit. A variable N and a variable ρ representing the positions of a plurality of sound sources k when given observation data Y composed of an observation time frequency component of each frequency of the microphone and an observation time frequency component of each frequency for each of the plurality of directions. , A variable Z representing an indicator of an index of a sound source that is dominant at each time for each frequency, a variable V for determining a probability π _k that each sound source k is dominant, and a frequency V for each of the plurality of directions. Variable λ representing the variance of the observed time frequency component, and variable ζ and variable representing the power of the time frequency component of each frequency of each sound source k and noise The objective function is minimized based on the variational inference method with the function representing the divergence representing the difference between the posterior distribution p (Θ | Y) of the unknown variable Θ including the variable function q (Θ) as the objective function. Thus, each parameter representing the distribution of each of the variable N, the variable ρ, the variable Z, the variable V, the variable λ, the variable ζ, and the variable γ is estimated, and the sound source position estimation unit The position of each sound source k is estimated based on a parameter representing the distribution of the variable N representing the position of the sound source k and a parameter representing the distribution of the variable ρ.

本発明に係るプログラムは、上記の音源定位装置の各部としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each part of the sound source localization apparatus.

以上説明したように、本発明の音源定位装置、方法、及びプログラムによれば、前記基準のマイクロホンの各周波数の観測時間周波数成分、及び前記複数の方向の各々に対する各周波数の観測時間周波数成分からなる観測データＹが与えられたときの、複数の音源ｋの位置を表す変数Ｎ及び変数ρ、各周波数に対する各時刻において支配的となる音源のインデックスを示すインジケータを表す変数Ｚ、各音源ｋが支配的になる確率π_kを定めるための変数Ｖ、前記複数の方向の各々に対する各周波数の観測時間周波数成分の分散を表す変数λ、並びに各音源ｋ及び雑音の各周波数の時間周波数成分のパワーを表す変数ζ及び変数γを含む未知変数Θの事後分布ｐ（Θ｜Ｙ）と変関数ｑ（Θ）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき前記目的関数を最小化するように、前記変数Ｎ、前記変数ρ、前記変数Ｚ、前記変数Ｖ、変数λ、変数ζ、前記変数γの各々の分布を表す各パラメータを推定し、各音源ｋの位置を推定することにより、音源数が未知の場合であっても、複数の音源を同時に定位することができる、という効果が得られる。 As described above, according to the sound source localization apparatus, method, and program of the present invention, from the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency for each of the plurality of directions. When the observation data Y is given, a variable N and a variable ρ representing the positions of a plurality of sound sources k, a variable Z representing an index of a sound source dominant at each time for each frequency, and each sound source k A variable V for determining the probability π _k to be dominant, a variable λ representing the variance of the observed time frequency component of each frequency for each of the plurality of directions, and the power of the time frequency component of each frequency of each sound source k and noise A function representing the divergence representing the difference between the posterior distribution p (Θ | Y) of the unknown variable Θ including the variable ζ and the variable γ representing the variable function q (Θ) Each of the variables N, the variable ρ, the variable Z, the variable V, the variable λ, the variable ζ, and the variable γ is represented as a distribution so as to minimize the objective function based on the variational inference method. By estimating the parameters and estimating the position of each sound source k, it is possible to obtain an effect that a plurality of sound sources can be localized simultaneously even when the number of sound sources is unknown.

点音源から観測点ｒへ到来する球面波を示す図である。It is a figure which shows the spherical wave which arrives at the observation point r from a point sound source. マイクロホンアレイの配置の一例を示す図である。It is a figure which shows an example of arrangement | positioning of a microphone array. 本発明の実施の形態に係る音源定位装置の構成を示す概略図である。It is the schematic which shows the structure of the sound source localization apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る音源定位装置における音源定位処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the sound source localization process routine in the sound source localization apparatus which concerns on embodiment of this invention. 実験における音源位置とマイクロホン位置を示す図である。It is a figure which shows the sound source position and microphone position in experiment. 残響時間0.5 s、観測長2779 ms の条件での実験結果を示す図である。It is a figure which shows the experimental result on the conditions of reverberation time 0.5 s and observation length 2779 ms. 残響時間0.5 s、観測長1665 ms の条件での実験結果を示す図である。It is a figure which shows the experimental result on the conditions of reverberation time 0.5 s and observation length 1665 ms. 残響時間0.2 s、観測長2779 ms の条件での実験結果を示す図である。It is a figure which shows the experimental result on the conditions of reverberation time 0.2 s and observation length 2779 ms. 残響時間0.2 s、観測長1665 ms の条件での実験結果を示す図である。It is a figure which shows the experimental result on the conditions of reverberation time 0.2 s and observation length 1665 ms.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明で提案する技術は、音響信号から波源位置を推定することを目的とした信号処理技術である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The technique proposed in the present invention is a signal processing technique for the purpose of estimating a wave source position from an acoustic signal.

＜本発明の実施の形態の概要＞
本発明の実施の形態は、上述した従来手法の利点を併せ持つ、小領域・瞬時観測による複数音源の波源定位を可能にする技術である。 <Outline of Embodiment of the Present Invention>
The embodiment of the present invention is a technology that enables wave source localization of a plurality of sound sources by small region / instantaneous observation, which has the advantages of the conventional method described above.

本発明の実施の形態では、波源拘束偏微分方程式の時間周波数領域表現をベースにした音響信号の確率分布と、雑音を含む全音源のスパース性の仮定（複数の音源が混在する音響信号の時間周波数表現において、各時間周波数点で高々一つの音源のみが支配的であるという仮定）に基づき、変分推論アルゴリズムにより各時間周波数点でどの音源が支配的らしいかを推定しながら各音源の波源定位を行う。 In the embodiment of the present invention, the probability distribution of an acoustic signal based on the time-frequency domain representation of the wave source-constrained partial differential equation and the sparseness assumption of all sound sources including noise (the time of an acoustic signal in which multiple sound sources are mixed) In the frequency representation, based on the assumption that at most one sound source is dominant at each time frequency point), the source of each sound source is estimated using the variational inference algorithm to determine which sound source seems to be dominant at each time frequency point. Perform localization.

さらに、本発明の実施の形態では、Dirichlet過程混合モデルをヒントにしたアイディアにより、音源数を仮定せずともモデルの複雑度を適応させながら実際の音源数に合わせて音源定位を行うことができる。 Furthermore, in the embodiment of the present invention, it is possible to perform sound source localization according to the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources, with the idea inspired by the Dirichlet process mixture model. .

＜本発明の実施の形態の原理＞
次に、音源の位置を推定する原理について説明する。 <Principle of Embodiment of the Present Invention>
Next, the principle of estimating the position of the sound source will be described.

＜音源拘束偏微分方程式＞
図１に示すように、観測点の基準となる位置ベクトルを

とし、単一波源の位置ベクトルを

とする。波源の信号をg(t)、音速をc とし、単一点波源からの球面波伝播を仮定すると観測点における観測値は <Sound source constrained partial differential equation>
As shown in FIG. 1, the position vector that becomes the reference of the observation point is

And the position vector of a single source

And Assuming that the wave source signal is g (t), the sound velocity is c, and spherical wave propagation from a single point source is assumed, the observed value at the observation point is

と表される。ここで、 It is expressed. here,

である。観測点から波源方向へ向かう単位ベクトルを

とすると、 It is. A unit vector from the observation point toward the wave source

Then,

であるため、

の空間微分は、 Because

The spatial derivative of is

となる。また、

の時間微分は It becomes. Also,

Is the time derivative of

となるので、式(1) と式(8) を式(7) に代入することでg が消去され、 Therefore, substituting Equation (1) and Equation (8) into Equation (7) eliminates g,

のように、観測信号とその時間・空間微分のみを含む方程式を立てることができる。ただし、

は観測点から波源までの距離である。この式を音源拘束式と呼ぶ。以上のように音源拘束式は、任意の音源信号波形で成り立つ、音源の位置と空間の場の一意な関係を記述する偏微分方程式である。 Thus, an equation including only the observed signal and its time / space derivative can be established. However,

Is the distance from the observation point to the wave source. This equation is called a sound source constraint equation. As described above, the sound source constraint equation is a partial differential equation describing a unique relationship between the position of the sound source and the space field, which is formed by an arbitrary sound source signal waveform.

＜音源拘束式に基づく音響信号の確率モデル化＞
観測信号の空間微分を空間差分で近似するため、以下では図２に示すマイクロホンアレイを仮定する。ただし、マイクロホンの配置は観測信号の空間微分を空間差分で近似できるものであれば良く、以下の理論は図２の配置に限らない。図２のマイクロホンアレイの場合、7 本のマイクロホンを用いて各時刻t_j で、基準点における信号f_0,jおよびその各方向の空間差分

を得ることができる。ただし、j は標本時刻のインデックスを表す。 <Probability modeling of acoustic signal based on sound source constraint>
In order to approximate the spatial differentiation of the observation signal by a spatial difference, a microphone array shown in FIG. 2 is assumed below. However, the arrangement of the microphones is not limited as long as the spatial differentiation of the observation signal can be approximated by the spatial difference, and the following theory is not limited to the arrangement shown in FIG. In the case of the microphone array of FIG. 2, the signal f _{0, j} at the reference point and the spatial difference in each direction at each time t _j using seven microphones.

Can be obtained. Here, j represents the index of the sample time.

このとき式(9) は Equation (9) is then

と表せる。ただし、i = x,y,z で、n_x、 n_y、 n_zはそれぞれn のx、 y、 z 方向の成分である。式(10)の左辺を右辺に移項すると It can be expressed. Here, i = x, y, z, and n _x , n _y , and n _z are components of n in the x, y, and z directions, respectively. If we move the left side of equation (10) to the right side,

が得られる。ここで、f_0,j、f_x,j、f_y,j、f_z,jを窓関数で窓掛けして取得された信号とする。切り出し区間の両端点の影響を無視できるものとすると、式(11) は周波数領域で Is obtained. Here, let f _{0, j} , f _{x, j} , f _{y, j} , and f _{z, j} be signals obtained by windowing with a window function. Assuming that the influence of the two end points of the cut-out section can be ignored, Equation (11) can be expressed in the frequency domain.

と表される。ただし、F_0,m、 F_x,m、 F_y,m、 F_z,mはf_0,j、f_x,j、f_y,j、f_z,jの離散Fourier 変換であり、mは離散周波数インデックスである。 It is expressed. Where F _{0, m} , F _{x, m} , F _{y, m} , F _{z, m} are discrete Fourier transforms of f _{0, j} , f _{x, j} , f _{y, j} , f _{z, j} , and m is Discrete frequency index.

式(12) の右辺は差分近似に伴う誤差により実際には必ずしも厳密に0 にはならない。そこで、式(11) の右辺を The right side of equation (12) is not always exactly 0 due to the error associated with the difference approximation. Therefore, the right side of equation (11)

のように誤差変数

に置き換え、これらを平均が0 で互いに独立な正規確率変数（複素正規分布に従う確率変数） Error variable like

These are normal random variables with mean 0 and independent of each other (random variables with complex normal distribution)

と仮定する。また、観測点における観測信号の各周波数成分を平均が0、分散がσ² _0,m の正規確率変数とする。これは、 Assume that Further, each frequency component of the observation signal at the observation point is a normal random variable having an average of 0 and a variance of σ ² _{0, m} . this is,

と仮定することに相当する。 Is equivalent to assuming.

ここで、

を並べたベクトルと

を並べたベクトルを here,

Vector and

Vector

とする。式(13) は And Equation (13) is

の形で書ける。ただし

であり、

は Can be written in the form of However,

And

Is

で与えられる。式(14)、(16) より、

は平均が

、分散共分散行列

が Given in. From equations (14) and (16),

Is average

, Variance covariance matrix

But

の複素正規分布 Complex normal distribution of

に従う。

であるので、

は Follow.

So

Is

と表される。よって式(22) より、

の確率密度関数 It is expressed. Therefore, from equation (22)

Probability density function

を得る。以上より、観測スペクトルおよびその空間差分

が与えられた下で、単一音源を定位する問題は、 Get. From the above, observed spectrum and its spatial difference

Given the, the problem of localizing a single sound source is

を解く最尤推定問題に帰着する。 Which results in a maximum likelihood estimation problem.

＜音源のスパース性を活用した複数音源の定位アルゴリズム＞
音声信号や楽音など実世界の音響信号の多くは時間周波数成分がスパースである。従って、複数の音源が同時に混在する場合でも、各時間周波数点では高々一つの音源のみが支配的であると仮定できる場合が多い。この音源の時間周波数成分のスパース性の仮定と以上の

の確率モデル化に基づき、音源が複数個存在する場合、および雑音が存在する場合の観測信号の確率分布を導くことができる。 <Multiple sound source localization algorithms using the sparsity of sound sources>
Many real-world acoustic signals such as voice signals and musical sounds have sparse temporal frequency components. Therefore, even when a plurality of sound sources coexist at the same time, it can often be assumed that only one sound source is dominant at each time frequency point. The assumption of sparsity of the time frequency component of this sound source

Based on the probability modeling, it is possible to derive the probability distribution of the observed signal when there are a plurality of sound sources and when there is noise.

信号の切り出しフレームの時刻のインデックスをl、音源インデックスをk = 0,...,K とし、k = 0は雑音、k ≠0は点音源に対応するものとする。また、点音源kの位置を

とする。ここで、雑音を含む全音源の時間周波数成分のスパース性を仮定し、周波数m、時刻l においてz_m,l 番目の音源のみが非零のパワーをもち、それ以外の音源のパワーを0 とする。このとき、所与のz_m,lの下での観測信号の時間周波数成分とその空間差分

（以後、観測信号）の条件付き確率密度関数は Assume that the time index of the signal cut frame is l, the sound source index is k = 0,..., K, k = 0 is noise, and k ≠ 0 is point sound source. Also, the position of the point sound source k

And Here, assuming the sparsity of the time-frequency components of all sound sources including noise, only the _{m, lth} sound source has non-zero power at frequency m and time l, and the power of other sound sources is assumed to be 0. To do. At this time, the time-frequency component of the observed signal under the given z _{m, l} and its spatial difference

The conditional probability density function of (observed signal) is

で与えられる。ここで、 Given in. here,

である。また、

は雑音の時間周波数成分の分散共分散行列で、周波数にのみ依存する正規化分散共分散行列モデル

と時刻にも依存する雑音のパワー

の積 It is. Also,

Is the variance-covariance matrix of the time-frequency component of noise, a normalized variance-covariance matrix model that depends only on frequency

Power of noise that also depends on time

Product of

で表されるものとする。拡散性雑音を仮定する場合の

の設定方法については後述する。

はすべての未知パラメータ

を表す。

の事前確率を

とすると、観測信号

の確率密度関数（

の尤度関数）は It shall be represented by When diffusive noise is assumed

The setting method will be described later.

Is all unknown parameters

Represents.

Prior probability of

Then the observed signal

Probability density function of (

Is the likelihood function)

と書ける。以上より、複数の音源と雑音が存在する場合の各音源の位置

を推定する問題は、観測信号

が与えられた下で Can be written. From the above, the position of each sound source when multiple sound sources and noise exist

The problem of estimating

Under given

を解く最尤推定問題に帰着する。この最適化問題の大域解は解析的に解くことはできないが、

を潜在変数としたExpectation-Maximization (EM) アルゴリズムにより停留点を探索することができる。 Which results in a maximum likelihood estimation problem. Although the global solution of this optimization problem cannot be solved analytically,

The stationary point can be searched by the Expectation-Maximization (EM) algorithm using as a latent variable.

＜ノンパラメトリックベイズモデリング＞
実環境では音源数が未知の場合が多い。上記の定式化では音源数K を既知と仮定したが、音源数を仮定せずともモデルの複雑度を適応させながら実際の音源数に合わせて音源定位を行えるようにすることが望ましい。そこで、本実施の形態では上記の生成モデルをDirichlet 過程混合モデルに拡張する。k = 0 を雑音、

を点音源のインデックスとし、

を加算無限次元の離散分布

に従って生成される確率変数とする。 <Non-parametric Bayes modeling>
In a real environment, the number of sound sources is often unknown. In the above formulation, it is assumed that the number of sound sources K is known. However, it is desirable that sound source localization can be performed according to the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources. Therefore, in the present embodiment, the generation model is extended to a Dirichlet process mixture model. k = 0 for noise,

Is the index of the point sound source,

Infinite dimensional discrete distribution

A random variable generated according to

ここで、点(m, l)において雑音が支配的である（z_m,l＝０となる）確率π₀を、超パラメータ

のベータ分布に従って生成される変数 Here, the probability π ₀ that the noise is dominant (z _{m, l} = 0) at the point (m, l) is expressed as a superparameter.

Generated according to the beta distribution of

とし、点音源k ≠ 0が支配的である（z_m,l＝ｋとなる）確率π_kを、棒折り過程に従って決まる変数 And the probability π _k that the point sound source k ≠ 0 is dominant (z _{m, l} = k) is determined by the bar folding process.

とする。以上のプロセスで生成される

の期待値は、k が大きいほど指数的に小さくなる傾向を持つため、大きいk に対応した音源ほどアクティブになる確率が低くなる。よって、観測信号からパラメータを推論する際、必要最小限の音源インデックス数の混合モデルで観測信号を説明しようとする効果がもたらされる。以上の生成モデルにおいて、全未知変数Θ は以下となる。 And Generated by the above process

The expected value of tends to decrease exponentially as k increases, so that the sound source corresponding to large k has a lower probability of being active. Therefore, when the parameter is inferred from the observed signal, there is an effect of trying to explain the observed signal with a mixed model having the minimum number of sound source indexes. In the above generation model, all unknown variables Θ are as follows.

＜変分推論アルゴリズム＞
以上の生成モデル化により、 <Variation reasoning algorithm>
With the above generation modeling,

と書ける。また、その他の事前分布p(V)、p(N)、p(ρ)、p(λ)、p(ζ) 及びp(γ)を Can be written. The other prior distributions p (V), p (N), p (ρ), p (λ), p (ζ) and p (γ) are

と仮定すると、観測信号

と未知変数Θの同時分布 Assuming that the observed signal

And unknown variable Θ

を具体的に記述することができる。ただし、

はvon Mise-Fisher分布、

は実正規分布、

はガンマ分布を表す。Θの事後分布

を解析的に得ることは難しいが、変分推論法に基づき近似分布を反復計算により得ることができる。変分推論は、

を満たす非負の変関数

を、事後分布

との間のKullback-Leibler(KL)ダイバージェンス Can be specifically described. However,

Is the von Mise-Fisher distribution,

Is the real normal distribution,

Represents a gamma distribution. Posterior distribution of Θ

Is difficult to obtain analytically, but an approximate distribution can be obtained by iterative calculation based on the variational reasoning method. Variational reasoning is

Nonnegative variable function satisfying

The posterior distribution

Kullback-Leibler (KL) divergence between

が小さくなるように求める方法であり、上記式（５５）を目的関数とする。

を The above equation (55) is used as an objective function.

The

のように近似できると仮定すると、

について反復的に式(55)の目的関数を最小化することで

の近似分布を得ることができる。また、

に関して、以下の打ち切り近似 Assuming that it can be approximated as

By iteratively minimizing the objective function of (55)

An approximate distribution can be obtained. Also,

With respect to the censored approximation

を行う。この近似はモデルの複雑度（音源数）を固定したことを意味するのではなくｑの関数空間をある範囲に限定したことを意味する。従って、

をできるだけ良く近似したければ

を大きくして

が取りうる範囲を広くすれば良い。 I do. This approximation does not mean that the complexity (number of sound sources) of the model is fixed, but means that the function space of q is limited to a certain range. Therefore,

Want to approximate as closely as possible

Increase

The range that can be taken should be widened.

式(55) の各q についての変分を0 と置くことで、以下を得る。これらを変分事後分布更新式という。 By substituting 0 for each q in Eq. (55), we get These are called variational posterior distribution update formulas.

ただし

は

の

に関する期待値を意味し、X が連続変数の場合は

、離散変数の場合は

で与えられる。また、

は、

の中のX 以外のすべての要素の集合を表す。導出は次節で述べるが、それぞれの変分事後分布更新式は以下のような形となる。 However,

Is

of

The expected value for, and if X is a continuous variable

For discrete variables

Given in. Also,

Is

Represents the set of all elements in X except X. The derivation will be described in the next section. Each variational posterior update formula has the following form.

＜変分更新式の導出＞
＜結合分布＞
上記で立てた生成モデルによりlog p(Θ，Ｙ) は具体的に以下のように書ける。 <Derivation of variational update formula>
<Bond distribution>
Specifically, log p (Θ, Y) can be written as follows using the generation model established above.

＜音源方向N の変分事後分布更新式＞

の中でN に関係する項は <Variable posterior distribution update formula for sound source direction N>

The term related to N in

であり、 And

となるので、N の変分事後分布更新式は Therefore, the variational posterior distribution update formula for N is

となる。よって、期待値

は以下となる。 It becomes. Therefore, expected value

Is as follows.

＜音源距離（の逆数）ρの変分事後分布更新式＞

の中でρに関係する項は <Variable posterior distribution update formula for sound source distance (reciprocal) ρ>

The term related to ρ in

であり、 And

となるので、ρの変分事後分布更新式は、 Therefore, the variational posterior distribution update formula for ρ is

＜アクティブ音源インデックスZ の変分事後分布更新式＞

の中でZ に関係する項は <A variational posterior distribution update formula for active sound source index Z>

The term related to Z in

であり、 And

となる。ただし、

は、ｋ＝０のとき、 It becomes. However,

When k = 0,

ｋ≠０のとき、 When k ≠ 0

である。従って、Z の変分事後分布更新式は、 It is. Therefore, the variational posterior distribution update formula of Z is

＜棒折り比V の変分事後分布更新式＞

の中でV に関係する項は <Variable posterior distribution update formula for bar folding ratio V>

The term related to V in

であり、 And

となるので、V の変分事後分布更新式は、 Therefore, the variational posterior distribution update formula of V is

ただし、

はディガンマ関数である。 However,

Is the digamma function.

＜誤差変数

の精度（分散の逆数）λの変分事後分布更新式＞

の中でλ に関係する項は <Error variable

Variational posterior distribution update formula for accuracy (reciprocal of variance) λ>

The term related to λ in

であり、 And

となるので、λ の変分事後分布更新式は、 Therefore, the variational posterior distribution update formula for λ is

となる。よって、期待値

は以下となる． It becomes. Therefore, expected value

Is as follows.

＜音源パワー（の逆数）ζの変分事後分布更新式＞

の中でζに関係する項は <Variable posterior distribution update formula of sound source power (reciprocal number) ζ>

The term related to ζ in

であり、 And

となるので、ζの変分事後分布更新式は、 Therefore, the variational posterior distribution update formula of ζ is

となる。よって、期待値 It becomes. Therefore, expected value

＜雑音パワー（の逆数）γの変分事後分布更新式＞

の中でγに関係する項は <Variable posterior distribution update formula of noise power (reciprocal of γ)>

The term related to γ in

であり、 And

となるので、γの変分事後分布更新式は、 Therefore, the variational posterior distribution update formula for γ is

＜雑音分散共分散行列W＞
ここでは、上記図2 のような7 本のマイクロホンの配置を想定した場合の雑音共分散行列W_mの設定例を述べる。ここで、

のFourier 変換を

とする。 <Noise variance covariance matrix W>
Here, a setting example of the noise covariance matrix W _m when assuming the arrangement of seven microphones as shown in FIG. 2 will be described. here,

Fourier transform

And

および

の関係は
and

The relationship is

と書けるため、

の分散共分散行列を

とすると、

の分散共分散行列は

となる。従って、例えば空間的に無相関で等しいパワーの雑音を仮定する場合、

は単位行列となるため、

の分散共分散行列

を To write

The variance-covariance matrix of

Then,

The variance-covariance matrix of

It becomes. Thus, for example, assuming spatially uncorrelated and equal power noise:

Is an identity matrix, so

Variance-covariance matrix of

The

と置けば良い。 And just put it.

ある区域内で、エネルギー密度が一様でかつすべての方向に対するエネルギーの流れが等しい確率であるとみなせる分布をしている音場を拡散音場といい、残響環境の音場を良く近似的に表すことが知られている。拡散音場においては、2 点間の空間相関係数が距離d にのみ依存し、 Within a certain area, a sound field with a uniform energy density and a distribution that can be regarded as having an equal probability of the flow of energy in all directions is called a diffuse sound field, and the sound field in a reverberant environment is approximated well. It is known to represent. In a diffuse sound field, the spatial correlation coefficient between two points depends only on the distance d,

で与えられる。従って、拡散性雑音を仮定する場合、図2 のようなアレイ幾何の例では、

の分散共分散行列

は Given in. Therefore, assuming diffuse noise, the example of an array geometry like Fig. 2

Variance-covariance matrix of

Is

となる。これを用いて、

の分散共分散行列

を

と置けば良い。このとき、

は、 It becomes. Using this,

Variance-covariance matrix of

The

And just put it. At this time,

Is

のような対角行列となる。 Is a diagonal matrix.

＜変分推論アルゴリズムのまとめ＞
各変数の変分事後分布更新式は <Summary of variational inference algorithm>
The variational posterior distribution update formula for each variable is

で与えられ、各分布のパラメータの更新式は以下の通りである。 The update formulas for the parameters of each distribution are as follows.

ここで

は、k = 0 のとき、 here

Is when k = 0

ｋ≠０のとき、 When k ≠ 0

である。また、更新式中に出てくる期待値は以下のように与えられる。 It is. The expected value that appears in the update formula is given as follows.

＜システム構成＞
次に、マイクロホンアレイにより入力された音響信号から、複数の音源の位置を推定する音源定位装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, an embodiment of the present invention will be described by taking as an example a case where the present invention is applied to a sound source localization apparatus that estimates the positions of a plurality of sound sources from acoustic signals input from a microphone array.

図３に示すように、本発明の実施の形態に係る音源定位装置１００は、ＣＰＵと、ＲＡＭと、音源定位処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 3, the sound source localization apparatus 100 according to the embodiment of the present invention is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing a sound source localization processing routine. Functionally, it is configured as shown below.

図３に示すように、音源定位装置１００は、入力部１０と、演算部２０と、出力部９０とを備えている。 As shown in FIG. 3, the sound source localization apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 90.

入力部１０は、上記図２に示すようなマイクロホンアレイの各マイクロホンから出力された、複数の音源からの音源信号が混じっている音響信号（以後、観測信号）の時系列データを受け付ける。 The input unit 10 receives time-series data of an acoustic signal (hereinafter referred to as an observation signal) output from each microphone of the microphone array as shown in FIG. 2 and mixed with sound source signals from a plurality of sound sources.

演算部２０は、空間差分算出部２２と、時間周波数展開部２４と、音源位置推定部２５と、を含んで構成されている。 The calculation unit 20 includes a spatial difference calculation unit 22, a time frequency expansion unit 24, and a sound source position estimation unit 25.

空間差分算出部２２は、マイクロホンアレイの各マイクロホンから出力された観測信号から、各時刻t_jで、基準点のマイクロホンにおける観測信号f_0,jを取得すると共に、以下の式に従って、各方向ｘ、ｙ、ｚの空間差分ｆ_x,j，ｆ_y,j，ｆ_z,jを算出する。 The spatial difference calculation unit 22 acquires the observation signal f _{0, j} at the reference point microphone from each observation signal output from each microphone of the microphone array at each time t _j , and each direction x according to the following equation: , Y, z spatial differences f _{x, j} , f _{y, j} , f _{z, j} are calculated.

時間周波数展開部２４は、空間差分算出部２２により得られた、基準点のマイクロホンにおける各時刻ｔ_ｊの観測信号f_0,jから、各周波数ｍの観測時間周波数成分Ｆ_0,mを計算する。また、時間周波数展開部２４は、空間差分算出部２２により得られた、各時刻ｔ_jの各方向ｘ、ｙ、ｚの空間差分ｆ_x,j，ｆ_y,j，ｆ_z,jから、各周波数ｍの観測時間周波数成分Ｆ_x,m，Ｆ_y,m，Ｆ_z,mを計算する。本実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 The time frequency expansion unit 24 calculates the observation time frequency component F _{0, m} of each frequency m from the observation signal f _{0, j at} each time t _j in the microphone of the reference point obtained by the spatial difference calculation unit 22. . In addition, the time frequency expansion unit 24 calculates the spatial differences f _{x, j} , _{fy, j} , f _{z, j in the} directions x, y, and z at each time t _j obtained by the spatial difference calculation unit 22. The observation time frequency components F _{x, m} , F _{y, m} and F _{z, m} of each frequency m are calculated. In the present embodiment, time frequency expansion such as short-time Fourier transform and wavelet transform is performed.

音源位置推定部２５は、時間周波数展開部２４において取得した各周波数ｍの観測時間周波数成分Ｆ_x,m，Ｆ_y,m，Ｆ_z,m，Ｆ_0,mからなる観測周波数成分ｙに基づいて、各周波数ｍの観測時間周波数成分Ｆ_x,m，Ｆ_y,m，Ｆ_z,m，Ｆ_0,mからなる観測データＹが与えられたときの、複数の音源ｋの位置を表す変数Ｎ及び変数ρ、各周波数に対する各時刻において支配的となる音源のインデックスを示すインジケータを表す変数Ｚ、各音源ｋが支配的になる確率π_kを定めるための変数Ｖ、複数の方向の各々に対する各周波数の観測時間周波数成分の分散を表す変数λ、並びに各音源ｋ及び雑音の各周波数の時間周波数成分のパワーを表す変数ζ及び変数γを含む未知変数Θの事後分布ｐ（Θ｜Ｙ）と変関数ｑ（Θ）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき目的関数を最小化するように、変数Ｎ、変数ρ、変数Ｚ、変数Ｖ、変数λ、変数ζ、変数γの各々の分布を表す各パラメータを推定し、各音源ｋの位置を推定する。 The sound source position estimation unit 25 is based on the observation frequency component y composed of the observation time frequency components F _{x, m} , F _{y, m} , F _{z, m} , F _{0, m of} each frequency m acquired by the time frequency expansion unit 24. Thus, a variable representing the position of a plurality of sound sources k when observation data Y comprising observation time frequency components F _{x, m} , F _{y, m} , F _{z, m} , F _{0, m of} each frequency m is given. N and a variable ρ, a variable Z representing an indicator of an index of a sound source that is dominant at each time for each frequency, a variable V for determining a probability π _k that each sound source k is dominant, and each of a plurality of directions A posteriori distribution p (Θ | Y) of an unknown variable Θ including a variable λ representing the variance of the observed time frequency components of each frequency, and a variable ζ and a variable γ representing the power of the time frequency components of each sound source k and noise. And the divergence representing the difference between the variable function q (Θ) Each parameter representing the distribution of each of the variable N, variable ρ, variable Z, variable V, variable λ, variable ζ, and variable γ so that the objective function is minimized based on the variational reasoning method with the function to be expressed as the objective function And the position of each sound source k is estimated.

具体的には、音源位置推定部２５は、変数更新部２８と、収束判定部３０とを備えている。 Specifically, the sound source position estimation unit 25 includes a variable update unit 28 and a convergence determination unit 30.

変数更新部２８は、まず、変数Ｎ、変数ρ、変数Ｚ、変数Ｖ、変数λ、変数ζ、変数γの各々の分布を表す各パラメータ

(以後、変分パラメータと称する)の初期値を設定する。 The variable updating unit 28 first sets each parameter representing the distribution of each of the variable N, variable ρ, variable Z, variable V, variable λ, variable ζ, and variable γ.

Set the initial value (hereinafter referred to as variational parameter).

また、変数更新部２８は、観測データＹと、変分パラメータ

に基づいて、上記式（１３２）〜式（１４６）に従って、変分パラメータ
を更新する。 In addition, the variable update unit 28 uses the observation data Y and the variation parameter.

In accordance with the above formula (132) to formula (146), the variational parameter
Update.

収束判定部３０は、予め定められた収束判定条件を満足するか否かを判定し、収束判定条件を満足していない場合には、変数更新部２８の処理を繰り返す。収束判定部３０は、収束判定条件を満たしたときに、最終的に得られた

に基づいて、各音源ｋの方向ベクトルn^(k)、音源距離R^(k)を、各音源ｋの位置の推定結果として、出力部９０により出力する。 The convergence determination unit 30 determines whether or not a predetermined convergence determination condition is satisfied. When the convergence determination condition is not satisfied, the process of the variable update unit 28 is repeated. The convergence determination unit 30 is finally obtained when the convergence determination condition is satisfied.

The output unit 90 outputs the direction vector n ^(k) and the sound source distance R ^(k) of each sound source k as the estimation result of the position of each sound source k.

収束判定条件としては、反復計算回数が予め定めた回数に達したことを用いればよい。なお、一回のパラメータ更新によるパラメータの変化率がほぼ1になったと見なせたことを、収束判定条件として用いてもよい。 As the convergence determination condition, it may be used that the number of iterations has reached a predetermined number. Note that the fact that the change rate of the parameter by one parameter update can be regarded as approximately 1 may be used as the convergence determination condition.

＜音源定位装置の作用＞
次に、本実施の形態に係る音源定位装置１００の作用について説明する。 <Operation of sound source localization device>
Next, the operation of the sound source localization apparatus 100 according to the present embodiment will be described.

入力部１０において、マイクロホンアレイの各マイクロホンから出力された観測信号の時系列データを受け付けると、音源定位装置１００は、図４に示す音源定位処理ルーチンを実行する。 When the input unit 10 receives time-series data of observation signals output from each microphone of the microphone array, the sound source localization apparatus 100 executes a sound source localization processing routine shown in FIG.

まず、ステップＳ１２０では、マイクロホンアレイの各マイクロホンから入力された観測信号の時系列データから、各時刻t_jで、基準点のマイクロホンにおける観測信号f_0,jを取得すると共に、各方向ｘ、ｙ、ｚの空間差分ｆ_x,j，ｆ_y,j，ｆ_z,jを算出する。 First, in step S120, the observation signal f _{0, j} at the reference point microphone is acquired at each time t _j from the time series data of the observation signal input from each microphone of the microphone array, and each direction x, y , Z spatial differences f _{x, j} , f _{y, j} , f _{z, j} are calculated.

ステップＳ１２１では、上記ステップＳ１２０で得られた基準点のマイクロホンにおける各時刻ｔ_jの観測信号f_0,jから、各周波数ｍの観測時間周波数成分Ｆ_0,mを計算する。また、各時刻ｔ_jの各方向ｘ、ｙ、ｚの空間差分ｆ_x,j，ｆ_y,j，ｆ_z,jから、各周波数ｍの観測時間周波数成分Ｆ_x,m，Ｆ_y,m，Ｆ_z,mを計算する。 In step S121, the observation time frequency component F _{0, m} of each frequency _m is calculated from the observation signal f _{0, j at} each time t _j in the microphone of the reference point obtained in step S120. In addition, the observation time frequency components F _{x, m} , F _{y, m of} each frequency m are obtained from the spatial differences f _{x, j} , f _{y, j} , f _z, _j in each direction x, y, z at each time t _j. , F _{z, m} is calculated.

ステップＳ１２２では、変分パラメータ

の初期値を設定する。 In step S122, the variational parameter

Set the initial value of.

ステップＳ１２４では、上記ステップＳ１２１によって計算された観測データＹと、初期値、又は前回更新された
に基づいて、上記式（１３２）〜式（１４６）に従って、変分パラメータ
を更新する。 In step S124, the observation data Y calculated in step S121 and the initial value or last updated
In accordance with the above formula (132) to formula (146), the variational parameter
Update.

ステップＳ１２５において、予め定められた収束判定条件を満たしたか否かを判定し、収束判定条件を満たしていない場合には、上記ステップＳ１２４へ戻る。一方、収束判定条件を満たした場合には、ステップＳ１２６へ進む。 In step S125, it is determined whether or not a predetermined convergence determination condition is satisfied. If the convergence determination condition is not satisfied, the process returns to step S124. On the other hand, if the convergence determination condition is satisfied, the process proceeds to step S126.

ステップＳ１２６では、上記ステップＳ１２４で最終的に得られた

に基づいて、各音源ｋの方向ベクトルn^(k),音源距離R^(k)を、各音源ｋの位置の推定結果として、出力部９０により出力して、音源定位処理ルーチンを終了する。 In step S126, it was finally obtained in step S124.

, The direction vector n ^(k) and the sound source distance R ^(k) of each sound source k are output by the output unit 90 as the position estimation result of each sound source k, and the sound source localization processing routine is terminated.

＜実験＞
提案手法の性能を検証するため、残響環境下の複数音源定位の数値シミュレーションを行った。今回、x 方向、y 方向の2 次元モデルを使用した。部屋サイズは6 m×10 m×4 m とし、中心に0.03m間隔で7 つのマイクロホンを上記図2 のように配置した。壁面の反射係数は0:7308 及び0:4566（Sabineの残響公式による残響時間がそれぞれ0.5 s 及び0.2 s に相当）とした。マイクロホンのサンプリング周波数は32 kHz、短時間Fourier 変換のフレーム長は64 ms（オーバーラップは32 ms）、観測信号全体の長さは2779 ms 及び1665 ms とした。音源数は3 個、音源の位置は部屋の中心からそれぞれ(1, 0, 0) m、(-0.5, 0.87, 0) m、(-0.5,-0.87, 0) m とした（図５参照）。図５では、バツ印が音源位置を示し、マル印がマイクロホンアレイの中心位置を示している。音源信号は話速バリエーション型音声データベース（SRV-DB）を利用した。提案法として、本実施の形態の手法である変分推論に基づく手法（VBEM）、正しい音源数（3 個）を仮定したEM アルゴリズムによる手法（EM1）、誤った音源数（6 個）を仮定したEM アルゴリズムによる手法（EM2）を評価し、従来法であるMUSIC 法との比較を行った。EM アルゴリズムにおいては定常雑音を仮定し

とした。どの手法においても、検出閾値をもとに音源の有無及び方向を推定し、そのうち真の音源方向との角度の誤差が±τ以内のものがあれば正解、なければ脱落誤り、検出された音源方向のうちどの真の音源方向にも属さないものを誤挿入として、F 尺度を算出した。各τについて、検出閾値を変化させたときの最も高いF 尺度のプロットを図６〜図９に示す。図６では、残響時間0.5 s、観測長2779 ms の条件での各手法の定位精度を示している。図７では、残響時間0.5 s、観測長1665 ms の条件での各手法の定位精度を示している。図８では、残響時間0.2 s、観測長2779 ms の条件での各手法の定位精度を示している。図９では、残響時間0.2 s、観測長1665 ms の条件での各手法の定位精度を示している。多くの場合、提案法である変分推論に基づく手法が他の手法より高精度な定位を達成していることが確認できた。 <Experiment>
In order to verify the performance of the proposed method, we performed numerical simulations of multiple sound source localization in a reverberant environment. This time, we used a two-dimensional model in the x and y directions. The room size is 6 m x 10 m x 4 m, and seven microphones are arranged in the center at intervals of 0.03 m as shown in Fig. 2 above. The wall reflection coefficients were 0: 7308 and 0: 4566 (the reverberation times according to the Sabine reverberation formula correspond to 0.5 s and 0.2 s, respectively). The sampling frequency of the microphone was 32 kHz, the short-time Fourier transform frame length was 64 ms (overlap was 32 ms), and the total length of the observation signal was 2779 ms and 1665 ms. The number of sound sources was 3, and the positions of the sound sources were (1, 0, 0) m, (-0.5, 0.87, 0) m, and (-0.5, -0.87, 0) m from the center of the room (see Fig. 5). ). In FIG. 5, the cross mark indicates the sound source position, and the round mark indicates the center position of the microphone array. As the sound source signal, a speech speed variation type speech database (SRV-DB) was used. As the proposed method, the method based on variational reasoning (VBEM), the method of this embodiment (VBEM), the method based on the EM algorithm assuming the correct number of sound sources (3) (EM1), and the number of incorrect sound sources (6) are assumed. The EM algorithm method (EM2) was evaluated and compared with the conventional MUSIC method. The EM algorithm assumes stationary noise.

It was. In any method, the presence and direction of the sound source are estimated based on the detection threshold, and if there is an error in the angle with the true sound source direction within ± τ, it is correct, otherwise it is a drop error and the detected sound source The F scale was calculated by misinserting the direction that does not belong to any true sound source direction. For each τ, plots of the highest F scale when the detection threshold is changed are shown in FIGS. FIG. 6 shows the localization accuracy of each method under the conditions of a reverberation time of 0.5 s and an observation length of 2779 ms. FIG. 7 shows the localization accuracy of each method under the conditions of a reverberation time of 0.5 s and an observation length of 1665 ms. FIG. 8 shows the localization accuracy of each method under the conditions of a reverberation time of 0.2 s and an observation length of 2779 ms. FIG. 9 shows the localization accuracy of each method under the conditions of a reverberation time of 0.2 s and an observation length of 1665 ms. In many cases, it was confirmed that the method based on variational reasoning, which is the proposed method, achieved a higher precision localization than other methods.

以上説明したように、本実施の形態に係る音源定位装置によれば、基準のマイクロホンの各周波数の観測時間周波数成分、及び複数の方向の各々に対する各周波数の観測時間周波数成分からなる観測データＹが与えられたときの、複数の音源ｋの位置を表す変数Ｎ及び変数ρ、各周波数に対する各時刻において支配的となる音源のインデックスを示すインジケータを表す変数Ｚ、各音源ｋが支配的になる確率π_kを定めるための変数Ｖ、複数の方向の各々に対する各周波数の観測時間周波数成分の分散を表す変数λ、並びに各音源ｋ及び雑音の各周波数の時間周波数成分のパワーを表す変数ζ及び変数γを含む未知変数Θの事後分布ｐ（Θ｜Ｙ）と変関数ｑ（Θ）との間の差異を表すダイバージェンスを表す関数を目的関数として、変分推論法に基づき目的関数を最小化するように、変数Ｎ、変数ρ、変数Ｚ、変数Ｖ、変数λ、変数ζ、変数γの各々の分布を表す各パラメータを推定し、各音源ｋの位置を推定することにより、音源数が未知の場合であっても、複数の音源を同時に定位することができる。 As described above, according to the sound source localization apparatus according to the present embodiment, the observation data Y including the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency in each of a plurality of directions. , A variable N representing the position of a plurality of sound sources k and a variable ρ, a variable Z representing an index of a sound source that is dominant at each time for each frequency, and each sound source k being dominant. A variable V for determining the probability π _k , a variable λ representing the variance of the observed time frequency component of each frequency in each of a plurality of directions, and a variable ζ representing the power of the time frequency component of each frequency of each sound source k and noise Based on the variational reasoning method, the function representing the divergence representing the difference between the posterior distribution p (Θ | Y) of the unknown variable Θ including the variable γ and the variable function q (Θ) is the objective function. In order to minimize the objective function, the parameters representing the distributions of the variable N, variable ρ, variable Z, variable V, variable λ, variable ζ, and variable γ are estimated, and the position of each sound source k is estimated. Thus, even when the number of sound sources is unknown, a plurality of sound sources can be localized simultaneously.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音源定位装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the sound source localization apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２２空間差分算出部
２４時間周波数展開部
２５音源位置推定部
２８変数更新部
３０収束判定部
９０出力部
１００音源定位装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 22 Spatial difference calculation part 24 Time frequency expansion | deployment part 25 Sound source position estimation part 28 Variable update part 30 Convergence determination part 90 Output part 100 Sound source localization apparatus

Claims

A sound source localization device that estimates the position of each of the plurality of sound sources from an observation signal mixed with sound source signals from a plurality of sound sources input by a microphone array,
For each of a plurality of directions, a spatial difference calculation unit that calculates a difference between the observation signals input by a pair of microphones arranged in the direction in the microphone array;
Of the microphone array, the observation signal input from a reference microphone is input, and an observation time frequency component of each frequency is output, and the spatial difference calculation unit calculates each of the plurality of directions. With the difference between the observation signals as an input, a time frequency expansion unit that outputs an observation time frequency component of each frequency for each of the plurality of directions;
Based on the observation time frequency component of each frequency of the reference microphone output by the time frequency expansion unit, and the observation time frequency component of each frequency for each of the plurality of directions,
A variable N representing the position of a plurality of sound sources k when observation data Y comprising observation time frequency components of each frequency of the reference microphone and observation time frequency components of each frequency for each of the plurality of directions is given. And a variable ρ, a variable Z representing an indicator of an index of a sound source that is dominant at each time for each frequency, a variable V for determining a probability π _k that each sound source k is dominant, and each of the plurality of directions A posteriori distribution p (Θ | Y) of an unknown variable Θ including a variable λ representing the variance of the observed time frequency components of each frequency, and a variable ζ and a variable γ representing the power of the time frequency components of each sound source k and noise. And the variable N, so that the objective function is minimized based on the variational inference method, with the function representing the divergence representing the difference between the variable function q (Θ) and the objective function as the objective function. Serial variables [rho, and the variables Z, the variable V, the variable lambda, the variable zeta, parameter estimation unit for estimating the parameters representing each of the distribution of the variable gamma,
A sound source position estimator for estimating the position of each sound source k based on a parameter representing the distribution of the variable N representing the estimated position of each sound source k and a parameter representing the distribution of the variable ρ;
Sound source localization device including

The sound source localization apparatus according to claim 1, wherein the probability π _k is determined so as to decrease exponentially as k increases.

The variable function q (Θ) is approximated by q (N) q (ρ) q (Z) q (V) q (λ) q (ζ) q (γ),
The number of sound source truncations K ^* is determined in advance, and z _{ω, l} is an indicator indicating the index of the sound source that is dominant at each time l with respect to the frequency m,
3. The sound source localization apparatus according to claim 1, wherein k ( _{zω, l} = k ′) = 0 is set for k ′ that is equal to or greater than K ^* + 1.

A sound source localization method in a sound source localization device that estimates the position of each of the plurality of sound sources from an observation signal mixed with sound source signals from a plurality of sound sources input by a microphone array,
A spatial difference calculation unit calculates, for each of a plurality of directions, a difference between the observation signals input by a pair of microphones arranged in the direction in the microphone array,
The time frequency expansion unit receives the observation signal input from a reference microphone in the microphone array and outputs an observation time frequency component of each frequency, and the spatial difference calculation unit outputs each of the plurality of directions. With respect to each of the plurality of directions, an observation time frequency component of each frequency is output as a difference between the observation signals calculated for
Based on the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency for each of the plurality of directions, the parameter estimation unit is output by the time frequency expansion unit,
A variable N representing the position of a plurality of sound sources k when observation data Y comprising observation time frequency components of each frequency of the reference microphone and observation time frequency components of each frequency for each of the plurality of directions is given. And a variable ρ, a variable Z representing an indicator of an index of a sound source that is dominant at each time for each frequency, a variable V for determining a probability π _k that each sound source k is dominant, and each of the plurality of directions A posteriori distribution p (Θ | Y) of an unknown variable Θ including a variable λ representing the variance of the observed time frequency components of each frequency, and a variable ζ and a variable γ representing the power of the time frequency components of each sound source k and noise. And the variable N, so that the objective function is minimized based on the variational inference method, with the function representing the divergence representing the difference between the variable function q (Θ) and the objective function as the objective function. Serial variable [rho, the variables Z, the variable V, the variable lambda, the variable zeta, estimates the parameters representing each of the distribution of the variable gamma,
A sound source localization method in which a sound source position estimation unit estimates a position of each sound source k based on a parameter representing a distribution of a variable N representing the estimated position of each sound source k and a parameter representing a distribution of a variable ρ.

The sound source localization method according to claim 4, wherein the probability π _k is determined so as to decrease exponentially as k increases.

The variable function q (Θ) is approximated by q (N) q (ρ) q (Z) q (V) q (λ) q (ζ) q (γ),
The number of sound source truncations K ^* is determined in advance, and z _{ω, l} is an indicator indicating the index of the sound source that is dominant at each time l with respect to the frequency m,
6. The sound source localization method according to claim 4, wherein k ( _{zω, l} = k ′) = 0 is set for k ′ that is equal to or greater than K ^* + 1.

The program for functioning a computer as each part of the sound source localization apparatus of any one of Claims 1-3.