JP2012173592A

JP2012173592A - Sound source parameter estimation device and sound source separation device and method thereof and program therefor

Info

Publication number: JP2012173592A
Application number: JP2011036713A
Authority: JP
Inventors: Tomohiro Nakatani; 智広中谷; Akiko Araki; 章子荒木; Takuya Yoshioka; 拓也吉岡; Masakiyo Fujimoto; 雅清藤本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-02-23
Filing date: 2011-02-23
Publication date: 2012-09-10
Anticipated expiration: 2031-02-23
Also published as: JP5438704B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound source parameter estimation device capable of estimating a sound source model parameter in conjunction with a sound source parameter even when no sound source model parameter is given in advance.SOLUTION: A sound source model parameter update section updates a sound source model parameter using a sound source power feature amount, a sound source power parameter, a sound source occupancy, a preliminary probability density function of the sound source power parameter stored in a sound source model storage section, and a model of the sound source power feature amount as inputs. A sound source occupancy update section updates a sound source occupancy for each sound source using a sound source position feature amount, the sound source power feature amount, the updated sound source power parameter for each sound source, a sound source position parameter, the sound source model parameter, the preliminary probability density function of the sound source power parameter stored in the sound source model storage section, and a model of the sound source power feature amount as inputs.

Description

この発明は、複数の音源が同時に生成した音響信号が混ざって複数のマイクロホンで収音された観測信号から、各音源の音源パラメータを推定する音源パラメータ推定装置と、その音源パラメータに基づいて各音源を分離する音源分離装置と、それらの方法とプログラムに関する。 The present invention relates to a sound source parameter estimation device that estimates sound source parameters of each sound source from observation signals collected by a plurality of microphones mixed with acoustic signals generated simultaneously by a plurality of sound sources, and each sound source based on the sound source parameters The present invention relates to a sound source separation device that separates sound sources and methods and programs thereof.

図７に、非特許文献１に開示された従来の音源パラメータ推定装置９００の機能構成を示す。音源パラメータ推定装置９００は、音源モデル記憶部９２０と、特徴抽出部９１０と、音源パワーパラメータ更新部９３０と、音源位置パラメータ更新部９４０と、音源占有度更新部９５０と、を具備する。 FIG. 7 shows a functional configuration of a conventional sound source parameter estimation apparatus 900 disclosed in Non-Patent Document 1. The sound source parameter estimation apparatus 900 includes a sound source model storage unit 920, a feature extraction unit 910, a sound source power parameter update unit 930, a sound source position parameter update unit 940, and a sound source occupancy update unit 950.

音源モデル記憶部９２０は、複数の音源信号それぞれの音源パワー時系列全体の状態を表す音源パワーパラメータの事前確率密度関数p(q^(l))（（l）はl番目の音源）と、その音源パワーパラメータq^(l)が与えられた場合の各音源信号の各時間周波数点(n,k)における事後確率密度関数である音源パワー特徴量のモデルf=β_q（l）,n,k(S)とを記憶する。特徴抽出部９１０は、複数の音源信号を複数（m個）のマイクロホンで収音した時間領域の信号を時間周波数領域信号に変換した観測信号x^(m) _n,kを入力として、各時間周波数点における音源位置特徴量A_n,kと音源パワー特徴量X_n,kを抽出する。 The sound source model storage unit 920 includes a prior probability density function p (q ^(l) ) ((l) is the lth sound source) of sound source power parameters representing the state of the entire sound source power time series of each of a plurality of sound source signals, Sound source power feature model f = β _{q (l), n, k,} which is a posterior probability density function at each time frequency point (n, k) of each sound source signal when the sound source power parameter q ^(l) is given (S) is memorized. The feature extraction unit 910 receives an observation signal x ^(m) _{n, k obtained} by converting a time domain signal obtained by collecting a plurality of sound source signals with a plurality (m) of microphones into a time frequency domain signal, and inputs each time frequency. A sound source position feature amount _{An, k} and a sound source power feature amount X _{n, k at a point} are extracted.

音源パワーパラメータ更新部９３０は、N_s個の音源ごとに音源パワー特徴量X_n,kと観測信号が得られた下での占有的な音源の事後確率密度関数である音源占有度^M^(l) _n,kと、音源パワー特徴量のモデルf=β_q（l）,n,k(S)と音源パワーパラメータの事前確率密度関数p(q^(l))とを入力として、音源パワーパラメータ^q^(l)を更新する。音源位置パラメータ更新部９４０は、音源位置特徴量A_n,kと音源占有度^M^(l) _n,kを入力として、各音源の音源位置パラメータ^φ^(l)を更新する。 The sound source power parameter update unit 930 is a sound source occupancy ^ M ⁽ which is an a posteriori probability density function of an exclusive sound source when the sound source power feature amount X _{n, k} and the observation signal are obtained for each of N _s sound sources. ^{l) Using} _{n, k} , sound source power feature model f = β _{q (l), n, k} (S) and prior probability density function p (q ^(l) ) of sound source power parameters as input, Update parameter ^ q ^(l) . The sound source position parameter update unit 940 receives the sound source position feature quantity _{An, k} and the sound source occupancy ^ M ^(l) _{n, k} and updates the sound source position parameter ^ φ ^(l) of each sound source.

音源占有度更新部９５０は、音源位置特徴量A_n,kと音源パワー特徴量X_n,kと各音源の更新された音源パワーパラメータ^q^(l)と音源位置パラメータ^φ^(l)と、音源パワー特徴量のモデルf=β_q（l）,n,k(S)と音源パワーパラメータの事前確率密度関数p(q^(l))とを基に各音源の音源占有度^M^(l) _n,kを更新する。 The sound source occupancy update unit 950 includes a sound source position feature amount _{An, k} , a sound source power feature amount X _{n, k} , an updated sound source power parameter ^ q ^(l) and a sound source position parameter ^ φ ^{(l) of} each sound source. , The sound source occupancy ^ M ⁽ of each sound source based on the model f = β _{q (l), n, k} (S) and the prior probability density function p (q ^(l) ) of the sound source power parameter ^l) Update _{n, k} .

音源パラメータ推定装置９００の音源パラメータ推定技術を用いた音源分離装置は、音源パラメータ推定装置９００が出力する音源パワー特徴量と、更新した音源占有度と音源パワーパラメータと、各音源信号の各時間周波数点における音源パワー特徴量のモデルとを入力として複数の音源のそれぞれの音源分離信号を、最小自乗誤差推定により求める音源分離部を更に備える。 The sound source separation apparatus using the sound source parameter estimation technique of the sound source parameter estimation apparatus 900 includes a sound source power feature amount output from the sound source parameter estimation apparatus 900, an updated sound source occupancy and a sound source power parameter, and each time frequency of each sound source signal. A sound source separation unit that obtains a sound source separation signal of each of a plurality of sound sources by least square error estimation using a model of a sound source power feature amount at the point as an input is further provided.

Tomohiro Nakatani, Shoko Araki, Takuya Yoshioka, Masakiyo Fujimoto, “Multichannel source separation based on source location cue with log-spectral shaping by hidden Markov source model,” Proc. of Interspeech-2010,pp. 2766-2769, Sep., 2010.Tomohiro Nakatani, Shoko Araki, Takuya Yoshioka, Masakiyo Fujimoto, “Multichannel source separation based on source location cue with log-spectral shaping by hidden Markov source model,” Proc. Of Interspeech-2010, pp. 2766-2769, Sep., 2010 . 中谷智広、荒木章子、吉岡卓也、藤本雅清、“DOAクラスタリングと音声の対数スペクトルHMMに基づく音源分離”日本音響学会2010年秋季研究発表会講演論文集、pp.577-580, 9月, 2010年.Tomohiro Nakatani, Akiko Araki, Takuya Yoshioka, Masaki Fujimoto, “Sound source separation based on DOA clustering and logarithmic spectrum HMM of speech” Proc. .

従来の音源パラメータ推定装置は、音源モデル記憶部が記憶する音源パワーパラメータの事前確率密度関数と音源パワー特徴量のモデルのそれぞれの挙動を制御するパラメータである音源モデルパラメータが事前に与えられなければ音源パラメータの推定を行うことができなかった。 In a conventional sound source parameter estimation device, a sound source model parameter that is a parameter for controlling the respective behaviors of the sound source power parameter a priori probability density function stored in the sound source model storage unit and the sound source power feature amount model is not given in advance. Sound source parameters could not be estimated.

この発明はこの課題に鑑みてなされたものであり、音源モデルパラメータが事前に与えられていない場合でも、音源パラメータの一部としてその他の音源パラメータと一緒に音源モデルパラメータをも推定できる音源パラメータ推定装置と音源分離装置と、それらの方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and even when no sound source model parameter is given in advance, it is possible to estimate a sound source parameter as well as other sound source parameters as a part of the sound source parameter. It is an object to provide a device, a sound source separation device, and a method and program thereof.

この発明の音源パラメータ推定装置は、音源モデル記憶部と、特徴抽出部と、音源パワーパラメータ更新部と、音源位置パラメータ更新部と、音源モデルパラメータ更新部と、音源占有度更新部と、を具備する。音源モデル記憶部は、複数の音源信号それぞれの音源パワー時系列全体の状態を表す音源パワーパラメータの事前確率密度関数と、その音源パワーパラメータが与えられた場合の各音源信号の各時間周波数点における音源パワー特徴量のモデルとを記憶する。特徴抽出部は、複数の音源信号を複数のマイクロホンで収音した時間領域信号を時間周波数領域信号に変換した観測信号を入力として、各時間周波数点における音源位置特徴量と音源パワー特徴量を抽出する。音源パワーパラメータ更新部は、音源パワー特徴量と、観測信号が得られた下での占有的な音源の事後確率密度関数である音源占有度と、音源パワーパラメータの事前確率密度関数と各音源信号の音源パワー特徴量のモデルと、音源パワー特徴量のモデルと音源パワーパラメータの事前確率密度関数のそれぞれの挙動を制御するパラメータである音源モデルパラメータと、を入力として各音源の音源パワーパラメータを更新する。音源位置パラメータ更新部は、音源位置特徴量と音源占有度を入力として、各音源の音源位置パラメータを更新する。音源モデルパラメータ更新部は、音源パワー特徴量と音源パワーパラメータと音源占有度と、音源モデル記憶部に記憶された音源パワーパラメータの事前確率密度関数と音源パワー特徴量のモデルとを入力として音源モデルパラメータを更新する。音源占有度更新部は、音源位置特徴量と音源パワー特徴量と各音源の更新された音源パワーパラメータと音源位置パラメータと、音源モデルパラメータと、音源モデル記憶部に記憶された音源パワーパラメータの事前確率密度関数と音源パワー特徴量のモデルとを入力として各音源の音源占有度を更新する。 A sound source parameter estimation apparatus according to the present invention includes a sound source model storage unit, a feature extraction unit, a sound source power parameter update unit, a sound source position parameter update unit, a sound source model parameter update unit, and a sound source occupancy degree update unit. To do. The sound source model storage unit is a prior probability density function of a sound source power parameter representing the state of the entire sound source power time series of each of the plurality of sound source signals, and each time frequency point of each sound source signal when the sound source power parameter is given. The sound source power feature amount model is stored. The feature extraction unit extracts the sound source position feature value and the sound source power feature value at each time frequency point by using the observation signal obtained by converting the time domain signal obtained by collecting multiple sound source signals with multiple microphones into the time frequency domain signal. To do. The sound source power parameter update unit includes a sound source power feature amount, a sound source occupancy that is an a posteriori probability density function of the exclusive sound source obtained from the observed signal, a prior probability density function of the sound source power parameter, and each sound source signal. The sound source power parameters of each sound source are updated with the sound source power feature model and the sound source model parameters that control the behavior of the sound source power feature model and the prior probability density function of the sound source power parameters as inputs. To do. The sound source position parameter update unit receives the sound source position feature amount and the sound source occupancy as input, and updates the sound source position parameter of each sound source. The sound source model parameter update unit receives the sound source power feature amount, the sound source power parameter, the sound source occupancy, the prior probability density function of the sound source power parameter stored in the sound source model storage unit, and the model of the sound source power feature amount as inputs. Update parameters. The sound source occupancy degree update unit pre-determines the sound source position feature amount, the sound source power feature amount, the updated sound source power parameter, the sound source position parameter, the sound source model parameter, and the sound source power parameter stored in the sound source model storage unit in advance. The sound source occupancy of each sound source is updated with the probability density function and the model of the sound source power feature quantity as inputs.

この発明の音源パラメータ推定装置によれば、音源モデルパラメータが事前に与えられていない場合でも、音源パラメータの一部としてその他の音源パラメータと一緒に音源モデルパラメータをも推定することができるので、事前に統計的性質を与えることができない多様な音源に対しても最適な音源パラメータを与えることができる。 According to the sound source parameter estimation device of the present invention, the sound source model parameters can be estimated together with other sound source parameters as part of the sound source parameters even when the sound source model parameters are not given in advance. It is possible to give optimal sound source parameters to various sound sources that cannot be given statistical properties.

その音源パラメータ推定装置を用いたこの発明の音源分離装置は、誤差の少ない音源分離信号を出力することが可能である。 The sound source separation device of the present invention using the sound source parameter estimation device can output a sound source separation signal with less error.

この発明の音源パラメータ推定装置１００の機能構成例を示す図。The figure which shows the function structural example of the sound source parameter estimation apparatus 100 of this invention. 音源パラメータ推定装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the sound source parameter estimation apparatus 100. この発明の音源分離装置２００の機能構成例を示す図。The figure which shows the function structural example of the sound source separation apparatus 200 of this invention. 音源分離装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the sound source separation apparatus. 実験結果を示す図であり、(ａ)は混合前の音声から推定した結果、（ｂ）は混合音から推定した結果を示す図である。It is a figure which shows an experimental result, (a) is a figure which shows the result estimated from the audio | voice before mixing, (b) is a figure which shows the result estimated from the mixed sound. 実験結果を示す図であり、分離後の信号のケプストラム歪みと混合音の数との関係を示す、（ａ）は混合条件１の混合音に対して処理した結果、（ｂ）と（ｃ）は残響のある別の環境で収録した混合音（混合条件２と３）に対して処理した結果を示す図である。It is a figure which shows an experimental result, and shows the relationship between the cepstrum distortion of the signal after isolation | separation, and the number of mixed sounds. These are figures which show the result of having processed with respect to the mixed sound (mixing conditions 2 and 3) recorded in another environment with reverberation. 従来の音源パラメータ推定装置９００の機能構成を示す図。The figure which shows the function structure of the conventional sound source parameter estimation apparatus 900.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前にこの発明の基本的な考えについて説明する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the basic idea of the present invention will be described.

〔この発明の基本的な考え〕
この発明では、予め音源モデルパラメータが与えられていなくとも音源パラメータの推定が行えるようにするために、ｌ番目の音源信号の音源パワー特徴量の時系列全体を{S^(l) _n,k}と表した時に、その同時確率密度関数を音源モデルパラメータψ^(l)を含めてモデル化すると共に、複数の音が混在した信号の音源パワー特徴量X_n,kが与えられたときに、各音源の音源占有度、音源パワー特徴量のモデルと音源パワーパラメータの事前確率密度関数、および音源パワーパラメータに基づいて各音源の音源モデルパラメータを更新する音源モデルパラメータ更新部を有する点で新しい。 [Basic idea of the present invention]
In the present invention, in order to enable the estimation of the sound source parameter even if the sound source model parameter is not given in advance, the entire time series of the sound source power feature amount of the l-th sound source signal is represented by {S ^(l) _{n, k} }. The simultaneous probability density function is modeled including the sound source model parameter ψ ^(l) , and each sound source power feature amount X _{n, k of} a signal in which multiple sounds are mixed is given. It is new in that it has a sound source occupancy degree, a model of sound source power features, a prior probability density function of sound source power parameters, and a sound source model parameter update unit that updates the sound source model parameters of each sound source based on the sound source power parameters.

最初に説明に用いる記号について説明する。観測信号には、N_s個の音源信号が重畳しており、その音源信号をN_m本のマイクロホンで収音する。m番目のマイクロホンから収音した収音信号を短時間フーリエ変換等を用いて周波数領域の信号に変換した観測信号をX^(m) _n,kと表記する。nはn番目の時間つまりフレーム番号、kはk番目の周波数つまりビン番号であり、n番目の時間及びk番目の周波数に対応する時間周波数点を参照する場合に、時間周波数点（n,k）と表記する。なお、記号＾の位置や添え字の表記とその位置は、式中の表記が正しい。 First, symbols used for description will be described. The observed signal is superimposed N _s number of sound source signals, picks up the sound signal in N _m the microphones. An observation signal obtained by converting a collected signal collected from the m-th microphone into a frequency domain signal using a short-time Fourier transform or the like is denoted as X ^(m) _{n, k} . n is the nth time or frame number, k is the kth frequency or bin number, and when referring to the time frequency point corresponding to the nth time and the kth frequency, the time frequency point (n, k ). Note that the position of the symbol ^ and the notation of the subscript and its position are correct in the expression.

この発明では、ｌ番目の音源信号の音源パワー時系列全体{S^(l) _n,k}の同時確率密度関数を次式に示すようにモデル化する。 In the present invention, the simultaneous probability density function of the entire sound source power time series {S ^(l) _{n, k} } of the l-th sound source signal is modeled as shown in the following equation.

ここで、q^(l)はｌ番目の音源の音源パワー時系列全体の状態を表す音源パワーパラメータを表す。以下では全ての音源のq^(l)をまとめてq=[q⁽¹⁾,…，q^(Ns)]とも表記する。ψ^(l)はl番目の音源の音源モデルパラメータ全体を現す。全ての音源のψ^(l)をまとめてψ=[ψ⁽¹⁾,…，ψ^(Ns)]とも表記する。 Here, q ^(l) represents a sound source power parameter representing the state of the entire sound source power time series of the l-th sound source. In the following, q ^(l) of all sound sources is collectively expressed as q = [q ⁽¹⁾ , ..., q ^(Ns) ]. ψ ^(l) represents the entire sound source model parameter of the l-th sound source. Ψ ^(l) of all sound sources is collectively expressed as ψ = [ψ ⁽¹⁾ ,..., Ψ ^(Ns) ].

また、β_{q(l),n,k,ψ(l)}(S)は音源パワー特徴量のモデルであり、音源パワーパラメータq^(l)と音源モデルパラメータψ^(l)が与えられた下で各時間周波数点(n,k)の音源信号の音源パワーがS^(l) _n,kとなる確率密度関数である（式（３））。式（１）の総和演算は、q^(l)が離散値ではなく連続値をとる場合にはq^(l)に関する積分演算に置き換えて表現されるものとする。式（２）において、音源の状態が既知のもとでは、異なる時間周波数点における音源パワーS^(l) _n,kは相互に独立であるという仮定を導入している。 Β _{q (l), n, k, ψ (l)} (S) is a model of the sound source power feature, and given the sound source power parameter q ^(l) and the sound source model parameter ψ ^(l) , This is a probability density function in which the sound source power of the sound source signal at each time frequency point (n, k) is S ^(l) _{n, k} (formula (3)). Summation of equation (1), if q ^(l) takes continuous values rather than discrete values shall be expressed by replacing the integral operation on q ^(l). In the equation (2), the assumption is made that the sound source powers S ^(l) _{n, k at} different time frequency points are independent from each other when the state of the sound source is known.

また、この発明では式（４）に示すように、各時間周波数点(n,k)において最も大きなエネルギーを持つ音源信号（以下、占有的な音源信号と称する）の音源パワーS^(l) _n,kは、観測信号の音源パワーと一致すると仮定する。 In the present invention, as shown in the equation (4), the sound source power S ^(l) _n of the sound source signal having the largest energy at each time frequency point (n, k) (hereinafter referred to as an exclusive sound source signal ^). _{, k} is assumed to match the sound source power of the observed signal.

また、占有的ではない音源ｌに関しては、S^(l) _n,k≦X_n,kの関係を持つと仮定する。すると、各音源信号の状態が既知の条件の下で、観測信号の音源パワーX_n,kの事後確率密度関数は次のように表現できることが知られている（参考文献：S. J. Rennie, J.R. Hershey, and P. A.01sen, “Hierarchical variational loopy belief propagation for multi-talker speech recognition,” Proc. ASRU-2009, pp. 176-181 2009.）。 Further, it is assumed that the non-occupying sound source 1 has a relationship of S ^(l) _{n, k} ≦ X _{n, k} . Then, it is known that the posterior probability density function of the sound source power X _{n, k} of the observed signal can be expressed as follows under the condition that the state of each sound source signal is known (reference: SJ Rennie, JR Hershey , and PA01sen, “Hierarchical variational loopy belief propagation for multi-talker speech recognition,” Proc. ASRU-2009, pp. 176-181 2009.).

この発明では、更に上式は次のように分解可能であると仮定している。 In the present invention, it is further assumed that the above equation can be decomposed as follows.

また、この発明では音源位置特徴量から音源位置パラメータφ＾^(l)を推定するため、音源位置特徴量のモデルp(A_n,k;φ)を導入する。音源位置特徴量のモデルp(A_n,k;φ)は、各音源信号のエネルギーは異なる時間周波数点にわたり疎に分布していると仮定し、その時間周波数点において占有的な音源の音源位置のみに依存して決まると仮定する。 In the present invention, in order to estimate the sound source position parameter φ ^ ^(l) from the sound source position feature quantity, a sound source position feature quantity model p (A _{n, k} ; φ) is introduced. The sound source location feature model p (A _{n, k} ; φ) assumes that the energy of each sound source signal is sparsely distributed over different time frequency points, and the sound source location of the sound source that is occupied at that time frequency point Suppose that it depends only on.

一般的に、全ての音源の音源位置パラメータφ^(l)をまとめてφ=[ψ⁽¹⁾,…，ψ^(Ns)]と表すと、音源位置特徴量のモデルp(A_n,k;φ)、つまり観測信号の音源位置特徴量の確率密度関数は、混合分布として式（８）に示すように展開することができる。 Generally, when the sound source position parameters φ ^(l) of all sound sources are collectively expressed as φ = [ψ ⁽¹⁾ ,..., Ψ ^(Ns) ], the sound source position feature model p (A _{n, k} ; φ), that is, the probability density function of the sound source position feature quantity of the observation signal can be developed as a mixture distribution as shown in Expression (8).

式（８）において、Z_n,kは時間周波数点（n,k）において占有的な音源の番号を表す確率変数であり、Z_n,k=ｌは、ｌ番目の音源が占有的な音源である場合を示す。また、p(Z_n,k=ｌ)は、l番目の音源が時間周波数点（n,k）において占有的な音源になる事前確率密度関数を表している。更に、以降の説明では次の表記を用いることにする。 In Equation (8), Z _{n, k} is a random variable that represents the number of the sound source that is occupied at the time frequency point (n, k), and Z _{n, k} = l is a sound source that is occupied by the l-th sound source. The case is shown. P (Z _{n, k} = l) represents a prior probability density function in which the l-th sound source becomes an exclusive sound source at the time frequency point (n, k). Further, the following notation is used in the following description.

γ_φ(l),n,k(A)は、時間周波数点（n,k）において占有的な音源の番号がlの場合に、音源位置特徴量A_n,kが得られる確率密度関数を表す。これは、l番目の音源の音源位置パラメータφ^(l)のみに依存するものとする。具体的なγ_φ(l),n,k(A)やφ^(l)の定義については後述する。 γ _{φ (l), n, k} (A) is the probability density function that gives the sound source position feature quantity An _{n, k} when the number of the sound source occupied at the time frequency point (n, k) is l. To express. This depends only on the sound source position parameter φ ^(l) of the l-th sound source. Specific definitions of γ _{φ (l), n, k} (A) and φ ^(l) will be described later.

式（８）のもと、γ_φ(l),n,k(A)が定義されている場合、音源位置パラメータφ^(l)と占有的な音源の番号に関する事前確率密度関数p(Z_n,k=l)が与えられれば、音源位置特徴量のモデルp(A_n,k;φ)は一意に定めることができる。逆に、音源位置特徴量A_n,kが観測された場合に、最尤推定などの方法に従い、音源位置パラメータと占有的な音源の番号に関する事前確率密度関数p(Z_n,k=l)やその事後確率密度関数を推定することができる。 If γ _{φ (l), n, k} (A) is defined under equation (8), the prior probability density function p (Z _n regarding the sound source position parameter φ ^(l) and the number of the occupied sound source _{, k} = l), the sound source position feature quantity model p (A _{n, k} ; φ) can be uniquely determined. Conversely, when the sound source position feature quantity An _{n, k} is observed, the prior probability density function p (Z _{n, k} = l) for the sound source position parameter and the number of the occupied sound source according to a method such as maximum likelihood estimation And its posterior probability density function.

以上の定義に従うと、完全データの確率密度関数は式（１０）に示すように導出される。 According to the above definition, the probability density function of complete data is derived as shown in equation (10).

式（１０）において、qが音源パワーパラメータ、ψが音源モデルパラメータ、φが音源位置パラメータであり、これらのパラメータがパラメータ推定の対象である。この発明では、次の対数尤度関数を最大化する値として、音源パワーパラメータqと音源モデルパラメータψと音源位置パラメータφを推定する。 In equation (10), q is a sound source power parameter, ψ is a sound source model parameter, and φ is a sound source position parameter, and these parameters are parameters to be estimated. In the present invention, the sound source power parameter q, the sound source model parameter ψ, and the sound source position parameter φ are estimated as values that maximize the next log likelihood function.

式（１２）で、確率変数Z_n,kは隠れ変数として扱われる。隠れ変数を含む対数尤度関数の最大化には、例えば、期待値最大化アルゴリズムなどを用いることができる。期待値最大化アルゴリズムでは、音源パワーパラメータの推定値^qと音源位置パラメータの推定値^φと音源モデルパラメータの推定値^ψに基づき、観測信号が得られた下での占有的な音源の番号の事後確率密度関数^M^(l) _n,k=p(Z_n,k｜A_n,k,X_n,k,^q;^φ,^ψ)をも同時に推定する必要がある。この発明では、この関数の値を音源占有度と称し、この値も音源パラメータに含めて考える。 In equation (12), the random variable Z _{n, k} is treated as a hidden variable. For example, an expectation maximization algorithm can be used to maximize the log likelihood function including hidden variables. In the expected value maximization algorithm, the sound source power parameter estimate ^ q, the sound source position parameter estimate ^ φ and the sound source model parameter estimate ^ ψ The a posteriori probability density function ^ M ^(l) _{n, k} = p (Z _{n, k} │A _{n, k} , X _{n, k} , ^ q; ^ φ, ^ ψ) must be estimated at the same time. In the present invention, the value of this function is referred to as the sound source occupancy, and this value is also included in the sound source parameters.

以上説明した考えで、音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)と、音源位置特徴量のモデルp(A_n,k;φ)の両者を考慮しながら最適な音源パラメータを推定することで音源パラメータの推定誤差を減らすことができる。また、音源位置特徴量のモデルp(A_n,k;φ)（式（８））と、音源パワー特徴量のモデル（式（７））に、占有的な音源の番号を表す変数Z_n,kを共有化することで、２つの特徴量を考慮しながら音源パラメータ推定の計算を簡単にすることができる。 In the above-described concept, both the sound source power feature model β _{q (l), n, k, ψ (l)} (S) and the sound source position feature model p (A _{n, k} ; φ) are considered. However, the estimation error of the sound source parameter can be reduced by estimating the optimum sound source parameter. In addition, the variable Z _n representing the number of the exclusive sound source is added to the sound source position feature quantity model p (A _{n, k} ; φ) (formula (8)) and the sound source power feature quantity model (formula (7)). _{, k} can be used to simplify calculation of sound source parameter estimation while considering two feature quantities.

以上、述べたように、この発明の音源パラメータ推定方法によれば、ｌ番目の音源信号の音源パワー時系列全体{S^(l) _n,k}の同時確率密度関数を音源モデルパラメータψ^(l)を含めてモデル化すると共に、複数の音が混在した信号の音源パワー特徴量X_n,kが与えられたときに、各音源の音源占有度、音源パワー特徴量のモデルと音源パワーパラメータの事前確率密度関数、および音源パワーパラメータに基づいて各音源の音源モデルパラメータを更新する音源モデルパラメータ更新部を有することで、予め音源モデルパラメータが与えられていなくとも音源パラメータの推定が行える。 As described above, according to the sound source parameter estimation method of the present invention, the simultaneous probability density function of the entire sound source power time series {S ^(l) _{n, k} } of the l-th sound source signal is expressed as the sound source model parameter ψ ^{(l )} with model including, excitation power characteristic quantity X _n of a signal in which a plurality of sound are _mixed, when _{the k} is given, the sound source occupancy of each sound source, the sound source power feature quantity model and the excitation power parameter By including a sound source model parameter updating unit that updates the sound source model parameters of each sound source based on the prior probability density function and the sound source power parameters, the sound source parameters can be estimated even if the sound source model parameters are not given in advance.

図１にこの発明の音源パラメータ推定装置１００の機能構成例を示す。その動作フローを図２に示す。音源パラメータ推定装置１００は、特徴抽出部９１０と、音源モデル記憶部９０と、音源の数に対応した数の音源パワーパラメータ更新部６０_１〜６０_Ｎｓと、音源パワーパラメータ更新部６０_１〜６０_Ｎｓと同じ数の音源位置パラメータ更新部９４０_１〜９４０_Ｎｓと、音源占有度更新部７０と、音源モデルパラメータ更新部８０と、を具備する。その各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of a sound source parameter estimation apparatus 100 of the present invention. The operation flow is shown in FIG. The sound source parameter estimation apparatus 100 includes a feature extraction unit 910, a sound source model storage unit 90, sound source power parameter update units 60 _{1 to} 60 _Ns corresponding to the number of sound sources, and sound source power parameter update units 60 _{1 to} 60 _Ns. Includes the same number of sound source position parameter update units 940 _{1 to} 940 _Ns , a sound source occupancy update unit 70, and a sound source model parameter update unit 80. The functions of the respective units are realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

音源パラメータ推定装置１００は、従来技術で説明した音源パワーパラメータ推定装置９００に対して音源モデルパラメータ更新部８０を備える点で異なる。特徴抽出部９１０と音源位置パラメータ更新部９４０_１〜９４０_Ｎｓとは、参照符号から明らかなように音源パワーパラメータ推定装置９００と同じものである。 The sound source parameter estimation device 100 is different from the sound source power parameter estimation device 900 described in the related art in that a sound source model parameter update unit 80 is provided. The feature extraction unit 910 and the sound source position parameter update units 940 _{1 to} 940 _Ns are the same as those of the sound source power parameter estimation apparatus 900 as is apparent from the reference symbols.

各機能部の動作を説明する。 The operation of each functional unit will be described.

〔特徴抽出部〕
特徴抽出部９１０は、複数の音源信号を複数（m本）のマイクロホンで収音した時間領域信号を時間周波数領域信号に変換した観測信号ｘ^(m) _n,kを入力として、各時間周波数点(n,k)における音源位置特徴量A_n,kと音源パワー特徴量X_n,kを抽出する。 (Feature extraction unit)
The feature extraction unit 910 receives an observation signal x ^(m) _{n, k obtained} by converting a time domain signal obtained by collecting a plurality of sound source signals with a plurality of (m) microphones into a time frequency domain signal, and inputs each time frequency point. A sound source position feature value An _{n, k} and a sound source power feature value X _{n, k} at (n, k) are extracted.

音源パワー特徴量X_n,kは、例えば、１本目のマイクロホンが収音した信号の対数パワースペクトルを音源パワー特徴量として抽出する場合には式（１３）に示すように計算される。音源パワー特徴量X_n,kは、音源占有度更新部７０と音源パワーパラメータ更新部６０_１〜６０_Ｎｓに入力される。 For example, when the logarithmic power spectrum of the signal collected by the first microphone is extracted as the sound source power feature amount, the sound source power feature amount X _{n, k} is calculated as shown in Expression (13). The sound source power feature amount X _{n, k} is input to the sound source occupancy update unit 70 and the sound source power parameter update units 60 _{1 to} 60 _Ns .

音源位置特徴量A_n,kは、一般に各時間周波数点における異なるマイクロホン間での信号の位相差や強度比などに表れる。したがって、音源位置特徴量A_n,kは、信号の位相差や強度比を異なるマイクロホンペアごとにまとめて出来るベクトルであったり、そこから更に何らかの特徴抽出を行った結果の値として抽出される。例えば、２本のマイクロホンで収音した信号の位相差を音源位置特徴量A_n,kとして抽出する場合、式（１４）に示すように計算される。 The sound source position feature amount _{An, k} generally appears in the phase difference or intensity ratio of signals between different microphones at each time frequency point. Therefore, the sound source position feature amount _{An, k} is a vector that can be obtained by collecting signal phase differences and intensity ratios for different microphone pairs, or is extracted as a value obtained as a result of some feature extraction. For example, when a phase difference between signals collected by two microphones is extracted as the sound source position feature amount _{An, k} , the calculation is performed as shown in Expression (14).

〔音源モデル記憶部〕
音源モデル記憶部９０は、複数の音源信号それぞれの音源パワー時系列全体の状態を表す音源パワーパラメータq^(l)の事前確率密度関数p(q^(l);ψ^（ｌ）)と、その音源パワーパラメータq^(l)が与えられた場合の各音源信号の各時間周波数点における音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)とを記憶する。(S)は音源パワー特徴量X_n,kを表す変数である。 [Sound source model storage unit]
The sound source model storage unit 90 includes a prior probability density function p (q ^(l) ; ψ ^(l) ) of a sound source power parameter q ^(l) representing the state of the entire sound source power time series of each of a plurality of sound source signals, and the sound source A model β _{q (l), n, k, ψ (l)} (S) of the sound source power feature quantity at each time frequency point of each sound source signal when the power parameter q ^(l) is given is stored. (S) is a variable representing the sound source power feature amount X _{n, k} .

〔音源占有度更新部〕
音源占有度更新部７０は、音源位置特徴量A_n,kと音源パワー特徴量X_n,kと各音源の更新された音源パワーパラメータ^q^(l)と音源位置パラメータ^φ^(l)と、音源モデルパラメータ^ψ^(l)と、音源モデル記憶部９０に記憶された音源パワーパラメータの事前確率密度関数p(q^(l);ψ^（ｌ）)と音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)とを入力として上記各音源の音源占有度を更新する。 [Sound source occupancy update section]
The sound source occupancy update unit 70 includes the sound source position feature quantity An _{n, k} , the sound source power feature quantity X _{n, k} , the updated sound source power parameter ^ q ^(l) and the sound source position parameter ^ φ ^{(l) of} each sound source. , The sound source model parameter ^ ψ ^(l) , the prior probability density function p (q ^(l) ; ψ ^(l) ) of the sound source power parameter stored in the sound source model storage unit 90, and the model β _{q ( l), n, k, ψ (l)} (S) are input and the sound source occupancy of each sound source is updated.

音源占有度更新部７０は、Σ_l^M^(l)=1となるように、音源占有度^M^(l) _n,kを、例えば乱数で初期化する（ステップＳ７０）。若しくは、従来技術で説明した音源パラメータ推定装置９００と同じ方法を用いても良い。その後、音源パワーパラメータ更新部６０_１〜６０_Ｎｓと音源占有度更新部７０と音源位置パラメータ更新部９４０_１〜９４０_Ｎｓと音源モデルパラメータ更新部８０の各処理を収束するまで繰り返す。 The sound source occupancy update unit 70 initializes the sound source occupancy ^ M ^(l) _{n, k} with, for example, random numbers so that Σ _l ^ M ^(l) = 1 (step S70). Or you may use the same method as the sound source parameter estimation apparatus 900 demonstrated by the prior art. Thereafter, the processes of the sound source power parameter update units 60 _{1 to} 60 _Ns , the sound source occupation rate update unit 70, the sound source position parameter update units 940 _{1 to} 940 _Ns, and the sound source model parameter update unit 80 are repeated until they converge.

〔音源パワーパラメータ更新部〕
音源パワーパラメータ更新部６０_１〜６０_Ｎｓは、音源パワー特徴量X_n,kと、各音源ｌごとに初期化された音源占有度^M^(l) _n,kと、音源モデル記憶部９０に記憶された音源パワーパラメータ^q^(l)の事前確率密度関数p(q^(l);ψ^（ｌ）)と音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)と音源モデルパラメータ^ψ^(l)とを入力として、音源パワーパラメータ^q^(l)を式（１５）に示すように更新（M-step）する（ステップＳ６０）。 [Sound source power parameter update unit]
The sound source power parameter updating units 60 _{1 to} 60 _Ns store the sound source power feature quantity X _{n, k} , the sound source occupation degree ^ M ^(l) _{n, k} initialized for each sound source l _, and the sound source model storage unit 90. Prior probability density function p (q ^(l) ; ψ ^(l) ) of stored sound source power parameter ^ q ^(l) and model β _{q (l), n, k, ψ (l)} ( Using S) and the sound source model parameter ^ ψ ^(l) as inputs, the sound source power parameter ^ q ^(l) is updated (M-step) as shown in equation (15) (step S60).

〔音源位置パラメータ更新部〕
音源位置パラメータ更新部９４０_１〜９４０_Ｎｓは、各音源ｌごとに初期化された音源占有度^M^(l) _n,kと、音源位置特徴量A_n,kを入力として音源位置パラメータ^φ^(l)を、式（１７）に示すように更新（M-step）する（ステップＳ９４０）。 [Sound source position parameter update unit]
The sound source position parameter updating units 940 _{1 to} 940 _Ns receive the sound source position parameter ^ φ by inputting the sound source occupancy ^ M ^(l) _{n, k} and the sound source position feature quantity An _{n, k} initialized for each sound source l. ^(l) is updated (M-step) as shown in equation (17) (step S940).

〔音源モデルパラメータ更新部〕
音源モデルパラメータ更新部８０は、音源パワー特徴量X_n,kと音源パワーパラメータ^q^(l)と音源占有度^M^(l) _n,kと、音源モデル記憶部９０に記憶された音源パワーパラメータの事前確率密度関数p(q^(l);ψ^（ｌ）)と音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)とを入力として音源モデルパラメータ^ψ^(l)を式（１５′）に示すように更新する（ステップＳ８１）。 [Sound source model parameter update unit]
The sound source model parameter update unit 80 includes sound source power feature amount X _{n, k} , sound source power parameter ^ q ^(l) , sound source occupancy ^ M ^(l) _{n, k} , and sound source power stored in the sound source model storage unit 90. The sound source model parameter ^ using the parameter prior probability density function p (q ^(l) ; ψ ^(l) ) and the sound source power feature model β _{q (l), n, k, ψ (l)} (S) as input ψ ^(l) is updated as shown in equation (15 ′) (step S81).

そして、音源占有度更新部７０は、各音源ｌごとに更新された音源パワーパラメータ^q^(l)と音源モデルパラメータ^ψ^(l)と音源位置特徴量A_n,kと音源パワー特徴量X_n,kと、音源モデル記憶部９０に記憶された音源パワーパラメータの事前確率密度関数p(q^(l);ψ^(l))と音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)と、を入力として、音源占有度^M^(l) _n,kを式（１８）に示すように更新（E-step）する（ステップＳ７１）。 The sound source occupancy update unit 70 then updates the sound source power parameter ^ q ^(l) , the sound source model parameter ^ ψ ^(l) , the sound source position feature quantity An _{n, k,} and the sound source power feature quantity X updated for each sound source l. _{n, k} , the prior probability density function p (q ^(l) ; ψ ^(l) ) of the sound source power parameter stored in the sound source model storage unit 90, and the model β _{q (l), n, k of the} sound source power feature quantity _{, ψ (l)} (S) as inputs, and the sound source occupancy ^ M ^(l) _{n, k} is updated (E-step) as shown in equation (18) (step S71).

ステップＳ６０〜ステップＳ７１の処理は、収束が得られるまで繰り返される（ステップＳ７２のno）。より具体的な音源位置特徴量のモデル及び、音源パワー特徴量のモデルを用いた実施例２を次に説明する。 Steps S60 to S71 are repeated until convergence is obtained (no in step S72). A second embodiment using a more specific model of the sound source position feature amount and the model of the sound source power feature amount will be described next.

先ず、特徴抽出部９１０は、式（１４）に基づきマイク間位相差を、音源位置特徴量A_n,kとして抽出する。また、各音源ｌに由来する観測信号のマイク間位相差は、各周波数ごとに異なる平均値μ^(l) _k、分散σ^(l) _kのガウス分布に従うと仮定する。すると式（９）は以下のように定義できる。 First, the feature extraction unit 910 extracts the inter-microphone phase difference as the sound source position feature quantity _{An, k} based on the equation (14). Further, it is assumed that the phase difference between the microphones of the observation signal derived from each sound source l follows a Gaussian distribution having different mean values μ ^(l) _k and variances σ ^(l) _k for each frequency. Equation (9) can then be defined as follows:

但し、φ^(l) _k=[μ^(l) _k,σ^(l) _k]は、音源位置パラメータφ^(l)のうち周波数kのみに関する部分を取り出したものであり、φ^(l)は全ての周波数kについてφ^(l) _ｋを集めたφ^(l)=[φ^(l) ₁,…, φ^(l) _Nk]である。N（・）は、ガウス分布の確率密度関数を表す。 Where φ ^(l) _k = [μ ^(l) _k , σ ^(l) _k ] is a part of the sound source position parameter φ ^(l) extracted only for frequency k, and φ ^(l) is all Φ ^(l) = [φ ^(l) ₁ ,..., Φ ^(l) _Nk ] in which φ ^(l) _k is collected with respect to frequency k. N (•) represents a probability density function of Gaussian distribution.

一方、特徴抽出部９１０は、式（１３）に基づき、どれか一つのマイクロホン信号の対数パワースペクトルを音源パワー特徴量X_n,kとして抽出するものとする。さらに、音源パワーパラメータq^(l)は、q^(l)=｛q^(l) ₀,q^(l) ₁,…}のように各時刻の状態を表す状態系列に分解され、一次のマルコフ過程に従い状態遷移が各時刻で起こると仮定する。 On the other hand, the feature extraction unit 910 extracts the logarithmic power spectrum of any one of the microphone signals as the sound source power feature amount X _{n, k} based on the equation (13). Furthermore, the sound source power parameter q ^(l) is decomposed into a state sequence representing the state at each time as q ^(l) = {q ^(l) ₀ , q ^(l) ₁ , ...}, and a first-order Markov process And state transitions occur at each time.

但し、音源パワーパラメータq^(l) ₀は隠れマルコフモデルの初期状態を表す。更に、式（３）で定義される各時間周波数点（n,k）におけるS^(l) _n,kの事後確率密度関数は、その時刻の状態q^(l) _ｎのみに依存するガウス分布に従うと仮定する。これを数式で表すと次のようになる。 However, the sound source power parameter q ^(l) ₀ represents the initial state of the hidden Markov model. Further, the posterior probability density function of S ^(l) _{n, k} at each time frequency point (n, k) defined by Equation (3) follows a Gaussian distribution that depends only on the state q ^(l) _{n at} that time. Assume that This is expressed by the following formula.

ここで、π^(l) _i=p(q^(l) ₀=i)は、隠れマルコフモデルの初期状態がｉである事前確率、α^(l) _i,j=p(q^(l) _n=j｜q^(l) _n-1=i)は、隠れマルコフモデルが状態ｉから状態ｊへ移る状態遷移確率、β_i,n,k,ψ(l)(S)=p(S^(l) _n,k=S｜q^(l) _n=i；ψ^(l))=N(S^(l) _n,k;μ^(l) _i,k,σ^(l) _i,k)は、隠れマルコフモデルの状態ｉにおける出力の確率密度関数であり、μ^(l) _i,k及びσ^(l) _i,kはその平均と分散である。 Where π ^(l) _i = p (q ^(l) ₀ = i) is the prior probability that the initial state of the hidden Markov model is i, α ^(l) _{i, j} = p (q ^(l) _n = j | q ^(l) _n-1 = i) is the state transition probability that the hidden Markov model moves from state i to state j, β _{i, n, k, ψ (l)} (S) = p (S ^(l) _{n, k} = S | q ^(l) _n = i; ψ ^(l) ) = N (S ^(l) _{n, k} ; μ ^(l) _{i, k} , σ ^(l) _{i, k} ) is a hidden Markov The probability density function of the output in state i of the model, and μ ^(l) _{i, k} and σ ^(l) _{i, k} are the mean and variance.

この定義に基づくと、音源モデルパラメータ^ψ^(l)は、全てのi,j,k,lに対するπ^(l) _ｉ, α^(l) _i,j,μ^(l) _i,k,σ^(l) _i,kで構成される。この実施例では、全て若しくは一部の音源モデルパラメータが音源パラメータの一部として期待値最大化アルゴリズムにより推定される。 Based on this definition, the sound source model parameter ^ ψ ^(l) is π ^(l) _i , α ^(l) _{i, j} , μ ^(l) _{i, k} , σ ^(for all i, j, k, l ^l) It consists of _{i and k} . In this embodiment, all or some of the sound source model parameters are estimated by the expected value maximization algorithm as part of the sound source parameters.

以上の仮定の下、図２で説明済みの期待値最大化アルゴリズムのM-step1は、各音源ｌごとに、音源パワーパラメータ更新部６０_１〜６０_Ｎｓが式（２２）を満たす状態時系列^q^(l)=[^q^(l) ₀,…,^q^(l) _Ns]を、Viterbiアルゴリズムを用いて更新する。 Under the above assumptions, M-step1 of the expected value maximization algorithm already explained in FIG. 2 is a state time series for which the sound source power parameter updating units 60 _{1 to} 60 _Ns satisfy Expression (22) for each sound source l Update q ^(l) = [^ q ^(l) ₀ , ..., ^ q ^(l) _Ns ] using the Viterbi algorithm.

また、M-step2は、各音源ｌごとに、音源位置パラメータ更新部９４０_１〜９４０_Ｎｓが、全ての周波数kで、φ^(l) _k=[μ^(l) _k,σ^(l) _k]を次のように更新する。 In addition, for each sound source l, the sound source position parameter update units 940 _{1 to} 940 _Ns have φ ^(l) _k = [μ ^(l) _k , σ ^(l) _k ] at all frequencies _k . Is updated as follows.

また、M-step3は、音源ｌごとに、音源モデルパラメータ更新部８０が音源モデルパラメータ^ψ^(l)を更新する。まず、π^(l),α^(l)を、i,jに関するπ^(l) _ｉ,α^(l) _ｉ,jの集合とすると、π^(l),α^(l)は以下のように更新される。 In M-step 3, the sound source model parameter update unit 80 updates the sound source model parameter ^ ψ ^(l) for each sound source l. First, if π ^(l) , α ^(l) is a set of π ^(l) _i , α ^(l) _{i, j related to i, j} , then π ^(l) , α ^(l) is updated as follows: Is done.

上記の更新は、隠れマルコフモデルの学習のための既知の方法で容易に実現することが可能である。一方、μ^(l),σ^(l)を全てのi,kに関するμ^(l) _i,k,σ^(l) _i,kの集合とすると、μ^(l),σ^(l)の更新は、以下のように実現される。 The above update can be easily realized by a known method for learning a hidden Markov model. On the other hand, if μ ^(l) , σ ^(l) is a set of μ ^(l) _{i, k} , σ ^(l) _{i, k} for all _{i, k} , the update of μ ^(l) , σ ^(l) is This is realized as follows.

上記の更新を実現する一つの方法として、準ニュートン法、共役勾配法などに代表される逐次最大化アルゴリズムを上げることができる。これには、一般的に知られる多くのアルゴリズムを適用することができる。一方、上記の関数中に含まれるlog(β_{q(l)n,n,k,ψ(l)}(X_n,k))は解析的な扱いが容易であるが、log(ρ_{q(l)n,n,k,ψ(l)}(X_n,k))は積分演算が含まれているため解析的な扱いが複雑になり、逐次最大化アルゴリズムに必要とされる計算を比較的複雑にしてしまう問題がある。これを回避する一つの方法は、log(ρ_{q(l)n,n,k,ψ(l)}(X_n,k))を解析的な扱いが比較的容易な関数で近似することである。以下では、これについて少し詳しく説明する。 As one method for realizing the above update, a sequential maximization algorithm represented by a quasi-Newton method, a conjugate gradient method, or the like can be raised. Many generally known algorithms can be applied to this. On the other hand, log (β _{q (l) n, n, k, ψ (l)} (X _{n, k} )) included in the above function is easy to handle analytically, but log (ρ _{q (l ) n, n, k, ψ (l)} (X _{n, k} )) contains integral operations, making it more complicated to handle analytically and making the calculations required for sequential maximization algorithms relatively complex There is a problem that makes it. One way to avoid this is to approximate log (ρ _{q (l) n, n, k, ψ (l)} (X _{n, k} )) with a function that is relatively easy to handle analytically. . In the following, this will be described in some detail.

まず、x=(X_n,k-μ^(l) _i,k)(σ^(l) _i,k)^-1/2とおき、f(x)= log(ρ_{q(l)n,n,k,ψ(l)}(X_n,k))と等価な以下の正規表現に書き換えておく。 First, set x = (X _{n, k} -μ ^(l) _{i, k} ) (σ ^(l) _{i, k} ) ^-1/2 and f (x) = log (ρ _{q (l) n, n,} Rewrite the following regular expression equivalent to _{k, ψ (l)} (X _{n, k} )).

すると、上記の関数f(x)の一次導関数df(x)/dxは、以下のどちらか一方の関数で近似できる。 Then, the first derivative df (x) / dx of the function f (x) can be approximated by one of the following functions.

これらの関数は、どちらも解析的な扱いが容易であり、特にμ^(l) _i,k,σ^(l) _i,kの更新に用いることで、例えば、以下の２つの利点を得ることができる。 Both of these functions are easy to handle analytically. In particular _, when used to update μ ^(l) _{i, k} , σ ^(l) _{i, k} , for example, the following two advantages can be obtained. it can.

その１は、μ^(l) _i,kの更新に関してf(x)の一次導関数をg₁(x)で近似することで、高速な収束が得られる繰り返し推定法の一つであるニュートン法に関しても大域的な収束が保証される。その２は、g₂(x)は、log(ρ_{q(l)n,n,k,ψ(l)}(X_n,k))の一次導関数と同じ数学的な形式をしているため、μ^(l) _i,k,σ^(l) _i,kの更新においてf(x)の一次導関数を、g₂(x)で近似することで、複雑な計算を行わずに逐次最大化アルゴリズムを適用することができる。 The first is Newton's method, which is one of the iterative estimation methods that can obtain fast convergence by approximating the _first derivative of f (x) with g ₁ (x) for the update of μ ^(l) _{i, k} The global convergence is guaranteed. The second is that g ₂ (x) has the same mathematical form as the first derivative of log (ρ _{q (l) n, n, k, ψ (l)} (X _{n, k} )). , Μ ^(l) _{i, k} , σ ^(l) _{i, k} update by approximating the first derivative of f (x) with g ₂ (x) Algorithms can be applied.

したがって、例えばg₁(x)を用いると、ニュートン法によるμ^(l) _i,kの１回の更新は以下のように実現することができる。 Therefore, for example, when g ₁ (x) is used, one update of μ ^(l) _{i, k} by the Newton method can be realized as follows.

ここでν^(l) _iは、q^(l) _n=iを満たす時間nの集合を表す。ｇ₁′(x)は、ｇ₁(x)の一次導関数を表す。 Here, ν ^(l) _i represents a set of time n satisfying q ^(l) _n = i. g ₁ ′ (x) represents the first derivative of g ₁ (x).

一方、g₂(x)を用いると、σ^(l) _i,kの更新は以下のように実現することができる。 On the other hand, using g ₂ (x) _, the update of σ ^(l) _{i, k} can be realized as follows.

ただし、κ^(l) _i,kは以下の値をとるものとする。 However, κ ^(l) _{i, k} assumes the following values.

また、音源占有度更新部７０が行うE-stepは、音源占有度を式（３２）に示すように更新する。 Further, the E-step performed by the sound source occupancy update unit 70 updates the sound source occupancy as shown in Expression (32).

ここで、事前に学習した隠れマルコフモデルがない場合の音源モデルパラメータ^ψ^(l)の初期化について他の方法を説明する。音源パワー特徴量の時系列を混合ガウスモデルでモデル化し、その結果得られた混合ガウス分布の分布パラメータψ′のうち混合比α′からπ^(l),α^(l)を、各ガウス分布の平均μ′_i,kと分散σ′_i,kからμ^(l) _i,k，σ^(l) _i,kを定める。このとき、混合ガウスモデルは以下の形をとる。 Here, another method for initializing the sound source model parameter ^ ψ ^(l) when there is no previously learned hidden Markov model will be described. The time series of sound source power features is modeled with a mixed Gaussian model, and the resulting mixture parameters α ′ to π ^(l) and α ^(l) of the distribution parameters ψ ′ of the mixed Gaussian distribution are Μ ^(l) _{i, k} and σ ^(l) _{i, k} are determined from the average μ ′ _{i, k} and the variance σ ′ _{i, k} . At this time, the mixed Gaussian model takes the following form.

音源パワー特徴量X_n,kが与えられた条件下で、混合ガウス分布の分布パラメータを定めるためには、例えば、期待値最大化アルゴリズムなどのように、一般的に知られている方法を適用することができる。その結果得られた分布パラメータを元に、音源モデルパラメータ^ψ^(l)を以下のように定めることができる。 In order to determine the distribution parameters of the mixed Gaussian distribution under the condition where the sound source power feature amount X _{n, k} is given, for example, a generally known method such as an expected value maximization algorithm is applied. can do. Based on the distribution parameters obtained as a result, the sound source model parameter ^ ψ ^(l) can be determined as follows.

これにより、観測された音源パワー特徴量の分布を近似的に表現する音源モデルパラメータの初期化を行うことができる。 As a result, it is possible to initialize the sound source model parameter that approximately represents the distribution of the observed sound source power feature quantity.

〔変形例１〕
次に、音源モデルパラメータ更新部８０が更新する音源モデルパラメータψ^(l)の一つであるμ^(l) _i,kの更新方法の変形例を説明する。例えば、μ^(l) _i,kの事前確率密度関数p(μ^(l) _i,k)=N（μ^(l) _i,k;~μ^(l) _i,k，~σ^(l) _i,k）が与えられていると仮定する。この事前確率密度関数の分布パラメータとしては、上記して説明した例で述べた方法などに基づき初期化された音源モデルパラメータの値を用いて、~μ^(l) _i,k=^μ^(l) _i,k，~σ^(l) _i,k=^σ^(l) _i,kと定めることが効果的であることが実験により確認されている。 [Modification 1]
Next, a modified example of a method for updating μ ^(l) _{i, k} which is one of the sound source model parameters ψ ^(l) updated by the sound source model parameter updating unit 80 will be described. For _example, μ ^(l) _{i, k} of the a priori probability density function ^{_{p (μ (l) i,}} k) = N (μ (l) i, k; ~ μ (l) i, k, ~ σ (l) i _{, k} ) is given. As the distribution parameter of this prior probability density function, the value of the sound source model parameter initialized based on the method described in the example described above is used, and ~ μ ^(l) _{i, k} = ^ μ ^{(l )} _{i, k} , ~ σ ^(l) _{i, k} = ^ σ ^(l) It has been confirmed by experiments that it is effective to define _{i, k} .

この事前確率密度関数を用いると^μ^(l) _i,kの更新式（２９）は、事後確率最大化基準に基づき以下のように修正される。 Using this prior probability density function _, the update equation (29) for ^ μ ^(l) _{i, k} is modified as follows based on the posterior probability maximization criterion.

ここで、ρは事前確率密度関数の重みを調整するコントロールパラメータ（＞０）であり、事前確率密度関数を信頼する程度に基づいて自由に定めることができる。 Here, ρ is a control parameter (> 0) for adjusting the weight of the prior probability density function, and can be freely determined based on the degree to which the prior probability density function is trusted.

〔音源分離装置〕
図３に、この発明の音源分離装置２００の機能構成例を示す。その動作フローを図４に示す。音源分離装置２００は、上記した音源パラメータ推定装置１００と、音源分離部９５と、を具備する。 [Sound source separation device]
FIG. 3 shows a functional configuration example of the sound source separation device 200 of the present invention. The operation flow is shown in FIG. The sound source separation device 200 includes the above-described sound source parameter estimation device 100 and a sound source separation unit 95.

音源分離部９５は、音源パラメータ推定装置１００が出力する音源パワー特徴量X_n,kと、更新した音源占有度^M^(l) _n,kと音源パワーパラメータ^q^(l) _nと音源モデルパラメータ^ψ^(l)と、各音源信号の各時間周波数点における音源パワー特徴量のモデルβ_{q(l),n,k,ψ(l)}(S)と、を入力として複数の音源のそれぞれの音源分離信号^S^(l) _n,kを最小自乗誤差推定により求める。 The sound source separation unit 95 outputs the sound source power feature amount X _{n, k} output from the sound source parameter estimation device 100, the updated sound source occupancy ^ M ^(l) _{n, k} , the sound source power parameter ^ q ^(l) _n, and the sound source model. Each of a plurality of sound sources is input with the parameter ^ ψ ^(l) and the model β _{q (l), n, k, ψ (l)} (S) of the sound source power feature quantity at each time frequency point of each sound source signal. The sound source separation signal ^ S ^(l) _{n, k} is _obtained by least square error estimation.

音源分離の方法は次式によって行う。 The sound source separation method is performed by the following equation.

〔確認実験〕
この発明の音源分離性能を評価する目的で確認実験を行った。実験条件を説明する。観測信号を３０組用意し、全ての観測信号において音源数はN_s=2とした。各観測信号は、それぞれ２人の男性の発話、２人の女性の発話、若しくは１名の女性と１名の男性の発話の混合音で構成した。 [Confirmation experiment]
A confirmation experiment was conducted for the purpose of evaluating the sound source separation performance of the present invention. The experimental conditions will be described. 30 sets of observation signals were prepared, and the number of sound sources in all the observation signals was N _s = 2. Each observation signal was composed of two male utterances, two female utterances, or a mixed sound of one female and one male utterance.

標本化周波数は16kHzとした。各観測信号に含まれる２つのマイクロホン信号は、各話者の発話に関するマイク間時間差がそれぞれ±1.5ミリ秒になるように、計算機上で信号を加算して合成した（混合条件１）。音源モデルパラメータの初期値は、上記した事前に学習した隠れマルコフモデルがない場合の初期化方法で初期化した値を用いた。そして、変形例１で説明した事前確率密度関数を利用した。各隠れマルコフモデルの状態数は４とした。 The sampling frequency was 16 kHz. The two microphone signals included in each observation signal were synthesized by adding signals on the computer so that the time difference between the microphones related to each speaker's utterance was ± 1.5 milliseconds (mixing condition 1). As the initial value of the sound source model parameter, the value initialized by the above-described initialization method when there is no hidden Markov model learned in advance was used. And the prior probability density function demonstrated in the modification 1 was utilized. Each hidden Markov model has 4 states.

実験結果を図５と図６に示す。図５の縦軸は、混合前の音声から推定した音源パワー特徴量に関する隠れマルコフモデルのそれぞれについての各状態の出力分布の平均のパワー(dB)、横軸は周波数(kHz)である。図５（ａ）は混合前の音声から推定した結果、図５（ｂ）は混合音から推定した結果を示す。 The experimental results are shown in FIGS. The vertical axis in FIG. 5 represents the average power (dB) of the output distribution in each state for each of the hidden Markov models related to the sound source power feature amount estimated from the sound before mixing, and the horizontal axis represents the frequency (kHz). FIG. 5A shows the result estimated from the sound before mixing, and FIG. 5B shows the result estimated from the mixed sound.

混合音中から推定されたパラメータ（図５（ｂ））は、混合前の音声から推定されたパラメータと酷似しており、この発明のパラメータ推定精度の信頼性の高さを証明している。 The parameter estimated from the mixed sound (FIG. 5B) is very similar to the parameter estimated from the sound before mixing, and proves the high reliability of the parameter estimation accuracy of the present invention.

図６に、分離後の信号のケプストラム歪みと混合音の数との関係を示す。縦軸はケプストラム歪み(dB)、横軸は混合音の数（数が増えるほど観測信号が長くなる）である。太い実線（□）で示す特性は、この発明で音源モデルパラメータを混合前の各音声信号から学習した場合を示す。実線（○）で示す特性は、この発明で音源モデルパラメータを観測信号のみから推定した場合を示す。一点鎖線（△）で示す特性は、非特許文献１の方法を用いた場合を示す。 FIG. 6 shows the relationship between the cepstrum distortion of the signal after separation and the number of mixed sounds. The vertical axis represents the cepstrum distortion (dB), and the horizontal axis represents the number of mixed sounds (the observed signal becomes longer as the number increases). The characteristic indicated by the thick solid line (□) indicates the case where the sound source model parameter is learned from each sound signal before mixing in the present invention. The characteristic indicated by the solid line (o) indicates the case where the sound source model parameter is estimated from only the observation signal in the present invention. The characteristic indicated by the alternate long and short dash line (Δ) indicates the case where the method of Non-Patent Document 1 is used.

図６（ａ）は、混合条件１の混合音に対して処理した結果、図６（ｂ）と（ｃ）は、残響のある別の環境で収録した混合音（混合条件２と３）に対して処理した結果を示している。全ての場合において、この発明の音源分離方法は、非特許文献１の方法よりもケプストラム歪みの小さな音源分離を実現している。また、音源モデルパラメータを事前学習していない場合でも、事前学習している場合に相当する性能が実現できていることが確認できる。 6A shows the result of processing the mixed sound under the mixing condition 1, and FIGS. 6B and 6C show the mixed sound (mixing conditions 2 and 3) recorded in another reverberant environment. The result of processing is shown. In all cases, the sound source separation method of the present invention achieves sound source separation with a smaller cepstrum distortion than the method of Non-Patent Document 1. In addition, even when the sound source model parameters are not learned in advance, it can be confirmed that performance equivalent to that obtained when the learning is performed in advance is realized.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

Prior probability density function of sound source power parameters representing the state of the entire sound source power time series of each of the plurality of sound source signals, and a model of sound source power features at each time frequency point of each sound source signal when the sound source power parameters are given And a sound source model storage unit storing
A feature extraction unit that extracts a sound source position feature amount and a sound source power feature amount at each time frequency point by using an observation signal obtained by converting a time domain signal obtained by collecting the plurality of sound source signals with a plurality of microphones into a time frequency domain signal. When,
The sound source power feature amount, the sound source occupancy which is a posterior probability density function of the exclusive sound source under which the observation signal is obtained, the prior probability density function of the sound source power parameter, and the sound source power of each sound source signal A sound source power parameter for updating a sound source power parameter of each sound source by inputting a feature amount model, a sound source model parameter which is a parameter for controlling the behavior of each of the sound source power feature amount model and the prior probability density function. Update section,
A sound source position parameter update unit that updates the sound source position parameter of each sound source, using the sound source position feature amount and the sound source occupancy as inputs,
The sound source model parameter is obtained by inputting the sound source power feature amount, the sound source power parameter, the sound source occupancy, the prior probability density function of the sound source power parameter stored in the sound source model storage unit, and the model of the sound source power feature amount. A sound source model parameter update unit to be updated;
The sound source position feature amount, the sound source power feature amount, the updated sound source power parameter and the sound source position parameter of each sound source, the sound source model parameter, a prior probability density function of the sound source power parameter stored in the sound source model storage unit, A sound source occupancy update unit that updates a sound source occupancy of each of the sound sources as an input of a model of a sound source power feature;
A sound source parameter estimation apparatus comprising:

In the sound source parameter estimation apparatus according to claim 1,
The time series of the sound source power feature amount follows a hidden Markov model, and the posterior probability density function of the sound source power parameter under a condition where the state of the Markov model of each sound source signal is known is expressed by a covariance matrix as a diagonal matrix. A sound source parameter estimation apparatus characterized by being modeled by a Gaussian distribution and including the mean of the Gaussian distribution and a diagonal element of a covariance matrix in the sound source model parameter.

The sound source parameter estimation device according to claim 1 or 2,
The sound source power feature amount output by the sound source parameter estimation device, the sound source occupancy, the sound source power feature amount, the sound source power parameter, the sound source model parameter updated by the sound source parameter estimation device, and the respective sound source signals at each time frequency point A sound source separation unit that obtains a sound source separation signal of each of a plurality of sound sources by input of a model of a sound source power feature amount by least square error estimation;
A sound source separation apparatus comprising:

A feature extraction process that extracts the sound source position feature and sound source power feature at each time frequency point using the observation signal obtained by converting the time domain signal obtained by collecting multiple sound source signals with multiple microphones into the time frequency domain signal. ,
The sound source power feature amount, the sound source occupancy that is the posterior probability density function of the exclusive sound source under which the observation signal is obtained, and the prior probability density function of the sound source power parameter stored in the sound source model storage unit, The sound source power feature value model of each sound source signal, the sound source power feature value model, and a sound source model parameter that is a parameter for controlling the behavior of each of the prior probability density functions, and the sound source power of each sound source Sound source power parameter update process for updating parameters,
A sound source position parameter update process for updating the sound source position parameter of each sound source, using the sound source position feature amount and the sound source occupancy as inputs.
The sound source model parameter is obtained by inputting the sound source power feature amount, the sound source power parameter, the sound source occupancy, the prior probability density function of the sound source power parameter stored in the sound source model storage unit, and the model of the sound source power feature amount. Sound source model parameter update process to be updated,
The sound source position feature amount, the sound source power feature amount, the updated sound source power parameter and the sound source position parameter of each sound source, the sound source model parameter, a prior probability density function of the sound source power parameter stored in the sound source model storage unit, A sound source occupancy update process for updating the sound source occupancy of each of the above sound sources by inputting a model of a sound source power feature,
A sound source parameter estimation method comprising:

In the sound source parameter estimation method according to claim 4,
The time series of the sound source power feature amount follows a hidden Markov model, and the posterior probability density function of the sound source power parameter under a condition where the state of the Markov model of each sound source signal is known is expressed by a covariance matrix as a diagonal matrix. A sound source parameter estimation method characterized by being modeled by a Gaussian distribution and including the mean of the Gaussian distribution and a diagonal element of a covariance matrix in the sound source model parameter.

The sound source power feature amount extracted by the sound source parameter estimation method according to claim 4, the sound source occupancy, the sound source power feature amount, the sound source power parameter, the sound source model parameter updated by the sound source parameter estimation method, and each of the sound sources A sound source separation process for obtaining a sound source separation signal of each of a plurality of sound sources by input of a model of a sound source power feature amount at each time frequency point of the signal by least square error estimation;
A sound source separation method comprising:

A program for causing a computer to function as the sound source parameter estimation device or the sound source separation device according to any one of claims 1 to 3.