JP2011164467A

JP2011164467A - Model estimation device, sound source separation device, and method and program therefor

Info

Publication number: JP2011164467A
Application number: JP2010028985A
Authority: JP
Inventors: Akiko Araki; 章子荒木; Tomohiro Nakatani; 智広中谷; Hiroshi Sawada; 宏澤田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-02-12
Filing date: 2010-02-12
Publication date: 2011-08-25
Anticipated expiration: 2030-02-12
Also published as: JP5337072B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound source separation device which correctly operates even if the number of sound sources is not known, and which attains sound source separation without any permutation problem between frequency components. <P>SOLUTION: A frequency domain conversion unit converts each observation signal of a time domain in each microphone to an observation signal spectrum of each frequency domain, and a phase difference calculation unit calculates a phase difference of the observation signal spectrum in each microphone. A model estimation unit sequentially applies the observation signal spectrum to a spectrum probability model for indicating distribution of a spectrum, and applies the phase difference between microphones to a phase difference probability model for indicating distribution of the phase difference, respectively, and calculates a model parameter of each probability model suitable for signal extraction and presence probability of each sound source, by using a predetermined evaluation function for evaluating each probability model. An effective sound source is extracted by using the model parameter of each probability model and the presence probability of each sound source, the observation signal spectrum is separated for each effective sound source by using a mask corresponding to the effective sound source. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、複数信号が混在している音響データからそれぞれの原信号を推定する音源分離技術に属し、特にそれぞれの原信号やそれらがどのように混ざったかの情報を用いずに、複数信号が混在している音響データのみからそれぞれの原信号を推定するブラインド音源分離技術に属するモデル推定装置、音源分離装置、それらの方法及びプログラムに関する。 The present invention belongs to a sound source separation technique for estimating each original signal from acoustic data in which a plurality of signals are mixed, and in particular, a plurality of signals are mixed without using information on each of the original signals and how they are mixed. The present invention relates to a model estimation device, a sound source separation device, a method and a program thereof belonging to a blind sound source separation technique for estimating each original signal from only the acoustic data being processed.

従来のブラインド音源分離技術（例えば非特許文献１）に基づいて構成された音源分離装置１０を図９に示す。ある時刻ｔにおいてＭ個の音源から放音されノイズとともに混合された信号を２個のマイク＃１、＃２で観測し、 FIG. 9 shows a sound source separation device 10 configured based on a conventional blind sound source separation technique (for example, Non-Patent Document 1). At a certain time t, the signals emitted from the M sound sources and mixed with the noise are observed with the two microphones # 1 and # 2,

を得たとする。 Suppose that

まず、周波数領域変換部１１０において、上記時間領域の観測信号を短時間フーリエ変換により First, in the frequency domain transform unit 110, the observation signal in the time domain is subjected to a short-time Fourier transform.

に変換する。ｎはフーリエ変換を行う時間フレームのインデックスであり、ｆは周波数成分のインデックスである。以後、断りのない場合、観測信号とは周波数領域の信号を指すとし、時間領域の観測信号の場合はそれを明記する。 Convert to n is an index of a time frame for performing Fourier transform, and f is an index of a frequency component. Hereinafter, when there is no notice, the observation signal indicates a frequency domain signal, and in the case of a time domain observation signal, it is specified.

ここで観測信号スペクトルは、 Here, the observed signal spectrum is

で表されると仮定する。ここで、ｈ_f,L,mは音源ｍ（ｍ＝１、２、・・・、Ｍ）からマイクＬ（Ｌ＝１、２）までの周波数応答、Ｓ_n,f,mは音源ｍの信号の周波数領域表現、ｎ(＝0,・・・,Ｎ_ｎ−1)は時刻、 It is assumed that Here, h _{f, L, m} is the frequency response from the sound source m (m = 1, 2,..., M) to the microphone L (L = 1, 2), and _{Sn, f, m} is the sound source m. Frequency domain representation of signal, n (= 0, ..., N _n -1) is time,

は周波数、ｆ_ｓはサンプリング周波数、Ｆはサンプリング点数、を表す。 Represents frequency, _{f s} is the sampling frequency, F is the number of sampling points, a.

音源分離を行うために、音源がスパースである、すなわち音源信号ｓ_n,f,mはまれにしか大きな値をとらず各時間周波数 (ｎ,ｆ)では高々１個の音源Ｓ_n,f,mのみが大きな値をとることを仮定する。これは、互いに異なる音声信号などで確認される性質である。これを仮定すると式(1)は、 In order to perform sound source separation, the sound source is sparse, that is, the sound source signal s _{n, f, m} rarely takes a large value, and at most one sound source Sn _{, f, m at} each time frequency (n, f) _. Assume that only _m takes a large value. This is a property confirmed by different audio signals. Assuming this, equation (1) becomes

と書ける。ここで、Ｓ_n,f,mは時間周波数 (ｎ,ｆ)にて支配的な音源信号である。 Can be written. Here, S _{n, f, m} is a sound source signal dominant at the time frequency (n, f).

次に、位相差計算部１２０において、マイク＃１とマイク＃２における観測信号スペクトル間の位相差（マイク間位相差と呼ぶ）Ａ_n,f＝arg[Ｘ_n,f,1／Ｘ_n,f,2]を計算する。このマイク間位相差Ａ_n,fは、信号の音源とマイクとの位置関係によって定まり、音源の位置が互いに異なっていれば、Ａ_n,fは各音源固有の値をとる。 Next, in the phase difference calculation unit 120, the phase difference between the observed signal spectra in the microphone # 1 and the microphone # 2 (referred to as the inter-microphone phase difference) _{An, f} = arg [ _{Xn, f, 1} / _{Xn, f, 2} ] is calculated. This inter-microphone phase difference _{An, f} is determined by the positional relationship between the sound source of the signal and the microphone, and if the positions of the sound sources are different from each other, _{An, f} takes a value specific to each sound source.

次に、位相差分類部３１において、マイク間位相差Ａ_n,fを周波数ごとにクラスタリングする。スパース性を仮定した式(2)より、音源ｍが支配的な時間周波数(ｎ,ｆ)では音源ｍに対応する位相差μ_n,f,mが、音源ｍ´が支配的な時間周波数(ｎ,ｆ)では音源ｍ´に対応する位相差μ_n,f,m´が求まっているため、位相差Ａ_n,fをクラスタリングすると、各音源成分に対応するクラスタが形成される。ここで従来法では、クラスタリングでいくつのクラスタを作るかを指定するため、音源数保持部３２から音源数Ｍを読み込み、位相差分類部３１ではk-means法などを用いてクラスタリングを行う。クラスタリングは周波数ごとに行われるため、クラスタのインデックスと、そのクラスタに対応する音源のインデックスとの対応関係は、周波数ごとにばらばらである。例えば、ある周波数ｆでは１番目のクラスタが音源１に、２番目のクラスタが音源２に対応するが、別の周波数ｆ´では１番目のクラスタが音源２に、２番目のクラスタに音源１に対応する、というように、クラスタと音源との対応関係がばらばらになってしまうことが一般的である。これをパーミュテーションの問題という。そこで、このパーミュテーションの問題を解決するために、パーミュテーション解決部３３を設け、ここで全ての周波数についてクラスタインデックスと音源のインデックスとを揃え、クラスタと音源とが完全に一対一に対応するように整える。これは例えば次のように行われる。まず、各周波数において得られた各クラスタについて、そのクラスタ内の位相差Ａ_n,fの平均値Ａ_fを求める。次に、平均値Ａ_fを周波数ｆで正規化したＡ_f／２πｆをクラスタリングし、同じ音源に対応する周波数成分をまとめる。これにより全ての周波数でクラスタインデックスと音源のインデックスを揃えることができる。最終的には、ｍ番目のクラスタＣ_ｍには音源ｍに対応するＡ_n,fの成分のみが含まれる。 Next, the phase difference classification unit 31 clusters the inter-microphone phase difference _{An, f} for each frequency. From Equation (2) assuming sparseness _, the phase difference μ _{n, f, m} corresponding to the sound source m is the time frequency (n, f) where the sound source m is dominant, and the time frequency ( Since the phase difference μ _{n, f, m ′} corresponding to the sound source m _′ is obtained at n, f), when the phase difference _{An, f} is clustered, a cluster corresponding to each sound source component is formed. Here, in the conventional method, in order to specify how many clusters to create by clustering, the number of sound sources M is read from the sound source number holding unit 32, and the phase difference classification unit 31 performs clustering using the k-means method or the like. Since clustering is performed for each frequency, the correspondence between the cluster index and the sound source index corresponding to the cluster varies from frequency to frequency. For example, at a certain frequency f, the first cluster corresponds to the sound source 1 and the second cluster corresponds to the sound source 2, but at another frequency f ′, the first cluster becomes the sound source 2 and the second cluster becomes the sound source 1. In general, the correspondence between the cluster and the sound source is dispersed. This is called a permutation problem. Therefore, in order to solve this permutation problem, a permutation resolution unit 33 is provided. Here, the cluster index and the sound source index are aligned for all frequencies, and the cluster and the sound source completely correspond one to one. Arrange to do. This is performed, for example, as follows. First, for each cluster obtained at each frequency, an average value A _f of the phase differences _{An, f} within the cluster is obtained. Next, A _f / 2πf obtained by normalizing the average value A _f with the frequency f is clustered, and the frequency components corresponding to the same sound source are collected. This makes it possible to align the cluster index and the sound source index at all frequencies. Eventually, the m th cluster C _m A _n corresponding to the sound source _m, contains only the component of _f.

次に音源分離部４０において、Ｃ_ｍを参照し音源ｍに対応するクラスタを形成している時間周波数(ｎ,ｆ)では１を、それ以外の時間周波数(ｎ,ｆ)では０をとるマスクＭ_n,f,mを作る。これを全ての音源ｍについて作る。更に、マスクＭ_n,f,mを観測信号の１つ（ここではＸ_n,f,1）に乗算し、分離信号Ｙ_n,f,mを得る。 Next, the sound source separation unit 40 refers to C _m and takes 1 at the time frequency (n, f) forming a cluster corresponding to the sound source m, and 0 at other time frequencies (n, f). Make M _{n, f, m} . This is made for all sound sources m. Furthermore, the mask M _{n, f, m} is multiplied by one of the observation signals (here, X _{n, f, 1} ) to obtain a separated signal Y _{n, f, m} .

Ｙ_n,f,m＝Ｘ_n,f,1・Ｍ_n,f,m (3)
最後に、時間領域変換部１５０において、得られた分離信号Ｙ_n,f,mを時間領域信号に変換する。 Y _{n, f, m} = X _{n, f, 1}・ M _{n, f, m} (3)
Finally, the time domain conversion unit 150 converts the obtained separated signal Y _{n, f, m} into a time domain signal.

H.Sawada, S.Araki and S.Makino, "A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures", Proc. WASPAA2007, 2007, p.139-142H. Sawada, S. Araki and S. Makino, "A two-stage frequency-domain blind source separation method for underdetermined convolutive combination", Proc. WASPAA2007, 2007, p.139-142

上記のように従来の手法では、周波数間のパーミュテーションの問題が生じるため、それを解決することが不可欠である。しかし解決に際しては、パーミュテーション解決部３３でよく用いられるＡ_f／２πｆのクラスタリングが、部屋の残響が多い場合やマイク間隔が広い場合にうまく動作しないという問題がある。すなわち、部屋の残響が多い場合にはマイク間位相差が周波数依存性を持つためにＡ_f／２πｆの値が各周波数で一定の値をとらず、Ａ_f／２πｆのクラスタリングが困難になる。また、マイク間隔が広い場合には、Ａ_n,f＝arg[ｘ_n,f,1／ｘ_n,f,2]の計算において実際のマイク間位相差は±２πを超えるにもかかわらず、argの計算でＡ_n,fの値が−２π≦Ａ_n,f≦２πの範囲に押さえこまれるため、Ａ_f／２πｆの値が各周波数で一定の値を取らず、Ａ_f／２πｆのクラスタリングが困難になる。また、従来の手法では分離すべき音源数Ｍがわかっている必要があるため、音源数Ｍが未知の場合は適用が困難であった。 As described above, the conventional method has a problem of permutation between frequencies, and it is indispensable to solve it. However, when solving, there is a problem that the clustering of A _f / 2πf, which is often used in the permutation resolution unit 33, does not work well when there is much room reverberation or when the microphone interval is wide. That is, without taking a constant value the value of A _f / 2 [pi] f is at each frequency in order to have a phase difference depends on the frequency between microphone when the reverberation of the room is large, clustering A _f / 2 [pi] f becomes difficult. When the microphone interval is wide, the actual phase difference between microphones exceeds ± 2π in the calculation of A _{n, f} = arg [x _{n, f, 1} / x _{n, f, 2} ]. for calculated in a _n of _arg, the value of _f is crowded pressing in the range of -2.pi. ≦ a _{n, f} ≦ 2 [pi, the values of a _f / 2 [pi] f is not take a constant value at each frequency, the a _f / 2 [pi] f Clustering becomes difficult. In addition, since it is necessary to know the number M of sound sources to be separated in the conventional method, it is difficult to apply when the number M of sound sources is unknown.

本発明の目的は、音源数が未知であっても動作し、周波数成分間のパーミュテーションの問題を生ずることなく良好に音源分離が可能なモデル推定装置及びそれを用いた音源分離装置を提供することにある。 An object of the present invention is to provide a model estimation device that operates even when the number of sound sources is unknown and can perform good sound source separation without causing the problem of permutation between frequency components, and a sound source separation device using the model estimation device There is to do.

本発明のモデル推定装置は、混合された複数の音源からの信号を複数個のマイクで観測し、混合された各音源の信号を抽出するモデル推定装置であり、周波数領域変換部と位相差計算部とモデル推定部とを備える。周波数領域変換部は、各マイクにおける時間領域での観測信号をそれぞれ周波数領域の観測信号スペクトルに変換する。位相差計算部は、各マイクにおける観測信号スペクトル間の位相差（マイク間位相差）を計算する。モデル推定部は、前記観測信号スペクトルをスペクトルの分布を示すスペクトル確率モデルに、また、前記マイク間位相差を位相差の分布を示す位相差確率モデルに、それぞれ逐次当てはめ、各確率モデルを評価する所定の評価関数を用いて、信号抽出に適した各確率モデルのモデルパラメタと各音源の存在確率を計算する。 The model estimation device of the present invention is a model estimation device that observes signals from a plurality of mixed sound sources with a plurality of microphones, and extracts the signals of each mixed sound source. And a model estimation unit. The frequency domain conversion unit converts the observation signal in the time domain of each microphone into an observation signal spectrum in the frequency domain. The phase difference calculation unit calculates a phase difference (observation phase difference between microphones) between observed signal spectra in each microphone. The model estimation unit sequentially applies the observed signal spectrum to a spectrum probability model indicating a spectrum distribution and the phase difference between microphones to a phase difference probability model indicating a phase difference distribution, and evaluates each probability model. Using a predetermined evaluation function, the model parameter of each probability model suitable for signal extraction and the existence probability of each sound source are calculated.

また、本発明の音源分離装置は、前記のモデル推定装置と信号分離部と時間領域変換部とを備える。信号分離部は、前記各音源の存在確率に基づき有効音源を抽出し、各確率モデルのモデルパラメタと各音源の存在確率に基づき計算した事後確率を用いて各有効音源に対応するマスクを作成し、当該マスクを用いて前記観測信号スペクトルを前記有効音源ごとに分離した分離信号を生成する。時間領域変換部は、各有効音源ごとの前記分離信号を、時間領域の信号に変換する。 Further, a sound source separation device of the present invention includes the model estimation device, a signal separation unit, and a time domain conversion unit. The signal separation unit extracts an effective sound source based on the existence probability of each sound source, and creates a mask corresponding to each effective sound source using a model parameter of each probability model and a posteriori probability calculated based on the existence probability of each sound source. Then, a separated signal is generated by separating the observed signal spectrum for each effective sound source using the mask. The time domain conversion unit converts the separated signal for each effective sound source into a time domain signal.

本発明のモデル推定装置及びそれを用いた音源分離装置によれば、音源数が未知であっても動作し、かつ、周波数成分間のパーミュテーションの問題を生ずることなく良好に音源分離をすることができる。 According to the model estimation apparatus of the present invention and the sound source separation apparatus using the model estimation apparatus, the sound source separation is performed without causing the problem of permutation between frequency components, even when the number of sound sources is unknown. be able to.

本発明のモデル推定装置１００の構成例を示すブロック図。The block diagram which shows the structural example of the model estimation apparatus 100 of this invention. 本発明のモデル推定装置１００の処理フロー例を示す図。The figure which shows the example of a processing flow of the model estimation apparatus 100 of this invention. 信号の周波数成分が同期する様子を示す図。The figure which shows a mode that the frequency component of a signal synchronizes. 本発明の音源分離装置２００の構成例を示すブロック図。The block diagram which shows the structural example of the sound source separation apparatus 200 of this invention. 本発明の音源分離装置２００の処理フロー例を示す図。The figure which shows the example of a processing flow of the sound source separation apparatus 200 of this invention. マスク生成部１４２により得られるマスクの例を示す図。The figure which shows the example of the mask obtained by the mask production | generation part 142. FIG. 図６において、ｍ＝４、５の場合の位相差パラメタ（平均値）の周波数特性とスペクトルパラメタの時間特性の例を示す図。In FIG. 6, the figure which shows the example of the frequency characteristic of a phase difference parameter (average value) in case of m = 4, 5, and the time characteristic of a spectrum parameter. 本発明のモデル推定装置２００と従来の音源分離装置１０との性能比較を示す図。The figure which shows the performance comparison of the model estimation apparatus 200 of this invention and the conventional sound source separation apparatus 10. FIG. 従来の音源分離装置１０の構成例を示すブロック図。The block diagram which shows the structural example of the conventional sound source separation apparatus 10. FIG.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

図１に本発明のモデル推定装置１００の構成例を示すブロック図を、図２にその処理フロー例を示す。モデル推定装置１００は、ノイズとともに混合された複数の音源からの信号を複数個のマイクで観測し、混合された各信号を抽出するモデル推定装置であり、周波数領域変換部１１０と位相差計算部１２０とモデル推定部１３０とを備える。 FIG. 1 is a block diagram showing a configuration example of the model estimation apparatus 100 of the present invention, and FIG. 2 shows a processing flow example thereof. The model estimation apparatus 100 is a model estimation apparatus that observes signals from a plurality of sound sources mixed with noise with a plurality of microphones, and extracts each of the mixed signals, and includes a frequency domain conversion unit 110 and a phase difference calculation unit. 120 and a model estimation unit 130.

周波数領域変換部１１０と位相差計算部１２０は従来の音源分離装置１０と同様のものである。すなわち、ある時刻ｔにおいてＭ個の音源から放音され混合された信号を、２個のマイク＃１、＃２で観測することにより得られた The frequency domain conversion unit 110 and the phase difference calculation unit 120 are the same as those of the conventional sound source separation device 10. In other words, it was obtained by observing a mixed signal emitted from M sound sources at a certain time t with two microphones # 1 and # 2.

を、周波数領域変換部１１０において短時間フーリエ変換により By the short-time Fourier transform in the frequency domain transform unit 110

に変換する（Ｓ１）。ｎはフーリエ変換を行うフレームのインデックスであり、ｆは周波数成分のインデックスである。位相差計算部１２０は、マイク＃１の観測信号スペクトルとマイク＃２の観測信号スペクトルの位相差（以下、「マイク間位相差」という。）Ａ_n,f＝arg[Ｘ_n,f,1／Ｘ_n,f,2]を計算する（Ｓ２）。 (S1). n is an index of a frame to be subjected to Fourier transform, and f is an index of a frequency component. The phase difference calculation unit 120 includes a phase difference between the observation signal spectrum of the microphone # 1 and the observation signal spectrum of the microphone # 2 (hereinafter referred to as “phase difference between microphones”) _{An, f} = arg [ _{Xn, f, 1.} / X _{n, f, 2} ] is calculated (S2).

以後、マイク＃１の観測信号スペクトルをＸ_n,fと表記し、これを説明に用いる。 Hereinafter, the observation signal spectrum of the microphone # 1 is denoted as X _{n, f} and used for the description.

モデル推定部１３０は、マイク間位相差を位相差の分布を示す位相差確率モデルに、また、観測信号スペクトルをスペクトルの分布を示すスペクトル確率モデルにそれぞれ逐次当てはめ、各確率モデルを評価する所定の評価関数を用いて、信号抽出に適した各確率モデルのモデルパラメタ等を計算する（Ｓ３〜５）。 The model estimation unit 130 sequentially applies the phase difference between microphones to a phase difference probability model indicating a phase difference distribution and the observation signal spectrum to a spectrum probability model indicating a spectrum distribution, respectively, and evaluates each probability model. Using the evaluation function, model parameters and the like of each probability model suitable for signal extraction are calculated (S3-5).

位相差の分布を示す位相差確率モデル及びスペクトルの分布を示すスペクトル確率モデルは以下のようにモデル化されたものである。 The phase difference probability model indicating the phase difference distribution and the spectrum probability model indicating the spectrum distribution are modeled as follows.

音源の位置が固定で、かつ全ての音源のマイクから見た方向が異なる場合、マイク間位相差Ａ_n,fはそれぞれの音源ｍごとに固有の値をとる。そのため、本発明では音源ｍに関するマイク間位相差Ａ_n,fの分布を平均μ_f,m、分散σ² _f,mの正規分布で以下のようにモデル化する。 When the positions of the sound sources are fixed and the directions of all the sound sources viewed from the microphones are different, the inter-microphone phase difference _{An, f} takes a unique value for each sound source m. Therefore, in the present invention, the distribution of the inter-microphone phase difference _{An, f} with respect to the sound source m is modeled as a normal distribution with an average μ _{f, m} and a variance σ ² _{f, m} as follows.

これを、位相差確率モデルと呼ぶ。なお、位相差の分布は周波数ｆごとに定義する。なお、Ｎは正規分布 This is called a phase difference probability model. The phase difference distribution is defined for each frequency f. N is a normal distribution

である。以上に基づき、位相差確率モデルのモデルパラメタを、
θ_Ａ＝{μ_f,m,σ² _f,m}
と表すことができる。 It is. Based on the above, the model parameters of the phase difference probability model are
θ _A = {μ _{f, m} , σ ² _{f, m} }
It can be expressed as.

また、観測信号スペクトルＸ_n,fをモデル化するため、本発明では式(2)と同様に音源のスパース性を仮定する。加えて、記載の平易化のため、音源ｍからマイク１までの周波数応答|ｈ_f,1,m|＝１、arg(ｈ_f,1,m)＝０とする。これにより、式(2)は、 In addition, in order to model the observed signal spectrum X _{n, f} , the present invention assumes the sparseness of the sound source as in the equation (2). In addition, for simplicity of description, it is assumed that the frequency response | h _{f, 1, m} | = 1 from the sound source m to the microphone 1 and arg (h _{f, 1, m} ) = 0. As a result, equation (2) becomes

と表すことができる。このような仮定を元に、観測信号スペクトルＸ_n,fを平均値０、分散γ² _n,f,mの複素正規分布で以下のようにモデル化する。 It can be expressed as. Based on this assumption, the observed signal spectrum X _{n, f} is modeled as a complex normal distribution with an average value of 0 and a variance γ ² _{n, f, m} as follows.

これをスペクトル確率モデルと呼ぶ。ここで、Ｎ_ｃは複素正規分布 This is called a spectral probability model. Where N _c is the complex normal distribution

である。またＭは混合数であり、音源数が既知であればそれと同じ数を用い、音源数が未知であれば十分に大きな数（例えばＭ＝１０）を用いる。また、分散γ² _n,f,mは音源ｍのパワーの期待値Ｅ[|Ｓ_n,f,m|^２]を意味する量である。更に、γ_n,f,mを時間依存ではあるが周波数には依存しないスペクトル包絡ρ_n,mと時間・周波数の双方に依存するスペクトル形状ａ_n,f,mとを用いて、以下のようにモデル化する。 It is. M is the number of mixtures. If the number of sound sources is known, the same number is used, and if the number of sound sources is unknown, a sufficiently large number (for example, M = 10) is used. The variance γ ² _{n, f, m} is an amount that represents the expected value E [| S _{n, f, m} | ² ] of the power of the sound source m. Further, by using a spectral envelope ρ _{n, m that} is time-dependent but not frequency-dependent for γ _{n, f, m} and a spectrum shape an _{n, f, m} that depends on both time and frequency, To model.

γ_n,f,m＝ａ_n,f,m・ρ_n,m (7)
ここで、スペクトル包絡ρ_n,mは信号の周波数成分のオンセット（信号のパワーが強い成分の開始時点）やオフセット（信号のパワーが強い成分の終了時点）が、全ての周波数で同期する性質をモデル化している。図３に同期のイメージを示す。色が濃いほどパワーが強いことを示し、この図から各周波数成分のパワーが強くなっている部分がほぼ同じ時期に同期していることがわかる。また、本発明ではスペクトル形状ａ_n,f,mを観測信号スペクトルの振幅|Ｘ_n,f|で代用する。すなわち、ａ_n,f,m＝|Ｘ_n,f|とする。以上に基づき、スペクトル確率モデルのモデルパラメタを
θ_Ｘ＝{ρ² _n,m}
と表すことができる。 γ _{n, f, m} = a _{n, f, m}・ ρ _{n, m} (7)
Here, the spectral envelope ρ _{n, m} is the property that the onset of the signal's frequency component (the start time of the component with strong signal power) and the offset (the end point of the component with strong signal power) are synchronized at all frequencies. Is modeled. FIG. 3 shows an image of synchronization. The darker the color, the stronger the power. From this figure, it can be seen that the portions where the power of each frequency component is strong are synchronized at almost the same time. In the present invention, the spectrum shape an _{, f, m} is substituted by the amplitude | _{Xn, f} | of the observed signal spectrum. That is, a _{n, f, m} = | X _{n, f} | Based on the above, the model parameters of the spectral probability model are set to θ _X = {ρ ² _{n, m} }
It can be expressed as.

以上より、観測データ（マイク間位相差Ａ_n,fと観測信号スペクトルＸ_n,f）のモデルｐ_n,f(Ｘ_n,f,Ａ_n,f；θ)は、 From the above, the model _{pn, f} (X _{n, f} , A _{n, f} ; θ) of the observation data (the phase difference An _{n, f} between the microphones and the observation signal spectrum X _{n, f} ) is

となる。ここで、α_ｍは音源ｍの存在確率ｐ(ｍ；θ）であり、Σ_ｍα_ｍ＝１である。α_ｍを以下、混合重みと呼ぶ。また、ｐ_n,f(Ｘ_n,f,Ａ_n,f|ｍ；θ)は、マイク間位相差Ａ_n,fと観測信号スペクトルＸ_n,fが互いに独立であると仮定し、 It becomes. Here, α _m is the existence probability p (m; θ) of the sound source m, and Σ _m α _m = 1. α _m is hereinafter referred to as a mixture weight. Further, _{pn, f} (X _{n, f} , A _{n, f} | m; θ) assumes that the inter-microphone phase difference _{An, f} and the observed signal spectrum X _{n, f} are independent from each other,

となる。ここで、ｗ_ａとｗ_ｘはそれぞれ、位相差の尤度に対する重みとスペクトルの尤度に対する重みである。 It becomes. Here, w _a and w _x are a weight for the likelihood of the phase difference and a weight for the likelihood of the spectrum, respectively.

モデル推定部１３０では、以上のようにモデル化された位相差確率モデル及びスペクトル確率モデルを用い、マイク間位相差Ａ_n,fを位相差確率モデルに、また、観測信号スペクトルＸ_n,fをスペクトル確率モデルにそれぞれ逐次当てはめ、各確率モデルを評価する所定の評価関数を用いて、事後確率（説明は後述する。）と信号抽出に適したパラメタ集合θ＝{θ_Ａ,θ_Ｘ,α_ｍ}＝{μ_f,m,σ² _f,m,ρ² _n,m,α_ｍ}とを求める。 The model estimation unit 130 uses the phase difference probability model and the spectrum probability model modeled as described above, the inter-microphone phase difference _{An, f} is used as the phase difference probability model, and the observation signal spectrum X _{n, f} is obtained. Using a predetermined evaluation function that sequentially applies each spectrum probability model and evaluates each probability model, a posteriori probability (explained later) and a parameter set suitable for signal extraction θ = {θ _A , θ _X , α _m } = {Μ _{f, m} , σ ² _{f, m} , ρ ² _{n, m} , α _m }

モデル推定部１３０は、事後確率計算部１３１とパラメタ更新部１３２とパラメタ保持部１３３とを備える。なお、モデル推定部１３０での処理に先立ち、パラメタ集合θ＝{μ_f,m,σ² _f,m,ρ² _n,m,α_ｍ}の初期値θ^０をパラメタ保持部１３３に用意しておき、また、パラメタ更新回数インデックスｔの初期値、混合数Ｍ、及びパラメタ更新回数の最大値Ｔ又は収束判定の閾値Δを設定しておく（Ｓ０）。なお、モデル推定部１３０での処理の前であればいつ行っても構わない。 The model estimation unit 130 includes a posterior probability calculation unit 131, a parameter update unit 132, and a parameter holding unit 133. Prior to processing by the model estimation unit 130, an initial value θ ⁰ of the parameter set θ = {μ _{f, m} , σ ² _{f, m} , ρ ² _{n, m} , α _m } is prepared in the parameter holding unit 133. In addition, the initial value of the parameter update count index t, the mixing number M, the maximum parameter update count T, or the threshold value Δ for convergence determination is set (S0). Note that it may be performed at any time before the process in the model estimation unit 130.

事後確率計算部１３１は、観測信号スペクトルＸ_n,fとマイク間位相差Ａ_n,fと、パラメタ保持部に記憶された現在のパラメタ集合θ^ｔ＝{μ^t _f,m,(σ² _f,m)^t,(ρ² _n,m)^t,α^t _ｍ}とから、事後確率ｐｍ_n,f、すなわちマイク間位相差Ａ_n,fと観測信号スペクトルＸ_n,fとが各時間周波数(ｎ,ｆ)において各音源ｍからの信号によるものである確率を以下のように計算する（Ｓ３）。 The posterior probability calculation unit 131 uses the observed signal spectrum X _{n, f} , the phase difference An _{n, f} between microphones _, and the current parameter set θ ^t = {μ ^t _{f, m} , (σ ² _{f , m} ) ^t , (ρ ² _{n, m} ) ^t , α ^t _m }, the a posteriori probability pm _{n, f} , that is, the inter-microphone phase difference An _{n, f} and the observed signal spectrum X _{n, f} The probability that it is due to the signal from each sound source m at (n, f) is calculated as follows (S3).

ここで、ｗ_ａとｗ_ｘは例えばｗ_ａ＝１．０、ｗ_ｘ＝０．２などを用いる。 Here, _{w a} and _{w x} is for example _w a = _1.0, the like w x = 0.2.

パラメタ更新部１３２は、スペクトルパラメタ更新手段１３２ａと位相差パラメタ更新手段１３２ｂと混合重み更新手段１３２ｃとを備え、現在のパラメタ集合θ^ｔをθ^t+1に更新する（Ｓ４）。 The parameter update unit 132 includes a spectrum parameter update unit 132a, a phase difference parameter update unit 132b, and a mixture weight update unit 132c, and updates the current parameter set θ ^t to θ ^{t + 1} (S4).

スペクトルパラメタ更新手段１３２ａは、事後確率ｐｍ_n,fを用いてスペクトル確率モデルのモデルパラメタ(ρ² _n,m)^tを、次の計算により更新する（Ｓ４−１）。 The spectrum parameter updating unit 132a updates the model parameter (ρ ² _{n, m} ) ^t of the spectrum probability model using the posterior probability pm _{n, f} by the following calculation (S4-1).

ここで、Ｎ_ｆは周波数成分の数である。 Here, N _f is the number of frequency components.

位相差パラメタ更新手段１３２ｂは、事後確率ｐｍ_n,fとマイク間位相差Ａ_n,fとを用いて位相差確率モデルのモデルパラメタθ_Ａ ^ｔ＝{μ^t _f,m,(σ² _f,m)^t}を、次の計算により更新する（Ｓ４−２）。 The phase difference parameter updating means 132b uses the posterior probability pm _{n, f} and the inter-microphone phase difference _{An, f} to model parameter θ _A ^t = {μ ^t _{f, m} , (σ ² _{f, m} ) ^t } is updated by the following calculation (S4-2).

混合重み計算手段１３２ｃは、事後確率ｐｍ_n,fを用いて混合重みα^ｔ _ｍを次の計算により更新する（Ｓ４−３）。 The mixture weight calculation unit 132c updates the mixture weight α ^t _m by the following calculation using the posterior probability pm _{n, f} (S4-3).

ここで、Ｎ_ｎは時間フレームの数である。 Here, N _n is the number of time frames.

パラメタ更新部１３２における各更新式(11)〜(14)の導出根拠を説明する。パラメタ更新はＥＭアルゴリズムを導出してそれに基づき行う。なお、正規分布のインデックスｍはＥＭアルゴリズムにおける隠れ変数として扱う。まず、最尤推定のためのコスト関数Ｌ(θ)は次のように与えられる。 The basis for deriving each update formula (11) to (14) in the parameter update unit 132 will be described. The parameter update is performed based on the EM algorithm. The normal distribution index m is treated as a hidden variable in the EM algorithm. First, the cost function L (θ) for maximum likelihood estimation is given as follows.

ここで、ｐ(ｍ|θ)は混合重みα_ｍであり、ｐ_n,f(Ｘ_n,f,Ａ_n,f|ｍ;θ)は式(9)の通りである。
また、ｗ_ａとｗ_ｘはそれぞれ、位相差との尤度とスペクトルの尤度に対する重みである。そして、ＥＭアルゴリズムで用いる評価関数（Ｑ関数）は次のように与えられる。 Here, p (m | θ) is the mixture weight α _m , and _{pn, f} (X _{n, f} , A _{n, f} | m; θ) is as shown in Equation (9).
W _a and w _x are weights for the likelihood of the phase difference and the likelihood of the spectrum, respectively. The evaluation function (Q function) used in the EM algorithm is given as follows.

このＱ関数はオンセットとオフセットが同期するスペクトル包絡が１つのクラスタにクラスタリングされているほど高い評価値を与える。すなわち、それぞれの信号について、各周波数成分の強弱がより同期しているほど信号抽出により適するという評価を与える。
更新後のパラメタ集合θ^ｔ+1＝{μ^ｔ+1 _f,m,(σ² _f,m)^ｔ+1,(ρ² _n,m)^ｔ+1,α^ｔ+1 _ｍ}は、このＱ関数を最大にするものとして推定される。すなわち、スペクトル確率モデルのモデルパラメタ(ρ² _n,m)^ｔ+1を求める式(11)は、 This Q function gives a higher evaluation value as the spectral envelope in which the onset and the offset are synchronized is clustered into one cluster. That is, for each signal, an evaluation is given that it is more suitable for signal extraction as the strength of each frequency component is more synchronized.
The updated parameter set θ ^{t + 1} = {μ ^{t + 1} _{f, m} , (σ ² _{f, m} ) ^{t + 1} , (ρ ² _{n, m} ) ^{t + 1} , α ^{t + 1} _m } Estimated to maximize the Q function. That is, the equation (11) for obtaining the model parameter (ρ ² _{n, m} ) ^{t + 1} of the spectral probability model is

により導出され、位相差確率モデルのモデルパラメタμ^ｔ+1 _f,m、(σ² _f,m)^ｔ+1を求める式(12)、(13)はそれぞれ、 Equations (12) and (13) for obtaining model parameters μ ^{t + 1} _{f, m} and (σ ² _{f, m} ) ^{t + 1} of the phase difference probability model

により導出され、混合重みα^ｔ+1 _ｍを求める式(14)は、 Equation (14) for obtaining the mixture weight α ^{t + 1} _m derived from

により導出される。 Is derived by

パラメタ保持部１３３は、パラメタ更新部１３２での更新処理により得られたパラメタ集合θ^ｔ+1を保存し、事後確率推定部１３１及びパラメタ更新部１３２での次回の処理の際にパラメタ集合θ^ｔとして提供する。 Parameter holding unit 133 stores the parameter set theta ^{t + 1} obtained by the update processing in the parameter update unit 132, a parameter set theta ^t at the next process in the posterior probability estimation unit 131 and the parameter updating unit 132 As offered.

モデル推定部１３０における、事後確率計算部１３１、パラメタ更新部１３２（及びパラメタ保持部１３３への更新データの読み書き）は、事前に設定したパラメタ更新回数の最大値Ｔに達するか、又は各パラメタ値の更新による変動幅が収束判定の閾値Δより小さくなるまで反復して行う。そして、モデル推定部１３０は、反復終了後のパラメタ集合θ^ｅ＝{μ^e _f,m,(σ^e _f,m)²,(ρ^e _n,m)²,α^e _ｍ}及びその時点での事後確率ｐｍ^ｅ _n,fを出力する。 In the model estimation unit 130, the posterior probability calculation unit 131 and the parameter update unit 132 (and the read / write of update data to the parameter holding unit 133) reach the maximum parameter update count T set in advance, or each parameter value This is repeated until the fluctuation range due to updating becomes smaller than the threshold value Δ for convergence determination. The model estimation unit 130 then sets the parameter set θ ^e = {μ ^e _{f, m} , (σ ^e _{f, m} ) ² , (ρ ^e _{n, m} ) ² , α ^e _m } after the iteration and at that time. The posterior probability pm ^e _{n, f of} is output.

実施例１で説明したモデル推定装置１００に、図４に示すように信号分離部１４０と時間領域変換部１５０とを追加することで音源分離装置２００を構成することができる。また、処理フローを図５に示す。 The sound source separation device 200 can be configured by adding a signal separation unit 140 and a time domain conversion unit 150 to the model estimation device 100 described in the first embodiment as illustrated in FIG. The processing flow is shown in FIG.

信号分離部１４０は、有効音源推定部１４１とマスク作成部１４２と分離信号作成部１４３とを備え、観測信号スペクトルＸ_n,fから各音源の信号を分離する（Ｓ６）。 The signal separation unit 140 includes an effective sound source estimation unit 141, a mask creation unit 142, and a separated signal creation unit 143, and separates the signal of each sound source from the observed signal spectrum X _{n, f} (S6).

有効音源推定部１４１は、計算に用いた混合数Ｍ個の各インデックスｍのうち、実際に存在する音源（以下、「有効音源」という。）のインデックスを抽出する。具体的には、音源数が既知であり混合数Ｍ＝音源数である場合には、全てのインデックスｍを出力する。音源数が未知である場合には、更新後の混合重みα^e _ｍのうち、十分大きな値（例えばα^e _ｍ＞ε（εは１０^−６など））を満たすｍを有効音源と判定し、そのｍを全て出力する。 The effective sound source estimation unit 141 extracts an index of a sound source that actually exists (hereinafter, referred to as “effective sound source”) from the M indexes of the number of mixtures M used in the calculation. Specifically, when the number of sound sources is known and the number of mixing M = the number of sound sources, all indexes m are output. If the number of sound sources is unknown, m that satisfies a sufficiently large value (for example, α ^e _m > ε (ε is 10 ⁻⁶ or the like)) among the updated mixture weight α ^e _m is determined as an effective sound source, Output all m.

マスク作成部１４２は、有効音源として出力した音源のインデックスｍに対応するそれぞれの音源を抽出するマスクＭ_n,f,mを作成する。マスクＭ_n,f,mは
更新後の事後確率ｐｍ^ｅ _n,fを用いて、
Ｍ_n,f,m＝ｐｍ^ｅ _n,f (17)
により求めることができる。 The mask creation unit 142 creates a mask M _{n, f, m} for extracting each sound source corresponding to the index m of the sound source output as an effective sound source. The mask M _{n, f, m} uses the updated posterior probability pm ^e _{n, f} ,
M _{n, f, m} = pm ^e _{n, f} (17)
It can ask for.

分離信号作成部１４３は、マスクＭ_n,f,mを観測信号スペクトルＸ_n,fに乗算し、分離信号Ｙ_n,f,mを計算する。 Separation signal generator 143 multiplies the mask M _{n, f,} a _m observed signal spectrum X _n, the _f, to calculate the separation signal Y _{n, f, m.}

Ｙ_n,f,m＝Ｘ_n,f・Ｍ_n,f,m (18)
最後に時間領域変換部１５０において、分離信号Ｙ_n,f,mを音源ｍごとに時間領域信号ｙ_m(t)に変換して出力する。 Y _{n, f, m} = X _{n, f} · M _{n, f, m} (18)
Finally, the time domain conversion unit 150 converts the separated signal Y _{n, f, m} into a time domain signal y _m (t) for each sound source m and outputs it.

以上、実施例１、２で説明したモデル推定装置１００及び音源分離装置２００により、音源数が未知であっても有効音源を抽出することができ、周波数成分間のパーミュテーションの問題を生ずることなく良好に音源分離をすることができる。その理由を説明する。 As described above, the model estimation apparatus 100 and the sound source separation apparatus 200 described in the first and second embodiments can extract an effective sound source even when the number of sound sources is unknown, and cause a problem of permutation between frequency components. The sound source can be separated well without any problems. The reason will be explained.

・有効音源を抽出できる理由
スペクトルのモデルをあらわす式(6)は、少ない数のクラスタが大きい分散を持っている方が、その尤度が大きくなることを示している。すなわち、式(6)は観測信号をなるべく少数のクラスタで説明する効果を持つ。これにより、有効音源に相当するインデックスｍに対応する混合重みα_ｍのみが大きな値を持ち、その他のインデックスｍ´に対応する混合重み（α_ｍ´）は限りなく０に近くなるため、これにより有効音源を抽出することができる。 -Reason why effective sound source can be extracted Equation (6), which represents a spectrum model, shows that the likelihood is larger when a small number of clusters have a large variance. That is, Equation (6) has the effect of explaining the observation signal with as few clusters as possible. As a result, only the mixing weight α _m corresponding to the index m corresponding to the effective sound source has a large value, and the mixing weight (α _m ′) corresponding to the other index _m ′ is infinitely close to 0. An effective sound source can be extracted.

・パーミュテーションの問題が生じない理由
評価関数である式(16)の第一項の最大化は、各周波数における位相差クラスタリングによる分離と解釈でき、第二項の最大化は、オンセットやオフセットが同期するスペクトル包絡のクラスタリングと解釈できる。すなわち、式(16)は第二項の最大化により、周波数ごとのパーミュテーションの問題を本質的に生じさせないようにしながら、第一項で分離を行える構成となっている。 Reason why permutation problem does not occur Maximization of the first term of the evaluation function (16) can be interpreted as separation by phase difference clustering at each frequency. It can be interpreted as clustering of spectral envelopes with synchronized offsets. That is, Equation (16) has a configuration in which separation can be performed with the first term while essentially preventing the problem of permutation for each frequency by maximizing the second term.

上記の各実施例では、スペクトル形状ａ_n,f,mを観測信号スペクトルの振幅|Ｘ_n,f|で代用したが、スペクトル形状を時間に依存しないパラメタａ_f,mとしてモデルパラメタθに含め、スペクトルパラメタ更新手段１３２ａで計算してもよい。この場合、スペクトルパラメタ更新手段１３２ａでは以下の式(19)〜(21)の計算を行う。 In each of the above embodiments, the spectrum shape an _{, f, m} is substituted with the amplitude | _{Xn, f} | of the observed signal spectrum, but the spectrum shape is included in the model parameter θ as a parameter af _{, m} that does not depend on time. The spectrum parameter updating unit 132a may calculate the value. In this case, the spectrum parameter updating unit 132a calculates the following equations (19) to (21).

ここで、式(20)はａ_f,mとρ_n,mのスケーリングの不定性を解消するために、Σ_fａ_f,m＝１の制約を与えたものである。 Here, equation (20) are those given a _f, in order to eliminate the scaling ambiguity _m and [rho _{n, m,} a Σ _f a _{_f, m} = 1 constraint.

上記の各実施例では、マイクが２個の場合、すなわちマイク間位相差としてマイク＃１とマイク＃２との位相差Ａ_n,f＝arg[Ｘ_n,f,1／Ｘ_n,f,2]を用いたが、２個以上のマイクを用いることもできる。すなわち、マイク＃ｊとマイクｊ´における観測信号の位相差Ａ_jj'n,f＝arg[Ｘ_n,f,j／Ｘ_n,f,j']を全てのマイクペアについて並べた縦ベクトルを考えて、マイク間位相差をモデル化することもできる。この場合、式(4)を複数マイクに拡張し、音源ｍに係るマイク間位相差の分布を、 In each of the embodiments described above, when there are two microphones, that is, as the phase difference between microphones, the phase difference _{An, f} = arg [ _{Xn, f, 1} / _{Xn, f, 2} ] is used, but two or more microphones can be used. That is, consider a vertical vector in which the phase difference A _{jj′n, f} = arg [X _{n, f, j} / X _{n, f, j ′} ] of observation signals between microphones #j and _{j ′} is arranged for all microphone pairs. Thus, the phase difference between microphones can be modeled. In this case, the expression (4) is expanded to a plurality of microphones, and the distribution of the phase difference between the microphones related to the sound source m is

でモデル化する。この時、位相差パラメタ更新手段１３２ｂでは、 Model with. At this time, the phase difference parameter updating unit 132b

を計算する。 Calculate

＜発明の効果＞
本発明の効果を確認するため、従来法及び本発明の方法で音源分離の実験を行った。音源数・マイク数はともに２とした。また、サンプリング周波数は８ｋＨｚ、マイク間隔は４ｃｍ及び２０ｃｍである。発明法において、混合数Ｍ＝８とした。一方、従来法としてはｋ−ｍｅａｎｓ法を用いてマイク間位相差のクラスタリングを行った。ｋ−ｍｅａｎｓ法で与える音源数(=クラスタリング数)は、発明法の混合数と同じくｋ＝８とした。 <Effect of the invention>
In order to confirm the effect of the present invention, a sound source separation experiment was performed using the conventional method and the method of the present invention. The number of sound sources and the number of microphones were both 2. The sampling frequency is 8 kHz, and the microphone intervals are 4 cm and 20 cm. In the invention method, the mixing number M was set to 8. On the other hand, clustering of the phase difference between microphones was performed using the k-means method as a conventional method. The number of sound sources (= clustering number) given by the k-means method was set to k = 8 similarly to the number of mixtures in the inventive method.

図６は、混合数Ｍ＝８を仮定して本発明の方法を用いた時に得られるマスクＭ_n,f,m＝ｐｍ^ｅ _n,fをｍ＝１〜８のそれぞれ場合についてプロットしたものである。図６より、本発明の方法では２つの信号に対するマスクが大きなパワーを持つことがわかる。この結果と式(14)により有効音源の抽出が可能であることがわかる。 6, the mask M _n obtained when using the method of assuming the present invention a mixed number _{M = 8, f, m =} pm e n, the _f plots for each case of m = 1 to 8 is there. From FIG. 6, it can be seen that the mask for two signals has a large power in the method of the present invention. This result and equation (14) show that an effective sound source can be extracted.

図７は、図６のｍ＝４とｍ＝５について、得られた位相差確率モデルのモデルパラメタのうちμ_f,m（図７(a)）の周波数特性と、スペクトル確率モデルのモデルパラメタρ_n,m（図７(b)）の時間特性を示したものである。図７(a)より、線形位相特性を持つパラメタμ_f,mが得られていることがわかる。また、図７(b)より、信号のスペクトル包絡がスペクトルパラメタρ_n,mにより得られていることがわかる。 FIG. 7 shows the frequency characteristics of μ _{f, m} (FIG. 7A) and the model parameters of the spectral probability model among the model parameters of the obtained phase difference probability model for m = 4 and m = 5 in FIG. The time characteristic of ρ _{n, m} (FIG. 7 (b)) is shown. FIG. 7A shows that a parameter μ _{f, m} having a linear phase characteristic is obtained. Further, FIG. 7 (b) shows that the spectral envelope of the signal is obtained by the spectral parameter ρ _{n, m} .

図８は、２０通りの音声組み合わせについて音源分離性能（信号対妨害音比(Signal to interference ratio: SIR)と信号対歪比(Signal to distortion ratio: SDR))を評価し、その平均を求めたものである。図８において、ｋ−ｍｅａｎｓが従来法の、ｐｒｏｐｏｓｅｄが本発明の方法の性能を示す。本発明の方法では従来法より高い分離性能が得られることがわかる。 FIG. 8 shows the evaluation of the sound source separation performance (Signal to interference ratio (SIR) and Signal to distortion ratio (SDR)) of 20 kinds of voice combinations, and the average was obtained. Is. In FIG. 8, k-means indicates the performance of the conventional method, and proposed indicates the performance of the method of the present invention. It can be seen that the method of the present invention provides higher separation performance than the conventional method.

以上のモデル推定装置及び音源分離装置をコンピュータによって実現する場合、割当制御部が担う処理機能はプログラムによって記述される。そしてパソコンや携帯端末上で、入力手段や各種記憶手段とＣＰＵとのデータのやりとりを通じてこのプログラムを実行することにより、ハードウェアとソフトウェアが協働し、上記処理機能がコンピュータ上で実現されて本発明のモデル推定装置及び音源分離装置の作用効果を奏する。なおこの場合、処理機能の少なくとも一部をハードウェア的に実現することとしてもよい。また、上記の各種処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 When the above model estimation device and sound source separation device are realized by a computer, the processing functions performed by the assignment control unit are described by a program. By executing this program on the personal computer or portable terminal through the exchange of data between the input means and various storage means and the CPU, the hardware and software cooperate to realize the above processing functions on the computer. The effects of the model estimation device and the sound source separation device of the invention are exhibited. In this case, at least a part of the processing function may be realized by hardware. Further, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

Claims

A model estimation device for observing signals from a plurality of mixed sound sources with a plurality of microphones and extracting each mixed signal,
A frequency domain conversion unit that converts the observation signal in the time domain of each microphone into an observation signal spectrum in the frequency domain;
A phase difference calculation unit for calculating a phase difference between observation signal spectra in each microphone (hereinafter referred to as “phase difference between microphones”);
A predetermined evaluation function for evaluating each probability model is obtained by sequentially applying the observed signal spectrum to a spectrum probability model indicating a spectrum distribution and the phase difference between microphones to a phase difference probability model indicating a phase difference distribution. A model estimator that calculates model parameters for each probability model suitable for signal extraction,
A model estimation device comprising:

The model estimation apparatus according to claim 1,
The model estimation apparatus according to claim 1, wherein the evaluation function gives an evaluation that each signal is more suitable for signal extraction as the strength of each frequency component is more synchronized.

The model estimation apparatus according to claim 1 or 2,
The model estimation unit includes:
The observed signal spectrum, the phase difference between the microphones, the model parameter of the phase difference probability model, the model parameter of the spectrum probability model, and the existence probability of each sound source (hereinafter referred to as “mixing weight”) stored in the parameter holding unit )), The posterior probability calculation unit for calculating the probability that the observed signal spectrum and the phase difference between the microphones are due to signals from each sound source at each time frequency (hereinafter referred to as “posterior probability”);
Spectral parameter updating means for updating the model parameter of the spectral probability model using the posterior probability, Phase difference parameter updating means for updating the model parameter of the phase difference probability model using the posterior probability, A parameter updating unit comprising: a mixing weight updating unit configured to update the mixing weight using a posterior probability;
A parameter holding unit for storing each model parameter and mixing weight, which is updated by the parameter updating unit;
A model estimation apparatus comprising:

The model estimation device according to any one of claims 1 to 3,
An effective sound source is extracted based on the updated mixed weight, a mask is created using the updated posterior probability corresponding to each effective sound source, and the observed signal spectrum is determined for each effective sound source using the mask. A signal separation unit for generating a separated separated signal;
A time domain converter that converts the separated signal for each effective sound source into a signal in the time domain;
A sound source separation device comprising:

A model estimation method for observing signals from a plurality of mixed sound sources with a plurality of microphones and extracting each mixed signal,
A frequency domain conversion step for converting an observation signal in the time domain in each microphone into an observation signal spectrum in the frequency domain, and
A phase difference calculating step for calculating a phase difference between observed signal spectra in each microphone;
The observed signal spectrum is sequentially applied to a spectrum probability model indicating a spectrum distribution, and the phase difference between the observed signal spectra is sequentially applied to a phase difference probability model indicating a phase difference distribution, and each probability model is evaluated. A model estimation step for calculating a model parameter of each probability model suitable for signal extraction using an evaluation function;
A model estimation method comprising:

The model estimation method according to claim 5, comprising:
The model estimation method according to claim 1, wherein the evaluation function gives an evaluation that each signal is more suitable for signal extraction as the strength of each frequency component is more synchronized.

A model estimation method according to claim 5 or 6,
The model estimation step includes:
The phase difference between the observed signal spectrum and the observed signal spectrum, the model parameter of the phase difference probability model, the model parameter of the spectral probability model, and the existence probability of each sound source (hereinafter referred to as “mixed”) The probability (hereinafter referred to as “posterior probability”) that the observed signal spectrum and the phase difference between the observed signal spectra are due to the signal from each sound source at each time frequency is calculated. A posteriori probability calculation step;
A spectral parameter update substep for updating the model parameter of the spectral probability model using the posterior probability, and a phase difference parameter update substep for updating the model parameter of the phase difference probability model using the posterior probability; A parameter update step for performing a mixture weight update substep for updating the mixture weight using the posterior probability; and
A parameter holding step for storing each model parameter and the mixing weight updated in the parameter updating step in a parameter holding unit;
Is repeatedly executed a predetermined number of times or until the values of the respective model parameters and the mixture weights converge.

A model estimation method according to any one of claims 5 to 7,
An effective sound source is extracted based on the updated mixed weight, a mask is created using the updated posterior probability corresponding to each effective sound source, and the observed signal spectrum is determined for each effective sound source using the mask. A signal separation step for generating a separated separated signal;
A time domain conversion step of converting the separated signal for each effective sound source into a time domain signal;
Sound source separation method to perform.

A program for causing a computer to execute the method according to claim 5.