JP2013097176A

JP2013097176A - Sound source separation device, sound source separation method, and program

Info

Publication number: JP2013097176A
Application number: JP2011240054A
Authority: JP
Inventors: Akiko Araki; 章子荒木; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-01
Filing date: 2011-11-01
Publication date: 2013-05-20
Anticipated expiration: 2031-11-01
Also published as: JP5726709B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound source separation technique which provides a method for efficiently estimating an arrival time difference δ and eliminates the need of overall search operation to increase the speed of estimation.SOLUTION: An arrival time difference δ, a power spectrum σof noise, a sound source spectrum s, and a sound source existence probability p(k|θ) are estimated from an observation signal x=[x,x]in a frequency domain. The estimation is performed in accordance with formula where: mis a posterior probability indicative of an expected value of attribution of the observation signal xto a sound source k; D is an interval between two microphones; c is a speed of an original signal; φ=sinc(2πfD/c) is true; ξ=[x-φ(x-s)] is true; ψand ψare phases of spectra sand ξrespectively; and δis the arrival time difference. The uncertainty of ±π which the estimated arrival time difference δincludes is corrected.

Description

本発明は信号処理の技術分野に属する。特に本発明は複数の原信号がノイズとともに混合され、二個のマイクで観測される状況で、観測信号からそれぞれの原信号を推定し、分離抽出する音源分離技術に関する。特に、原信号やそれらがどのように混ざったかの情報を用いずに複数の原信号とノイズとが混在している観測信号のみから、それぞれの原信号を推定する、ブラインド音源分離技術に属する。 The present invention belongs to the technical field of signal processing. In particular, the present invention relates to a sound source separation technique for estimating and separating and extracting each original signal from an observed signal in a situation where a plurality of original signals are mixed with noise and observed with two microphones. In particular, the present invention belongs to a blind sound source separation technique for estimating each original signal from only the observation signal in which a plurality of original signals and noise are mixed without using the information on the original signals and how they are mixed.

非特許文献１が音源分離の従来技術として知られている。非特許文献１では、音源ｋから発せられる原信号の、二個のマイクへの到達時間差δ_ｋを Non-Patent Document 1 is known as a conventional technique of sound source separation. In Non-Patent Document 1, the arrival time difference δ _k between the two microphones of the original signal emitted from the sound source k is calculated.

として推定している。 As estimated.

和泉洋介，小野順貴，嵯峨山茂樹，“スパースな混合モデルに基づく雑音・残響環境下の劣決定ブラインド音源分離”，電子情報通信学会総合大会講演論文集，２００８年３月Yosuke Izumi, Junki Ono, Shigeki Hatakeyama, “Underdetermined Blind Source Separation under Noise / Reverberation Environment Based on Sparse Mixture Model”, Proceedings of the IEICE General Conference, March 2008

しかしながら、従来技術では、到達時間差δ_ｋの解析的な更新式は与えられていないため、多くの計算コストを要する全探索操作によって、到達時間差δ_ｋを推定する必要がある。 However, in the prior art, since an analytical update formula for the arrival time difference δ _k is not given, it is necessary to estimate the arrival time difference δ _k by a full search operation that requires a large calculation cost.

本発明は、到達時間差δの性質に着目し、到達時間差δを効率的に推定する方法を与え、従来技術において必要であった全探索操作を不要とし、高速な音源分離技術を提供することを目的とする。 The present invention focuses on the nature of the arrival time difference δ, provides a method for efficiently estimating the arrival time difference δ, eliminates the full search operation required in the prior art, and provides a high-speed sound source separation technique. Objective.

上記の課題を解決するために、本発明の第一の態様によれば、複数の原信号がノイズとともに混合され、二個のマイクで観測される状況で、観測信号からそれぞれの原信号を分離抽出する。原信号の音源のインデックスをｋとし、周波数領域の観測信号ｘ_ｆ，ｔ＝［ｘ_{ｆ，ｔ，Ｌ}，ｘ_{ｆ，ｔ，Ｒ}］^Ｔから、二個のマイクへの原信号の到達時間差δ_ｆ，ｋと雑音のパワースペクトルσ_ｆ ^２と原信号のスペクトルｓ_{ｆ，ｔ，ｋ}と音源存在確率ｐ（ｋ｜θ）とを推定する。観測信号ｘ_ｆ，ｔと推定されたパラメタθ＝｛δ_ｆ，ｋ，σ_ｆ ^２，ｓ_{ｆ，ｔ，ｋ}，ｐ（ｋ｜θ）｝とから分離信号ｙ_{ｆ，ｔ，ｋ}を生成する。観測信号ｘ_ｆ，ｔが音源ｋに帰属する期待値を示す事後確率をｍ_{ｆ，ｔ，ｋ}とし、二個のマイクの間隔をＤとし、原信号の速度をｃとし、φ_ｆ＝ｓｉｎｃ（２πｆＤ／ｃ）とし、ξ_{ｆ，ｔ，ｋ}＝［ｘ_{ｆ，ｔ，Ｒ}−φ_ｆ（ｘ_{ｔ，ｔ，Ｌ}−ｓ_{ｆ，ｔ，ｋ}）］とし、スペクトルｓ_{ｆ，ｔ，ｋ}及びξ_{ｆ，ｔ，ｋ}の位相をそれぞれψ_ｓｋ及びψ_ξｋとし、到達時間差δ_ｆ，ｋを In order to solve the above problems, according to the first aspect of the present invention, in a situation where a plurality of original signals are mixed together with noise and observed by two microphones, each original signal is separated from the observed signal. Extract. The source signal source index is k, and the arrival time difference δ of the original signal to the two microphones from the frequency domain observation signal x _{f, t} = [x _{f, t, L} , x _{f, t, R} ] ^T _{f, k} , noise power spectrum σ _f ² , original signal spectrum s _{f, t, k} and sound source existence probability p (k | θ) are estimated. Generate separated signals y _{f, t, k} from observed signals x _{f, t} and estimated parameters θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)}. . The posterior probability that the observed signal x _{f, t} indicates the expected value attributed to the sound source k is m _{f, t, k} , the distance between the two microphones is D, the speed of the original signal is c, φ _f = sinc ( 2πfD / c), and ξ _{f, t, k} = [x _{f, t, R} −φ _f (x _{t, t, L} −s _{f, t, k} )], and spectra s _{f, t, k} and ξ The phases of _{f, t, k} are ψ _sk and ψ _ξk , respectively, and the arrival time difference δ _{f, k} is

として推定する。推定された到達時間差δ_ｆ，ｋが内包する±πの不定性を補正する。 Estimate as The uncertainty of ± π included in the estimated arrival time difference δ _{f, k} is corrected.

本発明は、到達時間差δを効率的に推定する方法を与え、高速な音源分離ができるという効果を奏する。 The present invention provides a method for efficiently estimating the arrival time difference δ, and has the effect of enabling high-speed sound source separation.

第一実施形態に係る音源分離装置の機能ブロック図。The functional block diagram of the sound source separation apparatus which concerns on 1st embodiment. 第一実施形態に係る音源分離装置の処理フローを示す図。The figure which shows the processing flow of the sound source separation apparatus which concerns on 1st embodiment. Ｅステップ計算部の機能ブロック図。The functional block diagram of an E step calculation part. Ｍステップ計算部の機能ブロック図。The functional block diagram of an M step calculation part. パラメタ推定部の処理フローを示す図。The figure which shows the processing flow of a parameter estimation part. シミュレーションを行った環境を示す図。The figure which shows the environment which performed the simulation. 図７Ａは音源数が２つの場合のシミュレーション結果を、図７Ｂは音源数が３つの場合のシミュレーション結果を示す図。FIG. 7A shows a simulation result when the number of sound sources is two, and FIG. 7B shows a simulation result when the number of sound sources is three. 音源スペクトル推定部の処理を時間差推定部の処理の前に行う場合のＭステップ計算部の機能ブロック図。The functional block diagram of the M step calculation part in the case of performing the process of a sound source spectrum estimation part before the process of a time difference estimation part.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態に係る音源分離装置１＞
図１は本実施形態に係る音源分離装置１の機能ブロック図を、図２はその処理フローを示す。 <Sound source separation apparatus 1 according to the first embodiment>
FIG. 1 is a functional block diagram of a sound source separation device 1 according to the present embodiment, and FIG. 2 shows a processing flow thereof.

音源分離装置１は周波数領域変換部１１とパラメタ推定部１２と分離信号生成部１３と時間領域変換部１４とを含み、さらにパラメタ推定部１２はＥステップ計算部１２１とＭステップ計算部１２２とを含む。 The sound source separation device 1 includes a frequency domain conversion unit 11, a parameter estimation unit 12, a separated signal generation unit 13, and a time domain conversion unit 14. The parameter estimation unit 12 further includes an E step calculation unit 121 and an M step calculation unit 122. Including.

まず、観測信号について説明する。信号原（以下「音源」とも言う）がＫ個あり、ｋ（ｋ＝１，２，…，Ｋ）を音源のインデックスとし、音源ｋから発せられる信号（以下「原信号」という）をｓ_ｋ（ｔ）とする。複数の原信号ｓ_１（ｔ），・・・，ｓ_Ｋ（ｔ）がノイズとともに二個のマイクＬ，Ｒで観測される状況で、マイクＬで観測される時間領域の観測信号をｘ_Ｌ（ｔ）とし、マイクＲで観測される時間領域の観測信号をｘ_Ｒ（ｔ）とし、二個のマイクＬ，Ｒで観測される時間領域の観測信号をｘ（ｔ）＝［ｘ_Ｌ（ｔ），ｘ_Ｒ（ｔ）］^Ｔとする。「^Ｔ」は転置を表す。ｔはフレーム番号及びそのフレーム番号に対応する時刻を表す。Ｔを時間フレームの総数とすると、ｔ＝０，１，…，Ｔ−１である。ここで周波数領域の観測信号をｘ_ｆ，ｔ＝［ｘ_{ｆ，ｔ，Ｌ}，ｘ_{ｆ，ｔ，Ｒ}］^Ｔと表記する。なお、ｆはサンプリング周波数ｆ_ｓをＦ等分した離散点であり、ｆ∈｛０，ｆ_ｓ／Ｆ，…，（Ｆ−１）ｆ_ｓ／Ｆ｝である。以降、断りのない場合、観測信号とは、周波数領域の観測信号ｘ_ｆ，ｔ＝［ｘ_{ｆ，ｔ，Ｌ}，ｘ_{ｆ，ｔ，Ｒ}］^Ｔを指し、時間領域の観測信号の場合はそれを明記する。
ここで、観測信号は、 First, the observation signal will be described. There are K signal sources (hereinafter also referred to as “sound sources”), k (k = 1, 2,..., K) is an index of a sound source, and a signal (hereinafter referred to as “original signal”) emitted from the sound source k is s _k. (T). In a situation where a plurality of original signals s ₁ (t),..., S _K (t) are observed with two microphones L and R together with noise, a time domain observation signal observed with the microphone L is expressed as x _L. (T), a time domain observation signal observed by the microphone R is x _R (t), and a time domain observation signal observed by the two microphones L and R is x (t) = [x _L ( _t), and x R ^{(t)] T.} “ ^T ” represents transposition. t represents a frame number and a time corresponding to the frame number. When T is the total number of time frames, t = 0, 1,..., T−1. Here, the observation signal in the frequency domain is _{expressed as xf, t} = [ _{xf, t, L} , _{xf, t, R} ] ^T. Note that f is a discrete point obtained by equally dividing the sampling frequency f _s by F, and f∈ {0, f _s / F,..., (F−1) f _s / F}. Hereinafter, when there is no notice, the observation signal means the observation signal x _{f, t} = [x _{f, t, L} , x _{f, t, R} ] ^T in the frequency domain, and in the case of the observation signal in the time domain, Specify clearly.
Here, the observed signal is

と表されると仮定する。ここで、ｓ_{ｆ，ｔ，ｋ}は原信号ｓ_ｋ（ｔ）のスペクトル（以下「音源スペクトル」ともいう）を、ｎ_ｆ，ｋ＝［ｎ_{ｆ，ｋ，Ｌ}，ｎ_{ｆ，ｋ，Ｒ}］はマイクＬ，Ｒにおける加算的雑音を表す。またｂ_ｆ，ｋ＝［ｂ_{ｆ，ｋ，Ｌ}，ｂ_{ｆ，ｋ，Ｒ}］^Ｔは音源ｋに関するステアリングベクトル（音源ｋの方向を特定するベクトルであり、方向ベクトルともいう）であり、原信号ｓ_ｋ（ｔ）のマイクＬ、Ｒへの到達時間差をδ_ｋとすると、 Suppose that Here, s _{f, t, k} is the spectrum of the original signal s _k (t) (hereinafter also referred to as “sound source spectrum”), and n _{f, k} = [n _{f, k, L} , n _{f, k, R} ] Represents additive noise in the microphones L and R. B _{f, k} = [b _{f, k, L} , b _{f, k, R} ] ^T is a steering vector related to the sound source k (a vector for specifying the direction of the sound source k, also referred to as a direction vector), and the original signal If the arrival time difference between s _k (t) and the microphones L and R is δ _k ,

である（非特許文献１参照）。本実施形態では、原信号の観測時間内においては、音源及びマイクは固定されており、またＫ個の音源は全て、異なる位置に配置されているとする。すなわち、ステアリングベクトルｂ_ｆ，ｋは時間ｔに依らず、ｋの値によって異なる値を取るものと仮定する。音源分離の目的は、観測信号ｘ_ｆ，ｔのみを用いて、全ての音源スペクトルｓ_{ｆ，ｔ，ｋ}を推定することである。 (See Non-Patent Document 1). In the present embodiment, it is assumed that the sound source and the microphone are fixed within the observation time of the original signal, and that all the K sound sources are arranged at different positions. That is, it is assumed that the steering vector b _{f, k} takes a different value depending on the value of k regardless of the time t. The purpose of sound source separation is to estimate all sound source spectra s _{f, t, k} using only the observation signals x _{f, t} .

本実施形態に係る音源分離装置１は、時間領域の観測信号ｘ（ｔ）＝［ｘ_Ｌ（ｔ），ｘ_Ｒ（ｔ）］^Ｔを入力とし、時間領域の分離信号（推定された各原信号）ｙ_ｋ（ｔ）を出力する。以下、各部の処理内容を説明する。 The sound source separation apparatus 1 according to the present embodiment receives a time domain observation signal x (t) = [x _L (t), x _R (t)] ^T as an input, and uses a time domain separation signal (estimated original signals). Signal) y _k (t) is output. Hereinafter, the processing content of each part is demonstrated.

＜周波数領域変換部１１＞
まず、周波数領域変換部１１は、マイクＬ、Ｒで収音した時間領域の観測信号ｘ（ｔ）＝［ｘ_Ｌ（ｔ），ｘ_Ｒ（ｔ）］^Ｔを入力とし、これを短時間フーリエ変換等により周波数領域の観測信号ｘ_ｆ，ｔ＝［ｘ_{ｆ，ｔ，Ｌ}，ｘ_{ｆ，ｔ，Ｒ}］^Ｔに変換し（ｓ１）、パラメタ推定部１２及び分離信号生成部１３に出力する。 <Frequency domain converter 11>
First, the frequency domain transform unit 11 receives the time domain observation signal x (t) = [x _L (t), x _R (t)] ^T collected by the microphones L and R, and uses this as a short-time Fourier transform. The frequency domain observation signal x _{f, t} = [x _{f, t, L} , x _{f, t, R} ] is converted to ^T by conversion or the like (s 1), and is output to the parameter estimation unit 12 and the separated signal generation unit 13.

＜パラメタ推定部１２＞
次に、パラメタ推定部１２は、観測信号ｘ_ｆ，ｔから、音源分離のために必要なパラメタθを推定する（ｓ２）。なおθ＝｛δ_ｆ，ｋ，σ_ｆ ^２，ｓ_{ｆ，ｔ，ｋ}，ｐ（ｋ｜θ）｝であり、δ_ｆ，ｋは原信号の、二個のマイクへの到達時間差を表す。σ_ｆ ^２は雑音のパワースペクトルを、ｓ_{ｆ，ｔ，ｋ}は音源スペクトルを、ｐ（ｋ｜θ）は音源存在確率（混合信号中の音源ｋの寄与率）を表す。 <Parameter estimation unit 12>
Next, the parameter estimation unit 12 estimates a parameter θ necessary for sound source separation from the observation signals _{xf, t} (s2). Note that θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)}, where δ _{f, k} represents the arrival time difference of the original signal to the two microphones. σ _f ² represents the noise power spectrum, s _{f, t, k} represents the sound source spectrum, and p (k | θ) represents the sound source existence probability (contribution rate of the sound source k in the mixed signal).

本実施形態では、上記パラメタを推定するために、以下の２つの仮定を用いる。 In the present embodiment, the following two assumptions are used to estimate the above parameters.

（仮定１）は、雑音ｎが、平均０、共分散行列σ_ｆ ^２Ｖ_ｆの正規分布に従う定常雑音でモデル化できるという仮定である。ここでσ_ｆ ^２は周波数ｆにおける雑音のパワーであり、Ｖ_ｆは例えば拡散性雑音の場合 (Assumption 1) is an assumption that the noise n can be modeled by stationary noise according to a normal distribution of mean 0 and covariance matrix σ _f ² V _f . Here, σ _f ² is the noise power at the frequency f, and V _f is, for example, diffuse noise.

で与えられる（非特許文献１参照）。ここでｃは原信号の速度（音速等）、Ｄは二個のマイクの間隔である。 (See Non-Patent Document 1). Here, c is the speed of the original signal (sound speed, etc.), and D is the interval between the two microphones.

（仮定２）は、原信号がスパースな信号（つまり、成分のうち０でないもの（非零の成分）がまばらである信号、言い換えると、ほとんどの（または多くの）成分が０である信号）であるという仮定である。すなわち、ある時間周波数（ｆ，ｔ）において、たかだか１つの原信号のみが支配的であると仮定する。（仮定２）によると、式（２）は (Assumption 2) is a signal in which the original signal is sparse (that is, a signal in which non-zero components (non-zero components) are sparse, in other words, a signal in which most (or many) components are zero). This is an assumption. That is, it is assumed that only one original signal is dominant at a certain time frequency (f, t). According to (Assumption 2), Equation (2) is

と表記できる（非特許文献１参照）。
上記（仮定１）、（仮定２）に基づけば、到達時間差δ_ｆ，ｋとなる方向に存在する音源ｋから発せられる原信号が到来してｘ_ｆ，ｔが観測される尤度は、 (See Non-Patent Document 1).
Based on the above (Assumption 1) and (Assumption 2), the likelihood that the original signal emitted from the sound source k existing in the direction of the arrival time difference δ _{f, k} arrives and x _{f, t} is observed is

で与えられる（非特許文献１参照）。「^Ｈ」はエルミート転置を表す。ここでｂ_ｆ，ｋは式（３）で表される。これより、到達時間差δ_ｆ，ｋ、音源スペクトルｓ_{ｆ，ｔ，ｋ}及び雑音のパワースペクトルσ_ｆ ^２は、対数尤度関数 (See Non-Patent Document 1). “ ^H ” represents Hermitian transpose. Here, b _{f, k} is expressed by Equation (3). Thus, the arrival time difference δ _{f, k} , the sound source spectrum s _{f, t, k} and the noise power spectrum σ _f ² are expressed by the log likelihood function.

を最大化するパラメタとして最尤推定により求める（非特許文献１参照）。但し、上記において、ｐ（ｋ｜δ_ｆ，ｋ，σ_ｆ ^２，ｓ_{ｆ，ｔ，ｋ}）は音源存在確率を表し、混合信号中の音源ｋの寄与率であり、 Is obtained by maximum likelihood estimation as a parameter that maximizes (see Non-Patent Document 1). However, in the above, p (k | δ _{f, k} , σ _f ² , s _{f, t, k} ) represents a sound source existence probability and is a contribution rate of the sound source k in the mixed signal,

である（非特許文献１参照）。
具体的には、期待値最大化法（以下「ＥＭアルゴリズム」ともいう）を適用し、Ｑ関数 (See Non-Patent Document 1).
Specifically, an expected value maximization method (hereinafter also referred to as “EM algorithm”) is applied, and the Q function

を最大とするパラメタθ＝｛δ_ｆ，ｋ，σ_ｆ ^２，ｓ_{ｆ，ｔ，ｋ}，ｐ（ｋ｜θ）｝を、以下のＥステップ及びＭステップの繰り返しにより求める（非特許文献１参照）。ここでｍ_{ｆ，ｔ，ｋ}は後述する式（１１）で、ｐ（ｘ_ｆ，ｔ｜ｋ，θ’）は前述の式（６）で与えられ、θ’は、１回前の繰り返しで得られているパラメタを意味する。 Parameter θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)} that maximizes is obtained by repeating the following E step and M step (see Non-Patent Document 1). ). Here, m _{f, t, k} is given by the following equation (11), p (x _{f, t} | k, θ ′) is given by the above-mentioned equation (6), and θ ′ is the previous iteration. Means the parameter being obtained.

この繰り返し計算を、パラメタ推定部１２にて行なう。以下、パラメタ推定部１２の処理の詳細を説明する。 This iterative calculation is performed by the parameter estimation unit 12. Details of the processing of the parameter estimation unit 12 will be described below.

図３はＥステップ計算部１２１の機能ブロック図を、図４はＭステップ計算部１２２の機能ブロック図を、図５はパラメタ推定部１２の処理フローを示す。 3 is a functional block diagram of the E step calculation unit 121, FIG. 4 is a functional block diagram of the M step calculation unit 122, and FIG. 5 shows a processing flow of the parameter estimation unit 12.

まず、パラメタ推定部１２は、各パラメタを初期化する（ｓ２０）。パラメタのうちσ_ｆ ^２、δ_ｆ，ｋ、及びｐ（ｋ｜θ）を初期化する。例えば、σ_ｆ ^２＝｜ｘ_{ｆ，ｔ＝０，Ｌ}｜^２，δ_ｆ，ｋ＝（Ｄ／ｃ）ｃｏｓα_ｋ（ｃは原信号の速度（音速等）、Ｄはマイク間隔，α_ｋは音源ｋの方向の初期値（−π／２〜π／２の間の適当な値）），ｐ（ｋ｜θ）＝１／Ｋとする。さらに、初期化したパラメタδ_ｆ，ｋとｘ_{ｆ，ｔ＝０}とを用いて、式（３）及び後述する式（１９）に基づき、ｓ_{ｆ，ｔ，ｋ}を初期化する。更新回数ｎ＝０とする。なお、最大更新回数Ｎ及び収束判定閾値Δは、当該装置の設計者や利用者等により予め設定されているものとする。 First, the parameter estimation unit 12 initializes each parameter (s20). Among the parameters, σ _f ² , δ _{f, k} , and p (k | θ) are initialized. For example, σ _f ² = | x _{f, t = 0, L} | ² , δ _{f, k} = (D / c) cos α _k (c is the speed of the original signal (sound speed, etc.), D is the microphone interval, and α _k is An initial value in the direction of the sound source k (appropriate value between −π / 2 to π / 2)), p (k | θ) = 1 / K. Further, using the initialized parameters δ _{f, k} and x _{f, t = 0} , s _{f, t, k} is initialized based on Expression (3) and Expression (19) described later. The number of updates n = 0. Note that the maximum number of updates N and the convergence determination threshold value Δ are set in advance by the designer or user of the apparatus.

Ｅステップ計算部１２１は事後確率推定部１２１１とＱ関数計算部１２１２とを含む（図３参照）。Ｍステップ計算部１２２は時間差推定部１２２１と音源スペクトル推定部１２２２と雑音パワー推定部１２２３と音源存在確率推定部１２２４とを含み、時間差推定部１２２１はさらに逆正接計算部１２２１１と時間補正部１２２１２とを備える（図４参照）。事後確率推定部１２１１とＱ関数計算部１２１２とにおける処理を併せてＥステップと呼び、時間差推定部１２２１と音源スペクトル推定部１２２２と雑音パワー推定部１２２３と音源存在確率推定部１２２４とにおける処理を併せてＭステップと呼ぶ。 The E step calculation unit 121 includes a posterior probability estimation unit 1211 and a Q function calculation unit 1212 (see FIG. 3). The M step calculation unit 122 includes a time difference estimation unit 1221, a sound source spectrum estimation unit 1222, a noise power estimation unit 1223, and a sound source existence probability estimation unit 1224. The time difference estimation unit 1221 further includes an arctangent calculation unit 12211, a time correction unit 12212, and the like. (See FIG. 4). The processes in the posterior probability estimation unit 1211 and the Q function calculation unit 1212 are collectively referred to as E step, and the processes in the time difference estimation unit 1221, the sound source spectrum estimation unit 1222, the noise power estimation unit 1223, and the sound source existence probability estimation unit 1224 are combined. This is called M step.

（事後確率推定部１２１１）
Ｅステップ計算部１２１の事後確率推定部１２１１は、観測信号ｘ_ｆ，ｔと、一回前の繰り返しで得られているパラメタθ’（但し、一回前の繰り返しで得られているパラメタθ’が存在しない場合、つまり、一回目の事後確率推定においては、前述の初期化したパラメタ）とを入力とし、これらの値を用いて事後確率ｍ_{ｆ，ｔ，ｋ}＝ｐ（ｋ｜ｘ_ｆ，ｔ，θ’）を以下の式（１１）により求め（ｓ２２）、Ｑ関数計算部１２１２とＭステップ計算部１２２とに出力する。 (A posteriori probability estimation unit 1211)
The posterior probability estimation unit 1211 of the E step calculation unit 121 uses the observed signal x _{f, t} and the parameter θ ′ obtained in the previous iteration (however, the parameter θ ′ obtained in the previous iteration). Is present, that is, in the first posterior probability estimation, the above-described initialized parameter) is used as an input, and these values are used to determine the posterior probability m _{f, t, k} = p (k | x _{f, t} , θ ′) is obtained from the following equation (11) (s22), and is output to the Q function calculator 1212 and the M step calculator 122.

なお、ｐ（ｘ_ｆ，ｔ｜ｋ，θ’）は式（６）により与えられる。なお、事後確率ｍ_{ｆ，ｔ，ｋ}は観測信号ｘ_ｆ，ｔが音源ｋに帰属する事後確率を表す。 Note that p (x _{f, t} | k, θ ′) is given by equation (6). The posterior probability m _{f, t, k} represents the posterior probability that the observed signal x _{f, t} belongs to the sound source k.

（逆正接計算部１２２１１）
Ｍステップ計算部１２２の時間差推定部１２２１の逆正接計算部１２２１１は、観測信号ｘ_ｆ，ｔと、事後確率ｍ_{ｆ，ｔ，ｋ}と、一回前の繰り返しで得られているパラメタθ’（より詳しく説明するとθ’のうちの音源スペクトルｓ_{ｆ，ｔ，ｋ}である。但し、一回前の繰り返しで得られているパラメタθ’が存在しない場合、つまり、一回目の逆正接計算においては、前述の初期化したパラメタ）とを入力とし、これらの値を用いて、到達時間差δ_ｆ，ｋ（より詳しく言うと到達時間差δ_ｆ，ｋに２πｆを乗じた値２πｆδ_ｆ，ｋ）を以下の式（１３）により推定し（ｓ２５）、時間補正部１２２１２に出力する。 (Inverse tangent calculation unit 12211)
The arc tangent calculation unit 12211 of the time difference estimation unit 1221 of the M step calculation unit 122 has an observation signal x _{f, t} , a posteriori probability m _{f, t, k,} and a parameter θ ′ ( More specifically, it is the sound source spectrum s _{f, t, k} of θ ′, provided that there is no parameter θ ′ obtained in the previous iteration, that is, in the first arc tangent calculation. And the above-described initialized parameter) as inputs, and using these values, an arrival time difference δ _{f, k} (more specifically _, a value 2πfδ _{f, k obtained} by multiplying the arrival time difference δ _{f, k} by 2πf) is as follows: (S25) and output to the time correction unit 12212.

ψ_ｓｋ（但し添え字ｓｋはｓ_ｋを表す）及びψ_ξｋ（但し添え字ξｋはξ_ｋを表す）は、それぞれｓ_{ｆ，ｔ，ｋ}及びξ_{ｆ，ｔ，ｋ}の位相を表す。なお、式（１３）の導出については後述する。 [psi _sk (where subscript sk represents _{s k)} and ψ _ξk (although representing a subscript Kushik is xi] _k), respectively _{s f, t, k} and xi] _{f, t,} represents the phase of the _k. The derivation of equation (13) will be described later.

なお、従来技術では式（１）に基づきδ_ｋを離散全探索によって推定するため多くの計算コストを要していたが、本実施形態では式（１３）に基づきδ_ｆ，ｋを算出するため全探索を要せず計算コストを小さくできる。また、式（１３）に基づき周波数ｆ毎に到達時間差δ_ｆ，ｋを求める点が従来技術とは異なる。 In the prior art, δ _k is estimated by a discrete full search based on Equation (1), which requires a lot of calculation cost. In this embodiment, δ _{f, k} is calculated based on Equation (13). The calculation cost can be reduced without requiring a full search. Moreover, the point which calculates | requires arrival time difference (delta) _{f, k} for every frequency f based on Formula (13) differs from a prior art.

（時間補正部１２２１２）
Ｍステップ計算部１２２の時間差推定部１２２１の時間補正部１２２１２は、観測信号ｘ_ｆ，ｔと、事後確率ｍ_{ｆ，ｔ，ｋ}と、一回前の繰り返しで得られているパラメタθ’（但し、一回前の繰り返しで得られているパラメタθ’が存在しない場合、つまり、一回目の時間補正においては、前述の初期化したパラメタ）とを入力とし、これらの値を用いて、式（１３）にて推定した到達時間差δ_ｆ，ｋが内包する±πの不定性を補正する（ｓ２６）。この補正は、式（１３）の左辺２πｆδ_ｆ，ｋが−πからπの値を取るのに対し、式（１３）の右辺の逆正接が−π／２からπ／２の値しか返すことができないため、２πｆδ_ｆ，ｋが−π〜−π／２及びπ／２〜πの範囲を取る場合の値を正しく求めるために必要である。式（１３）で得られた値を２πｆδ’_ｆ，ｋと記載すると、補正は以下のように行なう。 (Time correction unit 12212)
The time correction unit 12212 of the time difference estimation unit 1221 of the M step calculation unit 122 includes an observation signal x _{f, t} , a posteriori probability m _{f, t, k,} and a parameter θ ′ obtained by the previous iteration (however, When the parameter θ ′ obtained in the previous iteration does not exist, that is, in the first time correction, the above-described initialized parameter) is used as an input, and using these values, the equation ( The uncertainty of ± π included in the arrival time difference δ _{f, k} estimated in 13) is corrected (s26). This correction is such that the left side 2πfδ _{f, k} of equation (13) takes a value from −π to π, whereas the arctangent of the right side of equation (13) returns only a value from −π / 2 to π / 2. Therefore, it is necessary to correctly obtain a value when 2πfδf _{, k} is in the range of −π to −π / 2 and π / 2 to π. When the value obtained by Expression (13) is described as 2πfδ ′ _{f, k} , correction is performed as follows.

ここでξ_{ｆ，ｔ，ｋ}、φ_ｆはそれぞれ式（１４）、（１５）で与えられ、ψ_ｓｋ（但し添え字ｓｋはｓ_ｋを表す）、ψ_ξｋ（但し添え字ξｋはξ_ｋを表す）はそれぞれｓ_{ｆ，ｔ，ｋ}及びξ_{ｆ，ｔ，ｋ}の位相を表す。なお、式（１７）、（１８）の導出については後述する。 Here ξ _{f, t, k,} respectively formula φ _f (14), the given by (15), ψ _sk (where subscript sk represents _{s k), ψ} _ξk (where subscript Kushik is xi] _k Represents the phases of s _{f, t, k} and ξ _{f, t, k} , respectively. The derivation of equations (17) and (18) will be described later.

さらに、時間補正部１２２１２は、補正した値２πｆδ_ｆ，ｋを２πｆで除算し、到達時間差δ_ｆ，ｋを求め、音源スペクトル推定部１２２２と雑音パワー推定部１２２３とＱ関数計算部１２１２とに出力する。 Further, the time correcting unit 12212 is corrected value 2Paiefuderuta _f, the _k divided by 2 [pi] f, calculated arrival time difference [delta] _{f, k,} outputted to the sound source spectrum estimation section 1222 and the noise power estimation section 1223 and the Q function calculating unit 1212 To do.

（音源スペクトル推定部１２２２）
Ｍステップ計算部１２２の音源スペクトル推定部１２２２は、到達時間差δ_ｆ，ｋと観測信号ｘ_ｆ，ｔを入力とし、これらの値を用いて、音源スペクトルｓ_{ｆ，ｔ，ｋ}を以下の式（１９）により推定し（ｓ２７）、雑音パワー推定部１２２３とＱ関数計算部１２１２とに出力する。 (Sound source spectrum estimation unit 1222)
The sound source spectrum estimation unit 1222 of the M step calculation unit 122 receives the arrival time difference δ _{f, k} and the observation signal x _{f, t} and inputs the sound source spectrum s _{f, t, k} using the following formula ( 19) (s27) and output to the noise power estimation unit 1223 and the Q function calculation unit 1212.

ここで、ｂ_ｆ，ｋ、Ｖ_ｆはそれぞれ式（３）、（４）で表される。 Here, b _{f, k} and V _f are expressed by equations (3) and (4), respectively.

（雑音パワー推定部１２２３）
Ｍステップ計算部１２２の雑音パワー推定部１２２３は、到達時間差δ_ｆ，ｋと音源スペクトルｓ_{ｆ，ｔ，ｋ}と観測信号ｘ_ｆ，ｔと事後確率ｍ_{ｆ，ｔ，ｋ}とを入力とし、これらの値を用いて、雑音のパワースペクトルσ_ｆ ^２を以下の式（２０）により推定し（ｓ２８）、Ｑ関数計算部１２１２に出力する。 (Noise power estimation unit 1223)
The noise power estimation unit 1223 of the M step calculation unit 122 receives the arrival time difference δ _{f, k} , the sound source spectrum s _{f, t, k} , the observation signal x _{f, t} and the posterior probability m _{f, t, k} , Is used to estimate the power spectrum σ _f ² of noise by the following equation (20) (s28) and output to the Q function calculator 1212.

なお、Ｔは時間フレームの総数である。 T is the total number of time frames.

（音源存在確率推定部１２２４）
Ｍステップ計算部１２２の音源存在確率推定部１２２４は、事後確率ｍ_{ｆ，ｔ，ｋ}を入力とし、音源存在確率を以下の式（２１）により推定し（ｓ２９）、Ｑ関数計算部１２１２に出力する。 (Sound source existence probability estimation unit 1224)
The sound source existence probability estimation unit 1224 of the M step calculation unit 122 receives the posterior probabilities m _{f, t, and k} , estimates the sound source existence probability by the following equation (21) (s29), and outputs it to the Q function calculation unit 1212 To do.

（Ｑ関数計算部１２１２）
Ｅステップ計算部１２１のＱ関数計算部１２１２は、観測信号ｘ_ｆ，ｔと、パラメタθ＝｛δ_ｆ，ｋ，σ_ｆ ^２，ｓ_{ｆ，ｔ，ｋ}，ｐ（ｋ｜θ）｝と、事後確率ｍ_{ｆ，ｔ，ｋ}と、を入力とし、これらの値を用いてＱ関数を上述の式（１０）により求める（ｓ３０）。 (Q function calculator 1212)
The Q function calculation unit 1212 of the E step calculation unit 121 includes an observation signal x _{f, t} and parameters θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)}, The posterior probabilities m _{f, t, and k} are used as inputs, and the Q function is obtained by the above equation (10) using these values (s30).

ｓ２０〜ｓ３０までの処理を終えると、パラメタ推定部１２は、（条件１）Ｑ関数の値の変化量（｜Ｑ（θ｜θ^ｎ−１）−Ｑ（θ｜θ^ｎ）｜）が所定の収束判定閾値Δより小さくなるか、または、（条件２）更新回数ｎが所定の最大更新回数Ｎ（例えばＮ＝２０）以上か否かを判定する（ｓ３１）。 When the processing from s20 to s30 is completed, the parameter estimation unit 12 determines that (condition 1) the amount of change in the value of the Q function (| Q (θ | θ ^n-1 ) −Q (θ | θ ⁿ ) |) is predetermined. (Condition 2) Whether the number of updates n is equal to or greater than a predetermined maximum number of updates N (for example, N = 20) is determined (s31).

パラメタ推定部１２は、（条件１）、（条件２）の何れかを満たしたときは、パラメタ推定部１２はその時点で取得している最新の事後確率ｍ_{ｆ，ｔ，ｋ}と到達時間差δ_ｆ，ｋを分離信号生成部１３に出力する。 When the parameter estimation unit 12 satisfies any one of (Condition 1) and (Condition 2), the parameter estimation unit 12 obtains the latest posterior probability m _{f, t, k} and arrival time difference δ acquired at that time. _{f and k} are output to the separated signal generator 13.

パラメタ推定部１２は、（条件１）、（条件２）の何れも満たさないときは、ＥステップとＭステップを繰り返す（ｓ３１、ｓ２１）。なお、図示しない記憶部にパラメタθとＱ関数の値Ｑ（θ｜θ^ｎ）とを記憶しておき、次の繰り返しの際に用いる。 When neither of (Condition 1) and (Condition 2) is satisfied, the parameter estimation unit 12 repeats the E step and the M step (s31, s21). Note that the parameter θ and the value Q (θ | θ ⁿ ) of the Q function are stored in a storage unit (not shown), and are used in the next iteration.

＜分離信号生成部１３＞
分離信号生成部１３は、事後確率ｍ_{ｆ，ｔ，ｋ}と到達時間差δ_ｆ，ｋと観測信号ｘ_ｆ，ｔとを入力とし、以下の式（２２）により、分離信号ｙ_{ｆ，ｔ，ｋ}を生成し（ｓ３）、時間領域変換部１４へ出力する。 <Separated signal generator 13>
The separated signal generation unit 13 receives the posterior probability m _{f, t, k} , the arrival time difference δ _{f, k} and the observation signal x _{f, t} and inputs the separated signal y _{f, t, k according} to the following equation (22). Is generated (s3) and output to the time domain conversion unit 14.

なお、音源スペクトルｓ_{ｆ，ｔ，ｋ}は、音源の発する原信号のスペクトルを推定したものであるが、この音源スペクトルを単純に時間領域の信号に変換した場合には、他の音源の発する原信号が残ることがある。上述の式（２２）によって、音源ｋの発する原信号のみを抽出、分離することができる。 The sound source spectrum s _{f, t, k} is an estimation of the spectrum of the original signal emitted by the sound source. However, when this sound source spectrum is simply converted into a signal in the time domain, the source sound emitted by another sound source is obtained. A signal may remain. Only the original signal emitted from the sound source k can be extracted and separated by the above equation (22).

＜時間領域変換部１４＞
時間領域変換部１４は、分離信号ｙ_{ｆ，ｔ，ｋ}を入力とし、周波数領域変換部１１において行った周波数領域変換方法に対応する時間領域変換方法（例えば短時間フーリエ逆変換）で、分離信号ｙ_{ｆ，ｔ，ｋ}を時間領域の分離信号ｙ_ｋ（ｔ）に変換し（ｓ４）、音源分離装置１の出力値として出力する。 <Time domain conversion unit 14>
The time domain transform unit 14 receives the separated signals y _{f, t, and k} as input, and uses a time domain transform method (for example, short-time Fourier inverse transform) corresponding to the frequency domain transform method performed in the frequency domain transform unit 11 to separate the separated signals. y _{f, t, k} is converted into a separation signal y _k (t) in the time domain (s4) and output as an output value of the sound source separation device 1.

＜本実施形態のポイント＞
以下、本実施形態のポイントを説明し、式（１３）、（１７）、（１８）の導出方法を説明する。 <Points of this embodiment>
Hereinafter, the points of the present embodiment will be described, and the derivation methods of equations (13), (17), and (18) will be described.

本実施形態では、
（性質１）到達時間差δ_ｆ，ｋがＲチャネルとＬチャネルの位相差に影響を与える値であること
（性質２）位相差が周期的な値を取る量であること
という２つの性質を利用して、到着時間差δ_ｆ，ｋを推定する。ここで（性質１）は、式（２）にてノイズｎ_ｆ，ｔが十分に小さい場合を考えれば明らかである。（性質２）は、ある２つの位相を表わす量の差Θが、−π≦Θ＜πの範囲の値だけではなく、Θ±２πＭ（Ｍは任意の整数）という周期的な不定性を内包する値を取る性質を持つことを意味する。 In this embodiment,
(Property 1) The arrival time difference δ _{f, k} is a value that affects the phase difference between the R channel and the L channel. (Property 2) Uses two properties that the phase difference is an amount that takes a periodic value. Then, the arrival time difference δ _{f, k} is estimated. Here, (property 1) is obvious when the case where the noise n _{f, t} is sufficiently small in the expression (2) is considered. (Property 2) is not only a value in the range of −π ≦ Θ <π, but also a periodic indefiniteness of Θ ± 2πM (M is an arbitrary integer). It means that it has the property of taking the value to be.

（性質２）について、さらに以下にて説明を行なう。式（６） (Property 2) will be further described below. Formula (6)

の右辺における、ｅｘｐのカッコの中のベクトル及び行列ｘ，ｂ，Ｖを、それぞれの成分で表して整理すると、 The vectors in the parentheses of exp and the matrices x, b, and V on the right side of

となる（Ｃはδ_ｆ，ｋに依らない定数）。式（３２）は、位相や角度の分布のように周期的な値を取る変数に対する分布であるＶｏｎＭｉｓｅｓ分布 (C is a constant independent of δ _{f, k} ). Expression (32) is a Von Mises distribution that is a distribution for a variable that takes a periodic value such as a phase or angle distribution.

と同じ形をしていることが分かる（参考文献１参照）。
［参考文献１］Ｃ．Ｍ．ビショップ著，元田ら訳，“パターン認識と機械学習（上）”，シュプリンガー・ジャパン，２００６． (See Reference 1).
[Reference Document 1] C.I. M.M. Bishop, Motoda et al., “Pattern Recognition and Machine Learning (Part 1)”, Springer Japan, 2006.

ここで−π＜ｘ≦π、μは分布の平均（−π＜μ≦π）、κ＞０は分布の集中度パラメタ（正規分布での（１／分散）に相当）、Ｉ_０（ｘ）は０次の第１種ベッセル関数である。 Here, −π <x ≦ π, μ is the distribution average (−π <μ ≦ π), κ> 0 is the distribution concentration parameter (corresponding to (1 / dispersion) in the normal distribution), I ₀ (x ) Is a zeroth-order first-type Bessel function.

すなわち、式（３３）の変数ｘが式（３２）のψ_ξｋ−ψ_ｓｋに対応し、式（３３）の平均μが式（３２）の２πｆδ_ｆ，ｋに対応し、式（３３）の集中度κが式（３２）の（｜ξ_{ｆ，ｔ，ｋ}｜｜ｓ_{ｆ，ｔ，ｋ}｜）／（σ_ｆ ^２（１−φ_ｆ ^２））に対応する。 That is, the variable x in the equation (33) corresponds to ψ _ξk −ψ _sk in the equation (32), the average μ in the equation (33) corresponds to 2πfδ _{f, k} in the equation (32), and the equation (33) The degree of concentration κ corresponds to (| ξ _{f, t, k} || s _{f, t, k} |) / (σ _f ² (1−φ _f ² )) in Expression (32).

説明をより直感的にするため、雑音のパラメタがφ_ｆ＝０である場合を考える（これは、雑音が式（４）に示す拡散性雑音ではなく、分散σ_ｆ ^２のガウス雑音が観測ｘ_Ｌ，ｘ_Ｒにそれぞれ乗ることを意味する）。このとき、式（１４）によりξ_{ｆ，ｔ，ｋ}＝ｘ_{ｆ，ｔ，Ｒ}となるため、式（３２）において、ψ_ξｋはψ_{ｘｆ，ｔ，Ｒ}（但し、添え字ｘｆ，ｔ，Ｒはｘ_{ｆ，ｔ，Ｒ}を表し、ψ_{ｘｆ，ｔ，Ｒ}はマイクＲの観測信号ｘ_{ｆ，ｔ，Ｒ}の位相を表す）となる。また、式（３）、（５）により（但し式（５）においてｎ_ｆ，ｔ＝０）、ｓ_{ｆ，ｔ，ｋ}＝ｘ_{ｆ，ｔ，Ｌ}となるため、式（３２）においてψ_ｓｋはψ_{ｘｆ，ｔ，Ｌ}（但し、添え字ｘｆ，ｔ，Ｌはｘ_{ｆ，ｔ，Ｌ}を表し、ψ_{ｘｆ，ｔ，Ｌ}はマイクＬの観測信号ｘ_{ｆ，ｔ，Ｌ}の位相を表す）となる。また、前述の通りξ_{ｆ，ｔ，ｋ}＝ｘ_{ｆ，ｔ，Ｒ}であり、式（３）、（５）よりｘ_{ｆ，ｔ，Ｒ}＝ｅ^{ｊ２πｆδｋ，ｆ}・ｓ_{ｆ，ｔ，ｋ}（但し、ｅの添え字δｋ，ｆはδ_ｋ，ｆを表す）となるため、式（３２）において、｜ξ_{ｆ，ｔ，ｋ}｜＝｜ｅ^{ｊ２πｆδｋ，ｆ}・ｓ_{ｆ，ｔ，ｋ}｜＝｜ｓ_{ｆ，ｔ，ｋ}｜となる。よって、式（３２）は To make the description more intuitive, consider the case where the noise parameter is φ _f = 0 (this is not the diffusive noise shown in Equation (4), but the Gaussian noise with variance σ _f ² is observed x _L, means that the ride each to _{x R).} At this time, since ξ _{f, t, k} = x _{f, t, R according} to the equation (14), in the equation (32), ψ _ξk is ψ _{xf, t, R} (however, the subscripts xf, t, R Represents x _{f, t, R} and ψ _{xf, t, R} represents the phase of the observation signal x _{f, t, R} of the microphone R). Also, according to equations (3) and (5) (where n _{f, t} = 0 in equation (5)), s _{f, t, k} = x _{f, t, L} , so ψ _sk in equation (32) is [psi _{xf, t, L} (where subscripts xf, t, L represents a _{x f, t, L, ψ} xf, t, L represents an observed signal _{x f, t, L} of the phase of the microphone L) It becomes. Further, as described above, ξ _{f, t, k} = x _{f, t, R} , and from equations (3) and (5), x _{f, t, R} = e ^{j2πfδk, f} · s _{f, t, k} (however, , E subscripts δk, f represent δ _{k, f} ), so in equation (32), | ξ _{f, t, k} | = | e ^{j2πfδk, f} · s _{f, t, k} | = | s _{f, t, k} | Therefore, equation (32) becomes

となる。これとＶｏｎＭｉｓｅｓ分布との解釈を合わせると、
（１）ＲチャネルとＬチャネルの位相差ψ_{ｘｆ，ｔ，Ｒ}−ψ_{ｘｆ，ｔ，Ｌ}という周期的な値を取る変数の分布が、平均μ＝２πｆδ_ｆ，ｋを取る、
（２）ＶｏｎＭｉｓｅｓ分布の集中度κが、ＳＮ比｜ｓ_{ｆ，ｋ，ｔ}｜^２／σ_ｆ ^２に対応するようになる。すなわち、ＳＮ比が低い（条件が悪い）と、ＲチャネルとＬチャネルの位相差の値の集中度が下がる（＝分散が大きくなる）。これは、ＳＮ比が低い条件においては位相差の測定値がばらつく現象と対応する、
の２点が言える。 It becomes. Combine this with the Von Mises distribution,
(1) The distribution of variables taking periodic values of phase differences ψ _{xf, t, R} −ψ _{xf, t, L} between the R channel and the L channel takes an average μ = 2πfδ _{f, k} .
(2) The degree of concentration κ of the Von Mises distribution corresponds to the SN ratio | s _{f, k, t} | ² / σ _f ² . That is, when the S / N ratio is low (the condition is bad), the degree of concentration of the R channel and L channel phase difference values decreases (= dispersion increases). This corresponds to a phenomenon in which the measured value of the phase difference varies under a low SN ratio.
Two points can be said.

本実施形態では、（性質１）、（性質２）を利用しており、実施の手続きとしては、ＲチャネルとＬチャネルの位相差に関する量が、周期的な値を取る変数に対する分布であるＶｏｎＭｉｓｅｓ分布で表現できることに着目し、ＶｏｎＭｉｓｅｓ分布のパラメタに対する最尤推定により、分布の平均値μ＝２πｆδ_ｆ，ｋを推定することで、信号到達時間差パラメタδ_ｆ，ｋを推定する。 In this embodiment, (property 1) and (property 2) are used, and as an implementation procedure, an amount related to the phase difference between the R channel and the L channel is a distribution for a variable that takes a periodic value. Focusing on the fact that it can be expressed by the Mises distribution, the signal arrival time difference parameter δ _{f, k} is estimated by estimating the average value μ = 2πfδ _{f, k} of the distribution by maximum likelihood estimation with respect to the parameter of the Von Mises distribution.

式（３１）のｐ（ｘ_ｆ，ｔ｜ｋ，θ）をＱ関数の式（１０）に代入し、 Substituting p (x _{f, t} | k, θ) of equation (31) into equation (10) of the Q function,

を解くことで、式（１３）が得られる。
また時間補正部１２２１２における関数Ｆの式（１７）、（１８）は、Ｑ関数の２階微分 (13) is obtained by solving.
Further, the expressions (17) and (18) of the function F in the time correction unit 12212 are the second-order derivatives of the Q function

である。時間補正部１２２１２では、式（１３）で得られた解δ’_ｆ，ｋがＱ関数の極大値・極小値のどちらを与えるかを、Ｆ（２πｆδ’_ｆ，ｋ）の値の正負にて調べ、δ’_ｆ，ｋが極小値を与える場合に、上述の＋πまたは−πの補正を行なう。 It is. In the time correction unit 12212, whether the solution δ ′ _{f, k} obtained by the equation (13) gives the maximum value or the minimum value of the Q function is determined by whether the value of F (2πfδ ′ _{f, k} ) is positive or negative. When δ ′ _{f, k} gives a minimum value, the above correction of + π or −π is performed.

＜効果＞
従来法においては、時間差推定部において解析的な更新式が与えられていなかったため、多くの計算コストを要する全探索操作が必要であった。よって、時間差推定部の計算コストを削減し、高速な音源分離手段を提供することが課題である。本実施形態では、到着時間差δ_ｆ，ｋが、マイクＲとマイクＬの位相差に影響を与える値であることと、位相差が周期的な値を取る性質を持つことに着目し、到着時間差δ_ｆ，ｋを推定する。これにより、従来必要であった全探索操作が不要となるため、高速な音源分離手段を提供することが可能となる。 <Effect>
In the conventional method, since an analytical update formula is not given in the time difference estimation unit, a full search operation requiring a large calculation cost is required. Therefore, it is a problem to reduce the calculation cost of the time difference estimation unit and to provide a high-speed sound source separation unit. In the present embodiment, focusing on the fact that the arrival time difference δ _{f, k} is a value that affects the phase difference between the microphone R and the microphone L and that the phase difference has a periodic value, Estimate δ _{f, k} . This eliminates the need for a full search operation that has been required in the past, and thus provides a high-speed sound source separation unit.

＜シミュレーション結果＞
発明の効果を示すため実験を行なった。図６に示す部屋において、音源数は２つ（７０度と１５０度）又は３つ（３０，７０，１５０度）とした。音源は、英語話者音声を用い、音源数２及び３の場合それぞれにおいて１０通りの音源組合せにて実験を行なった。 <Simulation results>
An experiment was conducted to show the effect of the invention. In the room shown in FIG. 6, the number of sound sources was two (70 degrees and 150 degrees) or three (30, 70, 150 degrees). As the sound source, an English speaker voice was used, and when the number of sound sources was 2 and 3, the experiment was performed with 10 different sound source combinations.

雑音としては、平均０、共分散行列σ_ｆ ^２Ｖ_ｆ（Ｖ_ｆは式（４）により与えられる）のガウスノイズをＳＮ比約２５ｄＢにて重畳した。部屋の残響時間は１３０ｍｓ、サンプリング周波数は８ｋＨｚ，短時間フーリエ変換の窓長及びシフト長はそれぞれ６４ｍｓ，１６ｍｓとした。 As the noise, Gaussian noise having an average of 0 and a covariance matrix σ _f ² V _f (V _f is given by Equation (4)) was superimposed at an SN ratio of about 25 dB. The reverberation time of the room was 130 ms, the sampling frequency was 8 kHz, and the short Fourier transform window length and shift length were 64 ms and 16 ms, respectively.

従来法では到達時間差δ_ｋを推定する際に、音源位置Θ（図６参照）を０度から１８０度まで１度きざみで変化させ、それに対応する１８１種類の到達時間差δ_ｋの値について全探索を行なった。 In the conventional method, when the arrival time difference δ _k is estimated, the sound source position Θ (see FIG. 6) is changed in increments of 1 degree from 0 degrees to 180 degrees, and a total search is performed for the corresponding 181 types of arrival time differences δ _k. Was done.

図７に、ＳＩＲ（信号対雑音比、雑音には他話者音声含む）、ＳＤＲ（信号対歪み比）、及びパソコン（IntelXeon（登録商標）X5650 2.67GHz(6Core)×2CPU）における計算時間を示す。図７Ａは音源が二つの場合（７０度と１５０度）を、図７Ｂは音源が三つの場合（３０度，７０度，１５０度）を示す。ＳＩＲとＳＤＲは大きな値であるほど良い性能であることを示す。数字は、１０通りの音源組合せの平均値である。図７Ａ、Ｂに示す通り、本実施形態は、１／１０程度の計算時間で、従来法とほぼ同程度の分離性能を達成できることが見てとれる。 Figure 7 shows the calculation time in SIR (signal-to-noise ratio, noise includes other speaker's voice), SDR (signal-to-distortion ratio), and personal computer (IntelXeon (registered trademark) X5650 2.67 GHz (6 Core) x 2 CPU). Show. FIG. 7A shows the case where there are two sound sources (70 degrees and 150 degrees), and FIG. 7B shows the case where there are three sound sources (30 degrees, 70 degrees and 150 degrees). A larger value of SIR and SDR indicates better performance. The numbers are average values of 10 different sound source combinations. As shown in FIGS. 7A and 7B, it can be seen that the present embodiment can achieve substantially the same separation performance as the conventional method with a calculation time of about 1/10.

＜その他の変形例＞
本実施形態のポイントは上述の通り、到達時間差の推定方法である。従って、他の処理やパラメタの推定方法については、上記の実施形態に限定されるものではなく、他の従来技術を用いてもよい。 <Other variations>
As described above, the point of this embodiment is a method for estimating the arrival time difference. Therefore, other processes and parameter estimation methods are not limited to the above-described embodiment, and other conventional techniques may be used.

本実施形態では、各部間で直接データを受け渡しているが、図示しない記憶部を介して、各データを読み書きしてもよい。 In this embodiment, data is directly transferred between the respective units, but each data may be read and written via a storage unit (not shown).

また、本実施形態では、雑音パワー推定部１２２３は音源スペクトルｓ_{ｆ，ｔ，ｋ}を音源スペクトル推定部１２２２から取得しているが、到達時間差δ_ｆ，ｋと観測信号ｘ_ｆ，ｔとを用いて式（１９）に基づき雑音パワー推定部１２２３で計算する構成としてもよい。 In this embodiment, the noise power estimation unit 1223 obtains the sound source spectrum s _{f, t, k} from the sound source spectrum estimation unit 1222, but uses the arrival time difference δ _{f, k} and the observation signal x _{f, t.} The noise power estimator 1223 may be configured to calculate based on the equation (19).

本実施形態では、拡散性雑音の場合を想定しているが、他の特性を持つ雑音であってもよい。その場合、雑音の特性に応じて、式（４）や式（１５）のｓｉｎｃ（２πｆＤ／ｃ）を適宜変更すればよい。 In this embodiment, the case of diffuse noise is assumed, but noise having other characteristics may be used. In that case, the sinc (2πfD / c) in the equations (4) and (15) may be appropriately changed according to the noise characteristics.

分離信号生成部１３は、事後確率ｍ_{ｆ，ｔ，ｋ}と音源スペクトルｓ_{ｆ，ｔ，ｋ}とを入力とし、式（２２）に代えて、以下の式により、分離信号ｙ_{ｆ，ｔ，ｋ}を生成してもよい。 The separated signal generation unit 13 receives the posterior probabilities m _{f, t, k} and the sound source spectra s _{f, t, k} as inputs, and instead of the equation (22), the separated signals y _{f, t, k are expressed} by the following equations. May be generated.

この場合、パラメタ推定部１２は、Ｑ関数の値の変化量（｜Ｑ（θ｜θ^ｎ−１）−Ｑ（θ｜θ^ｎ）｜）が所定の収束判定閾値Δより小さくなったとき、または、更新回数ｎが所定の最大更新回数Ｎ以上になったとき（ｓ３１）に取得している最新の事後確率ｍ_{ｆ，ｔ，ｋ}と音源スペクトルｓ_{ｆ，ｔ，ｋ}を分離信号生成部１３に出力する。このような構成により、式（２２）と同様に分離信号ｙ_{ｆ，ｔ，ｋ}を生成することができる（式（１９）参照）。 In this case, when the amount of change in the value of the Q function (| Q (θ | θ ⁿ⁻¹ ) −Q (θ | θ ⁿ ) |) is smaller than the predetermined convergence determination threshold Δ, Alternatively, the latest signal posterior probabilities m _{f, t, k} and the sound source spectrum s _{f, t, k} acquired when the number of updates n is equal to or greater than the predetermined maximum number of updates N (s31) are separated signal generation unit 13 Output to. With such a configuration, the separated signals y _{f, t, and k} can be generated in the same manner as Expression (22) (see Expression (19)).

Ｍステップの計算順序は、本実施形態の計算順序に限らない。例えば、時間差推定部１２２１と音源スペクトル推定部１２２２の計算順序はどちらを先に行ってもよい。図８は音源スペクトル推定部１２２２の音源スペクトル推定処理（ｓ２７）を行った後に、時間差推定部１２２１の到達時間差推定処理（ｓ２５、ｓ２６）を行う場合の機能ブロック図を示す。以下、図８を用いて説明する。 The calculation order of M steps is not limited to the calculation order of this embodiment. For example, whichever of the calculation order of the time difference estimation unit 1221 and the sound source spectrum estimation unit 1222 may be performed first. FIG. 8 is a functional block diagram when the arrival time difference estimation processing (s25, s26) of the time difference estimation unit 1221 is performed after the sound source spectrum estimation processing (s27) of the sound source spectrum estimation unit 1222 is performed. Hereinafter, a description will be given with reference to FIG.

Ｍステップ計算部１２２の音源スペクトル推定部１２２２は、一回前の繰り返しで得られているパラメタθ’（より詳しく説明するとθ’のうちの到達時間差δ_ｆ，ｋである。但し、一回前の繰り返しで得られているパラメタθ’が存在しない場合、つまり、一回目の音源スペクトル計算においては、前述の初期化したパラメタ）と観測信号ｘ_ｆ，ｔを入力とし、これらの値を用いて、音源スペクトルｓ_{ｆ，ｔ，ｋ}を式（１９）により推定し、雑音パワー推定部１２２３とＱ関数計算部１２１２と時間差推定部１２２１と時間補正部１２２１２とに出力する。 The sound source spectrum estimation unit 1222 of the M step calculation unit 122 is the parameter θ ′ obtained in the previous iteration (more specifically, the arrival time difference δ _{f, k} of θ ′. If the parameter θ ′ obtained by repeating the above is not present, that is, in the first sound source spectrum calculation, the above-described initialized parameter) and the observation signal x _{f, t} are input, and these values are used. The sound source spectrum s _{f, t, k} is estimated by the equation (19) and output to the noise power estimation unit 1223, the Q function calculation unit 1212, the time difference estimation unit 1221, and the time correction unit 12212.

Ｍステップ計算部１２２の時間差推定部１２２１の逆正接計算部１２２１１は、観測信号ｘ_ｆ，ｔと、事後確率ｍ_{ｆ，ｔ，ｋ}と、音源スペクトルｓ_{ｆ，ｔ，ｋ}とを入力とし、これらの値を用いて、到達時間差δ_ｆ，ｋを式（１３）により推定し、時間補正部１２２１２に出力する。 The arc tangent calculation unit 12211 of the time difference estimation unit 1221 of the M step calculation unit 122 receives the observation signal x _{f, t} , the posterior probability m _{f, t, k,} and the sound source spectrum s _{f, t, k,} and these Is used to estimate the arrival time difference δ _{f, k} by the equation (13) and output it to the time correction unit 12212.

Ｍステップ計算部１２２の時間差推定部１２２１の時間補正部１２２１２は、観測信号ｘ_ｆ，ｔと、事後確率ｍ_{ｆ，ｔ，ｋ}と、音源スペクトルｓ_{ｆ，ｔ，ｋ}と、一回前の繰り返しで得られているパラメタθ’（より詳しく説明するとθ’のうちの雑音のパワースペクトルσ_ｆ ^２である。但し、一回前の繰り返しで得られているパラメタθ’が存在しない場合、つまり、一回目の時間補正においては、前述の初期化したパラメタ）とを入力とし、これらの値を用いて、式（１３）にて推定した到達時間差δ_ｆ，ｋが内包する±πの不定性を式（１６）に基づき補正する。さらに、時間補正部１２２１２は、補正した値２πｆδ_ｆ，ｋを２πｆで除算し、到達時間差δ_ｆ，ｋを求め、雑音パワー推定部１２２３とＱ関数計算部１２１２とに出力する。 The time correction unit 12212 of the time difference estimation unit 1221 of the M step calculation unit 122 repeats the observation signal xf _{, t} , the posterior probability _{mf, t, k} , the sound source spectrum sf _{, t, k,} and the previous iteration. Is the noise power spectrum σ _f ² of θ ′. More specifically, when the parameter θ ′ obtained in the previous iteration does not exist, that is, In the first time correction, the above-described initialized parameter) is used as an input, and by using these values, the uncertainty of ± π included in the arrival time difference δ _{f, k} estimated by Expression (13) is included. Correction is made based on equation (16). Further, the time correction unit 12212, the correction value 2Paiefuderuta _f, the _k divided by 2 [pi] f, calculated arrival time difference [delta] _{f, k,} and outputs to the noise power estimation section 1223 and the Q function calculating unit 1212.

上述の構成とすることで、第一実施形態と同様の効果を得ることができる。なお、同様に、一回前の繰り返しで得られているパラメタθ’のうちの到達時間差δ_ｆ，ｋに基づき雑音パワー推定部１２２３の雑音パワー推定処理（ｓ２８）を行った後に、時間差推定部１２２１の到達時間差推定処理（ｓ２５、ｓ２６）を行ってもよい。 By setting it as the above-mentioned structure, the effect similar to 1st embodiment can be acquired. Similarly, after performing the noise power estimation process (s28) of the noise power estimation unit 1223 based on the arrival time difference δ _{f, k} in the parameter θ ′ obtained in the previous iteration, the time difference estimation unit The arrival time difference estimation process 1221 (s25, s26) may be performed.

本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
上述した音源分離装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施例で図に示した機能構成をもつ装置）として機能させるためのプログラム、またはその処理手順（各実施例で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and recording medium>
The sound source separation device described above can also be functioned by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a processing procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

Claims

In a situation where a plurality of original signals are mixed with noise and observed with two microphones, a sound source separation device that separates and extracts each original signal from the observed signal,
The sound source index of the original signal is k, and the original signal to the two microphones from the observed signal x _{f, t} = [x _{f, t, L} , x _{f, t, R} ] ^T in the frequency domain. Parameter estimation means for estimating the arrival time difference δ _{f, k} , noise power spectrum σ _f ² , original signal spectrum s _{f, t, k} and sound source existence probability p (k | θ);
A separated signal y _{f, t, k} is generated from the observed signal x _{f, t} and the estimated parameter θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)}. Separating signal generating means for
The parameter estimation means includes
The posterior probability that the observed signal x _{f, t} indicates the expected value attributed to the sound source k is m _{f, t, k} , the distance between the two microphones is D, the speed of the original signal is c, φ _f = sinc (2πfD / c) and ξ _{f, t, k} = [x _{f, t, R} −φ _f (x _{t, t, L} −s _{f, t, k} )], and the spectrum s _{f, t,} The phases of _k and ξ _{f, t, k} are ψ _sk and ψ _ξk , respectively, and the arrival time difference δ _{f, k} is

Arc tangent calculation means to estimate as
Correction means for correcting the indefiniteness of ± π included in the estimated arrival time difference δ _{f, k} .
Sound source separation device.

In a situation where a plurality of original signals are mixed with noise and observed by two microphones, a sound source separation method for separating and extracting each original signal from the observed signal,
The sound source index of the original signal is k, and the original signal to the two microphones from the observed signal x _{f, t} = [x _{f, t, L} , x _{f, t, R} ] ^T in the frequency domain. Parameter estimation step for estimating the arrival time difference δ _{f, k} , noise power spectrum σ _f ² , original signal spectrum s _{f, t, k,} and sound source existence probability p (k | θ);
A separated signal y _{f, t, k} is generated from the observed signal x _{f, t} and the estimated parameter θ = {δ _{f, k} , σ _f ² , s _{f, t, k} , p (k | θ)}. And a separated signal generating step
The parameter estimation step includes:
The posterior probability that the observed signal x _{f, t} indicates the expected value attributed to the sound source k is m _{f, t, k} , the distance between the two microphones is D, the speed of the original signal is c, φ _f = sinc (2πfD / c) and ξ _{f, t, k} = [x _{f, t, R} −φ _f (x _{t, t, L} −s _{f, t, k} )], and the spectrum s _{f, t,} The phases of _k and ξ _{f, t, k} are ψ _sk and ψ _ξk , respectively, and the arrival time difference δ _{f, k} is

Arc tangent calculation step to estimate as
A correction step of correcting the indefiniteness of ± π included in the estimated arrival time difference δ _{f, k} .
Sound source separation method.

A program for causing a computer to function as the sound source separation device according to claim 1.