JP6059112B2

JP6059112B2 - Sound source separation device, method and program thereof

Info

Publication number: JP6059112B2
Application number: JP2013171079A
Authority: JP
Inventors: 慶介木下; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2017-01-11
Anticipated expiration: 2033-08-21
Also published as: JP2015040934A

Description

この発明は、入力信号に複数の目的信号が含まれている場合において、各目的信号を精度良く抽出する音源分離装置と、その方法とプログラムに関する。 The present invention relates to a sound source separation device that extracts each target signal with high accuracy when an input signal includes a plurality of target signals, and a method and program thereof.

複数の目的音源が存在する環境で音響信号を収音すると、しばしば目的信号同士が互いに重なり合った混合信号が観測される。この時、注目している目的音源が音声信号である場合、その他の音源信号がその目的信号に重畳した影響により、目的音声の明瞭度は大きく低下してしまう。その結果、本来の目的音声信号（以下、目的信号）の性質を抽出することが困難となり、自動音声認識（以下、音声認識）システムの認識率も著しく低下する。よって認識率の低下を防ぐためには、複数の目的信号をそれぞれ分離することで、目的信号の明瞭度を回復する工夫（方法）が必要である。 When an acoustic signal is collected in an environment where a plurality of target sound sources exist, a mixed signal in which the target signals overlap each other is often observed. At this time, when the target sound source of interest is an audio signal, the clarity of the target sound is greatly reduced due to the influence of other sound source signals superimposed on the target signal. As a result, it becomes difficult to extract the nature of the original target speech signal (hereinafter referred to as the target signal), and the recognition rate of the automatic speech recognition (hereinafter referred to as speech recognition) system is significantly reduced. Therefore, in order to prevent the recognition rate from being lowered, it is necessary to devise a method (method) for recovering the clarity of the target signal by separating a plurality of target signals.

この複数の目的信号をそれぞれ分離する要素技術は、さまざまな音響信号処理システムに用いることが可能である。例えば、実環境下で収音された音から目的信号を抽出して聞き取り易さを向上させる補聴器、目的信号を抽出することで音声の明瞭度を向上させるＴＶ会議システム、実環境で用いられる音声認識システム、機械制御インターフェースにおける機械と人間との対話装置、楽曲を検索したり採譜したりする音楽情報処理システムなどに利用することが出来る。 The elemental technology for separating the plurality of target signals can be used for various acoustic signal processing systems. For example, a hearing aid that extracts the target signal from the sound collected in the real environment to improve ease of hearing, a TV conference system that improves the intelligibility of the voice by extracting the target signal, and audio used in the real environment It can be used for a recognition system, a machine-human interaction device in a machine control interface, a music information processing system for searching and recording music, and the like.

図７に、例えば非特許文献１に開示されている従来の音源分離装置９００の機能構成を示してその動作を簡単に説明する。音源分離装置９００は、全マイク共通音源存在事後確率推定部９０、フィルタリング部９１、を備える。 FIG. 7 shows a functional configuration of a conventional sound source separation device 900 disclosed in Non-Patent Document 1, for example, and its operation will be briefly described. The sound source separation device 900 includes a sound source existence posterior probability estimation unit 90 and a filtering unit 91 for all microphones.

全マイク共通音源存在事後確率推定部９０は、複数の音源から発せられる音源信号を複数のマイクロホンで収音した複数チャネルの観測信号を入力として、当該各観測信号の各時間周波数ビンを特徴付ける特徴ベクトルを算出し、その特徴ベクトルを分類することで各音源に関する存在確率を計算する。フィルタリング部９１は、複数のマイクロホンで収音した複数チャネルの観測信号に、上記存在確率を乗算することで音源信号を回復する。 The sound source signal posterior probability estimation unit 90 common to all microphones receives a plurality of channel observation signals obtained by collecting sound source signals emitted from a plurality of sound sources with a plurality of microphones, and characterizes each time frequency bin of each observation signal And the existence probability for each sound source is calculated by classifying the feature vectors. The filtering unit 91 recovers the sound source signal by multiplying the observation signals of a plurality of channels collected by a plurality of microphones by the existence probability.

H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignement,” IEEE Trans. Audio, Speech and Lang. Process., vol. 19, pp.516-527, March 2011.H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignement,” IEEE Trans. Audio, Speech and Lang. Process., Vol. 19, pp.516- 527, March 2011.

しかし、複数のマイクロホンが空間的に大きく分散された形で配置されていると、各マイクロホンで観測されるある音源の音圧は同程度にならない。極端な場合は、ある音源はあるマイクロホンにおいて実質的に観測不可能な状況も起こり得る。このような状況では、各マイクロホンで異なる音源存在確率（アクティビティパタン）を仮定することが妥当である。しかし、従来の方法では、マイクロホン別に音源存在確率を計算することができないため、分散マイクロホンアレイ環境において、効率的な音源分離を行うことができない課題があった。 However, if a plurality of microphones are arranged in a spatially dispersed manner, the sound pressure of a certain sound source observed by each microphone does not become comparable. In extreme cases, a situation can occur in which a certain sound source is substantially unobservable with a certain microphone. In such a situation, it is reasonable to assume different sound source existence probabilities (activity patterns) for each microphone. However, in the conventional method, since the sound source existence probability cannot be calculated for each microphone, there is a problem that efficient sound source separation cannot be performed in a distributed microphone array environment.

この発明は、このような課題に鑑みてなされたものであり、分散マイクロホンアレイ環境においても効率的に音源分離を行うことができる音源分離装置とその方法とプログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and an object thereof is to provide a sound source separation apparatus, a method thereof, and a program capable of efficiently performing sound source separation even in a distributed microphone array environment.

この発明の音源分離装置は、マイク別音源存在事後確率推定部と、モデルパラメータ推定部と、出力音推定部と、を具備する。マイク別音源存在事後確率推定部は、複数の音源から発せられる音源信号を複数のマイクロホンで収音した複数チャネルの観測信号と、上記複数のマイクロホンの各々で観測される上記複数の音源の各々からの信号の音圧が異なると仮定した観測信号のモデルを用いて、各マイクロホンごとに各音源に関する音源存在事後確率を推定する。モデルパラメータ推定部は、複数チャネルの観測信号と、音源存在事後確率を入力として、観測信号のモデルパラメータを推定する。出力音推定部は、複数チャネルの観測信号と、音源存在事後確率と、モデルパラメータと、を入力として各マイクロホンごとに各音源からの到来信号を推定して出力する。 The sound source separation device according to the present invention includes a microphone-specific sound source presence posterior probability estimation unit, a model parameter estimation unit, and an output sound estimation unit. The microphone-specific sound source existence posterior probability estimation unit includes a plurality of channel observation signals obtained by collecting sound source signals emitted from a plurality of sound sources by a plurality of microphones, and each of the plurality of sound sources observed by each of the plurality of microphones. The sound source existence posterior probability for each sound source is estimated for each microphone, using the model of the observed signal assuming that the sound pressures of the signals are different. The model parameter estimation unit estimates the model parameters of the observation signal by using the observation signals of a plurality of channels and the sound source existence posterior probability as inputs. The output sound estimation unit estimates and outputs an incoming signal from each sound source for each microphone by using the observation signals of a plurality of channels, the sound source posterior probability, and the model parameters as inputs.

この発明の音源分離装置によれば、複数のマイクロホンごとに各音源に関して推定した音源存在事後確率を用いて、音源ごとに音源からの到来信号（音源イメージ）を推定するので分散マイクロホンアレイ環境においても効率的に音源分離を行うことができる。評価実験で確認した具体的な効果については後述する。 According to the sound source separation device of the present invention, the arrival signal (sound source image) from the sound source is estimated for each sound source using the sound source existence posterior probability estimated for each sound source for each of a plurality of microphones. Sound source separation can be performed efficiently. Specific effects confirmed in the evaluation experiment will be described later.

この発明の音源分離装置１００の機能構成例を示す図。The figure which shows the function structural example of the sound source separation apparatus 100 of this invention. 音源分離装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the sound source separation apparatus. この発明のＥＭアルゴリズムとNewton-Raphson法を用いる音源分離装置１００′の機能構成例を示す図。The figure which shows the function structural example of sound source separation apparatus 100 'using EM algorithm and Newton-Raphson method of this invention. モデルパラメータ最適化の動作フローを示す図。The figure which shows the operation | movement flow of model parameter optimization. 評価実験に使用した音響環境を示す図。The figure which shows the acoustic environment used for evaluation experiment. 評価実験結果を示す図Figure showing the evaluation experiment results 従来の音声分離装置９００の機能構成例を示す図。The figure which shows the function structural example of the conventional audio | voice separation apparatus 900.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、観測信号をモデル化する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Before the description of the embodiment, the observation signal is modeled.

〔観測信号のモデル化〕
複数の点音源（１，２，…Ｎ_ｉ）から発音する音声を、複数のマイクロホン（１，２，…Ｎ_ｍ）のｍ番目のマイクロホンで観測した場合、ｉ番目の音源から到来する信号ｘ_ｔ，ｆ ^{（ｉ，ｍ）}は、時間周波数領域において以下のように表される。ｔ（ｔ＝１，…Ｎ_ｔ），ｆ（ｆ＝１，…，Ｎ_ｆ）は、時間と周波数のインデックスである。 [Modeling of observed signals]
When sound generated from a plurality of point sound sources (1, 2,... N _i ) is observed by the m-th microphone of the plurality of microphones (1, 2,... N _m ), the signal x coming from the i-th sound source _{t, f} ^{(i, m)} is expressed as follows in the time-frequency domain. t (t = 1,... N _t ) and f (f = 1,..., N _f ) are time and frequency indexes.

ここでＳ_ｔ，ｆ ^（ｉ）とｓ_ｔ，ｆ ^（ｉ）は、それぞれｉ番目の音源からのクリーン音声信号の短時間フーリエ変換領域での信号と対数パワー領域での信号に相当し、それぞれマイク位置非依存のパラメータである。また、Ｈ_ｆ ^{（ｉ，ｍ）}とβ_ｆ ^{（ｉ，ｍ）}は、同様に短時間フーリエ変換領域と対数パワースペクトル領域での伝達関数に相当する。 Here, S _{t, f} ⁽ⁱ⁾ and _{st, f} ⁽ⁱ⁾ correspond to a signal in the short-time Fourier transform domain and a signal in the logarithmic power domain of the clean speech signal from the i-th sound source, respectively. This is a microphone position independent parameter. Similarly, H _f ^{(i, m)} and β _f ^{(i, m)} correspond to transfer functions in the short-time Fourier transform region and the logarithmic power spectrum region.

以降の説明では、変数β_ｆ ^{（ｉ，ｍ）}はマイク位置依存・音源時不変ゲインと称する。ｉ番目の音源から到来する信号ｘ_ｔ，ｆ ^{（ｉ，ｍ）}を音源イメージと称する。ｅ_ｔ，ｆ ^{（ｉ，ｍ）}はエラー項であり、ｘ_ｔ，ｆ ^{（ｉ，ｍ）}とｌｏｇ｜Ｓ_ｔ，ｆ ^（ｉ）Ｈ_ｆ ^{（ｉ，ｍ）}｜^２の差であり、例えば伝達関数の揺らぎを表す。このエラー項ｅ_ｔ，ｆ ^{（ｉ，ｍ）}は、平均０、分散σ_ｔ，ｆ ^{（ｉ，ｍ）}の白色信号であると仮定する。 In the following description, the variable β _f ^{(i, m)} is referred to as a microphone position dependent / sound source time invariant gain. A signal x _{t, f} ^{(i, m)} coming from the i-th sound source is referred to as a sound source image. _{^{e t, f (i, m}} ) is the error _{^{term, x t, f (i,}} m) and _{^{_{^{log | S t, f (i}}}} ) H f (i, m) | is the difference between the ^2, for example, transfer Represents function fluctuations. This error term _{et, f} ^{(i, m)} is assumed to be a white signal with an average of 0 and a variance σt _{, f} ^{(i, m)} .

以上の定義に従うと、ｉ番目の音源からのクリーン音声信号ｓ_ｔ，ｆ ^（ｉ）とその音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}との関係は、ガウス分布の確率密度関数として次のようにモデル化することができる。 According to the above definition, the relationship between the clean sound signal s _{t, f} ⁽ⁱ⁾ from the i-th sound source and the sound source image x _{t, f} ^{(i, m)} is as follows as a probability density function of Gaussian distribution. Can be modeled.

ここで、θ^（ｉ）はモデルパラメータ一式を表す。Ｎは正規分布（Normal distribution）を意味する。 Here, θ ⁽ⁱ⁾ represents a set of model parameters. N means a normal distribution.

次に、LogMax近似を用いて、複数の点音源が存在する環境におけるｍ番目のマイクロホンで収音した観測信号ｏ_ｔ，ｆ ^（ｍ）をモデル化する。その近似を用いれば、次式に示すように観測信号ｏ_ｔ，ｆ ^（ｍ）は、全点音源の中で最大の音圧を持つ支配的な音源信号の値と同値となる。 Next, the observation signal o _{t, f} ^(m) collected by the m-th microphone in an environment where a plurality of point sound sources exists is modeled using LogMax approximation. If the approximation is used, the observed signal ot _{, f} ^(m) becomes the same value as the dominant sound source signal having the maximum sound pressure among all point sound sources as shown in the following equation.

このモデル化では支配的ではない音源は、観測信号の対数パワースペクトル以下の値であれば、任意の値を取ることができる。上記したLogMax近似モデルは、次式に示すように確率的に定式化される。 A sound source that is not dominant in this modeling can take any value as long as it is a value less than or equal to the logarithmic power spectrum of the observation signal. The above LogMax approximate model is stochastically formulated as shown in the following equation.

ここで、Ｉ_ｔ，ｆ ^（ｍ）は、ｍ番目のマイクロホンの観測信号の各時間周波数ビンにおける支配的な音源の音源インデックスを表し、δ（・）はディラックのデルタ関数を表す。以降の説明では、変数Ｉ_ｔ，ｆ ^（ｍ）は支配的音源インデックス（ＤＳＩ：Dominant Source Index）と称し、簡単のために添え字は省略する。 Here, I _{t, f} ^(m) represents a sound source index of a dominant sound source in each time frequency bin of the observation signal of the m-th microphone, and δ (·) represents a Dirac delta function. In the following description, the variable It _{, f} ^(m) is referred to as a dominant source index (DSI), and the subscript is omitted for simplicity.

式（３）は、ｍ番目のマイクロホンにおける観測信号ｏ_ｔ，ｆ ^（ｍ）が、そのマイクロホンにおける支配的な音源イメージと同値であることを表している。ここで、マイクロホンごとに異なる音声のアクティビティパタン、つまり支配的音源インデックスＤＳＩが割り当てられていることに注意されたい。 Expression (3) represents that the observation signal ot _{, f} ^(m) in the m-th microphone is equivalent to the dominant sound source image in the microphone. Here, it should be noted that a different sound activity pattern, that is, a dominant sound source index DSI, is assigned to each microphone.

上記した確率モデルを用いると観測信号ｏ_ｔ，ｆ ^（ｍ）とＩ（支配的音源インデックスＤＳＩ）の同時確率は次式のように導出される。 When the above probability model is used, the joint probability of the observation signals ot _{, f} ^(m) and I (dominant sound source index DSI) is derived as follows.

なお、θ^（ｉ）は各音源ｉに関するパラメータを表し、θはすべての音源に関するパラメータを表す。すなわち、式（６）は、観測信号ｏ_ｔ，ｆ ^（ｍ）とＩ（支配的音源インデックスＤＳＩ）を含むモデルパラメータθの同時確率である。各音源の音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}と観測信号の確率モデルを、上記したようにモデル化した前提で、以下の実施例を説明する。なお、以降の説明では、上述のLogMax近似モデル（式（４））を、「LogMax観測モデル」あるいは「観測信号の確率モデル」として参照する。 Θ ⁽ⁱ⁾ represents a parameter related to each sound source i, and θ represents a parameter related to all sound sources. That is, Expression (6) is a joint probability of the model parameter θ including the observation signal o _{t, f} ^(m) and I (dominant sound source index DSI). The following example will be described on the assumption that the sound source image x _{t, f} ^{(i, m) of} each sound source and the probability model of the observation signal are modeled as described above. In the following description, the above-described LogMax approximate model (formula (4)) is referred to as a “LogMax observation model” or an “observation signal probability model”.

〔この発明の考え〕
この発明の音源分離方法は、上記した音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}に含まれる重要なパラメータに着目することで、複数のマイクロホンごとに異なるアクティビティパタンの推定を可能にする。 [Concept of this invention]
The sound source separation method of the present invention makes it possible to estimate a different activity pattern for each of a plurality of microphones by paying attention to important parameters included in the sound source image x _{t, f} ^{(i, m)} .

この発明の音源分離方法を特徴付ける重要なパラメータは、支配的音源インデックスＤＳＩである。支配的音源インデックスＤＳＩは、各音源の各マイクロホンにおけるアクティビティパタンを示しているので、このパラメータを推定できれば、各マイクロホンごとに異なるアクティビティパタンを推定することが直接的に可能となる。 An important parameter characterizing the sound source separation method of the present invention is the dominant sound source index DSI. Since the dominant sound source index DSI indicates an activity pattern of each microphone of each sound source, if this parameter can be estimated, it is possible to directly estimate an activity pattern that is different for each microphone.

この支配的音源インデックスＤＳＩに加えて、当該パラメータを暗に支える形となっている時不変のマイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}と、時変のマイク非依存・音源対数パワースペクトルｓ_ｔ，ｆ ^（ｉ）を用いる（式（１）参照）。 In addition to the dominant sound source index DSI, the time-invariant microphone position dependence / sound source time-invariant gain β _f ^{(i, m)} and the time-variant microphone independence / sound source logarithm that are implicitly supporting the parameter The power spectrum s _{t, f} ⁽ⁱ⁾ is used (see equation (1)).

これらのパラメータを用いることで、アクティビティパタンが推定できる原理を簡単に説明する。例えば、仮にある音源がｍ番目のマイクロホンに高いＳＮＲで到来すると、ＳＮＲに対応するパラメータであるマイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}は相対的に高い値を取る傾向にあり、その音源はLogMax観測モデルの元で支配的な音源として観測される。 The principle that the activity pattern can be estimated by using these parameters will be briefly described. For example, if a certain sound source arrives at the m-th microphone with a high SNR, the microphone position-dependent / sound source time-invariant gain β _f ^{(i, m),} which is a parameter corresponding to the SNR, tends to take a relatively high value. The sound source is observed as the dominant sound source under the LogMax observation model.

ある時間周波数ビンにおいて支配的な音源として陽に観測された信号は、その音源の対数パワースペクトルを推定することを可能にする。一方で、ある音源がｍ番目のマイクロホンに低いＳＮＲで到来すると、マイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}は相対的に低い値を取る傾向にあり、その音源はLogMax観測モデルの元で非支配的な音源となる。LogMax観測モデルの元では、非支配的な音源のスペクトルは陽には観測されないので、その音源の対数パワースペクトルの推定は行われない。 A signal that is positively observed as the dominant sound source in a certain time frequency bin makes it possible to estimate the logarithmic power spectrum of that sound source. On the other hand, when a certain sound source arrives at the m-th microphone with a low SNR, the microphone position-dependent / sound source time-invariant gain β _f ^{(i, m)} tends to take a relatively low value. Becomes a non-dominant sound source. Under the LogMax observation model, the spectrum of the non-dominant sound source is not observed explicitly, so the logarithmic power spectrum of the sound source is not estimated.

このようにこの発明では、各音源の対数パワースペクトルの推定を行うのにＳＮＲの高い、一般的には音源に近いマイクロホンの観測信号を主に用いるようになる。その結果、複数のマイクロホンからの情報を効果的に加味しながら、各マイクロホンごとに異なるアクティビティパタンの推定が可能となる。 As described above, in the present invention, a microphone observation signal having a high SNR, generally close to the sound source, is mainly used to estimate the logarithmic power spectrum of each sound source. As a result, it is possible to estimate different activity patterns for each microphone while effectively taking into account information from a plurality of microphones.

具体的な実施例では、支配的音源インデックスＤＳＩを潜在変数とした期待値最大化法（ＥＭアルゴリズム）を用いてアクティビティパタンの推定を行う。Ｅステップ（期待値）では、支配的音源インデックスＤＳＩに関する事後確率を更新し、どの音源がどのマイクロホンのどの時間周波数ビンで支配的かという情報を推定する。Ｍステップ（更新）では、その事後確率に基づいて、各音源のマイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}とマイク非依存・音源対数パワースペクトルｓ_ｔ，ｆ ^（ｉ）とエラー項ｅ_ｔ，ｆ ^{（ｉ，ｍ）}の分散σ_ｔ，ｆ ^{（ｉ，ｍ）}を更新する。 In a specific embodiment, the activity pattern is estimated using an expected value maximization method (EM algorithm) with the dominant sound source index DSI as a latent variable. In step E (expected value), the posterior probability relating to the dominant sound source index DSI is updated, and information about which sound source is dominant in which time frequency bin of which microphone is estimated. In M step (update), based on the posterior probability, the microphone position-dependent / sound source time-invariant gain β _f ^{(i, m)} , microphone-independent / sound source log power spectrum _{st, f} ⁽ⁱ⁾ and error Update the variance σ _{t, f} ^{(i, m)} of the term _{et, f} ^{(i, m)} .

図１に、この発明の音源分離装置１００の機能構成例を示す。その動作フローを図２に示す。音源分離装置１００は、マイク別音源存在事後確率推定部１０と、モデルパラメータ推定部２０と、出力音推定部３０と、を具備する。音源分離装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of a sound source separation device 100 of the present invention. The operation flow is shown in FIG. The sound source separation device 100 includes a microphone-specific sound source presence posterior probability estimation unit 10, a model parameter estimation unit 20, and an output sound estimation unit 30. The function of each unit of the sound source separation device 100 is realized by reading a predetermined program into a computer configured by, for example, a ROM, a RAM, and a CPU, and executing the program by the CPU.

マイク別音源存在事後確率推定部１０は、複数の音源から発せられる音源信号を複数のマイクロホンで収音した複数チャネルの観測信号ｏ_ｔ，ｆ ^（ｍ）と、マイクロホンの各々で観測される上記複数の音源ｉの各々からの信号の音圧が異なると仮定した観測信号のモデルを用いて、各マイクロホンｍごとに各音源ｉに関する音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を推定する（ステップＳ１０）。ここで、観測信号のモデルは、ｍ番目のマイクロホンで観測される信号ｏ_ｔ，ｆ ^{（ｉ，ｍ）}が、複数の音源の各々から到来し当該ｍ番目のマイクロホンで観測される到来信号のうち、最大の音圧を持つ到来信号と同値となるように定義されたモデル（LogMax観測モデル、式（４））である。また、到来信号のモデルは、ｍ番目のマイクロホンで観測されるｉ番目の音源の音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}が、ｉ番目の音源のマイク非依存・音源対数パワースペクトルｓ_ｔ，ｆ ^（ｉ）と、ｉ番目の音源からｍ番目のマイクロホンに到来する信号の音圧に対応するマイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}と、ｉ番目の音源からｍ番目のマイクロホンに到来する信号とｍ番目のマイクロホンで観測されるｉ番目の音源からの信号との差に対応するエラー項ｅ_ｔ，ｆ ^{（ｉ，ｍ）}と、により定義した確率モデルである（式（１））。 The microphone-specific sound source existence posterior probability estimation unit 10 collects sound source signals emitted from a plurality of sound sources with a plurality of microphones and a plurality of channel observation signals ot _{, f} ^(m), and the plurality of microphones observed with each of the microphones. The sound source existence posterior probability ^ M _{t, f} ^{(i, m)} for each sound source i is estimated for each microphone m using a model of an observation signal that is assumed that the sound pressure of the signal from each of the sound sources i is different. (Step S10). Here, the model of the observed signal is that the signal ot _{, f} ^{(i, m)} observed by the m-th microphone comes from each of a plurality of sound sources and is observed by the m-th microphone. , A model (LogMax observation model, equation (4)) defined to be equivalent to the incoming signal having the maximum sound pressure. The model of the incoming signal is that the sound source image x _{t, f} ^{(i, m)} of the i th sound source observed by the m th microphone is the microphone independent / sound source logarithmic power spectrum s _{t, f} ⁽ⁱ⁾ , the microphone position-dependent / sound source time-invariant gain β _f ^{(i, m)} corresponding to the sound pressure of the signal arriving at the m th microphone from the i th sound source, and the m th This is a probability model defined by an error term _{et, f} ^{(i, m)} corresponding to the difference between the signal arriving at the microphone and the signal from the i-th sound source observed by the m-th microphone (Equation ( 1)).

なお、マイク非依存・音源対数パワースペクトルｓ_ｔ，ｆ ^（ｉ）は、マイクロホンに依存しない音源からのクリーン音声信号と称しても良いものである。また、マイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}は、音源とマイクロホン位置によって変化する値であり、伝達関数と称しても良いものである。なお、＾等の表記は、図及び式中に表記されているように変数の直上に位置するのが正しい表記である。 The microphone-independent / sound source logarithmic power spectrum s _{t, f} ⁽ⁱ⁾ may be referred to as a clean audio signal from a sound source that does not depend on the microphone. The microphone position-dependent / sound source time-invariant gain β _f ^{(i, m)} is a value that varies depending on the sound source and the microphone position, and may be referred to as a transfer function. It should be noted that the notation such as ^ is a correct notation that is located immediately above the variable as shown in the drawings and equations.

モデルパラメータ推定部２０は、複数チャネルの観測信号ｏ_ｔ，ｆ ^（ｍ）と、マイク別音源存在事後確率推定部１０で推定した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を入力として、観測信号のモデルパラメータ＾θ^（ｉ）を推定する（ステップＳ２０）。モデルパラメータ＾θ^（ｉ）は、マイク非依存・音源対数パワースペクトルｓ_ｔ，ｆ ^（ｉ）と、マイク位置依存・音源時不変ゲインβ_ｆ ^{（ｉ，ｍ）}と、エラー項ｅ_ｔ，ｆ ^{（ｉ，ｍ）}の分散σ_ｔ，ｆ ^{（ｉ，ｍ）}と、である。 The model parameter estimation unit 20 receives the observation signals o _{t, f} ^(m) of a plurality of channels and the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} estimated by the microphone-specific sound source existence posterior probability estimation unit 10 as inputs. Then, the model parameter ^ θ ⁽ⁱ⁾ of the observation signal is estimated (step S20). The model parameter ^ θ ⁽ⁱ⁾ includes microphone independent / sound source logarithmic power spectrum s _{t, f} ⁽ⁱ⁾ , microphone position dependent / sound source time invariant gain β _f ^{(i, m)} , and error term _{et, f} ^{( i, m)} variance σ _{t, f} ^{(i, m)} .

出力音推定部３０は、複数チャネルの観測信号ｏ_ｔ，ｆ ^（ｍ）と、マイク別音源存在事後確率推定部１０で推定した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}と、モデルパラメータ推定部２０で推定したモデルパラメータ＾θ^（ｉ）と、を入力として各マイクロホンｍごとに各音源ｉに関する音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}を推定して出力する（ステップＳ３０）。 The output sound estimation unit 30 includes a plurality of channel observation signals ot _{, f} ^(m) , a sound source existence posterior probability ^ M _{t, f} ^{(i, m)} estimated by the microphone-specific sound source existence posterior probability estimation unit 10, and a model. Using the model parameter ^ θ ⁽ⁱ⁾ estimated by the parameter estimation unit 20 as an input, the sound source image x _{t, f} ^{(i, m)} relating to each sound source i is estimated and output for each microphone m (step S30).

以上説明したように動作する音源分離装置１００は、複数の各マイクロホンｍにおいて各音源ｉごとに推定した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を用いて、音源ｉごとの音源イメージｘ_ｔ，ｆ ^{（ｉ，ｍ）}を推定するので分散マイクロホンアレイ環境においても効率的に音源分離を行うことができる。以降において、音源分離装置１００の動作を更に詳しく説明する。 The sound source separation apparatus 100 that operates as described above uses the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} estimated for each sound source i in each of the plurality of microphones m, and the sound source image for each sound source i. Since x _{t, f} ^{(i, m)} is estimated, sound source separation can be performed efficiently even in a distributed microphone array environment. Hereinafter, the operation of the sound source separation device 100 will be described in more detail.

音源分離装置１００は、最大事後確率（ＭＡＰ）基準で効果的にモデルパラメータ＾θ^（ｉ）の推定を行う。この実施例では、支配的音源インデックスＤＳＩを潜在変数とみなして、モデルパラメータ＾θ^（ｉ）＝（ｓ_ｔ，ｆ ^（ｉ），β_ｆ ^{（ｉ，ｍ）}，σ_ｔ，ｆ ^{（ｉ，ｍ）}）を推定する。効率的な最大事後確率パラメータ推定を行うために、この実施例ではＥＭアルゴリズムを用い以下の補助関数を繰り返し最大化する。 The sound source separation device 100 effectively estimates the model parameter {circumflex over ⁽ θ ⁾ } ^{(i) on} the basis of the maximum posterior probability (MAP). In this embodiment, the dominant sound source index DSI is regarded as a latent variable, and model parameters ^ ⁽ⁱ⁾ = (s _{t, f} ⁽ⁱ⁾ , β _f ^{(i, m)} , σ _{t, f} ^{(i, m ))} ). In order to perform efficient maximum posterior probability parameter estimation, this embodiment repeatedly maximizes the following auxiliary functions using the EM algorithm.

ここで、θはモデルパラメータの事前推定値、＾θはモデルパラメータの推定値を表す。また、式（７）におけるｐ（ｘ_ｔ，ｆ ^{（ｉ，ｍ）}；θ^（ｉ））は、式（２）で定義されている通り、モデルパラメータの事前推定値θから算出することができる。なお、事前推定値θは予め与えられているものとする。すなわち、上述の補助関数Ｑ（θ｜＾θ）は、観測信号ｏ_ｔ，ｆ ^（ｍ）と支配的音源インデックスＤＳＩを含むモデルパラメータの事前推定値との同時確率ｐ（ｏ_ｔ，ｆ ^（ｍ），Ｉ_ｔ，ｆ ^（ｍ）＝ｉ；θ^（ｉ））に、音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}に対応する重みを乗じた値を、全ての観測信号について足し合わせた重み付き和である。ＥＭアルゴリズムでは、この補助関数の値が大きくなるように、モデルパラメータを更新する。 Here, θ represents a prior estimated value of the model parameter, and ^ θ represents an estimated value of the model parameter. Further, p (x _{t, f} ^{(i, m)} ; θ ⁽ⁱ⁾ ) in the equation (7) can be calculated from the pre-estimated value θ of the model parameter as defined in the equation (2). . It is assumed that the prior estimated value θ is given in advance. That is, the auxiliary function Q (θ | ^ θ) described above has the simultaneous probability p (o _{t, f} ^(m ) between the observed signal o _{t, f} ^(m) and the prior estimation value of the model parameter including the dominant sound source index DSI. ⁾ , It _{, f} ^(m) = i; θ ⁽ⁱ⁾ ) multiplied by the weight corresponding to the sound source posterior probability ^ M _{t, f} ^{(i, m)} for all the observed signals It is a weighted sum. In the EM algorithm, the model parameter is updated so that the value of the auxiliary function is increased.

各マイクロホンｍにおける音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}は次式で表せる。 The sound source existence posterior probability ^ M _{t, f} ^{(i, m)} in each microphone m can be expressed by the following equation.

式（７）は、第二項の複雑性により、解析的に最大化することができない。そこで、この実施例では、Newton-Raphson法を用いて効率的に補助関数を最大化する。 Equation (7) cannot be maximized analytically due to the complexity of the second term. Therefore, in this embodiment, the auxiliary function is maximized efficiently using the Newton-Raphson method.

図３に、ＥＭアルゴリズムとNewton-Raphson法を用いる音源分離装置１００′の機能構成例を示す。音源分離装置１００′は、音源分離装置１００の構成に加えて、更に記憶部４０と、反復処理部５０と、を備える。モデルパラメータ推定部２０は、マイク位置依存・音源時不変ゲイン推定手段２０１と、マイク非依存・音源対数パワースペクトル推定手段２０２と、を含む。 FIG. 3 shows a functional configuration example of a sound source separation device 100 ′ using the EM algorithm and the Newton-Raphson method. In addition to the configuration of the sound source separation device 100, the sound source separation device 100 ′ further includes a storage unit 40 and an iterative processing unit 50. The model parameter estimation unit 20 includes a microphone position dependent / sound source time invariant gain estimating unit 201 and a microphone independent / sound source logarithmic power spectrum estimating unit 202.

パラメータの最適化手順は、マイク別音源存在事後確率推定部１０とモデルパラメータ推定部２０と記憶部４０と反復処理部５０と、で行う。図４に、パラメータの最適化手順の動作フローを示す。 The parameter optimization procedure is performed by the microphone-specific sound source presence posterior probability estimation unit 10, the model parameter estimation unit 20, the storage unit 40, and the iterative processing unit 50. FIG. 4 shows an operation flow of the parameter optimization procedure.

記憶部４０には、モデルパラメータ＾θ^（ｉ）＝（＾ｓ_ｔ，ｆ ^（ｉ），＾β_ｆ ^{（ｉ，ｍ）}，＾σ_ｔ，ｆ ^{（ｉ，ｍ）}）の初期値θと、更新された値とが記憶される。記憶部４０は、更新されたモデルパラメータ＾θ^（ｉ）のみを記憶し、初期値θはその値を必要とする各部に予め定数として持たせるようにしても良い。 In the storage unit 40, model parameters _{ circumflex over ⁽ θ ⁾ } ⁽ⁱ⁾ = (^ s _{t, f} ⁽ⁱ⁾ , ^ β _f ^{(i, m)} , ^ σ _{t, f} ^{(i, m)} )) The updated value is stored. The storage unit 40 may store only the updated model parameter {circumflex over ⁽ θ ⁾ } ⁽ⁱ⁾ , and the initial value θ may be previously given as a constant to each unit that requires the value.

マイク別音源存在事後確率推定部１０は、複数のマイクロホンごとの観測信号ｏ_ｔ，ｆ ^（ｍ）と、記憶部４０に記憶されたモデルパラメータ＾θ^（ｉ）＝（＾ｓ_ｔ，ｆ ^（ｉ），＾β_ｆ ^{（ｉ，ｍ）}，＾σ_ｔ，ｆ ^{（ｉ，ｍ）}）とを入力として、各マイクロホンごとに、式（８）により、各音源ｉに関する音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を計算する（ステップＳ１０）。すなわち、マイク別音源存在事後確率推定部１０は、観測信号ｏ_ｔ，ｆ ^（ｍ）とモデルパラメータ＾θ^（ｉ）とを観測信号のモデルに当てはめたときの、観測信号ｏ_ｔ，ｆ ^（ｍ）とモデルパラメータ＾θ^（ｉ）との同時確率に基づいて、音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を計算する。この処理は、ＥＭアルゴリズムのＥステップに当たる。 The microphone-specific sound source existence posterior probability estimation unit 10 uses the observation signal ot _{, f} ^(m) for each of the plurality of microphones and the model parameter ^ θ ⁽ⁱ⁾ = (^ s _{t, f} ^{(i )} , ^ Β _f ^{(i, m)} , ^ σ _{t, f} ^{(i, m)} ) as inputs, and for each microphone, the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} is calculated (step S10). That is, the microphone by the sound source exists posteriori probability estimation unit 10, the observed signals _{o ^t,} ^{f (m)} and the model parameters ^ theta when the fitted model of the observation signal ^(i), the observed signal _{o t,} ^{f (m )} And the model parameter {circumflex over ⁽ θ ⁾ _} ⁽ⁱ⁾ , the sound source existence posterior probability _{{circumflex over} ⁽ M ⁾ _{} t, f} ^{(i, m)} is calculated. This process corresponds to the E step of the EM algorithm.

マイク位置依存・音源時不変ゲイン推定手段２０１は、複数のマイクロホンごとの観測信号ｏ_ｔ，ｆ ^（ｍ）と、マイク別音源存在事後確率推定部１０で計算した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}と、記憶部４０に記憶されたモデルパラメータ＾θ^（ｉ）のマイク非依存・音源対数パワースペクトル＾ｓ_ｔ，ｆ ^（ｉ）を入力として、次式でマイク位置依存・音源時不変ゲイン＾β_ｆ ^{（ｉ，ｍ）}と分散σ_ｔ，ｆ ^{（ｉ，ｍ）}を計算して、記憶部４０に記憶されている当該パラメータの値を更新する（ステップＳ２０１）。なお、以下の式では、条件ｏ_ｔ，ｆ ^（ｍ）＞（＾ｓ_ｔ，ｆ ^（ｉ）+＾β_ｆ ^{（ｉ，ｍ）}）が満たされる場合は、＾κ_ｔ，ｆ ^{（ｉ、ｍ）}＝＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}とし、満たされない場合は＾κ_ｔ，ｆ ^{（ｉ、ｍ）}＝１とする。 The microphone position-dependent / sound source time-invariant gain estimation unit 201 uses the observation signal ot _{, f} ^(m) for each of the plurality of microphones and the sound source presence posterior probability ^ M _{t, f} calculated by the microphone-specific sound source presence posterior probability estimation unit 10. ^{(I, m)} and the microphone parameter independent sound source log power spectrum ^ s _{t, f} ⁽ⁱ⁾ of the model parameter _{ circumflex over ⁽ θ ⁾ _} ⁽ⁱ⁾ stored in the storage unit 40 as input, The time invariant gain ^ β _f ^{(i, m)} and the variance σ _{t, f} ^{(i, m)} are calculated, and the value of the parameter stored in the storage unit 40 is updated (step S201). In the following expression, when the condition o _{t, f} ^(m) > (^ s _{t, f} ⁽ⁱ⁾ + ^ β _f ^{(i, m)} ) is satisfied, ^ κ _{t, f} ^{(i, m )} = ^ M _{t, f} ^{(i, m),} and if not satisfied, ^ κ _{t, f} ^{(i, m)} = 1.

マイク非依存・音源対数パワースペクトル推定手段２０２は、マイクロホンｍごとの観測信号ｏ_ｔ，ｆ ^（ｍ）と、記憶部４０に記憶されたモデルパラメータ＾θ^（ｉ）と、マイク別音源存在事後確率推定部１０で計算した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}を入力として、複数のマイクロホンｍとの間で共通となるｉ番目の音源からのクリーン音声信号ｓ_ｔ，ｆ ^（ｉ）を次式で計算して、記憶部４０に記憶されている当該パラメータの値を更新する（ステップＳ２０２）。ステップＳ２０１とＳ２０２の処理（ステップＳ２０）は、ＥＭアルゴリズムのＭステップに当たる。 The microphone-independent / logarithmic power logarithmic power spectrum estimation means 202 is arranged such that the observed signal ot _{, f} ^(m) for each microphone m, the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit 40, and the sound source existence posterior probability for each microphone. Using the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} calculated by the estimation unit 10 as an input, the clean sound signal _{st, f} ⁽ⁱ⁾ from the i-th sound source that is common to the plurality of microphones m. ⁾ Is calculated by the following equation, and the value of the parameter stored in the storage unit 40 is updated (step S202). The processing of steps S201 and S202 (step S20) corresponds to the M step of the EM algorithm.

また、＾ｓ_ｔ，ｆ ^（ｉ）と＾β_ｆ ^{（ｉ，ｍ）}の更新式は類似していることが分かる。これらの更新式の違いは平均化処理にあり、＾ｓ_ｔ，ｆ ^（ｉ）はマイクロホン番号に関する平均として計算され、一方で＾β_ｆ ^{（ｉ，ｍ）}は、時間インデックスに関する平均として計算される。 Also, it can be seen that the update formulas of ^ s _{t, f} ⁽ⁱ⁾ and ^ β _f ^{(i, m)} are similar. The difference between these update formulas is in the averaging process, where _{{circumflex over} ⁽ s ⁾ _{} t, f} ⁽ⁱ⁾ is calculated as the average over the microphone number, while _{{circumflex over} ⁽ β ⁾ _f ^{(i, m)} is calculated as the average over the time index. .

なお、式（９）における補助関数は、式（７）で定義される補助関数と式（１２）で計算される値に重みρを乗じたものを加算した値とする。これは、あるマイクロホンにおいて全く支配的にならない音源（LogMax観測モデルの元では陽には全く観測されない音源）があると、マイク位置依存・音源時不変ゲイン＾β_ｆ ^{（ｉ，ｍ）}の最適解は無限小となってしまい推定処理全体が不安定になる。前述のように、マイク非依存・音源対数パワースペクトル＾ｓ_ｔ，ｆ ^（ｉ）に関して以下のような正規化項（事前分布）２０３を定義し、補助関数に重みρで加算すれば、このような問題を回避することができる。 The auxiliary function in equation (9) is a value obtained by adding the auxiliary function defined in equation (7) and the value calculated in equation (12) multiplied by the weight ρ. This is because if there is a sound source that does not dominate at all in a certain microphone (a sound source that is not positively observed under the LogMax observation model ⁾ , the optimal solution for the microphone position-dependent and sound source time-invariant gain ^ β _f ^{(i, m)} Becomes infinitesimal and the entire estimation process becomes unstable. As described above, the following normalization term (prior distribution) 203 is defined for the microphone-independent / sound source log power spectrum ^ s _{t, f} ⁽ⁱ⁾ , and added to the auxiliary function with the weight ρ, like this Problems can be avoided.

正規化項２０３は、記憶部４０に予め記憶させておいても良いし、図３に示すようにモデルパラメータ推定部２０の内部に定数として持たせるようにしても良い。 The normalization term 203 may be stored in the storage unit 40 in advance, or may be provided as a constant inside the model parameter estimation unit 20 as shown in FIG.

以上のように、モデルパラメータ推定部２０では、式（７）の補助関数、つまり、観測信号ｏ_ｔ，ｆ ^（ｍ）と現在のモデルパラメータ推定値θ^（ｉ）を観測モデルに当てはめたときの、観測信号ｏ_ｔ，ｆ ^（ｍ）と支配的音源インデックスＤＳＩを含むモデルパラメータ推定値θ^（ｉ）との同時確率ｐ（ｏ_ｔ，ｆ ^（ｍ），Ｉ_ｔ，ｆ ^（ｍ）＝ｉ；θ^（ｉ））に、音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}に対応する重みを乗じた値を、全ての観測信号について足し合わせた重み付き和が大きくなるように、モデルパラメータ（マイク位置依存・音源時不変ゲイン＾β_ｆ ^{（ｉ，ｍ）}と分散σ_ｔ，ｆ ^{（ｉ，ｍ）}とマイク非依存・音源対数パワースペクトル＾ｓ_ｔ，ｆ ^（ｉ））を更新する（式（９）〜（１１））。 As described above, in the model parameter estimation unit 20, the auxiliary function of Expression (7), that is, the observation signal o _{t, f} ^(m) and the current model parameter estimation value θ ⁽ⁱ⁾ are applied to the observation model. , The joint probability p (ot _{, f} ^(m) , It _{, f} ^(m) = i; of the observed signal o _{t, f} ^(m) and the model parameter estimate θ ⁽ⁱ⁾ including the dominant sound source index DSI; θ ⁽ⁱ⁾ ) is multiplied by the weight corresponding to the sound source posterior probability ^ M _{t, f} ^{(i, m)} , and the model parameter is set so that the weighted sum of all the observed signals is increased. (Mic position dependent / sound source time-invariant gain ^ β _f ^{(i, m)} , variance σ _{t, f} ^{(i, m)} and microphone independent / sound source log power spectrum ^ s _{t, f} ⁽ⁱ⁾ ) are updated ( Formulas (9) to (11)).

反復処理部５０は、所定の基準を満たすまでＥステップとＭステップを繰り返す（ステップＳ５１）。所定の基準としては、例えば更新前のモデルパラメータ＾θ及び各音源に関する音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}から計算される式（７）に示したＱ関数（補助関数）の値と、更新後のモデルパラメータ及び各音源に関する音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}から計算されるＱ関数の値との差が所定の閾値未満となった時を、所定の基準を満たしたと判定する方法や、予め定めた繰り返す回数に達した場合に所定の基準を満たしたと判定する方法が考えられる。繰り返し処理を行うことで補助関数を最大化することができる。 The iterative processing unit 50 repeats the E step and the M step until a predetermined criterion is satisfied (step S51). As the predetermined standard, for example, the Q function (auxiliary function) shown in the equation (7) calculated from the model parameter ^ θ before update and the sound source existence posterior probability ^ M _{t, f} ^{(i, m) for} each sound source When the difference between the value and the value of the Q function calculated from the updated model parameter and the sound source existence posterior probability ^ M _{t, f} ^{(i, m) for} each sound source is less than a predetermined threshold, There are a method for determining that the standard is satisfied, and a method for determining that the predetermined standard is satisfied when a predetermined number of repetitions is reached. The auxiliary function can be maximized by repeating the process.

所定の基準を満たすと、出力音推定部３０は、複数のマイクロホンごとの観測信号ｏ_ｔ，ｆ ^（ｍ）と、マイク別音源存在事後確率推定部１０で計算した音源存在事後確率＾Ｍ_ｔ，ｆ ^{（ｉ，ｍ）}と、記憶部４０に記憶されたモデルパラメータ＾θ^（ｉ）と、を入力として、ｍ番目のマイクロホンにおけるｉ番目の音源イメージ＾ｘ_ｔ，ｆ ^{（ｉ，ｍ）}を計算して出力する。ＥＭアルゴリズムを用いてパラメータ推定を行うと最小二乗誤差推定で音源イメージ＾ｘ_ｔ，ｆ ^{（ｉ，ｍ）}を求めることが可能となる。推定される音源イメージ＾ｘ_ｔ，ｆ ^{（ｉ，ｍ）}は、次式で表される。 When the predetermined criterion is satisfied, the output sound estimation unit 30 uses the observation signal ot _{, f} ^(m) for each of the plurality of microphones and the sound source existence posterior probability ^ M _t, calculated by the microphone-specific sound source existence posterior probability estimation unit 10 _{. Using f} ^{(i, m)} and the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit 40 as input, the i-th sound source image ^ x _{t, f} ^{(i, m)} in the m-th microphone is calculated. And output. When parameter estimation is performed using the EM algorithm, a sound source image ^ x _{t, f} ^{(i, m)} can be obtained by least square error estimation. The estimated sound source image ^ x _{t, f} ^{(i, m)} is expressed by the following equation.

〔評価実験〕
この発明の音源分離装置１００の性能を評価する目的で評価実験を行った。実験条件は次の通りとした。 [Evaluation experiment]
An evaluation experiment was conducted for the purpose of evaluating the performance of the sound source separation device 100 of the present invention. The experimental conditions were as follows.

図５に、シミュレーションに用いた音響環境を示す。部屋のサイズは１０ｍ（Ｗ）×５ｍ（Ｄ）×５ｍ（Ｈ）であり、残響時間は１００ｍｓである。この音響環境を鏡像法（参考文献１：J. B. Allen and D. A. Berkeley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65(4), pp. 943-950, 1979.）を用いてシミュレーションした。 FIG. 5 shows the acoustic environment used for the simulation. The size of the room is 10 m (W) × 5 m (D) × 5 m (H), and the reverberation time is 100 ms. This acoustic environment is mirror image (Reference 1: JB Allen and DA Berkeley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., Vol. 65 (4), pp. 943-950. , 1979.).

音響環境としては４つの環境を模擬した。第１音響環境と第２音響環境は、３人の話者が半径８０ｃｍの円状に等間隔を開けて座り、同時会話する状況を想定した。第１音響環境は、３つのマイクロホンが半径１０ｃｍの同心円状に配置されている状況とし、第２音響環境は、同じマイクロホンが半径５０ｃｍの同心円状に配置されている状況とした。図３において、第１音響環境と第２音響環境は一方の２人の話者とマイクロホンのグループが存在しない状態である。 Four acoustic environments were simulated. The first acoustic environment and the second acoustic environment are assumed to be a situation in which three speakers sit at equal intervals in a circle with a radius of 80 cm and talk at the same time. The first acoustic environment is a situation where three microphones are arranged concentrically with a radius of 10 cm, and the second acoustic environment is a situation where the same microphone is arranged concentrically with a radius of 50 cm. In FIG. 3, the first acoustic environment and the second acoustic environment are in a state where there is no group of two speakers and a microphone.

第３音響環境と第４音響環境は、３人の話者と２人の話者の２つのグループが同じ部屋で会話している状況を想定した。第３音響環境は、５つのマイクロホンが半径１０ｃｍの同心円状に配置されている状況とし、第４音響環境は、同じマイクロホンが半径５０ｃｍの同心円状に配置されている状況とした。 The 3rd acoustic environment and the 4th acoustic environment assumed the situation where two groups of three speakers and two speakers are talking in the same room. The third acoustic environment is a situation where five microphones are arranged concentrically with a radius of 10 cm, and the fourth acoustic environment is a situation where the same microphone is arranged concentrically with a radius of 50 cm.

第１番目と第２番目の音響環境においては３音源の分離を行った。第３番目と第４番目の音響環境においては５音源の分離を行った。この発明と比較する従来法は、すべてのマイクロホンにおいて共通の音源アクティビティパタンを仮定して、ソフトマスクを用いた音源分離を行う非特許文献１に示された方法とした。従来法では、各音源に最も近いマイク観測信号にソフトマスク処理を行い、分離信号を算出した。 Three sound sources were separated in the first and second acoustic environments. In the third and fourth acoustic environments, five sound sources were separated. The conventional method compared with the present invention is the method shown in Non-Patent Document 1 that performs sound source separation using a soft mask, assuming a sound source activity pattern common to all microphones. In the conventional method, the microphone observation signal closest to each sound source is subjected to soft mask processing, and the separated signal is calculated.

この発明の方法では、ＥＭアルゴリズムの初期値として従来方法の処理結果を使用した。式（１２）に示した正規化項の計算にも従来法の処理結果を用いた。正規化項の重みρはρ＝０.００００１とした。 In the method of the present invention, the processing result of the conventional method is used as the initial value of the EM algorithm. The processing result of the conventional method was also used for the calculation of the normalization term shown in Equation (12). The normalization term weight ρ was set to ρ = 0.0001.

評価指標としてはケプストラム距離を用いた。ケプストラム距離は、比較対象信号と各音源に最も近いマイクロホンにおける各音源イメージの距離とした。評価音声としては、TIMIT（参考文献２：W. Fisher, G.R. Doddington, and K. M. Goudie-Marshall, “The DARPA speech recognition research database: specifications and status,” in Proc. DARPA workshop on Speech Recognition, 7986, pp. 96-99.）から無作為に抽出した音声を用い、各音響環境において計２０個の異なる混合音声を用意し、結果はそれらの平均値として算出した。 The cepstrum distance was used as an evaluation index. The cepstrum distance is the distance between the comparison target signal and each sound source image in the microphone closest to each sound source. For evaluation speech, TIMIT (Reference 2: W. Fisher, GR Doddington, and KM Goudie-Marshall, “The DARPA speech recognition research database: specifications and status,” in Proc. DARPA workshop on Speech Recognition, 7986, pp. 96-99.), Randomly mixed speech was used to prepare 20 different mixed speech in each acoustic environment, and the result was calculated as an average value thereof.

図６に、評価実験の結果を示す。横軸は音響環境、縦軸はケプストラム距離（ｄＢ）である。音響環境ごとに観測信号と従来法と本発明のケプストラム距離を示す。ここで、観測信号のケプストラム距離の算出のためには、各話者に最も近いマイクロホンの観測信号を用いており、最近傍マイクロホンを既知とした際のマイクロホン選択処理の結果に相当する。 FIG. 6 shows the results of the evaluation experiment. The horizontal axis is the acoustic environment, and the vertical axis is the cepstrum distance (dB). The observation signal, the conventional method, and the cepstrum distance of the present invention are shown for each acoustic environment. Here, in order to calculate the cepstrum distance of the observation signal, the observation signal of the microphone closest to each speaker is used, which corresponds to the result of the microphone selection process when the nearest microphone is known.

第１音響環境における結果では、従来法でもケプストラム距離を減らしているが、本発明は更にケプストラム距離を減らすことができている。これは、この発明の方法がケプストラム領域と類似する対数パワースペクトル領域にてパラメータ最適推定を行っているためと考えられる。 As a result of the first acoustic environment, the cepstrum distance is reduced even in the conventional method, but the present invention can further reduce the cepstrum distance. This is considered because the method of the present invention performs parameter optimum estimation in the logarithmic power spectrum region similar to the cepstrum region.

第２〜第４音響環境では、従来法による性能改善を確認することができない。従来法はケプストラム距離尺度で性能が劣化しており、過抑圧などにより歪が増大していることが予想される。本発明の方法では、全ての音響環境において、効果的にケプストラム距離を減少させることができた。このように本発明の音源分離装置１００によれば、分散マイクロホンアレイ環境においても効率的に音源分離を行うことが確認できた。 In the second to fourth acoustic environments, the performance improvement by the conventional method cannot be confirmed. In the conventional method, performance is degraded on the cepstrum distance scale, and distortion is expected to increase due to over-suppression. The method of the present invention can effectively reduce the cepstrum distance in all acoustic environments. Thus, according to the sound source separation apparatus 100 of the present invention, it was confirmed that sound source separation was performed efficiently even in a distributed microphone array environment.

上記した音声分離装置１００における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the speech separation apparatus 100 described above is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

なお、効率的に最大事後確率パラメータ推定を行う目的で、ＥＭアルゴリズムNewton-Raphson法を用いた音源分離装置１００′について説明を行ったが、この発明はこの実施例に限定されない。例えば最大事後確率パラメータ推定を行うのに、ＥＭアルゴリズムを用いる必要はない。全ての組み合わせを探索する全組み合わせ探索法を用いても、この発明の技術思想の範囲に含まれる。 For the purpose of efficiently estimating the maximum posterior probability parameter, the sound source separation apparatus 100 ′ using the EM algorithm Newton-Raphson method has been described, but the present invention is not limited to this embodiment. For example, it is not necessary to use the EM algorithm to perform maximum posterior probability parameter estimation. Even if an all combination search method for searching all combinations is used, it is within the scope of the technical idea of the present invention.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

It is assumed that the sound pressure of the signals from the multiple sound sources observed by each of the multiple microphones is different from the observation signal of the multiple channels obtained by collecting the sound source signals emitted from the multiple sound sources by the multiple microphones. A microphone-specific sound source presence posterior probability estimator that estimates the sound source presence posterior probability for each sound source for each microphone using the observed signal model;
A model parameter estimator for estimating the model parameters of the observation signal, using the observation signals of the plurality of channels and the sound source existence posterior probability as inputs;
An output sound estimator that estimates and outputs an incoming signal from each sound source for each of the microphones by using the observation signals of the plurality of channels, the sound source existence posterior probability, and the model parameter;
A sound source separation apparatus comprising:

The sound source separation device according to claim 1,
The model of the observed signal is
A signal ot _{, f} ^(m) observed by the m-th microphone (where t is a time index and f is a frequency index) arrives from each of the plurality of sound sources and is received by the m-th microphone. It is a model that is defined to be equivalent to the incoming signal with the maximum sound pressure among the observed incoming signals,
The incoming signal model is
The incoming signal x _{t, f} ^{(i, m)} from the i th sound source observed by the m th microphone is
clean sound signal _{st, f} ^{(i) of the} i-th sound source,
a transfer function β _f ^{(i, m)} corresponding to the sound pressure of the signal arriving at the m th microphone from the i th sound source,
an error term _{et, f} ^{(i, m)} corresponding to the difference between the signal arriving at the mth microphone from the ith sound source and the signal from the ith sound source observed at the mth microphone;
Is a probability model defined by
The model parameters are the clean sound signal s _{t, f} ^{(i) of the} sound source, the transfer function β _f ^{(i, m)} and the variance σ _{t, f} ⁽ⁱ ⁾ of the error term _{et, f} ^{(i, m).} ^{, M)}
A sound source separation device characterized by that.

In the sound source separation device according to claim 2 ,
Furthermore, a storage unit and an iterative processing unit are provided,
The storage unit stores the model parameter ^ θ ⁽ⁱ⁾ of the observation signal,
The microphone-specific sound source existence posterior probability estimation unit receives the observation signal ot _{, f} ^(m) for each microphone m and the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit as inputs, and outputs the signal for each microphone m. When the observed signal o _{t, f} ^(m) and the model parameter ^ θ ⁽ⁱ⁾ are applied to the observed signal model, the observed signal o _{t, f} ^(m) and the observed signal model parameter ^ θ ^{(i )} And the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} for each microphone m and sound source i,
The model parameter estimator includes the observation signal ot _{, f} ^(m) for each microphone m, the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit, and the sound source posterior probability ^ M _{t, f} ^(i, and ^m) as an input, the observed signal when the observed signal _{o t} for each said microphone ^{m, f} a model parameter ^ theta and ⁽ⁱ⁾ ^(m) was fitted to the model of the observed signal _{o t,} ^{f (m )} And the model parameter ^ θ ⁽ⁱ⁾ of the observed signal multiplied by the weight corresponding to the sound source posterior probability ^ M _{t, f} ^{(i, m)} to all the observed signals The transfer function β _f ^{(i, m)} and the variance σ _{t, f} ^(i, ^m) of the error term _{et, f} ^{(i, m)} stored in the storage unit are increased so that the weighted sum added with respect to ^m) and clean audio signal _{st, f} ⁽ⁱ⁾ Is a new one,
The iterative processing unit repeats the processing of the microphone-specific sound source presence posterior probability estimation unit and the model parameter estimation unit until a predetermined criterion is satisfied,
The output sound estimation unit receives the observation signals of the plurality of channels, the sound source existence posterior probability, and the parameter ^ θ ⁽ⁱ⁾ stored in the storage unit, and receives the incoming signal x _{t, f} ^{(i , M)}
A sound source separation device characterized by the above.

It is assumed that the sound pressure of the signals from the multiple sound sources observed by each of the multiple microphones is different from the observation signal of the multiple channels obtained by collecting the sound source signals emitted from the multiple sound sources by the multiple microphones. A microphone-specific sound source presence posterior probability estimation process for estimating the sound source presence posterior probability for each sound source for each microphone using the observed signal model;
A model parameter estimation process for estimating the model parameters of the observation signal by using the observation signals of the plurality of channels and the sound source existence posterior probability as inputs,
An output sound estimation process for estimating and outputting an incoming signal from each sound source for each of the microphones by using the observation signals of the plurality of channels, the sound source existence posterior probability, and the model parameter;
A sound source separation method comprising:

In the sound source separation method according to claim 4,
The model of the observed signal is
A signal ot _{, f} ^(m) observed by the m-th microphone (where t is a time index and f is a frequency index) arrives from each of the plurality of sound sources and is received by the m-th microphone. It is a model that is defined to be equivalent to the incoming signal with the maximum sound pressure among the observed incoming signals,
The incoming signal model is
The incoming signal x _{t, f} ^{(i, m)} from the i th sound source observed by the m th microphone is
clean sound signal _{st, f} ^{(i) of the} i-th sound source,
a transfer function β _f ^{(i, m)} corresponding to the sound pressure of the signal arriving at the m th microphone from the i th sound source,
an error term _{et, f} ^{(i, m)} corresponding to the difference between the signal arriving at the mth microphone from the ith sound source and the signal from the ith sound source observed at the mth microphone;
Is a probability model defined by
The model parameters are the clean sound signal s _{t, f} ^{(i) of the} sound source, the transfer function β _f ^{(i, m)} and the variance σ _{t, f} ⁽ⁱ ⁾ of the error term _{et, f} ^{(i, m).} ^{, M)}
A sound source separation method characterized by the above.

In the sound source separation method according to claim 5 ,
Furthermore, it has an iterative process,
The microphone-specific sound source existence posterior probability estimation process is performed by using the observation signal ot _{, f} ^(m) for each microphone m and the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit as input, and observing for each microphone m. The observed signal o _{t, f} ^(m) and the model parameter ^ θ ^{(i) of the} observed signal when the signal o _{t, f} ^(m) and the model parameter ^ θ ⁽ⁱ⁾ are applied to the model of the observed signal. And the sound source existence posterior probability ^ M _{t, f} ^{(i, m)} for each microphone m and sound source i,
The model parameter estimation process includes the observation signal ot _{, f} ^(m) for each microphone m, the model parameter ^ θ ⁽ⁱ⁾ stored in the storage unit, and the sound source existence posterior probability ^ M _{t, f} ^(i, and ^m) as an input, the observed signal when the observed signal _{o t} for each said microphone ^{m, f} a model parameter ^ theta and ⁽ⁱ⁾ ^(m) was fitted to the model of the observed signal _{o t,} ^{f (m )} And the model parameter ^ θ ⁽ⁱ⁾ of the observed signal multiplied by the weight corresponding to the sound source posterior probability ^ M _{t, f} ^{(i, m)} to all the observed signals The transfer function β _f ^{(i, m)} and the variance σ _{t, f} ^(i, ^m) of the error term _{et, f} ^{(i, m)} stored in the storage unit are increased so that the weighted sum added with respect to ^m) and the clean audio signal _{st, f} ⁽ⁱ⁾ To update,
The iterative process is to repeat the processes of the microphone-specific sound source presence posterior probability estimation process and the model parameter estimation process until a predetermined criterion is satisfied,
In the output sound estimation process, the received signals x _{t, f} ⁽ⁱ ⁾ for each sound source i are input with the observation signals of the plurality of channels, the sound source existence posterior probability, and the parameter ^ θ ⁽ⁱ⁾ stored in the storage unit. ^{, M)} ,
A sound source separation method characterized by the above.

A program for processing the sound source separation method according to any one of claims 4 to 6 by a computer.