JP2016156944A

JP2016156944A - Model estimation device, target sound enhancement device, model estimation method, and model estimation program

Info

Publication number: JP2016156944A
Application number: JP2015034398A
Authority: JP
Inventors: 信貴伊藤; Nobutaka Ito; 章子荒木; Akiko Araki; 智広中谷; Tomohiro Nakatani
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-02-24
Filing date: 2015-02-24
Publication date: 2016-09-01
Anticipated expiration: 2035-02-24
Also published as: JP6290803B2

Abstract

PROBLEM TO BE SOLVED: To achieve sound source separation with higher precision even when a reverberation time is long as compared with a frame length.SOLUTION: A model estimation device 10A comprises a storage part which stores a parameter of a model for a mixture signal, including reverberation, including a regression matrix showing characteristics of reverberation due to sounds output from a plurality of sound sources. The model estimation device 10A estimates a mixture signal including no reverberation through linear prediction using an observation signal obtained by observing a sound by a plurality of microphones and a regression matrix stored in a storage part. The model estimation device 10A calculates a posterior probability corresponding to a sound source that each time frequency point belongs to from the estimated mixture signal. The model estimation device 10A estimates a parameter from the observation signal, the estimated mixture signal, posterior probability, and parameter stored in the storage part, and updates the parameter stored in the storage part with the estimated parameter. The model estimation device 10A repeats the signal estimation, clustering and parameter estimation until a predetermined condition is met.SELECTED DRAWING: Figure 1

Description

本発明は、モデル推定装置、目的音強調装置、モデル推定方法及びモデル推定プログラムに関する。 The present invention relates to a model estimation device, a target sound enhancement device, a model estimation method, and a model estimation program.

従来から、目的音強調の技術として音源分離技術がある。音源分離技術は、複数のマイクロホンで取得した、複数の音源信号の混合信号を用いて、各音源信号を推定する技術である。特に、クラスタリングに基づく音源分離技術、独立成分分析に基づく音源分離技術がよく知られている。以下、従来技術として、クラスタリングに基づく音源分離技術ついて説明する。以下において、例えばＡがベクトルである場合には“ベクトルＡ”と表記し、例えばＡがスカラーである場合には単に“Ａ”と表記する。また、以下において、特に断らない限り、時間周波数領域での信号表現を用いる。時間フレームの番号をｔ∈｛１，２，・・・，Ｔ｝（Ｔは、フレーム総数）で表し、周波数binの番号をｆ∈｛１，２，・・・，Ｆ｝（Ｆは、ナイキスト周波数以下の周波数binの総数）で表す。 Conventionally, there is a sound source separation technique as a target sound enhancement technique. The sound source separation technique is a technique for estimating each sound source signal using a mixed signal of a plurality of sound source signals acquired by a plurality of microphones. In particular, a sound source separation technique based on clustering and a sound source separation technique based on independent component analysis are well known. Hereinafter, as a conventional technique, a sound source separation technique based on clustering will be described. In the following, for example, when A is a vector, it is expressed as “vector A”, and when A is a scalar, for example, it is simply expressed as “A”. In the following description, signal expression in the time frequency domain is used unless otherwise specified. The time frame number is represented by t∈ {1, 2,..., T} (T is the total number of frames), and the frequency bin number is represented by f∈ {1, 2,. The total number of frequency bins below the Nyquist frequency).

時間周波数領域での信号表現は、時間領域での信号表現に対し、短時間フーリエ変換などの時間周波数変換を適用することで得られる。逆に、時間領域での信号表現は、時間周波数領域での信号表現に対し、逆短時間フーリエ変換などの時間周波数変換の逆変換を適用することで得られる。 The signal representation in the time-frequency domain can be obtained by applying a time-frequency transform such as a short-time Fourier transform to the signal representation in the time domain. Conversely, the signal representation in the time domain can be obtained by applying the inverse transform of the time frequency transform such as the inverse short time Fourier transform to the signal representation in the time frequency domain.

Ｎ個（Ｎは、自然数）の音源からの信号をＭ個（Ｍは、自然数）のマイクロホンで観測するとする。ｍ（１≦ｍ≦Ｍ）番目のマイクロホンで観測される混合信号をｙ^(m) _tfで表し、下記（１）式のように、Ｍ個のマイクロホンで観測される混合信号を混合信号ベクトルｙ_tfとしてまとめて表記する。 Assume that signals from N sound sources (N is a natural number) are observed with M (M is a natural number) microphones. The mixed signal observed by the m (1 ≦ m ≦ M) -th microphone is represented by y ^(m) _tf , and the mixed signal observed by the M microphones is expressed by the mixed signal vector y as shown in the following equation (1). _Expressed collectively as _tf .

上記（１）式において、・^Ｔは、・の転置を表す。残響時間がフレーム長に比べて十分短い場合、混合信号ベクトルｙ_tfは、下記（２）式によりモデル化できる。 In the above formula (1), · ^T represents transposition of ·. When the reverberation time is sufficiently shorter than the frame length, the mixed signal vector y _tf can be modeled by the following equation (2).

上記（２）式において、ｃ⁽ⁿ⁾ _tfは、ｎ番目の音源信号を表す。また、上記（２）式におけるベクトルｈ⁽ⁿ⁾ _fは、下記（３）式により定義される。なお、下記（３）式において、ｈ^(m,n) _fは、ｎ番目の音源信号からｍ番目のマイクロホンへの時不変の伝達関数を表す。 In the above equation (2), c ⁽ⁿ⁾ _tf represents the nth sound source signal. Further, the vector h ⁽ⁿ⁾ _f in the above equation (2) is defined by the following equation (3). In the following equation (3), h ^{(m, n)} _f represents a time-invariant transfer function from the nth sound source signal to the mth microphone.

ベクトルｈ⁽ⁿ⁾ _fは、ステアリングベクトルと呼ばれ、ｎ番目の音源の位置に関する情報を含む。以下では、簡単のため、マイクロホン数がＭ＝２であり、残響や反響の影響は無視でき、各音源信号は平面波として伝搬すると仮定する。この場合、ベクトルｈ⁽ⁿ⁾ _fは、下記（４）式によりモデル化できる。なお、下記（４）式において、“ｊ”は虚数単位を表す。 The vector h ⁽ⁿ⁾ _f is called a steering vector and includes information on the position of the nth sound source. In the following, for simplicity, it is assumed that the number of microphones is M = 2, the influence of reverberation and reverberation can be ignored, and each sound source signal propagates as a plane wave. In this case, the vector h ⁽ⁿ⁾ _f can be modeled by the following equation (4). In the following formula (4), “j” represents an imaginary unit.

ここで、上記（４）式におけるω_fは、周波数binの番号ｆに対応する角周波数を表し、ｄ^(m,n)は、ｍ番目のマイクロホンとｎ番目の音源との距離を表し、ｃは、音速を表す。ｎ番目の音源のマイクロホン間到来時間差δ⁽ⁿ⁾を、下記（５）式により定義する。 Here, ω _f in the above equation (4) represents the angular frequency corresponding to the number f of the frequency bin, d ^{(m, n)} represents the distance between the m-th microphone and the n-th sound source, and c Represents the speed of sound. An arrival time difference δ ⁽ⁿ⁾ between microphones of the n-th sound source is defined by the following equation (5).

すると、ステアリングベクトルｈ⁽ⁿ⁾ _ｆにおけるマイクロホン間位相差arg(ｈ^(1,n) _f)−arg(ｈ^(2,n) _f)（arg(・)は、・の偏角（位相）を表す）と、ｎ番目のマイクロホン間到来時間差δ⁽ⁿ⁾との間には、下記（６）式に示す関係がある。 Then, the phase difference between microphones in the steering vector h ⁽ⁿ⁾ _f arg (h ^{(1, n)} _f ) −arg (h ^{(2, n)} _f ) (arg (•) is the declination (phase) of And the nth inter-microphone arrival time difference δ ⁽ⁿ⁾ has a relationship represented by the following equation (6).

クラスタリングに基づく音源分離技術では、観測された混合信号ベクトルｙ_tfは、「各時間周波数点では単一の音源成分のみからなる」（以下、「スパース」と表記する）と仮定する（例えば、非特許文献１参照）。スパースは、残響の影響が小さく、音源信号が音声である場合に、精度よく成立することが知られている。スパースの仮定の下では、「時間周波数点（ｔ,ｆ）において混合信号ベクトルｙ_tfに含まれる」（以下、「アクティブ」と表記する）音源の番号をｄ_tfで表すと、上記（２）式は、下記（７）式のように書き換えられる。 In the sound source separation technique based on clustering, the observed mixed signal vector y _tf is assumed to be “consisting of only a single sound source component at each time frequency point” (hereinafter referred to as “sparse”) (for example, non Patent Document 1). It is known that sparse is accurately established when the influence of reverberation is small and the sound source signal is speech. Under the sparse assumption, the number of the sound source “included in the mixed signal vector y _tf at the time frequency point (t, f)” (hereinafter referred to as “active”) is represented by d _tf , and the above (2) The equation can be rewritten as the following equation (7).

スパース性の仮定の下では、観測信号から、下記（８）式の定義に基づき計算される特徴量ｚ_tfは、下記（９）式に示すように、アクティブなｄ_tf番目の音源のマイクロホン間到来時間差と一致する。 Under the assumption of sparsity, the feature quantity z _tf calculated from the observed signal based on the definition of the following equation (8) is the distance between microphones of the active d _tf sound source as shown in the following equation (9): It matches the arrival time difference.

よって、ｚ_tfのクラスタリングにより音源分離が実現できる。クラスタリングは、例えば、混合モデルのフィッティングやk-meansクラスタリングなどのクラスタリング技術により行うことができる（例えば、非特許文献２参照）。 Therefore, sound source separation can be realized by clustering z _tf . Clustering can be performed, for example, by clustering techniques such as mixed model fitting and k-means clustering (see, for example, Non-Patent Document 2).

O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking.” IEEE Trans. SP, vol. 52, no. 7, pp. 1830-1847, Jul. 2004.O. Yilmaz and S. Rickard, “Blind separation of speech mixture via time-frequency masking.” IEEE Trans. SP, vol. 52, no. 7, pp. 1830-1847, Jul. 2004. S. Araki, H. Sawada, R. Mukai, and S. Makino, “Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors.” Signal Processing, vol. 87, no. 8, pp. 1833-1847, Aug. 2007.S. Araki, H. Sawada, R. Mukai, and S. Makino, “Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors.” Signal Processing, vol. 87, no. 8, pp. 1833-1847, Aug. 2007 . 伊藤信貴，荒木章子，木下慶介，中谷智広,“音源位置情報に基づく劣決定ブラインド音源分離のためのパーミュテーションフリークラスタリング法”，電子情報通信学会論文誌, vol. J97-A, no. 4, pp. 234-246, Apr. 2014.Nobutaka Ito, Akiko Araki, Keisuke Kinoshita, Tomohiro Nakatani, “Permutation Free Clustering Method for Underdetermined Blind Source Separation Based on Source Location Information”, IEICE Transactions, vol. J97-A, no. 4 , pp. 234-246, Apr. 2014. T. Yoshioka, T. Nakatani, M. Miyoshi, and H.G. Okuno, “Blind separation and dereverberation of speech mixtures by joint optimization.” IEEE Trans. ASLP, vol. 19, no. 1, pp. 69-84, Jan. 2011.T. Yoshioka, T. Nakatani, M. Miyoshi, and HG Okuno, “Blind separation and dereverberation of speech mixture by joint optimization.” IEEE Trans. ASLP, vol. 19, no. 1, pp. 69-84, Jan. 2011. N.Q.K. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sep. 2010.NQK Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sep . 2010. A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.AP Dempster, NM Laird, and DB Rubin, “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977 . H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment.” IEEE Trans. ASLP, vol. 19, no. 3, pp. 516-527, Mar. 2011.H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment.” IEEE Trans. ASLP, vol. 19, no. 3, pp. 516-527, Mar . 2011.

しかしながら、上記従来技術は、フレーム長と比べて残響時間が十分短いことを前提とするため、この前提が成立しない多くの実環境（例えば、会議室など）において、音源分離性能が低下する問題がある。 However, since the above prior art is based on the premise that the reverberation time is sufficiently short compared with the frame length, there is a problem that the sound source separation performance is deteriorated in many real environments (for example, conference rooms) where this premise is not satisfied. is there.

本願が開示する実施形態の一例は、上記に鑑みてなされたものであって、フレーム長と比べて残響時間が長い場合においても、より高精度な音源分離を実現することを目的とする。 An example of the embodiment disclosed in the present application has been made in view of the above, and an object thereof is to realize more accurate sound source separation even when the reverberation time is longer than the frame length.

本願の実施形態の一例は、モデル推定装置は、複数の音源が出力する音による残響の特性を示す回帰行列を含む、残響を含む混合信号のモデルのパラメータを保存する記憶部を備える。モデル推定装置は、音を複数のマイクロホンで観測した観測信号と、記憶部に保存される回帰行列とを用いた線形予測により、残響を含まない混合信号を推定する。モデル推定装置は、推定された混合信号を、各時間周波数点が属する音源毎のクラスタにクラスタリングし、記憶部に保存されるパラメータから、各クラスタと対応する事後確率を計算する。モデル推定装置は、推定された混合信号と、計算された事後確率とから、パラメータを推定し、推定したパラメータで記憶部に保存されるパラメータを更新する。モデル推定装置は、信号推定、クラスタリング及びパラメータ推定を、所定条件が満たされるまで繰り返す。 In an example of the embodiment of the present application, the model estimation apparatus includes a storage unit that stores a model parameter of a mixed signal including reverberation including a regression matrix indicating characteristics of reverberation due to sounds output from a plurality of sound sources. The model estimation apparatus estimates a mixed signal that does not include reverberation by linear prediction using observation signals obtained by observing sound with a plurality of microphones and a regression matrix stored in a storage unit. The model estimation apparatus clusters the estimated mixed signal into clusters for each sound source to which each time frequency point belongs, and calculates a posteriori probability corresponding to each cluster from parameters stored in the storage unit. The model estimation apparatus estimates a parameter from the estimated mixed signal and the calculated posterior probability, and updates the parameter stored in the storage unit with the estimated parameter. The model estimation apparatus repeats signal estimation, clustering, and parameter estimation until a predetermined condition is satisfied.

本願が開示する実施形態の一例によれば、例えば、フレーム長と比べて残響時間が長い場合においても、より高精度な音源分離を実現できる。 According to an example of the embodiment disclosed in the present application, for example, even when the reverberation time is longer than the frame length, more accurate sound source separation can be realized.

図１は、実施形態１に係るモデル推定装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a model estimation apparatus according to the first embodiment. 図２は、実施形態１に係るモデル推定装置の処理手順の一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of a processing procedure of the model estimation apparatus according to the first embodiment. 図３は、実施形態２に係るモデル推定装置の構成の一例を示す図である。FIG. 3 is a diagram illustrating an example of the configuration of the model estimation apparatus according to the second embodiment. 図４は、実施形態３に係るモデル推定装置の構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of the configuration of the model estimation apparatus according to the third embodiment. 図５は、実施形態３に係るモデル推定装置の処理手順の一例を示すフローチャートである。FIG. 5 is a flowchart illustrating an example of a processing procedure of the model estimation apparatus according to the third embodiment. 図６は、実施形態４に係る目的音強調装置の構成の一例を示す図である。FIG. 6 is a diagram illustrating an example of the configuration of the target sound enhancement device according to the fourth embodiment. 図７は、実施形態４に係る目的音強調装置の処理手順の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of a processing procedure of the target sound enhancement device according to the fourth embodiment. 図８は、実施形態４の効果の一例を説明する図である。FIG. 8 is a diagram for explaining an example of the effect of the fourth embodiment. 図９は、実施形態４の効果の一例を説明する図である。FIG. 9 is a diagram illustrating an example of the effect of the fourth embodiment. 図１０は、プログラムが実行されることにより、モデル推定装置及び目的音強調装置が実現されるコンピュータの一例を示す図である。FIG. 10 is a diagram illustrating an example of a computer in which a model estimation device and a target sound enhancement device are realized by executing a program.

［実施形態］
以下、本願が開示するモデル推定装置、目的音強調装置、モデル推定方法及びモデル推定プログラムの実施形態を説明する。なお、以下の実施形態は、一例を示すに過ぎず、本願が開示する技術を限定するものではない。また、以下に示す各実施形態は、矛盾しない範囲で適宜組合せてもよい。 [Embodiment]
Hereinafter, embodiments of a model estimation device, a target sound enhancement device, a model estimation method, and a model estimation program disclosed in the present application will be described. The following embodiments are merely examples, and do not limit the technology disclosed by the present application. Moreover, you may combine suitably each embodiment shown below in the range which does not contradict.

なお、以下の実施形態では、例えばＡがベクトルである場合には“ベクトルＡ”と表記し、例えばＡがスカラーである場合には単に“Ａ”と表記する。また、例えばＡが集合である場合には、“集合Ａ”と表記するものとする。また、例えばベクトルＡの関数ｆは、ｆ（ベクトルＡ）と表記するものとする。また、ベクトル又はスカラーであるＡに対し、“^〜Ａ”と記載する場合は「“Ａ”の直上に“〜”が記された記号」と同等であるとする。また、ベクトル又はスカラーであるＡに対し、“＾Ａ”と記載する場合は「“Ａ”の直上に“＾”が記された記号」と同等であるとする。また、ベクトル又はスカラーであるＡに対し、“〜＾Ａ”と記載する場合は「“Ａ”の直上に“＾”が記され、さらにその直上に“〜”が付された記号」と同等であるとする。また、ベクトル又はスカラーであるＡに対し、Ａ^TはＡの転置を表す。また、行列Ａに対し、行列Ａ^−１は行列Ａの逆行列を表し、detＡは行列Ａの行列式を表し、trＡは行列Ａの対角和（トレース）を表す。また、行列Ａに対し、行列Ａ^Hは、行列Ａのエルミート転置を表し、行列Ａ^＊は、行列Ａの複素共役を表す。また、集合Ａに対し、＃Ａは集合Ａの要素数を表す。また、exp（・）は、指数関数であり、ln(・)は対数関数である。 In the following embodiments, for example, when A is a vector, it is expressed as “vector A”, and when A is a scalar, for example, it is simply expressed as “A”. For example, when A is a set, it is described as “set A”. For example, the function f of the vector A is expressed as f (vector A). Further, with respect to A is a vector or scalar, be referred to as a "~ ^A" is assumed to be equivalent to "" A "symbol" ~ "is written directly above the". Further, when “^ A” is described for A which is a vector or a scalar, it is equivalent to “a symbol in which“ ^ ”is written immediately above“ A ””. In addition, when “˜ ^ A” is written for A which is a vector or a scalar, it is equivalent to “a symbol where“ ^ ”is written immediately above“ A ”and“ ˜ ”is added immediately above” ”. Suppose that In addition, A ^T represents transposition of A with respect to A which is a vector or a scalar. Further, with respect to the matrix A, the matrix A- ¹ represents the inverse matrix of the matrix A, detA represents the determinant of the matrix A, and trA represents the diagonal sum (trace) of the matrix A. Further, with respect to the matrix A, the matrix A ^H represents the Hermitian transpose of the matrix A, and the matrix A ^* represents the complex conjugate of the matrix A. For set A, #A represents the number of elements in set A. Exp (·) is an exponential function, and ln (·) is a logarithmic function.

［実施形態１］
以下、実施形態１について、実施形態１の理論的背景を説明後、実施形態１の一態様を説明する。 [Embodiment 1]
Hereinafter, after describing the theoretical background of the first embodiment, one aspect of the first embodiment will be described.

＜実施形態１の理論的背景＞
実施形態１は、残響下で、Ｎ個（Ｎは、自然数）の音源からの信号をＭ個（Ｍは、自然数）のマイクロホンで観測するとする。ｍ（１≦ｍ≦Ｍ）番目のマイクロホンで観測された残響を含む混合信号をｙ^(m) _tfで表し、下記（１０）式のように、Ｍ個のマイクロホンで観測される混合信号を混合信号ベクトルｙ_tfとしてまとめて表記する。 <Theoretical Background of Embodiment 1>
In the first embodiment, it is assumed that signals from N (N is a natural number) sound sources are observed with M (M is a natural number) microphones under reverberation. The mixed signal including reverberation observed by the m (1 ≦ m ≦ M) -th microphone is represented by y ^(m) _tf , and the mixed signal observed by M microphones is mixed as shown in the following equation (10). _These are collectively expressed as a signal vector y _tf .

実施形態１のモデル推定装置は、残響を含む混合信号ベクトルｙ_tfを、混合信号ベクトルｙ_tfの分布を表す確率モデルに当てはめ、所定の確率モデルのパラメータを推定する。以下では、先ず、混合信号ベクトルｙ_tfの分布を表す確率モデルについて説明し、次に、混合信号ベクトルｙ_tfの分布を表す確率モデルのパラメータを推定するアルゴリズムを導出する。以下、残響を含む混合信号ベクトルのモデル化、及び、パラメータ推定アルゴリズムの導出それぞれについて、理論的背景を説明する。 The model estimation apparatus according to the first embodiment applies a mixed signal vector y _tf including reverberation to a probability model representing the distribution of the mixed signal vector y _tf and estimates a parameter of a predetermined probability model. Hereinafter, first, a probability model representing the distribution of the mixed signal vector y _tf will be described, and then an algorithm for estimating the parameters of the probability model representing the distribution of the mixed signal vector y _tf will be derived. Hereinafter, the theoretical background will be described for each of modeling of the mixed signal vector including reverberation and derivation of the parameter estimation algorithm.

（実施形態１の残響を含む混合信号ベクトルのモデル化）
ｎ（１≦ｎ≦Ｎ）番目の音源のみが存在し、残響および他の音源が存在しないと仮定した場合に、Ｍ個のマイクロホンで観測される予定の信号を並べたベクトル（以下、「ｎ番目の音源の残響を含まないマイクロホン像」と表記する）をベクトルｓ⁽ⁿ⁾ _tf∈集合Ｃ^Ｍで表す。ここで、ベクトルｓ⁽ⁿ⁾ _tfは、複素数を要素とするＭ次元のベクトルである。残響が存在しないと仮定した場合に、Ｍ個のマイクロホンで観測される予定の混合信号を並べたベクトル（以下、「残響を含まない混合信号ベクトル」と表記する）をｘ_tf∈集合Ｃ^Ｍで表す。残響を含まない混合信号ベクトルｘ_tfがスパースであると仮定すれば、混合信号ベクトルｘ_tfは、下記（１１）式によりモデル化できる。 (Modeling of mixed signal vector including reverberation in embodiment 1)
When it is assumed that only the nth (1 ≦ n ≦ N) sound source exists and no reverberation and other sound sources exist, a vector (hereinafter referred to as “n”) in which signals scheduled to be observed by M microphones are arranged. ^(The microphone image that does not include the reverberation of the second sound source) is represented by a vector s ⁽ⁿ⁾ _tf ∈ set C ^M. Here, the vector s ⁽ⁿ⁾ _tf is an M-dimensional vector whose elements are complex numbers. Assuming that no reverberation exists, a vector in which mixed signals to be observed by M microphones are arranged (hereinafter referred to as a “mixed signal vector not including reverberation”) is expressed as x _tf ∈ set C ^M Represent. Assuming a mixed signal vector x _tf free of reverberation is sparse, mixed signal vector x _tf can be modeled by the following equation (11).

従来のクラスタリングに基づく音源分離では、残響を含む混合信号ベクトルｙ_tfがスパースであると仮定するのに対し、実施形態１は、残響を含まない混合信号ベクトルｘ_tfがスパースであると仮定する。これにより、残響下でも正確なモデル化が可能である。上記（１１）式による混合信号ベクトルｘ_tfのモデルに基づき、残響を含まない混合信号ベクトルｘ_tfの分布は、下記（１２）式の混合分布によりモデル化される。 In the sound source separation based on the conventional clustering, the mixed signal vector y _tf including reverberation is assumed to be sparse, whereas the first embodiment assumes that the mixed signal vector x _tf not including reverberation is sparse. This enables accurate modeling even under reverberation. Based on the model of the mixed signal vector x _tf according to the above equation (11), the distribution of the mixed signal vector x _tf not including reverberation is modeled by the mixture distribution of the following equation (12).

上記（１２）式において、ｐ(ベクトルｓ⁽ⁿ⁾ _tf|Θ)は、ｎ番目の音源の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの分布を表す確率モデルを表す。また、上記（１２）式において、Ｐ(ｄ_tf|Θ)は、混合重みと呼ばれ、アクティブな音源の番号ｄ_tfの確率モデルを表す。また、上記（１２）式において、Θは、確率モデルのパラメータの集合を表す。集合Θの定義は、後述する。 In the above equation (12), p (vector s ⁽ⁿ⁾ _tf | Θ) represents a probability model representing the distribution of the vector s ⁽ⁿ⁾ _tf of the microphone image that does not include the reverberation of the nth sound source. In the above equation (12), P (d _tf | Θ) is called a mixing weight and represents a probability model of the active sound source number d _tf . In the above equation (12), Θ represents a set of parameters of the probability model. The definition of the set Θ will be described later.

一方、残響を含む混合信号ベクトルｙ_tfは、残響を含まない混合信号ベクトルｘ_tfにより駆動されたマルチチャネル自己回帰過程により、下記（１３）式のようにモデル化できる。混合信号ベクトルｙ_tfのモデル化については、文献１「T. Yoshioka, T. Nakatani, M. Miyoshi, and H.G. Okuno, “Blind separation and dereverberation of speech mixtures by joint optimization.” IEEE Trans. ASLP, vol. 19, no. 1, pp. 69.84, Jan. 2011.」に詳述されている。 On the other hand, the mixed signal vector y _tf including reverberation can be modeled by the following equation (13) by a multichannel autoregressive process driven by the mixed signal vector x _tf not including reverberation. For modeling of the mixed signal vector y _tf , reference 1 “T. Yoshioka, T. Nakatani, M. Miyoshi, and HG Okuno,“ Blind separation and dereverberation of speech mixture by joint optimization. ”IEEE Trans. ASLP, vol. 19, no. 1, pp. 69.84, Jan. 2011. ”.

ここで、上記（１３）式において、ｋはタップ番号を表し、Ｋはタップ数を表し、行列Ｇ_kf∈集合Ｃ^M×Mは、複素数を要素とするＭ行Ｍ列の回帰行列を表し、行列Ｇ^H _kfは、回帰行列Ｇ_kfのエルミート転置を表す。また、上記（１３）式において、Δは、所定の遅延を表すが、好ましくは、音源信号が自己相関を持つ時間（音声の場合、２０〜３０ｍｓ程度）に相当するように設定する。遅延Δを導入することで、推定された回帰行列Ｇ_kfを用いて残響除去を行う際に、音源信号の自己相関が除去されることを防ぐ。また、便宜上、ｔ＜０に対しては、混合信号ベクトルｙ_tf＝０（ゼロベクトル）と定義する。便宜上、上記（１３）式のモデルを確率モデルとして表すと、下記（１４）式を得る。なお、下記（１４）式において、δは、ディラックのデルタ関数である。 Here, in the above equation (13), k represents a tap number, K represents the number of taps, a matrix G _kf ∈ set C ^{M × M} represents an M-row M-column regression matrix having complex numbers as elements, The matrix G ^H _kf represents the Hermitian transpose of the regression matrix G _kf . In the above equation (13), Δ represents a predetermined delay, but is preferably set so as to correspond to a time during which the sound source signal has autocorrelation (in the case of speech, about 20 to 30 ms). By introducing the delay Δ, the autocorrelation of the sound source signal is prevented from being removed when dereverberation is performed using the estimated regression matrix G _kf . For convenience, the mixed signal vector y _tf = 0 (zero vector) is defined for t <0. For convenience, when the model of the above equation (13) is expressed as a probability model, the following equation (14) is obtained. In the following equation (14), δ is a Dirac delta function.

上記（１２）式及び上記（１４）式の確率モデルを用いると、残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデルを、下記（１５）式及び下記（１６）式のように導出できる。 Using the probability models of the above equations (12) and (14), a probability model representing the distribution of the mixed signal vector y _tf including reverberation can be derived as in the following equations (15) and (16). .

残響を含む混合信号ベクトルｙ_tfの分布を表す、上記（１６）式の確率モデルの導出においては、各音源の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの分布を表す確率モデルｐ(ベクトルｓ⁽ⁿ⁾ _tf|Θ)と、アクティブな音源の番号ｄ_tfの確率モデルＰ(ｄ_tf|Θ)との具体形について、何の仮定も置いていないことに注意する。すなわち、これらの確率モデルを任意の確率分布によりモデル化しても、残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデルは、上記（１６）式により与えられる。 In the derivation of the probability model of the above equation (16) representing the distribution of the mixed signal vector y _tf including reverberation, a probability model p (representing the distribution of the vector s ⁽ⁿ⁾ _tf of the microphone image not including the reverberation of each sound source. Note that no assumptions are made about the concrete form of the vector s ⁽ⁿ⁾ _tf | Θ) and the probability model P (d _tf | Θ) of the active sound source number d _tf . That is, even if these probability models are modeled by an arbitrary probability distribution, a probability model representing the distribution of the mixed signal vector y _tf including reverberation is given by the above equation (16).

上記（１６）式によれば、残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデルを定めることは、アクティブな音源の番号ｄ_tfの確率モデルＰ(ｄ_tf|Θ)と、各音源の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの分布を表す確率モデルｐ(ベクトルｓ⁽ⁿ⁾ _tf|Θ)とを定めることに帰着することが分かる。これらの確率モデルは、任意の確率分布を用いてモデル化できるが、以下では、実施形態１における、これらのモデル化について説明する。 According to the above equation (16), the probability model representing the distribution of the mixed signal vector y _tf including reverberation is determined by the probability model P (d _tf | Θ) of the active sound source number d _tf and each sound source. It can be seen that this results in defining a probability model p (vector s ⁽ⁿ⁾ _tf | Θ) representing the distribution of the vector s ⁽ⁿ⁾ _tf of the microphone image that does not include reverberation. Although these probability models can be modeled using arbitrary probability distributions, these modeling in the first embodiment will be described below.

ｎ番目の音源の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの分布を表す確率モデルｐ(ベクトルｓ⁽ⁿ⁾ _tf|Θ)は、例えば、下記（１７）式の時変ガウス分布でモデル化できる。このモデル化については、文献２「N.Q.K. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model.” IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830.1840, Sep. 2010.」に詳述されている。 The probability model p (vector s ⁽ⁿ⁾ _tf | Θ) representing the distribution of the vector s ⁽ⁿ⁾ _tf of the microphone image not including the reverberation of the nth sound source is, for example, a time-varying Gaussian distribution of the following equation (17). Can be modeled. For this modeling, refer to Reference 2, “NQK Duong, E. Vincent, and R. Gribonval,“ Under-determined reverberant audio source separation using a full-rank spatial covariance model. ”IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830.1840, Sep. 2010. ”.

ここで、上記（１７）式において、φ⁽ⁿ⁾ _tfは、ベクトルｓ⁽ⁿ⁾ _tfの時変のパワースペクトルをモデル化するパラメータであり、行列Ｂ⁽ⁿ⁾ _fは、ベクトルｓ⁽ⁿ⁾ _tfの時不変の空間共分散行列をモデル化するパラメータである。また、上記（１７）式の右辺は、下記（１８）式により表される複素ガウス分布の確率密度関数である。下記（１８）式は、確率変数がベクトルα、平均がベクトルμ、共分散行列Σである複素ガウス分布の確率密度関数を表す。下記（１８）式において、πは円周率、det(πΣ)は、行列πΣの行列式を表す。 Here, in the above equation (17), φ ⁽ⁿ⁾ _tf is a parameter for modeling the time-varying power spectrum of the vector s ⁽ⁿ⁾ _tf , and the matrix B ⁽ⁿ⁾ _f is the vector s ⁽ⁿ⁾ It is a parameter that models the time-invariant spatial covariance matrix of _tf . The right side of the above equation (17) is a probability density function of a complex Gaussian distribution expressed by the following equation (18). The following equation (18) represents a probability density function of a complex Gaussian distribution in which a random variable is a vector α, an average is a vector μ, and a covariance matrix Σ. In the following equation (18), π represents a circular ratio, and det (πΣ) represents a determinant of a matrix πΣ.

また、実施形態１では、アクティブな音源の番号ｄ_tfの確率モデルＰ(ｄ_tf|Θ)を、周波数依存の混合重みα⁽ⁿ⁾ _ｆを用いて、下記（１９）式によりモデル化する。 In the first embodiment, the probability model P (d _tf | Θ) of the active sound source number d _tf is modeled by the following equation (19) using the frequency-dependent mixture weight α ⁽ⁿ⁾ _f .

実施形態１における、残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデルの具体形は、一般の場合である上記（１６）式に、ｎ番目の音源の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの分布を表す確率モデルｐ(ベクトルｓ⁽ⁿ⁾ _tf|Θ)の具体形である上記（１７）式と、アクティブな音源の番号ｄ_tfの確率モデルＰ(ｄ_tf|Θ)の具体形である上記（１９）式とを代入することで、下記（２０）式のように得られる。 The concrete form of the probability model representing the distribution of the mixed signal vector y _tf including reverberation in the first embodiment is a microphone image vector s that does not include the reverberation of the nth sound source in the above-described equation (16), which is a general case. ⁽ⁿ⁾ the probability model representing the distribution of _tf p (vector s ⁽ⁿ⁾ _tf | theta) and the formula (17) is a specific form of the probability model P number d _tf active sound sources (d _tf | theta) By substituting the above formula (19) which is a specific form of the following formula, the following formula (20) is obtained.

ここで、パラメータの集合Θは、具体的には、下記（２１）式により定義される。 Here, the parameter set Θ is specifically defined by the following equation (21).

（実施形態１のパラメータ推定アルゴリズムの導出）
残響を含む混合信号ベクトルｙ_tfの確率モデルを示す上記（１６）式に基づくと、例えば、最尤法又はＭＡＰ（Maximum A Posteriori）推定法に従って、パラメータの集合Θを推定することができる。 (Derivation of Parameter Estimation Algorithm of Embodiment 1)
Based on the above equation (16) indicating the probability model of the mixed signal vector y _tf including reverberation, the set of parameters Θ can be estimated, for example, according to the maximum likelihood method or the MAP (Maximum A Posteriori) estimation method.

最尤法では、残響を含む混合信号ベクトルｙ_tfの尤度ｐ(Ｙ|Θ)を評価関数とし、尤度ｐ(Ｙ|Θ)を最大化することでパラメータの集合の推定値Θ＝arg max_Θ｛ｐ(Ｙ|Θ)｝を求める。ここで、集合Ｙは、Ｙ:＝｛ベクトルｙ_tf｝_tf:＝｛ベクトルｙ_tf |∀ｔ,ｆ｝と定義する。 In the maximum likelihood method, the likelihood p (Y | Θ) of the mixed signal vector y _tf including reverberation is used as an evaluation function, and the likelihood p (Y | Θ) is maximized to estimate the parameter set Θ = arg Find max _Θ {p (Y | Θ)}. Here, the set Y is defined as Y: = {vector y _tf } _tf : = {vector y _tf | ∀t, f}.

一方、ＭＡＰ推定法では、パラメータの集合Θの事後確率ｐ(Θ|Ｙ)を評価関数とし、事後確率ｐ(Θ|Ｙ)を最大化することでパラメータの集合の推定値Θ＝arg max_Θ｛ｐ(Θ|Ｙ)｝を求める。さらに、ベイズの定理より、ｐ(Θ|Ｙ)＝｛ｐ(Ｙ|Θ)ｐ(Θ)｝／ｐ(Ｙ)であることと、ｐ(Ｙ)は定数であることに注意すると、ＭＡＰ推定法によるパラメータの集合Θの推定値は、下記（２２）式のように書きなおせる。なお、下記（２２）式において、ｐ(Θ)はパラメータの集合Θの事前確率を表す。 On the other hand, in the MAP estimation method, the posterior probability p (Θ | Y) of the parameter set Θ is used as an evaluation function, and the estimated value Θ = arg max _Θ of the parameter set is maximized by maximizing the posterior probability p (Θ | Y). {P (Θ | Y)} is obtained. Furthermore, MAP is noted from the Bayes' theorem that p (Θ | Y) = {p (Y | Θ) p (Θ)} / p (Y) and p (Y) is a constant. The estimated value of the parameter set Θ by the estimation method can be rewritten as the following equation (22). In the following equation (22), p (Θ) represents the prior probability of the parameter set Θ.

残響を含む混合信号ベクトルｙ_tfの尤度ｐ(Ｙ|Θ)は、上記（１５）式の左辺に現れる、残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデルを用いて、下記（２３）式で表される。 The likelihood p mixed signal vector y _tf containing reverberation (Y | theta), using a probability model representing the distribution of the mixed signal vector y _tf appear on the left-hand side of the equation (15), including reverberation, following (23 ) Expression.

パラメータの集合Θの事前確率ｐ(Θ)は、任意の確率モデルを用いてモデル化することができるが、例えば一様な分布を用いることができる。一様分布を用いる場合、上記（２２）式に基づく、ＭＡＰ推定法によるパラメータの集合Θの推定値は、最尤推定と一致する。もしくは、混合重みの事前分布として、下記（２４）式のようなディリクレ分布を仮定する。 The prior probability p (Θ) of the parameter set Θ can be modeled using an arbitrary probability model, but a uniform distribution can be used, for example. When the uniform distribution is used, the estimated value of the parameter set Θ by the MAP estimation method based on the above equation (22) coincides with the maximum likelihood estimation. Alternatively, a Dirichlet distribution such as the following equation (24) is assumed as the prior distribution of the mixture weight.

そして、混合重み以外のパラメータに対しては、一様な事前分布を仮定してもよい。この場合、パラメータの集合Θの事前分布Ｐ(Θ)は、上記（２４）式に示す混合重みの事前分布に比例する。ここで、上記（２４）式におけるψは、ハイパーパラメータと呼ばれる所定の定数である。ψは、任意の正数に設定することができるが、例えばψ＝６００とすればよい。 A uniform prior distribution may be assumed for parameters other than the mixture weight. In this case, the prior distribution P (Θ) of the set of parameters Θ is proportional to the prior distribution of the mixture weights shown in the above equation (24). Here, ψ in the above equation (24) is a predetermined constant called a hyper parameter. ψ can be set to an arbitrary positive number. For example, ψ = 600 may be set.

以下では、上記（２２）式に基づくＭＡＰ推定法により、パラメータの集合Θを推定するためのアルゴリズムの一例として、集合Ｄ:＝｛ｄ_tf｝を隠れ変数とみなしたＥＭ（Expectation-Maximization）アルゴリズムを導出する。なお、ＥＭについては、文献３「A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1.38, 1977.」に詳述されている。 In the following, as an example of an algorithm for estimating the parameter set Θ by the MAP estimation method based on the above equation (22), an EM (Expectation-Maximization) algorithm that considers the set D: = {d _tf } as a hidden variable Is derived. Regarding EM, Reference 3 “AP Dempster, NM Laird, and DB Rubin,“ Maximum likelihood from incomplete data via the EM algorithm. ”Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1.38, 1977 ”.

ＥＭアルゴリズムとは、以下に定義するＥステップとＭステップを、収束条件が満たされるまで反復するものである。Ｅステップでは、下記（２５）式で定義されるＱ関数：Ｑ（Θ;Θ´）を計算する。 The EM algorithm is to repeat the E step and M step defined below until the convergence condition is satisfied. In the E step, a Q function defined by the following equation (25): Q (Θ; Θ ′) is calculated.

ここで、lnＰ(Ｙ,Ｄ|Θ)は、完全データの集合｛Ｙ,Ｄ｝の対数尤度を表し、Ｐ(Ｄ|Ｙ,Θ´)は、パラメータの集合Θの現在の推定値Θ´に対する集合Ｄの事後確率を表し、<・>_{P(Ｄ|Ｙ,Θ´)}は、Ｐ(Ｄ|Ｙ,Θ´)に関する期待値演算を表す。 Where lnP (Y, D | Θ) represents the log likelihood of the complete data set {Y, D}, and P (D | Y, Θ ′) is the current estimate Θ of the parameter set Θ. Represents the posterior probability of the set D with respect to ′, and </ _{P (D | Y, Θ ′)} represents the expected value calculation for P (D | Y, Θ ′).

一方、Ｍステップでは、Ｑ関数の最大化により、パラメータの集合Θを更新する。ＥＭアルゴリズムの各反復において、評価関数ｐ(Ｙ|Θ)ｐ(Θ)の単調非減少性が保証されている。Ｑ関数の具体形を計算するために、下記（２６）式〜（３０）式のように、ｐ(Ｙ,Ｄ|Θ)、ｐ(Ｄ|Ｙ,Θ´)を求める。 On the other hand, in the M step, the parameter set Θ is updated by maximizing the Q function. In each iteration of the EM algorithm, the monotonic non-decreasing property of the evaluation function p (Y | Θ) p (Θ) is guaranteed. In order to calculate the specific form of the Q function, p (Y, D | Θ) and p (D | Y, Θ ′) are obtained as in the following equations (26) to (30).

ここで、上記（３０）式におけるγ⁽ⁿ⁾ _tfは、下記（３１）式〜（３３）式で定義する。 Here, γ ⁽ⁿ⁾ _tf in the above equation (30) is defined by the following equations (31) to (33).

ただし、簡単のため、上記（３３）式において、α⁽ⁿ⁾ _f、回帰行列Ｇ_kf、φ⁽ⁿ⁾ _tf、行列Ｂ⁽ⁿ⁾ _fの現在の推定値を、それぞれ、単にα⁽ⁿ⁾ _f、回帰行列Ｇ_kf、φ⁽ⁿ⁾ _tf、行列Ｂ⁽ⁿ⁾ _fと表記した。 However, for the sake of simplicity, in the above equation (33), the current estimated values of α ⁽ⁿ⁾ _f , regression matrix G _kf , φ ⁽ⁿ⁾ _tf , and matrix B ⁽ⁿ⁾ _f are simply α ⁽ⁿ⁾ , respectively. _f , regression matrix G _kf , φ ⁽ⁿ⁾ _tf , matrix B ⁽ⁿ⁾ _f .

上記（２８）式、上記（３０）式を、上記（２５）式へ代入することで、Ｑ関数の具体形が、下記（３４）式、（３５）式のように得られる。 By substituting the above equations (28) and (30) into the above equation (25), specific forms of the Q function can be obtained as the following equations (34) and (35).

混合重みα⁽ⁿ⁾ _fの更新式は、拘束条件Σ^N _n=1α⁽ⁿ⁾ _f＝１に注意して、ラグランジュの未定乗数法を用いることで得られる。φ⁽ⁿ⁾ _tf及び行列Ｂ⁽ⁿ⁾ _fの更新式は、上記（３５）式に示すＱ関数のφ⁽ⁿ⁾ _tf、及び、行列Ｂ⁽ⁿ⁾ _fの複素共役である(行列Ｂ⁽ⁿ⁾ _f)^＊に関する偏微分を０とおくことで得られる。 The update formula of the mixture weight α ⁽ⁿ⁾ _f is obtained by using the Lagrange's undetermined multiplier method while paying attention to the constraint condition Σ ^N _{n = 1} α ⁽ⁿ⁾ _f = 1. phi ⁽ⁿ⁾ _tf and update equation of the matrix B ⁽ⁿ⁾ _f is the ⁽³⁵⁾ φ ⁽ⁿ⁾ _tf of Q function shown in the expression and the complex conjugate of the matrix B ⁽ⁿ⁾ _f (matrix B ^{( n)} It is obtained by setting the partial derivative with respect to _f ) ^* to 0.

回帰行列Ｇ_kfの更新式は、上記（３５）式に示すＱ関数から、回帰行列Ｇ_kfのエルミート転置である行列Ｇ^H _kfに依存する項のみを抜き出すと、下記（３６）式、（３７）式のようになる。 Update equation of the regression matrix G _kf from Q function shown in the expression (35), when extracting only the term that depends on a Hermitian transpose of regression matrix G _kf matrix G ^H _kf, following equation (36), (37 )

上記（３７）式の、行列〜Ｇ^H _fに関する偏微分を０とおいて整理すると、下記（４１）式のようになる。 If the partial differentiation with respect to the matrix to G ^H _{f in the} equation (37) is set to 0, the following equation (41) is obtained.

上記（４１）式の両辺に、vec作用素を作用させ、下記（ａ）式で示される、行列Ａ、行列Ｂ、行列Ｘについての、クロネッカー積に関する公式を適用すると、下記（４２）式のようになる。 By applying the vec operator to both sides of the above equation (41) and applying the formula for the Kronecker product for the matrix A, the matrix B, and the matrix X shown by the following equation (a), the following equation (42) is obtained. become.

ただし、上記（４２）式において、vec[a₁・・・a_P]及び行列〜Ｇ^H _fを、それぞれ下記（４３）式、（４４）式のように定義する。 However, in the above equation (42), vec [a ₁ ... A _P ] and matrices to G ^H _f are defined as the following equations (43) and (44), respectively.

よって、上記（４２）式から、下記（４５）式のように、vec[行列〜Ｇ^H _f]が求まる。 Therefore, from the above equation (42), vec [matrix˜G ^H _f ] is obtained as in the following equation (45).

＜実施形態１の一態様＞
以下、上述の実施形態１の理論的背景に基づく、実施形態１の一態様を説明する。なお、実施形態１の一態様において、音源数Ｎは既知と仮定する。 <One aspect of Embodiment 1>
Hereinafter, one aspect of the first embodiment based on the theoretical background of the first embodiment will be described. In one aspect of Embodiment 1, it is assumed that the number of sound sources N is known.

（実施形態１に係るモデル推定装置の構成）
図１は、実施形態１に係るモデル推定装置の構成の一例を示す図である。実施形態１に係るモデル推定装置１０Ａは、残響除去処理部１１Ａ、クラスタリング部１２Ａを有する。残響除去処理部１１Ａは、初期化部１１Ａ−１、共分散行列更新部１１Ａ−２、回帰行列更新部１１Ａ−３、残響除去部１１Ａ−４を有する。共分散行列更新部１１Ａ−２及び回帰行列更新部１１Ａ−３及び混合重み更新部１２Ａ−２は、パラメータ推定部の一例である。残響除去部１１Ａ−４は、信号推定部の一例である。事後確率更新部１２Ａ−１は、事後確率計算部の一例である。 (Configuration of Model Estimation Device According to Embodiment 1)
FIG. 1 is a diagram illustrating an example of a configuration of a model estimation apparatus according to the first embodiment. The model estimation apparatus 10A according to the first embodiment includes a dereverberation processing unit 11A and a clustering unit 12A. The dereverberation processing unit 11A includes an initialization unit 11A-1, a covariance matrix update unit 11A-2, a regression matrix update unit 11A-3, and a dereverberation unit 11A-4. The covariance matrix update unit 11A-2, the regression matrix update unit 11A-3, and the mixture weight update unit 12A-2 are examples of a parameter estimation unit. The dereverberation unit 11A-4 is an example of a signal estimation unit. The posterior probability update unit 12A-1 is an example of a posterior probability calculation unit.

初期化部１１Ａ−１は、まず、パラメータの集合Θの初期値を計算する。この初期値は、例えば、以下のように計算することができる。まず、アクティブな音源の番号ｄ_tfの推定値＾ｄ_tfを、残響モデルを含まない従来のクラスタリングに基づく音源分離技術を用いて計算する。残響モデルを含まない従来のクラスタリングに基づく音源分離技術は、文献４「伊藤信貴，荒木章子，木下慶介，中谷智広,“音源位置情報に基づく劣決定ブラインド音源分離のためのパーミュテーションフリークラスタリング法”，電子情報通信学会論文誌, vol. J97-A, no. 4, pp. 234.246, Apr. 2014.」に詳述されている。 The initialization unit 11A-1 first calculates an initial value of the parameter set Θ. This initial value can be calculated as follows, for example. First, an estimate ^ d _tf number d _tf active sound source is calculated using the sound source separation technique based on the conventional clustering without the reverberation model. The conventional sound source separation technique based on clustering without reverberation model is described in Reference 4 “Nobutaka Ito, Akiko Araki, Keisuke Kinoshita, Tomohiro Nakatani,“ Permutation-free clustering method for underdetermined blind sound source separation based on sound source position information. ", IEICE Transactions, vol. J97-A, no. 4, pp. 234.246, Apr. 2014."

次に、初期化部１１Ａ−１は、推定値＾ｄ_tfを用いて、下記（４６）式〜（４９）式により、各パラメータを初期化する。なお、下記（４６）式及び（４８）式における集合Ｃ⁽ⁿ⁾ _fは、Ｃ⁽ⁿ⁾ _f:＝｛ｔ|ｄ_tf＝ｎ｝で定義される行列である。また、下記（４６）式及び（４８）式における＃Ｃ⁽ⁿ⁾ _fは、集合Ｃ⁽ⁿ⁾ _fの要素数を表す。また、下記（４９）式における“tr[・]”は、行列[・]のトレースを表す。 Next, the initialization unit 11A-1 uses the estimated value ^ d _tf to initialize each parameter according to the following formulas (46) to (49). The set C ⁽ⁿ⁾ _f in the following formulas (46) and (48) is a matrix defined by C ⁽ⁿ⁾ _f : = {t | d _tf = n}. Also, #C ⁽ⁿ⁾ _f in the following formulas (46) and (48) represents the number of elements of the set C ⁽ⁿ⁾ _f . Also, “tr [•]” in the following equation (49) represents a trace of the matrix [•].

共分散行列更新部１１Ａ−２は、各音源ｎ（ｎ＝１,・・・,Ｎ）の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fのパラメータφ⁽ⁿ⁾ _tf及び行列Ｂ⁽ⁿ⁾ _fを、それぞれ下記（５０）式、（５１）式により更新する。 The covariance matrix updating unit 11A-2 uses the covariance matrix φ ⁽ⁿ⁾ _tf B ⁽ⁿ ⁾ of the vector s ⁽ⁿ⁾ _tf of the microphone image not including the reverberation of each sound source n (n = 1,..., N). ⁾ parameters phi ⁽ⁿ⁾ _tf and matrix B ⁽ⁿ⁾ _f of _f, respectively following equation (50), updated by (51) below.

回帰行列更新部１１Ａ−３は、回帰行列Ｇ_kfを、下記（５２）式、（５３）式により更新する。 The regression matrix updating unit 11A-3 updates the regression matrix G _kf with the following formulas (52) and (53).

ここで、上記（５３）式の左辺に現れる行列〜Ｇ_f及び上記（５３）式の右辺に現れるベクトル〜ｙ_t-Δ-1,fは、下記（５４）式、（５５）式のように定義される。 Here, the matrix ~ G _f appearing on the left side of the above expression (53) and the vector ~ y _{t-Δ-1, f} appearing on the right side of the above expression (53) are expressed by the following expressions (54) and (55). Defined in

残響除去部１１Ａ−４は、残響を含まない混合信号ベクトルの推定値＾ｘ_tfを、下記（５６）式により更新する。 The dereverberation unit 11A-4 updates the estimated value _xtf of the mixed signal vector not including reverberation according to the following equation (56).

クラスタリング部１２Ａは、事後確率更新部１２Ａ−１、混合重み更新部１２Ａ−２を有する。事後確率更新部１２Ａ−１は、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを、下記（５７）式により更新する。なお、γ⁽ⁿ⁾ _tf:＝Ｐ(ｄ_tf＝ｎ|ベクトルｙ_tf,Θ)と定義する。 The clustering unit 12A includes a posterior probability update unit 12A-1 and a mixture weight update unit 12A-2. The posterior probability update unit 12A-1 sets the posterior probability γ ⁽ⁿ⁾ _tf that the n (n = 1,..., N) th sound source signal is active at the time frequency point (t, f) as follows (57 Update with the formula. Note that γ ⁽ⁿ⁾ _tf : = P (d _tf = n | vector y _tf , Θ).

混合重み更新部１２Ａ−２は、混合重みα⁽ⁿ⁾ _fを、下記（５８）式により更新する。 The mixing weight updating unit 12A-2 updates the mixing weight α ⁽ⁿ⁾ _f by the following equation (58).

なお、性能向上のため、モデル推定装置１０Ａの全処理に先立ち、残響を含む混合信号ベクトルｙ_tfに対し、前処理として、下記に示す白色化をおこなってもよい。 In addition, in order to improve performance, whitening as described below may be performed as preprocessing on the mixed signal vector y _tf including reverberation prior to the entire processing of the model estimation apparatus 10A.

なお、実施形態１は、クラスタリング部１２Ａの事後確率更新部１２Ａ−１が、上記（５７）式に基づき、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを計算するとした。しかし、これに限らず、k-meansクラスタリング等の従来技法を用い、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを計算するとしてもよい。 In the first embodiment, the a posteriori probability updating unit 12A-1 of the clustering unit 12A performs n (n = 1,..., N) th time frequency points (t, f) based on the above equation (57). Suppose we calculate the posterior probability γ ⁽ⁿ⁾ _tf that the sound source signal is active. However, the present invention is not limited to this, and a conventional technique such as k-means clustering is used, and the posterior probability γ that the n (n = 1,..., N) th sound source signal is active at the time frequency point (t, f). ⁽ⁿ⁾ _tf may be calculated.

（実施形態１に係るモデル推定装置の処理）
図２は、実施形態１に係るモデル推定装置の処理手順の一例を示すフローチャートである。以下に述べるモデル推定装置１０Ａの処理は、所定の収束判定条件が満たされるまで反復される。所定の収束条件は、例えば、「所定の反復回数に達している、又は、事後確率更新部１２Ａ−１、混合重み更新部１２Ａ−２の各更新部のうち１つ以上の更新部による更新前後のパラメータ値の差分が所定の閾値未満である」などとすればよい。 (Processing of model estimation apparatus according to Embodiment 1)
FIG. 2 is a flowchart illustrating an example of a processing procedure of the model estimation apparatus according to the first embodiment. The process of the model estimation apparatus 10A described below is repeated until a predetermined convergence determination condition is satisfied. The predetermined convergence condition is, for example, “A predetermined number of iterations has been reached, or before and after update by one or more update units among the update units of the posterior probability update unit 12A-1 and the mixed weight update unit 12A-2. The difference between the parameter values is less than a predetermined threshold value ”.

先ず、ステップＳ１１では、初期化部１１Ａ−１は、パラメータの集合Θの初期値を、上記（４６）式〜（４９）式に基づき計算し、モデル推定装置１０Ａの主記憶装置に保存する。次に、ステップＳ１２では、残響除去部１１Ａ−４は、モデル推定装置１０Ａの主記憶装置に現在保存されている回帰行列Ｇ_kfに基づき、上記（５６）式により、残響を含まない混合信号ベクトルの推定値＾ｘ_tfを更新する（“残響除去”処理）。 First, in step S11, the initialization unit 11A-1 calculates the initial value of the parameter set Θ based on the equations (46) to (49), and stores it in the main storage device of the model estimation device 10A. Next, in step S12, the dereverberation unit 11A-4, based on the regression matrix G _kf currently stored in the main storage device of the model estimation apparatus 10A, _uses the above equation (56) to _{obtain a} mixed signal vector that does not include reverberation. Update the estimated value ^ x _tf of “(reverberation removal)”.

次に、ステップＳ１３では、事後確率更新部１２Ａ−１は、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを、上記（５７）式により計算し、モデル推定装置１０Ａの主記憶装置に保存する。また、ステップＳ１３では、混合重み更新部１２Ａ−２は、混合重みα⁽ⁿ⁾ _fを、上記（５８）式により計算し、モデル推定装置１０Ａの主記憶装置に保存する（以上、“クラスタリング”処理）。 Next, in step S13, the posterior probability update unit 12A-1 is the time-frequency point (t, f) by n (n = 1, ···, N) th posterior probability of the sound source signal is active gamma ^{(n )} _tf is calculated by the above equation (57) and stored in the main memory of the model estimation apparatus 10A. In step S13, the mixture weight updating unit 12A-2 calculates the mixture weight α ⁽ⁿ⁾ _f by the above equation (58) and stores it in the main storage device of the model estimation apparatus 10A (hereinafter, “clustering”). processing).

次に、モデル推定装置１０Ａは、収束判定条件が満たされているか否かを判定する（ステップＳ１４）。モデル推定装置１０Ａは、収束判定条件が満たされている場合（ステップＳ１４Ｙｅｓ）、処理を終了する。モデル推定装置１０Ａは、収束判定条件が満たされていない場合（ステップＳ１４Ｎｏ）、ステップＳ１５へ処理を移す。 Next, the model estimation apparatus 10A determines whether or not the convergence determination condition is satisfied (step S14). When the convergence determination condition is satisfied (Yes in step S14), the model estimation device 10A ends the process. If the convergence determination condition is not satisfied (No at Step S14), the model estimating apparatus 10A moves the process to Step S15.

ステップＳ１５では、共分散行列更新部１１Ａ−２は、各音源ｎ（ｎ＝１,・・・,Ｎ）の残響を含まないマイクロホン像のベクトルｓ⁽ⁿ⁾ _tfの共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fのパラメータφ⁽ⁿ⁾ _tf及び行列Ｂ⁽ⁿ⁾ _fを、それぞれ上記（５０）式、（５１）式により計算し、モデル推定装置１０Ａの主記憶装置に更新保存する。また、ステップＳ１５では、回帰行列更新部１１Ａ−３は、共分散行列更新部１１Ａ−２により計算されたパラメータφ⁽ⁿ⁾ _tf及び行列Ｂ⁽ⁿ⁾ _fに基づき、回帰行列Ｇ_kfを、上記（５２）式、（５３）式により計算し、モデル推定装置１０Ａの主記憶装置に更新保存する。 In step S15, the covariance matrix updating unit 11A-2 performs the covariance matrix φ ^{(n) of the} microphone image vector s ⁽ⁿ⁾ _tf that does not include the reverberation of each sound source n (n = 1,..., N ^). the _tf B ⁽ⁿ⁾ parameters _f phi ⁽ⁿ⁾ _tf and matrix B ⁽ⁿ⁾ _f, each of the above equation (50), calculated by (51) below, updates stored in the main memory of the model estimation device 10A. In step S15, the regression matrix update unit 11A-3 calculates the regression matrix G _kf based on the parameters φ ⁽ⁿ⁾ _tf and matrix B ⁽ⁿ⁾ _f calculated by the covariance matrix update unit 11A-2. Calculations are made according to equations (52) and (53), and updated and stored in the main storage device of the model estimation device 10A.

また、ステップＳ１５では、事後確率更新部１２Ａ−１は、モデル推定装置１０Ａの主記憶装置に現在保存されているパラメータの集合Θ、及び、最後に実行したステップＳ１２による残響を含まない混合信号ベクトルの推定値＾ｘ_tfに基づき、上記（５７）式により、事後確率γ⁽ⁿ⁾ _tfを計算し、モデル推定装置１０Ａの主記憶装置に更新保存する。また、ステップＳ１５では、混合重み更新部１２Ａ−２は、事後確率更新部１２Ａ−１により計算された事後確率γ⁽ⁿ⁾ _tfに基づき、上記（５８）式により、混合重みα⁽ⁿ⁾ _fを更新し、モデル推定装置１０Ａの主記憶装置に更新保存する。以上のステップＳ１５の処理が終了すると、モデル推定装置１０Ａは、ステップＳ１２へ処理を移す。 In step S15, the posterior probability update unit 12A-1 also includes a set of parameters Θ currently stored in the main storage device of the model estimation apparatus 10A, and a mixed signal vector that does not include the reverberation in step S12 that was executed last. Posterior probability γ ⁽ⁿ⁾ _tf is calculated by the above equation (57) based on the estimated value ^ x _tf and updated and stored in the main memory of the model estimating apparatus 10A. In step S15, the mixture weight update unit 12A-2 uses the above equation (58) to calculate the mixture weight α ⁽ⁿ⁾ _f based on the posterior probability γ ⁽ⁿ⁾ _tf calculated by the posterior probability update unit 12A-1. Are updated and stored in the main storage of the model estimation apparatus 10A. When the process in step S15 is completed, the model estimation device 10A moves the process to step S12.

［実施形態２］
以下、実施形態２について、実施形態２の理論的背景を説明後、実施形態２の一態様を説明する。 [Embodiment 2]
Hereinafter, after describing the theoretical background of the second embodiment, one aspect of the second embodiment will be described.

＜実施形態２の理論的背景＞
実施形態１のように、上記（１９）式に示す周波数依存の混合重みを用いる場合、評価関数である事後確率には、パーミュテーション（置換）の不定性がある。すなわち、｛１,・・・,Ｎ｝上の置換Π_fにより、パラメータの集合Θのα⁽ⁿ⁾ _f、φ⁽ⁿ⁾ _tf、行列Ｂ⁽ⁿ⁾ _tfの順序を、下記（６２）式のように入れ替えた場合を考える。 <Theoretical Background of Embodiment 2>
As in the first embodiment, when the frequency-dependent mixture weight shown in the above equation (19) is used, the posterior probability that is the evaluation function has indefiniteness of permutation (replacement). That is, the order of α ⁽ⁿ⁾ _f , φ ⁽ⁿ⁾ _tf , and matrix B ⁽ⁿ⁾ _tf of the parameter set Θ is expressed by the following equation (62) by the permutation Π _f on {1 _,. Consider the case of replacement.

このとき、下記（６３）式が成り立つ。 At this time, the following equation (63) holds.

すなわち、事後確率を最大化するだけでは、推定されたΘにおける番号ｎは、周波数毎に、異なる音源に対応してしまうというパーミュテーション問題がある。よって、推定されたΘをそのまま用いては、適切に目的音強調を行うことはできない。従って、実施形態１に基づいて目的音強調装置を構成する際には、番号ｎが周波数によらず同一の音源に対応するように置換Π_fを決定する、パーミュテーション解決の処理が別途必要となる。 That is, there is a permutation problem that the number n in the estimated Θ corresponds to a different sound source for each frequency only by maximizing the posterior probability. Therefore, the target sound cannot be properly emphasized by using the estimated Θ as it is. Therefore, when the target sound enhancement device is configured based on the first embodiment, a permutation resolution process is separately required for determining the replacement Π _f so that the number n corresponds to the same sound source regardless of the frequency. It becomes.

これに対し、実施形態２のモデル推定装置は、時間依存の混合重みを用いる。これにより、上記文献４に開示されている通り、事後確率の最大化により、パーミュテーション問題を生じずにモデル推定が可能である。 In contrast, the model estimation apparatus according to the second embodiment uses time-dependent mixture weights. Thereby, as disclosed in the above-mentioned document 4, model estimation is possible without causing a permutation problem by maximizing the posterior probability.

以下、実施形態２の理論的背景を、実施形態１との差異に重点を置きながら説明する。 Hereinafter, the theoretical background of the second embodiment will be described with an emphasis on the difference from the first embodiment.

（実施形態２の残響を含む混合信号ベクトルのモデル化）
実施形態２では、アクティブな音源の番号ｄ_tfの確率モデルＰ(ｄ_tf|Θ)を、周波数依存の混合重みではなく、時間依存の混合重みα⁽ⁿ⁾ _tを用いて、下記（６４）式でモデル化する。 (Modeling of mixed signal vector including reverberation in embodiment 2)
In the second embodiment, the probability model P (d _tf | Θ) of the active sound source number d _tf is _expressed by the following (64) using the time-dependent mixture weight α ⁽ⁿ⁾ _t instead of the frequency-dependent mixture weight. Model with an expression.

従って、実施形態２における残響を含む混合信号ベクトルｙ_tfの分布を表す確率モデル（上記（１６）式参照）の具体形は、下記（６５）式のように得られる。 Therefore, the concrete form of the probability model (see the above equation (16)) representing the distribution of the mixed signal vector y _tf including reverberation in the second embodiment is obtained as the following equation (65).

パラメータの集合Θは、具体的には、下記（６６）式で表される。 The parameter set Θ is specifically expressed by the following equation (66).

（実施形態２のパラメータ推定アルゴリズムの導出）
ＥＭアルゴリズムにより、事後確率を最大化する点は、実施形態２は、実施形態１と同様である。しかし、実施形態２は、ＥＭアルゴリズムの各反復において、Ｅステップ、Ｍステップの処理に加えて、Ｐ（Permutation）ステップの処理を行う。Ｐステップでは、各周波数binの番号ｆにて、目的関数である事後確率が最大となるように、共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fを音源間で置換することにより、パーミュテーションを解決する。すなわち、Π_fを｛１,・・・,Ｎ｝上の置換として、下記（６７）式〜（６９）式の処理を行う。 (Derivation of Parameter Estimation Algorithm of Embodiment 2)
The second embodiment is the same as the first embodiment in that the EM algorithm maximizes the posterior probability. However, in the second embodiment, in each iteration of the EM algorithm, a P (Permutation) step process is performed in addition to the E step and M step processes. In the P step, permutation is performed by replacing the covariance matrix φ ⁽ⁿ⁾ _tf B ⁽ⁿ⁾ _f between sound sources so that the posterior probability that is the objective function is maximized at the number f of each frequency bin. Solve the problem. That is, Π _f is substituted on {1,..., N}, and the following formulas (67) to (69) are processed.

なお、Ｅステップ及びＭステップにおける更新式の導出は、実施形態１と同様であるので、説明を省略する。 Note that the derivation of the update formula in the E step and the M step is the same as in the first embodiment, and thus the description thereof is omitted.

＜実施形態２の一態様＞
以下、上述の実施形態２の理論的背景に基づく、実施形態２の一態様を説明する。なお、実施形態２の一態様において、音源数Ｎは既知と仮定する。しかし、実施形態２は、真の音源数Ｎ₀が既知でなくても、その上限は分かっていると仮定し、仮定する音源数Ｎを、真の音源数Ｎ₀の上限より大きく設定することで、音源数が既知である場合と同様に実施可能である。 <One aspect of Embodiment 2>
Hereinafter, an aspect of the second embodiment based on the theoretical background of the second embodiment will be described. In one aspect of the second embodiment, the number N of sound sources is assumed to be known. However, the second embodiment assumes that the upper limit is known even if the true sound source number N ₀ is not known, and sets the assumed sound source number N to be larger than the upper limit of the true sound source number N _0. Thus, the present invention can be implemented in the same manner as when the number of sound sources is known.

（実施形態２に係るモデル推定装置の構成）
図３は、実施形態２に係るモデル推定装置の構成の一例を示す図である。実施形態２に係るモデル推定装置１０Ｂは、残響除去処理部１１Ｂ、クラスタリング部１２Ｂを有する。残響除去処理部１１Ｂは、初期化部１１Ｂ−１、共分散行列更新部１１Ｂ−２、回帰行列更新部１１Ｂ−３、残響除去部１１Ｂ−４を有する。共分散行列更新部１１Ｂ−２及び回帰行列更新部１１Ｂ−３及び混合重み更新部１２Ｂ−２は、パラメータ推定部の一例である。残響除去部１１Ｂ−４は、信号推定部の一例である。事後確率更新部１２Ｂ−１は、事後確率計算部の一例である。 (Configuration of model estimation apparatus according to Embodiment 2)
FIG. 3 is a diagram illustrating an example of the configuration of the model estimation apparatus according to the second embodiment. The model estimation apparatus 10B according to the second embodiment includes a dereverberation processing unit 11B and a clustering unit 12B. The dereverberation processing unit 11B includes an initialization unit 11B-1, a covariance matrix update unit 11B-2, a regression matrix update unit 11B-3, and a dereverberation unit 11B-4. The covariance matrix update unit 11B-2, the regression matrix update unit 11B-3, and the mixture weight update unit 12B-2 are examples of a parameter estimation unit. The dereverberation unit 11B-4 is an example of a signal estimation unit. The posterior probability update unit 12B-1 is an example of a posterior probability calculation unit.

初期化部１１Ｂ−１は、まず、パラメータの集合Θの初期値を計算する。この初期値は、例えば、以下のように計算することができる。まず、アクティブな音源の番号ｄ_tfの推定値＾ｄ_tfを、実施形態１と同様に、残響モデルを含まない従来のクラスタリングに基づく音源分離技術を用いて計算する。次に、初期化部１１Ｂ−１は、推定値＾ｄ_tfを用いて、上記（４７）式〜（４９）式、及び、下記（７０）式により、各パラメータを初期化する。なお、下記（７０）式における集合〜Ｃ⁽ⁿ⁾ _tは、Ｃ⁽ⁿ⁾ _t:＝｛ｆ|ｄ_tf＝ｎ｝で定義される行列である。また、下記（７０）式における＃Ｃ⁽ⁿ⁾ _tは、集合Ｃ⁽ⁿ⁾ _tの要素数を表す。 The initialization unit 11B-1 first calculates an initial value of the parameter set Θ. This initial value can be calculated as follows, for example. First, an estimate ^ d _tf number d _tf active sound sources, as in the first embodiment, calculated using the sound source separation technique based on the conventional clustering without the reverberation model. Next, the initialization unit 11B-1 initializes each parameter using the estimated value ^ d _{tf according} to the above equations (47) to (49) and the following equation (70). Note that the set in the following equation (70) to C ⁽ⁿ⁾ _t is a matrix defined by C ⁽ⁿ⁾ _t : = {f | d _tf = n}. Also, #C ⁽ⁿ⁾ _t in the following equation (70) represents the number of elements of the set C ⁽ⁿ⁾ _t .

共分散行列更新部１１Ｂ−２、回帰行列更新部１１Ｂ−３、残響除去部１１Ｂ−４は、実施形態１の共分散行列更新部１１Ａ−２、回帰行列更新部１１Ａ−３、残響除去部１１Ａ−４とそれぞれ同様である。 The covariance matrix update unit 11B-2, the regression matrix update unit 11B-3, and the dereverberation unit 11B-4 are the covariance matrix update unit 11A-2, the regression matrix update unit 11A-3, and the dereverberation unit 11A of the first embodiment. Same as -4.

クラスタリング部１２Ｂは、事後確率更新部１２Ｂ−１、混合重み更新部１２Ｂ−２、パーミュテーション解決部１２Ｂ−３を有する。事後確率更新部１２Ｂ−１は、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを、下記（７１）式により更新する。なお、γ⁽ⁿ⁾ _tf:＝Ｐ(ｄ_tf＝ｎ|ベクトルｙ_tf,Θ)と定義する。 The clustering unit 12B includes a posterior probability update unit 12B-1, a mixture weight update unit 12B-2, and a permutation resolution unit 12B-3. The posterior probability updating unit 12B-1 sets the posterior probability γ ⁽ⁿ⁾ _tf that the n (n = 1,..., N) th sound source signal is active at the time frequency point (t, f) as follows (71 Update with the formula. Note that γ ⁽ⁿ⁾ _tf : = P (d _tf = n | vector y _tf , Θ).

混合重み更新部１２Ｂ−２は、混合重みα⁽ⁿ⁾ _tを、下記（７２）式により更新する。 The mixing weight updating unit 12B-2 updates the mixing weight α ⁽ⁿ⁾ _t by the following equation (72).

パーミュテーション解決部１２Ｂ−３は、各周波数binの番号ｆにて、目的関数である事後確率が最大となるように、共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fを音源間で置換することにより、パーミュテーションを解決する。すなわち、Π_fを｛１,・・・,Ｎ｝上の置換として、下記（７３）式〜（７５）式により、共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fを置換する。 The permutation resolution unit 12B-3 replaces the covariance matrix φ ⁽ⁿ⁾ _tf B ⁽ⁿ⁾ _f between sound sources so that the posterior probability that is an objective function is maximized at the number f of each frequency bin. To solve permutation. That is, Π _f is substituted on {1,..., N}, and the covariance matrix φ ⁽ⁿ⁾ _tf B ⁽ⁿ⁾ _f is substituted by the following equations (73) to (75).

なお、性能向上のため、モデル推定装置１０Ｂの全処理に先立ち、残響を含む混合信号ベクトルｙ_tfに対し、前処理として、上記（５９）式〜（６１）式に示す白色化をおこなってもよい。 In order to improve the performance, whitening shown in the above equations (59) to (61) is performed as preprocessing on the mixed signal vector y _tf including reverberation prior to the entire processing of the model estimation apparatus 10B. Good.

なお、実施形態２は、クラスタリング部１２Ｂの事後確率更新部１２Ｂ−１が、上記（７１）式に基づき、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを計算するとした。しかし、これに限らず、k-meansクラスタリング等の従来技法を用い、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを計算するとしてもよい。 In the second embodiment, the a posteriori probability updating unit 12B-1 of the clustering unit 12B performs n (n = 1,..., N) th time frequency points (t, f) based on the above equation (71). Suppose we calculate the posterior probability γ ⁽ⁿ⁾ _tf that the sound source signal is active. However, the present invention is not limited to this, and a conventional technique such as k-means clustering is used, and the posterior probability γ that the n (n = 1,..., N) th sound source signal is active at the time frequency point (t, f). ⁽ⁿ⁾ _tf may be calculated.

［実施形態３］
実施形態３は、実施形態２のモデル推定装置１０Ｂを用いて、上記文献４に記載の音源数推定技術により、音源数も推定する構成にしたものである。実施形態３は、真の音源数Ｎ₀は分からないがその上限は分かっていると仮定し、仮定する音源数Ｎを、真の音源数Ｎ₀の上限より大きく設定する。 [Embodiment 3]
In the third embodiment, the number of sound sources is also estimated using the model estimation device 10B of the second embodiment and the sound source number estimation technique described in the above-mentioned document 4. The third embodiment assumes that the true number of sound sources N ₀ is unknown but knows the upper limit thereof, and sets the assumed number of sound sources N to be larger than the upper limit of the true number of sound sources N ₀ .

（実施形態３に係るモデル推定装置の構成）
図４は、実施形態３に係るモデル推定装置の構成の一例を示す図である。実施形態３に係るモデル推定装置１０Ｃは、実施形態２に係るモデル推定装置１０Ｂと比較して、音源数推定部１３をさらに有する。 (Configuration of Model Estimation Device According to Embodiment 3)
FIG. 4 is a diagram illustrating an example of the configuration of the model estimation apparatus according to the third embodiment. The model estimation device 10C according to the third embodiment further includes a sound source number estimation unit 13 as compared with the model estimation device 10B according to the second embodiment.

音源数推定部１３は、クラスタリング部１２Ｂによって計算されたｎ番目の音源がアクティブである事後確率γ⁽ⁿ⁾ _tfを用いて、番号ｎ＝１,・・・,Ｎのうち、真の音源に対応する番号ｎ(１),・・・, ｎ(Ｎ₀)を判定し、真の音源に対応する番号のパラメータのみを出力する。具体的には、音源数推定部１３は、ｎ番目の音源がアクティブである事後確率γ⁽ⁿ⁾ _tfを用いて、ｎ番目の音源がアクティブである事後確率の総和を、例えば下記（７６）式により算出する。 The number of sound sources estimation unit 13 uses the posterior probability γ ⁽ⁿ⁾ _tf calculated by the clustering unit 12B as the true sound source among the numbers n = 1 _,. Corresponding numbers n (1),..., N (N ₀ ) are determined, and only the parameters of the numbers corresponding to the true sound source are output. Specifically, the sound source number estimation unit 13 uses the posterior probability γ ⁽ⁿ⁾ _tf that the nth sound source is active to calculate the total posterior probability that the nth sound source is active, for example, (76) Calculated by the formula.

そして、音源数推定部１３は、各ｎ番目の音源がアクティブである事後確率の総和ρ⁽ⁿ⁾を２つにクラスタリングし、総和の大きい方のクラスタに属するρ⁽ⁿ⁾の番号ｎ＝ｎ(１),・・・, ｎ(＾Ｎ₀)を求め、真の音源に対応する番号とみなす。例えば、音源数推定部１３は、ρ⁽ⁿ⁾に対して、クラスタ数２のk-meansクラスタリングを適用してクラスタリングする。 The sound source number estimation unit 13 clusters the total ρ ⁽ⁿ⁾ of posterior probabilities that each nth sound source is active into two, and the number n = n of ρ ⁽ⁿ⁾ belonging to the cluster with the larger sum. (1),..., N (^ N ₀ ) is obtained and regarded as the number corresponding to the true sound source. For example, the sound source number estimation unit 13 performs clustering by applying k-means clustering with two clusters to ρ ⁽ⁿ⁾ .

最後に、音源数推定部１３は、真の音源に対応するｎ＝ｎ(１),・・・, ｎ(＾Ｎ₀)に対応する、下記（７７）式に示すパラメータのみを出力する。なお、下記（７７）式において、ｌ＝１,・・・,＾Ｎ₀である。 Finally, the sound source number estimation unit 13 outputs only the parameters shown in the following equation (77) corresponding to n = n (1),..., N (^ N ₀ ) corresponding to the true sound source. In the following (77) equation, l = 1, ···, is ^ N _0.

（実施形態３に係るモデル推定装置の処理）
図５は、実施形態３に係るモデル推定装置の処理手順の一例を示すフローチャートである。以下に述べるモデル推定装置１０Ｃの処理は、実施形態１又は２と同様の所定の収束判定条件が満たされるまで反復される。 (Processing of model estimation apparatus according to Embodiment 3)
FIG. 5 is a flowchart illustrating an example of a processing procedure of the model estimation apparatus according to the third embodiment. The process of the model estimation device 10C described below is repeated until a predetermined convergence determination condition similar to that in the first or second embodiment is satisfied.

先ず、ステップＳ２１では、初期化部１１Ｂ−１は、パラメータの集合Θの初期値を、上記（４７）式〜（４９）式、及び、（７０）式に基づき計算し、モデル推定装置１０Ｃの主記憶装置に保存する。次に、ステップＳ２２では、残響除去部１１Ｂ−４は、モデル推定装置１０Ｃの主記憶装置に現在保存されている回帰行列Ｇ_kfに基づき、上記（５６）式により、残響を含まない混合信号ベクトルの推定値＾ｘ_tfを更新する（“残響除去”処理）。 First, in step S21, the initialization unit 11B-1 calculates the initial value of the parameter set Θ based on the equations (47) to (49) and (70), and the model estimation device 10C Save to main storage. Next, in step S22, the dereverberation unit 11B-4, based on the regression matrix G _kf currently stored in the main storage device of the model estimation device 10C, _uses the above equation (56) to _{obtain a} mixed signal vector that does not include reverberation. Update the estimated value ^ x _tf of “(reverberation removal)”.

次に、ステップＳ２３では、事後確率更新部１２Ｂ−１は、時間周波数点(ｔ,ｆ)でｎ（ｎ＝１,・・・,Ｎ）番目の音源信号がアクティブである事後確率γ⁽ⁿ⁾ _tfを、上記（７１）式により計算し、モデル推定装置１０Ｃの主記憶装置に保存する。また、ステップＳ２３では、混合重み更新部１２Ｂ−２は、混合重みα⁽ⁿ⁾ _tを、上記（７２）式により計算し、モデル推定装置１０Ｃの主記憶装置に保存する（以上、“クラスタリング”処理）。また、ステップＳ２３では、パーミュテーション解決部１２Ｂ−３は、Π_fを｛１,・・・,Ｎ｝上の置換として、上記（７３）式〜（７５）式により、共分散行列φ⁽ⁿ⁾ _tfＢ⁽ⁿ⁾ _fを置換する。 Next, in step S23, the posterior probability update unit 12B-1 has a posterior probability γ ^{(n (n} ) (n = 1,..., N) -th sound source signal is active at the time frequency point (t, f). ⁾ _tf is calculated by the above equation (71) and stored in the main memory of the model estimation device 10C. In step S23, the mixture weight updating unit 12B-2 calculates the mixture weight α ⁽ⁿ⁾ _t by the above equation (72) and stores it in the main storage device of the model estimation device 10C (hereinafter, “clustering”). processing). In step S23, the permutation resolution unit 12B-3 uses 共_f as a permutation on {1,..., N}, and the covariance matrix φ ^{( n)} _tf B ⁽ⁿ⁾ Replace _f .

次に、モデル推定装置１０Ｃは、収束判定条件が満たされているか否かを判定する（ステップＳ２４）。モデル推定装置１０Ｃは、収束判定条件が満たされている場合（ステップＳ２４Ｙｅｓ）、ステップＳ２６へ処理を移す。モデル推定装置１０Ｃは、収束判定条件が満たされていない場合（ステップＳ２４Ｎｏ）、ステップＳ２５へ処理を移す。 Next, the model estimation device 10C determines whether or not the convergence determination condition is satisfied (step S24). If the convergence determination condition is satisfied (Yes at Step S24), the model estimating apparatus 10C moves the process to Step S26. When the convergence determination condition is not satisfied (No at Step S24), the model estimating apparatus 10C moves the process to Step S25.

ステップＳ２５の処理は、図２に示す実施形態１のステップＳ１５の処理と同様である。ステップＳ２６では、音源数推定部１３は、ｎ番目の音源がアクティブである事後確率γ⁽ⁿ⁾ _tfを用いて、真の音源数を推定し、推定結果を出力する。 The process of step S25 is the same as the process of step S15 of the first embodiment shown in FIG. In step S26, the sound source number estimation unit 13 estimates the true number of sound sources using the posterior probability γ ⁽ⁿ⁾ _tf that the ^nth sound source is active, and outputs the estimation result.

［実施形態４］
実施形態４に係る目的音強調装置は、実施形態１〜３に係るモデル推定装置１０Ａ〜１０Ｃのいずれかを有する目的音強調装置１００である。 [Embodiment 4]
The target sound enhancement apparatus according to the fourth embodiment is the target sound enhancement apparatus 100 including any of the model estimation apparatuses 10A to 10C according to the first to third embodiments.

（実施形態４に係る目的音強調装置の構成）
図６は、実施形態４に係る目的音強調装置の構成の一例を示す図である。実施形態４に係る目的音強調装置１００は、周波数領域変換部２０、モデル推定装置１０Ａ（あるいは１０Ｂ又は１０Ｃ）、強調音計算部３０、時間領域変換部４０を有する。 (Configuration of target sound enhancement apparatus according to Embodiment 4)
FIG. 6 is a diagram illustrating an example of the configuration of the target sound enhancement device according to the fourth embodiment. The target sound enhancement apparatus 100 according to the fourth embodiment includes a frequency domain conversion unit 20, a model estimation apparatus 10A (or 10B or 10C), an enhancement sound calculation unit 30, and a time domain conversion unit 40.

周波数領域変換部２０は、時間領域での残響を含む混合信号ベクトル〜ｙ_τを、短時間フーリエ変換などの時間周波数変換により、時間周波数領域での残響を含む混合信号ベクトルｙ_tfに変換する。ここで、混合信号ベクトル〜ｙ_τは、下記（７８）式により定義される。 The frequency domain transform unit 20 transforms the mixed signal vector to y _τ including reverberation in the time domain into a mixed signal vector y _tf including reverberation in the time frequency domain by time frequency conversion such as short-time Fourier transform. Here, the mixed signal vector ~y _tau, is defined by the following (78) below.

ただし、上記（７８）式において、〜ｙ^(m) _τは、時間領域でのｍ（ｍ＝１,・・・,Ｍ）番目のマイクロホンで観測された残響を含む混合信号であり、τはサンプル番号を表す。モデル推定装置１０Ａ（あるいは１０Ｂ又は１０Ｃ）は、パラメータの集合Θと、各音源ｎがアクティブである事後確率γ⁽ⁿ⁾ _tfを計算する。 In the above equation (78), ˜y ^(m) _τ is a mixed signal including reverberation observed by the m (m = 1,..., M) th microphone in the time domain, and τ is Represents the sample number. The model estimation apparatus 10A (or 10B or 10C) calculates a set of parameters Θ and a posteriori probability γ ⁽ⁿ⁾ _tf that each sound source n is active.

強調音計算部３０は、周波数領域変換部２０から出力された時間周波数領域での残響を含む混合信号ベクトルｙ_tfと、モデル推定装置１０Ａ（あるいは１０Ｂ又は１０Ｃ）から出力されたパラメータの集合Θと各音源ｎがアクティブである事後確率γ⁽ⁿ⁾ _tfとを用いて、時間周波数領域での各音源の残響を含まないマイクロホン像の推定値＾ｓ⁽ⁿ⁾ _tfを、下記（７９）式及び（８０）式により計算し、出力する。 The enhancement sound calculation unit 30 includes a mixed signal vector y _tf including reverberation in the time-frequency domain output from the frequency domain conversion unit 20, and a set of parameters Θ output from the model estimation device 10A (or 10B or 10C). Using the posterior probability γ ⁽ⁿ⁾ _tf that each sound source n is active, the estimated value ^ s ⁽ⁿ⁾ _tf of the microphone image not including the reverberation of each sound source in the time-frequency domain is expressed by the following equation (79) and (80) Calculate and output.

なお、目的音強調装置１００において実施形態１のモデル推定装置１０Ａを用いる場合は、上記（７９）式及び（８０）式の処理に先立って、γ⁽ⁿ⁾ _tfの番号ｎが周波数によらず同一の音源に対応するように、パーミュテーション解決を行う必要がある。このパーミュテーション解決は、例えば文献５「H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment.” IEEE Trans. ASLP, vol. 19, no. 3, pp. 516.527, Mar. 2011.」に記載の方法により行うことができる。 When the target sound enhancement apparatus 100 uses the model estimation apparatus 10A of Embodiment 1, the number n of γ ⁽ⁿ⁾ _tf is independent of the frequency prior to the processing of the above expressions (79) and (80). It is necessary to solve permutation so that it corresponds to the same sound source. This permutation solution is described in, for example, Reference 5 “H. Sawada, S. Araki, and S. Makino,“ Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. ”IEEE Trans. ASLP, vol. 19, no. 3, pp. 516.527, Mar. 2011. "

時間領域変換部４０は、強調音計算部３０から出力された時間周波数領域での各音源の残響を含まないマイクロホン像の推定値のベクトル＾ｓ⁽ⁿ⁾ _tfに、逆短時間フーリエ変換などの時間周波数変換の逆変換を適用して、時間領域での各音源の残響を含まないマイクロホン像の推定値のベクトル〜＾ｓ⁽ⁿ⁾ _τを計算する。ここで、ベクトル〜＾ｓ⁽ⁿ⁾ _τは、下記（８１）式により定義される。ただし、〜＾ｓ^(m,n) _τは、ベクトル＾ｓ⁽ⁿ⁾ _τの第ｍ要素＾ｓ^(m,n) _tfの逆短時間フーリエ変換である。 The time domain transforming unit 40 converts an estimated value of the microphone image that does not include the reverberation of each sound source in the time frequency domain output from the emphasized sound calculating unit 30 to the vector ^ s ⁽ⁿ⁾ _tf , such as inverse short-time Fourier transform. Applying the inverse of the time-frequency transform, a vector of estimated values of microphone images not including the reverberation of each sound source in the time domain ˜ ^ ⁽ⁿ⁾ _τ is calculated. Here, the vector ~ ^ s ⁽ⁿ⁾ _τ is defined by the following equation (81). Here, ~ ^ s ^{(m, n)} _τ is an inverse short-time Fourier transform of the m-th element ^ s ^{(m, n)} _tf of the vector ^ s ⁽ⁿ⁾ _τ .

なお、強調音計算部３０において、残響除去と音源分離を同時に実現する例を示したが、残響のみを除去するために、時間周波数領域での残響を含まない混合信号の推定値＾ｘ_tfに、逆短時間フーリエ変換などの時間周波数変換の逆変換を適用して、時間領域での残響を含まない混合信号の推定値のベクトル〜＾ｘ_τを得る構成としてもよい。ここで、ベクトル〜＾ｘ_τは、下記（８２）式で定義される。ただし、〜＾ｓ^(m,n) _τは、ベクトル＾ｓ⁽ⁿ⁾ _τの第ｍ要素＾ｓ^(m,n) _tfの逆短時間フーリエ変換である。 Note that in the emphasized sound calculating section 30, an example of realizing dereverberation and source separation simultaneously, in order to remove the reverberation only, the estimated value ^ x _tf mixed signals without the reverberation in the time frequency domain Alternatively, an inverse transform of time-frequency transform such as inverse short-time Fourier transform may be applied to obtain a mixed signal estimated value vector ~ ^ _τ that does not include reverberation in the time domain. Here, the vector ~ ^ _xτ is defined by the following equation (82). Here, ~ ^ s ^{(m, n)} _τ is an inverse short-time Fourier transform of the m-th element ^ s ^{(m, n)} _tf of the vector ^ s ⁽ⁿ⁾ _τ .

（実施形態４に係る目的音強調装置の処理）
図７は、実施形態４に係る目的音強調装置の処理手順の一例を示すフローチャートである。実施形態４に係る目的音強調装置１００において、先ず、ステップＳ３１では、周波数領域変換部２０は、各マイクロホンで観測された信号をそれぞれ時間周波数領域の信号に変換する。次に、ステップＳ３２では、モデル推定装置１０Ａ（あるいは１０Ｂ又は１０Ｃ）は、モデル推定を行う。次に、ステップＳ３３では、強調音計算部３０は、強調音を計算により推定する。次に、ステップＳ３４では、時間領域変換部４０は、強調音計算部３０により推定された強調音を周波数領域から時間領域に変換する。 (Processing of the target sound enhancement device according to the fourth embodiment)
FIG. 7 is a flowchart illustrating an example of a processing procedure of the target sound enhancement device according to the fourth embodiment. In the target sound enhancement apparatus 100 according to the fourth embodiment, first, in step S31, the frequency domain conversion unit 20 converts the signals observed by the respective microphones into signals in the time frequency domain. Next, in step S32, the model estimation apparatus 10A (or 10B or 10C) performs model estimation. Next, in step S33, the emphasized sound calculation unit 30 estimates the emphasized sound by calculation. Next, in step S34, the time domain conversion unit 40 converts the enhancement sound estimated by the enhancement sound calculation unit 30 from the frequency domain to the time domain.

以下、実施形態４を例に取り、開示の実施形態の実施例及びその効果について説明する。図８及び図９は、実施形態４の効果の一例を説明する図である。実施形態４に係る目的音強調装置１００（以下「提案法」）と、従来の残響モデルを含まないクラスタリングベースの音源分離手法（例えば、文献４に記載の手法、以下「従来法」）の性能を比較する実験をおこなった。ただし、実施形態４に係る目的音強調装置１００のモデル推定装置としては、実施形態２に係るモデル推定装置１０Ｂを用いた。 Hereinafter, examples of the disclosed embodiment and effects thereof will be described using the fourth embodiment as an example. 8 and 9 are diagrams for explaining an example of the effect of the fourth embodiment. Performance of the target sound enhancement apparatus 100 (hereinafter “proposed method”) according to the fourth embodiment and a conventional clustering-based sound source separation method that does not include a reverberation model (for example, the method described in Document 4, hereinafter “conventional method”) The experiment which compares was conducted. However, the model estimation device 10B according to the second embodiment is used as the model estimation device of the target sound enhancement device 100 according to the fourth embodiment.

マイクロホンで観測される残響を含む混合信号は、残響を含まない音声波形に、実験室で計測したインパルス応答（例えば、上述の文献５参照）を畳み込むことにより生成した。図８は、インパルス応答を計測した際のマイクロホンと音源の位置を示す。なお、提案法及び従来法の両方において、パラメータΘの推定に先立って、残響を含む混合信号ベクトルｙ_tfに対し、上記（５９）式〜（６１）式に示す白色化をおこなった。また、音源数Ｎは既知とした。また、他の実験条件は、下記（表１）に示すとおりとした。なお、図８に示す実験室は、４．４５ｍ×３．５５ｍ×（高さ）２．５０ｍの空間であった。また、図８に示すSource1及び2とMicrophone1及び2の、実験室の床面に対する高さは、１．２ｍとした。 A mixed signal including reverberation observed by a microphone was generated by convolving an impulse response (for example, see Reference 5 described above) measured in a laboratory with a speech waveform not including reverberation. FIG. 8 shows the positions of the microphone and the sound source when the impulse response is measured. In both the proposed method and the conventional method, prior to the estimation of the parameter Θ, the mixed signal vector y _tf including reverberation was whitened as shown in the above equations (59) to (61). The number N of sound sources is assumed to be known. Other experimental conditions were as shown in the following (Table 1). In addition, the laboratory shown in FIG. 8 was a space of 4.45 m × 3.55 m × (height) 2.50 m. Further, the height of Source 1 and 2 and Microphone 1 and 2 shown in FIG. 8 with respect to the floor of the laboratory was 1.2 m.

提案法及び従来法の性能は、下記（８３）式で定義されるＳＩＲ（Signal-to-Interference Ratio）により評価した。 The performance of the proposed method and the conventional method was evaluated by SIR (Signal-to-Interference Ratio) defined by the following equation (83).

ここで、〜＾ｓ^(1,n,ν) _τは、〜＾ｓ^(1,n)に含まれるν番目の音源成分を表す。Τ:＝８ｋHz×８ｓ＝６４０００は、サンプリング点の総数を表し、Σ_ν≠ｎは、ｎ以外のνの値に対する総和を表す。 Here, ˜ ^ s ^{(1, n, ν)} _τ represents the νth sound source component included in ˜ ^ s ^{(1, n)} . Τ: = 8 kHz × 8 s = 64000 represents the total number of sampling points, and _{Σν ≠ n} represents the sum for values of ν other than n.

ここで、〜＾ｓ^(1,n,ν) _τの求め方を説明する。観測された残響を含む混合信号ベクトルｙ_tfは、ν番目の音源の残響を含むマイクロホン像のベクトルｘ^(ν) _tfを用いて、下記（８４）式のように分解できる。 Here, how to obtain ~ ^ s ^{(1, n, ν)} _τ will be described. The mixed signal vector y _tf including the observed reverberation can be decomposed as the following equation (84) using the microphone image vector x ^(ν) _tf including the reverberation of the νth sound source.

従って、ｎ番目の音源の残響を含まないマイクロホン像の推定値のベクトル＾ｓ⁽ⁿ⁾ _tfは、下記（８５）式及び（８６）式のように分解できる。 Therefore, the estimated vector ^ s ⁽ⁿ⁾ _tf of the microphone image that does not include the reverberation of the nth sound source can be decomposed as shown in the following equations (85) and (86).

ここで、上記（８６）式において、＾ｓ^(n,ν) _tfは、＾ｓ⁽ⁿ⁾ _tfに含まれるν番目の音源成分を表す。よって、下記（８７）式により、＾ｓ^(n,ν) _tfを求め、＾ｓ^(n,ν) _tfを逆短時間フーリエ変換して〜＾ｓ^(n,ν) _τを求め、〜＾ｓ^(n,ν) _τの第１要素として〜＾ｓ^(1,n,ν) _τが求まる。 Here, in the above equation (86) ^{, ｓs (n, ν)} _tf represents the _νth sound source component included in _ｓs ⁽ⁿ⁾ _tf . Therefore, by the following (87) equation, ^ s ^{(n, [nu)} seeking _tf, seek ^ s ^{(n, [nu)} ~ and inverse short time Fourier transform _{^{tf ^ s (n, ν)}} τ, ~ ^ s ^{(n, ν)} ~ as the first element of _{^{τ ^ s (1, n,}} ν) τ is obtained.

図９に、各残響時間に対し、音声波形の組み合わせを変えて８回の試行を行った際のＳＩＲの平均値をプロットしたグラフを示す。残響時間が最も小さい条件（残響時間１３０ｍｓ程度）では、提案法と従来法は同等の性能を示した。しかし、図９に示すように、残響時間が大きくなるにつれて、従来法に対する提案法の性能改善量が増加する傾向があった。特に、残響時間が３７０ｍｓ程度の場合に、性能改善量は、試行中、最大の約４ｄＢとなった。 FIG. 9 shows a graph in which the average value of SIR is plotted for each reverberation time when eight trials are performed with different combinations of speech waveforms. Under the conditions with the shortest reverberation time (reverberation time of about 130 ms), the proposed method and the conventional method showed equivalent performance. However, as shown in FIG. 9, as the reverberation time increases, the performance improvement amount of the proposed method with respect to the conventional method tends to increase. In particular, when the reverberation time was about 370 ms, the performance improvement amount was about 4 dB at the maximum during the trial.

以上から、実施形態１〜４は、独立成分分析に基づく音源分離技術と比較して、音源数が未知の場合でも適用できる等の利点があるクラスタリングに基づく音源分離技術において、線形予測に基づく残響除去とクラスタリングに基づく音源分離を交互に反復する。実施形態１〜４は、線形予測に基づく残響除去により推定された残響を含まない混合信号に対して、クラスタリングに基づく音源分離を適用することで、音源分離の性能を向上させることができる。さらに、実施形態１〜４は、改善された音源分離結果を用いることで、残響除去の性能を改善することができる。よって、実施形態１〜４は、残響除去と上記音源分離の反復により、残響時間がフレーム長に比べて長い場合でも、より高精度な音源分離を実現することができる。 As described above, the reverberation based on linear prediction is used in the first to fourth embodiments in the sound source separation technology based on clustering that has an advantage that it can be applied even when the number of sound sources is unknown, compared to the sound source separation technology based on independent component analysis. The sound source separation based on removal and clustering is repeated alternately. Embodiments 1 to 4 can improve sound source separation performance by applying sound source separation based on clustering to a mixed signal that does not include reverberation estimated by dereverberation based on linear prediction. Furthermore, Embodiments 1 to 4 can improve the performance of dereverberation by using the improved sound source separation result. Thus, Embodiments 1 to 4 can realize more accurate sound source separation even when the reverberation time is longer than the frame length by repetitive reverberation removal and sound source separation.

（モデル推定装置及び目的音強調装置の装置構成について）
図１、図３、図４に示すモデル推定装置１０Ａ〜１０Ｃ及び図６に示す目的音強調装置１００の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要さない。すなわち、モデル推定装置１０Ａ〜１０Ｃ及び目的音強調装置１００の機能の分散及び統合の具体的形態は図示のものに限られず、全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。 (About the device configuration of the model estimation device and the target sound enhancement device)
The components of the model estimation devices 10A to 10C shown in FIGS. 1, 3, and 4 and the target sound enhancement device 100 shown in FIG. 6 are functionally conceptual and are not necessarily physically configured as shown. I don't need to be. That is, the specific forms of distribution and integration of the functions of the model estimation apparatuses 10A to 10C and the target sound enhancement apparatus 100 are not limited to those shown in the drawings, and all or a part of them can be arbitrarily selected according to various loads and usage conditions. It can be configured to be functionally or physically distributed or integrated in units.

また、モデル推定装置１０Ａ〜１０Ｃ及び目的音強調装置１００において行われる各処理は、全部又は任意の一部が、ＣＰＵ（Central Processing Unit）等の処理装置及び処理装置により解析実行されるプログラムにて実現されてもよい。また、モデル推定装置１０Ａ〜１０Ｃ及び目的音強調装置１００において行われる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 In addition, each of the processes performed in the model estimation apparatuses 10A to 10C and the target sound emphasizing apparatus 100 is entirely or arbitrarily partly performed by a processing apparatus such as a CPU (Central Processing Unit) and a program executed by the processing apparatus. It may be realized. Moreover, each process performed in model estimation apparatus 10A-10C and the target sound emphasis apparatus 100 may be implement | achieved as hardware by a wired logic.

また、実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともできる。もしくは、実施形態において説明した各処理のうち、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 In addition, among the processes described in the embodiment, all or a part of the processes described as being automatically performed can be manually performed. Alternatively, all or some of the processes described as being manually performed among the processes described in the embodiments can be automatically performed by a known method. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.

（プログラムについて）
図１０は、プログラムが実行されることにより、モデル推定装置及び目的音強調装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。コンピュータ１０００において、これらの各部はバス１０８０によって接続される。 (About the program)
FIG. 10 is a diagram illustrating an example of a computer in which a model estimation device and a target sound enhancement device are realized by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. In the computer 1000, these units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example. The video adapter 1060 is connected to the display 1061, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、モデル推定装置１０Ａ〜１０Ｃ及び目的音強調装置１００の各処理を規定するプログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、例えばハードディスクドライブ１０３１に記憶される。例えば、モデル推定装置１０Ａ〜１０Ｃ及び目的音強調装置１００における機能構成と同様の情報処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the model estimation devices 10A to 10C and the target sound enhancement device 100 is stored in, for example, the hard disk drive 1031 as a program module 1093 in which a command executed by the computer 1000 is described. For example, a program module 1093 for executing information processing similar to the functional configuration in the model estimation devices 10A to 10C and the target sound enhancement device 100 is stored in the hard disk drive 1031.

また、実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 via the network interface 1070.

上記実施形態及びその他の実施形態は、本願が開示する技術に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 The above-described embodiments and other embodiments are included in the invention disclosed in the claims and equivalents thereof as well as included in the technology disclosed in the present application.

１０Ａ、１０Ｂ、１０Ｃモデル推定装置
１１Ａ、１１Ｂ残響除去処理部
１１Ａ−１、１１Ｂ−１初期化部
１１Ａ−２、１１Ｂ−２共分散行列更新部
１１Ａ−３、１１Ｂ−３回帰行列更新部
１１Ａ−４、１１Ｂ−４残響除去部
１２Ａ、１２Ｂクラスタリング部
１２Ａ−１、１２Ｂ−１事後確率更新部
１２Ａ−２、１２Ｂ−２混合重み更新部
１２Ｂ−３パーミュテーション解決部
１３音源数推定部
２０周波数領域変換部
３０強調音計算部
４０時間領域変換部
１００目的音強調装置
１０００コンピュータ
１０１０メモリ
１０２０ＣＰＵ 10A, 10B, 10C Model estimation apparatus 11A, 11B Reverberation removal processing unit 11A-1, 11B-1 Initialization unit 11A-2, 11B-2 Covariance matrix update unit 11A-3, 11B-3 Regression matrix update unit 11A- 4, 11B-4 Reverberation removal unit 12A, 12B Clustering unit 12A-1, 12B-1 A posteriori probability update unit 12A-2, 12B-2 Mixed weight update unit 12B-3 Permutation resolution unit 13 Sound source number estimation unit 20 Frequency Region conversion unit 30 Emphasized sound calculation unit 40 Time region conversion unit 100 Target sound enhancement device 1000 Computer 1010 Memory 1020 CPU

Claims

A storage unit for storing parameters of a model of a mixed signal including reverberation including a regression matrix indicating characteristics of reverberation due to sounds output from a plurality of sound sources;
A signal estimation unit for estimating a mixed signal not including the reverberation by linear prediction using an observation signal obtained by observing the sound with a plurality of microphones and a regression matrix stored in the storage unit;
A posterior probability calculation unit that calculates a posterior probability for each cluster corresponding to the sound source to which each time frequency point belongs, from the mixed signal estimated by the signal estimation unit;
The parameter is estimated and estimated from the observed signal, the mixed signal estimated by the signal estimation unit, the posterior probability calculated by the posterior probability calculation unit, and the parameter stored in the storage unit. A parameter estimation unit that updates a parameter stored in the storage unit with a parameter, and
The signal estimation unit, the posterior probability calculation unit, and the parameter estimation unit repeat each process until a predetermined condition is satisfied.

The mixed signal model including reverberation is a probabilistic model representing a distribution of the mixed signal including reverberation;
The probability model is a mixture model represented by a weighted sum of probability models representing a distribution of a mixture signal including the reverberation for each of the clusters;
The model estimation apparatus according to claim 1, wherein the parameter estimation unit estimates the parameter using a predetermined evaluation function for evaluating the probability model.

The predetermined evaluation function is a likelihood of a mixed signal including the reverberation with respect to a parameter estimated by the parameter estimation unit, or a posterior probability of a parameter estimated by the parameter estimation unit. 2. The model estimation apparatus according to 2.

The parameter estimated by the parameter estimation unit includes a mixing weight value indicating a distribution of the plurality of sound sources included in the mixed signal including the reverberation at each time frequency point,
The model estimation apparatus according to claim 3, wherein the mixing weight value is a mixing weight value for each frequency of the mixed signal including the reverberation or a mixing weight value for each time of the mixed signal including the reverberation.

The parameter estimation unit, based on the posterior probability corresponding to each of the plurality of sound sources included in the mixed signal including the reverberation at each time frequency point, the sound source included in the mixed signal including the reverberation among the plurality of sound sources. The model estimation apparatus according to claim 4, wherein a parameter corresponding to the estimated sound source is used as the estimated parameter.

From the parameter and the posterior probability estimated by the model estimation device according to any one of claims 1 to 5 and a mixed signal including reverberation of each sound source in the time frequency domain, in the time frequency domain. An objective sound emphasizing apparatus comprising: an output unit that estimates and outputs an estimated value of an acoustic signal that does not include reverberation of each of the sound sources.

A model estimation method executed by a model estimation device,
The model estimation apparatus includes a storage unit that stores a parameter of a model of a mixed signal including reverberation including a regression matrix indicating characteristics of reverberation due to sound output from a plurality of sound sources,
A signal estimation step of estimating a mixed signal not including the reverberation by linear prediction using an observation signal obtained by observing the sound with a plurality of microphones and a regression matrix stored in the storage unit;
A posterior probability calculation step of calculating a posterior probability for each cluster corresponding to the sound source to which each time frequency point belongs, from the mixed signal estimated by the signal estimation step;
The parameter is estimated and estimated from the observed signal, the mixed signal estimated by the signal estimation step, the posterior probability calculated by the posterior probability calculation step, and the parameter stored in the storage unit. A parameter estimation step of updating a parameter stored in the storage unit with a parameter, and
The signal estimation step, the posterior probability calculation step, and the parameter estimation step are repeated until a predetermined condition is satisfied.

A model estimation program for causing a computer to function as the model estimation apparatus according to claim 1.