JP2016045225A

JP2016045225A - Number of sound sources estimation device, number of sound sources estimation method, and number of sound sources estimation program

Info

Publication number: JP2016045225A
Application number: JP2014167025A
Authority: JP
Inventors: 信貴伊藤; Nobutaka Ito; 智広中谷; Tomohiro Nakatani; 章子荒木; Akiko Araki
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-08-19
Filing date: 2014-08-19
Publication date: 2016-04-04
Anticipated expiration: 2034-08-19
Also published as: JP6193823B2

Abstract

PROBLEM TO BE SOLVED: To appropriately estimate the number of sound sources even under a condition close to an actual environment.SOLUTION: A number of sound sources estimation device B extracts, from an observation signal derived by observing with M microphones, a mixed signal in which signals from N sound sources are mixed, a feature vector xthat corresponds to an observation signal vector ycomprising the time-frequency component of each observation signal. Then, the number of sound sources estimation device B applies the feature vector xτω to a prescribed probability model, estimates the model parameter of the probability model by using an evaluation function that gives a higher evaluation value in correspondence to the degree of synchronization of the time series of likelihood of each sound source between frequency bins, and calculates by using the model parameter, posterior probability which is conditional probability that the observation signal vector ybelongs to each of the clusters set in greater number than the N sound sources. Thereafter, the number of sound sources estimation device B calculates the sum total of posterior probability of each cluster by using the posterior probability, and estimates the number of sound sources on the basis of the sum total of posterior probability of each cluster.SELECTED DRAWING: Figure 3

Description

本発明は、音源数推定装置、音源数推定方法および音源数推定プログラムに関する。 The present invention relates to a sound source number estimation device, a sound source number estimation method, and a sound source number estimation program.

従来、音源分離技術として、クラスタリングに基づく音源分離を行う技術が知られている。例えば、音源数が既知であることを前提として、音源数と同数のクラスタを用いてクラスタリングを行う（例えば、非特許文献１参照）。１音源の特徴量は、各周波数内で同じ特徴量空間に集中しやすいため、このようなクラスタリングにより音源分離を行うことができる。 Conventionally, a technique for performing sound source separation based on clustering is known as a sound source separation technique. For example, assuming that the number of sound sources is known, clustering is performed using the same number of clusters as the number of sound sources (see, for example, Non-Patent Document 1). Since the feature values of one sound source tend to concentrate in the same feature value space within each frequency, sound source separation can be performed by such clustering.

各音源の特徴量空間の形状は、周波数に依存するが、１音源信号は、全周波数で同時に立ちあがりやすく、且つ、立ち下がりやすい。すなわち、音源アクティビティは周波数間で同期する。非特許文献１に記載のクラスタリングでは、この音源アクティビティの同期性を適切にモデル化し、全体最適化に組み込むことにより、音源位置特徴量と音源アクティビティの同時クラスタリングを実現している。 The shape of the feature amount space of each sound source depends on the frequency, but one sound source signal is likely to rise at the same time and fall easily at all frequencies. That is, sound source activity is synchronized between frequencies. In the clustering described in Non-Patent Document 1, the synchronism of the sound source activity is appropriately modeled and incorporated into the overall optimization, thereby realizing simultaneous clustering of the sound source position feature quantity and the sound source activity.

ところが、非特許文献１に記載のクラスタリングでは、音源数が既知であることを前提としており、音源数が未知である場合には、音源数と同数のクラスタを用いてクラスタリングを行うことができない。一方、音源数を推定する技術が知られている。例えば、音源数を推定する技術として、「音源数≦マイクロホン数、かつ残響なし」であることを前提とし、音源数を推定する方法がある（例えば、非特許文献２を参照）。 However, the clustering described in Non-Patent Document 1 assumes that the number of sound sources is known. If the number of sound sources is unknown, clustering cannot be performed using the same number of clusters as the number of sound sources. On the other hand, a technique for estimating the number of sound sources is known. For example, as a technique for estimating the number of sound sources, there is a method for estimating the number of sound sources on the premise that “the number of sound sources ≦ the number of microphones and no reverberation” (see, for example, Non-Patent Document 2).

伊藤信貴、荒木章子、中谷智広“時変混合重みに基づくパーミュテーション問題のないクラスタリングベース音源分離”、社団法人電子情報通信学会、信学技報, VOL.113, NO.27, 2013年5月Nobutaka Ito, Akiko Araki, Tomohiro Nakatani “Clustering-based sound source separation without permutation problem based on time-varying mixture weights”, The Institute of Electronics, Information and Communication Engineers, IEICE Technical Report, VOL.113, NO.27, 20135 Moon MATI WAX,THOMAS KAILATH,“Detection of Signals by Information Theoretic Criteria”, IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-33, NO.2, APRIL 1985MATI WAX, THOMAS KAILATH, “Detection of Signals by Information Theoretic Criteria”, IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL.ASSP-33, NO.2, APRIL 1985

しかしながら、上記した従来の音源数を推定する技術では、「音源数≦マイクロホン数、かつ残響なし」であることを前提としており、実環境に近い条件下では、音源数を適切に推定することができないという課題があった。つまり、従来の音源数を推定する技術は、「音源数≦マイクロホン数、かつ残響なし」という理想的な条件でしか成立せず、例えば、「音源数＞マイクロホン数、または残響あり」というような状況である場合には、音源数を適切に推定することができない。 However, the conventional techniques for estimating the number of sound sources are based on the premise that “the number of sound sources ≦ the number of microphones and no reverberation”, and under conditions close to the actual environment, the number of sound sources can be estimated appropriately. There was a problem that it was not possible. In other words, the conventional technique for estimating the number of sound sources can only be realized under the ideal condition of “the number of sound sources ≦ the number of microphones and no reverberation”, for example, “the number of sound sources> the number of microphones, or there is reverberation”. If this is the situation, the number of sound sources cannot be estimated appropriately.

上述した課題を解決し、目的を達成するために、本発明の音源数推定装置は、kをクラスタのインデックスとし、τを時間フレームのインデックスとし、ωを角周波数とし、Ｎ個の音源からの信号が混合された混合信号をM個のマイクロホンで観測した観測信号から、各観測信号の時間周波数成分からなる観測信号ベクトルy_τωに対応する特徴ベクトルx_τωを抽出する特徴抽出部と、前記特徴ベクトルx_τωを所定の確率モデルにあてはめ、各音源の尤度の時系列が周波数ビン間で同期しているほど高い評価値を与える評価関数を用いて、前記確率モデルのモデルパラメータを推定し、該モデルパラメータを用いて、前記観測信号ベクトルy_τωが、前記Ｎ個の音源よりも多く設定された各クラスタに属する条件付き確率である事後確率を計算するモデル推定部と、前記事後確率が有意な値をとるクラスタの個数を、音源数として推定する音源数推定部と、を含み、前記確率モデルは、各音源に関する特徴ベクトルx_τωの分布の重み付き和で表される混合モデルであり、前記確率モデルの混合重みは、時間フレームτに依存し、角周波数ωに依存しない重みであり、前記確率モデルのモデルパラメータは、前記混合重みと、各音源に関する前記特徴ベクトルx_τωの分布のパラメータであることを特徴とする。 In order to solve the above-described problems and achieve the object, the sound source number estimation apparatus of the present invention uses k as the cluster index, τ as the time frame index, ω as the angular frequency, and N sound sources. A feature extraction unit that extracts a feature vector x _τω corresponding to an observation signal vector y _τω composed of a time-frequency component of each observation signal from an observation signal obtained by observing a mixed signal in which signals are mixed with M microphones; The vector x _τω is applied to a predetermined probability model, and the model parameters of the probability model are estimated using an evaluation function that gives a higher evaluation value as the time series of likelihood of each sound source is synchronized between frequency bins, using the model parameters, the observed signal vector y _Tauomega calculates the posterior probability of the conditional probability belonging to the N respective clusters configured more than sound mode Includes a Le estimating unit, the number of clusters probability after previous article takes a significant value, and the number of sound sources estimating unit for estimating a number of sound sources, and the probabilistic model, the weights of the distribution of the feature vectors x _Tauomega for each sound source A mixture model represented by a sum, wherein the mixture weight of the probability model depends on the time frame τ and does not depend on the angular frequency ω, and the model parameter of the probability model includes the mixture weight, It is a parameter of the distribution of the feature vector x _τω related to the sound source.

本発明によれば、実環境に近い条件下であっても、音源数を適切に推定することが可能であるという効果を奏する。 According to the present invention, there is an effect that it is possible to appropriately estimate the number of sound sources even under conditions close to a real environment.

図１は、第一実施形態に係るモデル推定装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the model estimation apparatus according to the first embodiment. 図２は、第一実施形態に係るモデル推定装置の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the model estimation apparatus according to the first embodiment. 図３は、第二実施形態に係る音源数推定装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the sound source number estimation apparatus according to the second embodiment. 図４は、第二実施形態に係る音源数推定装置の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the sound source number estimation apparatus according to the second embodiment. 図５は、実際の音源数を超えるクラスタ数を設定した場合の事後確率の例を示す図である。FIG. 5 is a diagram illustrating an example of the posterior probability when the number of clusters exceeding the actual number of sound sources is set. 図６は、実験結果を示す図である。FIG. 6 is a diagram showing experimental results. 図７は、音源とクラスタ数とを同じに設定した場合における特徴量空間にプロットされる音源位置特徴量を示す図である。FIG. 7 is a diagram illustrating sound source position feature amounts plotted in the feature amount space when the sound source and the number of clusters are set to be the same. 図８は、音源よりもクラスタ数を多く設定した場合における特徴量空間にプロットされる音源位置特徴量の例を示す図である。FIG. 8 is a diagram illustrating an example of the sound source position feature amount plotted in the feature amount space when the number of clusters is set larger than that of the sound source. 図９は、音源よりもクラスタ数を多く設定した場合における特徴量空間にプロットされる音源位置特徴量の例を示す図である。FIG. 9 is a diagram illustrating an example of the sound source position feature amount plotted in the feature amount space when the number of clusters is set larger than that of the sound source. 図１０は、第三実施形態に係る音源数推定装置の機能構成を例示する図である。FIG. 10 is a diagram illustrating a functional configuration of the sound source number estimation apparatus according to the third embodiment. 図１１は、第三実施形態に係る音源数推定装置の処理フローを例示する図である。FIG. 11 is a diagram illustrating a processing flow of the sound source number estimation apparatus according to the third embodiment. 図１２は、音源数推定方法を実行するコンピュータを示す図である。FIG. 12 is a diagram illustrating a computer that executes the sound source number estimation method.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［発明のポイント］
この発明の音源数推定装置における技術のポイントは、音源数が未知である条件下において、音源数を推定することが可能である点である。詳細は後述するが、音源数よりも多いクラスタ数を設定して非特許文献１に記載のクラスタリングを行い、各クラスタの事後確率を計算し、事後確率が有意な値をとるクラスタの個数を、音源数として推定するものである。 [Points of Invention]
The technical point of the sound source number estimation apparatus according to the present invention is that the number of sound sources can be estimated under conditions where the number of sound sources is unknown. Although details will be described later, clustering described in Non-Patent Document 1 is performed by setting the number of clusters larger than the number of sound sources, the posterior probability of each cluster is calculated, and the number of clusters for which the posterior probability takes a significant value, This is estimated as the number of sound sources.

［第一実施形態］
この発明の第一実施形態は、複数の音源からの信号を複数個のマイクロホンで観測し、各音源に関するモデルパラメータを推定するモデル推定装置である。 [First embodiment]
1st Embodiment of this invention is a model estimation apparatus which observes the signal from a several sound source with a several microphone, and estimates the model parameter regarding each sound source.

図１を参照して、第一実施形態のモデル推定装置Ａの機能構成例を説明する。モデル推定装置Ａは、周波数領域変換部１、特徴抽出部２及びモデル推定部３を含む。モデル推定部３は、事後確率計算部３１、パラメータ更新部３２及びパラメータ保持部３３を含む。パラメータ更新部３３は、混合重み更新手段３２１、相関行列更新手段３２２、平均方向更新手段３２３、密度パラメータ更新手段３２４及びパーミュテーション解決手段３２５を含む。 With reference to FIG. 1, the example of a function structure of the model estimation apparatus A of 1st embodiment is demonstrated. The model estimation apparatus A includes a frequency domain conversion unit 1, a feature extraction unit 2, and a model estimation unit 3. The model estimation unit 3 includes a posterior probability calculation unit 31, a parameter update unit 32, and a parameter holding unit 33. The parameter update unit 33 includes a mixture weight update unit 321, a correlation matrix update unit 322, an average direction update unit 323, a density parameter update unit 324, and a permutation solution unit 325.

図２を参照して、モデル推定装置Ａの動作例を手続きの順に従って説明する。周波数領域変換部１へM個のマイクロホンにより観測された時間領域の混合信号~y_tが入力される。時間領域の混合信号~y_tは式(１)で定義される。

With reference to FIG. 2, the operation example of the model estimation apparatus A will be described in the order of procedures. Mixed signals ~ y _t of the observed time domain by M microphones to frequency domain transform unit 1 is input. The time domain mixed signal ~ y _t is defined by equation (1).

ここで、tは時間インデックスを表し、・^T（上付き文字のT）はベクトル・の転置を表し、~y_mtはm（1≦m≦M）番目のマイクロホンで観測された時間領域の混合信号を表す。 Where t is the time index, • ^T (superscript T) is the transpose of the vector • ~ y _mt is the time domain mixture observed by the mth (1 ≦ m ≦ M) microphone Represents a signal.

周波数領域変換部１は入力された時間領域の混合信号~y_tから短時間フーリエ変換などにより時間周波数領域の観測信号ベクトルy_τωを生成し出力する（ステップＳ１）。時間周波数領域の観測信号ベクトルy_τωは式(２)で定義される。

The frequency domain transform unit 1 generates and outputs an observation signal vector y _τω in the time frequency domain from the input time domain mixed signal ~ y _t by short-time Fourier transform or the like (step S1). The observation signal vector _{yτω in} the time-frequency domain is defined by equation (2).

ここで、τは時間フレームのインデックスを表し、ωは角周波数を表し、y_mτωは混合信号~y_mtの時間周波数領域での表現である。 Here, tau represents the time frame index, the ω represents an angular frequency, y _Emutauomega is expressed in the time frequency domain mixed signals ~ y _mt.

特徴抽出部２は、周波数領域変換部１の出力する時間周波数領域の観測信号ベクトルy_τωを入力とし、特徴ベクトルx_τωを計算し出力する（ステップＳ２）。特徴ベクトルx_τωの計算は、時間周波数領域の観測信号ベクトルy_τωを正規化することにより行ってもよいし、時間周波数領域の観測信号ベクトルy_τωを白色化した後に正規化することによって行ってもよいし、時間周波数領域の観測信号ベクトルy_τωを正規化した後に白色化して再度正規化することにより行ってもよい。例えば、時間周波数領域の観測信号ベクトルy_τωを正規化することにより特徴ベクトルx_τωを計算する場合は式(３)により計算すればよい。

The feature extraction unit 2 receives the time-frequency domain observation signal vector y _τω output from the frequency domain transformation unit 1 as input, and calculates and outputs a feature vector x _τω (step S2). Computation of a feature vector x _Tauomega may be performed by normalizing the observed signal vector y _Tauomega the time-frequency domain, performed by normalizing the observed signal vector y _Tauomega the time-frequency domain after the whitened _{Alternatively} , the observation signal vector y _τω in the time frequency domain may be normalized and then whitened and then normalized again. For example, when the feature vector x _τω is calculated by normalizing the observation signal vector y _τω in the time-frequency domain, the feature vector x _τω may be calculated by Expression (3).

また、例えば、時間周波数領域の観測信号ベクトルy_τωを白色化した後に正規化する場合には、以下のように特徴ベクトルx_τωを計算すればよい。まず、時間周波数領域の観測信号ベクトルy_τωを用いて、式(４)により時間周波数領域の観測信号ベクトルy_τωの標本相関行列R_ω ^ｙを計算する。

Further, for example, when normalizing the observation signal vector y _τω in the time frequency domain after whitening, the feature vector x _τω may be calculated as _follows . First, the sample correlation matrix R _ω ^y of the observation signal vector y _{τω in} the time-frequency domain is calculated by the equation (4) using the observation signal vector y _τω in the time-frequency domain.

ここで、Tはフレームの個数であり、・^H（上付き文字のH）はエルミート（Hermite）転置である。 Here, T is the number of frames, and • ^H (superscript H) is Hermite transpose.

次に、標本相関行列R_ω ^ｙの固有値と固有ベクトルを計算する。計算した標本相関行列R_ω ^ｙの固有値を、大きい順に並べたものをσ_ω1,σ_ω2,…,σ_ωMと表す。したがって、式(５)の関係が成り立つ。

Next, eigenvalues and eigenvectors of the sample correlation matrix R _ω ^y are calculated. The eigenvalues of the calculated sample correlation matrix R _ω ^y arranged in descending order are represented as σ _ω1 , σ _ω2 ,..., Σ _ωM . Therefore, the relationship of Formula (5) is established.

ここで、標本相関行列R_ω ^ｙはエルミート行列であるから、固有値σ_ω1,σ_ω2,…,σ_ωMはすべて実数であることに注意する。また、固有値σ_ω1,σ_ω2,…,σ_ωMに対応し、正規直交系をなす標本相関行列R_ω ^ｙの固有ベクトルをu_ω1,u_ω2,…,u_ωMで表す。ここで、標本相関行列R_ω ^ｙはエルミート行列であるから、このような固有ベクトルが存在することに注意する。 Here, since the sample correlation matrix R ω _^y is a Hermitian matrix, the eigenvalues _{_{σ ω1, σ ω2, ...,}} σ ωM is to note that all is a real number. Further, the eigenvectors of the sample correlation matrix R _ω ^y forming the orthonormal system corresponding to the eigenvalues σ _ω1 , σ _ω2 ,..., Σ _ωM are represented by u _ω1 , u _ω2 _,. Here, since the sample correlation matrix R _ω ^y is a Hermitian matrix, it should be noted that such an eigenvector exists.

次に、行列Σ_ωを式(６)により求め、行列U_ωを式(７)により求める。

Next, matrix sigma _omega determined by equation (6), the matrix U _omega is obtained by equation (7).

次に、行列U_ω及び行列Σ_ωを用いて、時間周波数領域の観測信号ベクトルy_τωを白色化したベクトルy’_τωを式(８)により計算する。

Next, a vector y ′ _τω obtained by whitening the observation signal vector y _τω in the time-frequency domain using the matrix U _ω and the matrix Σ _ω is calculated according to the equation (8).

最後に、次式のようにベクトルy’_τωをそのノルムで正規化することにより、特徴ベクトルx_τωを計算する。

Finally, the feature vector x _τω is calculated by normalizing the vector y ′ _τω with its norm as in the following equation.

モデル推定部３は、特徴ベクトルx_τωを特徴ベクトルの分布を表す確率モデルに当てはめ、確率モデルを評価する所定の評価関数を用いて、クラスタリングに適した確率モデルのモデルパラメータを計算する。そして、モデル推定部３は、該モデルパラメータを用いて、観測信号ベクトルy_τωが、Ｎ個の音源よりも多く設定された各クラスタに属する条件付き確率である事後確率を計算する。 The model estimation unit 3 applies the feature vector x _τω to a probability model representing the distribution of the feature vector, and calculates a model parameter of the probability model suitable for clustering using a predetermined evaluation function for evaluating the probability model. Then, the model estimation unit 3 calculates a posteriori probability, which is a conditional probability belonging to each cluster in which the observed signal vector _yτω is set to be larger than N sound sources, using the model parameter.

音源の位置が固定の場合、各周波数ビンにおいて、特徴ベクトルx_τωは理想的にはそれぞれの音源ごとに固有の値をとる。ただし、実際には、雑音・残響の影響、モデル化誤差などに起因する変動が存在するため、特徴ベクトルx_τωは音源ごとにある値を中心にクラスタを形成する。そのため、この発明では、クラスタkに関する特徴ベクトルx_τωの分布を、例えば、以下のようにワトソン（Watson）分布でモデル化する。 When the position of the sound source is fixed, in each frequency bin, the feature vector x _τω ideally takes a unique value for each sound source. However, in practice, there are fluctuations due to the effects of noise and reverberation, modeling errors, and the like, so the feature vector x _τω forms a cluster around a certain value for each sound source. Therefore, in the present invention, the distribution of the feature vector x _τω related to the cluster k is modeled by, for example, the Watson distribution as follows.

ここで、a_kωはクラスタkに関する特徴ベクトルの分布の中心を表し、平均方向（mean orientation）と呼ばれ、κ_kωはクラスタkに関する特徴ベクトルの分布の広がりの小ささを表し、密度パラメータ（concentration parameter）と呼ばれる。M(a,b,x)はクンマー（Kummer）関数である。クンマー関数についての詳細は「S. Sra and D. Karp, “The multivariate Watson distribution: maximum-likelihood estimation and other aspects”, arXiv: 1104.4422v2, 2012.（参考文献１）」を参照されたい。ここで、特徴ベクトルの分布が周波数ビンごとに定義されることに注意する。 Where a _kω represents the center of the distribution of feature vectors for cluster k and is called the mean orientation, κ _kω represents the small spread of the distribution of feature vectors for cluster k, and the density parameter (concentration parameter). M (a, b, x) is a Kummer function. For details on the Kummer function, see “S. Sra and D. Karp,“ The multivariate Watson distribution: maximum-likelihood estimation and other aspects ”, arXiv: 1104.4422v2, 2012. (Reference 1)”. Note that the distribution of feature vectors is defined for each frequency bin.

音声をはじめとする多くの音源信号は、「音源信号の時間周波数変換の振幅値の時系列{|s_ｎτω|}_τが、周波数ビン間で類似する」という共通振幅変調の性質をもつ（例えば、「G. J. Brown, “Computational Auditory Scene Analysis: A Representational Approach”, Ph.D. thesis, University of Sheffield, 1992.」を参照）。この発明では、この共通振幅変調の性質を、パーミュテーション問題を回避するための手掛かりとして利用できることに着目した。混合信号の各時間周波数成分に寄与する音源信号は高々一つであるという仮定（WDO（W-Disjoint Orthogonality）性の仮定）にもとづき、この共通振幅変調の性質を、クラスタリングの枠組みにおいて利用しやすい表現で言い換えると、「観測信号に寄与するクラスタ番号の時系列{d(τ,ω)}_τは、周波数ビン間で類似する」と言える。この発明では、この周波数ビン間での{d(τ,ω)}_τの類似性を、「d(τ,ω)の事前分布P(d(τ,ω)=k)が、フレームτに依存（時変）し、周波数ビン（角周波数ω）にはよらない（周波数非依存）」とモデル化する。このような各音源信号に対する、周波数ビン間での振幅変調の共通性を利用することにより、パーミュテーションを引き起こさずに、クラスタリングできる。この事前確率をα_kτにより表す。なお、α_kτはΣ_k=1 ^Kα_kτ=1を満たす。ここでは、音源数は未知であることを前提とし、モデル推定装置Ａでは、音源数を超えるクラスタ数が設定されているものとする。 Many sound source signals including speech have a common amplitude modulation property that “the time series {| s _nτω |} _τ of the amplitude value of the time-frequency conversion of the sound source signal is similar between frequency bins” (for example, , “GJ Brown,“ Computational Auditory Scene Analysis: A Representational Approach ”, Ph.D. thesis, University of Sheffield, 1992.”). In the present invention, attention is paid to the fact that the property of the common amplitude modulation can be used as a clue to avoid the permutation problem. Based on the assumption that there is at most one sound source signal contributing to each time frequency component of the mixed signal (WDO (W-Disjoint Orthogonality) property), this common amplitude modulation property is easy to use in the framework of clustering In other words, it can be said that “the time series {d (τ, ω)} _τ of cluster numbers contributing to the observation signal is similar between frequency bins”. In the present invention, the similarity of {d (τ, ω)} _τ between the frequency bins is expressed as “the prior distribution P (d (τ, ω) = k) of d (τ, ω) in the frame τ. It depends (time-varying) and does not depend on the frequency bin (angular frequency ω) (frequency-independent) ”. By using the commonality of amplitude modulation between frequency bins for each sound source signal, clustering can be performed without causing permutation. This prior probability is represented by α _kτ . Α _kτ satisfies Σ _{k = 1} ^K α _kτ = 1. Here, it is assumed that the number of sound sources is unknown, and it is assumed that the number of clusters exceeding the number of sound sources is set in the model estimation apparatus A.

この事前確率は、１個の時間フレームごとに変化すると仮定してもよいし、複数の時間フレームからなるブロックごとに変化すると仮定してもよい。事前確率が１個の時間フレームごとに変化すると仮定する場合、任意のクラスタkと任意の時間フレームτに対して、α_kτは独立変数であり、推定すべきパラメータである。 This prior probability may be assumed to change every one time frame, or may be assumed to change every block consisting of a plurality of time frames. Assuming that the prior probability changes every time frame, for any cluster k and any time frame τ, α _kτ is an independent variable and a parameter to be estimated.

一方、事前確率が数個の時間フレームからなるブロックごとに変化すると仮定する場合、Bをブロックの総数とし、ブロック番号をb=1,2,…,Bとし、Jを各ブロック内における時間フレームの総数とし、各ブロック内における時間フレームの番号をj=1,2,…,Jとすると、τ=(b−1)×J+jと表せ、α_{k,(b−1)×J+j}（j=1,2,…,J）は等しくなるから、推定すべきパラメータである混合重みは~α_kb=α_{k,(b−1)×J+1}により定義される~α_kbである。以下では、特に断りのない限り、事前確率が１個の時間フレームごとに変化すると仮定する場合について説明する。 On the other hand, if it is assumed that the prior probability changes for each block consisting of several time frames, B is the total number of blocks, block numbers are b = 1, 2,..., B, and J is a time frame within each block. If the time frame number in each block is j = 1, 2,..., J, it can be expressed as τ = (b−1) × J + j, α _{k, (b−1) × J + j (j = 1,2, ...,} J) from the equal, mixture weights are parameters to be estimated is ~ α _kb = α _k, with ~ alpha _kb defined by _{(b-1) × J +} 1 is there. In the following, a case will be described where it is assumed that the prior probability changes every one time frame unless otherwise specified.

以上より、特徴ベクトルx_τωの尤度関数は、式(１１)で表す混合モデルで与えられる。

From the above, the likelihood function of the feature vector x _τω is given by the mixed model expressed by the equation (11).

ここで、Θは、式(１２)に示すパラメータ集合である。

Here, Θ is a parameter set shown in Expression (12).

ここで、{α_kτ}_kτは式(１３)により定義される。

Here, {α _kτ } _kτ is defined by equation (13).

他の同様の記法もこれにならって定義される。以降では、α_kτを混合重みと呼ぶ。混合重みα_kτの事前分布として式(１４)に示すディリクレ（Dirichlet）分布を用いる。

Other similar notations are defined accordingly. Hereinafter, α _kτ is referred to as a mixing weight. As a prior distribution of the mixture weight α _kτ , a Dirichlet distribution shown in Expression (14) is used.

ここで、Γはガンマ関数であり、φはハイパーパラメータと呼ばれる。φの値を十分大きく定めることにより、混合重みα_kτの変動を抑えることができる。φの値を微調整する必要はないが、例えば、φ=1,10,100,1000などの値を用いることができる。 Here, Γ is a gamma function, and φ is called a hyperparameter. By setting the value of φ sufficiently large, fluctuations in the mixing weight α _kτ can be suppressed. Although it is not necessary to finely adjust the value of φ, for example, values such as φ = 1, 10, 100, 1000 can be used.

混合重みα_kτ以外のパラメータについては一様な事前分布を仮定する。したがって、p(Θ)=Π_τp({α_kτ}_k)である。 A uniform prior distribution is assumed for parameters other than the mixing weight α _kτ . Therefore, it is _{p (Θ) = Π τ p} ({α kτ} k).

モデル推定部３では、特徴ベクトルx_τωを以上のようにモデル化された確率モデルに当てはめ、確率モデルを評価する所定の評価関数を用いて、事後確率及びクラスタリングに適したパラメータ集合Θを求める。 The model estimation unit 3 applies the feature vector x _τω to the probability model modeled as described above, and obtains a parameter set Θ suitable for posterior probability and clustering using a predetermined evaluation function for evaluating the probability model.

以下、モデル推定部３の各部の処理を詳細に説明する。モデル推定部３は、図１に示すとおり、事後確率計算部３１、パラメータ更新部３２及びパラメータ保持部３３を含む。モデル推定部３での処理に先立ち、パラメータ集合Θの初期値をパラメータ保持部３３に用意しておく（ステップＳ０）。この初期値は、例えば、α_kτ=1/K、κ_kω=20とし、a_kωは{x_τω}_τωから無作為に選ぶことにより設定することができる。 Hereinafter, the process of each part of the model estimation part 3 is demonstrated in detail. The model estimation unit 3 includes a posterior probability calculation unit 31, a parameter update unit 32, and a parameter holding unit 33, as shown in FIG. Prior to processing in the model estimation unit 3, initial values of the parameter set Θ are prepared in the parameter holding unit 33 (step S0). The initial value is, for _example, α kτ = 1 / K, and _{_κ} kω = 20, a kω can be set by selecting at random from _{_{x} τω} τω.

事後確率計算部３１は、パラメータ保持部３３に記憶されたパラメータ集合Θから事後確率γ_kτω、すなわち特徴ベクトルx_τωが与えられたもとでd(τ,ω)=kとなる条件付き確率を式(１５)により計算する（ステップＳ３１）。

The posterior probability calculation unit 31 uses the parameter set Θ stored in the parameter holding unit 33 to _give a posterior probability γ _kτω , that is, a conditional probability that d (τ, ω) = k when the feature vector x _τω is given by the formula ( 15) (step S31).

パラメータ更新部３２は、図１に示すとおり、混合重み更新手段３２１、相関行列更新手段３２２、平均方向更新手段３２３、密度パラメータ更新手段３２４及びパーミュテーション解決手段３２５を含み、現在のパラメータ集合Θを更新して新たなパラメータ集合Θ’を生成する（ステップＳ３２）。 As shown in FIG. 1, the parameter update unit 32 includes a mixture weight update unit 321, a correlation matrix update unit 322, an average direction update unit 323, a density parameter update unit 324, and a permutation solution unit 325, and a current parameter set Θ Is updated to generate a new parameter set Θ ′ (step S32).

混合重み更新手段３２１は、事後確率γ_kτωを用いて、式(１６)を計算することにより、混合重みα_kτを新しい値α’_kτに更新する。 Mixing weight updating unit 321 uses the posterior probability gamma _Keitauomega, by calculating equation (16), and updates the mixture weight alpha _Lkr new value alpha _'Lkr.

ここで、Fは周波数ビンの個数を表す。φ=1のとき、α’_kτは全周波数ビンにわたる事後確率γ_kτωの平均値となることがわかる。φの増加とともに、α’_kτは定数1/Kに近づく。 Here, F represents the number of frequency bins. When _φ = 1, α 'kτ it can be seen that the average value of the posterior probability gamma _Keitauomega over all frequency bins. As φ increases, α ′ _kτ approaches the constant 1 / K.

相関行列更新手段３２２は、特徴ベクトルx_τωと事後確率γ_kτωを用いて、式(１７)を計算することにより、各クラスタkに対する相関行列R_kωを新しい値R’_kωに更新する。

The correlation matrix updating unit 322 updates the correlation matrix R _kω for each cluster k to a new value R ′ _kω by calculating Expression (17) using the feature vector x _τω and the posterior probability γ _kτω .

平均方向更新手段３２３は、相関行列R_kωの正規化された主成分ベクトルとして、平均方向a_kωを新しい値a’_kωに更新する。 The average direction updating unit 323 updates the average direction a _kω to a new value a ′ _kω as a normalized principal component vector of the correlation matrix R _kω .

密度パラメータ更新手段３２４は、相関行列R_kωの最大固有値λ_kωを用いて、密度パラメータк_kωを式(１８)により新しい値к’_kωに更新する。

The density parameter updating unit 324 updates the density parameter к _kω to a new value к ′ _kω using Equation (18) using the maximum eigenvalue λ _kω of the correlation matrix R _kω .

パーミュテーション解決手段３２５は、式(１９)〜(２１)に示すように、各周波数ビンにおいて、平均方向a’_kωと密度パラメータк’_kωを、事後確率p(Θ’|{x_τω}_τω)が最大になるように音源間で置換し、パーミュテーションを解決する（ステップＳ３２５）。 As shown in equations (19) to (21), the permutation solving means 325 uses the average direction a ′ _kω and the density parameter к ′ _kω as the posterior probabilities p (Θ ′ | {x _τω } for each frequency bin. _The permutation is solved by replacing between sound sources so that _τω ) is maximized (step S325).

なお、以上では、混合重みが１個の時間フレームごとに変化する場合の処理について説明したが、混合重みが複数の時間フレームからなるブロックごとに変化する場合は、混合重み更新手段３２１における混合重みα_kτの更新式(１６)において、分子の事後確率γ_kτωの時間フレーム内の和を事後確率γ_kτωのブロック内の和に置き換え、分母のFをF×Jで置き換えればよい。一方、相関行列更新手段３２２、平均方向更新手段３２３、密度パラメータ更新手段３２４及びパーミュテーション解決手段３２５においては、混合重みが１個の時間フレームごとに変化する場合の処理と同一の処理を行えばよい。 In the above, the processing when the mixing weight changes for each time frame has been described. However, when the mixing weight changes for each block composed of a plurality of time frames, the mixing weight in the mixing weight update unit 321 is used. in alpha _Lkr update equations (16), replacing the sum in the time frame of the posterior probability gamma _Keitauomega molecules to the sum of the block of the posterior probability gamma _Keitauomega, the F in the denominator may be replaced by F × J. On the other hand, the correlation matrix updating unit 322, the average direction updating unit 323, the density parameter updating unit 324, and the permutation solving unit 325 perform the same processing as that when the mixing weight changes for each time frame. Just do it.

以下、パラメータ更新部３２における各更新式の導出根拠を説明する。パラメータ更新はEM（Expectation-Maximization）アルゴリズムを導入して、それに基づき行う。なお、{d(τ,ω)}_τωは、EMアルゴリズムにおける隠れ変数として扱う。 Hereinafter, the basis for deriving each update formula in the parameter update unit 32 will be described. Parameter update is performed based on the EM (Expectation-Maximization) algorithm. Note that {d (τ, ω)} _τω is treated as a hidden variable in the EM algorithm.

まず、MAP（Maximum a posteriori）推定のためのコスト関数L(Θ)は、式(２２)〜(２４)により与えられる。

First, a cost function L (Θ) for MAP (Maximum a posteriori) estimation is given by equations (22) to (24).

ここで、{x_τω}_τωは互いに独立であると仮定し、Θに依存しない定数項を無視した。この目的関数を式(２５)に示す制約条件のもとで最大化する。

Here, {x _τω } _τω is assumed to be independent from each other, and a constant term independent of Θ is ignored. This objective function is maximized under the constraint condition shown in equation (25).

目的関数L(Θ)は、パーミュテーション問題がない場合に大きい値を取るため、L(Θ)の最大化によりパーミュテーション問題が回避できる。実際、式(２４)の第一項から分かるように、目的関数L(Θ)が大きくなるのは、混合重みα_kτが大きい値をとるk、τに対し、クラスタkに対する尤度（もっともらしさ）p(x_τω|d(τ,ω)=k,a_kω,κ_kω)が大きい場合である。したがって、L(Θ)の最大化により、クラスタkに対する尤度の時系列{p(x_τω|d(τ,ω)=k,a_kω,κ_kω)}_τが周波数ビン間で同期する。このことと、上述の「観測信号に寄与する音源インデックスの時系列{d(τ,ω)}_τは、周波数ビン間で類似する」という性質を考え合わせると、L(Θ)はパーミュテーション問題がない場合に大きい値を取ることがわかる。EMアルゴリズムで用いる評価関数（Q関数）は式(２６)(２７)により与えられる。 Since the objective function L (Θ) takes a large value when there is no permutation problem, the permutation problem can be avoided by maximizing L (Θ). In fact, as can be seen from the first term of equation (24), the objective function L (Θ) increases because of the likelihood (probability) of cluster k with respect to k and τ where the mixture weight α _kτ takes a large value. ) When p ( _xτω | d (τ, ω) = k, a _kω , κ _kω ) is large. Therefore, by maximizing L (Θ), the time series of likelihood {p (x _τω | d (τ, ω) = k, a _kω , κ _kω )} _τ for cluster k is synchronized between frequency bins. Considering this and the above-mentioned property that “the time series of the sound source index contributing to the observation signal {d (τ, ω)} _τ is similar between frequency bins”, L (Θ) is permutation. It can be seen that it takes a large value when there is no problem. The evaluation function (Q function) used in the EM algorithm is given by equations (26) and (27).

更新後のパラメータ集合Θ'は次式により定義される。

The updated parameter set Θ ′ is defined by the following equation.

Q関数を式(２５)の制約のもとで最大にするものとして導かれる。すなわち、混合重みα_kτの新たな値α'_kτを求める式(１６)は、ラグランジュ（Lagrange）の未定乗数法によって、式(２８)(２９)により導出される。

ここで、μはラグランジュの未定乗数である。 It is derived that the Q function is maximized under the constraint of equation (25). That is, Expression (16) for _obtaining a new value α ′ _kτ of the mixture weight α _kτ is derived from Expressions (28) and (29) by Lagrange's undetermined multiplier method.

Here, μ is Lagrange's undetermined multiplier.

平均方向の更新式は、

を解くことで得られる。したがって、クーラン・フィッシャー（Courant-Fischer）の定理より、R_kωの最大固有値に対応する固有ベクトル（主成分ベクトル）によって、平均方向を更新すればよい。 The average direction update formula is

Can be obtained by solving Therefore, the average direction may be updated by the eigenvector (principal component vector) corresponding to the maximum eigenvalue of R _{kω according to} the Courant-Fischer theorem.

また、密度パラメータの更新式(１８)については、まず∂Q/∂κ_kω=0より式(３１)を得る。

As for the density parameter update formula (18), formula (31) is first obtained from ∂Q / ∂κ _kω = 0.

ここで、

であり、λ_kωは相関行列R_kωの最大固有値である。上式は、近似的に次の式(３２)のように解くことができる。 here,

Λ _kω is the maximum eigenvalue of the correlation matrix R _kω . The above equation can be approximately solved as the following equation (32).

パラメータ保持部３３は、パラメータ更新部３２での更新処理により得られたパラメータ集合Θ’を記憶する（ステップＳ３３）。また、事後確率計算部３１での次回の処理の際には、記憶したパラメータ集合Θ’をパラメータ集合として提供する。なお、第一の実施形態のモデル推定装置を用いて推定した、パラメータ集合Θを用いることで、音源定位、音源数が未知の条件下での音源分離、音源数と分離音の同時推定等を行うことが可能である。 The parameter holding unit 33 stores the parameter set Θ ′ obtained by the update process in the parameter update unit 32 (step S33). In the next processing by the posterior probability calculation unit 31, the stored parameter set Θ 'is provided as a parameter set. By using the parameter set Θ estimated using the model estimation apparatus of the first embodiment, sound source localization, sound source separation under conditions where the number of sound sources is unknown, simultaneous estimation of the number of sound sources and separated sounds, etc. Is possible.

ステップＳ３１からステップＳ３３までの処理は、事前に設定した最大反復回数max_iterに達するまで、またはパラメータ更新部３２における各パラメータの更新による変動幅が収束判定の閾値Δよりも小さくなるまで、反復して行う（ステップＳ９１）。最大反復回数max_iter及び閾値Δの具体的な値は、例えば、max_iter=100、Δ=10⁻¹⁰とすることができる。 The processing from step S31 to step S33 is repeated until the preset maximum number of iterations max_iter is reached or until the fluctuation range due to updating of each parameter in the parameter updating unit 32 becomes smaller than the convergence determination threshold value Δ. This is performed (step S91). Specific values of the maximum number of iterations max_iter and the threshold Δ may be, for example, max_iter = 100 and Δ = 10 ⁻¹⁰ .

ステップＳ９１において、モデル推定部３における処理が最大反復回数max_iterに達した場合、または各パラメータの更新による変動幅が閾値Δよりも小さくなった場合、モデル推定部３は反復終了後の事後確率γ^o _kτωを出力する。 In step S91, when the process in the model estimation unit 3 reaches the maximum number of iterations max_iter, or when the fluctuation range due to the update of each parameter becomes smaller than the threshold value Δ, the model estimation unit 3 determines the posterior probability γ after the end of the iteration. ^o Output _kτω .

［第二実施形態］
この発明の第二実施形態は、第一実施形態のモデル推定装置Ａを用いて音源数推定装置として構成した実施形態である。 [Second Embodiment]
The second embodiment of the present invention is an embodiment configured as a sound source number estimating device using the model estimating device A of the first embodiment.

図３を参照して、第二実施形態の音源数推定装置Ｂの機能構成例を説明する。音源数推定装置Ｂは、第一実施形態のモデル推定装置Ａの各部に加えて、音源数推定部４を含む。 With reference to FIG. 3, the function structural example of the sound source number estimation apparatus B of 2nd embodiment is demonstrated. The sound source number estimation device B includes a sound source number estimation unit 4 in addition to each unit of the model estimation device A of the first embodiment.

音源数推定部４は、事後確率が有意な値をとるクラスタの個数を、音源数として推定する。具体的には、音源数推定部４は、事後確率計算部３１によって計算された各クラスタの事後確率の入力を受け付け、各クラスタの事後確率を用いて、各クラスタの事後確率の総和を算出する。例えば、音源数推定部４は、下記式（３３）を用いて、各クラスタの事後確率の総和を算出する。なお、各クラスタの事後確率の総和を算出することに限定させるものではなく、例えば、特定の周波数範囲に限った事後確率の部分和を算出してもよい。 The sound source number estimation unit 4 estimates the number of clusters in which the posterior probability takes a significant value as the number of sound sources. Specifically, the sound source number estimation unit 4 receives the input of the posterior probability of each cluster calculated by the posterior probability calculation unit 31, and calculates the sum of the posterior probabilities of each cluster using the posterior probability of each cluster. . For example, the sound source number estimation unit 4 calculates the sum of the posterior probabilities of each cluster using the following equation (33). Note that the present invention is not limited to calculating the sum of the posterior probabilities of each cluster, and for example, a partial sum of posterior probabilities limited to a specific frequency range may be calculated.

そして、音源数推定部４は、各クラスタの事後確率の総和を２つにクラスタリングし、総和の大きい方のクラスタに属するクラスタの数を音源数として推定する。 Then, the sound source number estimation unit 4 clusters the sum of the posterior probabilities of each cluster into two, and estimates the number of clusters belonging to the cluster with the larger sum as the number of sound sources.

図４を参照して、音源数推定装置Ｂの動作例を手続きの順に従って説明する。ステップＳ０からステップＳ９１までの処理は第一実施形態のモデル推定装置Ａの動作例と同様であるので詳細な説明は省略する。 With reference to FIG. 4, the operation example of the sound source number estimation apparatus B will be described in the order of procedures. Since the processing from step S0 to step S91 is the same as the operation example of the model estimation apparatus A of the first embodiment, detailed description thereof is omitted.

音源数推定部４は、各クラスタの事後確率を用いて、各クラスタの事後確率の総和を算出する（ステップＳ４１）。具体的には、音源数推定部４は、事後確率計算部３１によって計算された各クラスタの事後確率の入力を受け付け、各クラスタの事後確率を用いて、各クラスタの事後確率の総和を算出する。 The sound source number estimation unit 4 calculates the sum of the posterior probabilities of each cluster using the posterior probabilities of each cluster (step S41). Specifically, the sound source number estimation unit 4 receives the input of the posterior probability of each cluster calculated by the posterior probability calculation unit 31, and calculates the sum of the posterior probabilities of each cluster using the posterior probability of each cluster. .

そして、音源数推定部４は、各クラスタの事後確率の総和を２つにクラスタリングする（ステップＳ４２）。例えば、音源数推定部４は、各クラスタの事後確率の総和に対して、クラスタ数２のk-meansクラスタリングを適用してクラスタリングする。 Then, the sound source number estimation unit 4 clusters the sum of the posterior probabilities of each cluster into two (step S42). For example, the sound source number estimation unit 4 performs clustering by applying k-means clustering with two clusters to the sum of the posterior probabilities of each cluster.

続いて、音源数推定部４は、事後確率の総和の大きい方のクラスタに属するクラスタ数を音源数と推定する（ステップＳ４３）。例えば、音源数推定部４は、セントロイドがより大きいクラスタに属するクラスタの個数を、音源数として推定する。 Subsequently, the sound source number estimation unit 4 estimates the number of clusters belonging to the cluster with the larger sum of the posterior probabilities as the number of sound sources (step S43). For example, the sound source number estimation unit 4 estimates the number of clusters belonging to a cluster having a larger centroid as the number of sound sources.

ここで、音源数推定部４は、上記で例示したように、クラスタ数が「２」のk-meansクラスタリングを適用してクラスタリングする場合に限定されるものではなく、より簡易な処理として、所定の閾値を用いて、各クラスタの事後確率の総和に対して閾値処理を行ってもよい。具体的には、音源数推定部４は、各クラスタの事後確率の総和が所定の閾値以上であるかを判定し、所定の閾値以上である事後確率の総和に対応するクラスタの数を音源数として推定してもよい。クラスタ数が「２」のk-meansクラスタリングを適用してクラスタリングする方法は、所定の閾値以上である事後確率の総和に対応するクラスタの数を音源数として推定する方法と比べ、残響時間などの条件の変化に対して、より頑健であると期待される。 Here, as exemplified above, the sound source number estimation unit 4 is not limited to clustering by applying k-means clustering with the number of clusters of “2”. Threshold processing may be performed on the sum of the posterior probabilities of each cluster using the threshold. Specifically, the sound source number estimation unit 4 determines whether the sum of the posterior probabilities of each cluster is equal to or greater than a predetermined threshold, and determines the number of clusters corresponding to the sum of the posterior probabilities equal to or greater than the predetermined threshold. May be estimated. The method of clustering by applying k-means clustering with the number of clusters of “2” is different from the method of estimating the number of clusters corresponding to the sum of the posterior probabilities equal to or higher than a predetermined threshold as the number of sound sources. Expected to be more robust against changing conditions.

第一実施形態の説明では、音源数が未知であることを前提として、音源数を超えるクラスタ数を設定しているものとして説明した。このように、音源数を超えるクラスタを設定して、段落００４０〜００６５において説明した、パラメータの更新処理および事後確率の計算処理を行うことで、音源数と同数のクラスタのみ有意な事後確率をもち、残りのクラスタは小さい事後確率をもつこととなる。ここで、図５において、音源数が「３」であり、クラスタ数「４」の場合に、第二実施形態に記載のクラスタリング方法で得られた事後確率の例を示す。図５について、横軸が時間であり、縦軸が周波数であり、輝度の大きい点ほど、事後確率が大きいことを示している。 In the description of the first embodiment, it is assumed that the number of clusters exceeding the number of sound sources is set on the assumption that the number of sound sources is unknown. In this way, by setting clusters exceeding the number of sound sources and performing the parameter update processing and posterior probability calculation processing described in paragraphs 0040 to 0065, only the same number of clusters as the number of sound sources have significant posterior probabilities. The remaining clusters will have a small posterior probability. Here, FIG. 5 shows an example of the posterior probability obtained by the clustering method described in the second embodiment when the number of sound sources is “3” and the number of clusters is “4”. In FIG. 5, the horizontal axis is time, the vertical axis is frequency, and the higher the luminance, the higher the posterior probability.

つまり、実際の音源数と同数のクラスタのみ、音源アクティビティが有意な値をもっていることがわかる。図５の例を用いて説明すると、「cluster1」、「cluster2」、「cluster4」の３つのクラスタは有意な事後確率をもち、「cluster3」は小さい事後確率であり、有意な事後確率をもっていないといえる。 That is, it can be seen that the sound source activity has a significant value only in the same number of clusters as the actual number of sound sources. Using the example in FIG. 5, the three clusters “cluster1,” “cluster2,” and “cluster4” have significant posterior probabilities, and “cluster3” has a small posterior probability and no significant posterior probabilities. I can say that.

このため、上述したように、音源数推定部４は、クラスタ数２のk-meansクラスタリングを適用してクラスタリングした場合には、「cluster1」、「cluster2」および「cluster4」の３つのクラスタの事後確率の総和と、「cluster3」のみの１つのクラスタの事後確率の総和とでそれぞれクラスタリングされる。そして、事後確率の総和の大きい方のクラスタに属するクラスタ数（要素数）が、「cluster1」、「cluster2」および「cluster4」の３つであるから、音源数が「３」と推定される。 For this reason, as described above, the sound source number estimation unit 4 applies the posterior of the three clusters “cluster1,” “cluster2,” and “cluster4” when clustering is performed by applying k-means clustering with two clusters. Clustering is performed using the sum of probabilities and the sum of posterior probabilities of only one cluster of “cluster3”. Then, since the number of clusters (number of elements) belonging to the cluster with the larger sum of the posterior probabilities is three, “cluster1”, “cluster2”, and “cluster4”, the number of sound sources is estimated to be “3”.

［実験結果］
ここで、音源数推定装置Ｂによる音源数推定処理を実施した場合の実験結果について説明する。 [Experimental result]
Here, an experimental result when the sound source number estimation processing by the sound source number estimation apparatus B is performed will be described.

図６に、音源数推定実験における正解率を示す。図６に示すように、φ＝６００とした場合、全ての条件において、１００％の正解率が得られた。また、φを１→６００と増加させることにより、正解率が向上する傾向があった。一方、φをさらに６００→１０００と増加させると、音源数Ｎ＝３、残響時間３７０ｍｓの条件での正解率が、１００％→８８％と減少した。 FIG. 6 shows the correct answer rate in the sound source number estimation experiment. As shown in FIG. 6, when φ = 600, a 100% accuracy rate was obtained under all conditions. Moreover, there was a tendency that the accuracy rate was improved by increasing φ from 1 to 600. On the other hand, when φ was further increased from 600 to 1000, the correct answer rate under the conditions of the number of sound sources N = 3 and the reverberation time 370 ms decreased from 100% to 88%.

以上のように、従来は音源数の推定が困難であるような条件、例えば、「音源数>マイクロホン数、または残響あり」のような条件下でも、音源数を適切に推定することが可能である。 As described above, it is possible to appropriately estimate the number of sound sources even under conditions in which it is difficult to estimate the number of sound sources, for example, the number of sound sources> the number of microphones or reverberation. is there.

続いて、本発明の原理について説明する。まず、非特許文献１に記載のクラスタリングに基づく音源分離法について説明する。非特許文献１に記載のクラスタリングに基づく音源分離法は、音源数が既知であることを前提としており、音源数と同数のクラスタを用いてクラスタリングを行うものである。図７を参照して、上記の非特許文献１の技術について説明する。ここで、図７中の×印は、各時間周波数点における音源位置特徴量（例：マイクロホン間の位相差、時間差、振幅比）を表す。図７に示すように、同一音源の×印がそれぞれ円で囲まれており、１音源の特徴量は、各周波数内で、同じ特徴量空間に集中しやすい。各音源の特徴量空間の形状は、周波数に依存するが、１音源信号は、全周波数で同時に立ち上がりやすく、且つ、立ち下がりやすい。すなわち、音源アクティビティは周波数間で同期する。このようなクラスタリングでは、この音源アクティビティの同期性を適切にモデル化し、全体最適化に組み込むことにより、音源位置特徴量と音源アクティビティの同時クラスタリングを実現している。 Next, the principle of the present invention will be described. First, a sound source separation method based on clustering described in Non-Patent Document 1 will be described. The sound source separation method based on clustering described in Non-Patent Document 1 is based on the premise that the number of sound sources is known, and performs clustering using the same number of clusters as the number of sound sources. With reference to FIG. 7, the technique of said nonpatent literature 1 is demonstrated. Here, the x mark in FIG. 7 represents the sound source position feature amount (eg, phase difference, time difference, amplitude ratio between microphones) at each time frequency point. As shown in FIG. 7, the X marks of the same sound source are each surrounded by a circle, and the feature amount of one sound source tends to concentrate in the same feature amount space within each frequency. The shape of the feature amount space of each sound source depends on the frequency, but one sound source signal easily rises and falls easily at all frequencies simultaneously. That is, sound source activity is synchronized between frequencies. In such clustering, the synchronism of the sound source activity is appropriately modeled and incorporated into the overall optimization, thereby realizing simultaneous clustering of the sound source position feature quantity and the sound source activity.

上述のように、非特許文献１の技術では、音源数と同数のクラスタを用いてクラスタリングを行っていた。これを、音源数よりも多いクラスタを用いる構成に変更することは、これまでの技術常識に鑑みれば、容易に想到できるものではなかった。なぜならば、クラスタリングに基づく音源分離において、音源数よりも多いクラスタを用いると、通常、図８に示すように、１つの音源が複数のクラスタに分裂してしまうため、各クラスタが音源に対応しないからである。なお、図８は、音源数２、クラスタ数４の例である。 As described above, in the technique of Non-Patent Document 1, clustering is performed using the same number of clusters as the number of sound sources. Changing this to a configuration using more clusters than the number of sound sources has not been easily conceived in view of the common general technical knowledge. This is because, in the sound source separation based on clustering, if more clusters than the number of sound sources are used, normally one sound source is divided into a plurality of clusters as shown in FIG. 8, so each cluster does not correspond to a sound source. Because. FIG. 8 shows an example in which the number of sound sources is 2 and the number of clusters is 4.

これに対して、本発明に至る研究の過程で、次の全く新しい知見が得られた。すなわち、非特許文献１のクラスタリングにおいて、音源数よりも多いクラスタを用いる構成に変更すると、上記の技術常識に反して、１音源が複数のクラスタに分かれるようなことは起こらず、１音源は１つのクラスタにまとまる。つまり、先行技術のクラスタリングでは、上述のように、音源アクティビティの同期性を適切にモデル化しているため、図８のようなことは起こらず、図９のように１音源は１つのクラスタにまとまる。なお、図９は、音源数２、クラスタ数３の例である。図９に示すように、音源１および音源２がそれぞれ一つのクラスタにまとまっている。また、残り一つのクラスタは小さい事後確率をもつクラスタとなる。 On the other hand, the following completely new knowledge was obtained in the course of research leading to the present invention. In other words, in the clustering of Non-Patent Document 1, if the configuration is changed to a configuration using more clusters than the number of sound sources, contrary to the above technical common sense, one sound source is not divided into a plurality of clusters. Group into two clusters. That is, in the clustering of the prior art, as described above, the synchronism of sound source activity is appropriately modeled, so that the situation shown in FIG. 8 does not occur, and one sound source is grouped into one cluster as shown in FIG. . FIG. 9 shows an example in which the number of sound sources is 2 and the number of clusters is 3. As shown in FIG. 9, the sound source 1 and the sound source 2 are each grouped into one cluster. The remaining one cluster has a small posterior probability.

本発明では、以上の知見に基づき、非特許文献１のクラスタリングを、音源数よりも多いクラスタを用いる構成に変更する。また、本発明では、クラスタリング結果に基づき音源数を決定する際に、各クラスタの事後確率の総和を大小２クラスタにクラスタリングし、大きい方のクラスタの要素数として音源数を決定する。これにより、固定の閾値を用いる場合と比べて、より頑健に音源数を推定することができる。本発明の方法により、音源数が未知でも音源分離を行ったり、音源数を推定したりすることが可能になる。また、上記の実験結果からも分かるように、本発明の方法を用いて、極めて良好に音源数を推定できることが確認された。 In the present invention, based on the above knowledge, the clustering of Non-Patent Document 1 is changed to a configuration that uses clusters larger than the number of sound sources. In the present invention, when the number of sound sources is determined based on the clustering result, the sum of the posterior probabilities of each cluster is clustered into two large and small clusters, and the number of sound sources is determined as the number of elements of the larger cluster. Thereby, compared with the case where a fixed threshold value is used, the number of sound sources can be estimated more robustly. The method of the present invention makes it possible to perform sound source separation or estimate the number of sound sources even if the number of sound sources is unknown. Further, as can be seen from the above experimental results, it was confirmed that the number of sound sources can be estimated extremely well by using the method of the present invention.

また、本発明の音源数推定法では、「音源数＞マイクロホン数、または残響あり」の場合にも、音源数を適切に推定することができる。実際、実験により、「音源数＞マイクロホン数」の場合や、残響時間が比較的長い場合にも、本発明の音源数推定法により、きわめて良好に音源数を推定できることが示された。 Further, according to the sound source number estimation method of the present invention, the number of sound sources can be appropriately estimated even when “the number of sound sources> the number of microphones or reverberation”. In fact, experiments have shown that the number of sound sources can be estimated very well by the sound source number estimation method of the present invention even when “the number of sound sources> the number of microphones” or when the reverberation time is relatively long.

上述してきたように、音源数推定装置Ｂは、kをクラスタのインデックスとし、τを時間フレームのインデックスとし、ωを角周波数とし、Ｎ個の音源からの信号が混合された混合信号をM個のマイクロホンで観測した観測信号から、各観測信号の時間周波数成分からなる観測信号ベクトルy_τωに対応する特徴ベクトルx_τωを抽出する。そして、音源数推定装置Ｂは、特徴ベクトルx_τωを所定の確率モデルにあてはめ、各音源の尤度の時系列が周波数ビン間で同期しているほど高い評価値を与える評価関数を用いて、確率モデルのモデルパラメータを推定し、該モデルパラメータを用いて、観測信号ベクトルy_τωが、Ｎ個の音源よりも多く設定された各クラスタに属する条件付き確率である事後確率を計算する。その後、音源数推定装置Ｂは、事後確率を用いて、各クラスタの事後確率の総和を算出し、各クラスタの事後確率の総和に基づいて、音源数を推定する。これにより、音源数推定装置Ｂでは、実環境に近い条件下であっても、音源数を適切に推定することが可能である。 As described above, the sound source number estimation apparatus B uses M as a mixed signal in which signals from N sound sources are mixed, where k is a cluster index, τ is a time frame index, ω is an angular frequency. The feature vector x _τω corresponding to the observed signal vector y _τω composed of the time frequency component of each observed signal is extracted from the observed signal observed with the microphone. The sound source number estimation device B _applies the feature vector x _τω to a predetermined probability model, and uses an evaluation function that gives a higher evaluation value as the time series of the likelihood of each sound source is synchronized between frequency bins, The model parameters of the probabilistic model are estimated, and the posterior probabilities that are conditional probabilities belonging to each cluster in which the observed signal vector _yτω is set to be larger than N sound sources are calculated using the model parameters. Thereafter, the sound source number estimation device B calculates the sum of the posterior probabilities of each cluster using the posterior probability, and estimates the number of sound sources based on the sum of the posterior probabilities of each cluster. As a result, the sound source number estimation apparatus B can appropriately estimate the number of sound sources even under conditions close to the real environment.

また、音源数推定装置Ｂは、各クラスタの事後確率の総和を２つにクラスタリングし、総和の大きい方のクラスタに属するクラスタの数を音源数として推定する。音源数推定装置Ｂは、閾値を用いる場合と比べて、より頑健に音源数を推定することができる。 Further, the sound source number estimation device B clusters the sum of the posterior probabilities of each cluster into two, and estimates the number of clusters belonging to the cluster with the larger sum as the number of sound sources. The sound source number estimation apparatus B can more robustly estimate the number of sound sources than when using a threshold value.

また、音源数推定装置Ｂは、各クラスタの事後確率の総和が所定の閾値以上であるかを判定し、所定の閾値以上である事後確率の総和に対応するクラスタの数を音源数として推定してもよい。この場合には、上記の２つにクラスタリングを行う場合と比べて、より簡易に音源数を推定することができる。 Further, the sound source number estimation device B determines whether the sum of the posterior probabilities of each cluster is equal to or greater than a predetermined threshold, and estimates the number of clusters corresponding to the sum of the posterior probabilities equal to or greater than the predetermined threshold as the number of sound sources. May be. In this case, the number of sound sources can be estimated more easily than in the case where clustering is performed on the above two.

［第三実施形態］
この発明の第三実施形態は、第二実施形態に係る音源数推定装置Ｂの構成に音源分離部５および時間領域変換部６を追加した音源数推定装置Ｃとして構成した実施形態である。 [Third embodiment]
The third embodiment of the present invention is an embodiment configured as a sound source number estimation device C in which a sound source separation unit 5 and a time domain conversion unit 6 are added to the configuration of the sound source number estimation device B according to the second embodiment.

図１０を参照して、第三実施形態の音源数推定装置Ｃの機能構成例を説明する。音源数推定装置Ｃは、第二実施形態の音源数推定装置Ｂの各部に加えて、音源分離部５及び時間領域変換部６を含む。音源分離部５は、マスク作成部５１及び分離音作成部５２を含む。 With reference to FIG. 10, a functional configuration example of the sound source number estimation apparatus C of the third embodiment will be described. The sound source number estimation device C includes a sound source separation unit 5 and a time domain conversion unit 6 in addition to each unit of the sound source number estimation device B of the second embodiment. The sound source separation unit 5 includes a mask creation unit 51 and a separated sound creation unit 52.

図１１を参照して、音源数推定装置Ｃの動作例を手続きの順に従って説明する。ステップＳ０からステップＳ４３までの処理は第二実施形態の音源数推定装置Ｂの動作例と同様であるので詳細な説明は省略する。音源数推定部４は、前記事後確率が有意な値をとるクラスタの番号k(1)、k(2)、…、k(^N)を出力し、音源分離部５での処理に供する。 With reference to FIG. 11, an operation example of the sound source number estimation apparatus C will be described in the order of procedures. Since the processing from step S0 to step S43 is the same as the operation example of the sound source number estimation apparatus B of the second embodiment, detailed description thereof is omitted. The sound source number estimation unit 4 outputs cluster numbers k (1), k (2),..., K (^ N) for which the posterior probabilities have significant values, and is used for processing in the sound source separation unit 5. .

音源分離部５は、音源数推定部４の出力する前記事後確率が有意な値をとるクラスタの番号k(1)、k(2)、…、k(^N)と、周波数領域変換部１の出力する混合音の時間周波数変換y_τωと、事後確率計算部３１の出力する反復終了後の事後確率γ^o _kτωとを用いて、分離音の時間周波数変換^s_ｎτωを推定する（nは音源の番号）。 The sound source separation unit 5 includes a cluster number k (1), k (2),..., K (^ N) that the posterior probability output from the sound source number estimation unit 4 takes a significant value, and a frequency domain conversion unit. 1 is used to estimate the time-frequency conversion ^ s _nτω of the separated sound using the time-frequency conversion y _τω of the mixed sound output by 1 and the posterior probability γ ^o _kτω after completion of the repetition output from the posterior probability calculation unit 31 (n Is the number of the sound source.

マスク作成部５１は、音源数推定部４の出力する前記事後確率が有意な値をとるクラスタの番号k(1)、k(2)、…、k(^N)と、事後確率計算部３１の出力する前記反復終了後の事後確率γ^o _kτωとを用いて、音源数推定部４により推定された音源数に対応するマスクm_ｎτωを求める（ステップＳ５１）。まず、マスク作成部５１は、反復終了後の事後確率γ^o _kτωのうち、有意な値である事後確率γ^ｓ _ｎτωをγ^ｓ _ｎτω＝γ^o _{ｋ（ｎ）τω}により計算する。ここで、n=1、2、…、^Nであり、γの上付き文字のＳは、Significant（有意）を表す。次に、マスク作成部５１は、反復終了後の有意な値である事後確率γ^s _ｎτωを用いて、式(３４)により、アクティブな音源の番号d^ｓ(τ,ω)の推定値^d^ｓ(τ,ω)を計算する。

The mask creation unit 51 includes a cluster number k (1), k (2),..., K (^ N) where the posterior probability output from the sound source number estimation unit 4 takes a significant value, and a posterior probability calculation unit. A mask m _nτω corresponding to the number of sound sources estimated by the number-of-sound sources estimation unit 4 is obtained using the posterior probability γ ^o _kτω that is output from 31 after the end of the iteration (step S51). First, the mask creation unit 51 _calculates a significant posterior probability γ ^s _nτω among posterior probabilities γ ^o _kτω after the end of the iteration, using γ ^s _nτω = γ ^o _{k (n) τω} . Here, n = 1, 2,..., ^ N, and the superscript S of γ represents significant. Next, the mask creation unit 51 uses the posterior probability γ ^s _nτω , which is a significant value after the end of the iteration, to calculate the estimated value ^ d of the active sound source number d ^s (τ, ω) by Equation (34) ^ d ^s (τ, ω) is calculated.

次に、マスク作成部５１は、マスクm_ｎτωを式(３５)により計算する。

Next, the mask creation unit 51 calculates the mask m _nτω using the equation (35).

なお、マスク作成部５１は、マスクm_ｎτωを式(３６)により求めてもよい。

Note that the mask creation unit 51 may _obtain the mask m _nτω by the equation (36).

分離音作成部５２は、式(３７)により、マスクm_ｎτωを混合音の時間周波数変換y_1τωに乗算し、分離音の時間周波数変換^s_ｎτωを計算する。これにより、周波数領域の観測信号を音源ごとに分離する（ステップＳ５２）。

The separated sound creating unit 52 multiplies the mask m _nτω by the time frequency conversion y _1τω of the mixed sound _according to the equation (37) to calculate the time frequency conversion ^ s _nτω of the separated sound. Thereby, the observation signal in the frequency domain is separated for each sound source (step S52).

時間領域変換部６は、音源ｎごとに、時間周波数領域の分離信号^s_ｎτωを時間領域の分離信号~^s_ｎｔに変換して出力する（ステップＳ６）。 For each sound source n, the time domain conversion unit 6 converts the separation signal ^ s _nτω in the time frequency domain into a separation signal ~ ^ s _nt in the time domain and outputs the converted signal (step S6).

上述してきたように、音源数推定装置Ｃは、音源数が未知である場合であっても、音源数を推定した後に、音源分離技術を実現することができる。また、音源数推定装置Ｃは、パーミュテーション問題を生じず、二段階の処理を必要としない音源分離技術を実現することができる。これにより、例えば、音源位置などが時間的に変化する時変の環境での音声強調のためのオンライン音源分離を容易に実現することが可能となる。 As described above, the sound source number estimation device C can realize the sound source separation technique after estimating the number of sound sources even when the number of sound sources is unknown. In addition, the sound source number estimation apparatus C can realize a sound source separation technique that does not cause a permutation problem and does not require a two-stage process. Thereby, for example, it is possible to easily realize online sound source separation for speech enhancement in a time-varying environment in which the sound source position changes with time.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、モデル推定部３と音源数推定部４を統合してもよい。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the model estimation unit 3 and the sound source number estimation unit 4 may be integrated. Further, all or any part of each processing function performed in each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施例において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 In addition, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
また、上記実施形態に係る音源数推定装置Ｂ、Ｃが実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。この場合、コンピュータがプログラムを実行することにより、上記実施形態と同様の効果を得ることができる。さらに、かかるプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータに読み込ませて実行することにより上記実施形態と同様の処理を実現してもよい。以下に、音源数推定装置Ｂ、Ｃと同様の機能を実現する音源数推定プログラムを実行するコンピュータの一例を説明する。 [program]
In addition, it is possible to create a program in which the processing executed by the sound source number estimation devices B and C according to the above embodiment is described in a language that can be executed by a computer. In this case, the same effect as the above-described embodiment can be obtained by the computer executing the program. Further, such a program may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer and executed to execute the same processing as in the above embodiment. Below, an example of the computer which performs the sound source number estimation program which implement | achieves the function similar to the sound source number estimation apparatuses B and C is demonstrated.

図１２は、音源数推定プログラムを実行するコンピュータを示す図である。図１２に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 FIG. 12 is a diagram illustrating a computer that executes a sound source number estimation program. As illustrated in FIG. 12, the computer 1000 includes, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。ディスクドライブ１０４１には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. For example, a display 1130 is connected to the video adapter 1060.

ここで、図１２に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。上記実施形態で説明した各データは、例えばハードディスクドライブ１０９０やメモリ１０１０に記憶される。 Here, as shown in FIG. 12, the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each data described in the above embodiment is stored in, for example, the hard disk drive 1090 or the memory 1010.

また、音源数推定プログラムは、例えば、コンピュータ１０００によって実行される指令が記述されたプログラムモジュールとして、ハードディスクドライブ１０９０に記憶される。具体的には、上記実施形態で説明した音源数推定装置Ｂ、Ｃが実行する各処理が記述されたプログラムモジュールが、ハードディスクドライブ１０９０に記憶される。 The sound source number estimation program is stored in the hard disk drive 1090 as a program module in which a command executed by the computer 1000 is described, for example. Specifically, a program module describing each process executed by the sound source number estimation apparatuses B and C described in the above embodiment is stored in the hard disk drive 1090.

また、音源数推定プログラムによる情報処理に用いられるデータは、プログラムデータとして、例えば、ハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Data used for information processing by the sound source number estimation program is stored as program data in, for example, the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1090 to the RAM 1012 as necessary, and executes the above-described procedures.

なお、音源数推定プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、分散データ処理プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮ（Local Area Network）やＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 related to the sound source number estimation program are not limited to being stored in the hard disk drive 1090. For example, the program module 1093 and the program data 1094 are stored in a removable storage medium and are stored in the removable storage medium by the CPU 1020 via the disk drive 1041 or the like. It may be read out. Alternatively, the program module 1093 and the program data 1094 related to the distributed data processing program are stored in another computer connected via a network such as a LAN (Local Area Network) or a WAN (Wide Area Network), and the network interface 1070 is stored. Via the CPU 1020.

Ａモデル推定装置
Ｂ、Ｃ音源数推定装置
１周波数領域変換部
２特徴抽出部
３モデル推定部
３１事後確率計算部
３２パラメータ更新部
３２１混合重み更新手段
３２２相関行列更新手段
３２３平均方向更新手段
３２４密度パラメータ更新手段
３２５パーミュテーション解決手段
３３パラメータ保持部
４音源数推定部
５音源分離部
５１マスク作成部
５２分離音作成部
６時間領域変換部 A Model estimation device B, C Sound source number estimation device 1 Frequency domain conversion unit 2 Feature extraction unit 3 Model estimation unit 31 A posteriori probability calculation unit 32 Parameter update unit 321 Mixed weight update unit 322 Correlation matrix update unit 323 Average direction update unit 324 Density Parameter updating means 325 Permutation solving means 33 Parameter holding section 4 Number of sound sources estimation section 5 Sound source separation section 51 Mask creation section 52 Separated sound creation section 6 Time domain conversion section

Claims

k is the cluster index, τ is the time frame index, ω is the angular frequency,
A feature that extracts a feature vector x _τω corresponding to an observation signal vector y _τω composed of temporal frequency components of each observation signal from observation signals obtained by observing mixed signals obtained by mixing signals from N sound sources with M microphones. An extractor;
The feature vector x _τω is applied to a predetermined probability model, and the model parameters of the probability model are estimated using an evaluation function that gives a higher evaluation value as the time series of likelihood of each sound source is synchronized between frequency bins. And using the model parameter, a model estimation unit that calculates a posterior probability that is a conditional probability that the observed signal vector y _τω belongs to each cluster set more than the N sound sources;
A number-of-sound-sources estimation unit that estimates the number of clusters in which the posterior probability takes a significant value as the number of sound sources;
Including
The probability model is a mixed model represented by a weighted sum of distributions of feature vectors x _τω for each sound source,
The mixture weight of the probabilistic model is a weight that depends on the time frame τ and does not depend on the angular frequency ω,
The model parameter of the stochastic model is a parameter of the distribution of the feature weight x _τω related to each sound source and the mixture weight.

The sound source number estimation apparatus according to claim 1,
The sound source number estimation unit calculates the sum of the posterior probabilities of each cluster, clusters the sum of the posterior probabilities of each cluster into two, and estimates the number of clusters belonging to the cluster with the larger sum as the number of sound sources An apparatus for estimating the number of sound sources.

The sound source number estimation apparatus according to claim 1,
The sound source number estimation unit calculates the sum of the posterior probabilities for each cluster, determines whether the sum of the posterior probabilities for each cluster is equal to or greater than a predetermined threshold, and determines the sum of the posterior probabilities equal to or greater than the predetermined threshold. An apparatus for estimating the number of sound sources, wherein the number of corresponding clusters is estimated as the number of sound sources.

It is a sound source number estimation apparatus as described in any one of Claim 1 to 3,
The distribution of the feature vector x _{τω with} respect to the cluster k is a Watson distribution in which the average direction is a _kω and the density parameter is κ _kω .
The number-of-sound-sources estimation device characterized in that the distribution parameter of the feature vector x _τω related to the cluster k is the average direction a _kω and the density parameter κ _kω .

The sound source number estimation device according to any one of claims 1 to 4,
The sound source number estimation device according to claim 1, wherein the prior distribution of the mixture weights is a Dirichlet distribution for the mixture weights using a hyperparameter φ that does not depend on the cluster k as an index of each mixture weight.

It is a sound source number estimation apparatus as described in any one of Claim 1 to 5,
The observation the model estimator is based on the product of the mixture weights at time frame τ distribution and cluster k of the feature vector x _Tauomega about the cluster k, by Moto which the feature vector x _Tauomega given corresponding to x _Tauomega A posterior probability calculator for calculating a conditional probability that the signal vector y _τω belongs to the cluster k;
A mixing weight updating means for updating the mixing weight based on the conditional probability and the hyperparameter φ independent of the cluster k;
Correlation matrix updating means for calculating a correlation matrix R _kω for cluster k based on the conditional probability and the feature vector x _τω ;
Average direction updating means for updating the average direction a _kω with the normalized principal component vector of the correlation matrix R _kω as a new value;
Density parameter updating means for updating the density parameter κ _kω based on the maximum eigenvalue of the correlation matrix R _kω ;
_Permutation solving means for reordering the average direction a _kω and the density parameter κ _kω between sound sources so that the evaluation function is maximized for each frequency bin;
A sound source number estimation device comprising:

It is a sound source number estimation apparatus of Claim 6, Comprising:
The gamma _Keitauomega as the conditional probability, alpha _Lkr and the mixture weight, d (τ, ω) and contributing cluster number in the observed signal vector _y τω, the number of frequency bins F, of a · ^H · Hermitian transpose, λ _kω is the maximum eigenvalue of the correlation matrix R _kω ,
The posterior probability calculation unit calculates the conditional probability by the following equation:

The mixing weight updating means updates the mixing weight with α ′ _{kτ obtained} by the following formula as a new value,

The correlation matrix updating means updates the correlation matrix R _kW the R _'kW determined by the following equation as a new value,

The density parameter updating means updates the density parameter kappa _kW the kappa _'kW determined by the following equation as a new value

An apparatus for estimating the number of sound sources.

It is a sound source number estimation apparatus as described in any one of Claim 1 to 7,
n is the number of the sound source, ^ N is the number of sound sources estimated by the sound source number estimating unit,
The sound source number estimation unit outputs cluster numbers k (1), k (2),..., K (^ N) where the posterior probability takes a significant value,
Using the cluster numbers k (1), k (2),..., K (^ N) for which the posterior probabilities have significant values, and the posterior probabilities that are conditional probabilities belonging to the respective clusters, A mask creating unit for _obtaining a mask m _nτω corresponding to the number of sound sources estimated by the sound source number estimating unit;
A separated sound creating unit that calculates a separated sound in the time-frequency domain using the mask m _nτω from the observed signal vector y _τω ;
A sound source number estimation device comprising:

k is the cluster index, τ is the time frame index, ω is the angular frequency,
A step of extracting a feature vector x _τω corresponding to an observation signal vector y _τω composed of time frequency components of each observation signal from observation signals obtained by observing a mixed signal obtained by mixing signals from N sound sources with M microphones. When,
The feature vector x _τω is applied to a predetermined probability model, and the model parameters of the probability model are estimated using an evaluation function that gives a higher evaluation value as the time series of likelihood of each sound source is synchronized between frequency bins. And using the model parameters, calculating a posterior probability that is a conditional probability that the observed signal vector y _τω belongs to each cluster set more than the N sound sources;
Estimating the number of clusters for which the posterior probability takes a significant value as the number of sound sources;
Including
The probability model is a mixed model represented by a weighted sum of distributions of feature vectors x _τω for each sound source,
The mixture weight of the probabilistic model is a weight that depends on the time frame τ and does not depend on the angular frequency ω,
The model parameter of the probability model is a parameter of the number of sound sources, characterized in that the mixture weight and a parameter of the distribution of the feature vector x _τω for each sound source.

A sound source number estimation program for causing a computer to function as the sound source number estimation device according to claim 1.