JP5881454B2

JP5881454B2 - Apparatus and method for estimating spectral shape feature quantity of signal for each sound source, apparatus, method and program for estimating spectral feature quantity of target signal

Info

Publication number: JP5881454B2
Application number: JP2012029791A
Authority: JP
Inventors: 中谷　智広; 智広中谷; 拓也吉岡; 荒木　章子; 章子荒木; マークデルクロア; 雅清藤本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-14
Filing date: 2012-02-14
Publication date: 2016-03-09
Anticipated expiration: 2032-02-14
Also published as: JP2013167698A

Description

本発明は、複数の音源から出た音響信号が混ざってマイクロホンで収音された観測信号から、音源ごとに信号のスペクトル形状特徴量を推定する技術ならびに目的信号のスペクトル特徴量を推定する技術に関する。 The present invention relates to a technique for estimating a spectral shape feature quantity of a signal for each sound source and a technique for estimating a spectral feature quantity of a target signal from observation signals collected by a microphone by mixing acoustic signals emitted from a plurality of sound sources. .

図１に、非特許文献１，２，３に開示された従来の目的信号スペクトル特徴量推定装置の機能構成例を示す。この目的信号スペクトル特徴量推定装置９００は、観測信号に含まれる個々の音源（以下、番号m=1,2,・・・で区別する）について、当該音源に関連付けられた離散状態番号モデル記憶部９０１−ｍと状態依存スペクトルモデル記憶部９０２−ｍと離散状態番号選択部９０３−ｍを具備し、さらに、特徴抽出部９０４、音源占有度更新部９０５、目的音スペクトル推定部９０６を具備している。図１では、簡単のため音源数が２の場合を例示しているが、一般に３個以上の音源も考慮することができ、この場合は、離散状態番号モデル記憶部９０１−ｍと状態依存スペクトルモデル記憶部９０２−ｍとスペクトル番号選択部９０３−ｍは、音源数だけ別々のものが用意されているものとする。 FIG. 1 shows a functional configuration example of a conventional target signal spectrum feature quantity estimation device disclosed in Non-Patent Documents 1, 2, and 3. This target signal spectrum feature quantity estimation apparatus 900 has a discrete state number model storage unit associated with a sound source for each sound source (hereinafter, distinguished by numbers m = 1, 2,...) Included in the observed signal. 901-m, a state-dependent spectrum model storage unit 902-m, and a discrete state number selection unit 903-m, and further include a feature extraction unit 904, a sound source occupancy update unit 905, and a target sound spectrum estimation unit 906. Yes. In FIG. 1, the case where the number of sound sources is two is illustrated for simplicity, but generally three or more sound sources can also be considered. In this case, the discrete state number model storage unit 901-m and the state-dependent spectrum are used. The model storage unit 902-m and the spectrum number selection unit 903-m are prepared as many as the number of sound sources.

以下の説明では、音源数はN(m)と仮定して説明する。音源m=1は目的音源とし、音源m>1は背景音源とする。nは時間、k(=1,2,・・・,N(k))は周波数を表すものとし、或る時間周波数点(n,k)で値をとる変数をx_n,kのように表記し、その値を全周波数でまとめてできるベクトル（例えば、スペクトル特徴量など）を参照する場合にx_n(=[x_n,1,・・・,x_n,N(k)]^T)と表記するものとする。Tは、行列やベクトルの非共役転置を表す。 In the following description, it is assumed that the number of sound sources is N (m). A sound source m = 1 is a target sound source, and a sound source m> 1 is a background sound source. n is time, k (= 1,2, ..., N (k)) is frequency, and a variable that takes a value at a certain time frequency point (n, k) like x _{n, k} X _n (= [x _{n, 1} , ..., x _{n, N (k)} ] ^T ) when referring to a vector (for example, a spectral feature, etc.) that can be expressed and combined at all frequencies It shall be written as T represents a non-conjugated transpose of a matrix or vector.

離散状態番号モデル記憶部９０１−ｍは、音源mについて、短時間フレームに対するスペクトル特徴量の離散状態を表す離散状態番号q^(m)に関する事前確率関数p(q^(m))を記憶している。離散状態番号q^(m)は、N_q ^(m)(>0)個の番号(1からN_q ^(m)の自然数値)のいずれかを取るものとする。 The discrete state number model storage unit 901-m stores a prior probability function p (q ^(m) ) related to the discrete state number q ^(m) representing the discrete state of the spectral feature quantity for the short time frame for the sound source m. . The discrete state number q ^(m) is assumed to be one of N _q ^(m) (> 0) numbers (natural values from 1 to N _q ^(m)) .

状態依存スペクトルモデル記憶部９０２−ｍは、音源mについて、離散状態番号q^(m)が与えられた場合の音源mのスペクトル特徴量s^(m)の条件付き確率密度関数p(s^(m)|q^(m))を記憶している。 The state-dependent spectrum model storage unit 902-m has a conditional probability density function p (s ^{(m) for} the spectral feature s ^(m) of the sound source m when the discrete state number q ^(m) is given for the sound source m. | q ^(m) ) is stored.

特徴抽出部９０４は、マイクロホンで収音した時間領域信号x(t)を入力として、各短時間フレームnに対する観測信号のスペクトル特徴量x_nを抽出する。 The feature extraction unit 904 receives the time domain signal x (t) collected by the microphone and extracts the spectral feature amount x _n of the observation signal for each short-time frame n.

離散状態番号選択部９０３−ｍは、観測信号のスペクトル特徴量x_n、音源mに関する離散状態番号の事前確率関数p(q^(m))とスペクトル特徴量の条件付き確率密度関数p(s^(m)|q^(m))、音源占有度の推定値M^_n ^(m)(=[M^_n,1 ^(m),・・・,M^_n,N(k) ^(m)]^T)を受け取り、観測音に最も適合する音源mの離散状態番号の推定値q^_n ^(m)を求める。 The discrete state number selection unit 903-m includes the spectral feature quantity x _n of the observed signal, the prior probability function p (q ^(m) ) of the discrete state number relating to the sound source m, and the conditional probability density function p (s ^{( m)} | q ^(m) ), estimated sound source occupancy M ^ _n ^(m) (= [M ^ _{n, 1} ^(m) , ..., M ^ _{n, N (k)} ^(m) ] ^T ) And the estimated value q ^ _n ^(m) of the discrete state number of the sound source m that best matches the observed sound.

音源占有度更新部９０５は、観測音のスペクトル特徴量x_nと、すべての音源mに関するスペクトル特徴量の条件付き確率密度関数p(s^(m)|q^(m))と、すべての音源mのそれぞれに関して選択された離散状態番号の推定値の組合せ{q^_n ⁽¹⁾,・・・,q^_n ^(N(m))}を受け取り、各時間周波数点において各音源mが最も大きなエネルギーを持つ事後確率である音源占有度の推定値M^_n,k ^(m)を求める。 The sound source occupancy update unit 905 includes the spectral feature amount x _n of the observed sound, the conditional probability density function p (s ^(m) | q ^(m) ) of the spectral feature amount for all sound sources m, and all sound sources m. Receive a combination of estimated discrete state numbers {q ^ _n ⁽¹⁾ , ..., q ^ _n ^{(N (m))} } for each of the sound sources, and each sound source m is the largest at each time frequency point The estimated value M ^ _{n, k} ^(m) of the sound source occupancy, which is the posterior probability with energy, is _obtained .

さらに、目的音スペクトル推定部９０６は、目的音の音源占有度の推定値M^_n ⁽¹⁾と目的音の離散状態番号の推定値q^_n ⁽¹⁾と、特徴抽出部９０４が出力する観測信号のスペクトル特徴量x_nと、目的音のスペクトル特徴量の条件付き確率密度関数p(s⁽¹⁾|q⁽¹⁾)とを入力として、目的音のスペクトルの推定値を求める。 Further, the target sound spectrum estimation unit 906 outputs an estimated value M ^ _n ⁽¹⁾ of the sound source occupancy of the target sound, an estimated value q ^ _n ⁽¹⁾ of the discrete state number of the target sound, and the feature extraction unit 904 outputs. The estimated value of the spectrum of the target sound is obtained using the spectral feature quantity x _n of the observed signal and the conditional probability density function p (s ⁽¹⁾ | q ⁽¹⁾ ) of the spectral characteristic quantity of the target sound as inputs.

従来の目的信号スペクトル特徴量推定装置９００は、音源占有度更新部９０５と各離散状態番号選択部９０３−ｍのそれぞれが、音源占有度と観測信号のスペクトル特徴量に最も適合するスペクトル番号を効率的に求められるようにするために、以下の二つの前提を置いていた。
〈１〉
各時間周波数点(n,k)において、観測信号のスペクトル特徴量x_n,kは、各音源mのスペクトル特徴量s_n,k ^(m)のうち、値が大きいものに一致する。すなわち、
x_n,k=max{s_n,k ⁽¹⁾,・・・,s_n,k ^(N(m))} （１）
〈２〉
各音源信号のスペクトル特徴量s_n ^(m)は、離散状態番号q(m)が与えられたもとで、周波
数ごとに独立である。すなわち、スペクトル特徴量の条件付き確率密度関数p(s_n ^(m)|q^(m))は、以下のように分解できる。ただし、s_n ^(m)=[s_n,1 ^(m),・・・,s_n,N(k) ^(m)]^Tである。

In the conventional target signal spectrum feature amount estimation apparatus 900, each of the sound source occupancy update unit 905 and each discrete state number selection unit 903-m efficiently uses the spectrum number that best matches the sound source occupancy and the spectrum feature amount of the observation signal. The following two assumptions were made in order to be sought after.
<1>
At each time frequency point (n, k), the spectral feature value x _{n, k} of the observed signal matches the larger one of the spectral feature values sn _{, k} ^(m) of each sound source m. That is,
x _{n, k} = max {s _{n, k} ⁽¹⁾ , ..., s _{n, k} ^{(N (m))} } (1)
<2>
The spectral feature value s _n ^(m) of each sound source signal is independent for each frequency, given the discrete state number q (m). That is, the conditional probability density function p (s _n ^(m) | q ^(m) ) of the spectral feature quantity can be decomposed as follows. However, s _n ^(m) = [s _{n, 1} ^(m) ,..., Sn _{, N (k)} ^(m) ] ^T.

非特許文献１，２，３に開示されるように、上記の二つの仮定の下で、各音源の離散状態番号と音源占有度を交互に繰り返し更新することで、離散状態番号と音源占有度を推定できる。このとき、各繰り返しにおいて、離散状態番号は音源ごとに別々（独立）に更新することができ、また、音源占有度は周波数ごとに別々（独立）に更新することができる。すなわち、それぞれの更新において、音源間の組合せ、および周波数間の組合せを考慮する必要がないため、少ない計算コストで各更新を行うことができる。 As disclosed in Non-Patent Documents 1, 2, and 3, under the above two assumptions, the discrete state number and the sound source occupancy are updated by alternately and repeatedly updating the discrete state number and the sound source occupancy of each sound source. Can be estimated. At this time, in each repetition, the discrete state number can be updated separately (independently) for each sound source, and the sound source occupancy can be updated separately (independently) for each frequency. That is, in each update, since there is no need to consider a combination between sound sources and a combination between frequencies, each update can be performed with a low calculation cost.

中谷智広、荒木章子、吉岡拓也、藤本雅清、”DOAクラスタリングと音声の対数スペクトルHMMに基づく音源分離,” 日本音響学会2010年秋季研究発表会講演論文集、pp.577-580, 9月, 2010年.Tomohiro Nakatani, Akiko Araki, Takuya Yoshioka, Masayoshi Fujimoto, “Sound Source Separation Based on DOA Clustering and Logarithmic Spectrum HMM of Speech,” Proc. Year. 中谷智広、荒木章子、吉岡拓也、藤本雅清、”音源スペクトルHMMと音源方向モデルの教師無し同時学習に基づく多チャンネル音源分離,” 日本音響学会2011年春季研究発表会講演論文集, pp.805-808, 3月, 2011年.Tomohiro Nakatani, Akiko Araki, Takuya Yoshioka, Masayoshi Fujimoto, “Multi-channel sound source separation based on unsupervised simultaneous learning of sound source spectrum HMM and sound source direction model,” Proc. Of the 2011 Spring Conference of Acoustical Society of Japan, pp.805- 808, March, 2011. 中谷智広、荒木章子、マーク・デルクロア、吉岡拓也、藤本雅清、”非定常雑音に頑健な統合的音声認識アプローチ:音源方向GMMと対数スペクトルGMMに基づく統計モデルベース音声強調,” 日本音響学会2011年秋季研究発表会講演論文集, 9月, 2011年.Tomohiro Nakatani, Akiko Araki, Mark Delcroa, Takuya Yoshioka, Masayoshi Fujimoto, “Integrated speech recognition approach robust to nonstationary noise: Statistical model-based speech enhancement based on sound source direction GMM and logarithmic spectrum GMM,” Acoustical Society of Japan 2011 Proceedings of the Autumn Research Presentation, September, 2011.

従来の目的信号スペクトル特徴量推定装置は、上記<２>の仮定が成立しない場合は、これらの値を求めるための計算コストが膨大になるという問題があった。したがって、少ない計算コストで目的信号のスペクトル推定を行うために、上記の仮定に従うようにする必要があった。このため、各信号のスペクトルのモデルとして、対数パワースペクトルの分布を離散状態モデルで表現した混合ガウス分布（もしくは、その離散状態の時間遷移もモデル化した隠れマルコフモデル）などを用い、特に、上記<２>の仮定を満たすために、混合ガウス分布（もしくは、隠れマルコフモデル）を構成する各ガウス分布の共分散行列は対角行列のみに限定する必要があった。 The conventional target signal spectrum feature quantity estimation apparatus has a problem that the calculation cost for obtaining these values becomes enormous if the assumption <2> is not satisfied. Therefore, in order to estimate the spectrum of the target signal with low calculation cost, it is necessary to follow the above assumption. For this reason, a mixed Gaussian distribution (or a hidden Markov model that also models the time transition of the discrete state) expressing the distribution of the logarithmic power spectrum as a discrete state model is used as the spectrum model of each signal. In order to satisfy the assumption of <2>, the covariance matrix of each Gaussian distribution constituting the mixed Gaussian distribution (or Hidden Markov Model) has to be limited to only a diagonal matrix.

しかし、この限定は、スペクトルモデルが表現できる信号の分布を限定してしまい、信号が本来持つ分布を精度よく表現できるとは限らない。この結果、限定された分布で表現できるスペクトルの精度以上に、目的音および背景音のスペクトルの推定精度を高くできないという問題があった。例えば、より具体的には、音声認識でしばしば用いられるメル周波数ケプストラム係数（以下、MFCCという）に関する混合ガウス分布（もしくは、MFCCに関する隠れマルコフモデル）は、現在、音声信号のスペクトル形状のモデルとして最良なものの一つとして考えられているが、従来の方法では、これを信号のスペクトルモデルとして用いることはできなかった。この理由は、このモデルでは、各離散状態に対応する各ガウス分布がMFCCの分布を規定しており、かつ、MFCCの各係数は周波数間にまたがるスペクトル形状全体に関連するものであるため、各離散状態に対応する信号のスペクトルの分布は、一般的には、周波数間で独立になりえないからである。 However, this limitation limits the distribution of the signal that can be expressed by the spectrum model, and does not always represent the distribution inherent in the signal with high accuracy. As a result, there is a problem that the estimation accuracy of the spectrum of the target sound and the background sound cannot be made higher than the accuracy of the spectrum that can be expressed by the limited distribution. For example, more specifically, a mixed Gaussian distribution (or a hidden Markov model for MFCC) with respect to a mel frequency cepstrum coefficient (hereinafter referred to as MFCC) often used in speech recognition is currently the best model for the spectral shape of a speech signal. However, in the conventional method, this cannot be used as a spectrum model of a signal. The reason for this is that in this model, each Gaussian distribution corresponding to each discrete state defines the distribution of MFCC, and each coefficient of MFCC is related to the entire spectral shape spanning between frequencies. This is because the spectrum distribution of a signal corresponding to a discrete state cannot generally be independent between frequencies.

このような課題に鑑みて、本発明は、従来技術における制限的なモデルと異なり、制限を持たないより一般的なモデル（スペクトルの値が周波数間で相関を持つような高精度なモデル）を用いる場合でも、目的信号のスペクトル推定が効率的に行えるように、複数の音源から出た音響信号が混ざってマイクロホンで収音された観測信号から、音源ごとに信号のスペクトル形状特徴量を推定する技術ならびに目的信号のスペクトル特徴量を推定する技術を提供することを目的とする。 In view of such a problem, the present invention differs from a restrictive model in the prior art in that a more general model (a highly accurate model in which spectrum values have a correlation between frequencies) without restriction. Even if it is used, the spectral shape feature of each signal is estimated from the observation signal collected by the microphone mixed with acoustic signals from multiple sound sources so that the target signal spectrum can be estimated efficiently. It is an object of the present invention to provide a technique and a technique for estimating a spectral feature amount of a target signal.

本発明のスペクトル形状特徴量推定は、複数の音源それぞれからの音響信号が混ざって収音された観測信号から、連続スカラー値を要素に持つベクトルであって各音響信号のスペクトル特徴量の形状を表すスペクトル形状特徴量を推定するものであって、各音源に対応する、スペクトル形状特徴量の事前確率密度関数（スペクトル形状モデル）と、スペクトル形状特徴量が与えられたもとでのスペクトル特徴量の条件付き確率密度関数（スペクトル観測モデル）とを用いる。具体的には、各時間周波数点において最大のエネルギーを持つ音響信号の音源を表す占有的音源番号を潜在変数に持つ最適化関数を、スペクトル形状モデルとスペクトル観測モデルを用いて最大化するとともに、音源ごとのスペクトル形状特徴量、および、全ての音源のスペクトル形状特徴量が与えられたもとで各音源が占有的音源番号で表される音源である事後確率（音源占有度）、を推定する。スペクトル形状モデルは周波数間で相関を持つ確率密度関数である。最適化関数は、全ての音源のスペクトル形状特徴量が与えられたもとでの観測信号のスペクトル特徴量の条件付き確率密度関数と、音源ごとに定められるスペクトル形状特徴量の事前確率密度関数の積とで表される。 Spectral shape feature amount estimation of the present invention is a vector having continuous scalar values as elements from observation signals collected by mixing acoustic signals from a plurality of sound sources , and the shape of the spectral feature amount of each acoustic signal. The spectral shape feature quantity to be represented is estimated, and the prior probability density function (spectral shape model) of the spectral shape feature quantity corresponding to each sound source and the condition of the spectral feature quantity given the spectral shape feature quantity The attached probability density function (spectrum observation model) is used. Specifically, the optimization function having the exclusive sound source number representing the sound source of the acoustic signal having the maximum energy at each time frequency point as a latent variable is maximized using the spectrum shape model and the spectrum observation model, The posterior probability (sound source occupancy) that each sound source is a sound source represented by an occupying sound source number is estimated based on the spectral shape feature amount for each sound source and the spectral shape feature amounts of all sound sources. The spectral shape model is a probability density function having a correlation between frequencies. The optimization function is the product of the conditional probability density function of the spectral features of the observed signal given the spectral shape features of all sound sources and the prior probability density function of the spectral shape features determined for each sound source. It is represented by

本発明の目的信号スペクトル特徴量推定技術は、本発明のスペクトル形状特徴量推定技術によって推定された、音源ごとのスペクトル形状特徴量および音源占有度のうち、目的信号の音源に対応するスペクトル形状特徴量と、目的信号の音源の音源占有度と、目的信号の音源に対応するスペクトル観測モデルと、観測信号のスペクトル特徴量とから、目的信号のスペクトル特徴量を推定する。 The target signal spectral feature amount estimation technique of the present invention is a spectral shape feature corresponding to the sound source of the target signal out of the spectral shape feature quantity and the sound source occupancy for each sound source estimated by the spectral shape feature amount estimation technique of the present invention. The spectral feature amount of the target signal is estimated from the amount, the sound source occupancy of the target signal source, the spectrum observation model corresponding to the target signal source, and the spectral feature amount of the observation signal.

本発明に拠ると、詳細は実施形態の説明に譲るが、スペクトル形状モデルとして、スペクトルの値が周波数間で相関を持つような高精度なモデルを導入することができる。そして、低い計算コストで、従来例よりも高精度に目的信号のスペクトルが推定できる。 According to the present invention, details will be given in the description of the embodiment, but as a spectrum shape model, a highly accurate model in which spectrum values have a correlation between frequencies can be introduced. Then, the spectrum of the target signal can be estimated with lower calculation cost and with higher accuracy than the conventional example.

従来例の目的信号スペクトル特徴量推定装置の機能構成例を示す図。The figure which shows the function structural example of the target signal spectrum feature-value estimation apparatus of a prior art example. 目的信号スペクトル特徴量推定装置の機能構成例を示す図。The figure which shows the function structural example of the target signal spectrum feature-value estimation apparatus. 目的信号スペクトル特徴量推定装置の機能構成例を示す図。The figure which shows the function structural example of the target signal spectrum feature-value estimation apparatus.

《本技術の概略》
まず本技術の概略を説明してから、本技術の詳細を説明する。
図２にスペクトル形状特徴量推定装置／目的信号スペクトル特徴量推定装置の構成例を示す。この目的信号スペクトル特徴量推定装置１００は、観測信号に含まれる複数の音源のうち、m番目の音源ごとに、当該音源に関連付けられたスペクトル形状モデル記憶部１０１−ｍとスペクトル観測モデル記憶部１０２−ｍとスペクトル形状推定部１０３−ｍを具備し、さらに、特徴抽出部１０４、音源占有度更新部１０５、および目的音スペクトル推定部１０６を具備している。スペクトル形状特徴量推定装置１００ｐは、目的音スペクトル推定部１０６を具備しない点で、目的信号スペクトル特徴量推定装置１００と異なる。図２では、簡単のため音源数が２の場合を例示しているが、３個以上の音源を考慮する場合は、スペクトル形状モデル記憶部１０１−ｍとスペクトル観測モデル記憶部１０２−ｍとスペクトル形状推定部１０３−ｍは、音源数だけ別々のものが用意されているものとする。 <Outline of this technology>
First, an outline of the present technology will be described, and then details of the present technology will be described.
FIG. 2 shows a configuration example of the spectrum shape feature quantity estimation device / target signal spectrum feature quantity estimation device. The target signal spectral feature amount estimation apparatus 100 includes, for each m-th sound source among a plurality of sound sources included in the observation signal, a spectrum shape model storage unit 101-m associated with the sound source and a spectrum observation model storage unit 102. -M and a spectrum shape estimation unit 103-m, and further include a feature extraction unit 104, a sound source occupancy update unit 105, and a target sound spectrum estimation unit 106. The spectrum shape feature quantity estimation device 100p differs from the target signal spectrum feature quantity estimation device 100 in that the target sound spectrum estimation unit 106 is not provided. In FIG. 2, the case where the number of sound sources is two is illustrated for simplicity, but when three or more sound sources are considered, the spectrum shape model storage unit 101-m, the spectrum observation model storage unit 102-m, and the spectrum It is assumed that different shape estimation units 103-m are prepared for the number of sound sources.

実施形態におけるスペクトル形状特徴量推定装置は、それ単体で独立に存在するよりは、得られた最尤スペクトル形状特徴量などを用いて目的信号のスペクトル特徴量を推定する装置（実施形態における目的信号スペクトル特徴量推定装置）を構成する構成要素として存在することが実用的な場合がある。さらに云えば、スペクトル形状特徴量推定装置は、目的信号スペクトル特徴量推定装置とは容易に分離可能に目的信号スペクトル特徴量推定装置を構成する構成要素ではなく、目的信号スペクトル特徴量推定装置自体を或る機能に着眼して片面的に評価したものと云うこともできる。要するに、スペクトル形状特徴量推定装置は、目的信号スペクトル特徴量推定装置そのものであることが凡そ実用的と言うことができる場合がある。
ただし、スペクトル形状特徴量推定装置が、単体独立の構成要素として存在すること、目的信号スペクトル特徴量推定装置とは容易に分離可能に目的信号スペクトル特徴量推定装置を構成する構成要素であることを排除する趣旨ではない。例えば各音源のスペクトル形状特徴量を推定すること自体を目的とするならば、スペクトル形状特徴量推定装置を単体独立の構成要素として実現することに何らの妨げは無い。
ここでは、目的信号スペクトル特徴量推定装置／スペクトル形状特徴量推定装置は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現されるとし、スペクトル形状特徴量推定装置は、目的信号スペクトル特徴量推定装置を構成する構成要素として説明する。
目的信号スペクトル特徴量推定装置／スペクトル形状特徴量推定装置を単体単独の構成要素として、これをコンピュータで実現する場合のハードウェア構成例は後述する。 The spectral shape feature quantity estimation apparatus in the embodiment is an apparatus that estimates the spectral feature quantity of the target signal using the obtained maximum likelihood spectral shape feature quantity or the like (the target signal in the embodiment) rather than being independently present alone. In some cases, it may be practical to exist as a constituent element of a spectral feature amount estimation apparatus. Furthermore, the spectral shape feature quantity estimation device is not a component constituting the target signal spectrum feature quantity estimation device so that it can be easily separated from the target signal spectrum feature quantity estimation device. It can also be said that the evaluation was made on one side focusing on a certain function. In short, it can be said that the spectrum shape feature quantity estimation device is almost practical to be the target signal spectrum feature quantity estimation device itself.
However, the spectrum shape feature quantity estimation device exists as a single independent component, and that the target signal spectrum feature quantity estimation device is a component that constitutes the target signal spectrum feature quantity estimation device so that it can be easily separated from the target signal spectrum feature quantity estimation device. It is not intended to be excluded. For example, if the objective is to estimate the spectral shape feature quantity of each sound source, there is no obstacle to realizing the spectral shape feature quantity estimation device as a single independent component.
Here, it is assumed that the target signal spectral feature quantity estimation device / spectral shape feature quantity estimation device is realized by a computer such as a dedicated machine configured by dedicated hardware or a general-purpose machine such as a personal computer. The estimation device will be described as a component constituting the target signal spectrum feature quantity estimation device.
A hardware configuration example in the case where the target signal spectral feature quantity estimation device / spectral shape feature quantity estimation device is realized as a single component by a computer will be described later.

スペクトル形状モデル記憶部１０１−ｍは、音源mについて、短時間フレームに対するスペクトル形状特徴量c^(m)の事前確率密度関数p(c^(m))を記憶している。この事前確率密度関数を、以下、スペクトル形状モデルと呼ぶ。 The spectrum shape model storage unit 101-m stores a prior probability density function p (c ^(m) ) of the spectrum shape feature value c ^(m) for the short time frame for the sound source m. This prior probability density function is hereinafter referred to as a spectral shape model.

スペクトル観測モデル記憶部１０２−ｍは、音源mについて、スペクトル形状特徴量c^(m)が得られたもとでのスペクトル特徴量s^(m)の条件付き確率密度関数p(s^(m)|c^(m))を記憶している。与えられたスペクトル形状特徴量c^(m)に対応してスペクトル特徴量s^(m)をユニークには決定できないため、確率的な揺らぎが含まれると仮定している。この揺らぎをモデル化したものがスペクトル特徴量s^(m)の条件付き確率密度関数p(s^(m)|c^(m))である。以下、この関数をスペクトル観測モデルと呼ぶ。 The spectrum observation model storage unit 102-m has a conditional probability density function p (s ^(m) | c ⁽ ⁾ of the spectrum feature s ^(m) when the spectrum shape feature c ^(m) is obtained for the sound source m. ^m) Remember). Since the spectral feature quantity s ^(m) cannot be uniquely determined corresponding to the given spectral shape feature quantity c ^(m) , it is assumed that stochastic fluctuation is included. A model of this fluctuation is a conditional probability density function p (s ^(m) | c ^(m) ) of the spectral feature s ^(m) . Hereinafter, this function is referred to as a spectrum observation model.

特徴抽出部１０４は、マイクロホンで収音した時間領域信号x(t)を入力として、各短時間フレームnにおける観測信号のスペクトル特徴量x_nを抽出する。 The feature extraction unit 104 receives the time domain signal x (t) picked up by the microphone and extracts the spectral feature amount x _n of the observation signal in each short time frame n.

スペクトル形状推定部１０３−ｍは、観測信号のスペクトル特徴量x_n、音源mの音源占有度M^_n ^(m)、音源mに関するスペクトル形状モデルp(c^(m))、音源mのスペクトル観測モデルp(s^(m)|c^(m))を受け取り、音源mのスペクトル形状特徴量c^_n ^(m)を推定する。 The spectrum shape estimation unit 103-m is configured to observe the spectrum feature amount x _n of the observation signal, the sound source occupancy M ^ _n ^{(m) of} the sound source m, the spectrum shape model p (c ^(m) ) for the sound source m, and the spectrum observation of the sound source m. The model p (s ^(m) | c ^(m) ) is received, and the spectral shape feature c ^ _n ^(m) of the sound source m is estimated.

音源占有度更新部１０５は、特徴抽出部１０４から観測信号のスペクトル特徴量x_nを受け取り、すべてのスペクトル形状推定部１０３−ｍのそれぞれから、各音源mのスペクトル形状特徴量c^_n ^(m)を受け取り、すべてのスペクトル観測モデルから各音源mのスペクトル観測モデルp(s^(m)|c^(m))を受け取り、各音源mの音源占有度の推定値M^_n ^(m)を推定する。 The sound source occupancy update unit 105 receives the spectral feature amount x _n of the observation signal from the feature extraction unit 104, and from each of all the spectral shape estimation units 103-m, the spectral shape feature amount c ^ _n ^{(m ) And} the spectrum observation model p (s ^(m) | c ^(m) ) of each sound source m is received from all spectrum observation models, and the estimated value M ^ _n ^(m) of the sound source occupancy of each sound source m is estimated To do.

目的音スペクトル推定部１０６は、特徴抽出部１０４から観測信号のスペクトル特徴量x_nを受け取り、スペクトル形状推定部１０３−１から目的信号のスペクトル形状特徴量の推定値c^_n ⁽¹⁾を受け取り、音源占有度更新部１０５から目的信号の音源占有度の推定値M^_n ⁽¹⁾を受け取り、スペクトル観測モデル記憶部１０２−１から目的信号のスペクトル特徴量の条件付き確率密度関数p(s⁽¹⁾|c⁽¹⁾)を受け取り、目的信号のスペクトルの推定値を求める。 The target sound spectrum estimation unit 106 receives the spectral feature amount x _n of the observed signal from the feature extraction unit 104, and receives the estimated value c ^ _n ⁽¹⁾ of the spectral shape feature amount of the target signal from the spectral shape estimation unit 103-1. Then, the estimated value M ^ _n ⁽¹⁾ of the sound source occupancy of the target signal is received from the sound source occupancy update unit 105, and the conditional probability density function p (s ⁾ of the spectral feature quantity of the target signal is received from the spectrum observation model storage unit 102-1. ⁽¹⁾ | c ⁽¹⁾ ) is received and the estimated value of the spectrum of the target signal is obtained.

さらに、本技術では、各音源mのスペクトル形状特徴量を効率的に求めるために、以下の前提を置く。
<１>
各時間周波数点(n,k)において、観測信号のスペクトル特徴量x_n,kは、音源mのスペクトル特徴量s_n,k ^(m)のうち、値が大きいものに一致する。すなわち、式（３）が成立する。以下、この仮定をLogMaxモデルと呼ぶ。
x_n,k=max{s_n,k ⁽¹⁾,・・・,s_n,k ^(N(m))} （３）
〈２〉
各音源信号のスペクトル特徴量s_n ^(m)は、各音源のスペクトル形状特徴量c_n ^(m)が与えられたもとで、周波数ごとに独立である。すなわち、スペクトル特徴量の条件付き確率密度関数p(s_n ^(m)|c_n ^(m))は、以下のように分解できる。ただし、s_n ^(m)=[s_n,1 ^(m),・・・,s_n,N(k) ^(m)]^Tである。

Furthermore, in the present technology, in order to efficiently obtain the spectral shape feature amount of each sound source m, the following assumptions are made.
<1>
At each time frequency point (n, k), the spectral feature amount x _{n, k} of the observed signal matches the larger one of the spectral feature amounts sn _{, k} ^(m) of the sound source m. That is, Formula (3) is materialized. Hereinafter, this assumption is referred to as a LogMax model.
x _{n, k} = max {s _{n, k} ⁽¹⁾ , ..., s _{n, k} ^{(N (m))} } (3)
<2>
The spectral feature value s _n ^(m) of each sound source signal is independent for each frequency, given the spectral shape feature value c _n ^{(m) of} each sound source. That is, the conditional probability density function p (s _n ^(m) | c _n ^(m) ) of the spectral feature can be decomposed as follows. However, s _n ^(m) = [s _{n, 1} ^(m) ,..., Sn _{, N (k)} ^(m) ] ^T.

この仮定のうち、仮定<１>は従来例と同じであり、仮定<２>が従来例と異なる。 Of these assumptions, Assumption <1> is the same as the conventional example, and Assumption <2> is different from the conventional example.

上記の構成において、従来例との違いで最も重要な点は、従来例では、各音源のスペクトル特徴量のモデルを、離散状態モデルと状態依存スペクトルモデルの二つの組合せで表現していたのに対し、本発明では、スペクトル形状モデルとスペクトル観測モデルの二つの組合せで表現している点である。 In the above configuration, the most important difference from the conventional example is that in the conventional example, the model of the spectral feature amount of each sound source was expressed by two combinations of a discrete state model and a state-dependent spectral model. On the other hand, the present invention is expressed by two combinations of a spectrum shape model and a spectrum observation model.

従来例では、スペクトルの形状に関する特徴量はスペクトル観測過程と混在して状態依存スペクトルモデルで表現されており、そのモデルにおいて、スペクトルの値が周波数間で独立であると仮定する必要があった。 In the conventional example, the feature quantity related to the spectrum shape is expressed by a state-dependent spectrum model mixed with the spectrum observation process, and in the model, it is necessary to assume that the spectrum value is independent between frequencies.

他方、本発明では、周波数間で独立であると仮定する必要があるのはスペクトル観測モデルのみであり、スペクトル形状モデルについては何も規定されない。その結果、スペクトルの値が周波数間で相関を持つようなスペクトル形状モデルを導入することができる。例えば、MFCCの混合ガウスモデルなど、高精度なスペクトル形状モデルを利用することができる。 On the other hand, in the present invention, it is only the spectrum observation model that needs to be assumed to be independent between frequencies, and nothing is defined for the spectrum shape model. As a result, it is possible to introduce a spectrum shape model in which spectrum values have a correlation between frequencies. For example, a highly accurate spectral shape model such as a mixed Gaussian model of MFCC can be used.

また、上記の二つの仮定の下で、各音源のスペクトル形状特徴量と音源占有度を交互に繰り返し更新することで、スペクトル形状特徴量と音源占有度を推定できる。この時、各繰り返しにおいて、スペクトル形状特徴量は音源ごとに別々に更新することができ、また、音源占有度の更新は周波数ごとに別々に更新することができる。すなわち、それぞれの更新において、音源間で値の組合せ、および周波数間で値の組合せを考慮する必要がないため、少ない計算コストで各更新を行うことができる。 Under the above two assumptions, the spectral shape feature amount and the sound source occupancy can be estimated by alternately and repeatedly updating the spectral shape feature amount and the sound source occupancy degree of each sound source. At this time, in each iteration, the spectral shape feature amount can be updated separately for each sound source, and the update of the sound source occupancy can be updated separately for each frequency. That is, in each update, it is not necessary to consider a combination of values between sound sources and a combination of values between frequencies, so that each update can be performed with low calculation cost.

《本技術の詳細》
c_n=H(s_n)を、信号のスペクトル特徴量s_nからスペクトル形状特徴量c_nを抽出する特徴量変換関数とする。ここで、nは短時間フレームの番号を表し、スペクトル特徴量s_nは各周波数に対応する連続スカラー値s_n,kを要素にもつベクトルs_n=[s_n,1,s_n,2,・・・,s_n,N(k)]^Tと定義され、スペクトル形状特徴量c_nは形状パラメータの各次元に対応する連続スカラー値c_n,hを要素に持つベクトルc_n=[c_n,1,c_n,2,・・・,c_n,N(h)]^Tと定義され、H(・)は、スペクトルの形状に関する特徴量を抽出する任意の関数とする。例えば、s_nを音声認識などでしばしば用いられる対数メルフィルタバンクの出力、c_nをそのMFCCとすると、H(s_n)はs_nの離散コサイン変換D(s_n)に対応する。すなわち、H(s_n)=D(s_n)となる。また、s_nを対数パワースペクトルとし、c_nを対応するMFCCとすると、H(s_n)は、s_nに対数関数の逆変換、すなわち指数関数(exp(・)と表記)を適用したのち、メルフィルタバンク処理(mfb(・)と表記)を施し、再度、対数変換(log(・))を適用することに対応する。すなわち、H(s_n)=log(mfb(exp(s_n)))である。《Details of this technology》
c _n = H a (s _n), and feature transformation function for extracting a spectral shape feature c _n from the spectral characteristic quantity s _n of the signal. Here, n represents the short frame number, and the spectral feature value s _n is a vector s _n = [s _{n, 1} , s _{n, 2} , with elements of continuous scalar values s _{n, k} corresponding to the respective frequencies. ..., s _{n, N (k)} ] ^T, and the spectral shape feature value c _n is a vector c _n = [c _n having continuous scalar values c _{n, h} corresponding to each dimension of the shape parameter as elements. _{, 1} , c _{n, 2} ,..., C _{n, N (h)} ] ^T, and H (•) is an arbitrary function that extracts a feature quantity related to the shape of the spectrum. For example, the output of the often logarithmic mel filter bank used a s _n in speech recognition, when the c _n as its MFCC, H (s _n) corresponds to a discrete cosine transform D of s _n (s _n). That is, H (s _n ) = D (s _n ). Also, the s _n a log power spectrum, when the corresponding MFCC the c _n, H (s _n) is the inverse transform of the logarithmic function in s _n, i.e. after the application of the exponential function (exp (·) hereinafter) This corresponds to applying Mel filter bank processing (denoted as mfb (•)) and applying logarithmic transformation (log (•)) again. That is, H (s _n ) = log (mfb (exp (s _n ))).

以下の議論では、基本的に、複数の短時間フレームを観測信号から特徴抽出部１０４が抽出することを前提とするが、下記の説明（例えば実施形態１，２）においては、議論を簡単にするため、一つの短時間フレームnに閉じた処理として説明する。複数の短時間フレームに対しては、各短時間フレームに対して同じ処理を個別に適用するものとする。その他の実施形態では、必ずしもその限りではない。 In the following discussion, basically, it is assumed that the feature extraction unit 104 extracts a plurality of short-time frames from the observation signal. However, in the following description (for example, Embodiments 1 and 2), the discussion is simplified. Therefore, the process is described as closed in one short frame n. For a plurality of short time frames, the same processing is individually applied to each short time frame. In other embodiments, this is not necessarily the case.

[最適化関数の定義]
いま、s_n ^(m)を音源mの短時間フレームnにおけるスペクトル特徴量とする。また、すべての音源mに関するスペクトル特徴量およびスペクトル形状特徴量をひとまとめにして、以下のように記述することにする。ここで、N(m)は音源数を表す。
S_n={s_n ⁽¹⁾,・・・,s_n ^(N(m))} （５）
S_n,k={s_n,k ⁽¹⁾,・・・,s_n,k ^(N(m))} （６）
C_n={c_n ⁽¹⁾,・・・,c_n ^(N(m))} ただし、c_n ^(m)=H(s_n ^(m)) （７） [Definition of optimization function]
Now, _let s _n ^{(m) be} the spectral feature quantity in the short-time frame n of the sound source m. In addition, the spectral feature values and spectral shape feature values for all the sound sources m are collectively described as follows. Here, N (m) represents the number of sound sources.
S _n = {s _n ⁽¹⁾ , ..., s _n ^{(N (m))} } (5)
S _{n, k} = {s _{n, k} ⁽¹⁾ , ..., s _{n, k} ^{(N (m))} } (6)
C _n = {c _n ⁽¹⁾ , ..., c _n ^{(N (m))} } where c _n ^(m) = H (s _n ^(m) ) (7)

本技術では、目的音スペクトル推定のためのステップとして、以下で定義されるように、各音源mのスペクトル形状特徴量を事後確率最大化(MAP)推定で求める。

In the present technology, as a step for estimating the target sound spectrum, the spectrum shape feature amount of each sound source m is obtained by posterior probability maximization (MAP) estimation as defined below.

このMAP推定は、後述するように、主としてスペクトル形状推定部１０３−ｍと音源占有度推定部１０５による繰り返し処理により実現される。 As will be described later, this MAP estimation is realized mainly by an iterative process by the spectrum shape estimation unit 103-m and the sound source occupancy estimation unit 105.

MAP推定を行うために、本技術では、音源mのスペクトル形状モデル（すなわち、スペクトル形状特徴量の事前確率密度関数）p(c_n ^(m))、および、スペクトル観測モデル（すなわち、スペクトル形状特徴量が与えられたもとのでスペクトル特徴量の条件付き確率密度関数）p(s_n ^(m)|c_n ^(m))が事前に与えられているものとする（もしくは、後述するように目的音スペクトルを推定する対象となる観測信号から推定できるものとする）。 In order to perform MAP estimation, in the present technology, the spectral shape model of the sound source m (that is, the prior probability density function of the spectral shape feature amount) p (c _n ^(m) ) and the spectrum observation model (that is, the spectral shape feature). The conditional probability density function (p (s _n ^(m) | c _n ^(m) )) of the spectral feature is given in advance (or the target sound spectrum as described later) Can be estimated from the observed signal to be estimated).

スペクトル形状モデルp(c_n ^(m))に関して、特段の前提を置いていないので、c_n ^(m)の確率密度関数を表すものであれば、どんなものでもモデルとして利用することができる。代表的なものとして、c_n ^(m)に関する正規分布、ガウス混合モデル、隠れマルコフモデルなどがあげられる。 Since no particular assumption is made regarding the spectrum shape model p (c _n ^(m) ), any model that represents the probability density function of c _n ^(m) can be used as a model. Typical examples include a normal distribution for c _n ^(m) , a Gaussian mixture model, a hidden Markov model, and the like.

スペクトル観測モデルp(s_n ^(m)|c_n ^(m))は、前述の特徴量変換H(s)の逆過程として定義される。しかし、一般には、c_n=H(s_n)は多対１の変換となるため、その逆変換はユニークには定められない。したがって、その定め方には任意性がある。ここでは、一例をあげる。いま、s=G(c)を、c=H(s)を満たすsの集合の中の代表点に変換する関数とし、H(s)の疑似逆変換と呼ぶ。さらに、e=s-G(H(s))を逆変換誤差と呼ぶ。そして、eを期待値0、共分散行列Ξの正規分布に従うと仮定すると、スペクトル観測モデルは、以下のように定義できる。ここで、Nd(x;μ,Σ)は、平均μ、共分散行列Σの正規分布の確率密度関数を表す。
p(s_n ^(m)|c_n ^(m))=Nd(s_n ^(m);G(s_n ^(m)),Ξ) （９） The spectrum observation model p (s _n ^(m) | c _n ^(m) ) is defined as an inverse process of the feature quantity conversion H (s) described above. However, in general, c _n = H (s _n ) is a many-to-one transformation, and the inverse transformation is not uniquely determined. Therefore, the method of determination is arbitrary. Here is an example. Now, let s = G (c) be a function that transforms a representative point in the set of s that satisfies c = H (s), and call this pseudo inverse transformation of H (s). Further, e = sG (H (s)) is called an inverse conversion error. Assuming that e follows the normal distribution of expected value 0 and covariance matrix Ξ, the spectrum observation model can be defined as follows. Here, Nd (x; μ, Σ) represents a probability density function of a normal distribution of mean μ and covariance matrix Σ.
p (s _n ^(m) | c _n ^(m) ) = Nd (s _n ^(m) ; G (s _n ^(m) ), Ξ) (9)

さらに、このスペクトル観測モデルp(s_n ^(m)|c_n ^(m))は、式（４）に基づき、各周波数で定義される条件付き確率密度関数の積に分解できることが仮定される。これは、上記のモデルの場合、共分散行列は対角行列Ξ=diag{ξ_k}で規定されることを意味する。s=G(c)となるs=[s₁,s₂,・・・,s_N(k)]のk番目の要素を与える関数をs_k=G_k(c)と表すとすると、式（９）は、以下のように分解可能である。

Further, it is assumed that this spectrum observation model p (s _n ^(m) | c _n ^(m) ) can be decomposed into products of conditional probability density functions defined at respective frequencies based on Expression (4). This means that in the case of the above model, the covariance matrix is defined by the diagonal matrix Ξ = diag {ξ _k }. If the function that gives the kth element of s = [s ₁ , s ₂ ,..., s _{N (k)} ] where s = G (c) is expressed as s _k = G _k (c), (9) can be disassembled as follows.

なお、上記では、eの期待値がE{e}=0となることを仮定した（ここで、E{・}は確率変数の期待値をとる関数である）。他方、この値が0でない場合も、G'(c)=G(c)-E{e}としたG'(c)を疑似逆変換とすることで、逆変換誤差の期待値を0としたモデル化が可能である。さらに、上記では、スペクトル観測モデル（すなわち、逆変換誤差）が正規分布に従うと仮定したが、正規分布に限らず一般の分布を用いても同様の定義が可能である。 In the above, it is assumed that the expected value of e is E {e} = 0 (where E {·} is a function that takes the expected value of the random variable). On the other hand, even if this value is not 0, G ′ (c) with G ′ (c) = G (c) −E {e} is set to pseudo inverse transformation, so that the expected value of the inverse transformation error is 0. Modeling is possible. Furthermore, in the above description, it is assumed that the spectrum observation model (that is, the inverse transformation error) follows a normal distribution. However, the same definition is possible using not only the normal distribution but also a general distribution.

次に、本発明で導入した式（３）の仮定を確率的に扱うために、従来例と同様に、以下の関係式を導入する。

Next, in order to treat the assumption of Expression (3) introduced in the present invention stochastically, the following relational expression is introduced as in the conventional example.

ここで、d_n=[d_n,1,・・・,d_n,N(k)]は、各時間周波数において、最もエネルギーの大きな音源の番号を示す占有的音源番号を表す。例えば、d_n,k=1は、時間周波数点(n,k)において、m=1番の音源（すなわち、目的音）が最も占有的な音源であることを示す。δ(・)は、ディラックのデルタ関数である。式（１１）は、観測信号のスペクトル特徴量x_n,kと占有的音源番号で示された音源のスペクトル特徴量s_n,k ^(m)［m=d_n,k］は一致することを意味し、式（１２）は、占有的音源番号が最もスペクトル特徴量の値が大きい音源の番号に一致することを意味している。これは、式（３）と等価であることは明らかである。すると、x_n,k,d_n,kおよびC_nの間の関係は、式（４）,式（１１）および式（１２）に基づき以下のように導出できる。

Here, d _n = [d _{n, 1} ,..., D _{n, N (k)} ] represents an exclusive sound source number indicating the number of the sound source having the largest energy at each time frequency. For example, d _{n, k} = 1 indicates that the sound source of m = 1 (that is, the target sound) is the most occupied sound source at the time frequency point (n, k). δ (·) is a Dirac delta function. Equation (11) indicates that the spectral feature quantity x _{n, k} of the observed signal and the spectral feature quantity sn _{, k} ^(m) [m = d _{n, k} ] of the sound source indicated by the occupied sound source number match. In other words, equation (12) means that the exclusive sound source number matches the number of the sound source having the largest value of the spectral feature amount. This is clearly equivalent to equation (3). Then, the relationship between x _{n, k} , d _{n, k} and C _n can be derived as follows based on the equations (4), (11) and (12).

式（１３）は、全音源のスペクトル形状特徴量が与えられたもとでの、各周波数における、観測信号のスペクトル特徴量と占有的音源番号の条件付き同時確率密度関数であって、以下の二種類の確率関数の積によって定められる関数を表している。
１．占有的音源番号に一致する音源のスペクトル観測モデルにおいて、当該音源のスペクトル特徴量が観測信号のスペクトル特徴量と同一の値をとると規定された場合の確率関数（右辺の第一要素）
２．占有的音源番号に一致する音源以外の音源のスペクトル観測モデルにおいて、当該音源のスペクトル特徴量が観測信号のスペクトル特徴量の値以下の値をとると規定された場合の確率関数（右辺の第二要素中の各積分項） Expression (13) is a conditional joint probability density function of the spectrum feature quantity of the observation signal and the occupying sound source number at each frequency with the spectrum shape feature quantity of all the sound sources being given. Represents a function defined by the product of the probability functions of.
1. Probability function (first element on the right side) when it is specified that the spectrum feature of the sound source has the same value as the spectrum feature of the observed signal in the spectrum observation model of the sound source that matches the occupied sound source number
2. In a spectrum observation model of a sound source other than the sound source that matches the occupied sound source number, a probability function (second right side of the right side) is defined that the spectrum feature amount of the sound source is less than or equal to the value of the spectrum feature amount of the observation signal. Each integral term in the element)

最終的に、式（１３）を用いて、式（８）中のMAP関数p(x_n,C_n)は、以下のように定義
される。

Finally, using equation (13), the MAP function p (x _n , C _n ) in equation (8) is defined as follows.

ここで、d_n,kは隠れ変数として扱われている。式（１４）の右辺第一項は、スペクトル観測モデルとLogMaxモデルに対応し、式（１４）の右辺第二項は、各音源のスペクトル形状モデルに対応する。本技術では、式（１４）を最大にするスペクトル形状特徴量を全音源に関して求めることで、スペクトル形状モデル、スペクトル観測モデル、LogMaxモデルのすべてを考慮した推定を実現する。 Here, d _{n, k} is treated as a hidden variable. The first term on the right side of Equation (14) corresponds to the spectrum observation model and the LogMax model, and the second term on the right side of Equation (14) corresponds to the spectrum shape model of each sound source. In the present technology, the spectral shape feature quantity that maximizes the expression (14) is obtained for all sound sources, thereby realizing estimation in consideration of all of the spectral shape model, the spectrum observation model, and the LogMax model.

より詳しくは、式（１４）は、観測信号のスペクトル特徴量x_nと全音源のスペクトル形状特徴量C_nの同時確率密度関数であり、右辺のように二つの要素の積からなる。式（１４）の右辺の一つ目の要素は、全音源のスペクトル形状特徴量が与えられたもとでの観測信号のスペクトル特徴量の条件付き確率密度関数（すなわち、p(x_n|C_n))である。以下、これを、観測信号スペクトル観測モデルと呼ぶ。式（１４）の右辺の二つ目の要素は、音源ごとに定められるスペクトル形状特徴量の事前確率密度関数であるスペクトル形状モデルの積（すなわち、Π_l p(c_n ^(l))）である。
観測信号スペクトル観測モデルは、式（１４）の右辺にあるように以下の特徴を持つ。
１．観測信号スペクトル観測モデルは、各周波数に対応する条件付き確率密度関数の積に分解可能（すなわち、p(x_n|C_n)=Π_k p(x_n,k|C_n))。
２．各周波数に対応する条件付き確率密度関数は、当該周波数において、どの音源が最も占有的な音源であるかを示す占有的音源番号を潜在変数として持つ。（すなわち、p(x_n,k|C_n)=Σ_d p(x_n,k,d_n,k|C_n)）
３．各周波数に対応する条件付き確率密度関数は、音源ごとに定められるスペクトル観測モデルとLogMaxモデルに基づき、式（１３）のように定められる。 More specifically, the equation (14) is a simultaneous probability density function of the spectral feature quantity x _n of the observed signal and the spectral shape feature quantity C _n of all sound sources, and consists of the product of two elements as shown on the right side. The first element on the right side of the equation (14) is a conditional probability density function (that is, p (x _n | C _n ) of the spectral features of the observed signal given the spectral shape features of all sound sources. ). Hereinafter, this is referred to as an observation signal spectrum observation model. The second element on the right side of Equation (14) is the product of spectral shape models (ie, Π _l p (c _n ^(l) )), which is a prior probability density function of spectral shape features determined for each sound source. is there.
The observation signal spectrum observation model has the following characteristics as shown on the right side of Expression (14).
1. The observed signal spectrum observation model can be decomposed into products of conditional probability density functions corresponding to the respective frequencies (that is, p (x _n | C _n ) = Π _k p (x _{n, k} | C _n )).
2. The conditional probability density function corresponding to each frequency has an occupying sound source number indicating which sound source is the most occupying sound source at that frequency as a latent variable. (Ie, p (x _{n, k} | C _n ) = Σ _d p (x _{n, k} , d _{n, k} | C _n ))
3. The conditional probability density function corresponding to each frequency is determined as shown in Expression (13) based on the spectrum observation model and the LogMax model determined for each sound source.

本技術は、式（１４）を最大化する各音源のスペクトル形状特徴量を推定するとともに、各音源が占有的音源番号に一致する事後確率関数である各音源の音源占有度を推定する。さらに、目的信号について推定されたスペクトル形状特徴量および音源占有度と、目的信号のスペクトル観測モデルに基づき、目的信号のスペクトルを推定する。 The present technology estimates the spectral shape feature amount of each sound source that maximizes Expression (14), and estimates the sound source occupancy of each sound source, which is a posterior probability function in which each sound source matches the occupying sound source number. Further, the spectrum of the target signal is estimated based on the spectral shape feature amount and the sound source occupancy estimated for the target signal and the spectrum observation model of the target signal.

[最適化のアルゴリズム]
式（１４）を最大化するアルゴリズムとしては、共役勾配法、準ニュートン法などの一般の非線形最適化アルゴリズムを適用することができる。これらの方法は、各非線形最適化アルゴリズムに基づき、一般的な方法で導出できるので、ここでは説明を省略する。 [Optimization algorithm]
As an algorithm for maximizing Expression (14), a general nonlinear optimization algorithm such as a conjugate gradient method or a quasi-Newton method can be applied. Since these methods can be derived by a general method based on each nonlinear optimization algorithm, description thereof is omitted here.

他方、式（１４）は隠れ変数を含む関数であるので、期待値最大化(EM)アルゴリズムを用いて効率的に最大化することもできる。以下では、この最適化アルゴリズムについて詳しく述べる。EMアルゴリズムで用いられる補助関数は、以下のように定義される。

On the other hand, since Equation (14) is a function including a hidden variable, it can be efficiently maximized by using an expectation maximization (EM) algorithm. In the following, this optimization algorithm will be described in detail. The auxiliary function used in the EM algorithm is defined as follows.

いま、音源mの音源占有度の推定値をM^_n,k ^(m)=E{p(d_n,k=m|x_n,k,C_n)|C^_n}と表記すると、式（１７）はさらに以下のように展開できる。

Now, if the estimated value of the sound source occupancy of the sound source m is expressed as M ^ _{n, k} ^(m) = E {p (d _{n, k} = m | x _{n, k} , C _n ) | C ^ _n } (17) can be further expanded as follows.

EMアルゴリズムに従うと、E-stepにおいて式（１８）を計算し、M-stepにおいて式（１９）を最大化するC_nを求め、E-stepとM-stepを繰り返すことによって、前記の最適化関数を最大化するスペクトル形状特徴量C_nを求めることができる。 According to the EM algorithm, Eq. (18) is calculated in E-step, C _n that maximizes Eq. (19) is obtained in M-step, and the above optimization is performed by repeating E-step and M-step. A spectral shape feature quantity C _n that maximizes the function can be obtained.

ここで注意すべきは、E-stepにおいて、式（１８）は、各周波数で独立に計算できることである。このため、音源占有度の更新において、周波数間の値の組合せを考慮する必要がない。さらに、M-stepにおいて、式（１９）は、式（２０）のように、音源ごとの最適化関数に分解して最大化することができる。したがって、式（２０）に従い、音源ごとに独立にスペクトル形状特徴量を更新することができる。このため、音源間の値の組合せを考慮する必要がない。その結果、EMアルゴリズムの繰り返し推定に基づき、音源mの音源占有度とスペクトル形状特徴量を、効率的に更新することができる。 Note that in E-step, equation (18) can be calculated independently at each frequency. For this reason, it is not necessary to consider a combination of values between frequencies in updating the sound source occupancy. Further, in M-step, equation (19) can be maximized by decomposing into an optimization function for each sound source, as in equation (20). Therefore, according to the equation (20), the spectral shape feature amount can be updated independently for each sound source. For this reason, it is not necessary to consider the combination of values between sound sources. As a result, the sound source occupancy and the spectral shape feature amount of the sound source m can be efficiently updated based on the repeated estimation of the EM algorithm.

[目的信号のスペクトル特徴量の推定]
目的音スペクトル推定部１０６は、観測信号のスペクトル特徴量x_n,kと、目的信号の音源占有度の推定値M^_n,k ⁽¹⁾と、スペクトル形状特徴量の推定値c^_n ⁽¹⁾と、スペクトル観測モデルp(s_n ⁽¹⁾|c_n ⁽¹⁾)と、を入力として、目的信号のスペクトルの推定値s^_n,k ⁽¹⁾を最小自乗誤差推定により求める。推定の方法は次式によって行う。

[Estimation of spectral features of target signal]
The target sound spectrum estimation unit 106 includes the spectral feature amount x _{n, k} of the observation signal, the estimated value M ^ _{n, k} ⁽¹⁾ of the sound source occupancy of the target signal, and the estimated value c ^ _n ^{( 1)} and the spectrum observation model p (s _n ⁽¹⁾ | c _n ⁽¹⁾ ) are used as inputs, and an estimated value s ^ _{n, k} ⁽¹⁾ of the target signal spectrum is obtained by least square error estimation. The estimation method is as follows.

[実施形態１]
上記の説明に基づき、本技術では、以下の手順で目的音スペクトルの推定が行える。
ステップ１．各短時間フレームnに対して、特徴抽出部１０４が、観測信号のスペクトル特徴量x_nを抽出する。
ステップ２．各音源ｍに対応するスペクトル形状推定部１０３−ｍが、スペクトル形状特徴量の推定値c^_n ⁽¹⁾を初期化する。例えば、観測信号のスペクトル特徴量x_nと特徴量変換関数H(・)を用いて、c^_n ⁽¹⁾=H(x_n)とする。
ステップ３．(a)(b)を収束するまで繰り返す。
(a)音源占有度更新部１０５は、各周波数ごとに独立に、式（１８）を計算することで、各音源mの音源占有度の推定値M^_n,k ^(m)を更新する(E-step)。
(b)各音源mについて、スペクトル形状推定部１０３−ｍが、式（２０）を最大化するc_n ^(m)を求めることで、スペクトル形状特徴量の推定値c^_n ^(m)を更新する(M-step)。このとき、式（２０）は、一般には、非線形関数となるため、その最大化は、共役勾配法、準ニュートン法、ニュートン法などの一般的な非線形最適化法により実現される。
ステップ４．目的音スペクトル推定部１０６は、式（２１）により、目的信号のスペクトル特徴量の推定値s^_n,k ⁽¹⁾を求める。 [Embodiment 1]
Based on the above description, in the present technology, the target sound spectrum can be estimated by the following procedure.
Step 1. For each short-time frame n, the feature extraction unit 104 extracts the spectral feature amount x _n of the observation signal.
Step 2. The spectrum shape estimation unit 103-m corresponding to each sound source m initializes an estimated value c ^ _n ⁽¹⁾ of the spectrum shape feature quantity. For example, it is assumed that c ^ _n ⁽¹⁾ = H (x _n ) using the spectral feature amount x _n of the observed signal and the feature amount conversion function H (•).
Step 3. Repeat (a) and (b) until convergence.
(a) The sound source occupancy update unit 105 updates the estimated value M ^ _{n, k} ^(m) of the sound source occupancy of each sound source m by calculating Equation (18) independently for each frequency ( E-step).
(b) For each sound source m, the spectrum shape estimation unit 103-m updates the estimated value c ^ _n ^(m) of the spectrum shape feature amount by obtaining c _n ^(m) that maximizes Equation (20). (M-step). At this time, since the equation (20) is generally a nonlinear function, the maximization is realized by a general nonlinear optimization method such as a conjugate gradient method, a quasi-Newton method, or a Newton method.
Step 4. The target sound spectrum estimation unit 106 obtains an estimated value s ^ _{n, k} ⁽¹⁾ of the spectral feature quantity of the target signal according to the equation (21).

なお、一例として、上記３．(b)において、ニュートン法に従ってスペクトル形状特徴量を更新する場合の更新式は、以下のようになる。

As an example, the above 3. In (b), the update formula for updating the spectrum shape feature quantity according to the Newton method is as follows.

[実施形態２]
実施形態１より、さらに具体的な実施形態について説明する。まず、観測信号のスペクトル徴量としてメルフィルタバンクの出力を用い、スペクトル形状特徴量としてMFCCを用い、スペクトル形状モデルとしてMFCCの混合ガウス分布を用いるとする。 [Embodiment 2]
A more specific embodiment will be described from the first embodiment. First, it is assumed that the output of the mel filter bank is used as the spectral feature amount of the observation signal, MFCC is used as the spectral shape feature amount, and the mixed gaussian distribution of MFCC is used as the spectral shape model.

各音源mのスペクトル形状特徴量c_n ^(m)は、混合ガウス分布を用いて、以下のようにモデル化されているものとする。ここで、p(c_n ^(m)|i)とα_iは、それぞれ、混合番号iに対するガウス分布とその混合比を表す。各ガウス分布の平均μ_i ^(m)と共分散行列Σ_i ^(m)は、事前に、音響信号のデータベースなどを用いて求められているものとする。このモデルにおいて、混合番号iは隠れ変数として扱われる。

It is assumed that the spectral shape feature value c _n ^(m) of each sound source m is modeled as follows using a mixed Gaussian distribution. Here, p (c _n ^(m) | i) and α _i represent the Gaussian distribution and the mixture ratio for the mixture number i, respectively. It is assumed that the average μ _i ^(m) and the covariance matrix Σ _i ^(m) of each Gaussian distribution are obtained in advance using a database of acoustic signals. In this model, the mixture number i is treated as a hidden variable.

実施形態２において、実施形態１との最も大きな違いは、EMアルゴリズムの隠れ変数として新たに各音源mのスペクトル形状モデルにおける混合ガウス分布の混合番号iを含んでいることである。これにともない、EMアルゴリズムで用いる補助関数は、以下のように修正される。

In the second embodiment, the biggest difference from the first embodiment is that the mixture number i of the mixed Gaussian distribution in the spectrum shape model of each sound source m is newly included as a hidden variable of the EM algorithm. Accordingly, the auxiliary function used in the EM algorithm is modified as follows.

さらに、Z^_n,i ^(m)=E{p(i|c_n ^(m))|c^_n ^(m)}と表記すると、式（２０）は以下のように修正される。

Further, when _expressed as Z ^ _{n, i} ^(m) = E {p (i | c _n ^(m) ) | c ^ _n ^(m) }, equation (20) is modified as follows.

この結果、以下の手順で目的音スペクトルの推定が行えることになる。
ステップ１．各短時間フレームnに対して、特徴抽出部１０４が、観測信号に関するメルフィルタバンク出力x_nを抽出する。
ステップ２．各音源mに対応するスペクトル形状推定部１０３−ｍが、MFCCの推定値c^_n ^(m)を初期化する。例えば、H(x_n)を離散コサイン変換とし、c^_n ^(m)=H(x_n)とする。
ステップ３．(a)(b)(c)を収束するまで繰り返す。
(a)音源占有度更新部１０５は、各周波数ごとに独立に、式（１８）を計算することで、各音源mの音源占有度の推定値M^_n,k ^(m)を更新する(E-step1)。
(b)各音源mについて、スペクトル形状推定部１０３−ｍが、式（２８）に基づき、Z^_n,i ^(m)を更新する(E-step2)。
(c)各音源mについて、スペクトル形状推定部１０３−ｍが、式（２７）を最大化するMFCC c_n ^(m)を求めることで、MFCCの推定値c^_n ^(m)を更新する。このとき、式（２７）は、一般には、非線形関数となるため、その最大化は、共役勾配法、準ニュートン法、ニュートン法などの一般的な非線形最適化法により実現される(M-step)。
ステップ４．目的音スペクトル推定部１０６は、式（２１）により、目的信号のフィルタバンク出力（＝スペクトル特徴量）の推定値s^_n,k ^(m)を求める。 As a result, the target sound spectrum can be estimated by the following procedure.
Step 1. For each short-time frame n, the feature extraction unit 104 extracts the mel filter bank output x _n related to the observation signal.
Step 2. The spectrum shape estimation unit 103-m corresponding to each sound source m initializes the estimated value c ^ _n ^(m) of the MFCC. For example, H (x _n ) is a discrete cosine transform, and c ^ _n ^(m) = H (x _n ).
Step 3. Repeat (a), (b) and (c) until convergence.
(a) The sound source occupancy update unit 105 updates the estimated value M ^ _{n, k} ^(m) of the sound source occupancy of each sound source m by calculating Equation (18) independently for each frequency ( E-step1).
(b) For each sound source m, the spectrum shape estimation unit 103-m updates Z ^ _{n, i} ^(m) based on the equation (28) (E-step 2).
(c) For each sound source m, the spectrum shape estimation unit 103-m calculates the MFCC c _n ^(m) that maximizes Equation (27), thereby updating the estimated value c ^ _n ^(m) of the MFCC. At this time, since the equation (27) is generally a nonlinear function, its maximization is realized by a general nonlinear optimization method such as a conjugate gradient method, a quasi-Newton method, and a Newton method (M-step). ).
Step 4. The target sound spectrum estimator 106 obtains an estimated value s ^ _{n, k} ^(m) of the filter bank output (= spectrum feature quantity) of the target signal using equation (21).

次に、実施形態２において、さらに、各音源mに対するスペクトル観測モデルp(s_n,k ^(m)|c_n ^(m))について、より具体的なモデルを用いた場合について説明する。 Next, in the second embodiment, a case where a more specific model is used for the spectrum observation model p (s _{n, k} ^(m) | c _n ^(m) ) for each sound source m will be described.

この実施形態２では、疑似逆行列G(c)を、以下によって定義される線形回帰モデルでモデル化するものとする。ただし、Aは行列(N(k)×N(m))、bはベクトル(N(k)×1)を表す。
G(c)=Ac+b (29) In the second embodiment, the pseudo inverse matrix G (c) is modeled with a linear regression model defined by the following. However, A represents a matrix (N (k) × N (m)), and b represents a vector (N (k) × 1).
G (c) = Ac + b (29)

Aとbの値は、事前に音響信号のデータベースにより学習されるか、観測信号を用いて学習されるものとする。すなわち、いま学習用のデータベース（もしくは、観測信号）から、複数のスペクトル特徴量x_nと対応するスペクトル形状特徴量c_n=H(x_n)の組合せが与えられているときに、Aとbは、以下のように定められるものとする。

The values of A and b are learned in advance from a database of acoustic signals or learned using observation signals. That is, when a combination of a plurality of spectral feature amounts x _n and corresponding spectral shape feature amounts c _n = H (x _n ) is given from the learning database (or observation signal), A and b Shall be determined as follows:

また、逆変換誤差eの分散ξ_kの値は、式（３０）で得られたAとbを用いて得られる平均自乗回帰誤差E{|x_n-(Ac_n+b)|²}の、各周波数kに対応する値として定められるものとする。上記の定義に基づき、スペクトル観測モデルp(s_n,k ^(m)|c_n ^(m))を、式（１０）により定めることができる。なお、Aとbの推定に用いることができるデータに基づき、各音源ごとに個別のスペクトル観測モデルを用意してもよいし、すべての音源で共通のスペクトル観測モデルを用意してもよい。 The value of the variance ξ _k of the inverse transformation error e is the mean square regression error E {| x _n − (Ac _n + b) | ² } obtained by using A and b obtained by the equation (30). Suppose that it is determined as a value corresponding to each frequency k. Based on the above definition, the spectrum observation model p (s _{n, k} ^(m) | c _n ^(m) ) can be determined by equation (10). An individual spectrum observation model may be prepared for each sound source based on data that can be used for estimation of A and b, or a common spectrum observation model may be prepared for all sound sources.

この場合のアルゴリズムは、実施形態２のアルゴリズムと比べて、疑似逆変換G(c)の定義が変わるのみである。 The algorithm in this case only changes the definition of the pseudo inverse transform G (c) as compared with the algorithm of the second embodiment.

G(c)の定義に関する、もうひとつの別の例として、Aを離散コサイン変換行列に対するムーアペンローズ疑似逆変換行列、b=0と定めることも考えられる。これは、離散コサイン変換の逆変換のうち、変換後の特徴量のノルムが最小になるものを選択したことになる。このように、本技術では、様々な基準に基づく逆変換を、疑似逆変換G(c)として採用することができる。 As another example of the definition of G (c), it is also possible to define A as a Moore-Penrose pseudo inverse transformation matrix for a discrete cosine transformation matrix, b = 0. This means that among the inverse transforms of the discrete cosine transform, the one that minimizes the norm of the transformed feature value is selected. As described above, in the present technology, inverse transformation based on various criteria can be adopted as the pseudo inverse transformation G (c).

[実施形態２の変形例]
実施形態２において、観測信号や各音源のスペクトル特徴量として対数パワースペクトルを用いる場合について説明する。これは、上記のアルゴリズムにおいて、メルフィルタバンク出力に関する処理を対数パワースペクトルに関する処理に、単純に置き換えるだけで実現できる。上記のアルゴリズムに基づき、変更点をまとめると以下の通りである。
＝１＝
ステップ１．において、特徴抽出部１０４は、メルフィルタバンク出力のかわりに対数パワースペクトルを観測信号のスペクトル特徴量x_nとして抽出する。
＝２＝
ステップ２．の初期化においては、特徴量変換関数として、一例として挙げた、対数パワースペクトルからMFCCに変換する関数H(x_n)=log(mfb(exp(x_n)))を用いる。
＝３＝
ステップ３．(a), ３．(c)において用いる、各音源mに対するスペクトル観測モデルは、MFCCから対数パワースペクトルを求める逆変換G(c)に基づき定められるとする。
＝４＝
ステップ４．においては、目的信号のスペクトル推定値として、対数パワースペクトルを推定する。このため、観測信号のスペクトル特徴量x_nとして、観測信号の対数パワースペクトを用い、ステップ３．(a), ステップ３．(c)と同様に、スペクトル観測モデルは、MFCCから対数パワースペクトルを求める逆変換G(c)に基づき定められるものを用いる。 [Modification of Embodiment 2]
In the second embodiment, a case where a logarithmic power spectrum is used as an observation signal or a spectrum feature amount of each sound source will be described. This can be realized by simply replacing the process related to the mel filter bank output with the process related to the logarithmic power spectrum in the above algorithm. Based on the above algorithm, the changes are summarized as follows.
= 1 =
Step 1. The feature extraction unit 104 extracts a logarithmic power spectrum as the spectral feature amount x _n of the observation signal instead of the mel filter bank output.
= 2 =
Step 2. In the initialization, the function H (x _n ) = log (mfb (exp (x _n ))) for converting the logarithmic power spectrum to MFCC, which is given as an example, is used as the feature amount conversion function.
= 3 =
Step 3. (a), 3. It is assumed that the spectrum observation model for each sound source m used in (c) is determined based on an inverse transform G (c) for obtaining a logarithmic power spectrum from the MFCC.
= 4 =
Step 4. In, the logarithmic power spectrum is estimated as the spectrum estimation value of the target signal. For this reason, the logarithmic power spectrum of the observation signal is used as the spectral feature quantity x _n of the observation signal, and step 3. (a), Step 3. As in (c), the spectrum observation model is determined based on the inverse transform G (c) for obtaining the logarithmic power spectrum from the MFCC.

上記のアルゴリズムでは、最終的に得られるものが目的信号の対数パワースペクトルの推定値s^_n,k ^(m)である。したがって、ここから目的信号のパワースペクトルの推定値も、exp(s^_n,k ^(m))として求めることができる。さらに観測信号の位相情報を用いて、離散逆フーリエ変換とオーバラップ加算合成技術などにより、目的信号の波形の推定値をも得ることができる。 In the above algorithm, what is finally obtained is an estimated value s ^ _{n, k} ^(m) of the logarithmic power spectrum of the target signal. Therefore, the estimated value of the power spectrum of the target signal can also be obtained from here as exp (s ^ _{n, k} ^(m) ). Furthermore, using the phase information of the observation signal, an estimated value of the waveform of the target signal can be obtained by discrete inverse Fourier transform and overlap addition synthesis technology.

[実施形態３]
実施形態２において、スペクトル形状モデルとしてMFCCに関する混合ガウス分布を用いる代わりに、MFCCに関する隠れマルコフモデルを用いる場合を構成する。本実施形態では、簡単のためすべての観測信号が与えられてからすべての推定処理を行うバッチ処理を前提として説明する。ただし、隠れマルコフモデルを逐次処理で動作させることは周知の技術を用いれば可能であり、それらの技術を用いて本実施形態を逐次的に動作させることも可能である。 [Embodiment 3]
In the second embodiment, instead of using a mixed Gaussian distribution related to MFCC as a spectral shape model, a case of using a hidden Markov model related to MFCC is configured. In this embodiment, for the sake of simplicity, description will be made on the premise of batch processing in which all estimation processing is performed after all observation signals are given. However, it is possible to operate the hidden Markov model by sequential processing using known techniques, and it is also possible to operate this embodiment sequentially using these techniques.

まず、{c_n ^(m)}をすべての短時間フレームnに関するc_n ^(m)の集合を表すとする。各音源mのスペクトル形状モデルp({c_n ^(m)})は、隠れマルコフモデルを用いると、以下のようにモデル化できる。ただし、i₀は初期状態、i_n for n>0は、短時間フレームnにおける状態、π_i0は初期状態がi₀となる確率、π_i,jは、状態iから状態jへの遷移確率、p(c_n ^(m)|i)は状態iにおけるスペクトル形状特徴量の出力確率密度関数、Σ_Iは、すべての状態系列の組合せに関する総和を表す。

First, _let {c _n ^(m) } denote a set of c _n ^(m) for all short-time frames n. The spectrum shape model p ({c _n ^(m) }) of each sound source m can be modeled as follows using a hidden Markov model. Where i ₀ is the initial state, i _n for n> 0 is the state in the short time frame n, π _i0 is the probability that the initial state is i _0, and π _{i, j} is the transition probability from state i to state j _{^{, p (c n (m)}} | i) is the output probability density function of the spectral shape feature in the state i, sigma _I represents the sum concerning all combinations of state sequences.

隠れマルコフモデルでは、観測特徴量の系列が与えられたもとで、各短時間フレームにおいて各状態iをとる事後確率関数Z_n,i=p(i_n=i|{c_n ^(m)})は、forward-backwardアルゴリズムやViterbiアルゴリズムなどを用いて効率的に求められることが知られている。 In the hidden Markov model, given a sequence of observed features, the posterior probability function Z _{n, i} = p (i _n = i | {c _n ^(m) }) taking each state i in each short frame is It is known that it can be obtained efficiently using a forward-backward algorithm, Viterbi algorithm, or the like.

スペクトル形状モデルに隠れマルコフモデルを用いた場合、EMアルゴリズムで用いる補助関数は、以下のように修正される。

When the hidden Markov model is used for the spectral shape model, the auxiliary function used in the EM algorithm is modified as follows.

さらに、Z^_n,i ^(m)=E{p(i_n|{c_n ^(m)})|{c^_n ^(m)}}と表記すると、式（１９）、式（２０）は以下のように修正される。

Furthermore, when _expressed as Z ^ _{n, i} ^(m) = E {p (i _n | {c _n ^(m) }) | {c ^ _n ^(m) }}, the equations (19) and (20) are expressed as follows. It is corrected as follows.

したがって、実施形態２の補助関数との主な違いは、式（３５）のZ^_n,i ^(m)の計算方法のみとなる。Z^_n,i ^(m)は、M-stepにおいて更新されたスペクトル形状特徴量の推定値{c^_n ^(m)}を出力値として持つ隠れマルコフモデルが短時間フレームnで状態iをとる事後確率p(i_n|{c^_n ^(m)})と同一であるので、前記のように、forward-backwardアルゴリズムやViterbiアルゴリズムなどを用いて効率的に求めることができる。 Therefore, the main difference from the auxiliary function of the second embodiment is only the calculation method of Z ^ _{n, i} ^(m) in Expression (35). Z ^ _{n, i} ^(m) is a hidden Markov model whose output value is the estimated value {c ^ _n ^(m) } of the spectral shape feature that was updated in M-step. Since it is the same as the posterior probability p (i _n | {c ^ _n ^(m) }), it can be efficiently obtained using the forward-backward algorithm, the Viterbi algorithm, or the like as described above.

最終的に、実施形態２とのアルゴリズムの違いは、ステップ３．(b)の処理が以下のように修正されるのみである。
ステップ３．(b)各音源mごとに、スペクトル形状推定部１０３−ｍが、forward-backwardアルゴリズムなどを用いて、Z^_n,i ^(m)=p(i_n|{c^_n ^(m)})を求める。 Finally, the algorithm difference from the second embodiment is as follows. The process of (b) is only corrected as follows.
Step 3. (b) For each sound source m, the spectrum shape estimation unit 103-m uses a forward-backward algorithm or the like, and Z ^ _{n, i} ^(m) = p (i _n | {c ^ _n ^(m) }) Ask for.

[実施形態４]
これまでの実施形態では、すべての音源に関して、スペクトル形状モデルが事前に与えられていると仮定しているが、実際の環境では、必ずしもそのような状況は期待できない。例えば、目的信号のスペクトル形状モデルは事前に学習しておくことは可能だが、背景雑音のスペクトル形状モデルは事前に学習しておくことができない場合などがある。この場合、観測信号から得られる情報のみで、事前学習されていない音源のスペクトル形状モデルを学習する必要がある。そのような状況に対応するための方法として、実施形態４を説明する。 [Embodiment 4]
In the embodiments so far, it is assumed that the spectral shape model is given in advance for all sound sources, but such a situation cannot be expected in an actual environment. For example, the spectral shape model of the target signal can be learned in advance, but the spectral shape model of the background noise cannot be learned in advance. In this case, it is necessary to learn a spectrum shape model of a sound source that has not been pre-learned by only information obtained from the observation signal. Embodiment 4 will be described as a method for dealing with such a situation.

実施形態４では、実施形態２や実施形態３で、スペクトル形状特徴量c_n ^(m)のみをEMアルゴリズムにより求めるべきパラメータとして推定していたのに対し、スペクトル形状モデルである混合ガウス分布のパラメータ(混合比、各ガウス分布の平均と共分散)や隠れマルコフモデルのパラメータ（初期状態確率、状態遷移確率、出力確率分布の平均と共分散）なども求めるべきパラメータに含めて推定する。まず、スペクトル形状特徴量を求める問題を、以下の最適化関数を最大化する問題として定義する。ここで、θは、求めるべき全パラメータを含んだ集合であり、スペクトル形状特徴量C_n以外に、スペクトル形状モデルの未知パラメータも含まれているとする。

In the fourth embodiment, only the spectral shape feature value c _n ^(m) is estimated as a parameter to be obtained by the EM algorithm in the second and third embodiments, whereas the parameter of the mixed Gaussian distribution that is a spectral shape model is used. (Mixed ratio, average and covariance of each Gaussian distribution) and hidden Markov model parameters (initial state probability, state transition probability, average and covariance of output probability distribution) are included in the parameters to be estimated. First, the problem of obtaining the spectral shape feature value is defined as a problem that maximizes the following optimization function. Here, θ is a set including all parameters to be obtained, and it is assumed that unknown parameters of the spectrum shape model are included in addition to the spectrum shape feature amount C _n .

ここでは、スペクトル形状モデルのパラメータが、一つ以上の音源に関して、事前に与えられていない場合のアルゴリズムについて説明する。簡単のため、実施形態２や実施形態３のアルゴリズムとの違いについてのみ説明する。実施形態２に対する実施形態３の修正の場合と同様に、この場合も、アルゴリズムの修正は、ステップ３．(b)についてのみになる。具体的には、実施形態４では、スペクトル形状モデルが事前に与えられていない音源mについて、ステップ３．(b)の処理が以下の二つの処理で構成される。
ステップ３．(b)-1:初期化処理、もしくはM-stepにおいて更新されたc^_n ^(m)を用いて、まず、スペクトル形状モデルのパラメータを推定する。すなわち、スペクトル形状特徴量としてc^_n ^(m)が生成される尤度が最大になるように、スペクトル形状モデルのパラメータを推定する。パラメータの推定には、スペクトル形状モデルが採用している確率分布に適したアルゴリズムを用いればよい。例えば、混合ガウス分布や隠れマルコフモデルの場合は、EMアルゴリズムを用いて効率的に、パラメータ推定が行えることが知られている。
ステップ３．(b)-2:スペクトル形状モデルのパラメータ推定後に、実施形態２や実施形態３と同様の手続きにより、Z^_n,k ^(m)を求める。 Here, an algorithm when the parameters of the spectrum shape model are not given in advance for one or more sound sources will be described. For the sake of simplicity, only differences from the algorithms of the second and third embodiments will be described. As in the case of the modification of the third embodiment with respect to the second embodiment, the modification of the algorithm is performed in step 3. Only for (b). Specifically, in the fourth embodiment, for the sound source m for which the spectral shape model is not given in advance, step 3. The process (b) is composed of the following two processes.
Step 3. (b) -1: First, parameters of the spectrum shape model are estimated using c ^ _n ^(m) updated in the initialization process or M-step. That is, the parameters of the spectrum shape model are estimated so that the likelihood that c ^ _n ^(m) is generated as the spectrum shape feature amount is maximized. For the parameter estimation, an algorithm suitable for the probability distribution adopted by the spectral shape model may be used. For example, in the case of a mixed Gaussian distribution or a hidden Markov model, it is known that parameter estimation can be performed efficiently using an EM algorithm.
Step 3. (b) -2: After estimating the parameters of the spectral shape model, Z ^ _{n, k} ^(m) is _obtained by the same procedure as in the second and third embodiments.

以上の処理により、スペクトル形状モデルが事前に与えられていない場合でも、ステップ３．(b)の処理を遂行できることになる。また、上記以外の目的信号スペクトル推定の処理においても、上記で求めたスペクトル形状モデルを利用することができるので、特に、アルゴリズムの変更は生じない。 Even if the spectral shape model is not given in advance by the above processing, step 3. The process (b) can be performed. Further, in the target signal spectrum estimation processing other than the above, since the spectrum shape model obtained above can be used, the algorithm is not particularly changed.

なお、上記のスペクトル形状モデルのパラメータ推定は、スペクトル形状特徴量の推定値c^_n ^(m)が与えられたもとで、この推定値をスペクトル形状モデルが出力する尤度が大きくなるように実施される。したがって、実施形態４の構成に利用できるスペクトル形状モデルは上記で挙げたものに限定されず、与えられた出力系列に基づきパラメータ推定が可能な任意のスペクトル形状モデルが含まれる。 The parameter estimation of the spectral shape model described above is performed so that the likelihood that the spectral shape model outputs the estimated value given the estimated value c ^ _n ^(m) of the spectral shape feature amount. The Therefore, the spectrum shape model that can be used in the configuration of the fourth embodiment is not limited to the above-described one, and includes any spectrum shape model that allows parameter estimation based on a given output sequence.

[実施形態５]
本技術は、非特許文献１−３に示された従来例と同じく、音源占有度をEMアルゴリズムのE-stepで更新しながら繰り返し推定する構成をとる。したがって、観測信号が複数のマイクロホンから同時に収録されている場合には、従来例と同じ方法に基づき、観測信号から各音源の位置情報に基づく特徴量を抽出しつつ、E-stepで、音源位置の情報を考慮して音源占有度の更新を行うような構成が可能である。本実施形態では、この構成について説明する。 [Embodiment 5]
As in the conventional example shown in Non-Patent Document 1-3, the present technology takes a configuration in which the sound source occupancy is repeatedly estimated while being updated with the E-step of the EM algorithm. Therefore, if the observation signal is recorded simultaneously from multiple microphones, the feature value based on the position information of each sound source is extracted from the observation signal based on the same method as the conventional example, and the sound source position is It is possible to adopt a configuration in which the sound source occupancy is updated in consideration of this information. In the present embodiment, this configuration will be described.

以下、図３を参照して、本実施形態の説明を行う。図中では、複数のマイクロホン信号を区別するために、x⁽¹⁾(t)のように右肩にマイクロホン番号を記入している。簡単のため、マイクロホン数が２の場合を例示しているが、従来例のようにマイクロホンの数は２以上であればいくつでもよい。特徴抽出部１０４は、これまでの実施形態と同様に、どれか一つの観測信号から観測信号のスペクトル特徴量x_nを抽出する。ただし、複数のマイクロホンを用いて、何らかの処理（例えば、他の音声強調技術を用いて目的音を強調するなど）を加えた結果得られるスペクトル特徴量を抽出するような構成も可能である。各音源に対するスペクトル形状モデル、スペクトル観測モデル、スペクトル形状推定部、および目的音スペクトル推定部は、これまでに実施形態として説明したもののいずれかと同じ、もしくは、同様の方法で構成されるものを用いるとする。本実施形態では、これらの機能部についての説明は、特に必要な場合を除き省略する。 Hereinafter, this embodiment will be described with reference to FIG. In the figure, in order to distinguish a plurality of microphone signals, a microphone number is written on the right shoulder as x ⁽¹⁾ (t). For the sake of simplicity, the case where the number of microphones is two is illustrated, but any number of microphones may be used as long as the number of microphones is two or more as in the conventional example. The feature extraction unit 104 extracts the spectral feature amount x _n of the observation signal from any one of the observation signals, as in the previous embodiments. However, it is also possible to employ a configuration in which a plurality of microphones are used to extract a spectral feature amount obtained as a result of applying some processing (for example, enhancing a target sound using another speech enhancement technique). When the spectrum shape model, the spectrum observation model, the spectrum shape estimation unit, and the target sound spectrum estimation unit for each sound source are the same as those described in the embodiments so far, or those configured in a similar manner are used. To do. In the present embodiment, descriptions of these functional units are omitted unless particularly necessary.

他方、本実施形態では、新たな機能部として、スペクトル形状特徴量推定装置１００ｐ（つまり目的信号スペクトル特徴量推定装置１００）は、音源位置特徴抽出部１１１と、各音源mに関連付けられた音源位置状態推定部１１２−ｍを具備している。図中では、音源位置状態推定部１１２−ｍの数が２の場合を例示しているが、実際には、音源数に応じて音源位置状態推定部１１２−ｍを用意することを前提としている。これまでの実施形態と同様、m=1が目的音に対応する音源位置状態推定部１１２−ｍである。 On the other hand, in the present embodiment, as a new functional unit, the spectral shape feature quantity estimation device 100p (that is, the target signal spectral feature quantity estimation device 100) includes a sound source position feature extraction unit 111 and a sound source position associated with each sound source m. A state estimation unit 112-m is provided. In the figure, the case where the number of sound source position state estimation units 112-m is two is illustrated, but actually, it is assumed that the sound source position state estimation units 112-m are prepared according to the number of sound sources. . As in the previous embodiments, m = 1 is the sound source position state estimation unit 112-m corresponding to the target sound.

音源位置特徴抽出部１１１は、複数のマイクロホンによる観測信号x^(m)(t)から音源位置特徴量a_n=[a_n,1,a_n,2,・・・,a_n,N(k)]^Tを抽出する。ここで、a_n,kは、時間周波数点(n,k)における、音源位置特徴量である。非特許文献１−３などと同様に、音源位置特徴量としては、音源位置に依存して異なる値を取る傾向を持つものを採用すればよく、例えば、マイクロホン間位相差、マイクロホン間時間差、マイクロホン間強度差など様々な選択肢がある。本明細書では、一例として、以下の正規化複素スペクトルを音源位置特徴量として採用する場合について説明する。ただし、X_n,k=[X_n,k ⁽¹⁾,X_n,k ⁽²⁾,・・・,X_n,k ^(N(m))]^Tとし、X_n,k ^(m)を短時間フレームnに対応するx^(m)(t)の短時間フーリエ変換のk番目の周波数要素である。また、|・|はベクトルのノルムを表す。
a_n,k=X_n,k/|X_n,k| (37) The sound source position feature extraction unit 111 generates sound source position feature quantities a _n = [a _{n, 1} , a _{n, 2} ,..., A _{n, N (k} from observation signals x ^(m) (t) from a plurality of microphones. ₎ ] Extract ^T. Here, a _{n, k} is the sound source position feature quantity at the time frequency point (n, k). Similar to Non-Patent Documents 1-3 and the like, as the sound source position feature amount, one having a tendency to take different values depending on the sound source position may be adopted. For example, a phase difference between microphones, a time difference between microphones, a microphone There are various options such as the difference in strength. In this specification, the case where the following normalized complex spectrum is employ | adopted as a sound source position feature-value as an example is demonstrated. Where X _{n, k} = [X _{n, k} ⁽¹⁾ , X _{n, k} ⁽²⁾ , ..., X _{n, k} ^{(N (m))} ] ^T and X _{n, k} ^(m) is This is the kth frequency element of the short-time Fourier transform of x ^(m) (t) corresponding to the short-time frame n. | · | Represents the norm of the vector.
a _{n, k} = X _{n, k} / | X _{n, k} | (37)

次に、各音源位置状態推定部１１２−ｍは、音源位置特徴量a_n,kと音源位置パラメータの推定値φ^^(m)とに基づき、以下の値を求める。p(a_n,k;φ^(m))は、周波数kにおける、音源mの音源位置特徴量に関する確率密度関数を表し、以下では、w^_n,k ^(m)を音源mの音源位置状態の推定値と呼ぶ。
w^_n,k ^(m)=p(a_n,k;φ^(m)) (38) Next, each sound source position state estimation unit 112-m obtains the following values based on the sound source position feature quantity an _{, k} and the estimated value φ ^ ^{(m) of the} sound source position parameter. p (a _{n, k} ; φ ^(m) ) represents the probability density function related to the sound source position feature of the sound source m at the frequency k. In the following, w ^ _{n, k} ^(m) is the sound source position state of the sound source m. This is called an estimated value.
w ^ _{n, k} ^(m) = p (a _{n, k} ; φ ^(m) ) (38)

音源位置特徴量に関する確率密度関数としては、音源位置特徴量に依存して、様々なものが知られている。例えば、上記で示した正規化複素スペクトルを音源位置特徴量とする場合の確率密度関数は、参考文献１などに記載されている。
（参考文献１）Hiroshi Sawada, Shoko Araki, and Shoji Makino, ”Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, No.3, pp.516-527, 2011. Various probability density functions related to the sound source position feature quantity are known depending on the sound source position feature quantity. For example, the probability density function in the case where the normalized complex spectrum described above is used as the sound source position feature amount is described in Reference Document 1 and the like.
(Reference 1) Hiroshi Sawada, Shoko Araki, and Shoji Makino, “Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, No. 3 , pp.516-527, 2011.

音源位置パラメータの推定値φ^^(m)は、観測信号と同じ収録条件のもと複数のマイクロホンで収音した音響信号を学習データとして、事前に学習するか、もしくは、観測信号から直接学習することができる。これらの学習方法は、音源位置特徴量に依存して、様々なものが知られている。例えば、上記で示した正規化複素スペクトルを音源位置特徴量とする場合の学習方法は、参考文献１に詳しい。また、本技術の目的信号スペクトル推定において、音源占有度の推定値とスペクトル形状特徴量の推定値を繰り返し更新するのと並行して、各繰り返しにおいて、各音源位置状態推定部１１２−ｍが音源占有度更新部１０５から音源占有度の推定値を受け取り、音源位置パラメータの推定値を更新することも可能である。目的信号スペクトル推定において、音源位置パラメータの推定値を繰り返し更新する方法については、非特許文献２に記載されている。 The estimated value φ ^ ^(m) of the sound source position parameter is learned in advance using the acoustic signals collected by multiple microphones under the same recording conditions as the observation signal as learning data, or directly from the observation signal. be able to. Various learning methods are known depending on the sound source position feature quantity. For example, the learning method in the case where the normalized complex spectrum shown above is used as the sound source position feature amount is detailed in Reference Document 1. In addition, in the target signal spectrum estimation of the present technology, each sound source position state estimation unit 112-m performs the sound source in each iteration in parallel with the update of the estimated value of the sound source occupancy and the estimated value of the spectral shape feature amount. It is also possible to receive the estimated value of the sound source occupancy degree from the occupancy degree updating unit 105 and update the estimated value of the sound source position parameter. Non-Patent Document 2 describes a method for repeatedly updating an estimated value of a sound source position parameter in target signal spectrum estimation.

さらに、音源占有度更新部１０５は、観測信号のスペクトル特徴量x_nを受け取り、各音源mのそれぞれに関するスペクトル形状特徴量の推定値c^_n ^(m)とスペクトル観測モデルp(s_n ^(m)|c_n ^(m))を受け取るのに加えて、各音源mの音源位置状態の推定値w^_n,k ^(m)を受け取り、各音源の音源占有度の推定値M^_n,k ^(m)を更新する。この更新は、非特許文献２と同様、以下のように実現できる。

Further, the sound source occupancy update unit 105 receives the spectral feature amount x _n of the observation signal, and estimates the spectral shape feature amount c ^ _n ^(m) and the spectral observation model p (s _n ^{(m )} | c _n ^(m) ) In addition to receiving the estimated position w ^ _{n, k} ^(m) of each sound source m, the estimated sound source occupancy M ^ _{n, k} Update ^(m) . Similar to Non-Patent Document 2, this update can be realized as follows.

実施形態２でスペクトル特徴量として対数パワースペクトルを用いる場合の変形として、本実施形態のアルゴリズムをまとめると以下のようになる。
ステップ１．(a)(b)の処理によって特徴抽出する。
(a)各短時間フレームnに対して、特徴抽出部１０４が、観測信号に関する対数パワースペクトルx_nを抽出する。
(b)音源位置特徴抽出部１１１が観測信号に関する音源位置特徴量を式（３７）により抽出する。
ステップ２．(a)(b)の処理によって初期化を行う。
(a)各音源mに対応するスペクトル形状推定部１０３−ｍが、MFCCの推定値c^_n ^(m)を初期化する。例えば、H(x_n)=log(mfb(exp(x_n)))とし、c^_n ^(m)=H(x_n)とする。
(b)各音源mごとに、音源位置状態推定部１１２−ｍが、式（３８）により、音源位置状態の推定値w^_n,k ^(m)を求める。
ステップ３．(a)-(c)を収束するまで繰り返す。
(a)音源占有度更新部１０５は、各周波数ごとに独立に、式（３９）を計算することで、各音源mの音源占有度の推定値M^_n,k ^(m)を更新する(E-step1)。
(b)各音源mについて、スペクトル形状推定部１０３−ｍが、式（２８）に基づき、Z^_n,i ^(m)を更新する(E-step2)。
(c)各音源mについて、スペクトル形状推定部１０３−ｍが、式（２７）を最大化するMFCC c_n ^(m)を求めることで、MFCCの推定値c^_n ^(m)を更新する。このとき、式（２７）は、一般には、非線形関数となるため、その最大化は、共役勾配法、準ニュートン法、ニュートン法などの一般的な非線形最適化法により実現される(M-step)。
ステップ４．目的音スペクトル推定部１０６は、式（２１）により、対数パワースペクトル（＝スペクトル特徴量）の推定値s^_n,k ^(m)を求める。 As a modification when the logarithmic power spectrum is used as the spectrum feature amount in the second embodiment, the algorithm of the present embodiment is summarized as follows.
Step 1. Feature extraction is performed by the processes (a) and (b).
(a) For each short-time frame n, the feature extraction unit 104 extracts a logarithmic power spectrum x _n related to the observation signal.
(b) The sound source position feature extraction unit 111 extracts a sound source position feature amount related to the observation signal by Expression (37).
Step 2. (a) Initialization is performed by the processing of (b).
(a) The spectrum shape estimation unit 103-m corresponding to each sound source m initializes an estimated value c ^ _n ^(m) of MFCC. For example, H (x _n ) = log (mfb (exp (x _n ))) and c ^ _n ^(m) = H (x _n ).
(b) For each sound source m, the sound source position state estimation unit 112-m obtains an estimated value w ^ _{n, k} ^(m) of the sound source position state using Equation (38).
Step 3. Repeat (a)-(c) until convergence.
(a) The sound source occupancy update unit 105 updates the estimated value M ^ _{n, k} ^(m) of the sound source occupancy of each sound source m by calculating Equation (39) independently for each frequency ( E-step1).
(b) For each sound source m, the spectrum shape estimation unit 103-m updates Z ^ _{n, i} ^(m) based on the equation (28) (E-step 2).
(c) For each sound source m, the spectrum shape estimation unit 103-m calculates the MFCC c _n ^(m) that maximizes Equation (27), thereby updating the estimated value c ^ _n ^(m) of the MFCC. At this time, since the equation (27) is generally a nonlinear function, its maximization is realized by a general nonlinear optimization method such as a conjugate gradient method, a quasi-Newton method, and a Newton method (M-step). ).
Step 4. The target sound spectrum estimation unit 106 obtains an estimated value s ^ _{n, k} ^(m) of the logarithmic power spectrum (= spectrum feature amount) by the equation (21).

[確認実験]
目的信号スペクトル推定技術を評価する目的で確認実験を行った。実験条件を説明する。残響のある部屋で、二本のマイクロホンを用いて、マイクロホンの正面にいる話者の音声が様々な周囲の背景音と同時に収録された音を、観測信号として用いた。この観測信号を用いて、非特許文献３（従来例）および実施形態５（本技術）に示された目的信号スペクトル推定法の比較実験を行った。従来例と本技術はともに、対数パワースペクトルを観測信号のスペクトル特徴量とし、本技術では、MFCCをスペクトル形状特徴量とし、スペクトル形状モデルとしてガウス混合モデルを採用した。また、従来例と本技術はともに、非特許文献３と同じ音源位置特徴量のモデルを用いて、そのパラメータを事前学習により用意した。 [Confirmation experiment]
A confirmation experiment was conducted to evaluate the target signal spectrum estimation technique. The experimental conditions will be described. In a room with reverberation, we used two microphones and used the sound of the speaker in front of the microphone as well as various surrounding background sounds. Using this observation signal, a comparative experiment of the target signal spectrum estimation method shown in Non-Patent Document 3 (conventional example) and Embodiment 5 (present technology) was performed. Both the conventional example and this technology use the logarithmic power spectrum as the spectral feature of the observed signal. In this technology, MFCC is used as the spectral shape feature, and a Gaussian mixture model is used as the spectral shape model. In addition, in both the conventional example and the present technology, the same sound source position feature amount model as in Non-Patent Document 3 is used, and parameters thereof are prepared by prior learning.

観測信号および目的信号のスペクトルを推定した信号に対して、自動音声認識を適用した結果を示す。観測信号をそのまま音声認識した場合の単語正解率は、69.4%であったのに対し、従来例と本技術で推定した目的信号のスペクトルに対し音声認識を適用した場合の単語正解率は、それぞれ、83.7%と87.2%であった。従来例でも、大幅な音声認識率の改善が得られているが、本技術により、さらに大幅な改善が得られた。 The result of applying automatic speech recognition to a signal obtained by estimating the spectrum of an observation signal and a target signal is shown. The correct word rate when the observed signal was recognized as speech was 69.4%, whereas the correct word rate when applying speech recognition to the target signal spectrum estimated by the conventional example and this technology is 83.7% and 87.2%. Even in the conventional example, a significant improvement in the speech recognition rate has been obtained, but a further significant improvement has been obtained by this technology.

以上の結果より、本技術は、MFCCに関する混合ガウス分布をスペクトル形状モデルとして用いることで、従来例に比べて、大幅に目的信号のスペクトル推定精度を改善できることが確認された。 From the above results, it was confirmed that the present technology can significantly improve the spectrum estimation accuracy of the target signal by using a mixed Gaussian distribution related to MFCC as a spectrum shape model compared to the conventional example.

＜スペクトル形状特徴量推定装置／目的信号スペクトル特徴量推定装置のハードウェア構成例＞
上述の実施形態に関わるスペクトル形状特徴量推定装置／目的信号スペクトル特徴量推定装置（以下、単に推定装置という）は、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ＣＰＵ（Central Processing Unit）〔キャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）と、ハードディスクである外部記憶装置、並びにこれらの入力部、出力部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置間のデータのやり取りが可能なように接続するバスなどを備えている。また必要に応じて、推定装置に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Hardware configuration example of spectrum shape feature quantity estimation device / target signal spectrum feature quantity estimation device>
The spectral shape feature quantity estimation apparatus / target signal spectral feature quantity estimation apparatus (hereinafter simply referred to as estimation apparatus) according to the above-described embodiment includes an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a CPU (Central Processing Unit) [A cache memory may be provided. ] RAM (Random Access Memory) or ROM (Read Only Memory) and external storage device as a hard disk, and data exchange between these input unit, output unit, CPU, RAM, ROM, and external storage device It has a bus that can be connected. If necessary, the estimation apparatus may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

推定装置の外部記憶装置には、スペクトル形状特徴量などを推定するためのプログラム並びにこのプログラムの処理において必要となるデータなどが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくなどでもよい。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。以下、データやその格納領域のアドレスなどを記憶する記憶装置を単に「記憶部」と呼ぶことにする。 The external storage device of the estimation device stores a program for estimating a spectrum shape feature amount and the like and data necessary for processing of this program [not limited to the external storage device, for example, a program is read-only stored. You may memorize | store in the ROM which is an apparatus. ]. Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device. Hereinafter, a storage device that stores data, addresses of storage areas, and the like is simply referred to as a “storage unit”.

推定装置の記憶部には、観測信号からスペクトル特徴量を抽出するためのプログラム、最適化関数に基づいて最も尤もらしいスペクトル形状特徴量を推定するためのプログラム（具体的には、スペクトル形状特徴量を更新するためのプログラムと、音源占有度を更新するためのプログラム）、目的信号のスペクトル特徴量を推定するためのプログラム、必要に応じて更に、音源位置特徴量を推定するためのプログラム、音源位置状態を推定するためのプログラムが記憶されている。 In the storage unit of the estimation device, a program for extracting a spectral feature amount from an observation signal, a program for estimating a most likely spectral shape feature amount based on an optimization function (specifically, a spectral shape feature amount) , A program for updating the sound source occupancy), a program for estimating the spectral feature quantity of the target signal, a program for estimating the sound source position feature quantity if necessary, and a sound source A program for estimating the position state is stored.

推定装置では、記憶部に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭに読み込まれて、ＣＰＵで解釈実行・処理される。この結果、ＣＰＵが所定の機能（特徴抽出部、スペクトル形状推定部、音源占有度更新部、目的音スペクトル推定部、音源位置特徴抽出部、音源位置状態推定部）を実現することでスペクトル推定が実現される。なお、目的音スペクトル推定部は、スペクトル形状特徴量推定装置の必須の構成要素ではない。 In the estimation apparatus, each program stored in the storage unit and data necessary for processing each program are read into the RAM as necessary, and interpreted and executed by the CPU. As a result, spectrum estimation is performed by the CPU realizing predetermined functions (feature extraction unit, spectrum shape estimation unit, sound source occupancy update unit, target sound spectrum estimation unit, sound source position feature extraction unit, sound source position state estimation unit). Realized. The target sound spectrum estimation unit is not an essential component of the spectrum shape feature quantity estimation device.

＜補記＞
本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 <Supplementary note>
The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

また、上記実施形態において説明したハードウェアエンティティ（推定装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 When the processing functions in the hardware entity (estimation device) described in the above embodiment are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A spectral shape feature amount estimation device for estimating a spectral shape feature amount representing a shape of a spectral feature amount of each acoustic signal from an observation signal obtained by mixing and collecting acoustic signals from a plurality of sound sources,
Stores the prior probability density function (spectral shape model) of spectral shape features corresponding to each of the above sound sources and the conditional probability density function (spectral observation model) of spectral features given the spectral shape features A storage unit to
From the spectral feature quantity of the observed signal, the prior probability density function (spectral shape model) of the spectral shape feature quantity, and the conditional probability density function (spectral observation model) of the spectral feature quantity given the spectral shape feature quantity An estimation means for estimating a spectral shape feature amount for each sound source that maximizes an optimization function having a dedicated sound source number representing a sound source of an acoustic signal having the maximum energy at each time frequency point as a latent variable. Including
The above spectral shape model is a probability density function having a correlation between frequencies,
The optimization function includes a conditional probability density function of the spectral features of the observed signal given the spectral shape features of all the sound sources, and a prior probability density function of the spectral shape features determined for each of the sound sources. A spectral shape feature quantity estimation device expressed by the product of

The spectral shape feature amount estimation apparatus according to claim 1,
The estimation means is
Conditional probability density function (spectrum observation model) of the spectral feature amount for each of the above sound sources, the spectral feature amount of the observation signal, and the spectral feature amount for each of the above sound sources. And estimating the posterior probability (sound source occupancy) that each sound source is a sound source represented by the exclusive sound source number,
Estimated posterior probability (sound source occupancy), spectral feature quantity of observed signal, prior probability density function of spectral shape feature quantity (spectral shape model), and spectral feature quantity condition given spectral shape feature quantity A spectral shape feature quantity estimation device for estimating a spectral shape feature quantity for each sound source from a probability density function (spectral observation model) with a mark.

The spectral shape feature amount estimation apparatus according to claim 2,
The spectral shape feature amount is a mel frequency cepstrum coefficient,
The spectral shape feature amount estimation apparatus, wherein the spectral shape model is a mixed Gaussian distribution of mel frequency cepstrum coefficients.

The spectral shape feature amount estimation apparatus according to claim 2 or 3,
A sound source position feature extraction unit that extracts a sound source position feature amount that is a feature amount related to the position of each sound source from the observation signals obtained by each of a plurality of microphones;
A sound source position state estimation unit for obtaining an estimated value of a sound source position state represented by a probability density function related to the sound source position feature amount for each sound source;
Said estimating means, the spectral shape feature quantity estimation apparatus characterized by updating the source occupancy also using the estimate of the sound source position state of said each sound source.

The spectrum shape feature quantity corresponding to the sound source of the target signal out of the spectrum shape feature quantity and the sound source occupancy for each sound source estimated by the spectrum shape feature quantity estimation device according to any one of claims 2 to 4; A target signal including a target sound spectrum estimator that estimates a spectral feature of the target signal from the sound source occupancy of the target signal, a spectrum observation model corresponding to the target signal, and a spectral feature of the observed signal Spectral feature quantity estimation device.

A spectral shape feature amount estimation method for estimating a spectral shape feature amount representing a shape of a spectral feature amount of each acoustic signal from an observation signal obtained by mixing acoustic signals from a plurality of sound sources,
The storage unit has a prior probability density function (spectral shape model) of spectral shape features corresponding to each of the above sound sources, and a conditional probability density function (spectrum observation) of the spectral features given the spectral shape features. Model)
From the spectral feature quantity of the observed signal, the prior probability density function (spectral shape model) of the spectral shape feature quantity, and the conditional probability density function (spectral observation model) of the spectral feature quantity given the spectral shape feature quantity Including an estimation step for estimating a spectrum shape feature amount for each sound source that maximizes an optimization function having an exclusive sound source number representing a sound source of an acoustic signal having the maximum energy at each time frequency point as a latent variable. ,
The above spectral shape model is a probability density function having a correlation between frequencies,
The optimization function includes a conditional probability density function of the spectral features of the observed signal given the spectral shape features of all the sound sources, and a prior probability density function of the spectral shape features determined for each of the sound sources. A spectral shape feature amount estimation method expressed by the product of

The spectral shape feature amount estimation method according to claim 6,
The estimation step is:
Conditional probability density function (spectrum observation model) of the spectral feature amount for each of the above sound sources, the spectral feature amount of the observation signal, and the spectral feature amount for each of the above sound sources. And estimating the posterior probability (sound source occupancy) that each sound source is a sound source represented by the exclusive sound source number,
Estimated posterior probability (sound source occupancy), spectral feature quantity of observed signal, prior probability density function of spectral shape feature quantity (spectral shape model), and spectral feature quantity condition given spectral shape feature quantity A spectral shape feature amount estimation method, wherein a spectral shape feature amount for each sound source is estimated from a probability density function (spectral observation model) with a threshold.

The spectral shape feature amount estimation method according to claim 7,
The spectral shape feature amount is a mel frequency cepstrum coefficient,
The spectral shape feature quantity estimation method, wherein the spectral shape model is a mixed Gaussian distribution of mel frequency cepstrum coefficients.

The spectral shape feature amount estimation method according to claim 7 or 8,
A sound source position feature extraction step of extracting a sound source position feature amount that is a feature amount related to the position of each sound source from the observation signals obtained by each of a plurality of microphones;
A sound source position state estimation step for obtaining an estimated value of a sound source position state represented by a probability density function related to the sound source position feature amount for each sound source,
The estimation step includes the step of updating the sound source occupancy degree using the estimated value of the sound source position state of each sound source, and a spectral shape feature quantity estimation method.

The spectrum shape feature quantity corresponding to the sound source of the target signal among the spectrum shape feature quantity and the sound source occupancy for each sound source estimated by the spectrum shape feature quantity estimation method according to claim 7. A target signal including a target sound spectrum estimation step for estimating a spectral feature of the target signal from the sound source occupancy of the target signal, the spectrum observation model corresponding to the target signal, and the spectral feature of the observed signal Spectral feature estimation method.

A program for causing a computer to function as the spectrum shape feature quantity estimation device according to any one of claims 1 to 4 or the target signal spectrum feature quantity estimation device according to claim 5.