JP2016045221A

JP2016045221A - Signal analysis device, method, and program

Info

Publication number: JP2016045221A
Application number: JP2014166903A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 卓哉樋口; Takuya Higuchi; 裕史竹田; Yuji Takeda
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2014-08-19
Filing date: 2014-08-19
Publication date: 2016-04-04
Anticipated expiration: 2034-08-19
Also published as: JP6195548B2

Abstract

PROBLEM TO BE SOLVED: To perform sound source separation, acoustic event detection, and reverberation removal comprehensively with high accuracy.SOLUTION: A mixed sound time frequency expansion unit 28 outputs a multi-channel observation time frequency component using the time series data of a multi-channel observation signal as input in which sound source signals are mixed. A parameter update unit 32 performs, on the basis of a multi-channel observation time frequency component, a base power spectrum, an activation parameter, the time series of a spatial correlation matrix, and a state series, each of the updating of the base power spectrum, the updating of the activation parameter, the updating of the time series of the spatial correlation matrix, and the updating of the state series so that an object function is maximized at the time the multi-channel observation time frequency component is given. A convergence determination unit 34 performs the updating by the parameter update unit 32 repeatedly until a predetermined convergence condition is satisfied.SELECTED DRAWING: Figure 1

Description

本発明は、信号解析装置、方法、及びプログラムに係り、特に、パラメータを推定する信号解析装置、方法、及びプログラムに関する。 The present invention relates to a signal analysis apparatus, method, and program, and more particularly, to a signal analysis apparatus, method, and program for estimating parameters.

ブラインド音源分離の問題とは、音源信号や音源からマイクまでの伝達特性が未知の場合に、複数の音源信号が混合された多チャンネル観測信号から、音源信号を推定する問題である。一般的に、ブラインド音源分離の問題を解くためには、音源信号に対して立てたなんらかの仮定を基に最適化基準を立て、最適化問題を解く必要がある。 The problem of blind sound source separation is a problem of estimating a sound source signal from a multi-channel observation signal in which a plurality of sound source signals are mixed when the sound source signal or transfer characteristics from the sound source to the microphone are unknown. In general, in order to solve the problem of blind sound source separation, it is necessary to set an optimization criterion based on some assumption made for the sound source signal and solve the optimization problem.

単チャンネルの観測信号に対するブラインド音源分離の有効なアプローチとして、非負値行列因子分解(Non-negative Matrix Factorization;NMF)が知られている（非特許文献１、非特許文献２）。この手法では観測信号のパワースペクトログラムを、２つの非負値行列の積に分解する。分解した各行列は、いくつかのパワースペクトルによって構成される基底行列と、それらのパワースペクトルの時変な音量を表すパワーによって構成されるアクティベーション行列となる。ここで重要なのは、分解された各パワースペクトルが、観測信号の中で主となる要素、すなわち各音源信号を表していると考えられることである。ＮＭＦの利点として、クリーンな音響信号からあらかじめパワースペクトルを学習しておき、混合音に対してＮＭＦを適用しパワーのみを推定することで、特定の音を分離・抽出することが可能な点が上げられる。 Non-negative matrix factorization (NMF) is known as an effective approach for blind source separation for a single channel observation signal (Non-Patent Document 1, Non-Patent Document 2). In this method, the power spectrogram of an observation signal is decomposed into a product of two non-negative matrixes. Each decomposed matrix becomes a base matrix composed of several power spectra and an activation matrix composed of power representing the time-varying volume of those power spectra. What is important here is that each decomposed power spectrum is considered to represent a main element in the observation signal, that is, each sound source signal. As an advantage of NMF, it is possible to separate and extract a specific sound by learning a power spectrum from a clean sound signal in advance and applying only NMF to the mixed sound and estimating only the power. Raised.

また、音源信号の空間的な情報も利用して音源分離を行うために、ＮＭＦを多チャンネルの音響信号へと拡張するアプローチがいくつか知られている（非特許文献３、非特許文献４）。 In addition, in order to perform sound source separation using spatial information of a sound source signal, several approaches for extending NMF to a multi-channel acoustic signal are known (Non-patent Documents 3 and 4). .

D. D. Lee, and H. S. Seung, “Learning the parts of objects with nonnegative matrix factorization,”Nature, vol. 401, pp.788-791, 1999.D. D. Lee, and H. S. Seung, “Learning the parts of objects with nonnegative matrix factorization,” Nature, vol. 401, pp.788-791, 1999. P. Smaragdis, and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,”Proc. WASPAA 2003, Oct. 2003, pp. 177-180.P. Smaragdis, and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” Proc. WASPAA 2003, Oct. 2003, pp. 177-180. A. Ozerov, and C. Fevotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech and Language Processing, vol. 18, no. 3,pp. 550-563, Mar.2010.A. Ozerov, and C. Fevotte, “Multichannel nonnegative matrix factorization in convolutive combination for audio source separation,” IEEE Trans. Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550-563, Mar. 2010 . H. Sawada, H. Kameoka, S. Araki and N. Ueda, “Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization,” IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) 2012, pp. 261-264, 2012.H. Sawada, H. Kameoka, S. Araki and N. Ueda, “Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2012, pp. 261- 264, 2012.

しかし、従来技術においては、２つの問題がある。１つ目は、各音源信号のパワースペクトルが、１つの基底パワースペクトルによって表現できることを仮定していた点である。実際の音源信号のパワースペクトルは時変であることが多く、１つの基底パワースペクトルで表現するのは不十分な場合が多い。例えば、無音状態、音の立ち上がり、持続状態などの音源信号の状態に応じて、音源は異なるパワースペクトルを持つ場合があり、それらを１つの基底パワースペクトルによって表現するのは不十分であると考えられる。 However, there are two problems in the prior art. The first point is that it is assumed that the power spectrum of each sound source signal can be expressed by one base power spectrum. The power spectrum of an actual sound source signal is often time-varying, and it is often insufficient to express with one base power spectrum. For example, the sound source may have different power spectra depending on the state of the sound source signal such as a silence state, a sound rise, and a continuous state, and it is considered insufficient to express them with one base power spectrum. It is done.

２つ目は、残響を考慮していない点である。残響下において観測信号から音源信号を推定するためには、直接波だけでなく反射波も含めて観測信号をモデル化しなければならないと考えられるが非特許文献３及び非特許文献４の手法においては、それがなされていない。 The second is that reverberation is not taken into consideration. In order to estimate the sound source signal from the observation signal under reverberation, it is considered that the observation signal must be modeled including not only the direct wave but also the reflected wave, but in the methods of Non-Patent Document 3 and Non-Patent Document 4, , It has not been made.

本発明では、上記問題を解決するために成されたものであり、音源分離、音響イベント検出、及び残響除去を統合的に精度良く行うことができる信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and provides a signal analysis apparatus, method, and program capable of performing sound source separation, acoustic event detection, and dereverberation removal in an integrated and accurate manner. With the goal.

上記目的を達成するために、第１の発明に係る信号解析装置は、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号の時系列データを入力として、各時刻ｔ_ｌにおける各周波数ω_ｋの観測時間周波数成分ｙ（ω_ｋ，ｔ_ｌ）を表す多チャンネル観測時間周波数成分を出力する混合音時間周波数展開部と、前記多チャンネル観測時間周波数成分が与えられたもとでの、各音源ｉの、各周波数ω_k及び各状態ｑにおけるパワースペクトルｗ_i,k,qを表す基底パワースペクトルと、各音源ｉの、各時刻ｔ_ｌにおけるパワーｈ_i.lを表すアクティベーションパラメータと、各音源ｉの各時刻ｔ_n及び各周波数ω_kにおける空間相関行列Ｃ_i,k,nを表す空間相関行列の時系列と、各音源ｉの各時刻ｔ_ｌにおける音源状態ｚ_i,lを表す状態系列の条件付き確率を表す目的関数を最大化するように、前記混合音時間周波数展開部によって出力された前記多チャンネル観測時間周波数成分、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、及び前記状態系列に基づいて、前記基底パワースペクトルの更新、前記アクティベーションパラメータの更新、前記空間相関行列の時系列の更新、及び前記状態系列の更新の各々を行うパラメータ更新部と、予め定められた収束条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う収束判定部と、を含んで構成されている。 In order to achieve the above object, the signal analyzing apparatus according to the first invention receives time-series data of a multi-channel observation signal mixed with sound source signals from I sound sources, and inputs each frequency at each time t ₁ . a mixed sound time-frequency expansion unit for outputting the multi-channel observation time-frequency component representing the omega _k of the observation time-frequency component _{_{y (ω k, t l)}} , said at Moto the multi-channel observation time-frequency component is given, each sound source i, the base power spectrum representing the power spectrum w _{i, k, q} in each frequency ω _k and each state q, the activation parameter representing the power h _il of each sound source i at each time t ₁ , and each sound source i Of the spatial correlation matrix _representing the spatial correlation matrix C _{i, k, n} at each time t _n and each frequency ω _k and the state series representing the sound source state z _{i, l} at each time t ₁ of each sound source i. Conditional The multi-channel observation time frequency component output by the mixed sound time frequency expansion unit, the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and so as to maximize the objective function representing the probability, and A parameter updating unit configured to update the base power spectrum, update the activation parameter, update the time series of the spatial correlation matrix, and update the state series based on the state series; A convergence determination unit that repeatedly performs the update by the parameter update unit until a convergence condition is satisfied.

第２の発明に係るパラメータ推定方法は、混合音時間周波数展開部と、パラメータ更新部と、収束判定部と、を含む信号解析装置における、パラメータ推定方法であって、前記混合音時間周波数展開部は、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号の時系列データを入力として、各時刻ｔ_ｌにおける各周波数ω_ｋの観測時間周波数成分ｙ（ω_ｋ，ｔ_ｌ）を表す多チャンネル観測時間周波数成分を出力し、前記パラメータ更新部は、前記多チャンネル観測時間周波数成分が与えられたときの、各周波数ω_k及び各状態ｑに対する各音源ｉの音源信号のパワースペクトルｗ_i,k,qを表す基底パワースペクトルと、各時刻ｔ_ｌにおける各音源ｉの音源信号のパワーｈ_i.lを表すアクティベーションパラメータと、各時刻ｔ_n及び各周波数ω_kにおける各音源ｉの伝達周波数特性の成分ａ_i,k,nに基づく空間相関行列Ｃ_i,k,nを表す空間相関行列の時系列と、各時刻ｔ_ｌにおける各音源ｉの音源信号の状態ｚ_i,lを表す状態系列との確率を表す目的関数を最大化するように、前記混合音時間周波数展開部によって出力された前記多チャンネル観測時間周波数成分、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、及び前記状態系列に基づいて、前記基底パワースペクトルの更新、前記アクティベーションパラメータの更新、前記空間相関行列の時系列の更新、及び前記状態系列の更新の各々を行い、前記収束判定部は、予め定められた収束条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う。 A parameter estimation method according to a second invention is a parameter estimation method in a signal analysis apparatus including a mixed sound time-frequency expansion unit, a parameter update unit, and a convergence determination unit, wherein the mixed sound time-frequency expansion unit Represents an observation time frequency component y (ω _k , t _l ) of each frequency ω _k at each time t _l with time series data of multi-channel observation signals mixed with sound source signals from I sound sources as input. The multi-channel observation time frequency component is output, and the parameter updating unit is configured to provide the power spectrum w _i of the sound source signal of each sound source i for each frequency ω _k and each state q when the multi-channel observation time frequency component is given. _{, k,} and bottom power spectrum representing a _q, and activation parameters representing the power h _il of the sound source signals of the sound sources i at each time t _l, the time t _n and the Time series of wavenumber ω component a _i of the transmission frequency characteristic of the sound source i in _{_k, k,} spatial correlation matrix based on _n C _{i, k,} the spatial correlation matrix representing _n, sound of each sound source i at each time t _l The multi-channel observation time-frequency component output by the mixed sound time-frequency expansion unit, the base power spectrum, the base power spectrum, so as to maximize the objective function representing the probability of the state sequence representing the signal state z _{i, l} Based on the activation parameter, the time series of the spatial correlation matrix, and the state series, updating the base power spectrum, updating the activation parameter, updating the time series of the spatial correlation matrix, and updating the state series The convergence determination unit repeatedly performs the update by the parameter update unit until a predetermined convergence condition is satisfied.

第１及び第２の発明によれば、前記混合音時間周波数展開部により、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号の時系列データを入力として、各時刻ｔ_ｌにおける各周波数ω_ｋの観測時間周波数成分ｙ（ω_ｋ，ｔ_ｌ）を表す多チャンネル観測時間周波数成分を出力し、パラメータ更新部により、多チャンネル観測時間周波数成分が与えられたときの、各周波数ω_k及び各状態ｑに対する各音源ｉの音源信号のパワースペクトルｗ_i,k,qを表す基底パワースペクトルと、各時刻ｔ_ｌにおける各音源ｉの音源信号のパワーｈ_i.lを表すアクティベーションパラメータと、各時刻ｔ_n及び各周波数ω_kにおける各音源ｉの伝達周波数特性の成分ａ_i,k,nに基づく空間相関行列Ｃ_i,k,nを表す空間相関行列の時系列と、各時刻ｔ_ｌにおける各音源ｉの音源信号の状態ｚ_i,lを表す状態系列との確率を表す目的関数を最大化するように、混合音時間周波数展開部によって出力された多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、収束判定部により、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行う。 According to the first and second inventions, the mixed sound time frequency expansion unit receives time-series data of multi-channel observation signals mixed with sound source signals from I sound sources, and inputs each time t _{1 at} each time t ₁ . The multi-channel observation time frequency component representing the observation time frequency component y (ω _k , t _l ) of the frequency ω _k is output, and each frequency ω _k when the multi-channel observation time frequency component is given by the parameter updating unit. And a base power spectrum representing the power spectrum w _{i, k, q} of the sound source signal of each sound source i for each state q, an activation parameter representing the power h _il of the sound source signal of each sound source i at each time t ₁ , and The time series of the spatial correlation matrix representing the spatial correlation matrix C _{i, k, n} based on the components a _{i, k, n} of the transmission frequency characteristics of each sound source i at time t _n and each frequency ω _k, and at each time t _l Multi-channel observation time frequency component and base power spectrum output by the mixed sound time frequency expansion unit so as to maximize the objective function representing the probability of the state sequence representing the state z _{i, l} of the sound source signal of each sound source i Based on the activation parameter, the time series of the spatial correlation matrix, and the state series, the base power spectrum is updated, the activation parameter is updated, the time series of the spatial correlation matrix is updated, and the state series is updated. The update by the parameter update unit is repeatedly performed by the convergence determination unit until a predetermined convergence condition is satisfied.

このように、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号の時系列データを入力として、多チャンネル観測時間周波数成分を出力し、多チャンネル観測時間周波数成分が与えられたときの、基底パワースペクトルと、アクティベーションパラメータと、空間相関行列の時系列と、状態系列との確率を表す目的関数を最大化するように、多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、収束判定部により、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行うことにより、音源分離、音響イベント検出、残響除去を統合的に精度良く行うためのパラメータを推定することができる。 As described above, the time series data of the multichannel observation signal mixed with the sound source signals from the I sound sources is input, the multichannel observation time frequency component is output, and the multichannel observation time frequency component is given. Multi-channel observation time frequency component, base power spectrum, activation parameter, so as to maximize the objective function representing the probability of the base power spectrum, activation parameter, spatial correlation matrix time series and state series Based on the time series of the spatial correlation matrix and the state series, the base power spectrum is updated, the activation parameter is updated, the time series of the spatial correlation matrix is updated, and the state series is updated. Update by the parameter update unit is repeated until a predetermined convergence condition is satisfied. Accordingly, it is possible to estimate the parameters for performing sound source separation, an acoustic event detected may integrally accuracy dereverberation.

また、第１の発明において、前記目的関数を、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、及び前記状態系列が与えられたときの前記多チャンネル観測時間周波数成分の確率と、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、及び前記状態系列が出力される確率を表す事前確率との積、又は前記積の対数とし、前記パラメータ更新部は、前記積、又は前記積の対数を最大化するように、前記混合音時間周波数展開部によって出力された前記多チャンネル観測時間周波数成分、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、及び前記状態系列に基づいて、前記基底パワースペクトルの更新、前記アクティベーションパラメータの更新、前記空間相関行列の時系列の更新、及び前記状態系列の更新の各々を行ってもよい。 Further, in the first invention, the objective function is defined as the probability of the multi-channel observation time frequency component when the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and the state series are given. And the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and the prior probability representing the probability that the state series is output, or the logarithm of the product, In order to maximize the product or the logarithm of the product, the multi-channel observation time frequency component, the base power spectrum, the activation parameter, and the spatial correlation matrix output by the mixed sound time frequency expansion unit And updating the base power spectrum based on the sequence and the state sequence, Updating Activation parameters, updating of the time series of the spatial correlation matrix, and may be carried out each update of the state sequence.

また、第１の発明において、前記目的関数を、前記多チャンネル観測時間周波数成分、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、前記状態系列、（ｉ、ｋ、ｌ、ｎ）の全ての組み合わせについての補助変数Ｒ_i,k,l,n、及び（ｋ、ｌ）の全ての組み合わせについての補助変数Ｕ_k,lを用いて表され、かつ、前記積の対数の下限関数である補助関数とし、前記パラメータ更新部は、前記補助関数を大きくするように、前記基底パワースペクトル、前記アクティベーションパラメータ、及び前記空間相関行列の時系列に基づいて、前記補助変数、及び前記状態系列に基づいて、前記補助変数Ｒ_i,k,l,n及び前記補助変数Ｕ_k,lを更新し、前記混合音時間周波数展開部によって出力された前記多チャンネル観測時間周波数成分、前記基底パワースペクトル、前記アクティベーションパラメータ、前記空間相関行列の時系列、前記状態系列、及び前記補助変数Ｒ_i,k,l,n及び前記補助変数Ｕ_k,lに基づいて、前記基底パワースペクトルの更新、前記アクティベーションパラメータの更新、前記空間相関行列の時系列の更新、及び前記状態系列の更新の各々を行ってもよい。 In the first invention, the objective function may be the multi-channel observation time frequency component, the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, the state series, (i, k, l, n) auxiliary variables R _{i, k, l, n} for all combinations of n) and auxiliary variables U _{k, l} for all combinations of (k, l), and the logarithm of the product An auxiliary function that is a lower limit function, and the parameter update unit increases the auxiliary function based on the base power spectrum, the activation parameter, and the time series of the spatial correlation matrix, and the auxiliary variable, and Based on the state series, the auxiliary variables R _{i, k, l, n} and the auxiliary variable U _{k, l} are updated, and the multi-channel output by the mixed sound time frequency expansion unit is updated. Observation time frequency component, the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, the state series, the auxiliary variable R _{i, k, l, n} and the auxiliary variable U _{k, l} The base power spectrum may be updated, the activation parameter may be updated, the time series of the spatial correlation matrix may be updated, and the state series may be updated.

また、第１の発明において、前記補助関数は、負の逆行列に対するトレースの凹性を利用したＪｅｎｓｅｎの不等式と、負の対数関数の凸性を利用した接線不等式とを用いて定められた下限関数としてもよい。 In addition, in the first invention, the auxiliary function is a lower limit determined using a Jensen inequality using the concave of the trace for a negative inverse matrix and a tangent inequality using the convexity of the negative logarithmic function. It may be a function.

また、第１の発明において、前記パラメータ更新部は、前記基底パワースペクトルのうち、各周波数ω_k及び各状態ｑに対するパワースペクトルｗ_i,k,qが既知又は予め推定された音源ｉを除いた各音源ｉの音源信号のパワースペクトルｗ_i,k,qを更新してもよい。 In the first invention, the parameter updating unit excludes the sound source i from which the power spectrum w _{i, k, q} for each frequency ω _k and each state _q is known or previously estimated from the base power spectrum. The power spectrum w _{i, k, q} of the sound source signal of each sound source i may be updated.

また、本発明のプログラムは、コンピュータを、上記の信号解析装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said signal analysis apparatus.

以上説明したように、本発明の信号解析装置、方法、及びプログラムによれば、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号を入力として、多チャンネル観測時間周波数成分が与えられたもとでの、各音源の、基底パワースペクトルと、アクティベーションパラメータと、空間相関行列の時系列と、状態系列の条件付き確率を表す目的関数を最大化するように、多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行うことにより、音源分離、音響イベント検出、及び残響除去を統合的に精度良く行うことができる。 As described above, according to the signal analysis apparatus, method, and program of the present invention, a multi-channel observation time frequency component is given by using a multi-channel observation signal mixed with sound source signals from I sound sources. In order to maximize the objective function representing the conditional probability of the state sequence, the time series of the base power spectrum, activation parameters, spatial correlation matrix, and state sequence of each sound source, Update of base power spectrum, update of activation parameter, update of time series of spatial correlation matrix, and update of state series based on base power spectrum, activation parameter, time series of spatial correlation matrix and state series And update by the parameter update unit is repeated until a predetermined convergence condition is satisfied. Ukoto, the sound source separation, an acoustic event detected, and a dereverberation integrated manner can be accurately performed.

本発明の第１の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the signal analyzer which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る信号解析装置におけるパラメータ推定処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the parameter estimation processing routine in the signal analyzer which concerns on the 1st Embodiment of this invention. 混合音のスペクトログラムを示す図である。It is a figure which shows the spectrogram of a mixed sound. 無響下でのホッチキスの音のスペクトログラムを示す図である。It is a figure which shows the spectrogram of the sound of the stapler under anechoic. 残響除去済み分散音のスペクトログラムを示す図である。It is a figure which shows the spectrogram of the dispersed sound after dereverberation. 音響イベント検出結果を示す図である。It is a figure which shows an acoustic event detection result. 従来法による分離音のスペクトログラムを示す図である。It is a figure which shows the spectrogram of the separation sound by a conventional method. 本発明の第２の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the signal analyzer which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る信号解析装置におけるパラメータ推定処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the parameter estimation processing routine in the signal analyzer which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the signal analyzer which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施の形態に係る信号解析装置におけるパラメータ推定処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the parameter estimation processing routine in the signal analyzer which concerns on the 3rd Embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜発明の概要＞
まず、本実施の形態における概要について説明する。本実施の形態において、非特許文献３及び非特許文献４の手法をそのまま利用して、音源信号の状態や残響を考慮した精度良い音源分離を行うためには、従来の手法とは別に、残響除去と、音源信号の状態推定とを行う必要がある。しかし、実環境においてどのように残響がかかるかは各音源信号によって異なるため、残響除去の精度は音源分離性能に依存する。また、複数の音源信号の混合した信号からある特定の音源信号の状態を推定するよりも、他の音が混じっていない音源信号からその音源信号の状態を推定する方が簡単であることから、音源信号の状態推定精度も当然ながら音源分離性能に依存する。 <Outline of the invention>
First, an outline of the present embodiment will be described. In this embodiment, in order to perform accurate sound source separation in consideration of the state of a sound source signal and reverberation using the methods of Non-Patent Document 3 and Non-Patent Document 4 as they are, reverberation is separate from the conventional method. It is necessary to perform removal and state estimation of the sound source signal. However, since how the reverberation is applied in the actual environment varies depending on each sound source signal, the accuracy of dereverberation depends on the sound source separation performance. Since it is easier to estimate the state of a sound source signal from a sound source signal not mixed with other sounds than to estimate the state of a specific sound source signal from a mixed signal of a plurality of sound source signals, Naturally, the state estimation accuracy of the sound source signal also depends on the sound source separation performance.

そのため、音源分離、音響イベント検出、及び残響除去という３つの問題は、相互に依存関係があり、その精度向上のためには、本来同時に解くことが望ましいと考えられる。そこで、本実施の形態においては、音源信号、音響イベント、及び残響に関するパラメータを用いて、非特許文献３及び非特許文献４におけるモデル化よりも詳細に多チャンネル観測信号をモデル化し、そのモデルに基づく画一的最適化規準によって音源分離、音響イベント検出、及び残響除去を統合的に行い、その精度を向上させる。 For this reason, the three problems of sound source separation, acoustic event detection, and dereverberation are mutually dependent, and in order to improve the accuracy, it is desirable to solve them at the same time. Therefore, in the present embodiment, a multi-channel observation signal is modeled in more detail than the modeling in Non-Patent Document 3 and Non-Patent Document 4 using parameters related to sound source signals, acoustic events, and reverberation, and Based on uniform optimization criteria based on, sound source separation, acoustic event detection, and dereverberation are integrated to improve accuracy.

＜本実施の形態の原理＞
次に、本実施の形態の原理について説明する。本実施の形態においては、音源信号、音響イベント、及び残響に関するパラメータによって音響信号の生成モデルを構築すること、画一的規準の最適化によって各パラメータを求めるための、収束性の保障された反復アルゴリズムを実現すること、そして、パラメータ推定によって音源分離、音響イベント検出、及び残響除去を統合的に精度良く行えることを目的とする。本実施の形態においては、具体的には以下１〜５により実現する。 <Principle of this embodiment>
Next, the principle of this embodiment will be described. In this embodiment, it is possible to construct a generation model of an acoustic signal based on parameters related to a sound source signal, an acoustic event, and reverberation, and to obtain each parameter by optimization of uniform criteria. An object is to realize an algorithm and to perform sound source separation, acoustic event detection, and dereverberation comprehensively and accurately by parameter estimation. In the present embodiment, specifically, the following 1 to 5 are realized.

１．音源信号のパワースペクトル、音源信号の各時刻におけるパワー、音源信号の状態，及び音源信号の空間相関行列の時系列を、同一の規準を大きくするように交互に更新する。
２．同一の規準を、パラメータが決まった時に多チャンネル観測信号が出力される確率を表す尤度関数と、各パラメータが出力される確率を表す事前確率との積(またはその対数)とする。
３．多チャンネル観測信号の出力確率の対数を上回らず、かつ、これに接する関数で、各パラメータと補助変数によって表されるものを同一の規準とし、この規準を大きくするように各パラメータと補助変数を交互に更新する。
４．同一の規準は、負の逆行列に対するトレースの凹性を利用したJensenの不等式と、負の対数関数の凸性を利用した接線不等式を用いて作られる下限関数である。
５．あらかじめ既知である(あるいは推定した)パラメータを除くパラメータのみを交互に更新する。 1. The power spectrum of the sound source signal, the power of the sound source signal at each time, the state of the sound source signal, and the time series of the spatial correlation matrix of the sound source signal are alternately updated so as to increase the same criterion.
2. The same criterion is a product (or a logarithm thereof) of a likelihood function that represents the probability that a multi-channel observation signal is output when parameters are determined and the prior probability that represents the probability that each parameter is output.
3. Functions that do not exceed the logarithm of the output probability of a multichannel observation signal and that are in contact with this are represented by each parameter and auxiliary variable as the same criterion, and each parameter and auxiliary variable are set to increase this criterion. Update alternately.
4). The same criterion is a lower bound function created using Jensen's inequality using the concave nature of the trace for the negative inverse matrix and the tangent inequality using the convexity of the negative logarithmic function.
5). Only the parameters except the parameters already known (or estimated) are updated alternately.

次に、多チャンネルfactorial HMMによる多チャンネル観測時間周波数成分生成モデルについて説明する。まず、多チャンネル観測信号の生成プロセスについて説明する。Ｉ個の音源信号がＭ個のマイクロフォンで観測される場合を考えると、 Next, a multi-channel observation time frequency component generation model using a multi-channel factorial HMM will be described. First, a process for generating a multi-channel observation signal will be described. Consider the case where I sound source signals are observed by M microphones.

をｍ番目のマイクで観測される多チャンネル観測信号、 A multi-channel observation signal observed by the m-th microphone,

をｉ番目の音源信号だとすると、多チャンネル観測信号は時間領域において下記（１）式となる。なお、記号の上又は後に「＾」が付与されている記号は、テンソル、行列、又はベクトル表記を表す。 Is the i-th sound source signal, the multi-channel observation signal is expressed by the following equation (1) in the time domain. Note that a symbol with “^” above or after the symbol represents a tensor, a matrix, or a vector notation.

ここで、 here,

である。ａ_ｉ，ｍ（ｔ）は、ｉ番目の音源とｍ番目のマイクの間の室内インパルス応答を表す。ここで、室内インパルス応答長が時間周波数展開における時間窓長よりも十分に短い場合、瞬時混合近似を用いて多チャンネル観測信号は下記（２）式のように、時間周波数領域において表される。 It is. a _{i, m} (t) represents an indoor impulse response between the i-th sound source and the m-th microphone. Here, when the indoor impulse response length is sufficiently shorter than the time window length in the time-frequency expansion, the multi-channel observation signal is expressed in the time-frequency domain using the instantaneous mixing approximation as shown in the following equation (2).

ただし、 However,

である。ここで、 It is. here,

は、ｍ番目のマイクで観測された多チャンネル観測信号の周波数ω_ｋ、時刻ｔ_ｌにおける時間周波数成分であり、 Is the frequency ω _k of the multi-channel observation signal observed by the mth microphone, and the time frequency component at time t ₁ ,

はｉ番目の音源信号の周波数ω_ｋ、時刻ｔ_ｌにおける時間周波数成分である。ａ＾_ｉ（ω_ｋ）は、ｉ番目の音源信号に対する周波数ω_ｋにおける伝達周波数特性を表す。 Is the frequency ω _k of the i th sound source signal and the time frequency component at time t ₁ . a ^ _i (ω _k ) represents the transfer frequency characteristic at the frequency ω _k for the i-th sound source signal.

と、 When,

とはそれぞれ時間周波数領域における周波数及び時間のインデックスである。しかし、残響がある場合には、一般的に、室内インパルス応答長は時間窓長に対して十分に短いとはいえず、上記（２）式の瞬時混合近似は成り立たない。そこで、時間周波数領域における畳み込みの混合の形で多チャンネル観測信号を近似すると、下記（３）式のように表される。 Are indices of frequency and time in the time-frequency domain, respectively. However, when there is reverberation, in general, it cannot be said that the indoor impulse response length is sufficiently short with respect to the time window length, and the instantaneous mixing approximation of the above equation (2) does not hold. Therefore, when the multichannel observation signal is approximated in the form of convolution mixed in the time-frequency domain, it is expressed as the following equation (3).

ここで、ａ＾_ｉ（ω_ｋ，ｔ_ｎ）は、ｉ番目の音源信号に対する周波数ω_ｋにおける伝達周波数特性の時刻ｔ_ｎの成分であり、 Here, a ^ _i (ω _k , t _n ) is a component at time t _n of the transfer frequency characteristic at the frequency ω _k for the i th sound source signal,

は、伝達周波数特性の時間周波数領域における時間インデックスである。ここで、ａ＾_ｉ（ω_ｋ，ｔ_ｎ）はｉ番目の音源信号が時間周波数領域においてｔ_ｎだけ先の時刻にどれだけ影響を与えるか、すなわち時間周波数領域においてどれだけ残響がかかるかを表している。なお、以後ω_ｋ、ｔ_ｌ、ｔ_ｎの各々をｋ、ｌ、ｎの各々として表す。 Is a time index in the time frequency domain of the transfer frequency characteristic. Here, a ^ _i (ω _k , t _n ) indicates how much the i-th sound source signal affects the time ahead by t _{n in the} time-frequency domain, that is, how much reverberation is applied in the time-frequency domain. Represents. Hereinafter, each of ω _k , t ₁ , and t _n is represented as k, l, and n, respectively.

次に、上記（３）式に基づいて多チャンネル観測信号の生成プロセスを確率的に下記（４）式として記述する。音源信号ｓ_{ｉ，ｋ，ｌ}が平均０、分散σ_{ｉ，ｋ，ｌ} ^２の複素正規分布に従うと仮定すると、ａ＾_{１：Ｉ，ｋ，１：Ｎ}とｓ_{ｉ，ｋ，ｌ−Ｎ：ｌ}とが既知の条件下で多チャンネル観測信号ｙ＾_ｋ，ｌは上記（３）式より同じく複素正規分布に従う。 Next, the multi-channel observation signal generation process is stochastically described as the following equation (4) based on the above equation (3). Assuming that the sound source signals s _{i, k, l} follow a complex normal distribution with mean 0 and variance σ _{i, k, l} ² , a ^ _{1: I, k, 1: N} and s _{i, k, l-N:} Under the condition that _l is known, the multi-channel observation signal ＾ _{k, l} follows a complex normal distribution from the above equation (3).

ただし、Ｃ＾_{ｉ，ｋ，ｎ}＝ａ＾_{ｉ，ｋ，ｎ}ａ＾_{ｉ，ｋ，ｎ} ^Ｈは空間相関行列と呼ばれる行列であり、また、 However, C ^ _{i, k, n} = a ^ _{i, k, na} ^ _{i, k,} ^nH is a matrix called a spatial correlation matrix,

次に、ＨＭＭを用いた音源信号スペクトルのモデル化について説明する。まず音源信号がある１つのパワースペクトルを持つと仮定し、非特許文献１及び非特許文献２のＮＭＦを適用すると、音源信号ｓ_{ｉ，ｋ，ｌ}のパワースペクトルの期待値δ_{ｉ，ｋ，ｌ} ^２は、下記（５）式のように分解することができる。 Next, sound source signal spectrum modeling using the HMM will be described. First assume that with one of the power spectrum where there is the sound source signal, applying the non-patent documents 1 and 2 NMF, the sound source signal s _{i, k,} the expected value of the power spectrum of the _l [delta] _{i, k, l} ² can be decomposed as shown in the following formula (5).

ただし、ｗ_ｉ，ｋとｈ_ｉ，ｌとは非負の値を持つ。ｗ_{ｉ，１：Ｋ}はｉ番目の音源信号のパワースペクトルを表し、ｈ_ｉ，ｌは時刻ｌにおいてどのくらいの音量で鳴っているか、すなわちパワーを表す。この時音源信号の生成プロセスは下記（６）式のように表される。 However, w _{i, k} and h _{i, l} have non-negative values. w _{i, 1: K} represents the power spectrum of the i-th sound source signal, and h _{i, l} represents how loud the sound is at time l, that is, the power. At this time, the generation process of the sound source signal is expressed by the following equation (6).

しかし多くの場合、無音状態、音の立ち上がり、定常状態など音源信号はその状態に応じて複数のスペクトルを持つので、単一のスペクトルによるモデル化では不十分である。そこで、時刻ｌにおけるｉ番目の音源信号の状態を表す隠れ変数ｚ_ｉ，ｌ∈｛１，．．．，Ｑ｝を導入し、状態系列ｚ_ｉ，１，．．．，ｚ_ｉ，Ｌがマルコフ連鎖に従うと仮定し、下記（７）式のように表す。 However, in many cases, since a sound source signal such as a silence state, a sound rise, or a steady state has a plurality of spectra, modeling with a single spectrum is insufficient. Therefore, a hidden variable z _{i, l} ε {1,. . . , Q} and state sequence z _{i, 1} ,. . . , Z _{i, L} are assumed to follow a Markov chain, and are expressed as the following equation (7).

ただしCategorical(ｘ;ｙ＾)＝ｙ_ｘであり、ρ＾_ｑ＝（ρ_ｑ，１，．．．，ρ_ｑ，Ｑ）は状態ｑから各状態１，．．．，Ｑに遷移する時の遷移確率を表し、ρ＾＝（ρ_ｑ，ｑ´）_Ｑ×Ｑは遷移行列である。Ｑは音源の状態数を表す。ここで、パワーｈ_ｉ，ｌが各状態ｚ_ｉ，ｌに応じて異なるハイパーパラメータを持つガンマ分布から生成されると仮定すると下記（８）式のように表される。 However Categorical; a _{(x y ^) = y x} , ρ ^ q = (ρ q, 1, ..., ρ q, Q) is each state 1, ... from the state q . . , Q represents the transition probability at the time of transition, and ρ ^ = (ρ _{q, q ′} ) _{Q × Q} is a transition matrix. Q represents the number of states of the sound source. Here, if it is assumed that the power h _{i, l} is generated from a gamma distribution having different hyperparameters according to each state z _{i, l} , the following expression (8) is obtained.

ただし、α_１：Ｑとβ_１：Ｑとはそれぞれガンマ分布のスケールパラメータと形状パラメータであり、 Where α _{1: Q} and β _{1: Q} are the scale parameter and shape parameter of the gamma distribution, respectively.

である。パワーｈ_ｉ，ｌは無音状態において小さな値をとってほしいので、小さな値をとる確率が高くなるような分布になるようにガンマ分布のパラメータを設定し、残りの状態（有音状態に対応）に対しては、一様分布に近くなるようにガンマ分布のパラメータを設定すればよい。時刻ｌにおけるｉ番目の音源信号のパワースペクトルもｚ_ｉ，ｌに応じて決まると仮定すると、音源信号ｓ_{ｉ，ｋ，ｌ}の生成プロセスは、ｗ_{ｉ，ｋ，１：Ｑ}，ｈ_ｉ，ｌ，ｚ_ｉ，ｌが既知の条件下で下記（９）式のように表される。 It is. Since power h _{i, l} wants to take a small value in the silent state, the gamma distribution parameters are set so that the probability of taking a small value is high, and the remaining state (corresponding to the sound state) For this, the parameters of the gamma distribution may be set so as to be close to a uniform distribution. Assuming that the power spectrum of the i-th sound source signal at time l is also determined according to z _{i, l} , the generation process of the sound source signals s _{i, k, l} is w _{i, k, 1: Q} , h _{i, l.} , Z _{i, l} are represented by the following equation (9) under known conditions.

多チャンネル観測信号の最終的な生成モデルはａ＾_{ｉ：Ｉ，ｋ，０：Ｎ}，ｗ_{１：Ｉ，ｋ，１：Ｑ}，ｈ_{１：Ｉ，ｌ−Ｎ：ｌ}，ｚ_{１：Ｉ，ｌ−Ｎ：ｌ}が既知の条件下で、上記（７）式と上記（８）式と合わせて下記（１０）式のように表される。 The final generation model of the multi-channel observation signal is a ^ _{i: I, k, 0: N} , w _{1: I, k, 1: Q} , h _{1: I, l-N: l} , z _{1: I, l-N: l} is represented by the following formula (10) together with the above formula (7) and the above formula (8) under a known condition.

上記の生成モデルを定めることにより、以下説明するパラメータ推定アルゴリズムを適用することができる。 By determining the generation model, a parameter estimation algorithm described below can be applied.

次に、モデルパラメータの推定について説明する。パラメータの推定は、多チャンネル観測時間周波数成分Ｙ＾＝ｙ＾_{１：Ｋ，１：Ｌ}が与えられたときに、スペクトル生成モデルパラメータΘ＾の条件付き確率Ｐ（Θ＾｜Ｙ＾）を最大化する問題として定式化される。推定すべきパラメータΘ＾は、音源信号の基底パワースペクトルＷ＾＝ｗ_{１：Ｉ，１：Ｋ，１：Ｑ}、アクティベーションパラメータＨ＾＝ｈ_{１：Ｉ，１：Ｌ}、空間相関行列の時系列Ｃ＾＝Ｃ＾_{１：Ｉ，１：Ｋ，０：Ｎ}、及び音源の状態系列Ｚ＾ｚ_{１：Ｉ，１：Ｌ}である。ここで、Ｗ＾とＨ＾とが音源信号、Ｃ＾が残響、Ｚ＾が音響イベントに関するパラメータとなっており、当該パラメータの各々を推定することで、各々音源分離、残響除去、及び音響イベント検出を行うことに対応している。 Next, model parameter estimation will be described. The parameter is estimated by maximizing the conditional probability P (Θ ^ | Y ^) of the spectrum generation model parameter Θ ^ when multi-channel observation time frequency components Y ^ = y ^ _{1: K, 1: L} are given. It is formulated as a problem. The parameter Θ ^ to be estimated is the basis power spectrum W ^ = w _{1: I, 1: K, 1: Q} of the sound source signal, the activation parameter H ^ = h _{1: I, 1: L} , and the spatial correlation matrix The sequence C ^ = C ^ _{1: I, 1: K, 0: N} , and the sound source state sequence Z ^ z _{1: I, 1: L.} Here, W ^ and H ^ are sound source signals, C ^ is reverberation, and Z ^ is a parameter related to an acoustic event. By estimating each parameter, sound source separation, dereverberation, and acoustic event are performed. It corresponds to performing detection.

ここで、Θ＾の条件付き確率Ｐ（Θ＾｜Ｙ＾）を最大化するΘ＾を求めることは難しいが、各変数について局所最適化を繰り返すことは可能である。この時Ｐ（Θ＾｜Ｙ＾）は下記（１１）式及び下記（１２）式のように表される。 Here, it is difficult to obtain Θ ^ that maximizes the conditional probability P (Θ ^ | Y ^) of Θ ^, but local optimization can be repeated for each variable. At this time, P (Θ ^ | Y ^) is expressed as the following formula (11) and the following formula (12).

ここで、 here,

は定数部分を除いて一致することを意味する。 Means match except for the constant part.

本実施の形態において用いるアルゴリズムでは、各変数についてｌоｇＰ（Θ＾｜Ｙ＾
）の最大化を反復することでパラメータ推定を行う。ｌоｇＰ（Θ＾｜Ｙ＾）の最大化は、補助関数法を用いて各パラメータについて逐次的に実行可能である。ここで、ｌоｇＰ（Θ＾｜Ｙ＾）を具体的に、下記（１３）式で表す。 In the algorithm used in the present embodiment, lgP (Θ ^ | Y ^
The parameter estimation is performed by repeating the maximization of). Maximization of lｌgP (Θ ^ | Y ^) can be performed sequentially for each parameter using the auxiliary function method. Here, lоgP (Θ ^ | Y ^) is specifically expressed by the following equation (13).

である。ここで、目的関数ｌоｇＰ（Θ＾｜Ｙ＾）を It is. Where the objective function lｌgP (Θ ^ | Y ^) is

と置き、負の逆行列関数の凹性を利用したＪｅｎｓｅｎの不等式と負の対数関数の凸性を利用した接線不等式を適用すると下記（１４）式のように表される。 When Jensen's inequality using the negative inverse matrix function and the tangential inequality using the negative logarithmic function are applied, the following equation (14) is obtained.

ここで、Ｒ＾_{ｉ，ｋ，ｌ，ｎ}とＵ＾_ｋ，ｌはΣ_ｉ，ｎＲ＾_{ｉ，ｋ，ｌ，ｎ}＝Ｉ＾を満たす得るミート正定値行列であり、Ｒ＾とＵ＾の集合をΛ＾で表す。ｔｒ（・）は行列のトレースを表す。上記（１４）式の統合成立条件は、下記（１５）式及び下記（１６）式である。 _Here, an _{R ^ i, k, l,} n and _{U ^ k, l} is _{_{Σ i, n R ^ i,}} k, l, n = I ^ obtain satisfying meat positive definite matrix, R ^ and U ^ Is represented by Λ ^. tr (·) represents a matrix trace. The integration establishment conditions of the above equation (14) are the following equation (15) and the following equation (16).

ここで、任意のΘ＾について、Λ＾が上記（１５）式及び上記（１６）式で与えられるとき、補助関数 Here, for any Θ ^, when Λ ^ is given by the above equations (15) and (16), the auxiliary function

は目的関数 Is the objective function

と等しい。そして任意の固定されたΛ＾について Is equal to And for any fixed Λ ^

を増加させるΘ＾は、上記（１４）式により Is increased by the above equation (14).

を必ず増加させる。そのため、上記（１５）式及び上記（１６）式によるΛ＾の更新と、 Always increase. Therefore, update of Λ ^ by the above formula (15) and the above formula (16),

を増加させるようなΘ＾の更新を繰り返すことにより、目的関数は局所最適解に到達するまで、単調に増加する。 By repeatedly updating Θ ^ to increase, the objective function increases monotonically until a local optimal solution is reached.

次に、パラメータの更新について説明する。Λ＾を固定したとき、補助関数 Next, parameter updating will be described. When Λ ^ is fixed, auxiliary function

は、基底パワースペクトルＷ＾、及びアクティベーションパラメータＨ＾について、下記（１７）式〜下記（２０）式により最大化される。 Is maximized by the following formula (17) to the following formula (20) with respect to the base power spectrum W ^ and the activation parameter H ^.

ここで、 here,

を最大化するＣ＾は、下記（２１）式に示すＲｉｃｃａｔｉ方程式に従って求められる。 C ^ that maximizes is obtained according to the Riccati equation shown in the following equation (21).

Ｚ＾の更新に関しては、基底パワースペクトルＷ＾、アクティベーションパラメータＨ＾、空間相関行列の時系列Ｃ＾を固定したもとで行うＶｉｔｅｒｂｉアルゴリズムにより、 Regarding the update of Z ^, the Viterbi algorithm performed with the base power spectrum W ^, the activation parameter H ^, and the spatial correlation matrix time series C ^ fixed,

を最大化する状態系列Ｚ＾を取得する。 Obtain a state sequence Z ^ that maximizes.

＜本発明の第１の実施の形態に係る信号解析装置の構成＞
次に、本発明の第１の実施の形態に係る信号解析装置の構成について説明する。図１に示すように、本発明の第１の実施の形態に係る信号解析装置１００は、ＣＰＵと、ＲＡＭと、後述するパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device According to First Embodiment of the Present Invention>
Next, the configuration of the signal analysis apparatus according to the first embodiment of the present invention will be described. As shown in FIG. 1, the signal analyzing apparatus 100 according to the first embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program and various data for executing a parameter estimation processing routine to be described later, , Can be configured with a computer including. Functionally, the signal analyzing apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 90 as shown in FIG.

入力部１０は、残響がかかっておらず、他の音源が混じっていないクリーンな音響信号（以後、クリーン音響信号）の時系列データと当該音響信号が何の音源に対応しているかを表すラベルとのペアを受け付け、データベース２２に記憶する。また、入力部１０は、マイクロフォンから出力された残響がかかっており、クリーン音響信号を受け付けた複数の音源からの音源信号が混じっている混合音の音響信号（以後、混合音音響信号）の時系列データを受け付ける。 The input unit 10 is time-series data of a clean acoustic signal (hereinafter referred to as a clean acoustic signal) that is not reverberated and not mixed with other sound sources, and a label that indicates what sound source the sound signal corresponds to. Is received and stored in the database 22. The input unit 10 is a mixed sound signal (hereinafter referred to as a mixed sound signal) in which the reverberation output from the microphone is applied and the sound source signals from a plurality of sound sources that have received the clean sound signal are mixed. Accept series data.

演算部２０は、データベース２２と、時間周波数展開部２４と、パラメータ推定部２６と、パラメータ更新部３２と、収束判定部３４と、音声信号生成部４０と、を含んで構成されている。 The calculation unit 20 includes a database 22, a time frequency expansion unit 24, a parameter estimation unit 26, a parameter update unit 32, a convergence determination unit 34, and an audio signal generation unit 40.

データベース２２には、入力部１０において受け付けたクリーン音響信号の時系列データが記憶されている。 The database 22 stores time series data of clean acoustic signals received by the input unit 10.

時間周波数展開部２４は、データベース２２に記憶されているクリーン音響信号の時系列データに基づいて、各時刻ｔ_ｌにおける各周波数ω_ｋの観測時間周波数成分ｙ（ω_ｋ，ｔ_ｌ）を表す多チャンネル観測時間周波数成分Ｙ＾を計算する。なお、第１の実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 Time-frequency expansion unit 24, based on time-series data of the clean acoustic signals stored in the database 22, the multi-represents the observation time-frequency component y (ω _k, t _l) of each frequency omega _k at each time t _l The channel observation time frequency component Y ^ is calculated. In the first embodiment, time-frequency expansion such as short-time Fourier transform and wavelet transform is performed.

パラメータ推定部２６は、時間周波数展開部２４において取得した多チャンネル観測時間周波数成分Ｙ＾に対して従来技術である、多チャンネルＮＭＦを用いてパワースペクトルｗ_{ｉ，ｋ，ｑ}の推定を行い、基底パワースペクトルＷ＾を取得する。具体的には、上記（１０）式に示された生成モデルにおいて、音源信号の数Ｉ＝１、伝達周波数特性における残響時間長Ｎ＝０とした状態でパワースペクトルｗ_{ｉ，ｋ，ｑ}の推定を行い、結果として得られたパワースペクトルを、処理対象としたクリーン音響信号に対応したパワースペクトルｗ_{ｉ，ｋ，ｑ}として用いる。また、処理対象としたクリーン音響信号のラベルと、処理対象のクリーン音響信号に対応したパワースペクトルｗ_{ｉ，ｋ，ｑ}とをペアにして保持する。 The parameter estimation unit 26 estimates the power spectrum w _{i, k, q} using the multi-channel NMF, which is a conventional technique, for the multi-channel observation time frequency component Y ^ acquired by the time-frequency expansion unit 24, and Obtain the power spectrum W ^. Specifically, in the generation model shown in the above equation (10), the power spectra w _{i, k, q} are estimated with the number of sound source signals I = 1 and the reverberation time length N = 0 in the transfer frequency characteristics. And the resulting power spectrum is used as the power spectrum w _{i, k, q} corresponding to the clean acoustic signal to be processed. In addition, a label of the clean acoustic signal to be processed and a power spectrum w _{i, k, q} corresponding to the clean acoustic signal to be processed are held in pairs.

混合音時間周波数展開部２８は、時間周波数展開部２４と同様に、入力部１０において受け付けた混合音音響信号の時系列データに基づいて、各時刻ｔ_ｌにおける各周波数ω_ｋの観測時間周波数成分ｙ（ω_ｋ，ｔ_ｌ）を表す多チャンネル観測時間周波数成分Ｙ＾を計算する。なお、第１の実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 Similar to the time frequency expansion unit 24, the mixed sound time frequency expansion unit 28 is based on the time series data of the mixed sound acoustic signal received by the input unit 10, and the observation time frequency component of each frequency ω _k at each time t ₁ . A multi-channel observation time frequency component Y ^ representing y (ω _k , t _l ) is calculated. In the first embodiment, time-frequency expansion such as short-time Fourier transform and wavelet transform is performed.

パラメータ更新部３２は、上記（１４）式に示す目的関数を最大化するように、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、初期値である、又は前回更新した、各時刻ｔ_ｌにおける各音源ｉの音源信号のパワーｈ_ｉ，ｌを表すアクティベーションパラメータＨ＾と、初期値である、又は前回更新した、各時刻ｔ_ｎ及び各周波数ω_ｋにおける各音源ｉの伝達周波数特性の成分ａ_{ｉ，ｋ，ｎ}に基づいて求められる、空間相関行列Ｃ＾_{ｉ，ｋ，ｎ}を表す空間相関行列の時系列Ｃ＾と、初期値である、又は前回更新した各時刻ｔ_ｌにおける各音源ｉの音源信号の状態ｚ_ｉ，ｌを表す状態系列と、に基づいて、補助変数Ｒ＾と補助変数Ｕ＾との集合であるΛ＾の更新と、アクティベーションパラメータＨ＾の更新と、空間相関行列の時系列Ｃ＾の更新と、各音源ｉの音源信号の状態ｚ_ｉ，ｌの更新とを行う。 The parameter update unit 32 uses the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28 and the basis acquired by the parameter estimation unit 26 so as to maximize the objective function shown in the above equation (14). Power spectrum W ^, which is an initial value or updated last time, an activation parameter H ^ representing the power h _{i, l} of the sound source signal of each sound source i at each time t _l, and an initial value, or last time updated When the spatial correlation matrix represents the spatial correlation matrix C ^ _{i, k, n} obtained based on the components a _{i, k, n} of the transmission frequency characteristics of each sound source i at each time t _n and each frequency ω _k . Based on the series C ^ and the state series representing the state z _{i, l} of the sound source signal of each sound source i at each time t _l which is an initial value or updated last time, the auxiliary variable R ^ and the auxiliary variable Update of Λ ^ which is a set with the number U ^, update of the activation parameter H ^, update of the time series C ^ of the spatial correlation matrix, and update of the state z _{i, l} of the sound source signal of each sound source i I do.

具体的には、まず、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に基づいて、補助変数Ｒ＾と補助変数Ｕ＾と（以後、Λ＾とする）を更新する。次に、更新したΛ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾とに基づいて、上記（１８）式〜上記（２０）式に従って、アクティベーションパラメータＨ＾を更新する。 Specifically, first, the base power spectrum W ^ acquired by the parameter estimation unit 26, the initial value, or the activation parameter H ^ updated last time, and the initial value or the previously updated spatial correlation matrix Based on the time series C ^, the auxiliary variable R ^ and the auxiliary variable U ^ (hereinafter referred to as Λ ^) are updated based on the formula (15) and the formula (16). Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix updated last time, the multi-channel observation time frequency component Y ^ acquired in the mixed sound time frequency expansion unit 28, and parameter estimation Based on the base power spectrum W ^ acquired in the unit 26 and the activation parameter H ^ that is an initial value or updated last time, the activation parameter H ^ is determined according to the above formula (18) to the above formula (20). Update.

次に、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記同様、Λ＾を更新する。次に、更新したΛ＾と、初期値又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。 Next, based on the base power spectrum W ^ acquired in the parameter estimation unit 26, the updated activation parameter H ^, and the time series C ^ of the spatial correlation matrix that is the initial value or updated last time, the same as above. , Λ ^ is updated. Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix updated last time, the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28, and the parameter estimation unit 26 Based on the acquired base power spectrum W ^ and the updated activation parameter H ^, the time series C ^ of the spatial correlation matrix is updated according to the above equations (21) to (22).

次に、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、上記と同様にΛ＾を更新する。次に、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、 Next, based on the base power spectrum W ^ acquired by the parameter estimation unit 26, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix, Λ ^ is updated in the same manner as described above. . Next, based on the base power spectrum W ^ acquired in the parameter estimation unit 26, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix, according to the Viterbi algorithm,

を最大化する状態系列Ｚ＾を更新する。そして、更新した各パラメータに基づく目的関数 Update the state sequence Z ^ that maximizes. And the objective function based on each updated parameter

の値を収束判定部３４に出力する。 Is output to the convergence determination unit 34.

収束判定部３４は、パラメータ更新部３２において取得した目的関数の値と前回の目的関数の値との差分が、予め定められた閾値以下である場合に、収束条件を満たすと判定する。収束条件を満たすまで、パラメータ更新部３２における更新処理と、収束判定部３４における収束条件の判定処理を繰り返す。 The convergence determination unit 34 determines that the convergence condition is satisfied when the difference between the value of the objective function acquired by the parameter update unit 32 and the value of the previous objective function is equal to or less than a predetermined threshold value. Until the convergence condition is satisfied, the update process in the parameter update unit 32 and the determination process of the convergence condition in the convergence determination unit 34 are repeated.

音声信号生成部４０は、パラメータ推定部２６において取得した基底パワースペクトルＷ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、パラメータ更新部３２において更新したアクティベーションパラメータＨ＾及び空間相関行列の時系列Ｃ＾とに基づいて、下記（２３）式に示す多チャンネルウィーナーフィルタに従って、ｉ番目の音響信号に対応する残響が除去された分離音^＊ｙ＾_{ｉ，ｋ，ｌ}を算出し、出力部９０から出力する。また、音声信号生成部４０は、パラメータ更新部３２において更新した状態系列Ｚに基づいて、ｉ番目の音源信号のラベルに対応する音響イベントを検出し、出力部９０に出力する。 The sound signal generation unit 40 includes the base power spectrum W ^ acquired by the parameter estimation unit 26, the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28, and the activation updated by the parameter update unit 32. Based on the parameter H ^ and the time series C ^ of the spatial correlation matrix, the separated sound ^* y ^ _i, from which the reverberation corresponding to the i-th acoustic signal is removed according to the multi-channel Wiener filter shown in the following equation (23) _{. k and l} are calculated and output from the output unit 90. Further, the audio signal generation unit 40 detects an acoustic event corresponding to the label of the i-th sound source signal based on the state sequence Z updated by the parameter update unit 32 and outputs it to the output unit 90.

＜本発明の第１の実施の形態に係る信号解析装置の作用＞
次に、本発明の第１の実施の形態に係る信号解析装置１００の作用について説明する。まず、入力部１０においてクリーン音響信号の時系列データとラベルとのペアを受け付け、データベース２２に記憶する。次に、入力部１０において、マイクロフォンから出力された混合音音響信号の時系列データを受け付けると、信号解析装置１００は、図２に示すパラメータ推定処理ルーチンを実行する。 <Operation of the signal analyzing apparatus according to the first embodiment of the present invention>
Next, the operation of the signal analyzing apparatus 100 according to the first embodiment of the present invention will be described. First, a pair of clean sound signal time-series data and a label is received at the input unit 10 and stored in the database 22. Next, when the input unit 10 receives the time-series data of the mixed sound signal output from the microphone, the signal analysis apparatus 100 executes a parameter estimation processing routine shown in FIG.

まず、ステップＳ１００では、データベースに記憶されているクリーン音響信号の時系列データとラベルとのペアを読み込む。 First, in step S100, a pair of clean sound signal time-series data and a label stored in a database is read.

次に、ステップＳ１０２では、ステップＳ１００において取得したクリーン音響信号の時系列データに基づいて、多チャンネル観測時間周波数成分Ｙ＾を取得する。 Next, in step S102, the multi-channel observation time frequency component Y ^ is acquired based on the time series data of the clean acoustic signal acquired in step S100.

次に、ステップＳ１０４では、ステップＳ１０２において取得した多チャンネル観測時間周波数成分Ｙ＾に基づいて、基底パワースペクトルＷ＾を取得する。 Next, in step S104, a base power spectrum W ^ is acquired based on the multichannel observation time frequency component Y ^ acquired in step S102.

次に、ステップＳ１０６では、入力部１０において受け付けた混合音音響信号の時系列データに基づいて、多チャンネル観測時間周波数成分Ｙ＾を取得する。 Next, in step S106, the multichannel observation time frequency component Y ^ is acquired based on the time-series data of the mixed sound signal received by the input unit 10.

ステップＳ１０７では、アクティベーションパラメータＨ＾、及び空間相関行列の時系列Ｃ＾について、初期値を設定する。 In step S107, initial values are set for the activation parameter H ^ and the time series C ^ of the spatial correlation matrix.

ステップＳ１０８では、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１０において前回更新したアクティベーションパラメータＨ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１４において前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に基づいて、Λ＾を更新する。 In step S108, the base power spectrum W ^ acquired in step S104, the initial value was set in step S107, or the activation parameter H ^ updated last time in step S110, and the initial value was set in step S107. Or, based on the time series C ^ of the spatial correlation matrix updated last time in step S114, Λ ^ is updated based on the above equation (15) and the above equation (16).

次に、ステップＳ１１０では、ステップＳ１０８において更新したΛ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１４において前回更新した空間相関行列の時系列Ｃ＾と、ステップＳ１０６において取得した多チャンネル観測時間周波数成分Ｙ＾と、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１０において前回更新したアクティベーションパラメータＨ＾とに基づいて、アクティベーションパラメータＨ＾を更新する。 Next, in step S110, the Λ ^ updated in step S108, the initial value set in step S107, or the time series C ^ of the spatial correlation matrix previously updated in step S114, and the multichannel acquired in step S106. Based on the observation time frequency component Y ^, the base power spectrum W ^ acquired in step S104, and the activation parameter H ^ for which the initial value was set in step S107 or updated last time in step S110, the activation parameter Update H ^.

次に、ステップＳ１１２では、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１１０において更新したアクティベーションパラメータＨ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１４において前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に基づいて、Λ＾を更新する。 Next, in step S112, the base power spectrum W ^ acquired in step S104, the activation parameter H ^ updated in step S110, and the initial value set in step S107, or the spatial correlation updated last time in step S114. Based on the time series C ^ of the matrix, Λ ^ is updated based on the above formula (15) and the above formula (16).

次に、ステップＳ１１４では、ステップＳ１１２において更新したΛ＾と、ステップＳ１０７で初期値が設定された、又はステップＳ１１４において前回更新した空間相関行列の時系列Ｃ＾と、ステップＳ１０６において取得した多チャンネル観測時間周波数成分Ｙ＾と、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１１０において更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。 Next, in step S114, Λ ^ updated in step S112, the initial value set in step S107, or the time series C ^ of the spatial correlation matrix previously updated in step S114, and the multichannel acquired in step S106. Based on the observation time frequency component Y ^, the base power spectrum W ^ acquired in step S104, and the activation parameter H ^ updated in step S110, spatial correlation is performed according to the above formula (21) to the above formula (22). Update the matrix time series C ^.

次に、ステップＳ１１６では、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１１０において更新したアクティベーションパラメータＨ＾と、ステップＳ１１４において更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に基づいて、Λ＾を更新する。 Next, in step S116, based on the base power spectrum W ^ acquired in step S104, the activation parameter H ^ updated in step S110, and the time series C ^ of the spatial correlation matrix updated in step S114, the above-mentioned Λ ^ is updated based on the equation (15) and the above equation (16).

次に、ステップＳ１１８では、ステップＳ１０４において取得した基底パワースペクトルＷ＾と、ステップＳ１１０において取得したアクティベーションパラメータＨ＾と、ステップＳ１１４において取得した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、状態系列Ｚ＾を更新する。 Next, in step S118, based on the base power spectrum W ^ acquired in step S104, the activation parameter H ^ acquired in step S110, and the time series C ^ of the spatial correlation matrix acquired in step S114, Viterbi. The state series Z ^ is updated according to the algorithm.

次に、ステップＳ１２０では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、パラメータ推定処理ルーチンを終了し、収束条件を満たしていない場合には、ステップＳ１０８へ移行し、ステップＳ１０８〜ステップＳ１２０の処理を繰り返す。 Next, in step S120, it is determined whether or not a convergence condition is satisfied. If the convergence condition is satisfied, the parameter estimation processing routine is terminated. If the convergence condition is not satisfied, the process proceeds to step S108, and the processes in steps S108 to S120 are repeated.

＜実験例＞
第１の実施の形態における信号解析装置１００による音源分離・残響除去・音響イベント検出の統合的手法に関し、適切に音源分離、残響除去、音響イベント検出が実行可能であることの検証のために行った予備実験について説明する。 <Experimental example>
Regarding the integrated method of sound source separation / dereverberation removal / acoustic event detection by the signal analysis apparatus 100 according to the first embodiment, it is performed to verify that sound source separation, dereverberation, and acoustic event detection can be appropriately performed. A preliminary experiment will be described.

ＲＷＣＰ非音声ドライソース（非特許文献５：S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,” in Proc. 2nd International Conference on Language Resources & Evaluation (LREC 2000), pp. 965-968, 2000.）の中のホッチキスの音とベルの音に対して、ＲＷＣＰ実環境データベース（非特許文献５）内の残響時間６００ｍｓのインパルス応答を畳み込んで人工的に多チャンネルの残響下混合音声を作成した。図３に作成した混合音のスペクトログラムを示す。パワースペクトルの状態数Ｑ＝２とし、ｑ＝１が無音状態となるようにハイパーパラメータα_１：Ｑ，β_１：Ｑを調整した上で、あらかじめクリーンな音響信号からパワースペクトルを学習し、本実施の形態における信号解析装置１００を適用した。 RWCP non-voice dry source (Non-Patent Document 5: S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition, ”In Proc. 2nd International Conference on Language Resources & Evaluation (LREC 2000), pp. 965-968, 2000.) An impulse response with a reverberation time of 600 ms was convoluted to create a multi-channel mixed sound under reverberation artificially. FIG. 3 shows a spectrogram of the created mixed sound. After adjusting the hyperparameters α _{1: Q} and β _{1: Q} so that the number of power spectrum states Q = 2 and q = 1 being silent, the power spectrum is learned from a clean acoustic signal in advance. The signal analysis apparatus 100 in the embodiment is applied.

無響下でのホッチキスの音（元の音源信号）のスペクトログラム（図４）、本実施の形態によって得られた残響除去済み分離音（図５）、音響イベント検出結果（図６）を示す、図４では黒がその時刻で推定された状態を表す。また従来法によって得られた分離音のスペクトログラム（図７）を示す。従来法には非特許文献４の手法を用いた。従来法では残響が音源信号として推定されてしまっているのに対して、本実施の形態における手法では残響が除去され図４で示した元の音源信号により近い分離音が得られていることが分かる。また、音源分離性能の客観指標として信号対干渉比(Signal-to-Interference-Ratio; SIR)（非特許文献６：E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. ASLP, IEEE Transactions on Audio, Speech, and Language Processing,pp. 1462-1469, 2006.）を用いた。高いＳＩＲは高い分離性能を示す。２つの音源におけるＳＩＲの平均は、本実施の形態による分離音では３９．６７[dB]、従来法による分離音では３５．９４[dB]であり，本実施の形態による手法が従来法と比べて高い分離性能を示していることが分かる。これにより、本実施の形態による手法によって音源分離・残響除去・音響イベント検出が統合的に行われ、従来法に比べて高い精度で音源分離が行われていることが分かる。 The spectrogram (FIG. 4) of the stapler sound (original sound source signal) under anechoic conditions, the separated sound after dereverberation obtained by the present embodiment (FIG. 5), and the acoustic event detection result (FIG. 6) are shown. In FIG. 4, black represents a state estimated at that time. Moreover, the spectrogram (FIG. 7) of the separated sound obtained by the conventional method is shown. For the conventional method, the method of Non-Patent Document 4 was used. In the conventional method, reverberation is estimated as a sound source signal, whereas in the method in this embodiment, the reverberation is removed and a separated sound closer to the original sound source signal shown in FIG. 4 is obtained. I understand. As an objective index of sound source separation performance, Signal-to-Interference-Ratio (SIR) (Non-Patent Document 6: E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation, "IEEE Trans. ASLP, IEEE Transactions on Audio, Speech, and Language Processing, pp. 1462-1469, 2006.). A high SIR indicates a high separation performance. The average SIR of the two sound sources is 39.67 [dB] for the separated sound according to the present embodiment and 35.94 [dB] for the separated sound according to the conventional method, and the method according to the present embodiment is compared with the conventional method. It can be seen that the separation performance is high. Thereby, it can be seen that sound source separation, dereverberation, and acoustic event detection are performed in an integrated manner by the method according to the present embodiment, and sound source separation is performed with higher accuracy than the conventional method.

以上説明したように、本発明の第１の実施の形態に係る信号解析装置によれば、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号を入力として、多チャンネル観測時間周波数成分が与えられたもとでの、各音源の、基底パワースペクトルと、アクティベーションパラメータと、空間相関行列の時系列と、状態系列の条件付き確率を表す目的関数を最大化するように、多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行うことにより、音源分離、音響イベント検出、及び残響除去を統合的に精度良く行うことができる As described above, according to the signal analyzing apparatus according to the first embodiment of the present invention, a multichannel observation signal obtained by mixing sound source signals from I sound sources is input, and a multichannel observation time frequency component is input. Multi-channel observation time so that the objective function representing the conditional probability of the state sequence and the time series of the spatial power matrix and the state power spectrum of each sound source is maximized. Based on frequency components, base power spectrum, activation parameter, time series of spatial correlation matrix, and state series, update of base power spectrum, update of activation parameters, update of time series of spatial correlation matrix, and state series Perform each update and repeat the update by the parameter update unit until a predetermined convergence condition is satisfied The Ukoto, source separation, an acoustic event detected, and a dereverberation can be performed integrally precisely

また、画一的最適化規準に基づいて音源分離・音響イベント検出・残響除去を統合的に精度よく行うことができる。 In addition, sound source separation, acoustic event detection, and dereverberation can be integrated and accurately performed based on uniform optimization criteria.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、第１の実施の形態においては、多チャンネルウィーナーフィルタに従って、音響信号に対応する残響が除去された分離音を算出する場合について説明したが、これに限定されるものではなく、 For example, in the first embodiment, the case where the separated sound from which the reverberation corresponding to the acoustic signal is removed is calculated according to the multi-channel Wiener filter is not limited to this.

に時間周波数領域の多チャンネル観測信号ｙ＾_ｋ，ｌの位相を付加して分離音を得てもよい。 The separated sound may be obtained by adding the phase of the multi-channel observation signals ＾ _{k, l} in the time frequency domain.

また、第１の実施の形態においては、今回の目的関数と前回の目的関数との差分が予め定められた閾値以下である場合に収束条件を満たすと判定する場合について説明したが、これに限定されるものではない。例えば、予め定められた繰り返し回数の更新を行ったことを収束条件としてもよい。 In the first embodiment, the case has been described in which the convergence condition is determined when the difference between the current objective function and the previous objective function is equal to or less than a predetermined threshold. However, the present invention is not limited to this. Is not to be done. For example, the convergence condition may be that a predetermined number of repetitions has been updated.

また、更新するパラメータの順番には任意性があるため、第１の実施の形態の順番に限定されない。 Moreover, since the order of the parameters to be updated is arbitrary, the order is not limited to the order of the first embodiment.

次に、第２の実施の形態に係る信号解析装置について説明する。 Next, a signal analyzing apparatus according to the second embodiment will be described.

第２の実施の形態においては、一部のパワースペクトルを学習し、当該学習されたパワースペクトルを固定した上で、学習していないパワースペクトルを含む残りのパラメータの各々を、多チャンネル観測信号を基に推定する点が第１の実施の形態と異なる。なお、第１の実施の形態に係る信号解析装置１００と同様の構成及び作用については、同一の符号を付して説明を省略する。 In the second embodiment, a part of the power spectrum is learned, the learned power spectrum is fixed, and each of the remaining parameters including the unlearned power spectrum is converted into a multi-channel observation signal. The point estimated based on this is different from the first embodiment. In addition, about the structure and effect | action similar to the signal analyzer 100 which concerns on 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

＜本発明の第２の実施の形態に係る信号解析装置の構成＞
次に、本発明の第２の実施の形態に係る信号解析装置の構成について説明する。図８に示すように、本発明の第２の実施の形態に係る信号解析装置２００は、ＣＰＵと、ＲＡＭと、後述するパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置２００は、機能的には図８に示すように入力部２１０と、演算部２２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device According to Second Embodiment of the Present Invention>
Next, the configuration of the signal analyzing apparatus according to the second embodiment of the present invention will be described. As shown in FIG. 8, the signal analysis apparatus 200 according to the second embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program and various data for executing a parameter estimation processing routine described later, and , Can be configured with a computer including. The signal analyzing apparatus 200 is functionally configured to include an input unit 210, a calculation unit 220, and an output unit 90 as shown in FIG.

入力部２１０は、残響がかかっておらず、他の音源が混じっていないクリーンな音響信号（以後、クリーン音響信号）の時系列データと当該音響信号が何の音源に対応しているかを表すラベルとのペアを受け付け、データベース２２２に記憶する。また、入力部２１０は、マイクロフォンから出力された残響がかかっており、クリーン音響信号を受け付けた音源以外の音源を含む複数の音源からの音源信号が混じっている混合音の音響信号（以後、混合音音響信号）の時系列データを受け付ける。 The input unit 210 is a time-series data of a clean acoustic signal (hereinafter referred to as a clean acoustic signal) in which no reverberation is applied and other sound sources are mixed, and a label indicating what sound source the sound signal corresponds to. And is stored in the database 222. In addition, the input unit 210 receives a reverberation output from the microphone, and is an acoustic signal of a mixed sound in which sound source signals from a plurality of sound sources including a sound source other than the sound source that has received the clean sound signal are mixed (hereinafter, mixed sound). (Acoustic sound signal) time-series data is received.

演算部２２０は、データベース２２２と、時間周波数展開部２４と、パラメータ推定部２２６と、パラメータ更新部２３２と、収束判定３４と、音声信号生成部４０と、を含んで構成されている。 The calculation unit 220 includes a database 222, a time frequency expansion unit 24, a parameter estimation unit 226, a parameter update unit 232, a convergence determination 34, and an audio signal generation unit 40.

パラメータ推定部２２６は、時間周波数展開部２４において取得した多チャンネル観測時間周波数成分Ｙ＾に対して従来技術である、多チャンネルＮＭＦを用いて予め定められたパワースペクトルｗ_{ｉ，ｋ，ｑ}の推定を行う。具体的には、上記（１０）式に示された生成モデルにおいて、音源信号の数Ｉ＝１、伝達周波数特性における残響時間長Ｎ＝０とした状態でパワースペクトルｗ_{ｉ，ｋ，ｑ}の推定を行い、結果として得られたパワースペクトルを、処理対象としたクリーン音響信号に対応したパワースペクトルｗ_{ｉ，ｋ，ｑ}として用いる。 The parameter estimator 226 estimates the power spectrum w _{i, k, q} determined in advance using the multi-channel NMF, which is a conventional technique for the multi-channel observation time frequency component Y ^ acquired by the time-frequency expansion unit 24. I do. Specifically, in the generation model shown in the above equation (10), the power spectra w _{i, k, q} are estimated with the number of sound source signals I = 1 and the reverberation time length N = 0 in the transfer frequency characteristics. And the resulting power spectrum is used as the power spectrum w _{i, k, q} corresponding to the clean acoustic signal to be processed.

パラメータ更新部２３２は、上記（１４）式に示す目的関数を最大化するように、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、パラメータ推定部２２６において取得した既知の各音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}及び既知の音源以外の各音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}とからなる基底パワースペクトルＷ＾と、初期値である、又は前回更新した、各時刻ｔ_ｌにおける各音源ｉの音源信号のパワーｈ_ｉ，ｌを表すアクティベーションパラメータＨ＾と、初期値である、又は前回更新した、各時刻ｔ_ｎ及び各周波数ω_ｋにおける各音源ｉの伝達周波数特性の成分ａ_{ｉ，ｋ，ｎ}に基づいて、空間相関行列Ｃ＾_{ｉ，ｋ，ｎ}を表す空間相関行列の時系列Ｃ＾と、初期値である、又は前回更新した各時刻ｔ_ｌにおける各音源ｉの音源信号の状態ｚ_ｉ，ｌを表す状態系列Ｚ＾と、に基づいて、Ｒ＾とＵ＾との集合であるΛ＾の更新と、基底パワースペクトルＷ＾の更新と、アクティベーションパラメータＨ＾の更新と、空間相関行列の時系列Ｃ＾の更新と、各音源ｉの音源信号の状態ｚ_ｉ，ｌの更新とを行う。 The parameter update unit 232 uses the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28 and the known acquired by the parameter estimation unit 226 so as to maximize the objective function shown in the above equation (14). power spectrum w _i of each sound source i _{of, k,} the power spectrum w _i of each sound source i other than _q and the known sound _{source, k,} and the bottom power spectrum W ^ consisting of _q, which is an initial value, or the last update , The activation parameter H ^ representing the power h _{i, l} of the sound source signal of each sound source i at each time t _l, and the initial value or the last updated each sound source i at each time t _n and each frequency ω _k . Based on the components a _{i, k, n} of the transmission frequency characteristic of the spatial correlation matrix C ^ _{i, k, n} and the initial value of the time series C ^ representing the spatial correlation matrix C New state z _i of the sound source signals of the sound sources i at each time t _{_l,} the state sequence represents a _l Z ^, on the basis, R ^ and the lambda ^ update a set of U ^, bottom power spectrum Update W ^, update activation parameter H ^, update time series C ^ of the spatial correlation matrix, and update the state z _{i, l} of the sound source signal of each sound source i.

具体的には、まず、パラメータ推定部２２６において取得した既知の音源ｉの固定されているパワースペクトルｗ_{ｉ，ｋ，ｑ}、及び初期値である、又は前回更新したそれ以外の音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}とからなる基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に従って、Λ＾を更新する。次に、更新したΛ＾と、基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１７）式に従って、当該それ以外の音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}を更新し、基底パワースペクトルＷ＾を一部更新する。次に、更新した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に従って、Λ＾を更新する。次に、更新したΛ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、一部更新した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾とに基づいて、上記（１８）式〜上記（２０）式に従って、アクティベーションパラメータＨ＾を更新する。次に、一部更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、初期値又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記同様、Λ＾を更新する。次に、更新したΛ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、一部更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。次に、一部更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、上記と同様にΛ＾を更新する。次に、一部更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、 Specifically, first, the power spectrum w _{i, k, q of} the known sound source i acquired by the parameter estimation unit 226 and the power spectrum of the other sound source i that is the initial value or has been updated last time. a base power spectrum W ^ consisting of w _{i, k, q} , an activation parameter H ^ that is an initial value or updated last time, and a time series C ^ of a spatial correlation matrix that is an initial value or previously updated Based on the above, Λ ^ is updated according to the above equation (15) and the above equation (16). Next, the updated Λ ^, the base power spectrum W ^, the activation parameter H ^ that is the initial value or the previous update, and the time series C ^ that is the initial value or the previous updated spatial correlation matrix Based on the above, the power spectrum w _{i, k, q} of the other sound source i is updated according to the above equation (17), and the base power spectrum W ^ is partially updated. Next, based on the updated base power spectrum W ^, the initial value or the activation parameter H ^ updated last time, and the initial value or the time series C ^ of the spatial correlation matrix updated last time (15 ) And Λ ^ are updated according to the above equation and the above equation (16). Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix updated last time, the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28, and a part Based on the updated base power spectrum W ^ and the activation parameter H ^ that is an initial value or updated last time, the activation parameter H ^ is updated according to the above formulas (18) to (20). Next, Λ ^ is updated as described above based on the partially updated base power spectrum W ^, the updated activation parameter H ^, and the initial value or the time series C ^ of the spatial correlation matrix updated last time. . Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix updated last time, the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28, and a part Based on the updated base power spectrum W ^ and the updated activation parameter H ^, the time series C ^ of the spatial correlation matrix is updated according to the equations (21) to (22). Next, Λ ^ is updated in the same manner as described above based on the partially updated base power spectrum W ^, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix. Next, according to the Viterbi algorithm, based on the partially updated base power spectrum W ^, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix,

＜本発明の第２の実施の形態に係る信号解析装置の作用＞
次に、本発明の第２の実施の形態に係る信号解析装置２００の作用について説明する。まず、入力部２１０においてクリーン音響信号の時系列データとラベルとのペアを受け付け、データベース２２２に記憶する。次に、入力部２１０において、マイクロフォンから出力された混合音音響信号の時系列データを受け付けると、信号解析装置２００は、図９に示すパラメータ推定処理ルーチンを実行する。 <Operation of Signal Analysis Device According to Second Embodiment of the Present Invention>
Next, the operation of the signal analyzing apparatus 200 according to the second embodiment of the present invention will be described. First, the input unit 210 receives a pair of clean sound signal time-series data and a label, and stores them in the database 222. Next, when the input unit 210 receives time-series data of the mixed sound signal output from the microphone, the signal analysis device 200 executes a parameter estimation processing routine shown in FIG.

ステップＳ２００では、ステップＳ１０２において取得した多チャンネル観測時間周波数成分Ｙ＾に基づいて、パワースペクトルｗ_{ｉ，ｋ，ｑ}を取得する。 In step S200, the power spectrum w _{i, k, q} is acquired based on the multichannel observation time frequency component Y ^ acquired in step S102.

ステップＳ２０１では、アクティベーションパラメータＨ＾、及び空間相関行列の時系列Ｃ＾について、初期値を設定する。また、基底パワースペクトルＷ＾のうち、既知の音源以外の各音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}について、初期値を設定する。 In step S201, initial values are set for the activation parameter H ^ and the time series C ^ of the spatial correlation matrix. In addition, initial values are set for the power spectra wi _{, k, q} of the sound sources i other than the known sound sources in the base power spectrum W ^.

ステップＳ２０２では、ステップＳ２００において取得した既知の音源ｉの固定されているパワースペクトルｗ_{ｉ，ｋ，ｑ}、及びステップＳ２０１で初期値が設定された、又は前回のステップＳ２０３において取得した既知の音源以外の音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}からなる基底パワースペクトルＷ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２１０において取得した空間相関行列の時系列Ｃ＾と、に基づいて、Λ＾を更新する。 In step S202, the fixed power spectrum w _{i, k, q of} the known sound source i acquired in step S200 and the initial value set in step S201 or other than the known sound source acquired in the previous step S203 A power spectrum w _{i, k, q} of the sound source i of the sound source i, and an activation parameter H ^ set in step S201 or obtained in the previous step S206, and an initial value in step S201. Λ ^ is updated based on the time series C ^ of the spatial correlation matrix that has been set or acquired in the previous step S210.

次に、ステップＳ２０３では、ステップＳ２０２において取得したΛ＾と、ステップＳ２００において取得した既知の音源ｉの固定されているパワースペクトルｗ_{ｉ，ｋ，ｑ}及びステップＳ２０１で初期値が設定された、又は前回のステップＳ２０３において取得した既知の音源以外の音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}からなる基底パワースペクトルＷ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、当該既知の音源以外の音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}を更新し、基底パワースペクトルＷ＾を更新する。 Next, in step S203, Λ ^ acquired in step S202, the fixed power spectrum w _{i, k, q of} the known sound source i acquired in step S200, and initial values are set in step S201, or The base power spectrum W ^ consisting of the power spectra w _{i, k, q} of the sound source i other than the known sound source acquired in the previous step S203 and the initial value set in step S201, or acquired in the previous step S206 Based on the activation parameter H ^ and the time series C ^ of the spatial correlation matrix whose initial value was set in step S201 or acquired in the previous step S210, the power spectrum w of the sound source i other than the known sound source _{i, k, q} are updated, and the base power spectrum W ^ is updated.

次に、ステップＳ２０４では、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S204, the base power spectrum W ^ acquired in step S203, the initial value set in step S201 or the activation parameter H ^ acquired in the previous step S206, and the initial value in step S201. Λ ^ is updated based on the time series C ^ of the spatial correlation matrix set or acquired in the previous step S210.

次に、ステップＳ２０６では、ステップＳ２０４において取得したΛ＾と、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、アクティベーションパラメータＨ＾を更新する。 Next, in step S206, the Λ ^ acquired in step S204, the base power spectrum W ^ acquired in step S203, and the activation parameter H set in step S201 or set in the previous step S206. The activation parameter H ^ is updated based on ^ and the time series C ^ of the spatial correlation matrix, whose initial value is set in step S201 or acquired in the previous step S210.

次に、ステップＳ２０８では、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２０１で初期値が設定された、又は前回のステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S208, the base power spectrum W ^ acquired in step S203, the activation parameter H ^ acquired in step S206, and the initial value set in step S201 or the space acquired in the previous step S210. Based on the time series C ^ of the correlation matrix, Λ ^ is updated.

次に、ステップＳ２１０では、ステップＳ２０８において更新したΛ＾と、ステップＳ２０１で初期値が設定された、又はステップＳ２１０において前回更新した空間相関行列の時系列Ｃ＾と、ステップＳ１０６において取得した多チャンネル観測時間周波数成分Ｙ＾と、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０６において更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。 Next, in step S210, Λ ^ updated in step S208, the initial value set in step S201, or the time series C ^ of the spatial correlation matrix previously updated in step S210, and the multichannel acquired in step S106. Based on the observation time frequency component Y ^, the base power spectrum W ^ acquired in step S203, and the activation parameter H ^ updated in step S206, spatial correlation is performed according to the above formulas (21) to (22). Update the matrix time series C ^.

次に、ステップＳ２１２では、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S212, based on the base power spectrum W ^ acquired in step S203, the activation parameter H ^ acquired in step S206, and the time series C ^ of the spatial correlation matrix acquired in step S210, Λ Update ^.

次に、ステップＳ２１４では、ステップＳ２０３において取得した基底パワースペクトルＷ＾と、ステップＳ２０６において取得したアクティベーションパラメータＨ＾と、ステップＳ２１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、状態系列Ｚ＾を更新する。 Next, in step S214, Viterbi is based on the base power spectrum W ^ acquired in step S203, the activation parameter H ^ acquired in step S206, and the time series C ^ of the spatial correlation matrix acquired in step S210. The state series Z ^ is updated according to the algorithm.

次に、ステップＳ２１６では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、パラメータ推定処理ルーチンを終了し、収束条件を満たしていない場合には、ステップＳ２０２へ移行し、ステップＳ２０２〜ステップＳ２１６の処理を繰り返す。 Next, in step S216, it is determined whether a convergence condition is satisfied. If the convergence condition is satisfied, the parameter estimation processing routine is terminated. If the convergence condition is not satisfied, the process proceeds to step S202, and the processes in steps S202 to S216 are repeated.

以上説明したように、本発明の第２の実施の形態に係る信号解析装置によれば、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号を入力として、多チャンネル観測時間周波数成分が与えられたもとでの、各音源の、基底パワースペクトルと、アクティベーションパラメータと、空間相関行列の時系列と、状態系列の条件付き確率を表す目的関数を最大化するように、多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行うことにより、音源分離、音響イベント検出、及び残響除去を統合的に精度良く行うことができる As described above, according to the signal analysis apparatus according to the second embodiment of the present invention, the multichannel observation signal obtained by mixing the sound source signals from the I sound sources is input, and the multichannel observation time frequency component is input. Multi-channel observation time so that the objective function representing the conditional probability of the state sequence and the time series of the spatial power matrix and the state power spectrum of each sound source is maximized. Based on frequency components, base power spectrum, activation parameter, time series of spatial correlation matrix, and state series, update of base power spectrum, update of activation parameters, update of time series of spatial correlation matrix, and state series Perform each update and repeat the update by the parameter update unit until a predetermined convergence condition is satisfied The Ukoto, source separation, an acoustic event detected, and a dereverberation can be performed integrally precisely

次に、第３の実施の形態に係る信号解析装置について説明する。 Next, a signal analyzing apparatus according to the third embodiment will be described.

第３の実施の形態においては、パワースペクトルの学習を行わずに、全てのモデルパラメータを、多チャンネル観測信号を基に推定する点が第１の実施の形態と異なる。なお、第１の実施の形態に係る信号解析装置１００と同様の構成及び作用については、同一の符号を付して説明を省略する。 The third embodiment is different from the first embodiment in that all model parameters are estimated based on multi-channel observation signals without performing power spectrum learning. In addition, about the structure and effect | action similar to the signal analyzer 100 which concerns on 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

＜本発明の第３の実施の形態に係る信号解析装置の構成＞
次に、本発明の第３の実施の形態に係る信号解析装置の構成について説明する。図１０に示すように、本発明の第３の実施の形態に係る信号解析装置３００は、ＣＰＵと、ＲＡＭと、後述するパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置３００は、機能的には図１０に示すように入力部３１０と、演算部３２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device According to Third Embodiment of the Present Invention>
Next, the configuration of the signal analyzing apparatus according to the third embodiment of the present invention will be described. As shown in FIG. 10, a signal analyzing apparatus 300 according to the third embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program and various data for executing a parameter estimation processing routine to be described later, , Can be configured with a computer including. Functionally, the signal analysis device 300 includes an input unit 310, a calculation unit 320, and an output unit 90 as shown in FIG.

入力部３１０は、マイクロフォンから出力された残響がかかっており、複数の音源からの音源信号が混じっている混合音の音響信号（以後、混合音音響信号）の時系列データを受け付ける。 The input unit 310 receives time-series data of a mixed sound sound signal (hereinafter, mixed sound sound signal) in which reverberation output from the microphone is applied and sound source signals from a plurality of sound sources are mixed.

演算部３２０は、混合音時間周波数展開部２８と、パラメータ更新部３３２と、収束判定部３４と、音声信号生成部４０とを含んで構成されている。 The calculation unit 320 includes a mixed sound time frequency expansion unit 28, a parameter update unit 332, a convergence determination unit 34, and an audio signal generation unit 40.

パラメータ更新部３３２は、上記（１４）式に示す目的関数を最大化するように、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、初期値である、又は前回更新した各音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}とからなる基底パワースペクトルＷ＾と、初期値である、又は前回更新した、各時刻ｔ_ｌにおける各音源ｉの音源信号のパワーｈ_ｉ，ｌを表すアクティベーションパラメータＨ＾と、初期値である、又は前回更新した、各時刻ｔ_ｎ及び各周波数ω_ｋにおける各音源ｉの伝達周波数特性の成分ａ_{ｉ，ｋ，ｎ}に基づいて、空間相関行列Ｃ＾_{ｉ，ｋ，ｎ}を表す空間相関行列の時系列Ｃ＾と、初期値である、又は前回更新した各時刻ｔ_ｌにおける各音源ｉの音源信号の状態ｚ_ｉ，ｌを表す状態系列Ｚ＾と、に基づいて、Ｒ＾とＵ＾との集合であるΛ＾の更新と、基底パワースペクトルＷ＾の更新と、アクティベーションパラメータＨ＾の更新と、空間相関行列の時系列Ｃ＾の更新と、各音源ｉの音源信号の状態ｚ_ｉ，ｌの更新とを行う。 The parameter updating unit 332 is the initial value or the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28 so as to maximize the objective function shown in the above equation (14), or updated last time. power spectrum w _i of each sound source i _{was, k,} and the bottom power spectrum W ^ consisting of _q, which is an initial value, or the last time updated, power h _i of the sound source signal of each sound source i at each time t _{_l, l} Spatial correlation based on the activation parameter H ^ that represents and the component a _{i, k, n} of the transmission frequency characteristic of each sound source i at each time t _n and each frequency ω _k that is an initial value or updated last time matrix C ^ _{i, k,} and time series of spatial correlation matrix representing _n C ^, which is an initial value, or the previous state z _i of the sound source signals of the sound sources i at each time t _l of _updating, the state sequence representing the _l Based on {circumflex over ()}, update of Λ ^ which is a set of R ^ and U ^, update of the base power spectrum W ^, update of the activation parameter H ^, and time series C ^ of the spatial correlation matrix The update and the update of the state z _{i, l} of the sound source signal of each sound source i are performed.

具体的には、まず、初期値である、又は前回更新した音源ｉのパワースペクトルｗ_{ｉ，ｋ，ｑ}からなる基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に従って、Λ＾を更新する。次に、更新したΛ＾と、基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１７）式に従って、基底パワースペクトルＷ＾を更新する。次に、更新した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記（１５）式、及び上記（１６）式に従って、Λ＾を更新する。次に、更新したΛ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、更新した基底パワースペクトルＷ＾と、初期値である、又は前回更新したアクティベーションパラメータＨ＾とに基づいて、上記（１８）式〜上記（２０）式に従って、アクティベーションパラメータＨ＾を更新する。次に、更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、初期値又は前回更新した空間相関行列の時系列Ｃ＾とに基づいて、上記同様、Λ＾を更新する。次に、更新したΛ＾と、初期値である、又は前回更新した空間相関行列の時系列Ｃ＾と、混合音時間周波数展開部２８において取得した多チャンネル観測時間周波数成分Ｙ＾と、更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。次に、更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、上記と同様にΛ＾を更新する。次に、更新した基底パワースペクトルＷ＾と、更新したアクティベーションパラメータＨ＾と、更新した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、 Specifically, first, the base power spectrum W ^ consisting of the power spectrum w _{i, k, q of} the sound source i that is the initial value or updated last time, and the activation parameter H ^ that is the initial value or last time updated. Λ ^ is updated according to the above formula (15) and the above formula (16) based on the initial value or the time series C ^ of the spatial correlation matrix updated last time. Next, the updated Λ ^, the base power spectrum W ^, the activation parameter H ^ that is the initial value or the previous update, and the time series C ^ that is the initial value or the previous updated spatial correlation matrix Based on the above, the base power spectrum W ^ is updated according to the above equation (17). Next, based on the updated base power spectrum W ^, the initial value or the activation parameter H ^ updated last time, and the initial value or the time series C ^ of the spatial correlation matrix previously updated, Λ ^ is updated according to the above equation (15) and the above equation (16). Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix that was updated last time, and the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28 were updated. On the basis of the base power spectrum W ^ and the activation parameter H ^ that is the initial value or updated last time, the activation parameter H ^ is updated according to the expressions (18) to (20). Next, Λ ^ is updated as described above based on the updated base power spectrum W ^, the updated activation parameter H ^, and the initial value or the time series C ^ of the previously updated spatial correlation matrix. Next, the updated Λ ^, the initial value or the time series C ^ of the spatial correlation matrix that was updated last time, and the multi-channel observation time frequency component Y ^ acquired by the mixed sound time frequency expansion unit 28 were updated. Based on the base power spectrum W ^ and the updated activation parameter H ^, the time series C ^ of the spatial correlation matrix is updated according to the above formulas (21) to (22). Next, Λ ^ is updated in the same manner as described above based on the updated base power spectrum W ^, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix. Next, according to the Viterbi algorithm based on the updated base power spectrum W ^, the updated activation parameter H ^, and the updated time series C ^ of the spatial correlation matrix,

＜本発明の第３の実施の形態に係る信号解析装置の作用＞
次に、本発明の第３の実施の形態に係る信号解析装置３００の作用について説明する。入力部３１０においてマイクロフォンから出力された混合音音響信号の時系列データを受け付けると、信号解析装置３００は、図１１に示すパラメータ推定処理ルーチンを実行する。 <Operation of Signal Analysis Device According to Third Embodiment of the Present Invention>
Next, the operation of the signal analyzing apparatus 300 according to the third embodiment of the present invention will be described. When the time series data of the mixed sound signal output from the microphone is received by the input unit 310, the signal analysis apparatus 300 executes a parameter estimation processing routine shown in FIG.

ステップＳ３００では、パワースペクトル行列Ｗ＾、アクティベーションパラメータＨ＾、及び空間相関行列の時系列Ｃ＾について、初期値を設定する。 In step S300, initial values are set for the power spectrum matrix W ^, the activation parameter H ^, and the time series C ^ of the spatial correlation matrix.

ステップＳ３０１では、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０２において取得したパワースペクトル行列Ｗ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 In step S301, the initial value was set in step S300, or the power spectrum matrix W ^ acquired in the previous step S302, and the activation parameter set in step S300 or the initial value was acquired in the previous step S306. Λ ^ is updated based on H ^ and the time series C ^ of the spatial correlation matrix whose initial value is set in step S300 or acquired in the previous step S310.

次に、ステップＳ３０２では、ステップＳ３０１において取得したΛ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０２において取得したパワースペクトル行列Ｗ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、基底パワースペクトルＷ＾を更新する。 Next, in step S302, the Λ ^ acquired in step S301 and the initial value set in step S300, or the power spectrum matrix W ^ acquired in the previous step S302, and the initial value set in step S300. Or, based on the activation parameter H ^ acquired in the previous step S306 and the time series C ^ of the spatial correlation matrix whose initial value was set in step S300 or acquired in the previous step S310, the base power spectrum Update W ^.

次に、ステップＳ３０４では、ステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S304, the base power spectrum W ^ acquired in step S302, the initial value set in step S300, or the activation parameter H ^ acquired in the previous step S306, and the initial value in step S300. Λ ^ is updated based on the set time series C ^ of the spatial correlation matrix acquired in the previous step S310.

次に、ステップＳ３０６では、ステップＳ３０４において取得したΛ＾とステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、アクティベーションパラメータＨ＾を更新する。 Next, in step S306, Λ ^ acquired in step S304, the base power spectrum W ^ acquired in step S302, and the activation parameter H ^ in which the initial value was set in step S300 or acquired in the previous step S306. Then, the activation parameter H ^ is updated based on the time series C ^ of the spatial correlation matrix set in step S300 or the initial value set in the previous step S310.

次に、ステップＳ３０８では、ステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３００で初期値が設定された、又は前回のステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S308, the base power spectrum W ^ acquired in step S302, the activation parameter H ^ acquired in step S306, the initial value set in step S300, or the space acquired in the previous step S310. Based on the time series C ^ of the correlation matrix, Λ ^ is updated.

次に、ステップＳ３１０では、ステップＳ３０８において更新したΛ＾と、ステップＳ３００で初期値が設定された、又はステップＳ３１０において前回更新した空間相関行列の時系列Ｃ＾と、ステップＳ１０６において取得した多チャンネル観測時間周波数成分Ｙ＾と、ステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３０６において更新したアクティベーションパラメータＨ＾とに基づいて、上記（２１）式〜上記（２２）式に従って、空間相関行列の時系列Ｃ＾を更新する。 Next, in step S310, Λ ^ updated in step S308, the initial value set in step S300, or the time series C ^ of the spatial correlation matrix previously updated in step S310, and the multichannel acquired in step S106. Based on the observation time frequency component Y ^, the base power spectrum W ^ acquired in step S302, and the activation parameter H ^ updated in step S306, spatial correlation is performed according to the above formulas (21) to (22). Update the matrix time series C ^.

次に、ステップＳ３１２では、ステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Λ＾を更新する。 Next, in step S312, based on the base power spectrum W ^ acquired in step S302, the activation parameter H ^ acquired in step S306, and the time series C ^ of the spatial correlation matrix acquired in step S310, Λ Update ^.

次に、ステップＳ３１４では、ステップＳ３０２において取得した基底パワースペクトルＷ＾と、ステップＳ３０６において取得したアクティベーションパラメータＨ＾と、ステップＳ３１０において取得した空間相関行列の時系列Ｃ＾とに基づいて、Ｖｉｔｅｒｂｉアルゴリズムに従って、状態系列Ｚ＾を更新する。 Next, in step S314, based on the base power spectrum W ^ acquired in step S302, the activation parameter H ^ acquired in step S306, and the time series C ^ of the spatial correlation matrix acquired in step S310, Viterbi. The state series Z ^ is updated according to the algorithm.

次に、ステップＳ３１６では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、パラメータ推定処理ルーチンを終了し、収束条件を満たしていない場合には、ステップＳ３０１へ移行し、ステップＳ３０１〜ステップＳ３１６の処理を繰り返す。 Next, in step S316, it is determined whether a convergence condition is satisfied. If the convergence condition is satisfied, the parameter estimation processing routine is terminated. If the convergence condition is not satisfied, the process proceeds to step S301, and the processes in steps S301 to S316 are repeated.

以上説明したように、本発明の第３の実施の形態に係る信号解析装置によれば、Ｉ個の音源からの音源信号が混合された多チャンネル観測信号を入力として、多チャンネル観測時間周波数成分が与えられたもとでの、各音源の、基底パワースペクトルと、アクティベーションパラメータと、空間相関行列の時系列と、状態系列の条件付き確率を表す目的関数を最大化するように、多チャンネル観測時間周波数成分、基底パワースペクトル、アクティベーションパラメータ、空間相関行列の時系列、及び状態系列に基づいて、基底パワースペクトルの更新、アクティベーションパラメータの更新、空間相関行列の時系列の更新、及び状態系列の更新の各々を行い、予め定められた収束条件を満たすまで、パラメータ更新部による更新を繰り返し行うことにより、音源分離、音響イベント検出、及び残響除去を統合的に精度良く行うことができる As described above, according to the signal analyzing apparatus of the third embodiment of the present invention, the multi-channel observation time frequency component is input using the multi-channel observation signal mixed with the sound source signals from the I sound sources. Multi-channel observation time so that the objective function representing the conditional probability of the state sequence and the time series of the spatial power matrix and the state power spectrum of each sound source is maximized. Based on frequency components, base power spectrum, activation parameter, time series of spatial correlation matrix, and state series, update of base power spectrum, update of activation parameters, update of time series of spatial correlation matrix, and state series Perform each update and repeat the update by the parameter update unit until a predetermined convergence condition is satisfied The Ukoto, source separation, an acoustic event detected, and a dereverberation can be performed integrally precisely

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
２２データベース
２４時間周波数展開部
２６パラメータ推定部
２８混合音時間周波数展開部
３２パラメータ更新部
３４収束判定部
４０音声信号生成部
９０出力部
１００信号解析装置
２００信号解析装置
２１０入力部
２２０演算部
２２２データベース
２２６パラメータ推定部
２３２パラメータ更新部
３００信号解析装置
３１０入力部
３２０演算部
３３２パラメータ更新部 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 22 Database 24 Time frequency expansion part 26 Parameter estimation part 28 Mixed sound time frequency expansion part 32 Parameter update part 34 Convergence determination part 40 Speech signal generation part 90 Output part 100 Signal analysis apparatus 200 Signal analysis apparatus 210 Input Unit 220 calculation unit 222 database 226 parameter estimation unit 232 parameter update unit 300 signal analysis device 310 input unit 320 calculation unit 332 parameter update unit

Claims

Multi-channel representing observation time frequency component y (ω _k , t _l ) of each frequency ω _k at each time t ₁ with time series data of multi-channel observation signals mixed with sound source signals from I sound sources as input. A mixed sound time frequency expansion unit that outputs an observation time frequency component;
The base power spectrum representing the power spectrum w _{i, k, q} at each frequency ω _k and each state q of each sound source i, and each time of each sound source i, given the multi-channel observation time frequency component. an activation parameter representing the power h _il at t _1, a time series of a spatial correlation matrix representing the spatial correlation matrix C _{i, k, n} at each time t _n and each frequency ω _k of each sound source i, and each sound source i The multi-channel observation time frequency component output by the mixed sound time frequency expansion unit so as to maximize an objective function representing a conditional probability of a state sequence representing a sound source state z _{i, l} at each time t ₁ , Based on the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and the state series, the update of the base power spectrum, the activation A parameter update unit that performs each of update of the update parameter, update of the time series of the spatial correlation matrix, and update of the state series;
A convergence determination unit that repeatedly performs update by the parameter update unit until a predetermined convergence condition is satisfied;
Including a signal analysis device.

The objective function is
The base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and the probability of the multi-channel observation time frequency component given the state series, the base power spectrum, the activation parameter, the The product of the time series of the spatial correlation matrix and the prior probability representing the probability that the state series is output, or the logarithm of the product,
The parameter update unit, the multi-channel observation time frequency component output by the mixed sound time frequency expansion unit, the base power spectrum, the activation parameter, so as to maximize the product, or the logarithm of the product Based on the time series of the spatial correlation matrix and the state series, the base power spectrum is updated, the activation parameter is updated, the time series of the spatial correlation matrix is updated, and the state series is updated. The signal analyzing apparatus according to claim 1.

The objective function is
Auxiliary variables _Ri, for all combinations of the multi-channel observation time frequency component, the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, the state series, (i, k, l, n) an auxiliary function that is expressed using auxiliary variables U _{k, l} for all combinations of _{k, l, n} and (k, l) and that is a logarithmic lower limit function of the product,
The parameter updating unit is configured to increase the auxiliary function based on the base power spectrum, the activation parameter, and the time series of the spatial correlation matrix, based on the auxiliary variable, and the state series, The auxiliary variable R _{i, k, l, n} and the auxiliary variable U _{k, l} are updated, and the multi-channel observation time frequency component, the base power spectrum, and the activation parameter output by the mixed sound time frequency expansion unit Updating the base power spectrum and updating the activation parameters based on the time series of the spatial correlation matrix, the state series, the auxiliary variables R _{i, k, l, n} and the auxiliary variables U _{k, l} The signal analysis apparatus according to claim 2, wherein each of updating of the time series of the spatial correlation matrix and updating of the state series is performed.

4. The auxiliary function according to claim 3, wherein the auxiliary function is a lower limit function defined by using a Jensen inequality using a concave of a trace with respect to a negative inverse matrix and a tangential inequality using a convexity of a negative logarithmic function. Signal analysis device.

The parameter update unit includes the power of the sound source signal of each sound source i excluding the sound source i for which the power spectrum w _{i, k, q} for each frequency ω _k and each state _q is known or previously estimated from the base power spectrum. The signal analysis apparatus according to claim 1 _{, wherein the} spectrum w _{i, k, q} is updated.

A parameter estimation method in a signal analysis device including a mixed sound time frequency expansion unit, a parameter update unit, and a convergence determination unit,
The mixed sound time frequency expansion unit receives time-series data of a multi-channel observation signal mixed with sound source signals from I sound sources, and receives an observation time frequency component y (ω of each frequency ω _k at each time t ₁ . _k , t _l ), which outputs a multi-channel observation time frequency component,
The parameter update unit includes a base power spectrum representing a power spectrum w _{i, k, q} in each frequency ω _k and in each state q of each sound source i given the multi-channel observation time frequency component, sound source i, and activation parameters representing the power h _il at each time t _l, the time series of the spatial correlation matrix representing the spatial correlation matrix C _{i, k, n} at each time t _n and the frequency omega _k of each sound source i And the multi-channel output by the mixed sound time-frequency expansion unit so as to maximize the objective function representing the conditional probability of the state sequence representing the sound source state z _{i, l} at each time t ₁ of each sound source i Based on the observation time frequency component, the base power spectrum, the activation parameter, the time series of the spatial correlation matrix, and the state series, the base power spectrum Updating, updating the activation parameter, updating the time series of the spatial correlation matrix, and updating the state series,
The convergence estimation unit repeatedly performs updating by the parameter updating unit until a predetermined convergence condition is satisfied.

The program for functioning a computer as each part which comprises the signal analyzer of any one of Claims 1-5.