JP6142402B2

JP6142402B2 - Acoustic signal analyzing apparatus, method, and program

Info

Publication number: JP6142402B2
Application number: JP2013181701A
Authority: JP
Inventors: 弘和亀岡; 浩気池澤; 伸克北条; ミケルエスピ; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2013-09-02
Filing date: 2013-09-02
Publication date: 2017-06-07
Anticipated expiration: 2033-09-02
Also published as: JP2015049406A

Description

本発明は、音響信号解析装置、方法、及びプログラムに係り、特に、音響信号の時系列データから、原音声信号を強調するための音響信号解析装置、方法、及びプログラムに関する。 The present invention relates to an acoustic signal analysis device, method, and program, and more particularly, to an acoustic signal analysis device, method, and program for enhancing an original voice signal from time-series data of the acoustic signal.

音声強調技術は、例えば、ハンズフリー通話システムにおいて通信音声の了解性を向上させたり、自動会議録作成システムにおいて音声認識率を向上させたりするための重要な要素技術である。 The voice enhancement technique is an important element technique for improving the intelligibility of communication voice in a hands-free call system and improving the voice recognition rate in an automatic conference record creation system, for example.

雑音や残響環境の情報は事前に知り得ないことが多いため、観測信号のみから音声信号を獲得する問題を扱う必要があるが、音声信号に残響と雑音が重畳されて観測信号が得られるプロセスを順問題と捉えると、観測信号から音声信号を得る問題は逆問題と見なすことができる。 Since information on noise and reverberation environment is often not known in advance, it is necessary to deal with the problem of acquiring a speech signal from only the observation signal, but the process by which reverberation and noise are superimposed on the speech signal to obtain the observation signal Can be regarded as an inverse problem.

雑音や室内伝達系の情報が未知の場合、この逆問題には解が無数に存在しうるため、解を絞り込むための何らかの仮定が必要である。音声強調に関する従来アプローチでは、雑音に定常性、室内伝達系に時不変性などの制約を仮定し、その仮定の下で当該逆問題が定式化されるものが多かった。 If the information on the noise and the indoor transmission system is unknown, there can be countless solutions to this inverse problem, so some assumption is necessary to narrow down the solution. In conventional approaches related to speech enhancement, there are many cases where the inverse problem is formulated under the assumptions of constraints such as stationarity for noise and time invariance for indoor transmission systems.

例えば、古典的な雑音除去手法であるスペクトラルサブトラクション法では、雑音の定常性が仮定され、既知の非音声区間から得られる雑音のパワー（または振幅）スペクトルを観測信号のパワー（または振幅）スペクトルから減算することで音声信号のパワー（または振幅）スペクトルを推定する。また、逆フィルタリングに基づく残響除去法では、時不変かつ最小位相な室内伝達系が仮定され、音声に関する仮定やモデル（調波性、自己回帰モデル、自己相関関数コードブック等）を手がかりに最適な残響除去フィルタを推定する。 For example, in the spectral subtraction method, which is a classical noise removal technique, noise continuity is assumed, and the noise power (or amplitude) spectrum obtained from a known non-speech interval is derived from the power (or amplitude) spectrum of the observed signal. The power (or amplitude) spectrum of the audio signal is estimated by subtraction. In addition, the dereverberation method based on inverse filtering assumes a time-invariant and minimum-phase indoor transmission system, and is best suited to clues to assumptions and models related to speech (harmonics, autoregressive models, autocorrelation function codebooks, etc.) Estimate the dereverberation filter.

こうした従来手法は、仮定が成立する状況においては高い性能を発揮する一方で、実環境においては、雑音が例えば携帯電話の着信音及び背景音楽などのように非定常なものである場合や、僅かでも音源移動がある場合等では、雑音や室内伝達系に対する上述の仮定が成立しない場合があり、このような環境の下では高い性能を発揮できなくなるという問題があった。 While these conventional methods exhibit high performance in the situation where the assumption is established, in the real environment, the noise may be non-stationary, such as a ringtone of a mobile phone and background music, or slightly However, when there is a movement of the sound source, the above assumptions about noise and the indoor transmission system may not be satisfied, and there is a problem that high performance cannot be exhibited under such an environment.

こうした室内伝達系の変化に対する頑健な残響除去の手法として、音声スペクトログラムと室内インパルス応答のスペクトログラムの時間方向の畳み込みモデルに基づき、音声スペクトログラムのスパース性を手がかりに音声スペクトログラムを推定する方法が提案されている（非特許文献１）。 As a robust dereverberation method against changes in the indoor transmission system, a method has been proposed that estimates the speech spectrogram based on the temporal convolution model of the speech spectrogram and the room impulse response spectrogram based on the sparsity of the speech spectrogram. (Non-Patent Document 1).

しかし、この手法では残響のような乗法的な雑音の混入プロセスしか想定されておらず、加法的な雑音の混入プロセスは陽に想定されていない（雑音が存在しない状況を仮定している）ため、観測信号に雑音が含まれる場合には高い性能を発揮できないという問題があった。 However, in this method, only a multiplicative noise mixing process such as reverberation is assumed, and an additive noise mixing process is not assumed explicitly (assuming a situation where no noise exists). When the observation signal contains noise, there is a problem that high performance cannot be exhibited.

一方、非定常な雑音を抑圧するための手法として、スペクトログラムを非負値行列と見立て、これを二つの非負値行列に分解する非負値行列因子分解アプローチが近年注目されている（非特許文献２）。非負値行列因子分解を用いれば、観測スペクトログラムを構成する基底スペクトルと各時刻におけるそれらの結合係数を同時に得ることができる。これにより、例えばクリーンな音声サンプルからあらかじめ基底スペクトルを獲得（学習）しておき、雑音つき音声の観測スペクトログラムを基底行列と係数行列に分解する際、基底スペクトルセットの一部をあらかじめ学習した基底スペクトルのセットに固定しておくことで、音声とそれ以外（すなわち雑音）のスペクトログラムに分解する。 On the other hand, as a technique for suppressing non-stationary noise, a non-negative matrix factorization approach that considers a spectrogram as a non-negative matrix and decomposes it into two non-negative matrices has recently attracted attention (Non-Patent Document 2). . If non-negative matrix factorization is used, the base spectrum constituting the observation spectrogram and their coupling coefficients at each time can be obtained simultaneously. As a result, for example, when a base spectrum is acquired (learned) in advance from a clean speech sample, and the observation spectrogram of a noisy speech is decomposed into a base matrix and a coefficient matrix, a base spectrum obtained by learning a part of the base spectrum set in advance. It is decomposed into a spectrogram of speech and other (ie noise).

H. Kameoka, T. Nakatani, T. Yoshioka, “Robust Speech Dereverberation Based on Non-negativity and Sparse Nature of Speech Spectrograms,” in Proc. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2009), pp. 45−48, 2009.H. Kameoka, T. Nakatani, T. Yoshioka, “Robust Speech Dereverberation Based on Non-negativity and Sparse Nature of Speech Spectrograms,” in Proc. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2009), pp. 45 −48, 2009. P. Smaragdis, B. Raj and M. V. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Proc. ICA2007, pp. 414−421, 2007.P. Smaragdis, B. Raj and M. V. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixture,” in Proc. ICA2007, pp. 414-421, 2007.

しかし、上記の非特許文献２の手法では加法的な雑音の混入プロセスしか想定されておらず、残響のような乗法的な雑音の混入プロセスは陽に想定されていない（残響がない状況を仮定している）ため、観測信号に残響が含まれる場合には高い性能を発揮できないという問題があった。 However, in the above-mentioned method of Non-Patent Document 2, only an additive noise mixing process is assumed, and a multiplicative noise mixing process such as reverberation is not assumed explicitly (assuming a situation without reverberation). Therefore, there is a problem that high performance cannot be exhibited when reverberation is included in the observed signal.

本発明は、上記の事情を鑑みてなされたもので、原音声信号に雑音及び残響成分が重畳した音響信号の時系列データから原音声信号を精度よく分離して、原音声信号を強調させることができる音響信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and emphasizes an original audio signal by accurately separating the original audio signal from time-series data of an acoustic signal in which noise and a reverberation component are superimposed on the original audio signal. An object of the present invention is to provide an acoustic signal analyzing apparatus, method, and program capable of performing the above.

上記の目的を達成するために本発明に係る音響信号解析装置は、音響信号の時系列データを入力として、各時刻tにおける各周波数kの観測時間周波数成分ｙ_k[t]を出力する時間周波数解析部と、前記音響信号に含まれる原音声信号sの各時刻t及び各周波数kのパワースペクトル密度Φ^(s) _k[t]を、Ｍ個の基底スペクトルｍを表す非負値の各要素Ｂ^(s) _m,kからなる基底行列と、各時刻tにおける基底スペクトルＢ^(s) _m,kのゲインＧ^(s) _m[t]を非負値である各要素とする係数行列との積として表した場合の前記係数行列の各要素Ｇ^(s) _m[t]、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]、前記音響信号に含まれる雑音信号nの各時刻t及び各周波数kのパワースペクトル密度Φ⁽ⁿ⁾ _k[t]を、Ｑ個の基底スペクトルqを表す非負値の各要素Ｂ⁽ⁿ⁾ _q,kからなる基底行列と、各時刻tにおける基底スペクトルＢ⁽ⁿ⁾ _q,kのゲインＧ⁽ⁿ⁾ _q[t]を非負値である各要素とする係数行列との積として表した場合の前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,k、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]の各々の初期値を設定すると共に、前記基底行列の各要素Ｂ^(s) _m,kに、予め求められた値を設定するパラメータ初期値設定部と、（τ、m）の全ての組み合わせにおける、前記基底行列の要素Ｂ^(s) _m,k、前記係数行列の要素Ｇ^(s) _m[t]、及び前記パワーＨ_k[τ]と、前記パワースペクトル密度Φ⁽ⁿ⁾ _k[t]とに基づいて算出されるパワースペクトル密度系列モデルΦ^x _k[t]と、前記観測時間周波数成分ｙ_k[t]との距離が小さくなるように、各時刻t及び各周波数kの前記観測時間周波数成分ｙ_k[t]と、前記基底行列の各要素Ｂ^(s) _m,kと、前記係数行列の各要素Ｇ^(s) _m[t]と、前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,kと、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]と、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]と、各時刻t及び各周波数kのパワースペクトル密度系列モデルΦ^x _k[t]とに基づいて、前記係数行列の各要素Ｇ^(s) _m[t]と、前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,kと、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]と、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]とを更新するパラメータ更新部と、予め定められた終了条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う終了判定部と、を含んで構成されている。 In order to achieve the above object, the acoustic signal analyzing apparatus according to the present invention receives time series data of an acoustic signal as an input and outputs an observation time frequency component y _k [t] of each frequency k at each time t. The analysis unit and the power spectrum density Φ ^(s) _k [t] at each time t and each frequency k of the original speech signal s included in the acoustic signal are each non-negative element B representing M base spectra m. ^(s) _m, and the base matrix of _k, as the product of the basis spectra B ^(s) _{m, k} of the gain G ^(s) _m [t] a coefficient matrix for each element is a non-negative value at each time t Each element G ^(s) _m [t] of the coefficient matrix when expressed, power H _k [τ] of each frequency k of the indoor impulse response at each time τ, and each time of the noise signal n included in the acoustic signal The power spectral density Φ ⁽ⁿ⁾ _k [t] of t and each frequency k is determined from non-negative elements B ⁽ⁿ⁾ _{q, k} representing the Q base spectra q. And a base matrix B ⁽ⁿ⁾ _{q, k} at each time t is expressed as a product of a coefficient matrix having each non-negative gain G ⁽ⁿ⁾ _q [t] Each element B ⁽ⁿ⁾ _{q, k} , and each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix are set to initial values, and each element B ^(s) _{m, k} of the base matrix is set to A parameter initial value setting unit for setting values obtained in advance and elements B ^(s) _{m, k of the} base matrix and elements G ^(s) _m [of the coefficient matrix in all combinations of (τ, m) t], the power H _k [τ], and the power spectral density series model Φ ^x _k [t] calculated based on the power spectral density Φ ⁽ⁿ⁾ _k [t], and the observation time frequency component the observation time frequency component y _k [t] of each time t and each frequency k, each element B ^(s) _{m, k of the} basis matrix, so that the distance to y _k [t] becomes small, Coefficient matrix An element ^{_{G (s) m [t]}} , the elements B ⁽ⁿ⁾ _q of the basis _matrix, and _k, and the elements G ⁽ⁿ⁾ _q [t] of the coefficient matrix, the room impulse response for each time τ Based on the power H _k [τ] at each frequency k and the power spectral density series model Φ ^x _k [t] at each time t and each frequency k, each element G ^(s) _m [t] of the coefficient matrix And each element B ⁽ⁿ⁾ _{q, k of the} basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and power H _k [τ of each frequency k of the room impulse response at each time τ And a termination determination unit that repeatedly performs updating by the parameter updating unit until a predetermined termination condition is satisfied.

本発明に係る音響信号解析方法は、時間周波数解析部によって、音響信号の時系列データを入力として、各時刻tにおける各周波数kの観測時間周波数成分ｙ_k[t]を出力し、パラメータ初期値設定部によって、前記音響信号に含まれる原音声信号sの各時刻t及び各周波数kのパワースペクトル密度Φ^(s) _k[t]を、Ｍ個の基底スペクトルｍを表す非負値の各要素Ｂ^(s) _m,kからなる基底行列と、各時刻tにおける基底スペクトルＢ^(s) _m,kのゲインＧ^(s) _m[t]を非負値である各要素とする係数行列との積として表した場合の前記係数行列の各要素Ｇ^(s) _m[t]、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]、前記音響信号に含まれる雑音信号nの各時刻t及び各周波数kのパワースペクトル密度Φ⁽ⁿ⁾ _k[t]を、Ｑ個の基底スペクトルqを表す非負値の各要素Ｂ⁽ⁿ⁾ _q,kからなる基底行列と、各時刻tにおける基底スペクトルＢ⁽ⁿ⁾ _q,kのゲインＧ⁽ⁿ⁾ _q[t]を非負値である各要素とする係数行列との積として表した場合の前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,k、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]の各々の初期値を設定すると共に、前記基底行列の各要素Ｂ^(s) _m,kに、予め求められた値を設定し、パラメータ更新部によって、（τ、m）の全ての組み合わせにおける、前記基底行列の要素Ｂ^(s) _m,k、前記係数行列の要素Ｇ^(s) _m[t]、及び前記パワーＨ_k[τ]と、前記パワースペクトル密度Φ⁽ⁿ⁾ _k[t]とに基づいて算出されるパワースペクトル密度系列モデルΦ^x _k[t]と、前記観測時間周波数成分ｙ_k[t]との距離が小さくなるように、各時刻t及び各周波数kの前記観測時間周波数成分ｙ_k[t]と、前記基底行列の各要素Ｂ^(s) _m,kと、前記係数行列の各要素Ｇ^(s) _m[t]と、前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,kと、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]と、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]と、各時刻t及び各周波数kのパワースペクトル密度系列モデルΦ^x _k[t]とに基づいて、前記係数行列の各要素Ｇ^(s) _m[t]と、前記基底行列の各要素Ｂ⁽ⁿ⁾ _q,kと、前記係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]と、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]とを更新し、終了判定部によって、予め定められた終了条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う。 In the acoustic signal analysis method according to the present invention, the time-frequency analysis unit inputs the time-series data of the acoustic signal, outputs the observation time frequency component y _k [t] of each frequency k at each time t, and sets the parameter initial value The setting unit converts the power spectrum density Φ ^(s) _k [t] at each time t and each frequency k of the original audio signal s included in the acoustic signal into each non-negative element B representing M base spectra m. ^(s) _m, and the base matrix of _k, as the product of the basis spectra B ^(s) _{m, k} of the gain G ^(s) _m [t] a coefficient matrix for each element is a non-negative value at each time t Each element G ^(s) _m [t] of the coefficient matrix when expressed, power H _k [τ] of each frequency k of the indoor impulse response at each time τ, and each time of the noise signal n included in the acoustic signal power spectral density Φ of t and each frequency k a ⁽ⁿ⁾ _k [t], the elements of the non-negative value representing the number Q of the base spectrum q ⁽ⁿ⁾ _q, and the base matrix of _k, as the product of the basis spectra B ⁽ⁿ⁾ _{q, k} of the gain G ⁽ⁿ⁾ _q [t] a coefficient matrix for each element is a non-negative value at each time t The initial value of each element B ⁽ⁿ⁾ _{q, k of} the base matrix and the element G ⁽ⁿ⁾ _q [t] of the coefficient matrix is set, and each element B ^{( s)} A value obtained in advance is set in _{m, k} , and the parameter updating unit performs element B ^(s) _{m, k of the} base matrix and elements of the coefficient matrix in all combinations of (τ, m). ^{_{G (s) m [t]}} , and with the power H _k [τ], the power spectral density Φ ⁽ⁿ⁾ _k [t] power spectral density series model is calculated based on the Φ ^x _k [t] and such that said distance between the observation time-frequency component y _k [t] is small, and the observation time-frequency component y _k [t] at each time t and each frequency k, each element of the basis matrix B ^(s) _m, and _k, before Each element G of the coefficient matrix ^(s) _m [t], wherein the elements B ⁽ⁿ⁾ _q basis _matrix, and _k, each element G of the coefficient matrix ⁽ⁿ⁾ _q [t], at each time τ Based on the power H _k [τ] of each frequency k of the indoor impulse response and the power spectral density series model Φ ^x _k [t] of each time t and each frequency k, each element G ^{(s) of the} coefficient matrix _m [t], each element B ⁽ⁿ⁾ _{q, k of the} basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and power of each frequency k of the indoor impulse response at each time τ H _k [τ] is updated, and the updating by the parameter updating unit is repeated until the end determination unit satisfies a predetermined end condition.

本発明に係るプログラムは、上記の音響信号解析装置の各部としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each part of the acoustic signal analysis apparatus.

以上説明したように、本発明の音響信号解析装置、方法、及びプログラムによれば、原音声信号sのパワースペクトル密度Φ^(s) _k[t]の基底行列及び係数行列と、室内インパルス応答のパワーＨ_k[τ]と、雑音信号nのパワースペクトル密度Φ⁽ⁿ⁾ _k[t]とに基づいて算出されるパワースペクトル密度系列モデルΦ^x _k[t]と、観測時間周波数成分ｙ_k[t]との距離が小さくなるように、パワースペクトル密度Φ^(s) _k[t]の係数行列と、パワースペクトル密度Φ⁽ⁿ⁾ _k[t]の基底行列及び係数行列と、室内インパルス応答のパワーＨ_k[τ]を更新することを繰り返すことにより、原音声信号に雑音及び残響成分が重畳した音響信号の時系列データから原音声信号を精度よく分離して、原音声信号を強調させることができる、という効果が得られる。 As described above, according to the acoustic signal analysis apparatus, method, and program of the present invention, the base matrix and coefficient matrix of the power spectral density Φ ^(s) _k [t] of the original audio signal s, and the room impulse response The power spectral density series model Φ ^x _k [t] calculated based on the power H _k [τ] and the power spectral density Φ ⁽ⁿ⁾ _k [t] of the noise signal n, and the observation time frequency component y _k [ as the distance between t] is small, the coefficient matrix of the power spectral density ^{_{Φ (s) k [t]}} , and the base matrix and the coefficient matrix of the power spectral density ^{_{Φ (n) k [t]}} , the room impulse response By repeatedly updating the power H _k [τ], the original voice signal is accurately separated from the time series data of the acoustic signal in which noise and reverberation components are superimposed on the original voice signal, and the original voice signal is emphasized. The effect of being able to be obtained.

マイクロホンの位置を変えて測定した室内インパルス応答の短時間フーリエ変換の振幅成分及び位相成分を示す図である。It is a figure which shows the amplitude component and phase component of short-time Fourier-transform of the indoor impulse response measured by changing the position of a microphone. 本発明の実施の形態に係る音響信号解析装置の構成を示す概略図である。It is the schematic which shows the structure of the acoustic signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る音響信号解析装置における音響信号解析処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the acoustic signal analysis process routine in the acoustic signal analyzer which concerns on embodiment of this invention. （Ａ）メル周波数ケプストラム係数歪みの評価結果を示すグラフ、（Ｂ）バークスペクトル歪みスコアの評価結果を示すグラフ、及び（Ｃ）信号対干渉比の評価結果を示すグラフである。(A) A graph showing the evaluation result of the Mel frequency cepstrum coefficient distortion, (B) a graph showing the evaluation result of the Bark spectrum distortion score, and (C) a graph showing the evaluation result of the signal-to-interference ratio.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜発明の原理＞ <Principle of the invention>

まず、本発明の原理について説明する。 First, the principle of the present invention will be described.

＜音声強調問題の定式化＞ <Formulation of speech enhancement problem>

室内伝達系が時不変の場合、原音声信号をｓ[t]、室内インパルス応答をｈ[t]、雑音信号をｎ[t] とすると、音響信号としての観測信号ｙ[t]は時間領域において、以下の（１）式のように表される。 When the indoor transmission system is time-invariant, if the original audio signal is s [t], the indoor impulse response is h [t], and the noise signal is n [t], the observation signal y [t] as an acoustic signal is the time domain Is expressed as the following equation (1).

ただし、tは離散時刻のインデックスである。右辺第1項Σ_τｓ[t-τ]ｈ[τ]は残響音声信号を表す。ここで、観測信号、原信号、室内インパルス応答、雑音信号の短時間フーリエ変換をそれぞれｙ_k[t]、ｓ_k[t]、ｈ_k[t]、ｎ_k[t]とする。ただし、k=1,...,Kは周波数のインデックス、t=1,...,Tはフレーム（時刻）のインデックスを表す。時間領域における信号同士の畳み込みは、特定の条件の下でそれぞれの信号の各周波数における狭帯域信号同士の畳み込みで近似できることが知られている。もしこの近似が成立するような短時間フーリエ変換の分析条件を選べば、ｙ_k[t]は（２）式のように表される。 Where t is an index of discrete time. The first term on the right side _Στ s [t−τ] h [τ] represents a reverberant speech signal. Here, y _k [t], s _k [t], h _k [t], and n _k [t] are the short-time Fourier transforms of the observed signal, the original signal, the room impulse response, and the noise signal, respectively. Here, k = 1,..., K represents a frequency index, and t = 1,..., T represents a frame (time) index. It is known that convolution of signals in the time domain can be approximated by convolution of narrowband signals at each frequency of each signal under specific conditions. If the short-time Fourier transform analysis conditions are selected such that this approximation holds, y _k [t] can be expressed as in equation (2).

以上では時不変な系を考えたが、実際には音源が移動することがあるため、室内伝達系の時不変性の仮定は必ずしも成り立たない。ここで、時変な室内伝達系を仮定することは、（２）式において室内インパルス応答ｈ_k[τ]が時刻tごとに変化しうることを許容することに相当する。すなわち、以下の（３）式のように表される。 In the above, a time-invariant system is considered, but since the sound source may actually move, the time-invariant assumption of the indoor transmission system does not necessarily hold. Here, assuming a time-varying indoor transmission system is equivalent to allowing the indoor impulse response h _k [τ] to change at each time t in the equation (2). That is, it is expressed as the following equation (3).

非定常雑音、時変残響環境下における音声強調問題は、（３）式をもとに観測時間周波数成分Ｙ={ｙ_k[t]}_{1≦k≦K,1≦t≦T}からＳ={ｓ_k[t]}_{1≦k≦K,1≦t≦T}を推定する問題として定式化される。（３）式では室内インパルス応答が時刻ごとに変化することを許容しているため、（２）式に比べて未知パラメータが圧倒的に多くなっており、同じ右辺を与えるｓとｈとｎの組み合わせは無数に存在する。そこで以下では、その無数の組み合わせの中から正しい解を得るためにどのような仮定を置くべきか、について説明する。 The speech enhancement problem under non-stationary noise and time-varying reverberation environment is based on the following equation (3): observed time frequency component Y = {y _k [t]} _{1 ≦ k ≦ K, 1 ≦ t ≦ T} to S = {s _k [t]} Formulated as a problem of estimating _{1 ≦ k ≦ K and 1 ≦ t ≦ T.} In equation (3), the indoor impulse response is allowed to change with time, so the number of unknown parameters is overwhelmingly larger than in equation (2), and s, h, and n give the same right side. There are countless combinations. Therefore, in the following, we will explain what assumptions should be made in order to obtain a correct solution from the myriad combinations.

＜室内伝達系の「半時変性」＞ <"Half-time degeneration" of indoor transmission system>

室内インパルス応答が時間的に変化する主要因の１つは話者の移動である。なぜなら、話者の移動に伴い、話者からマイクロホンまでの（壁からの反射音の到来経路を含む）あらゆる音響経路の距離が一斉に変化するからである。 One of the main factors that cause the room impulse response to change with time is the movement of the speaker. This is because as the speaker moves, the distances of all acoustic paths from the speaker to the microphone (including the arrival path of the reflected sound from the wall) change all at once.

ここで、単一の音響経路のみに着目すると、伝達系のインパルス応答はデルタ関数となるが、経路の距離に変化があった場合、当該インパルス応答のフーリエ変換の位相成分には音響経路の距離変化に伴う到来時間の変化、振幅成分には音響経路の距離変化に伴う強度の変化がそれぞれ反映されることになる。音の強度は伝播距離に対して反比例的であるため、伝播距離がある程度大きい場合には伝播距離に変化があっても振幅成分はほとんど変化しないことになる。 Here, focusing only on a single acoustic path, the impulse response of the transmission system is a delta function, but if there is a change in the distance of the path, the phase component of the Fourier transform of the impulse response is the distance of the acoustic path. The change in arrival time accompanying the change and the amplitude component reflect the change in intensity accompanying the change in the distance of the acoustic path. Since the intensity of sound is inversely proportional to the propagation distance, if the propagation distance is large to some extent, the amplitude component hardly changes even if the propagation distance changes.

一方、位相成分については、到来時間と周波数に比例した量だけ変化するため、到来時間の変化がたとえわずかであっても、特に高周波域においては著しく変化しうる。 On the other hand, since the phase component changes by an amount proportional to the arrival time and the frequency, even if the change of the arrival time is slight, it can change remarkably especially in the high frequency range.

よって、単一の音響経路の伝達系のインパルス応答の短時間フーリエ変換には、話者の移動に伴って振幅は変化しにくく位相は変化しやすい、という性質があることが予想される。 Therefore, it is expected that the short-time Fourier transform of the impulse response of the transmission system of a single acoustic path has the property that the amplitude does not easily change and the phase easily changes as the speaker moves.

実際の室内インパルス応答はあらゆる音響経路の伝達系のインパルス応答の重ね合わせとなるので、一つ一つのインパルス応答が上述のような傾向があるならば、その重ね合わせもおおよそ同様な傾向があるのではないかと考えられる。 Since the actual indoor impulse response is a superposition of the impulse responses of the transmission systems of all acoustic paths, if each impulse response tends to be as described above, the superposition tends to be roughly the same. It is thought that.

図１は、話者の代わりにマイクロホンの位置を数センチ変えて測定した室内インパルス応答の短時間フーリエ変換の振幅成分と位相成分を比較したものである。 FIG. 1 compares the amplitude component and the phase component of the short-time Fourier transform of the room impulse response measured by changing the position of the microphone by several centimeters instead of the speaker.

上記図１の上段は、室内インパルス応答の短時間フーリエ変換の振幅成分（Ａ）及び位相成分（Ｂ）を示したものであり、上記図１の下段は、マイクロホンの位置を数センチ変えて測定した室内インパルス応答の短時間フーリエ変換の振幅成分(Ｃ)と位相成分(Ｄ)を示したものである。なお、室内インパルス応答のデータはRWCP実環境音声・音響データベースのマイクロホンアレーデータベースに収録されている、残響時間0.3[s]の残響可変室（パネル）で測定されたものである。 The upper part of FIG. 1 shows the amplitude component (A) and phase component (B) of the short-time Fourier transform of the room impulse response, and the lower part of FIG. 1 shows the measurement by changing the position of the microphone by several centimeters. The amplitude component (C) and phase component (D) of the short-time Fourier transform of the indoor impulse response are shown. The room impulse response data was measured in a reverberation variable room (panel) with a reverberation time of 0.3 [s] recorded in the microphone array database of the RWCP real-world voice and sound database.

上記図１から、振幅成分はさほど変化していないのに対し位相成分は著しく変化していることが分かる。 From FIG. 1, it can be seen that the amplitude component does not change so much, while the phase component changes significantly.

以上より、発明者らは、話者の移動等、環境の軽微な変化については、室内インパルス応答の短時間フーリエ変換の振幅成分を時不変、位相成分を時変と仮定した特殊な系により扱えるのではないかと考えた。以後、このような系を「半時変系」と呼ぶこととする。 From the above, the inventors can handle minor changes in the environment, such as speaker movement, using a special system that assumes that the amplitude component of the short-time Fourier transform of the room impulse response is time-invariant and the phase component is time-variant. I thought that. Hereinafter, such a system is referred to as a “half-time varying system”.

＜観測信号の生成モデル＞ <Observation signal generation model>

以上の２つの仮定をベースに、観測信号の生成モデルを定式化する。（３）式に半時変性の仮定を導入すると、観測信号の短時間フーリエ変換は（４）式のように表される。 Based on the above two assumptions, an observation signal generation model is formulated. When the assumption of half-time variation is introduced into the expression (3), the short-time Fourier transform of the observation signal is expressed as the expression (4).

ただし、φ_k,t[τ]とＨ_k[t]は室内インパルス応答の短時間フーリエ変換の位相成分と振幅成分であり、Ｈ_k[t]=|ｈ_k,t[τ]|である。ここで、φ_k,t[τ]をk,n,mごとに独立に区間[0;2π)上の一様分布に従う確率変数とする。さらに、原音声信号、雑音については、それぞれ平均が０の複素正規分布（５）式及び（６）式に従う確率変数とする。 Where φ _{k, t} [τ] and H _k [t] are the phase component and amplitude component of the short-time Fourier transform of the room impulse response, and H _k [t] = | h _{k, t} [τ] | . Here, φ _{k, t} [τ] is a random variable that follows a uniform distribution on the interval [0; 2π) independently for each k, n, and m. Further, the original speech signal and noise are assumed to be random variables according to the complex normal distributions (5) and (6), each having an average of 0.

ただし、Φ^(s) _k[t]、Φ⁽ⁿ⁾ _k[t]はそれぞれ原音声信号および雑音信号の時刻t、k番目の周波数ビンにおけるパワースペクトル密度である。ここで、Ｎ_C(z;0,λ)は複素正規分布の確率密度関数を表し、（７）式によって与えられる。 Here, Φ ^(s) _k [t] and Φ ⁽ⁿ⁾ _k [t] are the power spectral density at the time t and the kth frequency bin of the original speech signal and the noise signal, respectively. Here, N _C (z; 0, λ) represents a probability density function of a complex normal distribution, and is given by equation (7).

証明は省略するが、原音声信号と雑音のガウス性の仮定と、正規分布の再生性より、観測信号の短時間フーリエ変換ｙ_k[t]はk、tごとに独立に、（８）式に示す複素正規分布に従うことが示される。 Although the proof is omitted, the short-time Fourier transform y _k [t] of the observed signal is independent for each k and t from the assumption of the Gaussian nature of the original speech signal and noise and the reproducibility of the normal distribution. It is shown to follow the complex normal distribution shown in.

次に、原音声信号のパワースペクトル密度系列Φ^(s) _k[t]に対し、非負値行列積表現を導入する。基底スペクトルをＭ個とし、m番目の基底スペクトルをＢ^(s) _m,kとする。また、時刻tにおける基底スペクトルＢ^(s) _m,kのゲインをＧ^(s) _m[t]とすると、原音声信号のパワースペクトル密度系列は（９）式及び（１０）式で示される。 Next, a non-negative matrix product representation is introduced for the power spectral density sequence Φ ^(s) _k [t] of the original speech signal. Assume that the number of base spectra is M, and the m-th base spectrum is B ^(s) _{m, k} . Also, ^assuming that the gain of the base spectrum B ^(s) _{m, k} at time t is G ^(s) _m [t], the power spectrum density sequence of the original speech signal is expressed by equations (9) and (10).

なお、基底スペクトルとそのゲインは非負であることに注意する．また、スケールの任意性を除くため、（１１）式に示す条件を満たすものとする。 Note that the base spectrum and its gain are non-negative. Moreover, in order to remove the arbitraryness of the scale, the condition shown in the equation (11) is assumed to be satisfied.

音声の基底スペクトルＢ^(s) _m,kは未知変数として観測信号から他の未知パラメータとしてブラインドで推定することもありえるが、ある程度の量のクリーン音声サンプルからあらかじめ学習しておくことを想定する。その意味で、本実施の形態に係る手法（以下、提案法と称する）はセミブラインドな音声強調手法である。基底の事前学習は、クリーン音声データのスペクトログラムに対してＮＭＦを適用することで行うことができる。 The speech base spectrum B ^(s) _{m, k} may be estimated blindly as an unknown variable from the observed signal as an unknown variable, but it is assumed that it is learned in advance from a certain amount of clean speech samples. In that sense, the method according to the present embodiment (hereinafter referred to as the proposed method) is a semi-blind speech enhancement method. Prior learning of the basis can be performed by applying NMF to the spectrogram of clean speech data.

＜パラメータ推定アルゴリズム＞ <Parameter estimation algorithm>

以上で立てた観測時間周波数成分Ｙ={ｙ_k[t]}_{1≦k≦K,1≦t≦T}の確率密度関数は、未知パラメータの尤度関数に対応する。（８）式より、観測時間周波数成分Ｙが与えられたもとでの未知パラメータの対数尤度は、定数項・符号を無視すれば、観測スペクトログラム|ｙ_k[t]|²と（１２）式で示されるパワースペクトル密度系列モデルとの距離を表す板倉斎藤距離（１３）式と等しくなる。 The probability density function of the observation time frequency component Y = {y _k [t]} _{1 ≦ k ≦ K and 1 ≦ t ≦ T established above} corresponds to the likelihood function of the unknown parameter. From equation (8), the log likelihood of the unknown parameter with the observation time frequency component Y given is the observation spectrogram | y _k [t] | ² and the equation (12) if the constant term / sign is ignored. It becomes equal to the Itakura Saito distance (13) expression showing the distance with the power spectrum density series model shown.

従って、提案法は観測スペクトログラム|ｙ_k[t]|²とパワースペクトル密度系列モデルΦ^x _k[t]との最適フィッティング問題に帰着する。ここで、雑音信号のパワースペクトル密度系列Φ⁽ⁿ⁾ _k[t]に関しても、（１４）式のようにモデル化される。 Therefore, the proposed method results in an optimal fitting problem between the observed spectrogram | y _k [t] | ² and the power spectral density sequence model Φ ^x _k [t]. Here, the power spectrum density sequence Φ ⁽ⁿ⁾ _k [t] of the noise signal is also modeled as in equation (14).

なお、板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)は、距離尺度を表すβ-divergenceにおいてβを０にした場合に得られる距離に相当する。 The Itakura Saito distance I (H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ) corresponds to a distance obtained when β is set to 0 in β-divergence representing a distance scale.

（１３）式で示される板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)は直接最小化することが困難であるため、板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)を下回らず１点で接する上限関数を設定する。導出は省略するが、（１５）式に示す上限関数が板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)の上限関数となる。 The Itakura Saito distance I (H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ) represented by the equation (13) is difficult to directly minimize, so the Itakura Saito distance I (H, G ^{( s)} , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ) Set an upper limit function that touches at one point without falling below. Although the derivation is omitted, the upper limit function shown in the equation (15) is the upper limit function of the Itakura Saito distance I (H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ).

すなわち、板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)の代わりに（１５）式に示された上限関数を小さくするようにＨ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾および補助変数ζ,ξ,ηを交互に更新することで目的関数Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)を小さくしていくことができる。この際、Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾の各パラメータは非負値を保ったまま更新するようにする。なお、（１５）式におけるΦ^(x) _k[t]は（１６）式によって示される。 That is, instead of the Itakura Saito distance I (H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ), H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ and auxiliary variables ζ, ξ, η are updated alternately to reduce the objective function I (H, G ^(s) , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ) Can do. At this time, the parameters of H, G ^(s) , B ⁽ⁿ⁾ , and G ⁽ⁿ⁾ are updated while maintaining non-negative values. Note that Φ ^(x) _k [t] in equation (15) is expressed by equation (16).

そして、この上限関数Ｄ₀の更新則は（１７）式〜（２３）式で示される。 The update rule of the upper bound function D ₀ is represented by (17) to (23) below.

＜システム構成＞
<System configuration>

次に、原音声信号に雑音及び残響成分が重畳された観測信号を解析して、観測信号に含まれる原音声信号を推定する音響信号解析装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 Next, by analyzing an observation signal in which noise and reverberation components are superimposed on the original speech signal, and estimating the original speech signal included in the observation signal, the case where the present invention is applied as an example, An embodiment of the present invention will be described.

図２に示すように、本発明の実施の形態に係る音響信号解析装置は、ＣＰＵと、ＲＡＭと、後述する音響信号解析処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 2, the acoustic signal analysis apparatus according to the embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program for executing an acoustic signal analysis processing routine to be described later. It is configured and functionally configured as follows.

音響信号解析装置１００は、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。 The acoustic signal analysis device 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40.

入力部１０により、原音声信号に雑音及び残響成分が重畳された観測信号の時系列データが入力される。記憶部３０は、入力部１０により入力された観測信号の時系列データを記憶する。また、記憶部３０は、後述する各処理での結果を記憶すると共に、本処理ルーチンで用いる各パラメータの初期値を記憶している。 The input unit 10 inputs time series data of an observation signal in which noise and reverberation components are superimposed on the original voice signal. The storage unit 30 stores time-series data of observation signals input by the input unit 10. In addition, the storage unit 30 stores the result of each process to be described later, and stores the initial value of each parameter used in this process routine.

演算部２０は、時間周波数解析部２１と、初期設定部２２と、補助変数更新部２３と、パラメータ更新部２４と、終了判定部２５と、信号変換部２６とを備えている。 The calculation unit 20 includes a time-frequency analysis unit 21, an initial setting unit 22, an auxiliary variable update unit 23, a parameter update unit 24, an end determination unit 25, and a signal conversion unit 26.

時間周波数解析部２１は、例えばマイクロホンの時系列信号としての観測された観測信号ｙ[t]を入力として、観測信号ｙ[t]の観測時間周波数成分Ｙ={ｙ_k[t]}（周波数k＝１，・・・，K、時刻t＝１，・・・，T）を計算する。また、計算した観測時間周波数成分Ｙを、記憶部３０に記憶しておく。より詳細には、時間周波数解析部２１は、例えばマイクロホンで観測された観測信号の時系列データを入力として、短時間フーリエ変換（Short-Time Fourier Transform；ＳＴＦＴ）を用いて時間周波数解析を行うことにより観測時間周波数成分Ｙを計算する。 The time frequency analysis unit 21 receives, for example, an observed observation signal y [t] as a time series signal of a microphone, and receives an observation time frequency component Y = {y _k [t]} (frequency) of the observation signal y [t]. k = 1,..., K, time t = 1,. The calculated observation time frequency component Y is stored in the storage unit 30. More specifically, the time-frequency analysis unit 21 performs time-frequency analysis using, for example, short-time Fourier transform (STFT) using time-series data of observation signals observed with a microphone as input. The observation time frequency component Y is calculated by

初期設定部２２は、後述する処理で用いる各パラメータＧ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]の各初期値及びＢ^(s) _m,k の値を設定する。なお、Ｂ^(s) _m,k以外のパラメータの初期値は、例えば乱数を用いて適当な値に設定すればよい。この場合、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]の各パラメータの初期値は非負値となるように設定する。 The initial setting unit 22 uses initial values of each parameter G ^(s) _m [t], H _k [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [t] and B used in processing to be described later. ^(s) Set the _{m and k} values. The initial values of parameters other than B ^(s) _{m, k} may be set to appropriate values using, for example, random numbers. In this case, the initial values of the parameters G ^(s) _m [t], H _k [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [t] are set to be non-negative values.

そして、Ｂ^(s) _m,kの値には、一定量のクリーン音声サンプルから予め学習して得た値を設定する。なお、スケールの任意性を排除するため、Ｂ^(s) _m,kの値は上記（１１）式を満たすように予め学習されているものとする。 A value obtained by learning in advance from a certain amount of clean speech samples is set as the value of B ^(s) _{m, k} . It is assumed that the value of B ^(s) _{m, k} has been learned in advance so as to satisfy the above equation (11) in order to eliminate scale arbitraryness.

補助変数更新部２３は、（k、m、t、τ）の全ての組み合わせの各々について、記憶部３０に記憶されているＢ^(s) _m,k、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]に基づいて、上記（２１）式に従って上限関数Ｄ₀を小さくするように補助変数ζ_k,m,t,τを更新し、記憶部３０に格納する。 The auxiliary variable updating unit 23 stores B ^(s) _{m, k} , G ^(s) _m [t], H stored in the storage unit 30 for each of all combinations of (k, m, t, τ). _{Based on k} [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [t], the auxiliary variable ζk _{, m, t, τ is set} so as to reduce the upper limit function D ₀ according to the above equation (21). Is updated and stored in the storage unit 30.

また、補助変数更新部２３は、（k、q、t）の全ての組み合わせの各々について、記憶部３０に記憶されているＢ^(s) _m,k、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]に基づいて、上記（２２）式に従って上限関数Ｄ₀を小さくするように補助変数ξ_k,q,tを更新し、記憶部３０に格納する。 The auxiliary variable updating unit 23 also stores B ^(s) _{m, k} , G ^(s) _m [t], H stored in the storage unit 30 for each of all combinations of (k, q, t). _{Based on k} [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [t], the auxiliary variable ξk _{, q, t} is updated so as to reduce the upper limit function D ₀ according to the above equation (22). And stored in the storage unit 30.

更に、補助変数更新部２３は、各時刻t及び各周波数kについて、記憶部３０に記憶されているＢ^(s) _m,k、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]に基づいて、上記（１６）式、上記（２３）式に従って上限関数Ｄ₀を小さくするように補助変数η_k[t]を更新し、記憶部３０に格納する。 Further, the auxiliary variable update unit 23 stores B ^(s) _{m, k} , G ^(s) _m [t], H _k [τ], B stored in the storage unit 30 for each time t and each frequency k. ^{(n) Based on} _{q, k} and G ⁽ⁿ⁾ _q [t], the auxiliary variable η _k [t] is updated so as to reduce the upper limit function D ₀ according to the above equations (16) and (23). And stored in the storage unit 30.

パラメータ更新部２４は、（m、t）の全ての組み合わせの各々について、記憶部３０に記憶されているｙ_k[t]、Ｂ^(s) _m,k、Ｈ_k[τ]、ζ_k,m,t,τ、η_k[t]に基づいて、上記（１７）式に従って、上限関数Ｄ₀を小さくするように、基底スペクトルＢ^(s) _m,kのゲイン係数Ｇ^(s) _m[t]を更新し、記憶部３０に格納する。 The parameter update unit 24 stores, for each combination of (m, t), y _k [t], B ^(s) _{m, k} , H _k [τ], ζ _{k, Based on m, t, τ} and η _k [t], the gain coefficient G ^(s) _m [of the base spectrum B ^(s) _{m, k} is set so as to reduce the upper limit function D ₀ according to the above equation (17). t] is updated and stored in the storage unit 30.

また、パラメータ更新部２４は、（q、k）の全ての組み合わせの各々について、記憶部３０に記憶されているｙ_k[t]、Ｇ⁽ⁿ⁾ _q[t]、ξ_k,q,t、η_k[t]に基づいて、上記（１８）式に従って、上限関数Ｄ₀を小さくするように、雑音信号nのパワースペクトル密度Φ⁽ⁿ⁾ _k[t]の基底行列の要素Ｂ⁽ⁿ⁾ _q,kを更新し、記憶部３０に格納する。 The parameter update unit 24 also stores y _k [t], G ⁽ⁿ⁾ _q [t], ξ _{k, q, t} stored in the storage unit 30 for each of all combinations of (q, k). , Η _k [t], the element B ⁽ⁿ ) of the basis matrix of the power spectral density Φ ⁽ⁿ⁾ _k [t] of the noise signal n so as to reduce the upper limit function D ₀ according to the above equation (18) ⁾ _{q and k} are updated and stored in the storage unit 30.

また、パラメータ更新部２４は、（q、t）の全ての組み合わせの各々について、記憶部３０に記憶されているｙ_k[t]、Ｂ⁽ⁿ⁾ _q,k、ξ_k,q,t、η_k[t]に基づいて、上記（１９）式に従って、上限関数Ｄ₀を小さくするように、基底スペクトルＢ⁽ⁿ⁾ _q,kのゲインＧ⁽ⁿ⁾ _q[t]を更新し、記憶部３０に格納する。 In addition, the parameter update unit 24 uses y _k [t], B ⁽ⁿ⁾ _{q, k} , ξ _{k, q, t} , stored in the storage unit 30 for each of all combinations of (q, t). Based on η _k [t], the gain G ⁽ⁿ⁾ _q [t] of the base spectrum B ⁽ⁿ⁾ _{q, k} is updated and stored so as to reduce the upper limit function D ₀ according to the above equation (19). Store in the unit 30.

更に、パラメータ更新部２４は、（k、τ）の全ての組み合わせの各々について、記憶部３０に記憶されているｙ_k[t]、Ｂ^(s) _m,k、Ｇ^(s) _m[t]、ζ_k,m,t,τ、η_k[t]に基づいて、上記（２０）式に従って、上限関数Ｄ₀を小さくするように、室内インパルス応答のパワーＨ_k[τ]を更新し、記憶部３０に格納する。 Further, the parameter update unit 24 stores y _k [t], B ^(s) _{m, k} , G ^(s) _m [t] stored in the storage unit 30 for each of all combinations of (k, τ). ], Ζ _{k, m, t, τ} and η _k [t], the power H _k [τ] of the indoor impulse response is updated so as to reduce the upper limit function D ₀ according to the above equation (20). And stored in the storage unit 30.

終了判定部２５は、予め定められた終了条件を満足するか否かを判定し、終了条件を満足していない場合には、補助変数更新部２３及びパラメータ更新部２４の各処理を繰り返す。終了判定部２５は、終了条件を満足したと判定した場合には、信号変換部２６による処理に移行する。信号変換部２６は、記憶部３０に記憶されているＢ^(s) _m,k 及びＧ^(s) _m[t]に基づいて、雑音信号を取り除くことにより推定される原音声信号（以下、推定原音声信号と称する）に変換し、出力部４０により、推定原音声信号を出力する。 The end determination unit 25 determines whether or not a predetermined end condition is satisfied. If the end condition is not satisfied, the processes of the auxiliary variable update unit 23 and the parameter update unit 24 are repeated. When it is determined that the end condition is satisfied, the end determination unit 25 proceeds to processing by the signal conversion unit 26. Based on B ^(s) _{m, k} and G ^(s) _m [t] stored in the storage unit 30, the signal conversion unit 26 estimates an original speech signal (hereinafter, estimated) by removing a noise signal. The output unit 40 outputs an estimated original voice signal.

なお、終了条件としては、繰り返し回数がＬ-1回目の上限関数（１５）式の値と、繰り返し回数がＬ回目の上限関数（１５）式の値との差が、予め定めた閾値よりも小さくなったことを用いればよい。あるいは、終了条件として、繰り返し回数が、予め定められた上限回数に到達したことを用いてもよい。 As an end condition, the difference between the value of the upper limit function (15) with the number of repetitions L-1 and the value of the upper limit function (15) with the number of repetitions L is less than a predetermined threshold. What has become smaller can be used. Alternatively, the termination condition may be that the number of repetitions has reached a predetermined upper limit number.

＜音響信号解析装置の作用＞ <Operation of acoustic signal analyzer>

次に、本実施の形態に係る音響信号解析装置１００の作用について説明する。まず、解析対象の信号として、原音声信号に雑音及び残響成分が重畳された観測信号の時系列データが音響信号解析装置１００に入力され、記憶部３０に格納される。そして、音響信号解析装置１００において、図３に示す音響信号解析処理ルーチンが実行される。 Next, the operation of the acoustic signal analysis apparatus 100 according to the present embodiment will be described. First, as a signal to be analyzed, time-series data of an observation signal in which noise and reverberation components are superimposed on an original speech signal is input to the acoustic signal analysis device 100 and stored in the storage unit 30. Then, in the acoustic signal analysis apparatus 100, an acoustic signal analysis processing routine shown in FIG. 3 is executed.

まず、ステップＳ１０１において、記憶部３０から、観測信号ｙ[t]を読み込み、当該観測信号ｙ[t]に対して、短時間フーリエ変換を用いた時間周波数分析を行い、観測信号ｙ[t]の観測時間周波数成分Ｙ={ｙ_k[t]}（k＝１，・・・，K、t＝１，・・・，T）を算出すると共に、得られた観測時間周波数成分Ｙを記憶部３０に記憶する。 First, in step S101, the observation signal y [t] is read from the storage unit 30, and the observation signal y [t] is subjected to time-frequency analysis using short-time Fourier transform for the observation signal y [t]. Observation time frequency component Y = {y _k [t]} (k = 1,..., K, t = 1,..., T) and the obtained observation time frequency component Y is stored. Store in unit 30.

そして、ステップＳ１０２において、乱数を用いて、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]の各初期値を設定して、記憶部３０に記憶すると共に、一定量のクリーン音声サンプルから予め学習して得た各周波数kのＭ個の基底スペクトルｍからなる基底行列の各要素をＢ^(s) _m,kとして設定して、記憶部３０に記憶する。 In step S102, initial values of G ^(s) _m [t], H _k [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [t] are set using random numbers. And each element of the base matrix consisting of M base spectra m of each frequency k obtained by learning in advance from a fixed amount of clean speech samples is set as B ^(s) _{m, k} And stored in the storage unit 30.

次にステップＳ１０３では、上記ステップＳ１０２で設定されたＢ^(s) _m,k、Ｇ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]、又は上記ステップＳ１０２で設定されたＢ^(s) _m,k及び後述するステップＳ１０４、Ｓ１０５、Ｓ１０６、Ｓ１０７で更新されたＧ^(s) _m[t]、Ｈ_k[τ]、Ｂ⁽ⁿ⁾ _q,k 、Ｇ⁽ⁿ⁾ _q[t]に基づいて、上記（２１）式に従って、補助変数ζ_k,m,t,τを各（k、m、t、τ）の組み合わせについて算出すると共に、上記（２２）式に従って、補助変数ξ_k,q,tを各（k、q、t）の組み合わせについて算出し、更に、上記（１６）式、（２３）式に従って、補助変数η_k[t]を各時刻t及び各周波数kについて算出して、記憶部３０に格納する。 In step S103, B ^(s) _{m, k} , G ^(s) _m [t], H _k [τ], B ⁽ⁿ⁾ _{q, k} , G ⁽ⁿ⁾ _q [ t], or B ^(s) _{m, k} set in step S102 and G ^(s) _m [t], H _k [τ], B ⁽ updated in steps S104, S105, S106, and S107 described later. ^{n) Based on} _{q, k} , G ⁽ⁿ⁾ _q [t], the auxiliary variable ζ _{k, m, t, τ} is calculated for each combination of (k, m, t, τ) according to the above equation (21). In addition, the auxiliary variables ξ _{k, q, t} are calculated for each combination (k, q, t) according to the above equation (22), and further, according to the above equations (16), (23) _k [t] is calculated for each time t and each frequency k and stored in the storage unit 30.

ステップＳ１０４では、上記ステップＳ１０１で生成された観測信号ｙ[t]の観測時間周波数成分Ｙと、上記ステップＳ１０２で設定されたＢ^(s) _m,kと、上記ステップＳ１０２で設定されたＨ_k[τ]、又は後述するステップＳ１０７で更新されたＨ_k[τ]と、上記ステップＳ１０３で更新された補助変数ζ_k,m,t,τ、η_k[t]に基づいて、上記（１７）式に従って、基底スペクトルＢ^(s) _m,kの非負値のゲイン係数Ｇ^(s) _m[t]を各（m、t）の組み合わせについて算出して、記憶部３０に格納する。 In step S104, the observation time frequency component Y of the observation signal y [t] generated in step S101, B ^(s) _{m, k} set in step S102, and H _k set in step S102. [τ] or H _k [τ] updated in step S107, which will be described later, and the auxiliary variables ζ _{k, m, t, τ} , η _k [t] updated in step S103, (17 ) _, A non-negative gain coefficient G ^(s) _m [t] of the base spectrum B ^(s) _{m, k} is calculated for each combination of (m, t) and stored in the storage unit 30.

ステップＳ１０５では、上記ステップＳ１０１で生成された観測信号ｙ[t]の観測時間周波数成分Ｙと、上記ステップＳ１０２で設定されたＧ⁽ⁿ⁾ _q[t]、又は後述するステップＳ１０６で更新されたＧ⁽ⁿ⁾ _q[t]と、上記ステップＳ１０３で更新された補助変数ξ_k,q,t、η_k[t]に基づいて、上記（１８）式に従って、雑音信号nのパワースペクトル密度Φ⁽ⁿ⁾ _k[t]の基底行列の要素Ｂ⁽ⁿ⁾ _q,kを各（q、k）の組み合わせについて算出して、記憶部３０に格納する。 In step S105, the observation time frequency component Y of the observation signal y [t] generated in step S101, G ⁽ⁿ⁾ _q [t] set in step S102, or updated in step S106 described later. Based on G ⁽ⁿ⁾ _q [t] and the auxiliary variables ξ _{k, q, t} and η _k [t] updated in step S103, the power spectral density Φ of the noise signal n is obtained according to the above equation (18). ⁽ⁿ⁾ Element B ⁽ⁿ⁾ _{q, k} of the basis matrix of _k [t] is calculated for each (q, k) combination and stored in the storage unit 30.

ステップＳ１０６では、上記ステップＳ１０１で生成された観測信号ｙ[t]の観測時間周波数成分Ｙと、上記ステップＳ１０２で設定されたＢ⁽ⁿ⁾ _q,k、又は上記ステップＳ１０５で更新されたＢ⁽ⁿ⁾ _q,kと、上記ステップＳ１０３で更新された補助変数ξ_k,q,t、η_k[t]に基づいて、上記（１９）式に従って、基底スペクトルＢ⁽ⁿ⁾ _q,kのゲインＧ⁽ⁿ⁾ _q[t]を各（q、t）の組み合わせについて算出して、記憶部３０に格納する。 In step S106, the observation time-frequency component Y of the generated observation signal y [t] in step S101, updated in step S102 is set in the B ⁽ⁿ⁾ _{q, k} or step S105, B ^{( n) Based on} _{q, k} and the auxiliary variables ξ _{k, q, t} and η _k [t] updated in step S103, the gain of the base spectrum B ⁽ⁿ⁾ _{q, k} according to the above equation (19) G ⁽ⁿ⁾ _q [t] is calculated for each (q, t) combination and stored in the storage unit 30.

そして、ステップＳ１０７では、上記ステップＳ１０１で生成された観測信号ｙ[t]の観測時間周波数成分Ｙと、上記ステップＳ１０２で設定されたＢ^(s) _m,kと、上記ステップＳ１０２で設定されたＧ^(s) _m[t]、又は上記ステップＳ１０４で更新されたＧ^(s) _m[t]と、上記ステップＳ１０３で更新された補助変数ζ_k,m,t,τ、η_k[t]に基づいて、上記（２０）式に従って、室内インパルス応答のパワーＨ_k[τ]を各（k、τ）の組み合わせについて算出して、記憶部３０に格納する。 In step S107, the observation time frequency component Y of the observation signal y [t] generated in step S101, B ^(s) _{m, k} set in step S102, and the step S102 are set. ^{_{G (s) m [t]}} , or the G ^(s) _m [t] updated in step S104, the auxiliary variables updated in step _{S103 ζ k, m, t,} τ, η k [t] Based on the above, the power H _k [τ] of the room impulse response is calculated for each combination (k, τ) according to the above equation (20) and stored in the storage unit 30.

次のステップＳ１０８では、上記ステップＳ１０１で生成された観測信号ｙ[t]の観測時間周波数成分Ｙと、上記ステップＳ１０２で設定されたＢ^(s) _m,kと、上記ステップＳ１０３で更新された補助変数ζ_k,m,t,τ、ξ_k,q,t、η_k[t]と、上記ステップＳ１０４で更新されたＧ^(s) _m[t]と、上記ステップＳ１０５で更新されたＢ⁽ⁿ⁾ _q,kと、上記ステップＳ１０６で更新されたＧ⁽ⁿ⁾ _q[t]と、上記ステップＳ１０７で更新されたＨ_k[τ]に基づいて、上記（１５）式に従って上限関数Ｄ₀の値を算出して、記憶部３０に記憶する。そして、前回のステップＳ１０８で算出した上限関数Ｄ₀の値を記憶部３０から読み込み、今回のステップＳ１０８で算出した上限関数Ｄ₀の値と、前回のステップＳ１０８で算出した上限関数Ｄ₀の値との差分が、予め記憶部３０に記憶されている予め定められた閾値よりも小さいか否かを判定し、差分が予め定められた閾値以上の場合には、終了条件を満足していないと判断して、上記ステップＳ１０３へ戻り、上記ステップＳ１０３〜ステップＳ１０７の処理を繰り返す。一方、差分が予め定められた閾値未満の場合には、終了条件を満足したと判断して、ステップＳ１０９で、上記ステップＳ１０２で設定されたＢ^(s) _m,kと、上記ステップＳ１０４で最終的に更新されたＧ^(s) _m[t]に基づいて推定原音声信号を算出すると共に、出力部４０より推定原音声信号を出力して音響信号解析処理ルーチンを終了する。 In the next step S108, the observation time frequency component Y of the observation signal y [t] generated in step S101, B ^(s) _{m, k} set in step S102, and updated in step S103. Auxiliary variables ζ _{k, m, t, τ} , ξ _{k, q, t} , η _k [t], G ^(s) _m [t] updated in step S104, and B updated in step S105 ^{Based on (n)} _{q, k} , G ⁽ⁿ⁾ _q [t] updated in step S106, and H _k [τ] updated in step S107, the upper limit function D is determined according to the above equation (15). A value of ₀ is calculated and stored in the storage unit 30. Then, read the value of the upper bound function D ₀ calculated in the previous step S108 from the storage unit 30, the value of the upper bound function D ₀ calculated in this step S108, the value of the upper bound function D ₀ calculated in the previous step S108 Is determined to be smaller than a predetermined threshold stored in the storage unit 30 in advance, and if the difference is equal to or larger than the predetermined threshold, the end condition is not satisfied. Determination is made, the process returns to step S103, and the processes of steps S103 to S107 are repeated. On the other hand, if the difference is less than the predetermined threshold value, it is determined that the end condition is satisfied, and B ^(s) _{m, k} set in step S102 and the final result in step S104 are determined in step S109. The estimated original speech signal is calculated based on the updated G ^(s) _m [t] and the estimated original speech signal is output from the output unit 40, and the acoustic signal analysis processing routine is terminated.

＜実施結果＞ <Results>

次に、本実施の形態に係る手法の有用性を示す目的で、残響室内（残響時間１．３秒）での移動音声信号に、ＰＨＳの着信音、背景雑音、当該ＰＨＳの着信音及び背景雑音を組み合わせた音、粒子落下音、ＢＧＭ音、他者の音声といった非定常雑音を重畳したものを観測信号として用いた場合の、提案法での評価実験の結果について説明する。 Next, for the purpose of showing the usefulness of the technique according to the present embodiment, the PHS ringtone, background noise, the PHS ringtone, and the background are added to the moving voice signal in the reverberation room (reverberation time 1.3 seconds). A description will be given of the results of an evaluation experiment using the proposed method when a non-stationary noise such as a noise combined sound, a particle falling sound, a BGM sound, or another person's voice is superimposed as an observation signal.

図４は、提案法を用いた場合の評価実験結果を示した図である。上記図４（Ａ）はメル周波数ケプストラム係数歪み（Mel-Frequency Cepstral Coefficients Distortion：ＭＦＣＣＤ）を評価基準とした実験結果、上記図４（Ｂ）はバークスペクトル歪みスコア（Bark Spectral Distortion Score：ＢＳＤＳ）を評価基準とした実験結果、上記図４（Ｃ）は信号対干渉比（Source to Interference Ratio：ＳＩＲ）を評価基準とした実験結果を示している。なお、ＭＦＣＣＤ及びＢＳＤＳは値が小さくなる程、音声の明瞭度が良くなることを示す値であり、一方、ＳＩＲは値が大きくなる程、音声の明瞭度が良くなることを示す値である。 FIG. 4 is a diagram showing an evaluation experiment result when the proposed method is used. FIG. 4A shows the experimental results based on Mel-Frequency Cepstral Coefficients Distortion (MFCCD) as an evaluation standard, and FIG. 4B shows the Bark Spectral Distortion Score (BSDS). FIG. 4C shows the experimental results using the signal-to-interference ratio (SIR) as the evaluation standard. Note that MFCCD and BSDS are values indicating that the sound clarity is improved as the values are reduced, while SIR is a value indicating that the sound clarity is improved as the values are increased.

上記図４から分かるように、移動音声信号にＢＧＭ又は他者の音声を重畳した場合のＭＦＣＣＤの評価結果を除いて、観測信号に含まれる音声の明瞭度に比べ提案法による推定原音声信号の明瞭度の方が向上するという結果を得られた。特に、移動音声信号にＰＨＳの着信音を重畳した場合のＳＩＲに関する評価結果では、４倍以上の改善が見られた。 As can be seen from FIG. 4 above, except for the evaluation result of the MFCCD when the voice of the BGM or another person is superimposed on the mobile voice signal, the estimated original voice signal of the proposed method is compared with the clarity of the voice included in the observed signal. The result that the clarity was improved was obtained. In particular, in the evaluation result regarding SIR when the PHS ringtone is superimposed on the mobile voice signal, an improvement of 4 times or more was observed.

以上説明したように、本発明の実施の形態に係る音響信号解析装置によれば、上記（１５）式で示される上限関数の値が小さくなるように、係数行列の各要素Ｇ^(s) _m[t]と、基底行列の各要素Ｂ⁽ⁿ⁾ _q,kと、係数行列の各要素Ｇ⁽ⁿ⁾ _q[t]と、各時刻τの室内インパルス応答の各周波数kのパワーＨ_k[τ]と、補助変数ζ_k,m,t,τと、補助変数ξ_k,q,tと、補助変数η_k[t]とを繰り返し更新することにより、音声信号に雑音及び残響成分が重畳した観測信号から精度よく音声信号を推定して、音声信号の明瞭度を向上させることができる。 As described above, according to the acoustic signal analysis device according to the embodiment of the present invention, each element G ^(s) _m of the coefficient matrix is set so that the value of the upper limit function expressed by the above equation (15) becomes small. [t], each element B ⁽ⁿ⁾ _{q, k of} the basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and the power H _k [ By repeatedly updating τ], auxiliary variable ζ _{k, m, t, τ} , auxiliary variable ξ _{k, q, t} and auxiliary variable η _k [t], noise and reverberation components are superimposed on the audio signal. The speech signal can be accurately estimated from the observed signal and the clarity of the speech signal can be improved.

このように、本発明で提案する手法では、音源の移動等に関して、室内インパルス応答の短時間フーリエ変換の振幅成分を時不変、位相成分を時変とする観測信号の生成モデルを立てて、観測信号に含まれる原音声信号を高い精度で推定する Thus, in the method proposed in the present invention, with respect to the movement of the sound source, etc., an observation signal generation model in which the amplitude component of the short-time Fourier transform of the room impulse response is time-invariant and the phase component is time-variable is established and observed. Estimate the original speech signal included in the signal with high accuracy

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音響信号解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above-described acoustic signal analysis apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

また、本実施の形態に係る手法では、パワースペクトル密度系列モデルΦ^x _k[t]と観測時間周波数成分ｙ_k[t]との距離の尺度として板倉斎藤距離Ｉ(Ｈ,Ｇ^(s),Ｂ⁽ⁿ⁾,Ｇ⁽ⁿ⁾)を用いたが、これに限らず、β-divergenceの一種であるGeneralized Kullback-Leibler距離やFrobeniusノルムといった尺度を用いるようにしてもよい。 Further, in the method according to the present embodiment, the power spectral density series model Φ ^x _k [t] and the observation time-frequency component y _k [t] Itakura Saito as a measure of the distance between the distance ^{I (H, G (s)} , B ⁽ⁿ⁾ , G ⁽ⁿ⁾ ) are used, but the present invention is not limited to this, and a scale such as a Generalized Kullback-Leibler distance or Frobenius norm, which is a kind of β-divergence, may be used.

１０入力部
２０演算部
２１時間周波数解析部
２２初期設定部
２３補助変数更新部
２４パラメータ更新部
２５終了判定部
２６信号変換部
３０記憶部
４０出力部
１００音響信号解析装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Time frequency analysis part 22 Initial setting part 23 Auxiliary variable update part 24 Parameter update part 25 End determination part 26 Signal conversion part 30 Storage part 40 Output part 100 Acoustic signal analysis apparatus

Claims

A time-frequency analysis unit that outputs time-series data of an acoustic signal and outputs an observation time-frequency component y _k [t] of each frequency k at each time t;
The power spectrum density Φ ^(s) _k [t] at each time t and each frequency k of the original speech signal s included in the acoustic signal is represented by each non-negative element B ^(s) _m representing M base spectra _{m. , k} and the coefficient matrix with each non-negative element as a gain matrix G ^(s) _m [t] of the base spectrum B ^(s) _{m, k} at each time t Each element G ^(s) _m [t] of the coefficient matrix, power H _k [τ] of each frequency k of the indoor impulse response at each time τ, each time t and each frequency of the noise signal n included in the acoustic signal The power spectral density Φ ⁽ⁿ⁾ _k [t] of _k is represented by a base matrix composed of non-negative elements B ⁽ⁿ⁾ _{q, k} representing Q base spectra q and a base spectrum B ^{(n at} each time t. ⁾ Each element B ⁽ⁿ⁾ _{q, k of} the base matrix when the gain G ⁽ⁿ⁾ _q [t] of _{q, k} is expressed as a product with a coefficient matrix having each element as a non-negative value The key points of Sets the initial value of each of the ^{_{G (n) q [t]}} , the element B ^(s) _m the base _matrix, the _k, the parameter initial value setting unit for setting a pre-determined value,
The element B ^(s) _{m, k of the} basis matrix, the element G ^(s) _m [t] of the coefficient matrix, the power H _k [τ], and the power in all combinations of (τ, m) Each of the power spectral density series model Φ ^x _k [t] calculated on the basis of the spectral density Φ ⁽ⁿ⁾ _k [t] and the observation time frequency component y _k [t] are reduced in distance. The observation time frequency component y _k [t] at time t and each frequency k, each element B ^(s) _{m, k of the} basis matrix, each element G ^(s) _m [t] of the coefficient matrix, Each element B ⁽ⁿ⁾ _{q, k of the} base matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and power H _k [τ] of each frequency k of the indoor impulse response at each time τ , Each element G ^(s) _m [t] of the coefficient matrix and each element B ^{(n of the} basis matrix ⁾ based on the power spectral density series model Φ ^x _k [t] at each time t and each frequency k ⁾ _{q, k} and each element G ^{(n) of the} coefficient matrix a parameter updating unit that updates _q [t] and the power H _k [τ] of each frequency k of the indoor impulse response at each time τ;
An end determination unit that repeatedly performs update by the parameter update unit until a predetermined end condition is satisfied;
An acoustic signal analyzing apparatus including:

The acoustic signal analysis apparatus according to claim 1, wherein β-divergence is used as a measure of the distance between the power spectral density series model Φ ^x _k [t] and the observation time frequency component y _k [t].

An auxiliary variable update unit,
A function representing the distance between the power spectral density series model Φ ^x _k [t] and the observation time frequency component y _k [t] is expressed as the observation time frequency component y _k [t] at each time t and each frequency k. Each element B ^(s) _{m, k of the} basis matrix, each element G ^(s) _m [t] of the coefficient matrix, each element B ⁽ⁿ⁾ _{q, k of the} basis matrix _, and the coefficient Each element G ⁽ⁿ⁾ _q [t] of the matrix, the power H _k [τ] of each frequency k of the indoor impulse response at each time τ, and the power spectrum density series model Φ ^x _{k at} each time t and each frequency _k [t] and auxiliary variables ζ _{k, m, t, τ} for all combinations of (k, m, t, τ) and auxiliary variables ξ _k, for all combinations of (k, q, t) _{q, t} and auxiliary variables η _k [t] for each time t and each frequency k, and the power spectral density series model Φ ^x _k [t] and the observation time frequency component y _k β-diver with [t] And auxiliary function, which is the upper limit function of ence,
The auxiliary variable updating unit reduces each auxiliary function to each element B ^(s) _{m, k of the} basis matrix, each element G ^(s) _m [t] of the coefficient matrix, and the basis matrix Element B ⁽ⁿ⁾ _{q, k} , each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, power H _k [τ] of each frequency k of the room impulse response at each time τ, and each time Based on t and the power spectral density series model Φ ^x _k [t] of each frequency k, auxiliary variables ζ _{k, m, t, τ} for all combinations of (k, m, t, τ), ( update auxiliary variables ξ _{k, q, t} for all combinations of k, q, t) and auxiliary variables η _k [t] for each time t and each frequency k,
The parameter update unit reduces the auxiliary function so that the observation time frequency component y _k [t] at each time t and each frequency k, and each element B ^(s) _{m, k of the} basis matrix, Each element G ^(s) _m [t] of the coefficient matrix, each element B ⁽ⁿ⁾ _{q, k of the} basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and each time τ And the auxiliary variables ζ _{k, m, t, τ} for all combinations of the power H _k [τ] of each frequency k of the room impulse response of (k, m, t, τ) _, and (k, q, t ) based on the auxiliary variables xi] _{k, q} for all _{combinations,} and _t, an auxiliary variable eta _k [t] for each time t and each frequency k of the elements G ^(s of the coefficient ^matrix) _m [ t], each element B ⁽ⁿ⁾ _{q, k of the} basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and power H _k of each frequency k of the room impulse response at each time τ The acoustic signal analyzing apparatus according to claim 2, wherein [τ] is updated.

The time-frequency analysis unit inputs time series data of the acoustic signal, and outputs an observation time frequency component y _k [t] of each frequency k at each time t,
The parameter initial value setting unit converts the power spectrum density Φ ^(s) _k [t] at each time t and each frequency k of the original audio signal s included in the acoustic signal to a non-negative value representing M base spectra m. A basis matrix composed of each element B ^(s) _{m, k} , a coefficient matrix having a gain G ^(s) _m [t] of the base spectrum B ^(s) _{m, k} at each time t as a non-negative element, Each coefficient G ^(s) _m [t], power H _k [τ] of each frequency k of the room impulse response at each time τ, and noise signal n included in the acoustic signal Power spectrum density Φ ⁽ⁿ⁾ _k [t] at each time t and each frequency k, a base matrix composed of non-negative elements B ⁽ⁿ⁾ _{q, k} representing Q base spectra q, and each time each of the basis matrix when expressed basal spectrum B ⁽ⁿ⁾ _q in _t, the gain of the _k G a ⁽ⁿ⁾ _q [t] as the product of the coefficient matrix with each element is non-negative Containing B ⁽ⁿ⁾ _{q, k,} sets the respective initial values of the elements G ⁽ⁿ⁾ _q [t] of the coefficient matrix, the elements B ^(s) _m the base _matrix, the _k, determined in advance Set the value
By the parameter updating unit, the element B ^(s) _{m, k of the} basis matrix, the element G ^(s) _m [t] of the coefficient matrix, and the power H _k [τ] in all combinations of (τ, m). ] And the power spectral density series model Φ ^x _k [t] calculated based on the power spectral density Φ ⁽ⁿ⁾ _k [t] and the observation time frequency component y _k [t] are small The observation time frequency component y _k [t] at each time t and each frequency k, each element B ^(s) _{m, k of the} basis matrix _, and each element G ^(s) _{m of the} coefficient matrix [t], each element B ⁽ⁿ⁾ _{q, k of the} basis matrix, each element G ⁽ⁿ⁾ _q [t] of the coefficient matrix, and power H of each frequency k of the room impulse response at each time τ _k [τ] and the power spectral density series model Φ ^x _k [t] at each time t and each frequency k, each element G ^(s) _m [t] of the coefficient matrix and the basis matrix each element B ⁽ⁿ⁾ _q, and _k Wherein each element G of the coefficient matrix ⁽ⁿ⁾ _q [t], and updates the power H _k [tau] of each frequency k of the room impulse response for each time tau,
An acoustic signal analysis method in which updating by the parameter updating unit is repeatedly performed by a termination determination unit until a predetermined termination condition is satisfied.

The program for functioning a computer as each part of the acoustic signal analyzer of any one of Claims 1-3.