JP6521886B2

JP6521886B2 - Signal analysis apparatus, method, and program

Info

Publication number: JP6521886B2
Application number: JP2016032396A
Authority: JP
Inventors: 弘和亀岡; 莉李
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2019-05-29
Anticipated expiration: 2036-02-23
Also published as: JP2017151222A

Description

本発明は、信号解析装置、方法、及びプログラムに係り、特に、パラメータを推定する信号解析装置、方法、及びプログラムに関する。 The present invention relates to a signal analysis apparatus, method, and program, and more particularly to a signal analysis apparatus, method, and program for estimating parameters.

本発明は音声信号から雑音を抑圧する問題を扱う。音声信号に混入する雑音は音声通信の品質を劣化させるだけでなく音声認識や音声変換などのさまざまな音声処理の性能低下を招く。この問題を解決するためこれまでさまざまな音声強調手法が提案されてきた。 The present invention addresses the problem of suppressing noise from speech signals. The noise mixed in the voice signal not only degrades the quality of voice communication but also causes the performance degradation of various voice processing such as voice recognition and voice conversion. Various speech enhancement methods have been proposed so far to solve this problem.

音声強調手法は教師なしアプローチ、教師ありアプローチ、半教師ありアプローチに大別される。 Speech enhancement methods are roughly divided into unsupervised approach, supervised approach and semi-supervised approach.

教師ありアプローチは対象音声と対象雑音のサンプルが事前に得られる状況、半教師ありアプローチは対象音声のサンプルのみが事前に得られる状況、教師なしアプローチはいずれも得られない状況をそれぞれ想定した音声強調手法である。また、強調する対象が信号（またはスペクトル）の場合と特徴量の場合とでも大別される。教師あり特徴量強調アプローチの代表例としてはVector Taylor Series (VTS)法、Stereo Piecewise Linear Compensation for Environment (SPLICE)、Denoising Autoencoder (DAE)を用いた手法などがある。 The supervised approach assumes the situation where the target speech and the target noise sample are obtained in advance, the semi-supervised approach takes the situation where only the target speech sample is obtained in advance, and the unsupervised approach does not obtain either. It is an emphasizing method. Further, the target to be emphasized is roughly classified into the case of a signal (or spectrum) and the case of a feature quantity. Typical examples of the supervised feature enhancement approach include the Vector Taylor Series (VTS) method, the Stereo Piecewise Linear Compensation for Environment (SPLICE), and the method using the Denoising Autoencoder (DAE).

VTS法は、音声と雑音の線形な重畳過程を特徴量空間で１次近似することにより雑音あり音声特徴量からクリーン音声特徴量への変換関数を構成する手法である。 The VTS method is a method of constructing a conversion function from a noisy speech feature to a clean speech feature by first approximating a linear superposition process of speech and noise in a feature space.

SPLICEは，雑音あり音声とクリーン音声の特徴量の同時確率密度関数を混合正規分布 (Gaussian Mixture Model: GMM)でモデル化し、学習サンプルを用いて学習したGMMパラメータにより雑音あり音声特徴量からクリーン音声特徴量への変換関数を構成する手法である。 SPLICE models simultaneous probability density functions of features of noisy speech and clean speech with Gaussian mixture model (GMM), and cleans speech from noisy speech with GMM parameters learned using training samples This is a method of constructing a conversion function to feature quantities.

DAE法は、雑音あり音声特徴量を入力、クリーン音声特徴量を出力とした深層ニューラルネットワークにより入出力間の変換関数を構成する手法である。これら教師あり音声強調アプローチは、識別モデルや識別的規準に基づくため、既知の雑音環境下では極めて強力であるが、未知の雑音環境下では必ずしも有効ではない。ただし、学習データの音声または雑音がテスト時のものと異なる場合にそのミスマッチを補償する方法も多く提案されている。 The DAE method is a method of forming a conversion function between input and output by a deep layer neural network in which a noisy speech feature is input and a clean speech feature is output. These supervised speech enhancement approaches are quite powerful in known noise environments because they are based on discrimination models and criteria, but are not always effective in unknown noise environments. However, many methods have been proposed to compensate for the mismatch when speech or noise of learning data is different from that at the time of test.

一方、半教師あり信号強調アプローチの代表例である半教師あり非負値行列因子分解(Semi-supervised Non-negative Matrix Factorization: SSNMF)に基づく手法（非特許文献１）は、未知の雑音環境下における強力な音声強調法として近年注目されている。この手法は、各時刻の観測スペクトルを事前学習した音声の基底スペクトルと雑音の基底スペクトルの非負結合でフィッティングすることで音声と雑音のパワースペクトルを推定することが可能となる、という原理に基づく。 On the other hand, the method based on the semi-supervised non-negative matrix factorization (SSNMF) (non-patent document 1), which is a representative example of the semi-supervised signal enhancement approach, is under unknown noise environment It has been attracting attention in recent years as a powerful speech enhancement method. This method is based on the principle that it is possible to estimate the power spectrum of speech and noise by fitting the observation spectrum of each time instant with the non-negative coupling of the basis spectrum of noise pre-learned and the basis spectrum of noise.

P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Proc. Independent Component Analysis and Signal Separation, pp. 414-421, 2007.P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Proc. Independent Component Analysis and Signal Separation, pp. 414-421, 2007.

しかし、音声の基底スペクトルで雑音スペクトルを説明できてしまう場合やその逆の場合には推定したスペクトルが実際の音声スペクトルに対応しない可能性がある。このため、音声スペクトルと雑音スペクトルの分解の不定性を解消するためには音声スペクトルが満たすべきより強い制約が必要である。また、SSNMF法では信号は強調できたとしても特徴量を強調できる保証はないため、強調処理が音声認識や音声変換など音声特徴量に基づく音声処理の性能向上に直結するとは限らない。 However, if the noise spectrum can be explained by the basis spectrum of speech or vice versa, the estimated spectrum may not correspond to the actual speech spectrum. For this reason, in order to eliminate the ambiguity of the decomposition of the speech spectrum and the noise spectrum, stronger restrictions that the speech spectrum must satisfy are required. Further, in the SSNMF method, even if the signal can be enhanced, there is no guarantee that the feature amount can be enhanced, so the enhancement processing is not always directly linked to the performance improvement of voice processing such as voice recognition and voice conversion based on the voice feature amount.

本発明では、上記事情を鑑みて成されたものであり、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a signal analysis device, method, and program capable of suppressing noise, emphasizing an audio signal and emphasizing a cepstrum feature. To aim.

上記目的を達成するために、本発明に係る信号解析装置は、音声信号と雑音信号とが混合された観測信号の時系列データを入力として、各時刻及び各周波数の観測時間周波数成分を表す観測スペクトログラムを出力する時間周波数展開部と、前記時間周波数展開部により出力された前記観測スペクトログラム、予め学習された音声信号の各基底及び各周波数におけるパワースペクトルを表す基底スペクトル、及びケプストラム空間で定義される、予め学習された音声信号のケプストラム特徴量の確率分布を表わすパラメータに基づいて、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定するパラメータ推定部と、を含んで構成されている。 In order to achieve the above object, a signal analysis apparatus according to the present invention receives, as input, time-series data of an observation signal in which an audio signal and a noise signal are mixed, and displays an observation time frequency component of each time and each frequency It is defined by a time-frequency expansion unit outputting a spectrogram, the observation spectrogram output by the time-frequency expansion unit, a base spectrum representing a power spectrum at each base and each frequency of a speech signal learned in advance, and a cepstrum space Based on a parameter representing the probability distribution of the cepstral feature of the speech signal learned in advance, observation time frequency components of each time and each frequency, the base spectrum of the speech signal, and the power at each time of the speech signal Time period of each time and each frequency obtained from the activation parameter The speech based on a number component, a distance between the base spectrum of the noise signal and the sum of time-frequency components of each time and each frequency determined from the activation parameter, and the probability distribution of cepstrum feature of the speech signal The speech signal is reduced so as to reduce a criterion expressed using a regularization term that represents the likelihood of cepstral features of time frequency components at each time and frequency determined from the base spectrum of the signal and the activation parameter. And a parameter estimation unit configured to estimate the activation spectrum of the noise signal and the activation parameter of the noise signal.

本発明に係る信号解析方法は、時間周波数展開部と、パラメータ推定部とを含む信号解析装置における信号解析方法であって、前記時間周波数展開部が、音声信号と雑音信号とが混合された観測信号の時系列データを入力として、各時刻及び各周波数の観測時間周波数成分を表す観測スペクトログラムを出力し、前記パラメータ推定部が、前記時間周波数展開部により出力された前記観測スペクトログラム、予め学習された音声信号の各基底及び各周波数におけるパワースペクトルを表す基底スペクトル、及びケプストラム空間で定義される、予め学習された音声信号のケプストラム特徴量の確率分布を表わすパラメータに基づいて、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定する。 A signal analysis method according to the present invention is a signal analysis method in a signal analysis apparatus including a time-frequency expansion unit and a parameter estimation unit, wherein the time-frequency expansion unit is an observation in which an audio signal and a noise signal are mixed. An observation spectrogram representing observation time-frequency components of each time and each frequency is output with the time series data of the signal as input, and the parameter estimation unit learns in advance the observation spectrogram output by the time frequency expansion unit. At each time and each frequency, based on the basis spectrum representing the power spectrum at each base and each frequency of the speech signal, and the parameter representing the probability distribution of the cepstral feature of the pre-learned speech signal defined in cepstral space. An observation time frequency component, a base spectrum of the audio signal, and each time of the audio signal And the time frequency component of each time and frequency determined from the activation parameter representing the power, and the distance between the base spectrum of the noise signal and the sum of time frequency components of each time and frequency determined from the activation parameter And, based on the probability distribution of the cepstral feature of the audio signal, regularization representing the likelihood of the cepstral feature of the time-frequency component of each time and each frequency obtained from the base spectrum of the audio signal and the activation parameter The activation parameters of the speech signal, the basis spectrum of the noise signal and the activation parameters are estimated so as to reduce the criteria expressed using terms.

また、本発明のプログラムは、コンピュータを、上記の信号解析装置を構成する各部として機能させるためのプログラムである。 Further, a program of the present invention is a program for causing a computer to function as each unit constituting the above-described signal analysis device.

以上説明したように、本発明の信号解析装置、方法、及びプログラムによれば、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定することにより、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる。 As described above, according to the signal analysis apparatus, method, and program of the present invention, the observation time frequency components of each time and each frequency, the base spectrum of the voice signal, and the power at each time of the voice signal A time-frequency component of each time and frequency determined from an activation parameter representing the distance, and a distance from a sum of time-frequency components of each time and frequency determined from the base spectrum of the noise signal and the activation parameter; Based on the probability distribution of the cepstral feature of the audio signal, using a regularization term representing the likelihood of the cepstral feature of the time-frequency component of each time and frequency determined from the base spectrum of the audio signal and the activation parameter To reduce the criteria expressed by It said activation parameter of the signal, by estimating the said base spectrum and the activation parameters of the noise signal, to suppress noise, as well as emphasizing the speech signal, can be emphasized cepstrum characteristic quantity.

本発明の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram showing functional composition of a signal analysis device concerning an embodiment of the invention. 本発明の実施の形態に係る信号解析装置における学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the learning process routine in the signal-analysis apparatus based on embodiment of this invention. 本発明の実施の形態に係る信号解析装置におけるパラメータ推定処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the parameter estimation processing routine in the signal-analysis apparatus based on embodiment of this invention. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜発明の概要＞
まず、本実施の形態における概要について説明する。本実施の形態においては、上述したSSNMF法の問題を解決するため、スペクトルだけでなく特徴量の事前情報も活用することで音声スペクトルを推定する手がかりをより多く与えるとともに特徴量の歪みを生じにくくするSSNMFの正則化法を提案する。 <Overview of the Invention>
First, an outline of the present embodiment will be described. In the present embodiment, in order to solve the problem of the above-described SSNMF method, more clues for estimating the speech spectrum can be provided by using not only the spectrum but also the advance information of the feature amount, and distortion of the feature amount is less likely We propose a regularization method of SSNMF.

＜本実施の形態の原理＞
次に、本実施の形態の原理について説明する。 <Principle of this embodiment>
Next, the principle of the present embodiment will be described.

＜問題の定式化＞
観測信号の振幅スペクトログラムまたはパワースペクトログラム（以後、観測スペクトログラム）をＹ_ω,tとする。ただし、ωとtは周波数、時刻のインデックスである。スペクトルの加法性を仮定し、各時刻の音声スペクトルX^(s) _ω,tおよび雑音スペクトルX⁽ⁿ⁾ _ω,tをそれぞれK_s個の基底スペクトル

とK_n個の基底スペクトル

の非負結合

で表せるものとする。 <Formulation of the problem>
The amplitude spectrogram or power spectrogram of the observation signal (hereinafter, observation spectrogram) is represented by _{Yω, t} . Where ω and t are indexes of frequency and time. Assuming the additivity of the spectrum, the speech spectrum X ^(s) _{ω, t at} each time and the noise spectrum X ⁽ⁿ⁾ _{ω, t} each have K _s basis spectra

And K _n basis spectra

Nonnegative binding of

It can be expressed by

SSNMF法は、クリーン音声の学習サンプルから事前学習した

を用いて、観測スペクトルY_ω,tに

をフィッティングすることで観測スペクトログラムに含まれる音声の成分と雑音の成分を推定する方法である。このようにして求まる音声スペクトルと雑音スペクトルの推定値からWienerフィルタなどにより観測信号から音声信号を得ることができる。このアプローチでは事前学習した音声の基底スペクトルが音声と雑音の分離の手がかりとなるが、音声の基底スペクトルで雑音スペクトルを説明できてしまう場合やその逆の場合がありえるため、Y_ω,tとX_ω,tの誤差を小さくできたとしてもX^(s) _ω,tとX⁽ⁿ⁾ _ω,tが実際の音声スペクトルと雑音スペクトルに対応するとは限らない。このため、同じX_ω,tを与えるX^(s) _ω,tとX⁽ⁿ⁾ _ω,tの不定性を解消するためには音声スペクトルが満たすべきより強い制約が必要である。今、もしX^(s) _ω,tが音声スペクトルに対応しているならX^(s) _ω,tは特徴量空間においても音声が実際にとりうる範囲内に分布するはずである。そこで、本実施の形態では、ケプストラム特徴量に着目し、ケプストラム空間で定義される確率分布に基づいてX^(s) _ω,tに対する正則化項を考え，これと、Y_ω,tとX_ω,tの誤差規準との和を規準としたパラメータ最適化アルゴリズムを提案する。 SSNMF method pre-learned from clean speech learning samples

To the observed spectrum Y _{ω, t}

Is a method of estimating the sound component and the noise component contained in the observation spectrogram by fitting. A speech signal can be obtained from the observation signal by a Wiener filter or the like from the speech spectrum and noise spectrum estimated values thus obtained. In this approach, the basis spectrum of pre-learned speech is a clue to the separation of speech and noise, but Y _{ω, t} and X can be used because the noise spectrum can be explained by the basis spectrum of speech and vice versa. _{Even if} the error of _{ω, t} can be reduced, X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} do not necessarily correspond to the actual speech spectrum and noise spectrum. For this reason, in order to eliminate the indeterminacy of X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} giving the same X _{ω, t} , a stronger constraint that the speech spectrum must satisfy is required. Now, if X 2 ^(s) _{ω, t} corresponds to the speech spectrum, X 2 ^(s) _{ω, t} should be distributed within the range that the speech can actually take in the feature space as well. Therefore, in the present embodiment, paying attention to the cepstrum feature amount, the regularization term for X ^(s) _{ω, t} is considered based on the probability distribution defined in the cepstrum space, and this and Y _{ω, t} and X _ω We propose a parameter optimization algorithm based on the sum of _T , _{and t} error criteria.

Y_ω,tとX_ω,tの誤差は二乗誤差、Iダイバージェンス、板倉齋藤距離などで測ることができるが、ここではIダイバージェンス

を用いる。ただし、すべての基底スペクトルは

のような制約を満たしているものとする。次に、X^(s) _ω,tに対し、

のような規準を考える。ただし、

はX_0,t,...,X_Ω-1,tのメル周波数ケプストラム係数(Mel-Frequency Cepstrum Coefficients:MFCC)であり、f_l,ωはl番目のメルフィルタバンク係数、

は離散コサイン変換の係数である。式(5)は、

がパラメータ

の混合正規分布から生成される確率の対数を表す。ただし、

は、m番目の正規分布の平均と分散と重みを表す。クリーン音声の学習サンプルのMFCC系列からこの混合正規分布のパラメータθを学習することで、

を、Ｘ^(s) _ω,tが MFCC空間においてできるだけ学習サンプルと同様に分布する場合に高いスコアを与える規準とすることができる。提案法は、式(3)と式(5)の二つの規準を考慮した

のような規準を最小化することが目的である。ただし、λは正則化パラメータである。以上のようにこの最適化問題はスペクトルのモデルをケプストラム距離規準でソフトな制約を課す問題となっており、これまで発明者らはこの枠組により楽音分離と音色クラスタリングを同一最適化規準の下で行う手法を提案している。 The errors of Y _{ω, t} and X _{ω, t} can be measured by squared error, I divergence, Itakura-Satoshi distance, etc., but here I divergence

Use However, all basis spectra are

It is assumed that the following constraints are satisfied. Next, for X ^(s) _{ω, t}

Consider a criterion like However,

Is the Mel-Frequency Cepstrum Coefficients (MFCC) of X _{0, t} , ..., X _{Ω -1, t} , f _{l, ω} is the l-th Mel filter bank coefficient,

Is the coefficient of the discrete cosine transform. Formula (5) is

Is the parameter

Represents the logarithm of the probability generated from the mixed normal distribution of However,

Represents the mean, variance and weight of the mth normal distribution. By learning the parameter θ of this mixed normal distribution from the MFCC series of clean speech learning samples,

Let X ^(s) _{ω, t} be a criterion that gives a high score if it is distributed as much as possible in the MFCC space as the learning samples. The proposed method takes into account the two criteria of equation (3) and equation (5)

The goal is to minimize such criteria. Where λ is a regularization parameter. As described above, this optimization problem is a problem that imposes a soft constraint on the spectrum model by the cepstrum distance criterion, and the inventors have so far used this framework under the same optimization criterion for tone separation and timbre clustering. It proposes the method to do.

本実施の形態の手法はこの枠組により音声の信号強調と特徴量強調を同時に実現することを目指した手法であり、「ケプストラム正則化 SSNMF」と呼ぶ。 The method of the present embodiment is a method aiming at simultaneously realizing speech signal emphasis and feature quantity emphasis by this framework, and is called "Cepstrum regularization SSNMF".

＜パラメータ推定アルゴリズム＞

を最小化する U^(s)、H⁽ⁿ⁾、U⁽ⁿ⁾を解析的に得ることはできないが、当該最適化問題の局所最適解を探索する反復アルゴリズムを補助関数法に基づき導くことができる。補助関数法による、目的関数F(θ)の最小化問題の最適化アルゴリズムでは、まず補助変数αを導入し、

を満たす補助関数F+(θ,α)を設計する。このような補助関数が設計できれば、

と

を交互に繰り返すことで、目的関数F(θ)を局所最小化するθを得ることができる。以下で、

の補助関数とそれに基づく更新式を導く。 <Parameter estimation algorithm>

Although it is impossible to analytically obtain U ^(s) , H ⁽ⁿ⁾ and U ⁽ⁿ⁾ to minimize ^N , it is possible to derive an iterative algorithm based on the auxiliary function method to search for the local optimal solution of the optimization problem. it can. In the optimization algorithm of the objective function F (θ) minimization problem by the auxiliary function method, first, an auxiliary variable α is introduced,

Design an auxiliary function F + (θ, α) satisfying If such an auxiliary function can be designed,

When

By alternately repeating the above, it is possible to obtain θ which locally minimizes the objective function F (θ). Below,

Derive the auxiliary function of and the update expression based on it.

については、負の対数関数が凸関数であることを利用し、Jensenの不等式により

のような上界関数が立てられる。ただし、＝^cはパラメータに依存する項のみに関する等号を表す。
In terms of, we use the fact that the negative logarithmic function is a convex function, and use Jensen's inequality

An upper bound function such as However, = ^c represents the equal sign concerning only the term depending on the parameter.

またHとUは、ここでは

としている。ζ_k,ω,tはζ_k,ω,t≧０、Σkζ_k,ω,t＝１を満たす変数であり、式 (8)の等号は

のとき成立する。 Also H and U are here

And ζ _{k, ω, t} is a variable satisfying ζ _{k, ω, t} 0 0, Σ _k ζ _k, _{ω, t} = 1, and the equal sign of equation (8) is

It holds in the case of

次に、

の上界関数を設計する。式(8)と同様、負の対数関数が凸関数であることを利用し、Jensenの不等式より

のような不等式が立てられる。式(14)の等号は

のとき成立する。続いて

の上界関数を導く。二次関数は凸関数なので、Jensenの不等式より

のような不等式が立てられる。ただし、

である。 β_l,n,m,tはΣ_lβ_l,n,m,t＝１を満たす任意の正の定数、

は

を満たす変数であり、式(16)の等号は

のとき成立する。式(16)と式(14)より、

がいえる。ただし、

である。A_l,tは非負値である点に注意し，次に (log G_l,t)²の上界関数を考える。(log G_l,t)²の上界は不等式を用いて

で与えることができる。ただし、

であり、式 (21)の等号は

のとき成立する。さらに、f_l,ωH_k,ωU_k,tが非負値であること、逆数関数が正領域で凸関数であることから、Jensenの不等式より

が成り立つ。ただし、ρ_l,k,ω,tはρ_l,k,ω,t＞0、Σ_ω,kρ_l,k,ω,t＝１を満たす変数であり、

のとき式 (26)の等号は成立する。続いて B_l,t log G_l,tの項の上界を考える。 B_l,tは非負値であるとは限らないので、B_l,tの符号に応じて別種の不等式を立てる。まず、対数関数が凹関数であるため、B_l,t≧０のとき、

のような不等式を得る。φ_l,tは正の変数であり，

のとき式 (28)の等号は成立する。一方 B_l,t＜0のとき、負の対数関数は凸関数よりJensenの不等式により

がいえる。ただし、ν_k,l,ω,tはν_k,l,ω,t＞０、Σ_k,ων_k,l,ω,t＝１を満たす変数であり、

のとき式 (30)の等号は成立する。まとめると、

と書ける。ただし、δ_xは条件xを満たす場合に 1、満たさない場合に 0となる指示関数である。以上より、

の上界関数

を得ることができる。この不等式を導いたことのポイントは、右辺を最小にする HやUを解析的に得ることができる点にあり、

と合わせることにより更新式を閉形式で与える補助関数を設計することができる。この補助関数より、各パラメータの更新式

を得る。ただし，

である。 next,

Design the upper bound function of As with equation (8), taking advantage of the fact that the negative logarithmic function is a convex function, according to the Jensen inequality

An inequality like this is set up. The equal sign of equation (14) is

It holds in the case of continue

Lead the upper bound function of Because quadratic functions are convex functions, Jensen's inequality

An inequality like this is set up. However,

It is. β _{l, n, m, t} is any positive constant satisfying Σ _l β _{l, n, m, t} = 1,

Is

(16) is the variable that satisfies

It holds in the case of From equations (16) and (14),

It can be said. However,

It is. Note that A _{l, t} is nonnegative and then consider the upper bound function of (log G _{l, t} ) ² . The upper bound of (log G _{l, t} ) ² uses inequality

Can be given by However,

And the equal sign of equation (21) is

It holds in the case of Furthermore, from the fact that f _{l, ω} H _{k, ω} U _{k, t} are nonnegative values, and that the reciprocal function is a convex function in the positive region, according to Jensen's inequality

Is true. However, _{l l, k, ω, t} are variables satisfying を満たす_{l, k, ω, t} > 0, _{ω ω, k} _{l l, k, ω, t} = 1,

The equality of equation (26) holds when Then consider the upper bound of the term B _{l, t} log G _{l, t} . Since B _{l, t} is not necessarily a nonnegative value, another inequality is made according to the sign of B _{l, t} . First, since the logarithmic function is a concave function, when B _{l, t} 00,

Get an inequality like φ _{l, t} is a positive variable,

The equality sign of equation (28) holds when On the other hand, when B _{l, t} <0, the negative logarithm function is more convex than the convex function by Jensen's inequality

It can be said. However, _{k k, l, ω, t} are variables satisfying ν _{k, l, ω, t} > 0, Σ _{k, ω} _{k k, l, ω, t} = 1

The equality of equation (30) holds when Summary,

I can write. However, δ _x is an instruction function that is 1 when the condition x is satisfied, and 0 when the condition x is not satisfied. From the above,

Upper bound function of

You can get The point of deriving this inequality is that H and U that minimize the right side can be obtained analytically,

By combining with, it is possible to design an auxiliary function that gives an update expression in closed form. Update formula for each parameter from this auxiliary function

Get However,

It is.

また、β_l,n,m,tは任意の正の定数なので、β_l,n,m,t ＞０、Σ_lβ_l,n,m,t＝１を満たす範囲であればどのように与えてもアルゴリズムの収束性は保証される。従って例えば、反復計算の各ステップで

と更新してもアルゴリズムの収束性は保証される。これは補助関数をβ_l,n,m,tに関して最小にする更新である。 Also, since β _{l, n, m, t} is any positive constant, how is it within the range satisfying β _{l, n, m, t} > 0, _{l l} β _{l, n, m, t} = 1 Even if given, the convergence of the algorithm is guaranteed. Thus, for example, at each step of the iterative calculation

Even if updated, the convergence of the algorithm is guaranteed. This is an update that minimizes the auxiliary function with respect to β _{l, n, m, t} .

＜本発明の実施の形態に係る信号解析装置の構成＞
次に、本発明の実施の形態に係る信号解析装置の構成について説明する。図１に示すように、本発明の実施の形態に係る信号解析装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及びパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device According to Embodiment of the Present Invention>
Next, the configuration of the signal analysis device according to the embodiment of the present invention will be described. As shown in FIG. 1, a signal analysis apparatus 100 according to an embodiment of the present invention is a ROM storing a CPU, a RAM, a program for executing a learning process routine and a parameter estimation process routine described later, and various data. And can be configured on a computer. The signal analysis apparatus 100 functionally includes an input unit 10, an operation unit 20, and an output unit 90, as shown in FIG.

入力部１０は、他の音源が混じっていないクリーンな音声信号（以後、クリーン音声信号）の時系列データを受け付ける。また、入力部１０は、音声信号と雑音信号とが混じっている音響信号（以後、観測信号）の時系列データを受け付ける。 The input unit 10 receives time-series data of a clean audio signal (hereinafter, a clean audio signal) in which other sound sources are not mixed. Further, the input unit 10 receives time-series data of an acoustic signal (hereinafter, an observation signal) in which an audio signal and a noise signal are mixed.

演算部２０は、時間周波数展開部２４と、特徴量抽出部２６と、基底スペクトル学習部２８と、基底スペクトル記憶部３０と、特徴量モデル学習部３２と、特徴量モデル記憶部３４と、パラメータ推定部３６と、音声信号生成部３８と、を含んで構成されている。 The operation unit 20 includes a time-frequency expansion unit 24, a feature extraction unit 26, a basis spectrum learning unit 28, a basis spectrum storage unit 30, a feature amount model learning unit 32, a feature amount model storage unit 34, and parameters. An estimation unit 36 and an audio signal generation unit 38 are included.

時間周波数展開部２４は、クリーン音声信号の時系列データに基づいて、各時刻における各周波数の時間周波数成分を表す振幅スペクトログラム又はパワースペクトログラムを計算する。なお、第１の実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 The time-frequency expansion unit 24 calculates an amplitude spectrogram or power spectrogram representing time-frequency components of each frequency at each time based on time-series data of the clean speech signal. In the first embodiment, time frequency expansion such as short time Fourier transform or wavelet transform is performed.

また、時間周波数展開部２４は、観測信号の時系列データに基づいて、各時刻ｔにおける各周波数ωの観測時間周波数成分Ｙ_ω,tを表す振幅スペクトログラム又はパワースペクトログラムである観測スペクトログラムＹを計算する。 In addition, the time frequency expansion unit 24 calculates an observation spectrogram Y which is an amplitude spectrogram or a power spectrogram representing an observation time frequency component Y _{ω, t} of each frequency ω at each time t based on time series data of the observation signal. .

特徴量抽出部２６は、時間周波数展開部２４によって計算された、クリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、各時刻のケプストラム特徴量を抽出する。 The feature quantity extraction unit 26 extracts the cepstrum feature quantity of each time based on the time frequency component of each frequency at each time of the clean speech signal calculated by the time frequency expansion unit 24.

基底スペクトル学習部２８は、時間周波数展開部２４によって計算された、クリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、従来技術であるＮＭＦを用いて、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を推定する。 Based on the time frequency components of each frequency at each time of the clean speech signal calculated by the time frequency expansion unit 24, the basis spectrum learning unit 28 uses the conventional NMF to calculate each basis k of the clean speech signal. And a base spectrum representing the power spectrum at each frequency ω

Estimate

基底スペクトル記憶部３０は、基底スペクトル学習部２８によって推定された、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を記憶している。 The basis spectrum storage unit 30 is a basis spectrum representing the power spectrum at each frequency k and each basis k of the clean speech signal estimated by the basis spectrum learning unit 28.

I remember.

特徴量モデル学習部３２は、特徴量抽出部２６によって抽出された各時刻のケプストラム特徴量に基づいて、ケプストラム空間で定義される、ケプストラム特徴量の確率分布を表すパラメータを学習する。具体的には、当該確率分布を、混合正規分布とした場合の、混合正規分布のパラメータθを学習する。 The feature amount model learning unit 32 learns a parameter representing the probability distribution of the cepstrum feature amounts defined in the cepstrum space, based on the cepstrum feature amounts at each time point extracted by the feature amount extraction unit 26. Specifically, when the probability distribution is a mixture normal distribution, the parameter θ of the mixture normal distribution is learned.

特徴量モデル記憶部３４は、特徴量モデル学習部３２によって学習された、混合正規分布のパラメータθを記憶している。 The feature amount model storage unit 34 stores the parameter θ of the mixed normal distribution, which is learned by the feature amount model learning unit 32.

パラメータ推定部３６は、時間周波数展開部２４により出力された観測スペクトログラムＹ、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトル、及び特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布を表すパラメータθに基づいて、各時刻及び各周波数の観測時間周波数成分Ｙと、音声信号の基底スペクトル、及び音声信号のアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分Ｘ^(s)、並びに雑音信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分Ｘ⁽ⁿ⁾の和Ｘとの距離

、及び音声信号のケプストラム特徴量の確率分布に基づく、各時刻及び各周波数の時間周波数成分Ｘ^(s)のケプストラム特徴量の尤もらしさ

を表す正則化項を用いて表される上記（７）式に示す規準を小さくするように、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とを推定する。 The parameter estimation unit 36 stores the observation spectrogram Y output from the time-frequency expansion unit 24, the basis spectra at each base and each frequency of the audio signal stored in the basis spectrum storage unit 30, and the feature amount model storage unit 34 Each time and each frequency obtained from the observation time frequency component Y of each time and each frequency, the base spectrum of the sound signal, and the activation parameter of the sound signal based on the parameter θ representing the probability distribution of the cepstral feature Of the time frequency component X ^{(s) of} the noise signal and the sum X of the time frequency component X ⁽ⁿ⁾ of each time and each frequency determined from the basis spectrum of the noise signal and the activation parameter

And the likelihood of the cepstral feature of the time-frequency component X ^(s) of each time and each frequency based on the probability distribution of the cepstral feature of the audio signal

The basis spectrum H ^(s) of the speech signal and the activation parameter U ^(s), and the basis spectrum H of the noise signal so as to reduce the criterion shown in the above equation (7) expressed using the regularization term representing ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ are estimated.

具体的には、パラメータ推定部３６は、初期値設定部４０、補助変数更新部４２、パラメータ更新部４４、及び収束判定部４６を備えている。 Specifically, the parameter estimation unit 36 includes an initial value setting unit 40, an auxiliary variable updating unit 42, a parameter updating unit 44, and a convergence determination unit 46.

初期値設定部４０は、音声信号の基底スペクトルＨ^(s)の初期値として、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトルを設定する。また、初期値設定部４０は、音声信号のアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに初期値を設定する。 The initial value setting unit 40 sets, as an initial value of the basis spectrum H ^(s) of the speech signal, basis spectra of each basis and each frequency of the speech signal stored in the basis spectrum storage unit 30. Furthermore, the initial value setting unit 40 sets initial values for the activation parameter U ^(s) of the audio signal, the base spectrum H ^{(n) of} the noise signal, and the activation parameter U ⁽ⁿ⁾ .

補助変数更新部４２は、特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布のパラメータθと、初期値である、又は前回更新した、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに基づいて、上記（１３）式、（１５）式、（２４）式、（２７）式、（２９）式、（３１）式、（３７）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tを更新する。 The auxiliary variable updating unit 42 uses the parameter θ of the probability distribution of the cepstrum feature stored in the feature amount model storage unit 34 and the basis spectrum H ^{(s) of the} speech signal which is the initial value or updated last time. The above equations (13), (15), (24), and (27) are based on the tivation parameter U ^(s) and the basis spectrum H ^{(n) of} the noise signal and the activation parameter U ^(n). , (29), (31), and (37), each base k, each frequency ω, and ζ _{k, ω, t} for each time _t , each normal distribution m, and α _{m, t} for each time _t , Each mel filter bank coefficient l and ξ _{l, t} , φ _l, t for each time t, each mel filter bank coefficient l, each base k, each frequency ω, and ρ _{l, k, ω, t} for each time _t _{, ν k, l, ω,} t, each mel filter bank coefficients l, the normal distribution m, each main Frequency cepstral coefficients n, and beta _{l, m} for each time _{t, n,} and updates the _t.

パラメータ更新部４４は、時間周波数展開部２４により出力された観測スペクトログラムＹと、補助変数更新部４２によって更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tに基づいて、上記（３３）式〜（３６）式に従って、音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ^(s) _k,tと、雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ω及び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ⁽ⁿ⁾ _k,tとを更新する。 The parameter updating unit 44 includes the observation spectrogram Y output from the time-frequency expanding unit 24, each base k updated by the auxiliary variable updating unit 42, each frequency ω, and ζ _{k, ω, t} for each time _t , Normal distribution m and α _m, t for each time t, each mel filter bank coefficient l and ξ _{l, t} _1, φ _{l, t} for each time _t , each mel filter bank coefficient l, each base k, each frequency ω, and Ρ _{l, k, ω, t} , _{k k, l, ω,} t for each time t, each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _l, for each time t _{Based on m, n, t} , according to the equations (33) to (36), basis spectra H ^(s) _{k, ω} of each basis k and each frequency ω of the speech signal, each basis k and each time t Activation parameters U ^(s) _{k, t,} and for each base k of the noise signal and for each frequency ω The basis spectrum H ⁽ⁿ⁾ _{k, ω,} and each basis k and the activation parameter U ⁽ⁿ⁾ _{k, t at} each time t are updated.

収束判定部４６は、収束条件を満たすか否かを判定し、収束条件を満たすまで、補助変数更新部４２における更新処理と、パラメータ更新部４４における更新処理とを繰り返させる。 The convergence determination unit 46 determines whether or not the convergence condition is satisfied, and repeats the updating process in the auxiliary variable updating unit 42 and the updating process in the parameter updating unit 44 until the convergence condition is satisfied.

収束条件としては、例えば、繰り返し回数が、上限回数に到達したことを用いることができる。あるいは、収束条件として、上記（７）式の規準の値と前回の規準の値との差分が、予め定められた閾値以下であることを用いることができる。 As the convergence condition, for example, it can be used that the number of repetitions reaches the upper limit number. Alternatively, as the convergence condition, it can be used that the difference between the value of the criterion of the equation (7) and the value of the previous criterion is equal to or less than a predetermined threshold.

音声信号生成部３８は、パラメータ推定部３６において取得した音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ωに基づいて、ウィーナーフィルタに従って、音声信号を生成し、出力部９０から出力する。 The speech signal generation unit 38 generates the basis spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the speech signal acquired in the parameter estimation unit 36, the basis spectrum H of each base k of each noise signal, and each frequency ω ⁽ⁿ⁾ Based on _{k, ω} , according to the Wiener filter, an audio signal is generated and output from the output unit 90.

＜本発明の実施の形態に係る信号解析装置の作用＞
次に、本発明の実施の形態に係る信号解析装置１００の作用について説明する。まず、入力部１０においてクリーン音声信号の時系列データを受け付けると、信号解析装置１００は、図２に示す学習処理ルーチンを実行する。 <Operation of Signal Analysis Device According to Embodiment of the Present Invention>
Next, the operation of the signal analysis device 100 according to the embodiment of the present invention will be described. First, when time-series data of a clean speech signal is received at the input unit 10, the signal analysis device 100 executes a learning processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けたクリーン音声信号の時系列データに基づいて、クリーン音声信号の各時刻における各周波数の時間周波数成分を計算する。 First, in step S100, based on the time-series data of the clean speech signal received by the input unit 10, the time frequency component of each frequency at each time of the clean speech signal is calculated.

次に、ステップＳ１０２では、ステップＳ１００において取得したクリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、各時刻のケプストラム特徴量を抽出する。 Next, in step S102, the cepstrum feature value of each time is extracted based on the time frequency component of each frequency at each time of the clean speech signal acquired in step S100.

次に、ステップＳ１０４では、上記ステップＳ１００において取得したクリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、従来技術であるＮＭＦにより、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を推定し、基底スペクトル記憶部３０に格納する。 Next, in step S104, based on the time frequency components of each frequency at each time of the clean speech signal acquired in step S100, the NMF according to the prior art calculates the power at each base k and each frequency ω of the clean speech signal. Basis spectrum representing the spectrum

Are estimated and stored in the basis spectrum storage unit 30.

ステップＳ１０６では、上記ステップＳ１０２で抽出した各時刻のケプストラム特徴量に基づいて、ケプストラム特徴量の確率分布を表す混合正規分布のパラメータθを学習し、特徴量モデル記憶部３４に格納して、学習処理ルーチンを終了する。 In step S106, based on the cepstrum feature quantity at each time point extracted in step S102, the parameter θ of the mixture normal distribution representing the probability distribution of the cepstrum feature quantity is learned, and stored in the feature quantity model storage unit 34. End the processing routine.

次に、入力部１０において、音声信号と雑音信号とが混在した観測信号の時系列データを受け付けると、信号解析装置１００は、図３に示すパラメータ推定処理ルーチンを実行する。 Next, when the input unit 10 receives time-series data of an observation signal in which an audio signal and a noise signal are mixed, the signal analysis device 100 executes a parameter estimation processing routine shown in FIG. 3.

まず、ステップＳ１２０では、入力部１０において受け付けた観測信号の時系列データに基づいて、観測スペクトログラムＹを計算する。 First, in step S120, the observation spectrogram Y is calculated based on the time-series data of the observation signal received by the input unit 10.

ステップＳ１２２では、音声信号の基底スペクトルＨ^(s)の初期値として、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトルを設定する。また、音声信号のアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに初期値を設定する。 In step S122, as the initial value of the basis spectrum H ^(s) of the speech signal, the basis spectrum of each basis and each frequency of the speech signal stored in the basis spectrum storage unit 30 is set. Further, initial values are set to the activation parameter U ^{(s) of} the audio signal, the base spectrum H ^{(n) of} the noise signal, and the activation parameter U ⁽ⁿ⁾ .

ステップＳ１２４では、特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布のパラメータθと、上記ステップＳ１２２で初期値が設定された、又は後述するステップＳ１２６で前回更新した、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに基づいて、上記（１３）式、（１５）式、（２４）式、（２７）式、（２９）式、（３１）式、（３７）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tを更新する。 In step S124, the parameter θ of the probability distribution of the cepstrum feature stored in the feature quantity model storage unit 34 and the initial value set in step S122, or of the audio signal updated last in step S126 described later Based on the basis spectrum H ^(s) and the activation parameter U ^(s), and the basis spectrum H ^{(n) of} the noise signal and the activation parameter U ⁽ⁿ⁾ , 24), (27), (29), (31) and (37), each base k, each frequency ω, and ζ _{k, ω, t} for each time _t , each normal distribution m and Α _m, t for each time t, each mel filter bank coefficient l and ξ _{l, t} , φ _{l, t} for each time _t , each mel filter bank coefficient l, each base k, each frequency ω, and each time t _{l l, k, ω, t} , _{k k, 1. l, ω, t} , each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _{l, m, n, t} for each time t.

次に、ステップＳ１２６では、上記ステップＳ１２０で得られた観測スペクトログラムＹと、上記ステップＳ１２４で更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tに基づいて、上記（３３）式〜（３６）式に従って、音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ^(s) _k,tと、雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ω及び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ⁽ⁿ⁾ _k,tとを更新する。 Next, in step S126, the observed spectrogram Y obtained in step S120, each base k updated in step S124, each frequency ω, and ζ _{k, ω, t} for each time _t , each normal distribution m And α _m, t for each time t, each mel filter bank coefficient l and ξ _{l, t} , φ _{l, t} for each time _t , each mel filter bank coefficient l, each base k, each frequency ω, and each time t Ρ _{l, k, ω, t} , _{k k, l, ω, t} , each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _{l, m, n,} for each time t Based on _t , according to the equations (33) to (36), the basis spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the speech signal, and the activation parameters of each base k and each time t U ^(s) _{k, t} , the basis sp of each base k of the noise signal and each frequency ω Vector H ⁽ⁿ⁾ _k, ω and activation parameters U ⁽ⁿ⁾ _k of each basis k and each time _t, to update the _t.

次に、ステップＳ１２８では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、ステップＳ１３０へ移行し、収束条件を満たしていない場合には、ステップＳ１２４へ移行し、ステップＳ１２４〜ステップＳ１２８の処理を繰り返す。 Next, in step S128, it is determined whether the convergence condition is satisfied. If the convergence condition is satisfied, the process proceeds to step S130. If the convergence condition is not satisfied, the process proceeds to step S124, and the processes of steps S124 to S128 are repeated.

ステップＳ１３０では、上記ステップＳ１２６で最終的に更新された音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ωに基づいて、ウィーナーフィルタに従って、音声信号を生成し、出力部９０から出力して、パラメータ推定処理ルーチンを終了する。 In step S130, the basis spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the speech signal finally updated in step S126, and the basis spectrum H of each base k of each noise signal and each frequency ω ⁽ⁿ⁾ Generate an audio signal according to the Wiener filter based on _{k, ω} , and output it from the output unit 90, and complete the parameter estimation processing routine.

＜実験例＞
ATR音声データベース 503文の音声データとRWCPの雑音データ（ white noise, babble noise, museum noise, background music noiseの４種類）を用いて上述した実施の形態の手法による雑音抑圧効果を検証する評価実験を行った。比較対象は従来の SSNMF法とし、処理前と処理後の信号対歪み比(SDR)およびケプストラム歪みの改善値を評価した。テストデータはクリーン音声に各雑音をさまざまな SNRで重畳させて作成した。テストデータのすべての音響信号はサンプリング周波数16kHzのモノラル信号で、フレーム長 32ms、フレームシフト 16msで短時間Fourier変換を行い、観測スペクトログラム Y_ω,tを算出した。学習においては 10名（うち女性 4名、男性 6名）の話者の計450文の音声を用いてH^(s) _k,ωとMFCCのGMMパラメータθの学習を行った。 MFCCの次元は13としGMMの混合数は 30とした。テストにおいては、学習で得られた H^(s) _k,ωとθとを固定し、λ＝1としてU^(s) _k,t、H⁽ⁿ⁾ _k,ω、U⁽ⁿ⁾ _k,tの推定を行った。推定後、X^(s) _ω,tとX⁽ⁿ⁾ _ω,tを用いて Wienerフィルタにより音声信号の推定値を算出した。提案法アルゴリズムの初期値は従来の SSNMFにより得た。 <Example of experiment>
Evaluation experiment to verify the noise suppression effect by the method of the above-mentioned embodiment using speech data of ATR speech database 503 sentences and noise data of RWCP (white noise, babble noise, museum noise, background music noise) went. The comparison target was the conventional SSNMF method, and the improvement values of signal-to-distortion ratio (SDR) and cepstrum distortion before and after processing were evaluated. Test data was created by superimposing each noise on clean speech at various SNRs. All acoustic signals in the test data were monaural signals with a sampling frequency of 16 kHz, and a short Fourier transform was performed with a frame length of 32 ms and a frame shift of 16 ms to calculate the observation spectrogram Y _{ω, t} . In the learning, H ^(s) _{k, ω} and the MFCC GMM parameter θ were learned using speech of a total of 450 sentences of 10 speakers (including 4 females and 6 males). The dimension of MFCC is 13 and the number of mixed GMMs is 30. In the test, H ^(s) _{k, ω} and θ obtained by learning are fixed, and λ ⁽ = 1 ⁾ _, U ^(s) _{k, t} , H ⁽ⁿ⁾ _{k, ω} , U ⁽ⁿ⁾ _{k, t} The estimation of After estimation, the estimated value of the speech signal was calculated by the Wiener filter using X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} . The initial values of the proposed algorithm were obtained by the conventional SSNMF.

以上の条件下での提案法と従来法によって得られたケプストラム歪みおよびSDRの改善値を図４〜７に示す。 The improved values of cepstrum distortion and SDR obtained by the proposed method and the conventional method under the above conditions are shown in FIGS.

図４は、提案法と従来法によって得られたケプストラム歪みの改善値を示す。図４上が雑音の種類が白色雑音の場合を示し、図４下が、雑音の種類がバブル雑音の場合を示す。図５は、提案法と従来法によって得られたケプストラム歪みの改善値を示す。図５上が雑音の種類が実環境雑音の場合を示し、図５下が、雑音の種類が背景音楽雑音の場合を示す。図６は、提案法と従来法によって得られたSDRの改善値を示す。図６上が、雑音の種類が白色雑音の場合を示し、図６下が、雑音の種類がバブル雑音の場合を示す。図７は、提案法と従来法によって得られたSDRの改善値を示す。図７上が、雑音の種類が実環境雑音の場合を示し、図７下が、雑音の種類が背景音楽雑音の場合を示す。いずれの評価尺度においてもほとんどの場合において提案法の方が高い改善値を得られていることが確認できた。 FIG. 4 shows the improvement of cepstrum distortion obtained by the proposed method and the conventional method. The upper part of FIG. 4 shows the case where the type of noise is white noise, and the lower part of FIG. 4 shows the case where the type of noise is bubble noise. FIG. 5 shows the improvement of cepstrum distortion obtained by the proposed method and the conventional method. The upper part of FIG. 5 shows the case where the type of noise is real environment noise, and the lower part of FIG. 5 shows the case where the type of noise is background music noise. FIG. 6 shows the improvement value of SDR obtained by the proposed method and the conventional method. The upper part of FIG. 6 shows the case where the type of noise is white noise, and the lower part of FIG. 6 shows the case where the type of noise is bubble noise. FIG. 7 shows the improvement value of SDR obtained by the proposed method and the conventional method. The upper part of FIG. 7 shows the case where the type of noise is real environment noise, and the lower part of FIG. 7 shows the case where the type of noise is background music noise. It was confirmed that the proposed method achieved higher improvement values in most cases in any of the evaluation scales.

以上説明したように、本発明の実施の形態に係る信号解析装置によれば、各時刻及び各周波数の観測時間周波数成分と、予め推定された音声信号の基底スペクトル、及び音声信号のアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに雑音信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び音声信号のケプストラム特徴量の確率分布に基づく、音声信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、音声信号のアクティベーションパラメータと、雑音信号の基底スペクトル及びアクティベーションパラメータとを推定し、音声信号を生成することにより、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる。 As described above, according to the signal analysis device according to the embodiment of the present invention, the observation time frequency components of each time and each frequency, the base spectrum of the speech signal estimated in advance, and the activation parameter of the speech signal Distance with the sum of time-frequency components of each time and frequency determined from time-frequency components of each time and frequency determined from the time base and activation parameters of noise signal, and probability of cepstral feature of speech signal In order to reduce the criterion expressed using a regularization term that represents the likelihood of the cepstral feature of the time-frequency component of each time and each frequency determined from the basis spectrum and activation parameters of the speech signal based on the distribution, Voice signal activation parameters and noise signal basis Estimating the torque and the activation parameters, by generating an audio signal to suppress noise, as well as emphasizing the speech signal, it can be emphasized cepstrum characteristic quantity.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

例えば、音声信号の基底スペクトル及び音声信号のケプストラム特徴量の確率分布を表すパラメータを学習する処理と、観測信号から基底スペクトル及びアクティベーションパラメータを推定するパラメータ推定とを別々の装置で行うように構成してもよい。 For example, it is configured to perform processing of learning parameters representing probability distribution of basis spectrum of speech signal and cepstrum feature of speech signal and parameter estimation for estimating basis spectrum and activation parameter from observation signal by different devices. You may

また、更新するパラメータの順番には任意性があるため、上記の実施の形態の順番に限定されない。 In addition, since the order of parameters to be updated is arbitrary, it is not limited to the order of the above embodiment.

また、音声信号のアクティベーションパラメータ、雑音信号の基底スペクトル、及びアクティベーションパラメータと同様に、音声信号の基底スペクトルも更新する場合を例に説明したが、これに限定されるものではなく、音声信号の基底スペクトルを更新せずに、予め推定された音声信号の基底スペクトルに固定してもよい。 Also, although the case of updating the base spectrum of the audio signal as well as the activation parameter of the audio signal, the base spectrum of the noise signal, and the activation parameter has been described as an example, the present invention is not limited thereto. The base spectrum of the speech signal may be fixed to the base spectrum of the pre-estimated speech signal without updating the base spectrum of.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Furthermore, although the present invention has been described as an embodiment in which the program is installed in advance, the program can be provided by being stored in a computer readable recording medium, and provided via a network. It is also possible.

１０入力部
２０演算部
２４時間周波数展開部
２６特徴量抽出部
２８基底スペクトル学習部
３０基底スペクトル記憶部
３２特徴量モデル学習部
３４特徴量モデル記憶部
３６パラメータ推定部
３８音声信号生成部
４０初期値設定部
４２補助変数更新部
４４パラメータ更新部
４６収束判定部
９０出力部
１００信号解析装置 10 input unit 20 operation unit 24 time frequency expansion unit 26 feature amount extraction unit 28 basis spectrum learning unit 30 basis spectrum storage unit 32 feature amount model learning unit 34 feature amount model storage unit 36 parameter estimation unit 38 speech signal generation unit 40 initial value Setting unit 42 Auxiliary variable updating unit 44 Parameter updating unit 46 Convergence determination unit 90 Output unit 100 Signal analysis device

Claims

A time-frequency expansion unit that receives, as input, time-series data of an observation signal in which an audio signal and a noise signal are mixed, and outputs an observation spectrogram representing an observation time-frequency component of each time and each frequency;
Cepstrum features of a pre-learned speech signal defined by the observation spectrogram output by the time-frequency expansion unit, a basis spectrum representing a power spectrum at each basis and frequency of each basis and frequency of the previously learned speech signal, and a cepstrum space Based on the parameters representing the probability distribution of quantities,
Time frequency components of each time and each frequency obtained from activation time parameters indicating observation time frequency components of each time and each frequency, base spectrum of the sound signal, and power at each time of the sound signal, and the noise signal The base spectrum of the audio signal based on the base spectrum and the distance between the time and the sum of time frequency components of each frequency obtained from the activation parameter, and the probability distribution of the cepstral feature of the audio signal; The activation parameters of the audio signal so as to reduce the criterion expressed using a regularization term representing the likelihood of the cepstral feature of the time-frequency component of each time and each frequency obtained from the activation parameter; The basis of the noise signal A parameter estimation unit for estimating the torque and the activation parameter,
Signal analyzer including:

The signal analysis apparatus according to claim 1, wherein the criterion is expressed by the following equation.

Where U ^(s) represents the activation parameter of the speech signal, H ⁽ⁿ⁾ represents the base spectrum of the noise signal, and U ⁽ⁿ⁾ represents the activation parameter of the noise signal. Y represents the observation spectrogram, and X represents the time-frequency component X ^{(s) of} each frequency determined from the basis spectrum of the speech signal and the activation parameter U ^(s), and the basis of the noise signal Represents the sum of time frequency components of each frequency obtained from the spectrum H ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ ,

Represents the distance between the observed spectrogram Y and the sum X,

Represents the likelihood of the cepstral feature of the time-frequency component X ^(s) of each frequency determined from the basis spectrum of the voice signal and the activation parameter U ^(s) , and λ is a regularization parameter.

The parameter estimation unit
So as to reduce the auxiliary function is the upper bound function of the criterion, the parameter updating section for updating said activation parameter of the audio signal, and a pre-Symbol basal spectrum and the activation parameters of the noise signal,
A convergence determination unit that repeats updating by the parameter updating unit until a predetermined convergence condition is satisfied;
The signal analysis device according to claim 1 or 2, comprising

A signal analysis method in a signal analysis apparatus including a time frequency expansion unit and a parameter estimation unit,
The time-frequency expansion unit receives time-series data of an observation signal in which an audio signal and a noise signal are mixed, and outputs an observation spectrogram representing an observation time-frequency component of each time and each frequency,
The parameter estimation unit is pre-learned defined by the observation spectrogram output by the time-frequency expansion unit, a basis spectrum representing a power spectrum at each basis and each frequency of each basis and a frequency of a speech signal learned in advance, and a cepstrum space Based on the parameters representing the probability distribution of the cepstral feature of the speech signal
Time frequency components of each time and each frequency obtained from activation time parameters indicating observation time frequency components of each time and each frequency, base spectrum of the sound signal, and power at each time of the sound signal, and the noise signal The base spectrum of the audio signal based on the base spectrum and the distance between the time and the sum of time frequency components of each frequency obtained from the activation parameter, and the probability distribution of the cepstral feature of the audio signal; The activation parameters of the audio signal so as to reduce the criterion expressed using a regularization term representing the likelihood of the cepstral feature of the time-frequency component of each time and each frequency obtained from the activation parameter; The basis of the noise signal Torr and signal analysis method for estimating and said activation parameters.

The signal analysis method according to claim 4, wherein the criterion is expressed by the following equation.

Where U ^(s) represents the activation parameter of the speech signal, H ⁽ⁿ⁾ represents the base spectrum of the noise signal, and U ⁽ⁿ⁾ represents the activation parameter of the noise signal. Y represents the observation spectrogram, and X represents the time-frequency component X ^{(s) of} each frequency determined from the base spectrum of the speech signal and the activation parameter U ^(s), and the noise signal Represents the sum of time frequency components of each frequency obtained from the basis spectrum H ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ ,

Represents the distance between the observed spectrogram Y and the sum X,

Represents the likelihood of the cepstral feature of the time-frequency component X ^(s) of each frequency determined from the basis spectrum of the voice signal and the activation parameter U ^(s) , and λ is a regularization parameter.

In the estimation by the parameter estimation unit,
Parameter updating unit, to reduce the auxiliary function is the upper bound function of the criterion, then updating said activation parameter of the audio signal, and said base spectrum and the activation parameters of the noise signal,
The signal analysis method according to claim 4 or 5, further comprising repeating the updating by the parameter updating unit until the convergence determination unit satisfies a predetermined convergence condition.

The program for functioning a computer as each part of the signal-analysis apparatus in any one of Claims 1-3.