JP2017151222A

JP2017151222A - Signal analysis device, method, and program

Info

Publication number: JP2017151222A
Application number: JP2016032396A
Authority: JP
Inventors: 弘和亀岡; Hirokazu Kameoka; 莉李; Ri Ri
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2017-08-31
Anticipated expiration: 2036-02-23
Also published as: JP6521886B2

Abstract

PROBLEM TO BE SOLVED: To suppress a noise, to emphasize a voice signal, and to emphasize cepstrum feature quantity.SOLUTION: A parameter estimation part 36 estimates an activation parameter of a voice signal and a base spectrum and activation parameter of a noise signal to reduce a standard represented by using a distance between an observation time frequency component and a sum of a time frequency component calculated from a preliminarily estimated base spectrum of the voice signal and the activation parameter of the voice signal and a time frequency component calculated from the base spectrum and activation parameter of the noise signal and a regularization item representing the likelihood of cepstrum feature quantity of a time frequency component calculated from the base spectrum and activation parameter of the voice signal based on probability distribution of cepstrum feature quantity of the voice signal. A voice signal generation part 38 generates the voice signal from the base spectrums of the voice signal and the noise signal.SELECTED DRAWING: Figure 1

Description

本発明は、信号解析装置、方法、及びプログラムに係り、特に、パラメータを推定する信号解析装置、方法、及びプログラムに関する。 The present invention relates to a signal analysis apparatus, method, and program, and more particularly, to a signal analysis apparatus, method, and program for estimating parameters.

本発明は音声信号から雑音を抑圧する問題を扱う。音声信号に混入する雑音は音声通信の品質を劣化させるだけでなく音声認識や音声変換などのさまざまな音声処理の性能低下を招く。この問題を解決するためこれまでさまざまな音声強調手法が提案されてきた。 The present invention addresses the problem of suppressing noise from speech signals. Noise mixed in an audio signal not only deteriorates the quality of audio communication, but also causes a reduction in performance of various audio processing such as audio recognition and audio conversion. In order to solve this problem, various speech enhancement methods have been proposed so far.

音声強調手法は教師なしアプローチ、教師ありアプローチ、半教師ありアプローチに大別される。 Speech enhancement methods are broadly divided into unsupervised approaches, supervised approaches, and semi-supervised approaches.

教師ありアプローチは対象音声と対象雑音のサンプルが事前に得られる状況、半教師ありアプローチは対象音声のサンプルのみが事前に得られる状況、教師なしアプローチはいずれも得られない状況をそれぞれ想定した音声強調手法である。また、強調する対象が信号（またはスペクトル）の場合と特徴量の場合とでも大別される。教師あり特徴量強調アプローチの代表例としてはVector Taylor Series (VTS)法、Stereo Piecewise Linear Compensation for Environment (SPLICE)、Denoising Autoencoder (DAE)を用いた手法などがある。 In the supervised approach, the target speech and target noise samples are obtained in advance, in the semi-supervised approach, only the target speech sample is obtained in advance, and in the unsupervised approach, both are assumed. Emphasis technique. Moreover, it is divided roughly into the case where the object to emphasize is a signal (or spectrum), and the case of a feature-value. Typical examples of supervised feature enhancement approaches include Vector Taylor Series (VTS), Stereo Piecewise Linear Compensation for Environment (SPLICE), and Denoising Autoencoder (DAE).

VTS法は、音声と雑音の線形な重畳過程を特徴量空間で１次近似することにより雑音あり音声特徴量からクリーン音声特徴量への変換関数を構成する手法である。 The VTS method is a technique for constructing a conversion function from a speech feature with noise to a clean speech feature by linearly approximating a linear superposition process of speech and noise in a feature space.

SPLICEは，雑音あり音声とクリーン音声の特徴量の同時確率密度関数を混合正規分布 (Gaussian Mixture Model: GMM)でモデル化し、学習サンプルを用いて学習したGMMパラメータにより雑音あり音声特徴量からクリーン音声特徴量への変換関数を構成する手法である。 SPLICE uses a mixed normal distribution (Gaussian Mixture Model: GMM) to model the joint probability density function of features with noise and clean speech, and clean speech from noise features with GMM parameters learned using training samples. This is a technique for constructing a conversion function to a feature quantity.

DAE法は、雑音あり音声特徴量を入力、クリーン音声特徴量を出力とした深層ニューラルネットワークにより入出力間の変換関数を構成する手法である。これら教師あり音声強調アプローチは、識別モデルや識別的規準に基づくため、既知の雑音環境下では極めて強力であるが、未知の雑音環境下では必ずしも有効ではない。ただし、学習データの音声または雑音がテスト時のものと異なる場合にそのミスマッチを補償する方法も多く提案されている。 The DAE method is a method of constructing a conversion function between input and output by a deep neural network having a speech feature with noise as an input and a clean speech feature as an output. These supervised speech enhancement approaches are extremely powerful in known noise environments because they are based on discriminative models and discriminative criteria, but are not necessarily effective in unknown noise environments. However, many methods have been proposed to compensate for the mismatch when the voice or noise of the learning data is different from that at the time of the test.

一方、半教師あり信号強調アプローチの代表例である半教師あり非負値行列因子分解(Semi-supervised Non-negative Matrix Factorization: SSNMF)に基づく手法（非特許文献１）は、未知の雑音環境下における強力な音声強調法として近年注目されている。この手法は、各時刻の観測スペクトルを事前学習した音声の基底スペクトルと雑音の基底スペクトルの非負結合でフィッティングすることで音声と雑音のパワースペクトルを推定することが可能となる、という原理に基づく。 On the other hand, a method based on semi-supervised non-negative matrix factorization (SSNMF), which is a representative example of a semi-supervised signal enhancement approach, is performed in an unknown noise environment. In recent years, it has attracted attention as a powerful speech enhancement method. This method is based on the principle that it is possible to estimate the power spectrum of speech and noise by fitting the observed spectrum at each time with a non-negative combination of the base spectrum of speech and the base spectrum of noise.

P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Proc. Independent Component Analysis and Signal Separation, pp. 414-421, 2007.P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixture,” in Proc. Independent Component Analysis and Signal Separation, pp. 414-421, 2007.

しかし、音声の基底スペクトルで雑音スペクトルを説明できてしまう場合やその逆の場合には推定したスペクトルが実際の音声スペクトルに対応しない可能性がある。このため、音声スペクトルと雑音スペクトルの分解の不定性を解消するためには音声スペクトルが満たすべきより強い制約が必要である。また、SSNMF法では信号は強調できたとしても特徴量を強調できる保証はないため、強調処理が音声認識や音声変換など音声特徴量に基づく音声処理の性能向上に直結するとは限らない。 However, if the noise spectrum can be explained by the speech base spectrum, or vice versa, the estimated spectrum may not correspond to the actual speech spectrum. For this reason, in order to eliminate the indefiniteness of the decomposition of the speech spectrum and the noise spectrum, stronger constraints that the speech spectrum should satisfy are required. In addition, even if the signal can be enhanced in the SSNMF method, there is no guarantee that the feature amount can be enhanced. Therefore, the enhancement processing does not necessarily directly improve the performance of speech processing based on speech feature amounts such as speech recognition and speech conversion.

本発明では、上記事情を鑑みて成されたものであり、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides a signal analysis apparatus, method, and program capable of suppressing noise, enhancing a speech signal, and enhancing a cepstrum feature amount. Objective.

上記目的を達成するために、本発明に係る信号解析装置は、音声信号と雑音信号とが混合された観測信号の時系列データを入力として、各時刻及び各周波数の観測時間周波数成分を表す観測スペクトログラムを出力する時間周波数展開部と、前記時間周波数展開部により出力された前記観測スペクトログラム、予め学習された音声信号の各基底及び各周波数におけるパワースペクトルを表す基底スペクトル、及びケプストラム空間で定義される、予め学習された音声信号のケプストラム特徴量の確率分布を表わすパラメータに基づいて、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定するパラメータ推定部と、を含んで構成されている。 In order to achieve the above object, the signal analyzing apparatus according to the present invention receives time series data of an observation signal in which a speech signal and a noise signal are mixed as input, and represents an observation time frequency component at each time and each frequency. A time-frequency expansion unit that outputs a spectrogram, the observation spectrogram output by the time-frequency expansion unit, a base spectrum that represents a power spectrum at each base and each frequency of a previously learned speech signal, and a cepstrum space , Based on the parameters representing the probability distribution of the cepstrum feature amount of the speech signal learned in advance, the observed time frequency component of each time and each frequency, the base spectrum of the speech signal, and the power of the speech signal at each time Each time and frequency period obtained from the activation parameter The speech based on the number component, the distance from the base frequency of the noise signal and the sum of the time frequency components of each time and each frequency obtained from the activation parameter, and the probability distribution of the cepstrum feature of the speech signal The speech signal so as to reduce the criterion expressed using the regularization term representing the likelihood of the cepstrum feature quantity of the time frequency component of each time and each frequency obtained from the base spectrum of the signal and the activation parameter. And the parameter estimation unit for estimating the base spectrum of the noise signal and the activation parameter.

本発明に係る信号解析方法は、時間周波数展開部と、パラメータ推定部とを含む信号解析装置における信号解析方法であって、前記時間周波数展開部が、音声信号と雑音信号とが混合された観測信号の時系列データを入力として、各時刻及び各周波数の観測時間周波数成分を表す観測スペクトログラムを出力し、前記パラメータ推定部が、前記時間周波数展開部により出力された前記観測スペクトログラム、予め学習された音声信号の各基底及び各周波数におけるパワースペクトルを表す基底スペクトル、及びケプストラム空間で定義される、予め学習された音声信号のケプストラム特徴量の確率分布を表わすパラメータに基づいて、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定する。 A signal analysis method according to the present invention is a signal analysis method in a signal analysis apparatus including a time-frequency expansion unit and a parameter estimation unit, wherein the time-frequency expansion unit is an observation in which an audio signal and a noise signal are mixed. Using the time-series data of the signal as an input, output an observation spectrogram representing the observation time frequency component of each time and frequency, and the parameter estimation unit is trained in advance, the observation spectrogram output by the time frequency expansion unit Based on the base spectrum representing the power spectrum at each base and each frequency of the speech signal and the parameter representing the probability distribution of the cepstrum feature amount of the speech signal learned in advance defined in the cepstrum space, at each time and each frequency. Observation time frequency component, base spectrum of the audio signal, and each time of the audio signal The time frequency component of each time and each frequency obtained from the activation parameter representing the power in the signal, and the distance from the sum of the time frequency component of each time and each frequency obtained from the base spectrum and the activation parameter of the noise signal And regularization representing the likelihood of the time frequency component of each time and frequency obtained from the base spectrum of the speech signal and the activation parameter based on the probability distribution of the cepstrum feature amount of the speech signal The activation parameter of the speech signal, the base spectrum of the noise signal, and the activation parameter are estimated so as to reduce a criterion expressed using a term.

また、本発明のプログラムは、コンピュータを、上記の信号解析装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said signal analysis apparatus.

以上説明したように、本発明の信号解析装置、方法、及びプログラムによれば、各時刻及び各周波数の観測時間周波数成分と、前記音声信号の基底スペクトル、及び前記音声信号の各時刻におけるパワーを表すアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び前記音声信号のケプストラム特徴量の確率分布に基づく、前記音声信号の前記基底スペクトル及び前記アクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、前記音声信号の前記アクティベーションパラメータと、前記雑音信号の前記基底スペクトル及び前記アクティベーションパラメータとを推定することにより、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる。 As described above, according to the signal analysis apparatus, method, and program of the present invention, the observation time frequency component of each time and frequency, the base spectrum of the audio signal, and the power of the audio signal at each time are calculated. A time frequency component of each time and each frequency obtained from the activation parameter to represent, a distance from a sum of the time frequency component of each time and each frequency obtained from the base spectrum of the noise signal and the activation parameter, and Based on the probability distribution of the cepstrum feature amount of the speech signal, using a regularization term representing the likelihood of the cepstrum feature amount of the time frequency component of each time and each frequency obtained from the base spectrum and the activation parameter of the speech signal To reduce the criteria expressed by It said activation parameter of the signal, by estimating the said base spectrum and the activation parameters of the noise signal, to suppress noise, as well as emphasizing the speech signal, can be emphasized cepstrum characteristic quantity.

本発明の実施の形態に係る信号解析装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る信号解析装置における学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the learning process routine in the signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る信号解析装置におけるパラメータ推定処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the parameter estimation processing routine in the signal analyzer which concerns on embodiment of this invention. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result. 実験結果を示す図である。It is a figure which shows an experimental result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜発明の概要＞
まず、本実施の形態における概要について説明する。本実施の形態においては、上述したSSNMF法の問題を解決するため、スペクトルだけでなく特徴量の事前情報も活用することで音声スペクトルを推定する手がかりをより多く与えるとともに特徴量の歪みを生じにくくするSSNMFの正則化法を提案する。 <Outline of the invention>
First, an outline of the present embodiment will be described. In the present embodiment, in order to solve the above-described problem of the SSNMF method, not only the spectrum but also the prior information of the feature amount is used to give more clues for estimating the speech spectrum and to prevent distortion of the feature amount. We propose a regularization method for SSNMF.

＜本実施の形態の原理＞
次に、本実施の形態の原理について説明する。 <Principle of this embodiment>
Next, the principle of this embodiment will be described.

＜問題の定式化＞
観測信号の振幅スペクトログラムまたはパワースペクトログラム（以後、観測スペクトログラム）をＹ_ω,tとする。ただし、ωとtは周波数、時刻のインデックスである。スペクトルの加法性を仮定し、各時刻の音声スペクトルX^(s) _ω,tおよび雑音スペクトルX⁽ⁿ⁾ _ω,tをそれぞれK_s個の基底スペクトル

とK_n個の基底スペクトル

の非負結合

で表せるものとする。 <Formulation of problem>
Let Y _{ω, t} be the amplitude spectrogram or power spectrogram (hereinafter referred to as observation spectrogram) of the observed signal. Where ω and t are frequency and time indexes. Assuming spectrum additivity, K _s basis spectra of speech spectrum X ^(s) _{ω, t} and noise spectrum X ⁽ⁿ⁾ _{ω, t} at each time

And K _n basis spectra

Non-negative coupling of

It can be expressed as

SSNMF法は、クリーン音声の学習サンプルから事前学習した

を用いて、観測スペクトルY_ω,tに

をフィッティングすることで観測スペクトログラムに含まれる音声の成分と雑音の成分を推定する方法である。このようにして求まる音声スペクトルと雑音スペクトルの推定値からWienerフィルタなどにより観測信号から音声信号を得ることができる。このアプローチでは事前学習した音声の基底スペクトルが音声と雑音の分離の手がかりとなるが、音声の基底スペクトルで雑音スペクトルを説明できてしまう場合やその逆の場合がありえるため、Y_ω,tとX_ω,tの誤差を小さくできたとしてもX^(s) _ω,tとX⁽ⁿ⁾ _ω,tが実際の音声スペクトルと雑音スペクトルに対応するとは限らない。このため、同じX_ω,tを与えるX^(s) _ω,tとX⁽ⁿ⁾ _ω,tの不定性を解消するためには音声スペクトルが満たすべきより強い制約が必要である。今、もしX^(s) _ω,tが音声スペクトルに対応しているならX^(s) _ω,tは特徴量空間においても音声が実際にとりうる範囲内に分布するはずである。そこで、本実施の形態では、ケプストラム特徴量に着目し、ケプストラム空間で定義される確率分布に基づいてX^(s) _ω,tに対する正則化項を考え，これと、Y_ω,tとX_ω,tの誤差規準との和を規準としたパラメータ最適化アルゴリズムを提案する。 The SSNMF method was pre-trained from a clean speech training sample.

To the observed spectrum Y _{ω, t}

This is a method for estimating speech components and noise components included in an observation spectrogram by fitting. A speech signal can be obtained from the observation signal by a Wiener filter or the like from the estimated values of the speech spectrum and the noise spectrum obtained in this way. In this approach, the pre-learned speech base spectrum is a clue to separation of speech and noise, but the noise spectrum can be explained by the speech base spectrum and vice versa, so Y _{ω, t} and X _{Even if} the error of _{ω, t} can be reduced, X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} do not always correspond to the actual speech spectrum and noise spectrum. For this reason, in order to eliminate the indefiniteness of X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} giving the same X _{ω, t} , a stronger constraint that the speech spectrum should satisfy is necessary. Now, if X ^(s) _{ω, t} corresponds to the speech spectrum, X ^(s) _{ω, t} should be distributed within the range that the speech can actually take in the feature amount space. Therefore, in this embodiment, paying attention to the cepstrum feature amount, a regularization term for X ^(s) _{ω, t} is considered based on the probability distribution defined in the cepstrum space, and this is _{expressed as} Y _{ω, t} and X _ω Therefore, a parameter optimization algorithm based on the sum of _t error criterion is proposed.

Y_ω,tとX_ω,tの誤差は二乗誤差、Iダイバージェンス、板倉齋藤距離などで測ることができるが、ここではIダイバージェンス

を用いる。ただし、すべての基底スペクトルは

のような制約を満たしているものとする。次に、X^(s) _ω,tに対し、

のような規準を考える。ただし、

はX_0,t,...,X_Ω-1,tのメル周波数ケプストラム係数(Mel-Frequency Cepstrum Coefficients:MFCC)であり、f_l,ωはl番目のメルフィルタバンク係数、

は離散コサイン変換の係数である。式(5)は、

がパラメータ

の混合正規分布から生成される確率の対数を表す。ただし、

は、m番目の正規分布の平均と分散と重みを表す。クリーン音声の学習サンプルのMFCC系列からこの混合正規分布のパラメータθを学習することで、

を、Ｘ^(s) _ω,tが MFCC空間においてできるだけ学習サンプルと同様に分布する場合に高いスコアを与える規準とすることができる。提案法は、式(3)と式(5)の二つの規準を考慮した

のような規準を最小化することが目的である。ただし、λは正則化パラメータである。以上のようにこの最適化問題はスペクトルのモデルをケプストラム距離規準でソフトな制約を課す問題となっており、これまで発明者らはこの枠組により楽音分離と音色クラスタリングを同一最適化規準の下で行う手法を提案している。 The error between Y _{ω, t} and X _{ω, t} can be measured by square error, I divergence, Itakura Saito distance, etc., but here I divergence

Is used. However, all base spectra are

It is assumed that the following restrictions are satisfied. Next, for X ^(s) _{ω, t} ,

Consider the following criteria. However,

Is the Mel-Frequency Cepstrum Coefficients (MFCC) of X _{0, t} , ..., X _{Ω-1, t} , f _{l, ω} is the l-th mel filter bank coefficient,

Is a coefficient of discrete cosine transform. Equation (5) is

Is a parameter

Represents the logarithm of the probability generated from the mixed normal distribution. However,

Represents the mean, variance, and weight of the mth normal distribution. By learning the parameter θ of this mixed normal distribution from the MFCC sequence of the clean speech learning sample,

Can be a criterion that gives a high score when X ^(s) _{ω, t} is distributed in the MFCC space as much as possible in the learning sample. The proposed method considers the two criteria of Equation (3) and Equation (5).

The goal is to minimize such criteria. Where λ is a regularization parameter. As described above, this optimization problem is a problem that imposes a soft constraint on the spectrum model by the cepstrum distance criterion. Until now, the inventors have used this framework to separate musical tone separation and timbre clustering under the same optimization criterion. Proposes a technique to do.

本実施の形態の手法はこの枠組により音声の信号強調と特徴量強調を同時に実現することを目指した手法であり、「ケプストラム正則化 SSNMF」と呼ぶ。 The method of the present embodiment is a method aiming to simultaneously realize speech signal enhancement and feature amount enhancement by this framework, and is called “cepstrum regularization SSNMF”.

＜パラメータ推定アルゴリズム＞

を最小化する U^(s)、H⁽ⁿ⁾、U⁽ⁿ⁾を解析的に得ることはできないが、当該最適化問題の局所最適解を探索する反復アルゴリズムを補助関数法に基づき導くことができる。補助関数法による、目的関数F(θ)の最小化問題の最適化アルゴリズムでは、まず補助変数αを導入し、

を満たす補助関数F+(θ,α)を設計する。このような補助関数が設計できれば、

と

を交互に繰り返すことで、目的関数F(θ)を局所最小化するθを得ることができる。以下で、

の補助関数とそれに基づく更新式を導く。 <Parameter estimation algorithm>

U ^(s) , H ⁽ⁿ⁾ , U ⁽ⁿ⁾ cannot be obtained analytically, but an iterative algorithm that searches for the local optimal solution of the optimization problem can be derived based on the auxiliary function method it can. In the optimization algorithm for the minimization problem of the objective function F (θ) by the auxiliary function method, first, the auxiliary variable α is introduced,

An auxiliary function F + (θ, α) that satisfies the above is designed. If such an auxiliary function can be designed,

When

By alternately repeating the above, θ that locally minimizes the objective function F (θ) can be obtained. Below,

Auxiliary functions and update formulas based on them are derived.

については、負の対数関数が凸関数であることを利用し、Jensenの不等式により

のような上界関数が立てられる。ただし、＝^cはパラメータに依存する項のみに関する等号を表す。
For, the negative logarithmic function is a convex function, and Jensen's inequality

An upper bound function such as However, = ^c represents an equal sign for only the term depending on the parameter.

またHとUは、ここでは

としている。ζ_k,ω,tはζ_k,ω,t≧０、Σkζ_k,ω,t＝１を満たす変数であり、式 (8)の等号は

のとき成立する。 H and U here

It is said. ζ _{k, ω, t} is a variable that satisfies ζ _{k, ω, t} ≧ 0 and Σkζ _{k, ω, t} = 1.

This holds true.

次に、

の上界関数を設計する。式(8)と同様、負の対数関数が凸関数であることを利用し、Jensenの不等式より

のような不等式が立てられる。式(14)の等号は

のとき成立する。続いて

の上界関数を導く。二次関数は凸関数なので、Jensenの不等式より

のような不等式が立てられる。ただし、

である。 β_l,n,m,tはΣ_lβ_l,n,m,t＝１を満たす任意の正の定数、

は

を満たす変数であり、式(16)の等号は

のとき成立する。式(16)と式(14)より、

がいえる。ただし、

である。A_l,tは非負値である点に注意し，次に (log G_l,t)²の上界関数を考える。(log G_l,t)²の上界は不等式を用いて

で与えることができる。ただし、

であり、式 (21)の等号は

のとき成立する。さらに、f_l,ωH_k,ωU_k,tが非負値であること、逆数関数が正領域で凸関数であることから、Jensenの不等式より

が成り立つ。ただし、ρ_l,k,ω,tはρ_l,k,ω,t＞0、Σ_ω,kρ_l,k,ω,t＝１を満たす変数であり、

のとき式 (26)の等号は成立する。続いて B_l,t log G_l,tの項の上界を考える。 B_l,tは非負値であるとは限らないので、B_l,tの符号に応じて別種の不等式を立てる。まず、対数関数が凹関数であるため、B_l,t≧０のとき、

のような不等式を得る。φ_l,tは正の変数であり，

のとき式 (28)の等号は成立する。一方 B_l,t＜0のとき、負の対数関数は凸関数よりJensenの不等式により

がいえる。ただし、ν_k,l,ω,tはν_k,l,ω,t＞０、Σ_k,ων_k,l,ω,t＝１を満たす変数であり、

のとき式 (30)の等号は成立する。まとめると、

と書ける。ただし、δ_xは条件xを満たす場合に 1、満たさない場合に 0となる指示関数である。以上より、

の上界関数

を得ることができる。この不等式を導いたことのポイントは、右辺を最小にする HやUを解析的に得ることができる点にあり、

と合わせることにより更新式を閉形式で与える補助関数を設計することができる。この補助関数より、各パラメータの更新式

を得る。ただし，

である。 next,

Design the upper bound function of. Similar to Eq. (8), the negative logarithmic function is a convex function.

An inequality such as The equal sign in equation (14) is

This holds true. continue

Deriving the upper bound function of Since the quadratic function is a convex function, from Jensen's inequality

An inequality such as However,

It is. β _{l, n, m, t} is any positive constant satisfying Σ _l β _{l, n, m, t} = 1,

Is

And the equal sign in equation (16) is

This holds true. From Equation (16) and Equation (14),

I can say. However,

It is. Note that A _{l, t} is nonnegative, and then consider the upper bound function of (log G _{l, t} ) ² . The upper bound of (log G _{l, t} ) ² is

Can be given in However,

And the equal sign in equation (21) is

This holds true. Furthermore, since f _{l, ω} H _{k, ω} U _{k, t} are non-negative values and the reciprocal function is a convex function in the positive region, Jensen's inequality is

Holds. However, ρ _{l, k, ω, t} is a variable satisfying ρ _{l, k, ω, t} > 0 and Σ _{ω, k} ρ _{l, k, ω, t} = 1,

Then the equal sign of equation (26) holds. Next, consider the upper bound of the term B _{l, t} log G _{l, t} . Since B _{l, t} is not necessarily a non-negative value, another kind of inequality is established according to the sign of B _{l, t} . First, since the logarithmic function is a concave function, when B _{l, t} ≧ 0,

To obtain an inequality such as φ _{l, t} is a positive variable,

Then the equal sign of equation (28) holds. On the other hand, when B _{l, t} <0, the negative logarithmic function is more

I can say. Where ν _{k, l, ω, t} is a variable satisfying ν _{k, l, ω, t} > 0 and Σ _{k, ω} ν _{k, l, ω, t} = 1,

Then the equal sign of equation (30) holds. Summary,

Can be written. However, δ _x is an indicator function that becomes 1 when the condition x is satisfied and becomes 0 when the condition x is not satisfied. From the above,

Upper bound function

Can be obtained. The point of deriving this inequality is that H and U that minimize the right side can be obtained analytically,

It is possible to design an auxiliary function that gives an update expression in a closed form by combining with. From this auxiliary function, the update formula for each parameter

Get. However,

It is.

また、β_l,n,m,tは任意の正の定数なので、β_l,n,m,t ＞０、Σ_lβ_l,n,m,t＝１を満たす範囲であればどのように与えてもアルゴリズムの収束性は保証される。従って例えば、反復計算の各ステップで

と更新してもアルゴリズムの収束性は保証される。これは補助関数をβ_l,n,m,tに関して最小にする更新である。 Also, since β _{l, n, m, t} is an arbitrary positive constant, how is it as long as β _{l, n, m, t} > 0 and Σ _l β _{l, n, m, t} = 1 are satisfied? Even if given, the convergence of the algorithm is guaranteed. So, for example, at each step of the iterative calculation

Even if updated, the convergence of the algorithm is guaranteed. This is an update that minimizes the auxiliary function with respect to β _{l, n, m, t} .

＜本発明の実施の形態に係る信号解析装置の構成＞
次に、本発明の実施の形態に係る信号解析装置の構成について説明する。図１に示すように、本発明の実施の形態に係る信号解析装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチン及びパラメータ推定処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この信号解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０と、を含んで構成されている。 <Configuration of Signal Analysis Device according to Embodiment of the Present Invention>
Next, the configuration of the signal analysis apparatus according to the embodiment of the present invention will be described. As shown in FIG. 1, a signal analyzing apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a learning processing routine and a parameter estimation processing routine described later. And a computer including the above. Functionally, the signal analyzing apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 90 as shown in FIG.

入力部１０は、他の音源が混じっていないクリーンな音声信号（以後、クリーン音声信号）の時系列データを受け付ける。また、入力部１０は、音声信号と雑音信号とが混じっている音響信号（以後、観測信号）の時系列データを受け付ける。 The input unit 10 receives time-series data of a clean audio signal (hereinafter, clean audio signal) that is not mixed with other sound sources. The input unit 10 also receives time-series data of an acoustic signal (hereinafter referred to as an observation signal) in which an audio signal and a noise signal are mixed.

演算部２０は、時間周波数展開部２４と、特徴量抽出部２６と、基底スペクトル学習部２８と、基底スペクトル記憶部３０と、特徴量モデル学習部３２と、特徴量モデル記憶部３４と、パラメータ推定部３６と、音声信号生成部３８と、を含んで構成されている。 The calculation unit 20 includes a time frequency expansion unit 24, a feature amount extraction unit 26, a base spectrum learning unit 28, a base spectrum storage unit 30, a feature amount model learning unit 32, a feature amount model storage unit 34, and parameters. An estimation unit 36 and an audio signal generation unit 38 are included.

時間周波数展開部２４は、クリーン音声信号の時系列データに基づいて、各時刻における各周波数の時間周波数成分を表す振幅スペクトログラム又はパワースペクトログラムを計算する。なお、第１の実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 The time frequency expansion unit 24 calculates an amplitude spectrogram or a power spectrogram representing a time frequency component of each frequency at each time based on the time series data of the clean speech signal. In the first embodiment, time-frequency expansion such as short-time Fourier transform and wavelet transform is performed.

また、時間周波数展開部２４は、観測信号の時系列データに基づいて、各時刻ｔにおける各周波数ωの観測時間周波数成分Ｙ_ω,tを表す振幅スペクトログラム又はパワースペクトログラムである観測スペクトログラムＹを計算する。 In addition, the time-frequency expansion unit 24 calculates an observation spectrogram Y that is an amplitude spectrogram or a power spectrogram representing the observation time frequency component Y _{ω, t} of each frequency ω at each time t based on the time-series data of the observation signal. .

特徴量抽出部２６は、時間周波数展開部２４によって計算された、クリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、各時刻のケプストラム特徴量を抽出する。 The feature amount extraction unit 26 extracts the cepstrum feature amount at each time based on the time frequency component of each frequency at each time of the clean speech signal calculated by the time frequency expansion unit 24.

基底スペクトル学習部２８は、時間周波数展開部２４によって計算された、クリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、従来技術であるＮＭＦを用いて、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を推定する。 The base spectrum learning unit 28 uses the conventional NMF to calculate each basis k of the clean speech signal based on the time-frequency component of each frequency at each time of the clean speech signal calculated by the time-frequency expansion unit 24. And the base spectrum representing the power spectrum at each frequency ω

Is estimated.

基底スペクトル記憶部３０は、基底スペクトル学習部２８によって推定された、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を記憶している。 The base spectrum storage unit 30 is a base spectrum representing the power spectrum at each base k and each frequency ω of the clean speech signal estimated by the base spectrum learning unit 28.

Is remembered.

特徴量モデル学習部３２は、特徴量抽出部２６によって抽出された各時刻のケプストラム特徴量に基づいて、ケプストラム空間で定義される、ケプストラム特徴量の確率分布を表すパラメータを学習する。具体的には、当該確率分布を、混合正規分布とした場合の、混合正規分布のパラメータθを学習する。 The feature amount model learning unit 32 learns a parameter representing the probability distribution of the cepstrum feature amount defined in the cepstrum space based on the cepstrum feature amount at each time extracted by the feature amount extraction unit 26. Specifically, the parameter θ of the mixed normal distribution is learned when the probability distribution is a mixed normal distribution.

特徴量モデル記憶部３４は、特徴量モデル学習部３２によって学習された、混合正規分布のパラメータθを記憶している。 The feature amount model storage unit 34 stores the parameter θ of the mixed normal distribution learned by the feature amount model learning unit 32.

パラメータ推定部３６は、時間周波数展開部２４により出力された観測スペクトログラムＹ、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトル、及び特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布を表すパラメータθに基づいて、各時刻及び各周波数の観測時間周波数成分Ｙと、音声信号の基底スペクトル、及び音声信号のアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分Ｘ^(s)、並びに雑音信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分Ｘ⁽ⁿ⁾の和Ｘとの距離

、及び音声信号のケプストラム特徴量の確率分布に基づく、各時刻及び各周波数の時間周波数成分Ｘ^(s)のケプストラム特徴量の尤もらしさ

を表す正則化項を用いて表される上記（７）式に示す規準を小さくするように、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とを推定する。 The parameter estimation unit 36 is stored in the observation spectrogram Y output from the time-frequency expansion unit 24, each base of the speech signal stored in the base spectrum storage unit 30, and the base spectrum at each frequency, and the feature amount model storage unit 34. Each time and each frequency obtained from the observation time frequency component Y of each time and each frequency, the base spectrum of the audio signal, and the activation parameter of the audio signal based on the parameter θ representing the probability distribution of the cepstrum feature quantity And the time frequency component X ^{(s) of} the noise signal and the distance X from the sum X of the time frequency component X ⁽ⁿ⁾ of each time and each frequency obtained from the base spectrum of the noise signal and the activation parameter

And the likelihood of the cepstrum feature of the time-frequency component X ^(s) at each time and frequency based on the probability distribution of the cepstrum feature of the audio signal

The base spectrum H ^(s) and the activation parameter U ^(s) of the speech signal and the base spectrum H of the noise signal so as to reduce the criterion shown in the above equation (7) expressed using the regularization term representing ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ are estimated.

具体的には、パラメータ推定部３６は、初期値設定部４０、補助変数更新部４２、パラメータ更新部４４、及び収束判定部４６を備えている。 Specifically, the parameter estimation unit 36 includes an initial value setting unit 40, an auxiliary variable update unit 42, a parameter update unit 44, and a convergence determination unit 46.

初期値設定部４０は、音声信号の基底スペクトルＨ^(s)の初期値として、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトルを設定する。また、初期値設定部４０は、音声信号のアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに初期値を設定する。 The initial value setting unit 40 sets a base spectrum in each base and each frequency of the speech signal stored in the base spectrum storage unit 30 as an initial value of the base spectrum H ^(s) of the speech signal. The initial value setting unit 40 sets initial values for the activation parameter U ^(s) of the audio signal, the base spectrum H ^{(n) of} the noise signal, and the activation parameter U ⁽ⁿ⁾ .

補助変数更新部４２は、特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布のパラメータθと、初期値である、又は前回更新した、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに基づいて、上記（１３）式、（１５）式、（２４）式、（２７）式、（２９）式、（３１）式、（３７）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tを更新する。 The auxiliary variable updating unit 42 includes the cepstrum feature quantity probability distribution parameter θ stored in the feature quantity model storage unit 34, the initial value, or the previously updated base spectrum H ^(s) and the active spectrum of the speech signal. Based on the activation parameter U ^(s) , the base spectrum H ^{(n) of} the noise signal, and the activation parameter U ⁽ⁿ⁾ , the above expressions (13), (15), (24), and (27) , (29), (31), and (37), ζ _{k, ω, t} for each basis k, each frequency ω, and each time _t , each normal distribution m, and α _{m, t} for each time _t , Ξ _{l, t} , φ _l, t for each mel filter bank coefficient l and each time t, each mel filter bank coefficient l, each base k, each frequency ω, and ρ _{l, k, ω, t} for each time _t _{, ν k, l, ω,} t, each mel filter bank coefficients l, the normal distribution m, each main Frequency cepstral coefficients n, and beta _{l, m} for each time _{t, n,} and updates the _t.

パラメータ更新部４４は、時間周波数展開部２４により出力された観測スペクトログラムＹと、補助変数更新部４２によって更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tに基づいて、上記（３３）式〜（３６）式に従って、音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ^(s) _k,tと、雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ω及び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ⁽ⁿ⁾ _k,tとを更新する。 The parameter update unit 44 includes the observation spectrogram Y output by the time frequency expansion unit 24, each base k updated by the auxiliary variable update unit 42, each frequency ω, and ζ _{k, ω, t} for each time _t , Α _m, t for each normal distribution m and each time t, each mel filter bank coefficient l and ξ _{l, t} , φ _{l, t} for each time _t , each mel filter bank coefficient l, each basis k, each frequency ω, and Ρ _{l, k, ω, t} , ν _{k, l, ω,} t for each time t, each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _l, for each time t _{Based on m, n, t} , the base spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the speech signal, each base k, and each time t according to the above formulas (33) to (36). Activation parameter U ^(s) _{k, t} of each noise signal basis k and each frequency ω Update the base spectrum H ⁽ⁿ⁾ _{k, ω} and the activation parameters U ⁽ⁿ⁾ _{k, t} for each base k and each time t.

収束判定部４６は、収束条件を満たすか否かを判定し、収束条件を満たすまで、補助変数更新部４２における更新処理と、パラメータ更新部４４における更新処理とを繰り返させる。 The convergence determination unit 46 determines whether or not the convergence condition is satisfied, and repeats the update process in the auxiliary variable update unit 42 and the update process in the parameter update unit 44 until the convergence condition is satisfied.

収束条件としては、例えば、繰り返し回数が、上限回数に到達したことを用いることができる。あるいは、収束条件として、上記（７）式の規準の値と前回の規準の値との差分が、予め定められた閾値以下であることを用いることができる。 As the convergence condition, for example, the fact that the number of repetitions has reached the upper limit number can be used. Alternatively, as the convergence condition, it can be used that the difference between the value of the criterion in the equation (7) and the value of the previous criterion is equal to or less than a predetermined threshold value.

音声信号生成部３８は、パラメータ推定部３６において取得した音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ωに基づいて、ウィーナーフィルタに従って、音声信号を生成し、出力部９０から出力する。 The sound signal generation unit 38 has a base spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the sound signal acquired by the parameter estimation unit 36, and each base k and base spectrum H of each frequency ω of the aligned noise signal. ⁽ⁿ⁾ Based on _{k and ω} , an audio signal is generated according to the Wiener filter and output from the output unit 90.

＜本発明の実施の形態に係る信号解析装置の作用＞
次に、本発明の実施の形態に係る信号解析装置１００の作用について説明する。まず、入力部１０においてクリーン音声信号の時系列データを受け付けると、信号解析装置１００は、図２に示す学習処理ルーチンを実行する。 <Operation of Signal Analysis Device According to Embodiment of the Present Invention>
Next, the operation of the signal analyzing apparatus 100 according to the embodiment of the present invention will be described. First, when the time series data of the clean speech signal is received at the input unit 10, the signal analyzing apparatus 100 executes a learning processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けたクリーン音声信号の時系列データに基づいて、クリーン音声信号の各時刻における各周波数の時間周波数成分を計算する。 First, in step S100, the time frequency component of each frequency at each time of the clean sound signal is calculated based on the time series data of the clean sound signal received by the input unit 10.

次に、ステップＳ１０２では、ステップＳ１００において取得したクリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、各時刻のケプストラム特徴量を抽出する。 Next, in step S102, the cepstrum feature amount at each time is extracted based on the time frequency component of each frequency at each time of the clean audio signal acquired in step S100.

次に、ステップＳ１０４では、上記ステップＳ１００において取得したクリーン音声信号の各時刻における各周波数の時間周波数成分に基づいて、従来技術であるＮＭＦにより、クリーン音声信号の各基底ｋ及び各周波数ωにおけるパワースペクトルを表す基底スペクトル

を推定し、基底スペクトル記憶部３０に格納する。 Next, in step S104, based on the time frequency component of each frequency at each time of the clean speech signal acquired in step S100, the power at each base k and each frequency ω of the clean speech signal is obtained by NMF, which is a conventional technique. The base spectrum representing the spectrum

Is stored in the base spectrum storage unit 30.

ステップＳ１０６では、上記ステップＳ１０２で抽出した各時刻のケプストラム特徴量に基づいて、ケプストラム特徴量の確率分布を表す混合正規分布のパラメータθを学習し、特徴量モデル記憶部３４に格納して、学習処理ルーチンを終了する。 In step S106, based on the cepstrum feature quantity at each time extracted in step S102, a mixed normal distribution parameter θ representing the probability distribution of the cepstrum feature quantity is learned, stored in the feature quantity model storage unit 34, and learned. The processing routine ends.

次に、入力部１０において、音声信号と雑音信号とが混在した観測信号の時系列データを受け付けると、信号解析装置１００は、図３に示すパラメータ推定処理ルーチンを実行する。 Next, when the input unit 10 receives time-series data of an observation signal in which an audio signal and a noise signal are mixed, the signal analysis apparatus 100 executes a parameter estimation processing routine shown in FIG.

まず、ステップＳ１２０では、入力部１０において受け付けた観測信号の時系列データに基づいて、観測スペクトログラムＹを計算する。 First, in step S120, an observation spectrogram Y is calculated based on the time series data of the observation signal received by the input unit 10.

ステップＳ１２２では、音声信号の基底スペクトルＨ^(s)の初期値として、基底スペクトル記憶部３０に記憶された音声信号の各基底及び各周波数における基底スペクトルを設定する。また、音声信号のアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに初期値を設定する。 In step S122, the base spectrum at each base and each frequency of the speech signal stored in the base spectrum storage unit 30 is set as the initial value of the base spectrum H ^(s) of the speech signal. Also, initial values are set for the activation parameter U ^{(s) of} the audio signal, the base spectrum H ^{(n) of} the noise signal, and the activation parameter U ⁽ⁿ⁾ .

ステップＳ１２４では、特徴量モデル記憶部３４に記憶されているケプストラム特徴量の確率分布のパラメータθと、上記ステップＳ１２２で初期値が設定された、又は後述するステップＳ１２６で前回更新した、音声信号の基底スペクトルＨ^(s)及びアクティベーションパラメータＵ^(s)と、雑音信号の基底スペクトルＨ⁽ⁿ⁾及びアクティベーションパラメータＵ⁽ⁿ⁾とに基づいて、上記（１３）式、（１５）式、（２４）式、（２７）式、（２９）式、（３１）式、（３７）式に従って、各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tを更新する。 In step S124, the cepstrum feature quantity probability distribution parameter θ stored in the feature quantity model storage unit 34 and the initial value set in step S122 or updated last time in step S126 described later are used. Based on the base spectrum H ^(s) and the activation parameter U ^(s) and the base spectrum H ⁽ⁿ⁾ and the activation parameter U ^{(n) of} the noise signal, the above formulas (13), (15), ( 24), (27), (29), (31), and (37), ζ _{k, ω, t} for each base k, each frequency ω, and each time _t , each normal distribution m and Α _m, t for each time t, each mel filter bank coefficient l and ξ _{l, t} , φ _{l, t} for each time _t , each mel filter bank coefficient l, each base k, each frequency ω, and each time t ρ _{l, k, ω, t} , ν _{k, l, ω, t} , each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _{l, m, n, t} for each time _t are updated.

次に、ステップＳ１２６では、上記ステップＳ１２０で得られた観測スペクトログラムＹと、上記ステップＳ１２４で更新された各基底ｋ、各周波数ω、及び各時刻ｔに対するζ_k,ω,t、各正規分布ｍ及び各時刻ｔに対するα_m,t、各メルフィルタバンク係数ｌ及び各時刻ｔに対するξ_l,t、φ_l,t、各メルフィルタバンク係数ｌ、各基底ｋ、各周波数ω、及び各時刻ｔに対するρ_l,k,ω,t、ν_k,l,ω,t、各メルフィルタバンク係数ｌ、各正規分布ｍ、各メル周波数ケプストラム係数ｎ、及び各時刻ｔに対するβ_l,m,n,tに基づいて、上記（３３）式〜（３６）式に従って、音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ^(s) _k,tと、雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ω及び各基底ｋ及び各時刻ｔのアクティベーションパラメータＵ⁽ⁿ⁾ _k,tとを更新する。 Next, in step S126, the observation spectrogram Y obtained in step S120, each basis k updated in step S124, each frequency ω, and ζ _{k, ω, t} for each time t, and each normal distribution m. Α _m, t for each time t, each mel filter bank coefficient l and ξ _{l, t} , φ _{l, t} , each mel filter bank coefficient l, each base k, each frequency ω, and each time t for each time t Ρ _{l, k, ω, t} , ν _{k, l, ω, t} , each mel filter bank coefficient l, each normal distribution m, each mel frequency cepstrum coefficient n, and β _{l, m, n, On} the basis of _t , according to the above formulas (33) to (36), the basis spectrum H ^(s) _{k, ω} for each base k and each frequency ω of the speech signal, the activation parameters for each base k and each time t U ^(s) _{k, t} and the basis spectrum of each basis k and frequency ω of the noise signal Updates the spectrum H ⁽ⁿ⁾ _{k, ω} and the activation parameters U ⁽ⁿ⁾ _{k, t} for each basis k and each time t.

次に、ステップＳ１２８では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、ステップＳ１３０へ移行し、収束条件を満たしていない場合には、ステップＳ１２４へ移行し、ステップＳ１２４〜ステップＳ１２８の処理を繰り返す。 Next, in step S128, it is determined whether or not a convergence condition is satisfied. If the convergence condition is satisfied, the process proceeds to step S130. If the convergence condition is not satisfied, the process proceeds to step S124, and the processes in steps S124 to S128 are repeated.

ステップＳ１３０では、上記ステップＳ１２６で最終的に更新された音声信号の各基底ｋ及び各周波数ωの基底スペクトルＨ^(s) _k,ω、並び雑音信号の各基底ｋ及び各周波数ωの基底スペクトルＨ⁽ⁿ⁾ _k,ωに基づいて、ウィーナーフィルタに従って、音声信号を生成し、出力部９０から出力して、パラメータ推定処理ルーチンを終了する。 In step S130, the base spectrum H ^(s) _{k, ω} of each base k and each frequency ω of the speech signal finally updated in step S126, the base k of each side noise signal and the base spectrum H of each frequency ω. ⁽ⁿ⁾ Based on _{k and ω} , an audio signal is generated according to the Wiener filter, output from the output unit 90, and the parameter estimation processing routine is terminated.

＜実験例＞
ATR音声データベース 503文の音声データとRWCPの雑音データ（ white noise, babble noise, museum noise, background music noiseの４種類）を用いて上述した実施の形態の手法による雑音抑圧効果を検証する評価実験を行った。比較対象は従来の SSNMF法とし、処理前と処理後の信号対歪み比(SDR)およびケプストラム歪みの改善値を評価した。テストデータはクリーン音声に各雑音をさまざまな SNRで重畳させて作成した。テストデータのすべての音響信号はサンプリング周波数16kHzのモノラル信号で、フレーム長 32ms、フレームシフト 16msで短時間Fourier変換を行い、観測スペクトログラム Y_ω,tを算出した。学習においては 10名（うち女性 4名、男性 6名）の話者の計450文の音声を用いてH^(s) _k,ωとMFCCのGMMパラメータθの学習を行った。 MFCCの次元は13としGMMの混合数は 30とした。テストにおいては、学習で得られた H^(s) _k,ωとθとを固定し、λ＝1としてU^(s) _k,t、H⁽ⁿ⁾ _k,ω、U⁽ⁿ⁾ _k,tの推定を行った。推定後、X^(s) _ω,tとX⁽ⁿ⁾ _ω,tを用いて Wienerフィルタにより音声信号の推定値を算出した。提案法アルゴリズムの初期値は従来の SSNMFにより得た。 <Experimental example>
ATR speech database An evaluation experiment to verify the noise suppression effect by the method of the above embodiment using speech data of 503 sentences and RWCP noise data (four types of white noise, babble noise, museum noise, and background music noise) went. The comparison target was the conventional SSNMF method, and the improvement in signal-to-distortion ratio (SDR) and cepstrum distortion before and after processing were evaluated. The test data was created by superimposing each noise on clean speech with various SNRs. All acoustic signals in the test data were monaural signals with a sampling frequency of 16 kHz, and Fourier transform was performed for a short time with a frame length of 32 ms and a frame shift of 16 ms, and the observation spectrogram Y _{ω, t} was calculated. In learning, H ^(s) _{k, ω} and MFCC GMM parameter θ were trained using a total of 450 sentences of 10 speakers (including 4 women and 6 men). The MFCC dimension was 13 and the number of GMM mixtures was 30. In the test, H ^(s) _{k, ω} and θ obtained by learning are fixed, and U ^(s) _{k, t} , H ⁽ⁿ⁾ _{k, ω} , U ⁽ⁿ⁾ _{k, t} with λ = 1 Was estimated. After estimation, the estimated value of the speech signal was calculated by the Wiener filter using X ^(s) _{ω, t} and X ⁽ⁿ⁾ _{ω, t} . The initial value of the proposed algorithm was obtained by conventional SSNMF.

以上の条件下での提案法と従来法によって得られたケプストラム歪みおよびSDRの改善値を図４〜７に示す。 The improved values of cepstrum distortion and SDR obtained by the proposed method and the conventional method under the above conditions are shown in FIGS.

図４は、提案法と従来法によって得られたケプストラム歪みの改善値を示す。図４上が雑音の種類が白色雑音の場合を示し、図４下が、雑音の種類がバブル雑音の場合を示す。図５は、提案法と従来法によって得られたケプストラム歪みの改善値を示す。図５上が雑音の種類が実環境雑音の場合を示し、図５下が、雑音の種類が背景音楽雑音の場合を示す。図６は、提案法と従来法によって得られたSDRの改善値を示す。図６上が、雑音の種類が白色雑音の場合を示し、図６下が、雑音の種類がバブル雑音の場合を示す。図７は、提案法と従来法によって得られたSDRの改善値を示す。図７上が、雑音の種類が実環境雑音の場合を示し、図７下が、雑音の種類が背景音楽雑音の場合を示す。いずれの評価尺度においてもほとんどの場合において提案法の方が高い改善値を得られていることが確認できた。 FIG. 4 shows the improvement value of the cepstrum distortion obtained by the proposed method and the conventional method. 4 shows the case where the noise type is white noise, and the lower part of FIG. 4 shows the case where the noise type is bubble noise. FIG. 5 shows the improvement value of the cepstrum distortion obtained by the proposed method and the conventional method. FIG. 5 shows the case where the noise type is real environment noise, and FIG. 5 shows the case where the noise type is background music noise. FIG. 6 shows the improved SDR values obtained by the proposed method and the conventional method. 6 shows the case where the noise type is white noise, and the lower part of FIG. 6 shows the case where the noise type is bubble noise. FIG. 7 shows the improved SDR values obtained by the proposed method and the conventional method. 7 shows the case where the noise type is real environment noise, and the lower part of FIG. 7 shows the case where the noise type is background music noise. In any evaluation scale, it was confirmed that the proposed method was able to obtain higher improvement values in most cases.

以上説明したように、本発明の実施の形態に係る信号解析装置によれば、各時刻及び各周波数の観測時間周波数成分と、予め推定された音声信号の基底スペクトル、及び音声信号のアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分、並びに雑音信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分の和との距離、及び音声信号のケプストラム特徴量の確率分布に基づく、音声信号の基底スペクトル及びアクティベーションパラメータから求められる各時刻及び各周波数の時間周波数成分のケプストラム特徴量の尤もらしさを表す正則化項を用いて表される規準を小さくするように、音声信号のアクティベーションパラメータと、雑音信号の基底スペクトル及びアクティベーションパラメータとを推定し、音声信号を生成することにより、雑音を抑制し、音声信号を強調すると共に、ケプストラム特徴量を強調することができる。 As described above, according to the signal analysis device according to the embodiment of the present invention, the observation time frequency component of each time and each frequency, the base spectrum of the speech signal estimated in advance, and the activation parameter of the speech signal The time frequency component of each time and frequency obtained from the above, the distance from the sum of the time frequency components of each time and frequency obtained from the base spectrum and activation parameters of the noise signal, and the probability of the cepstrum feature quantity of the audio signal Based on the distribution, the criterion expressed using the regularization term representing the likelihood of the cepstrum feature amount of the time-frequency component of each time and each frequency obtained from the base spectrum and the activation parameter of the audio signal is reduced. Audio signal activation parameters and noise signal basis spectrum Estimating the torque and the activation parameters, by generating an audio signal to suppress noise, as well as emphasizing the speech signal, it can be emphasized cepstrum characteristic quantity.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、音声信号の基底スペクトル及び音声信号のケプストラム特徴量の確率分布を表すパラメータを学習する処理と、観測信号から基底スペクトル及びアクティベーションパラメータを推定するパラメータ推定とを別々の装置で行うように構成してもよい。 For example, the processing for learning the parameters representing the probability distribution of the speech signal base spectrum and the speech signal cepstrum feature amount and the parameter estimation for estimating the base spectrum and the activation parameter from the observation signal are performed by different devices. May be.

また、更新するパラメータの順番には任意性があるため、上記の実施の形態の順番に限定されない。 In addition, since the order of the parameters to be updated is arbitrary, the order of the above embodiments is not limited.

また、音声信号のアクティベーションパラメータ、雑音信号の基底スペクトル、及びアクティベーションパラメータと同様に、音声信号の基底スペクトルも更新する場合を例に説明したが、これに限定されるものではなく、音声信号の基底スペクトルを更新せずに、予め推定された音声信号の基底スペクトルに固定してもよい。 In addition, the case where the base spectrum of the audio signal is updated as well as the activation parameter of the audio signal, the base spectrum of the noise signal, and the activation parameter has been described as an example, but the present invention is not limited to this. The base spectrum of the speech signal may be fixed to the preestimated speech spectrum without updating the base spectrum.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
２４時間周波数展開部
２６特徴量抽出部
２８基底スペクトル学習部
３０基底スペクトル記憶部
３２特徴量モデル学習部
３４特徴量モデル記憶部
３６パラメータ推定部
３８音声信号生成部
４０初期値設定部
４２補助変数更新部
４４パラメータ更新部
４６収束判定部
９０出力部
１００信号解析装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 24 Time frequency expansion part 26 Feature-value extraction part 28 Base spectrum learning part 30 Base spectrum memory | storage part 32 Feature-value model learning part 34 Feature-value model memory | storage part 36 Parameter estimation part 38 Speech signal generation part 40 Initial value Setting unit 42 Auxiliary variable update unit 44 Parameter update unit 46 Convergence determination unit 90 Output unit 100 Signal analysis device

Claims

A time-frequency expansion unit that outputs an observation spectrogram representing an observation time-frequency component of each time and each frequency, using as input the time-series data of the observation signal in which the audio signal and the noise signal are mixed,
The observation spectrogram output by the time-frequency expansion unit, the base spectrum representing the power spectrum at each base and each frequency of the pre-learned speech signal, and the cepstrum feature of the pre-learned speech signal defined by the cepstrum space Based on the parameters representing the probability distribution of quantities,
Observation time frequency components of each time and frequency, the base spectrum of the audio signal, and the time frequency component of each time and frequency obtained from an activation parameter representing the power of the audio signal at each time, and the noise signal The base spectrum of the speech signal based on the distance from the sum of the time frequency components of each time and each frequency obtained from the base spectrum and the activation parameter of the speech signal, and the probability distribution of the cepstrum feature amount of the speech signal The activation parameter of the audio signal so as to reduce the criterion expressed using the regularization term representing the likelihood of the cepstrum feature quantity of the time-frequency component of each time and each frequency obtained from the activation parameter; The basis spectrum of the noise signal; A parameter estimation unit for estimating the torque and the activation parameter,
Including a signal analysis device.

The signal analysis apparatus according to claim 1, wherein the criterion is represented by the following expression.

Where U ^(s) represents the activation parameter of the audio signal, H ⁽ⁿ⁾ represents the base spectrum of the noise signal, and U ⁽ⁿ⁾ represents the activation parameter of the noise signal. Y is the observation spectrogram, X is the time frequency component X ^{(s) of} each frequency obtained from the base spectrum of the speech signal and the activation parameter U ^(s), and the base of the noise signal It represents the sum of time frequency components of each frequency obtained from the spectrum H ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ ,

Represents the distance between the observed spectrogram Y and the sum X,

Represents the likelihood of the cepstrum feature quantity of the time-frequency component X ^(s) of each frequency obtained from the base spectrum of the speech signal and the activation parameter U ^(s) , and λ is a regularization parameter.

The parameter estimation unit includes:
A parameter updating unit that updates the activation parameter of the speech signal and the base spectrum and the activation parameter of the speech signal of the noise signal so as to reduce an auxiliary function that is an upper bound function of the criterion;
A convergence determination unit that repeats the update by the parameter update unit until a predetermined convergence condition is satisfied;
The signal analysis device according to claim 1, comprising:

A signal analysis method in a signal analysis device including a time frequency expansion unit and a parameter estimation unit,
The time frequency expansion unit inputs time series data of an observation signal in which a speech signal and a noise signal are mixed, and outputs an observation spectrogram representing an observation time frequency component of each time and each frequency,
The parameter estimation unit is pre-learned defined by the observation spectrogram output by the time-frequency expansion unit, a base spectrum representing a power spectrum at each base and frequency of a pre-learned speech signal, and a cepstrum space. Based on the parameter representing the probability distribution of the cepstrum feature of the voice signal
Observation time frequency components of each time and frequency, the base spectrum of the audio signal, and the time frequency component of each time and frequency obtained from an activation parameter representing the power of the audio signal at each time, and the noise signal The base spectrum of the speech signal based on the distance from the sum of the time frequency components of each time and each frequency obtained from the base spectrum and the activation parameter of the speech signal, and the probability distribution of the cepstrum feature amount of the speech signal The activation parameter of the audio signal so as to reduce the criterion expressed using the regularization term representing the likelihood of the cepstrum feature quantity of the time-frequency component of each time and each frequency obtained from the activation parameter; The basis spectrum of the noise signal; Torr and signal analysis method for estimating and said activation parameters.

The signal analysis method according to claim 4, wherein the criterion is represented by the following expression.

Where U ^(s) represents the activation parameter of the audio signal, H ⁽ⁿ⁾ represents the base spectrum of the noise signal, and U ⁽ⁿ⁾ represents the activation parameter of the noise signal. Y is the observation spectrogram, X is the time frequency component X ^{(s) of} each frequency obtained from the base spectrum of the speech signal and the activation parameter U ^(s), and the noise signal Represents the sum of the time-frequency components of each frequency obtained from the base spectrum H ⁽ⁿ⁾ and the activation parameter U ⁽ⁿ⁾ ,

Represents the distance between the observed spectrogram Y and the sum X,

Represents the likelihood of the cepstrum feature quantity of the time-frequency component X ^(s) of each frequency obtained from the base spectrum of the speech signal and the activation parameter U ^(s) , and λ is a regularization parameter.

By the parameter estimation unit estimating,
A parameter updating unit updates the activation parameter of the speech signal, the base spectrum of the speech signal of the noise signal, and the activation parameter so as to reduce an auxiliary function that is an upper bound function of the criterion. ,
The signal analysis method according to claim 4, wherein the convergence determination unit includes repeating the update by the parameter update unit until a predetermined convergence condition is satisfied.

The program for functioning a computer as each part of the signal analyzer of any one of Claims 1-3.