JP2013007975A

JP2013007975A - Noise suppression device, method and program

Info

Publication number: JP2013007975A
Application number: JP2011142230A
Authority: JP
Inventors: Masakiyo Fujimoto; 雅清藤本; Shinji Watabe; 晋治渡部; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-06-27
Filing date: 2011-06-27
Publication date: 2013-01-10
Anticipated expiration: 2031-06-27
Also published as: JP5713818B2

Abstract

PROBLEM TO BE SOLVED: To provide a noise suppression technology which can reflect changes and features of more noise signals to a probability model of a noise signal and correctly suppress a noise signal by using a noise signal as learning data, regardless of existence of a voice signal.SOLUTION: A sound feature of a sound signal is extracted. A noise signal is estimated by using a probability model of a voice signal excluding a noise (hereafter, referred to as "voice model") and the sound feature of the sound signal, and an unsupervised learning of a probability model of a noise signal (hereafter, referred to as "noise model") is performed with the estimated noise signal as learning data. A noise signal of the sound signal is suppressed by using the noise model.

Description

本発明は入力音響信号に含まれる雑音信号を抑圧して所望の信号を抽出するための雑音抑圧技術に関する。 The present invention relates to a noise suppression technique for extracting a desired signal by suppressing a noise signal included in an input acoustic signal.

処理対象とする音声信号や、音声信号以外の信号（以下「雑音信号」という）を含む音響信号から音声信号を聞き取りやすくするために雑音信号を抑制する従来技術が知られている。特に、自動音声認識技術を実際の環境で利用する場合においては、正しく音声認識を行うために、音響信号から雑音信号を取り除き所望の音声信号のみを抽出する必要がある。自動音声認識の実際の環境での利用は今後の情報化社会の中で大きく期待されており、早急に解決されるべき問題である。 Conventional techniques for suppressing a noise signal are known in order to make it easier to hear a sound signal from a sound signal including a sound signal to be processed and a signal other than the sound signal (hereinafter referred to as “noise signal”). In particular, when the automatic speech recognition technology is used in an actual environment, it is necessary to remove a noise signal from an acoustic signal and extract only a desired speech signal in order to correctly perform speech recognition. The use of automatic speech recognition in the actual environment is highly expected in the information-oriented society in the future, and is a problem that should be solved as soon as possible.

非特許文献１が雑音抑圧に係る従来技術として知られている。非特許文献１は、予め推定した音声信号と雑音信号の確率モデルから音響信号の確率モデルを生成し、確率モデルと音響信号全体の統計量との差分をテイラー展開で表現する。ＥＭアルゴリズム（以下「期待値最大化法」ともいう）を用いてその差分を推定し音響信号の確率モデルを最適化する。その後、最適化された音響信号の確率モデルと音声信号の確率モデルのパラメータを用いて雑音を抑圧する方法が開示されている。 Non-Patent Document 1 is known as a prior art related to noise suppression. Non-Patent Document 1 generates a probability model of an acoustic signal from a speech signal and a noise signal estimated in advance, and expresses the difference between the probability model and the statistic of the entire acoustic signal by Taylor expansion. The difference is estimated using an EM algorithm (hereinafter also referred to as “expected value maximization method”), and the acoustic signal probability model is optimized. Thereafter, a method of suppressing noise using the optimized parameters of the acoustic signal probability model and the speech signal probability model is disclosed.

P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylorseries approach for environment-independent speech recognition", in Proceedings of ICASSP '96, May 1996, vol. II, pp. 733-736P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylorseries approach for environment-independent speech recognition", in Proceedings of ICASSP '96, May 1996, vol. II, pp. 733-736

従来技術において雑音信号の確率モデルを推定するためには、雑音信号のみの学習データが必要となる。しかし、通常、雑音抑圧を行う際に観測可能な信号は、雑音信号と音声信号が混合された音響信号のみであり、雑音信号のみを単独で観測することは難しい。このため、従来技術では、音声信号が存在せず、雑音信号のみが存在する時間区間を推定することにより雑音信号のみの学習データを得ていた。しかしながら、このような方法では、音声信号が存在する時間区間における雑音信号を学習データとして利用することができず、当該区間で発生した雑音信号の変化や特徴を雑音信号の確率モデルに反映することができない。そのため、雑音信号の分布を正確に推定、表現することが難しい。 In order to estimate the probability model of a noise signal in the prior art, learning data of only the noise signal is required. However, normally, the signal that can be observed when performing noise suppression is only an acoustic signal in which the noise signal and the audio signal are mixed, and it is difficult to observe only the noise signal alone. For this reason, in the prior art, learning data of only the noise signal is obtained by estimating a time interval in which there is no audio signal and only the noise signal exists. However, in such a method, a noise signal in a time interval in which an audio signal exists cannot be used as learning data, and changes and characteristics of the noise signal generated in the interval are reflected in the noise signal probability model. I can't. For this reason, it is difficult to accurately estimate and represent the distribution of the noise signal.

本発明は、音声信号の存在有無に関わらず、雑音信号を学習データとして利用し、より多くの雑音信号の変化や特徴を雑音信号の確率モデルに反映することができ、より正確に雑音信号を抑圧することができる雑音抑圧技術を提供することを目的とする。 The present invention can use a noise signal as learning data regardless of the presence or absence of a speech signal, and can reflect more changes and features of the noise signal in the probability model of the noise signal. An object of the present invention is to provide a noise suppression technique that can be suppressed.

上記の課題を解決するために、本発明の第一の態様によれば、雑音信号と音声信号とを含む音響信号から雑音信号を抑圧する。音響信号の音響特徴を抽出する。雑音を含まない音声信号の確率モデル（以下「音声モデル」という）と音響信号の音響特徴とを用いて、雑音信号を推定し、推定した雑音信号を学習データとして雑音信号の確率モデル（以下「雑音モデル」という）を教師無し学習する。雑音モデルを用いて音響信号の雑音信号を抑圧する。 In order to solve the above problems, according to the first aspect of the present invention, a noise signal is suppressed from an acoustic signal including a noise signal and a voice signal. Extract the acoustic features of the acoustic signal. A noise signal is estimated using a noise signal-free stochastic model (hereinafter referred to as “speech model”) and the acoustic features of the acoustic signal, and the estimated noise signal is used as training data to determine the noise signal probability model (hereinafter “ "Noise model"). The noise signal of the acoustic signal is suppressed using the noise model.

本発明に係る雑音抑圧技術は、より正確に雑音信号を抑圧することができるという効果を奏する。 The noise suppression technique according to the present invention has an effect that noise signals can be more accurately suppressed.

雑音抑圧装置１００の機能ブロック図。2 is a functional block diagram of the noise suppression device 100. FIG. 雑音抑圧装置１００の処理フローを示す図。The figure which shows the processing flow of the noise suppression apparatus. 音響特徴抽出部１０４の処理フローを示す図。The figure which shows the processing flow of the acoustic feature extraction part 104. FIG. 雑音モデル推定部１０５の機能ブロック図。The functional block diagram of the noise model estimation part 105. FIG. 雑音モデル推定部１０５の処理フローを示す図。The figure which shows the processing flow of the noise model estimation part 105. FIG. 雑音モデルパラメータ推定手段３０６の機能ブロック図。The functional block diagram of the noise model parameter estimation means 306. 雑音モデルパラメータ推定手段３０６の処理フローを示す図。The figure which shows the processing flow of the noise model parameter estimation means 306. 雑音抑圧部１０６の機能ブロック図。FIG. 3 is a functional block diagram of the noise suppression unit 106. 雑音抑圧フィルタ推定手段の処理フローを示す図。The figure which shows the processing flow of a noise suppression filter estimation means. 雑音抑圧フィルタ適用手段の処理フローを示す図。The figure which shows the processing flow of a noise suppression filter application means. 本発明による雑音モデルの推定例を示す図。The figure which shows the example of an estimation of the noise model by this invention. 本発明による雑音抑圧例を示す図。The figure which shows the noise suppression example by this invention. 本発明による音響信号に含まれる音声信号に対する雑音抑圧信号を示す図。The figure which shows the noise suppression signal with respect to the audio | voice signal contained in the acoustic signal by this invention.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」、「⁻」は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following explanation, the symbols "^" and " ^- " used in the text should be written immediately above the character that immediately follows, but are written immediately before the character due to restrictions on the text notation. . In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態に係る雑音抑圧装置１００＞
図１及び図２を用いて第一実施形態に係る雑音抑圧装置１００を説明する。 <Noise Suppression Device 100 according to First Embodiment>
A noise suppression device 100 according to the first embodiment will be described with reference to FIGS. 1 and 2.

図１に示すように雑音抑圧装置１００は、音響特徴抽出部１０４と、音声モデルを構成する無音ＧＭＭ（混合正規分布：Gaussian mixture model）とクリーン音声ＧＭＭとが格納されるＧＭＭ記憶部１０７と、雑音モデル推定部１０５と、雑音抑圧部１０６とを含む。雑音抑圧装置１００は、音声信号と雑音信号とが混合された音響信号ｏ_τを収録し、または、入力され、音響信号ｏ_τから雑音信号を抑圧した雑音抑圧信号^ｓ_τを出力する。但し、τは離散信号のサンプル点を表す。以下、本実施形態の概要を説明する。 As shown in FIG. 1, the noise suppression apparatus 100 includes an acoustic feature extraction unit 104, a GMM storage unit 107 in which a silent GMM (Gaussian mixture model) and a clean speech GMM constituting a speech model are stored, A noise model estimation unit 105 and a noise suppression unit 106 are included. The noise suppression apparatus 100 is recorded an acoustic signal o _tau where the audio signal and the noise signal are mixed, or, and outputs a noise suppressed signal ^ s _tau was suppressing noise signals from the acoustic signal o _tau. However, (tau) represents the sample point of a discrete signal. Hereinafter, an outline of the present embodiment will be described.

図２に示すように音響特徴抽出部１０４は音響信号から雑音抑圧を実施するための複素数スペクトルと対数メルスペクトルとを抽出する（ｓ１０４）。雑音モデル推定部１０５は対数メルスペクトルとＧＭＭ記憶部１０７で主記憶上に保持された無音ＧＭＭとクリーン音声ＧＭＭとを用いて雑音信号の確率モデル（以下「雑音モデル」という）である雑音ＧＭＭを推定する（ｓ１０５）。雑音抑圧部１０６は、複素数スペクトルと、対数メルスペクトルと、無音ＧＭＭと、クリーン音声ＧＭＭと、雑音ＧＭＭとを用いて雑音抑圧フィルタを設計し、音響信号から雑音信号を抑圧して雑音抑圧信号を得る（ｓ１０６）。以下、各部の詳細を説明する。 As shown in FIG. 2, the acoustic feature extraction unit 104 extracts a complex spectrum and a log mel spectrum for performing noise suppression from the acoustic signal (s104). The noise model estimation unit 105 uses a logarithmic mel spectrum and a silence GMM and a clean speech GMM held in the main memory by the GMM storage unit 107 to obtain a noise GMM which is a probability model (hereinafter referred to as “noise model”) of a noise signal. Estimate (s105). The noise suppression unit 106 designs a noise suppression filter using a complex number spectrum, logarithmic mel spectrum, silence GMM, clean speech GMM, and noise GMM, and suppresses the noise signal from the acoustic signal to generate the noise suppression signal. Obtain (s106). Details of each part will be described below.

＜音響特徴抽出部１０４＞
音響特徴抽出部１０４は、音響信号の音響特徴を抽出する（ｓ１０４）。抽出する音響特徴は、音響信号から雑音信号を抑圧する際に用いるものであり、例えば、複素数スペクトルと対数メルスペクトルである。音響特徴抽出部１０４は、例えば、図３に示す流れで処理を行う。 <Acoustic Feature Extraction Unit 104>
The acoustic feature extraction unit 104 extracts the acoustic feature of the acoustic signal (s104). The extracted acoustic features are used when a noise signal is suppressed from an acoustic signal, and are, for example, a complex spectrum and a log mel spectrum. For example, the acoustic feature extraction unit 104 performs processing according to the flow shown in FIG.

まず、ある周波数（例えば１６，０００Ｈｚ）で標本化された音響信号ｏ_τを時間軸方向に一定時間幅（シフト幅）で始点を移動させながら、一定時間長（フレーム幅）の音響信号をフレームとして切り出す（ｓ２０１）。例えばフレーム幅Ｆｒａｍｅ＝３２０個のサンプル点（１６，０００Ｈｚ×２０ｍｓ）の音響信号ｏ_ｔ＝｛ｏ_ｔ，０，ｏ_ｔ，１，…，ｏ_ｔ，ｎ，…，ｏ_{ｔ，３１９}｝を、シフト幅Ｓｈｉｆｔ＝１６０個のサンプル点（１６，０００Ｈｚ×１０ｍｓ）ずつ始点を移動させながら切り出す。ここでｔはフレーム番号、ｎはフレーム内のｎ番目のサンプル点を表す。フレーム単位の音響信号をｏ_ｔとし、以下のように表す。
o_t={o_t,0,o_t,1,…,o_t,n,…,o_t,Frame-1}
なお、複数チャネルの音響信号を入力とする場合には、チャネル毎にフレームを切り出せばよい。また、フレームを切り出す際に、例えば以下のハミング窓のような窓関数ｗ_ｎを掛け合わせて切り出してもよい。 First, while moving the start point at a certain frequency (e.g. 16,000Hz) in the sampled sound signal o _tau the time axis direction in a predetermined time width (shift width), the frame acoustic signal for a predetermined time length (frame width) Is cut out (s201). For example acoustic signals _o t ₌ frame width Frame = 320 samples points _{(16,000Hz × 20ms) {o t} , 0, o t, 1, ..., o t, n, ..., o t, 319} a, Cut out while shifting the start point by shift width Shift = 160 sample points (16,000 Hz × 10 ms). Here, t represents the frame number, and n represents the nth sample point in the frame. Acoustic signals in frame units and o _t, expressed as follows.
o _t = {o _{t, 0} , o _{t, 1} ,…, o _{t, n} ,…, o _{t, Frame-1} }
Note that when an acoustic signal of a plurality of channels is input, a frame may be cut out for each channel. Further, when cutting out the frame, it may be excised for example by multiplying the window function w _n, such as the following Hamming window.

次に、音響特徴抽出部１０４は音響信号ｏ_ｔに対してＭ点（但し、Ｍは２のべき乗、かつ、フレーム幅Ｆｒａｍｅ以上の値を設定する必要があり、例えば５１２とする）の高速フーリエ変換処理を適用して、複素数スペクトルＳｐｃ_ｔ＝｛Ｓｐｃ_ｔ，０，…，Ｓｐｃ_ｔ，ｍ，…，Ｓｐｃ_{ｔ，Ｍ−１}｝（但し、ｍは周波数ビンの番号である）を得る（ｓ２０２）。 Then, fast Fourier point M to the acoustic feature extraction unit 104 the audio signal _{o t} (where, M is a power of 2, and must be set to a minimum of the frame width Frame, eg, 512) Applying the transformation process, a complex spectrum Spc _t = {Spc _{t, 0} ,..., Spc _{t, m} ,..., Spc _{t, M−1} } (where m is the frequency bin number) is obtained (s202 ).

次に、音響特徴抽出部１０４は、Ｓｐｃ_ｔ，ｍの絶対値に対してメルフィルタバンク分析を行い（ｓ２０３）、フィルタバンクの出力に対し、対数化処理を適用する（ｓ２０４）。このような処理により、Ｒ次元（例えばＲ＝２４）の対数メルスペクトルを要素に持つベクトル（以下、このベクトルを単に「対数メルスペクトル」という）Ｏ_ｔ＝｛Ｏ_ｔ，０，…，Ｏ_ｔ，ｒ，…，Ｏ_{ｔ，Ｒ−１}｝を算出する。但しｒはベクトルの要素番号を示す。すなわち、音響特徴抽出部１０４の出力は複素数スペクトルＳｐｃ_ｔと、対数メルスペクトルＯ_ｔとである。複素数スペクトルＳｐｃ_ｔは、雑音抑圧部１０６の入力となり、対数メルスペクトルＯ_ｔは、雑音モデル推定部１０５と、雑音抑圧部１０６との入力となる。 Next, the acoustic feature extraction unit 104 performs mel filter bank analysis on the absolute value of Spct _{, m} (s203), and applies logarithmic processing to the output of the filter bank (s204). By such processing, a vector having an R-dimensional (for example, R = 24) log mel spectrum as an element (hereinafter, this vector is simply referred to as “log mel spectrum”) O _t = {O _{t, 0} ,..., O _{t ,} _{R 1} ,..., O _{t, R−1} } is calculated. However, r shows the element number of a vector. That is, the output of the acoustic feature extraction unit 104 is the complex spectrum Spc _t and the log mel spectrum O _t . Complex spectrum Spc _t becomes the input of the noise suppressor 106, logarithmic Mel spectrum _{O t} is a noise model estimating section 105, the input of the noise suppressor 106.

＜ＧＭＭ記憶部１０７＞
図示しない記憶部には、雑音を含まない音声信号の確率モデル（以下「音声モデル」という）が予め記憶される。例えば、記憶部の一部であるＧＭＭ記憶部１０７には、音声モデルとして無音ＧＭＭとクリーン音声ＧＭＭが格納される。なお、無音ＧＭＭは雑音信号を含まない音声信号の無音部分より取得した音響信号に基づき学習されたＧＭＭであり、クリーン音声ＧＭＭは雑音のない環境において無音部分を除く音声のみからなる音響信号に基づき学習されたＧＭＭである。 <GMM storage unit 107>
In a storage unit (not shown), a probability model (hereinafter referred to as “speech model”) of a speech signal that does not contain noise is stored in advance. For example, a silent GMM and a clean speech GMM are stored as speech models in the GMM storage unit 107 that is a part of the storage unit. The silent GMM is a GMM learned based on an acoustic signal acquired from a silent portion of a speech signal that does not include a noise signal, and the clean speech GMM is based on an acoustic signal consisting only of speech excluding the silent portion in an environment without noise. It is a learned GMM.

無音ＧＭＭ及びクリーン音声ＧＭＭは次式により与えられる。 The silent GMM and the clean speech GMM are given by the following equations.

上式において、ｊは無音ＧＭＭと、クリーン音声ＧＭＭとを識別するインデックスであり、ｊ＝０は無音ＧＭＭを、ｊ＝１はクリーン音声ＧＭＭを示す。また、ｋは無音ＧＭＭもしくはクリーン音声ＧＭＭに含まれる正規分布の番号、Ｋは総正規分布数である（例えばＫ＝１２８）。また、Ｓ_ｔは雑音を含まない音声信号の対数メルスペクトルであり、ｂ_Ｓ，ｊ（Ｓ_ｔ）は無音ＧＭＭもしくはクリーン音声ＧＭＭの尤度である。なお、下付文字Ｓは、後述する雑音ＧＭＭや音声信号と雑音信号を含む音響信号のＧＭＭとは異なる音響モデル（無音ＧＭＭまたはクリーン音声ＧＭＭ等）に係る尤度やパラメータであることを示している。また、ｗ_{Ｓ，ｊ，ｋ}と、μ_{Ｓ，ｊ，ｋ}と、Σ_{Ｓ，ｊ，ｋ}とは、それぞれ無音ＧＭＭもしくはクリーン音声ＧＭＭの混合重みと、平均ベクトルと、対角分散行列である。また、関数Ｎ（・）は、次式で与えられる多次元正規分布の確率密度関数である。 In the above equation, j is an index for identifying the silent GMM and the clean voice GMM, j = 0 indicates the silent GMM, and j = 1 indicates the clean voice GMM. Further, k is a normal distribution number included in the silent GMM or the clean speech GMM, and K is the total normal distribution number (for example, K = 128). Further, _St is a logarithmic mel spectrum of a speech signal not including noise, and b _{S, j} (S _t ) is a likelihood of a silent GMM or a clean speech GMM. The subscript S indicates a likelihood or parameter relating to an acoustic model (silent GMM or clean speech GMM) different from a noise GMM, which will be described later, or a GMM of a speech signal and an acoustic signal including the noise signal. Yes. Further, w _{S, j, k} , μ _{S, j, k} and Σ _{S, j, k} are a mixing weight of a silent GMM or a clean speech GMM, an average vector, and a diagonal dispersion matrix, respectively. The function N (•) is a probability density function of a multidimensional normal distribution given by the following equation.

一方、雑音モデルとして、雑音信号のＧＭＭ（以下「雑音ＧＭＭ」という）を用いることができる。雑音ＧＭＭは次式により与えられる。 On the other hand, a noise signal GMM (hereinafter referred to as “noise GMM”) can be used as the noise model. The noise GMM is given by:

上式において、ｌは雑音ＧＭＭに含まれる正規分布の番号、Ｌは総正規分布数である（例えばＬ＝４）。また、Ｎ_ｔは雑音の対数メルスペクトルであり、ｂ_Ｎ（Ｎ_ｔ）は、雑音ＧＭＭの尤度であり、ｗ_Ｎ，ｌと、μ_Ｎ，ｌと、Σ_Ｎ，ｌとは、それぞれ雑音ＧＭＭの混合重みと、平均ベクトルと、対角分散行列である。以後、雑音ＧＭＭのパラメータセット（以下「雑音モデルパラメータ」ともいう）をλ＝｛ｗ_Ｎ，ｌ，μ_Ｎ，ｌ，Σ_Ｎ，ｌ｝と定義する。なお、下付文字Ｎは、雑音ＧＭＭに係る尤度やパラメータであることを示している。 In the above equation, l is a normal distribution number included in the noise GMM, and L is the total normal distribution number (for example, L = 4). N _t is the logarithmic mel spectrum of noise, b _N (N _t ) is the likelihood of noise GMM, and w _{N, l} , μ _{N, l} and Σ _{N, l} are noise respectively. GMM mixture weight, average vector, and diagonal dispersion matrix. Hereinafter, a noise GMM parameter set (hereinafter also referred to as “noise model parameters”) is defined as λ = {w _{N, l} , μ _{N, l} , Σ _{N, l} }. Note that the subscript N indicates the likelihood or parameter related to the noise GMM.

非特許文献１では、雑音信号の特徴が定常的かつ、その分布が単峰性であるという前提のもとで雑音抑圧を行っている。一方、本実施形態では、雑音信号が、多峰的な分布に従う非定常な雑音に基づく信号であると定義し、雑音モデルを単一の正規分布ではなく、ＧＭＭにて表現している。なお、後述の雑音モデル推定部１０５において、雑音モデルを教師無し学習する。 In Non-Patent Document 1, noise suppression is performed on the assumption that the characteristics of a noise signal are stationary and the distribution is unimodal. On the other hand, in the present embodiment, the noise signal is defined as a signal based on non-stationary noise following a multimodal distribution, and the noise model is expressed by GMM instead of a single normal distribution. Note that the noise model estimation unit 105 described later performs unsupervised learning of the noise model.

＜雑音モデル推定部１０５＞
雑音モデル推定部１０５は、対数メルスペクトルＯ_ｔと無音ＧＭＭとクリーン音声ＧＭＭとを用いて、雑音信号を推定し、推定した雑音信号を学習データとして雑音ＧＭＭを教師無し学習する（ｓ１０５）。本実施形態では、雑音信号そのものではなく、雑音信号の音響特徴（対数メルスペクトル）を推定し、これを用いて、雑音ＧＭＭを学習する。 <Noise Model Estimation Unit 105>
The noise model estimation unit 105 estimates a noise signal by using the log mel spectrum O _t , the silence GMM, and the clean speech GMM, and performs unsupervised learning of the noise GMM using the estimated noise signal as learning data (s105). In the present embodiment, not the noise signal itself but an acoustic feature (log mel spectrum) of the noise signal is estimated, and the noise GMM is learned using this.

例えば、雑音モデル推定部１０５において、雑音ＧＭＭは入れ子構造となった２種類のＥＭアルゴリズムにより推定する。以後、この２種類のＥＭアルゴリズムを、それぞれ第一ＥＭアルゴリズム及び第二ＥＭアルゴリズムと呼ぶこととする。ＥＭアルゴリズムは、ある確率モデルのパラメータ推定に利用される方法であり、確率モデルのコスト関数（対数尤度関数）の期待値を計算するExpectation-step（Ｅ−ｓｔｅｐ）と、コスト関数を最大化するMaximization-step（Ｍ−ｓｔｅｐ）とを、収束条件を満たすまで繰り返すことによりパラメータを最適推定する。 For example, in the noise model estimation unit 105, the noise GMM is estimated by two types of nested EM algorithms. Hereinafter, these two types of EM algorithms will be referred to as a first EM algorithm and a second EM algorithm, respectively. The EM algorithm is a method used for parameter estimation of a certain probability model. Expectation-step (E-step) for calculating an expected value of the cost function (log likelihood function) of the probability model and the cost function are maximized. The parameter is optimally estimated by repeating Maximization-step (M-step) to satisfy the convergence condition.

第一ＥＭアルゴリズムにおいて、音響信号を用いて、雑音信号と音声信号とを含む音響信号の確率モデル（以下「音響モデル」ともいう）の尤度が最大となるように、収束条件を満たすまで、後述する確率モデル生成処理（ｓ３０３）と第一期待値計算処理（ｓ３０４）と雑音信号推定処理（ｓ３０５）と雑音モデルパラメータ推定処理（ｓ３０６）とを繰り返す（図５参照）。 In the first EM algorithm, until the convergence condition is satisfied so that the likelihood of the stochastic model of the acoustic signal including the noise signal and the voice signal (hereinafter also referred to as “acoustic model”) is maximized using the acoustic signal. A probability model generation process (s303), a first expected value calculation process (s304), a noise signal estimation process (s305), and a noise model parameter estimation process (s306) described later are repeated (see FIG. 5).

第二ＥＭアルゴリズムは後述する雑音モデルパラメータ推定手段３０６において実施され、推定した雑音信号を用いて、雑音ＧＭＭの尤度が最大となるように、収束条件を満たすまで、後述する第二期待値計算処理（ｓ４０３）と雑音ＧＭＭのパラメータ更新処理（ｓ４０４）を繰り返す（図７参照）。 The second EM algorithm is executed by a noise model parameter estimation unit 306 described later, and a second expected value calculation described later is performed using the estimated noise signal until the convergence condition is satisfied so that the likelihood of the noise GMM is maximized. The process (s403) and the noise GMM parameter update process (s404) are repeated (see FIG. 7).

以下、図４及び図５を用いて雑音モデル推定部１０５の詳細を説明する。 Details of the noise model estimation unit 105 will be described below with reference to FIGS. 4 and 5.

雑音モデル推定部１０５は、例えば図４に示すように第一初期値推定手段３０２と確率モデル生成手段３０３と第一期待値計算手段３０４と雑音信号推定手段３０５と雑音モデルパラメータ推定手段３０６と第一収束判定手段３０７とを含む。 For example, as shown in FIG. 4, the noise model estimation unit 105 includes a first initial value estimation unit 302, a probability model generation unit 303, a first expected value calculation unit 304, a noise signal estimation unit 305, a noise model parameter estimation unit 306, A convergence determination means 307.

（第一初期値推定手段３０２）
まず第一初期値推定手段３０２は、第一ＥＭアルゴリズムの繰り返し回数を示すインデックスｉを初期化する（ｓ３０１）。次に第一初期値推定手段３０２は、対数メルスペクトルＯ_ｔを受け取り、第一ＥＭアルゴリズムにおける雑音モデルパラメータの初期値λ^{（ｉ＝０）}＝｛ｗ^{（ｉ＝０）} _Ｎ，ｌ，ｗ^{（ｉ＝０）} _Ｎ，ｌ，ｗ^{（ｉ＝０）} _Ｎ，ｌ｝を次式により推定し（ｓ３０２）、確率モデル生成手段３０３に出力する。但し、Ａは初期値推定に要するフレーム数である（例えばＡ＝１０）。 (First initial value estimation means 302)
First, the first initial value estimating means 302 initializes an index i indicating the number of repetitions of the first EM algorithm (s301). Next, the first initial value estimation means 302 receives the log mel spectrum O _t, and the initial value λ ^{(i = 0)} = {w ^{(i = 0)} _{N, l} , w ^{( i = 0)} _{N, l} , w ^{(i = 0)} _{N, l} } is estimated by the following equation (s302) and output to the probability model generation means 303. However, A is the number of frames required for initial value estimation (for example, A = 10).

上式において、添え字（ｉ）はｉ回目の繰り返し推定におけるパラメータであることを示す。なお、ｄｉａｇはかっこ内を要素とする対角行列を、上付き文字Ｔは転置を表す。 In the above equation, the subscript (i) indicates a parameter in the i-th iteration estimation. Here, diag represents a diagonal matrix with elements in parentheses, and the superscript T represents transposition.

（確率モデル生成手段３０３）
確率モデル生成手段３０３は、雑音ＧＭＭとクリーン音声ＧＭＭと無音ＧＭＭとを用いて、音響モデルを生成する（ｓ３０３）。例えば、確率モデル生成手段３０３は、ｉ回目の繰り返し推定における雑音モデルパラメータλ^（ｉ）を第一初期値推定手段３０２または第一収束判定手段３０７から受け取り、無音ＧＭＭとクリーン音声ＧＭＭのパラメータ（ｗ_{Ｓ，ｊ，ｋ}，μ_{Ｓ，ｊ，ｋ}，Σ_{Ｓ，ｊ，ｋ}）をＧＭＭ記憶部１０７から受け取り、これらの値を利用して、対数メルスペクトルＯ_ｔの確率モデルを以下のようなＧＭＭで構成する。 (Probability model generation means 303)
The probability model generation unit 303 generates an acoustic model using the noise GMM, the clean speech GMM, and the silence GMM (s303). For example, the probability model generation unit 303 receives the noise model parameter λ ⁽ⁱ⁾ in the i-th iterative estimation from the first initial value estimation unit 302 or the first convergence determination unit 307 and receives the parameters (w of silent GMM and clean speech GMM). _{S, j, k} , μ _{S, j, k} , Σ _{S, j, k} ) are received from the GMM storage unit 107, and using these values, a probabilistic model of the log mel spectrum O _t is represented by the following GMM. Consists of.

上式において、ｂ_Ｏ，ｊ ^（ｉ）（Ｏ_ｔ）は、確率モデル生成手段３０３にて生成される（対数メルスペクトルＯ_ｔの）確率モデルの尤度であり、ｗ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とμ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とΣ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とは、雑音モデルパラメータλ^（ｉ）＝｛ｗ^（ｉ） _Ｎ，ｌ，μ^（ｉ） _Ｎ，ｌ，Σ^（ｉ） _Ｎ，ｌ｝と、無音ＧＭＭもしくはクリーン音声ＧＭＭのパラメータ（ｗ_{Ｓ，ｊ，ｋ}，μ_{Ｓ，ｊ，ｋ}，Σ_{Ｓ，ｊ，ｋ}）とから生成された対数メルスペクトルＯ_ｔの確率モデルの混合重みと、平均ベクトルと、対角分散行列とであり、次式で与えられる。 In the above equation, b _{O, j} ⁽ⁱ⁾ (O _t ) is the likelihood of the probability model (of the log mel spectrum O _t ) generated by the probability model generation means 303, and w _{O, j, k, l} ⁽ⁱ⁾ and μ _{O, j, k, l} ⁽ⁱ⁾ and Σ _{O, j, k, l} ⁽ⁱ⁾ are the noise model parameters λ ⁽ⁱ⁾ = {w ⁽ⁱ⁾ _{N, l} , μ ^{( i)} _{N, l} , Σ ⁽ⁱ⁾ _{N, l} } and silent GMM or clean speech GMM parameters (w _{S, j, k} , μ _{S, j, k} , Σ _{S, j, k} ) The mixture weight of the probability model of the log mel spectrum O _t , the average vector, and the diagonal dispersion matrix are given by the following equations.

上式において、関数ｌｏｇ（・）とｅｘｐ（・）は、ベクトルの要素毎に演算を行う。また、⁻１は全ての要素が１のベクトル、Ｉは単位行列、Ｈ_{ｊ，ｋ，ｌ} ^（ｉ）は、式（１０）の関数ｈ（・）のヤコビ行列である。なお、下付文字Ｏは、音声信号と雑音信号を含む音響信号のＧＭＭに係る尤度やパラメータであることを示している。確率モデル生成手段３０３は、式（９）〜（１２）で求めた音響モデルのパラメータであるｗ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とμ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とΣ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とを第一期待値計算手段３０４に出力する。 In the above equation, the functions log (•) and exp (•) perform an operation for each vector element. Also, ^- 1 vector of all elements 1, I is the identity _{matrix, H j, ^k,} ^{l (i)} is the Jacobian matrix of the function h of the formula (10) (·). Note that the subscript O indicates the likelihood or parameter related to GMM of an acoustic signal including a speech signal and a noise signal. The probabilistic model generation means 303 is the acoustic model parameters w _{O, j, k, l} ⁽ⁱ⁾ , μ _{O, j, k, l} ⁽ⁱ⁾ and Σ _O which are parameters of the acoustic model obtained by the equations (9) to (12). _{, J, k, l} ⁽ⁱ⁾ are output to the first expected value calculation means 304.

（第一期待値計算手段３０４）
第一期待値計算手段３０４は、音響モデルのパラメータであるｗ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とμ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とΣ_{Ｏ，ｊ，ｋ，ｌ} ^（ｉ）とを確率モデル生成手段３０３から受け取り、音響信号の対数メルスペクトルＯ_ｔを音響特徴抽出部１０４から受け取り、ｉ回目の繰り返し推定における対数メルスペクトルＯ_ｔの確率モデルのコスト関数Ｑ_１（）の期待値を次式により計算する（Ｅ−ｓｔｅｐ）（ｓ３０４）。 (First expected value calculation means 304)
The first expected value calculation means 304 is the acoustic model parameters w _{O, j, k, l} ⁽ⁱ⁾ , μ _{O, j, k, l} ⁽ⁱ⁾ and ΣO _{, j, k, l} ^(i). Is received from the probability model generation means 303, the log mel spectrum O _t of the acoustic signal is received from the acoustic feature extraction unit 104, and the expectation of the cost function Q ₁ () of the probability model of the log mel spectrum O _t in the i-th iterative estimation. The value is calculated by the following equation (E-step) (s304).

なお、上式において、Ｏ_{０：Ｔ−１}＝｛Ｏ_０，…，Ｏ_ｔ，…，Ｏ_Ｔ−１｝であり、Ｔは対数メルスペクトルＯ_ｔの総フレーム数、Ｐ^（ｉ） _ｔ，ｊとＰ^（ｉ） _{ｔ，ｊ，ｋ，ｌ}は、それぞれ次式で与えられるフレームｔにおけるＧＭＭ種別ｊ、もしくは正規分布番号ｋと、ｌとに対する事後確率である。特に、Ｐ^（ｉ） _{ｔ，ｊ＝０}を音声非存在確率、Ｐ^（ｉ） _{ｔ，ｊ＝１}を音声存在確率と定義する。 In the above equation, O _{0: T-1} = {O ₀ ,..., O _t ,..., O _T-1 }, where T is the total number of frames of the log mel spectrum O _t , P ⁽ⁱ⁾ _{t, j} and P ⁽ⁱ⁾ _{t, j, k, and l} are posterior probabilities for GMM type j or normal distribution number k and l in frame t given by the following equations, respectively. In particular, P ⁽ⁱ⁾ _{t, j = 0} is defined as a speech non-existence probability, and P ⁽ⁱ⁾ _{t, j = 1} is defined as a speech existence probability.

第一期待値計算手段３０４は、求めた第一期待値Ｑ_１を第一収束判定手段３０７に、Ｐ^（ｉ） _ｔ，ｊとＰ^（ｉ） _{ｔ，ｊ，ｋ，ｌ}とを雑音信号推定手段３０５に出力する。

First expectation value calculation unit 304, a first expected value _{Q 1} obtained in the first convergence determining unit ^{_{307, P (i) t,}} j and ^{_{P (i) t, j,}} k, the noise signal estimate and _l Output to the means 305.

（雑音信号推定手段３０５）
雑音信号推定手段３０５は、音響信号を用いて、雑音信号を推定する（ｓ３０５）。例えば、雑音信号推定手段３０５は、Ｐ^（ｉ） _ｔ，ｊとＰ^（ｉ） _{ｔ，ｊ，ｋ，ｌ}を第一期待値計算手段３０４から受け取り、音響信号の対数メルスペクトルＯ_ｔを音響特徴抽出部１０４から受け取り、雑音モデルパラメータλ^（ｉ）を更新するために用いる雑音信号の対数メルスペクトルＮ^（ｉ） _ｔを推定し、雑音モデルパラメータ推定手段３０６に出力する。雑音の対数メルスペクトルＮ^（ｉ） _ｔは、次式により推定される。 (Noise signal estimation means 305)
The noise signal estimation means 305 estimates a noise signal using the acoustic signal (s305). For example, the noise signal estimation unit 305 receives P ⁽ⁱ⁾ _{t, j} and P ⁽ⁱ⁾ _{t, j, k, l} from the first expected value calculation unit 304 and determines the logarithmic mel spectrum O _t of the acoustic signal as an acoustic feature. The log mel spectrum N ⁽ⁱ⁾ _t of the noise signal received from the extraction unit 104 and used to update the noise model parameter λ ⁽ⁱ⁾ is estimated and output to the noise model parameter estimation means 306. The log mel spectrum N ⁽ⁱ⁾ _t of noise is estimated by the following equation.

（雑音モデルパラメータ推定手段３０６）
雑音モデルパラメータ推定手段３０６は、雑音信号の対数メルスペクトルＮ^（ｉ） _ｔを学習データとして、雑音モデルパラメータλ^（ｉ）を推定し（Ｍ−ｓｔｅｐ）（ｓ３０６）、推定した雑音モデルパラメータを第一収束判定手段３０７に出力する。雑音モデルパラメータλ^（ｉ）の具体的な推定方法については後述する。 (Noise model parameter estimation means 306)
The noise model parameter estimation means 306 estimates the noise model parameter λ ⁽ⁱ⁾ (M-step) (s306) using the logarithmic mel spectrum N ⁽ⁱ⁾ _t of the noise signal as learning data, and the estimated noise model parameter The result is output to the convergence determination means 307. A specific estimation method of the noise model parameter λ ⁽ⁱ⁾ will be described later.

（第一収束判定手段３０７）
第一収束判定手段３０７は、第一期待値計算手段３０４から第一期待値Ｑ_１を受け取り、この値を用いて収束条件を満たすか否かを判定し（ｓ３０７）、満たす場合はλ＝λ^（ｉ）としλを出力し雑音モデル推定部１０５の処理を終了する。満たさない場合はλ^（ｉ）を確率モデル生成手段３０３に出力し、ｉ←ｉ＋１（ｓ３０８）として、繰り返し処理を行うように各部に制御信号を出力し、ｓ３０３〜ｓ３０６の処理を繰り返す。例えば、収束条件は、最新の第一期待値Ｑ_１（Ｏ_{０：Ｔ−１}，λ^（ｉ））と一つ前の第一期待値Ｑ_１（Ｏ_{０：Ｔ−１}，λ^{（ｉ−１）}）との差が所定値以下である場合や、繰り返し回数ｉが所定値以上になった場合等とすることができる。例えば以下の式で表すことができる。 (First convergence determination means 307)
First convergence determining means 307, the first expected value calculation unit 304 receives the first expected value Q _1, to determine the convergence condition is satisfied whether using this value (s307), if it meets lambda = lambda ^(I) is output as λ, and the process of the noise model estimation unit 105 is terminated. If not satisfied, λ ⁽ⁱ⁾ is output to the probability model generation means 303, and as i ← i + 1 (s308), a control signal is output to each unit so as to perform repetition processing, and the processing of s303 to s306 is repeated. For example, the convergence condition includes the latest first expected value Q ₁ (O _{0: T−1} , λ ⁽ⁱ⁾ ) and the previous first expected value Q ₁ (O _{0: T−1} , λ ^{(i−). 1) The} case where the difference from) is less than or equal to a predetermined value, or the number of repetitions i is greater than or equal to a predetermined value. For example, it can be expressed by the following formula.

であり、η_１＝０．０００１とする。

And η ₁ = 0.0001.

＜雑音モデルパラメータ推定手段３０６の詳細＞
雑音モデルパラメータ推定手段３０６は、例えば図６に示すように第二初期値推定手段４０２と第二期待値計算手段４０３とパラメータ更新手段４０４と第二収束判定手段４０５とを含む。雑音モデルパラメータ推定手段３０６は、図７に示す処理フローで処理を行い、雑音信号推定手段３０５で推定した雑音の対数メルスペクトルＮ^（ｉ） _ｔと、第二ＥＭアルゴリズムにより、雑音モデルパラメータλ^（ｉ）を推定する。以下、図６及び図７を用いて雑音モデルパラメータ推定手段３０６の詳細を説明する。 <Details of Noise Model Parameter Estimation Unit 306>
The noise model parameter estimation unit 306 includes, for example, a second initial value estimation unit 402, a second expected value calculation unit 403, a parameter update unit 404, and a second convergence determination unit 405 as shown in FIG. The noise model parameter estimation unit 306 performs processing according to the processing flow shown in FIG. 7, and uses the noise log mel spectrum N ⁽ⁱ⁾ _t estimated by the noise signal estimation unit 305 and the noise model parameter λ ^{( i)} is estimated. Details of the noise model parameter estimation means 306 will be described below with reference to FIGS.

（第二初期値推定手段４０２）
第二初期値推定手段４０２は、まず第二ＥＭアルゴリズムの繰り返し回数を示すインデックスｉｉを初期化する（ｓ４０１）。次に第二初期値推定手段４０２は、雑音の対数メルスペクトルＮ^（ｉ） _ｔを受け取り、この値を用いて、第二ＥＭアルゴリズムにおける雑音モデルパラメータλ^（ｉｉ）の初期値λ⁽ⁱⁱ⁼⁰⁾={w⁽ⁱⁱ⁼⁰⁾ _N,l,μ⁽ⁱⁱ⁼⁰⁾ _N,l,Σ⁽ⁱⁱ⁼⁰⁾ _N,l}を次式により推定し、第二期待値計算手段４０３に出力する。 (Second initial value estimating means 402)
The second initial value estimating means 402 first initializes an index ii indicating the number of repetitions of the second EM algorithm (s401). Next, the second initial value estimation means 402 receives the log mel spectrum N ⁽ⁱ⁾ _t of noise, and uses this value to determine the initial value λ ^{(ii = 0} ⁾ of the noise model parameter λ ⁽ⁱⁱ⁾ in the second EM algorithm. ⁾ = {w ^{(ii = 0)} _{N, l} , μ ^{(ii = 0)} _{N, l} , Σ ^{(ii = 0)} _{N, l} } is estimated by the following equation and output to the second expected value calculation means 403 .

上式において、添え字（ｉｉ）はｉｉ回目の繰り返し推定におけるパラメータであることを示す。また、ＧａｕｓｓＲａｎｄ（・）は正規乱数の発生器である。 In the above equation, the subscript (ii) indicates a parameter in the ii-th iteration estimation. GaussRand (·) is a normal random number generator.

（第二期待値計算手段４０３）
第二期待値計算手段４０３は、雑音の対数メルスペクトルＮ^（ｉ） _ｔを雑音信号推定手段３０５から受け取り、第二ＥＭアルゴリズムにおける雑音モデルパラメータλ^（ｉｉ）を第二初期値推定手段４０２または第二収束判定手段４０５から受け取り、ｉｉ回目の繰り返し推定における雑音ＧＭＭのコスト関数Ｑ_２（）の期待値を次式により計算し（Ｅ−ｓｔｅｐ）（ｓ４０３）、第二収束判定手段４０５に出力する。 (Second expected value calculation means 403)
The second expected value calculation means 403 receives the log mel spectrum N ⁽ⁱ⁾ _t of noise from the noise signal estimation means 305, and receives the noise model parameter λ ⁽ⁱⁱ⁾ in the second EM algorithm as the second initial value estimation means 402 or the first. The expected value of the cost function Q ₂ () of the noise GMM in the ii-th iterative estimation is calculated from the following equation (E-step) (s 403) and output to the second convergence determination unit 405. .

上式において、Ｎ^（ｉ） _{０：Ｔ−１}＝｛Ｎ^（ｉ） _０，…，Ｎ^（ｉ） _ｔ，…，Ｎ^（ｉ） _Ｔ−１｝であり、Ｐ^（ｉｉ） _ｔ，ｌは、次式で与えられるフレームｔにおける正規分布番号ｌに対する事後確率である。 In the above equation, N ⁽ⁱ⁾ _{0: T-1} = {N ⁽ⁱ⁾ ₀ , ..., N ⁽ⁱ⁾ _t , ..., N ⁽ⁱ⁾ _T-1 }, and P ⁽ⁱⁱ⁾ _{t, l} is The posterior probability for the normal distribution number l in the frame t given by the following equation.

第二期待値計算手段４０３は、求めたＰ^（ｉｉ） _ｔ，ｌをパラメータ更新手段４０４に出力する。 The second expected value calculation unit 403 outputs the calculated P ⁽ⁱⁱ⁾ _{t, l} to the parameter update unit 404.

（パラメータ更新手段４０４）
パラメータ更新手段４０４は、Ｐ^（ｉｉ） _ｔ，ｌを受け取り、雑音の対数メルスペクトルＮ^（ｉ） _ｔを雑音信号推定手段３０５から受け取り、雑音モデルパラメータλ^（ｉｉ）を次式により更新し（Ｍ−ｓｔｅｐ）（ｓ４０４）、更新した雑音モデルパラメータλ^（ｉｉ）を第二収束判定手段４０５に出力する。 (Parameter update means 404)
The parameter updating unit 404 receives P ⁽ⁱⁱ⁾ _{t, l} , receives the log mel spectrum N ⁽ⁱ⁾ _t of noise from the noise signal estimation unit 305, and updates the noise model parameter λ ⁽ⁱⁱ⁾ by the following equation (M -Step) (s404), and outputs the updated noise model parameter λ ⁽ⁱⁱ⁾ to the second convergence determination means 405.

（第二収束判定手段４０５）
第二収束判定手段４０５は、第二期待値計算手段４０３から第二期待値Ｑ_２を受け取り、この値を用いて収束条件を満たすか否かを判定し（ｓ４０５）、満たす場合はλ^（ｉ）＝λ^（ｉｉ）としλ^（ｉ）を出力し雑音モデルパラメータ推定手段３０６の処理を終了する。満たさない場合はλ^（ｉｉ）を第二期待値計算手段４０３に出力し、ｉｉ←ｉｉ＋１（ｓ４０６）として、繰り返し処理を行うように各部に制御信号を出力し、ｓ４０３、ｓ４０４の処理を繰り返す。例えば、収束条件は、最新の第二期待値Ｑ_２（Ｏ_{０：Ｔ−１}，λ^（ｉ））と一つ前の第二期待値Ｑ_２（Ｏ_{０：Ｔ−１}，λ^{（ｉ−１）}）との差が所定値以下である場合や、繰り返し回数ｉｉが所定値以上になった場合等とすることができる。例えば以下の式で表すことができる。 (Second convergence determination means 405)
Second convergence determining means 405, from the second expected value calculation unit 403 receives the second expected value Q _2, to determine the convergence condition is satisfied whether using this value (s405), if it meets lambda ^{(i )} = Λ ⁽ⁱⁱ⁾ , λ ⁽ⁱ⁾ is output, and the processing of the noise model parameter estimation means 306 is terminated. If not satisfied, λ ⁽ⁱⁱ⁾ is output to the second expected value calculation means 403, and as ii ← ii + 1 (s406), a control signal is output to each unit so as to perform repetition processing, and the processing of s403 and s404 is repeated. For example, the convergence condition includes the latest second expected value Q ₂ (O _{0: T−1} , λ ⁽ⁱ⁾ ) and the previous second expected value Q ₂ (O _{0: T−1} , λ ^{(i−). 1) The} case where the difference from) is less than or equal to a predetermined value, or the case where the number of repetitions ii is greater than or equal to a predetermined value. For example, it can be expressed by the following formula.

であり、η_２＝０．０００１とする。 And η ₂ = 0.0001.

＜雑音抑圧部１０６＞
雑音抑圧部１０６は、雑音ＧＭＭを用いて音響信号の雑音信号を抑圧する（ｓ１０６）。例えば、図８に示すように雑音抑圧部１０６は、雑音抑圧フィルタ推定手段５０１と雑音抑圧フィルタ適用手段５０２を含む。雑音抑圧フィルタ推定手段５０１は音響信号の対数メルスペクトルＯ_ｔと、無音ＧＭＭとクリーン音声ＧＭＭのパラメータ｛ｗ_{Ｓ，ｊ，ｋ}，μ_{Ｓ，ｊ，ｋ}，Σ_{Ｓ，ｊ，ｋ}｝と、雑音モデルパラメータλとを受け取り、雑音抑圧フィルタＷ^Ｌｉｎ _ｔ，ｍを推定する。雑音抑圧フィルタ適用手段５０２は、複素数スペクトルＳｐｃ_ｔと、雑音抑圧フィルタＷ^Ｌｉｎ _ｔ，ｍとを受け取り、雑音を抑圧して雑音抑圧信号＾ｓ_τを得る。以下、各手段の詳細を説明する。 <Noise Suppression Unit 106>
The noise suppression unit 106 suppresses the noise signal of the acoustic signal using the noise GMM (s106). For example, as shown in FIG. 8, the noise suppression unit 106 includes a noise suppression filter estimation unit 501 and a noise suppression filter application unit 502. The noise suppression filter estimation means 501 includes a log mel spectrum O _t of an acoustic signal, parameters {w _{S, j, k} , μ _{S, j, k} , Σ _{S, j, k} } of silence GMM and clean speech GMM, noise The model parameter λ is received and the noise suppression filter W ^Lin _{t, m} is estimated. The noise suppression filter application unit 502 receives the complex spectrum Spc _t and the noise suppression filter W ^Lin _{t, m} , suppresses noise, and obtains a noise suppression signal ^ _sτ . Details of each means will be described below.

（雑音抑圧フィルタ推定手段５０１）
雑音抑圧フィルタ推定手段５０１は図９に示す流れで処理を行う。まず、雑音抑圧フィルタ推定手段５０１は、無音ＧＭＭとクリーン音声ＧＭＭのパラメータと、雑音モデルパラメータを受け取り、これらの値を用いて、音響信号の対数メルスペクトルＯ_ｔの確率モデルのパラメータを以下のように生成する（ｓ６０１）。 (Noise suppression filter estimation means 501)
The noise suppression filter estimation means 501 performs processing according to the flow shown in FIG. First, the noise suppression filter estimation means 501 receives the parameters of the silent GMM and the clean speech GMM and the noise model parameters, and uses these values to set the parameters of the probability model of the log mel spectrum O _t of the acoustic signal as follows. (S601).

次に、雑音抑圧フィルタ推定手段５０１は、音声非存在／存在確率Ｐ^（ｉ） _ｔ，ｊと事後確率Ｐ_{ｔ，ｊ，ｋ，ｌ}とを、求めた対数メルスペクトルＯ_ｔの確率モデルのパラメータと対数メルスペクトルＯ_ｔとを用いて計算する（ｓ６０２）。 Next, the noise suppression filter estimation means 501 uses the probability model parameter of the log mel spectrum O _t obtained from the speech non-existence / presence probability P ⁽ⁱ⁾ _{t, j} and the posterior probability P _{t, j, k, l.} And the log mel spectrum O _t (s602).

次に、雑音抑圧フィルタ推定手段５０１は、無音ＧＭＭのパラメータとクリーン音声ＧＭＭのパラメータと雑音モデルパラメータと音声非存在／存在確率Ｐ^（ｉ） _ｔ，ｊと事後確率Ｐ_{ｔ，ｊ，ｋ，ｌ}とを用いて、メル周波数軸上での雑音抑圧フィルタＷ^Ｍｅｌ _ｔ，ｒを次式のように推定する（ｓ６０３）。 Next, the noise suppression filter estimation means 501 includes a silence GMM parameter, a clean speech GMM parameter, a noise model parameter, a speech absence / existence probability P ⁽ⁱ⁾ _{t, j,} and a posteriori probability P _{t, j, k, l.} Is used to estimate the noise suppression filter W ^Mel _{t, r} on the mel frequency axis as in the following equation (s603).

上式はベクトルの要素毎の表記である。 The above equation is a notation for each vector element.

次に、雑音抑圧フィルタ推定手段５０１は、メル周波数軸上での雑音抑圧フィルタＷ^Ｍｅｌ _ｔ，ｒを線形周波数軸上での雑音抑圧フィルタＷ^Ｌｉｎ _ｔ，ｍに変換し（ｓ６０４）、雑音抑圧フィルタ適用手段５０２に出力する。なお、変換は３次スプライン補間をメル周波数軸に適用することにより、線形周波数軸上での雑音抑圧フィルタの値を推定することにより行う。 Next, the noise suppression filter estimation means 501 converts the noise suppression filter W ^Mel _{t, r} on the mel frequency axis into a noise suppression filter W ^Lin _{t, m} on the linear frequency axis (s604), and the noise suppression filter The data is output to the application unit 502. Note that the conversion is performed by estimating the value of the noise suppression filter on the linear frequency axis by applying cubic spline interpolation to the mel frequency axis.

（雑音抑圧フィルタ適用手段５０２）
雑音抑圧フィルタ適用手段５０２は図１０に示す流れで処理を行う。雑音抑圧フィルタ適用手段５０２は、雑音抑圧フィルタ推定手段５０１から雑音抑圧フィルタＷ^Ｌｉｎ _ｔ，ｍを受け取り、音響特徴抽出部１０４から受け取った複素数スペクトルＳｐｃ_ｔに対して雑音抑圧フィルタＷ^Ｌｉｎ _ｔ，ｍを次式のように掛け合わせることにより雑音抑圧された複素数スペクトル＾Ｓ_ｔ，ｍを得る（ｓ７０１）。 (Noise suppression filter applying means 502)
The noise suppression filter application unit 502 performs processing according to the flow shown in FIG. Noise suppression filter application unit 502, the noise suppression filter estimator 501 from the noise suppression filter ^W _{Lin t,} receives the _m, the noise suppression filter ^W _{Lin t} for complex spectrum Spc _t received from the acoustic feature extraction unit _104, the _m By multiplying as in the following equation, a noise-suppressed complex spectrum {circumflex over (S)} _{t, m} is obtained (s701).

次に、雑音抑圧フィルタ適用手段５０２は、複素数スペクトル＾Ｓ_ｔ，ｍに対して逆高速フーリエ変換を適用することにより、フレームｔにおける雑音抑圧信号＾ｓ_ｔ，ｎを得る（ｓ７０２）。 Next, the noise suppression filter application unit 502 obtains the complex spectrum _{^ S t,} by applying the inverse fast Fourier transform on _m, the noise suppression signal _{^ s t} in frame _t, the _n (s702).

次に、雑音抑圧フィルタ適用手段５０２は、各フレームの雑音抑圧信号＾ｓ_ｔ，ｎを次式のように窓関数ｗ_ｎを解除しながら連結して連続した雑音抑圧信号＾ｓ_τを得て（ｓ７０３）、これを雑音抑圧装置１００の出力値として出力する。 Next, the noise suppression filter application unit 502 obtains a noise suppression signal ^ s _tau continuous noise suppression signal ^ s t of each _frame, the _n linked with releasing the window function w _n by the following equation (S703), this is output as the output value of the noise suppression apparatus 100.

＜効果＞
このような構成とすることで、雑音信号のみが存在する時間区間における雑音信号だけでなく、雑音信号と音声信号とが何れも存在する時間区間における雑音信号を学習データとして利用できる。言い換えると、音声信号の存在有無に関わらず、雑音信号を学習データとして利用できる。これにより、より多くの雑音信号の変化や特徴を雑音信号の確率モデルに反映することができ、より正確に雑音信号をモデル化し、高精度に雑音抑圧を実施することができる。なお、推定された雑音信号には誤差が含まれる可能性があるが、確率モデルの推定においては、学習データの統計的な性質を推定してモデル化を行っているため、誤差の問題は致命的な問題とならない。 <Effect>
With such a configuration, not only the noise signal in the time interval in which only the noise signal exists, but also the noise signal in the time interval in which both the noise signal and the audio signal exist can be used as learning data. In other words, a noise signal can be used as learning data regardless of the presence or absence of an audio signal. As a result, more changes and features of the noise signal can be reflected in the probability model of the noise signal, and the noise signal can be modeled more accurately and noise suppression can be performed with high accuracy. Note that the estimated noise signal may contain errors, but in the estimation of the probabilistic model, since the modeling is performed by estimating the statistical properties of the learning data, the error problem is fatal. It does not become a problem.

また、非特許文献１では、収音された音響信号全体を用いて、ＥＭアルゴリズムにより音響信号の確率モデルを最適化する方法が開示されているが、音響信号に含まれる雑音信号の特徴が定常的かつ、その分布（頻度分布もしくは確率分布）が単峰性であるという前提のもとで雑音抑圧を行う技術である。しかし、実環境における雑音信号の多くは非定常的な特徴を持ち、その分布は多峰性であることが多い。そのため、非特許文献１記載の技術では、非定常的な特徴を持ち、その分布が多峰性である雑音信号に対応できず、十分な雑音抑圧性能が得られない場合がある。 Further, Non-Patent Document 1 discloses a method for optimizing a probability model of an acoustic signal by using an EM algorithm using the entire collected acoustic signal, but the characteristics of the noise signal included in the acoustic signal are steady. This is a technique for performing noise suppression on the assumption that the distribution (frequency distribution or probability distribution) is unimodal. However, many noise signals in the real environment have non-stationary characteristics, and their distribution is often multimodal. For this reason, the technique described in Non-Patent Document 1 cannot cope with a noise signal having unsteady characteristics and having a multimodal distribution, and may not provide sufficient noise suppression performance.

本実施形態においては、雑音信号の分布が単峰性ではなく、多峰性であるという前提に基づいて、雑音信号の確率モデルを単一の正規分布ではなく、ＧＭＭにて表現し、雑音信号の確率モデルをＥＭアルゴリズムにより推定している。このような構成とすることで、非定常的な特徴を持ち、その分布が多峰性である雑音信号を適切にモデル化することができ、雑音信号を効果的に抑圧することができる。なお、非定常的な特徴を持ち、その分布が多峰性である雑音信号をモデル化しようとすると、単峰性である雑音信号をモデル化しようとする場合よりもモデルが複雑となり、必要なデータが多くなるが、上述の通り、本実施形態においては、音声信号の存在有無に関わらず、雑音信号を学習データとして利用できるため、多くのデータを取得することができ、最適な雑音抑圧フィルタを設計することができる。 In the present embodiment, based on the premise that the distribution of the noise signal is not unimodal but multimodal, the noise signal probability model is expressed by GMM instead of a single normal distribution, and the noise signal Are estimated by the EM algorithm. With such a configuration, it is possible to appropriately model a noise signal having non-stationary characteristics and having a multimodal distribution, and the noise signal can be effectively suppressed. Note that if you attempt to model a noise signal that has non-stationary features and its distribution is multimodal, the model will be more complicated than if you are trying to model a noise signal that is unimodal. Although the amount of data increases, as described above, in this embodiment, a noise signal can be used as learning data regardless of the presence or absence of an audio signal, so that a large amount of data can be acquired, and an optimal noise suppression filter Can be designed.

よって、本実施形態のような構成とすることで、様々な雑音が存在する環境であっても音響信号より雑音信号を抑圧して目的とする音声信号を高品質で取り出すことができる。 Therefore, with the configuration as in the present embodiment, even in an environment where various types of noise exist, the target audio signal can be extracted with high quality by suppressing the noise signal from the acoustic signal.

［シミュレーション結果］
本発明の効果を示すために、音声信号と雑音信号が混在する音響信号を本発明の雑音抑圧装置に入力し、雑音抑圧を実施した例を示す。以下実験方法、及び結果について説明する。 [simulation result]
In order to show the effect of the present invention, an example is shown in which an acoustic signal in which a voice signal and a noise signal are mixed is input to the noise suppression apparatus of the present invention and noise suppression is performed. The experimental method and results will be described below.

本実験では、評価用データには、IPA（Information-technology promotion agency, Japan) -98-TestSetのうち、男性２３名が発声したデータ１００文を用いており、これらの音声データに対して、空港ロビー、駅プラットホーム、街頭にて別途収録した雑音を、それぞれＳ／Ｎ比０ｄＢ、５ｄＢ、１０ｄＢにて計算機上で重畳した。すなわち、雑音三種類×Ｓ／Ｎ比三種類の九種類の評価データを作成した。それぞれの音声データは、サンプリング周波数１６，０００Ｈｚ、量子化ビット数１６ビットで離散サンプリングされたモノラル信号である。この音響信号に対し、１フレームの時間長を２０ｍｓ（Ｆｒａｍｅ＝３２０サンプル点）とし、１０ｍｓ（Ｓｈｉｆｔ＝１６０サンプル点）毎にフレームの始点を移動させて、音響特徴抽出部１０４を適用した。 In this experiment, the data for evaluation uses 100 sentences spoken by 23 men from IPA (Information-technology promotion agency, Japan) -98-TestSet. Noises recorded separately in the lobby, station platform, and street were superimposed on the computer with S / N ratios of 0 dB, 5 dB, and 10 dB, respectively. That is, nine types of evaluation data of three types of noise × three types of S / N ratios were created. Each audio data is a monaural signal discretely sampled at a sampling frequency of 16,000 Hz and a quantization bit number of 16 bits. The acoustic feature extraction unit 104 was applied to this acoustic signal by setting the time length of one frame to 20 ms (Frame = 320 sample points) and moving the start point of the frame every 10 ms (Shift = 160 sample points).

無音ＧＭＭ、クリーン音声ＧＭＭには、Ｒ＝２４次元の対数メルスペクトルを音響特徴量とする混合分布数Ｋ＝１２８のＧＭＭを用い、それぞれ無音信号、クリーン音声信号を用いて学習した。雑音ＧＭＭの混合分布数Ｌは、１、２、３、４の４種類の値を与え、それぞれの場合の結果を比較する。 As the silent GMM and the clean speech GMM, a GMM having a mixed distribution number K = 128 having an R = 24-dimensional logarithmic mel spectrum as an acoustic feature is used, and learning is performed using the silent signal and the clean speech signal, respectively. The mixed distribution number L of the noise GMM gives four values of 1, 2, 3, and 4, and the results in each case are compared.

性能の評価は音声認識により行い、評価尺度は次式の単語誤り率ＷＥＲで行った。 The performance was evaluated by speech recognition, and the evaluation scale was the word error rate WER of the following equation.

上式のＮは総単語数、Ｄは脱落誤り単語数、Ｓは置換誤り単語数、Ｉは挿入誤り単語数であり、ＷＥＲの値が小さい程音声認識性能が高いことを示す。 In the above equation, N is the total number of words, D is the number of dropped error words, S is the number of replacement error words, I is the number of insertion error words, and the smaller the WER value, the higher the speech recognition performance.

音声認識は、有限状態トランスデューサーに基づく認識器（T. Hori, et al., "Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition", IEEE Trans. on ASLP, May 2007, vol. 15, no. 4, pp. 1352-1365）により行い、音響モデルには話者独立のＴｒｉｐｈｏｎｅＨＭＭを用いており、各ＨＭＭの構造は３状態のＬｅｆｔ−ｔｏ−ｒｉｇｈｔ型ＨＭＭであり、各状態は１６の正規分布を持つ。ＨＭＭ全体の状態数は２，０００である。音声認識の音響特徴量は、１フレームの時間長を２０ｍｓ（Ｆｒａｍｅ＝３２０）とし、１０ｍｓ（Ｓｈｉｆｔ＝１６０サンプル点）毎にフレームの始点を移動させて分析した１２次元のＭＦＣＣ（Mel-frequency cepstral coefficient）、対数パワー値、各々の１次及び２次の回帰係数を含む合計３９次元のベクトルである。また、言語モデルにはＴｒｉ−ｇｒａｍを用い、語彙数は２０，０００単語である。 Speech recognition is based on a finite state transducer based recognizer (T. Hori, et al., "Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition", IEEE Trans. on ASLP, May 2007, vol. 15, no. 4, pp. 1352-1365). Speaker-independent Triphone HMM is used for the acoustic model, and each HMM has a three-state Left-to-right structure. Each state has 16 normal distributions. The number of states of the entire HMM is 2,000. The acoustic feature of speech recognition is a 12-dimensional MFCC (Mel-frequency cepstral) in which the time length of one frame is 20 ms (Frame = 320) and the start point of the frame is moved every 10 ms (Shift = 160 sample points). coefficient), the logarithmic power value, and a 39-dimensional vector in total including the first and second order regression coefficients. The language model uses Tri-gram and the number of vocabulary is 20,000 words.

図１１は、雑音モデルの推定例であり、８０１〜８０５は、それぞれ、非特許文献１に開示された方法と、本発明（Ｌ＝１，２，３，４）とによる雑音モデルの推定結果を示しており、８番目のメルフィルタ（中心周波数１，０２２Ｈｚ）から得られた雑音の対数メルスペクトルの分布を示している。各図において、破線は、雑音の対数メルスペクトルのヒストグラム、実線は各手法により推定された雑音の確率モデルの形状、点線は雑音ＧＭＭを構成する、各要素分布の形状を示している。縦軸は、正規化された頻度、もしくは確率を示しており、横軸は８番目のメルフィルタから得られた雑音メルスペクトルの値を示している。 FIG. 11 is an example of noise model estimation. Reference numerals 801 to 805 denote noise model estimation results according to the method disclosed in Non-Patent Document 1 and the present invention (L = 1, 2, 3, 4), respectively. The logarithmic mel spectrum distribution of the noise obtained from the eighth mel filter (center frequency: 1022 Hz) is shown. In each figure, the broken line indicates the histogram of the log mel spectrum of noise, the solid line indicates the shape of the noise probability model estimated by each method, and the dotted line indicates the shape of each element distribution constituting the noise GMM. The vertical axis indicates the normalized frequency or probability, and the horizontal axis indicates the value of the noise mel spectrum obtained from the eighth mel filter.

図１１の結果より、本発明により、非特許文献１に開示された方法に比べて、雑音の対数メルスペクトルのヒストグラムに近い形状の雑音確率モデルが推定できることが明らかとなった。特に、８０４（Ｌ＝３）と、８０５（Ｌ＝４）との結果は、雑音の対数メルスペクトルのヒストグラムとほぼ同等の形状を示している。 From the result of FIG. 11, it became clear that the present invention can estimate a noise probability model having a shape close to the logarithmic mel spectrum histogram of noise as compared with the method disclosed in Non-Patent Document 1. In particular, the results of 804 (L = 3) and 805 (L = 4) show almost the same shape as the histogram of the log mel spectrum of noise.

図１２は、音声信号の波形と、入力音響信号の波形（空港ロビー雑音、Ｓ／Ｎ比０ｄＢ）と、本発明（Ｌ＝１，２，３，４）による雑音抑圧信号の波形で、本発明により効果的に雑音が抑圧されていることが分かる。 FIG. 12 shows the waveform of the audio signal, the waveform of the input acoustic signal (airport lobby noise, S / N ratio 0 dB), and the waveform of the noise suppression signal according to the present invention (L = 1, 2, 3, 4). It can be seen that noise is effectively suppressed by the invention.

また、図１３は、雑音抑圧の結果を示す。非特許文献１に開示された方法と、本発明（Ｌ＝１，２，３，４）とによる音声認識の評価結果を示している。図１３の結果から、本発明により従来技術に比べて高い性能を得られることが明らかとなった。 FIG. 13 shows the result of noise suppression. The evaluation result of the speech recognition by the method disclosed in Non-Patent Document 1 and the present invention (L = 1, 2, 3, 4) is shown. From the results shown in FIG. 13, it has been clarified that the present invention can obtain higher performance than the prior art.

＜その他の実施形態＞
上記実施形態において、音響特徴抽出部１０４のフレーム切り出し処理ｓ２０１にて窓関数ｗ_ｎにはハミング窓以外に方形窓、ハニング窓、ブラックマン窓などの窓関数を利用してもよい。 <Other embodiments>
In the above embodiments, rectangular window besides Hamming window in the window function w _n at frame cutout processing s201 acoustic feature extraction unit 104, a Hanning window may be used a window function, such as Blackman windows.

上記実施形態において、記憶部に予め記憶される音声モデルとして、無音ＧＭＭ、クリーン音声ＧＭＭの代わりに、ＨＭＭ（Hidden Markov model）等の他の確率モデルを用いてもよい。また、記憶部に記憶される音声モデルとして、無音ＧＭＭ、クリーン音声ＧＭＭの２つのＧＭＭだけでなく、より多くのＧＭＭを用いてもよい。例えば、無音ＧＭＭ、無声音ＧＭＭ、有声音ＧＭＭや、音素毎のＧＭＭを用いてもよい。 In the above embodiment, other probabilistic models such as HMM (Hidden Markov model) may be used instead of the silent GMM and the clean voice GMM as the voice model stored in advance in the storage unit. Further, as a speech model stored in the storage unit, not only two GMMs, the silent GMM and the clean speech GMM, but more GMMs may be used. For example, a silent GMM, an unvoiced sound GMM, a voiced sound GMM, or a GMM for each phoneme may be used.

上記実施形態において、雑音ＧＭＭの代わりに、雑音モデルとしてＨＭＭ等の他の確率モデルを用いてもよい。このとき、ＨＭＭの各状態を混合正規分布（ＧＭＭ）等で表現すれば、第一実施形態と同様に分布が多峰性である雑音信号をモデル化することができる。 In the above embodiment, instead of the noise GMM, another probability model such as an HMM may be used as the noise model. At this time, if each state of the HMM is expressed by a mixed normal distribution (GMM) or the like, a noise signal having a multimodal distribution can be modeled as in the first embodiment.

上記実施形態において、雑音抑圧フィルタ推定手段５０１の雑音抑圧フィルタ推定処理ｓ６０３にて、重み付け平均ではなく、最大の重みを持つ推定結果をそのまま使用してもよい。この場合、他の推定結果の重みに比べて十分大きな重みを持っていることが望ましい。 In the above embodiment, in the noise suppression filter estimation processing s603 of the noise suppression filter estimation unit 501, the estimation result having the maximum weight may be used as it is, instead of the weighted average. In this case, it is desirable to have a sufficiently large weight compared to the weights of other estimation results.

上記実施形態において、各部、各手段間で直接信号を入出力しているが、図示しない記憶部に格納しておき、記憶部を介して信号を受け渡しを行う構成としてもよい。 In the above embodiment, signals are directly input / output between each unit and each means. However, a configuration may be adopted in which signals are stored in a storage unit (not shown) and signals are transferred via the storage unit.

本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
上述した雑音抑圧装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施例で図に示した機能構成をもつ装置）として機能させるためのプログラム、またはその処理手順（各実施例で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and recording medium>
The noise suppression device described above can also be functioned by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a processing procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

本発明は、自動音声認識の前段において、音響信号から雑音を抑圧し、雑音抑圧信号を用いて自動音声認識に利用することができる。また、ＴＶ会議システム等の通話システムや録音システムにおいて、受信または収録した音響信号から雑音信号を抑圧する際に利用することができる。 The present invention can be used for automatic speech recognition using a noise suppression signal by suppressing noise from an acoustic signal before the automatic speech recognition. In addition, it can be used when a noise signal is suppressed from a received or recorded sound signal in a call system such as a TV conference system or a recording system.

Claims

A noise suppression device that suppresses a noise signal from an acoustic signal including a noise signal and a voice signal,
An acoustic feature extraction unit for extracting an acoustic feature of the acoustic signal;
A storage unit for storing a probability model (hereinafter referred to as “speech model”) of a speech signal that does not include noise;
A noise model estimator that estimates a noise signal using the acoustic features of the acoustic signal and the speech model, and performs unsupervised learning of a noise signal probability model (hereinafter referred to as “noise model”) using the estimated noise signal as learning data. When,
A noise suppression unit that suppresses the noise signal of the acoustic signal using a speech model and a noise model;
Including a noise suppression device.

The noise suppression device according to claim 1,
The noise signal is defined as a signal based on non-stationary noise that follows a multimodal distribution, and the noise model estimation unit learns the noise model unsupervised.
Noise suppression device.

The noise suppression device according to claim 1 or 2,
The noise model estimator is
Probability model generation means for generating a probability model of the acoustic signal (hereinafter referred to as “acoustic model”) using the noise model and the speech model;
Noise signal estimating means for estimating the noise signal using the acoustic signal;
Noise model parameter estimation means for estimating the noise model parameters using the noise signal as learning data,
Using the acoustic signal, the processing of the probability model generation means, the noise signal estimation means, and the noise model parameter estimation means is repeated until a convergence condition is satisfied by an expected value maximization method so that the likelihood of the acoustic model is maximized,
Noise suppression device.

A noise suppression method for suppressing a noise signal from an acoustic signal including a noise signal and a voice signal,
An acoustic feature extraction step for extracting an acoustic feature of the acoustic signal;
A noise signal is estimated using the acoustic features of the acoustic signal and a noise signal probability model (hereinafter referred to as “voice model”), and the estimated noise signal is used as learning data to obtain a noise signal probability model (hereinafter “ A noise model estimation step for unsupervised learning of a noise model),
A noise suppression step of suppressing a noise signal of the acoustic signal using a noise model;
Including a noise suppression method.

The noise suppression method according to claim 4,
The noise signal is defined as a signal based on non-stationary noise following a multimodal distribution, and the noise model is unsupervised learning in the noise model estimation step.
Noise suppression method.

The noise suppression method according to claim 4 or 5, wherein
The noise model estimation step includes:
A probability model generation sub-step of generating a probability model of the acoustic signal (hereinafter referred to as “acoustic model”) using the noise model and the speech model;
A noise signal estimation substep for estimating the noise signal using the acoustic signal;
Noise model parameter estimation sub-step for estimating the noise model parameters using the noise signal as training data,
Using the acoustic signal, the probability model generation sub-step, the noise signal estimation sub-step, and the noise model parameter estimation sub-step are repeated until a convergence condition is satisfied by an expected value maximization method so that the likelihood of the acoustic model is maximized. ,
Noise suppression method.

A program for causing a computer to function as the noise suppression device according to claim 1.