JP2014029407A

JP2014029407A - Noise suppression device, method and program

Info

Publication number: JP2014029407A
Application number: JP2012169893A
Authority: JP
Inventors: Masakiyo Fujimoto; 雅清藤本; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-07-31
Filing date: 2012-07-31
Publication date: 2014-02-13
Anticipated expiration: 2032-07-31
Also published as: JP5740362B2

Abstract

PROBLEM TO BE SOLVED: To provide a noise suppression technology which suppresses an unsteady noise signal with respect to an input signal.SOLUTION: Input signal spectral information and sound feature quantity used for parameter estimation of a noise model are extracted from an input signal by using a soundless model and a voice model; a noise model parameter of a noise model and a speaker adaptation parameter dependent on a speaker are estimated by using the sound feature quantity, the soundless model and the voice model; a noise suppression filter is estimated by using the sound feature quantity, the soundless model, the voice model, the noise model parameter and the speaker adaptation parameter; and a target signal is output by applying the noise suppression filter to the spectral information. The noise model is expressed as a weighting mixed probability distribution, and a posterior probability used for the estimation of the noise model parameter and the speaker adaptation parameter is estimated by marginalizing the soundless model, the weighting mixed probability distribution of the input signal acquired by synthesizing the sound model and the noise model, and a posterior probability acquired from the input signal.

Description

本発明は、複数の音響信号が含まれる信号から、雑音信号を抑圧して所望の信号を抽出するための雑音抑圧技術に関する。 The present invention relates to a noise suppression technique for extracting a desired signal by suppressing a noise signal from a signal including a plurality of acoustic signals.

自動音声認識技術を実際の環境で利用する場合においては、処理対象とする音声信号以外の信号つまり雑音が含まれる音響信号から、雑音を取り除き所望の音声信号のみを抽出する必要がある。自動音声認識の実際の環境での利用は今後の情報化社会の中で大きく期待されており、早急に解決されるべき問題である。 When the automatic speech recognition technology is used in an actual environment, it is necessary to remove noise from a signal other than the speech signal to be processed, that is, an acoustic signal including noise, and extract only a desired speech signal. The use of automatic speech recognition in the actual environment is highly expected in the information-oriented society in the future, and is a problem that should be solved as soon as possible.

非特許文献1は、音声信号と雑音信号が混合された信号を入力とし、あらかじめ推定した音声信号、及び雑音信号それぞれの確率モデルから入力信号の確率モデルを生成する。その際、入力信号の確率モデルを構成する音声信号、及び雑音信号それぞれの確率モデルと、入力信号に含まれる音声信号、及び雑音信号それぞれの統計量との差分をテイラー級数近似で表現する。その差分をEMアルゴリズムを用いて推定し、入力信号の確率モデルを最適化する。その後、最適化された入力信号の確率モデルと音声信号の確率モデルのパラメータを用いて雑音を抑圧する。 Non-Patent Document 1 uses a signal obtained by mixing a speech signal and a noise signal as an input, and generates a probability model of the input signal from the probability models of the speech signal and the noise signal estimated in advance. At this time, the difference between the probability model of each of the speech signal and noise signal constituting the probability model of the input signal and the statistics of each of the speech signal and noise signal included in the input signal is expressed by Taylor series approximation. The difference is estimated using the EM algorithm, and the input signal probability model is optimized. After that, noise is suppressed using the optimized input signal probability model and speech signal probability model parameters.

非特許文献2は、音声信号と雑音信号が混合された信号を入力とし、多数話者の学習用音声データを用いて学習された音声信号の確率モデルを入力信号に含まれる音声信号の発話者の特徴に適応（話者適応）させ、かつ統計的な性質が多峰的な分布に従う雑音信号に対処するため、入力信号より音声信号と、雑音信号とをそれぞれ抽出し、抽出した音声信号、及び雑音信号を用いて話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルを入れ子構造となったEMアルゴリズムにより推定する。その後、話者適応後の音声信号の確率モデルと、推定した雑音の確率モデルとから入力信号の最適な確率モデルを生成し、入力信号の最適な確率モデルと話者適応後の音声信号の確率モデルのパラメータを用いて雑音を抑圧する。 Non-Patent Document 2 uses a signal obtained by mixing a voice signal and a noise signal as an input, and a speaker of a voice signal included in the input signal based on a probability model of a voice signal learned using voice data for learning of many speakers. In order to adapt to the characteristics of the speaker (speaker adaptation) and deal with noise signals whose statistical properties follow a multimodal distribution, the speech signal and the noise signal are extracted from the input signal, respectively, Then, the parameters of speaker adaptation using noise signals and the stochastic model of noise signals following multimodal distribution are estimated by the nested EM algorithm. Then, the optimal probability model of the input signal is generated from the probability model of the speech signal after speaker adaptation and the estimated probability model of noise, and the optimal probability model of the input signal and the probability of the speech signal after speaker adaptation Suppresses noise using model parameters.

P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylor series approach for environment-independent speech recognition," in Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996.P. J. Moreno, B. Raj, and R. M. Stern, "A vector Taylor series approach for environment-independent speech recognition," in Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996. 藤本雅清，渡部晋治，中谷智広，“話者適応と雑音混合モデル推定の同時適用による雑音抑圧，”電子情報通信学会，音声研究会，SP2011-91，pp. 113-118，Dec. 2011.Masayoshi Fujimoto, Tomoharu Watanabe, Tomohiro Nakatani, “Noise Suppression by Simultaneous Application of Speaker Adaptation and Noise Mixing Model Estimation,” IEICE, Speech Society, SP2011-91, pp. 113-118, Dec. 2011.

実際の環境で自動音声認識を行うにあたり必要不可欠な技術は、入力音響信号から雑音を取り除き、高品質な音声信号を得る雑音抑圧技術である。 A technology essential for automatic speech recognition in an actual environment is a noise suppression technology that removes noise from an input acoustic signal and obtains a high-quality speech signal.

非特許文献1では、収音された入力信号を用いて、EMアルゴリズムにより入力信号の確率モデルを構成する音声信号、及び雑音信号それぞれの確率モデルを最適化し、それに伴い入力信号の確率モデルを最適化する方法が開示されているが、入力信号に含まれる雑音信号の特徴が定常的かつ、その分布（頻度分布もしくは確率分布）が単峰性であるという前提のもとで雑音抑圧を行う技術である。しかし、実環境における雑音信号の多くは非定常的な特徴を持ち、その分布は多峰性であることが多い。そのため、非特許文献1に記載の技術では、非定常的な雑音信号に対応できず、十分な雑音抑圧性能が得られない。また、入力信号に含まれる音声信号と雑音信号との関係が非線形関数により表現されるため、テイラー級数近似を用いても音声信号、及び雑音信号それぞれの確率モデルのパラメータ推定の際に解析解が得られない。そのため、非特許文献1に記載の技術では、音声信号、及び雑音信号それぞれの確率モデルパラメータの最適解が得られず、十分な雑音抑圧性能が得られない。 In Non-Patent Document 1, using the collected input signal, the probability model of each of the speech signal and noise signal constituting the probability model of the input signal is optimized by the EM algorithm, and accordingly the probability model of the input signal is optimized. A technique for noise suppression based on the premise that the characteristics of the noise signal included in the input signal are stationary and the distribution (frequency distribution or probability distribution) is unimodal. It is. However, many noise signals in the real environment have non-stationary characteristics, and their distribution is often multimodal. For this reason, the technique described in Non-Patent Document 1 cannot cope with non-stationary noise signals, and sufficient noise suppression performance cannot be obtained. In addition, since the relationship between the speech signal and noise signal included in the input signal is expressed by a nonlinear function, even when Taylor series approximation is used, an analytical solution is available when estimating the parameters of the probability model of the speech signal and noise signal. I can't get it. For this reason, the technique described in Non-Patent Document 1 cannot obtain the optimal solution of the probability model parameters of the speech signal and the noise signal, and cannot obtain sufficient noise suppression performance.

非特許文献2では、収音された入力信号より音声信号と、雑音信号とを抽出し、抽出された音声信号と、雑音信号とを用いて話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルとを推定する方法が開示されており、非定常的な雑音信号に対応することが可能となっている。非特許文献2に記載の技術では、話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルとをそれぞれ異なるEMアルゴリズムにより推定する。加えて、話者適応後の音声信号の確率モデルと、推定した雑音の確率モデルとから入力信号の確率モデルを生成し、EMアルゴリズムにより、入力信号の確率モデルのパラメータが収束するまで推定を繰り返す。このため、非特許文献2に記載の技術は３種の異なるEMアルゴリズムを有しており、かつ、それぞれが入れ子構造となっているため極めて多くの計算量を必要とし、現実的な処理ではない。また、入力信号より抽出された音声信号と、雑音信号とを用いて話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルとを推定する際に、音声信号と、雑音信号とに抽出誤りが多く含まれる場合、話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルの推定精度が劣化する。しかしながら、非特許文献2に記載の技術は、このような抽出誤りについて考慮がされていない。 In Non-Patent Document 2, a speech signal and a noise signal are extracted from the collected input signal, and the speaker adaptation parameters and the multimodal distribution are extracted using the extracted speech signal and the noise signal. A method for estimating a stochastic model of a noise signal is disclosed, and it is possible to deal with a non-stationary noise signal. In the technique described in Non-Patent Document 2, a speaker adaptation parameter and a noise signal probability model according to a multimodal distribution are estimated by different EM algorithms. In addition, a probability model of the input signal is generated from the probability model of the speech signal after speaker adaptation and the estimated noise probability model, and the estimation is repeated by the EM algorithm until the parameters of the input signal probability model converge. . For this reason, the technique described in Non-Patent Document 2 has three different EM algorithms, and each has a nested structure, which requires a large amount of calculation and is not a realistic process. . Further, when estimating speaker adaptation parameters and a noise signal probability model according to a multimodal distribution using the speech signal extracted from the input signal and the noise signal, the speech signal, the noise signal, When many extraction errors are included, the speaker adaptation parameters and the estimation accuracy of the noise signal probabilistic model according to the multimodal distribution deteriorate. However, the technique described in Non-Patent Document 2 does not consider such extraction errors.

このような状況に鑑み、本発明は、多峰的な分布に基づく雑音信号の確率モデルを用いて、入力信号に対して非定常的な雑音信号を効果的に抑圧することが可能な雑音抑圧技術を提供することにある。 In view of such a situation, the present invention uses a noise signal probabilistic model based on a multimodal distribution, and can effectively suppress non-stationary noise signals with respect to an input signal. To provide technology.

本発明の雑音抑圧技術は、音声信号と雑音信号とを含む音響信号を入力信号とし、当該入力信号から当該雑音信号が抑圧された信号（目的信号）を出力する雑音抑圧技術であって、無音信号の確率モデル（無音モデル）と、音声信号の確率モデル（音声モデル）とを用いて、入力信号から、入力信号のスペクトル情報と、雑音信号の確率モデル（雑音モデル）のパラメータ推定に用いる音響特徴量と、を出力する音響特徴量抽出処理と、音響特徴量と、無音モデルと、音声モデルとを用いて、雑音モデルを表現するパラメータ（雑音モデルパラメータ）と、確率モデルを表現するために用いられるパラメータであって話者依存性のパラメータ（話者適応パラメータ）とを推定するパラメータ推定処理と、音響特徴量と、無音モデルと、音声モデルと、雑音モデルパラメータと、話者適応パラメータとを用いて、雑音抑圧フィルタを推定し、さらに当該雑音抑圧フィルタをスペクトル情報に適用することによって目的信号を出力する雑音抑圧処理とを行う。雑音モデルは、複数の確率分布による重み付け混合確率分布として表現されており、パラメータ推定処理では、雑音モデルパラメータと話者適応パラメータの推定に用いられる事後確率を、無音モデル、及び音声モデルと雑音モデルを合成して得られる入力信号の重み付け混合確率分布と入力信号から得られる事後確率を周辺化することによって推定する。 The noise suppression technology of the present invention is a noise suppression technology that uses a sound signal including a speech signal and a noise signal as an input signal and outputs a signal (target signal) in which the noise signal is suppressed from the input signal. Using the signal probability model (silence model) and the sound signal probability model (speech model), from the input signal, the spectrum information of the input signal and the sound used for parameter estimation of the noise signal probability model (noise model) In order to express a noise model parameter (noise model parameter) and a probability model using an acoustic feature quantity extraction process that outputs the feature quantity, an acoustic feature quantity, a silence model, and a speech model A parameter estimation process for estimating a speaker-dependent parameter (speaker adaptation parameter), an acoustic feature amount, a silence model, and a speech module. Performed and Le, the noise model parameters, by using the speaker adaptation parameter, and estimate the noise suppression filter and a noise suppressing process of outputting the target signal by further applying the noise suppression filter to spectral information. The noise model is expressed as a weighted mixed probability distribution with a plurality of probability distributions. In the parameter estimation process, the posterior probabilities used for estimating the noise model parameters and the speaker adaptation parameters are expressed as a silence model, a speech model, and a noise model. Is estimated by marginalizing the weighted mixed probability distribution of the input signal obtained by combining the posterior probabilities obtained from the input signal.

本発明によると、詳細は後述するが、入力信号に対して非定常的な雑音信号を効果的に抑圧することが可能であり、複数のEMアルゴリズムを入れ子構造とする構成ではないため多くの計算量を必要とすることもない。 According to the present invention, as will be described in detail later, it is possible to effectively suppress a non-stationary noise signal with respect to an input signal, and since many EM algorithms are not configured to be nested, many calculations are performed. No need for quantity.

本発明による実施形態の雑音抑圧装置の機能構成例を示す図。The figure which shows the function structural example of the noise suppression apparatus of embodiment by this invention. 音響特徴抽出部の処理手順を示す図。The figure which shows the process sequence of an acoustic feature extraction part. パラメータ推定部の機能構成例を示す図。The figure which shows the function structural example of a parameter estimation part. パラメータ推定部の処理手順を示す図。The figure which shows the process sequence of a parameter estimation part. SN比に基づく信頼データの選択方法を示す図。The figure which shows the selection method of the reliability data based on SN ratio. 雑音抑圧部の機能構成例を示す図。The figure which shows the function structural example of a noise suppression part. 雑音抑圧フィルタ推定部の処理手順を示す図。The figure which shows the process sequence of a noise suppression filter estimation part. 雑音抑圧フィルタ適用部の処理手順を示す図。The figure which shows the process sequence of a noise suppression filter application part. 対数確率比に基づく信頼データの選択方法を示す図。The figure which shows the selection method of the reliability data based on logarithmic probability ratio. 本発明による雑音抑圧の実施例を示す図。The figure which shows the Example of the noise suppression by this invention.

以下、図面を参照しつつ、本発明の一実施の形態について説明する。なお、以下の説明に用いる図面では、同一の構成要素には同一の符号を付してある。それらの名称、機能も同一であり、それらについての説明は繰り返さない。以下の説明において、テキスト中で使用する記号「^」、「~」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。ベクトルについては、テキスト中でＡ^→の如く表記しているが、式中においては太字で記載している。行列については、テキスト中では通常フォントで表記しているが、式中においては太字で記載している。また、ベクトルの各要素単位で行われる処理は、特に断りが無い限り、全てのベクトル、行列の全ての要素に対して適用されるものとする。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same components are denoted by the same reference numerals. Their names and functions are also the same, and description thereof will not be repeated. In the following explanation, the symbols “^”, “~”, etc. used in the text should be described immediately above the character that immediately follows, but are described immediately before the character due to restrictions on text notation. To do. In the formula, these symbols are written in their original positions. The vector is indicated as A ^{→ in} the text, but is indicated in bold in the formula. The matrix is usually written in font in the text, but in bold in the formula. Further, the processing performed for each element of a vector is applied to all the elements of all vectors and matrices unless otherwise specified.

＜骨子＞
話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルとを推定するためには、音声信号、及び雑音信号に加え、音声信号の確率モデル、及び雑音信号の確率モデルに含まれる各分布に対する事後確率（占有率）が必要となる。事後確率は隠れ変数の期待値として定義されており、隠れ変数とは、ある時刻の信号が確率モデルに含まれるどの分布から生成されたかを示すパラメータであり、信号のみからは直接観測できないパラメータであるため、隠れ変数と呼ばれる。隠れ変数の期待値である事後確率はEMアルゴリズムを用いることにより容易に推定できるが、EMアルゴリズムは反復的なパラメータ推定アルゴリズムであり、計算量が比較的多くなる。そのため、話者適応のパラメータ、及び多峰的な分布に従う雑音信号の確率モデルの推定の際にそれぞれ異なるEMアルゴリズムを適用すると、計算量が大幅に増加する。加えて、話者適応後の音声信号の確率モデルと、推定した雑音の確率モデルとから生成された入力信号の確率モデルを最適化するためには、さらに異なるEMアルゴリズムが必要となる。話者適応のパラメータ、及び多峰的な分布に従う雑音信号の確率モデルの推定のためのEMアルゴリズムは、入力信号の確率モデル最適化のためのEMアルゴリズムの入れ子構造となるため、極めて多くの計算量を必要とする。 <Outline>
In order to estimate the parameters of speaker adaptation and the stochastic model of the noise signal according to the multimodal distribution, it is included in the stochastic model of the speech signal and the stochastic model of the noise signal in addition to the speech signal and the noise signal. A posteriori probability (occupancy) for each distribution is required. The posterior probability is defined as the expected value of the hidden variable.The hidden variable is a parameter that indicates from which distribution the signal at a certain time is generated in the probability model, and is a parameter that cannot be observed directly from the signal alone. Because it is, it is called a hidden variable. The posterior probability, which is the expected value of the hidden variable, can be easily estimated by using the EM algorithm. However, the EM algorithm is an iterative parameter estimation algorithm, and the amount of calculation is relatively large. For this reason, when different EM algorithms are applied in estimating the speaker adaptation parameters and the stochastic model of the noise signal according to the multimodal distribution, the amount of calculation is greatly increased. In addition, in order to optimize the probability model of the input signal generated from the probability model of the speech signal after speaker adaptation and the estimated noise probability model, a different EM algorithm is required. The EM algorithm for estimating the stochastic model of the noise signal according to the parameters of speaker adaptation and the multimodal distribution is a nested structure of the EM algorithm for optimizing the stochastic model of the input signal, so a large number of calculations Requires an amount.

また、入力信号からの音声信号、及び雑音信号の抽出精度は、単位時刻ごとのSN比（信号対雑音比）に依存する。すなわち、入力信号の各時刻において、音声信号のエネルギーが優勢（高SN比）な個所では、音声信号の抽出精度が高く、雑音信号の抽出精度が低くなる。逆に音声信号のエネルギーが劣勢（低SN比）な個所では、音声信号の抽出精度が低く、雑音信号の抽出精度が高くなる。このような抽出精度を考慮せずに、話者適応のパラメータ、及び多峰的な分布に従う雑音信号の確率モデルの推定を実施すると、それぞれの推定性能が劣化する。 Further, the extraction accuracy of the audio signal and noise signal from the input signal depends on the SN ratio (signal-to-noise ratio) for each unit time. That is, at each time point of the input signal, at locations where the energy of the audio signal is dominant (high SN ratio), the audio signal extraction accuracy is high and the noise signal extraction accuracy is low. Conversely, at locations where the audio signal energy is inferior (low SN ratio), the audio signal extraction accuracy is low and the noise signal extraction accuracy is high. If estimation of a stochastic model of a noise signal according to speaker adaptation parameters and multimodal distribution is performed without considering such extraction accuracy, each estimation performance deteriorates.

それゆえに、本発明では、入力信号の確率モデルの事後確率の周辺化による音声信号の確率モデル、及び雑音信号の確率モデルの事後確率推定と、入力信号からの音声信号、及び雑音信号の抽出精度を考慮した、話者適応のパラメータ、及び多峰的な分布に従う雑音信号の確率モデルの推定を実施する。このような方法を用いることにより、EMアルゴリズムの入れ子構造を解消して計算量を大幅に削減することが可能となり、かつ話者適応のパラメータ、及び多峰的な分布に従う雑音信号の確率モデルの推定性能の改善が可能となる。 Therefore, in the present invention, the speech signal probability model by the marginalization of the posterior probability of the input signal probability model and the posterior probability estimation of the noise signal probability model, and the extraction accuracy of the sound signal and noise signal from the input signal The estimation of the stochastic model of the noise signal according to the parameters of speaker adaptation and the multimodal distribution in consideration of By using such a method, it is possible to eliminate the nested structure of the EM algorithm and greatly reduce the amount of calculation, and the parameter of the speaker adaptation and the stochastic model of the noise signal according to the multimodal distribution The estimation performance can be improved.

なお、実施の形態においては、音声信号の確率モデル、及び多峰的な分布に基づく雑音信号の確率モデルとして、混合正規分布（Gaussian mixture model: GMM）を採用する。 In the embodiment, a mixture normal distribution (Gaussian mixture model: GMM) is adopted as a probability model of a speech signal and a probability model of a noise signal based on a multimodal distribution.

＜雑音抑圧装置100全体の構成＞
本発明による雑音抑圧装置の一実施例を図1に示す。図中の符号100はこの発明による実施形態の雑音抑圧装置の構成を示す。雑音抑圧装置100は、音声信号101と、雑音信号102とが混合された入力信号103から雑音抑圧を実施するための特徴量である複素数スペクトル112と、対数メルスペクトル113とを抽出する音響特徴抽出部104と、対数メルスペクトル113と、GMM記憶部107に記憶されている無音GMM 108と、クリーン音声GMM 109とを用いて雑音の確率モデルである話者適応パラメータ110と、雑音GMMのパラメータセット111とを推定するパラメータ推定部105と、複素数スペクトル112と、対数メルスペクトル113と、無音GMM 108と、クリーン音声GMM 109と、話者適応パラメータ110と、雑音GMMのパラメータセット111とを用いて雑音抑圧フィルタを設計し、雑音を抑圧して雑音抑圧信号114を得る雑音抑圧部106とを含む。 <Configuration of noise suppression device 100 as a whole>
One embodiment of a noise suppression apparatus according to the present invention is shown in FIG. Reference numeral 100 in the figure indicates the configuration of the noise suppression apparatus according to the embodiment of the present invention. The noise suppression device 100 extracts an acoustic feature extraction that extracts a complex spectrum 112 and a logarithmic mel spectrum 113, which are features for performing noise suppression, from an input signal 103 obtained by mixing a speech signal 101 and a noise signal 102. Unit 104, log mel spectrum 113, silence GMM 108 stored in GMM storage unit 107, clean speech GMM 109, speaker adaptation parameter 110 which is a noise probability model, and noise GMM parameter set Using parameter estimation unit 105 for estimating 111, complex spectrum 112, log mel spectrum 113, silence GMM 108, clean speech GMM 109, speaker adaptation parameter 110, and noise GMM parameter set 111 And a noise suppression unit 106 that designs a noise suppression filter and obtains a noise suppression signal 114 by suppressing noise.

＜音響特徴抽出部104の構成＞
音響特徴抽出部104は、図2に示す流れで処理を行う。
まず、フレーム切り出し処理S201にて入力信号o_τ 103（τは離散信号のサンプル点）を時間軸方向に一定時間幅で始点を移動させながら、一定時間長の音響信号をフレームとして切り出す。例えばFrame = 320個のサンプル点（16,000 Hz ×20 ms）の音響信号o_t，nを、Shift = 160個のサンプル点（16,000 Hz ×10 ms）ずつ始点を移動させながら切り出す。その際、例えば以下のハミング窓のような窓関数w_nを掛け合わせて切り出す。ここでtはフレーム番号、nはフレーム内のn番目のサンプル点を表す。

<Configuration of acoustic feature extraction unit 104>
The acoustic feature extraction unit 104 performs processing according to the flow shown in FIG.
First, in frame cut-out processing S201, an input signal o _τ 103 (τ is a sampling point of a discrete signal) is cut out as a frame with an acoustic signal having a fixed time length while moving the start point with a fixed time width in the time axis direction. For example, the acoustic signal o _{t, n} of Frame = 320 sample points (16,000 Hz × 20 ms) is cut out while shifting the start point by Shift = 160 sample points (16,000 Hz × 10 ms). At that time, I cut out for example by multiplying the window function w _n, such as the following Hamming window. Here, t represents the frame number, and n represents the nth sample point in the frame.

その後、o_t，nに対してM点（Mは2のべき乗、かつFrame以上の値に設定される。例えば512である）の高速フーリエ変換処理S202を適用して、複素数スペクトルSpc^→ _t = { Spc_t，0， …， Spc_t，m， …， Spc_t，M-1} 112を得る(mは周波数ビンの番号)。次に、Spc_t，mの絶対値に対してメルフィルタバンク分析処理S203と対数化処理S204を適用し、R次元（例えばR = 24）の対数メルスペクトルを要素に持つベクトルO^→ _t = {O_t，0， …， O_t，r， …， O_t，R-1} 113を算出する（rはベクトルの要素番号）。すなわち、音響特徴抽出部104の出力は複素数スペクトルSpc^→ _t 112と、対数メルスペクトルO^→ _t 113とである。複素数スペクトルSpc^→ _t 112は、雑音抑圧部106の入力となり、対数メルスペクトルO^→ _t 113は、パラメータ推定部105と、雑音抑圧部106との入力となる。 After that, by applying fast Fourier transform processing S202 of M points (M is set to a power of 2 and a value equal to or greater than Frame, for example, 512) for o _{t, n} , complex spectrum Spc ^→ _t = {Spc _{t, 0} ,…, Spc _{t, m} ,…, Spc _{t, M−1} } 112 is obtained (m is the frequency bin number). Next, mel filter bank analysis processing S203 and logarithmic processing S204 are applied to the absolute value of Spct _{, m} , and a vector O ^→ _t = {having an R-dimensional (for example, R = 24) log mel spectrum as an element. O _{t, 0} ,…, O _{t, r} ,…, O _{t, R-1} } 113 is calculated (r is the element number of the vector). That is, the output of the acoustic feature extraction unit 104 is a complex spectrum Spc ^→ _t 112 and a log mel spectrum O ^→ _t 113. The complex spectrum Spc ^→ _t 112 is input to the noise suppression unit 106, and the log mel spectrum O ^→ _t 113 is input to the parameter estimation unit 105 and the noise suppression unit 106.

＜パラメータ推定部105の概要とパラメータの定義＞
パラメータ推定部105は、対数メルスペクトルO^→ _t 113と、無音GMM 108と、クリーン音声GMM 109のパラメータとを受けて、雑音GMMのパラメータセット110と、話者適応パラメータ111とを最適推定する。 <Outline of parameter estimation unit 105 and parameter definition>
The parameter estimation unit 105 receives the log mel spectrum O ^→ _t 113, the silence GMM 108, and the parameters of the clean speech GMM 109, and optimally estimates the noise GMM parameter set 110 and the speaker adaptation parameter 111.

無音GMM 108、及びクリーン音声GMM 109は次式により与えられる。

The silent GMM 108 and the clean voice GMM 109 are given by the following equations.

上式（２）において、j は無音GMM 108と、クリーン音声GMM 109とを識別するインデックスであり、j = 0は無音GMM 108、j = 1はクリーン音声GMM 109を示し、kは無音GMM 108もしくはクリーン音声GMM 109に含まれる正規分布の番号、Kは総正規分布数である（例えばK = 128）。また、S^→ _tと、p_SI，j(S^→ _t) とは、音声の対数メルスペクトルと、無音GMM 108、もしくはクリーン音声GMM 109の尤度であり、また、w_SI，j，kと、μ^→ _SI，j，kと、Σ_S，j，kとは、それぞれ無音GMM 108、もしくはクリーン音声GMM 109の混合重みと、平均ベクトルと、対角分散行列とである。それぞれのパラメータは多数話者の学習用音声データを用いて、事前に推定する。関数N(・)は、次式（３）で与えられる多次元正規分布の確率密度関数である。

In the above equation (2), j is an index for identifying the silent GMM 108 and the clean voice GMM 109, j = 0 indicates the silent GMM 108, j = 1 indicates the clean voice GMM 109, and k indicates the silent GMM 108. Alternatively, the normal distribution number included in the clean speech GMM 109, K is the total normal distribution number (for example, K = 128). S ^→ _t and p _{SI, j} (S ^→ _t ) are the log mel spectrum of speech and the likelihood of silence GMM 108 or clean speech GMM 109, and w _{SI, j, k} and , Μ ^→ _{SI, j, k} and Σ _{S, j, k} are the mixing weight, average vector, and diagonal dispersion matrix of the silent GMM 108 or the clean speech GMM 109, respectively. Each parameter is estimated in advance using speech data for learning of many speakers. The function N (•) is a probability density function of a multidimensional normal distribution given by the following equation (3).

多数話者の学習用音声データから推定されたパラメータから構成されるGMMは、話者独立（Speaker independent: SI）GMMと呼ばれ、特定話者の学習用音声データから推定されたパラメータから構成されるGMMは、話者依存（Speaker dependent: SD）GMMと呼ばれる。話者依存GMMは実用上、現実的ではないため、話者適応処理によりSD GMMを得る。すなわち、次式（４）の話者適応処理によりSI GMMの平均ベクトルμ^→ _SI，j，kを変換することによって、SD GMMの平均ベクトルμ^→ _SD，j，kを得る。

A GMM that consists of parameters estimated from the speech data for learning from a large number of speakers is called a speaker independent (SI) GMM, and it consists of parameters estimated from the speech data for learning for a specific speaker. GMM is called speaker dependent (SD) GMM. Since a speaker-dependent GMM is not practical in practice, an SD GMM is obtained by speaker adaptation processing. That is, the average vector μ ^→ _{SD, j, k} of the SD GMM is obtained by converting the average vector μ ^→ _{SI, j, k} of the SI GMM by speaker adaptation processing of the following equation (4).

上式において、A^→とb^→とは、それぞれR×R次元の変換行列と、R次元のバイアスベクトルである。Wと、ξ^→ _j，kとは、次式（５）（６）で与えられる(R+1)×R次元の拡張変換行列と、R+1次元の拡張平均ベクトルであり、Wを話者適応パラメータ110とする。Wは、無音GMM 108と、クリーン音声GMM 109とを識別するインデックスj、及び無音GMM 108、もしくはクリーン音声GMM 109に含まれる正規分布の番号kに対して独立のパラメータとする。

In the above equation, A ^→ and b ^→ are an R × R dimensional transformation matrix and an R dimensional bias vector, respectively. W and ξ ^→ _{j, k} are an (R + 1) × R-dimensional extended transformation matrix and an R + 1-dimensional extended average vector given by the following equations (5) and (6). The person adaptation parameter 110 is used. W is an independent parameter for the index j for identifying the silent GMM 108 and the clean speech GMM 109, and the normal distribution number k included in the silent GMM 108 or the clean speech GMM 109.

一方、雑音GMM は次式（７）により与えられる。

On the other hand, the noise GMM is given by the following equation (7).

上式（７）において、lは雑音GMM に含まれる正規分布の番号、Lは総正規分布数である（例えばL = 4）。また、N^→ _tと、p_N(N^→ _t)とは、雑音の対数メルスペクトルと、雑音GMM の尤度とであり、w_N，lと、μ^→ _N，lと、Σ_N，lとは、それぞれ雑音GMMの混合重みと、平均ベクトルと、対角分散行列とである。以後、雑音GMMのパラメータセット111をλ= { w_N，l， μ_N，l, Σ_N，l}と定義する。 In the above equation (7), l is the number of the normal distribution included in the noise GMM, and L is the total normal distribution number (for example, L = 4). N ^→ _t and p _N (N ^→ _t ) are the log mel spectrum of noise and the likelihood of noise GMM, w _{N, l} , μ ^→ _{N, l} and Σ _{N, l} Are a noise GMM mixing weight, an average vector, and a diagonal dispersion matrix, respectively. Hereinafter, the parameter set 111 of the noise GMM is defined as λ = {w _{N, l} , μ _{N, l} , Σ _{N, l} }.

パラメータ推定部105において、話者適応パラメータW 110と、雑音GMMのパラメータセットλ 111とは、EMアルゴリズムにより推定する。EMアルゴリズムは、ある確率モデルのパラメータ推定に利用される方法であり、確率モデルのコスト関数（対数尤度関数）の期待値を計算するExpectation-step (E-step)と、コスト関数を最大化するMaximization-step (M-step)とを、収束条件を満たすまで繰り返すことによりパラメータを最適化する。 In the parameter estimation unit 105, the speaker adaptation parameter W 110 and the noise GMM parameter set λ 111 are estimated by the EM algorithm. The EM algorithm is a method used for parameter estimation of a certain probability model. Expectation-step (E-step) that calculates the expected value of the cost function (log-likelihood function) of the probability model and the cost function are maximized. The parameters are optimized by repeating Maximization-step (M-step) until the convergence condition is satisfied.

＜パラメータ推定部105の構成＞
図３は、パラメータ推定部105の構成を示しており、初期化部301と、確率・信号推定部302と、信頼データ選択部303と、話者適応パラメータ推定部304と、雑音GMM推定部305と、収束判定部306とを含む。パラメータ推定部105は、図４に示す流れで処理を行う。 <Configuration of Parameter Estimator 105>
FIG. 3 shows the configuration of the parameter estimation unit 105, which includes an initialization unit 301, a probability / signal estimation unit 302, a confidence data selection unit 303, a speaker adaptive parameter estimation unit 304, and a noise GMM estimation unit 305. And a convergence determination unit 306. The parameter estimation unit 105 performs processing according to the flow shown in FIG.

＜初期化部301＞
まず、EMアルゴリズムの繰り返しインデックスiを初期化する(S401)。 <Initialization unit 301>
First, the iteration index i of the EM algorithm is initialized (S401).

次に、初期値推定処理S402にて、EMアルゴリズムにおける、話者適応パラメータW 110と、雑音GMMのパラメータセットλ 111との初期値を次式により推定する。Uは初期値推定に要するフレーム数である（例えばU = 10）。

Next, in initial value estimation processing S402, initial values of speaker adaptation parameter W 110 and noise GMM parameter set λ 111 in the EM algorithm are estimated by the following equation. U is the number of frames required for initial value estimation (for example, U = 10).

上式において、添え字(i)はEMアルゴリズムにおける、i回目の繰り返し推定におけるパラメータであることを示す。また、式（８）において0^→とIとは、それぞれ全ての要素が0のR次元縦ベクトル、R×R次元の単位行列であり、GaussRand(・)は多次元正規乱数の発生器である。 In the above equation, the subscript (i) indicates a parameter in the i-th iterative estimation in the EM algorithm. In Equation (8), 0 ^→ and I are R-dimensional vertical vectors and R × R-dimensional unit matrices, respectively, where all elements are 0, and GaussRand (•) is a multidimensional normal random number generator. .

＜確率・信号推定部302＞
確率モデルパラメータ生成処理S403にて、i-1回目の繰り返し推定における話者適応パラメータW^(i-1)と、雑音GMMのパラメータセットλ^(i-1)と、無音GMM108と、クリーン音声GMM 109とのパラメータを利用して、対数メルスペクトルO^→ _t 113の確率モデルを以下のようなGMMで構成する。

<Probability / signal estimation unit 302>
In the probabilistic model parameter generation processing S403, the speaker adaptation parameter W ^(i-1) , the noise GMM parameter set λ ^(i-1) , the silence GMM 108, and the clean speech GMM 109 in the ^i- 1th iteration estimation The probability model of the log mel spectrum O ^→ _t 113 is constructed by the following GMM using the parameters

上式（１４）において、p_O，j ⁽ⁱ⁾(O^→ _t)は、確率モデルパラメータ生成処理S403にて生成される対数メルスペクトルO^→ _t 113の確率モデルの尤度であり、w_{O，j，k，l} ⁽ⁱ⁾と、μ^→ _{O，j，k，l} ⁽ⁱ⁾と、Σ_{O，j，k，l} ⁽ⁱ⁾とは、i-1回目の繰り返し推定における話者適応パラメータW^(i-1)と、雑音GMMのパラメータセットλ^(i-1)と、無音GMM 108、もしくはクリーン音声GMM 109のパラメータとから生成された対数メルスペクトルO^→ _t 113の確率モデルの混合重みと、平均ベクトルと、対角分散行列とであり、次式で与えられる。

In the above equation (14), p _{O, j} ⁽ⁱ⁾ (O ^→ _t ) is the likelihood of the probability model of the log mel spectrum O ^→ _t 113 generated in the probability model parameter generation processing S403, and w _{O , J, k, l} ⁽ⁱ⁾ , μ ^→ _{O, j, k, l} ⁽ⁱ⁾ and Σ _{O, j, k, l} ⁽ⁱ⁾ are speaker adaptations in the i-1th iteration estimation A mixture of stochastic models of log mel spectrum O ^→ _t 113 generated from parameter W ^(i-1) , noise GMM parameter set λ ^(i-1) , and silence GMM 108 or clean speech GMM 109 parameters The weight, average vector, and diagonal dispersion matrix are given by the following equation.

上式において、関数log(・)とexp(・)は、ベクトルの要素毎に演算を行う。また、1^→は全ての要素が1のR次元縦ベクトル、H_j，k，l ⁽ⁱ⁾は、関数h(・)のヤコビ行列である。 In the above equation, the functions log (•) and exp (•) perform an operation for each vector element. 1 ^→ is an R-dimensional vertical vector with all elements being 1, and H _{j, k, l} ⁽ⁱ⁾ is a Jacobian matrix of the function h (·).

期待値計算処理S404にてi回目の繰り返し推定における対数メルスペクトルO^→ _t 113の確率モデルのコスト関数Q_O(・)の期待値を次式により計算する（EMアルゴリズムのE-step）。

In the expected value calculation processing S404, the expected value of the cost function Q _O (•) of the probability model of the log mel spectrum O ^→ _t 113 in the i-th iterative estimation is calculated by the following equation (E-step of the EM algorithm).

上式において、O^→ _0:T-1 = { O₀， …， O_t， …， O_T-1}であり、Tは対数メルスペクトルO^→ _t 113の総フレーム数、P_t，j ⁽ⁱ⁾と、P_{t，j，k，l} ⁽ⁱ⁾とは、それぞれ次式で与えられるフレームtにおけるGMM種別j、もしくは正規分布番号kと、lとに対する事後確率である。特に、P_t，j=0 ⁽ⁱ⁾を音声非存在確率、P_t，j=1 ⁽ⁱ⁾を音声存在確率と定義する。

In the above equation, O ^→ _{0: T-1} = {O ₀ ,…, O _t ,…, O _T-1 }, and T is the total number of frames of the log mel spectrum O ^→ _t 113, P _{t, j} ^{( i)} and P _{t, j, k, l} ⁽ⁱ⁾ are posterior probabilities for GMM type j or normal distribution number k and l in frame t given by the following equations, respectively. In particular, P _{t, j = 0} ⁽ⁱ⁾ is defined as a speech non-existence probability, and P _{t, j = 1} ⁽ⁱ⁾ is defined as a speech existence probability.

EMアルゴリズムのM-stepは、信号推定処理S405と、事後確率推定処理S406と、信頼データ選択処理S407と、話者適応パラメータ推定処理S408と、雑音GMMパラメータ推定処理S409とから構成される。 The M-step of the EM algorithm includes a signal estimation process S405, a posteriori probability estimation process S406, a confidence data selection process S407, a speaker adaptive parameter estimation process S408, and a noise GMM parameter estimation process S409.

信号推定処理S405において、話者適応パラメータW⁽ⁱ⁾と、雑音GMMのパラメータセットλ⁽ⁱ⁾とを更新するために用いる、クリーン音声の対数メルスペクトルS^→ _t ⁽ⁱ⁾と、雑音の対数メルスペクトルN^→ _t ⁽ⁱ⁾とを対数メルスペクトルO^→ _t 113より推定する。クリーン音声の対数メルスペクトルS^→ _t ⁽ⁱ⁾と、雑音の対数メルスペクトルN^→ _t ⁽ⁱ⁾とは、次式により推定される。

In the signal estimation process S405, the log mel spectrum S ^→ _t ^{(i) of} the clean speech used to update the speaker adaptation parameter W ⁽ⁱ⁾ and the parameter set λ ^{(i) of} the noise GMM, and the log of the noise The mel spectrum N ^→ _t ⁽ⁱ⁾ is estimated from the log mel spectrum O ^→ _t 113. The log mel spectrum S ^→ _t ^{(i) of} clean speech and the log mel spectrum N ^→ _t ^{(i) of} noise are estimated by the following equations.

事後確率推定処理S406にて、話者適応パラメータW⁽ⁱ⁾と、雑音GMMのパラメータセットλ⁽ⁱ⁾との更新に必要な事後確率P_t，j，k ⁽ⁱ⁾と、P_t，l ⁽ⁱ⁾とを次式の周辺化により推定する。

In the posterior probability estimation process S406, the posterior probabilities P _{t, j, k} ⁽ⁱ⁾ and P _{t, l} necessary for updating the speaker adaptation parameter W ⁽ⁱ⁾ and the parameter set λ ⁽ⁱ⁾ of the noise GMM ⁽ⁱ⁾ is estimated by the marginalization of the following equation.

＜信頼データ選択部303＞
信頼データ選択処理S407にて、話者適応パラメータW⁽ⁱ⁾と、雑音GMMのパラメータセットλ⁽ⁱ⁾とを推定する際に用いる、信頼性の高いクリーン音声の推定対数メルスペクトル^S^→ _t ⁽ⁱ⁾と、雑音の推定対数メルスペクトル^N^→ _t ⁽ⁱ⁾とを選択する。 <Reliable data selection unit 303>
In the reliable data selection process S407, the estimated log mel spectrum ^ S ^→ _t of the reliable clean speech used when estimating the speaker adaptation parameter W ⁽ⁱ⁾ and the noise GMM parameter set λ ⁽ⁱ⁾ and ^(i), to select the noise of the estimated logarithmic Mel spectrum ^ N ^→ _t ^(i).

まず、フレームtにおけるSN比SNR_t ⁽ⁱ⁾を次式により計算する。

First, the SN ratio SNR _t ⁽ⁱ⁾ in frame t is calculated by the following equation.

次に、全てのフレームから得られたSN比SNR_t ⁽ⁱ⁾にk-meanクラスタリングを適用して、SNR_t ⁽ⁱ⁾を2つのクラスC = 0，1に分類し、各クラスの平均SN比をAveSNR_c ⁽ⁱ⁾と定義する。その後、図５に示す判定基準により、フレームtにおいてクリーン音声と、雑音とのどちらが優勢であるかを判定し、クリーン音声が優勢であれば、フレーム番号tをクリーン音声信号フレームの集合T_S ⁽ⁱ⁾に格納する。逆に雑音が優勢であれば、フレーム番号tを雑音フレームの集合T_N ⁽ⁱ⁾に格納する。 Next, k-mean clustering is applied to the signal-to-noise ratio SNR _t ⁽ⁱ⁾ obtained from all frames to classify SNR _t ⁽ⁱ⁾ into two classes C = 0, 1, and the average SN of each class The ratio is defined as AveSNR _c ⁽ⁱ⁾ . After that, based on the determination criteria shown in FIG. 5, it is determined which of the clean voice and noise is dominant in the frame t. If the clean voice is dominant, the frame number t is set to a set T _S ⁽ Store in ⁱ⁾ . Conversely, if the noise is dominant, the frame number t is stored in the noise frame set T _N ⁽ⁱ⁾ .

＜話者適応パラメータ推定部304＞
話者適応パラメータ推定処理S408は、信号推定処理S405にて推定したクリーン音声の対数メルスペクトル^S^→ _t ⁽ⁱ⁾と、事後確率推定処理S406にて推定した事後確率P_t，j，k ⁽ⁱ⁾と、信頼データ選択処理S407にて推定したクリーン音声信号フレームの集合T_S ⁽ⁱ⁾を用いて次式により話者適応パラメータW⁽ⁱ⁾を更新する。

<Speaker adaptive parameter estimation unit 304>
Speaker adaptive parameter estimation processing S408 includes log mel spectrum ^ S ^→ _t ^{(i) of} clean speech estimated in signal estimation processing S405 and posterior probability P _{t, j, k} ⁽ estimated in posterior probability estimation processing S406. ^{Using i)} and the set T _S ⁽ⁱ⁾ of clean speech signal frames estimated in the trust data selection process S407, the speaker adaptation parameter W ⁽ⁱ⁾ is updated by the following equation.

＜雑音GMM推定部305＞
雑音GMMパラメータ推定処理S409は、信号推定処理S405にて推定した雑音の対数メルスペクトル^N^→ _t ⁽ⁱ⁾と、事後確率推定処理S406にて推定した事後確率P_t，l ⁽ⁱ⁾と、信頼データ選択処理S407にて推定した雑音信号フレームの集合T_N ⁽ⁱ⁾を用いて次式により雑音GMMのパラメータセットλ⁽ⁱ⁾を更新する。

<Noise GMM estimation unit 305>
The noise GMM parameter estimation process S409 includes the log mel spectrum of noise estimated in the signal estimation process S405 ^ N ^→ _t ⁽ⁱ⁾ , the posterior probability P _{t, l} ⁽ⁱ⁾ estimated in the posterior probability estimation process S406, The noise GMM parameter set λ ⁽ⁱ⁾ is updated by the following equation using the set of noise signal frames T _N ⁽ⁱ⁾ estimated in the reliable data selection processing S407.

＜収束判定部306＞
収束判定処理S410にて収束条件を満たすか否かを判定し、満たす場合は、W = W⁽ⁱ⁾、及びλ=λ⁽ⁱ⁾として、パラメータ推定部105の処理を終了する。満たさない場合はi ←i + 1（S411）として、確率モデル生成処理S403以降の処理を繰り返す。収束条件は次式であり、ηは例えばη ＝ 0.0001とする。

<Convergence determination unit 306>
In the convergence determination process S410, it is determined whether or not the convergence condition is satisfied. If satisfied, W = W ⁽ⁱ⁾ and λ = λ ⁽ⁱ⁾ are set, and the process of the parameter estimation unit 105 is terminated. If not satisfied, i ← i + 1 (S411), and the process after the probability model generation process S403 is repeated. The convergence condition is the following equation, and η is, for example, η = 0.0001.

＜雑音抑圧部106の構成＞
雑音抑圧部106の構成は図６に示すとおりであり、対数メルスペクトルO^→ _t 113と、無音GMM 108と、クリーン音声GMM 109と、話者適応パラメータW⁽ⁱ⁾と、雑音GMMのパラメータセットλ⁽ⁱ⁾とを受けて、雑音抑圧フィルタF_t，m ^Linを推定する雑音抑圧フィルタ推定部601と、複素数スペクトルSpc^→ _t 112と、雑音抑圧フィルタF_t，m ^Linとを受けて雑音を抑圧して雑音抑圧信号^s_τ 114を得る雑音抑圧フィルタ適用部602とを含む。 <Configuration of Noise Suppression Unit 106>
The configuration of the noise suppression unit 106 is as shown in FIG. 6. The log mel spectrum O ^→ _t 113, the silence GMM 108, the clean speech GMM 109, the speaker adaptation parameter W ^(i), and the noise GMM parameter set In response to λ ⁽ⁱ⁾ , the noise suppression filter estimator 601 for estimating the noise suppression filter F _{t, m} ^Lin , the complex spectrum Spc ^→ _t 112, and the noise suppression filter F _{t, m} ^Lin receives noise. And a noise suppression filter application unit 602 that obtains a noise suppression signal ^ s _τ 114 by suppression.

＜雑音抑圧フィルタ推定部601の構成＞
雑音抑圧フィルタ推定部601は図７に示す流れで処理を行う。
まず、確率モデル生成処理S701にて、無音GMM 108と、クリーン音声GMM 109と、話者適応パラメータW⁽ⁱ⁾と、雑音GMMのパラメータセットλ⁽ⁱ⁾とから、対数メルスペクトルO^→ _t 113の確率モデルのパラメータを以下のように生成する。

<Configuration of Noise Suppression Filter Estimation Unit 601>
The noise suppression filter estimation unit 601 performs processing according to the flow shown in FIG.
First, in the probabilistic model generation process S701, the logarithmic mel spectrum O ^→ _t 113 is calculated from the silent GMM 108, the clean speech GMM 109, the speaker adaptation parameter W ^(i), and the noise GMM parameter set λ ^(i). The parameters of the probability model are generated as follows.

確率計算処理S702にて、音声非存在／存在確率P_t，jと、事後確率P_{t，j，k，l}とを対数メルスペクトルO^→ _t 113の確率モデルのパラメータと、対数メルスペクトルO^→ _t 113とを用いて計算する。

In the probability calculation processing S702, and the parameters of the probability model of voice absence / presence probability P _t, and _j, the posterior probability P _{t, j, k,} logarithmic Mel spectrum and _l

O

^→ _t 113, logarithmic Mel spectrum O ^→ calculated using the _t 113.

雑音抑圧フィルタ推定処理S703において、無音GMM 108、及びクリーン音声GMM 109の拡張平均ベクトルξ_j，kと、話者適応パラメータWとから生成されるSD GMMの平均ベクトルμ^→ _SD，j，kと、雑音GMMのパラメータセットλに含まれる雑音GMMの平均ベクトルμ^→ _N，lと、音声非存在／存在確率P_t，jと、事後確率P_{t，j，k，l}とを用いてメル周波数軸上での雑音抑圧フィルタF_t，r ^Melを次式のように推定する。式（３６）はベクトルの要素毎の表記である。

In the noise suppression filter estimation processing S703, the average vector μ ^→ _{SD, j, k of the} SD GMM generated from the extended average vector ξ _{j, k} of the silent GMM 108 and the clean speech GMM 109 and the speaker adaptation parameter W Mel frequency using noise GMM mean vector μ ^→ _{N, l} , speech absence / presence probability P _{t, j} and posterior probability P _{t, j, k, l} included in noise GMM parameter set λ The noise suppression filter F _{t, r} ^Mel on the axis is estimated as follows: Expression (36) is a notation for each element of the vector.

雑音抑圧フィルタ推定処理S704にて、メル周波数軸上での雑音抑圧フィルタF_t，r ^Melを線形周波数軸上での雑音抑圧フィルタF_t，m ^Linに変換する。変換は３次スプライン補間をメル周波数軸に適用することにより、線形周波数軸上での雑音抑圧フィルタの値を推定する。 In noise suppression filter estimation processing S704, the noise suppression filters F _{t and r} ^Mel on the mel frequency axis are converted into noise suppression filters F _{t and m} ^Lin on the linear frequency axis. The transformation estimates the value of the noise suppression filter on the linear frequency axis by applying cubic spline interpolation to the mel frequency axis.

＜雑音抑圧フィルタ適用部602の構成＞
雑音抑圧フィルタ適用部602は図８に示す流れで処理を行う。
フィルタリング処理S801にて、複素数スペクトルSpc^→ _t 112に対して雑音抑圧フィルタF_t，m ^Linを次式のように掛け合わせることにより雑音抑圧された複素数スペクトル^S_t，mを得る。式（３７）はベクトルの要素毎の表記である。

<Configuration of Noise Suppression Filter Application Unit 602>
The noise suppression filter application unit 602 performs processing according to the flow shown in FIG.
In the filtering process S801, the complex spectrum Sp _t ^→ _m 112 is multiplied by a noise suppression filter F _{t, m} ^Lin as shown in the following equation to obtain a noise-suppressed complex spectrum ^ S _{t, m} . Expression (37) is a notation for each element of the vector.

逆高速フーリエ変化処理S802にて、複素数スペクトル^S_t，mに対して逆高速フーリエ変換を適用することにより、フレームtにおける雑音抑圧音声^s_t，nを得る。 In the inverse fast Fourier change processing S802, the noise suppression speech ^ _{st, n} in the frame t is obtained by applying the inverse fast Fourier transform to the complex spectrum ^ St _{, m} .

波形連結処理S803にて各フレームの雑音抑圧音声^s_t，nを次式のように窓関数w_nを解除しながら連結して連続した雑音抑圧音声^s_τ 114を得る。

Obtaining a noise reduced speech ^ s _t, the noise reduced speech ^ s _tau 114 where the _n consecutive linked while releasing the window function w _n as in the following equation for each frame in the waveform linking process S803.

雑音抑圧装置１００は、コンピュータが読み取り可能な符号によって記述されたプログラムをコンピュータに実行させることによって実現される。これらのプログラムは例えば磁気ディスクあるいはCD-ROMのようなコンピュータが読み取り可能な記憶媒体に記憶され、記憶媒体からコンピュータにインストールするか或いは通信回線を通じてインストールされて実行される。 The noise suppression apparatus 100 is realized by causing a computer to execute a program described by a computer-readable code. These programs are stored in a computer-readable storage medium such as a magnetic disk or CD-ROM, and installed in the computer from the storage medium or installed through a communication line and executed.

＜変形例＞
上記実施の形態において、フレーム切り出し処理S201にて窓関数w_nにはハミング窓以外に方形窓、ハニング窓、ブラックマン窓などの窓関数を利用してもよい。 <Modification>
In the above embodiment, the square windows besides Hamming window in the window function w _n at frame cutout processing S201, Hanning window may be used a window function, such as Blackman windows.

上記実施の形態において、無音GMM 108、クリーン音声GMM 109の代わりに、音声信号の確率モデルとして、隠れマルコフモデル（Hidden Markov model: HMM）等の他の確率モデルを用いてもよい。 In the above embodiment, instead of the silent GMM 108 and the clean speech GMM 109, another probability model such as a hidden Markov model (HMM) may be used as the speech signal probability model.

上記実施の形態において、無音GMM 108、クリーン音声GMM 109の2つのGMMだけでなく、より多くのGMMを用いてもよい。例えば、無音GMM、無声音GMM、有声音GMMや、音素毎のGMMを用いてもよい。 In the above embodiment, not only two GMMs, the silent GMM 108 and the clean voice GMM 109, but a larger number of GMMs may be used. For example, a silent GMM, an unvoiced sound GMM, a voiced sound GMM, or a GMM for each phoneme may be used.

上記実施の形態において、雑音 GMMの代わりに、雑音信号の確率モデルとしてHMM等の他の確率モデルを用いてもよい。 In the above embodiment, instead of the noise GMM, another probability model such as an HMM may be used as a noise signal probability model.

上記実施の形態において、話者適応処理を次式のように変換行列Aのみを用いて実施してもよい。

In the above embodiment, the speaker adaptation process may be performed using only the transformation matrix A as shown in the following equation.

上記実施の形態において、話者適応処理を次式のようにバイアスベクトルb^→のみを用いて実施してもよい。

In the above embodiment, the speaker adaptation process may be performed using only the bias vector b ^→ as in the following equation.

上記実施の形態において、話者適応処理のパラメータである変換行列Aとバイアスベクトルb^→を、次式のように無音GMM 108と、クリーン音声GMM 109とを識別するインデックスj、及び無音GMM 108、もしくはクリーン音声GMM 109に含まれる正規分布の番号kに依存するパラメータとしてもよい。

In the above embodiment, conversion matrix A and bias vector b ^→ which are parameters of speaker adaptation processing, an index j for identifying silent GMM 108 and clean speech GMM 109, and silent GMM 108, Alternatively, the parameter may depend on the normal distribution number k included in the clean speech GMM 109.

上記実施の形態において、信頼データ選択をSN比SNR_t ⁽ⁱ⁾ではなく、次式の対数確率比LPR_t ⁽ⁱ⁾を用いて実施してもよい。

In the above embodiment, the selection of the reliable data may be performed using the log probability ratio LPR _t ⁽ⁱ⁾ of the following equation instead of the SN ratio SNR _t ⁽ⁱ⁾ .

この時、全てのフレームから得られた対数確率比LPR_t ⁽ⁱ⁾にk-meanクラスタリングを適用して、LPR_t ⁽ⁱ⁾を2つのクラスC = 0，1に分類し、各クラスの平均対数確率比をAveLPR_c ⁽ⁱ⁾と定義する。その後、図９に示す判定基準により、フレームtにおいてクリーン音声と、雑音とのどちらが優勢であるかを判定し、クリーン音声が優勢であれば、フレーム番号tをクリーン音声信号フレームの集合T_S ⁽ⁱ⁾に格納する。逆に雑音が優勢であれば、フレーム番号tを雑音フレームの集合T_N ⁽ⁱ⁾に格納する。 At this time, k-mean clustering is applied to the log probability ratio LPR _t ⁽ⁱ⁾ obtained from all the frames, and LPR _t ⁽ⁱ⁾ is classified into two classes C = 0, 1, and the average of each class The log probability ratio is defined as AveLPR _c ⁽ⁱ⁾ . Thereafter, it is determined whether the clean voice or the noise is dominant in the frame t according to the criterion shown in FIG. 9. If the clean voice is dominant, the frame number t is set to the set T _S ⁽ Store in ⁱ⁾ . Conversely, if the noise is dominant, the frame number t is stored in the noise frame set T _N ⁽ⁱ⁾ .

上記実施の形態において、信頼データ選択をk-meanクラスタリングではなく、次式の閾値処理を用いて実施してもよい。Th_SNRとTh_LPRとは、それぞれSNR_t ⁽ⁱ⁾と、LPR_t ⁽ⁱ⁾とに対応する閾値である。

In the above embodiment, trust data selection may be performed using threshold processing of the following equation instead of k-mean clustering. Th _SNR and Th _LPR are thresholds corresponding to SNR _t ⁽ⁱ⁾ and LPR _t ⁽ⁱ⁾ , respectively.

上記実施の形態において、雑音抑圧フィルタ推定処理S703にて、重み付け平均ではなく、最大の重み、すなわち最大の音声非存在／存在確率P_t，jと、事後確率P_{t，j，k，l}との積を持つ推定結果をそのまま使用してもよい。この場合、他の推定結果の重みに比べて十分大きな重みを持っていることが望ましい。 In the above embodiment, in the noise suppression filter estimation processing S703, not the weighted average but the maximum weight, that is, the maximum speech non-existence / existence probability P _{t, j} and the posterior probability P _{t, j, k, l} The estimation result having the product of may be used as it is. In this case, it is desirable to have a sufficiently large weight compared to the weights of other estimation results.

本発明の効果を示すために、音声信号と雑音信号が混在する音響信号を本発明の実施形態による雑音抑圧装置に入力し、雑音抑圧を実施した例を示す。以下実験方法、及び結果について説明する。 In order to show the effect of the present invention, an example in which an acoustic signal in which a voice signal and a noise signal are mixed is input to the noise suppression device according to the embodiment of the present invention and noise suppression is performed will be shown. The experimental method and results will be described below.

本実験では、評価用データには、IPA（Information-technology promotion agency、 Japan) -98-TestSetのうち、男性23名が発声したデータ100文を用いており、これらの音声データに対して、空港ロビー、駅プラットホーム、街頭にて別途収録した雑音を、それぞれSN比0 dB、5 dB、10 dBにて計算機上で重畳した。すなわち、雑音3種類×SN比3種類の合計9種類の評価データを作成した。それぞれの音声データは、サンプリング周波数16、000 Hz、量子化ビット数16ビットで離散サンプリングされたモノラル信号である。この音響信号に対し、1フレームの時間長を20 ms（Frame = 320サンプル点）とし、10 ms（Shift = 160サンプル点）ごとにフレームの始点を移動させて、音響特徴抽出部104を適用した。 In this experiment, IPA (Information-technology promotion agency, Japan) -98-TestSet of 100 data uttered by 23 men was used for the evaluation data. Noises recorded separately in the lobby, station platform, and street were superimposed on the computer with SN ratios of 0 dB, 5 dB, and 10 dB, respectively. That is, a total of nine types of evaluation data of three types of noise × three types of SN ratio were created. Each audio data is a monaural signal that is discretely sampled at a sampling frequency of 16,000 Hz and a quantization bit number of 16 bits. The acoustic feature extraction unit 104 was applied to this acoustic signal by setting the time length of one frame to 20 ms (Frame = 320 sample points) and moving the start point of the frame every 10 ms (Shift = 160 sample points). .

無音GMM 108、クリーン音声GMM 109には、R = 24次元の対数メルスペクトルを音響特徴量とする混合分布数K = 128のGMMを用い、それぞれ多数話者の学習用音声データを用いて学習した。雑音GMMの混合分布数にはL = 4を与えた。 For the silent GMM 108 and clean speech GMM 109, R = 24-dimensional log mel spectrum was used as a GMM with a mixed distribution number K = 128, which was trained using speech data for learning of many speakers. . L = 4 was given to the number of mixed distributions of noise GMM.

性能の評価は音声認識により行い、評価尺度は次式の単語誤り率（Word error rate: WER）で行った。Nは総単語数、Dは脱落誤り単語数、Sは置換誤り単語数、Iは挿入誤り単語数であり、WERの値が小さい程音声認識性能が高いことを示す。

The performance was evaluated by speech recognition, and the evaluation scale was the word error rate (WER) of the following formula. N is the total number of words, D is the number of dropped error words, S is the number of replacement error words, I is the number of insertion error words, and the smaller the WER value, the higher the speech recognition performance.

音声認識は、有限状態トランスデューサーに基づく認識器（T. Hori, et al., "Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition," IEEE Trans. on ASLP, vol.15, no.4, pp.1352-1365, May 2007）により行い、音響モデルには話者独立のTriphone HMMを用いており、各HMMの構造は３状態のLeft-to-right型HMMであり、各状態は16の正規分布を持つ。HMM全体の状態数は2,000である。音声認識の音響特徴量は、1フレームの時間長を20 ms（Frame = 320）とし、10 ms（Shift = 160サンプル点）ごとにフレームの始点を移動させて分析した12次元のMFCC （Mel-frequency cepstral coefficient）、対数パワー値、各々の1次及び2次の回帰係数を含む合計39次元のベクトルである。また、言語モデルにはTri-gramを用い、語彙数は20,000単語である。 Speech recognition is based on a finite state transducer based recognizer (T. Hori, et al., "Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition," IEEE Trans. on ASLP, vol.15, no.4, pp.1352-1365, May 2007), and the acoustic model uses a speaker-independent Triphone HMM, and each HMM has a three-state Left-to- It is a right type HMM, and each state has 16 normal distributions. The total number of HMMs is 2,000. The acoustic feature for speech recognition is a 12-dimensional MFCC (Mel-) analyzed by moving the start point of the frame every 10 ms (Shift = 160 sample points), with the time length of one frame being 20 ms (Frame = 320). frequency cepstral coefficient), the logarithmic power value, and a 39-dimensional vector in total including the first and second order regression coefficients. The language model uses Tri-gram and the vocabulary is 20,000 words.

図１０は、雑音抑圧の結果として、雑音抑圧無しと、非特許文献1に開示された方法と、非特許文献2に開示された方法と、本発明による雑音抑圧とによる音声認識の評価結果を示している。図１０の結果から、本発明により従来技術に比べて高い性能を得られることが明らかとなった。 FIG. 10 shows the results of speech recognition evaluation by the noise suppression result, no noise suppression, the method disclosed in Non-Patent Document 1, the method disclosed in Non-Patent Document 2, and the noise suppression according to the present invention. Show. From the results shown in FIG. 10, it has been clarified that the present invention can obtain higher performance than the prior art.

Claims

A noise suppression device that outputs an acoustic signal including an audio signal and a noise signal as an input signal, and outputs a signal (hereinafter referred to as a target signal) in which the noise signal is suppressed from the input signal,
A storage unit for storing a probability model of a silence signal (hereinafter referred to as a silence model) and a probability model of a sound signal (hereinafter referred to as a speech model);
An acoustic feature quantity extraction unit that outputs spectrum information of the input signal and an acoustic feature quantity used for parameter estimation of a stochastic model of a noise signal (hereinafter referred to as a noise model) from the input signal;
It is used to express a parameter expressing the noise model (hereinafter referred to as a noise model parameter) and a probability model of the acoustic feature using the acoustic feature, the silence model, and the speech model. A parameter estimator for estimating a speaker-dependent parameter (hereinafter referred to as a speaker adaptation parameter);
Estimating a noise suppression filter using the acoustic feature, the silence model, the speech model, the noise model parameter, and the speaker adaptation parameter, and applying the noise suppression filter to the spectrum information And a noise suppression unit that outputs the target signal by
The noise model is expressed as a weighted mixed probability distribution with a plurality of probability distributions,
The parameter estimation unit
The posterior probability used for the estimation of the noise model parameter and the speaker adaptation parameter is obtained from the silence model, the weighted mixed probability distribution of the input signal obtained by combining the speech model and the noise model, and the input signal. A noise suppressor characterized by estimating posterior probabilities by marginalizing.

The noise suppression device according to claim 1,
The parameter estimation unit
Estimating the noise model parameter and the speaker adaptation parameter by repeating the calculation process of the expected value of the cost function of the probability model of the acoustic feature quantity and the maximization process of the cost function until a convergence condition is satisfied. A noise suppression device characterized by the above.

The noise suppression device according to claim 2,
The parameter estimation unit
In the cost function maximization process, the acoustic feature quantity related to the noise signal and the acoustic feature quantity related to the audio signal are estimated, and furthermore, a reliable one is extracted for each, and the highly reliable The speaker adaptation parameter is updated using the acoustic feature quantity and the posterior probability related to the speech signal, and the noise model parameter is updated using the acoustic feature quantity and the posterior probability related to the highly reliable noise signal. The noise suppression device characterized by updating.

An acoustic signal including an audio signal and a noise signal is used as an input signal, and a noise suppression method for outputting a signal (hereinafter referred to as a target signal) in which the noise signal is suppressed from the input signal,
Using a silence signal probability model (hereinafter referred to as a silence model) and a sound signal probability model (hereinafter referred to as a speech model),
An acoustic feature quantity extraction step for outputting, from the input signal, spectral information of the input signal and an acoustic feature quantity used for parameter estimation of a stochastic model of a noise signal (hereinafter referred to as a noise model);
It is used to express a parameter expressing the noise model (hereinafter referred to as a noise model parameter) and a probability model of the acoustic feature using the acoustic feature, the silence model, and the speech model. A parameter estimation step for estimating a speaker-dependent parameter (hereinafter referred to as a speaker adaptation parameter);
Estimating a noise suppression filter using the acoustic feature, the silence model, the speech model, the noise model parameter, and the speaker adaptation parameter, and applying the noise suppression filter to the spectrum information A noise suppression step for outputting the target signal by
The noise model is expressed as a weighted mixed probability distribution with a plurality of probability distributions,
In the parameter estimation step,
The posterior probability used for the estimation of the noise model parameter and the speaker adaptation parameter is obtained from the silence model, the weighted mixed probability distribution of the input signal obtained by combining the speech model and the noise model, and the input signal. A noise suppression method characterized by estimating posterior probabilities by marginalizing.

The program for functioning a computer as a noise suppression apparatus in any one of Claims 1-3.