JP2016143043A

JP2016143043A - Speech model learning method, noise suppression method, speech model learning system, noise suppression system, speech model learning program, and noise suppression program

Info

Publication number: JP2016143043A
Application number: JP2015021453A
Authority: JP
Inventors: 雅清藤本; Masakiyo Fujimoto; 智広中谷; Tomohiro Nakatani
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-02-05
Filing date: 2015-02-05
Publication date: 2016-08-08
Anticipated expiration: 2035-02-05
Also published as: JP6243858B2

Abstract

PROBLEM TO BE SOLVED: To improve noise suppression performance.SOLUTION: A noise suppression system 200 records a mixed acoustic signal having an audio signal and noise signal mixed, and extracts a feature quantity, which is used to perform noise suppression, and a feature quantity, which is used to calculate a speech posteriori probability, from the mixed acoustic signal. The noise suppression system 200 uses a logarithmic mel spectrum, speech GMM, normalized logarithmic mel spectrum, and speech DNN resulting from learning to estimate a speaker adaptation parameter and a parameter set of a noise GMM that is a probability model of noise. The noise suppression system 200 uses a complex spectrum, logarithmic mel spectrum, speech GMM, speaker adaptation parameter, parameter set of the noise GMM, and speech posteriori probability to construct a noise suppression filter, and suppresses noise so as to obtain a noise suppressed signal.SELECTED DRAWING: Figure 4

Description

本発明は、音声モデル学習方法、雑音抑圧方法、音声モデル学習装置、雑音抑圧装置、音声モデル学習プログラム及び雑音抑圧プログラムに関する。 The present invention relates to a speech model learning method, a noise suppression method, a speech model learning device, a noise suppression device, a speech model learning program, and a noise suppression program.

近年、自動音声認識は、情報化社会の中で利用局面が増えつつあり、技術の進歩が大きく期待されている。自動音声認識を実際の環境で利用する場合には、処理対象とする音声信号以外の信号、つまり雑音が含まれる音響信号から雑音を取り除き、所望の音声信号を抽出する必要がある。 In recent years, the use of automatic speech recognition has been increasing in the information society, and technological advances are highly expected. When automatic speech recognition is used in an actual environment, it is necessary to remove noise from a signal other than a speech signal to be processed, that is, an acoustic signal including noise, and extract a desired speech signal.

例えば、音声信号と雑音信号が混合された信号を入力とし、あらかじめ推定した音声信号及び雑音信号それぞれの確率モデルから入力混合信号の確率モデルを生成する。その際、入力混合信号の確率モデルを構成する音声信号及び雑音信号それぞれの確率モデルと、入力混合信号に含まれる音声信号及び雑音信号それぞれの統計量との差分をテイラー級数近似で表現する。その差分をＥＭアルゴリズムを用いて推定し、入力混合信号の確率モデルを最適化する。その後、最適化された入力混合信号の確率モデルと音声信号の確率モデルのパラメータを用いて雑音を抑圧する方法が開示されている（例えば非特許文献１参照）。 For example, a signal obtained by mixing a speech signal and a noise signal is input, and a probability model of the input mixed signal is generated from the probability models of the speech signal and the noise signal estimated in advance. At that time, the difference between the probability model of each of the speech signal and the noise signal constituting the probability model of the input mixed signal and the statistics of each of the speech signal and the noise signal included in the input mixed signal is expressed by Taylor series approximation. The difference is estimated using the EM algorithm, and the input mixed signal probability model is optimized. Thereafter, a method of suppressing noise using the optimized probabilistic model of the input mixed signal and the parameters of the probabilistic model of the speech signal is disclosed (for example, see Non-Patent Document 1).

また、例えば、音声信号と雑音信号が混合された信号を入力とし，多数話者の学習用音声データを用いて学習された音声信号の確率モデルを入力混合信号に含まれる音声信号の発話者の特徴に適応（話者適応）させ、かつ統計的な性質が多峰的な分布に従う雑音信号に対処するため、入力混合信号より音声信号と、雑音信号とをそれぞれ抽出する。この際、ＳＮ比を基準として、単位時間毎に各抽出信号の信頼度を算出する。抽出した音声信号及び雑音信号と、各信号の信頼度とを用いて話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルをＥＭアルゴリズムにより推定する。その後、話者適応後の音声信号の確率モデルと、推定した雑音の確率モデルとから入力信号の最適な確率モデルを生成し、入力混合信号の最適な確率モデルと話者適応後の音声信号の確率モデルのパラメータを用いて雑音を抑圧する方法が開示されている（例えば非特許文献２参照）。 In addition, for example, a signal obtained by mixing a speech signal and a noise signal is input, and a probability model of a speech signal learned by using speech data for learning of a large number of speakers is used for the speaker of the speech signal included in the input mixture signal. In order to adapt to the feature (speaker adaptation) and deal with a noise signal whose statistical properties follow a multimodal distribution, a speech signal and a noise signal are extracted from the input mixed signal, respectively. At this time, the reliability of each extracted signal is calculated for each unit time with the SN ratio as a reference. The extracted speech signal and noise signal and the reliability of each signal are used to estimate a speaker adaptation parameter and a stochastic model of the noise signal according to a multimodal distribution using an EM algorithm. Then, the optimal probability model of the input signal is generated from the probability model of the speech signal after speaker adaptation and the estimated probability model of noise, and the optimal probability model of the input mixed signal and the speech signal after speaker adaptation are generated. A method of suppressing noise using a parameter of a probability model is disclosed (for example, see Non-Patent Document 2).

P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach for environment-independent speech recognition.” in Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996.P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach for environment-independent speech recognition.” In Proceedings of ICASSP '96, vol. II, pp. 733-736, May 1996. M. Fujimoto and T. Nakatani, “A reliable data selection for model-based noise suppression using unsupervised joint speaker adaptation and noise model estimation.” in Proceedings of ICSPCC '12, pp. 4713-4716, Aug 2012.M. Fujimoto and T. Nakatani, “A reliable data selection for model-based noise suppression using unsupervised joint speaker adaptation and noise model estimation.” In Proceedings of ICSPCC '12, pp. 4713-4716, Aug 2012.

しかしながら、上記従来技術は、例えば非特許文献１において、入力混合信号に含まれる雑音信号の特徴が定常的かつ、その分布（頻度分布もしくは確率分布）が単峰性であるという前提のもとで雑音抑圧を行う技術である。しかし、実環境における雑音信号の多くは非定常的な特徴を持ち、その分布は多峰性であることが多い。そのため、非定常的な雑音信号に対応できず、十分な雑音抑圧性能が得られない。また、入力混合信号に含まれる音声信号と雑音信号との関係が非線形関数により表現されるため、テイラー級数近似を用いても音声信号及び雑音信号それぞれの確率モデルのパラメータ推定の際に解析解が得られない。そのため、音声信号及び雑音信号それぞれの確率モデルパラメータの最適解が得られず、十分な雑音抑圧性能が得られない。 However, the above prior art is based on the assumption that, for example, in Non-Patent Document 1, the characteristics of the noise signal included in the input mixed signal are stationary and the distribution (frequency distribution or probability distribution) is unimodal. This is a technology for noise suppression. However, many noise signals in the real environment have non-stationary characteristics, and their distribution is often multimodal. Therefore, it cannot cope with non-stationary noise signals and sufficient noise suppression performance cannot be obtained. In addition, since the relationship between the speech signal and the noise signal included in the input mixed signal is expressed by a nonlinear function, an analytical solution can be used when estimating the parameters of the probability model of the speech signal and the noise signal even if Taylor series approximation is used. I can't get it. Therefore, the optimal solution of the probability model parameter for each of the speech signal and the noise signal cannot be obtained, and sufficient noise suppression performance cannot be obtained.

また、上記従来技術は、例えば非特許文献２において、多峰的な分布に従う雑音信号の確率モデルを推定することにより、非定常的な雑音信号に対応することが可能であっても、話者適応のパラメータと、多峰的な分布に従う雑音信号の確率モデルとをＥＭアルゴリズムにより推定する。音声信号の確率モデルには、混合正規分布（Gaussian Mixture Model：ＧＭＭ）を用いるが、入力混合信号から音声信号と雑音信号とを抽出、話者適応パラメータ推定及び雑音抑圧フィルタを設計する際には、音声信号のＧＭＭに含まれる各要素分布に対する事後確率（音声事後確率と定義）が必要となる。これは，入力混合信号に含まれる音声信号が、各時刻において音声信号のＧＭＭ内のどの要素分布に属するかという識別問題に相当する。しかし、識別器としてのＧＭＭの性能は低く、ＧＭＭでは十分な雑音抑圧性能が得られない。 Further, in the non-patent document 2, for example, the above prior art estimates a noise signal probabilistic model according to a multimodal distribution, so that it is possible to handle a non-stationary noise signal. An adaptation parameter and a stochastic model of a noise signal following a multimodal distribution are estimated by an EM algorithm. A Gaussian Mixture Model (GMM) is used for the probabilistic model of the speech signal. When the speech signal and the noise signal are extracted from the input mixture signal and the speaker adaptive parameter estimation and the noise suppression filter are designed. Therefore, posterior probabilities (defined as speech posterior probabilities) for each element distribution included in the GMM of the speech signal are required. This corresponds to an identification problem as to which element distribution in the GMM of the audio signal the audio signal included in the input mixed signal belongs to at each time. However, the performance of the GMM as a discriminator is low, and sufficient noise suppression performance cannot be obtained with the GMM.

本願が開示する実施形態の一例は、上記に鑑みてなされたものであって、雑音抑圧性能を向上させることを目的とする。 An example of an embodiment disclosed in the present application has been made in view of the above, and aims to improve noise suppression performance.

本願の実施形態の一例は、学習用の音声信号から音響特徴量を抽出する。そして、実施形態の一例は、抽出された音響特徴量と、音声信号の混合正規分布とを対応付けるラベル情報を生成する。そして、実施形態の一例は、学習用の音声信号及び学習用の雑音信号を含む学習用の音響信号から正規化された音響特徴量を抽出する。そして、実施形態の一例は、生成されたラベル情報と、抽出された正規化された音響特徴量とを用いて、音声モデルを学習する。 In an example of the embodiment of the present application, an acoustic feature amount is extracted from an audio signal for learning. And an example of embodiment produces | generates the label information which matches the extracted acoustic feature-value and the mixing normal distribution of an audio | voice signal. In the exemplary embodiment, a normalized acoustic feature amount is extracted from a learning acoustic signal including a learning speech signal and a learning noise signal. In the exemplary embodiment, the speech model is learned using the generated label information and the extracted normalized acoustic feature amount.

また、本願の実施形態の一例は、上記音声モデル学習方法により学習された音声モデルを音声モデル記憶部に保存する。そして、実施形態の一例は、音声信号及び雑音信号を含む混合音響信号から音響特徴量を抽出する。そして、実施形態の一例は、混合音響信号から正規化された音響特徴量を抽出する。そして、実施形態の一例は、音声モデルと、抽出された正規化された音響特徴量とを用いて、音声事後確率を計算する。そして、実施形態の一例は、計算された音声事後確率と、音声信号の混合正規分布とを用いて、混合音響信号における雑音信号を抑圧する。 Further, in an example of the embodiment of the present application, the speech model learned by the speech model learning method is stored in the speech model storage unit. And an example of embodiment extracts an acoustic feature-value from the mixed acoustic signal containing an audio | voice signal and a noise signal. And an example of embodiment extracts the acoustic feature-value normalized from the mixed acoustic signal. Then, in an example of the embodiment, the speech posterior probability is calculated using the speech model and the extracted normalized acoustic feature amount. In the exemplary embodiment, the noise signal in the mixed acoustic signal is suppressed using the calculated voice posterior probability and the mixed normal distribution of the voice signal.

本願が開示する実施形態の一例によれば、例えば、雑音抑圧性能を向上させることができる。 According to an exemplary embodiment disclosed in the present application, for example, noise suppression performance can be improved.

図１は、音声モデル学習装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a speech model learning apparatus. 図２は、音声モデル学習装置の第１音響特徴抽出部の処理手順の一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of a processing procedure of the first acoustic feature extraction unit of the speech model learning device. 図３は、音声モデル学習装置の第２音響特徴抽出部の処理手順の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of a processing procedure of the second acoustic feature extraction unit of the speech model learning device. 図４は、雑音抑圧装置の構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of the configuration of the noise suppression device. 図５は、雑音抑圧装置のパラメータ推定部の構成の一例を示す図である。FIG. 5 is a diagram illustrating an example of a configuration of a parameter estimation unit of the noise suppression device. 図６は、雑音抑圧装置のパラメータ推定部の処理手順の一例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of a processing procedure of the parameter estimation unit of the noise suppression device. 図７は、雑音抑圧装置のパラメータ推定部による信頼データ選択処理のサブルーチンの一例を示すフローチャートである。FIG. 7 is a flowchart showing an example of a subroutine of the trust data selection process by the parameter estimation unit of the noise suppression device. 図８は、雑音抑圧装置の雑音抑圧部の構成の一例を示す図である。FIG. 8 is a diagram illustrating an example of the configuration of the noise suppression unit of the noise suppression device. 図９は、雑音抑圧装置の雑音抑圧フィルタ推定部の処理手順の一例を示すフローチャートである。FIG. 9 is a flowchart illustrating an example of a processing procedure of the noise suppression filter estimation unit of the noise suppression device. 図１０は、雑音抑圧装置の雑音抑圧フィルタ適用部の処理手順の一例を示すフローチャートである。FIG. 10 is a flowchart illustrating an example of a processing procedure of the noise suppression filter application unit of the noise suppression device. 図１１は、実施形態による効果の一例を示す図である。FIG. 11 is a diagram illustrating an example of the effect according to the embodiment. 図１２は、プログラムが実行されることにより、音声モデル学習装置及び雑音抑圧装置が実現されるコンピュータの一例を示す図である。FIG. 12 is a diagram illustrating an example of a computer that realizes a speech model learning device and a noise suppression device by executing a program.

［実施形態］
以下、本願が開示する音声モデル学習方法、雑音抑圧方法、音声モデル学習装置、雑音抑圧装置、音声モデル学習プログラム及び雑音抑圧プログラムの実施形態を説明する。なお、以下の実施形態は、一例を示すに過ぎず、本願が開示する技術を限定するものではない。また、以下に示す実施形態及びその他の実施形態は、矛盾しない範囲で適宜組合せてもよい。 [Embodiment]
Hereinafter, embodiments of a speech model learning method, a noise suppression method, a speech model learning device, a noise suppression device, a speech model learning program, and a noise suppression program disclosed in the present application will be described. The following embodiments are merely examples, and do not limit the technology disclosed by the present application. Moreover, you may combine suitably embodiment shown below and other embodiment in the range with no contradiction.

なお、以下の実施形態では、例えば、ベクトル又はスカラーであるＡに対し、“＾Ａ”と記載する場合は「“Ａ”の真上に“＾”が記された記号」と同等とし、“￣Ａ”と記載する場合は「“Ａ”の真上に“￣”が記された記号」と同等であるとする。また、“Ａ”がベクトルである場合には、「ベクトルＡ」と表記し、“Ａ”がスカラーである場合には、単に「Ａ」と表記し、“Ａ”が集合である場合には、「集合Ａ」と表記するものとする。また、例えばベクトルＡの関数ｆは、ｆ（ベクトルＡ）と表記するものとする。なお、行列Ａに対し、行列Ａ^−１は、行列Ａの逆行列を表す。 In the following embodiments, for example, when “^ A” is written for A which is a vector or a scalar, it is equivalent to “a symbol with“ ^ ”immediately above“ A ””. When “￣A” is described, it is equivalent to “a symbol in which“ ￣ ”is written immediately above“ A ””. Also, when “A” is a vector, it is expressed as “vector A”, when “A” is a scalar, it is simply expressed as “A”, and when “A” is a set. And “set A”. For example, the function f of the vector A is expressed as f (vector A). Note that the matrix A ⁻¹ represents the inverse matrix of the matrix A with respect to the matrix A.

また、以下の実施形態では、音声信号の識別器として、ディープニューラルネットワーク（Deep Neural Network：ＤＮＮ）に基づく識別器を導入する。ＤＮＮは、多層パーセプトロンの一種であり、通常の多層パーセプトロンが３層程度の識別層を有するのに対し、実施形態では、３層より多くの識別層を有し、より深いネットワークを構築する。具体的には、各識別層を制約付きボルツマンマシン（Restricted Boltzmann Machine：ＲＢＭ)で学習し、その後、各識別層のＲＢＭを連結してネットワーク全体のパラメータを調整することにより、深い識別層を持つニューラルネットワークを構築することができる。このような深い識別層を持たせることで、音声信号の識別性能を高めることができる。 In the following embodiments, a classifier based on a deep neural network (DNN) is introduced as a classifier for an audio signal. DNN is a kind of multi-layer perceptron, and a normal multi-layer perceptron has about three identification layers, whereas in the embodiment, it has more than three identification layers and constructs a deeper network. Specifically, each identification layer is learned by a restricted Boltzmann Machine (RBM), and then the RBM of each identification layer is connected to adjust the parameters of the entire network, thereby having a deep identification layer. A neural network can be constructed. By providing such a deep discrimination layer, the discrimination performance of the audio signal can be enhanced.

ＤＮＮによる音声信号の識別器を雑音抑圧に導入するためには、ＤＮＮの出力層に含まれる各ノードと、音声信号のＧＭＭの各要素分布との対応付けを行う必要がある。そのために、先ず、各時刻における雑音の存在しない音声信号が、音声信号のＧＭＭに含まれるどの要素分布に属するかを示した分布ラベルを生成する。その後、音声信号と雑音信号との混合信号と、分布ラベルを用いて音声信号のＤＮＮを学習する。このような方法を用いることにより、音声信号のＧＭＭの各要素と音声信号のＤＮＮの出力層の各ノードとの対応付けが可能となる。 In order to introduce a DNN voice signal discriminator to noise suppression, it is necessary to associate each node included in the DNN output layer with each element distribution of the GMM of the voice signal. For this purpose, first, a distribution label indicating which element distribution included in the GMM of the audio signal the audio signal without noise at each time belongs is generated. Thereafter, the DNN of the audio signal is learned using the mixed signal of the audio signal and the noise signal and the distribution label. By using such a method, it is possible to associate each element of the GMM of the audio signal with each node of the output layer of the DNN of the audio signal.

また、音声信号のＤＮＮを用いることで、入力混合信号に含まれる音声信号の識別性能が向上し、入力混合信号からの音声信号と雑音信号との抽出精度、及び、話者適応パラメータと、雑音抑圧フィルタとの推定精度とを改善することが可能となる。 Further, by using the DNN of the audio signal, the discrimination performance of the audio signal included in the input mixed signal is improved, the accuracy of extracting the audio signal and the noise signal from the input mixed signal, the speaker adaptation parameter, and the noise It is possible to improve the estimation accuracy with the suppression filter.

なお、ＤＮＮについては、文献１「A. Mohamed, G. Dahl, G. Hinton, “Acoustic Modeling Using Deep Belief Networks.”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no1., pp. 14-22, 2012.」、文献２「久保陽太郎，“ディープラーニングによるパターン認識”，情報処理，vol. 54，no. 5，pp. 500-508，April 2013.」に詳述されている。 Regarding DNN, reference 1 “A. Mohamed, G. Dahl, G. Hinton,“ Acoustic Modeling Using Deep Belief Networks. ”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no1., Pp. 14-22, 2012. ”, Reference 2“ Yotaro Kubo, “Pattern Recognition by Deep Learning”, Information Processing, vol. 54, no. 5, pp. 500-508, April 2013. ” .

（音声モデル学習装置の構成）
図１は、音声モデル学習装置の構成の一例を示す図である。音声モデル学習装置１００は、音声ＧＭＭ記憶装置３００、音声ＤＮＮ記憶装置４００が接続される。音声ＧＭＭ記憶装置３００は、音声ＧＭＭ３００ａを記憶する。音声ＤＮＮ記憶装置４００は、後述する音声ＤＮＮ学習部１４０により学習されたパラメータである重み行列Ｗ_ｊと、バイアスベクトルｖ_ｊとを含む音声ＤＮＮ４００ａを記憶する。音声モデル学習装置１００は、学習用音声信号Ｏ^{ｃｌｅａｎ} _τ及び学習用音声信号Ｏ^{ｃｌｅａｎ} _τと学習用雑音信号とが混合した学習用混合信号Ｏ^{ｎｏｉｓｙ} _τを入力とし、ＤＮＮのパラメータである重み行列Ｗ_ｊと、バイアスベクトルｖ_ｊとを出力する。音声モデル学習装置１００は、第１音響特徴抽出部１１０、第２音響特徴抽出部１２０、最尤分布推定部１３０、音声ＤＮＮ学習部１４０を有する。 (Configuration of speech model learning device)
FIG. 1 is a diagram illustrating an example of a configuration of a speech model learning apparatus. The speech model learning device 100 is connected to a speech GMM storage device 300 and a speech DNN storage device 400. The voice GMM storage device 300 stores a voice GMM 300a. The voice DNN storage device 400 stores a voice DNN 400a including a weight matrix W _j that is a parameter learned by a voice DNN learning unit 140 described later and a bias vector v _j . The speech model learning apparatus 100 receives a learning speech signal O ^clean _{τ, a} learning speech signal O ^clean _τ and a learning mixed signal O ^noisey _τ mixed with a learning noise signal, and a weight matrix W that is a DNN parameter. _j and the bias vector v _j are output. The speech model learning apparatus 100 includes a first acoustic feature extraction unit 110, a second acoustic feature extraction unit 120, a maximum likelihood distribution estimation unit 130, and a speech DNN learning unit 140.

第１音響特徴抽出部１１０は、学習用音声信号Ｏ^{ｃｌｅａｎ} _τを入力とし、学習用音声信号Ｏ^{ｃｌｅａｎ} _τから音声ＤＮＮの学習に用いる対応分布ラベルＬａｂ_ｔを得るための特徴量である学習用対数メルスペクトルのベクトルＯ^{ｃｌｅａｎ} _ｔを抽出する。 The first acoustic feature extraction unit 110 receives the learning speech signal O ^clean _τ , and uses a learning logarithm that is a feature amount for obtaining the correspondence distribution label Lab _t used for learning the speech DNN from the learning speech signal O ^clean _τ. Extract mel spectrum vector O ^clean _t .

図２は、音声モデル学習装置の第１音響特徴抽出部の処理手順の一例を示すフローチャートである。図２に従い、第１音響特徴抽出部１１０の処理を説明する。先ず、第１音響特徴抽出部１１０は、フレーム切出処理にて学習用音声信号Ｏ^{ｃｌｅａｎ} _τ（τは離散信号のサンプル点）を時間軸方向に一定時間幅で始点を移動させながら、一定時間長の音響信号をフレームとして切り出す（ステップＳ１１０ａ）。例えば、第１音響特徴抽出部１１０は、Ｆｒａｍｅ＝４００個のサンプル点（１６，０００Ｈｚ×２５ｍｓ）の音響信号Ｏ^{ｃｌｅａｎ} _τ，ｎを、Ｓｈｉｆｔ＝１６０個のサンプル点（１６，０００Ｈｚ×１０ｍｓ）ずつ始点を移動させながら切り出す。ここで、ｔはフレーム番号、ｎはフレーム内のｎ番目のサンプル点を表す。その際、第１音響特徴抽出部１１０は、例えば、下記（１）式に示すハミング窓のような窓関数ｗ_ｎを掛け合わせて切り出す。 FIG. 2 is a flowchart illustrating an example of a processing procedure of the first acoustic feature extraction unit of the speech model learning device. The process of the first acoustic feature extraction unit 110 will be described with reference to FIG. First, the first acoustic feature extraction unit 110 moves the start point of the learning speech signal O ^clean _τ (τ is a sampling point of a discrete signal) in the time axis direction with a constant time width in the frame cutout process for a fixed time. A long acoustic signal is cut out as a frame (step S110a). For example, the first acoustic feature extraction unit 110 converts the acoustic signal O ^clean _{τ, n} of Frame = 400 sample points (16,000 Hz × 25 ms) into Shift = 160 sample points (16,000 Hz × 10 ms). Cut out while moving the start point. Here, t represents the frame number, and n represents the nth sample point in the frame. At that time, the first acoustic feature extraction unit 110, for example, cut out by multiplying a window function w _n, such as Hamming window shown in the following equation (1).

その後、第１音響特徴抽出部１１０は、音響信号Ｏ^{ｃｌｅａｎ} _ｔ，ｎに対して、Ｍ点（Ｍは２のべき乗かつＦｒａｍｅ以上の値であり、例えばＭ＝５１２）の高速フーリエ変換処理を実行し、複素数スペクトルのベクトルＳｐｃ^{ｃｌｅａｎ} _ｔ＝｛Ｓｐｃ^{ｃｌｅａｎ} _ｔ，０，・・・，Ｓｐｃ^{ｃｌｅａｎ} _ｔ，ｍ，・・・，Ｓｐｃ^{ｃｌｅａｎ} _{ｔ，Ｍ−１}｝^Ｔを得る（ｍは、周波数ビンの番号）（ステップＳ１１０ｂ）。なお、｛・｝^Ｔは、行列又はベクトルの転置を表す。次に、各Ｓｐｃ^{ｃｌｅａｎ} _ｔ，ｍの絶対値に対して、メルフィルタバンク分析処理（ステップＳ１１０ｃ）、対数化処理（ステップＳ１１０ｄ）を適用し、Ｒ次元（例えばＲ＝２４）の対数メルスペクトルを要素に持つベクトルＯ^{ｃｌｅａｎ} _ｔ＝｛Ｏ^{ｃｌｅａｎ} _ｔ，０，・・・，Ｏ^{ｃｌｅａｎ} _ｔ，ｒ，・・・，Ｏ^{ｃｌｅａｎ} _{ｔ，Ｒ−１}｝^Ｔを算出する（ｒはベクトルＯ^{ｃｌｅａｎ} _ｔの要素番号）。この結果、第１音響特徴抽出部１１０は、学習用対数メルスペクトルとして、ベクトルＯ^{ｃｌｅａｎ} _ｔを出力する。 Thereafter, the first acoustic feature extraction unit 110 performs fast Fourier transform processing of M points (M is a power of 2 and a value equal to or greater than Frame, for example, M = 512) for the acoustic signal O ^clean _{t, n} . and a vector of complex spectrum ^{_{^{_{^{Spc clean t = {Spc clean t}}}}} , 0, ···, Spc clean t, m, ···, Spc clean t, M-1} to obtain a ^T (m is the number of frequency bins (Step S110b). Note that {·} ^T represents transposition of a matrix or a vector. Next, mel filter bank analysis processing (step S110c) and logarithmization processing (step S110d) are applied to the absolute value of each Spc ^clean _{t, m} , and an R-dimensional (for example, R = 24) log mel spectrum is obtained. vector ^{O clean} _t = with the elements of ^{_{^{_{{O clean t, 0, ···}}}} , O clean t, r, ···, O clean t, R-1} is calculated ^T (r is a vector ^{O clean} _t element number). As a result, the first acoustic feature extraction unit 110 outputs a vector O ^clean _t as a logarithmic mel spectrum for learning.

第２音響特徴抽出部１２０は、学習用音声信号Ｏ^{ｃｌｅａｎ} _τと学習用雑音信号とが混合した学習用混合信号Ｏ^{ｎｏｉｓｙ} _τを入力とし、学習用混合信号Ｏ^{ｎｏｉｓｙ} _τから音声モデル学習を実施するための特徴量である学習用正規化対数メルスペクトルのベクトルＯ^{ｎｏｉｓｙ} _ｔを抽出する。 The second acoustic feature extraction unit 120 receives the learning mixed signal O ^noisey _{τ obtained} by mixing the learning speech signal O ^clean _τ and the learning noise signal, and performs speech model learning from the learning mixed signal O ^noisey _τ. The learning normalized log mel spectrum vector O ^noise _t is extracted.

図３は、音声モデル学習装置の第２音響特徴抽出部の処理手順の一例を示すフローチャートである。図３に従い、第２音響特徴抽出部１２０の処理を説明する。第２音響特徴抽出部１２０は、ステップＳ１２０ａ〜Ｓ１２０ｄにおいて、Ｏ^{ｎｏｉｓｙ} _τに対して、図２に示す、Ｏ^{ｃｌｅａｎ} _τに対して実行されるステップＳ１１０ａ〜Ｓ１１０ｄそれぞれと同様の処理を実行する。 FIG. 3 is a flowchart illustrating an example of a processing procedure of the second acoustic feature extraction unit of the speech model learning device. The process of the second acoustic feature extraction unit 120 will be described with reference to FIG. Second acoustic feature extraction unit 120 in step ^{S120a～S120d,} against ^{O noisy} _tau, shown in FIG. ^2, performing the same processing as each step S110a~S110d performed on ^{O clean} _tau.

次に、第２音響特徴抽出部１２０は、ステップＳ１２０ｄの対数化処理にて得られた学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトルに対して、正規化処理を適用する（ステップＳ１２０ｅ）。具体的には、第２音響特徴抽出部１２０は、学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトル全体から求めた学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトルの平均と標準偏差を用いて、学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトルを平均０、分散１に正規化する。 Next, the second acoustic feature extraction unit 120 applies a normalization process to the logarithmic mel spectrum of the learning mixed signal O ^noisey _τ obtained by the logarithmic process of step S120d (step S120e). Specifically, the second acoustic feature extraction unit 120 uses the mean and standard deviation of the log mel spectrum of the learning mixed signal O ^noisy mixed signals for learning obtained from whole log mel spectrum of _tau O ^noisy _tau, learning use mixed signal ^{O noisy} averaged logarithmic Mel spectrum of _tau 0, normalized to unit variance.

次に、第２音響特徴抽出部１２０は、ステップＳ１２０ｅの正規化処理にて正規化された学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトルの１次と、２次の回帰係数を算出し、正規化された学習用混合信号Ｏ^{ｎｏｉｓｙ} _τの対数メルスペクトルと合わせて３Ｒ次元のベクトルＯ^ｎｏｒｍ _ｔ＝｛Ｏ^ｎｏｒｍ _ｔ，０，・・・，Ｏ^ｎｏｒｍ _ｔ，ｒ，・・・，Ｏ^ｎｏｒｍ _{ｔ，３Ｒ−１}｝^Ｔを構成する回帰係数付与処理を実行する（ステップＳ１２０ｆ）。その後、第２音響特徴抽出部１２０は、ステップＳ１２０ｆの回帰係数付与処理にて回帰係数が付与されたベクトルＯ^ｎｏｒｍ _ｔをフレームｔの前後Ｚフレーム｛ｔ−Ｚ，・・・，ｔ，・・・，ｔ＋Ｚ｝分だけ結合した３Ｒ×（２Ｒ＋１）次元のベクトルＯ^ｎｏｒｍ _ｔ＝｛ベクトルＯ^ｎｏｒｍ _ｔ―Ｚ ^Ｔ，・・・，ベクトルＯ^ｎｏｒｍ _ｔ ^Ｔ，・・・，ベクトルＯ^ｎｏｒｍ _ｔ＋Ｚ ^Ｔ｝^Ｔを算出するフレーム連結処理を実行する（例えばＺ＝５）（ステップＳ１２０ｇ）。この結果、第２音響特徴抽出部１２０は、学習用正規化対数メルスペクトルのベクトルＯ^ｎｏｒｍ _ｔを出力する。 Next, the second acoustic feature extraction unit 120 calculates the primary and secondary regression coefficients of the log mel spectrum of the learning mixed signal O ^noisey _τ normalized by the normalization process in step S120e, of the vector ^O of 3R dimensions together with logarithmic Mel spectrum of the learning mixed signal ^{_{^{_{^{_{O noisy τ norm t = {O}}}}}} norm t, 0, ···, O norm t, r, ···, O norm t, _3R-1 } Regression coefficient provision processing that constitutes ^T is executed (step S120f). Thereafter, the second acoustic feature extraction unit 120, the front and rear Z frame {t-Z in step vector regression coefficients have been granted by the regression coefficient imparting treatment of S120f ^{O norm} _t frame t, · · ·, t, · · ·, t + Z} amount corresponding bound 3R × (2R + 1) dimensional vector ^{O norm} _t = {vector ^{_{^{O norm t-Z T, ···}}} , vector ^{O norm} _{t ^T,} ···, vector ^O _norm ^{t +} Z ^{^T} T} Is executed (for example, Z = 5) (step S120g). Consequently, the second acoustic feature extraction unit 120 outputs the vector ^{O norm} _t of the learning normalized logarithmic Mel spectrum.

最尤分布推定部１３０は、第１音響特徴抽出部１１０の出力である学習用対数メルスペクトルのベクトルＯ^{ｃｌｅａｎ} _ｔと、音声ＧＭＭ記憶装置３００の主記憶上に記憶された音声ＧＭＭ３００ａとを用いて、対応分布ラベルＬａｂ_ｔを得る。 The maximum likelihood distribution estimation unit 130 uses the learning log mel spectrum vector O ^clean _t , which is the output of the first acoustic feature extraction unit 110, and the speech GMM 300a stored in the main memory of the speech GMM storage device 300. , The corresponding distribution label Lab _t is obtained.

最尤分布推定部１３０は、学習用対数メルスペクトルのベクトルＯ^{ｃｌｅａｎ} _ｔと、音声ＧＭＭ３００ａとを用いて、音声ＤＮＮの学習に用いる対応分布ラベルＬａｂ_ｔを、下記（２）式により推定する。 The maximum likelihood distribution estimation unit 130 estimates the corresponding distribution label Lab _t used for learning of the speech DNN using the learning log mel spectrum vector O ^clean _t and the speech GMM 300a according to the following equation (2).

上記（２）式において、ｋは音声ＧＭＭ３００ａに含まれる正規分布の番号であり、最大値Ｋを取る。Ｋは総正規分布数である。例えば、Ｋ＝５１２である。上記（２）式において、ｗ_ＳＩ，ｋは音声ＧＭＭ３００ａの混合重みであり、ベクトルμ_ＳＩ,ｋは音声ＧＭＭ３００ａの平均ベクトルであり、ベクトルΣ_ＳＩ,ｋは音声ＧＭＭ３００ａの対角分散行列である。それぞれのパラメータであるｗ_ＳＩ，ｋ、ベクトルμ_ＳＩ,ｋ、ベクトルΣ_ＳＩ,ｋは、多数話者の学習用音声データを用いて事前に推定されたものである。また、上記（２）式において、関数Ｎ（・）は、下記（３）式で与えられる多次元正規分布の確率密度関数である。上記（２）式は、ｋを１≦ｋ≦Ｋの範囲で走査した場合のｍａｘ｛・｝を対応分布ラベルＬａｂ_ｔとする。 In the above equation (2), k is a normal distribution number included in the speech GMM 300a and takes the maximum value K. K is the total normal distribution number. For example, K = 512. In the above equation (2), w _{SI, k} is the mixing weight of the speech GMM 300a, the vector μSI _{, k} is an average vector of the speech GMM 300a, and the vector ΣSI _{, k} is a diagonal dispersion matrix of the speech GMM 300a. Each parameter, w _{SI, k} , vector μ _{SI, k} , and vector ΣSI _{, k,} is estimated in advance using speech data for learning of many speakers. In the above equation (2), the function N (•) is a probability density function of a multidimensional normal distribution given by the following equation (3). In the above equation (2), max {·} when k is scanned in the range of 1 ≦ k ≦ K is the corresponding distribution label Lab _t .

音声ＤＮＮ学習部１４０は、対応分布ラベルＬａｂ_ｔと学習用正規化対数メルスペクトルのベクトルＯ^{ｎｏｉｓｙ} _ｔとを用いて、音声ＤＮＮ４００ａのパラメータである重み行列Ｗ_ｊとバイアスベクトルｖ_ｊとを学習する。音声ＤＮＮ学習部１４０は、最尤分布推定部１３０により推定された対応分布ラベルＬａｂ_ｔと、第２音響特徴抽出部１２０により計算された学習用正規化対数メルスペクトルのベクトルＯ^{ｎｏｉｓｙ} _ｔとを用いて、音声ＤＮＮ４００ａとして、Ｊ層の隠れ層を持つＤＮＮを学習する（例えばＪ＝５）。ＤＮＮの一般的な学習方法は、前述の文献１及び文献２に示されるとおりである。 Voice DNN learning unit 140 uses the corresponding distribution labels Lab _t and vector ^{O noisy} _t of the learning normalized logarithmic Mel spectrum, learns the weight matrix _{W j} and the bias vector _{v j} is a parameter of the speech DNN400a. The speech DNN learning unit 140 uses the correspondence distribution label Lab _t estimated by the maximum likelihood distribution estimation unit 130 and the learning normalized log mel spectrum vector O ^noise _t calculated by the second acoustic feature extraction unit 120. Thus, a DNN having a hidden layer of J is learned as the speech DNN 400a (for example, J = 5). The general learning method of DNN is as shown in the above-mentioned literature 1 and literature 2.

音声ＤＮＮ学習部１４０は、音声ＤＮＮ４００ａのパラメータである重み行列Ｗ_ｊと、バイアスベクトルｖ_ｊとを、音声ＤＮＮ記憶装置４００へ出力し、主記憶上に記憶させる。なお、重み行列Ｗ_ｊはＤ_ｊ×Ｄ_ｊ−１次元の行列であり、バイアスベクトルｖ_ｊはＤ_ｊ次元の縦ベクトルである（例えば、Ｄ_０＝３Ｒ×(２Ｚ＋１)、Ｄ_ｊ＝２０４８（ｊ＝１，・・・，Ｊ−１）、Ｄ_ｊ＝Ｋ）。 The voice DNN learning unit 140 outputs the weight matrix W _j and the bias vector v _j that are parameters of the voice DNN 400a to the voice DNN storage device 400 and stores them in the main memory. The weight matrix W _j is a D _j × D _j−1 dimensional matrix, and the bias vector v _j is a D _j dimensional vertical vector (for example, D ₀ = 3R × (2Z + 1), D _j = 2048 ( j = 1,..., J-1), D _j = K).

（雑音抑圧装置の構成）
図４は、雑音抑圧装置の構成の一例を示す図である。雑音抑圧装置２００は、音声ＧＭＭ記憶装置３００、音声ＤＮＮ記憶装置４００が接続される。雑音抑圧装置２００は、音声信号及び雑音信号が混合された入力混合信号Ｏ_τを入力とし、入力混合信号Ｏ_τにおいて雑音信号が抑圧されたと推定される雑音抑圧信号＾Ｓ_τを出力する。雑音抑圧装置２００は、第１音響特徴抽出部２１０、第２音響特徴抽出部２２０、パラメータ推定部２３０、雑音抑圧部２４０を有する。 (Configuration of noise suppression device)
FIG. 4 is a diagram illustrating an example of the configuration of the noise suppression device. The noise suppression device 200 is connected to a voice GMM storage device 300 and a voice DNN storage device 400. The noise suppression apparatus 200 receives an input mixed signal O _τ mixed with a speech signal and a noise signal, and outputs a noise suppression signal ^ S _τ that is estimated to be suppressed in the input mixed signal O _τ . The noise suppression apparatus 200 includes a first acoustic feature extraction unit 210, a second acoustic feature extraction unit 220, a parameter estimation unit 230, and a noise suppression unit 240.

第１音響特徴抽出部２１０は、音声信号及び雑音信号が混合した入力混合信号Ｏ_τを入力とし、入力混合信号Ｏ_τに対して雑音抑圧を実施するための特徴量である複素数スペクトルのベクトルＳｐｃ_ｔ及び入力混合信号Ｏ_τの対数メルスペクトルのベクトルＯ_ｔを抽出する。第１音響特徴抽出部２１０は、音声モデル学習装置１００の第１音響特徴抽出部１１０と同様の処理機能を有する。 The first acoustic feature extractor 210, the audio signal and the noise signal as input an input mixed signal O _tau mixing, a vector of complex spectrum is a feature amount for the practice of the noise suppression for the input mixed signal O _tau Spc A logarithmic mel spectrum vector O _t of _t and the input mixed signal O _τ is extracted. The first acoustic feature extraction unit 210 has the same processing function as the first acoustic feature extraction unit 110 of the speech model learning device 100.

第２音響特徴抽出部２２０は、入力混合信号Ｏ_τを入力とし、入力混合信号Ｏ_τから音声事後確率Ｐ_ｔ，ｋを計算するための特徴量である正規化対数メルスペクトルのベクトルＯ^ＤＮＮ _ｔを抽出する。第２音響特徴抽出部２２０は、音声モデル学習装置１００の第２音響特徴抽出部１２０と同様の処理機能を有する。 The second acoustic feature extraction unit 220 receives the input mixed signal O _τ and receives a normalized log mel spectrum vector O ^DNN _t that is a feature amount for calculating the speech posterior probability P _{t, k} from the input mixed signal O _τ. To extract. The second acoustic feature extraction unit 220 has the same processing function as the second acoustic feature extraction unit 120 of the speech model learning device 100.

パラメータ推定部２３０は、第１音響特徴抽出部２１０により抽出された対数メルスペクトルのベクトルＯ_ｔと、音声ＧＭＭ記憶装置３００に記憶された音声ＧＭＭ３００ａと、第２音響特徴抽出部２２０により抽出された正規化対数メルスペクトルのベクトルＯ^ＤＮＮ _ｔと、音声ＤＮＮ記憶装置４００に記憶された音声ＤＮＮ４００ａとを用いて、話者適応パラメータのベクトルｂと、雑音の確率モデルである雑音ＧＭＭのパラメータセットλとを推定する。 The parameter estimation unit 230 is extracted by the logarithmic mel spectrum vector O _t extracted by the first acoustic feature extraction unit 210, the speech GMM 300 a stored in the speech GMM storage device 300, and the second acoustic feature extraction unit 220. Using the normalized log mel spectrum vector O ^DNN _t and the speech DNN 400a stored in the speech DNN storage device 400, the speaker adaptation parameter vector b, and the noise GMM parameter set λ, which is a noise probability model, Is estimated.

多数話者の学習用音声データから推定されたパラメータから構成される音声ＧＭＭ３００ａは、話者独立（Speaker Independent：ＳＩ）ＧＭＭと呼ばれ、特定話者の学習用音声データから推定されたパラメータから構成される音声ＧＭＭは、話者依存（Speaker Dependent:ＳＤ）ＧＭＭと呼ばれる。話者独立ＧＭＭを特定話者の学習用音声データを用いて学習することは、実用的ではないため、話者適応処理により、話者依存ＧＭＭを得る。すなわち、下記（４）式による話者適応処理により、話者独立ＧＭＭの平均ベクトルμ_ＳＩ，ｋを変換することにより、話者依存ＧＭＭの平均ベクトルμ_ＳＤ，ｋを得る。 Speech GMM 300a composed of parameters estimated from multi-speaker learning speech data is referred to as speaker independent (SI) GMM, and is composed of parameters estimated from specific speaker learning speech data. The voice GMM to be performed is called a speaker dependent (SD) GMM. Since it is not practical to learn the speaker independent GMM using the speech data for learning of a specific speaker, a speaker dependent GMM is obtained by speaker adaptation processing. That is, the average vector μ _{SD, k} of the speaker-dependent GMM is _obtained by converting the average vector μ _{SI, k} of the speaker independent GMM by speaker adaptation processing according to the following equation (4).

上記（４）式において、ベクトルｂは話者適応パラメータであり、Ｒ次元のベクトルでる。ベクトルｂは、音声ＧＭＭ３００ａに含まれる正規分布の番号ｋに対して独立のパラメータとする。一方、雑音ＧＭＭは、下記（５）式により与えられる。 In the above equation (4), the vector b is a speaker adaptation parameter and is an R-dimensional vector. The vector b is an independent parameter for the normal distribution number k included in the speech GMM 300a. On the other hand, the noise GMM is given by the following equation (5).

上記（５）式において、ｌは雑音ＧＭＭに含まれる正規分布の番号、Ｌは正規分布の総数である（例えば、Ｌ＝４）。また、ベクトルＮ_ｔは雑音の対数メルスペクトルであり、ｐ_Ｎ（ベクトルＮ_ｔ）は雑音ＧＭＭの尤度である。また、ｗ_Ｎ，ｌは雑音ＧＭＭの混合重みであり、ベクトルμ_Ｎ，ｌは雑音ＧＭＭの平均ベクトル、行列Σ_Ｎ，ｌは雑音ＧＭＭの対角分散行列である。以後、雑音ＧＭＭのパラメータセットをλ＝｛ｗ_Ｎ，ｌ，ベクトルμ_Ｎ，ｌ，行列Σ_Ｎ，ｌ｝と定義する。 In the above equation (5), l is the number of the normal distribution included in the noise GMM, and L is the total number of normal distributions (for example, L = 4). Further, the vector N _t is a log mel spectrum of noise, and p _N (vector N _t ) is the likelihood of the noise GMM. Further, w _{N, l} is a noise GMM mixing weight, vector μ _{N, l} is an average vector of noise GMM, and matrix Σ _{N, l} is a diagonal dispersion matrix of noise GMM. Hereinafter, the parameter set of the noise GMM is defined as λ = {w _{N, l} , vector μ _{N, l} , matrix Σ _{N, l} }.

パラメータ推定部２３０は、話者適応パラメータのベクトルｂと、雑音ＧＭＭのパラメータセットλは、ＥＭアルゴリズムにより推定する。ＥＭアルゴリズムは、ある確率モデルのパラメータ推定に用いられる方法であり、確率モデルのコスト関数（対数尤度関数）の期待値を計算するＥｘｐｅｃｔａｔｉｏｎ−ｓｔｅｐ（Ｅ−ｓｔｅｐ）と、コスト関数を最大化するＭａｘｉｍｉｚａｔｉｏｎ−ｓｔｅｐ（Ｍ−ｓｔｅｐ）とを、収束条件を満たすまで繰り返すことによりパラメータを最適化する。 The parameter estimation unit 230 estimates the speaker adaptation parameter vector b and the noise GMM parameter set λ using the EM algorithm. The EM algorithm is a method used for parameter estimation of a certain probability model, and maximizes the cost function with an Expectation-step (E-step) for calculating an expected value of a cost function (log likelihood function) of the probability model. The parameter is optimized by repeating Maximization-step (M-step) until the convergence condition is satisfied.

さらに、図４に示すパラメータ推定部２３０の詳細構成について説明する。図５は、雑音抑圧装置のパラメータ推定部の構成の一例を示す図である。図５に示すように、パラメータ推定部２３０は、初期化部２３１、確率及び信号推定部２３２、信頼データ選択部２３３、話者適応パラメータ推定部２３４、雑音ＧＭＭ推定部２３５、収束判定部２３６を有する。 Furthermore, a detailed configuration of the parameter estimation unit 230 illustrated in FIG. 4 will be described. FIG. 5 is a diagram illustrating an example of a configuration of a parameter estimation unit of the noise suppression device. As shown in FIG. 5, the parameter estimation unit 230 includes an initialization unit 231, a probability and signal estimation unit 232, a confidence data selection unit 233, a speaker adaptive parameter estimation unit 234, a noise GMM estimation unit 235, and a convergence determination unit 236. Have.

図６は、雑音抑圧装置のパラメータ推定部の処理手順の一例を示すフローチャートである。図６に従い、パラメータ推定部２３０の処理を説明する。先ず、初期化部２３１は、ＥＭアルゴリズムの繰り返しインデックスをｉ＝１と初期化する（ステップＳ２３０ａ）。次に、初期化部２３１は、ＥＭアルゴリズムにおける話者適応パラメータのベクトルｂと、雑音ＧＭＭのパラメータセットλの初期値を、下記（６）〜（１１）式により推定する初期値推定処理を実行する（ステップＳ２３０ｂ）。ここで、下記（９）式におけるＵは初期値推定に要するフレーム数である（例えばＵ＝１０）。また、下記（９）式におけるｄｉａｇ｛・｝は、行列・の対角成分のみを計算し、非対角成分を０とすることを表す。 FIG. 6 is a flowchart illustrating an example of a processing procedure of the parameter estimation unit of the noise suppression device. The process of the parameter estimation unit 230 will be described with reference to FIG. First, the initialization unit 231 initializes the repetition index of the EM algorithm as i = 1 (step S230a). Next, the initialization unit 231 executes initial value estimation processing for estimating the speaker adaptation parameter vector b in the EM algorithm and the initial values of the noise GMM parameter set λ using the following equations (6) to (11). (Step S230b). Here, U in the following equation (9) is the number of frames required for initial value estimation (for example, U = 10). Further, diag {·} in the following equation (9) represents that only the diagonal component of the matrix is calculated and the non-diagonal component is set to zero.

上記（９）式において、添え字ｉはＥＭアルゴリズムにおけるｉ回目の繰り返し推定におけるパラメータであることを示す。また、上記（６）式におけるベクトル０は、要素が０であるＲ次元縦ベクトルである。また、上記（１０）式におけるＧａｕｓｓＲａｎｄ（・）は、正規乱数の発生関数である。 In the above equation (9), the subscript i indicates a parameter in the i-th iterative estimation in the EM algorithm. Further, the vector 0 in the above equation (6) is an R-dimensional vertical vector whose element is 0. Further, GaussRand (•) in the above equation (10) is a function for generating normal random numbers.

次に、確率及び信号推定部２３２は、正規化対数メルスペクトルのベクトルＯ^ＤＮＮ _ｔと、音声ＤＮＮ４００ａ記憶されるパラメータである重み行列Ｗ_ｊ及びバイアスベクトルｖ_ｊとを用いて、下記（１２）式〜（１５）式により、音声事後確率Ｐ_ｔ，ｋを計算する音声事後確率計算処理を実行する（ステップＳ２３０ｃ）。 Next, the probability and signal estimation unit 232 uses the normalized log mel spectrum vector O ^DNN _t , the weight matrix W _j and the bias vector v _j that are parameters stored in the speech ^DNN 400a, and the following equation (12): The speech posterior probability calculation process for calculating the speech posterior probability P _{t, k} is executed by the expression (15) (step S230c).

なお、上記（１４）式において、Ｗ_{ｊ，ｋ，ｋ´}は重み行列Ｗ_ｊの要素であり、ｖ_ｊ，ｋはバイアスベクトルｖ_ｊの要素であり、上記（１５）式において、Ｏ^ＤＮＮ _ｔ，ＫはベクトルＯ^ＤＮＮ _ｔの要素である。 In the above equation (14), W _{j, k, k ′} are elements of the weight matrix W _j , v _{j, k} are elements of the bias vector v _j , and in the above equation (15), O ^DNN _{t , K} are elements of the vector O ^DNN _t .

次に、確率及び信号推定部２３２は、（ｉ−１）回目の繰り返し推定における話者適応パラメータのベクトルｂ^{（ｉ−１）}と、（ｉ−１）回目の繰り返し推定における雑音ＧＭＭのパラメータセットλ^{（ｉ−１）}と、音声ＧＭＭ３００ａのパラメータを利用して、下記（１６）式のような、対数メルスペクトルのベクトルＯ_ｔのＧＭＭを構成する混合信号ＧＭＭ生成処理を実行する（ステップＳ２３０ｄ）。 Next, the probability and signal estimation unit 232 includes (i-1) a speaker adaptation parameter vector b ^(i-1) in the (i-1) th iteration estimation, and a noise GMM parameter set in the (i-1) th iteration estimation. Using λ ⁽ⁱ⁻¹⁾ and the parameters of the speech GMM 300a, a mixed signal GMM generation process constituting the GMM of the logarithmic mel spectrum vector O _t as shown in the following equation (16) is executed (step S230d). .

なお、上記（１６）式において、ｐ_ｏ ^（ｉ）（ベクトルＯ^ｔ）は、ステップＳ２３０ｄの混合信号ＧＭＭ生成処理にて生成される対数メルスペクトルのベクトルＯ_ｔの、音声ＧＭＭ３００ａに対する尤度である。また、ｗ_{Ｏ，ｋ，ｌ} ^（ｉ）、ベクトルμ_{Ｏ，ｋ，ｌ} ^（ｉ）、行列Σ_{Ｏ，ｋ，ｌ} ^（ｉ）は、それぞれ、（ｉ−１）回目の繰り返し推定における話者適応パラメータセットのベクトルｂ^{（ｉ−１）}と、雑音ＧＭＭのパラメータセットλ^(ｉ−１)と、音声ＧＭＭ３００ａのパラメータとから生成される対数メルスペクトルのベクトルＯｔのＧＭＭの混合重み、平均ベクトル、対角分散行列であり、下記（１７）式〜（２０）式で与えられる。 In the above equation (16), p _o ⁽ⁱ⁾ (vector O ^t ) is the likelihood of the log mel spectrum vector O _t generated in the mixed signal GMM generation processing in step S230d with respect to the speech GMM 300a. . Also, w _{O, k, l} ⁽ⁱ⁾ , vector μ _{O, k, l} ⁽ⁱ⁾ , matrix Σ _{O, k, l} ⁽ⁱ⁾ are respectively speaker adaptations in the (i−1) -th iteration estimation. GMM mixture weight, average vector, pair of logarithmic mel spectrum vector Ot generated from parameter set vector b ^(i-1) , noise GMM parameter set λ ^(i-1), and parameters of speech GMM 300a It is an angular dispersion matrix and is given by the following equations (17) to (20).

なお、上記（１８）式において、対数関数ｌｏｇ（・）及び指数関数ｅｘｐ（・）は、ベクトルの要素毎に演算を行う。また、上記（１８）式及び（２０）式において、ベクトル１は、全ての要素が１であるＲ次元縦ベクトルである。また、上記（１９）式において、Ｈ_ｋ，ｌ ^（ｉ）は、関数ｈ（・）のヤコビ行列である。 In the above equation (18), the logarithmic function log (•) and the exponential function exp (•) are calculated for each element of the vector. In the above equations (18) and (20), the vector 1 is an R-dimensional vertical vector in which all elements are 1. In the above equation (19), H _{k, l} ⁽ⁱ⁾ is a Jacobian matrix of the function h (•).

次に、確率及び信号推定部２３２は、ｉ回目の繰り返し推定における対数メルスペクトルのベクトルＯ_ｔの確率モデルのコスト関数Ｑｏ（・）の期待値を、下記（２１）式により計算する期待値計算処理を実行する（ＥＭアルゴリズムのＥ−ｓｔｅｐ）（ステップＳ２３０ｅ）。 Next, the probability and signal estimation unit 232 calculates an expected value by calculating the expected value of the cost function Qo (•) of the probability model of the logarithmic mel spectrum vector O _{t in} the i-th iterative estimation using the following equation (21). The process is executed (EM algorithm E-step) (step S230e).

上記（２１）式において、ベクトルＯ_{０：Ｔ−１}＝｛Ｏ_０，・・・，Ｏ_ｔ，・・・Ｏ_Ｔ−１｝である。また、上記（２１）式において、Ｔは対数メルスペクトルのベクトルＯ_ｔの総フレーム数である。また、上記（２１）式において、Ｐ_{ｔ，ｋ，ｌ} ^（ｉ）は、下記（２２）式及び（２３）により、フレームｔにおける音声ＧＭＭ３００ａの正規分布番号ｋ及び雑音ＧＭＭの正規分布番号ｌに対して与えられる音声事後確率である。 In the above (21), the vector _{_{O 0: T-1 = {}} O 0, ···, O t, ··· O T-1} is. In the above equation (21), T is the total number of frames of the logarithmic mel spectrum vector O _t . Further, in the above equation (21), P _{t, k, l} ⁽ⁱ⁾ is changed to the normal distribution number k of the speech GMM 300a and the normal distribution number l of the noise GMM in the frame t by the following equations (22) and (23). This is the voice posterior probability given to the user.

なお、ＥＭアルゴリズムのＭ−ｓｔｅｐは、ステップＳ２３０ｆの信号推定処理、ステップＳ２３０ｇの信頼データ選択処理、ステップＳ２３０ｈの話者適応パラメータ推定処理、ステップＳ２３０ｉの雑音ＧＭＭパラメータ推定処理に該当する。 Note that M-step of the EM algorithm corresponds to the signal estimation process in step S230f, the confidence data selection process in step S230g, the speaker adaptive parameter estimation process in step S230h, and the noise GMM parameter estimation process in step S230i.

ステップＳ２３０ｆにおいて、確率及び信号推定部２３２は、話者適応パラメータのベクトルｂ^（ｉ）と、雑音ＧＭＭのパラメータセットλ^（ｉ）を更新するために用いる、クリーン音声の対数メルスペクトルのベクトルＳ_ｔ ^（ｉ）と、雑音の対数メルスペクトルのベクトルＮ_ｔ ^（ｉ）とを、対数メルスペクトルのベクトルＯ_ｔから推定する。クリーン音声の対数メルスペクトルのベクトルＳ_ｔ ^（ｉ）と、雑音の対数メルスペクトルのベクトルＮ_ｔ ^（ｉ）は、下記（２４）式及び（２５）式により推定される。 In step S230f, the probability and signal estimation unit 232 uses the logarithmic mel spectrum vector S _{t of the} clean speech used to update the speaker adaptation parameter vector b ⁽ⁱ⁾ and the noise GMM parameter set λ ^(i). and ^(i), and a noise of the logarithmic Mel spectrum vector _N ^{t (i),} estimated from the logarithmic Mel spectrum vector _{O t.} The logarithmic mel spectrum vector S _t ^{(i) of} clean speech and the noise mel spectrum vector N _t ⁽ⁱ⁾ of noise are estimated by the following equations (24) and (25).

次に、信頼データ選択部２３３は、話者適応パラメータのベクトルｂ^（ｉ）と、雑音ＧＭＭのパラメータセットλ^（ｉ）とを推定する際に用いる、クリーン音声の推定対数メルスペクトルのベクトル＾Ｓ_ｔ ^（ｉ）と、雑音の推定対数メルスペクトル＾Ｎ_ｔ ^（ｉ）とを選択する信頼データ選択処理を実行する（ステップＳ２３０ｇ）。 Next, the trust data selection unit 233 uses the estimated log mel spectrum vector {circumflex over ^(S) } of clean speech used when estimating the speaker adaptation parameter vector b ⁽ⁱ⁾ and the noise GMM parameter set λ ^(i). A trust data selection process for selecting _t ⁽ⁱ⁾ and an estimated log mel spectrum of noise ^ N _t ⁽ⁱ⁾ is executed (step S230g).

図７は、雑音抑圧装置のパラメータ推定部による信頼データ選択処理のサブルーチンの一例を示すフローチャートである。信頼データ選択処理は、全フレームにおいて、クリーン音声と、雑音とのいずれが優勢であるかを判定した結果に基づき、クリーン音声が優勢であれば、各フレーム番号ｔをクリーン音声信号フレームの集合Ｔ_Ｓ ^（ｉ）に格納し、雑音が優勢であれば、各フレーム番号ｔを雑音フレームの集合Ｔ_Ｎ ^（ｉ）に格納する処理である。図７に示すように、先ず、信頼データ選択部２３３は、各フレームｔにおけるＳＮ比であるＳＮＲ_ｔ ^（ｉ）を、下記（２６）式により計算する。 FIG. 7 is a flowchart showing an example of a subroutine of the trust data selection process by the parameter estimation unit of the noise suppression device. The reliability data selection process is based on the result of determining which of clean speech and noise is dominant in all frames. If clean speech is dominant, each frame number t is assigned to a set T of clean speech signal frames. If it is stored in _S ⁽ⁱ⁾ and noise is dominant, each frame number t is stored in a set T _N ⁽ⁱ⁾ of noise frames. As shown in FIG. 7, first, the reliable data selection unit 233 calculates SNR _t ⁽ⁱ⁾ , which is the SN ratio in each frame t, using the following equation (26).

上記（２６）式において、＾Ｓ_ｔ，ｒ ^（ｉ）は、フレームｔにおけるクリーン音声の推定対数メルスペクトルのベクトル＾Ｓ_ｔ ^（ｉ）の要素であり、＾Ｎ_ｔ，ｒ ^（ｉ）は、フレームｔにおける雑音の推定対数メルスペクトルのベクトル＾Ｎ_ｔ ^（ｉ）の要素である。そして、信頼データ選択部２３３は、上記（２６）式により得られた、各フレームｔにおけるＳＮ比であるＳＮＲ_ｔ ^（ｉ）にｋ−ｍｅａｎクラスタリングを適用して、全てのフレームｔにおけるＳＮＲ_ｔ ^（ｉ）を２つのクラスＣ＝０，１に分類し、各クラスの平均ＳＮ比をＡｖｅＳＮＲ_ｃ ^（ｉ）と定義する（以上、ステップＳ２３０ｇ−１）。 In the above _{(26), ^ S ^t,} ^{r (i)} is an element of the vector of the estimated logarithmic Mel spectrum of the clean speech _^ ^{S t (i)} in the frame _{^{t, ^ N t, r (}} i) is This is an element of the vector ^ N _t ⁽ⁱ⁾ of the estimated log mel spectrum of noise in frame t. Then, the reliability data selection unit 233 applies k-mean clustering to SNR _t ⁽ⁱ⁾ , which is the S / N ratio in each frame t, obtained by the above equation (26), so that the SNR _t ^{( i)} is classified into two classes C = 0, 1, and the average SN ratio of each class is defined as AveSNR _c ⁽ⁱ⁾ (step S230g-1).

そして、信頼データ選択部２３３は、各フレームｔにおいてＡｖｅＳＮＲ_ｃ＝０ ^（ｉ）≧ＡｖｅＳＮＲ_ｃ＝１ ^（ｉ）であるか否かを判定する（ステップＳ２３０ｇ−２）。信頼データ選択部２３３は、フレームｔにおいてＡｖｅＳＮＲ_ｃ＝０ ^（ｉ）≧ＡｖｅＳＮＲ_ｃ＝１ ^（ｉ）であると判定した場合、ステップＳ２３０ｇ−３へ処理を移す。一方、信頼データ選択部２３３は、フレームｔにおいてＡｖｅＳＮＲ_ｃ＝０ ^（ｉ）＜ＡｖｅＳＮＲ_ｃ＝１ ^（ｉ）であると判定した場合、ステップＳ２３０ｇ−６へ処理を移す。 Then, the reliability data selection unit 233 determines whether or not AveSNR _{c = 0} ⁽ⁱ⁾ ≧ AveSNR _{c = 1} ⁽ⁱ⁾ in each frame t (step S230g-2). When the trust data selection unit 233 determines that AveSNR _{c = 0} ⁽ⁱ⁾ ≧ AveSNR _{c = 1} ⁽ⁱ⁾ in the frame t, the process moves to step S230g-3. On the other hand, if the trust data selection unit 233 determines that AveSNR _{c = 0} ⁽ⁱ⁾ <AveSNR _{c = 1} ⁽ⁱ⁾ in the frame t, the process moves to step S230g-6.

ステップＳ２３０ｇ−３では、信頼データ選択部２３３は、各フレームｔにおけるＳＮＲ_ｔ ^（ｉ）がＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝０｝、すなわち、ＳＮＲ_ｔ ^（ｉ）が集合｛Ｃ＝０｝（Ｃ＝０のクラスタ）に属するか否かを判定する。信頼データ選択部２３３は、ＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝０｝であると判定したフレームｔについては、ステップＳ２３０ｇ−４へ処理を移す。一方、信頼データ選択部２３３は、ＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝１｝であると判定したフレームｔについては、ステップＳ２３０ｇ−５へ処理を移す。 In step S230g-3, the trust data selection unit 233 determines that SNR _t ⁽ⁱ⁾ in each frame t is SNR _t ⁽ⁱ⁾ ∈ {C = 0}, that is, SNR _t ⁽ⁱ⁾ is a set {C = 0} ( Whether or not it belongs to the cluster (C = 0). The trust data selection unit 233 shifts the processing to step S230g-4 for the frame t determined to be SNR _t ⁽ⁱ⁾ ε {C = 0}. On the other hand, the trust data selection unit 233 moves the process to step S230g-5 for the frame t determined to be SNR _t ⁽ⁱ⁾ ε {C = 1}.

ステップＳ２３０ｇ−４では、信頼データ選択部２３３は、ステップＳ２３０ｇ−３で判定したフレーム番号ｔを、クリーン音声信号フレームの集合Ｔ_Ｓ ^（ｉ）へ格納する。一方、ステップＳ２３０ｇ−５では、信頼データ選択部２３３は、ステップＳ２３０ｇ−３で判定したフレーム番号ｔを、雑音信号フレームの集合Ｔ_Ｎ ^（ｉ）へ格納する。 In step S230g-4, the reliable data selection unit 233 stores the frame number t determined in step S230g-3 in the set T _S ^{(i) of} clean speech signal frames. On the other hand, in step S230g-5, the reliable data selection unit 233 stores the frame number t determined in step S230g-3 in the noise signal frame set T _N ⁽ⁱ⁾ .

他方、ステップＳ２３０ｇ−６では、信頼データ選択部２３３は、各フレームｔにおけるＳＮＲ_ｔ ^（ｉ）がＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝１｝、すなわち、ＳＮＲ_ｔ ^（ｉ）が集合｛Ｃ＝１｝（Ｃ＝１のクラスタ）に属するか否かを判定する。信頼データ選択部２３３は、ＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝１｝であると判定したフレームｔについては、ステップＳ２３０ｇ−７へ処理を移す。一方、信頼データ選択部２３３は、ＳＮＲ_ｔ ^（ｉ）∈｛Ｃ＝０｝であると判定したフレームｔについては、ステップＳ２３０ｇ−８へ処理を移す。 On the other hand, in step S230g-6, the trust data selection unit 233 determines that SNR _t ⁽ⁱ⁾ in each frame t is SNR _t ⁽ⁱ⁾ ∈ {C = 1}, that is, SNR _t ⁽ⁱ⁾ is a set {C = 1. } (C = 1 cluster). The trust data selection unit 233 moves the process to step S230g-7 for the frame t determined to be SNR _t ⁽ⁱ⁾ ε {C = 1}. On the other hand, the trust data selection unit 233 moves the process to step S230g-8 for the frame t determined to be SNR _t ⁽ⁱ⁾ ε {C = 0}.

ステップＳ２３０ｇ−７では、信頼データ選択部２３３は、ステップＳ２３０ｇ−６で判定したフレーム番号ｔを、クリーン音声信号フレームの集合Ｔ_Ｓ ^（ｉ）へ格納する。一方、ステップＳ２３０ｇ−８では、信頼データ選択部２３３は、ステップＳ２３０ｇ−６で判定したフレーム番号ｔを、雑音信号フレームの集合Ｔ_Ｎ ^（ｉ）へ格納する。ステップＳ２３０ｇ−４、Ｓ２３０ｇ−５、Ｓ２３０ｇ−７、Ｓ２３０ｇ−８の処理が終了すると、信頼データ選択部２３３は、図６に示す雑音抑圧装置のパラメータ推定部２３０の処理へ処理を復帰させる。 In step S230g-7, the reliable data selection unit 233 stores the frame number t determined in step S230g-6 in the clean speech signal frame set T _S ⁽ⁱ⁾ . On the other hand, in step S230g-8, the reliable data selection unit 233 stores the frame number t determined in step S230g-6 in the noise signal frame set T _N ⁽ⁱ⁾ . When the processes of steps S230g-4, S230g-5, S230g-7, and S230g-8 are completed, the reliable data selection unit 233 returns the process to the process of the parameter estimation unit 230 of the noise suppression device illustrated in FIG.

次に、話者適応パラメータ推定部２３４は、ステップＳ２３０ｃの音声事後確率計算処理にて得た音声事後確率Ｐ_ｔ，ｋと、ステップＳ２３０ｆの信号推定処理にて推定したクリーン音声の対数メルスペクトル＾Ｓ_ｔ ^（ｉ）と、ステップＳ２３０ｇの信頼データ選択処理にて推定したクリーン音声信号フレームの集合Ｔ_Ｓ ^（ｉ）を用いて、下記（２７）式により、話者適応パラメータのベクトルｂ^（ｉ）を更新する話者適応パラメータ推定処理を実行する（ステップＳ２３０ｈ）。 Next, the speaker adaptive parameter estimation unit 234 uses the speech posterior probability P _{t, k} obtained in the speech posterior probability calculation process in step S230c and the log mel spectrum ^ of the clean speech estimated in the signal estimation process in step S230f. Using a set T _S ⁽ⁱ⁾ of clean speech signal frames estimated in S _t ⁽ⁱ⁾ and the reliability data selection process in step S230g, a speaker adaptation parameter vector b ^{(i) according to the following} equation (27 ^): The speaker adaptation parameter estimation processing for updating the is executed (step S230h).

次に、雑音ＧＭＭ推定部２３５は、ステップＳ２３０ｅの期待値計算処理にて得た音声事後確率Ｐ_ｔ，ｌ ^（ｉ）と、ステップＳ２３０ｆの信号推定処理にて推定した雑音の対数メルスペクトルのベクトル＾Ｎ_ｔ ^（ｉ）と、ステップＳ２３０ｇの信頼データ選択処理にて推定した雑音信号フレームの集合Ｔ_Ｎ ^（ｉ）を用いて、下記（２８）式〜（３０）式により、雑音ＧＭＭのパラメータセットλ^（ｉ）を更新する雑音ＧＭＭパラメータ推定処理を実行する（ステップＳ２３０ｉ）。 Next, the noise GMM estimator 235 calculates the speech posterior probability P _{t, l} ⁽ⁱ⁾ obtained in the expected value calculation process in step S230e and the log mel spectrum vector of the noise estimated in the signal estimation process in step S230f. ^ and _N ^{t (i),} using a set _T ^{N (i)} of the noise signal frame estimated by reliable data selection processing in step S230g, the following (28) to (30), a parameter set of the noise GMM A noise GMM parameter estimation process for updating λ ⁽ⁱ⁾ is executed (step S230i).

次に、収束判定部２３６は、所定の収束条件が満されるか否かを判定する収束判定処理を実行する（ステップＳ２３０ｊ）。収束判定部２３６は、所定の収束条件が満される場合は、ベクトルｂ＝ｂ（ｉ）として、パラメータ推定部２３０の処理を終了する。一方、収束判定部２３６は、所定の収束条件が満たされない場合は、ｉを１インクリメント（ｉ←ｉ＋１）し（ステップＳ２３０ｋ）、ステップＳ２３０ｄへ処理を移す。なお、所定の収束条件は、下記（３１）式で表される。なお、下記（３１）式おいて、Ｑ_Ｏ（・）は、上記（２１）式で定義される。また、下記（３１）式おいて、η＝０．０００１とする。 Next, the convergence determination unit 236 executes a convergence determination process for determining whether or not a predetermined convergence condition is satisfied (step S230j). If the predetermined convergence condition is satisfied, the convergence determination unit 236 sets the vector b = b (i) and ends the processing of the parameter estimation unit 230. On the other hand, when the predetermined convergence condition is not satisfied, the convergence determination unit 236 increments i by 1 (i ← i + 1) (step S230k), and moves the process to step S230d. The predetermined convergence condition is expressed by the following equation (31). In the following equation (31), Q _O (•) is defined by the above equation (21). In the following equation (31), η = 0.0001.

また、さらに、図４に示す雑音抑圧部２４０の詳細構成について説明する。図８は、雑音抑圧装置の雑音抑圧部の構成の一例を示す図である。雑音抑圧部２４０は、複素数スペクトルのベクトルＳｐｃ_ｔと、対数メルスペクトルのベクトルＯ_ｔと、音声ＧＭＭ３００ａと、話者適応パラメータのベクトルｂと、雑音ＧＭＭのパラメータセットλと、音声事後確率Ｐ_ｔ，ｋとを用いて雑音抑圧フィルタを構成し、雑音を抑圧して雑音抑圧信号＾Ｓ_τを得る。 Further, the detailed configuration of the noise suppression unit 240 shown in FIG. 4 will be described. FIG. 8 is a diagram illustrating an example of the configuration of the noise suppression unit of the noise suppression device. The noise suppression unit 240 includes a complex spectrum vector Spc _t , a log mel spectrum vector O _t , a speech GMM 300 a, a speaker adaptation parameter vector b, a noise GMM parameter set λ, and a speech posterior probability P _t, configure the noise suppression filter using the _k, obtain noise suppression signal ^ S _tau to suppress noise.

図８に示すように、雑音抑圧部２４０は、雑音抑圧フィルタ推定部２４１、雑音抑圧フィルタ適用部２４２を有する。雑音抑圧フィルタ推定部２４１は、対数メルスペクトルのベクトルＯｔと、音声ＧＭＭ３００ａと、話者適応パラメータのベクトルｂと、雑音ＧＭＭのパラメータセットλと、音声事後確率Ｐ_ｔ，ｋとを入力とし、雑音抑圧フィルタＦ_ｔ，ｍ ^Ｌｉｎを推定する。雑音抑圧フィルタ適用部２４２は、複素数スペクトルのベクトルＳｐｃ_ｔと、雑音抑圧フィルタＦ_ｔ，ｍ ^Ｌｉｎとを入力とし、雑音を抑圧して雑音抑圧信号＾Ｓ_τを得る。 As illustrated in FIG. 8, the noise suppression unit 240 includes a noise suppression filter estimation unit 241 and a noise suppression filter application unit 242. The noise suppression filter estimation unit 241 receives a logarithmic mel spectrum vector Ot, a speech GMM 300a, a speaker adaptation parameter vector b, a noise GMM parameter set λ, and a speech posterior probability P _{t, k.} The suppression filter F _{t, m} ^Lin is estimated. Noise suppression filter applying unit 242 obtains a vector Spc _t complex spectrum, the noise suppression filter _F ^t, as input and ^{m Lin,} the noise suppression signal ^ S _tau to suppress noise.

図９は、雑音抑圧装置の雑音抑圧フィルタ推定部の処理手順の一例を示すフローチャートである。先ず、雑音抑圧フィルタ推定部２４１は、音声ＧＭＭ３００ａと、話者適応パラメータのベクトルｂと、雑音ＧＭＭのパラメータセットλとから、対数メルスペクトルのベクトルＯｔのＧＭＭのパラメータを、下記（３２）式〜（３５）式のように生成する確率モデル生成処理を実行する（ステップＳ２４１ａ）。 FIG. 9 is a flowchart illustrating an example of a processing procedure of the noise suppression filter estimation unit of the noise suppression device. First, the noise suppression filter estimation unit 241 determines the GMM parameters of the logarithmic mel spectrum vector Ot from the speech GMM 300a, the speaker adaptation parameter vector b, and the noise GMM parameter set λ by the following equation (32): Probability model generation processing is generated as shown in equation (35) (step S241a).

次に、雑音抑圧フィルタ推定部２４１は、下記（３６）式及び（３７）式により、事後確率Ｐ_{ｔ，ｋ，ｌ}を、対数メルスペクトルのベクトルＯ_ｔのＧＭＭのパラメータと、対数メルスペクトルのベクトルＯ_ｔと、音声事後確率Ｐ_ｔ，ｋとを用いて計算する確率計算処理を実行する（ステップＳ２４１ｂ）。 Next, the noise suppression filter estimation unit 241 calculates the posterior probabilities P _{t, k, l} , the GMM parameters of the log mel spectrum vector O _t , and the log mel spectrum by the following equations (36) and (37). Probability calculation processing is performed using the vector O _t and the speech posterior probability P _{t, k} (step S241b).

次に、雑音抑圧フィルタ推定部２４１は、音声ＧＭＭ３００ａの平均ベクトルμ_ＳＩ，ｋと、話者適応パラメータのベクトルｂとから生成される話者依存（ＳＤ）ＧＭＭの平均ベクトルμ_ＳＤ，ｋと、雑音ＧＭＭのパラメータセットλに含まれる雑音ＧＭＭの平均ベクトルμ_Ｎ，ｌと、事後確率Ｐ_{ｔ，ｋ，ｌ}とを用いて、メル周波数軸上での雑音抑圧フィルタＦ_ｔ，ｒ ^Ｍｅｌを、下記（３８）式のように推定する雑音抑圧フィルタ推定処理を実行する（ステップＳ２４１ｃ）。なお、下記（３８）式は、ベクトルの要素毎の表記である。 Next, the noise suppression filter estimation unit 241 includes an average vector μ _{SD, k of the} speaker-dependent (SD) GMM generated from the average vector μ _{SI, k} of the speech GMM 300a and the vector b of the speaker adaptation parameter, The noise suppression filter F _{t, r} ^Mel on the mel frequency axis is _{expressed as} follows using the average vector μ _{N, l of the} noise GMM included in the parameter set λ of the noise GMM and the posterior probability P _{t, k, l.} A noise suppression filter estimation process for estimation as shown in equation (38) is executed (step S241c). The following equation (38) is a notation for each vector element.

次に、雑音抑圧フィルタ推定部２４１は、メル周波数軸上での雑音抑圧フィルタＦ_ｔ，ｒ ^Ｍｅｌを、線形周波数軸上での雑音抑圧フィルタＦ_ｔ，ｒ ^Ｌｉｎへ変換する雑音抑圧フィルタ変換処理を実行する（ステップＳ２４１ｄ）。メル周波数軸上での雑音抑圧フィルタＦ_ｔ，ｒ ^Ｍｅｌを、線形周波数軸上での雑音抑圧フィルタＦ_ｔ，ｒ ^Ｌｉｎへ変換する処理は、３次スプライン補間をメル周波数軸に適用することにより、線形周波数軸上での雑音抑圧フィルタの値が推定されるものである。ステップＳ２４１ｄが終了すると、雑音抑圧フィルタ推定部２４１の処理は終了する。 Next, the noise suppression filter estimation unit 241 performs a noise suppression filter conversion process for converting the noise suppression filter F _{t, r} ^Mel on the mel frequency axis to the noise suppression filter F _{t, r} ^Lin on the linear frequency axis. It executes (step S241d). The process of converting the noise suppression filter F _{t, r} ^Mel on the mel frequency axis to the noise suppression filter F _{t, r} ^Lin on the linear frequency axis is obtained by applying cubic spline interpolation to the mel frequency axis. The value of the noise suppression filter on the linear frequency axis is estimated. When step S241d ends, the processing of the noise suppression filter estimation unit 241 ends.

図１０は、雑音抑圧装置の雑音抑圧フィルタ適用部の処理手順の一例を示すフローチャートである。先ず、雑音抑圧フィルタ適用部２４２は、複素数スペクトルのベクトルＳｐｃ_ｔに対して雑音抑圧フィルタＦ_ｔ，ｍ ^Ｌｉｎを、下記（３９）式のように掛け合わせることにより、雑音抑圧された複素数スペクトル＾Ｓ_ｔ，ｍを得るフィルタリング処理を実行する（ステップＳ２４２ａ）。なお、下記（３９）式は、ベクトルの要素毎の表記である。 FIG. 10 is a flowchart illustrating an example of a processing procedure of the noise suppression filter application unit of the noise suppression device. First, the noise suppression filter applying unit 242, the noise suppression filter _F ^t For complex spectrum vector Spc _^t, the ^{m Lin,} by multiplying as follows (39) equation, the noise-suppressed complex spectrum ^ S _A filtering process for obtaining _{t and m} is executed (step S242a). The following equation (39) is a notation for each vector element.

次に、雑音抑圧フィルタ適用部２４２は、複素数スペクトル＾Ｓ_ｔ，ｍに対して逆高速フーリエ変換を適用することにより、フレームｔにおける雑音抑圧音声＾Ｓ_ｔ，ｎを得る逆高速フーリエ変化処理を実行する（ステップＳ２４２ｂ）。次に、雑音抑圧フィルタ適用部２４２は、各フレームｔの雑音抑圧音声＾Ｓ_ｔ，ｎを、下記（４０）式及び（４１）式のように、窓関数ｗ_ｎを解除しながら連結して、連続した雑音抑圧音声＾ｓ_τを得る波形連結処理を実行する（ステップＳ２４２ｃ）。ステップＳ２４２ｃが終了すると、雑音抑圧フィルタ適用部２４２の処理は終了する。 Next, the noise suppression filter application unit 242 performs inverse fast Fourier change processing for obtaining the noise-suppressed speech ^ S _{t, n} in the frame t by applying inverse fast Fourier transform to the complex spectrum ^ S _{t, m} . It executes (step S242b). Next, the noise suppression filter applying unit 242, the noise reduced speech _{^ S t} of each frame _t, a _n, as follows (40) and (41) below, in conjunction with releasing the window function _{w n} Then, the waveform concatenation process for obtaining the continuous noise-suppressed speech ^ s _τ is executed (step S242c). When step S242c ends, the processing of the noise suppression filter application unit 242 ends.

［実施形態による効果］
実施形態の効果を示すため、音声信号と雑音信号が混在する音響信号を実施形態の雑音抑圧装置２００へ入力し、雑音抑圧を実施した例を示す。以下、実験方法及び結果について説明する。 [Effects of the embodiment]
In order to show the effect of the embodiment, an example is shown in which an acoustic signal in which a voice signal and a noise signal are mixed is input to the noise suppression apparatus 200 of the embodiment and noise suppression is performed. Hereinafter, experimental methods and results will be described.

実験では、ＡＵＲＯＲＡ４とよばれる雑音環境下音声認識データベースを用いて評価を行った。ＡＵＲＯＲＡ４の評価データセットは、Ａ：雑音の無い音声、Ｂ：６種類の雑音が混在した音声、Ｃ：異なるマイクで収録された雑音の無い音声、Ｄ：異なるマイクで収録された６種類の雑音が混在した音声の４セットで構成される。ＡＵＲＯＲＡ４の詳細については、文献３「N. Parihar, J. Picone, D. Pearce, H.G. Hirsch，“Performance analysis of the Aurora large vocabulary baseline system.” in Proceedings of the European Signal Processing Conference, Vienna, Austria, 2004.」に記載のとおりである。 In the experiment, evaluation was performed using a speech recognition database under a noisy environment called AURORA4. The AURORA4 evaluation data set consists of A: voice without noise, B: voice mixed with six kinds of noise, C: voice without noise recorded with different microphones, and D: six kinds of noise recorded with different microphones. Is composed of 4 sets of mixed audio. For details of AURORA4, see Reference 3 “N. Parihar, J. Picone, D. Pearce, HG Hirsch,“ Performance analysis of the Aurora large vocabulary baseline system. ”In Proceedings of the European Signal Processing Conference, Vienna, Austria, 2004. . ".

ＡＵＲＯＲＡ４の音声データは、サンプリング周波数１６，０００Ｈｚ、量子化ビット数１６ビットで離散サンプリングされたモノラル信号である。この音声データに基づく音響信号に対し、１フレームの時間長を２５ｍｓ（Ｆｒａｍｅ＝４００サンプル点）とし、１０ｍｓ（Ｓｈｉｆｔ＝１６０サンプル点）ごとにフレームの始点を移動させて、音響特徴抽出を行った。 The audio data of AURORA 4 is a monaural signal that is discretely sampled at a sampling frequency of 16,000 Hz and a quantization bit number of 16 bits. For the acoustic signal based on this audio data, the time length of one frame is 25 ms (Frame = 400 sample points), and the start point of the frame is moved every 10 ms (Shift = 160 sample points) to perform acoustic feature extraction. .

音声ＧＭＭ３００ａとして、Ｒ＝２４次元の対数メルスペクトルを音響特徴量とする混合分布数Ｋ＝５１２のＧＭＭを用い、ＡＵＲＯＲＡ４の雑音の混合が無い学習用音声データを用いて学習した。雑音ＧＭＭの混合分布数にはＬ＝４を与えた。音声ＤＮＮ４００ａには、Ｒ＝２４次元の対数メルスペクトルとその１次及び２次の回帰係数、及び現在のフレームを中心に前後Ｚ＝５フレームずつの特徴量を含む合計Ｄ_０＝３Ｒ×（２Ｚ＋１）＝７９２次元のベクトルを音響特徴量としてＪ＝５層の隠れ層を有し、入力層にＤ_０＝７９２ノード、隠れ層にＤ_ｊ＝２０４８（ｊ＝１，・・・，４）ノード、出力層にＤ_５＝Ｋ＝５１２ノードを有するＤＮＮを用い、ＡＵＲＯＲＡ４の雑音が混合した学習用音声データを用いて学習した。 As the speech GMM 300a, a GMM with a mixed distribution number K = 512 having an R = 24-dimensional logarithmic mel spectrum as an acoustic feature quantity was used, and learning was performed using speech data for learning with no AURORA4 noise mixing. L = 4 was given to the number of mixed distributions of the noise GMM. The audio DNN 400a includes a total D ₀ = 3R × (2Z + 1) including R = 24-dimensional logarithmic mel spectrum, its primary and secondary regression coefficients, and feature quantities of Z = 5 frames before and after the current frame. ) = 792 dimensional vectors as acoustic features, J = 5 hidden layers, D ₀ = 792 nodes in the input layer, D _j = 2048 (j = 1,..., 4) nodes in the hidden layer Then, using a DNN having D ₅ = K = 512 nodes in the output layer, learning was performed using learning speech data mixed with AURORA4 noise.

音声認識は、有限状態トランスデューサーに基づく認識器により行った。有限状態トランスデューサーに基づく認識器の詳細は、文献４「T. Hori, et al.，“Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition.” IEEE Trans. on ASLP, vol. 15, no. 4, pp. 1352-1365, May 2007.」に記載のとおりである。 Speech recognition was performed by a recognizer based on a finite state transducer. Details of the recognizer based on the finite state transducer are described in Reference 4 “T. Hori, et al.,“ Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. ”IEEE On ASLP, vol. 15, no. 4, pp. 1352-1365, May 2007. ”.

音響モデルにはＤＮＮを用いており、７層の隠れ層を有する。各隠れ層のノード数は、２０４８である。また、出力層のノード数は、３０４２である。音声認識の音響特徴量は、１フレームの時間長を２５ｍｓ（Ｆｒａｍｅ＝４００）とし、１０ｍｓ（Ｓｈｉｆｔ＝１６００サンプル点）ごとにフレームの始点を移動させて分析した２４次元の対数メルスペクトルとその１次及び２次の回帰係数、及び現在のフレームを中心に前後５フレームずつの特徴量を含む合計７９２次元のベクトルである。また、言語モデルにはＴｒｉ−ｇｒａｍを用い、語彙数は５，０００単語である。また、音声認識の評価尺度は、下記（４２）式の単語誤り率（Word Error Rate：ＷＥＲ）とした。下記（４２）式のＮは総単語数、Ｄは脱落誤り単語数、Ｓは置換誤り単語数、Ｉは挿入誤り単語数であり、ＷＥＲの値が小さい程、音声認識性能が高いことを示す。 The acoustic model uses DNN and has seven hidden layers. The number of nodes in each hidden layer is 2048. The number of nodes in the output layer is 3042. The acoustic feature amount for speech recognition is a 24-dimensional log mel spectrum analyzed by moving the start point of a frame every 10 ms (Shift = 1600 sample points) with a time length of one frame being 25 ms (Frame = 400) and its 1 This is a 792-dimensional vector in total including second and second order regression coefficients and feature quantities of the previous and next 5 frames around the current frame. The language model uses Tri-gram and the vocabulary is 5,000 words. The evaluation scale for speech recognition was the word error rate (WER) of the following equation (42). In the following equation (42), N is the total number of words, D is the number of dropped error words, S is the number of replacement error words, I is the number of insertion error words, and the smaller the WER value, the higher the speech recognition performance. .

図１１は、実施形態による効果の一例を示す図である。図１１に示す「従来技術」は、、文献５「M. Fujimoto and T. Nakatani, “A reliable data selection for model-based noise suppression using unsupervised joint speaker adaptation and noise model estimation.” in Proceedings of ICSPCC '12, pp. 4713-4716, Aug 2012.」に開示されている方法による雑音抑圧結果を示す。図１１は、「雑音抑圧なし」、「従来技術」、「実施形態」の各音声認識の評価結果の比較を示す。図１１に示すとおり、実施形態は、従来技術に比べ、雑音を含む評価セットＢ及びＤにおいて、ＷＥＲが小さいことから、より高い雑音抑圧性能を得られることが分かる。 FIG. 11 is a diagram illustrating an example of the effect according to the embodiment. The “prior art” shown in FIG. 11 is described in Reference 5 “M. Fujimoto and T. Nakatani,“ A reliable data selection for model-based noise suppression using unsupervised joint speaker adaptation and noise model estimation. ”In Proceedings of ICSPCC '12 , pp. 4713-4716, Aug 2012. ”shows the result of noise suppression by the method disclosed in“ FIG. 11 shows a comparison of evaluation results of speech recognition of “no noise suppression”, “prior art”, and “embodiment”. As shown in FIG. 11, it can be seen that the embodiment can obtain higher noise suppression performance because the WER is small in the evaluation sets B and D including noise as compared with the related art.

すなわち、実施形態によれば、様々な雑音が存在する環境において、音響信号に含まれる雑音信号が多峰性の分布に従う非定常雑音であっても、入力された音響信号から雑音信号を抑圧して、目的とする音声信号を高品質で取り出すことができる。 That is, according to the embodiment, in an environment where various types of noise exist, even if the noise signal included in the acoustic signal is non-stationary noise that follows a multimodal distribution, the noise signal is suppressed from the input acoustic signal. Thus, the target audio signal can be extracted with high quality.

［その他の実施形態］
その他の実施形態では、図２のステップＳ１１０ａ及び図３のステップＳ１２０ａのフレーム切り出し処理において、窓関数ｗ_ｎとして、ハミング窓以外に、方形窓、ハニング窓、ブラックマン窓などの窓関数を利用してもよい。また、その他の実施形態では、音声ＧＭＭ３００ａに代えて、音声信号の確率モデルとして、隠れマルコフモデル（Hidden Markov Model:ＨＭＭ）等の他の確率モデルを用いてもよい。また、その他の実施形態では、雑音ＧＭＭに代えて、雑音信号の確率モデルとして、ＨＭＭ等の他の確率モデルを用いてもよい。 [Other Embodiments]
In other embodiments, the frame cutout process of step S120a of steps S110a and 3 of Figure 2, as a window function w _n, besides Hamming window, utilizing a rectangular window, Hanning window, a window function, such as Blackman windows May be. In other embodiments, instead of the speech GMM 300a, another probability model such as a Hidden Markov Model (HMM) may be used as the probability model of the speech signal. In other embodiments, instead of the noise GMM, another probability model such as an HMM may be used as the noise signal probability model.

また、その他の実施形態では、話者適応パラメータのベクトルｂを、下記（４３）式のように、音声ＧＭＭ３００ａに含まれる正規分布の番号ｋに依存するパラメータとしてもよい。 In other embodiments, the speaker adaptation parameter vector b may be a parameter depending on the number k of the normal distribution included in the speech GMM 300a, as shown in the following equation (43).

また、その他の実施形態では、図６のステップＳ２３０ｇ及び図７に示す信頼データ選択処理を、ｋ−ｍｅａｎクラスタリングに代えて、下記（４４）式に示すように、所定閾値Ｔｈ_ＳＮＲを用いて実行してもよい。 In other embodiments, the trust data selection process shown in step S230g of FIG. 6 and FIG. 7 is executed using a predetermined threshold Th _SNR as shown in the following equation (44) instead of k-means clustering. May be.

また、その他の実施形態では、図９のステップＳ２４１ｃの雑音抑圧フィルタ推定処理において、上記（３８）式のような各事後確率Ｐ_{ｔ，ｋ，ｌ}の重み付け平均ではなく、最大の重みつまり最大の事後確率Ｐ_{ｔ，ｋ，ｌ}により重み付けした推定結果を用いてもよい。この場合、最大の事後確率Ｐ_{ｔ，ｋ，ｌ}が他の事後確率Ｐ_{ｔ，ｋ，ｌ}と比べて十分大きいことが望ましい。 Also, in other embodiments, the noise suppression filter estimation process in step S241c of FIG. 9, the (38) each posterior probability P _t as _{formula, k,} rather than the weighted average of _l, the maximum of the largest weight, that An estimation result weighted by the posterior probability P _{t, k, l} may be used. In this case, it is desirable that the maximum posterior probability P _{t, k, l} is sufficiently larger than the other posterior probabilities P _{t, k, l} .

（音声モデル学習装置及び雑音抑圧装置の装置構成について）
図１に示す音声モデル学習装置１００及び図４に示す雑音抑圧装置２００の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要さない。すなわち、音声モデル学習装置１００及び雑音抑圧装置２００の機能の分散及び統合の具体的形態は図示のものに限られず、全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。例えば、音声モデル学習装置１００及び雑音抑圧装置２００は、一体の装置であってもよい。 (About device configuration of speech model learning device and noise suppression device)
Each component of the speech model learning apparatus 100 illustrated in FIG. 1 and the noise suppression apparatus 200 illustrated in FIG. 4 is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific forms of the distribution and integration of the functions of the speech model learning device 100 and the noise suppression device 200 are not limited to those shown in the figure, and all or a part of them can be changed into arbitrary units according to various loads and usage conditions. And can be configured to be functionally or physically distributed or integrated. For example, the speech model learning device 100 and the noise suppression device 200 may be an integrated device.

また、実施形態では、音声モデル学習装置１００及び雑音抑圧装置２００は別装置とし、音声モデル学習装置１００の第１音響特徴抽出部１１０及び第２音響特徴抽出部１２０と、雑音抑圧装置２００の第１音響特徴抽出部２１０及び第２音響特徴抽出部２２０とは、それぞれ異なる機能構成部とした。しかし、これに限らず、第１音響特徴抽出部１１０と第１音響特徴抽出部２１０、及び／又は、第２音響特徴抽出部１２０と第２音響特徴抽出部２２０は、同一の機能構成部であってもよい。 In the embodiment, the speech model learning device 100 and the noise suppression device 200 are separate devices, and the first acoustic feature extraction unit 110 and the second acoustic feature extraction unit 120 of the speech model learning device 100 and the first of the noise suppression device 200 are used. The first acoustic feature extraction unit 210 and the second acoustic feature extraction unit 220 are different functional components. However, the present invention is not limited to this, and the first acoustic feature extraction unit 110 and the first acoustic feature extraction unit 210 and / or the second acoustic feature extraction unit 120 and the second acoustic feature extraction unit 220 are the same functional components. There may be.

また、実施形態では、音声ＧＭＭ記憶装置３００及び音声ＤＮＮ記憶装置４００は、音声モデル学習装置１００及び雑音抑圧装置２００と別装置であるとした。しかし、これに限らず、音声ＧＭＭ記憶装置３００及び／又は音声ＤＮＮ記憶装置４００は、音声モデル学習装置１００及び／又は雑音抑圧装置２００と一体の装置であってもよい。 In the embodiment, the speech GMM storage device 300 and the speech DNN storage device 400 are separate from the speech model learning device 100 and the noise suppression device 200. However, the present invention is not limited to this, and the speech GMM storage device 300 and / or the speech DNN storage device 400 may be an apparatus integrated with the speech model learning device 100 and / or the noise suppression device 200.

また、音声モデル学習装置１００及び雑音抑圧装置２００において行われる各処理は、全部又は任意の一部が、ＣＰＵ（Central Processing Unit）等の処理装置及び処理装置により解析実行されるプログラムにて実現されてもよい。また、音声モデル学習装置１００及び雑音抑圧装置２００において行われる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 Each processing performed in the speech model learning device 100 and the noise suppression device 200 is realized in whole or in part by a processing device such as a CPU (Central Processing Unit) and a program that is analyzed and executed by the processing device. May be. Each process performed in the speech model learning device 100 and the noise suppression device 200 may be realized as hardware by wired logic.

また、実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともできる。もしくは、実施形態において説明した各処理のうち、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 In addition, among the processes described in the embodiment, all or a part of the processes described as being automatically performed can be manually performed. Alternatively, all or some of the processes described as being manually performed among the processes described in the embodiments can be automatically performed by a known method. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.

（プログラムについて）
図１２は、プログラムが実行されることにより、音声モデル学習装置及び雑音抑圧装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。コンピュータ１０００において、これらの各部はバス１０８０によって接続される。 (About the program)
FIG. 12 is a diagram illustrating an example of a computer that realizes a speech model learning device and a noise suppression device by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. In the computer 1000, these units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１０５１、キーボード１０５２に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１０６１に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1031. The disk drive interface 1040 is connected to the disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. The serial port interface 1050 is connected to a mouse 1051 and a keyboard 1052, for example. The video adapter 1060 is connected to the display 1061, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、音声モデル学習装置１００及び雑音抑圧装置２００の各処理を規定するプログラムは、コンピュータ１０００によって実行される指令が記述されたプログラムモジュール１０９３として、例えばハードディスクドライブ１０３１に記憶される。例えば、音声モデル学習装置１００及び雑音抑圧装置２００における機能構成と同様の情報処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the speech model learning device 100 and the noise suppression device 200 is stored in, for example, the hard disk drive 1031 as a program module 1093 in which commands executed by the computer 1000 are described. For example, a program module 1093 for executing information processing similar to the functional configuration in the speech model learning device 100 and the noise suppression device 200 is stored in the hard disk drive 1031.

また、実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３やプログラムデータ１０９４は、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1041 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 via the network interface 1070.

上記実施形態及びその他の実施形態は、本願が開示する技術に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 The above-described embodiments and other embodiments are included in the invention disclosed in the claims and equivalents thereof as well as included in the technology disclosed in the present application.

１００音声モデル学習装置
１１０第１音響特徴抽出部
１２０第２音響特徴抽出部
１３０最尤分布推定部
１４０音声ＤＮＮ学習部
２００雑音抑圧装置
２１０第１音響特徴抽出部
２２０第２音響特徴抽出部
２３０パラメータ推定部
２３１初期化部
２３２確率及び信号推定部
２３３信頼データ選択部
２３４話者適応パラメータ推定部
２３５雑音ＧＭＭ推定部
２３６収束判定部
２４０雑音抑圧部
２４１雑音抑圧フィルタ推定部
２４２雑音抑圧フィルタ適用部
３００音声ＧＭＭ記憶装置
３００ａ音声ＧＭＭ
４００音声ＤＮＮ記憶装置
４００ａ音声ＤＮＮ
１０００コンピュータ
１０１０メモリ
１０２０ＣＰＵ 100 speech model learning device 110 first acoustic feature extraction unit 120 second acoustic feature extraction unit 130 maximum likelihood distribution estimation unit 140 speech DNN learning unit 200 noise suppression device 210 first acoustic feature extraction unit 220 second acoustic feature extraction unit 230 Estimation unit 231 Initialization unit 232 Probability and signal estimation unit 233 Reliability data selection unit 234 Speaker adaptive parameter estimation unit 235 Noise GMM estimation unit 236 Convergence determination unit 240 Noise suppression unit 241 Noise suppression filter estimation unit 242 Noise suppression filter application unit 300 Voice GMM storage device 300a Voice GMM
400 voice DNN storage device 400a voice DNN
1000 Computer 1010 Memory 1020 CPU

Claims

A speech model learning method executed by a speech model learning device,
A feature extraction step for learning that extracts an acoustic feature from a speech signal for learning;
An audio label generation step of generating label information that associates the acoustic feature amount extracted by the learning feature amount extraction step with a mixed normal distribution of the audio signal;
A learning normalized feature amount extraction step of extracting a normalized acoustic feature amount from a learning acoustic signal including the learning speech signal and the learning noise signal;
A speech model learning step of learning a speech model using the label information generated by the speech label generation step and the normalized acoustic feature amount extracted by the learning normalization feature amount extraction step. A speech model learning method characterized by

The speech model learning step includes a mixed normal distribution of the speech signal and each node of the output layer of the deep neural network corresponding to the normalized acoustic feature amount extracted by the learning normalized feature amount extraction step. The speech model learning method according to claim 1, wherein the speech model is learned by associating the speech model.

A noise suppression method performed by a noise suppression device,
A speech model storage step of storing the speech model learned by the speech model learning method according to claim 1 or 2 in a speech model storage unit;
A feature extraction step of extracting an acoustic feature amount from a mixed acoustic signal including an audio signal and a noise signal;
A normalized feature extraction step of extracting a normalized acoustic feature from the mixed acoustic signal;
A speech posterior probability calculation step of calculating a speech posterior probability using the speech model and the normalized acoustic feature amount extracted by the normalized feature amount extraction step;
A noise suppression step of suppressing the noise signal in the mixed acoustic signal using the speech posterior probability calculated by the speech posterior probability calculation step and a mixed normal distribution of the voice signal. Repression method.

A signal estimation step of estimating the audio signal and the noise signal included in the mixed acoustic signal;
Speaker adaptation for estimating speaker adaptation parameters for adapting a mixed normal distribution of the speech signal to a speech speaker corresponding to the speech signal from the speech signal and the noise signal estimated by the signal estimation step. A parameter estimation step;
A noise mixed normal distribution generating step for generating a mixed normal distribution of noise signals from the noise signal estimated by the signal estimating step;
A mixed normal distribution generating step of generating a mixed normal distribution of the mixed acoustic signal from the mixed normal distribution of the speaker adaptation parameter and the voice signal and the mixed normal distribution of the noise signal;
An expected value calculation step of calculating an expected value of the speech signal and an expected value of the noise signal included in the mixed acoustic signal from the speech posterior probability and a mixed normal distribution of the mixed acoustic signal;
The signal estimation step, the speaker adaptation parameter estimation step, the noise mixed normal distribution generation step, the mixed normal distribution generation step, and the expected value calculation step are the expected values of the speech signal calculated by the expected value calculation step. 4. The noise suppression method according to claim 3, wherein processing is recursively repeated for the expected value of the speech signal and the expected value of the noise signal until the expected value of the noise signal satisfies a predetermined condition.

A selection step of selecting a signal satisfying a predetermined condition from the voice signal and the noise signal estimated by the signal estimation step,
The speaker adaptation parameter estimation step estimates the speaker adaptation parameter from the voice signal and the noise signal selected by the selection step,
The noise suppression method according to claim 4, wherein the noise mixed normal distribution generation step generates a mixed normal distribution of the noise signal from the noise signal selected by the selection step.

A feature extraction unit for learning that extracts an acoustic feature from a speech signal for learning;
An audio label generation unit that generates label information that associates the acoustic feature amount extracted by the learning feature amount extraction unit with the mixed normal distribution of the audio signal;
A learning normalized feature quantity extraction unit that extracts a normalized acoustic feature quantity from a learning acoustic signal including the learning speech signal and the learning noise signal;
A speech model learning unit that learns a speech model using the label information generated by the speech label generation unit and the normalized acoustic feature amount extracted by the learning normalized feature amount extraction unit. A speech model learning apparatus characterized by that.

The speech model learning unit includes a mixed normal distribution of the speech signal and each node of the output layer of the deep neural network corresponding to the normalized acoustic feature amount extracted by the learning normalized feature amount extraction unit. The speech model learning apparatus according to claim 6, wherein the speech model is learned by associating the speech model.

A speech model storage unit that stores the speech model learned by the speech model learning device according to claim 6;
A feature extraction unit that extracts an acoustic feature amount from a mixed acoustic signal including an audio signal and a noise signal;
A normalized feature quantity extraction unit for extracting a normalized acoustic feature quantity from the mixed acoustic signal;
A speech posterior probability calculation unit that calculates a speech posterior probability using the speech model and the normalized acoustic feature amount extracted by the normalized feature amount extraction unit;
A noise suppression unit comprising: a noise suppression unit that suppresses the noise signal in the mixed acoustic signal using the speech posterior probability calculated by the speech posterior probability calculation unit and a mixed normal distribution of the speech signal. apparatus.

A signal estimation unit for estimating the audio signal and the noise signal included in the mixed acoustic signal;
Speaker adaptation for estimating speaker adaptation parameters for adapting a mixed normal distribution of the speech signal to speech speakers corresponding to the speech signal from the speech signal and the noise signal estimated by the signal estimation unit A parameter estimator;
From the noise signal estimated by the signal estimation unit, a noise mixed normal distribution generation unit that generates a mixed normal distribution of noise signals,
A mixed normal distribution generating unit that generates a mixed normal distribution of the mixed acoustic signal from the mixed normal distribution of the speaker adaptation parameter and the voice signal and the mixed normal distribution of the noise signal;
An expected value calculation unit for calculating an expected value of the speech signal and an expected value of the noise signal included in the mixed acoustic signal from the speech posterior probability and a mixed normal distribution of the mixed acoustic signal; and
The signal estimation unit, the speaker adaptive parameter estimation unit, the noise mixed normal distribution generation unit, the mixed normal distribution generation unit, and the expected value calculation unit are expected values of the speech signal calculated by the expected value calculation unit. The noise suppression device according to claim 8, wherein the processing is recursively repeated for the expected value of the speech signal and the expected value of the noise signal until the expected value of the noise signal satisfies a predetermined condition.

A selection unit that selects a signal that satisfies a predetermined condition from the voice signal and the noise signal estimated by the signal estimation unit;
The speaker adaptation parameter estimation unit estimates the speaker adaptation parameter from the voice signal and the noise signal selected by the selection unit,
The noise suppression apparatus according to claim 9, wherein the noise mixed normal distribution generation unit generates a mixed normal distribution of the noise signal from the noise signal selected by the selection unit.

A speech model learning program for causing a computer to function as the speech model learning device according to claim 6.

A noise suppression program for causing a computer to function as the noise suppression device according to claim 8, 9 or 10.