JP2008298844A

JP2008298844A - Noise suppressing device, computer program, and speech recognition system

Info

Publication number: JP2008298844A
Application number: JP2007141840A
Authority: JP
Inventors: Shigeki Matsuda; 繁樹松田; Takeo Fukurotani; 丈夫袋谷; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2008-12-11

Abstract

<P>PROBLEM TO BE SOLVED: To suppress noise in a short period of time by using a limited calculation resource in nonsteady noise environment. <P>SOLUTION: A noise suppressing device includes: a noise probability distribution estimation section 200 for generating an estimation parameter of noise distribution for each frame, from a feature amount extracted from the frame of a predetermined period of an observation signal, by using a particle filter; an observation signal distribution estimation section 202 for adapting a Gaussian Mixture Model (GMM) for clean voice estimation according to the estimated noise probability distribution; a clean voice estimation section 204 for calculating an estimated feature amount of object voice for each frame by a Minimum Mean Square Error (MMSE) estimation method; and a calculation control section 212 for controlling an interval of adaptation so that adaptation of the GMM is performed to a plurality of frames at a time. The estimated feature amount of the object voice is calculated by using the Gaussian Mixture Model of a specified frame, when the adaptation is performed to the frame, while by using the Gaussian Mixture Model which is adapted to the previous frame, when the adaptation is not performed. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、雑音が発生する実環境下での音声認識技術に関し、特に、非定常的な雑音が発生する環境下での音声認識率を改善するための雑音抑圧装置及びそれを使用した音声認識システムに関する。 The present invention relates to a speech recognition technique in a real environment where noise is generated, and more particularly to a noise suppression device for improving a speech recognition rate in an environment where non-stationary noise is generated, and speech recognition using the same. About the system.

人間にとって容易かつ自然なヒューマンマシンインタフェースを実現するための技術として、音声認識技術が研究されている。近年では、大規模な音声・テキストデータベースと統計確率的な音声認識手法とにより、高い認識率での音声認識が実現されるようになった。今日ではさらに、人間と機械とが接する実環境下において、高速にかつ高い認識率で音声認識を実現するための応用技術開発が進められている。 Speech recognition technology has been studied as a technology for realizing a human machine interface that is easy and natural for humans. In recent years, speech recognition at a high recognition rate has been realized by a large-scale speech / text database and statistical stochastic speech recognition techniques. Nowadays, the development of applied technology for realizing speech recognition at high speed and with a high recognition rate in an actual environment where a human and a machine are in contact with each other is underway.

実環境が実験室等の環境と大きく異なる点の一つに、雑音の存在がある。雑音は、無視できない音量で、絶え間なく、かつ不規則に発生し、時間の経過とともに変動する。雑音は、音声認識を行なう際の妨げとなる。雑音が発生する実環境下での音声認識率を改善することは、音声認識の応用技術開発を行なう上で、早急に解決されるべき問題である。 One of the major differences between the actual environment and the laboratory environment is the presence of noise. Noise is generated constantly and irregularly at a volume that cannot be ignored, and fluctuates over time. Noise is a hindrance when performing speech recognition. Improving the speech recognition rate in a real environment where noise is generated is a problem that should be solved as soon as possible in developing an application technology for speech recognition.

雑音が発生する環境下での音声認識率を改善するための技術の一つに、時間の経過に対し定常的な性質を持つ雑音について、音声認識の前処理の段階で雑音を推定し抑圧する技術がある。 One of the technologies for improving the speech recognition rate in an environment where noise is generated is to estimate and suppress the noise at the pre-processing stage of speech recognition for noise that has a stationary property over time. There is technology.

後掲の非特許文献１には、定常的な雑音の一般的な抑圧方法であるスペクトルサブトラクション法が開示されている。この方法では、発話の前の区間において観測された雑音の振幅スペクトルと発話中の区間における雑音の振幅スペクトルとが同じであると仮定する。そしてこの仮定に基づき、発話時に観測された音声信号の振幅スペクトルから、発話直前に観測された雑音の振幅スペクトルを減算して、雑音を抑圧する。 Non-Patent Document 1 described later discloses a spectral subtraction method which is a general method for suppressing stationary noise. In this method, it is assumed that the noise amplitude spectrum observed in the section before the utterance is the same as the noise amplitude spectrum in the section during the utterance. Based on this assumption, the noise is suppressed by subtracting the amplitude spectrum of the noise observed immediately before the utterance from the amplitude spectrum of the speech signal observed during the utterance.

後掲の非特許文献２には、分散型音声認識における雑音抑圧方法が開示されている。この方法では、発話直前に観測された雑音の振幅スペクトルを用いて、ウィナフィルタ理論に基づく雑音の抑圧を行なう。 Non-Patent Document 2 described later discloses a noise suppression method in distributed speech recognition. In this method, noise suppression based on the Wiener filter theory is performed using the noise amplitude spectrum observed immediately before the utterance.

音声認識の前処理の段階において雑音を逐次的に推定し抑圧する技術もある。後掲の非特許文献３には、逐次ＥＭ（Expectation Maximization）アルゴリズムを適用して雑音の最尤推定値を逐次的に求める方法が開示されている。逐次ＥＭアルゴリズムを用いて逐次的に雑音を推定する方法では、雑音の時間変動に対処しつつ高精度に雑音の推定及び抑圧を行なうことができる。 There is also a technique for sequentially estimating and suppressing noise in the preprocessing stage of speech recognition. Non-Patent Document 3 described later discloses a method of sequentially obtaining a maximum likelihood estimation value of noise by applying a sequential EM (Expectation Maximization) algorithm. In the method of sequentially estimating noise using the sequential EM algorithm, noise can be estimated and suppressed with high accuracy while coping with temporal fluctuation of noise.

後掲の非特許文献４及び非特許文献５に開示された、カルマンフィルタを用いて雑音の推定値を逐次的に求める方法も一般的に用いられている。この方法では、一期先予測とフィルタリングとを交互に行なうことによって、雑音を逐次的に推定し抑圧する。 Non-patent literature 4 and non-patent literature 5, which will be described later, generally use a method of sequentially obtaining an estimated value of noise using a Kalman filter. In this method, noise is sequentially estimated and suppressed by alternately performing first-term prediction and filtering.

また、雑音環境下での音声認識率を改善するための技術として、雑音を考慮した確率モデルを用いて適応的に音声認識を行なう技術がある。例えば後掲の特許文献１には、パーティクルフィルタと呼ばれる逐次推定法を用いて、雑音パラメータの推定と、ＨＭＭ（Hidden Markov Model：隠れマルコフモデル）を構成する隠れ状態の時間的成長とを行ない、当該ＨＭＭに基づく音声認識を行なう音声認識システムが開示されている。 Further, as a technique for improving the speech recognition rate in a noisy environment, there is a technique for performing adaptive speech recognition using a stochastic model considering noise. For example, in Patent Document 1 described later, noise parameters are estimated using a sequential estimation method called a particle filter, and temporal growth of a hidden state constituting an HMM (Hidden Markov Model) is performed. A speech recognition system that performs speech recognition based on the HMM is disclosed.

Ｓ．Ｆ．ボル：「スペクトルサブトラクションを用いた、音声内の音響ノイズの抑圧」、ＩＥＥＥＴｒａｎｓ．ＡＳＳＰ、Ｖｏｌ．２７、Ｎｏ．２、１１３−１２０頁、１９７９年（S.F.Boll: “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. ASSP, Vol. 27, No. 2, pp. 113-120, 1979）S. F. Bol: “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. ASSP, Vol. 27, no. 2, 113-120, 1979 (S.F. Boll: “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. ASSP, Vol. 27, No. 2, pp. 113-120, 1979) 欧州電気通信標準化機構（ＥＴＳＩ：European Telecommunications Standards Institute）勧告ＥＳ２０２０５０Ｖ１．１．３ “音声の処理、伝送、及び品質の局面（ＳＴＱ），分配型音声認識：上級フロントエンド特徴抽出アルゴリズム；圧縮アルゴリズム”、２００３年１１月（ETSI ES 202 050 V1.1.3, “Speech Processing, Transmission and Quality Aspects (STQ), Distributed Speech Recognition: Advanced Front-end Feature Extraction Algorithm; Compression Algorithms,” Nov. 2003.）European Telecommunications Standards Institute (ETSI) Recommendation ES 202 050 V1.1.3 “Aspects of Speech Processing, Transmission, and Quality (STQ), Distributed Speech Recognition: Advanced Front End Feature Extraction Algorithm; Compression Algorithm ", November 2003 (ETSI ES 202 050 V1.1.3," Speech Processing, Transmission and Quality Aspects (STQ), Distributed Speech Recognition: Advanced Front-end Feature Extraction Algorithm; Compression Algorithms, "Nov. 2003.) Ｍ．アフィフィ、Ｏ．シオアン：「ロバスト音声認識のための最適な忘却による逐次推定」、ＩＥＥＥＴｒａｎｓ．ＳＡＰ、Ｖｏｌ．１２、Ｎｏ．１、１９−２６頁、２００４年（M.Afify, O.Siohan: “Sequential Estimation with Optimal Forgetting for Robust Speech Recognition,” IEEE Trans. SAP, Vol. 12, No.1, pp. 19-26, 2004）M.M. Affifi, O. Shioan: “Sequential estimation with optimal forgetting for robust speech recognition”, IEEE Trans. SAP, Vol. 12, no. 1, 19-26, 2004 (M. Afify, O. Siohan: “Sequential Estimation with Optimal Forgetting for Robust Speech Recognition,” IEEE Trans. SAP, Vol. 12, No. 1, pp. 19-26, 2004. ) 有本卓：「カルマンフィルター」、産業図書Takashi Arimoto: “Kalman Filter”, industrial books 中野道雄監修、西山清著：「パソコンで解くカルマンフィルタ」、丸善Supervised by Michio Nakano, Kiyoshi Nishiyama: “Kalman filter solved on a personal computer”, Maruzen Ａ．Ｍ．ペイナド他、「分散音声認識のための、ＭＭＳＥによるチャネル誤りの緩和」、ユーロスピーチ２００１スカンジナビア（第７回音声コミュニケーション及びテクノロジーヨーロッパ大会）予稿集、ｐｐ．２７０７−２７１０、２００１年（Peinado A M, Sanchez V, Segura J C, Perez-Cordoba J L, "MMSE-Based Channel Error Mitigation for Distributed Speech Recognition," Eurospeech 2001 - Scandinavia (7th European Conference on Speech Communication and Technology), pp.2707-2710, 2001）A. M.M. Paynad et al., “Mitigation of channel errors by MMSE for distributed speech recognition”, Euro Speech 2001 Scandinavia (7th European Conference on Speech Communication and Technology), pp. 2707-2710, 2001 (Peinado AM, Sanchez V, Segura JC, Perez-Cordoba JL, "MMSE-Based Channel Error Mitigation for Distributed Speech Recognition," Eurospeech 2001-Scandinavia (7th European Conference on Speech Communication and Technology), pp .2707-2710, 2001) 特開２００７−４１４９９号公報JP 2007-41499 A

非特許文献１及び非特許文献２に記載の技術はいずれも、雑音が定常的なものであるという前提のもとで雑音の推定及び抑圧を行なう技術である。しかし、実環境において雑音の多くは非定常である。すなわち、雑音の音響的特徴は時間の経過に伴い変動する。そのため、非特許文献１及び非特許文献２に記載の技術では、雑音の時間変動に対応できず、高精度に雑音を抑圧することができない。 The techniques described in Non-Patent Document 1 and Non-Patent Document 2 are both techniques for estimating and suppressing noise on the assumption that the noise is stationary. However, most of the noise in the real environment is non-stationary. That is, the acoustic characteristics of noise vary with time. For this reason, the techniques described in Non-Patent Document 1 and Non-Patent Document 2 cannot cope with temporal fluctuations in noise and cannot suppress noise with high accuracy.

非特許文献３に記載の技術では、逐次ＥＭアルゴリズムが用いられている。逐次ＥＭアルゴリズムで雑音を推定する場合、観測された音声信号のフレーム毎に、そのフレームにおけるパラメータが尤度関数の局所最適値に収束するまで反復計算を行なう必要がある。そのため、雑音が変動する度に膨大な量の計算が必要となり、計算に時間を要する。よって、この手法により実時間で雑音を推定し抑圧するのは困難である。 In the technique described in Non-Patent Document 3, a sequential EM algorithm is used. When noise is estimated by the sequential EM algorithm, it is necessary to perform iterative calculation for each frame of the observed speech signal until the parameters in the frame converge to the local optimum value of the likelihood function. Therefore, an enormous amount of calculation is required every time the noise fluctuates, and the calculation takes time. Therefore, it is difficult to estimate and suppress noise in real time by this method.

非特許文献４及び非特許文献５に記載の技術では、カルマンフィルタを用いて雑音を推定する。この推定方法は、一期先予測とフィルタリングとを交互に行なう方法であり、逐次ＥＭアルゴリズムのような反復計算を必要とはしない。しかし、カルマンフィルタを用いた手法は、雑音の事後確率分布が単一正規分布であるものとして確率分布を推定する。真の事後確率分布が混合分布であった場合には、単一正規分布で近似される。そのため、精度が劣化する。 In the techniques described in Non-Patent Document 4 and Non-Patent Document 5, noise is estimated using a Kalman filter. This estimation method is a method of alternately performing first-term prediction and filtering, and does not require iterative calculation like a sequential EM algorithm. However, the method using the Kalman filter estimates the probability distribution assuming that the posterior probability distribution of noise is a single normal distribution. When the true posterior probability distribution is a mixed distribution, it is approximated by a single normal distribution. Therefore, the accuracy is deteriorated.

特許文献１に記載の音声認識システムでは、雑音を考慮したモデルを用いて音声認識を行なうため、音声認識の精度が高くなるという効果が得られている。しかし、パーティクルフィルタを用いているために計算量が大きく、計算資源の限られた装置でこのシステムを高速に動作させるのは困難である。 In the speech recognition system described in Patent Document 1, since speech recognition is performed using a model that takes noise into account, an effect of increasing speech recognition accuracy is obtained. However, since the particle filter is used, the calculation amount is large, and it is difficult to operate the system at high speed with an apparatus having limited calculation resources.

それゆえに、本発明の目的は、非定常雑音が発生する環境下での音声認識率を改善し、かつ限られた計算資源を用いて雑音を短時間で抑圧することができる雑音抑圧装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a noise suppression device that can improve a speech recognition rate in an environment where non-stationary noise occurs and can suppress noise in a short time using limited calculation resources. It is to be.

本発明の他の目的は、非定常雑音が発生する環境下での音声認識率を改善し、かつ計算量を削減しながら雑音を短時間で抑圧することができる雑音抑圧装置を提供することである。 Another object of the present invention is to provide a noise suppression device that can improve a speech recognition rate in an environment where non-stationary noise occurs and can suppress noise in a short time while reducing the amount of calculation. is there.

本発明の第１の局面に係る雑音抑圧装置は、雑音が発生する環境下での目的音声の観測により得られる観測信号における雑音の成分を抑圧するための雑音抑圧装置であって、観測信号について所定周期ごとにフレーム化された所定時間長のフレームよりそれぞれ抽出される特徴量を受け、複数のパーティクルを有するパーティクルフィルタを用いて、予め準備された、複数個の要素分布からなる、クリーン音声推定のための音響モデルに基づき、雑音を表す確率分布の推定パラメータをフレームごとに逐次生成するための雑音推定手段と、雑音推定手段によりパラメータが推定された雑音の確率分布にしたがって、音響モデルを雑音に適応化するための適応化手段と、雑音に適応化された音響モデルと、観測信号の特徴量とを用いて、フレームごとに目的音声の推定特徴量をＭＭＳＥ（最小平均２乗誤差：Minimum Mean Square Error）推定法により算出するための目的音声推定手段と、適応化手段による、音響モデルの適応化を、複数のフレームごとに行なうように、適応化手段による適応化の間隔を制御するための制御手段とを含む雑音抑圧装置であって、目的音声推定手段は、あるフレームに対して適応化手段による適応化が行なわれたときには、そのフレームに対しては当該適応化されたガウス混合モデルを用いて目的音声の推定特徴量を算出し、あるフレームに対して適応化手段による適応化が行なわれなかったときには、当該フレームに対してはそれより前のフレームに対して適応化手段によって適応化された音響モデルを用いて目的音声の推定特徴量を算出することを特徴とする。 A noise suppression apparatus according to a first aspect of the present invention is a noise suppression apparatus for suppressing a noise component in an observation signal obtained by observation of a target speech in an environment where noise is generated. A clean speech estimation comprising a plurality of element distributions prepared in advance using a particle filter having a plurality of particles, each receiving a feature amount extracted from a frame of a predetermined time length framed at a predetermined period. Based on the acoustic model for the noise, the noise estimation means for sequentially generating the estimation parameter of the probability distribution representing the noise for each frame, and the noise model according to the noise probability distribution whose parameter is estimated by the noise estimation means Using the adaptation means for adapting to the noise, the acoustic model adapted to noise, and the feature quantity of the observed signal. The target speech estimation means for calculating the estimated feature amount of the target speech by the MMSE (Minimum Mean Square Error) estimation method and the adaptation means for adaptation of the acoustic model for each frame And a control unit for controlling the interval of adaptation by the adaptation unit, wherein the target speech estimation unit performs adaptation by the adaptation unit for a certain frame. The estimated feature quantity of the target speech is calculated for the frame using the adapted Gaussian mixture model, and when the adaptation means is not applied to a frame, For the frame, the estimated feature quantity of the target speech is calculated using the acoustic model adapted by the adaptation means for the previous frame. That.

ＭＭＳＥ法の詳細については、非特許文献６に開示されている。 Details of the MMSE method are disclosed in Non-Patent Document 6.

なお、本明細書では、雑音抑圧装置に与えられる音声（観測音声）は、雑音のない目的音声と雑音とが重畳した音と考える。このように考えたときの目的音声を「クリーン音声」と呼ぶ。 In this specification, the voice (observation voice) given to the noise suppression device is considered as a sound in which the target voice without noise and noise are superimposed. The target voice when considered in this way is called “clean voice”.

制御手段は、適応化手段がクリーン音声推定のための音響モデルを雑音に適応化するにあたり、複数のフレームごとにその適応化がされるように適応化の間隔を制御する。あるフレームに対して適応化が行なわれたときには、目的音声推定手段はその音響モデルを用いたＭＭＳＥ推定法によって目的音声を推定する。あるフレームに対して適応化が行なわれなかったときには、目的音声推定手段は、それより前のフレームに対して適応化された音響モデルを用いたＭＭＳＥ推定法によって目的音声を推定する。音響モデルの適応化という、計算量の大きな処理を各フレームに対して行なわなくてもよく、計算量が削減される。したがって、目的音声の推定を高速に行なうことができる。 When the adaptation unit adapts the acoustic model for clean speech estimation to noise, the control unit controls the adaptation interval so that the adaptation is performed for each of a plurality of frames. When adaptation is performed for a certain frame, the target speech estimation means estimates the target speech by the MMSE estimation method using the acoustic model. When the adaptation is not performed for a certain frame, the target speech estimation means estimates the target speech by the MMSE estimation method using the acoustic model adapted to the previous frame. A process with a large amount of calculation such as adaptation of the acoustic model need not be performed on each frame, and the amount of calculation is reduced. Therefore, the target speech can be estimated at high speed.

音響モデルは、複数個の要素分布からなるガウス混合分布でもよい。 The acoustic model may be a Gaussian mixture distribution including a plurality of element distributions.

ガウス混合分布を用いると、クリーン音声の特徴量が複雑な分布をしていても、それを統計的にモデル化することが容易になる。 If the Gaussian mixture distribution is used, even if the feature amount of clean speech has a complicated distribution, it becomes easy to model it statistically.

複数個の要素分布のうち、いずれか二つが、互いに異なる分散を持つようにしてもよい。 Any two of the plurality of element distributions may have different variances.

通常、要素分布は音声サンプルに対する学習により統計的に得られるので、その分散は互いに異なることが多い。 In general, since the element distribution is statistically obtained by learning on the speech sample, the variance is often different from each other.

複数個の要素分布が、互いに異なる平均と、互いに等しい分散とを有するようにしてもよい。 The plurality of element distributions may have different averages and equal variances.

このようにすると、雑音推定手段における計算において分布を考慮する必要が事実上なくなり、計算量がさらに削減される。その結果、処理をより早くすることができる。その結果、非定常雑音が発生する環境下での音声認識率を改善することが可能で、かつ限られた計算資源を用いて雑音を短時間で抑圧することができる雑音抑圧装置を提供することができる。 In this way, it is virtually unnecessary to consider the distribution in the calculation in the noise estimation means, and the amount of calculation is further reduced. As a result, processing can be made faster. As a result, it is possible to provide a noise suppression device capable of improving a speech recognition rate in an environment where non-stationary noise occurs and capable of suppressing noise in a short time using limited calculation resources. Can do.

好ましくは、雑音抑圧装置は、適応化手段によってあるフレームについて音響モデルの適応化が行なわれたことに応答して、当該適応化された音響モデルを記憶するための記憶手段をさらに含み、目的音声推定手段は、あるフレームについてクリーン音声推定のための音響モデルの適応化が行なわれたときには、当該適応化された音響モデルを用い、ＭＭＳＥ推定法によって目的音声の推定特徴量を算出し、あるフレームについてクリーン音声推定のための音響モデルの適応化が行なわれないときには、記憶手段に記憶された適応化された音響モデルを用いて、ＭＭＳＥ法によって目的音声の推定特徴量を算出することを特徴とする。 Preferably, the noise suppression device further includes storage means for storing the adapted acoustic model in response to the adaptation of the acoustic model for a frame by the adaptation means, and the target speech When the acoustic model for clean speech estimation is adapted for a certain frame, the estimating means calculates the estimated feature amount of the target speech by using the adapted acoustic model and the MMSE estimation method. When the acoustic model for the clean speech estimation is not adapted for, the estimated feature quantity of the target speech is calculated by the MMSE method using the adapted acoustic model stored in the storage means. To do.

あるフレームについて音響モデルの適応化が行なわれたときには、その音響モデルを記憶手段に記憶する。あるフレームについて音響モデルの適応化が行なわれなかったときには、記憶手段に記憶された音響モデルを用いたＭＭＳＥ推定法によって目的音声の推定特徴量が算出される。あるフレームについて適応化が行なわれなかったときには、それより前のフレームに対して適応化がされた音響モデルを用いて目的音声の推定特徴量が算出される。この処理は時間的に間をおかずに繰返し行なわれるため、このように、以前のフレームに対して適応化がされた音響モデルを用いても、適応化の間の間隔が充分短ければ、目的音声の推定に対する性能上の影響はほとんど見られない。一方で、適応化のための演算量は大幅に削減できる。その結果、性能を維持しながら目的音声の推定を高速に行なうことができる。 When the acoustic model is adapted for a certain frame, the acoustic model is stored in the storage means. When the acoustic model is not adapted for a certain frame, the estimated feature quantity of the target speech is calculated by the MMSE estimation method using the acoustic model stored in the storage means. When the adaptation is not performed for a certain frame, the estimated feature amount of the target speech is calculated using the acoustic model adapted for the previous frame. Since this process is repeated in a timely manner, even if an acoustic model adapted for the previous frame is used in this way, if the interval between adaptations is sufficiently short, the target speech There is almost no performance impact on the estimation of. On the other hand, the calculation amount for adaptation can be greatly reduced. As a result, the target speech can be estimated at high speed while maintaining the performance.

より好ましくは、制御手段は、予め、クリーン音声推定のための音響モデルを雑音に適応化する処理を行なうフレーム間の間隔を定める情報を記憶するための間隔記憶手段と、直前に適応化手段による適応化が行なわれた後に処理されたフレーム数を記憶するためのフレーム数記憶手段と、処理対象のフレームが雑音抑圧装置に与えられるたびにフレーム数記憶手段の記憶内容に１を加算するための加算手段と、フレーム数記憶手段の記憶内容と間隔記憶手段の記憶内容とが等しいか否かを判定するための判定手段と、判定手段による判定結果にしたがって、適応化手段によるそのフレームに対する適応化を可能化する処理と、適応化手段によるそのフレームに対する適応化を不能化する処理とを行なうための手段と、判定手段により、フレーム数記憶手段の記憶内容と間隔記憶手段の記憶内容とが等しいと判定されたことに応答して、フレーム数記憶手段をゼロにクリアするための手段とを含む。 More preferably, the control means includes an interval storage means for storing information for determining an interval between frames for performing processing for adapting an acoustic model for clean speech estimation to noise in advance, and an adaptation means immediately before Frame number storage means for storing the number of frames processed after adaptation, and for adding 1 to the stored contents of the frame number storage means every time a frame to be processed is given to the noise suppression device An adder, a determination unit for determining whether the stored contents of the frame number storage unit and the stored content of the interval storage unit are equal, and adaptation by the adaptation unit according to the determination result by the determination unit And a means for performing adaptation processing for disabling adaptation to the frame by the adaptation means, and a decision means In response to the stored contents of the memory contents and spacing storage means beam number storage means is determined to be equal, and means for clearing the frame number memory means to zero.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記したいずれかの雑音抑圧装置として動作させる。したがって、上記した雑音抑圧装置と同様の効果を得ることができる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to operate as any of the noise suppression devices described above. Therefore, the same effect as the above-described noise suppression device can be obtained.

本発明の第３の局面に係る音声認識システムは、上記したいずれかの雑音抑圧装置と、雑音抑圧装置により算出される目的音声の推定特徴量を受けて、上記した音響モデルと、認識対象言語に関する所定の言語モデルとを用いて、目的音声に関する音声認識を行なうための音声認識手段とを含む。 The speech recognition system according to the third aspect of the present invention receives any of the above-described noise suppression device and the estimated feature amount of the target speech calculated by the noise suppression device, and receives the above-described acoustic model and recognition target language. And speech recognition means for performing speech recognition on the target speech using a predetermined language model.

［第１の実施の形態］
以下、図面を参照しつつ、本発明の一実施の形態について説明する。なお、以下の説明に用いる図面では、同一の部品には同一の符号を付してある。それらの名称及び機能も同一である。したがって、それらについての詳細な説明は繰返さない。以下の説明のテキスト中で使用する記号「＾」等は、本来はその直後の文字の直上に記載すべきものであるが、テキスト記法の制限により当該文字の直前に記載する。式中では、これらの記号等は本来の位置に記載してある。また以下の説明のテキスト中では、ベクトル又は行列については多くの場合、例えば「ベクトルＸ_t」、「行列Σ_W」等のように直前に「ベクトル」、「行列」等を付した通常のテキストの形で記載するが、式中ではいずれも太字で記載する。 [First Embodiment]
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated. The symbol “^” or the like used in the text of the following description should be described immediately above the character immediately after it, but it is described immediately before the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Also, in the text of the following description, in the case of a vector or a matrix, in many cases, for example, a normal text immediately preceded by “vector”, “matrix”, etc. like “vector X _t ”, “matrix Σ _W ”, etc. In the formula, all are written in bold.

〔構成〕
〈音声認識システム全体の構成〉
図１に、本実施の形態に係る音声認識システム１００全体の構成を示す。図１を参照して、この音声認識システム１００は、音源１０２が発生する音１２２から音声認識に用いる音声の特徴を表す特徴量ベクトル１２６を抽出するための前処理部１０４と、前処理部１０４に接続され、音声の特徴と音素との関係を表す確率モデル（音響モデル）を準備するための前処理用音響モデル部１０６と、認識対象の言語における単語の連接確率等を表す確率モデル（言語モデル）を準備するための言語モデル部１０８と、言語モデル部１０８の言語モデル及び所定の音響モデルを用いて、前処理部１０４から出力された特徴量に対応する単語等を探索するための探索部１１０と、探索部１１０に接続され、探索部１１０による探索に用いられる音響モデルを準備するための認識用音響モデル部１０９とを含む。〔Constitution〕
<Configuration of the entire speech recognition system>
FIG. 1 shows the overall configuration of the speech recognition system 100 according to the present embodiment. With reference to FIG. 1, the speech recognition system 100 includes a preprocessing unit 104 for extracting a feature vector 126 representing a feature of speech used for speech recognition from a sound 122 generated by a sound source 102, and a preprocessing unit 104. And a pre-processing acoustic model unit 106 for preparing a probabilistic model (acoustic model) representing the relationship between speech features and phonemes, and a probabilistic model (language for expressing word connection probabilities in a language to be recognized) A model for preparing a model), and a search for searching for a word or the like corresponding to the feature amount output from the preprocessing unit 104 using the language model of the language model unit 108 and a predetermined acoustic model. And a recognition acoustic model unit 109 for preparing an acoustic model connected to the search unit 110 and used for the search by the search unit 110.

音声認識システム１００はさらに、前処理部１０４による特徴量ベクトル１２６の抽出に用いられ、後述する状態空間モデルにおける拘束条件を定めるための係数からなる、拘束条件パラメータ１３８を含む。 The speech recognition system 100 further includes a constraint condition parameter 138 that is used for the extraction of the feature vector 126 by the preprocessing unit 104 and includes a coefficient for determining a constraint condition in a state space model described later.

音源１０２は、認識されるべき音声（目的音声）１２０を発話する話者１１６と、話者１１６の周囲で雑音１２１を発生する雑音源１１８とを含む。音源１０２が発生し前処理部１０４により収録される音１２２は、話者１１６の発話により発生する雑音のない目的音声１２０と雑音１２１とが重畳した音となる。本明細書では、前述したように、雑音を含まない目的音声１２０を「クリーン音声」と呼ぶ。これに対して、前処理部１０４に到達し、前処理部１０４により収録される音１２２、すなわち、クリーン音声１２０と雑音１２１とが重畳した音１２２を、「雑音重畳音声」と呼ぶ。 The sound source 102 includes a speaker 116 that utters a speech (target speech) 120 to be recognized, and a noise source 118 that generates noise 121 around the speaker 116. The sound 122 generated by the sound source 102 and recorded by the preprocessing unit 104 is a sound in which the target voice 120 without noise and noise 121 generated by the speech of the speaker 116 are superimposed. In this specification, as described above, the target voice 120 that does not include noise is referred to as “clean voice”. On the other hand, the sound 122 that reaches the preprocessing unit 104 and is recorded by the preprocessing unit 104, that is, the sound 122 in which the clean sound 120 and the noise 121 are superimposed is referred to as “noise superimposed sound”.

前処理部１０４は、雑音重畳音声１２２を収録し、その結果得られる観測信号に所定の信号処理を施すことにより、観測信号に関する所定の特徴量ベクトル（以下、この特徴量ベクトルを単に「観測信号の特徴量」と呼ぶことがある。）１２４を抽出するための計測部１１２と、計測部１１２により抽出された観測信号の特徴量１２４に含まれる雑音の成分を、前処理用音響モデル部１０６により準備される音響モデルと拘束条件パラメータ１３８とを用いて抑圧するための雑音抑圧部１１４とを含む。 The pre-processing unit 104 records the noise-superimposed speech 122 and performs predetermined signal processing on the observation signal obtained as a result, thereby obtaining a predetermined feature vector related to the observation signal (hereinafter referred to as “observation signal”). And a noise component included in the feature value 124 of the observation signal extracted by the measurement unit 112 is extracted from the measurement unit 112 for extracting 124 and the preprocessing acoustic model unit 106. And a noise suppression unit 114 for suppression using the acoustic model prepared by the above and the constraint condition parameter 138.

計測部１１２は、観測信号を、フレーム間隔１０ミリ秒、時間長が数１０ミリ秒のフレームごとに対数メルフィルタバンク分析し、得られる対数メルスペクトルを要素とするベクトルを観測信号の特徴量１２４として出力する。 The measurement unit 112 analyzes the logarithmic mel filter bank of the observation signal for each frame having a frame interval of 10 milliseconds and a time length of several tens of milliseconds, and uses the obtained logarithmic mel spectrum as an element as a feature 124 Output as.

雑音抑圧部１１４は、前処理用音響モデル部１０６により準備される音響モデルと拘束条件パラメータ１３８とを用いて、観測信号の特徴量１２４を基に、クリーン音声１２０の特徴量ベクトルをフレームごとに逐次推定する機能を持つ。そしてこの逐次推定によって得られる特徴量ベクトルを、音声認識に用いる音声の特徴量ベクトル１２６として探索部１１０に出力する。この際、まず雑音１２１の特徴量ベクトルを推定し、その結果を基にクリーン音声１２０の特徴量ベクトルを推定する。なお、本明細書では、特徴量ベクトル１２６によって表される音声を「推定クリーン音声」と呼ぶ。また、特徴量ベクトル１２６を「推定クリーン音声の特徴量」と呼ぶ。 The noise suppression unit 114 uses the acoustic model prepared by the preprocessing acoustic model unit 106 and the constraint condition parameter 138 to generate the feature vector of the clean speech 120 for each frame based on the feature 124 of the observation signal. Has the ability to estimate sequentially. The feature quantity vector obtained by this successive estimation is output to the search unit 110 as a voice feature quantity vector 126 used for speech recognition. At this time, first, the feature vector of the noise 121 is estimated, and the feature vector of the clean speech 120 is estimated based on the result. In this specification, the voice represented by the feature vector 126 is referred to as “estimated clean voice”. Also, the feature quantity vector 126 is referred to as “estimated clean speech feature quantity”.

探索部１１０は、推定クリーン音声の特徴量１２６を用いて、認識用音響モデル部１０９により準備された音響モデルと、言語モデル部１０８により準備された言語モデルとを基に、適合する目的言語の単語等を探索し、その結果を認識出力１２８として出力する。 The search unit 110 uses the feature quantity 126 of the estimated clean speech, and based on the acoustic model prepared by the recognition acoustic model unit 109 and the language model prepared by the language model unit 108, A word or the like is searched, and the result is output as a recognition output 128.

〈前処理用の音響モデル〉
以下に、前処理用音響モデル部１０６により準備される音響モデルについて説明する。図１に示す前処理用音響モデル部１０６は、クリーン音声１２０に対する音響モデルとして、ガウス混合モデル（ＧＭＭ：Gaussian Mixture Model）１３０を準備し保持する。前処理用音響モデル部１０６は、予め用意されたクリーン音声１２０に関する学習データを記憶するための学習データ記憶部１３２と、学習データ記憶部１３２内の学習データを用いてＧＭＭ１３０に対する学習を行なうためのモデル学習部１３４と、モデル学習部１３４による学習で得られるＧＭＭ１３０を記憶するためのＧＭＭ記憶部１３６とを含む。 <Acoustic model for pretreatment>
Hereinafter, an acoustic model prepared by the preprocessing acoustic model unit 106 will be described. The pre-processing acoustic model unit 106 shown in FIG. 1 prepares and holds a Gaussian Mixture Model (GMM) 130 as an acoustic model for the clean speech 120. The preprocessing acoustic model unit 106 performs learning for the GMM 130 using the learning data storage unit 132 for storing learning data related to the clean speech 120 prepared in advance and the learning data in the learning data storage unit 132. A model learning unit 134 and a GMM storage unit 136 for storing the GMM 130 obtained by learning by the model learning unit 134 are included.

図２に、ＧＭＭ１３０の概念を模式的に示す。図２を参照して、ＧＭＭ１３０は、時系列信号を、一つの定常信号源（状態）によりモデル化した確率モデルである。このＧＭＭ１３０においては、クリーン音声１２０の特徴量ベクトルとして出力される可能性のあるベクトルと、そのベクトルが出力される確率（以下、単に「出力確率」と呼ぶ。）とが定義される。出力確率は混合正規分布１４０によって表現される。ＧＭＭ１３０における混合正規分布１４０は、複数の要素分布１４８Ａ，１４８Ｂ，…，１４８Ｋを含む。これらの要素分布１４８Ａ，１４８Ｂ，…，１４８Ｋはいずれも単一正規分布である。例えば、混合正規分布１４０に含まれるある要素分布１５０をｋ_tとする。要素分布ｋ_tは、単一正規分布であり、多次元を考えると、分布の平均ベクトルμ_S,kt（以下単に「平均」と呼ぶ。）と共分散行列Σ_S,kt（以下単に「分散」と呼ぶ。）とにより表現される。これらはいずれも予め様々な音声サンプルに基づいて統計的に学習（算出）される。この要素分布ｋ_t１５０にしたがった確率で出力されるパラメータのベクトルをベクトルＳ_kt,tとする。以下の説明では、ＧＭＭ１３０から出力されるパラメータベクトルＳ_kt,tを、「（ＧＭＭ１３０の）出力パラメータ」と呼ぶ。 FIG. 2 schematically shows the concept of the GMM 130. Referring to FIG. 2, the GMM 130 is a probability model in which a time series signal is modeled by one stationary signal source (state). In GMM 130, a vector that may be output as a feature vector of clean speech 120 and a probability that the vector is output (hereinafter simply referred to as “output probability”) are defined. The output probability is expressed by a mixed normal distribution 140. The mixed normal distribution 140 in the GMM 130 includes a plurality of element distributions 148A, 148B,. These element distributions 148A, 148B, ..., 148K are all single normal distributions. For example, certain elements distribution 150 contained in the mixed normal distribution 140 and k _t. The element distribution k _t is a single normal distribution, and considering multidimensionality, the distribution average vector μ _{S, kt} (hereinafter simply referred to as “average”) and the covariance matrix Σ _{S, kt} (hereinafter simply referred to as “dispersion”). "). All of these are statistically learned (calculated) based on various voice samples in advance. A vector of parameters output with a probability according to the element distribution k _t 150 is set as a vector S _{kt, t} . In the following description, the parameter vector S _{kt, t} output from the GMM 130 is referred to as “output parameter (of the GMM 130)”.

〈状態空間モデル〉
以下に、状態空間モデルについて説明する。状態空間モデルは、観測信号の生成過程を表した観測方程式と、処理の対象の変化する過程（以下、この過程を「状態遷移過程」と呼ぶ。）を表した状態方程式とからなる動的モデルである。図３に状態空間モデル１６０を模式的に示す。 <State space model>
The state space model will be described below. A state space model is a dynamic model consisting of an observation equation that represents the generation process of an observed signal and a state equation that represents a process that changes the processing target (hereinafter, this process is referred to as a “state transition process”). It is. FIG. 3 schematically shows the state space model 160.

時刻ｔのフレーム（以下、単に「第ｔフレーム」と呼ぶ。）における観測信号の特徴量１２４（図１参照）をＸ_tとする。観測信号の特徴量Ｘ_tは、上記のとおり雑音重畳音声１２２から得られる対数メルスペクトルを要素に持つベクトルである。この観測信号の特徴量Ｘ_tは、クリーン音声１２０と雑音１２１とが重畳した音の対数メルスペクトルを要素に持つ。ここに、第ｔフレームにおけるクリーン音声１２０の対数メルスペクトルを要素に持つベクトルをクリーン音声の特徴量ベクトルＳ_tとする。また、雑音１２１の対数メルスペクトルを要素に持つベクトルを雑音の特徴量ベクトルＮ_tとする。ベクトルＸ_t、Ｓ_t、及びＮ_tの次元数は同一である。なお、以下に説明する処理はこれらベクトル及び行列の要素についてそれぞれ行なわれるが、以下の説明では、簡単のために各要素を特に区別して言及することはしない。 A feature quantity 124 (see FIG. 1) of the observation signal in the frame at time t (hereinafter simply referred to as “tth frame”) is represented by X _t . The feature amount X _t of the observation signal is a vector having a log mel spectrum obtained from the noise superimposed speech 122 as an element as described above. The feature quantity X _t of the observation signal has a logarithmic Mel spectra of clean speech 120 and noise 121 and the sound of superimposed elements. Here, the feature vector S _t of the clean speech vector with a logarithmic Mel spectra of clean speech 120 in the t frame element. A vector having the log mel spectrum of the noise 121 as an element is defined as a noise feature vector N _t . The dimensions of the vectors X _t , S _t and N _t are the same. Note that the processing described below is performed for each element of the vector and matrix, but in the following description, each element is not particularly distinguished for the sake of simplicity.

まず、状態空間モデル１６０における観測信号の生成過程について説明する。観測信号の特徴量Ｘ_tは、計測によって得られる既知のベクトルである。これに対し、クリーン音声の特徴量ベクトルＳ_tと雑音の特徴量ベクトルＮ_tとはいずれも、計測によっては得ることのできない未知のベクトルである。 First, an observation signal generation process in the state space model 160 will be described. Feature quantity X _t of the observation signals is a known vector obtained by the measurement. On the other hand, the clean speech feature vector _St and the noise feature vector _Nt are unknown vectors that cannot be obtained by measurement.

ここで、クリーン音声１２０の出力過程がＧＭＭでモデル化できるものと仮定する。すなわち、第ｔフレームにおけるクリーン音声の特徴量ベクトルＳ_tが、ＧＭＭ１３０内のある要素分布ｋ_t１５０（図２参照）にしたがって出力される出力パラメータベクトルＳ_kt,tにより表現されるものと仮定する。ただし、クリーン音声の特徴量ベクトルＳ_tと出力パラメータベクトルＳ_kt,tとの間には誤差が存在する。この誤差もまたベクトルである。この誤差を誤差ベクトルＶ_tとする。誤差ベクトルＶ_tは、次の式に示すように、平均が０で分散がΣ_S,ktの単一正規分布で表現される確率分布にしたがう値を要素に持つものとする。 Here, it is assumed that the output process of the clean speech 120 can be modeled by GMM. That is, it is assumed feature vector S _t of the clean speech in the t frame, the output parameter vector S _kt outputted in accordance with the GMM130 element distribution k _t 0.99 (see FIG. _2), as represented by _t . However, there is an error between the clean speech feature vector _St and the output parameter vector _{Skt, t} . This error is also a vector. This error and error vector V _t. As shown in the following equation, the error vector V _t has a value according to a probability distribution represented by a single normal distribution having an average of 0 and a variance of Σ _{S, kt} as an element.

ただし、この式においてΣ_S,ktはＧＭＭ１３０内のある要素分布ｋ_t１５０より得られるパラメータの共分散行列を表し、記号「〜」は左辺の値が右辺に示される確率分布にしたがうことを示す。すなわち、左辺の値が右辺に示す確率分布にしたがったサンプリングにより推定できることを示す。また、この式において、「Ｎ（μ，Σ）」は、平均がμで分散がΣの単一正規分布を表す。

In this equation, Σ _{S, kt} represents a covariance matrix of parameters obtained from a certain element distribution k _t 150 in the GMM 130, and the symbol “˜” indicates that the value on the left side follows the probability distribution shown on the right side. . That is, the value on the left side can be estimated by sampling according to the probability distribution shown on the right side. In this equation, “N (μ, Σ)” represents a single normal distribution with an average of μ and a variance of Σ.

上記の仮定に基づき、観測信号の特徴量Ｘ_t１２４の生成過程は、雑音の特徴量ベクトルＮ_t、ＧＭＭ１３０からの出力パラメータベクトルＳ_kt,t、及び誤差ベクトルＶ_tを用いて、次の式（１）に示す観測方程式により表現されるものとする。 Based on the above assumption, the generation process of the observed signal feature quantity X _t 124 is performed using the noise feature quantity vector N _t , the output parameter vector S _{kt, t} from the GMM 130, and the error vector V _t as follows: It shall be expressed by the observation equation shown in (1).

なお、式（１）でＩは単位ベクトルを表す。また、ベクトルの対数、ベクトルの指数演算はそれぞれ、ベクトルの各要素について対数をとり、又は指数計算し、その結果を成分とするベクトルを表すものとする。

In Equation (1), I represents a unit vector. In addition, the logarithm of the vector and the exponent operation of the vector respectively represent a vector having a logarithm or exponent calculation for each element of the vector and using the result as a component.

次に、状態空間モデル１６０における処理対象の状態遷移過程について述べる。状態空間モデル１６０においては、雑音の特徴量ベクトルＮ_tが処理の対象になる。ここで、雑音の特徴量ベクトルＮ_tがランダムウォーク過程にしたがって変化するものと仮定する。すなわち、第ｔ−１フレームにおける雑音の特徴量ベクトルＮ_t-1が第ｔフレームにおける雑音の特徴量ベクトルＮ_tとの間に、ランダムな変化が生じるものと仮定する。このランダムな変化を表すベクトルをランダムガウス雑音ベクトルＷ_tとする。ランダムガウス雑音ベクトルＷ_tは、平均が０で分散がΣ_wの単一正規分布で表現される確率分布にしたがう値を要素に持つランダムガウス雑音であるものとする。 Next, a state transition process to be processed in the state space model 160 will be described. In the state space model 160, the noise of the feature vector N _t is the target of processing. Here, it is assumed that the noise feature vector N _t changes according to the random walk process. That is, it is assumed feature vector N _t-1 noise in the t-1 frame is between the noise feature vector N _t in the t frame, as random changes occur. A vector representing this random change is a random Gaussian noise vector W _t . Random Gaussian noise vector W _t is assumed average is random Gaussian noise with a value according to the probability distribution that is expressed to the element by a single normal distribution variance sigma _w 0.

ただし、この式においてΣ_Wは、ランダムガウス雑音ベクトルＷ_tの共分散行列を表す。

In this equation, Σ _W represents the covariance matrix of the random Gaussian noise vector W _t .

上記の仮定に基づき雑音の特徴量ベクトルＮ_tの状態遷移過程を表現する状態方程式を定義すると、状態方程式は、次の式（２）のようになる。 When a state equation expressing the state transition process of the noise feature vector N _t is defined based on the above assumption, the state equation is expressed as the following equation (2).

しかし、ランダムウォーク過程に基づく上記の仮定では、雑音の特徴量ベクトルＮ_tの変化をランダムガウス雑音ベクトルＷ_tで規定している。そのため、式（２）に示す状態方程式では、雑音の特徴量ベクトルＮ_tの時間変化を正確に表現することはできない。そこで、本実施の形態では、図１に示す拘束条件パラメータ１３８を用いて、雑音の特徴量ベクトルＮ_tの変化に対し、拘束条件を設ける。拘束条件及びそのための拘束条件パラメータ１３８の詳細については、後述する。

However, the above assumption based on the random walk process, defines a random Gaussian noise vector W _t changes in the noise feature vector N _t. For this reason, the state equation shown in Expression (2) cannot accurately represent the temporal change of the noise feature vector N _t . Therefore, in the present embodiment, a constraint condition is provided for a change in the noise feature quantity vector N _t using the constraint condition parameter 138 shown in FIG. Details of the constraint condition and the constraint condition parameter 138 for the constraint condition will be described later.

〈雑音抑圧部１１４の構成〉
図４に、雑音抑圧部１１４（図１参照）の構成をブロック図で示す。図４を参照して、雑音抑圧部１１４は、観測信号の特徴量Ｘ_t１２４、ＧＭＭ１３０、及び拘束条件パラメータ１３８を用いて、雑音の特徴量ベクトルＮ_tの出力される確率を表す確率分布（以下、「雑音確率分布」と呼ぶ。）をフレームごとに逐次的に推定し、雑音確率分布を表すパラメータ（以下、このパラメータを「雑音確率分布の推定パラメータ」と呼ぶ。）２０５を生成するための雑音確率分布推定部２００と、雑音確率分布推定部２００により推定された雑音確率分布の推定パラメータ２０５及びＧＭＭ１３０を基に、観測信号の特徴量Ｘ_t１２４の出力される確率を表す確率分布（以下、「観測信号分布」と呼ぶ。）を推定し、観測信号確率分布を表すパラメータ２０８（以下、このパラメータを「観測信号分布のパラメータ」と呼ぶ。）を生成するための観測信号分布推定部２０２とを含む。 <Configuration of Noise Suppression Unit 114>
FIG. 4 is a block diagram showing the configuration of the noise suppression unit 114 (see FIG. 1). Referring to FIG. 4, the noise suppression unit 114, feature quantity of the observation signal X _t 124, GMM130, and using a constraint enforcement parameter 138, the probability distribution representing the probability of the output of the noise feature vector N _t ( (Hereinafter referred to as “noise probability distribution”) is sequentially estimated for each frame, and a parameter (hereinafter referred to as “estimation parameter of noise probability distribution”) 205 representing the noise probability distribution is generated. Noise probability distribution estimator 200, noise probability distribution estimation parameter 205 estimated by noise probability distribution estimator 200, and GMM 130, a probability distribution that represents the probability of output of observed signal feature quantity X _t 124 ( Hereinafter, the parameter 208 representing the observation signal probability distribution (hereinafter referred to as “observation signal distribution parameter”) is estimated. And a monitoring signal distribution estimation unit 202 for generating the called.) And.

雑音抑圧部１１４はさらに、観測信号分布が推定されるフレーム間隔（「Ｌ」とする。）を予め記憶するための間隔記憶部２０６と、雑音確率分布推定部２００により雑音確率分布の推定パラメータ２０５が推定されたことに応答して、Ｌ個目のフレームごとに観測信号分布を推定するように観測信号分布推定部２０２を制御するための計算制御部２１２と、計算制御部２１２が制御の観測信号分布の推定を行なわなかったフレーム数を計数するために使用する回数記憶部２１８とを含む。 The noise suppression unit 114 further includes an interval storage unit 206 for preliminarily storing a frame interval (“L”) at which the observation signal distribution is estimated, and the noise probability distribution estimation unit 200 to estimate the noise probability distribution estimation parameter 205. , The calculation control unit 212 for controlling the observation signal distribution estimation unit 202 so as to estimate the observation signal distribution every L-th frame, and the calculation control unit 212 controls the observation. And a number storage unit 218 used for counting the number of frames for which the signal distribution has not been estimated.

計算制御部２１２は、観測信号分布を推定すべきフレームのタイミングであることを示す推定指示信号２０９を観測信号分布推定部２０２に与える機能と、観測信号分布推定部２０２によりあるフレームに対して推定され出力された観測信号分布を記憶すべきタイミングを示す観測信号分布記憶制御信号２０７を出力する機能と、観測信号分布を推定すべきフレームのタイミングであるときには第１の値を、それ以外のときには第２の値をとる選択指示信号２１１を出力する機能とを持つ。すなわち、計算制御部２１２は、１フレームの処理を開始するごとに回数記憶部２１８に記憶された値Ｉ_Ｌに１を加算し、その値を間隔記憶部２０６に記憶された値Ｌと比較する機能を持つ。計算制御部２１２は、両者が等しくなったときには観測信号分布記憶制御信号２０７及び推定指示信号２０９を出力する。計算制御部２１２はまた、このときには選択指示信号２１１の値を第１の値とする。それ以外のときには計算制御部２１２は選択指示信号２１１の値を第２の値とする。なお、計算制御部２１２は、回数記憶部２１８の値がＬと等しくなったときには回数記憶部２１８の値をゼロにリセットする。 The calculation control unit 212 has a function of providing the observation signal distribution estimation unit 202 with an estimation instruction signal 209 that indicates the timing of a frame for which the observation signal distribution is to be estimated, and the observation signal distribution estimation unit 202 estimates a certain frame. The function of outputting the observation signal distribution storage control signal 207 indicating the timing at which the output observation signal distribution should be stored, and the first value when it is the timing of the frame for which the observation signal distribution is to be estimated, and otherwise And a function of outputting a selection instruction signal 211 having a second value. That is, the calculation control unit 212 adds 1 to the value I _L stored in the frequency storage unit 218 each time to start processing of one frame is compared to the value L stored in the distance storage unit 206 the value Has function. The calculation control unit 212 outputs the observation signal distribution storage control signal 207 and the estimation instruction signal 209 when they become equal. At this time, the calculation control unit 212 also sets the value of the selection instruction signal 211 as the first value. In other cases, the calculation control unit 212 sets the value of the selection instruction signal 211 as the second value. The calculation control unit 212 resets the value of the number storage unit 218 to zero when the value of the number storage unit 218 becomes equal to L.

雑音抑圧部１１４はさらに、計算制御部２１２からの観測信号分布記憶制御信号２０７に応答して、観測信号分布推定部２０２により推定された観測信号分布を記憶し出力するための分布記憶部２１４と、分布記憶部２１４の出力及び観測信号分布推定部２０２の出力にそれぞれ接続された二つの入力を持ち、選択指示信号２１１が第１の値であるときには観測信号分布のパラメータ２０８を、それ以外の時には最後に推定された観測信号分布のパラメータ２１３を、それぞれ選択して観測信号分布のパラメータ２１７として出力するための選択部２１６と、観測信号の特徴量１２４、選択部２１６の出力する観測信号分布のパラメータ２１７、及びＧＭＭ１３０を基に、推定クリーン音声の特徴量１２６を生成するためのクリーン音声推定部２０４とを含む。 The noise suppression unit 114 further includes a distribution storage unit 214 for storing and outputting the observation signal distribution estimated by the observation signal distribution estimation unit 202 in response to the observation signal distribution storage control signal 207 from the calculation control unit 212. , Having two inputs respectively connected to the output of the distribution storage unit 214 and the output of the observation signal distribution estimation unit 202, and when the selection instruction signal 211 is the first value, the parameter 208 of the observation signal distribution Sometimes the last estimated observation signal distribution parameter 213 is selected and output as the observed signal distribution parameter 217, the observed signal feature 124, and the observed signal distribution output by the selecting unit 216. Clean speech estimation for generating the estimated clean speech feature 126 based on the parameters 217 and the GMM 130 And a 204.

雑音確率分布推定部２００は、雑音確率分布をフレームごとに逐次推定し、雑音確率分布の推定パラメータ２０５を出力する機能を持つ。ここに、観測信号の特徴量Ｘ₀，…，Ｘ_tからなるベクトルの系列を系列Ｘ_0:t＝｛Ｘ₀，…，Ｘ_t｝とし、雑音の特徴量ベクトルＮ₀，…，Ｎ_tからなるベクトルの系列を系列Ｎ_0:t＝｛Ｎ₀，…，Ｎ_t｝とする。観測信号ベクトルの系列Ｘ_0:tが与えられた時の系列Ｎ_0:tの事後確率分布ｐ（Ｎ_0:t｜Ｘ_0:t）は、１次マルコフ連鎖を用いて、次の式（３）のように表される。 The noise probability distribution estimation unit 200 has a function of sequentially estimating the noise probability distribution for each frame and outputting an estimation parameter 205 of the noise probability distribution. Here, a series of vectors composed of observed signal feature values X ₀ ,..., X _t is a sequence X _{0: t} = {X ₀ ,..., X _t }, and noise feature vector N ₀ _,. A vector sequence consisting of the sequence N _{0: t} = {N ₀ ,..., N _t }. The posterior probability distribution p (N _{0: t} | X _{0: t} ) of the sequence N _{0: t} when the observation signal vector sequence X _{0: t} is given is _expressed by the following equation (1) using a first-order Markov chain _: It is expressed as 3).

したがって雑音の特徴量ベクトルＮ_tの確率分布を逐次推定する問題は、観測信号ベクトルの系列Ｘ_0:tが与えられた時の事後確率ｐ（Ｎ_0:t｜Ｘ_0:t）を最大にするような系列Ｎ_0:tを推定する問題に帰着する。雑音確率分布推定部２００は、観測信号の特徴量Ｘ_t１２４と、ＧＭＭ１３０と、状態空間モデル１６０と、雑音の状態遷移に関する上記の拘束条件パラメータ１３８とに基づきこの推定を行なう。その際、雑音確率分布推定部２００は、パーティクルフィルタと呼ばれる手法を用いる。この推定法は、ある状態空間モデルにより表現される状態空間内に、局限された状態空間（パーティクル）を多数生成して各パーティクルにおいてパラメータの確率分布を推定し、そして各パーティクルを用いて、状態空間内におけるパラメータの確率分布を近似的に表現する手法である。 Therefore, the problem of successively estimating the probability distribution of the noise feature vector N _t is to maximize the posterior probability p (N _{0: t} | X _{0: t} ) when the observation signal vector sequence X _{0: t} is given. This results in the problem of estimating the sequence N _{0: t} . The noise probability distribution estimation unit 200 performs this estimation based on the observed signal feature amount X _t 124, the GMM 130, the state space model 160, and the constraint condition parameter 138 regarding the noise state transition. At that time, the noise probability distribution estimation unit 200 uses a technique called a particle filter. This estimation method generates many localized state spaces (particles) in a state space represented by a certain state space model, estimates the probability distribution of parameters for each particle, and uses each particle to This is a technique for approximately expressing the probability distribution of parameters in space.

計算制御部２１２は、間隔記憶部２０６に記憶されたフレーム間隔（Ｌ個）と、回数記憶部２１８とを使用して、観測信号分布推定部２０２による観測信号分布のパラメータの推定を、各フレームに対してではなく、Ｌ個のフレームごとに行なうように観測信号分布推定部２０２を制御する機能を持つ。計算制御部２１２はさらに、観測信号分布推定部２０２によって観測信号分布のパラメータ２０８が推定されたときにはその値を記憶するように分布記憶部２１４に対して観測信号分布記憶制御信号２０７を与える機能、観測信号分布推定部２０２により観測信号分布のパラメータ２０８の推定が行なわれたフレームに対しては第１の値を、それ以外のフレームに対しては第２の値を、それぞれとる選択指示信号２１１を選択部２１６に対して出力する機能を持つ。 The calculation control unit 212 uses the frame interval (L) stored in the interval storage unit 206 and the number-of-times storage unit 218 to estimate the parameters of the observation signal distribution by the observation signal distribution estimation unit 202 for each frame. And has a function of controlling the observation signal distribution estimation unit 202 so as to be performed every L frames. The calculation control unit 212 further provides a function of giving the observation signal distribution storage control signal 207 to the distribution storage unit 214 so as to store the value when the observation signal distribution parameter 208 is estimated by the observation signal distribution estimation unit 202, A selection instruction signal 211 that takes a first value for a frame in which the observation signal distribution parameter 208 is estimated by the observation signal distribution estimation unit 202 and a second value for the other frames. Is output to the selection unit 216.

観測信号分布推定部２０２は、観測信号分布のパラメータ２０８として、各パーティクルにおける観測信号分布の平均ベクトル及び共分散行列を算出する機能を持つ。観測信号分布のパラメータ２０８の算出には、例えば、ＶＴＳ（Vector Taylor Series）法と呼ばれるＨＭＭ合成法が用いられる。 The observation signal distribution estimation unit 202 has a function of calculating an average vector and a covariance matrix of the observation signal distribution in each particle as the observation signal distribution parameter 208. For example, an HMM synthesis method called a VTS (Vector Taylor Series) method is used to calculate the parameter 208 of the observation signal distribution.

分布記憶部２１４は、計算制御部２１２から観測信号分布記憶制御信号２０７が与えられると、そのときに観測信号分布推定部２０２から出力されている観測信号分布のパラメータ２０８を記憶する。分布記憶部２１４の出力は最後に推定された観測信号分布のパラメータ２１３として選択部２１６の一方の入力に与えられる。 When the observation signal distribution storage control signal 207 is given from the calculation control unit 212, the distribution storage unit 214 stores the observation signal distribution parameter 208 output from the observation signal distribution estimation unit 202 at that time. The output of the distribution storage unit 214 is given to one input of the selection unit 216 as the parameter 213 of the observed signal distribution estimated last.

選択部２１６は、計算制御部２１２から与えられる選択指示信号２１１が第１の値のときには観測信号分布推定部２０２からの観測信号分布のパラメータ２０８を、第２の値のときには分布記憶部２１４からの最後に推定された観測信号分布のパラメータ２１３を、それぞれ選択してクリーン音声推定部２０４に与える。 The selection unit 216 receives the observation signal distribution parameter 208 from the observation signal distribution estimation unit 202 when the selection instruction signal 211 provided from the calculation control unit 212 is the first value, and from the distribution storage unit 214 when the selection instruction signal 211 is the second value. The parameters 213 of the observed signal distribution estimated at the end of each are selected and given to the clean speech estimation unit 204.

クリーン音声推定部２０４は、フレームごとに、各パーティクルにおけるクリーン音声のパラメータを推定し、推定クリーン音声の特徴量１２６を、それら推定されたパラメータの重み付き和によって算出する機能を持つ。推定クリーン音声の特徴量１２６の算出には、例えば、ＭＭＳＥ推定法が用いられる。クリーン音声推定部２０４はさらに、雑音確率分布推定部２００に、次のフレームへの移行に関する要求２１０を発行する機能を持つ。 The clean speech estimation unit 204 has a function of estimating clean speech parameters for each particle for each frame and calculating a feature amount 126 of the estimated clean speech by a weighted sum of these estimated parameters. For example, the MMSE estimation method is used to calculate the feature amount 126 of the estimated clean speech. The clean speech estimation unit 204 further has a function of issuing a request 210 regarding the transition to the next frame to the noise probability distribution estimation unit 200.

〈パーティクルフィルタ〉
以下に、パーティクルフィルタについて説明する。この手法では、多数のパーティクルにおける初期的なパラメータを、ランダムなサンプリングにより、又は当該パラメータの初期的な状態を表す確率分布からのサンプリングにより決定する。そして、以下の処理をフレームごとに行なう。すなわち、あるフレームに対応して各パーティクルにおいてパラメータが決定されると、まず、各パーティクルのパラメータを当該フレームに後続するフレームに対応するものに更新する。続いて、更新の尤度に応じて各パーティクルに対して重みを付与する。続いて、更新後のパーティクルにおけるパラメータの確率分布にしたがい、当該後続のフレームに対応する各パーティクルのパラメータを再サンプリングする。続いて、再サンプリングされたパラメータを基に、当該後続のフレームに対応する各パーティクルのパラメータを決定する。以上の処理をフレームごとに行なうことにより、逐次的に各パーティクルにおけるパラメータを決定する。 <Particle filter>
Hereinafter, the particle filter will be described. In this method, initial parameters in a large number of particles are determined by random sampling or sampling from a probability distribution representing the initial state of the parameters. Then, the following processing is performed for each frame. That is, when a parameter is determined for each particle corresponding to a certain frame, first, the parameter of each particle is updated to one corresponding to a frame subsequent to the frame. Subsequently, a weight is assigned to each particle according to the likelihood of update. Subsequently, the parameter of each particle corresponding to the subsequent frame is resampled according to the parameter probability distribution in the updated particle. Subsequently, the parameter of each particle corresponding to the subsequent frame is determined based on the resampled parameter. By performing the above processing for each frame, parameters for each particle are sequentially determined.

パーティクルフィルタにおいて、状態空間モデル１６０におけるパラメータはそれぞれ、パーティクルにおけるパラメータの重み付き和によって近似的に表現される。ここに、パーティクルの数をＪ個とし、第ｔフレームでの、ｊ（１≦ｊ≦Ｊ）番目のパーティクルにおける雑音の特徴量ベクトルをベクトルＮ_t ^(j)とする。さらに、第ｔフレームにおけるｊ番目のパーティクルに対する重みをｗ_t ^(j)とする。式（３）に示す事後確率分布ｐ（Ｎ_0:t｜Ｘ_0:t）は、次の式（４）に示すモンテカルロサンプリングにより近似的に表現される。 In the particle filter, each parameter in the state space model 160 is approximately expressed by a weighted sum of parameters in the particle. Here, the number of particles is J, and the noise feature vector of the j (1 ≦ j ≦ J) -th particle in the t-th frame is a vector N _t ^(j) . Further, let w _t ^(j) be the weight for the j-th particle in the t-th frame. The posterior probability distribution p (N _{0: t} | X _{0: t} ) shown in the equation (3) is approximately expressed by the Monte Carlo sampling shown in the following equation (4).

なお、この式においてδ（）は、Dirac-delta関数を表す。

In this equation, δ () represents the Dirac-delta function.

ｊ番目のパーティクルにおける雑音の特徴量ベクトルの系列Ｎ_0:t ^(j)を出力する確率分布をｑ（Ｎ_0:t ^(j)｜Ｘ_0:t）とすると、パーティクルに対する重みｗ_t ^(j)は、次の式（５）により与えられる。 If the probability distribution for outputting the noise feature vector series N _{0: t} ^(j) in the j-th particle is q (N _{0: t} ^(j) | X _{0: t} ), the weight w _t ^{(j )} Is given by the following equation (5).

確率分布ｑ（Ｎ_0:t ^(j)｜Ｘ_0:t）は、次の式（６）に示す連鎖モデルで表現されるものとする。

The probability distribution q (N _{0: t} ^(j) | X _{0: t} ) is assumed to be expressed by a chain model shown in the following equation (6).

また、上記の式（３）の事後確率分布ｐ（Ｎ_0:t｜Ｘ_0:t）は、ベイズ則により次の式（７）のように表現できる。

Further, the posterior probability distribution p (N _{0: t} | X _{0: t} ) of the above equation (3) can be expressed as the following equation (7) by Bayes rule.

したがって、式（５）、式（６）、及び式（７）より、パーティクルに対する重みｗ_t ^(j)は、式（８）によって与えられることになる。

Therefore, the weight w _t ^(j) for the particles is given by the equation (8) from the equations (5), (6), and (7).

ここで、ｐ（Ｎ_t ^(j)｜Ｎ_t-1 ^(j)）＝ｑ（Ｎ_t ^(j)｜Ｎ_0:t-1 ^(j)，Ｘ_0:t）と仮定すると、式（８）より、式（９）が得られる。

Assuming that p (N _t ^(j) | N _t-1 ^(j) ) = q (N _t ^(j) | N _{0: t-1} ^(j) , X _{0: t} ), the equation (8 ), Equation (9) is obtained.

式（９）のｐ（Ｘ_t｜Ｎ_t ^(j)）は、次の式（１０）に示す確率密度関数によりモデル化される。

P (X _t | N _t ^(j) ) in the equation (9) is modeled by a probability density function shown in the following equation (10).

雑音確率分布推定部２００は、雑音確率分布の推定パラメータ２０５として、パーティクルｊ（１≦ｊ≦Ｊ）ごとに、パーティクルにおける雑音の特徴量ベクトルＮ_t ^(j)に関する式（４）における確率密度関数ｐ（Ｎ_0:t ^(j)｜Ｘ_0:t）のパラメータと、そのパーティクルに対する重みｗ_t ^(j)とを、図３に示す状態空間モデル１６０に基づいて逐次的に算出する。確率密度関数ｐ（Ｎ_0:t ^(j)｜Ｘ_0:t）のパラメータは、そのパーティクルにおける雑音の特徴量ベクトルＮ_t ^(j)の平均ベクトル＾Ｎ_t ^(j)と共分散行列Σ_Nt ^(j)とを含む。以下、確率密度関数ｐ（Ｎ_0:t｜Ｘ_0:t）の平均ベクトル＾Ｎ_t ^(j)と共分散行列Σ_Nt ^(j)とを、「（ｊ番目の）パーティクルにおける雑音のパラメータ」と呼ぶ。 The noise probability distribution estimation unit 200 uses, as the noise probability distribution estimation parameter 205, the probability density function in the equation (4) for the feature vector N _t ^(j) of the noise in the particle for each particle j (1 ≦ j ≦ J). The parameter of p (N _{0: t} ^(j) | X _{0: t} ) and the weight w _t ^(j) for the particle are sequentially calculated based on the state space model 160 shown in FIG. The parameters of the probability density function p (N _{0: t} ^(j) | X _{0: t} ) are the mean vector ^ N _t ^(j) of the noise feature vector N _t ^{(j) in} the particle and the covariance matrix Σ _Nt ^(j) . Hereinafter, the mean vector ^ N _t ^{(j) of} the probability density function p (N _{0: t} | X _{0: t} ⁾ and the covariance matrix Σ _Nt ^(j) are expressed as “the noise parameter in the (j-th) particle”. Call it.

〈状態遷移過程に対する拘束条件〉
上記のとおり、式（２）に示す状態方程式では、雑音の特徴量ベクトルＮ_tの時間変化を正確に表現することはできない。そこで、本実施の形態では、各パーティクルにおける雑音の特徴量ベクトルＮ_t ^(j)（１≦ｊ≦Ｊ）の変化に対し、次の式（１１）に示す状態方程式を導入する。 <Restrictions for state transition process>
As described above, in the state equation shown in equation (2), it is impossible to accurately represent the time variation of noise feature vector N _t. Therefore, in the present embodiment, the state equation shown in the following equation (11) is introduced with respect to the change of the noise feature vector N _t ^(j) (1 ≦ j ≦ J) in each particle.

この状態方程式（１１）において第１項と第２項とは、第ｔ＋１フレームにおけるパーティクルの散らばりを抑制するための拘束条件である。以下この拘束条件を第１の拘束条件と呼ぶ。また、状態方程式（１１）において第３項は、ｊ番目のパーティクルにおける雑音の特徴量ベクトルの時間推移に対する拘束条件である。以下、この拘束条件を第２の拘束条件と呼ぶ。

In this state equation (11), the first term and the second term are constraint conditions for suppressing particle scattering in the (t + 1) th frame. Hereinafter, this constraint condition is referred to as a first constraint condition. In the state equation (11), the third term is a constraint condition for the time transition of the noise feature vector in the j-th particle. Hereinafter, this constraint condition is referred to as a second constraint condition.

状態方程式（１１）においてαは忘却係数であり、βは第２の拘束条件に対するスケーリング係数である。 In the state equation (11), α is a forgetting factor, and β is a scaling factor for the second constraint condition.

第１の拘束条件において、ベクトル＾Ｎ_tは、第ｔフレームの各パーティクルにおける雑音の特徴量ベクトルＮ_t ⁽¹⁾，…，Ｎ_t ^(J)の加重平均であり、次の式（１２）により与えられる。 In the first constraint, the vector ^ N _t is a weighted average of noise feature vectors N _t ⁽¹⁾ ,..., N _t ^(J) in each particle of the t-th frame, and the following equation (12) Given by.

すなわち、第１の拘束条件により、各パーティクルにおける雑音の特徴量ベクトルは、加重平均ベクトル＾Ｎ_tに近づくよう補正される。

That is, according to the first constraint condition, the feature vector of noise in each particle is corrected so as to approach the weighted average vector ^ N _t .

第２の拘束条件において、ベクトルμ_Nt ^(j)は、ｊ番目のパーティクルにおける過去Ｔフレーム分の雑音の特徴量ベクトルＮ_t-T+1 ^(j)，…，Ｎ_t ^(j)の平均（Polyak Average）であり、次の式（１３）により与えられる。 In the second constraint condition, the vector μ _Nt ^(j) is an average of noise feature vectors N _{t−T + 1} ^(j) ,..., N _t ^{(j) for} the past T frames in the j-th particle ( Polyak Average), which is given by the following equation (13).

すなわち、第２の拘束条件により、パーティクルにおける雑音の特徴量ベクトルにそれぞれ、そのパーティクルにおけるPolyak Averageベクトルμ_Nt ^(j)がフィードバックされる。本実施の形態では、式（１１）に示す状態方程式の忘却係数α及び第２の拘束条件に対するスケーリング係数βと、式（１３）におけるフレーム数Ｔとが、図１に示す拘束条件パラメータ１３８として与えられる。

That is, the Polyak Average vector μ _Nt ^{(j) of the} particle is fed back to the noise feature vector of the particle by the second constraint condition. In the present embodiment, the forgetting factor α of the state equation shown in Equation (11), the scaling factor β for the second constraint condition, and the frame number T in Equation (13) are used as the constraint parameter 138 shown in FIG. Given.

雑音確率分布推定部２００は、観測方程式（１）と上記の状態方程式（１１）とにより表される状態空間モデルに基づくパーティクルフィルタを用いて、雑音確率分布の逐次推定を行なう。 The noise probability distribution estimation unit 200 sequentially estimates the noise probability distribution using a particle filter based on the state space model represented by the observation equation (1) and the state equation (11).

〈雑音確率分布推定部２００の構成〉
図５に、雑音確率分布推定部２００の構成をブロック図で示す。図５を参照して、雑音確率分布推定部２００は、クリーン音声推定部２０４からの要求２１０を受けて、観測信号の特徴量１２４から処理対象となるフレームを選択し、当該フレームに対応する観測信号の特徴量１２４をフレームに応じた出力先に与えるためのフレーム選択部２２０を含む。 <Configuration of Noise Probability Distribution Estimation Unit 200>
FIG. 5 is a block diagram showing the configuration of the noise probability distribution estimation unit 200. Referring to FIG. 5, noise probability distribution estimation section 200 receives request 210 from clean speech estimation section 204, selects a frame to be processed from feature quantity 124 of the observation signal, and performs observation corresponding to the frame. A frame selection unit 220 for providing the signal feature quantity 124 to the output destination corresponding to the frame is included.

雑音確率分布推定部２００はさらに、フレーム選択部２２０から観測信号の特徴量１２４を受けて、初期的な状態における雑音を表す確率分布（以下、「雑音初期分布」と呼ぶ。）を推定し、多数（Ｊ個）のパーティクルについて、ｔ＝０のフレーム（以下、このフレームを「初期フレーム」と呼ぶ。）における雑音確率分布の推定パラメータ２０５を決定するための雑音初期分布推定部２２２と、フレーム選択部２２０から観測信号の特徴量１２４を受け、各パーティクルについて、ｔ（ｔ≧１）番目のフレームにおける雑音確率分布の推定パラメータ２０５を、逐次的に算出するための逐次計算部２２４とを含む。 The noise probability distribution estimation unit 200 further receives a feature quantity 124 of the observation signal from the frame selection unit 220, estimates a probability distribution representing noise in an initial state (hereinafter referred to as “noise initial distribution”), and For a large number (J) of particles, a noise initial distribution estimation unit 222 for determining a noise probability distribution estimation parameter 205 in a frame at t = 0 (hereinafter, this frame is referred to as an “initial frame”), a frame A sequential calculation unit 224 for sequentially calculating an estimation parameter 205 of the noise probability distribution in the t (t ≧ 1) -th frame for each particle in response to the feature value 124 of the observation signal from the selection unit 220; .

フレーム選択部２２０は、要求２１０が与えられる度に、処理対象のフレームを順次選択する。フレーム選択部２２０は、処理対象として初期フレームを選択すると、観測信号の特徴量Ｘ_t１２４のうち最初の所定フレーム分（例えば１０フレーム分）を、雑音初期分布推定部２２２に与える。またフレーム選択部２２０は、処理対象としてその他（ｔ≧１）のフレームを選択すると、そのフレームにおける観測信号の特徴量Ｘ_t１２４を逐次計算部２２４に与える。 The frame selection unit 220 sequentially selects frames to be processed every time the request 210 is given. When the initial frame is selected as a processing target, the frame selection unit 220 gives the initial predetermined frame (for example, 10 frames) of the observed signal feature amount X _t 124 to the noise initial distribution estimation unit 222. In addition, when the other frame (t ≧ 1) is selected as the processing target, the frame selection unit 220 gives the feature amount X _t 124 of the observation signal in the frame to the sequential calculation unit 224.

雑音初期分布推定部２２２は、雑音初期分布のパラメータを、以下のようにして推定する。 The initial noise distribution estimation unit 222 estimates the parameters of the initial noise distribution as follows.

すなわち、雑音初期分布推定部２２２は、雑音初期分布が、単一正規分布であるものとみなして、雑音初期分布を推定する。雑音の初期値ベクトルをベクトルＮ₀とし、雑音初期分布をｐ（Ｎ₀）とする。雑音初期分布ｐ（Ｎ₀）における平均ベクトルをμ_Nとし、共分散行列を行列Σ_Nとすると、雑音初期分布ｐ（Ｎ₀）は次の式（１４）のように表される。 That is, the initial noise distribution estimation unit 222 estimates the initial noise distribution by regarding the initial noise distribution as a single normal distribution. An initial value vector of noise is a vector N ₀ , and an initial noise distribution is p (N ₀ ). When the average vector in the initial noise distribution p (N ₀ ) is μ _N and the covariance matrix is a matrix Σ _N , the initial noise distribution p (N ₀ ) is expressed as the following equation (14).

雑音初期分布推定部２２２は、最初の所定フレーム分の区間の観測信号の特徴量Ｘ_t１２４が雑音１２１の成分のみからなるものとみなし、式（１４）に示す雑音初期分布ｐ（Ｎ₀）の平均ベクトルμ_Nと共分散行列Σ_Nとを推定する。例えば、０≦ｔ≦９の１０フレーム分の区間が雑音１２１の成分のみからなる区間に該当する場合、雑音初期分布推定部２２２は、平均ベクトルμ_Nと共分散行列Σ_Nとをそれぞれ、次の式（１５）と式（１６）とによって算出する。ただし、式（１６）においてベクトルの右肩に付した「Ｔ」は転置を表す。

The initial noise distribution estimation unit 222 considers that the feature amount X _t 124 of the observed signal in the first predetermined frame interval is composed only of the noise 121 component, and the initial noise distribution p (N ₀ ) shown in Expression (14). Of the mean vector μ _N and the covariance matrix Σ _N. For example, when a section of 10 frames of 0 ≦ t ≦ 9 corresponds to a section including only the noise 121 component, the noise initial distribution estimation unit 222 applies the average vector μ _N and the covariance matrix Σ _N to the next (15) and (16). However, “T” attached to the right shoulder of the vector in Expression (16) represents transposition.

そして雑音初期分布推定部２２２は、初期フレーム（ｔ＝０）でのｊ番目のパーティクルにおける雑音のパラメータであるベクトルＮ₀ ^(j)と共分散行列Σ_N0 ^(j)とを、それぞれ、式（１７）及び式（１８）のように設定する。

Then, the initial noise distribution estimation unit 222 obtains a vector N ₀ ^(j) and a covariance matrix Σ _N0 ^(j) , which are noise parameters of the j-th particle in the initial frame (t = 0), by the formula ( 17) and the equation (18).

すなわち、雑音初期分布推定部２２２は、ｊ番目のパーティクルにおける雑音の特徴量ベクトルＮ₀ ^(j)を、初期分布ｐ（Ｎ₀）からのサンプリングによって生成し、共分散行列Σ_N0 ^(j)を、初期分布ｐ（Ｎ₀）の共分散行列Σ_Nに設定する。雑音初期分布推定部２２２は、式（１７）と式（１８）とに示す設定をパーティクルｊ（１≦ｊ≦Ｊ）ごとに行なう。

That is, the noise initial distribution estimation unit 222 generates a noise feature vector N ₀ ^(j) in the j-th particle by sampling from the initial distribution p (N ₀ ), and generates a covariance matrix Σ _N0 ^(j) . , Set to the covariance matrix Σ _N of the initial distribution p (N ₀ ). The initial noise distribution estimation unit 222 performs the setting shown in Expression (17) and Expression (18) for each particle j (1 ≦ j ≦ J).

逐次計算部２２４は、ＧＭＭ１３０から出力パラメータ２４０をサンプリングするためのＧＭＭサンプリング部２２６を含む。逐次計算部２２４はさらに、観測信号の特徴量１２４を受けて、各パーティクルにおける雑音のパラメータを更新するための更新部２３０と、更新後のパーティクルに対する重みを算出するための重み算出部２３２と、算出された重みに基づき、パーティクルにおける雑音のパラメータを再サンプリングするための再サンプリング部２３４と、再サンプリングされた各パーティクル及び第ｔ−１フレームにおける各パーティクルに基づき、各パーティクルにおける雑音のパラメータを決定し、雑音確率分布の推定パラメータ２０５を生成するための推定パラメータ生成部２３６とを含む。 The sequential calculation unit 224 includes a GMM sampling unit 226 for sampling the output parameter 240 from the GMM 130. The sequential calculation unit 224 further receives the feature quantity 124 of the observation signal and updates the noise parameter of each particle, an update unit 230 for calculating the weight for the updated particle, and a weight calculation unit 232 for calculating the weight for the updated particle. Based on the calculated weight, the re-sampling unit 234 for re-sampling the noise parameter in the particle, and determining the noise parameter in each particle based on each re-sampled particle and each particle in the t-1 frame. And an estimation parameter generation unit 236 for generating an estimation parameter 205 of the noise probability distribution.

ＧＭＭサンプリング部２２６は、パーティクルｊ（１≦ｊ≦Ｊ）ごとに、ＧＭＭ１３０（図２参照）内の混合分布１４０から、パーティクルに対応する要素分布ｋ_t ^(j)を、その混合重みに基づいてサンプリングする。ＧＭＭサンプリング部２２６はさらに、出力パラメータベクトルＳ^(j) _kt ^(j) _,tを、要素分布ｋ_t ^(j)からサンプリングして、更新部２３０に与える。ここにＧＭＭ１３０における要素分布１４８Ａ，…，１４８Ｋの混合重みをＰ_S,ktとすると、要素分布ｋ_t ^(j)は、混合重みＰ_S,ktを出力確率とする確率分布にしたがう。すなわち、ＧＭＭ１３０から次の式（１９）に示すサンプリングによって得られる。 For each particle j (1 ≦ j ≦ J), the GMM sampling unit 226 calculates an element distribution k _t ^(j) corresponding to the particle from the mixture distribution 140 in the GMM 130 (see FIG. 2) based on the mixture weight. Sampling. Further, the GMM sampling unit 226 samples the output parameter vector S ^(j) _kt ^(j) _{, t} from the element distribution k _t ^(j), and supplies it to the update unit 230. Here, if the mixing weights of the element distributions 148A,..., 148K in the GMM 130 are P _{S, kt} , the element distribution k _t ^(j) follows a probability distribution with the mixing weights P _{S, kt} as output probabilities. That is, it is obtained from the GMM 130 by sampling shown in the following equation (19).

要素分布ｋ_t ^(j)の平均ベクトルをベクトルμ_kt ^(j)とし、要素分布ｋ_t ^(j)の共分散行列を行列Σ_S,kt ^(j)とすると、ｊ番目のパーティクルにおけるＧＭＭ１３０の出力パラメータベクトルＳ^(j) _kt ^(j) _,tは、要素分布ｋ_t ^(j)から、次の式（２０）に示すサンプリングによって得られる。

If the average vector of the element distribution k _t ^(j) is the vector μ _kt ^(j) and the covariance matrix of the element distribution k _t ^(j) is the matrix Σ _{S, kt} ^(j) , the output of the GMM 130 at the j-th particle The parameter vector S ^(j) _kt ^(j) _{, t} is obtained from the element distribution k _t ^(j) by sampling shown in the following equation (20).

なお、フレーム選択部２２０はさらに、ＧＭＭサンプリング部２２６に対し、第ｔフレームにおけるＧＭＭの出力パラメータのサンプリングを要求する機能を持つ。

The frame selection unit 220 further has a function of requesting the GMM sampling unit 226 to sample the output parameters of the GMM in the t-th frame.

更新部２３０は、上記の観測方程式（１）及び状態方程式（１１）からなる動的モデルを状態空間モデルとする拡張カルマンフィルタによって、第ｔ−１フレームに対応する各パーティクルにおける雑音のパラメータを、第ｔフレームに対応するものに更新する機能を持つ。この際、拘束条件パラメータ１３８と、状態空間モデル１６０（図３）と、ＧＭＭサンプリング部２２６によりサンプリングされた出力パラメータＳ^(j) _kt ^(j) _,tとを基にパラメータの更新を行なう。なお、拡張カルマンフィルタは、観測方程式（１）に示すように非線形項を含む状態空間モデルに対応したカルマンフィルタである。 The updating unit 230 sets the noise parameter in each particle corresponding to the t−1 frame by the extended Kalman filter using the dynamic model composed of the observation equation (1) and the state equation (11) as a state space model. It has a function of updating to the one corresponding to the t frame. At this time, the parameters are updated based on the constraint parameter 138, the state space model 160 (FIG. 3), and the output parameter S ^(j) _kt ^(j) _{, t} sampled by the GMM sampling unit 226. The extended Kalman filter is a Kalman filter corresponding to a state space model including a nonlinear term as shown in the observation equation (1).

図６に、更新部２３０の構成をブロック図で示す。図６を参照して、更新部２３０は、第ｔ−１フレームの雑音確率分布の推定パラメータ２０５を基に、第ｔ−１フレームについて、状態方程式（１１）の上記した第１の拘束条件に係る加重平均ベクトル＾Ｎ_t-1を上記の式（１２）を用いて算出するための加重平均算出部２５０を含む。 FIG. 6 is a block diagram showing the configuration of the update unit 230. Referring to FIG. 6, the updating unit 230 sets the first constraint condition of the state equation (11) for the t−1 frame based on the estimation parameter 205 of the noise probability distribution of the t−1 frame. A weighted average calculation unit 250 for calculating the weighted average vector ^ N _t-1 using the above equation (12) is included.

更新部２３０はさらに、第ｔ−１フレーム以前の各フレームについて、各パーティクルにおける雑音のパラメータを、パーティクルごとに蓄積するためのバッファメモリ部２５２と、バッファメモリ部２５２に蓄積された雑音のパラメータ及び拘束条件パラメータ１３８により定められるフレーム数Ｔを基に、各パーティクルについて、第ｔ−１フレームにおける、上記の式（１３）に示すＴフレーム分のPolyak Averageベクトルμ_Nt-1 ^(j)を算出するためのPolyak Average算出部２５４と、Polyak Averageベクトルμ_Nt-1 ^(j)と第ｔ−１フレームにおける雑音確率分布の推定パラメータ２０５とを基に、状態方程式（１１）の第２の拘束条件におけるフィードバック分に対応するベクトルを算出するための、フィードバック部２５６とを含む。フィードバック部２５６は、Polyak Averageベクトルμ_Nt-1 ^(j)と第ｔ−１フレームにおける平均ベクトル＾Ｎ_t-1 ^(j)との差分μ_Nt-1 ^(j)−＾Ｎ_t-1 ^(j)を算出する。 The updating unit 230 further includes a buffer memory unit 252 for storing the noise parameter for each particle for each frame before the t−1th frame, the noise parameter stored in the buffer memory unit 252, and Based on the number of frames T determined by the constraint condition parameter 138, the Polyak average vector μ _Nt−1 ^(j) for the T frames shown in the above equation (13) in the t−1th frame is calculated for each particle. For the second constraint condition of the state equation (11), based on the Polyak Average calculation unit 254, the Polyak Average vector μ _Nt-1 ^(j), and the estimation parameter 205 of the noise probability distribution in the t−1 frame. A feedback unit 256 for calculating a vector corresponding to the feedback component. The feedback unit 256 calculates the difference μ _Nt-1 ^(j) − ^ N _t-1 ^(j ⁾ between the Polyak Average vector μ _Nt-1 ^(j) and the average vector ^ N _t-1 ^(j) in the t−1 frame. ⁾ Is calculated.

更新部２３０はさらに、観測方程式（１）と状態方程式（１１）とからなるモデルを状態空間モデルとする拡張カルマンフィルタを用いて、第ｔ−１フレームに対応するパーティクルにおける雑音のパラメータを、第ｔフレームに対応するものに更新するための拡張カルマンフィルタ部２５８とを含む。拡張カルマンフィルタ部２５８は、ｊ番目のパーティクルにおける雑音のパラメータの更新に、第ｔフレームにおける観測信号の特徴量Ｘ_t１２４と、ｊ番目のパーティクルにおけるＧＭＭ１３０（図２参照）の出力パラメータベクトルＳ^(j) _kt ^(j) _,tと、拘束条件パラメータ１３８として与えられる忘却係数α及びスケーリング係数βと、加重平均ベクトル＾Ｎ_t-1と、差分μ_Nt-1 ^(j)−＾Ｎ_t-1 ^(j)とを用いる。 Further, the updating unit 230 uses the extended Kalman filter whose state space model is the model made up of the observation equation (1) and the state equation (11) to set the noise parameter in the particle corresponding to the t−1 frame to the t And an extended Kalman filter unit 258 for updating to the one corresponding to the frame. The extended Kalman filter unit 258 updates the parameter of the noise in the j-th particle, the observed signal feature amount X _t 124 in the t-th frame, and the output parameter vector S ^(j in the GMM 130 (see FIG. 2) in the j-th particle. ⁾ _kt ^(j) _{, t} , forgetting factor α and scaling factor β given as constraint parameter 138, weighted average vector ^ N _t-1 , and difference μ _Nt-1 ^(j) − ^ N _t-1 ^{( j)} .

本実施の形態における拡張カルマンフィルタの分布更新式を、以下の式（２１）〜式（２６）に示す。なお、これらの数式において第ｔ−１フレームに対応するパラメータから予測される第ｔフレームにおけるパラメータについては添え字として「_t|t-1」を付してある。 Expressions (21) to (26) below show the distribution update formulas of the extended Kalman filter in the present embodiment. In these equations, “ _{t | t−1} ” is attached as a subscript to the parameter in the t-th frame predicted from the parameter corresponding to the t−1 frame.

ただし、行列Σ_Wは、前述したとおり、第ｔ−１フレームから第ｔフレームへの状態変化の際に雑音の特徴量ベクトルＮ_tに生じるランダムガウス雑音ベクトルＷ_t-1の共分散行列を表す。

However, as described above, the matrix Σ _W represents the covariance matrix of the random Gaussian noise vector W _t−1 generated in the noise feature vector N _t when the state changes from the t−1 frame to the t frame. .

再び図５を参照して、重み算出部２３２は、第ｔフレームでの観測信号の特徴量ベクトルＸ_t１２４と、第ｔフレームの各パーティクルにおけるＧＭＭ１３０の出力パラメータベクトルＳ^(j) _kt ^(j) _,tと、当該フレームの当該パーティクルにおける雑音のパラメータである平均ベクトル＾Ｎ_t ^(j)及び共分散行列Σ_Nt ^(j)と、第ｔ−１フレームのパーティクルに対する重みｗ_t-1 ^(j)とを基に、上記の式（９）及び式（１０）に示す算出方法を用いて、第ｔフレームのパーティクルに対する重みｗ_t ^(j)を算出する機能を持つ。なお、重みｗ_t ^(j)（１≦ｊ≦Ｊ）は、Σ_j=1〜Jｗ_t ^(j)＝１となるように正規化される。 Referring to FIG. 5 again, the weight calculation unit 232 includes the feature vector X _t 124 of the observation signal in the t-th frame and the output parameter vector S ^(j) _kt ^(j) of the GMM 130 for each particle in the t-th frame. _{, t} , a mean vector ^ N _t ^(j) and a covariance matrix Σ _Nt ^(j) that are parameters of noise in the particle in the frame, and a weight w _t-1 ^(j) for the particle in the t−1 frame. Based on the above, the weight w _t ^(j) for the particles in the t-th frame is calculated using the calculation methods shown in the above equations (9) and (10). The weights w _t ^(j) (1 ≦ j ≦ J) are normalized so that Σ _{j =} 1 to J w _t ^(j) = 1.

再サンプリング部２３４は、パラメータが更新されたパーティクルにおける雑音の確率分布にしたがい、第ｔフレームに対応する各パーティクルにおける雑音のパラメータを再サンプリングする機能を持つ。この際、再サンプリング部２３４は、微小な重みｗ_t ^(j)しか与えられていないパーティクルにおける雑音の確率分布からは、雑音のパラメータの再サンプリングを行なわない。一方、大きな重みｗ_t ^(j)が与えられているパーティクルにおける確率分布からは、重みｗ_t ^(j)の大きさに応じた回数再サンプリングを行ない、得られた雑音のパラメータをそれぞれ、当該再サンプリングの回数と同数のパーティクルに割当てる。ただし再サンプリングの全回数及びパーティクルの全数は一定（Ｊ）である。このようにするのは、各パーティクルに割当てられる重みが、上記の式（９）から分かるように観測信号の特徴量Ｘ_t１２４の尤度に対応しているからである。 The re-sampling unit 234 has a function of re-sampling the noise parameter of each particle corresponding to the t-th frame according to the noise probability distribution of the particle whose parameter is updated. At this time, the resampling unit 234 does not resample the noise parameter from the probability distribution of noise in the particles to which only a minute weight w _t ^(j) is given. On the other hand, from the probability distribution of particles with a large weight w _t ^(j), resampling is performed a number of times according to the size of the weight w _t ^(j) , and the noise parameters obtained are re-sampled. Allocate the same number of particles as the number of samplings. However, the total number of resampling and the total number of particles are constant (J). This is because the weight assigned to each particle corresponds to the likelihood of the feature quantity X _t 124 of the observation signal as can be seen from the above equation (9).

推定パラメータ生成部２３６は、マルコフ連鎖モンテカルロ法のMetropolis-Hastingsアルゴリズムにより、第ｔフレームに対応するパーティクルを再生成する機能を持つ。図７に、推定パラメータ生成部２３６の構成をブロック図で示す。図７を参照して、推定パラメータ生成部２３６は、第ｔ−１フレームに対応する各パーティクルにおける雑音のパラメータを、第ｔフレームに対応するものに再更新するための再更新部２６２を含む。再更新部２６２は、再サンプリング部２３４による再サンプリングで得られた各パーティクルにおける雑音のパラメータを用いて、状態空間モデル１６０における雑音確率分布を生成する。そして、当該生成した確率分布と拘束条件パラメータ１３８とに基づき、図６に示す更新部２３０と同様の手法を用いて、上記の式（２１）〜式（２６）に示す分布更新式により表される拡張カルマンフィルタによって、各パーティクルにおける雑音のパラメータの再更新を行なう。 The estimation parameter generation unit 236 has a function of regenerating particles corresponding to the t-th frame by the Metropolis-Hastings algorithm of the Markov chain Monte Carlo method. FIG. 7 is a block diagram showing the configuration of the estimation parameter generation unit 236. Referring to FIG. 7, estimated parameter generation section 236 includes a re-update section 262 for re-updating the noise parameter in each particle corresponding to the (t−1) -th frame to that corresponding to the t-th frame. The re-update unit 262 generates a noise probability distribution in the state space model 160 using the noise parameter of each particle obtained by the re-sampling by the re-sampling unit 234. Then, based on the generated probability distribution and the constraint condition parameter 138, the distribution updating formulas shown in the above formulas (21) to (26) are expressed using the same method as the updating unit 230 shown in FIG. The noise parameter of each particle is updated again by the extended Kalman filter.

推定パラメータ生成部２３６はさらに、再更新されたパーティクルに対する重み（これを以下「ｗ_t ^*(j)」とする。）を上記の式（９）及び式（１０）に示す算出方法を用いて算出するための重み再計算部２６４を含む。 The estimation parameter generation unit 236 further uses the calculation method shown in the above equations (9) and (10) to calculate the weights for the re-updated particles (hereinafter referred to as “w _t ^{* (j)} ”). A weight recalculation unit 264 for calculation is included.

推定パラメータ生成部２３６はさらに、再サンプリングされたパーティクルに対する重みｗ_t ^(j)及び再更新されたパーティクルに対する重みｗ_t ^*(j)から、再更新された雑音のパラメータを許容するか否かの判定に用いる許容確率νを算出するための許容確率算出部２６６と、所定の乱数発生方法により０から１までの閉区間内の乱数ｕを発生させるための乱数発生部２６８と、許容確率νと乱数ｕとに基づき、第ｔフレームに対応するパーティクルにおけるパラメータとして、再サンプリングされたパーティクルにおける雑音のパラメータと、再更新されたパーティクルにおける雑音のパラメータとの一方を選択するためのパラメータ選択部２７０とを含む。 The estimation parameter generation unit 236 further determines whether or not to allow a re-updated noise parameter from the weight w _t ^(j) for the re-sampled particle and the weight w _t ^{* (j)} for the re-updated particle. An allowable probability calculating unit 266 for calculating an allowable probability ν used for the determination, a random number generating unit 268 for generating a random number u within a closed interval from 0 to 1 by a predetermined random number generating method, an allowable probability ν, A parameter selection unit 270 for selecting one of the noise parameter of the resampled particle and the noise parameter of the reupdated particle as the parameter of the particle corresponding to the t-th frame based on the random number u; including.

許容確率算出部２６６は、重みｗ_t ^(j)及び重みｗ_t ^*(j)から次の式（２７）にしたがって、許容確率νを算出する機能を持つ。 The allowable probability calculation unit 266 has a function of calculating the allowable probability ν from the weight w _t ^(j) and the weight w _t ^{* (j) according} to the following equation (27).

パラメータ選択部２７０は、乱数ｕが許容確率ν以下であれば、当該パーティクルにおける雑音のパラメータ及びその重みを再更新で得られた新たなパラメータ及びその重みに変更する機能を持つ。

If the random number u is less than or equal to the allowable probability ν, the parameter selection unit 270 has a function of changing the noise parameter and its weight in the particle to a new parameter and its weight obtained by re-update.

〈コンピュータによる実現〉
以下の説明からも明らかなように、図１に示す音声認識システム１００の前処理部１０４、前処理用音響モデル部１０６、及び探索部１１０は、いずれもコンピュータハードウェアと、その上で実行されるプログラムと、コンピュータハードウェアに格納されたデータとにより実現可能である。図８に、前処理部１０４（図１参照）に含まれる雑音抑圧部１１４が行なう雑音抑圧処理を実現するコンピュータプログラムの制御構造をフローチャートで示す。 <Realization by computer>
As will be apparent from the following description, the preprocessing unit 104, the preprocessing acoustic model unit 106, and the search unit 110 of the speech recognition system 100 shown in FIG. 1 are all executed on computer hardware. And a program stored in computer hardware. FIG. 8 is a flowchart showing a control structure of a computer program that realizes noise suppression processing performed by the noise suppression unit 114 included in the preprocessing unit 104 (see FIG. 1).

図８を参照して、雑音抑圧処理が開始されると、ステップ２８２において、初期状態における雑音の特徴量Ｎ₀の各要素の値に対応する初期分布を推定する。すなわち、上記の式（１５）及び式（１６）に示す算出方法により、式（４）に示す雑音初期分布ｐ（Ｎ₀）のパラメータである平均ベクトルμ_N及び共分散行列Σ_Nを算出する。さらに、式（１７）及び式（１８）にしたがい雑音初期分布ｐ（Ｎ₀）からベクトルＮ₀ ^(j)（ｊ＝１，…，Ｊ）をサンプリングし、初期フレームの各パーティクルにおける雑音のパラメータを推定する。またステップ２８２では、変数Ｉ_ＬにＬ−１を代入する。すなわち、図４に示す計算制御部２１２は、回数記憶部２１８にＬ−１を記憶させる。 Referring to FIG. 8, when the noise suppression process is started, in step 282, an initial distribution corresponding to the value of each element of noise feature amount N ₀ in the initial state is estimated. That is, the average vector μ _N and the covariance matrix Σ _N that are parameters of the initial noise distribution p (N ₀ ) shown in the equation (4) are calculated by the calculation methods shown in the equations (15) and (16). . Further, the vector N ₀ ^(j) (j = 1,..., J) is sampled from the noise initial distribution p (N ₀ ) according to the equations (17) and (18), and the noise parameters for each particle in the initial frame are sampled. Is estimated. In addition the step 282, is substituted for L-1 to the variable _{I L.} That is, the calculation control unit 212 illustrated in FIG. 4 stores L−1 in the number storage unit 218.

ステップ２８４では、雑音抑圧の対象となるフレームを次のフレームに移行させる。以下の説明では、移行後のフレームが第ｔフレームであるものとする。 In step 284, the frame subject to noise suppression is shifted to the next frame. In the following description, it is assumed that the frame after the transition is the t-th frame.

ステップ２８５では、パーティクルフィルタを用いて、処理対象のフレームについて、各パーティクルにおける雑音のパラメータを推定する。すなわち、確率密度関数ｐ（Ｎ_0:t ^(j)｜Ｘ_0:t）のパラメータである平均ベクトル＾Ｎ_t ^(j)及び共分散行列Σ_Nt ^(j)を推定し、さらに、各パーティクルに対する重みｗ_t ^(j)を定めて、雑音確率分布の推定パラメータ２０５を生成する。このステップでの処理については、図１１を用いて後述する。 In step 285, the noise parameter in each particle is estimated for the processing target frame using the particle filter. That is, the mean vector ^ N _t ^(j) and the covariance matrix Σ _Nt ^(j) that are parameters of the probability density function p (N _{0: t} ^(j) | X _{0: t} ⁾ are estimated, and further, for each particle. A weight w _t ^(j) is determined, and an estimation parameter 205 of the noise probability distribution is generated. The processing in this step will be described later with reference to FIG.

ステップ２８６では、変数Ｉ_Ｌに１を加算する。 In step 286, 1 is added to the variable _{I L.}

ステップ２８８では、変数Ｉ_Ｌの値が定数Ｌの値と等しいか否かが判定され、その判定結果に応じて制御の流れが分岐される。すなわち、この判定結果がＹＥＳであれば制御はこの後のステップ２９２に進み、それ以外であれば制御は後述するステップ２９８に進む。 In step 288, the value of the variable I _L is determined whether equal to the value of the constant L, control flow depending on the result of the determination is branched. That is, if this determination result is YES, control proceeds to the subsequent step 292, and otherwise, control proceeds to step 298 described later.

ステップ２９２では、観測信号分布のパラメータ２０８を推定する。すなわち、ステップ２８５で定めた各パーティクルの雑音のパラメータ＾Ｎ_t ^(j)、及びΣ_Nt ^(j)を用いて、各パーティクルにおける観測信号の特徴量Ｘ_t１２４の確率分布を推定する。さらに、ＧＭＭ１３０を構成する要素分布ｋ（１≦ｋ≦Ｋ）ごとに、パーティクルにおける観測信号の特徴量Ｘ_t１２４の確率分布の平均ベクトルμ_Xkt ^(j) _,tと、共分散行列Σ_Xk,t ^(j)とを算出する。 In step 292, the observed signal distribution parameter 208 is estimated. That is, the probability distribution of the feature quantity X _t 124 of the observed signal in each particle is estimated using the noise parameters ^ N _t ^(j) and Σ _Nt ^(j) determined in step 285. Further, for each element distribution k (1 ≦ k ≦ K) constituting the GMM 130, the average vector μ _Xkt ^(j) _{, t} of the probability distribution of the observed signal feature quantity X _t 124 in the particle and the covariance matrix Σ _{Xk, t} ^(j) is calculated.

ステップ２９４では、ステップ２９２で推定された観測信号の確率分布の平均ベクトル及び共分散行列を分布記憶部２１４（図４参照）に記憶する。続くステップ２９６では変数Ｉ_Ｌの値を０にクリアする。この後、制御はステップ３００に進む。 In step 294, the average vector and covariance matrix of the probability distribution of the observed signal estimated in step 292 are stored in the distribution storage unit 214 (see FIG. 4). The value of the subsequent step 296 variable _{I L} is cleared to 0. Thereafter, control proceeds to step 300.

一方、ステップ２８８における判定結果がＮＯである場合、制御はステップ２９８に進む。ステップ２９８では、ステップ２８８における判定結果がＹＥＳとなったときに分布記憶部２１４に記憶された平均ベクトル及び共分散行列が分布記憶部２１４から読出される。ステップ２９８の後、制御はステップ３００に進む。 On the other hand, if the determination result in step 288 is NO, the control proceeds to step 298. In step 298, the average vector and covariance matrix stored in the distribution storage unit 214 when the determination result in step 288 is YES are read from the distribution storage unit 214. After step 298, control proceeds to step 300.

ステップ３００では、ＭＭＳＥ推定法により、第ｔフレームにおける推定クリーン音声の特徴量１２６を算出する。すなわちまず、ステップ２８５及びステップ２９２の処理で得られたパラメータ、又はステップ２９８で分布記憶部２１４から読出されたパラメータを用いて、ＭＭＳＥ推定法によって、ＭＭＳＥ推定値ベクトル＾Ｓ_tを算出し、推定クリーン音声の特徴量１２６（図１参照）として出力する。 In step 300, the feature quantity 126 of the estimated clean speech in the t-th frame is calculated by the MMSE estimation method. That is, first, the MMSE estimation value vector Ｓ _St is calculated by the MMSE estimation method using the parameters obtained by the processing of step 285 and step 292 or the parameters read from the distribution storage unit 214 in step 298, and the estimation is performed. Output as clean speech feature 126 (see FIG. 1).

この式において、Ｐ（ｋ｜Ｘ_t，（ｊ））は、ｊ番目のパーティクルにおける、ＧＭＭ１３０内の要素分布ｋに対する混合重みを表す。混合重みＰ（ｋ｜Ｘ_t，（ｊ））は、特許文献１に記載されたものと同様、次の数式により算出される。

In this equation, P (k | X _t , (j)) represents the mixing weight for the element distribution k in the GMM 130 in the j-th particle. The mixing weight P (k | X _t , (j)) is calculated by the following equation, as described in Patent Document 1.

ただし、式（２８）〜（３１）はｔ＝ｎ・Ｌ（ｎはゼロ又は正の整数）のときの式である。ｔ≠ｎ・Ｌのときには、観測信号分布推定部２０２で推定されたパラメータではなく、分布記憶部２１４に記憶されたパラメータを用いるので、この式は次のとおりとなる。

However, Expressions (28) to (31) are expressions when t = n · L (n is zero or a positive integer). When t ≠ n · L, since the parameter stored in the distribution storage unit 214 is used instead of the parameter estimated by the observation signal distribution estimation unit 202, this equation is as follows.

この式（３２）で、ｔ_Ｌは、ｔ_Ｌ＝（ｎ−１）・Ｌ＜ｔ＜ｎ・Ｌ（ｎは正の整数）となる条件を満たす値である。

In this equation (32), t _L is a value that satisfies the condition of t _L = (n−1) · L <t <n · L (n is a positive integer).

この様子を図９を参照して説明する。図９は、Ｌ＝４としたときの、フレーム時間（ｔ）と、各フレームにおいて混合重みの算出に使用される観測信号分布のパラメータとの関係を示す。図９を参照して、ある時刻ｔ＝ｔ_０で観測信号分布のパラメータが算出されたものとする。この後、４フレーム後のｔ＝ｔ_４で観測信号分布のパラメータが再度算出されるまでのフレームｔ＝ｔ_１、ｔ_２及びｔ_３では、ｔ＝ｔ_０で算出された観測信号分布のパラメータが重み算出に使用される。 This will be described with reference to FIG. FIG. 9 shows the relationship between the frame time (t) when L = 4 and the parameters of the observed signal distribution used for calculating the mixing weight in each frame. Referring to FIG. 9, it is assumed that the parameters of the observation signal distributed at a certain time t = t ₀ is calculated. Thereafter, in the frames t = t ₁ , t _2, and t ₃ until the parameter of the observed signal distribution is calculated again at t = t ₄ after 4 frames, the parameter of the observed signal distribution calculated at t = t ₀ Is used for weight calculation.

このようにして算出された混合重みを用い、ステップ３００において、ＭＭＳＥ推定法によって、式（２８）〜式（３０）にしたがってＭＭＳＥ推定値ベクトル＾Ｓ_tを算出し、推定クリーン音声の特徴量１２６（図１参照）として出力する。 In step 300, the MMSE estimation value vector ^ _St is calculated according to the equations (28) to (30) by the MMSE estimation method using the mixture weight calculated in this way, and the estimated clean speech feature 126 is obtained. (See FIG. 1).

続いて、ステップ３０２では、終了判定を行なう。すなわち第ｔフレームが最終のフレームであれば雑音抑圧処理を終了する。さもなければステップ２８４に戻る。 Subsequently, in step 302, end determination is performed. That is, if the t-th frame is the final frame, the noise suppression process is terminated. Otherwise return to step 284.

このように、観測信号分布推定部２０２における観測信号分布の推定が、Ｌフレームごとに行なわれるようになり、その間のフレームでは行なわれない。この結果、観測信号分布推定部２０２における処理時間を短縮することができる。その結果、単純に考えてこの部分の計算量は１／Ｌとなり、クリーン音声の推定のための計算量が大幅に削減できる。 In this way, the observation signal distribution estimation unit 202 estimates the observation signal distribution every L frames, but not in the frames in between. As a result, the processing time in the observation signal distribution estimation unit 202 can be shortened. As a result, the calculation amount of this part is 1 / L simply considering, and the calculation amount for the estimation of clean speech can be greatly reduced.

図１０は、Ｌの値を１，２，４及び８としたときに、どのフレームで観測信号分布の推定が行なわれるかを示す図である。Ｌ＝１のときは、全てのフレームに対して推定が行なわれる。それに対し、Ｌ＝２とすると、図１０から明らかなように推定を行なうフレーム数は１／２となる。Ｌ＝４，Ｌ＝８のときはそれぞれ１／４、１／８となる。 FIG. 10 is a diagram showing in which frame the observation signal distribution is estimated when the value of L is 1, 2, 4, and 8. In FIG. When L = 1, estimation is performed for all frames. On the other hand, if L = 2, the number of frames to be estimated is ½ as apparent from FIG. When L = 4 and L = 8, they are 1/4 and 1/8, respectively.

図１１に、ステップ２８５（図８参照）において行なわれる雑音確率分布の推定パラメータ２０５の生成処理を実現するプログラムの制御構造をフローチャートで示す。図１１を参照して、雑音確率分布の推定パラメータの生成処理が開始されると、ステップ３２０において、拡張カルマンフィルタによる更新を行なう際の雑音１２１の状態遷移過程に対する第１及び第２の拘束条件に係るパラメータベクトルを算出する。すなわち、第ｔ−１フレームのパーティクルでの雑音のパラメータの加重平均ベクトル＾Ｎ_t-1を式（１２）を用いて算出する。そして、パーティクルの各々において、過去Ｔフレーム分の当該パーティクルにおける雑音のパラメータからPolyak Averageベクトルμ_Nt-1 ^(j)を算出し、平均ベクトル＾Ｎ_t-1 ^(j)との差分μ_Nt-1 ^(j)−＾Ｎ_t-1 ^(j)を算出する。 FIG. 11 is a flowchart showing a control structure of a program that realizes the generation process of the estimation parameter 205 of the noise probability distribution performed in step 285 (see FIG. 8). Referring to FIG. 11, when the generation process of the estimation parameter of the noise probability distribution is started, in step 320, the first and second constraint conditions for the state transition process of noise 121 when updating by the extended Kalman filter are set. Such a parameter vector is calculated. That is, the weighted average vector ^ N _t-1 of the noise parameter at the particle of the t−1th frame is calculated using Expression (12). Then, for each particle, a Polyak Average vector μ _Nt−1 ^(j) is calculated from the noise parameters of the particle for the past T frames, and a difference μ _Nt−1 from the average vector ^ N _t−1 ^(j) is calculated. ^(j) − ^ N _t−1 ^(j) is calculated.

ステップ３２２では、式（２１）〜式（２６）に示す拡張カルマンフィルタを用いて、第ｔ−１フレームのパーティクルにおける雑音確率分布から、第ｔフレームの各パーティクルにおける雑音のパラメータを推定する。 In step 322, the noise parameter in each particle in the t-th frame is estimated from the noise probability distribution in the particle in the t-1 frame using the extended Kalman filter expressed by the equations (21) to (26).

ステップ３２４では、第ｔフレームの各パーティクルに対する重みｗ_t ^(j)を、式（９）及び式（１０）によって算出する。そして、重みｗ_t ^(j)を正規化する。ステップ３２６では、各パーティクルに対する重みｗ_t ^(j)を基に、各パーティクルからの再サンプリングの回数を決定し、当該パーティクルにおける雑音確率分布に基づいてパラメータを再サンプリングする。ステップ３２８では、Metropolis-Hastingsアルゴリズムを用いて第ｔフレームのパーティクルを再生成する。 In step 324, the weight w _t ^(j) for each particle in the t-th frame is calculated by the equations (9) and (10). Then, the weight w _t ^(j) is normalized. In step 326, the number of re-sampling from each particle is determined based on the weight w _t ^(j) for each particle, and the parameter is re-sampled based on the noise probability distribution in the particle. In step 328, the particles of the t-th frame are regenerated using the Metropolis-Hastings algorithm.

図１２にステップ３２８（図１１参照）における処理の詳細をフローチャートで示す。図１２を参照して、ステップ３２８における処理が開始されると、ステップ３４０において、図１１に示すステップ３２０と同様に、加重平均ベクトル＾Ｎ_t-1を、式（１２）に示す算出方法で算出する。そして、パーティクルの各々において、過去Ｔフレーム分の当該パーティクルにおける雑音のパラメータからPolyak Averageベクトルμ_Nt-1 ^(j)を算出し、平均ベクトル＾Ｎ_t-1 ^(j)との差分μ_Nt-1 ^(j)−＾Ｎ_t-1 ^(j)を算出する。 FIG. 12 is a flowchart showing details of processing in step 328 (see FIG. 11). Referring to FIG. 12, when the processing in step 328 is started, in step 340, as in step 320 shown in FIG. 11, the weighted average vector ^ N _t-1 is calculated by the calculation method shown in equation (12). calculate. Then, for each particle, a Polyak Average vector μ _Nt−1 ^(j) is calculated from the noise parameters of the particle for the past T frames, and a difference μ _Nt−1 from the average vector ^ N _t−1 ^(j) is calculated. ^(j) − ^ N _t−1 ^(j) is calculated.

続くステップ３４２では、ステップ３２６（図１１参照）での再サンプリングで得られた各パーティクルにおける雑音パラメータにより表現される雑音確率分布を用いて、式（２１）〜式（２６）に示す拡張カルマンフィルタにより、各パーティクルにおける雑音のパラメータの再更新を行なう。すなわち、第ｔフレームのパーティクルを新たに準備し、ステップ３２２（図１１参照）での処理と同様の処理により、第ｔ−１フレームのパーティクルに対応するパラメータから、第ｔフレームのパーティクルに対応するパラメータへの再更新を行ない、準備したパーティクルのパラメータに設定する。ステップ３４４では、ステップ３４２で準備したパーティクルに対する重みｗ_t ^*(j)を、図１１に示すステップ３２４の処理と同様の処理で算出し正規化する。 In the subsequent step 342, using the noise probability distribution expressed by the noise parameter in each particle obtained by re-sampling in step 326 (see FIG. 11), the extended Kalman filter shown in equations (21) to (26) is used. The noise parameters in each particle are updated again. In other words, particles at the t-th frame are newly prepared, and corresponding to the particles at the t-th frame from the parameters corresponding to the particles at the (t-1) -th frame by the same process as the process at step 322 (see FIG. 11). Update the parameters again and set the parameters of the prepared particles. In step 344, the weight w _t ^{* (j)} for the particles prepared in step 342 is calculated and normalized by the same process as the process in step 324 shown in FIG.

ステップ３４６では、ステップ３２４の処理で算出された重みｗ_t ^(j)と、ステップ３４４で算出された重みｗ_t ^*(j)との比較により、ステップ３４２で準備されたパーティクルの許容確率νを定める。ステップ３４８では、区間［０，１］の値からなる一様な集合Ｕ_[0,1]の中から任意の値を選択することにより乱数ｕを発生する。ステップ３５０では、ステップ３４８で発生した乱数ｕの値と、ステップ３４６で定めた許容確率νの値とを比較する。ｕが許容確率の値以下であれば、ステップ３５２へ進む。さもなければステップ３５４に進む。ステップ３５２では、ステップ３４２で準備されたパーティクルを許容する。すなわち、ステップ３２６での再サンプリングで得られたパラメータを、準備されたパーティクルのパラメータで置換して処理を終了する。ステップ３５４では、ステップ３４２で準備されたパーティクルを棄却する。すなわち、準備されたパーティクル及びそのパラメータを棄却し、処理を終了する。 In step 346, the allowable probability ν of the particles prepared in step 342 is determined by comparing the weight w _t ^(j) calculated in step 324 with the weight w _t ^{* (j)} calculated in step 344. Determine. In step 348, a random number u is generated by selecting an arbitrary value from the uniform set U _[0,1] consisting of values in the interval [0,1]. In step 350, the value of the random number u generated in step 348 is compared with the value of the allowable probability ν determined in step 346. If u is less than or equal to the allowable probability, the process proceeds to step 352. Otherwise, go to step 354. In step 352, the particles prepared in step 342 are allowed. That is, the parameter obtained by the resampling in step 326 is replaced with the parameter of the prepared particle, and the process is terminated. In step 354, the particles prepared in step 342 are rejected. That is, the prepared particles and their parameters are rejected, and the process ends.

〔動作〕
本実施の形態に係る音声認識システム１００は以下のように動作する。 [Operation]
The speech recognition system 100 according to the present embodiment operates as follows.

まず、図５に示す雑音確率分布推定部２００が、初期フレーム（ｔ＝０）における雑音確率分布の推定パラメータ２０５を生成する動作について説明する。図１に示す計測部１１２が、音源１０２から雑音重畳音声１２２を受け、観測信号の特徴量Ｘ_t１２４を抽出する。抽出された特徴量Ｘ_t１２４は、雑音抑圧部１１４の図５に示す雑音確率分布推定部２００に与えられる。 First, an operation in which the noise probability distribution estimation unit 200 shown in FIG. 5 generates the noise probability distribution estimation parameter 205 in the initial frame (t = 0) will be described. The measurement unit 112 illustrated in FIG. 1 receives the noise-superimposed speech 122 from the sound source 102, and extracts the feature amount X _t 124 of the observation signal. The extracted feature amount X _t 124 is given to the noise probability distribution estimation unit 200 shown in FIG. 5 of the noise suppression unit 114.

図５を参照して、雑音確率分布推定部２００のフレーム選択部２２０は、特徴量Ｘ_t１２４のうち最初の１０フレーム分を、雑音初期分布推定部２２２に与える。雑音初期分布推定部２２２は、上記の式（１４）〜式（１６）に示す処理により雑音初期分布ｐ（Ｎ₀）を推定する。さらに、雑音初期分布ｐ（Ｎ₀）から、上記の式（１７）及び式（１８）に示すサンプリングをＪ回行なう。このサンプリングによって、各パーティクルにおける雑音の初期的なパラメータであるベクトルＮ₀ ^(j)及び共分散行列Σ_N0 ^(j)が決定される。雑音確率分布推定部２００は、これらのパラメータを、初期フレームにおける雑音確率分布の推定パラメータ２０５として出力する。 With reference to FIG. 5, the frame selection unit 220 of the noise probability distribution estimation unit 200 gives the first 10 frames of the feature amount X _t 124 to the noise initial distribution estimation unit 222. The initial noise distribution estimation unit 222 estimates the initial noise distribution p (N ₀ ) by the processing shown in the above equations (14) to (16). Further, sampling shown in the above equations (17) and (18) is performed J times from the initial noise distribution p (N ₀ ). By this sampling, a vector N ₀ ^(j) and a covariance matrix Σ _N0 ^{(j), which} are initial parameters of noise in each particle, are determined. The noise probability distribution estimation unit 200 outputs these parameters as an estimation parameter 205 of the noise probability distribution in the initial frame.

図４に示す間隔記憶部２０６には値Ｌが予め設定されており、回数記憶部２１８には値Ｌ−１が記憶される。 A value L is preset in the interval storage unit 206 illustrated in FIG. 4, and a value L−1 is stored in the number storage unit 218.

次に、雑音確率分布推定部２００の逐次推定部２２４が、第ｔフレーム（ｔ≧１）における雑音確率分布の推定パラメータ２０５を生成する動作について説明する。図５を参照して、次のフレームの処理の開始要求２１０に応答して、フレーム選択部２２０は、ＧＭＭサンプリング部２２６に、第ｔフレームにおけるＧＭＭの出力パラメータのサンプリングを要求するとともに、観測信号の特徴量Ｘ_t１２４を更新部２３０に与える。 Next, an operation in which the successive estimation unit 224 of the noise probability distribution estimation unit 200 generates the noise probability distribution estimation parameter 205 in the t-th frame (t ≧ 1) will be described. Referring to FIG. 5, in response to processing start request 210 for the next frame, frame selection unit 220 requests GMM sampling unit 226 to sample the output parameter of GMM in the t-th frame and observe signal It gives the feature quantity X _t 124 of the updating section 230.

ＧＭＭサンプリング部２２６は、ＧＭＭ１３０から、出力パラメータベクトルＳ^(j) _kt ^(j) _,tのサンプリングを行なう。例えば、ｊ番目のパーティクルにおいて、ＧＭＭサンプリング部２２６が、図２に示すＧＭＭ１３０内の混合正規分布１４０の中から、混合重みにしたがった確率で要素分布ｋ_t ^(j)のサンプリングを行なう。その結果、要素分布ｋ_t ^(j)として、要素分布１５０がサンプリングされたものとする。ＧＭＭサンプリング部２２６はさらに、要素分布ｋ_t ^(j)により表される出力確率の分布にしたがい、出力パラメータベクトルＳ^(j) _kt ^(j) _,tをサンプリングする。ＧＭＭサンプリング部２２６は、総数Ｊの各パーティクルにおける出力パラメータベクトルＳ^(j) _kt ^(j) _,tをそれぞれ、以上の手順でサンプリングし、図５に示す更新部２３０に与える。 The GMM sampling unit 226 samples the output parameter vector S ^(j) _kt ^(j) _{, t} from the GMM 130. For example, at the j-th particle, the GMM sampling unit 226 samples the element distribution k _t ^(j) with a probability according to the mixing weight from the mixed normal distribution 140 in the GMM 130 shown in FIG. As a result, the element distribution 150 is sampled as the element distribution k _t ^(j) . The GMM sampling unit 226 further samples the output parameter vector S ^(j) _kt ^(j) _{, t} according to the output probability distribution represented by the element distribution k _t ^(j) . The GMM sampling unit 226 samples the output parameter vectors S ^(j) _kt ^(j) _{and t} for the total number J of particles in accordance with the above-described procedure, and supplies the sampled data to the updating unit 230 shown in FIG.

図１３に、逐次計算部２２４によるパラメータの更新、及び再サンプリングの概要を模式的に示す。図１３においては、ある雑音のパラメータが左右方向に分布し、時間が上から下に進行する。また、図１３においては、パーティクルを白抜きの丸印と黒塗りの丸印とによって模式的に示す。例えば、白抜きの丸印で示すパーティクルが重みｗ_t ^(j)の値の微小なパーティクルであり、黒塗りの丸印で示すパーティクルが重みｗ_t ^(j)の値の大きなパーティクルであるものとする。 FIG. 13 schematically shows an outline of parameter updating and re-sampling performed by the sequential calculation unit 224. In FIG. 13, a certain noise parameter is distributed in the left-right direction, and the time advances from top to bottom. Further, in FIG. 13, the particles are schematically shown by white circles and black circles. For example, a particle indicated by a white circle is a minute particle having a value of weight w _t ^(j) , and a particle indicated by a black circle is a particle having a large value of weight w _t ^(j) To do.

図１３を参照して、第ｔ−１フレームに対応するパーティクルにより状態空間４２０が近似的に表現されているものとする。更新部２３０は、以下のようにして、状態空間４２０内の各パーティクルにおける雑音のパラメータを、第ｔフレームに対応する状態空間４３０内の各パーティクルにおける雑音のパラメータに更新する。 Referring to FIG. 13, it is assumed that state space 420 is approximately represented by particles corresponding to the (t-1) th frame. The updating unit 230 updates the noise parameter of each particle in the state space 420 to the noise parameter of each particle in the state space 430 corresponding to the t-th frame as follows.

まず、図６に示す更新部２３０の拡張カルマンフィルタ部２５８は、第ｔ−１フレームの各パーティクルにおける推定確率分布の推定パラメータ２０５を取得する。取得された推定確率分布の推定パラメータ２０５は、加重平均算出部２５０、バッファメモリ２５２、及びフィードバック部２５６に与えられる。なお、この時点で、バッファメモリ２５２には、少なくとも第ｔ−１フレーム以前のＴフレーム分について、推定確率分布の推定パラメータ２０５が格納されている。 First, the extended Kalman filter unit 258 of the updating unit 230 illustrated in FIG. 6 acquires the estimation parameter 205 of the estimated probability distribution of each particle in the t−1 frame. The obtained estimation parameter 205 of the estimated probability distribution is given to the weighted average calculation unit 250, the buffer memory 252, and the feedback unit 256. At this time, the buffer memory 252 stores the estimated parameter 205 of the estimated probability distribution for at least T frames before the (t-1) th frame.

図６に示す加重平均算出部２５０は、推定確率分布の推定パラメータ２０５が与えられると、式（１２）に示す加重平均ベクトル＾Ｎ_t-1を算出する。この加重平均ベクトル＾Ｎ_t-1に基づき、式（１１）に示す状態方程式における第１の拘束条件を導入して、雑音の平均ベクトルを補正すると、補正後の雑音確率分布における雑音のパラメータは、補正前の平均ベクトル＾Ｎ_t-1 ^(j)より、加重平均ベクトル＾Ｎ_t-1に近づく。したがって、パーティクルの散らばりが抑制される。 The weighted average calculation unit 250 shown in FIG. 6 calculates the weighted average vector ^ N _t-1 shown in Expression (12) when the estimation parameter 205 of the estimated probability distribution is given. When the first constraint condition in the state equation shown in the equation (11) is introduced based on the weighted average vector ^ N _t−1 and the noise average vector is corrected, the noise parameter in the corrected noise probability distribution is The weighted average vector ^ N _t-1 is closer than the average vector ^ N _t-1 ^(j) before correction. Therefore, scattering of particles is suppressed.

新たな推定確率分布の推定パラメータ２０５がバッファメモリ部２５２に蓄積されると、Polyak Average算出部２５４は、バッファメモリ部２５２に蓄積されているＴフレーム分の推定確率分布の推定パラメータ２０５を用いて、各パーティクルにおける式（１３）に示すPolyak Averageベクトルμ_Nt ^(j)を算出する。算出したPolyak Averageベクトルμ_Nt-1 ^(j)は、フィードバック部２５６に与えられる。フィードバック部２５６は、各パーティクルにおいて、Polyak Averageベクトルμ_Nt-1 ^(j)と、平均ベクトル＾Ｎ_t-1 ^(j)との差分μ_Nt-1 ^(j)−＾Ｎ_t-1 ^(j)を算出する。なお、バッファメモリ部２５２に推定確率分布の推定パラメータ２０５がＴフレーム分蓄積されていない場合、Polyak Average算出部２５４は、バッファメモリ部２５２に蓄積されているだけのフレーム分の雑音確率分布の推定パラメータ２０５を用いて、Polyak Averageベクトルμ_Nt ^(j)を算出する。 When the estimated parameter 205 of the new estimated probability distribution is accumulated in the buffer memory unit 252, the Polyak Average calculating unit 254 uses the estimated parameter 205 of the estimated probability distribution for T frames accumulated in the buffer memory unit 252. Then, the Polyak Average vector μ _Nt ^(j) shown in Expression (13) for each particle is calculated. The calculated Polyak Average vector μ _Nt−1 ^(j) is given to the feedback unit 256. Feedback unit 256, in each particle, and Polyak Average, which vector μ _Nt-1 ^(j), the average vector ^ N _t-1 ^(j) the difference between _{^{μ Nt-1 (j) -}} ^ N t-1 (j) Is calculated. When the estimation parameter 205 of the estimated probability distribution is not accumulated in the buffer memory unit 252 for T frames, the Polyak Average calculation unit 254 estimates the noise probability distribution for only the frames accumulated in the buffer memory unit 252. Using the parameter 205, the Polyak Average vector μ _Nt ^(j) is calculated.

図１４に、Polyak Average及びフィードバックの概念を模式的に示す。図１４（Ａ）及び（Ｂ）はいずれも、ｊ番目のパーティクルにおけるPolyak Averageベクトルμ_Nt ^(j)とそのパーティクルに対応する雑音の特徴量ベクトルＮ_t-4 ^(j)，…，Ｎ_t+1 ^(j)との関係を表している。なお、図１４（Ａ）は、雑音の特徴量ベクトルの時間遷移が緩やかである場合を示し、図１４（Ｂ）は、時間遷移が激しい場合を示す。これらの図において、時間は左から右に進行し、雑音の特徴量は上下方向に変化する。図１４（Ａ）及び図（Ｂ）においては、第ｔフレームにおけるPolyak Averageベクトルμ_Nt ^(j)を、白抜きの丸印で示す。なお、この図に示すPolyak Averageベクトルμ_Nt ^(j)においては、Ｔ＝５フレーム分であるものとする。 FIG. 14 schematically shows the concept of Polyak Average and feedback. 14A and 14B both show the Polyak Average vector μ _Nt ^(j) in the j-th particle and the noise feature vector N _t-4 ^(j) ,..., N _{t +} corresponding to the particle. ₁ represents the relationship with ^(j) . FIG. 14A shows a case where the time transition of the noise feature vector is gentle, and FIG. 14B shows a case where the time transition is intense. In these figures, time progresses from left to right, and the feature amount of noise changes in the vertical direction. In FIGS. 14A and 14B, the Polyak Average vector μ _Nt ^(j) in the t-th frame is indicated by a white circle. In the Polyak Average vector μ _Nt ^(j) shown in this figure, it is assumed that T = 5 frames.

図１４（Ａ）を参照して、第ｔ−１フレームにおける雑音の特徴量Ｎ_t-1 ^(j)と、Polyak
Averageベクトルμ_Nt ^(j)との間には、差分μ_Nt ^(j)−Ｎ_t ^(j)が生じる。図１４（Ｂ）に示すような時間遷移の激しい場合においても同様に、雑音の特徴量Ｎ_t ^(j)と、Polyak Averageベクトルμ_Nt ^(j)との間には、差分μ_Nt ^(j)−Ｎ_t ^(j)が生じる。図１４（Ａ）における雑音の特徴量ベクトルＮ_t-4 ^(j)，…，Ｎ_t ^(j)の変動に比べて、図１４（Ｂ）における雑音の特徴量ベクトルＮ_t-4 ^(j)，…，Ｎ_t ^(j)の変動は大きい。すなわち図１４（Ａ）における雑音の特徴量ベクトルＮ_t-4 ^(j)，…，Ｎ_t ^(j)同士の差異は、図１４（Ｂ）における当該それらの差異より小さい。 Referring to FIG. 14A, the noise feature amount N _t-1 ^{(j) in the (} _t−1 ^{) th} frame and the Polyak
A difference μ _Nt ^(j) −N _t ^(j) is generated between the Average vector μ _Nt ^(j) . Similarly, in the case of FIG. 14 (B) intense such time transition as shown, the noise characteristic amount N _t ^(j), between the Polyak Average, which vector mu _Nt ^(j), the difference mu _Nt ^(j) −N _t ^(j) is generated. Figure 14 (A) of noise in the feature vector _{^{N t-4 (j),}} ..., N t as compared with the variation of the ^(j), FIG. 14 (B) the noise of the feature in the vector N _t-4 ^(j) , ..., N _t ^(j) varies greatly. That is, the difference between the noise feature vectors N _t-4 ^(j) ,..., N _t ^(j) in FIG. 14 (A) is smaller than those differences in FIG.

Polyak Averageベクトルμ_Nt ^(j)は、Ｎ_t-4 ^(j)，…，Ｎ_t ^(j)の平均である。そのため、Polyak Averageベクトルμ_Nt ^(j)のとり得る範囲は、Ｎ_t-4 ^(j)，…，Ｎ_t ^(j)の最小から最大までの範囲である。したがって、図１４（Ａ）に示すように、これらの特徴量ベクトル同士の差異が小さければ、その分Polyak Averageベクトルμ_Nt-1 ^(j)のとり得る範囲は狭くなる。差分μ_Nt-1 ^(j)−Ｎ_t-1 ^(j)の変動幅は自ずから小さくなる。これに対して、図１４（Ｂ）に示すように雑音の特徴量ベクトル同士の差異が大きければ、その分Polyak Averageベクトルμ_Nt ^(j)のとり得る範囲は広くなる。差分μ_Nt ^(j)−Ｎ_t ^(j)の変動幅も自ずから大きくなる。すなわち、差分μ_Nt ^(j)−Ｎ_t ^(j)は、過去Ｔフレーム分の雑音の変化を反映する。この差分に基づき、次のフレームにおける雑音の特徴量ベクトルＮ_t+1 ^(j)を予測すると、過去Ｔフレーム分の雑音の変化が反映された特徴量ベクトルが得られる。 The Polyak Average vector μ _Nt ^(j) is the average of N _t−4 ^(j) ,..., N _t ^(j) . Therefore, the possible range of the Polyak Average vector μ _Nt ^(j) is the range from the minimum to the maximum of N _t−4 ^(j) ,..., N _t ^(j) . Therefore, as shown in FIG. 14A, if the difference between these feature quantity vectors is small, the range that the Polyak Average vector μ _Nt−1 ^(j) can take is narrowed accordingly. The fluctuation range of the difference μ _Nt−1 ^(j) −N _t−1 ^(j) is naturally reduced. On the other hand, as shown in FIG. 14B, if the difference between the noise feature vectors is large, the range that the Polyak Average vector μ _Nt ^(j) can take is widened accordingly. The fluctuation range of the difference μ _Nt ^(j) −N _t ^(j) naturally increases. That is, the difference μ _Nt ^(j) −N _t ^(j) reflects the noise change for the past T frames. Based on this difference, when the feature vector N _{t + 1} ^(j) of noise in the next frame is predicted, a feature vector reflecting the noise change for the past T frames is obtained.

拡張カルマンフィルタ部２５８（図６参照）は、加重平均ベクトル＾Ｎ_t-1と、差分ベクトルμ_Nt-1 ^(j)−Ｎ_t-1 ^(j)と、拘束条件パラメータ１３８により定められる忘却係数α及びスケーリング係数βと、観測信号の特徴量Ｘ_t１２４と、出力パラメータ２４０とを基に、式（２１）〜式（２６）により示す拡張カルマンフィルタによって各パーティクルの更新を行なう。 The extended Kalman filter unit 258 (see FIG. 6) includes a weighted average vector ^ N _t−1 , a difference vector μ _Nt−1 ^(j) −N _t−1 ^(j), and a forgetting factor α determined by a constraint parameter 138. Each particle is updated by the extended Kalman filter expressed by the equations (21) to (26) based on the scaling coefficient β, the observed signal feature amount X _t 124, and the output parameter 240.

この更新において、式（２１）に示す、雑音の一期先予測パラメータＮ_t|t-1 ^(j)においては、＾Ｎ_t-1 ^(j)の散らばりが抑制される。また、過去Ｔフレーム分のパラメータの変動がフィードバックされる。すなわち、過去の変動が大きかった場合には、一期先予測パラメータＮ_t|t-1 ^(j)の変動も大きくなる。反対に過去の変動が小さかった場合には、一期先予測パラメータＮ_t|t-1 ^(j)の変動も小さくなる。したがって、パラメータの時間推移に対する拘束条件が、過去のパラメータの変動によって強化される。 In this update, the dispersion of ^ N _t-1 ^(j) is suppressed in the one-period ahead prediction parameter N _{t | t-1} ^(j) of noise shown in Expression (21). Also, parameter variations for the past T frames are fed back. That is, when the past fluctuation is large, the fluctuation of the one-year ahead prediction parameter N _{t | t−1} ^(j) also becomes large. On the other hand, when the past fluctuation is small, the fluctuation of the one-year prediction parameter N _{t | t−1} ^(j) is also small. Therefore, the constraint condition for the time transition of the parameter is strengthened by the past parameter variation.

以上のようにして、各パーティクルの更新を行なわれることにより、図１３に示す状態空間４２０内の各パーティクルは更新され、パラメータが更新されたパーティクルにより第ｔフレームに対応する状態空間４３０が表現される。 By updating each particle as described above, each particle in the state space 420 shown in FIG. 13 is updated, and the state space 430 corresponding to the t-th frame is expressed by the particle whose parameter is updated. The

これに応答して、重み算出部２３２が、状態空間４３０内の各パーティクルに対する重みｗ_t ^(j)を、式（２２）及び式（２３）によって算出する。再サンプリング部２３４は、重みｗ_t ^(j)に基づき、パーティクルにおける雑音のパラメータを再サンプリングする。この際、再サンプリング部２３４はまず、状態空間４３０内の各パーティクルからの再サンプリングの回数を、パーティクルに対する重みｗ_t ^(j)に応じてパーティクルごとに設定する。白抜きの丸印で表される重みの微小なパーティクルからのサンプリングの回数を０に設定する。また、黒塗りの丸印で表される重みの大きなパーティクルからのサンプリングの回数を、重みの大きさに応じて１〜３に設定する。続いて、状態空間４３０内のパーティクルにおける雑音確率分布に基づき、設定された回数ずつ、雑音のパラメータの再サンプリングを行なう。このようにして、第ｔフレームに対応する新たな状態空間４４０を表現するパーティクルがそれぞれ形成される。 In response to this, the weight calculation unit 232 calculates the weight w _t ^(j) for each particle in the state space 430 by the equations (22) and (23). The re-sampling unit 234 re-samples the noise parameter in the particle based on the weight w _t ^(j) . At this time, the resampling unit 234 first sets the number of resamplings from each particle in the state space 430 for each particle according to the weight w _t ^(j) for the particle. The number of samplings from a minute particle with a weight represented by a white circle is set to zero. In addition, the number of times of sampling from particles with a large weight represented by black circles is set to 1 to 3 according to the magnitude of the weight. Subsequently, based on the noise probability distribution of the particles in the state space 430, the noise parameters are resampled by the set number of times. In this way, particles representing a new state space 440 corresponding to the t-th frame are formed.

再サンプリング部２３４によるこのような再サンプリングが繰返し行なわれると、あるフレームに対応するパーティクルの多くにおける雑音のパラメータが、それ以前の時点のフレームに対応する少数のパーティクルにおける雑音のパラメータの確率分布からサンプリングされたものとなるおそれがある。そこで、推定パラメータ生成部２３６は、Metropolis-Hastingsアルゴリズムを用いて、新たに第ｔフレームに対応するパーティクルにおけるパラメータを生成することにより、このような事態を防止する。図７に示す再更新部２６２は、状態空間４４０における雑音確率分布にしたがい、第ｔ−１フレームに対応する状態空間４２０内のパーティクルにおける雑音のパラメータを再更新する。重み再計算部２６４は、再更新されたパーティクルに対する重みｗ_t ^*(j)を算出する。許容確率算出部２６６は、再更新されたパーティクルに対する重みｗ_t ^*(j)と、再サンプリングされたパーティクルに対する重みｗ_t ^(j)とを基に、許容確率νを算出する。パラメータ選択部２７０は、許容確率νと、乱数発生部２６８が発生した［０，１］の区間の乱数ｕとを比較し、乱数ｕが許容確率ν以下であれば、再サンプリングされたパーティクルにおけるパラメータを、再更新されたパーティクルにおけるパラメータで置換する。さもなければ、再更新されたパーティクルにおけるパラメータを棄却する。 When such re-sampling by the re-sampling unit 234 is repeatedly performed, the noise parameter in many particles corresponding to a certain frame is obtained from the probability distribution of the noise parameter in a small number of particles corresponding to the previous frame. May be sampled. Therefore, the estimated parameter generation unit 236 prevents such a situation by newly generating parameters for the particles corresponding to the t-th frame using the Metropolis-Hastings algorithm. The re-updating unit 262 illustrated in FIG. 7 re-updates the noise parameters of the particles in the state space 420 corresponding to the (t-1) th frame, according to the noise probability distribution in the state space 440. The weight recalculation unit 264 calculates a weight w _t ^{* (j)} for the re-updated particle. Acceptable probability calculation unit 266, the weight w _t ^* for particles that are re-updated ^(j), based on the weight w _t ^(j) with respect to the resampled particles, calculates the permission probability [nu. The parameter selection unit 270 compares the allowable probability ν with the random number u in the interval [0, 1] generated by the random number generation unit 268. If the random number u is equal to or less than the allowable probability ν, the parameter selection unit 270 Replace the parameter with the parameter in the re-updated particle. Otherwise, reject the parameter in the re-updated particle.

以上のような動作をフレームごとに繰返すことにより、各フレームに対応して、各パーティクルにおける雑音のパラメータである、平均ベクトル＾Ｎ_t ^(j)及び共分散行列Σ_Nt ^(j)が推定される。各パーティクルにおける雑音のパラメータである平均ベクトル＾Ｎ_t ^(j)及び共分散行列Σ_Nt ^(j)と、各パーティクルに対する重みｗ_t ^(j)とが、雑音確率分布の推定パラメータ２０５となる。雑音確率分布推定部２００は、雑音確率分布の推定パラメータ２０５と観測信号の特徴量ベクトルＸ_t１２４とを、フレームごとに、図４に示す計算制御部２１２及び観測信号分布推定部２０２に与える。 By repeating the above operation for each frame, the mean vector ^ N _t ^(j) and the covariance matrix Σ _Nt ^(j) , which are noise parameters for each particle, are estimated corresponding to each frame. . The average vector ^ N _t ^(j) and the covariance matrix Σ _Nt ^(j) , which are noise parameters for each particle, and the weight w _t ^(j) for each particle are the estimation parameters 205 of the noise probability distribution. The noise probability distribution estimation unit 200 supplies the noise probability distribution estimation parameter 205 and the observation signal feature vector X _t 124 to the calculation control unit 212 and the observation signal distribution estimation unit 202 shown in FIG. 4 for each frame.

図４を参照して、計算制御部２１２は、回数記憶部２１８に記憶された値Ｉ_Ｌに１を加算し、値Ｉ_Ｌが間隔記憶部２０６に記憶されたフレーム間隔の値Ｌと等しいか否かを判定する。 Referring to FIG. 4, the calculation control unit 212 adds 1 to the value stored I _L in the frequency storage unit 218, if the value I _L is equal to the value L of the stored frame interval to the interval storage unit 206 Determine whether or not.

値Ｉ_Ｌがフレーム間隔の値Ｌと等しいときには、計算制御部２１２は、観測信号分布推定部２０２に対して推定指示信号２０９を与え、分布記憶部２１４に対して観測信号分布記憶制御信号２０７を与える。１回目の処理では必ず値Ｉ_Ｌが値Ｌと等しくなる。したがって、最初のフレームに対しては、観測信号分布推定部２０２による観測信号分布のパラメータ２０８が推定され、分布記憶部２１４に記憶される。 When the value IL is equal to the value _L of the frame interval, the calculation control unit 212 gives an estimation instruction signal 209 to the observation signal distribution estimation unit 202, and sends an observation signal distribution storage control signal 207 to the distribution storage unit 214. give. Always value I _L is equal to the value L in the first process. Therefore, for the first frame, the observation signal distribution parameter 208 by the observation signal distribution estimation unit 202 is estimated and stored in the distribution storage unit 214.

図４に示す観測信号分布推定部２０２は、計算制御部２１２から推定指示信号２０９が与えられると、雑音確率分布の推定パラメータ２０５とＧＭＭ１３０とに基づき、観測信号分布のパラメータ２０８として、ＶＴＳ法によって、第ｔフレームに対応する各パーティクルにおける観測信号分布の平均ベクトル及び共分散行列を生成する。これにより、各パーティクルにおいて雑音の確率分布と観測信号の確率分布とが推定されたことになる。これらの値は選択部２１６と分布記憶部２１４とに与えられる。 When the estimation instruction signal 209 is given from the calculation control unit 212, the observation signal distribution estimation unit 202 shown in FIG. 4 is based on the noise probability distribution estimation parameter 205 and the GMM 130 as the observation signal distribution parameter 208 by the VTS method. Then, an average vector and a covariance matrix of the observation signal distribution in each particle corresponding to the t-th frame are generated. As a result, the probability distribution of noise and the probability distribution of the observation signal are estimated for each particle. These values are given to the selection unit 216 and the distribution storage unit 214.

分布記憶部２１４は、観測信号分布記憶制御信号２０７に応答して、観測信号分布のパラメータ２０８を記憶する。 The distribution storage unit 214 stores an observation signal distribution parameter 208 in response to the observation signal distribution storage control signal 207.

計算制御部２１２は、選択指示信号２１１の値を第１の値とする。選択部２１６は、選択指示信号２１１が第１の値であるため、観測信号分布推定部２０２の出力である観測信号分布のパラメータ２０８を選択して観測信号分布のパラメータ２１７としてクリーン音声推定部２０４に与える。 The calculation control unit 212 sets the value of the selection instruction signal 211 as the first value. Since the selection instruction signal 211 is the first value, the selection unit 216 selects the observation signal distribution parameter 208 that is the output of the observation signal distribution estimation unit 202 and uses the observed signal distribution parameter 217 as the clean speech estimation unit 204. To give.

クリーン音声推定部２０４は、ＭＭＳＥ推定法により、第ｔフレームに対応する各パーティクルにおいて、クリーン音声１２０のＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)を算出する。さらに、ＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)と重みｗ_t ^(j)とを用いて、第ｔフレームにおける推定クリーン音声の特徴量ベクトル＾Ｓ_t１２６を算出し、図１に示す探索部１１０に出力する。 The clean speech estimation unit 204 calculates the MMSE estimated value vector ^{ circumflex over ⁽ S ⁾ _} ⁽ _t ⁾ of the clean speech 120 for each particle corresponding to the t-th frame by the MMSE estimation method. Further, using the MMSE estimated value vector ^{ circumflex over ⁽ S ⁾ } _t ^(j) and the weight w _t ^(j) , the estimated clean speech feature vector ^{ circumflex over ⁽ _t ⁾ _} 126 in the t-th frame is calculated, and the search unit 110 shown in FIG. Output to.

値Ｉ_Ｌがフレーム間隔の値Ｌと等しくないときには、計算制御部２１２は、観測信号分布推定部２０２に対して推定指示信号２０９を与えない。したがって観測信号分布推定部２０２は観測信号分布の推定を行なわない。またこのときには計算制御部２１２は観測信号分布記憶制御信号２０７を分布記憶部２１４に与えない。その結果、分布記憶部２１４は最後に観測信号分布推定部２０２によって推定された観測信号分布を引続き記憶する。計算制御部２１２はさらに、選択指示信号２１１の値を第２の値とする。選択部２１６は、選択指示信号２１１が第２の値であるため、分布記憶部２１４の出力である最後に推定された観測信号分布のパラメータ２１３を選択し、観測信号分布のパラメータ２１７としてクリーン音声推定部２０４に与える。 When the value I _L is not equal to the value L of the frame interval, calculation control unit 212 does not give an estimate indication signal 209 with respect to the observed signal distribution estimation unit 202. Therefore, the observation signal distribution estimation unit 202 does not estimate the observation signal distribution. At this time, the calculation control unit 212 does not give the observation signal distribution storage control signal 207 to the distribution storage unit 214. As a result, the distribution storage unit 214 continues to store the observation signal distribution finally estimated by the observation signal distribution estimation unit 202. The calculation control unit 212 further sets the value of the selection instruction signal 211 as the second value. Since the selection instruction signal 211 has the second value, the selection unit 216 selects the parameter 213 of the observation signal distribution estimated last, which is the output of the distribution storage unit 214, and clean speech as the observation signal distribution parameter 217. This is given to the estimation unit 204.

したがって、クリーン音声推定部２０４は、ＭＭＳＥ推定法により、第ｔフレームに対応する各パーティクルに対し、第ｔ_Ｌフレーム（ｔ_Ｌ＜ｔ）について算出された観測信号分布のパラメータ２０８を用いてクリーン音声１２０のＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)を算出する。さらに、ＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)と重みｗ_t ^(j)とを用いて、第ｔフレームにおける推定クリーン音声の特徴量ベクトル＾Ｓ_t１２６を算出し、図１に示す探索部１１０に出力する。 Therefore, the clean speech estimation unit 204 uses the observed signal distribution parameter 208 calculated for the t _L frame (t _L <t) for each particle corresponding to the t frame by the MMSE estimation method. 120 MMSE estimated value vectors ^{ circumflex over ⁽ S ⁾ } _t ^(j) are calculated. Further, using the MMSE estimated value vector ^{ circumflex over ⁽ S ⁾ } _t ^(j) and the weight w _t ^(j) , the estimated clean speech feature vector ^{ circumflex over ⁽ _t ⁾ _} 126 in the t-th frame is calculated, and the search unit 110 shown in FIG. Output to.

図１に示す探索部１１０は、推定クリーン音声の特徴量ベクトル＾Ｓ_t１２６を用いて、認識用音響モデル部１０９に保持された音響モデルと、言語モデル部１０８に保持された言語モデルとを基に、適合する目的言語の単語等を探索し、その結果を認識出力１２８として出力する。 The search unit 110 shown in FIG. 1 uses the estimated clean speech feature vector ^ _St 126 to obtain the acoustic model held in the recognition acoustic model unit 109 and the language model held in the language model unit 108. Based on this, a word or the like of a target language that matches is searched, and the result is output as a recognition output 128.

このように本実施の形態に係る音声認識システム１００によれば、Ｌの値を変更することにより、観測信号分布の推定における計算量を１／Ｌに削減することができる。パーティクルフィルタによる雑音除去全体の計算量において、この観測信号分布の推定における計算量は全体の約半分を示す。そのため、上記実施の形態におけるように観測信号分布の推定に係る計算量を削減することができれば、雑音除去全体の計算量を大幅に削減でき、その結果、雑音除去を高速化できる。又は、より少ない計算資源を用いて同等の性能の雑音除去を実現できる。 As described above, according to the speech recognition system 100 according to the present embodiment, the amount of calculation in the estimation of the observation signal distribution can be reduced to 1 / L by changing the value of L. In the calculation amount of the entire noise removal by the particle filter, the calculation amount in the estimation of the observed signal distribution is about half of the whole. Therefore, if the calculation amount related to the estimation of the observation signal distribution can be reduced as in the above embodiment, the calculation amount of the entire noise removal can be greatly reduced, and as a result, the noise removal can be speeded up. Alternatively, it is possible to achieve noise removal with equivalent performance using less computational resources.

〔音声認識実験〕
上記実施の形態で示した雑音除去の有効性を評価するために、連続音声認識実験を行なった。評価音声としては、出願人において準備したＢＴＥＣと呼ばれる発話コーパスのうち、ｔｅｓｔｓｅｔ−０１として抽出された５１０発話音声を用いた。この発話音声はクリーンな音声であるが、この音声に対し、駅コンコースで収録した雑音と、駅改札付近で収録した雑音とをそれぞれ２５，１５，１０ｄＢの条件で重畳した。評価実験に用いた音響モデルは、出願人において作成したＴＲＡ，ＴＲＡ−ＢＬＡ，ＡＰＰ−ＢＬＡ（約１７万発話）を用いて学習した。 [Voice recognition experiment]
In order to evaluate the effectiveness of noise removal shown in the above embodiment, a continuous speech recognition experiment was conducted. As the evaluation voice, the 510 utterance voice extracted as testset-01 from the utterance corpus called BTEC prepared by the applicant was used. This uttered voice is a clean voice, but the noise recorded at the station concourse and the noise recorded near the station ticket gate were superimposed on this voice under the conditions of 25, 15, and 10 dB, respectively. The acoustic model used in the evaluation experiment was learned using TRA, TRA-BLA, APP-BLA (about 170,000 utterances) created by the applicant.

実験結果を以下のテーブル１に示す。テーブル１内に示す計算量は、１．２ＧＨｚのクロック信号によって動作する、商業的に利用可能な、通常程度の性能のＣＰＵ（中央演算処理装置）を有するコンピュータで計算した値である。テーブル１中、例えば「０．６４×ＲＴ」とあるのは、１０秒の雑音重畳音声を処理するのに６．４秒かかることを意味する。 The experimental results are shown in Table 1 below. The amount of calculation shown in Table 1 is a value calculated by a computer having a CPU (Central Processing Unit) having a normal performance and operating commercially with a clock signal of 1.2 GHz. In Table 1, for example, “0.64 × RT” means that it takes 6.4 seconds to process 10-second noise-superimposed speech.

テーブル１の結果から明らかなように、観測信号分布の推定を間引くことにより、推定間隔Ｌ＝１６のときには、従来の推定間隔Ｌ＝１の場合と比較して、約２８％の計算時間が削減されている。しかも、計算時間が大幅に削減されているにもかかわらず、単語正解精度により表される音声認識性能はほとんど同等である。

As is clear from the results of Table 1, the calculation time is reduced by about 28% when the estimation interval L = 16 by thinning out the estimation of the observed signal distribution compared to the conventional estimation interval L = 1. Has been. In addition, the speech recognition performance expressed by the word correct accuracy is almost the same, although the calculation time is greatly reduced.

この実験結果から、本実施の形態によれば、非定常的な雑音の存在する環境下で、計算量を削減しながら音声認識性能を良好に保つことが可能であることが分かる。 From this experimental result, it can be seen that according to the present embodiment, it is possible to maintain good speech recognition performance while reducing the amount of calculation in an environment where non-stationary noise exists.

［変形例］
上記実施の形態では、クリーン音響モデルとして、予めサンプル音声に対する統計処理（学習）により準備したガウス混合分布を用いている。ガウス混合分布は、各次元ごとに複数の要素分布を含む多次元の分布である。事前の学習により、要素分布ごとにその平均と分散とが算出される。したがって多くの場合、要素分布ごとにその平均と分散とは異なっている。そのため、複雑な分布でも統計的にモデル化することができる。この場合、二つの要素分布の平均が一致していたり、二つの要素分布の分散が一致したりしていることはあり得るが、両者が一致することは通常はないと考えられる。 [Modification]
In the above-described embodiment, a Gaussian mixture distribution prepared in advance by statistical processing (learning) for sample speech is used as the clean acoustic model. The Gaussian mixture distribution is a multi-dimensional distribution including a plurality of element distributions for each dimension. The average and variance are calculated for each element distribution by prior learning. Therefore, in many cases, the mean and variance are different for each element distribution. Therefore, even a complex distribution can be statistically modeled. In this case, it is possible that the averages of the two element distributions match or the variances of the two element distributions match, but they do not usually match.

しかし、本発明による雑音抑圧を実現するためには、学習によって得られた要素分布をそのまま使用しなくてもよい。例えば、上記した実施の形態において、要素分布の平均のみを用い、分散は全ての要素分布において等しいものと仮定しても、上記実施の形態と全く同様の仕組みを用いて雑音抑圧を行なうことができる。この場合には、音響モデルとしては各要素分布の平均のみを記憶しておけばよい。 However, in order to realize noise suppression according to the present invention, the element distribution obtained by learning need not be used as it is. For example, in the above-described embodiment, noise suppression can be performed using the same mechanism as in the above-described embodiment even if only the average of the element distribution is used and the variance is assumed to be the same in all element distributions. it can. In this case, only the average of each element distribution may be stored as the acoustic model.

さらに、要素分布の平均の算出にあたって、特徴量を連続的な値として算出してもよいし、特徴量を予め離散的なものに定めておき、計算により得られた特徴量を、最も近い離散的特徴量により置換することで量子化してもよい。 Furthermore, when calculating the average of the element distribution, the feature amount may be calculated as a continuous value, or the feature amount is set to a discrete value in advance, and the feature amount obtained by the calculation is the closest discrete Quantization may be performed by replacing the target feature amount.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態に係る音声認識システム１００の構成を示す概略図である。It is the schematic which shows the structure of the speech recognition system 100 which concerns on one embodiment of this invention. ＧＭＭ１３０の概念を示す概略図である。It is the schematic which shows the concept of GMM130. 観測信号の状態空間モデル１６０の概念を示す概略図である。It is the schematic which shows the concept of the state space model 160 of an observation signal. 雑音抑圧部１１４の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a noise suppression unit 114. FIG. 雑音確率分布推定部２００の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a noise probability distribution estimation unit 200. FIG. 更新部２３０の構成を示すブロック図である。3 is a block diagram illustrating a configuration of an update unit 230. FIG. 推定パラメータ生成部２３６の構成を示すブロック図である。4 is a block diagram illustrating a configuration of an estimation parameter generation unit 236. FIG. 雑音抑圧処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of a noise suppression process. Ｌ＝４のときにＭＭＳＥ推定において用いられる観測信号分布のパラメータの間の関係を示す模式図である。It is a schematic diagram which shows the relationship between the parameters of the observed signal distribution used in MMSE estimation when L = 4. Ｌ＝１，２，４，８のときの観測信号分布のパラメータ推定のタイミングを示す模式図である。It is a schematic diagram which shows the timing of parameter estimation of the observation signal distribution when L = 1, 2, 4, 8. 雑音確率分布の推定パラメータ２０５の生成処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the production | generation process of the estimation parameter 205 of noise probability distribution. Metropolis-Hastingsアルゴリズムによるサンプリング処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the sampling process by a Metropolis-Hastings algorithm. パーティクルフィルタによる処理の概要を示す図である。It is a figure which shows the outline | summary of the process by a particle filter. Polyak Average及びフィードバックの概念を示す模式図である。示す図である。It is a schematic diagram which shows the concept of Polyak Average and feedback. FIG.

Explanation of symbols

１００音声認識システム
１０２音源
１０４前処理部
１０６前処理用音響モデル部
１０８言語モデル部
１０９認識用音響モデル部
１１０探索部
１１２計測部
１１４雑音抑圧部
１１６話者
１１８雑音源
１２０クリーン音声
１２１雑音
１２２雑音重畳音声
１２４観測信号の特徴量
１２６推定クリーン音声の特徴量
１３０ＧＭＭ
１３２学習データ記憶部
１３４モデル学習部
１３６ＧＭＭ記憶部
１３８拘束条件パラメータ
１６０状態空間モデル
２００雑音確率分布推定部
２０２観測信号分布推定部
２０４クリーン音声推定部
２０６間隔記憶部
２０８観測信号分布のパラメータ
２１４分布記憶部
２１６選択部
２１８回数記憶部
２２０フレーム選択部
２２２雑音初期分布推定部
２２４逐次計算部
２２６ＧＭＭサンプリング部
２３０更新部
２３２重み算出部
２３４再サンプリング部
２３６推定パラメータ生成部
２４０出力パラメータ
２５０加重平均算出部
２５２バッファメモリ部
２５４ Polyak Average算出部
２５６フィードバック部
２５８拡張カルマンフィルタ部
２６２再更新部
２６４重み再計算部
２６６許容確率算出部
２６８乱数発生部
２７０パラメータ選択部 DESCRIPTION OF SYMBOLS 100 Speech recognition system 102 Sound source 104 Preprocessing part 106 Preprocessing acoustic model part 108 Language model part 109 Recognition acoustic model part 110 Search part 112 Measurement part 114 Noise suppression part 116 Speaker 118 Noise source 120 Clean voice 121 Noise 122 Noise Superimposed speech 124 Observed signal feature 126 Estimated clean speech feature 130 GMM
132 learning data storage unit 134 model learning unit 136 GMM storage unit 138 constraint parameter 160 state space model 200 noise probability distribution estimation unit 202 observation signal distribution estimation unit 204 clean speech estimation unit 206 interval storage unit 208 observation signal distribution parameter 214 distribution Storage unit 216 Selection unit 218 Number storage unit 220 Frame selection unit 222 Noise initial distribution estimation unit 224 Sequential calculation unit 226 GMM sampling unit 230 Update unit 232 Weight calculation unit 234 Re-sampling unit 236 Estimation parameter generation unit 240 Output parameter 250 Weighted average calculation Unit 252 buffer memory unit 254 Polyak Average calculation unit 256 feedback unit 258 extended Kalman filter unit 262 re-update unit 264 weight recalculation unit 266 allowable probability calculation unit 268 random number generation 270 parameter selection unit

Claims

A noise suppression device for suppressing a noise component in an observation signal obtained by observation of a target voice in an environment where noise is generated,
The observation signal is made up of a plurality of element distributions prepared in advance using a particle filter having a plurality of particles, each receiving a feature amount extracted from a frame having a predetermined time length framed at predetermined intervals. Noise estimation means for sequentially generating, for each frame, an estimation parameter of a probability distribution representing the noise based on an acoustic model for clean speech estimation;
Adapting means for adapting the acoustic model to noise according to a noise probability distribution whose parameters are estimated by the noise estimating means;
Target speech estimation means for calculating an estimated feature amount of the target speech for each frame by the MMSE estimation method using the acoustic model and the feature amount of the observation signal;
A noise suppression device including control means for controlling an interval of adaptation by the adaptation means so that the adaptation of the acoustic model by the adaptation means is performed for each of a plurality of frames,
The target speech estimation means calculates an estimated feature amount of the target speech for the frame using the adapted acoustic model when the adaptation means performs adaptation on the frame. When adaptation by the adapting means is not performed for a certain frame, the acoustic model adapted for the previous frame by the adapting means is used for the previous frame. A noise suppression device characterized by calculating an estimated feature amount of a target speech.

The noise suppression device according to claim 1, wherein the acoustic model is a Gaussian mixture distribution including the plurality of element distributions.

The noise suppression device according to claim 2, wherein variances of any two of the plurality of element distributions are different from each other.

The noise suppression device according to claim 2, wherein the plurality of element distributions have equal variances.

And further comprising storage means for storing the adapted acoustic model in response to the adaptation of the acoustic model for the clean speech estimation for a frame by the adaptation means,
When the acoustic model for the clean speech estimation is adapted for a certain frame, the target speech estimation means uses the adapted acoustic model and estimates the target speech by the MMSE estimation method. When the acoustic model is not adapted for the clean speech estimation for a certain frame, the adaptive acoustic model stored in the storage means is used to perform the MMSE method. The noise suppression apparatus according to claim 1, wherein an estimated feature amount of the target speech is calculated.

The control means includes
Interval storage means for storing in advance information for determining an interval between frames for performing processing for adapting the acoustic model for the clean speech estimation to noise,
Frame number storage means for storing the number of frames processed immediately after the adaptation by the adaptation means is performed immediately before;
Adding means for adding 1 to the stored content of the frame number storage means each time a frame to be processed is given to the noise suppression device;
A determination means for determining whether or not the storage content of the frame number storage means and the storage content of the interval storage means are equal;
Means for performing processing for enabling adaptation to the frame by the adaptation means, and processing for disabling adaptation to the frame by the adaptation means, according to a determination result by the determination means;
Means for clearing the frame number storage means to zero in response to the determination means determining that the storage content of the frame number storage means and the storage content of the interval storage means are equal. The noise suppression device according to any one of claims 1 to 5.

A computer program that, when executed by a computer, causes the computer to operate as the noise suppression device according to any one of claims 1 to 6.

The noise suppression device according to any one of claims 1 to 6,
Voice recognition means for receiving the estimated feature quantity of the target voice calculated by the noise suppression device and performing voice recognition on the target voice using the acoustic model and a predetermined language model on the recognition target language And a voice recognition system.