JP4586577B2

JP4586577B2 - Disturbance component suppression device, computer program, and speech recognition system

Info

Publication number: JP4586577B2
Application number: JP2005057993A
Authority: JP
Inventors: 雅清藤本; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-03-02
Filing date: 2005-03-02
Publication date: 2010-11-24
Anticipated expiration: 2025-03-02
Also published as: JP2006243290A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a disturbance component suppressing device which improves speech recognizing performance in actual environment wherein there is an influence of disturbance such as additive noise and reverberation is present and can suppress components of disturbance in a short time. <P>SOLUTION: A disturbance component suppression section 114 includes a disturbance probability distribution estimation portion 200 which receives feature quantities 124 extracted frames of a designated time length divided by designated periods as to an observation signal obtained by observing a target speech in environment wherein additive noise and multiplicative distortion are generated and generates parameters 206 representing disturbance by frames one after another by using a particle filter having a plurality of particles, and a parameter generation portion 202 and a screen speech estimation portion 204 which calculate estimated feature quantities 126 of a target speech by the frames by using the feature quantities 124 of the observation signal, estimated parameters 206 of the disturbance, and a GMM 130. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声に影響を与えるような外乱が発生する実環境下での音声認識技術に関し、特に、非定常的な加法性雑音と残響等の乗法性歪みとが発生する環境下での音声認識精度を改善するための、外乱成分抑圧装置及びそれを使用した音声認識システムに関する。 The present invention relates to speech recognition technology in a real environment where disturbances that affect speech occur, and in particular, speech in an environment where non-stationary additive noise and multiplicative distortion such as reverberation occur. The present invention relates to a disturbance component suppressing device and a speech recognition system using the same to improve recognition accuracy.

人間にとって容易かつ自然なヒューマンマシンインタフェースを実現するための技術として、音声認識技術が研究されている。近年では、大規模な音声・テキストデータベースと統計確率的な音声認識手法とにより、高い認識率での音声認識が実現されるようになった。今日ではさらに、人間が機械と接する実環境下において、高速にかつ高い認識率で音声認識を実現するための応用技術開発が進められている。 Speech recognition technology has been studied as a technology for realizing a human machine interface that is easy and natural for humans. In recent years, speech recognition at a high recognition rate has been realized by a large-scale speech / text database and statistical stochastic speech recognition techniques. Nowadays, applied technology is being developed to realize speech recognition at high speed and with a high recognition rate in a real environment where a human is in contact with a machine.

実環境が実験室等の環境と大きく異なる点の一つに、雑音の存在がある。雑音は、無視できない音量で絶間なく不規則に発生する。加えて実環境ではさらに、その環境下での音声の空間伝達特性に依存して、又は残響等によって、音声に乗法性の歪みが生じる。このような外乱は、音声認識を行なう際の妨げとなる。これらの外乱が生じる環境下での音声認識性能の改善は、音声認識の応用技術開発を行なう上で、早急に解決されるべき問題である。 One of the major differences between the actual environment and the laboratory environment is the presence of noise. Noise occurs constantly and irregularly at a volume that cannot be ignored. In addition, in the real environment, multiplicative distortion occurs in the voice depending on the spatial transfer characteristics of the voice in the environment or due to reverberation or the like. Such disturbance disturbs voice recognition. Improvement of speech recognition performance in an environment where these disturbances occur is a problem that should be solved as soon as possible in developing an application technology for speech recognition.

雑音環境下での音声認識性能を改善するための技術のひとつに、音声認識の前処理の段階で雑音を推定し抑圧する技術がある。後掲の非特許文献１には、雑音抑圧の一般的な方法であるスペクトルサブトラクション法が開示されている。この方法では、発話の前の区間において観測された雑音の振幅スペクトルが発話中の区間における雑音の振幅スペクトルと同じであると仮定し、発話から得られた音声信号の振幅スペクトルから、発話直前に観測された雑音の振幅スペクトルを減算することで雑音を抑圧する。 One technique for improving speech recognition performance in a noisy environment is a technique for estimating and suppressing noise at the pre-processing stage of speech recognition. Non-Patent Document 1 described later discloses a spectral subtraction method that is a general method of noise suppression. In this method, it is assumed that the noise amplitude spectrum observed in the interval before the utterance is the same as the noise amplitude spectrum in the utterance interval, and from the amplitude spectrum of the speech signal obtained from the utterance, immediately before the utterance. Noise is suppressed by subtracting the amplitude spectrum of the observed noise.

音声認識の前処理の段階において雑音を逐次的に推定し抑圧する技術もある。非特許文献２には、逐次ＥＭ（Expectation Maximization）アルゴリズムを適用して雑音の最尤推定値を逐次的に求める手法が開示されている。逐次ＥＭアルゴリズムを用いて逐次的に雑音を推定する手法では、雑音の時間変動に対処しつつ高精度に雑音の推定及び抑圧を行なうことができる。 There is also a technique for sequentially estimating and suppressing noise in the preprocessing stage of speech recognition. Non-Patent Document 2 discloses a technique of sequentially obtaining a maximum likelihood estimation value of noise by applying a sequential EM (Expectation Maximization) algorithm. In the technique of sequentially estimating noise using the sequential EM algorithm, it is possible to estimate and suppress noise with high accuracy while coping with temporal fluctuation of noise.

非特許文献３及び非特許文献４に開示された、カルマンフィルタを用いて雑音の推定値を逐次的に求める手法も一般的に用いられている。この手法では、一期先予測とフィルタリングとを交互に行なうことによって、雑音を逐次的に推定し抑圧する。 A method of sequentially obtaining an estimated value of noise using a Kalman filter disclosed in Non-Patent Document 3 and Non-Patent Document 4 is also generally used. In this method, noise is sequentially estimated and suppressed by alternately performing first-term prediction and filtering.

また、雑音環境下での音声認識性能を改善するための技術として、雑音を考慮した確率モデルを用いて適応的に音声認識を行なう技術がある。例えば後掲の特許文献１には、パーティクルフィルタと呼ばれる逐次推定法を用いて、雑音パラメータの推定と、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）を構成する隠れ状態の時間的成長とを行ない、当該隠れマルコフモデルに基づく音声認識を行なう音声認識システムが開示されている。 As a technique for improving speech recognition performance in a noisy environment, there is a technique for performing adaptive speech recognition using a probability model that takes noise into account. For example, in Patent Document 1 described later, noise parameters are estimated using a sequential estimation method called a particle filter, and temporal growth of a hidden state constituting a Hidden Markov Model (HMM) is performed. A speech recognition system that performs speech recognition based on the hidden Markov model is disclosed.

乗法性歪みが生じる環境下での音声認識性能を改善するための技術として、ケプストラム平均減算法（Cepstrum Mean Subtraction：ＣＭＳ）を用いて乗法性歪みを除去する技術がある。この手法は、収録用のマイクロホンの特性による歪みなど、分析窓長より短いインパルス応答長の伝達特性を持つ乗法性歪みを除去することができる。 As a technique for improving speech recognition performance in an environment where multiplicative distortion occurs, there is a technique for removing multiplicative distortion using a Cepstrum Mean Subtraction (CMS). This technique can remove multiplicative distortions having a transfer characteristic with an impulse response length shorter than the analysis window length, such as distortion due to the characteristics of a recording microphone.

非特許文献５には、反射音を加法性雑音とみなして残響下での音声認識を行なう技術が開示されている。この技術では、残響下で観測される音声（以下、「残響音声」と呼ぶ。）を、１次線形予測により表現している。ここに、時刻ｔにおける目的音声及び残響音声の線形メルスペクトルを要素に持つベクトルをＳ_t ^Lin、及びＸ_S,t ^Linとし、各メル周波数領域での音声の伝達特性、すなわち乗法性歪みの線形メルスペクトルを対角成分に持つ行列をＨ^Linとする。また、各メル周波数領域での残響の線形予測係数を対角成分に持つ行列をＡ^Linとする。この技術では、残響音声のベクトルＸ_S,t ^Linを次の再帰式によって表現する。 Non-Patent Document 5 discloses a technique for performing speech recognition under reverberation by regarding reflected sound as additive noise. In this technique, speech observed under reverberation (hereinafter referred to as “reverberation speech”) is expressed by first-order linear prediction. Here, the vector having a linear Mel spectrum of the voice and reverberant sound component and S _t ^Lin, and X _S, and _t ^Lin at time t, the transfer characteristic of the audio at each Mel frequency domain, i.e. the multiplicative distortion linear Let H ^Lin be a matrix having a mel spectrum as a diagonal component. A matrix having a linear prediction coefficient of reverberation in each mel frequency region as a diagonal component is A ^Lin . In this technique, a reverberant speech vector X _{S, t} ^Lin is expressed by the following recursive formula.

Ｘ_S,t ^Lin＝Ｈ^LinＳ_t ^Lin＋Ａ^LinＸ_S,t-1 ^Lin
また、この技術では、行列Ｈ^Linの要素すなわち乗法性歪みの線形メルスペクトルと、行列Ａ^Linの要素すなわち残響の線形予測係数を、それぞれ時間固定のパラメータとみなし、ＥＭアルゴリズムによりこれらのパラメータを推定する。上記の再帰式により、分析窓長よりも長いインパルス応答長の歪みも表現されるため、反射音の影響等をモデル化することができる。 X _{S, t} ^Lin = H ^Lin S _t ^Lin + A ^Lin X _{S, t-1} ^Lin
In this technique, the elements of the matrix H ^Lin , that is, the linear mel spectrum of multiplicative distortion, and the elements of the matrix A ^Lin , that is, the linear prediction coefficient of reverberation are regarded as time-fixed parameters, and these parameters are estimated by the EM algorithm. To do. Since the impulse response length distortion longer than the analysis window length is expressed by the above recursive formula, the influence of reflected sound and the like can be modeled.

特開２００２−２５１１９８号公報JP 2002-251198 A Ｓ．Ｆ．ボル：「スペクトルサブトラクションを用いた、音声内の音響ノイズの抑圧」、ＩＥＥＥＴｒａｎｓ．ＡＳＳＰ、Ｖｏｌ．２７、Ｎｏ．２、１１３−１２０頁、１９７９年（S.F.Boll: “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. ASSP, Vol. 27, No. 2, pp. 113-120, 1979）S. F. Bol: “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. ASSP, Vol. 27, no. 2, 113-120, 1979 (S.F. Boll: “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. ASSP, Vol. 27, No. 2, pp. 113-120, 1979) Ｍ．アフィフィ、Ｏ．シオアン：「ロバスト音声認識のための最適な忘却による逐次推定」、ＩＥＥＥＴｒａｎｓ．ＳＡＰ、Ｖｏｌ．１２、Ｎｏ．１、１９−２６頁、２００４年（M.Afify, O.Siohan: “Sequential Estimation with Optimal Forgetting for Robust Speech Recognition,” IEEE Trans. SAP, Vol. 12, No.1, pp. 19-26, 2004）M.M. Affifi, O. Shioan: “Sequential estimation with optimal forgetting for robust speech recognition”, IEEE Trans. SAP, Vol. 12, no. 1, 19-26, 2004 (M. Afify, O. Siohan: “Sequential Estimation with Optimal Forgetting for Robust Speech Recognition,” IEEE Trans. SAP, Vol. 12, No. 1, pp. 19-26, 2004. ) 有本卓：「カルマンフィルター」、産業図書Takashi Arimoto: “Kalman Filter”, industrial books 中野道雄監修、西山清著：「パソコンで解くカルマンフィルタ」、丸善Supervised by Michio Nakano, Kiyoshi Nishiyama: “Kalman filter solved on a personal computer”, Maruzen タキグチテツヤ他：「１次線形予測を用いた残響音声の認識」、ＩＣＡＳＳＰ’０４、８６９−８７２頁、２００４年５月（T.Takiguchi et al: “Reverberant speech recognition using first-order linear prediction,” ICASSP'04, pp. 896-872, May 2004）Tetsugu Takiguchi et al .: “Recognition of reverberant speech using first-order linear prediction”, ICASSP '04, pages 869-872, May 2004 (T. Takiguchi et al: “Reverberant speech recognition using first-order linear prediction,” ICASSP'04, pp. 896-872, May 2004)

実環境において雑音の多くは非定常な雑音である。すなわち、雑音の音響的特徴は時間の経過に伴い変動する。非特許文献１に記載のスペクトルサブトラクション法のように、雑音が定常的なものであるという前提のもとで雑音の推定及び抑圧を行なう技術では、雑音の時間変動に対応できず、高精度に雑音を抑圧することができない。 In the real environment, most of the noise is non-stationary noise. That is, the acoustic characteristics of noise vary with time. As in the spectral subtraction method described in Non-Patent Document 1, a technique that estimates and suppresses noise under the premise that noise is stationary cannot cope with time fluctuations of noise and is highly accurate. Noise cannot be suppressed.

非特許文献２に記載された逐次ＥＭアルゴリズムを用いた手法は、尤度関数の局所最適値に値が収束するまで反復計算を行なう。そのため、雑音が変動する度に膨大な量の計算が必要となり、計算に時間を要する。よって、この手法により実時間で雑音を推定し抑圧するのは困難である。 The method using the sequential EM algorithm described in Non-Patent Document 2 performs iterative calculation until the value converges to the local optimum value of the likelihood function. Therefore, an enormous amount of calculation is required every time the noise fluctuates, and the calculation takes time. Therefore, it is difficult to estimate and suppress noise in real time by this method.

非特許文献３及び非特許文献４に開示されたカルマンフィルタを用いる推定方法は、一期先予測とフィルタリングとを交互に行ない逐次推定を行なう。そのため、逐次ＥＭアルゴリズムのような反復計算を必要とはしない。しかし、カルマンフィルタを用いた手法は、雑音の事後確率分布が単一正規分布であるものとして確率分布を推定する。真の確率分布が混合分布であった場合には、単一正規分布で近似される。そのため、精度が劣化する。 The estimation method using the Kalman filter disclosed in Non-Patent Document 3 and Non-Patent Document 4 performs successive estimation by alternately performing one-time prediction and filtering. Therefore, it does not require an iterative calculation like the sequential EM algorithm. However, the method using the Kalman filter estimates the probability distribution assuming that the posterior probability distribution of noise is a single normal distribution. When the true probability distribution is a mixed distribution, it is approximated by a single normal distribution. Therefore, the accuracy is deteriorated.

特許文献１に記載の音声認識システムのように、雑音を考慮したモデルを用いて音声認識を行なう技術では、雑音が重畳された音声と確率モデルとのマッチングが行なわれる。そのため、例えば音響モデル適応など、雑音のない音声に基づいて行なうべき前処理を実行できない。 In a technique for performing speech recognition using a model that takes noise into consideration, as in the speech recognition system described in Patent Document 1, matching is performed between a speech on which noise is superimposed and a probability model. For this reason, for example, preprocessing that should be performed based on noise-free speech such as acoustic model adaptation cannot be executed.

非特許文献５に記載の技術では、上記の再帰式により反射音の影響についてモデル化されている。しかし、一般に残響は、音源から離れた地点で音を観測又は収録する際に生じる現象である。音源と観測地点との距離が離れている環境下では、反射音のみならず、音源と観測地点とを取り巻く環境で発生する雑音の存在が無視できなくなる。非特許文献５に記載の技術では、この点について考慮されていない。また、非特許文献５に記載の技術では、行列Ｈ^Linの要素すなわち乗法性歪みの線形メルスペクトルと、行列Ａ^Linの要素すなわち残響の線形予測係数を、それぞれ時間固定のパラメータとみなしている。しかし、実環境においては、例えば音源及びその周囲で音を反射する物体が移動することがある。このような環境下では、乗法性歪みのパラメータも、残響の線形予測係数も時間の経過とともに変動する。そのため、非特許文献５に記載の技術では、残響の時間変動に対応できず、外乱の影響に高精度に対処することができない。 In the technique described in Non-Patent Document 5, the effect of reflected sound is modeled by the above recursive formula. However, reverberation is generally a phenomenon that occurs when sound is observed or recorded at a point away from a sound source. In an environment where the distance between the sound source and the observation point is large, not only the reflected sound but also the noise generated in the environment surrounding the sound source and the observation point cannot be ignored. In the technique described in Non-Patent Document 5, this point is not taken into consideration. In the technique described in Non-Patent Document 5, an element of the matrix H ^Lin , that is, a linear mel spectrum of multiplicative distortion, and an element of the matrix A ^Lin , that is, a linear prediction coefficient of reverberation are regarded as time-fixed parameters. However, in a real environment, for example, a sound source and an object that reflects sound around the sound source may move. Under such circumstances, both the multiplicative distortion parameter and the linear prediction coefficient of reverberation change with time. For this reason, the technique described in Non-Patent Document 5 cannot cope with time fluctuations of reverberation and cannot cope with the influence of disturbance with high accuracy.

それゆえに、本発明の目的は、非定常雑音及び残響等の乗法性歪みが生じる環境下での音声認識性能を改善し、かつ外乱成分の抑圧を短時間で行なうことができる外乱成分抑圧装置を提供することである。 Therefore, an object of the present invention is to provide a disturbance component suppression device that can improve speech recognition performance in an environment in which multiplicative distortion such as non-stationary noise and reverberation occurs, and can suppress disturbance components in a short time. Is to provide.

本発明の第１の局面に係る外乱成分抑圧装置は、加法性雑音及び乗法性歪みが生じる環境下で目的音声を観測することにより得られる観測信号の、外乱の成分を抑圧する装置である。この装置は、観測信号について所定周期ごとにフレーム化された所定時間長のフレームよりそれぞれ抽出される特徴量を受けて、複数のパーティクルを有するパーティクルフィルタを用いて、外乱を表す確率分布の推定パラメータをフレームごとに逐次生成するための外乱パラメータ推定手段と、観測信号の特徴量と、推定パラメータと、目的音声に関する所定の音響モデルとを用いて、フレームごとに目的音声の推定特徴量を算出するための目的音声推定手段とを含む。 A disturbance component suppressing device according to a first aspect of the present invention is a device that suppresses disturbance components of an observation signal obtained by observing a target speech in an environment where additive noise and multiplicative distortion occur. This apparatus receives a feature amount extracted from a frame of a predetermined time length that is framed every predetermined period for an observation signal, and uses a particle filter having a plurality of particles to estimate a probability distribution parameter representing a disturbance. Is calculated for each frame using disturbance parameter estimation means for sequentially generating a frame for each frame, a feature amount of an observation signal, an estimation parameter, and a predetermined acoustic model related to the target speech. And target speech estimation means.

好ましくは、外乱パラメータ推定手段は、外乱の初期分布を設定し、当該初期分布にしたがった確率で、複数のパーティクルの各々における外乱を表す確率分布の初期パラメータをそれぞれ設定するための初期パラメータ設定手段と、音響モデルと観測信号の特徴量とを基に、拡張カルマンフィルタを用いて、各パーティクルにおける先行する第１のフレームの推定パラメータをそれぞれ第１のフレームに後続する第２のフレームに対応するものに更新するための更新手段と、第２のフレームにおける複数のパーティクルの各々の重みを算出するための重み算出手段とを含む。 Preferably, the disturbance parameter estimation means sets an initial distribution of the disturbance, and an initial parameter setting means for setting an initial parameter of the probability distribution representing the disturbance in each of the plurality of particles with a probability according to the initial distribution. And an estimated parameter of the preceding first frame in each particle corresponding to the second frame following the first frame using an extended Kalman filter based on the acoustic model and the feature amount of the observation signal Updating means for updating the weights, and weight calculating means for calculating the weight of each of the plurality of particles in the second frame.

より好ましくは、初期パラメータ設定手段は、観測信号の特徴量を基に加法性雑音の初期分布を推定し、当該初期分布にしたがった確率で、複数のパーティクルの各々における加法性雑音の確率分布の初期パラメータをそれぞれサンプリングするための手段と、複数のパーティクルの各々における乗法性歪みの確率分布の初期パラメータの値を所定値に設定するための手段とを含む。 More preferably, the initial parameter setting means estimates the initial distribution of additive noise based on the feature quantity of the observed signal, and the probability distribution of the additive noise in each of the plurality of particles with a probability according to the initial distribution. Means for sampling each of the initial parameters, and means for setting the value of the initial parameter of the multiplicative distortion probability distribution in each of the plurality of particles to a predetermined value.

より好ましくは、外乱パラメータ推定手段はさらに、再サンプリング手段により再サンプリングされたパラメータを基に、複数のパーティクルの各々において、第１のフレームに対応する推定パラメータをそれぞれ、第２のフレームに対応するものに再更新するための再更新手段と、複数のパーティクルの各々において、再更新手段により再更新された推定パラメータと、再サンプリング手段により再サンプリングされた推定パラメータとの一方を、所定の判定基準にしたがい第２のフレームの推定パラメータとして選択するための選択手段とを含む。 More preferably, the disturbance parameter estimation means further corresponds to the estimation parameter corresponding to the first frame in each of the plurality of particles based on the parameter resampled by the resampling means. A re-updating means for re-upding to a thing, an estimation parameter re-updated by the re-updating means in each of a plurality of particles, and an estimated parameter re-sampled by the re-sampling means And selecting means for selecting as the estimation parameter of the second frame.

好ましくは、目的音声推定手段は、観測信号の特徴量と、推定パラメータと、音響モデルとを基に、フレームに対応する観測信号の確率モデルを合成するための観測信号モデル合成手段と、観測信号の特徴量、推定パラメータ、音響モデル、及び観測信号の確率モデルを基に、フレームごとに、目的音声の推定特徴量を算出するための推定特徴量算出手段とを含む。 Preferably, the target speech estimation unit includes an observation signal model synthesis unit for synthesizing a probability model of the observation signal corresponding to the frame based on the feature amount of the observation signal, the estimation parameter, and the acoustic model, and the observation signal Based on the feature amount, the estimation parameter, the acoustic model, and the observed signal probability model, estimated feature amount calculation means for calculating the estimated feature amount of the target speech for each frame.

より好ましくは、観測信号モデル合成手段は、複数のパーティクルの各々に対して、推定パラメータと、音響モデルとを基に、当該パーティクルにおける観測信号の確率モデルのパラメータを推定するためのパラメータ推定手段を含む。 More preferably, the observation signal model combining means includes parameter estimation means for estimating the parameters of the probability model of the observation signal for the particle based on the estimation parameter and the acoustic model for each of the plurality of particles. Including.

推定特徴量算出手段は、フレームごとに、複数のパーティクルの各々の目的音声の推定パラメータを、観測信号の特徴量、音響モデル、推定パラメータ、及び観測信号の確率モデルを基に算出するための手段と、複数のパーティクルの各々における目的音声の推定パラメータを基に、当該フレームにおける目的音声の推定特徴量を算出するための手段とを含んでもよい。 The estimated feature amount calculating means is a means for calculating, for each frame, an estimation parameter of the target speech of each of the plurality of particles based on the feature amount of the observation signal, the acoustic model, the estimation parameter, and the probability model of the observation signal. And means for calculating an estimated feature amount of the target speech in the frame based on an estimation parameter of the target speech in each of the plurality of particles.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、コンピュータを本発明の第１の局面に係るいずれかの外乱成分抑圧装置として動作させる。 When executed by a computer, the computer program according to the second aspect of the present invention causes the computer to operate as any of the disturbance component suppression devices according to the first aspect of the present invention.

本発明の第３の局面に係る音声認識システムは、本発明の第１の局面に係るいずれかの外乱成分抑圧装置と、外乱成分抑圧装置により算出される目的音声の推定特徴量を受けて、目的音声に関する所定の音響モデルと、認識対象言語に関する所定の言語モデルとを用いて、目的音声に関する音声認識を行なうための音声認識手段とを含む。 The speech recognition system according to the third aspect of the present invention receives the estimated feature amount of the target speech calculated by any one of the disturbance component suppressing device and the disturbance component suppressing device according to the first aspect of the present invention, Speech recognition means for performing speech recognition related to the target speech using a predetermined acoustic model related to the target speech and a predetermined language model related to the recognition target language;

以下、図面を参照しつつ、本発明の一実施の形態について説明する。なお、以下の説明に用いる図面では、同一の部品には同一の符号を付してある。それらの名称及び機能も同一である。したがって、それらについての説明は繰返さない。以下の説明のテキスト中で使用する記号「＾」等は、本来はその直後の文字の直上に記載すべきものであるが、テキスト記法の制限により当該文字の直前に記載する。式中では、これらの記号等は本来の位置に記載してある。また以下の説明のテキスト中では、ベクトル又は行列については例えば「ベクトルＸ_ｔ」、「行列Σ_W」等のように直前に「ベクトル」、「行列」等を付した通常のテキストの形で記載するが、式中ではいずれも太字で記載する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, description thereof will not be repeated. The symbol “^” or the like used in the text of the following description should be described immediately above the character immediately after it, but it is described immediately before the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Also, in the text of the following description, vectors or matrices are described in the form of ordinary text with “vector”, “matrix”, etc. immediately preceding them, such as “vector X _t ”, “matrix Σ _W ”, etc. However, it is written in bold in the formula.

［構成］
図１に、本実施の形態に係る音声認識システム１００全体の構成を示す。図１を参照して、この音声認識システム１００は、音源１０２が発生する音１２２を収集し、収集した音から認識に用いる特徴量を抽出するための前処理部１０４と、前処理部１０４に接続され、音声と音素との関係を表す確率モデル（音響モデル）を準備するための前処理用音響モデル部１０６と、認識対象の言語における単語の連接確率等を表す確率モデル（言語モデル）を準備するための言語モデル部１０８と、前処理部１０４から出力された特徴量に対応する単語等を、言語モデル部１０８の言語モデルを用いて探索するための探索部１１０と、探索部１１０に接続され、探索部１１０による探索に用いられる音響モデルを準備するための認識用音響モデル部１０９とを含む。 [Constitution]
FIG. 1 shows the overall configuration of the speech recognition system 100 according to the present embodiment. With reference to FIG. 1, the speech recognition system 100 collects a sound 122 generated by the sound source 102, and extracts a feature amount used for recognition from the collected sound, and a preprocessing unit 104. A pre-processing acoustic model unit 106 for preparing a probability model (acoustic model) connected and representing a relationship between speech and phonemes, and a probability model (language model) representing a word connection probability in a recognition target language A language model unit 108 for preparation, a search unit 110 for searching for a word or the like corresponding to the feature amount output from the preprocessing unit 104 using the language model of the language model unit 108, and a search unit 110 And a recognition acoustic model unit 109 for preparing an acoustic model that is connected and used for the search by the search unit 110.

音源１０２は、認識されるべき音声（目的音声）１２０を発話する話者１１６と、話者１１６の周囲において音の伝達に影響を及ぼす外乱要因１１８とを含む。前処理部１０４に到達する音１２２は、話者１１６の発話により発生した目的音声１２０ではなく、外乱要因１１８の影響を受けて変化した音となる。本明細書では、話者１１６の発話により発生する雑音のない目的音声１２０を、「クリーン音声」と呼ぶ。また、前処理部１０４により収録される音、すなわち外乱要因１１８の影響により変化した状態で前処理部１０４に到達する音１２２を「観測音」と呼ぶ。 The sound source 102 includes a speaker 116 that utters a sound (target sound) 120 to be recognized, and a disturbance factor 118 that affects sound transmission around the speaker 116. The sound 122 that reaches the pre-processing unit 104 is not the target sound 120 generated by the utterance of the speaker 116 but a sound that has changed due to the influence of the disturbance factor 118. In the present specification, the target voice 120 without noise generated by the speech of the speaker 116 is referred to as “clean voice”. The sound recorded by the preprocessing unit 104, that is, the sound 122 that reaches the preprocessing unit 104 in a state changed by the influence of the disturbance factor 118 is referred to as “observation sound”.

前処理用音響モデル部１０６は、クリーン音声１２０に対するガウス混合モデル（ＧＭＭ：Gaussian Mixture Model）からなる音響モデルを準備し保持する。前処理用音響モデル部１０６は、予め用意された大量の学習データを記憶するための学習データ記憶部１３２と、学習データ記憶部１３２に記憶された学習データを用いてＧＭＭに対する学習処理を行なうためのモデル学習部１３４と、モデル学習部１３４による学習で得られるＧＭＭ１３０を記憶するためのＧＭＭ記憶部１３６とを含む。 The preprocessing acoustic model unit 106 prepares and holds an acoustic model composed of a Gaussian Mixture Model (GMM) for the clean speech 120. The preprocessing acoustic model unit 106 performs learning processing on the GMM using the learning data storage unit 132 for storing a large amount of learning data prepared in advance and the learning data stored in the learning data storage unit 132. Model learning unit 134 and a GMM storage unit 136 for storing GMM 130 obtained by learning by model learning unit 134.

図２に、ＧＭＭ１３０の概念を模式的に示す。図２を参照して、ＧＭＭ１３０は、時系列信号の値を、一つの定常信号源（状態）によりモデル化した確率モデルである。このＧＭＭ１３０においては、出力確率が定義される。具体的には、ＧＭＭ１３０では、時刻ｔにおいてクリーン音声１２０として出力される可能性のある値と、その値が出力される確率とが定義される。ＧＭＭ１３０においては、出力確率は混合正規分布によって表現される。例えばＧＭＭ１３０は単一正規分布１４８Ａ，１４８Ｂ，…，１４８Ｋからなる混合正規分布を持つ。 FIG. 2 schematically shows the concept of the GMM 130. Referring to FIG. 2, the GMM 130 is a probability model in which the value of the time series signal is modeled by one stationary signal source (state). In this GMM 130, an output probability is defined. Specifically, in the GMM 130, a value that may be output as the clean sound 120 at time t and a probability that the value is output are defined. In the GMM 130, the output probability is expressed by a mixed normal distribution. For example, the GMM 130 has a mixed normal distribution composed of single normal distributions 148A, 148B,.

再び図１を参照して、前処理部１０４は、観測音１２２を収録し、得られる観測信号に所定の信号処理を施すことにより、当該観測信号に関する所定の特徴量ベクトル（以下単に「特徴量」と呼ぶ。）１２４を抽出するための計測部１１２と、計測部１１２が抽出する特徴量１２４に含まれる外乱の成分を、ＧＭＭ１３０を用いて抑圧するための外乱成分抑圧部１１４とを含む。 Referring again to FIG. 1, the preprocessing unit 104 records the observation sound 122 and performs predetermined signal processing on the obtained observation signal, whereby a predetermined feature vector (hereinafter simply referred to as “feature amount”) regarding the observation signal. The measurement unit 112 for extracting 124 and the disturbance component suppression unit 114 for suppressing the disturbance component included in the feature value 124 extracted by the measurement unit 112 using the GMM 130 are included.

具体的には、計測部１１２は、観測信号を時間長数１０ミリ秒のフレームごとに対数メルフィルタバンク分析し、得られる対数メルスペクトルを要素とするベクトルを特徴量１２４として出力する。 Specifically, the measurement unit 112 analyzes the logarithmic mel filter bank for each frame having a time length of 10 milliseconds, and outputs a vector having the obtained log mel spectrum as an element as the feature quantity 124.

外乱成分抑圧部１１４は、ＧＭＭ１３０を用いて、観測信号の特徴量１２４からクリーン音声１２０の特徴量を推定する。そして推定により得られた特徴量１２６を探索部１１０に出力する。本明細書では、推定クリーン音声の特徴量１２６によって表される音声を「推定クリーン音声」と呼ぶ。 The disturbance component suppression unit 114 uses the GMM 130 to estimate the feature amount of the clean speech 120 from the feature amount 124 of the observation signal. Then, the feature amount 126 obtained by the estimation is output to the search unit 110. In this specification, the voice represented by the feature quantity 126 of the estimated clean voice is referred to as “estimated clean voice”.

図３に、外乱要因１１８の信号モデルを模式的に示す。図３を参照して、クリーン音声１２０は、図１に示す話者１１６から計測部１１２までの空間伝達特性等に依存して乗法性歪みを受けるため、話者１１６から計測部１１２に到達する音５００は、クリーン音声１２０と異なる音となる。ここに、時刻ｔのフレーム（以下、単に「第ｔフレーム」と呼ぶ。）におけるクリーン音声１２０の線形メルスペクトルを要素に持つベクトルをＳ_t ^Linとし、乗法性歪みの線形メルスペクトルを対角成分に持つ行列をＨ_t ^Linとする。計測部１１２に到達する音５００の線形メルスペクトルを要素に持つベクトルをＸ_S,t ^Lin（Ｄ）とすると、Ｘ_S,t ^Lin（Ｄ）は一般に、次の式で表現される。すなわち、
Ｘ_S,t ^Lin（Ｄ）＝Ｈ_t ^LinＳ_t ^Lin FIG. 3 schematically shows a signal model of the disturbance factor 118. Referring to FIG. 3, clean speech 120 is subjected to multiplicative distortion depending on the spatial transfer characteristics from speaker 116 to measurement unit 112 shown in FIG. 1, and thus reaches measurement unit 112 from speaker 116. The sound 500 is different from the clean sound 120. Here, the time t frame (hereinafter, simply referred to as "the t frame".) The vector with the linear Mel spectra of clean speech 120 elements and S _t ^Lin in, multiplicative linear mel spectrum diagonal components of the strain Let H _t ^Lin be the matrix of Assuming that a vector having the linear mel spectrum of the sound 500 reaching the measuring unit 112 as an element is X _{S, t} ^Lin (D), X _{S, t} ^Lin (D) is generally expressed by the following expression. That is,
_{^{X S, t Lin (D)}} = H t Lin S t Lin

しかし、観測音１２２は、残響の影響を受ける。すなわち、直接的に到達する音５００だけでなく、周囲の壁面等により反射して計測部１１２に到達する反射音５０２の影響を受ける。本実施の形態では、反射音を加法性雑音とみなす。反射音５０２の線形メルスペクトルを要素に持つベクトルをＸ_S,t ^Lin（Ｒ）とし、残響の影響を受けて計測部１１２に到達した音をＸ_S,t ^Linとすると、Ｘ_S,t ^Linは、次の式で表現される。すなわち、
Ｘ_S,t ^Lin＝Ｘ_S,t ^Lin（Ｄ）＋Ｘ_S,t ^Lin（Ｒ） However, the observation sound 122 is affected by reverberation. That is, not only the sound 500 that reaches directly, but also the reflected sound 502 that is reflected by the surrounding wall surface and reaches the measuring unit 112. In the present embodiment, the reflected sound is regarded as additive noise. Assuming that a vector having the linear mel spectrum of the reflected sound 502 as an element is X _{S, t} ^Lin (R), and a sound reaching the measuring unit 112 due to the influence of reverberation is X _{S, t} ^Lin , X _{S, t} ^Lin Is expressed by the following equation. That is,
X _{S, t} ^Lin = X _{S, t} ^Lin (D) + X _{S, t} ^Lin (R)

直接音５００と反射音５０２とはいずれも話者１１６が発した音であるが、伝播する経路の違いにより、反射音５０２は、直接音５００より遅延して計測部１１２に到達する。非特許文献５によれば、各メル周波数帯域における残響の線形予測係数を対角成分に持つ行列を行列Ａ_t ^Linとすると、Ｘ_S,t ^Lin（Ｒ）は、次の式で表現される。すなわち、
Ｘ_S,t ^Lin（Ｒ）＝Ａ_t ^LinＸ_S,t-1 ^Lin Both the direct sound 500 and the reflected sound 502 are sounds emitted by the speaker 116, but the reflected sound 502 arrives at the measuring unit 112 with a delay from the direct sound 500 due to a difference in the propagation path. According to Non-Patent Document 5, when a matrix with linear prediction coefficients of the reverberation in the Mel frequency band diagonal and matrix _{^{_{A t Lin, X S, t}}} Lin (R) is represented by the following formula . That is,
X _{S, t} ^Lin (R) = A _t ^Lin X _{S, t-1} ^Lin

さらに実環境では、話者１１６及び計測部１１２の周囲において雑音５０４が発生し、計測部１１２に到達する。ここに雑音５０４の線形メルスペクトルを要素に持つベクトルをＮ_t ^Linとし、観測音１２２の線形メルスペクトルを要素に持つベクトルをＸ_S+N,t ^Linとする。Ｘ_S+N,t ^Linは、次の信号モデルによりモデル化できる。すなわち、
Ｘ_S+N,t ^Lin＝Ｘ_S,t ^Lin＋Ｎ_t ^Lin＝Ｈ_t ^LinＳ_t ^Lin＋Ｎ_t ^Lin＋Ａ^LinＸ_S,t-1 ^Lin
反射音は観測できないため、この式において反射音のベクトルＸ_S,t-1 ^Linを次のように近似する。すなわち、
Ｘ_S,t-1 ^Lin＝Ｘ_S+N,t-1 ^Lin−Ｎ_t-1 ^Lin Further, in the actual environment, noise 504 is generated around the speaker 116 and the measurement unit 112 and reaches the measurement unit 112. Here, a vector having the linear mel spectrum of the noise 504 as an element is N _t ^Lin, and a vector having the linear mel spectrum of the observation sound 122 as an element is X _{S + N, t} ^Lin . X _{S + N, t} ^Lin can be modeled by the following signal model. That is,
X _{S + N, t} ^Lin = X _{S, t} ^Lin + N _t ^Lin = H _t ^Lin _St ^Lin + N _t ^Lin + A ^Lin X _{S, t-1} ^Lin
Since the reflected sound cannot be observed, the reflected sound vector X _{S, t-1} ^Lin is approximated as follows in this equation. That is,
X _{S, t-1} ^Lin = X _{S + N, t-1} ^Lin −N _t-1 ^Lin

第ｔフレームにおける観測信号の特徴量１２４、すなわち観測音１２２から得られる対数メルスペクトルを要素に持つベクトルを特徴量ベクトルＸ_tとする。なお、特徴量ベクトルＸ_tは、ベクトルＸ_S,t ^Linの各要素を対数メルスペクトル領域に変換することにより得られるベクトルである。特徴量ベクトルＸ_tは、計測により得られる既知のパラメータである。特徴量ベクトルＸ_tは、クリーン音声１２０の対数メルスペクトルを要素に持つベクトルＳ_tが外乱の影響で変化したベクトルである。ベクトルＳ_tは、未知のベクトルである。外乱には、乗法性歪み、残響、及び加法性雑音による影響分が含まれる。ここに、乗法性歪みの対数メルスペクトルを対角成分に持つ行列をＨ_tとし、加法性雑音の対数メルスペクトルを要素に持つベクトルをＮ_tとする。また、外乱には、残響による影響分も含まれる。さらに第ｔフレームにおける線形予測係数の対数を対角成分に持つ行列を行列Ａ_t、反射音のベクトルＸ_S,t-1 ^Linの各要素を対数化したベクトルをＸ_S,t-1とする。 A feature quantity 124 of the observation signal in the t-th frame, that is, a vector having a log mel spectrum obtained from the observation sound 122 as an element is defined as a feature quantity vector _Xt . Note that the feature vector _Xt is a vector obtained by converting each element of the vector X _{S, t} ^Lin into a log mel spectrum region. The feature vector _Xt is a known parameter obtained by measurement. Feature vector X _t is a vector vector S _t is changed by the influence of disturbance with logarithmic Mel spectra of clean speech 120 to the element. The vector _St is an unknown vector. The disturbance includes an influence due to multiplicative distortion, reverberation, and additive noise. Here, a matrix having a logarithmic mel spectrum of multiplicative distortion as a diagonal component is denoted by H _t, and a vector having a log mel spectrum of additive noise as an element is denoted by N _t . In addition, disturbances include effects due to reverberation. Further, a matrix having the logarithm of the linear prediction coefficient in the t-th frame as a diagonal component is a matrix A _t , and a vector obtained by logarithmizing each element of the reflected sound vector X _{S, t−1} ^Lin is X _{S, t−1} . .

上記したベクトルＸ_t、Ｓ_t、Ｎ_t、及びＸ_S,t-1の次元数は同一である。また、行列Ｈ_t及びＡ_tの行数及び列数は同一である。なお、以下に説明する処理はこれらベクトル及び行列の要素についてそれぞれ行なわれるが、以下の説明では、簡単のために各の要素を特に区別して言及することはしない。 The above-described vectors X _t , S _t , N _t , and X _{S, t−1} have the same number of dimensions. Further, the number of rows matrix H _t and A _t and the number of columns are the same. The processing described below is performed for each element of the vector and matrix. However, in the following description, each element is not particularly distinguished for the sake of simplicity.

図４に、観測信号の観測過程及び雑音の状態変化過程を表現する状態空間モデル１６０を示す。図４を参照して、状態空間モデル１６０において、クリーン音声１２０の出力過程はＧＭＭでモデル化できるものと仮定する。すなわち、第ｔフレームにおけるクリーン音声１２０の成分であるベクトルＳ_tは、ＧＭＭ１３０内のある要素分布にしたがって出力されるものと仮定する。 FIG. 4 shows a state space model 160 expressing the observation process of the observation signal and the noise state change process. Referring to FIG. 4, in the state space model 160, it is assumed that the output process of the clean speech 120 can be modeled by GMM. That is, the vector S _t is a component of the clean speech 120 in the t frame is assumed to be output in accordance with an element distribution within GMM130.

ＧＭＭ１３０において、第ｔフレームに対応する要素分布をｋ_tとする。なお、要素分布ｋ_tは、平均をμ_S,ktとし分散をΣ_S,ktとする単一正規分布とする。また、要素分布ｋ_tから出力されるパラメータのベクトルをベクトルＳ_kt,tとする。以下、ＧＭＭ１３０から出力されるパラメータベクトルＳ_kt,tを、「（ＧＭＭ１３０の）出力パラメータ」と呼ぶ。クリーン音声１２０の特徴量ベクトルＳ_tと、出力パラメータベクトルＳ_kt,tとの間には誤差が存在する。また、Ｘ_S+N,t ^Linを対数メルスペクトル領域に変換する際にも誤差を伴う。これらの誤差もまたベクトルであり、これらの誤差のベクトルをまとめて、ベクトルＶ_tとする。また、外乱要因１１８による外乱を表す行列をΛ_t＝（Ｎ_t，Ｈ_t，Ａ_t）とする。観測信号の特徴量ベクトルＸ_t（１２４）の観測過程は、上記のＸ_S+N,t ^Linを対数メルスペクトル領域に変換することにより、ＧＭＭ１３０の出力パラメータベクトルＳ_kt,t及び誤差ベクトルＶ_t、並びにベクトルＸ_t、Ｓ_t、Ｎ_t、及びＸ_S,t-1、並びに行列Ｈ_t及びΛ_tを用いて、次の式（１）により表現される。 In the GMM 130, let the element distribution corresponding to the t-th frame be k _t . The element distribution k _t is a single normal distribution with the mean μ _{S, kt} and the variance Σ _{S, kt} . A vector of parameters output from the element distribution k _t is a vector S _{kt, t} . Hereinafter, the parameter vector S _{kt, t} output from the GMM 130 is referred to as an “output parameter (of the GMM 130)”. A feature vector S _t of clean speech 120, the error is present between the output parameter vector S _{kt, t.} In addition, there is an error in converting X _{S + N, t} ^Lin to the log mel spectrum region. These errors are also a vector, are collectively vector of these errors, a vector V _t. A matrix representing the disturbance due to the disturbance factor 118 is Λ _t = (N _t , H _t , A _t ). The observation signal feature quantity vector X _t (124) is observed by converting the above X _{S + N, t} ^Lin into a log mel spectrum region, thereby generating an output parameter vector S _{kt, t} and an error vector V _t of the GMM 130. , And vectors X _t , S _t , N _t , and X _{S, t−1} , and matrices H _t and Λ _t are expressed by the following equation (1).

なお、式（１）でＩは単位行列を表す。また行列の対数、行列の指数演算はそれぞれ、行列の各要素について対数をとり、又は指数計算をし、その結果を成分とする行列を表すものとする。誤差ベクトルＶ_tは、次の式（２）のように平均が０で分散がΣ_S,ktの単一正規分布で表現される確率分布にしたがう値を要素に持つものとする。

In Equation (1), I represents a unit matrix. In addition, the logarithm of the matrix and the exponent operation of the matrix respectively represent a matrix having a logarithm or exponent calculation for each element of the matrix and having the result as a component. The error vector V _t has a value according to a probability distribution represented by a single normal distribution having an average of 0 and a variance of Σ _{S, kt} as shown in the following equation (2).

ただしこの式においてΣ_S,ktはＧＭＭ１３０内のある要素分布ｋ_tより得られるパラメータの共分散行列を表し、記号「〜」は左辺の値が右辺に示される確率分布にしたがうことを示す。すなわち、左辺の値が右辺に示す確率分布にしたがったサンプリングにより推定できることを示す。また、この式において、「Ｎ（μ，Σ）」は、平均値ベクトルμ、分散Σの単一正規分布を表す。

However sigma _{S, kt} in this equation represents the covariance matrix of the parameters obtained from the element distribution k _t with the GMM130, the symbol "~" indicates that according to the probability distribution values of the left side is shown in the right side. That is, the value on the left side can be estimated by sampling according to the probability distribution shown on the right side. In this equation, “N (μ, Σ)” represents a single normal distribution with an average value vector μ and a variance Σ.

また状態空間モデル１６０において、外乱を表す行列Λ_tは、ランダムウォーク過程にしたがって変化するものと仮定する。すなわち、第ｔ−１フレームにおける外乱を表す行列Λ_t-1と時刻ｔにおける外乱を表す行列Λ_tとの間に誤差が生じるものと仮定する。ベクトルＮ_t、行列Ｈ_t、及び行列Ａ_tに対するこの誤差をそれぞれ、ベクトルＷ_N、行列Ｗ_H、及び行列Ｗ_Aとし、これらをまとめて誤差を表す行列Ｗ_t＝（Ｗ_Nt，Ｗ_Ht，Ｗ_At）と定義する。外乱を表す行列Λ_tの時間変動は、次の式（３）により表現される。 In the state space model 160, it is assumed that a matrix Λ _t representing a disturbance changes according to a random walk process. That is, it is assumed that an error occurs between the matrix Λ _t−1 representing the disturbance at the t−1 frame and the matrix Λ _t representing the disturbance at time t. Vector N _t, the matrix H _t, and the error respectively for the matrix A _t, the vector W _N, the matrix W _H, and a matrix W _A, the matrix W _t = (W _Nt they collectively representing an error, W _Ht, W _At ). The time variation of the matrix Λ _t representing the disturbance is expressed by the following equation (3).

この式において、誤差行列Ｗ_t＝（Ｗ_Nt，Ｗ_Ht，Ｗ_At）を構成する、ベクトルＷ_Nt、行列Ｗ_Ht、及び行列Ｗ_Atはそれぞれ予測誤差であり、それぞれ平均０、共分散行列がΣ_WN、Σ_WH、及びΣ_WAの単一正規分布で表現される確率分布にしたがう白色性ガウス雑音であるものとする。誤差を表す行列Ｗ_tは、次の式（４）のように単一正規分布にしたがう。

In this formula, erroneous letters s column _{_{W t = (W Nt, W}} Ht, W At) constituting the vector W _Nt, matrix W _Ht, and the matrix W _At are each prediction error, mean respectively 0, covariance It is assumed that the matrix is white Gaussian noise according to a probability distribution expressed by a single normal distribution of Σ _WN , Σ _WH , and Σ _WA . Matrix W _t representing the error, according to a single normal distribution as the following equation (4).

ただし、式（４）においてΣ_Wは、誤差を表す行列Ｗ_tの共分散行列を表す。

In Equation (4), Σ _W represents a covariance matrix of a matrix W _t representing an error.

図１に示す外乱成分抑圧部１１４は、上記の式（１）〜式（４）により表現される状態空間モデル１６０を用いて、フレームごとに、クリーン音声の特徴量ベクトルを逐次推定する。 The disturbance component suppression unit 114 shown in FIG. 1 sequentially estimates the feature vector of clean speech for each frame using the state space model 160 expressed by the above equations (1) to (4).

図５に、外乱成分抑圧部１１４の構成をブロック図で示す。図５を参照して、外乱成分抑圧部１１４は、観測信号の特徴量Ｘ_t（１２４）を受けて、ＧＭＭ１３０を用いて状態空間モデル１６０における外乱を表す行列Λ_tの確率分布（以下、「外乱確率分布」と呼ぶ。）を推定するための外乱確率分布推定部２００と、外乱確率分布推定部２００により推定された外乱確率分布とＧＭＭ１３０とから観測信号の確率モデルの平均ベクトルと共分散行列とを生成するためのパラメータ生成部２０２と、外乱確率分布、観測信号の平均ベクトル及び共分散行列、並びにＧＭＭ１３０を用いて、推定クリーン音声の特徴量１２６を算出するためのクリーン音声推定部２０４とを含む。 FIG. 5 is a block diagram showing the configuration of the disturbance component suppressing unit 114. Referring to FIG. 5, disturbance component suppressing section 114 receives feature quantity X _t (124) of the observed signal, and uses GMM 130 to generate a probability distribution (hereinafter, “matrix Λ _t) representing the disturbance in state space model 160. A disturbance probability distribution estimation unit 200 for estimating the disturbance probability distribution, a disturbance probability distribution estimated by the disturbance probability distribution estimation unit 200, and an average vector of the probability model of the observation signal and the covariance matrix from the GMM 130 And a clean speech estimation unit 204 for calculating a feature quantity 126 of the estimated clean speech using the disturbance probability distribution, the average vector and covariance matrix of the observed signal, and the GMM 130, including.

外乱確率分布推定部２００は、外乱確率分布をフレームごとに逐次推定し、外乱確率分布を表すパラメータ２０６（以下、単に「推定外乱分布２０６」と呼ぶ。）を出力する機能を持つ。ここに、外乱を表す行列Λ₀，…，Λ_tからなる行列の系列を系列Λ_0:t＝｛Λ₀，…，Λ_t｝とする。系列Λ_0:tの事後確率分布ｐ（Λ_0:t｜Ｘ_0:t）は、１次マルコフ連鎖を用いて、次の式（５）のように表される。 The disturbance probability distribution estimation unit 200 has a function of sequentially estimating the disturbance probability distribution for each frame and outputting a parameter 206 representing the disturbance probability distribution (hereinafter simply referred to as “estimated disturbance distribution 206”) . Here, the matrix lambda ₀ which represents the disturbance, ..., series sequence consisting lambda _t matrix _{_{Λ 0: t = {Λ 0}} , ..., Λ t} and. The posterior probability distribution p (Λ _{0: t} | X _{0: t} ) of the sequence Λ _{0: t} is expressed as the following equation (5) using a first-order Markov chain.

式（５）のｐ（Λ_t｜Λ_t-1）は、単一正規分布を用いて次の式（６）のようにモデル化される。

P (Λ _t | Λ _t-1 ) in the equation (5) is modeled as the following equation (6) using a single normal distribution.

また、式（５）のｐ（Ｘ_t｜Λ_t）は、単一正規分布を用いて次の式（７）のようにモデル化される。

Also, p in formula _{(5) (X t | Λ} t) is modeled as the following equation (7) using a single normal distribution.

したがって、状態空間モデル１６０を基に外乱を表す行列Λ_tの確率分布を逐次推定する問題は、観測信号ベクトルＸ_tが与えられた時の事後確率を最大にするような系列Λ_0:tを推定する問題に帰着する。外乱確率分布推定部２００は、観測信号ベクトルＸ_tと状態空間モデル１６０とに基づき、この推定を行なう。 Therefore, the problem of sequentially estimating the probability distribution of the matrix Λ _t representing the disturbance based on the state space model 160 is that a sequence Λ _{0: t} that maximizes the posterior probability when the observed signal vector X _t is given. Reduce to the problem to estimate. The disturbance probability distribution estimation unit 200 performs this estimation based on the observation signal vector X _t and the state space model 160.

外乱確率分布推定部２００は、外乱を表す行列Λ_tの確率分布を逐次的に推定する際に、パーティクルフィルタと呼ばれる手法を用いる。この推定法は、状態空間内に、局限された状態空間（パーティクル）を多数生成して、各パーティクルにおいてパラメータの確率分布を推定し、状態空間内におけるパラメータの確率分布を、各パーティクルにおいて推定された確率分布を用いて近似的に表現する手法である。この手法では、多数のパーティクルにおける初期的なパラメータを、ランダムなサンプリングにより、又は当該パラメータの初期分布からのサンプリングにより決定する。そして、以下の処理をフレームごとに行なう。すなわち、あるフレームに対応して各パーティクルにおいてパラメータが決定されると、各パーティクルのパラメータを当該フレームに後続するフレームに対応するものに更新し、その更新の尤度に応じて各パーティクルに対して重みを付与する。そして、更新後のパーティクルにおけるパラメータの確率分布にしたがい、当該後続のフレームに対応する各パーティクルのパラメータを再サンプリングする。再サンプリングされたパラメータを基に、当該後続のフレームに対応する各パーティクルのパラメータを決定する。以上の処理をフレームごとに行なうことにより、逐次的に各パーティクルにおけるパラメータを決定する。状態空間におけるパラメータは、パーティクルにおけるパラメータの重み付き和によって近似的に表現される。すなわち、パーティクルの数をＪ、ｊ番目のパーティクルにおいて外乱を表す行列Λ_tに対応する各パラメータからなる行列を行列Λ_t ^(j)とし、当該パーティクルに対する重みをｗ_t ^(j)とすると、式（５）に示す系列Λ_0:tの事後確率分布ｐ（Λ_0:t｜Ｘ_0:t）は、次の式（８）によって近似的に表現される。 The disturbance probability distribution estimation unit 200 uses a technique called a particle filter when sequentially estimating the probability distribution of the matrix Λ _t representing the disturbance. This estimation method generates many localized state spaces (particles) in the state space, estimates the parameter probability distribution in each particle, and estimates the parameter probability distribution in the state space for each particle. It is a technique to express approximately using the probability distribution. In this method, initial parameters in a large number of particles are determined by random sampling or sampling from an initial distribution of the parameters. Then, the following processing is performed for each frame. That is, when a parameter is determined for each particle corresponding to a certain frame, the parameter of each particle is updated to that corresponding to the frame subsequent to that frame, and for each particle according to the likelihood of the update. Give weight. Then, the parameter of each particle corresponding to the subsequent frame is resampled according to the parameter probability distribution in the updated particle. Based on the resampled parameters, the parameters of each particle corresponding to the subsequent frame are determined. By performing the above processing for each frame, parameters for each particle are sequentially determined. The parameters in the state space are approximately expressed by the weighted sum of the parameters in the particles. That is, when the number of particles is J, a matrix composed of parameters corresponding to a matrix Λ _t representing disturbance in the j-th particle is a matrix Λ _t ^(j), and a weight for the particle is w _t ^(j) , The posterior probability distribution p (Λ _{0: t} | X _{0: t} ) of the sequence Λ _{0: t} shown in (5) is approximately expressed by the following equation (8).

パラメータ生成部２０２は、具体的にはＶＴＳ（Vector Taylor Series）法と呼ばれるＨＭＭ合成法によって、パーティクルフィルタにより推定された外乱確率分布を用い、複数のパーティクルにおける観測信号の特徴量ベクトルＸ_tの平均ベクトル及び共分散行列（２０８）をそれぞれ算出する機能を持つ。 Parameter generating unit 202, by HMM composition specifically called VTS (Vector Taylor Series) method, using the disturbance probability distributions estimated by the particle filter, the average of the feature vectors X _t of the observation signals in a plurality of particles Each has a function of calculating a vector and a covariance matrix (208).

クリーン音声推定部２０４は、最小２乗誤差（Minimum Mean Square Error：ＭＭＳＥ）推定法で、フレームごとに、複数のパーティクルにおけるクリーン音声のパラメータをそれぞれ推定し、それら推定されたパラメータの重み付き和によって推定クリーン音声の特徴量１２６を算出する機能を持つ。クリーン音声推定部２０４はさらに、外乱確率分布推定部２００に、次のフレームへの移行に関する要求２１０を発行する機能を持つ。 The clean speech estimation unit 204 estimates a clean speech parameter for each of a plurality of particles for each frame by a minimum mean square error (MMSE) estimation method, and calculates a weighted sum of these estimated parameters. It has a function of calculating the feature quantity 126 of the estimated clean speech. The clean speech estimation unit 204 further has a function of issuing a request 210 regarding the transition to the next frame to the disturbance probability distribution estimation unit 200.

図６に、外乱確率分布推定部２００の構成をブロック図で示す。図６を参照して、外乱確率分布推定部２００は、観測信号の特徴量１２４とクリーン音声推定部２０４からの要求２１０とを受けて、処理対象となるフレームを選択し、当該フレームにおける観測信号の特徴量１２４をフレームに応じた出力先に出力するためのフレーム選択部２２０と、フレーム選択部２２０から最初の所定フレーム分の観測信号の特徴量１２４を受けて初期状態における外乱確率分布を推定し、各パーティクルにおける外乱の初期的なパラメータを決定するための外乱初期分布推定部２２２と、フレーム選択部２２０からｔ（ｔ≧１）番目フレームにおける観測信号の特徴量１２４を受けて、逐次的に、パーティクルにおける雑音のパラメータと当該パーティクルに対する重みとを算出するための逐次計算部２２４とを含む。 FIG. 6 is a block diagram showing the configuration of the disturbance probability distribution estimation unit 200. Referring to FIG. 6, disturbance probability distribution estimation section 200 receives a feature quantity 124 of the observation signal and a request 210 from clean speech estimation section 204, selects a frame to be processed, and observes the observation signal in that frame. Frame selection unit 220 for outputting the feature amount 124 of the image to the output destination corresponding to the frame, and estimating the disturbance probability distribution in the initial state by receiving the feature amount 124 of the observation signal for the first predetermined frame from the frame selection unit 220 Then, the disturbance initial distribution estimation unit 222 for determining the initial parameter of the disturbance in each particle and the feature amount 124 of the observation signal in the t (t ≧ 1) -th frame from the frame selection unit 220 are sequentially received. Includes a sequential calculation unit 224 for calculating a noise parameter for the particle and a weight for the particle. .

外乱初期分布推定部２２２は、時刻ｔ＝０のフレームにおける外乱を表す行列Λ₀＝（Ｎ₀，Ｈ₀，Ａ₀）の確率分布（以下、「外乱初期分布」）を推定する。この際、加法性雑音の初期分布を以下のようにして推定する。 The disturbance initial distribution estimation unit 222 estimates a probability distribution (hereinafter, “disturbance initial distribution”) of a matrix Λ ₀ = (N ₀ , H ₀ , A ₀ ) representing the disturbance in the frame at time t = 0. At this time, the initial distribution of additive noise is estimated as follows.

外乱初期分布推定部２２２はまず、加法性雑音の初期分布、すなわち加法性雑音の初期値ベクトルＮ₀の確率分布ｐ（Ｎ₀）が、単一正規分布であるものとみなし、加法性雑音の初期分布を推定する。加法性雑音の初期分布における平均ベクトルをμ_Nとし、共分散行列を行列Σ_Nとすると、加法性雑音の初期分布ｐ（Ｎ₀）は次の式（９）のように表される。 The disturbance initial distribution estimation unit 222 first considers that the initial distribution of additive noise, that is, the probability distribution p (N ₀ ) of the initial value vector N ₀ of additive noise is a single normal distribution, and Estimate the initial distribution. When the average vector in the initial distribution of additive noise is μ _N and the covariance matrix is a matrix Σ _N , the initial distribution p (N ₀ ) of additive noise is expressed as the following equation (9).

外乱初期分布推定部２２２は、最初の所定フレーム分の区間の観測信号の特徴量ベクトルＸ_tが加法性雑音の成分のみからなるものとみなし、加法性雑音の初期分布の平均ベクトルμ_N、及び共分散行列Σ_Nを推定する。例えば０≦ｔ≦９の１０フレーム分の区間がこの区間に該当する場合、外乱初期分布推定部２２２は、平均ベクトルμ_N及び共分散行列Σ_Nをそれぞれ、次の式（１０）及び式（１１）によって算出する。ただし、ベクトルの右肩に付した「Ｔ」は転置を表す。

Disturbance initial distribution estimation unit 222, considers the feature vector X _t of the first predetermined number of frames interval of the observed signal consists only component of the additive noise, the mean vector mu _N of initial distribution of additive noise and, Estimate the covariance matrix Σ _N. For example, when a section of 10 frames of 0 ≦ t ≦ 9 corresponds to this section, the disturbance initial distribution estimation unit 222 calculates the average vector μ _N and the covariance matrix Σ _N using the following formulas (10) and ( 11). However, “T” attached to the right shoulder of the vector represents transposition.

次に外乱初期分布推定部２２２は、初期状態での各パーティクルにおける外乱を表す行列Λ₀ ^(j)を構成するベクトルＮ₀ ^(j)、行列Ｈ₀ ^(j)、及び行列Ａ₀ ^(j)を、式（１２）のように設定する。

Next, the disturbance initial distribution estimation unit 222 includes a vector N ₀ ^(j) , a matrix H ₀ ^(j) , and a matrix A ₀ ^(j) that constitute a matrix Λ ₀ ^(j) representing the disturbance in each particle in the initial state. Is set as shown in Expression (12).

すなわち、各パーティクルにおける加法性雑音のベクトルＮ₀ ^(j)を、初期分布ｐ（Ｎ₀）からのサンプリングによって生成し、各パーティクルにおける乗法性歪みの行列Ｈ₀ ^(j)及び残響の行列Ａ₀ ^(j)の各要素の値を０に設定する。

That is, a vector N ₀ ^(j) of additive noise in each particle is generated by sampling from the initial distribution p (N ₀ ), and a multiplicative distortion matrix H ₀ ^(j) and a reverberation matrix A ₀ in each particle. the value of each element of the ^(j) is set to 0.

さらに外乱初期分布推定部２２２は、各パーティクルにおける外乱を表す行列Λ₀ ^(j)を構成するベクトルＮ₀ ^(j)、行列Ｈ₀ ^(j)、及び行列Ａ₀ ^(j)の共分散行列Σ_N0 ^(j)、Σ_H0 ^(j)、及びΣ_A0 ^(j)を式（１３）のように設定する。 Further, the disturbance initial distribution estimation unit 222 has a covariance matrix Σ of a vector N ₀ ^(j) , a matrix H ₀ ^(j) , and a matrix A ₀ ^(j) that constitute a matrix Λ ₀ ^(j) representing the disturbance in each particle. _N0 ^(j) , Σ _H0 ^(j) , and Σ _A0 ^(j) are set as in equation (13).

すなわち、各パーティクルにおける加法性雑音のベクトルＮ₀ ^(j)の共分散行列を、初期分布ｐ（Ｎ₀）の共分散行列に設定し、各パーティクルにおける乗法性歪みの行列Ｈ₀ ^(j)及び残響の行列Ａ₀ ^(j)の共分散行列の各要素を０に設定する。外乱初期分布推定部２２２は、式（１２）と式（１３）とに示す設定を、パーティクルｊ（１≦ｊ≦Ｊ）ごとに行なう。

That is, the covariance matrix of the additive noise vector N ₀ ^(j) in each particle is set to the covariance matrix of the initial distribution p (N ₀ ), and the multiplicative distortion matrix H ₀ ^(j) in each particle and Each element of the covariance matrix of the reverberation matrix A ₀ ^(j) is set to zero. The disturbance initial distribution estimation unit 222 performs the setting shown in Expression (12) and Expression (13) for each particle j (1 ≦ j ≦ J).

逐次計算部２２４は、ＧＭＭ１３０の出力パラメータ１４０をサンプリンするためのＧＭＭサンプリング部２２６と、第ｔフレームにおける観測信号の特徴量１２４を受け、各パーティクルにおける外乱のパラメータを更新するための更新部２３０と、更新後のパーティクルに対する重みをそれぞれ算出するための重み算出部２３２と、重み算出部２３２により算出された重みに基づき、パーティクルにおける外乱のパラメータを再サンプリングするための再サンプリング部２３４と、再サンプリングされたパーティクルにおける外乱のパラメータと第ｔ−１フレームにおける各パーティクルにおける外乱のパラメータとに基づき、各パーティクルにおける外乱のパラメータを決定し、推定外乱分布２０６を生成するための推定外乱分布生成部２３６とを含む。 The sequential calculation unit 224 receives a GMM sampling unit 226 for sampling the output parameter 140 of the GMM 130, an update unit 230 for updating the disturbance parameter in each particle in response to the observed signal feature 124 in the t-th frame, , A weight calculation unit 232 for calculating the weights for the updated particles, a re-sampling unit 234 for re-sampling the disturbance parameters in the particles based on the weights calculated by the weight calculation unit 232, and resampling The estimated disturbance distribution generation for determining the disturbance parameter for each particle and generating the estimated disturbance distribution 206 based on the disturbance parameter for the generated particle and the disturbance parameter for each particle in the (t-1) th frame. And a 236.

更新部２３０は、状態空間モデル１６０（図４）を基に構成される拡張カルマンフィルタを用いて、第ｔ−１フレームに対応するパーティクルにおける雑音のパラメータを、第ｔフレームに対応するものに更新する機能を持つ。拡張カルマンフィルタは、式（１）に示すように非線形項を含む状態空間モデルに対応したカルマンフィルタである。本実施の形態における拡張カルマンフィルタの分布更新式を、以下の式（１４）〜式（１９）に示す。なお、これらの数式において第ｔ−１フレームに対応するパラメータから予測される第ｔフレームにおけるパラメータについては添え字として「_t|t-1」を付してある。 The updating unit 230 uses the extended Kalman filter configured based on the state space model 160 (FIG. 4) to update the noise parameter in the particle corresponding to the t−1 frame to the one corresponding to the t frame. Has function. The extended Kalman filter is a Kalman filter corresponding to a state space model including a nonlinear term as shown in Expression (1). The distribution update formulas of the extended Kalman filter in the present embodiment are shown in the following formulas (14) to (19). In these equations, “ _{t | t−1} ” is attached as a subscript to the parameter in the t-th frame predicted from the parameter corresponding to the t−1 frame.

ただし、式（１７）及び式（１８）のベクトルＳ^(j) _kt ^(j) _,tは、ｊ番目のパーティクルにおいてＧＭＭ１３０（図２参照）の出力パラメータベクトルＳ_kt,tに相当するパラメータである。また前述した通り、行列Σ_Wは、第ｔ−１フレームから第ｔフレームへの状態変化の際に外乱を表す行列Λ_tに生じる誤差の行列Ｗ_tの共分散行列を表す。

However, the vector S ^(j) _kt ^(j) _{, t} in the equations (17) and (18) is a parameter corresponding to the output parameter vector S _{kt, t} of the GMM 130 (see FIG. 2) in the j-th particle. . Further, as described above, the matrix Σ _W represents the covariance matrix of the error matrix W _t generated in the matrix Λ _t representing the disturbance at the time of the state change from the t−1 frame to the t frame.

ＧＭＭサンプリング部２２６は、ＧＭＭ１３０（図２参照）内の混合分布から、要素分布である単一正規分布ｋ_t ^(j)をその混合重みに基づいてサンプリングする。ＧＭＭサンプリング部２２６はさらに、サンプリングされた要素分布ｋ_t ^(j)から出力パラメータベクトルＳ^(j) _kt ^(j) _,tを確率分布にしたがってサンプリングして、更新部２３０に与える。ＧＭＭ１３０における要素分布ｋ_tの混合重みをＰ_S,st ^(j) _,ktとすると、要素分布ｋ_t ^(j)は、混合重みＰ_S,st ^(j) _,ktを出力確率とする確率分布にしたがう。すなわち、ＧＭＭ１３０から次の式（２０）に示すサンプリングによって得られる。 The GMM sampling unit 226 samples a single normal distribution k _t ^(j) that is an element distribution from the mixture distribution in the GMM 130 (see FIG. 2) based on the mixture weight. The GMM sampling unit 226 further samples the output parameter vector S ^(j) _kt ^(j) _{, t} from the sampled element distribution k _t ^(j) according to the probability distribution, and supplies the sample to the update unit 230. Assuming that the mixing weight of the element distribution k _{t in} the GMM 130 is P _{S, st} ^(j) _{, kt} , the element distribution k _t ^(j) is a probability distribution with the mixing weight P _{S, st} ^(j) _{, kt} as the output probability. Follow. That is, it is obtained from the GMM 130 by sampling shown in the following equation (20).

要素分布ｋ_t ^(j)の平均ベクトルをベクトルμ_kt ^(j)とし、要素分布ｋ_t ^(j)の共分散行列を行列Σ_S,kt ^(j)とすると、ｊ番目のパーティクルにおけるＧＭＭ１３０の出力パラメータベクトルＳ^(j) _kt ^(j) _,tは、要素分布ｋ_t ^(j)から、次の式（２１）に示すサンプリングによって得られる。

If the average vector of the element distribution k _t ^(j) is the vector μ _kt ^(j) and the covariance matrix of the element distribution k _t ^(j) is the matrix Σ _{S, kt} ^(j) , the output of the GMM 130 at the j-th particle The parameter vector S ^(j) _kt ^(j) _{, t} is obtained from the element distribution k _t ^(j) by sampling shown in the following equation (21).

なお、フレーム選択部２２０はさらに、ＧＭＭサンプリング部２２６に対し、第ｔフレームにおけるＧＭＭの出力パラメータのサンプリングを要求する機能を持つ。

The frame selection unit 220 further has a function of requesting the GMM sampling unit 226 to sample the output parameters of the GMM in the t-th frame.

重み算出部２３２は、第ｔフレームでの観測信号の特徴量ベクトルＸ_tと、第ｔフレームの各パーティクルにおけるＧＭＭ１３０の出力パラメータベクトルＳ^(j) _kt ^(j) _,t、及び外乱のパラメータ行列Λ_t ^(j)と第ｔ−１フレームのパーティクルに対する重みｗ_t-1 ^(j)とを基に、次の式（２２）及び式（２３）に示す算出方法を用いて、第ｔフレームのパーティクルに対する重みｗ_t ^(j)を算出する機能を持つ。 Weight calculation unit 232, a feature vector X _t of the observation signals in the t-th frame, the output parameter vector S of GMM130 in each particle of the t frame ^{_{^{_{(j) kt (j),}}}} t, and disturbance of parameter matrix Λ _{Based on t} ^(j) and the weight w _t-1 ^(j) for the particles in the ⁽ _t−1 ⁾ th frame, the particles in the tth frame are calculated using the calculation methods shown in the following equations (22) and (23). Has a function of calculating a weight w _t ^(j) .

なお、重みｗ_t ^(j)（１≦ｊ≦Ｊ）は、Σ_j=1〜Ｊｗ_t ^(j)＝１となるように正規化される。

The weights w _t ^(j) (1 ≦ j ≦ J) are normalized so that Σ _{j =} 1 to J w _t ^(j) = 1.

再サンプリング部２３４は、パラメータが更新されたパーティクルにおける外乱のパラメータの確率分布にしたがい、時刻ｔに対応する各パーティクルにおける外乱のパラメータ行列Λ_t ^(j)を再サンプリングする機能を持つ。この際、再サンプリング部２３４は、微小な重みｗ_t ^(j)しか与えられていないパーティクルにおける確率分布からは、パラメータの再サンプリングを行なわない。一方、大きな重みｗ_t ^(j)が与えられているパーティクルにおける確率分布からは、パラメータを重みｗ_t ^(j)の大きさに応じた回数の再サンプリングを行ない、得られたパラメータをそれぞれ、当該再サンプリングの回数と同数のパーティクルに割当てる。ただし再サンプリングの全回数及びパーティクルの全数は一定（Ｊ）である。このようにするのは、各パーティクルに割当てられる重みが、式（２２）から分かるように観測された特徴量ベクトルＸ_tの尤度に対応しているからである。 The re-sampling unit 234 has a function of re-sampling the disturbance parameter matrix Λ _t ^(j) in each particle corresponding to the time t according to the probability distribution of the disturbance parameter in the particle whose parameter has been updated. At this time, the re-sampling unit 234 does not re-sample the parameters from the probability distribution of the particles to which only a minute weight w _t ^(j) is given. On the other hand, from the probability distribution of particles with a large weight w _t ^(j) , the parameters are resampled a number of times according to the size of the weight w _t ^(j) , and the obtained parameters are Assign to the same number of particles as the number of resampling. However, the total number of resampling and the total number of particles are constant (J). This is because the weight assigned to each particle corresponds to the likelihood of the observed feature vector X _t as can be seen from Equation (22).

推定外乱分布生成部２３６は、Metropolis-Hastingsアルゴリズムにより、第ｔフレームに対応するパーティクルの再生成を行なう機能を持つ。図７に、推定外乱分布生成部２３６の構成をブロック図で示す。図７を参照して、推定外乱分布生成部２３６は、再サンプリング部２３４による再サンプリングで得られた各パーティクルにおける外乱の確率分布を用いて状態空間モデル１６０における外乱の確率分布を表し、当該表した確率分布に基づき、第ｔ−１フレームに対応するパーティクルにおける外乱のパラメータを第ｔフレームに対応するものへ、上記の式（１４）〜式（１９）に示す拡張カルマンフィルタを用いて再更新するための再更新部２６２と、再更新されたパーティクルに対する重み（これを以下「ｗ_t ^*(j)」とする。）を上記の式（２２）及び式（２３）に示す算出方法を用いて算出するための重み再計算部２６４と、再サンプリングされたパーティクルに対する重みｗ_t ^(j)及び再更新されたパーティクルに対する重みｗ_t ^*(j)から、再更新されたパラメータを許容するか否かの判定に用いる許容確率νを算出するための許容確率算出部２６６と、所定の乱数発生方法により０から１までの閉区間内の乱数ｕを発生させるための乱数発生部２６８と、許容確率νと乱数ｕとに基づき、第ｔフレームに対応するパーティクルにおけるパラメータとして、再サンプリングされたパーティクルにおけるパラメータと、再更新されたパーティクルにおけるパラメータとの一方を選択するためのパラメータ選択部２７０とを含む。 The estimated disturbance distribution generation unit 236 has a function of regenerating particles corresponding to the t-th frame by the Metropolis-Hastings algorithm. FIG. 7 is a block diagram illustrating the configuration of the estimated disturbance distribution generation unit 236. Referring to FIG. 7, the estimated disturbance distribution generation unit 236 represents the disturbance probability distribution in the state space model 160 using the disturbance probability distribution of each particle obtained by resampling by the resampling unit 234, and Based on the obtained probability distribution, the disturbance parameter in the particle corresponding to the (t-1) th frame is updated again to that corresponding to the tth frame using the extended Kalman filter shown in the above equations (14) to (19). The re-updating unit 262 and the weight for the re-updated particle (hereinafter referred to as “w _t ^{* (j)} ”) using the calculation method shown in the above formulas (22) and (23). a weight recalculation unit 264 for calculating weights w _t ^* or ^(j) for the weights w _t ^(j) and re-updated particle for resampled particles , An allowable probability calculation unit 266 for calculating an allowable probability ν used for determining whether or not the re-updated parameter is allowed, and a random number u within a closed interval from 0 to 1 by a predetermined random number generation method Based on the random number generation unit 268, the allowable probability ν, and the random number u, as a parameter in the particle corresponding to the t-th frame, one of the parameter in the resampled particle and the parameter in the reupdated particle is set. A parameter selection unit 270 for selection.

許容確率算出部２６６は、重みｗ_t ^(j)及び重みｗ_t ^*(j)から次の式（２４）にしたがって、許容確率νを算出する機能を持つ。 Acceptable probability calculation unit 266, a weight according to w _t ^(j) and the weights w _t ^{* (j)} from the following equation (24) has a function of calculating the permission probability [nu.

パラメータ選択部２７０は、ｕが許容確率ν以下であれば、当該パーティクルにおける外乱のパラメータを再更新で得られた新たなパラメータに変更する機能を持つ。 The parameter selection unit 270 has a function of changing the disturbance parameter of the particle to a new parameter obtained by re-update if u is equal to or less than the allowable probability ν.

［プログラム構造］
以下の説明からも明らかなように、図１に示す音声認識システム１００の前処理部１０４、前処理用音響モデル部１０６、及び探索部１１０は、いずれもコンピュータハードウェアとその上で実行されるプログラムにより実現可能である。図８に、本実施の形態に係る前処理部１０４に含まれる外乱成分抑圧部１１４が行なう外乱成分の抑圧処理を実現するコンピュータプログラムの制御構造をフローチャートで示す。 [Program structure]
As will be apparent from the following description, the preprocessing unit 104, the preprocessing acoustic model unit 106, and the search unit 110 of the speech recognition system 100 shown in FIG. 1 are all executed on computer hardware and the computer hardware. It can be realized by a program. FIG. 8 is a flowchart showing a control structure of a computer program for realizing disturbance component suppression processing performed by the disturbance component suppression unit 114 included in the preprocessing unit 104 according to the present embodiment.

図８を参照して、外乱成分の抑圧処理が開始されると、ステップ３０２において、初期状態における外乱Λ₀の各要素の値に対応する初期分布を推定する。すなわち、上記の式（１０）及び式（１１）に示す算出方法により、式（９）に示す加法性雑音の初期分布ｐ（Ｎ₀）のパラメータ平均ベクトルμ_N及び共分散行列Σ_Nを算出する。さらに、式（１２）及び式（１３）にしたがい、初期分布ｐ（Ｎ₀）からパラメータベクトルＮ₀ ^(j)（ｊ＝１，…，Ｊ）をサンプリングし、各パーティクルにおける加法性雑音の初期的なパラメータに推定する。またこの際、各パーティクルにおける乗法性雑音の初期的なパラメータ行列Ｈ₀ ^(j)及び残響の初期的なパラメータ行列Ａ₀ ^(j)についても、それぞれ式（１２）及び式（１３）にしたがい設定を行なう。 Referring to FIG. 8, when the disturbance component suppression process is started, in step 302, an initial distribution corresponding to the value of each element of disturbance Λ ₀ in the initial state is estimated. That is, the calculation method shown in the above formula (10) and (11), calculates a parameter mean vector mu _N and covariance matrix sigma _N of initial distribution of additive noise represented by the formula (9) p (N ₀₎ To do. Further, according to the equations (12) and (13), the parameter vector N ₀ ^(j) (j = 1,..., J) is sampled from the initial distribution p (N ₀ ), and the initial additive noise of each particle is detected. To the typical parameters. At this time, the initial parameter matrix H ₀ ^(j) of multiplicative noise in each particle and the initial parameter matrix A ₀ ^(j) of reverberation are also set according to the equations (12) and (13), respectively. To do.

ステップ３０４では、外乱抑圧の対象となるフレームを次のフレームに移行させる。ステップ３０６では、パーティクルフィルタを用いて、処理対象のフレームにおける外乱を表す行列に対応する確率分布のパラメータを推定する。すなわち、各パーティクルにおける外乱のパラメータ行列Λ_t ^(j)、及び行列Λ_t ^(j)の共分散行列を推定し、さらに、各パーティクルに対する重みｗ^(j)を定める。このステップでの処理については、図９を用いて後述する。 In step 304, the frame subject to disturbance suppression is shifted to the next frame. In step 306, the probability distribution parameter corresponding to the matrix representing the disturbance in the processing target frame is estimated using the particle filter. That is, the disturbance parameter matrix Λ _t ^{(j) of} each particle and the covariance matrix of the matrix Λ _t ^(j) are estimated, and the weight w ^(j) for each particle is determined. The processing in this step will be described later with reference to FIG.

ステップ３０８では、ステップ３０６でパーティクルごとに定めた外乱のパラメータ行列Λ_t ^(j)と、その共分散行列とを用いて、各パーティクルにおける観測信号の特徴量ベクトルＸ_t（１２４）の確率分布を推定する。さらに、ＧＭＭ１３０を構成する要素分布ｋ（１≦ｋ≦Ｋ）ごとに、パーティクルにおける観測信号の確率モデルの平均ベクトルμ_Xkt ^(j) _,tと、共分散行列Σ_Xk,t ^(j)とを算出する。 In step 308, a disturbance parameter matrix defined for each particle lambda _t ^(j) in step 306, using its covariance matrix, the probability distribution of the feature vectors X _t of the observation signals of each particle (124) presume. Further, for each element distribution k (1 ≦ k ≦ K) constituting the GMM 130, an average vector μ _Xkt ^(j) _{, t} of a probability model of an observation signal in a particle and a covariance matrix Σ _{Xk, t} ^(j) calculate.

ステップ３１０では、ＭＭＳＥ推定法により、第ｔフレームにおけるクリーン音声の特徴量を推定する。すなわちまず、ステップ３０６及びステップ３０８の処理で得られたパラメータを用いて、ＭＭＳＥ推定法によって、ＭＭＳＥ推定値ベクトル＾Ｓ_tを算出し、推定クリーン音声の特徴量１２６（図１参照）として出力する。 In step 310, the feature amount of clean speech in the t-th frame is estimated by the MMSE estimation method. That is, first, the MMSE estimation value vector {circumflex over (S) _} is calculated by the MMSE estimation method using the parameters obtained in the processing of step 306 and step 308 and is output as the estimated clean speech feature 126 (see FIG. 1). .

この式において、Ｐ（ｋ｜Ｘ_t，（ｊ））は、ｊ番目のパーティクルにおける、ＧＭＭ１３０内の要素分布ｋに対する混合重みを表す。混合重みＰ（ｋ｜Ｘ_t，（ｊ））は、次の数式により算出される。

In this equation, P (k | X _t , (j)) represents the mixing weight for the element distribution k in the GMM 130 in the j-th particle. The mixing weight P (k | X _t , (j)) is calculated by the following equation.

ステップ３１２では、終了判定を行なう。すなわち第ｔフレームが最終のフレームであれば外乱成分の抑圧処理を終了する。さもなければステップ３０４に戻る。 In step 312, end determination is performed. That is, if the t-th frame is the final frame, the disturbance component suppression processing is terminated. Otherwise return to step 304.

図９に、ステップ３０６（図８参照）において行なわれる外乱確率分布の推定処理を実現するプログラムの制御構造をフローチャートで示す。図９を参照して、外乱確率分布の推定処理が開始されると、ステップ３２２において、式（１４）〜式（１９）により示す拡張カルマンフィルタを用いて、第ｔ−１フレームのパーティクルにおける外乱確率分布から、第ｔフレームのパーティクルにおける外乱確率分布を推定する。 FIG. 9 is a flowchart showing a control structure of a program that realizes the disturbance probability distribution estimation process performed in step 306 (see FIG. 8). Referring to FIG. 9, when the disturbance probability distribution estimation process is started, in step 322, the disturbance probability of the particles in the t−1th frame using the extended Kalman filter expressed by equations (14) to (19). From the distribution, the disturbance probability distribution in the t-th particle is estimated.

ステップ３２４では、第ｔフレームの各パーティクルに対する重みｗ_t ^(j)を、式（２２）及び式（２３）によって算出し、正規化する。ステップ３２６では、各パーティクルに与えられた重みｗ_t ^(j)に基づき、各パーティクルからの再サンプリングの回数を決定し、当該パーティクルにおける外乱確率分布に基づいてパラメータを再サンプリングする。ステップ３２８では、Metropolis-Hastingsアルゴリズムを用いて第ｔフレームのパーティクルを再生成する。 In step 324, the weight w _t ^(j) for each particle in the t-th frame is calculated and normalized by the equations (22) and (23). In step 326, the number of re-sampling from each particle is determined based on the weight w _t ^(j) given to each particle, and the parameter is re-sampled based on the disturbance probability distribution of the particle. In step 328, the particles of the t-th frame are regenerated using the Metropolis-Hastings algorithm.

図１０にステップ３２８（図９参照）における処理の詳細をフローチャートで示す。図１０を参照して、ステップ３２８における処理が開始されると、ステップ３４２において、ステップ３２６（図９参照）での再サンプリングで得られたパーティクルにおけるパラメータを用いて、外乱確率分布の再更新を行なう。すなわち、時刻ｔのフレームのパーティクルを新たに準備し、ステップ３２２（図９参照）での処理と同様の処理により、第ｔ−１フレームのパーティクルに対応するパラメータから、第ｔフレームのパーティクルに対応するパラメータへの再更新を行ない、準備したパーティクルのパラメータに設定する。ステップ３４４では、ステップ３４２で準備したパーティクルに対する重みｗ_t ^*(j)を、図９に示すステップ３２４の処理と同様の処理で算出し正規化する。 FIG. 10 is a flowchart showing details of the processing in step 328 (see FIG. 9). Referring to FIG. 10, when the processing in step 328 is started, in step 342, the disturbance probability distribution is re-updated using the parameters in the particles obtained by the resampling in step 326 (see FIG. 9). Do. That is, a new particle of the frame at time t is prepared, and the particle corresponding to the t-th frame is handled from the parameter corresponding to the particle of the (t-1) -th frame by the same process as the process in step 322 (see FIG. 9). Re-update to the parameter to be set, and set the parameter of the prepared particle. In step 344, the weight w _t ^{* (j)} for the particles prepared in step 342 is calculated and normalized by the same process as the process in step 324 shown in FIG.

ステップ３４６では、ステップ３２４の処理で算出された重みｗ_t ^(j)と、ステップ３４４で算出された重みｗ_t ^*(j)との比較により、ステップ３４２で準備されたパーティクルの許容確率νを定める。ステップ３４８では、区間［０，１］の値からなる一様な集合Ｕ_[0,1]の中から任意の値を選択することにより乱数ｕを発生する。ステップ３５０では、ステップ３４８で発生した乱数ｕの値と、ステップ３４６で定めた許容確率νの値とを比較する。ｕが許容確率の値以下であれば、ステップ３５２へ進む。さもなければステップ３５４に進む。ステップ３５２では、ステップ３４２で準備されたパーティクルを許容する。すなわち、ステップ３２６での再サンプリングで得られたパラメータを、準備されたパーティクルのパラメータで置換して処理を終了する。ステップ３５４では、ステップ３４２で準備されたパーティクルを棄却する。すなわち、準備されたパーティクル及びそのパラメータを破棄し、処理を終了する。 In step 346, the allowable probability ν of the particles prepared in step 342 is determined by comparing the weight w _t ^(j) calculated in step 324 with the weight w _t ^{* (j)} calculated in step 344. Determine. In step 348, a random number u is generated by selecting an arbitrary value from the uniform set U _[0,1] consisting of values in the interval [0,1]. In step 350, the value of the random number u generated in step 348 is compared with the value of the allowable probability ν determined in step 346. If u is less than or equal to the allowable probability, the process proceeds to step 352. Otherwise, go to step 354. In step 352, the particles prepared in step 342 are allowed. That is, the parameter obtained by the resampling in step 326 is replaced with the parameter of the prepared particle, and the process is terminated. In step 354, the particles prepared in step 342 are rejected. That is, the prepared particles and their parameters are discarded, and the process ends.

［動作］
本実施の形態に係る音声認識システム１００は以下のように動作する。まず、図６に示す外乱確率分布推定部２００による初期状態における外乱の確率分布の推定動作を説明する。図１に示す計測部１１２が、音源１０２から観測音１２２を受け、観測信号の特徴量Ｘ_t（１２４）を抽出する。抽出された特徴量Ｘ_t（１２４）は、外乱成分抑圧部１１４の図５に示す外乱確率分布推定部２００に与えられる。図６を参照して、外乱確率分布推定部２００のフレーム選択部２２０は、特徴量Ｘ_t（１２４）のうち最初の１０フレーム分を、外乱初期分布推定部２２２に与える。外乱初期分布推定部２２２は、上記の式（９）〜式（１１）に示す処理により加法性雑音の初期分布ｐ（Ｎ₀）を推定する。さらに、雑音の初期分布ｐ（Ｎ₀）から、上記の式（１２）及び式（１３）に示すサンプリングをＪ回行なう。このサンプリングによって、各パーティクルにおける雑音の初期的なパラメータベクトルＮ₀ ^(j)及び共分散行列Σ_N0 ^(j)が決定される。乗法性歪みの初期パラメータ行列Ｈ₀ ^(j)及びその共分散行列Σ_H0 ^(j)をともに０に設定し、残響の初期パラメータ行列Ａ₀ ^(j)及びその共分散行列Σ_A0 ^(j)をともに０に設定する。外乱確率分布推定部２００は、これらのパラメータを、時刻ｔ＝０のフレームにおける推定外乱分布２０６のパラメータとして出力する。 [Operation]
The speech recognition system 100 according to the present embodiment operates as follows. First, the disturbance probability distribution estimation operation in the initial state by the disturbance probability distribution estimation unit 200 shown in FIG. 6 will be described. The measurement unit 112 shown in FIG. 1 receives the observation sound 122 from the sound source 102 and extracts the feature value X _t (124) of the observation signal. The extracted feature amount X _t (124) is given to the disturbance probability distribution estimation unit 200 shown in FIG. With reference to FIG. 6, the frame selection unit 220 of the disturbance probability distribution estimation unit 200 gives the first 10 frames of the feature amount X _t (124) to the disturbance initial distribution estimation unit 222. Disturbance initial distribution estimating unit 222 estimates the additive noise initial distribution p (N ₀₎ by the process shown in the above equation (9) to (11). Further, sampling shown in the above equations (12) and (13) is performed J times from the initial noise distribution p (N ₀ ). By this sampling, an initial parameter vector N ₀ ^(j) and a covariance matrix Σ _N0 ^(j) of noise in each particle are determined. The initial parameter matrix H ₀ ^{(j) of} multiplicative distortion and its covariance matrix Σ _H0 ^(j) are both set to 0, and the initial parameter matrix A ₀ ^{(j) of} reverberation and its covariance matrix Σ _A0 ^(j) are Both are set to 0. The disturbance probability distribution estimation unit 200 outputs these parameters as parameters of the estimated disturbance distribution 206 in the frame at time t = 0.

次に、外乱確率分布推定部２００による、第ｔフレーム（ｔ≧１）における推定外乱分布２０６の推定動作を説明する。図６を参照して、次のフレームの処理の開始要求２１０に応答して、フレーム選択部２２０は、観測信号の特徴量Ｘ_t（１２４）を更新部２３０に与えるとともに、ＧＭＭサンプリング部２２６に、第ｔフレームにおけるＧＭＭの出力パラメータのサンプリングを要求する。更新部２３０は、これに応答して、第ｔ−１フレームの各パーティクルにおける推定外乱分布２０６のパラメータを取得する。 Next, the estimation operation of the estimated disturbance distribution 206 in the t-th frame (t ≧ 1) by the disturbance probability distribution estimation unit 200 will be described. Referring to FIG. 6, in response to processing start request 210 for the next frame, frame selection unit 220 provides observed signal feature quantity X _t (124) to update unit 230 and also to GMM sampling unit 226. , Request sampling of output parameters of the GMM in the t-th frame. Updating unit 230, in response thereto, to obtain the parameters of the estimated disturbance distribution 206 in each particle of the t-1 frame.

ＧＭＭサンプリング部２２６は、ＧＭＭ１３０から、出力パラメータベクトルＳ^(j) _kt ^(j) _,tのサンプリングを行なう。図１１に、出力パラメータベクトルＳ^(j) _kt ^(j) _,tのサンプリングの概要を模式的に示す。例えば、ｊ番目のパーティクルにおいて、ＧＭＭ１３０内の混合正規分布４００の中から、混合重みにしたがった確率で要素分布ｋ_t ^(j)（４０２）をサンプリングする。ＧＭＭサンプリング部２２６はさらに、要素分布ｋ_t ^(j)（４０２）により表される出力確率の分布にしたがい、出力パラメータベクトルＳ^(j) _kt ^(j) _,t（４０４）をサンプリングする。ＧＭＭサンプリング部２２６は、総数Ｊの各パーティクルにおける出力パラメータベクトルＳ^(j) _kt ^(j) _,tをそれぞれ、以上の手順でサンプリングし、図６に示す更新部２３０に与える。 The GMM sampling unit 226 samples the output parameter vector S ^(j) _kt ^(j) _{, t} from the GMM 130. FIG. 11 schematically shows an outline of sampling of the output parameter vector S ^(j) _kt ^(j) _{, t} . For example, at the j-th particle, the element distribution k _t ^(j) (402) is sampled from the mixed normal distribution 400 in the GMM 130 with the probability according to the mixing weight. The GMM sampling unit 226 further samples the output parameter vector S ^(j) _kt ^(j) _{, t} (404) according to the output probability distribution represented by the element distribution k _t ^(j) (402). GMM sampling unit 226, an output parameter vector S in each particle of the total number ^{_{^{J (j) kt (j)}}} , t respectively, sampled at the above procedure, gives the updating section 230 shown in FIG.

図１２に、逐次計算部２２４によるパラメータの更新、及び再サンプリングの概要を模式的に示す。図１２においては、ある外乱のパラメータが左右方向に分布し、時間が上から下に進行する。また、図１２においては、パーティクルを、白抜きの丸印、及び黒塗りの丸印によって模式的に示す。例えば、白抜きの丸印で示すパーティクルが重みｗ_t ^(j)の値の微小なパーティクルであり、黒塗りの丸印で示すパーティクルが重みｗ_t ^(j)の値の大きなパーティクルであるものとする。 FIG. 12 schematically shows an outline of parameter updating and resampling performed by the sequential calculation unit 224. In FIG. 12, certain disturbance parameters are distributed in the left-right direction, and the time advances from top to bottom. In FIG. 12, the particles are schematically shown by white circles and black circles. For example, a particle indicated by a white circle is a minute particle having a value of weight w _t ^(j) , and a particle indicated by a black circle is a particle having a large value of weight w _t ^(j) To do.

図１２を参照して、第ｔ−１フレームに対応するパーティクルにより状態空間４２０が近似的に表現されているものとする。更新部２３０は、式（１４）〜式（１９）により示す拡張カルマンフィルタを用いて、状態空間４２０内の各パーティクルにおける外乱分布のパラメータ行列＾Λ_t-1 ^(j)を、第ｔフレームに対応する推定外乱分布のパラメータ行列＾Λ_t ^(j)に更新する。これにより、状態空間４２０内の各パーティクルは更新され、パラメータが更新されたパーティクルにより第ｔフレームに対応する状態空間４３０が表現される。 Referring to FIG. 12, it is assumed that state space 420 is approximately represented by particles corresponding to the (t-1) th frame. The updating unit 230 uses the extended Kalman filter expressed by the equations (14) to (19) to correspond the parameter matrix ^ Λ _t-1 ^(j) of the disturbance distribution in each particle in the state space 420 to the t-th frame. Update the parameter matrix ^ Λ _t ^(j) of the estimated disturbance distribution. As a result, each particle in the state space 420 is updated, and the state space 430 corresponding to the t-th frame is represented by the particle whose parameter has been updated.

続いて重み算出部２３２は、状態空間４３０内の各パーティクルに対する重みｗ_t ^(j)を、式（２２）及び式（２３）によって算出する。再サンプリング部２３４は、重みｗ_t ^(j)に基づき、パーティクルにおける外乱のパラメータを再サンプリングする。この際、再サンプリング部２３４はまず、状態空間４３０内の各パーティクルからの再サンプリングの回数を、ｗ_t ^(j)に応じてパーティクルごとに設定する。白抜きの丸印で表される重みの微小なパーティクルからのサンプリングの回数を０に設定する。また、黒塗りの丸印で表される重みの大きなパーティクルからのサンプリングの回数を、重みの大きさに応じて１〜３に設定する。続いて、状態空間４３０内のパーティクルにおける外乱の確率分布に基づき、設定された回数ずつ、外乱のパラメータの再サンプリングを行なう。このようにして、第ｔフレームに対応する新たな状態空間４４０を表現するパーティクルがそれぞれ形成される。 Subsequently, the weight calculation unit 232 calculates the weight w _t ^(j) for each particle in the state space 430 using Expression (22) and Expression (23). The re-sampling unit 234 re-samples the disturbance parameter in the particle based on the weight w _t ^(j) . At this time, the re-sampling unit 234 first sets the number of re-sampling from each particle in the state space 430 for each particle according to w _t ^(j) . The number of samplings from a minute particle with a weight represented by a white circle is set to zero. In addition, the number of times of sampling from particles with a large weight represented by black circles is set to 1 to 3 according to the magnitude of the weight. Subsequently, based on the probability distribution of the disturbance in the particles in the state space 430, the disturbance parameter is resampled by the set number of times. In this way, particles representing a new state space 440 corresponding to the t-th frame are formed.

再サンプリング部２３４によるこのような再サンプリングが繰返し行なわれると、あるフレームに対応するパーティクルの多くにおける外乱のパラメータが、それ以前の時点のフレームに対応する少数のパーティクルにおける外乱のパラメータの確率分布からサンプリングされたものとなるおそれがある。そこで、推定外乱分布生成部２３６は、Metropolis-Hastingsアルゴリズムを用いて、新たに第ｔフレームに対応するパーティクルにおけるパラメータを生成することにより、このような事態を防止する。図７に示す再更新部２６２は、状態空間４４０における推定外乱分布にしたがい、第ｔ−１フレームに対応する状態空間４２０内のパーティクルにおける外乱のパラメータを再更新する。重み再計算部２６４は、再更新されたパーティクルに対する重みｗ_t ^*(j)を算出する。許容確率算出部２６６は、再更新されたパーティクルに対する重みｗ_t ^*(j)と、再サンプリングされたパーティクルに対する重みｗ_t ^(j)とを基に、許容確率νを算出する。パラメータ選択部２７０は、許容確率νと、乱数発生部２６８が発生した［０，１］の区間の乱数ｕとを比較し、乱数ｕが許容確率ν以下であれば、再サンプリングされたパーティクルにおけるパラメータを、再更新されたパーティクルにおけるパラメータで置換する。さもなければ、再更新されたパーティクルにおけるパラメータを破棄する。 When such re-sampling by the re-sampling unit 234 is repeatedly performed, the disturbance parameters in many particles corresponding to a certain frame are obtained from the probability distribution of the disturbance parameters in a small number of particles corresponding to the previous frame. May be sampled. Therefore, the estimated disturbance distribution generation unit 236 prevents such a situation by newly generating parameters in the particles corresponding to the t-th frame using the Metropolis-Hastings algorithm. The re-update unit 262 illustrated in FIG. 7 re-updates the disturbance parameters of the particles in the state space 420 corresponding to the (t-1) th frame in accordance with the estimated disturbance distribution in the state space 440. The weight recalculation unit 264 calculates a weight w _t ^{* (j)} for the re-updated particle. Acceptable probability calculation unit 266, the weight w _t ^* for particles that are re-updated ^(j), based on the weight w _t ^(j) with respect to the resampled particles, calculates the permission probability [nu. The parameter selection unit 270 compares the allowable probability ν with the random number u in the interval [0, 1] generated by the random number generation unit 268. If the random number u is equal to or less than the allowable probability ν, the parameter selection unit 270 Replace the parameter with the parameter in the re-updated particle. Otherwise, discard the parameters in the re-updated particle.

以上のような動作をフレームごとに繰返すことにより、各フレームに対応して、各パーティクルにおける推定外乱分布２０６のパラメータベクトルＮ_t ^(j)Ｈ_t ^(j)及び行列Ａ_t ^(j)、並びに共分散行列Σ_Nt ^(j)Σ_Ht ^(j)及びΣ_At ^(j)が推定される。外乱確率分布推定部２００は、各パーティクルにおける推定外乱分布２０６のパラメータベクトルＮ_t ^(j)Ｈ_t ^(j)及び行列Ａ_t ^(j)、並びに共分散行列Σ_Nt ^(j)Σ_Ht ^(j)及びΣ_At ^(j)と、各パーティクルに対する重みｗ_t ^(j)と、観測信号の特徴量ベクトルＸ_tとを、フレームごとに、図５に示すパラメータ生成部２０２に与える。 By repeating the above operation for each frame, the parameter vector N _t ^(j) H _t ^(j) and the matrix At ^(j) of the estimated disturbance distribution 206 in each particle and the matrix A _t ^(j) covariance matrix _{^{_{^{Σ Nt (j) Σ Ht (}}}} j) and sigma _At ^(j) is estimated. Disturbance probability distribution estimation unit 200, the parameter vector N _t of the estimated disturbance distribution 206 in each particle ^{^(j)} H _t ^(j) and the matrix A _t ^(j), and the covariance matrix _{^{_{^{Σ Nt (j) Σ Ht (}}}} j) And Σ _At ^(j) , the weight w _t ^(j) for each particle, and the feature vector X _t of the observation signal are given to the parameter generation unit 202 shown in FIG. 5 for each frame.

図５を参照して、パラメータ生成部２０２は、ＶＴＳ法によって、第ｔフレームに対応する各パーティクルにおける観測信号の確率モデルの平均ベクトル及び共分散行列（２０８）を生成する。これにより、各パーティクルにおいて外乱の確率分布と、観測信号の確率分布とが推定されたことになる。クリーン音声推定部２０４は、ＭＭＳＥ推定法により、第ｔフレームに対応する各パーティクルにおいて、クリーン音声のＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)を算出する。さらに、ＭＭＳＥ推定値ベクトル＾Ｓ_t ^(j)と重みｗ_t ^(j)とを用いて、時刻ｔにおけるクリーン音声の推定特徴量ベクトル＾Ｓ_tを算出し、図１に示す探索部１１０に出力する。 Referring to FIG. 5, the parameter generation unit 202 generates an average vector and a covariance matrix (208) of the probability model of the observation signal in each particle corresponding to the t-th frame by the VTS method. As a result, the probability distribution of the disturbance and the probability distribution of the observation signal are estimated for each particle. The clean speech estimation unit 204 calculates a clean speech MMSE estimated value vector ^{ circumflex over ⁽ S ⁾ _} ^(j) for each particle corresponding to the t-th frame by the MMSE estimation method. Further, using the MMSE estimated value vector ^{ circumflex over ⁽ S ⁾ } _t ^(j) and the weight w _t ^(j) , an estimated feature amount vector ^{ circumflex over ⁽ S ⁾ } at time _t is calculated and output to the search unit 110 shown in FIG. To do.

図１に示す探索部１１０は、クリーン音声の推定特徴量ベクトル＾Ｓ_tを用いて、認識用音響モデル部１０９に保持された音響モデルと、言語モデル部１０８に保持された言語モデルとを基に、適合する目的言語の単語等を探索し、その結果を認識出力１２８として出力する。 Searching unit 110 shown in FIG. 1, based on using the estimated feature vector ^ S _t of clean speech, the acoustic model held in the acoustic model for recognition section 109, and a language model stored in the language model 108 Then, a suitable target language word or the like is searched, and the result is output as a recognition output 128.

［実験］
本実施の形態に係る音声認識システム１００による効果を確認するために、観測信号からの雑音の推定実験と、観測信号の認識実験とを行なった。以下、実験方法及び結果について説明する。 [Experiment]
In order to confirm the effect of the speech recognition system 100 according to the present embodiment, an experiment for estimating noise from the observed signal and an experiment for recognizing the observed signal were performed. Hereinafter, experimental methods and results will be described.

本実験においては、日本語の雑音下音声認識評価用共通データベースに収録されたクリーン音声１００１文のデータに、残響のインパルス応答を畳み込み、さらに加法性雑音を人工加算して、観測信号を生成した。残響のインパルス応答には、実環境音声・音響データベースに収録されている、残響時間０．３秒及び１．３秒のインパルス応答を用いた。また加算する雑音には、それぞれ実環境で収録された工場雑音と道路工事雑音とを使用した。本実験では、雑音を加算していない試料と、クリーン音声に２０ｄＢ、１５ｄＢ、及び１０ｄＢのＳＮＲ（Signal-to-Noise Ratio）でそれぞれ雑音を加算した試料とを用意した。用意した各試料を２３次対数メルフィルタバンク処理し、得られた２３次対数メルスペクトルの各成分を要素とするベクトルをそれぞれ生成し、認識対象の特徴量ベクトルとした。 In this experiment, the observed signal was generated by convolution of the impulse response of reverberation into the data of clean speech 1001 sentence recorded in the common database for speech recognition evaluation under noisy Japanese and artificially adding additive noise. . For the reverberation impulse response, impulse responses with a reverberation time of 0.3 seconds and 1.3 seconds recorded in the real environment speech and sound database were used. As noise to be added, factory noise and road construction noise recorded in the actual environment were used. In this experiment, a sample in which noise was not added and a sample in which noise was added to each of clean speech at an SNR (Signal-to-Noise Ratio) of 20 dB, 15 dB, and 10 dB were prepared. Each prepared sample was subjected to 23th-order log mel filter bank processing, and a vector having each component of the obtained 23th-order log mel spectrum as an element was generated and used as a feature quantity vector to be recognized.

認識実験では、比較のために、上記の各試料から、本実施の形態に係る外乱成分の抑圧処理の方法を含む次の５種の処理方法で、探索に用いる特徴量を生成した。すなわち、ＨＴＫＢａｓｅｌｉｎｅすなわち外乱抑圧処理を行なわない観測信号の特徴量（Ｂａｓｅｌｉｎｅ）、ＥＴＳＩ（European Telecommunications Standards Institute）により勧告されているＥＴＳＩＡｄｖａｎｃｅｄｆｒｏｎｔ−ｅｎｄ（ＥＳ２０２）による雑音抑圧処理を施した特徴量（ＥＴＳＩ）、従来のＭＭＳＥ推定により得られる推定特徴量（ＭＭＳＥ）、非特許文献５に記載の手法での処理により得られる推定特徴量（ＥＭ）、及びパーティクルフィルタを用いた本実施の形態に係る外乱成分の抑圧処理（Ｐｒｏｐｏｓｅｄ）により得られる推定特徴量である。 In the recognition experiment, for comparison, feature quantities used for the search were generated from the above samples by the following five processing methods including the disturbance component suppression processing method according to the present embodiment. That is, the feature quantity (Baseline) of the observed signal that is not subjected to the disturbance suppression process, the characteristic quantity subjected to the noise suppression process by ETSI Advanced front-end (ES 202) recommended by ETSI (European Telecommunications Standards Institute) (ETSI), an estimated feature amount (MMSE) obtained by conventional MMSE estimation, an estimated feature amount (EM) obtained by processing by the method described in Non-Patent Document 5, and a particle filter. This is an estimated feature amount obtained by the disturbance component suppression processing (Proposed).

パーティクルフィルタを用いた外乱成分の抑圧処理を行なう際の、ＧＭＭ１３０（図２参照）には、混合分布数５１２のモデルを用いた。この処理においては、誤差ベクトルＷ_tの共分散行列を、Σ_WN＝Σ_WH＝Σ_WA＝diag（0.01）に設定した。また、処理に用いるパーティクルの総数Ｊを２０に設定した。 A model having a mixture distribution number of 512 was used for the GMM 130 (see FIG. 2 ) when performing disturbance component suppression processing using a particle filter. In this processing, the covariance matrix of the error vector W _t, was set to _{_{_{Σ WN = Σ WH = Σ WA}}} = diag (0.01). In addition, the total number J of particles used in the processing was set to 20.

抑圧後の推定クリーン音声を用いた音声認識を行なう際の特徴量には、３９次ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）（１２次ＭＦＣＣ＋Ｃ０＋Δ＋ΔΔ）を用いた。また、図１に示す認識用音響モデル１０９には、１６状態２０混合分布のＨＭＭを用いた。 A 39th-order MFCC (Mel Frequency Cepstrum Coefficient) (12th-order MFCC + C0 + Δ + ΔΔ) is used as a feature amount when performing speech recognition using the estimated clean speech after suppression. Further, a 16-state 20-mixed HMM was used for the recognition acoustic model 109 shown in FIG.

この認識実験における処理に、クロック周波数３．２ギガヘルツ、３２ビットの市販のＣＰＵ（Central Processing Unit）を用いた場合、処理に要した時間は、観測信号における実時間の０．８倍であった。すなわち、認識処理を実時間で処理できることが明らかとなった。 When a commercially available CPU (Central Processing Unit) with a clock frequency of 3.2 GHz and a 32-bit clock was used for the processing in this recognition experiment, the time required for the processing was 0.8 times the real time of the observation signal. . That is, it became clear that the recognition process can be processed in real time.

表１〜表４に、各試料に対する認識実験で得られた認識精度を、上記の処理方法別に示す。 Tables 1 to 4 show the recognition accuracy obtained in the recognition experiment for each sample according to the above processing methods.

表１〜表４を参照して、パーティクルフィルタによる抑圧処理（Ｐｒｏｐｏｓｅｄ）を行なうことで、良好な単語認識精度が得られることが分かる。特に、残響１．３秒の観測信号においては、パーティクルフィルタによる抑圧処理（Ｐｒｏｐｏｓｅｄ）により、高い単語認識精度が得られることが分かる。

Referring to Tables 1 to 4, it can be seen that good word recognition accuracy can be obtained by performing suppression processing (Proposed) using a particle filter. In particular, it can be seen that a high word recognition accuracy can be obtained in the observation signal of reverberation 1.3 seconds by the suppression process (Proposed) using the particle filter.

以上の実験結果から、本実施の形態の外乱成分の抑圧処理により、非定常な加法性雑音及び残響による歪みを受ける環境下での音声認識性能が改善され、かつ実時間処理が可能になることが明らかとなった。 Based on the above experimental results, the disturbance component suppression processing according to the present embodiment improves speech recognition performance in an environment subject to distortion due to unsteady additive noise and reverberation, and enables real-time processing. Became clear.

［変形例等］
なお、本実施の形態においては、パーティクルフィルタによる処理を外乱成分の抑圧に用いている。そのため、雑音抑圧後の推定クリーン音声のパラメータを用いて探索を行なう前に、さらに音響モデル適応を行なうこともできる。音響モデル適応により、推定クリーン音声に適合した音響モデルを探索に用いることができるようになる。したがって、認識精度が向上することが期待される。 [Modifications, etc.]
In the present embodiment, the processing by the particle filter is used for suppressing disturbance components. Therefore, acoustic model adaptation can be further performed before searching using the parameters of the estimated clean speech after noise suppression. With the acoustic model adaptation, an acoustic model suitable for the estimated clean speech can be used for the search. Therefore, it is expected that the recognition accuracy is improved.

また、本実施の形態においては、前処理用の音響モデルにＧＭＭを用いたが、前処理用の音響モデルにＨＭＭを用いてもよい。この場合、上記の式（２０）に示す要素分布のサンプリングに先立ち、ＨＭＭの遷移確率にしたがって状態のサンプリングを行なえばよい。 In this embodiment, GMM is used for the acoustic model for preprocessing. However, HMM may be used for the acoustic model for preprocessing. In this case, the state may be sampled according to the transition probability of the HMM prior to sampling the element distribution shown in the above equation (20).

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態に係る音声認識システム１００の構成を示す概略図である。It is the schematic which shows the structure of the speech recognition system 100 which concerns on one embodiment of this invention. ＧＭＭ１３０の概念を示す概略図である。It is the schematic which shows the concept of GMM130. 外乱要因１１８を模式的に示す図である。It is a figure which shows the disturbance factor 118 typically. 観測信号の状態空間モデル１６０の概念を示す概略図である。It is the schematic which shows the concept of the state space model 160 of an observation signal. 外乱成分抑圧部１１４の構成を示すブロック図である。3 is a block diagram illustrating a configuration of a disturbance component suppression unit 114. FIG. 外乱確率分布推定部２００の構成を示すブロック図である。3 is a block diagram showing a configuration of a disturbance probability distribution estimation unit 200. FIG. 推定外乱分布生成部２３６の構成を示すブロック図である。4 is a block diagram illustrating a configuration of an estimated disturbance distribution generation unit 236. FIG. 雑音抑圧処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of a noise suppression process. 外乱確率分布推定処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of disturbance probability distribution estimation processing. Metropolis-Hastingsアルゴリズムによるサンプリング処理の制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the sampling process by a Metropolis-Hastings algorithm. ＧＭＭ１３０からパラメータをサンプリングする動作の概要を示す図である。It is a figure which shows the outline | summary of the operation | movement which samples a parameter from GMM130. パーティクルフィルタによる処理の概要を示す図である。It is a figure which shows the outline | summary of the process by a particle filter.

Explanation of symbols

１００音声認識システム
１０２音源
１０４前処理部
１０６前処理用音響モデル部
１０８言語モデル部
１０９認識用音響モデル部
１１０探索部
１１２計測部
１１４外乱分布抑圧部
１１６話者
１１８外乱要因
１２０クリーン音声
１２２観測音
１２４観測信号の特徴量
１２６推定クリーン音声の特徴量
１３０ＧＭＭ
１３２学習データ記憶部
１３４モデル学習部
１３６ＧＭＭ記憶部
１６０状態空間モデル
２００外乱確率分布推定部
２０２パラメータ生成部
２０４クリーン音声推定部
２２０フレーム選択部
２２２外乱初期分布推定部
２２４逐次計算部
２２６ＧＭＭサンプリング部
２３０更新部
２３２重み算出部
２３４再サンプリング部
２３６推定外乱分布生成部
２６２再更新部
２６４重み再計算部
２６６許容確率算出部
２６８乱数発生部
２７０パラメータ選択部 DESCRIPTION OF SYMBOLS 100 Speech recognition system 102 Sound source 104 Preprocessing part 106 Preprocessing acoustic model part 108 Language model part 109 Recognition acoustic model part 110 Search part 112 Measurement part 114 Disturbance distribution suppression part 116 Speaker 118 Disturbance factor 120 Clean voice 122 Observation sound 124 Features of observed signal 126 Features of estimated clean speech 130 GMM
132 learning data storage unit 134 model learning unit 136 GMM storage unit 160 state space model 200 disturbance probability distribution estimation unit 202 parameter generation unit 204 clean speech estimation unit 220 frame selection unit 222 disturbance initial distribution estimation unit 224 sequential calculation unit 226 GMM sampling unit 230 update unit 232 weight calculation unit 234 re-sampling unit 236 estimated disturbance distribution generation unit 262 re-update unit 264 weight re-calculation unit 266 allowable probability calculation unit 268 random number generation unit 270 parameter selection unit

Claims

A disturbance component suppression device that suppresses a disturbance component of an observation signal obtained by observing target speech in an environment where additive noise and multiplicative distortion occur due to disturbance, wherein the disturbance component is a component due to additive noise, Includes components due to reverberation and multiplicative distortion,
Receiving feature amounts extracted from frames of a predetermined time length framed at predetermined intervals for the observation signal, the component due to the reverberation is regarded as additive disturbance, and using a particle filter having a plurality of particles A disturbance parameter estimation means for sequentially generating an estimation parameter of a probability distribution representing the disturbance for each frame;
Target speech estimation means for calculating an estimated feature amount of the target speech for each frame using a feature amount of the observed signal, the estimation parameter, and a predetermined acoustic model related to the target speech ;
Means for approximating a reverberation component included in the disturbance by a difference between the observed signal and a component due to the additive noise estimated by the disturbance parameter estimation means, and inputting the reverberation component to the disturbance parameter estimation means Disturbance component suppression device.

The disturbance parameter estimation means includes:
An initial parameter setting means for setting an initial distribution of the disturbance and setting an initial parameter of a probability distribution representing the disturbance in each of the plurality of particles with a probability according to the initial distribution;
Based on the feature value of the acoustic model and the observed signal, regarded as a disturbance of additive components according to the reverberation, using an extended Kalman filter, wherein the estimated parameters of the first frame preceding in each particle respectively the Updating means for updating to one corresponding to a second frame following one frame;
The disturbance component suppressing device according to claim 1, further comprising weight calculating means for calculating the weight of each of the plurality of particles in the second frame.

A computer program that, when executed by a computer, causes the computer to operate as the disturbance component suppressing device according to claim 1 or 2.

The disturbance component suppressing device according to claim 1 or 2,
In response to the estimated feature amount of the target speech calculated by the disturbance component suppressing device, speech recognition related to the target speech is performed using a predetermined acoustic model related to the target speech and a predetermined language model related to a recognition target language. A speech recognition system, comprising: speech recognition means for performing.