JPH09230886A

JPH09230886A - Noise-resistant hidden markov model creating method for speech recognition and speech recognition device using the method

Info

Publication number: JPH09230886A
Application number: JP8047551A
Authority: JP
Inventors: Yasuhiro Minami; 泰浩南; Tomoko Matsui; 知子松井; Sadahiro Furui; 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-03-06
Filing date: 1996-03-05
Publication date: 1997-09-05

Abstract

PROBLEM TO BE SOLVED: To make the multiplicative distortion of linear spectral area caused by passing through telephone line into an effective noise-resistance speech HMM(Hidden Markov Model). SOLUTION: Moises in an utterance environment are recorded (S1 ), and HMM of the noises a generated. Each output probability distribution of the noise HMM and a speech HMM created from a speech uninfluenced by the noise and the multiplicative distortion is transformed to a linear spectral area (S31 ), and the speech HMM distribution in the linear spectral area is multiplied by a multiplicative distortion W of an unknown. The result of this of this multiplication and the noise HMM distribution in the linear spectral area are convolution-operated (S322 ), and the operation result is inversely transformed (S33 ) to the original speech HMM area, and an incomplete noise-resistant speech HMM (S3 ) is generated with the multiplicative distortion mode as an unknown. Frequency to the input speech of this incomplete HMM is determined, to estimate (S4 ) the multiplicative distortion of the incomplete HMM to make the frequency maximum, and this estimated value is substituted for incomplete HMM to obtain the noise-resistant speech HMM (S5 ).

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、隠れマルコフモ
デル（Hidden Markov Model 、以下、ＨＭＭという）を
使用する音声認識方法に用いられ、背景雑音が加算され
た音声や、例えば電話回線を通されることで乗算性ひず
みが生じた音声などの認識に適する音声認識用耐雑音Ｈ
ＭＭの作成方法及びその作成方法が適用された音声認識
装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used in a voice recognition method using a Hidden Markov Model (hereinafter referred to as HMM), and is passed through voice added with background noise or through a telephone line, for example. Noise resistance H for voice recognition, which is suitable for recognizing voices with multiplicative distortion
The present invention relates to a method for creating an MM and a voice recognition device to which the method is applied.

【０００２】[0002]

【従来の技術】ＨＭＭを使用する音声認識方法の従来例
を図１を参照して説明する。音声入力手段１を介して入
力された音声は、音声認識手段２において音声ＨＭＭ格
納部３に格納されている各音声ＨＭＭとの間の類似度が
計算され、その値に基づいた認識結果が認識結果出力手
段４を介して出力される。2. Description of the Related Art A conventional example of a voice recognition method using an HMM will be described with reference to FIG. For the voice input through the voice input unit 1, the voice recognition unit 2 calculates the degree of similarity with each voice HMM stored in the voice HMM storage unit 3 and recognizes the recognition result based on that value. It is output via the result output means 4.

【０００３】従来においては、音声ＨＭＭの作成は一般
に雑音のない状態において得られた音声情報に基づいて
行われる。この様にして得られた音声ＨＭＭは、雑音の
影響を受けていないものであるところから、雑音の存在
する実環境において雑音の影響を受けた入力音声や電話
回線を通されてひずみを受けた入力音声との類似度を計
算する場合に適切な音声ＨＭＭとならず、音声認識の性
能が著しく劣化する。Conventionally, the creation of a voice HMM is generally performed on the basis of voice information obtained in a noise-free state. Since the voice HMM obtained in this way is not affected by noise, it is distorted by passing through the input voice or telephone line affected by noise in a real environment where noise is present. When calculating the degree of similarity with the input voice, the voice HMM is not appropriate, and the voice recognition performance is significantly deteriorated.

【０００４】一方、雑音の存在する環境下において雑音
の影響を受けた音声を収録し、この音声に基づいて音声
ＨＭＭを作成することも行われているが、雑音の種類は
膨大であるところから、高い認識性能を得ようとする
と、音声認識装置全体の構成が肥大化するに到る。また
実環境において、学習音声を収録して、この収録音声か
ら音声ＨＭＭを作成するには、その学習音声長は例えば
２４時間程度もの長いものを必要とし、しかも、音声Ｈ
ＭＭを作成するのに例えば２カ月程度もの時間を必要と
する。このように実環境ごとに、学習音声を収録して、
これより音声ＨＭＭを作成することは簡単に、かつ短時
間では行えなかった。On the other hand, it is also practiced to record a voice affected by noise in an environment where noise is present and create a voice HMM based on this voice, but the types of noise are enormous. In order to obtain high recognition performance, the entire structure of the voice recognition device becomes bloated. In addition, in a real environment, in order to record a learning voice and create a voice HMM from the recorded voice, the learning voice length needs to be long, for example, about 24 hours.
It takes about two months, for example, to create the MM. In this way, learning voices are recorded for each real environment,
Therefore, it is not possible to easily create a voice HMM in a short time.

【０００５】このような点より、実環境において、その
環境に適合した耐雑音音声ＨＭＭを短時間でしかも比較
的簡単に作成して、雑音の多い環境の下においても高い
認識率を得る音声認識方法が、文献１（F.Matin , K.Sh
ikano & Y.Minami ,“ Recognition of Noisy Speech b
y Composition of Hidden Markov Models ”ＩＳＳＮ１
０１８−４０７４（ Volume ２）ＥＳＣＡ，pp. １０３
１−１０３４）で提案されている。この発明はこの従来
の方法を改善したものであるから、以下にこの従来の方
法に用いられている耐雑音音声ＨＭＭの作成方法を簡単
に説明する。From this point of view, in a real environment, a noise-resistant voice HMM suitable for the environment can be created in a short time and relatively easily, and a voice recognition that can obtain a high recognition rate even in a noisy environment. The method is document 1 (F.Matin, K.Sh
ikano & Y.Minami, “Recognition of Noisy Speech b
y Composition of Hidden Markov Models "ISSN1
018-4074 (Volume 2) ESCA, pp. 103
1-1034). Since the present invention is an improvement of this conventional method, a method of creating a noise resistant speech HMM used in this conventional method will be briefly described below.

【０００６】この従来の方法では実環境で雑音を収録
し、この収録雑音に基づき雑音ＨＭＭを作成し、この雑
音ＨＭＭと、雑音や乗算性ひずみの影響を受けていない
音声ＨＭＭと積空間で合成する。この合成を図２に示す
ように行う。音声認識のＨＭＭにおいて使用される音響
パラメータとして、ケプストラム係数が広く使用されて
いる。このケプストラム係数は、対数スペクトル（対数
パワースペクトル）とコサイン変換の関係にある。According to this conventional method, noise is recorded in a real environment, a noise HMM is created based on this recording noise, and this noise HMM and a speech HMM not affected by noise or multiplicative distortion are combined in a product space. To do. This synthesis is performed as shown in FIG. Cepstral coefficients are widely used as acoustic parameters used in HMMs for speech recognition. The cepstrum coefficient has a relationship of logarithmic spectrum (logarithmic power spectrum) and cosine transform.

【０００７】雑音ＨＭＭ及び音声ＨＭＭは共にケプスト
ラム領域で作成されているものとする。これら雑音ＨＭ
Ｍ及び音声ＨＭＭの各出力確率の分布をそれぞれコサイ
ン変換してそれぞれの対数スペクトル上の分布を算出す
る（Ｓ₃₁₁）。次に、これら対数スペクトル上の分布の
エキスポーネンシャル変換を行なってそれぞれの線形ス
ペクトル上の分布を算出する（Ｓ₃₁₂）。これら線形ス
ペクトルとされた雑音ＨＭＭと音声ＨＭＭとの分布を畳
み込み演算により求め（Ｓ₃₂）、この合成ＨＭＭを対数
変換し（Ｓ₃₃₁）、続いて逆コサイン変換を行ってケプ
ストラム領域での耐雑音音声ＨＭＭを作成する
（Ｓ₃₃₂）。It is assumed that both the noise HMM and the voice HMM are created in the cepstrum domain. These noise HM
The distributions of the output probabilities of the M and the speech HMM are respectively cosine-transformed to calculate the distributions on the respective logarithmic spectra (S ₃₁₁ ). Next, the exponential transformation of these logarithmic spectrum distributions is performed to calculate the respective linear spectrum distributions ( _S312 ). The distribution of the noise HMM and the speech HMM, which are linear spectra, is obtained by a convolutional operation (S ₃₂ ), the composite HMM is logarithmically transformed (S ₃₃₁ ), and subsequently the inverse cosine transform is performed to perform noise resistance in the cepstrum domain. A voice HMM is created ( _S332 ).

【０００８】ここで、音声ＨＭＭと雑音ＨＭＭの各出力
確率の分布は正規分布の混合（いわゆる Gaussiann mix
ture）を使用したＨＭＭで表されているとする。正規分
布はその平均値と共分散で表現することができるから、
図２中の各変換は、平均値の変換と共分散の変換を行な
う。次に、ＨＭＭの出力確率の分布が正規分布の場合の
上記変換方法について更に説明する。The distribution of the output probabilities of the speech HMM and the noise HMM is a mixture of normal distributions (so-called Gaussiann mix).
ture). Since the normal distribution can be expressed by its mean value and covariance,
Each conversion in FIG. 2 performs conversion of average value and conversion of covariance. Next, the conversion method when the distribution of the output probability of the HMM is a normal distribution will be further described.

【０００９】先ず、これらＨＭＭの出力確率の分布のパ
ラメータとして０次からｐ次までのケプストラム係数を
考え、Ｃ＝（Ｃ₀Ｃ₁Ｃ₂…Ｃ_p-1Ｃ_p）・・・（１）Ｄ＝（Ｄ₀Ｄ₁Ｄ₂…Ｄ_p-1Ｄ_p）・・・（２）と表す。ここでＣは音声ＨＭＭの分布の平均値、Ｄは雑
音ＨＭＭの分布の平均値である。このケプストラム係数
から対数スペクトルへの変換は、コサイン変換として知
られているが、これは線形変換であり、（ｐ＋１）×ｍ
の変換行列（ＣＯＳ）で表す。ｍは対数スペクトルの次
数である。音声ＨＭＭ、雑音ＨＭＭの各分布の平均値の
対数スペクトルのベクトルをそれぞれＬＣ，ＬＤで表す
と、ＬＣ＝Ｃ（ＣＯＳ）・・・（３）ＬＤ＝Ｄ（ＣＯＳ）・・・（４）となる（図２、Ｓ₃₁₁）。また、ケプストラム領域での
音声ＨＭＭ、雑音ＨＭＭの分布の共分散をそれぞれ
Σ^C，Σ^Dとすると、対数スペクトルの領域でのこれら
分布の共分散Σ^LC，Σ^LDは、 Σ^LC＝（ＣＯＳ）Σ^C（ＣＯＳ）^t ・・・（５） Σ^LD＝（ＣＯＳ）Σ^D（ＣＯＳ）^t ・・・（６）となる。 ^tは転置行列を表す。この様にして、音声Ｈ
ＭＭ、雑音ＨＭＭの正規分布の対数スペクトル領域の平
均値と共分散がそれぞれ得られる（図２、Ｓ₃₁₁）。First, consider the 0th to pth cepstrum coefficients as parameters of the distribution of the output probabilities of these HMMs, and C = (C ₀ C ₁ C ₂ ... C _p-1 C _p ) (1) D = (D ₀ D ₁ D ₂ ... D _p-1 D _p ) ... (2) Here, C is the average value of the distribution of the voice HMM, and D is the average value of the distribution of the noise HMM. The conversion from this cepstrum coefficient to a logarithmic spectrum is known as a cosine transform, which is a linear transform and is (p + 1) × m.
Is represented by a conversion matrix (COS). m is the order of the log spectrum. When the vector of the logarithmic spectrum of the average value of each distribution of the voice HMM and the noise HMM is represented by LC and LD, respectively, LC = C (COS) ... (3) LD = D (COS) ... (4) (FIG. 2, S ₃₁₁ ). If the covariances of the speech HMM and noise HMM distributions in the cepstrum region are Σ ^C and Σ ^D , respectively, the covariances Σ ^LC and Σ ^LD of these distributions in the logarithmic spectrum region are Σ ^LC = (COS) Σ ^C (COS) ^t (5) Σ ^LD = (COS) Σ ^D (COS) ^t (6) ^t represents a transposed matrix. In this way, voice H
The average value and covariance of the logarithmic spectrum region of the normal distribution of the MM and the noise HMM are obtained (FIG. 2, S ₃₁₁ ).

【００１０】次に、対数スペクトルを線形スペクトルに
変換するエキスポーネンシャル変換について説明する。
この変換は正規分布の形にならないが、変換されたもの
を正規分布を用いて近似する。対数スペクトル領域の各
平均値ＬＣ，ＬＤ、各共分散Σ^LC，Σ^LDをそれぞれエキ
スポーネンシャル変換したときの各分布の平均値ＳＣ，
ＳＤと共分散Σ^SC，Σ^SDを計算すると、それぞれＳＣⁱ＝ｅｘｐ（ＬＣⁱ＋Σ^LC _ij／２）・・・（７）ＳＤⁱ＝ｅｘｐ（ＬＤⁱ＋Σ^LD _ij／２）・・・（８） Σ^SC _ij＝ＳＣⁱ×ＳＣ^j×｛ｅｘｐ（Σ^LC _ij）−１｝・・・（９） Σ^SD _ij＝ＳＤⁱ×ＳＤ^j×｛ｅｘｐ（Σ^LD _ij）−１｝・・・（10）ｉ，ｊ＝０，１，２，・・・，ｐとなる（Ｓ₃₁₂）。Next, exponential conversion for converting a logarithmic spectrum into a linear spectrum will be described.
This transformation does not take the form of a normal distribution, but the transformed one is approximated using a normal distribution. Mean values LC and LD of the logarithmic spectrum region, mean values SC of respective distributions when respective covariances Σ ^LC and Σ ^LD are subjected to exponential transformation,
When SD and covariances Σ ^SC and Σ ^SD are calculated, SC ⁱ = exp (LC ⁱ + Σ ^LC _ij / 2) ... (7) SD ⁱ = exp (LD ⁱ + Σ ^LD _ij / 2) ... ( 8) Σ ^SC _ij = SC ⁱ × SC ^j × {exp (Σ ^LC _ij ) −1} (9) Σ ^SD _ij = SD ⁱ × SD ^j × {exp (Σ ^LD _ij ) −1} ... (10) i, j = 0,1,2, ..., p ( _S312 ).

【００１１】周囲雑音は加法性雑音であって線形スペク
トルの領域において音声と雑音とを加算することができ
るから、線形スペクトル領域での音声ＨＭＭ、雑音ＨＭ
Ｍの分布の和である耐雑音音声ＨＭＭの分布の平均値Ｍ
と共分散Σ^Mは次式により求める（図２、Ｓ₃₂）。Ｍⁱ＝ＳＣⁱ＋ＳＤⁱ ・・・（11） Σ^M _ij＝Σ^SC _ij＋Σ^SD _ij ・・・（12）この様にして得られた分布の平均値Ｍⁱと共分散Σ^Mを
今までの過程と逆にケプストラム領域まで変換してい
く。先ず、エキスポーネンシャル変換の逆変換である対
数変換を行なう。その対数変換された平均値をＬＭ、共
分散をΣ^LMとすると、エキスポーネンシャル変換の逆変
換であるのでＬＭⁱ＝ｌｏｇ（Ｍⁱ）−１／２ｌｏｇ（Σ^M _ij／Ｍ^{i 2}＋１）・・(13) Σ^LM _ij＝ｌｏｇ（Σ^M _ij／（ＭⁱＭ^j）＋１）・・(14) を演算する（図２、Ｓ₃₃₁）。更に、逆コサイン変換
（ＣＯＳ′）ｍ×（ｐ＋１）によって対数スペクトルを
ケプストラム領域へ変換し、耐雑音音声ＨＭＭの出力確
率の分布の平均値Ｓと共分散Σ^Sを次式により得る（図
２、Ｓ₃₃₂）。Since ambient noise is additive noise and speech and noise can be added in the region of the linear spectrum, the speech HMM and noise HM in the region of the linear spectrum are present.
The average value M of the noise resistant speech HMM distribution, which is the sum of the distributions of M
And covariance Σ ^M are calculated by the following equation (FIG. 2, S ₃₂ ). M ⁱ = SC ⁱ + SD ⁱ (11) Σ ^M _ij = Σ ^SC _ij + Σ ^SD _ij (12) The average value M ⁱ and covariance Σ ^M of the distribution thus obtained have been calculated up to now. Reverse the process of to convert to the cepstrum region. First, logarithmic transformation, which is the inverse transformation of exponential transformation, is performed. Letting LM be the logarithmically transformed average value and Σ ^LM be the covariance, it is the inverse transformation of the exponential transformation, so LM ⁱ = log (M ⁱ ) −½ log (Σ ^M _ij / M ^{i 2} +1). ) ··· (13) Σ ^LM _ij = log (Σ ^M _ij / (M ⁱ M ^j ) +1) ··· (14) is calculated (FIG. 2, S ₃₃₁ ). Further, the inverse cosine transform (COS ′) m × (p + 1) is used to transform the logarithmic spectrum into the cepstrum region, and the average value S and covariance Σ ^S of the output probability distribution of the noise-resistant speech HMM are obtained by the following equation (FIG. 2). , S ₃₃₂ ).

【００１２】Ｓ＝ＬＭ（ＣＯＳ′）・・・（15） Σ^S＝（ＣＯＳ′）Σ^LM（ＣＯＳ′）^t ・・・（16）分布が単一の正規分布であるときには、２つの分布で上
記変換を行なえばよい。分布が正規分布の混合であると
きには、あらゆる分布の組み合わせに対して、上記の変
換を行なえばよい。従って例えば音声ＨＭＭが３つの正
規分布の混合で、雑音ＨＭＭが３つの正規分布の混合で
ある場合は耐雑音音声ＨＭＭは３×３＝９の正規分布の
混合となる。S = LM (COS ′) (15) Σ ^S = (COS ′) Σ ^LM (COS ′) ^t (16) When the distribution is a single normal distribution, two distributions Then, the above conversion may be performed. When the distribution is a mixture of normal distributions, the above conversion may be performed for all combinations of distributions. Therefore, for example, when the speech HMM is a mixture of three normal distributions and the noise HMM is a mixture of three normal distributions, the noise-resistant speech HMM is a mixture of 3 × 3 = 9 normal distributions.

【００１３】上記説明では、音声と雑音のＨＭＭ中の一
つずつの分布形を取り上げ、その合成法を述べた。通常
音声ＨＭＭは図３左上に示すような右から左へ遷移する
３状態Ａ、Ｂ、Ｃぐらいのモデルで表せる。一方雑音モ
デルとしては図３右上に示すように２状態１、２間を遷
移するエルゴード的なＨＭＭが適している。この時、耐
雑音音声ＨＭＭは、図３下に示すような積モデルとな
り、６つの状態１Ａ，１Ｂ，１Ｃ，２Ａ，２Ｂ，２Ｃを
もち、それぞれの状態は音声ＨＭＭの状態と雑音ＨＭＭ
の状態の組み合わせからなっている。例えば状態１Ａは
音声ＨＭＭの状態Ａと雑音ＨＭＭの状態１から合成さ
れ、この状態１Ａでの出力分布はＰ_A＊Ｐ₁となる。こ
こでＰ_Aは音声ＨＭＭの状態Ａでの分布、Ｐ₁は雑音Ｈ
ＭＭの状態１での分布をそれぞれ表し、また＊は、図２
を参照して説明した変換を表す。このような操作を、各
状態１Ｂ，１Ｃ，２Ａ〜２Ｃについてそれぞれ対応する
音声ＨＭＭ、雑音ＨＭＭの状態での分布を用いて行な
う。さらにこの耐雑音音声ＨＭＭの状態間の遷移確率
は、図３の下に示すように音声ＨＭＭの状態間の遷移と
雑音ＨＭＭの状態間の遷移との積の形となる。例えば、
状態１Ａから状態１Ｂへの遷移確率は音声ＨＭＭの状態
Ａから状態Ｂへの遷移確率ａ_ABと雑音ＨＭＭの状態１か
ら状態１への遷移確率ａ₁₁との積の形（ａ_AB×ａ₁₁）に
なる。In the above description, the respective distribution forms in the HMM of speech and noise are taken up and the synthesis method thereof is described. The normal voice HMM can be represented by a model of three states A, B, and C that transit from right to left as shown in the upper left of FIG. On the other hand, as the noise model, an ergodic HMM that transits between two states 1 and 2 as shown in the upper right of FIG. 3 is suitable. At this time, the noise resistant speech HMM becomes a product model as shown in the lower part of FIG. 3, and has six states 1A, 1B, 1C, 2A, 2B and 2C, and each state is a state of the speech HMM and a noise HMM.
It consists of a combination of states. For example, the state 1A is synthesized from the state A of the voice HMM and the state 1 of the noise HMM, and the output distribution in this state 1A is P _A * P ₁ . Here, P _A is the distribution in the state A of the speech HMM, P ₁ is the noise H
Each of the distributions of MM in the state 1 is shown, and * is shown in FIG.
Represents the transformation described with reference to. Such an operation is performed using the distributions in the states of the voice HMM and the noise HMM corresponding to each of the states 1B, 1C, 2A to 2C. Further, the transition probability between the states of the noise resistant speech HMM is in the form of the product of the transition between the states of the speech HMM and the transition between the states of the noise HMM as shown in the lower part of FIG. For example,
The transition probability from the state 1A to the state 1B is a product of the transition probability a _AB from the state A of the speech HMM to the state B and the transition probability a ₁₁ of the noise HMM from the state 1 to the state 1 (a _AB × a ₁₁ )become.

【００１４】この従来の方法によれば確かに加算値雑音
に対して多少強い音声ＨＭＭが得られる。しかも音声Ｈ
ＭＭのセットとしては既存のものを利用でき、一方雑音
ＨＭＭは実環境で例えば５〜６秒、長くても２０秒程度
の短かい時間雑音を収録し、その収録雑音に基づき作成
すればよく、かつこの雑音ＨＭＭの作成時間は１秒程度
で作ることができる。このようにしてその環境雑音を収
録し、更に雑音ＨＭＭを作り、この雑音ＨＭＭと音声Ｈ
ＭＭとを前述のように合成して耐雑音音声ＨＭＭを作
り、この耐雑音音声ＨＭＭを用いて入力音声を認識する
までの時間は１分間程（高速の演算手段を用いるともっ
と短時間）と短時間である。According to this conventional method, it is possible to obtain a speech HMM which is somewhat strong against the additive noise. Moreover, voice H
An existing one can be used as a set of MMs, while a noise HMM can record a short time noise of, for example, 5 to 6 seconds, and at most about 20 seconds in a real environment, and can be created based on the recorded noise. Moreover, the noise HMM can be created in about 1 second. In this way, the environmental noise is recorded, and a noise HMM is further created.
MM and MM are combined as described above to create a noise resistant speech HMM, and the time until the input speech is recognized using this noise resistant speech HMM is about one minute (it is much shorter if a high-speed arithmetic means is used). It's a short time.

【００１５】[0015]

【発明が解決しようとする課題】しかしこの従来の方法
は音声信号を例えば電話回線を通すことによりその音声
信号に生じる線形スペクトル領域の乗算性ひずみに対し
ては全く効果がなく、また加算性雑音に対しても十分と
は言えなかった。この発明の目的は加算性雑音のみなら
ず線形スペクトル領域の乗算性ひずみを受けた音声に対
しても高い認識率を得ることを可能とし、しかも短時間
に、比較的簡単に作ることができる音声認識用耐雑音隠
れマルコフモデル作成方法を提供することにある。However, this conventional method has no effect on the multiplicative distortion in the linear spectral region which occurs in a voice signal when the voice signal is passed through, for example, a telephone line, and the additive noise is not generated. It wasn't enough for me. It is an object of the present invention to obtain a high recognition rate not only for additive noise but also for voices that are subject to multiplicative distortion in the linear spectral region, and that can be produced relatively easily in a short time. It is to provide a method for creating a noise-resistant hidden Markov model for recognition.

【００１６】この発明の他の目的は加算性雑音のみなら
ず線形スペクトル領域の乗算性ひずみを受けた音声に対
しても高い認識率が得られ、かつ耐雑音音声ＨＭＭを比
較的簡単、短時間に作ることができる音声認識装置を提
供することにある。Another object of the present invention is to obtain a high recognition rate not only for additive noise but also for speech that has undergone multiplicative distortion in the linear spectrum region, and for making noise resistant speech HMM relatively simple and in a short time. The object is to provide a voice recognition device that can be manufactured in

【００１７】[0017]

【課題を解決するための手段】この発明の方法によれば
雑音や乗算性ひずみ（以下単に雑音で総称する）の影響
を受けていない音声ＨＭＭと、雑音から作られたそのＨ
ＭＭ（雑音ＨＭＭ）とから、線形スペクトル領域の乗算
性ひずみ又はＳ／Ｎ（信号／雑音）を未知数（変数）と
して含む未完耐雑音音声ＨＭＭを第１ステップで作成
し、上記未完耐雑音音声ＨＭＭの入力音声に対する尤度
が最大になるような上記乗算性ひずみ又はＳ／Ｎを第２
ステップで推定し、この推定した値を上記未完耐雑音音
声ＨＭＭに代入して耐雑音音声ＨＭＭを第３ステップで
完成する。According to the method of the present invention, a speech HMM that is not affected by noise or multiplicative distortion (hereinafter simply referred to as noise) and its H generated from noise.
From the MM (noise HMM), an incomplete noise-resistant speech HMM including a multiplicative distortion in the linear spectrum region or S / N (signal / noise) as an unknown (variable) is created in the first step, and the uncompleted noise-resistant speech HMM is generated. The multiplicative distortion or S / N that maximizes the likelihood of the input speech of
The noise-resistant voice HMM is completed in the third step by estimating the value in the step and substituting the estimated value into the uncompleted noise-resistant voice HMM.

【００１８】上記第１ステップは音声ＨＭＭと雑音ＨＭ
Ｍとを積空間で合成することにより上記未完耐雑音音声
ＨＭＭを求める。前記積空間での合成は音声ＨＭＭと雑
音ＨＭＭとの各出力確率の分布を線形スペクトル領域に
第４ステップで変換し、これら線形スペクトル領域の音
声ＨＭＭと雑音ＨＭＭの各出力確率の分布を互いに畳み
込み演算を第５ステップで行い、この畳み込み演算結果
を元の音声ＨＭＭ領域に第６ステップで逆変換して行
う。The first step is the speech HMM and the noise HM.
The uncompleted noise resistant speech HMM is obtained by synthesizing M and M in the product space. In the synthesis in the product space, the distributions of the output probabilities of the speech HMM and the noise HMM are transformed into a linear spectral domain in the fourth step, and the distributions of the output probabilities of the speech HMM and the noise HMM in the linear spectral domain are convoluted with each other. The calculation is performed in the fifth step, and the convolution calculation result is inversely transformed into the original speech HMM area in the sixth step.

【００１９】ケプストラム領域でそれぞれ表わされた音
声ＨＭＭと雑音ＨＭＭの各出力確率の分布を、コサイン
変換し、更にエキスポーネンシャル変換し、上記第６ス
テップは上記畳み込み演算結果を対数変換し、更に逆コ
サイン変換する。上記第４ステップは、対数スペクトル
領域でそれぞれ表わされた音声ＨＭＭと雑音ＨＭＭの各
出力確率の分布をそれぞれエキスポーネンシャル変換
し、上記第６ステップは上記畳み込み演算結果を対数変
換する。The distributions of the output probabilities of the speech HMM and the noise HMM respectively expressed in the cepstrum domain are cosine-transformed, and further exponential-transformed. In the sixth step, the convolution operation result is logarithmically transformed, Further, inverse cosine transform is performed. In the fourth step, the output probability distributions of the voice HMM and the noise HMM respectively expressed in the logarithmic spectrum domain are subjected to exponential transformation, and the sixth step is subjected to logarithmic transformation of the convolution operation result.

【００２０】前記積空間での合成は、線形スペクトル領
域でそれぞれ表わされた音声ＨＭＭと雑音ＨＭＭの各出
力確率の分布の畳み込み演算で行う。前記第２ステップ
は、未完耐雑音音声ＨＭＭに各種の乗算ひずみ又はＳ／
Ｎを与え、これら各未完耐雑音音声ＨＭＭと入力音声と
の尤度を求め、求めた尤度の最大となった乗算ひずみ又
はＳ／Ｎを推定値とする。The synthesis in the product space is performed by the convolution operation of the distributions of the output probabilities of the speech HMM and the noise HMM respectively expressed in the linear spectral domain. In the second step, various multiplication distortions or S /
N is given, the likelihood between each of these uncompleted noise-resistant speech HMMs and the input speech is calculated, and the multiplication distortion or S / N that maximizes the calculated likelihood is used as the estimated value.

【００２１】前記第２ステップは最尤推定方法又は最急
降下法による繰り返し演算により推定する。第１ステッ
プで用いる雑音ＨＭＭを、環境雑音を収録して作成す
る。前記線形スペクトル領域での音声ＨＭＭの分布と雑
音ＨＭＭの分布との畳み込み演算を、音声ＨＭＭの分布
の線形スペクトルに未知の乗算性ひずみを乗算して行
う。In the second step, estimation is performed by iterative calculation according to the maximum likelihood estimation method or the steepest descent method. The noise HMM used in the first step is created by recording environmental noise. The convolution operation of the distribution of the speech HMM and the distribution of the noise HMM in the linear spectrum region is performed by multiplying the linear spectrum of the distribution of the speech HMM by an unknown multiplicative distortion.

【００２２】前記線形スペクトル領域での音声ＨＭＭの
分布と雑音ＨＭＭの分布との畳み込み演算を、音声ＨＭ
Ｍの分布の線形スペクトルに１０^-(S/N)/2を乗算して又
は雑音ＨＭＭの分布の線形スペクトルに１０
^(S/N)/2（Ｓ／Ｎは未知数）を乗算して行う。ケプスト
ラム領域で表わされた上記音声ＨＭＭの出力確率の分布
の各ケプストラムに未知数の乗算性ひずみ成分を加算し
て上記積空間での合成を行う。The convolution operation of the distribution of the speech HMM and the distribution of the noise HMM in the linear spectrum domain is performed by the speech HM.
Multiply the linear spectrum of the distribution of M by 10 ^{− (S / N) / 2} or 10 to the linear spectrum of the distribution of the noise HMM.
^It is performed by multiplying ^{(S / N) / 2} (S / N is an unknown number). An unknown multiplicative distortion component is added to each cepstrum of the output probability distribution of the speech HMM represented in the cepstrum region, and synthesis in the product space is performed.

【００２３】ケプストラム領域で表わされた上記音声Ｈ
ＭＭと雑音ＨＭＭの一方の出力確率の分布のケプストラ
ムの０次に未知数のＳ／Ｎ成分を加算して上記積空間で
の合成を行う。対数スペクトル領域で表わされた上記音
声ＨＭＭの出力確率の分布の対数スペクトラムに未知数
の乗算性ひずみ成分を加算して上記積空間での合成を行
う。The voice H represented by the cepstrum region
The 0th-order unknown S / N component of the cepstrum of the distribution of the output probabilities of one of the MM and the noise HMM is added to perform synthesis in the product space. An unknown multiplicative distortion component is added to the logarithmic spectrum of the output probability distribution of the speech HMM expressed in the logarithmic spectrum domain, and synthesis in the product space is performed.

【００２４】対数スペクトル領域で表わされた上記音声
ＨＭＭ及び雑音ＨＭＭの一方の出力確率の分布の対数ス
ペクトラムに未知数のＳ／Ｎ成分を加算して上記積空間
での合成を行う。この発明の音声認識装置によれば、前
記この発明の方法により音声ＨＭＭと雑音ＨＭＭとから
作られた耐雑音音声ＨＭＭを用いて入力音声を認識する
装置であって、雑音や乗算性ひずみの影響を受けていな
い音声ＨＭＭのセットが音声ＨＭＭ格納部に格納されて
あり、周囲雑音が雑音入力手段により入力され、その入
力された雑音に基づき雑音ＨＭＭが雑音ＨＭＭ作成手段
により作成されて雑音ＨＭＭ格納部に格納され、この格
納された雑音ＨＭＭと音声ＨＭＭ格納部の音声ＨＭＭ
と、Ｓ／Ｎ又は乗算性ひずみ格納部からのＳ／Ｎ又は乗
算性ひずみとが未完耐雑音音声ＨＭＭ合成手段により、
Ｓ／Ｎ又は乗算性ひずみを未知数として積空間で合成さ
れて未完耐雑音ＨＭＭが作成され、この未完耐雑音音声
ＨＭＭの未知数が、Ｓ／Ｎ或は乗算性ひずみ推定手段に
より、音声入力手段から入力された入力音声に対する未
完耐雑音音声ＨＭＭの尤度が最大になるように推定さ
れ、その推定されたＳ／Ｎ或いは乗算性ひずみの値が耐
雑音音声ＨＭＭ完成手段で未完耐雑音音声ＨＭＭに代入
されて耐雑音音声ＨＭＭが完成されて耐雑音音声ＨＭＭ
格納部に格納され、入力音声はこの耐雑音音声ＨＭＭと
の類似度が音声認識手段で計算され、その計算結果にも
とづき、認識結果が出力される。The unknown S / N component is added to the logarithmic spectrum of the output probability distribution of one of the speech HMM and the noise HMM expressed in the logarithmic spectrum domain to perform synthesis in the product space. According to the speech recognition apparatus of the present invention, there is provided an apparatus for recognizing an input speech by using a noise resistant speech HMM made up of a speech HMM and a noise HMM by the method of the present invention, which is affected by noise and multiplicative distortion. A set of voice HMMs that have not been received are stored in the voice HMM storage unit, ambient noise is input by the noise input means, and noise HMMs are created by the noise HMM creating means based on the input noises and stored in the noise HMMs. Noise HMM stored in the storage unit and the voice HMM stored in the voice HMM storage unit
And the S / N or the multiplicative distortion from the S / N or the multiplicative distortion storage unit are processed by the incomplete noise resistant speech HMM synthesizing means.
The unfinished noise resistant HMM is created by synthesizing in the product space with S / N or the multiplicative distortion as an unknown number, and the unknown number of this unfinished noise resistant speech HMM is output from the voice input means by the S / N or the multiplicative distortion estimation means. The uncompleted noise-resistant speech HMM is estimated so that the likelihood of the uncompleted noise-resistant speech HMM for the input speech is maximized, and the estimated S / N or the value of the multiplicative distortion is converted into the uncompleted noise-resistant speech HMM by the noise-resistant speech HMM completion means. Noise-resistant speech HMM is completed by substitution and noise-resistant speech HMM
The input voice is stored in the storage unit, the degree of similarity with the noise resistant voice HMM is calculated by the voice recognition means, and the recognition result is output based on the calculation result.

【００２５】[0025]

【発明の実施の形態】図４にこの発明による音声認識装
置の機能構成例を示し、図５Ａにこの発明方法の一例に
おける処理手順を示す。音声ＨＭＭ格納部３には予め雑
音の無い状態において収録された音声情報について作成
された音声ＨＭＭのセットが格納されている。そして、
雑音入力手段５を介して環境雑音を収録し（Ｓ₁）、収
録された環境雑音について雑音ＨＭＭ作成手段６におい
て雑音ＨＭＭを作成する（Ｓ₂）。雑音ＨＭＭの作成は
音声ＨＭＭの作成と同様の手法で行えばよく、これは例
えば S.E.Levinson , L. &. Rabinen , and M.M.Sondhi
:“ An Introduction to the Application of the The
ory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition”, The Bell Syste
m Technical Journal , Vol. 62 , No.4 ( April1983)
に示されている。作成された雑音ＨＭＭは雑音ＨＭＭ格
納部７に格納しておく。雑音ＨＭＭ格納部７に格納され
た雑音ＨＭＭは音声ＨＭＭ格納部３に格納されている音
声ＨＭＭと同一の特徴パラメータで表現されている、つ
まり音声ＨＭＭが例えばケプストラム領域で表現されて
いる場合は雑音ＨＭＭもケプストラム領域で表現したも
のとする。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 4 shows an example of the functional configuration of a speech recognition apparatus according to the present invention, and FIG. 5A shows the processing procedure in an example of the present invention method. The voice HMM storage unit 3 stores a set of voice HMMs created for voice information recorded in advance in a noise-free state. And
Environmental noise is recorded via the noise input means 5 (S ₁ ), and noise HMM is created by the noise HMM creating means 6 for the recorded environmental noise (S ₂ ). The noise HMM can be created by the same method as that of the speech HMM, and this can be done by, for example, SELevinson, L. &. Rabinen, and MMSondhi.
: “An Introduction to the Application of the The
ory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition ”, The Bell Syste
m Technical Journal, Vol. 62, No.4 (April 1983)
Is shown in The created noise HMM is stored in the noise HMM storage unit 7. The noise HMM stored in the noise HMM storage unit 7 is represented by the same characteristic parameter as that of the speech HMM stored in the speech HMM storage unit 3, that is, noise is generated when the speech HMM is represented in, for example, a cepstrum region. The HMM is also expressed in the cepstrum region.

【００２６】未完耐雑音音声ＨＭＭ合成手段８で雑音Ｈ
ＭＭ格納部７内の雑音ＨＭＭと音声ＨＭＭ格納部３内の
音声ＨＭＭとから、Ｓ／Ｎ或いは乗算性ひずみ格納部９
内の線形スペクトル領域の乗算性ひずみ又はＳ／Ｎを未
知数として含む未完耐雑音音声ＨＭＭを作成する
（Ｓ₃）。この未完耐雑音音声ＨＭＭの作成は雑音ＨＭ
Ｍと音声ＨＭＭとを積空間において合成することにより
行うが、その際に音声ＨＭＭに未知乗算性ひずみが乗算
乃至加算され、あるいは音声ＨＭＭ又は雑音ＨＭＭに対
し未知Ｓ／Ｎが乗算乃至加算される。Incomplete noise resistant speech HMM synthesizer 8 generates noise H
From the noise HMM in the MM storage unit 7 and the voice HMM in the voice HMM storage unit 3, the S / N or multiplicative distortion storage unit 9 is obtained.
An uncompleted noise-resistant speech HMM including the multiplicative distortion or S / N in the linear spectral region in as an unknown is created (S ₃ ). This incomplete noise resistant speech HMM is created by noise HM
This is performed by synthesizing M and the voice HMM in the product space. At that time, the voice HMM is multiplied or added with the unknown multiplicative distortion, or the voice HMM or the noise HMM is multiplied or added with the unknown S / N. .

【００２７】雑音ＨＭＭと音声ＨＭＭとの積空間におけ
る合成は例えば従来の手法と同様に図５Ｂに示すよう
に、雑音ＨＭＭと音声ＨＭＭの各出力確率の分布を線形
スペクトル領域に変換し（Ｓ₃₁）、これら線形スペクト
ル領域の雑音ＨＭＭと音声ＨＭＭの出力確率の分布の畳
み込み演算を行うが、この際に、線形スペクトル領域の
音声ＨＭＭの出力確率分布の線形スペクトルに対して乗
算性ひずみ（未知数）を乗算し、あるいは線形スペクト
ル領域の音声ＨＭＭ及び雑音ＨＭＭの一方の出力確率分
布の線形スペクトルにＳ／Ｎと対応した成分（未知数）
を乗算し（Ｓ₃₂₁）、その後、その乗算された出力確率
分布と乗算されない出力確率分布との畳み込み演算を行
う（Ｓ₃₂₂）、その畳み込み演算結果を元の領域に逆変
換して未完耐雑音音声ＨＭＭを得る（Ｓ₃₃）。図２に示
した従来法の説明と同様に音声ＨＭＭ、雑音ＨＭＭは共
にケプストラム領域で作成されているものとし、図６に
図２と対応する部分に同一記号を付けて、この発明方法
のより具体例を示す。つまり音声ＨＭＭ、雑音ＨＭＭの
各出力確率分布の平均値Ｃ，Ｄは式（１），（２）で表
わされ、これらはコサイン変換されて対数スペクトルＬ
Ｃ，ＬＤにそれぞれ変換され、また各分布の共分散もΣ
^LC，Σ^LDに変換され（Ｓ₃₁₁）、更にこれらはそれぞれ
線形スペクトル領域の平均値ＳＣⁱ，ＳＤⁱ，Σ^SC _ij，
Σ^SD _ijに変換される（Ｓ₃₁₂）。所で一般の通信分野で
は例えば環境雑音Ｎの存在下で発声した音声Ｓをマイク
ロホンで受音し、伝送路へ通した時の出力Ｘは、マイク
ロホン及び伝送路で受けるひずみをＷとするとＸ＝ＷＳ
＋Ｎと表わされることが知られている。この点を考慮し
て、この実施例では乗算性ひずみ格納部９（図４）内の
未知乗算性ひずみＷを音声ＨＭＭの出力確率の分布に乗
算し（Ｓ₃₂₁）、この乗算した分布と雑音ＨＭＭとの畳
み込み演算をする（Ｓ₃₂₂）。つまり式（11），（12）
に代えて次式を演算する。In the synthesis of the product space of the noise HMM and the speech HMM, for example, as in the conventional method, as shown in FIG. 5B, the distribution of the output probabilities of the noise HMM and the speech HMM is converted into a linear spectral domain (S ₃₁ ), The convolution operation of the distribution of the output probabilities of the noise HMM and the speech HMM in the linear spectrum region is performed. At this time, the multiplicative distortion (unknown number) is applied to the linear spectrum of the output probability distribution of the speech HMM in the linear spectrum region. Or a component corresponding to S / N in the linear spectrum of the output probability distribution of one of the speech HMM and the noise HMM in the linear spectrum domain (unknown)
(S ₃₂₁ ), and then, a convolution operation is performed between the output probability distribution that has been multiplied and the output probability distribution that is not multiplied (S ₃₂₂ ), and the result of the convolution operation is inversely transformed into the original area to perform the uncompleted noise immunity. get a voice HMM (S _33). Similar to the description of the conventional method shown in FIG. 2, it is assumed that both the voice HMM and the noise HMM are created in the cepstrum region. In FIG. 6, parts corresponding to those in FIG. A specific example is shown. That is, the average values C and D of the output probability distributions of the speech HMM and the noise HMM are represented by the equations (1) and (2), and these are cosine transformed to obtain the logarithmic spectrum L.
C and LD respectively, and the covariance of each distribution is Σ
^LC , Σ ^LD (S ₃₁₁ ), and these are respectively averaged values SC ⁱ , SD ⁱ , Σ ^SC _ij in the linear spectral region,
It is converted into Σ ^SD _ij (S ₃₁₂ ). In the general communication field, for example, an output X when a voice S uttered in the presence of environmental noise N is received by a microphone and passed through a transmission path is X =, where W is a distortion received by the microphone and the transmission path. WS
It is known to be represented as + N. In consideration of this point, in this embodiment, the distribution of the output probability of the voice HMM is multiplied by the unknown multiplicative distortion W in the multiplicative distortion storage unit 9 (FIG. 4) (S ₃₂₁ ), and this multiplied distribution and noise are multiplied. The convolution operation with the HMM is performed ( _S322 ). That is, equations (11) and (12)
Instead of, the following equation is calculated.

【００２８】Ｍⁱ＝Ｗ_iＳＣⁱ＋ＳＤⁱ ・・・（17) Σ^M _ij＝Ｗ_iＷ_jΣ^SC _ij ＋ Σ^SD _ij ・・・（18) この畳み込み演算の結果Ｍⁱ，Σ^M _ijに対しては図２
の場合と同様に式(13), （14）により対数変換を行い
（Ｓ₃₃₁）、更に式（15），（16）により逆コサイン変
換を行って元のケプストラム領域での乗算性ひずみを未
知数として含む未完耐雑音音声ＨＭＭが得られる（Ｓ
₂₃₂）。M ⁱ = W _i SC ⁱ + SD ⁱ (17) Σ ^M _ij = W _i W _j Σ ^SC _ij + Σ ^SD _ij (18) Results of this convolution operation M ⁱ , Σ ^M _ij For Figure 2
As in the case of, the logarithmic transformation is performed by the equations (13) and (14) (S ₃₃₁ ), and the inverse cosine transformation is performed by the equations (15) and (16) to calculate the multiplicative distortion in the original cepstrum region by an unknown number. An uncompleted noise resistant speech HMM including
₂₃₂ ).

【００２９】次にこのようにして得られた未完耐雑音音
声ＨＭＭ中の未知数である乗算性ひずみＷ（＝Ｗ₁，・
・・，Ｗ_m）を推定する（図５Ａ，Ｓ₄）。このため音
声入力手段１（図４）から音声を入力し、Ｓ／Ｎ或いは
乗算性ひずみ推定手段１０で、その入力音声の系列Ｘに
対し、未完耐雑音音声ＨＭＭのセットＭ（Ｗ）の尤度Ｐ
（Ｘ｜Ｍ（Ｗ））が最大となる乗算ひずみＷを推定す
る。この推定は最急降下法又は最尤推定法による繰り返
し演算により求めることができる。Next, the multiplicative distortion W (= W ₁ , ..., Which is an unknown number in the incomplete noise-resistant speech HMM obtained in this way.
.., W _m ) is estimated (FIG. 5A, S ₄ ). Therefore, a voice is input from the voice input means 1 (FIG. 4), and the S / N or multiplicative distortion estimation means 10 calculates the likelihood of the set M (W) of uncompleted noise-resistant voice HMMs with respect to the input voice sequence X. Degree P
Estimate the multiplication distortion W that maximizes (X | M (W)). This estimation can be obtained by iterative calculation by the steepest descent method or the maximum likelihood estimation method.

【００３０】即ち最急降下法により尤度Ｐ（Ｘ｜Ｍ
（Ｗ））を最大にするには以下の繰り返し演算を実行す
る。１．Ｗを初期設定し、２．Ｗ^t＝Ｗ^t-1＋ε（∂Ｐ（Ｘ｜Ｍ（Ｗ））／（∂
Ｗ）を用いて次のＷ^tを推定し、３．Ｗ^t-1をＷ^tで更新する。That is, the likelihood P (X | M
(W)) is maximized by executing the following iterative calculation. 1. Initialize W, W ^t = W ^t-1 + ε (∂P (X | M (W)) / (∂
2.) estimate the next W ^t using Update W ^t-1 with W ^t .

【００３１】４．２、３を収束するまで繰り返す。εに
は適当な小さな値を用いる。最尤推定法により尤度Ｐ
（Ｘ｜Ｍ（Ｗ））を最大にするには次のようにする。一
般にトレリス尤度とビタービ尤度の間には以下のような
式が成り立つ。Ｐ（Ｘ｜Ｍ（Ｗ））＝ΣＰ（Ｘ，Ｓ｜Ｍ（Ｗ）） Σは総てのＳについての和ここでＳは状態の遷移を表す。またＷの更新後のものを
Ｗ′としてＱ（Ｗ，Ｗ′）を次のように定義する。Repeat steps 4.2 and 3 until convergence. Use an appropriate small value for ε. Likelihood P by the maximum likelihood estimation method
To maximize (X | M (W)), do as follows. Generally, the following equation holds between the trellis likelihood and the Viterbi likelihood. P (X | M (W)) = ΣP (X, S | M (W)) Σ is the sum for all S, where S represents a state transition. Further, Q (W, W ') is defined as follows, where W after updating W is W'.

【００３２】Ｑ（Ｗ，Ｗ′）＝ΣＰ（Ｘ，Ｓ｜Ｍ
（Ｗ))ｌｏｇＰ（Ｘ，Ｓ｜Ｍ（Ｗ′)) Σは総てのＳについての和ここでＱ（Ｗ，Ｗ′）＞Ｑ（Ｗ，Ｗ）ならばＰ（Ｘ｜Ｍ
（Ｗ′））＞Ｐ（Ｘ｜Ｍ（Ｗ))が成り立つ。この原理と
最尤推定法とを用いるＷの推定方法を以下に示す。Q (W, W ') = ΣP (X, S | M
(W)) log P (X, S | M (W ')) Σ is the sum of all S Here, if Q (W, W') > Q (W, W), then P (X | M
(W ')) > P (X | M (W)). An estimation method of W using this principle and the maximum likelihood estimation method will be shown below.

【００３３】１．Ｗを初期設定し、２．Ｑ（Ｗ^t-1，Ｗ^t）を最大にするＷ^tを最尤推定法
で推定する。３．Ｗ^t-1をＷ^tで更新する。４．２、３を収束するまで繰り返す。このようにして推定された乗算性ひずみＷの値をＨＭＭ
完成手段１１（図４）で各未完耐雑音音声ＨＭＭに代入
して耐雑音音声ＨＭＭを完成し、これを耐雑音音声ＨＭ
Ｍ格納部１２に格納する（図５Ａ，Ｓ₅）。未知音声の
認識は、音声入力手段１より未知音声を入力して、耐雑
音音声ＨＭＭ格納部１２に格納されている各耐雑音音声
ＨＭＭとの類似度を音声認識手段２で計算して、その計
算結果にもとづいて認識結果を出力する。乗算性ひずみ
を推定する際に用いる入力音声Ｘは学習音声ではなく、
認識しようとしている未知音声でもよい。後者の場合は
乗算性ひずみを推定後、その未知音声について前記音声
認識処理を行う。1. Initialize W, Q the ^{^{(W t-1, W t}} ) estimated by maximum likelihood estimation method W ^t to maximum. 3. Update W ^t-1 with W ^t . Repeat steps 4.2 and 3 until convergence. The value of the multiplicative distortion W estimated in this way is calculated by the HMM.
The completion means 11 (FIG. 4) substitutes each incomplete noise resistant speech HMM to complete the noise resistant speech HMM, and this is completed.
It is stored in the M storage unit 12 (FIG. 5A, S ₅ ). The unknown voice is recognized by inputting the unknown voice from the voice input unit 1, calculating the degree of similarity with each noise resistant voice HMM stored in the noise resistant voice HMM storage unit 12 by the voice recognition unit 2, and The recognition result is output based on the calculation result. The input speech X used when estimating the multiplicative distortion is not a learning speech,
It may be the unknown voice that you are trying to recognize. In the latter case, the multiplicative distortion is estimated, and then the voice recognition process is performed on the unknown voice.

【００３４】上述の具体例では音声ＨＭＭ、雑音ＨＭＭ
がケプストラム領域のものとしたが、対数スペクトル
（対数パワースペクトル）領域で作られたものにもこの
発明を適用することができる。この場合は図７に示すよ
うに図６中のステップＳ₃₁₂におけるエキスポーネンシ
ャル変換を、対数スペクトル領域での雑音ＨＭＭの分布
と音声ＨＭＭの分布に対して行って、線形スペクトル領
域の各分布を得（Ｓ₃₁）、その音声ＨＭＭの分布の線形
スペクトルに対して乗算性ひずみＷを乗算し
（Ｓ ₃₂₁）、この乗算された音声ＨＭＭの分布と雑音Ｈ
ＭＭの分布とコンボルーション演算を行い（Ｓ₃₂₂）、
その演算結果を対数変換して未完耐雑音音声ＨＭＭを得
る（Ｓ₃₃）。In the above specific example, a voice HMM and a noise HMM
Is the cepstrum region, but the logarithmic spectrum
This is also true for those made in the (logarithmic power spectrum) domain.
The invention can be applied. In this case, it's shown in Figure 7.
Sea urchin in step S in FIG.₃₁₂Exponency in
Distribution of noise HMM in logarithmic spectral domain
And the speech HMM distribution,
Obtain each distribution of the region (S₃₁), The linear distribution of the speech HMM
Multiply the spectrum by the multiplicative distortion W
(S ₃₂₁), The distribution of this multiplied speech HMM and the noise H
Perform the convolution operation with the distribution of MM (S₃₂₂),
The calculation result is logarithmically converted to obtain an incomplete noise resistant speech HMM.
(S₃₃).

【００３５】同様に音声ＨＭＭ、雑音ＨＭＭが線形スペ
クトル領域でそれぞれ作られている場合は、これらの分
布間のコンボルーションを行って直ちに未完耐雑音音声
ＨＭＭを得る。更に上述では未完耐雑音音声ＨＭＭに未
知数として乗算性ひずみＷを導入したが、Ｓ／Ｎを導入
してもよい。つまりこの場合の音声ＨＭＭの分布と、雑
音ＨＭＭの分布とのコンボルーションの結果得られる未
完耐雑音音声ＨＭＭの分布の平均値Ｍⁱと共分散Σ^M _ij
に対しそれぞれ次式のようにＳ／Ｎを導入する。Similarly, when the speech HMM and the noise HMM are created in the linear spectral domain, the uncompleted noise-resistant speech HMM is immediately obtained by performing convolution between these distributions. Furthermore, in the above, the multiplicative distortion W is introduced as an unknown into the incomplete noise resistant speech HMM, but S / N may be introduced. That is, the average value M ⁱ and the covariance Σ ^M _ij of the distribution of the incomplete noise-resistant speech HMM obtained as a result of the convolution of the distribution of the speech HMM and the distribution of the noise HMM in this case.
On the other hand, S / N is introduced as in the following equations.

【００３６】Ｍⁱ＝ＳＣⁱ＋ｋＳＤ・・・（19) Σ^M _ij＝Σ^SC _ij＋ｋ²Σ^SD _ij ・・・（20) ｋ＝１０^-((S/N)/2) ・・・（21) この場合図４中のＳ／Ｎ或は乗算性ひずみ格納部９にＷ
ではなくｋを格納しておく、また図５Ｂ、図６、図７の
各ステップＳ₃₂₁における乗算は、雑音ＨＭＭの分布
（音声ＨＭＭの分布でもよい）に対してｋを乗算する。
つまり式（19），（20），（21）の演算を行う。更に図
４中のＳ／Ｎ或は乗算性ひずみ推定手段１０ではＰ（Ｘ
｜Ｍ（Ｓ／Ｎ））という尤度を最大にする様にＳ／Ｎが
選ばれる。つまり入力音声の系列Ｘに対し、Ｓ／Ｎの関
数である未完耐雑音音声ＨＭＭのセットＭ（Ｓ／Ｎ）の
尤度が最大になるＳ／Ｎを推定する。この入力音声とし
ては雑音ＨＭＭを作成した時の環境で発声した学習音声
又は認識対象音声を用いるとよい。このＳ／Ｎの推定も
Ｐ（Ｘ｜Ｍ（Ｓ／Ｎ））を最大にするＳ／Ｎを最尤推定
法或は最急降下法による繰り返し演算によって求めるこ
とができる。この他のＳ／Ｎ推定の例を図８に示す。Ｎ
個のＳ／Ｎの値（Ｓ／Ｎ）₁〜（Ｓ／Ｎ）_Nを用意し、
これらを式（19），（20），（21）に代入してＮ個の未
完耐雑音音声ＨＭＭを作り、これらＮ個の未完耐雑音音
声ＨＭＭの入力音声に対する尤度Ｐ（Ｘ｜Ｍ（Ｓ／Ｎ))
をそれぞれ計算し、このＮ個の尤度中最大となるＳ／Ｎ
を選ぶ。M ⁱ = SC ⁱ + kSD (19) Σ ^M _ij = Σ ^SC _ij + k ² Σ ^SD _ij (20) k = 10 ^{− ((S / N) / 2)}・・・ ( 21) In this case, W is stored in the S / N or multiplicative distortion storage section 9 in FIG.
Instead, k is stored, and the multiplication in each step S ₃₂₁ of FIGS. 5B, 6 and 7 multiplies the distribution of the noise HMM (or the distribution of the speech HMM) by k.
That is, the equations (19), (20), and (21) are calculated. Further, in the S / N or multiplicative distortion estimating means 10 in FIG. 4, P (X
S / N is chosen to maximize the likelihood of | M (S / N)). That is, with respect to the input speech sequence X, the S / N that maximizes the likelihood of the set M (S / N) of the incomplete noise-resistant speech HMMs, which is a function of the S / N, is estimated. As the input voice, learning voice or recognition target voice uttered in the environment when the noise HMM is created may be used. This S / N estimation can also be obtained by iterative calculation by the maximum likelihood estimation method or the steepest descent method for the S / N that maximizes P (X | M (S / N)). Another example of S / N estimation is shown in FIG. N
Prepare individual S / N values (S / N) ₁ to (S / N) _N ,
Substituting these into equations (19), (20), and (21), N uncompleted noise-resistant speech HMMs are created, and the likelihood P (X | M ( S / N))
Respectively, and the maximum S / N among these N likelihoods is calculated.
Choose

【００３７】（Ｓ／Ｎ）₁〜（Ｓ／Ｎ）_Nとしては、例
えば雑音ＨＭＭを作成するための雑音収録環境におい
て、信号Ｓと雑音Ｎの和の雑音Ｎに対する比（Ｓ＋Ｎ）
／Ｎを求め、この値と、これに対し±３ｄＢした各値と
の３つを用い、その３つについての尤度Ｐ（Ｘ｜Ｍ（Ｓ
／Ｎ））を求め、その最大となるＳ／Ｎを決定する。ま
た、Ｐ（Ｘ｜Ｍ（Ｓ／Ｎ））の代わりにビタービアルゴ
リズムによる尤度Ｐ（Ｘ，Ｓ｜Ｍ（Ｓ／Ｎ））を最大に
するという定式化も可能である。ここで、ＳはＨＭＭの
状態遷移を表す。乗算性ひずみＷの推定も図８に示した
ようにして行ってもよい。As (S / N) ₁ to (S / N) _N , for example, in a noise recording environment for creating a noise HMM, the ratio of the sum of the signal S and the noise N to the noise N (S + N).
/ N is obtained, and three values of this value and each value obtained by ± 3 dB are used, and the likelihood P (X | M (S
/ N)), and the maximum S / N is determined. In addition, a formulation that maximizes the likelihood P (X, S | M (S / N)) by the Viterbi algorithm instead of P (X | M (S / N)) is also possible. Here, S represents the state transition of the HMM. The estimation of the multiplicative distortion W may also be performed as shown in FIG.

【００３８】上述においては、未完耐雑音音声ＨＭＭの
乗算性ひずみＷ、Ｓ／Ｎの導入を、線形スペクトル領域
で行ったが、ケプストラム領域において、式 (1)に対し
Ｗを加算し、つまりＣ＋Ｗ＝（Ｃ₀＋Ｗ₀，Ｃ₁＋
Ｗ₁，・・・Ｃ_p＋Ｗ_p）を求め、このＷを加算した音
声ＨＭＭと、雑音ＨＭＭとの積空間での合成を行っても
よい。またＣ＋ｋ＝（Ｃ₀＋ｋ，Ｃ₁，…，Ｃ_p），
（ｋ＝αｌｏｇ（１０^(S/N)/ ²）、αはコサイン変換に
よって定まる定数）を求め、このｋを加算した音声ＨＭ
Ｍと、雑音ＨＭＭとの積空間での合成を行ってもよい
し、あるいは雑音ＨＭＭに対しｋを加算し、ｋを加算し
ない音声ＨＭＭとの積空間での合成を行ってもよい。さ
らに、対数スペクトル領域でＷまたはＳ／Ｎを導入して
もよい。つまり、ＬＣ＋Ｗ＝Ｃ（ＣＯＳ）＋Ｗを求め、
これとＬＤとの積空間での合成を行ってもよく、あるい
はＬＣ＋ｋ＝（ＬＣ₀＋ｋ，ＬＣ₁＋ｋ，…，ＬＣ_P＋
ｋ），（ｋ＝ｌｏｇ（１０^(S/N)/2）、を求めこれとＬ
Ｄとの積空間での合成を行ってもよく、あるいは、ＬＣ
＋ｋを求め、これとＬＤとの積空間での合成を行っても
よい。In the above, the introduction of the multiplicative distortion W and S / N of the incomplete noise resistant speech HMM was performed in the linear spectrum domain, but in the cepstrum domain, W is added to the equation (1), that is, C + W. = (C ₀ + W ₀ , C ₁ +
W ₁ , ... C _p + W _p ) may be obtained, and the speech HMM to which this W is added and the noise HMM may be combined in the product space. Also, C + k = (C ₀ + k, C ₁ , ..., C _p ),
(K = αlog (10 ^{(S / N) /} ² ), α is a constant determined by cosine transformation)
The M and the noise HMM may be combined in the product space, or k may be added to the noise HMM and the speech HMM in which the k is not added may be combined in the product space. Furthermore, W or S / N may be introduced in the logarithmic spectral domain. That is, LC + W = C (COS) + W is obtained,
Synthesis may be performed in the product space of this and LD, or LC + k = (LC ₀ + k, LC ₁ + k, ..., LC _P +
k), (k = log (10 ^{(S / N) / 2} )) and L
D may be combined in the product space with D, or LC
Alternatively, + k may be obtained and the product of LD and LD may be combined in the product space.

【００３９】更に上述では雑音ＨＭＭとしては、音声Ｈ
ＭＭと同一領域のものを作成したが、例えば音声ＨＭＭ
はケプストラム領域で求め、雑音ＨＭＭは対数スペクト
ル領域で求めてもよい。この場合は、雑音ＨＭＭをケプ
ストラム領域から対数スペクトル領域へ変換する演算
分、演算量が少なくなる。Further, in the above description, the noise HMM is the speech H.
I created one in the same area as MM.
May be obtained in the cepstrum region, and the noise HMM may be obtained in the logarithmic spectrum region. In this case, the amount of calculation is reduced by the calculation for converting the noise HMM from the cepstrum domain to the logarithmic spectrum domain.

【００４０】[0040]

【発明の効果】以上述べたように、この発明によれば、
発声場所の雑音に基づいて雑音ＨＭＭを作成し、この雑
音ＨＭＭと、予め雑音の無い状態において収録された音
声情報に基づいて作成された音声ＨＭＭとからＳ／Ｎ又
は乗算性ひずみを未知数として導入した未完耐雑音音声
ＨＭＭを作り、この未完耐雑音音声ＨＭＭの入力音声に
対する尤度を最大にするＳ／Ｎ或は乗算性ひずみを決定
して耐雑音音声ＨＭＭを作成する。この様にすることに
より、Ｓ／Ｎの変動、マイクロホン歪み、回線歪み、話
者の発声変動に強い音声ＨＭＭを作成することができ、
つまりこの発明方法により得られた音声ＨＭＭを用いる
この発明の音声認識装置は、雑音が加わった音声や回線
ひずみを受けた音声などを従来より高い認識率で認識す
ることができる。As described above, according to the present invention,
A noise HMM is created based on the noise of the utterance location, and S / N or multiplicative distortion is introduced as an unknown number from this noise HMM and a voice HMM created based on voice information recorded in advance in the absence of noise. The uncompleted noise resistant speech HMM is created, and the S / N or multiplicative distortion that maximizes the likelihood of the uncompleted noise resistant speech HMM with respect to the input speech is determined to create the noise resistant speech HMM. By doing this, it is possible to create a voice HMM that is resistant to S / N fluctuation, microphone distortion, line distortion, and speaker utterance fluctuation.
That is, the voice recognition device of the present invention using the voice HMM obtained by the method of the present invention can recognize a voice to which noise is added, a voice subjected to line distortion, and the like with a higher recognition rate than before.

【００４１】そして、乗算性ひずみを未知数として音声
ＨＭＭに含め、これと雑音ＨＭＭとの合成を、積空間に
おいて実施することにより、あるいは音声ＨＭＭ及び雑
音ＨＭＭの一方にＳ／Ｎを未知数として含め、これと、
未知数として含めなかったＨＭＭとの合成を積空間にお
いて実施することにより雑音を含んだ音声をモデル化す
る様な未完耐雑音音声ＨＭＭの構成が可能となり、更に
このＨＭＭの入力音声に対する尤度を最大にする乗算性
ひずみ又はＳ／Ｎを決定することにより耐雑音音声ＨＭ
Ｍが構成できる。Then, the multiplicative distortion is included in the speech HMM as an unknown value, and the synthesis of this and the noise HMM is performed in the product space, or S / N is included as an unknown value in one of the speech HMM and the noise HMM. This and
An uncompleted noise-resistant speech HMM that models noise-containing speech can be constructed by performing synthesis in the product space with an HMM that was not included as an unknown number, and the likelihood of this HMM for input speech is maximized. Noise resistant speech HM by determining multiplicative distortion or S / N
M can be configured.

【００４２】また、音声ＨＭＭ、雑音ＨＭＭの両者を線
形スペクトル領域に変換し、その線形スペクトル領域の
音声ＨＭＭの出力確率の分布に乗算性ひずみを与え、こ
れと雑音ＨＭＭの出力確率の分布との畳み込み演算を
し、その結果を、元の音声ＨＭＭの領域に逆変換するこ
とで前記未完耐雑音音声ＨＭＭが求められる。同様に音
声ＨＭＭ、雑音ＨＭＭの両者を線形スペクトル領域に変
換し、その線形スペクトル領域の音声ＨＭＭ、雑音ＨＭ
Ｍの一方の出力確率の分布の線形スペクトルに対してＳ
／Ｎ成分を乗算し、これと、そのＳ／Ｎ成分を乗算しな
いＨＭＭの出力確率分布との畳み込み演算し、この演算
結果を元の音声ＨＭＭの領域に逆変換することで前記未
完耐雑音音声ＨＭＭが求められる。Further, both the speech HMM and the noise HMM are converted into a linear spectrum region, and the distribution of the output probability of the speech HMM in the linear spectrum region is subjected to multiplicative distortion. The uncompleted noise resistant speech HMM is obtained by performing a convolution operation and inversely converting the result into the area of the original speech HMM. Similarly, both the voice HMM and the noise HMM are converted into a linear spectrum region, and the voice HMM and the noise HM in the linear spectrum region are converted.
For a linear spectrum of one output probability distribution of M, S
/ N component is multiplied, and the convolution operation of this and the output probability distribution of the HMM which is not multiplied by the S / N component is performed, and the calculation result is inversely transformed into the region of the original speech HMM to obtain the uncompleted noise resistant speech. HMM is required.

【００４３】ケプストラム領域で表現された音声ＨＭＭ
と雑音ＨＭＭの出力確率の分布を線形スペクトル領域ま
でコサイン変換およびエキスポーネンシャル変換により
変換し、次いで畳み込み演算を行ない、この演算結果を
逆の変換である対数変換および逆コサイン変換を行なう
ことにより、ケプストラム領域における未完耐雑音音声
ＨＭＭの分布を計算することができる。また特に対数ス
ペクトル領域で表現された音声ＨＭＭと雑音ＨＭＭの出
力確率の分布をエキスポーネンシャル変換し、前記の畳
み込み演算を行ない、その演算結果を対数変換すること
により対数スペクトル領域における未完耐雑音音声ＨＭ
Ｍを求める。Speech HMM represented in the cepstrum domain
And the distribution of the output probability of the noise HMM are transformed to the linear spectral domain by the cosine transform and the exponential transform, then the convolution operation is performed, and the result of this operation is subjected to the inverse transformation, the logarithmic transformation and the inverse cosine transformation. , It is possible to calculate the distribution of uncompleted noise resistant speech HMM in the cepstrum region. Further, particularly, the output probability distributions of the speech HMM and the noise HMM expressed in the logarithmic spectrum domain are subjected to the exponential transformation, the convolution operation is performed, and the operation result is logarithmically transformed to obtain the incomplete noise resistance in the logarithmic spectrum domain. Voice HM
Find M.

【００４４】この発明方法及び装置の効果を調べるため
に音韻認識実験を行った。評価用データには話者１名の
発声した電話番号案内タスク５１文を用いた。実験には
１２ｄＢと６ｄＢの雑音が加わり、さらに、乗算性雑音
で歪んだ音声を用いた。音韻認識率の実験結果を下の表
に示す。何れもケプストラム領域のモデルであり、”音
声ＨＭＭ”の実験では、何も処理を行わない音声ＨＭ
Ｍ、つまり図４中の音声ＨＭＭ格納部３内の音声ＨＭＭ
を用いた。“ＨＭＭ合成法のみ”の実験は、発明の背景
の項で述べた前記文献１で提案された耐雑音ＨＭＭ作成
法で得られたＨＭＭを用いた実験である。“ＨＭＭ合成
＋Ｓ／Ｎ推定”はこの発明方法により、Ｓ／Ｎを未知数
として導入して作成したＨＭＭを用いた実験であり、
“ＨＭＭ合成＋乗算性ひずみ推定”はこの発明方法によ
り乗算性ひずみを未知数として導入して作成したＨＭＭ
を用いた実験である。これらこの発明の方法でのＨＭＭ
の作成におけるＳ／Ｎ、乗算性ひずみの推定は最急降下
法により行った。これらの実験結果から、何も処理しな
い音声ＨＭＭより、従来のＨＭＭ合成のみの方が認識率
がよくなり、この発明の方法によれば加算性雑音に更に
強くなり、しかも乗算性ひずみに対しても従来よりも著
しく強い耐雑音音声ＨＭＭが得られ、認識結果が改善さ
れることがわかる。以上述べたように、この発明によれば発声環境の雑音モ
デルを使用し、従来の音声ＨＭＭと合成して乗算性ひず
み、又はＳ／Ｎを未知数とした未完耐雑音音声ＨＭＭを
作り、そのＨＭＭの入力音声に対する尤度が最大的未知
数を推定しているから発声環境に適した頑健な音声ＨＭ
Ｍの作成をすることができ、このＨＭＭを用いるこの発
明の音声認識装置によれば高い音声認識率を達成するこ
とが可能となる。しかも、この発明装置によれば雑音Ｈ
ＭＭの作成から入力音声の認識までの時間は例えば１分
間程度（高速の演算装置を用いればもっと短時間）であ
り、短時間に、かつ比較的簡単な処理で認識を行うこと
ができる。A phoneme recognition experiment was conducted to investigate the effect of the method and apparatus of the present invention. As the evaluation data, 51 sentences of the telephone number guidance task uttered by one speaker were used. Noises of 12 dB and 6 dB were added to the experiment, and a voice distorted by multiplicative noise was used. The experimental results of the phoneme recognition rate are shown in the table below. Both are models of the cepstrum domain, and in the experiment of "speech HMM", the speech HM without any processing is performed.
M, that is, the voice HMM in the voice HMM storage unit 3 in FIG.
Was used. The "HMM synthesis method only" experiment is an experiment using the HMM obtained by the noise-resistant HMM creation method proposed in the above-mentioned Document 1 described in the background of the invention. “HMM synthesis + S / N estimation” is an experiment using an HMM created by introducing S / N as an unknown by the method of the present invention,
"HMM synthesis + multiplicative distortion estimation" is an HMM created by introducing the multiplicative distortion as an unknown by the method of the present invention.
It is an experiment using. HMMs in these inventive methods
The estimation of S / N and multiplicative distortion in the preparation of was performed by the steepest descent method. From these experimental results, the recognition rate of the conventional HMM synthesis alone is better than that of the speech HMM that does not process anything. According to the method of the present invention, it becomes stronger against additive noise, and moreover, against multiplicative distortion. It can be seen that also a noise resistant speech HMM that is significantly stronger than the conventional one is obtained and the recognition result is improved. As described above, according to the present invention, the noise model of the utterance environment is used and synthesized with the conventional speech HMM to produce an incomplete noise-resistant speech HMM in which multiplicative distortion or S / N is an unknown number, and the HMM is used. Robust HM suitable for the utterance environment because the maximum likelihood of the input speech is estimated
M can be created, and the speech recognition apparatus of the present invention using this HMM can achieve a high speech recognition rate. Moreover, according to the device of the present invention, noise H
The time from the creation of the MM to the recognition of the input voice is, for example, about 1 minute (which is shorter if a high-speed arithmetic device is used), and the recognition can be performed in a short time and with a relatively simple process.

[Brief description of drawings]

【図１】音声認識装置の簡略に示した機能構成を示すブ
ロック図。FIG. 1 is a block diagram showing a simplified functional configuration of a voice recognition device.

【図２】従来の耐雑音音声ＨＭＭの作成処理手順を示す
流れ図。FIG. 2 is a flowchart showing a procedure of a conventional noise-resistant speech HMM creation process.

【図３】音声ＨＭＭ、雑音ＨＭＭ及びこれらを合成した
耐雑音音声ＨＭＭの各状態遷移を示す図。FIG. 3 is a diagram showing state transitions of a speech HMM, a noise HMM, and a noise-resistant speech HMM obtained by combining these.

【図４】この発明による音声認識装置の実施例の機能構
成を示すブロック図。FIG. 4 is a block diagram showing a functional configuration of an embodiment of a voice recognition device according to the present invention.

【図５】Ａはこの発明による耐雑音音声ＨＭＭ作成方法
の実施例の処理手順を示す流れ図、ＢはＡ中のステップ
Ｓ₃の処理の具体例を示す手順を示す流れ図である。5A is a flowchart showing a processing procedure of an embodiment of the noise resistant speech HMM creating method according to the present invention, and B is a flowchart showing a procedure showing a concrete example of the processing of step S _{3 in} A. FIG.

【図６】図５Ｂの処理の更に具体例を示す流れ図。FIG. 6 is a flowchart showing a further specific example of the processing in FIG. 5B.

【図７】図５Ｂの処理の更に他の具体例を示す流れ図。FIG. 7 is a flowchart showing yet another specific example of the processing of FIG. 5B.

【図８】未完耐雑音音声ＨＭＭのＳ／Ｎを推定する手法
の一例を示すための図。FIG. 8 is a diagram showing an example of a method for estimating the S / N of an incomplete noise-resistant speech HMM.

Claims

[Claims]

1. A multiplicative distortion in a linear spectrum region is calculated from a speech HMM that is not affected by noise and multiplicative distortion (hereinafter simply referred to as noise) and its HMM (noise HMM) created from noise. A first step of creating an incomplete noise-resistant speech HMM including S / N (signal / noise) as an unknown number, and the above-mentioned multiplicative distortion or S / so that the likelihood of the incomplete noise-resistant speech HMM with respect to an input speech becomes maximum. A method for creating a noise-resistant hidden Markov model for speech recognition, comprising: a second step of estimating N; and a third step of substituting the estimated value into the incomplete noise-resistant speech HMM to complete the noise-resistant speech HMM.

2. The method of creating a hidden Markov model according to claim 1, wherein the first step includes a fourth step of giving the multiplicative distortion to the speech HMM, a speech HMM to which the multiplicative distortion has been given, and the noise HMM. And the fifth step of synthesizing and in the product space.

3. The hidden Markov model creation method according to claim 2, wherein the fourth step is addition of the multiplicative distortion to the average value of the distribution of the output probabilities of the speech HMMs represented in the cepstrum region.

4. The method of creating a hidden Markov model according to claim 2, wherein the fourth step is the addition of the multiplicative distortion to the average value of the output probability distribution of the speech HMM represented in the logarithmic spectrum domain.

5. The hidden Markov model creation method according to claim 1, wherein the first step comprises the speech HMM and the noise HMM.
To the HMM to which the component including the S / N is added and the HMM to which the component including the S / N is not added in the product space.

6. The method of creating a hidden Markov model according to claim 5, wherein the fourth step is an HM represented by a cepstrum region.
The above S for the 0th order term of the average value of the distribution of the output probabilities of M
This is addition of components including / N.

7. The method of creating a hidden Markov model according to claim 5, wherein the fourth step is H represented in a logarithmic spectral domain.
It is the addition of the component including the above S / N to the average value of the distribution of the output probability of MM.

8. The hidden Markov model creation method according to claim 1, wherein said first step comprises a sixth step of converting each output probability distribution of said speech HMM and said noise HMM into a linear spectral domain, and said linear spectral domain The seventh step of multiplying the output probability distribution of the speech HMM by the above-mentioned multiplicative distortion, the distribution of the output probability of the speech HMM multiplied by the above-mentioned multiplicative distortion, and the distribution of the output probability of the noise HMM in the above-mentioned linear spectral region are described. It comprises an eighth step of performing a convolution operation and a ninth step of inversely transforming the result of the convolution operation into the region of the original speech HMM.

9. The method of creating a hidden Markov model according to claim 1, wherein the first step is to convert a distribution of output probabilities of the speech HMM and the noise HMM into a linear spectral domain.
And S / N in the distribution of the output probability of one of the speech HMM and the noise HMM in the linear spectral domain.
And a seventh step of convoluting the output probability distribution of the multiplied HMM and the output probability distribution of the non-multiplied HMM, and a result of the convolution calculation of the original speech HMM. (9) Step 9 of inverse transformation into the region of.

10. The method of creating a hidden Markov model according to claim 8 or 9, wherein in the sixth step, the distributions of the output probabilities of the speech HMM and the noise HMM respectively represented in the cepstrum domain are cosine-transformed and further extracted. The ninth step is a step of performing a logarithmic transformation of the convolution operation result, and further an inverse cosine transformation of the result.

11. The hidden Markov model creation method according to claim 8 or 9, wherein in the sixth step, the distributions of the output probabilities of the speech HMM and the noise HMM respectively expressed in the logarithmic spectral domain are subjected to the exponential transformation. Is the step
The ninth step is a step of logarithmically converting the convolution operation result.

12. The hidden Markov model creation method according to claim 1, wherein said first step comprises a fourth step of multiplying said multiplicative distortion by an output probability distribution of a speech HMM expressed in a linear spectral domain, and said multiplication. The fifth step comprises a convolution operation of the output probability distribution multiplied by the sexual distortion and the distribution of the output probability of the noise HMM expressed in the linear spectral domain.

13. The method of creating a hidden Markov model according to claim 1, wherein the first step comprises a distribution of output probabilities of one of the speech HMM and the noise HMM expressed in a linear spectral domain, the component including the S / N. And a fifth step of performing a convolution operation of the above-mentioned multiplied output probability distribution and the non-multiplied output probability distribution.

14. The hidden Markov model creating method according to claim 5, 9, or 13, wherein the component containing S / N is 10 ^{((S / N) / 2)} .

15. The method of creating a hidden Markov model according to claim 1, wherein the second step is estimated by iterative calculation by a maximum likelihood estimation method.

16. The hidden Markov model creation method according to claim 1, wherein the second step is estimated by iterative calculation by the steepest descent method.

17. The method of creating a hidden Markov model according to claim 1, wherein the second step applies multiplicative distortions of various values to each of the incomplete noise-resistant speech HMMs. Uncompleted noise resistant voice H
The likelihoods of the MM with respect to the input speech are calculated, and the value of the multiplicative distortion given to the HMM having the maximum likelihood is used as the estimated value.

18. The hidden Markov model generation method according to claim 1, wherein the second step gives S / Ns of various values to each of the incomplete noise-resistant speech HMMs. , The likelihood of each of these uncompleted noise resistant speech HMMs with respect to the input speech is obtained, and the S / N value given to the HMM having the maximum likelihood is used as an estimated value.

19. The method of creating a hidden Markov model according to claim 1, wherein the noise HMM used in the first step is created by recording noise under the environment of the speech to be recognized. Have steps.

20. The hidden Markov model generation method according to claim 1, wherein the speech HMM is expressed in a cepstrum domain, and the noise HMM is expressed in a logarithmic spectrum domain. It is a thing.

21. A noise resistant speech H created from a speech HMM not affected by noise and multiplicative distortion and a noise HMM.
A voice recognition device using MM, comprising: a voice HMM storage unit storing the set of voice HMMs; a noise HMM storage unit storing the noise HMMs; and an endurance store storing the set of noise resistant voice HMMs. A noise voice HMM storage unit, a multiplicative distortion or S / N storage unit storing an unknown number of multiplicative distortions or S / Ns, a noise input means for recording noises in the same environment as the recognition target voice, and the above input means. Noise HMM creating means for creating the noise HMM based on the recorded noise and storing it in the noise storage unit, the noise HMM in the noise HMM storage unit, the voice HMM in the voice HMM storage unit, and the multiplyability. Multiplying strain or S of the above unknown in the strain or S / N storage
/ N to create an incomplete noise resistant speech HMM containing multiplicative distortion or S / N as an unknown number
M creating means, voice input means for inputting a voice to be recognized, and estimating the above-mentioned multiplicative distortion or S / N that maximizes the likelihood of the incomplete noise resistant voice HMM for the voice input from the voice input means. Multiplying distortion or S / N estimating means and multiplying distortion or S / N estimated by the above estimating means
To the uncompleted noise-resistant speech HMM to create a set of the noise-resistant speech HMM and store it in the noise-resistant speech HMM storage unit, and a noise-recognition speech HMM completion means for recognizing the speech input by the speech input means. And a voice recognition means for calculating the similarity between the power voice and each of the noise resistant voice HMMs in the noise resistant voice HMM storage unit and outputting the recognition result based on the calculation result.