JP2000075888A

JP2000075888A - Learning method of hidden markov model and voice recognition system

Info

Publication number: JP2000075888A
Application number: JP10246761A
Authority: JP
Inventors: Kazuhiko Shudo; 和彦首藤
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-09-01
Filing date: 1998-09-01
Publication date: 2000-03-14

Abstract

PROBLEM TO BE SOLVED: To provide a learning method of a hidden Markov model(HMM) capable of achieving a higher recognition rate under noise environments by performing HMM learning from voice data for learning, outputting a noise mixed voice model, removing an estimated noise model, and outputting a non- noise-dependent voice model with noise removed. SOLUTION: A noise mixed voice data 10 for learning are voice data mixed with noise. An HMM learning processing part 11 performs HMM learning based on the noise mixed voice data 10 for learning, and outputs the result to a noise subtraction processing part 15 as a noise mixed voice model 12. A noise estimation part 13 estimates noise from the noise mixed voice data 10 for learning and outputs an estimated noise model 14. The noise subtraction processing part 15 removes the estimated noise model 14 from the noise mixed voice model 12, and outputs a non-noise-dependent voice model 16 with noise removed. As a result, a sturdy voice model can be prepared without being confined to a specific noise at the time of learning.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ヒドン・マルコフ
・モデルの学習方法及びこれを用いた音声認識システム
に関し、例えば、音声による操作が可能なカーナビゲー
ションなどで用いられる、車内のような雑音環境下にお
いて好適なヒドン・マルコフ・モデルの学習方法及び音
声認識システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a learning method for a Hidden Markov Model and a speech recognition system using the same. For example, the present invention relates to a noise environment such as a car environment used in car navigation or the like which can be operated by voice. The present invention relates to a suitable Hidden Markov Model learning method and a speech recognition system below.

【０００２】[0002]

【従来の技術】音声認識技術として、古典的なパターン
・マッチング手法から、近年では統計的な手法に変わ
り、後者が主流になりつつある。後者の統計的な手法で
は、確率的な有限状態を持つマルコフ・モデルが提案さ
れており、通常、ＨＭＭ（hiddenMarkov model：隠れマ
ルコフモデル）と呼ぶ。ＨＭＭでは、学習用音声データ
を用いて音声モデルの学習を行うことで高い認識率を上
げることが可能となっている。2. Description of the Related Art As a speech recognition technique, the classical pattern matching technique has been changed to a statistical technique in recent years, and the latter is becoming mainstream. In the latter statistical method, a Markov model having a probabilistic finite state has been proposed, and is usually referred to as an HMM (hidden Markov model: hidden Markov model). In the HMM, it is possible to increase a high recognition rate by learning a speech model using learning speech data.

【０００３】図４は従来のこの種のＨＭＭを用いた連続
音声認識システムの構成を示すブロック図である。FIG. 4 is a block diagram showing the configuration of a conventional continuous speech recognition system using this type of HMM.

【０００４】図４において、連続音声認識システムは、
Ａ／Ｄ変換部１、ＬＰＣ分析部２、背景雑音逐次学習部
３、音声検出部４、切替え部５、ビタビ照合部６、ＨＭ
Ｍパラメータ推定部７及びＨＭＭ音声辞書８から構成さ
れている。このうち、Ａ／Ｄ変換部１とＬＰＣ分析部２
とで音声分析ブロックが構成され、背景雑音逐次学習部
３と音声検出部４とで音声区間検出ブロックが、切替え
部５とビタビ照合部６とでＨＭＭ照合ブロックが、ＨＭ
Ｍパラメータ推定部７とＨＭＭ音声辞書８とでＨＭＭモ
デル学習ブロックがそれぞれ構成されている。In FIG. 4, a continuous speech recognition system comprises:
A / D conversion unit 1, LPC analysis unit 2, background noise sequential learning unit 3, voice detection unit 4, switching unit 5, Viterbi matching unit 6, HM
It comprises an M parameter estimator 7 and an HMM speech dictionary 8. The A / D converter 1 and the LPC analyzer 2
Constitutes a speech analysis block, the background noise sequential learning section 3 and the speech detection section 4 constitute a speech section detection block, and the switching section 5 and the Viterbi comparison section 6 constitute an HMM collation block.
An HMM model learning block is composed of the M parameter estimating unit 7 and the HMM speech dictionary 8.

【０００５】Ａ／Ｄ変換部１は、入力音声信号を所定の
サンプリング周波数（例えば、８ｋＨｚ）でサンプリン
グしディジタル信号に変換する。The A / D converter 1 samples an input audio signal at a predetermined sampling frequency (for example, 8 kHz) and converts it into a digital signal.

【０００６】ＬＰＣ分析部２は、音声波形を短い区間
（フレームと呼び、長さは通常１０ミリ〜３０ミリ秒で
ある）に区切り、フレーム毎に特徴パラメータを抽出す
る。音声分析には、音声の特性に合った能率的方法とし
て広く使用されているＬＰＣ（Linear Predictive Codi
ng：線形予測符号化）分析を用い、ＬＰＣ係数からＬＰ
Ｃケプストラム（Cepstrum）を算出する。ここで、ケプ
ストラムとは、対数スペクトラム（Logarithm）を逆フ
ーリエ変換したもので、人間の聴覚特性に近い性質を持
ち、比較的少ない数のパラメータで効率良く音声を表現
できる。[0006] The LPC analysis unit 2 divides a speech waveform into short sections (referred to as frames, whose length is usually 10 to 30 ms), and extracts characteristic parameters for each frame. For speech analysis, LPC (Linear Predictive Code), which is widely used as an efficient method suited to the characteristics of speech, is used.
ng: Linear Predictive Coding) analysis using LPC coefficients to LP
Calculate Cepstrum. Here, the cepstrum is obtained by performing an inverse Fourier transform of a logarithmic spectrum (Logarithm), has a property close to human auditory characteristics, and can express speech efficiently with a relatively small number of parameters.

【０００７】音声検出部４は、雑音区間における対数パ
ワーとＬＰＣケプストラムの推定平均値を雑音特徴スペ
クトルとして記憶し、この雑音特徴ベクトルと入力信号
の特徴ベクトルとの距離を求め、その時間的変化から音
声区間を検出する。The speech detector 4 stores the logarithmic power in the noise section and the estimated average value of the LPC cepstrum as a noise feature spectrum, finds the distance between the noise feature vector and the feature vector of the input signal, and calculates the distance from the time change. Detect voice section.

【０００８】背景雑音逐次学習部３は、雑音区間と判定
された区間で雑音特徴ベクトルを更新することにより、
雑音特徴の逐次適応学習を行うとともに、距離変動の適
応学習によるしきい値の自動設定も行う。[0008] The background noise sequential learning unit 3 updates the noise feature vector in the section determined to be a noise section,
In addition to performing the sequential adaptive learning of the noise feature, the threshold is also automatically set by the adaptive learning of the distance variation.

【０００９】ビタビ照合部６は、ビタビ（Viterbi）ア
リゴリズムを用いてＨＭＭ照合を行う。ＨＭＭ照合で
は、音素や単語を表現したＨＭＭモデルと未知入力音声
とを比較し、類似度を求める。The Viterbi collation unit 6 performs HMM collation using the Viterbi algorithm. In the HMM collation, a similarity is obtained by comparing an HMM model representing a phoneme or a word with an unknown input speech.

【００１０】ＨＭＭパラメータ推定部７は、ＥＭ（Expe
ctation Maximization）アルゴリズムを用いてＨＭＭモ
デル学習を行う。ＨＭＭモデル学習では、あらかじめ用
意した音声データでＨＭＭモデルのパラメータを推定す
る。[0010] The HMM parameter estimating unit 7 generates an EM (Expe
HMM model learning is performed using an ctation Maximization algorithm. In the HMM model learning, the parameters of the HMM model are estimated from voice data prepared in advance.

【００１１】切替え部５は、上記ＨＭＭ照合とＨＭＭモ
デル学習との処理を切り替えるものである。また、ＨＭ
Ｍ音声辞書８は、ＨＭＭパラメータ推定部７によるＨＭ
Ｍモデル学習結果を記憶し、ビタビ照合部６によるＨＭ
Ｍ照合において参照される。The switching unit 5 switches the processing between the HMM collation and the HMM model learning. Also, HM
The M speech dictionary 8 stores the HM
The M model learning result is stored, and the HM
Referenced in M matching.

【００１２】一般に、ＨＭＭは、複数の状態（例えば、
音声の特徴等）と状態間の遷移からなる。さらに、ＨＭ
Ｍは状態間の遷移を表す遷移確率と、遷移する際に伴う
特徴ベクトルを出力する出力確率分布（通常はガウス分
布を用いる）を有している。このようなＨＭＭを用いた
単語音声認識の例を図５に示す。In general, an HMM has multiple states (eg,
And the transition between states. Furthermore, HM
M has a transition probability representing transition between states and an output probability distribution (usually using a Gaussian distribution) for outputting a feature vector accompanying the transition. FIG. 5 shows an example of word speech recognition using such an HMM.

【００１３】図５は、音声認識方法に用いられる単語Ｈ
ＭＭの構造を示す状態遷移図である。FIG. 5 shows a word H used in the speech recognition method.
It is a state transition diagram which shows the structure of MM.

【００１４】図５中のｓ1，ｓ2，ｓ3，ｓ4はＨＭＭにお
ける音声の特徴等の状態を表し、ａ11，ａ12，ａ22，ａ
23，ａ33，ａ34，ａ44，ａ45は状態遷移確率、（ｕ1，
σ1）、（ｕ2，σ2）、（ｕ3，σ3）、（ｕ4，σ4）は
出力確率分布を表す。In FIG. 5, s1, s2, s3, and s4 represent states of voice features in the HMM, and a11, a12, a22, a
23, a33, a34, a44, a45 are state transition probabilities, (u1,
.sigma.1), (u2, .sigma.2), (u3, .sigma.3), and (u4, .sigma.4) represent output probability distributions.

【００１５】ＨＭＭでは、状態遷移確率ａij（ｉ＝１，
…，４、ｊ＝１，…，５）で状態遷移が行なわれる際、
出力確率分布（ｕｋ、σｋ）でべクトルを出力する。発
声された単語をＨＭＭを用いて認識するには、まず、各
単語に対して用意された学習データを用いて、その単語
のベクトル列を最も高い確率で出力するようにＨＭＭを
学習する。次に、発声された未知単語のべクトル列を入
力し、最も高い出力確率を与えた単語ＨＭＭを認識結果
とする。In the HMM, the state transition probability aij (i = 1,
, 4, j = 1,..., 5)
The vector is output using the output probability distribution (uk, σk). In order to recognize an uttered word using the HMM, the HMM is first learned by using learning data prepared for each word so as to output a vector sequence of the word with the highest probability. Next, the vector sequence of the uttered unknown word is input, and the word HMM that gives the highest output probability is set as the recognition result.

【００１６】この種の音声認識方法では、発声された単
語そのものにＨＭＭを与えて学習し、尤度（すなわち、
べクトル列の出力確率）によって認識結果を判断するも
のである。このような単語ＨＭＭは、優れた認識精度を
保証するが、認識語彙数が増大することによって膨大な
学習データが必要となることや、学習対象語以外の音声
が全く認識できないことなどの欠点がある。In this type of speech recognition method, an uttered word itself is given an HMM and learned, and the likelihood (ie,
The recognition result is determined based on the output probability of the vector train). Although such a word HMM guarantees excellent recognition accuracy, it has disadvantages such as an enormous amount of learning data required due to an increase in the number of recognition vocabularies, and the inability to recognize speech other than the learning target words at all. is there.

【００１７】ところで、雑音が大きな特殊な環境での音
声認識、例えばカーナビゲーションで音声認識によって
地名を入力するような場合、車内という非常に雑音が大
きい環境では性能が著しく劣化し、ユーザに受け入れら
れる認識率を達成することは困難である。ところが、Ｈ
ＭＭ理論によれば、雑音が混合した学習用音声データを
用意し、これからＨＭＭ学習による音声モデル学習を行
うことが考えられる。以下この方法を雑音学習方法と呼
ぶ。Incidentally, in the case of speech recognition in a special environment where noise is large, for example, when a place name is input by voice recognition in car navigation, the performance is significantly deteriorated in a very noisy environment such as in a car, which is accepted by the user. Achieving recognition rates is difficult. However, H
According to the MM theory, it is conceivable to prepare learning speech data mixed with noise and perform speech model learning by HMM learning from this. Hereinafter, this method is called a noise learning method.

【００１８】このようにして得た音声モデルを用いれ
ば、実際のフィールドで使用した場合でもかなりの認識
率が得られる。しかし、この雑音学習方法には学習用音
声データの雑音環境が音声認識システム使用時のそれと
同一でなければならないという重大な制約が存在する。
すなわち、学習して得た音声モデルが学習時の雑音に依
存しているため、雑音環境が異なれば認識率が劣化して
しまう。このためこの方法を実際のフィールドで適用す
るのは実用的ではない。By using the voice model obtained in this way, a considerable recognition rate can be obtained even when used in an actual field. However, this noise learning method has a serious limitation that the noise environment of the training speech data must be the same as that when using the speech recognition system.
That is, since the speech model obtained by learning depends on the noise at the time of learning, the recognition rate deteriorates if the noise environment is different. Therefore, it is not practical to apply this method in a real field.

【００１９】一方で、スペクトル・サブトラクションと
呼ばれる雑音除去方式、複数マイクを用いた適応処理な
ど、従来から多くの雑音対策手法が提案されてきたが、
最近ではΡＭＣ（Parallel Model Combination）法と呼
ばれるＨＭＭモデルの雑音への適応手法がよく研究され
ている。例えば、信学技報ＳＰ９２−９６、Frank Mart
in,Kyohiro Shikano,Yasuhiro Minami,Yoichi Okabe:"R
ecognition of NoisySpeech by Composition of Hiddon
Markov Models"に記載されたものがある。On the other hand, many noise countermeasures have been proposed in the past, such as a noise removal method called spectrum subtraction and adaptive processing using a plurality of microphones.
Recently, a method of adapting the HMM model to noise, called the ΡMC (Parallel Model Combination) method, has been studied well. For example, IEICE Technical Report SP92-96, Frank Mart
in, Kyohiro Shikano, Yasuhiro Minami, Yoichi Okabe: "R
ecognition of NoisySpeech by Composition of Hiddon
Markov Models ".

【００２０】ΡＭＣ方法によれば音声モデルとしては雑
音に依存したものを用いないため、あらゆる雑音環境で
適用可能である。According to the ΡMC method, a speech model that does not depend on noise is not used, so that it can be applied to any noise environment.

【００２１】[0021]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の耐雑音音声認識システムにあっては、以下の
ような問題点があった。However, such a conventional noise-tolerant speech recognition system has the following problems.

【００２２】例えば、車内での音声認識を試みた場合、
ＰＭＣ方法ではある程度の認識率の向上は見られるもの
の、実際のフィールドで使用するにはまだ不十分な認識
率しか達成できない。理由としてはΡＭＣ方法は主に加
法的雑音を対象とするが、実際のフィールドではさらに
乗法的雑音、ランバート効果と呼ばれる騒音下での発声
スペクトラム自体の変形などの要因が存在することが考
えられる。For example, when an attempt is made to recognize speech in a car,
Although the PMC method provides some improvement in the recognition rate, it can still achieve a recognition rate that is still insufficient for use in the actual field. The reason is that the ΡMC method mainly targets additive noise, but it is considered that factors such as multiplicative noise and deformation of the utterance spectrum itself under noise called Lambert effect exist in the actual field.

【００２３】本発明は、ΡＭＣ方法を拡張して、従来の
ＰＭＣ方法で扱われない乗法的雑音、ランバート効果な
どにも対応して、雑音環境下でより高い認識率を達成す
ることができるヒドン・マルコフ・モデルの学習方法及
び音声認識システムを提供することを目的とする。The present invention expands the ΡMC method so as to cope with multiplicative noise, Lambert effect, and the like that are not handled by the conventional PMC method, and to achieve a higher recognition rate in a noise environment.・ It is an object of the present invention to provide a Markov model learning method and a speech recognition system.

【００２４】[0024]

【課題を解決するための手段】本発明に係るヒドン・マ
ルコフ・モデルの学習方法は、雑音が混合した学習用音
声データからヒドン・マルコフ・モデル（ＨＭＭ）学習
を行い、音声モデルを作成するヒドン・マルコフ・モデ
ルの学習方法であって、学習用音声データに含まれる雑
音を推定して、推定雑音モデルを生成するとともに、学
習用音声データからＨＭＭ学習を行って雑音混合音声モ
デルを出力し、雑音混合音声モデルから推定雑音モデル
を除去して雑音が除去された雑音非依存音声モデルを出
力することを特徴とする。A learning method for a Hidden Markov Model according to the present invention performs Hidden Markov Model (HMM) learning from learning speech data mixed with noise to generate a speech model. A method for learning a Markov model, which estimates noise included in training speech data, generates an estimated noise model, performs HMM learning from the training speech data, and outputs a noise mixed speech model; The method is characterized in that an estimated noise model is removed from a noise mixed speech model and a noise-independent speech model from which noise has been removed is output.

【００２５】上記推定雑音モデルの除去は、推定雑音モ
デル、及び雑音混合音声モデルの各出力確率分布をケプ
ストラム空間での値からコサイン変換により対数スペク
トラム空間での値に変換し、さらに指数変換により線形
スペクトラム空間での値に変換し、こうして計算された
線形スペクトラム空間での雑音混合音声モデルの出力確
率分布の平均及び分散のそれぞれについて、線形スペク
トラム空間での推定雑音モデルの出力確率分布の平均及
び分散を減算し、その結果を対数変換して対数スペクト
ラム空間での値に変換し、さらに逆コサイン変換により
ケプストラム空間での出力確率分布を得、こうして雑音
混合音声モデルのうち、加算的雑音が除去された音声モ
デルを得ることにより行われるものであってもよい。The above-described estimation noise model is removed by converting the output probability distributions of the estimation noise model and the noise-mixed speech model from values in the cepstrum space to values in the logarithmic spectrum space by cosine transform, and further to linearity by exponential transformation. The mean and variance of the output probability distribution of the estimated noise model in the linear spectrum space are converted to the values in the spectrum space, and the mean and variance of the output probability distribution of the noise mixed speech model in the linear spectrum space calculated in this way are respectively. Is subtracted, and the result is logarithmically converted to a value in logarithmic spectrum space, and an output probability distribution in cepstrum space is obtained by inverse cosine transform.Thus, additive noise is removed from the noise mixed speech model. It may be performed by obtaining a speech model.

【００２６】また、本発明に係る音声認識システムは、
学習用音声データからヒドン・マルコフ・モデル（ＨＭ
Ｍ）学習を行って音声モデルを作成し、該音声モデルを
用いて音声認識を行う音声認識システムにおいて、雑音
が混合した学習用音声データを有する学習用音声データ
ベースと、学習用音声データに含まれる雑音を推定して
推定雑音モデルを生成する雑音推定手段と、学習用音声
データからＨＭＭ学習を行って雑音混合音声モデルを出
力するＨＭＭ学習処理手段と、雑音混合音声モデルから
推定雑音モデルを除去して雑音が除去された雑音非依存
音声モデルを出力する雑音減算処理手段と、雑音非依存
音声モデルを用いて音声認識を行う音声認識手段とを備
えて構成する。Also, the speech recognition system according to the present invention
Hidden Markov model (HM
M) In a speech recognition system that creates a speech model by performing learning and performs speech recognition using the speech model, a speech database for learning that includes speech data for learning mixed with noise, and a speech database that is included in the learning speech data Noise estimating means for estimating noise to generate an estimated noise model; HMM learning processing means for performing HMM learning from training speech data to output a noise mixed speech model; and removing the estimated noise model from the noise mixed speech model. And a noise subtraction processing means for outputting a noise-independent speech model from which noise has been removed, and a speech recognition means for performing speech recognition using the noise-independent speech model.

【００２７】本発明に係る音声認識システムは、雑音減
算処理手段が、推定雑音モデル、及び雑音混合音声モデ
ルの各出力確率分布をケプストラム空間での値からコサ
イン変換により対数スペクトラム空間での値に変換し、
さらに指数変換により線形スペクトラム空間での値に変
換し、こうして計算された線形スペクトラム空間での雑
音混合音声モデルの出力確率分布の平均及び分散のそれ
ぞれについて、線形スペクトラム空間での推定雑音モデ
ルの出力確率分布の平均及び分散を減算し、その結果を
対数変換して対数スペクトラム空間での値に変換し、さ
らに逆コサイン変換によりケプストラム空間での出力確
率分布を得、こうして雑音混合音声モデルのうち、加算
的雑音が除去された音声モデルを得るものであってもよ
い。In the speech recognition system according to the present invention, the noise subtraction processing means converts each output probability distribution of the estimated noise model and the noise mixed speech model from a value in cepstrum space to a value in logarithmic spectrum space by cosine transform. And
Furthermore, the output probability of the estimated noise model in the linear spectrum space is converted into a value in the linear spectrum space by exponential conversion, and the mean and variance of the output probability distribution of the noise mixed speech model in the linear spectrum space calculated in this way are respectively obtained. The mean and variance of the distribution are subtracted, the result is logarithmically converted to a value in logarithmic spectrum space, and the output probability distribution in cepstrum space is obtained by inverse cosine transform, and thus, among the noise mixed speech models, A speech model from which the objective noise has been removed may be obtained.

【００２８】本発明に係る音声認識システムは、雑音非
依存音声モデルと、入力から雑音を推定して推定雑音モ
デルを生成する雑音推定手段と、雑音非依存音声モデル
と推定雑音モデルとを合成し、雑音混合音声モデルを出
力するモデル合成手段と、雑音混合音声モデルを用いて
ＨＭＭ認識処理を行う認識手段とを備えて構成したもの
であってもよい。A speech recognition system according to the present invention synthesizes a noise-independent speech model, noise estimation means for estimating noise from an input to generate an estimated noise model, and a noise-independent speech model and an estimated noise model. And a model synthesizing unit that outputs a noise-mixed speech model, and a recognition unit that performs an HMM recognition process using the noise-mixed speech model.

【００２９】上記モデル合成手段は、ＰＭＣ方法により
雑音非依存音声モデルと雑音モデルとを合成するもので
あってもよい。The model synthesis means may synthesize the noise-independent speech model and the noise model by the PMC method.

【００３０】[0030]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３１】第１の実施形態図１は本発明の第１の実施形態に係るヒドン・マルコフ
・モデルの学習方法の実施に用いられる装置を示すブロ
ック図である。First Embodiment FIG. 1 is a block diagram showing an apparatus used for implementing a method for learning a Hidden Markov Model according to a first embodiment of the present invention.

【００３２】図１において、１０は音声データベースと
して提供される学習用雑音混合音声データ（学習用音声
データ）である。この学習用雑音混合音声データ１０は
雑音が混合している音声データである。In FIG. 1, reference numeral 10 denotes learning noise mixed voice data (learning voice data) provided as a voice database. The learning noise-mixed speech data 10 is speech data in which noise is mixed.

【００３３】ＨＭＭ学習処理部１１（ＨＭＭ学習処理手
段）は、学習用雑音混合音声データ１０を基に従来法に
よってＨＭＭ学習を行い、結果を雑音混合音声モデル１
２として雑音減算処理部１５（雑音減算処理手段）に出
力する。The HMM learning processing unit 11 (HMM learning processing means) performs HMM learning based on the learning noise-mixed speech data 10 by a conventional method, and outputs the result as a noise-mixed speech model 1
As 2 is output to the noise subtraction processing unit 15 (noise subtraction processing means).

【００３４】雑音推定部１３（雑音推定手段）は、学習
用雑音混合音声データ１０から雑音を推定し、推定雑音
モデル１４を出力する。The noise estimating unit 13 (noise estimating means) estimates noise from the noise mixed speech data 10 for learning and outputs an estimated noise model 14.

【００３５】雑音減算処理部１５は、雑音混合音声モデ
ル１２から推定雑音モデル１４を除去して雑音が除去さ
れた雑音非依存音声モデル１６を出力する。The noise subtraction processing unit 15 removes the estimated noise model 14 from the noise mixed speech model 12 and outputs a noise-independent speech model 16 from which noise has been removed.

【００３６】このように、本実施形態に係るヒドン・マ
ルコフ・モデルの学習方法の実施に用いられる装置は、
学習用雑音混合音声データ１０を基にΗＭＭ学習を行い
雑音混合音声モデル１２として出力するＨＭＭ学習処理
部１１と、音声モデルをＨＭＭ学習する際に、学習用雑
音混合音声データ１０に含まれる雑音を推定し、推定雑
音モデル１４を生成する雑音推定部１３と、雑音混合音
声モデル１２から推定雑音モデル１４を除去して雑音が
除去された雑音非依存音声モデル１６を出力する雑音減
算処理部１５とから構成されている。As described above, the apparatus used for implementing the hidden Markov model learning method according to the present embodiment is as follows.
An HMM learning processing unit 11 that performs ΗMM learning based on the learning noise-mixed speech data 10 and outputs the result as a noise-mixed speech model 12, and removes noise included in the learning noise-mixed speech data 10 when performing HMM learning on the speech model. A noise estimating unit 13 for estimating and generating an estimated noise model 14; a noise subtracting unit 15 for removing the estimated noise model 14 from the noise mixed speech model 12 and outputting a noise-independent speech model 16 from which noise has been removed; It is composed of

【００３７】以下、上述のように構成された装置により
実施されるヒドン・マルコフ・モデルの学習方法の動作
を説明する。The operation of the Hidden Markov Model learning method performed by the above-configured apparatus will be described below.

【００３８】学習用雑音混合音声データ１０は、雑音が
混合した音声データベースとなっている。ＨＭＭ学習処
理部１１では、学習用雑音混合音声データ１０を基に従
来法によってＨＭＭ学習を行う。結果として生成される
音声モデルを雑音混合音声モデル１２と呼ぶ。この雑音
混合音声モデル１２は、学習用雑音混合音声データ１０
に混合されている雑音の性質を反映しており、このモデ
ルを用いることで、同じ性質の雑音が乗っている音声に
対しては高い認識率で認識できる反面、異なった性質の
雑音が乗っている音声に対しては弱いという特徴があ
る。The learning noise mixed speech data 10 is a speech database in which noise is mixed. The HMM learning processing unit 11 performs HMM learning based on the learning noise-mixed speech data 10 by a conventional method. The resulting speech model is referred to as a noise mixed speech model 12. This noise-mixed speech model 12 is used for learning noise-mixed speech data 10.
This model reflects the nature of the noise mixed into the sound, and by using this model, speech with the same nature can be recognized at a high recognition rate, but noise of a different nature can be recognized. There is a feature that it is weak to the voice that is.

【００３９】一方、雑音推定部１３では、学習用雑音混
合音声データ１０から雑音を推定し、推定雑音モデル１
４を出力する。ここでは、推定雑音モデル１４としては
１ステート、１混合のＨＭＭモデルを考える。したがっ
て、平均と分散のみをモデルのパラメータとして持つ。
推定雑音モデルの作成方法としては、例えば学習用雑音
混合音声データ１０のうち非発話区間を取り出し、その
区間にある信号を雑音とみなして、その平均と分散を求
めればよい。On the other hand, the noise estimation section 13 estimates noise from the noise mixed speech data 10 for learning, and
4 is output. Here, a one-state, one-mixed HMM model is considered as the estimated noise model 14. Therefore, only the mean and the variance are used as model parameters.
As a method of creating the estimated noise model, for example, a non-utterance section is extracted from the noise-mixed speech data for learning 10, a signal in the section is regarded as noise, and its average and variance may be obtained.

【００４０】前述の雑音混合音声モデル１２は、概略、
以下（Ａ）（Ｂ）（Ｃ）の音声／雑音を統合したモデル
と考えられる。The above-described noise-mixed speech model 12 is roughly described as follows.
Hereinafter, it is considered that the voice / noise of (A), (B) and (C) is integrated.

【００４１】（Ａ）ランバード効果により変形した音声（Β）乗法的雑音（Ｃ）加法的雑音この雑音混合音声モデル１２が、学習時とは異なる雑音
が混合した音声に対して認識率が低いのは、上記（Ｃ）
加法的雑音の性質が異なることが大きいことに原因があ
ると考えられる。すなわち、上記（Ｂ）乗法的雑音は主
にマイクなどの音声の伝送系の問題であり、マイクなど
を同じにすれば車内外の騒音とは関係がない。また、ラ
ンバード効果はある程度雑音の種類や程度に依存するこ
とは考えられるものの、上記（Ｃ）加法的雑音と比較す
れば相対的に雑音の相違による影響が少ないと考えられ
る。(A) Speech deformed by the Lambert effect (Β) Multiplicative noise (C) Additive noise This noise-mixed speech model 12 has a low recognition rate for speech mixed with noise different from that at the time of learning. Is the above (C)
This is probably due to the large difference in the nature of the additive noise. That is, the above-mentioned (B) multiplicative noise is mainly a problem of a sound transmission system such as a microphone, and has no relation to noise inside and outside the vehicle if the microphone is the same. Although the Lambert effect is considered to depend to some extent on the type and degree of noise, it is considered that the effect of the noise difference is relatively small as compared with the additive noise (C).

【００４２】雑音減算処理部１５の目的は、学習によっ
て得た（Ａ）ランバード効果により変形した音声及び、
（Ｂ）乗法的雑音は残しつつ、雑音の種類、程度によっ
て性質が大きく変化する（Ｃ）加法的雑音を除去するこ
とで、ランバード効果や乗法的雑音に対して頑強で、か
つ特定の種類や程度の加法的雑音に限定されない音声モ
デルを生成することである。こうして生成される音声モ
デルをここでは雑音非依存音声モデル１６と呼ぶ。勿
論、この音声モデルだけでは加法的雑音に対応できない
が、それについては従来法であるΡＭＣ処理を適用する
ことで対処できる。このことの詳細については第２の実
施形態で説明する。The purpose of the noise subtraction processing unit 15 is (A) the sound transformed by the Lambert effect obtained by learning, and
(B) The characteristics vary greatly depending on the type and degree of the noise while leaving the multiplicative noise. (C) By removing the additive noise, it is robust against the Lambert effect and the multiplicative noise, and has a specific type or noise. The goal is to generate a speech model that is not limited to a degree of additive noise. The speech model thus generated is referred to as a noise-independent speech model 16 here. Of course, this speech model alone cannot cope with additive noise, but this can be coped with by applying the conventional method of MC processing. Details of this will be described in a second embodiment.

【００４３】図２は上記雑音減算処理部１５の動作の流
れを示すフローチャートであり、図２中、Ｓはフローの
各ステップを示す。この処理の基本的な考え方はＰＭＣ
方法と同一である。FIG. 2 is a flow chart showing the flow of the operation of the noise subtraction processing section 15. In FIG. 2, S indicates each step of the flow. The basic idea of this processing is PMC
Same as the method.

【００４４】ΡＭＣ方法が雑音モデルと音声モデルを加
算して合成することで、雑音が混合した音声モデルを生
成するのに対し、雑音減算処理部１５における処理は雑
音が混合した音声モデルから雑音モデルを減算すること
で加法的雑音が除去された（しかしランバード効果や乗
法的雑音の学習結果は保存されている）音声モデルを生
成する。While the MC method adds and combines the noise model and the voice model to generate a voice model in which noise is mixed, the processing in the noise subtraction processing unit 15 uses the noise model from the voice model in which noise is mixed. Is subtracted to generate a speech model from which the additive noise has been removed (but the learning result of the Lambert effect and the multiplicative noise is preserved).

【００４５】以下、図２に従って雑音減算処理部１５の
動作を説明する。べクトルＭ，Ｎ，Ｓの添字のｃｐ、ｌ
ｇ、ｌｎはそれぞれケプストラム（Cepstrum）、対数ス
ペクトラム（Logarithm）、線形スペクトラム（Linea
r）空間での値であることを表す。Hereinafter, the operation of the noise subtraction processing unit 15 will be described with reference to FIG. Subscript cp, l of vector M, N, S
g and ln are cepstrum, logarithm spectrum, and linear spectrum (Linea spectrum), respectively.
r) Indicates a value in space.

【００４６】ケプストラムをパラメータとするＨＭＭ出
力確率分布｛Ｍｃｐ，ΣＭｃｐ｝（Ｍｃｐは出力確率分
布の平均べクトル、ΣＭｃｐは出力確率分布の分散行
列）は、ステップＳ１のコサイン変換によりケプストラ
ム空間から対数スペクトラム空間での値｛Ｍｌｇ，ΣＭ
ｌｇ｝へと変換される。さらに、ステップＳ２の指数変
換により指数スペクトラム空間での値｛Ｍｌｎ，ΣＭｌ
ｎ｝へと変換される。以上Ｓ１、Ｓ２については従来の
ＰＭＣ方法と同一であり、詳細は省略する。The HMM output probability distribution {Mcp, {Mcp} (Mcp is the average vector of the output probability distribution, and {Mcp is the variance matrix of the output probability distribution) using the cepstrum as a parameter is calculated from the cepstrum space by the cosine transform in step S1. Value in space ｛Mlg, ΣM
lg}. Further, the values {Mln, {Ml} in the exponential spectrum space are obtained by the exponential conversion in step S2.
n}. Steps S1 and S2 are the same as those of the conventional PMC method, and the details are omitted.

【００４７】これらの処理を雑音混合音声ＨＭＭ、雑音
ΗＭＭそれぞれについて行う。These processes are performed for each of the noise mixed speech HMM and the noise ΗMM.

【００４８】ステップＳ３が従来のＰＭＣ方法と異なる
処理であり、線形スペクトラム空間での雑音混合音声の
出力確率分布｛Ｍｌｎ，ΣＭｌｎ｝と雑音の出力確率分
布｛Ｎｌｎ，ΣＮｌｎ｝を、次式（１）、（２）に示す
ように減算して、雑音が除去された音声の線形スペクト
ラム空間での出力確率分布｛Ｓｌｎ，ΣＳｌｎ｝を得
る。Step S3 is a processing different from the conventional PMC method. The output probability distribution {Mln, {Mln} of the noise-mixed speech and the output probability distribution {Nln, {Nln} of the noise in the linear spectrum space are expressed by the following equation (1). ) And (2) are subtracted to obtain an output probability distribution {Sln, {Sln} in the linear spectrum space of the noise-free speech.

【００４９】Ｓｌｎ（ｋ）＝ｍａｘ｛Ｍｌｎ（ｋ）−Ｎｌｎ（ｋ），α＊Ｎｌｎ（ｋ）｝ …（１） ΣＳｌｎ（ｋ，ｌ）＝ｍａｘ｛ΣＭｌｎ（ｋ，ｌ）−ΣＮｌｎ（ｋ，ｌ），α ＊α＊ΣＮｌｎ（ｋ，ｌ）｝ …（２）ｋ，ｌ＝１，２，…，ｐ（ｐはべクトルＳｌｎ、Ｍｌｎ
などの次元）但し、例えばＳｌｎ（ｋ）とあるのは、出力確率分布の
平均べクトルＳｌｎの第ｋ成分を表し、ΣＳｌｎ（ｋ，
ｌ）とあるのは、出力確率分布の分散行列ΣＳｌｎの第
ｋ，ｌ成分を表す。αは正定数であり、例えばα＝０．
１としてよい。Sln (k) = max {Mln (k) −Nln (k), α * Nln (k)} (1) {Sln (k, l) = max ｛ΣMln (k, l) −ΣNln (k , L), α * α * {Nln (k, l)} (2) k, l = 1, 2,..., P (p is a vector Sln, Mln
However, for example, Sln (k) represents the k-th component of the average vector Sln of the output probability distribution, and ΣSln (k,
1) represents the k-th and l-th components of the variance matrix ΣSln of the output probability distribution. α is a positive constant, for example, α = 0.
It may be 1.

【００５０】この式（１）、（２）は基本的には雑音混
合音声のスペクトラムから雑音スペクトラムを減算して
いると考えてよい。この場合、減算した結果が負の値に
なるのを防ぐため、フロアリングとして雑音スペクトラ
ムに定数αを乗じた値を最小値として代用する。この処
理によって雑音混合音声のうち、加法的雑音成分が除去
されたことになる。Equations (1) and (2) can be considered basically as subtracting the noise spectrum from the spectrum of the noise-mixed speech. In this case, in order to prevent the result of the subtraction from becoming a negative value, a value obtained by multiplying the noise spectrum by a constant α is used as the flooring as the minimum value. By this processing, the additive noise component has been removed from the noise mixed speech.

【００５１】こうして得られた線形スペクトラム空間上
の、加法的雑音が除去された音声確率分布｛Ｓｌｎ，Σ
Ｓｌｎ｝をステップＳ４において対数変換して対数スぺ
クトラム空間上での値｛Ｓｌｇ，ΣＳｌｇ｝に変換し、
さらにステップＳ５において逆コサイン変換を施してケ
プストラム空間での値｛Ｓｃｐ，ΣＳｃｐ｝を得る。こ
れらのステップＳ４、Ｓ５については従来のＰＭＣ方法
と同一であり、詳細な説明は省略する。The speech probability distribution {Sln,} on the linear spectrum space thus obtained from which additive noise has been removed.
Sln} is logarithmically converted in step S4 to a value {Slg, {Slg} in a logarithmic spectrum space,
Further, in step S5, inverse cosine transform is performed to obtain values {Scp, {Scp} in the cepstrum space. Steps S4 and S5 are the same as the conventional PMC method, and a detailed description thereof will be omitted.

【００５２】このようにして、最終的に加法的雑音が除
去された音声ＨＭＭモデルを得る。この音声モデルは加
法的雑音が除去されているだけで、ランバード効果や乗
法的雑音の学習結果は保存されているので、ランバード
効果などに対して頑強で、かつ特定の種類や程度の加法
的雑音に限定されない音声モデルと考えることができ
る。こうしてできた音声モデルを雑音非依存音声モデル
１６として出力する。In this way, a speech HMM model from which additive noise has been finally removed is obtained. Since this speech model only removes additive noise and preserves the learning results of the Lambert effect and multiplicative noise, it is robust against the Lambert effect, etc., and has a specific type and degree of additive noise. Can be considered as a speech model that is not limited to. The resulting speech model is output as a noise-independent speech model 16.

【００５３】以上説明したように、第１の実施形態に係
るヒドン・マルコフ・モデルの学習方法では、学習用雑
音混合音声データ１０を基にΗＭＭ学習を行い雑音混合
音声モデル１２として出力するＨＭＭ学習処理部１１
と、音声モデルをＨＭＭ学習する際に、学習用雑音混合
音声データ１０に含まれる雑音を推定し、推定雑音モデ
ル１４を生成する雑音推定部１３と、雑音混合音声モデ
ル１２から推定雑音モデル１４を除去して雑音が除去さ
れた雑音非依存音声モデル１６を出力する雑音減算処理
部１５とを備え、学習用雑音混合音声データ１０に含ま
れる雑音を推定して、推定雑音モデル１４を生成すると
ともに、学習用雑音混合音声データ１０からＨＭＭ学習
を行って雑音混合音声モデル１２を出力し、雑音減算処
理部１５により雑音混合音声モデル１２から推定雑音モ
デル１４を除去して雑音が除去された雑音非依存音声モ
デル１６を出力するようにしているので、ＨＭＭ学習に
よって得られた雑音混合音声モデル１２から、学習用雑
音混合音声データ１０から推定される加法的雑音のみが
除去されることになり、学習時の特定の雑音に限定され
ることなく、ランバード効果などに対して頑強な音声モ
デルを作成することができる。As described above, in the learning method of the Hidden Markov Model according to the first embodiment, the HMM learning that performs the ΗMM learning based on the noise-mixed speech data for learning 10 and outputs as the noise-mixed speech model 12. Processing unit 11
And a noise estimating unit 13 for estimating noise included in the training noise-mixed speech data 10 and generating an estimated noise model 14 when the speech model is subjected to HMM learning. A noise subtraction processing unit 15 that outputs a noise-independent speech model 16 from which noise has been removed by removing the noise, estimates noise included in the training noise-mixed speech data 10, and generates an estimated noise model 14. The HMM learning is performed from the noise-mixed speech data for learning 10 to output a noise-mixed speech model 12, and the noise subtraction unit 15 removes the estimated noise model 14 from the noise-mixed speech model 12 to remove the noise. Since the dependent speech model 16 is output, the noise-mixed speech data for learning is obtained from the noise-mixed speech model 12 obtained by the HMM learning. Only additive noise estimated from 0 is to be removed, without being limited to the particular noise during training, it is possible to create a robust speech model for such Lambert effect.

【００５４】したがって、ＨＭＭを用いた音声認識シス
テムに適用すれば、ΡＭＣ方法を拡張して、従来のＰＭ
Ｃ方法で扱われない乗法的雑音、ランバート効果などに
も対応することができ、雑音環境下でより高い認識率を
達成することができる。Therefore, if applied to a speech recognition system using an HMM, the $ MC method is extended to
It is possible to cope with multiplicative noise, Lambert effect, and the like that are not handled by the C method, and achieve a higher recognition rate in a noise environment.

【００５５】第２の実施形態第２の実施形態は、第１の実施形態で得られた雑音非依
存音声モデルを利用して音声認識を行うシステムであ
る。Second Embodiment The second embodiment is a system for performing speech recognition using the noise-independent speech model obtained in the first embodiment.

【００５６】第１の実施形態では、学習用音声データか
ら推定される加法的雑音のみを除去するようにしたの
で、学習時の特定の雑音に限定されることなく、ランバ
ード効果などに対して頑強な音声モデルを作成できた。
しかし、雑音非依存音声モデル１６は加法的雑音が除去
されているので、加法的雑音が混合した音声の認識にそ
のまま用いることはできない。この加法的雑音に対処す
るために従来さまざまな手法が提案されてきたことは従
来技術の説明で述べたとおりである。In the first embodiment, only the additive noise estimated from the learning speech data is removed, so that the present invention is not limited to a specific noise at the time of learning and is robust against the Lambert effect and the like. I was able to create a new voice model
However, since the additive noise has been removed from the noise-independent speech model 16, it cannot be used as it is for recognition of speech mixed with additive noise. As described in the description of the related art, various methods have been conventionally proposed to cope with the additive noise.

【００５７】そこで、例えばスペクトル・サブトラクシ
ョン方法を用いて入力音声中の加法的雑音を低減した
後、第１の実施形態の雑音非依存音声モデルを用いて音
声認識を行うことが考えられる。このスペクトル・サブ
トラクション方法を用いる方法では、雑音非依存音声モ
デルがランバード効果に対して頑強であることを考える
と、雑音非依存音声モデルを用いることによりある程度
の性能の向上が期待できる。Therefore, it is conceivable to reduce the additive noise in the input speech by using, for example, a spectral subtraction method, and then perform speech recognition by using the noise-independent speech model of the first embodiment. In the method using the spectral subtraction method, considering that the noise-independent speech model is robust against the Lambert effect, a certain improvement in performance can be expected by using the noise-independent speech model.

【００５８】しかし、第１の実施形態の雑音非依存音声
モデルが基本的にＰＭＣ方法に沿った考え方で生成され
たことを考慮すると、加法的雑音の対処方法として上記
スペクトル・サブトラクション方法を適用するよりもΡ
ＭＣ方法を適用した方が雑音非依存音声モデルとの親和
性がよいことは明らかであり、従ってＰＭＣ方法を用い
るのが高い認識率を上げる点で最も効果的である。However, considering that the noise-independent speech model of the first embodiment is basically generated according to the concept based on the PMC method, the above-described spectral subtraction method is applied as a method for dealing with additive noise. Than
It is clear that the application of the MC method has a better affinity with the noise-independent speech model, and thus the use of the PMC method is most effective in increasing the high recognition rate.

【００５９】図３は本発明の第２の実施形態に係る音声
認識システムの構成を示すブロック図である。本実施形
態に係る音声認識システムの説明にあたり図１に示すブ
ロック図と同一部分には同一符号を付している。FIG. 3 is a block diagram showing the configuration of the speech recognition system according to the second embodiment of the present invention. In the description of the speech recognition system according to the present embodiment, the same parts as those in the block diagram shown in FIG.

【００６０】図３において、２０はマイクなどからの音
声入力をディジタル信号に変換して入力する音声入力
部、２１は音声波形を短い区間に区切り、フレーム毎に
特徴パラメータを抽出して音声を分析するＬＰＣ分析部
からなる音声分析部である。In FIG. 3, reference numeral 20 denotes a voice input unit for converting a voice input from a microphone or the like into a digital signal and inputting the digital signal. Reference numeral 21 divides a voice waveform into short sections, extracts feature parameters for each frame, and analyzes the voice. This is a voice analysis unit including an LPC analysis unit.

【００６１】雑音推定部１３は、音声分析部２１の出力
に基づき入力信号中の雑音を推定し、推定雑音モデル１
４としてモデル合成部（ＰＭＣ処理部）２２（モデル合
成手段）に出力する。The noise estimating unit 13 estimates the noise in the input signal based on the output of the speech analyzing unit 21, and
As 4, it is output to a model synthesis unit (PMC processing unit) 22 (model synthesis means).

【００６２】雑音非依存音声モデル１６は、前記雑音減
算処理部１５（図１及び図２）により生成された音声モ
デルであり、加法的雑音を除去したことにより、ランバ
ード効果や乗法的雑音に対して頑強で、かつ特定の種類
や程度の加法的雑音に限定されない音声モデルとなって
いる。The noise-independent speech model 16 is a speech model generated by the noise subtraction processing section 15 (FIGS. 1 and 2). It is a robust and robust speech model that is not limited to any particular type or degree of additive noise.

【００６３】モデル合成部２２は、推定雑音モデル１４
と雑音非依存音声モデル１６とをＰＭＣ方法を用いて合
成し、新たな音声モデルを雑音混合音声モデル２３とし
て出力する。ＰＭＣ方法のアルゴリズムについては、従
来法と同様である。The model synthesizing section 22 calculates the estimated noise model 14
And a noise-independent speech model 16 by using the PMC method, and outputs a new speech model as a noise-mixed speech model 23. The algorithm of the PMC method is the same as the conventional method.

【００６４】上記音声分析部２１からの分析結果、及び
雑音混合音声モデル２３は、認識部２４（認識手段）に
出力される。認識部２４は、モデル合成部２２の出力に
対し、雑音混合音声モデル２３を用いて認識処理を行
い、認識結果２５を出力して処理を終える。The analysis result from the speech analysis unit 21 and the noise-mixed speech model 23 are output to a recognition unit 24 (recognition means). The recognition unit 24 performs a recognition process on the output of the model synthesis unit 22 using the noise-mixed speech model 23, outputs a recognition result 25, and ends the process.

【００６５】このように、本実施形態に係るヒドン・マ
ルコフ・モデルの学習方法は、音声モデルとしてクリー
ンな音声モデルの代わりに雑音非依存音声モデル１６を
用いること以外は従来のΡＭＣ方法と同一である。As described above, the learning method of the hidden Markov model according to the present embodiment is the same as the conventional ΡMC method except that the noise-independent speech model 16 is used instead of the clean speech model as the speech model. is there.

【００６６】以下、上述のように構成された音声認識シ
ステムの動作を説明する。The operation of the speech recognition system configured as described above will be described below.

【００６７】音声入力部２０からの音声入力は音声分析
部２１に入力され、音声分析部２１で分析されて雑音推
定部１３及び認識部２４に出力される。The voice input from the voice input unit 20 is input to the voice analysis unit 21, analyzed by the voice analysis unit 21, and output to the noise estimation unit 13 and the recognition unit 24.

【００６８】雑音推定部１３では、入力から雑音を推定
し、推定雑音モデル１４を作成してモデル合成部２２に
出力する。一方、モデル合成部２２には、加法的雑音を
除去することで、ランバード効果や乗法的雑音に対して
頑強で、かつ特定の種類や程度の加法的雑音に限定され
ない音声モデルである雑音非依存音声モデル１６が入力
される。The noise estimating unit 13 estimates noise from the input, creates an estimated noise model 14, and outputs it to the model synthesizing unit 22. On the other hand, by removing the additive noise, the model synthesis unit 22 is a noise-independent sound model that is robust to the Lambert effect and multiplicative noise, and is not limited to a specific type or degree of additive noise. An audio model 16 is input.

【００６９】モデル合成部２２では、この推定雑音モデ
ル１４と雑音非依存音声モデル１６とをＰＭＣ合成し、
加法的雑音の影響が考慮された雑音混合音声モデル２３
を得る。The model synthesis unit 22 performs PMC synthesis on the estimated noise model 14 and the noise-independent speech model 16,
Noise-mixed speech model 23 considering the effects of additive noise
Get.

【００７０】認識部２４では、この雑音混合音声モデル
２３を用いて認識処理を行い、認識結果２５を出力す
る。The recognizing unit 24 performs a recognizing process using the noise-mixed speech model 23, and outputs a recognition result 25.

【００７１】従来のクリーンな音声モデルを用いたΡＭ
Ｃ方法と比べると、ＰＭＣ方法は加法的雑音に対しては
効果があるものの、ランバード効果のような音声そのも
のの変形に対しては無力である。一方、本手法では、雑
音非依存音声モデルはランバード効果に対して頑強であ
るので、加法的雑音と同時にランバード効果に対しても
頑強な音声認識システムを構成することが可能である。ΡM using a conventional clean speech model
Compared with the C method, the PMC method is effective against additive noise, but is ineffective against deformation of the voice itself such as the Lambert effect. On the other hand, in the present method, since the noise-independent speech model is robust against the Lambert effect, it is possible to construct a speech recognition system that is robust against the Lambert effect as well as the additive noise.

【００７２】また、学習用雑音混合音声データ１０から
学習した雑音混合音声モデルをそのまま用いた場合（雑
音学習方法）と比べると、雑音混合音声モデル２３はラ
ンバード効果に頑強であるものの、加法的雑音の種類や
程度は学習時の雑音にのみ限定され、ΡＭＣ方法のよう
にテスト時の入力から加法的雑音を推定して、その雑音
に適応した新たな雑音混合音声モデルを生成することが
できない。本実施形態ではΡＭＣ方法を適用することで
学習時とは異なる加法的雑音に対応することが可能であ
る。Further, as compared with the case where the noise-mixed speech model learned from the noise-mixed speech data for learning 10 is used as it is (noise learning method), the noise-mixed speech model 23 is more robust to the Lambert effect, but has the additive noise. The type and degree of are limited only to the noise at the time of learning, and it is not possible to estimate the additive noise from the input at the time of the test and generate a new noise mixed speech model adapted to the noise as in the モデル MC method. In the present embodiment, it is possible to cope with additive noise different from that at the time of learning by applying the ΡMC method.

【００７３】以上説明したように、第２の実施形態に係
る音声認識システムは、入力から雑音を推定して推定雑
音モデル１４を生成する雑音推定部１３と、雑音非依存
音声モデル１６と推定雑音モデル１４とを合成し、雑音
混合音声モデル２３を出力するモデル合成部２２と、雑
音混合音声モデル２３を用いてＨＭＭ認識処理を行う認
識部２４とを備え、音声モデルとしてクリーンな音声モ
デルの代わりに雑音学習された雑音混合音声モデルから
加法雑音のみが除去された雑音非依存音声モデル１６を
用いて、雑音が混合した音声をΡＭＣ方法によって認識
するようにしたので、特定の雑音に限定されることな
く、加法的雑音は勿論、ランバード効果などに対しても
頑強な耐雑雑音音声認識システムを構成することができ
る。また、雑音非依存音声モデル１６はＰＭＣ方法と親
和性があるので、雑音非依存音声モデルを用いて雑音混
合音声を認識する場合、ΡＭＣ方法と併用することで、
他の加法的雑音除去方法と併用する時よりも高い認識性
能が得られる。As described above, the speech recognition system according to the second embodiment estimates the noise from the input to generate the estimated noise model 14, the noise-independent speech model 16, and the estimated noise model. A model synthesis unit 22 that synthesizes the model 14 and outputs a noise-mixed speech model 23; and a recognition unit 24 that performs HMM recognition processing using the noise-mixed speech model 23. By using the noise-independent speech model 16 in which only additive noise has been removed from the noise-mixed speech model that has been subjected to noise learning, the speech mixed with noise is recognized by the ΡMC method, so that the speech is limited to specific noise. In addition, a robust noise-recognition speech recognition system that is robust not only to additive noise but also to the Lambert effect and the like can be configured. In addition, since the noise-independent speech model 16 has affinity with the PMC method, when recognizing the noise-mixed speech using the noise-independent speech model, by using the ΡMC method together,
Higher recognition performance is obtained than when used in combination with other additive denoising methods.

【００７４】したがって、このような優れた特長を有す
る音声認識システムを、例えば音声による操作が可能な
カーナビゲーションに用いられる耐雑音音声認識システ
ムに適用すれば、この装置において認識率の大幅な向上
を図ることができる。Therefore, if the speech recognition system having such excellent features is applied to, for example, a noise-resistant speech recognition system used for car navigation which can be operated by speech, the recognition rate can be greatly improved in this device. Can be planned.

【００７５】なお、第１の実施形態に係るヒドン・マル
コフ・モデルの学習方法及び第２の実施形態に係る音声
認識システムは、学習用音声データからＨＭＭ学習を行
う音声認識システムであればどのようなものにも適用す
ることができ、各種端末に組み込まれる回路として実施
することもできる。The method for learning the Hidden Markov Model according to the first embodiment and the speech recognition system according to the second embodiment may be any speech recognition system that performs HMM learning from training speech data. It can be applied to various types of circuits, and can also be implemented as circuits incorporated in various terminals.

【００７６】また、第１の実施形態に係るヒドン・マル
コフ・モデルの学習方法及び第２の実施形態に係る音声
認識システムを構成する各処理部や各種プロセスの数、
種類接続状態などは前述した各実施形態に限られない。Further, the method of learning the Hidden Markov Model according to the first embodiment and the number of processing units and various processes constituting the speech recognition system according to the second embodiment,
The type connection state and the like are not limited to the above embodiments.

【００７７】[0077]

【発明の効果】本発明に係るヒドン・マルコフ・モデル
の学習方法では、学習用音声データに含まれる雑音を推
定して、推定雑音モデルを生成するとともに、学習用音
声データからＨＭＭ学習を行って雑音混合音声モデルを
出力し、雑音減算処理手段により雑音混合音声モデルか
ら推定雑音モデルを除去して雑音が除去された雑音非依
存音声モデルを出力するようにしたので、学習用音声デ
ータから推定される加法的雑音のみが除去されることに
なり、学習時の特定の雑音に限定されることなく、ラン
バード効果などに対して頑強な音声モデルを作成するこ
とができる。In the learning method of the Hidden Markov Model according to the present invention, noise included in the learning speech data is estimated, an estimated noise model is generated, and HMM learning is performed from the learning speech data. Since the noise-mixed speech model is output and the estimated noise model is removed from the noise-mixed speech model by the noise subtraction processing means to output a noise-independent speech model from which noise has been removed, the noise-independent speech model is estimated from the training speech data. Thus, only additive noise is removed, and a speech model that is robust against the Lambert effect and the like can be created without being limited to specific noise during learning.

【００７８】本発明に係る音声認識システムでは、雑音
が混合した学習用音声データを有する学習用音声データ
ベースと、学習用音声データに含まれる雑音を推定して
推定雑音モデルを生成する雑音推定手段と、学習用音声
データからＨＭＭ学習を行って雑音混合音声モデルを出
力するＨＭＭ学習処理手段と、雑音混合音声モデルから
推定雑音モデルを除去して雑音が除去された雑音非依存
音声モデルを出力する雑音減算処理手段と、雑音非依存
音声モデルを用いて音声認識を行う音声認識手段とを備
えて構成したので、ΡＭＣ方法を拡張して、従来のＰＭ
Ｃ方法で扱われない乗法的雑音、ランバート効果などに
も対応することができ、雑音環境下でより高い認識率を
達成することができる。In the speech recognition system according to the present invention, a learning speech database having learning speech data mixed with noise, and a noise estimating means for estimating noise contained in the learning speech data and generating an estimated noise model. HMM learning processing means for performing HMM learning from training speech data to output a noise-mixed speech model, and noise for removing an estimated noise model from the noise-mixed speech model and outputting a noise-independent speech model from which noise has been removed Since it is configured to include subtraction processing means and speech recognition means for performing speech recognition using a noise-independent speech model, the MC method is extended to
It is possible to cope with multiplicative noise, Lambert effect, and the like that are not handled by the C method, and achieve a higher recognition rate in a noise environment.

[Brief description of the drawings]

【図１】本発明を適用した第１の実施形態に係るヒドン
・マルコフ・モデルの学習方法の実施に用いられる装置
を示すブロック図である。FIG. 1 is a block diagram showing an apparatus used for implementing a hidden Markov model learning method according to a first embodiment of the present invention.

【図２】上記ヒドン・マルコフ・モデルの学習方法の雑
音減算処理部の動作の流れを示すフローチャートであ
る。FIG. 2 is a flowchart showing a flow of an operation of a noise subtraction processing unit in the hidden Markov model learning method.

【図３】本発明を適用した第２の実施形態に係る音声認
識システムの構成を示すブロック図である。FIG. 3 is a block diagram illustrating a configuration of a speech recognition system according to a second embodiment to which the present invention has been applied.

【図４】従来のＨＭＭを用いた連続音声認識システムの
構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a conventional continuous speech recognition system using an HMM.

【図５】音声認識方法に用いられる単語ヒドン・マルコ
フ・モデルの構造を示す図である。FIG. 5 is a diagram showing a structure of a word Hidden Markov Model used in a speech recognition method.

[Explanation of symbols]

１０学習用雑音混合音声データ（学習用音声デー
タ）、１１ＨＭＭ学習処理部（ＨＭＭ学習処理手
段）、１２，２３雑音混合音声モデル、１３雑音推
定部（雑音推定手段）、１４推定雑音モデル、１５
雑音減算処理部（雑音減算処理手段）、１６雑音非依
存音声モデル、２０音声入力部、２１音声分析部、
２２モデル合成部（ＰＭＣ処理部）（モデル合成手
段）、２４認識部（認識手段）、２５認識結果10 noise mixed speech data for learning (speech data for learning), 11 HMM learning processing section (HMM learning processing means), 12, 23 noise mixed speech model, 13 noise estimation section (noise estimation means), 14 estimated noise model, 15
Noise subtraction processing section (noise subtraction processing means), 16 noise-independent speech model, 20 speech input section, 21 speech analysis section,
22 model synthesis unit (PMC processing unit) (model synthesis means), 24 recognition unit (recognition means), 25 recognition result

Claims

[Claims]

1. A Hidden Markov Model (HMM) learning method for performing a Hidden Markov Model (HMM) learning from learning noise data mixed with noise to generate a speech model, wherein the learning voice data is included in the learning voice data. And generating an estimated noise model, performing HMM learning from the learning speech data to output a noise-mixed speech model, and removing the estimated noise model from the noise-mixed speech model to reduce noise. A method for learning a Hidden Markov Model, comprising outputting a noise-independent speech model from which noise has been removed.

2. The method of removing an estimated noise model comprises converting each output probability distribution of an estimated noise model and a noise mixed speech model from a value in a cepstrum space to a value in a logarithmic spectrum space by a cosine transform, and further performing an exponential transform. Is converted to a value in the linear spectrum space, and the average and the variance of the output probability distribution of the noise mixed speech model in the linear spectrum space calculated in this way are respectively averaged for the output probability distribution of the estimated noise model in the linear spectrum space. And the variance, logarithmically transform the result to a value in logarithmic spectrum space, and obtain an output probability distribution in cepstrum space by inverse cosine transform. 2. The method according to claim 1, wherein said step is performed by obtaining a removed speech model. Learning method of Markov model.

3. A Hidden Markov Model (HMM) learning is performed from the learning voice data to generate a voice model,
A speech recognition system for performing speech recognition using the speech model, comprising: a learning speech database having learning speech data mixed with noise; and estimating noise included in the learning speech data to generate an estimated noise model. Noise estimation means; HMM learning processing means for performing HMM learning from the learning speech data to output a noise-mixed speech model; noise removal means for removing the estimated noise model from the noise-mixed speech model to remove noise. A speech recognition system comprising: noise subtraction processing means for outputting a dependent speech model; and speech recognition means for performing speech recognition using the noise-independent speech model.

4. The noise subtraction processing means converts each output probability distribution of the estimated noise model and the noise mixed speech model from a value in cepstrum space to a value in logarithmic spectrum space by cosine transform, and further performs exponential transformation. The average and the variance of the output probability distribution of the estimated noise model in the linear spectrum space are converted to the values in the linear spectrum space, and the average and the variance of the output probability distribution of the noise-mixed speech model in the linear spectrum space calculated in this manner are respectively calculated. Subtract the variance, logarithmically convert the result to a value in logarithmic spectrum space, and obtain the output probability distribution in cepstrum space by inverse cosine transform, thus removing additive noise from the noise mixed speech model 4. The speech recognition system according to claim 3, wherein a speech model obtained is obtained.

5. A noise-independent speech model, noise estimation means for estimating noise from an input to generate an estimated noise model, and synthesizing the noise-independent speech model and the estimated noise model, The speech recognition system according to claim 3, further comprising: a model synthesis unit that outputs a model; and a recognition unit that performs an HMM recognition process using the noise-mixed speech model.

6. The model synthesizing means includes a PMC (Parallel).
The noise-independent speech model and the noise model are synthesized by an el Model Combination (el Model Combination) method.
A speech recognition system as described.