JP2000075889A

JP2000075889A - Voice recognizing system and its method

Info

Publication number: JP2000075889A
Application number: JP10246768A
Authority: JP
Inventors: Kazuhiko Shudo; 和彦首藤
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1998-09-01
Filing date: 1998-09-01
Publication date: 2000-03-14

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognizing system, which is capable of following various changes in noise environment and constantly achieving a superior recognition rate, by selecting the optimal voice model in accordance with a noise environment at the time of recognition and performing a voice recognition processing by means of the voice model so selected. SOLUTION: An analytical result from a voice analyzer 11, and plural voice models 12 (voice model 1-N) are outputted to a voice recognizing part 13. The voice recognizing part 13 performs voice recognition processing independently for each voice model 1-N by referring a voice feature quantity obtained from the voice analyzer 11 to a template stored in the voice models, and outputs to a probability comparing part 14. In addition, the voice recognizing part 13 also calculates a numerical value indicating certainty other than a recognized symbol, outputting such certainty also to the probability comparing part 14. The probability comparing part 14 compares the output of the voice recognizing part 13, outputting the recognized symbol as the recognized result 15 of the system.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、耐雑音音声認識シ
ステム及び音声認識方法に関し、例えば、音声による操
作が可能なカーナビゲーションなどで用いられる、車内
のような雑音環境下において好適な音声認識システム及
び音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a noise-tolerant speech recognition system and a speech recognition method. For example, the present invention relates to a speech recognition system suitable for use in a car navigation system which can be operated by voice and suitable in a noise environment such as a car. And a speech recognition method.

【０００２】[0002]

【従来の技術】音声認識技術として、古典的なパターン
・マッチング手法から、近年では統計的な手法に変わ
り、後者が主流になりつつある。後者の統計的な手法で
は、確率的な有限状態を持つマルコフ・モデルが提案さ
れており、通常、ＨＭＭ（hiddenMarkov model：隠れマ
ルコフモデル）と呼ぶ。ＨＭＭでは、学習用音声データ
を用いて音声モデルの学習を行うことで高い認識率を上
げることが可能となっている。2. Description of the Related Art As a speech recognition technique, the classical pattern matching technique has been changed to a statistical technique in recent years, and the latter is becoming mainstream. In the latter statistical method, a Markov model having a probabilistic finite state has been proposed, and is usually referred to as an HMM (hidden Markov model: hidden Markov model). In the HMM, it is possible to increase a high recognition rate by learning a speech model using learning speech data.

【０００３】図３は従来のこの種のＨＭＭを用いた連続
音声認識システムの構成を示すブロック図である。FIG. 3 is a block diagram showing the configuration of a conventional continuous speech recognition system using this type of HMM.

【０００４】図３において、連続音声認識システムは、
Ａ／Ｄ変換部１、ＬＰＣ分析部２、背景雑音逐次学習部
３、音声検出部４、切替え部５、ビタビ照合部６、ＨＭ
Ｍパラメータ推定部７及びＨＭＭ音声辞書８から構成さ
れている。このうち、Ａ／Ｄ変換部１とＬＰＣ分析部２
とで音声分析ブロックが構成され、背景雑音逐次学習部
３と音声検出部４とで音声区間検出ブロックが、切替え
部５とビタビ照合部６とでＨＭＭ照合ブロックが、ＨＭ
Ｍパラメータ推定部７とＨＭＭ音声辞書８とでＨＭＭモ
デル学習ブロックがそれぞれ構成されている。In FIG. 3, a continuous speech recognition system comprises:
A / D conversion unit 1, LPC analysis unit 2, background noise sequential learning unit 3, voice detection unit 4, switching unit 5, Viterbi matching unit 6, HM
It comprises an M parameter estimator 7 and an HMM speech dictionary 8. The A / D converter 1 and the LPC analyzer 2
Constitutes a speech analysis block, the background noise sequential learning section 3 and the speech detection section 4 constitute a speech section detection block, and the switching section 5 and the Viterbi comparison section 6 constitute an HMM collation block.
An HMM model learning block is composed of the M parameter estimating unit 7 and the HMM speech dictionary 8.

【０００５】Ａ／Ｄ変換部１は、入力音声信号を所定の
サンプリング周波数（例えば、８ｋＨｚ）でサンプリン
グしディジタル信号に変換する。The A / D converter 1 samples an input audio signal at a predetermined sampling frequency (for example, 8 kHz) and converts it into a digital signal.

【０００６】ＬＰＣ分析部２は、音声波形を短い区間
（フレームと呼び、長さは通常１０ミリ〜３０ミリ秒で
ある）に区切り、フレーム毎に特徴パラメータを抽出す
る。音声分析には、音声の特性に合った能率的方法とし
て広く使用されているＬＰＣ（Linear Predictive Codi
ng：線形予測符号化）分析を用い、ＬＰＣ係数からＬＰ
Ｃケプストラム（Cepstrum）を算出する。ここで、ケプ
ストラムとは、対数スペクトラム（Logarithm）を逆フ
ーリエ変換したもので、人間の聴覚特性に近い性質を持
ち、比較的少ない数のパラメータで効率良く音声を表現
できる。[0006] The LPC analysis unit 2 divides a speech waveform into short sections (referred to as frames, whose length is usually 10 to 30 ms), and extracts characteristic parameters for each frame. For speech analysis, LPC (Linear Predictive Code), which is widely used as an efficient method suited to the characteristics of speech, is used.
ng: Linear Predictive Coding) analysis using LPC coefficients to LP
Calculate Cepstrum. Here, the cepstrum is obtained by performing an inverse Fourier transform of a logarithmic spectrum (Logarithm), has a property close to human auditory characteristics, and can express speech efficiently with a relatively small number of parameters.

【０００７】音声検出部４は、雑音区間における対数パ
ワーとＬＰＣケプストラムの推定平均値を雑音特徴スペ
クトルとして記憶し、この雑音特徴ベクトルと入力信号
の特徴ベクトルとの距離を求め、その時間的変化から音
声区間を検出する。The speech detector 4 stores the logarithmic power in the noise section and the estimated average value of the LPC cepstrum as a noise feature spectrum, finds the distance between the noise feature vector and the feature vector of the input signal, and calculates the distance from the time change. Detect voice section.

【０００８】背景雑音逐次学習部３は、雑音区間と判定
された区間で雑音特徴ベクトルを更新することにより、
雑音特徴の逐次適応学習を行うとともに、距離変動の適
応学習によるしきい値の自動設定も行う。[0008] The background noise sequential learning unit 3 updates the noise feature vector in the section determined to be a noise section,
In addition to performing the sequential adaptive learning of the noise feature, the threshold is also automatically set by the adaptive learning of the distance variation.

【０００９】ビタビ照合部６は、ビタビ（Viterbi）ア
リゴリズムを用いてＨＭＭ照合を行う。ＨＭＭ照合で
は、音素や単語を表現したＨＭＭモデルと未知入力音声
とを比較し、類似度を求める。The Viterbi collation unit 6 performs HMM collation using the Viterbi algorithm. In the HMM collation, a similarity is obtained by comparing an HMM model representing a phoneme or a word with an unknown input speech.

【００１０】ＨＭＭパラメータ推定部７は、ＥＭ（Expe
ctation Maximization）アルゴリズムを用いてＨＭＭモ
デル学習を行う。ＨＭＭモデル学習では、あらかじめ用
意した音声データでＨＭＭモデルのパラメータを推定す
る。[0010] The HMM parameter estimating unit 7 generates an EM (Expe
HMM model learning is performed using an ctation Maximization algorithm. In the HMM model learning, the parameters of the HMM model are estimated from voice data prepared in advance.

【００１１】切替え部５は、上記ＨＭＭ照合とＨＭＭモ
デル学習との処理を切り替えるものである。また、ＨＭ
Ｍ音声辞書８は、ＨＭＭパラメータ推定部７によるＨＭ
Ｍモデル学習結果を記憶し、ビタビ照合部６によるＨＭ
Ｍ照合において参照される。The switching unit 5 switches the processing between the HMM collation and the HMM model learning. Also, HM
The M speech dictionary 8 stores the HM
The M model learning result is stored, and the HM
Referenced in M matching.

【００１２】一般に、ＨＭＭは、複数の状態（例えば、
音声の特徴等）と状態間の遷移からなる。さらに、ＨＭ
Ｍは状態間の遷移を表す遷移確率と、遷移する際に伴う
特徴ベクトルを出力する出力確率分布（通常はガウス分
布を用いる）を有している。このようなＨＭＭを用いた
単語音声認識の例を図４に示す。In general, an HMM has multiple states (eg,
And the transition between states. Furthermore, HM
M has a transition probability representing transition between states and an output probability distribution (usually using a Gaussian distribution) for outputting a feature vector accompanying the transition. FIG. 4 shows an example of word speech recognition using such an HMM.

【００１３】図４は、音声認識方法に用いられる単語Ｈ
ＭＭの構造を示す状態遷移図である。FIG. 4 shows a word H used in the speech recognition method.
It is a state transition diagram which shows the structure of MM.

【００１４】図４中のｓ1，ｓ2，ｓ3，ｓ4はＨＭＭにお
ける音声の特徴等の状態を表し、ａ11，ａ12，ａ22，ａ
23，ａ33，ａ34，ａ44，ａ45は状態遷移確率、（ｕ1，
σ1）、（ｕ2，σ2）、（ｕ3，σ3）、（ｕ4，σ4）は
出力確率分布を表す。In FIG. 4, s1, s2, s3, and s4 represent states of voice features in the HMM, and a11, a12, a22, a
23, a33, a34, a44, a45 are state transition probabilities, (u1,
.sigma.1), (u2, .sigma.2), (u3, .sigma.3), and (u4, .sigma.4) represent output probability distributions.

【００１５】ＨＭＭでは、状態遷移確率ａij（ｉ＝１，
…，４、ｊ＝１，…，５）で状態遷移が行なわれる際、
出力確率分布（ｕｋ、σｋ）でべクトルを出力する。発
声された単語をＨＭＭを用いて認識するには、まず、各
単語に対して用意された学習データを用いて、その単語
のベクトル列を最も高い確率で出力するようにＨＭＭを
学習する。次に、発声された未知単語のべクトル列を入
力し、最も高い出力確率を与えた単語ＨＭＭを認識結果
とする。In the HMM, the state transition probability aij (i = 1,
, 4, j = 1,..., 5)
The vector is output using the output probability distribution (uk, σk). In order to recognize an uttered word using the HMM, the HMM is first learned by using learning data prepared for each word so as to output a vector sequence of the word with the highest probability. Next, the vector sequence of the uttered unknown word is input, and the word HMM that gives the highest output probability is set as the recognition result.

【００１６】この種の音声認識方法では、発声された単
語そのものにＨＭＭを与えて学習し、尤度（すなわち、
べクトル列の出力確率）によって認識結果を判断するも
のである。このような単語ＨＭＭは、優れた認識精度を
保証するが、認識語彙数が増大することによって膨大な
学習データが必要となることや、学習対象語以外の音声
が全く認識できないことなどの欠点がある。In this type of speech recognition method, an uttered word itself is given an HMM and learned, and the likelihood (ie,
The recognition result is determined based on the output probability of the vector train). Although such a word HMM guarantees excellent recognition accuracy, it has disadvantages such as an enormous amount of learning data required due to an increase in the number of recognition vocabularies, and the inability to recognize speech other than the learning target words at all. is there.

【００１７】近年、カーナビゲーションなどの商品にお
いて、その操作を音声認識を用いてユーザからの音声に
よって行うことが試みられている。この場合、車内のよ
うに、車外からの騒音や走行音といった、雑音の激しい
環境では、雑音対策を考えずに、音声認識をそのまま適
用したのでは認識率が低く、実用的でない。そこで、従
来さまざまな雑音対策手法が提案されてきた。例えば、
スペクトルサブトラクションと呼ばれるノイズ除去方
式、複数マイクを用いた個応処理方式、ＰＭＣ（Parall
el Model Combination）方法と呼ばれるＨＭＭモデルの
雑音への適応などである。In recent years, in a product such as a car navigation system, it has been attempted to perform the operation by voice from a user using voice recognition. In this case, in an environment where noise is intense, such as noise from outside the vehicle or running noise, such as inside a vehicle, it is not practical to apply speech recognition as it is without considering noise measures, as it is not practical. Therefore, various noise suppression methods have been conventionally proposed. For example,
Noise removal method called spectral subtraction, individual processing method using multiple microphones, PMC (Parallel
el Model Combination), which is an adaptation of the HMM model to noise.

【００１８】現状ではこれらの方法を用いても、実際の
フィールドでユーザに受け入れられる認識率を達成する
ことは困難である。一方で静かな環境ではなく、雑音環
境の中で発声した音声データを用いてΗＭＭ学習を行
い、雑音が混合した音声モデルを作成することが考えら
れる。この方法をここでは雑音学習法と呼ぶ。この方法
から得られる音声モデルを用いれば実際のフィールドで
もかなりの認識率を達成できる。At present, even with these methods, it is difficult to achieve a recognition rate that is acceptable to the user in the actual field. On the other hand, it is conceivable to perform ΗMM learning using voice data uttered in a noise environment instead of a quiet environment to create a voice model mixed with noise. This method is referred to herein as a noise learning method. By using the speech model obtained from this method, a considerable recognition rate can be achieved even in an actual field.

【００１９】[0019]

【発明が解決しようとする課題】しかしながら、このよ
うな従来の耐雑音音声認識システムにあっては、以下の
ような問題点があった。However, such a conventional noise-tolerant speech recognition system has the following problems.

【００２０】すなわち、この雑音学習法は、良好な認識
率を達成するには学習時とテスト時の雑音環境が同じで
なければならないという重大な制約を抱えている。例え
ば、カーナビゲーションにおける地名の音声入力といっ
たアプリケーションを考えてみよう。天候、路面の状
態、車外の騒音、走行速度など、車内をとりまく環境は
さまざまに変化し、従ってその雑音の様子も変化する。
特定の天候、路面状態、車外騒音、走行速度で採取され
た学習用音声データベースで雑音学習された音声モデル
を用いると、学習時の条件と同一の場面では良好な認識
率を与えるものの、路面状態が変わったり走行速度が変
わったりすると、とたんに認識率が落ちてしまう。That is, this noise learning method has a serious restriction that the noise environment at the time of learning and the noise environment at the time of testing must be the same to achieve a good recognition rate. For example, consider an application such as voice input of a place name in car navigation. The environment surrounding the vehicle changes in various ways, such as the weather, road surface conditions, noise outside the vehicle, and traveling speed, and thus the appearance of the noise also changes.
Using a speech model trained with noise in a training speech database collected at specific weather, road surface conditions, outside noise, and traveling speed, a good recognition rate is given in the same scene as the learning conditions, but the road surface conditions If the speed changes or the running speed changes, the recognition rate will drop immediately.

【００２１】このように、雑音環境がさまざまに変化す
るフィールドでこの雑音学習方法をそのままの形で適用
するのは実用的ではない。As described above, it is not practical to apply the noise learning method as it is in a field where the noise environment changes variously.

【００２２】本発明は、雑音環境が様々に変化するフィ
ールドにおいて、雑音環境が様々に変化してもそれに追
随して常に良好な認識率を上げることができる音声認識
システム及び音声認識方法を提供することを目的とす
る。The present invention provides a speech recognition system and a speech recognition method capable of always improving a good recognition rate in a field where the noise environment changes variously, even if the noise environment changes variously. The purpose is to:

【００２３】[0023]

【課題を解決するための手段】本発明に係る音声認識シ
ステムは、音声モデルを用いて音声認識を行う音声認識
システムにおいて、異なる雑音環境に対応して用意され
た複数の音声モデルと、認識時の雑音環境に応じて最適
な音声モデルを選択する音声モデル選択手段と、音声モ
デル選択手段により選択された音声モデルを用いて音声
認識処理を行う音声認識手段とを備えて構成する。A speech recognition system according to the present invention is a speech recognition system for performing speech recognition using a speech model, comprising: a plurality of speech models prepared for different noise environments; And a voice recognition means for performing a voice recognition process using the voice model selected by the voice model selection means.

【００２４】本発明に係る音声認識システムは、音声モ
デルを用いて音声認識を行う音声認識システムにおい
て、異なる雑音環境に対応して用意された複数の音声モ
デルと、各音声モデルを用いて音声認識処理を行う音声
認識手段と、音声認識手段により認識した認識結果のう
ちの一つを選択する音声モデル選択手段とを備えて構成
する。A speech recognition system according to the present invention is a speech recognition system for performing speech recognition using a speech model. In the speech recognition system, a plurality of speech models prepared for different noise environments and speech recognition using each speech model are provided. It comprises a voice recognition means for performing processing and a voice model selection means for selecting one of the recognition results recognized by the voice recognition means.

【００２５】上記音声認識処理は、ヒドン・マルコフ・
モデル（ＨＭＭ）による音声認識処理であってもよく、
また、上記複数の音声モデルは、システムが使用される
フィールドにおいて想定される異なる雑音の程度、種類
等に応じて用意された複数の音声モデルであってもよ
い。The above speech recognition processing is performed by Hidden Markov
It may be a speech recognition process using a model (HMM),
Further, the plurality of speech models may be a plurality of speech models prepared in accordance with different degrees and types of noise assumed in a field where the system is used.

【００２６】また、上記複数の音声モデルは、ＨＭＭ単
語モデル又はＨＭＭ音韻モデルであってもよく、上記複
数の音声モデルは、異なる雑音環境下で発声した音声デ
ータベースを基にＨＭＭ学習により作成した雑音環境に
対応した音声モデルであってもよい。Further, the plurality of speech models may be an HMM word model or an HMM phoneme model, and the plurality of speech models are noise models created by HMM learning based on speech databases uttered under different noise environments. A sound model corresponding to the environment may be used.

【００２７】上記音声モデル選択手段は、各音声モデル
について、該当音声モデルを用いて個別に音声認識処理
を行い、その際の認識の確からしさが最も高いものを与
えた音声モデルを選択するものであってもよい。The voice model selecting means individually performs voice recognition processing for each voice model using the corresponding voice model, and selects a voice model given the one with the highest probability of recognition at that time. There may be.

【００２８】本発明に係る音声認識システムは、入力中
に含まれる雑音を推定する雑音推定手段を備え、各音声
モデルは対応する雑音モデルを含んでおり、音声モデル
選択手段は、雑音推定手段により推定された推定雑音と
各音声モデルの雑音モデルとを照合し、最も類似した雑
音モデルを持った音声モデルを選択するものであっても
よい。The speech recognition system according to the present invention includes noise estimation means for estimating noise included in an input, each speech model includes a corresponding noise model, and the speech model selection means uses the noise estimation means. The estimated noise may be compared with the noise model of each speech model, and the speech model having the most similar noise model may be selected.

【００２９】上記推定雑音と雑音モデルとの照合は、雑
音モデルと推定雑音との間でマッチング処理を行い、該
マッチング度を類似度の尺度とするものであってもよ
く、また、上記マッチング処理は、ビタビ照合であって
もよい。The matching between the estimated noise and the noise model may be performed by performing a matching process between the noise model and the estimated noise, and using the degree of matching as a measure of similarity. May be Viterbi collation.

【００３０】また、本発明に係る音声認識方法は、音声
モデルを用いて音声認識を行う音声認識方法において、
異なる雑音環境に対応して複数の音声モデルを用意し、
認識時の雑音環境に応じて最適な音声モデルを選択し、
選択された音声モデルを用いて音声認識処理を行うこと
を特徴とする。The speech recognition method according to the present invention is a speech recognition method for performing speech recognition using a speech model.
Prepare multiple voice models for different noise environments,
Select the optimal speech model according to the noise environment at the time of recognition,
The speech recognition process is performed using the selected speech model.

【００３１】本発明に係る音声認識方法は、音声モデル
を用いて音声認識を行う音声認識方法において、異なる
雑音環境に対応して複数の音声モデルを用意し、各音声
モデルを用いて音声認識処理を行い、認識した認識結果
のうちの一つを選択することを特徴とする。A speech recognition method according to the present invention is a speech recognition method for performing speech recognition using a speech model, wherein a plurality of speech models are prepared corresponding to different noise environments, and speech recognition processing is performed using each speech model. And selecting one of the recognized recognition results.

【００３２】本発明に係る音声認識方法は、入力中に含
まれる雑音を推定し、各音声モデルは対応する雑音モデ
ルを含んでおり、音声モデルの選択において、推定され
た推定雑音と各音声モデルの雑音モデルとを照合し、最
も類似した雑音モデルを持った音声モデルを選択するこ
とを特徴とする。A speech recognition method according to the present invention estimates noise included in an input, and each speech model includes a corresponding noise model. In selecting a speech model, the estimated estimated noise and each speech model are selected. And a voice model having the most similar noise model is selected.

【００３３】[0033]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００３４】第１の実施形態図１は本発明の第１の実施形態に係る音声認識システム
の構成及び処理を示すフローチャートである。First Embodiment FIG. 1 is a flowchart showing the configuration and processing of a speech recognition system according to a first embodiment of the present invention.

【００３５】図１において、１０はマイクなどからの音
声入力をディジタル信号に変換して入力する音声入力
部、１１は音声波形を短い区間に区切り、フレーム毎に
特徴パラメータを抽出して音声を分析するＬＰＣ分析部
からなる音声分析部である。音声分析部１１では、音声
の特性に合った能率的方法として広く使用されているＬ
ＰＣ分析を用い、ＬＰＣ係数からＬＰＣケプストラムを
算出する。In FIG. 1, reference numeral 10 denotes a voice input unit for converting a voice input from a microphone or the like into a digital signal and inputting it. This is a voice analysis unit including an LPC analysis unit. In the voice analysis unit 11, L which is widely used as an efficient method suitable for the characteristics of the voice is used.
The LPC cepstrum is calculated from the LPC coefficient using PC analysis.

【００３６】また、本音声認識システムはあらかじめ複
数の音声モデル１２（音声モデル１〜Ｎ）を持つ。この
音声モデル１２は、システムが使用されるフィールドに
おいて想定される異なる雑音環境１〜Ｎに対応する異な
る音声モデル１〜Ｎである。The speech recognition system has a plurality of speech models 12 (speech models 1 to N) in advance. The speech models 12 are different speech models 1 to N corresponding to different noise environments 1 to N assumed in a field where the system is used.

【００３７】上記音声分析部１１からの分析結果、及び
複数の音声モデル１２（音声モデル１〜Ｎ）は、音声認
識部１３（音声認識手段）に出力される。The analysis result from the voice analysis unit 11 and a plurality of voice models 12 (voice models 1 to N) are output to a voice recognition unit 13 (voice recognition means).

【００３８】音声認識部１３は、音声分析部１１から得
た音声特徴量を音声モデルに蓄えられているテンプレー
トと照合することによって各音声モデル１〜Ｎについて
独立に音声認識処理を行い、確率比較部１４（音声モデ
ル選択手段）に出力する。また、音声認識部１３は、音
声認識処理において、認識されたシンボル（認識された
単語）の他にその確からしさを示す数値も計算し、その
確からしさＰ−１〜Ｐ−Ｎも確率比較部１４に出力す
る。The voice recognition unit 13 performs voice recognition processing independently for each of the voice models 1 to N by comparing the voice feature amount obtained from the voice analysis unit 11 with a template stored in the voice model, and performs a probability comparison. Output to the unit 14 (speech model selection means). Further, in the voice recognition process, the voice recognition unit 13 also calculates a numerical value indicating the probability in addition to the recognized symbol (recognized word), and also calculates the probability P-1 to PN in the probability comparison unit. 14 is output.

【００３９】確率比較部１４は、音声認識部１３の出力
である確からしさΡ−１〜Ｐ−Ｎを比較し、その最も大
きいものΡ−ＭＡＸを取り出し、それに対応した認識シ
ンボルＳ−ＭＡＸをシステムの認識結果１５として出力
する。The probability comparing unit 14 compares the probabilities Ρ-1 to PN output from the speech recognizing unit 13, extracts the largest one Ρ-MAX, and converts the corresponding recognition symbol S-MAX into the system. Is output as the recognition result 15.

【００４０】上記音声認識部１３は、異なる雑音環境に
対応して用意された各音声モデル１〜Ｎを用いて音声認
識処理を行う音声認識手段を構成するとともに、認識結
果の確からしさを示す数値も計算することによって音声
モデル選択手段の一部をも構成する。The speech recognition section 13 constitutes speech recognition means for performing speech recognition processing using each of the speech models 1 to N prepared corresponding to different noise environments, and a numerical value indicating the likelihood of the recognition result. Also, a part of the speech model selecting means is configured by calculating

【００４１】また、上記確率比較部１４は、確からしさ
Ρ−１〜Ｐ−Ｎを比較することにより最適な認識結果を
選択する音声モデル選択手段を構成する。The probability comparing section 14 constitutes a speech model selecting means for selecting an optimal recognition result by comparing the likelihood Ρ-1 to PN.

【００４２】このように、本実施形態に係る音声認識シ
ステムは、異なる雑音の程度、種類等に応じて対応する
複数の音声モデル１２（音声モデル１〜Ｎ）と、各音声
モデル１〜Ｎについて独立に音声認識処理を行う音声認
識部１３と、音声認識部１３により認識した認識結果の
うちの一つを選択する確率比較部１４とを持つことを特
徴とする。As described above, the speech recognition system according to the present embodiment comprises a plurality of speech models 12 (speech models 1 to N) corresponding to different degrees and types of noise, and It is characterized by having a voice recognition unit 13 that performs voice recognition processing independently and a probability comparison unit 14 that selects one of the recognition results recognized by the voice recognition unit 13.

【００４３】以下、上述のように構成された音声認識シ
ステムの動作を説明する。Hereinafter, the operation of the speech recognition system configured as described above will be described.

【００４４】音声入力部１０では、マイクなどから音を
入力し、Ａ／Ｄ変換により信号をデジタル信号に変換す
る。The audio input unit 10 receives a sound from a microphone or the like and converts the signal into a digital signal by A / D conversion.

【００４５】音声分析部１１では、音声入力部１０から
の信号に対して例えばＬＰＣ分析を行い、その特徴量を
抽出し、後段の音声認識部１３への入力とする。The voice analysis unit 11 performs, for example, an LPC analysis on the signal from the voice input unit 10 to extract the characteristic amount thereof, which is input to the voice recognition unit 13 at the subsequent stage.

【００４６】システムはあらかじめ複数の音声モデル１
２（音声モデル１〜Ｎ）を持つ。ここで音声モデルとし
ては、ＨＭＭ単語モデルあるいはＨＭＭ音韻モデルとし
てよい。システムが使用されるフィールドにおいて想定
される異なる雑音環境１〜Ｎに対処するため、それに対
応する異なる音声モデル１〜Ｎを用意する。The system has a plurality of speech models 1 in advance.
2 (voice models 1 to N). Here, the speech model may be an HMM word model or an HMM phoneme model. In order to cope with different noise environments 1 to N assumed in a field where the system is used, corresponding different speech models 1 to N are prepared.

【００４７】これらの音声モデルの作成方法としては、
可能であればシステムが使用される実際のフィールドで
の様々な雑音環境下（異なるＳＮ比、雑音の種類など）
で発声した音声データベースを作成し、同一あるいは類
似した雑音環境毎にＨＭＭ学習によってその雑音環境に
対応した音声モデルを得ることが望ましい。しかし、そ
れが困難ならば雑音データのみを取得して、それと静か
な環境で発声した音声データベースとを計算機上で加算
し、そのＳＮ比を変えたり、雑音データの種類を変えた
りして、擬似的に様々な雑音環境下での音声データベー
スを作成し、それからＨＭＭ学習によって音声モデルを
得てもよい。As a method of creating these voice models,
If possible, under various noise environments in the actual field where the system is used (different SNR, type of noise, etc.)
It is desirable to create a speech database uttered in step (1) and obtain a speech model corresponding to the noise environment by HMM learning for each identical or similar noise environment. However, if it is difficult, only the noise data is obtained, and it is added to the voice database uttered in a quiet environment on a computer, and the SN ratio is changed or the type of the noise data is changed. Alternatively, a speech database may be created under various noise environments, and then a speech model may be obtained by HMM learning.

【００４８】音声認識部１３では、音声分析部１１から
得た音声特徴量を音声モデルに蓄えられているテンプレ
ートと照合することによって音声の認識を行う。通常の
音声認識システムでは使用する音声モデルは１つである
が、本実施形態では、異なった雑音環境に対応するた
め、前述の複数の音声モデル１２（音声モデル１〜Ｎ）
を使用する。そして、音声認識部１３では、各音声モデ
ル１〜Ｎについて独立に音声認識処理を行う。ＨＭＭな
どに基づく音声認識処理では認識されたシンボル（認識
された単語）の他にその確からしさを示す数値も計算さ
れる。したがって、音声認識部１３では、各音声モデル
１〜Ｎを使用した場合の認識シンボルＳ−１〜Ｓ−Ｎと
同時にその確からしさＰ−１〜Ｐ−Ｎをも出力する。The voice recognition section 13 recognizes voice by comparing the voice feature amount obtained from the voice analysis section 11 with a template stored in the voice model. In a normal speech recognition system, one speech model is used. However, in the present embodiment, in order to cope with different noise environments, the plurality of speech models 12 (speech models 1 to N) described above are used.
Use Then, the voice recognition unit 13 performs voice recognition processing independently for each of the voice models 1 to N. In speech recognition processing based on HMM or the like, a numerical value indicating the likelihood is calculated in addition to the recognized symbol (recognized word). Therefore, the voice recognition unit 13 outputs the recognition symbols S-1 to SN when each of the voice models 1 to N is used, and also outputs the probabilities P-1 to PN at the same time.

【００４９】確率比較部１４では、音声認識部１４の出
力である確からしさΡ−１〜Ｐ−Ｎを比較し、その最も
大きいものΡ−ＭＡＸを取り出し、このΡ−ＭＡＸに対
応した認識シンボルＳ−ＭＡＸをシステムの認識結果１
５として出力する。The probability comparing section 14 compares the likelihoods Ρ-1 to PN output from the speech recognizing section 14, extracts the largest one Ρ-MAX, and extracts the recognition symbol S corresponding to the Ρ-MAX. -MAX is the system recognition result 1
Output as 5.

【００５０】以上説明したように、第１の実施形態に係
る音声認識システムは、システムが使用されるフィール
ドにおいて想定される異なる雑音環境１〜Ｎに対応する
異なる音声モデル１〜Ｎと、各音声モデル１〜Ｎについ
て独立に音声認識処理を行うとともに、音声認識処理に
おいて、認識されたシンボル（認識された単語）の他に
その確からしさを示す数値も計算し、その確からしさＰ
−１〜Ｐ−Ｎも出力する音声認識部１３と、音声認識部
１３の出力である確からしさΡ−１〜Ｐ−Ｎを比較し、
その最も大きいものΡ−ＭＡＸを取り出し、それに対応
した認識シンボルＳ−ＭＡＸをシステムの認識結果１５
として出力する確率比較部１４とを備え、選択された音
声モデルを用いて音声認識処理を行うように構成したの
で、音声認識をする際には各音声モデルごとに認識処理
を行い、そのうち最も高い確からしさを出力した音声モ
デルの認識結果が採用されることになり、車内のように
時々刻々雑音環境が変わっているような状況でも環境に
追随して一貫して高い認識率を達成することができる。As described above, the speech recognition system according to the first embodiment includes different speech models 1 to N corresponding to different noise environments 1 to N assumed in a field where the system is used, and each speech model. The voice recognition process is independently performed on the models 1 to N. In the voice recognition process, a numerical value indicating the certainty is calculated in addition to the recognized symbol (recognized word).
-1 to PN are also output, and the probabilities Ρ-1 to PN which are outputs of the voice recognition unit 13 are compared with each other.
The largest Ρ-MAX is taken out, and the corresponding recognition symbol S-MAX is extracted from the system recognition result 15.
And a probability comparison unit 14 that outputs the selected speech model. The speech recognition process is performed using the selected speech model. Therefore, when speech recognition is performed, the recognition process is performed for each speech model. The recognition result of the voice model that outputs the certainty will be adopted, so that even in a situation where the noise environment changes every moment like in a car, it is possible to consistently achieve a high recognition rate following the environment. it can.

【００５１】上述した第１の実施形態に係る音声認識シ
ステムでは、異なる雑音環境に対応して複数の音声モデ
ル１２を用意し、各音声モデル１２を用いて音声認識処
理を行い、認識した認識結果のうちの一つを選択するよ
うにしているが、これに代えて、まず、音声モデル１２
を選択し、選択された音声モデルを用いて音声認識処理
を行うようにしてもよい。以下、この例を第２の実施形
態として説明する。In the speech recognition system according to the first embodiment described above, a plurality of speech models 12 are prepared corresponding to different noise environments, speech recognition processing is performed using each speech model 12, and the recognized recognition result is obtained. Is selected, but instead of this, first, the voice model 12
May be selected, and speech recognition processing may be performed using the selected speech model. Hereinafter, this example will be described as a second embodiment.

【００５２】第２の実施形態図２は本発明の第２の実施形態に係る音声認識システム
の構成及び処理を示すフローチャートである。本実施形
態に係る音声認識システムの説明にあたり図１に示す音
声認識システムの構成及び処理と同一部分には同一符号
を付している。Second Embodiment FIG. 2 is a flowchart showing the configuration and processing of a speech recognition system according to a second embodiment of the present invention. In the description of the speech recognition system according to the present embodiment, the same parts as those in the configuration and processing of the speech recognition system shown in FIG.

【００５３】図２において、１０はマイクなどからの音
声入力をディジタル信号に変換して入力する音声入力
部、１１は音声波形を短い区間に区切り、フレーム毎に
特徴パラメータを抽出して音声を分析するＬＰＣ分析部
からなる音声分析部である。音声分析部１１では、第１
の実施形態と同様に、音声の特性に合った能率的方法と
して広く使用されているＬＰＣ分析を用い、ＬＰＣ係数
からＬＰＣケプストラムを算出する。In FIG. 2, reference numeral 10 denotes a voice input unit for converting a voice input from a microphone or the like into a digital signal and inputting it. Reference numeral 11 divides a voice waveform into short sections, extracts characteristic parameters for each frame, and analyzes the voice. This is a voice analysis unit including an LPC analysis unit. In the voice analysis unit 11, the first
As in the embodiment, the LPC cepstrum is calculated from the LPC coefficient using the LPC analysis widely used as an efficient method suitable for the characteristics of the voice.

【００５４】雑音推定部１３（雑音推定手段）は、入力
信号に含まれる雑音を推定し、推定雑音２２を出力す
る。The noise estimating section 13 (noise estimating means) estimates the noise contained in the input signal and outputs an estimated noise 22.

【００５５】また、第１の実施形態と同様に、あらかじ
め複数の音声モデル２３（音声モデル１〜Ｎ）を持つ。
この音声モデル２３は、システムが使用されるフィール
ドにおいて想定される異なる雑音環境１〜Ｎに対応する
異なる音声モデル１〜Ｎであることに加え、さらに各音
声モデル２３（音声モデル１〜Ｎ）ごとに、音声モデル
１〜Ｎに対応して学習の際の雑音環境の雑音モデル２４
（雑音モデル１〜Ｎ）も用意する。Further, similarly to the first embodiment, a plurality of voice models 23 (voice models 1 to N) are provided in advance.
The voice models 23 are different voice models 1 to N corresponding to different noise environments 1 to N assumed in a field where the system is used, and further, each voice model 23 (voice models 1 to N). In addition, the noise model 24 of the noise environment at the time of learning corresponding to the speech models 1 to N
(Noise models 1 to N) are also prepared.

【００５６】音声モデル選択部２５（音声モデル選択手
段）は、音声モデル１〜Ｎのうちで、雑音推定部２１の
出力の推定雑音２２と各音声モデル２３に蓄えられた雑
音モデル２４とを比較し、最も類似している雑音モデル
を持った音声モデルを選択する。The speech model selection unit 25 (speech model selection means) compares the estimated noise 22 output from the noise estimation unit 21 with the noise model 24 stored in each speech model 23 among the speech models 1 to N. Then, the speech model having the most similar noise model is selected.

【００５７】音声認識部２６は、音声モデル選択部２５
によって選択された音声モデルを用いてＨＭＭ認識処理
を行い、認識結果２７として出力する。The speech recognition section 26 is composed of a speech model selection section 25
The HMM recognition processing is performed using the speech model selected by the above, and is output as a recognition result 27.

【００５８】以下、上述のように構成された音声認識シ
ステムの動作を説明する。The operation of the speech recognition system configured as described above will be described below.

【００５９】音声入力部１０では、マイクなどから音を
入力し、Ａ／Ｄ変換により信号をデジタル信号に変換す
る。音声分析部１１では、音声入力部１０からの信号に
対して例えばＬＰＣ分析を行い、その特徴量を抽出し、
雑音推定部２１及び音声認識部２６に出力する。The voice input unit 10 receives a sound from a microphone or the like and converts the signal into a digital signal by A / D conversion. The voice analysis unit 11 performs, for example, LPC analysis on the signal from the voice input unit 10 and extracts the feature amount thereof.
Output to the noise estimation unit 21 and the speech recognition unit 26.

【００６０】雑音推定部２１では、入力信号に含まれる
雑音を推定する。推定方法としては従来様々な手法が提
案されており、それを用いればよい。例えば、信号中の
発声区間を同定し、非発声区間での信号を雑音とみなす
などすればよい。The noise estimator 21 estimates the noise contained in the input signal. Conventionally, various methods have been proposed as estimating methods, and these may be used. For example, a speech section in a signal may be identified, and a signal in a non-speech section may be regarded as noise.

【００６１】音声モデル２３（音声モデル１〜Ｎ）につ
いては第１の実施形態で説明したが、ここではさらに各
音声モデル１〜Ｎごとに、学習の際の雑音環境の雑音モ
デル２４（雑音モデル１〜Ｎ）も別個用意する。雑音モ
デル２４の作成方法としては、騒音環境下で実際に発声
した音声データベースを用いた場合には、非発話区間で
の信号を雑音とみなし、ＨＭＭ学習によって求めること
ができるし、雑音データから計算機上で擬似的に作成し
た音声データベースを用いた場合には、もともとの雑音
データからＨＭＭ学習によって求めればよい。こうして
音声モデル１〜Ｎに対応して雑音モデル１〜Ｎを作成す
る。Although the speech model 23 (speech models 1 to N) has been described in the first embodiment, here, the noise model 24 (noise model) of the noise environment at the time of learning is further provided for each speech model 1 to N. 1 to N) are also prepared separately. As a method for creating the noise model 24, when a speech database actually uttered in a noisy environment is used, a signal in a non-speech section can be regarded as noise and can be obtained by HMM learning. In the case of using the speech database created simulated above, the speech data may be obtained by HMM learning from the original noise data. Thus, the noise models 1 to N are created corresponding to the speech models 1 to N.

【００６２】音声モデル選択部２５では、音声モデル１
〜Ｎのうちで、雑音推定部２１の出力の推定雑音２２と
各音声モデル２３に蓄えられた雑音モデル２４とを比較
し、最も類似している雑音モデルを持った音声モデルを
選択する。類似尺度としては、雑音モデルを用いて従来
のマッチング手法（ビタビ照合など）で推定雑音とのマ
ッチングを行い、結果として出力されるマッチング度
（ＨＭＭでは確率）を用いる。雑音モデルとして１ステ
ート、１混合のＨＭＭモデルを選んだ場合には、雑音モ
デルは平均と分散のみで表されるので、推定雑音からあ
らかじめその平均と分散を計算し、それと雑音モデルと
の間のガウス距離を類似尺度として用いてもよい。The voice model selecting section 25 outputs the voice model 1
Among the noise models, the estimated noise 22 output from the noise estimating unit 21 is compared with the noise model 24 stored in each audio model 23, and the audio model having the most similar noise model is selected. As the similarity measure, a matching with the estimated noise is performed by a conventional matching method (Viterbi matching or the like) using a noise model, and a matching degree (probability in HMM) output as a result is used. When a one-state, one-mixture HMM model is selected as the noise model, the noise model is expressed only by the mean and the variance. Therefore, the mean and the variance are calculated in advance from the estimated noise, and the difference between the estimated noise and the noise model is calculated. Gaussian distance may be used as a similarity measure.

【００６３】音声認識部２６では、音声モデル選択部２
５によって選択された音声モデルを用いてＨＭＭ認識処
理を行い、認識結果２７を出力して処理を終える。この
場合、第１の実施形態のように複数の音声モデルについ
て認識処理を行う必要はない。In the speech recognition section 26, the speech model selection section 2
The HMM recognition process is performed using the speech model selected in Step 5, and a recognition result 27 is output, and the process ends. In this case, there is no need to perform recognition processing on a plurality of speech models as in the first embodiment.

【００６４】以上説明したように、第２の実施形態に係
る音声認識システムは、さらに各音声モデル２３（音声
モデル１〜Ｎ）ごとに、音声モデル１〜Ｎに対応して学
習の際の雑音環境の雑音モデル２４（雑音モデル１〜
Ｎ）も用意し、音声モデル１〜Ｎのうちで、雑音推定部
２１の出力の推定雑音２２と各音声モデルに蓄えられた
雑音モデル２４とを比較し、最も類似している雑音モデ
ルを持った音声モデルを選択する音声モデル選択部２５
と、音声モデル選択部２５によって選択された音声モデ
ルを用いてＨＭＭ認識処理を行う音声認識部２６とを備
え、入力から雑音成分を推定し、複数の音声モデルのう
ち、その推定雑音に最も類似した雑音モデルを持つ音声
モデルをあらかじめ選択して、その音声モデルについて
のみ認識処理を行うようにしたので、第１の実施形態と
比べて、音声認識部２６の処理の負荷が大幅に軽減さ
れ、リアルタイムの音声認識が可能になる。As described above, the speech recognition system according to the second embodiment further includes, for each of the speech models 23 (speech models 1 to N), the noise at the time of learning corresponding to the speech models 1 to N. Environmental noise model 24 (noise models 1 to
N) is also prepared, and among the speech models 1 to N, the estimated noise 22 output from the noise estimating unit 21 is compared with the noise model 24 stored in each speech model, and the most similar noise model is obtained. Voice model selection unit 25 for selecting a voice model
And a speech recognition unit 26 that performs HMM recognition processing using the speech model selected by the speech model selection unit 25, estimates a noise component from an input, and is most similar to the estimated noise among a plurality of speech models. Since a speech model having the noise model obtained in advance is selected and recognition processing is performed only on the speech model, the processing load of the speech recognition unit 26 is significantly reduced as compared with the first embodiment. Real-time speech recognition becomes possible.

【００６５】したがって、このような優れた特長を有す
る音声認識システムを、例えば音声による操作が可能な
カーナビゲーションに用いられる耐雑音音声認識システ
ムに適用すれば、この装置において認識率の大幅な向上
を図ることができる。Therefore, if the speech recognition system having such excellent features is applied to, for example, a noise-tolerant speech recognition system used for car navigation which can be operated by voice, the recognition rate can be greatly improved in this device. Can be planned.

【００６６】なお、上記各実施形態に係る音声認識シス
テム及び音声認識方法は、音声モデルを用いて音声認識
を行う音声認識システムには全て適用することができ、
各種端末に組み込まれる回路として実施することもでき
る。The speech recognition system and the speech recognition method according to each of the above embodiments can be applied to any speech recognition system that performs speech recognition using a speech model.
The present invention can be implemented as a circuit incorporated in various terminals.

【００６７】また、上記各実施形態では、複数の音声モ
デルが、ＨＭＭ学習により作成された音声モデル、特
に、異なる雑音環境下で発声した音声データベースを基
にＨＭＭ学習により作成した雑音環境に対応した音声モ
デルであることが望ましいが、雑音の程度、種類等に応
じて複数の音声モデルを用いるものであればどのような
方法であってもよい。Further, in each of the above embodiments, the plurality of speech models correspond to speech models created by HMM learning, in particular, noise environments created by HMM learning based on speech databases uttered under different noise environments. It is desirable to use a voice model, but any method may be used as long as a plurality of voice models are used according to the degree and type of noise.

【００６８】さらに、上記各実施形態に係る音声認識シ
ステムを構成する各処理部や各種プロセスの数、種類接
続状態などは前述した各実施形態に限られない。Further, the number of processing units and various processes constituting the speech recognition system according to each of the above embodiments, the type of connection, and the like are not limited to the above embodiments.

【００６９】[0069]

【発明の効果】本発明に係る音声認識システム及び音声
認識方法では、異なる雑音環境に対応して用意された複
数の音声モデルと、各音声モデルを用いて音声認識処理
を行う音声認識手段と、音声認識手段により認識した認
識結果のうちの一つを選択する音声モデル選択手段とを
備えているので、雑音環境が様々に変化するフィールド
において、雑音環境が様々に変化してもそれに追随して
常に良好な認識率を上げることができる。According to the speech recognition system and the speech recognition method of the present invention, a plurality of speech models prepared corresponding to different noise environments, speech recognition means for performing speech recognition processing using each speech model, Since it has a voice model selecting means for selecting one of the recognition results recognized by the voice recognizing means, in a field where the noise environment changes variously, even if the noise environment changes variously, it follows the noise environment. A good recognition rate can always be raised.

【００７０】本発明に係る音声認識システム及び音声認
識方法では、異なる雑音環境に対応して用意された複数
の音声モデルと、認識時の雑音環境に応じて最適な音声
モデルを選択する音声モデル選択手段と、音声モデル選
択手段により選択された音声モデルを用いて音声認識処
理を行う音声認識手段とを備えているので、音声認識処
理の負荷を大幅に軽減しつつ認識率を上げることがで
き、リアルタイムの音声認識が可能になる。In the speech recognition system and the speech recognition method according to the present invention, a plurality of speech models prepared corresponding to different noise environments and a speech model selection for selecting an optimal speech model according to the noise environment at the time of recognition. Means and voice recognition means for performing voice recognition processing using the voice model selected by the voice model selection means, it is possible to increase the recognition rate while significantly reducing the load of voice recognition processing, Real-time speech recognition becomes possible.

[Brief description of the drawings]

【図１】本発明を適用した第１の実施形態に係る音声認
識システムの構成及び処理を示すフローチャートであ
る。FIG. 1 is a flowchart showing the configuration and processing of a speech recognition system according to a first embodiment to which the present invention has been applied.

【図２】本発明を適用した第２の実施形態に係る音声認
識システムの構成及び処理を示すフローチャートであ
る。FIG. 2 is a flowchart showing the configuration and processing of a speech recognition system according to a second embodiment to which the present invention has been applied.

【図３】従来のＨＭＭを用いた連続音声認識システムの
構成を示すブロック図である。FIG. 3 is a block diagram showing a configuration of a conventional continuous speech recognition system using an HMM.

【図４】音声認識方法に用いられる単語ヒドン・マルコ
フ・モデルの構造を示す図である。FIG. 4 is a diagram showing a structure of a word Hidden Markov Model used in a speech recognition method.

[Explanation of symbols]

１０音声入力部、１１音声分析部、１２，２３複
数の音声モデル（音声モデル１〜Ｎ）、１３音声認識
部（音声認識手段）、１４確率比較部（音声モデル選
択手段）、１５認識結果、２１雑音推定部（雑音推
定手段）、２２推定雑音、２４雑音モデル、２５音
声モデル選択部（音声モデル選択手段）、２６音声認
識部、２７認識結果Reference Signs List 10 voice input unit, 11 voice analysis unit, 12, 23 multiple voice models (voice models 1 to N), 13 voice recognition unit (voice recognition unit), 14 probability comparison unit (voice model selection unit), 15 recognition result, 21 noise estimation unit (noise estimation means), 22 estimated noise, 24 noise model, 25 speech model selection unit (speech model selection means), 26 speech recognition unit, 27 recognition result

Claims

[Claims]

1. A speech recognition system for performing speech recognition using a speech model, wherein a plurality of speech models prepared for different noise environments and an optimal speech model are selected according to the noise environment at the time of recognition. A voice model selecting means for performing a voice recognition process using the voice model selected by the voice model selecting means.

2. A speech recognition system for performing speech recognition using a speech model, comprising: a plurality of speech models prepared corresponding to different noise environments; and speech recognition means for performing speech recognition processing using each speech model. And a voice model selecting means for selecting one of the recognition results recognized by the voice recognizing means.

3. The speech recognition system according to claim 1, wherein the speech recognition process is a speech recognition process based on a Hidden Markov Model (HMM).

4. The voice model according to claim 1, wherein the voice models are voice models prepared according to different noise levels and types assumed in a field where the system is used. 3. The speech recognition system according to any one of 2.

5. The speech recognition system according to claim 1, wherein the plurality of speech models are HMM word models or HMM phoneme models.

6. The plurality of speech models are based on speech databases uttered under different noise environments.
The speech recognition system according to claim 1, wherein the speech model is a speech model corresponding to a noise environment created by MM learning.

7. The voice recognition means performs voice recognition processing individually for each voice model, and the voice model selection means selects a voice model given the highest probability of recognition in the voice recognition processing. 3. The speech recognition system according to claim 2, wherein said speech recognition system is selected.

8. A noise estimation unit for estimating noise included in the input, each speech model includes a corresponding noise model, and the speech model selection unit includes an estimated noise estimated by the noise estimation unit. 2. The speech recognition system according to claim 1, wherein the speech model is compared with a noise model of each speech model, and a speech model having the most similar noise model is selected.

9. The collation between the estimated noise and the noise model is as follows:
Perform a matching process between the noise model and the estimated noise,
9. The speech recognition system according to claim 8, wherein the degree of matching is used as a measure of similarity.

10. The speech recognition system according to claim 9, wherein said matching processing is Viterbi matching.

11. A speech recognition method for performing speech recognition using a speech model, comprising preparing a plurality of speech models corresponding to different noise environments, and selecting an optimal speech model according to the noise environment at the time of recognition. And performing a voice recognition process using the selected voice model.

12. A speech recognition method for performing speech recognition using a speech model, comprising: preparing a plurality of speech models corresponding to different noise environments; performing speech recognition processing using each speech model; A voice recognition method comprising selecting one of the following.

13. Estimating noise included in an input, wherein each speech model includes a corresponding noise model. In selecting the speech model, the estimated noise and the noise model of each speech model are combined. The speech recognition method according to claim 11, wherein the speech model having the most similar noise model is selected by collation.