JP3704080B2

JP3704080B2 - Speech recognition method, speech recognition apparatus, and speech recognition program

Info

Publication number: JP3704080B2
Application number: JP2001342291A
Authority: JP
Inventors: 亮典小柴
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-11-07
Filing date: 2001-11-07
Publication date: 2005-10-05
Anticipated expiration: 2021-11-07
Also published as: JP2003140684A

Description

【０００１】
【発明の属する技術分野】
本発明は、隠れマルコフモデルを利用した音声認識に関する。
【０００２】
【従来の技術】
近年、音声認識手段として有効な方法として、混合多次元正規分布を持つ隠れマルコフモデル（ＨＭＭ：Hidden Markov Model)を用いる方法が研究され、応用されてきた。この方法に関しては、たとえば、参考文献（中川聖一、“確率モデルによる音声認識”、電子情報通信学会、１９９８）に詳しい。かかる参考文献などで述べられている混合多次元正規分布を持つＨＭＭを用いた音声認識方法について簡単に説明する。
【０００３】
まず、音声はその構成単位（音素、音節、単語など）ごとに図２で表されるような、状態とそれを結ぶアークからなるモデルで表現される。各状態はその状態から他の状態へ遷移する確率を示す状態遷移確率ａｉｊを持つ。また、各アークは、遷移の際に出力する音響的な特徴の出力確率ｂｉｊ（ｙ）を持つ。なお、ｉ、ｊはそれぞれ状態の番号を示し、ｙは音響的な特徴を表すものとする。
【０００４】
混合多次元正規分布を持つＨＭＭは、出力確率ｂｉｊ（ｙ）を次式（１）で与える。
【０００５】
【数１】

【０００６】
ここで、λijmは状態ｉから状態ｊへの遷移する経路の第m多次元正規分布に対する分岐確率を表す。また、μijmおよびΣijmは、それぞれ、第m多次元正規分布の平均ベクトルおよび共分散行列を表す。
【０００７】
この手法を用いて不特定話者に対して高い精度で認識を行う場合には、あらかじめ複数の発声者に対して収集したデータでこれらのパラメータを学習しておく必要がある。
【０００８】
しかし、実際にはシステムを利用するすべての発声者を想定した学習データを作成することは不可能であるため、実環境で使用する場合、話者によっては著しく認識精度が低下する場合がある。
【０００９】
かかる認識精度の低下を図３を参照して説明する。
【００１０】
図３において、実線で示された正規分布は、複数の話者の発声から学習された、ある音韻の出力確率分布である。これに対し、学習データとして集められた発話者とは音響的に異なる発声を持つ発話者（発話者Ａとする）の出力確率分布は破線のように表される。図３では簡単のため正規分布の次元数は１としている。発話者Ａが発声した音声の特徴ベクトルが図３に示すｙである場合、学習された正規分布から計算される出力確率はＰとなる。このＰは、発話者Ａの実際の出力確率分布から計算される出力確率ＰＴに対して小さな値をとることになる。つまりこの音韻に対する出力確率が不当に小さくなる。これは、認識精度が低下することを意味している。したがって、学習データが不十分である場合や、多数の不特定話者がシステムを利用するような場面では、学習された出力確率分布が、想定される全ての発話者の音響的な分布をカバーしないため、著しく認識精度が低下した発話者が現れる場合がある。
【００１１】
この問題に対して、話者適応技術を用い、話者ごとに平均ベクトルおよび共分散行列を適応的に学習する方法が提案されている。しかし、話者適応によりこれらのパラメータを再学習するためには、話者毎にある程度のデータ量が必要となるので、短時間に発声者が次々と変わるような局面には適さない。
【００１２】
このように、従来、混合多次元正規分布を持つＨＭＭを用いた音声認識において、学習の際に準備されたデータが不十分であるような場合、かかる学習データから外れた発声を持つ話者に対する認識精度が得られず、不特定話者に対応できないという問題があった。
【００１３】
【発明が解決しようとする課題】
ＨＭＭを利用した音声認識において、学習データとして用意された発声とは音響的に異なる発声を持つ話者に対しても所要の認識精度が得られることが、実用面においても有効である。
【００１４】
本発明はこのような事情を考慮してなされたものであり、ＨＭＭの学習対象外の発話者に対しても高い精度の音声認識が可能な音声認識方法、装置、およびプログラムを提供することを目的とする。
【００１５】
【課題を解決するための手段】
本発明に係る音声認識方法は、入力音声信号から音声の発声区間を検出し、検出された前記発声区間毎の音声信号を分析することにより特徴ベクトル系列を抽出し、抽出された前記特徴ベクトル系列と、所定の認識候補毎に予め用意され、正規分布を有する隠れマルコフモデルとのパターン照合を行って照合スコアを計算し、計算された前記照合スコアに基づいて前記音声の認識結果を判定する音声認識方法において、前記パターン照合は前記隠れマルコフモデルに基づく前記特徴ベクトル系列の尤度計算を含み、かつ、この尤度計算は、当該隠れマルコフモデルの平均ベクトルから所定の距離範囲内の特徴ベクトル系列に対して等しい出力確率を与えることを特徴とする。
【００１６】
【発明の実施の形態】
以下、図面を参照しながら本発明の実施形態を説明する。本発明の実施形態は、混合多次元の正規分布を有する隠れマルコフモデル（ＨＭＭ）利用の音声認識に関するものであり、話者による多次元正規分布のばらつきに対してロバストな照合方式で照合尤度を計算する音声認識装置に関する。
【００１７】
［第１の実施形態］
図１は、本発明の第１の実施形態に係わる音声認識装置の概略構成を示すブロック図である。同図に示す音声認識装置は、入力音声信号を音響分析して発声区間を検出する発声区間検出部１０１と、この発声区間検出部１０１で検出された発声区間の音声信号を分析することにより特徴ベクトルを抽出する特徴ベクトル抽出部１０２と、予め学習を行っておいた所定の各認識候補の標準特徴パターンとしての混合多次元正規分布を有する隠れマルコフモデル（ＨＭＭ）を記憶する標準特徴パターン記憶部１０４と、特徴ベクトル抽出部１０２で抽出された特徴ベクトル系列と、標準特徴パターン記憶部１０４が記憶している各認識候補毎のＨＭＭとのパターン照合を行って照合スコアを計算（尤度計算）するパターン照合部１０３と、このパターン照合部１０３で得られる認識候補ごとの照合スコアをもとに、認識された発声内容を判定する認識結果判定部１０５とを備えて成る。尚、図１に示す構成において、発声者が発声した音声を入力してデジタルの電気信号（デジタル音声信号）に変換するためのマイクロホン、Ａ／Ｄ（アナログ／デジタル）変換器を含む音声入力部が接続されてもよい。この場合、発声区間検出部１０１への入力音声信号は、当該音声入力部から供給される。
【００１８】
以上のように構成された本実施形態に係る音声認識装置による音声認識処理を説明する。
【００１９】
先ず、発声区間検出部１０１は、入力音声信号から音声の発声区間を検出する。この発声区間検出部１０１において検出された発声区間の音声信号を特徴ベクトル抽出部１０２が、予め定められた複数の周波数帯域ごとに周波数分析する。これにより、当該発声区間の音声信号が特徴ベクトル系列（特徴ベクトル時系列）{xt}に変換される。なお、特徴ベクトル（特徴パラメータ）は、フレームと呼ばれる固定の時間長を単位として求められる。
【００２０】
音声認識に使用される代表的な特徴ベクトルとしては、バンドパスフィルタやフーリエ変換によって求めることができるパワースペクトラム、ＬＰＣ（線形予測）分析によって求められるケプストラム計数などが良く知られている。本実施形態でも、公知の特徴ベクトルを用いることができる。即ち、本発明は使用する特徴ベクトルの種類に限定されない。
【００２１】
特徴ベクトル抽出部１０２により抽出された特徴ベクトルの時系列は、パターン照合部１０３に送られる。標準特徴パターン記憶部１０４には、所定の認識候補（認識単位）が、予め学習しておいた混合多次元正規分布をもつＨＭＭとして記憶されている。
【００２２】
パターン照合部１０３は、特徴ベクトル抽出部１０２から送られた特徴ベクトルの時系列と、標準特徴パターン記憶部１０４に記録されている混合多次元正規分布を持つＨＭＭとのパターン照合を行って照合スコアを計算する。
【００２３】
本実施形態においては、パターン照合部１０３が次式（２）に基づいて出力確率を計算する。かかる出力確率の値を用いてパターン照合が行われる。
【００２４】
【数２】

【００２５】
但し、式（２）において、ｄ（ｙ−μ）は、ベクトルｙ、μ間の距離を表す。
【００２６】
図４は、当該パターン照合における出力確率の正規分布を表す。ここでは簡単のため、正規分布の次元数は１としている。
【００２７】
図５は、本実施形態による作用効果を説明するための図である。同図において、ｙはある時刻の特徴ベクトルを表す。かかる特徴ベクトルｙに対する出力確率は、学習された正規分布に対してはＰとなるが、本実施形態の場合、上式（２）により、Ｐｃとなる。
【００２８】
つまり、本実施形態では、学習データとして用意された発話者とは音響的に異なる発話者の出力確率分布が破線で表されるような場合であっても、学習された正規分布の平均ベクトルから距離Ｄの範囲内では、出力確率の値を、常に等しい値Ｐｃとしている。したがって、正規分布曲線のずれにより、その急峻部において出力確率値が不当に低くなること、つまり、波線で示される正規分布曲線上の本来の値が学習データに基づく正規分布曲線上の点Ｐの値となること、を回避でき、これに起因して照合スコアが低下し、認識精度が低下することがない。
【００２９】
かかる本発明の第１実施形態によれば、限られた学習データで混合多次元正規分布を持つＨＭＭを学習した場合であっても、学習外の発話者に対して照合スコアを低下させることなく、高精度の音声認識が可能となる。
【００３０】
［第２の実施形態］
次に、本発明の第２の実施形態を説明する。この第２の実施形態は上述した第１の実施形態の変形に係り、本実施形態の基本的な構成は、第１実施形態の図１に示したものと同様である。
【００３１】
上述した第１の実施形態では、式（２）に基づいて出力確率を計算することを特徴とするものであった。第２の実施形態は、出力確率計算の際に、特徴ベクトルの次元（混合多次元正規分布の次元）ごとに独立して、等しい出力確率を与える範囲を設定するよう構成されている。
【００３２】
この場合、出力確率は式（３）のように計算することができる。
【００３３】
【数３】

【００３４】
図６は、第２実施形態の作用効果を説明するための図である。第１の実施形態で説明した図５では、簡単のため、特徴ベクトルの各次元は無相関、すなわち、共分散行列は対角であると仮定していた。
【００３５】
図６は、分散σi、平均μiを持つ第１の次元の正規分布と、分散σj、平均μjを持つ第２の次元の正規分布とをそれぞれ示している。第２の実施形態によれば、上記の式（４）及び式（５）に基づくことにより、次元ｉ，ｊごとに、等しい出力確率を与える範囲を変化させることができる。したがって、上述した第１の実施形態と同様の作用効果が得られる上、学習された正規分布の分散の大きさをも考慮して、出力確率のロバスト性を調節することができるようになる。
【００３６】
なお、本発明は上述した実施形態に限定されず種々変形して実施可能である。
【００３７】
例えば、上記実施形態では、混合多次元正規分布を有するＨＭＭを用いる音声認識について説明したが、混合でなく多次元の正規分布を有するＨＭＭや、混合でなく多次元でもない正規分布を有するＨＭＭを用いる音声認識についても、本発明は適用可能である。
【００３８】
【発明の効果】
以上説明したように、本発明によれば、正規分布を持つＨＭＭを用いた音声認識における照合尤度計算において、かかる正規分布の平均ベクトルを中心とした所定範囲の入力ベクトルについては等しい出力確率を用いるようにしているので、学習データとして用意された発声の範囲から外れるような多様な音響特性を持つ発話者に対しての照合尤度が劣化することを回避できる。したがって、高精度な音声認識が可能となり、実用面においても多大な効果を奏する。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声認識装置の基本構成を示すブロック図
【図２】隠れマルコフモデル（ＨＭＭ）の構成を示す図
【図３】従来の音声認識における認識精度低下の問題を説明するための図であって、２つの異なる正規分布から計算される出力確率の比較をグラフ上で説明するための図
【図４】平均ベクトルから所定範囲内の特徴ベクトルに対して等しい出力確率を与えるようにした場合の出力確率分布を示す図
【図５】本発明の第１の実施形態に係る作用効果を説明するための図
【図６】本発明の第２の実施形態に係る作用効果を説明するための図であって、正規分布の分散に応じて、等しい出力確率を与える範囲を変えた場合の出力確率分布を示す図
【符号の説明】
１０１…発声区間検出部
１０２…特徴ベクトル抽出部
１０３…パターン照合部
１０４…標準パターン記憶部
１０５…認識結果判定部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to speech recognition using a hidden Markov model.
[0002]
[Prior art]
In recent years, a method using a Hidden Markov Model (HMM) having a mixed multidimensional normal distribution has been studied and applied as an effective method as a speech recognition means. This method is described in detail in, for example, a reference (Seiichi Nakagawa, “Speech recognition using a probabilistic model”, IEICE, 1998). A speech recognition method using an HMM having a mixed multidimensional normal distribution described in this reference will be briefly described.
[0003]
First, speech is represented by a model composed of a state and an arc connecting it as shown in FIG. 2 for each structural unit (phoneme, syllable, word, etc.). Each state has a state transition probability aij indicating the probability of transition from that state to another state. Each arc has an output probability bij (y) of an acoustic feature that is output at the time of transition. Note that i and j represent state numbers, respectively, and y represents an acoustic feature.
[0004]
An HMM having a mixed multidimensional normal distribution gives an output probability bij (y) by the following equation (1).
[0005]
[Expression 1]

[0006]
Here, λijm represents the branching probability for the m-th multidimensional normal distribution of the path transitioning from state i to state j. Further, μijm and Σijm represent an average vector and a covariance matrix of the mth multidimensional normal distribution, respectively.
[0007]
When recognizing an unspecified speaker with high accuracy using this method, it is necessary to learn these parameters from data collected for a plurality of speakers in advance.
[0008]
However, in practice, it is impossible to create learning data that assumes all speakers who use the system. Therefore, when used in an actual environment, the recognition accuracy may be significantly reduced depending on the speaker.
[0009]
Such a decrease in recognition accuracy will be described with reference to FIG.
[0010]
In FIG. 3, a normal distribution indicated by a solid line is an output probability distribution of a phoneme learned from the utterances of a plurality of speakers. On the other hand, the output probability distribution of a speaker (referred to as speaker A) having an utterance that is acoustically different from the speaker collected as learning data is represented by a broken line. In FIG. 3, the number of dimensions of the normal distribution is 1 for simplicity. When the feature vector of the voice uttered by the speaker A is y shown in FIG. 3, the output probability calculated from the learned normal distribution is P. This P takes a small value with respect to the output probability PT calculated from the actual output probability distribution of the speaker A. That is, the output probability for this phoneme is unduly small. This means that the recognition accuracy is lowered. Therefore, when the learning data is insufficient or when many unspecified speakers use the system, the learned output probability distribution covers the acoustic distribution of all possible speakers. Therefore, there may be a speaker who has remarkably reduced recognition accuracy.
[0011]
In order to solve this problem, a method of adaptively learning an average vector and a covariance matrix for each speaker using a speaker adaptation technique has been proposed. However, in order to relearn these parameters by speaker adaptation, a certain amount of data is required for each speaker, which is not suitable for a situation where the speaker changes one after another in a short time.
[0012]
As described above, in the conventional speech recognition using the HMM having the mixed multidimensional normal distribution, when the data prepared at the time of learning is insufficient, it is possible to deal with a speaker having an utterance deviating from the learning data. There was a problem that recognition accuracy was not obtained and it was not possible to deal with unspecified speakers.
[0013]
[Problems to be solved by the invention]
In speech recognition using an HMM, it is also effective in practical use that required recognition accuracy can be obtained even for a speaker who has an utterance that is acoustically different from the utterance prepared as learning data.
[0014]
The present invention has been made in view of such circumstances, and provides a speech recognition method, apparatus, and program capable of highly accurate speech recognition even for a speaker who is not an HMM learning target. Objective.
[0015]
[Means for Solving the Problems]
The speech recognition method according to the present invention detects a speech utterance section from an input speech signal, extracts a feature vector series by analyzing the detected speech signal for each utterance section, and extracts the feature vector series And a voice that is prepared in advance for each predetermined recognition candidate, performs pattern matching with a hidden Markov model having a normal distribution, calculates a matching score, and determines a recognition result of the voice based on the calculated matching score In the recognition method, the pattern matching includes a likelihood calculation of the feature vector sequence based on the hidden Markov model, and the likelihood calculation includes a feature vector sequence within a predetermined distance range from an average vector of the hidden Markov model. Is given the same output probability.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. Embodiments of the present invention relate to speech recognition using a Hidden Markov Model (HMM) having a mixed multi-dimensional normal distribution, and a matching likelihood in a matching method that is robust against variations in multi-dimensional normal distribution by speakers. The present invention relates to a speech recognition apparatus that calculates
[0017]
[First Embodiment]
FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to the first embodiment of the present invention. The speech recognition apparatus shown in FIG. 1 is characterized by an utterance section detection unit 101 that acoustically analyzes an input speech signal to detect a utterance section, and an analysis of the speech signal of the utterance section detected by the utterance section detection unit 101. Feature vector extraction unit 102 for extracting a vector, and standard feature pattern storage unit for storing a hidden Markov model (HMM) having a mixed multidimensional normal distribution as a standard feature pattern of each predetermined recognition candidate that has been previously learned 104, pattern matching between the feature vector sequence extracted by the feature vector extraction unit 102 and the HMM for each recognition candidate stored in the standard feature pattern storage unit 104 is performed to calculate a matching score (likelihood calculation) The recognized utterance content based on the matching score for each recognition candidate obtained by the pattern matching unit 103. Comprising a recognition result determining unit 105. In the configuration shown in FIG. 1, a voice input unit including a microphone and an A / D (analog / digital) converter for inputting a voice uttered by a speaker and converting it into a digital electric signal (digital voice signal) May be connected. In this case, the input voice signal to the utterance section detection unit 101 is supplied from the voice input unit.
[0018]
A speech recognition process performed by the speech recognition apparatus according to the present embodiment configured as described above will be described.
[0019]
First, the utterance section detection unit 101 detects a utterance section of a voice from the input voice signal. The feature vector extraction unit 102 analyzes the frequency of the speech signal of the utterance section detected by the utterance section detection unit 101 for each of a plurality of predetermined frequency bands. As a result, the speech signal in the utterance section is converted into a feature vector series (feature vector time series) {xt}. The feature vector (feature parameter) is obtained in units of a fixed time length called a frame.
[0020]
As typical feature vectors used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum count that is obtained by LPC (linear prediction) analysis, and the like are well known. Also in this embodiment, a known feature vector can be used. That is, the present invention is not limited to the type of feature vector used.
[0021]
The feature vector time series extracted by the feature vector extraction unit 102 is sent to the pattern matching unit 103. In the standard feature pattern storage unit 104, predetermined recognition candidates (recognition units) are stored as HMMs having a mixed multidimensional normal distribution learned in advance.
[0022]
The pattern matching unit 103 performs pattern matching between the time series of the feature vectors sent from the feature vector extracting unit 102 and the HMM having a mixed multidimensional normal distribution recorded in the standard feature pattern storage unit 104 to obtain a matching score Calculate
[0023]
In the present embodiment, the pattern matching unit 103 calculates the output probability based on the following equation (2). Pattern matching is performed using the output probability value.
[0024]
[Expression 2]

[0025]
However, in Expression (2), d (y−μ) represents the distance between the vectors y and μ.
[0026]
FIG. 4 shows a normal distribution of output probabilities in the pattern matching. Here, for simplicity, the number of dimensions of the normal distribution is 1.
[0027]
FIG. 5 is a diagram for explaining the operational effects of the present embodiment. In the figure, y represents a feature vector at a certain time. The output probability for the feature vector y is P for the learned normal distribution, but in the present embodiment, it is Pc according to the above equation (2).
[0028]
That is, in this embodiment, even if the output probability distribution of a speaker that is acoustically different from the speaker prepared as learning data is represented by a broken line, the average vector of the learned normal distribution is used. Within the range of the distance D, the value of the output probability is always the same value Pc. Therefore, due to the deviation of the normal distribution curve, the output probability value becomes unreasonably low at the steep portion, that is, the original value on the normal distribution curve indicated by the wavy line is the point P on the normal distribution curve based on the learning data. It is possible to avoid becoming a value, and the matching score does not decrease due to this, and the recognition accuracy does not decrease.
[0029]
According to the first embodiment of the present invention, even when an HMM having a mixed multi-dimensional normal distribution is learned with limited learning data, the collation score is not lowered with respect to a non-learning speaker. Highly accurate speech recognition is possible.
[0030]
[Second Embodiment]
Next, a second embodiment of the present invention will be described. The second embodiment relates to the modification of the first embodiment described above, and the basic configuration of the present embodiment is the same as that shown in FIG. 1 of the first embodiment.
[0031]
In the first embodiment described above, the output probability is calculated based on the equation (2). The second embodiment is configured to set a range that gives an equal output probability independently for each dimension of a feature vector (a dimension of a mixed multidimensional normal distribution) when calculating the output probability.
[0032]
In this case, the output probability can be calculated as shown in Equation (3).
[0033]
[Equation 3]

[0034]
FIG. 6 is a diagram for explaining the operational effects of the second embodiment. In FIG. 5 described in the first embodiment, for simplicity, each dimension of the feature vector is assumed to be uncorrelated, that is, the covariance matrix is diagonal.
[0035]
FIG. 6 shows a normal distribution of the first dimension having variance σi and average μi, and a normal distribution of the second dimension having variance σj and average μj, respectively. According to the second embodiment, based on the above equations (4) and (5), the range that gives the same output probability can be changed for each of the dimensions i and j. Therefore, the same operational effects as those of the first embodiment described above can be obtained, and the robustness of the output probability can be adjusted in consideration of the degree of variance of the learned normal distribution.
[0036]
The present invention is not limited to the above-described embodiment, and can be implemented with various modifications.
[0037]
For example, in the above embodiment, speech recognition using an HMM having a mixed multidimensional normal distribution has been described. The present invention can also be applied to voice recognition to be used.
[0038]
【The invention's effect】
As described above, according to the present invention, in collation likelihood calculation in speech recognition using an HMM having a normal distribution, an equal output probability is obtained for an input vector in a predetermined range centered on the average vector of the normal distribution. Since it is used, it is possible to avoid the deterioration of the matching likelihood for a speaker having various acoustic characteristics that deviate from the utterance range prepared as learning data. Therefore, highly accurate speech recognition is possible, and there are significant effects in practical use.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a speech recognition apparatus according to a first embodiment of the present invention. FIG. 2 is a diagram showing a configuration of a hidden Markov model (HMM). FIG. 3 is a recognition accuracy in conventional speech recognition. FIG. 4 is a diagram for explaining a problem of reduction, and is a diagram for explaining a comparison of output probabilities calculated from two different normal distributions on a graph. FIG. 4 is a graph for a feature vector within a predetermined range from an average vector. FIG. 5 is a diagram illustrating an output probability distribution when equal output probabilities are given. FIG. 5 is a diagram for explaining an operation effect according to the first embodiment of the present invention. FIG. 6 is a second embodiment of the present invention. FIG. 6 is a diagram for explaining the operational effects according to the form, and shows an output probability distribution when a range for giving an equal output probability is changed according to the dispersion of the normal distribution.
101 ... Speech section detection unit 102 ... feature vector extraction unit 103 ... pattern matching unit 104 ... standard pattern storage unit 105 ... recognition result determination unit

Claims

A speech utterance section is detected from the input speech signal, and a feature vector series is extracted by analyzing the detected speech signal for each utterance section, and the extracted feature vector series and a predetermined recognition candidate in advance. In the speech recognition method that is prepared, calculates a matching score by performing pattern matching with a hidden Markov model having a mixed multidimensional normal distribution, and determines the recognition result of the speech based on the calculated matching score,
The pattern matching includes likelihood calculation of the feature vector series based on the hidden Markov model, and the likelihood calculation is performed by using a mixed multidimensional normal distribution in which each feature of the hidden Markov model has an input feature vector. If the input feature vector is the same as the average vector of the mixed multi-dimensional normal distribution of each state of the hidden Markov model A speech recognition method characterized by giving an output probability .

Wherein said range giving the equal output probability to the vector sequences, according to claim 1 method speech recognition, wherein the set independently for each dimension of the mixed multi-dimensional normal distribution.

Utterance interval detection means for detecting an utterance interval of speech from an input audio signal;
Feature vector extraction means for extracting a feature vector series by analyzing the voice signal of the utterance section detected by the utterance section detection means;
Storage means prepared for each predetermined recognition candidate and storing a hidden Markov model having a mixed multidimensional normal distribution;
Pattern matching means for calculating a matching score for each recognition candidate by performing pattern matching between the feature vector sequence extracted by the feature vector extraction means and the hidden Markov model stored in the storage means, and an input feature If the vector is within a predetermined distance from the average vector of the mixed multidimensional normal distribution of each state of the hidden Markov model, the input feature vector is each state of the hidden Markov model. Pattern matching means including calculation means for calculating the output probability of the feature vector series so as to give the same value as when the average vector of the mixed multidimensional normal distribution of
A speech recognition apparatus comprising: a recognition result determining means for determining a recognition result based on a matching score for each recognition candidate obtained by the pattern matching means.

Computer
An utterance section detecting means for detecting an utterance section of speech from an input speech signal;
Feature vector extraction means for extracting a feature vector series by analyzing a speech signal of the utterance section detected by the utterance section detection means;
Storage means for storing a hidden Markov model prepared for each predetermined recognition candidate and having a mixed multidimensional normal distribution,
Pattern matching means for calculating a matching score for each recognition candidate by performing pattern matching between the feature vector sequence extracted by the feature vector extraction means and the hidden Markov model stored in the storage means, and an input feature If the vector is within a predetermined distance from the average vector of the mixed multidimensional normal distribution of each state of the hidden Markov model, the input feature vector is each state of the hidden Markov model. A pattern matching means including a calculation means for calculating the output probability of the feature vector series so as to give the same value as that when the average vector of the mixed multidimensional normal distribution has
A speech recognition program for functioning as a recognition result determining means for determining a recognition result based on a matching score for each recognition candidate obtained by the pattern matching means.