JP2003140684A

JP2003140684A - Method, device, and program for voice recognition

Info

Publication number: JP2003140684A
Application number: JP2001342291A
Authority: JP
Inventors: Akinori Koshiba; 亮典小柴
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-11-07
Filing date: 2001-11-07
Publication date: 2003-05-16
Anticipated expiration: 2021-11-07
Also published as: JP3704080B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and device for voice recognition for recognizing the voice of a speaker other than a leaner of HMM with high precision. SOLUTION: In the voice recognition using an HMM having a mixed multidimensional regular distribution, equal output probability is used for input feature vectors within a specific distance range having its center at the mean vector of the mixed multidimensional regular distribution to prevent the recognition precision of an acoustically different speaker from becoming worse deviating from a range of speech prepared as learning data.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、隠れマルコフモデ
ルを利用した音声認識に関する。TECHNICAL FIELD The present invention relates to speech recognition using a hidden Markov model.

【０００２】[0002]

【従来の技術】近年、音声認識手段として有効な方法と
して、混合多次元正規分布を持つ隠れマルコフモデル
（ＨＭＭ：Hidden Markov Model)を用いる方法が研究さ
れ、応用されてきた。この方法に関しては、たとえば、
参考文献（中川聖一、“確率モデルによる音声認識”、
電子情報通信学会、１９９８）に詳しい。かかる参考文
献などで述べられている混合多次元正規分布を持つＨＭ
Ｍを用いた音声認識方法について簡単に説明する。2. Description of the Related Art In recent years, a method using a Hidden Markov Model (HMM) having a mixed multidimensional normal distribution has been studied and applied as an effective method for speech recognition. Regarding this method, for example,
References (Seiichi Nakagawa, "Speech Recognition by Stochastic Model",
For details, see IEICE, 1998). HM with mixed multidimensional normal distribution described in such references
A voice recognition method using M will be briefly described.

【０００３】まず、音声はその構成単位（音素、音節、
単語など）ごとに図２で表されるような、状態とそれを
結ぶアークからなるモデルで表現される。各状態はその
状態から他の状態へ遷移する確率を示す状態遷移確率ａ
ｉｊを持つ。また、各アークは、遷移の際に出力する音
響的な特徴の出力確率ｂｉｊ（ｙ）を持つ。なお、ｉ、
ｊはそれぞれ状態の番号を示し、ｙは音響的な特徴を表
すものとする。First, a speech is composed of its constituent units (phonemes, syllables,
It is represented by a model consisting of states and arcs connecting them, as shown in FIG. Each state has a state transition probability a indicating the probability of transition from that state to another state.
have ij. Further, each arc has an output probability bij (y) of an acoustic feature that is output at the time of transition. Note that i,
Each j represents a state number, and y represents an acoustic feature.

【０００４】混合多次元正規分布を持つＨＭＭは、出力
確率ｂｉｊ（ｙ）を次式（１）で与える。An HMM having a mixed multidimensional normal distribution gives an output probability bij (y) by the following equation (1).

【０００５】[0005]

【数１】 [Equation 1]

【０００６】ここで、λijmは状態ｉから状態ｊへの遷
移する経路の第m多次元正規分布に対する分岐確率を表
す。また、μijmおよびΣijmは、それぞれ、第m多次元
正規分布の平均ベクトルおよび共分散行列を表す。Here, λijm represents the branch probability for the m-th multidimensional normal distribution of the transition path from the state i to the state j. Further, μijm and Σijm represent the mean vector and the covariance matrix of the m-th multidimensional normal distribution, respectively.

【０００７】この手法を用いて不特定話者に対して高い
精度で認識を行う場合には、あらかじめ複数の発声者に
対して収集したデータでこれらのパラメータを学習して
おく必要がある。When recognizing an unspecified speaker with high accuracy using this method, it is necessary to learn these parameters from the data collected for a plurality of speakers in advance.

【０００８】しかし、実際にはシステムを利用するすべ
ての発声者を想定した学習データを作成することは不可
能であるため、実環境で使用する場合、話者によっては
著しく認識精度が低下する場合がある。However, in practice, it is impossible to create learning data that assumes all speakers using the system. Therefore, when used in a real environment, the recognition accuracy may drop significantly depending on the speaker. There is.

【０００９】かかる認識精度の低下を図３を参照して説
明する。The decrease in recognition accuracy will be described with reference to FIG.

【００１０】図３において、実線で示された正規分布
は、複数の話者の発声から学習された、ある音韻の出力
確率分布である。これに対し、学習データとして集めら
れた発話者とは音響的に異なる発声を持つ発話者（発話
者Ａとする）の出力確率分布は破線のように表される。
図３では簡単のため正規分布の次元数は１としている。
発話者Ａが発声した音声の特徴ベクトルが図３に示すｙ
である場合、学習された正規分布から計算される出力確
率はＰとなる。このＰは、発話者Ａの実際の出力確率分
布から計算される出力確率ＰＴに対して小さな値をとる
ことになる。つまりこの音韻に対する出力確率が不当に
小さくなる。これは、認識精度が低下することを意味し
ている。したがって、学習データが不十分である場合
や、多数の不特定話者がシステムを利用するような場面
では、学習された出力確率分布が、想定される全ての発
話者の音響的な分布をカバーしないため、著しく認識精
度が低下した発話者が現れる場合がある。In FIG. 3, the normal distribution shown by the solid line is the output probability distribution of a certain phoneme learned from the utterances of a plurality of speakers. On the other hand, the output probability distribution of a speaker having a utterance acoustically different from the speaker collected as the learning data (referred to as speaker A) is represented by a broken line.
In FIG. 3, the number of dimensions of the normal distribution is 1 for simplicity.
The feature vector of the voice uttered by the speaker A is y shown in FIG.
, The output probability calculated from the learned normal distribution is P. This P has a small value with respect to the output probability PT calculated from the actual output probability distribution of the speaker A. That is, the output probability for this phoneme becomes unreasonably small. This means that the recognition accuracy decreases. Therefore, when the training data is insufficient, or when a large number of unspecified speakers use the system, the learned output probability distribution covers the acoustic distribution of all possible speakers. Therefore, a speaker whose recognition accuracy is significantly lowered may appear.

【００１１】この問題に対して、話者適応技術を用い、
話者ごとに平均ベクトルおよび共分散行列を適応的に学
習する方法が提案されている。しかし、話者適応により
これらのパラメータを再学習するためには、話者毎にあ
る程度のデータ量が必要となるので、短時間に発声者が
次々と変わるような局面には適さない。To solve this problem, a speaker adaptation technique is used,
A method of adaptively learning a mean vector and a covariance matrix for each speaker has been proposed. However, in order to re-learn these parameters by speaker adaptation, a certain amount of data is required for each speaker, which is not suitable for situations where the speaker changes one after another in a short time.

【００１２】このように、従来、混合多次元正規分布を
持つＨＭＭを用いた音声認識において、学習の際に準備
されたデータが不十分であるような場合、かかる学習デ
ータから外れた発声を持つ話者に対する認識精度が得ら
れず、不特定話者に対応できないという問題があった。As described above, in the conventional speech recognition using the HMM having the mixed multidimensional normal distribution, when the prepared data for learning is insufficient, the utterance deviates from the learning data. There is a problem that the recognition accuracy for the speaker is not obtained and it is not possible to deal with an unspecified speaker.

【００１３】[0013]

【発明が解決しようとする課題】ＨＭＭを利用した音声
認識において、学習データとして用意された発声とは音
響的に異なる発声を持つ話者に対しても所要の認識精度
が得られることが、実用面においても有効である。In speech recognition using the HMM, it is practical that the required recognition accuracy can be obtained even for a speaker having a utterance acoustically different from the utterance prepared as learning data. It is also effective in terms of aspect.

【００１４】本発明はこのような事情を考慮してなされ
たものであり、ＨＭＭの学習対象外の発話者に対しても
高い精度の音声認識が可能な音声認識方法、装置、およ
びプログラムを提供することを目的とする。The present invention has been made in consideration of the above circumstances, and provides a voice recognition method, device, and program capable of performing highly accurate voice recognition even for a speaker who is not an HMM learning target. The purpose is to do.

【００１５】[0015]

【課題を解決するための手段】本発明に係る音声認識方
法は、入力音声信号から音声の発声区間を検出し、検出
された前記発声区間毎の音声信号を分析することにより
特徴ベクトル系列を抽出し、抽出された前記特徴ベクト
ル系列と、所定の認識候補毎に予め用意され、正規分布
を有する隠れマルコフモデルとのパターン照合を行って
照合スコアを計算し、計算された前記照合スコアに基づ
いて前記音声の認識結果を判定する音声認識方法におい
て、前記パターン照合は前記隠れマルコフモデルに基づ
く前記特徴ベクトル系列の尤度計算を含み、かつ、この
尤度計算は、当該隠れマルコフモデルの平均ベクトルか
ら所定の距離範囲内の特徴ベクトル系列に対して等しい
出力確率を与えることを特徴とする。A speech recognition method according to the present invention detects a vocal section of a voice from an input speech signal and analyzes a voice signal of each detected vocal section to extract a feature vector sequence. Then, the extracted feature vector series, and prepared in advance for each predetermined recognition candidate, to calculate the matching score by performing pattern matching with a hidden Markov model having a normal distribution, based on the calculated matching score In the speech recognition method for determining the recognition result of the speech, the pattern matching includes a likelihood calculation of the feature vector series based on the hidden Markov model, and the likelihood calculation is performed from an average vector of the hidden Markov model. It is characterized in that equal output probabilities are given to feature vector sequences within a predetermined distance range.

【００１６】[0016]

【発明の実施の形態】以下、図面を参照しながら本発明
の実施形態を説明する。本発明の実施形態は、混合多次
元の正規分布を有する隠れマルコフモデル（ＨＭＭ）利
用の音声認識に関するものであり、話者による多次元正
規分布のばらつきに対してロバストな照合方式で照合尤
度を計算する音声認識装置に関する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. Embodiments of the present invention relate to speech recognition using a Hidden Markov Model (HMM) having a mixed multidimensional normal distribution, and a matching likelihood that is robust against a variation in the multidimensional normal distribution depending on a speaker. The present invention relates to a voice recognition device for calculating.

【００１７】［第１の実施形態］図１は、本発明の第１
の実施形態に係わる音声認識装置の概略構成を示すブロ
ック図である。同図に示す音声認識装置は、入力音声信
号を音響分析して発声区間を検出する発声区間検出部１
０１と、この発声区間検出部１０１で検出された発声区
間の音声信号を分析することにより特徴ベクトルを抽出
する特徴ベクトル抽出部１０２と、予め学習を行ってお
いた所定の各認識候補の標準特徴パターンとしての混合
多次元正規分布を有する隠れマルコフモデル（ＨＭＭ）
を記憶する標準特徴パターン記憶部１０４と、特徴ベク
トル抽出部１０２で抽出された特徴ベクトル系列と、標
準特徴パターン記憶部１０４が記憶している各認識候補
毎のＨＭＭとのパターン照合を行って照合スコアを計算
（尤度計算）するパターン照合部１０３と、このパター
ン照合部１０３で得られる認識候補ごとの照合スコアを
もとに、認識された発声内容を判定する認識結果判定部
１０５とを備えて成る。尚、図１に示す構成において、
発声者が発声した音声を入力してデジタルの電気信号
（デジタル音声信号）に変換するためのマイクロホン、
Ａ／Ｄ（アナログ／デジタル）変換器を含む音声入力部
が接続されてもよい。この場合、発声区間検出部１０１
への入力音声信号は、当該音声入力部から供給される。[First Embodiment] FIG. 1 shows a first embodiment of the present invention.
3 is a block diagram showing a schematic configuration of a voice recognition device according to the exemplary embodiment. FIG. The speech recognition apparatus shown in FIG. 1 includes a vocal section detecting unit 1 for acoustically analyzing an input speech signal to detect a vocal section.
01, a feature vector extraction unit 102 that extracts a feature vector by analyzing the speech signal of the utterance section detected by the utterance section detection unit 101, and standard features of each predetermined recognition candidate that has been learned in advance. Hidden Markov Model (HMM) with mixed multidimensional normal distribution as pattern
The standard feature pattern storage unit 104 storing the feature vector sequence, the feature vector series extracted by the feature vector extraction unit 102, and the HMM for each recognition candidate stored in the standard feature pattern storage unit 104 are subjected to pattern matching to perform matching. A pattern matching unit 103 that calculates a score (likelihood calculation) and a recognition result determination unit 105 that determines the recognized utterance content based on the matching score for each recognition candidate obtained by the pattern matching unit 103. Consists of In the configuration shown in FIG.
A microphone for inputting the voice uttered by a speaker and converting it into a digital electric signal (digital voice signal),
A voice input unit including an A / D (analog / digital) converter may be connected. In this case, the vocalization section detection unit 101
The input audio signal to the audio input device is supplied from the audio input unit.

【００１８】以上のように構成された本実施形態に係る
音声認識装置による音声認識処理を説明する。A voice recognition process by the voice recognition apparatus according to the present embodiment having the above-described configuration will be described.

【００１９】先ず、発声区間検出部１０１は、入力音声
信号から音声の発声区間を検出する。この発声区間検出
部１０１において検出された発声区間の音声信号を特徴
ベクトル抽出部１０２が、予め定められた複数の周波数
帯域ごとに周波数分析する。これにより、当該発声区間
の音声信号が特徴ベクトル系列（特徴ベクトル時系列）
{xt}に変換される。なお、特徴ベクトル（特徴パラメー
タ）は、フレームと呼ばれる固定の時間長を単位として
求められる。First, the utterance section detection unit 101 detects a utterance section of a voice from an input voice signal. The feature vector extraction unit 102 frequency-analyzes the voice signal of the utterance section detected by the utterance section detection unit 101 for each of a plurality of predetermined frequency bands. As a result, the voice signal in the utterance section is feature vector series (feature vector time series)
Converted to {xt}. The feature vector (feature parameter) is obtained in units of fixed time length called a frame.

【００２０】音声認識に使用される代表的な特徴ベクト
ルとしては、バンドパスフィルタやフーリエ変換によっ
て求めることができるパワースペクトラム、ＬＰＣ（線
形予測）分析によって求められるケプストラム計数など
が良く知られている。本実施形態でも、公知の特徴ベク
トルを用いることができる。即ち、本発明は使用する特
徴ベクトルの種類に限定されない。As typical feature vectors used for speech recognition, a power spectrum that can be obtained by a bandpass filter or Fourier transform, a cepstrum count obtained by LPC (linear prediction) analysis, etc. are well known. Also in this embodiment, a known feature vector can be used. That is, the present invention is not limited to the type of feature vector used.

【００２１】特徴ベクトル抽出部１０２により抽出され
た特徴ベクトルの時系列は、パターン照合部１０３に送
られる。標準特徴パターン記憶部１０４には、所定の認
識候補（認識単位）が、予め学習しておいた混合多次元
正規分布をもつＨＭＭとして記憶されている。The time series of the feature vectors extracted by the feature vector extraction unit 102 is sent to the pattern matching unit 103. A predetermined recognition candidate (recognition unit) is stored in the standard feature pattern storage unit 104 as an HMM having a mixed multidimensional normal distribution learned in advance.

【００２２】パターン照合部１０３は、特徴ベクトル抽
出部１０２から送られた特徴ベクトルの時系列と、標準
特徴パターン記憶部１０４に記録されている混合多次元
正規分布を持つＨＭＭとのパターン照合を行って照合ス
コアを計算する。The pattern matching unit 103 performs pattern matching between the time series of feature vectors sent from the feature vector extraction unit 102 and the HMM having the mixed multidimensional normal distribution recorded in the standard feature pattern storage unit 104. To calculate the matching score.

【００２３】本実施形態においては、パターン照合部１
０３が次式（２）に基づいて出力確率を計算する。かか
る出力確率の値を用いてパターン照合が行われる。In this embodiment, the pattern matching unit 1
03 calculates the output probability based on the following equation (2). Pattern matching is performed using the value of the output probability.

【００２４】[0024]

【数２】 [Equation 2]

【００２５】但し、式（２）において、ｄ（ｙ−μ）
は、ベクトルｙ、μ間の距離を表す。However, in the equation (2), d (y-μ)
Represents the distance between the vectors y and μ.

【００２６】図４は、当該パターン照合における出力確
率の正規分布を表す。ここでは簡単のため、正規分布の
次元数は１としている。FIG. 4 shows a normal distribution of output probabilities in the pattern matching. Here, for simplicity, the number of dimensions of the normal distribution is 1.

【００２７】図５は、本実施形態による作用効果を説明
するための図である。同図において、ｙはある時刻の特
徴ベクトルを表す。かかる特徴ベクトルｙに対する出力
確率は、学習された正規分布に対してはＰとなるが、本
実施形態の場合、上式（２）により、Ｐｃとなる。FIG. 5 is a diagram for explaining the function and effect of this embodiment. In the figure, y represents a feature vector at a certain time. The output probability for the feature vector y is P for the learned normal distribution, but in the case of the present embodiment, it is Pc according to the above equation (2).

【００２８】つまり、本実施形態では、学習データとし
て用意された発話者とは音響的に異なる発話者の出力確
率分布が破線で表されるような場合であっても、学習さ
れた正規分布の平均ベクトルから距離Ｄの範囲内では、
出力確率の値を、常に等しい値Ｐｃとしている。したが
って、正規分布曲線のずれにより、その急峻部において
出力確率値が不当に低くなること、つまり、波線で示さ
れる正規分布曲線上の本来の値が学習データに基づく正
規分布曲線上の点Ｐの値となること、を回避でき、これ
に起因して照合スコアが低下し、認識精度が低下するこ
とがない。That is, in the present embodiment, even if the output probability distribution of the speaker acoustically different from the speaker prepared as the learning data is represented by the broken line, the learned normal distribution Within the distance D from the average vector,
The value of the output probability is always the same value Pc. Therefore, due to the deviation of the normal distribution curve, the output probability value becomes unreasonably low at the steep portion, that is, the original value on the normal distribution curve indicated by the wavy line is the point P on the normal distribution curve based on the learning data. It can be avoided that the value becomes a value, and as a result, the matching score does not decrease and the recognition accuracy does not decrease.

【００２９】かかる本発明の第１実施形態によれば、限
られた学習データで混合多次元正規分布を持つＨＭＭを
学習した場合であっても、学習外の発話者に対して照合
スコアを低下させることなく、高精度の音声認識が可能
となる。According to the first embodiment of the present invention, even when the HMM having the mixed multidimensional normal distribution is learned with the limited learning data, the collation score is lowered for the speaker who has not learned. High-accuracy voice recognition can be performed without performing the above.

【００３０】［第２の実施形態］次に、本発明の第２の
実施形態を説明する。この第２の実施形態は上述した第
１の実施形態の変形に係り、本実施形態の基本的な構成
は、第１実施形態の図１に示したものと同様である。[Second Embodiment] Next, a second embodiment of the present invention will be described. The second embodiment relates to the modification of the first embodiment described above, and the basic configuration of the present embodiment is the same as that shown in FIG. 1 of the first embodiment.

【００３１】上述した第１の実施形態では、式（２）に
基づいて出力確率を計算することを特徴とするものであ
った。第２の実施形態は、出力確率計算の際に、特徴ベ
クトルの次元（混合多次元正規分布の次元）ごとに独立
して、等しい出力確率を与える範囲を設定するよう構成
されている。The first embodiment described above is characterized in that the output probability is calculated based on the equation (2). The second embodiment is configured such that, when calculating output probabilities, a range that gives equal output probabilities is set independently for each dimension of the feature vector (dimension of mixed multidimensional normal distribution).

【００３２】この場合、出力確率は式（３）のように計
算することができる。In this case, the output probability can be calculated as in equation (3).

【００３３】[0033]

【数３】 [Equation 3]

【００３４】図６は、第２実施形態の作用効果を説明す
るための図である。第１の実施形態で説明した図５で
は、簡単のため、特徴ベクトルの各次元は無相関、すな
わち、共分散行列は対角であると仮定していた。FIG. 6 is a diagram for explaining the function and effect of the second embodiment. In FIG. 5 described in the first embodiment, for simplification, it is assumed that the dimensions of the feature vector are uncorrelated, that is, the covariance matrix is diagonal.

【００３５】図６は、分散σi、平均μiを持つ第１の次
元の正規分布と、分散σj、平均μjを持つ第２の次元の
正規分布とをそれぞれ示している。第２の実施形態によ
れば、上記の式（４）及び式（５）に基づくことによ
り、次元ｉ，ｊごとに、等しい出力確率を与える範囲を
変化させることができる。したがって、上述した第１の
実施形態と同様の作用効果が得られる上、学習された正
規分布の分散の大きさをも考慮して、出力確率のロバス
ト性を調節することができるようになる。FIG. 6 shows a first-dimensional normal distribution having a variance σi and a mean μi, and a second-dimensional normal distribution having a variance σj and a mean μj. According to the second embodiment, based on the above equations (4) and (5), it is possible to change the range that gives equal output probabilities for each dimension i, j. Therefore, it is possible to obtain the same operational effect as that of the first embodiment described above, and it is possible to adjust the robustness of the output probability in consideration of the magnitude of the variance of the learned normal distribution.

【００３６】なお、本発明は上述した実施形態に限定さ
れず種々変形して実施可能である。The present invention is not limited to the above-described embodiment, but can be implemented with various modifications.

【００３７】例えば、上記実施形態では、混合多次元正
規分布を有するＨＭＭを用いる音声認識について説明し
たが、混合でなく多次元の正規分布を有するＨＭＭや、
混合でなく多次元でもない正規分布を有するＨＭＭを用
いる音声認識についても、本発明は適用可能である。For example, in the above embodiment, the speech recognition using the HMM having the mixed multidimensional normal distribution has been described, but the HMM having the multidimensional normal distribution instead of the mixture,
The present invention is also applicable to speech recognition using an HMM having a normal distribution that is neither mixed nor multidimensional.

【００３８】[0038]

【発明の効果】以上説明したように、本発明によれば、
正規分布を持つＨＭＭを用いた音声認識における照合尤
度計算において、かかる正規分布の平均ベクトルを中心
とした所定範囲の入力ベクトルについては等しい出力確
率を用いるようにしているので、学習データとして用意
された発声の範囲から外れるような多様な音響特性を持
つ発話者に対しての照合尤度が劣化することを回避でき
る。したがって、高精度な音声認識が可能となり、実用
面においても多大な効果を奏する。As described above, according to the present invention,
In matching likelihood calculation in speech recognition using an HMM having a normal distribution, equal output probabilities are used for input vectors in a predetermined range centered on the average vector of such a normal distribution, so that they are prepared as learning data. It is possible to avoid deterioration of the matching likelihood for a speaker having various acoustic characteristics that deviates from the range of the utterance. Therefore, high-accuracy voice recognition is possible, and a great effect is obtained in practical use.

[Brief description of drawings]

【図１】本発明の第１の実施形態に係る音声認識装置の
基本構成を示すブロック図FIG. 1 is a block diagram showing a basic configuration of a voice recognition device according to a first embodiment of the present invention.

【図２】隠れマルコフモデル（ＨＭＭ）の構成を示す図FIG. 2 is a diagram showing a configuration of a hidden Markov model (HMM).

【図３】従来の音声認識における認識精度低下の問題を
説明するための図であって、２つの異なる正規分布から
計算される出力確率の比較をグラフ上で説明するための
図FIG. 3 is a diagram for explaining a problem of deterioration of recognition accuracy in conventional speech recognition, and is a diagram for explaining comparison of output probabilities calculated from two different normal distributions on a graph.

【図４】平均ベクトルから所定範囲内の特徴ベクトルに
対して等しい出力確率を与えるようにした場合の出力確
率分布を示す図FIG. 4 is a diagram showing an output probability distribution when equal output probabilities are given to feature vectors within a predetermined range from the average vector.

【図５】本発明の第１の実施形態に係る作用効果を説明
するための図FIG. 5 is a diagram for explaining an operation effect according to the first embodiment of the present invention.

【図６】本発明の第２の実施形態に係る作用効果を説明
するための図であって、正規分布の分散に応じて、等し
い出力確率を与える範囲を変えた場合の出力確率分布を
示す図FIG. 6 is a diagram for explaining the function and effect according to the second embodiment of the present invention, showing an output probability distribution in the case where a range giving equal output probabilities is changed according to the variance of the normal distribution. Figure

[Explanation of symbols]

１０１…発声区間検出部１０２…特徴ベクトル抽出部１０３…パターン照合部１０４…標準パターン記憶部１０５…認識結果判定部 101 ... Vocal section detecting unit 102 ... Feature vector extraction unit 103 ... Pattern matching unit 104 ... Standard pattern storage unit 105 ... Recognition result determination unit

Claims

[Claims]

1. A feature vector sequence is extracted by detecting a voice utterance section of an input voice signal and analyzing a voice signal for each of the detected utterance periods, and extracting the feature vector sequence and a predetermined feature vector sequence. Prepared in advance for each recognition candidate,
In a voice recognition method for calculating a matching score by performing pattern matching with a hidden Markov model having a normal distribution, and determining a recognition result of the voice based on the calculated matching score, the pattern matching is the hidden Markov model. A likelihood calculation of the feature vector series based on, and the likelihood calculation gives equal output probabilities to the feature vector series within a predetermined distance range from the average vector of the hidden Markov model. Voice recognition method.

2. A feature vector series is extracted by detecting a voice utterance section from an input voice signal and analyzing the detected voice signal for each utterance section, and extracting the feature vector series and a predetermined feature vector series. Prepared in advance for each recognition candidate,
In a voice recognition method for calculating a matching score by performing pattern matching with a hidden Markov model having a mixed multidimensional normal distribution, in a voice recognition method for determining a recognition result of the voice based on the calculated matching score, the pattern matching is the Including likelihood calculation of the feature vector series based on the hidden Markov model, and the likelihood calculation provides equal output probabilities to the feature vector series within a predetermined distance range from the average vector of the hidden Markov model. Speech recognition method characterized by.

3. The speech recognition method according to claim 2, wherein a range that gives equal output probabilities to the feature vector series is set independently for each dimension of the mixed multidimensional normal distribution.

4. A voicing section detecting means for detecting a voicing section of a voice from an input voice signal, and a feature vector extraction for extracting a feature vector series by analyzing the voice signal of the voicing section detected by the voicing section detecting means. Means, storage means for storing a hidden Markov model prepared for each predetermined recognition candidate and having a normal distribution, feature vector series extracted by the feature vector extraction means, and hidden Markov model stored in the storage means Pattern matching means for calculating the matching score for each recognition candidate by performing the pattern matching, and outputting the feature vector series so as to give an equal value within a predetermined range from the average vector of the hidden Markov model. Pattern matching means including a calculation means for calculating the probability, and each recognition candidate obtained by the pattern matching means Speech recognition apparatus comprising a determining recognition result determining means for collating the recognition result based on the score of the.

5. A voicing section detecting means for detecting a voicing section of a voice from an input voice signal, and a feature vector for extracting a feature vector sequence by voice analysis of the voice signal of the voicing section detected by the voicing section detecting means. Extraction means, storage means that is prepared for each predetermined recognition candidate, and stores a hidden Markov model having a mixed multidimensional normal distribution, feature vector series extracted by the feature vector extraction means, and the storage means. A pattern matching means for calculating a matching score for each recognition candidate by performing pattern matching with a hidden Markov model, wherein the feature is provided so as to give equal values within a predetermined range from the average vector of the hidden Markov model. A pattern matching means including a calculation means for calculating the output probability of the vector sequence; Speech recognition apparatus comprising a determining recognition result determination means results recognized on the basis of the matching score of each recognition candidate.

6. A voicing section detecting means for detecting a voicing section of a voice from an input voice signal, a feature vector sequence is extracted by analyzing a voice signal of the voicing section detected by the voicing section detecting means. Vector extraction means, storage means for storing a hidden Markov model prepared for each predetermined recognition candidate and having a normal distribution, feature vector series extracted by the feature vector extraction means, and hidden Markov models stored in the storage means Pattern matching means for calculating the matching score for each recognition candidate by performing the pattern matching, and outputting the feature vector series so as to give an equal value within a predetermined range from the average vector of the hidden Markov model. Pattern matching means including a calculation means for calculating the probability, each obtained by the pattern matching means Speech recognition program to function as determining the recognition result determining unit a recognition result based on the matching score of each identified candidate.