JPH0968996A

JPH0968996A - Voice recognition method

Info

Publication number: JPH0968996A
Application number: JP7225224A
Authority: JP
Inventors: Takashi Miki; 敬三木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1995-09-01
Filing date: 1995-09-01
Publication date: 1997-03-11

Abstract

PROBLEM TO BE SOLVED: To increase a recognition speed in voice recognition using a hidden Markov model as a standard pattern. SOLUTION: When output probability of an input voice characteristic vector is obtained from the hidden Markov model provided with a non-correlation mixture normal distribution, the largest output probability among the output probability obtained from respective normal distributions of the non-correlation mixture normal distribution is made the output probability of the input voice characteristic vector. When the output probability obtained from respective normal distributions are calculated, when the output probability obtained from a Q-th normal distribution of the non-correlation mixture normal distribution in an immediately before frame becomes the largest one, the probability that the output probability obtained from the Q-th normal distribution becomes the largest one is high even in a next frame also. Thus, the output probability obtained from the Q-th normal distribution is made a maximum value candidate, and the output probability obtained from respective normal distributions are compared with the maximum value candidate on the way of calculation, and the calculation of the output probability is interrupted according to the comparison result.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、隠れマルコフモ
デルを用いた音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method using a hidden Markov model.

【０００２】[0002]

[Prior art]

例えば文献：中川聖一”確率モデルによる音声認識”
電子情報通信学会（１９８８）ＩＳＢＮ−４−８８５
５２−０７２−Ｘにも開示されているように、音声認識
では、音声標準パタンとして、隠れマルコフ・モデル
（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ。以下、ＨＭ
Ｍと呼ぶ）を広く用いている。音声標準パタンとなるＨ
ＭＭは、いくつかの状態例えばＳ₀ 〜Ｓ₃ と、状態Ｓ_i
からＳ_jに遷移する確率ａ_ij及びその遷移の際にある音
声シンボルベクトルＶ_t が出力される確率ｂ_ij(V_t)とで
表される。出力確率ｂ_ij(V_t)は、一般に、複数個の正規
分布から成る無相関混合正規分布により表される。For example: Seiichi Nakagawa "Speech recognition by stochastic model"
IEICE (1988) ISBN-4-885
As disclosed in 52-072-X, in speech recognition, a hidden Markov model (HiddenMarkov Model) is used as a speech standard pattern.
(M is called) is widely used. H, which is the standard voice pattern
The MM has several states, eg S ₀ -S ₃ , and states S _i.
Is represented by probability a _ij of transition from S _j to S _j and probability b _ij (V _t ) of outputting a voice symbol vector V _{t at} the time of the transition. The output probability b _ij (V _t ) is generally represented by an uncorrelated mixed normal distribution including a plurality of normal distributions.

【０００３】ＨＭＭを用いた音声認識では、入力音声信
号から、音声区間の各フレーム毎に入力音声特徴ベクト
ルｘ_t を抽出する。次いでＨＭＭの無相関混合正規分布
を用いて、入力音声特徴ベクトルｘ_t の出力確率ｂ_ij(x
_t)＝Σ｛λ_ijm ｂ_ijm(x_t) ｝を算出する。ここで、λ
_ijm は無相関混合正規分布における第ｍ番目の正規分布
の重み、ｂ_ijm は無相関混合正規分布における第ｍ番目
の正規分布から求めた入力音声特徴ベクトルｘ_t の重み
無し確率を示す。In voice recognition using an HMM, an input voice feature vector x _t is extracted from an input voice signal for each frame of a voice section. Then using uncorrelated Gaussian mixture of HMM, the output probability b _ij (x of the input speech feature vector x _t
_t ) = Σ {λ _ijm b _ijm (x _t )} is calculated. Where λ
_ijm represents the weight of the m-th normal distribution in the decorrelated mixed normal distribution, and b _ijm represents the unweighted probability of the input speech feature vector x _t obtained from the m-th normal distribution in the decorrelated mixed normal distribution.

【０００４】次いで音声区間の始端フレームから終端フ
レームまでに抽出された入力音声特徴ベクトルｘ_t の時
系列とＨＭＭとの間の尤度を、これら各入力音声特徴ベ
クトルｘ_t の出力確率ｂ_ij(x_t)を用いて求める。各ＨＭ
Ｍ毎に尤度を求め、最大の尤度を得たＨＭＭに付与され
ているカテゴリ名を、その入力音声信号の認識結果とす
る。Then, the likelihood between the time series of the input voice feature vector x _t extracted from the start frame to the end frame of the voice section and the HMM is calculated as the output probability b _ij (of each input voice feature vector x _t ). x _t ). Each HM
The likelihood is calculated for each M, and the category name given to the HMM for which the maximum likelihood is obtained is used as the recognition result of the input speech signal.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら入力音声
特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)として、Σ｛λ
_ijm ｂ_ijm(x_t) ｝を求めるのでは、計算量が増大するた
め、入力音声特徴ベクトルｘ_t の時系列とＨＭＭとの間
の尤度を高速に計算することが難しい。従って出力確率
ｂ_ij(x_t)を、精度の低下を抑えつつ、より簡略に求める
ことが望まれていた。However, as the output probability b _ij (x _t ) of the input speech feature vector x _t , Σ {λ
_ijm b _ijm (x _t )} is difficult to calculate the likelihood between the time series of the input speech feature vector x _t and the HMM because the calculation amount increases. Therefore, it has been desired to obtain the output probability b _ij (x _t ) more simply while suppressing the decrease in accuracy.

【０００６】[0006]

【課題を解決するための手段】前述の課題を解決するた
め、この発明の音声認識方法は、隠れマルコフモデルを
音声標準パタンとし、この隠れマルコフモデルは、互い
に無相関な複数個の正規分布を有し当該モデルから出力
される音声シンボルベクトルの出力確率を表す無相関混
合正規分布を備え、音声区間内の始端フレームから終端
フレームまでに抽出された入力音声特徴ベクトルの時系
列と隠れマルコフモデルとの間の尤度を、各入力音声特
徴ベクトルの出力確率の対数値を用いて計算し、最大の
尤度を得た隠れマルコフモデルに付与されているカテゴ
リ名を、当該音声区間の入力音声信号に対する認識結果
とする音声認識方法において、ｂ_ij(x_t)：総個数Ｍ個の正規分布を有する無相関混合正
規分布を備えた隠れマルコフモデルから、第ｔ番目のフ
レームで抽出された入力音声特徴ベクトルｘ_t が出力さ
れる出力確率（１≦ｔ≦Ｔ。第１番目のフレームは音声
区間の始端フレーム、及び、第Ｔ番目のフレームは音声
区間の終端フレームを表す。）、ｇ_ijm(x_t) ：総個数Ｍ個の正規分布において第ｍ番目
（１≦ｍ≦Ｍ。）の正規分布から算出される入力音声特
徴ベクトルｘ_t の重み付け確率（但し、ｇ_ijm(x_t) ＝λ_ijm ｂ_ijm(x_t) 、ｂ_ijm(x_t) ＝
（２π）^-p/2｜ρ_ijm ｜^-1/2exp ｛−Ｄ_ijmt ² ／２｝、
Ｄ_ijmt ² ＝（ｘ_t −μ_ijm ）’ρ_ijm ^-1 （ｘ_t −μ
_ijm ）、 λ_ijm ：第ｍ番目の正規分布の重み、ｂ_ijm(x_t) ：第ｍ番目の正規分布から算出される入力音
声特徴ベクトルｘ_t の重み無し確率、ｐ：入力音声特徴ベクトルｘ_t の次数、 ρ_ijm ：第ｍ番目の正規分布の分散・供分散行列、 μ_ijm ：第ｍ番目の正規分布の平均ベクトル、Ｄ_ijmt：入力音声特徴ベクトルｘ_t と第ｍ番目の正規分
布との間の距離を表すマハラビスの汎距離。）、Ｇ_ijm(x_t) ：重み付け確率ｇ_ijm(x_t) の対数値（但し、Ｇ_ijm(x_t) ＝Ｅ_ijm −Ｄ_ijmt ² ／２、Ｅ_ijm ＝ln（λ_ijm ）＋ln｛（２π）^-p/2｜ρ_ijm ｜
^-1/2｝。）とするとき、総個数Ｍ個の各正規分布から算出される重
み付け確率ｇ_ijm(x_t) の対数値Ｇ_ijm(x_t) のなかで最大
の対数値Ｇ_ijm(x_t) を、入力音声特徴ベクトルｘ_t の出
力確率ｂ_ij(x_t)の対数値に用いて、隠れマルコフモデル
との間の尤度を計算するに当り、ｔ≧２のときに第ｔ番
目のフレームにおいて最大の対数値Ｇ_ijm(x_t) を検出す
るための最大値候補と、ｔ≧２のときに第ｔ−１番目の
フレームにおいて最大の対数値Ｇ_ijm(x_t) を得た正規分
布がいずれであるかを表すインデックスとを格納する参
照情報記憶部を設け、ｔ＝１では、総個数Ｍ個の全正規
分布について各正規分布毎に対数値Ｇ_ijm(x_t) を算出し
て、最大の対数値Ｇ_ijm(x_t) を検出し、該最大の対数値
Ｇ_ijm(x_t) を第１番目のフレームにおける入力音声特徴
ベクトルｘ_t の出力確率ｂ_ij(x_t)の対数値とすると共に
該最大の対数値Ｇ_ijm(x_t) を得た正規分布に対応するイ
ンデックスを格納し、ｔ≧２では、（１）まずインデッ
クスに対応する正規分布を用いて算出した対数値Ｇ
_ijm(x_t) を最大値候補として格納し、（２）総個数Ｍ個
の正規分布のうちインデックスに対応しない残りの正規
分布を用いた対数値Ｇ_ijm(x_t) の算出では、−Ｄ_ijmt ²
／２の項を算出するための演算の一又は複数の演算間隔
毎に、算出途上の対数値Ｇ_ijm(x_t) を、最大値候補と比
較し、（３−Ａ）算出途上の対数値Ｇ_ijm(x_t) が最大値
候補より小さくなったら、当該対数値Ｇ_ijm(x_t) の算出
を終了し、然る後、残りの次の正規分布につき対数値Ｇ
_ijm (x_t)の算出を開始し、（３−Ｂ）算出途上の対数値
Ｇ_ijm(x_t) が最大値候補より小さくなることなく、当該
対数値Ｇ_ijm(x_t) の算出を終了したら、最大値候補を当
該対数値Ｇ_ijm (x_t)に書き換え、然る後、残りの次の正
規分布につき対数値Ｇ_ijm(x_t) の算出を開始し、（４）
総個数Ｍ個の全正規分布について対数値Ｇ_ijm(x_t) の算
出を終了したら、このとき格納されている最大値候補を
得た正規分布に対応するインデックスに、参照情報記憶
部のインデックスを書換えると共に、当該最大値候補
を、出力確率ｂ_ij(x_t)の対数値に用いて、隠れマルコフ
モデルとの間の尤度を計算することを特徴とする。In order to solve the above-mentioned problems, the speech recognition method of the present invention uses a hidden Markov model as a speech standard pattern, and the hidden Markov model has a plurality of normal distributions that are uncorrelated with each other. Having a non-correlated mixed normal distribution that represents the output probability of the voice symbol vector output from the model, the time series of the input voice feature vector extracted from the start frame to the end frame in the voice section, and a hidden Markov model Between the input speech signal of the input speech signal of the relevant speech section, and the category name given to the Hidden Markov Model that has obtained the maximum likelihood is calculated by using the logarithmic value of the output probability of each input speech feature vector. in the speech recognition method the recognition result for, b _ij (x _t): the hidden Markov models with uncorrelated Gaussian mixture having a total number of M normal distribution The t-th output probability of the input extracted by the frame speech feature vector x _t is output (1 ≦ t ≦ T. 1st frame start frame of the speech segment, and the T-th frame speech section , G _ijm (x _t ): Weighted probability of the input speech feature vector x _t calculated from the m-th (1 ≦ m ≦ M.) Normal distribution in the total number M of normal distributions. (However, g _ijm (x _t ) = λ _ijm b _ijm (x _t ), b _ijm (x _t ) =
^{(2π) -p / 2 | ρ} ijm | -1/2 exp {-D ijmt 2/2},
D _ijmt ² = (x _t −μ _ijm ) ′ ρ _ijm ⁻¹ (x _t −μ
_ijm ), λ _ijm : weight of the m-th normal distribution, b _ijm (x _t ): unweighted probability of the input speech feature vector x _t calculated from the m-th normal distribution, p: input speech feature vector x order of _t , ρ _ijm : variance-covariance matrix of the m-th normal distribution, μ _ijm : mean vector of the m-th normal distribution, D _ijmt : input speech feature vector x _t and the m-th normal distribution Mahalavis general distance that represents the distance between. _{_{), G ijm (x t)}} : logarithm of weighted probabilities g _ijm (x _t) _{_{(where, G ijm (x t) =}} E ijm -D ijmt 2/2, E ijm = ln (λ ijm) + ln {( 2π) ^{-p / 2} ｜ ρ _ijm ｜
^-1/2 }. ), Input the maximum logarithmic value G _ijm (x _t ) among the logarithmic values G _ijm (x _t ) of the weighted probabilities g _ijm (x _t ) calculated from each M normal distribution. When calculating the likelihood with the hidden Markov model using the logarithmic value of the output probability b _ij (x _t ) of the speech feature vector x _t , when t ≧ 2, the maximum in the t-th frame is calculated. in either the maximum value candidate for detecting logarithm G _ijm (x _t), the normal distribution was obtained maximum logarithm G _ijm the (x _t) in the first t-1 th frame at t ≧ 2 is A reference information storage unit that stores an index indicating whether or not there is is provided, and at t = 1, the logarithmic value G _ijm (x _t ) is calculated for each normal distribution with respect to all normal distributions of the total number M, and the maximum value is calculated. detecting a logarithmic value G _ijm (x _t), outermost sized logarithm G _ijm the (x _t) of the input speech feature vector x _t in the first frame Storing an index corresponding to a normal distribution was obtained outermost sized logarithm G _ijm (x _t) with the logarithm of the power probability b _ij (x _t), the t ≧ 2, (1) first corresponding to the index Logarithmic value G calculated using the normal distribution
_ijm (x _t ) is stored as the maximum value candidate, and (2) -D is calculated in calculating the logarithmic value G _ijm (x _t ) using the remaining normal distribution that does not correspond to the index among the M normal distributions. _ijmt ²
The logarithmic value G _ijm (x _t ) in the process of calculation is compared with the maximum value candidate for each one or a plurality of calculation intervals for calculating the term of / 2, and (3-A) the logarithmic value in the process of calculation When G _ijm (x _t ) becomes smaller than the maximum value candidate, the calculation of the logarithmic value G _ijm (x _t ) is finished, and _thereafter , the logarithmic value G for the remaining next normal distribution G
The calculation of _ijm (x _t ) is started, and (3-B) the calculation of the logarithmic value G _ijm (x _t ) is completed without the logarithmic value G _ijm (x _t ) being calculated being smaller than the maximum value candidate. Then, the maximum value candidate is rewritten to the logarithmic value G _ijm (x _t ), and after that, calculation of the logarithmic value G _ijm (x _t ) is started for the remaining next normal distribution, and (4)
When the calculation of the logarithmic value G _ijm (x _t ) is completed for all the normal distributions of the total number M, the index of the reference information storage unit is set to the index corresponding to the normal distribution that has obtained the maximum value candidate stored at this time. In addition to rewriting, the maximum value candidate is used for the logarithmic value of the output probability b _ij (x _t ) to calculate the likelihood with the hidden Markov model.

【０００７】このような発明によれば、総個数Ｍ個の各
正規分布から算出される重み付け確率ｇ_ijm(x_t) の対数
値のなかで最大の対数値Ｇ_ijm(x_t) を、入力音声特徴ベ
クトルｘ_t の出力確率ｂ_ij(x_t)の対数値に用いて、隠れ
マルコフモデルとの間の尤度を計算する。これは、総個
数Ｍ個の各正規分布から算出される重み付け確率ｇ
_ijm(x_t) のなかで最大の重み付け確率ｇ_ijm(x_t) を、入
力音声特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)に用いるこ
とに、他ならない。According to such an invention, the maximum logarithmic value G _ijm (x _t ) among the logarithmic values of the weighting probability g _ijm (x _t ) calculated from each normal distribution of the total number M is input. The likelihood with respect to the hidden Markov model is calculated using the logarithmic value of the output probability b _ij (x _t ) of the speech feature vector x _t . This is the weighted probability g calculated from each of the M normal distributions.
_ijm maximum weighted probability g _ijm (x _t) among (x _t), to be used for the output probability b _ij of the input speech feature vector x _{_t} (x _t), none other.

【０００８】これに対し、従来において典型的に用いら
れていた音声特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)は、
無相関混合正規分布の各正規分布から求めた重み付け確
率ｇ_ijm(x_t) ＝λ_ijm ｂ_ijm(x_t) の線形和Σ｛λ_ijm ｂ
_ijm(x_t) ｝である。On the other hand, the output probability b _ij (x _t ) of the speech feature vector x _t , which is typically used in the past, is
Linear sum Σ {λ _ijm b of weighted probabilities g _ijm (x _t ) = λ _ijm b _ijm (x _t ) obtained from each normal distribution of uncorrelated mixed normal distribution
_ijm (x _t )}.

【０００９】ところで隠れマルコフモデルが備える総個
数Ｍ個の正規分布は互いに無相関であるので、重み付け
確率ｇ_ijm(x_t) が最大とならない正規分布と入力音声特
徴ベクトルｘ_t との間の距離は、重み付け確率ｇ
_ijm(x_t) が最大となる正規分布との距離よりも長くな
る。By the way, since the total number M of normal distributions included in the hidden Markov model are uncorrelated with each other, the distance between the normal distribution _whose weighting probability g _ijm (x _t ) does not become maximum and the input speech feature vector x _t. Is the weighted probability g
_ijm (x _t ) is longer than the distance from the maximum normal distribution.

【００１０】これがため最大とならない重み付け確率ｇ
_ijm(x_t) は、最大の重み付け確率ｇ_ijm(x_t) に対して無
視し得る程に微小となるので、この発明において最大の
重み付け出力確率ｇ_ijm(x_t) を入力音声特徴ベクトルｘ
_t の出力確率ｂ_ij(x_t)としても、従来と近似的に等しい
出力確率ｂ_ij(x_t)を得ることができる。Because of this, the maximum weighting probability g
_{Since ijm} (x _t ) is so small as to be negligible with respect to the maximum weighting probability g _ijm (x _t ), the maximum weighting output probability g _ijm (x _t ) in the present invention is set to the input speech feature vector x.
_As the output probability b _ij (x _t ) of _t , it is possible to obtain the output probability b _ij (x _t ) that is approximately equal to the conventional one.

【００１１】また重み付け確率ｇ_ijm(x_t) の対数値Ｇ
_ijm(x_t) はＧ_ijm(x_t) ＝Ｅ_ijm −Ｄ_ijmt ² ／２と表さ
れ、そして第ｍ番目の正規分布において、λ_ijm 及び｜
ρ_ijm ｜は一定であり従ってＥ_ijm は一定であるので、
算出途上の対数値Ｇ_ijm(x_t) はＥ_ijm をピークとして−
Ｄ_ijmt ² ／２の演算の一演算間隔毎に減少してゆく。こ
こで−Ｄ_ijm ²／２の演算の一演算間隔とは、−Ｄ_ijmt ²
／２の算出過程において、入力音声特徴ベクトルｘ_t の
一ベクトル成分について行なわれる演算の開始から終了
までの間隔を表す。The logarithmic value G of the weighting probability g _ijm (x _t )
_ijm (x _t) is expressed as _{_{G ijm (x t) = E}} ijm -D ijmt 2/2, and in the m-th normal distribution, lambda _ijm and |
Since ρ _ijm | is constant and thus E _ijm is constant,
The logarithmic value G _ijm (x _t ) in the process of calculation has a peak at E _ijm −
_D ijmt ^2/2 of the slide into reduced per one operation interval of the operation. Here as one calculation interval calculation -D _ijm ^2/2 _is, -D ijmt ²
In the calculation process of / 2, it represents the interval from the start to the end of the calculation performed on one vector component of the input speech feature vector x _t .

【００１２】これがため、−Ｄ_ijm ²／２の演算の、一又
は複数の演算間隔毎に、算出途上の対数値Ｇ_ijm(x_t) を
最大値候補と比較し（上記（２）の処理）、算出途上の
対数値Ｇ_ijm(x_t) が最大値候補よりも小さくなったら、
当該対数値Ｇ_ijm(x_t) の算出を算出途上で終了すること
により（上記（３−Ａ）の処理）、最大の対数値Ｇ
_ijm(x_t) 検出に要する計算量を減少させることができ
る。[0012] This is because, the operation of -D _ijm ^2/2, the processing of each one or more of the operational interval, calculates developing the logarithm G _ijm the (x _t) as compared to the maximum value candidate (above (2) ), If the logarithmic value G _ijm (x _t ) _under calculation becomes smaller than the maximum value candidate,
By ending the calculation of the logarithmic value G _ijm (x _t ) during calculation (processing of (3-A) above), the maximum logarithmic value G is obtained.
_The amount of calculation required for _ijm (x _t ) detection can be reduced.

【００１３】しかも第ｔ−１番目のフレームの入力音声
特徴ベクトルｘ_t-1 と第ｔ番目のフレームの入力音声特
徴ベクトルｘ_t とは、時間的に近接しているので、これ
らベクトルｘ_t 及びｘ_t-1 の成分は互いに類似する可能
性が高い。[0013] Moreover the input speech feature vector x _t-1 of the t-1 th frame and the input speech feature vector x _t of the t-th frame, since the temporal proximity, these vectors x _t and The components of x _t-1 are likely to be similar to each other.

【００１４】従って第ｔ−１番目のフレームにおいて第
Ｉ番目の正規分布から求めた重み付け確率ｇ_ijI(x_t-1)
の対数値Ｇ_ijI(x_t-1) が最大の対数値Ｇ_ijm(x_t-1) とな
った場合、次の第ｔ番目のフレームにおいても第Ｉ番目
の正規分布から求めた重み付け確率ｇ_ijI(x_t) の対数値
Ｇ_ijI(x_t) が最大の対数値Ｇ_ijm(x_t) となる可能性が高
い。Therefore, the weighting probability g _ijI (x _t-1 ) obtained from the I-th normal distribution in the ( _t-1 ) th frame
When the logarithmic value G _ijI (x _t-1 ) of is the maximum logarithmic value G _ijm (x _t-1 ), the weighting probability g obtained from the I-th normal distribution also in the next t-th frame. _ijI (x _t) of the logarithmic value G _ijI (x _t) is likely to be the largest logarithmic value G _ijm (x _t).

【００１５】これがため、この第Ｉ番目の正規分布から
求めた対数値Ｇ_ijI(x_t) を最大値候補の初期値として
（上記（１）の処理）、算出途上の対数値Ｇ_ijm(x_t) が
最大値候補よりも小さくなったら、当該対数値Ｇ
_ijm(x_t) の算出を算出途上で終了することにより（上記
（３−Ａ）の処理）、最大の対数値Ｇ_ijm(x_t) 検出に要
する計算量を減少させることができる。For this reason, the logarithmic value G _ijI (x _t ) obtained from the I-th normal distribution is used as the initial value of the maximum value candidate (processing of (1) above), and the logarithmic value G _ijm (x in the process of calculation is used. _{When t} ) becomes smaller than the maximum value candidate, the logarithmic value G
(process of (3-A)) by finish calculating the _ijm (x _t) in calculating developing, it is possible to reduce the amount of calculation required to detect the largest logarithm G _ijm (x _t).

【００１６】[0016]

【発明の実施の形態】図１はこの発明の音声認識方法の
実施に用いて好適な音声認識装置の一構成例を概略的に
示す機能ブロック図である。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a functional block diagram schematically showing an example of the configuration of a voice recognition apparatus suitable for implementing the voice recognition method of the present invention.

【００１７】同図に示す音声認識装置１０は、辞書部１
２、音響処理部１４、音声区間検出部１６、ＨＭＭ照合
部１８及び参照情報記憶部２０を備える。The speech recognition apparatus 10 shown in FIG.
2, a sound processing unit 14, a voice section detection unit 16, an HMM collation unit 18, and a reference information storage unit 20.

【００１８】辞書部１２は、音声標準パタンとして隠れ
マルコフモデルを格納する。隠れマルコフモデルは、互
いに無相関な複数個の正規分布を有し当該モデルから出
力される音声シンボルベクトルの出力確率を表す無相関
混合正規分布を備える。The dictionary unit 12 stores the hidden Markov model as a voice standard pattern. The hidden Markov model has a plurality of normal distributions that are uncorrelated with each other and has a non-correlated mixed normal distribution that represents the output probability of a speech symbol vector output from the model.

【００１９】音響処理部１４は、一定時間幅のフレーム
毎に、入力音声信号から入力音声特徴ベクトルを抽出す
る。音声区間検出部１６は、入力音声信号から音声区間
を検出する。The sound processing unit 14 extracts an input voice feature vector from the input voice signal for each frame of a fixed time width. The voice section detection unit 16 detects a voice section from the input voice signal.

【００２０】ＨＭＭ照合部１８は、音声区間の始端フレ
ームから終端フレームまでに抽出された入力音声特徴ベ
クトルの時系列と隠れマルコフモデルとの間の尤度を、
各入力音声特徴ベクトルの出力確率を用いて計算し、最
大の尤度を得た隠れマルコフモデルに付与されているカ
テゴリ名を、当該音声区間の入力音声信号に対する認識
結果とする。The HMM matching unit 18 calculates the likelihood between the time series of the input voice feature vector extracted from the start frame to the end frame of the voice section and the hidden Markov model,
The category name assigned to the hidden Markov model that has been calculated using the output probabilities of the input speech feature vectors and has the maximum likelihood is used as the recognition result for the input speech signal in the relevant speech section.

【００２１】ここで、ｂ_ij(x_t)：総個数Ｍ個の正規分布を有する無相関混合正
規分布を備えた隠れマルコフモデルから、第ｔ番目のフ
レームで抽出された入力音声特徴ベクトルｘ_t が出力さ
れる出力確率（１≦ｔ≦Ｔ。第１番目のフレームは音声
区間の始端フレーム、及び、第Ｔ番目のフレームは音声
区間の終端フレームを表す。）、ｇ_ijm(x_t) ：総個数Ｍ個の正規分布において第ｍ番目
（１≦ｍ≦Ｍ。）の正規分布から算出される入力音声特
徴ベクトルｘ_t の重み付け確率（但し、ｇ_ijm(x_t) ＝λ_ijm ｂ_ijm(x_t) 、ｂ_ijm(x_t) ＝
（２π）^-p/2｜ρ_ijm ｜^-1/2exp ｛−Ｄ_ijmt ² ／２｝、
Ｄ_ijmt ² ＝（ｘ_t −μ_ijm ）’ρ_ijm ^-1 （ｘ_t −μ
_ijm ）、 λ_ijm ：第ｍ番目の正規分布の重み、ｂ_ijm(x_t) ：第ｍ番目の正規分布から算出される入力音
声特徴ベクトルｘ_t の重み無し確率、ｐ：入力音声特徴ベクトルｘ_t の次数、 ρ_ijm ：第ｍ番目の正規分布の分散・供分散行列、 μ_ijm ：第ｍ番目の正規分布の平均ベクトル、Ｄ_ijmt：入力音声特徴ベクトルｘ_t と第ｍ番目の正規分
布との間の距離を表すマハラビスの汎距離。）、（ｘ_t −μ_ijm ）’：（ｘ_t −μ_ijm ）’は（ｘ_t −μ
_ijm ）の転置行列を表す、Ｇ_ijm(x_t) ：重み付け確率ｇ_ijm(x_t) の対数値（但し、Ｇ_ijm(x_t) ＝Ｅ_ijm −Ｄ_ijmt ² ／２、Ｅ_ijm ＝ln（λ_ijm ）＋ln｛（２π）^-p/2｜ρ_ijm ｜
^-1/2｝。）とするとき、隠れマルコフモデルとの間の尤度計算に用
いる入力音声特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)の対
数値として、総個数Ｍ個の各正規分布から算出される重
み付け確率ｇ_ijm(x_t) の対数値Ｇ_ijm(x_t) のなかで最大
の対数値Ｇ_ijm(x_t) を用いる。Here, b _ij (x _t ): an input speech feature vector x _t extracted in the t-th frame from a hidden Markov model having an uncorrelated mixed normal distribution having a total number M of normal distributions. Output probability (1 ≦ t ≦ T. The first frame represents the start frame of the voice section and the T-th frame represents the end frame of the voice section), g _ijm (x _t ): The weighted probability of the input speech feature vector x _t calculated from the m-th (1 ≦ m ≦ M.) Normal distribution in the total number M of normal distributions (where g _ijm (x _t ) = λ _ijm b _ijm ( x _t ), b _ijm (x _t ) =
^{(2π) -p / 2 | ρ} ijm | -1/2 exp {-D ijmt 2/2},
D _ijmt ² = (x _t −μ _ijm ) ′ ρ _ijm ⁻¹ (x _t −μ
_ijm ), λ _ijm : weight of the m-th normal distribution, b _ijm (x _t ): unweighted probability of the input speech feature vector x _t calculated from the m-th normal distribution, p: input speech feature vector x order of _t , ρ _ijm : variance-covariance matrix of the m-th normal distribution, μ _ijm : mean vector of the m-th normal distribution, D _ijmt : input speech feature vector x _t and the m-th normal distribution Mahalavis general distance that represents the distance between. ), (X _t −μ _ijm ) ′: (x _t −μ _ijm ) ′ is (x _t −μ _ijm ) '
represents the transposed matrix of _{_{_{ijm), G ijm (x t}}} ): logarithm of weighted probabilities g _ijm (x _t) _{_{(where, G ijm (x t) =}} E ijm -D ijmt 2/2, E ijm = ln ( λ _ijm ) + ln {(2π) ^{-p / 2} _{│ρ ijm} │
^-1/2 }. ), The weighting calculated from each of the M normal distributions as the logarithmic value of the output probability b _ij (x _t ) of the input speech feature vector x _t used for the likelihood calculation with the hidden Markov model. using the probability g _ijm maximum logarithm value G _ijm among (x _t) of the logarithmic value _{_{G ijm (x t) (x}} t).

【００２２】参照情報記憶部２０は、ｔ≧２のときに第
ｔ番目のフレームにおいて最大の対数値Ｇ_ijm(x_t) を検
出するための最大値候補と、ｔ≧２のときに第ｔ−１番
目のフレームにおいて最大の対数値Ｇ_ijm(x_t) を得た正
規分布がいずれであるかを表すインデックスとを格納す
る。The reference information storage unit 20 stores the maximum value candidate for detecting the maximum logarithmic value G _ijm (x _t ) in the t-th frame when t ≧ 2, and the t-th candidate when t ≧ 2. The index indicating which is the normal distribution that has the largest logarithmic value G _ijm (x _t ) in the _−1st frame is stored.

【００２３】そしてＨＭＭ照合部１８は、始端フレーム
から終端フレームまでに出力された入力音声特徴ベクト
ルｘ_t の時系列と隠れマルコフモデルとの間の尤度を、
次の如くして行なう。The HMM matching unit 18 then calculates the likelihood between the time series of the input speech feature vector x _t output from the start frame to the end frame and the hidden Markov model,
Do as follows.

【００２４】すなわちｔ＝１の場合は、総個数Ｍ個の全
正規分布について各正規分布毎に対数値Ｇ_ijm(x_t) を算
出して、最大の対数値Ｇ_ijm(x_t) を検出し、この最大の
対数値Ｇ_ijm(x_t) を第１番目のフレームにおける入力音
声特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)の対数値とする
と共にこの最大の対数値Ｇ_ijm(x_t) を得た正規分布に対
応するインデックスを格納する。That is, when t = 1, the logarithmic value G _ijm (x _t ) is calculated for each normal distribution with respect to the total number M of normal distributions, and the maximum logarithmic value G _ijm (x _t ) is detected. Then, the maximum logarithmic value G _ijm (x _t ) is set as the logarithmic value of the output probability b _ij (x _t ) of the input speech feature vector x _t in the first frame, and the maximum logarithmic value G _ijm (x Store the index corresponding to the normal distribution for which _t ) was obtained.

【００２５】そしてｔ≧２の場合には、（１）まずイン
デックスに対応する正規分布を用いて算出した対数値Ｇ
_ijm(x_t) を最大値候補として格納し、（２）総個数Ｍ個
の正規分布のうちインデックスに対応しない残りの正規
分布を用いた対数値Ｇ_ijm(x_t) の算出では、−Ｄ_ijmt ²
／２の項を算出するための演算の一又は複数の演算間隔
毎に、算出途上の対数値Ｇ_ijm(x_t) を、最大値候補と比
較し、（３−Ａ）算出途上の対数値Ｇ_ijm(x_t) が最大値
候補より小さくなったら、当該対数値Ｇ_ijm(x_t) の算出
を終了し、然る後、残りの次の正規分布につき対数値Ｇ
_ijm (x_t)の算出を開始し、（３−Ｂ）算出途上の対数値
Ｇ_ijm(x_t) が最大値候補より小さくなることなく、当該
対数値Ｇ_ijm(x_t) の算出を終了したら、最大値候補を当
該対数値Ｇ_ijm (x_t)に書き換え、然る後、残りの次の正
規分布につき対数値Ｇ_ijm(x_t) の算出を開始し、（４）
総個数Ｍ個の全正規分布について対数値Ｇ_ijm(x_t) の算
出を終了したら、このとき格納されている最大値候補を
得た正規分布に対応するインデックスに、参照情報記憶
部２０のインデックスを書換えると共に、当該最大値候
補を、出力確率ｂ_ij(x_t)の対数値に用いて、隠れマルコ
フモデルとの間の尤度を計算する。When t ≧ 2, (1) First, the logarithmic value G calculated using the normal distribution corresponding to the index
_ijm (x _t ) is stored as the maximum value candidate, and (2) -D is calculated in calculating the logarithmic value G _ijm (x _t ) using the remaining normal distribution that does not correspond to the index among the M normal distributions. _ijmt ²
The logarithmic value G _ijm (x _t ) in the process of calculation is compared with the maximum value candidate for each one or a plurality of calculation intervals for calculating the term of / 2, and (3-A) the logarithmic value in the process of calculation When G _ijm (x _t ) becomes smaller than the maximum value candidate, the calculation of the logarithmic value G _ijm (x _t ) is finished, and _thereafter , the logarithmic value G for the remaining next normal distribution G
The calculation of _ijm (x _t ) is started, and (3-B) the calculation of the logarithmic value G _ijm (x _t ) is completed without the logarithmic value G _ijm (x _t ) being calculated being smaller than the maximum value candidate. Then, the maximum value candidate is rewritten to the logarithmic value G _ijm (x _t ), and after that, calculation of the logarithmic value G _ijm (x _t ) is started for the remaining next normal distribution, and (4)
When the calculation of the logarithmic value G _ijm (x _t ) is completed for all the normal distributions of the total number M, the index of the reference information storage unit 20 is added to the index corresponding to the normal distribution that has obtained the maximum value candidate stored at this time. And the maximum value candidate is used as the logarithmic value of the output probability b _ij (x _t ) to calculate the likelihood with the hidden Markov model.

【００２６】図２は音声標準パタンに用いる隠れマルコ
フモデルの説明に供する図である。音声標準パタンに用
いる隠れマルコフモデル（以下、ＨＭＭ）は、音声認識
一単位分ここでは単語１個分の音声信号であって、カテ
ゴリｚを付与されている音声信号を表現している。各カ
テゴリ毎に個別に複数のＨＭＭを用意し、ＨＭＭとカテ
ゴリｚとを、相対応付けて辞書部１２に格納する。FIG. 2 is a diagram for explaining the hidden Markov model used for the voice standard pattern. The Hidden Markov Model (hereinafter, HMM) used for the voice standard pattern is a voice signal for one unit of voice recognition, here, a voice signal for one word, and represents a voice signal to which a category z is added. A plurality of HMMs are prepared individually for each category, and the HMMs and categories z are stored in the dictionary unit 12 in association with each other.

【００２７】ＨＭＭは、総個数Ｉ個の状態Ｓ₁ 〜Ｓ_I か
ら成る状態の集合１と、音声シンボルベクトルｘの集合
２と、状態遷移確率ａ_ijの集合３と、出力確率ｂ_ij(x)
の集合４と、初期状態確率Ф_i の集合５と、最終状態Ｆ
の集合６とにより定義される。The HMM has a set 1 of states consisting of a total of I states S _{1 to} S _I, a set 2 of speech symbol vectors x, a set 3 of state transition probabilities a _ij , and an output probability b _ij (x )
A set 4, a set 5 of the initial state probability .PHI _i, final state F
And the set 6 of

【００２８】[0028]

【数１】 [Equation 1]

【００２９】例えば図２の例において、ａ₁₂は状態Ｓ₁
から状態Ｓ₂ に遷移する確率及びｂ₁₂(x) は状態Ｓ₁ か
ら状態Ｓ₂ に遷移したとき音声シンボルベクトルｘが出
力される確率、またａ₂₂は状態Ｓ₂ から状態Ｓ₂ に遷移
する確率及びｂ₂₂(x) は状態Ｓ₂ から状態Ｓ₂ に遷移し
たとき音声シンボルベクトルｘが出力される確率を表
す。For example, in the example of FIG. 2, a ₁₂ is the state S ₁
The probability and b ₁₂ (x) is the voice symbol vector x when a transition from the state S ₁ to state S ₂ transitions to a state S ₂ is output from, also a ₂₂ transitions from state S ₂ to state S ₂ probability and b ₂₂ (x) represents the probability that speech symbol vector x is output when a transition from the state S ₂ to state S _2.

【００３０】ＨＭＭを定義するための集合１〜６は、統
計的手法によって、各カテゴリｚ毎に個別に求められ
る。すなわちカテゴリｚに対応する音声信号として種々
の音声信号を集め、例えば年齢別にもしくは性別毎に音
声信号を集め、或は、発生法の異なる音声信号を集め、
これら音声信号の統計的性質を表現する集合１〜６を求
める。The sets 1 to 6 for defining the HMM are individually obtained for each category z by a statistical method. That is, various voice signals are collected as voice signals corresponding to the category z, for example, voice signals are collected by age or sex, or voice signals of different generation methods are collected.
Sets 1 to 6 expressing the statistical properties of these audio signals are obtained.

【００３１】出力確率ｂ_ij(x) は、互いに無相関であり
かつそれぞれ音声シンボルベクトルｘの関数である複数
個の正規分布から成る無相関混合正規分布（無相関連続
確率密度分布）により表現される。無相関混合正規分布
は、数学的取り扱いが簡単でしかも表現能力が高いとい
う利点を有する。The output probabilities b _ij (x) are represented by an uncorrelated mixed normal distribution (uncorrelated continuous probability density distribution) consisting of a plurality of normal distributions that are uncorrelated with each other and are functions of the speech symbol vector x. It The decorrelated mixed normal distribution has the advantage of being easy to handle mathematically and having high expressiveness.

【００３２】次に音声認識装置１０の動作説明ととも
に、この実施例の音声認識方法の処理の流れにつき具体
的に説明する。Next, the flow of processing of the voice recognition method of this embodiment will be specifically described along with the description of the operation of the voice recognition apparatus 10.

【００３３】音響処理部１４は、入力音声信号から、各
フレーム毎に入力音声特徴ベクトルｘ_t を抽出する。こ
の時点で入力音声特徴ベクトルｘ_t に付与されるフレー
ム番号ｔは、音響処理開始時点のフレームを第ｔ＝１番
目のフレームとして、順次に付与された番号であり、こ
のフレーム番号ｔは、後述するＨＭＭ照合部１８におい
て、音声区間の始端フレームを第１番目（ｔ＝１）のフ
レームとして、音声区間の始端フレームから終端フレー
ムまで順次に付与された番号に書き改められる。The sound processing section 14 extracts the input voice feature vector x _t for each frame from the input voice signal. The frame number t given to the input speech feature vector x _t at this point is a number sequentially given with the frame at the start of the acoustic processing as the t = 1th frame, and this frame number t will be described later. In the HMM matching unit 18, the start frame of the voice section is rewritten as the first (t = 1) frame into numbers sequentially assigned from the start frame to the end frame of the voice section.

【００３４】入力音声特徴ベクトルｘ_t は、ｘ_t ＝（ｘ
_t1、ｘ_t2、……、ｘ_tp）と表せる。ｐは入力音声特徴ベ
クトルｘ_t の次数、及びｘ_t1〜ｘ_tpは入力音声特徴ベク
トルｘ_t のベクトル成分を表す。The input speech feature vector x _t is x _t = (x
_t 1, x _t 2, ..., X _t p). p represents the order of the input speech feature vector x _t , and x _{t1 to} x _t p represent vector components of the input speech feature vector x _t .

【００３５】入力音声特徴ベクトルｘ_t のベクトル成分
としては、例えば、中心周波数が異なる複数のバンドパ
スフィルタから成る帯域フィルタ群に入力音声信号を入
力したときの各フィルタ出力から得たものや、入力音声
信号をフーリエ解析して得られるパワースペクトル成分
や、或は、入力音声信号の線形予測分析すなわちＬＰＣ
分析により求められるＬＰＣケプストラム係数を、用い
ることができる。ここでは帯域フィルタ群を用いて入力
音声特徴ベクトルｘ_t を抽出する例につき説明する。As the vector component of the input speech feature vector x _t , for example, one obtained from each filter output when an input speech signal is input to a bandpass filter group consisting of a plurality of bandpass filters having different center frequencies, or an input A power spectrum component obtained by Fourier analysis of an audio signal, or a linear predictive analysis or LPC of an input audio signal
The LPC cepstrum coefficient determined by analysis can be used. Here, an example in which the input speech feature vector x _t is extracted using a band filter group will be described.

【００３６】音響処理部１４は、入力音声信号をアナロ
グ信号からデジタル信号に変換し、変換後の入力音声信
号を、帯域フィルタ群を介して、各バンドパスフィルタ
に対応した周波数帯（チャネル）の信号成分に分離し、
それぞれ周波数帯が異なる総個数ｐ個の信号成分ｘ1 〜
ｘp を得る。次いで音響処理部１４は、信号成分ｘ1を
整流し、フレーム単位に、整流した信号成分ｘ1 （信号
成分ｘ1 の絶対値）の平均値を得る。この平均値は、整
流した信号成分ｘ1 を１フレーム分の時間幅で除して得
られる。第ｔ番目のフレームにおいて得られる信号成分
ｘ1 の平均値を、入力音声特徴ベクトルｘ_t の成分ｘ_t1
として抽出する。同様にして、残りの信号成分ｘ2 〜ｘ
p から、入力音声特徴ベクトルｘ_t の成分ｘ_t2〜ｘ_tpを
抽出する。The acoustic processing unit 14 converts the input audio signal from an analog signal to a digital signal, and outputs the converted input audio signal through a band filter group to a frequency band (channel) corresponding to each band pass filter. Separated into signal components,
The total number p of signal components x1 ...
Get xp. Next, the acoustic processing unit 14 rectifies the signal component x1 and obtains an average value of the rectified signal component x1 (absolute value of the signal component x1) in frame units. This average value is obtained by dividing the rectified signal component x1 by the time width of one frame. The average value of the signal component x1 obtained in the t-th frame, the components of the input speech feature vector x _t x _t 1
Extract as Similarly, the remaining signal components x2 to x
From p, it extracts the component x _{_t} 2~x _t p of the input speech feature vector x _t.

【００３７】次に音声区間検出部１６は、音響処理部１
４からの入力音声特徴ベクトルｘ_tに基づいて、音声区
間の始端フレーム及び終端フレームを検出し、どのフレ
ームが音声区間の始端フレーム及び終端フレームである
かを表す区間情報を生成する。音声区間は、音声認識一
単位分の音声信号が含まれる区間である。音声認識の一
単位は、単語単位、音素単位或はそのほかとすることが
できるが、ここでは単語単位とする。Next, the voice section detecting section 16 includes the sound processing section 1
Based on the input voice feature vector x _t from 4, the start frame and the end frame of the voice section are detected, and the section information indicating which frame is the start frame and the end frame of the voice section is generated. The voice section is a section including a voice signal for one unit of voice recognition. One unit of speech recognition can be a word unit, a phoneme unit, or another unit, but here, it is a word unit.

【００３８】ＨＭＭ照合部１８は、区間情報と入力音声
特徴ベクトルｘ_t とを音声区間検出部１６から入力し
て、音声区間の始端フレームから終端フレームまでに抽
出された入力音声特徴ベクトルｘ_t の時系列ｘ₁ 〜ｘ_T
を生成する。ここで、フレーム番号ｔは、音声区間の始
端フレームを第１番目（ｔ＝１）のフレームとして、音
声区間の始端フレームから終端フレームまで順次に付与
された番号に書き改められる。The HMM collation unit 18 inputs the section information and the input voice feature vector x _t from the voice section detection unit 16 and extracts the input voice feature vector x _t from the start frame to the end frame of the voice section. Time series x _{1 to} x _T
Generate Here, the frame number t is rewritten into a number sequentially assigned from the start frame to the end frame of the voice section with the start frame of the voice section as the first (t = 1) frame.

【００３９】そしてＨＭＭ照合部１８はベクトル時系列
ｘ₁ 〜ｘ_T と辞書部１２に格納されているＨＭＭとの間
の尤度ln｛Ｐ（ｘ₁ 〜ｘ_T ）｝を求め、最大の尤度を得
たＨＭＭに対し付与されているカテゴリｚを、認識結果
として出力する。Then, the HMM matching unit 18 finds the likelihood ln {P (x _{1 to} x _T )} between the vector time series x _{1 to} x _T and the HMM stored in the dictionary unit 12, and determines the maximum likelihood. The category z assigned to the obtained HMM is output as the recognition result.

【００４０】ここで、Ｐ（ｘ₁ 〜ｘ_T ）はＨＭＭにおい
てベクトル時系列ｘ₁ 〜ｘ_T が出現する確率であって、
次式（１）の如く表される。Here, P (x _{1 to} x _T ) is the probability that the vector time series x _{1 to} x _T will appear in the HMM, and
It is expressed as the following equation (1).

【００４１】[0041]

【数２】 [Equation 2]

【００４２】（１）式において、＊ｉはＳ_i ∈Ｆを満た
すｉ（最終状態Ｆに属する状態Ｓ_iに付与されている番
号ｉ）であって、従ってｉ＝＊ｉとなる前向き確率ｃ_iT
のなかで最大の前向き確率ｃ_iTを、出現確率Ｐ（ｘ₁ 〜
ｘ_T ）とするものである。In the equation (1), * i is i (the number i given to the state S _i belonging to the final state F) that satisfies S _i εF, and therefore the forward probability c that i = * i. _iT
Among them, the maximum forward probability c _iT is _defined as the appearance probability P (x ₁ ~
x _T ).

【００４３】前向き確率ｃ_iTは、ビタビアルゴリズムに
より、次式（２）〜（３）に示す漸化式を用いて近似的
に求められる。The forward probability c _iT is approximately obtained by the Viterbi algorithm using the recurrence formulas shown in the following equations (2) to (3).

【００４４】[0044]

【数３】 (Equation 3)

【００４５】ここで、ln（ａ_ij）＝Ａ_ij、ln｛ｂ
_ij(x_t)｝＝Ｂ_ij(x_t)、ln（ｃ_it）＝Ｃ_itと表せば（以
下、遷移対数値Ａ_ij、出力対数値Ｂ_ij、前向き対数値Ｃ
_iTと称する）、式（１）〜（３）を変形して、尤度ln
｛Ｐ（ｘ₁ 〜ｘ_t ）｝の算出に関する（４）〜（６）式
が得られる。Here, ln (a _ij ) = A _ij , ln {b
_ij (x _t )} = B _ij (x _t ), ln (c _it ) = C _it (hereinafter, transition logarithmic value A _ij , output logarithmic value B _ij , forward logarithmic value C
_iT ), and the equations (1) to (3) are modified to obtain the likelihood ln
Expressions (4) to (6) regarding the calculation of {P (x _{1 to} x _t )} are obtained.

【００４６】[0046]

【数４】 (Equation 4)

【００４７】（５）〜（６）式はｔの漸化式であるか
ら、ｔ＝１、２、……、Ｔとなる場合の各前向き対数値
Ｃ_iTを、次式の如く順次に計算できる。Since the equations (5) to (6) are recurrence equations of t, the forward logarithmic values C _iT when t = 1, 2, ..., T are calculated sequentially as the following equation. it can.

【００４８】[0048]

【数５】 (Equation 5)

【００４９】ＨＭＭにおいて、初期状態からベクトル系
列ｘ₁ 〜ｘ_t を生成して状態Ｓ_i に至る遷移パスは一つ
又は複数存在し、ほとんどの場合に複数の遷移パスが存
在する。複数の遷移パスが存在する場合、各遷移パス毎
に前向き対数値Ｃ_iTが求められ、従って各遷移パスに対
応した複数の前向き対数値Ｃ_iTを得ることとなる。In the HMM, there is one or a plurality of transition paths from the initial state to generate the vector series x _{1 to} x _t and reach the state S _i . In most cases, there are a plurality of transition paths. When there are a plurality of transition paths, the forward logarithmic value C _iT is obtained for each transition path, and therefore a plurality of forward logarithmic values C _iT corresponding to the respective transition paths are obtained.

【００５０】ＨＭＭ照合部１８は、カテゴリｚを付与さ
れたＨＭＭにおいて、前向き対数値Ｃ_iTを求め、ｉ＝＊
ｉとなる前向き対数値Ｃ_iTのなかで最大の前向き対数値
Ｃ_iTを、ベクトル時系列ｘ₁ 〜ｘ_T と当該ＨＭＭとの間
の尤度ln｛Ｐ（ｘ₁ 〜ｘ_T ）｝として得る。そして辞書
部１２に格納されているすべてのＨＭＭについて、各Ｈ
ＭＭ毎に、尤度ln｛Ｐ（ｘ₁ 〜ｘ_T ）｝を求め、最大の
尤度ln｛Ｐ（ｘ₁ 〜ｘ_T ）｝を得たＨＭＭに付与されて
いるカテゴリｚを、ベクトル時系列ｘ₁ 〜ｘ_Tを得た入
力音声信号の認識結果として出力する。The HMM matching unit 18 obtains the forward logarithmic value C _iT in the HMM assigned with the category z, and i = *
The maximum forward logarithmic value C _iT among the forward logarithmic values C _iT for i is obtained as the likelihood ln {P (x _{1 to} x _T )} between the vector time series x _{1 to} x _T and the HMM. . Then, for all HMMs stored in the dictionary unit 12,
For each MM, the likelihood ln {P (x _{1 to} x _T )} is calculated, and the category z assigned to the HMM that has the maximum likelihood ln {P (x _{1 to} x _T )} is calculated at vector time. The sequence x _{1 to} x _T is output as the recognition result of the input speech signal.

【００５１】上述の尤度ln｛Ｐ（ｘ₁ 〜ｘ_T ）｝を算出
する過程において、最も複雑な演算は、出力対数値Ｂ_ij
(x_t)を求める演算である。この演算を高速に行なうた
め、出力確率ｂ_ij(x_t)を次式（１２）の如く定義する。
出力確率ｂ_ij(x_t)は、総個数Ｍ個の正規分布を有する無
相関混合正規分布を備えた隠れマルコフモデルから、入
力音声特徴ベクトルｘ_t が出力される確率である。In the process of calculating the above-mentioned likelihood ln {P (x _{1 to} x _T )}, the most complicated operation is the output logarithmic value B _ij.
This is an operation for obtaining (x _t ). In order to perform this calculation at high speed, the output probability b _ij (x _t ) is defined by the following equation (12).
The output probability b _ij (x _t ) is the probability that the input speech feature vector x _t will be output from the hidden Markov model having a decorrelation mixed normal distribution having a total number M of normal distributions.

【００５２】[0052]

【数６】 (Equation 6)

【００５３】（１２）式中のｇ_ijm(x_t) は、総個数Ｍ個
の正規分布から成る無相関混合正規分布において第ｍ番
目の正規分布から算出される入力音声特徴ベクトルｘ_t
の重み付け確率であって、次式（１３）〜（１５）を用
いて表すことができる。G _ijm (x _t ) in the equation (12) is the input speech feature vector x _t calculated from the m-th normal distribution in the uncorrelated mixed normal distribution consisting of a total of M normal distributions.
Is a weighted probability of and can be expressed using the following equations (13) to (15).

【００５４】[0054]

【数７】 (Equation 7)

【００５５】（１３）式中のλ_ijm は第ｍ番目の正規分
布の重み、及びｂ_ijm(x_t) は第ｍ番目の正規分布から算
出される入力音声特徴ベクトルｘ_t の重み無し確率であ
る。重み無し確率ｂ_ijm(x_t) は式（１４）で表され、式
（１４）中のｐは入力音声特徴ベクトルｘ_t の次数、ρ
_ijm は第ｍ番目の正規分布の分散・供分散行列、及びＤ
_ijmtは入力音声特徴ベクトルｘ_t と第ｍ番目の正規分布
との間の距離を表すマハラビスの汎距離である。マハラ
ビスの汎距離Ｄ_ijmtは式（１５）で表され、式（１５）
中のμ_ijm は第ｍ番目の正規分布の平均ベクトル、（ｘ
_t −μ_ijm ）’は（ｘ_t −μ_ijm ）の転置行列である。In equation (13), λ _ijm is the weight of the m-th normal distribution, and b _ijm (x _t ) is the unweighted probability of the input speech feature vector x _t calculated from the m-th normal distribution. is there. The unweighted probability b _ijm (x _t ) is expressed by Expression (14), and p in Expression (14) is the order of the input speech feature vector x _t , ρ
_ijm is the variance and covariance matrix of the mth normal distribution, and D
_ijmt is a Mahalabis general distance that represents the distance between the input speech feature vector x _t and the m-th normal distribution. The Mahalavis general distance D _ijmt is expressed by Equation (15), and Equation (15)
_Where μ _ijm is the mean vector of the m-th normal distribution, (x
_t −μ _ijm ) ′ is the transposed matrix of (x _t −μ _ijm ).

【００５６】（１２）式は、総個数Ｍ個の正規分布から
成る無相関混合正規分布において個々の正規分布から得
られる重み付け確率ｇ_ijm(x_t) のうち最大となる重み付
け確率ｇ_ijm(x_t) を、入力音声特徴ベクトルｘ_t の出力
確率ｂ_ij(x_t)として検出することを表す。[0056] (12) is weighted probability g _ijm (x having the maximum of the weighted probability g _ijm which the uncorrelated Gaussian Mixture consisting of the total number of M normal distributions obtained from the individual normal distribution (x _t) the _t), representative of the detected as the output probability b _ij of the input speech feature vector x _{_t} (x _t).

【００５７】従来における典型的な出力確率ｂ_ij(x_t)は
重み付け確率ｇ_ijm(x_t) の線形和として表されるが、
（１２）式の如く出力確率ｂ_ij(x_t)として最大の重み付
け確率ｇ_ijm(x_t) を用いても、従来の出力確率ｂ_ij(x_t)
と近似的に等しい出力確率ｂ_ij(x_t)を得ることができ
る。無相関混合正規分布においては総個数Ｍ個の正規分
布は互いに無相関であるので、最大とならなかった重み
付け確率ｇ_ijm(x_t) は最大の重み付け確率ｇ_ijm(x_t) に
比して微小な値となると考えられるからである。The conventional typical output probability b _ij (x _t ) is expressed as a linear sum of the weighting probabilities g _ijm (x _t ).
Even if the maximum weighted probability g _ijm (x _t ) is used as the output probability b _ij (x _t ) as in the equation (12), the conventional output probability b _ij (x _t )
Output probabilities b _ij (x _t ) approximately equal to can be obtained. In the uncorrelated mixed normal distribution, the total number M of normal distributions are uncorrelated with each other, so that the weighting probability g _ijm (x _t ) that is not the maximum is _smaller than the maximum weighting probability g _ijm (x _t ). This is because it is considered to be a very small value.

【００５８】そして出力確率ｂ_ij(x_t)の対数値Ｂ_ij(x_t)
（以下、出力対数値Ｂ_ij(x_t)）は、式（１２）を用い
て、次式（１６）の如く表せる。[0058] and the logarithm B _ij of the output probability _{_{b ij (x t) (x}} t)
(Hereinafter, the output logarithmic value B _ij (x _t )) can be expressed by the following expression (16) using the expression (12).

【００５９】[0059]

【数８】 (Equation 8)

【００６０】（１６）式中の重み付け対数値Ｇ_ijm(x_t)
は、重み付け確率ｇ_ijm(x_t) の対数値であって、式（１
３）〜（１５）を用いて次式（１７）の如く表せる。Weighted logarithmic value G _ijm (x _t ) in the equation (16)
Is a logarithmic value of the weighted probability g _ijm (x _t ), and
It can be expressed by the following equation (17) using 3) to (15).

【００６１】[0061]

【数９】 [Equation 9]

【００６２】ここで重み付け対数値Ｇ_ijm(x_t) に着目す
る。ＨＭＭの無相関混合正規分布を構成する総個数Ｍ個
の正規分布は、全て無相関であるので、各正規分布の分
散・供分散行列ρ_ijm は対角行列となる。 _Attention is now paid to the weighted logarithmic value G _ijm (x _t ). Since all M normal distributions that form the HMM uncorrelated mixed normal distribution are uncorrelated, the variance / covariance matrix ρ _ijm of each normal distribution is a diagonal matrix.

【００６３】分散・供分散行列ρ_ijm （無相関混合正規
分布の第ｍ番目のρ_ijm ）の第ｒ行第ｓ列の要素をＡ
_ijmrs 、入力音声特徴ベクトルｘ_t の第ｒ番目の成分を
Ｂr 、及び、平均ベクトルμ_ijm （無相関混合正規分布
の第ｍ番目のμ_ijm ）の第ｒ番目の成分をＣ_ijmrと表せ
ば、（１５）式は次式（１８）の如く変形できる。The element of the r-th row and the s-th column of the variance- _sub- dispersion matrix ρ _ijm (m-th ρ _ijm of the uncorrelated mixed normal distribution) is A
_ijm rs, _represent the r-th component of the input speech feature vector x _t as Br, and the r-th component of the mean vector μ _ijm (m-th μ _ijm of the uncorrelated mixed normal distribution) as C _ijm r For example, the equation (15) can be transformed into the following equation (18).

【００６４】[0064]

【数１０】 (Equation 10)

【００６５】分散・供分散行列ρ_ijm は対角行列である
からｒ≠ｓではＡ_ijmrs ＝０であり従って（１８）式は
次式（１９）の如く変形できる。Since the variance / _sub- dispersion matrix ρ _ijm is a diagonal matrix, A _ijm rs = 0 when r ≠ s. Therefore, the equation (18) can be transformed into the following equation (19).

【００６６】[0066]

【数１１】 [Equation 11]

【００６７】しかも分散・供分散行列ρ_ijm は逆相関行
列であるから、Ａ_ijmrr ≧０が成り立つので、（１９）
式中のＡ_ijmrr ・（Ｂr −Ｃ_ijmr）² の各項は非負であ
り従ってＤ_ijmt ² ≧０である。Moreover, since the variance / _sub- dispersion matrix ρ _ijm is an inverse correlation matrix, A _ijm rr ≧ 0 holds. Therefore, (19)
_{Each term} of A _ijm rr · (Br −C _ijm r) ^{2 in} the _equation is non-negative and therefore D _ijmt ² ≧ 0.

【００６８】従って（１７）式において、Ｅ_ijm は各正
規分布毎に定まる一定の値でありかつＤ_ijmt ² ≧０であ
るので、算出途上の重み付け対数値Ｇ_ijm(x_t) は、Ｅ
_ijm から（１９）式中のＡ_ijmrr ・（Ｂr −Ｃ_ijmr）²
の各項を順次に減じた値なる。換言すれば、算出途上の
Ｇ_ijm(x_t) の値は、Ｅ_ijm をピークとして、入力音声特
徴ベクトルｘ_t の一成分について行なわれるＡ_ijmrr ・
（Ｂr −Ｃ_ijmr）² の演算の、一演算間隔毎に、減少し
てゆく。Therefore, in the equation (17), E _ijm is a constant value determined for each normal distribution and D _ijmt ² ≧ 0. Therefore, the weighted logarithmic value G _ijm (x _t ) in the calculation is E
_{From ijm} , A _ijm rr · (Br −C _ijm r) ^{2 in the} equation (19)
The value is obtained by sequentially subtracting each term of. In other words, the value of G _ijm (x _t ) in the process of calculation is A _ijm rr ·, which is performed for one component of the input speech feature vector x _t , with E _ijm as the peak.
The value of (Br-C _ijm r) ² decreases with each calculation interval.

【００６９】次に図３及び図４を参照して、ＨＭＭ照合
部１８が行なう尤度計算の流れについて説明する。図３
はｔ＝１のとき最大の重み付け対数値Ｇ_ijm(x_t) を算出
する場合の動作フロー及び図４はｔ≧２のとき最大の重
み付け対数値Ｇ_ijm(x_t) を算出する場合の動作フローを
示す。Next, the flow of likelihood calculation performed by the HMM matching unit 18 will be described with reference to FIGS. 3 and 4. FIG.
Is an operation flow for calculating the maximum weighted logarithmic value G _ijm (x _t ) when t = 1, and FIG. 4 is an operation for calculating the maximum weighted logarithmic value G _ijm (x _t ) when t ≧ 2. The flow is shown.

【００７０】まずＨＭＭ照合部１８は、前向き対数値の
初期値Ｃ_i0を設定する。次にＨＭＭ照合部１８は、ｔ＝
１のときの前向き対数値Ｃ_it、すなわち始端フレーム
（第１番目のフレーム）の入力音声特徴ベクトルｘ_t に
ついて、前向き対数値Ｃ_itを求める。First, the HMM matching unit 18 sets an initial value C _i0 of the forward logarithmic value. Next, the HMM matching unit 18 sets t =
The forward logarithmic value C _it when _it is 1, that is, the forward logarithmic value C _it for the input speech feature vector x _t of the start frame (first frame) is obtained.

【００７１】このためＨＭＭ照合部１８は、ＨＭＭにお
いて入力音声特徴ベクトルｘ_t に対応する音声シンボル
ベクトルを検索する。そして対応する音声シンボルベク
トルの出力確率を表す無相関混合正規分布を、入力音声
特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)を表す無相関混合
正規分布として用いて、この無相関混合正規分布の各正
規分布から重み付け対数値Ｇ_ijm(x_t) を算出し、最大の
重み付け対数値Ｇ_ijm(x_t) を検出し（図３のＳ１）、そ
して最大の重み付け対数値Ｇ_ijm(x_t) を入力音声特徴ベ
クトルｘ_t の出力対数値Ｂ_ij(x_t)として格納すると共
に、当該最大の重み付け対数値Ｇ_ijm(x_t) を得た正規分
布の番号ｍをインデックスＱ_ijとして格納する（図３の
Ｓ２）。対応する音声シンボルベクトルを出力する状態
遷移が複数存在する場合には、各状態遷移毎に、音声シ
ンボルベクトルの出力確率を表す無相関混合正規分布が
存在するので、これら各無相関混合正規分布をそれぞれ
入力音声特徴ベクトルｘ_t の無相関混合正規分布に用い
て、各状態遷移毎に個別の出力対数値Ｂ_ij(x_t)及びイン
デックスＱ_ijを得て格納する。Therefore, the HMM matching unit 18 searches the HMM for a voice symbol vector corresponding to the input voice feature vector x _t . The uncorrelated mixed normal distribution representing the output probability of the corresponding speech symbol vector is used as the uncorrelated mixed normal distribution representing the output probability b _ij (x _t ) of the input speech feature vector x _t. calculating a weighted logarithmic value G _ijm (x _t) from each normal distribution, and detects the maximum weighted logarithmic value G _ijm (x _t) (S1 in FIG. 3), and the maximum of the weighted logarithmic value G _ijm (x _t ) _Is stored as the output logarithmic value B _ij (x _t ) of the input speech feature vector x _t , and the number m of the normal distribution that obtains the maximum weighted logarithmic value G _ijm (x _t ) is stored as the index Q _ij . (S2 in FIG. 3). When there are multiple state transitions that output the corresponding speech symbol vector, there is an uncorrelated mixed normal distribution that represents the output probability of the speech symbol vector for each state transition. The output logarithmic value B _ij (x _t ) and the index Q _ij are individually obtained and stored for each state transition by using the uncorrelated mixed normal distribution of the input speech feature vector x _t .

【００７２】次いでＨＭＭ照合部１８は、算出し終えた
出力対数値Ｂ_ij(x_t)を用いて、ｔ＝１のときの前向き対
数値Ｃ_itを算出する。Next, the HMM matching unit 18 calculates the forward logarithm value C _it when t = 1 by using the calculated output logarithm value B _ij (x _t ).

【００７３】次にＨＭＭ照合部１８は、ｔ≧２のときの
入力音声特徴ベクトルｘ_t の前向き対数値Ｃ_itを算出す
る。Next, the HMM matching unit 18 calculates the forward logarithmic value C _it of the input speech feature vector x _t when t ≧ 2.

【００７４】このためＨＭＭ照合部１８は、ＨＭＭにお
いて入力音声特徴ベクトルｘ_t に対応する音声シンボル
ベクトルを検索する。そして対応する音声シンボルベク
トルの出力確率を表す無相関混合正規分布を、入力音声
特徴ベクトルｘ_t の出力確率ｂ_ij(x_t)を表す無相関混合
正規分布として用いて、この無相関混合正規分布の正規
分布のなかからインデックスＱ_ijに対応する第Ｑ_ij番目
の正規分布を検索し、この正規分布から重み付け対数値
Ｇ_ijm(x_t) を算出する。そして算出した重み付け対数値
Ｇ_ijm(x_t) を最大値候補Ｇ_ijQ(x_t) として格納し、然る
後、正規分布の番号ｍをｍ＝１に初期化する（図４のＳ
１）。Therefore, the HMM collation unit 18 searches the HMM for a voice symbol vector corresponding to the input voice feature vector x _t . The uncorrelated mixed normal distribution representing the output probability of the corresponding speech symbol vector is used as the uncorrelated mixed normal distribution representing the output probability b _ij (x _t ) of the input speech feature vector x _t. The Q _ij -th normal distribution corresponding to the index Q _ij is searched from among the normal distributions, and the weighted logarithmic value G _ijm (x _t ) is calculated from this normal distribution. Then, the calculated weighted logarithmic value G _ijm (x _t ) is stored as the maximum value candidate G _ijQ (x _t ), and then the number m of the normal distribution is initialized to m = 1 (S in FIG. 4).
1).

【００７５】次いで正規分布の番号ｍがインデックスＱ
_ijと等しいか否かを判定する（図４のＳ２）。Next, the number m of the normal distribution is the index Q.
_It is determined whether it is equal to _ij (S2 in FIG. 4).

【００７６】図４のＳ２で番号ｍがインデックスＱ_ijで
ない場合には、第ｍ番目の正規分布を用いて、重み付け
対数値Ｇ_ijm(x_t) の算出を開始し（図４のＳ３）、まず
Ｇ_ijm(x_t) のＥ_ijm の項を算出し（図４のＳ４）、然る
後、Ｇ_ijm(x_t) のＤ_ijmt ² の項の演算を、一演算間隔又
は複数演算間隔だけ、行なう（図４のＳ５）。一演算間
隔は入力音声特徴ベクトルｘ_t の成分１個分について行
なわれる演算間隔である。次いで算出途上の重み付け対
数値Ｇ_ijm(x_t) が、最大値候補Ｇ_ijQ(x_t) より大きいか
否かを判定する（図４のＳ６）。If the number m is not the index Q _ij in S2 of FIG. 4, the calculation of the weighted logarithmic value G _ijm (x _t ) is started using the m-th normal distribution (S3 of FIG. 4). First, the term of E _ijm of G _ijm (x _t ) is calculated (S4 in FIG. 4), and then the operation of the term of D _ijmt ² of G _ijm (x _t ) is performed by one operation interval or multiple operation intervals. , (S5 in FIG. 4). One calculation interval is a calculation interval performed for one component of the input speech feature vector x _t . Next, it is determined whether or not the weighted logarithmic value G _ijm (x _t ) being calculated is larger than the maximum value candidate G _ijQ (x _t ) (S6 in FIG. 4).

【００７７】図４のＳ６でＧ_ijm(x_t) ＞Ｇ_ijQ(x_t) であ
れば、Ｄ_ijmt ² の演算を入力音声特徴ベクトルｘ_t のす
べての成分について終了したか否かを判定し（図４のＳ
７）、Ｄ_ijmt ² の演算を終了していなければＳ５の演算
に戻る。Ｄ_ijmt ² の演算を終了したならば、当該演算を
終了した重み付け対数値Ｇ_ijm(x_t) を、最大値候補Ｇ
_ijQ(x_t) として書換えると共に、当該演算を終了した重
み付け対数値Ｇ_ijm(x_t)を得た正規分布の番号ｍをイン
デックスＱ_ijとして書き換える（Ｓ８）。然る後、総個
数Ｍ個の正規分布すべてについて処理を終了したか否か
を判定し（Ｓ９）、終了していなければ正規分布の番号
ｍに１を加算し（Ｓ１１）、然る後、Ｓ２の処理に戻
り、終了していればこのとき格納されている最大値候補
Ｇ_ijQ(x_t) を入力音声信号ｘ_t の出力対数値Ｂ_ij(x_t)と
して格納する（Ｓ１０）。図４のＳ６でＧ_ijm(x_t) ≦Ｇ
_ijQ(x_t) であれば、Ｓ７〜Ｓ８の処理を行なわずに、Ｓ
９の処理を行なう。If G _ijm (x _t )> G _ijQ (x _t ) in S6 of FIG. 4, it is determined whether the calculation of D _ijmt ² has been completed for all the components of the input speech feature vector x _t. (S in FIG. 4
7), _if the calculation of D _ijmt ² is not completed, the process returns to S5. When the operation of D _ijmt ² is completed, the weighted logarithmic value G _ijm (x _t ) for which the operation is completed is set to the maximum value candidate G.
_ijQ (x _t ) is rewritten, and at the same time, the number m of the normal distribution that has obtained the weighted logarithmic value G _ijm (x _t ) for which the calculation has been completed is rewritten as an index Q _ij (S8). After that, it is determined whether or not the processing has been completed for all of the M normal distributions (S9), and if not completed, 1 is added to the normal distribution number m (S11). Returning to the processing of S2, if completed, the maximum value candidate G _ijQ (x _t ) stored at this time is stored as the output logarithmic value B _ij (x _t ) of the input voice signal x _t (S10). In step S6 of FIG. 4, G _ijm (x _t ) ≦ G
If it is _ijQ (x _t ), S7 to S8 are not processed and S
9 is performed.

【００７８】またＳ２でｍ＝Ｑ_ijであれば、Ｓ３〜Ｓ８
の処理を行なわずに、Ｓ９の処理を行なう。If m = Q _ij in S2, S3 to S8
The process of S9 is performed without performing the process of.

【００７９】対応する音声シンボルベクトルを出力する
状態遷移が複数存在する場合には、各状態遷移毎に、音
声シンボルベクトルの出力確率を表す無相関混合正規分
布が存在するので、これら各無相関混合正規分布をそれ
ぞれ入力音声特徴ベクトルｘ_t の無相関混合正規分布に
用いて、各状態遷移毎に個別に、図４のＳ１〜Ｓ１１の
処理を行なう。When there are a plurality of state transitions that output the corresponding speech symbol vector, since there is a decorrelation mixture normal distribution that represents the output probability of the speech symbol vector for each state transition, each of these decorrelation mixture The normal distribution is used for the uncorrelated mixed normal distribution of the input speech feature vector x _t , and the processes of S1 to S11 of FIG. 4 are individually performed for each state transition.

【００８０】ｔ＝２〜Ｔの各入力音声特徴ベクトルｘ_t
につき出力対数値Ｂ_ij(x_t)を得る毎に、前向き対数値Ｃ
_itを求め、最終的に得た前向き対数値Ｃ_iTを、入力音声
特徴ベクトルｘ₁ 〜ｘ_T とＨＭＭとの間の尤度として得
る。Each input speech feature vector x _{t of} t = 2 to _T
Each time the output logarithmic value B _ij (x _t ) is obtained, the forward logarithmic value C
_It is obtained, and the finally obtained forward logarithmic value C _iT is obtained as the likelihood between the input speech feature vectors x _{1 to} x _T and the HMM.

【００８１】既に説明したように、算出途上のＧ
_ijm(x_t) の値は、Ｅ_ijm をピークとして、入力音声特徴
ベクトルｘ_t の一成分について行なわれるＡ_ijmrr ・
（Ｂr −Ｃ_ijmr）² の演算の、一演算間隔毎に、減少し
てゆくので、図４のＳ６の判定においてＧ_ijm(x_t) ＞Ｇ
_ijQ(x_t) となる場合に、当該算出途上のＧ_ijm(x_t) の算
出を終了することにより、無駄な演算を省略して演算速
度を向上できる。As described above, G in the process of calculation
The value of _ijm (x _t ) is A _ijm rr ·, which is performed for one component of the input speech feature vector x _t , with E _ijm as a peak.
Since the calculation of (Br-C _ijm r) ² decreases at each calculation interval, G _ijm (x _t )> G in the determination of S6 in FIG.
when the _ijQ (x _t), by ending the calculation of the calculation course of G _ijm (x _t), can be improved calculation speed by omitting wasteful operations.

【００８２】また直前のフレームで最大の重み付け対数
値Ｇ_ijm(x_t) を得た正規分布の番号ｍすなわちインデッ
クスＱ_ijを格納し、次のフレームにおいてインデックス
Ｑ_ijに対応する正規分布から求めた重み付け対数値Ｇ
_ijm(x_t) を最大値候補とすることにより、無駄な演算を
省略して演算速度を向上できる。これは直前のフレーム
と次のフレームとで入力音声特徴ベクトルｘ_t は類似し
ているので、次のフレームにおいてもインデックスＱ_ij
に対応する正規分布から求めた重み付け対数値Ｇ
_ijm(x_t) が最大となる可能性が高いからである。Further, the number m of the normal distribution which has obtained the maximum weighted logarithmic value G _ijm (x _t ) in the immediately preceding frame, that is, the index Q _ij is stored, and is _calculated from the normal distribution corresponding to the index Q _ij in the next frame. Weighted logarithmic value G
_{By using ijm} (x _t ) as the maximum value candidate, unnecessary calculation can be omitted and the calculation speed can be improved. This is because the input speech feature vector x _t is similar between the previous frame and the next frame, and therefore the index Q _ij is also used in the next frame.
Weighted logarithmic value G obtained from the normal distribution corresponding to
_This is because there is a high possibility that _ijm (x _t ) will be the maximum.

【００８３】[0083]

【発明の効果】上述した説明からも明らかなように、こ
の発明の音声認識方法によれば、隠れマルコフモデルが
備える総個数Ｍ個の正規分布は互いに無相関であるの
で、重み付け確率ｇ_ijm(x_t) が最大とならない正規分布
と入力音声特徴ベクトルｘ_t との間の距離は、重み付け
確率ｇ_ijm(x_t) が最大となる正規分布との距離よりも長
くなる。これがため最大とならない重み付け確率ｇ
_ijm(x_t) は、最大の重み付け確率ｇ_ijm(x_t) に対して無
視し得る程に微小となるので、この発明において最大の
重み付け出力確率ｇ_ijm(x_t) を入力音声特徴ベクトルｘ
_t の出力確率ｂ_ij(x_t)としても、従来と近似的に等しい
出力確率ｂ_ij(x_t)を得ることができる。As is clear from the above description, according to the speech recognition method of the present invention, since the total number M of normal distributions included in the hidden Markov model are uncorrelated with each other, the weighting probability g _ijm ( The distance between the normal distribution in which x _t ) is not the maximum and the input speech feature vector x _t is longer than the distance with the normal distribution in which the weighting probability g _ijm (x _t ) is the maximum. Because of this, the maximum weighting probability g
_{Since ijm} (x _t ) is so small as to be negligible with respect to the maximum weighting probability g _ijm (x _t ), the maximum weighting output probability g _ijm (x _t ) in the present invention is set to the input speech feature vector x.
_As the output probability b _ij (x _t ) of _t , it is possible to obtain the output probability b _ij (x _t ) that is approximately equal to the conventional one.

【００８４】また重み付け確率ｇ_ijm(x_t) の対数値Ｇ
_ijm(x_t) はＧ_ijm(x_t) ＝Ｅ_ijm −Ｄ_ijmt ² ／２と表さ
れ、そして第ｍ番目の正規分布において、Ｅ_ijm は一定
であるので、算出途上の対数値Ｇ_ijm(x_t) はＥ_ijm をピ
ークとして−Ｄ_ijmt ² ／２の演算の一演算間隔毎に減少
してゆく。The logarithmic value G of the weighting probability g _ijm (x _t )
_ijm (x _t) is expressed as _{_{G ijm (x t) = E}} ijm -D ijmt 2/2, and in the m-th normal distribution, since E _ijm is constant, calculating developing logarithmic values G _ijm ( x _t) is slide into reduced per one operation interval of the operation of _-D ijmt ^2/2 the E _ijm a peak.

【００８５】これがため、−Ｄ_ijm ²／２の演算の、一又
は複数の演算間隔毎に、算出途上の対数値Ｇ_ijm(x_t) を
最大値候補と比較し、算出途上の対数値Ｇ_ijm(x_t) が最
大値候補よりも小さくなったら、当該対数値Ｇ_ijm(x_t)
の算出を算出途上で終了することにより、最大の対数値
Ｇ_ijm(x_t) 検出に要する計算量を減少させることができ
る。[0085] This is because, -D _ijm ^2/2 operations for each one or a plurality of calculation intervals, calculates developing logarithm G _ijm the (x _t) is compared with the maximum value candidate calculation developing logarithmic values G _{When ijm} (x _t ) becomes smaller than the maximum value candidate, the relevant logarithmic value G _ijm (x _t )
The calculation amount required for detecting the maximum logarithmic value G _ijm (x _t ) can be reduced by terminating the calculation of the calculation.

【００８６】しかも直前のフレームの入力音声特徴ベク
トルｘ_t-1 と次のフレームの入力音声特徴ベクトルｘ_t
とは、時間的に近接しているので、これらベクトルｘ_t
及びｘ_t-1 の成分は互いに類似する可能性が高い。従っ
て第ｔ−１番目のフレームにおいて第Ｉ番目の正規分布
から求めた重み付け確率ｇ_ijI(x_t-1) の対数値Ｇ_ijI(x
_t-1) が最大の対数値Ｇ_ijm(x_t-1) となった場合、次の
第ｔ番目のフレームにおいても第Ｉ番目の正規分布から
求めた重み付け確率ｇ_ijI(x_t) の対数値Ｇ_ijI(x_t) が最
大の対数値Ｇ_ijm(x_t) となる可能性が高い。[0086] Moreover, the input speech feature vector of the immediately preceding frame x _t-1 and the input speech feature of the next frame vector x _t
And are close in time, these vectors x _t
And the components of x _t-1 are likely to be similar to each other. Therefore, the logarithmic value G _ijI (x of the weighting probability g _ijI (x _t-1 ) obtained from the I-th normal distribution in the t-1 th frame
_t-1 ) becomes the maximum logarithmic value G _ijm (x _t-1 ), the pair of weighting probabilities g _ijI (x _t ) obtained from the I-th normal distribution is calculated in the next t-th frame. It is highly possible that the numerical value G _ijI (x _t ) becomes the maximum logarithmic value G _ijm (x _t ).

【００８７】これがため、この第Ｉ番目の正規分布から
求めた対数値Ｇ_ijI(x_t) を最大値候補の初期値として、
算出途上の対数値Ｇ_ijm(x_t) が最大値候補よりも小さく
なったら、当該対数値Ｇ_ijm(x_t) の算出を算出途上で終
了することにより、最大の対数値Ｇ_ijm(x_t) 検出に要す
る計算量を減少させることができる。For this reason, the logarithmic value G _ijI (x _t ) obtained from the I-th normal distribution is set as the initial value of the maximum value candidate.
When calculating the course of the logarithmic value G _ijm (x _t) is smaller than the maximum value candidate, by ending the calculation of the logarithmic value G _ijm (x _t) in calculating developing, the largest logarithm G _ijm (x _t ) The amount of calculation required for detection can be reduced.

[Brief description of drawings]

【図１】この発明の実施に用いて好適な音声認識装置の
構成を概略的に示す図である。FIG. 1 is a diagram schematically showing a configuration of a voice recognition device suitable for implementing the present invention.

【図２】ＨＭＭの説明に供する図である。FIG. 2 is a diagram for explaining an HMM.

【図３】ｔ＝１のときの重み付け対数値Ｇ_ijm(x_t) を算
出する場合の動作フローである。FIG. 3 is an operation flow for calculating a weighted logarithmic value G _ijm (x _t ) when t = 1.

【図４】ｔ≧２のときの重み付け対数値Ｇ_ijm(x_t) を算
出する場合の動作フローである。FIG. 4 is an operation flow for calculating a weighted logarithmic value G _ijm (x _t ) when t ≧ 2.

[Explanation of symbols]

１０：音声認識装置１２：辞書部１４：音響処理部１６：音声区間検出部１８：ＨＭＭ照合部２０：参照情報記憶部 10: Speech recognition device 12: Dictionary section 14: Sound processing section 16: Speech section detection section 18: HMM collation section 20: Reference information storage section

Claims

[Claims]

1. A hidden Markov model is used as a voice standard pattern, and the hidden Markov model has a plurality of normal distributions that are uncorrelated with each other and represents a non-correlated mixed normal representing an output probability of a voice symbol vector output from the model. With distribution,
The likelihood between the time series of the input speech feature vector extracted from the start frame to the end frame in the speech section and the hidden Markov model is calculated using the logarithmic value of the output probability of each input speech feature vector. , _Ij (x _t ): the total number of M normal _patterns in the speech recognition method in which the category name given to the hidden Markov model that has the maximum likelihood is used as the recognition result for the input speech signal in the speech section. The output probability (1 ≦ t ≦ T. The output probability of the input speech feature vector x _t extracted at the t-th frame is output from the hidden Markov model having the uncorrelated mixed normal distribution having the distribution. The beginning frame of the voice section and the T-th frame represent the end frame of the voice section.), G _ijm (x _t ): The m-th (1 ≦ m ≦ M.) In the total number M of normal distributions. Regular minutes of Weighting the probability of the input speech feature vector x _t calculated from _{_{(where, g ijm (x t) =}} λ ijm b ijm (x t), b ijm (x t) = (2π) -p / 2 | ρ ijm | ^-1/2 exp {-D
_{^{_{ijmt 2/2}, D ijmt}}} 2 = (x t -μ ijm) 'ρ ijm -1 (x t -μ
_ijm ), λ _ijm : weight of the m-th normal distribution, b _ijm (x _t ): unweighted probability of the input speech feature vector x _t calculated from the m-th normal distribution, p: input speech feature vector x order of _t , ρ _ijm : variance-covariance matrix of the m-th normal distribution, μ _ijm : mean vector of the m-th normal distribution, D _ijmt : input speech feature vector x _t and the m-th normal distribution Mahalavis general distance that represents the distance between. _{_{), G ijm (x t)}} : logarithm of weighted probabilities g _ijm (x _t) _{_{(where, G ijm (x t) =}} E ijm -D ijmt 2/2, E ijm = ln (λ ijm) + ln {( 2π) ^{-p / 2} ｜ ρ _ijm ｜
^-1/2 }. ), Input the maximum logarithmic value G _ijm (x _t ) among the logarithmic values G _ijm (x _t ) of the weighted probabilities g _ijm (x _t ) calculated from each of the M normal distributions. In calculating the likelihood with the hidden Markov model by using the logarithmic value of the output probability b _ij (x _t ) of the speech feature vector x _t , when t ≧ 2, the maximum value in the t-th frame is calculated. The maximum value candidate for detecting the logarithmic value G _ijm (x _t ) and the maximum logarithmic value G in the (t−1) th frame when t ≧ 2.
_ijm (x _t ) is provided with a reference information storage unit that stores an index that indicates which normal distribution is obtained, and at t = 1, for all normal distributions of the total number M, logarithmic values for each normal distribution calculates the G _ijm (x _t), detects the maximum logarithmic value G _ijm (x _t), outermost sized logarithm G _ijm the (x _t) of the input speech feature vector x _t in the first frame The index corresponding to the normal distribution that obtains the _maximum logarithmic value G _ijm (x _t ) is stored as the logarithmic value of the output probability b _ij (x _t ), and when t ≧ 2, (1) corresponds to the index first. The logarithmic value G _ijm (x _t ) calculated using the normal distribution is stored as the maximum value candidate, and (2) the logarithmic value G using the remaining normal distribution that does not correspond to the index among the total number M of normal distributions.
In the calculation of _ijm (x _t), for each one or a plurality of calculation intervals of the operation for calculating the term of _-D ijmt ^2/2, calculate developing logarithm G _ijm the (x _t), the maximum value candidate In comparison, (3-A) When the logarithmic value G _ijm (x _t ) in the process of calculation becomes smaller than the maximum value candidate, the calculation of the logarithmic value G _ijm (x _t ) ends, and then the remaining next value. It starts the calculation of the normal distribution per logarithmic value _{_{G ijm (x t), (}} 3-B) calculating developing logarithmic values G _ijm (x _t) without becomes smaller than the maximum value candidate, the logarithmic value G _ijm ( x _t ), the maximum value candidate is set to the logarithmic value G _ijm.
Rewriting to (x _t ), after that, calculation of the logarithmic value G _ijm (x _t ) is started for the remaining next normal distribution, and (4) logarithmic value G _ijm (x _When the calculation of ( _t ) is completed, the index of the reference information storage unit is rewritten to the index corresponding to the normal distribution for which the maximum value candidate stored at this time is obtained, and the maximum value candidate is
A speech recognition method characterized by calculating a likelihood with a hidden Markov model using a logarithmic value of an output probability b _ij (x _t ).