JPH1185187A

JPH1185187A - Acoustic model generating device and speech recognition device

Info

Publication number: JPH1185187A
Application number: JP9245206A
Authority: JP
Inventors: Atsushi Nakamura; 篤中村
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1997-09-10
Filing date: 1997-09-10
Publication date: 1999-03-30
Anticipated expiration: 2017-09-10
Also published as: JP3009640B2

Abstract

PROBLEM TO BE SOLVED: To provide the acoustic model generating device and the speech recog nition device in which the characteristics of voice data are more precisely reflected, the power of expression of a general speaker acoustic model is im proved compared with a conventional example and voice is recognized with a higher voice recognition rate. SOLUTION: An initial hidden Markov model(HMM) generating section 21 generates an initial HMM by a prescribed learning algorithm based on the feature parameter of the voice data. An HMM reconstituting section 22 reconstitutes an initial HMM by adding the components having an HMM Gauss mixed distribution based on the tendency of the frame errors which are discrimination errors of the frame unit caused by an initial HMM concerning the voice data and generates the reconstituted HMM. A relearning section 23 relearns the reconstituted HMM, which is relearned by a prescribed learning algorithm, based on the feature parameter of the voice data and generates the acoustic model which is the relearned HMM.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、隠れマルコフモデ
ル（以下、ＨＭＭという。）などの音響モデルを生成す
る音響モデル生成装置、並びに、生成された音響モデル
を用いて、入力される発声音声文の音声信号に基づいて
音声認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic model generating apparatus for generating an acoustic model such as a Hidden Markov Model (hereinafter, referred to as HMM), and an uttered speech sentence input using the generated acoustic model. The present invention relates to a speech recognition device for recognizing speech based on a speech signal.

【０００２】[0002]

【従来の技術】不特定話者音声認識装置において、認識
性能の向上のために音響モデルとしてのＨＭＭに対して
は、精密性、頑健性の両方が求められる。精密性向上
は、個々の音響単位（例えば音素）毎の音響現象に対し
て、実際の音声サンプルに基づき、如何に忠実なモデリ
ングを行なうか、という問題である。しかしながら、実
際には、使用できる音声サンプル量の不足により、精密
性を追及しすぎると、その音声サンプルのみに特化した
極めて頑健性の低いモデリングが為されてしまう（過学
習）。そのため、音響モデルの頑健性を確保するため
に、しばしば、異なる音響単位間でのモデルパラメータ
の共有が行なわれる。モデルパラメータを共有すること
により、１パラメータ当りの音声サンプル量が増え、そ
の結果、特に音声サンプル中にあまり多く現れなかった
音響単位に対するパラメータ推定値が比較的高い信頼性
を持つことになり、モデリング全体としての頑健性が向
上する。ところが、モデルパラメータを共有するという
ことは、とりもなおさず、「パラメータを共有している
音響単位同士については、少なくとも部分的には、その
識別を行なわない」ということであり、場合によっては
モデリングとしての精密性を犠牲にする可能性がある。2. Description of the Related Art In an unspecified speaker's speech recognition apparatus, HMM as an acoustic model is required to have both precision and robustness in order to improve recognition performance. Improvement of precision is a problem of how to faithfully model an acoustic phenomenon for each acoustic unit (for example, phoneme) based on an actual speech sample. However, in practice, if the precision is sought too much due to a shortage of usable audio samples, modeling with extremely low robustness specialized for only that audio sample is performed (over-learning). Therefore, in order to ensure the robustness of the acoustic model, model parameters are often shared between different acoustic units. The sharing of model parameters increases the amount of speech samples per parameter, resulting in relatively high reliability of parameter estimates, especially for acoustic units that do not appear very often in speech samples. Robustness as a whole is improved. However, sharing the model parameters means, for the most part, that acoustic units sharing parameters are not identified at least in part. May sacrifice precision.

【０００３】このように、精密性と頑健性は、片方を追
及すればもう片方が損なわれるという関係にあり、両者
の平衡点を、音声サンプル量に応じて、出来るかぎり適
切に求めることが重要である。ところで、この平衡点
は、音響モデルと組み合わせる言語モデルの種類によっ
ても異なるであろう。例えば、言語モデルとして、音素
連接規則のみを言語モデルに用いた場合と、単語のＮ−
ｇｒａｍを用いた場合とでは、音響モデルの果たすべき
役割も異なるように思われる。[0003] As described above, precision and robustness have a relationship in that if one is pursued, the other will be impaired, and it is important to find the equilibrium point between the two as appropriately as possible in accordance with the amount of audio samples. It is. By the way, this equilibrium point will also differ depending on the type of language model combined with the acoustic model. For example, a case where only a phoneme connection rule is used as a language model and a case where a word N-
It seems that the role of the acoustic model should be different from the case where the gram is used.

【０００４】前者の場合は、音響モデルの音素識別能
力、即ち精密性に対する要求が強くなるであろう。なぜ
なら音響モデルで択一的な識別が成功しなければ、音素
連接規則のみによってその誤りを復旧できる可能性は限
られているからである。一方、後者の場合、音響モデル
で必ずしも択一的な識別が成功しなくても、語彙として
許され得る音素並びや、単語の連接確率の制約等によ
り、前者に比べてその誤りを復旧できる可能性は大き
い。つまり、後者のような、比較的強い言語制約を用い
た場合には、緩い言語制約の場合にと比べて、音響モデ
ルに対して、精密性、つまり、「対応する音素に対して
最も高い尤度を与え、かつ他の音素に対しては低い尤度
を与えること」よりも、むしろ、頑健性、つまり、「よ
り多くの分布が、対応する音素に対して、それなりに高
い尤度を与えること」が求められ、そのために、「対応
する音素に対して最も高い尤度を与え、かつ他の音素に
対しては低い尤度を与えること」を多少犠牲にしても、
総合的にはよい結果をもたらすと考えられるのである。In the former case, the demand for the phoneme identification capability of the acoustic model, that is, the precision, will increase. This is because if the alternative identification is not successful in the acoustic model, the possibility of recovering the error only by the phoneme connection rule is limited. On the other hand, in the latter case, even if alternative identification is not necessarily successful in the acoustic model, the error can be recovered compared to the former, due to the phoneme arrangement that can be accepted as a vocabulary, the restriction on the word connection probability, etc. Sex is big. In other words, when relatively strong language constraints such as the latter are used, the acoustic model has higher precision, that is, “the highest likelihood for the corresponding phoneme,” as compared to the case of loose language constraints. Rather than "giving degree and giving low likelihood to other phonemes", i.e. "more distribution gives a reasonably high likelihood for the corresponding phoneme That is, it is sought, and therefore, at the expense of "giving the highest likelihood to the corresponding phoneme and giving low likelihood to other phonemes",
Overall, it is thought to give good results.

【０００５】ところで、ＨＭＭにおける複数ガウス混合
分布間でのガウス分布共有の先行研究としては、例え
ば、従来技術文献１「X.D.Huang et al.,“Unified Tec
hniques for Vector Quantization and Hidden Markov
Modeling Using Semi-continuous models",Proceddings
of ICASSP'89,pp.639-642,1989年』等に見られる半連
続ＨＭＭの構成法（以下、第１の従来例という。）が代
表的である。また、音声サンプルに基づいて分布の共有
関係を決定する手法としては、逐次状態分割融合法（例
えば、従来技術文献２「鷹見淳一，“状態分割融合法に
よる高効率な隠れマルコフ網の自動作成”，電子情報通
信学会論文誌（Ｄ−ＩＩ），Ｊ７８−Ｄ−ＩＩ，Ｎｏ．
５，ｐｐ．７１７−７２６，１９９５年５月」参照。）
（以下、第２の従来例という。）がある。[0005] Meanwhile, as a prior study of sharing a Gaussian distribution among a plurality of Gaussian mixture distributions in an HMM, for example, in prior art document 1, "XDHuang et al.," Unified Tec "
hniques for Vector Quantization and Hidden Markov
Modeling Using Semi-continuous models ", Proceddings
of ICASSP'89, pp. 639-642, 1989, etc. (hereinafter referred to as a first conventional example) is representative. In addition, as a method of determining the distribution sharing relationship based on a voice sample, a sequential state-division fusion method (for example, Prior Art Document 2 “Junichi Takami,“ Automatic creation of a highly efficient hidden Markov network by the state-division fusion method ”) , IEICE Transactions (D-II), J78-D-II, No.
5, pp. 717-726, May 1995 ". )
(Hereinafter referred to as a second conventional example).

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、第１の
従来例の手法による共有関係は、「ガウス混合分布間の
パラメトリックな距離に基づいて決定されており、実際
の音声サンプルの特性が反映されない」という欠点があ
る。また、第２の従来例においては、（ａ）ガウス混合分布間のパラメトリックな距離に基づ
いて共有関係を仮決定した後に、その最終的採否のみを
音声サンプルを用いて行なうために、必ずしも、音声サ
ンプルの特徴に基づく共有構造が形成される保証がな
い。（ｂ）ガウス分布の共有が、ガウス混合分布のコンポー
ネント全体を単位としてしか行なわれないため、分布の
表現力に欠ける。（ｃ）基本的に、一状態についての融合の決定毎にＨＭ
Ｍ全体のパラメータ再推定を行なうため、共有関係の最
終的な決定までに長い時間を要する。という欠点があ
る。However, the sharing relationship according to the first conventional technique is "determined based on a parametric distance between Gaussian mixture distributions, and does not reflect the characteristics of actual speech samples." There is a disadvantage that. Further, in the second conventional example, (a) after temporarily determining a sharing relationship based on a parametric distance between Gaussian mixture distributions and then performing only final adoption or rejection using a voice sample, the voice is not necessarily used. There is no guarantee that a shared structure based on sample characteristics will be formed. (B) Since the sharing of the Gaussian distribution is performed only in units of the entire component of the Gaussian mixture distribution, the expression of the distribution lacks. (C) Basically, each time a fusion decision for a state
Since the parameters of the entire M are re-estimated, it takes a long time to finally determine the sharing relationship. There is a disadvantage that.

【０００７】本発明の目的は以上の問題点を解決し、音
声データの特性をより精密に反映することができるとと
もに、不特定話者音響モデルの表現力を従来例に比較し
て向上させることができ、しかもより高い音声認識率で
音声認識することができる音響モデルを生成するための
音響モデル生成装置、及び音声認識装置を提供すること
にある。SUMMARY OF THE INVENTION It is an object of the present invention to solve the above-mentioned problems, to more accurately reflect the characteristics of voice data, and to improve the expressive power of an unspecified speaker acoustic model as compared with a conventional example. Another object of the present invention is to provide an acoustic model generation device and a speech recognition device for generating an acoustic model capable of performing speech recognition at a higher speech recognition rate.

【０００８】[0008]

【課題を解決するための手段】本発明に係る請求項１記
載の音響モデル生成装置は、所定の音声データの特徴パ
ラメータに基づいて、所定の学習アルゴリズムにより、
初期の隠れマルコフモデルを生成する第１の生成手段
と、上記音声データに対して初期の隠れマルコフモデル
が起こす、所定の時間のフレーム単位の識別誤りである
フレーム誤りの傾向に基づいて隠れマルコフモデルのガ
ウス混合分布のコンポーネントを追加することにより、
上記第１の生成手段によって生成された初期の隠れマル
コフモデルを再構成して、再構成された隠れマルコフモ
デルを生成する第２の生成手段と、上記音声データの特
徴パラメータに基づいて、所定の学習アルゴリズムによ
り、上記第２の生成手段によって生成された隠れマルコ
フモデルを再学習することにより、再学習された隠れマ
ルコフモデルである音響モデルを生成する第３の生成手
段とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided an acoustic model generating apparatus which performs a predetermined learning algorithm based on a characteristic parameter of predetermined voice data.
First generating means for generating an initial Hidden Markov Model; and a Hidden Markov Model based on a tendency of a frame error, which is an identification error of a frame unit at a predetermined time, caused by the initial Hidden Markov Model for the audio data. By adding the components of the Gaussian mixture of
A second generation unit configured to reconstruct an initial hidden Markov model generated by the first generation unit and generate a reconstructed hidden Markov model; And a third generation unit configured to re-learn the hidden Markov model generated by the second generation unit by a learning algorithm, thereby generating an acoustic model that is a re-learned hidden Markov model. And

【０００９】また、請求項２記載の音響モデル生成装置
は、請求項１記載の音響モデル生成装置において、上記
第２の生成手段は、上記初期の隠れマルコフモデルと上
記音声データとの間でビタビアラインメント処理を実行
することにより、（ａ）上記初期の隠れマルコフモデル
中に含まれるガウス混合分布のビタビ系列と、（ｂ）上
記音声データの各フレームに対して最も高い尤度を与
え、上記初期の隠れマルコフモデル中に各ガウス混合分
布のコンポーネントとして含まれるガウス分布の最尤系
列とを得る第１の処理手段と、上記第１の処理手段によ
って得られた、ガウス混合分布のビタビ系列、及びガウ
ス分布の最尤系列における、時刻を同じくするガウス混
合分布と、ガウス分布の組み合わせのそれぞれの出現頻
度に基づいて、上記初期の隠れマルコフモデル中に含ま
れるガウス混合分布とガウス分布の全ての組み合わせに
ついて、各ガウス混合分布においてフレーム誤りが生じ
てかつそのときの最尤ガウス分布が当該組み合わせのガ
ウス分布であるときのフレーム誤り確率を演算し、演算
された各フレーム誤り確率が所定のしきい値を越えると
きに当該ガウス分布を、当該ガウス混合分布の新たなコ
ンポーネントとして追加する第２の処理手段とを備え、
上記第２の処理手段によって各ガウス混合分布の新たな
コンポーネントとして追加された各ガウス分布は、当該
ガウス分布が上記初期の隠れマルコフモデル中で、コン
ポーネントとして属していたガウス混合分布と、上記第
２の処理手段によって新たにコンポーネントとして属す
ることになったガウス混合分布との双方から共有される
コンポーネントとなることを特徴とする。According to a second aspect of the present invention, in the acoustic model generating apparatus according to the first aspect of the present invention, the second generating means includes a Viterbi between the initial hidden Markov model and the audio data. By executing the alignment processing, the highest likelihood is given to (a) the Viterbi sequence of the Gaussian mixture distribution included in the initial hidden Markov model and (b) each frame of the audio data, and First processing means for obtaining a maximum likelihood sequence of a Gaussian distribution included as a component of each Gaussian mixture distribution in the hidden Markov model of, and a Viterbi sequence of a Gaussian mixture distribution obtained by the first processing means; In the maximum likelihood sequence of the Gaussian distribution, based on the respective appearance frequencies of the Gaussian mixture distribution at the same time and the combination of the Gaussian distributions, For all combinations of the Gaussian mixture distribution and the Gaussian distribution included in the hidden Markov model of the period, when a frame error occurs in each Gaussian mixture distribution and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution of the combination Second processing means for calculating an error probability, and adding the Gaussian distribution as a new component of the Gaussian mixture distribution when each calculated frame error probability exceeds a predetermined threshold value,
Each Gaussian distribution added as a new component of each Gaussian mixture distribution by the second processing means includes a Gaussian mixture distribution whose Gaussian distribution belongs as a component in the initial hidden Markov model and the second Gaussian distribution. And a Gaussian mixture distribution newly assigned as a component by the processing means.

【００１０】さらに、本発明に係る音声認識装置は、請
求項１又は２記載の音響モデル生成装置によって生成さ
れた音響モデルを用いて、入力される発声音声文の音声
信号に基づいて音声認識する音声認識手段を備えたこと
を特徴とする。Further, a speech recognition apparatus according to the present invention uses the acoustic model generated by the acoustic model generation apparatus according to claim 1 or 2 to perform speech recognition based on a speech signal of an input uttered speech sentence. It is characterized by having voice recognition means.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１２】図１に本発明に係る一実施形態の連続音声
認識装置のブロック図を示す。本実施形態においては、
単語単位のＮ−ｇｒａｍを用いた音声認識装置におい
て、既学習の音響モデル（ＨＭＭ）が、緩い言語制約で
の音声認識向きの、精密性過多の状態にあるという仮定
の下に、その頑健性を増すべく、既学習ＨＭＭにおける
複数ガウス混合分布間でのガウス分布共有を行う。モデ
リングの観点からは、この共有構造が、音声サンプルに
基づいて決定されることが望ましい。本実施形態では、
従来技術の問題点を解決するために、（ａ）音声データ
のみに基づいてガウス分布の共有関係を決定する、
（ｂ）ガウス分布の共有を各ガウス分布毎に行う、並び
に、（ｃ）共有構造決定のためのパラメータ再推定が１
回しか行なわないために、短時間で共有関係の決定が可
能である複数ガウス混合分布間でのガウス分布共有の方
法を用いる。FIG. 1 is a block diagram showing a continuous speech recognition apparatus according to an embodiment of the present invention. In the present embodiment,
In a speech recognizer using a word-unit N-gram, its robustness is based on the assumption that a trained acoustic model (HMM) is in a state of excessive precision in speech recognition direction with loose language constraints. , The Gaussian distribution is shared among a plurality of Gaussian mixture distributions in the learned HMM. From a modeling standpoint, it is desirable that this shared structure be determined based on audio samples. In this embodiment,
In order to solve the problems of the prior art, (a) determining a Gaussian distribution sharing relationship based only on audio data,
(B) The Gaussian distribution is shared for each Gaussian distribution, and (c) the parameter reestimation for determining the shared structure is 1
A method of sharing a Gaussian distribution among a plurality of Gaussian mixture distributions, which can determine a sharing relationship in a short time, is used because the sharing is performed only once.

【００１３】本実施形態の音声認識装置は、図１に示す
ように、（ａ）音声データメモリ３０に格納された所定
の音声データの特徴パラメータに基づいて、所定の学習
アルゴリズムにより、初期のＨＭＭを生成する初期ＨＭ
Ｍ生成部２１と、（ｂ）上記音声データに対して初期の
ＨＭＭが起こす、所定の時間のフレーム単位の識別誤り
であるフレーム誤りの傾向に基づいてＨＭＭのガウス混
合分布のコンポーネントを追加することにより、初期Ｈ
ＭＭ生成部２１によって生成された初期のＨＭＭを再構
成して、再構成されたＨＭＭを生成するＨＭＭ再構成部
２２と、（ｃ）上記音声データの特徴パラメータに基づ
いて、所定の学習アルゴリズムにより、ＨＭＭ再構成部
２２によって生成されたＨＭＭを再学習することによ
り、再学習されたＨＭＭである音響モデルを生成する再
学習部２３とを備えたことを特徴とする。As shown in FIG. 1, the speech recognition apparatus according to the present embodiment comprises: (a) an initial HMM using a predetermined learning algorithm based on characteristic parameters of predetermined voice data stored in a voice data memory 30; Initial HM that generates
(B) adding a component of the Gaussian mixture distribution of the HMM based on the tendency of a frame error, which is an identification error in a frame unit at a predetermined time, caused by an initial HMM to the audio data; The initial H
An HMM reconstructing unit 22 that reconstructs the initial HMM generated by the MM generating unit 21 and generates a reconstructed HMM; and (c) a predetermined learning algorithm based on the feature parameter of the audio data. , A re-learning unit 23 that re-learns the HMM generated by the HMM reconstructing unit 22 to generate an acoustic model that is the re-learned HMM.

【００１４】ここで、ＨＭＭ再構成部２２は、（ｂ１）
上記初期のＨＭＭと上記音声データとの間でビタビアラ
イメント処理を実行することにより、上記初期のＨＭＭ
にそれぞれ含まれる複数のガウス混合分布の集合と複数
のガウス分布の集合を得た後、（ｂ２）得られた複数の
ガウス混合分布と複数のガウス分布のすべての組み合わ
せについて、各ガウス混合分布においてフレーム誤りが
生じてかつそのときの最尤ガウス分布が当該組み合わせ
のガウス分布であるときのフレーム誤り確率を演算し、
演算された各フレーム誤り確率が所定のしきい値を超え
るときに当該ガウス分布を、当該ガウス混合分布の新た
なコンポーネントとして追加することを特徴とする。Here, the HMM reconstructing unit 22 calculates (b1)
By performing a Viterbi alignment process between the initial HMM and the audio data, the initial HMM
After obtaining a set of a plurality of Gaussian mixture distributions and a set of a plurality of Gaussian distributions respectively included in (b2), for each combination of the obtained plurality of Gaussian mixture distributions and the plurality of Gaussian distributions, Calculate the frame error probability when a frame error occurs and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution of the combination,
When each of the calculated frame error probabilities exceeds a predetermined threshold value, the Gaussian distribution is added as a new component of the Gaussian mixture distribution.

【００１５】そして、図１の音声認識装置は、再学習部
２３によって生成された音素ＨＭＭを用いて、入力され
る発声音声文の音声信号に基づいて音声認識する。ここ
で、本実施形態の音声認識装置は、公知のワン−パス・
ビタビ復号化法を用いて、入力される発声音声文の音声
信号の特徴パラメータに基づいて上記発声音声文の単語
仮説を検出し音響尤度を計算して出力する単語照合部４
を備えた連続音声認識装置において、単語照合部４から
バッファメモリ５を介して出力される、単語仮説に対し
て、当該単語の各音素の時間方向の中央部の音響尤度の
ピークを、当該中央部よりも遅延された時刻に移動する
ように遅延させて、当該単語仮説の音響尤度を補正する
尤度補正部７と、尤度補正部７から出力される音響尤度
を含む総合尤度を有する単語仮説に基づいて、当該単語
の先頭音素環境毎に、発声開始時刻から当該単語の終了
時刻に至る計算された総合尤度のうちの最も高い尤度を
有する１つの単語仮説で代表させるように単語仮説の絞
り込みを行う単語仮説絞込部６を備える。The speech recognition apparatus shown in FIG. 1 uses the phoneme HMM generated by the relearning unit 23 to perform speech recognition based on the speech signal of the input uttered speech sentence. Here, the speech recognition device of the present embodiment is a known one-pass
A word collating unit 4 for detecting the word hypothesis of the uttered speech sentence based on the feature parameter of the speech signal of the input uttered speech sentence and calculating and outputting the acoustic likelihood using the Viterbi decoding method.
In the continuous speech recognition device provided with the above, for the word hypothesis output from the word matching unit 4 via the buffer memory 5, the peak of the acoustic likelihood of the central part of each phoneme of the word in the time direction is calculated. A likelihood correction unit 7 that delays the movement to a time delayed from the center to correct the acoustic likelihood of the word hypothesis, and a total likelihood including the acoustic likelihood output from the likelihood correction unit 7. Based on a word hypothesis having a degree, for each head phoneme environment of the word, one word hypothesis having the highest likelihood among the total likelihoods calculated from the utterance start time to the end time of the word is represented. A word hypothesis narrowing unit 6 that narrows down word hypotheses so as to cause the word hypothesis to be provided.

【００１６】以下、ＨＭＭの生成処理について詳述す
る。音声データメモリ３０には、不特定話者の音声デー
タの特徴パラメータが予め記憶され、ここで、特徴パラ
メータは、音声波形信号をフレーム単位でＡ／Ｄ変換し
た音声サンプルに対してＬＰＣ分析した得た、対数パワ
ー、１６次ＬＰＣケプストラム係数、Δ対数パワー、及
び１６次Δケプストラム係数を含む。初期ＨＭＭ生成部
２１は、音声データメモリ３０に格納された所定の音声
データの特徴パラメータに基づいて、バーム・ウエルチ
（Ｂａｕｍ−Ｗｅｌｃｈ）の学習アルゴリズムにより、
初期のＨＭＭを生成して、初期ＨＭＭメモリ３１に格納
する。Hereinafter, the HMM generation processing will be described in detail. The voice data memory 30 previously stores feature parameters of voice data of an unspecified speaker, and the feature parameters are obtained by LPC analysis of voice samples obtained by A / D conversion of a voice waveform signal on a frame basis. Log power, 16th-order LPC cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. The initial HMM generation unit 21 uses a Baum-Welch learning algorithm based on the characteristic parameters of the predetermined voice data stored in the voice data memory 30.
An initial HMM is generated and stored in the initial HMM memory 31.

【００１７】次いで、ＨＭＭ再構成部２２は、音声デー
タメモリ３０内の音声データに基づき、初期ＨＭＭに対
して本実施形態の方法によるガウス混合分布の再構成
（コンポーネント追加、共有）を行なって再構成された
ＨＭＭを再構成されたＨＭＭメモリ３２に格納する。さ
らに、再学習部２３は、再構成後のＨＭＭに対して、パ
ラメータを再推定し、最終的な音素ＨＭＭとして、音素
ＨＭＭメモリ１１に格納する。本実施形態の再構成方法
及びパラメータ再推定においては、基本的に、初期モデ
ルの作成に用いた音声データをそのまま用いる。従っ
て、本処理のために、新たに音声サンプルを用意する必
要はないという利点を有する。Next, the HMM reconstructing unit 22 reconstructs the Gaussian mixture distribution (component addition and sharing) of the initial HMM based on the speech data in the speech data memory 30 by the method of the present embodiment, and reconstructs the HMM. The configured HMM is stored in the reconfigured HMM memory 32. Further, the re-learning unit 23 re-estimates the parameters of the reconstructed HMM, and stores the parameters in the phoneme HMM memory 11 as the final phoneme HMM. In the reconstruction method and parameter re-estimation of the present embodiment, basically, the speech data used for creating the initial model is used as it is. Therefore, there is an advantage that it is not necessary to prepare a new audio sample for this processing.

【００１８】さらに、ＨＭＭ再構成処理におけるコンポ
ーネント追加に関する考え方について述べる。本実施形
態の方法の本質は、音声データに対して初期ＨＭＭが起
こすフレーム単位の識別誤り（以下、フレーム誤り）の
傾向を考慮して、ガウス混合分布のコンポーネント追加
を行なうことにある。このフレーム誤りとは、初期ＨＭ
Ｍと音声サンプルによるビタビアライメント（Ｖｉｔｅ
ｒｂｉａｌｉｇｎｍｅｎｔ）処理において、以下の条
件を満たす場合を指す。Further, the concept of adding a component in the HMM reconfiguration processing will be described. The essence of the method of the present embodiment is to add components of the Gaussian mixture distribution in consideration of the tendency of frame-by-frame identification errors (hereinafter, frame errors) caused by the initial HMM for voice data. This frame error is the initial HM
Viterbi alignment (Vite
rbi alignment) processing when the following conditions are satisfied.

【数１】 (Equation 1)

【００１９】ここで、関数ｍａｘ（・）はγ∈Γである
ときに変数γ（ガウス混合分布）を変化したときに関数
値が最大となる値を示す関数である。また、ｏ_t：時刻ｔにおける特徴ベクトル、ｇ_t：ビタビアライメント処理によって時刻ｔに割り当
てられたガウス混合分布、Γ：初期ＨＭＭ全体のガウス
混合分布の集合（すなわち、初期ＨＭＭの各状態におけ
る複数のガウス混合分布の集合である。）、Ｐ（ｏ│ｇ）：分布ｇから特徴ベクトルｏが出力される
ことに対する尤度である。Here, the function max (·) is a function indicating the value at which the function value becomes maximum when the variable γ (Gaussian mixture distribution) changes when γ∈Γ. Further, o _t: feature vector at time t, g _t: Gaussian mixture distribution assigned to the time t by the Viterbi alignment process, gamma: set of initial HMM entire Gaussian mixture (i.e., a plurality in each state of the initial HMM It is a set of Gaussian mixture distributions.), P (o│g): Likelihood that the feature vector o is output from the distribution g.

【００２０】以下、フレーム誤りを事象Ｅで表わし、Ｅ
の余事象をＥ^cと書くことにする。特定の音響現象に対
して、このフレーム誤りを頻繁に起こすガウス混合分布
は、実際の音声認識において、正解経路上で当該音響現
象との照合を行なう際に音響尤度の落ち込みを起こしや
すい。今、時刻ｔについて、以下のガウス分布ｘ_tを考
える。Hereinafter, a frame error is represented by an event E.
Is written as E ^c . The Gaussian mixture distribution that frequently causes frame errors for a specific acoustic phenomenon tends to cause a drop in acoustic likelihood when performing matching with the acoustic phenomenon on the correct path in actual speech recognition. Consider the following Gaussian distribution x _t at time t.

【数２】ここで、関数ａｒｇｍａｘは、ξ∈Ξであるときに変数
ξ（ガウス分布）を変化したときに関数値が最大となる
ときの変数ξを示す関数である。また、Ξ：初期ＨＭＭ
全体のガウス分布の集合（すなわち、初期ＨＭＭにおけ
る各状態におけるガウス混合分布の元となるガウス分布
の集合である）である。(Equation 2) Here, the function argmax is a function indicating the variable の when the function value becomes maximum when the variable ξ (Gaussian distribution) changes when ξ∈Ξ. Ξ: Initial HMM
This is a set of the entire Gaussian distribution (that is, a set of Gaussian distributions that are the basis of the Gaussian mixture distribution in each state in the initial HMM).

【００２１】ｘ_tは、本来、ＨＭＭ全体の中のいずれか
のガウス混合分布のコンポーネントであるが、ここでは
特徴ベクトルｏ_iに対して最大の音響尤度を与える単独
のガウス分布として扱う。さらに、Γの元の時系列｛ｇ
_t｝とΞの元の時系列｛ｘ_t｝について、各元の出現頻度
を分析することにより、条件付きフレーム誤り確率、Ｐ
（Ｅ，ξ│γ）（γ∈Γ，ξ∈Ξ）が得られる。この条
件付きフレーム誤り確率Ｐ（Ｅ，ξ│γ）は、ガウス混
合分布ξにおいてフレーム誤りが生じてかつそのときの
最尤ガウス分布が当該ガウス分布γであるときのフレー
ム誤り確率である。フレーム誤り確率Ｐ（Ｅ，ξ│γ）
がある程度大きい値を持つならば、ガウス混合分布γ
は、ガウス分布ξの近傍の音響現象との照合を行なう際
に音響尤度の落ち込みを起こしやすいといえる。そこ
で、ガウス分布ξをガウス混合分布γのコンポーネント
として新たに追加することにより、その悪影響を抑止出
来ると考えられる。X _t is originally a component of any Gaussian mixture distribution in the entire HMM, but is treated here as a single Gaussian distribution that gives the maximum acoustic likelihood to the feature vector o _i . Furthermore, the original time series Γg of Γ
For the time series {x _t} of the original _t} and .XI, by analyzing the frequency of occurrence of each element, the conditional frame error probability, P
(E, ξ | γ) (γ∈Γ, ξ∈Ξ) is obtained. The conditional frame error probability P (E, ξ | γ) is a frame error probability when a frame error occurs in the Gaussian mixture distribution ξ and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution γ. Frame error probability P (E, ξ | γ)
Has a somewhat large value, the Gaussian mixture distribution γ
Is likely to cause a drop in acoustic likelihood when performing matching with acoustic phenomena near the Gaussian distribution ξ. Therefore, it is considered that the adverse effect can be suppressed by newly adding the Gaussian distribution ξ as a component of the Gaussian mixture distribution γ.

【００２２】次いで、ＨＭＭ再構成処理のアルゴリズム
について説明する。以下の、（ステップＳＳ１）〜（ス
テップＳＳ５）の処理を実行することにより、ガウス混
合分布のコンポーネント追加、共有が実現される。（ステップＳＳ１）初期ＨＭＭと音声サンプルの間でビ
タビアライメント処理を実行し、ガウス混合分布の集合
であるΓの元の時系列、すなわちビタビ系列｛ｇ_t｝、
及びガウス分布の集合であるΞの元の時系列、すなわち
ガウス分布の最尤系列｛ｘ_t｝をそれぞれ得る（図７参
照。）。（ステップＳＳ２）ステップＳＳ１で得た時系列から、
ガウス混合分布γとガウス分布ξの全ての組合せについ
て、条件付きフレーム誤り確率Ｐ（Ｅ，ξ│γ）を得る
（図８参照。）。（ステップＳＳ３）全てのガウス混合分布γについてス
テップＳＳ４を実行する。（ステップＳＳ４）全てのガウス分布ξについてステッ
プＳＳ５を実行する。（ステップＳＳ５）条件付きフレーム誤り確率Ｐ（Ｅ，
ξ│γ）があらかじめ定めたしきい値を越える場合、ガ
ウス混合分布ξを、ガウス分布γの新たなコンポーネン
トとして追加する（図９参照。）。追加したコンポーネ
ントは、ガウス分布γと、ガウス混合分布ξが元々属し
ていたガウス混合分布との間で共有する。上記ステップＳＳ５のしきい値処理により、音声データ
中に含まれる偶発的な雑音等に起因するフレーム誤りに
対するコンポーネント追加を抑止することが出来る。す
なわち、上記ＨＭＭ再構成処理は、（Ｉ）上記初期ＨＭ
Ｍと上記音声データとの間でビタビアラインメント処理
を実行することにより、（ａ）上記初期ＨＭＭ中に含ま
れるガウス混合分布のビタビ系列と、（ｂ）上記音声デ
ータの各フレームに対して最も高い尤度を与え、上記初
期ＨＭＭ中に各ガウス混合分布のコンポーネントとして
含まれるガウス分布の最尤系列とを得る第１の処理と、
（II）上記第１の処理によって得られた、ガウス混合分
布のビタビ系列、及びガウス分布の最尤系列における、
時刻を同じくするガウス混合分布と、ガウス分布の組み
合わせのそれぞれの出現頻度に基づいて、上記初期ＨＭ
Ｍ中に含まれるガウス混合分布とガウス分布の全ての組
み合わせについて、各ガウス混合分布においてフレーム
誤りが生じてかつそのときの最尤ガウス分布が当該組み
合わせのガウス分布であるときのフレーム誤り確率を演
算し、演算された各フレーム誤り確率が所定のしきい値
を越えるときに当該ガウス分布を、当該ガウス混合分布
の新たなコンポーネントとして追加する第２の処理手段
とを含み、（III）上記第２の処理によって各ガウス混
合分布の新たなコンポーネントとして追加された各ガウ
ス分布は、当該ガウス分布が上記初期ＨＭＭ中で、コン
ポーネントとして属していたガウス混合分布と、上記第
２の処理によって新たにコンポーネントとして属するこ
とになったガウス混合分布との双方から共有されるコン
ポーネントとなる。Next, the algorithm of the HMM reconstruction processing will be described. By executing the following processes of (Step SS1) to (Step SS5), addition and sharing of components of the Gaussian mixture distribution are realized. (Step SS1) perform Viterbi alignment process between the initial HMM and the voice sample, the time series of the original Γ is a set of Gaussian mixture distributions, i.e. Viterbi sequence {g _t},
And the original time series of Ξ which is a set of Gaussian distributions, that is, the maximum likelihood sequence {x _t } of the Gaussian distribution, respectively (see FIG. 7). (Step SS2) From the time series obtained in step SS1,
The conditional frame error probability P (E, ξ | γ) is obtained for all combinations of the Gaussian mixture γ and the Gaussian ξ (see FIG. 8). (Step SS3) Step SS4 is executed for all Gaussian mixture distributions γ. (Step SS4) Step SS5 is executed for all Gaussian distributions ξ. (Step SS5) Conditional frame error probability P (E,
If ξ | γ) exceeds a predetermined threshold value, the Gaussian mixture distribution ξ is added as a new component of the Gaussian distribution γ (see FIG. 9). The added component is shared between the Gaussian distribution γ and the Gaussian mixture distribution to which the Gaussian mixture distribution ξ originally belonged. By the threshold processing in step SS5, it is possible to suppress the addition of components for frame errors caused by accidental noise or the like included in audio data. That is, the HMM reconstruction processing is performed by (I) the initial HM
By executing the Viterbi alignment process between M and the audio data, the highest Viterbi sequence of (a) the Gaussian mixture distribution included in the initial HMM and (b) each frame of the audio data are obtained. A first process of giving a likelihood and obtaining a maximum likelihood sequence of a Gaussian distribution included as a component of each Gaussian mixture distribution in the initial HMM;
(II) In the Viterbi sequence of the Gaussian mixture distribution and the maximum likelihood sequence of the Gaussian distribution obtained by the first processing,
Based on the appearance frequency of each combination of the Gaussian mixture distribution and the Gaussian distribution at the same time, the initial HM
For all combinations of Gaussian mixture distribution and Gaussian distribution included in M, calculate the frame error probability when a frame error occurs in each Gaussian mixture distribution and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution of the combination. And second processing means for adding the Gaussian distribution as a new component of the Gaussian mixture distribution when each of the calculated frame error probabilities exceeds a predetermined threshold value, and (III) the second processing means The Gaussian distribution added as a new component of each Gaussian mixture distribution by the above process is the Gaussian mixture distribution that the Gaussian distribution belonged to as a component in the initial HMM, and the Gaussian mixture distribution newly added as a component by the second process. It is a component that is shared by both the Gaussian mixture distribution that came to belong.

【００２３】図３乃至図６は、図１のＨＭＭ再構成部２
２によって実行されるＨＭＭ再構成処理の詳細を示すフ
ローチャートである。まず、図３のステップＳ１で、音
声データ番号ｎのパラメータを０に初期化し、次いで、
ステップＳ２で音声データ＃ｎと初期ＨＭＭとの間でビ
タビのアライメント処理を実行し、ガウス混合分布のビ
タビ系列｛ｇⁿｔ｝及びガウス分布の最尤系列｛ｘⁿｔ｝
を求める。そして、ステップＳ３で全データについてス
テップＳ２の処理を実施したか否かが判断され、ＮＯの
ときステップＳ４でパラメータｎを１だけインクリメン
トした後、ステップＳ２の処理を実行する。一方、ステ
ップＳ３でＹＥＳのときは、図４のステップＳ１１に進
む。FIGS. 3 to 6 show the HMM reconstructing unit 2 shown in FIG.
6 is a flowchart illustrating details of an HMM reconfiguration process performed by the HMM 2. First, in step S1 of FIG. 3, the parameter of the audio data number n is initialized to 0.
In step S2, Viterbi alignment processing is performed between the audio data #n and the initial HMM, and a Viterbi sequence {g ⁿ t} of a Gaussian mixture distribution and a maximum likelihood sequence {x ⁿ t} of a Gaussian distribution.
Ask for. Then, in step S3, it is determined whether or not the processing of step S2 has been performed for all data. If NO, the parameter n is incremented by 1 in step S4, and then the processing of step S2 is executed. On the other hand, if YES in step S3, the process proceeds to step S11 in FIG.

【００２４】図４のステップＳ１１で、すべての計数値
Ｃ（・）を０に初期化し、ステップＳ１２で、音声デー
タ番号ｎのパラメータを０に初期化し、ステップＳ１３
でフレーム番号ｔのパラメータを０に初期化した後、ス
テップＳ１４で、フレーム誤りが生じたか否かが判断さ
れ、フレーム誤りが生じたときは、ステップＳ１５でフ
レーム誤りを計数する計数値Ｃ（Ｅ，ｇⁿｔ｜ｘⁿｔ）を
１だけインクリメントしてステップＳ１７に進む。一
方、ステップＳ１４でフレーム誤りが生じていないとき
は、ステップＳ１６で、フレーム誤りが生じていないこ
とを計数する計数値Ｃ（Ｅ^c｜ｇⁿｔ）を１だけインクリ
メントしてステップＳ１７に進む。ステップＳ１７で
は、音声データ＃ｎの全フレームについてステップＳ１
４の処理を実施したか否かが判断され、ＮＯのときはス
テップＳ１８でパラメータｔを１だけインクリメントし
てステップＳ１４に戻る。一方、ステップＳ１７でＹＥ
ＳのときはステップＳ１９で、全音声データについてス
テップＳ１４の処理を実施したか否かが判断され、ＮＯ
のときはステップＳ２０でデータ番号ｎのパラメータを
１だけインクリメントしてステップＳ１３に戻る。一
方、ステップＳ１９でＹＥＳのときは、図５のステップ
Ｓ２１に進む。In step S11 of FIG. 4, all the count values C (•) are initialized to 0. In step S12, the parameter of the audio data number n is initialized to 0.
After the parameter of the frame number t is initialized to 0 in step S14, it is determined in step S14 whether or not a frame error has occurred. If a frame error has occurred, the count value C (E) for counting the frame error in step S15. , G ⁿ t | x ⁿ t) is incremented by one, and the process proceeds to step S17. On the other hand, if no frame error has occurred in step S14, the count value C (E ^c | g ⁿ t) for counting that no frame error has occurred is incremented by one in step S16, and the flow proceeds to step S17. In step S17, step S1 is performed for all frames of the audio data #n.
It is determined whether or not the process of step 4 has been performed. If the determination is NO, the parameter t is incremented by 1 in step S18, and the process returns to step S14. On the other hand, in step S17, YE
In the case of S, it is determined in step S19 whether or not the processing of step S14 has been performed for all the audio data.
In the case of, the parameter of the data number n is incremented by 1 in step S20, and the process returns to step S13. On the other hand, if YES in step S19, the process proceeds to step S21 in FIG.

【００２５】図５のステップＳ２１では、ガウス混合分
布の番号ｉのパラメータを０に初期化し、ステップＳ２
２でガウス分布の番号ｊのパラメータを０に初期化した
後、ステップＳ２３で、次式を用いて条件付きフレーム
誤り確率Ｐ（Ｅ,γ_i|ξ_j）を計算する。In step S21 of FIG. 5, the parameter of the number i of the Gaussian mixture distribution is initialized to 0, and step S2
After the parameter of the Gaussian distribution number j is initialized to 0 in step 2, the conditional frame error probability P (E, γ _i | ξ _j ) is calculated in step S23 using the following equation.

【数３】そして、ステップＳ２４で全ガウス分布についてステッ
プＳ２３の処理を実施したか否かが判断され、ＮＯのと
きはステップＳ２５でパラメータｊを１だけインクリメ
ントしてステップＳ２３に戻る。一方、ステップＳ２４
でＹＥＳのときはステップＳ２６で、全ガウス混合分布
についてステップＳ２３の処理を実施したか否かが判断
され、ＮＯのときはステップＳ２７でパラメータｉを１
だけインクリメントしてステップＳ２２に戻る。一方、
ステップＳ２６でＹＥＳのときは、図６のステップＳ３
１に進む。(Equation 3) Then, in step S24, it is determined whether or not the processing in step S23 has been performed for all Gaussian distributions. If NO, the parameter j is incremented by 1 in step S25, and the process returns to step S23. On the other hand, step S24
If YES in step S26, it is determined in step S26 whether or not the processing in step S23 has been performed for all Gaussian mixture distributions. If NO, the parameter i is set to 1 in step S27.
And the process returns to step S22. on the other hand,
If YES in step S26, the process proceeds to step S3 in FIG.
Proceed to 1.

【００２６】図６のステップＳ３１では、ガウス混合分
布の番号ｉのパラメータを０に初期化し、ステップＳ３
２でガウス分布の番号ｊのパラメータを０に初期化した
後、ステップＳ３３で、条件付きフレーム誤り確率Ｐ
（Ｅ,γ_i|ξ_j）がしきい値ρ（好ましい実施形態では
０．０１である。）を超えるとき、ガウス混合分布ξを
ガウス分布γの新たなコンポーネントとして追加した
後、ステップＳ３５に進み、一方、ステップＳ３３でＮ
ＯであるときはそのままステップＳ３５に進む。ステッ
プＳ３５で、全ガウス分布についてステップＳ３３の処
理を実施したか否かが判断され、ＮＯのときはステップ
Ｓ３６でパラメータｊを１だけインクリメントしてステ
ップＳ３３に戻る。一方、ステップＳ３５でＹＥＳのと
きは、ステップＳ３７で、全ガウス混合分布についてス
テップＳ３３の処理を実施したか否かが判断され、ＮＯ
のときはステップＳ３８でパラメータｉを１だけインク
リメントしてステップＳ３２に戻る。一方、ステップＳ
３７でＹＥＳのときはステップＳ３９で得られた再構成
ＨＭＭをメモリ３２に格納して、当該ＨＭＭ再構成処理
を終了する。In step S31 of FIG. 6, the parameter of the number i of the Gaussian mixture distribution is initialized to 0, and step S3
After the parameter of the number j of the Gaussian distribution is initialized to 0 in step 2, in step S33, the conditional frame error probability P
When (E, γ _i | ξ _j ) exceeds the threshold value ρ (0.01 in the preferred embodiment), the Gaussian mixture distribution ξ is added as a new component of the Gaussian distribution γ. Proceeds, while N is determined in step S33.
If it is O, the process proceeds directly to step S35. In step S35, it is determined whether or not the process in step S33 has been performed for all Gaussian distributions. If NO, the parameter j is incremented by 1 in step S36, and the process returns to step S33. On the other hand, if YES is determined in the step S35, it is determined in a step S37 whether or not the process of the step S33 is performed for all Gaussian mixture distributions.
In step S38, the parameter i is incremented by 1 in step S38, and the process returns to step S32. On the other hand, step S
If YES in step S37, the reconfigured HMM obtained in step S39 is stored in the memory 32, and the HMM reconfiguration processing ends.

【００２７】さらに、再学習部２３では、ＨＭＭの再構
成の後、尤度最大、尤度比最大等の基準により、例え
ば、バーム・ウエルチの学習アルゴリズムを用いて、以
下の各パラメータを再推定して、再学習後の音素ＨＭＭ
を音素ＨＭＭメモリ１１に格納する。（ａ）各ガウス分布の平均、（ｂ）各ガウス分布の分散（ｃ）各ガウス混合分布の混合重み（ｄ）状態遷移確率Further, after reconstructing the HMM, the re-learning unit 23 re-estimates the following parameters based on criteria such as maximum likelihood and maximum likelihood ratio using, for example, a Balm-Welch learning algorithm. And the phoneme HMM after re-learning
Is stored in the phoneme HMM memory 11. (A) Average of each Gaussian distribution, (b) Variance of each Gaussian distribution, (c) Mixing weight of each Gaussian mixture distribution, (d) State transition probability

【００２８】ガウス分布の平均、分散、及び状態遷移確
率については、初期ＨＭＭの値をそのまま初期値として
用いる。また、ガウス混合分布の混合重みについては、
フレーム誤り確率、及びコンポーネント追加実行のしき
い値を考慮して、以下のように初期値を定める。As for the mean, variance, and state transition probability of the Gaussian distribution, the values of the initial HMM are used as they are as initial values. For the mixture weight of the Gaussian mixture distribution,
Considering the frame error probability and the threshold value of the component addition execution, the initial value is determined as follows.

【００２９】まず、ガウス混合分布γにガウス分布ξが
新たなコンポーネントとして追加された場合の混合重み
初期値を、条件付きフレーム誤り確率の値をそのまま用
いて、次式とする。First, the initial value of the mixture weight when the Gaussian distribution ξ is added as a new component to the Gaussian mixture distribution γ is expressed by the following equation using the value of the conditional frame error probability as it is.

【数４】ｗｈ^γ _ξ＝Ｐ（Ｅ，ξ｜γ）## EQU4 ## wh ^γ _ξ = P (E, ξ | γ)

【００３０】コンポーネントが追加されたことにより、
初期ＨＭＭに元々含まれていたコンポーネントの混合重
みに対しても新たな初期値が必要となる。これらを、次
式により与える。With the addition of the component,
A new initial value is also required for the mixture weight of the components originally included in the initial HMM. These are given by the following equations.

【数５】ｗｈ^γ _ξ＝｛Ｐ（Ｅ^c｜γ）＋Ｐ（Ｅ，Ξ^γ _ρ
｜γ）｝・ｗ^γ _ξ ただし、ｗｈ^γ _ξは、初期ＨＭＭにおける、ガウス分布
ξのガウス混合分布γにおける混合重みである。ここ
で、Ξ^γ _ρは、コンポーネント追加実行のしきい値がρ
のときに、ガウス混合分布γに対するコンポーネント追
加の対象とならないガウス分布の集合であり、以下によ
って与えられる。Wh ^γ _ξ = P (E ^c | γ) + P (E, Ξ ^γ _ρ
| ^Γ)} · w γ _ξ However, wh ^gamma _xi], in the initial HMM, a mixture weights in the Gaussian mixture distributions gamma Gaussian xi]. Here, Ξ ^γ _ρ is determined when the threshold for component addition execution is ρ
Is a set of Gaussian distributions not subject to component addition to the Gaussian mixture distribution γ, given by

【数６】 Ξ^γ _ρ＝｛ξ｜Ｐ（Ｅ，ξ｜γ）＜ρ；ξ∈Ξ｝６ ^γ _ρ = ｛ξ | P (E, ξ | γ) <ρ; ξ∈Ξ｝

【００３１】本実施形態において用いる尤度補正部７の
尤度補正は、遅延決定（Delayed decision）のビーム探
索と呼ぶことができる。この遅延決定のビーム探索は、
第４の従来例のような尤度の先読みや、非線形関数によ
る尤度のマッピングによらずに、すでに探索を終えた経
路の尤度の評価を遅らせることによって、尤度の局所的
変動に対処する。なお、以下の計算において、尤度とは
対数尤度を指すものとする。本実施形態において、各符
号を尤度補正部７においてのみ以下のように定義する。（ａ）ｔ：時刻；（ｂ）Ｓ：ビーム探索の経路；（ｃ）ｑ_A（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける音響尤
度；（ｄ）Ｑ_A（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける文頭か
ら累積音響尤度；（ｅ）Ｑ_L（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける文頭か
らの累積言語尤度。The likelihood correction of the likelihood correction unit 7 used in this embodiment can be called a beam search of a delayed decision. The beam search for this delay determination is
The local variation of likelihood is dealt with by delaying the evaluation of the likelihood of a route that has already been searched, without relying on likelihood look-ahead and the likelihood mapping by a nonlinear function as in the fourth conventional example. I do. In the following calculation, the likelihood indicates the log likelihood. In the present embodiment, each code is defined only in the likelihood correction unit 7 as follows. (A) t: time; (b) S: path of beam search; (c) q _A (S, t): path S, acoustic likelihood at time t; (d) Q _A (S, t): path S, cumulative acoustic likelihood from beginning of a sentence at time _{t; (e) Q L (} S, t): path S, the cumulative language likelihood from beginning of a sentence at time t.

【００３２】ここで、音響尤度は、単語照合部４におい
て音素ＨＭＭメモリ１１内の音素ＨＭＭを参照して計算
される尤度であり、言語尤度は、単語照合部４において
統計的言語モデルメモリ１３内の言語モデルを参照して
計算される尤度である。以上のように定義したとき、一
般に、累積音響尤度は１フレーム毎の音響尤度を足し合
わせることによって次式で求められる。Here, the acoustic likelihood is a likelihood calculated by the word matching unit 4 with reference to the phoneme HMM in the phoneme HMM memory 11, and the language likelihood is calculated by the statistical language model in the word matching unit 4. The likelihood is calculated with reference to the language model in the memory 13. When defined as described above, generally, the cumulative acoustic likelihood is obtained by adding the acoustic likelihood for each frame by the following equation.

【数７】Ｑ_A（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ−１）＋ｑ_A（Ｓ，ｔ）Q _A (S, t) = Q _A (S, t−1) + q _A (S, t)

【００３３】そして、ビーム探索に使用する文頭からの
累積総合尤度Ｑ_all（Ｓ，ｔ）は、音響尤度Ｑ_A（Ｓ，
ｔ）と言語尤度Ｑ_L（Ｓ，ｔ）を用いて次式で計算され
る。The cumulative total likelihood Q _all (S, t) from the head of the sentence used for beam search is the acoustic likelihood Q _A (S,
t) and the language likelihood Q _L (S, t) are calculated by the following equation.

【数８】Ｑ_all（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ）＋α・Ｑ_L（Ｓ，ｔ）[Equation 8] _{Q all (S, t) =} Q A (S, t) + α · Q L (S, t)

【００３４】ここで、定数αは言語尤度の音響尤度に対
する重み係数であり、好ましい実施形態においては、α
＝４．５である。本実施形態における、遅延決定のビー
ム探索では、次式に示すように、上記数２において、Ｑ
_A（Ｓ，ｔ）の代わりにＱ_A（Ｓ，ｔ）から遅延音響尤度
Ｑ_Ad（Ｓ，ｔ）を差し引いた尤度Ｑ_A’（Ｓ，ｔ）を使
用する。すなわち、時刻ｔ−１では、次式に示すよう
に、Ｑ_A（Ｓ，ｔ−１）の代わりにＱ_A（Ｓ，ｔ−１）か
ら遅延音響尤度Ｑ_Ad（Ｓ，ｔ−１）を差し引いた尤度Ｑ
_A’（Ｓ，ｔ−１）を使用する。Here, the constant α is a weight coefficient for the sound likelihood of the language likelihood, and in a preferred embodiment, α
= 4.5. In the beam search for delay determination in the present embodiment, as shown in the following equation,
Using the _A (S, t) likelihood Q _A '(S, t) obtained by subtracting the Q _A (S, t) delayed from the acoustic likelihood Q _Ad (S, t) instead of. That is, at time t-1, as shown in the following equation, the delayed acoustic likelihood Q _Ad (S, t-1) is obtained from Q _A (S, t-1) instead of Q _A (S, t-1). Likelihood Q minus
_A ′ (S, t−1) is used.

【００３５】[0035]

【数９】Ｑ_A’（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ）−Ｑ_Ad（Ｓ，ｔ）ここで、上記数９の右辺の第２項の尤度Ｑ_Ad（Ｓ，ｔ）
は次式で計算される。Q _A ′ (S, t) = Q _A (S, t) −Q _Ad (S, t) Here, the likelihood Q _Ad (S, t) of the second term on the right side of the above equation 9
Is calculated by the following equation.

【数１０】Ｄ＝Ｑ_Ad（Ｓ，ｔ−１）＋ｑ_A（Ｓ，ｔ）D = Q _Ad (S, t-1) + q _A (S, t)

【数１１】Ｑ_Ad（Ｓ，ｔ）＝Ｆ（Ｄ）・Ｄ## _EQU11 ## Q _Ad (S, t) = F (D) .D

【００３６】上記数９を書き換えると、上記数７を参照
して書き換えると、次式を得る。By rewriting equation 9 and rewriting with reference to equation 7, the following equation is obtained.

【数１２】Ｑ_A’（Ｓ，ｔ）＝Ｑ_A’（Ｓ，ｔ−１）＋ｑ
_A’（Ｓ，ｔ）ここで、尤度ｑ_A’（Ｓ，ｔ）を次式により決定する。[Number 12] _{Q A '(S, t)} = Q A' (S, t-1) + q
_A ′ (S, t) Here, the likelihood q _A ′ (S, t) is determined by the following equation.

【数１３】ｑ_A’（Ｓ，ｔ）＝ｆ（ｘ）＝ｆ（ｑ_A（Ｓ，ｔ）＋Ｑ_A（Ｓ，ｔ−１）−Ｑ_A’（Ｓ，ｔ−１）Equation 13] _{q A '(S, t)} = f (x) = f (q A (S, t) + Q A (S, t-1) -Q A' (S, t-1)

【００３７】ここで、上記数１３における｛Ｑ_A(Ｓ，ｔ
−１)−Ｑ_A’(Ｓ，ｔ−１)｝は、Ｑ_Ad（Ｓ，ｔ−１）で
あり、本特許出願人による特許出願の特開平９−８１１
８５号公報における実施形態と比較して１時刻前の過小
評価分であり、このデータは、尤度補正部７に接続され
る過小評価尤度メモリ１４に順次記憶されて、次の時刻
ｔにおける音響尤度を補正して総合尤度を計算するため
に用いられる。従って、本実施形態においては、尤度補
正部７は、時刻（ｔ−１）において、各単語仮説に対し
て、１時刻前の過小評価分データである上記数７におけ
る｛Ｑ_A(Ｓ，ｔ−１)−Ｑ_A’(Ｓ，ｔ−１)｝を計算し
て、過小評価尤度メモリ１４に記憶し、次いで、時刻ｔ
において、上記数１２と上記数１３とを用いて、過小評
価するように補正された音響尤度Ｑ_A’（Ｓ，ｔ）を計
算し、次いで、上記数８を書き換えた次式とを用いて、
累積尤度である総合尤度Ｑ’_all（Ｓ，ｔ）を計算し、
当該計算された総合尤度Ｑ’_all（Ｓ，ｔ）を有する単
語仮説をバッファメモリ５を介して単語仮説絞込部６に
出力する。Here, ｛Q _A (S, t) in the above equation (13)
-1) -Q _A '(S, t-1)} is Q _Ad (S, t-1), which is disclosed in Japanese Patent Application Laid-Open No. 9-811 by the present applicant.
This data is an underestimated value one time earlier than that of the embodiment of JP-A-85-85, and this data is sequentially stored in an underestimated likelihood memory 14 connected to the likelihood correction unit 7, and is stored at the next time t. It is used to calculate the total likelihood by correcting the acoustic likelihood. Therefore, in the present embodiment, at time (t−1), the likelihood correction unit 7 calculates the ｛Q _A (S, t-1) -Q _A '(S, t-1)} is calculated and stored in the underestimated likelihood memory 14, and then at time t
In the above, the acoustic likelihood Q _A ′ (S, t) corrected so as to be underestimated is calculated using the above equations 12 and 13, and then the following equation obtained by rewriting the above equation 8 is used. hand,
Calculate the total likelihood Q ′ _all (S, t), which is the cumulative likelihood,
The word hypothesis having the calculated overall likelihood Q ′ _all (S, t) is output to the word hypothesis narrowing unit 6 via the buffer memory 5.

【数１４】Ｑ’_all（Ｓ，ｔ）＝Ｑ_A’（Ｓ，ｔ）＋α・
Ｑ_L（Ｓ，ｔ）Q ′ _all (S, t) = Q _A ′ (S, t) + α ·
Q _L (S, t)

【００３８】なお、上記数１３において、関数ｆ（ｘ）
は、上記尤度ｘに対する遅延割合を求める第１の関数で
あり、例えば、関数ｘは、ｘが増加するにつれて、概
ね、関数ｆ（ｘ）の傾斜を小さくするように変化する関
数である。また、上記数１１における関数Ｆ（Ｄ）は上
記第１の関数に関連し、尤度Ｄに対する遅延割合を求め
る第２の関数である。In the above equation (13), the function f (x)
Is a first function for calculating a delay ratio with respect to the likelihood x. For example, the function x is a function that generally changes so as to decrease the slope of the function f (x) as x increases. Further, the function F (D) in the above equation 11 is a second function related to the first function and for calculating a delay ratio with respect to the likelihood D.

【００３９】次いで、図１の連続音声認識装置の構成及
び動作について説明する。図１において、音素ＨＭＭメ
モリ１１は、単語照合部４に接続され、音素ＨＭＭを予
め記憶し、当該音素ＨＭＭは、各状態を含んで表され、
各状態はそれぞれ以下の情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率なお、本実施例において用いる音素ＨＭＭは、各分布が
どの話者に由来するかを特定する必要があるため、所定
の話者混合ＨＭＭを変換して作成する。ここで、出力確
率密度関数は３４次元の対角共分散行列をもつ混合ガウ
ス分布である。Next, the configuration and operation of the continuous speech recognition apparatus of FIG. 1 will be described. In FIG. 1, a phoneme HMM memory 11 is connected to the word matching unit 4 and stores a phoneme HMM in advance, and the phoneme HMM is represented including each state,
Each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding state and succeeding state (d) Parameter of output probability density distribution (e) Self transition probability and transition probability to succeeding state The phoneme HMM used in the example is created by converting a predetermined speaker mixed HMM because it is necessary to specify which speaker each distribution originates from. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix.

【００４０】また、単語辞書メモリ１２は、単語照合部
４に接続され、単語辞書を予め記憶し、当該単語辞書
は、音素ＨＭＭメモリ１１内の音素ＨＭＭの各単語毎に
シンボルで表した読みを示すシンボル列を格納する。さ
らに、統計的言語モデルメモリ１３は、単語照合部４に
接続され、所定の統計的言語モデルを予め記憶する。こ
こで、統計的言語モデルは、例えば、従来技術文献６
「政瀧浩和ほか，“連続音声認識のための可変長連鎖統
計言語モデル”，電子通信情報学会技術報告，ＳＰ９５
−７３，１９９５年１１月」において開示されている、
時間方向の長さが可変である可変長Ｎ−ｇｒａｍと呼ば
れる言語モデルを使用することができる。当該統計的言
語モデルは、品詞クラスと単語との可変長Ｎ−ｇｒａｍ
であり、次の３種類のクラス間のバイグラムとして表現
する。（ａ）品詞クラス、（ｂ）品詞クラスから分離した単語のクラス、及び、（ｃ）連接単語が結合してできたクラス。The word dictionary memory 12 is connected to the word collating unit 4 and stores a word dictionary in advance. The word dictionary stores a reading of each phoneme HMM in the phoneme HMM memory 11 represented by a symbol for each word. The symbol string shown is stored. Further, the statistical language model memory 13 is connected to the word matching unit 4 and stores a predetermined statistical language model in advance. Here, the statistical language model is described in, for example, the related art document 6
"Hirokazu Masataki et al.," Variable Length Statistical Language Model for Continuous Speech Recognition ", IEICE Technical Report, SP95
-73, November 1995 ".
A language model called variable length N-gram whose length in the time direction is variable can be used. The statistical language model includes a variable length N-gram of a part of speech class and a word.
And expressed as a bigram between the following three types of classes. (A) part-of-speech class, (b) class of word separated from part-of-speech class, and (c) class formed by connecting connected words.

【００４１】図１の連続音声認識装置において、特徴抽
出部２と、単語照合部４と、尤度補正部７と、単語仮説
絞込部６と、初期ＨＭＭ生成部２１と、ＨＭＭ再構成部
２２と、再学習部２３とは、例えば、ＣＰＵを備えたデ
ジタル計算機で構成される。また、バッファメモリ３，
５と、音素ＨＭＭメモリ１１と、単語辞書メモリ１２
と、統計的言語モデルメモリ１３と、過小評価尤度メモ
リ１４と、音声データメモリ３０と、初期ＨＭＭメモリ
３１と、再構成されたＨＭＭメモリ３２とは、例えば、
ハードディスクメモリで構成される。In the continuous speech recognition apparatus of FIG. 1, the feature extracting unit 2, the word matching unit 4, the likelihood correcting unit 7, the word hypothesis narrowing unit 6, the initial HMM generating unit 21, and the HMM reconstructing unit The re-learning unit 23 and the re-learning unit 23 are configured by, for example, a digital computer including a CPU. Also, the buffer memory 3,
5, phoneme HMM memory 11, and word dictionary memory 12
, A statistical language model memory 13, an underestimated likelihood memory 14, a voice data memory 30, an initial HMM memory 31, and a reconstructed HMM memory 32,
It consists of a hard disk memory.

【００４２】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して単語照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the word matching unit 4 via the buffer memory 3.

【００４３】単語照合部４は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ３を介して入力される特
徴パラメータのデータに基づいて、音素ＨＭＭメモリ１
１内の音素ＨＭＭと、単語辞書メモリ１２内の単語辞書
と、統計的言語モデルメモリ１３内の統計的言語モデル
とを用いて単語仮説を検出し、音素ＨＭＭに基づいた音
響尤度と、統計的言語モデルに基づいた言語尤度とを計
算して、単語仮説とともに尤度補正部７に出力する。こ
こで、単語照合部４は、各時刻の各ＨＭＭの状態毎に、
単語内の尤度と発声開始からの音響尤度を計算する。音
響尤度及び言語尤度を含む尤度は、単語の識別番号、単
語の開始時刻、先行単語の違い毎に個別にもつ。また、
計算処理量の削減のために、音素ＨＭＭ、単語辞書及び
統計的言語モデルとに基づいて計算される総合尤度のう
ちの低い総合尤度のグリッド仮説を削減する。単語照合
部４は、その結果の単語仮説と総合尤度の情報を発声開
始時刻からの時間情報（具体的には、例えばフレーム番
号）とともに尤度補正部７に出力する。The word collating unit 4 uses the one-pass Viterbi decoding method to store the phoneme HMM memory 1 based on the characteristic parameter data input via the buffer memory 3.
1, a word hypothesis is detected using the word dictionary in the word dictionary memory 12, and the statistical language model in the statistical language model memory 13, and the acoustic likelihood based on the phoneme HMM, And calculates the linguistic likelihood based on the statistical language model, and outputs the linguistic likelihood together with the word hypothesis to the likelihood correcting unit 7. Here, the word matching unit 4 determines, for each state of each HMM at each time,
The likelihood within a word and the acoustic likelihood from the start of utterance are calculated. The likelihood including the acoustic likelihood and the linguistic likelihood is individually provided for each word identification number, word start time, and preceding word difference. Also,
In order to reduce the amount of calculation processing, the grid hypothesis of a low total likelihood among the total likelihoods calculated based on the phoneme HMM, the word dictionary, and the statistical language model is reduced. The word matching unit 4 outputs the resulting word hypothesis and information on the overall likelihood to the likelihood correction unit 7 together with time information (specifically, for example, a frame number) from the utterance start time.

【００４４】これに応答して、尤度補正部７は、時刻
（ｔ−１）において、各単語仮説に対して、１時刻前の
過小評価分データである上記数７における｛Ｑ_A(Ｓ，ｔ
−１)−Ｑ_A’(Ｓ，ｔ−１)｝を計算して、過小評価尤度
メモリ１４に記憶し、次いで、時刻ｔにおいて、上記数
６と上記数７とを用いて、過小評価するように補正され
た音響尤度Ｑ_A’（Ｓ，ｔ）を計算し、次いで、上記数
８とを用いて、総合尤度Ｑ’_all（Ｓ，ｔ）を計算し、
当該計算された総合尤度Ｑ’_all（Ｓ，ｔ）を有する単
語仮説をバッファメモリ５を介して単語仮説絞込部６に
出力する。In response, at time (t-1), likelihood correction section 7 calculates the ｛Q _A (S , T
-1) -Q _A ′ (S, t−1)} is stored in the underestimated likelihood memory 14, and at time t, the underestimated is calculated using the above equations 6 and 7. The sound likelihood Q _A ′ (S, t) corrected so as to calculate the total likelihood Q ′ _all (S, t) by using the above equation (8),
The word hypothesis having the calculated overall likelihood Q ′ _all (S, t) is output to the word hypothesis narrowing unit 6 via the buffer memory 5.

【００４５】単語仮説絞込部６は、尤度補正部７からバ
ッファメモリ５を介して出力される総合尤度を有する単
語仮説に基づいて、終了時刻が等しく開始時刻が異なる
同一の単語の単語仮説に対して、当該単語の先頭音素環
境毎に、発声開始時刻から当該単語の終了時刻に至る計
算された総合尤度のうちの最も高い尤度を有する１つの
単語仮説で代表させるように単語仮説の絞り込みを行っ
た後、絞り込み後のすべての単語仮説の単語列のうち、
最大の総合尤度を有する仮説の単語列を認識結果として
出力する。本実施形態においては、好ましくは、処理す
べき当該単語の先頭音素環境とは、当該単語より先行す
る単語仮説の最終音素と、当該単語の単語仮説の最初の
２つの音素とを含む３つの音素並びをいう。The word hypothesis narrowing section 6 is based on a word hypothesis having an overall likelihood output from the likelihood correction section 7 via the buffer memory 5, and is configured to select words having the same end time but different start times from each other. With respect to the hypothesis, the word is represented by one word hypothesis having the highest likelihood among the calculated total likelihoods from the utterance start time to the end time of the word for each head phoneme environment of the word. After narrowing down hypotheses, of the word strings of all the narrowed word hypotheses,
A word string of a hypothesis having the maximum overall likelihood is output as a recognition result. In the present embodiment, preferably, the first phoneme environment of the word to be processed is three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. I mean a line.

【００４６】例えば、図２に示すように、（ｉ−１）番
目の単語Ｗ_i-1の次に、音素列ａ₁，ａ₂，…，ａ_nからな
るｉ番目の単語Ｗ_iがくるときに、単語Ｗ_i-1の単語仮説
として６つの仮説Ｗａ，Ｗｂ，Ｗｃ，Ｗｄ，Ｗｅ，Ｗｆ
が存在している。ここで、前者３つの単語仮説Ｗａ，Ｗ
ｂ，Ｗｃの最終音素は／ｘ／であるとし、後者３つの単
語仮説Ｗｄ，Ｗｅ，Ｗｆの最終音素は／ｙ／であるとす
る。終了時刻ｔ_eと先頭音素環境が等しい仮説（図２で
は先頭音素環境が“ｘ／ａ₁／ａ₂”である上から３つの
単語仮説）のうち総合尤度が最も高い仮説（例えば、図
２において１番上の仮説）以外を削除する。なお、上か
ら４番めの仮説は先頭音素環境が違うため、すなわち、
先行する単語仮説の最終音素がｘではなくｙであるの
で、上から４番めの仮説を削除しない。すなわち、先行
する単語仮説の最終音素毎に１つのみ仮説を残す。図２
の例では、最終音素／ｘ／に対して１つの仮説を残し、
最終音素／ｙ／に対して１つの仮説を残す。[0046] For example, as shown in FIG. 2, the (i-1) th word W _i-1 of the following phoneme string a _1, a _2, ..., come i th word W _i consisting a _n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, and Wf are assumed as the word hypotheses of the word Wi _-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. Of the hypotheses in which the end time t _e is equal to the head phoneme environment (in FIG. 2, the top three word hypotheses whose head phoneme environment is “x / a ₁ / a ₂ ”), the hypothesis with the highest overall likelihood (for example, FIG. 2 except for the top hypothesis). Note that the fourth hypothesis from the top has a different phoneme environment, that is,
Since the last phoneme of the preceding word hypothesis is y instead of x, the fourth hypothesis from the top is not deleted. That is, only one hypothesis is left for each final phoneme of the preceding word hypothesis. FIG.
In the example, leave one hypothesis for the final phoneme / x /
Leave one hypothesis for the final phoneme / y /.

【００４７】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の２つの音素とを含
む３つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも１つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。In the above embodiment, the head phoneme environment of the word is defined as a sequence of three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this. The phoneme sequence of the preceding word hypothesis including the final phoneme of the preceding word hypothesis, and at least one phoneme of the preceding word hypothesis that is continuous with the final phoneme, And a phoneme sequence that includes a phoneme sequence that includes the first phoneme of the word hypothesis.

【００４８】[0048]

【実施例】本発明者は、図１の音声認識装置の有効性を
確認するために、以下の実験を行った。上述の方法によ
り、既学習ＨＭＭのガウス混合分布再構成を行なった。
初期ＨＭＭとしては、特許出願人が所有する旅行設定の
コーパス（テキストデータ）の男性話者１７５名による
自然発話音声を学習用音声サンプルとし、公知のＭＬ−
ＳＳＳアルゴリズムによって作成した男性話者用不特定
話者ＨＭｎｅｔを用いた。本ＨＭｎｅｔは前後環境依存
の音素ＨＭＭを状態共有ネットワークによって表現して
いる。初期ＨＭＭに用いたＨＭｎｅｔについての条件を
表１及び表２に示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted the following experiment in order to confirm the effectiveness of the speech recognition apparatus shown in FIG. The Gaussian mixture distribution reconstruction of the learned HMM was performed by the above method.
As the initial HMM, a naturally spoken voice of 175 male speakers in a corpus (text data) of a travel setting owned by the patent applicant is used as a learning voice sample, and a known ML-
The unspecified speaker HMNet for male speakers created by the SSS algorithm was used. This HMNet expresses the phoneme HMM depending on the environment before and after by a state sharing network. Tables 1 and 2 show the conditions for HMNet used in the initial HMM.

【００４９】[0049]

【表１】音響分析条件 ───────────────────────────── サンプリング周波数：１２ｋＨｚ量子化：１６ビット線形プリエンファシス：１−０．９７ｚ^-1 ウインドウ：２０ｍｓハミングフレームシフト：１０ｍｓ特徴ベクトル：対数パワー＋１６次ＬＰＣケプストラム係数＋Δ対数パワー＋１６次Δケプストラム係数 ─────────────────────────────[Table 1] Acoustic analysis conditions サンプリング Sampling frequency: 12 kHz Quantization: 16-bit linear Pre-emphasis: 1 −0.97z ⁻¹ window: 20 ms Hamming frame shift: 10 ms Feature vector: log power + 16th order LPC cepstrum coefficient + Δlog power + 16th order Δcepstrum coefficient ────────────────── ───────────

【００５０】[0050]

【表２】ＨＭｎｅｔの構造に関する条件 ───────────────────────────── （ａ）４０１状態の男性話者独立ＨＭｎｅｔ状態分割された複数の音素ＨＭＭに対する４００状態（３つのコンテキスト依存型ＨＭＭ）無音ＨＭＭに対する１状態（ｂ）音響単位：日本語２５音素＋無音（ｃ）混合サイズ：１０混合／状態（ｄ）共分散タイプ：直交 ─────────────────────────────Table 2 Conditions for the structure of HMNet ───────────────────────────── (a) Male speaker independent HMet state in 401 state 400 states for a plurality of divided phoneme HMMs (3 context-dependent HMMs) 1 state for a silence HMM (b) Acoustic unit: 25 Japanese phonemes + silence (c) Mixed size: 10 mixtures / state (d) Covariance Type: orthogonal ─────────────────────────────

【００５１】このＨＭｎｅｔに対して、上記の男性話者
１７５名による自然発話音声を再び用い、コンポーネン
ト追加実行のしきい値ｒ＝０．０１にて、ガウス混合分
布の再構成を行なった。その結果、初期ＨＭＭにおける
総コンポーネント数４０００に対し、全体で２６０３回
のコンポーネント追加が行なわれ、初期ＨＭＭにおいて
一律１０であった各ガウス混合分布の混合数が、１０〜
３６の範囲で分布することとなった（図１０参照。）。For this HMNet, the Gaussian mixture distribution was reconstructed at the threshold value r = 0.01 of the component addition execution using the natural speech voice of the 175 male speakers described above. As a result, a total of 2603 components are added to the total number of components 4000 in the initial HMM, and the mixture number of each Gaussian mixture distribution, which was uniformly 10 in the initial HMM, is 10 to 10.
It was distributed in a range of 36 (see FIG. 10).

【００５２】次に、コンポーネント追加における混合ガ
ウス分布の追加元と追加先の関係について調べてみる
と、コンポーネント追加の半数以上は、同じ中心音素を
表現する分布同士でが行なわれていることが分かった。
本実験で用いたＨＭｎｅｔは、中心音素毎の異音ＨＭＭ
を環境方向、あるいは時間方向に逐次状態分割して得ら
れたものである。従って、同じ中心音素を表現する分布
同士でのコンポーネント追加は、主に、ＨＭｎｅｔ作成
過程の逐次状態分割により、いずれかの状態で表現不能
になった音響現象を再び表現可能とする働きをしている
と考えることができる。なお、このような、コンポーネ
ント追加が、環境方向、時間方向の両方について行なわ
れていることも確認した。Next, when examining the relationship between the addition source and the addition destination of the Gaussian mixture distribution in component addition, it is found that distributions expressing the same central phoneme are performed for more than half of the component addition. Was.
HMNet used in this experiment is an allophone HMM for each central phoneme.
Are sequentially divided in the environmental direction or the time direction. Therefore, the addition of components between distributions expressing the same central phoneme mainly serves to make it possible to express again an acoustic phenomenon that cannot be expressed in any state due to sequential state division in the HMNet creation process. Can be considered. It was also confirmed that such component addition was performed in both the environment direction and the time direction.

【００５３】また、パラメータの再推定処理において
は、ガウス混合分布再構成後のＨＭＭに対してパラメー
タの再推定を行なった。パラメータ推定にあたっては、
その基準を適切に選ぶことにより、再構成の効果を最大
限に引き出すことが期待できるが、本実験では、主に、
再構成によるガウス混合分布の構造の変化がもたらす効
果を評価することを目的とし、初期ＨＭＭの作成時と同
様の尤度最大基準を採用した。初期ＨＭＭの作成時、及
び分布再構成時と同様の男性話者１７５名による自然発
話音声を用い、バーム・ウエルチ（Ｂａｕｍ−Ｗｅｌｃ
ｈ）の学習アルゴリズムによって、以下のパラメータを
推定した。（ａ）ガウス分布の平均、（ｂ）ガウス分布の分散、
（ｃ）ガウス混合分布の混合重み、及び（ｄ）状態遷移
確率。In the parameter re-estimation process, the parameters of the HMM after the Gaussian mixture distribution reconstruction were re-estimated. When estimating parameters,
By properly selecting the criterion, it can be expected to maximize the effect of the reconstruction, but in this experiment,
In order to evaluate the effect of the change in the structure of the Gaussian mixture distribution due to the reconstruction, the maximum likelihood criterion used when the initial HMM was created was adopted. Using the spontaneously uttered speech by 175 male speakers as in the creation of the initial HMM and the distribution reconstruction, Baum-Welc
The following parameters were estimated by the learning algorithm of h). (A) mean of Gaussian distribution, (b) variance of Gaussian distribution,
(C) Gaussian mixture distribution mixture weight, and (d) state transition probability.

【００５４】さらに、連続音声認識実験について述べ
る。ガウス混合分布再構成とその後のパラメータ再推定
によって得られた再構成ＨＭＭを用いて、連続音声認識
実験を行い、初期ＨＭＭをそのまま用いた場合（以下、
比較例という。）とその認識率を比較した。実験条件を
以下に示す。（ａ）連続音声認識器：マルチパス探索と単語グラフ出
力を特徴とする連続音声認識装置（図１参照。）。（ｂ）言語モデル：可変長単語クラスＮ−ｇｒａｍ、分離クラス数：５００。（ｃ）単語辞書：語彙数６９２２（ｄ）テストデータ：男性オープン話者７名分の旅行会
話音声、特許出願人が所有する旅行設定のコーパス（テ
キストデータ）、８１発声、延べ９３７単語。（ｅ）評価基準次式で定義される、単語グラフ中の第一位認識候補に対
する単語アキュラシーと単語％コレクト。Next, a continuous speech recognition experiment will be described. A continuous speech recognition experiment was performed using a reconstructed HMM obtained by Gaussian mixture distribution reconstruction and subsequent parameter re-estimation.
It is called a comparative example. ) And their recognition rates. The experimental conditions are shown below. (A) Continuous speech recognizer: A continuous speech recognizer characterized by multipath search and word graph output (see FIG. 1). (B) Language model: variable-length word class N-gram, number of separated classes: 500. (C) Word dictionary: 6922 vocabulary words (d) Test data: travel conversation voices for seven male open speakers, corpus (text data) of travel settings owned by the patent applicant, 81 utterances, 937 words in total. (E) Evaluation Criteria Word accuracy and word% correct for the first recognition candidate in the word graph, defined by the following equation.

【数１５】単語アキュラシー＝（Ｎ−Ｉ−Ｄ−Ｓ）／Ｎ## EQU15 ## Word accuracy = (NIDS) / N

【数１６】単語％コレクト＝（Ｎ−Ｄ−Ｓ）／Ｎここで、Ｎ：正解単語数、Ｉ：挿入誤り数、Ｄ：脱落誤り数、Ｓ：置換誤り数。Where: N: number of correct words, I: number of insertion errors, D: number of missing errors, S: number of replacement errors.

【００５５】連続音声認識装置のビーム幅、及び言語尤
度重みは、予備実験によって、初期ＨＭＭを用いた音声
認識において単語アキュラシーが最大になるように設定
した。初期ＨＭＭに対する最適設定から、上記で定義さ
れた、言語尤度の音響尤度に対する重み係数αのみを変
化させた際の認識結果を図１１に示す。単語アキュラシ
ー、単語％コレクトいずれについても、再学習後の再構
成ＨＭＭ（実施例）が、初期ＨＭＭ（比較例）を上回っ
ていることが分かる。The beam width and the linguistic likelihood weight of the continuous speech recognition apparatus were set by a preliminary experiment so that the word accuracy was maximized in the speech recognition using the initial HMM. FIG. 11 shows a recognition result when only the weight coefficient α for the acoustic likelihood of the language likelihood defined above is changed from the optimal setting for the initial HMM. It can be seen that the reconstructed HMM (Example) after re-learning exceeds the initial HMM (Comparative Example) for both word accuracy and word% correct.

【００５６】本実施形態では、音響尤度の局所的落ち込
み抑止を目的とした、既学習のガウス混合分布型不特定
話者ＨＭＭの表現力向上を、音声サンプルを用いたガウ
ス混合分布の再構成によって図る方法を発明した。既学
習ＨＭＭと音声サンプルとの照合によって得られる誤り
傾向に基づいて、コンポーネントの追加と共有を行なう
本方法により、音響尤度の局所的落ち込みを効果的に抑
えることができ、その結果、音声認識率が向上すること
を確認した。In the present embodiment, the expression power of the learned Gaussian mixture distribution type unspecified speaker HMM for the purpose of suppressing the local drop of the acoustic likelihood is determined by reconstructing the Gaussian mixture distribution using speech samples. Invented a method to achieve this. This method of adding and sharing a component based on the error tendency obtained by matching a learned HMM with a speech sample can effectively suppress a local drop in acoustic likelihood, and as a result, speech recognition It was confirmed that the rate improved.

【００５７】さらに、本発明に係る本実施形態の再学習
後の再構成ＨＭＭの効果について以下に考察する。（ａ）分布の表現力分布の表現力は個々のガウス混合分布の混合数によって
決まる。逐次状態分割融合法においては、全てのガウス
混合分布に対して混合数が等しくなり、ガウス混合分布
毎に表現するべき対象の細かさに対応できる構造は生成
されない。本発明においては、実施例で、５混合（初期
モデルの混合数）から３６混合に渡る、様々な混合数の
ガウス混合分布が生成されている。また分布の共有構造
によって、ガウス分布の総数はその適用前と変わらない
ので音響モデルとしての頑健性を保ちながら分布の表現
力を高めることが出来る。（ｂ）共有構造決定のための計算時間最終的な共有構造の決定に要する時間のほとんどは、学
習データに対する音響モデルの尤度計算に要する時間が
占めている。第２の従来例の逐次状態分割融合法におい
ては、一状態についての融合の決定毎にＨＭＭ全体のパ
ラメータ再推定を行なうため、モデルの総状態数をＮと
したとき、Further, the effect of the reconstructed HMM after relearning of the present embodiment according to the present invention will be considered below. (A) Expressive power of distribution The expressive power of a distribution is determined by the number of mixtures of each Gaussian mixture distribution. In the successive state division fusion method, the number of mixtures becomes equal for all Gaussian mixture distributions, and a structure that can correspond to the fineness of an object to be expressed for each Gaussian mixture distribution is not generated. In the present invention, various mixtures of Gaussian mixture distributions ranging from 5 mixtures (the number of mixtures in the initial model) to 36 mixtures are generated in the embodiment. In addition, since the total number of Gaussian distributions is the same as before the application by the shared structure of distribution, the expressiveness of the distribution can be enhanced while maintaining the robustness as an acoustic model. (B) Calculation Time for Determining the Shared Structure Most of the time required for determining the final shared structure occupies the time required for calculating the likelihood of the acoustic model for the learning data. In the sequential state division fusion method of the second conventional example, in order to re-estimate the parameters of the entire HMM every time fusion of one state is determined, when the total number of states of the model is N,

【数１７】Ｎ＋Ｎ＝２×Ｎ（回）の尤度計算が必要となる。ここで、数１７の左辺第１項
は状態分割毎の尤度計算である。本発明の実施形態にお
いては、全ての状態に関する共有構造を一括して決定す
るため、共有構造そのものの決定に要する尤度計算は２
回である。従って、[Mathematical formula-see original document] The likelihood calculation of N + N = 2 * N (times) is required. Here, the first term on the left side of Expression 17 is a likelihood calculation for each state division. In the embodiment of the present invention, since the shared structure for all states is determined collectively, the likelihood calculation required for determining the shared structure itself is 2
Times. Therefore,

【数１８】Ｎ＋２（回）の尤度計算となる。ここで、数１８の左辺第１項は状態
分割毎の尤度計算である。通常Ｎは４００から１０００
に設定されるので、計算時間は、ほぼ半分に短縮される
と考えられる。## EQU18 ## The likelihood calculation is N + 2 (times). Here, the first term on the left side of Expression 18 is a likelihood calculation for each state division. Usually N is 400 to 1000
, The calculation time is expected to be reduced by almost half.

【００５８】以上説明したように、本実施形態によれ
ば、初期ＨＭＭを上述のようにコンポーネントを追加し
て再構成した後再学習したＨＭＭについては、初期ＨＭ
Ｍに比較して、音響モデルとしての頑健性を保ちながら
分布の表現力を高めることが出来る。従って、当該ＨＭ
Ｍを用いて音声認識することにより、従来技術に比較し
てより高い音声認識率で音声認識することができる。ま
た、共有構造決定のための計算時間については、第２の
従来例に比較して概ね半減することができ、より高速で
ＨＭＭを構築することができる。As described above, according to the present embodiment, for the HMM that has been re-learned after the initial HMM has been reconfigured by adding components as described above, the initial HM
Compared with M, the expression power of the distribution can be enhanced while maintaining the robustness as an acoustic model. Therefore, the HM
By performing voice recognition using M, voice recognition can be performed at a higher voice recognition rate than in the related art. Further, the calculation time for determining the shared structure can be reduced to almost half as compared with the second conventional example, and the HMM can be constructed at a higher speed.

【００５９】[0059]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の音響モデル生成装置によれば、所定の音声デー
タの特徴パラメータに基づいて、所定の学習アルゴリズ
ムにより、初期の隠れマルコフモデルを生成する第１の
生成手段と、上記音声データに対して初期の隠れマルコ
フモデルが起こす、所定の時間のフレーム単位の識別誤
りであるフレーム誤りの傾向に基づいて隠れマルコフモ
デルのガウス混合分布のコンポーネントを追加すること
により、上記第１の生成手段によって生成された初期の
隠れマルコフモデルを再構成して、再構成された隠れマ
ルコフモデルを生成する第２の生成手段と、上記音声デ
ータの特徴パラメータに基づいて、所定の学習アルゴリ
ズムにより、上記第２の生成手段によって生成された隠
れマルコフモデルを再学習することにより、再学習され
た隠れマルコフモデルである音響モデルを生成する第３
の生成手段とを備える。従って、初期ＨＭＭを上述のよ
うにコンポーネントを追加して再構成した後再学習した
ＨＭＭについては、初期ＨＭＭに比較して、音響モデル
としての頑健性を保ちながら分布の表現力を高めること
が出来る。従って、当該ＨＭＭを用いて音声認識するこ
とにより、従来技術に比較してより高い音声認識率で音
声認識することができる。また、共有構造決定のための
計算時間については、第２の従来例に比較して概ね半減
することができ、より高速でＨＭＭを構築することがで
きる。As described above in detail, according to the acoustic model generating apparatus according to the first aspect of the present invention, an initial hidden Markov model is obtained by a predetermined learning algorithm based on a characteristic parameter of predetermined speech data. And a Gaussian mixture distribution of the Hidden Markov Model based on a tendency of a frame error, which is an identification error of a frame unit at a predetermined time, caused by an initial Hidden Markov Model for the audio data. Second generating means for reconstructing the initial hidden Markov model generated by the first generating means by adding components to generate a reconstructed hidden Markov model; and characteristics of the audio data The hidden Markov model generated by the second generation unit based on the parameters and by a predetermined learning algorithm By relearning, third generating an acoustic model is re-learned HMM
Generating means. Therefore, as for the HMM obtained by reconstructing the initial HMM by adding components as described above and then re-learning, the expression power of the distribution can be enhanced while maintaining the robustness as an acoustic model, as compared with the initial HMM. . Therefore, by performing voice recognition using the HMM, voice recognition can be performed at a higher voice recognition rate than in the related art. Further, the calculation time for determining the shared structure can be reduced to almost half as compared with the second conventional example, and the HMM can be constructed at a higher speed.

【００６０】また、請求項２記載の音響モデル生成装置
においては、請求項１記載の音響モデル生成装置におい
て、上記第２の生成手段は、上記初期の隠れマルコフモ
デルと上記音声データとの間でビタビアラインメント処
理を実行することにより、（ａ）上記初期の隠れマルコ
フモデル中に含まれるガウス混合分布のビタビ系列と、
（ｂ）上記音声データの各フレームに対して最も高い尤
度を与え、上記初期の隠れマルコフモデル中に各ガウス
混合分布のコンポーネントとして含まれるガウス分布の
最尤系列とを得る第１の処理手段と、上記第１の処理手
段によって得られた、ガウス混合分布のビタビ系列、及
びガウス分布の最尤系列における、時刻を同じくするガ
ウス混合分布と、ガウス分布の組み合わせのそれぞれの
出現頻度に基づいて、上記初期の隠れマルコフモデル中
に含まれるガウス混合分布とガウス分布の全ての組み合
わせについて、各ガウス混合分布においてフレーム誤り
が生じてかつそのときの最尤ガウス分布が当該組み合わ
せのガウス分布であるときのフレーム誤り確率を演算
し、演算された各フレーム誤り確率が所定のしきい値を
越えるときに当該ガウス分布を、当該ガウス混合分布の
新たなコンポーネントとして追加する第２の処理手段と
を備え、上記第２の処理手段によって各ガウス混合分布
の新たなコンポーネントとして追加された各ガウス分布
は、当該ガウス分布が上記初期の隠れマルコフモデル中
で、コンポーネントとして属していたガウス混合分布
と、上記第２の処理手段によって新たにコンポーネント
として属することになったガウス混合分布との双方から
共有されるコンポーネントとなる。従って、初期ＨＭＭ
を上述のようにコンポーネントを追加して再構成した後
再学習したＨＭＭについては、初期ＨＭＭに比較して、
音響モデルとしての頑健性を保ちながら分布の表現力を
高めることが出来る。従って、当該ＨＭＭを用いて音声
認識することにより、従来技術に比較してより高い音声
認識率で音声認識することができる。また、共有構造決
定のための計算時間については、第２の従来例に比較し
て概ね半減することができ、より高速でＨＭＭを構築す
ることができる。According to a second aspect of the present invention, in the acoustic model generating apparatus as set forth in the first aspect, the second generating means is configured to perform a process between the initial hidden Markov model and the audio data. By executing the Viterbi alignment processing, (a) a Viterbi sequence having a Gaussian mixture distribution included in the initial hidden Markov model,
(B) first processing means for giving the highest likelihood to each frame of the audio data and obtaining a Gaussian maximum likelihood sequence included as a component of each Gaussian mixture distribution in the initial hidden Markov model; And the frequency of occurrence of a combination of a Gaussian mixture distribution having the same time and a Gaussian distribution in the Viterbi sequence of the Gaussian mixture distribution and the maximum likelihood sequence of the Gaussian distribution obtained by the first processing means. For all combinations of the Gaussian mixture distribution and the Gaussian distribution included in the initial hidden Markov model, when a frame error occurs in each Gaussian mixture distribution and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution of the combination Is calculated, and when each calculated frame error probability exceeds a predetermined threshold value, A second processing means for adding a Gaussian distribution as a new component of the Gaussian mixture distribution, wherein each Gaussian distribution added as a new component of each Gaussian mixture distribution by the second processing means is a Gaussian mixture. In the initial hidden Markov model, the distribution is a component shared by both the Gaussian mixture distribution belonging to the component and the Gaussian mixture distribution newly belonging to the component by the second processing means. . Therefore, the initial HMM
Is re-learned after adding and reconfiguring components as described above, compared to the initial HMM,
It is possible to enhance the expressive power of the distribution while maintaining the robustness as an acoustic model. Therefore, by performing voice recognition using the HMM, voice recognition can be performed at a higher voice recognition rate than in the related art. Further, the calculation time for determining the shared structure can be reduced to almost half as compared with the second conventional example, and the HMM can be constructed at a higher speed.

【００６１】さらに、本発明に係る音声認識装置におい
ては、請求項１又は２記載の音響モデル生成装置によっ
て生成された音響モデルを用いて、入力される発声音声
文の音声信号に基づいて音声認識する音声認識手段を備
える。従って、初期ＨＭＭを上述のようにコンポーネン
トを追加して再構成した後再学習したＨＭＭについて
は、初期ＨＭＭに比較して、音響モデルとしての頑健性
を保ちながら分布の表現力を高めることが出来る。従っ
て、当該ＨＭＭを用いて音声認識することにより、従来
技術に比較してより高い音声認識率で音声認識すること
ができる。また、共有構造決定のための計算時間につい
ては、第２の従来例に比較して概ね半減することがで
き、より高速でＨＭＭを構築することができる。Further, in the speech recognition apparatus according to the present invention, the speech recognition is performed based on the speech signal of the uttered speech sentence using the acoustic model generated by the acoustic model generation apparatus according to claim 1 or 2. Voice recognition means for performing the operation. Therefore, as for the HMM obtained by reconstructing the initial HMM by adding components as described above and then re-learning, the expression power of the distribution can be enhanced while maintaining the robustness as an acoustic model, as compared with the initial HMM. . Therefore, by performing voice recognition using the HMM, voice recognition can be performed at a higher voice recognition rate than in the related art. Further, the calculation time for determining the shared structure can be reduced to almost half as compared with the second conventional example, and the HMM can be constructed at a higher speed.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の音声認識装置における単語仮説絞込部
６の処理を示すタイミングチャートである。FIG. 2 is a timing chart showing processing of a word hypothesis narrowing section 6 in the voice recognition device of FIG.

【図３】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理の第１の部分を示すフローチャート
である。FIG. 3 is a flowchart illustrating a first part of the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 of FIG. 1;

【図４】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理の第２の部分を示すフローチャート
である。FIG. 4 is a flowchart illustrating a second part of the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 of FIG. 1;

【図５】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理の第３の部分を示すフローチャート
である。FIG. 5 is a flowchart illustrating a third part of the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 of FIG. 1;

【図６】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理の第４の部分を示すフローチャート
である。FIG. 6 is a flowchart illustrating a fourth part of the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 in FIG. 1;

【図７】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理におけるビタビ系列と最尤系列の構
成の一例を示す図である。FIG. 7 is a diagram illustrating an example of the configuration of a Viterbi sequence and a maximum likelihood sequence in the HMM reconfiguration processing performed by the HMM reconfiguration unit 22 in FIG. 1;

【図８】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理におけるフレーム誤り確率の算出の
一例を示す図である。FIG. 8 is a diagram illustrating an example of calculation of a frame error probability in the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 in FIG. 1;

【図９】図１のＨＭＭ再構成部２２によって実行され
るＨＭＭ再構成処理における誤り確率に基づくコンポー
ネントの追加を示す図である。9 is a diagram illustrating addition of a component based on an error probability in the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 in FIG. 1;

【図１０】図１のＨＭＭ再構成部２２によって実行さ
れるＨＭＭ再構成処理後の混合数の分布の一例を示すグ
ラフである。FIG. 10 is a graph showing an example of the distribution of the number of mixtures after the HMM reconfiguration processing executed by the HMM reconfiguration unit 22 in FIG. 1;

【図１１】図１の音声認識装置の実験結果であって、
音声認識結果の比較を示すグラフである。11 is an experimental result of the voice recognition device of FIG. 1,
It is a graph which shows a comparison of a speech recognition result.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語仮説絞込部、７…尤度補正部、１１…音素ＨＭＭメモリ、１２…単語辞書メモリ、１３…統計的言語モデル、１４…過小評価尤度メモリ、２１…初期ＨＭＭ生成部、２２…ＨＭＭ再構成部、２３…再学習部、３０…音声データメモリ、３１…初期ＨＭＭメモリ、３２…再構成されたＨＭＭメモリ。 DESCRIPTION OF SYMBOLS 1 ... microphone, 2 ... feature extraction part, 3, 5 ... buffer memory, 4 ... word collation part, 6 ... word hypothesis narrowing part, 7 ... likelihood correction part, 11 ... phoneme HMM memory, 12 ... word dictionary memory, 13: Statistical language model, 14: Underestimated likelihood memory, 21: Initial HMM generating unit, 22: HMM reconstructing unit, 23: Re-learning unit, 30: Voice data memory, 31: Initial HMM memory, 32: Re Configured HMM memory.

Claims

[Claims]

1. A first generating means for generating an initial hidden Markov model by a predetermined learning algorithm based on a characteristic parameter of predetermined audio data, and an initial hidden Markov model is generated for the audio data. By adding a component of the Gaussian mixture distribution of the Hidden Markov Model based on the tendency of a frame error, which is an identification error of a frame unit at a predetermined time, the initial Hidden Markov Model generated by the first generation unit is obtained. A second generation unit configured to generate a reconstructed hidden Markov model, and a hidden Markov model generated by the second generation unit by a predetermined learning algorithm based on the feature parameter of the audio data. By retraining the model,
An acoustic model generation device, comprising: a third generation unit configured to generate an acoustic model that is a re-learned hidden Markov model.

2. The acoustic model generation device according to claim 1, wherein the second generation unit executes a Viterbi alignment process between the initial hidden Markov model and the audio data.
(A) the Viterbi sequence of the Gaussian mixture distribution included in the initial hidden Markov model, and (b) the highest likelihood is given to each frame of the audio data, and each Gaussian is included in the initial hidden Markov model. First processing means for obtaining a maximum likelihood sequence of a Gaussian distribution included as a component of a mixture distribution, and a Viterbi sequence of a Gaussian mixture distribution obtained by the first processing means, and a maximum likelihood sequence of a Gaussian distribution, Based on the Gaussian mixture distribution at the same time and the frequency of each of the combinations of the Gaussian distributions, for all combinations of the Gaussian mixture distribution and the Gaussian distribution included in the initial hidden Markov model, the frame in each Gaussian mixture distribution When an error occurs and the maximum likelihood Gaussian distribution at that time is the Gaussian distribution of the combination, the frame error probability A second processing means for calculating a rate and adding the Gaussian distribution as a new component of the Gaussian mixture distribution when each of the calculated frame error probabilities exceeds a predetermined threshold value. Each Gaussian distribution added as a new component of each Gaussian mixture distribution by the processing means of the above, the Gaussian distribution in the initial hidden Markov model,
An acoustic model generation apparatus characterized in that the component is a component shared by both the Gaussian mixture distribution belonging to the component and the Gaussian mixture distribution newly belonging to the component by the second processing means.

3. A voice recognition device comprising: a voice recognition unit configured to recognize a voice based on a voice signal of an input utterance voice sentence, using a voice model generated by the voice model generation device according to claim 1 or 2. Speech recognition device.