JP4394972B2

JP4394972B2 - Acoustic model generation method and apparatus for speech recognition, and recording medium recording an acoustic model generation program for speech recognition

Info

Publication number: JP4394972B2
Application number: JP2004043048A
Authority: JP
Inventors: 晋治渡部; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-02-19
Filing date: 2004-02-19
Publication date: 2010-01-06
Anticipated expiration: 2024-02-19
Also published as: JP2005234214A

Description

本発明は、音声認識用音響モデルの生成方法及び装置、音声認識用音響モデル生成プログラムを記録した記録媒体に関する。 The present invention relates to a method and apparatus for generating a speech recognition acoustic model, and a recording medium on which a speech recognition acoustic model generation program is recorded.

音響モデルが用いられる音声認識装置の概略を説明する（図１参照）。
音声認識装置は、フレームごとに学習音声信号データと時系列特徴ベクトルに変換する特徴量変換部と、モデルパラメータ学習及び適切なモデル構造決定を行う音響モデル生成部と、得られた音響モデルを用いて未知入力音声の時系列特徴量ベクトルに対しスコアを算出し、これに発音辞書や言語モデル等に対するスコアを考慮して認識結果を与える認識部とから構成される。 An outline of a speech recognition apparatus using an acoustic model will be described (see FIG. 1).
The speech recognition device uses a feature amount conversion unit that converts learning speech signal data and time-series feature vectors for each frame, an acoustic model generation unit that performs model parameter learning and determination of an appropriate model structure, and the obtained acoustic model. A recognition unit that calculates a score for a time-series feature vector of unknown input speech and gives a recognition result in consideration of a score for a pronunciation dictionary, a language model, and the like.

音響モデル生成部について説明を行う。
現在音響モデルで主流となっているのは図２（状態数３、混合数３）にあるように、１音素の特徴量時系列を隠れマルコフモデル（ＨＭＭ）で表現し、ＨＭＭ状態の出力分布として混合ガウス分布を用いる手法である。
ＨＭＭは音声の区分定常的な性質を、定常的確率過程と状態遷移の組み合わせで表現することができる。また、混合ガウス分布を用いることにより、様々な要因により作られる音声の揺らぎを統計的に表現することができる。
状態系列集合をＳ＝｛ｓ⁰，ｓ¹，・・・，ｓ^T｝とし、混合ガウス成分系列をＶ＝｛ｖ⁰，ｖ¹，・・・，ｖ^T｝とし、Ｄ次元時系列特徴量ベクトルＯ＝｛Ｏ^T∈Ｒ^D｜ｔ＝１，・・・，Ｔ｝とすると、Ｏ，Ｓ，Ｖからなる完全データを出力とするＨＭＭ出力分布は各音素カテゴリー毎に次のように表現される。

また、文中において上付き添字とべき乗を区別するために、べき乗の場合は必ずその変数に括弧（）を付ける。 The acoustic model generation unit will be described.
As shown in Fig. 2 (3 states and 3 mixtures), the acoustic model time series is represented by a hidden Markov model (HMM), and the output distribution of HMM states. This is a method using a mixed Gaussian distribution.
The HMM can express the piecewise stationary nature of speech by a combination of stationary stochastic processes and state transitions. Further, by using the mixed Gaussian distribution, it is possible to statistically express voice fluctuations generated by various factors.
The state sequence set ^{^{S = {s 0, s 1}} , ···, s T} and, Gaussian mixture components series ^{^{V = {v 0, v 1}} , ···, v T} and, D dimension time series feature Assuming that the quantity vector O = {O ^T ∈R ^D | t = 1,..., T}, the HMM output distribution that outputs complete data composed of O, S, and V is as follows for each phoneme category. Expressed.

In addition, in order to distinguish a superscript from a power in a sentence, parentheses () are always attached to the variable in the case of a power.

次に音素カテゴリーについて述べる。発話機構の物理的制約や抑揚などにより、前後の音素に応じて発声は変化するため、近年では直前直後の音素環境を考慮したトライフォンを音素カテゴリーとするのが一般的である。トライフォンは音素環境を考慮しない音素カテゴリー（モノフォン）に比べて、複雑なモデルである。
トライフォンの総数は膨大であり、全てのトライフォンを十分に学習できるほどのデータを容易するのは困難である。そのため、音響的に性質の似ているトライフォンをＨＭＭ状態単位でクラスタリングし、１つの音素カテゴリーとみなすことにより、データ不足により生じる過学習を回避する手法がとられる（図３参照）。このような状態クラスタリングの仕方や総状態数の決定を音響モデルにおけるモデル構造の選択と呼ぶ（音素環境状態クラスタリング）。モデル構造選択には状態あたりの混合数の設定も含まれる（混合数決定）。 Next, the phoneme category is described. Since the utterance changes depending on the previous and next phonemes due to the physical restrictions and inflection of the speech mechanism, in recent years, the triphone considering the phoneme environment immediately before and after is generally used as the phoneme category. The triphone is a more complex model than the phoneme category (monophone) that does not consider the phoneme environment.
The total number of triphones is enormous, and it is difficult to facilitate data enough to learn all the triphones. For this reason, triphones that are acoustically similar in nature are clustered in units of HMM states, and regarded as one phoneme category, thereby avoiding overlearning caused by data shortage (see FIG. 3). Such determination of the state clustering method and the total number of states is called model structure selection in the acoustic model (phoneme environmental state clustering). Model structure selection includes setting the number of mixtures per state (determining the number of mixtures).

最適モデル構造は学習データに依存するため、学習データに応じたモデル構造選択が必要となる。従来法である最尤法はデータが十分大きい時の推定を保証しているだけでモデル構造選択についての妥当性を議論することができない。そのため、モデル構造選択には経験則の介在が不可欠となっている。さらに、音声認識用音響モデルは、総計数が１００万個にも及ぶ複雑なパラメータがＨＭＭや音素環境状態クラスタリング、混合ガウス分布で階層的に表現されており、またＨＭＭ状態系列や混合ガウス成分系列といった隠れ変数を含んでいるため、そのモデル構造は極めて複合的である。従って、現在の音響モデル構築は、この複合的構造を把握した限定された人（専門家）の経験則を利用せざるを得ないという問題を抱える。一方で、実際の音声認識の応用においては、事前に学習に用いられたデータとは異なる環境での用途が必ずでてくるため、日々得られる異環境音声データをもとに音響モデルを再構築することが頻繁に起こる。このような場合、その都度経験則が必要となり、そのコストは大変膨大となる。 Since the optimal model structure depends on the learning data, it is necessary to select a model structure according to the learning data. The maximum likelihood method, which is a conventional method, only guarantees estimation when the data is sufficiently large, and the validity of model structure selection cannot be discussed. Therefore, the rule of thumb is indispensable for the model structure selection. Furthermore, in the acoustic model for speech recognition, complicated parameters with a total count of 1 million are hierarchically expressed by HMM, phoneme environment state clustering, and mixed Gaussian distribution, and HMM state series and mixed Gaussian component series. The model structure is extremely complex. Therefore, the current acoustic model construction has a problem that it is necessary to use a rule of thumb of a limited person (expert) who grasps this complex structure. On the other hand, in actual speech recognition applications, use in an environment different from the data used for learning in advance will always occur, so the acoustic model is reconstructed based on daily different environment speech data It happens frequently. In such a case, an empirical rule is required each time, and the cost becomes very large.

（経験則法）
経験則による従来型の音響モデル構築について説明する（図４：s11〜s18 参照）。
(s11)学習データに対して、学習データ量や学習データの性質（時系列特徴量ベクトル
）をもとに、経験からクラスタリング状態の総数（状態数）と混合数を設定する。その後、(s12)音素決定木法や逐次状態分割法などの状態クラスタリングを、あらかじめ定めた総状態数に達するまで行い、状態クラスタリング構造を設定する（図４参照）。ここで、従来法においてはクラスタリングの基準として尤度や平均ベクトルの距離といった分布の近さをあらわす評価関数が用いられる（図５参照）。しかし、これらの評価関数は総状態数などのモデル構造の良し悪しとは無関係なため、学習データから得られる経験則による総状態数の設定が必要となる。また状態クラスタリングにおいて、１）状態を表現する出力分布は混合数１の単一ガウス分布である、２）状態に割り当てられた学習データは固定、と仮定することにより隠れ変数を取り払っている。１）に関して、本来は様々な要因により作られる音声の揺らぎを表現する混合ガウス分布を用いたクラスタリングが妥当であるが、混合ガウス分布とした場合、評価関数計算において、学習データの各フレームごとに隠れ変数の事後確率値を評価関数が収束するまで繰り返し計算を行う期待値最大化法が必要となり、それを状態の組み合わせ各々に対して計算する必要があるため計算量が膨大である。一方、混合数１の単一ガウス分布とした場合、状態ｊ＝１と状態ｊ＝２を共有化させた状態Ｊ＝｛１＋２｝の統計量（フレーム数ζ₁₊₂、平均ベクトルμ₁₊₂、対角共分散行列σ₁₊₂）は、状態１と状態２の統計量（フレーム数ζ₁，ζ₂、平均ベクトルμ₁，μ₂、対角共分散行列σ₁，σ₂）を用いて次のように求められる（s12-1)。

ここで（μ）²＝｛（μ_d=1）²，・・・，（μ_d=D）²｝’（「’」は転置を表す）である。ここでは計算時間短縮の理由から対角共分散行列を用いる。よって共分散行列は対角成分から構成されるＤ次元ベクトルで表現される（つまり、σ＝（σ₁₁，・・・σ_DD）’，σ_yzは共分散行列のｙ行ｚ列成分）。計算結果は極めてシンプルであり、期待値最大化法を必要とせず、それぞれの統計量によって所望のクラスター状態の統計量を解析的に得られ、それらの関数である評価関数も解析的に算出される(s12-2)。そのため最小クラスター（トライフォン音素カテゴリーの場合はトライフォン状態）の統計量を事前に計算し、記憶してしておけば、式（２）を用いて高速に評価関数を計算することができる(s12-1)。その後、１）及び２）の仮定を取り払い、混合ガウス分布モデルのＨＭＭ最尤学習を行う(s13)。このとき、混合数を変化させて複数の音響モデルを作る(s16)。また、総状態数を変化させて先ほどの作業を繰り返すことにより(s17)、状態数・混合数が異なる複数の音響モデル構造を作ることができる(s14)。最後に、その音響モデルの良し悪しを決めるために、評価データをもとに認識を行い(s15)、認識率が最も良いものが音響モデルとして採用される(s18)。しかし、認識率を評価基準とした場合、音響モデルは実際の認識データではなく評価データに特化されることになるため、未知データに対する認識が前提の音声認識システムにとって、必ずしも良い評価とはなっていない。また、音声認識は言語モデル等が複雑に絡まった大規模システムであるため、認識結果を出すにも経験則の介在が不可欠であり、かつ時間もかかる（これを経験則法と呼ぶ）。 (Rule of thumb)
A conventional acoustic model construction based on an empirical rule will be described (see FIG. 4: s11 to s18).
(s11) For the learning data, the total number of clustering states (number of states) and the number of mixtures are set from experience based on the amount of learning data and the nature of the learning data (time-series feature amount vector). Thereafter, (s12) state clustering such as a phoneme decision tree method or a sequential state division method is performed until a predetermined total number of states is reached, and a state clustering structure is set (see FIG. 4). Here, in the conventional method, an evaluation function representing the closeness of the distribution such as the likelihood and the distance of the average vector is used as a clustering standard (see FIG. 5). However, since these evaluation functions are irrelevant to the quality of the model structure such as the total number of states, it is necessary to set the total number of states based on empirical rules obtained from learning data. Also, in state clustering, hidden variables are removed by assuming that 1) the output distribution representing the state is a single Gaussian distribution with a mixture number of 1 and 2) the learning data assigned to the state is fixed. With regard to 1), clustering using a mixed Gaussian distribution that expresses fluctuations in speech produced by various factors is appropriate. However, when a mixed Gaussian distribution is used, in the evaluation function calculation, each frame of learning data is used. An expected value maximization method that repeatedly calculates the a posteriori probability value of the hidden variable until the evaluation function converges is necessary, and it is necessary to calculate it for each combination of states. On the other hand, in the case of a single Gaussian distribution with a mixture number of 1, a statistic of state J = {1 + 2} in which state j = 1 and state j = 2 are shared (number of frames ζ _{1 + 2} , average vector μ _{1+ 2} , the diagonal covariance matrix σ _{1 + 2} ) is the statistics of state 1 and state 2 (number of frames ζ ₁ , ζ ₂ , mean vectors μ ₁ , μ ₂ , diagonal covariance matrix σ ₁ , σ ₂ ) Is calculated as follows (s12-1).

Here, (μ) ² = {(μ _{d = 1} ) ² ,..., (Μ _{d = D} ) ² } ′ (“′” represents transposition). Here, a diagonal covariance matrix is used for the reason of shortening the calculation time. Therefore, the covariance matrix is expressed by a D-dimensional vector composed of diagonal components (that is, σ = (σ ₁₁ ,..., Σ _DD ) ′, σ _yz is the y row z column component of the covariance matrix). The calculation result is extremely simple, does not require the expectation maximization method, the statistics of the desired cluster state can be obtained analytically by each statistic, and the evaluation function that is their function is also calculated analytically. (S12-2). Therefore, if the statistics of the smallest cluster (triphone state in the case of the triphone phoneme category) are calculated and stored in advance, the evaluation function can be calculated at high speed using equation (2) ( s12-1). Thereafter, the assumptions 1) and 2) are removed, and the HMM maximum likelihood learning of the mixed Gaussian distribution model is performed (s13). At this time, a plurality of acoustic models are created by changing the number of mixtures (s16). Further, by repeating the above operation by changing the total number of states (s17), it is possible to create a plurality of acoustic model structures with different numbers of states / mixed numbers (s14). Finally, in order to determine whether the acoustic model is good or bad, recognition is performed based on the evaluation data (s15), and the one with the best recognition rate is adopted as the acoustic model (s18). However, when the recognition rate is used as an evaluation criterion, the acoustic model is specialized for evaluation data, not actual recognition data. Therefore, it is not always a good evaluation for a speech recognition system based on recognition of unknown data. Not. In addition, since speech recognition is a large-scale system in which language models and the like are complicatedly involved, intervention of an empirical rule is indispensable and takes time to obtain a recognition result (this is called an empirical rule method).

（２段階法）
ＭＤＬ（最小記述長）、ＢＩＣ（ベイズ的情報基準）、ＡＩＣ（赤池情報基準）の漸近情報量基準（漸近：学習データが十分多い領域でのみ機能する）や変分ベイズ基準評価関数を用いた音響モデル構造決定は、評価関数によりモデルの良し悪しを決めることができるため、経験則による総状態数の設定や認識率算出（モデル構造の良し悪しを評価に用いる）をする必要がない[特許文献１，非特許文献１，２参照]。これらは、混合数決定においても評価関数を用いることにより同様の利点を持つ[非特許文献３参照]。しかし、[非特許文献１，２]は評価関数にＭＤＬ，ＢＩＣ，ＡＩＣを用いており、学習データが少ない領域ではその構造決定が十分に機能しない。また音響モデルは隠れ変数を含んでおり、そのような場合においてもＭＤＬ，ＢＩＣ，ＡＩＣはモデル構造を正確に決定することができない。変分ベイズ評価関数は学習データの量に依存せず、また隠れ変数が存在しかつ複雑に構造化された音響モデルにおいても、その構造を評価関数に正確に反映できる。しかし、実際の音響モデルは音素環境クラスタリング、混合数決定の組み合わせで表現され、それらの最適な音響モデルをしらみつぶしに探していくのは変分ベイズ評価関数を用いた場合でも大変時間がかかる。 (Two-stage method)
MDL (minimum description length), BIC (Bayesian information criterion), AIC (Akaike information criterion) asymptotic information criterion (asymptotic: works only in areas where there is enough learning data) and variational Bayes criterion evaluation function In acoustic model structure determination, the quality of the model can be determined by the evaluation function, so there is no need to set the total number of states and calculate the recognition rate based on empirical rules (use the quality of the model structure for evaluation) [patent Reference 1, Non-Patent Documents 1, 2]. These have similar advantages by using an evaluation function in determining the number of mixtures [see Non-Patent Document 3]. However, [Non-Patent Documents 1 and 2] use MDL, BIC, and AIC as evaluation functions, and the structure determination does not function sufficiently in an area where learning data is small. In addition, the acoustic model includes hidden variables, and even in such a case, MDL, BIC, and AIC cannot accurately determine the model structure. The variational Bayesian evaluation function does not depend on the amount of learning data, and the structure can be accurately reflected in the evaluation function even in an acoustic model having a hidden variable and complicatedly structured. However, an actual acoustic model is expressed by a combination of phoneme environment clustering and determination of the number of mixtures, and it takes a very long time to search for the optimum acoustic model even when a variational Bayes evaluation function is used.

それを回避するために、(s21)まず初めに状態クラスタリングに際し、先ほど同様１）及び２）の仮定を用いて隠れ変数を除き、あらかじめ計算された各状態の統計量をもとに状態クラスタリングを行う（図７参照）。この場合経験則法とは違い、評価関数の最も高くなる状態クラスタリングを選択することにより、状態クラスタリングに関しては経験則の介在なく構築することができる。その後(s22)１）及び２）の仮定を取り払い、混合ガウス分布モデルのＨＭＭ最尤学習を行う。このとき、混合数を変化させて複数の音響モデルを構築し、(s23)最も評価関数の高い音響モデルをもって最適音響モデルとする手法が提案されている（このような２段階操作によるモデル構造の自動決定法を２段階法とよぶ）[非特許文献３参照]。２段階法は経験則を必要とせず、計算機で音響モデルを構築できる上に、状態共有構造は１種類を作ればいいのでモデル構造探索空間は削減され、従来法よりも短い時間で音響モデルを構築できる。しかし、状態クラスタリング・混合数決定それぞれの工程で独立に最適モデル構造探索を行うため、局所最適モデル構造を選択することになり、認識性能が経験則を用いる従来法に比べて下回る（図６及び表１参照）。
このように、従来の２段階法は局所最適モデル探索であるため、最適音響モデルの自動構築が機能的に不可能である。
本発明は、音素環境状態クラスタリング、混合数決定を同時に最適化することにより、最適音響モデルを構築する。
篠田浩一．特開２００２−２６８６７５「音声認識装置」篠田浩一，渡辺隆夫“情報量基準を用いた状態クラスタリングによる音響モデルの作成”信学技報，SP1996-79，pp.9-15，1996．渡部晋治，南泰浩，中村篤，上田修功“ベイズ的アプローチに基づく状態共有型ＨＭＭ構造の選択”電子情報通信学会論文誌 D-II，Vol.86-D-II，pp.776-786，2003. S.Watanabe,Y.Minami,A.Nakamura,and N.Ueda.“Bayesian acoustic modeling for spontaneous speech recognition.” In Proc. SSPR2003, pp.47-50, 2003. To avoid this, (s21) First, in the state clustering, the hidden variables are removed using the assumptions 1) and 2), and the state clustering is performed based on the statistics of each state calculated in advance. Perform (see FIG. 7). In this case, unlike the rule-of-thumb method, the state clustering can be constructed without intervention of the rule of thumb by selecting the state clustering with the highest evaluation function. Thereafter, the assumptions of (s22) 1) and 2) are removed, and the HMM maximum likelihood learning of the mixed Gaussian distribution model is performed. At this time, a method has been proposed in which a plurality of acoustic models are constructed by changing the number of mixtures, and (s23) the acoustic model having the highest evaluation function is used as the optimal acoustic model (model structure of such a two-step operation is used. The automatic determination method is called a two-step method) [see Non-Patent Document 3]. The two-step method does not require an empirical rule, and an acoustic model can be constructed by a computer. In addition, since only one type of state-sharing structure can be created, the model structure search space is reduced, and an acoustic model can be created in a shorter time than the conventional method. Can be built. However, since the optimal model structure search is performed independently in each of the state clustering and the determination of the number of mixtures, the local optimal model structure is selected, and the recognition performance is lower than the conventional method using an empirical rule (see FIG. 6 and FIG. 6). (See Table 1).
Thus, since the conventional two-stage method is a local optimum model search, automatic construction of the optimum acoustic model is functionally impossible.
The present invention constructs an optimal acoustic model by simultaneously optimizing phoneme environment state clustering and determining the number of mixtures.
Shinoda Koichi. Japanese Patent Laid-Open No. 2002-268675 “Voice Recognition Device” Shinoda Koichi and Watanabe Takao “Creation of Acoustic Model by State Clustering Using Information Criterion”, IEICE Technical Report, SP1996-79, pp.9-15, 1996. Junji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Nobuyoshi Ueda "Selection of State-Sharing HMM Structure Based on Bayesian Approach" IEICE Transactions D-II, Vol.86-D-II, pp.776-786, 2003 . S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. “Bayesian acoustic modeling for spontaneous speech recognition.” In Proc. SSPR2003, pp.47-50, 2003.

音響モデル構築は２つの異なる方法によって行われる。第１の方法は音響モデル構造を熟知する専門家が経験に基づき構築する方法であり、第２の方法は自動的に構築する方法である。
第１の方法は多大なコストを要するという問題を有する。その原因は人手の介在、、又は経験則に頼ることにある。すなわち、実際の音声認識の応用においては、事前に学習に用いられたデータとは異なる環境での用途が必ず出てくるため、日々得られる異環境音声データをもとに音響モデルを再構築することが頻繁に起こる。このような場合、その都度経験則が必要となり、そのコストが膨大となる。
第２の方法は、多大な計算量を要するという問題を有する、この問題は音響モデル構造決定の計算量が膨大であるという事実に起因する、これは次の２つに起因する。すなわち、最適モデル構造を探索する際の膨大な探索空間と個々の構造を評価する際の多大な評価関数計算量とである。 Acoustic model construction is done in two different ways. The first method is a method that is constructed by an expert who is familiar with the acoustic model structure based on experience, and the second method is a method that is automatically constructed.
The first method has a problem of requiring a great cost. The cause lies in relying on human intervention or rules of thumb. In other words, in actual speech recognition applications, use in an environment different from the data used for learning in advance will always occur, so the acoustic model is reconstructed based on daily different environment speech data. It happens frequently. In such a case, an empirical rule is required each time, and the cost becomes enormous.
The second method has a problem that a large amount of calculation is required. This problem is caused by the fact that the calculation amount of the acoustic model structure determination is enormous. This is due to the following two. That is, a huge search space for searching for an optimal model structure and a large amount of evaluation function calculation for evaluating each structure.

探索空間は、モデル構造決定が、音素環境状態クラスタリングと混合ガウス分布の混合選択との２つを同時に最適化することによって成されるために、膨大になる。評価関数計算は、学習データの各フレームごとに混合ガウス分布における隠れ変数の事後確率を収束するまで計算することを必要とするため、多大な計算時間を必要とする。また評価関数計算時間を低減するために提案された２段階法は局所最適モデル構造を選択するため、計算時間低減の反面、性能が劣化するという問題を有する。 The search space is enormous because model structure determination is made by simultaneously optimizing two of phoneme environmental state clustering and mixed selection of mixed Gaussian distributions. Since the evaluation function calculation needs to calculate until the posterior probability of the hidden variable in the mixed Gaussian distribution is converged for each frame of the learning data, it requires a lot of calculation time. In addition, the two-stage method proposed for reducing the evaluation function calculation time selects the local optimum model structure, but has a problem that the performance deteriorates while the calculation time is reduced.

本発明は、前記第２の方法に属し、ベイズ基準による音響モデル構造の自動決定に際し、音素環境状態クラスタリングを混合ガウス分布で行うことにより、モデル構造探索空間を削減する。そのとき、ベイズ基準で用いるモデル構造評価関数の計算において、混合ガウス分布の統計量を事前知識を用いて近似し、あらかじめ計算された統計量のみから評価関数を近似的に算出することにより、計算量を削減する。 The present invention belongs to the second method, and reduces the model structure search space by performing phoneme environment state clustering with a mixed Gaussian distribution when automatically determining an acoustic model structure based on a Bayesian criterion. At that time, in the calculation of the model structure evaluation function used on the Bayesian basis, the statistic of the mixed Gaussian distribution is approximated using prior knowledge, and the evaluation function is approximately calculated from only the pre-calculated statistic. Reduce the amount.

経験則を用いる音響モデルと同等の性能を保ちながら、２段階法とほぼ同程度の計算時間で最適音響モデルの自動構築を実現する。
表１は、実際に、経験則を用いた手法（経験則法）と２段階法、及び本発明で紹介した音響モデル作成法（混合ガウス分布を用いた音素環境状態クラスタリング法および実施例の最後に示した評価関数近似法の併用）に対して計算時間及び認識性能を比較したものである。発明法は経験則法に対して認識性能がほぼ同程度であると共に計算時間の短縮された最適音響モデルを経験則を用いず自動で構築できる。また、従来型の自動構築法である２段階法と比べて、計算時間がほぼ同程度で、かつ最適音響モデルを構築できたため認識性能は自動構築法を上回った。このように本発明は最適音響モデル構造を実用的計算時間で自動構築することを可能とした。

While maintaining the same performance as an acoustic model using an empirical rule, automatic construction of the optimal acoustic model is realized in approximately the same computation time as the two-stage method.
Table 1 shows a method using an empirical rule (empirical rule method) and a two-stage method, and an acoustic model creation method introduced in the present invention (phoneme environmental state clustering method using a mixed Gaussian distribution and the end of the examples The calculation time and the recognition performance are compared with the evaluation function approximation method shown in FIG. The inventive method can automatically construct an optimal acoustic model with almost the same recognition performance as the empirical rule and with reduced calculation time without using the empirical rule. Compared with the two-stage method, which is a conventional automatic construction method, the calculation time is almost the same, and the optimal acoustic model can be constructed, so the recognition performance exceeds the automatic construction method. As described above, the present invention makes it possible to automatically construct an optimal acoustic model structure in a practical calculation time.

本発明を図８〜図１０を参照して説明する。
音素環境状態クラスタリング、混合数決定を同時に最適化する音響モデルを構築するための手段として、混合ガウス分布を用いた決定木の構築を提案する。決定木の構築法としては音素質問を利用して節の併合・分割により効率よくクラスタリングを行う音素決定木法を用いる。時系列状態方向の構造決定を考慮した逐次状態分割法でも同様の議論が可能である。これらのクラスタリングによって得られる、状態あたりの混合数が異なる複数の音響モデルの中から最もベイズ基準評価関数の高いモデルをもって最適モデル構造とする（図８(s1) 参照）。 The present invention will be described with reference to FIGS.
As a means to construct an acoustic model that simultaneously optimizes phoneme environment state clustering and number of mixtures determination, we propose the construction of a decision tree using a mixture Gaussian distribution. As a decision tree construction method, a phoneme decision tree method is used in which clustering is efficiently performed by merging and dividing clauses using phoneme queries. The same argument can be made for the sequential state division method considering the structure determination in the time series state direction. The model with the highest Bayesian criterion evaluation function is selected from a plurality of acoustic models obtained by clustering and having different numbers of mixtures per state (see FIG. 8 (s1)).

本方法は、評価関数の最も高くなる状態クラスタリングを選択することにより、状態クラスタリングに関しては経験則の介在なく構築することができるため、２段階法同様、モデル構造探索空間を削減する（図９(s1-1) 参照）。この探索方法は音声認識用音響モデル構造の単峰性を利用しており、最適性を保証する。また、状態クラスタリング時に混合ガウス分布を用いるため、単一ガウス分布で状態クラスタリングを行う２段階法と比べて、より正確な音響モデルパラメータ（ベイズ基準のため、Θに対する事後分布パラメータがそれにあたる）および音響モデル構造を作成することができる。この場合、各クラスタリングにおける評価関数計算において混合ガウス分布モデルの隠れ変数が存在するため、最尤法同様、学習データの各フレームごとに隠れ変数の事後確率値を評価関数が収束するまで繰り返し計算する（変分ベイズ期待値最大化法）必要があるため、最終的に混合ガウス分布の状態クラスタリング構造を得るためには、莫大な計算時間がかかる。 In this method, by selecting the state clustering with the highest evaluation function, it is possible to construct the state clustering without empirical rules. Therefore, as in the two-step method, the model structure search space is reduced (FIG. 9 ( See s1-1)). This search method uses the unimodality of the acoustic model structure for speech recognition and guarantees optimality. Also, since a mixed Gaussian distribution is used during state clustering, it is more accurate than the two-stage method that performs state clustering with a single Gaussian distribution (because of Bayesian criteria, the posterior distribution parameter for Θ corresponds to it) and An acoustic model structure can be created. In this case, since there is a hidden variable of the mixed Gaussian distribution model in the evaluation function calculation in each clustering, the posterior probability value of the hidden variable is repeatedly calculated for each frame of the learning data until the evaluation function converges, as in the maximum likelihood method. (Variation Bayes expectation maximization method) is necessary, and it takes enormous calculation time to finally obtain a state clustering structure of a mixed Gaussian distribution.

そこで、計算時間短縮のために、混合ガウス分布の統計量を各状態の十分統計量を用いて近似的に導出し、変分ベイズ期待値最大化法なしでベイズ基準評価関数を近似的に導出する手法を提案する（図１０(s1-2) 参照）。
初めに、学習データが状態だけでなく各混合成分においても割り当てが固定であると仮定する。このとき、隠れ変数によるベイズ基準評価関数への寄与は−Σ_kｗ_jklogｗ_jk と近似できるため、状態ｊの評価関数は次のように近似的に表現することができる。

各混合成分あたりの統計量ζ_jk，ｗ_jk，μ_jk，σ_jkはビタービ・アライメントやk-meansクラスタリング等で与えることができる。また事前分布パラメータのうちφ_jk ⁰，ξ_jk ⁰，η_jk ⁰，σ_jk ⁰，Ｒ_jk ⁰ はモノフォンＨＭＭ状態や音素を混合分布で表現したときの各混合成分あたりの統計量（フレーム数ζ・平均μ・分散σ）などから与えることができる。事前分布パラメータをベイズ基準評価関数Ｆ^mが最も高くなるように学習により求める方法もある。しかし、音声認識用音響モデルの学習時間は膨大なため、特にφ_jk ⁰，ξ_jk ⁰，η_jk ⁰に関しては固定されたパラメータを割り振り、ν_jk ⁰，Ｒ_jk ⁰のみをモノフォンＨＭＭ状態の混合分布統計量で与える方が現実的である。従って以降では任意の状態および任意の混合成分に関して一様な事前分布パラメータφ⁰，ξ⁰，η⁰を用いて議論を進める。 Therefore, in order to shorten the calculation time, the statistic of the mixed Gaussian distribution is approximately derived using sufficient statistics for each state, and the Bayesian criterion evaluation function is approximately derived without the variational Bayes expectation maximization method. A method is proposed (see Fig. 10 (s1-2)).
First, it is assumed that the assignment is fixed not only in the learning data but also in each mixture component. At this time, since the contribution to the Bayesian criterion evaluation function by the hidden variable can be approximated to −Σ _k w _jk logw _jk , the evaluation function of the state j can be expressed approximately as follows.

Statistics ζ _jk , w _jk , μ _jk , and σ _jk for each mixture component can be given by Viterbi alignment, k-means clustering, or the like. Among the prior distribution parameters, φ _jk ⁰ , ξ _jk ⁰ , η _jk ⁰ , σ _jk ⁰ , and R _jk ⁰ are the statistics (number of frames ζ) for each mixed component when the monophone HMM state or phoneme is expressed by a mixed distribution. (Average μ · dispersion σ) etc. There is also a method of obtaining the prior distribution parameters by learning so that the Bayes reference evaluation function F ^m becomes the highest. However, since the learning time of the acoustic model for speech recognition is enormous, a fixed parameter is _assigned particularly for φ _jk ⁰ , ξ _jk ⁰ , and η _jk ⁰ , and only ν _jk ⁰ and R _jk ⁰ are mixed in the monophone HMM state. It is more realistic to use distribution statistics. Therefore, in the following, the discussion will proceed using uniform prior distribution parameters φ ⁰ , ξ ⁰ , and η ⁰ for an arbitrary state and an arbitrary mixed component.

次に、各混合成分あたりの統計量μ_jk,σ_jkは状態辺り一様であると仮定し、かわりにν_jk ⁰，σ_jk ⁰を事前に学習したモノフォンＨＭＭ状態の混合分布統計量とし、フレーム数はモノフォンＨＭＭ状態の混合重み係数に比例させる手法も考えられる（つまりμ_jk＝μ_j，σ_jk＝σ_j，ζ_jk＝ζ_jζ_k／Σ_kζ_k，ν_jk ⁰＝μ_k，Ｒ_jk ⁰＝η〜_jkσ_k）。このような事後分布パラメータは次のように表現される。

フレーム数及び分散事後分布が同一であると仮定する（つまりζ_jk＝ζ_j／Ｌ，Ｌは混合数であり、学習データ量に応じて１０〜３０に設定する）ことにより、上記方法と比較して混合分布を用いた事前統計量を必要としないで事後分布パラメータを求めることができる。

Next, assume that the statistics μ _jk and σ _{jk for} each mixture component are uniform around the state, and instead use ν _jk ⁰ and σ _jk ⁰ as the mixture distribution statistics of the monophone HMM state learned in advance, A method in which the number of frames is proportional to the mixing weight coefficient in the monophone HMM state is also considered (that is, μ _jk = μ _j , σ _jk = σ _j , ζ _jk = ζ _j ζ _k / Σ _k ζ _k , ν _jk ⁰ = μ _k _{^{_{, R jk 0 = η~ jk σ}}} k). Such posterior distribution parameters are expressed as follows.

Compared with the above method by assuming that the number of frames and the distribution posterior distribution are the same (that is, ζ _jk = ζ _j / L, L is the number of mixtures and is set to 10 to 30 according to the learning data amount). Thus, the posterior distribution parameters can be obtained without the need for prior statistics using the mixture distribution.

以上の近似を用いることにより、混合ガウス分布における評価関数計算が事前に計算された統計量のみから構成されるため、フレーム数に隠れ変数の事後確率値を計算することなく容易に評価関数を計算することができる。
このように、本発明は図８と図６の違いからわかるように、混合ガウス分布を用いた状態クラスタリングを行うことにより、従来法（２段階法）では機能的に不可能であった最適音響モデルの自動構築を実現可能とする。そのとき、式（５）と式（７）を利用して混合ガウス分布統計量を近似的に求めてベイズ評価関数を計算することにより、混合ガウス分布を用いた状態クラスタリングを実用的計算時間で行うことを可能とする。
次に、混合数を変えて繰り返し評価関数を計算して(s2)、最も評価関数の高い音響モデルを選ぶ(s3)。
なお、図８において、音響モデル構築部を構成するモデル作成部は(s1)，(s2)の処理、音響モデル選択部は(s3)の処理を行う。 By using the above approximation, the evaluation function calculation in the mixed Gaussian distribution is composed only of pre-calculated statistics, so it is easy to calculate the evaluation function without calculating the posterior probability value of the hidden variable in the number of frames can do.
Thus, as can be seen from the difference between FIG. 8 and FIG. 6, the present invention performs optimal clustering that is impossible in the conventional method (two-stage method) by performing state clustering using a mixed Gaussian distribution. Enables automatic model construction. At that time, by using formulas (5) and (7) to approximate the mixed Gaussian distribution statistics and calculating the Bayesian evaluation function, state clustering using the mixed Gaussian distribution can be achieved in a practical calculation time. Make it possible to do.
Next, the evaluation function is repeatedly calculated by changing the number of mixtures (s2), and the acoustic model having the highest evaluation function is selected (s3).
In FIG. 8, the model creation unit constituting the acoustic model construction unit performs the processes (s1) and (s2), and the acoustic model selection unit performs the process (s3).

本発明の音響モデル生成装置をコンピュータにより構成することができる。その場合は図に示された方法の各手順をコンピュータに実行させるための音響モデル生成プログラムを、ＣＤ−ＲＯＭ、磁気ディスク装置などの記録媒体又は通信回線を介してコンピュータ内にダウンロードして、そのプログラムをコンピュータに実行させる。 The acoustic model generation apparatus of the present invention can be configured by a computer. In that case, an acoustic model generation program for causing the computer to execute each procedure of the method shown in the figure is downloaded into the computer via a recording medium such as a CD-ROM or a magnetic disk device or a communication line, and Let the computer run the program.

音声認識装置の概略構成を示す図。The figure which shows schematic structure of a speech recognition apparatus. １音素を表現する音響モデルを説明する図。The figure explaining the acoustic model expressing 1 phoneme. 中心音素/ａ/のTriphoneＨＭＭ状態のクラスタリングを説明する図。The figure explaining the clustering of the Triphone HMM state of central phoneme / a /. 経験則による音響モデル構築の手順を示す図。The figure which shows the procedure of the acoustic model construction by an empirical rule. 経験則による混合数１として音素環境状態クラスタリングの手順を示す図。The figure which shows the procedure of phoneme environmental state clustering as the mixture number 1 by an empirical rule. 計算機による音響モデル自動構築（２段階法）の手順を示す図。The figure which shows the procedure of the acoustic model automatic construction (two step method) by a computer. 計算機による混合数１として音素環境状態クラスタリング(s21)の手順を示す図。The figure which shows the procedure of phoneme environmental state clustering (s21) as the mixture number 1 by a computer. 計算機による音響モデル自動構築（混合ガウス分布を用いた音素環境状態のクラスタリング）の手順を示す図。The figure which shows the procedure of the acoustic model automatic construction (clustering of the phoneme environmental state using mixed Gaussian distribution) by a computer. 混合ガウス分布を用いた音素環境状態クラスタリング（変分ベイズ期待値最大化法）(s1-1)の手順を示す図。The figure which shows the procedure of the phoneme environmental state clustering (variation Bayes expectation value maximization method) (s1-1) using mixed Gaussian distribution. 混合ガウス分布を用いた音素環境状態クラスタリング（混合統計量を用いた近似）(s1-2)の手順を示す図。The figure which shows the procedure of phoneme environmental state clustering (approximation using a mixing statistic) (s1-2) using mixed Gaussian distribution.

Claims

A feature amount conversion unit for converting a learning speech signal into a time series feature amount;
Clustering is performed by using the phoneme decision tree method based on time-series feature amounts, have rows with a Gaussian mixture phoneme environment state clustering during the clustering, an acoustic model obtained by the clustering, A model creation unit that generates acoustic models having different numbers of mixtures per state and calculates a Bayesian criterion evaluation function;
An acoustic model selection unit that selects an acoustic model structure having a maximum Bayesian criterion function value;
A speech recognition acoustic model generator with,
The calculation of the model structure evaluation function based on the Bayes criterion is to approximate the statistics of the mixed Gaussian distribution using prior knowledge, and repeatedly calculate the posterior probability value of the hidden variable for each frame of the training data until the evaluation function converges. An acoustic model generation apparatus for speech recognition, characterized in that an evaluation function is obtained without any problem.

The acoustic model generation device for speech recognition according to claim 1,
An acoustic model generation apparatus for speech recognition using a sequential state division method for phoneme environment state clustering.

A step of converting a learning speech signal into a time-series feature amount by a feature amount conversion unit;
The model creation unit performs clustering using the phoneme decision tree method based on time-series features,
In this clustering, phonemic environment state clustering is performed using a mixed Gaussian distribution, and acoustic models obtained by the clustering, having different numbers of mixtures per state, are generated, and a Bayesian criterion evaluation function is calculated. And steps to
An acoustic model selection unit that selects an acoustic model structure having a maximum Bayesian criterion function value, and a method for generating an acoustic model for speech recognition, comprising:
The calculation of the model structure evaluation function based on the Bayes criterion is to approximate the statistics of the mixed Gaussian distribution using prior knowledge, and repeatedly calculate the posterior probability value of the hidden variable for each frame of the training data until the evaluation function converges. An acoustic model generation method for speech recognition, characterized in that an evaluation function is obtained without any problem.

The method for generating an acoustic model for speech recognition according to claim 3,
A method for generating an acoustic model for speech recognition, which uses a sequential state division method for phoneme environment state clustering .

A process of converting a learning speech signal into a time-series feature amount;
Clustering using the phoneme decision tree method based on the time series features, and performing the phoneme environment state clustering using a mixed Gaussian distribution at the time of this clustering,
  An acoustic model obtained by clustering, generating an acoustic model having a different number of mixtures per state, and calculating a Bayesian criterion function;
A process of selecting an acoustic model structure having a maximum Bayesian criterion function value;
  Is a recording medium on which a sound recognition acoustic model generation program for causing a computer to execute is recorded,
  The calculation of the model structure evaluation function based on the Bayes criterion is to approximate the statistics of the mixed Gaussian distribution using prior knowledge, and repeatedly calculate the posterior probability value of the hidden variable for each frame of the training data until the evaluation function converges. A recording medium on which is recorded an acoustic model generation program for speech recognition, characterized in that an evaluation function is obtained without any problem.

A recording medium on which the acoustic model generation program for speech recognition according to claim 5 is recorded,
A recording medium on which an acoustic model generation program for speech recognition using a sequential state division method for phoneme environmental state clustering is recorded.