JPH0997096A

JPH0997096A - Sound model producing method for speech recognition

Info

Publication number: JPH0997096A
Application number: JP7274693A
Authority: JP
Inventors: Junichi Takami; 淳一鷹見
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 1995-09-28
Filing date: 1995-09-28
Publication date: 1997-04-08

Abstract

PROBLEM TO BE SOLVED: To speed up a learning process by forming an initial model and a model to be merged or divided as a single Gaussian distribution model and temporarily producing a mixed Gaussian distribution in the dividing process. SOLUTION: An initial model which has a small signal source is prepared as the initial model and a separating process and a merging process are repeated for the signal source. Further, total likelihood calculated in learning is substituted in P(1) representing the total likelihood when only one signal source is given to generate a model which shares no signal source when the number M of signal sources is 4. Then the size of the output probability distribution of a signal source produced by the merging process is utilized as an evaluation scale to decide the similarity between the signal sources. Then two single Gaussian distributions having two signal sources to be merged are merged into one single Gaussian distribution on the basis of the single Gaussian distribution model and the single Gaussian distribution having a signal source to be divided is re-formed into a two-mixed Gaussian distribution, which is then divided into two single Gaussian distributions.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識用音響モ
デル生成方法に係り、特に隠れマルコフモデル（Hidden
Markov Model:ＨＭＭ）を用いた音声認識において、必
要最小限のモデルパラメータで最大限の音声現象をモデ
ル化するための各モデルの単位、状態ネットワークの構
造、信号源の複数状態間での共有構造および信号源のパ
ラメータを最適に決定するような音声認識用音響モデル
生成方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for generating an acoustic model for speech recognition, and more particularly to a hidden Markov model (Hidden model).
In speech recognition using Markov Model (HMM), the unit of each model, the structure of the state network, the shared structure between multiple states of the signal source for modeling the maximum speech phenomenon with the minimum required model parameters And a method for generating an acoustic model for speech recognition for optimally determining parameters of a signal source.

【０００２】[0002]

【従来の技術】ＨＭＭを用いての高精度かつ頑健な音声
認識を行なうためには、モデルの詳細さと頑健性を如何
にして両立させるかが重要な課題となる。モデルの詳細
化のためには、音声空間全体を覆い尽くすような音素コ
ンテキストカテゴリを適切に決める必要があり、限られ
た学習用音声サンプルから頑健性の高いモデルを推定す
るためには、モデルパラメータの冗長性を削減し、必要
最小限のモデルパラメータで音声の本質的な情報のみを
効率良く表現するようなメカニズムを導入する必要があ
る。2. Description of the Related Art In order to perform highly accurate and robust speech recognition using an HMM, it is an important issue how to balance model details and robustness. In order to refine the model, it is necessary to properly determine phoneme context categories that cover the entire speech space, and in order to estimate a robust model from a limited number of training speech samples, the model parameters It is necessary to introduce a mechanism that reduces the redundancy of the and efficiently express only the essential information of the voice with the minimum required model parameters.

【０００３】このような必要性から、状態分割のみによ
り適切なモデルを生成するための「逐次状態分割法（Su
ccessive State Splitting：ＳＳＳ）」が開発されてい
るが、状態に対する逐次２分割のみの処理では、達成で
きる状態ネットワークの構造に限界があり、モデルパラ
メータの冗長性を完全に除去することができなかった。Due to such a need, the "sequential state division method (Su
ccessive State Splitting (SSS) "has been developed, but the structure of the state network that can be achieved is limited in the processing of only sequential two-division for states, and redundancy of model parameters cannot be completely removed. .

【０００４】そこで、本発明者は、特願平６−２８４１
３５号にて、状態に対する逐次２分割のみでモデルの生
成を行なうＳＳＳの欠点を克服するために、信号源に対
する分割処理と融合処理を同時に実現し、それらのうち
の一方を逐次選択しながら処理を進めることで、ＳＳＳ
の利点を失うことなく、任意の状態ネットワークの構造
の実現を可能にし、必要最小限のモデルパラメータで最
大限の音声現象を高精度かつ頑健に表現することのでき
る表現効率の高い音声認識用音響モデルの生成方法を提
供した。Therefore, the inventor of the present invention filed Japanese Patent Application No. 6-2841.
No. 35, in order to overcome the disadvantage of SSS in which a model is generated only by sequential division into two states, division processing and fusion processing for a signal source are realized at the same time, and one of them is sequentially selected and processed. SSS
It is possible to realize the structure of an arbitrary state network without losing the advantages of the above, and it is possible to express the maximum speech phenomenon with the minimum necessary model parameters with high accuracy and robustness. The method of model generation was provided.

【０００５】[0005]

【発明が解決しようとする課題】しかし、特願平６−２
８４１３５号の実施例で示した方法は、混合数２の混合
ガウス分布モデルを基礎（ベース）とする方式で、一組
（２つ）の融合対象信号源の持つ２つの２混合ガウス分
布を１つの２混合ガウス分布に融合する信号源融合処理
や、分割対象信号源の持つ１つの２混合ガウス分布を２
つの単一ガウス分布に分割した後、それぞれを改めて２
混合ガウス分布に再形成する信号源分割処理を行ってい
た。[Problems to be Solved by the Invention] However, Japanese Patent Application No. 6-2
The method described in the embodiment of No. 84135 is a method based on a mixture Gaussian distribution model having a mixture number of 2 and sets two two-mixture Gaussian distributions of one set (two) of fusion target signal sources to one. Signal source fusion processing that fuses two two-mixture Gaussian distributions, or one single two-mixture Gaussian distribution that the split-source signal source has
After dividing into two single Gaussian distributions,
The signal source division processing was performed to recreate the Gaussian mixture distribution.

【０００６】ところが、一般的に、混合ガウス分布モデ
ルの学習は、単一ガウス分布モデルの学習に比べて、多
くの時間を費やさなければならないことが知られてお
り、特願平６−２８４１３５号に記載されている方法に
おいても多くの時間を費やさなければならなかった。However, it is generally known that learning of a mixed Gaussian distribution model requires much more time than learning of a single Gaussian distribution model, and Japanese Patent Application No. 6-284135. A lot of time had to be spent in the method described in.

【０００７】そこで、本発明は、信号源に対する分割処
理と融合処理とを同時に実現し、それらのうちの一方を
逐次選択しながら処理を進めることで、ＳＳＳの利点を
失うことなく、任意の状態のネットワーク構造を実現可
能にし、また、必要最小限のモデルパラメータで最大限
の音声現象を高精度かつ頑健に表現することのできる表
現効率の高い音声認識用音響モデルの生成方法を提供し
て、単一ガウス分布モデルをベースとした高速な学習を
可能にすることを目的とする。Therefore, the present invention realizes the division processing and the fusion processing for the signal source at the same time, and advances the processing while sequentially selecting one of them, so that the advantage of SSS is not lost and an arbitrary state is achieved. It also provides a method for generating an acoustic model for speech recognition with high expression efficiency that can realize the network structure of, and can express the maximum speech phenomenon with high accuracy and robustness with the minimum necessary model parameters. The purpose is to enable fast learning based on a single Gaussian distribution model.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
の手段として、初期モデルと、融合処理あるいは分割処
理の対象となるモデルとを単一ガウス分布モデルとして
形成し、分割処理を行う際に、一時的に２混合ガウス分
布を作り出すことによって学習処理の高速化を実現しよ
うとするものである。[Means for Solving the Problems] As means for achieving the above object, when an initial model and a model to be subjected to fusion processing or division processing are formed as a single Gaussian distribution model and division processing is performed, By temporarily creating a Gaussian mixture with two mixtures, the learning process can be speeded up.

【０００９】そして、本発明は、信号源の融合と分割と
を全学習用サンプルに対する評価値の最大化という基準
の下で行なうことによって、信号源数は局所的には増減
しながら、大局的には徐々に増加していく。Further, according to the present invention, the fusion and division of the signal sources are performed under the criterion of maximizing the evaluation value with respect to all the learning samples, so that the number of signal sources is locally increased and decreased, and Gradually increases.

【００１０】その結果、モデルの精密化が逐次的に行な
われ、最終的には、各モデルの単位や状態ネットワーク
の構造、信号源の複数状態間での共有構造および出力確
率分布のパラメータが、すべて共通の評価基準の下で最
適に決定された音響モデルを、従来法に比べて高速に自
動生成することができる。As a result, the models are sequentially refined, and finally, the unit of each model, the structure of the state network, the shared structure between a plurality of states of the signal source, and the parameters of the output probability distribution are The acoustic model optimally determined under the common evaluation criteria can be automatically generated at a higher speed than the conventional method.

【００１１】[0011]

【発明の実施の形態】図１は本発明の音声認識用音響モ
デルの生成方法の概要を説明するためのフローチャート
図である。本発明は、音声の特徴パターンの微小単位時
間内での形状（音声の静的特徴）およびその時間的な変
化（音声の動的特徴）を複数の信号源の連鎖として表現
した確率モデルに対して、共通の評価基準（尤度最大
化）に基づいて個々の出力確率分布を融合あるいは分割
するといった処理を繰り返すことによって、モデルの単
位と状態ネットワークの構造、信号源の複数状態間での
共有構造および出力確率分布のパラメータを同時かつ自
動的に決定することができる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a flow chart for explaining the outline of the method for generating a voice recognition acoustic model of the present invention. The present invention is directed to a probabilistic model in which the shape of a voice feature pattern (static feature of voice) and its temporal change (dynamic feature of voice) in a minute unit time are expressed as a chain of a plurality of signal sources. Then, by repeating the process of fusing or dividing the individual output probability distributions based on a common evaluation criterion (likelihood maximization), the unit of the model and the structure of the state network, the sharing of the signal source among multiple states The parameters of structure and output probability distribution can be determined simultaneously and automatically.

【００１２】以降、図１を参照して、より具体的に説明
する。まず、初期モデルとして小規模なモデル（モデル
全体で使用されている信号源の総数Ｍ＝１）を用意する
（ステップ１）。これは例えば、１個の状態（固有の音
素コンテキストカテゴリに対応付けられたモデル構成上
の概念）と１個の信号源（単一ガウス分布で表現された
出力確率分布および状態遷移確率からなるモデルの最小
構成要素）を有するものである。そして、以降の処理
は、この信号源に対して、分割と融合とが繰り返し行な
われる。さらに、信号源数が１の時点での総尤度を現す
Ｐ(1) に、学習時に計算された総尤度を代入して、ステ
ップ２に例示されるようなモデル（信号源数Ｍ＝４で、
その状態が信号源を共有していないモデル）を形成す
る。Hereinafter, a more specific description will be given with reference to FIG. First, a small-scale model (total number of signal sources M = 1 used in the entire model M = 1) is prepared as an initial model (step 1). This is, for example, a model consisting of one state (concept on the model structure associated with a unique phoneme context category) and one signal source (output probability distribution expressed by a single Gaussian distribution and state transition probability). Minimum component) of Then, in the subsequent processing, division and fusion are repeatedly performed on this signal source. Further, by substituting the total likelihood calculated at the time of learning into P (1) representing the total likelihood when the number of signal sources is 1, the model (the number of signal sources M = In 4,
The state forms a model that does not share a signal source).

【００１３】本方法の実行中に形成されるモデルは、隠
れマルコフ網（Hidden Markov Network:ＨＭｎｅｔ）と
呼ばれ、複数の状態のネットワークとして表すことがで
きる。ＨＭｎｅｔは、以下の情報により構成されてい
る。The model formed during the execution of the method is called Hidden Markov Network (HMnet) and can be represented as a network of multiple states. HMnet is composed of the following information.

【００１４】（１）ＨＭｎｅｔの構成要素：・信号源の集合。・状態の集合。（２）信号源の構成要素：・信号源の番号（インデックス）。・出力確率分布（対角共分散行列表現の単一ガウス分
布）。・自己ループ確率および次状態への遷移確率。（３）状態の構成要素：・状態の番号（インデックス）。・信号源へのポインタ（信号源番号）。・受理可能な音素環境カテゴリ（音素環境要因の直積空
間として定義）。・先行状態および後続状態のリスト。(1) Components of HMnet: A set of signal sources. -A set of states. (2) Signal source components: -Signal source number (index). Output probability distribution (single Gaussian distribution in diagonal covariance matrix representation). -Self loop probability and transition probability to the next state. (3) State components: State number (index). -Pointer to the signal source (source number). -Acceptable phoneme environment category (defined as a direct product space of phoneme environment factors). A list of predecessor and successor states.

【００１５】融合すべき信号源の選択（ステップ３）で
は、信号源間の類似性を判定するために、融合処理によ
って生成される信号源の出力確率分布の大きさを評価尺
度として利用する。即ち、２つの信号源Ｑ(i) とＱ(j)
の全ての組合せに対し、それらの出力確率分布（共に単
一分布）を融合した場合の分布の大きさＤijを、式
（１）によって近似的に求める。In the selection of signal sources to be fused (step 3), the size of the output probability distribution of the signal sources generated by the fusion process is used as an evaluation measure in order to judge the similarity between the signal sources. That is, two signal sources Q (i) and Q (j)
The distribution size Dij in the case where the output probability distributions (both are single distributions) are fused for all the combinations of (1) is approximately obtained by Expression (1).

【００１６】[0016]

【数１】 [Equation 1]

【００１７】Ｄijの値が最小となる二つの信号源Ｑ(i)
およびＱ(j) を、融合処理の対象として選択する。信号
源の融合（ステップ４）は、２つの信号源Ｑ(i) とＱ
(j) とを融合し、新たな信号源Ｑ(I) を作成することで
行なう。Ｑ(I) の出力確率分布の平均値μIk、分散σIk
²は、それぞれ以下の式（４），（５）で計算できる。Two signal sources Q (i) that minimize the value of Dij
And Q (j) are selected as targets of fusion processing. The fusion of the signal sources (step 4) consists of two signal sources Q (i) and Q (i).
This is done by fusing (j) and creating a new signal source Q (I). Mean value μIk and variance σIk of output probability distribution of Q (I)
² can be calculated by the following equations (4) and (5), respectively.

【００１８】[0018]

【数２】 [Equation 2]

【００１９】また、Ｑ(I) の自己遷移確率ａI ^selfと後
続状態への遷移確率ａI ^nextには、式（６）および式
（７）で求められる値をそれぞれ使用する。For the self-transition probability aI ^self of Q (I) and the transition probability aI ^next to the subsequent state, the values obtained by the equations (6) and (7) are used, respectively.

【００２０】[0020]

【数３】 (Equation 3)

【００２１】この処理で得られたＱ(I) は、融合前にＱ
(i＾) 、あるいはＱ(j＾) が割り当てられていた全ての
状態で共有化する。そのための処理として、信号源への
ポインタの値が i＾または j＾となっているすべての状
態に対し、その値をI に置き換える。この処理によっ
て、モデル全体での信号源の数は一時的にＭ−１とな
る。Q (I) obtained by this processing is Q (I) before fusion.
(i ^) or Q (j ^) is shared in all assigned states. As a process for that purpose, the value is replaced with I for all states in which the value of the pointer to the signal source is i ^ or j ^. By this processing, the number of signal sources in the entire model temporarily becomes M-1.

【００２２】この時点で、信号源に対する融合処理の結
果得られたモデルを採用するか否かの判定を行なう。融
合処理結果は、融合処理後のモデルから得られる総尤度
（これをＰ'(M-1)と表す）が、これ以前の処理過程で既
に計算されている、総分布数がＭ−１の時点での尤度Ｐ
(M-1) を越える場合にのみ採用される。この場合は、Ｍ
の値をＭ−１に変更してモデルの再学習の処理へ進む
（ステップ９へ）。融合処理の結果が採用されなかった
場合には、改めて融合処理を行なう前のモデル（ステッ
プ２のモデル）を対象とした分割処理のフェーズに入る
（ステップ５へ）。At this point, it is determined whether or not to use the model obtained as a result of the fusion process for the signal sources. The fusion processing result shows that the total likelihood (which is represented as P ′ (M−1)) obtained from the model after the fusion processing has already been calculated in the processing process before this, and the total distribution number is M−1. Likelihood P at
It will be adopted only when (M-1) is exceeded. In this case, M
The value of is changed to M-1 and the process of model re-learning is proceeded to (step 9). If the result of the fusion processing is not adopted, the phase of the division processing for the model before the fusion processing (the model of step 2) is entered again (to step 5).

【００２３】そして、実際の分割に先だって、分割の対
象となる信号源の選定（ステップ５）を行なう。すべて
の信号源Ｑ(i) に対して、その信号源の大きさｄi を式
（８）により算出し、ｄi の値の最も大きい信号源（こ
れをＱ(i＾) とする）を分割対象として選定する。Then, prior to the actual division, the signal source to be divided is selected (step 5). For all the signal sources Q (i), the magnitude di of the signal source is calculated by the equation (8), and the signal source having the largest value of di (this is referred to as Q (i ^)) is to be divided. To be selected as.

【００２４】[0024]

【数４】 [Equation 4]

【００２５】次に、Ｑ(i＾) をＱ(I) とＱ(J) の二つの
信号源に分割する。このための処理として、まず、尤度
計算時にＱ(i＾) を使用する全学習サンプルに対してビ
タビ(Viterbi) アルゴリズムを適用し、各サンプルの状
態経路を求める。Next, Q (i ^) is divided into two signal sources Q (I) and Q (J). As a process for this, first, the Viterbi algorithm is applied to all learning samples using Q (i ^) at the time of likelihood calculation, and the state path of each sample is obtained.

【００２６】次に、求められた状態経路に基づき、Ｑ(i
＾) に対応付けられた学習サンプルの全てのフレームを
抽出する。その後、抽出された学習サンプルの全フレー
ムのデータをベクトル量子化により２つのグループに分
け、各グループ毎に、平均値および分散を求める。最後
に、分割された二つの信号源に対して、得られた各グル
ープの分布のそれぞれ一方を出力確率分布として割り当
て、Ｑ(i＾) の自己遷移確率および後続状態への遷移確
率の値をそのまま複写する。また、Ｍの値をＭ＋１に変
更する。Next, based on the obtained state path, Q (i
Extract all the frames of the learning sample associated with (). After that, the data of all the frames of the extracted learning samples are divided into two groups by vector quantization, and the average value and the variance are obtained for each group. Finally, for each of the two divided signal sources, one of the obtained distributions of each group is assigned as an output probability distribution, and the self-transition probability of Q (i ^) and the value of the transition probability to the subsequent state are set. Copy as it is. Also, the value of M is changed to M + 1.

【００２７】この処理で、信号源の分割が完了する。な
お、信号源を分割した場合には、状態の再構成を同時に
行なう必要がある。状態の再構成は、信号源の共有構造
のみの組替えにより達成される最大尤度ＰD 、一つの状
態を音素環境方向に分割した場合に達成される最大尤度
ＰC 、一つの状態を時間方向に分割した場合に達成され
る最大尤度ＰT のうち、より大きい値を示すものを採用
するといった方法で行なわれる。This processing completes the division of the signal source. When the signal sources are divided, it is necessary to reconstruct the states at the same time. State reconstruction is performed by maximum likelihood PD achieved by recombining only the shared structure of signal sources, maximum likelihood PC achieved when one state is divided in the phoneme environment direction, and one state in the time direction. Among the maximum likelihoods PT achieved in the case of division, the one showing a larger value is adopted.

【００２８】信号源の共有構造のみの組替え（ステップ
６）は、分割対象となった信号源Ｑ(i＾) が、複数の状
態で共有されているものであった場合にのみ行う必要が
ある。この場合、これ以降の状態分割処理は、すべてこ
こでの処理の結果得られたモデルに対して継続して行な
う。また、Ｑ(i＾) がただ一つの状態でのみ使用されて
いるものである場合には、ここでの処理を省略し、ＰD
の値を負の無限大（−∞）として次の処理に進む。The rearrangement only of the shared structure of the signal sources (step 6) needs to be performed only when the signal source Q (i ^) to be divided is shared in a plurality of states. . In this case, all subsequent state division processing is continuously performed on the model obtained as a result of the processing here. If Q (i ^) is used in only one state, the processing here is omitted and PD
The value of is set to negative infinity (-∞) and the process proceeds to the next step.

【００２９】信号源Ｑ(i＾) へのポインタを有する状態
の集合をＳと表す。ここでは、Ｓの要素に対して、Ｑ
(I) とＱ(J) のいずれか一方を割り当てることで、信号
源共有構造の組替えを行なう。この割り当ては、式
（９）で計算される最大値ＰD を求めることによって行
なわれる。A set of states having a pointer to the signal source Q (i ^) is represented by S. Here, for elements of S, Q
The signal source sharing structure is rearranged by assigning either (I) or Q (J). This allocation is performed by obtaining the maximum value PD calculated by the equation (9).

【００３０】[0030]

【数５】 (Equation 5)

【００３１】ＰD の値が求められた時点で、ｐsI（Ｙs
）＞ｐsJ（Ｙs ）ならば状態ｓにＱ(I) を、そうでな
ければ状態ｓにＱ(J) を割り当てる。音素環境方向への
状態分割は、Ｓの要素の中の一つの状態ｓを二つの状態
に分割し、それらを並列に結合することにより行う。At the time when the value of PD is obtained, psI (Ys
)> PsJ (Ys), assign Q (I) to state s, and assign Q (J) to state s otherwise. The state division toward the phoneme environment is performed by dividing one state s among the elements of S into two states and connecting them in parallel.

【００３２】この場合、分割対象となった状態を通る経
路で表現されている学習用サンプルを、新たに生成され
る状態を通る２通りの経路に振り分ける必要がある。こ
の振り分けは、状態ｓと、状態ｓにおいて分割可能な音
素環境要因（二つ以上の要素を持つ要因）ｆに関して、
式（１０）によって計算されるＰC を最大化するような
状態ｓ＾と要因ｆ＾とを求め、ｆ＾に属する要素を分割
することにより行う。In this case, it is necessary to distribute the learning sample represented by the route passing through the state to be divided into two routes passing through the newly generated state. This distribution is based on the state s and the phoneme environment factor (factor having two or more elements) f that can be divided in the state s,
This is performed by obtaining a state s ^ and a factor f ^ that maximize PC calculated by equation (10), and dividing the elements belonging to f ^.

【００３３】[0033]

【数６】 (Equation 6)

【００３４】分割すべき状態ｓ＾と、要因ｆ＾とが求め
られた時点で、ｆ＾の要素ａ s＾ f＾e をどちらの経路
に振り分けるかは、式（１０）を計算する過程で既に得
られているｑI(ｙ s＾ f＾e ) およびｑJ(ｙ s＾ f＾e
) の値を用い、式（１１）に従って決定する。When the state s ^ to be divided and the factor f ^ are obtained, the route to which the element a s ^ f ^ e of f ^ is allocated is determined in the process of calculating the equation (10). The already obtained qI (ys ^ f ^ e) and qJ (ys ^ f ^ e)
) Is used to determine according to equation (11).

【００３５】[0035]

【数７】 (Equation 7)

【００３６】ＡIf＾およびＡJf＾を定めた後、状態ｓ＾
を分割して新たに生成された二つの状態Ｓ(I')、および
Ｓ(J')に対して、以下の処理を行なう。まず、これらの
状態の信号源へのポインタに、それぞれＩおよびＪを代
入する。次に、それらの音素環境情報として、要因ｆ＾
に関する部分にはそれぞれＡIf＾およびＡJf＾を割当
て、ｆ＾以外の要因ｆには、分割前の状態ｓで保有され
ていた要因ｆの内容をそのまま複写する。以上で、音素
環境方向への状態分割が完了する。時間方向への状態分
割（ステップ８）は、Ｓの要素の中の一つの状態ｓを二
つの状態に分割し、それらを直列に結合することにより
行う。この場合、Ｑ(I) とＱ(J) のどちらを前方の状態
に割り当てるかによって２通りの可能性が考えられる。
そこで、式（１２）によって計算されるＰT を最大化す
るような状態ｓ＾と信号源の適用順序を決定する。After defining AIf ^ and AJf ^, the state s ^
Is divided into two newly generated states S (I ′) and S (J ′), the following processing is performed. First, I and J are respectively substituted into the pointers to the signal sources in these states. Next, as those phoneme environment information, the factor f ^
AIf ^ and AJf ^ are respectively assigned to the parts related to, and the contents of the factor f held in the state s before the division are copied as they are to the factors f other than f ^. This completes the state division in the phoneme environment direction. The state division in the time direction (step 8) is performed by dividing one state s of the elements of S into two states and connecting them in series. In this case, there are two possibilities depending on which of Q (I) and Q (J) is assigned to the forward state.
Therefore, the application order of the state s ^ and the signal source that maximizes PT calculated by the equation (12) is determined.

【００３７】[0037]

【数８】 (Equation 8)

【００３８】この後、状態ｓ＾を分割して新たに生成さ
れた二つの状態Ｓ(I')およびＳ(J')に対して以下の処理
を行なう。まず、これらの状態の信号源へのポインタ
に、それぞれＩおよびＪを代入する。次に、ｒI （Ｙ s
＾）＞ｒJ （Ｙ s＾）ならば状態Ｓ(I')を前方に位置付
けて、そうでなければ状態Ｓ(J')を前方に位置付けて、
ネットワーク構造を再構成する。最後に、それらの音素
環境情報として、分割前の状態ｓ＾で保有されていた内
容をそのまま複写する。以上で、時間方向への状態分割
が完了する。After that, the following processing is performed on the two newly created states S (I ') and S (J') by dividing the state s ^. First, I and J are respectively substituted into the pointers to the signal sources in these states. Next, rI (Y s
^)> RJ (Ys ^), state S (I ') is positioned in the front, otherwise state S (J') is positioned in front,
Reconfigure the network structure. Lastly, as the phoneme environment information, the contents held in the state s ^ before division are copied as they are. This completes the state division in the time direction.

【００３９】この時点で形成されているＨＭｎｅｔの信
号源は、一部の信号源に対する融合処理や分割処理の結
果、モデル全体としての最適性が崩されている可能性が
高い。そこで、信号源全体のパラメータを最適化し、次
の繰り返し処理に備えるために、融合処理あるいは分割
処理の影響が及ぶ範囲内にあるすべての信号源に対し
て、その出力確率分布および状態遷移確率を再学習（ス
テップ９）する。It is highly possible that the HMNet signal source formed at this point is not optimal as a whole model as a result of fusion processing or division processing for some signal sources. Therefore, in order to optimize the parameters of the entire signal source and prepare for the next iterative process, the output probability distribution and state transition probability are calculated for all signal sources within the range affected by the fusion process or the division process. Re-learning (step 9).

【００４０】その後、Ｐ(M) に学習の結果達成された総
尤度を代入し、モデル全体での信号源数Ｍが所定の値に
達するまで信号源に対する融合処理と分割処理を続け
る。ここまでの処理で、ＨＭｎｅｔの構造が決定され
る。この時点での各信号源の出力確率分布は、すべて単
一ガウス分布が割り当てられている。そこで最後に、そ
れらの出力確率分布を、最終的に使用したい形状に変更
するための学習（ステップ１０）をＨＭｎｅｔ全体に対
して行なう（単一ガウス分布のまま使用する場合には、
この処理は不用）。以上でHMnet の生成が完了する。After that, the total likelihood achieved as a result of learning is substituted for P (M), and the fusion process and the division process for the signal sources are continued until the number M of signal sources in the entire model reaches a predetermined value. By the processing up to this point, the structure of HMnet is determined. A single Gaussian distribution is assigned to the output probability distribution of each signal source at this point. Therefore, finally, learning (step 10) for changing those output probability distributions into a shape to be finally used is performed on the entire HMnet (when the single Gaussian distribution is used,
This process is unnecessary). This completes the generation of HMnet.

【００４１】[0041]

【発明の効果】本発明の音声認識用音響モデル生成方法
は、信号源の融合と分割を逐次選択しながら繰り返して
いるので、必要最小限の信号源で多様な音声現象をうま
く表現することができる音響モデルを高速で自動的に生
成することができるという効果がある。According to the acoustic model generation method for speech recognition of the present invention, the fusion and division of the signal sources are repeated while being sequentially selected, so that various speech phenomena can be well represented by the minimum necessary number of signal sources. The effect is that an acoustic model that can be generated can be automatically generated at high speed.

[Brief description of drawings]

【図１】本発明の音声認識用音響モデル生成方法の一実
施例のメカニズムを説明するためのフローチャート図で
ある。FIG. 1 is a flow chart diagram for explaining the mechanism of an embodiment of a method for generating an acoustic model for speech recognition according to the present invention.

[Explanation of symbols]

１初期モデル作成ステップ２処理過程で生成されるモデル例の作成ステップ３融合すべき信号源の選定ステップ４信号源の融合ステップ５分割すべき信号源の選定ステップ６信号源共有構造の組み替えステップ７音素コンテキスト方向への状態分割ステップ８時間方向への状態分割ステップ９モデルの再学習ステップ１０分布形状の変更ステップ 1 Initial model creation step 2 Creation step of model example created in the process 3 Selection step of signal sources to be merged 4 Fusion step of signal sources 5 Selection step of signal sources to be divided 6 Recombination step of signal source sharing structure 7 State division step in phoneme context direction 8 State division step in time direction 9 Model re-learning step 10 Distribution shape changing step

Claims

[Claims]

1. A static feature of a voice, which is a shape of a feature pattern of a voice in a minute time, and a dynamic feature of the voice, which is a temporal change thereof, are determined from one output probability distribution and a set of state transition probabilities. A phoneme context-dependent acoustic model generation method for speech information processing using a hidden Markov model modeled as a chain of different signal sources, which is a division or fusion process of signal sources with respect to an initial model having few signal sources. By sequentially selecting, the phoneme context category that is the unit of the model, the number of states used to represent each model and the sharing relationship between multiple models, the sharing relationship of each signal source between multiple states, And the shape of each output probability distribution are all determined under a common evaluation criterion. As a basis, by fusing two single Gaussian distributions of two fusion target signal sources into one single Gaussian distribution, the fusion processing of the signal sources is performed, One Gaussian distribution 2
A method for generating an acoustic model for speech recognition, characterized in that the signal source is divided into two single Gaussian distributions after reforming into a mixed Gaussian distribution.

2. The acoustic model generation method for speech recognition according to claim 1, wherein the division processing for reshaping one single Gaussian distribution of the signal source to be divided into one two-mix Gaussian distribution is a likelihood. The Viterbi algorithm calculates the state path for all learning samples that use the source to be divided during the calculation, and based on the calculated state path, all frames of the learning samples assigned to the source to be divided are extracted. , A method for generating an acoustic model for speech recognition, characterized in that two Gaussian distributions are formed by vector quantization on the data of all frames of the extracted learning samples.