JPH04125599A

JPH04125599A - Reference pattern generating method

Info

Publication number: JPH04125599A
Application number: JP2246863A
Authority: JP
Inventors: Ryosuke Isotani; 亮輔磯谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1990-09-17
Filing date: 1990-09-17
Publication date: 1992-04-27
Anticipated expiration: 2017-01-28
Also published as: JP3251005B2

Abstract

PURPOSE:To easily decide the parameter of a reference pattern by finding vector output probability distribution represented in mixed continuous distribution by synthesizing the vector output probability distribution of plural learned reference patterns. CONSTITUTION:An HMM(imbedded Markov) model A(3) is generated from the learning data (1) of a talker A, and an HMM model B(4) from the learning data (2) of a talker B. An HMM model C(5) for speech recognition of unspecific talker is generated from the models A and B. The model (C) can be used as the HMM model for recognition of unspecific talker as it is, and also, a better model C'(7) can be generated by using the learning data (6) for large number of talkers. In such a way, it is possible to easily decide the parameter of the reference pattern in which the vector output probability is represented in the mixed continuous distribution by using the plural learned reference patterns.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、音声認識等パターン認識に用いられる標準パ
ターンの作成方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a method for creating standard patterns used in pattern recognition such as speech recognition.

[Conventional technology]

音声認識などパターン認識の分野で、認識用の標準パタ
ーンとして確率モデルを用いる方法が近年注目されてお
り、特に隠れマルコフモデル（以下ＨＭＭと呼ぶ）は音
声認識の分野で標準パターンを表すモデルとして広（用
いられている。In the field of pattern recognition such as speech recognition, methods that use probabilistic models as standard patterns for recognition have attracted attention in recent years, and Hidden Markov Models (hereinafter referred to as HMMs) are particularly popular as models representing standard patterns in the field of speech recognition. (Used.

ＨＭＭは状態の集合と状態間の遷移確率と状態あるいは
遷移のベクトル出力確率によって定義され、入カバター
ンに対する各ＨＭＭの尤度を計算することにより認識を
行う。ＨＭＭによる音声認識については、刊行物「確率
モデルによる音声認識」牛用を一著に詳しく述べられて
いる。HMMs are defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, and recognition is performed by calculating the likelihood of each HMM with respect to input cover turns. Speech recognition using HMM is described in detail in the publication ``Speech Recognition Using Probabilistic Models'' for Cattle.

各状態（あるいは遷移）のベクトル出力確率が混合連続
分布で表されるＨＭＭモデルのパラメータを決定する方
法として、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムなど、あ
る初期値から学習用データを用いてパラメータを繰り返
し更新する学習法が知られている。この場合、出力確率
分布の平均値などのパラメータの初期値は、混合する各
分布毎に決定する必要がある。これらのパラメータの初
期値を与える方法としては、（ａ）乱数で与える（ｂ）単一の分布の場合のパラメータに乱数値でぼかし
作用を行う（「連続出力分布型ＨＭＭによる日本語音韻
認識の検討」）電子情報通信学会音声研究会資料５Ｐ８
９−４８）などの方法が知られている。As a method for determining the parameters of an HMM model in which the vector output probability of each state (or transition) is expressed by a mixed continuous distribution, learning that repeatedly updates parameters using training data from a certain initial value, such as the Baum-Welch algorithm, is used. The law is known. In this case, the initial values of parameters such as the average value of the output probability distribution need to be determined for each distribution to be mixed. The methods of giving initial values for these parameters are: (a) giving them as random numbers; (b) blurring the parameters in the case of a single distribution with random values (see ``Japanese Phonological Recognition Using Continuous Output Distribution HMM''). ``Consideration'') Institute of Electronics, Information and Communication Engineers Speech Study Group Materials 5P8
9-48) and other methods are known.

一方、ある初期値から更新によって求めるのではなく学
習データから直接パラメータを決定する方法として、（Ｃ）学習データをセグメンテーションしたあとクラス
タリングを行って混合する分布数のクラスタを求め、各
クラスタのデータから平均値等のパラメータを求める方
法が知られている（°°旧ｇｈＰｅｒｆｏｒｍａｎｃｅ
　Ｃｏｎｎｅｃｔｅｄ　Ｄｉｇｉｔ　Ｒｅｃｏｇｎｉｔ
ｉｏｎ　ＩＪｓｉｎｇ　Ｈｉｄｄｅｎ　Ｍａｒｋｏｖ　
Ｍｏｄｅｌｓ″、　ＩＥＥＥ　Ｔｒａｎｓａｃｔｉｏｎ
ｏｎ　Ａｃｏｕｓｔｉｃｓ、　５ｐｅｅｃｈ、　ａｎｄ
　Ｓｉｇｎａｌ　ＰｒｏｃｅｓｓｉｎｇＶｏｌ、３７＋
　Ｎｏ、８＋　ｐｐ、１２１４−１２２４．　Ａｕｇｕ
ｓｔ　１９８９）。On the other hand, as a method to directly determine the parameters from the training data instead of determining them by updating from a certain initial value, (C) segment the training data and then perform clustering to find clusters of the number of distributions to be mixed, and from the data of each cluster A method for determining parameters such as average values is known (°°formerly known as ghPerformance).
Connected Digit Recognize
ion IJsing Hidden Markov
Models'', IEEE Transaction
on Acoustics, 5peech, and
Signal Processing Vol, 37+
No, 8+ pp, 1214-1224. August
st 1989).

このようにして決められた値を初期値として、Ｂａｕｍ
−Ｗｅｌｃｈアルゴリズムなどにより更新を行うことも
できる。Using the value determined in this way as the initial value, Baum
-Updating can also be performed using the Welch algorithm or the like.

〔発明が解決しようとする課題］学習により繰り返しパラメータを更新する方法を用いる
場合、効率よく学習が行われるためには初期値の設定が
重要であることが知られているが、（ａ）のように乱数
を用いたり（ｂ）のように単一の分布の場合のパラメー
タを用いるのでは、学習の収束までに時間がかかり、ま
た収束値も全体の最適値ではなく局所的な最適値になる
可能性が高い。[Problem to be solved by the invention] When using a method of repeatedly updating parameters through learning, it is known that setting initial values is important for efficient learning. Using random numbers as in (b) or parameters for a single distribution as in (b), it takes time for learning to converge, and the convergence value is not the overall optimal value but the local optimal value. There is a high possibility that it will.

方（Ｃ）の方法は、パラメータ更新のための学習を必ず
しも必要とせず、また、更新の初期値として用いる場合
でも少ない繰り返し回数で収束すると考えられるが、ク
ラスタリングのための計算などが必要で、計算量が多く
なるという欠点があった。Method (C) does not necessarily require learning for parameter updating, and is thought to converge with a small number of iterations even when used as an initial value for updating, but it requires calculations for clustering, etc. The drawback is that the amount of calculation increases.

本発明の目的は、このような欠点を解消した標準パター
ン作成方法を提供することにある。An object of the present invention is to provide a standard pattern creation method that eliminates such drawbacks.

（課題を解決するための手段］第１の発明は、状態の集合と状態間の遷移確率と状態あ
るいは遷移のベクトル出力確率とによって定義される標
準パターンの作成方法において、ベクトル出力確率が連
続分布で表される複数の標準パターンの対応する状態あ
るいは遷移のベクトル出力確率分布を重み付きで混合し
た混合連続分布を状態あるいは遷移のベクトル出力確率
とする標準パターンを作成することを特徴とする。(Means for Solving the Problems) A first invention provides a method for creating a standard pattern defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, in which the vector output probabilities are distributed continuously. The present invention is characterized in that a standard pattern is created in which the vector output probability of a state or transition is a mixed continuous distribution obtained by weighting and mixing vector output probability distributions of corresponding states or transitions of a plurality of standard patterns represented by .

第２の発明は、状態の集合と状態間の遷移確率と状態あ
るいは遷移のベクトル出力確率とによって定義される音
声認識用の標準パターンの作成方法において、複数の話者について話者ごとにその話者の音声データを
用いて学習して作成されたベクトル出力確率が連続分布
で表される標準パターンの対応する状態あるいは遷移の
ベクトル出力確率分布を重み付きで混合した混合連続分
布を状態あるいは遷移のベクトル出力確率とする標準パ
ターンを作成することを特徴とする。A second invention is a method for creating a standard pattern for speech recognition defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, in which the speech recognition is performed for each speaker for a plurality of speakers. The vector output probability created by learning using the voice data of the person who is learning is expressed as a continuous distribution.The vector output probability distribution of the corresponding state or transition of the standard pattern is mixed with a weight, and the vector output probability distribution of the state or transition is expressed as a continuous distribution. The feature is that a standard pattern is created as a vector output probability.

第３の発明は、状態の集合と状態間の遷移確率と状態あ
るいは遷移のベクトル出力確率とによって定義される音
声認識用の標準パターンの作成方法において、異なる環境で発声あるいは収録した音声データを用いて
環境ごとに学習して作成されたベクトル出力確率が連続
分布で表される標準パターンの対応する状態あるいは遷
移のベクトル出力確率分布を重み付きで混合した混合連
続分布を状態あるいは遷移のベクトル出力確率とする標
準パターンを作成することを特徴とする。The third invention is a method for creating a standard pattern for speech recognition defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, using speech data uttered or recorded in different environments. The vector output probability created by learning for each environment is expressed as a continuous distribution.The vector output probability of the state or transition is a mixed continuous distribution that is a weighted mixture of the vector output probability distribution of the corresponding state or transition of the standard pattern. The feature is that a standard pattern is created.

［作用］本発明によれば、混合連続分布で表されるベクトル出力
確率分布を、すでに学習済みの複数の標準パターンのベ
クトル出力確率分布から合成して求めることにより、標
準パターンのパラメータを簡易に決定することができる
。また、合成に用いる標準パターンを適切に選べば、Ｂ
ａｕｍ−Ｗｅｌｃｈ法などの学習の初期パラメータとし
て用いる場合、乱数で初期パラメータを決定する場合な
どに比べ少ない学習回数で収束し、局所的な最適値に収
束する確率も小さくなると期待される。また、学習によ
るパラメータ更新を行わずそのまま用いることもできる
。[Operation] According to the present invention, the parameters of the standard pattern can be easily determined by synthesizing the vector output probability distribution represented by a mixed continuous distribution from the vector output probability distributions of a plurality of standard patterns that have already been learned. can be determined. In addition, if the standard pattern used for synthesis is appropriately selected, B
When used as an initial parameter for learning such as the aum-Welch method, it is expected that convergence will occur with a smaller number of learning times than when initial parameters are determined using random numbers, and the probability of convergence to a locally optimal value will be reduced. Furthermore, it is also possible to use it as is without updating the parameters through learning.

第２の発明のように、合成に複数の話者について話者ご
とにその話者の音声データを用いて学習して作成された
標準パターンを用いれば、ベクトル出力確率が混合連続
出力分布で表される不特定話者音声認識用の標準パター
ンを簡易乙こ作成することができる。As in the second invention, if a standard pattern created by learning for multiple speakers using the speech data of each speaker is used for synthesis, the vector output probability can be expressed as a mixed continuous output distribution. It is possible to easily create a standard pattern for speaker-independent speech recognition.

第３の発明のように、合成に異なる環境で発声あるいは
収録した音声データを用いて環境ごとに学習して作成さ
れた標準パターンを用いれば、ベクトル出力確率が混合
連続出力分布で表される環境の変動に強い標準パターン
を簡易に作成することができる。As in the third invention, if a standard pattern created by learning each environment using audio data uttered or recorded in different environments is used for synthesis, an environment where the vector output probability is expressed by a mixed continuous output distribution. Standard patterns that are resistant to fluctuations can be easily created.

［実施例〕第１図は、第１の発明を不特定話者音声認識用のＨＭＭ
モデル作成に適用した実施例を説明するためのブロック
図である。話者Ａの学習データ（１）からＨＭＭモデル
Ａ（３）を、話者Ｂの学習データ（２）からＨＭＭモデ
ルＢ（４）を作成する。話者Ａ、Ｂとしては、たとえば
男性１女性から標準的な話者を１名ずつ選んで用いる。[Example] FIG. 1 shows the first invention as an HMM for speaker-independent speech recognition.
FIG. 2 is a block diagram for explaining an example applied to model creation. HMM model A (3) is created from speaker A's learning data (1), and HMM model B (4) is created from speaker B's learning data (2). As speakers A and B, standard speakers are selected from, for example, one male and one female.

８ＭＭモデルは第２図に示すような形のモデルとする。The 8MM model has a shape as shown in FIG.

各状態ｉに対し、状態遷移確率ａ　ｉｉｌ　　ａ　ｉｉ
＊＋　（ａ　＝ｉ＋　ａ　ｉｔ、、：　ｌ　）と出力ベ
クトルｙに対する出力確率分布ｂ、（ｙ）が定められて
いる。モデルＡの状態遷移確率、出力確率分布を、それ
ぞれａＡ′ｂＡ（ｙ）などと表す。出力ベクトル確率分
布が単一ガウス分布で表されたとすると、ｂ＋Ａ（ｙ）＝Ｎ　（ｙ、　　μ、＾、Σ、＾）ｂ＝”
　（ｙ）＝Ｎ　（ｙ、　　μ４−　Σ、ＩＩ）と表され
る。ここで、Ｎ　（ｙ、　　μ１．Σ、）は平均ベクト
ルをμ８、共分散行列をΣ、とする多次元ガウス分布を
表す。モデルＡとモデルＢから、不特定話者音声認識用
のＨＭＭモデルＣ（５）を作成する。モデルＣの状態遷
移確率をａ８１．ａｉｉ＋１出力確率分布をす、ｃとす
る。出力確率分布が、次のような混合数２の混合ガウス
分布で表されるとする。For each state i, the state transition probability a iii a ii
*+ (a = i+ a it, , : l ) and the output probability distribution b, (y) for the output vector y are determined. The state transition probability and output probability distribution of model A are expressed as aA'bA(y), respectively. If the output vector probability distribution is represented by a single Gaussian distribution, then b+A(y)=N (y, μ, ^, Σ, ^) b=”
It is expressed as (y)=N (y, μ4−Σ, II). Here, N (y, μ1.Σ,) represents a multidimensional Gaussian distribution with a mean vector of μ8 and a covariance matrix of Σ. From model A and model B, HMM model C (5) for speaker-independent speech recognition is created. The state transition probability of model C is a81. Let c be the output probability distribution of aii+1. Assume that the output probability distribution is represented by the following Gaussian mixture distribution with a mixture number of 2.

ｔｌｅ（ｙ）＝λ’Ｎ　（ｙ、　　μｉ’＋　　Σ、１
）十λ”Ｎ　（ｙ、　ＩＪｉ”＋　Σ、′）このとき、
モデルＣの各パラメータを次のように定める。tle(y)=λ'N (y, μi'+ Σ, 1
) 10λ”N (y, IJi”+ Σ, ′) At this time,
Each parameter of model C is determined as follows.

ａｉ＝ｃ＝　（ａｉ♂＋ａｉｉ”）／２ａ　ｉｉ。１°
””　（ａｉｉ＋＋Ａ＋　ａ　ｆｉｓ＋”）　／　２μ
、′＝μ、′、Σ１′　−Σ、ＡＩＪ　ｉ”　　−ＪＪ　ｉＢ＋　　Σ、２　　＝Σ、′
λ１　＝λ２　＝１／２このようにして作成されたモデルＣは、そのまま不特定
話者音声認識用の８ＭＭモデルとして用いることもでき
、また、さらに多数の話者の学習データ（６）を用いて
Ｂａｕｍ−Ｗｅｌｃｈ法などで学習を行い、よりよいモ
デルＣ’　（７）を作成するための初期モデルとして用
いることもできる。ai=c= (ai♂+aii”)/2a ii.1°
"" (aii++A+ a fis+") / 2μ
,′=μ,′,Σ1′ −Σ,A IJ i” −JJ iB+ Σ,2 =Σ,′
λ1 = λ2 = 1/2 The model C created in this way can be used as is as an 8MM model for speaker-independent speech recognition, or it can be used as an 8MM model for speaker-independent speech recognition, or it can be used as an 8MM model for speaker-independent speech recognition. It can also be used as an initial model for creating a better model C' (7) by performing learning using the Baum-Welch method or the like.

モデルＡ、Ｂとして出力確率分布が混合ガウス分布で表
されるものが用意されている場合にも、同様にモデルＣ
を作成することができる。この場合、モデルＣの出力確
率分布の混合数は、モデルＡ、Ｂの出力確率分布の混合
数の和になる。Similarly, when models A and B are prepared whose output probability distribution is a Gaussian mixture distribution, model C
can be created. In this case, the number of mixed output probability distributions of model C is the sum of the number of mixed output probability distributions of models A and B.

次に、第２の発明の一実施例について説明する。Next, an embodiment of the second invention will be described.

多数の話者が発声した少数語案の音声データをクラスタ
リングすることにより話者をＭ個のクラスタに分け、各
クラスタからクラスタ中心の話者Ｍ名を選ぶ。Ｍ名の各
話者について、ＨＭＭ学習に必要な量の音声データをも
とに、出力確率分布が単一ガウス分布で表されるＨＭＭ
モデルを学習して作成する。作成されたＭ個のモデルか
ら、第１の発明の実施例と同様に混合数がＭの混合ガウ
ス分布を出力確率分布とするＨＭＭモデルを作成するこ
とにより不特定話者音声認識用のＨＭＭモデルが得られ
る。Ｍ名の話者を選ぶためのクラスタリングに用いるデ
ータは少数のデータでよいので、従来の技術の（Ｃ）に
比べ計算量は少なくなる。The speakers are divided into M clusters by clustering the voice data of the few words uttered by a large number of speakers, and M names of speakers at the center of the cluster are selected from each cluster. An HMM whose output probability distribution is represented by a single Gaussian distribution for each of M speakers, based on the amount of audio data required for HMM learning.
Learn and create models. From the M models created, an HMM model for speaker-independent speech recognition is created by creating an HMM model whose output probability distribution is a Gaussian mixture distribution with M mixtures, as in the embodiment of the first invention. is obtained. Since only a small amount of data is required for clustering to select M speakers, the amount of calculation is reduced compared to the conventional technique (C).

最後に、第３の発明の一実施例について説明する。第１
の発明の実施例において、モデルＡ、　Ｂの選び方とし
て、ある話者の異なる環境下（たとえば、静かな環境と
雑音の多い環境）で発声したデータを用いて学習したモ
デルを用いれば、モデルＣとして環境の変動に強い認識
モデルを作成することができる。Finally, an embodiment of the third invention will be described. 1st
In the embodiment of the invention, models A and B can be selected by using models trained using data uttered by a certain speaker under different environments (for example, a quiet environment and a noisy environment). As a result, it is possible to create a recognition model that is resistant to changes in the environment.

〔Effect of the invention〕

以上述べたように、第１の発明によれば、すでに学習さ
れている複数の標準パターンを用いて、ベクトル出力確
率が混合連続分布で表される標準パターンのパラメータ
を簡単に決定することができ、そのまま、あるいはこの
値を初期値とした少数回の学習でパターン認識に用いる
ことができる。As described above, according to the first invention, it is possible to easily determine the parameters of a standard pattern whose vector output probability is represented by a mixed continuous distribution using a plurality of standard patterns that have already been learned. , it can be used for pattern recognition either as is or by a small number of training sessions with this value as the initial value.

また、第２．第３の発明によれば、不特定話者用、環境
の変動に強い標準パターンをそれぞれ簡易に作成するこ
とができる。Also, the second. According to the third invention, it is possible to easily create standard patterns for unspecified speakers and that are resistant to environmental changes.

[Brief explanation of the drawing]

第１図は、第１の発明を不特定話者音声認識用のＨＭＭ
モデル作成に適用した実施例を説明するためのブロック
図、第２図は、実施例におけるＨＭＭモデルの形を示す図で
ある。１・・・・・話者Ａの学習データ２・・・・・話者Ｂの学習データ３・・・・・ｆ（ＭＭモデルＡ４・・・・・ＨＭＭモデルＢ５・・・・・ＨＭＭモデルＣ６・・・・・多数話者の学習データ７・・・・・ＨＭＭモデルＣ′ 代理人　弁理士　　岩　佐　　義　幸FIG. 1 shows the first invention as an HMM for speaker-independent speech recognition.
A block diagram for explaining the embodiment applied to model creation. FIG. 2 is a diagram showing the shape of the HMM model in the embodiment. 1...Speaker A's learning data 2...Speaker B's learning data 3...f (MM model A 4...HMM model B 5... HMM model C 6...Multi-speaker learning data 7...HMM model C' Agent Patent attorney Yoshiyuki Iwasa

Claims

[Claims]

(1) In a method for creating standard patterns defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, corresponding states of multiple standard patterns whose vector output probabilities are expressed in a continuous distribution Alternatively, a standard pattern creation method is characterized in that a standard pattern is created in which the vector output probability of a state or transition is a mixed continuous distribution obtained by mixing the vector output probability distributions of transitions with weights.

(2) In a method for creating a standard pattern for speech recognition defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, each speaker's voice is The vector output probability of a state or transition is a mixed continuous distribution that is a weighted mixture of the vector output probability distribution of the corresponding state or transition of a standard pattern in which the vector output probability created by learning using data is expressed as a continuous distribution. A standard pattern creation method characterized by creating a standard pattern.

(3) In a method for creating standard patterns for speech recognition defined by a set of states, transition probabilities between states, and vector output probabilities of states or transitions, speech data uttered or recorded in different environments is used to create a standard pattern for each environment. A standard in which the vector output probability of a state or transition is a mixed continuous distribution that is a weighted mixture of the vector output probability distributions of the corresponding state or transition of a standard pattern in which the vector output probability created by learning is expressed as a continuous distribution. A standard pattern creation method characterized by creating a pattern.