JP2852210B2

JP2852210B2 - Unspecified speaker model creation device and speech recognition device

Info

Publication number: JP2852210B2
Application number: JP7239821A
Authority: JP
Inventors: 政啓外村; 昭一松永
Original assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Current assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Priority date: 1995-09-19
Filing date: 1995-09-19
Publication date: 1999-01-27
Anticipated expiration: 2015-09-19
Also published as: JPH0981178A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の特定話者の
隠れマルコフモデルに基づいて、不特定話者の隠れマル
コフモデル（以下、ＨＭＭという。）を作成する不特定
話者モデル作成装置、及びその不特定話者モデル作成装
置を用いた音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an unspecified speaker model creating apparatus for creating a hidden Markov model (hereinafter, referred to as an HMM) of an unspecified speaker based on a plurality of hidden Markov models of a specific speaker. And a speech recognition device using the unspecified speaker model creation device.

【０００２】[0002]

【従来の技術】従来、学習用の特定話者モデルに基づい
て不特定話者のＨＭＭを作成するために、バーム・ウェ
ルチ（Ｂａｕｍ−Ｗｅｌｃｈ）の学習アルゴリズム（以
下、第１の従来例という。）が広く用いられている（例
えば、中川聖一著，“確率モデルによる音声認識”，ｐ
ｐ．５５−６４，電子情報通信学会，昭和６３年７月発
行参照。）。この第１の従来例では、ＨＭＭにおいて時
刻１から時刻ｔまでの間部分観測列｛ｙ₁，ｙ₂，ｙ₃，
…，ｙ_t｝を観測した後、時刻ｔには状態ｉにいる前向
き確率と、時刻ｔに状態ｉにいて時刻ｔ＋１から最後ま
での部分観測列｛ｙ_t+1，ｙ_t+2，ｙ_t+3，…，ｙ_r｝を観
測する後向き確率とを用いて、ＨＭＭのパラメータを再
推定して学習することにより、不特定話者のＨＭＭを作
成する。2. Description of the Related Art Conventionally, a Baum-Welch learning algorithm (hereinafter referred to as a first conventional example) has been used to create an HMM of an unspecified speaker based on a specific speaker model for learning. ) Are widely used (eg, Seiichi Nakagawa, “Speech Recognition by Stochastic Model”, p.
p. 55-64, IEICE, July 1988. ). In the first conventional example, the partial observation sequence {y ₁ , y ₂ , y ₃ ,
, Y _t }, the forward probability of being in state i at time t, and the partial observation sequence {y _{t + 1} , y _{t + 2} , y from time t + 1 to the end in state i at time t Using the backward probability of observing _{t + 3} ,..., y _r }, the HMM of the unspecified speaker is created by re-estimating and learning the parameters of the HMM.

【０００３】上記第１の従来例の方法を用いて、多様な
話者の音声の音響的特徴量の変動に対応するために多数
話者の音声データでモデルを学習することが望ましく学
習データが多量になる傾向があり、多数の話者による多
量の音声データでモデルを学習することが望ましい。し
かしながら、このような多量のデータを取り扱う場合、
その膨大な計算量はコンピューターの処理速度が高速化
しつつある現在においても問題となっている。Using the method of the first conventional example, it is desirable to learn a model with voice data of a large number of speakers in order to cope with fluctuations in acoustic features of voices of various speakers. It tends to be large, and it is desirable to learn the model with a large amount of voice data from a large number of speakers. However, when dealing with such a large amount of data,
The enormous amount of calculation has become a problem even today, as the processing speed of computers is increasing.

【０００４】このような不特定話者モデルの計算量を削
減するために、既に小坂らによって特定話者モデルによ
る話者クラスタリングとモデル合成によるＣＣＬ法（以
下、第２の従来例という。）が提案されている（従来文
献２「小坂ほか，“クラスタリング手法を用いた不特定
話者モデル作成法”，日本音響学会論文集，１−Ｒ−１
２，１９９４年１１月」参照。）。この第２の従来例の
方法では、各話者の音声の音響的特徴の類似性がすべて
の音響空間で等しいという仮定のもとに、すべての音韻
にわたるモデルセット全体を単位としてクラスタリング
を行っている。具体的には、十分に学習された特定話者
モデルをモデル間の距離を定義することによってクラス
タリングした後、各特定話者モデルを合成することによ
り不特定話者モデルを作成している。In order to reduce the amount of calculation of such an unspecified speaker model, Kosaka et al. Have already developed a speaker clustering based on a specific speaker model and a CCL method (hereinafter referred to as a second conventional example) based on model synthesis. Proposal (Conventional Document 2, "Kosaka et al.," Method of Creating Unspecified Speaker Model Using Clustering Method ", Transactions of the Acoustical Society of Japan, 1-R-1
2, November 1994 ". ). In the method of the second conventional example, clustering is performed in units of the entire model set over all phonemes, on the assumption that the similarity of acoustic features of the speech of each speaker is equal in all acoustic spaces. I have. Specifically, after a specific speaker model that has been sufficiently learned is clustered by defining a distance between the models, an unspecific speaker model is created by combining the specific speaker models.

【０００５】[0005]

【発明が解決しようとする課題】第２の従来例の方法で
は、少ない計算量で不特定話者モデルを作成することが
可能であるが、特定話者モデルのすべてのパラメータが
十分学習されていない場合には性能のよいモデルが得ら
れないため各話者に対して多くの発声データが必要とな
る。また、ＨＭＭの全ての状態において混合出力ガウス
分布の混合数が必ず同じになり、話者による特徴量のバ
ラツキの少ない状態に対して無駄なパラメータが増える
という問題があった。In the method of the second conventional example, it is possible to create an unspecified speaker model with a small amount of calculation, but all parameters of the specific speaker model are sufficiently learned. Otherwise, a high-performance model cannot be obtained, so that a lot of utterance data is required for each speaker. In addition, the mixed number of the mixed output Gaussian distribution always becomes the same in all the states of the HMM, and there is a problem that a useless parameter increases in a state in which the variation of the feature amount by the speaker is small.

【０００６】本発明の第１の目的は以上の問題点を解決
し、各特定話者モデルのすべてのパラメータが学習され
ている必要がなく、また話者毎に学習されているパラメ
ータが異なっている場合においても不特定話者モデルを
作成でき、しかも処理装置のメモリ容量が少なくてす
み、その計算時間を短縮することができる不特定話者モ
デル作成装置を提供することにある。また、本発明の第
２の目的は、上記第１の目的に加えて、作成された不特
定話者モデルを用いて音声認識することができ、従来例
に比較して音声認識率を改善することができる音声認識
装置を提供することにある。A first object of the present invention is to solve the above problems, and it is not necessary for all parameters of each specific speaker model to be learned, and the parameters learned for each speaker are different. It is an object of the present invention to provide an unspecified speaker model generating apparatus which can generate an unspecified speaker model even if the processing is performed, and which requires a small memory capacity of the processing device and can shorten the calculation time. A second object of the present invention, in addition to the first object, is that speech recognition can be performed using the created speaker-independent model, and the speech recognition rate is improved as compared with the conventional example. It is an object of the present invention to provide a voice recognition device capable of performing the above.

【０００７】[0007]

【課題を解決するための手段】本発明に係る請求項１記
載の不特定話者モデル作成装置は、入力された複数の特
定話者の単一ガウス分布の隠れマルコフモデルに基づい
て、不特定話者の混合ガウス分布の隠れマルコフモデル
を作成する不特定話者モデル作成装置において、入力さ
れた複数の特定話者の単一ガウス分布の隠れマルコフモ
デルの各状態の出力ガウス分布を各状態ごとに独立にク
ラスタリングして合成することにより不特定話者の混合
ガウス分布の隠れマルコフモデルを作成するモデル作成
手段を備えたことを特徴とする。According to a first aspect of the present invention, there is provided an unspecified speaker model generating apparatus for generating an unspecified speaker model based on a single Gaussian distribution hidden Markov model of a plurality of specified speakers. In an unspecified speaker model creating apparatus for creating a hidden Markov model of a mixed Gaussian distribution of speakers, an output Gaussian distribution of each state of a hidden Markov model of a single Gaussian distribution of a plurality of specific speakers is input for each state. And a model creating means for creating a hidden Markov model of a Gaussian mixture distribution of unspecified speakers by independently clustering and combining.

【０００８】また、請求項２記載の不特定話者モデル作
成装置は、請求項１記載の不特定話者モデル作成装置に
おいて、上記モデル作成手段は、入力された複数の特定
話者の発声音声データに基づいて、複数の話者に対して
同一の初期話者隠れマルコフモデルを用いて所定の学習
法により上記発声音声データの存在する状態に対しての
み出力ガウス分布を学習することにより、複数個の特定
話者用単一ガウス分布の隠れマルコフモデルを作成する
学習手段と、上記学習手段によって作成された複数個の
特定話者用単一ガウス分布の隠れマルコフモデルに基づ
いて、各出力ガウス分布間の距離を基準にして、各クラ
スタにより短い距離に出力ガウス分布が含まれるように
複数のクラスタにクラスタリングを行うクラスタリング
手段と、上記クラスタリング手段によって各状態毎にク
ラスタリングされた単一ガウス分布の隠れマルコフモデ
ルに基づいて、各クラスタ内の複数の出力ガウス分布の
隠れマルコフモデルを各状態の単一ガウス分布の隠れマ
ルコフモデルに合成する合成手段と、上記合成手段によ
って合成された各状態の単一ガウス分布の隠れマルコフ
モデルを混合することにより、不特定話者の混合ガウス
分布の隠れマルコフモデルを作成する混合手段とを備え
たことを特徴とする。According to a second aspect of the present invention, there is provided the unspecified speaker model creating apparatus according to the first aspect, wherein the model creating means includes a plurality of input uttered voices of the specific speakers. By learning the output Gaussian distribution only for the state where the uttered voice data exists by a predetermined learning method using the same initial speaker hidden Markov model for a plurality of speakers based on the data, Learning means for creating a single speaker specific Gaussian distribution hidden Markov model; anda plurality of output Gaussian models based on a plurality of specific speaker single Gaussian distribution hidden Markov models created by the learning means. Clustering means for performing clustering on a plurality of clusters based on the distance between distributions such that the output Gaussian distribution is included in a shorter distance for each cluster; The hidden Markov model of a plurality of output Gaussian distributions in each cluster is synthesized with the hidden Markov model of a single Gaussian distribution of each state based on the hidden Markov model of the single Gaussian distribution clustered for each state by the taling means. Combining means for creating a hidden Markov model of a mixed Gaussian distribution of an unspecified speaker by mixing the hidden Markov model of a single Gaussian distribution of each state synthesized by the combining means. It is characterized by.

【０００９】さらに、請求項３記載の不特定話者モデル
作成装置は、請求項２記載の不特定話者モデル作成装置
において、上記クラスタリング手段は、各状態毎に予め
設定したしきい値以上のデータ量で学習された出力ガウ
ス分布のみを取り出した後、クラスタリングすることを
特徴とする。Further, according to a third aspect of the present invention, in the unspecified speaker model generating apparatus according to the second aspect, the clustering means includes a predetermined threshold value or more for each state. After extracting only the output Gaussian distribution learned with the data amount, clustering is performed.

【００１０】またさらに、請求項４記載の不特定話者モ
デル作成装置は、請求項２又は３記載の不特定話者モデ
ル作成装置において、上記クラスタリング手段は、各状
態においてクラスタリングされた各クラスタの中心と各
出力ガウス分布間の距離の平均値が予め決めた距離以下
になるまでクラスタリングを繰り返すことにより、各状
態における各出力ガウス分布のバラツキが大きいほどク
ラスタ数が多くなるように各状態におけるクラスタ数を
決定することを特徴とする。Further, in the apparatus for creating an unspecified speaker model according to claim 4, in the apparatus for creating an unspecified speaker model according to claim 2 or 3, the clustering means includes: By repeating clustering until the average value of the distance between the center and each output Gaussian distribution becomes equal to or less than a predetermined distance, the clusters in each state are increased so that the variation in each output Gaussian distribution in each state is larger and the number of clusters is larger. The number is determined.

【００１１】また、本発明に係る請求項５記載の音声認
識装置は、入力された複数の特定話者の単一ガウス分布
の隠れマルコフモデルに基づいて、不特定話者の混合ガ
ウス分布の隠れマルコフモデルを作成する請求項１乃至
４のうちの１つに記載の不特定話者モデル作成装置と、
入力された発声音声文の音声信号に基づいて、上記不特
定話者モデル作成装置によって作成された不特定話者の
混合分布の隠れマルコフモデルを用いて、音声認識する
音声認識手段とを備えたことを特徴とする。Further, according to a fifth aspect of the present invention, there is provided a speech recognition apparatus, comprising the steps of: hiding a mixed Gaussian distribution of an unspecified speaker based on a single Gaussian distribution of a single Gaussian distribution of a plurality of specific speakers; An unspecified speaker model creating apparatus according to any one of claims 1 to 4, which creates a Markov model.
Voice recognition means for voice recognition using a hidden Markov model of a mixture distribution of unspecified speakers created by the unspecified speaker model creating apparatus based on the speech signal of the input uttered speech sentence. It is characterized by the following.

【００１２】[0012]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る一
実施形態である音声認識装置のブロック図である。本実
施形態の音声認識装置は、特に、特定話者の発声音声デ
ータ３０のメモリに格納された複数Ｎ人の特定話者の発
声音声データに基づいて公知の最尤推定法を用いてデー
タの存在する状態に対してのみ出力ガウス分布を学習
し、上記特定話者モデルの中から学習された出力ガウス
分布のパラメータのみを取り出しＨＭＭの対応する状態
毎にクラスタリングを行った後合成及び混合を行って混
合ガウス分布の隠れマルコフ網（以下、ＨＭ網とい
う。）を作成し、作成したＨＭ網をＨＭ網１１のメモリ
に格納する不特定話者モデル作成部３１を備え、ＨＭ網
１１のメモリに格納されたＨＭ網を参照して音声認識を
行うことを特徴とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a speech recognition device according to one embodiment of the present invention. In particular, the speech recognition apparatus of the present embodiment uses a known maximum likelihood estimation method based on utterance voice data of a plurality of N specific speakers stored in the memory of the utterance voice data 30 of the specific speaker. The output Gaussian distribution is learned only for existing states, only the parameters of the output Gaussian distribution learned from the specific speaker model are taken out, clustering is performed for each corresponding state of the HMM, and then synthesis and mixing are performed. To generate a hidden Markov network (hereinafter referred to as an HM network) having a Gaussian mixture, and store the created HM network in a memory of the HM network 11. The speech recognition is performed with reference to the stored HM network.

【００１３】この音声認識装置は、マイクロホン１と、
特徴抽出部２と、バッファメモリ３と、音素照合部４
と、文脈自由文法データベース２０のメモリに格納され
た所定の文脈自由文法に基づいて作成された、メモリに
格納されたＬＲテーブル１３のメモリを参照して音声認
識処理を実行する音素コンテキスト依存型ＬＲパーザ
（以下、ＬＲパーザという。）５とを備える。This speech recognition device comprises a microphone 1 and
Feature extraction unit 2, buffer memory 3, phoneme matching unit 4
And a phoneme context-dependent LR that executes a speech recognition process by referring to the memory of the LR table 13 stored in the memory, which is created based on a predetermined context-free grammar stored in the memory of the context-free grammar database 20 A parser (hereinafter referred to as LR parser) 5.

【００１４】図２は、不特定話者モデル作成部３１によ
って実行される不特定話者モデル作成処理を示すフロー
チャートである。当該作成処理においては、まず、ステ
ップＳ１において、複数Ｎ人の特定話者の発声音声デー
タに基づいて、当該発声音声データの特徴パラメータを
抽出し、抽出した特徴パラメータに基づいて、複数Ｎ人
の全ての話者に対して同一のＨＭ網である初期話者モデ
ル（各状態１混合）を用いて公知の最尤推定法によりデ
ータの存在する状態に対してのみ出力ガウス分布の平均
値と分散を学習することにより、Ｎ個の特定話者用単一
ガウス分布のＨＭ網を作成する。FIG. 2 is a flowchart showing an unspecified speaker model creation process executed by the unspecified speaker model creation unit 31. In the creation process, first, in step S1, feature parameters of the uttered voice data are extracted based on the uttered voice data of a plurality of N specific speakers, and a plurality of N Mean value and variance of the output Gaussian distribution only for the state where data exists by the known maximum likelihood estimation method using the initial speaker model (each state 1 mixture) which is the same HM network for all speakers To create an HM network of N single speaker-specific Gaussian distributions.

【００１５】次いで、ステップ２では、図３に示すよう
に、作成されたＮ個の特定話者用単一ガウス分布のＨＭ
網に基づいて、各状態毎に予め設定したしきい値以上の
データ量で学習された出力ガウス分布のみを取り出した
後、図４に示すように、出力ガウス分布間の公知のバタ
ーチャ（Ｂｈａｔｔａｃｈａｒｙｙａ）距離を基準にし
て、各クラスタにより短い距離に出力ガウス分布が含ま
れるように複数のクラスタにクラスタリングを行なう。
ここで、取り出す学習データ量にしきい値を設けたのは
信頼性の低い出力ガウス分布がクラスタリングに悪影響
を及ぼさないようにするためである。これにより、信頼
性の高いＨＭ網１１を得ることができ、当該ＨＭ網１１
を用いて音声認識することにより、従来例に比較して高
い音声認識率で音声認識することができる。また、当該
クラスタリングでは、各状態においてクラスタリングさ
れ各クラスタの中心と各出力ガウス分布間の公知のバタ
ーチャ（Ｂｈａｔｔａｃｈａｒｙｙａ）距離の平均値が
予め決めた距離以下になるまでクラスタリングを繰り返
すことにより、各状態における各メンバーの出力ガウス
分布のバラツキに応じてクラスタ数Ｋを決定する。ここ
で、バラツキが大きい場合はクラスタ数Ｋを比較的多く
設定する一方、バラツキが小さい場合はクラスタ数Ｋを
比較的少なく設定する。また、上記クラスタ数Ｋの決定
においては、最大のクラスタ数Ｋｍａｘ及び最小のクラ
スタ数Ｋｍｉｎを設定してもよい。さらに、学習データ
量が小さい場合は、好ましくは、クラスタ数Ｋを小さく
設定する。Next, in step 2, as shown in FIG. 3, the HMs of the N single Gaussian distributions for the specific speaker are created.
After extracting only the output Gaussian distributions learned with a data amount equal to or larger than a preset threshold value for each state based on the network, as shown in FIG. 4, a known Bhattacharyya between the output Gaussian distributions is obtained. Based on the distance, clustering is performed on a plurality of clusters such that the output Gaussian distribution is included in a shorter distance for each cluster.
Here, the threshold value is set for the amount of learning data to be extracted in order to prevent the output Gaussian distribution having low reliability from affecting the clustering. Thereby, a highly reliable HM network 11 can be obtained, and the HM network 11 can be obtained.
By performing voice recognition using, the voice recognition can be performed at a higher voice recognition rate than the conventional example. In addition, in the clustering, clustering is performed in each state, and clustering is repeated until the average value of a known Bhattacharyya distance between the center of each cluster and each output Gaussian distribution becomes equal to or less than a predetermined distance, and thereby, in each state. The number of clusters K is determined according to the variation of the output Gaussian distribution of each member. Here, when the variation is large, the number of clusters K is set relatively large, while when the variation is small, the number K of clusters is set relatively small. In determining the number of clusters K, a maximum number of clusters Kmax and a minimum number of clusters Kmin may be set. Furthermore, when the amount of learning data is small, the number of clusters K is preferably set small.

【００１６】次いで、ステップＳ３においては、上記ス
テップＳ２で各状態ごとにクラスタリングされた結果を
用いて、図５に示すように、クラスタ内の複数の出力ガ
ウス分布を各状態の単一ガウス分布に合成する。合成は
出力ガウス分布の総数、及びクラスタリング結果が各状
態ごとに異なること以外は、従来文献２の方法と同様の
方法で行なった。当該ステップＳ３の合成方法について
は詳細後述する。さらに、ステップＳ４においては、各
状態ごとに全てのクラスタの合成された単一ガウス分布
を公知の話者混合法を用いて混合することにより混合ガ
ウス分布のＨＭ網を作成してＨＭ網１１のメモリに格納
する。混合比率は各クラスタのメンバーの出力ガウス分
布の学習データ量の総和の比に比例する値とした。すな
わち、各クラスタのメンバーの学習データ量が大きいほ
ど、混合比率を大きく設定する。Next, in step S3, a plurality of output Gaussian distributions in the cluster are converted into a single Gaussian distribution of each state as shown in FIG. 5 by using the result clustered for each state in step S2. Combine. The synthesis was performed in the same manner as the method of the conventional document 2 except that the total number of output Gaussian distributions and the clustering result were different for each state. The combining method in step S3 will be described later in detail. Further, in step S4, an HM network of a mixed Gaussian distribution is created by mixing the synthesized single Gaussian distribution of all clusters for each state using a known speaker mixing method, and the HM network 11 Store in memory. The mixture ratio was a value proportional to the ratio of the sum of the learning data amounts of the output Gaussian distribution of the members of each cluster. That is, the larger the learning data amount of the members of each cluster, the larger the mixture ratio is set.

【００１７】上記ステップＳ３において用いられる各ク
ラスタにおける合成後の平均値μｈ_jと分散Ｓｈ_jは、次
の数１及び数２で表される。なお、重み係数ｗ_j ⁽ⁱ⁾は次
の数３で表される。The average value μh _j and the variance Sh _j of each cluster used in step S3 are represented by the following equations (1) and (2). The weight coefficient w _j ⁽ⁱ⁾ is expressed by the following equation ⁽ 3 ⁾ .

【００１８】[0018]

【数１】 (Equation 1)

【数２】 (Equation 2)

【数３】 (Equation 3)

【００１９】数１と数２はそれぞれ、複数のガウス分布
を単一ガウス分布と見なして求めた場合の平均値、分散
を表す。ここで、μ_j ⁽ⁱ⁾とＳ_j ⁽ⁱ⁾は自然数ｉ番目のＨＭ
網のの状態ｊにおける単一ガウス分布である出力確率密
度関数の平均値と分散を表わす。また、ｎ_j ⁽ⁱ⁾はｉ番目
のＨＭ網の状態ｊにおけるサンプル数を表す。すなわ
ち、数１から明らかなように、合成後の平均値μｈ_jと
分散Ｓｈ_jとはそれぞれ、合成前の平均値μ_jと分散Ｓ_j
を、各状態におけるサンプル数ｎ_j ⁽ⁱ⁾に応じてサンプル
数ｎ_j ⁽ⁱ⁾が大きいほど大きい重み係数ｗ_j ⁽ⁱ⁾で重み付け
されて計算される。Equations 1 and 2 respectively represent an average value and a variance when a plurality of Gaussian distributions are determined as a single Gaussian distribution. Here, μ _j ⁽ⁱ⁾ and S _j ⁽ⁱ⁾ are the natural number i-th HM
Represents the mean and variance of the output probability density function, which is a single Gaussian distribution at network state j. N _j ⁽ⁱ⁾ represents the number of samples in the state j of the ith HM network. That is, as is clear from Equation 1, the average value μh _j and the variance Sh _j after the synthesis are respectively the average value μ _j and the variance S _j before the synthesis.
Is weighted with a larger weighting factor w _j ⁽ⁱ⁾ as the number of samples n _j ⁽ⁱ⁾ increases in accordance with the number of samples n _j ⁽ⁱ⁾ in each state.

【００２０】本実施形態においては、音声認識のための
統計的音素モデルセットとしてＨＭ網１１を使用してい
る。当該ＨＭ網１１は効率的に表現された音素環境依存
モデルである。１つのＨＭ網は多数の音素環境依存モデ
ルを包含する。ＨＭ網１１はガウス分布を含む状態の結
合で構成され、個々の音素環境依存モデル間で状態が共
有される。このためパラメータ推定のためのデータ数が
不足する場合も、頑健なモデルを作成することができ
る。このＨＭ網１１は逐次状態分割法（Successive Sta
te Splitting:以下、ＳＳＳという。）を用いて自動作
成される。上記ＳＳＳではＨＭ網のトポロジーの決定、
異音クラスタの決定、各々の状態におけるガウス分布の
パラメータの推定を同時に行なう。本実施形態において
は、ＨＭ網のパラメータとして、ガウス分布で表現され
る出力確率及び遷移確率を有する。このため認識時には
一般のＨＭＭと同様に扱うことができる。In this embodiment, the HM network 11 is used as a statistical phoneme model set for speech recognition. The HM network 11 is a phoneme environment dependent model expressed efficiently. One HM network includes many phoneme environment dependent models. The HM network 11 is composed of a combination of states including a Gaussian distribution, and states are shared between individual phoneme environment-dependent models. Therefore, even when the number of data for parameter estimation is insufficient, a robust model can be created. This HM network 11 uses a successive state division method (Successive Sta
te Splitting: Hereinafter, referred to as SSS. ) Automatically. The above SSS determines the topology of the HM network,
Determination of abnormal noise clusters and estimation of Gaussian distribution parameters in each state are performed simultaneously. In the present embodiment, the parameters of the HM network include an output probability and a transition probability expressed by a Gaussian distribution. Therefore, at the time of recognition, it can be handled in the same way as a general HMM.

【００２１】次いで、上述の本実施形態の音声認識方法
を用いた、ＳＳＳ−ＬＲ（left-to-right rightmost
型）不特定話者連続音声認識装置について説明する。こ
の装置は、メモリに格納されたＨＭ網１１と呼ばれる音
素環境依存型の効率のよいＨＭＭの表現形式を用いてい
る。また、上記ＳＳＳにおいては、音素の特徴空間上に
割り当てられた確率的定常信号源（状態）の間の確率的
な遷移により音声パラメータの時間的な推移を表現した
確率モデルに対して、尤度最大化の基準に基づいて個々
の状態をコンテキスト方向又は時間方向へ分割するとい
う操作を繰り返すことによって、モデルの精密化を逐次
的に実行する。Next, an SSS-LR (left-to-right rightmost) using the above-described speech recognition method of the present embodiment.
(Type) An unspecified speaker continuous speech recognition device will be described. This device uses a phoneme environment-dependent efficient HMM expression format called an HM network 11 stored in a memory. In the SSS, the likelihood of a stochastic model expressing a temporal transition of a speech parameter by a stochastic transition between stochastic stationary signal sources (states) assigned to a feature space of a phoneme is calculated. The refinement of the model is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the criterion of maximization.

【００２２】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００２３】音素照合部４に接続されるメモリ内のＨＭ
網１１は、各状態をノードとする複数のネットワークと
して表され、各状態はそれぞれ以下の情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラスタ（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率HM in the memory connected to the phoneme matching unit 4
The network 11 is represented as a plurality of networks having each state as a node, and each state has the following information. (A) State number (b) Acceptable context cluster (c) List of preceding and succeeding states (d) Parameters of output probability density distribution (e) Self transition probability and transition probability to succeeding state

【００２４】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ５からの音素照合要求に応じて音素照合処理
を実行する。そして、不特定話者モデルを用いて音素照
合区間内のデータに対する尤度が計算され、この尤度の
値が音素照合スコアとしてＬＲパーザ５に返される。こ
のときに用いられるモデルは、ＨＭＭと等価であるため
に、尤度の計算には通常のＨＭＭで用いられている前向
きパスアルゴリズムをそのまま使用する。The phoneme matching unit 4 executes phoneme matching processing in response to a phoneme matching request from the phoneme context-dependent LR parser 5. Then, the likelihood for the data in the phoneme matching section is calculated using the unspecified speaker model, and the value of the likelihood is returned to the LR parser 5 as a phoneme matching score. Since the model used at this time is equivalent to the HMM, the likelihood calculation uses the forward path algorithm used in the normal HMM as it is.

【００２５】一方、メモリ内の所定の文脈自由文法（Ｃ
ＦＧ）データベース２０を公知の通り自動的に変換して
ＬＲテーブルを作成してＬＲテーブル１３のメモリに格
納される。ＬＲパーザ５は、上記ＬＲテーブル１３を参
照して、入力された音素予測データについて左から右方
向に、後戻りなしに処理する。構文的にあいまいさがあ
る場合は、スタックを分割してすべての候補の解析が平
行して処理される。ＬＲパーザ５は、ＬＲテーブル１３
から次にくる音素を予測して音素予測データを音素照合
部４に出力する。これに応答して、音素照合部４は、そ
の音素に対応するＨＭ網１１内の情報を参照して照合
し、その尤度を音声認識スコアとしてＬＲパーザ５に戻
し、順次音素を連接していくことにより、連続音声の認
識を行い、その音声認識結果データを出力する。上記連
続音声の認識において、複数の音素が予測された場合
は、これらすべての存在をチェックし、ビームサーチの
方法により、部分的な音声認識の尤度の高い部分木を残
すという枝刈りを行って高速処理を実現する。On the other hand, a predetermined context-free grammar (C
FG) The LR table is created by automatically converting the database 20 as is well known and stored in the memory of the LR table 13. The LR parser 5 refers to the LR table 13 and processes the input phoneme prediction data from left to right without backtracking. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 has an LR table 13
, And outputs phoneme prediction data to the phoneme matching unit 4. In response, the phoneme matching unit 4 performs matching by referring to information in the HM network 11 corresponding to the phoneme, returns the likelihood to the LR parser 5 as a speech recognition score, and sequentially connects the phonemes. As a result, continuous speech recognition is performed, and the speech recognition result data is output. When a plurality of phonemes are predicted in the continuous speech recognition, the existence of all of them is checked, and a pruning is performed by using a beam search method to leave a partial tree having a high likelihood of partial speech recognition. To achieve high-speed processing.

【００２６】以上の実施形態において、特定話者の発声
音声データ３０と、ＨＭ網１１と、ＬＲテーブル１３
と、文脈自由文法データベース２０とはそれぞれ、例え
ばハードディスクメモリに格納される。また、音素照合
部４とＬＲパーザ５と不特定話者モデル作成部３１は例
えばデジタル電子計算機によって構成される。In the above embodiment, the uttered voice data 30 of the specific speaker, the HM network 11, and the LR table 13
And the context-free grammar database 20 are stored in, for example, a hard disk memory. Further, the phoneme matching unit 4, the LR parser 5, and the speaker-independent model creation unit 31 are configured by, for example, a digital computer.

【００２７】以上の実施形態においては、図２の不特定
話者モデル作成処理によって不特定話者モデルを作成し
ているが、当該作成処理によって作成されたＨＭ網に対
して公知のバーム・ウェルチの学習アルゴリズムを用い
て再学習して、ＨＭ網を作成してもよい。In the above embodiment, the unspecified speaker model is created by the unspecified speaker model creation processing of FIG. 2, but a known balm-welch is applied to the HM network created by the creation processing. HM network may be created by re-learning using the learning algorithm described in (1).

【００２８】[0028]

【実施例】本発明者は、図１の音声認識装置の有効性を
確かめるために、以下の通り実験を行った。当該実験に
は、コンテキスト依存型の音素ＨＭＭの状態を効果的に
共有したＨＭ網（例えば、従来文献３「鷹見ほか，“音
素コンテキストと時間に関する逐次状態分割による隠れ
マルコフ網の自動生成”，電子通信情報学会技術研究報
告，ＳＰ９１−８８，１９９１年１２月」参照。）を使
用した。ＨＭ網の構造は１人の話者の発声した２６２０
単語の音声データを用いて決定し、総状態数２００、及
び６００の２種類のモデルを作成した。各モデルには１
状態１０混合の無音モデルを付加した。特定話者モデル
学習用の初期話者モデルは無音モデルを除き各状態とも
単一分布としパラメータの初期値は構造決定と同じ音声
データで決定した。この初期話者モデルをもとに、本特
許出願人が所有する、トラベル・プランニングをタスク
とした自然発話の音声認識データベース（例えば、従来
文献４「Ｔ．Ｍｏｒｉｍｏｔｏｅｔａｌ．，“Ａ
ＳｐｅｅｃｈａｎｄＬａｎｇｕａｇｅＤａｔａｂａ
ｓｅｆｏｒＳｐｅｅｃｈＴｒａｎｓｌａｔｉｏｎ
Ｒｅｓｅａｒｃｈ”，Ｐｒｏｃ．ｏｆＩＣＳＬＰ’
９４，ｐｐ．１７９１−１７９４，１９９４年」参照）
の中の男性８１名の自然発話データを用いて最尤推定法
により出力ガウス分布の平均値と分散を学習することに
より８１名分の特定話者モデルを作成した。但し、１人
あたりのデータ量が２０発話程度と少ないため、分散は
初期パラメータより値が大きくなる場合のみ更新した。
なお、今回は男性話者のみを用いて不特性話者モデルの
作成、及び認識実験を行なった。認識実験は学習に用い
たものと同じ自然発話データベースより選択した学習デ
ータに含まれない男性９人に対して行なった。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted the following experiment in order to confirm the effectiveness of the speech recognition apparatus shown in FIG. In the experiment, an HM network that effectively shared the state of a context-dependent phoneme HMM (for example, conventional literature 3 “Takami et al.,“ Automatic Generation of Hidden Markov Network by Sequential State Partitioning with Phoneme Context and Time ”), Telecommunications Information Technology Research Report, SP91-88, December 1991 "). The structure of the HM network is one speaker uttering 2620
The decision was made using the voice data of the words, and two types of models having a total number of states of 200 and 600 were created. 1 for each model
A silence model with state 10 mixed was added. The initial speaker model for the specific speaker model learning was a single distribution for each state except the silence model, and the initial values of the parameters were determined using the same speech data as the structure determination. Based on this initial speaker model, a speech recognition database of a natural utterance having a task of travel planning, which is owned by the present applicant (for example, see T. Morimoto et al., “A.
Speech and Language Database
se for Speech Translation
Research ”, Proc. Of ICSLP '
94, pp. 1791-1794, 1994 ").
The specific speaker model for 81 persons was created by learning the average value and the variance of the output Gaussian distribution by the maximum likelihood estimation method using the natural utterance data of 81 men in the above. However, since the data amount per person was as small as about 20 utterances, the variance was updated only when the value was larger than the initial parameter.
In this case, an uncharacteristic speaker model was created and a recognition experiment was performed using only male speakers. The recognition experiment was performed on nine men who were not included in the learning data selected from the same natural utterance database as that used for learning.

【００２９】不特定話者モデルはＨＭ網全体を単位とし
たモデルベースのクラスタリングを用いた第２の従来例
のＣＣＬ法と本発明に係るＨＭＭの状態別クラスタリン
グの結果を用いる方法により作成し両者の性能を音素認
識実験により比較した。ただし、本発明に係る状態別ク
ラスタリングによる方法では特定話者モデルの各状態の
出力ガウス分布の内、学習時の状態占有データ量が１０
フレーム以上のもののみを使用した。さらに、状態別ク
ラスタリングによって作成したモデルを初期モデルとし
てバーム・ウェルチの学習アルゴリズムによって再学習
したモデルの認識率との比較も行なった。またさらに、
本発明に係る状態別クラスタリングによる方法でＨＭＭ
を作成した後、バーム・ウェルチの学習アルゴリズムに
よって再学習したモデルの認識率についても実験を行っ
た。ここで、実験条件である、分析条件、使用パラメー
タ、学習／認識データを表１に示す。The speaker-independent model is created by the CCL method of the second conventional example using model-based clustering in units of the entire HM network and the method using the state-based clustering result of the HMM according to the present invention. Were compared by phoneme recognition experiments. However, according to the state-based clustering method according to the present invention, the state occupation data amount at the time of learning is 10 out of the output Gaussian distribution of each state of the specific speaker model.
Only frames and more were used. Furthermore, the model created by state-based clustering was compared with the recognition rate of the model retrained by the Balm-Welch learning algorithm using the model as an initial model. In addition,
The HMM using the state-based clustering method according to the present invention
Then, experiments were performed on the recognition rate of the model re-learned by the Balm-Welch learning algorithm. Table 1 shows the analysis conditions, the parameters used, and the learning / recognition data, which are the experimental conditions.

【００３０】[0030]

【表１】実験条件 ───────────────────────────────── 分析条件サンプリング周波数＝１２ＫＨｚハミング窓＝２０ｍｓフレーム周期＝５ｍｓ ───────────────────────────────── 使用パラメータ１６次ＬＰＣケプストラム＋１６次Δケプストラム＋対数パワー＋Δ対数パワー ───────────────────────────────── 学習データ男性８１名−−各話者１会話（合計１７９９発声） ───────────────────────────────── 不特定話者モデル評価データ男性９名−−各話者１会話（１１〜２９発声） ─────────────────────────────────[Table 1] Experimental conditions ───────────────────────────────── Analysis conditions Sampling frequency = 12 kHz Hamming window = 20 ms frame Period = 5 ms ───────────────────────────────── Parameters used 16th order LPC cepstrum + 16th order cepstrum + logarithmic power + Δ Logarithmic power 学習 Learning data 81 men-1 conversation for each speaker (total 1799 (Utterance) ───────────────────────────────── Unspecified speaker model evaluation data 9 males --- each speaker 1 conversation (11-29 utterances)

【００３１】表２及び表３に、第２の従来例のＣＣＬ法
（以下、表においてモデルクラスタリングと略す。）及
び、本発明に係る状態別クラスタリングによる方法（以
下、表において、状態別クラスタリングと略す。）で作
成した各状態、混合数のＨＭ網に含まれる出力ガウス分
布の総数を示す。第２の従来例のＣＣＬ法による場合は
無音モデルを除き全ての状態に対して混合分布数が等し
くなるが、本発明に係る状態別クラスタリングによる場
合は各状態に対して特定話者モデルから抽出された１０
フレーム以上のデータで学習された出力ガウス分布数が
設定した混合数より少ない場合には抽出された分布数が
その状態の混合分布数となるためモデルベースのクラス
タリングによる場合より総分布数が少なくなっている。
但し、今回は各状態における抽出した出力ガウス分布の
平均値のばらつきの度合は混合数の決定において考慮し
ていない。このように音素バランスを考慮した音声デー
タの収集が困難な自由発話音声データベースを用いた場
合には各状態ごとに混合分布数を設計することにより不
要なパラメータの増加を防ぐことができる可能性がある
ことがわかる。Tables 2 and 3 show the CCL method of the second conventional example (hereinafter abbreviated as model clustering in the tables) and the method by state-based clustering according to the present invention (hereinafter referred to as state-based clustering in the tables). ), The total number of output Gaussian distributions included in the HM network of each state and number of mixtures. In the case of the CCL method of the second conventional example, the number of mixture distributions is equal for all states except for the silent model, but in the case of state-based clustering according to the present invention, each state is extracted from the specific speaker model. Done 10
If the number of output Gaussian distributions trained with data of more than frames is less than the set number of mixtures, the number of distributions extracted is the number of mixture distributions in that state, so the total number of distributions is smaller than with model-based clustering ing.
However, this time, the degree of variation of the average value of the extracted output Gaussian distribution in each state is not considered in determining the number of mixtures. When using a free speech database where it is difficult to collect speech data considering phoneme balance, it is possible to prevent the increase of unnecessary parameters by designing the number of mixture distributions for each state. You can see that there is.

【００３２】[0032]

【表２】不特定話者モデルの総分布数−２０１状態のＨＭ網の場合 ─────────────────────────────────── 作成法／混合数５１０１５２０ ─────────────────────────────────── モデルクラスタリング１０１０２０１０３０１０４０１０ ─────────────────────────────────── 状態別クラスタリング９７９１９０３２７９８３６７８ ───────────────────────────────────[Table 2] Total number of distribution of unspecified speaker model-In case of HM network in 201 state ───────────────────────────── ────── Preparation method / mixing number 5 10 15 20 ─────────────────────────────────── Model clustering 1010 2010 3010 4010 {State-based clustering 979 1903 2798 3678} ─────────────────────────────────

【００３３】[0033]

【表３】不特定話者モデルの総分布数−６０１状態のＨＭ網の場合 ─────────────────────────────────── 作成法／混合数３５１０１５ ─────────────────────────────────── モデルクラスタリング１８１０３０１０６０１０９０１０ ─────────────────────────────────── 状態別クラスタリング１６１７２５４０４６１４６４４７ ───────────────────────────────────[Table 3] Total number of distributions of the unspecified speaker model-In the case of the HM network in the 601 state ────── Creation method / mixing number 3 5 10 15 ─────────────────────────────────── Model clustering 1810 3010 6010 9010 {Clustering by state 1617 2540 4614 6447} ─────────────────────────────────

【００３４】表４及び表５は各方法により作成した不特
定話者モデルを用いた音素認識実験の結果である。表中
の結果は男性９人に対する平均値を示している。Tables 4 and 5 show the results of phoneme recognition experiments using an unspecified speaker model created by each method. The results in the table show the average values for nine men.

【００３５】[0035]

【表４】モデル作成法による音素認識率（％）の比較−２０１状態のＨＭ網の場合 ─────────────────────────────────── 作成法／混合数５１０１５２０ ─────────────────────────────────── バーム・ウェルチ６５．９６６．８ − − ─────────────────────────────────── モデルクラスタリング６２．２６２．５６３．３６３．２ ─────────────────────────────────── 状態別クラスタリング６３．６６４．１６４．０６４．５ ─────────────────────────────────── 状態別クラスタリング６８．０６８．６ − − ＋バーム・ウェルチ ───────────────────────────────────[Table 4] Comparison of phoneme recognition rate (%) by model creation method-In case of HM network in 201 state ───────── Preparation method / mixing number 5 10 15 20 ──────────────────────────────── ─── Balm Welch 65.9 66.8--─────────────────────────────────── Model Clustering 62.2 62.5 63.3 63.2 別 By state Clustering 63.6 64.1 64.0 64.5 状態 By state Clustering 68.0 68.6 − − + Balm W Ruchi ───────────────────────────────────

【００３６】[0036]

【表５】モデル作成法による音素認識率（％）の比較−６０１状態のＨＭ網の場合 ─────────────────────────────────── 作成法／混合数３５１０１５ ─────────────────────────────────── バーム・ウェルチ６７．６６７．８ − − ─────────────────────────────────── モデルクラスタリング６５．１６５．５６６．２６６．２ ─────────────────────────────────── 状態別クラスタリング６７．８６７．９６７．８６７．８ ─────────────────────────────────── 状態別クラスタリング６９．２６９．２ − − ＋バーム・ウェルチ ───────────────────────────────────Table 5 Comparison of phoneme recognition rate (%) by model creation method-In case of HM network in 601 state ───────── Creation method / mixing number 3 5 10 15 ──────────────────────────────── ─── Balm Welch 67.6 67.8--─────────────────────────────────── Model Clustering 65.1 65.5 66.2 66.2 別 By state Clustering 67.8 67.9 67.8 67.8 別 By state Clustering 69.2 69.2 − − + balm-we Ji ───────────────────────────────────

【００３７】表４及び表５の結果を表２及び表３の結果
とあわせて見ると、本発明に係る状態別クラスタリング
による方法は全ての条件のもとで第２の従来例のＣＣＬ
法による場合より少ないパラメータ数で高い認識性能を
示しており、認識率の差はＨＭ網の状態数が２０１状態
の場合より６０１状態の場合の方が大きくなっている。
実際の認識処理のスピードや話者適応を行なう場合の効
率を考えた場合できるだけ少ないパラメータ数で高い認
識性能が得られる方が不特定話者モデルとしての性能は
良いと考えられ、このことは、本発明に係る状態別クラ
スタリングによる方法が性能の良いモデルを得るのに有
効な方法であることを示している。Looking at the results in Tables 4 and 5 together with the results in Tables 2 and 3, the method using the state-based clustering according to the present invention under all conditions provides the CCL of the second conventional example.
The high recognition performance is shown with a smaller number of parameters than in the case of the method.
Considering the speed of actual recognition processing and the efficiency of speaker adaptation, the performance as an unspecified speaker model is considered to be better if high recognition performance is obtained with as few parameters as possible. It is shown that the state-based clustering method according to the present invention is an effective method for obtaining a high-performance model.

【００３８】また、ＨＭ網の状態数と認識性能の関係を
見た場合、６０１状態のＨＭ網は２０１状態のＨＭ網よ
り高い認識性能を示しており、これは、第２の従来例の
ＣＣＬ法及び、本発明に係る状態別クラスタリング法の
どちらの場合にも同様のことが言える。これは、２０１
状態ではまだ音韻環境が十分に細分化されてモデル化さ
れていないことが原因であると考えられる。音韻環境が
十分に細分化されるように状態分割されていなければ、
各状態の出力ガウス分布は音韻環境及び話者環境の両方
の要因による音響的特徴量の変動を同時に表現しなけれ
ばならず、音韻性と話者性の区別が難しくなり、認識誤
りの可能性が高くなると考えられる。When looking at the relationship between the number of states of the HM network and the recognition performance, the HM network in the 601 state shows higher recognition performance than the HM network in the 201 state, which is the same as the CCL of the second conventional example. The same can be said for both the method and the state-based clustering method according to the present invention. This is 201
It is considered that the cause is that the phonological environment has not been sufficiently segmented and modeled yet. If the phonological environment is not subdivided enough to be subdivided,
The output Gaussian distribution of each state must simultaneously represent the variation of acoustic features due to both the phonological environment and the speaker environment, making it difficult to distinguish between phonological and speakeric, and the possibility of recognition errors. Is thought to be higher.

【００３９】さらに、表４及び表５から明らかなよう
に、本発明に係る状態別クラスタリング法でクラスタリ
ングした後バーム・ウェルチの学習アルゴリズムを用い
て再学習した場合、他の方法に比較してより高い音素認
識率が得られている。Further, as is apparent from Tables 4 and 5, when clustering is performed by the state-based clustering method according to the present invention and then re-learned by using the Balm-Welch learning algorithm, compared to other methods, A high phoneme recognition rate has been obtained.

【００４０】最後に、不特定話者モデルの作成時間につ
いて述べる。従来文献２において開示された第２の従来
例のＣＣＬ法では、バーム・ウェルチの学習アルゴリズ
ムの数パーセント程度の計算時間しか要しないと報告さ
れている。本発明に係る状態別クラスタリングを用いる
場合にはクラスタリングを行なう回数が増える分、第２
の従来例のＣＣＬ法に比較して計算時間が増加するが、
この時間はモデル作成に要する時間の大部分を占める特
定話者モデルの学習時間に比較すると非常に小さいた
め、全体の時間で見た場合には、第２の従来例のＣＣＬ
法と同様にバーム・ウェルチの学習アルゴリズムの数パ
ーセント程度の計算時間で不特定話者モデルを作成可能
である。Lastly, the time for creating the speaker-independent model will be described. It is reported that the second conventional CCL method disclosed in Conventional Document 2 requires only a few percent of the calculation time of the Balm-Welch learning algorithm. When the state-based clustering according to the present invention is used, the number of times of performing the clustering increases,
Although the calculation time increases as compared with the conventional CCL method of
Since this time is very small as compared with the learning time of the specific speaker model that occupies most of the time required for model creation, the CCL of the second conventional example is viewed in the whole time.
Similar to the method, a speaker-independent model can be created in a calculation time of about several percent of the learning algorithm of Balm-Welch.

【００４１】以上説明したように、本発明に係る実施形
態によれば、入力された複数の特定話者の単一ガウス分
布のＨＭＭの各状態の出力ガウス分布を各状態ごとに独
立にクラスタリングして合成することにより不特定話者
の混合ガウス分布のＨＭＭを作成するので、各特定話者
モデルの全てのパラメータが学習されている必要はな
く、また話者ごとに学習されているパラメータが異なっ
ていてる場合にも対応することができる。従って、発話
数が少ない話者の音声データや自由発話音声のような話
者ごとに発話内容が異なるデータに対しても使用するこ
とができる。さらに、ＨＭＭの状態ごとに各特定話者モ
デルから取り出された出力ガウス分布の平均値のばらつ
きやその学習データ量の情報を利用することによって状
態ごとに分割するクラスタ数を決めることができるた
め、学習データ量や話者間の音響的特徴の変動の度合を
考慮した混合分布数をＨＭＭの各状態ごとに決定するこ
とができる。当該不特定話者モデルのＨＭＭを用いて音
声認識することにより、従来例に比較して高い音声認識
率で音声認識することができる。As described above, according to the embodiment of the present invention, the input Gaussian distribution of each state of a single Gaussian HMM of a plurality of specific speakers is clustered independently for each state. The HMM of the mixed Gaussian distribution of unspecified speakers is created by combining with each other, so that all the parameters of each specific speaker model do not need to be learned, and the parameters learned for each speaker are different. You can also respond to the situation. Therefore, the present invention can be used for voice data of a speaker with a small number of utterances or data with different utterance contents for each speaker, such as free speech. Furthermore, the number of clusters to be divided for each state can be determined by using information on the variation of the average value of the output Gaussian distribution extracted from each specific speaker model and the amount of learning data for each state of the HMM. The number of mixture distributions can be determined for each state of the HMM in consideration of the amount of learning data and the degree of variation in acoustic characteristics between speakers. By performing voice recognition using the HMM of the unspecified speaker model, voice recognition can be performed at a higher voice recognition rate than in the conventional example.

【００４２】[0042]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の不特定話者モデル作成装置によれば、入力され
た複数の特定話者の単一ガウス分布の隠れマルコフモデ
ルに基づいて、不特定話者の混合ガウス分布の隠れマル
コフモデルを作成する不特定話者モデル作成装置におい
て、入力された複数の特定話者の単一ガウス分布の隠れ
マルコフモデルの各状態の出力ガウス分布を各状態ごと
に独立にクラスタリングして合成することにより不特定
話者の混合ガウス分布の隠れマルコフモデルを作成する
モデル作成手段を備える。具体的には、上記モデル作成
手段は、入力された複数の特定話者の発声音声データに
基づいて、複数の話者に対して同一の初期話者隠れマル
コフモデルを用いて所定の学習法により上記発声音声デ
ータの存在する状態に対してのみ出力ガウス分布を学習
することにより、複数個の特定話者用単一ガウス分布の
隠れマルコフモデルを作成する学習手段と、上記学習手
段によって作成された複数個の特定話者用単一ガウス分
布の隠れマルコフモデルに基づいて、各出力ガウス分布
間の距離を基準にして、各クラスタにより短い距離に出
力ガウス分布が含まれるように複数のクラスタにクラス
タリングを行うクラスタリング手段と、上記クラスタリ
ング手段によって各状態毎にクラスタリングされた単一
ガウス分布の隠れマルコフモデルに基づいて、各クラス
タ内の複数の出力ガウス分布の隠れマルコフモデルを各
状態の単一ガウス分布の隠れマルコフモデルに合成する
合成手段と、上記合成手段によって合成された各状態の
単一ガウス分布の隠れマルコフモデルを混合することに
より、不特定話者の混合ガウス分布の隠れマルコフモデ
ルを作成する混合手段とを備える。As described above in detail, according to the unspecified speaker model generating apparatus according to the first aspect of the present invention, based on the input hidden Markov model of a single Gaussian distribution of a plurality of specific speakers. In an unspecified speaker model generating apparatus for generating a hidden Markov model of a mixed Gaussian distribution of unspecified speakers, an output Gaussian distribution of each state of a single Gaussian hidden Markov model of a plurality of input specific speakers is input. Are clustered independently for each state and synthesized to create a hidden Markov model of a mixed Gaussian distribution of unspecified speakers. Specifically, the model creating means performs a predetermined learning method using the same initial speaker hidden Markov model for a plurality of speakers based on the input utterance voice data of the plurality of specific speakers. Learning means for creating a hidden Markov model of a plurality of single Gaussian distributions for a specific speaker by learning an output Gaussian distribution only for a state in which the uttered voice data exists, and learning means created by the learning means. Clustering into multiple clusters based on the distance between each output Gaussian distribution based on the hidden Markov model of multiple single speaker specific Gaussian distributions such that each cluster contains the output Gaussian distribution at a shorter distance Based on a hidden Markov model of a single Gaussian distribution clustered for each state by the clustering means. Combining means for combining a plurality of output Gaussian hidden Markov models in each cluster into a single Gaussian hidden Markov model for each state; and a single Gaussian distribution Hidden Markov model for each state synthesized by the combining means. And a mixing means for generating a hidden Markov model of a mixed Gaussian distribution of an unspecified speaker by mixing

【００４３】すなわち、多数の特定話者モデルから学習
されている出力ガウス分布のみを取り出してＨＭＭの各
状態で独立にクラスタリングを行なうことにより、各状
態における特徴量の変動の大きさや学習データ量を考慮
してクラスタ数を決定することが可能となり各状態ごと
に最適な出力ガウス分布数を決定することができる。ま
た、各特定話者モデルの学習されている出力ガウス分布
のみを選択的に使用することができるため各特定話者モ
デルの全ての出力ガウス分布が学習されている必要はな
く、一人あたりの発話量の少ないデータベースに対して
も有効に使用することができる。また、各話者ごとに別
々にパラメータ推定を行なうため、全てのデータを一度
に使って学習する第１の従来例のバーム・ウェルチの学
習アルゴリズムによる方法に対して計算量を飛躍的に減
らすことが可能となる。従って、不特定話者モデルの作
成時間を大幅に短縮することができる。That is, by extracting only the output Gaussian distribution learned from a number of specific speaker models and performing independent clustering in each state of the HMM, the magnitude of the variation of the feature amount and the amount of training data in each state are reduced. The number of clusters can be determined in consideration of this, and the optimal number of output Gaussian distributions can be determined for each state. Also, since only the output Gaussian distribution of each specific speaker model that has been learned can be selectively used, it is not necessary that all output Gaussian distributions of each specific speaker model have been learned, and the utterance per person It can be used effectively even for small databases. Also, since the parameter estimation is performed separately for each speaker, the amount of calculation is drastically reduced as compared with the first conventional method using the Balm-Welch learning algorithm in which learning is performed using all data at once. Becomes possible. Therefore, it is possible to greatly reduce the time for creating the unspecified speaker model.

【００４４】また、請求項３記載の不特定話者モデル作
成装置によれば、上記クラスタリング手段は、各状態毎
に予め設定したしきい値以上のデータ量で学習された出
力ガウス分布のみを取り出した後、クラスタリングす
る。これにより、信頼性のより高い最適な不特定話者モ
デルを作成することができる。従って、当該不特定話者
モデルを用いて音声認識を行うことにより、従来例に比
較してより高い音声認識率で音声認識することができ
る。According to a third aspect of the present invention, the clustering means extracts only an output Gaussian distribution learned with a data amount equal to or larger than a predetermined threshold value for each state. And then clustering. As a result, an optimal speaker-independent model with higher reliability can be created. Therefore, by performing voice recognition using the unspecified speaker model, voice recognition can be performed at a higher voice recognition rate than in the conventional example.

【００４５】さらに、請求項４記載の不特定話者モデル
作成装置によれば、上記クラスタリング手段は、各状態
においてクラスタリングされた各クラスタの中心と各出
力ガウス分布間の距離の平均値が予め決めた距離以下に
なるまでクラスタリングを繰り返すことにより、各状態
における各出力ガウス分布のバラツキが大きいほどクラ
スタ数が多くなるように各状態におけるクラスタ数を決
定する。従って、各状態における各出力ガウス分布のバ
ラツキを考慮してクラスタ数を決定することが可能とな
り各状態ごとに最適な出力ガウス分布数を決定すること
ができる。これにより、信頼性のより高い最適な不特定
話者モデルを作成することができる。それ故、当該不特
定話者モデルを用いて音声認識を行うことにより、従来
例に比較してより高い音声認識率で音声認識することが
できる。According to a fourth aspect of the present invention, the clustering means determines in advance the average value of the distance between the center of each cluster and the output Gaussian distribution in each state. By repeating the clustering until the distance becomes equal to or less than the set distance, the number of clusters in each state is determined such that the larger the variation of each output Gaussian distribution in each state, the larger the number of clusters. Therefore, the number of clusters can be determined in consideration of the variation of each output Gaussian distribution in each state, and the optimal number of output Gaussian distributions can be determined for each state. As a result, an optimal speaker-independent model with higher reliability can be created. Therefore, by performing voice recognition using the unspecified speaker model, voice recognition can be performed at a higher voice recognition rate than in the conventional example.

【００４６】また、本発明に係る請求項５記載の音声認
識装置によれば、入力された複数の特定話者の単一ガウ
ス分布の隠れマルコフモデルに基づいて、不特定話者の
混合ガウス分布の隠れマルコフモデルを作成する請求項
１乃至４のうちの１つに記載の不特定話者モデル作成装
置と、入力された発声音声文の音声信号に基づいて、上
記不特定話者モデル作成装置によって作成された不特定
話者の混合分布の隠れマルコフモデルを用いて、音声認
識する音声認識手段とを備える。従って、当該不特定話
者モデルを用いて音声認識を行うことにより、従来例に
比較してより高い音声認識率で音声認識することができ
る。According to the speech recognition apparatus of the fifth aspect of the present invention, the mixed Gaussian distribution of unspecified speakers is based on the input Hidden Markov Model of a single Gaussian distribution of a plurality of specific speakers. 5. The unspecified speaker model creating apparatus according to claim 1, wherein the unidentified speaker model creating apparatus creates an Hidden Markov Model based on the speech signal of an input uttered voice sentence. Using a hidden Markov model of a mixture distribution of unspecified speakers created by the above method. Therefore, by performing voice recognition using the unspecified speaker model, voice recognition can be performed at a higher voice recognition rate than in the conventional example.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の不特定話者モデル作成部によって実行
される不特定話者モデル作成処理を示すフローチャート
である。FIG. 2 is a flowchart showing an unspecified speaker model creation process executed by the unspecified speaker model creation unit of FIG. 1;

【図３】図１の不特定話者モデル作成部によって実行
される不特定話者モデル作成処理のうち特定話者モデル
の学習と出力ガウス分布の抽出の処理を示す図である。FIG. 3 is a diagram illustrating a process of learning a specific speaker model and extracting an output Gaussian distribution in an unspecified speaker model generating process performed by the unspecified speaker model generating unit in FIG. 1;

【図４】図１の不特定話者モデル作成部によって実行
される不特定話者モデル作成処理のうち各状態毎の出力
ガウス分布のクラスタリングの処理を示す図である。FIG. 4 is a diagram showing a process of clustering an output Gaussian distribution for each state in an unspecified speaker model creation process executed by the unspecified speaker model creation unit in FIG. 1;

【図５】図１の不特定話者モデル作成部によって実行
される不特定話者モデル作成処理のうち各クラスタ毎に
複数の確率密度関数を混合する処理を示す図である。FIG. 5 is a diagram illustrating a process of mixing a plurality of probability density functions for each cluster in an unspecified speaker model creation process performed by the unspecified speaker model creation unit of FIG. 1;

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…隠れマルコフ網（ＨＭ網）、１３…ＬＲテーブル、２０…文脈自由文法データベース、３０…特定話者の発声音声データ、３１…不特定話者モデル作成部。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme collation part, 5 ... LR parser, 11 ... Hidden Markov network (HM network), 13 ... LR table, 20 ... Context-free grammar database, 30 ... Uttered voice data of a specific speaker, 31...

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−125599（ＪＰ，Ａ) 特開昭63−257798（ＪＰ，Ａ) 日本音響学会講演論文集（平成７年９月）３−２−９，ｐ．123−124 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 521 G10L 3/00 531 G10L 3/00 535 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-4-125599 (JP, A) JP-A-63-257798 (JP, A) Proceedings of the Acoustical Society of Japan (September 1995) 3-2 -9, p. 123-124 (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00 521 G10L 3/00 531 G10L 3/00 535 JICST file (JOIS)

Claims

(57) [Claims]

1. An unspecified speaker model creating apparatus for creating a hidden Markov model of a mixed Gaussian distribution of unspecified speakers based on an input hidden Markov model of a single Gaussian distribution of a plurality of specific speakers, Hidden Markov model of mixed Gaussian distribution of unspecified speakers by clustering and combining output Gaussian distributions of each state of each state of Hidden Markov model of single Gaussian distribution of multiple specific speakers independently for each state An unspecified speaker model creation device, characterized by comprising a model creation means for creating a speaker model.

2. The model creating means according to a predetermined learning method using the same initial speaker hidden Markov model for a plurality of speakers based on the input utterance voice data of a plurality of specific speakers. A learning means for creating a hidden Markov model of a plurality of single Gaussian distributions for a specific speaker by learning an output Gaussian distribution only for a state in which the uttered voice data exists; Clustering into multiple clusters based on the distance between each output Gaussian distribution based on the hidden Markov model of multiple single speaker specific Gaussian distributions such that each cluster contains the output Gaussian distribution at a shorter distance Based on a Hidden Markov Model with a single Gaussian distribution clustered for each state by the clustering means. Combining means for combining a hidden Markov model of a plurality of output Gaussian distributions in each cluster into a hidden Markov model of a single Gaussian distribution of each state; and hiding a single Gaussian distribution of each state synthesized by the combining means. 2. The unspecified speaker model creating apparatus according to claim 1, further comprising mixing means for creating a hidden Markov model having a mixed Gaussian distribution of unspecified speakers by mixing the Markov models.

3. The clustering device according to claim 2, wherein the clustering means extracts only an output Gaussian distribution learned with a data amount equal to or larger than a predetermined threshold value for each state, and then performs clustering. Specific speaker model creation device.

4. The clustering means repeats clustering until the average value of the distance between the center of each cluster that has been clustered in each state and each output Gaussian distribution is equal to or less than a predetermined distance. 4. The speaker-independent model generation apparatus according to claim 2, wherein the number of clusters in each state is determined such that the number of clusters increases as the variation of the output Gaussian distribution increases.

5. A hidden Markov model of a mixed Gaussian distribution of unspecified speakers is created based on the input Hidden Markov Model of a single Gaussian distribution of a plurality of specific speakers.
And an unspecified speaker model created by the unspecified speaker model creating apparatus based on the input speech signal of the uttered speech sentence. A speech recognition apparatus, comprising: speech recognition means for performing speech recognition using a hidden Markov model of distribution.