JPH0981182A

JPH0981182A - Learning device for hidden markov model(hmm) and voice recognition device

Info

Publication number: JPH0981182A
Application number: JP7232436A
Authority: JP
Inventors: Atsushi Nakamura; 篤中村
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-09-11
Filing date: 1995-09-11
Publication date: 1997-03-28
Anticipated expiration: 2015-09-11
Also published as: JP2886118B2

Abstract

PROBLEM TO BE SOLVED: To reduce the time and the cost required to change the vocabulary of registered words by generating pseudo word learning data that are produced based on uniform random numbers and relearning garbage hidden HMMs. SOLUTION: While changing the vocabulary of registered words, an ME learning method, in which a garbage HMM 12 is learned using pseudo word learning data as the learning method of the HMM 12 that does not require the collection and the manipulation of new voice samples, is employed. In this case, the pseudo word learning data are generated based on a prescribed word HMM 11 which is a phoneme HMM (a CD phoneme HMM) considering the front and behind phoneme environment and uniform random numbers generated by a digital electronic computer. By generating pseudo word learning data which are generated based on uniform random numbers and relearning the HMM 12 based on the data above, no work is required for the collection and the manipulation of new learning data voice samples.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識のための
隠れマルコフモデル（以下、ＨＭＭという。）を学習す
るためのＨＭＭの学習装置、及びその学習装置によって
学習されたＨＭＭを用いて音声認識する音声認識装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an HMM learning device for learning a hidden Markov model (hereinafter referred to as HMM) for speech recognition, and speech recognition using the HMM learned by the learning device. Voice recognition device.

【０００２】[0002]

【従来の技術】従来の連続音声認識装置においては、入
力された発声音声から抽出された音響的特徴パラメータ
に基づいて、ＨＭＭを用いて音声認識してその結果を出
力している。2. Description of the Related Art In a conventional continuous speech recognition apparatus, an HMM is used to perform speech recognition based on acoustic feature parameters extracted from an input uttered speech, and the result is output.

【０００３】音声認識部による上記ＨＭＭを用いた登録
語の抽出（スポッティングともいう。）においては、未
登録語を検出するときに用いるガーベジＨＭＭがスポッ
ティング性能に大きな影響を与える。従来、ガーベジＨ
ＭＭを学習するために、誤り最小化基準に基づく学習法
（以下、ＭＥ学習法という。）が用いられており、その
有効性が、例えば、従来文献１「Ｋｏｍｏｒｉｅｔ
ａｌ．，“Ｍｉｎｉｍｕｍｅｒｒｏｒｃｌａｓｓｉ
ｆｉｃａｔｉｏｎｔｒａｉｎｉｎｇｆｏｒＨＭＭ−
ｂａｓｅｄｋｅｙｗｏｒｄｓｐｏｔｔｉｎｇ”，Ｐ
ｒｏｃ．ＩＣＳＬＰ９２，Ｖｏｌ．Ｉ，ｐｐ．９−１
２，１９９２年」及び従来文献２「Ｔｏｒｒｅｅｔ
ａｌ．，“Ｄｉｓｃｒｉｍｉｎａｔｉｖｅｔｒａｉｎ
ｉｎｇｏｆｇａｒｂａｇｅｍｏｄｅｌｆｏｒｎ
ｏｎ−ｖｏｃａｂｕｌａｒｙｕｔｔｅｒａｎｃｅｒｅ
ｊｅｃｔｉｏｎ”，Ｐｒｏｃ．ＩＣＳＬＰ９４，Ｖｏ
ｌ．Ｉ，ｐｐ．４７５−４７８，１９９４年」において
報告されている。In the extraction (also referred to as spotting) of registered words using the HMM by the voice recognition unit, the garbage HMM used when detecting unregistered words has a great influence on the spotting performance. Conventionally, garbage H
A learning method based on an error minimization criterion (hereinafter referred to as ME learning method) is used to learn MM, and its effectiveness is described in, for example, the conventional document 1 “Komori et.
al. , "Minimum error classi
fiction training for HMM-
based keyword spotting ", P
rc. ICSLP 92, Vol. I, pp. 9-1
2, 1992 "and the conventional document 2" Torre et.
al. , "Discriminative train
ingof garbage model for n
on-vocabulary utterance re
injection ”, Proc. ICSLP 94, Vo
l. I, pp. 475-478, 1994 ".

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、これら
の従来例の方法では、学習に際して大量の音声サンプル
を用いるため、登録語の語彙の変更毎に音声サンプルの
収集、切り出し等の工程が必要であり、迅速な登録語の
語彙の変更は原理的に不可能であった。However, in these conventional methods, since a large amount of voice samples are used for learning, it is necessary to collect and cut voice samples each time the vocabulary of registered words is changed. , It was impossible in principle to change the vocabulary of registered words quickly.

【０００５】本発明の目的は以上の問題点を解決し、従
来例に比較して容易にかつ迅速に登録語の語彙を変更し
てＨＭＭを学習することができるＨＭＭの学習装置及び
その学習装置によって学習されたＨＭＭを用いて音声認
識する音声認識装置を提供することにある。An object of the present invention is to solve the above problems and to learn an HMM by changing the vocabulary of registered words more easily and quickly than in the conventional example, and an HMM learning device and the learning device. Another object of the present invention is to provide a voice recognition device for recognizing a voice using the HMM learned by.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の隠れマルコフモデルの学習装置は、多次元一様乱数
を発生する乱数発生手段と、上記乱数発生手段によって
発生された多次元一様乱数を、予め登録された登録語を
認識するための所定の単語隠れマルコフモデルの多次元
ガウス分布に従う複数のガウス乱数に変換して、変換さ
れた複数のガウス乱数を複数の特徴パラメータである擬
似的な単語学習データとして出力するデータ生成手段
と、上記データ生成手段から出力された擬似的な単語学
習データと、上記単語隠れマルコフモデルに基づいて、
所定のコスト関数の関数値が最小となるように、予め登
録されない未登録語を検出するためのガーベジ隠れマル
コフモデルの複数のパラメータを更新することにより上
記ガーベジ隠れマルコフモデルの複数のパラメータを学
習する学習手段とを備えたことを特徴とする。A hidden Markov model learning apparatus according to a first aspect of the present invention is a random number generating means for generating a multidimensional uniform random number, and a multidimensional one generated by the random number generating means. -Like random numbers are converted into a plurality of Gaussian random numbers that follow the multidimensional Gaussian distribution of a given word hidden Markov model for recognizing preregistered registered words, and the converted plurality of Gaussian random numbers are a plurality of feature parameters. Data generation means for outputting as pseudo word learning data, pseudo word learning data output from the data generating means, based on the word hidden Markov model,
Learn a plurality of parameters of the Garbage Hidden Markov Model by updating a plurality of parameters of the Garbage Hidden Markov Model for detecting unregistered unregistered words so that the function value of the predetermined cost function is minimized. And a learning means.

【０００７】また、請求項２記載の隠れマルコフモデル
の学習装置は、請求項１記載の隠れマルコフモデルの学
習装置において、上記コスト関数は、認識対象の単語に
ついて、上記単語学習データと上記単語隠れマルコフモ
デルとに基づいて計算された音声認識のためのスコア
と、上記単語学習データと上記ガーベジ隠れマルコフモ
デルとに基づいて計算された音声認識のためのスコアと
に基づいて計算された、発声された単語が認識されない
誤りの発生可能性を示す指標値と、認識対象の単語につ
いて、認識対象の単語を除く上記単語学習データと上記
単語隠れマルコフモデルとに基づいて計算された音声認
識のためのスコアと、上記単語学習データと上記ガーベ
ジ隠れマルコフモデルとに基づいて計算された音声認識
のためのスコアとに基づいて計算された、発声されてい
ない単語が認識結果に現れる誤りの発生可能性を示す指
標値と、を加算することにより計算される関数であるこ
とを特徴とする。A hidden Markov model learning apparatus according to a second aspect is the hidden Markov model learning apparatus according to the first aspect, wherein the cost function includes the word learning data and the word hidden for a word to be recognized. Spoken, calculated based on a score for speech recognition calculated based on the Markov model and a score for speech recognition calculated based on the word learning data and the garbage hidden Markov model. An index value indicating the possibility of an error in which a word is not recognized, and a recognition target word, for speech recognition calculated based on the word learning data excluding the recognition target word and the word hidden Markov model A score and a score for speech recognition calculated based on the word learning data and the garbage hidden Markov model It was calculated Zui, characterized in that it is a function calculated by word to be uttered is added to, and the index value indicating the likelihood of errors appearing in the recognition result.

【０００８】さらに、請求項３記載の隠れマルコフモデ
ルの学習装置は、請求項２記載の隠れマルコフモデルの
学習装置において、上記音声認識のための尤度を示すス
コアは、ビタビ復号化法によって計算されたスコアであ
ることを特徴とする。Further, in the hidden Markov model learning apparatus according to claim 3, in the hidden Markov model learning apparatus according to claim 2, the score indicating the likelihood for the speech recognition is calculated by the Viterbi decoding method. It is characterized by the score.

【０００９】また、本発明に係る音声認識装置は、請求
項１、２又は３記載の隠れマルコフモデルの学習装置
と、入力された発声音声文の音声信号に基づいて、予め
登録された登録語を認識するための単語隠れマルコフモ
デルと、上記隠れマルコフモデルの学習装置によって学
習され予め登録されない未登録語を検出するためのガー
ベジ隠れマルコフモデルとを用いて音声認識して音声認
識結果を出力する音声認識手段を備えたことを特徴とす
る。The speech recognition apparatus according to the present invention is a registered word registered in advance based on a hidden Markov model learning apparatus according to claim 1, 2 or 3 and a voice signal of an input uttered voice sentence. To recognize the word Hidden Markov model, and using a garbage Hidden Markov model for detecting unregistered words that are learned by the learning device of the hidden Markov model and not registered in advance, and output a speech recognition result. It is characterized in that a voice recognition means is provided.

【００１０】[0010]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。本発明では、迅速かつ容
易に登録語の語彙セットの変更を可能とするべく、登録
語の語彙セットの変更時に新たな音声サンプルの収集や
加工が不要なガーベジＨＭＭ１２の学習法として擬似的
な単語学習データ（以下、単語学習データ）を用いてガ
ーベジＨＭＭを学習するＭＥ学習法を用いることを特徴
とする。ここで、擬似的な単語学習データは、前後音素
環境を考慮した音素ＨＭＭ（以下、ＣＤ音素ＨＭＭ）で
ある所定の単語ＨＭＭ１１と、デジタル電子計算機によ
って発生させた一様乱数をもとに生成する。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. In the present invention, in order to enable the vocabulary set of registered words to be changed quickly and easily, it is not necessary to collect or process a new voice sample when changing the vocabulary set of registered words. An ME learning method for learning a garbage HMM using learning data (hereinafter, word learning data) is used. Here, the pseudo word learning data is generated based on a predetermined word HMM11 which is a phoneme HMM (hereinafter, a CD phoneme HMM) considering the phoneme environment before and after and a uniform random number generated by a digital electronic computer. .

【００１１】図１は本発明に係る一実施形態である音声
認識装置のブロック図であり、単語照合部４で用いる登
録語スポッティングアルゴリズムは、ワン−パス・ビタ
ビ復号化法（Ｏｎｅ−ｐａｓｓＶｉｔｅｒｂｉｄｅ
ｃｏｄｉｎｇ）に基づくものである。音響モデルとして
は、図９に示すように、ＣＤ音素ＨＭＭの連結によって
構成される各登録語に関するＨＭＭと、それぞれ１状態
の無音ＨＭＭ及びガーベジＨＭＭを用いる。ここで、予
め登録される複数の登録語を認識するための各登録語に
関するＨＭＭと無音ＨＭＭは、図１の単語ＨＭＭ１１の
メモリに格納される一方、予め登録されない未登録語を
検出するためのガーベジＨＭＭは図１のガーベジＨＭＭ
１２のメモリに格納される。これらのＨＭＭ１１，１２
のメモリは例えばハードディスクメモリで構成される。FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention. The registered word spotting algorithm used in the word matching unit 4 is a one-pass Viterbi decoding method (One-pass Viterbi decode).
coding). As the acoustic model, as shown in FIG. 9, an HMM for each registered word formed by concatenation of CD phoneme HMMs, a silent HMM in one state, and a garbage HMM are used. Here, the HMM and the silent HMM relating to each registered word for recognizing a plurality of registered words registered in advance are stored in the memory of the word HMM 11 in FIG. 1 while detecting an unregistered word not registered in advance. The garbage HMM is the garbage HMM shown in Fig. 1.
It is stored in 12 memories. These HMMs 11, 12
The memory is composed of, for example, a hard disk memory.

【００１２】上記単語ＨＭＭ１１は、図８（ａ）に示す
状態間の接続関係を有し、図８（ｂ）に示す情報構造を
有する。単語ＨＭＭ１１は、図８（ａ）に示すように、
複数ｎ個の状態が縦続に接続された状態遷移で表され、
各状態で自己ループを有する。そして、単語ＨＭＭ１１
の各状態は、図８（ｂ）に示すように、自己ループ確率
と、状態遷移確率及び出力分布のデータとを含み、出力
分布のデータは、多次元ガウス分布番号、混合重み、次
元番号、各次元に対応する平均値と分散値を含む。ここ
で、多次元ガウス分布とは、例えば、１６次元ＬＰＣケ
プストラム、１６次元Δケプストラム、対数パワー、Δ
対数パワーを含む３４次元の特徴パラメータに関するガ
ウス分布である。The word HMM 11 has a connection relation between states shown in FIG. 8A and an information structure shown in FIG. 8B. The word HMM11 is, as shown in FIG.
Multiple n states are represented by cascaded state transitions,
It has a self-loop in each state. And the word HMM11
As shown in FIG. 8 (b), each state of includes a self-loop probability, state transition probability, and output distribution data, and the output distribution data includes a multidimensional Gaussian distribution number, a mixture weight, a dimension number, Contains the mean and variance values for each dimension. Here, the multidimensional Gaussian distribution means, for example, 16-dimensional LPC cepstrum, 16-dimensional Δ cepstrum, logarithmic power, Δ
It is a Gaussian distribution regarding a 34-dimensional feature parameter including logarithmic power.

【００１３】図２に、図１のＨＭＭ学習部２０によって
実行されるガーベジＨＭＭ学習処理を示す。ＣＤ音素Ｈ
ＭＭと無音ＨＭＭとを含む単語ＨＭＭ１１と、ガーベジ
ＨＭＭ１２は公知のバーム・ウェルチ（Ｂａｕｍ−Ｗｅ
ｌｃｈ）アルゴリズムによって予めそのパラメータが学
習されて初期パラメータが設定され、ガーベジＨＭＭ１
２のみが図２の処理によってＭＥ学習法により再学習さ
れる。FIG. 2 shows a garbage HMM learning process executed by the HMM learning unit 20 shown in FIG. CD phoneme H
The word HMM11 including MM and silence HMM and the garbage HMM12 are known Baum-Welch (Baum-We).
lch) algorithm, the parameters are learned in advance and the initial parameters are set, and the garbage HMM1
Only 2 is re-learned by the ME learning method by the process of FIG.

【００１４】図２に示すように、まず、ステップＳ１に
おいて、擬似的な単語学習データを作成するための単語
学習データ作成処理が実行される。ここでは、単語学習
データは、認識対象の各単語について、単語ＨＭＭ１１
内の各単語ＨＭＭが持つ情報をもとに生成する。具体的
には、各単語について、デジタル電子計算機によって発
生された一様疑似乱数を、単語ＨＭＭの状態遷移規則
と、各状態の混合重み分布とによって決定される多次元
ガウス分布に従う乱数に変換し出力する手続きを、各単
語ＨＭＭの最終状態に至るまで繰り返すことで実現す
る。As shown in FIG. 2, first, in step S1, a word learning data creating process for creating pseudo word learning data is executed. Here, the word learning data is the word HMM11 for each word to be recognized.
It is generated based on the information that each word HMM in has. Specifically, for each word, a uniform pseudo-random number generated by a digital computer is converted into a random number that follows a multidimensional Gaussian distribution determined by the state transition rule of the word HMM and the mixture weight distribution of each state. It is realized by repeating the output procedure until the final state of each word HMM is reached.

【００１５】次いで、ステップＳ２では、作成された単
語学習データに基づいて、以下に詳細後述する認識誤り
発生可能性の指標値に対応するように定義されたコスト
関数の関数値が最小値（実際には、局所的最小値）とな
るように、ガーベジＨＭＭ１２の各パラメータを逐次的
に更新して新たなガーベジＨＭＭを得る。実際には、複
数の単語学習データセットを用意し、各単語学習データ
セットについてのコストの平均値を最小化するように学
習を進める。Next, in step S2, the function value of the cost function defined so as to correspond to the index value of the recognition error occurrence probability, which will be described later in detail, based on the created word learning data is the minimum value (actual value). To obtain a new garbage HMM, each parameter of the garbage HMM 12 is sequentially updated so that it becomes a local minimum value. Actually, a plurality of word learning data sets are prepared, and learning is advanced so as to minimize the average value of costs for each word learning data set.

【００１６】ステップＳ２で用いるコスト関数は、単語
学習データと単語ＨＭＭ１１及び、未登録語の検出のた
めのＨＭＭであるガーベジＨＭＭ１２とを用いてビタビ
（Ｖｉｔｅｒｂｉ）のスコアの差に基づいて計算され
る、認識誤り発生可能性の指標値として定義する。認識
誤りとしては、発声された単語が認識されない誤り（す
なわち、単語脱落誤り）と、発声されていない単語が認
識結果に現れる誤り（すなわち、単語湧きだし誤り）が
あり、それぞれの誤り発生可能性指標値の計算は詳細後
述する方法で計算される。上記コスト関数Ｃは次の数１
で表される。The cost function used in step S2 is calculated based on the difference in the score of Viterbi using the word learning data, the word HMM11, and the garbage HMM12 which is an HMM for detecting unregistered words. , Is defined as an index value of recognition error occurrence probability. There are two types of recognition errors: an uttered word is not recognized (that is, a word drop error) and an unspoken word appears in the recognition result (that is, a word-swelling error). The index value is calculated by the method described later in detail. The cost function C is the following equation 1
It is represented by

【００１７】[0017]

【数１】 [Equation 1]

【００１８】ここで、Ｅ（Δ）は次の数２で表されるシ
グモイド関数を示す。Here, E (Δ) represents a sigmoid function expressed by the following equation 2.

【数２】Ｅ（Δ）＝１／（１＋ｅｘｐ（−αΔ））[Equation 2] E (Δ) = 1 / (1 + exp (−αΔ))

【００１９】また、Ｐ（Ｓ，ｘ）は次の数３で表される
関数であって、単語学習データｘに対するＨＭＭｓのビ
タビのスコアＶ（ｓ，ｘ）の最大値を示す。Further, P (S, x) is a function expressed by the following equation 3, and indicates the maximum value of the Viterbi score V (s, x) of the HMMs with respect to the word learning data x.

【数３】 (Equation 3)

【００２０】さらに、ｇｈωは次の数４で表される関数
であって、ビタビのスコアＶ（ω，ｗ）を最大するとき
の引数である。Further, ghω is a function expressed by the following equation 4, and is an argument for maximizing the Viterbi score V (ω, w).

【数４】 (Equation 4)

【００２１】さらに、以下の通りである。Ｗ：単語学習データの集合、｜Ｗ｜：単語学習データの集合の単語数、 Ω：単語ＨＭＭ１１の集合、｜Ω｜：単語ＨＭＭ１１の集合の単語数、 γ：ガーベジＨＭＭ１２の集合、Ｖ（ｓ，ｘ）：単語学習データｘに対するＨＭＭｓのビ
タビのスコア、ｈ：単語学習データの集合Ｗの各要素（すなわち、各単
語学習データ）を、対応する単語ＨＭＭ１１の集合Ωの
各要素（すなわち、単語ＨＭＭ１１内の各ＨＭＭ）に写
す全単射。Further, it is as follows. W: set of word learning data, | W |: number of words in set of word learning data, Ω: set of word HMM11, | Ω |: number of words in set of word HMM11, γ: set of garbage HMM12, V (s , X): Viterbi score of the HMMs for the word learning data x, h: each element of the set W of word learning data (that is, each word learning data), each element of the set Ω of the corresponding word HMM11 (that is, the word) A bijective shot to each HMM in HMM11).

【００２２】ステップＳ２におけるコストの最小化処理
においては、ガーベジＨＭＭ１２の各パラメータθ（す
なわち、平均、分散、混合重み）は、コスト関数値が収
束に至るまで、次の数５によって逐次更新される。In the cost minimization process in step S2, each parameter θ (that is, the average, the variance, and the mixing weight) of the garbage HMM 12 is sequentially updated by the following equation 5 until the cost function value converges. .

【００２３】[0023]

【数５】 θ⁽ⁱ⁾＝θ^(i-1)−β［∂Ｃ／∂θ］（θ＝θ^(i-1)）[Equation 5] θ ⁽ⁱ⁾ = θ ^(i-1) −β [∂C / ∂θ] (θ = θ ^(i-1) )

【００２４】ここで、θ⁽ⁱ⁾はｉ回目の更新によって得
られたパラメータであり、βは学習定数であって、例え
ば０．１乃至０．５の値をとる。また、数５の右辺の第
２項の［∂Ｃ／∂θ］（θ＝θ^(i-1)）は、θ＝θ^(i-1)
のときの［∂Ｃ／∂θ］である。Here, θ ⁽ⁱ⁾ is a parameter obtained by the i-th update, and β is a learning constant, which takes a value of 0.1 to 0.5, for example. In addition, [∂C / ∂θ] (θ = θ ^(i-1) ) of the second term on the right side of Equation 5 is θ = θ ^(i-1)
It is [∂C / ∂θ] at the time.

【００２５】図３は、図２の単語学習データ生成処理
（ステップＳ１）を示すフローチャートである。図３に
示すように、ステップＳ１１において単語番号ｊに１が
セットされ、ステップＳ１２において、詳細後述する単
語番号ｊの単語（以下、単語＃ｊという。）に関する特
徴パラメータ列生成処理が実行される。次いで、ステッ
プＳ１３において生成した特徴パラメータ列の音素継続
時間をチェックし、ステップＳ１４において当該音素継
続時間が正常か否か判断される。このときの具体的な判
断基準は、母音の場合は２０ミリ秒以下を正常と判断
し、子音の場合は１０ミリ秒以下を正常と判断した。音
素継続時間が正常であれば、ステップＳ１５において生
成した特徴パラメータ列をワーキングメモリ２１に出力
して一時的に格納し、ステップＳ１６で単語番号ｊを１
つだけインクリメントしてステップＳ１７に進む。ステ
ップＳ１７では、すべての単語について単語学習データ
の生成が完了したか否かが判断され、否のときは、ステ
ップＳ１２に戻って上記の処理を繰り返し、完了してい
るときは当該単語学習データ生成処理を終了する。な
お、ステップＳ１４で音素継続時間が正常でないと判断
されたときは、生成した特徴パラメータ列を出力するこ
となく取り除き、別の一様乱数を発生して別の特徴パラ
メータ列を発生するために、ステップＳ１２に戻り上述
の処理を繰り返す。FIG. 3 is a flow chart showing the word learning data generation process (step S1) of FIG. As shown in FIG. 3, the word number j is set to 1 in step S11, and in step S12, the characteristic parameter string generation process for the word of the word number j (hereinafter, referred to as word #j), which will be described later in detail, is executed. . Next, the phoneme duration of the characteristic parameter string generated in step S13 is checked, and it is determined in step S14 whether the phoneme duration is normal. As a concrete judgment criterion at this time, 20 ms or less was judged to be normal in the case of vowels, and 10 ms or less was judged to be normal in the case of consonants. If the phoneme duration is normal, the characteristic parameter string generated in step S15 is output to the working memory 21 and temporarily stored, and the word number j is set to 1 in step S16.
It increments by one and proceeds to step S17. In step S17, it is determined whether or not the generation of the word learning data has been completed for all the words. If not, the process returns to step S12 to repeat the above process, and if the generation is completed, the word learning data generation is completed. The process ends. If it is determined in step S14 that the phoneme duration is not normal, the generated characteristic parameter sequence is removed without being output, and another uniform random number is generated to generate another characteristic parameter sequence. Returning to step S12, the above-mentioned processing is repeated.

【００２６】図４は、図３の単語＃ｊの特徴パラメータ
列生成処理（ステップＳ１２）を示すフローチャートで
ある。図４に示すように、まず、ステップＳ２１で状態
番号ｉに１をセットし、ステップＳ２２でＨＭＭ学習部
２０であるデジタル電子計算機によって発生された一様
乱数（当該一様乱数は、０から１までの間の値であ
る。）と状態番号ｉの状態（以下、状態＃ｉという。）
の混合重み分布に従って単語ＨＭＭ１１内の単語＃ｊの
多次元ガウス分布番号ｊを決定する。すなわち、単語＃
ｊの複数個の多次元ガウス分布の混合分布の総和は１で
あるので、発生された一様乱数の値が、各多次元ガウス
分布の混合重みの累積加算値に該当するか否かを判断す
ることにより、単語＃ｊの多次元ガウス分布番号ｊを決
定する。例えば、多次元ガウス分布番号＃２における混
合重みの累積加算値は、多次元ガウス分布番号＃２にお
ける混合重みと、多次元ガウス分布番号＃１における混
合重みとを加算した値であり、発生された一様乱数が多
次元ガウス分布番号＃１における混合重みを超え、多次
元ガウス分布番号＃２における混合重みの累積加算値以
下のときに、ｊ＝２と決定する。FIG. 4 is a flowchart showing the characteristic parameter string generation process (step S12) for the word #j in FIG. As shown in FIG. 4, first, in step S21, the state number i is set to 1, and in step S22, a uniform random number generated by the digital electronic computer that is the HMM learning unit 20 (the uniform random number is 0 to 1). Between the values up to) and the state of state number i (hereinafter referred to as state #i).
The multidimensional Gaussian distribution number j of the word #j in the word HMM11 is determined according to the mixture weight distribution of. Ie the word #
Since the sum of the mixture distributions of a plurality of multidimensional Gaussian distributions of j is 1, it is determined whether the value of the generated uniform random number corresponds to the cumulative addition value of the mixture weights of the multidimensional Gaussian distributions. By doing so, the multidimensional Gaussian distribution number j of the word #j is determined. For example, the cumulative addition value of the mixing weights in the multidimensional Gaussian distribution number # 2 is a value obtained by adding the mixing weights in the multidimensional Gaussian distribution number # 2 and the mixing weights in the multidimensional Gaussian distribution number # 1 and is generated. When the uniform random number exceeds the mixing weight in the multidimensional Gaussian distribution number # 1 and is equal to or less than the cumulative addition value of the mixing weights in the multidimensional Gaussian distribution number # 2, j = 2 is determined.

【００２７】次いで、ステップＳ２３では、上記デジタ
ル電子計算機によって発生された多次元一様乱数を、単
語ＨＭＭ１１内の単語＃ｊのガウス分布番号ｊの多次元
ガウス分布（以下、多次元ガウス分布＃ｊという。）に
従う複数のガウス乱数（正規乱数ともいう。）に変換
し、その結果を特徴パラメータ列としてワーキングメモ
リ２１に出力する。ここで、多次元ガウス分布＃ｊに従
うガウス乱数とは、ガウス分布の平均、分散及び形状が
同一であるガウス乱数である。Next, in step S23, the multidimensional uniform random number generated by the digital computer is converted into the multidimensional Gaussian distribution of the Gaussian distribution number j of the word #j in the word HMM11 (hereinafter, the multidimensional Gaussian distribution #j). It is converted into a plurality of Gaussian random numbers (also referred to as normal random numbers) according to the above), and the result is output to the working memory 21 as a characteristic parameter string. Here, the Gaussian random number according to the multidimensional Gaussian distribution #j is a Gaussian random number having the same mean, variance, and shape of the Gaussian distribution.

【００２８】さらに、ステップＳ２４において、上記デ
ジタル電子計算機によって発生された一様乱数と、単語
ＨＭＭ１１内の単語＃ｊの状態＃１の遷移確率とに基づ
いて状態遷移の有無を決定する。すなわち、発生された
一様乱数が遷移確率以下であるときに、状態遷移すると
判断し、発生された一様乱数が遷移確率を超えるときに
状態遷移しないと判断する。次いで、ステップＳ２５で
は、状態遷移するか否かが判断され、状態遷移しない場
合は、自己ループとして判断し、別の一様乱数を発生し
て別の特徴パラメータ列を発生するためにステップＳ２
２に戻る。一方、ステップＳ２５で状態遷移すると判断
されたときは、ステップＳ２６で状態番号ｉを１だけイ
ンクリメントしてステップＳ２７で状態＃ｉが当該ＨＭ
Ｍの最終状態であるか否かが判断され、最終状態でない
ときは、ステップＳ２２に戻って、次の状態について上
述の処理を繰り返し、最終状態であるときは図１２のメ
インルーチンに戻る。Further, in step S24, the presence or absence of state transition is determined based on the uniform random number generated by the digital computer and the transition probability of state # 1 of word #j in word HMM11. That is, when the generated uniform random number is less than or equal to the transition probability, it is determined that the state transition occurs, and when the generated uniform random number exceeds the transition probability, it is determined that the state transition does not occur. Next, in step S25, it is determined whether or not the state transition occurs. If the state transition does not occur, it is determined as a self-loop, and another uniform random number is generated to generate another characteristic parameter sequence in step S2.
Return to 2. On the other hand, when it is determined in step S25 that the state transition occurs, the state number i is incremented by 1 in step S26, and the state #i is changed to the HM in step S27.
It is determined whether or not it is the final state of M. If it is not the final state, the process returns to step S22, the above-described processing is repeated for the next state, and if it is the final state, the process returns to the main routine of FIG.

【００２９】図５は、図２のステップＳ２において実行
されるサブルーチンであるコスト関数計算処理を示すフ
ローチャートである。図５に示すように、まず、ステッ
プＳ３１において後述の単語脱落誤り発生可能性指標値
計算処理を実行し、ステップＳ３２において単語湧き出
し誤り発生可能性指標値計算処理を実行し、ステップＳ
３３において、ステップＳ３１で計算されて計算バッフ
ァＢｕｆｆ１に格納された値と、ステップＳ３２で計算
されて計算バッファＢｕｆｆ２に格納された値とを加算
して加算結果をコスト関数値Ｃとする。FIG. 5 is a flow chart showing a cost function calculation process which is a subroutine executed in step S2 of FIG. As shown in FIG. 5, first, in a step S31, a word drop error occurrence probability index value calculation process, which will be described later, is executed, in a step S32, a word spelling error occurrence probability index value calculation process is executed, and then in a step S
In 33, the value calculated in step S31 and stored in the calculation buffer Buff1 is added to the value calculated in step S32 and stored in the calculation buffer Buff2, and the addition result is set as the cost function value C.

【００３０】図６は、図５の単語脱落誤り発生可能性指
標値計算処理を示すフローチャートである。この処理で
は、各認識対象単語について、擬似的な単語学習データ
と単語ＨＭＭ１１内の当該単語のＨＭＭとに基づいてビ
タビのスコアを計算するとともに、擬似的な単語学習デ
ータとガーベジＨＭＭ１２とに基づいてビタビのスコア
を計算し、ガーベジＨＭＭ１２によるビタビのスコアか
ら当該単語のＨＭＭによるビタビのスコアを引いたもの
をシグモイド関数によって平滑化し、上記平滑化した値
の総和を、認識対象単語数で割って正規化して、単語脱
落誤り発生可能性指標値とする。FIG. 6 is a flow chart showing the word drop error occurrence possibility index value calculation processing of FIG. In this process, for each recognition target word, a Viterbi score is calculated based on the pseudo word learning data and the HMM of the word in the word HMM11, and based on the pseudo word learning data and the garbage HMM12. The Viterbi score is calculated, the result obtained by subtracting the Viterbi score by the HMM of the word from the Viterbi score by the garbage HMM12 is smoothed by the sigmoid function, and the sum of the smoothed values is divided by the number of recognition target words to obtain the normal value. It is converted into the word omission error possibility index value.

【００３１】図６に示すように、ステップＳ４１で計算
バッファＢｕｆｆ１に０がセットされ、ステップＳ４２
で単語番号ｊに１がセットされた後、ステップＳ４３に
おいて、数１の右辺の第１項内のΣより右側部分であ
る、ガーベジＨＭＭ１２によるビタビのスコアから当該
単語のＨＭＭによるビタビのスコアを引いたものを計算
し、当該計算値を計算バッファＢｕｆｆ１の値に加算し
て、その加算結果を計算バッファＢｕｆｆ１の値として
更新する。そして、ステップＳ４４で、単語番号ｊを１
だけインクリメントして、ステップＳ４５ですべての単
語についてステップＳ４３の処理が終了したか否かが判
断され、完了していないときはステップＳ４３に戻って
上述の処理を繰り返し、終了しているときはステップＳ
４６に進む。ステップＳ４６では、計算バッファＢｕｆ
ｆ１の値を単語学習データの集合の単語数で割って、除
算の結果を計算バッファＢｕｆｆ１に格納する。最後
に、ステップＳ４７では、計算バッファＢｕｆｆ１の値
を、数１の右辺の第１項に対応する単語脱落誤り発生可
能性指標値としてワーキングメモリ２１に出力して格納
する。As shown in FIG. 6, 0 is set in the calculation buffer Buff1 in step S41, and the calculation buffer Buff1 is set in step S42.
After the word number j is set to 1 in step S43, in step S43, the Viterbi score by the HMM of the word is subtracted from the Viterbi score by the garbage HMM12, which is the right side of Σ in the first term on the right side of the equation 1. Then, the calculated value is added to the value of the calculation buffer Buff1, and the addition result is updated as the value of the calculation buffer Buff1. Then, in step S44, the word number j is set to 1
It is determined whether or not the process of step S43 has been completed for all the words in step S45, and if not completed, the process returns to step S43 to repeat the above process, and if completed, the step of S
Proceed to 46. In step S46, the calculation buffer Buf
The value of f1 is divided by the number of words in the set of word learning data, and the result of the division is stored in the calculation buffer Buff1. Finally, in step S47, the value of the calculation buffer Buff1 is output to the working memory 21 and stored as the word loss error occurrence possibility index value corresponding to the first term on the right side of the equation 1.

【００３２】図７は、図５の単語脱落湧き出し誤り発生
可能性指標値計算処理を示すフローチャートである。こ
の処理では、各認識対象単語について、当該単語を除く
すべての単語学習データと、単語ＨＭＭ１１内の当該単
語のＨＭＭに基づいてビタビのスコアを計算し、これら
計算された中で最大のビタビのスコアを与える単語学習
データｇｈωおよびそのスコアｙを記憶し、さらに単語
学習データｇｈωとガーベジＨＭＭ１２に基づいてビタ
ビのスコアｚを計算し、スコアｙからガーベジＨＭＭ１
２に基づいて計算されたビタビのスコアｚを引いたもの
をシグモイド関数によって平滑化し、上記平滑化した値
の総和を、認識対象単語数で割って正規化して、単語湧
きだし誤り発生可能性指標値とする。FIG. 7 is a flow chart showing the process of calculating the word omission error occurrence possibility index value of FIG. In this process, for each recognition target word, a Viterbi score is calculated based on all the word learning data excluding the word and the HMM of the word in the word HMM11, and the maximum Viterbi score among these calculated words is calculated. The word learning data ghω and the score y thereof are stored, and the Viterbi score z is calculated based on the word learning data ghω and the garbage HMM12, and the garbage HMM1 is calculated from the score y.
A value obtained by subtracting the Viterbi score z calculated based on 2 is smoothed by a sigmoid function, the sum of the smoothed values is divided by the number of words to be recognized, and then normalized, and a word spring error probability index The value.

【００３３】図７に示すように、ステップＳ５１で計算
バッファＢｕｆｆ２に０がセットされ、ステップＳ５２
で単語番号ｊに１がセットされた後、ステップＳ５３で
は、単語ＨＭＭ１１内の単語＃ｊのＨＭＭ（以下、単語
ＨＭＭ＃ｊという。）と、当該単語＃ｊを除く各単語学
習データに基づいて各ビタビのスコアを計算し、これら
の計算されたスコアの中で最大値を与える単語学習デー
タｇｈωを選択してワーキングメモリ２１に格納する。
次いで、ステップＳ５４では、最大値を与える単語学習
データｇｈωとガーベジＨＭＭ１２に基づいてビタビの
スコアｚを計算し、数１の右辺の第２項内のΣより右側
部分である、最大のビタビのスコアｙからガーベジＨＭ
Ｍ１２に基づいて計算されたビタビのスコアｚを引いた
ものを計算し、当該計算値を計算バッファＢｕｆｆ２の
値に加算して、その加算結果を計算バッファＢｕｆｆ２
の値として更新する。そして、ステップＳ５４で、単語
番号ｊを１だけインクリメントして、ステップＳ５５で
すべての単語についてステップＳ５３及びＳ５４の処理
が終了したか否かが判断され、完了していないときはス
テップＳ５３に戻って上述の処理を繰り返し、終了して
いるときはステップＳ５７に進む。ステップＳ５７で
は、計算バッファＢｕｆｆ２の値を単語ＨＭＭ１１の集
合の単語数で割って、除算の結果を計算バッファＢｕｆ
ｆ２に格納する。最後に、ステップＳ５８では、計算バ
ッファＢｕｆｆ２の値を、数１の右辺の第２項に対応す
る単語湧き出し誤り発生可能性指標値としてワーキング
メモリ２１に出力して格納する。As shown in FIG. 7, 0 is set in the calculation buffer Buff2 in step S51, and the calculation buffer Buff2 is set in step S52.
After the word number j is set to 1 in step S53, in step S53, based on the HMM of the word #j in the word HMM11 (hereinafter referred to as the word HMM # j) and each word learning data excluding the word #j. The score of each Viterbi is calculated, and the word learning data ghω which gives the maximum value among these calculated scores is selected and stored in the working memory 21.
Next, in step S54, the Viterbi score z is calculated based on the word learning data ghω that gives the maximum value and the garbage HMM12, and the maximum Viterbi score, which is the part on the right side of Σ in the second term on the right side of Expression 1, is calculated. Garbage HM from y
A value obtained by subtracting the Viterbi score z calculated based on M12 is calculated, the calculated value is added to the value in the calculation buffer Buff2, and the addition result is calculated in the calculation buffer Buff2.
Update as the value of. Then, in step S54, the word number j is incremented by 1, and in step S55, it is determined whether or not the processes of steps S53 and S54 have been completed for all the words. If not completed, the process returns to step S53. The above process is repeated, and when the process is completed, the process proceeds to step S57. In step S57, the value of the calculation buffer Buff2 is divided by the number of words in the set of word HMM11, and the result of the division is calculated in the calculation buffer Buf.
Store in f2. Finally, in step S58, the value of the calculation buffer Buff2 is output to and stored in the working memory 21 as a word-shedding-error occurrence possibility index value corresponding to the second term on the right side of Expression 1.

【００３４】次いで、上述の方法で再学習されたガーベ
ジＨＭＭ１２と、単語ＨＭＭ１１とを用いて音声認識を
行う単語認識のための音声認識装置について図１を参照
して説明する。Next, a speech recognition apparatus for word recognition that performs speech recognition using the garbage HMM 12 retrained by the above method and the word HMM 11 will be described with reference to FIG.

【００３５】図１において、ＨＭＭ学習部２０は、擬似
的な単語学習データと、単語ＨＭＭ１１に基づいてガー
ベジＨＭＭ１２を再学習して、ガーベジＨＭＭ１２のメ
モリに格納する。一方、話者の発声音声はマイクロホン
１に入力されて音声信号に変換された後、特徴抽出部２
に入力される。特徴抽出部２は、入力された音声信号を
Ａ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対数パ
ワー、１６次ケプストラム係数、Δ対数パワー及び１６
次Δケプストラム係数を含む３４次元の特徴パラメータ
を抽出する。抽出された特徴パラメータの時系列はバッ
ファメモリ３を介して単語照合部４に入力される。In FIG. 1, the HMM learning unit 20 relearns the garbage HMM 12 based on the pseudo word learning data and the word HMM 11, and stores it in the memory of the garbage HMM 12. On the other hand, the uttered voice of the speaker is input to the microphone 1 and converted into a voice signal, and then the feature extraction unit 2
Is input to The feature extracting unit 2 performs, for example, LPC analysis after A / D converting the input voice signal, and performs logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16 logarithmic power.
A 34-dimensional feature parameter including the next Δ cepstrum coefficient is extracted. The time series of the extracted characteristic parameters is input to the word matching unit 4 via the buffer memory 3.

【００３６】単語照合部４は、バッファメモリ３に格納
された特徴パラメータの時系列データに基づいて、公知
のワン−パス・ビタビ復号化方法により、登録語の認識
のための単語ＨＭＭ１１と、未登録語の検出のためのガ
ーベジＨＭＭ１２とを用いて、単語照合区間内のデータ
に対するビタビのスコアが計算され、最大のビタビのス
コアに対応する単語を認識単語列として出力する。The word collating unit 4 uses the known one-pass Viterbi decoding method based on the time-series data of the characteristic parameters stored in the buffer memory 3 to identify the word HMM 11 for recognizing the registered word and the word HMM 11 for recognizing the registered word. Using the garbage HMM12 for detecting a registered word, the Viterbi score for the data in the word matching section is calculated, and the word corresponding to the maximum Viterbi score is output as a recognized word string.

【００３７】以上の実施形態において、単語照合部４及
びＨＭＭ学習部２０は、例えばデジタル電子計算機で構
成される。In the above embodiment, the word matching unit 4 and the HMM learning unit 20 are composed of, for example, a digital electronic computer.

【００３８】[0038]

【実施例】本発明者は、本実施形態のＨＭＭ学習部２０
の有効性を確認するために以下のように実験を行った。
その実験条件を表１に示す。ＣＤ音素ＨＭＭとしては、
複数の話者適応された隠れマルコフ網（以下、ＨＭ網と
いう。）の合成によって得られた２００状態の話者不特
定ＨＭ網を用いた。スポッティングの対象語彙として、
ホテル予約等のトラベル・プランニングをタスクとする
本特許出願人が所有する自然発声対話コーパス（従来文
献３「Ｍｏｒｉｍｏｔｏｅｔａｌ．，“Ａｓｐｅ
ｅｃｈａｎｄｌａｎｇｕａｇｅｄａｔａｂａｓｅ
ｆｏｒｓｐｅｅｃｈｔｒａｎｓｌａｔｉｏｎｒ
ｅｓｅａｒｃｈ”，Ｐｒｏｃ．ＩＣＳＬＰ９４，Ｖｏ
ｌ．ＩＶ，ｐｐ．１７９１−１７９４，１９９４年」参
照。）の中から２０単語を選択した。EXAMPLES The inventor of the present invention has found that the HMM learning unit 20 of the present embodiment.
The following experiment was conducted to confirm the effectiveness of.
The experimental conditions are shown in Table 1. As a CD phoneme HMM,
A 200-state speaker-independent HM network obtained by synthesizing a plurality of speaker-adapted hidden Markov networks (hereinafter referred to as HM networks) was used. As a target vocabulary for spotting,
A spontaneous vocal dialogue corpus owned by the applicant of the present invention whose task is travel planning such as hotel reservation (conventional document 3 “Morimoto et al.,“ A spe
ech and language database
for speech translation
research ”, Proc. ICSLP 94, Vo
l. IV, pp. 1791-1794, 1994 ". 20 words were selected from the above.

【００３９】[0039]

【表１】実験条件 ─────────────────────────────────── 音響解析条件サンプリング周波数＝１２ｋＨｚサンプリングのビット数＝１６ビットプリエンファシス＝１−０．９７ｚ^-1 ハミング窓＝２０ミリ秒フレームシフト＝５ミリ秒特徴パラメータ＝１６次元ＬＰＣケプストラム＋１６次元Δケプストラム＋パワー＋Δパワー ─────────────────────────────────── ＨＭＭのトポロジー単語ＨＭＭ：３状態又は４状態、５混合無音ＨＭＭ：１状態、１０混合ガーベジＨＭＭ；：１状態、２０混合 ───────────────────────────────────[Table 1] Experimental conditions ─────────────────────────────────── Acoustic analysis conditions Sampling frequency = 12 kHz Number of bits = 16 bits Pre-emphasis = 1-0.97z ^-1 Hamming window = 20 ms Frame shift = 5 ms Feature parameter = 16-dimensional LPC cepstrum + 16-dimensional Δ cepstrum + power + Δ power ───────── ─────────────────────────── HMM Topology Word HMM: 3 states or 4 states, 5 mixed silence HMM: 1 state, 10 mixed garbage HMM ;: 1 state, 20 mixed ───────────────────────────────────

【００４０】ＭＥ学習法による学習処理においては、出
来るかぎり良い初期モデルから学習を始めることが重要
である。本実験では、初期ガーベジＨＭＭを、複数の話
者特定モデルの合成によって作成した。本方法は、音響
的特徴、話者性それぞれに対する分解能を確保するべく
複数のＨＭＭを作成した上で、それらを所望の混合数を
持つ１つのＨＭＭに合成するものである。In the learning process by the ME learning method, it is important to start learning from the best possible initial model. In this experiment, an initial garbage HMM was created by synthesizing multiple speaker-specific models. In this method, a plurality of HMMs are created in order to secure the resolution for each of acoustic characteristics and speaker characteristics, and then they are combined into one HMM having a desired mixture number.

【００４１】単語学習データとして、全語彙の擬似的な
単語学習データを２０組生成した。学習に際しては、未
登録語に対応する学習データも必要である。未登録語に
関する統計的な言語データが利用可能な場合は、未登録
語を普遍的に表現する言語モデルを作成し、その上で上
述のデータ生成方法を適用することにより、未登録語に
関する単語学習データを生成することができる。本実験
では、これらの言語データが利用できない場合の本方法
の適用例として、擬似的な単語学習データの中から未登
録語に関する単語学習データの代用となるものを選択し
て使用する方法をとった。つまり、数１の各登録語毎の
単語湧きだし誤り可能性指標値の計算において、未登録
語に関する単語学習データの代用として、当該登録語を
除く擬似的な単語学習データのうち、当該登録語ＨＭＭ
に対して最大のビタビのスコアを与えるものを用いた。
これにより、ガーベジＨＭＭ１２は、各単語学習データ
に対して、正解の単語ＨＭＭより低いスコアを、不正解
の単語ＨＭＭよりも高いスコアを与えるように学習され
る。As word learning data, 20 sets of pseudo word learning data of all vocabularies were generated. In learning, learning data corresponding to unregistered words is also required. If statistical language data about unregistered words is available, create a language model that universally represents unregistered words, and apply the above-mentioned data generation method on it to create words for unregistered words. Learning data can be generated. In this experiment, as an application example of this method when these language data cannot be used, a method of selecting and using a substitute for the word learning data regarding unregistered words from the pseudo word learning data is used. It was That is, in the calculation of the word-swelling error probability index value for each registered word in Equation 1, the registered word in the pseudo word learning data excluding the registered word is substituted for the word learning data regarding the unregistered word. HMM
The one that gave the maximum Viterbi score for was used.
As a result, the garbage HMM 12 is learned so as to give each word learning data a score lower than that of the correct word HMM and higher than that of the incorrect word HMM.

【００４２】そして、作成済みの初期ガーベジＨＭＭを
上述の学習方法により再学習した。本実験では、平均と
混合重みについてパラメータの更新を行った。コスト関
数値が収束に至るまでの繰り返し計算回数は２０であっ
た。Then, the created initial garbage HMM was re-learned by the above-mentioned learning method. In this experiment, the parameters of the average and the mixture weight were updated. The number of repeated calculations until the cost function value reached convergence was 20.

【００４３】次いで、スポッティング実験と結果につい
て述べる。再学習済みのガーベジＨＭＭを用い、男女各
１名の話者について、話者オープンの登録語スポッティ
ング実験を行った。テストデータとして、前述の自然発
声対話コーパス中から、４対話を選んだ。総発話数は６
０であり、登録語の延べ出現回数は２２であった。図１
０に示すように、本発明の方法でＭＥ再学習されたガー
ベジＨＭＭ１２を用いることにより、初期ガーベジＨＭ
Ｍを用いた場合と比較して、登録語の脱落率に対する単
語誤りの湧き出し率特性が向上した。この結果から、本
発明の学習方法が、代用的な未登録語に関する単語学習
データを用いた場合でさえ、スポッティング性能の向上
に有効であることがわかる。Next, the spotting experiment and the result will be described. Using the re-learned garbage HMM, a speaker open registration word spotting experiment was conducted for one speaker for each gender. As test data, 4 dialogues were selected from the above-mentioned spontaneous speech dialogue corpus. The total number of utterances is 6
It was 0, and the total number of appearances of registered words was 22. FIG.
As shown in FIG. 0, by using the ME retrained garbage HMM12 in the method of the present invention, the initial garbage HM
Compared with the case where M is used, the word error release rate characteristic with respect to the drop rate of registered words is improved. From this result, it is understood that the learning method of the present invention is effective in improving the spotting performance even when the word learning data regarding the substitute unregistered word is used.

【００４４】以上説明したように、本実施形態によれ
ば、一様乱数に基づいて発生された擬似的な単語学習デ
ータを生成して、それに基づいてガーベジＨＭＭ１２を
再学習するので、推定対象の単語について、新たな学習
データ用音声サンプルの収集や加工などの作業が不要な
ために、登録語の語彙セットの変更に要する時間及びコ
ストが大幅に軽減される。従って、従来例に比較して容
易にかつ迅速に登録語の語彙セットを変更してガーベジ
ＨＭＭ１２を再学習することができる。また、再学習さ
れたガーベジＨＭＭ１２を用いて音声認識した場合、従
来例とほぼ同等の音声認識率で音声認識することができ
る。As described above, according to the present embodiment, since the pseudo word learning data generated based on the uniform random number is generated and the garbage HMM12 is retrained based on the pseudo word learning data, the estimation target Since it is unnecessary to collect and process a new voice sample for learning data for a word, the time and cost required for changing the vocabulary set of registered words are significantly reduced. Therefore, it is possible to change the vocabulary set of registered words and re-learn the garbage HMM 12 more easily and quickly than in the conventional example. Further, when the voice recognition is performed using the re-learned garbage HMM 12, the voice recognition can be performed at a voice recognition rate almost equal to that of the conventional example.

【００４５】[0045]

【発明の効果】以上詳述したように本発明に係る隠れマ
ルコフモデルの学習装置によれば、多次元一様乱数を発
生する乱数発生手段と、上記乱数発生手段によって発生
された多次元一様乱数を、予め登録された登録語を認識
するための所定の単語隠れマルコフモデルの多次元ガウ
ス分布に従う複数のガウス乱数に変換して、変換された
複数のガウス乱数を複数の特徴パラメータである擬似的
な単語学習データとして出力するデータ生成手段と、上
記データ生成手段から出力された擬似的な単語学習デー
タと、上記単語隠れマルコフモデルに基づいて、所定の
コスト関数の関数値が最小となるように、予め登録され
ない未登録語を検出するためのガーベジ隠れマルコフモ
デルの複数のパラメータを更新することにより上記ガー
ベジ隠れマルコフモデルの複数のパラメータを学習する
学習手段とを備える。従って、一様乱数に基づいて発生
された擬似的な単語学習データを生成して、それに基づ
いてガーベジ隠れマルコフモデルを再学習するので、推
定対象の単語について、新たな学習データ用音声サンプ
ルの収集や加工などの作業が不要なために、登録語の語
彙の変更に要する時間及びコストが大幅に軽減される。
従って、従来例に比較して容易にかつ迅速に登録語の語
彙を変更してガーベジ隠れマルコフモデルを再学習する
ことができる。As described above in detail, according to the hidden Markov model learning apparatus of the present invention, the random number generating means for generating a multidimensional uniform random number and the multidimensional uniform generated by the random number generating means. Converts a random number into a plurality of Gaussian random numbers that follow a multidimensional Gaussian distribution of a given word hidden Markov model for recognizing preregistered registered words, and converts the plurality of converted Gaussian random numbers into a plurality of characteristic parameter pseudo-parameters. Based on the data generation means for outputting as typical word learning data, the pseudo word learning data output from the data generation means, and the word hidden Markov model so that the function value of the predetermined cost function is minimized. , By updating multiple parameters of the Garbage Hidden Markov Model for detecting unregistered words that are not registered in advance. And a learning means for learning a plurality of parameters of Dell. Therefore, pseudo word learning data generated based on uniform random numbers is generated, and the Garbage Hidden Markov Model is retrained based on the pseudo word learning data. Therefore, a new learning data voice sample is collected for the word to be estimated. Since no work such as or modification is required, the time and cost required for changing the vocabulary of registered words are significantly reduced.
Therefore, it is possible to change the vocabulary of the registered word and re-learn the Garbage Hidden Markov Model easily and quickly as compared with the conventional example.

【００４６】また、本発明に係る音声認識装置によれ
ば、上記隠れマルコフモデルの学習装置と、入力された
発声音声文の音声信号に基づいて、予め登録された登録
語を認識するための単語隠れマルコフモデルと、上記隠
れマルコフモデルの学習装置によって学習され予め登録
されない未登録語を検出するためのガーベジ隠れマルコ
フモデルとを用いて音声認識して音声認識結果を出力す
る音声認識手段を備える。従って、従来例に比較して容
易にかつ迅速に再学習されたガーベジ隠れマルコフモデ
ルを用いて、従来例とほぼ同等の音声認識率で音声認識
することができる。Further, according to the speech recognition apparatus of the present invention, the hidden Markov model learning apparatus and a word for recognizing a registered word registered in advance based on the input voice signal of the uttered voice sentence. A speech recognition unit is provided which performs speech recognition using a hidden Markov model and a garbage hidden Markov model for detecting an unregistered word that is learned by the learning apparatus for the hidden Markov model and is not registered in advance, and outputs a speech recognition result. Therefore, by using the Garbage Hidden Markov Model that is relearned easily and quickly as compared with the conventional example, it is possible to perform voice recognition with a speech recognition rate almost equal to that of the conventional example.

[Brief description of drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１のＨＭＭ学習部２０によって実行される
ガーベジＨＭＭ学習処理を示すフローチャートである。FIG. 2 is a flowchart showing a garbage HMM learning process executed by the HMM learning unit 20 of FIG.

【図３】図２のサブルーチンである単語学習データ生
成処理を示すフローチャートである。FIG. 3 is a flowchart showing a word learning data generation process which is a subroutine of FIG.

【図４】図３のサブルーチンである特徴パラメータ列
生成処理を示すフローチャートである。FIG. 4 is a flowchart showing a characteristic parameter sequence generation process which is a subroutine of FIG.

【図５】図２のステップＳ２において実行されるサブ
ルーチンであるコスト関数計算処理を示すフローチャー
トである。5 is a flowchart showing a cost function calculation process which is a subroutine executed in step S2 of FIG.

【図６】図５のサブルーチンである単語脱落誤り発生
可能性指標値計算処理を示すフローチャートである。FIG. 6 is a flowchart showing a word loss error occurrence possibility index value calculation process which is a subroutine of FIG.

【図７】図５のサブルーチンである単語湧き出し誤り
可能性指標値計算処理を示すフローチャートである。FIG. 7 is a flowchart showing a word-seeking error possibility index value calculation process which is a subroutine of FIG.

【図８】単語ＨＭＭの構造を示す図であって、（ａ）
は単語ＨＭＭにおける状態間の接続関係を示す状態遷移
図であり、（ｂ）は単語ＨＭＭの情報構造を示す図であ
る。FIG. 8 is a diagram showing a structure of a word HMM, including (a)
[Fig. 3] is a state transition diagram showing a connection relation between states in the word HMM, and (b) is a diagram showing an information structure of the word HMM.

【図９】図１の音声認識装置で用いるスポッティング
用言語モデルを示す状態遷移図である。9 is a state transition diagram showing a spotting language model used in the speech recognition apparatus of FIG.

【図１０】図１の音声認識装置の実験で得られた登録
語の脱落率に対する湧き出し率を示すグラフである。FIG. 10 is a graph showing a spring-out rate with respect to a drop-out rate of registered words, which is obtained in an experiment of the speech recognition apparatus of FIG.

【符号の説明】１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…単語照合部、１１…単語ＨＭＭ、１２…ガーベジＨＭＭ、２０…ＨＭＭ学習部、２１…ワーキングメモリ。[Explanation of Codes] 1 ... Microphone, 2 ... Feature extraction unit, 3 ... Buffer memory, 4 ... Word matching unit, 11 ... Word HMM, 12 ... Garbage HMM, 20 ... HMM learning unit, 21 ... Working memory.

Claims

[Claims]

1. A random number generating means for generating a multidimensional uniform random number, and a predetermined word hidden Markov model for recognizing a registered word registered in advance for the multidimensional uniform random number generated by the random number generating means. Data generating means for converting the plurality of Gaussian random numbers according to the multidimensional Gaussian distribution of, and outputting the converted plurality of Gaussian random numbers as pseudo word learning data which are a plurality of feature parameters, and the data generating means. Based on the pseudo-word learning data and the word hidden Markov model, a plurality of garbage hidden Markov models for detecting unregistered words that are not registered in advance so that the function value of the predetermined cost function is minimized. A learning means for learning a plurality of parameters of the Garbage Hidden Markov Model by updating the parameters. Learning device of Re Markov model.

2. The cost function is a score for speech recognition, which is calculated based on the word learning data and the word hidden Markov model for a recognition target word, the word learning data and the garbage hidden Markov model. An index value that indicates the likelihood of an error in which a spoken word is not recognized, which is calculated based on the score for speech recognition calculated based on the model, and a recognition target word for the recognition target word. Score for speech recognition calculated based on the word learning data and the word hidden Markov model excluding, and score for speech recognition calculated based on the word learning data and the garbage hidden Markov model And an index value that indicates the likelihood of an error in which unvocalized words appear in the recognition result, calculated based on Learning apparatus Hidden Markov Models according to claim 1, characterized in that the function calculated by.

3. The hidden Markov model learning apparatus according to claim 2, wherein the score indicating the likelihood for speech recognition is a score calculated by a Viterbi decoding method.

4. The hidden Markov model learning device according to claim 1, 2 or 3, and a word hidden Markov model for recognizing a registered word registered in advance based on a voice signal of an input utterance voice sentence. And a voice recognition means for performing voice recognition using the Garbage Hidden Markov Model for detecting unregistered words that are learned by the hidden Markov model learning device and are not registered in advance, and output a voice recognition result. And a voice recognition device.