JPH07239695A - Learning method of hidden markov model - Google Patents

Learning method of hidden markov model

Info

Publication number
JPH07239695A
JPH07239695A JP6029291A JP2929194A JPH07239695A JP H07239695 A JPH07239695 A JP H07239695A JP 6029291 A JP6029291 A JP 6029291A JP 2929194 A JP2929194 A JP 2929194A JP H07239695 A JPH07239695 A JP H07239695A
Authority
JP
Japan
Prior art keywords
learning
sentence
phoneme
hmm
markov model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP6029291A
Other languages
Japanese (ja)
Other versions
JP3091623B2 (en
Inventor
Takashi I
傑 易
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oki Electric Industry Co Ltd
Original Assignee
Oki Electric Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oki Electric Industry Co Ltd filed Critical Oki Electric Industry Co Ltd
Priority to JP06029291A priority Critical patent/JP3091623B2/en
Publication of JPH07239695A publication Critical patent/JPH07239695A/en
Application granted granted Critical
Publication of JP3091623B2 publication Critical patent/JP3091623B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Abstract

PURPOSE:To permit learning of phoneme HMMs(Hidden Markov Models) with high accuracy by eliminating the inconsistency of the voice data and phoneme notation according to a difference in nasalization of 'GA' line voices or long vowels at the time of learning the phoneme HMMs from sentence voices by a consolidated learning method. CONSTITUTION:A sentence HMM is first formed by using the phoneme notation information belonging to the inputted sentence voices and substituting /g/ and /ng/ at the point where the 'GA' line syllable exists while referencing a phoneme HMM dictionary 4 for recognition in a step 5 at the time of learning the phoneme HMMs by using the sentence voices. The sentence HMM and the input voices are collated and the syllable notation is determined by calculating tolerance in a step 6 and thereafter, the phoneme HMMs are connected to form the sentence HMM in a step 7. The learning of the sentence HMM is executed in a step 8 and the sentence HMM is decomposed into the phoneme HMMs in a step 9 and thereafter, whether the phoneme HMMs converge or not is decided in a step 12. The leaning is returned to the learning of the sentence HMM of the step 8 in a step 11 and the learning processing and the separating processing are repeated in case of non-conversion.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、音声認識におけるヒド
ン・マルコフ・モデル(以下、HMMという)の学習方
法に関するものである。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a learning method for a Hidden Markov Model (hereinafter referred to as HMM) in speech recognition.

【0002】[0002]

【従来の技術】従来、このような分野の技術としては、
例えば、次のような文献に記載されるものがあった。 文献1;ザ・ベル・システム・テクニカル・ジャーナル
(The Bell SystemTechnical Journal)、62[4]、
(1983−4)、American Telegraph Company、
(米)、エス・イー・レビンソン(S.E.Levinson),エル
・アール・ラビナー(L.R.Rabiner), エム・エム・ソン
ディ(M.M.Sondhi)共著、「An Introduction to the App
lication of the Theory of Probabilistic Function o
f a Markov Process to AutomaticSpeech Recognitio
n」P.1035−1074 文献2;中川聖一著「確率モデルによる音声認識」(昭
63−7)、電子情報通信学会、P55−61 音声認識技術として、古典的なパターン・マッチング手
法から、近年では統計的な手法に変わり、後者は主流に
なりつつある。統計的な手法では、確率的な有限状態を
持つマルコフ・モデル(以下、HMMと呼ぶ)が提案さ
れている。一般に、HMMは、複数の状態(例えば、音
声の特徴等)と状態間の遷移となる。さらに、HMMは
状態間の遷移を表す遷移確率と、遷移する際に伴うラベ
ル(音声の特徴パラメータの典型的なもので、通常数十
から数千種類がある)を出力する出力確率を有してい
る。このようなHMMを用いた音声認識方法が前記文献
に記載されており、その単語音声認識の例を図2に示
す。
2. Description of the Related Art Conventionally, as a technique in such a field,
For example, some documents were described in the following documents. Reference 1: The Bell SystemTechnical Journal, 62 [4],
(1983-4), American Telegraph Company,
(US), SE Levinson (SELevinson), ER Rabiner (LR Rabiner), MM Sondhi (MM Sondhi), `` An Introduction to the App
lication of the Theory of Probabilistic Function o
fa Markov Process to AutomaticSpeech Recognitio
n "P. 1035-1074 Reference 2; Seiichi Nakagawa, "Speech Recognition by Stochastic Model" (SHO 63-7), IEICE, P55-61. As a speech recognition technology, from classical pattern matching method to statistically in recent years. However, the latter is becoming mainstream. As a statistical method, a Markov model having a stochastic finite state (hereinafter referred to as HMM) has been proposed. Generally, the HMM is a transition between states and a plurality of states (for example, voice characteristics). Furthermore, the HMM has a transition probability that represents a transition between states and an output probability that outputs a label (a typical feature parameter of speech, usually several tens to several thousands) associated with the transition. ing. A speech recognition method using such an HMM is described in the above document, and an example of the word speech recognition is shown in FIG.

【0003】図2は音声認識方法に用いられる単語HM
M構造を示す図である。図2のs1 , s2 ,s3 , s4
はHMMにおける音声の特徴等の状態を表す。a11, a
12, a22 ,a23 ,a33, a34, a44, a45は状態遷移確
率、b 1(k) ,b 2(k),b 3(k),b 4(k) はラベル出力
を表す。HMMでは、状態遷移確率aij(i=1, …,4、j=
1,…,5) で状態遷移が行われる際、ラベル出力確率bj
(k) でラベルを出力する。発声された単語をHMMを用
いて、その単語のラベル列を最も高い確率で出力するよ
うにHMMを学習する。次に発声された未知単語のラベ
ル列を入力し、最も高い出力確率を与えた単語HMMを
認識結果とする。なお、単語を文で置き換えれば、同様
の方法で文単位で発声された音声を認識することができ
る。この種の音声認識方法では、発声された単語や文そ
のものにHMMを与えて学習し、尤度(即ち、ラベル列
の出力確率)によって認識結果を判断するものである。
このような単語又は文HMMは優れた認識精度を保証す
るが、認識語彙数が増大することによって膨大な学習デ
ータが必要となることや、学習対象語以外の音声が全く
認識できないなどの欠点がある。
FIG. 2 shows the word HM used in the speech recognition method.
It is a figure which shows M structure. S 1 , s 2 , s 3 , s 4 in FIG.
Represents a state such as a voice feature in the HMM. a 11 , a
12 , a 22 , a 23 , a 33 , a 34 , a 44 , a 45 are state transition probabilities, b 1 (k), b 2 (k), b 3 (k), b 4 (k) are label outputs Represents In the HMM, the state transition probability a ij (i = 1, ..., 4, j =
When the state transition is performed in 1, ..., 5), the label output probability b j
The label is output at (k). Using the HMM for the spoken word, the HMM is learned so as to output the label string of the word with the highest probability. Next, the label string of the uttered unknown word is input, and the word HMM that gives the highest output probability is used as the recognition result. By replacing a word with a sentence, it is possible to recognize a voice uttered for each sentence in the same manner. In this type of speech recognition method, an HMM is given to a spoken word or a sentence itself to learn, and a recognition result is judged based on a likelihood (that is, an output probability of a label string).
Such a word or sentence HMM guarantees excellent recognition accuracy, but has a drawback that a huge amount of learning data is required due to an increase in the number of recognized vocabulary words, and speech other than the learning target word cannot be recognized at all. is there.

【0004】一方、音声学では通常、音素と呼ばれる声
学的要素の系列で単語や文を表している。従って、音素
ごとにHMMを用意し、これらのHMMを連結して単語
又は文HMMを生成し、音素ごとにHMMを用意し、こ
れらのHMMを連結して単語又は文HMMを生成し、単
語認識を行う方法がある。特に、文音声を認識する場
合、大量の文音声を用意することが困難であるため、認
識対象となる全ての文のHMMを学習することが不可能
に近い。従って、文音声の場合は、音素HMMより文H
MMを生成するのが現実的な手段である。音素を学習す
るには、学習データに各々の音素が存在する区間を示す
情報即ちラベル情報も用意しなければならない。しか
し、ラベル付け作業はコンビュータによる自動作業は満
足の行く精度が得られず、ほとんど手作業でラベル付け
を行っている。そこで、ラベル情報を要しない学習法が
提案されている。この方法ではまず音素HMMの初期モ
デルを用意する。そして発声内容が既知でラベルが付か
ない文発声の学習データに対して、先の音素HMMの初
期モデルを連結して文HMMを構築し、これらの文HM
Mを学習用文発声データで学習する。この場合、文の始
端と終端が分かれば学習プロセスが実現できる。さらに
連結と逆の手続きでこれらの文HMMを分解し、音素H
MMを生成する。学習精度をよくするため、上述の連結
学習、分解生成を繰り返すことによって、精度の良い音
素HMMを生成する。当然なことで、この連結学習法は
単語音声にも適用できる。
On the other hand, in phonetics, words or sentences are usually represented by a series of vocal elements called phonemes. Therefore, an HMM is prepared for each phoneme, these HMMs are connected to generate a word or sentence HMM, an HMM is prepared for each phoneme, these HMMs are connected to generate a word or sentence HMM, and word recognition is performed. There is a way to do. In particular, when recognizing sentence voices, it is difficult to prepare a large amount of sentence voices, and thus it is almost impossible to learn HMMs of all sentences to be recognized. Therefore, in the case of a sentence voice, the sentence H is obtained from the phoneme HMM.
Generating an MM is a realistic means. In order to learn phonemes, it is necessary to prepare information indicating the section in which each phoneme exists in the learning data, that is, label information. However, as for the labeling work, the automatic work by the computer does not provide satisfactory accuracy, and most of the labeling work is done manually. Therefore, a learning method that does not require label information has been proposed. In this method, first, an initial model of the phoneme HMM is prepared. Then, the sentence HMM is constructed by connecting the above initial model of the phoneme HMM to the learning data of the sentence utterance whose utterance content is known and which is not labeled.
Learn M with the learning sentence vocalization data. In this case, the learning process can be realized if the beginning and end of the sentence are known. Furthermore, these sentence HMMs are decomposed by the reverse procedure of concatenation, and the phoneme H
Generate MM. In order to improve the learning accuracy, the phoneme HMM with high accuracy is generated by repeating the above-described connection learning and decomposition generation. Naturally, this connection learning method can also be applied to word speech.

【0005】[0005]

【発明が解決しようとする課題】しかしながら、従来の
音素HMM連結学習法では、次のような問題点があっ
た。 (1)発声者が文音声を発声する時、各々の発声者の出
身地域及び教育背景等か異なるため、「が」行および
「が」行拗音(以下、単に「が」行と呼ぶ)が鼻音化さ
れるかどうかは発声者によってまちまちである。例え
ば、「こくご(国語)」には/kokugo/と/ko
kungo/のように二通りの発声がある。HMMを学
習する時、この違いを無視し、鼻音化されない(または
鼻音化される)音韻表記に従って音素HMMを連結して
学習すると、異なる性質の音声データを同じHMMモデ
ルに割り当てられてしまい、音素HMM精度の低下が免
れない。従って、この「が」行発声の違いを検出しなけ
ればない。従来、この「が」行の鼻音化の検出は人間の
検聴によらざるを得ず、大量の時間と労力が必要であ
る。 (2)発声者が文音声を発声する時、各々の発声者の出
身地域及び教育背景等か異なるため、長音の発声の仕方
は発声者によってまちまちである。例えば、「とうきょ
う(東京)」を/toukyou/,/tookyoo
/,/toukyoo/,/tookyou/のように
四通りの発音が有り得る。HMMを学習するとき、この
違いを無視し、一律にローマ字表記(即ち/touky
ou/)に従って音素HMMを連結して学習すると、異
なる性質の音声データを同じHMMモデルに割り当てら
れてしまい、音素HMM精度の低下が免れない。従っ
て、長音の発声の違いを検出しなければない。従来、長
音の発声の違いの検出は人間の検聴によらざるを得ず、
大量の時間と労力が必要である。このように、「が」行
の鼻音化及び異なる長音の発声の仕方により音素HMM
精度の低下が免れなく、この音素HMM精度の低下を回
避するため「が」行の鼻音化の検出及び長音の発声の違
いは人間の検聴によらざるを得ず、大量の時間と労力が
必要であるという問題点があった。
However, the conventional phoneme HMM connection learning method has the following problems. (1) When the utterers utter a sentence, the "ga" line and the "ga" line sound (hereinafter simply referred to as the "ga" line) are different because the origins of each speaker and the educational background are different. Whether it is nasalized depends on the speaker. For example, "Kokugo (Japanese)" is / kokugo / and / ko
There are two types of vocalizations, such as kungo /. When learning the HMM, ignoring this difference and learning by concatenating the phoneme HMMs according to the phoneme notation (or nasalized) phoneme notation, speech data of different properties are assigned to the same HMM model, and the phoneme The HMM accuracy is unavoidably reduced. Therefore, it is necessary to detect the difference in utterance of this "ga" line. Conventionally, detection of the nasalization of this "ga" line has to be done by human hearing, which requires a large amount of time and labor. (2) When a utterer utters a sentence voice, the method of uttering a long sound varies depending on the utterer because the origin and educational background of each utterer are different. For example, "Tokyo (Tokyo)" is / tokyo /// tokyo
There are four possible pronunciations, such as /, / tokyo /, and / tokyo /. When learning the HMM, ignore this difference and uniformly use the romanization (ie / touki
When phoneme HMMs are connected and learned in accordance with ou /), voice data having different properties are assigned to the same HMM model, and the phoneme HMM accuracy is unavoidably deteriorated. Therefore, it is necessary to detect the difference in utterance of long sound. Conventionally, the detection of the difference in long-sounding utterance must be done by human hearing.
It takes a lot of time and effort. In this way, the phoneme HMM is changed according to the nasalization of the “ga” line and the utterance of different long sounds.
In order to avoid the decrease in the accuracy of the phoneme HMM, the detection of nasalization of "ga" lines and the difference in the utterance of the long sound must be done by human hearing, which requires a lot of time and labor. There was a problem that it was necessary.

【0006】[0006]

【課題を解決するための手段】第1の発明は、前記課題
を解決するために、連続音声データを用いて音素HMM
を学習するに際して、音素HMMの初期モデルを連結し
て文HMMを構築する。そして、前記文HMMを学習す
る学習処理と、前記学習処理後にその学習結果を音素H
MMに分解する分解処理と、前記分解された音素HMM
を再構築して文HMMを作る連結処理とを用い、前記学
習処理、分解処理、及び連結処理を繰り返すことによっ
て前記音素HMMを学習するHMMの学習方法におい
て、次の手段を講じている。学習用文音声データに含ま
れる「が」行音節又は「が」行拗音が鼻音化されたかど
うかを音声認識手法で検出し、前記音素HMMを連結し
て前記文HMMを生成する際、前記認識結果に従い相応
する音素HMMを連結して学習し、該音素HMMを学習
するようにしている。第2の発明は、第1の発明と同様
のHMMの学習方法において、次の手段を講じている。
学習用文音声データに含まれる長音の発声の違いを音声
認識手法で検出し、前記音素HMMを連結して前記文H
MMを生成する際、前記認識結果に従い該長音の発声に
相応する音素HMMを連結して学習し、該音素HMMを
学習するようにしている。
In order to solve the above-mentioned problems, the first invention uses a phoneme HMM by using continuous speech data.
When learning, the sentence HMM is constructed by connecting the initial models of the phoneme HMM. Then, a learning process for learning the sentence HMM, and a learning result after the learning process for the phoneme H
Decomposition processing for decomposing into MM and the decomposed phoneme HMM
In the HMM learning method of learning the phoneme HMM by repeating the learning process, the decomposition process, and the concatenation process by using the concatenation process for reconstructing the sentence HMM and the sentence HMM, the following means are taken. When the “ga” line syllable or the “ga” line sound included in the learning sentence voice data is nasalized is detected by a voice recognition method, and when the sentence HMM is generated by connecting the phoneme HMMs, the recognition is performed. According to the result, the corresponding phoneme HMMs are connected and learned, and the phoneme HMMs are learned. The second invention adopts the following means in the HMM learning method similar to the first invention.
A difference in utterance of long sounds included in the sentence sentence learning data is detected by a speech recognition method, and the phoneme HMMs are connected to each other to connect the sentence H.
When the MM is generated, the phoneme HMM corresponding to the utterance of the long sound is connected and learned in accordance with the recognition result, and the phoneme HMM is learned.

【0007】[0007]

【作用】第1の発明によれば、以上のようにHMMを構
成したので、学習用文音声データに含まれる「が」行音
節又は「が」行拗音がN 個含まれていたとすると、これ
らの「が」行音節又は「が」行拗音が鼻音化により可能
な文HMMの数は2N であり、これら2N の文HMMと
学習用文音声データとの照合により学習用文音声データ
に含まれる「が」行音節又は「が」行拗音が鼻音化され
たかどうかを音声認識手法で検出する。このように連結
学習を行う前に、音声認識手法により学習用文音声デー
タに含まれる「が」行音節又は「が」行拗音が鼻音化さ
れたかどうかが正確に検出され、相応するHMMを連結
して学習が行われるので、「が」行音節又は「が」行拗
音が鼻音化されたかどうかが認識でき、精度の高い音素
HMMの学習が行える。第2の発明によれば、以上のよ
うにHMMを構成したので、学習用文音声データに含ま
れる長音の発声の違いにより可能な複数個の文HMMと
学習用文音声データとの照合により長音の発声の違いを
音声認識手法で検出する。このように連結学習を行う前
に、音声認識手法により学習用文音声データに含まれる
長音の発声の違いを正確に認識され、学習用文音声デー
タに含まれる長音の発声に相応するHMMを連結して学
習が行われるので、長音の発声の違いが認識でき、精度
の高い音素HMMの学習が行える。従って、前記課題を
解決できるのである。
According to the first aspect of the present invention, since the HMM is configured as described above, if it is assumed that there are N "ga" syllables or "ga" gospels included in the learning sentence voice data, The number of sentence HMMs in which the "ga" line syllable or the "ga" line sound can be nasalized is 2 N , and the sentence sentence HMM of these 2 N and the sentence sentence voice data for learning are compared to obtain the sentence sentence voice data for learning. A voice recognition method detects whether or not the included “ga” syllable or “ga” syllable has been nasalized. As described above, before performing the connection learning, it is accurately detected whether the “ga” line syllable or the “ga” line sound included in the learning sentence voice data is nasalized by the voice recognition method, and the corresponding HMM is connected. Since the learning is performed in this manner, it is possible to recognize whether the “ga” syllable or the “ga” syllable has been nasalized, and the phoneme HMM can be learned with high accuracy. According to the second aspect of the invention, since the HMM is configured as described above, a long-sound is obtained by matching a plurality of sentence HMMs and learning-sentence voice data that are possible due to the difference in utterance of the long-sound included in the learning-sentence voice data. The difference in the utterances of the voices is detected by the voice recognition method. As described above, before performing the connected learning, the difference in the long voice utterance included in the learning sentence voice data is accurately recognized by the voice recognition method, and the HMM corresponding to the long voice utterance included in the learning sentence voice data is connected. Since the learning is performed in this way, the difference in the utterance of the long sound can be recognized, and the learning of the phoneme HMM with high accuracy can be performed. Therefore, the above problem can be solved.

【0008】[0008]

【実施例】第1の実施例 図1は、本発明の第1の実施例を示すHMMの学習方法
の処理内容のフローチャートであり、この図を参照しつ
つ、本第1の実施例のHMM学習方法を説明する。本第
1の実施例のHMMの学習方法では、例えば、プログラ
ム制御されるコンピュータを用いて図1のステップ1〜
13の処理が実行される。先ず、図1のステップ1にお
いて、学習が開始されると、ステップ2で、学習データ
の音声信号(例えば、文音声)が入力され、ステップ3
の前処理へ進む。ステップ3の前処理では、例えば、入
力されたアナログ音声信号をアナログ/ディジタル変換
(以下、A/D変換と呼ぶ)によってディジタル信号に
変換し、LPC(Linear Predictive Coding 、線形予測
符号化)分析によるLPCケプストラムの抽出等によ
り、音声特徴パラメータを抽出し、ステップ5へ進む。
ステップ5では、入力された文音声に付属する音素表記
情報を用いて、予め用意しておいた認識用音素HMM辞
書4を参照しながら、「が」行音節のある箇所を検出し
そこに/g/と/ng/を代入して文HMMを生成し、
ステップ6へ進む。ステップ6では、ステップ5で生成
された文HMMと入力音声とを照合し、尤度を計算し、
最も大きい出力確率を与える文HMMを正解音韻表記と
して次のステップ7に渡す。以上のステップ1〜6まで
の処理によって学習用文音声データに含まれる「が」行
音節が鼻音化されたかどうかが認識される。
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First Embodiment FIG. 1 is a flowchart of the processing contents of an HMM learning method showing a first embodiment of the present invention. The HMM of the first embodiment will be described with reference to this figure. Explain the learning method. In the learning method of the HMM according to the first embodiment, for example, the steps 1 to 1 in FIG.
The process of 13 is performed. First, when learning is started in step 1 of FIG. 1, a voice signal (for example, sentence voice) of learning data is input in step 2, and step 3
Proceed to the pre-processing. In the pre-processing of step 3, for example, the input analog audio signal is converted into a digital signal by analog / digital conversion (hereinafter referred to as A / D conversion), and LPC (Linear Predictive Coding) analysis is performed. The voice feature parameter is extracted by extracting the LPC cepstrum or the like, and the process proceeds to step 5.
In step 5, the phoneme notation information attached to the input sentence voice is used to refer to the recognition phoneme HMM dictionary 4 prepared in advance to detect a position where a "ga" line syllable exists and / Substitute g / and / ng / to generate the sentence HMM,
Go to step 6. In step 6, the sentence HMM generated in step 5 is collated with the input voice, the likelihood is calculated,
The sentence HMM giving the highest output probability is passed to the next step 7 as the correct phoneme notation. By the above steps 1 to 6, it is recognized whether or not the "ga" line syllable included in the learning sentence voice data is nasalized.

【0009】例えば、「こくごをまなぶ(国語を学
ぶ)」という文が入力されたとすると、ステップ5で
は、次の二つの文HMMが生成される。 (1)/kokugowomanabu/ (2)/kokungowomanabu/ ステップ6では、この二つの文HMMと入力音声とを照
合し、尤度を計算する。ここでは仮に、(2)番目のH
MMがより大きい出力確率を与えたとすると、 /kokungowomanabu/ は正解音韻表記として次のステップ7へ渡される。ステ
ップ7では、ステップ6で決定された音韻表記と音素辞
書10を参照しながら、音素HMMを連結して文HMM
を生成し、生成結果をステップ8へ送る。ステップ8で
は、入力された学習音声を使用して文HMMパラメータ
を推定する。この推定には、例えば、前記文献2に記載
されたBaum-Welch(B−W)アルゴリズムを用いる。こ
のB−Wアルゴリズムでは、例えば、観測ラベル系列O
=o1 , o2 , …,oT 及び状態系列I= i1 ,i2 , …,i
T に対して、次式(1)のような前向き変数αt (i) と
後ろ向き変数βt (i) を定義する。 αt (i) =Pr( o1 ,o2 , …,oT ,it =si ) βt (i) =Pr( ot+1 ,ot+2 , …,oT ,it =si ) ・・・(1) そして、状態遷移確率aijとラベル出力確率bj (k) を
次式(2)のように推定する。
[0009] For example, if the sentence "Manipulate kokugo (learn Japanese)" is input, in step 5, the following two sentence HMMs are generated. (1) / kokugowomanabu / (2) / kokungoumanabu / In step 6, these two sentence HMMs are collated with the input speech to calculate the likelihood. Here, the (2) th H
If MM gives a higher output probability, then / kokungowomanabu / is passed to the next step 7 as the correct phoneme notation. In step 7, while referring to the phoneme notation determined in step 6 and the phoneme dictionary 10, the phoneme HMMs are connected and the sentence HMM is connected.
Is generated and the generation result is sent to step 8. In step 8, the sentence HMM parameters are estimated using the input learning voice. For this estimation, for example, the Baum-Welch (BW) algorithm described in Document 2 is used. In this B-W algorithm, for example, the observation label series O
= O 1, o 2, ... , o T and state series I = i 1, i 2, ..., i
For T , a forward variable α t (i) and a backward variable β t (i) as defined by the following equation (1) are defined. α t (i) = Pr ( o 1, o 2, ..., o T, i t = s i) β t (i) = Pr (o t + 1, o t + 2, ..., o T, i t = s i ) (1) Then, the state transition probability a ij and the label output probability b j (k) are estimated as in the following expression (2).

【0010】[0010]

【数1】 このように文HMMを学習し終えると、ステップ9にお
いて、文HMMを音素HMMに分解し、修正された音素
HMMを音素辞書10に保存する。この音素HMMが収
束したか否かを、ステップ12で検査し、もし収束した
ら(即ち、音素HMMパラメータの前回の値と今回の値
との差が充分に小さければ)、ステップ13で学習を終
了する。これに対し、ステップ12の検査の結果、収束
していなければ、ステップ11において、ステップ9で
分解された音素HMMを連結して文HMMを再構築し、
ステップ11において、ステップ8の文HMMの学習へ
戻り、前述した学習処理と分解処理を繰り返す。
[Equation 1] After learning the sentence HMM in this way, in step 9, the sentence HMM is decomposed into phoneme HMMs, and the corrected phoneme HMMs are stored in the phoneme dictionary 10. Whether or not this phoneme HMM has converged is checked in step 12, and if it converges (that is, if the difference between the previous value and the current value of the phoneme HMM parameter is sufficiently small), the learning ends in step 13. To do. On the other hand, if the result of the check in step 12 is that they do not converge, in step 11, the phoneme HMMs decomposed in step 9 are concatenated to reconstruct a sentence HMM,
In step 11, the process returns to the learning of the sentence HMM in step 8, and the learning process and the decomposition process described above are repeated.

【0011】以上のように、本第1の実施例では、次の
ような利点がある。文音声を用いて音素HMMを学習す
る時、ステップ5で「が」行音節のある箇所に/g/と
/ng/を代入して文HMMを生成し、ステップ6でこ
れらの文HMMと入力音声とを照合し、尤度を計算し、
学習用の音声データに含まれる「が」行が鼻音化された
かどうかを検出するので、正確に連結学習を行うことが
できる。そのため、「が」行音声の鼻音化に伴った音声
データと音素表記の不一致を解消し、高精度のHMM学
習が可能となる。
As described above, the first embodiment has the following advantages. When learning a phoneme HMM using sentence speech, in step 5, / g / and / ng / are substituted into the place where the "ga" line syllable exists to generate a sentence HMM, and in step 6, these sentence HMMs are input. Match the voice, calculate the likelihood,
Since it is detected whether or not the "ga" line included in the learning voice data is nasalized, it is possible to accurately perform the connected learning. Therefore, it is possible to eliminate the inconsistency between the phonetic notation and the voice data that accompanies the nasalization of the “ga” line voice, and it is possible to perform highly accurate HMM learning.

【0012】第2の実施例 図3は、本発明の第2の実施例を示すHMMの学習方法
の処理内容のフローチャートである。第1の実施例は学
習用文音声データに含まれる「が」行が鼻音化されたか
どうかを検出するものであるのに対して、この第2の実
施例は学習用文音声データに含まれる長音を検出するも
のである。この図を参照しつつ、本第2の実施例のHM
M学習方法を説明する。本第2の実施例のHMMの学習
方法では、例えば、プログラム制御されるコンピュータ
を用いて図1のステップ21〜33の処理が実行される
先ず、図3のステップ21において、学習が開始される
と、ステップ22で、学習データの音声信号(例えば、
文音声)が入力され、ステップ23の前処理へ進む。ス
テップ23の前処理では、例えば、入力されたアナログ
音声信号をA/D変換によってディジタル信号に変換
し、LPC分析によるLPCケプストラムの抽出等によ
り、音声特徴パラメータを抽出し、ステップ25へ進
む。ステップ25では、入力された文音声に付属する音
素表記情報を用いて、認識用音素HMM辞書24を参照
しながら、長音音節にある箇所を検出しそこに可能な発
声の音素HMMを代入して文HMMを生成し、ステップ
26へ進む。ステップ26では、ステップ25で生成さ
れた文HMMと入力音声とを照合し、尤度を計算し、最
も大きい出力確率を与える文HMMを正解音韻表記とし
て次のステップ27に渡す。以上のステップ21〜26
までの処理によって学習用文音声データに含まれる長音
の発声の違いが認識される。
Second Embodiment FIG. 3 is a flowchart of the processing contents of the learning method of the HMM showing the second embodiment of the present invention. The first embodiment is for detecting whether or not the "ga" line included in the learning sentence voice data is nasalized, whereas the second embodiment is included in the learning sentence voice data. It detects long sound. With reference to this figure, the HM of the second embodiment
The M learning method will be described. In the HMM learning method of the second embodiment, for example, the processing of steps 21 to 33 of FIG. 1 is executed using a program-controlled computer. First, learning is started in step 21 of FIG. Then, in step 22, a voice signal of the learning data (for example,
(Sentence voice) is input, and the process proceeds to the preprocessing of step 23. In the preprocessing of step 23, for example, the input analog voice signal is converted into a digital signal by A / D conversion, the voice characteristic parameter is extracted by extraction of the LPC cepstrum by LPC analysis, and the process proceeds to step 25. In step 25, the phoneme notation information attached to the input sentence speech is used to refer to the recognition phoneme HMM dictionary 24 to detect a part in the long syllable and substitute a phoneme HMM of possible utterance therein. The sentence HMM is generated, and the process proceeds to step 26. In step 26, the sentence HMM generated in step 25 is collated with the input voice, the likelihood is calculated, and the sentence HMM giving the largest output probability is passed to the next step 27 as the correct phoneme notation. Steps 21 to 26 above
By the processes up to, the difference in the utterance of the long sound included in the learning sentence voice data is recognized.

【0013】例えば、「とうきょうにいく(東京にい
く)」という文が入力されたとすると、ステップ25で
は、次の四つの文HMMが生成される。 (1)/toukyouniiku/ (2)/toukyooniiku/ (3)/tookyouniiku/ (4)/tookyooniiku/ ステップ26では、この四つの文HMMと入力音声とを
照合し、尤度を計算し、ステップ27に進む。このステ
ップ26において入力された文音声に含まれる長音の発
音に相応するHMMが検出される。ここでは仮に、
(4)番目のHMMがより大きい出力確率を与えたとす
ると、 /tookyooniiku/ は正解音韻表記として次のステップ27へ渡される。ス
テップ27では、ステップ26で決定された音韻表記と
音素辞書30を参照しながら、音素HMMを連結して文
HMMを生成し、生成結果をステップ28へ送る。ステ
ップ28では、入力された学習音声を使用して文HMM
パラメータを推定する。この推定には、例えば、B−W
アルゴリズムを用いて、例えば、観測ラベル系列O= o
1 , o2 , …,oT 及び状態系列I= i1 ,i2 , …,iT
対して、式(1)のような前向き変数αt (i) と後ろ向
き変数βt (i) を定義する。そして、状態遷移確率aij
とラベル出力確率bj (k) を式(2)のように推定す
る。
For example, if the sentence "Tokyo ni Iku (go to Tokyo)" is input, in step 25, the following four sentence HMMs are generated. (1) / tokyouniiku / (2) / tokyouniuk / (3) / tokyouniuk / (4) / tokyooniiku / In step 26, the four sentence HMMs are compared with the input speech, and the likelihood is calculated, and in step 27. move on. In step 26, the HMM corresponding to the pronunciation of the long sound included in the sentence voice input is detected. Here, if
Assuming that the (4) th HMM gives a larger output probability, then / tokyokyoniku / is passed to the next step 27 as the correct phoneme notation. In step 27, referring to the phoneme notation determined in step 26 and the phoneme dictionary 30, the phoneme HMMs are concatenated to generate a sentence HMM, and the generation result is sent to step 28. In step 28, the sentence HMM is input using the input learning voice.
Estimate the parameters. For this estimation, for example, B-W
Using an algorithm, for example, observation label series O = o
For 1 , 1 , o 2 , ..., o T and state series I = i 1 , i 2 , ..., i T , forward variable α t (i) and backward variable β t (i) as shown in equation (1). Is defined. Then, the state transition probability a ij
And the label output probability b j (k) are estimated as in Expression (2).

【0014】このように文HMMを学習し終えると、ス
テップ29において、文HMMを音素HMMに分解し、
修正された音素HMMを音素辞書30に保存する。この
音素HMMが収束したか否かを、ステップ32で検査
し、もし収束したら(即ち、音素HMMパラメータの前
回の値と今回の値との差が充分に小さければ)、ステッ
プ33で学習を終了する。これに対し、ステップ32の
検査の結果、収束していなければ、ステップ31におい
て、ステップ29で分解された音素HMMを連結して文
HMMを再構築し、ステップ31において、ステップ2
8の文HMMの学習へ戻り、前述した学習処理と分解処
理を繰り返す。以上のように、本第2の実施例では、次
のような利点がある。文音声を用いて音素HMMを学習
する時、ステップ25で長音音節にある箇所に可能な発
声の音素HMMを代入して文HMMを生成し、ステップ
26で文HMMと入力音声とを照合し、入力された文音
声に含まれる長音の発音に相応するHMMを検出するの
で正確に連結学習を行うことができる。そのため、長音
発声の違いに伴った音声データと音素表記の不一致を解
消し、高精度のHMM学習が可能となる。なお、本発明
は、上記実施例に限定されず種々の変形が可能である。
その変形例としては、例えば次のようなものがある。文
音声を例として説明したが、単語音声や文節音声に対し
ても適用可能である。
After learning the sentence HMM in this way, in step 29, the sentence HMM is decomposed into phoneme HMMs.
The corrected phoneme HMM is stored in the phoneme dictionary 30. Whether or not the phoneme HMM has converged is checked in step 32, and if converged (that is, if the difference between the previous value and the current value of the phoneme HMM parameter is sufficiently small), the learning ends in step 33. To do. On the other hand, as a result of the check in step 32, if the results do not converge, in step 31, the phoneme HMMs decomposed in step 29 are concatenated to reconstruct a sentence HMM, and in step 31, step 2
Returning to the learning of the sentence HMM of No. 8, the learning process and the decomposition process described above are repeated. As described above, the second embodiment has the following advantages. When learning a phoneme HMM using sentence speech, a sentence HMM is generated by substituting a phoneme HMM of a possible utterance into a position in a long syllable in step 25 to generate a sentence HMM, and in step 26, the sentence HMM and the input speech are collated, Since the HMM corresponding to the pronunciation of the long sound included in the input sentence voice is detected, the connection learning can be accurately performed. Therefore, it is possible to eliminate the inconsistency between the voice data and the phoneme notation associated with the difference in long-sound utterance, and to perform highly accurate HMM learning. The present invention is not limited to the above embodiment, and various modifications can be made.
The following are examples of such modifications. Although sentence speech has been described as an example, it can be applied to word speech and phrase speech.

【0015】[0015]

【発明の効果】以上詳細に説明したように、第1の発明
によれば、先ず、学習用文音声データに含まれる「が」
行音節と「が」行拗音が鼻音化されたかどうかを音声認
識手法で検出し、前記音素ヒドン・マルコフ・モデルを
連結して前記文ヒドン・マルコフ・モデルを生成する
際、検出した「が」行音節と「が」行拗音の有る箇所に
「が」行音節及び「が」行拗音に相応するヒドン・マル
コフ・モデルを代入して学習するようにしたので、正確
に連結学習を行うことができる。よって、「が」行音声
の鼻音化に伴った音声データと音素表記の不一致を解消
し、高精度のHMM学習が可能となる。第2の発明によ
れば、学習用文音声データに含まれる長音の発声の違い
を音声認識手法で検出し、前記音素ヒドン・マルコフ・
モデルを連結して前記文ヒドン・マルコフ・モデルを生
成する際、前記認識結果に従い、該長音の発声に相応す
るヒドン・マルコフ・モデルを連結して学習するように
したので、正確に連結学習を行うことができる。従っ
て、長音発声の違いに伴った音声データと音素表記の不
一致を解消し、高精度のHMM学習が可能となる。
As described in detail above, according to the first aspect of the invention, first, "ga" included in the learning sentence voice data is included.
A syllable and a “ga” are detected by a speech recognition method to determine whether the syllable is nasalized, and when the phoneme Hidden-Markov model is concatenated to generate the sentence Hidden-Markov model, the detected “ga” is detected. Since the Hidden-Markov model corresponding to the "ga" and "ga" gospels is substituted in the places where the syllable and "ga" gospel are present, it is possible to perform accurate connected learning. it can. Therefore, it is possible to eliminate the inconsistency between the phoneme notation and the voice data that accompanies the nasalization of the “ga” line voice, and it is possible to perform highly accurate HMM learning. According to the second aspect of the present invention, the difference in utterance of long sounds included in the learning sentence voice data is detected by the voice recognition method, and the phoneme Hidden Markov
When connecting the models to generate the sentence Hidden-Markov model, the Hidden-Markov model corresponding to the utterance of the long note is connected and learned in accordance with the recognition result. It can be carried out. Therefore, it is possible to eliminate the inconsistency between the voice data and the phoneme notation associated with the difference in long-sound utterance, and to perform highly accurate HMM learning.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の第1の実施例を示すHMMの学習方法
の処理内容のフローチャートである。
FIG. 1 is a flowchart of processing contents of an HMM learning method according to a first embodiment of the present invention.

【図2】従来の音声認識方法に用いられる単語HMMの
構造例を示す図である。
FIG. 2 is a diagram showing a structural example of a word HMM used in a conventional speech recognition method.

【図3】本発明の第2の実施例を示すHMMの学習方法
の処理内容のフローチャートである。
FIG. 3 is a flowchart of processing contents of an HMM learning method according to a second embodiment of the present invention.

【符号の説明】[Explanation of symbols]

4,24 認識用音素HMM辞書 5 「が」行音節の鼻音化を考慮して文H
MM生成処理 6,26 HMM尤度計算による音韻表記決定処
理 7,27 音素HMMの連結による文HMM構成
処理 8,28 文HMMの学習B−Wアルゴリズム処
理 9,29 文HMMの音素HMM分解処理 10,30 音素HMM辞書 11,31 文HMMの再構成処理 12,32 音素HMMの収束判定処理 25 長音の発声の揺らぎを考慮して文HM
M生成処理
4,24 Phoneme HMM dictionary for recognition 5 Sentence H in consideration of nasalization of "ga" line syllable
MM generation process 6,26 Phoneme notation determination process by HMM likelihood calculation 7,27 Sentence HMM construction process by concatenation of phoneme HMMs 8,28 Learning HMM sentence BW algorithm process 9,29 Sentence HMM decomposition process for sentence HMMs 10 , 30 Phoneme HMM dictionary 11, 31 Reconstruction process of sentence HMM 12, 32 Convergence determination process of phoneme HMM 25 Sentence HM considering fluctuation of vocalization of long sound
M generation process

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】 連続音声データを用いて音素ヒドン・マ
ルコフ・モデルを学習するに際して、音素ヒドン・マル
コフ・モデルの初期モデルを連結して文ヒドン・マルコ
フ・モデルを構築し、 前記文ヒドン・マルコフ・モデルを学習する学習処理
と、前記学習処理後にその学習結果を音素ヒドン・マル
コフ・モデルに分解する分解処理と、前記分解された音
素ヒドン・マルコフ・モデルを再構築して文ヒドン・マ
ルコフ・モデルを作る連結処理とを用い、前記学習処
理、分解処理、及び連結処理を繰り返すことによって前
記音素ヒドン・マルコフ・モデルを学習するヒドン・マ
ルコフ・モデルの学習方法において、 学習用文音声データに含まれる「が」行音節又は「が」
行拗音が鼻音化されたかどうかを音声認識手法で検出
し、前記音素ヒドン・マルコフ・モデルを連結して前記
文ヒドン・マルコフ・モデルを生成する際、前記認識結
果に従い相応するヒドン・マルコフ・モデルを連結して
学習し、該音素ヒドン・マルコフ・モデルを学習するこ
とを特徴とするヒドン・マルコフ・モデルの学習方法。
1. When learning a phoneme Hidden-Markov model using continuous speech data, an initial model of the phoneme Hidden-Markov model is connected to construct a sentence-Hidden-Markov model, and the sentence-Hidden-Markov model is constructed. A learning process for learning a model, a decomposition process for decomposing the learning result into a phoneme Hidden Markov model after the learning process, and a sentence Hidden Markov model by reconstructing the decomposed phoneme Hidden Markov model In the learning method of the hidden Markov model for learning the phoneme Hidden-Markov model by repeating the learning process, the decomposition process, and the concatenation process using "Ga" line syllable or "ga"
A method of detecting whether or not the syllable is nasalized by a speech recognition method, and connecting the phoneme Hidden-Markov model to generate the sentence Hidden-Markov model, the corresponding Hidden-Markov model according to the recognition result. A learning method for a Hidden-Markov model, which comprises learning by connecting the phonemes to learn the phoneme Hidden-Markov model.
【請求項2】 連続音声データを用いて音素ヒドン・マ
ルコフ・モデルを学習するに際して、音素ヒドン・マル
コフ・モデルの初期モデルを連結して文ヒドン・マルコ
フ・モデルを構築し、 前記文ヒドン・マルコフ・モデルを学習する学習処理
と、前記学習処理後にその学習結果を音素ヒドン・マル
コフ・モデルに分解する分解処理と、前記分解された音
素ヒドン・マルコフ・モデルを再構築して文ヒドン・マ
ルコフ・モデルを作る連結処理とを用い、前記学習処
理、分解処理、及び連結処理を繰り返すことによって前
記音素ヒドン・マルコフ・モデルを学習するヒドン・マ
ルコフ・モデルの学習方法において、 学習用文音声データに含まれる長音の発声の違いを音声
認識手法で検出し、前記音素ヒドン・マルコフ・モデル
を連結して前記文ヒドン・マルコフ・モデルを生成する
際、前記認識結果に従い該長音の発声に相応する音素ヒ
ドン・マルコフ・モデルを連結して学習し、該音素ヒド
ン・マルコフ・モデルを学習することを特徴とするヒド
ン・マルコフ・モデルの学習方法。
2. When learning a phoneme Hidden Markov model using continuous speech data, a sentence Hidden Markov model is constructed by connecting initial models of the phoneme Hidden Markov model, and the sentence Hidden Markov model is constructed. A learning process for learning a model, a decomposition process for decomposing the learning result into a phoneme Hidden Markov model after the learning process, and a sentence Hidden Markov model by reconstructing the decomposed phoneme Hidden Markov model In the learning method of the hidden Markov model for learning the phoneme Hidden-Markov model by repeating the learning process, the decomposition process, and the concatenation process using The difference in the utterance of the long sound is detected by a speech recognition method, and the phoneme Hidden-Markov model is connected to the sentence. When generating a Don-Markov model, the phoneme Hidden-Markov model corresponding to the vocalization of the long sound is connected and learned in accordance with the recognition result, and the phoneme Hidden-Markov model is learned.・ How to learn Markov model.
JP06029291A 1994-02-28 1994-02-28 Learning Hidden Markov Model Expired - Fee Related JP3091623B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP06029291A JP3091623B2 (en) 1994-02-28 1994-02-28 Learning Hidden Markov Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP06029291A JP3091623B2 (en) 1994-02-28 1994-02-28 Learning Hidden Markov Model

Publications (2)

Publication Number Publication Date
JPH07239695A true JPH07239695A (en) 1995-09-12
JP3091623B2 JP3091623B2 (en) 2000-09-25

Family

ID=12272150

Family Applications (1)

Application Number Title Priority Date Filing Date
JP06029291A Expired - Fee Related JP3091623B2 (en) 1994-02-28 1994-02-28 Learning Hidden Markov Model

Country Status (1)

Country Link
JP (1) JP3091623B2 (en)

Also Published As

Publication number Publication date
JP3091623B2 (en) 2000-09-25

Similar Documents

Publication Publication Date Title
JP4301102B2 (en) Audio processing apparatus, audio processing method, program, and recording medium
US5333275A (en) System and method for time aligning speech
US8019602B2 (en) Automatic speech recognition learning using user corrections
US6085160A (en) Language independent speech recognition
Al-Qatab et al. Arabic speech recognition using hidden Markov model toolkit (HTK)
JP2003316386A (en) Method, device, and program for speech recognition
EP2048655A1 (en) Context sensitive multi-stage speech recognition
KR101014086B1 (en) Voice processing device and method, and recording medium
JP5184467B2 (en) Adaptive acoustic model generation apparatus and program
Sirigos et al. A hybrid syllable recognition system based on vowel spotting
JP2886118B2 (en) Hidden Markov model learning device and speech recognition device
Syadida et al. Sphinx4 for indonesian continuous speech recognition system
JP3091623B2 (en) Learning Hidden Markov Model
Delić et al. A Review of AlfaNum Speech Technologies for Serbian, Croatian and Macedonian
Amdal et al. Pronunciation variation modeling in automatic speech recognition
KR102405547B1 (en) Pronunciation evaluation system based on deep learning
Gollan et al. Towards automatic learning in LVCSR: rapid development of a Persian broadcast transcription system.
Kessens et al. Improving recognition performance by modelling pronunciation variation.
JPH09114482A (en) Speaker adaptation method for voice recognition
JPH08211891A (en) Learning method for hidden markov model
JP2912513B2 (en) Learning Hidden Markov Model
JPH09160586A (en) Learning method for hidden markov model
JPH07121192A (en) Method for learning hidden markov model
Sadashivappa MLLR Based Speaker Adaptation for Indian Accents
JPH08211893A (en) Speech recognition device

Legal Events

Date Code Title Description
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20000711

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20080721

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090721

Year of fee payment: 9

LAPS Cancellation because of no payment of annual fees