JPH07239695A

JPH07239695A - Learning method of hidden markov model

Info

Publication number: JPH07239695A
Application number: JP6029291A
Authority: JP
Inventors: Takashi I; 傑易
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1994-02-28
Filing date: 1994-02-28
Publication date: 1995-09-12
Anticipated expiration: 2015-09-25
Also published as: JP3091623B2

Abstract

PURPOSE:To permit learning of phoneme HMMs(Hidden Markov Models) with high accuracy by eliminating the inconsistency of the voice data and phoneme notation according to a difference in nasalization of 'GA' line voices or long vowels at the time of learning the phoneme HMMs from sentence voices by a consolidated learning method. CONSTITUTION:A sentence HMM is first formed by using the phoneme notation information belonging to the inputted sentence voices and substituting /g/ and /ng/ at the point where the 'GA' line syllable exists while referencing a phoneme HMM dictionary 4 for recognition in a step 5 at the time of learning the phoneme HMMs by using the sentence voices. The sentence HMM and the input voices are collated and the syllable notation is determined by calculating tolerance in a step 6 and thereafter, the phoneme HMMs are connected to form the sentence HMM in a step 7. The learning of the sentence HMM is executed in a step 8 and the sentence HMM is decomposed into the phoneme HMMs in a step 9 and thereafter, whether the phoneme HMMs converge or not is decided in a step 12. The leaning is returned to the learning of the sentence HMM of the step 8 in a step 11 and the learning processing and the separating processing are repeated in case of non-conversion.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識におけるヒド
ン・マルコフ・モデル（以下、ＨＭＭという）の学習方
法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a learning method for a Hidden Markov Model (hereinafter referred to as HMM) in speech recognition.

【０００２】[0002]

【従来の技術】従来、このような分野の技術としては、
例えば、次のような文献に記載されるものがあった。文献１；ザ・ベル・システム・テクニカル・ジャーナル
（The Bell SystemTechnical Journal)、６２［４］、
（１９８３−４）、American Telegraph Company、
（米）、エス・イー・レビンソン(S.E.Levinson),エル
・アール・ラビナー（L.R.Rabiner), エム・エム・ソン
ディ(M.M.Sondhi)共著、「An Introduction to the App
lication of the Theory of Probabilistic Function o
f a Markov Process to AutomaticSpeech Recognitio
n」Ｐ．１０３５−１０７４文献２；中川聖一著「確率モデルによる音声認識」（昭
６３−７）、電子情報通信学会、Ｐ５５−６１音声認識技術として、古典的なパターン・マッチング手
法から、近年では統計的な手法に変わり、後者は主流に
なりつつある。統計的な手法では、確率的な有限状態を
持つマルコフ・モデル（以下、ＨＭＭと呼ぶ）が提案さ
れている。一般に、ＨＭＭは、複数の状態（例えば、音
声の特徴等）と状態間の遷移となる。さらに、ＨＭＭは
状態間の遷移を表す遷移確率と、遷移する際に伴うラベ
ル（音声の特徴パラメータの典型的なもので、通常数十
から数千種類がある）を出力する出力確率を有してい
る。このようなＨＭＭを用いた音声認識方法が前記文献
に記載されており、その単語音声認識の例を図２に示
す。2. Description of the Related Art Conventionally, as a technique in such a field,
For example, some documents were described in the following documents. Reference 1: The Bell SystemTechnical Journal, 62 [4],
(1983-4), American Telegraph Company,
(US), SE Levinson (SELevinson), ER Rabiner (LR Rabiner), MM Sondhi (MM Sondhi), `` An Introduction to the App
lication of the Theory of Probabilistic Function o
fa Markov Process to AutomaticSpeech Recognitio
n "P. 1035-1074 Reference 2; Seiichi Nakagawa, "Speech Recognition by Stochastic Model" (SHO 63-7), IEICE, P55-61. As a speech recognition technology, from classical pattern matching method to statistically in recent years. However, the latter is becoming mainstream. As a statistical method, a Markov model having a stochastic finite state (hereinafter referred to as HMM) has been proposed. Generally, the HMM is a transition between states and a plurality of states (for example, voice characteristics). Furthermore, the HMM has a transition probability that represents a transition between states and an output probability that outputs a label (a typical feature parameter of speech, usually several tens to several thousands) associated with the transition. ing. A speech recognition method using such an HMM is described in the above document, and an example of the word speech recognition is shown in FIG.

【０００３】図２は音声認識方法に用いられる単語ＨＭ
Ｍ構造を示す図である。図２のｓ₁, ｓ₂，ｓ₃, ｓ₄
はＨＭＭにおける音声の特徴等の状態を表す。ａ₁₁, ａ
₁₂, ａ₂₂ ,ａ₂₃ ,ａ₃₃, ａ₃₄, ａ₄₄, ａ₄₅は状態遷移確
率、ｂ₁(k) ，ｂ₂(k),ｂ₃(k),ｂ₄(k) はラベル出力
を表す。ＨＭＭでは、状態遷移確率ａ_ij(i=1, …,4、j=
1,…,5) で状態遷移が行われる際、ラベル出力確率ｂ_j
(k) でラベルを出力する。発声された単語をＨＭＭを用
いて、その単語のラベル列を最も高い確率で出力するよ
うにＨＭＭを学習する。次に発声された未知単語のラベ
ル列を入力し、最も高い出力確率を与えた単語ＨＭＭを
認識結果とする。なお、単語を文で置き換えれば、同様
の方法で文単位で発声された音声を認識することができ
る。この種の音声認識方法では、発声された単語や文そ
のものにＨＭＭを与えて学習し、尤度（即ち、ラベル列
の出力確率）によって認識結果を判断するものである。
このような単語又は文ＨＭＭは優れた認識精度を保証す
るが、認識語彙数が増大することによって膨大な学習デ
ータが必要となることや、学習対象語以外の音声が全く
認識できないなどの欠点がある。FIG. 2 shows the word HM used in the speech recognition method.
It is a figure which shows M structure. S ₁ , s ₂ , s ₃ , s _{4 in} FIG.
Represents a state such as a voice feature in the HMM. a ₁₁ , a
₁₂ , a ₂₂ , a ₂₃ , a ₃₃ , a ₃₄ , a ₄₄ , a ₄₅ are state transition probabilities, b ₁ (k), b ₂ (k), b ₃ (k), b ₄ (k) are label outputs Represents In the HMM, the state transition probability a _ij (i = 1, ..., 4, j =
When the state transition is performed in 1, ..., 5), the label output probability b _j
The label is output at (k). Using the HMM for the spoken word, the HMM is learned so as to output the label string of the word with the highest probability. Next, the label string of the uttered unknown word is input, and the word HMM that gives the highest output probability is used as the recognition result. By replacing a word with a sentence, it is possible to recognize a voice uttered for each sentence in the same manner. In this type of speech recognition method, an HMM is given to a spoken word or a sentence itself to learn, and a recognition result is judged based on a likelihood (that is, an output probability of a label string).
Such a word or sentence HMM guarantees excellent recognition accuracy, but has a drawback that a huge amount of learning data is required due to an increase in the number of recognized vocabulary words, and speech other than the learning target word cannot be recognized at all. is there.

【０００４】一方、音声学では通常、音素と呼ばれる声
学的要素の系列で単語や文を表している。従って、音素
ごとにＨＭＭを用意し、これらのＨＭＭを連結して単語
又は文ＨＭＭを生成し、音素ごとにＨＭＭを用意し、こ
れらのＨＭＭを連結して単語又は文ＨＭＭを生成し、単
語認識を行う方法がある。特に、文音声を認識する場
合、大量の文音声を用意することが困難であるため、認
識対象となる全ての文のＨＭＭを学習することが不可能
に近い。従って、文音声の場合は、音素ＨＭＭより文Ｈ
ＭＭを生成するのが現実的な手段である。音素を学習す
るには、学習データに各々の音素が存在する区間を示す
情報即ちラベル情報も用意しなければならない。しか
し、ラベル付け作業はコンビュータによる自動作業は満
足の行く精度が得られず、ほとんど手作業でラベル付け
を行っている。そこで、ラベル情報を要しない学習法が
提案されている。この方法ではまず音素ＨＭＭの初期モ
デルを用意する。そして発声内容が既知でラベルが付か
ない文発声の学習データに対して、先の音素ＨＭＭの初
期モデルを連結して文ＨＭＭを構築し、これらの文ＨＭ
Ｍを学習用文発声データで学習する。この場合、文の始
端と終端が分かれば学習プロセスが実現できる。さらに
連結と逆の手続きでこれらの文ＨＭＭを分解し、音素Ｈ
ＭＭを生成する。学習精度をよくするため、上述の連結
学習、分解生成を繰り返すことによって、精度の良い音
素ＨＭＭを生成する。当然なことで、この連結学習法は
単語音声にも適用できる。On the other hand, in phonetics, words or sentences are usually represented by a series of vocal elements called phonemes. Therefore, an HMM is prepared for each phoneme, these HMMs are connected to generate a word or sentence HMM, an HMM is prepared for each phoneme, these HMMs are connected to generate a word or sentence HMM, and word recognition is performed. There is a way to do. In particular, when recognizing sentence voices, it is difficult to prepare a large amount of sentence voices, and thus it is almost impossible to learn HMMs of all sentences to be recognized. Therefore, in the case of a sentence voice, the sentence H is obtained from the phoneme HMM.
Generating an MM is a realistic means. In order to learn phonemes, it is necessary to prepare information indicating the section in which each phoneme exists in the learning data, that is, label information. However, as for the labeling work, the automatic work by the computer does not provide satisfactory accuracy, and most of the labeling work is done manually. Therefore, a learning method that does not require label information has been proposed. In this method, first, an initial model of the phoneme HMM is prepared. Then, the sentence HMM is constructed by connecting the above initial model of the phoneme HMM to the learning data of the sentence utterance whose utterance content is known and which is not labeled.
Learn M with the learning sentence vocalization data. In this case, the learning process can be realized if the beginning and end of the sentence are known. Furthermore, these sentence HMMs are decomposed by the reverse procedure of concatenation, and the phoneme H
Generate MM. In order to improve the learning accuracy, the phoneme HMM with high accuracy is generated by repeating the above-described connection learning and decomposition generation. Naturally, this connection learning method can also be applied to word speech.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、従来の
音素ＨＭＭ連結学習法では、次のような問題点があっ
た。（１）発声者が文音声を発声する時、各々の発声者の出
身地域及び教育背景等か異なるため、「が」行および
「が」行拗音（以下、単に「が」行と呼ぶ）が鼻音化さ
れるかどうかは発声者によってまちまちである。例え
ば、「こくご（国語）」には／ｋｏｋｕｇｏ／と／ｋｏ
ｋｕｎｇｏ／のように二通りの発声がある。ＨＭＭを学
習する時、この違いを無視し、鼻音化されない（または
鼻音化される）音韻表記に従って音素ＨＭＭを連結して
学習すると、異なる性質の音声データを同じＨＭＭモデ
ルに割り当てられてしまい、音素ＨＭＭ精度の低下が免
れない。従って、この「が」行発声の違いを検出しなけ
ればない。従来、この「が」行の鼻音化の検出は人間の
検聴によらざるを得ず、大量の時間と労力が必要であ
る。（２）発声者が文音声を発声する時、各々の発声者の出
身地域及び教育背景等か異なるため、長音の発声の仕方
は発声者によってまちまちである。例えば、「とうきょ
う（東京）」を／ｔｏｕｋｙｏｕ／，／ｔｏｏｋｙｏｏ
／，／ｔｏｕｋｙｏｏ／，／ｔｏｏｋｙｏｕ／のように
四通りの発音が有り得る。ＨＭＭを学習するとき、この
違いを無視し、一律にローマ字表記（即ち／ｔｏｕｋｙ
ｏｕ／）に従って音素ＨＭＭを連結して学習すると、異
なる性質の音声データを同じＨＭＭモデルに割り当てら
れてしまい、音素ＨＭＭ精度の低下が免れない。従っ
て、長音の発声の違いを検出しなければない。従来、長
音の発声の違いの検出は人間の検聴によらざるを得ず、
大量の時間と労力が必要である。このように、「が」行
の鼻音化及び異なる長音の発声の仕方により音素ＨＭＭ
精度の低下が免れなく、この音素ＨＭＭ精度の低下を回
避するため「が」行の鼻音化の検出及び長音の発声の違
いは人間の検聴によらざるを得ず、大量の時間と労力が
必要であるという問題点があった。However, the conventional phoneme HMM connection learning method has the following problems. (1) When the utterers utter a sentence, the "ga" line and the "ga" line sound (hereinafter simply referred to as the "ga" line) are different because the origins of each speaker and the educational background are different. Whether it is nasalized depends on the speaker. For example, "Kokugo (Japanese)" is / kokugo / and / ko
There are two types of vocalizations, such as kungo /. When learning the HMM, ignoring this difference and learning by concatenating the phoneme HMMs according to the phoneme notation (or nasalized) phoneme notation, speech data of different properties are assigned to the same HMM model, and the phoneme The HMM accuracy is unavoidably reduced. Therefore, it is necessary to detect the difference in utterance of this "ga" line. Conventionally, detection of the nasalization of this "ga" line has to be done by human hearing, which requires a large amount of time and labor. (2) When a utterer utters a sentence voice, the method of uttering a long sound varies depending on the utterer because the origin and educational background of each utterer are different. For example, "Tokyo (Tokyo)" is / tokyo /// tokyo
There are four possible pronunciations, such as /, / tokyo /, and / tokyo /. When learning the HMM, ignore this difference and uniformly use the romanization (ie / touki
When phoneme HMMs are connected and learned in accordance with ou /), voice data having different properties are assigned to the same HMM model, and the phoneme HMM accuracy is unavoidably deteriorated. Therefore, it is necessary to detect the difference in utterance of long sound. Conventionally, the detection of the difference in long-sounding utterance must be done by human hearing.
It takes a lot of time and effort. In this way, the phoneme HMM is changed according to the nasalization of the “ga” line and the utterance of different long sounds.
In order to avoid the decrease in the accuracy of the phoneme HMM, the detection of nasalization of "ga" lines and the difference in the utterance of the long sound must be done by human hearing, which requires a lot of time and labor. There was a problem that it was necessary.

【０００６】[0006]

【課題を解決するための手段】第１の発明は、前記課題
を解決するために、連続音声データを用いて音素ＨＭＭ
を学習するに際して、音素ＨＭＭの初期モデルを連結し
て文ＨＭＭを構築する。そして、前記文ＨＭＭを学習す
る学習処理と、前記学習処理後にその学習結果を音素Ｈ
ＭＭに分解する分解処理と、前記分解された音素ＨＭＭ
を再構築して文ＨＭＭを作る連結処理とを用い、前記学
習処理、分解処理、及び連結処理を繰り返すことによっ
て前記音素ＨＭＭを学習するＨＭＭの学習方法におい
て、次の手段を講じている。学習用文音声データに含ま
れる「が」行音節又は「が」行拗音が鼻音化されたかど
うかを音声認識手法で検出し、前記音素ＨＭＭを連結し
て前記文ＨＭＭを生成する際、前記認識結果に従い相応
する音素ＨＭＭを連結して学習し、該音素ＨＭＭを学習
するようにしている。第２の発明は、第１の発明と同様
のＨＭＭの学習方法において、次の手段を講じている。
学習用文音声データに含まれる長音の発声の違いを音声
認識手法で検出し、前記音素ＨＭＭを連結して前記文Ｈ
ＭＭを生成する際、前記認識結果に従い該長音の発声に
相応する音素ＨＭＭを連結して学習し、該音素ＨＭＭを
学習するようにしている。In order to solve the above-mentioned problems, the first invention uses a phoneme HMM by using continuous speech data.
When learning, the sentence HMM is constructed by connecting the initial models of the phoneme HMM. Then, a learning process for learning the sentence HMM, and a learning result after the learning process for the phoneme H
Decomposition processing for decomposing into MM and the decomposed phoneme HMM
In the HMM learning method of learning the phoneme HMM by repeating the learning process, the decomposition process, and the concatenation process by using the concatenation process for reconstructing the sentence HMM and the sentence HMM, the following means are taken. When the “ga” line syllable or the “ga” line sound included in the learning sentence voice data is nasalized is detected by a voice recognition method, and when the sentence HMM is generated by connecting the phoneme HMMs, the recognition is performed. According to the result, the corresponding phoneme HMMs are connected and learned, and the phoneme HMMs are learned. The second invention adopts the following means in the HMM learning method similar to the first invention.
A difference in utterance of long sounds included in the sentence sentence learning data is detected by a speech recognition method, and the phoneme HMMs are connected to each other to connect the sentence H.
When the MM is generated, the phoneme HMM corresponding to the utterance of the long sound is connected and learned in accordance with the recognition result, and the phoneme HMM is learned.

【０００７】[0007]

【作用】第１の発明によれば、以上のようにＨＭＭを構
成したので、学習用文音声データに含まれる「が」行音
節又は「が」行拗音がN 個含まれていたとすると、これ
らの「が」行音節又は「が」行拗音が鼻音化により可能
な文ＨＭＭの数は２^Nであり、これら２^Nの文ＨＭＭと
学習用文音声データとの照合により学習用文音声データ
に含まれる「が」行音節又は「が」行拗音が鼻音化され
たかどうかを音声認識手法で検出する。このように連結
学習を行う前に、音声認識手法により学習用文音声デー
タに含まれる「が」行音節又は「が」行拗音が鼻音化さ
れたかどうかが正確に検出され、相応するＨＭＭを連結
して学習が行われるので、「が」行音節又は「が」行拗
音が鼻音化されたかどうかが認識でき、精度の高い音素
ＨＭＭの学習が行える。第２の発明によれば、以上のよ
うにＨＭＭを構成したので、学習用文音声データに含ま
れる長音の発声の違いにより可能な複数個の文ＨＭＭと
学習用文音声データとの照合により長音の発声の違いを
音声認識手法で検出する。このように連結学習を行う前
に、音声認識手法により学習用文音声データに含まれる
長音の発声の違いを正確に認識され、学習用文音声デー
タに含まれる長音の発声に相応するＨＭＭを連結して学
習が行われるので、長音の発声の違いが認識でき、精度
の高い音素ＨＭＭの学習が行える。従って、前記課題を
解決できるのである。According to the first aspect of the present invention, since the HMM is configured as described above, if it is assumed that there are N "ga" syllables or "ga" gospels included in the learning sentence voice data, The number of sentence HMMs in which the "ga" line syllable or the "ga" line sound can be nasalized is 2 ^N , and the sentence sentence HMM of these 2 ^{N and} the sentence sentence voice data for learning are compared to obtain the sentence sentence voice data for learning. A voice recognition method detects whether or not the included “ga” syllable or “ga” syllable has been nasalized. As described above, before performing the connection learning, it is accurately detected whether the “ga” line syllable or the “ga” line sound included in the learning sentence voice data is nasalized by the voice recognition method, and the corresponding HMM is connected. Since the learning is performed in this manner, it is possible to recognize whether the “ga” syllable or the “ga” syllable has been nasalized, and the phoneme HMM can be learned with high accuracy. According to the second aspect of the invention, since the HMM is configured as described above, a long-sound is obtained by matching a plurality of sentence HMMs and learning-sentence voice data that are possible due to the difference in utterance of the long-sound included in the learning-sentence voice data. The difference in the utterances of the voices is detected by the voice recognition method. As described above, before performing the connected learning, the difference in the long voice utterance included in the learning sentence voice data is accurately recognized by the voice recognition method, and the HMM corresponding to the long voice utterance included in the learning sentence voice data is connected. Since the learning is performed in this way, the difference in the utterance of the long sound can be recognized, and the learning of the phoneme HMM with high accuracy can be performed. Therefore, the above problem can be solved.

【０００８】[0008]

【実施例】第１の実施例図１は、本発明の第１の実施例を示すＨＭＭの学習方法
の処理内容のフローチャートであり、この図を参照しつ
つ、本第１の実施例のＨＭＭ学習方法を説明する。本第
１の実施例のＨＭＭの学習方法では、例えば、プログラ
ム制御されるコンピュータを用いて図１のステップ１〜
１３の処理が実行される。先ず、図１のステップ１にお
いて、学習が開始されると、ステップ２で、学習データ
の音声信号（例えば、文音声）が入力され、ステップ３
の前処理へ進む。ステップ３の前処理では、例えば、入
力されたアナログ音声信号をアナログ／ディジタル変換
（以下、Ａ／Ｄ変換と呼ぶ）によってディジタル信号に
変換し、ＬＰＣ(Linear Predictive Coding 、線形予測
符号化）分析によるＬＰＣケプストラムの抽出等によ
り、音声特徴パラメータを抽出し、ステップ５へ進む。
ステップ５では、入力された文音声に付属する音素表記
情報を用いて、予め用意しておいた認識用音素ＨＭＭ辞
書４を参照しながら、「が」行音節のある箇所を検出し
そこに／ｇ／と／ｎｇ／を代入して文ＨＭＭを生成し、
ステップ６へ進む。ステップ６では、ステップ５で生成
された文ＨＭＭと入力音声とを照合し、尤度を計算し、
最も大きい出力確率を与える文ＨＭＭを正解音韻表記と
して次のステップ７に渡す。以上のステップ１〜６まで
の処理によって学習用文音声データに含まれる「が」行
音節が鼻音化されたかどうかが認識される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First Embodiment FIG. 1 is a flowchart of the processing contents of an HMM learning method showing a first embodiment of the present invention. The HMM of the first embodiment will be described with reference to this figure. Explain the learning method. In the learning method of the HMM according to the first embodiment, for example, the steps 1 to 1 in FIG.
The process of 13 is performed. First, when learning is started in step 1 of FIG. 1, a voice signal (for example, sentence voice) of learning data is input in step 2, and step 3
Proceed to the pre-processing. In the pre-processing of step 3, for example, the input analog audio signal is converted into a digital signal by analog / digital conversion (hereinafter referred to as A / D conversion), and LPC (Linear Predictive Coding) analysis is performed. The voice feature parameter is extracted by extracting the LPC cepstrum or the like, and the process proceeds to step 5.
In step 5, the phoneme notation information attached to the input sentence voice is used to refer to the recognition phoneme HMM dictionary 4 prepared in advance to detect a position where a "ga" line syllable exists and / Substitute g / and / ng / to generate the sentence HMM,
Go to step 6. In step 6, the sentence HMM generated in step 5 is collated with the input voice, the likelihood is calculated,
The sentence HMM giving the highest output probability is passed to the next step 7 as the correct phoneme notation. By the above steps 1 to 6, it is recognized whether or not the "ga" line syllable included in the learning sentence voice data is nasalized.

【０００９】例えば、「こくごをまなぶ（国語を学
ぶ）」という文が入力されたとすると、ステップ５で
は、次の二つの文ＨＭＭが生成される。（１）／ｋｏｋｕｇｏｗｏｍａｎａｂｕ／（２）／ｋｏｋｕｎｇｏｗｏｍａｎａｂｕ／ステップ６では、この二つの文ＨＭＭと入力音声とを照
合し、尤度を計算する。ここでは仮に、（２）番目のＨ
ＭＭがより大きい出力確率を与えたとすると、／ｋｏｋｕｎｇｏｗｏｍａｎａｂｕ／は正解音韻表記として次のステップ７へ渡される。ステ
ップ７では、ステップ６で決定された音韻表記と音素辞
書１０を参照しながら、音素ＨＭＭを連結して文ＨＭＭ
を生成し、生成結果をステップ８へ送る。ステップ８で
は、入力された学習音声を使用して文ＨＭＭパラメータ
を推定する。この推定には、例えば、前記文献２に記載
されたBaum-Welch（Ｂ−Ｗ）アルゴリズムを用いる。こ
のＢ−Ｗアルゴリズムでは、例えば、観測ラベル系列Ｏ
＝o₁, o₂, …,o_T及び状態系列Ｉ＝ i₁,i₂, …,i
_Tに対して、次式（１）のような前向き変数α_t(i) と
後ろ向き変数β_t(i) を定義する。 α_t(i) ＝Ｐｒ（ o₁,o₂, …,o_T,i_t=s_i) β_t(i) ＝Ｐｒ（ o_t+1,o_t+2, …,o_T,i_t=s_i) ・・・（１）そして、状態遷移確率ａ_ijとラベル出力確率ｂ_j(k) を
次式（２）のように推定する。[0009] For example, if the sentence "Manipulate kokugo (learn Japanese)" is input, in step 5, the following two sentence HMMs are generated. (1) / kokugowomanabu / (2) / kokungoumanabu / In step 6, these two sentence HMMs are collated with the input speech to calculate the likelihood. Here, the (2) th H
If MM gives a higher output probability, then / kokungowomanabu / is passed to the next step 7 as the correct phoneme notation. In step 7, while referring to the phoneme notation determined in step 6 and the phoneme dictionary 10, the phoneme HMMs are connected and the sentence HMM is connected.
Is generated and the generation result is sent to step 8. In step 8, the sentence HMM parameters are estimated using the input learning voice. For this estimation, for example, the Baum-Welch (BW) algorithm described in Document 2 is used. In this B-W algorithm, for example, the observation label series O
_{_{= O 1, o 2, ...}} , o T and state series _{_{I = i 1, i 2,}} ..., i
_{For T} , a forward variable α _t (i) and a backward variable β _t (i) as defined by the following equation (1) are defined. _{α t (i) = Pr (} o 1, o 2, ..., o T, i t = s i) β t (i) = Pr (o t + 1, o t + 2, ..., o T, i t = s _i ) (1) Then, the state transition probability a _ij and the label output probability b _j (k) are estimated as in the following expression (2).

【００１０】[0010]

【数１】このように文ＨＭＭを学習し終えると、ステップ９にお
いて、文ＨＭＭを音素ＨＭＭに分解し、修正された音素
ＨＭＭを音素辞書１０に保存する。この音素ＨＭＭが収
束したか否かを、ステップ１２で検査し、もし収束した
ら（即ち、音素ＨＭＭパラメータの前回の値と今回の値
との差が充分に小さければ）、ステップ１３で学習を終
了する。これに対し、ステップ１２の検査の結果、収束
していなければ、ステップ１１において、ステップ９で
分解された音素ＨＭＭを連結して文ＨＭＭを再構築し、
ステップ１１において、ステップ８の文ＨＭＭの学習へ
戻り、前述した学習処理と分解処理を繰り返す。[Equation 1] After learning the sentence HMM in this way, in step 9, the sentence HMM is decomposed into phoneme HMMs, and the corrected phoneme HMMs are stored in the phoneme dictionary 10. Whether or not this phoneme HMM has converged is checked in step 12, and if it converges (that is, if the difference between the previous value and the current value of the phoneme HMM parameter is sufficiently small), the learning ends in step 13. To do. On the other hand, if the result of the check in step 12 is that they do not converge, in step 11, the phoneme HMMs decomposed in step 9 are concatenated to reconstruct a sentence HMM,
In step 11, the process returns to the learning of the sentence HMM in step 8, and the learning process and the decomposition process described above are repeated.

【００１１】以上のように、本第１の実施例では、次の
ような利点がある。文音声を用いて音素ＨＭＭを学習す
る時、ステップ５で「が」行音節のある箇所に／ｇ／と
／ｎｇ／を代入して文ＨＭＭを生成し、ステップ６でこ
れらの文ＨＭＭと入力音声とを照合し、尤度を計算し、
学習用の音声データに含まれる「が」行が鼻音化された
かどうかを検出するので、正確に連結学習を行うことが
できる。そのため、「が」行音声の鼻音化に伴った音声
データと音素表記の不一致を解消し、高精度のＨＭＭ学
習が可能となる。As described above, the first embodiment has the following advantages. When learning a phoneme HMM using sentence speech, in step 5, / g / and / ng / are substituted into the place where the "ga" line syllable exists to generate a sentence HMM, and in step 6, these sentence HMMs are input. Match the voice, calculate the likelihood,
Since it is detected whether or not the "ga" line included in the learning voice data is nasalized, it is possible to accurately perform the connected learning. Therefore, it is possible to eliminate the inconsistency between the phonetic notation and the voice data that accompanies the nasalization of the “ga” line voice, and it is possible to perform highly accurate HMM learning.

【００１２】第２の実施例図３は、本発明の第２の実施例を示すＨＭＭの学習方法
の処理内容のフローチャートである。第１の実施例は学
習用文音声データに含まれる「が」行が鼻音化されたか
どうかを検出するものであるのに対して、この第２の実
施例は学習用文音声データに含まれる長音を検出するも
のである。この図を参照しつつ、本第２の実施例のＨＭ
Ｍ学習方法を説明する。本第２の実施例のＨＭＭの学習
方法では、例えば、プログラム制御されるコンピュータ
を用いて図１のステップ２１〜３３の処理が実行される
先ず、図３のステップ２１において、学習が開始される
と、ステップ２２で、学習データの音声信号（例えば、
文音声）が入力され、ステップ２３の前処理へ進む。ス
テップ２３の前処理では、例えば、入力されたアナログ
音声信号をＡ／Ｄ変換によってディジタル信号に変換
し、ＬＰＣ分析によるＬＰＣケプストラムの抽出等によ
り、音声特徴パラメータを抽出し、ステップ２５へ進
む。ステップ２５では、入力された文音声に付属する音
素表記情報を用いて、認識用音素ＨＭＭ辞書２４を参照
しながら、長音音節にある箇所を検出しそこに可能な発
声の音素ＨＭＭを代入して文ＨＭＭを生成し、ステップ
２６へ進む。ステップ２６では、ステップ２５で生成さ
れた文ＨＭＭと入力音声とを照合し、尤度を計算し、最
も大きい出力確率を与える文ＨＭＭを正解音韻表記とし
て次のステップ２７に渡す。以上のステップ２１〜２６
までの処理によって学習用文音声データに含まれる長音
の発声の違いが認識される。 Second Embodiment FIG. 3 is a flowchart of the processing contents of the learning method of the HMM showing the second embodiment of the present invention. The first embodiment is for detecting whether or not the "ga" line included in the learning sentence voice data is nasalized, whereas the second embodiment is included in the learning sentence voice data. It detects long sound. With reference to this figure, the HM of the second embodiment
The M learning method will be described. In the HMM learning method of the second embodiment, for example, the processing of steps 21 to 33 of FIG. 1 is executed using a program-controlled computer. First, learning is started in step 21 of FIG. Then, in step 22, a voice signal of the learning data (for example,
(Sentence voice) is input, and the process proceeds to the preprocessing of step 23. In the preprocessing of step 23, for example, the input analog voice signal is converted into a digital signal by A / D conversion, the voice characteristic parameter is extracted by extraction of the LPC cepstrum by LPC analysis, and the process proceeds to step 25. In step 25, the phoneme notation information attached to the input sentence speech is used to refer to the recognition phoneme HMM dictionary 24 to detect a part in the long syllable and substitute a phoneme HMM of possible utterance therein. The sentence HMM is generated, and the process proceeds to step 26. In step 26, the sentence HMM generated in step 25 is collated with the input voice, the likelihood is calculated, and the sentence HMM giving the largest output probability is passed to the next step 27 as the correct phoneme notation. Steps 21 to 26 above
By the processes up to, the difference in the utterance of the long sound included in the learning sentence voice data is recognized.

【００１３】例えば、「とうきょうにいく（東京にい
く）」という文が入力されたとすると、ステップ２５で
は、次の四つの文ＨＭＭが生成される。（１）／ｔｏｕｋｙｏｕｎｉｉｋｕ／（２）／ｔｏｕｋｙｏｏｎｉｉｋｕ／（３）／ｔｏｏｋｙｏｕｎｉｉｋｕ／（４）／ｔｏｏｋｙｏｏｎｉｉｋｕ／ステップ２６では、この四つの文ＨＭＭと入力音声とを
照合し、尤度を計算し、ステップ２７に進む。このステ
ップ２６において入力された文音声に含まれる長音の発
音に相応するＨＭＭが検出される。ここでは仮に、
（４）番目のＨＭＭがより大きい出力確率を与えたとす
ると、／ｔｏｏｋｙｏｏｎｉｉｋｕ／は正解音韻表記として次のステップ２７へ渡される。ス
テップ２７では、ステップ２６で決定された音韻表記と
音素辞書３０を参照しながら、音素ＨＭＭを連結して文
ＨＭＭを生成し、生成結果をステップ２８へ送る。ステ
ップ２８では、入力された学習音声を使用して文ＨＭＭ
パラメータを推定する。この推定には、例えば、Ｂ−Ｗ
アルゴリズムを用いて、例えば、観測ラベル系列Ｏ＝ o
₁, o₂, …,o_T及び状態系列Ｉ＝ i₁,i₂, …,i_Tに
対して、式（１）のような前向き変数α_t(i) と後ろ向
き変数β_t(i) を定義する。そして、状態遷移確率ａ_ij
とラベル出力確率ｂ_j(k) を式（２）のように推定す
る。For example, if the sentence "Tokyo ni Iku (go to Tokyo)" is input, in step 25, the following four sentence HMMs are generated. (1) / tokyouniiku / (2) / tokyouniuk / (3) / tokyouniuk / (4) / tokyooniiku / In step 26, the four sentence HMMs are compared with the input speech, and the likelihood is calculated, and in step 27. move on. In step 26, the HMM corresponding to the pronunciation of the long sound included in the sentence voice input is detected. Here, if
Assuming that the (4) th HMM gives a larger output probability, then / tokyokyoniku / is passed to the next step 27 as the correct phoneme notation. In step 27, referring to the phoneme notation determined in step 26 and the phoneme dictionary 30, the phoneme HMMs are concatenated to generate a sentence HMM, and the generation result is sent to step 28. In step 28, the sentence HMM is input using the input learning voice.
Estimate the parameters. For this estimation, for example, B-W
Using an algorithm, for example, observation label series O = o
_{For 1} , ₁ , o ₂ , ..., o _T and state series I = i ₁ , i ₂ , ..., i _T , forward variable α _t (i) and backward variable β _t (i) as shown in equation (1). Is defined. Then, the state transition probability a _ij
And the label output probability b _j (k) are estimated as in Expression (2).

【００１４】このように文ＨＭＭを学習し終えると、ス
テップ２９において、文ＨＭＭを音素ＨＭＭに分解し、
修正された音素ＨＭＭを音素辞書３０に保存する。この
音素ＨＭＭが収束したか否かを、ステップ３２で検査
し、もし収束したら（即ち、音素ＨＭＭパラメータの前
回の値と今回の値との差が充分に小さければ）、ステッ
プ３３で学習を終了する。これに対し、ステップ３２の
検査の結果、収束していなければ、ステップ３１におい
て、ステップ２９で分解された音素ＨＭＭを連結して文
ＨＭＭを再構築し、ステップ３１において、ステップ２
８の文ＨＭＭの学習へ戻り、前述した学習処理と分解処
理を繰り返す。以上のように、本第２の実施例では、次
のような利点がある。文音声を用いて音素ＨＭＭを学習
する時、ステップ２５で長音音節にある箇所に可能な発
声の音素ＨＭＭを代入して文ＨＭＭを生成し、ステップ
２６で文ＨＭＭと入力音声とを照合し、入力された文音
声に含まれる長音の発音に相応するＨＭＭを検出するの
で正確に連結学習を行うことができる。そのため、長音
発声の違いに伴った音声データと音素表記の不一致を解
消し、高精度のＨＭＭ学習が可能となる。なお、本発明
は、上記実施例に限定されず種々の変形が可能である。
その変形例としては、例えば次のようなものがある。文
音声を例として説明したが、単語音声や文節音声に対し
ても適用可能である。After learning the sentence HMM in this way, in step 29, the sentence HMM is decomposed into phoneme HMMs.
The corrected phoneme HMM is stored in the phoneme dictionary 30. Whether or not the phoneme HMM has converged is checked in step 32, and if converged (that is, if the difference between the previous value and the current value of the phoneme HMM parameter is sufficiently small), the learning ends in step 33. To do. On the other hand, as a result of the check in step 32, if the results do not converge, in step 31, the phoneme HMMs decomposed in step 29 are concatenated to reconstruct a sentence HMM, and in step 31, step 2
Returning to the learning of the sentence HMM of No. 8, the learning process and the decomposition process described above are repeated. As described above, the second embodiment has the following advantages. When learning a phoneme HMM using sentence speech, a sentence HMM is generated by substituting a phoneme HMM of a possible utterance into a position in a long syllable in step 25 to generate a sentence HMM, and in step 26, the sentence HMM and the input speech are collated, Since the HMM corresponding to the pronunciation of the long sound included in the input sentence voice is detected, the connection learning can be accurately performed. Therefore, it is possible to eliminate the inconsistency between the voice data and the phoneme notation associated with the difference in long-sound utterance, and to perform highly accurate HMM learning. The present invention is not limited to the above embodiment, and various modifications can be made.
The following are examples of such modifications. Although sentence speech has been described as an example, it can be applied to word speech and phrase speech.

【００１５】[0015]

【発明の効果】以上詳細に説明したように、第１の発明
によれば、先ず、学習用文音声データに含まれる「が」
行音節と「が」行拗音が鼻音化されたかどうかを音声認
識手法で検出し、前記音素ヒドン・マルコフ・モデルを
連結して前記文ヒドン・マルコフ・モデルを生成する
際、検出した「が」行音節と「が」行拗音の有る箇所に
「が」行音節及び「が」行拗音に相応するヒドン・マル
コフ・モデルを代入して学習するようにしたので、正確
に連結学習を行うことができる。よって、「が」行音声
の鼻音化に伴った音声データと音素表記の不一致を解消
し、高精度のＨＭＭ学習が可能となる。第２の発明によ
れば、学習用文音声データに含まれる長音の発声の違い
を音声認識手法で検出し、前記音素ヒドン・マルコフ・
モデルを連結して前記文ヒドン・マルコフ・モデルを生
成する際、前記認識結果に従い、該長音の発声に相応す
るヒドン・マルコフ・モデルを連結して学習するように
したので、正確に連結学習を行うことができる。従っ
て、長音発声の違いに伴った音声データと音素表記の不
一致を解消し、高精度のＨＭＭ学習が可能となる。As described in detail above, according to the first aspect of the invention, first, "ga" included in the learning sentence voice data is included.
A syllable and a “ga” are detected by a speech recognition method to determine whether the syllable is nasalized, and when the phoneme Hidden-Markov model is concatenated to generate the sentence Hidden-Markov model, the detected “ga” is detected. Since the Hidden-Markov model corresponding to the "ga" and "ga" gospels is substituted in the places where the syllable and "ga" gospel are present, it is possible to perform accurate connected learning. it can. Therefore, it is possible to eliminate the inconsistency between the phoneme notation and the voice data that accompanies the nasalization of the “ga” line voice, and it is possible to perform highly accurate HMM learning. According to the second aspect of the present invention, the difference in utterance of long sounds included in the learning sentence voice data is detected by the voice recognition method, and the phoneme Hidden Markov
When connecting the models to generate the sentence Hidden-Markov model, the Hidden-Markov model corresponding to the utterance of the long note is connected and learned in accordance with the recognition result. It can be carried out. Therefore, it is possible to eliminate the inconsistency between the voice data and the phoneme notation associated with the difference in long-sound utterance, and to perform highly accurate HMM learning.

[Brief description of drawings]

【図１】本発明の第１の実施例を示すＨＭＭの学習方法
の処理内容のフローチャートである。FIG. 1 is a flowchart of processing contents of an HMM learning method according to a first embodiment of the present invention.

【図２】従来の音声認識方法に用いられる単語ＨＭＭの
構造例を示す図である。FIG. 2 is a diagram showing a structural example of a word HMM used in a conventional speech recognition method.

【図３】本発明の第２の実施例を示すＨＭＭの学習方法
の処理内容のフローチャートである。FIG. 3 is a flowchart of processing contents of an HMM learning method according to a second embodiment of the present invention.

[Explanation of symbols]

４，２４認識用音素ＨＭＭ辞書５「が」行音節の鼻音化を考慮して文Ｈ
ＭＭ生成処理６，２６ＨＭＭ尤度計算による音韻表記決定処
理７，２７音素ＨＭＭの連結による文ＨＭＭ構成
処理８，２８文ＨＭＭの学習Ｂ−Ｗアルゴリズム処
理９，２９文ＨＭＭの音素ＨＭＭ分解処理１０，３０音素ＨＭＭ辞書１１，３１文ＨＭＭの再構成処理１２，３２音素ＨＭＭの収束判定処理２５長音の発声の揺らぎを考慮して文ＨＭ
Ｍ生成処理4,24 Phoneme HMM dictionary for recognition 5 Sentence H in consideration of nasalization of "ga" line syllable
MM generation process 6,26 Phoneme notation determination process by HMM likelihood calculation 7,27 Sentence HMM construction process by concatenation of phoneme HMMs 8,28 Learning HMM sentence BW algorithm process 9,29 Sentence HMM decomposition process for sentence HMMs 10 , 30 Phoneme HMM dictionary 11, 31 Reconstruction process of sentence HMM 12, 32 Convergence determination process of phoneme HMM 25 Sentence HM considering fluctuation of vocalization of long sound
M generation process

Claims

[Claims]

1. When learning a phoneme Hidden-Markov model using continuous speech data, an initial model of the phoneme Hidden-Markov model is connected to construct a sentence-Hidden-Markov model, and the sentence-Hidden-Markov model is constructed. A learning process for learning a model, a decomposition process for decomposing the learning result into a phoneme Hidden Markov model after the learning process, and a sentence Hidden Markov model by reconstructing the decomposed phoneme Hidden Markov model In the learning method of the hidden Markov model for learning the phoneme Hidden-Markov model by repeating the learning process, the decomposition process, and the concatenation process using "Ga" line syllable or "ga"
A method of detecting whether or not the syllable is nasalized by a speech recognition method, and connecting the phoneme Hidden-Markov model to generate the sentence Hidden-Markov model, the corresponding Hidden-Markov model according to the recognition result. A learning method for a Hidden-Markov model, which comprises learning by connecting the phonemes to learn the phoneme Hidden-Markov model.

2. When learning a phoneme Hidden Markov model using continuous speech data, a sentence Hidden Markov model is constructed by connecting initial models of the phoneme Hidden Markov model, and the sentence Hidden Markov model is constructed. A learning process for learning a model, a decomposition process for decomposing the learning result into a phoneme Hidden Markov model after the learning process, and a sentence Hidden Markov model by reconstructing the decomposed phoneme Hidden Markov model In the learning method of the hidden Markov model for learning the phoneme Hidden-Markov model by repeating the learning process, the decomposition process, and the concatenation process using The difference in the utterance of the long sound is detected by a speech recognition method, and the phoneme Hidden-Markov model is connected to the sentence. When generating a Don-Markov model, the phoneme Hidden-Markov model corresponding to the vocalization of the long sound is connected and learned in accordance with the recognition result, and the phoneme Hidden-Markov model is learned.・ How to learn Markov model.