JPH0827638B2

JPH0827638B2 - Phoneme-based speech recognition device

Info

Publication number: JPH0827638B2
Application number: JP63182225A
Authority: JP
Inventors: 和永吉田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-07-20
Filing date: 1988-07-20
Publication date: 1996-03-21
Anticipated expiration: 2011-03-21
Also published as: JPH0229799A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、音声を構成する基本単位である音素を認識
単位とすることにより、大語彙連続音声が認識可能な音
声認識装置の改良に関するものである。Description: TECHNICAL FIELD The present invention relates to an improvement of a speech recognition device capable of recognizing a large vocabulary continuous speech by using a phoneme, which is a basic unit constituting speech, as a recognition unit. Is.

（従来の技術）従来、音声を構成する基本的な単位である音素を単位
として、音声を認識する方式は存在した。この方式で
は、まず認識の基本となる音素標準パターンを学習によ
り求める。この音素標準パターンを、音素表記された単
語辞書にしたがって合成した単語の標準パターンを用い
て単語音声を認識する。この場合において、「音素」と
いう用語は認識の単位という意味で、音声学でいう音素
だけでなく、音節や複数の音素連鎖も含む、より広い意
味で用いている。また、認識対象としては単語以外に、
文節、文章等があるが、以下では単語を認識する場合に
ついて説明する。(Prior Art) Conventionally, there has been a method of recognizing a voice by using a phoneme, which is a basic unit forming a voice, as a unit. In this method, first, a phoneme standard pattern which is the basis of recognition is obtained by learning. A word voice is recognized by using a standard pattern of words synthesized from this phoneme standard pattern according to a word dictionary in which phonemes are written. In this case, the term "phoneme" means a unit of recognition, and is used in a broader sense including not only phonemes in phonetics but also syllables and plural phoneme chains. Also, in addition to words as recognition targets,
Although there are clauses, sentences, etc., the case of recognizing a word will be described below.

音素を単位とした認識方式の例として、たとえば、日
本音響学会音声研究会資料S85−62（1985年12月20日）
の477頁から484頁に「音節をベースとする日本語音声認
識」と題されて発表されている論文（以下引用文献１と
称す）に、子音＋母音（CV;子音をＣ、母音をＶと表記
する）および母音＋子音＋母音（VCV）を認識単位（音
素）とした音声認識方式が示されている。この方式で
は、単語単位に連続発声された学習用音声を、CV、VCV
の音素にセグメンテーションし、セグメンテーションさ
れた区間の音声から音素標準パターンを作成する（この
ような標準パターンの作成や、更新を学習と呼ぶ）。認
識時には、求められた音素標準パターンを結合したもの
を基にDPマッチング法により単語（文節）を認識してい
る。As an example of a phoneme-based recognition method, for example, the Acoustical Society of Japan, Speech Research Material S85-62 (December 20, 1985)
, Pp. 477 to 484, in a paper titled "Syllabic-based Japanese Speech Recognition" (hereinafter referred to as Reference 1), consonant + vowel (CV; consonant is C, vowel is V And a vowel + consonant + vowel (VCV) as a recognition unit (phoneme). In this method, the learning voice continuously uttered word by word is converted into CV and VCV.
The phoneme standard pattern is created from the segmented speech and the phoneme standard pattern is created from the segmented speech (the creation and update of such a standard pattern is called learning). At the time of recognition, words (phrases) are recognized by the DP matching method based on the combination of the phoneme standard patterns obtained.

また、IEEE,International Conference on Acoustic
s, Speech, and Signal Processing,1986,30.9の1593頁
に"The Role of Word−Dependent Coarticulatory Effe
cts in a Phoneme−Based Speech Recognition System"
（1986年）と題されて発表されている論文（以下引用文
献２と称す）には、音素を認識の単位として、隠れマル
コフモデル（Hidden Markov model、以下「HMM」と呼
ぶ）法を用いて認識を行う方式が示されている。以下
に、この方式を説明する。In addition, IEEE, International Conference on Acoustic
s, Speech, and Signal Processing, 1986, 30.9, page 1593, "The Role of Word-Dependent Coarticulatory Effe.
cts in a Phoneme-Based Speech Recognition System "
In a paper published as (1986) (hereinafter referred to as reference document 2), the Hidden Markov model (hereinafter referred to as "HMM") method is used as a unit of recognition of phonemes. The method of recognition is shown. This method will be described below.

最初に、音素単位のHMM（音素HMMと呼ぶ。音素標準パ
ターンと等価なものである。）を学習により求める方法
について述べる。学習用音声は、ベクトル量子化法によ
り観測ラベル列Ｏ（ｔ）;1≦ｔ≦Ｔに変換される。第２
図に音素HMMの例を示す。HMMは図のような状態遷移ネッ
トワークであらわされる。HMMには、状態ｉにおける観
測ラベル列Ｏ（ｔ）の出力確立bi（Ｏ（ｔ））、状態ｉ
から状態ｊへの状態遷移確立ａ（i,j）の各パラメータ
が定義されている。First, we describe the method of finding the phoneme-based HMM (called a phoneme HMM, which is equivalent to the phoneme standard pattern) by learning. The learning voice is converted into the observation label string O (t); 1≤t≤T by the vector quantization method. Second
The figure shows an example of a phoneme HMM. The HMM is represented by a state transition network as shown in the figure. In the HMM, the output establishment bi (O (t)) of the observation label sequence O (t) in the state i, the state i
Each parameter of the state transition establishment a (i, j) from the state to the state j is defined.

音素HMMを学習する場合、まず、あらかじめ代表的な
話者のセグメンテーションされた音声データをもとに音
素HMMを作成しておく。これを音素HMM学習の初期値であ
る初期モデルとする。新しい話者に対しては、その話者
が発声した学習用音声を用いて、学習処理により初期モ
デルを更新し、その話者の音素HMMを作成する。この学
習処理は、FB（Forward−Backward）アルゴリズムを用
いて実行できる。このFBアルゴリズムについては、たと
えば、Proceedings of The IEEE,Vol.73,No.11の1652頁
に"Structural Methods in Automatic Speech Recognit
ion"（1985年11月）と題されて発表されている論文（以
下引用文献３と称す）に詳しく述べられている。単語単
位に発声された学習用音声を用いて音素HMMを学習する
方法としては、単語単位に更新されたFBアルゴリズムに
おける中間パラメータを、音素HMM単位にまとめること
により新しい音素HMMのパラメータ得るという操作を繰
り返す方法をもちいている。When learning a phoneme HMM, first, a phoneme HMM is created in advance based on segmented speech data of a typical speaker. This is the initial model that is the initial value for phoneme HMM learning. For a new speaker, the learning model uttered by the speaker is used to update the initial model by the learning process, and the phoneme HMM of the speaker is created. This learning process can be executed using an FB (Forward-Backward) algorithm. For this FB algorithm, see, for example, "Structural Methods in Automatic Speech Recognit" on page 1652 of Proceedings of The IEEE, Vol.73, No.11.
This is described in detail in the paper entitled "ion" (November 1985) (hereinafter referred to as Reference 3). A method for learning a phoneme HMM using a learning voice uttered in word units. As a method, a method of repeating the operation of obtaining the parameters of a new phoneme HMM by collecting the intermediate parameters in the FB algorithm updated word by word in the phoneme HMM unit is used.

入力音声を認識する場合は、音素HMMを音素表記され
た単語辞書に従って結合した単語HMMを用いて、上記のF
Bアルゴリズムにおける前向き確率として、その単語HMM
に対する入力音声の出現確率を求める。その出現確率が
最も高い単語が認識結果となる。When recognizing the input speech, use the word HMM that combines the phoneme HMMs according to the word dictionary in which phonemes are written, and use the above F
The word HMM as the forward probability in the B algorithm
The appearance probability of the input voice for is calculated. The word with the highest appearance probability becomes the recognition result.

（発明が解決しようとする問題点）音素を単位とした音声認識では、一般に、認識単位と
しては、Ｃ、Ｖなどの単独の音素（単音素と呼ぶ）より
も引用文献１に述べられているようなCV、VCVなどのあ
る程度の長さを持った音素連鎖（複合音素と呼ぶ）を用
いるほうが好ましい。これは、音声パターンは、前後に
どのような音素があるかにより大きく変化すること、単
音素から次の単音素に移る変化部分に大きな特徴がある
こと等の理由による。(Problems to be Solved by the Invention) In speech recognition using phonemes as a unit, the recognition unit is generally described in the cited document 1 rather than a single phoneme such as C or V (called a single phoneme). It is preferable to use a phoneme chain (called a compound phoneme) having a certain length such as CV and VCV. This is because the voice pattern greatly changes depending on what kind of phoneme exists before and after, and the change part from one phoneme to the next phoneme has a great feature.

しかし、一般に単音素の種類にくらべて複合音素の種
類はきわめて多い。たとえば日本語のＣ、Ｖなどの単音
素の数は20種程度であるが、複合音素のVCVは1000種以
上存在する。このため、全てのVCVを学習するために
は、それらの複合音素を含む膨大な学習用音声が必要で
あり、学習に必要な処理量もきわめて多くなるという欠
点があった。However, in general, the types of compound phonemes are extremely large compared to the types of single phonemes. For example, the number of single phonemes such as C and V in Japanese is about 20, but there are more than 1000 kinds of compound phoneme VCV. Therefore, in order to learn all VCVs, a huge amount of learning speech including those compound phonemes is required, and there is a drawback that the amount of processing required for learning becomes extremely large.

また、音声を発声する場合、無声化・長母音化等の発
声変形が生じることがある。このような発声変形に対処
するため、無声化音素や長母音化音素の音素HMMを通常
の音素HMMの他に用意する方法がある。しかし、発声変
形の有無は確率的に生じることであり、学習用音声の中
に特定の発声変形が存在しない場合や、無声化等の変形
を生じる可能性のある音素全てに、発声変形が生じてい
る場合がある。このような場合は、発声変形が生じた音
素や、発声変形のない音素の標準パターンが学習されな
いという欠点があった。Further, when a voice is uttered, voicing transformation such as devoicing and lengthening of vowels may occur. In order to deal with such voicing transformation, there is a method of preparing phoneme HMMs of unvoiced phonemes and long vowel phonemes in addition to normal phoneme HMMs. However, the presence / absence of voicing transformation occurs probabilistically, and if there is no particular voicing transformation in the training voice, or if all phonemes that may have transformations such as unvoiced voicing transformation occur. There is a case. In such a case, there is a drawback that a phoneme in which voicing deformation occurs or a standard pattern of phonemes without voicing deformation is not learned.

さらに、初期モデルは、一般に代表的な話者の発声を
もとに作成したものを用いているが、音素HMMを学習す
るときに、新しい話者の学習用音声が代表的な話者の音
声パターンと大きく異なる場合は、学習音声の音素セグ
メントと音素HMMの区間の対応付けが大きくずれてしま
うことにより、学習が正しく行われないことがあるとい
う欠点があった。Furthermore, although the initial model is generally created based on the utterance of a typical speaker, when learning a phoneme HMM, the learning voice of the new speaker is the voice of the typical speaker. In the case of a large difference from the pattern, there is a drawback that the learning may not be performed correctly because the correspondence between the phoneme segment of the learning speech and the section of the phoneme HMM is largely deviated.

本発明は、以上のような欠点を除き、少ない学習用音
声で、さまざまな発声変形や、種類の多い複合音素の標
準パターンが学習できるようにすることにより、高性能
な音声認識装置を実現することにある。The present invention realizes a high-performance speech recognition apparatus by eliminating various drawbacks as described above and enabling various voicing transformations and standard patterns of compound phonemes of many types to be learned with a small amount of learning speech. Especially.

（問題点を解決するための手段）本願の第１の発明による音素を単位とした音声認識装
置は、学習用音声から単音素標準パターンを求める単音
素学習部と、得られた前記単音素標準パターンを１個以
上結合することにより複合音素標準パターンを作成する
音素結合部と、前記作成された複合音素標準パターンを
もとに学習用音声を用いて学習する複合音素学習部と、
前記複合音素標準パターンを用いて入力音声を認識する
音声認識部とを有する。(Means for Solving Problems) A phoneme-based speech recognition device according to the first invention of the present application includes a phoneme learning unit that obtains a phoneme standard pattern from a learning voice, and the obtained phoneme standard. A phoneme combination unit that creates a compound phoneme standard pattern by combining one or more patterns, and a compound phoneme learning unit that learns using learning speech based on the created compound phoneme standard pattern,
And a voice recognition unit that recognizes an input voice using the composite phoneme standard pattern.

本願の第２の発明による音素を単位とした音声認識装
置は、前記本願の第１の発明に加え、学習用音声の発声
変形情報を求める発声変形検出部と、前記発声変形情報
をもとに学習を行う単音素学習部と、前記発声変形情報
をもとに学習を行う複合音素学習部とを有する。In addition to the first invention of the present application, a phoneme-based speech recognition device according to the second invention of the present application is based on the voicing modification detection unit for obtaining voicing modification information of a learning voice, and the voicing modification information. It has a single phoneme learning unit for learning and a compound phoneme learning unit for learning on the basis of the utterance transformation information.

本願の第３の発明による音素を単位とした音声認識装
置は、前記本願の第１および２の発明に加え、あらかじ
め求められた初期標準パターンと、学習用音声から求め
られた初期標準パターンを標準パターン毎に選択する音
素選択部と、選択された初期標準パターンを初期値とし
て学習用音声から単音素標準パターンを求める単音素学
習部とを有する。In addition to the first and second inventions of the present application, the phoneme-based speech recognition device according to the third invention of the present application uses, as a standard, an initial standard pattern obtained in advance and an initial standard pattern obtained from learning speech. It has a phoneme selection unit that selects each pattern, and a single phoneme learning unit that obtains a single phoneme standard pattern from the learning voice using the selected initial standard pattern as an initial value.

（作用）本発明による音素を単位とする音声認識装置の作用に
ついて説明する。以下の説明では、単音素としてはＣ、
Ｖなどの音素、複合音素としては語頭のCV、およびVCV
などの音素連鎖を用いることにする。また認識方法とし
ては、引用文献２に述べられているような音素単位のHM
Mを用いることにする。このほかの方法による場合も同
様である。(Operation) The operation of the phoneme-based speech recognition device according to the present invention will be described. In the following description, C is a single phoneme,
Phonemes such as V, CV and VCV at the beginning of compound phonemes
We will use phoneme chains such as. As a recognition method, phoneme-based HM as described in the cited document 2 is used.
I will use M. The same applies to other methods.

本発明では、認識の単位として複合音素HMMを用い
る。この複合音素HMMを学習により求める際に、学習用
音声が少ない場合や発声変形が含まれている場合、定義
されている全ての複合音素に対応する音声パターンが学
習用音声中に存在しないことがある。その結果、学習さ
れない複合音素HMMができる可能性が生じる。In the present invention, a compound phoneme HMM is used as a recognition unit. When learning this compound phoneme HMM by learning, if there are few learning voices or if voicing transformations are included, it is possible that there is no speech pattern corresponding to all defined compound phonemes in the learning voice. is there. As a result, there is a possibility that an unlearned compound phoneme HMM can be created.

これに対処するため、本発明では複合音素HMMを求め
る際に、まず単音素HMMを学習により求める。単音素HMM
は複合音素HMMを分割したものに相当する。複合音素V1C
1V2（たとえば［asi］）は、単音素V1（［ａ］）、C1
（［ｓ］）、V2（［ｉ］）に分割される。単音素HMM
は、種類が限られる（20種程度）ため、全ての単音素を
含む学習用音声を用意することは容易である。発声変形
に対しても、単音素HMMを用いれば、発声変形が生じた
音素と類似した音素との置き換えにより対処可能であ
る。例えば、無声化母音の単音素HMMは、摩擦音［ｓ］
等の単音素HMMで置き換えることにより対処できる。学
習用音声の発声変形の内容（例えば無声化の有無）につ
いては、あらかじめ発声変形情報として、わかっている
ものとする。In order to cope with this, in the present invention, when obtaining a compound phoneme HMM, first, a monophoneme HMM is obtained by learning. Monophone HMM
Is equivalent to a composite phoneme HMM divided. Compound phoneme V1C
1V2 (eg [asi]) is a single phoneme V1 ([a]), C1
([S]) and V2 ([i]). Monophone HMM
Since there are a limited number of types (about 20 types), it is easy to prepare a learning voice including all phonemes. It is possible to deal with the voicing deformation by replacing the phoneme in which the voicing deformation has occurred with a similar phoneme by using the monophone HMM. For example, a monophone HMM with unvoiced vowels has fricatives [s]
It can be dealt with by replacing with a single phoneme HMM. It is assumed that the content of the utterance transformation of the learning voice (for example, the presence or absence of unvoiced voice) is known in advance as the utterance transformation information.

単音素HMMの学習方法は、例えば引用文献２に述べら
れているような方法を用いることができる。第３図に単
音素［ａ］の単音素HMMの例を示す。ここでは、図に示
すように２状態からなるHMMを単音素HMMとして用いる。
単音素HMMを学習する場合、まず代表的な話者の音声よ
り求められた単音素HMMを初期モデルとして用いる。こ
れらの単音素HMMを、単音素表記された単語辞書に従っ
て結合し単語HMMを作成する。第５図に単音素HMM
［ａ］，［ａ］，［ｓ］，［ａ］，［ａ］，［ｈ］，
［ｉ］，［ｉ］を結合して得られた単語HMM［asahi］の
例を示す。母音の単音素［ａ］，［ｉ］が重なっている
のは複合音素［asa］，［ahi］への分割を考慮したため
である。As a method for learning the monophone HMM, for example, the method described in the cited document 2 can be used. FIG. 3 shows an example of a single phoneme HMM of a single phoneme [a]. Here, an HMM having two states as shown in the figure is used as a monophone HMM.
When learning a monophone HMM, a monophone HMM obtained from the voice of a typical speaker is used as an initial model. These monophoneme HMMs are combined according to a monophoneme notation word dictionary to create a word HMM. Figure 5 shows the monophone HMM.
[A], [a], [s], [a], [a], [h],
An example of the word HMM [asahi] obtained by combining [i] and [i] is shown. The single phonemes [a] and [i] of the vowel overlap because the division into the compound phonemes [asa] and [ahi] is taken into consideration.

このような単語HMMをもとに、単語発声された学習用
音声を用いて、引用文献２に述べられている方法で単音
素HMMを学習する。発声変形を含む学習用音声を用いる
場合、前記の発声変形情報にしたがって単音素HMMを結
合し、学習に用いる単語HMMを作成する。Based on such a word HMM, a phoneme HMM is learned by the method described in the cited document 2 by using a learning voice in which a word is uttered. In the case of using the learning voice including the utterance transformation, the phoneme HMMs are combined according to the utterance transformation information to create the word HMM used for learning.

このようにして求められた単音素HMMを結合すること
により複合音素HMMを作成することができる。第４図
は、単音素HMMの［ａ］，［ｓ］，［ａ］を結合し求め
られた複合音素HMM［asa］の例である。調音結合の影響
のため、前後の音素の種類によって同じ音素でも音声パ
ターンは変わり得る。このように、単音素HMMの単なる
結合では不十分ではあるが、近似的には複合音素HMMと
して使用することは可能である。A compound phoneme HMM can be created by combining the single phoneme HMMs thus obtained. FIG. 4 is an example of a compound phoneme HMM [asa] obtained by combining [a], [s], and [a] of a single phoneme HMM. Due to the influence of articulatory coupling, the voice pattern may change even for the same phoneme depending on the type of phonemes before and after. Thus, although simple combination of single phoneme HMMs is not enough, it can be used as a compound phoneme HMM approximately.

さらに、本発明では学習用音声中に存在する複合音素
HMMについては、単音素HMMの合成により作成した複合音
素HMMを初期モデルとして、複合音素HMMの学習を行う。
この学習は単音素HMMの場合と同様に行うことができ
る。Furthermore, in the present invention, the compound phonemes present in the learning speech are
Regarding the HMM, the compound phoneme HMM is trained by using the compound phoneme HMM created by synthesizing the single phoneme HMM as an initial model.
This learning can be performed as in the case of the monophone HMM.

これにより、学習用音声中に存在する複合音素に対し
ては調音結合の影響を含んだモデルを作成することがで
きる。As a result, it is possible to create a model that includes the influence of articulatory coupling on the compound phonemes existing in the learning speech.

このように、本発明によれば学習用音声中に存在する
複合音素だけでなく、存在しない場合も近似的に複合音
素HMMを作成することができるので、限られた量の学習
用音声を有効に使って複合音素HMMの学習を行うことが
できる。As described above, according to the present invention, not only the compound phoneme existing in the learning speech but also the compound phoneme HMM can be created approximately even when it does not exist. Can be used for learning compound phoneme HMM.

以上の説明では、学習用音声中の発声変形はあらかじ
めわかっているとした。しかし、発声変形情報を得るた
めには、あらかじめ学習用音声を発声する際に発声変形
の有無について指定する等の方法を用いる必要がある。
例えば、ある母音は無声化させ、他の母音は無声化させ
ないように発声する等である。しかし、このような方法
は使用者に負担を強いることになる。それに対し、本発
明では、学習用音声の発声変形を自動的に検出する方法
を用いることもできる。以下に、この発声変形の自動的
検出法について述べる。In the above description, it is assumed that the voicing transformation in the learning voice is known in advance. However, in order to obtain voicing transformation information, it is necessary to use a method such as designating the presence or absence of voicing transformation when uttering a learning voice in advance.
For example, one vowel is made unvoiced and the other vowel is uttered so as not to be unvoiced. However, such a method imposes a burden on the user. On the other hand, in the present invention, it is possible to use a method of automatically detecting the vocalization deformation of the learning voice. The automatic detection method of this vocal deformation is described below.

まず、発声変形をすべて網羅した単語辞書を用意し、
それらの辞書に従って代表的話者により求められた単音
素HMMを結合して単語HMMを作成する。これらの単語HMM
に対する学習用音声の出現確率を求め、最も確率の高い
単語辞書中の発声変形を学習用音声の発声変形とする。
この出現確率は、単語を認識する場合と同様、引用文献
３に述べられている前向き確率を用いて求めることがで
きる。First, prepare a word dictionary that covers all vocal variants,
A word HMM is created by combining the monophone HMMs obtained by a representative speaker according to those dictionaries. These words HMM
Then, the appearance probability of the learning voice is calculated, and the utterance transformation in the word dictionary with the highest probability is used as the utterance transformation of the learning voice.
This appearance probability can be obtained using the forward probability described in the cited document 3, as in the case of recognizing a word.

例えば、「拍手」（発音：［hakusyu］）の４番目と
６番目の単音素である［ｕ］は無声化する可能性があ
る。そこで、すべての可能な組合せである［hakusy
u］，［haku−syu］，［hakusyu−］，［haku−syu−］
（無性化した［ｕ］を［ｕ−］とあらわす）の辞書を用
意し、それらの辞書に従って作られた単語HMMを用いて
発声変形を求める。例えば［haku−syu］の出現確率が
最も高ければ、最初の［ｕ］は無声化しているとする。For example, [u], which is the fourth and sixth phonemes of "clapping" (pronunciation: [hakusyu]), may be devoiced. So all possible combinations [hakusy
u], [haku-syu], [hakusyu-], [haku-syu-]
The dictionaries (incapacitated [u] are represented as [u-]) are prepared, and voicing variants are obtained using the word HMMs created according to these dictionaries. For example, if the appearance probability of [haku-syu] is the highest, it is assumed that the first [u] is devoiced.

このようにして求められた発声変形情報を用いること
により、発声変形を含む音声による学習が可能となる。By using the vocalization deformation information obtained in this way, it becomes possible to learn by voice including the vocalization deformation.

また、以上の説明では、初期モデルは代表的な話者の
発声を基にしている。通常、これにより単音素HMMの学
習を行うことができるが、新しい話者の学習用音声が代
表的な話者の音声パターンと大きく異なる場合は、学習
が正しく行われないことがあることについて既に述べ
た。そこで本発明では、母音などの、話者によりパター
ン変動の大きい可能性のある単音素の初期モデルは、学
習用音声から直接作成する方法を用いる。Also, in the above description, the initial model is based on the utterance of a typical speaker. Usually, this allows learning of a single phoneme HMM, but if the new speaker's training speech is significantly different from the typical speaker's speech pattern, it may already be learned that learning may not be successful. Stated. Therefore, in the present invention, a method of directly creating an initial model of a single phoneme, such as a vowel, whose pattern variation may be large depending on the speaker, is used.

ここでは、母音の初期モデルを学習用音声から作成す
る場合について述べるが、母音以外の初期モデルに対し
ても同様である。まず、学習用音声として単音素単位に
セグメンテーションが容易な音声を用意する。たとえば
単独発声した母音を学習用音声とすれば、この音声デー
タの音声区間（たとえば、振幅がある程度以上大きい部
分）を母音の単音素のセグメントとすることができる。
また、たとえば、共立出版「音声認識」の73頁から述べ
られているセグメンテーション法を用いて母音区間を切
り出すことにより、様々な音声を初期モデル作成のため
に用いることができる。Here, a case where an initial model of a vowel is created from learning voice is described, but the same applies to an initial model other than a vowel. First, a speech that is easy to segment is prepared for each phoneme as a learning speech. For example, if a vowel uttered alone is used as a learning voice, a voice section of the voice data (for example, a portion having an amplitude larger than a certain level) can be a segment of a vowel monophoneme.
Further, for example, various voices can be used for initial model creation by cutting out a vowel section using the segmentation method described from page 73 of Kyoritsu Publishing “Speech Recognition”.

この学習用音声を上記の方法で単音素にセグメンテー
ションし、単音素セグメント内の音声を学習用音声とし
て、単音素毎にFBアルゴリズムを用いて単音素HMMを学
習することができる。この場合学習用音声セグメントと
単音素HMMの対応付けがずれるおそれは無いので、学習
の初期モデルとして例えば乱数値や代表的な話者より作
られた単音素HMMを用いることができる。また、観測ラ
ベル出力確率bi（ｘ）だけを、対応するセグメント内の
全観測ラベルの出現頻度を基に求めたものを用い、状態
遷移確率ａ（i,j）は代表的な話者のものをそのまま用
いることもできる。This learning speech can be segmented into single phonemes by the above method, and the single phoneme HMM can be learned using the FB algorithm for each single phoneme, using the speech in the single phoneme segment as the training speech. In this case, there is no risk that the learning voice segment and the monophone HMM are associated with each other, so a random number value or a monophone HMM made by a typical speaker can be used as an initial model for learning. In addition, only the observation label output probability bi (x) is used based on the appearance frequencies of all the observation labels in the corresponding segment, and the state transition probability a (i, j) is that of a typical speaker. Can be used as is.

このように、単音素の初期モデルを学習する話者の音
声から求めることにより、安定した学習が可能となる。In this way, stable learning can be performed by obtaining the initial model of a single phoneme from the voice of the speaker who is learning.

以上、HMMを用いた方法を基に説明したが、引用文献
１に述べられているようなDPマッチング法を使用する場
合も同様である。この場合、音素標準パターンとして、
単音素標準パターンと複合音素標準パターンを用意す
る。DPマッチングを用いた標準パターンの反復学習法と
して、次のようなものを用いることができる。あらかじ
め求められた初期音素標準パターンを単語辞書に従って
接続した単語標準パターンを用いて、学習用音声をDPマ
ッチングにより音素単位にセグメンテーションする。得
られた音素単位のセグメントを同一音素間で平均化し新
しい音素標準パターンを作成する。この操作を繰り返
し、音素標準パターンを更新する。Although the method using the HMM has been described above, the same applies to the case of using the DP matching method as described in the cited document 1. In this case, as the phoneme standard pattern,
A single phoneme standard pattern and a compound phoneme standard pattern are prepared. As a standard pattern iterative learning method using DP matching, the following method can be used. Using the word standard pattern that is obtained by connecting the initial phoneme standard patterns obtained in advance according to the word dictionary, the learning speech is segmented into phoneme units by DP matching. The obtained phoneme unit segments are averaged among the same phonemes to create a new phoneme standard pattern. This operation is repeated to update the phoneme standard pattern.

このような反復学習法を用いることにより、DPマッチ
ングに用いる標準パターンの学習法もHMMのFBアルゴリ
ズムを用いる方法と同様に扱うことができるので、本発
明による学習法を適用することができる。By using such an iterative learning method, the learning method of the standard pattern used for DP matching can be treated in the same manner as the method using the FB algorithm of HMM, and thus the learning method according to the present invention can be applied.

（実施例）本発明による音素を単位とした音声認識装置の実施例
について図面を参照して説明する。第１図は本発明の一
実施例を示す構成図である。まず最初に、認識に用いる
複合音素HMMを求める学習方法について説明する。(Example) An example of a phoneme-based speech recognition apparatus according to the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention. First, a learning method for obtaining a compound phoneme HMM used for recognition will be described.

初期モデルメモリ１には単音素HMMの学習のための初
期値となる単音素HMMのパラメータ（初期モデル）が保
持されている。この初期モデルは予め代表的な話者が発
声した音声を単音素毎にセグメンテーションしたものか
ら求めておく。これは、単音素毎に文献３に述べられて
いるようなFBアルゴリズムを適用することにより実現で
きる。単音素HMMとしては、ここで第３図に示されてい
るような２状態からなるモデルを用いる。The initial model memory 1 holds parameters (initial model) of the monophone HMM that are initial values for learning the monophone HMM. This initial model is obtained in advance by segmenting the voice uttered by a typical speaker into individual phonemes. This can be realized by applying the FB algorithm as described in Reference 3 for each phoneme. As the monophone HMM, a model consisting of two states as shown in FIG. 3 is used here.

初期モデル学習用音声メモリ２の中には、ベクトル量
子化法により観測ラベル列に変換された初期モデル学習
用音声の音声データが保持されている。この音声データ
はセグメンテーション部３に入力され、単音素単位にセ
グメンテーションされる。ここでは単独発声した母音を
初期モデル学習用音声とし、この音声データの振幅があ
る定められた値より大きい部分を母音の単音素データと
する。パラメータ作成部４では、入力された単音素デー
タをもとに、FBアルゴリズムにより単音素HMMのパラメ
ータを求める。The initial model learning voice memory 2 holds the voice data of the initial model learning voice converted into the observation label sequence by the vector quantization method. This voice data is input to the segmentation unit 3 and segmented into single phonemes. Here, a vowel that is uttered alone is used as the initial model learning voice, and a portion of the voice data whose amplitude is larger than a predetermined value is used as vowel monophoneme data. The parameter creating unit 4 obtains the parameters of the monophone HMM by the FB algorithm based on the input monophone data.

初期モデル選択部５では、初期モデルメモリ１中の単
音素HMMと、パラメータ作成部４により求められた単音
素HMMから、予め定められたルールに従って、単音素HMM
が初期モデルとして選択され出力される。例えば、母音
の初期モデル（単音素HMM）はパラメータ作成部４から
のものを用い、それ以外は初期モデルメモリ１内のもの
を選択するというルールを適用する。初期モデル学習用
音声を用いない場合は、初期モデルメモリ１中の単音素
HMMを初期モデルとして出力する。The initial model selection unit 5 uses the single phoneme HMM in the initial model memory 1 and the single phoneme HMM obtained by the parameter creation unit 4 according to a predetermined rule.
Is selected and output as the initial model. For example, the rule is applied that the initial model of the vowel (single phoneme HMM) is used from the parameter creating unit 4, and the others are selected from the initial model memory 1. If the initial model learning speech is not used, the single phoneme in the initial model memory 1
Output the HMM as the initial model.

学習用音声メモリ６中には観測ラベル列で表現された
学習用音声データが保持されている。発声変形検出部７
では、学習用音声データに対応する単語の発声変形をす
べて網羅した辞書を単語辞書メモリ８から読みだす。こ
の単語辞書の表記に従って初期モデルを結合し、発声変
形毎の単語HMMを作成する。続いて、これらの単語HMMに
対して、学習用音声データの出現確率を前向き確率とし
て求め、出現確率が最大となる単語HMMの発声変形を、
その学習用音声データの発声変形情報として発声変形情
報メモリ９中に保持する。また、発声変形情報として
は、このようにして求められたもの以外にも、予め学習
用音声を調査することにより得られた情報を用いること
ができる。The learning voice memory 6 holds the learning voice data represented by the observation label string. Speech deformation detection unit 7
Then, a dictionary that covers all utterance variants of words corresponding to the learning voice data is read from the word dictionary memory 8. The initial model is combined according to the notation of this word dictionary, and the word HMM for each voicing transformation is created. Subsequently, with respect to these word HMMs, the appearance probability of the learning voice data is obtained as a forward probability, and the voicing transformation of the word HMM having the maximum appearance probability is calculated.
The utterance modification information memory 9 holds the utterance modification information of the learning voice data. Further, as the vocalization transformation information, in addition to the information obtained in this way, information obtained by investigating the learning voice in advance can be used.

単音素学習部10では、学習用音声メモリ６中の音声デ
ータを用いて単音素HMMの学習を行う。これは、まず発
声変形情報メモリ９中の発声変形情報により、単語辞書
メモリ８中の発声変形を含む単語辞書を選択し、初期モ
デルを結合し学習用音声に対応した単語HMMを作成す
る。引用文献２に述べられている方法と同様に、この単
語HMMのパラメータを更新し、その更新されたパラメー
タを単音素単位にまとめることにより単音素HMMの学習
処理を進める。パラメータが集束するまで、この学習処
理を繰り返して行う。The phoneme learning unit 10 learns a phoneme HMM using the voice data in the learning voice memory 6. This is to first select a word dictionary containing utterance transformations in the word dictionary memory 8 according to the utterance transformation information in the utterance transformation information memory 9 and combine the initial models to create a word HMM corresponding to the learning voice. Similar to the method described in the reference document 2, the parameters of this word HMM are updated, and the updated parameters are collected in units of single phonemes to advance the learning processing of the single phoneme HMM. This learning process is repeated until the parameters converge.

学習が終了した単音素HMMは、単音素結合部11におい
て予め定められたルールに従って結合され、複合音素HM
Mが作られる。ルールとしては例えば、複合音素［asa］
は単音素［ａ］，［ｓ］，［ａ］を第４図のように結合
して作成するというものなどがある。The learned single phoneme HMMs are combined according to a predetermined rule in the single phoneme combination unit 11 to form a compound phoneme HM.
M is made. As a rule, for example, compound phoneme [asa]
There is a method in which monophonemes [a], [s], and [a] are combined and created as shown in FIG.

複合音素学習部12では、単音素結合部11により作られ
た複合音素HMMを初期モデルとし、学習用音声メモリ６
中の音声データを用いて複合音素HMMの学習を行う。学
習法は単音素学習部10で用いられた方法と同じである。In the compound phoneme learning unit 12, the compound phoneme HMM created by the single phoneme combination unit 11 is used as an initial model, and the learning voice memory 6
The compound phoneme HMM is trained by using the voice data in it. The learning method is the same as the method used in the monophone learning unit 10.

求められた複合音素HMMは、複合音素HMMメモリ13中に
保持される。The obtained compound phoneme HMM is held in the compound phoneme HMM memory 13.

続いて、入力音声を認識する方法について説明する。
認識処理は認識部14中で行われる。この認識方法は文献
２で述べられている方法と同じである。すなわち、入力
音声はベクトル量子化方法により観測ラベル列に変換さ
れる。つぎに、単語辞書８中の単語辞書に従って、複合
音素HMMメモリ13中の複合音素HMMを結合し単語HMMが順
次作成される。この単語HMMに対する入力音声の出現確
率を、前向き確率として求め、この出現確率が最大とな
る単語HMMに対する単語が認識結果となる。Next, a method of recognizing the input voice will be described.
The recognition process is performed in the recognition unit 14. This recognition method is the same as the method described in Reference 2. That is, the input speech is converted into the observation label string by the vector quantization method. Next, according to the word dictionary in the word dictionary 8, the compound phoneme HMMs in the compound phoneme HMM memory 13 are combined to sequentially create the word HMMs. The appearance probability of the input speech with respect to this word HMM is obtained as a forward probability, and the word with respect to the word HMM having the maximum appearance probability becomes the recognition result.

（発明の効果）本発明によれば、少ない学習用音声で、多くの発声変
形や、多くの種類を持つ複合音素の標準パターンが学習
できるので、高性能な音声認識装置を実現することがで
きる。(Effect of the Invention) According to the present invention, since many voicing variations and standard patterns of compound phonemes having many types can be learned with a small amount of learning speech, a high-performance speech recognition device can be realized. .

[Brief description of drawings]

第１図は、本願発明による一実施例を示す構成図、第２
図は、音素HMMの例を示す図、第３図は、単音素HMMの例
を示す図、第４図は、複合音素HMMの例を示す図、第５
図は、単語HMMの例を示す図である。図において、１…初期モデルメモリ、２…初期モデル学習用音声メモ
リ、３…セグメンテーション部、４…パラメータ生成
部、５…初期モデル選択部、６…学習用音声メモリ、７
…発声変形検出器、８…単語辞書メモリ、９…発声変形
情報メモリ、10…単音素学習部、11…単音素結合部、12
…複合音素結合部、13…複合音素HMMメモリ、14…認識
部。FIG. 1 is a block diagram showing an embodiment of the present invention, and FIG.
FIG. 4 is a diagram showing an example of a phoneme HMM, FIG. 3 is a diagram showing an example of a single phoneme HMM, FIG. 4 is a diagram showing an example of a compound phoneme HMM, and FIG.
The figure is a diagram showing an example of the word HMM. In the figure, 1 ... Initial model memory, 2 ... Initial model learning voice memory, 3 ... Segmentation unit, 4 ... Parameter generating unit, 5 ... Initial model selecting unit, 6 ... Learning voice memory, 7
... voicing transformation detector, 8 ... word dictionary memory, 9 ... voicing transformation information memory, 10 ... single phoneme learning unit, 11 ... single phoneme combining unit, 12
… Composite phoneme combination unit, 13… Composite phoneme HMM memory, 14… Recognition unit.

Claims

[Claims]

1. A monophoneme learning unit that obtains a monophoneme standard pattern from a learning voice, and a phoneme combination unit that creates a compound phoneme standard pattern by combining one or more of the obtained monophoneme standard patterns, Phoneme-based speech recognition including a compound phoneme learning unit that learns using a learning voice based on the created compound phoneme standard pattern, and a recognition unit that recognizes an input voice using the compound phoneme standard pattern. apparatus.

2. A voicing deformation detection unit for obtaining voicing deformation information of learning voice, a single phoneme learning unit for learning based on the voicing deformation information, and a compound phoneme for learning based on the voicing deformation information. The phoneme-based speech recognition device according to claim 1, further comprising a learning unit.

3. An initial standard pattern obtained in advance, a phoneme selection unit for selecting an initial standard pattern obtained from a learning voice for each standard pattern, and a learning voice with the selected initial standard pattern as an initial value. 3. A phoneme-based speech recognition device according to claim 1, further comprising a phoneme learning unit that obtains a phoneme standard pattern.