JPH04284498A - Method of voice recognition - Google Patents
Method of voice recognitionInfo
- Publication number
- JPH04284498A JPH04284498A JP3049687A JP4968791A JPH04284498A JP H04284498 A JPH04284498 A JP H04284498A JP 3049687 A JP3049687 A JP 3049687A JP 4968791 A JP4968791 A JP 4968791A JP H04284498 A JPH04284498 A JP H04284498A
- Authority
- JP
- Japan
- Prior art keywords
- speaker
- hmm
- state
- word
- phoneme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims description 10
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 abstract description 12
- 239000013598 vector Substances 0.000 abstract description 8
- 238000001228 spectrum Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- OOYGSFOGFJDDHP-KMCOLRRFSA-N kanamycin A sulfate Chemical group OS(O)(=O)=O.O[C@@H]1[C@@H](O)[C@H](O)[C@@H](CN)O[C@@H]1O[C@H]1[C@H](O)[C@@H](O[C@@H]2[C@@H]([C@@H](N)[C@H](O)[C@@H](CO)O2)O)[C@H](N)C[C@@H]1N OOYGSFOGFJDDHP-KMCOLRRFSA-N 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Abstract
Description
【0001】0001
【産業上の利用分野】この発明は、隠れマルコフモデル
を用い、不特定話者大語彙連続音声認識に適用して認識
性能を向上させるようにした音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method using a hidden Markov model and applying it to speaker-independent large-vocabulary continuous speech recognition to improve recognition performance.
【0002】0002
【従来の技術】隠れマルコフモデル(例えば中川聖一「
確率モデルによる音声認識」電子情報通信学会編(19
88))による不特定話者音声認識では、多くの発声者
からの音声スペクトルをもとに作成れれた符号帳(コー
ドブック)を用いることが多い。このコードブックは、
ユニバーサルコードブックとよばれている。しかしなが
ら、図4Bに示すようにある特定の話者のコードブック
の空間11は、図4Bに示すようにユニバーサルコード
ブック12の部分空間となっている。また、コードブッ
ク12内でのコードワードの動きも話者ごとに特有であ
る。[Prior art] Hidden Markov model (for example, Seiichi Nakagawa's
“Speech Recognition Using Probabilistic Models” edited by the Institute of Electronics, Information and Communication Engineers (19
In speaker-independent speech recognition according to 88)), a codebook created based on speech spectra from many speakers is often used. This codebook is
It is called the universal codebook. However, as shown in FIG. 4B, the codebook space 11 of a particular speaker is a subspace of the universal codebook 12, as shown in FIG. 4B. Furthermore, the movement of codewords within the codebook 12 is also unique for each speaker.
【0003】このような事実にもかかわらず、隠れマル
コフモデル(HMM)による不特定話者音声認識では、
ユニバーサルコードブック12を用い、多数の話者から
の多量の音声データを用いて単語あるいは音韻単位の隠
れマルコフモデルを作成していた。よって、話者固有の
コードブックの空間11の制約は一切考えられてなく、
様々な副作用を起こしており、不特定話者大語彙連続音
声認識での認識性能の劣化となっていた。Despite this fact, in speaker-independent speech recognition using Hidden Markov Models (HMM),
Hidden Markov models of words or phoneme units were created using the Universal Codebook 12 and a large amount of speech data from many speakers. Therefore, the constraints of the speaker-specific codebook space 11 are not considered at all,
It caused various side effects, including deterioration of recognition performance in speaker-independent, large-vocabulary continuous speech recognition.
【0004】0004
【課題を解決するための手段】この発明によれば、不特
定話者用の音韻/単語を表す隠れマルコフモデルと、話
者の特徴を表す隠れマルコフモデルとを合成し、その合
成モデルを用いてその話者の音声認識を行う。つまり、
この発明による発声者を考慮した統計的な連続音声認識
の基本の式は、以下のように書かれる。こゝでSに関す
る項がこの発明で導入されたものである。[Means for Solving the Problems] According to the present invention, a hidden Markov model representing phonemes/words for an unspecified speaker and a hidden Markov model representing characteristics of the speaker are synthesized, and the synthesized model is used. to perform speech recognition of the speaker. In other words,
The basic formula for statistical continuous speech recognition considering the speaker according to the present invention is written as follows. Here, the term regarding S is introduced in this invention.
【0005】
P(W,S|Y)=P(W,S)P(Y|W,
S)/P(Y)
=P(S)P(W|S)P(Y|W,S)/P(Y
)こゝで、
W:単語列
S:発声者
Y:入力音声のベクトル系列
P(S):発声者Sがこの音声認識装置を用いている確
率
P(W|S):発声者Sがある単語列Wを発生する確率
とみなされ、発声者Sによる統計的言語モデル(例えば
、鹿野「統計的手法による音声認識」電子情報通信学会
誌、Vo.73,No.12,pp1276−1285
,(1990.12))である。P(W,S|Y)=P(W,S)P(Y|W,
S)/P(Y)
=P(S)P(W|S)P(Y|W,S)/P(Y
) Here, W: Word string S: Speaker Y: Vector sequence of input speech P(S): Probability that speaker S uses this speech recognition device P(W|S): Speaker S exists. It is considered as the probability of generating the word string W, and is calculated by a statistical language model by the speaker S (for example, Kano, "Speech Recognition Using Statistical Methods", Journal of the Institute of Electronics, Information and Communication Engineers, Vol. 73, No. 12, pp 1276-1285).
, (1990.12)).
【0006】P(Y|W,S):発生内容Wで発声者S
での入力音声のベクトル系列Yの確率(音響モデル)よ
って、統計的な連続音声認識の問題は、max {P
(S)P(W|S)P(Y|W,S)}W,S
となる単語列Wを発声者の情報Sを利用して推定するこ
ととなる。P(Y|W,S): Speaker S with occurrence content W
According to the probability (acoustic model) of the vector sequence Y of input speech at
(S)P(W|S)P(Y|W,S)}W,S The word string W is estimated using the information S of the speaker.
【0007】ここで、P(S)は、発声者Sがこの音声
認識装置を用いている確率を表す。さらに、発声者Sに
よる音響モデル(単語/音韻モデル)P(Y|W,S)
を隠れマルコフモデル(HMM)でモデル化することを
考える。発声者ごとに単語/音韻の音声データを大量に
発生すれば、発声者ごとの単語/音韻のHMMを作成す
ることができるが、通常は、全ての発声者に音声データ
を大量に発生させることは、現実的ではない。よって、
通常行われているように、多数の発声者からの音声デー
タを用いて作成された単語/音韻のHMMP(Y|W)
を用い、このP(Y|W)について話者固有のコードブ
ックの空間を制限することを考える。以下、話者固有の
コードブックの空間やコードワードの動きを表すのにも
HMMを用いることを考える。[0007] Here, P(S) represents the probability that the speaker S is using this speech recognition device. Furthermore, the acoustic model (word/phonological model) P(Y|W,S) by speaker S
Consider modeling using a hidden Markov model (HMM). If a large amount of word/phoneme speech data is generated for each speaker, it is possible to create a word/phoneme HMM for each speaker, but normally it is not possible to generate a large amount of speech data for all speakers. is not realistic. Therefore,
As is commonly done, we use a word/phoneme HMMP (Y|W) created using speech data from a large number of speakers, and use a speaker-specific codebook for this P(Y|W). Think about limiting your space. Hereinafter, we will consider using the HMM to represent the space of a speaker-specific codebook and the movement of codewords.
【0008】HMMを次の6組で表す。
HMM:M=(U,V,T,P,I,F)こゝで、U:
状態の集合
V:入力ベクトルの集合
T:遷移確率の集合
P:出力確率の集合
I:初期状態
F:最終状態
また、入力系列を
Y=y1 y2 …yt …yN
で表す。[0008] The HMM is represented by the following six sets. HMM: M=(U, V, T, P, I, F) Here, U:
Set of states V: Set of input vectors T: Set of transition probabilities P: Set of output probabilities I: Initial state F: Final state Further, the input sequence is expressed as Y=y1 y2 ...yt ...yN.
【0009】こゝで、話者性を表すHMMとして、話者
ごとに任意の発声音声からそのパラメータが推定される
エルゴードHMMを考える。このエルゴードHMMの簡
単な例を図3Aに示す。つまり、発声者ごとに比較的短
い学習音声を入力し、状態1から状態2への遷移確率を
t12に、時点t1 の入力ベクトルyt が状態1か
ら状態2へ遷移する確率をP12(yt )とし、状態
1,2,3間を遷移するモデル(エルゴードHMM)を
各話者ごとに作る。この話者HMMを
Msi =(Usi,V,Tsi,Psi,I
si,Fsi ):発声者i(i=1,…, L)で表
す。多数の話者の音声データによってそのパラメータか
ら推定される単語/音韻のHMMとして、左から右への
遷移をもつ図3Bに示すようなHMMを考える。これは
多数の話者の音声データから作る。この単語/音韻のH
MMを
Mpj =(Upj , V,Tpj , Ppj ,
Ipj ,Fpj ) :単語/音韻j(j=1,…,
M)
と表す。[0009] Here, as an HMM representing speaker characteristics, we will consider an ergodic HMM whose parameters are estimated from arbitrary utterances for each speaker. A simple example of this ergodic HMM is shown in FIG. 3A. In other words, input a relatively short training speech for each speaker, and let t12 be the transition probability from state 1 to state 2, and let P12(yt ) be the probability that the input vector yt at time t1 will transition from state 1 to state 2. , a model (ergodic HMM) that transitions between states 1, 2, and 3 is created for each speaker. This speaker HMM is defined as Msi = (Usi, V, Tsi, Psi, I
si, Fsi): Represented by speaker i (i=1,..., L). As a word/phoneme HMM estimated from the parameters of speech data of a large number of speakers, consider an HMM as shown in FIG. 3B with a left-to-right transition. This is created from voice data from multiple speakers. This word/phonetic H
Let MM be Mpj = (Upj, V, Tpj, Ppj,
Ipj, Fpj): word/phoneme j (j=1,...,
M).
【0010】この発明ではこれら2つのHMM,Msi
とMpj との積空間でのHMMを作る。この合成H
MMを話者制約単語/音韻HMMと呼ぶことにし、次の
ように定義する。
Mpji=(Upji,V,Tpji,Ppji,Ip
ji,Fpji) :単語/音韻(j),発声者(i)
=(Upj×Usi,V,Tpj×Tsi,Λ(Ppj
×Psi),Ipj×Isi,Fpj×Fsi)
そして発声者iの発声音声の認識を、この話者制約単語
/音韻HMMで、確率を最大にする発声者iと単語/音
韻jを求めることにより行う。上の式でΛ()は、出力
確率の和が1になるようにするスケールファクターであ
る。[0010] In this invention, these two HMMs, Msi
Create an HMM in the product space of and Mpj. This synthesis H
The MM will be referred to as the speaker-constrained word/phoneme HMM, and is defined as follows. Mpji=(Upji, V, Tpji, Ppji, Ip
ji, Fpji): word/phoneme (j), speaker (i)
=(Upj×Usi,V,Tpj×Tsi,Λ(Ppj
× Psi), Ipj × Isi, Fpj × Fsi) Then, the speech uttered by speaker i can be recognized by finding the speaker i and word/phoneme j that maximize the probability using this speaker constraint word/phoneme HMM. conduct. In the above equation, Λ() is a scale factor that makes the sum of output probabilities equal to 1.
【0011】[0011]
【実施例】この発明では、上述したように不特定話者用
の音韻/単語を表すHMMと、話者の特徴を表すエルゴ
ード話者性HMMとを合成したHMMを用いるが、図2
Aに示すように2状態1,2からなるエルゴード話者性
HMMと、図2Bに示す3状態A,B,Cからなる不特
定話者用単語/音韻HMMとを用いて、これらを合成し
た状態数6の話者制約単語/音韻HMMの構成例を図1
に示す。遷移確率と出力確率の値の計算式を図中に示し
ておく。ただし、一部の遷移については遷移確率も出力
確率も省いてあるが、同様に計算できる。[Embodiment] In this invention, as described above, an HMM is used which is a combination of an HMM representing phonemes/words for an unspecified speaker and an ergodic speaker characteristic HMM representing characteristics of the speaker.
These were synthesized using an ergodic speaker-specific HMM consisting of two states 1 and 2 as shown in A, and a word/phoneme HMM for non-specific speakers consisting of three states A, B, and C shown in Fig. 2B. Figure 1 shows an example of the configuration of a speaker-constrained word/phoneme HMM with six states.
Shown below. The calculation formulas for the values of transition probability and output probability are shown in the figure. However, although transition probabilities and output probabilities are omitted for some transitions, they can be calculated in the same way.
【0012】このような話者制約単語/音韻HMMを用
いた音声認識の装置の例を図4Aに示す。入力端子1か
ら入力された音声は、特徴抽出部2においてディジタル
信号に変換され、かつLPCケプストラム分析されたの
ち、フレーム(10ミリ秒)ごとにユニバーサルコード
ブックによってベクトル量子化される。話者モデルHM
Mの学習部3では、あらかじめ蓄えた複数のエルゴード
話者HMMから、もっとも尤度の高い話者HMMを選び
、かつ、入力音声によってもそのHMMへの追加学習を
行う。次に、話者制約音韻HMMの合成部4で、この発
明により話者モデルHMMと不特定話者音韻モデルHM
M5から話者制約音韻HMMを合成する。連続音声認識
部6は、この話者制約音韻HMMを用いて、入力音声の
発生内容を認識し、認識結果7を出力する。An example of a speech recognition device using such a speaker-constrained word/phoneme HMM is shown in FIG. 4A. The audio input from the input terminal 1 is converted into a digital signal in the feature extraction unit 2, subjected to LPC cepstrum analysis, and then vector quantized using a universal codebook for each frame (10 milliseconds). speaker model HM
The learning unit 3 of M selects the speaker HMM with the highest likelihood from a plurality of ergodic speaker HMMs stored in advance, and also performs additional learning on that HMM based on input speech. Next, in the speaker-constrained phoneme HMM synthesis unit 4, the speaker model HMM and the speaker-independent phoneme model HM are combined according to the present invention.
A speaker-constrained phoneme HMM is synthesized from M5. The continuous speech recognition unit 6 uses this speaker-constrained phoneme HMM to recognize the content of the input speech, and outputs a recognition result 7.
【0013】この実施例では、合成部4で話者制約音韻
HMMを合成してから連続音声の認識を行う手順を示し
たが、連続音声認識を行う中に、逐次話者制約音韻HM
Mを合成作成する手順を入れることも可能である。[0013] In this embodiment, a procedure was shown in which continuous speech recognition is performed after synthesizing speaker-constrained phonemes HMM in the synthesizing section 4. However, during continuous speech recognition, speaker-constrained phonemes HM
It is also possible to include a procedure for synthetically creating M.
【0014】[0014]
【発明の効果】以上述べたように、この発明によれば、
話者HMMを用いることにより、不特定話者用の音韻/
単語HMMを、発生者特有の空間やスペクトルの動きに
制約することができ、高い認識率を達成することが可能
となる。この手法により、発声者が、特定話者用の音韻
/単語HMMを作成するために、多量の音声データを発
生する必要がなくなる。この方法によれば、少量の任意
の音声データによって、話者モデルを選択し、かつ、追
加学習で話者モデルを適応化し,この話者HMMモデル
と音韻/単語HMMとを合成することによって、高精度
な特定話者用の音韻/単語HMMの作成が可能となる。[Effects of the Invention] As described above, according to the present invention,
By using speaker HMM, phonology/
It is possible to constrain the word HMM to the spatial and spectral movements specific to the generator, and it is possible to achieve a high recognition rate. This technique eliminates the need for a speaker to generate large amounts of speech data in order to create a phoneme/word HMM for a particular speaker. According to this method, a speaker model is selected using a small amount of arbitrary speech data, the speaker model is adapted through additional learning, and this speaker HMM model is synthesized with a phoneme/word HMM. It becomes possible to create a highly accurate phoneme/word HMM for a specific speaker.
【0015】上述では、離散HMMを主体として説明し
たが、この方法は、ファジィベクトル量子化ベースのH
MM,連続分布のHMMに対しても同様に適用すること
ができる。同様にして、この発明は、2つのHMMで制
約された条件を満たす音声認識方式一般に適用でき、例
えば、雑音環境、マイクロフォンの種類等への適応等に
用い、認識性能を向上させることができる。つまり、例
えば話者の特徴エルゴードHMMに替えて、マイクロフ
ォンの特性を表すエルゴードHMMを用いることにより
、認識性能を向上させることができる。その他、音声認
識以外でも、制約条件が2つのHMMであたえられる対
象に適用することができる。[0015] Although the above explanation mainly focused on discrete HMM, this method is based on fuzzy vector quantization based HMM.
It can be similarly applied to MM and continuous distribution HMM. Similarly, the present invention can be applied to general speech recognition methods that satisfy the conditions constrained by two HMMs, and can be used, for example, to adapt to noisy environments, types of microphones, etc., and improve recognition performance. In other words, recognition performance can be improved by using, for example, an ergodic HMM representing characteristics of a microphone instead of an ergodic HMM representing characteristics of a speaker. In addition to speech recognition, the present invention can also be applied to objects for which constraint conditions are given by two HMMs.
【図1】不特性話者の単語/音韻HMMとエルゴード話
者性HMMとを合成した話者制約単語/音韻HMMの構
成例を示す図。FIG. 1 is a diagram showing a configuration example of a speaker-constrained word/phoneme HMM that is a combination of a word/phoneme HMM of an uncharacteristic speaker and an ergodic speaker-specific HMM.
【図2】Aはエルゴード話者性HMMの構成例を示す図
、Bは不特定話者の単語/音韻HMMの構成例を示す図
である。FIG. 2A is a diagram showing an example of the configuration of an ergodic speaker-specific HMM, and FIG. 2B is a diagram showing an example of the configuration of a word/phoneme HMM for an unspecified speaker.
【図3】Aは話者性を表すエルゴードHMMの簡単な例
を示す図、Bは単語/音韻のHMMの簡単な例を示す図
である。FIG. 3A is a diagram showing a simple example of an ergodic HMM representing speaker characteristics, and FIG. 3B is a diagram showing a simple example of a word/phoneme HMM.
【図4】Aはこの発明を適用した連続音声認識システム
の例を示すブロック図、Bはユニバーサルコードブック
と話者固有のコードブックの空間との関係例を示す図で
ある。FIG. 4A is a block diagram showing an example of a continuous speech recognition system to which the present invention is applied, and FIG. 4B is a diagram showing an example of the relationship between a universal codebook and a speaker-specific codebook space.
Claims (1)
特定話者用の音韻/単語を表す隠れマルコフモデルと、
話者の特徴を表す隠れマルコフモデルとを合成し、その
合成モデルを用いて上記話者の音声認識を行うことを特
徴とする音声認識方法。Claim 1: A speaker-independent speech recognition method comprising: a hidden Markov model representing phonemes/words for speaker-independent;
A speech recognition method characterized by synthesizing a hidden Markov model representing characteristics of a speaker and performing speech recognition of the speaker using the synthesized model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP3049687A JP3036706B2 (en) | 1991-03-14 | 1991-03-14 | Voice recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP3049687A JP3036706B2 (en) | 1991-03-14 | 1991-03-14 | Voice recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
JPH04284498A true JPH04284498A (en) | 1992-10-09 |
JP3036706B2 JP3036706B2 (en) | 2000-04-24 |
Family
ID=12838098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP3049687A Expired - Fee Related JP3036706B2 (en) | 1991-03-14 | 1991-03-14 | Voice recognition method |
Country Status (1)
Country | Link |
---|---|
JP (1) | JP3036706B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0720891A (en) * | 1993-06-25 | 1995-01-24 | Koninkl Ptt Nederland Nv | Method for detection of best path through probable network in order to recognize, especially, voice and image |
-
1991
- 1991-03-14 JP JP3049687A patent/JP3036706B2/en not_active Expired - Fee Related
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0720891A (en) * | 1993-06-25 | 1995-01-24 | Koninkl Ptt Nederland Nv | Method for detection of best path through probable network in order to recognize, especially, voice and image |
Also Published As
Publication number | Publication date |
---|---|
JP3036706B2 (en) | 2000-04-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP2733955B2 (en) | Adaptive speech recognition device | |
JP2826215B2 (en) | Synthetic speech generation method and text speech synthesizer | |
Sisman et al. | A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder. | |
JP4109063B2 (en) | Speech recognition apparatus and speech recognition method | |
JP3050934B2 (en) | Voice recognition method | |
JP6437581B2 (en) | Speaker-adaptive speech recognition | |
JPH0772840B2 (en) | Speech model configuration method, speech recognition method, speech recognition device, and speech model training method | |
JPH05241589A (en) | Voice coder having speaker-dependent prototype generated from non-user reference data | |
JPH0636156B2 (en) | Voice recognizer | |
JP2002304190A (en) | Method for generating pronunciation change form and method for speech recognition | |
US5943647A (en) | Speech recognition based on HMMs | |
JPH08123484A (en) | Method and device for signal synthesis | |
JPH1069290A (en) | Speech processor | |
JP2898568B2 (en) | Voice conversion speech synthesizer | |
Zgank et al. | Predicting the acoustic confusability between words for a speech recognition system using Levenshtein distance | |
KR20220134347A (en) | Speech synthesis method and apparatus based on multiple speaker training dataset | |
JP4461557B2 (en) | Speech recognition method and speech recognition apparatus | |
JPH08211897A (en) | Speech recognition device | |
JP2003330484A (en) | Method and device for voice recognition | |
JPH04284498A (en) | Method of voice recognition | |
JPH10254473A (en) | Method and device for voice conversion | |
JPH1195786A (en) | Method and device for pattern recognition, and recording medium which stores pattern recognition program | |
JP3532248B2 (en) | Speech recognition device using learning speech pattern model | |
JPH04318600A (en) | Voice recognizing method | |
JP2976795B2 (en) | Speaker adaptation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
LAPS | Cancellation because of no payment of annual fees |