JPH10319986A

JPH10319986A - Acoustic model creating method

Info

Publication number: JPH10319986A
Application number: JP9127062A
Authority: JP
Inventors: Takatoshi Sanehiro; 貴敏實廣; Shoichi Matsunaga; 昭一松永; Shigeki Sagayama; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-05-16
Filing date: 1997-05-16
Publication date: 1998-12-04

Abstract

PROBLEM TO BE SOLVED: To make learning voice data a model capable of a recognition having flexibility without being influenced by the distinctiveness of the learning voice data being origins of an acoustic model (HMM). SOLUTION: Voice data for learning 17 are converted into a featured parameter series (18) and this series is subjected to a voice recognition in which free connections are allowed in a phoneme unit and a restriction of vocabulary is not present (32) and not only a correct interpretation but also plural contrast candidates near this are calculated from this recognition. Then, parameters of an initial acoustic model 31 are corrected so that errors are decreased by using probabilities of them (33) and a voice recognition is performed again by using these corrected acoustic model 34 and the same things are repeated, in short, a recognition learning is performed by an error minimizing method (MCE).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、各カテゴリの特
徴量をモデル化しておき、入力特徴量系列に対する各モ
デルの確率を求めて入力データの認識を行うパターン認
識方法に用いるモデルの作成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of creating a model used in a pattern recognition method for recognizing input data by modeling a feature amount of each category and obtaining a probability of each model with respect to an input feature amount sequence. .

【０００２】[0002]

【従来の技術】確率、統計論に基づいた確率モデルによ
る方法は、音声、文字、図形等のパターン認識において
有用な技術である。以下では、特に、音声認識を例に隠
れマルコフモデル（Hidden Markov Model 、以下ＨＭＭ
と記す）を用いた従来技術について説明する。隠れマル
コフモデルについては、例えば、中川聖一「確率モデル
による音声認識」電子情報通信学会編（１９８８）に説
明がある。2. Description of the Related Art A method using a probability model based on probability and statistics is a useful technique in pattern recognition of voice, characters, figures, and the like. In the following, the Hidden Markov Model (HMM)
) Will be described. The hidden Markov model is described in, for example, Seiichi Nakagawa, “Speech Recognition by Stochastic Model,” edited by the Institute of Electronics, Information and Communication Engineers (1988).

【０００３】従来の音声認識装置において、ある音声単
位（音素、音節、単語など）を、ＨＭＭを用いて各カテ
ゴリごとにモデル化しておく方法は、性能が高く、現在
の主流になっている。図４に従来のＨＭＭを用いた音声
認識装置の機能構成例を示す。入力端子１１から入力さ
れた音声は、Ａ／Ｄ変換部１２においてディジタル信号
に変換される。そのディジタル信号から音声特徴パラメ
ータ抽出部１３において音声特徴パラメータを抽出す
る。あらかじめ、ある音声単位ごとに作成したＨＭＭを
モデルパラメータメモリ１４から読み出し、モデル確率
計算部１５において、入力音声に対する各モデルの確率
を計算する。最も大きな確率を示すモデルが表現する音
声単位を認識結果として認識結果出力部１６より出力す
る。In a conventional speech recognition apparatus, a method of modeling a certain speech unit (phonemes, syllables, words, etc.) for each category by using an HMM has a high performance and is currently the mainstream. FIG. 4 shows a functional configuration example of a conventional speech recognition apparatus using an HMM. The audio input from the input terminal 11 is converted into a digital signal by the A / D converter 12. A voice feature parameter extraction unit 13 extracts voice feature parameters from the digital signal. The HMM created in advance for each voice unit is read from the model parameter memory 14, and the model probability calculation unit 15 calculates the probability of each model with respect to the input voice. The speech unit represented by the model having the highest probability is output from the recognition result output unit 16 as a recognition result.

【０００４】現在よく用いられる音響モデルとしてのＨ
ＭＭは３状態３ループのものである。ＨＭＭをある音声
単位ごと（一般には、単語、音素や音節など）に作成す
る。各状態には、音声特徴パラメータの統計的な確率分
布がそれぞれ付与される。現在の主流では、音声単位と
して単語ではなく、音素や音節を用い、認識させたい語
彙に応じてそれらのＨＭＭを連結して用いる。認識装置
で用いる前に、音響モデル学習用音声データ１７を用い
て、音響モデルを生成する。データベース１７からの学
習用データを音声特徴パラメータ抽出部１８で特徴パラ
メータへ変換し、これを用いて、音響モデルパラメータ
学習部１９において、初期音響モデル生成部２１で得ら
れた初期モデルを元にモデルを学習する。ここで得られ
たモデルパラメータを認識装置で用いる。[0004] H as an acoustic model often used at present
MM is a three-state three-loop. An HMM is created for each voice unit (generally, a word, a phoneme, a syllable, and the like). Each state is provided with a statistical probability distribution of a voice feature parameter. In the current mainstream, phonemes or syllables are used as speech units instead of words, and these HMMs are connected and used according to the vocabulary to be recognized. Before using in the recognition device, an acoustic model is generated using the acoustic model learning speech data 17. The learning data from the database 17 is converted into a feature parameter by a speech feature parameter extraction unit 18, and the obtained model is used in an acoustic model parameter learning unit 19 based on the initial model obtained by the initial acoustic model generation unit 21. To learn. The model parameters obtained here are used by the recognition device.

【０００５】パターン認識の分野において、各カテゴリ
を確率分布で表し、識別率を向上させるために識別誤り
を減らすように確率分布を推定する方法がある。これを
音声認識の分野でＨＭＭにおいて、誤りやすいモデル間
を識別誤りを最小にするように学習し、認識精度を向上
させる一つの方法としてＭＣＥ法が提案されている。こ
れは文献Chin-Hui Lee,Frank K.Soong and Kuldip K.Pa
liwal,“Automatic Speech and Speaker Recognition─
─Advanced Topics ”（Kluwer Academic Publishers,1
996 年）の第５章B.-H.Juang,W.Chou and C.-H.Lee,
“Statistical and Discriminative Methods for ASR”
に概説されている。In the field of pattern recognition, there is a method of representing each category by a probability distribution and estimating the probability distribution so as to reduce an identification error in order to improve an identification rate. In the field of speech recognition, in the HMM, an MCE method has been proposed as one method for improving recognition accuracy by learning between error-prone models so as to minimize identification errors. This is described in Chin-Hui Lee, Frank K. Soong and Kuldip K. Pa
liwal, “Automatic Speech and Speaker Recognition─
"Advanced Topics" (Kluwer Academic Publishers, 1
996), Chapter 5, B.-H. Juang, W. Chou and C.-H. Lee,
“Statistical and Discriminative Methods for ASR”
Are outlined in

【０００６】こういった識別学習では、認識対象内にモ
デル化したものが一つの場合、単に正解カテゴリと対立
カテゴリを誤りが小さくなるように学習すればよい。認
識対象内にモデル化したものが複数ある場合には、モデ
ルでセグメンテーションし、その区間を正解カテゴリと
対立カテゴリの識別誤りを最小にするよう、学習する。In such discrimination learning, when there is only one model in the recognition target, the correct category and the opposing category may be simply learned so as to reduce errors. When there are a plurality of modeled models in the recognition target, segmentation is performed using the model, and learning is performed to minimize the discrimination error between the correct category and the opposing category in the section.

【０００７】[0007]

【発明が解決しようとする課題】この識別学習を用い
て、単語ＨＭＭや、単語より小さい単位（日本語では音
素や音節など）のサブワードＨＭＭを作り、認識精度を
向上させることができる。しかし、単語ＨＭＭでは、認
識できる語彙は固定される。また、サブワード単位であ
っても学習音声を認識し、その認識誤りを対立候補とし
て用いるだけでは、単語が固定され、タスク依存のモデ
ル、つまりモデル作成時の学習用音声データが都市名の
みからなる場合は、その学習データに、また数字音声デ
ータのみからなる場合はその学習データに依存したモデ
ルになる。この場合、認識タスクが学習データのタスク
と異なる場合に精度向上は保証できない。特に、一般的
な語彙に基づく音声認識方法では、カテゴリ間のアライ
メントが正確にマッチングしていなくとも、部分的に確
率が高いところがあれば、その部分的に高いスコアが支
配的になってその候補のほうが優位になって、認識誤り
を生じる場合が多い。このようなものを対立候補として
識別学習すると、部分的にいびつなセグメントに対して
学習してしまう可能性がある。学習タスク以外の認識タ
スクでは、むしろ識別学習する前より精度を落としてし
まう。By using this discriminative learning, word HMMs and subword HMMs in units smaller than words (such as phonemes and syllables in Japanese) can be created to improve recognition accuracy. However, in the word HMM, the vocabulary that can be recognized is fixed. In addition, by simply recognizing the learning speech even in subword units and using the recognition error as an alternative candidate, the word is fixed, and the task-dependent model, that is, the training speech data at the time of model creation consists only of the city name In this case, the model depends on the learning data, and when only the numeric voice data is used, the model depends on the learning data. In this case, if the recognition task is different from the task of the learning data, improvement in accuracy cannot be guaranteed. In particular, in a general vocabulary-based speech recognition method, even if the alignment between categories is not accurately matched, if there is a part with a high probability, a partly high score becomes dominant and the candidate becomes a candidate. Becomes dominant and often causes recognition errors. If such a thing is discriminated and learned as an opposition candidate, there is a possibility that a partially distorted segment may be learned. In a recognition task other than the learning task, the accuracy is rather lower than before the recognition learning.

【０００８】この発明は、上述したような従来方法の欠
点に鑑みてなされたもので、識別学習によって学習タス
クに依存しないで、汎用的な音響モデルを作成する方法
を提供することを目的とする。The present invention has been made in view of the above-described drawbacks of the conventional method, and has as its object to provide a method of creating a general-purpose acoustic model by discriminating learning without depending on a learning task. .

【０００９】[0009]

【課題を解決するための手段】この発明では、識別学習
を用いて、汎用に用いることのできる音声モデルを作成
するために、識別対象となる認識結果として、語彙制約
なし、つまり音素あるいは音響単位で自由な連鎖を許容
できる音声認識系で生成する認識結果を用いる。ここで
いう語彙制約なし音声認識系とは、語彙を持たないで、
任意の単語や文を認識できるように、音素あるいは音節
などの任意の連鎖を許して認識する方法である。According to the present invention, in order to create a speech model that can be used for general purposes using discrimination learning, the recognition result to be discriminated has no vocabulary restriction, that is, no phoneme or acoustic unit. And a recognition result generated by a speech recognition system that allows a free chain. The speech recognition system without vocabulary here means that it has no vocabulary,
In this method, arbitrary chains such as phonemes or syllables are allowed so that any word or sentence can be recognized.

【００１０】こういった音声認識方法による認識精度
は、現在、一般的に低い。しかし、部分的な誤りには、
偏りがある場合も多いと考えられる。この一般的な誤り
傾向を識別学習できれば、得られる音声モデルは汎用的
に使える音声モデルになる。また、学習方法として学習
区間を誤り部分のみにした場合も考えられる。At present, the recognition accuracy of such a voice recognition method is generally low. However, partial errors include:
It is considered that there is often a bias. If this general error tendency can be identified and learned, the obtained speech model will be a speech model that can be used for general purposes. It is also conceivable that the learning section includes only the error section as a learning method.

【００１１】[0011]

【発明の実施の形態】この発明の実施例として、識別学
習がＭＣＥである場合の例を以下で述べる。まず、ＭＣ
Ｅの一般的定式化について述べる。ここでは、文献C.-
S.Liu,C.-H.Lee,W.Chou,B.-H.Juang and A.E.Rosenber
g,“A study on minimum erroe discriminative traini
ng for speaker recognition, ”J.Acoust.Soc.Am.97
(1),pp.637-648,January 1995 のＡＰＰＥＮＤＩＸで定
式化されているものに沿って説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS As an embodiment of the present invention, an example in which discrimination learning is MCE will be described below. First, MC
The general formulation of E will be described. Here, reference C.-
S.Liu, C.-H.Lee, W.Chou, B.-H.Juang and AERosenber
g, “A study on minimum erroe discriminative traini
ng for speaker recognition, "J.Acoust.Soc.Am.97
(1), pp. 637-648, January 1995, will be described along with those formulated in APPENDIX.

【００１２】音響モデルのパラメータセットＰにおい
て、観測ベクトル系列Ｘのクラス（モデル）ｋに対する
対数尤度を識別関数ｇ_k（Ｘ，Λ）として、識別誤り関
数をｄ_k（Ｘ，Ｐ）＝−ｇ_k（Ｘ，Ｐ）＋Ｇ_k（Ｘ，Ｐ）（１）とする。ここで、Ｇ_k（Ｘ，Ｐ）＝log ［1/（Ｋ−１）Σexp ｛eg_j（Ｘ，Λ）｝］^1/e （２） Σはｊ＝k 以外のｊの各値についての加算であり、正し
いクラス（モデル）ｋに対する対立候補（誤り易いモデ
ル）の尤度に相当する。ここで、Ｋは対立候補の数、ｅ
は定数である。損失関数Ｌをシグモイド関数を用いて以
下のように定義する。In the parameter set P of the acoustic model, the log likelihood for the class (model) k of the observed vector sequence X is defined as an identification function g _k (X, （), and the identification error function is defined as d _k (X, P) = −. g _k (X, P) + G _k (X, P) (1) Here, G _k (X, P) = log [1 / (K−1) {exp {eg _j (X, Λ)}] ^{1 / e} (2)} is the value of each value of j other than j = k. This is an addition, and corresponds to the likelihood of an alternative candidate (error-prone model) for the correct class (model) k. Where K is the number of opposing candidates, e
Is a constant. The loss function L is defined as follows using a sigmoid function.

【００１３】Ｌ_k（Ｘ；Ｐ）＝Ｌ(d_k）＝１／｛１＋exp （−ａ（ｄ_k＋ｂ））｝（３）ここで、ａ，ｂは定数である。パラメータセットＰは遷
移確率Ａ、混合分布の重みＣ、平均値μ、分散Ｕから成
る。負になることを避けるため、遷移確率、重み、分散
は対数をとって定式化し、そのパラメータセットをＰ′
とする。音響モデルのパラメータは次式で更新される。L _k (X; P) = L (d _k ) = 1 / {1 + exp (−a (d _k + b))} (3) where a and b are constants. The parameter set P includes a transition probability A, a weight C of the mixture distribution, an average value μ, and a variance U. To avoid becoming negative, transition probabilities, weights, and variances are formulated logarithmically, and the parameter set is expressed as P ′
And The parameters of the acoustic model are updated by the following equation.

【００１４】Ｐ_t+1′＝Ｐ_t′−∈_tＶ_t∇Ｌ_k（Ｘ；Ｐ′）｜_P'=P't （４） ∈_tは学習ステップサイズで小さな正の実数、Ｖ_tは正
定値行列。各パラメータの微小変動分∇Ｌ_k（Ｘ；
Ｐ′）｜_P'=P'tを∈_tとＶ_tで制御して、パラメータ
Ｐ′を更新する。徐々にパラメータＰ′を動かしてやり
ながら、最適な点を見つける。P _{t + 1} ′ = P _t ′ − _t V _t ∇L _k (X; P ′) | _{P ′ = P ′} _t (4) where _t is a small positive real number with a learning step size, V _t Is a positive definite matrix. Minute variation of each parameter ΔL _k (X;
P | controls the ') P' _{= P't} in ∈ _t and V _t, to update the parameters P '. The optimum point is found by gradually moving the parameter P '.

【００１５】各クラス（モデル）Ｃ_j，ｊ＝１，…，Ｋ
に対して、∇Ｌ_k（Ｘ；Ｐ′）は、次式で表われる。 ∇_rjＬ_k（Ｘ；Ｐ′）＝（∂Ｌ_k／∂ｄ_k）（∂ｋ_k／∂ｇ_i）・∇_rj［ｇ_j（Ｘ；Ｐ′）］（５）ここで、Ｌ_kは式（３）で定義されたものである。それぞれの項は、 ∂Ｌ_k／∂ｄ_k＝ａ・Ｌ_k（１−Ｌ_k）（６） ∂ｄ_k／∂ｇ_j＝ −１，ｊ＝ｋのとき＝［Σｅｘｐ｛ｅｇ_n（Ｘ，Ｐ）｝］^-1 ・ｅｘｐ｛ｅｇ_j（Ｘ，Ｐ）｝，ｊ≠ｋのとき（７） Σはｎ＝ｋ以外のｎの各値についての加算となる。ここ
で、ｊ＝ｋの時が正解（候補）に対しての重みになり、
ｊ≠ｋの時は対立候補に対して、尤度の大きさに応じた
重みになっている。式（４）からわかるように、正解の
時は右辺第２項は正の符号になり、パラメータ全体は学
習で推定されたパラメータの方へ近付き、対立候補の場
合には、負の符号になり、推定されたパラメータから反
対の方向へ移動することになる。Each class (model) C _j , j = 1,..., K
∇L _k (X; P ′) is expressed by the following equation. _{_{∇ rj L k (X; P}} ') = (∂L k / ∂d k) (∂k k / ∂g i) · ∇ rj [g j (X; P')] (5) where, L _k Is defined by equation (3). Each _{_{term, ∂L k / ∂d k = a}} · L k (1-L k) (6) ∂d k / ∂g j = -1, when j = k = [Σexp {eg n (X , P)}] ^-1 · exp {eg _j (X, P)}, j ≠ k (7)} is an addition for each value of n other than n = k. Here, when j = k is the weight for the correct answer (candidate),
When j ≠ k, the weight is given to the conflict candidate in accordance with the magnitude of the likelihood. As can be seen from equation (4), when the answer is correct, the second term on the right side has a positive sign, the whole parameter approaches the parameter estimated by learning, and in the case of an alternative candidate, it has a negative sign. , Will move in the opposite direction from the estimated parameters.

【００１６】対数尤度ｇ_j（Ｘ，Ｐ）が混合ガウス分布
の場合を考える。∇ｇ_jに関しては、以下のように遷移
確率、平均、分散、重み係数について計算する。クラス
ｊの状態Ｓ１から状態Ｓ２への遷移確率ａ_s1,s2につい
ては、 ∂ｇ_i／∂ａ′_s1,s2＝Σ_t=1 ^Tδ（ｑ_t-1−Ｓ１）δ（ｑ_t−Ｓ２），Ｓ１＝１，…，Ｎ，Ｓ２＝Ｓ１またはＳ１＋１（８）クラスｊの第Ｓ１状態の第ｍ混合分布での観測確率Ｂ
_s1m＝｛μ_s1m，ｃ_s1m，ｕ_s1m｝に対して、 ∂ｇ_i／∂Ｂ_s1m＝Σ_t=1 ^T（１／ｂ_qt（ｘ_t)) ・（∂ｂ_qt（ｘ_t）／∂Ｂ_s1m）δ（ｑ_t−Ｓ１）（９）平均値μ_s1mについては、 ∂ｂ_qt(t) ／∂μ_s1m＝ｃ_s1mＮ（ｘ_t｜μ_s1m，Ｕ_s1m）（ｘ_t−μ_s1m）／Ｕ_s1m （10）分散Ｕ_s1m′については、 ∂ｂ_qt（ｘ_t）／∂Ｕ_s1m′＝ｃ_s1mＮ（ｘ_t｜μ_s1m，Ｕ_s1m）・(1/2) ｛（ｘ_t−μ_s1m）²／Ｕ_s1m−１｝（11）混合分布の重み係数ｃ_s1m′については、 ∂ｂ_qt（ｘ_t）／∂ｃ_s1m′＝ｃ_s1mＮ（ｘ_t｜μ_s1m，Ｕ_s1m）（12）以上のようにして識別誤り学習により音響モデル作成を
するが、その際にタスク依存の音響モデルを作成する場
合には、タスクあるいは語彙に合せたものであればよい
が、タスクに依存しない音響モデルを作成する場合に
は、語彙制約なしの音声認識系による結果を用いる。Consider a case where the log likelihood g _j (X, P) has a Gaussian mixture distribution. For ∇g _j , the transition probability, average, variance, and weight coefficient are calculated as follows. For transition probabilities a _{s1, s2} from the state S1 of class j to state S2 _{_{is, ∂g i / ∂a 's1,}} s2 = Σ t = 1 T δ (q t-1 -S1) δ (q t -S2 , S1 = 1,..., N, S2 = S1 or S1 + 1.
_{_{_{s1m = {μ s1m, c s1m}}} , u s1m} _{_{respect, ∂g i / ∂B s1m = Σ}} t = 1 T (1 / b qt (x t)) · (∂b qt (x t) / ∂ _{_{B s1m) δ (q t -S1}} ) (9) for the average value mu _S1M _{is, ∂b qt (t) / ∂μ} s1m = c s1m N (x t | μ s1m, U s1m) (x t -μ s1m ) / U _s1m (10) For the variance U _s1m ′, ∂b _qt (x _t ) / _{∂U s1 m} ′ = c _s1m N (x _t | μ _s1m , U _s1m ) · (1/2) ｛(x _t _{^{_{-μ s1m) 2 / U s1m -1}}} } (11) weighting coefficients c _S1M the mixture distribution _{_{'for, ∂b qt (x t) /}} ∂c s1m' = c s1m N (x t | μ s1m, U s1m (12) As described above, the acoustic model is created by the identification error learning. When the task-dependent acoustic model is created at this time, any acoustic model that matches the task or vocabulary may be used. When creating an independent acoustic model, Use the results of the unconstrained by the speech recognition system.

【００１７】図１に、この発明による識別学習の手順を
示す。学習用音声データ１７の特徴パラメータを抽出部
１８で抽出し、その学習用音声データの特徴パラメータ
系列を、初期音響モデルパラメータ３１を用いて、語彙
制約なし音声認識部３２で認識する。ここで、複数候補
およびそれらの確率を得る。識別誤り学習音響パラメー
タ推定部３３で、上記に示した識別誤りｄ_k又はＬ_kを
最小とするようなパラメータ推定を行い、音響モデルパ
ラメータを更新する。この更新された音響モデルパラメ
ータ３４を用いて、語彙制約なし音声認識部３２で学習
用音響データの特徴パラメータ系列を認識する。これを
繰り返すことにより、最適な音響モデルを得ることが出
来る。また、音響モデルパラメータ３４のうち、分散や
重み係数は学習データ量が少ないと学習データに偏った
値になり、逆に精度を落す可能性がある。学習データ量
が少ない場合には、他のパラメータに対し、最も効果的
な平均値のみを学習することもできる。FIG. 1 shows a procedure of discriminative learning according to the present invention. The feature parameter of the learning speech data 17 is extracted by the extraction unit 18, and the feature parameter sequence of the learning speech data 17 is recognized by the speech recognition unit 32 without vocabulary restriction using the initial acoustic model parameters 31. Here, a plurality of candidates and their probabilities are obtained. The identification error learning acoustic parameter estimating unit 33 performs parameter estimation to minimize the above-described identification error d _k or L _k , and updates the acoustic model parameters. Using the updated acoustic model parameters 34, the vocabulary-free speech recognition unit 32 recognizes a feature parameter sequence of the learning acoustic data. By repeating this, an optimal acoustic model can be obtained. Also, among the acoustic model parameters 34, the variance and the weighting factor become values biased toward the learning data when the amount of the learning data is small, and conversely there is a possibility that the accuracy decreases. If the amount of learning data is small, only the most effective average value can be learned for other parameters.

【００１８】学習、すなわちパラメータ推定に用いる音
声区間は式（１）における観測ベクトル系列Ｘの定義の
仕方で異なり、得られる音響モデルの特性も異なってく
る。以下の説明では、セグメント単位として音素を用い
ているが、実際にＨＭＭを用いる場合には、セグメント
単位が状態になる。一つの音素ＨＭＭが３状態から成っ
ていれば、一つの音素区間が３つの状態系列でセグメン
トされる。つまり正解音素と、誤った音素（対立候補）
とについてその各対応する状態間で誤りが減少するよう
に、パラメータ更新がなされる。The speech section used for learning, that is, parameter estimation, differs depending on how the observation vector sequence X is defined in the equation (1), and the characteristics of the obtained acoustic model also differ. In the following description, a phoneme is used as a segment unit. However, when an HMM is actually used, the segment unit is in a state. If one phoneme HMM has three states, one phoneme section is segmented into three state sequences. In other words, the correct phoneme and the wrong phoneme (alternative candidate)
Are updated such that errors between each of the corresponding states are reduced.

【００１９】（ａ）請求項目４記載の発明にあたるもの
では、式（１）を観測ベクトル系列Ｘを入力音声全体と
し、対立候補に語彙制約なし音声認識の複数候補を用い
る。図２（ａ）に示すように、それぞれの候補で得られ
るすべてのセグメントにおいてそれぞれパラメータを推
定する。この場合、セグメント内には正解である部分も
含まれる可能性があり、誤り区間に対して正解区間が多
い時には、誤りがさほど学習に反映されないことにな
る。(A) In the invention according to claim 4, equation (1) uses the observation vector sequence X as the entire input speech, and uses a plurality of vocabulary-free speech recognition candidates as the opposition candidates. As shown in FIG. 2A, parameters are estimated for all segments obtained from each candidate. In this case, there is a possibility that a correct part is also included in the segment, and when there are many correct sections with respect to the error section, the error is not so much reflected in the learning.

【００２０】（ｂ）請求項目５記載の発明にあたるもの
では、対立候補に語彙制約なし音声認識での複数候補を
用い、図２（ｂ）に示すように、正解と異なるセグメン
ト区間、図で斜線を施した区間のみを用いてパラメータ
推定する。語彙制約なし認識系で得られる複数候補は、
部分的に誤っている場合が多いので、誤りやすい部分を
強調して学習できる。この場合式（１）を次式のように
変形する。(B) In the invention according to claim 5, a plurality of candidates in speech recognition without vocabulary restriction are used as opposing candidates, and as shown in FIG. The parameter is estimated using only the section subjected to. Multiple candidates obtained by the vocabulary-free recognition system are:
Since there are many cases where a part is incorrect, it is possible to emphasize and emphasize a part that is likely to be wrong. In this case, the equation (1) is transformed into the following equation.

【００２１】ｄ_k（Ｘ_kj，ｒ_k，ｒ_j）＝−ｇ_k（Ｘ_kj，ｒ_k）＋ｇ_j（Ｘ_kj，ｒ_j），ｊ≠ｋ（13）ここで、ｒ_k：正解の部分クラス（モデル）ｋに対する
パラメータセット、ｒ_j：対立候補の正解と異なる区間
が存在する部分クラス（モデル）ｊに対するパラメータ
セットで、Ｐのサブセットになる。正解の部分クラスｋ
に対して、Ｘ_kj：対立候補内の正解と異なるセグメント
区間であり、クラスｊに相当する。すなわち、式（１）
でのＸをＸ_kjに置き換えて、ｄ_kは誤っている区間のみ
を考慮する。例えば正解「０」の部分ではその初めの部
分のみをパラメータ更新する。[0021] _{_{_{d k (X kj, r k}}} , r j) = - g k (X kj, r k) + g j (X kj, r j), j ≠ k (13) Here, r _k: the correct answer Parameter set for partial class (model) k, r _j : Parameter set for partial class (model) j in which a section different from the correct answer of the opposing candidate exists, and is a subset of P. Correct answer subclass k
X _kj : a segment section different from the correct answer in the _alternative candidate, and corresponds to class j. That is, equation (1)
Is replaced with X _kj , and d _k considers only the section in which the error occurs. For example, in the part of the correct answer "0", only the first part is updated.

【００２２】（ｃ）比較方法として、語彙制約なし音声
認識系は用いないが、正解の各音素セグメントをそのま
ま他の音素で置換する場合について述べる。図２（ｃ）
に示すように、各正解音素区間がＸに相当し、対立候補
はこの区間ごと得られる正解以外に尤度の近い音素にな
る。この方法は、語彙制約なし認識系の結果を利用しな
いで簡便に学習ができる。つまり、例えば正解ｉを、対
立候補ａのクラスで認識して見てその次にこの発明の効
果を表す実施例を示す。スコアを求め、同様に正解ｉを
他の全てのクラスで認識して見て、これら認識結果のス
コアの大きい順から対立候補を選び、この対立候補のパ
ラメータを用いてパラメータ推定を行う。(C) As a comparison method, a case will be described in which a speech recognition system without vocabulary restriction is not used, but each correct phoneme segment is replaced with another phoneme as it is. FIG. 2 (c)
As shown in (1), each correct phoneme section corresponds to X, and the alternative candidate is a phoneme having a similar likelihood other than the correct answer obtained for each section. In this method, learning can be easily performed without using the result of a recognition system having no vocabulary constraint. That is, for example, the correct answer i is recognized and recognized by the class of the contending candidate a, and then the embodiment showing the effect of the present invention is shown. A score is obtained, and the correct answer i is similarly recognized in all the other classes and viewed, and an alternative candidate is selected in descending order of the score of the recognition result, and parameter estimation is performed using the parameters of the alternative candidate.

【００２３】実験条件としては、１２ｋＨｚサンプリン
グ周波数、フレーム長３２ｍｓ、フレーム周期８ｍｓ
で、特徴量としては１６次選択線形予測ケプストラム、
１６次Δケプストラム、Δパワーを用いた。初期モデル
は音素環境依存ＨＭＭ約４５０状態を含むもので、各４
混合分布の２７音素モデルである。初期モデルの学習デ
ータは、ＡＴＲデータベースＡセット音素バランス２１
６単語および重要語５２４０単語を各男性１０名、女性
１０名と、日本音響学会データベース５０３文を男性３
０名、女性３４名、合計で９，６００文を用い、汎用的
な音響モデルを作成した。The experimental conditions include a 12 kHz sampling frequency, a frame length of 32 ms, and a frame period of 8 ms.
And the feature amount is a 16th-order selected linear prediction cepstrum,
A 16th order ΔCepstrum and ΔPower were used. The initial model contains about 450 states of the phoneme environment-dependent HMM.
It is a 27 phoneme model of a mixture distribution. The training data of the initial model is the ATR database A set phoneme balance 21
6 words and 5240 important words were 10 males and 10 females, respectively, and the Acoustic Society of Japan database 503 sentences were 3 males
A general-purpose acoustic model was created using 9,600 sentences in total of 0 and 34 women.

【００２４】識別学習用には、ＡＴＲデータベースＣセ
ットから２１６単語を各男性１０名、女性１０名を使用
した。汎用的なモデルを作成するには、学習データは広
い範囲の音素、音素環境が含まれている必要があるの
で、音素バランスの取れた２１６単語を使用した。評価
は、（ａ）学習タスク、（ｂ）未学習タスク、（ｃ）語
彙制約なし音声認識系、（ｄ）単語音声認識系（語彙制
約あり音声認識系）での評価を行った。評価データは、
学習タスク用の評価として、同Ｃセット２１６単語の未
学習話者、男性５名、女性５名を用いた。未学習タスク
用の評価としては、異なる環境で収録した１００都市
名、男性５名、女性４名を用いた。単語音声認識実験で
は、１００都市名データに対しては、認識タスクとして
困難さが増すよう、駅名などを加えて語彙を１２０２単
語としたもので評価した。For discriminative learning, 216 words from the ATR database C set were used by 10 males and 10 females. In order to create a general-purpose model, the training data needs to include a wide range of phonemes and phoneme environments. Therefore, 216 words with well-balanced phonemes were used. The evaluation was performed using (a) a learning task, (b) an unlearned task, (c) a speech recognition system without vocabulary constraints, and (d) a word speech recognition system (speech recognition system with vocabulary constraints). Evaluation data is
As the evaluation for the learning task, an unlearned speaker of 216 words in the same C set, five men, and five women were used. As evaluations for unlearned tasks, 100 city names, 5 males, and 4 females recorded in different environments were used. In the word-speech recognition experiment, evaluation was performed using 1002 word names and vocabulary of 1202 words in order to increase the difficulty of the recognition task for 100 city name data.

【００２５】結果を図３に示す。表中の「長母音なし」
は評価において長母音であってもなくとも正解とした場
合で、「長母音あり」は長母音、単母音の部分も正しい
もののみ正解とした場合を示す。学習タスク（２１６単
語）、未学習タスク（１００都市名）とも、語彙制約な
し音声認識の精度を向上させることができる。学習タス
クでは、（ａ），（ｂ）とも「長母音なし」では約２％
ほど「長母音あり」では約３％の向上が見られる。
（ｃ）では、若干改善の度合が小さい。未学習タスクで
は、（ａ）は「長母音あり」において１３．１％、
（ｂ）７．４％、（ｃ）は１１．７％の向上が見られ
る。特に、長母音部分の改善が大きいのが分かる。FIG. 3 shows the results. "No long vowel" in the table
Indicates that the answer was correct regardless of whether it was a long vowel in the evaluation, and "has a long vowel" indicates that only the correct long vowel and single vowel were correct. For both the learning task (216 words) and the unlearned task (100 city names), the accuracy of speech recognition without vocabulary restriction can be improved. In the learning task, both “a” and “b” were about 2% in “no long vowel”
As for "long vowels", about 3% improvement is seen.
In (c), the degree of improvement is slightly small. In the unlearned task, (a) is 13.1% in “with long vowel”,
(B) shows an improvement of 7.4%, and (c) shows an improvement of 11.7%. In particular, it can be seen that the improvement of the long vowel part is large.

【００２６】単語認識としては、学習タスクに対して、
ほぼ同じか若干上昇傾向が見られる。未学習タスクに対
しては、３方式とも約２％の向上がある。すなわち、タ
スクに依存しない音響モデルができているのが分かる。
（ａ），（ｂ），（ｃ）方式の比較では、（ａ）方式が
どの評価でも相対的によいのが分かる。なお識別学習と
しては先に示した例に限らず、誤り易い候補を探し、そ
れとの誤りが減少するようにパラメータを更新するよう
にすればよい。As word recognition, for a learning task,
Almost the same or slightly upward trend is seen. For the unlearned task, all three methods have about 2% improvement. That is, it can be seen that an acoustic model independent of the task has been created.
Comparison of the methods (a), (b) and (c) shows that the method (a) is relatively good in any evaluation. Note that the discrimination learning is not limited to the example described above, and a candidate that is likely to be erroneous may be searched for, and the parameter may be updated so that errors with the candidate are reduced.

【００２７】[0027]

【発明の効果】以上述べたようにこの発明によれば、音
響モデルの精度を学習タスクに依存しないで、汎用的な
音響モデルを作成することができる。As described above, according to the present invention, a general-purpose acoustic model can be created without depending on the accuracy of the acoustic model depending on the learning task.

[Brief description of the drawings]

【図１】この発明の語彙制約なし音声認識系を用いた識
別学習の流れを示す機能構成図FIG. 1 is a functional configuration diagram showing a flow of identification learning using a speech recognition system without vocabulary restriction according to the present invention.

【図２】この発明の音響モデルを識別学習する際に対象
とする学習区間の違いを方式ごとに示す図FIG. 2 is a diagram showing, for each method, a difference in a learning section targeted for identification learning of an acoustic model according to the present invention;

【図３】この発明の効果として実験結果を示す図FIG. 3 is a diagram showing an experimental result as an effect of the present invention.

【図４】音声認識装置の機能構成を示すブロック図FIG. 4 is a block diagram showing a functional configuration of the speech recognition device.

Claims

[Claims]

1. A method for creating a general-purpose acoustic model from an initial probabilistic acoustic model, comprising: extracting speech characteristic parameters corresponding to the probabilistic acoustic model from learning speech data; A process of performing vocabulary-free speech recognition that allows free chaining in units smaller than words, phonemes or syllables, and using the recognition result of the speech to find an alternative candidate that differs from the correct recognition result And updating the initial acoustic model by discriminating learning to reduce the acoustic model.

2. The method according to claim 1, wherein the stochastic acoustic model is a hidden Markov model.

3. The method according to claim 2, wherein the discrimination learning is error minimization learning (MC
3. The acoustic model creation method according to claim 2, wherein E: Minimum Classification Error).

4. The sound according to claim 1, wherein the discriminative learning learns parameters of the acoustic model by using respective results of a correct answer and an alternative candidate in all segment sections. Model creation method.

5. The method according to claim 1, wherein the discriminative learning learns the parameters of the acoustic model from the results of the opposing candidates only in the segment sections different from the correct answer using the results. Acoustic model creation method.