JP3256979B2

JP3256979B2 - A method for finding the likelihood of an acoustic model for input speech

Info

Publication number: JP3256979B2
Application number: JP09693591A
Authority: JP
Inventors: 達雄松岡; 清宏鹿野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-04-26
Filing date: 1991-04-26
Publication date: 2002-02-18
Anticipated expiration: 2017-02-18
Also published as: JPH04326400A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は音声認識方式における
モデルとして用いられ、音声の音響的特徴量を抽出し、
その特徴量を統計的にモデル化した音響モデルの入力音
声に対する尤度を求める方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used as a model in a speech recognition system, and extracts acoustic features of speech.
Input sound of an acoustic model whose features are statistically modeled
The present invention relates to a method for calculating likelihood for a voice .

【０００２】[0002]

【従来の技術】音声の音響的特徴を確率的、統計的にモ
デル化する手法である隠れマルコフ法（Ｈｉｄｄｅｎ
ＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）を用いた音声認識
システムでは、一認識対象カテゴリ、つまり音節、単
語、文節などの語彙（あるいは認識対象単位）ごとに、
ひとつ、あるいは複数のＨＭＭを設定し、学習用音声を
用いて学習する。認識時には、音声認識システムの入力
音声がそれらのモデルから観測される確率を計算し、尤
度（尤もらしさ）の最も高いモデルのカテゴリを入力音
声のカテゴリと判定する、あるいは、尤度の高い順に認
識カテゴリ候補としている。ＨＭＭは、統計的なモデル
であるから学習用音声中に現われた頻度に従って、ある
カテゴリとある音響的特徴とを関連づける強さを内部に
確率分布として蓄積する。2. Description of the Related Art Hidden Markov method (Hidden method), which is a technique for stochastically and statistically modeling acoustic features of speech.
In a speech recognition system using a Markov Model (HMM), for each recognition target category, that is, for each vocabulary (or recognition target unit) such as a syllable, a word, and a phrase,
One or a plurality of HMMs are set, and learning is performed using a learning voice. At the time of recognition, the probability that the input speech of the speech recognition system is observed from those models is calculated, and the category of the model having the highest likelihood (likelihood) is determined as the category of the input speech, or in the order of the highest likelihood. It is a recognition category candidate. Since the HMM is a statistical model, the strength of associating a certain category with a certain acoustic feature is stored therein as a probability distribution according to the frequency of appearance in the learning speech.

【０００３】一方、音声認識システムでは、認識対象と
なる音声は、しばしば学習用音声とは違った発話様式
（例えば、速い、遅い、大きい、小さいなど）、収音条
件（マイクロホン、伝送特性など）、周囲雑音環境など
で発声され、これが認識率の低下を招いている。不特定
話者音声認識では話者の違いも認識率を低下させる。Ｈ
ＭＭは学習音声中に現われた音響的特徴については、非
常によくモデル化を行なうことができるが、学習音声中
に出現頻度が少ない、あるいは全く出現しない音響的特
徴についてはうまくモデル化することができない。した
がって、ＨＭＭを用いた音声認識システムで、さまざま
な発話様式（例えは、速い、遅い、大きい、小さい、質
問口調など）の音声に対しても高い認識率を得るために
いくつかの方法が提案されてきた。On the other hand, in a speech recognition system, the speech to be recognized is often an utterance style (eg, fast, slow, loud, small, etc.) different from the learning speech, and sound collecting conditions (microphone, transmission characteristics, etc.). Is uttered in an ambient noise environment or the like, which causes a reduction in the recognition rate. In speaker-independent speech recognition, differences in speakers also reduce the recognition rate. H
MM can model very well the acoustic features that appear in the training speech, but can model well the acoustic features that appear infrequently or not at all in the training speech. Can not. Therefore, in the speech recognition system using the HMM, several methods are proposed to obtain a high recognition rate even for speech of various utterance styles (for example, fast, slow, large, small, question tone, etc.). It has been.

【０００４】そのひとつの方法は、通常の発声の他に、
意図的に、さまざまな発話様式の発声も学習用音声とし
て集め、その学習用音声を用いてＨＭＭを学習し、モデ
ルがさまざまな発話様式の音声に対して頑健に構成され
ることを図った方法⁽¹⁾⁽²⁾である。しかしながら、現実
には、さまざまな発話様式の音声を集めることは非常に
困難であるし、発話様式や、収音条件、周囲雑音条件な
どあらゆる異なった条件を網羅することは不可能である
ため、この方法によって対応できる認識対象音声には限
りがある。また、学習用音声の増加に伴ってＨＭＭの学
習に必要な計算時間が膨大になっていくことが大きな問
題であった。[0004] One of the methods is, in addition to ordinary utterances,
A method that intentionally collects utterances of various utterance styles as learning sounds, learns the HMM using the learning utterances, and attempts to construct a model robustly for various utterance styles. ^{(1) and (2)} . However, in reality, it is very difficult to collect voices of various utterance styles, and it is impossible to cover all different conditions such as utterance styles, sound pickup conditions, and ambient noise conditions. There is a limit to the recognition target speech that can be handled by this method. Another major problem is that the computation time required for HMM learning becomes enormous as the number of learning voices increases.

【０００５】もうひとつの方法は、ＨＭＭが離散確率分
布モデルの場合にのみ適用可能な方法である。ＨＭＭが
離散確率分布モデルの場合には、音声信号の系列を符号
化して表現するので、その符号化のためのコードブック
をコードブックマッピングと呼ばれる方法⁽³⁾により変
換することにより学習済みのモデルを他の発話様式に適
応化する方法⁽²⁾である。この方法について、普通の速
度で発声された音声で学習したモデルを速い発声の音声
に適応化する場合を例として説明する。普通の速度の発
声と速い発声の音声とがあるとき、普通の発声速度の音
声を用いてコードブック１を、速い発声の音声を用いて
コードブック２を、設計する。そして、普通の速度の発
声の音声をコードブック１を用いてベクトル量子化し、
その結果のコードブック１のコードワードの系列をＨＭ
Ｍで学習する。つぎに、発声内容が同じで発声速度が普
通の音声と速い音声を、それぞれコードブック１、コー
ドブック２を用いてベクトル量子化し、コードブック１
とコードブック２の各コードワードの対応関係をＤＰマ
ッチングにより求める。発声速度の速い音声を認識対象
とするときには、コードブック２でベクトル量子化を行
ない、その結果を、コードブック１とコードブック２の
対応関係からコードブック１のコードワード系列に変換
し、コードブック１を用いて学習したＨＭＭを用いて速
い発声速度の音声を認識することが可能になる。しかし
ながら、この方法も、さまざまな発話様式の音声を学習
用音声として持っていることが前提となっている。すな
わち、先の例では、普通の速度の発声と、速い速度の発
声の両方の音声をそれぞれコードブックが設計できるほ
ど大量に持っていなければならない。また、発声内容が
同じで発声速度の異なる音声が得られなければコードブ
ックマッピングは行なえない。さらに、速い速度の音声
によってコードブックが設計できるほどであれば、速い
発声速度の音声でＨＭＭ自体を学習することが可能であ
る。したがって、この方法は、先の方法と同様に学習用
音声の収集の問題を含んでいる。また、ＨＭＭは一通り
の発声について学習すればよいが、コードブックを各発
話様式に対応して設計する必要があるため、ＨＭＭを学
習するのと同じくらいの計算時間を必要とすることが問
題であった。参考文献（１）Ｌｉｐｐｍａｎｎ，Ｒ．，ｅｔａｌ．，”Ｍｕ
ｌｔｉ−ｓｔｙｌｅＴｒａｉｎｉｎｇｆｏｒｒｏ
ｂｕｓｔｉｓｏｌａｔｅｄ−ｗｏｒｄｓｐｅｅｃｈ
ｒｅｃｏｇｎｉｔｉｏｎ，”Ｐｒｏｃｅｅｄｉｎｇｓ
ｏｆＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅ
ｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈａ
ｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ ’８７，
１７．４，ｐｐ．７０５−７０８，１９８７（２）三樹、他、”発話変動への適応化手法を用いた音
声認識”、電子情報通信学会技術研究報告、ＳＰ９０−
１９、１９９０．６（３）Ｓｈｉｋａｎｏ，Ｋ．，ｅｔａｌ．，”Ｓｐｅ
ａｋｅｒ−ＡｄａｐｔａｔｉｏｎｔｈｒｏｕｇｈＶ
ｅｃｔｏｒＱｕａｎｔｉｚａｔｉｏｎ，”Ｐｒｏｃｅ
ｅｄｉｎｇｓｏｆＩｎｔｅｒｎａｔｉｏｎａｌＣ
ｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐ
ｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎ
ｇ ’８６，ｐｐ．２６４３−２６４６，１９８６本発明の目的は、さまざまな発話様式の音声に対して高
い認識性能を得ようとするときに大きな問題であった学
習用音声の収集の労力を大幅に削減し、さらに、ＨＭＭ
のパラメータの再推定や、コードブックの再設計など膨
大な計算時間を必要とした処理を不要とし、限られた学
習用音声と計算時間を使って、学習用音声には含まれて
いなかった発話様式の音声に対しても高い認識性能を得
ることのできる音響モデルの入力音声に対する尤度を求
める方法を提供することにある。Another method is a method applicable only when the HMM is a discrete probability distribution model. In the case where the HMM is a discrete probability distribution model, a sequence of speech signals is encoded and represented. Therefore, a codebook for the encoding is converted by a method called codebook mapping ^{(3), and the} learned model is converted. Is a method ⁽²⁾ for adapting to other utterance styles. This method will be described by taking as an example a case where a model learned with a voice uttered at a normal speed is adapted to a voice uttered at a high speed. When there is a normal utterance rate and a fast utterance rate, a codebook 1 is designed using a normal utterance rate voice, and a codebook 2 is designed using a fast utterance rate. Then, the speech of the utterance at a normal speed is vector-quantized using the codebook 1, and
The resulting codeword 1 codebook sequence is HM
Learn with M. Next, the speech having the same utterance content and the normal utterance speed and the fast utterance speed are vector-quantized using the codebook 1 and the codebook 2, respectively.
And the corresponding relationship between each codeword of the codebook 2 and the corresponding codeword are obtained by DP matching. When a speech having a high utterance speed is to be recognized, vector quantization is performed in the codebook 2 and the result is converted into a codeword sequence of the codebook 1 from the correspondence between the codebook 1 and the codebook 2. 1 can be used to recognize a speech with a high utterance speed. However, this method is also based on the premise that voices of various utterance styles are provided as learning voices. That is, in the above example, both normal-speed utterances and high-speed utterances must be provided in such a large amount that a codebook can be designed. Codebook mapping cannot be performed unless voices with the same voice content but different voice speeds are obtained. Furthermore, if the codebook can be designed with a high-speed voice, it is possible to learn the HMM itself with a high-speed voice. Therefore, this method involves the problem of collecting learning speech as in the previous method. In addition, the HMM only needs to learn one utterance, but since the codebook needs to be designed for each utterance style, it takes a similar amount of calculation time as learning the HMM. Met. References (1) Lippmann, R .; , Et al. , "Mu
lti-style Training for ro
bus isolated-word speech
recognition, "Proceedings
of International Conference
nice on Acoustics, Speech a
nd Signal Processing '87,
17.4, pp. 705-708, 1987 (2) Miki, et al., "Speech Recognition Using Adaptation to Utterance Variation", IEICE Technical Report, SP90-
19, 1990.6 (3) Shikano, K .; , Et al. , "Spe
aker-Adaptation through V
vector Quantization, "Proce
edings of International C
onence on Acoustics, Sp
ech and Signal Processin
g '86, pp. 2643-2646, 1986 An object of the present invention is to significantly reduce the labor of collecting learning speech, which is a major problem when trying to obtain high recognition performance for speech of various utterance styles. HMM
Speech that was not included in the training speech using limited training speech and computation time, eliminating the need for enormous computation time, such as re-estimation of parameters and codebook redesign. Finding the likelihood of an acoustic model that can achieve high recognition performance even for speech of a style
It is to provide a method for optimizing .

【０００６】[0006]

【課題を解決するための手段】この発明によれば各一認
識対象カテゴリについて、それぞれ統計的性質の異なる
ｎ個（ｎは２以上の整数）の学習用音声セットを用い
て、それぞれと対応するｎ個の音響モデルを作成し、こ
れらｎ個の音響モデルをそれぞれ結合係数で重み付けて
結合して上記一認識対象カテゴリに対する一つの音響モ
デルとする。According to the present invention, for each recognition target category, n (n is an integer of 2 or more) learning voice sets having different statistical properties are used to correspond to each category. The n acoustic models are created, and the n acoustic models are weighted by coupling coefficients and combined to form one acoustic model for the one recognition target category.

【０００７】更に高い認識率を上げるには、認識対象音
声の一部を用いて学習することにより、結合係数を適応
化する。この学習は学習音声の一部を用いて学習し、そ
の残りの学習音声を評価し、その一部を代えて同様のこ
とを繰り返し、最適な結合係数を求めることにより行
う。このようにして認識対象音声に対して高い認識性能
をもつモデルを従来法より、数分の一から数十分の一の
学習用音声、計算時間で構成することができる。従来の
技術に対し、（１）ＨＭＭ自体の再学習を必要としな
い、（２）コードブックの再設計を必要としない、
（３）認識対象と同様の統計的性質を持つ学習用音声を
従来法の数十分の一から、数分の一程度しか必要としな
い、などの利益がある。In order to further increase the recognition rate, the coupling coefficient is adapted by learning using a part of the speech to be recognized. This learning is performed by learning using a part of the learning voice, evaluating the remaining learning voice, repeating the same operation by replacing a part of the learning voice, and obtaining an optimum coupling coefficient. In this way, a model having a high recognition performance for the recognition target speech can be constituted by a fraction to one-tenth of the learning speech and the calculation time by the conventional method. Compared to the prior art, (1) no re-learning of the HMM itself is required, (2) no re-design of the codebook is required,
(3) There is an advantage in that a learning voice having the same statistical properties as the recognition target is required to be only about one tenth to several tenths of the conventional method.

【０００８】[0008]

【Example】

実施例１以下この発明の一実施例として、統計的性質の異なる学
習用音声セットから作成した２個の音響モデルとして、
単語ごとに区切って発声された音声で学習したＨＭＭ
と、文節ごとに区切って発声された音声で学習したＨＭ
Ｍとを用いて、自由に発声された連続音声に適応化した
モデルを構成する場合について、図面を参照して説明す
る。Embodiment 1 Hereinafter, as one embodiment of the present invention, two acoustic models created from learning speech sets having different statistical properties will be described.
HMMs learned from voices uttered separately for each word
And HM learned with voices uttered in sections
A case in which a model adapted to a freely uttered continuous voice using M and a model configured will be described with reference to the drawings.

【０００９】図１にこの発明を適用した音声認識装置を
示す。音声入力部１でアナログ音声信号Ａがディジタル
音声信号Ｂに変換され、そのディジタル音声信号Ｂから
音響特徴量（例えば、ケプストラム、Δケプストラム、
Δパワーなど）Ｃが音響特徴量抽出部２で抽出される。
結合係数や、モデルの尤度などが演算部３で計算され、
ＨＭＭのパラメータや、結合係数などがメモリ４に格納
され、演算部３からは認識結果Ｅが出力される。音響特
徴量抽出部２は、ハードウェアにより実現しても、ある
いは、ソフトウェアにより実現してもよい。ソフトウェ
アにより実現する場合には、演算部３の演算能力が十分
にあれば演算部３で実現しても差しつかえない。FIG. 1 shows a speech recognition apparatus to which the present invention is applied. The analog audio signal A is converted into a digital audio signal B by the audio input unit 1, and the digital audio signal B is used to convert acoustic features (for example, cepstrum, Δ cepstrum,
C is extracted by the acoustic feature quantity extraction unit 2.
The coupling coefficient, the likelihood of the model, and the like are calculated by the calculation unit 3,
HMM parameters, coupling coefficients, and the like are stored in the memory 4, and the calculation unit 3 outputs a recognition result E. The acoustic feature quantity extraction unit 2 may be realized by hardware or may be realized by software. In the case of realization by software, it can be realized by the arithmetic unit 3 if the arithmetic capability of the arithmetic unit 3 is sufficient.

【００１０】図２にこれら二つのＨＭＭを結合させた音
響モデルを示し、かつそのモデルの尤度計算の演算手順
を示す。この図に示す演算は図１の演算部３において行
なわれる。組み合わせモデルは認識対象カテゴリ（語
彙）ごとに設ける。認識対象とする自由に発声された連
続音声の音響特徴量Ｃは、単語ごとに区切って発声され
た音声で学習された当該語彙に対するＨＭＭ（モデル
１）の尤度の計算５と、同じく文節ごとに区切って発声
された音声で学習された当該語彙に対するＨＭＭ（モデ
ル２）の尤度の計算６とがなされ、モデル１、モデル２
から観測される尤度の各計算結果に対し結合係数λ₁ 、
λ₂ の乗算７，８がそれぞれなされ、これらモデル１、
モデル２の各尤度にそれぞれ結合係数を乗じた値の加算
９がなされ、その加算値が、この組み合わせモデルの入
力音声に対する尤度Ｄである。つまりこの組み合わせモ
デルがこの語彙に対する一つの音響モデルであり、この
音響モデルから音響特徴量ベクトルx が出力される尤度
Ｐ(x) は、モデル１、モデル２からその音響特徴量ベク
トルx が出力される尤度をそれぞれＰ₁ (x) 、Ｐ₂ (x)
とすると（１）式で表される。つまりモデル１とモデル
２とは線形結合されている。FIG. 2 shows an acoustic model in which these two HMMs are combined, and shows a calculation procedure of likelihood calculation of the model. The calculation shown in this figure is performed in the calculation unit 3 of FIG. A combination model is provided for each recognition target category (vocabulary). The acoustic feature value C of the freely uttered continuous speech to be recognized is calculated by calculating the likelihood of the HMM (model 1) for the vocabulary learned from the uttered speech divided into words 5 and similarly for each phrase. HMM is a calculation 6 likelihood of (model 2) made for that vocabulary learned speech uttered separated, the model 1, model 2
For each calculation result of the likelihood observed from, the coupling coefficient λ ₁ ,
lambda ₂ multiplication 7,8 is made, respectively, of these models 1,
The addition 9 of the value obtained by multiplying each likelihood of the model 2 by the coupling coefficient is performed, and the added value is the likelihood D of the combination model with respect to the input speech. That is, this combination model is one acoustic model for this vocabulary, and the likelihood P (x) from which the acoustic feature vector x is output from this acoustic model is the output of the acoustic feature vector x from model 1 and model 2. Let P ₁ (x) and P ₂ (x) be the likelihoods
Then, it is expressed by equation (1). That is, the model 1 and the model 2 are linearly combined.

【００１１】Ｐ(x) ＝λ₁Ｐ₁(x) ＋λ₂Ｐ₂(x) （１）ここで、λ₁＋λ₂＝１（１）式の結合係数λ₁，λ₂はモデル１，２の性質か
ら予め固定的に決めてもよいが、認識対象となる連続発
声の学習用音声を用いて、学習することによりその認識
対象に適するように決定してもよい。その適応化は例え
ば（２）式、（３）式を繰り返し計算することにより求
める。[0011] _{P (x) = λ 1 P} 1 (x) + λ 2 P 2 (x) (1) _{_{Here, λ 1 + λ 2 = 1}} (1) the coupling coefficient of the equation lambda _1, lambda ₂ is the model 1, Although it may be fixedly determined in advance from the second property, it may be determined so as to be suitable for the recognition target by learning using a continuous utterance learning voice to be recognized. The adaptation is obtained, for example, by repeatedly calculating the equations (2) and (3).

【００１２】ｃ_j＝Σ｛λ_jＰ_j(x_i) ／Ｐ(x_i) ｝ (j=1,2;i=1,2,3, …,N) （２） λ′_j＝ｃ_j／Σｃ_j （３）ただし、Ｎは適応化学習音声中の当該語彙のトークンの
数である。これは学習データを二つに分け、半分でモデ
ルを学習し、残りのデータで、λを推定する“ｈｅｌｄ
−ｏｕｔ−ｉｎｔｅｒｐｏｌａｔｉｏｎ”と呼ばれる方
法である。この方法に限らず、例えば“ｄｅｌｅｔｅｄ
−ｉｎｔｅｒｐｏｌａｔｉｏｎ”を用いてλを推定して
もよい。この場合（２）′式、（３）′式の漸化式によ
り再推定を繰り返すことにより求められる。C _j = {λ _j P _j (x _i ) / P (x _i )} (j = 1,2; i = 1,2,3,..., N) (2) λ ′ _j = c _j / Σc _j (3) where N is the number of tokens of the vocabulary in the adaptive learning speech. This is divided into the learning data to two, to learn the model in half, the rest of the data, to estimate the λ "held
-Out-interpolation ". Not limited to this method, for example, " deleted
Λ may be estimated using “−interpolation”. In this case, the λ is obtained by repeating re-estimation using the recurrence formulas (2) ′ and (3) ′.

【００１３】ｃ_j＝Σ｛λ_jＰⁱ _j(x_i) ／Ｐⁱ(x_i) ｝（２）′ λ′_j＝ｃ_j／Σｃ_j （３）′ Ｐⁱ _j(x_i) ，Ｐⁱ(x_i) はｉ番目のデータｘ_iを除い
て学習したモデルによるｘ_iの尤度である。λ′を
（１）、（２）′式に代入して値が収束するまで再推定
を繰り返す。C _j = Σ ｛λ _j P ⁱ _j (x _i ) / P ⁱ (x _i )｝ (2) ′ λ ′ _j = c _j / jc _j (3) ′ P ⁱ _j (x _i ), P ⁱ (x _i ) is the likelihood of x _i by the model learned except for the i-th data x _i . Substituting λ ′ into the expressions (1) and (2) ′ and repeating re-estimation until the values converge.

【００１４】図３に認識時における認識結果の判定方法
を示す。ここでは、一例として、認識語彙数が５の場合
を示している。語彙１〜語彙５の図３に示した音響モデ
ル１１₁ 〜１１₅ の入力音声の音響特徴量Ｃに対する尤
度がそれぞれＤ₁ 〜Ｄ₅ となったことを示す。認識時に
はこの尤度Ｄ₁ 〜Ｄ₅ のうち最大となる語彙を認識結果
とする、あるいは、尤度の高い順に認識結果候補とす
る。FIG. 3 shows a method of judging a recognition result at the time of recognition. Here, as an example, a case where the number of recognized words is five is shown. It indicates that the likelihood for the acoustic feature quantity C of the input speech acoustic model 11 ₁ to 11 ₅ shown in FIG. 3 vocabulary 1 vocabulary 5 becomes D ₁ to D _5, respectively. At the time of recognition, the vocabulary having the maximum of the likelihoods D _{1 to} D ₅ is set as a recognition result, or as a recognition result candidate in the order of higher likelihood.

【００１５】この実施例の方法を用いて、単語ごとに区
切って発声された音声を用いて学習したモデルと、文節
ごとに区切って発声された音声を用いて学習したモデル
とを線形結合し、更にそれを連続発声された認識対象音
声に適応化したモデルで、日本語１８子音の音素認識実
験を行なった結果を図４に示す。一括学習モデルは、単
語発声、文節発声、連続発声を同時に用いて学習したモ
デルを用いた結果であり、組み合わせモデルは、前記実
施例のモデル、つまり、単語発声で学習したモデルと文
節発声で学習したモデルを線形結合したモデルの結合係
数を連続発声の音声を用いて学習したモデルを用いた結
果であるが、（３）の学習なしは、連続発声の音声を用
いて結合係数を学習適応化していない場合である。
（１）は単語発声と文節発声の音声に加えて連続発声の
音声を２０文章用いた場合、（２）は単語発声と文節発
声の音声に加えて連続発声の音声を１０文章用いた場
合、そして、（３）は単語発声と文節発声だけを用い
て、連続発声を用いなかった場合を示している。
（ａ）、（ｂ）、（ｃ）は評価音声の種類を示してい
る。この図４に示す結果から、（１）、（２）、（３）
のどの場合にも、連続発声の音声に対しては、組み合わ
せモデル、つまりこの発明の実施例の方が高い認識率を
示しており、特に結合係数の適応化学習をしなくても高
い認識率を示し、この発明の手法の有効性を実証してい
る。実施例２次にこの発明の実施例２として、統計的性質が異なる学
習用音声セットを用いたモデルとして、コンテキスト独
立型ＨＭＭとコンテキスト依存型ＨＭＭとを用い、これ
らを組み合わせることにより音響モデルを発話様式の変
動に対して頑健にする場合について説明する。例えば、
音素／ａ／の音響的性質（あるいは音響的様態）はその
前後の音素が何であるかにより異なる。前後の音素の違
いによって異なるＨＭＭを設ける場合にそのＨＭＭをコ
ンテキスト依存型ＨＭＭと言う。例えば、／ａｋａｉ／
という語の１番目の／ａ／は語頭にあって直後が／ｋ／
である。２番目の／ａ／は直前が／ｋ／で直後が／ｉ／
である。１番目の／ａ／を＃−ａ−ｋというＨＭＭで表
現し、２番目の／ａ／をｋ−ａ−ｉというＨＭＭで表現
して、それぞれのモデルの学習にはコンテキストが一致
する／ａ／だけを学習音声として用いるとき、これらの
ＨＭＭをコンテキスト依存型ＨＭＭという。コンテキス
ト依存型ＨＭＭのコンテキストを前後１音素ずつ考えた
場合をトリホンベース（ｔｒｉｐｈｏｎｅ−ｂａｓｅ）
と言う。前後どちらか一つを考慮に入れる場合をバイホ
ンベース（ｂｉｐｈｏｎｅ−ｂａｓｅ）と言う。前後の
音素数を多く選べばそれだけ音響的な環境条件の制約を
厳しくすることになり、モデルは詳細なものになるが、
出現頻度が少なくなり学習用音声を収集するのが困難に
なる。一方、前後がどのような音素であっても／ａ／に
ついてはすべて／ａ／というＨＭＭで表現する場合この
ＨＭＭをコンテキスト独立型ＨＭＭと言う。コンテキス
ト独立型ＨＭＭの場合には、さまざまな音素環境の／ａ
／を学習用音声として用いるため、学習用音声の音響的
性質に関する分散は大きいが、出現頻度は多くなるの
で、多くの学習用音声を使うことができる。By using the method of this embodiment, a model trained by using the voice uttered by separating each word and a model learned by using the voice uttered by separating each phrase are linearly combined, Further, FIG. 4 shows the result of a phoneme recognition experiment of 18 consonants in Japanese with a model adapted to the continuously uttered recognition target speech. The collective learning model is a result of using a model learned by simultaneously using word utterance, phrase utterance, and continuous utterance, and the combination model is a model of the above-described embodiment, that is, a model trained by word utterance and a model learned by phrase utterance. The results obtained by using a model obtained by learning the coupling coefficients of a model obtained by linearly combining the models obtained using continuous uttered voices. In the case of (3) without learning, the coupling coefficients are learned and adapted using continuous uttered voices. If not.
(1) When 20 sentences of continuous speech are used in addition to the words and phrases, and (2) When 10 sentences of continuous speech are used in addition to the words and phrases. (3) shows a case where only word utterance and phrase utterance are used, and continuous utterance is not used.
(A), (b), and (c) show types of evaluation voice. From the results shown in FIG. 4, (1), (2), (3)
In all cases, the combination model, that is, the embodiment of the present invention, has a higher recognition rate for continuous utterance speech, and particularly has a higher recognition rate without performing the adaptation learning of the coupling coefficient. To demonstrate the effectiveness of the technique of the present invention. Second Embodiment Next, as a second embodiment of the present invention, a context-independent HMM and a context-dependent HMM are used as models using learning speech sets having different statistical properties, and an acoustic model is uttered by combining these. A case where the system is made robust against a change in style will be described. For example,
The acoustic properties (or acoustic aspects) of the phoneme / a / differ depending on what the preceding and following phonemes are. When a different HMM is provided depending on the difference between the preceding and succeeding phonemes, the HMM is called a context-dependent HMM. For example, / akai /
The first / a / in the word is at the beginning of the word and immediately after / k /
It is. The second / a / is immediately before / k / and immediately after / i /
It is. The first / a / is represented by an HMM # -ak, and the second / a / is represented by an HMM k-a-i. When only / is used as a learning speech, these HMMs are called context-dependent HMMs. Triphone-base is a case where the context of a context-dependent HMM is considered one phoneme before and after.
Say A case in which one of the front and the back is taken into account is called a biphone-base. The more phonemes you choose before and after, the more restrictive the acoustic environmental conditions will be and the more detailed your model will be.
The frequency of appearance decreases, and it becomes difficult to collect learning sounds. On the other hand, when all the / a / are expressed by the HMM of / a / regardless of the phonemes before and after, this HMM is called a context-independent HMM. In the case of a context-independent HMM, / a of various phonemic environments
Since / is used as the learning voice, the variance of the acoustic properties of the learning voice is large, but the frequency of appearance is high, so that many learning voices can be used.

【００１６】図５に、コンテキスト独立型ＨＭＭとコン
テキスト依存型ＨＭＭとを線形結合したモデルの尤度を
計算する処理を示している。モデル１２₁はコンテキス
ト独立型ＨＭＭで、モデル１２₂〜モデル１２_Mは（Ｍ
−１）個のコンテキスト依存型ＨＭＭである。例えばＭ
＝３の時、モデル１２₂をｂｉｐｈｏｎｅ−ｂａｓｅ
の、モデル１２₃をｔｒｉｐｈｏｎｅ−ｂａｓｅのコン
テキスト依存型ＨＭＭとする。さらにクオドラホンベー
ス（ｑｕａｄｒａｐｈｏｎｅ−ｂａｓｅ４連鎖）、クウ
ンホンベース（ｑｕｉｎｔｐｈｏｎｅ−ｂａｓｅ５連
鎖）などのより詳細なコンテキスト依存型ＨＭＭを考え
てもよい。実施例１と同様に、組み合わせモデルから音
響特徴量ベクトルｘが出力される尤度Ｐ(x) は、音響特
徴ベクトルｘがそれぞれモデル１２₁〜モデル１２_Mか
ら出力され尤度をＰ₁(x) 〜Ｐ_M(x)とし、これらそれ
ぞれ１３₁〜１３_Mで乗ずる結合係数をλ₁〜λ_Mとす
るとき、次式で表わされる。FIG. 5 shows a process for calculating the likelihood of a model obtained by linearly combining a context-independent HMM and a context-dependent HMM. Model 12 ₁ in the context independent HMM, model 12 _2-model 12 _M is (M
-1) context-dependent HMMs. For example, M
= Time of 3, the model 12 ₂ biphone-base
Of the model 12 ₃ and context-dependent HMM for triphone-base. Further, more detailed context-dependent HMMs such as quadraphone base (quadraphone-base 4 chain) and quadraphone base (quintphone-base 5 chain) may be considered. As in Example 1, the likelihood P of acoustic feature vectors x of a combination model is output (x) is the acoustic feature vector x is output from the model 12 ₁ to model 12 _M, respectively likelihood P ₁ (x ) to P and _M (x), when the coupling coefficient to be multiplied by their respective 13 ₁ to 13 _M and lambda ₁ to [lambda] _M, is expressed by the following equation.

【００１７】Ｐ(x) ＝λ₁Ｐ₁(x) ＋λ₂Ｐ₂(x) ＋…＋λ_MＰ_M(x) （４）ここで、λ₁＋λ₂＋…＋λ_M＝１（４）式のλ₁〜λ_Mは、次の（５）（６）式を繰り返
し計算することにより求める。ｃ_j＝Σ｛λ_jＰ_j(x_i) ／Ｐ(x_i) ｝（５） λ′_j＝ｃ_j／Σｃ_j （６）ここで、Ｍ＝３として、Ｐ₁ (x_i) をコンテキスト独立
型ＨＭＭ、Ｐ₂(x_i)をｂｉｐｈｏｎｅ−ｂａｓｅのコ
ンテキスト依存型ＨＭＭ、Ｐ₃(x_i) をｔｒｉｐｈｏｎ
ｅ−ｂａｓｅのコンテキスト依存型ＨＭＭとするとこれ
らを組み合わせたモデルの観測ベクトルｘ_iに対する尤
度Ｐ(x_i) は（７）式で表される。[0017] _{P (x) = λ 1 P} 1 (x) + λ 2 P 2 (x) + ... + λ M P M (x) (4) _{_{Here, λ 1 + λ 2 + ...}} + λ M = 1 (4) Λ ₁ to λ _M in the equations are obtained by repeatedly calculating the following equations (5) and (6). c _j = Σ ｛λ _j P _j (x _i ) / P (x _i )｝ (5) λ ′ _j = c _j / Σc _j (6) Here, assuming M = 3, P ₁ (x _i ) is A context-independent HMM, P ₂ (x _i ) is a biphone-base context-dependent HMM, and P ₃ (x _i ) is a triphon.
Assuming that the e-base context-dependent HMM is used, the likelihood P (x _i ) for the observation vector x _i of a model combining these is expressed by equation (7).

【００１８】Ｐ(x_i) ＝λ₁Ｐ₁ (x_i) ＋λ₂Ｐ₂ (x_i) ＋λ₃Ｐ₃(x_i) （７）ただし、λ₁＋λ₂＋λ₃＝１このモデルを、認識対象とする音声の一部を用いて
λ₁、λ₂、λ₃を学習することにより、認識対象音声
に適応化する。実施例３統計的性質が異なる学習用音声セットからそれぞれ求め
たＨＭＭとして、話者ごとに学習したＨＭＭを用い、こ
れらを組み合わせて新たな話者の発声に適応化すること
ができる。それぞれ異なる話者の音声で学習したモデル
を図５に示したように組み合わせて、その結合係数を認
識対象話者の音声で適応化学習する。適応化学習におけ
る結合係数の学習方法は、実施例２に示した方法と同様
である。実施例４音響モデルを学習音声を用いて学習するＨＭＭベースの
音声認識手法では、学習音声と実際の認識対象音声との
収音系の違いによる影響を受けやすい。音声の収音系の
違いに対して耐性の高い認識システムを構成するため、
この発明の方法により、統計的性質の異なる学習用音声
セットとして複数の異なる収音系で収音した音声を用い
てそれぞれ学習したモデルを線形結合し、その結合係数
を認識対象音声の収音系で収音した音声で適応化学習す
ることにより、収音系の影響を少なくすることができ
る。P (x _i ) = λ ₁ P ₁ (x _i ) + λ ₂ P ₂ (x _i ) + λ ₃ P ₃ (x _i ) (7) where λ ₁ + λ ₂ + λ ₃ = 1 By learning λ ₁ , λ ₂ , and λ ₃ using a part of the speech to be recognized, the speech is adapted to the speech to be recognized. Embodiment 3 HMMs learned for each speaker are used as HMMs obtained from learning speech sets having different statistical properties, and these can be combined to adapt to a new speaker's utterance. The models learned with the voices of the different speakers are combined as shown in FIG. 5, and the coupling coefficient is adaptively learned with the voice of the recognition target speaker. The method of learning the coupling coefficient in the adaptive learning is the same as the method described in the second embodiment. Embodiment 4 In an HMM-based speech recognition method of learning an acoustic model using a learning speech, the acoustic model is easily affected by a difference in a sound collection system between the learning speech and the actual speech to be recognized. In order to construct a recognition system that is highly resistant to differences in sound collection systems,
According to the method of the present invention, models learned individually by using sounds collected by a plurality of different sound collection systems as learning sound sets having different statistical properties are linearly combined, and their coupling coefficients are determined by the sound collection system of the recognition target sound. By performing the adaptive learning using the voice collected by the above, the influence of the voice collection system can be reduced.

【００１９】[0019]

【発明の効果】この発明によれば、統計的性質が異なる
学習音声セットからそれのＨＭＭ音響モデルを作り、こ
れら音響モデルを同一カテゴリについて結合係数で重み
付けして組み合わせてそのカテゴリに対する１つの音響
モデルとすることにより、認識率を高くすることができ
る。特にその結合係数を認識対象音声で適応化すること
で、学習音声とは統計性質の異なる認識対象音声に適応
化したＨＭＭ音響モデルが構成できる。適応化学習で
は、ＨＭＭ自体のパラメータは学習せず結合係数だけを
学習するため、学習すべきパラメータ数が非常に少な
く、よって、少ない学習用音声と短い計算時間で適応化
が可能である。According to the present invention, HMM acoustic models are created from learning speech sets having different statistical properties, and these acoustic models are weighted by the coupling coefficient for the same category and combined to form one acoustic model for that category. By doing so, the recognition rate can be increased. In particular, by adapting the coupling coefficient with the recognition target speech, an HMM acoustic model adapted to the recognition target speech having a statistical property different from that of the learning speech can be configured. In the adaptive learning, since only the coupling coefficient is learned without learning the parameters of the HMM itself, the number of parameters to be learned is very small. Therefore, the adaptation can be performed with a small learning voice and a short calculation time.

[Brief description of the drawings]

【図１】音声認識システムの一般的構成を示すブロック
図。FIG. 1 is a block diagram showing a general configuration of a speech recognition system.

【図２】この発明の一例における二つのモデルの線形結
合の様子を示す図。FIG. 2 is a diagram showing a state of a linear combination of two models in an example of the present invention.

【図３】図２で得られた結合モデルを用いた認識時にお
ける認識結果の判定方法を説明する図。FIG. 3 is a view for explaining a method of determining a recognition result at the time of recognition using the combined model obtained in FIG. 2;

【図４】日本語１８子音に対する認識結果を示す図。FIG. 4 is a diagram showing recognition results for 18 Japanese consonants.

【図５】この発明による複数のモデルの線形結合の様子
を一般的な場合について示す図。FIG. 5 is a diagram showing a state of linear combination of a plurality of models according to the present invention in a general case.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭61−183696（ＪＰ，Ａ) 特開昭62−5300（ＪＰ，Ａ) 特開昭61−180298（ＪＰ，Ａ) 特開昭61−121093（ＪＰ，Ａ) 電子情報通信学会技術研究報告（ＳＰ 89 46−53）17−24頁連続出力分布型ＨＭＭによる日本語音韻認識の検討 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/14 G10L 15/06 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-61-183696 (JP, A) JP-A-62-5300 (JP, A) JP-A-61-180298 (JP, A) JP-A-61-180298 121093 (JP, A) IEICE Technical Report (SP 89 46-53) pp. 17-24 Examination of Japanese Phoneme Recognition by Continuous Output Distribution HMM (58) Fields Investigated (Int. Cl. ⁷ , DB G10L 15/14 G10L 15/06

Claims

(57) [Claims]

1. Statistical modeling of acoustic features of speech
The likelihood of the input acoustic model for the input speech
In each of the categories to be recognized,
(N is an integer of 2 or more) learning sounds with different statistical properties
N individual acoustic models created using sets
Of each of the above-mentioned input voices is obtained.
The individual likelihood is determined in advance according to the properties of each of the above acoustic models.
By weighting with the determined coupling coefficient and combining
Acoustic model characterized by finding desired likelihood
A method to find the likelihood of the input speech .

2. The likelihood of an acoustic model for an input speech according to claim 1, wherein the coupling coefficient is adapted by learning using a part of the speech to be recognized.
Mel method.