JPH08123465A

JPH08123465A - Adapting method for acoustic model

Info

Publication number: JPH08123465A
Application number: JP6264097A
Authority: JP
Inventors: Tatsuo Matsuoka; 達雄松岡; Sadahiro Furui; 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1994-10-27
Filing date: 1994-10-27
Publication date: 1996-05-17

Abstract

PURPOSE: To increase a recognition rate with a small learning speech and a small calculation quantity. CONSTITUTION: A semicontinuous distribution HMM model is generated by using a learning speech for an unspecified speaker, its base distribution is stored in a code book 15, and weight coefficients for respective base distributions are stored in a weight coefficient memory 16; and weight coefficients as to all phonemes independent of respective phonemes are stored as weight coefficients 19 for all phoneme models. The learning speech of a recognition speech is inputted, the weight coefficients 19 for all the phoneme models are used, and only the respective base distributions in the code book 15 are adapted and stored in a code book 17. At the time of recognition, an input sound is recognized by using the weight coefficients in the code book 17 and the weight coefficients in the weight coefficient memory 16.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、音声認識における標
準パターンとなるべきものとして用いられ、あらかじめ
別の環境で収録された音声（学習用音声）を用いて学習
した音響モデルを、特定の音声収音系回線特性や、特定
の話者など学習用音声と性質を異にする音声に適応化す
る方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used as a standard pattern in speech recognition, and an acoustic model learned using speech (learning speech) recorded in another environment in advance is used as a specific speech. The present invention relates to a method of adapting to a voice collecting system line characteristic and a voice having a different characteristic from a learning voice such as a specific speaker.

【０００２】[0002]

【従来の技術】音声の音響的特徴を確率的、統計的にモ
デル化する手法である隠れマルコフモデル（Ｈｉｄｄｅ
ｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）を用いた音声
認識システムでは、一認識対象カテゴリ、つまり音素、
音節、単語などの語彙（あるいは認識対象単位）ごと
に、１つ、あるいは複数のＨＭＭを設定し、学習用音声
を用いて学習する、つまりＨＭＭを作る。認識時には、
音声認識システムの入力音声がそれらのモデルから観測
される確率を計算し、尤度（尤もらしさ）の最も高い順
に認識結果候補としている。ＨＭＭは、統計的なモデル
であるから学習用音声中に現われた頻度に従って、ある
音響的特徴量とあるカテゴリとを関連づける強さを内部
に確率分布として表現する。つまり図４Ａに示すよう
に、すべての認識対象カテゴリ（例えば音素）ごとに、
初期状態（音素の始端付近）ａ、第２状態ｂ、第３状態
ｃ、最終状態（音素の終端）ｄの４つの状態を順次遷移
し、各状態はその音素のその状態における音響特徴量の
統計的な分布を表現し、状態から状態への遷移確率が与
えられた音響モデルＭ₁〜Ｍ_Mが予め求められ、入力音
声がある音響モデルより出力する確率を計算して入力音
声に対するその音響モデルの尤度を求める。2. Description of the Related Art Hidden Markov Model (Hidede), which is a method for modeling acoustic characteristics of speech stochastically and statistically.
n Markov Model: HMM), a speech recognition system using one recognition target category, that is, a phoneme,
One or a plurality of HMMs are set for each vocabulary (or recognition target unit) such as syllables and words, and learning is performed using the learning voice, that is, an HMM is created. Upon recognition,
The probabilities that the input speech of the speech recognition system is observed from these models are calculated, and the recognition result candidates are arranged in the order of highest likelihood. Since the HMM is a statistical model, the strength of associating a certain acoustic feature quantity with a certain category is internally expressed as a probability distribution according to the frequency of appearance in the learning voice. That is, as shown in FIG. 4A, for all recognition target categories (for example, phonemes),
An initial state (around the beginning of the phoneme) a, a second state b, a third state c, and a final state (the end of the phoneme) d are sequentially transitioned, and each state is the acoustic feature amount of that phoneme in that state. Acoustic models M _{1 to} _MM that represent a statistical distribution and are given transition probabilities from state to state are calculated in advance, and the probability that the input speech is output from a certain acoustic model is calculated, and the acoustic corresponding to the input speech is calculated. Find the likelihood of the model.

【０００３】ＨＭＭは確率分布の表現方法から、離散分
布モデル、連続分布モデル、半連続分布モデルの３つに
大きく分類される。離散確率分布モデルでは、音声の音
響的特徴量はコード化された離散的な値で表現される。
例えば、図５Ａに示すように音声の音響的特徴量は代表
的なＮ個の特徴ベクトルＡ₁〜Ａ_Nの何れかで表わさ
れ、これら特徴ベクトルＡ₁〜Ａ_Nにはそれぞれコード
（例えば番号）Ｃ₁〜Ｃ _Nが与えられている。また各音
素を示す音響モデルＭ₁〜Ｍ_Mのそれぞれごとに、図５
Ｂに示すように、コードＣ₁〜Ｃ_Nのそれぞれに対し、
出力確率Ｐ₁〜Ｐ _Pが１対１で対応ずけられている。入
力音声はフレームごとにその特徴ベクトルが代表特徴ベ
クトルＡ₁〜Ａ_Nの何れに最も近いかが求められ、その
代表特徴ベクトルを示すコード列に入力音声が変換さ
れ、そのコード列は各音響モデルＭ₁〜Ｍ_Mのそれぞれ
について、その出力確率が演算される。これら演算され
た出力確率中の最も高い（尤度が大きい）音響モデルと
対応する音素が認識結果として出力される。The HMM is based on the method of expressing the probability distribution, and
Cloth model, continuous distribution model, semi-continuous distribution model
Largely classified. In the discrete probability distribution model, the sound of speech
Resonance features are represented by coded discrete values.
For example, as shown in FIG.
Typical N feature vectors A₁~ A_NRepresented by either
And these feature vectors A₁~ A_NEach has a code
(Eg number) C₁~ C _NIs given. Also each sound
Acoustic model M showing elementary₁~ M_MFigure 5 for each of
As shown in B, code C₁~ C_NFor each of
Output probability P₁~ P _PThere is a one-to-one correspondence. Entering
The feature vector of the force voice is
Cutle A₁~ A_NWhich of the following is closest to the
The input speech is converted into a code string that represents the representative feature vector.
And the code string is for each acoustic model M₁~ M_MEach of
, The output probability is calculated. These are calculated
With the highest (most likely) acoustic model among the output probabilities
The corresponding phoneme is output as the recognition result.

【０００４】連続確率分布モデルでは、音響的特徴量ベ
クトルは連続量のまま扱う。例えば図４Ａ中の音響モデ
ルＭ_Mはその初期状態ａはその音響的特徴量が分布Ｄ₁
で表わされ、状態ｂ〜ｄではそれぞれその音響的特徴量
が分布Ｄ₂〜Ｄ₄として表わされる。連続確率分布モデ
ルには、単一分布モデルと混合分布モデルとがあり、図
４ＡのモデルＭ_Mは混合分布モデルの場合で例えば図４
Ｂに示すように、１つの混合分布Ｄ₀が複数の分布Ｖ₁
〜Ｖ₃の重み付き加算の形で表現される。これら分布Ｖ
₁〜Ｖ₃は音声の音響的特徴量の分布をガウス分布で近
似し、平均値μ ₁〜μ₃と共分散行列σ₁〜σ₃とに止
りそれぞれ表現される。音響モデルＭ₁〜Ｍ_Mそれぞれ
図５Ｃに示すように各状態ごとにその複数の分布と、図
に示していないが重み係数とが与えられて表現される。In the continuous probability distribution model, acoustic feature vectors are used.
Cuttles are treated as continuous quantities. For example, the acoustic model in FIG. 4A
Le M_MThe initial state a is the acoustic feature distribution D₁
And the acoustic features of the states b to d, respectively.
Is the distribution D₂~ D_FourIs represented as Continuous probability distribution model
There are a single distribution model and a mixed distribution model.
Model A of 4A_MIs a mixture distribution model, for example, in FIG.
As shown in B, one mixture distribution D₀Has multiple distributions V₁
~ V₃It is expressed in the form of weighted addition of. These distributions V
₁~ V₃Is a Gaussian distribution that approximates the distribution of audio acoustic features.
Similar, average μ ₁~ Μ₃And the covariance matrix σ₁~ Σ₃Stop
Each is expressed. Acoustic model M₁~ M_MRespectively
As shown in FIG. 5C, a plurality of distributions for each state,
Although not shown in the figure, the weighting factor and are given and expressed.

【０００５】入力音声は各音響モデルごとに、その表現
された各状態の分布により出力確率を演算し、音響モデ
ルごとの出力確率、つまり尤度を求めその最大の音響モ
デルの音素を認識結果とする。混合分布モデルは精密な
分布の推定が可能であるが、推定すべきパラメータ数が
多いため、それだけ多くの学習用音声を必要とする。半
連続確率分布モデルは離散分布モデルと連続分布モデル
の混合分布形のものとの特徴を合せもったモデルであ
る。つまり混合ガウス分布の連続分布モデルにおいて、
混合分布数を十分大きく、例えば２５６に設定し、かつ
各音響モデルに対して同一の分布Ｖ₁〜Ｖ_Nとし、各音
響モデル間の区別は重み係数によって行う。例えば図６
に示すように状態ａについて、各音響モデルＭ₁〜Ｍ_M
について、分布Ｖ₁〜Ｖ_Nのそれぞれに対する重みＷが
それぞれ与えられている。同様に状態ｂ、ｃ、ｄについ
ても、各音響モデルＭ₁〜Ｍ_Mのそれぞれに対し、分布
Ｖ ₁〜Ｖ_Nのそれぞれの重みＷが与えられている。つま
り基底分布Ｖ₁〜Ｖ_Nは全音響モデル、全状態にわたっ
て共有され、各音響モデルの各状態ごとに重み係数Ｗ_i
の値が各音素固有の値として決められている。入力音声
は各音響モデルごとに、出力確率を演算し、その最大の
音響モデルの音素を認識結果とする。半連続モデルは離
散モデルにおける特徴ベクトルＡ₁〜Ａ_Nの代りに基底
分布Ｖ₁〜Ｖ _Nが用いられたもので、図５Ａに示した単
一のコードブックによりパラメータ空間が表現されてい
るという離散モデルの特徴と、混合ガウス分布により各
音素モデルが詳細に表現されるという混合分布形連続モ
デルの特徴とを合せもっている。The input speech is represented by each acoustic model.
The output probability is calculated from the distribution of each state
Output probability, that is, the likelihood of each
The Dell phoneme is used as the recognition result. The mixture distribution model is precise
It is possible to estimate the distribution, but the number of parameters
Because of the large number, many learning sounds are required. Half
The continuous probability distribution model is a discrete distribution model and a continuous distribution model.
It is a model that has the features of the mixed distribution type of
You. That is, in the continuous distribution model of Gaussian mixture distribution,
Set the number of mixture distributions large enough, for example 256, and
The same distribution V for each acoustic model₁~ V_NAnd each sound
The weighting coefficient is used to distinguish between the sound models. For example, in FIG.
As shown in FIG.₁~ M_M
About the distribution V₁~ V_NThe weight W for each of
Each is given. Similarly for states b, c and d
However, each acoustic model M₁~ M_MFor each of
V ₁~ V_NIs given a weight W. Tsuma
Base distribution V₁~ V_NIs all acoustic models, all states
Weighting coefficient W for each state of each acoustic model_i
The value of is determined as a value unique to each phoneme. Input voice
Calculates the output probability for each acoustic model, and
The phoneme of the acoustic model is used as the recognition result. Semi-continuous models are separated
Feature vector A in the dispersion model₁~ A_NBase instead of
Distribution V₁~ V _NWas used for the single unit shown in FIG. 5A.
Parameter space is represented by one codebook
Due to the characteristics of the discrete model of
A mixture distribution type continuous model in which the phoneme model is represented in detail.
It also matches the characteristics of Dell.

【０００６】ＨＭＭのような統計的なモデルを用いた音
声認識では、モデルパラメータを推定するための学習用
音声と実際に認識対象になる音声とが同じような条件で
収音されることを前提としている。すなわち、音響的な
環境、たとえば、背景雑音や、回線の特性が、学習時と
認識時でほぼ同じであると仮定している。学習時と認識
時との収音条件が異なる場合、実際に認識対象となる音
声の音響的特徴量はモデルが表現している音響的特徴量
と異なるので認識精度が悪くなるという問題がある。In voice recognition using a statistical model such as HMM, it is premised that the learning voice for estimating the model parameters and the voice to be actually recognized are picked up under similar conditions. I am trying. That is, it is assumed that the acoustic environment, for example, the background noise and the line characteristics are almost the same during learning and recognition. When the sound collecting conditions at the time of learning and at the time of recognition are different, there is a problem that the recognition accuracy is deteriorated because the acoustic feature amount of the voice actually to be recognized is different from the acoustic feature amount expressed by the model.

【０００７】学習時と認識時との音響的特徴量の変動に
は、スペクトル上で加算的に影響するものと、フィルタ
的に影響するものとがある。背景雑音などはパワーとし
て音声に加わるものであるからスペクトル領域でも加算
的になる。一方、回線特性の違い（歪み）などはスペク
トル包絡の形状が変化、通常はスペクトル包絡の傾きが
変化するのでスペクトル領域においてフィルタ的に影響
する。Variations in the acoustic feature quantity during learning and during recognition include additive effects on the spectrum and filter effects. Since background noise is added to the voice as power, it becomes additive even in the spectral domain. On the other hand, the difference (distortion) in the line characteristics changes the shape of the spectrum envelope, and usually changes the slope of the spectrum envelope, so that it has a filter effect in the spectrum region.

【０００８】学習時と認識時との音響的な条件が異なる
場合、認識システムを認識対象となる音響的条件に適応
化することで認識性能を改善しようとする試みがされて
きた。以下に、これまでに提案されている２つの方法に
ついて説明する。第一は、ケプストラム平均値正規化法
と呼ばれる方法である。音声の音響的特徴量としては対
数スペクトルの逆フーリエ変換で定義されるケプストラ
ムが用いられることが多い。ケプストラム領域において
は、スペクトラム領域におけるフィルタが加減算により
実現されるので回線特性の変動による歪みはケプストラ
ムの加減算により補正できる。この原理による簡単で効
果的な回線特性補正方法がケプストラム平均値正規化法
である。音声の音響的特徴量としてケプストラム係数を
用いる場合には、そのケプストラムの時系列から当該音
声区間にわたる平均値を引くことで、時不変な周波数ス
ペクトル的傾向を平坦化することができる。しかしなが
ら、ケプストラム平均値正規化法では、長時間平均によ
り回線における時不変のスペクトル包絡を差し引いて平
坦化するというのが原理であるため、ある程度長い音声
区間にわたって平均をとらないと効果が期待できない。
また、単純にある区間のケプストラムの時系列の平均値
を差し引くだけであるため、音声エネルギーの大小、あ
るいはＳＮ比の違いによる影響で推定誤りが起こるなど
その改善効果には限界があることが問題であった。In the case where the acoustic conditions at the time of learning differ from the acoustic conditions at the time of recognition, attempts have been made to improve the recognition performance by adapting the recognition system to the acoustic conditions to be recognized. The two methods proposed so far will be described below. The first is a method called the cepstrum mean value normalization method. The cepstrum defined by the inverse Fourier transform of the logarithmic spectrum is often used as the acoustic feature of the voice. In the cepstrum domain, the filter in the spectrum domain is realized by addition and subtraction, so that distortion due to fluctuations in the line characteristics can be corrected by addition and subtraction of the cepstrum. A simple and effective line characteristic correction method based on this principle is the cepstrum average value normalization method. When the cepstrum coefficient is used as the acoustic feature amount of the voice, the time-invariant frequency spectrum tendency can be flattened by subtracting the average value over the voice section from the time series of the cepstrum. However, in the cepstrum average value normalization method, the principle is to flatten by subtracting the time-invariant spectrum envelope in the line by long-term averaging.
Also, since the average value of the time series of the cepstrum in a certain section is simply subtracted, there is a limit to the improvement effect such as an estimation error caused by the magnitude of the voice energy or the difference in the SN ratio. Met.

【０００９】第二はコードブックの変換によるモデル適
応化法である。この方法は、話者適応化のために提案さ
れたが、コードブックを用いるモデルをベースとしてい
れば、一般に学習音声と認識対象音声の収録環境の不一
致に対する適応化手法として適用可能と考えられる。こ
の方法により、離散確率分布モデル、あるいは半連続分
布モデルの場合には、コードブックを学習用音声で求め
たものから認識対象となる音声で求めたものへ変換する
ことでモデルの適応化が可能である。この方法につい
て、学習用音声の収録回線である回線Ａの音声で学習し
たモデルを、認識対象音声の収録回線である回線Ｂの音
声に適応化する場合を例として説明する。回線Ａの音声
と回線Ｂの音声とがあるとき、回線Ａの音声を用いてコ
ードブックＡを、回線Ｂの音声を用いてコードブックＢ
をそれぞれ設計する。そして、回線Ａの音声をコードブ
ックＡを用いてベクトル量子化し、その結果のコードブ
ックＡのコードの系列を用いてＨＭＭを学習する（ＨＭ
Ｍを作成する）。次に、発声内容が同じ回線Ｂの音声
を、それぞれコードブックＡ、コードブックＢを用いて
それぞれベクトル量子化し、コードブックＡとコートブ
ックＢの各コードの対応関係をＤＰマッチングにより求
める。回線Ｂの音声を認識対象とするときには、コード
ブックＢでベクトル量子化を行ない、その結果をコード
ブックＡとコードブックＢの対応関係からコードブック
Ａのコード系列に変換し、コードブックＡを用いて学習
したＨＭＭを用いて回線Ｂの音声を認識することが可能
になる。しかしながら、この方法は回線Ｂの音声、すな
わち、認識対象となる音声の収録された回線の音声をコ
ードブックを設計できるほどの量を持っていることが必
要であり、かつ回線Ａと全く同じ発声内容の音声がなけ
ればならないということが問題である。したがって、よ
り少ない量の適応化音声で、かつ発声内容に関する制約
の緩い適応化法が必要であった。The second is a model adaptation method by conversion of a codebook. This method has been proposed for speaker adaptation, but if it is based on a model that uses a codebook, it is generally considered to be applicable as an adaptation method for the discrepancy between the recording environment of the learning speech and the recognition target speech. With this method, in the case of a discrete probability distribution model or a semi-continuous distribution model, the model can be adapted by converting the codebook obtained from the training speech from that obtained from the speech to be recognized. Is. This method will be described by taking as an example the case where a model learned by the voice of the line A which is the recording line of the learning voice is adapted to the voice of the line B which is the recording line of the recognition target voice. When there is a voice of the line A and a voice of the line B, the voice of the line A is used for the codebook A, and the voice of the line B is used for the codebook B.
Design each. Then, the voice of the line A is vector-quantized using the codebook A, and the HMM is learned using the resulting codebook A code sequence (HM
Create M). Next, the voices of the line B having the same utterance content are vector-quantized using the codebook A and the codebook B, respectively, and the correspondence between the codes of the codebook A and the codebook B is obtained by DP matching. When the voice of the line B is to be recognized, vector quantization is performed by the codebook B, the result is converted from the correspondence between the codebook A and the codebook B into the code sequence of the codebook A, and the codebook A is used. It becomes possible to recognize the voice of the line B by using the HMM learned by learning. However, this method requires that the voice of the line B, that is, the voice of the line in which the voice to be recognized is recorded, be large enough to design a codebook, and the same utterance as that of the line A is produced. The problem is that there must be audio for the content. Therefore, there is a need for an adaptation method with a smaller amount of adaptation speech and less restrictions on the utterance content.

【００１０】[0010]

【発明が解決しようとする課題】この発明の目的は、学
習用音声と認識対象となる音声とが性質の異なる場合に
も高い認識精度を得るための適応化を、少ない学習音声
を用いて行うことができ、また少ない計算量で行なうこ
とができる音響モデルの適応化方法を提供することにあ
る。SUMMARY OF THE INVENTION An object of the present invention is to perform adaptation so as to obtain high recognition accuracy even when the learning voice and the voice to be recognized have different properties by using a small amount of learning voice. Another object of the present invention is to provide a method of adapting an acoustic model that can be performed with a small amount of calculation.

【００１１】[0011]

【課題を解決するための手段】この発明によれば音響モ
デルを、パラメータ空間を複数の基底分布で表現された
コードブックと、そのコードブック中の各基底分布に対
する重み係数とにより構成し、各認識対象カテゴリと独
立に学習した全カテゴリ音響モデルを用いて、パラメー
タ空間を表現する基底分布を、性質を異にする音声、つ
まり認識時の音声と同一性質の音声により再推定して適
応化する。According to the present invention, an acoustic model is constructed by a codebook in which a parameter space is represented by a plurality of basis distributions, and a weighting coefficient for each basis distribution in the codebook. Adapting the basis distribution expressing the parameter space by re-estimating the voices with different characteristics, that is, the voices at the time of recognition, using the all-category acoustic model learned independently of the recognition target category. .

【００１２】請求項２の発明では、再推定された各基底
分布の、その推定前の基底分布に対する変化をそれぞれ
適応化ベクトルとし、各基底分布を音声パワーに従って
クラスタリングし、上記適応化ベクトルを上記各クラス
タに属する基底分布について平均化し、その平均化適応
化ベクトルを用いて、そのクラスタの各基底分布を適応
化する。According to the second aspect of the present invention, the change of each re-estimated basis distribution with respect to the pre-estimated basis distribution is used as an adaptation vector, each basis distribution is clustered according to the speech power, and the adaptation vector is defined as above. The base distributions belonging to each cluster are averaged, and each base distribution of the cluster is adapted using the averaged adaptation vector.

【００１３】請求項３の発明では各クラスタごとの平均
化適応化ベクトルと、そのクラスタの各基底分布ごとの
適応化ベクトルとを荷重平均し、その荷重平均適応化ベ
クトルを用いてそのクラスタの基底分布を適応化する。According to the third aspect of the present invention, the averaged adaptation vector for each cluster and the adaptation vector for each basis distribution of the cluster are weighted averaged, and the basis of the cluster is calculated using the weighted average adaptation vector. Adapt the distribution.

【００１４】[0014]

【作用】前記構成により、（１）全音素ＨＭＭを用いる
ことで適応化音声の発声内容によらずに任意の発声によ
り適応化が可能であり、（２）音声のパワーの大小を考
慮しているため、より正確な回線特性の適応化が可能で
あるという利点がある。すなわち、音声のパワーが大き
くＳＮ比が高い場合は付加的な雑音の影響が小さく、パ
ワーが小さい場合はその逆であることを利用し、パワー
の大きいクラスタに属する基底分布ロードワードに対し
てはそのクラスタに属する基底分布（コードワード）の
修正量の平均値を、パワーの小さいクラスタに属する基
底分布（コードワード）に対してはその基底分布自身の
修正量を重視するようにコードブックを適応化すること
が可能である。With the above configuration, (1) by using the whole phoneme HMM, it is possible to adapt by any utterance irrespective of the utterance content of the adapted voice, and (2) considering the power of the voice. Therefore, there is an advantage that the line characteristics can be more accurately adapted. That is, when the power of the voice is high and the SN ratio is high, the effect of additional noise is small, and when the power is low, the opposite is used. The codebook is adapted so that the average value of the correction amount of the base distribution (codeword) belonging to the cluster is emphasized, and the correction amount of the base distribution itself is emphasized for the base distribution (codeword) belonging to the cluster with low power. Is possible.

【００１５】[0015]

【実施例】以下、この発明の一実施例として、防音室な
ど音響条件の比較的よい環境で収録した音声で学習した
音響モデルを、学習音声とは特性の異なる電話音声に適
応化する場合について図面を参照して説明する。この例
では音響モデルとして半連続分布ＨＭＭを用いた場合に
ついて説明する。この発明の方法は、モデルパラメータ
空間を基底分布の集合により表現し、その基底分布を各
モデルが共有するようなモデル表現であれば、離散分布
ＨＭＭでも連続分布ＨＭＭでも適用可能である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS As an embodiment of the present invention, a case in which an acoustic model learned by a voice recorded in an environment with relatively good acoustic conditions such as a soundproof room is adapted to a telephone voice having a characteristic different from the learned voice A description will be given with reference to the drawings. In this example, a case where a semi-continuous HMM is used as an acoustic model will be described. The method of the present invention can be applied to both a discrete distribution HMM and a continuous distribution HMM as long as the model parameter space is expressed by a set of base distributions and the base distributions are shared by the models.

【００１６】図１にこの発明を適用した音声認識装置を
示す。入力端子１１からのアナログ音声信号は音声入力
部１２でディジタル音声信号に変換され、そのディジタ
ル音声信号から音響特徴量（例えば、ケプストラム、Δ
ケプストラム、Δパワーなど）が音響特徴量抽出部１３
で抽出される。音響モデルとしてＨＭＭを用いた場合
で、ＨＭＭのパラメータ（音響特徴量ベクトルの平均
値、共分散、遷移確率）や、各分布の重み係数は演算部
１４で計算される。半連続分布ＨＭＭではパラメータ空
間を複数の基底分布で表現したコードブックと、そのコ
ードブック中の各基底分布に対する重み係数とにより構
成されるが、前記比較的よい環境で収録した音声で学習
したＨＭＭの基底分布が不特定コードブック１４に蓄え
られ、その各ＨＭＭについての各基底分布に対する重み
係数が重み係数メモリ１６に記憶されている。またこの
発明では不特定話者用コードブックの基底分布を電話音
声で適応化した基底分布が適応化コードブック１７に蓄
えられる。認識結果は演算部１４から出力端子１８に出
力される。音響特徴量抽出部１３は、ハードウェアによ
り実現しても、あるいは、ソフトウェアにより実現して
もよい。ソフトウェアにより実現する場合には、演算部
１４の演算能力が十分にあれば演算部１４で実現しても
差しつかえない。FIG. 1 shows a voice recognition device to which the present invention is applied. The analog voice signal from the input terminal 11 is converted into a digital voice signal by the voice input unit 12, and the acoustic feature amount (eg, cepstrum, Δ) is converted from the digital voice signal.
Cepstrum, Δ power, etc.) is the acoustic feature quantity extraction unit 13
It is extracted with. When the HMM is used as the acoustic model, the parameters of the HMM (the average value of the acoustic feature amount vector, the covariance, the transition probability) and the weighting coefficient of each distribution are calculated by the calculation unit 14. The semi-continuous distribution HMM is composed of a codebook in which the parameter space is expressed by a plurality of basis distributions and a weighting coefficient for each basis distribution in the codebook. The HMM learned by speech recorded in the relatively good environment. Are stored in the unspecified codebook 14, and the weighting coefficient for each base distribution for each HMM is stored in the weighting coefficient memory 16. Further, in the present invention, the base distribution obtained by adapting the base distribution of the codebook for unspecified speakers with the telephone voice is stored in the adaptation codebook 17. The recognition result is output from the calculation unit 14 to the output terminal 18. The acoustic feature quantity extraction unit 13 may be realized by hardware or software. When it is realized by software, it may be realized by the arithmetic unit 14 as long as the arithmetic unit 14 has sufficient arithmetic capacity.

【００１７】適応化前の基底分布の集合、つまり不特定
話者用コードブック１５に収容されている基底分布の集
合は例えば図２Ａに示すようにＶ₁〜Ｖ_Nからなる。半
連続ＨＭＭは前述したようにこのコードブックの各分布
に対する重み係数をもっており、入力音声に対する尤度
は、各分布の確率分布関数値を重み付き加算することに
より求められる。コードブック１５のサイズ、すなわち
基底分布の数は、音響的特徴量として例えばケプストラ
ム係数を用いる場合、２５６程度を用いることが多い。
入力音声の特徴ベクトルをｘ、各基底分布の確率密度関
数値をＶ₁（ｘ），Ｖ₂（ｘ），Ｖ₃（ｘ），…，Ｖ_N
（ｘ）とし、それぞれの分布に対する重み係数を、
Ｗ₁，Ｗ₂，Ｗ₃，…，Ｗ_Nとすると、その入力音声の
特徴ベクトルｘに対する尤度Ｆ（ｘ）はＦ（ｘ）＝Ｗ₁Ｖ₁（ｘ）＋Ｗ₂Ｖ₂（ｘ）＋Ｗ₃Ｖ₃（ｘ）＋…＋Ｗ_NＶ_N（ｘ） …（１）で求められる。Ｗ₁〜Ｗ_Nは各ＨＭＭにより異った値で
ある。The set of base distributions before adaptation, that is, the set of base distributions stored in the unspecified speaker codebook 15 is composed of V _{1 to} V _N as shown in FIG. 2A, for example. As described above, the semi-continuous HMM has a weighting coefficient for each distribution of this codebook, and the likelihood for the input speech is obtained by weighted addition of the probability distribution function values of each distribution. The size of the codebook 15, that is, the number of base distributions is often about 256 when a cepstrum coefficient is used as the acoustic feature amount.
The feature vector of the input speech is x, the probability density function value of each basis distribution is V ₁ (x), V ₂ (x), V ₃ (x), ..., _VN
(X) and the weighting coefficient for each distribution is
Assuming W ₁ , W ₂ , W ₃ , ..., W _N , the likelihood F (x) for the feature vector x of the input voice is F (x) = W ₁ V ₁ (x) + W ₂ V ₂ (x) + W ₃ V ₃ (x) + ... + W _N _VN (x) (1) W _{1 to} W _N are different values for each HMM.

【００１８】各音素と対応するＨＭＭの形を決めるパラ
メータ（Ｖ₁，Ｖ₂，Ｖ₃，…，Ｖ _Nのガウス分布の平
均値と共分散、それぞれの分布に対する重み係数、
Ｗ₁，Ｗ ₂，Ｗ₃，…，Ｗ_N）は、多くの音声データを
用いてフォワード・バックワードアルゴリズムにより推
定される。ここで、基底分布Ｖ₁〜Ｖ_Nは全モデル、全
状態にわたって共有されており、各モデルの各状態ごと
にＷ_iの値が各音素モデル固有の値として推定される。Params that determine the shape of the HMM corresponding to each phoneme
Meter (V₁, V₂, V₃,,, V _NThe Gaussian distribution of
Mean and covariance, weighting factor for each distribution,
W₁, W ₂, W₃,…, W_N) Is a lot of voice data
Using forward-backward algorithm
Is determined. Where the basis distribution V₁~ V_NIs all models, all
Shared across states, for each state in each model
To W_iThe value of is estimated as a value unique to each phoneme model.

【００１９】この発明による適応化では、各モデルのＷ
_i、つまり重み係数メモリ１６の内容はそのままにして
おき、基底分布Ｖ₁〜Ｖ_Nのみを適応化する。適応化に
より、認識対象となる音声のパラメータ空間をうまく表
現できるように各基底分布Ｖ ₁〜Ｖ_Nの平均値や分散が
変化する。平均値の変化により位置が移動し、共分散の
変化により分布の大きさが変わる。この適応化により各
分布Ｖ₁〜Ｖ_Nは図２Ｂに示すように変化させられる。
適応化用音声が十分な量を得られない場合には共分散は
変化させずに平均値だけを変化させてもよい。基底分布
Ｖ₁〜Ｖ_N自体が新たなパラメータ空間へ移動すること
によりモデル固有のＷ_iが変化しなくとも、音素モデル
としては新たなパラメータ空間に適応化されたものとな
る。In the adaptation according to the present invention, W of each model is
_i, That is, leave the contents of the weighting coefficient memory 16 unchanged
Every, base distribution V₁~ V_NOnly adapt. For adaptation
Better represent the parameter space of the speech to be recognized.
Each base distribution V ₁~ V_NThe mean and variance of
Change. The position moves due to changes in the mean value, and the covariance
The size of the distribution changes due to the change. With this adaptation
Distribution V₁~ V_NIs varied as shown in FIG. 2B.
If you do not get enough adaptation speech, the covariance is
You may change only an average value, without changing. Base distribution
V₁~ V_NMoving itself to a new parameter space
Model-specific W_iPhoneme model even if the
Is adapted to the new parameter space.
You.

【００２０】適応化による平均値、共分散の再推定の具
体的方法について、モデルを音素単位に設定していると
仮定して説明する。平均値、共分散の再推定は全音素Ｈ
ＭＭを用いて行なう。つまり、各認識音素とは独立に、
認識対象となる音声すべてを用いて学習し、全ての音素
の何れに対しても比較的大きな尤度となるように学習し
たモデル、いわゆる全音素モデルに対する重み係数Ｗ₁
〜Ｗ_Nを用いて、コードブック１５のみ（平均値と共分
散）を再学習する。通常、各音素モデルは、２５６のコ
ードワード、つまり基底分布の内、特にその音素を表現
するために重要な複数の基底分布に対して高い重み係数
を持ち、その他に対してはほとんど０に近い非常に小さ
な重み係数の値を示す。したがって、個々の音素モデル
の再推定では、大きな重み係数のかかったコードワード
（基底分布）がより大きく移動し、重み係数の小さなコ
ードワード（基底分布）はほとんど移動しないため、コ
ードブック全体をバランスよく再推定することができな
いため、全コードワード（全基底分布）に対してバラン
スよく重み係数を持っている全音素ＨＭＭを用いて再推
定を行なう。この全音素ＨＭＭは不特定話者用コードブ
ック１５及び重み係数を学習する際に予め学習してお
き、その重み係数をメモリ１６中に全音素モデル用重み
係数１９として記憶しておき、適応化コードブック１７
を作成する際に、この全音素モデル用重み係数を用い
て、その他は通常の学習と同様にフォワード・バックワ
ードアルゴリズムにより各音素モデル（ＨＭＭ）の平均
値及び共分散の推定を行って適応化コードブック１７を
作成する。A specific method of re-estimating the mean value and covariance by adaptation will be described assuming that the model is set in a phoneme unit. Re-estimation of the mean value and covariance is for the whole phoneme H
Performed using MM. In other words, independently of each recognized phoneme,
A weighting coefficient W ₁ for a model learned by using all speeches to be recognized and learned so as to have a relatively large likelihood for all phonemes, that is, a so-called whole phoneme model.
Re-learn only codebook 15 (mean and covariance) using ~ W _N. Generally, each phoneme model has a high weighting factor for 256 codewords, that is, a plurality of base distributions that are important for expressing the phoneme, and is close to 0 for others. Indicates a very small weighting factor value. Therefore, in re-estimation of individual phoneme models, codewords with large weighting factors (base distributions) move more, and codewords with smaller weighting factors (base distributions) move less, thus balancing the entire codebook. Since re-estimation cannot be performed well, re-estimation is performed using all-phoneme HMMs that have well-balanced weighting factors for all codewords (all basis distributions). This all-phoneme HMM is learned in advance when learning the codebook 15 for unspecified speakers and the weighting coefficient, and the weighting coefficient is stored in the memory 16 as the weighting coefficient 19 for all-phoneme model and adapted. Codebook 17
When this is created, the weighting coefficient for the whole phoneme model is used, and the others are adapted by estimating the average value and covariance of each phoneme model (HMM) by the forward / backward algorithm as in the case of normal learning. Create the codebook 17.

【００２１】全音素モデルは音素に独立なので発声内容
によらずに学習できるため、ある決められた適応化用学
習音声を発声しなければならないというような拘束条件
を必要としないことも利点である。以上の適応化学習の
演算は、図１中の演算部１４において行なわれる。回線
Ｂの適応化学習用音声は、各認識カテゴリ（音素）に対
応する区間をラベル付けされている必要がなく、回線Ｂ
の適応化学習用音声は、音声入力部１２、音響特徴量抽
出部１３において、アナログ音声信号からディジタル音
声信号に変換され、音響特徴量ベクトルにされる。この
回線Ｂの適応化学習音声の音声区間の音響特徴量ベクト
ルを観測サンプルとして、フォワード・バックワードア
ルゴリズムにより全音素ＨＭＭの分布の平均値、共分散
や、重み係数を再推定することができる。各音素ＨＭＭ
は重み係数の再推定／更新をする必要がなく、コードブ
ック１５を適応化されたものに変更するだけでよい。こ
のようにして、基底分布の重み係数はもとの不特定話者
用モデル、つまりメモリ１６の内容と同じで、コードブ
ック１５の平均値、共分散が回線Ｂの音声に最適化され
たＨＭＭを作成し、適応化コードブック１７とされる。Since the all-phoneme model is independent of phonemes, it can be learned without depending on the utterance content. Therefore, it is also advantageous that there is no need for a constraint condition that a certain learning voice for adaptation should be uttered. . The calculation of the adaptive learning described above is performed in the calculation unit 14 in FIG. The adaptive learning speech on the line B does not need to be labeled with a section corresponding to each recognition category (phoneme).
The adaptive learning speech of is converted from an analog speech signal into a digital speech signal in the speech input unit 12 and the acoustic feature quantity extraction unit 13 to be an acoustic feature quantity vector. By using the acoustic feature amount vector of the speech section of the adaptive learning speech of the line B as an observation sample, it is possible to re-estimate the average value, covariance, and weighting coefficient of the distribution of all phoneme HMMs by the forward / backward algorithm. Each phoneme HMM
Does not need to re-estimate / update the weighting factors, only the codebook 15 needs to be adapted. In this way, the weighting coefficient of the basis distribution is the same as the original model for unspecified speakers, that is, the content of the memory 16, and the average value and covariance of the codebook 15 are optimized for the voice of the line B. Is created and used as an adaptation codebook 17.

【００２２】通常の不特定話者音声認識では、不特定話
者用コードブック１５と重み係数メモリ１６で音響特徴
量が表現された不特定話者用モデルを用いる。回線Ｂか
らの音声を認識する場合は、適応化コードブック１７と
重み係数メモリ１６とで認識対象回線Ｂに適応化された
ＨＭＭを用いて、回線Ｂの入力音声に対する各認識カテ
ゴリのＨＭＭの尤度を求め、最も尤度の高いモデルのカ
テゴリを認識結果とする、あるいは尤度の高い順に認識
結果候補とする。In general speech recognition by an unspecified speaker, a model for the unspecified speaker in which the acoustic feature amount is represented by the codebook 15 for the unspecified speaker and the weighting coefficient memory 16 is used. When recognizing the voice from the line B, the HMM adapted to the line B to be recognized by the adaptive codebook 17 and the weighting coefficient memory 16 is used, and the likelihood of the HMM of each recognition category with respect to the input voice of the line B is calculated. The model category with the highest likelihood is used as the recognition result, or the recognition result candidates are selected in the descending order of likelihood.

【００２３】図３Ａに、この発明方法によりマイク音声
で学習した半連続ＨＭＭのコードブック１５を電話音声
へ適応化した場合のその電話音声に対する音素認識結果
を示す。音響的特徴量はケプストラムとΔケプストラム
各１２次元である。図中、ＣＭＮは従来技術の項で述べ
たケプストラム平均値正規化法、ｍｅａｎは各基底分布
の平均値だけを適応化したもの、ｍｅａｎａｎｄｖ
ａｒ．は平均値と共分散を同時に適応化したもの、ｍｅ
ａｎ＋ｖａｒは平均値だけを適応化した後に、共分散だ
けを適応化したものである。この図からＣＭＮにこの発
明方法を組み合わせると５５．４％まで認識率が向上し
た。請求項２の発明の実施例半連続ＨＭＭのコードブック１５を各基底分布の音声パ
ワーにしたがってクラスタリングする。すなわち音声パ
ワーの近い基底分布は同じクラスタに属する。前記請求
項１の発明の実施例において求めた各基底分布（コード
ブック１５）に対応する適応化基底分布（コードブック
１７）の変化を適応化ベクトルとする時、基底分布の属
するクラスタごとにその適応化ベクトルを平均化して、
そのクラスタの代表適応化ベクトル（平均化適応化ベク
トル）とし、そのクラスタに属する基底分布すべてをそ
のクラスタの代表適応化ベクトルにより適応化する。例
えば音声パワークラスタリングにより、例えば図２Ａ中
の基底分布Ｖ₂，Ｖ₃，Ｖ ₆が同じクラスタに属したと
すると、基底分布Ｖ₂，Ｖ₃，Ｖ₆の適応化コードブッ
ク１７中の各対応する基底分布への変化ベクトル（適応
化ベクトル）Ｅ₂，Ｅ₃，Ｅ₆（この場合は平均値の変
化を示すベクトル）を平均化し、その平均化適応化ベク
トルＥ_mを用いて、そのクラスタに属する基底分布
Ｖ₂，Ｖ₃，Ｖ₆を適応化する。この場合は一種の平滑
化の効果により適応化用音声が少量の場合にも頑健な適
応化が期待できる。FIG. 3A shows a microphone sound according to the method of the present invention.
Phone voice of semi-continuous HMM codebook 15 learned in
Phoneme recognition results for the telephone speech when it is adapted to
Is shown. Acoustic features are cepstrum and delta cepstrum
12 dimensions each. In the figure, CMN is described in the section of conventional technology.
Cepstrum mean normalization method, mean is each basis distribution
Of mean and v
ar. Is an adaptation of mean and covariance at the same time, me
an + var is the covariance after adapting only the mean
This is an adapted case. From this figure to CMN
When combined with the bright method, the recognition rate improved to 55.4%.
Was.Embodiment of the invention of claim 2 The semi-continuous HMM codebook 15 is used for the speech pattern of each basis distribution.
Clustering according to the word. That is, the voice
The base distributions that are close to the word belong to the same cluster. Claim
Each basis distribution (code) obtained in the embodiment of the invention of Item 1
Adaptive basis distribution (codebook) corresponding to Book 15)
When the change in 17) is used as an adaptation vector,
Average the adaptation vector for each cluster
The representative adaptation vector of the cluster (average adaptation vector
, And all the basis distributions belonging to that cluster.
It is adapted by the representative adaptation vector of the cluster. An example
For example, by voice power clustering, for example, in FIG. 2A.
Base distribution of V₂, V₃, V ₆Belong to the same cluster
Then, the basis distribution V₂, V₃, V₆Adaptation code of
Change vector (adaptive
Vector) E₂, E₃, E₆(In this case, change the mean
Averaging vector)
Toru E_m, The basis distribution belonging to that cluster
V₂, V₃, V₆To adapt. In this case a kind of smoothing
Due to the effect of optimization, robust adaptation is possible even when the amount of adaptation voice is small.
Adaptation can be expected.

【００２４】図３Ｂにこの請求項２の発明の方法でコー
ドブックを適応化した場合の認識結果を示す。音響的特
徴量としてケプストラム、Δケプストラムに加え、正規
化対数パワーとその一次微分（Δパワー）を用いた。ク
ラスタリングは正規化対数パワーにより行なった。クラ
スタ数は実験的に最適値を求め、５とした。特徴量が増
えたことにより先の実験より全体的に認識率が向上して
いるが、パワーでクラスタリングした場合はＣＭＮやｍ
ｅａｎ（全音素ＨＭＭでコードブックの平均値を適応化
した場合）より高い認識率を示している。請求項３の発明の実施例前記実施例における各基底分布に対応する適応化ベクト
ルと、その基底分布の属するクラスタの代表適応化ベク
トル（平均化適応化ベクトル）との重み付き線形和を新
たに適応化ベクトルとしてコードブックを適応化する。
音声パワーが大きいところではおもにフィルタ的な歪み
の影響が精度の劣化原因として考えられ、音声パワーが
小さいところでは加算的な雑音の影響も無視できないと
考えられるため、音声パワーの大小によって、基底分布
自身に対応する適応化ベクトルとクラスタの代表適応化
ベクトルの寄与率を操作することで、より高精度な適応
化が実現できると期待できる。FIG. 3B shows the recognition result when the codebook is adapted by the method of the present invention. In addition to the cepstrum and the Δ cepstrum, the normalized logarithmic power and its first derivative (Δ power) were used as acoustic features. The clustering was performed by the normalized logarithmic power. The optimum number of clusters was experimentally determined and set to 5. The recognition rate is improved overall compared to the previous experiment due to the increased feature quantity, but when clustering by power, CMN and m
The recognition rate is higher than that of ean (when the average value of the codebook is adapted by the whole phoneme HMM). Embodiment of the Invention of Claim 3 A weighted linear sum of the adaptation vector corresponding to each basis distribution and the representative adaptation vector (averaged adaptation vector) of the cluster to which the basis distribution belongs is newly added. Adapt the codebook as an adaptation vector.
The effect of filter distortion is considered to be the main cause of deterioration of accuracy in the case of high voice power, and the effect of additive noise cannot be ignored in the case of low voice power. It can be expected that more accurate adaptation can be realized by manipulating the contribution rate of the adaptation vector corresponding to itself and the representative adaptation vector of the cluster.

【００２５】上述ではこの発明を回線音声に適応化させ
る場合に適用したが、いわゆる話者適応にも適用でき
る。また音響モデルとしてはＨＭＭに限らない。Although the present invention is applied to the case of adapting to the line voice in the above, it can be applied to so-called speaker adaptation. The acoustic model is not limited to HMM.

【００２６】[0026]

【発明の効果】以上述べたように、この発明によれば、
（１）任意の発声内容の適応化音声により認識対象とな
る音声の特性へ音響モデルを適応化することができ、
（２）音声パワーに応じた適応化を行なうことでより頑
健で精度の高い適応化が可能となる、（３）各カテゴリ
モデルの分布係数は再推定せず、共通のコードブックだ
けを再推定するため適応化学習に要する学習音声は少な
くてよく、そのため計算時間も少ない、などの利点があ
る。As described above, according to the present invention,
(1) It is possible to adapt the acoustic model to the characteristics of the speech to be recognized by the adapted speech of arbitrary utterance content,
(2) More robust and highly accurate adaptation is possible by performing adaptation according to speech power. (3) Re-estimation of the common codebook only, without re-estimating the distribution coefficient of each category model. Therefore, less learning speech is required for adaptive learning, and therefore, there is an advantage that the calculation time is shorter.

[Brief description of drawings]

【図１】この発明を適用した音声認識システムの構成を
示すブロック図。FIG. 1 is a block diagram showing a configuration of a voice recognition system to which the invention is applied.

【図２】この発明による音響モデルの適応化の様子を示
す図。FIG. 2 is a diagram showing a state of adaptation of an acoustic model according to the present invention.

【図３】この発明の効果を示す図。FIG. 3 is a diagram showing the effect of the present invention.

【図４】Ａは音響モデルの例を示す図。Ｂは混合分布の
例を示す図である。FIG. 4A is a diagram showing an example of an acoustic model. B is a diagram showing an example of a mixture distribution.

【図５】Ａは離散分布モデルのコードブックの例を示す
図、Ｂはその各音響モデルの例を示す図、Ｃは連続分布
モデルの例を示す図である。5A is a diagram showing an example of a codebook of a discrete distribution model, B is a diagram showing an example of each acoustic model thereof, and C is a diagram showing an example of a continuous distribution model.

【図６】半連続分布モデルの例を示す図。FIG. 6 is a diagram showing an example of a semi-continuous distribution model.

Claims

[Claims]

1. A learning voice is used to extract an acoustic feature amount of the voice, and the feature amount is statistically modeled,
In a method of adapting an acoustic model corresponding to a recognition category by using a voice having a property different from that of the learning voice at the time of recognition, the acoustic model is a codebook expressing a parameter space with a plurality of base distributions. , The weight distribution for each basis distribution in the codebook, and using all-category acoustic models learned independently for each recognition target category, the basis distribution expressing the above parameter space is made different in the above properties. An acoustic model adaptation method characterized by re-estimation and adaptation by speech.

2. A change of each re-estimated basis distribution with respect to a pre-estimated basis distribution is set as an adaptation vector, the basis distribution is clustered according to speech power, and the adaptation vector belongs to each of the clusters. The acoustic model adaptation method according to claim 1, wherein the basis distributions are averaged and each of the basis distributions of the cluster is adapted using the averaged adaptation vector.

3. The change of each re-estimated basis distribution with respect to the pre-estimated basis distribution is used as an adaptation vector, the basis distribution is clustered according to speech power, and the adaptation vector belongs to each of the clusters. The basis distribution is averaged, the averaged adaptation vector for each cluster and the adaptation vector for each basis distribution of the cluster are weighted averaged, and the weighted average adaptation vector is used to calculate the basis distribution of the cluster. The method for adapting an acoustic model according to claim 1, characterized in that