JPH1097273A

JPH1097273A - Speaker adapting method for voice model, speech recognizing method using the same method, and recording medium recorded with the same method

Info

Publication number: JPH1097273A
Application number: JP9093855A
Authority: JP
Inventors: Tomoko Matsui; 知子松井; Sadahiro Furui; 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1996-08-02
Filing date: 1997-04-11
Publication date: 1998-04-14
Anticipated expiration: 2017-04-11
Also published as: JP3216565B2

Abstract

PROBLEM TO BE SOLVED: To increase the recognition rate by extracting model series of voice models for unspecified speakers by using the voice of a speaker to be recognized and adapting the speech models for unspecified speakers by the model series. SOLUTION: A speech model storage part 12 is previously stored with the speech models for unspecified speakers corresponding to a recognition category as a reference model dictionary. A procedure of implementing a speech recognizing method is previously recorded in a storage part 10M in a control part 10, which controls the processes of respective parts 11 to 16 of the speech recognition part by following the procedure. Then model series are so extracted that a correct model series is included with high probability, and speaker adaption to the respective model series is performed; likelihood to the object speech to be recognized when speech models after the speaker adaptation are used is compared and the speech model having the highest value is selected to perform adaptation based upon a more correct model series.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、多数の話者の音
声を用いて、音韻、単語などの認識カテゴリに対応した
音声の特徴を例えば隠れマルコフモデル（Hidden Marko
v Model,以下ＨＭＭと記す）でモデル化した不特定話者
用音声モデルを、認識対象となる話者の音声を用いて、
その話者に対する認識率を高めるように適応化する音声
モデルの話者適応化方法及びその方法を用いた音声認識
方法及びその方法を記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the use of speech of a large number of speakers to identify features of speech corresponding to recognition categories such as phonemes and words, for example, using a hidden Markov model (Hidden Marko model).
v Model, hereafter referred to as HMM) using the speaker-independent speech model
The present invention relates to a speaker adaptation method for a speech model adapted to increase the recognition rate for the speaker, a speech recognition method using the method, and a recording medium recording the method.

【０００２】[0002]

【従来の技術】音声認識法では、入力音声の特徴パラメ
ータ列を予め作成した基準の音声（単語又は音韻）モデ
ル列と比較し、最も確からしい音声モデル系列を認識結
果として出力する。音声認識に使用されるこの様な基準
の音声モデルの集合である基準モデル辞書は、例えば以
下のようにして作成される。予め多数の話者の音声を例
えば線形予測分析して特徴パラメータ列を得る。その特
徴パラメータ列から単語又は音韻毎に基準の単語モデル
又は音韻モデル（以下これらを総称して音声モデルと呼
ぶ）を作成し、基準モデル辞書とする。2. Description of the Related Art In a speech recognition method, a feature parameter sequence of an input speech is compared with a reference speech (word or phoneme) model sequence created in advance, and the most probable speech model sequence is output as a recognition result. A reference model dictionary, which is a set of such reference speech models used for speech recognition, is created as follows, for example. For example, speech of many speakers is subjected to, for example, linear prediction analysis to obtain a feature parameter sequence. A reference word model or phoneme model (hereinafter collectively referred to as a speech model) is created for each word or phoneme from the feature parameter sequence, and is used as a reference model dictionary.

【０００３】近年、音声認識においては基準音声モデル
として、音声を例えば一定フレーム毎に得られるケプス
トラムのような特徴パラメータ列の統計量で表すＨＭＭ
を使用することが多いので、このＨＭＭについて図１を
参照して簡単に説明する。例えば音韻ＨＭＭでは、各音
韻部分に対応した特徴パラメータ列は、一般にその開始
領域、中間領域、終了領域でだいたい特徴が分かれてい
るので、図１Ａに示すようにそれら３つの領域を代表す
る３つの状態で規定することができる。ただし、音声現
象は連続的なので各領域の境界は明確ではないが、図１
Ａでは説明を簡略化するために各領域の境界を明示して
いる。これらの領域を第１、第２、第３状態と呼び、Ｓ
₁，Ｓ₂，Ｓ₃ で表すことにする。[0003] In recent years, in speech recognition, as a reference speech model, an HMM representing a speech by a statistic of a feature parameter sequence such as a cepstrum obtained for every certain frame, for example.
In many cases, the HMM will be briefly described with reference to FIG. For example, in a phoneme HMM, a feature parameter sequence corresponding to each phoneme portion generally has a feature divided into a start region, an intermediate region, and an end region. Therefore, as shown in FIG. It can be specified by state. However, the boundary of each area is not clear because the sound phenomenon is continuous,
In A, boundaries of each area are clearly shown for simplicity of explanation. These areas are called first, second, and third states, and S
To be represented by _1, S _2, S _3.

【０００４】音韻ＨＭＭの各状態は、その音韻の各領域
に対応した特徴パラメータ列の分布を表し、実際には特
徴パラメータベクトルの次元におけるガウス分布の組合
せで表す。図１Ｂは特徴パラメータベクトルの次元が一
次元の場合に、各状態を４つのガウス分布の組合せで表
した例である。そのとき、状態Ｓ_i はその４つのガウス
分布の平均値ｍ_i={m_i1,m_i2,m_i3,m_i4}、分散σ_i={σ
_i1,σ_i2,σ_i3,σ_i4}、及び重み係数ｗ_i={w_i1,w_i2,
w_i3,w_i4}で規定される。このような複数のガウス分布の
組合せで表される分布を混合ガウス分布と呼んでいる。[0004] Each state of the phoneme HMM represents the distribution of a feature parameter sequence corresponding to each region of the phoneme, and is actually represented by a combination of Gaussian distributions in the dimension of the feature parameter vector. FIG. 1B is an example in which each state is represented by a combination of four Gaussian distributions when the dimension of the feature parameter vector is one-dimensional. At that time, the state S _i is the average value of the four Gaussian distributions m _i = {m _i1 , m _i2 , m _i3 , m _i4 }, and the variance σ _i = {σ
_i1 , _σi2 , _σi3 , _σi4 }, and weighting factors w _i = {w _i1 , w _i2 ,
w _i3 , w _i4 }. Such a distribution represented by a combination of a plurality of Gaussian distributions is called a Gaussian mixture distribution.

【０００５】更に、これら第１、第２、第３の状態
Ｓ₁，Ｓ₂，Ｓ₃ のそれぞれにおいて、同じ状態に遷移す
る確率a₁₁,a₂₂,a₃₃と、次の状態に遷移する確率a₁₂,a₂₃
を例えば図１Ｃに示すように定義する。以上、全状態の
全混合ガウス分布の平均値、分散、重み係数、状態遷移
確率の統計的パラメータの組を、この音韻を表すＨＭＭ
のモデルパラメータと呼び、θで代表する。基準モデル
辞書には予め決めた全ての音韻（又は単語）についての
音声ＨＭＭのモデルパラメータが納められている。Further, in each of the first, second, and third states S ₁ , S ₂ , and S ₃ , the probability a ₁₁ , a ₂₂ , and a ₃₃ of transition to the same state and transition to the next state are set. Probability a ₁₂ , a ₂₃
Is defined as shown in FIG. 1C, for example. As described above, a set of statistical parameters of the average value, variance, weighting factor, and state transition probability of all Gaussian mixture distributions of all states is represented by an HMM representing this phoneme
And is represented by θ. The reference model dictionary stores model parameters of the speech HMM for all predetermined phonemes (or words).

【０００６】音声認識を行う場合は、入力音声から得ら
れる特徴パラメータ列を、可能な音声ＨＭＭ系列に当て
はめたときに、最もよく当てはまった音声ＨＭＭ系列を
認識結果として出力する。なお、そのあてはまる度合い
をＨＭＭの尤度と呼び、実際にはその特徴パラメータ列
を音声ＨＭＭ系列に含まれる各状態に確率的に割り振っ
た場合の、混合ガウス分布から計算される確率と、状態
遷移確率との積として得られる。後述する不特定話者の
音声に音声モデルを適応化するには、その話者の音声に
対応する音声モデル系列での尤度が最大となるように各
音声ＨＭＭのモデルパラメータθ（例えば３つの状態の
それぞれにおける混合ガウス分布の平均値ｍ₁,ｍ₂,
ｍ₃）を変化させる。When speech recognition is performed, when a feature parameter sequence obtained from input speech is applied to a possible speech HMM sequence, the best-fit speech HMM sequence is output as a recognition result. Note that the degree of the fit is called the likelihood of the HMM, and in fact, the probability calculated from the Gaussian mixture distribution when the feature parameter sequence is stochastically allocated to each state included in the speech HMM sequence, and the state transition Obtained as a product of probability. In order to adapt the speech model to the speech of an unspecified speaker, which will be described later, the model parameters θ (for example, three parameters) of each speech HMM are set so that the likelihood in the speech model sequence corresponding to the speech of the speaker is maximized. The mean value of the Gaussian mixture m ₁ , m ₂ ,
m ₃ ).

【０００７】例えば、駅に設置される音声認識を使用し
た対話型自動発券装置は、不特定の利用者の発声する行
き先駅名の音声を認識して、該当する乗車券を発行する
ことが要求される。不特定話者の音声の認識率を高める
ためには、利用者が発声した音声を用いて、基準モデル
辞書をその利用者に適応化し、その適応化した基準モデ
ル辞書を使ってその利用者の音声を再認識することが考
えられる。[0007] For example, an interactive automatic ticketing apparatus using voice recognition installed in a station is required to recognize the voice of the destination station name uttered by an unspecified user and issue a corresponding ticket. You. In order to increase the recognition rate of the unspecified speaker's voice, the reference model dictionary is adapted to the user using the voice uttered by the user, and the user is used to adapt the reference model dictionary. Re-recognition of voice is conceivable.

【０００８】一般的にこの話者適応化技術は、適応化に
用いる音声データの発声内容が既知である場合（教師あ
り）と、未知である場合（教師なし）に分けて考えるこ
とができる。またこの技術は、認識システムがあらかじ
め適量の音声データを収集し、それを適応化に用いるオ
フライン型と、認識のたびにその発声を用いて教師なし
適応化を行うオンライン型とに分類することができる。In general, this speaker adaptation technique can be considered in two cases: a case where the utterance content of voice data used for the adaptation is known (with teacher) and a case where it is unknown (without teacher). In addition, this technology can be classified into an offline type, in which the recognition system collects an appropriate amount of voice data in advance and uses it for adaptation, and an online type, in which unsupervised adaptation is performed by using the utterance at each recognition. .

【０００９】教師なしでオンライン型の話者適応化は即
時型の話者適応化と呼ばれ、この即時型では、上述の駅
における自動発券装置の例のように、特に多くの利用者
が代わるがわるその認識システムを利用するような応用
（アプリケーション）において有効である。しかしこの
即時型の話者適応化は少量の音声データだけを用いて教
師なしで行う必要がある。Unsupervised online speaker adaptation is referred to as immediate speaker adaptation, in which many users are replaced, especially in the case of the automatic ticketing device at the station described above. It is effective in an application that uses the recognition system that changes. However, this immediate speaker adaptation needs to be performed unsupervised using only a small amount of speech data.

【００１０】従来の教師なし話者適応化では、例えば文
献「中川聖一：“確率モデルによる音声認識”、電子情
報通信学会、1988」のViterbi アルゴリズムのようなデ
コーディングアルゴリズムにもとづいて、不特定話者用
の基準モデル辞書（例えば基準の音韻ＨＭＭの集合）を
用いて入力音声を一度認識し、その入力音声の音韻ＨＭ
Ｍ系列を推定する。基準モデル辞書から推定した入力音
声の音韻ＨＭＭ系列Λに従って音韻ＨＭＭを辞書から選
択して連結し、その連結された音韻ＨＭＭ列の尤度が最
大になるように、基準モデル辞書の全ての音韻ＨＭＭの
モデルパラメータθを認識対象話者の音韻モデルパラメ
ータへ写像する関数（モデル変換関数）Gη(θ) の中の
変換パラメータηを、例えば事後確率最大化推定(maxim
um a posteriori:MAP、例えばJ.L.Gauvain and C.-H.Le
e, Maximum a posteriori estimation for multivariat
e Gaussian mixture observations of Markov chains,
IEEE Trans. Speech and Audio processing, Vol.2, N
o.2, pp291-298, 1994）に基づいて、次式に従って求め
る。In the conventional unsupervised speaker adaptation, an unspecified speaker is identified based on a decoding algorithm such as the Viterbi algorithm described in the document "Seiichi Nakagawa:" Speech Recognition by Stochastic Model ", IEICE, 1988". The input speech is recognized once using a reference model dictionary for a speaker (for example, a set of reference phoneme HMMs), and the phoneme HM of the input speech is recognized.
Estimate the M sequence. The phoneme HMMs are selected from the dictionary according to the phoneme HMM sequence 入力 of the input speech estimated from the reference model dictionary and connected, and all the phoneme HMMs of the reference model dictionary are maximized so that the likelihood of the connected phoneme HMM sequence is maximized. The transformation parameter η in the function (model transformation function) Gη (θ) that maps the model parameter θ to the phoneme model parameter of the speaker to be recognized is estimated, for example, by maximizing the posterior probability (maxim
um a posteriori: MAP, for example JLGauvain and C.-H.Le
e, Maximum a posteriori estimation for multivariat
e Gaussian mixture observations of Markov chains,
IEEE Trans. Speech and Audio processing, Vol. 2, N
o.2, pp291-298, 1994).

【００１１】[0011]

【数１】 (Equation 1)

【００１２】ここで、Ｘは入力音声、f()は尤度関数、g
()は事前確率密度関数を表す。認識対象話者に適応化し
た音韻ＨＭＭのパラメータθ' は、このη'を用いて次
式Here, X is an input voice, f () is a likelihood function, g
() Represents a prior probability density function. The parameter θ ′ of the phoneme HMM adapted to the speaker to be recognized is calculated using

【００１３】[0013]

【数２】 (Equation 2)

【００１４】のように計算される。実用的には、例え
ば、それぞれの音韻モデルを表すパラメータθのうち、
分散σ、重み係数ｗ及び状態遷移確率ａは変化しないも
のと仮定し、平均値ｍのみを適応化する。その時、モデ
ル変換関数Gη()は次式Is calculated as follows: Practically, for example, among the parameters θ representing the respective phoneme models,
Assuming that the variance σ, the weight coefficient w, and the state transition probability a do not change, only the average value m is adapted. Then, the model conversion function Gη () is

【００１５】[0015]

【数３】 (Equation 3)

【００１６】により求める。この様にして認識対象話者
に適応した基準モデル辞書を使って入力音声Ｘを再認識
し、その認識結果を出力する。しかし、式(1),(2)を使
ったモデルパラメータθの写像では不特定話者用基準音
韻ＨＭＭ辞書に対する性能が低い話者の入力音声Ｘに対
しては、音韻ＨＭＭ系列Λが正しく推定できず、話者適
応化の効果が必ずしも得られなかった。[0016] In this way, the input speech X is re-recognized using the reference model dictionary adapted to the speaker to be recognized, and the recognition result is output. However, in the mapping of the model parameters θ using the equations (1) and (2), the phoneme HMM sequence Λ is correctly estimated for the input speech X of a speaker having low performance with respect to the unspecified speaker reference phoneme HMM dictionary. No, the effect of speaker adaptation was not always obtained.

【００１７】[0017]

【発明が解決しようとする課題】この発明の第１の目的
は、話者適応によって入力音声に対する音声モデル系列
の推定をより正確に行うことができる音声モデルの話者
適応化方法を提供することである。この発明の第２の目
的は、上記話者適応化方法を使った認識率の高い音声認
識方法を提供することである。SUMMARY OF THE INVENTION It is a first object of the present invention to provide a speaker adaptation method for a speech model which can more accurately estimate a speech model sequence for input speech by speaker adaptation. It is. A second object of the present invention is to provide a speech recognition method having a high recognition rate using the above speaker adaptation method.

【００１８】この発明の第３の.....目的は、上記音声
認識方法を記録した記録媒体を提供することである。A third object of the present invention is to provide a recording medium on which the above-mentioned speech recognition method is recorded.

【００１９】[0019]

【課題を解決するための手段】この発明によれば、特徴
抽出過程で、認識対象となる話者の入力音声を分析して
特徴パラメータ列を抽出し、その抽出した特徴パラメー
タ列と不特定話者用音声モデルを用いて、モデル系列抽
出過程で、上記入力音声の特徴パラメータ列に対応する
と推定されるモデル系列の複数の仮説を抽出し、仮適応
化過程でその抽出した複数のモデル系列の各系列ごと
に、その系列に従って不特定話者用音声モデルを連結し
たモデルと上記入力音声の特徴パラメータ列との尤度が
最大になるように上記連結した不特定話者用音声モデル
を適応化し、適応化モデル選出過程で、上記各モデル系
列ごとに適応化後の音声モデルをそのモデル系列に従っ
て連結したモデルと上記入力音声の特徴パラメータ列と
の尤度を求め、それらの尤度に基づいてモデル系列に対
応した適応化音声モデルを選出してこれを適応化音声モ
デルとする。According to the present invention, in a feature extraction process, an input voice of a speaker to be recognized is analyzed to extract a feature parameter sequence, and the extracted feature parameter sequence and an unspecified speech are extracted. In the model sequence extraction process, a plurality of hypotheses of the model sequence estimated to correspond to the feature parameter sequence of the input speech are extracted using the speaker's voice model, and the plurality of model sequences extracted in the temporary adaptation process are extracted. For each sequence, the connected speaker-independent speech model is adapted such that the likelihood between the model in which the speaker-independent speech models are linked according to the sequence and the feature parameter sequence of the input speech is maximized. In the adaptation model selection process, the likelihood between the model obtained by concatenating the adapted speech model for each of the model sequences according to the model sequence and the feature parameter sequence of the input speech is obtained. This by selecting the adapted speech models corresponding to the adaptation speech model to model sequence based on the likelihood.

【００２０】前記適応化モデル選出過程は、不特定話者
用音声モデルを用いたときの尤度ではなく、話者適応化
後の音声モデルを用いたときの尤度に基づいて、適応化
音声モデルの選出を行っている。このことは、「正しい
モデル系列に対する不特定話者用音声モデルの尤度が低
い値であっても、その系列に対する話者適応化後の音声
モデルの尤度は高い値になる」という原理に基づいてい
る。この発明では、正しいモデル系列が高い確率で含ま
れるように複数のモデル系列を抽出し、これらの各モデ
ル系列に対してそれぞれ話者適応化を行い、各話者適応
化後の音声モデルを用いた時の認識対象音声に対する尤
度を比較し、最も高い値を示す音声モデルを選択するこ
とによって、より正しいモデル系列にもとづく適応化が
行われる。In the adaptation model selection process, the adaptation speech is not based on the likelihood when using the speaker-independent speech model, but based on the likelihood using the speech model after speaker adaptation. We are selecting models. This is based on the principle that "even if the likelihood of the unspecified speaker's speech model for the correct model series is low, the likelihood of the speech model after speaker adaptation for that series is high." Is based on In the present invention, a plurality of model sequences are extracted so that a correct model sequence is included with a high probability, speaker adaptation is performed on each of these model sequences, and a speech model after each speaker adaptation is used. By comparing the likelihood with the recognition target speech at the time of occurrence and selecting the speech model having the highest value, the adaptation based on a more correct model sequence is performed.

【００２１】モデル系列抽出過程で、正しいモデル系列
が含まれるように複数のモデル系列を抽出するには例え
ば文献「C.-H.Lee他監修：“Automatic speech and spe
akerrecognition（第１８章Multiple-pass search stra
tegies）”、Kluwer Academic Publishers, 1995」のMu
ltiple-pass search strategiesのN-bestパラダイム(pa
radigm)が利用できる。これにより、効果的にモデル系
列の探索空間を小さくできる。具体的には、認識対象と
なる話者の音声を用いて、不特定話者用音声モデルのパ
ラメータを、認識対象話者の音声に対する尤度が大きく
なるように適応化を進めながら、適応化に使う仮説（モ
デル系列）を、Ｎ−ｂｅｓｔ仮説の中から選び直すこと
により、音声モデルを認識対象話者に適応化する。In order to extract a plurality of model sequences so as to include a correct model sequence in the process of extracting a model sequence, for example, refer to the document “Automatic speech and spe, supervised by C.-H. Lee et al.
akerrecognition (Chapter 18, Multiple-pass search stra
tegies) ", Mu of Kluwer Academic Publishers, 1995"
N-best paradigm of ltiple-pass search strategies (pa
radigm) is available. Thereby, the search space of the model sequence can be effectively reduced. Specifically, using the speech of the speaker to be recognized, the parameters of the speech model for the unspecified speaker are adapted while performing the adaptation so as to increase the likelihood for the speech of the recognition target speaker. The speech model is adapted to the speaker to be recognized by re-selecting the hypothesis (model series) to be used from among the N-best hypotheses.

【００２２】またこの発明においては、各モデル系列に
対する適応化は、始めはパラメータ間の結び状態を強く
して大まかに行い、その大まかに適応化された音声モデ
ルと認識対象話者の音声とを用いて、複数のモデル系列
のそれぞれについて適応化を行い、その適応化された音
声モデルを用いた時の認識対象話者の音声に対する尤度
が最大のものを選択して、音声モデルを適応化する。そ
の時各モデル系列に対する適応化は前回よりもパラメー
タ間の結び状態を緩くして細かく行なう。適応化モデル
選出過程で選出した尤度が最大の１つのモデル系列のモ
デルパラメータで音声モデルを適応化する代わりに、Ｎ
個のモデル系列のモデルパラメータをそれらの尤度に対
応した重み付けで平均して得たモデルパラメータで音声
モデルを適応化してもよい。以上の手続きを少くとも１
回繰り返し行う。この繰り返しは、仮適応化過程と適応
化モデル選出過程のみでもよい。つまりモデル系列抽出
過程までもどらず、最初に抽出した複数のモデル系列を
再利用してもよい。In the present invention, the adaptation to each model sequence is first performed roughly by strengthening the connection state between the parameters, and the roughly adapted speech model and the speech of the speaker to be recognized are converted. Adapts the speech model by performing adaptation for each of the multiple model sequences and selecting the one with the highest likelihood for the speech of the recognition target speaker when using the adapted speech model. I do. At this time, the adaptation to each model series is performed finely by loosening the connection between parameters compared to the previous time. Instead of adapting the speech model with the model parameters of one model sequence having the maximum likelihood selected in the adaptation model selection process, N
The speech model may be adapted with model parameters obtained by averaging the model parameters of the model series with weights corresponding to their likelihoods. At least one of the above procedures
Repeat several times. This repetition may be performed only in the temporary adaptation process and the adaptation model selection process. That is, a plurality of model sequences extracted first may be reused without returning to the model sequence extraction process.

【００２３】音声認識結果を出力する場合は、最終的に
適応化モデル選出過程で選出された尤度が最大のモデル
系列を認識結果として出力する。或いは、適応化モデル
選出過程で選出されたモデル系列に対応した、適応化後
の音声モデルを使って入力音声を再認識し、最大尤度を
示すモデル系列を認識結果として出力する。この発明に
よる適応化アルゴリズムを使った認識方法を予め記録媒
体に記録しておき、その記録媒体を各種不特定話者音声
認識システムに用いることができる。When outputting the speech recognition result, the model sequence having the maximum likelihood finally selected in the adaptation model selection process is output as the recognition result. Alternatively, the input speech is re-recognized using the speech model after the adaptation corresponding to the model sequence selected in the adaptation model selecting process, and the model sequence showing the maximum likelihood is output as a recognition result. The recognition method using the adaptation algorithm according to the present invention is recorded in a recording medium in advance, and the recording medium can be used for various speaker-independent speaker recognition systems.

【００２４】[0024]

BEST MODE FOR CARRYING OUT THE INVENTION

第１実施例この発明の第１実施例を図２のフローチャートと図３の
音声認識システム機能ブロック図を参照して説明する。
音声モデル蓄積部１２には、多数の話者の音声を用いて
学習した単語などの認識カテゴリに対応した不特定話者
用音声モデル、例えば不特定話者用音声ＨＭＭが基準モ
デル辞書として予め蓄積されているものとする。また、
以下に説明するこの発明の話者適応化方法を使った音声
認識方法を実施する手順は制御部１０内の記憶部１０Ｍ
に予め記録されており、制御部１０はその手順に従って
図３の音声認識システムの各部１１〜１６の処理をそれ
ぞれ制御する。First Embodiment A first embodiment of the present invention will be described with reference to the flowchart of FIG. 2 and the functional block diagram of the speech recognition system of FIG.
The voice model storage unit 12 stores in advance a voice model for an unspecified speaker, for example, a voice HMM for an unspecified speaker, corresponding to a recognition category such as a word learned using the voices of many speakers as a reference model dictionary. It is assumed that Also,
The procedure for implementing the speech recognition method using the speaker adaptation method of the present invention described below is performed by the storage unit 10M in the control unit 10.
The control unit 10 controls the processes of the respective units 11 to 16 of the voice recognition system in FIG. 3 according to the procedure.

【００２５】ステップＳ１（特徴量抽出過程）：特徴量
抽出部１１で認識対象話者の音声データの特徴量を抽出
する。特徴量抽出は、入力された音声データを一定フレ
ーム毎にＬＰＣ分析し、例えばケプストラム又はΔケプ
ストラムなどの特徴パラメータベクトルの時系列を特徴
パラメータ列として得る。ステップＳ２（モデル系列抽出過程）：モデル系列選択
部１３においてモデル蓄積部１２から選択した音声モデ
ルを連結して、ステップＳ１で得られた特徴パラメータ
列に変換された音声データＸに最も近いと推定される、
Ｎ個のモデル系列Λ₁，Λ₂，…Λ_Nを抽出する。このモ
デル系列の抽出は例えば文献「W.Chou他：“An algorit
hm of high resolution and efficient multiple strin
g hypothesization for continuous speech recognitio
n using inter-word models”、Proc.ICASSP、pp.II-15
3-156,1994」に述べられている手法による。Step S1 (feature extraction process): The feature extraction unit 11 extracts the features of the voice data of the speaker to be recognized. In the feature extraction, the input audio data is subjected to LPC analysis for each fixed frame, and a time series of feature parameter vectors such as cepstrum or Δcepstrum is obtained as a feature parameter sequence. Step S2 (model sequence extraction process): The model sequence selection unit 13 connects the speech models selected from the model storage unit 12, and estimates that the model is closest to the speech data X converted into the feature parameter sequence obtained in step S1. Done,
Extract _N model sequences Λ ₁ , Λ ₂ ,... Λ _N. The extraction of this model series is described, for example, in the document “W. Chou et al .:“ An algorit
hm of high resolution and efficient multiple strin
g hypothesization for continuous speech recognitio
n using inter-word models ”, Proc.ICASSP, pp.II-15
3-156, 1994 ”.

【００２６】具体的には、例えば想定される全ての音声
ＨＭＭ系列Λから前記特徴パラメータ列Ｘに変換された
音声データＸとの尤度f(X|Λ,θ)を尤度計算部１４にお
いて計算し、尤度が最も高いものから順にＮ個（Ｎは２
以上の予め決めた整数）のモデル系列Λ_n(n=1,…,N) を
モデル系列選択部１３から選択抽出する。この抽出され
たＮ個のモデル系列中に、認識対象音声を正しく表現し
たモデル系列（正解）が高い確率で含まれる程度にＮの
数を選定する。例えば４桁の数字音声（単語列）の場合
はＮ＝１０とすれば、この１０個のモデル系列に正しい
モデル系列が含まれる確率は９７％程度となるから、Ｎ
を１０に選ぶ。Ｎの数をこれより大にすれば正しいモデ
ル系列が含まれる確率はより高くなり、演算量、処理時
間を考慮しなければＮは大きい方がよい。Ｎの数は認識
対象音声が複雑になれば、これに応じて大にする必要が
ある。この発明では抽出したＮ個のモデル系列に正解が
含まれている確率が高いことが重要である。More specifically, for example, the likelihood calculation unit 14 calculates the likelihood f (X | Λ, θ) of the assumed speech HMM sequence Λ from the speech data X converted into the feature parameter string X by the likelihood calculation unit 14. N are calculated in order from the one with the highest likelihood (N is 2
The model sequence Λ _n (n = 1,..., N) of the above-mentioned predetermined integer is selected and extracted from the model sequence selection unit 13. The number N is selected to such an extent that a model sequence (correct answer) that correctly represents the speech to be recognized is included with a high probability in the extracted N model sequences. For example, if N = 10 in the case of a 4-digit numeric voice (word sequence), the probability that the correct model sequence is included in these 10 model sequences is about 97%.
Choose 10 If the number of N is larger than this, the probability that a correct model sequence is included will be higher, and N should be larger if the amount of calculation and processing time are not considered. If the speech to be recognized becomes complicated, the number N needs to be increased accordingly. In the present invention, it is important that the extracted N model sequences have a high probability that a correct answer is included.

【００２７】ステップＳ３（仮適応化過程）：特徴パラ
メータ列に変換された音声データＸに対し、ステップＳ
２で抽出されたモデル系列Λ_n(n=1,…,N)毎に、適応化
部１５において次式Step S3 (temporary adaptation process): Step S3 is performed on the speech data X converted into the characteristic parameter sequence.
For each model sequence Λ _n (n = 1,..., N) extracted in step 2,

【００２８】[0028]

【数４】 (Equation 4)

【００２９】により、音声データＸに対するモデル系列
Λ_nの尤度関数値f(X|Λ_n,η_n,θ) が最大となる変換パ
ラメータη'_nを求め、その変換パラメータη'_nを用いて
そのモデル系列Λ_n を構成する不特定話者用音声ＨＭＭ
のモデルパラメータθを変換し、θ'_n(=Ｇη'_n(θ)) を
得る。その方法としては、例えばBaum-Welchアルゴリズ
ム（例えば文献「中川聖一："確率モデルによる音声認
識"、電子情報通信学会、1988」）もしくはＭＡＰ推定
アルゴリズムを使うことができる。この変換パラメータ
η'_nを用いて変換したモデルパラメータθ'_nをもって、
Ｎ個の各モデル系列Λ_n(n=1,…,N) 毎にそのモデル系列
を構成する全音声ＨＭＭが認識対象話者にそれぞれ仮適
応化されたことになる。Thus, a conversion parameter η ′ _n that maximizes the likelihood function value f (X | Λ _n , η _n , θ) of the model sequence Λ _n with respect to the voice data X is obtained, and the conversion parameter η ′ _n is used. Speech HMM for unspecified speakers forming the model sequence Λ _n
Is converted to obtain θ ′ _n (= Gη ′ _n (θ)). As the method, for example, the Baum-Welch algorithm (for example, the document “Seiichi Nakagawa:“ Speech Recognition by Stochastic Model ”, IEICE, 1988)” or the MAP estimation algorithm can be used. With the model parameter θ ′ _n converted using this conversion parameter η ′ _n ,
For each of the N model sequences Λ _n (n = 1,..., N), all the speech HMMs constituting the model sequence are provisionally adapted to the speaker to be recognized.

【００３０】即ち、図４に示すように、モデル系列Λ₁
は音声モデルλ₁₁,λ₁₂,…,λ_1k1の系列よりなり、モデ
ル系列Λ₂は音声モデルλ₂₁,λ₂₂,…,λ_2k2よりなり、
モデル系列Λ_Nは音声モデルλ_N1,λ_N2,…,λ_NkNからな
るものとする。式(4) により変換パラメータη₁ を変化
させた時のモデル系列Λ₁と音声データＸとの尤度関数
f(X|Λ₁,η₁,θ₁)が最大となる変換パラメータη'₁を求
め、このη'₁を用いて不特定話者用音声ＨＭＭのモデル
パラメータθをθ₁(=Ｇη₁(θ))に変換する。なお、モ
デル系列Λ_n を構成する全ての音声モデル（ＨＭＭ）λ
_n1,λ_n2,…,λ_n _knの全てのモデルパラメータθに共通に
変換パラメータη_n を決定する場合、それぞれの音声モ
デルのパラメータθの変換は互いに拘束された状態であ
り、これをパラメータ間の結びの状態が強いという。[0030] That is, as shown in FIG. 4, model sequence lambda ₁
Is _composed of a sequence of speech models λ ₁₁ , λ ₁₂ ,..., Λ _1k1 , and the model sequence Λ ₂ is composed of speech models λ ₂₁ , λ ₂₂ _,.
It is assumed that the model sequence Λ _N is composed of speech models λ _N1 , λ _N2 ,..., Λ _NkN . Likelihood function between the model sequence Λ ₁ and the voice data X when the conversion parameter η ₁ is changed according to equation (4)
A conversion parameter η ′ ₁ that maximizes f (X | Λ ₁ , η ₁ , θ ₁ ) is obtained, and using this η ′ ₁ , a model parameter θ of the unspecified speaker speech HMM is set to θ ₁ (= Gη ₁ (θ)). Note that all the speech models (HMM) λ constituting the model sequence Λ _n
_{When the} conversion parameter η _n is determined in common for all the model parameters θ of _n ₁ , λ _n2 ,..., λ _n _kn , the conversion of the parameters θ of the respective voice models is in a state in which they are mutually constrained. It is said that the state of knotting is strong.

【００３１】同様にモデル系列Λ₂の音声モデルλ₂₁,
λ₂₂,…,λ_2k2の変換パラメータη₂を変化させ、音声デ
ータＸに対するモデル系列Λ₂の尤度関数f(X|Λ₂,η₂,
θ₂)が最大となる変換パラメータη'₂を求め、そのη'₂
から仮適応化されたモデルパラメータθ₂(=Ｇη₂(θ))
を得る。以下同様に各モデル系列Λ_nについて尤度関数
f(X|Λ_n,η_n,θ_n)が最大となる変換パラメータη'_nを求
め、仮適応化されたモデルパラメータθ_n を得る。これ
によりＮ個の仮適応化された音声モデルパラメータθ₁,
…,θ_N が得られる。Similarly, the speech model λ ₂₁ of the model sequence Λ ₂ ,
lambda _22, ..., lambda _2k2 changing the conversion parameter eta ₂ of the model series lambda ₂ of the likelihood function f (X for the audio data X | Λ _{_2, η 2,}
theta ₂₎ is "seek _2, the eta 'transformation parameter eta becomes maximum ₂
Model parameters θ ₂ (= Gη ₂ (θ)) provisionally adapted from
Get. Similarly, the likelihood function for each model series Λ _n
A conversion parameter η ′ _n that maximizes f (X | Λ _n , η _n , θ _n ) is obtained, and a temporarily adapted model parameter θ _n is obtained. Thereby, N temporarily adapted speech model parameters θ ₁ ,
.., Θ _N are obtained.

【００３２】ステップＳ４（適応化モデル選出過程）：
次に、ステップＳ３で仮適応化された各モデルパラメー
タθ_n 系列Λ'_n毎に、入力音声データＸに対する尤度、
つまり尤度関数f(X|Λ_n,θ_n)を求め、これら尤度関数f
(X|Λ₁,θ₁)，…，f(X|Λ_N,θ _N)の中で最大となるモデ
ル系列Λ_iを正しいモデル系列として選択する。従っ
て、そのモデル系列Λ_iを構成する音声ＨＭＭは、話者
に適応化された音声ＨＭＭとなっている。Step S4 (adaptation model selection process):
Next, each model parameter provisionally adapted in step S3
TA θ_n Series Λ '_nFor each input speech data X,
In other words, the likelihood function f (X | Λ_n, θ_n), And these likelihood functions f
(X | Λ₁, θ₁),…, F (X | Λ_N, θ _NThe largest model in)
Le series Λ_iAs the correct model series. Follow
And the model series Λ_iIs the speaker
The voice HMM is adapted to the following.

【００３３】図２に波線で示すように、ステップＳ３の
仮適応化を予め決めた回数連続して繰り返し実行しても
よい。その場合は、繰り返し毎に各モデル系列の変換パ
ラメータη'_nによる仮適応化を前回の仮適応化より緩い
結び状態（後述）で細かく行う。即ち、各モデル系列Λ
_n 毎にそれを構成する全音声モデルの全モデルパラメー
タθを例えば２つのグループに分類し（クラスタリング
し）、その２つのグループ（クラスタ）のモデルパラメ
ータに対しそれぞれ尤度が最大となるように変換パラメ
ータη_a、η_bをそれぞれ独立に制御することにより、式
(2),(3) に従ってモデル系列Λ_nに対する仮適応化を行
う。実用的には例えば前述のように変換パラメータηと
してガウス分布の平均値ｍの変化量Δｍ（η＝Δｍ）を
使う。その場合、各モデル系列Λ_n を構成する全モデル
に含まれる全分布の平均値を２つのグループにクラスタ
リングして、それら２つのクラスタ毎に共通に平均値の
変化量Δｍ_a、Δｍ_bを決める。このようにしてＮ個のモ
デル系列に対しそれぞれ尤度を最大とするように仮適応
化が行われる。As indicated by the dashed line in FIG. 2, the provisional adaptation in step S3 may be repeatedly performed continuously a predetermined number of times. In such a case, the provisional adaptation using the conversion parameter η ′ _n of each model series is finely performed for each iteration in a looser connection state (described later) than the previous provisional adaptation. That is, each model series Λ
_For every _n , all the model parameters θ of all the speech models that constitute it are classified into, for example, two groups (clustering), and converted so as to maximize the likelihood for the model parameters of the two groups (clusters). By controlling the parameters η _a and η _b independently, the equation
(2) the making of a provisional adaptability to model sequence lambda _n according (3). Practically, for example, the change amount Δm (η = Δm) of the average value m of the Gaussian distribution is used as the conversion parameter η as described above. In that case, the average value of all distributions included in all models constituting each model sequence lambda _n and clustered into two groups, the two change amounts of the average value in common for each cluster Delta] m _a, decide Delta] m _b . In this way, provisional adaptation is performed so that the likelihood is maximized for each of the N model sequences.

【００３４】音声認識結果を出力する場合は、ステップ
Ｓ４で選択された最大尤度のモデル系列Λ_iをステップ
Ｓ５で認識結果として出力する。或いは、波線で示すよ
うに、最終的にステップＳ４で選択された適応化モデル
系列Λ_iに使用された変換パラメータη'_iにより、ステ
ップＳ６で基準モデル辞書の全音声モデルを適応化し、
次にステップＳ７でその適応化された基準モデル辞書を
使って入力音声データＸを再認識し、尤度が最大となる
モデル系列を１つ抽出し、ステップＳ５で抽出されたモ
デル系列を認識結果として出力する。その場合、再認識
処理としては、特徴パラメータ列に変換された入力音声
データＸに対し、例えば Viterbiアルゴリズムによっ
て、尤度関数f(X|Λ,θ)が最大となるモデル系列を求め
て認識結果として出力する。第２実施例図５はこの発明によるモデル適応化方法の第２実施例の
処理手順を示す。第２の実施例では、初めてのステップ
Ｓ１〜Ｓ４の処理は第１実施例である図２のステップＳ
１〜Ｓ４と同様であり、ステップＳ１で入力音声をＬＰ
Ｃ分析して特徴パラメータ列に変換し、ステップＳ２で
基準モデル辞書から入力音声特徴パラメータ列に対し最
も尤度の高いＮ個の音声モデル系列を選択し、ステップ
Ｓ３でＮ個の各モデル系列を尤度が最大となるよう適応
化し、ステップＳ４でＮ個の仮適応化モデル系列から尤
度最大のモデル系列を選択する。ステップＳ４の終了
後、更に以下の処理を実行する。When outputting a speech recognition result, the model sequence Λ _i of the maximum likelihood selected in step S4 is output as a recognition result in step S5. Alternatively, as shown by the dashed line, the entire speech model of the reference model dictionary is adapted in step S6 by the conversion parameter η ′ _i finally used in the adaptation model sequence Λ _i selected in step S4,
Next, in step S7, the input speech data X is re-recognized by using the adapted reference model dictionary, one model sequence having the maximum likelihood is extracted, and the model sequence extracted in step S5 is recognized as a recognition result. Output as In this case, as the re-recognition processing, for the input speech data X converted into the feature parameter sequence, for example, by using the Viterbi algorithm, a model sequence in which the likelihood function f (X | Λ, θ) is maximized is determined. Output as Second Embodiment FIG. 5 shows a processing procedure of a second embodiment of the model adaptation method according to the present invention. In the second embodiment, the processing of the first steps S1 to S4 is the same as that of the first embodiment shown in FIG.
As in steps S1 to S4, the input voice is LP
C analysis and conversion into a feature parameter sequence, and in step S2, N speech model sequences having the highest likelihood for the input speech feature parameter sequence are selected from the reference model dictionary, and in step S3, each of the N model sequences is Adaptation is performed so that the likelihood becomes maximum, and a model sequence having the maximum likelihood is selected from the N provisionally adapted model sequences in step S4. After the end of step S4, the following processing is further executed.

【００３５】ステップＳ５（Ｎモデル適応化過程）：ス
テップＳ４で選択したモデル系列Λ _iに対し、ステップ
Ｓ３で最大尤度を与えた変換パラメータη'_iを共通に使
って直前のステップＳ３での仮適応化前のＮ個のモデル
系列について適応化を行う。ステップＳ３に戻り、前回
ステップＳ５で適応化したＮ個のモデル系列Λ₁,…,Λ
_Nそれぞれについて入力音声データＸに対し再び尤度が
最大となるようにそれぞれのモデル系列Λ₁,…,Λ_Nに対
する仮適応化を行う。ただし、変換パラメータη_n によ
る仮適応化は、前回の仮適応化より各モデル系列内で結
び状態をゆるめて行う。Step S5 (N model adaptation process):
Model series selected in step S4 _iStep against
The transformation parameter η ′ giving the maximum likelihood in S3_iAre used in common.
Therefore, N models before provisional adaptation in the immediately preceding step S3
Adapt the sequence. Returning to step S3,
N model sequences adapted in step S5₁,…, Λ
_NThe likelihood of the input speech data X is again
Each model series to maximize₁,…, Λ_NTo
Tentative adaptation. Where the conversion parameter η_n By
Tentative adaptation results within each model series from the previous tentative adaptation.
And loosen the condition.

【００３６】ステップＳ４で、Ｎ個のモデル系列から再
び最大尤度を与えたモデル系列を正しいモデル系列とし
て選択する。上述の第２実施例において、ステップＳ
５，Ｓ３，Ｓ４を更に所望の回数繰り返し実行し、最後
の回のステップＳ４で最大尤度を与えたモデル系列を正
しいモデル系列として選択してもよい。In step S4, a model sequence to which the maximum likelihood is given again from the N model sequences is selected as a correct model sequence. In the above-described second embodiment, step S
5, S3, and S4 may be further repeated a desired number of times, and the model sequence that has given the maximum likelihood in the last step S4 may be selected as a correct model sequence.

【００３７】第２実施例において、音声認識結果を出力
するには、ステップＳ４で最終的に選択したモデル系列
をステップＳ６で認識結果として認識結果出力部１６か
ら出力する。或いは、波線で示すように、ステップＳ４
で選択されたモデル系列に対してその直前のステップＳ
３で与えたと同じ変換パラメータη'_iを共通に使ってス
テップＳ７で基準モデル辞書の全音声モデルを適応化
し、次にステップＳ８でその適応化された基準モデル辞
書を使って入力音声データＸを再認識し、尤度が最大と
なるモデル系列を１つ抽出し、それをステップＳ５で認
識結果として出力する。その場合、再認識処理として
は、特徴パラメータの時系列に変換された入力音声デー
タＸを、例えば Viterbiアルゴリズムによって、尤度関
数f(X|θ) が最大となるモデル系列を求めて認識結果と
して出力する。また、図２の第１実施例の場合と同様
に、ステップＳ３の仮適応化を複数回繰り返してからス
テップＳ４の適応化モデル選出過程に移ってもよい。第
３実施例図６はこの発明によるモデル適応化方法の第３
実施例の処理手順を示す。上述の第２実施例では、ステ
ップＳ４で選択したモデル系列にその直前のステップＳ
３で与えた変換パラメータη'iと同じ変換パラメータ
η'iを使って、仮適応化前のＮ個のモデル系列をステッ
プＳ５で適応化する場合を示したが、第３実施例では、
ステップＳ５において基準モデル辞書の全モデルを適応
化し、ステップＳ３ではなく、ステップＳ２に戻り、ス
テップＳ５で適応化された辞書を使って入力音声データ
Ｘに対する尤度の最も高いＮ個のモデル系列の選択を再
び実行する。In the second embodiment, to output a speech recognition result, the model sequence finally selected in step S4 is output from the recognition result output unit 16 as a recognition result in step S6. Alternatively, as indicated by the dashed line, step S4
Step S immediately before the model series selected in
In step S7, all the speech models in the reference model dictionary are adapted using the same conversion parameter η ′ _i as given in step 3, and in step S8, the input speech data X is converted using the adapted reference model dictionary. The model is re-recognized, and one model sequence having the maximum likelihood is extracted, and is output as a recognition result in step S5. In this case, as the re-recognition processing, the input speech data X converted into the time series of the feature parameters is obtained, for example, by a Viterbi algorithm, to obtain a model series in which the likelihood function f (X | θ) is maximized. Output. Further, as in the case of the first embodiment in FIG. 2, the provisional adaptation in step S3 may be repeated a plurality of times, and then the process may proceed to the adaptation model selection process in step S4. Third Embodiment FIG. 6 shows a third embodiment of the model adaptation method according to the present invention.
4 shows a processing procedure of the embodiment. In the second embodiment described above, the model sequence selected in step S4 is added to the immediately preceding step S4.
The case where the N model sequences before provisional adaptation are adapted in step S5 using the same transformation parameter η′i as the transformation parameter η′i given in step 3 has been described. In the third embodiment,
In step S5, all models of the reference model dictionary are adapted, and the process returns to step S2 instead of step S3. Using the dictionary adapted in step S5, the N model sequences having the highest likelihood for the input speech data X are used. Perform the selection again.

【００３８】即ち、図５の第２実施例の場合と同様に、
仮適応化されたＮ個のモデル系列からステップＳ４で最
大の尤度を示したモデル系列を暫定的正しいモデル系列
として選択する。第３実施例ではステップＳ５で、その
選択したモデル系列にステップＳ３で与えた変換パラメ
ータη'_iと同じ変換パラメータη'_iを使って基準モデル
辞書の全モデルを適応化し、ステップＳ２に戻る。ステ
ップＳ２で適応化された基準モデル辞書を使って、入力
音声データＸに対し最も尤度が高いＮ個のモデル系列を
再び選び直し、ステップＳ３、Ｓ４で前回と同様に仮適
応化及び適応化モデル選出を行う。That is, as in the case of the second embodiment shown in FIG.
A model sequence having the maximum likelihood in step S4 is selected from the provisionally adapted N model sequences as a provisionally correct model sequence. In step S5 in the third embodiment, and adapting the entire model of the reference model dictionary with the 'same transformation parameter η and _i' _i transformation parameter η given in step S3 to the selected model sequence returns to step S2. Using the reference model dictionary adapted in step S2, the N model sequences having the highest likelihood with respect to the input speech data X are again selected, and in steps S3 and S4, provisional adaptation and adaptation are performed as in the previous case. Select a model.

【００３９】更にステップＳ５，Ｓ２，Ｓ３，Ｓ４を繰
り返してもよい。その場合も、前述と同様に繰り返し処
理毎にモデルパラメータの階層的クラスタリングを行っ
てステップＳ３におけるモデル系列の仮適応化を行うの
が好ましい。その場合、ステップＳ５における基準モデ
ル辞書の適応化はステップＳ３におけるモデルパラメー
タのクラスタリングと同じクラスタリングを辞書内の全
モデルのパラメータに対して行ってモデルパラメータθ
をグループ分けし、ステップＳ３で決められたそれぞれ
のクラスタに対する変換パラメータ、例えばη_ia、η_ib
をそれぞれのグループのモデルパラメータに対し独立に
与えることにより辞書を適応化する。Further, steps S5, S2, S3 and S4 may be repeated. Also in this case, it is preferable to perform the hierarchical adaptation of the model sequence in step S3 by performing the hierarchical clustering of the model parameters for each repetition process as described above. In this case, in the adaptation of the reference model dictionary in step S5, the same clustering as the clustering of the model parameters in step S3 is performed on the parameters of all the models in the dictionary, and the model parameters θ
Are grouped, and conversion parameters for each cluster determined in step S3, such as η _ia and η _ib
Is independently applied to the model parameters of each group to adapt the dictionary.

【００４０】このステップＳ５，Ｓ２，Ｓ３，Ｓ４の繰
り返し処理は、各繰り返し毎にステップＳ３において前
回のステップＳ３における各モデル系列に対するモデル
パラメータのクラスタリングによる各クラスタのモデル
パラメータを更に例えば２つのグループにクラスタリン
グするように階層的にクラスタを増加して行う。従っ
て、３回目のモデル適応化では４つのクラスタのモデル
パラメータに対し、独立に変換パラメータη_ia、η_ib、
η_ic、η_idを与えて適応化を行う。それと同様にステッ
プＳ５における辞書の全モデルに対するモデルパラメー
タのクラスタリングでも処理ループの繰り返し毎にクラ
スタ数を階層的に増加させることによって、モデル間の
パラメータの結び状態をゆるめていく。最終回でのステ
ップＳ４で、その直前のステップＳ３で得られた仮適応
化後のそれぞれのモデル系列のうち、最も尤度の高いモ
デル系列を正しいモデル系列として選択する。また、図
６には線で示すように、ステップＳ３の仮適応化を、前
回よりも変換パラメータ間の結び状態を緩めて行うこと
を予め決めた回数連続して繰り返してもよい。In the repetition processing of steps S5, S2, S3, and S4, the model parameters of each cluster are further divided into, for example, two groups by clustering the model parameters for each model series in the previous step S3 in step S3 for each repetition. Perform clustering in a hierarchical manner like clustering. Therefore, in the third model adaptation, the transformation parameters η _ia , η _ib ,
Adaptation is performed by giving η _ic and η _id . Similarly, in the clustering of model parameters for all models in the dictionary in step S5, the number of clusters is hierarchically increased each time the processing loop is repeated, thereby loosening the connected state of the parameters between the models. In step S4 in the last round, the model series with the highest likelihood is selected as the correct model series from the respective model series after provisional adaptation obtained in the immediately preceding step S3. Further, as indicated by a line in FIG. 6, the provisional adaptation in step S3 may be repeated for a predetermined number of consecutive times in which the connection between the conversion parameters is loosened compared to the previous time.

【００４１】認識結果を出力する場合は、最後の繰り返
しにおけるステップＳ４で選択した最大尤度のモデル系
列を出力する。或いは、ステップＳ２〜Ｓ５の繰り返し
適応化処理を所定回数実行した後、ステップＳ５の基準
モデル辞書適応化過程で終了し、次に波線で示すよう
に、ステップＳ６でその適応化された基準モデル辞書を
使って入力音声データＸを再認識し、尤度が最大となる
モデル系列を１つ抽出し、認識結果として出力してもよ
い。その場合、再認識処理としては、特徴パラメータの
時系列に変換された入力音声データＸを、例えば Viter
biアルゴリズムによって、尤度関数f(X|θ) が最大とな
るモデル系列を求めて認識結果として出力する。また、
図２の第１実施例の場合と同様に、ステップＳ３の仮適
応化を複数回繰り返してからステップＳ４の適応化モデ
ル選択過程に移ってもよい。When outputting the recognition result, the model sequence having the maximum likelihood selected in step S4 in the last repetition is output. Alternatively, after the repetitive adaptation process of steps S2 to S5 is performed a predetermined number of times, the process ends in the reference model dictionary adaptation process of step S5, and then, as indicated by the dashed line, the adapted reference model dictionary in step S6 May be used to re-recognize the input voice data X, extract one model sequence having the maximum likelihood, and output it as a recognition result. In this case, as the re-recognition processing, the input voice data X converted into the time series
The bi-algorithm obtains a model sequence with the maximum likelihood function f (X | θ) and outputs it as a recognition result. Also,
As in the case of the first embodiment shown in FIG. 2, the provisional adaptation in step S3 may be repeated a plurality of times, and then the process may proceed to the adaptation model selection process in step S4.

【００４２】上述の実施例１、２、３において、ステッ
プＳ３で尤度関数としてf(X|Λ_n,η _n,θ)の代りに、こ
れと事前確率密度関数g(θ) との積f(X|Λ_n,η_n,θ)g
(θ)を用いてもよい。また上述において音声モデルとし
てはＨＭＭに限らない。またステップＳ２でのモデル系
列の抽出は、N-bestパラダイムに従って行う場合に限ら
ず、要は正解のモデル系列が高い確率で含まれるように
抽出すればよい。In the first, second, and third embodiments, the step
In step S3, f (X | Λ_n, η _n, θ)
F (X | Λ)_n, η_n, θ) g
(θ) may be used. In the above description,
Is not limited to HMM. Also, the model system in step S2
Extract columns only when they follow the N-best paradigm
The point is that the correct model series is included with high probability
You only need to extract it.

【００４３】上述の第２、第３実施例ではステップＳ３
において、N-best法に従って選択されたＮ個のモデル系
列Λ₁〜Λ_Nを仮適応化し、ステップＳ４でそれらのう
ち、最大尤度を示した１つのモデル系列を選択し、その
選択したモデル系列の変換パラメータにより他のモデル
を適応化したが、仮適応化により正解のモデル系列が必
ずしも最大の尤度を示すとは限らない。不特定話者用Ｈ
ＭＭを用いたときに、特に正解のモデル系列が3-bestよ
りも下位に認識される場合には、仮適応化しても正解が
１位にならないことが多い。そこで、以下に示すよう
に、ステップＳ４においてＮ個のモデル系列から最大尤
度のモデル系列を１つだけ選択する代わりに、下位のモ
デル系列に関しても信頼度を考慮して適応化に寄与する
ようにしてもよい。In the above-described second and third embodiments, step S3
In, temporarily adapting the selected N model sequence lambda ₁ to [lambda] _N according N-best method, among them at step S4, selects one model sequence which showed the greatest likelihood, and the selected model Although other models have been adapted using the transformation parameters of the series, the correct model series does not always show the maximum likelihood due to the temporary adaptation. H for unspecified speakers
When the MM is used, especially when the correct model series is recognized lower than 3-best, the correct answer often does not become the first place even if provisional adaptation is performed. Therefore, as shown below, instead of selecting only one model sequence having the maximum likelihood from the N model sequences in step S4, the lower model sequence is considered to contribute to the adaptation in consideration of the reliability. It may be.

【００４４】即ち、この変形例では、ステップＳ４にお
いて、ステップＳ２で選択した最も尤度の高いＮ個のモ
デル系列Λ₁〜Λ_Nに対する信頼度C_n(n=1,…,N)を次式で
定義する。Ｃ_n＝ {f(X|Λ_n,θ_n)/ｆ_max}^r ここで、f_maxは仮適応化後にＮ個のモデル系列Λ_n がそ
れぞれ示す入力音声データＸに対する尤度のうち最大の
もの、f(X|Λ_n,θ_n)は仮適応化後に各モデル系列Λ_n が
示す尤度、ｒは実験的に設定する定数パラメータであ
る。この信頼度Ｃ_nを各系列に付いて計算し、音声ＨＭ
Ｍのモデルパラメータを次式に従って計算する。[0044] That is, in this modification, in step S4, the reliability _{C n (n = 1, ...} , N) for the highest likelihood of N model sequence lambda ₁ to [lambda] _N selected in step S2 the following Define by expression. C _n = {f (X | Λ _n , θ _n ) / f _max } ^r where f _max is the maximum likelihood of the input speech data X indicated by the N model sequences Λ _n after the tentative adaptation. _{things, f (X | Λ n,} θ n) is the likelihood indicated by each model sequence lambda _n after temporary adaptation, r is a constant parameter set experimentally. This reliability C _n is calculated for each series, and the speech HM
The model parameters of M are calculated according to the following equation.

【００４５】θ'＝Σ^N _n=1C_nθ_n/Σ^N _n=1C_n この様にして得られたパラメータθ' を使ってステップ
Ｓ５における適応化を行う。この方法を平滑化推定法と
呼ぶ。[0045] performing the adaptation of theta step S5 with a ^{_{'= Σ N n = 1 C}} n θ n / Σ N n = 1 C n parameter theta obtained in this way'. This method is called a smoothing estimation method.

【００４６】[0046]

【発明の効果】以上述べたように、この発明において
は、認識対象となる話者の音声を用いて、不特定話者用
音声モデルの複数のモデル系列を抽出し、モデル系列ご
とに、不特定話者用音声モデルを適応化している。その
ため、抽出したモデル系列中に正解が含まれていれば、
ステップＳ３で各モデル系列ごとに適応化されたモデル
系列と音声データＸとの尤度は正解のモデル系列に対す
るものが最大となる可能性が非常に高くなる。この発明
ではこの最大となるモデル系列にもとづく適応化音声モ
デルを採用するため、より正しい適応化が実現されてい
る。つまり、例えばステップＳ２において尤度が大きい
ものから順にモデル系列Λ₁，…，Λ_Nを抽出したとす
ると、正解のモデル系列の尤度が１番目より下位にあっ
た場合は、従来においては正解でないモデル系列を用い
て音声モデルの適応化がなされてしまう。しかしなが
ら、この発明では、ステップＳ３、Ｓ４により、正解の
モデル系列に基づく適応化音声モデルとの尤度が最大と
なる可能性が大となり、より正しい適応化が得られる。
従って前述したようにステップＳ２で抽出したＮ個のモ
デル系列に正解が含まれている確率が高いことが重要で
ある。この発明の効果を、４桁数字（単語系列）の認識
実験において調べた。この実験例を述べる。実験で使用
した音声ＨＭＭは、混合ガウス分布数４の連続型ＨＭＭ
である。音声ＨＭＭは４桁数字中の各数字ごとに１３種
類（/rei/，/maru/，/zero/，/ichi/，/ni/，/san/，/y
on/，/go/，/roku/，/nana/，/hachi/，/kyu/，/ku/）
を用意した。不特定話者用音声ＨＭＭの作成には、男17
7名による計24,194発声（孤立数字もしくは２桁数字も
しくは４桁数字）を用い、Baum-Welchアルゴリズムによ
ってＨＭＭパラメータの推定を行った。また話者適応化
及び認識に、不特定話者用音声ＨＭＭの作成で用いた話
者とは異なる男性１００名のそれぞれによる６発声（４
桁数字）を用いた場合の、４桁数字認識率により評価し
た。特徴パラメータとして、標本化周波数８ｋＨｚの音
声データを、フレーム周期８ｍｓ毎にフレーム長３２ｍ
ｓで分析次数１２のＬＰＣ分析を行ってケプストラム及
びΔケプストラムを抽出した。As described above, in the present invention, a plurality of model sequences of an unspecified speaker's speech model are extracted using the speech of the speaker to be recognized, The voice model for a specific speaker is adapted. Therefore, if the correct answer is included in the extracted model series,
It is highly probable that the likelihood between the model sequence adapted for each model sequence in step S3 and the speech data X will be the largest for the correct model sequence. In the present invention, since the adapted speech model based on the maximum model sequence is adopted, more correct adaptation is realized. That is, for example, if the model sequences Λ ₁ ,..., Λ _N are extracted in order from the one with the highest likelihood in step S2, if the likelihood of the correct model sequence is lower than the first, Adaptation of a speech model is performed using a model sequence that is not. However, in the present invention, steps S3 and S4 increase the possibility that the likelihood with the adapted speech model based on the correct model sequence is maximized, and more correct adaptation is obtained.
Therefore, as described above, it is important that the probability that a correct answer is included in the N model sequences extracted in step S2 is high. The effect of the present invention was examined in a four-digit number (word series) recognition experiment. This experimental example will be described. The speech HMM used in the experiment is a continuous HMM with Gaussian mixture distribution number 4.
It is. 13 types of voice HMMs (/ rei /, / maru /, / zero /, / ichi /, / ni /, / san /, / y)
on /, / go /, / roku /, / nana /, / hachi /, / kyu /, / ku /)
Was prepared. To create a voice HMM for unspecified speakers, a man 17
HMM parameters were estimated by Baum-Welch algorithm using a total of 24,194 utterances (isolated numbers, two-digit numbers, or four-digit numbers) by seven persons. For speaker adaptation and recognition, six utterances (4
(Digit number) was evaluated by the 4-digit number recognition rate. As feature parameters, audio data having a sampling frequency of 8 kHz is converted to a frame length of 32 m for every 8 ms of frame period.
A cepstrum and a Δcepstrum were extracted by performing LPC analysis of order 12 with s.

【００４７】なお話者適応化では、第３実施例に従い音
声ＨＭＭの各混合ガウス分布の平均値ｍのバイアス成分
Δｍ（即ちη＝Δｍ）のみ推定し、バイアスΔｍの総数
をはじめは全モデルの全分布共通に１（強い結び状態）
とし、適応化処理を繰り返す毎に２、４、８、１６とし
て階層的に推定した（即ち、クラスタリングを階層的に
行った）。また、N-best仮説（ステップＳ２での抽出モ
デル系列）の数Ｎは１０とした。In the speaker adaptation, only the bias component Δm (ie, η = Δm) of the average value m of each Gaussian mixture distribution of the speech HMM is estimated in accordance with the third embodiment, and the total number of bias Δm is initially set for all models. 1 for all distributions (strong knot state)
Each time the adaptation process was repeated, it was hierarchically estimated as 2, 4, 8, and 16 (that is, clustering was performed hierarchically). The number N of the N-best hypotheses (the extracted model sequence in step S2) was set to 10.

【００４８】図７に、不特定話者用音声ＨＭＭを用いた
認識率（ベースライン）と、不特定話者用音声ＨＭＭを
用いて音声認識して得られたモデル系列に従って、不特
定話者用音声ＨＭＭを話者適応化する従来の適応化方法
により適応化された音声ＨＭＭを用いた認識率（従来
法）と、この発明の第３実施例の方法により適応化され
た音声ＨＭＭを用いた認識率（本方法）とを示す。この
結果から、この発明方法が従来法に比べて有効であるこ
とがわかる。FIG. 7 shows an unspecified speaker according to the recognition rate (baseline) using the unspecified speaker's speech HMM and a model sequence obtained by performing speech recognition using the unspecified speaker's speech HMM. The recognition rate using the speech HMM adapted by the conventional adaptation method for speaker adaptation of the speech HMM for use (conventional method) and the speech HMM adapted by the method of the third embodiment of the present invention are used. And the recognition rate (this method). From these results, it can be seen that the method of the present invention is more effective than the conventional method.

[Brief description of the drawings]

【図１】ＨＭＭを説明するための図であり、Ａは音韻の
３つの特徴的領域を示す特徴パラメータ系列、Ｂは各領
域における特徴パラメータの模式的混合ガウス分布、Ｃ
は領域間の状態遷移。FIG. 1 is a diagram for explaining an HMM, in which A is a characteristic parameter sequence indicating three characteristic regions of a phoneme, B is a schematic mixed Gaussian distribution of characteristic parameters in each region, C
Indicates a state transition between regions.

【図２】この発明による第１実施例の方法の手順を示す
流れ図。FIG. 2 is a flowchart showing a procedure of a method according to a first embodiment of the present invention.

【図３】この発明を実施する音声認識システムの機能ブ
ロック図。FIG. 3 is a functional block diagram of a speech recognition system embodying the present invention.

【図４】ステップＳ２で得られた抽出モデル系列と、対
応尤度関数を示す図。FIG. 4 is a diagram showing an extracted model sequence obtained in step S2 and a corresponding likelihood function.

【図５】この発明による第２実施例の方法の手順を示す
流れ図。FIG. 5 is a flowchart showing a procedure of a method according to a second embodiment of the present invention.

【図６】この発明による第３実施例の方法の手順を示す
流れ図。FIG. 6 is a flowchart showing a procedure of a method according to a third embodiment of the present invention.

【図７】この発明の効果を示す実験結果を示す図。FIG. 7 is a view showing an experimental result showing an effect of the present invention.

Claims

[Claims]

1. A reference model dictionary for an unspecified speaker which learns using the voices of a large number of speakers and constructs a speech model corresponding to a recognition category such as a phoneme or a word using model parameters. In the speaker adaptation method of a speech model that adapts a speech model to the speech of a speaker to be recognized, (a) feature amount extraction for extracting a speech feature parameter sequence of the input speech of the speaker to be recognized (B) extracting, from the reference model dictionary, a plurality of hypothesis model sequences estimated to correspond to the feature parameter sequence of the input speech from the unspecified speaker voice model, c) For each model sequence of the plurality of extracted hypothesis model sequences, the model of each of the above-mentioned hypotheses is maximized so that the likelihood of the model sequence of the hypothesis with respect to the feature parameter sequence of the input speech is maximized. A tentative adaptation process of tentatively adapting by controlling the model parameters of the Dell sequence, and (d) based on the likelihood of the respective hypothesis model sequences after the tentative adaptation in the tentative adaptation process with respect to the feature parameter sequence. And a step of selecting at least one of the hypothetical model sequences after the provisional adaptation as an adapted speech model sequence.

2. The speaker model adaptation method according to claim 1, wherein said adaptation model selecting step (d) is performed after said hypothesis model sequence is provisionally adapted in said provisional adaptation step (c). And selecting a hypothetical model sequence having the maximum likelihood for the feature parameter sequence as the adapted speech model sequence.

3. The speaker adaptation method for a speech model according to claim 1 or 2, further comprising the step of selecting said unspecified speech based on model parameters of an adapted speech model sequence selected in said adaptation model selecting step (d). Adapting the user's speech model sequence, and performing the tentative adaptation process (c) by, for each of the plurality of extracted hypothesis model sequences, the likelihood of the sequence with respect to the extracted feature parameter sequence. Using the adapted speaker-independent speech model adapted to make the connection between the conversion parameters looser than the previous time, and performing the finer adaptation, and selecting the adaptation model And re-selecting the adapted model sequence in the step (d) again at least once.

4. The speaker adaptation method for a speech model according to claim 1, wherein the speech model of the reference model dictionary is added to the temporary adaptation model sequence selected in the adaptation model selection step (d). A reference model dictionary adaptation process to be adapted, and using the adapted reference model dictionary, in the model sequence extraction process, again extract the plurality of hypothesis model sequences having the maximum likelihood for the input speech, A step of loosely tying the connection state between the parameters in the provisional adaptation process (c) and provisionally adapting the hypothesis model sequence in detail, and among those provisionally adapted hypothesis model sequences,
The method further includes the step of selecting a readapted model sequence having the maximum likelihood.

5. The speaker adaptation method for a speech model according to claim 2 or 4, wherein the provisional adaptation step (c) is performed by fine adaptation by loosening the connection state between the conversion parameters as compared with the previous time. The method further includes a step of continuously repeating the above operation a predetermined number of times.

6. The speaker adaptation method for a speech model according to claim 1 or 2, wherein the model sequence extraction step (b) includes the step of extracting a plurality of model sequences having the greatest likelihood with the feature parameter sequence of the input speech. It is the process of getting.

7. The speaker adaptation method for a speech model according to claim 1, wherein the adaptation model selecting step (d) is provisionally adapted in the provisional adaptation step for the feature parameter sequence of the input speech. And selecting the speech model as a sum of the adapted speech models corresponding to the model sequence by weighting corresponding to the likelihood of the extracted model sequence.

8. The method according to claim 1, wherein the speech model is a speech HMM defined by a statistic parameter.

9. The speaker adaptation method for a speech model according to claim 8, wherein the adaptation or the provisional adaptation controls an average value of a Gaussian distribution which is at least one of model parameters defining the speech HMM. This is a process of adapting so that the likelihood is maximized.

10. A speaker model adaptation method according to claim 1 or 2, wherein a recognition category corresponding to the model sequence selected in said adaptation model selection step is determined with respect to said input speech of said recognition target speaker. A speech recognition method including a step of outputting as a speech recognition result.

11. The speaker adaptation method for a speech model according to claim 2, wherein all the speech models in the reference model dictionary are adapted to a reference model dictionary corresponding to a model sequence selected in the adaptation model selection process. Extracting the speech model sequence having the highest likelihood for the feature parameter sequence of the input speech from the step and the adapted reference model dictionary,
A re-recognition step of outputting as a recognition result.

12. The method according to claim 4, wherein the characteristic parameter sequence of the input speech is re-used by using the last adapted speech model for the unspecified speaker in the reference model dictionary. A speech recognition method including a step of performing a recognition process, extracting one model sequence that gives the maximum likelihood, and outputting the model sequence as a recognition result for the input speech of the speaker to be recognized.

13. The speaker adaptation method for a speech model according to claim 11, wherein the re-recognition processing includes converting the input speech data converted into a time series of feature parameters into a Viterb speech data.
This is a process for obtaining a model sequence with the maximum likelihood function by the i-algorithm.

14. A recording medium in which the procedure of the voice recognition method according to claim 10, 12 or 13 is recorded.