JPH1185186A

JPH1185186A - Nonspecific speaker acoustic model forming apparatus and speech recognition apparatus

Info

Publication number: JPH1185186A
Application number: JP9242513A
Authority: JP
Inventors: Jun Ishii; 純石井
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1997-09-08
Filing date: 1997-09-08
Publication date: 1999-03-30
Anticipated expiration: 2017-09-08
Also published as: JP3088357B2

Abstract

PROBLEM TO BE SOLVED: To improve a speech recognition rate by making learning by using the characteristic parameters of speech data subjected to speaker normalization and learning the fluctuations by differences in speaker characteristics. SOLUTION: A phoneme collation section 4 executes phoneme collation processing according to the phoneme collation request from a phoneme context dependence type LR parser 5 and calculates the likelihood to the data within the phoneme collation section by using the speaker model of the element HMM stored in a memory of a speaker adapted hidden Markov model(HMM) 11. The likelihood is returned as a phoneme collation score to the LR parser 5. The LR parser 5 references an LR table 12, executes processing without returning between from the left to the right direction relating to the inputted phoneme prediction data, predicts the next phoneme from the LR table 12 and outputs the same to the phoneme collation section 4. The phoneme collation section 4 makes collation by referencing the information within the HMM 11 for the phoneme and returns the likelihood thereof as a speech recognition score to the LR parser 5 which successively connects the phonemes and recognized the connected speeches.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、初期話者モデルに
対して話者依存の音声データの特徴パラメータを用いて
話者正規化を行った後、不特定話者化を行うことにより
不特定話者化された音響モデルである隠れマルコフモデ
ル（以下、ＨＭＭという。）を生成する不特定話者音響
モデル生成装置、及び、生成された不特定話者ＨＭＭを
用いて音声認識する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an unspecified speaker by performing speaker normalization on an initial speaker model using characteristic parameters of speaker-dependent speech data and then performing unspecified speaker conversion. An unspecified speaker acoustic model generation apparatus that generates a hidden Markov model (hereinafter, referred to as HMM) that is a speakerized acoustic model, and a speech recognition apparatus that performs speech recognition using the generated unspecified speaker HMM About.

【０００２】[0002]

【従来の技術】音声認識のアプリケーションを考えた場
合、事前の話者登録無しに使用が可能の不特定話者音声
認識システムの要望が高い。しかしながら、現状の不特
定話者音声認識の性能は、特定話者音声認識よりも低
く、その差は、誤り率で２〜３倍程度である。不特定話
者音声認識の性能を向上されるため、特定話者が発声し
た少量の適応データを用い、不特定話者音声認識の音響
モデルを特定話者へ近づける話者適応化処理（例えば、
従来技術文献１「C.L.Leggetter et al.,“MaximumLike
lihood Linear Regression for Speaker Adaptation of
Continuous Density Hidden Markov Models",Computer
Speech and Language,Vol.9,pp.171-185,1995年」参
照。）の研究が行なわれているが、特定話者音声認識と
同等の性能を示すまでには、多量の学習用適応化データ
が必要となっている。2. Description of the Related Art When considering a speech recognition application, there is a strong demand for an unspecified speaker speech recognition system that can be used without prior speaker registration. However, the current unspecified speaker speech recognition performance is lower than that of the specific speaker speech recognition, and the difference is about two to three times the error rate. In order to improve the performance of the speaker-independent speaker recognition, a speaker adaptation process (for example, using a small amount of adaptation data uttered by a particular speaker) to bring the acoustic model of the speaker-independent speaker recognition closer to the particular speaker.
Prior art document 1 “CLLeggetter et al.,“ MaximumLike
lihood Linear Regression for Speaker Adaptation of
Continuous Density Hidden Markov Models ", Computer
Speech and Language, Vol. 9, pp. 171-185, 1995. " However, a large amount of training adaptation data is required until the performance is equivalent to that of the specific speaker speech recognition.

【０００３】一般に、話者に依存しない不特定話者ＨＭ
Ｍ（以下、ＳＩ−ＨＭＭという。）の学習は複数の話者
の音声データを用いて行う。学習データには話者による
違いだけでなく、学習対象の単位の置かれた状況（コン
テキスト）等の違いが混在するにも関わらず、特定話者
音声認識の音響モデル（話者に依存するＨＭＭ（以下、
ＳＤ−ＨＭＭという。）である。）の学習と同様に処理
する。これにより、ＳＩ−ＨＭＭには話者の違いに起因
する変動と音韻コンテキストの変動の両方が混在し、広
がりの大きなモデルになってしまう。これが識別性能劣
化の要因の１つなっていると考えられる。連続混合分布
型ＨＭＭを基本とした音声認識システムの場合では、ガ
ウス分布の分散が大きくなる現象であり、認識単位間の
重なりが発生し、識別を困難となるという問題点があっ
た。[0003] In general, an unspecified speaker HM independent of the speaker
Learning of M (hereinafter referred to as SI-HMM) is performed using voice data of a plurality of speakers. In spite of the fact that the learning data contains not only differences depending on speakers, but also differences such as situations (contexts) where units to be learned are placed, an acoustic model for specific speaker speech recognition (HMM depending on speakers) (Less than,
It is called SD-HMM. ). The processing is performed in the same manner as in the learning of ()). As a result, in the SI-HMM, both the variation due to the difference between speakers and the variation in the phonemic context are mixed, resulting in a model having a large spread. This is considered to be one of the factors of the degradation of the identification performance. In the case of a speech recognition system based on a continuous mixture distribution type HMM, the variance of the Gaussian distribution is large, and there is a problem in that recognition units are overlapped with each other, making identification difficult.

【０００４】特に、従来技術文献１において開示され
た、従来例の重回帰写像モデルを用いて話者適応化した
場合に、学習用適応化データが少量であるときに、適応
化のパラメータの推定精度が比較的悪く、音声認識率が
比較的低いという問題点があった。[0004] In particular, when speaker adaptation is performed using the conventional multiple regression mapping model disclosed in the prior art document 1, when the adaptation data for learning is small, estimation of adaptation parameters is performed. There is a problem that the accuracy is relatively poor and the speech recognition rate is relatively low.

【０００５】本特許出願人は、以上の問題点を解決する
ために、特願平０９−０５４５９６号の特許出願におい
て、話者正規化装置及び話者適応化装置が開示されてい
る。当該話者正規化装置は、「複数の話者にそれぞれ依
存する音声データの特徴ベクトルに基づいて、所定の隠
れマルコフモデルの初期モデルに対して、最尤線形回帰
法により、重回帰写像モデルに基づく平均ベクトルの変
換のための変換行列と定数項ベクトルを含む第１の変換
係数を上記各話者毎に演算する第１の演算手段と、上記
複数の話者にそれぞれ依存する音声データの特徴ベクト
ルから上記各話者毎に、上記第１の演算手段によって演
算された定数項ベクトルを減算して正規化された音声デ
ータの特徴ベクトルを演算する第２の演算手段と、上記
第２の演算手段によって演算された正規化された音声デ
ータの特徴ベクトルに基づいて、上記隠れマルコフモデ
ルの初期モデルを、所定の学習アルゴリズムを用いて学
習することにより、話者正規化された隠れマルコフモデ
ルのモデルパラメータを演算する第３の演算手段とを備
えたこと」を特徴としている。また、当該話者適応化装
置は、「話者適応化する話者の音声データの特徴ベクト
ルに基づいて、上記話者正規化装置の第３の演算手段に
よって演算された隠れマルコフモデルに対して、最尤線
形回帰法により、重回帰写像モデルに基づく平均ベクト
ルの変換のための変換行列と定数項ベクトルを含む第２
の変換係数を演算する第４の演算手段と、上記第４の演
算手段によって演算された変換行列と定数項ベクトルを
含む第２の変換係数に基づいて、最大事後確率推定法に
より、話者適応化された重回帰写像モデルに基づく平均
ベクトルの変換のための変換行列と定数項ベクトルを含
む第３の変換係数を演算する第５の演算手段と、上記第
５の演算手段によって演算された変換行列と定数項ベク
トルを含む第３の変換係数に対して、所定の線形変換処
理を実行することにより、話者適応化後の隠れマルコフ
モデルの平均ベクトルを演算する第６の演算手段とを備
えたこと」を特徴としている。[0005] In order to solve the above problems, the present applicant has disclosed a speaker normalizing apparatus and a speaker adapting apparatus in Japanese Patent Application No. 09-054596. The speaker normalization apparatus uses a maximum likelihood linear regression method to generate a multiple regression mapping model for an initial model of a predetermined hidden Markov model based on a feature vector of speech data that depends on each of a plurality of speakers. First calculating means for calculating, for each speaker, a first conversion coefficient including a conversion matrix and a constant term vector for converting an average vector based on the plurality of speakers, A second computing means for subtracting a constant term vector computed by the first computing means for each speaker from the vector to compute a feature vector of voice data normalized, and a second computing means By learning the initial model of the hidden Markov model using a predetermined learning algorithm based on the feature vector of the normalized speech data calculated by the means, Further comprising a third calculating means for calculating a model parameter of the normalized HMM "the is characterized. Further, the speaker adaptation apparatus may be configured such that “based on the feature vector of the voice data of the speaker to be speaker-adapted, the hidden Markov model calculated by the third calculation unit of the speaker normalization apparatus is used. A second matrix including a conversion matrix and a constant term vector for converting an average vector based on a multiple regression mapping model by a maximum likelihood linear regression method.
Based on a fourth calculating means for calculating a conversion coefficient of the following, and a second conversion coefficient including a conversion matrix and a constant term vector calculated by the fourth calculating means, by a maximum posterior probability estimating method, Calculating means for calculating a third conversion coefficient including a conversion matrix and a constant term vector for converting an average vector based on the generalized multiple regression mapping model, and a conversion calculated by the fifth calculating means And a sixth operation means for executing an average linear vector of the hidden Markov model after speaker adaptation by executing a predetermined linear conversion process on the third conversion coefficient including the matrix and the constant term vector. That is the feature.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上記話
者適応化装置においては、音声認識時において話者適応
処理を行う必要があるという問題点があった。不特定話
者モデルとしては話者適応処理無しのシステムが望まれ
る。本発明の目的は以上の問題点を解決し、不特定話者
音声認識において、従来技術に比較して音声認識率を改
善することができる不特定話者音響モデル生成装置及び
音声認識装置を提供することにある。However, the above speaker adaptation apparatus has a problem that it is necessary to perform speaker adaptation processing at the time of speech recognition. A system without speaker adaptation processing is desired as an unspecified speaker model. An object of the present invention is to solve the above problems and provide an unspecified speaker acoustic model generation device and a speech recognition device capable of improving the speech recognition rate in the unspecified speaker speech recognition as compared with the related art. Is to do.

【０００７】[0007]

【課題を解決するための手段】本発明に係る請求項１記
載の不特定話者音響モデル生成装置は、複数の話者にそ
れぞれ依存する音声データの特徴ベクトルに基づいて、
所定の隠れマルコフモデルの初期モデルに対して、最尤
線形回帰法により、重回帰写像モデルに基づく平均ベク
トルの変換のための変換行列と定数項ベクトルを含む第
１の変換係数を上記各話者毎に演算することにより、上
記各話者毎に適応された隠れマルコフモデルを得る第１
の演算手段と、上記第１の演算手段によって得られた上
記各話者毎に適応された隠れマルコフモデルに基づい
て、上記音声データとその発話内容のテキストデータか
ら、ビタビ・アルゴリズムを用いて、最適状態系列を演
算し、各時刻の最適状態毎に上記音声データの特徴ベク
トルが最大出力確率を示す混合分布系列を演算する第２
の演算手段と、上記第２の演算手段によって演算された
最適状態系列の各状態内の混合分布の話者適応化前後の
平均ベクトルを用いて、上記音声データの特徴ベクトル
を話者正規化することにより、話者正規化された音声デ
ータの特徴ベクトルを演算する第３の演算手段と、上記
第３の演算手段によって演算された正規化された音声デ
ータの特徴ベクトルに基づいて、上記隠れマルコフモデ
ルの初期モデルを、所定の学習アルゴリズムを用いて学
習することにより、話者正規化された隠れマルコフモデ
ルのモデルパラメータを演算する第４の演算手段と、上
記第４の演算手段によって演算された話者正規化された
隠れマルコフモデルに対して、最尤線形回帰法により、
重回帰写像モデルに基づく平均ベクトルの変換のための
変換行列と定数項ベクトルを含む第２の変換係数を上記
各話者毎に演算することにより、上記各話者毎に適応さ
れた隠れマルコフモデルの平均ベクトルを得る第５の演
算手段と、上記第５の演算手段によって得られた適応さ
れた隠れマルコフモデルの平均ベクトルと、上記第４の
演算手段によって演算された話者正規化された隠れマル
コフモデルのモデルパラメータとに基づいて、不特定話
者化することにより、不特定話者化された隠れマルコフ
モデルの平均ベクトルと共分散行列を演算して、不特定
話者化された隠れマルコフモデルを得る第６の演算手段
とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided an apparatus for generating an unspecified speaker acoustic model based on a feature vector of speech data which depends on a plurality of speakers.
For the initial model of the predetermined hidden Markov model, the first conversion coefficient including a conversion matrix and a constant term vector for conversion of an average vector based on the multiple regression mapping model is determined by the maximum likelihood linear regression method. The first calculation is to obtain a hidden Markov model adapted for each speaker by calculating
Based on the hidden Markov model adapted for each speaker obtained by the first calculating means, from the voice data and the text data of the uttered content, using a Viterbi algorithm, A second operation of calculating an optimal state sequence and calculating a mixture distribution sequence in which the feature vector of the audio data indicates the maximum output probability for each optimal state at each time;
And speaker normalization of the feature vector of the voice data using the average vector before and after the speaker adaptation of the mixture distribution in each state of the optimal state sequence calculated by the second arithmetic unit. Accordingly, the third calculating means for calculating the feature vector of the voice data normalized by the speaker, and the hidden Markov based on the feature vector of the voice data normalized by the third calculating means. A fourth calculating means for calculating the model parameters of the speaker-normalized hidden Markov model by learning an initial model of the model using a predetermined learning algorithm; and a fourth calculating means for calculating the model parameters. For the speaker-normalized hidden Markov model, maximum likelihood linear regression
A hidden Markov model adapted for each speaker by calculating a second conversion coefficient including a conversion matrix and a constant term vector for converting an average vector based on the multiple regression mapping model for each speaker And an adapted hidden Markov model average vector obtained by the fifth operation means, and speaker-normalized hidden average calculated by the fourth operation means. Based on the model parameters of the Markov model, the average vector and the covariance matrix of the hidden Markov model converted to the unspecified speaker are calculated by making the speaker unspecified, and the hidden Markov converted to the unspecified speaker is calculated. And a sixth calculating means for obtaining a model.

【０００８】また、本発明に係る請求項２記載の音声認
識装置は、請求項１記載の不特定話者音響モデル生成装
置の第６の演算手段によって演算された隠れマルコフモ
デルを用いて、入力された発声音声文の音声信号に基づ
いて、音声認識して音声認識結果を出力する音声認識手
段とを備えたことを特徴とする。According to a second aspect of the present invention, there is provided a speech recognition apparatus, comprising the steps of inputting a hidden Markov model calculated by a sixth calculating means of the unspecified speaker acoustic model generating apparatus. Voice recognition means for performing voice recognition on the basis of the voice signal of the uttered voice sentence and outputting a voice recognition result.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１０】図１は本発明に係る一実施形態である音声
認識装置のブロック図である。この実施形態は、話者正
規化制御部２０と、不特定話者化制御部２１とを備えた
ことを特徴とする。FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention. This embodiment is characterized in that a speaker normalization control unit 20 and an unspecified speaker conversion control unit 21 are provided.

【００１１】ここで、話者正規化制御部２０は、（ａ）
複数Ｍ人の話者にそれぞれ依存する音声データ３２−１
乃至３２−Ｍの特徴ベクトルＯ^m（ｍ＝１，２，…，
Ｍ）＝［ｏ^m ₁，ｏ^m ₂，…，ｏ^m _Tm］に基づいて、所定の
ＨＭＭの初期モデル（以下、初期ＨＭＭという。）３１
に対して、最尤線形回帰法により、重回帰写像モデルに
基づく平均ベクトルの変換のための変換行列と定数項ベ
クトルを含む第１の変換係数Ａ^m，ｂ^mを、後述する数１
を用いて各話者ｍ（ｍ＝１，２，…，Ｍ）毎に演算する
ことにより、上記各話者毎に適応されたＨＭＭλｈ^mを
得た後、（ｂ）上記得られた各話者ｍ毎に適応されたＨ
ＭＭλｈ^mに基づいて、上記音声データとその発話内容
のテキストデータ（音声データ３２−１乃至３２−Ｍと
ともにメモリに格納される。）から、ビタビ・アルゴリ
ズムを用いて、最適状態系列ｐ^m＝［ｐ^m ₁，ｐ^m ₂，…，
ｐ^m _Tm］を演算し、各時刻の最適状態毎に上記音声デー
タの特徴ベクトルＯ^mが最大出力確率を示す混合分布系
列ｑ^m＝［ｑ^m ₁，ｑ^m ₂，…，ｑ^m _Tm］を、後述する数２を
用いて演算し、（ｃ）上記演算された最適状態系列の各
状態ｐ^m _t内の混合分布ｑ^m _tの話者適応化前後の平均ベク
トルを用いて、上記音声データの特徴ベクトルを話者正
規化することにより、話者正規化された音声データの特
徴ベクトルＯｂ＝［Ｏｂ¹，Ｏ^b２，…，Ｏｂ^M］を、後
述する数３を用いて演算し、（ｄ）上記演算された正規
化された音声データの特徴ベクトルＯｂに基づいて、上
記初期ＨＭＭを、後述する数４乃至数８を用いてバーム
・ウエルチ（Ｂａｕｍ−Ｗｅｌｃｈ）の学習アルゴリズ
ムを用いて学習することにより、話者正規化されたＨＭ
Ｍλｂのモデルパラメータを演算する。ここで、モデル
パラメータは、平均ベクトル、ガウス分布の分散、状態
遷移確率などのＨＭＭのモデルパラメータを含む。In this case, the speaker normalization control section 20 has the following steps:
Speech data 32-1 depending on each of a plurality of M speakers
To 32-M feature vectors O ^m (m = 1, 2,...,
^{_{M) = [o m 1,}} o m 2, ..., o m Tm] based on a predetermined HMM initial model (hereinafter, referred to as initial HMM.) 31
By using the maximum likelihood linear regression method, the first transformation coefficients A ^m and b ^m including the transformation matrix for transforming the average vector based on the multiple regression mapping model and the constant term vector are expressed by the following equation (1).
Each speaker m (m = 1,2, ..., M) by using by calculating for each, after obtaining the HMMramudah ^m which is adapted at the each speaker, each story obtained (b) above H adapted for each person m
Based on MMramudah ^m, the from the voice data and text data of the speech content (. Are stored together with audio data 32-1 to 32-M in memory), using a Viterbi algorithm, the best state sequence p ^m = [ p ^m ₁ , p ^m ₂ , ...,
p ^m _Tm ], and a mixed distribution sequence q ^m = [q ^m ₁ , q ^m ₂ ,..., q ^m _Tm ] in which the feature vector O ^{m of the} audio data indicates the maximum output probability for each optimal state at each time. and it was calculated using the two numbers to be described later, by using the average vector of the front and rear speaker adaptation of mixture distribution q ^m _t in each state p ^m _t of (c) above calculated optimum state sequence, the voice by speaker normalization feature vector data, the feature vector Ob = voice data speaker normalization ^{^{[Ob 1, O b 2,}} ..., Ob M] and calculates using a number 3 to be described later (D) The initial HMM is calculated using the Baum-Welch learning algorithm using Equations 4 to 8 described below, based on the calculated normalized feature vector Ob of the audio data. Speaker-normalized HM
The model parameters of Mλb are calculated. Here, the model parameters include HMM model parameters such as mean vector, variance of Gaussian distribution, and state transition probability.

【００１２】また、不特定話者化制御部２１は、（ｅ）
上記演算された話者正規化されたＨＭＭλｂに対して、
最尤線形回帰法により、重回帰写像モデルに基づく平均
ベクトルの変換のための変換行列と定数項ベクトルを含
む第２の変換係数を、数９を用いて、上記各話者ｍ毎に
演算することにより、上記各話者ｍ毎に適応されたＨＭ
Ｍの平均ベクトルを得た後（ここで、適応学習するパラ
メータは、ガウス分布の平均値である。）、（ｆ）上記
得られた適応されたＨＭＭの平均ベクトルと、話者正規
化制御部２０によって演算された話者正規化されたＨＭ
Ｍのモデルパラメータである共分散行列とに基づいて、
後述する数１０及び数１１を用いて、不特定話者化する
ことにより、不特定話者化されたＨＭＭλａの平均ベク
トルと共分散行列を演算して、不特定話者化されたＨＭ
Ｍλａを得る。ここで、ＨＭＭλａの遷移確率と、混合
重み係数については、話者正規化されたＨＭＭλｂのパ
ラメータを用いる。In addition, the unspecified speaker control unit 21 includes (e)
For the calculated speaker-normalized HMMλb,
Using a maximum likelihood linear regression method, a second transformation coefficient including a transformation matrix for transforming an average vector based on a multiple regression mapping model and a constant term vector is calculated for each speaker m using Expression 9. Thus, the HM adapted for each speaker m
After obtaining the average vector of M (where the parameter to be adaptively learned is the average value of the Gaussian distribution), (f) the average vector of the obtained adapted HMM and the speaker normalization control unit Speaker-normalized HM calculated by 20
Based on the covariance matrix, which is a model parameter of M,
The average vector and the covariance matrix of the HMM λa which is made into an unspecified speaker is calculated by making the speaker unspecified by using Expressions 10 and 11 described later, and the HM which has been made unspecified speaker is calculated.
Obtain Mλa. Here, the speaker-normalized parameters of the HMMλb are used for the transition probability of the HMMλa and the mixing weight coefficient.

【００１３】さらに、図１の音声認識装置は、上記不特
定話者化されたＨＭＭ１１を用いて、入力された発声音
声文の音声信号に基づいて、音声認識して音声認識結果
を出力する。Further, the speech recognition apparatus of FIG. 1 performs speech recognition based on the speech signal of the input uttered speech sentence by using the HMM 11 converted to the unspecified speaker, and outputs a speech recognition result.

【００１４】まず、本実施形態における話者正規化を用
いた不特定話者モデル作成について説明する。会話のよ
うな自然発話音声データを用いて一括処理した場合、発
話様式が学習話者毎で大きく異るので、広がりが大きな
音響モデルが生成される。また、発話内容も話者毎に異
っており、認識ユニット毎に学習話者数の偏りが生じ、
話者の違いによる変動を正しく表現できなくなる。この
ため、ＨＭＭは図６（ａ）のように、広がりが大きな、
偏った出力確率分布となり不特定話者音声認識の性能が
低くなると考えられる。そこで、自然発話音声を用いた
不特定話者モデルを、以下の２段階の処理によって作成
する。（ａ）話者正規化を行い、話者内の音韻変動のみで図６
（ｂ）の話者正規化モデル（ＳＮモデル：λｂ）を得る
ように学習する。（ｂ）話者正規化モデルλｂを基準として話者の違いに
よる変動を推定し、図６（ｃ）の話者正規化不特定話者
モデル（ＳＮ−ＳＩモデル：λａ）を獲得する。First, the creation of an unspecified speaker model using speaker normalization in this embodiment will be described. When batch processing is performed using naturally uttered speech data such as conversation, an utterance style is greatly different for each learning speaker, so that an acoustic model having a large spread is generated. Also, the utterance contents are different for each speaker, and the number of learning speakers is biased for each recognition unit,
Fluctuations due to different speakers cannot be correctly expressed. Therefore, the HMM has a large spread as shown in FIG.
It is considered that the output probability distribution is biased and the performance of speaker-independent speech recognition is reduced. Therefore, an unspecified speaker model using naturally uttered speech is created by the following two-stage processing. (A) Speaker normalization is performed, and only the phonemic fluctuation within the speaker is performed.
Learning is performed to obtain the speaker normalization model (SN model: λb) of (b). (B) The variation due to the difference in the speakers is estimated based on the speaker normalization model λb, and the speaker-normalized unspecified speaker model (SN-SI model: λa) shown in FIG. 6C is obtained.

【００１５】まず、学習データの話者正規化について述
べる。話者正規化は、Ｍ人の学習話者音声データからの
特徴パラメータの観測系列の集合Ｏ＝［Ｏ¹，Ｏ²…，Ｏ
^M］（話者ｍの特徴パラメータの観測系列はＯ^m＝
［ｏ^m ₁，ｏ^m ₂，…，ｏ^m _Tm］（ｏはｎ次元のベクトルで
あり、下つき添字は時刻（具体的には、フレーム番号）
である。）から話者正規化観測系列の集合Ｏｂ＝［Ｏｂ
¹，Ｏｂ²，…，Ｏｂ^M］を求めることで行う。本実施形
態では、話者適応法を利用し、話者適応モデルと観測ベ
クトルとの相対的な位置が、話者正規化観測ベクトルで
あると仮定した話者正規化法を述べる。話者ｍの適応モ
デルλｈ^m（本明細書において、モデルはＨＭＭをい
う。）は、初期モデルをλとし、特徴ベクトルの観測系
列Ｏ^mを学習データとした適応学習によって得る。ここ
では、話者適応方式として最尤線形回帰法（Maximum Li
kelihood Linear Regression；以下、ＭＬＬＲ法とい
う。；例えば従来技術文献１参照。）を用いて、ガウス
分布の平均ベクトルμ（ｊ，ｋ）（状態ｊ内の混合分布
ｋ）を、次式で適応平均ベクトルμｈ（ｊ，ｋ）に写像
する。First, speaker normalization of learning data will be described. The speaker normalization is a set O = [O ¹ , O ² …, O of an observation sequence of feature parameters from M learning speaker voice data.
^M ] (The observed sequence of the feature parameter of speaker m is O ^m =
^{_{^{_{[O m 1, o m 2}}}} , ..., o m Tm] (o is an n-dimensional vector, subscripts the time (specifically, the frame number)
It is. ), A set of speaker-normalized observation sequences Ob = [Ob
^1, Ob ^2, ..., carried out by obtaining the Ob ^M]. In the present embodiment, a speaker normalization method using a speaker adaptation method and assuming that the relative position between the speaker adaptation model and the observation vector is a speaker normalized observation vector will be described. The adaptive model λh ^{m of the} speaker ^m (in this specification, the model is an HMM) is obtained by adaptive learning using the initial model as λ and the observation sequence O ^m of the feature vector as learning data. Here, the maximum likelihood linear regression method (Maximum Li
kelihood Linear Regression; hereinafter, referred to as the MLLR method. See, for example, Prior Art Document 1. ), The average vector μ (j, k) of the Gaussian distribution (mixture distribution k in the state j) is mapped to the adaptive average vector μh (j, k) by the following equation.

【数１】μｈ^m（ｊ，ｋ）＝Ａ^mμ（ｊ，ｋ）＋ｂ^m Equation 1 μh ^m (j, k) = A ^m μ (j, k) + b ^m

【００１６】ここで、Ａ^m、ｂ^mはそれぞれ、ｎ×ｎの行
列、ｎ次元のベクトルであり、ガウス分布の共有化クラ
ス毎に推定する。また、ｎは特徴ベクトルの次元数であ
る。このＭＬＬＲ法による処理の概念図を図４に示す。Here, A ^m and b ^m are an n × n matrix and an n-dimensional vector, respectively, which are estimated for each Gaussian distribution sharing class. N is the number of dimensions of the feature vector. FIG. 4 shows a conceptual diagram of the processing by the MLLR method.

【００１７】次に、話者ｍの適応モデルλｈ^mを用い
て、話者ｍの特徴ベクトルの観測系列Ｏ^mとその発声内
容のテキストデータからビタビ（Viterbi）・アルゴリ
ズムにより最適状態系列ｐ^m＝［ｐ^m ₁，ｐ^m ₂，…，
ｐ^m _Tm］を求め、各時刻の最適状態毎に、特徴ベクトル
の観測系列Ｏ^mが最大出力確率を示す混合分布系列ｑ^m＝
［ｑ^m ₁，ｑ^m ₂，…，ｑ^m _Tm］を次式で抽出する。Next, using the adaptive model λhm of the speaker m, the optimal state sequence p ^m = V ^m by the Viterbi algorithm from the observed sequence O ^m of the feature vector of the speaker ^m and the text data of the utterance content. ^{_{^{_{[p m 1, p m 2}}}} , ...,
p ^m _Tm] look, for each optimal state at each time, mixing distribution sequence observation sequence O ^m of feature vectors indicating the maximum output probability q ^m =
[Q ^m ₁ , q ^m ₂ ,..., Q ^m _Tm ] are extracted by the following equation.

【数２】ｑ^m _t＝ａｒｇｍａｘ[ｃ(ｐ^m _t,ｑ)・Ｎ(ｏ^m _t,μ
ｈ^m(ｐ^m _t,ｑ),Ｕ（ｐ^m _t,ｑ))]ｑ∈Ｑ^m _t [Number 2] ^{_{q m t = argmax [c (}} p m t, q) · N (o m t, μ
^{^{_{h m (p m t, q}}} ), U (p m t, q))] q∈Q m t

【００１８】ここで、Ｑ^m _tは時刻ｔの最適状態内の混合
分布の集合、ｃは混合重み係数、Ｕは共分散行列であ
る。また、関数ａｒｇｍａｘ（・）は、ｑ∈Ｑ^m _tなる条
件で変数ｑを変化したときに、当該関数値が最大となる
ときの変数ｑの値を示す関数である。さらに、関数Ｎ
（・）は、変数である特徴パラメータｏ^m _t、平均ベクト
ルμｈ^m(ｐ^m _t,ｑ)及び共分散行列Ｕ（ｐ^m _t,ｑ)を設定し
たときの出力確率である。次いで、話者正規化観測系列
Ｏｂ^m＝［ｏｂ^m ₁，ｏｂ^m ₂，…，ｏｂ^m _Tm］は、上記で求
めた、状態ｐ^m _t内の混合分布ｑ^m _tの話者適応前後の平均
ベクトルを用い、次式に従って獲得する。[0018] Here, Q ^m _t is the set of mixture distribution in the optimal state at time t, c is mixed weighting coefficient, U is the covariance matrix. The function argmax (·), upon changing the variable q by q∈Q ^m _t The condition is a function representing the value of the variable q when the function value is maximized. Further, the function N
(·) Is the output probability when the set characteristic parameter o ^m _t is a variable, the mean vector ^{^{_{μh m (p m t, q}}} ) and covariance matrix U (p ^m _t, q) a. Then, speaker normalization observation sequence ^{^{_{Ob m = [ob m 1,}}} ob m 2, ..., ob m Tm] is obtained in the above, the mixture distribution q ^m _t in state p ^m _t speaker adaptation before and after The average vector is obtained according to the following equation.

【数３】ｏｂ^m _t＝ｏ^m _t−μｈ^m（ｐ^m _t，ｑ^m _t）＋μ（ｐ^m
_t，ｑ^m _t）[Number 3] ^{_{^{_{ob m t = o m t -μh}}}} m (p m t, q m t) + μ (p m
_t , q ^m _t )

【００１９】すなわち、音声データの特徴パラメータの
観測系列ｏ^m _tから、話者適応化後の平均ベクトルμｈ^m
（ｐ^m _t，ｑ^m _t）を減算するとともに、話者適応化前の平
均ベクトルμ（ｐ^m _t，ｑ^m _t）を加算することにより、原
点を話者適応化前にあわせて、上記観測系列ｏ^m _tを話者
正規化する。話者正規化処理の概念図を図５に示す。以
上の処理を学習話者全て、すなわち話者毎に行い、話者
正規化観測系列の集合Ｏｂ＝［Ｏｂ¹，Ｏｂ²，…，Ｏｂ
^m］を得る。[0019] That is, the observation sequence o ^m _t of feature parameters of the speech data, the mean vector .mu.H ^m after speaker adaptation
^{_{^{_{(P m t, q m t}}}} ) as well as subtracting the average vector ^{_{^{μ (p m t, q m}}} t) before the speaker adaptation by adding, together origin before the speaker adaptation, the the speaker normalizing the observation series o ^m _t. FIG. 5 shows a conceptual diagram of the speaker normalization processing. The above processing is performed for all the learning speakers, that is, for each speaker, and a set of speaker normalized observation sequences Ob = [Ob ¹ , Ob ² ,..., Ob
^m ].

【００２０】次いで、話者正規化モデルの学習について
述べる。まず、話者正規化観測系列Ｏｂを用いて、初期
モデルλの再学習を行う。ガウス分布の平均値と共分散
行列μｂ（ｊ，ｋ），Ｕｂ（ｊ，ｋ）は次式で更新す
る。Next, learning of the speaker normalization model will be described. First, re-learning of the initial model λ is performed using the speaker normalized observation sequence Ob. The mean value of the Gaussian distribution and the covariance matrices μb (j, k) and Ub (j, k) are updated by the following equations.

【数４】 (Equation 4)

【数５】 (Equation 5)

【００２１】ここで、Here,

【数６】 (Equation 6)

【数７】 (Equation 7)

【数８】 (Equation 8)

【００２２】ここで、γ^m _t（ｊ，ｋ）は、特徴パラメー
タｏｂ^m _tが状態ｊの混合分布ｋに観測される期待値であ
る。また、｛・｝’は転置行列を表す。その他、ＨＭＭ
の遷移確率、混合重み係数等も同様に更新する。更新さ
れた音響モデルを前述の初期モデルλに置き換え、正規
化処理を一定回数繰り返し、最終的に得られたモデルを
ＳＮモデルλｂとする。Here, γ ^m _t (j, k) is an expected value of the feature parameter ob ^m _t observed in the mixture k of the state j. ｛·｝ ′ Represents a transposed matrix. Other, HMM
Are updated in the same manner. The updated acoustic model is replaced with the above-described initial model λ, the normalization process is repeated a fixed number of times, and the finally obtained model is set as an SN model λb.

【００２３】次いで、話者正規化不特定話者モデルの作
成について述べる。不特定話者音声認識を目的とした、
話者の違いによる変動が表現されている話者正規化モデ
ル（ＳＮ−ＳＩモデル）の作成法について述べる。話者
の違いによる変動は、話者正規化モデルを初期モデルと
して、各学習話者毎に話者適応モデルを作成し、ガウス
分布を合成して表現する。Next, creation of a speaker-normalized unspecified speaker model will be described. For the purpose of speaker-independent speech recognition,
A method of creating a speaker normalization model (SN-SI model) in which a variation due to a difference between speakers is expressed will be described. The variation due to the difference between speakers is expressed by creating a speaker adaptation model for each learning speaker using the speaker normalization model as an initial model, and synthesizing a Gaussian distribution.

【００２４】（ａ）ＳＮモデルλｂを初期モデルとし、
話者正規化制御部２０のＭＬＬＲ処理と同様に、ＭＬＬ
Ｒ法によって、次式によって各学習話者毎の適応モデル
を作成する。適応するパラメータはガウス分布の平均値
である。(A) Using the SN model λb as an initial model,
Similar to the MLLR process of the speaker normalization control unit 20, the MLL
According to the R method, an adaptive model for each learning speaker is created by the following equation. The adapted parameter is the mean value of the Gaussian distribution.

【数９】 μｈｂ^m（ｋ，ｊ）＝Ａｂ^mμｂ（ｊ，ｋ）＋ｂｂ^m （ｂ）適応モデルの平均ベクトルμｈ^m（ｊ，ｋ）とＳ
Ｎモデルの共分散行列Ｕｂ（ｊ，ｋ）から、数１０及び
数１１により、平均ベクトルμａ（ｊ，ｋ）と共分散行
列Ｕａ（ｊ，ｋ）を求め、ＳＮ−ＳＩモデルλａを得
る。ここで、遷移確率、混合重み係数についてはＳＮモ
デルの値を用いる。Μhb ^m (k, j) = Ab ^m μb (j, k) + bb ^m (b) Average vector μh ^m (j, k) of adaptive model and S
From the covariance matrix Ub (j, k) of the N model, an average vector μa (j, k) and a covariance matrix Ua (j, k) are obtained by Expressions 10 and 11, and an SN-SI model λa is obtained. Here, the values of the SN model are used for the transition probability and the mixing weight coefficient.

【数１０】 (Equation 10)

【数１１】 [Equation 11]

【００２５】図１において、話者正規化制御部２０、不
特定話者化制御部２１、特徴抽出部２、音素照合部４、
ＬＲパーザ５は例えば、デジタル計算機などの演算制御
装置で構成され、バッファメモリ３は例えばハードディ
スクメモリであり、初期ＨＭＭ３１及び各話者１乃至Ｍ
の音声データの特徴パラメータベクトル、話者正規化さ
れたＨＭＭ３３、不特定話者化されたＨＭＭ１１、ＬＲ
テーブル１２及び文脈自由文法１３は例えばハードディ
スクメモリに記憶される。なお、各話者の音声データ３
２−１乃至３２−Ｍは各話者の音声波形信号から特徴抽
出した特徴パラメータのベクトル、すなわち特徴ベクト
ル及びその発声内容のテキストデータである。本明細書
において、音声データとは、特徴ベクトルをいう。In FIG. 1, a speaker normalization control unit 20, an unspecified speaker conversion control unit 21, a feature extraction unit 2, a phoneme verification unit 4,
The LR parser 5 is composed of, for example, an arithmetic and control unit such as a digital computer. The buffer memory 3 is, for example, a hard disk memory, and has an initial HMM 31 and speakers 1 to M.
, The speaker-normalized HMM 33, the speaker-independent HMM 11, LR
The table 12 and the context-free grammar 13 are stored in, for example, a hard disk memory. In addition, the voice data 3 of each speaker
Reference numerals 2-1 to 32-M denote feature parameter vectors extracted from the speech waveform signal of each speaker, that is, text data of the feature vector and its utterance content. In this specification, audio data refers to a feature vector.

【００２６】図２は、図１の話者正規化制御部２０によ
って実行される話者正規化処理を示すフローチャートで
ある。まず、図２のステップＳ１で、各話者ｍの音声デ
ータ３２−１乃至３２−Ｍを読み出すとともに、不特定
話者ＨＭＭである初期ＨＭＭ（ＨＭＭの初期モデル）３
１を読み出して処理対象のＨＭＭとする。次いで、ステ
ップＳ２で、図４に示すように、処理対象のＨＭＭに対
して、各話者ｍ毎の音声データ３２−１乃至３２−Ｍの
特徴パラメータＯ^mを学習データとして適応学習し、こ
こで、ＭＬＬＲ法により数１を用いて各話者毎にガウス
分布の平均ベクトルμ（ｊ，ｋ）を適応平均ベクトルμ
ｈ^m（ｊ，ｋ）に写像することにより、話者ｍの適応モ
デルλｈ^mを得る。次いで、ステップＳ３で、各話者毎
に適応モデルλｈ^mを用いて観測系列Ｏ^mとその発声内容
のテキストデータからビタビ・アルゴリズムにより、最
適状態系列ｐ^mを演算し、各時刻の最適状態毎に観測系
列Ｏ^mが最大出力確率を示す混合分布系列ｑ^mを数２を用
いて演算する。さらに、ステップＳ４で話者正規化観測
系列Ｏｂ^mを状態ｐ^m _t内の混合分布ｑ^m _tの話者適応前後
の平均ベクトルを用いて数３に従って演算する。さら
に、ステップＳ５で、話者正規化観測系列Ｏｂ^mを用い
て初期ＨＭＭに対してバーム・ウエルチの学習アルゴリ
ズムを用いて再学習を行う。そして、ステップＳ６で所
定の繰り返し回数となったか否かが判断され、なってい
ないときは、ステップＳ７で再学習後のＨＭＭを処理対
象のＨＭＭとして、再び、ステップＳ２に戻り、上記の
処理を実行する。一方、ステップＳ６で、所定の繰り返
し回数（好ましい実施形態においては、５回である。）
となったときは、ステップＳ８で再学習後のＨＭＭを話
者正規化ＨＭＭ３３としてメモリに記憶する。そして当
該話者正規化処理を終了する。FIG. 2 is a flowchart showing a speaker normalization process executed by the speaker normalization control unit 20 of FIG. First, in step S1 of FIG. 2, the speech data 32-1 to 32-M of each speaker m is read, and an initial HMM (an initial model of the HMM) 3 which is an unspecified speaker HMM is read.
1 is read out and set as the HMM to be processed. Next, in step S2, as shown in FIG. 4, the feature parameter O ^m of the speech data 32-1 to 32-M for each speaker m is adaptively learned as learning data for the HMM to be processed. Then, the average vector μ (j, k) of the Gaussian distribution is converted to the adaptive average vector μ
By mapping to h ^m (j, k), an adaptive model λhm of speaker ^m is obtained. Then, in step S3, by using an adaptive model .lambda.h ^m for each speaker with observation sequence O ^m by the Viterbi algorithm from the text data of the utterance contents, calculates the optimal state sequence p ^m, for each optimal condition at each time Then, a mixed distribution sequence q ^{m in} which the observation sequence O ^m shows the maximum output probability is calculated using Equation 2. Furthermore, the calculation according to the equation 3 using the average vector of speaker adaptation before and after the mixture distribution q ^m _t speaker in normalized observation sequence Ob ^m states p ^m _t at step S4. Further, in step S5, the initial HMM is re-learned using the speaker normalized observation sequence Ob ^m using the Balm-Welch learning algorithm. Then, it is determined in step S6 whether or not the predetermined number of repetitions has been reached. If not, the HMM after the re-learning is set as the processing target HMM in step S7, and the process returns to step S2 again to repeat the above processing. Execute. On the other hand, in step S6, a predetermined number of repetitions (5 in the preferred embodiment).
When, the HMM after the re-learning is stored in the memory as the speaker-normalized HMM 33 in step S8. Then, the speaker normalization processing ends.

【００２７】図３は、図１の不特定話者化制御部によっ
て実行される不特定話者化処理を示すフローチャートで
ある。図３のステップＳ１１で、話者正規化されたＨＭ
Ｍ３３と各話者の音声データ３２−１乃至３２−Ｍを読
み出す。次いで、ステップＳ１２で、ＭＬＬＲ法により
数９を用いて、各話者毎の適応モデルを演算する。さら
に、ステップＳ１３で数１０及び数１１を用いて、不特
定話者化されたモデルλｂを演算する。最後に、不特定
話者化されたＨＭＭ１１をメモリに記憶する。そして当
該不特定話者化処理を終了する。FIG. 3 is a flowchart showing an unspecified speaker conversion process executed by the unspecified speaker conversion control unit of FIG. In step S11 of FIG. 3, the speaker-normalized HM
M33 and voice data 32-1 to 32-M of each speaker are read. Next, in step S12, an adaptive model for each speaker is calculated by using the MLLR method using Expression 9. Further, in step S13, a model λb converted to an unspecified speaker is calculated using equations (10) and (11). Finally, the HMM 11 that has been turned into an unspecified speaker is stored in the memory. Then, the unspecified speaker conversion process ends.

【００２８】不特定話者化されたＨＭＭ１１は、音素照
合部４に接続され、ＨＭ網として複数の状態のネットワ
ークとして表すこともできる。ＨＭＭ１１内の個々の状
態は、音声空間上の１つの確率的定常信号源と見なすこ
とができ、それぞれ以下の情報を保有している。（ａ）状態番号、（ｂ）受理可能なコンテキストクラ
ス、（ｃ）先行する状態および後続する状態のリスト、
（ｄ）音声の特徴空間上に割り当てられた確率分布のパ
ラメータ、（ｅ）自己遷移確率および後続状態への遷移
確率。不特定話者化されたＨＭＭ１１では、入力データとその
コンテキスト情報が与えられた場合、そのコンテキスト
を受理することができる状態を先行および後続状態リス
トの制約内で連結することによって、入力データに対す
るモデルを一意に決定することができる。ここで、出力
確率密度関数は３４次元の対角共分散行列をもつ混合ガ
ウス分布（本明細書において、ガウス分布という。）で
あり、各ガウス分布は、初期ＨＭＭ３１を用いて話者正
規化制御部２０により話者正規化されかつ、話者正規化
されたＨＭＭ３３を用いて不特定話者化制御部２１によ
り不特定話者化されている。The unspecific speaker HMM 11 is connected to the phoneme matching unit 4 and can be represented as a network in a plurality of states as an HM network. Each state in the HMM 11 can be regarded as one stochastic stationary signal source in the sound space, and each has the following information. (A) state number, (b) acceptable context class, (c) list of preceding and following states,
(D) parameters of a probability distribution assigned to the feature space of speech, (e) self-transition probability and transition probability to a subsequent state. In the unspecific speakerized HMM 11, when input data and its context information are given, a model for the input data is connected by linking the states that can receive the context within the constraints of the preceding and succeeding state lists. Can be uniquely determined. Here, the output probability density function is a mixed Gaussian distribution having a 34-dimensional diagonal covariance matrix (referred to as a Gaussian distribution in this specification), and each Gaussian distribution is controlled by speaker normalization control using an initial HMM 31. Using the HMM 33 that has been speaker-normalized by the unit 20 and speaker-normalized, the unspecific speaker conversion unit 21 converts the speaker into an unspecified speaker.

【００２９】一般に連続分布型ＨＭＭによるモデルに対
して少量の適応データにより話者適応を行なう場合、ガ
ウス分布の平均値の適応は他のパラメータの適応に比べ
て効果が大きいことが知られている（例えば、従来技術
文献２「大倉計美ほか，“混合連続分布ＨＭＭを用いた
移動ベクトル場平滑化話者適応方式”，音響学会講演論
文集，２−Ｑ−１７，ｐｐ．１９１−１９２，１９９２
年３月」参照。）。本実施形態においては、各ガウス分
布の平均値のみの適応を行ない、分散値、状態遷移確率
及び、混合ガウス分布の重み係数の適応は行なわない。In general, when speaker adaptation is performed on a model based on a continuous distribution type HMM with a small amount of adaptation data, adaptation of the average value of the Gaussian distribution is known to be more effective than adaptation of other parameters. (For example, Prior Art Document 2, "Kumi Okura et al.," Moving Vector Field Smoothing Speaker Adaptation Method Using Mixed Continuous Distribution HMM ", Proceedings of the Acoustical Society of Japan, 2-Q-17, pp. 191-192, 1992
March ". ). In the present embodiment, only the average value of each Gaussian distribution is applied, and the variance, the state transition probability, and the weight coefficient of the mixed Gaussian distribution are not applied.

【００３０】次いで、上述の本実施形態の話者正規化方
法及び話者適応化方法を用いた、ＳＳＳ−ＬＲ（left-t
o-right rightmost型）不特定話者連続音声認識装置に
ついて説明する。この装置は、ＨＭＭ１１を含むＨＭ網
のメモリに格納された音素環境依存型の効率のよいＨＭ
Ｍの表現形式を用いている。また、上記ＳＳＳにおいて
は、音素の特徴空間上に割り当てられた確率的定常信号
源（状態）の間の確率的な遷移により音声パラメータの
時間的な推移を表現した確率モデルに対して、尤度最大
化の基準に基づいて個々の状態をコンテキスト方向又は
時間方向へ分割するという操作を繰り返すことによっ
て、モデルの精密化を逐次的に実行する。Next, the SSS-LR (left-t) using the above-described speaker normalization method and speaker adaptation method according to the present embodiment.
An o-right rightmost type) speaker-independent continuous speech recognition device will be described. This device is a phoneme environment-dependent efficient HM stored in the memory of the HM network including the HMM 11.
M expression format is used. In the SSS, the likelihood of a stochastic model expressing a temporal transition of a speech parameter by a stochastic transition between stochastic stationary signal sources (states) assigned to a feature space of a phoneme is calculated. The refinement of the model is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the criterion of maximization.

【００３１】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００３２】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ５からの音素照合要求に応じて音素照合処理
を実行する。そして、話者適応化されたＨＭＭ１１のメ
モリに格納された音素ＨＭＭの話者モデルを用いて音素
照合区間内のデータに対する尤度が計算され、この尤度
の値が音素照合スコアとしてＬＲパーザ５に返される。
このとき、前向きパスアルゴリズムを使用する。The phoneme collating unit 4 executes a phoneme collating process in response to a phoneme collation request from the phoneme context-dependent LR parser 5. Then, the likelihood for the data in the phoneme matching section is calculated using the speaker model of the phoneme HMM stored in the memory of the speaker-adapted HMM 11, and the value of the likelihood is used as the phoneme matching score in the LR parser 5. Is returned to
At this time, a forward path algorithm is used.

【００３３】一方、文脈自由文法データベース１３内の
所定の文脈自由文法（ＣＦＧ）を公知の通り自動的に変
換してＬＲテーブル１２を作成してそのメモリに格納さ
れる。ＬＲパーザ５は、上記ＬＲテーブル１２を参照し
て、入力された音素予測データについて左から右方向
に、後戻りなしに処理する。構文的にあいまいさがある
場合は、スタックを分割してすべての候補の解析が平行
して処理される。ＬＲパーザ５は、上記ＬＲテーブル１
２から次にくる音素を予測して音素予測データを音素照
合部４に出力する。これに応答して、音素照合部４は、
その音素に対応するＨＭＭ１１内の情報を参照して照合
し、その尤度を音声認識スコアとしてＬＲパーザ５に戻
し、順次音素を連接していくことにより、連続音声の認
識を行う。上記連続音声の認識において、複数の音素が
予測された場合は、これらすべての存在をチェックし、
ビームサーチの方法により、部分的な音声認識の尤度の
高い部分木を残すという枝刈りを行って高速処理を実現
する。On the other hand, a predetermined context-free grammar (CFG) in the context-free grammar database 13 is automatically converted as is well known, and an LR table 12 is created and stored in its memory. The LR parser 5 refers to the LR table 12 and processes the input phoneme prediction data from left to right without regression. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 uses the LR table 1
It predicts the next phoneme from 2 and outputs phoneme prediction data to the phoneme matching unit 4. In response, the phoneme matching unit 4
The matching is performed with reference to the information in the HMM 11 corresponding to the phoneme, the likelihood is returned to the LR parser 5 as a voice recognition score, and the continuous voice recognition is performed by sequentially connecting the phonemes. If multiple phonemes are predicted in the above continuous speech recognition, check for the presence of all of them,
By the beam search method, high-speed processing is realized by performing pruning to leave a partial tree having a high likelihood of partial speech recognition.

【００３４】[0034]

【実施例】本発明者は、以上のように構成された音声認
識装置について、評価実験を以下の如く行った。評価実
験として、表１を実験条件として、連続単語認識実験を
行った。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted an evaluation experiment on a speech recognition apparatus having the above-described configuration as follows. As an evaluation experiment, a continuous word recognition experiment was performed using Table 1 as experimental conditions.

【００３５】[0035]

【表１】実験条件 ─────────────────────────────────── 音響分析 ─────────────────────────────────── サンプリング周波数：１２ＫＨｚフレームシフト：１０ｍｓフレーム長：２０ｍｓ（ハミング窓）特徴ベクトル：１６次ケプストラム係数，１６次Δケプストラム係数，対数パワー，Δパワー ─────────────────────────────────── 音声データ ─────────────────────────────────── 旅行会話タスク学習：男性９９名（１３Ｋ単語，１２３７発話）女性１３１名（２０Ｋ単語，１７２５発話）評価：男性１６名（２１０２単語，１９６発話）女性１９名（２８４４単語，２４４発話） ─────────────────────────────────── ＨＭＭ ─────────────────────────────────── ＭＬ−ＳＳＳで作成したＨＭｎｅｔ（５混合／状態）＋１状態（１０混合）の無音モデル ─────────────────────────────────── 言語モデル ─────────────────────────────────── 可変長Ｎ−ｇｒａｍ学習：４１４，３２６単語（異り６，３９６単語），パープレキティー：１９．３４ ───────────────────────────────────[Table 1] Experimental conditions 音響 Acoustic analysis ───────サンプリング Sampling frequency: 12 KHz Frame shift: 10 ms Frame length: 20 ms (Hamming window) Feature vector: 16th order cepstrum Coefficient, 16th order cepstrum coefficient, logarithmic power, Δpower 音声 audio data ─────────────────────────────────── Travel conversation task learning: 99 men (13K words, 1237 utterances) 131 women (20K words, 1725 utterances) Evaluation: 16 men (2102 words, 196 utterances) 19 women (2844 words, 2 4 utterances) ─────────────────────────────────── HMM ─────────── Ｈ Silence model of HMNet (5 mixture / state) + 1 state (10 mixture) created by ML-SSS ─────────────────────────────── Language model ───────────────── ────────────────── Variable length N-gram learning: 414,326 words (6,396 different words), perplexity: 19.34３４ ─────────────────────────────

【００３６】音声データは本特許出願人で収録した旅行
会話音声データを用い、男性話者モデル、女性話者モデ
ル、男女不特定話者モデルを男性９９名、女性１３１名
のデータから作成し評価した。音響モデルの構造も上記
の音声データを用い、有音モデルは公知のＭＬ−ＳＳＳ
（maximum likelihood successive state splitting）
によって決定したＨＭｎｅｔ（５混合／状態）を使用
し、無音モデルは１状態（１０混合）とした。ＳＮ−Ｓ
Ｉモデル作成に用いたＭＬＬＲ法は、共有化クラスが１
６（不特定話者モデルのガウス分布を公知のクルバック
発散法（Kullbackdivergence）に基づいてクラスタリン
グすることで決定した。）であり、回帰行列Ａの対角成
分と定数項ｂを推定した。上記話者正規化処理の繰り返
し回数は５回とし、バーム・ウエルチ（Ｂａｕｍ−Ｗｅ
ｌｃｈ）の学習アルゴリズムでパラメータ推定を行っ
た。言語モデルは公知の可変長Ｎ−ｇｒａｍ（例えば、
従来技術文献３「H.Masataki et al.,“Variable-Order
N-Gram Generation by Word-Class Splitting and Con
secutive Word Grouping",Proceedings of ICASSP'96,p
p.188-191,1996年」参照。）を用い、認識結果をワード
グラフを用いたビームサーチの１位候補で評価した。As voice data, travel conversation voice data recorded by the present applicant is used, and a male speaker model, a female speaker model, and a gender-unspecified speaker model are created and evaluated from data of 99 men and 131 women. did. The structure of the acoustic model also uses the above-mentioned voice data, and the sound model is a well-known ML-SSS.
(Maximum likelihood successive state splitting)
Was used, and the silent model was 1 state (10 mixtures). SN-S
In the MLLR method used for creating the I model, the sharing class is 1
6 (determined by clustering the Gaussian distribution of the unspecified speaker model based on the known Kullback divergence method), and estimated the diagonal components of the regression matrix A and the constant term b. The number of repetitions of the speaker normalization process is set to 5 times, and a Baum-Wee
lch). The language model is a known variable-length N-gram (for example,
Prior art document 3 “H. Masataki et al.,“ Variable-Order
N-Gram Generation by Word-Class Splitting and Con
secutive Word Grouping ", Proceedings of ICASSP'96, p
188-191, 1996. " ), And the recognition result was evaluated as the first-place candidate of the beam search using the word graph.

【００３７】次いで、実験結果について述べる。ＳＮ−
ＳＩモデルを用いて、男性１６名、女性１９名（音響モ
デルオープン、言語モデルクローズ）に対して連続音声
認識実験を行なった。ＨＭＭの状態数を４０１，６０
１，８０１，１００１とした場合の単語アキュラシーで
評価した結果を表２に示す。比較としてＳＮモデル、従
来の不特定話者（ＳＩ）モデルの認識結果も示す。Next, the experimental results will be described. SN-
Using the SI model, a continuous speech recognition experiment was performed on 16 males and 19 females (acoustic model open, language model closed). 401, 60 HMM states
Table 2 shows the results of the word accuracy evaluation when the words are 1,801 and 1001. For comparison, the recognition results of the SN model and the conventional speaker-independent (SI) model are also shown.

【００３８】[0038]

【表２】連続単語認識結果−単語誤り率（％） ─────────────────────────────────── ＨＭＭの状態数（分布数） ────────────────────────── ４０１６０１８０１１００１（２０１０）（３０１０）（４０１０）（５０１０） ─────────────────────────────────── 男性話者４４．３４２．６４３．９４０．４モデル４８．１４９．６５０．８４４．９４５．４４６．８４５．７４１．２ ─────────────────────────────────── 女性話者２９．９２８．７３０．３２８．６モデル３２．９３２．１３５．０３５．０３１．６３０．３３２．９３１．５ ─────────────────────────────────── 男女不特定３６．０３１．９３３．８３３．５話者モデル３８．９３５．５３７．３３８．５３７．７３３．８３４．７３５．４ ─────────────────────────────────── （注）上段：話者正規化不特定話者（ＳＮ−ＳＩ）モデル中段：話者正規化（ＳＮ）モデル下段：不特定話者（ＳＩ）モデル[Table 2] Continuous word recognition result-Word error rate (%) ─────────────────────────────────── Number of states of HMM (number of distributions) ４０１ 401 601 801 1001 (2010) (3010) (4010) (5010) ─────────────────────────────────── Male speaker 44.3 42.6 43.9 40.4 Model 48.1 49.6 50.8 44.9 45.4 46.8 45.7 41.2 ────────── Female speaker 29.9 28.7 30.3 28.6 Model 32.9 32.1 35.0 35.0 31.6 30.3 32.9 31.5 ─────────不 Unspecified gender 36.0 31.9 33.8 33.5 Speaker model 38.9 35.5 37.3 38.5 37.7 33.8 34.7 35.4 (Note) Upper: Speaker-normalized unspecified speaker (SN-SI) model Middle: Speaker-normalized (SN) model Lower: Unspecified speaker (SI) model

【００３９】表２から明らかなように、ＳＮ−ＳＩモデ
ルは、全ての音響モデルの種類において、従来のＳＩモ
デルを上回る認識結果を得た。話者正規化を行って話者
内の音韻変動を学習した後に、話者の違いによる変動を
推定することが有効であることが分かる。ＳＮモデルは
従来のＳＩモデルよりも認識率が低い。これは、話者の
違いによる変動が含まれていないので、不特定話者音声
認識では低性能であると理解できる。As is clear from Table 2, the SN-SI model obtained recognition results that exceeded the conventional SI model in all acoustic model types. It can be seen that it is effective to estimate the fluctuation due to the speaker difference after learning the phoneme fluctuation in the speaker by performing the speaker normalization. The SN model has a lower recognition rate than the conventional SI model. Since this does not include fluctuation due to the difference between speakers, it can be understood that the performance is unreliable in unspecified speaker speech recognition.

【００４０】以上説明したように、本実施形態によれ
ば、話者毎で発話内容、発話様式が異る自然発話音声を
用いた不特定話者モデルのために、話者正規化処理を施
して話者内の音韻変動を学習した後に、話者の違いによ
る変動を再学習することにより不特定話者化処理を行っ
た。すなわち、話者正規化した音声データの特徴パラメ
ータを用いて学習し、話者正規化モデルを生成した後、
話者性の違いによる変動を学習するので、学習音声デー
タの偏りによる悪影響が減少し、得られた不特定話者モ
デルを用いて音声認識することにより、従来技術に比較
して音声認識率を大幅に向上させることができる。As described above, according to this embodiment, a speaker normalization process is performed for an unspecified speaker model using naturally uttered speech having different utterance contents and utterance styles for each speaker. After learning the phonemic fluctuations within the speaker, the speaker-independent processing was performed by re-learning the fluctuations due to the difference between the speakers. That is, after learning using the feature parameters of the speaker-normalized speech data and generating a speaker-normalized model,
Since the variation due to the difference in speaker characteristics is learned, the adverse effect due to the bias of the training speech data is reduced, and the speech recognition is performed by using the obtained unspecified speaker model, thereby improving the speech recognition rate compared to the conventional technology. It can be greatly improved.

【００４１】[0041]

【発明の効果】以上詳述したように本発明に係る不特定
話者音響モデル生成装置によれば、複数の話者にそれぞ
れ依存する音声データの特徴ベクトルに基づいて、所定
の隠れマルコフモデルの初期モデルに対して、最尤線形
回帰法により、重回帰写像モデルに基づく平均ベクトル
の変換のための変換行列と定数項ベクトルを含む第１の
変換係数を上記各話者毎に演算することにより、上記各
話者毎に適応された隠れマルコフモデルを得る第１の演
算手段と、上記第１の演算手段によって得られた上記各
話者毎に適応された隠れマルコフモデルに基づいて、上
記音声データとその発話内容のテキストデータから、ビ
タビ・アルゴリズムを用いて、最適状態系列を演算し、
各時刻の最適状態毎に上記音声データの特徴ベクトルが
最大出力確率を示す混合分布系列を演算する第２の演算
手段と、上記第２の演算手段によって演算された最適状
態系列の各状態内の混合分布の話者適応化前後の平均ベ
クトルを用いて、上記音声データの特徴ベクトルを話者
正規化することにより、話者正規化された音声データの
特徴ベクトルを演算する第３の演算手段と、上記第３の
演算手段によって演算された正規化された音声データの
特徴ベクトルに基づいて、上記隠れマルコフモデルの初
期モデルを、所定の学習アルゴリズムを用いて学習する
ことにより、話者正規化された隠れマルコフモデルのモ
デルパラメータを演算する第４の演算手段と、上記第４
の演算手段によって演算された話者正規化された隠れマ
ルコフモデルに対して、最尤線形回帰法により、重回帰
写像モデルに基づく平均ベクトルの変換のための変換行
列と定数項ベクトルを含む第２の変換係数を上記各話者
毎に演算することにより、上記各話者毎に適応された隠
れマルコフモデルの平均ベクトルを得る第５の演算手段
と、上記第５の演算手段によって得られた適応された隠
れマルコフモデルの平均ベクトルと、上記第４の演算手
段によって演算された話者正規化された隠れマルコフモ
デルのモデルパラメータとに基づいて、不特定話者化す
ることにより、不特定話者化された隠れマルコフモデル
の平均ベクトルと共分散行列を演算して、不特定話者化
された隠れマルコフモデルを得る第６の演算手段とを備
える。従って、話者毎で発話内容、発話様式が異る自然
発話音声を用いた不特定話者モデルのために、話者正規
化処理を施して話者内の音韻変動を学習した後に、話者
の違いによる変動を再学習することにより不特定話者化
処理を行った。すなわち、話者正規化した音声データの
特徴パラメータを用いて学習し、話者正規化モデルを生
成した後、話者性の違いによる変動を学習するので、学
習音声データの偏りによる悪影響が減少し、得られた不
特定話者モデルを用いて音声認識することにより、従来
技術に比較して音声認識率を大幅に向上させることがで
きる。As described in detail above, according to the apparatus for generating an unspecified speaker acoustic model according to the present invention, a predetermined hidden Markov model is generated based on a feature vector of speech data which depends on a plurality of speakers. By using a maximum likelihood linear regression method with respect to the initial model, a first conversion coefficient including a conversion matrix and a constant term vector for conversion of an average vector based on the multiple regression mapping model is calculated for each speaker. First computing means for obtaining a hidden Markov model adapted for each speaker, and the speech based on the hidden Markov model adapted for each speaker obtained by the first computing means. From the data and the text data of the utterance contents, the optimal state sequence is calculated using the Viterbi algorithm,
A second calculating means for calculating a mixed distribution series in which the feature vector of the audio data shows the maximum output probability for each optimum state at each time; and a state in each state of the optimum state series calculated by the second calculating means. A third calculating means for calculating the speaker-normalized feature data of the speech data by subjecting the feature vector of the speech data to speaker normalization using the average vector before and after the speaker adaptation of the mixture distribution; The speaker normalization is performed by learning the initial model of the hidden Markov model using a predetermined learning algorithm based on the feature vector of the normalized speech data calculated by the third calculation means. A fourth calculating means for calculating model parameters of the hidden Markov model;
The speaker-normalized hidden Markov model calculated by the calculation means is subjected to a maximum likelihood linear regression method to include a conversion matrix and a constant term vector for converting an average vector based on a multiple regression mapping model. Calculating the transformation coefficient of each of the speakers to obtain an average vector of the Hidden Markov Model adapted for each of the speakers, and an adaptive coefficient obtained by the fifth arithmetic means. An unspecified speaker is obtained by converting the hidden Markov model into an unspecified speaker based on the average vector of the hidden Markov model and the model parameter of the speaker-normalized hidden Markov model calculated by the fourth calculating means. Sixth arithmetic means for calculating an average vector and a covariance matrix of the generalized Hidden Markov Model to obtain an unspecific speakerized Hidden Markov Model. Therefore, for an unspecified speaker model using naturally uttered speech having different utterance contents and utterance styles for each speaker, the speaker normalization processing is performed to learn the phonemic fluctuation within the speaker, and then the speaker The speaker-independent processing was performed by re-learning the fluctuations caused by the differences in. That is, learning is performed using the feature parameters of the speaker-normalized voice data, and after generating a speaker-normalized model, fluctuations due to differences in speaker characteristics are learned. By performing speech recognition using the obtained speaker-independent model, the speech recognition rate can be significantly improved as compared with the related art.

【００４２】また、本発明に係る請求項２記載の音声認
識装置によれば、上記不特定話者音響モデル生成装置の
第６の演算手段によって演算された隠れマルコフモデル
を用いて、入力された発声音声文の音声信号に基づい
て、音声認識して音声認識結果を出力する音声認識手段
とを備える。従って、話者正規化した音声データの特徴
パラメータを用いて学習し、話者正規化モデルを生成し
た後、話者性の違いによる変動を学習するので、学習音
声データの偏りによる悪影響が減少し、得られた不特定
話者モデルを用いて音声認識することにより、従来技術
に比較して音声認識率を大幅に向上させることができ
る。According to the speech recognition apparatus of the second aspect of the present invention, the input is performed by using the hidden Markov model calculated by the sixth calculation means of the unspecified speaker acoustic model generation apparatus. Voice recognition means for performing voice recognition based on the voice signal of the uttered voice sentence and outputting a voice recognition result. Therefore, since learning is performed using the feature parameters of the speaker-normalized speech data and a speaker-normalized model is generated, the variation due to the difference in speaker characteristics is learned. By performing speech recognition using the obtained speaker-independent model, the speech recognition rate can be significantly improved as compared with the related art.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の話者正規化制御部によって実行される
話者正規化処理を示すフローチャートである。FIG. 2 is a flowchart illustrating a speaker normalization process performed by a speaker normalization control unit in FIG. 1;

【図３】図１の不特定話者化制御部によって実行され
る不特定話者化処理を示すフローチャートである。FIG. 3 is a flowchart showing an unspecified speaker conversion process executed by the unspecified speaker conversion control unit in FIG. 1;

【図４】図１の話者正規化制御部によって実行される
ＭＬＬＲ処理を示す図である。FIG. 4 is a diagram illustrating an MLLR process executed by a speaker normalization control unit in FIG. 1;

【図５】図１の話者正規化制御部によって実行される
話者正規化処理を示す図である。FIG. 5 is a diagram illustrating a speaker normalization process performed by a speaker normalization control unit in FIG. 1;

【図６】図１の装置によって準備され又は生成される
各モデルの出力確率分布を示す図であって、（ａ）は不
特定話者モデルの出力確率分布を示す図であり、（ｂ）
は話者正規化モデルの出力確率分布を示す図であり、
（ａ）は話者正規化された不特定話者モデルの出力確率
分布を示す図である。6A and 6B are diagrams illustrating output probability distributions of respective models prepared or generated by the apparatus of FIG. 1, wherein FIG. 6A is a diagram illustrating an output probability distribution of an unspecified speaker model, and FIG.
Is a diagram showing the output probability distribution of the speaker normalization model,
(A) is a figure which shows the output probability distribution of the speaker-independent unspecified speaker model.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…不特定話者化されたＨＭＭ、１２…ＬＲテーブル、１３…文脈自由文法（ＣＦＧ）データベース、２０…話者正規化制御部、２１…不特定話者化制御部、３１…初期ＨＭＭ、３２−１乃至３２−Ｍ…話者１乃至Ｍの音声データ、３３…話者正規化されたＨＭＭ。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme collation part, 5 ... LR parser, 11 ... HMM converted into unspecified speaker, 12 ... LR table, 13 ... Context-free grammar (CFG) Database 20: speaker normalization control unit 21: unspecified speaker conversion control unit 31: initial HMM, 32-1 to 32-M: voice data of speakers 1 to M, 33: speaker normalized HMM.

Claims

[Claims]

1. An average vector based on a multiple regression mapping model by a maximum likelihood linear regression method with respect to an initial model of a predetermined hidden Markov model based on a feature vector of voice data which depends on a plurality of speakers, respectively. First arithmetic means for obtaining a hidden Markov model adapted for each speaker by calculating a first conversion coefficient including a conversion matrix for conversion and a constant term vector for each speaker; Based on the hidden Markov model adapted for each speaker obtained by the first calculating means, an optimum state sequence is calculated from the voice data and the text data of the utterance content using a Viterbi algorithm. A second calculating means for calculating a mixed distribution series in which the feature vector of the audio data shows the maximum output probability for each optimal state at each time; Using the average vector before and after the speaker adaptation of the mixture distribution in each state of the calculated optimal state sequence, the feature vector of the audio data is speaker-normalized. A third calculating means for calculating a feature vector; and an initial model of the hidden Markov model, based on a feature vector of the normalized speech data calculated by the third calculating means, using a predetermined learning algorithm. Computing means for computing model parameters of the speaker-normalized hidden Markov model by learning the speaker-normalized hidden Markov model computed by the fourth computing means. And a second conversion coefficient including a conversion matrix and a constant term vector for converting an average vector based on the multiple regression mapping model by the maximum likelihood linear regression method. 5 to obtain the average vector of the Hidden Markov Model adapted for each speaker.
Calculation means, the average vector of the adapted hidden Markov model obtained by the fifth calculation means, and the model parameter of the speaker-normalized hidden Markov model calculated by the fourth calculation means. A sixth operation for calculating the average vector and the covariance matrix of the hidden Markov model converted to the unspecified speaker based on the specified speaker, thereby obtaining the hidden Markov model converted to the unspecified speaker Means for generating an unspecified speaker acoustic model.

2. A speech recognition based on a speech signal of an input uttered speech sentence, using a hidden Markov model computed by a sixth computing means of the speaker-independent acoustic model generation apparatus according to claim 1. And a speech recognition means for outputting a speech recognition result.