JP3035239B2

JP3035239B2 - Speaker normalization device, speaker adaptation device, and speech recognition device

Info

Publication number: JP3035239B2
Application number: JP9054596A
Authority: JP
Inventors: 純石井; 政啓外村
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1997-03-10
Filing date: 1997-03-10
Publication date: 2000-04-24
Anticipated expiration: 2017-03-10
Also published as: JPH10254485A

Abstract

PROBLEM TO BE SOLVED: To provide the speaker normalizing device, the speaker adaptive device and the speech recognizer while improving the estimating precision of adaptive parameters compared with the conventional ones and improving the speech recognition rate. SOLUTION: A speaker normalization control section 20 computes the transformation coefficient, which includes a transformation matrix and a constant term vector for every speaker. The matrix transforms an average vector based on a multiple regression mapping model employing a maximum likelihood linear regression method against an initial hidden Markov model(HMM) 31 based on the voice data that depend on plural speakers. The constant term vector is subtracted from the voice data, normalized voice data are computed, an initial HMM is learned and the HMM, which is speaker normalized, is obtained. A speaker adaptive control section 21 computes the transformation coefficient against a speaker normalized HMM 33 based on speaker adaptive learning data. Then, the transformation coefficient for the transformation based on the model, which is made speaker adaptive by a MAP estimating method, is computed and linear transformation processed to compute the average vector of the HMM after the speaker adaptive process. Then, an HMM 11, which is made speaker adaptive, is obtained.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、初期話者モデルに
対して話者依存の音声データの特徴パラメータを用いて
話者正規化を行って、話者正規化された隠れマルコフモ
デル（以下、ＨＭＭという。）を作成する話者正規化装
置、話者正規化されたＨＭＭに対して話者適応用学習デ
ータを用いて話者適応化を行って、話者適応化されたＨ
ＭＭを作成する話者適応化装置、及び、話者正規化又は
話者適応化されたＨＭＭを用いて音声認識する音声認識
装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker-normalized hidden Markov model (hereinafter, referred to as a speaker-normalized model) by performing speaker normalization on an initial speaker model using speaker-dependent speech data feature parameters. A speaker normalizing device that creates HMM), performs speaker adaptation on the speaker-normalized HMM using the speaker adaptation learning data, and performs speaker-adapted H.
The present invention relates to a speaker adaptation device that creates an MM, and a speech recognition device that recognizes speech using a speaker-normalized or speaker-adapted HMM.

【０００２】[0002]

【従来の技術】音声認識のアプリケーションを考えた場
合、事前の話者登録無しに使用が可能の不特定話者音声
認識システムの要望が高い。しかしながら、現状の不特
定話者音声認識の性能は、特定話者音声認識よりも低
く、その差は、誤り率で２〜３倍程度である。不特定話
者音声認識の性能を向上されるため、特定話者が発声し
た少量の適応データを用い、不特定話者音声認識の音響
モデルを特定話者へ近づける話者適応化処理（例えば、
従来技術文献１「C.L.Leggetter et al.,“MaximumLike
lihood Linear Regression for Speaker Adaptation of
Continuous Density Hidden Markov Models",Computer
Speech and Language,Vol.9,pp.171-185,1995年」参
照。）の研究が行なわれているが、特定話者音声認識と
同等の性能を示すまでには、多量の学習用適応化データ
が必要となっている。2. Description of the Related Art When considering a speech recognition application, there is a strong demand for an unspecified speaker speech recognition system that can be used without prior speaker registration. However, the current unspecified speaker speech recognition performance is lower than that of the specific speaker speech recognition, and the difference is about two to three times the error rate. In order to improve the performance of the speaker-independent speaker recognition, a speaker adaptation process (for example, using a small amount of adaptation data uttered by a particular speaker) to bring the acoustic model of the speaker-independent speaker recognition closer to the particular speaker.
Prior art document 1 “CLLeggetter et al.,“ MaximumLike
lihood Linear Regression for Speaker Adaptation of
Continuous Density Hidden Markov Models ", Computer
Speech and Language, Vol. 9, pp. 171-185, 1995. " However, a large amount of training adaptation data is required until the performance is equivalent to that of the specific speaker speech recognition.

【０００３】[0003]

【発明が解決しようとする課題】一般に、話者に対して
独立な不特定話者ＨＭＭ（以下、ＳＩ−ＨＭＭとい
う。）の学習は複数の話者の音声データを用いて行う。
学習データには話者による違いだけでなく、学習対象の
単位の置かれた状況（コンテキスト）等の違いが混在す
るにも関わらず、特定話者音声認識の音響モデル（話者
に依存するＨＭＭ（以下、ＳＤ−ＨＭＭという。）であ
る。）の学習と同様に処理する。これにより、ＳＩ−Ｈ
ＭＭには話者の違いに起因する変動と音韻コンテキスト
の変動の両方が混在し、広がりの大きなモデルになって
しまう。これが識別性能劣化の要因の１つなっていると
考えられる。連続混合分布型ＨＭＭを基本とした音声認
識システムの場合では、ガウス分布の分散が大きくなる
現象であり、認識単位間の重なりが発生し、識別を困難
となるという問題点があった。Generally, learning of an independent speaker-independent HMM (hereinafter, SI-HMM) independent of a speaker is performed using voice data of a plurality of speakers.
In spite of the fact that the learning data contains not only differences depending on speakers, but also differences such as situations (contexts) where units to be learned are placed, an acoustic model for specific speaker speech recognition (HMM depending on speakers) (Hereinafter referred to as SD-HMM).). Thereby, SI-H
In the MM, both the variation caused by the difference of the speaker and the variation of the phonemic context coexist, resulting in a model having a large spread. This is considered to be one of the factors of the degradation of the identification performance. In the case of a speech recognition system based on a continuous mixture distribution type HMM, the variance of the Gaussian distribution is large, and there is a problem in that recognition units are overlapped with each other, making identification difficult.

【０００４】特に、従来技術文献１において開示され
た、従来例の重回帰写像モデルを用いて話者適応化した
場合に、学習用適応化データが少量であるときに、適応
化のパラメータの推定精度が比較的悪く、音声認識率が
比較的低いという問題点があった。[0004] In particular, when speaker adaptation is performed using the conventional multiple regression mapping model disclosed in the prior art document 1, when the adaptation data for learning is small, estimation of adaptation parameters is performed. There is a problem that the accuracy is relatively poor and the speech recognition rate is relatively low.

【０００５】本発明の目的は以上の問題点を解決し、従
来例に比較して適応化のパラメータの推定精度を改善す
ることができ、しかも音声認識率を改善することができ
る話者正規化装置、話者適応化装置及び音声認識装置を
提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above problems and to improve the estimation accuracy of adaptation parameters as compared with the prior art, and to improve the speech recognition rate. A device, a speaker adaptation device, and a speech recognition device.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の話者正規化装置は、所定の隠れマルコフモデルの初
期モデルを学習するための学習データであり、複数の話
者にそれぞれ依存する音声データの特徴ベクトルを記憶
する記憶装置と、上記記憶装置に記憶された音声データ
の特徴ベクトルに基づいて、上記隠れマルコフモデルの
初期モデルに対して、最尤線形回帰法により、重回帰写
像モデルに基づく平均ベクトルの変換のための変換行列
と、スペクトルに共通する個人差を表す定数項ベクトル
とを含む第１の変換係数を上記各話者毎に演算する第１
の演算手段と、上記記憶装置に記憶された音声データの
特徴ベクトルから上記各話者毎に、上記第１の演算手段
によって演算された定数項ベクトルを減算して正規化さ
れた音声データの特徴ベクトルを演算する第２の演算手
段と、上記第２の演算手段によって演算された正規化さ
れた音声データの特徴ベクトルに基づいて、上記隠れマ
ルコフモデルの初期モデルを、所定の学習アルゴリズム
を用いて学習することにより、話者正規化された隠れマ
ルコフモデルのモデルパラメータを演算する第３の演算
手段とを備えたことを特徴とする。According to a first aspect of the present invention, there is provided a speaker normalizing apparatus which is training data for learning an initial model of a predetermined hidden Markov model, and which depends on a plurality of speakers. A storage device for storing a feature vector of speech data to be processed, and a multiple regression mapping for the initial model of the hidden Markov model based on the feature vector of the speech data stored in the storage device by a maximum likelihood linear regression method. A first transform coefficient for each speaker is calculated, the first transform coefficient including a transform matrix for transforming an average vector based on a model and a constant term vector representing an individual difference common to spectra.
And the characteristic of the voice data normalized by subtracting the constant term vector calculated by the first calculating means for each speaker from the feature vector of the voice data stored in the storage device. A second calculating means for calculating the vector, and an initial model of the hidden Markov model based on a feature vector of the normalized speech data calculated by the second calculating means, using a predetermined learning algorithm. And a third calculating means for calculating model parameters of the speaker-normalized hidden Markov model by learning.

【０００７】また、本発明に係る請求項２記載の話者適
応化装置は、話者適応化する話者の音声データの特徴ベ
クトルに基づいて、請求項１記載の話者正規化装置の第
３の演算手段によって演算された隠れマルコフモデルに
対して、最尤線形回帰法により、重回帰写像モデルに基
づく平均ベクトルの変換のための変換行列と定数項ベク
トルを含む第２の変換係数を演算する第４の演算手段
と、上記第４の演算手段によって演算された変換行列と
定数項ベクトルを含む第２の変換係数に基づいて、最大
事後確率推定法により、話者適応化された重回帰写像モ
デルに基づく平均ベクトルの変換のための変換行列と定
数項ベクトルを含む第３の変換係数を演算する第５の演
算手段と、上記第５の演算手段によって演算された変換
行列と定数項ベクトルを含む第３の変換係数に対して、
所定の線形変換処理を実行することにより、話者適応化
後の隠れマルコフモデルの平均ベクトルを演算する第６
の演算手段とを備えたことを特徴とする。According to a second aspect of the present invention, there is provided a speaker normalizing apparatus according to the first aspect, based on a feature vector of voice data of a speaker to be speaker-adapted. A second transformation coefficient including a transformation matrix and a constant term vector for transforming an average vector based on the multiple regression mapping model is calculated by the maximum likelihood linear regression method for the Hidden Markov Model calculated by the third calculation means. Multiple regression speaker-adapted by a maximum a posteriori probability estimating method based on a fourth calculating means, and a second conversion coefficient including a conversion matrix and a constant term vector calculated by the fourth calculating means. Fifth computing means for computing a third transformation coefficient including a transformation matrix for transforming an average vector based on a mapping model and a constant term vector, and a transformation matrix computed by the fifth computing means and a constant term vector For the third transform coefficient including,
By performing a predetermined linear transformation process, a sixth vector for calculating an average vector of the hidden Markov model after speaker adaptation is calculated.
And arithmetic means.

【０００８】さらに、請求項３記載の音声認識装置は、
請求項１記載の話者正規化装置の第３の演算手段によっ
て演算された隠れマルコフモデルを用いて、入力された
発声音声文の音声信号に基づいて、音声認識して音声認
識結果を出力する音声認識手段とを備えたことを特徴と
する。[0008] Further, the speech recognition apparatus according to claim 3 is
Using the hidden Markov model calculated by the third calculation means of the speaker normalization apparatus according to claim 1, perform voice recognition based on the voice signal of the input uttered voice sentence and output a voice recognition result. Voice recognition means.

【０００９】さらに、請求項４記載の音声認識装置は、
請求項２記載の話者適応化装置の第６の演算手段によっ
て演算された隠れマルコフモデルの平均ベクトルを含む
隠れマルコフモデルを用いて、入力された発声音声文の
音声信号に基づいて、音声認識して音声認識結果を出力
する音声認識手段とを備えたことを特徴とする。Further, the voice recognition device according to the fourth aspect of the present invention,
Speech recognition based on a speech signal of an input uttered speech sentence, using a hidden Markov model including an average vector of the hidden Markov model calculated by the sixth calculation means of the speaker adaptation apparatus according to claim 2. And voice recognition means for outputting a voice recognition result.

【００１０】[0010]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は本発明に係る一実
施形態である音声認識装置のブロック図である。この実
施形態は、話者正規化制御部２０と、話者適応化制御部
２１とを備えたことを特徴とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a speech recognition device according to one embodiment of the present invention. This embodiment is characterized in that a speaker normalization control unit 20 and a speaker adaptation control unit 21 are provided.

【００１１】ここで、話者正規化制御部２０は、（ａ）
複数Ｍ人の話者にそれぞれ依存する音声データ３２−１
乃至３２−Ｍの特徴ベクトルに基づいて、所定のＨＭＭ
の初期モデル（以下、初期ＨＭＭという。）３１に対し
て、最尤線形回帰法により、重回帰写像モデルに基づく
平均ベクトルの変換のための変換行列と定数項ベクトル
を含む第１の変換係数Ａ_c ^(m)，ｂ_c ⁽ ^m)を、後述する数６
乃至数１１を用いて各話者ｍ（ｍ＝１，２，…，Ｍ）毎
に演算し、（ｂ）後述する数１２を用いて、上記複数Ｍ
人の話者にそれぞれ依存する音声データ３２−１乃至３
２−Ｍの特徴ベクトルｏ_t ^(m)から上記各話者ｍ毎に、上
記演算された定数項ベクトルｂ_c ^(m)を減算して正規化さ
れた音声データの特徴ベクトルｏｈ_tを演算し、（ｃ）
上記演算された正規化された音声データの特徴ベクトル
ｏｈ_tに基づいて、隠れマルコフモデルの初期モデル３
１を、所定の学習アルゴリズムを用いて学習することに
より、話者正規化されたＨＭＭ３３のモデルパラメータ
を演算することを特徴とする。ここで、モデルパラメー
タは、平均ベクトル、ガウス分布の分散、状態遷移確率
などのＨＭＭのモデルパラメータを含む。In this case, the speaker normalization control section 20 has the following steps:
Speech data 32-1 depending on each of a plurality of M speakers
Based on the feature vectors of
Of the initial model (hereinafter referred to as an initial HMM) 31 by a maximum likelihood linear regression method, a first conversion coefficient A including a conversion matrix for converting an average vector based on a multiple regression mapping model and a constant term vector. _c ^(m) and b _c ⁽ ^m) are expressed by the following _equation ⁽ 6).
, M is calculated for each speaker m (m = 1, 2,..., M) by using Equation 11;
Speech data 32-1 to 32-depending on human speakers
The calculated constant term vector b _c ^(m) is subtracted for each speaker ^m from the 2-M feature vector o _t ^(m) to calculate a normalized speech data feature vector oh _t. , (C)
Based on the computed normalized feature vectors oh _t of the audio data, the initial model 3 of Hidden Markov Models
1 by using a predetermined learning algorithm to calculate speaker-normalized model parameters of the HMM 33. Here, the model parameters include HMM model parameters such as mean vector, variance of Gaussian distribution, and state transition probability.

【００１２】また、話者適応化制御部２１は、（ｄ）話
者適応化する音声データである話者適応用学習データ３
４の特徴ベクトルに基づいて、話者正規化装置２０によ
って演算された話者正規化されたＨＭＭ３３に対して、
最尤線形回帰法により、後述する数６を用いて、重回帰
写像モデルに基づく平均ベクトルの変換のための変換行
列と定数項ベクトルを含む第２の変換係数Ａ_c，ｂ_cを演
算し、（ｅ）上記演算された変換行列と定数項ベクトル
を含む第２の変換係数Ａ_c，ｂ_cに基づいて、最大事後確
率推定法により、後述する数１４及び数１５を用いて、
話者適応化された重回帰写像モデルに基づく平均ベクト
ルの変換のための変換行列と定数項ベクトルを含む第３
の変換係数Ａ_c,k ^MAP，ｂ_c,k ^MAPを演算し、（ｆ）上記演
算された変換行列と定数項ベクトルを含む第３の変換係
数Ａ_c,k ^MAP，ｂ_c,k ^MAPに対して、後述する数１３を用い
て所定の線形変換処理を実行することにより、話者適応
化後のＨＭＭの平均ベクトルμｈ_k, ^MAPを演算すること
を特徴とする。The speaker adaptation control unit 21 further comprises (d) speaker adaptation learning data 3 which is speech data for speaker adaptation.
Based on the feature vector of No. 4, the speaker-normalized HMM 33 calculated by the speaker normalization device 20 is:
Using the maximum likelihood linear regression method, a second conversion coefficient A _c , b _c including a constant matrix and a conversion matrix for conversion of an average vector based on the multiple regression mapping model is calculated using Equation 6 described below, (E) On the basis of the calculated transformation matrix and the second transformation coefficients A _c and b _c including the constant term vector, the maximum posterior probability estimation method is used to obtain the following equations 14 and 15,
A third matrix including a transformation matrix and a constant term vector for transformation of an average vector based on a speaker-adapted multiple regression mapping model
The coefficients of the transformed A _{_c, k ^MAP,} b _c, calculates the _k ^MAP, the third conversion coefficient A _c containing (f) the calculated transformation matrix and the constant term _{_vector, k ^MAP,} b _c, the _k ^MAP On the other hand, by performing a predetermined linear conversion process using Expression 13 described later, the average vector μh _k, ^MAP of the HMM after speaker adaptation is calculated.

【００１３】さらに、図１の音声認識装置は、上記話者
適応化されたＨＭＭ１１を用いて、入力された発声音声
文の音声信号に基づいて、音声認識して音声認識結果を
出力する。また、話者正規化されたＨＭＭ３３を用い
て、入力された発声音声文の音声信号に基づいて、音声
認識して音声認識結果を出力してもよい。Further, the speech recognition apparatus shown in FIG. 1 uses the above-described speaker-adapted HMM 11 to perform speech recognition based on the speech signal of the input uttered speech sentence, and outputs a speech recognition result. Further, using the speaker-normalized HMM 33, speech recognition may be performed based on the speech signal of the input uttered speech sentence, and the speech recognition result may be output.

【００１４】本発明に係る実施形態においては、話者性
を取り除く話者正規化手法によって音響モデルを生成す
ることを検討した。話者正規化を行なうことでモデルの
広がりが小さくなり、識別性能の向上が期待できる。ま
た、このような話者正規化により、変動分が音韻コンテ
キストを主としているものとみなせるモデルが得られる
ならば、話者適応の初期モデルとしても有効であると考
えられる。正規化処理は、重回帰写像モデルの定数項を
用いる。定数項は声帯音源スペクトルの概形や回線特性
のような広範囲のスペクトルに共通する個人差を表すと
考えられる。定数項を個人差ベクトルと考え、定数項を
学習データから引くことで正規化を行なう。さらにここ
では、話者正規化を施した音声データによって学習した
話者正規化されたＨＭＭを初期モデルとする話者適応化
を、重回帰写像モデルによる話者適応と最大事後確率推
定法（以下、ＭＡＰ推定法という。）を組み合わせた方
法を用いる。In the embodiment according to the present invention, generation of an acoustic model by a speaker normalization method for removing speakerness has been studied. By performing speaker normalization, the spread of the model is reduced, and an improvement in discrimination performance can be expected. Also, if a model is obtained by such speaker normalization in which the variation can be considered to be mainly based on the phonemic context, it is considered to be effective as an initial model for speaker adaptation. The normalization process uses a constant term of the multiple regression mapping model. The constant term is considered to represent an individual difference common to a wide range of spectrums such as an outline of a vocal cord sound source spectrum and line characteristics. The constant term is considered as an individual difference vector, and normalization is performed by subtracting the constant term from the learning data. Further, here, the speaker adaptation using the speaker-normalized HMM trained by the speaker-normalized speech data as the initial model is performed by using the speaker adaptation by the multiple regression mapping model and the maximum posterior probability estimating method (hereinafter, referred to as “the maximum posterior probability estimation method”). , MAP estimation method).

【００１５】まず、本実施形態において用いる重回帰写
像モデルについて説明する。重回帰写像モデルによる話
者適応化は初期モデルのｋ番目のガウス分布の平均ベク
トルμ_k（次元数ｎ）を、次式によって、話者適応化モ
デルに基づく平均ベクトルμｈ_kに変換することで行な
われる。First, a multiple regression mapping model used in the present embodiment will be described. The speaker adaptation by the multiple regression mapping model is performed by converting the average vector μ _k (the number of dimensions n) of the k-th Gaussian distribution of the initial model into the average vector μh _k based on the speaker adaptation model by the following equation. Done.

【００１６】[0016]

【数１】μｈ_k＝Ａ_cμ_k＋ｂ_c Μh _k = A _c μ _k + b _c

【００１７】ここで、Ａ_cはｎ×ｎの変換行列であり、
ｂ_cはｎ次元の定数項ベクトルであり、共有化されたガ
ウス分布のクラスΩ_c毎に求める。ここでは、学習用適
応化データに関して最尤を基準に変換係数Ａ_c，ｂ_cを推
定する最尤線形回帰法（Maximum likelihood linear re
gression；以下、ＭＬＬＲ方法という。；例えば、従来
技術文献１参照。）を用いて推定する方法について述べ
る。ＭＬＬＲ法では時刻ｔにｋ番目のガウス分布（以
下、ガウス分布ｋという。）において入力ベクトルｏ_t
観測される確率密度関数ｂ_k（ｏ_t）を次式の通り仮定す
る。Where A _c is an n × n transformation matrix,
b _c is an n-dimensional constant term vector, which is obtained for each class Ω _{c of the} shared Gaussian distribution. Here, a maximum likelihood linear regression method (Maximum likelihood linear regression) for estimating the conversion coefficients A _c and b _c on the learning adaptation data based on the maximum likelihood is used.
gression; hereinafter, referred to as the MLLR method. See, for example, Prior Art Document 1. ) Will be described. In the MLLR method, an input vector o _t in a k-th Gaussian distribution (hereinafter, referred to as a Gaussian distribution k) at time t.
The observed probability density function b _k a (o _t) Suppose as follows.

【００１８】[0018]

【数２】ｂ_k（ｏ_t）＝１／｛（２π）^n/2｜Σ_k｜^1/2｝
×ｅｘｐ[−(１／２){ｏ_t−(Ａ_cμ_k＋ｂ_c)}'Σ_k ^-1{ｏ_t
−(Ａ_cμ_k＋ｂ_c)}][Number 2] _{_{b k (o t) = 1}} / {(2π) n / 2 | Σ k | 1/2}
× exp [- (1/2) { o t - (A c μ k + b c)} 'Σ k -1 {o t
− (A _c μ _k + b _c )}]

【００１９】ここで、Σ_kは対角共分散行列ｄｉａｇ
［σ² _k1，σ² _k2，…，σ² _kn］である。また、’は転置
行列を表わす。さらに、Σ_k ^-1は行列Σ_kの逆行列を表わ
す。変換係数は次式のバーム（Ｂａｕｍ）の補助関数を
最大にすることによって得る。Where Σ _k is the diagonal covariance matrix diag
[Σ ² _k1 , σ ² _k2 ,..., Σ ² _kn ]. 'Represents a transposed matrix. Further, Σ _k ^-1 represents an inverse matrix of the matrix Σ _k . The transform coefficients are obtained by maximizing the Baum auxiliary function:

【００２０】[0020]

【数３】 (Equation 3)

【００２１】ここで、Ｏはフレーム長がＴの適応化デー
タの特徴ベクトルの系列（ｏ₁，ｏ₂，…，ｏ_T）を表し
ている。また、λ，λｂは適応化前後のモデルパラメー
タである。θは状態系列（θ₁，θ₂，…，θ_T）であ
り、Θは可能な全ての状態系列の集合を表している。Ｆ
（Ｏ，θ│λ），Ｆ（Ｏ，θ│λｂ）はそれぞれ状態系
列θにおける適応前後の尤度である。Here, O represents a sequence (o ₁ , o ₂ ,..., O _T ) of feature vectors of the adaptation data having the frame length T. Λ and λb are model parameters before and after adaptation. θ is a state sequence (θ ₁ , θ ₂ ,..., θ _T ), and Θ represents a set of all possible state sequences. F
(O, θ | λ) and F (O, θ | λb) are likelihoods before and after adaptation in the state sequence θ, respectively.

【００２２】補助関数が最大値を示す変換係数Ａ_c，ｂ_c
は、次式のように、補助関数をＡ_c，ｂ_cで偏微分し、偏
微分したものを共有化クラスΩ_cにおいて零とすること
で得られる。Conversion coefficients A _c , b _{c at} which the auxiliary function indicates the maximum value
Is obtained by partially differentiating the auxiliary function with A _c and b _c , and setting the partially differentiated value to zero in the shared class Ω _{c as} in the following equation.

【００２３】[0023]

【数４】 (Equation 4)

【数５】 (Equation 5)

【００２４】ここで、γ_k（ｔ）はガウス分布ｋにおい
て時刻ｔに入力ベクトルが観測される期待値である。ま
た、μ_k’は平均ベクトルμ_kの転置行列である。従っ
て、数４及び数５から、変換行列Ａ_cのｐ行目の要素ａ
_cp,i，及び定数項ｂ_cのｐ番目の要素ｂ_cpは次式で与え
られる。Here, γ _k (t) is an expected value at which an input vector is observed at time t in Gaussian distribution k. Μ _k ′ is a transposed matrix of the average vector μ _k . Therefore, from Equations 4 and 5, the element a in the p-th row of the transformation matrix A _c
_{cp, i,} and p th element b _cp constant term b _c is given by the following equation.

【００２５】[0025]

【数６】 (Equation 6)

【００２６】ここで、Here,

【数７】 (Equation 7)

【数８】 (Equation 8)

【数９】 (Equation 9)

【数１０】 (Equation 10)

【数１１】 [Equation 11]

【００２７】ここで、μ_kiは平均ベクトルのｉ番目の要
素であり、σ_kpは対角共分散行列の（ｐ，ｐ）要素であ
り、ｏ_tpは時刻ｔの入力ベクトルのｐ番目の要素を表し
ている。以上が重回帰写像モデルについての説明であ
る。Here, μ _ki is the i-th element of the mean vector, σ _kp is the (p, p) element of the diagonal covariance matrix, and o _tp is the p-th element of the input vector at time t. Is represented. The above is the description of the multiple regression mapping model.

【００２８】次いで、重回帰写像モデルを用いた話者正
規化による音響モデル作成について説明する。重回帰写
像モデルの定数項ｂ_cは声帯音源スペクトルの概形や回
線特性のような広範囲のスペクトルに共通する個人差を
表すと考えられる。そこで、本実施形態では定数項ｂ_c
を個人差ベクトルと仮定し、話者正規化を行なう。図４
及び図５は発明した話者正規化方法の概念図である。ま
た、図２は、図１の話者正規化制御部２０によって実行
される、Ｍ人の話者の音声データを用いて話者正規化モ
デルを作成する話者正規化処理のフローチャートであ
り、図７はそのブロック図である。図１において、話者
正規化制御部２０、話者適応化制御部２１、特徴抽出部
２、音素照合部４、ＬＲパーザ５は例えば、デジタル計
算機などの演算制御装置で構成され、バッファメモリ３
は例えばハードディスクメモリであり、初期ＨＭＭ３１
及び各話者１乃至Ｍの音声データの特徴パラメータベク
トル、話者正規化されたＨＭＭ３３、話者適応用学習デ
ータ３４、話者適応化されたＨＭＭ１１、ＬＲテーブル
１２及び文脈自由文法１３は例えばハードディスクメモ
リに記憶される。なお、各話者の音声データ３２−１乃
至３２−Ｍは各話者の音声波形信号から特徴抽出した特
徴パラメータのベクトル、すなわち特徴ベクトルであ
る。本明細書において、音声データとは、特徴ベクトル
をいう。以下に、図２及び図７を参照して、話者正規化
モデルの作成手順を述べる。Next, the creation of an acoustic model by speaker normalization using a multiple regression mapping model will be described. Constant term b _c multiple regression mapping model is considered to represent the individual differences that are common to a wide range of the spectrum, such as general shape and line characteristics of glottal source spectrum. Therefore, in the present embodiment, the constant term b _c
Is a personal difference vector, and speaker normalization is performed. FIG.
5 and 5 are conceptual diagrams of the invented speaker normalization method. FIG. 2 is a flowchart of a speaker normalization process executed by the speaker normalization control unit 20 of FIG. 1 to create a speaker normalization model using the voice data of M speakers. FIG. 7 is a block diagram thereof. In FIG. 1, a speaker normalization control unit 20, a speaker adaptation control unit 21, a feature extraction unit 2, a phoneme collation unit 4, and an LR parser 5 are configured by an arithmetic and control unit such as a digital computer, for example.
Is, for example, a hard disk memory, and the initial HMM 31
The feature parameter vectors of the voice data of the speakers 1 to M, the speaker-normalized HMM 33, the speaker adaptation learning data 34, the speaker-adapted HMM 11, the LR table 12, and the context-free grammar 13 are, for example, a hard disk. Stored in memory. The speech data 32-1 to 32-M of each speaker is a vector of feature parameters extracted from the speech waveform signal of each speaker, that is, a feature vector. In this specification, audio data refers to a feature vector. Hereinafter, the procedure for creating the speaker normalization model will be described with reference to FIGS.

【００２９】図１、図２及び図７を参照すれば、まず、
図２のステップＳ１で、不特定話者ＨＭＭである初期Ｈ
ＭＭ（ＨＭＭの初期モデル）３１を読み出して処理対象
のＨＭＭとする。次いで、ステップＳ２で、図４に示す
ように、処理対象のＨＭＭに対してＭＬＬＲ法により数
６乃至数１１を用いて各話者１乃至Ｍ毎に重回帰写像モ
デルの変換係数Ａ_c ^(m)，ｂ_c ^(m)，ｍ＝１，２，…，Ｍを
演算する。さらに、ステップＳ３で、図５に示すよう
に、数１２を用いて各話者１乃至Ｍの音声データｏ_t ^(m)
３２−１乃至３２−Ｍから重回帰写像モデルの定数項ベ
クトルｂ_c ^(m)を減算することにより正規化音声データｏ
ｈ_tを演算する。Referring to FIGS. 1, 2 and 7, first,
In step S1 of FIG. 2, an initial H which is an unspecified speaker HMM
The MM (initial model of the HMM) 31 is read out and set as the HMM to be processed. Next, in step S2, as shown in FIG. 4, the conversion coefficient A _c ^{(m of the} multiple regression mapping model for each of the speakers 1 to M using the MLLR method with respect to the processing target HMM by using the equations 6 to 11. ⁾ , B _c ^(m) , m = 1, 2,..., M. Further, in step S3, as shown in FIG. 5, the voice data o _t ^{(m) of} each of the speakers 1 to M is obtained by using Expression 12.
By subtracting the constant term vector b _c ^(m) of the multiple regression mapping model from 32-1 to 32-M, the normalized speech data o
_ht is calculated.

【数１２】ｏｈ_t＝ｏ_t ^(m)−ｂ_c ^(m)，１≦ｍ≦ＭEquation 12] _{_{^{oh t = o t (m)}}} -b c (m), 1 ≦ m ≦ M

【００３０】次いで、ステップＳ４でテキストデータ付
き正規化音声データｏｈ_tに対してバーム・ウエルチ
（Ｂａｕｍ−Ｗｅｌｃｈ）の学習アルゴリズムを用いて
再学習を行う。そして、ステップＳ５で所定の繰り返し
回数となったか否かが判断され、なっていないときは、
ステップＳ６で再学習後のＨＭＭを処理対象のＨＭＭと
して、再び、ステップＳ２に戻り、上記の処理を実行す
る。一方、ステップＳ５で、所定の繰り返し回数（好ま
しい実施例においては、３回）となったときは、ステッ
プＳ７で再学習後のＨＭＭを話者正規化ＨＭＭ３３とし
てメモリに記憶する。そして当該話者正規化処理を終了
する。[0030] Next, the re-trained using a learning algorithm Balm Welch (Baum-Welch) for text data-normalization audio data oh _t in step S4. Then, in step S5, it is determined whether or not a predetermined number of repetitions has been reached.
In step S6, the HMM after the re-learning is set as the processing target HMM, and the process returns to step S2 again to execute the above processing. On the other hand, when the predetermined number of repetitions is reached (three in the preferred embodiment) in step S5, the HMM after the re-learning is stored in the memory as the speaker-normalized HMM 33 in step S7. Then, the speaker normalization processing ends.

【００３１】次いで、ＭＡＰ推定法を用いたＭＬＬＲ法
の話者適応化処理について説明する。ＭＬＬＲ法は学習
用適応データに対して最尤を基準に平均ベクトルの推定
を行うため、初期モデルの事前知識を有効に利用した話
者適応化ではない。従って、話者正規化モデルが良い事
前知識を持っていても、十分に活用できない可能性があ
る。そこで、事前知識を有効に利用する方法であるＭＡ
Ｐ推定法（例えば、従来技術文献２「C.H.Lee et al.,
“A Study on Speaker Adaptation of the Parameters
of Continuous Density Hidden Markov Models",IEEE T
ransactions onSignal Processing,Vol.39,No.4,pp.806
-814,1991年」参照。）をＭＬＬＲ法に適用した手法
（以下、ＭＡＰ−ＭＬＬＲ法という。また、ＭＡＰ−Ｍ
ＬＬＲ法による処理をＭＡＰ−ＭＬＬＲ処理という。）
によって話者適応を行なうことを以下のように発明し
た。ここで、ＭＡＰ−ＭＬＬＲ法による話者適応化後の
ガウス分布ｋの平均ベクトルμｈ_k ^MAPは下式で与えられ
る。Next, the speaker adaptation processing of the MLLR method using the MAP estimation method will be described. Since the MLLR method estimates the average vector for the learning adaptation data on the basis of the maximum likelihood, it is not speaker adaptation that effectively uses prior knowledge of the initial model. Therefore, even if the speaker normalization model has good prior knowledge, there is a possibility that it cannot be fully utilized. Therefore, MA, which is a method to effectively use prior knowledge,
P estimation method (for example, see Prior Art Document 2 “CHLee et al.,
“A Study on Speaker Adaptation of the Parameters
of Continuous Density Hidden Markov Models ", IEEE T
ransactions onSignal Processing, Vol.39, No.4, pp.806
-814, 1991. " ) Applied to the MLLR method (hereinafter referred to as the MAP-MLLR method.
Processing by the LLR method is called MAP-MLLR processing. )
We have invented the following speaker adaptation. Here, the average vector μh _k ^MAP of the Gaussian distribution k after speaker adaptation by the MAP-MLLR method is given by the following equation.

【００３２】[0032]

【数１３】μｈ_k ^MAP＝Ａ_c,k ^MAPμ_k＋ｂ_c,k ^MAP Μh _k ^MAP = A _{c, k} ^MAP μ _k + b _{c, k} ^MAP

【数１５】 (Equation 15)

【００３３】ここで、Ｉはｎ×ｎの単位行列であり、τ
_kは事前知識の確からしさに関する定数である。好まし
い実施例においては、τ_k＝４．０に設定される。Here, I is an n × n unit matrix, and τ
_k is a constant related to the certainty of prior knowledge. In the preferred embodiment, τ _k = 4.0 is set.

【００３４】ＭＡＰ推定法による平均ベクトルの推定は
初期モデルに基づく平均ベクトル（事前知識）と最尤推
定による平均ベクトルとの線形結合になっている。図６
はＭＡＰ推定法を用いたＭＬＬＲ法による平均ベクトル
の推定についての概念図である。図６における矢印の太
さは、ガウス分布において学習データが観測される期待
値の大きさを示している。図６の例のように、学習デー
タが観測される期待値が大きいガウス分布は、ＭＬＬＲ
法によって推定される平均ベクトル付近に推定される。
また反対に、観測される期待値が小さいガウス分布では
初期モデルに基づく平均ベクトル付近への推定となる。
このようにＭＡＰ推定法を導入することで、ＭＬＬＲ法
により話者適応化による平均ベクトル推定の信頼性を考
慮して、適切に事前知識の情報を用いる話者適応化が行
なわれる。ここで、本実施形態の方法は、すべての係数
を推定し、ガウス分布個々に変換係数を求める。このた
め、本実施形態の方法は、従来例に比較して精密な話者
適応を行なうことが可能である。The estimation of the average vector by the MAP estimation method is a linear combination of the average vector (prior knowledge) based on the initial model and the average vector based on the maximum likelihood estimation. FIG.
FIG. 3 is a conceptual diagram of estimation of an average vector by an MLLR method using a MAP estimation method. The thickness of the arrow in FIG. 6 indicates the magnitude of the expected value at which the learning data is observed in the Gaussian distribution. As in the example of FIG. 6, a Gaussian distribution having a large expected value at which learning data is observed is represented by the MLLR
It is estimated around the average vector estimated by the method.
Conversely, in the Gaussian distribution where the observed expected value is small, the value is estimated near the average vector based on the initial model.
By introducing the MAP estimation method in this manner, speaker adaptation using information of prior knowledge is appropriately performed in consideration of the reliability of average vector estimation by speaker adaptation by the MLLR method. Here, the method of the present embodiment estimates all coefficients and obtains transform coefficients for each Gaussian distribution. For this reason, the method of the present embodiment can perform speaker adaptation more precisely than the conventional example.

【００３５】図３は、図１の話者適応化制御部２１によ
って実行される話者適応化処理のフローチャートであ
り、図８はそのブロック図である。図３において、ステ
ップＳ１１でまず、話者正規化されたＨＭＭ３３と、話
者適応化する話者の音声データの特徴ベクトルを含む話
者適応化用学習データ３４を読み出す。次いで、ステッ
プＳ１２でＭＬＬＲ法により数６乃至数１１を用いて変
換係数Ａ_c，ｂ_cを演算する。そして、ステップＳ１３
で、ＭＡＰ法により数１４及び数１５を用いて変換係数
Ａ_c,k ^MAP，ｂ_c,k ^MAPを演算する。さらに、数１３を用い
て線形変換処理を行って話者適応化されたＨＭＭ１１を
得る。最後に、話者適応化されたＨＭＭ１１をメモリに
記憶する。以上で、ＭＡＰ−ＭＬＬＲ法による当該話者
適応化処理が終了する。FIG. 3 is a flowchart of the speaker adaptation process executed by the speaker adaptation control unit 21 of FIG. 1, and FIG. 8 is a block diagram thereof. In FIG. 3, first, in step S11, the speaker-normalized HMM 33 and the speaker adaptation learning data 34 including the feature vector of the speech data of the speaker to be speaker-adapted are read. Next, in step S12, the conversion coefficients A _c and b _c are calculated by the MLLR method using the equations 6 to 11. Then, step S13
Then, the conversion coefficients A _{c, k} ^MAP and b _{c, k} ^MAP are calculated by the MAP method using Expressions 14 and 15. Further, a linear transformation process is performed using Expression 13 to obtain a speaker-adapted HMM 11. Finally, the speaker-adapted HMM 11 is stored in the memory. Thus, the speaker adaptation processing by the MAP-MLLR method ends.

【００３６】話者適応化されたＨＭＭ１１は、音素照合
部４に接続され、ＨＭ網として複数の状態のネットワー
クとして表すこともできる。ＨＭＭ１１内の個々の状態
は、音声空間上の１つの確率的定常信号源と見なすこと
ができ、それぞれ以下の情報を保有している。（ａ）状
態番号、（ｂ）受理可能なコンテキストクラス、（ｃ）
先行する状態および後続する状態のリスト、（ｄ）音声
の特徴空間上に割り当てられた確率分布のパラメータ、
（ｅ）自己遷移確率および後続状態への遷移確率。話者
適応化されたＨＭＭ１１では、入力データとそのコンテ
キスト情報が与えられた場合、そのコンテキストを受理
することができる状態を先行および後続状態リストの制
約内で連結することによって、入力データに対するモデ
ルを一意に決定することができる。ここで、出力確率密
度関数は３４次元の対角共分散行列をもつ混合ガウス分
布（本明細書において、ガウス分布という。）であり、
各ガウス分布は、初期ＨＭＭ３１を用いて話者正規化制
御部２０により話者正規化されかつ、話者正規化された
ＨＭＭ３３を用いて話者適応化制御部２１により話者適
応化されている。なお、話者正規化されたＨＭＭ３３を
音素照合部４に接続して音素検出に用いてもよい。The speaker-adapted HMM 11 is connected to the phoneme matching unit 4 and can be represented as an HM network as a network in a plurality of states. Each state in the HMM 11 can be regarded as one stochastic stationary signal source in the sound space, and each has the following information. (A) state number, (b) acceptable context class, (c)
A list of preceding and following states, (d) probability distribution parameters assigned on the speech feature space,
(E) Self transition probability and transition probability to the succeeding state. In the speaker-adaptive HMM 11, when input data and its context information are given, a model for the input data is connected by concatenating states capable of accepting the context within constraints of the preceding and succeeding state lists. It can be determined uniquely. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix (hereinafter, referred to as a Gaussian distribution).
Each Gaussian distribution is speaker-normalized by the speaker normalization control unit 20 using the initial HMM 31, and is speaker-adapted by the speaker adaptation control unit 21 by using the speaker-normalized HMM 33. . Note that the speaker-normalized HMM 33 may be connected to the phoneme matching unit 4 and used for phoneme detection.

【００３７】一般に連続分布型ＨＭＭによるモデルに対
して少量の適応データにより話者適応を行なう場合、ガ
ウス分布の平均値の適応は他のパラメータの適応に比べ
て効果が大きいことが知られている（例えば、従来技術
文献３「大倉計美ほか，“混合連続分布ＨＭＭを用いた
移動ベクトル場平滑化話者適応方式”，音響学会講演論
文集，２−Ｑ−１７，ｐｐ．１９１−１９２，１９９２
年３月」参照。）。本実施形態においては、各ガウス分
布の平均値のみの適応を行ない、分散値、状態遷移確率
及び、混合ガウス分布の重み係数の適応は行なわない。It is generally known that when speaker adaptation is performed on a model based on a continuous distribution type HMM with a small amount of adaptation data, adaptation of the average value of the Gaussian distribution is more effective than adaptation of other parameters. (For example, Prior Art Document 3, "Kumi Okura et al.," Moving vector field smoothing speaker adaptation method using mixed continuous distribution HMM ", Proc. Of the Acoustical Society of Japan, 2-Q-17, pp. 191-192, 1992
March ". ). In the present embodiment, only the average value of each Gaussian distribution is applied, and the variance, the state transition probability, and the weight coefficient of the mixed Gaussian distribution are not applied.

【００３８】次いで、上述の本実施形態の話者正規化方
法及び話者適応化方法を用いた、ＳＳＳ−ＬＲ（left-t
o-right rightmost型）不特定話者連続音声認識装置に
ついて説明する。この装置は、ＨＭＭ１１を含むＨＭ網
のメモリに格納された音素環境依存型の効率のよいＨＭ
Ｍの表現形式を用いている。また、上記ＳＳＳにおいて
は、音素の特徴空間上に割り当てられた確率的定常信号
源（状態）の間の確率的な遷移により音声パラメータの
時間的な推移を表現した確率モデルに対して、尤度最大
化の基準に基づいて個々の状態をコンテキスト方向又は
時間方向へ分割するという操作を繰り返すことによっ
て、モデルの精密化を逐次的に実行する。Next, an SSS-LR (left-t) using the above-described speaker normalization method and speaker adaptation method of the present embodiment.
An o-right rightmost type) speaker-independent continuous speech recognition device will be described. This device is a phoneme environment-dependent efficient HM stored in the memory of the HM network including the HMM 11.
M expression format is used. In the SSS, the likelihood of a stochastic model expressing a temporal transition of a speech parameter by a stochastic transition between stochastic stationary signal sources (states) assigned to a feature space of a phoneme is calculated. The refinement of the model is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the criterion of maximization.

【００３９】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, a speaker's uttered voice is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００４０】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ５からの音素照合要求に応じて音素照合処理
を実行する。そして、話者適応化されたＨＭＭ１１のメ
モリに格納された音素ＨＭＭの話者モデルを用いて音素
照合区間内のデータに対する尤度が計算され、この尤度
の値が音素照合スコアとしてＬＲパーザ５に返される。
このとき、前向きパスアルゴリズムを使用する。The phoneme matching unit 4 executes a phoneme matching process in response to a phoneme matching request from the phoneme context-dependent LR parser 5. Then, the likelihood for the data in the phoneme matching section is calculated using the speaker model of the phoneme HMM stored in the memory of the speaker-adapted HMM 11, and the value of the likelihood is used as the phoneme matching score in the LR parser 5. Is returned to
At this time, a forward path algorithm is used.

【００４１】一方、文脈自由文法データベース１３内の
所定の文脈自由文法（ＣＦＧ）を公知の通り自動的に変
換してＬＲテーブル１２を作成してそのメモリに格納さ
れる。ＬＲパーザ５は、上記ＬＲテーブル１２を参照し
て、入力された音素予測データについて左から右方向
に、後戻りなしに処理する。構文的にあいまいさがある
場合は、スタックを分割してすべての候補の解析が平行
して処理される。ＬＲパーザ５は、上記ＬＲテーブル１
２から次にくる音素を予測して音素予測データを音素照
合部４に出力する。これに応答して、音素照合部４は、
その音素に対応するＨＭＭ１１内の情報を参照して照合
し、その尤度を音声認識スコアとしてＬＲパーザ５に戻
し、順次音素を連接していくことにより、連続音声の認
識を行う。上記連続音声の認識において、複数の音素が
予測された場合は、これらすべての存在をチェックし、
ビームサーチの方法により、部分的な音声認識の尤度の
高い部分木を残すという枝刈りを行って高速処理を実現
する。On the other hand, a predetermined context-free grammar (CFG) in the context-free grammar database 13 is automatically converted, as is well known, to create an LR table 12, which is stored in its memory. The LR parser 5 refers to the LR table 12 and processes the input phoneme prediction data from left to right without regression. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 uses the LR table 1
It predicts the next phoneme from 2 and outputs phoneme prediction data to the phoneme matching unit 4. In response, the phoneme matching unit 4
The matching is performed with reference to the information in the HMM 11 corresponding to the phoneme, the likelihood is returned to the LR parser 5 as a voice recognition score, and the continuous voice recognition is performed by sequentially connecting the phonemes. If multiple phonemes are predicted in the above continuous speech recognition, check for the presence of all of them,
By the beam search method, high-speed processing is realized by performing pruning to leave a partial tree having a high likelihood of partial speech recognition.

【００４２】[0042]

【実施例】本発明者は、以上のように構成された音声認
識装置について、評価実験を以下の如く行った。評価実
験として２６音素を対象とした言語制約を用いない音素
タイプライター型の音素認識実験を行なった。表１に音
響分析条件、使用した音声データを示す。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted an evaluation experiment on a speech recognition apparatus having the above-described configuration as follows. As an evaluation experiment, a phoneme typewriter-type phoneme recognition experiment for 26 phonemes without using language constraints was performed. Table 1 shows the acoustic analysis conditions and the audio data used.

【００４３】[0043]

【表１】実験条件 ─────────────────────────────────── 分析条件サンプリング周波数１２ＫＨｚ２０ｍｓハミング窓フレーム周期５ｍｓ ─────────────────────────────────── 使用パラメータ１６次ＬＰＣケプストラム＋１６次Δケプストラム＋ｌｏｇパワー＋Δｌｏｇパワー ─────────────────────────────────── 学習データ男性１４６名、女性１３９名から選択した男性９名、女性６名（各５０文章） ─────────────────────────────────── 適応／認識データ話者男性３名（ＭＡＵ，ＭＭＹ，ＭＴＭ）女性３名（ＦＡＦ，ＦＭＳ，ＦＹＭ）適応データ５９８文節（ＳＢ１，ＳＢ２，ＳＢ４タスク）からランダムに取り出したｎ個の文節認識データ２７９文節（ＳＢ３タスク） ───────────────────────────────────[Table 1] Experimental conditions ─────────────────────────────────── Analysis conditions Sampling frequency 12 kHz 20 ms Hamming window Frame Period 5 ms 使用 Parameter used 16th order LPC cepstrum + 16th order cepstrum + log power + Δlog Power ─────────────────────────────────── Learning data 9 men selected from 146 men and 139 women Name, 6 women (50 sentences each) 適応 Adaptation / recognition data Person 3 men (MAU, MMY, MTM) 3 women (FAF, FMS, FYM) Adaptation data 598 clauses (SB1 , Clauses SB2, SB4 tasks) n clauses randomly extracted from the recognition data 279 clauses (SB3 task) ─────────────────────────── ────────

【００４４】適応前の音響モデルの状態の共有構造（Ｈ
Ｍ網）は、男性話者１名の単語発声を用い、逐次状態分
割法（例えば、従来技術文献４「J.Takami et al.,A Su
ccessive State Splitting Algorithm for Efficient A
llophone Modeling",Proceedings of CASSP'92,pp.573-
576,1992年」参照。）によって決定した。状態数は２０
０とし（各５混合）、１状態（１０混合）の無音モデル
を付加したモデルを使用した。話者正規化、及び話者適
応に用いるＭＬＬＲ法の共有化クラスの数は１とした。
すなわち全てのガウス分布を共有化して変換係数の推定
を行う。The shared structure of the state of the acoustic model before adaptation (H
The M network uses the word utterance of one male speaker and uses the sequential state division method (for example, the prior art document 4 “J. Takami et al., A Su
ccessive State Splitting Algorithm for Efficient A
llophone Modeling ", Proceedings of CASSP'92, pp.573-
576, 1992 ". ). 20 states
A model to which a silence model of 1 state (5 mixtures) and 1 state (10 mixtures) was added was used. The number of shared classes of the MLLR method used for speaker normalization and speaker adaptation was set to one.
That is, all the Gaussian distributions are shared, and the transform coefficients are estimated.

【００４５】話者正規化モデル、及び比較のための従来
例のＳＩ−ＨＭＭモデルの作成は、１５話者の音声デー
タを用い、バーム・ウエルチ（Ｂａｕｍ−Ｗｅｌｃｈ）
アルゴリズムで学習を行なった。この１５話者は２８５
人のモデルから代表となる話者としてクラスタリング法
（例えば、従来技術文献５「T.Kosaka et al.,“Tree-S
tructured Speaker Clustering For Fast Speaker Adap
tation",Proceedingsof ICASSP'94,pp.245-248,1994
年」参照。）により選択した。上述のステップＳ５にお
ける話者正規化処理の繰り返し回数は３回とした。さら
に、話者適応化処理においては、ＭＡＰ推定法の事前知
識の確からしさに関する定数τ_kは全てのガウス分布で
同一の値とし、実験的定めた４．０を用いた。図３及び
図８に示されている手順で教師あり話者適応を行ない、
各適応文節数に対して選択文節を変えた評価をそれぞれ
３回繰り返した平均の音素認識率を求めた。A speaker normalization model and a conventional SI-HMM model for comparison are prepared using speech data of 15 speakers, and are created by Baum-Welch.
Learning was performed with the algorithm. These 15 speakers are 285
As a representative speaker from a human model, a clustering method (for example, see Prior Art Document 5 “T.Kosaka et al.,“ Tree-S
tructured Speaker Clustering For Fast Speaker Adap
tation ", Proceedingsof ICASSP'94, pp.245-248,1994
See year. ). The number of repetitions of the speaker normalization process in step S5 is three. Further, in the speaker adaptation processing, the constant τ _k relating to the certainty of the prior knowledge of the MAP estimation method was set to the same value in all Gaussian distributions, and an experimentally determined value of 4.0 was used. Supervised speaker adaptation is performed according to the procedures shown in FIGS.
The average phoneme recognition rate was obtained by repeating the evaluation in which the selected phrase was changed for each number of adaptive phrases three times.

【００４６】まず、話者正規化による識別性能向上を確
かめるために、話者正規化ＨＭＭ３３を用い、適応処理
なしで音素認識実験を行なった。表２に結果を示す。比
較として従来例のＳＩ−ＨＭＭモデルの認識結果も合わ
せて記述している。First, in order to confirm the improvement of the discrimination performance by the speaker normalization, a phoneme recognition experiment was performed using the speaker normalized HMM 33 without any adaptive processing. Table 2 shows the results. For comparison, the recognition result of the conventional SI-HMM model is also described.

【００４７】[0047]

【表２】話者正規化されたＨＭＭを用いた音声認識結果音素誤り率（％）上段：話者正規化モデル、下段：不特定話者モデル ─────────────────────────────────── ＭＡＵＭＭＹＭＴＭＦＡＦＦＭＳＦＹＭ平均 ─────────────────────────────────── １５．２１５．２１２．０２０．２１８．４２９．５１８．４１５．５１７．０１３．３２１．９２５．２３３．４２１．１ ───────────────────────────────────[Table 2] Speech recognition results using speaker-normalized HMM Phoneme error rate (%) Upper: Speaker normalized model, Lower: Unspecified speaker model ─────────────────────── MAU MMY MTM FAF FMS FYM average ─────────────────── １５ 15.2 15.2 12.0 20.2 18.4 29.5 18.4 15.5 17.0 13.3 21.9 25 .2 33.4 21.1}

【００４８】表２から明らかなように、評価話者６名全
てにおいて話者正規化モデルの認識率が高く、平均音素
誤り率が２１．１％から１８．４％に減少（１２．８％
の誤り削減率）した。特に、従来例のＳＩ−ＨＭＭモデ
ルでの認識率が低い話者（ＦＭＳ，ＦＹＭ）における改
善効果が大きい。話者正規化により、ガウス分布の分散
が小さくなり、認識単位間の識別が明確となり性能が向
上した結果と考えられる。As is clear from Table 2, the recognition rate of the speaker-normalized model is high in all six evaluation speakers, and the average phoneme error rate decreases from 21.1% to 18.4% (12.8%).
Error reduction rate). In particular, the improvement effect is large for speakers (FMS, FYM) with a low recognition rate in the conventional SI-HMM model. It is considered that the speaker normalization reduces the variance of the Gaussian distribution, clarifies the discrimination between recognition units, and improves the performance.

【００４９】次いで、表３に初期モデルとして話者正規
化されたＨＭＭ３３を用いた場合と、従来例のＳＩ−Ｈ
ＭＭモデルを用いた場合のＭＡＰ−ＭＬＬＲ法による話
者適応の認識結果を示す。Next, Table 3 shows the case where the speaker-normalized HMM 33 is used as the initial model and the conventional SI-H
4 shows recognition results of speaker adaptation by the MAP-MLLR method when the MM model is used.

【００５０】[0050]

【表３】話者適応化されたＨＭＭを用いた音声認識結果音素誤り率（％）上段：話者正規化モデル、下段：不特定話者モデル ────────────────────────────── 話者適応文節数３５７１０２０ ────────────────────────────── ＭＡＵ１５．８１５．０１４．９１５．２１３．７１６．４１５．７１４．９１５．３１４．３ ────────────────────────────── ＭＭＹ１５．３１４．６１４．４１４．２１３．６１７．３１６．０１６．０１５．３１４．６ ────────────────────────────── ＭＴＭ１１．８１１．８１１．０１０．９９．９１３．３１３．２１２．８１２．３１０．６ ────────────────────────────── ＦＡＦ１９．０１６．８１５．６１４．９１４．１２１．８１９．８１８．５１６．５１５．１ ────────────────────────────── ＦＭＳ１９．５１８．５１７．７１６．６１３．９２６．３２３．９２２．４２０．０１５．６ ────────────────────────────── ＦＹＭ２６．６２３．９２３．２２１．４１９．４２９．６２４．０２５．４２４．２１９．６ ────────────────────────────── 平均１８．０１６．８１６．１１５．６１４．１２０．８１８．８１８．３１７．２１４．９ ──────────────────────────────[Table 3] Speech recognition result using speaker-adapted HMM Phoneme error rate (%) Upper: Speaker normalized model, Lower: Unspecified speaker model {Speaker adaptation number of clauses 3 5 7 10 20} {MAU 15.8 15.0 14.9 15.2 13.7 16.4 15.7 14.9 15.3 14.3} ＭＭ MMY 15.3 14.6 14.4 14.2 13.6 17.3 16.0 16.0 15.3 14.6 Ｍ MTM 11.8 11.8 11.0 10.9 9.9 13.3 13 .2 12.8 12.3 10.6 ───────────────────────────── FAF 19.0 16.8 15.6 14.9 14.1 21.8 19. 8 18.5 16.5 15.1 FMS 19.5 18.5 17.7 16.6 13.9 26.3 23.9 22.4 20.0 15.6 {FYM 26.6 23.9 23.2 21.4 19.4 29.6 24.0 25.4 24.2 19.6} {Average 18.0 16.8 16.1 15.6 14.1 20.8 18.8 18.3 17.2 14.9} ──────────────────── ───

【００５１】表３から明らかなように、全ての話者、文
節数において、話者正規化されたＨＭＭ３３を初期モデ
ルとした話者適応が高い認識率を示している。話者正規
化モデルは、話者適応に適した事前知識を有する初期モ
デルであり、正確な話者適応を実現している。As is apparent from Table 3, the speaker adaptation using the speaker-normalized HMM 33 as the initial model shows a high recognition rate for all speakers and the number of phrases. The speaker normalization model is an initial model having prior knowledge suitable for speaker adaptation, and realizes accurate speaker adaptation.

【００５２】以上説明したように、本実施形態によれ
ば、重回帰写像モデルを用い、話者正規化モデルを作成
する方法を発明した。この話者正規化方法によって作成
した音響モデル３３は、従来例のＳＩ−ＨＭＭモデルよ
りも音素認識で高い性能が得られた。また、話者正規化
されたＨＭＭ３３を初期モデルとし、ＭＡＰ−ＭＬＬＲ
法によって話者適応化を行なった場合においても、初期
モデルの事前知識が反映され、正確な話者適応が実現で
きた。また、学習用適用データが少量であっても、話者
正規化又は話者適応化されたＨＭＭのパラメータの推定
精度を従来例に比較して大幅に改善することができる。As described above, according to the present embodiment, a method for creating a speaker normalization model using a multiple regression mapping model has been invented. The acoustic model 33 created by this speaker normalization method obtained higher performance in phoneme recognition than the conventional SI-HMM model. Also, the speaker-normalized HMM 33 is used as an initial model, and MAP-MLLR
Even when speaker adaptation was performed by the method, the prior knowledge of the initial model was reflected and accurate speaker adaptation was realized. Further, even when the amount of training application data is small, the accuracy of estimating the parameters of the speaker-normalized or speaker-adapted HMM can be greatly improved as compared with the conventional example.

【００５３】[0053]

【発明の効果】以上詳述したように、本発明に係る請求
項１記載の話者正規化装置によれば、所定の隠れマルコ
フモデルの初期モデルを学習するための学習データであ
り、複数の話者にそれぞれ依存する音声データの特徴ベ
クトルを記憶する記憶装置と、上記記憶装置に記憶され
た音声データの特徴ベクトルに基づいて、上記隠れマル
コフモデルの初期モデルに対して、最尤線形回帰法によ
り、重回帰写像モデルに基づく平均ベクトルの変換のた
めの変換行列と、スペクトルに共通する個人差を表す定
数項ベクトルとを含む第１の変換係数を上記各話者毎に
演算する第１の演算手段と、上記記憶装置に記憶された
音声データの特徴ベクトルから上記各話者毎に、上記第
１の演算手段によって演算された定数項ベクトルを減算
して正規化された音声データの特徴ベクトルを演算する
第２の演算手段と、上記第２の演算手段によって演算さ
れた正規化された音声データの特徴ベクトルに基づい
て、上記隠れマルコフモデルの初期モデルを、所定の学
習アルゴリズムを用いて学習することにより、話者正規
化された隠れマルコフモデルのモデルパラメータを演算
する第３の演算手段とを備える。従って、当該話者正規
化装置によって、隠れマルコフモデルのパラメータの推
定精度を従来例に比較して大幅に改善することができ、
当該話者正規化装置によって得られた、話者正規化され
た隠れマルコフモデルを用いて音声認識することによ
り、従来例に比較して高い音声認識率で音声認識するこ
とができる。As described above in detail, according to the speaker normalizing apparatus of the first aspect of the present invention, the learning data for learning the initial model of a predetermined hidden Markov model is a plurality of learning data. A storage device for storing a feature vector of voice data depending on a speaker, and a maximum likelihood linear regression method for an initial model of the hidden Markov model based on the feature vector of the voice data stored in the storage device. A first conversion coefficient including a conversion matrix for converting the average vector based on the multiple regression mapping model and a constant term vector representing an individual difference common to the spectrum is calculated for each speaker. The constant vector obtained by subtracting the constant term vector calculated by the first calculating means for each speaker from the calculating means and the feature vector of the voice data stored in the storage device is normalized. A second calculating means for calculating a feature vector of the voice data, and an initial model of the hidden Markov model based on a normalized feature vector of the voice data calculated by the second calculating means, the predetermined model A third calculating means for calculating model parameters of the speaker-normalized hidden Markov model by learning using an algorithm. Therefore, the estimation accuracy of the parameters of the Hidden Markov Model can be greatly improved by the speaker normalization device as compared with the conventional example.
By performing speech recognition using the speaker-normalized hidden Markov model obtained by the speaker normalization device, speech recognition can be performed at a higher speech recognition rate than in the related art.

【００５４】また、本発明に係る請求項２記載の話者適
応化装置によれば、話者適応化する話者の音声データの
特徴ベクトルに基づいて、請求項１記載の話者正規化装
置の第３の演算手段によって演算された隠れマルコフモ
デルに対して、最尤線形回帰法により、重回帰写像モデ
ルに基づく平均ベクトルの変換のための変換行列と定数
項ベクトルを含む第２の変換係数を演算する第４の演算
手段と、上記第４の演算手段によって演算された変換行
列と定数項ベクトルを含む第２の変換係数に基づいて、
最大事後確率推定法により、話者適応化された重回帰写
像モデルに基づく平均ベクトルの変換のための変換行列
と定数項ベクトルを含む第３の変換係数を演算する第５
の演算手段と、上記第５の演算手段によって演算された
変換行列と定数項ベクトルを含む第３の変換係数に対し
て、所定の線形変換処理を実行することにより、話者適
応化後の隠れマルコフモデルの平均ベクトルを演算する
第６の演算手段とを備える。従って、当該話者適応化装
置によって、話者適応化のパラメータの推定精度を従来
例に比較して大幅に改善することができ、当該話者適応
化装置によって得られた、話者適応化された隠れマルコ
フモデルを用いて音声認識することにより、従来例に比
較して高い音声認識率で音声認識することができる。According to the speaker adapting apparatus according to the second aspect of the present invention, the speaker normalizing apparatus according to the first aspect is based on the feature vector of the voice data of the speaker to be speaker-adapted. A second transformation coefficient including a transformation matrix for transforming an average vector based on a multiple regression mapping model and a constant term vector by a maximum likelihood linear regression method on the hidden Markov model calculated by the third calculation means. And a second conversion coefficient including a conversion matrix and a constant term vector calculated by the fourth calculation means,
A fifth transformation coefficient including a transformation matrix for transforming an average vector based on a speaker-adapted multiple regression mapping model and a constant term vector is calculated by a maximum posterior probability estimation method.
By performing a predetermined linear transformation process on the third transformation coefficient including the transformation matrix and the constant term vector computed by the fifth computing means, the hidden matrix after speaker adaptation is obtained. A sixth calculating means for calculating an average vector of the Markov model. Therefore, the estimation accuracy of the parameters of the speaker adaptation can be greatly improved by the speaker adaptation device as compared with the conventional example, and the speaker adaptation obtained by the speaker adaptation device can be improved. By performing voice recognition using the hidden Markov model, voice recognition can be performed with a higher voice recognition rate than the conventional example.

【００５５】さらに、請求項３記載の音声認識装置によ
れば、請求項１記載の話者正規化装置の第３の演算手段
によって演算された隠れマルコフモデルを用いて、入力
された発声音声文の音声信号に基づいて、音声認識して
音声認識結果を出力する音声認識手段とを備える。従っ
て、上記話者正規化装置によって得られた、話者正規化
された隠れマルコフモデルを用いて音声認識することに
より、従来例に比較して高い音声認識率で音声認識する
ことができる。According to a third aspect of the present invention, an input uttered voice sentence is obtained by using the hidden Markov model calculated by the third calculating means of the speaker normalizing device according to the first aspect. Voice recognition means for performing voice recognition on the basis of the voice signal and outputting a voice recognition result. Therefore, by performing the speech recognition using the speaker-normalized hidden Markov model obtained by the above-described speaker normalization apparatus, speech recognition can be performed at a higher speech recognition rate than the conventional example.

【００５６】さらに、請求項４記載の音声認識装置によ
れば、請求項２記載の話者適応化装置の第６の演算手段
によって演算された隠れマルコフモデルの平均ベクトル
を含む隠れマルコフモデルを用いて、入力された発声音
声文の音声信号に基づいて、音声認識して音声認識結果
を出力する音声認識手段とを備える。従って、上記話者
適応化装置によって得られた、話者適応化された隠れマ
ルコフモデルを用いて音声認識することにより、従来例
に比較して高い音声認識率で音声認識することができ
る。Further, according to the speech recognition apparatus of the fourth aspect, the hidden Markov model including the average vector of the hidden Markov model calculated by the sixth calculation means of the speaker adaptation apparatus of the second aspect is used. Voice recognition means for performing voice recognition based on the voice signal of the input uttered voice sentence and outputting a voice recognition result. Therefore, by performing speech recognition using the speaker-adapted hidden Markov model obtained by the speaker adaptation apparatus, speech recognition can be performed at a higher speech recognition rate than the conventional example.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の話者正規化制御部によって実行される
話者正規化処理を示すフローチャートである。FIG. 2 is a flowchart illustrating a speaker normalization process performed by a speaker normalization control unit in FIG. 1;

【図３】図１の話者適応化制御部によって実行される
話者適応化処理を示すフローチャートである。FIG. 3 is a flowchart illustrating a speaker adaptation process performed by a speaker adaptation control unit in FIG. 1;

【図４】図１の話者正規化制御部によって実行される
ＭＬＬＲ処理を示す図である。FIG. 4 is a diagram illustrating an MLLR process executed by a speaker normalization control unit in FIG. 1;

【図５】図１の話者正規化制御部によって実行される
話者正規化処理を示す図である。FIG. 5 is a diagram illustrating a speaker normalization process performed by a speaker normalization control unit in FIG. 1;

【図６】図１の話者適応化制御部によって実行される
話者適応化処理を示す図である。FIG. 6 is a diagram illustrating a speaker adaptation process performed by a speaker adaptation control unit in FIG. 1;

【図７】図１の話者正規化制御部によって実行される
話者正規化処理を示すブロック図である。FIG. 7 is a block diagram illustrating a speaker normalization process performed by a speaker normalization control unit in FIG. 1;

【図８】図１の話者適応化制御部によって実行される
話者適応化処理を示すブロック図である。FIG. 8 is a block diagram illustrating a speaker adaptation process performed by the speaker adaptation control unit in FIG. 1;

【符号の説明】１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…話者適応化されたＨＭＭ、１２…ＬＲテーブル、１３…文脈自由文法データベース、２０…話者正規化制御部、２１…話者適応化制御部、３１…初期ＨＭＭ、３２−１乃至３２−Ｍ…話者１乃至Ｍの音声データ、３３…話者正規化されたＨＭＭ、３４…話者適応用学習データ。[Description of Signs] 1 ... Microphone, 2 ... Feature extraction unit, 3 ... Buffer memory, 4 ... Phoneme collation unit, 5 ... LR parser, 11 ... HMM with speaker adaptation, 12 ... LR table, 13 ... Context free Grammar database, 20: speaker normalization control unit, 21: speaker adaptation control unit, 31: initial HMM, 32-1 to 32-M: voice data of speakers 1 to M, 33: speaker normalized HMM, 34... Learning data for speaker adaptation.

フロントページの続き (56)参考文献日本音響学会平成８年度秋季研究発表会講演論文集▲Ｉ▼ ３−３−17「重回帰モデルに基づく話者適応方式の検討」ｐ．119−120（平成８年９月25日) Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1995 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．１”ＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎｂａｓｅｄｏｎＳｐｅｃｔｒａｌＮｏｒｍａｌｉｚａｔｉｏｎａｎｄＤｙｎａｍｉｃＨＭＭＰａｒａｍｅｔｅｒａｄａｐｔａｉｏｎ”ｐ．704−707 日本音響学会平成９年度春季研究発表会講演論文集▲Ｉ▼ ２−６−16「重回帰モデルを用いた話者適応のための話者正規化方式」ｐ．75−76（平成９年３月 17日) 日本音響学会平成７年度秋季研究発表会講演論文集▲Ｉ▼ ３−２−９「状態別話者クラスタリングを用いた不特定話者モデルの検討」ｐ．123−124（平成７年９月) 日本音響学会平成７年度春季研究発表会講演論文集▲Ｉ▼ ２−５−６「ＭＡＰ−ＶＦＳ話者適応法における平滑化係数制御の効果」ｐ．41−42（平成７年３月) 電子情報通信学会技術研究報告［音声］Ｖｏｌ．94 Ｎｏ．271 ＳＰ94−51 「最大事後確率推定法と移動ベクトル場平滑法を統合した話者適応方式」ｐ．25 −30（1994／10／13) 日本音響学会平成８年度春季研究発表会講演論文集▲Ｉ▼ １−５−22「制限付き重回帰モデルによる話者適応の検討」ｐ．51−52（平成８年３月26日発行) Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1996 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，”ＮｏｖｅｌＴｒａｉｎｉｎｇＭｅｔｈｏｄｆｏｒＣｌａｓｓｉｆｉｅｒｓｕｓｅｄｉｎＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎ”，ｐ．2119−2122，1996 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1996 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，”ＣｏｍｐａｃｔＭｏｄｅｌｆｏｒＳｐｅａｋｅｒ−ＡｄａｐｔｉｖｅＴｒａｉｎｉｎｇ”, ｐ．1137−1140，1996 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1996 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｉｎｇＶｏｌ．２，”ＮｏｒｍａｌｉｚｅｄＤｉｓｃｒｉｍｉｎａｎｔＡｎａｌｙｓｉｓｗｉｔｈＡｐｐｌｉｃａｔｉｏｎｔｏａＨｙｂｒｉｄＳｐｅａｋｅｒ−ＶｅｒｉｆｉｃａｔｉｏｎＳｙｓｔｅｍ”ｐ．681−684 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1996 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＶｏｌ．１，”ＳｐｅａｋｅｒＢａｃｋｇｒｏｕｎｄＭｏｄｅｌｓｆｏｒＣｏｎｎｅｃｔｅｄＤｉｇｉｔＰａｓｓｗｏｒｄＳｐｅｋｅｒＶｅｒｉｆｉｃａｔｉｏｎ”ｐ．81−84 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1981 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＶｏｌ．１／３，”ＳｐｅａｋｅｒＩｄｅｎｔｉｆｉｃａｔｉｏｎａｎｄＶｅｒｉｆｉｃａｔｉｏｎＣｏｍｂｉｎｅｄｗｉｔｈＳｐｅａｋｅｒＩｎｄｅｐｅｎｄｅｎｔＷｏｒｄＲｅｃｏｇｎｉｔｉｏｎ”ｐ．184−187 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1997 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＶｏｌ．２，”Ｓｐｅａｋｅｒ−ＡｄａｐｔｅｄＴｒａｉｎｉｎｇｏｎｔｈｅＳｗｉｔｃｈｂｏａｒｄＣｏｒｐｕｓ”ｐ．1059−1062 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1997 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＶｏｌ．２，”Ｓｐｅａｋｅｒ−ＡｄａｐｔｉｖｅＴｒａｉｎｉｎｇ：ＡＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄＡｐｐｒｏａｃｈｔｏＳｐｅａｋｅｒＮｏｒｍａｌｉｚａｔｉｏｎ" ｐ．1043−1046 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/14 G10L 15/06 G10L 15/10 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References Acoustical Society of Japan Autumn Research Conference 1996 Annual Meeting I-3-3-17 “Study of speaker adaptation method based on multiple regression model” p. 119-120 (September 25, 1996) Proceedings of 1995 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1 "Speaker Adoption based on Spectral Normalization and Dynamic HMM Parameter adaptation" p. 704-707 Proceedings of the Acoustical Society of Japan 1997 Spring Meeting, ▲ I ▼ 2-6-16, "Speaker normalization method for speaker adaptation using multiple regression models" p. 75-76 (March 17, 1997) Proceedings of the Fall Meeting of the Acoustical Society of Japan in 1995 (I) 3-2-9 “Examination of unspecified speaker model using state-based speaker clustering” p. 123-124 (September, 1995) Proceedings of the Acoustical Society of Japan Spring Meeting, 1995, I, 2-5-6, "Effect of smoothing coefficient control on MA P-VFS speaker adaptation method" p . 41-42 (March 1995) IEICE Technical Report [Voice] Vol. 94 No. 271 SP94-51 "Speaker adaptation method integrating maximum a posteriori probability estimation method and moving vector field smoothing method" p. 25-30 (October 13, 1994) Proceedings of the Acoustical Society of Japan Spring Meeting, 1996, I, 1-5-22, "Study on speaker adaptation using restricted multiple regression model" p. 51-52 (issued on March 26, 1996) Proceedings of 1996 IEEE International Conference on Spokane Language Processing, "Novell Training Technology Association for Associates Classification for Associates." 2119-2122, 1996 Proceedings of 1996 IEEE International Conference on Spokane Language Processing, "Compact Model for Speaker-Adaptive Training." 1137-1140, 1996 Proceedings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing Vol. 2, "Normalized Discriminant Analysis with Application to a Hybrid Speed Maker-Verification System", p. 681-684 Processings of 1996 IEEE International Conference on Acoustics, Speech and Signal Processing Vol. 1, "Speaker Backpack Model Models for Connected Digit Password Speaker Verification" p. 81-84 Proceedings of 1981 IEEE International Conference on Acoustics, Speech and Signal Processing Vol. 1/3, "Speaker Identification and Verification Combined with Speaker Independent Word Recognition" p. 184-187 Proceedings of 1997 IEEE International Conference on Acoustics, Speech and Signal Processing Vol. 2, "Speaker-Adapted Training on the Switchboard Corpus" p. 1059-1062 Proceedings of 1997 IEEE International Conference on Acoustics, Speech and Signal Processing Vol. 2, "Speaker-Adactive Training: A Maximum Likelihood Approach to Speaker Normalization" p. 1043-1046 (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 15/14 G10L 15/06 G10L 15/10 JICST file (JOIS)

Claims

(57) [Claims]

1. A storage device for storing an initial model of a predetermined Hidden Markov Model, wherein the storage device stores a feature vector of voice data dependent on each of a plurality of speakers. Based on the feature vector of the voice data, the initial model of the Hidden Markov Model is subjected to the maximum likelihood linear regression method, using a transformation matrix for transforming the average vector based on the multiple regression mapping model, and individual differences common to the spectrum. A first calculating means for calculating, for each of the speakers, a first conversion coefficient including a constant term vector representing the following, and for each of the speakers from the feature vector of the voice data stored in the storage device: A second computing means for subtracting the constant term vector computed by the first computing means to compute a feature vector of the speech data normalized, and the second computing means By learning the initial model of the hidden Markov model using a predetermined learning algorithm based on the feature vector of the normalized speech data calculated by the above, model parameters of the speaker-normalized hidden Markov model And a third calculating means for calculating the following.

2. A hidden Markov model calculated by the third calculating means of the speaker normalization apparatus according to claim 1, based on a feature vector of the voice data of the speaker to be speaker-adapted. Fourth arithmetic means for calculating a second conversion coefficient including a conversion matrix and a constant term vector for conversion of an average vector based on the multiple regression mapping model by a likelihood linear regression method; A transformation matrix and a constant term for transforming an average vector based on a speaker-adapted multiple regression mapping model by a maximum posterior probability estimation method based on the transformed transformation matrix and a second transformation coefficient including a constant term vector A fifth calculating means for calculating a third conversion coefficient including the vector, and a predetermined linear conversion processing for the third conversion coefficient including the conversion matrix and the constant term vector calculated by the fifth calculating means. The by executing speaker adaptation apparatus characterized by comprising a sixth calculating means for calculating an average vector of the hidden Markov model after speaker adaptation.

3. A speech recognition apparatus using the hidden Markov model computed by the third computation means of the speaker normalization apparatus according to claim 1, and performing speech recognition based on the speech signal of the input uttered speech sentence. And a voice recognition unit for outputting a recognition result.

4. A speech signal of an input uttered speech sentence using a hidden Markov model including an average vector of a hidden Markov model computed by the sixth computing means of the speaker adaptation apparatus according to claim 2. And a voice recognition unit for performing voice recognition based on the voice recognition result and outputting a voice recognition result.