JPH0981179A

JPH0981179A - Speaker adaptive device and voice recognition device

Info

Publication number: JPH0981179A
Application number: JP7239819A
Authority: JP
Inventors: Jun Ishii; 純石井; Masahiro Tonomura; 政啓外村; Shoichi Matsunaga; 昭一松永
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-09-19
Filing date: 1995-09-19
Publication date: 1997-03-28
Anticipated expiration: 2015-09-19
Also published as: JP2888781B2

Abstract

PROBLEM TO BE SOLVED: To improve the precision in estimating movement vectors and to enhance the voice recognition rate by selecting prescribed higher-order plural vectors in which the value of the distance between an object vector that it to be processed and a vicinity vector is small. SOLUTION: A speaker adaptive control section 31 adaptively learns an initial speaker model 30 which includes speaker's cluster models using speaker adaptive learning data 32, which are the sentence uttering text data, for example, converts the model into an unspecified speaker phoneme model of the phoneme HMM, stores the model in the memory of a hidden Markov network (an HM network) and performs voice recognition based on the network 11. Specifically, the section 31 successively executes the computational process of the movement vectors, the interpolation process of the movement vectors, the smoothing process of the movement vectors and the learning process, in which speaker-adapting is conducted, employing the processed movement vectors. During the interpolation and flattering/smoothing processes of the movement vectors, the selection of vicinity vectors is conducted using the tree construction of the state dividing process by a known sequential state dividing method.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、初期話者モデルを
話者適応用学習データを用いて話者適応化を行って隠れ
マルコフモデル（以下、ＨＭＭという。）を作成する話
者適応化装置、及びそのＨＭＭを用いて音声認識する音
声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation device for preparing a hidden Markov model (hereinafter referred to as HMM) by performing speaker adaptation on an initial speaker model using learning data for speaker adaptation. , And a voice recognition device for recognizing a voice using the HMM.

【０００２】[0002]

【従来の技術】従来、ＨＭＭを用いた音声認識装置に対
して少量の学習資料によって話者適応を行なう場合、安
定した適応効果を得るにはその情報不足を補うことが不
可欠である。このため、不特定話者モデル等の初期話者
モデルに含まれる情報を事前知識として使用する話者適
応法が種々研究されている（例えば、従来文献１「大倉
計美ほか，“混合連続分布ＨＭＭを用いた移動ベクトル
場平滑化話者適応方式”，音響学会講演論文集，２−Ｑ
−１７，ｐｐ．１９１−１９２，１９９２年３月」参
照。）。2. Description of the Related Art Conventionally, in the case of performing speaker adaptation with a small amount of learning material for a speech recognition apparatus using an HMM, it is essential to compensate for the lack of information in order to obtain a stable adaptation effect. For this reason, various speaker adaptation methods that use information included in an initial speaker model such as an unspecified speaker model as prior knowledge have been studied (for example, in conventional document 1, “Kemi Okura et al.,“ Mixed continuous distribution ”). Moving vector field smoothing speaker adaptation method using HMM ", Proceedings of the Acoustical Society of Japan, 2-Q
-17, pp. 191-192, March 1992 ”. ).

【０００３】例えば、従来文献１に開示されている従来
例の移動ベクトル場平滑化話者適応方式（以下、ＶＦＳ
方式という。）では、各モデル・パラメータに対して話
者適応用学習データを用いた推定値と初期値の差分を移
動ベクトルと定義している。そして、各移動ベクトルを
音響的に近傍にある移動ベクトルの情報を用いて平滑化
することにより推定誤差を低減したり、対応する学習デ
ータがないことによる未学習モデル・パラメータの補間
を行なって話者適応化を行っている。For example, a conventional moving vector field smoothing speaker adaptation method (hereinafter referred to as VFS) disclosed in the prior art reference 1.
It is called a method. ), The difference between the estimated value using the speaker adaptation learning data and the initial value for each model parameter is defined as a movement vector. Then, the estimation error is reduced by smoothing each movement vector using the information of the movement vector that is acoustically nearby, and the unlearned model parameters are interpolated by the lack of corresponding learning data. Person adaptation.

【０００４】[0004]

【発明が解決しようとする課題】従来例のＶＦＳ方式
は、話者適応用学習データが少量であることで生じるパ
ラメータの推定誤差を、パラメータの補間、平滑化によ
って軽減する方式であるが、従来例のＶＦＳ方式におい
て、この補間、平滑化を行なう際に使用する近傍のベク
トルは、話者適応用学習データによって学習されたベク
トルの中でユークリッド距離が近いものであり、音素環
境という概念は含まれていない。距離の基準では、異な
る音素環境のベクトルによって補間処理及び平滑化処理
が行なわれる場合があり、このために、音素環境によっ
て移動ベクトルが各々固有の性質を有していることが反
映されなくなってしまい、移動ベクトルの推定精度にお
いて劣化が生じるという問題点があった。The VFS method of the conventional example is a method of reducing parameter estimation error caused by a small amount of learning data for speaker adaptation by parameter interpolation and smoothing. In the VFS method of the example, the neighboring vector used when performing this interpolation and smoothing is one having a short Euclidean distance among the vectors learned by the speaker adaptation learning data, and the concept of the phoneme environment is included. It is not. On the basis of the distance, the interpolation process and the smoothing process may be performed by the vectors of different phoneme environments, so that it is not reflected that the movement vectors have unique properties depending on the phoneme environment. However, there is a problem that the estimation accuracy of the movement vector deteriorates.

【０００５】本発明の目的は以上の問題点を解決し、従
来例に比較して移動ベクトルの推定精度を改善すること
ができ、音声認識率を向上することができる話者適応化
装置、及びそのＨＭＭを用いて音声認識する音声認識装
置を提供することにある。An object of the present invention is to solve the above-mentioned problems, improve the estimation accuracy of the moving vector as compared with the conventional example, and improve the speech recognition rate, and a speaker adaptation device, and It is to provide a voice recognition device for recognizing a voice using the HMM.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の話者適応化装置は、話者適応前後の隠れマルコフモ
デルの特徴ベクトルの関係を示す移動ベクトルを用い
て、話者適応用学習データに基づいて初期話者モデルを
話者適応して学習することにより音声認識のための隠れ
マルコフモデルの話者モデルを計算するための話者適応
化装置において、上記話者適応用学習データが存在して
話者適応用学習データに基づいて話者適応された後の隠
れマルコフモデルの第１の特徴ベクトルを、当該第１の
特徴ベクトルと、その近傍にある話者適応された後の隠
れマルコフモデルの複数の第２の特徴ベクトルとを用い
て平滑化処理を実行する平滑化手段と、上記話者適応化
用学習データが存在せず上記平滑化手段によって計算さ
れなかった話者適応後の隠れマルコフモデルのガウス分
布の平均ベクトルを、当該平均ベクトルに対応する話者
適応前の隠れマルコフモデルのガウス分布の平均ベクト
ルの近傍にある上記話者適応用学習データが存在して上
記平滑化手段によって計算された話者適応後の隠れマル
コフモデルのガウス分布の平均ベクトルの移動ベクトル
を用いて補間する補間手段とを備え、上記平滑化手段と
上記補間手段は、逐次状態分割法による状態分割過程の
木構造を用いて、当該木構造内のあるノードからより下
層内のベクトルのうち処理すべき対象ベクトルと近傍ベ
クトルとの距離の値が小さい所定の上位複数個のベクト
ルを選択する選択手段を備えたことを特徴とする。A speaker adaptation apparatus according to a first aspect of the present invention is adapted for speaker adaptation by using a movement vector indicating a relationship between feature vectors of hidden Markov models before and after speaker adaptation. A speaker adaptation apparatus for calculating a speaker model of a hidden Markov model for speech recognition by learning by adapting an initial speaker model based on learning data, the learning data for speaker adaptation Exists and the first feature vector of the Hidden Markov Model after the speaker adaptation based on the speaker adaptation learning data is set to the first feature vector and a speaker adaptation in the vicinity thereof after the speaker adaptation. Smoothing means for performing a smoothing process using a plurality of second feature vectors of the hidden Markov model, and speaker adaptation not calculated by the smoothing means because the learning data for speaker adaptation does not exist. The average vector of the Gaussian distribution of the hidden Markov model of the above is smoothed by the learning data for speaker adaptation existing near the average vector of the Gaussian distribution of the hidden Markov model before speaker adaptation corresponding to the average vector. Interpolating means using the moving vector of the mean vector of the Gaussian distribution of the Hidden Markov Model after speaker adaptation calculated by the means, the smoothing means and the interpolating means, Selecting means for selecting a plurality of predetermined high-order vectors having a small distance value between a target vector to be processed and a neighboring vector among vectors in a lower layer from a node in the tree structure using the tree structure of the process It is characterized by having.

【０００７】また、請求項２記載の話者適応化装置は、
請求項１記載の話者適応化装置において、上記選択手段
は、上記対象ベクトルが属する状態が対応する最下層の
ノードを抽出し、上記抽出された最下層のノードから、
当該最下層のノードよりも高い層に有るあるノード以下
の状態内の話者適応学習済みベクトル数が上記所定の複
数個以上になるまで上記木構造をさかのぼり、上記ある
ノードを最上位ノードとし、上記最上位ノード以下の状
態内のベクトルにおいて、上記対象ベクトルと近傍ベク
トルとの距離の値が小さい所定の上位複数個のベクトル
を上記補間処理及び平滑化処理のための選択ベクトルと
して選択することを特徴とする。Further, the speaker adaptation apparatus according to claim 2,
2. The speaker adaptation device according to claim 1, wherein the selection means extracts a node in the lowest layer to which the state to which the target vector belongs corresponds, and from the extracted node in the lowest layer,
The tree structure is traced back until the number of speaker adaptive learned vectors in a state below a certain node in a layer higher than the node of the lowest layer is equal to or more than the predetermined number, and the certain node is set as the top node, In the vector in the state below the top node, selecting a predetermined plurality of vectors having a small distance value between the target vector and the neighboring vector as a selection vector for the interpolation processing and the smoothing processing. Characterize.

【０００８】さらに、請求項３記載の話者適応化装置
は、請求項１又は２記載の話者適応化装置において、上
記平滑化手段は、上記話者適応用学習データが存在して
上記平滑化手段によって計算された話者適応後の隠れマ
ルコフモデルのガウス分布の平均ベクトルを、当該平均
ベクトルと、その近傍にある上記話者適応用学習データ
が存在して上記平滑化手段によって計算された話者適応
後の隠れマルコフモデルのガウス分布の平均ベクトルの
移動ベクトルとを用いてかつ移動ベクトルの連続性の拘
束条件に基づいて、上記ガウス分布の話者適応用学習デ
ータのデータ量の増加に対して平滑化の強度が小さくな
るように予め決定された平滑化の強度を示す平滑化係数
を用いて平滑化することを特徴とする。Further, the speaker adaptation apparatus according to claim 3 is the speaker adaptation apparatus according to claim 1 or 2, wherein the smoothing means includes the learning data for speaker adaptation and the smoothing. The mean vector of the Gaussian distribution of the hidden Markov model after speaker adaptation calculated by the smoothing means is calculated by the smoothing means with the mean vector and the speaker adaptation learning data in the vicinity thereof. Using the moving vector of the mean vector of the Gaussian distribution of the Hidden Markov Model after speaker adaptation and the constraint of the continuity of the moving vector On the other hand, smoothing is performed using a smoothing coefficient that indicates a predetermined smoothing strength so that the smoothing strength becomes smaller.

【０００９】本発明に係る請求項４記載の音声認識装置
は、請求項１乃至３のうちの１つに記載の話者適応化装
置と、入力された発声音声文の音声信号に基づいて、上
記話者適応化装置によって話者適応された隠れマルコフ
モデルの話者モデルを用いて音声認識して音声認識結果
を出力する音声認識手段とを備えたことを特徴とする。A speech recognition apparatus according to a fourth aspect of the present invention is based on the speaker adaptation apparatus according to any one of the first to third aspects and the input speech signal of the uttered voice sentence. And a voice recognition means for performing voice recognition using a hidden Markov model speaker model adapted by the speaker adaptation device and outputting a voice recognition result.

【００１０】[0010]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る一
実施形態である音声認識装置のブロック図である。本実
施形態の音声認識装置は、話者クラスタモデルを含む初
期話者モデル３０を、例えば文発話テキストデータであ
る話者適応用学習データ３２を用いて、適応化学習して
音素ＨＭＭの不特定話者音素モデルに変換して隠れマル
コフ網（以下、ＨＭ網という。）１１のメモリに格納す
る話者適応化制御部３１を備え、当該ＨＭ網１１に基づ
いて音声認識を行うことを特徴とする。特に、上記話者
適応化制御部３１は、図３に示すように、移動ベクトル
の計算処理（ステップＳ１）と、移動ベクトルの補間処
理（ステップＳ２）と、移動ベクトルの平滑化処理（ス
テップＳ３）と、処理後の移動ベクトルを用いて話者適
応化する学習処理（ステップＳ４）とを順次実行し、こ
こで、上記移動ベクトルの補間処理及び平滑化処理にお
いて、公知の逐次状態分割法（ＳＳＳ）による状態分割
過程の木構造を用いて近傍ベクトルの選択を行うことを
特徴とする。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention. The speech recognition apparatus of the present embodiment adaptively learns an initial speaker model 30 including a speaker cluster model by using speaker adaptation learning data 32 that is, for example, sentence utterance text data, and unidentifies a phoneme HMM. A speaker adaptation control unit 31 for converting into a speaker phoneme model and storing it in a memory of a hidden Markov network (hereinafter referred to as HM network) 11 is provided, and speech recognition is performed based on the HM network 11. To do. In particular, as shown in FIG. 3, the speaker adaptation control unit 31 calculates the movement vector (step S1), interpolates the movement vector (step S2), and smooths the movement vector (step S3). ) And a learning process (step S4) for adapting the speaker using the processed moving vector are sequentially executed. Here, in the interpolation process and the smoothing process of the moving vector, a known sequential state division method ( It is characterized in that the neighborhood vector is selected using a tree structure of the state division process by SSS).

【００１１】まず、ＶＦＳ方式の補間処理及び平滑化処
理におけるベクトルの選択について説明する。従来例の
ＶＦＳ方式は話者適応の問題を少数の学習資料（すなわ
ち、話者適応用学習データ）によるＨＭＭの再学習問題
として捉え、（１）移動ベクトルの計算処理、（２）移
動ベクトルの補間処理、（３）移動ベクトルの平滑化処
理の３ステップの処理によって行なわれる方式である。
ここで、移動ベクトルとは、初期モデルと適応モデルの
各々対応するガウス分布の平均ベクトルの差分である。
（２）移動ベクトルの補間処理及び（３）移動ベクトル
の平滑化処理は、補間、平滑化の対象であるベクトルの
Ｋ個の近傍ベクトルを用いて行なわれる。移動ベクトル
の補間処理は、話者適応用学習データによって未学習で
あったベクトルｐについて行なわれ、まずベクトルｐと
話者適応用学習データによって学習されたベクトルｋと
の距離ｄ_p,kを基準として、この距離の値が小さいベク
トルから順にＫ個のベクトルを選択する。そして、これ
らの移動ベクトルを用いて内挿及び外挿によって移動ベ
クトルを推定する。また、移動ベクトルの平滑化も同様
に、平滑化対象のベクトルｐからの距離ｄ_p,kが小さい
Ｋ個のベクトルによって処理を行なう。しかしながら、
これは単純にベクトル空間での距離によるベクトルの選
択であり、音素環境は考慮されていない。音素環境が同
一、または類似性が高いベクトルの間では、移動ベクト
ルの性質の類似性も高いと考えられる。従って、異なっ
た音素でもベクトル空間上で距離が近ければ、補間処理
及び平滑化処理にはそれを用いることになる従来例のＶ
ＦＳ方式では、推定された移動ベクトルの誤差は大きい
と考えられる。そこで、本発明では、以下に詳述するよ
うに、ＶＦＳ方式の補間処理及び平滑化処理に用いる近
傍ベクトルを音素環境の類似度によって選択する方法を
用い、選択するときに、従来文献２「鷹見淳一ほか，
“音素コンテキストと時間に関する逐次状態分割による
隠れマルコフ網の自動生成”，電子情報通信学会技術報
告，ＳＰ９１−８８，１９９１年１２月」において開示
されている公知の逐次状態分割法（ＳＳＳ）による状態
分割過程の木構造を用いてベクトルの選択を行う。First, the selection of a vector in the VFS interpolation processing and smoothing processing will be described. The VFS method of the conventional example regards the problem of speaker adaptation as a re-learning problem of the HMM with a small number of learning materials (that is, learning data for speaker adaptation), and (1) moving vector calculation processing, (2) moving vector This method is performed by three steps of interpolation processing and (3) movement vector smoothing processing.
Here, the movement vector is the difference between the average vectors of the Gaussian distributions corresponding to the initial model and the adaptive model.
(2) Movement vector interpolation processing and (3) movement vector smoothing processing are performed using K neighboring vectors of the vector to be interpolated and smoothed. The movement vector interpolation process is performed on the vector p that has not been learned by the speaker adaptation learning data, and first, the distance d _{p, k} between the vector _p and the vector k learned by the speaker adaptation learning data is used as a reference. , K vectors are selected in order from the vector having the smallest distance value. Then, the movement vector is estimated by interpolation and extrapolation using these movement vectors. Similarly, the smoothing of the moving vector is performed by K vectors having a small distance d _{p, k} from the vector p to be smoothed. However,
This is simply the selection of vectors by distance in vector space, and the phoneme environment is not considered. It is considered that movement vectors have high similarity between vectors having the same phoneme environment or high similarity. Therefore, even if different phonemes are close to each other in the vector space, they are used for interpolation processing and smoothing processing in the conventional example.
In the FS method, it is considered that the error of the estimated movement vector is large. Therefore, in the present invention, as will be described in detail below, a method of selecting a neighborhood vector used for VFS interpolation processing and smoothing processing according to the similarity of the phoneme environment is used. Junichi and others,
State by known sequential state division method (SSS) disclosed in "Automatic Generation of Hidden Markov Network by Sequential State Division with respect to Phoneme Context and Time", IEICE Technical Report, SP91-88, December 1991. Vector selection is performed using the tree structure of the division process.

【００１２】逐次状態分割法（ＳＳＳ）は、複数の状態
の連結によって表現されコンテキスト依存の音素の集合
をすべて同時に求める隠れマルコフ網（以下、ＨＭ網と
いう。）を生成するために考案されたアルゴリズムであ
る。初期状態を１つの状態でモデル化したものから出発
し、最も分布が大きい状態を音素環境方向あるいは時間
方向に分割することを繰り返し、ネットワークを構成す
る。The sequential state division method (SSS) is an algorithm devised to generate a hidden Markov network (hereinafter referred to as an HM network) that simultaneously obtains a set of context-dependent phonemes expressed by connecting a plurality of states. Is. Starting from a model of the initial state with one state, the state with the largest distribution is repeatedly divided into the phoneme environment direction or the time direction to form a network.

【００１３】逐次状態分割法（ＳＳＳ）の原理について
述べる。逐次状態分割法（ＳＳＳ）の基本的な原理は、
音素の特徴空間上に割り当てられた確率的定常信号源
（状態）の間の確率的な遷移により音声の特徴パラメー
タの時間的な推移を表現した確率モデルに対して、尤度
最大化の基準に基づいて個々の状態をコンテキスト方向
または時間方向へ分割するといった操作を繰り返すこと
によって、モデルの精密化を逐次的に行おうというもの
である。これにより、モデルの単位決定とそのモデルの
構造決定、および各状態のパラメータ推定を、共通の評
価基準の下で同時に実現することができる。当該逐次状
態分割法（ＳＳＳ）における処理の流れを図６に示し、
この図６に従って逐次状態分割法（ＳＳＳ）の原理を説
明する。The principle of the sequential state division method (SSS) will be described. The basic principle of Sequential State Division (SSS) is
For the stochastic model that expresses the temporal transition of the speech feature parameters by the stochastic transition between the stochastic stationary signal sources (states) assigned in the phoneme feature space Based on this, by repeating operations such as dividing individual states in the context direction or the time direction, the model is refined sequentially. Thereby, the unit determination of the model, the structure determination of the model, and the parameter estimation of each state can be simultaneously realized under a common evaluation criterion. A flow of processing in the sequential state division method (SSS) is shown in FIG.
The principle of the sequential state division method (SSS) will be described with reference to FIG.

【００１４】まず初期モデルとして、ただ１つの状態
と、その状態を始端から終端まで結ぶ１本のパスから成
るモデルをすべての音声サンプルから形成し、この状態
を分割することから始める。ある時点における状態の分
割は、パスの分割を伴うコンテキスト方向、あるいはパ
スの分割を伴わない時間方向のうちのいずれか一方に関
して行われる。特にコンテキスト方向への分割時には、
パスの分割に伴ってそれぞれのパスに割り当てられるコ
ンテキストクラスも同時に分割される。実際の分割方法
としては、コンテキストクラスの分割方法も含めてその
時点で可能な全ての分割方法の中から、音声サンプルに
適用した場合の尤度の総和が最も大きくなるものを採用
する。このような状態分割を繰り返すことによって少な
い状態数で高い尤度を達成することのできる効率の良い
モデルが生成される。First, as an initial model, a model consisting of only one state and one path connecting the state from the start end to the end is formed from all voice samples, and this state is divided. The division of the state at a certain time point is performed either in the context direction with the division of the path or in the time direction without the division of the path. Especially when splitting in the context direction,
As the paths are divided, the context class assigned to each path is also divided at the same time. As the actual division method, the one that gives the largest sum of likelihoods when applied to a voice sample is adopted from all possible division methods at that time, including the context class division method. By repeating such state division, an efficient model that can achieve high likelihood with a small number of states is generated.

【００１５】逐次状態分割法（ＳＳＳ）による状態の分
割過程を追従すると、図８のように木構造が構成でき
る。図８において、（ａ）から（ｂ）では、状態Ｓ０を
状態Ｓ０と状態Ｓ１とに２分割すると、元のノードＮ₀
から分岐したノードＮ₁から状態Ｓ０と状態Ｓ１へ対等
関係で並列に分岐される。次いで、（ｂ）から（ｃ）で
は、状態Ｓ１を状態Ｓ１と状態Ｓ２とに２分割すると、
ノードＮ₁から分岐したノードＮ₂から状態Ｓ１と状態Ｓ
２へ対等関係で並列に分岐される。さらに、（ｃ）から
（ｄ）では、状態Ｓ０を状態Ｓ０と状態Ｓ３とに２分割
すると、ノードＮ ₁から分岐したノードＮ₃から状態Ｓ０
と状態Ｓ３へ対等関係で並列に分岐される。またさら
に、（ｄ）から（ｅ）では、状態Ｓ２を状態Ｓ２と状態
Ｓ４とに２分割すると、ノードＮ₂から分岐したノード
Ｎ₄から状態Ｓ２と状態Ｓ４へ対等関係で並列に分岐さ
れる。以下同様に、１つのノードから２分岐されるよう
に状態が分割されて木構造が構成される。State distribution by the sequential state division method (SSS)
Following the splitting process, a tree structure can be constructed as shown in Fig. 8.
You. In FIG. 8, the state S0 is changed from (a) to (b).
When the state S0 and the state S1 are divided into two, the original node N₀
Node N branched from₁From state S0 to state S1
It branches in parallel because of the relationship. Then, from (b) to (c)
Divides the state S1 into the state S1 and the state S2,
Node N₁Node N branched from₂To state S1 and state S
It is parallelly branched to 2. Furthermore, from (c)
In (d), the state S0 is divided into two states S0 and S3.
Then, node N ₁Node N branched from_ThreeTo state S0
To the state S3 in parallel with each other. Again
In addition, in (d) to (e), the state S2 is changed to the state S2.
If S2 and S4 are divided into two, node N₂A node branched from
N_FourFrom state S2 to state S4 in parallel in a parallel relationship.
It is. In the same way, from one node to two branches
The state is divided into and the tree structure is constructed.

【００１６】このように構成した逐次状態分割法（ＳＳ
Ｓ）の状態分割過程による木構造において、任意のノー
ド以下の状態は、ある段階では１状態であったことから
音素環境の類似性が高いと考えられる。そして、その類
似性は下層のノードの方がより高いと考えられる。本発
明の方法は、この逐次状態分割法（ＳＳＳ）の状態分割
過程による木構造を用い、音素環境の類似性によってＶ
ＦＳ方式の近傍ベクトルの選択を行なうことにより、音
素環境固有の移動ベクトルの性質を加味し、移動ベクト
ルの推定精度の向上を図るものである。The sequential state division method (SS
In the tree structure according to the state division process of S), the states below an arbitrary node are considered to have a high similarity in the phoneme environment, since the states at one stage were one state at a certain stage. Then, the similarity is considered to be higher in the lower layer node. The method of the present invention uses a tree structure according to the state division process of this sequential state division method (SSS), and V
By selecting the neighborhood vector of the FS method, the property of the movement vector peculiar to the phoneme environment is added to improve the estimation accuracy of the movement vector.

【００１７】次いで、本発明に係る、上記補間処理及び
平滑化処理に用いる近傍ベクトルの選択方法を以下に説
明する。ＶＦＳ方式の補間処理及び平滑化処理の対象ベ
クトルｐについて近傍ベクトルｋの選択は、以下の手順
で行う。（１）対象ベクトルｐが属する状態が対応する最下層の
ノードを抽出する。（２）上記抽出された最下層のノードから、当該最下層
のノードよりも高い層に有るあるノード以下の状態内の
話者適応学習済みベクトル数がＫ個以上になるまで上記
木構造をさかのぼり、上記あるノードを最上位ノードと
する。（３）上記最上位ノード以下の状態内のベクトルにおい
て、対象ベクトルｐと近傍ベクトルｋとの距離ｄ_p,kの
値が小さい上位Ｋ個のベクトルを上記補間処理及び平滑
化処理のための選択ベクトルとする。すなわち、本実施形態においては、逐次状態分割法（Ｓ
ＳＳ）の状態分割過程によって構成した木構造を用い
て、当該木構造内のあるノードからより下層内のベクト
ルのうち対象ベクトルｐと近傍ベクトルｋとの距離ｄ
_p,kの値が小さい上位Ｋ個のベクトルを選択することを
特徴とする。Next, a method of selecting a neighborhood vector used for the above interpolation processing and smoothing processing according to the present invention will be described below. The selection of the neighboring vector k for the target vector p for the VFS method interpolation processing and smoothing processing is performed in the following procedure. (1) Extract the lowest node corresponding to the state to which the target vector p belongs. (2) The tree structure is traced back from the extracted lowermost layer node until the number of speaker adaptive learned vectors in a state below a certain node in a layer higher than the lowermost layer node becomes K or more. , The node above is the top node. (3) Among the vectors in the state below the top node, the upper K vectors having a smaller value of the distance d _{p, k} between the target vector p and the neighboring vector k are selected for the interpolation processing and smoothing processing. Vector. That is, in this embodiment, the sequential state division method (S
(SS) using a tree structure formed by the state division process, a distance d between a target vector p and a neighboring vector k among vectors in a lower layer from a node in the tree structure.
It is characterized in that the upper K vectors having small values of _{p, k} are selected.

【００１８】図９は、状態ｊ内のベクトルに対して近傍
ベクトルを選択する場合の、木構造のノードとベクトル
の選択範囲の例を示したものである。状態ｊが対応して
いる最下層のノードＮ_j0から木構造のノードをＮ_j1，Ｎ
_j2，Ｎ_j3，．．．とさかのぼるに従って、これらのノー
ドにおいてのベクトルの選択範囲は、それぞれ、次のグ
ループが存在し、これらの選択範囲で近傍ベクトルを選
択する。（ａ）ノードＮ_j0の状態Ｓｊを含むグループＧ０、
（ｂ）ノードＮ_j1から最下層に向う木構造に含まれる状
態Ｓｊ，Ｓｋ，Ｓｌを含むグループＧ１、（ｃ）ノード
Ｎ_j2から最下層に向う木構造に含まれる状態Ｓｊ，Ｓ
ｋ，Ｓｌ，Ｓｍを含むグループＧ２、（ｄ）ノードＮ_j3
から最下層に向う木構造に含まれる状態Ｓｊ，Ｓｋ，Ｓ
ｌ，Ｓｍ，Ｓｎ，Ｓｏを含むグループＧ３、．．．。FIG. 9 shows an example of the selection range of nodes and vectors in the tree structure when selecting a neighborhood vector for a vector in state j. From the node N _{j0 in the} lowest layer to which the state j corresponds to the nodes N _j1 and N in the tree structure.
_j2 , N _j3 ,. . . As we go back, the selection ranges of the vectors at these nodes have the following groups respectively, and the neighborhood vectors are selected in these selection ranges. (A) A group G0 including the state Sj of the node N _j0 ,
(B) A group G1 including states Sj, Sk, and Sl included in the tree structure from the node N _j1 to the bottom layer, and (c) States Sj and S included in the tree structure from the node N _j2 to the bottom layer.
Group G2 including k, Sl and Sm, (d) Node N _j3
States Sj, Sk, S included in the tree structure from the bottom to the bottom
The group G3, which includes l, Sm, Sn, So. . . .

【００１９】逐次状態分割法（ＳＳＳ）により自動生成
され、音素照合部４に接続されるＨＭ網（図６の最も下
側のもの）１１は複数の状態のネットワークとして表す
ことができる。個々の状態は、音声空間上の１つの確率
的定常信号源と見なすことができ、それぞれ以下の情報
を保有している。（ａ）状態番号、（ｂ）受理可能なコンテキストクラ
ス、（ｃ）先行する状態および後続する状態のリスト、
（ｄ）音声の特徴空間上に割り当てられた確率分布のパ
ラメータ、（ｅ）自己遷移確率および後続状態への遷移
確率。ＨＭ網１１では、入力データとそのコンテキスト情報が
与えられた場合、そのコンテキストを受理することがで
きる状態を先行および後続状態リストの制約内で連結す
ることによって、入力データに対するモデルを一意に決
定することができる。このモデルは図７に示すような、
複数の状態が縦続に連結され各状態において自己ループ
を有するＨＭＭと等価であるため、通常のＨＭＭと同様
に、尤度計算のための前向きパスアルゴリズムやパラメ
ータ推定のためのバーム・ウエルチ（Ｂａｕｍ−Ｗｅｌ
ｃｈ）のアルゴリズムをそのまま使用することができ
る。ここで、出力確率密度関数は３４次元の対角共分散
行列をもつ混合ガウス分布（以下、ガウス分布とい
う。）であり、各ガウス分布は、初期話者モデル３０を
用いて話者適応化制御部３１によって学習される。The HM network (the lowest one in FIG. 6) 11 which is automatically generated by the sequential state division method (SSS) and is connected to the phoneme collation unit 4 can be represented as a network of a plurality of states. Each state can be regarded as one stochastic stationary signal source in the voice space, and each holds the following information. (A) state number, (b) acceptable context class, (c) list of preceding and succeeding states,
(D) Parameters of the probability distribution assigned on the feature space of the speech, (e) self-transition probabilities and transition probabilities to subsequent states. In the HM network 11, when input data and its context information are given, a model for the input data is uniquely determined by linking states that can accept the context within the constraints of the preceding and succeeding state lists. be able to. This model is as shown in Figure 7,
Since a plurality of states are connected in cascade and each state is equivalent to an HMM having a self-loop, a forward path algorithm for likelihood calculation and a Baum-Welch (Baum- Wel
The algorithm of ch) can be used as it is. Here, the output probability density function is a mixed Gaussian distribution (hereinafter referred to as Gaussian distribution) having a 34-dimensional diagonal covariance matrix, and each Gaussian distribution is controlled by the speaker adaptation control using the initial speaker model 30. Learned by the unit 31.

【００２０】一般に連続分布型ＨＭＭによるモデルに対
して少量の適応データにより話者適応を行なう場合、ガ
ウス分布の平均値の適応は他のパラメータの適応に比べ
て効果が大きいことが知られている（例えば、従来文献
１参照。）。本実施形態においては、各ガウス分布の平
均値のみの適応を行ない、分散値、状態遷移確率及び、
混合ガウス分布の重み係数の適応は行なわない。It is generally known that when speaker adaptation is performed on a continuous distribution HMM model with a small amount of adaptation data, adaptation of the average value of the Gaussian distribution is more effective than adaptation of other parameters. (For example, refer to the conventional document 1.). In the present embodiment, only the mean value of each Gaussian distribution is adapted, the variance value, the state transition probability, and
The weighting coefficient of the Gaussian mixture distribution is not adapted.

【００２１】本実施形態において、バッファメモリ３
と、ＨＭ網１１と、ＬＲテーブル１３と、文脈自由文法
データベース２０と、初期話者モデル３０と、話者適応
用学習データ３２とは、例えばハードディスクメモリな
どの記憶装置に格納される。In the present embodiment, the buffer memory 3
The HM network 11, the LR table 13, the context-free grammar database 20, the initial speaker model 30, and the speaker adaptation learning data 32 are stored in a storage device such as a hard disk memory.

【００２２】話者適応化制御部３１における具体的な話
者適応化処理を、図２及び図３を参照して以下に説明す
る。この話者適応化処理では、まず、初期音素ＨＭＭで
ある初期話者モデル３０を、例えば文発話テキストデー
タを含む話者適応用学習データ（以下、学習データとい
う。）３２を用いて学習する。ここでは、文発話テキス
トデータに対応する音素ラベル系列に従って音素ＨＭＭ
を連結して文ＨＭＭを作成し、この文ＨＭＭを上記話者
適応用学習データである文発話データを用いて学習した
後、再び音素ＨＭＭの単位に切り離すことにより、音素
ＨＭＭからなるＨＭ網１１の学習を行う。A specific speaker adaptation process in the speaker adaptation control unit 31 will be described below with reference to FIGS. 2 and 3. In this speaker adaptation processing, first, an initial speaker model 30 which is an initial phoneme HMM is learned by using speaker adaptation learning data (hereinafter, referred to as learning data) 32 including sentence utterance text data. Here, the phoneme HMM is generated according to the phoneme label series corresponding to the sentence utterance text data.
Are connected to create a sentence HMM, the sentence HMM is learned by using the sentence utterance data that is the speaker adaptation learning data, and then the sentence HMM is separated again into units of the phoneme HMM. Learn.

【００２３】すなわち、この話者適応化処理では、未知
話者の音声に含まれる音素に関して、標準話者の音素Ｈ
ＭＭの平均値を再学習する。まず、標準話者の音素ＨＭ
Ｍを未知話者の音素ＨＭＭの初期話者モデルとする。そ
して、未知話者の入力音声の音素系列に対応するように
未知話者のＨＭＭを連結し、ＨＭＭの遷移確率、出現確
率の平均と分散、及び分岐確率のうち平均のみを連結学
習する。具体的には、連結学習前後のＨＭＭの平均ベク
トルの差分を移動ベクトルとみなし、学習されなかった
ＨＭＭの平均ベクトルの移動ベクトルを補間し平均ベク
トルを移動するものである。That is, in this speaker adaptation processing, with respect to the phonemes included in the voice of the unknown speaker, the phoneme H of the standard speaker is obtained.
Re-learn the average value of MM. First, the phoneme HM of the standard speaker
Let M be the initial speaker model of the phoneme HMM of an unknown speaker. Then, the HMMs of the unknown speaker are connected so as to correspond to the phoneme sequence of the input voice of the unknown speaker, and only the average of the transition probability of the HMM, the average and variance of the appearance probabilities, and the branch probability is connected and learned. Specifically, the difference between the average vectors of the HMMs before and after the connected learning is regarded as the moving vector, and the moving vector of the unlearned HMM average vector is interpolated to move the average vector.

【００２４】まず、ステップＳ１において、以下のよう
に移動ベクトルの計算を行う。初期話者モデル内の未知
話者の全音素ＨＭＭのガウス分布の平均ベクトルの組
（Ｃ^I＝ｃ₁ ^I，…，ｃ_K ^I），ここで、Ｋは全てのガウス
分布の個数である。）のうち学習されたｋ番目の平均ベ
クトルｃ_k ^I（ｋ∈Ｋ₁，Ｋ₁：学習音声中に存在した音素
のＨＭＭの平均ベクトルの番号の集合）と、話者適応用
学習データ内の標準話者のガウス分布の平均ベクトルの
組ＣＲ中で対応するｃ_k ^Rより、平均ベクトルの差分ベク
トルｖｋを計算し、これを話者空間の移動ベクトルとす
る。First, in step S1, the movement vector is calculated as follows. A set of average vectors (C ^I = c ₁ ^I , ..., C _K ^I ) of a Gaussian distribution of all phoneme HMMs of an unknown speaker in the initial speaker model, where K is the number of all Gaussian distributions. ), The k-th average vector c _k ^I (k ∈ K ₁ , K ₁ : a set of numbers of average vectors of HMMs of phonemes existing in the learning speech) and the learning data for speaker adaptation The difference vector vk of the mean vector is calculated from the corresponding c _k ^R in the set of mean vectors CR of the Gaussian distribution of the standard speaker, and this is used as the movement vector in the speaker space.

【００２５】[0025]

【数１】ｖ_k＝ｃ^I _k−ｃ^R _k，ｋ∈Ｋ₁ ## EQU1 ## v _k = c ^I _k −c ^R _k , _k ∈ K ₁

【００２６】ここで、Ｋ₁は各ガウス分布のうち学習デ
ータの存在したものの集合である。これを図示すると図
３のようになる。図３に示すように、適応学習前の初期
話者モデルの音響空間ＡＳ１において例えば３個のガウ
ス分布が存在する一方、適応学習後の話者モデルの音響
空間ＡＳ２において例えば３個のガウス分布が存在する
とき、適応学習前のガウス分布の平均ベクトルｃ_k ^Rが適
応学習後のガウス分布の平均ベクトルｃ_k ^Iに適応化学習
されることになる。Here, K ₁ is a set of Gaussian distributions having learning data. This is shown in FIG. As shown in FIG. 3, for example, three Gaussian distributions exist in the acoustic space AS1 of the initial speaker model before adaptive learning, while three Gaussian distributions exist in the acoustic space AS2 of the speaker model after adaptive learning. When it exists, the average vector c _k ^R of the Gaussian distribution before adaptive learning is adaptively learned to the average vector c _k ^I of the Gaussian distribution after adaptive learning.

【００２７】次いで、図２のステップＳ２においては、
以下の通り、移動ベクトルの補間処理を実行する。すな
わち、未知話者の全音素ＨＭＭのガウス分布の平均ベク
トルの組Ｃ^Iのうち、話者適応用学習データが存在しな
かった音素に対する未学習のＨＭＭに属するガウス分布
の平均ベクトルｃ_n ^I（ここで、ｎ∈Ｋ₂であり、Ｋ₂は各
ガウス分布のうち話者適応用学習データの存在しなかっ
たものの集合である。）を、学習されたｋ番目（ｋ∈Ｋ
₁）の移動ベクトルｖ_kと、平均ベクトルｃ_n ^Rと平均ベク
トルｃ_n ^k間のファジイ級関数μ_n,kから求めた移動ベク
トルｖ_nを用いてｃ_n ^Iに移動する。ここで、ｋ番目（ｋ
∈Ｋ₁）の移動ベクトルｖ_kとは、上述のように、逐次状
態分割法（ＳＳＳ）による状態分割過程の木構造を用い
て選択された移動ベクトルである。Then, in step S2 of FIG.
The movement vector interpolation processing is executed as follows. That is, of the set C ^I of average vectors of Gaussian distribution of all phoneme HMM of unknown speaker, average vector c _n ^{I of} Gaussian distribution belonging to unlearned HMM for phonemes for which speaker adaptation learning data did not exist. Here, n ∈ K ₂ , and K ₂ is a set of Gaussian distributions in which learning data for speaker adaptation did not exist.
Move to c _n ^I by using the movement vector v _k _1), the motion vector v _n obtained from the fuzzy grade function mu _{n, k} between the mean vectors c _n ^R mean vectors c _n ^k. Where the kth (k
The movement vector v _k of εK ₁ ) is the movement vector selected using the tree structure of the state division process by the sequential state division method (SSS) as described above.

【００２８】[0028]

【数２】 [Equation 2]

【数３】ｃ_n ^I＝ｃ_n ^R＋ｖｎ## EQU3 ## c _n ^I = c _n ^R + vn

【数４】 (Equation 4)

【００２９】ここで、ｄ_n,kは、平均ベクトルｃ_n ^Rと平
均ベクトルｃ_k ^Rの距離を表す。上記の移動ベクトルの計
算処理と補間処理を図４を用いて説明する。図４（ａ）
及び（ｂ）は、すべてのＨＭＭに含まれるガウス分布の
合計が４個である場合について示してある。連結学習に
より平均ベクトルｃ₁ ^R，ｃ₂ ^R，ｃ₃ ^Rがそれぞれ、平均ベ
クトルｃ₁ ^I，ｃ₂ ^I，ｃ₃ ^Iにそれぞれ移動し、平均ベクト
ルｃ_n ^Rは学習されなかった場合を示している。この場合
の平均ベクトルｃ_n ^Iは、ｃ₁ ^R，ｃ₂ ^R，ｃ₃ ^Rと移動ベクト
ルｖ₁，ｖ₂，ｖ₃及びファジイ級関数μ_n,1，μ_n,2，μ
_n,3を用いて計算される。Here, d _{n, k} represents the distance between the average vector c _n ^R and the average vector c _k ^R. The above-described movement vector calculation processing and interpolation processing will be described with reference to FIG. Figure 4 (a)
And (b) show the case where the total number of Gaussian distributions included in all HMMs is four. The case where the average vectors c ₁ ^R , c ₂ ^R , and c ₃ ^R are respectively moved to the average vectors c ₁ ^I , c ₂ ^I , and c ₃ ^I by the connection learning and the average vector c _n ^R is not learned is shown. ing. The average vector c _n ^{I in} this case is c ₁ ^R , c ₂ ^R , c ₃ ^R and movement vectors v ₁ , v ₂ , v ₃ and fuzzy class functions μ _{n, 1} , μ _{n, 2} , μ.
Calculated using _{n, 3} .

【００３０】図４の（ａ）に示すように、話者適応用学
習データが存在しなかった未学習のガウス分布の平均ベ
クトルｃ_n ^Rの近傍に３個の平均ベクトルｃ₁ ^R，ｃ₂ ^R，ｃ
₃ ^Rが存在する。そして、図４の（ｂ）に示すように、こ
れらの移動ベクトルｖ_n（ｎ＝１，２，３）に基づい
て、数３を用いて平均ベクトルｃ_n ^Rの移動ベクトルｖ_n
を求めて、移動ベクトルの補間処理を行い、未学習のガ
ウス分布の平均ベクトルｃ_n ^Iを求めている。As shown in FIG. 4A, three mean vectors c ₁ ^R and c ₂ are present in the vicinity of the mean vector c _n ^R of the unlearned Gaussian distribution for which the learning data for speaker adaptation did not exist. ^R , c
₃ ^R exists. Then, as shown in (b) of FIG. 4, based on these movement vectors v _n (n = 1, 2, 3), the movement vector v _n of the average vector c _n ^R is calculated using Equation 3.
Then, the moving vector is interpolated to obtain the average vector c _n ^I of the unlearned Gaussian distribution.

【００３１】上述のステップで得られたモデルは、十分
な適応語数が得られていない場合に推定誤差を含んでい
る。このような推定誤差を含むものから求められた移動
ベクトルの方向は、非連続的な動きをしていると考えら
れる。そこで、話者空間を移動するための移動ベクトル
に連続性の拘束条件を入れ、移動ベクトルの方向性を揃
える、すなわち平滑化を行うことにより推定誤差の吸収
を行う。The model obtained in the above steps contains an estimation error when a sufficient number of adaptive words are not obtained. It is considered that the direction of the movement vector obtained from the one including such an estimation error has a discontinuous movement. Therefore, a constraint condition of continuity is put in the movement vector for moving in the speaker space, and the directionality of the movement vector is made uniform, that is, smoothing is performed to absorb the estimation error.

【００３２】さらに、ステップＳ３の平滑化処理におい
ては、平均ベクトルｃ_k ^Iとその近傍にあるｍ番目の平均
ベクトルｃ_m ^Iとｃ_m ^Rの差分ベクトルｖ_mを求める。次
に、ファジイ級関数μ_k,mを用いて、差分ベクトルｖ_mに
平滑化処理を行い、次の数５を用いて平滑化移動ベクト
ルｖ_k ^sを求める。Further, in the smoothing process of step S3, the difference vector v _m between the average vector c _k ^I and the m-th average vectors c _m ^I and c _m ^{R in} the vicinity thereof is obtained. Next, the fuzzy class function μ _{k, m} is used to perform the smoothing process on the difference vector v _m, and the smoothing movement vector v _k ^s is obtained using the following equation 5.

【００３３】[0033]

【数５】 (Equation 5)

【００３４】ここで、Ｎ（ｋ）は平均ベクトルｃ_k ^Rのｋ
−近傍にある平均ベクトルの番号であり、α_mはｖ_mの信
頼度を与える定数であり、ｋ＝ｍの場合、μ_k,m＝１と
する。ここで、平均ベクトルｃ_k ^Rのｋ−近傍にある平均
ベクトルとは、上述のように、逐次状態分割法（ＳＳ
Ｓ）による状態分割過程の木構造を用いて選択された平
均ベクトルである。Here, N (k) is k of the average vector c _k ^R.
-The number of the average vector in the neighborhood, α _m is a constant giving the reliability of v _m , and when k = m, μ _{k, m} = 1. Here, the average vector in the _k -neighborhood of the average vector c _k ^R means the sequential state division method (SS
It is the average vector selected using the tree structure of the state division process according to S).

【００３５】最後に、ステップＳ４においては、処理後
の移動ベクトルｖ_k ^Sと平均ベクトルｃ_k ^Rを用いて、次の
数６に示すように、平均ベクトルｃ_k ^Rを初期話者モデル
の未知話者へ話者適応する。すなわち計算された移動ベ
クトルｖ_k ^Sを用いて、初期話者モデル３０を話者適応す
ることにより学習し、これによって、音素ＨＭＭの話者
モデルを計算してＨＭ網１１のメモリに格納する。[0035] Finally, in step S4, by using the movement vector v _k ^S after treatment the mean vector c _k ^R, as shown in the following equation 6, the unknown mean vector c _k ^R of the initial speaker model Adapt the speaker to the speaker. That is, by using the calculated movement vector v _k ^S , the initial speaker model 30 is learned by adapting the speaker, and the speaker model of the phoneme HMM is calculated and stored in the memory of the HM network 11.

【００３６】[0036]

【数６】ｃ_S ^k＝ｃ_k ^R＋ｖ_k ^S (6) c _S ^k = c _k ^R + v _k ^S

【００３７】ここで、ｃ_S ^kは、平滑化を行って得られた
話者適応後の音素ＨＭＭのガウス分布の平均ベクトルで
ある。本実施形態においては、α_m＝１（ｍ∈Ｋ₁）、α
_m＝０（ｍ∈Ｋ₂）とした。また、ファジイ級関数
μ_k,m：（ｋ≠ｍ）は、ｍ∈Ｋ１である平均ベクトルｃ_m
^R全てを用いて求めた。Here, c _S ^k is the average vector of the Gaussian distribution of the speaker-adapted phoneme HMM obtained by smoothing. In the present embodiment, α _m = 1 (mεK ₁ ), α
_{It was set to m} = 0 (mεK ₂ ). Also, fuzzy grade function _{μ k, m: (k ≠} m) , the average vector c _m is m∈K1
^R was calculated using all.

【００３８】上記の処理を図５を用いて説明する。図５
は、全てのＨＭＭに含まれるガウス分布の合計が４個で
ある場合について示してある。ステップＳ３乃至Ｓ５に
よる処理により、平均ベクトルｃ₁ ^R，ｃ₂ ^R，ｃ₃ ^R，ｃ_k ^R
がｃ₁ ^I，ｃ₂ ^I，ｃ₃ ^I，ｃ_k ^Iにそれぞれ移動したとする。
いま、ｃ_k ^Iに対応する移動ベクトルｖ_kを考える。移動
ベクトルｖｋは、ｖ₁，ｖ₂，ｖ₃，ｖ_kとそれぞれに対応
するファジイ級関数と各移動ベクトルに対する信頼性の
重み係数αｍにより平滑化されｖ_k ^Sが計算される。The above processing will be described with reference to FIG. FIG.
Shows the case where the total number of Gaussian distributions included in all HMMs is four. The average vectors c ₁ ^R , c ₂ ^R , c ₃ ^R , and c _k ^{R are} processed by the processing in steps S3 to S5.
Are moved to c ₁ ^I , c ₂ ^I , c ₃ ^I and c _k ^I , respectively.
Now consider the movement vector v _k corresponding to c _k ^I. Movement vector vk _{_{is, v 1, v 2, v}} 3, v k and smoothed by the weight coefficient αm of reliability for fuzzy grade function and the motion vector corresponding to v _k ^S is calculated.

【００３９】次いで、上述の本実施形態の話者適応化方
法を用いた、ＳＳＳ−ＬＲ（left-to-right rightmost
型）不特定話者連続音声認識装置について説明する。こ
の装置は、ＨＭ網１１のメモリに格納された音素環境依
存型の効率のよいＨＭＭの表現形式を用いている。ま
た、上記ＳＳＳにおいては、音素の特徴空間上に割り当
てられた確率的定常信号源（状態）の間の確率的な遷移
により音声パラメータの時間的な推移を表現した確率モ
デルに対して、尤度最大化の基準に基づいて個々の状態
をコンテキスト方向又は時間方向へ分割するという操作
を繰り返すことによって、モデルの精密化を逐次的に実
行する。Next, SSS-LR (left-to-right rightmost) using the speaker adaptation method of the present embodiment described above.
(Type) An unspecified speaker continuous speech recognition device will be described. This device uses a phoneme environment-dependent efficient HMM representation format stored in the memory of the HM network 11. Further, in the above SSS, the likelihood is compared with the stochastic model in which the temporal transition of the speech parameter is expressed by the stochastic transition between the stochastic stationary signal sources (states) assigned in the phoneme feature space. The model refinement is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the maximization criterion.

【００４０】図１において、話者適応制御部３１は、話
者クラスモデルを含む初期話者モデル３０を、例えば文
発話テキストデータである話者適応用学習データ３２を
用いて図２に示す話者適応化処理により移動ベクトルを
計算し、計算した移動ベクトルを用いて適応化学習して
ＨＭＭの不特定話者音素モデルに変換してＨＭ網１１の
メモリに格納する。一方、話者の発声音声はマイクロホ
ン１に入力されて音声信号に変換された後、特徴抽出部
２に入力される。特徴抽出部２は、入力された音声信号
をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対数
パワー、１６次ケプストラム係数、Δ対数パワー及び１
６次Δケプストラム係数を含む３４次元の特徴パラメー
タを抽出する。抽出された特徴パラメータの時系列はバ
ッファメモリ３を介して音素照合部４に入力される。In FIG. 1, the speaker adaptation control unit 31 uses an initial speaker model 30 including a speaker class model as shown in FIG. 2 by using speaker adaptation learning data 32 which is, for example, sentence utterance text data. The moving vector is calculated by the person adaptation process, and adaptive learning is performed using the calculated moving vector to convert it into an unspecified speaker phoneme model of the HMM and store it in the memory of the HM network 11. On the other hand, the uttered voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the feature extraction unit 2. The feature extracting unit 2 performs, for example, LPC analysis after A / D conversion of the input voice signal, and performs logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 1
A 34-dimensional feature parameter including a 6th-order Δ cepstrum coefficient is extracted. The time series of the extracted characteristic parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００４１】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ５からの音素照合要求に応じて音素照合処理
を実行する。そして、ＨＭ網１１のメモリに格納された
音素ＨＭＭの話者モデルを用いて音素照合区間内のデー
タに対する尤度が計算され、この尤度の値が音素照合ス
コアとしてＬＲパーザ５に返される。このとき、前向き
パスアルゴリズムを使用する。The phoneme matching unit 4 executes a phoneme matching process in response to a phoneme matching request from the phoneme context dependent LR parser 5. Then, the likelihood of the data in the phoneme matching section is calculated using the speaker model of the phoneme HMM stored in the memory of the HM network 11, and the value of this likelihood is returned to the LR parser 5 as the phoneme matching score. At this time, the forward pass algorithm is used.

【００４２】一方、文脈自由文法データベース２０内の
所定の文脈自由文法（ＣＦＧ）を公知の通り自動的に変
換してＬＲテーブルを作成してＬＲテーブル１３のメモ
リに格納される。ＬＲパーザ５は、上記ＬＲテーブル１
３を参照して、入力された音素予測データについて左か
ら右方向に、後戻りなしに処理する。構文的にあいまい
さがある場合は、スタックを分割してすべての候補の解
析が平行して処理される。ＬＲパーザ５は、上記ＬＲテ
ーブル１３から次にくる音素を予測して音素予測データ
を音素照合部４に出力する。これに応答して、音素照合
部４は、その音素に対応するＨＭ網１１内の情報を参照
して照合し、その尤度を音声認識スコアとしてＬＲパー
ザ５に戻し、順次音素を連接していくことにより、連続
音声の認識を行う。上記連続音声の認識において、複数
の音素が予測された場合は、これらすべての存在をチェ
ックし、ビームサーチの方法により、部分的な音声認識
の尤度の高い部分木を残すという枝刈りを行って高速処
理を実現する。On the other hand, a predetermined context-free grammar (CFG) in the context-free grammar database 20 is automatically converted, as is known, to create an LR table and stored in the memory of the LR table 13. The LR parser 5 is the LR table 1 described above.
3, the input phoneme prediction data is processed from left to right without backtracking. In the case of syntactic ambiguity, the stack is split and parsing of all candidates is processed in parallel. The LR parser 5 predicts the next phoneme from the LR table 13 and outputs the phoneme prediction data to the phoneme matching unit 4. In response to this, the phoneme collation unit 4 collates by referring to the information in the HM network 11 corresponding to the phoneme, returns the likelihood to the LR parser 5 as a speech recognition score, and sequentially connects the phonemes. By going through, recognition of continuous voice is performed. When a plurality of phonemes are predicted in the continuous speech recognition, the existence of all of them is checked, and a pruning is performed by using a beam search method to leave a partial tree having a high likelihood of partial speech recognition. To achieve high-speed processing.

【００４３】以上の実施形態において、各移動ベクトル
に対する信頼性の重み係数αｍを用いているが、本発明
はこれに限らず、この重み係数α_mを重み係数λ_a,bとし
て以下に詳述するように制御してもよい。本実施形態に
おいて、近傍数Ｋは６とし、重み係数λ_a,bは次の数７
によって計算されるガウス窓を用いることができる。In the above embodiment, the reliability weighting coefficient α _m for each movement vector is used, but the present invention is not limited to this, and the weighting coefficient α _m will be described in detail below as the weighting coefficient λ _{a, b.} It may be controlled to do so. In the present embodiment, the number of neighbors K is 6, and the weighting factor λ _{a, b} is
A Gaussian window calculated by can be used.

【００４４】[0044]

【数７】λ_a,b＝ｅｘｐ（−ｄ_a,b／ｆｐ）[Formula 7] λ _{a, b} = exp (-d _{a, b} / fp)

【００４５】ここで、ｄ_a,bは平均ベクトルｃ_a ^Iと平均
ベクトルｃ_b ^Iとの間の距離であり、ｆｐは予め決められ
る正の数の重み制御パラメータであり平滑化の強度を示
す平滑化係数であり、次の数８で表される。Here, d _{a, b} is the distance between the average vector c _a ^I and the average vector c _b ^I , and fp is a predetermined positive number of weight control parameters, which indicates the strength of smoothing. It is a smoothing coefficient and is expressed by the following equation 8.

【００４６】[0046]

【数８】ｆｐ＝（ｆ・α）／（ｎ_p＋α）[Expression 8] fp = (f · α) / (n _p + α)

【００４７】ここで、ｆは全てのパラメータに対して共
通に与えられる平滑化係数ｆｐの所定の初期値であり、
ｎ_pはｐ番めのガウス分布の適応用学習データのデータ
量を表している。Here, f is a predetermined initial value of the smoothing coefficient fp commonly given to all parameters,
n _p represents the data amount of the p-th Gaussian distribution learning data for adaptation.

【００４８】すなわち、本実施形態においては、平滑化
係数ｆｐの制御においては適応用学習データの内容によ
って各パラメータに対する適応用学習データのデータ量
に偏りがあることを考慮し、また状態数や混合数等のモ
デルの構造に依存しない基準で制御を行なうために、各
パラメータすなわち、ガウス分布の平均値毎に独立に行
う。上記数８の式を用いることにより、各パラメータに
対する平滑化の強さは適応用学習データ量の増加に従っ
て弱められていき、データ量ｎ_pが無限大となるとき、
平滑化を行なわない場合と同様の状態に収束することが
わかる。また、このときの収束の速さは係数αによって
決定されるが、本実施形態においては、係数αは実験的
に求めた値を使用した。That is, in the present embodiment, in controlling the smoothing coefficient fp, it is considered that the data amount of the adaptive learning data for each parameter is biased depending on the content of the adaptive learning data, and the number of states and the mixture are mixed. In order to perform control on the basis that does not depend on the structure of the model such as the number, each parameter, that is, the average value of the Gaussian distribution is independently controlled. By using the equation (8), the smoothing strength for each parameter is weakened as the adaptive learning data amount increases, and when the data amount n _p becomes infinite,
It can be seen that the state converges to the same state as when smoothing is not performed. Further, the convergence speed at this time is determined by the coefficient α, but in the present embodiment, the coefficient α uses a value obtained experimentally.

【００４９】最後に、ステップＳ４において、ステップ
Ｓ２又はＳ３で処理されて計算された移動ベクトルを用
いて、メモリ３０に格納された初期話者モデルを話者適
応することにより学習し、これによって、ＨＭ網の話者
モデルを計算してＨＭ網１１のメモリに格納する。Finally, in step S4, the movement vector calculated and processed in step S2 or S3 is used to learn by speaker adaptation of the initial speaker model stored in the memory 30, whereby The speaker model of the HM network is calculated and stored in the memory of the HM network 11.

【００５０】本実施形態においては、数８で表される平
滑化係数ｆｐを用いているが、本発明はこれに限らず、
少なくとも、ガウス分布の話者適応用学習データのデー
タ量の増加に対して平滑化の強度が小さくなるように予
め決定された平滑化の強度を示す平滑化係数ｆｐを用い
ればよい。例えば、この平滑化係数ｆｐに代えて次の数
９乃至数１２に示す平滑化係数ｆｐ１乃至ｆｐ４を用い
てもよい。In the present embodiment, the smoothing coefficient fp represented by the equation 8 is used, but the present invention is not limited to this, and
At least, the smoothing coefficient fp indicating the smoothing strength determined in advance so that the smoothing strength becomes smaller with the increase in the data amount of the speaker adaptation learning data having the Gaussian distribution may be used. For example, the smoothing coefficients fp1 to fp4 shown in the following Expressions 9 to 12 may be used instead of the smoothing coefficient fp.

【００５１】[0051]

【数９】ｆｐ１＝ｆ｛１−（ｎ_i／α）｝，ｎ_i＜αのときｆｐ１＝０，ｎ_i≧αのときEquation 9] _{fp1 = f {1- (n i} / α)}, when fp1 = 0, n _i ≧ α when n _i <alpha

【数１０】ｆｐ２＝ｆ｛１−（ｎ_i／α）｝²，ｎ_i＜αのときｆｐ２＝０，ｎ_i≧αのときEquation 10] _{fp2 = f {1- (n i} / α)} 2, n i < When alpha fp2 = 0, when n _i ≧ alpha of

【数１１】ｆｐ３＝ｆ・ｅｘｐ（−ｎ_i／α）[Mathematical formula-see original document] fp3 = f * exp (-n _i / α)

【数１２】ｆｐ４＝ｆ・ｅｘｐ（−ｎ_i／α）² Mathematical Expression 12 fp4 = f · exp (−n _i / α) ²

【００５２】以上の実施形態において、話者適応化制御
部３１と、音素照合部４と、ＬＲパーザ５とは、例えば
デジタル電子計算機によって構成される。以上の実施形
態においては、音素ＨＭＭがネットワークで表されたＨ
Ｍ網１１を用いているが、本発明はこれに限らず、ＨＭ
網１１に代えて音素ＨＭＭを用いてもよい。In the above embodiment, the speaker adaptation control unit 31, the phoneme matching unit 4, and the LR parser 5 are constituted by, for example, a digital electronic computer. In the above embodiment, the phoneme HMM is a H represented by a network.
Although the M network 11 is used, the present invention is not limited to this, and the HM
A phoneme HMM may be used instead of the network 11.

【００５３】[0053]

【実施例】本発明者は、本実施形態の音声認識装置の評
価を行うために、以下のように実験を行った。この実験
には２００状態のＨＭ網を使用した。話者適応前の初期
状態の初期話者モデルとしては、不特定話者モデル（小
坂ほか，“クラスタリング手法を用いた不特定話者モデ
ル作成法”，日本音響学会論文集，１−Ｒ−１２，１９
９４年１１月参照。）（２８５人分の不特定話者モデル
から合成することによって作成したモデル）を使用し、
各状態の混合数は５とした。また、従来例のＶＦＳ方式
を行なう場合に用いる近傍ベクトル数は６とした。分析
条件、使用パラメータ、適応データ／認識データを表１
に示す。実験では各適応文節数に対して選択文節を変え
た評価をそれぞれ３回繰り返し、平均の音素認識率を求
めた。EXAMPLE The present inventor conducted the following experiment in order to evaluate the speech recognition apparatus of this embodiment. A 200-state HM network was used for this experiment. As an initial speaker model in an initial state before speaker adaptation, an unspecified speaker model (Kosaka et al., “Method for creating unspecified speaker model using clustering method”, Acoustical Society of Japan, 1-R-12 , 19
See November 1994. ) (A model created by synthesizing from 285 unspecified speaker models),
The number of mixtures in each state was 5. Further, the number of neighboring vectors used when performing the VFS method of the conventional example is set to 6. Table 1 shows analysis conditions, parameters used, and adaptive / recognition data.
Shown in In the experiment, the evaluation with different selection phrases was repeated three times for each number of adaptive phrases, and the average phoneme recognition rate was obtained.

【００５４】[0054]

【表１】実験条件 ─────────────────────────────────── 分析条件サンプリング周波数＝１２ＫＨｚハミング窓＝２０ｍｓフレーム周期＝５ｍｓ ─────────────────────────────────── 使用パラメータ１６次ＬＰＣケプストラム＋１６次Δケプストラム＋対数パワー＋Δ対数パワー ─────────────────────────────────── 学習データ男性１４６名＋女性１３９名（各話者５０文章） ─────────────────────────────────── 適応／認識データ ─────────────────────────────────── （ａ）話者男性４名（ＭＡＵ，ＭＭＹ，ＭＳＨ，ＭＴＭ）（ｂ）適応データ５９８文節（本特許出願人が所有のＳＢ１，ＳＢ２，ＳＢ４タスク）からランダムに取り出したｎ個の文節（ｃ）認識データ２７９文節（本特許出願人が所有のＳＢ３タスク） ───────────────────────────────────[Table 1] Experimental conditions ─────────────────────────────────── Analysis conditions Sampling frequency = 12 KHz Hamming window = 20 ms frame period = 5 ms ─────────────────────────────────── Working parameter 16th LPC cepstrum + 16th Δ cepstrum + Logarithmic power + Δ logarithmic power ─────────────────────────────────── Learning data 146 men + 139 women (50 sentences for each speaker) ─────────────────────────────────── Adaptation / recognition data ──── ──────────────────────────────── (a) Four speakers (MAU, MMY, MSH, MTM) ( ) Adaptive data n clauses randomly picked from 598 clauses (SB1, SB2, SB4 task owned by the applicant of the present patent) (c) Recognition data 279 clauses (SB3 task owned by the applicant of the present patent) ─── ────────────────────────────────

【００５５】男性４名で音素認識実験を行なった結果を
表２に示す。比較として距離ｄ_p,kによって移動ベクト
ルの補間処理及び平滑化処理に用いる近傍ベクトルの選
択を行う従来例のＶＦＳ方式の結果も示す。Table 2 shows the results of the phoneme recognition experiment conducted by four men. For comparison, the result of the conventional VFS method in which the neighborhood vector used for the interpolation process and the smoothing process of the moving vector is selected by the distance d _{p, k} is also shown.

【００５６】[0056]

【表２】話者適応結果−音素認識誤り率（％）上段：従来例のＶＦＳ方式下段：本実施形態の方法 ─────────────────────────────────── 適応文節数話者名適応前１０２０３０４０５０ ─────────────────────────────────── ＭＡＵ 19.1 17.4 14.8 13.4 12.7 11.9 17.3 14.2 13.3 12.7 12.1 ─────────────────────────────────── ＭＭＹ 20.8 17.5 16.1 15.3 14.5 14.3 18.6 16.3 15.1 14.4 14.0 ─────────────────────────────────── ＭＳＨ 26.9 19.4 17.5 17.2 16.8 16.5 20.2 17.8 16.8 15.9 15.6 ─────────────────────────────────── ＭＴＭ 18.7 14.2 12.2 10.8 10.7 10.5 15.1 12.1 10.6 9.8 10.1 ─────────────────────────────────── 平均値 21.4 17.1 15.2 14.2 13.6 13.3 17.8 15.1 13.9 13.2 13.0 ───────────────────────────────────[Table 2] Speaker adaptation result-phoneme recognition error rate (%) Upper row: VFS method of conventional example Lower row: method of this embodiment ──────────────────── ─────────────── Number of adaptive phrases Speaker name Before adaptation 10 20 30 40 50 ─────────────────────── ───────────── MAU 19.1 17.4 14.8 13.4 12.7 11.9 17.3 14.2 13.3 12.7 12.1 ───────────────────────── ────────── MMY 20.8 17.5 16.1 15.3 14.5 14.3 18.6 16.3 15.1 14.4 14.0 ──────────────────────────── ─────── MSH 26.9 19.4 17.5 17.2 16.8 16.5 20.2 17.8 16.8 15.9 15.6 ──────────────────────────────── ──── MTM 18.7 14.2 12.2 10.8 10.7 10.5 15.1 12.1 10.6 9.8 10.1 ─ ───────────────────────────────── Average 21.4 17.1 15.2 14.2 13.6 13.3 17.8 15.1 13.9 13.2 13.0 ──── ───────────────────────────────

【００５７】表２から明らかなように、適応文節数が少
ない場合には、従来例のＶＦＳ方式の方が逐次状態分割
法（ＳＳＳ）による状態分割過程の木構造を用いた本発
明に係る本実施形態の方式よりも高い認識率を示してい
る（適応文節数が、１０の場合参照。）。しかしなが
ら、適応文節数が多い場合には、本発明に係る本実施形
態の方式が若干ではあるが高い認識率を示している（適
応文節数が、２０，３０，４０，５０の場合参照。）。
適応文節数が少ない場合において、本発明に係る本実施
形態の方式の認識率が劣っている原因としては、適応学
習されたベクトル数が少ないために、木構造を上位層の
ノードまでさかのぼってしまい、音素環境の類似度が低
いベクトルを選択していることが考えられる。適応文節
数が多い場合には、逐次状態分割法（ＳＳＳ）による状
態分割過程の木構造の下層のノード以下の状態内におい
てベクトルの選択が行なわれており、従来例のＶＦＳ方
式より高い認識率を示している。従って、本発明に係る
逐次状態分割法（ＳＳＳ）の状態分割過程の木構造の下
層部分のノード以下に属する状態内のベクトルを用いて
移動ベクトルの補間処理及び平滑化処理を行なうこと
は、音素環境が考慮されたものとなり、ベクトル間の距
離の選択よりも有効であることが分かる。As is clear from Table 2, when the number of adaptive clauses is small, the conventional VFS method uses the tree structure of the state division process by the sequential state division method (SSS) according to the present invention. The recognition rate is higher than that of the method of the embodiment (see the case where the number of adaptive clauses is 10). However, when the number of adaptive phrases is large, the method of the present embodiment according to the present invention shows a small but high recognition rate (see when the number of adaptive phrases is 20, 30, 40, 50). .
When the number of adaptive clauses is small, the reason why the recognition rate of the method of the present embodiment according to the present invention is inferior is that since the number of adaptively learned vectors is small, the tree structure is traced back to the upper layer node. It is conceivable that a vector with a low phoneme environment similarity is selected. When the number of adaptive clauses is large, vectors are selected within the state below the node in the tree structure of the state division process by the sequential state division method (SSS), and the recognition rate is higher than that of the conventional VFS method. Is shown. Therefore, it is not necessary to perform movement vector interpolation processing and smoothing processing using vectors in states belonging to nodes under the tree structure in the lower layer of the state division process of the sequential state division method (SSS) according to the present invention. It turns out that the environment is taken into consideration and is more effective than the selection of the distance between the vectors.

【００５８】以上の実験においては、従来例のＶＦＳ方
式の平滑化係数を制御することは行なっていないが、上
述の実施形態の最後に示した変形例に示すように、音素
環境の類似度によって平滑化係数制御を行なうことによ
り、音声認識率をさらに向上させることができると考え
られる。In the above experiment, the smoothing coefficient of the VFS method of the conventional example is not controlled, but as shown in the modified example at the end of the above-mentioned embodiment, it depends on the similarity of the phoneme environment. It is considered that the voice recognition rate can be further improved by performing the smoothing coefficient control.

【００５９】以上説明したように、従来例のＶＦＳ方式
の補間処理及び平滑化処理を行なう際に用いる近傍ベク
トルを、逐次状態分割法（ＳＳＳ）の状態分割過程によ
って構成した木構造を用いて選択するようにしたので、
上記補間処理及び平滑化処理において、音素環境の類似
性を取り入れられた処理となり、従来例に比較して移動
ベクトルの推定精度を改善することができ、音声認識率
を向上することができる。As described above, the neighborhood vector used when performing the interpolation process and the smoothing process of the conventional VFS method is selected by using the tree structure formed by the state division process of the sequential state division method (SSS). I decided to do so,
In the above-described interpolation processing and smoothing processing, the similarity of the phoneme environment is taken into account, the accuracy of estimating the motion vector can be improved, and the speech recognition rate can be improved, as compared with the conventional example.

【００６０】[0060]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の話者適応化装置によれば、話者適応前後の隠れ
マルコフモデルの特徴ベクトルの関係を示す移動ベクト
ルを用いて、話者適応用学習データに基づいて初期話者
モデルを話者適応して学習することにより音声認識のた
めの隠れマルコフモデルの話者モデルを計算するための
話者適応化装置において、上記話者適応用学習データが
存在して話者適応用学習データに基づいて話者適応され
た後の隠れマルコフモデルの第１の特徴ベクトルを、当
該第１の特徴ベクトルと、その近傍にある話者適応され
た後の隠れマルコフモデルの複数の第２の特徴ベクトル
とを用いて平滑化処理を実行する平滑化手段と、上記話
者適応化用学習データが存在せず上記平滑化手段によっ
て計算されなかった話者適応後の隠れマルコフモデルの
ガウス分布の平均ベクトルを、当該平均ベクトルに対応
する話者適応前の隠れマルコフモデルのガウス分布の平
均ベクトルの近傍にある上記話者適応用学習データが存
在して上記平滑化手段によって計算された話者適応後の
隠れマルコフモデルのガウス分布の平均ベクトルの移動
ベクトルを用いて補間する補間手段とを備え、上記平滑
化手段と上記補間手段は、逐次状態分割法による状態分
割過程の木構造を用いて、当該木構造内のあるノードか
らより下層内のベクトルのうち処理すべき対象ベクトル
と近傍ベクトルとの距離の値が小さい所定の上位複数個
のベクトルを選択する選択手段を備える。従って、上記
補間処理及び平滑化処理において、音素環境の類似性を
取り入れられた処理となり、従来例に比較して移動ベク
トルの推定精度を改善することができ、音声認識率を向
上することができる。As described above in detail, according to the speaker adaptation apparatus of claim 1 of the present invention, the movement vector indicating the relationship between the feature vectors of the hidden Markov model before and after the speaker adaptation is used, A speaker adaptation apparatus for calculating a speaker model of a hidden Markov model for speech recognition by learning by speaker adaptation of an initial speaker model based on learning data for speaker adaptation, wherein the speaker The first feature vector of the hidden Markov model after the adaptation learning data is present and is speaker-adapted based on the speaker adaptation learning data is the first feature vector and the speaker adaptation in the vicinity thereof. Smoothing means for performing a smoothing process using the plurality of second feature vectors of the hidden Markov model after being processed, and the speaker adaptation learning data does not exist and is not calculated by the smoothing means. The mean vector of the Gaussian distribution of the hidden Markov model after speaker adaptation, the learning data for speaker adaptation that exists near the mean vector of the Gaussian distribution of the hidden Markov model before speaker adaptation that corresponds to the mean vector exists. And interpolating means for interpolating using a moving vector of the average vector of the Gaussian distribution of the Hidden Markov Model after speaker adaptation calculated by the smoothing means. By using the tree structure of the state division process by the method, a plurality of predetermined upper vectors with a small distance value between the target vector to be processed and the neighboring vector among the vectors in the lower layer from a node in the tree structure A selection means for selecting is provided. Therefore, in the above-mentioned interpolation processing and smoothing processing, the processing takes into account the similarity of the phoneme environment, the estimation accuracy of the movement vector can be improved compared to the conventional example, and the speech recognition rate can be improved. .

【００６１】また、上記話者適応化装置において、上記
選択手段は、上記対象ベクトルが属する状態が対応する
最下層のノードを抽出し、上記抽出された最下層のノー
ドから、当該最下層のノードよりも高い層に有るあるノ
ード以下の状態内の話者適応学習済みベクトル数が上記
所定の複数個以上になるまで上記木構造をさかのぼり、
上記あるノードを最上位ノードとし、上記最上位ノード
以下の状態内のベクトルにおいて、上記対象ベクトルと
近傍ベクトルとの距離の値が小さい所定の上位複数個の
ベクトルを上記補間処理及び平滑化処理のための選択ベ
クトルとして選択する。これにより、上記補間処理及び
平滑化処理において、音素環境の類似性を取り入れられ
た処理となり、従来例に比較して移動ベクトルの推定精
度を改善することができ、音声認識率を向上することが
できる。In the speaker adaptation device, the selecting means extracts a node in the lowest layer to which the state to which the target vector belongs corresponds, and the node in the lowest layer is extracted from the extracted nodes in the lowest layer. The tree structure is traced back until the number of speaker adaptive learned vectors in a state below a certain node in a higher layer becomes equal to or more than the predetermined plurality.
In the vector in the state below the top node, the certain node is set as the top node, and a plurality of predetermined upper vectors having a small value of the distance between the target vector and the neighboring vector are set in the interpolation process and the smoothing process. As a selection vector for As a result, in the above-mentioned interpolation processing and smoothing processing, the similarity of the phoneme environment is introduced, the estimation accuracy of the moving vector can be improved compared to the conventional example, and the speech recognition rate can be improved. it can.

【００６２】さらに、上記話者適応化装置において、上
記平滑化手段は、上記話者適応用学習データが存在して
上記平滑化手段によって計算された話者適応後の隠れマ
ルコフモデルのガウス分布の平均ベクトルを、当該平均
ベクトルと、その近傍にある上記話者適応用学習データ
が存在して上記平滑化手段によって計算された話者適応
後の隠れマルコフモデルのガウス分布の平均ベクトルの
移動ベクトルとを用いてかつ移動ベクトルの連続性の拘
束条件に基づいて、上記ガウス分布の話者適応用学習デ
ータのデータ量の増加に対して平滑化の強度が小さくな
るように予め決定された平滑化の強度を示す平滑化係数
を用いて平滑化する。従って、当該学習データが少ない
移動ベクトルに対しては平滑化により推定誤差を効果的
に吸収し、多くの学習データにより学習された推定誤差
の少ないパラメータに対しては平滑化を弱くすることに
より、性能が低下するのを防止することができる。これ
により、広い範囲の適応用学習データのデータ量に対し
て常に良い適応性能を得ることができる。また、各移動
ベクトル毎に個別に平滑化の強さを制御するために、適
応用学習データに含まれる音素にかたよりがある場合に
も、そのかたよりを考慮した平滑化の制御を行うことが
できる。従って、上記計算されかつ平滑化された移動ベ
クトルを用いて話者適応された話者モデルを用いて音声
認識することにより、従来例に比較して、しかも、請求
項１又は２記載の装置に比較して高い音声認識率を得る
ことができる。Further, in the speaker adaptation device, the smoothing means calculates the Gaussian distribution of the Hidden Markov Model after speaker adaptation calculated by the smoothing means in the presence of the speaker adaptation learning data. An average vector is the average vector and a moving vector of the average vector of the Gaussian distribution of the hidden Markov model after speaker adaptation, which is calculated by the smoothing means with the speaker adaptation learning data existing in the vicinity thereof. Based on the constraint condition of the continuity of the movement vector, the smoothing strength determined in advance so that the strength of the smoothing becomes smaller as the data amount of the speaker adaptation learning data of the Gaussian distribution increases. Smoothing is performed using a smoothing coefficient indicating strength. Therefore, by effectively smoothing the estimation error by smoothing for the movement vector with a small amount of the learning data, and weakening the smoothing for the parameter with a small estimation error learned by a large amount of learning data, It is possible to prevent the performance from decreasing. As a result, it is possible to always obtain good adaptation performance for a large amount of adaptation learning data. Also, in order to control the smoothing strength individually for each movement vector, even if there is a bias in the phonemes included in the learning data for adaptation, smoothing control that considers that bias should be performed. You can Therefore, by performing voice recognition using the speaker model adapted to the speaker using the calculated and smoothed movement vector, the apparatus according to claim 1 or 2 is compared with the conventional example. By comparison, a high voice recognition rate can be obtained.

【００６３】さらに、本発明に係る請求項４記載の音声
認識装置によれば、上記話者適応化装置と、入力された
発声音声文の音声信号に基づいて、上記話者適応化装置
によって話者適応された隠れマルコフモデルの話者モデ
ルを用いて音声認識して音声認識結果を出力する音声認
識手段とを備える。従って、従来例に比較して高い音声
認識率を得ることができる。Further, according to the speech recognition apparatus of the fourth aspect of the present invention, the speaker adaptation apparatus and the speaker adaptation apparatus speaks based on the inputted voice signal of the uttered voice sentence. And a voice recognition unit that outputs voice recognition results by performing voice recognition using a speaker model of a hidden Markov model that is personally adapted. Therefore, a higher voice recognition rate can be obtained as compared with the conventional example.

[Brief description of drawings]

【図１】本発明に係る一実施形態である音声認識装置
のブロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の話者適応制御部３１によって実行され
る話者適応化処理を示すフローチャートである。FIG. 2 is a flowchart showing a speaker adaptation process executed by a speaker adaptation control unit 31 of FIG.

【図３】移動ベクトルを用いて図２の話者適応化処理
を実行する場合における、適応学習前の初期話者モデル
の音響空間ＡＳ１から適応学習後の話者モデルの音響空
間ＡＳ２への変換を示す概念図である。FIG. 3 is a diagram illustrating conversion of an acoustic space AS1 of an initial speaker model before adaptive learning into an acoustic space AS2 of a speaker model after adaptive learning when the speaker adaptation process of FIG. 2 is executed using a movement vector. It is a conceptual diagram which shows.

【図４】（ａ）は、図２のステップＳ１で実行される
移動ベクトルの計算処理を示す概念図であり、（ｂ）
は、図２のステップＳ２で実行される移動ベクトルの補
間処理を示す概念図である。4 (a) is a conceptual diagram showing a movement vector calculation process executed in step S1 of FIG. 2, and FIG.
FIG. 3 is a conceptual diagram showing a movement vector interpolation process executed in step S2 of FIG.

【図５】図２のステップＳ３で実行される移動ベクト
ルの平滑化処理を示す概念図である。FIG. 5 is a conceptual diagram showing a movement vector smoothing process executed in step S3 of FIG.

【図６】図１の話者適応制御部３１によって実行され
る逐次状態分割法（ＳＳＳ）の原理を示す図である。6 is a diagram showing the principle of the sequential state division method (SSS) executed by the speaker adaptive control unit 31 of FIG.

【図７】図１の音声認識装置において用いるＨＭ網の
個々のモデル構造を示す状態遷移図である。7 is a state transition diagram showing an individual model structure of an HM network used in the speech recognition apparatus of FIG.

【図８】図１の話者適応制御部３１によって実行され
る逐次状態分割法（ＳＳＳ）による状態分割過程の木構
造を示す概念図である。8 is a conceptual diagram showing a tree structure of a state division process by a sequential state division method (SSS) executed by the speaker adaptive control unit 31 of FIG.

【図９】図１の話者適応制御部３１における処理にお
いて選択される近傍ベクトルの選択範囲を示す概念図で
ある。9 is a conceptual diagram showing a selection range of neighboring vectors selected in the processing in the speaker adaptive control unit 31 of FIG.

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…隠れマルコフ網（ＨＭ網）、１３…ＬＲテーブル、２０…文脈自由文法データベース、３０…初期話者モデル、３１…話者適応化制御部、３２…話者適応用学習データ、Ｓ１…移動ベクトルの計算処理、Ｓ２…移動ベクトルの補間処理、Ｓ３…移動ベクトルの平滑化処理、Ｓ４…処理後の移動ベクトルを用いて話者適応化する処
理。DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme matching part, 5 ... LR parser, 11 ... Hidden Markov network (HM network), 13 ... LR table, 20 ... Context-free grammar database, 30 ... Initial speaker model, 31 ... Speaker adaptation control unit, 32 ... Learning data for speaker adaptation, S1 ... Movement vector calculation processing, S2 ... Movement vector interpolation processing, S3 ... Movement vector smoothing processing, S4 ... The process of speaker adaptation using the processed motion vector.

───────────────────────────────────────────────────── フロントページの続き (72)発明者外村政啓京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者松永昭一京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Masahiro Tonomura Inoue Masahiro, Soka-cho, Kyoto Prefecture No. 5 Mihiradani, Osamu Osamu, Kyoto, Japan (72) Inventor Shoichi Matsunaga Kyoto 5 Seiraya-cho, Seiji-cho, Seika-cho, Oita Prefecture San-tani Valley, Inc.

Claims

[Claims]

1. A voice is obtained by speaker-adaptive learning of an initial speaker model based on speaker adaptation learning data using a movement vector indicating a relationship between feature vectors of a hidden Markov model before and after speaker adaptation. In the speaker adaptation device for calculating the speaker model of the hidden Markov model for recognition, after the speaker adaptation learning data exists and the speaker adaptation is performed based on the speaker adaptation learning data, The first feature vector of the hidden Markov model is the first feature vector,
Smoothing means for performing a smoothing process using a plurality of second feature vectors of the Hidden Markov Model after speaker adaptation in the vicinity thereof, and the learning data for speaker adaptation does not exist, and The average vector of the Gaussian distribution of the hidden Markov model after speaker adaptation that has not been calculated by the smoothing means is in the vicinity of the average vector of the Gaussian distribution of the hidden Markov model before speaker adaptation that corresponds to the average vector. Person learning data exists and interpolation means for interpolating using the moving vector of the average vector of the Gaussian distribution of the hidden Markov model after speaker adaptation calculated by the smoothing means is provided, and the smoothing means The interpolating means uses a tree structure of a state division process by the sequential state division method to process a vector in a lower layer from a node in the tree structure. Speaker adaptation apparatus characterized by comprising a selection means for selecting the upper plurality of vectors predetermined value is less of the distance between the elephant vector and neighboring vectors.

2. The selecting means extracts a node in the lowest layer to which the state to which the target vector belongs corresponds, and a node in a layer higher than the node in the lowest layer from the extracted node in the lowest layer. The tree structure is traced back until the number of speaker adaptive learned vectors in the following states is equal to or more than the predetermined number, the certain node is set as the top node, and the vector in the state below the top node is 2. The speaker adaptation apparatus according to claim 1, wherein a plurality of predetermined upper vectors having a small distance value between the target vector and the neighborhood vector are selected as selection vectors for the interpolation processing and the smoothing processing. .

3. The smoothing means sets the average vector of the Gaussian distribution of the Hidden Markov Model after speaker adaptation calculated by the smoothing means when the learning data for speaker adaptation exists to the average vector. , The movement vector of the mean vector of the Gaussian distribution of the hidden Markov model after the speaker adaptation, which has the learning data for speaker adaptation in its vicinity, and is calculated by the smoothing means, and the continuation of the movement vector Based on the constraint of sex, a smoothing coefficient is used that indicates a predetermined smoothing strength so that the smoothing strength decreases as the data amount of the speaker adaptation learning data of the Gaussian distribution increases. 3. The speaker adaptation apparatus according to claim 1, wherein the speaker adaptation apparatus is characterized by smoothing.

4. The speaker adaptation device according to claim 1, wherein the speaker adaptation device adapts the speaker based on a voice signal of an input uttered voice sentence. A voice recognition device, comprising: a voice recognition means for performing voice recognition using a hidden Markov model speaker model and outputting a voice recognition result.