JPH11143486A

JPH11143486A - Device and method adaptable for speaker

Info

Publication number: JPH11143486A
Application number: JP9306887A
Authority: JP
Inventors: Kazuhiko Sumiya; 和彦住谷; Nobuyuki Saito; 伸行斎藤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1997-11-10
Filing date: 1997-11-10
Publication date: 1999-05-28

Abstract

PROBLEM TO BE SOLVED: To realize the speaker-adaptable capacity excellent in accuracy in the speaker adaptation using the maximum posterior probability estimating method. SOLUTION: A set 17 of a large number of speaker models independent from an adaptable speaker is prepared using an acoustic analyzing means equivalent to an acoustic analysis part 10. Then, the sound of the adaptable speaker is inputted, and analyzed by the acoustic analysis part 10, the distribution of the feature parameter vector of the adaptable speaker is obtained, and preserved as the sample data 16 for adaptation. An datable model preparation part 15 measures the distance between the adaptable speaker model preserved as the sample data 16 for adaptation and a large number (N-pieces) of speaker models preserved as the set 17 of the speaker model, and the speaker models of M pieces are selected in the order of smaller distance to the adaptable speaker model. The weighted addition of the selected speaker models of M pieces is achieved to determine the initial model.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、適応対象話者が発
声した音声を学習用データとして、初期の音声モデルを
修正し、話者に適応させた音声モデルを作成する、話者
適応技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation technique for correcting an initial speech model using a speech uttered by a speaker to be adapted as learning data and creating a speech model adapted to the speaker. .

【０００２】[0002]

【従来技術】隠れマルコフモデル（ＨｉｄｄｅｎＭａ
ｒｋｏｖＭｏｄｅｌ、以降ＨＭＭと略する）は、音声
のスペクトル的、時間的な変動に対処しやすく、高い認
識精度を実現できることから、音声認識において広く用
いられている。ＨＭＭは、状態間の遷移確率、状態遷移
に伴うシンボルの出力確率を持った状態遷移モデルであ
り、音声信号のような時間とともに連続的に変化する信
号をモデル化するには、左から右に状態が遷移する、所
謂ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型モデルが適当である。
図１に状態数４の場合のｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型
のＨＭＭの例を示す。ここで、１，２，３，４は状態を
表し、ａ_ij（ｉ，ｊ＝１，…４）は、状態ｉから状態ｊ
に遷移する確率を示している。また、ｂ_j（ｏ）は、状
態遷移に伴って状態ｊにおいてシンボルｏが観測される
確率を示している。音声認識の場合、このシンボルとし
ては、通常、特徴パラメータ・ベクトルが使われる。2. Description of the Related Art A hidden Markov model (Hidden Ma)
rkov Model (hereinafter abbreviated as HMM) is widely used in speech recognition because it can easily deal with spectral and temporal fluctuations of speech and can realize high recognition accuracy. The HMM is a state transition model having transition probabilities between states and output probabilities of symbols associated with state transitions. To model a signal that changes continuously with time, such as a speech signal, it is necessary to model from left to right. A so-called left-to-right type model in which the state transitions is appropriate.
FIG. 1 shows an example of a left-to-right type HMM in the case of four states. Here, 1, 2, 3, and 4 represent states, and a _ij (i, j = 1,..., 4) represents states i to j
Indicates the probability of transition to. B _j (o) indicates the probability that the symbol o is observed in the state j with the state transition. In the case of speech recognition, a feature parameter vector is usually used as this symbol.

【０００３】ＨＭＭを使った音声認識では、認識の対象
となる音声の単位（例えば、音素、音韻、音節、単語）
ごとに、ＨＭＭによる音声モデルを用意し、事前に訓練
用の音声データを用いて、状態遷移確率ａ_ij、特徴パラ
メータベクトルの出力確率ｂ_j（ｏ）といったモデル・
パラメータを決定しておく。そして、認識時には、入力
された音声を分析し、特徴パラメータベクトルの系列に
変換し、そのパラメータベクトルの系列を観測する可能
性が最も高くなるモデルに対応する音声単位の列を決定
し、それを認識結果とする。In speech recognition using an HMM, a unit of speech to be recognized (eg, phonemes, phonemes, syllables, words)
In each case, a speech model using an HMM is prepared, and a model such as a state transition probability a _ij and a feature parameter vector output probability b _j (o) is prepared using speech data for training in advance.
Determine parameters. At the time of recognition, the input speech is analyzed, converted into a sequence of feature parameter vectors, and a sequence of speech units corresponding to a model that is most likely to observe the sequence of parameter vectors is determined. The result is the recognition result.

【０００４】ＨＭＭのような統計モデルでは、一般に、
学習用のデータを増やすことにより、モデルの精度を高
めることができる。そこで、対象となる話者が発声した
大量のデータを用いることによって、その話者のための
精度の高いモデルを作成することができる。しかし、そ
のためには、使用に先立って大量の音声データを収集す
る必要があり、そのために話者に大量の発声を要求する
ことになり、実用上の障害になっていた。In a statistical model such as an HMM, generally,
By increasing the data for learning, the accuracy of the model can be improved. Therefore, by using a large amount of data uttered by the target speaker, a highly accurate model for the speaker can be created. However, for this purpose, it is necessary to collect a large amount of voice data before use, which requires a large amount of utterance from the speaker, which has been a practical obstacle.

【０００５】一方、事前に不特定多数の話者の音声デー
タを収集し、それに基づいて、話者に依存しない標準的
な話者モデルを作ることが考えられる。この不特定話者
を対象としたモデルを用いれば、使用者が使用に先立っ
て話者モデルを訓練するために大量の発声を行う必要が
なく、すぐに使用を開始できるという利点がある。しか
し、話者ごとに適合させたモデルではないので、認識精
度が十分ではない。一般に、音声認識では、個人に即し
た音声モデルを用意することにより、認識の精度を高め
ることができるが、その音声モデルを作成するためには
多数の発話を事前に用意する必要があり、認識精度と手
間がトレードオフの関係になっている。On the other hand, it is conceivable to collect speech data of a large number of unspecified speakers in advance and create a standard speaker model independent of the speakers based on the collected speech data. The use of the model for unspecified speakers has the advantage that the user does not need to make a large amount of utterances to train the speaker model prior to use, and can immediately start using the model. However, since the model is not adapted for each speaker, recognition accuracy is not sufficient. In general, in speech recognition, the accuracy of recognition can be improved by preparing a speech model that suits the individual, but in order to create that speech model, it is necessary to prepare a large number of utterances in advance. Accuracy and labor are in a trade-off relationship.

【０００６】こうしたことから、少量の話者固有の発声
データだけで、使用を開始することができ、さらに話者
固有の発声データを追加することによって逐次的に認識
精度を向上させる事を狙った「話者適応」が、音声認識
を応用するある局面では注目されている。例えばディク
テーションのような、ある程度個人的に使用されるもの
では、簡単な手続きで使用することができ、また、使用
につれて、逐次、認識の精度が向上していくことが望ま
しく、この話者適応が有効であろう。[0006] From the above, it is possible to start using only a small amount of speaker-specific utterance data, and to improve recognition accuracy sequentially by adding speaker-specific utterance data. "Speaker adaptation" has attracted attention in some aspects of applying speech recognition. For something that is used personally to some extent, such as dictation, it can be used with a simple procedure, and it is desirable that the accuracy of recognition be improved as it is used. Will be valid.

【０００７】「話者適応」の手法の一つとして、予め求
めておいた初期の音声モデルを、実際の話者や使用環境
の特徴を取り込んで、修正することによって実現する手
法があり、その中でも、ベイズ推定に基づいた音声モデ
ルのパラメータの再推定が試みられ、効果をあげてい
る。ベイズ推定は、ベイズの定理に基づいて、パラメー
タを決定するものであり、以下の考え方による。As one of the methods of "speaker adaptation", there is a method of realizing an initial speech model obtained in advance by taking in the characteristics of an actual speaker and a use environment and modifying it. Above all, attempts have been made to re-estimate the parameters of the speech model based on Bayesian estimation, which has been effective. Bayesian estimation determines parameters based on Bayes' theorem, and is based on the following concept.

【０００８】ある事象Ａという結果を生じさせた原因Ｈ
_iの可能性Ｐ（Ｈ_i｜Ａ）は、ベイズの定理により、原因
の確率Ｐ（Ｈ_i）に、その原因からある事象が発生する
確率Ｐ（Ａ｜Ｈ_i）をかけたものによって計算される。
この時、Ｐ（Ｈ_i）は事前確率とよばれ、Ｐ（Ｈ_i｜Ａ）
は事後確率と呼ばれる。つまり、事前に予測した事前確
率Ｐ（Ｈ_i）とサンプルの確率Ｐ（Ａ｜Ｈ_i）によって事
後確率が決定すると考える。The cause H that caused the event A
_i possibility P (H _i | A) is the Bayes' theorem, the cause of the probability P (H _i), events from the cause probability generated P | calculated by multiplied by (A H _i) Is done.
At this time, P (H _i ) is called prior probability, and P (H _i | A)
Is called the posterior probability. That is, it is assumed that the posterior probability is determined by the prior probability P (H _i ) predicted in advance and the sample probability P (A | H _i ).

【０００９】原因として分布のパラメータをとり、結果
としてサンプルされるデータをとる。そうすると、分布
のパラメータの事後確率は、分布のパラメータの予測値
である事前確率とサンプルデータから得られる結果の確
率から得られることになる。そこで、モデルのパラメー
タについての正しい予測があれば、事後確率の決定にそ
れを効果的に取り込むことができる。The distribution parameters are taken as the cause, and the resulting sampled data is taken. Then, the posterior probabilities of the parameters of the distribution are obtained from the prior probabilities that are the predicted values of the parameters of the distribution and the probabilities of the results obtained from the sample data. Thus, if there is a correct prediction for the parameters of the model, it can be effectively incorporated into the determination of the posterior probability.

【００１０】一般に、結果の確率に対して、事前確率と
事後確率が同一の分布族に属していれば、サンプルは分
布族内の変換を起こすだけであり、数学的な取り扱いが
容易になる。このときの事前確率と事後確率の分布族
は、自然な共役分布と呼ばれるが、ガウス分布Ｎ（θ，
σ²）の平均のパラメータθについて、ガウス分布Ｎ
（λ，τ²）は自然な共役分布であることが知られてい
る。In general, if the prior probability and the posterior probability belong to the same distribution family with respect to the probability of the result, the sample only causes a transformation within the distribution family, which facilitates mathematical treatment. The distribution family of the prior and posterior probabilities at this time is called a natural conjugate distribution, but has a Gaussian distribution N (θ,
σ ² ) mean Gaussian distribution N
(Λ, τ ² ) is known to be a natural conjugate distribution.

【００１１】ベイズの定理に基づく枠組みの中で、パラ
メータを推定し話者適応を行う方法が、最大事後確率推
定法として、例えば、Ｃｈｉｎ−ＨｕｉＬｅｅ，Ｃｈ
ｉｈ−ＨｅｎｇＬｉｎａｎｄＢｉｉｎｇ−Ｈｗａ
ｎｇＪｕａｎｇ，”ＡＳｔｕｄｙｏｎＳｐｅａ
ｋｅｒＡｄａｐｔａｔｉｏｎｏｆｔｈｅＰａｒ
ａｍｅｔｅｒｓｏｆＣｏｎｔｉｎｕｏｕｓＤｅｎ
ｓｉｔｙＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ
ｓ”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＳ
ｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．３９，Ｎ
ｏ．４，Ａｐｒｉｌ，１９９１（以後文献１とする）
で、開示されている。これによれば、平均値パラメータ
μが事前確率Ｐ_o（μ）に従い、その分散σ²が既知で固
定とすると、μの共役な事前分布は、平均値ν、分散τ
²を持ったガウス分布であり、これを使うと、パラメー
タの最大事後確率推定値は、In a framework based on Bayes' theorem, a method of estimating parameters and performing speaker adaptation is a method of estimating a maximum posterior probability, for example, Chin-Hui Lee, Ch
ih-Heng Lin and Biing-Hwa
ng Jung, "A Study on Spaa
ker Adaptation of the Par
meters of Continous Den
site Hidden Markov Model
s ", IEEE Transactions on S
signal Processing, Vol. 39, N
o. 4, April, 1991 (hereinafter referred to as Reference 1)
And is disclosed. According to this, if the average parameter μ follows the prior probability P _o (μ) and its variance σ ² is known and fixed, the conjugate prior distribution of μ has an average value ν and a variance τ
It is a Gaussian distribution with ² , using which the maximum posterior probability estimate of the parameter is

【００１２】[0012]

【数１】で与えられる。ここで、ｎは対応するＨＭＭの状態にお
いて観測される訓練用サンプルの個数、ｏはサンプルの
平均である。つまり、平均値の最大事後確率推定法によ
る推定値は、事前分布の平均値νとサンプルの平均ｏの
重み付き平均で与えられる。（１）から明らかに、ｎが
０のとき、つまりサンプルが全くない場合には、推定値
は事前分布の平均値νのままである。また、ｎが十分大
きいとき、つまり、サンプル数が十分多いときは、推定
値はサンプルの平均ｏに近づく。このように、最大事後
確率推定法では、事前に予測した分布を取り込んで、サ
ンプルデータの個数に応じてモデルのパラメータを訓練
用サンプルの特性に漸近的に近づけることができるの
で、使用に応じて逐次話者モデルの精度向上を図ること
ができ、理想的な話者適応を実現できる。(Equation 1) Given by Here, n is the number of training samples observed in the corresponding HMM state, and o is the average of the samples. That is, the estimated value of the average value by the maximum posterior probability estimation method is given by a weighted average of the average value ν of the prior distribution and the average o of the samples. As is apparent from (1), when n is 0, that is, when there are no samples, the estimated value remains the mean value ν of the prior distribution. When n is sufficiently large, that is, when the number of samples is sufficiently large, the estimated value approaches the average o of the samples. In this way, the maximum a posteriori probability estimation method takes in the distribution predicted in advance and can asymptotically approximate the model parameters to the characteristics of the training sample according to the number of sample data. The accuracy of the successive speaker model can be improved, and ideal speaker adaptation can be realized.

【００１３】文献１では、ベイズ推定による他のパラメ
ータに対する話者適応例として、分散の場合と平均と分
散双方の場合についても述べられているが、平均値の話
者適応が認識性能に対して効果が高いことが示されてい
る。[0013] Document 1 describes a case of variance and a case of both average and variance as examples of speaker adaptation to other parameters based on Bayesian estimation. It is shown to be highly effective.

【００１４】[0014]

【発明が解決しようとする課題】この最大事後確率推定
法によるパラメータ推定を使った話者適応方式が、これ
までに開示されている。特開平８−９５５９２号公報で
開示されている従来例１では、標準的音素モデルを用い
て、ある特定の話者に音素モデルを合わせ込む話者適応
を行っているが、標準的な初期音声モデルとしては、老
若男女いろいろな話者が発声した音声データを用いて予
め学習しておいた、不特定多数の話者の音声を認識対象
とした不特定話者モデルを用いるとしている。この従来
例１で示している不特定話者モデルを用いた適応方式で
は、事前分布として用いる標準的な初期音声モデルの分
布が広がっており、少量の学習データで、それを十分に
補償し、適応対象の話者に適応するようパラメータを補
正することは困難である。A speaker adaptation method using parameter estimation by the maximum posterior probability estimation method has been disclosed. In the prior art 1 disclosed in Japanese Patent Application Laid-Open No. 8-95592, speaker adaptation for matching a phoneme model to a specific speaker using a standard phoneme model is performed. As the model, an unspecified speaker model that has been trained in advance using voice data uttered by various speakers of various ages, and that recognizes the voices of an unspecified number of speakers is used. In the adaptation method using the unspecified speaker model shown in the conventional example 1, the distribution of the standard initial speech model used as the prior distribution is widened, and it is sufficiently compensated with a small amount of training data. It is difficult to correct parameters to adapt to the speaker to be adapted.

【００１５】また、特開平８−１１０７９２号公報で開
示されている従来例２では、木構造クラスタリングモデ
ルにより作成した話者クラスタを用いて初期話者モデル
を作成するとしている。この方法では、初期話者モデル
の分布をある程度絞る事ができるが、クラスタの中心が
適応対象となる話者モデルのベクトルの中から、大きく
ずれている場合には、補正の効果が小さく、十分な適応
ができない恐れがある。Further, in the conventional example 2 disclosed in JP-A-8-110792, an initial speaker model is created using a speaker cluster created by a tree structure clustering model. With this method, the distribution of the initial speaker model can be narrowed down to some extent. However, if the center of the cluster deviates greatly from the vector of the speaker model to be adapted, the effect of the correction is small, and May not be able to adapt.

【００１６】最大事後確率推定法は、前述したように、
話者の発声サンプルデータを用いて、初期話者モデルの
パラメータを修正し、話者に適応させていくもので、サ
ンプルデータの数に応じて漸次モデルの精度を向上させ
ることができるが、この初期の話者モデルをどう選ぶか
が、適応の能力に大きく影響する。本発明は、こうした
点に鑑みなされたもので、最大事後確率推定法を用いた
話者適応において、精度の高い話者適応の能力を実現す
るための手段を与えるものである。本発明は、最適な初
期の話者モデルを選択することにより、適応対象の特定
の話者からの少量の発声データを用いて、その話者に対
する精度の高い音声モデルを作成する話者適応装置を実
現することを目的とするものであり、その話者適応装置
を用いることにより、精度の高い話者適応の能力を持っ
た音声認識装置を実現することができる。The maximum posterior probability estimation method is, as described above,
Using the speaker's utterance sample data, the parameters of the initial speaker model are modified and adapted to the speaker, and the accuracy of the model can be gradually improved according to the number of sample data. The choice of the early speaker model has a significant effect on the ability to adapt. The present invention has been made in view of such a point, and provides means for realizing a speaker adaptation capability with high accuracy in speaker adaptation using a maximum posterior probability estimation method. The present invention provides a speaker adaptation apparatus that creates an accurate speech model for a particular speaker by selecting an optimal initial speaker model using a small amount of utterance data from a specific speaker to be adapted. It is an object of the present invention to realize a speech recognition device having a speaker adaptation capability with high accuracy by using the speaker adaptation device.

【００１７】[0017]

【課題を解決するための手段】本発明は、初期話者モデ
ルと適応学習用データを用いて、最大事後確率推定法に
よって話者モデルのパラメータを再推定し、話者適応を
行う装置において、多数の話者から作成した多数（Ｎ
個）の話者モデルを事前に用意し、それらの多数の話者
モデルの中から、適応対象の話者に距離的に近いＭ個の
話者モデルを選択し、そうして選択されたＭ個の話者モ
デルを混合して初期の話者モデルを構成することを特徴
とする。また、そのとき、個数Ｍ（Ｍ＜＜Ｎ）を適応対
象話者と事前に用意されたＮ個の個々の話者モデルとの
距離の関係に基づいて決定する。これにより、分布に応
じた精度の高い初期モデルを決定することができるとと
もに、不必要なパラメータの増加を押さえることができ
る。According to the present invention, there is provided an apparatus for re-estimating parameters of a speaker model by a maximum posterior probability estimation method using an initial speaker model and data for adaptive learning, and performing speaker adaptation. Many (N) created from many speakers
Speaker models are prepared in advance, and among these many speaker models, M speaker models which are close in distance to the speaker to be adapted are selected, and the selected M It is characterized in that an initial speaker model is formed by mixing individual speaker models. Further, at this time, the number M (M << N) is determined based on the distance relationship between the adaptation target speakers and N individual speaker models prepared in advance. This makes it possible to determine an initial model with high accuracy according to the distribution, and to suppress an unnecessary increase in parameters.

【００１８】この構成においては、混合するＮ個の話者
モデルは、適応対象話者の特徴ベクトルとの距離に応じ
て、重み付けされるようにしてもよい。In this configuration, the N speaker models to be mixed may be weighted according to the distance from the feature vector of the speaker to be adapted.

【００１９】また、適応対象の話者と距離的に最も近い
話者モデルとの距離を基底距離とするとき、適応対象の
話者との距離が前記基底距離と比較して一定範囲以内で
ある話者モデルを選択することにより、混合する話者モ
デルの個数Ｎを可変とするようにしてもよい。When the distance between the speaker to be adapted and the speaker model closest to the distance is set as the base distance, the distance between the speaker to be adapted and the base distance is within a certain range as compared with the base distance. By selecting a speaker model, the number N of speaker models to be mixed may be made variable.

【００２０】また、本発明は方法としても実現でき、ま
た少なくともその一部をコンピュータプログラム製品と
しても実現できる。The present invention can also be realized as a method, and at least a part thereof can be realized as a computer program product.

【００２１】[0021]

【発明の実施の態様】以下、本発明の実施例について説
明する。Embodiments of the present invention will be described below.

【００２２】図２は、本発明による話者適応装置とそれ
を用いた音声認識システムの実施例をブロック図で示し
たものである。音声認識システムは、入力された音声を
音響解析部１０で分析し、特徴パラメータ・ベクトルの
系列を抽出する。抽出した特徴パラメータ・ベクトルの
系列を、音韻照合部１１において、言語モデル１４から
の粗い情報を参照しながら、音声モデル１３と照合し、
複数の音韻系列の候補を作成する。こうしてできた複数
の音韻系列の候補を、言語認識部１２で言語モデル１４
からの細かい情報を使って再評価し、最終的な認識結果
を確定する。上記の話者に適応した音声モデル１３を作
る手段が本発明による話者適応化装置であり、その方法
と構成を以下に述べる。FIG. 2 is a block diagram showing an embodiment of a speaker adaptation apparatus and a speech recognition system using the same according to the present invention. In the speech recognition system, the input speech is analyzed by the acoustic analysis unit 10, and a sequence of feature parameter vectors is extracted. The extracted feature parameter / vector sequence is compared with the speech model 13 in the phoneme matching unit 11 while referring to coarse information from the language model 14.
Create multiple phoneme sequence candidates. The plurality of phoneme sequence candidates thus generated are input to the language recognition unit 12 by the language model 14.
Re-evaluate using the detailed information from to determine the final recognition result. The means for generating the above-mentioned speaker-adapted speech model 13 is the speaker adaptation apparatus according to the present invention, and the method and configuration thereof will be described below.

【００２３】この実施例では、まず、事前に多数の話者
の音声を収集し、図示していない、音響解析部１０と同
等の音響解析手段を使って、適応対象話者と独立した多
数の話者モデルの集合１７を用意しておく。つまり、Ｎ
人の話者の音声を収集することにより、Ｎ個の分布を予
め計算して用意しておく。図３は、これらの多数の話者
モデルの特徴パラメータベクトルの出力確率分布を模式
的に破線で示したものである。説明の都合上、特徴パラ
メータベクトルを２次元として表示しているが、この特
徴パラメータベクトルは、実際には３０次元程度の多次
元ベクトルとするのが普通である。各話者モデルは、図
３に示すように、相互に重なり合いながら、多次元空間
内に分布している。次に、話者適応のために、適応対象
の話者の音声を入力し、音響解析部１０で分析して適応
対象話者の特徴パラメータベクトルの分布を求め、適応
用サンプル・データ１６として保存する。図４は、この
適応用サンプル・データから得られた適応対象話者の特
徴パラメータベクトルの出力確率の分布の様子の例を実
線で示したもので、先に求めた多数の特定話者モデルの
分布との関係を示したものである。以上のようにして適
応対象の話者モデルと多数（Ｎ個）の話者モデルとを決
定しておき、適応モデル作成部１５において、適応化し
た音声モデル１３を作成する。それは次の手順で行う。
適応モデル作成部１５では、まず適応用サンプル・デー
タ１６として保存されている適応対象の話者モデルと話
者モデルの集合１７として保存されている多数（Ｎ個）
の話者モデルとの間の距離を測定する。話者モデルとし
て、多変量のガウス分布を仮定すると、この距離として
は、適応対象の話者モデルの中心ベクトルと特定話者モ
デルとのマハラノビス距離を用いることができる。この
距離を使って、適応対象の話者モデルとの距離が近い順
にＭ個の話者モデルを選ぶ。図４は、Ｍを３とした場合
について、適応対象の話者モデルの最近傍にある、ｎ番
目、ｎ＋１番目とｎ＋２番目のモデルが選ばれる様子を
説明している。こうして、Ｍ個の話者モデルが選択でき
ると、それらの重み付き加算値として初期のモデルを決
定する。In this embodiment, first, the voices of a large number of speakers are collected in advance, and a large number of voices independent of the adaptation target speaker are collected using acoustic analysis means (not shown) equivalent to the acoustic analysis unit 10. A set 17 of speaker models is prepared. That is, N
By collecting voices of human speakers, N distributions are calculated and prepared in advance. FIG. 3 schematically shows output probability distributions of feature parameter vectors of these many speaker models by broken lines. For convenience of explanation, the feature parameter vector is displayed as two-dimensional. However, this feature parameter vector is usually a multidimensional vector of about 30 dimensions. As shown in FIG. 3, the speaker models are distributed in a multidimensional space while overlapping each other. Next, for speaker adaptation, the speech of the speaker to be adapted is input and analyzed by the acoustic analysis unit 10 to obtain the distribution of the characteristic parameter vector of the speaker to be adapted, and stored as the sample data 16 for adaptation. I do. FIG. 4 is a solid line showing an example of the distribution of the output probabilities of the feature parameter vectors of the adaptation target speaker obtained from the adaptation sample data. It shows the relationship with the distribution. As described above, the speaker model to be adapted and a large number (N) of speaker models are determined, and the adaptive model creation unit 15 creates the adapted speech model 13. It is done in the following steps.
The adaptation model creating unit 15 first stores a large number (N) of speaker models to be adapted stored as adaptation sample data 16 and a set 17 of speaker models.
Measure the distance between the speaker model. Assuming a multivariate Gaussian distribution as the speaker model, the Mahalanobis distance between the center vector of the speaker model to be adapted and the specific speaker model can be used as this distance. Using this distance, M speaker models are selected in order of decreasing distance from the speaker model to be adapted. FIG. 4 illustrates how the nth, n + 1th, and n + 2th models, which are closest to the speaker model to be adapted, are selected when M is 3. In this way, when M speaker models can be selected, an initial model is determined as a weighted addition value.

【００２４】なお、この時の重みをこれらの距離に基づ
いて決定することができる。The weight at this time can be determined based on these distances.

【００２５】また、混合するモデルの個数Ｍを固定値と
せず、分布に応じて変化させるようにしてもよい。初期
モデルの候補として選択するモデルが不確かな場合は、
混合するモデルの個数Ｍを多くし、確かな場合は混合す
るモデルの個数Ｍを少なくする。そのために、適応対象
の話者モデルの中心ベクトルとＮ個の特定話者モデルと
のマハラノビス距離を測定し、その距離が、全モデルに
対する距離の中で最も短いものと比べてある一定範囲内
であるモデルを選ぶ。そうして選ばれたモデルの個数を
Ｍ個であるとき、先の例と同様にしてＭ個のモデル
の重み付き加算による混合で初期のモデルを作成する。
このときも、先の例と同様に、重みはこれらの距離に基
づいて決定することができる。こうすることによって、
曖昧性が強い、不確かな初期モデルの候補に関しては、
混合数を多くして細かいモデルを作成し、逆に、曖昧性
が低く、確度の高い初期モデルの候補に関しては、混合
数を少なくして計算量を削減したモデルを作ることがで
きる。上記の一定範囲内としては、適応対象の話者モデ
ルの中心ベクトルと全モデルに対する距離の中で最も短
いものをＤ_minとするとき、その距離とＤ_minの比が一定
値以下のもの、或いは、その距離とＤ_minの差が一定値
以下のものを選べばよい。Further, the number M of the models to be mixed may not be a fixed value but may be changed according to the distribution. If you are unsure which model to select as an initial model candidate,
The number M of models to be mixed is increased, and if it is certain, the number M of models to be mixed is reduced. For this purpose, the Mahalanobis distance between the center vector of the speaker model to be adapted and the N specific speaker models is measured, and the distance is within a certain range compared to the shortest distance among all models. Choose a model. When the number of selected models is M, an initial model is created by mixing the M models by weighted addition in the same manner as in the previous example.
At this time, as in the previous example, the weight can be determined based on these distances. By doing this,
For ambiguous, uncertain initial model candidates,
A fine model can be created by increasing the number of mixtures, and conversely, with respect to initial model candidates with low ambiguity and high accuracy, it is possible to create a model in which the number of mixtures is reduced to reduce the amount of calculation. The within a range of above, when the shortest in distance to the center vector and all models of the adaptive target speaker model and D _min, those ratios of the distance and D _min is equal to or less than a predetermined value, or , The difference between the distance and D _min may be smaller than a certain value.

【００２６】以下に具体例について詳しく述べる。Hereinafter, specific examples will be described in detail.

【００２７】［具体例１］ここでは、連続音声認識に対
応した話者適応方式を例として示す。連続音声認識のた
めには、認識の単位となる音声の単位を音素レベルとす
るのが適当である。そこで、ここでは、音素毎にＨＭＭ
を作成する。ＨＭＭの作成は、音声データを使ってＨＭ
Ｍのパラメータを決定する訓練と呼ばれる手続きを実行
することにより行われる。連続的に発声された発声デー
タを用いて音素レベルのＨＭＭを訓練するには、発声さ
れた音素列に関して、先行する音素のＨＭＭの最終状態
を後続する音素のＨＭＭの初期状態につなげて訓練を行
う。[Specific Example 1] Here, a speaker adaptation method corresponding to continuous speech recognition will be described as an example. For continuous speech recognition, it is appropriate that the unit of speech as the unit of recognition be a phoneme level. Therefore, here, HMM
Create HMM is created using voice data
This is done by performing a procedure called training to determine the parameters of M. To train the HMM at the phoneme level using continuously uttered utterance data, the training is performed by connecting the final state of the HMM of the preceding phoneme to the initial state of the HMM of the succeeding phoneme for the uttered phoneme sequence. Do.

【００２８】本発明を実施するためには、事前に多数の
話者からの発声データを収集し、認識単位となる音声単
位毎にＨＭＭを作成する必要がある。音声信号のような
時間とともに連続的に変化する信号をモデル化するに
は、前述したように、左から右に状態が遷移する、所謂
ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型のＨＭＭが適当であるの
で、図１のような、例えば４状態のＨＭＭを使用する。
図１において、１，２，３，４は状態を表し、ａ
_ij（ｉ，ｊ＝１，…４）は、状態ｉから状態ｊに遷移す
る確率を示している。状態ｊにおいて、モデルから観測
される事象をｂ_j（ｏ）であらわす。この観測事象の対
象である音声の特徴パラメータベクトルは、連続信号で
あるので、ｂ_j（ｏ）は以下のような連続確率密度関数
で表すことができる。In order to implement the present invention, it is necessary to collect utterance data from a large number of speakers in advance and create an HMM for each speech unit that is a recognition unit. To model a signal that changes continuously with time, such as an audio signal, as described above, a so-called left-to-right type HMM in which the state transitions from left to right is appropriate. For example, a 4-state HMM such as 1 is used.
In FIG. 1, 1, 2, 3, and 4 represent states, and a
_ij (i, j = 1,..., 4) indicates the probability of transition from state i to state j. In state j, an event observed from the model is represented by b _j (o). Since the feature parameter vector of the speech that is the target of this observation event is a continuous signal, b _j (o) can be represented by the following continuous probability density function.

【００２９】[0029]

【数２】ここで、ｏは観測ベクトル、Ｎ［…］は、平均値ベクト
ルμ_j、分散・共分散行列Σ_jのガウス分布関数である。(Equation 2) Here, o is an observation vector, and N [...] is a Gaussian distribution function of the mean vector μ _j and the variance / covariance matrix Σ _j .

【００３０】個々のモデルにおけるパラメータ、ａ
_ij（ｉ，ｊ＝１，…４）やｂ_j（ｏ）（ｊ＝１，…４）
は、公知のＢａｕｍ−Ｗｅｌｃｈ法を使った繰り返し計
算により求めることができる。このＢａｕｍ−Ｗｅｌｃ
ｈ法による再推定については、例えば、文献、Ｌａｗｒ
ｅｎｃｅＲａｂｉｎｅｒ，Ｂｉｉｎｇ−Ｈｗａｎｇ
Ｊｕａｎｇ，”ＦｕｎｄａｍｅｎｔａｌｓｏｆＳｐ
ｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”，Ｐｒｅｎｔｉｃ
ｅＨａｌｌＰＴＲ９３で詳しく述べられている。そ
の概要を示すと以下の通りである。Ｂａｕｍ−Ｗｅｌｃ
ｈ法では、Parameters in the individual models, a
_ij (i, j = 1,... 4) and b _j (o) (j = 1,.
Can be determined by iterative calculation using the known Baum-Welch method. This Baum-Welc
For re-estimation by the h method, see, for example,
ence Rabiner, Biing-Hwang
Juang, "Fundamentals of Sp.
tech Recognition ”, Prentic
e Hall PTR93. The outline is as follows. Baum-Welc
In the h method,

【００３１】[0031]

【数３】として、パラメータを再推定し、推定されたパラメータ
を入れ替えて推定の計算を繰り返しすることにより、ｏ
が観測される確率が極大となるようなモデルを作成する
ことができる。(Equation 3) By re-estimating the parameters, replacing the estimated parameters and repeating the calculation of the estimation,
Can be created such that the probability of observing is maximized.

【００３２】このとき、ｂ_j（ｏ）として、式（２）の
ようなガウス分布を仮定すると、μ_j、Σ_jに対する再推
定の式は、At this time, assuming a Gaussian distribution as in equation (2) as b _j (o), the re-estimation equation for μ _j and Σ _j is as follows:

【００３３】[0033]

【数４】となる。ここで、γ_t（ｊ）は、観測系列ｏとモデルが
与えられたとき、時刻ｔにおいて状態ｊに存在する確率
である。以上のようにして、事前に作成しておく、複数
の話者のモデルが決定する。(Equation 4) Becomes Here, γ _t (j) is the probability of being in state j at time t given observation series o and model. As described above, a plurality of speaker models created in advance are determined.

【００３４】適応は、以下のステップで行う。まず、適
応対象話者から適応学習用の音声データを収集する。そ
して、これらの適応用学習データに含まれる各音素のＨ
ＭＭについて、上述した方法に従い、パラメータを決定
する。そのとき、ビタビ・セグメンテーションにより、
音素のＨＭＭの各状態に対応するフレームが決まってい
るので、それによって、適応学習用データのサンプル数
が求められる。The adaptation is performed in the following steps. First, speech data for adaptation learning is collected from adaptation target speakers. Then, the H of each phoneme included in the learning data for adaptation is calculated.
For MM, parameters are determined according to the method described above. At that time, with Viterbi segmentation,
Since the frame corresponding to each state of the phonetic HMM is determined, the number of samples of the adaptive learning data is obtained.

【００３５】次に、学習データから得られた、適応対象
話者のモデルと、事前に得られている多数の話者モデル
との距離を計算する。この距離としては、平均値ベクト
ルと話者モデルとのマハラノビス距離を使う。Next, the distance between the model of the speaker to be adapted obtained from the training data and a number of speaker models obtained in advance is calculated. The Mahalanobis distance between the average value vector and the speaker model is used as this distance.

【００３６】あるｋ番目の話者モデルのｊ番目の状態の
分布の平均値ベクトルがμ_kj、分散・共分散行列がΣ_kj
であるとし、適応対象話者のｊ番目の状態の平均値ベク
トルμ_jとこのｋ番目のモデルのマハラノビス距離をＤ
_kjとするとき、Ｄ_kjの二乗はThe mean vector of the distribution of the j-th state of a certain k-th speaker model is μ _kj , and the variance / covariance matrix is Σ _kj
, And the average value vector μ _j of the j-th state of the speaker to be adapted and the Mahalanobis distance of this k-th model are D
_kj , the square of D _kj is

【００３７】[0037]

【数５】で求められる。ここで、Σ_kj ^-1は、分散・共分散行列
Σ_kjの逆行列、（μ_kj−μ_j）^tはベクトル（μ_kj−
μ_j）の転置ベクトルである。(Equation 5) Is required. Here, Σ _kj ^-1 is the inverse matrix of the variance / covariance matrix Σ _kj , and (μ _kj −μ _j ) ^t is the vector (μ _kj −
μ _j ).

【００３８】すべての話者モデルに対してマハラノビス
距離Ｄ_kjを計算し、距離の近い方からＭ個の話者のモデ
ルを選択する。図４は、これを説明するために、Ｍが３
のときの様子を示したもので、距離の近い方から３つの
モデル、ｎ、ｎ＋１、ｎ＋２が選ばれている。但し、状
態に関する添字は省略している。The Mahalanobis distance D _kj is calculated for all speaker models, and M speaker models are selected from the closest one. FIG. 4 shows that M is 3
In this case, three models, n, n + 1, and n + 2, are selected from the closest one. However, suffixes relating to the state are omitted.

【００３９】さて、この選択されたＭ個の話者モデルの
中の、ｍ（１≦ｍ≦Ｍ）番目の話者モデルの特徴ベクト
ルを考える。そしてその特徴ベクトルのある要素の平均
をμ_mj、その分散をτ_mj ²とする。今、特徴ベクト
ルの要素の平均μ_jが、事前分布Ｐ₀（μ）を持ち、その
分散σ_j ²が既知の固定値であるとする。そうすると、
μ_jに対する共役な事前分布は、ガウス分布となる。そ
こで、ｍ（１≦ｍ≦Ｍ）番目の話者モデルを平均に対す
る共役事前分布とすると、最大事後確率推定法により、
平均の推定値μ_mj ^MAPは、Now, consider the feature vector of the mth (1 ≦ m ≦ M) th speaker model among the selected M speaker models. The average of a certain element of the feature vector is μ _mj , and the variance is τ _mj ² . Now, it is assumed that the average μ _j of the elements of the feature vector has a prior distribution P ₀ (μ), and the variance σ _j ² is a known fixed value. Then,
The conjugate prior to μ _j is a Gaussian distribution. Thus, if the m-th (1 ≦ m ≦ M) speaker model is a conjugate prior distribution with respect to the average, the maximum posterior probability estimation method
The average estimate μ _mj ^MAP is

【００４０】[0040]

【数６】で与えられる。ここで、ｎはサンプルの個数である。(Equation 6) Given by Here, n is the number of samples.

【００４１】つまり、ｍ（１≦ｍ≦Ｍ）番目の話者モデ
ルを初期モデルとした、特徴ベクトルのある要素の平均
の推定値はμ_mj ^MAP、事前分布の平均値μ_mjと、適応
学習において観測されたサンプルの平均ｏ_mの重み付き
平均となる。That is, using the m-th (1 ≦ m ≦ M) speaker model as an initial model, the estimated value of the average of an element having a feature vector is μ _mj ^MAP , the average value of a prior distribution μ _mj , and adaptive learning. the weighted average of the average o _m of the observed samples in.

【００４２】そこで、ｍ番目の話者モデルに対して得ら
れた推定値を用いて、適応対象話者に対するモデルは、
次の形で与えられる。Then, using the estimated values obtained for the m-th speaker model, the model for the speaker to be adapted is:
Given in the form:

【００４３】[0043]

【数７】ここで、ｃ_mjは、話者モデルのｊ番目の状態における、
ｍ番目の話者に対する重みの係数で、これは、前記の過
程の中で計算されている距離に基づいて決定することが
できる。つまり、(Equation 7) Here, _cmj is the j-th state of the speaker model,
A weighting factor for the mth speaker, which can be determined based on the distance calculated in the above process. That is,

【００４４】[0044]

【数８】とすればよい。このとき、あきらかに(Equation 8) And it is sufficient. At this time, clearly

【００４５】[0045]

【数９】であり、（９）のは統計的制約を満足するように決定さ
れている。(Equation 9) And (9) is determined to satisfy the statistical constraints.

【００４６】［具体例２］具体例１では、モデルの混合
数Ｍを固定したが、このＭの値をモデルの分布に応じて
変化させることにより、計算コストに対して適応能力が
高い、効率的なモデルを作成する事ができる。つまり、
適応対象の話者モデルの中心ベクトルと事前に用意され
ている多数の特定話者モデルとの距離の分布が、状態ご
とに違い、一様ではないときには、その確からしさに応
じて、Ｍの値を変化させるのである。例えば、適応対象
の話者モデルの中心ベクトルと事前に用意されている多
数の話者モデルとの距離が、それぞれの話者モデル間で
大きく異なり、少数の話者モデルのみに近いときには、
それらのモデルのみを混合する分布として利用すれば十
分であり、そうすることにより、無駄なパラメータの増
加を防ぐことができるからである。[Specific Example 2] In the specific example 1, the number M of mixtures of the model is fixed. However, by changing the value of M according to the distribution of the model, the adaptability to the calculation cost is high, and the efficiency is high. Model can be created. That is,
The distribution of distances between the center vector of the speaker model to be adapted and a number of specific speaker models prepared in advance differs for each state, and when the distribution is not uniform, the value of M depends on the likelihood. It changes. For example, when the distance between the center vector of the speaker model to be adapted and a large number of speaker models prepared in advance is significantly different between the respective speaker models and is close to only a small number of speaker models,
It is sufficient to use only those models as a distribution for mixing, and by doing so, it is possible to prevent an increase in useless parameters.

【００４７】具体例２では、適応対象話者のモデルと事
前に得られている多数の話者モデルとのマハラノビス距
離Ｄ_kをとり、それに基づいて混合数を決定する。本発
明を実施するためには、まず、具体例１と同様にして、
事前に多数の話者からの発声データを収集し、認識単位
となる音声単位毎に多数の話者のＨＭＭを作成する。次
に、適応対象話者から適応用学習用の音声データを収集
し、適応用学習データに含まれる各音素のＨＭＭについ
て、モデルのパラメータを決定する。そうして、こうし
て得られた適応対象話者のモデルと、多数の話者モデル
との距離を計算する。距離としては、マハラノビス距離
を使う。この具体例２では、適応対象話者のモデルと、
多数の話者モデルとの距離のうち最少のものをIn the specific example 2, the Mahalanobis distance D _k between the model of the speaker to be adapted and a large number of speaker models obtained in advance is determined, and the number of mixtures is determined based on the Mahalanobis distance D _k . In order to carry out the present invention, first, in the same manner as in Example 1,
Speech data from a large number of speakers is collected in advance, and HMMs for a large number of speakers are created for each voice unit as a recognition unit. Next, speech data for learning for adaptation is collected from the adaptation target speaker, and model parameters are determined for the HMM of each phoneme included in the training data for adaptation. Then, the distance between the model of the speaker to be adapted thus obtained and a number of speaker models is calculated. The Mahalanobis distance is used as the distance. In this specific example 2, the model of the speaker to be adapted is
The least distance between many speaker models

【００４８】[0048]

【数１０】とするとき、Ｄ_minと比較して規定の範囲の距離Ｄ_kをも
つモデルｋのみを混合する要素として選択する。(Equation 10) In this case, only the model k having a distance D _k within a specified range as compared with D _min is selected as an element to be mixed.

【００４９】この様子を図で説明すると以下のようにな
る。例えば、図５のように、モデルｎに対する距離Ｄ_n
が最少であり、モデルｎ＋１、モデルｎ＋２に対する距
離Ｄ_n+1、Ｄ_n+2がその最少距離と比較してある範囲以内
であれば、モデルｎ、モデルｎ＋１、モデルｎ＋２の３
つのモデルが、混合するモデルとして選ばれる。また、
例えば、図６のように、モデルｎに対する距離が最少で
あり、モデルｎ＋１に対する距離はと比較してある範囲
以内あるが、その次に近いモデルｎ＋２に対する距離が
と比較してある範囲以内になければ、モデルｎとモデル
ｎ＋１の２つのモデルのみが混合するモデルとして選択
される。This situation will be described below with reference to the drawings. For example, as shown in FIG. 5, the distance to the model n D _n
Is the minimum, and if the distances D _{n + 1} and D _{n + 2} to the model n + 1 and the model n + 2 are within a certain range as compared with the minimum distance, 3 of the model n, the model n + 1 and the model n + 2
One model is chosen as the model to mix. Also,
For example, as shown in FIG. 6, the distance to the model n is the minimum, and the distance to the model n + 1 is within a certain range as compared with, but the distance to the next closest model n + 2 must be within a certain range as compared to. For example, only two models, model n and model n + 1, are selected as a mixed model.

【００５０】この最少距離と比較してある範囲を決定す
る方法としては、モデルｋとの距離Ｄ_kと最少距離Ｄ_min
との比が、一定値δ以下である、つまりAs a method of determining a certain range in comparison with the minimum distance, the distance D _k to the model k and the minimum distance D _min
Is less than or equal to a certain value δ, that is,

【００５１】[0051]

【数１１】なるｋ番目のモデルを選択するのが一つの方法である。[Equation 11] One way is to select the k-th model.

【００５２】また、モデルｋとの距離Ｄ_kと最少距離Ｄ
_minとの差が一定値δ’以下である、つまりThe distance D _{k from} the model k and the minimum distance D
The difference from _min is less than or equal to the fixed value δ ', that is,

【００５３】[0053]

【数１２】なるｋ番目のモデルを選択する事もできる。(Equation 12) The k-th model can be selected.

【００５４】いずれかの方法で、選択されたモデルの個
数をＭ’とするとき、それらの選択された話者モデルに
対して得られた推定値を用いて、適応対象話者に対する
モデルは、具体例１の場合と同様にして次の形で与えら
れる。When the number of selected models is M ′ in any of the methods, using the estimated values obtained for the selected speaker models, the model for the speaker to be adapted is: It is given in the following form as in the case of the specific example 1.

【００５５】[0055]

【数１３】ここで、ｃ_mjは、話者モデルのｊ番目の状態における、
番目の話者に対する重みの係数で、これは、前記の過程
の中で計算されている距離に基づいて決定することがで
きる。つまり、(Equation 13) Here, _cmj is the j-th state of the speaker model,
The weighting factor for the second speaker, which can be determined based on the distances calculated in the above process. That is,

【００５６】[0056]

【数１４】とすればよい。[Equation 14] And it is sufficient.

【００５７】[0057]

【発明の効果】本発明は、事前に仮定した初期話者モデ
ルと適応学習用データを用いて、最大事後確率推定法に
よって話者モデルのパラメータを再推定し、話者適応を
行う装置において、初期話者モデルとして適応対象の話
者の特性にできるだけ近いと考えられる予測分布を仮定
しようとするものである。初期話者モデルに適応対象の
話者の特性にできるだけ近い予測分布を用いることによ
り、少量の適応学習用データで精度の高いモデルのパラ
メータ推定が行われ、良好な話者適応が実現する。本発
明では、事前に得られている多数の特定話者モデルと適
応対象話者の距離を測定し、距離的に近いＮ個のモデル
を選択的に用いることによりこの作用を効果的に発現さ
せる方法を開示しており、これにより高い適応性を実現
する事ができる。According to the present invention, there is provided an apparatus for re-estimating parameters of a speaker model by a maximum posterior probability estimation method using an initial speaker model assumed in advance and adaptive learning data, and performing speaker adaptation. As an initial speaker model, an attempt is made to assume a predicted distribution that is considered as close as possible to the characteristics of the speaker to be adapted. By using a prediction distribution as close as possible to the characteristics of the speaker to be adapted to the initial speaker model, highly accurate model parameter estimation is performed with a small amount of adaptive learning data, and good speaker adaptation is realized. In the present invention, the distance between a large number of specific speaker models obtained in advance and the speaker to be adapted is measured, and this function is effectively exhibited by selectively using N models close in distance. A method is disclosed by which high adaptability can be achieved.

【００５８】また、本発明では、距離的に近いＭ個のモ
デルを選択するときに、モデルとの距離の相対的な関係
により、選択するモデルの個数Ｍを変化させる方法を開
示している。これは、仮定する予測分布の分布に応じて
最適な混合モデルを選択するもので、これにより、最良
の適応能力が発揮されるとともに、計算処理量も減少す
るという効果もある。Further, the present invention discloses a method of changing the number M of models to be selected according to the relative relationship of the distance to the models when selecting M models close in distance. This is to select the optimal mixture model according to the distribution of the assumed distribution to be assumed. This has the effect of exhibiting the best adaptability and reducing the amount of calculation processing.

[Brief description of the drawings]

【図１】ＨＭＭの例を示す模式図である。FIG. 1 is a schematic diagram illustrating an example of an HMM.

【図２】本発明の実施例を示すブロック図である。FIG. 2 is a block diagram showing an embodiment of the present invention.

【図３】多数の特定話者モデルの分布の例を示す図で
ある。FIG. 3 is a diagram illustrating an example of distribution of a number of specific speaker models.

【図４】学習データから得られる適応対象話者の分布
の例を示す図である。FIG. 4 is a diagram illustrating an example of a distribution of adaptation target speakers obtained from learning data.

【図５】適応対象話者モデルの最近傍にある特定話者
モデルの分布の例を示す図である。FIG. 5 is a diagram illustrating an example of a distribution of a specific speaker model located closest to an adaptation target speaker model;

【図６】適応対象話者モデルの最近傍にある特定話者
モデルの分布の他の例を示す図である。FIG. 6 is a diagram illustrating another example of the distribution of a specific speaker model that is closest to an adaptation target speaker model;

[Explanation of symbols]

１〜４ＨＭＭの状態１０音響解析部１１音韻照合部１２言語認識部１３音声モデル１４言語モデル１５適応モデル作成部１６適応用サンプル・データ１７特定話者モデルの集合２０多数の話者の特徴ベクトルの分布２１適応対象話者の特徴ベクトルの分布 1-4 HMM state 10 Acoustic analysis unit 11 Phoneme matching unit 12 Language recognition unit 13 Speech model 14 Language model 15 Adaptation model creation unit 16 Adaptation sample data 17 Set of specific speaker models 20 Feature vectors of many speakers 21 Distribution of feature vector of speaker to be adapted

Claims

[Claims]

1. A speaker adaptation apparatus for re-estimating parameters of a speaker model by a maximum a posteriori probability estimation method using an initial speaker model and data for adaptive learning and performing speaker adaptation. A number of initial speaker models are created from the speakers, and the adaptation learning data uttered by the adaptation target speaker is extracted, and the features of the adaptation target speaker are extracted. The N speaker models are selected from the one closest to the feature of the speaker to be adapted in terms of distance, and each of the selected N speaker models is assumed to be a distribution assumed in advance, and the learning data for adaptation is used. A speaker adaptation apparatus comprising: estimating parameters of a speaker model using the parameters; and adding and adding N speaker models having the estimated parameters to generate a speech model of a speaker to be adapted.

2. The speaker adaptation apparatus according to claim 1, wherein the N speaker models to be mixed are weighted according to a distance from a feature vector of the adaptation target speaker.

3. When the distance between the speaker to be adapted and the speaker model closest in distance is the base distance, the distance between the speaker to be adapted and the base distance is within a certain range as compared with the base distance. 3. The speaker adaptation apparatus according to claim 1, wherein the number N of speaker models to be mixed is made variable by selecting a speaker model.

4. A speaker adaptation method in which parameters of a speaker model are re-estimated by a maximum posterior probability estimation method using an initial speaker model and data for adaptive learning to perform speaker adaptation. A number of initial speaker models are created from the speakers, and the adaptation learning data uttered by the adaptation target speaker is extracted, and the features of the adaptation target speaker are extracted. The N speaker models are selected from the one closest to the feature of the speaker to be adapted in terms of distance, and each of the selected N speaker models is assumed to be a distribution assumed in advance, and the learning data for adaptation is used. A speaker adaptation method comprising: estimating a parameter of a speaker model using the parameters; and adding and adding N speaker models having the estimated parameters to generate a speech model of a speaker to be adapted.