JP3448371B2

JP3448371B2 - HMM learning device

Info

Publication number: JP3448371B2
Application number: JP26395894A
Authority: JP
Inventors: 計美大倉
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1994-03-18
Filing date: 1994-10-27
Publication date: 2003-09-22
Anticipated expiration: 2018-09-22
Also published as: JPH07306690A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声の統計的特徴をガ
ウス分布等の分布により近似的に表現する確率モデルで
あるＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（以下、
ＨＭＭという。）の学習装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Hidden Markov Model (hereinafter, referred to as a probabilistic model) which is a probabilistic model that approximately expresses statistical characteristics of speech by a distribution such as Gaussian distribution.
It is called HMM. ) Learning device.

【０００２】〔発明の詳細な説明〕[0002] [Detailed Description of the Invention]

【０００３】[0003]

【従来の技術】近年、ＨＭＭを用いた音声認識装置の開
発が盛んに行われている。このＨＭＭは大量の音声デー
タから得られる音声の統計的特徴をモデル化したもので
あり、このモデルは、（１）発声の揺らぎを分布という
形で統計的に処理できる、（２）話者による発声時間長
の違いを吸収できる、といった利点を備えている。2. Description of the Related Art In recent years, a voice recognition device using an HMM has been actively developed. This HMM is a model of statistical characteristics of speech obtained from a large amount of speech data. This model can (1) statistically process fluctuations of utterance in the form of distribution, (2) depending on the speaker It has the advantage of being able to absorb differences in utterance duration.

【０００４】これらの利点を備えた音素ＨＭＭを用いて
単語の音声認識を行なう場合を例に挙げて説明する。A case will be described as an example where voice recognition of a word is performed using a phoneme HMM having these advantages.

【０００５】一般的に、単語はそれより小さい単位、例
えば音素が繋ぎ合わさって成立しているように、音素単
位でＨＭＭを作成しておくと、その音素ＨＭＭの連結に
より任意の単語に対する単語認識を行なうことができ
る。図４は音素ＨＭＭに基づいて単語認識を行なうため
の概念図である。Generally, when an HMM is created in units of phonemes so that a word is formed by a unit smaller than that, for example, phonemes are connected, the word recognition for an arbitrary word is performed by connecting the phoneme HMMs. Can be done. FIG. 4 is a conceptual diagram for performing word recognition based on the phoneme HMM.

【０００６】今、辞書に登録されている認識対象が「う
ちけす（Ｕ／ＣＨ／Ｉ／Ｋ／Ｅ／Ｓ／Ｕ）」、「うちあ
わせ（Ｕ／ＣＨ／Ｉ／Ａ／Ｗ／Ａ／Ｓ／Ｅ）」及び「う
る（Ｕ／Ｒ／Ｕ）」の３単語である場合、作成する必要
がある音素ＨＭＭは辞書中に出現する「Ｕ／ＣＨ／Ｉ／
Ｋ／Ｅ／Ｓ／Ａ／Ｗ／Ｒ」の９種類のみでよいことが分
かる。[0006] Now, the recognition targets registered in the dictionary are "Ukesu (U / CH / I / K / E / S / U)" and "Uchidake (U / CH / I / A / W / A / S / E) ”and“ Uru (U / R / U) ”are three words, the phoneme HMM that needs to be created is“ U / CH / I / ”that appears in the dictionary.
It is understood that only 9 types of "K / E / S / A / W / R" are sufficient.

【０００７】したがって、単語認識に際しては、音素Ｈ
ＭＭを連結することにより辞書内に存在する単語に対応
する単語ＨＭＭを作成し、入力音声（単語）と近いもの
を確率的尤度（確からしさ）として得ることができるよ
うな構成となっている。Therefore, in word recognition, the phoneme H
The word HMM corresponding to the word existing in the dictionary is created by connecting the MMs, and the one close to the input speech (word) can be obtained as the stochastic likelihood (probability). .

【０００８】このように、予め多数話者の音声情報を学
習して音素ＨＭＭを作成しておくことによって、入力音
声が単語の場合であっても認識することが可能であり、
以上がＨＭＭについての概要である。As described above, by learning the voice information of a large number of speakers and creating the phoneme HMM in advance, it is possible to recognize even when the input voice is a word,
The above is the outline of the HMM.

【０００９】ところで、斯かるＨＭＭは、一般的に数百
語の学習用単語等を用いて作成される。しかし、使用者
に数百語もの単語を発声させるのは、使用者の負担を考
えると現実的ではない。かかる点を回避するために少数
の学習単語を用いてＨＭＭを使用者の音声特徴にチュー
ニングする方法として話者適応法があり、この話者適応
法が電子情報通信学会論文誌Ｄ−２Ｖｏｌ．Ｊ７６
−Ｄ−２Ｎｏ．１２１９９３年１２月２４６９乃至
２４７６頁に開示されている。By the way, such an HMM is generally created by using several hundred words for learning. However, uttering hundreds of words by the user is not realistic considering the burden on the user. In order to avoid such a point, there is a speaker adaptation method as a method of tuning the HMM to a user's voice feature by using a small number of learning words, and this speaker adaptation method is based on the Institute of Electronics, Information and Communication Engineers D-2 Vol. J76
-D-2 No. 12 December 1993, pp. 2469-2476.

【００１０】以下にそのＨＭＭを用いた音声認識方法を
図５乃至図９を用いて以下に説明する。A speech recognition method using the HMM will be described below with reference to FIGS. 5 to 9.

【００１１】図５は入力音声“ば（ｂａ）”、及び
“ぶ（ｂｕ）”の対数パワー（以下、単にパワーとい
う。）と時間との関係を表した音声パターンである。こ
の音声パターンは入力音声の音声帯域を、例えば１６個
の帯域フィルタで分割し、音声の周波数分析を行なった
後、時間毎の対数パワーをとることによって得られるも
のである。この図５（ｂ）をみると分かるように同じ音
素／ｂ／の区間でも、パワー変化に違いがあり、音素／
ｂ／の区間のパワー変化を見ると、／ｂ／の最初の部分
のパワー変化は少なく、徐々に大きくなっていってお
り、そのパワー変化に着目すると、音素／ｂ／を、区間１；パワー変化は少ないが、揺らぎの多い部分、区間２；パワーの立ち上がり部分、区間３；パワーの急峻な立ち上がり部分、のように、３区間に大きく分けることができる。FIG. 5 is a voice pattern showing the relationship between the logarithmic power (hereinafter simply referred to as power) of the input voices "ba (ba)" and "bu (bu)" and time. This voice pattern is obtained by dividing the voice band of the input voice by, for example, 16 band filters, performing frequency analysis of the voice, and then taking logarithmic power for each time. As can be seen from FIG. 5 (b), there is a difference in power change even in the same phoneme / b / section,
Looking at the power change in the section b /, the power change in the first part of / b / is small and gradually increasing. Focusing on the power change, the phoneme / b / There are few changes, but there are many fluctuations: section 2; power rising portion; section 3; steep power rising portion;

【００１２】一方、図６（ａ）は図５（ｂ）の音素／
ｂ／の区間を区間１乃至区間３に分割した図である。ま
た、図６（ｂ）は図６（ａ）の音素／ｂ／の区間１乃
至区間３に夫々対応して、パワー変化量をヒストグラム
で表したものをガウス分布によって近似したものであ
る。一般的にＨＭＭでは、音声の特徴を斯かる分布１乃
至分布３のように表現するのである。例えば、１６チャ
ネルの帯域フィルタ等で音声を分析した場合は、各々の
チャネルに対応して１つのガウス分布が求められる。こ
こで、斯かる１６個のガウス分布を１つのコンポーネン
トと見做すことにより、このコンポーネントに含まれる
１６個の各々のガウス分布の平均値をベクトルとして表
現でき、以下斯かるベクトルを平均ベクトルという。On the other hand, FIG. 6A shows the phonemes of FIG.
It is the figure which divided the section of b / into the section 1 to the section 3. Further, FIG. 6B corresponds to the phoneme / b / sections 1 to 3 in FIG. 6A, respectively, and shows a histogram of the power change amount, which is approximated by a Gaussian distribution. Generally, in the HMM, the characteristics of the voice are expressed as such distributions 1 to 3. For example, when a voice is analyzed by a 16-channel bandpass filter or the like, one Gaussian distribution is obtained for each channel. Here, by considering these 16 Gaussian distributions as one component, the average value of each of the 16 Gaussian distributions included in this component can be expressed as a vector, and such a vector will be referred to as an average vector hereinafter. .

【００１３】ところで、図７は従来のＨＭＭの話者適応
に基づくＨＭＭの学習装置、及びこの学習装置を用いた
音声認識装置の概略構成図である。By the way, FIG. 7 is a schematic configuration diagram of a conventional HMM learning device based on speaker adaptation of the HMM, and a speech recognition device using this learning device.

【００１４】図７において、１は入力音声の特徴を周波
数帯域毎に分析する音声分析部、２は学習することによ
りＨＭＭの初期モデルを記憶する初期モデル記憶部であ
り、斯かる初期モデルは特定の話者の音声を用いて作成
した特定話者のＨＭＭでも良いし、多数の話者音声を用
いて学習した不特定話者のＨＭＭでも良い。具体的な学
習方法としては、周知のフォワードバックワードアルゴ
リズムやビタビアライメントに基づく学習則等を用いれ
ばよい。In FIG. 7, reference numeral 1 is a voice analysis unit for analyzing the characteristics of the input voice for each frequency band, and 2 is an initial model storage unit for storing an initial model of the HMM by learning, and the initial model is specified. It may be an HMM of a specific speaker created by using the voices of the speakers, or an HMM of an unspecified speaker learned by using a large number of speaker voices. As a specific learning method, a well-known forward backward algorithm or a learning rule based on Viterbi alignment may be used.

【００１５】３は、上述の初期モデルを入力音声を用い
て再学習する学習部であり、この学習部３では、ＨＭＭ
を表すパラメータの内、平均ベクトルのみを学習するも
のとする。A learning unit 3 re-learns the above-mentioned initial model by using an input voice.
It is assumed that only the average vector is learned from among the parameters expressing.

【００１６】以下に、学習部３における平均ベクトルの
学習について説明する。The learning of the average vector in the learning section 3 will be described below.

【００１７】ＨＭＭの初期モデル中のコンポーネントの
平均ベクトルの組をＣ^R（Ｃ^R＝（ｃ ₁ ^R，・・・，ｃ_k ^R，・・
・，ｃ_m ^R）、ここでｍは全てのコンポーネントの個数を
表す。）とすると、この平均ベクトルの組Ｃ^Rは、学習
部３において音声分析部１により分析された分析結果を
用いて再学習され、再学習後の平均ベクトルＣ^Iが得ら
れる。ここで、Ｃ^I＝（ｃ₁ ^I，・・・，ｃ_k ^I・・・，ｃ_m ^I ）で
ある。ここで、ｃ_k ^Rとｃ _k ^Iは対応しているものとする。
つまり、ｃ_k ^Rは学習後にｃ_k ^Iになる。Of the components in the initial model of the HMM
The set of average vectors is C^R(C^R= (C ₁ ^R,,, c_k ^R・・・
., C_m ^R), Where m is the number of all components
Represent ), This average vector set C^RIs learning
The analysis result analyzed by the voice analysis unit 1 in the unit 3
Re-learned using the average vector C after re-learning^IGot
Be done. Where C^I= (C₁ ^I,,, c_k ^I..., c_m ^I )so
is there. Where c_k ^RAnd c _k ^IShall correspond.
That is, c_k ^RIs c after learning_k ^Ibecome.

【００１８】４は学習部３において再学習したＨＭＭを
より高精度なモデルにするための話者適応部であり、こ
の話者適応部４は移動ベクトルの計算、及び学習されな
かった音素ＨＭＭに関する平均ベクトルｃ_n ^Rに関する移
動ベクトルｖ_nに対して内挿・補間処理をしたり、更に
は全ての移動ベクトルに対して平滑化処理を行ない、平
均ベクトルの適応化を行う部分である。Reference numeral 4 denotes a speaker adaptation unit for making the HMM re-learned in the learning unit 3 into a more accurate model. This speaker adaptation unit 4 calculates the movement vector and relates to a phoneme HMM that has not been learned. This is a part that performs interpolation / interpolation processing on the moving vector v _n related to the average vector c _n ^R , and further performs smoothing processing on all the moving vectors to adapt the average vector.

【００１９】５は話者適応部４にて適応化を行った後の
ＨＭＭを記憶しておく適応後モデル記憶部である。Reference numeral 5 denotes a post-adaptation model storage unit for storing the HMM after the adaptation by the speaker adaptation unit 4.

【００２０】上述が従来のＨＭＭの学習装置の構成であ
り、ここでは話者適応部４での処理について説明する。（Ａ）移動ベクトルｖ_kの計算初期モデルＣ^Rと再学習後のＣ^I中の各コンポーネントの
平均ベクトルの差分を次式に従い求める。以降、これを
移動ベクトルトという。The above is the configuration of the conventional HMM learning apparatus, and the processing in the speaker adaptation unit 4 will be described here. (A) Calculation of movement vector v _k The difference between the initial model C ^R and the average vector of each component in C ^I after re-learning is calculated according to the following equation. Hereinafter, this is called a movement vector.

【００２１】ｖ_k＝ｃ_k ^I−ｃ_k ^R （但し、ｋ＝１，２，・・・・・，ｍ）（Ｂ）移動ベクトルｖ_nの内挿・補間処理図８（ａ）は、初期モデルの平均ベクトル（ｃ₁ ^R，
ｃ₂ ^R，ｃ₃ ^R）が再学習され、この再学習後に平均ベクト
ル（ｃ₁ ^I，ｃ₂ ^I，ｃ₃ ^I）となったことを示している図で
ある。また、平均ベクトルｃ_n ^Rは、学習音声中にｃ_n ^Rに
関する音素が存在しなかった為に、学習されなかったこ
とを表している。図中の（ｖ₁，ｖ₂，ｖ₃）は、学習で
きたコンポーネントから求められた移動ベクトルであ
る。V _k = c _k ^I −c _k ^R (where k = 1, 2, ..., ^M ) (B) Interpolation / interpolation processing of the movement vector v _n FIG. The mean vector of the initial model (c ₁ ^R ,
c ₂ ^R , c ₃ ^R ) is re-learned, and after this re-learning, it becomes a mean vector (c ₁ ^I , c ₂ ^I , c ₃ ^I ). Further, the average vector c _n ^R represents that learning has not been performed because there is no phoneme related to c _n ^R in the learned speech. (V ₁ , v ₂ , v ₃ ) in the figure are movement vectors obtained from the learned components.

【００２２】ここで、移動ベクトルの内挿・補間処理と
は、学習されなかった平均ベクトルｃ_n ^Rに関する移動ベ
クトルｖ_nを図８（ｂ）のように求めるものである。Here, the movement vector interpolation / interpolation processing is to obtain the movement vector v _n related to the unlearned average vector c _n ^R as shown in FIG. 8B.

【００２３】図８（ｂ）は、移動ベクトルｖ_nを算出す
るための概念図であり、この移動ベクトルｖ_nは移動ベ
クトルｖ₁，ｖ₂，及びｖ₃に基づいて内挿することによ
って算出され、移動ベクトルｖ_nは移動ベクトルｖ₁，ｖ
₂，及びｖ₃の重み付き平均で表すことができる。（Ｃ）移動ベクトルの平滑化処理図９は上述の（Ｂ）移動ベクトルｖ_nの内挿・補間処理
の後に行なう移動ベクトルの平滑化処理についての概念
図である。FIG. 8B is a conceptual diagram for calculating the movement vector v _n . This movement vector v _n is calculated by interpolating based on the movement vectors v ₁ , v ₂ and v _3. And the movement vector v _n is the movement vector v ₁ , v
It can be represented by a weighted average of ₂ and v ₃ . (C) Movement Vector Smoothing Process FIG. 9 is a conceptual diagram of (B) movement vector smoothing process performed after the interpolation / interpolation process of the movement vector v _n .

【００２４】上述の如く、推定算出された移動ベクトル
は、十分な語数によって学習が行われていない場合に
は、多量の推定誤差を含んでいると考えられる。従っ
て、このような推定誤差を含むものから算出された移動
ベクトルの方向は非連続的な動きをしていると考えられ
る。ここでいう非連続的な動きとは、図９において、移
動ベクトルｖ₁，ｖ₂，及びｖ₃は共に略同方向を向いて
いるが、ｖ_kは斯かる３つの移動ベクトルとは違う方向
を向いていることをいう。As described above, the estimated and calculated movement vector is considered to include a large amount of estimation error when the learning is not performed with a sufficient number of words. Therefore, it is considered that the direction of the movement vector calculated from the one including such an estimation error has a discontinuous movement. In FIG. 9, the discontinuous movements here mean that the movement vectors v ₁ , v ₂ , and v ₃ are all in substantially the same direction, but v _k is a direction different from those three movement vectors. That is facing.

【００２５】そこで、平均ベクトルｃ_k ^Rとその近傍にあ
る平均ベクトルに関する移動ベクトルｖ₁，ｖ₂，及びｖ
₃に基づいて、修正を加えることによって、平滑化移動
ベクトルｖ_k ^Sが算出される。つまり、移動ベクトルｖ_k
は移動ベクトルｖ₁，ｖ₂，及びｖ₃の影響を受けて左方
向に若干修正される。この修正処理を平滑化処理とい
う。（Ｄ）平均ベクトル適応化処理上述の（Ｃ）で求められた平滑化移動ベクトルｖ_k ^S、及
び平均ベクトルＣ_k ^Rを用いて次式に従って、話者適応後
のＨＭＭの平均ベクトルＣ_k ^S（ｋ＝１，・・・・・，ｍ）を
算出する。Therefore, the moving vectors v ₁ , v ₂ , and v with respect to the mean vector c _k ^R and the mean vectors in the vicinity thereof are
Based on ₃ , the smoothed motion vector v _k ^S is calculated by making a correction. That is, the movement vector v _k
Is slightly corrected to the left under the influence of the movement vectors v ₁ , v ₂ and v ₃ . This correction processing is called smoothing processing. (D) Average vector adaptation processing Using the smoothed movement vector v _k ^S and the average vector C _k ^R obtained in the above (C), the average vector C _k ^S of the HMM after speaker adaptation according to the following equation. Calculate (k = 1, ..., m).

【００２６】Ｃ_k ^S＝Ｃ_k ^R＋ｖ_k ^S C _k ^S = C _k ^R + v _k ^S

【００２７】[0027]

【発明が解決しようとする課題】然し乍ら、上述の如
く、従来のＨＭＭの作成においては、（Ｂ）移動ベクト
ルの内挿による補間処理、及び（Ｃ）移動ベクトルの平
滑化処理が行われているが、学習用音声資料つまり単語
音声が少数の場合、学習部３において学習される平均ベ
クトルは少数であり、かかる学習された少数の平均ベク
トルから求められる移動ベクトルもまた少数である。こ
のような場合、即ち移動ベクトルｖ₁，ｖ₂，及びｖ₃，・
・・・・が少ない場合には、その移動ベクトルｖ₁，ｖ₂，及
びｖ₃，・・・・・を用いて内挿するベクトルが多くなり、ま
た平滑化処理では、学習で得られた少ないベクトルから
平滑化処理を行うため、入力話者のモデルとして不適切
なモデルしか得られないという問題点があった。However, as described above, in the conventional HMM creation, (B) interpolation processing by interpolation of the moving vector and (C) smoothing processing of the moving vector are performed. However, when the learning audio material, that is, the word speech is small in number, the average vector learned by the learning unit 3 is small, and the movement vector obtained from the learned small average vector is also small. In such a case, that is, movement vectors v ₁ , v ₂ , and v ₃ ,.
.. is small, the number of vectors to be interpolated using the movement vectors v ₁ , v ₂ , and v ₃ , ... is large, and in the smoothing process, it is obtained by learning. Since the smoothing process is performed from a small number of vectors, there is a problem that only an inappropriate model can be obtained as a model of the input speaker.

【００２８】[0028]

【課題を解決するための手段】そこで、本発明は上述の
問題点に鑑み為されたものであり、複数の代表話者の話
者部分空間移動ベクトルｖ_i，_s，_m ⁿから、少量の入力話
者の学習用音声資料から得られた入力話者の話者部分空
間移動ベクトルｖ_i，_s，_m ^inpに距離的に近い代表話者の
話者部分空間移動ベクトルｖ_i，_s，_m ^spnoを選択し、該
代表話者の話者部分空間移動ベクトルｖ_i，_s，_m ^spnoを
修正することにより不特定話者ＨＭＭを入力話者に適応
させることを特徴とする。Therefore, the present invention has been made in view of the above problems, and a small amount of the speaker subspace movement vectors v _i , _s , _m ^{n of} a plurality of representative speakers can be obtained. input speaker of the speaker subspace movement vector v _i of the input speaker obtained from the learning audio material, _s, _m ^inp to the distance to close representative speaker of the speaker subspace movement vector v _i, _s, _m select ^SPNO, characterized in that to adapt the surrogate table speaker speaker subspace movement vector v _i, _s, to the input talker the unspecified speaker HMM by modifying the _m ^SPNO.

【００２９】更に、本発明は入力音声の特徴を分析する
音声分析部(1)と、ＨＭＭの初期モデルを記憶する初期
モデル記憶部(2)と、上記音声分析部(1)において入力話
者の音声を分析した結果を用いて初期モデル記憶部(2)
に記憶されたＨＭＭを学習する学習部(3)と、該学習部
(3)において学習された入力話者のＨＭＭの平均ベクト
ルμ_i，_s，_m ^inpと初期モデル記憶部(2)に記憶されてい
るＨＭＭの平均ベクトルμ_i，_s，_mの差分から求められ
る差分ベクトルを用いて計算される入力話者の話者部分
空間移動ベクトルｖ_i，_s，_m ^inpを計算する入力話者の話
者部分空間移動ベクトル計算部(10a)と、該入力話者の
話者部分空間移動ベクトル計算部(10a)にて求められた
入力話者の話者部分空間移動ベクトルｖ_i，_s，_m ^inpを記
憶する入力話者の話者部分空間移動ベクトル記憶部(10
b)と、代表話者の話者部分空間移動ベクトルｖ_i，_s，_m ⁿ
を記憶する代表話者の話者部分空間移動ベクトル記憶部
(12)と、上記入力話者の話者部分空間移動ベクトル記憶
部(10b)に記憶された入力話者の話者部分空間移動ベク
トルｖ_i，_s，_m ^inpと距離的に近い代表話者の話者部分空
間移動ベクトルｖ_i，_s，_m ^spnoを選択する代表話者選択
部(10c)と、該代表話者選択部(10c)にて得られた代表話
者の話者部分空間移動ベクトルｖ_i，_s，_m ^spno、入力話
者の話者部分空間移動ベクトルｖ_i，_s，_m ^inp、及び初期
モデルの平均ベクトルμ_i，_s，_mを用いて、話者適応後
の平均ベクトルμ_i，_s，_m ^inpを求める話者適応後モデル
構築部(10d)と、話者適応後の平均ベクトルμ_i，_s，_m
^inpを記憶する適応後モデル記憶部(14)と、を具備する
ことを特徴とする。Further, according to the present invention, a voice analysis unit (1) for analyzing characteristics of an input voice, an initial model storage unit (2) for storing an initial model of an HMM, and an input speaker in the voice analysis unit (1). Initial model storage unit using the results of analyzing the human voice (2)
A learning unit (3) for learning the HMM stored in
(3) mean vector mu _i, _s of the HMM input speaker learned in, _m ^inp and initial model mean vector mu _i, _s of the HMM stored in the storage unit (2) is determined from the difference between _m input speaker of the speaker subspace movement vector v _i is calculated using the difference vector, _s, speaker subspace movement vector calculating unit input speaker calculating the _m ^inp and (10a), of the input speaker speaker subspace moving vector calculation unit speaker subspace of the input speaker obtained by (10a) movement vector v _i, _s, speaker subspace motion vector storage unit of the input speaker for storing _m ^inp (10
b) and the speaker subspace movement vector v _i , _s , _m ⁿ of the representative speaker
Speaker subspace movement vector memory of the representative speaker
(12) and said input speaker of the speaker subspace motion vector storage unit (10b) to store the input speaker of the speaker subspace movement vector v _i, _s, _m ^inp and distance to close representative speaker Speaker subspace movement vector v _i , _s , _m ^spno of the representative speaker, and the speaker subspace movement of the representative speaker obtained by the representative speaker selection unit (10 c). vector v _i, with _s, _m ^SPNO, speaker subspace movement vector v _i of the input speaker, _s, _m ^inp, and mean vector mu _i, _s of the initial model, the _m, the average vector after the speaker adaptation mu _i, _s, speaker adaptation after model construction unit for determining the _m ^inp and (10d), the mean vector mu _i, _s after the speaker adaptation, _m
and a post-adaptation model storage unit (14) for storing ^inp .

【００３０】[0030]

【作用】ＨＭＭの初期モデル中のコンポーネントの平均
ベクトルを、音声分析部(1)において分析された入力音
声の分析結果を用いて学習する。The average vector of the components in the initial model of the HMM is learned by using the analysis result of the input voice analyzed by the voice analysis unit (1).

【００３１】この後、初期モデル中のコンポーネントの
平均ベクトルとこの平均ベクトルに対応する再学習後の
平均ベクトルとの差分を用いて、入力話者の話者部分空
間移動ベクトルｖ_i，_s，_m ^inpを計算する。After that, by using the difference between the average vector of the components in the initial model and the average vector after retraining corresponding to this average vector, the speaker subspace movement vectors v _i , _s , _m of the input speaker are used. Calculate ^inp .

【００３２】入力話者の話者部分空間移動ベクトル記憶
部(10b)に記憶された入力話者の話者部分空間移動ベク
トルｖ_i，_s，_mと距離的に近い代表話者の話者部分空間
移動ベクトルｖ_i，_s，_m ^spnoを選択する。The speaker part of the representative speaker, which is distance-wise close to the speaker subspace movement vector v _i , _s , _{m of} the input speaker stored in the speaker subspace movement vector storage unit (10b) of the input speaker. Select spatial movement vectors v _i , _s , _m ^spno .

【００３３】代表話者選択部(10c)にて得られた代表話
者の話者部分空間移動ベクトルｖ_i， _s，_m ^spno、入力話
者の話者部分空間移動ベクトルｖ_i，_s，_m ^inp、及び初期
モデルの平均ベクトルμ_i，_s，_mを用いて、話者適応後
の平均ベクトルμ_i，_s，_m ^inpを話者適応後モデル構築部
(10d)にて求める。Representative talk obtained by the representative talker selection unit (10c)
Speaker's subspace movement vector v_i， _s，_m ^spnoInput story
Speaker's subspace movement vector v_i，_s，_m ^inp, And early
Model mean vector μ_i，_s，_mAfter speaker adaptation using
Mean vector of_i，_s，_m ^inpAfter speaker adaptation model building department
Find in (10d).

【００３４】最後に、話者適応後の平均ベクトル
μ_i，_s，_m ^inpを適応後モデル記憶部(14)に記憶させる。[0034] Finally, the mean vector mu _i, _s after the speaker adaptation, and stores the _m ^inp adaptive after the model storage unit (14).

【００３５】[0035]

【実施例】本発明の実施例を図１乃至図３に基づいて説
明する。ＨＭＭには対角共分散行列の混合ガウス分布型
を用いるものとする。Embodiments of the present invention will be described with reference to FIGS. A mixed Gaussian distribution type of a diagonal covariance matrix is used for the HMM.

【００３６】図１は本発明に係るＨＭＭの学習装置の概
略構成図であり、また図２は本発明に係るＨＭＭの学習
装置の話者適応部１０を中心とした詳細な構成図であ
り、従来のＨＭＭの学習装置と同一構成については同一
番号を付している。FIG. 1 is a schematic configuration diagram of an HMM learning device according to the present invention, and FIG. 2 is a detailed configuration diagram centering on a speaker adaptation unit 10 of the HMM learning device according to the present invention. The same components as those of the conventional HMM learning device are designated by the same reference numerals.

【００３７】本発明のＨＭＭの学習装置の構成が従来の
それと異なる第一の点は、話者適応部４に代えて話者適
応部１０を設けたことであり、この話者適応部１０は入
力話者の話者部分空間移動ベクトル計算部１０ａ、入力
話者の話者部分空間移動ベクトル記憶部１０ｂ、代表話
者選択部１０ｃ、及び適応後モデル構築部１０ｄから構
成されている。The first difference in the configuration of the learning device for the HMM of the present invention from the conventional one is that a speaker adaptation unit 10 is provided instead of the speaker adaptation unit 4, and this speaker adaptation unit 10 is The input speaker includes a speaker subspace movement vector calculation unit 10a, an input speaker speaker subspace movement vector storage unit 10b, a representative speaker selection unit 10c, and a post-adaptation model construction unit 10d.

【００３８】本発明のＨＭＭの学習装置の構成が従来の
それと異なる第二の点は、代表話者選択部１０ｃに接続
して代表話者の話者部分空間移動ベクトル記憶部１２、
及び代表話者の話者部分空間移動ベクトル計算部１１を
設けたことである。The second difference in the configuration of the learning device of the HMM of the present invention from that of the conventional one is that the speaker subspace movement vector storage unit 12 of the representative speaker is connected to the representative speaker selection unit 10c.
And the speaker subspace movement vector calculation unit 11 of the representative speaker.

【００３９】本発明のＨＭＭの学習装置の構成が従来の
それと異なる第三の点は、初期モデル記憶部２、代表話
者の話者部分空間移動ベクトル記憶部１２、及び話者適
応部１０に基づいてＨＭＭを作成する適応後モデル作成
部１３、並びにこのＨＭＭを記憶する適応後モデル記憶
部１４を設けたことである。The third point in which the configuration of the learning device of the HMM of the present invention is different from the conventional one is that the initial model storage unit 2, the speaker subspace movement vector storage unit 12 of the representative speaker, and the speaker adaptation unit 10 are provided. That is, the post-adaptation model creation unit 13 that creates an HMM based on the HMM and the post-adaptation model storage unit 14 that stores the HMM are provided.

【００４０】ここで、本発明の代表的な構成要件であ
る、（Ａ）代表話者の話者部分空間移動ベクトル計算部
１１、（Ｂ）代表話者の話者部分空間移動ベクトル記憶
部１２、（Ｃ）入力話者の話者部分空間移動ベクトル計
算部１０ａ、（Ｄ）入力話者の話者部分空間移動ベクト
ル記憶部１０ｂ、（Ｅ）代表話者選択部１０ｃ、（Ｆ）
話者適応後モデル構築部１０ｄ、（Ｇ）適応後モデル作
成部１３、（Ｈ）適応後モデル記憶部１４、の夫々の機
能について詳述する。（Ａ）代表話者の話者部分空間移動ベクトル計算部１１この計算部１１は、複数の代表話者の話者部分空間移動
ベクトルを求める機能を有する。ここで、話者部分空間
移動ベクトルとは、初期モデルと斯かる初期モデルを再
学習した後のＨＭＭのガウス分布の平均ベクトルの差分
を用いて求められるものであり、以下のステップで話者
部分空間移動ベクトルを求めることができる。Here, (A) a speaker subspace movement vector calculation unit 11 of a representative speaker and (B) a speaker subspace movement vector storage unit 12 of a representative speaker, which are typical constituent features of the present invention. , (C) speaker subspace movement vector calculation unit 10a of the input speaker, (D) speaker subspace movement vector storage unit 10b of the input speaker, (E) representative speaker selection unit 10c, (F)
Functions of the speaker-adapted model construction unit 10d, the (G) -adapted model creation unit 13, and the (H) -adapted model storage unit 14 will be described in detail. (A) Speaker Subspace Movement Vector Calculation Unit 11 of Representative Speaker This calculation unit 11 has a function of obtaining speaker subspace movement vectors of a plurality of representative speakers. Here, the speaker subspace movement vector is obtained by using the difference between the initial model and the mean vector of the Gaussian distribution of the HMM after retraining the initial model, and the speaker part is calculated in the following steps. The spatial movement vector can be obtained.

【００４１】ステップ１；初期モデル記憶部に記憶され
ている初期モデル（λ）を各代表話者の音素ＨＭＭの初
期モデルとする。ここで、Ｉは音素ＨＭＭの番号を示し
ており、本実施例では３９個の音素ＨＭＭを用いている
ため、Ｉ＝３９である。また、λ_iはｉ番目の音素ＨＭ
Ｍを示している。Step 1: The initial model (λ) stored in the initial model storage unit is used as the initial model of the phoneme HMM of each representative speaker. Here, I represents the number of the phoneme HMM, and since 39 phoneme HMMs are used in this embodiment, I = 39. Λ _i is the i-th phoneme HM
M is shown.

【００４２】λ＝｛λ₁，・・・，λ_i，・・・，λ_I｝また、λ_iはλ_i＝｛ｗ_i，_s，_m，ａ_i，_s1，_s2，μ_i，_s，
_m，σ_i，_s，_m｝で表される。Λ = {λ ₁ , ..., λ _i , ..., λ _I } In addition, λ _i is λ _i = {w _i , _s , _m , a _i , _s1 , _s2 , μ _i , _s ，
It is represented by _m , σ _i , _s , _m }.

【００４３】尚、ｗ_i，_s，_m、μ_i，_s，_m、及びσ_i，_s，
_mはｉ番目の音素ＨＭＭの第ｓ状態のｍ番目のガウス分
布に関する重み、平均ベクトル、分散値のベクトルを夫
々表している。ａ_i，_s1，_s2はｉ番目の音素ＨＭＭの第
ｓ１状態から第ｓ２状態への遷移確率を示しており、本
実施例では特徴量として３３次元ベクトルを用いたの
で、μ_i，_s，_m、σ_i，_s，_mは３３次元のベクトルとな
る。Note that w _i , _s , _m , μ _i , _s , _m , and σ _i , _s ,
_m represents a weight, a mean vector, and a vector of variance values for the m-th Gaussian distribution in the s-th state of the i-th phoneme HMM, respectively. a _i , _s1 , and _s2 represent the transition probabilities from the s1th state to the s2th state of the i-th phoneme HMM. Since a 33-dimensional vector is used as the feature quantity in this embodiment, μ _i , _s , and _m , Σ _i , _s , and _m are 33-dimensional vectors.

【００４４】ここで、初期モデルとしては、特定話者モ
デルを用いても良いし、また不特定話者モデルを用いて
も良い。Here, a specific speaker model or an unspecified speaker model may be used as the initial model.

【００４５】ところで、μ_i，_s，_mは従来例で示したＣ_k
^Rと同じものであり、本実施例での説明の便宜上、以降
μ_i，_s，_mという記号を用いる。By the way, μ _i , _s , and _m are C _k shown in the conventional example.
^It is the same as ^R, and the symbols μ _i , _s , and _m are used hereinafter for convenience of description in the present embodiment.

【００４６】ステップ２；代表話者の入力音声の音素系
列に対応するように代表話者のＨＭＭを連結し学習を行
なう。学習はｗ_i，_s，_m及びμ_i，_s，_mのみ行ないｎ番目
の話者モデルとしてλ_i ⁿ＝｛ｗ_i ⁿ，_s，_m，
ａ_i，_s1，_s2，μ_i ⁿ，_s，_m，σ_i，_s，_m｝を得る。ここ
で、ｎは代表話者の番号を表し、ｎ＝１，２，・・・・・，
Ｎであり、本実施例では３０名の代表話者を用いたので
Ｎ＝３０である。Step 2; The HMMs of the representative speaker are connected so as to correspond to the phoneme sequence of the input voice of the representative speaker, and learning is performed. Learning is performed only for w _i , _s , _m and μ _i , _s , _m , and λ _i ⁿ = {w _i ⁿ , _s , _m , as an ^nth speaker model.
a _i , _s1 , _s2 , μ _i ⁿ , _s , _m , σ _i , _s , _m } are obtained. Here, n represents the number of the representative speaker, n = 1, 2, ...
N, and in this embodiment, 30 representative speakers were used, so N = 30.

【００４７】ステップ３；各代表話者について、平均値
の差分ｔ_i ⁿ，_s，_mを求める。[0047] Step 3: For each representative speaker, the difference t _i ⁿ of the average value, _s, a _m ask.

【００４８】∀_i，_s，_m∈Ω ｔ_i，_s，_m ⁿ ＝ μ_i，_s，
_m ⁿ−μ_i，_s，_m （ｎ＝１，２，・・・・・，Ｎ）ここで、Ωはλに含まれる平均ベクトルμ_i，_s，_mの添
字_i，_s，_mの組を表す。∀ _i , _s , _m ∈Ωt _i , _s , _m ⁿ = μ _i , _s ,
_m ⁿ −μ _i , _s , _m (n = 1, 2, ..., N) where Ω is a set of subscripts _i , _s , and _m of the average vector μ _i , _s , _m included in λ. Represents

【００４９】ここで、平均値の差分ｔ_i ⁿ，_s，_mとは、従
来例で示した移動ベクトルと同じものである。[0049] Here, the difference t _i ⁿ of the mean, _s, and _m is the same as the motion vector shown in the conventional example.

【００５０】ステップ４；数１に従い、代表話者の話者
部分空間移動ベクトルｖ_i，_s，_m ⁿを求める。ここでは、
μ_i，_s，_mの距離的に近くにあるＫ個の平均ベクトルを
用いて、部分空間毎に話者部分空間移動ベクトルを求め
るものとする。Step 4; According to equation 1, the speaker subspace movement vectors v _i , _s , _m ⁿ of the representative speaker are obtained. here,
It is assumed that a speaker subspace movement vector is obtained for each subspace by using K average vectors that are close in distance to μ _i , _s , and _m .

【００５１】[0051]

【数１】 [Equation 1]

【００５２】ここで、Ｋ_i，_s，_mはμ_i，_s，_mの近傍にあ
るＫ個の平均ベクトルに関する添字の組である。また、
Ｄ（ａ，ｂ）はベクトルａ，ｂ間の距離を表す。ｆはフ
ァジネスと呼ばれるファジイ級関数の値を制御する変数
である。また、ファジイ級関数以外に三角窓や矩形窓、
ガウス分布等の関数等を用いることも可能である。Here, K _i , _s , _m is a set of subscripts relating to K average vectors in the vicinity of μ _i , _s , _m . Also,
D (a, b) represents the distance between the vectors a and b. f is a variable that controls the value of a fuzzy class function called fuzzyness. In addition to the fuzzy class functions, triangular windows, rectangular windows,
It is also possible to use a function such as Gaussian distribution.

【００５３】一方、ｔ_i ⁿ，_s，_mを代表話者の話者部分空
間移動ベクトルとしてもよい。[0053] On the other hand, t _i ^n, _s, may be used as the speaker subspace movement vector of the representative speaker _m.

【００５４】また、学習は｛ｗ_i，_s，_m，ａ_i，_s1，_s2，
μ_i，_s，_m，σ_i，_s，_m｝のうち、少なくともμ_i，_s，_m
を含むように学習すれば良い。当然、｛ｗ_i，_s，_m，
ａ_i，_s1，_s2，μ_i，_s，_m，σ_i，_s，_m｝の全てを学習し
ても良い。（Ｂ）代表話者の話者部分空間移動ベクトル記憶部１２代表話者の話者部分空間移動ベクトル記憶部１２は代
表話者の話者部分空間移動ベクトル計算部１１にて算出
された、複数の代表話者の話者部分空間移動ベクトルｖ
_i，_s，_m ⁿを記憶する。（Ｃ）入力話者の話者部分空間移動ベクトル計算部１０
ａ学習部３により学習されたモデルに基づいて、入力話者
の話者部分空間移動ベクトルｖ_i，_s，_m ^inpを以下のステ
ップで求める。尚、ここで、ｉｎｐは入力話者を表して
いる。Further, the learning is {w _i , _s , _m , a _i , _s1 , _s2 ,
Of μ _i , _s , _m , σ _i , _s , _m }, at least μ _i , _s , _m
You only have to learn to include. Of _{_{_{course, {w i, s, m}}} ,
All of a _i , _s1 , _s2 , μ _i , _s , _m , σ _i , _s , _m } may be learned. (B) Speaker subspace movement vector storage unit 12 of representative speaker The speaker subspace movement vector storage unit 12 of the representative speaker is a plurality of speaker subspace movement vector calculation units 11 of the representative speaker. Speaker subspace movement vector v of the representative speaker of
Memorize _i , _s , and _m ⁿ . (C) Speaker subspace movement vector calculator 10 of the input speaker
Based on the learned model by a learning unit 3, obtains speaker subspace movement vector v _i of the input speaker, _s, the _m ^inp the following steps. Here, inp represents the input speaker.

【００５５】ステップ１；平均値の差分ｔ_i，_s，_m ^inpを
算出する。[0055] Step 1: calculate the difference t _i of the average value, _s, a _m ^inp.

【００５６】[0056]

【数２】 [Equation 2]

【００５７】ステップ２；数３に従い、入力話者部分空
間移動ベクトルｖ_i，_s，_m ^inpを求める。[0057] Step 2: As the number 3, the input speaker subspace movement vector v _i, _s, a _m ^inp determined.

【００５８】[0058]

【数３】 [Equation 3]

【００５９】ここで、Ｅは学習音声資料中に現れた音素
に対応した音素ＨＭＭの平均ベクトルの添字の組を表
す。（Ｄ）入力話者の話者部分空間移動ベクトル記憶部１０
ｂ入力話者の話者部分空間移動ベクトル記憶部１０ｂは入
力話者の話者部分空間移動ベクトル計算部１０ａで算出
した入力話者部分空間移動ベクトルｖ_i，_s，_m ⁱ ^npを記憶
する。（Ｅ）代表話者選択部１０ｃ音素ＨＭＭの各コンポーネントの分岐確率を考慮して、
数４に従い、入力話者部分空間移動ベクトルｖ_i，_s，_m
^inpと距離的に近い代表話者の部分空間移動ベクトル
ｖ_i，_s，_m ⁿをもつ代表話者の番号（ｓｐｎｏ）、及びこ
の代表話者の番号（ｓｐｎｏ）を有する代表話者の部分
空間移動ベクトルｖ_i，_s，_m ^spnoを選択する。Here, E represents a set of subscripts of the average vector of the phoneme HMM corresponding to the phonemes appearing in the learning speech material. (D) Speaker subspace movement vector storage unit 10 of the input speaker
b Input speaker speaker subspaces motion vector storage unit 10b stores the input speaker of the speaker subspace movement vector calculating unit input speaker subspaces motion vector calculated at _{_{_{^{10a v i, s, m i}}}} np. (E) Representative speaker selection unit 10c Considering the branch probability of each component of the phoneme HMM,
According to Equation 4, the input speaker subspace movement vector v _i , _s , _m
^inp and distance to close representative speaker subspaces movement vector v _i, _s, representatives speaker numbers with _m ⁿ (spno), and subspace representative speaker having a number (SPNO) of the representative speaker Select the movement vectors v _i , _s , _m ^spno .

【００６０】[0060]

【数４】 [Equation 4]

【００６１】（Ｆ）話者適応後モデル構築部１０ｄ話者適応後モデル構築部１０ｄでは、代表話者選択部１
０ｃにて得られた代表話者の話者部分空間移動ベクトル
ｖ_i，_s，_m ^spno、入力話者の話者部分空間移動ベクトル
ｖ_i，_s，_m ^inp、及び初期モデルの平均ベクトルμ_i，_s，
_mを用いて、話者適応後の平均ベクトルμ_i，_s，_m ^inpを
求める。(F) Speaker-adapted model construction unit 10d In the speaker-adapted model construction unit 10d, the representative speaker selection unit 1 is used.
Speaker subspace movement vector representative speaker obtained in _{_{_{^{0c v i, s, m spno}}}} , speaker subspace movement vector v _i of the input speaker, _s, _m ^inp, and mean vector of the initial model mu _i , _S ,
with _m, the average vector mu _i, _s after the speaker adaptation, seek _m ^inp.

【００６２】[0062]

【数５】 [Equation 5]

【００６３】ここで、本実施例ではＷ＝０．５に設定し
た。（Ｇ）適応後モデル作成部１３適応後モデル作成部１３では、話者適応後モデル構築部
１０ｄで構築された話者適応後の平均ベクトルμ_i，_s，
_m ^inp、並びに初期モデル記憶部２に記憶されている初期
モデルのガウス分布に関する重みｗ_i，_s，_m、遷移確率
ａ_i，_s1，_s2及び分散値ベクトルσ_i，_s，_m又は入力話者
のガウス分布に関する重みｗ_i，_s，_m ^inp、遷移確率
ａ_i，_s1，_s2 ^inp及び分散値ベクトルσ_i，_s，_m ^inp又は代
表話者の話者空間移動ベクトル記憶部１２に記憶されて
いるガウス分布に関する重みｗ_i，_s， _m ^spno、遷移確率
ａ_i，_s1，_s2 ^spno及び分散値ベクトルσ_i，_s，_m ^spnoを用
いて、適応後のモデルを作成する。（Ｈ）適応後モデル記憶部１４適応後モデル記憶部１４は適応後モデル作成部１３で作
成された適応後モデルを記憶する。In this embodiment, W = 0.5 is set.
It was (G) Adaptation model creation unit 13 In the post-adaptation model creation unit 13, the speaker post-adaptation model construction unit
Mean vector μ after speaker adaptation constructed in 10d_i，_s，
_m ^inp, And the initial values stored in the initial model storage unit 2.
Weight w for the Gaussian distribution of the model_i，_s，_m, Transition probability
a_i，_s1，_s2And the variance value vector σ_i，_s，_mOr input speaker
Weight w for the Gaussian distribution of_i，_s，_m ^inp, Transition probability
a_i，_s1，_s2 ^inpAnd the variance value vector σ_i，_s，_m ^inpOr generation
Stored in the speaker space movement vector storage unit 12 of the speaker
Weight w for Gaussian distribution_i，_s， _m ^spno, Transition probability
a_i，_s1，_s2 ^spnoAnd the variance value vector σ_i，_s，_m ^spnoFor
And create a model after adaptation. (H) Adapted model storage unit 14 The post-adaptation model storage unit 14 is created by the post-adaptation model creation unit 13.
Store the created post-adaptation model.

【００６４】上述の本発明の構成を用いて、話者適応後
モデル構築部１０ｄにおいて話者適応後のモデルを構築
するに際しての学習処理、及び話者適応処理を図３に示
すフローチャートに基づいて図１、及び図２を参照し乍
ら、以下ステップ毎に説明する。Based on the flowchart shown in FIG. 3, learning processing and speaker adaptation processing for constructing a model after speaker adaptation in the speaker adaptation model construction unit 10d using the above-described configuration of the present invention are performed. Each step will be described below with reference to FIGS. 1 and 2.

【００６５】図３に示すフローチャートは大きく前段の
学習処理（ステップＳ１〜Ｓ５）と、後段の話者適応処
理（ステップＳ６〜Ｓ９）に分けることができる。The flowchart shown in FIG. 3 can be roughly divided into a learning process in the first stage (steps S1 to S5) and a speaker adaptation process in the second stage (steps S6 to S9).

【００６６】まず、ステップＳ１において、初期モデル
記憶部２に記憶されている音素ＨＭＭを入力話者の初期
モデルとする。ステップＳ２では、初期モデル記憶部２
に記憶されている音素ＨＭＭを連結し単語ＨＭＭを作成
する。ステップＳ３では、学習用音声資料を用いて学習
部３で単語ＨＭＭを学習する。ステップＳ４において、
学習部３では単語ＨＭＭを分解し、音素ＨＭＭとする。
ステップＳ５では、繰り返し学習することによって、例
えば、学習用音声資料中の音素ＨＭＭの平均ベクトルが
収束する終了条件を満たすか、否かを判定し、終了条件
を満足すればステップＳ６に進み、一方終了条件を満足
しなければステップＳ２に戻る。First, in step S1, the phoneme HMM stored in the initial model storage unit 2 is used as the initial model of the input speaker. In step S2, the initial model storage unit 2
The word HMMs are created by concatenating the phoneme HMMs stored in. In step S3, the learning unit 3 learns the word HMM using the learning audio material. In step S4,
The learning unit 3 decomposes the word HMM into a phoneme HMM.
In step S5, by repeating learning, for example, it is determined whether or not a termination condition for converging the average vector of the phoneme HMMs in the learning speech material is satisfied, and if the termination condition is satisfied, the process proceeds to step S6. If the ending condition is not satisfied, the process returns to step S2.

【００６７】ステップＳ６では、入力話者の話者空間移
動ベクトル計算部１０ａにおいて、初期モデルのコンポ
ーネントの平均ベクトルと学習後の音素ＨＭＭの平均ベ
クトル間の差分ベクトルｔ_i，_s，_m ^inpを求める。ステッ
プＳ７では、入力話者の話者部分空間移動ベクトル計算
部１０ａにおいて、入力話者の話者部分空間移動ベクト
ルｖ_i，_s，_m ^inpを求める。[0067] In step S6, the speaker space motion vector calculation section 10a of the input speaker, obtains an average vector of the components of the initial model and the difference vector t _i between the mean vector of the phoneme HMM after learning, _s, the _m ^inp . In step S7, the speaker subspace movement vector calculating portion 10a of the input speaker, speaker subspace movement vector v _i of the input speaker, _s, the _m ^inp determined.

【００６８】ステップＳ８では、代表話者選択部１０ｃ
において入力話者の話者部分空間移動ベクトルｖ_i，_s，
_m ^inpと近い代表話者の部分空間移動ベクトルｖ_i，_s，_m ⁿ
をもつ代表話者の番号（ｓｐｎｏ）を選択する。ステッ
プＳ９では、話者適応後モデル構築部１０ｄにおいて、
代表話者選択部１０ｃにて得られた代表話者の話者部分
空間移動ベクトルｖ_i，_s，_m ^spno、入力話者の話者部分
空間移動ベクトルｖ_i， _s，_m ^inp、及び初期モデルの平均
ベクトルμ_i，_s，_mを用いて、話者適応後の平均ベクト
ルμ_i，_s，_m ^inpを求める。In step S8, the representative speaker selection unit 10c
At the speaker subspace movement vector v of the input speaker at_i，_s，
_m ^inpSubspace movement vector v of the representative speaker close to_i，_s，_m ⁿ
Select the number (spno) of the representative speaker with. Step
In step S9, in the speaker-adapted model building unit 10d,
Speaker portion of the representative speaker obtained by the representative speaker selection unit 10c
Space movement vector v_i，_s，_m ^spno, The speaker part of the input speaker
Space movement vector v_i， _s，_m ^inp, And the average of the initial model
Vector μ_i，_s，_m, The average vector after speaker adaptation
Le μ_i，_s，_m ^inpAsk for.

【００６９】ここで、本発明のＨＭＭの学習装置によっ
て学習を行ない、その評価試験を行った。Here, learning was performed by the HMM learning device of the present invention, and the evaluation test was conducted.

【００７０】初期モデルには、日本音響学会連続音声デ
ータベースの男性話者３０名の音声資料の一部から作成
した不特定話者モデルを用いた。代表話者モデルの作成
には、同一データベースの男性話者３０名を用いた。評
価は電子協日本語共通音声データに含まれる男性話者７
０名の地名１００単語を用いた。分析条件はサンプリン
グ周波数１２ｋＨｚ、ハミング窓長２１．３ｍｓ、１６
次ＬＰＣ分析、フレーム周期５ｍｓである。特徴量に
は、１６次ＬＰＣケプストラム、１６次Δケプストラ
ム、Δ対数パワーの３３次元ベクトルを用いた。ＨＭＭ
は４状態３ループ、対角共分散行列の混合ガウス分布型
であり、各状態からのアークはタイドアークとした。ま
た、モデル数は３９種とした。As the initial model, an unspecified speaker model created from a part of audio data of 30 male speakers in the ASJ continuous speech database was used. 30 male speakers from the same database were used to create the representative speaker model. Evaluation was done by a male speaker 7 included in the Japanese common voice data of Jkyo.
100 words of 0 place names were used. Analysis conditions are sampling frequency 12 kHz, Hamming window length 21.3 ms, 16
Next LPC analysis, frame period 5 ms. A 16th-order LPC cepstrum, a 16th-order Δ cepstrum, and a 33-dimensional vector of Δ logarithmic power were used as the feature amount. HMM
Is a 4-state 3-loop, mixed Gaussian distribution type of diagonal covariance matrix, and the arc from each state is a tide arc. The number of models was 39.

【００７１】話者適応用音声資料には、１００地名のう
ちの一部（１〜１０地名）を用いた。評価音声資料に
は、話者適応に用いた音声資料以外の地名を用いた。ま
た、評価は話者適応用単語セットを変え、２回行った。
認識用単語辞書は１００地名とした。A part of 100 place names (1 to 10 place names) was used as the speaker adaptation audio material. The place name other than the voice material used for speaker adaptation was used as the evaluation voice material. In addition, the evaluation was performed twice by changing the speaker adaptation word set.
The recognition word dictionary was 100 place names.

【００７２】表１は本発明の学習装置を用いた単語認識
結果、話者適応法を用いない従来の単語認識結果、及び
不特定話者モデル（初期モデル）を用いた従来の単語認
識結果を夫々適応単語数毎に示したものである。Table 1 shows word recognition results using the learning device of the present invention, conventional word recognition results that do not use the speaker adaptation method, and conventional word recognition results that use an unspecified speaker model (initial model). It is shown for each number of adaptive words.

【００７３】[0073]

【表１】 [Table 1]

【００７４】この表から分かるように、本発明の学習装
置を用いた単語認識結果は従来の単語認識結果より認識
率は向上しており、本発明のＨＭＭの学習装置は有効的
であることが分かる。As can be seen from this table, the recognition rate of the word recognition result using the learning device of the present invention is higher than that of the conventional word recognition result, indicating that the HMM learning device of the present invention is effective. I understand.

【００７５】尚、本実施例の学習部３では、従来の学習
部３と同様に、未知話者の音素ＨＭＭのガウス分布の平
均ベクトルμ_i，_s，_m ^inpを再学習しているが、これには
限られず、ガウス分布に関する重みｗ_i，_s，_m、分散値
のベクトルσ_i，_s，_m、又は遷移確率ａ_i，_s1，_s2を含む
任意の組み合わせに関して再学習してよい。[0075] In the learning unit 3 of this embodiment, as in the conventional learning unit 3, the mean vector mu _i, _s of the Gaussian distribution of the phoneme HMM unknown speaker, although relearn _m ^inp, However, the present invention is not limited to this, and the weights w _i , _s , _m regarding the Gaussian distribution, the vector of variance values σ _i , _s , _m , or any combination including the transition probabilities a _i , _s1 , _s2 may be retrained.

【００７６】ところで、図３に示すステップＳ８では、
代表話者選択部１０ｃにおいて入力話者の話者部分空間
移動ベクトルｖ_i，_s，_m ^inpと近い代表話者の部分空間移
動ベクトルｖ_i，_s，_m ⁿをもつ代表話者の番号（ｓｐｎ
ｏ）を選択する概念を述べたが、本発明者は具体的な代
表話者選択手法として以下の２つの手法を提案する。即
ち、第１番目の手法は、(１)話者空間移動ベクトル間の
距離に基づく手法であり、また第２番目の手法は、(２)
学習用音声に対するＨＭＭの尤度に基づく手法であり、
以下でそれぞれの手法を詳説する。（１）話者空間移動ベクトル間の距離に基づく手法（ａ）話者空間移動ベクトル間の距離この話者空間移動ベクトルを用いる手法には、（ａ−
１）全ての話者空間移動ベクトルを使用する場合、及び
（ａ−２）学習された音素ＨＭＭに関する話者空間移動
ベクトルのみを使用する場合、があり、それぞれ以下の
数式を用いて、代表話者を選択する。By the way, in step S8 shown in FIG.
Representative speaker selection section speaker subspace of the input speaker movement vector in 10c v _i, _s, _m ^inp and close representative speaker subspaces movement vector v _i, _s, representatives speaker numbers with _m ⁿ (spn
Although the concept of selecting o) has been described, the present inventor proposes the following two methods as specific representative speaker selecting methods. That is, the first method is (1) a method based on the distance between speaker space movement vectors, and the second method is (2)
It is a method based on the likelihood of HMM for learning speech,
Each method is explained in detail below. (1) Method based on distance between speaker space movement vectors (a) Distance between speaker space movement vectors Methods using this speaker space movement vector include (a-
1) There is a case where all the speaker space movement vectors are used, and (a-2) a case where only the speaker space movement vector related to the learned phoneme HMM is used. Select the person.

【００７７】（ａ−１）全ての話者空間移動ベクトルを
使用する場合(A-1) When all the speaker space movement vectors are used

【００７８】[0078]

【数６】 [Equation 6]

【００７９】（ａ−２）学習された音素ＨＭＭに関する
話者空間移動ベクトルのみを使用する場合(A-2) When only the speaker space movement vector regarding the learned phoneme HMM is used

【００８０】[0080]

【数７】 [Equation 7]

【００８１】ここで、ｄ_aは数８で表される距離尺度で
ある。Here, d _a is a distance measure expressed by equation (8).

【００８２】[0082]

【数８】 [Equation 8]

【００８３】但し、ｖ_i，_s，_m，_d ^inpはｖ_i，_s，_m ^inpに
おける特徴パラメータの第ｄ要素に関する平均値を表
し、本実施例では特徴量として３３次元ベクトルを用い
たので、１≦ｄ≦３３となる。また、Ｗ_dはその重みを
表す。（ｂ）ガウス分布の分散を考慮した話者空間移動ベクト
ル間の距離に基づく手法この手法においても上述と同様に、（ｂ−１）全ての話
者空間移動ベクトルを使用する場合と、（ｂ−２）学習
された音素ＨＭＭに関する話者空間移動ベクトルのみを
使用する場合があり、それぞれ以下の式を用いて代表話
者を選択した。However, v _i , _s , _m , and _d ^inp represent the average value of the d-th element of the characteristic parameter in v _i , _s , and _m ^inp , and since a 33-dimensional vector is used as the characteristic amount in this embodiment, 1 ≦ d ≦ 33. W _d represents the weight. (B) Method Based on Distance Between Speaker Space Movement Vectors Considering Gaussian Distribution Variance Also in this method, (b-1) when all speaker space movement vectors are used and (b) -2) In some cases, only the speaker space movement vector related to the learned phoneme HMM is used, and the representative speaker is selected using the following formulas.

【００８４】（ｂ−１）全ての話者空間移動ベクトルを
使用する場合(B-1) When all the speaker space movement vectors are used

【００８５】[0085]

【数９】 [Equation 9]

【００８６】（ｂ−２）学習された音素ＨＭＭに関する
話者空間移動ベクトルのみを使用する場合(B-2) When only the speaker space movement vector related to the learned phoneme HMM is used

【００８７】[0087]

【数１０】 [Equation 10]

【００８８】ここで、ｄ_bは数１１で表される距離尺度
である。Here, d _b is a distance measure expressed by the equation 11.

【００８９】[0089]

【数１１】 [Equation 11]

【００９０】当然各代表話者の初期モデルおよび入力話
者の初期モデルを再学習する時に分散を再学習した場
合、数９、１０、および１１はそれぞれ数１２、１３、
および１４となる。Naturally, if the variance is re-learned when the initial model of each representative speaker and the initial model of the input speaker are re-learned, then Equations 9, 10 and 11 are respectively Equations 12, 13 and
And 14, respectively.

【００９１】[0091]

【数１２】 [Equation 12]

【００９２】[0092]

【数１３】 [Equation 13]

【００９３】[0093]

【数１４】 [Equation 14]

【００９４】但し、σ_i，_s，_m，_d ^inpはσ_i，_s，_m ^inpに
おける特徴パラメータの第ｄ要素に関する分散値を表
し、本実施例では特徴量として３３次元ベクトルを用い
たので、上述と同様に、１≦ｄ≦３３となる。However, σ _i , _s , _m , and _d ^inp represent the variance value regarding the d-th element of the characteristic parameter in σ _i , _s , and _m ^inp , and since a 33-dimensional vector is used as the characteristic amount in this embodiment, Similar to the above, 1 ≦ d ≦ 33.

【００９５】（２）学習用音声に対するＨＭＭの尤度に
基づく手法この手法は、学習用音声資料（Ｏ）を音素ＨＭＭ
（λ’）を用いて認識した場合の尤度Ｌ（Λ
_j（λ’），ｏ_j）により代表話者を選択する手法であ
る。ここで、Λ_j（λ’）はｊ番目の学習用音声ｏ_jに対
応したワードモデルを表す。音素ＨＭＭとしては以下の
２種類を用い、それぞれ尤度を求めた。まず最初に、ｎ
番目の代表話者の話者空間移動ベクトルｖ_i，_s，_m ⁿによ
りλⁿの平均ベクトルμ_i，_s，_m ⁿを移動したＨＭＭ
（λⁿ’）を用いて尤度Ｌ（Λ_j ⁿ（λⁿ’），ｏ_j）を求
める。ここで、λⁿ’のｉ番目の音素ＨＭＭはλⁿ’＝
｛ｗ_i，_s，_m，ａ_i，_sj，_sk，μ_i，_s，_m ⁿ’，
σ_i，_s，_m ²｝として表される。また、μ_i，_s，_m ⁿ’＝ｖ
_i，_s，_m ⁿ＋μ_i，_s，_mであり、以下の数１５に従い、累
積尤度が計算される。(2) HMM-likelihood-based method for learning speech In this method, learning speech material (O) is converted into a phoneme HMM.
Likelihood L (Λ when recognition is performed using (λ ′)
_This is a method of selecting a representative speaker by _j (λ '), o _j ). Here, Λ _j (λ ′) represents a word model corresponding to the j-th learning voice o _j . The following two types were used as the phoneme HMM, and the likelihood was calculated for each. First of all, n
An HMM obtained by moving the mean vector μ _i , _s , _m ⁿ of λ ⁿ by the speaker space movement vectors v _i , _s , _m ⁿ of the th representative speaker
(Λ ⁿ ') using the likelihood _{^{^{L (Λ j n (λ n}}} '), o j) seek. Here, λ ⁿ 'i-th phoneme HMM of λ ^n' =
{W _i , _s , _m , a _i , _sj , _sk , μ _i , _s , _m ⁿ ',
σ _i , _s , _m ² }. Also, μ _i , _s , _m ⁿ '= v
_i , _s , _m ⁿ + μ _i , _s , _m , and the cumulative likelihood is calculated according to the following Expression 15.

【００９６】[0096]

【数１５】 [Equation 15]

【００９７】次に、ｎ番目の代表話者をベースにして話
者適応を行った適応後ＨＭＭ（λ^ad ^apt（ｎ））を用い
て尤度Ｌ（Λ_j ⁿ（λ^adapt（ｎ）），ｏ_j）を求める。こ
こで、λ^adapt（ｎ）のｉ番目の音素ＨＭＭはλ_i ^adapt
（ｎ）＝｛ｗ_i，_s，_m，ａ_i，_s _j，_sk，μ_i，_s，
_m ^adapt（ｎ），σ_i，_s，_m ²｝として表される。また、μ
_i，_s，_m ^adapt（ｎ）＝Ｗｖ_i，_s，_m ^inp＋（１.０−Ｗ）
ｖ_i，_s，_m ⁿ＋μ_i，_s，_mであり、以下の数１６に従い、
累積尤度が計算される。Next, based on the nth representative speaker,
Post-adaptation HMM (λ^ad ^apt(N))
Likelihood L (Λ_j ⁿ(Λ^adapt(N)), o_j). This
Where λ^adaptThe i-th phoneme HMM in (n) is λ_i ^adapt
(N) = {w_i，_s，_m, A_i，_s _j，_sk, Μ_i，_s，
_m ^adapt(N), σ_i，_s，_m ²} Is represented. Also, μ
_i，_s，_m ^adapt(N) = Wv_i，_s，_m ^inp+ (1.0-W)
v_i，_s，_m ⁿ+ Μ_i，_s，_mAnd according to the following equation 16,
Cumulative likelihood is calculated.

【００９８】[0098]

【数１６】 [Equation 16]

【００９９】このように、上述の如き代表話者選択手法
によって代表話者の選択を行った後は図３のステップＳ
９と同様な処理が施される。As described above, after the representative speaker is selected by the representative speaker selection method as described above, step S in FIG.
Processing similar to that of 9 is performed.

【０１００】ここで、今回提案した代表話者選択手法に
よって選択した代表話者を用いて認識した結果を図１０
に示す。FIG. 10 shows the result of recognition using the representative speaker selected by the representative speaker selection method proposed this time.
Shown in.

【０１０１】図１０をみれば明らかなように、(１)話者
空間移動ベクトル間の距離に基づく手法において、分散
を考慮する方（上述のＤ_b1、Ｄ_b2が該当）が話者空間移
動ベクトル間の距離を用いる場合（上述のＤ_a1、Ｄ_a2が
該当）より認識率が向上していることが分かる。また、
その中でも、学習された音素のみに関する話者空間移動
ベクトル間の距離（上述のＤ_b2が該当）を用いる方が、
全ての話者空間移動ベクトルに基づく距離（上述のＤ_b1
が該当）を用いる場合より若干高い認識性能が得られて
いる。これは学習されなかった音素に関する移動ベクト
ルの内挿誤差が原因であると考えられる。As is apparent from FIG. 10, (1) In the method based on the distance between the speaker space movement vectors, the one considering the variance (the above-mentioned D _b1 and D _b2 correspond) is the speaker space movement. It can be seen that the recognition rate is improved as compared with the case of using the distance between the vectors (corresponding to D _a1 and D _a2 described above). Also,
Among them, it is better to use the distance between the speaker space movement vectors related to only the learned phonemes (corresponding to D _b2 described above).
Distance based on all speaker space movement vectors (D _b1 above
A slightly higher recognition performance is obtained compared to the case where (is applicable). It is considered that this is due to the interpolation error of the movement vector regarding the unlearned phoneme.

【０１０２】一方、ガウス分布の分散を考慮した話者空
間移動ベクトル間の距離に基づく手法（上述のＤ_b1、Ｄ
_b2が該当）は、学習用音声に対するＨＭＭの尤度に基づ
く手法（上述のＬ₁、Ｌ₂が該当）とほぼ同等の性能であ
ることが分かる。また、話者空間移動ベクトル間の距離
に基づく手法は、学習用音声に対するＨＭＭの尤度に基
づく手法よりも演算量が少ないという利点がある。On the other hand, a method based on the distance between the speaker space movement vectors in consideration of the variance of the Gaussian distribution (the above-mentioned D _b1 , D
It can be seen that _b2 is applicable) and the performance is almost the same as the method based on the likelihood of HMM for learning speech (the above L ₁ and L ₂ are applicable). Further, the method based on the distance between the speaker space movement vectors has an advantage that the amount of calculation is smaller than the method based on the likelihood of the HMM for the learning voice.

【０１０３】以上の結果より、話者空間移動ベクトル間
の距離に基づく手法においては、ガウス分布の分散を考
慮した方が認識率が良いことが分かった。From the above results, it was found that the recognition rate is better when the variance of the Gaussian distribution is considered in the method based on the distance between the speaker space movement vectors.

【０１０４】更に、ファジイ級関数（例えば、数１に示
すＦ_i’，_s’，_m’）で用いている距離尺度Ｄも分散を
考慮した距離尺度ｄを用いる方が認識率が良くなると考
えられる。[0104] Furthermore, fuzzy grade function (e.g., F _i indicating the number _{_{1 ', s', m'}} ) with even distance measure D is used considered better to use a distance measure d in consideration of dispersion becomes better recognition rate To be

【０１０５】[0105]

【発明の効果】以上の説明から明らかなように本発明に
よれば、代表話者選択部にて得られた代表話者の話者部
分空間移動ベクトルｖ_i，_s，_m ^spon、入力話者の話者部
分空間移動ベクトルｖ_i，_s，_m ^inp、及び初期モデルの平
均ベクトルμ_i，_s，_mを用いて、話者適応後の平均ベク
トルμ_i，_s，_m ^inpを求めることによって、少数の学習用
モデルによる学習であっても、話者にとって適切なモデ
ルを得ることができる。As is apparent from the above description, according to the present invention, the speaker subspace movement vectors v _i , _s , _m ^spon , and the input speaker of the representative speaker obtained by the representative speaker selection unit. speaker subspace movement vector v _i, _s, _m ^inp, and mean vector mu _i, _s of the initial model, with _m, the average vector mu _i, _s after the speaker adaptation, by determining the _m ^inp, Even with learning using a small number of learning models, a model suitable for the speaker can be obtained.

[Brief description of drawings]

【図１】本発明に係るＨＭＭの学習装置の概略構成図で
ある。FIG. 1 is a schematic configuration diagram of an HMM learning device according to the present invention.

【図２】本発明に係るＨＭＭの学習装置の話者適応部を
中心とした構成図である。FIG. 2 is a configuration diagram centering on a speaker adaptation unit of a learning device for an HMM according to the present invention.

【図３】本発明に係る学習処理、及び話者適応化処理に
関するフローチャートである。FIG. 3 is a flowchart relating to a learning process and a speaker adaptation process according to the present invention.

【図４】従来における音素ＨＭＭに基づいて単語認識を
行なうための概念図である。FIG. 4 is a conceptual diagram for performing word recognition based on a conventional phoneme HMM.

【図５】入力音声“ば（ｂａ）”、及び“ぶ（ｂ
ｕ）”のパワーと時間との関係を表した音声パターンで
ある。FIG. 5: Input voices “ba (ba)” and “bu (b)
u) ”is a voice pattern showing the relationship between power and time.

【図６】音素／ｂ／の区間分割図、並びに夫々の区間を
ガウス分布によって近似した図である。FIG. 6 is a segment division diagram of phoneme / b /, and a diagram in which each segment is approximated by a Gaussian distribution.

【図７】従来のＨＭＭの話者適応に基づくＨＭＭの学習
装置、及びこの学習装置を用いた音声認識装置の概略構
成図である。FIG. 7 is a schematic configuration diagram of a conventional HMM learning device based on speaker adaptation of an HMM, and a speech recognition device using this learning device.

【図８】初期モデルの平均ベクトルの再学習前後の対応
を示す図である。FIG. 8 is a diagram showing a correspondence before and after re-learning of an average vector of an initial model.

【図９】移動ベクトルの平滑化処理を行なう際の概念図
である。FIG. 9 is a conceptual diagram when performing a smoothing process of a moving vector.

【図１０】本発明において、各代表話者選択手法の認識
率の比較結果を示した図である。FIG. 10 is a diagram showing a comparison result of recognition rates of respective representative speaker selection methods in the present invention.

【符号の説明】１・・・・・・音声分析部２・・・・・・初期モデル記憶部３・・・・・・学習部４・・・・・・話者適応部５・・・・・・適応後モデル記憶部１０ａ・・入力話者の話者部分空間移動ベクトル計算部１０ｂ・・入力話者の話者部分空間移動ベクトル記憶部１０ｃ・・代表話者選択部１０ｄ・・話者適応後モデル構築部１１・・・・代表話者の話者部分空間移動ベクトル計算部１２・・・・代表話者の話者部分空間移動ベクトル記憶部１３・・・・適応後モデル作成部１４・・・・適応後モデル記憶部[Explanation of symbols] 1 ... Voice analysis unit 2 ... Initial model storage 3 ・・ Learning department 4 ・・ Speaker adaptation department 5 ... ・ Adapted model storage 10a ... Speaker subspace movement vector calculation unit of input speaker 10b .. Speaker subspace movement vector storage unit of input speaker 10c ... Representative speaker selection unit 10d ... Model building unit after speaker adaptation 11 ··· Speaker subspace movement vector calculator of representative speaker 12 ··· Speaker subspace movement vector storage unit for representative speaker 13 ··· After adaptation model creation unit 14 ... Model storage after adaptation

フロントページの続き (56)参考文献大倉，大西，飯田，複数代表話者の話者空間移動ベクトルに基づく不特定話者ＨＭＭの話者適応，電子情報通信学会技術研究報告［音声］，日本，1994年６月16日，Ｖｏｌ．94，Ｎｏ．90，ＳＰ94 −21，Ｐａｇｅｓ 53−60 大倉，大西，飯田，話者空間移動ベクトルに基づく話者適応法における代表話者選択手法，日本音響学会平成６年度秋季研究発表会講演論文集，日本，1994年 10月31日，２−８−22，Ｐａｇｅｓ 81 −82 大倉，大西，飯田，話者空間移動ベクトルに基づく不特定話者モデルの話者適応，日本音響学会平成６年度春季研究発表会講演論文集，日本，1994年３月23 日，３−７−９，Ｐａｇｅｓ 105−106 宮永，嵯峨山，移動ベクトル場平滑化話者適応方式における標準話者選択方式の検討，日本音響学会平成４年度秋季研究発表会講演論文集，日本，1992年10月５日，２−５−２，Ｐａｇｅｓ 121 −122 大倉，杉山，嵯峨山，混合連続分布ＨＭＭ移動ベクトル場平滑化話者適応方式，電子情報通信学会論文誌Ｄ−ＩＩ, 日本，1993年12月，Ｖｏｌ．Ｊ76−Ｄ− ＩＩ，Ｎｏ．12，Ｐａｇｅｓ 2469− 2476 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/06 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of front page (56) References Okura, Onishi, Iida, Speaker adaptation of unspecified speaker HMM based on speaker space movement vector of multiple representative speakers, IEICE Technical Report [Speech], Japan , June 16, 1994, Vol. 94, No. 90, SP94-21, Pages 53-60 Okura, Onishi, Iida, Representative speaker selection method in speaker adaptation method based on speaker space movement vector, Acoustical Society of Japan 1994 Autumn Research Conference, Japan, October 31, 1994, 2-8-22, Pages 81-82 Okura, Onishi, Iida, Speaker adaptation of speaker-independent speaker model based on speaker space movement vector, ASJ 1994 Proceedings of Spring Meeting, Japan, March 23, 1994, 3-7-9, Pages 105-106 Miyanaga, Sagayama, Moving Vector Field Smoothing Examination of standard speaker selection method in speaker adaptation method , Proceedings of the 1994 Autumn Research Conference of the Acoustical Society of Japan, Japan, October 5, 1992, 2-5-2, Pages 121-122 Okura, Sugiyama, Sagayama, mixed continuous distribution H MM moving vector field Smoothing speaker adaptation method, IEICE Transactions DI , Japan, December 1993, Vol. J76-D-II, No. 12, Pages 2469-2476 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 15/06 JISST file (JOIS)

Claims

(57) [Claims]

1. A speaker subspace movement vector of an input speaker obtained from a small amount of input speaker learning audio material from speaker subspace movement vectors v _i , _s , _m ⁿ of a plurality of representative speakers. The speaker subspace movement vectors v _i , _s , _m ^spno of the representative speaker, which are close to v _i , _s , _m ^inp in distance, are selected, and the speaker subspace movement vectors v _i , _s , _{m of the} representative speaker are selected. An HMM learning device characterized by adapting an unspecified speaker HMM to an input speaker by modifying ^spno .

2. A voice analysis unit for analyzing characteristics of input voice.
(1), an initial model storage unit (2) for storing an initial model of the HMM, and a result of analyzing the voice of the input speaker in the voice analysis unit (1), and stored in the initial model storage unit (2) A learning unit (3) for learning the learned HMM, and an average vector μ _i , _s , _m of the HMM of the input speaker learned in the learning unit (3).
^Inp and the speaker subspace movement vector v _i of the input speaker, which is calculated using the difference vector obtained from the difference between the HMM average vectors μ _i , _s , and _m stored in the initial model storage unit (2), _s, speaker subspace movement vector calculating unit input speaker calculating the _m ^inp and (10a), the input speaker talking determined by speaker subspace moving vector calculation unit of the input speaker (10a) Speaker subspace movement vector v _i , _s , _m ^inp , and the speaker subspace movement vector v _i , _s , _m ⁿ of the representative speaker. The speaker subspace movement vector storage unit (12) of the representative speaker to be stored and the speaker subspace movement vector v of the input speaker stored in the speaker subspace movement vector storage unit (10b) of the input speaker. _i , _s ,
_{It is} obtained by the representative speaker selecting unit (10c) that selects the speaker subspace movement vectors v _i , _s , and _m ^spno of the representative speaker that are close to _m ^inp in distance, and the representative speaker selecting unit (10c). The representative speaker's speaker subspace movement vectors v _i , _s , _m ^spno , the input speaker's speaker subspace movement vectors v _i , _s , _m ^inp , and the initial model mean vectors μ _i , _s , _m . used, the mean vector mu _i, _s after the speaker adaptation, speaker adaptation after model construction unit for determining the _m ^inp (10d)
If, HMM learning device characterized by comprising a mean vector mu _i, _s after speaker adaptation, adaptive post-model storage unit for storing the _m ^inp (14), the.