JP2004279454A

JP2004279454A - Method for speech generation model speaker adaptation, and its device, its program, and its recording medium

Info

Publication number: JP2004279454A
Application number: JP2003066847A
Authority: JP
Inventors: Sadao Hiroya; 定男廣谷; Masaaki Yoda; 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-03-12
Filing date: 2003-03-12
Publication date: 2004-10-07
Anticipated expiration: 2023-03-12
Also published as: JP4230254B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech generation model adapting method which can adapt a speaking style based upon articulation motion of a speaker and obtain sufficient precision and to adapt not only a sound space but also a speech generation model itself according to articulation motion estimated from a speech while taking dynamic features into consideration. <P>SOLUTION: As articulation and sound mapping of a speech generation model, adaptation into a speech given a statistical speech generation model composed of a linear function of determining a speech spectrum from an articulation parameter is performed by a hidden Markov model (hereinafter HMM) constituted as dynamic models for articulation motion and states of the HMMs. A speech generated by a model is adapted so that the output probability of a given speech becomes maximum, articulation motion maximizing the fact probability is estimated from the given speech by using the adapted speech, and adaptation that maximizes the output probability of the articulation motion and also maximizes the output probability of the given speech for the speech generated from the estimated articulation motion is performed. Consequently, the speech generation model is so adapted to maximize the output probability of the given speech. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、調音運動を表現する確率的な動的モデル（以下調音モデルと書く）と、調音パラメータベクトルと音声パラメータベクトルとを関連付ける調音・音響マッピング係数を含む音声生成モデルを、入力された話者の音声信号を用いて適応化する音声生成モデル適応化方法、その装置、プログラム及びその記録媒体に関する。
【０００２】
【従来の技術】
音声信号からその音声の調音運動の逆推定手法として、調音運動の動的な振舞いを記述した隠れマルコフモデル（以下ＨＭＭと書く）と、調音運動の調音パラメータベクトルと音声スペクトル（音声パラメータベクトル）との関係を関数近似するための調音・音響マッピング係数とにより構成される音声生成モデルに基づき、音声信号からその音声の調音運動（調音パラメータベクトル系列）を逆推定する方法を提案した（非特許文献１）。
【０００３】
しかし、調音運動の逆推定に関する研究は特定話者を対象としたものが多い。これまでに不特定話者の音声入力を対象とした研究は、ニューラルネットワークを用いた逆推定法に基づく話者適応化法が考えられているが、入力音声とモデル音声との声道長正規化に基づくものであった（Ｓ．ＤｕｓａｎａｎｄＬ．Ｄｅｎｇ，“Ｖｏｃａｌ−ＴｒａｃｔＬｅｎｇｔｈＮｏｒｍａｌｉｚａｔｉｏｎｆｏｒＡｃｏｕｓｔｉｃ−ｔｏ−ＡｒｔｉｃｕｌａｔｏｒｙＭａｐｐｉｎｇＵｓｉｎｇＮｅｕｒａｌＮｅｔｗｏｒｋｓ，”ｉｎＴｈｅ１３８^ｔｈＭｅｅｔｉｎｇｏｆｔｈｅＡｃｏｕｓｔｉｃＳｏｃｉｅｔｙｏｆＡｍｅｒｉｃａ，１９９９．）。
また、ある音声パラメータを持つモデルに対して、入力音声パラメータの出力確率（尤度）を最大にするようにそのモデルのパラメータを適応化する手法がある（非特許文献２）。
【０００４】
【非特許文献１】
ＳａｄａｏＨｉｒｏｙａａｎｄＭａｓａａｋｉＨｏｎｄａ，“Ａｃｏｕｓｔｉｃ−ｔｏ−ａｒｔｉｃｕｌａｔｏｒｙｉｎｖｅｒｓｅｍａｐｐｉｎｇｕｓｉｎｇａｎＨＭＭ−ｂａｓｅｄｓｐｅｅｃｈｐｒｏｄｕｃｔｉｏｎｍｏｄｅｌ，”ｉｎＩＣＳＬＰ，２００２，ｐｐ．２３０５−２３０８．
【非特許文献２】
Ｃ．Ｊ．ＬｅｇｇｅｔｔｅｒａｎｄＰ．Ｃ．Ｗｏｏｄｌａｎｄ，“Ｍａｘｉｍｕｍｌｉｋｅｌｉｈｏｏｄｌｉｎｅａｒｒｅｇｒｅｓｓｉｏｎｆｏｒｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎｏｆｃｏｎｔｉｎｕｏｕｓｄｅｎｓｉｔｙｈｉｄｄｅｎｍａｒｋｏｖｍｏｄｅｌｓ，”ｉｎＣｏｍｐｕｔｅｒＳｐｅｅｃｈａｎｄＬａｎｇｕａｇｅ，ｖｏｌ．９，ｐｐ．１７１−１８５，１９９５．
【０００５】
【発明が解決しようとする課題】
従来の声道長正規化に基づく不特定話者の音声入力を対象とした調音運動の逆推定の研究は、入力音声の音響空間を、特定話者のモデルの音響空間に適応させるものである。しかしながら、音声スペクトルと調音運動の間の冗長性から、音響空間の適応が調音運動の適応に直接結びつかず、したがって、音響空間の適応だけでは、発話者の調音運動に基づく発話スタイルを適応することができない。また、従来の適応化法の研究では、あるモデルのパラメータに対して、入力パラメータ系列の尤度を上げるようにパラメータを適応化していくため、高い尤度ではあるが、パラメータに関する動的な制約などは利用していないため、十分な精度が得られない。この発明の目的は発話者の調音運動に基づく発話スタイルを適応することができ、かつ十分な精度が得られる音声生成モデル適応化方法、その装置、プログラム及びその記録媒体を提供することにある。つまりこの発明が解決しようとする課題は、音響空間の適応だけではなく、音声生成モデル自体を、音声から動的な特徴を考慮して推定した調音運動に基づいて適応化することであるとも云える。
【０００６】
【課題を解決するための手段】
この発明によれば、入力話者の音声信号から、（１）既存の音声生成モデルにより生成される音声スペクトルを、入力された音声スペクトルに適応化することにより、音声生成モデルを入力話者に適応する、あるいは（２）入力話者の音声信号から既存の音声生成モデルに基づきその入力音声の調音運動を決定し、この決定された調音運動を用いて、その既存の音声生成モデル中の調音運動軌道の確率的な動的モデル（調音モデルと書く）、更に必要に応じて調音パラメータベクトルに対する音声スペクトルベクトルを関連させるマッピング係数を入力話者に適応化する。
【０００７】
前記（１）の方法は、既存の音声生成モデルにより生成される音声スペクトル（パラメータ）ベクトルを、入力話者の入力音声の音声スペクトル（パラメータ）ベクトル系列の出力確率が最大となる適応化を行う。その適応化を関連付ける関係係数を用いて音声生成モデル中の調音・音響マッピング係数を適応化する。
前記（２）の方法は、入力話者の入力音声スペクトル（パラメータ）ベクトル系列に対する事後確率が最大となる調音運動（調音パラメータベクトル系列）を、既存の音声生成モデルを用いて決定し、この決定された調音運動（調音パラメータベクトル系列）の出力確率が最大となるように調音モデルを適応化する。また、必要に応じて上記の決定された調音運動から生成される音声スペクトル（パラメータ）ベクトルに対する、入力話者の入力音声スペクトル（パラメータ）ベクトル系列の出力確率が最大となるように調音・音響マッピング係数を適応化する。
【０００８】
また、前記（２）の方法は、前記（１）の方法を組み合わせ、初めに既存の音声生成モデルにより生成される音声スペクトル（パラメータ）ベクトルを、入力話者の入力音声の音声スペクトル（パラメータ）ベクトル系列に適応化した後、この適応化された音声スペクトル（パラメータ）ベクトルを用いて、入力された音声スペクトル（パラメータ）ベクトル系列から調音運動（調音パラメータベクトル）を決定し、決定された調音運動を用いて、少なくとも調音モデルを適応化し、更に必要に応じて調音・音響マッピング係数を適応化する。
【０００９】
【発明の実施の形態】
まずこの発明における適応化の対象である音声生成モデルの作成方法を説明する。
モデル作成
文章を連続発声した音声信号と、磁気センサシステムにより同時観測された調音データを用いて、調音・音響対コードブックを作成する。音声信号はフレームごとに、例えば毎秒２５０回のレートで、窓長３２ｍｓのブラックマン窓で切り出され、スペクトル分析され、例えば０次項を除いた１６次のメルケプストラム係数が音声パラメータとして求められる。必要に応じてその音声パラメータから微分（差分）により、時間的変化として速度、加速度のパラメータが検出され、これら音声パラメータと速度、加速度パラメータを要素とするベクトルが音声パラメータベクトルｙとして生成される。
【００１０】
同時に観測された調音器官の複数の各位置、例えば下顎と、上・下唇と、舌上の４箇所と軟口蓋の計８点のそれぞれについての水平方向および垂直方向における各位置情報信号が毎秒２５０回のレートで取り込まれ、その各位置パラメータから必要に応じて、微分（差分）により時間的変化としての速度パラメータが求められ、更に必要に応じて各速度パラメータの微分（差分）により時間的変化としての加速度パラメータが求められる。これら各１６個の位置パラメータ、速度パラメータ、加速度パラメータを要素とする調音パラメータベクトルｘが生成される。
【００１１】
つまりこの例では音声パラメータベクトルｙ、調音パラメータベクトルｘはそれぞれ下記のように４８個の要素からなるベクトルである。
ｙ＝［ｋ_１，……，ｋ_１６，ｋ_１′，……，ｋ_１６′，ｋ_１″，……，ｋ_１６″］
ｘ＝［ｐ_ａ，……，ｐ_ｎ，ｐ_ａ′，……，ｐ_ｎ′，ｐ_ａ″，……，ｐ_ｎ″］
このようにして同一の時点において求まった音声パラメータベクトルｙと調音パラメータベクトルｘを対とするデータを複数個、例えば２０万セット保持して調音・音響対コードブックを構成する。
【００１２】
このようにして求めた調音パラメータベクトルｘおよび音声パラメータベクトルｙを用いて調音運動を表現する確率的な動的モデル（以下調音モデルと書く）、この例では隠れマルコフモデル（以下ＨＭＭと記す）λを作成する。このＨＭＭのモデルλの作成は、前記文章の連続発声により得られた全体の音声パラメータベクトル系列ｙの出力確率Ｐ（ｙ，ｑ｜λ）が最大となるようにする。ここでｑは全体の音声パラメータベクトル系列ｙに対する状態系列を表す。この例ではＨＭＭのモデルλの構造は、２音素組の３状態１混合ガウス分布で、スキップなしのｌｅｆｔ−ｔｏ−ｒｉｇｈｔモデルとする。例えば図１に示すように３つの状態ｑ_１，ｑ_２，ｑ_３があり、各状態での調音パラメータベクトル、音声パラメータベクトルの各出力確率をそれぞれ１つのガウス分布とし、状態過程は同一状態から同一状態への遷移（ループ）と、ｑ_１からｑ_２又はｑ_２からｑ_３への遷移の計５つのみである。各音素について次に続く異なる音素ごとにモデルが作られる。
【００１３】
調音パラメータベクトル系列ｘを含むモデルにおいては、状態系列ｑを構成する各１つの状態をｑ_ｊとする時、状態ｑ_ｊの音声パラメータベクトルｙの出力確率は、その状態ｑ_ｊへの遷移確率Ｐ_ｔ＝Ｐ（ｑ_ｊ｜λ）と、その状態ｑ_ｊに対する調音パラメータベクトルｘの出力確率Ｐ_ｘ＝Ｐ（ｘ｜ｑ_ｊ，λ）と、その状態ｑ_ｊに対する調音パラメータベクトルｘに対する音声パラメータベクトルｙの出力確率Ｐ_ｙ＝Ｐ（ｙ｜ｘ，ｑ_ｊ，λ）との積である。従ってＰ（ｙ，ｑ_ｊ｜λ）＝∫Ｐ（ｙ｜ｘ，ｑ_ｊ，λ）Ｐ（ｘ｜ｑ_ｊ，λ）Ｐ（ｑ_ｊ｜λ）ｄｘが最大となるように各モデルを作成すればよい。ここで与えられた調音パラメータベクトルに対する音声パラメータの出力確率Ｐ（ｙ｜ｘ，ｑ_ｊ，λ）と、調音パラメータベクトルの出力確率Ｐ（ｘ｜ｑ_ｊ，λ）は共にガウス分布を仮定する。
【００１４】
図２にモデル作成処理手順例を示す。この学習法は「ビタビ学習法」と呼ばれるものである。まず入力音声パラメータベクトル系列ｙ及び入力調音パラメータベクトル系列ｘと発声文章との対応付けにより各同一音素の両パラメータベクトル対を集め、その各音素ごとに、その複数の各パラメータベクトル対ごとに前記３状態ｑ_１，ｑ_２，ｑ_３をそれぞれ同一時間長として対応付け、各状態ごとにモデルパラメータＡ，ｂ，ｘ_ｍ，σ_ｘ，ｗ_ｍ，σ_ｗを演算し、つまり初期モデルλを作って記憶する（Ｓ１）。
【００１５】
つまり調音パラメータベクトルｘから音声パラメータベクトルｙを決定する関数ｙ＝ｆ（ｘ）として、ｙ＝Ａｘ＋ｂを用い、調音パラメータベクトルｘを用いて計算した音声パラメータベクトルｙ′＝Ａｘ＋ｂと、その調音パラメータベクトルｘと対をなす音声パラメータベクトルｙとの二乗誤差が最小となるＡとｂを求め、かつｙ′のｙに対する誤差ｗを求め、その誤差ｗの平均ｗ_ｍを計算し、更に誤差ｗの共分散σ_ｗを計算し、調音パラメータベクトルｘの平均ｘ_ｍを計算し、調音パラメータベクトルｘの共分散σ_ｘを計算し、状態遷移確率γを計算する。初期状態遷移確率γは自己遷移確率を０．８、ある状態から他の状態に遷移する確率を０．２など適当な値に設定し、その後はある状態ｑ_ｊに注目した場合、その状態に対応するフレームすべてに対して、同じ状態に遷移するフレームの数をその状態に対応するフレームの総数で割った値を自己遷移確率とし、ある状態から他の状態に遷移する確率を（１−自己遷移確率）として計算する。
【００１６】
これらモデルパラメータＡ，ｂ，ｗ_ｍ，σ_ｗ，ｘ_ｍ，σ_ｍ，Ｐ_ｔを各音素の各状態ごとに計算して音素対応に記憶する。なお、変換関数はこの例では左辺のベクトルｙは要素数が４８であり、右辺中のベクトルｘも要素数が４８であり、係数Ａは４８×４８の行列となり、定数ｂも要素数が４８のベクトルとなる。
従ってＡ，ｂを決定するにはｙとｘの対を最低４８個必要とする。
次にこの初期モデルλに対して入力音声パラメータベクトルｙの出力確率
Ｐ（ｙ，ｑ｜λ）＝∫Ｐ（ｙ｜ｘ，ｑ，λ）Ｐ（ｘ｜ｑ，λ）Ｐ（ｑ｜λ）ｄｘ（１）
が最大になるように音声パラメータベクトルｙおよび調音パラメータベクトルｘに状態ｑ_ｊを対応付けることをビタビ（Ｖｉｔｅｒｂｉ）アルゴリズムを用いて決定する（Ｓ２）。つまり前記文章の最初の音素を初期値としてその各状態における調音パラメータベクトルｘに対する音声パラメータベクトルｙの出力確率Ｐ（ｙ｜ｘ，ｑ，λ）と調音パラメータベクトルｘの出力確率Ｐ（ｘ｜ｑ，λ）とを、先に記憶したモデルを参照して、確率がガウス分布していることに基づき、それぞれ下記式（２）、式（３）により求める。
【００１７】
Ｐ（ｙ｜ｘ，ｑ，λ）＝［１／（（２π）^Ｎ／２）｜σ_ｗ｜^１／２］×ｅｘｐ［−（１／２）（ｙ−Ａｘ−ｂ−ｗ_ｍ）^Ｔσ_ｗ ^−１（ｙ−Ａｘ−ｂ−ｗ_ｍ）］（２）
Ｐ（ｘ｜ｑ，λ）＝［１／（（２π）^Ｍ／２）｜σ_ｘ｜^１／２）］×ｅｘｐ［−（１／２）（ｘ−ｘ_ｍ）^Ｔσ_ｘ ^−１（ｘ−ｘ_ｍ）］（３）
Ｎはベクトルｙの次数、Ｍはベクトルｘの次数であり、前記例では共に４２であり、（）^Ｔは行列の転置を表わす。
【００１８】
また遷移確率Ｐ（ｑ｜λ）を求め、Ｐ（ｙ｜ｘ，ｑ，λ）とＰ（ｘ｜ｑ，λ）とＰ（ｑ｜λ）の積をブランチメトリックとし、各状態について求めたブランチメトリックの最大のものを生き残りパスとし、そのブランチメトリックをそれまでのパスメトリックに加算することを順次行う。最終的に得られたパスメトリックの最大の状態系列ｑが式（１）を最大とするものである。
次にこの状態系列ｑの決定の際に求まった入力音声パラメータベクトルｙの出力確率の最大値、つまり最大パスメトリックの値が収束したかを調べ（Ｓ３）、収束していなければステップＳ２で決定された状態系列ｑと入力音声パラメータベクトル系列ｙ及び入力調音パラメータベクトル系列ｘとを対応付け、その
状態系列ｑにおけるモデルからモデルへの変化点を検出して、音素区間の入力音声パラメータベクトル系列ｙ及び入力調音パラメータベクトル系列ｘに対す
る対応付けを再設定する（Ｓ４）。
【００１９】
この再設定された各音素についての音声パラメータベクトル及び調音パラメータベクトルの集合について、各モデルパラメータＡ，ｂ，ｗ_ｍ，σ_ｗ，ｘ_ｍ，σ_ｘ，Ｐ_ｔをそれぞれ演算し、つまり音素モデルを作成し、記憶していた対応モデルパラメータを更新記憶してステップＳ２に戻る（Ｓ５）。
以下ステップＳ２〜Ｓ５を繰返すことにより、得られる音声パラメータベクトルの出力確率の最大値はほぼ一定値となり、つまりステップＳ３で収束したことが検出されて終了とする。
【００２０】
このようにして得られたＨＭＭの各モデルは、例えば図３に示すように各音素対応のモデルλ１〜λＪの格納部２５−１〜２５−Ｊごとに状態遷移確率γ（これは前述したように各ループと隣りへとの計５つの確率よりなる）が遷移確率格納部２７に格納され、各状態ごとのＡ，ｂが係数格納部２８に格納され、ｘ_ｍ，ｗ_ｍが平均格納部２９に、σ_ｍ，σ_ｗが共分散格納部３１に格納される。係数Ａ，ｂは調音パラメータベクトルＸと対応した音声パラメータベクトルｙの近似値を対応ずけるためのパラメータであるから調音・音響マッピング係数と呼ぶ。その他のパラメータｘ_ｍ，ｗ_ｍ，σ_ｍ，σ_ｗは調音モデルと呼ぶ。またＰ（ｙ｜ｘ，ｑ，λ）は式（２）で計算され、Ｐ（ｘ｜ｑ，λ）は式（３）で計算されるから、調音パラメータベクトルｘに対する音声パラメータベクトルｙの出力確率、また調音パラメータベクトルｘの出力確率もモデル記憶部２５に格納されていると云える。モデル作成方法として「ビタビ学習法」を示したが、より精度の良い学習法「ＥＭ学習法」（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ）を用いてもよい。
【００２１】
第１実施形態
この発明の第１実施形態においては既存の音声生成モデルにより生成される音声パラメータ（スペクトル）ベクトルを、入力話者の入力音声パラメータ（スペクトル）ベクトル系列の出力確率が最大となるように適応化し、この生成音声パラメータベクトルと入力音声パラメータベクトル系列とを関係付ける係数を用いて、音声生成モデル中の調音・音響マッピング係数を適応化する。
以下この第１実施形態を、図４及び図５を参照して説明する。話者の入力音声信号は入力端子１１からディジタル信号として入力され信号記憶部４２に一旦格納される（Ｓ１）。この話者入力音声信号は音声パラメータベクトル生成部４３において、フレームごとに入力音声パラメータ（スペクトル）ベクトルｙが生成され、入力音声パラメータベクトル系列Ｙが生成される（Ｓ２）。例えば入力音声信号はフレームごとにスペクトル分析され、音声パラメータが検出され（Ｓ２−１）、更にそのスペクトルの時間的変化としての速度、加速度パラメータが検出され（Ｓ２−２）、これら両パラメータにより音声パラメータベクトルとされ、各フレームの音声パラメータベクトルの時系列が音声パラメータベクトル系列Ｙとされる。これらパラメータとしては、適応化の対象であるモデル記憶部４８に記憶されている音声生成モデルの作成時に用いた音声パラメータと同一のもの、前記例では０次項を除いた１６次のメルケプストラム係数とその速度パラメータが検出される。この入力音声パラメータベクトル系列Ｙは記憶部４２に一旦格納される（Ｓ３）。なお話者入力音声信号と対応した文章の音素列が音素列記憶部４５に格納される。
【００２２】
音声生成モデルより生成された音声パラメータベクトルの平均ベクトルｙ_ｍｊに対して、入力音声パラメータベクトル系列Ｙ＝（Ｙ_１，…，Ｙ_２）の出力確率Ｐ（Ｙ｜ｑ，λ）が最大となるように、音声生成モデルの平均ベクトルを適応化する。出力確率Ｐ（Ｙ｜ｑ，λ）を最大化する平均ベクトルｙ_ｍｊは前記非特許文献２の１７４頁を参考にすると、対数尤度ｌｏｇＰ（Ｙ｜ｑ，λ）を最大にするように求めればよい。従って
ｌｏｇＰ（Ｙ｜ｑ，λ）＝Ｋ−（１／２）Σ_ｔΣ_ｊγ_ｔ（ｊ）（Ｙ_ｔ−Ｈ_ｓｙ_ｍｊ）^Ｔσ_ｙｊ ^−１（Ｙ_ｔ−Ｈ_ｓｙ_ｍｊ）
を最大にするＨ_ｓを
Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｙｊ ^−１Ｙ_ｔｙ_ｍｊ ^Ｔ＝Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｙｊ ^−１Ｈ_ｓｙ_ｍｊｙ_ｍｊ ^Ｔ（４）
を計算することで求めることができる。ここでｔはベクトル系列の離散的時刻を、ｊは各音素における状態番号をそれぞれ表わし、Ｋは定数、Ｈ_ｓは回帰係数であり、γ_ｔ（ｊ）は音声パラメータベクトルが時刻ｔで状態ｊに存在する確率であってγ_ｔ（ｊ）＝Ｐ（ｑ_ｔ＝ｊ｜ｙ，λ）であり、音声パラメータベクトルｙ_ｊの平均ベクトルはｙ_ｍｊ＝Ａ_ｊｘ_ｍｊ＋ｂ_ｊにより、ベクトルｙ_ｊの共分散行列はσ_ｙｊ＝Ａ_ｊσ_ｘｊＡ_ｊ ^Ｔ＋σ_ｗにより求める。（）^Ｔは転置行列を表わし、ｓは音響空間を分割するクラスタを表わす。つまり音声生成モデルλの全てを１つのクラスタとするか、あるいは母音と子音とを別のクラスタとして求めるなど、全音素モデルをいくつかのクラスタに分けて求める。
【００２３】
つまり図４、図５に示すように入力音声パラメータベクトル系列Ｙと、音声
生成モデルとを用いて、音声関係係数算出部４６で、音声関係係数Ｈ_ｓを計算する（Ｓ４）。モデル記憶部４４中の各モデルの調音・音響マッピング係数Ａ_ｊ，ｂ_ｊと調音平均ベクトルｘ_ｍｊを取出し、音声パラメータの平均ベクトルｙ_ｍｊを平均ベクトル生成部４７でｙ_ｍｊ＝Ａ_ｊｘ_ｍｊ＋ｂ_ｊの計算によりそれぞれ生成する（Ｓ４−１）。またモデル記憶部４４中の各モデルの調音パラメータベクトルｘ_ｊの共分散σ_ｘｊと調音平均ベクトルｘ_ｍｊの誤差ｗ_ｊの共分散σ_ｗｊを取り出し、音声パラメータベクトルｙ_ｊの共分散行列σ_ｙｊを共分散計算部４８でσ_ｙｊ＝Ａ_ｊσ_ｘｊＡ_ｊ ^Ｔ＋σ_ｗｊの計算によりそれぞれ生成する（Ｓ４−２）。更に音声関係係数算出部４６において、記憶部４５内の音素系列に従って各音素についてモデル記憶部４４内の対応音素モデルλの遷移確率γ_ｊを取出し、これと、入力音声パラメータベクトル系列Ｙと、平均ベクトル生成部４７よりの平均ベクトルｙ_ｍｊと、共分散計算部４８よりの共分散行列σ_ｙｊとを用いて、Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｙｊ ^−１Ｙ_ｔｙ_ｍｊ ^Ｔ，Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｙｊ ^−１ｙ_ｍｊｙ_ｍｊ ^Ｔを計算し、式（４）を満す回帰係数（音声関係係数）Ｈ_ｓを求める（Ｓ４−３）。
【００２４】
この音声関係係数Ｈ_ｓを用いて、音声生成モデルの調音・音響マッピング係数Ａ_ｊ，ｂ_ｊを、それぞれＨ_ｓＡ_ｊ，Ｈ_ｓｂ_ｊと係数適応化部４９で入力話者音声に適応化する（Ｓ５）。
更に必要に応じて、先に求めた音声パラメータベクトルの平均ベクトルｙ_ｍｊを音声適応化部５１でＹ_ｍｊ＝Ｈ_ｓｙ_ｍｊの計算により変更する（Ｓ６）。
この適応化された音声生成モデルを用いれば、調音パラメータベクトル系列を入力して、これと対応した前記入力話者の音声に近い音声信号を合成することができる。
なお図５において制御部５２は各部を順次動作させ、また各記憶部に対する読み書きを行う。
【００２５】
第２実施形態
次にこの発明の第２実施形態を図６及び図７を参照して説明する。第２実施形態は入力音声の調音運動を、適応対象音声生成モデルを用いて決定し、この調音運動の出力確率が最大となるように音声生成モデルを適応化する。
入力端子４１からの入力話者の入力音声信号から入力音声パラメータ（スペクトル）ベクトル系列Ｙを音声パラメータベクトル生成部４３で生成し（Ｓ２）、これを一旦記憶部４３に記憶する（Ｓ３）ことは第１実施形態と同様である。
【００２６】
この第２実施形態においては、入力音声パラメータベクトル系列Ｙの出力確率Ｐ（Ｙ｜ｑ，λ）を最大にする状態系列ｑを、記憶部４５の音素系列に基づき、例えばビタビアルゴリズムにより状態系列生成部６１で生成する（Ｓ４）。この生成の手法は先に述べたモデル作成法とほぼ同様に行えばよい。
次にこの状態系列ｑに対して事後確率Ｐ（ｘ｜ｙ，ｑ，λ）を最大にする調音運動、つまり調音パラメータベクトル系列を調音パラメータベクトル生成部６２で生成する（Ｓ５）。Ｐ（ｘ｜ｙ，ｑ，λ）を最大にする調音パラメータベクトル系列ｘは前記非特許文献１の２３０６頁左欄の記載から明らかなように次式（５）を最小化する系列ｘ _ｅを求めればよい。
【００２７】
Ｊ＝（Ｙ−Ａｘ _ｅ−ｂ）^Ｔσ_ｗ ^−１（Ｙ−Ａｘ _ｅ−ｂ）（５）
つまり非特許文献１中の式（４）（下記の式）により求める。
ｘ _ｅ＝（σ_ｘ ^−１＋Ａ^Ｔσ_ｗ ^−１Ａ）^−１（σ_ｘ ^−１ｘ_ｍ＋Ａ^Ｔσ_ｗ ^−１（ｙ−ｂ））
このようにして生成された調音パラメータベクトル系列ｘ _ｅと、出力確率Ｐ（ｘ _ｅ｜ｑ，λ）が最大となる調音パラメータベクトルｘ_ｅの平均ベクトルｘ_ｅｍとを関係付ける次式（６）を平均関係係数計算部６３で計算して、平均関係係数Ｃ_ｓを求める（Ｓ６）。
【００２８】
Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｘｊ ^−１ｘ_ｅｔｘ_ｍｊ ^Ｔ＝Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｘｊ ^−１Ｃ_ｓｘ_ｍｊｘ_ｍｊ ^Ｔ（６）
つまり生成された調音パラメータベクトル系列ｘ _ｅの各ベクトルｘ_ｅｔについて、記憶部４４中の音声生成モデルの対応音素モデルλの遷移確率γ_ｊ、共分散σ_ｘｊ、平均ｘ_ｍｊを取出し、Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｘｊ ^−１ｘ_ｅｔｘ_ｍｊ，Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｘｊ ^−１ｘ_ｍｊｘ_ｍｊ ^Ｔを計算して式（６）を計算して回帰係数Ｃ_ｓを求める。このようにして求めた平均関係係数Ｃ_ｓを用いて、記憶部４４中の調音平均ベクトルｘ_ｍｊを平均適応化部６４でＸ_ｍｊ＝Ｃ_ｓｘ_ｍｊとして調音平均ベクトルを適応化する（Ｓ７）。
【００２９】
この適応化された音声生成モデルを用いて入力話者の音声信号の調音運動（調音パラメータベクトル系列）を求めることにより、適応化前のモデルを用いる場合よりも高い精度で調音運動を求めることができる。
更に調音・音響マッピング係数Ａ_ｊ，ｂ_ｊも適応化する場合は次のようにする。
ステップＳ１５で生成された調音パラメータベクトル系列ｘ _ｅの各調音スペクトルベクトルｘ_ｅｔと対応する音声スペクトルベクトルを音素系列を参照しながら、音声生成モデルの調音・音響マッピング係数Ａ_ｊ，ｂ_ｊを用いて、音声ベクトル生成部６５で音声スペクトルベクトルｙ_ｊ＝Ａ_ｊｘ_ｅｔ＋ｂ_ｊを生成する（Ｓ８）。
【００３０】
この音声生成モデルを用いた音声スペクトルベクトルｙに対して、入力音声パラメータベクトル系列Ｙの出力確率Ｐ（Ｙ｜ｑ，λ）を最大にする調音・音響マッピング係数は、第１実施形態の場合と同様に、
Ｐ（Ｙ｜ｑ，λ）＝∫Ｐ（Ｙ｜ｘ，ｑ，λ）Ｐ（ｘ｜ｑ，λ）ｄｘの対数尤度ｌｏｇＰ（Ｙ｜ｑ，λ）を最大にすることにより与えられる。従って、式（４）の導出と同様に次式（７）を満す回帰係数（マッピング関係係数）Ｄ_ｓをマッピング関係係数算出部６６で算出する（Ｓ９）。
【００３１】
Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｗｊ ^−１Ｙ_ｔ（Ａ_ｊｘ_ｅｔ＋ｂ_ｊ）^Ｔ＝Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｗｊ ^−１Ｄ_ｓ（Ａ_ｊｘ_ｅｔ＋ｂ）（Ａ_ｊｘ_ｅｔ＋ｂ_ｊ）^Ｔ（７）
つまりマッピング関係係数計算部６６で、入力音声パラメータベクトル系列Ｙと、音声ベクトル生成部６５よりの音声ベクトルｙ_ｊと、モデル記憶部４４中の誤差の共分散σ_ｗｊ、遷移確率γ_ｊとを用いて、
Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｗｊ ^−１Ｙ_ｔｙ_ｊ ^Ｔ＝Σ_ｔΣ_ｊγ_ｔ（ｊ）σ_ｗｊ ^−１Ｄ_ｓｙ_ｊｙ_ｊ ^Ｔ
を満すＤ_ｓを計算する。
【００３２】
このマッピング関係係数Ｄ_ｓを用いてモデル記憶部４４中の各音声生成モデルの各調音・音響マッピング係数Ａ_ｊ，ｂ_ｊを係数適応化部６７でＤ_ｓＡ_ｊ，Ｄ_ｓｂ_ｊとして適応化する（Ｓ１０）。
ステップＳ４における状態系列の生成は、第１実施形態において音声適応化部５１で生成した適応化音声平均ベクトルＹ_ｍｊ（＝Ｈ_ｓｙ_ｍｊ）を用いて行ってもよい。この場合、回帰係数Ｃ_ｓとＤ_ｓで共通のクラスタを用いた場合、回帰係数の冗長性のため第１実施形態と同じ尤度になるが、適応化された音声生成モデルは第１実施形態と異なるものとなる。
【００３３】
以上のような各種の適応化法により音声生成モデルを話者音声に適応化し、その適応化音声生成モデルを用いて、その話者の音声信号に対する調音運動を例えば非特許文献１に示すように推定する。
実験
日本人男性３名によって発声された３５６文章の音声信号と調音データを用い、モデル作成の項で述べた条件でパラメータベクトルを生成し、３者ごとにモデルを作成し、各入力話者から、入力話者以外の話者２名のそれぞれのモデルに対して適応を行い、評価は計６つのテストの平均で行った。今回用いた調音データは、調音観測点上に小さな受信コイルを接着する磁気センサシステムを用いて観測された。しかし、話者毎に受信コイルを接着する位置が異なり、また、話者毎に調音器官の大きさが異なるため、入力話者の観測した調音運動と別の特定話者のモデルを用いて推定した調音運動は、そのままでは比較することができない。したがって、あらかじめ求めた入力話者の観測調音運動と別の話者の観測調音運動の位置とサイズの線形変換を用いて正規化し、評価を行った。適応の際には、教師ありの学習を用い、適応文章数は４０とした。調音運動の逆推定の際には、音素ありの条件を用いた（非特許文献１）。適応化法は、（Ａ）第１実施形態、（Ｂ）第２実施形態、（Ｃ）第１実施形態と第２実施形態との併用の３つで実験を行った。クラスタ数ｓは全適応化法において共通とした。
【００３４】
図８に、クラスタ数の値を１，３，５，１０としたときの、学習データに対する音響パラメータベクトルの対数尤度を示す。適応化法を用いることで、話者独立モデルよりも尤度が上昇することが分かる。また、（Ａ）法と（Ｃ）法の尤度はほぼ同じであり、（Ｂ）法の尤度はそれらに比べて低い。
なお、話者の音声と調音運動を用いて作ったモデル（話者モデル）を用いた場合と、話者と無関係の音声とその対応調音運動を用いて作ったモデル（話者独立モデル）を用いた場合についての実験結果も示した。
【００３５】
図９に、クラスタ数における、適応化法による調音運動の二乗誤差を示す。適応化法を用いることで、すべてのクラスタ数において話者独立モデルよりも誤差が減少している。（Ａ）法を用いた場合、調音運動に関する適応は行われないため、クラスタの数によらず、誤差はほぼ一定である。一方、（Ｂ）法の場合、クラスタ数の増加につれて誤差が減少していく。（Ｃ）法は、クラスタ数が５までは（Ｂ）法よりも誤差が小さいが、クラスタ数が１０では尤度が高いにも関わらず、誤差が大きくなっている。
【００３６】
図１０に各種音素毎の二乗誤差を示す。評価はクラスタ数１０を用いて行った。‘Ｔｏｔａｌ’は発声全体の二乗誤差であり、‘Ｖｏｗｅｌ’から‘Ｎａｓａｌ’まではそれぞれ、その発声の際に重要な調音器官における二乗誤差である。適応化法を用いることで、すべての音素クラスに対して話者独立モデルよりも向上が見られた。最大約４４．４％の改善が見られた。また、適応化法による音素クラスに対する誤差の違いは見られなかった。
入力男性話者が「やるべきことはやっており何ら落ち度はない」という文章を発声した音声信号から、話者独立モデルを利用して推定された調音運動軌道（太線）と観測された調音運動軌道（細線）の垂直信号の例を図１１に示し、（Ｃ）法により適応化したモデルを利用して推定された調音運動軌道（太線）と観測された調音運動軌道（細線）の垂直信号の例を図１２に示す。これら両図を比較すれば図１２の方が太線が細線に近いものとなっており、モデル適応化の効果が得られていることが理解できる。
【００３７】
また推定された調音運動は発声した音素の特徴を良く再現している。推定された調音運動から生成した音声スペクトルと入力音声スペクトルとのスペクトル歪みも約６９．０％の改善が見られた。
図５、図７に示した適応化装置をコンピュータに機能させてもよい。この場合は図４又は図７に示した適応化方法の各手順をコンピュータに実行させるためのプログラムをＣＤ−ＲＯＭ、磁気ディスクなどの記録媒体からコンピュータにインストールし、又は通信回線を介してダウンロードし、そのプログラムをコンピュータに実行させればよい。上述においては調音パラメータベクトルを変数として音声パラメータベクトルを近似する関数に線形関数を用いたが他の関数でもよい。音声パラメータベクトル、及び調音パラメータベクトルとしては加速度成分や速度成分を用いなくてもよい。
【００３８】
【発明の効果】
この発明によれば、調音運動を表現する確率的な動的モデルと調音パラメータベクトルと音声パラメータベクトルとを関連付ける調音・音響マッピング係数とを含む音声生成モデルを話者適応化することができ、この適応化した音声生成モデルを使用することにより、入力話者音声の調音運動を、適応化しないモデルを用いる場合より、精度よく推定することができる。更に、調音パラメータベクトルから音声合成する場合に、所望の話者の音声を合成することができる。また、同様にこのモデルを用いて音声認識する場合も高認識精度を得ることができるようになる。更にこの発明によれば小量の音声データからでも音声生成モデルを適応化することができる。
【図面の簡単な説明】
【図１】１つの音素モデルの状態遷移の例を示す図。
【図２】モデル作成手順の例を示す流れ図。
【図３】音声生成モデルが記憶されている記憶装置の記憶内容例を示す図。
【図４】この発明の第１実施形態の処理手順の例を示す流れ図。
【図５】この発明の第１実施形態の機能構成例を示すブロック図。
【図６】この発明の第２実施形態の処理手順の例を示す流れ図。
【図７】この発明の第２実施形態の機能構成例を示すブロック図。
【図８】学習音声パラメータベクトルに対する合成音声パラメータベクトルの対数尤度の実験結果を示すグラフ。
【図９】入力音声信号に対して推定した調音パラメータベクトルの２乗誤差の実験結果を示すグラフ。
【図１０】入力音声信号に対して推定した調音パラメータベクトルの各種音素ごとの２乗誤差の実験結果を示す図。
【図１１】実測した調音運動と、話者独立モデルを利用して推定した調音運動の例を示す図。
【図１２】実測した調音運動と、話者適応化したモデルを利用して推定した調音運動の例を示す図。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech generation model including a probabilistic dynamic model expressing articulation movement (hereinafter referred to as an articulation model) and an articulation / acoustic mapping coefficient for associating an articulation parameter vector with an audio parameter vector. The present invention relates to a speech generation model adaptation method, a device, a program, and a recording medium for adapting using a speech signal of a user.
[0002]
[Prior art]
A hidden Markov model (hereinafter referred to as HMM) describing the dynamic behavior of articulatory motion, an articulatory parameter vector and a speech spectrum (speech parameter vector) of the articulatory motion, as an inverse estimation method of the articulatory motion of the voice from the voice signal. Proposed a method for inversely estimating the articulation motion (articulation parameter vector sequence) of a speech from a speech signal based on a speech generation model composed of articulation and acoustic mapping coefficients for functionally approximating the relationship of 1).
[0003]
However, many studies on inverse estimation of articulatory movements focus on specific speakers. So far, research on speech input from unspecified speakers has considered a speaker adaptation method based on an inverse estimation method using a neural network. (S. Dusan and L. Deng, "Vocal-Tact Length Normalization for Acoustic-to-Articulatory Mapping Utilizing Neural Networks," in The 138.^th  Meeting of the Acoustic Society of America, 1999. ).
In addition, there is a method of adapting the parameters of a model having a certain voice parameter so that the output probability (likelihood) of the input voice parameter is maximized (Non-Patent Document 2).
[0004]
[Non-patent document 1]
Sadao Hiroya and Masaaki Honda, “Acoustic-to-articulatory inverse mapping using an HMM-based speech production model,” in ICSLP, 2002, pp. 2305-2308.
[Non-patent document 2]
C. J. Leggetter and P.M. C. Woodland, "Maximum Likelihood linear regression for speakers adaptation of continuity density hidden markovov models," in Computer magazine. 9, pp. 171-185, 1995.
[0005]
[Problems to be solved by the invention]
Conventional research on inversion estimation of articulatory motion for speech input of unspecified speakers based on vocal tract length normalization adapts the acoustic space of the input speech to the acoustic space of the model of the specific speaker. . However, due to the redundancy between the speech spectrum and articulatory movements, the adaptation of the acoustic space does not directly lead to the adaptation of the articulatory movements, and thus the adaptation of the acoustic space alone will adapt the speaking style based on the articulatory movements of the speaker. Can not. In addition, in the study of conventional adaptation methods, parameters are adapted to increase the likelihood of an input parameter sequence for a certain model parameter. Since it is not used, sufficient accuracy cannot be obtained. An object of the present invention is to provide a method, an apparatus, a program, and a recording medium for adapting a speech generation model capable of adapting an utterance style based on articulatory movements of a speaker and obtaining sufficient accuracy. In other words, the problem to be solved by the present invention is not only the adaptation of the acoustic space, but also the adaptation of the speech generation model itself based on the articulation motion estimated from the speech in consideration of the dynamic features. I can.
[0006]
[Means for Solving the Problems]
According to the present invention, (1) adapting the speech spectrum generated by the existing speech generation model to the input speech spectrum from the speech signal of the input speaker, thereby providing the speech generation model to the input speaker. Adapting or (2) determining the articulation of the input speech from the input speaker's speech signal based on the existing speech production model, and using the determined articulation motion, the articulation in the existing speech production model. A probabilistic dynamic model of the motion trajectory (referred to as articulation model) and, if necessary, a mapping coefficient relating the speech spectrum vector to the articulation parameter vector are adapted to the input speaker.
[0007]
In the method (1), the speech spectrum (parameter) vector generated by the existing speech generation model is adapted to maximize the output probability of the speech spectrum (parameter) vector sequence of the input speech of the input speaker. . Adapting the articulatory / acoustic mapping coefficients in the speech production model using the relation coefficients relating the adaptation.
According to the method (2), an articulatory motion (articulatory parameter vector sequence) that maximizes the posterior probability of the input speaker with respect to the input speech spectrum (parameter) vector sequence is determined using an existing speech generation model. The articulatory model is adapted so that the output probability of the performed articulatory movement (articulatory parameter vector sequence) is maximized. Further, if necessary, articulation / acoustic mapping is performed such that the output probability of the input speech spectrum (parameter) vector sequence of the input speaker with respect to the speech spectrum (parameter) vector generated from the determined articulation motion is maximized. Adapt coefficients.
[0008]
The method (2) is a combination of the method (1), and firstly converts the speech spectrum (parameter) vector generated by the existing speech generation model into the speech spectrum (parameter) of the input speech of the input speaker. After adaptation to the vector sequence, articulation motion (articulation parameter vector) is determined from the input speech spectrum (parameter) vector sequence using the adapted speech spectrum (parameter) vector, and the determined articulation motion is determined. Is used to at least adapt the articulatory model and, if necessary, the articulatory / acoustic mapping coefficients.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
First, a method for creating a speech generation model to be adapted in the present invention will be described.
Model creation
An articulatory / sound pair codebook is created using an audio signal obtained by continuously uttering sentences and articulatory data simultaneously observed by a magnetic sensor system. The audio signal is cut out by a Blackman window having a window length of 32 ms at a rate of, for example, 250 times per second for each frame, and subjected to spectrum analysis. For example, a 16th-order mel-cepstral coefficient excluding the 0th-order term is obtained as an audio parameter. If necessary, velocity and acceleration parameters are detected as temporal changes by differentiation (difference) from the voice parameters, and a vector having these voice parameters and the speed and acceleration parameters as elements is generated as a voice parameter vector y.
[0010]
At each of a plurality of simultaneously observed articulatory positions, for example, the lower jaw, upper and lower lips, four points on the tongue, and eight positions of the soft palate, each position information signal in the horizontal and vertical directions is 250 per second. The velocity parameter is taken in at each rate, and the velocity parameter as a temporal change is obtained from the position parameter by differentiation (difference) as needed, and the temporal change is obtained by differentiation (difference) of each velocity parameter as needed. Is obtained. An articulation parameter vector x having these 16 position parameters, velocity parameters, and acceleration parameters as elements is generated.
[0011]
That is, in this example, the voice parameter vector y and the articulation parameter vector x are vectors composed of 48 elements, respectively, as described below.
y = [k₁, ……, k₁₆, K₁', ……, k₁₆', K₁″, ……, k₁₆″]
x = [p_a  , ……, p_n  , P_a', ..., p_n', P_a″, ……, p_n″]
A plurality of, for example, 200,000 sets of pairs of the speech parameter vector y and the articulation parameter vector x obtained at the same point in time as described above are stored to form the articulation / sound pair codebook.
[0012]
A probabilistic dynamic model (hereinafter referred to as an articulatory model) that expresses articulation using the articulatory parameter vector x and the speech parameter vector y obtained in this manner. In this example, a hidden Markov model (hereinafter referred to as an HMM) λ Create The generation of the HMM model λ is based on the entire speech parameter vector sequence obtained by continuous utterance of the sentence.yOutput probability P (y, Q | λ) is maximized. Where q is the entire speech parameter vector sequenceyRepresents a state sequence for. In this example, the structure of the HMM model λ is a left-to-right model with no skip and a three-state one-gaussian mixture of two phonemes. For example, as shown in FIG.₁, Q₂  , Q₃  The output probabilities of the articulatory parameter vector and the speech parameter vector in each state are each set to one Gaussian distribution, and the state process is a transition (loop) from the same state to the same state, and q₁  To q₂  Or q₂  To q₃  There are only a total of five transitions to. For each phoneme, a model is created for each of the following different phonemes.
[0013]
Articulation parameter vector sequencexIn the model including, each state constituting the state sequence q is represented by q_j  Then state q_j  The output probability of the voice parameter vector y of_j  Transition probability P to_t  = P (q_j| Λ) and its state q_jOutput probability P of articulation parameter vector x for_x  = P (x | q_j  , Λ) and its state q_jOutput probability P of speech parameter vector y for articulation parameter vector x for_y= P (y | x, q_j, Λ). Therefore, P (y, q_j| Λ) = ∫P (y | x, q_j, Λ) P (x | q_j, Λ) P (q_j| Λ) Each model may be created such that dx is maximized. The output probability P (y | x, q) of the speech parameter for the given articulation parameter vector_j, Λ) and the output probability P (x | q) of the articulatory parameter vector._j, Λ) both assume a Gaussian distribution.
[0014]
FIG. 2 shows an example of a model creation processing procedure. This learning method is called “Viterbi learning method”. First, the input speech parameter vector sequenceyAnd input articulation parameter vector sequencexAnd the utterance sentence, the two parameter vector pairs of the same phoneme are collected, and for each phoneme, for each of the plurality of parameter vector pairs, the three state q₁, Q₂, Q₃As the same time length, and model parameters A, b, x for each state_m, Σ_x, W_m, Σ_wIs calculated, that is, an initial model λ is created and stored (S1).
[0015]
That is, as a function y = f (x) for determining the speech parameter vector y from the articulation parameter vector x, y = Ax + b, the speech parameter vector y ′ = Ax + b calculated using the articulation parameter vector x, and the articulation parameter vector A and b that minimize the square error between the speech parameter vector y forming a pair with x are obtained, and the error w of y ′ with respect to y is obtained. The average w of the errors w is obtained._mAnd the covariance σ of the error w_wAnd calculate the average x of the articulation parameter vector x_mAnd calculate the covariance σ of the articulation parameter vector x_xAnd the state transition probability γ is calculated. The initial state transition probability γ is set to an appropriate value such as 0.8 for the self transition probability and 0.2 for the probability of transition from one state to another state._jFor all frames corresponding to that state, the value obtained by dividing the number of frames transitioning to the same state by the total number of frames corresponding to that state is defined as the self-transition probability. The transition probability is calculated as (1-self transition probability).
[0016]
These model parameters A, b, w_m, Σ_w, X_m, Σ_m, P_tIs calculated for each state of each phoneme and stored for each phoneme. In this example, the vector y on the left side has 48 elements, the vector x on the right side has 48 elements, the coefficient A is a 48 × 48 matrix, and the constant b has 48 elements in this example. Is the vector of
Therefore, at least 48 pairs of y and x are required to determine A and b.
Next, for this initial model λ, the output probability of the input speech parameter vector y
P (y, q | λ) = ∫P (y | x, q, λ) P (x | q, λ) P (q | λ) dx (1)
To the speech parameter vector y and the articulation parameter vector x such that_jIs determined using the Viterbi algorithm (S2). That is, the output probability P (y | x, q, λ) of the speech parameter vector y with respect to the articulation parameter vector x in each state and the output probability P (x | q , Λ) are obtained by the following equations (2) and (3) based on the fact that the probabilities are Gaussian-distributed with reference to the previously stored model.
[0017]
P (y | x, q, λ) = [1 / ((2π))^{N / 2}  ) | Σ_w|^1/2  ] × exp [-(1/2) (y-Ax-bw)_m)^Tσ_w ^-1(Y-Ax-bw_m)] (2)
P (x | q, λ) = [1 / ((2π)^{M / 2}  ) | Σ_x|^1/2  )] × exp [-(1/2) (xx_m)^Tσ_x ^-1(Xx_m)] (3)
N is the order of the vector y, M is the order of the vector x, and both are 42 in the above example.^TRepresents the transpose of a matrix.
[0018]
Further, a transition probability P (q | λ) is obtained, and a product of P (y | x, q, λ), P (x | q, λ) and P (q | λ) is used as a branch metric to obtain each state. The largest branch metric is determined to be the surviving path, and the branch metric is sequentially added to the previous path metric. The maximum state sequence q of the finally obtained path metric maximizes the equation (1).
Next, it is checked whether or not the maximum value of the output probability of the input speech parameter vector y obtained at the time of determining the state sequence q, that is, the value of the maximum path metric has converged (S3). State sequence q and input speech parameter vector sequenceyAnd input articulation parameter vector sequencexAnd
A change point from the model to the model in the state sequence q is detected, and the input speech parameter vector sequence of the phoneme section is detected.yAnd input articulation parameter vector sequencexAgainst
Is reset (S4).
[0019]
For the set of the speech parameter vector and the articulation parameter vector for each of the reset phonemes, each model parameter A, b, w_m, Σ_w, X_m, Σ_x, P_t, That is, a phoneme model is created, the stored corresponding model parameters are updated and stored, and the process returns to step S2 (S5).
By repeating steps S2 to S5 below, the maximum value of the output probabilities of the obtained speech parameter vectors becomes substantially constant, that is, the convergence is detected in step S3, and the process ends.
[0020]
Each model of the HMM obtained in this manner is, for example, as shown in FIG.The state transition probabilities γ (which consist of a total of five probabilities for each loop and the adjacent one as described above) are stored in the transition probability storage unit 27 for each of the storage units 25-1 to 25-J of. A and b for each state are stored in the coefficient storage unit 28, and x_m, W_mIs stored in the average storage unit 29._m, Σ_wIs stored in the covariance storage unit 31. The coefficients A and b are referred to as articulation / acoustic mapping coefficients because they are parameters for associating an approximate value of the speech parameter vector y corresponding to the articulation parameter vector X. Other parameters x_m, W_m, Σ_m, Σ_wIs called the articulatory model. Further, since P (y | x, q, λ) is calculated by equation (2) and P (x | q, λ) is calculated by equation (3), the output of the speech parameter vector y with respect to the articulation parameter vector x It can be said that the probability and the output probability of the articulation parameter vector x are also stored in the model storage unit 25. Although the “Viterbi learning method” has been described as a model creation method, a more accurate learning method “EM learning method” (Expectation-Maximization) may be used.
[0021]
First embodiment
In the first embodiment of the present invention, a speech parameter (spectrum) vector generated by an existing speech generation model is adapted so that an output probability of an input speech parameter (spectrum) vector sequence of an input speaker is maximized, Using the coefficients relating the generated speech parameter vector and the input speech parameter vector sequence, the articulatory / acoustic mapping coefficients in the speech production model are adapted.
Hereinafter, the first embodiment will be described with reference to FIGS. The input voice signal of the speaker is input as a digital signal from the input terminal 11 and is temporarily stored in the signal storage unit 42 (S1). In the speaker input speech signal, an input speech parameter (spectrum) vector y is generated for each frame in a speech parameter vector generation unit 43, and an input speech parameter vector sequenceYIs generated (S2). For example, an input audio signal is subjected to spectrum analysis for each frame, audio parameters are detected (S2-1), and velocity and acceleration parameters as temporal changes in the spectrum are detected (S2-2). Parameter vector, and the time series of the voice parameter vector of each frame is the voice parameter vector sequence.YIt is said. These parameters are the same as the speech parameters used when creating the speech generation model stored in the model storage unit 48 to be adapted, and in the above example, the 16th-order mel-cepstral coefficients excluding the 0th-order term and The speed parameter is detected. This input speech parameter vector sequenceYIs temporarily stored in the storage unit 42 (S3). Note that a phoneme string of a sentence corresponding to the speaker input voice signal is stored in the phoneme string storage unit 45.
[0022]
Average vector y of speech parameter vectors generated from the speech generation model_mjFor the input speech parameter vector sequenceY= (Y₁, ..., Y₂) Output probability P (Y| Q, λ) is adapted to maximize the average vector of the speech generation model. Output probability P (Y| Q, λ) to maximize the average vector y_mjReferring to page 174 of Non-patent Document 2, log likelihood log P (Y| Q, λ) may be determined so as to maximize it. Therefore
log P (Y| Q, λ) = K- (1/2) Σ_tΣ_jγ_t(J) (Y_t-H_sy_mj)^Tσ_yj ^-1(Y_t-H_sy_mj)
H to maximize_sTo
Σ_tΣ_jγ_t(J) σ_yj ^-1Y_ty_mj ^T= Σ_tΣ_jγ_t(J) σ_yj ^-1H_sy_mjy_mj ^T(4)
Can be obtained by calculating Here, t represents the discrete time of the vector sequence, j represents the state number of each phoneme, K is a constant, H_sIs the regression coefficient, γ_t(J) is the probability that the speech parameter vector exists in state j at time t, and γ_t(J) = P (q_t= J | y, λ) and the speech parameter vector y_jMean vector of y_mj= A_jx_mj+ B_jYields the vector y_jIs the covariance matrix of σ_yj= A_jσ_xjA_j ^T+ Σ_wAsk by ()^TRepresents a transposed matrix, and s represents a cluster dividing the acoustic space. That is, the whole phoneme model is divided into several clusters, for example, all the speech generation models λ are made into one cluster, or vowels and consonants are found as separate clusters.
[0023]
That is, as shown in FIG. 4 and FIG.YAnd voice
Using the generated model, the voice-related coefficient calculating unit 46 uses the voice-related coefficient H_sIs calculated (S4). Articulatory / acoustic mapping coefficient A for each model in model storage unit 44_j, B_jAnd the articulatory average vector x_mjAnd the average vector y of speech parameters_mjIn the average vector generation unit 47_mj= A_jx_mj+ B_j(S4-1). The articulation parameter vector x of each model in the model storage unit 44_jCovariance σ of_xjAnd the articulatory average vector x_mjError w_jCovariance σ of_wjAnd the voice parameter vector y_jCovariance matrix σ_yjIn the covariance calculation unit 48_yj= A_jσ_xjA_j ^T+ Σ_wj(S4-2). Further, in the speech relation coefficient calculating section 46, the transition probability γ of the corresponding phoneme model λ in the model storage section 44 for each phoneme according to the phoneme sequence in the storage section 45._jFrom the input speech parameter vector sequenceYAnd the average vector y from the average vector generation unit 47_mjAnd the covariance matrix σ from the covariance calculation unit 48_yjAnd with Σ_tΣ_jγ_t(J) σ_yj ^-1Y_ty_mj ^T, Σ_tΣ_jγ_t(J) σ_yj ^-1y_mjy_mj ^TIs calculated, and a regression coefficient (speech-related coefficient) H that satisfies equation (4)_sIs obtained (S4-3).
[0024]
This voice relation coefficient H_s, The articulatory / acoustic mapping coefficient A of the speech generation model_j, B_jAnd H_sA_j, H_sb_jAnd the coefficient adaptation unit 49 adapts to the input speaker's voice (S5).
Further, if necessary, the average vector y of the speech parameter vectors obtained earlier_mjBy the voice adaptation unit 51_mj= H_sy_mj(S6).
If this adapted speech generation model is used, it is possible to input an articulatory parameter vector sequence and synthesize a speech signal corresponding to the input, which is close to the speech of the input speaker.
In FIG. 5, the control unit 52 sequentially operates each unit, and performs reading and writing for each storage unit.
[0025]
Second embodiment
Next, a second embodiment of the present invention will be described with reference to FIGS. In the second embodiment, the articulatory motion of the input speech is determined using the adaptation target speech generation model, and the speech generation model is adapted so that the output probability of this articulation motion is maximized.
Input speech parameter (spectrum) vector sequence from the input speech signal of the input speaker from the input terminal 41YIs generated by the voice parameter vector generation unit 43 (S2), and is temporarily stored in the storage unit 43 (S3), as in the first embodiment.
[0026]
In the second embodiment, the input speech parameter vector seriesYOutput probability P (YThe state sequence q that maximizes | q, λ) is generated by the state sequence generation unit 61 by, for example, the Viterbi algorithm based on the phoneme sequence in the storage unit 45 (S4). This generation method may be performed in substantially the same manner as the model creation method described above.
Next, the articulatory motion that maximizes the posterior probability P (x | y, q, λ) for this state sequence q, that is, an articulatory parameter vector sequence, is generated by the articulatory parameter vector generator 62 (S5). Articulatory parameter vector sequence that maximizes P (x | y, q, λ)xIs a sequence that minimizes the following equation (5), as is clear from the description in the left column of page 2306 of Non-Patent Document 1.x _eShould be obtained.
[0027]
J = (Y-Ax _e-B)^Tσ_w ^-1(Y-Ax _e-B) (5)
That is, it is obtained by Expression (4) (the following expression) in Non-Patent Document 1.
x _e= (Σ_x ^-1+ A^Tσ_w ^-1A)^-1(Σ_x ^-1x_m+ A^Tσ_w ^-1(Y-b))
The articulation parameter vector sequence generated in this wayx _eAnd the output probability P (x _e| Q, λ) is the maximum articulation parameter vector x_eMean vector x of_emIs calculated by the average relation coefficient calculation unit 63 to obtain the average relation coefficient C_s(S6).
[0028]
Σ_tΣ_jγ_t(J) σ_xj ^-1x_etx_mj ^T= Σ_tΣ_jγ_t(J) σ_xj ^-1C_sx_mjx_mj ^T(6)
In other words, the generated articulation parameter vector sequencex _eEach vector x_et, The transition probability γ of the corresponding phoneme model λ of the speech generation model in the storage unit 44_j, Covariance σ_xj, Average x_mjTake out and Σ_tΣ_jγ_t(J) σ_xj ^-1x_etx_mj, Σ_tΣ_jγ_t(J) σ_xj ^-1x_mjx_mj ^TIs calculated, and equation (6) is calculated to obtain the regression coefficient C_sAsk for. The average relation coefficient C thus obtained_s, The articulation average vector x in the storage unit 44_mjIn the average adaptation unit 64_mj= C_sx_mj(S7).
[0029]
By using this adapted speech generation model to determine the articulatory motion (articulatory parameter vector sequence) of the speech signal of the input speaker, it is possible to obtain the articulatory motion with higher accuracy than when using the model before adaptation. it can.
Further, articulation / acoustic mapping coefficient A_j, B_jWhen adapting also,
Articulatory parameter vector sequence generated in step S15x _eEach articulatory spectrum vector x_etAnd a speech spectrum vector corresponding to the speech generation model while referring to the phoneme sequence._j, B_j, The speech spectrum vector y in the speech vector generation unit 65_j= A_jx_et+ B_jIs generated (S8).
[0030]
For a speech spectrum vector y using this speech generation model, an input speech parameter vector sequenceYOutput probability P (Y| Q, λ) is maximized, as in the case of the first embodiment.
P (Y| Q, λ) = ∫P (Y | x, q, λ) P (x | q, λ) dx log likelihood log P (Y| Q, λ) is maximized. Therefore, similarly to the derivation of equation (4), a regression coefficient (mapping relation coefficient) D satisfying the following equation (7):_sIs calculated by the mapping relation coefficient calculation unit 66 (S9).
[0031]
Σ_tΣ_jγ_t(J) σ_wj ^-1Y_t(A_jx_et+ B_j)^T= Σ_tΣ_jγ_t(J) σ_wj ^-1D_s(A_jx_et+ B) (A_jx_et+ B_j)^T            (7)
That is, in the mapping relation coefficient calculation unit 66, the input speech parameter vector sequenceYAnd the speech vector y from the speech vector generation unit 65_jAnd the error covariance σ in the model storage unit 44_wj, Transition probability γ_jAnd using
Σ_tΣ_jγ_t(J) σ_wj ^-1Y_t  y_j ^T= Σ_tΣ_jγ_t(J) σ_wj ^-1D_sy_jy_j ^T
Fill D_sIs calculated.
[0032]
This mapping relation coefficient D_s, Each articulatory / acoustic mapping coefficient A of each sound generation model in the model storage unit 44._j, B_jIn the coefficient adaptation unit 67_sA_j, D_sb_j(S10).
The generation of the state series in step S4 is performed by the adaptive speech average vector Y generated by the speech adaptation unit 51 in the first embodiment._mj(= H_sy_mj) May be used. In this case, the regression coefficient C_sAnd D_sWhen a common cluster is used, the likelihood is the same as that of the first embodiment due to the redundancy of the regression coefficient, but the adapted speech generation model is different from that of the first embodiment.
[0033]
The speech generation model is adapted to the speaker's speech by the various adaptation methods as described above, and the articulatory motion of the speaker's speech signal using the adapted speech production model is shown in, for example, Non-Patent Document 1. presume.
Experiment
Using speech signals and articulatory data of 356 sentences uttered by three Japanese men, parameter vectors are generated under the conditions described in the section on model creation, models are created for each of the three persons, and from each input speaker, Adaptation was performed for each model of two speakers other than the input speaker, and the evaluation was performed on the average of a total of six tests. The articulation data used this time was observed using a magnetic sensor system that adheres a small receiving coil to the articulation observation point. However, since the position where the receiving coil is bonded differs for each speaker, and the size of the articulatory organ differs for each speaker, it is estimated using the articulatory movement observed by the input speaker and another specific speaker model. Articulatory movements cannot be compared as they are. Therefore, normalization was performed using a linear transformation of the position and size of the observation articulation movement of the input speaker and the observation articulation movement of another speaker, which were obtained in advance, and the evaluation was performed. At the time of adaptation, supervised learning was used, and the number of adapted sentences was 40. In the case of articulatory motion reverse estimation, a condition with phonemes was used (Non-Patent Document 1). Experiments were performed on three adaptation methods: (A) the first embodiment, (B) the second embodiment, and (C) a combination of the first and second embodiments. The number of clusters s was common in all adaptation methods.
[0034]
FIG. 8 shows the log likelihood of the acoustic parameter vector for the learning data when the values of the number of clusters are 1, 3, 5, and 10. It can be seen that the use of the adaptation method increases the likelihood as compared with the speaker independent model. The likelihoods of the (A) method and the (C) method are almost the same, and the likelihood of the (B) method is lower than those.
Note that a model created using speaker's voice and articulatory movement (speaker model) and a model created using speech unrelated to the speaker and its corresponding articulatory movement (speaker independent model) were used. Experimental results for the case used are also shown.
[0035]
FIG. 9 shows the square error of the articulation motion by the adaptation method with respect to the number of clusters. By using the adaptation method, the error is smaller than that of the speaker independent model at all cluster numbers. In the case where the method (A) is used, the adaptation relating to the articulatory movement is not performed, so that the error is substantially constant regardless of the number of clusters. On the other hand, in the case of the method (B), the error decreases as the number of clusters increases. The (C) method has a smaller error than the (B) method when the number of clusters is up to 5, but the error is large when the number of clusters is 10, despite the high likelihood.
[0036]
FIG. 10 shows the square error for each phoneme. The evaluation was performed using 10 clusters. 'Total' is the square error of the entire utterance, and from 'Vowel' to 'Nasal' are the square errors of articulatory organs important in the utterance. By using the adaptation method, all phoneme classes showed improvement over the speaker independent model. Up to about 44.4% improvement was seen. In addition, there was no difference in error between phoneme classes by the adaptation method.
An articulatory trajectory (thick line) and an observed articulatory trajectory estimated by using a speaker-independent model from a speech signal in which the input male speaker uttered a sentence saying "I am doing what I need to do and there is no fault" An example of a vertical signal (thin line) is shown in FIG. 11, and the vertical signal of the articulatory orbit (thick line) estimated using the model adapted by the method (C) and the observed vertical signal of the articulatory orbit (thin line) is used. An example is shown in FIG. By comparing these figures, the thick line in FIG. 12 is closer to the thin line, and it can be understood that the effect of model adaptation is obtained.
[0037]
The estimated articulatory movements well reproduce the characteristics of the uttered phonemes. The spectral distortion between the speech spectrum generated from the estimated articulatory movement and the input speech spectrum was also improved by about 69.0%.
The adaptation device shown in FIGS. 5 and 7 may be caused to function by a computer. In this case, a program for causing the computer to execute each procedure of the adaptation method shown in FIG. 4 or FIG. 7 is installed in the computer from a recording medium such as a CD-ROM or a magnetic disk, or downloaded via a communication line. The program may be executed by a computer. In the above description, a linear function is used as a function for approximating the speech parameter vector using the articulation parameter vector as a variable, but another function may be used. The acceleration component and the velocity component need not be used as the voice parameter vector and the articulation parameter vector.
[0038]
【The invention's effect】
According to the present invention, it is possible to speaker-adapt a speech generation model including a probabilistic dynamic model representing articulatory movement and an articulation / acoustic mapping coefficient that associates an articulation parameter vector and a speech parameter vector. By using the adapted speech generation model, the articulatory motion of the input speaker's speech can be estimated more accurately than when using a model that does not adapt. Further, when speech synthesis is performed from the articulation parameter vector, speech of a desired speaker can be synthesized. Similarly, in the case of speech recognition using this model, high recognition accuracy can be obtained. Further, according to the present invention, a speech generation model can be adapted even from a small amount of speech data.
[Brief description of the drawings]
FIG. 1 is a diagram showing an example of a state transition of one phoneme model.
FIG. 2 is a flowchart illustrating an example of a model creation procedure.
FIG. 3 is a diagram showing an example of storage contents of a storage device in which a speech generation model is stored.
FIG. 4 is a flowchart showing an example of a processing procedure according to the first embodiment of the present invention.
FIG. 5 is a block diagram showing a functional configuration example of the first embodiment of the present invention.
FIG. 6 is a flowchart illustrating an example of a processing procedure according to a second embodiment of the present invention.
FIG. 7 is a block diagram showing a functional configuration example according to a second embodiment of the present invention;
FIG. 8 is a graph showing experimental results of log likelihood of a synthesized speech parameter vector with respect to a learning speech parameter vector.
FIG. 9 is a graph showing an experimental result of a square error of an articulation parameter vector estimated for an input audio signal.
FIG. 10 is a diagram showing experimental results of square errors of various types of phonemes of an articulation parameter vector estimated for an input speech signal.
FIG. 11 is a diagram showing an example of an actually measured articulation motion and an articulation motion estimated using a speaker independent model.
FIG. 12 is a diagram showing an example of an actually measured articulation motion and an articulation motion estimated using a speaker-adapted model.

Claims

A speech generation model including a probabilistic dynamic model (hereinafter, referred to as an articulation model) representing articulation movement stored in a storage device and an articulation / acoustic mapping coefficient for associating an articulation parameter vector with a speech parameter vector is input. A method of adapting to a speaker voice signal,
Analyzing the input speaker voice signal for each frame to generate an input voice parameter vector sequence;
A procedure for generating each of the articulatory parameter vector sequences and the respective average vectors of the speech parameter vectors using the respective articulatory / acoustic mapping coefficients for each of the above sound generation models,
A step of relating the average vector sequence of the input voice parameter vector sequence (hereinafter referred to as the input voice average vector) and the sequence of the average vector so as to maximize the output probability of the input voice parameter vector sequence;
Applying the relation to each of the articulatory / acoustic mapping coefficients of each of the above-mentioned speech generation models, and adapting the above-mentioned speech generation model.

A method of adapting an articulatory model stored in a storage device and a speech generation model including articulatory / acoustic mapping coefficients from an input speaker audio signal,
Analyzing the input speaker voice signal for each frame to generate an input voice parameter vector sequence;
A procedure for generating an articulation parameter vector sequence having a maximum posterior probability for the input speech parameter vector sequence using the speech generation model;
A procedure for relating the average vector sequence of the articulation parameter vector sequence and the average vector sequence of the speech generation model such that the output probability of the articulation parameter vector sequence is maximized,
Applying the association to the average vector of the speech generation model and adapting the average vector.

A procedure for generating a speech parameter vector sequence corresponding to the articulation parameter vector sequence using each articulation / acoustic mapping coefficient of each speech generation model,
A step of relating the generated voice parameter vector sequence and the input voice parameter vector sequence to maximize the output probability of the input voice parameter vector sequence;
3. A method according to claim 2, further comprising the step of giving the association to the articulation / acoustic mapping coefficients of the speech generation model and adapting them.

The procedure for generating the articulation parameter vector sequence includes:
For each of the speech generation models, using each of the articulation and acoustic mapping coefficients, a step of generating each average vector of the speech parameter vector corresponding to the articulation parameter vector sequence,
Adapting the average vector to maximize the output probability of the input speech parameter vector sequence,
Generating a sequence of the average vector that maximizes the output probability of the input speech parameter vector;
Generating the articulation parameter vector sequence that maximizes the posterior probability for the input speech parameter vector sequence with respect to the average vector sequence. Method.

A speech generation model including a probabilistic dynamic model (hereinafter, referred to as an articulation model) representing articulation movement stored in a storage device and an articulation / acoustic mapping coefficient for associating an articulation parameter vector with a speech parameter vector is input. A device adapted to a speaker voice signal,
A speech parameter vector generation unit that performs spectrum analysis of the input speaker speech signal for each frame and generates an input speech parameter vector sequence;
An average vector generation unit that inputs each articulatory / acoustic mapping coefficient of each of the audio generation models and generates an average vector of the audio parameter vector,
A speech relation that inputs an average vector sequence of the input speech parameter vector sequence (hereinafter referred to as an input speech average vector) and the respective average vectors and relates them so as to maximize the output probability of the input speech parameter vector sequence. A voice-related coefficient calculator for calculating a coefficient,
An audio adaptation unit for multiplying each of the audio relation coefficients and each articulation / acoustic mapping coefficient of each audio production model and outputting an adapted articulation / acoustic mapping coefficient. Person adaptation device.

An articulation model stored in a storage device, a device for adapting a speech generation model including articulation and acoustic mapping coefficients from an input speaker speech signal,
A speech parameter vector generation unit that performs spectrum analysis of the input speaker speech signal for each frame and generates an input speech parameter vector sequence;
The speech generation model, the input speech parameter vector sequence, input, the posterior probability for the input speech parameter vector sequence, the articulation parameter vector generation unit that generates an articulation parameter vector sequence that maximizes,
An average relation coefficient calculation unit for inputting the articulation parameter vector series and each articulation average vector of each sound generation vector and calculating an average relation coefficient for relating the articulation parameter vector series so that the output probability of the articulation parameter vector series is maximized; When,
A voice generation model speaker comprising: an average vector adaptation unit configured to generate an adapted articulation average vector by multiplying each of the average relation coefficients and each of the articulation average vectors of each of the speech generation models. Adaptation device.

A speech vector generation unit that receives each of the articulation / acoustic mapping coefficients of the speech generation models and the articulation parameter vector sequence and generates a corresponding speech parameter vector sequence;
A mapping relation coefficient calculation for inputting the generated speech parameter vector series and the input speech parameter vector series, and calculating a mapping relation coefficient for relating the input speech parameter vector series to maximize the output probability thereof. Department and
7. A mapping adaptation unit for multiplying each of the mapping relation coefficients and each of the articulation and acoustic mapping coefficients of each of the speech generation models to generate an adapted articulation and acoustic mapping coefficient. The described speech generation model speaker adaptation device.

The articulatory parameter vector generator,
A speech average vector generation unit that inputs each articulatory / acoustic mapping coefficient and each articulation average vector of each of the above speech generation models, and generates each average vector of the speech parameter vector,
A voice-related coefficient calculating unit that receives the input voice parameter vector sequence and each of the voice average vectors, and calculates a voice-related coefficient that relates these to maximize the output probability of the input voice parameter vector sequence;
A voice adaptation unit that generates an adapted voice average vector by multiplying the voice relation coefficient and each of the voice average vectors,
A speech average vector sequence generation unit that generates the sequence of the adapted speech average vector that maximizes the output probability of the input speech parameter vector sequence,
8. The parameter sequence generator according to claim 6, further comprising: a parameter sequence generator configured to generate the articulation parameter vector sequence having a maximum posterior probability with respect to the input audio parameter vector sequence with respect to the adaptive audio average vector sequence. 9. Speech generation model speaker adaptation device.

A program for causing a computer to execute each procedure of the speech generation model speaker adaptation method according to claim 1.

A computer-readable recording medium recording the program according to claim 9.