JP4242676B2

JP4242676B2 - Disassembly method to create a mouth shape library

Info

Publication number: JP4242676B2
Application number: JP2003066584A
Authority: JP
Inventors: クロードジュンカジャン
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-03-12
Filing date: 2003-03-12
Publication date: 2009-03-25
Anticipated expiration: 2023-03-12
Also published as: JP2003280677A

Abstract

<P>PROBLEM TO BE SOLVED: To create a library of mouth shapes only with a small amount of mouse shape data. <P>SOLUTION: The library of mouth shapes is created by separating speaker- dependent and speaker independent variability. Preferably, speaker dependent variability is modeled by a speaker space 42 while the speaker independent variability (i.e., context dependency), is modeled by a set 44 of normalized mouth shapes that need be built only once. Given a small amount of data from a new speaker, it is possible to construct a corresponding mouth shape library by estimating a point in speaker space that maximizes the likelihood of adaptation data and by combining speaker dependent and speaker independent variability. To build the speaker space 42, a context independent mouth shape parametric representation is obtained. Then a supervector containing the set of context-independent mouth shapes is formed for each speaker included in the speaker space 42, Dimensionality reduction 38 is used to find the areas of the speaker space 42. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、合成口形状または模擬口形状を表示するオーディオビジュアルテキスト音声合成システムなど、様々なマルチメディアアプリケーションで使用される口形状の生成に関し、特に、話者依存性の変動と話者独立性の変動を分離する技術に基づいて口形状のライブラリを作成するシステムおよび方法に関する。
【０００２】
【従来の技術】
マルチメディアアプリケーションおよびテキスト音声合成アプリケーションにおいてトーキングヘッドの動画シーケンスを生成することは、特に、様々な口形状を表す画像を撮影した場合に、非常に単調になりやすい。口形状は、調音結合現象（音響同士間の影響）によって影響されるので、音声部分とトーキングヘッドの動画とをうまく一致させるためには、動画による形状を多数格納した大きなライブラリが必要とされる。３Ｄモデル作成技術の発達やより高速なコンピュータが利用可能になったことは、実際の人間から撮影された画像や最新のモデル作成技術に基づいて現実感のあるトーキングヘッドを開発することへの関心を高めるきっかけとなった。
【０００３】
【発明が解決しようとする課題】
しかしながら、画像の集合を基にして実際の顔画像のコンピュータモデルを作成することが可能になったとはいえ、音声データと画像データすなわちビデオデータとをうまく同期させるのに必要な口形状ライブラリの作成はまだ困難である。
【０００４】
この点に関して進歩が続いてはいるが、これまで提案されてきた解決策の場合には、多数の口形状を用いて調音結合のライブラリを作成することが必要であり、この作業は非常に時間のかかるものである。現在、特定話者に何時間も口形状のサンプルを登録させない限り、音声と画像とがうまく同期する口形状ライブラリを作成するのに有効な方法はない。
【０００５】
少量の口形状データだけで音声と画像がうまく同期する口形状ライブラリを作成できれば本当に望ましいことであるが、そのような技術はこれまで存在しなかった。
【０００６】
したがって、本発明の目的は、少量の口形状データだけで口形状ライブラリを作成するシステムと方法を提供することである。
【０００７】
【課題を解決するための手段】
本発明の第１の側面では、口形状ライブラリの作成方法が提供される。この方法は、話者独立性口形状モデルの情報を提供する工程と、話者依存性口形状モデルの変動に関する情報を提供する工程と、新しい話者の口形状データを入手する工程と、上記新しい話者の口形状データと上記話者依存性口形状モデルの変動に関する情報とに基づいて話者依存性口形状モデルの情報を推定する工程と、上記話者独立性口形状モデルの情報と上記話者依存性口形状モデルの情報とに基づいて口形状ライブラリを作成する工程とを備えている。
【０００８】
本発明の第２の側面では、話者独立性口形状モデルの情報と話者依存性口形状モデルの変動に関する情報とを格納するコンピュータメモリと、話者の口形状データを受け取る入力部と、上記口形状データと上記話者依存性口形状モデルの変動に関する情報とに基づいて話者依存性口形状モデルの情報を推定し、上記話者独立性口形状モデルの情報と上記話者依存性口形状モデルの情報とに基づいて口形状ライブラリを作成する口形状ライブラリ作成モジュールとを備えた適応型オーディオビジュアルテキスト音声合成システムが提供される。
【０００９】
本発明の第３の側面では、適応型オーディオビジュアルテキスト音声合成システムに使用される口形状ライブラリ作成モジュールを製作する方法が提供される。この方法は、複数の学習用話者からの口形状データに基づいて話者独立性口形状モデルの情報と話者依存性口形状モデルの変動に関する情報とを求める工程と、上記話者独立性口形状モデルの情報と上記話者依存性口形状モデルの変動に関する情報とをコンピュータメモリに格納する工程と、話者依存性口形状データと上記話者依存性口形状モデルの変動に関する情報とに基づいて話者依存性口形状モデルの情報を推定する工程と、上記話者独立性口形状モデルの情報と上記話者依存性口形状モデルの情報とに基づいて口形状ライブラリを作成する工程とを備えている。
【００１０】
好ましい実施形態では、話者依存性の変動（話者による変動）を話者空間によってモデル化する一方、話者独立性の変動（すなわち、文脈依存性（文脈による変動））を１回だけ作成すればよい標準型口形状の集合によってモデル化する。新しい話者から少量のデータが与えられると、適応データの尤度を最大にする話者空間内の点を推定することによって、対応する口形状ライブラリを作成することができる。この技術によれば、わずかな口形状インスタンスで口形状ライブラリを作成することができるので、トーキングヘッドの製作が非常に容易になる。話者空間を構築するため、口形状のパラメータ表現を入手する。その後、話者空間内の話者ごとに、文脈独立性の（文脈によらない）口形状の集合を含んだスーパーベクトルを形成する。主成分分析（ＰＣＡ）、線形判別分析（ＬＤＡ）などの次元数削減技術を利用して、話者空間の各領域を求める。
【００１１】
本発明のその他の適用分野については、以下に記載の詳細な説明から明らかになるであろう。なお、本発明の好ましい実施形態を示す以下の詳細な説明と具体例は例示に過ぎず、本発明の範囲を限定することを意図するものではない。
【００１２】
【発明の実施の形態】
以下、本発明の好ましい実施形態を図面に基づいて詳細に説明する。
【００１３】
なお、以下の好ましい実施形態の説明は、本質的に例示に過ぎず、本発明、その用途および使用法を限定するものでは全くない。
【００１４】
本発明の好ましい実施形態では、モデルベースのシステムを用いて口形状ライブラリが作成される。モデルベースシステムは、Ｎ人の学習用話者から学習を行った後、新しい話者（場合によっては、学習用話者の１人であってもよい）からの口形状データを適応化することによって口形状データを生成する際に使用される。このシステムは、前と次の口形状に依存して口形状特性を同定することによって文脈を考慮する。好ましい実施形態では、話者独立性の変動と話者依存性の変動とが分離、すなわち、分解される。本システムは、文脈依存性の（特定の文脈の）口形状を話者独立性の変動に関連づける一方、文脈独立性の口形状を話者依存性の変動に関連づける。
【００１５】
学習時には、話者独立性のデータが、文脈に応じてデータを編成する決定木に格納される。さらに、学習時には、話者依存性のデータを使用して、Ｎ個の学習用話者母集団の話者依存特性を表現する固有空間が構築される。
【００１６】
その後、新しい口形状ライブラリが必要な場合は、新しい話者が、必ずしも全てではなくいくつかの口形素(visemes)によって口形状データのサンプルを提供する。口形素は、特定音素の調音と関連づけされた口形状である。このデータサンプルから、新しい話者が固有空間に配置、すなわち、射影される。固有空間内の新しい話者の位置から、話者依存性の（文脈独立性の）パラメータ集合が推定される。これらのパラメータから、本システムは、文脈独立性の重心を生成する。重心には、決定木からの文脈依存性データが付加される。文脈依存性データは、それぞれ異なる文脈に対応するズレとして重心に付加されてもよい。このようにして、口形状ライブラリ全体を作成することができる。この口形状ライブラリ作成プロセスをより深く理解するため、図１ないし図３に基づいて以下に詳細に説明する。
【００１７】
図１に示すように、口形状ライブラリ作成方法１０が１２で開始されると、ステップ１４に進み、話者独立性口形状モデルの情報が提供される。好ましい実施形態では、話者独立性口形状モデル情報は、文脈依存性のデルタ決定木に格納されるパラメータ空間に対応する。方法１０はステップ１６に進み、話者依存性口形状モデルの変動に関する情報が提供される。好ましい実施形態では、ステップ１６で、文脈独立性の話者空間が生成され、この話者空間は、複数の口形状を話者単位でパラメータで表現することによって文脈独立性かつ話者依存性のパラメータ空間を生成するように処理することが可能である。また、好ましい実施形態では、話者独立性のデータを使用して、Ｎ人の学習用話者に対応する固有空間を生成する。次に、方法１０はステップ１８に進み、新しい話者の口形状データが入手される。その場合、好ましくは、口形状入力のプロンプトの後に画像検出を経てから、データが入手される。また、好ましくは、ステップ１８で、口形状入力がパラメータで表現される。固有空間でＮ人の話者の母集団を表現する実施形態の場合、異なる口形素の全てについて新しい話者の入力データを入手する必要はない。
【００１８】
次に、方法１０はステップ２０に進み、入手した口形状データおよび話者依存性口形状モデルの変動に関する情報に基づいて、話者依存性口形状モデルの情報が推定される。方法１０はさらにステップ２２に進み、話者独立性口形状モデル情報および話者依存性口形状モデル情報に基づいて、口形状ライブラリが作成される。好ましい実施形態では、ステップ２２で、話者依存性かつ文脈独立性のパラメータ空間と話者独立性かつ文脈依存性のパラメータ空間とを加えて、話者依存性かつ文脈依存性のパラメータ空間が得られる。これにより、方法１０は２４で終了する。
【００１９】
好ましい実施形態では、ステップ２０で、話者依存性パラメータ表現および話者依存性口形状モデルの変動に関する情報に基づいて、話者依存性文脈独立性スーパーベクトルが作成される。具体的には、話者依存性パラメータ表現に基づいて、話者空間（固有空間）内の点が推定され、この推定された話者空間内の点に基づいて、話者依存性文脈独立性スーパーベクトルが作成される。適切な点を推定する１つの方法は、全ての口形素が入手可能であれば、ユークリッド距離を使用して話者空間内の点を推定することである。しかしながら、そのパラメータ表現が隠れマルコフモデルからのガウス分布に一致する場合には、口形状の動きが状態の遷移であると仮定すると、最尤推定技術（ＭＬＥＴ）を使用することができる。実際には、最尤推定技術は、実際にどの程度の口形状データが入手できるかに関係なく、話者の口形状入力データと最も一致する話者空間の中でスーパーベクトルを選択することになる。
【００２０】
最尤推定技術では、所定の口形状モデル集合の観測データを生成する確率を表す確率関数Ｑが使用される。確率関数Ｑの操作は、この関数が確率項Ｐだけでなくその項の対数であるｌｏｇＰも含んでいる場合に容易になる。その場合、確率関数は、各固有値について個別に確率関数の導関数をとることによって最大化される。例えば、話者空間が１００次元であれば、このシステムは、確率関数Ｑの１００個の導関数を算出し、各導関数を０に設定して各固有値Ｗについて解く。
【００２１】
そのようにして得られた固有値Ｗの集合は、最大尤度の点に対応する話者空間内の点を同定するのに必要な固有値を表す。したがって、固有値Ｗの集合は、話者空間内の最大尤度ベクトルを含んでいる。その後、この最大尤度ベクトルを使用して話者空間内の最適点に対応するスーパーベクトルを作成することができる。
【００２２】
最大尤度に関して、本発明の枠組みにおいては、与えられたモデルに対する観測値Ｏの尤度を最大にする必要がある。これは、以下の式
【数１】

（但し、λはモデル、λ＾は推定されたモデル）で表現される補助関数Ｑを繰返し最大化することによって行うことができる。
【００２３】
予め近似化する方法として、平均値のみを最大化する場合もある。確率Ｐが口形状モデル集合によって与えられるため、以下の式が得られる。
【数２】

但し、
【数３】

であり、ｏ_tは時刻ｔにおける特徴ベクトル、Ｃ_m ^(s)-1は状態ｓの混合ガウス分布ｍの逆共分散、μ＾_m ^(s)は状態ｓの混合成分ｍの適応平均の近似値、γ_m ^(s)(ｔ)は混合ガウス分布ｍ｜λ，ｏ_tを用いた確率Ｐをそれぞれ表す。
【００２４】
新しい話者の口形状モデルのガウス平均が話者空間内に位置すると仮定し、この空間に以下の平均スーパーベクトルμ_j（ｊ＝１，．．．，Ｅ）が広がるとする。
【数４】

【００２５】
但し、μ_m ^(s)(ｊ)は、固有ベクトル（固有モデル）ｊの状態ｓにおける混合ガウス分布ｍの平均ベクトルを表す。この場合、以下の近似値μ＾が必要である。
【数５】

【００２６】
μ_jは直交であり、ｗ_jは話者モデルの固有値である。ここで、いかなる新しい話者も観測済みの話者のデータベースから線形結合の形でモデル化できると仮定する。その場合、
【数６】

が成り立つ。但し、ｓはモデルλの状態、ｍは線形変換Ｍの混合ガウス分布である。
【００２７】
確率関数Ｑを最大化するためには、∂Ｑ／∂ｗ_e＝０，ｅ＝１，．．．，Ｅと設定する必要がある（但し、固有ベクトルは直交であるので、∂ｗ_i／∂ｗ_j＝０，ｉ≠ｊ）。したがって、以下の式が成り立つ。
【数７】

【００２８】
上記の導関数を計算すると、以下の式が得られる。
【数８】

【００２９】
さらに、上の式から、以下の一連の線形等式が求められる。
【数９】

【００３０】
図２に示すように、話者依存性変動と話者独立性変動への分解の好ましい実施形態では、Ｎ人の学習用話者２６から入力された口形状に基づいて、パラメータ空間が生成される。この学習用話者のパラメータ空間は、学習用話者から収集された口形状データから作成されたスーパーベクトル２８から構成されている。例えば、口形状は、１状態当たりに１つ以上のガウス分布を有する隠れマルコフモデルまたはその他の確率モデルの形でモデル化される。パラメータ空間は、ガウス分布の定義に使用されるパラメータ値を用いることによって構築されてもよい。
【００３１】
文脈依存性（話者独立性）変動と文脈独立性（話者依存性）変動とは、以下のようにして分離すなわち分解される。まず、学習用話者データ２６から文脈独立性話者依存性データ３４が入手され、その後、このデータ３４の平均値が分離プロセス３０に入力として送られる。分離プロセス３０は、ラベル付き文脈情報３２から文脈の知識を入手し、さらに、学習用話者データ２６からの入力も受け取る。分離プロセス３０は、文脈の知識を利用して、学習用話者データから文脈独立性話者依存性データ３４の平均値を減算する。これにより、分離プロセス３０は、文脈依存性話者独立性データ３６を生成すなわち抽出する。この文脈依存性話者独立性データ３６は、デルタ決定木４４のデータ構造に格納される。
【００３２】
好ましい実施形態では、文脈依存性話者独立性データ３６を表すガウス分布データが、様々な口形素に関するデルタ決定木４４の形で格納され、このデルタ決定木４４は、終端ではない節４６のｙｅｓ／ｎｏ文脈に基づく質問と、終端節４８の特定の口形状をあらわすガウス分布データとから構成されている。
【００３３】
その一方で、文脈独立性話者依存性データ３４は、スーパーベクトルの形のまま反射され、このスーパーベクトルは、主成分分析（ＰＣＡ）、独立成分分析（ＩＣＡ）、線形判別分析（ＬＤＡ）、因子分析（ＦＡ）、特異値分解（ＳＶＤ）などの適切な次元数削減技術によって次元数が削減される（３８）。その結果、固有ベクトルの集合とそれに関連する固有値とが抽出される。好ましい実施形態では、話者空間４２のサイズを削減するために、プロセス４０で、最下位の固有ベクトルの一部が切り捨てられる。これにより、場合に応じていくらかの上位固有ベクトルが残されて、固有空間、すなわち、話者空間４２が構築される。生成された固有ベクトルの全てを残すことも可能であるが、話者空間４２を記憶する所要メモリ量を削減するために、プロセス４０が実行されることが好ましい。
【００３４】
Ｎ人の学習用話者について固有空間（話者空間）４２とデルタ決定木４４が生成されると、新しい話者の口形状ライブラリを作成する際にシステムを使用する準備が整う。この場合、新しい話者は、学習時に事前に口形状データを提供しなかった話者であってもよいし、学習時に参加した話者の１人であってもよい。
【００３５】
図３に、新しいライブラリを作成するシステムと工程を示す。
【００３６】
図３に示すように、まず、新しい話者から、口形状データのパラメータ表現５０が入手される。この段階で全ての口形素に関する口形状パラメータデータの完全集合を収集することもできるが、実際には、必要ではない。固有空間内の点を同定できるだけの口形状データのサンプルが得られるだけで充分である。これにより、話者空間４２内の点Ｐは、口形状データのパラメータ表現５０に基づいて推定され、文脈独立性話者依存性パラメータ空間５２が、固有空間（話者空間）内の点Ｐに対応する重心５３の形で生成される。固有空間を使うことによる１つの大きな利点は、新しい話者によって与えられなかった口形状の口形素のパラメータを自動的に推定できることである。それは、固有空間がＮ個の学習用話者母集団の話者依存性データに基づいているからであり、そのためには、口形状データの完全集合が既に与えられていることが好ましい。
【００３７】
符号５４で示すように、文脈独立性話者依存性パラメータ空間５２の重心５３に対して、デルタ決定木４４の形で格納されている文脈依存性話者独立性口形状データ４８が付加されることによって、口形状ライブラリ５６が成立する。
【００３８】
具体的には、文脈依存性話者独立性データが、文脈ごとにデルタ決定木から取り出され、取り出されたデータが、固有空間を用いて作成された話者依存性データと結合、すなわち、合算されることによって、新しい話者の口形状ライブラリが作成される。実際には、固有空間から作成された話者依存性データは、重心とみなすことができ、話者独立性データは、その重心からの「デルタ」、すなわち、ズレとみなすことができる。この点に関して、固有空間から作成されるデータは、特定話者に対応する口形状情報を表している（その情報の一部は、固有空間の作用による推定値を表している）。一方、デルタ決定木から得られるデータは、様々な文脈における口形状間の話者独立性の差を表している。このように、文脈ごとに話者依存性情報（重心）と話者独立性情報（ズレ）とを結合させることによって、新しい口形状ライブラリが作成される。
【００３９】
図４に示すように、本発明の適応型オーディオビジュアルテキスト音声合成システム５８では、コンピュータメモリ内に、話者独立性口形状モデル情報６０と、話者依存性口形状モデルの変動に関する情報６２とが格納されている。また、このシステム５８は、新しい話者から口形状データ６６を受け取る入力部６４を有している。口形状ライブラリ作成モジュール６８は、新しい話者からの口形状データ６６と話者依存性口形状モデルの変動に関する情報６２とに基づく話者依存性口形状モデル情報（図示せず）の推定と、話者独立性口形状モデル情報６０と話者依存性口形状モデル情報（図示せず）とに基づく口形状ライブラリ７０の作成とを行うように動作することができる。
【００４０】
上記の本発明に関する説明は本質的に例示に過ぎず、したがって、本発明の主旨から逸脱しない変形は本発明の範囲内に含まれるものである。そのような変形は、本発明の精神および範囲から逸脱するものとみなすべきではない。
【図面の簡単な説明】
【図１】本発明にかかる口形状ライブラリ作成方法のフロー図。
【図２】本発明の好ましい実施形態において、話者依存性変動と話者独立性変動への分解を示すブロック図。
【図３】本発明の好ましい実施形態にかかる口形状ライブラリ作成方法のブロック図。
【図４】本発明にかかる適応型オーディオビジュアルテキスト音声合成システムのブロック図。
【符号の説明】
２８スーパーベクトル
４４デルタ決定木
５０口形状データのパラメータ表現
５２文脈独立性話者依存性パラメータ空間
５３重心
５６口形状ライブラリ
５８適応型オーディオビジュアルテキスト音声合成システム
６４入力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to the generation of mouth shapes for use in various multimedia applications, such as audiovisual text speech synthesis systems that display synthetic mouth shapes or simulated mouth shapes, and in particular, variations in speaker dependence and speaker independence. TECHNICAL FIELD The present invention relates to a system and method for creating a mouth shape library based on a technique for isolating fluctuations in the mouth.
[0002]
[Prior art]
Generating talking head animation sequences in multimedia and text-to-speech synthesis applications can be very monotonous, especially when images representing various mouth shapes are taken. Mouth shape is affected by the articulatory coupling phenomenon (influence between sounds), so a large library that stores a large number of shapes by moving images is required to match the sound part and the moving image of the talking head well. . The development of 3D model creation technology and the availability of higher-speed computers are the reason for the interest in developing a realistic talking head based on images taken from actual humans and the latest model creation technology. It was an opportunity to increase
[0003]
[Problems to be solved by the invention]
However, although it is possible to create a computer model of an actual facial image based on a set of images, it is necessary to create a mouth shape library necessary to successfully synchronize audio data and image data, that is, video data. Is still difficult.
[0004]
While progress has been made in this regard, in the case of previously proposed solutions, it is necessary to create a library of articulatory connections using a large number of mouth shapes, which is a very time consuming task. It takes a lot. At present, there is no effective method for creating a mouth shape library in which sound and image are well synchronized unless a specific speaker is registered with mouth shape samples for many hours.
[0005]
While it would be really desirable to create a mouth shape library that synchronizes audio and images well with only a small amount of mouth shape data, no such technology has ever existed.
[0006]
Accordingly, it is an object of the present invention to provide a system and method for creating a mouth shape library with only a small amount of mouth shape data.
[0007]
[Means for Solving the Problems]
In a first aspect of the present invention, a method for creating a mouth shape library is provided. The method comprises the steps of providing information on a speaker independent mouth shape model, providing information on variations in the speaker dependent mouth shape model, obtaining mouth shape data of a new speaker, Estimating the speaker-dependent mouth shape model information based on the new speaker's mouth shape data and information on the variation of the speaker-dependent mouth shape model, and information on the speaker-independent mouth shape model; And a step of creating a mouth shape library based on the information of the speaker dependent mouth shape model.
[0008]
In a second aspect of the present invention, a computer memory for storing information on the speaker-independent mouth shape model and information on changes in the speaker-dependent mouth shape model, an input unit for receiving the mouth shape data of the speaker, Estimating the speaker-dependent mouth shape model information based on the mouth shape data and the information on the variation of the speaker-dependent mouth shape model, the information on the speaker-independent mouth shape model and the speaker dependency An adaptive audiovisual text-to-speech synthesis system including a mouth shape library creation module that creates a mouth shape library based on information on a mouth shape model is provided.
[0009]
In a third aspect of the present invention, a method for producing a mouth shape library creation module for use in an adaptive audiovisual text-to-speech synthesis system is provided. This method includes the steps of obtaining information on speaker-independent mouth shape models and information on fluctuations in speaker-dependent mouth shape models based on mouth shape data from a plurality of learning speakers, and speaker independence Storing the mouth shape model information and the information about the variation of the speaker-dependent mouth shape model in a computer memory, the speaker-dependent mouth shape data, and the information about the variation of the speaker-dependent mouth shape model. Estimating a speaker-dependent mouth shape model based on the information, creating a mouth shape library based on the speaker-independent mouth shape model information and the speaker-dependent mouth shape model information; It has.
[0010]
In a preferred embodiment, speaker-dependent variation (speaker-dependent variation) is modeled by speaker space, while speaker-independent variation (ie, context-dependent (context-dependent variation)) is created only once. It is modeled by a set of standard type mouth shapes. Given a small amount of data from a new speaker, a corresponding mouth shape library can be created by estimating the point in speaker space that maximizes the likelihood of adaptive data. According to this technique, since a mouth shape library can be created with a few mouth shape instances, it is very easy to manufacture a talking head. In order to construct a speaker space, a parametric representation of the mouth shape is obtained. Then, for each speaker in the speaker space, a supervector containing a context-independent (independent of context) mouth shape set is formed. Each area of the speaker space is obtained using a dimensionality reduction technique such as principal component analysis (PCA) or linear discriminant analysis (LDA).
[0011]
Other areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be noted that the following detailed description and specific examples showing preferred embodiments of the present invention are merely examples, and are not intended to limit the scope of the present invention.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
[0013]
It should be noted that the following description of the preferred embodiment is merely illustrative in nature and does not limit the present invention, its application and usage.
[0014]
In a preferred embodiment of the present invention, a mouth shape library is created using a model-based system. A model-based system adapts mouth shape data from a new speaker (possibly one of the learning speakers) after learning from N learning speakers Is used when generating mouth shape data. This system considers the context by identifying mouth shape characteristics depending on the previous and next mouth shapes. In a preferred embodiment, speaker independence variation and speaker dependency variation are separated, i.e., decomposed. The system associates a context-dependent (specific context) mouth shape with a speaker-independent variation, while associating a context-independent mouth shape with a speaker-dependent variation.
[0015]
During learning, speaker independence data is stored in a decision tree that organizes data according to context. Furthermore, at the time of learning, a speaker-dependent data is used to construct an eigenspace that expresses speaker-dependent characteristics of N learning speaker populations.
[0016]
Then, if a new mouth shape library is needed, the new speaker provides samples of mouth shape data with several visemes, not necessarily all. A viseme is a mouth shape associated with the articulation of a specific phoneme. From this data sample, a new speaker is placed or projected in the eigenspace. A speaker-dependent (context-independent) parameter set is estimated from the position of the new speaker in the eigenspace. From these parameters, the system generates a context-independent centroid. Context-dependent data from the decision tree is added to the centroid. The context-dependent data may be added to the center of gravity as a shift corresponding to each different context. In this way, the entire mouth shape library can be created. In order to better understand the mouth shape library creation process, a detailed description will be given below with reference to FIGS.
[0017]
As shown in FIG. 1, when the mouth shape library creation method 10 is started at 12, the process proceeds to step 14 where information of a speaker independent mouth shape model is provided. In a preferred embodiment, the speaker independent mouth shape model information corresponds to a parameter space stored in a context dependent delta decision tree. The method 10 proceeds to step 16 where information regarding the variation of the speaker dependent mouth shape model is provided. In a preferred embodiment, a context-independent speaker space is generated in step 16, which is a context-independent and speaker-dependent by expressing a plurality of mouth shapes as parameters on a per-speaker basis. It can be processed to create a parameter space. In a preferred embodiment, speaker independence data is used to generate eigenspaces corresponding to N learning speakers. The method 10 then proceeds to step 18 where new speaker mouth shape data is obtained. In that case, data is preferably obtained after image detection after a mouth shape input prompt. Also preferably, in step 18, the mouth shape input is represented by a parameter. For embodiments that represent a population of N speakers in eigenspace, it is not necessary to obtain new speaker input data for all of the different visemes.
[0018]
Next, the method 10 proceeds to step 20 where speaker-dependent mouth shape model information is estimated based on the obtained mouth shape data and information regarding variations in the speaker-dependent mouth shape model. The method 10 further proceeds to step 22 where a mouth shape library is created based on the speaker independent mouth shape model information and the speaker dependent mouth shape model information. In a preferred embodiment, step 22 adds a speaker-dependent and context-independent parameter space and a speaker-independent and context-dependent parameter space to obtain a speaker-dependent and context-dependent parameter space. It is done. This ends method 10 at 24.
[0019]
In a preferred embodiment, at step 20, a speaker-dependent context-independent supervector is created based on information about speaker-dependent parameter representations and speaker-dependent mouth shape model variations. Specifically, points in the speaker space (eigenspace) are estimated based on the speaker-dependent parameter representation, and speaker-dependent context independence is based on the estimated points in the speaker space. A supervector is created. One way to estimate the appropriate point is to use Euclidean distance to estimate the point in speaker space if all visemes are available. However, if the parameter representation matches the Gaussian distribution from the Hidden Markov Model, the maximum likelihood estimation technique (MLET) can be used assuming that the mouth shape movement is a state transition. In practice, the maximum likelihood estimation technique selects the super vector in the speaker space that most closely matches the speaker's mouth shape input data, regardless of how much mouth shape data is actually available. Become.
[0020]
In the maximum likelihood estimation technique, a probability function Q representing the probability of generating observation data of a predetermined mouth shape model set is used. Manipulation of the probability function Q is facilitated when the function includes not only the probability term P but also log P, which is the logarithm of that term. In that case, the probability function is maximized by taking the derivative of the probability function individually for each eigenvalue. For example, if the speaker space is 100 dimensions, the system calculates 100 derivatives of the probability function Q, sets each derivative to 0 and solves for each eigenvalue W.
[0021]
The set of eigenvalues W thus obtained represents the eigenvalues necessary to identify the point in the speaker space corresponding to the maximum likelihood point. Therefore, the set of eigenvalues W includes the maximum likelihood vector in the speaker space. This maximum likelihood vector can then be used to create a supervector corresponding to the optimal point in the speaker space.
[0022]
Regarding the maximum likelihood, in the framework of the present invention, it is necessary to maximize the likelihood of the observed value O for a given model. This is the following formula:

(Note that λ is a model, and λ ^ is an estimated model).
[0023]
As a method of approximating in advance, only the average value may be maximized. Since the probability P is given by the mouth shape model set, the following equation is obtained.
[Expression 2]

However,
[Equation 3]

_Ot is the feature vector at time t, C _m ^{(s) -1} is the inverse covariance of the mixed Gaussian distribution m of the state s, and μ ^ _m ^(s) is an approximation of the adaptive mean of the mixed component m of the state s. _{^{value, γ m (s) (t}} ) is Gaussian mixture m | represents lambda, the probability P using o _t respectively.
[0024]
Assume that the Gaussian average of the new speaker's mouth shape model is located in the speaker space, and that the following average supervector μ _j (j = 1,..., E) extends in this space.
[Expression 4]

[0025]
Here, μ _m ^(s) (j) represents an average vector of the mixed Gaussian distribution m in the state s of the eigenvector (eigenmodel) j. In this case, the following approximate value ＾ is required.
[Equation 5]

[0026]
μ _j is orthogonal and w _j is the eigenvalue of the speaker model. Now assume that any new speaker can be modeled in a linear combination from the database of observed speakers. In that case,
[Formula 6]

Holds. Here, s is the state of the model λ, and m is a mixed Gaussian distribution of the linear transformation M.
[0027]
To maximize the probability function Q _{is, ∂Q / ∂w e = 0,} e = 1 ,. . . , E need to be set (however, since the eigenvectors are orthogonal, ∂w _i / ∂w _j = 0, i ≠ j). Therefore, the following equation holds.
[Expression 7]

[0028]
When the above derivative is calculated, the following equation is obtained.
[Equation 8]

[0029]
Furthermore, from the above equation, the following series of linear equations is obtained.
[Equation 9]

[0030]
As shown in FIG. 2, in the preferred embodiment of decomposition into speaker-dependent variation and speaker-independent variation, a parameter space is generated based on mouth shapes input from N learning speakers 26. The The parameter space of the learning speaker is composed of super vectors 28 created from mouth shape data collected from the learning speaker. For example, the mouth shape is modeled in the form of a hidden Markov model or other probabilistic model with one or more Gaussian distributions per state. The parameter space may be constructed by using parameter values that are used to define a Gaussian distribution.
[0031]
The context-dependent (speaker independence) fluctuation and the context independence (speaker dependence) fluctuation are separated or decomposed as follows. First, context independent speaker dependency data 34 is obtained from the learning speaker data 26, and then the average value of this data 34 is sent to the separation process 30 as input. Separation process 30 obtains contextual knowledge from labeled contextual information 32 and also receives input from learning speaker data 26. The separation process 30 uses the context knowledge to subtract the average value of the context independent speaker dependency data 34 from the learning speaker data. Thereby, the separation process 30 generates or extracts context-dependent speaker independence data 36. This context-dependent speaker independence data 36 is stored in the data structure of the delta decision tree 44.
[0032]
In the preferred embodiment, Gaussian distribution data representing context-dependent speaker independence data 36 is stored in the form of a delta decision tree 44 for various visemes, which is the node 46 yes that is not the terminal. / No context, and a Gaussian distribution data representing a specific mouth shape of the terminal clause 48.
[0033]
On the other hand, the context-independent speaker-dependent data 34 is reflected in the form of a supervector, which is principal component analysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA), The number of dimensions is reduced (38) by appropriate dimension number reduction techniques such as factor analysis (FA) and singular value decomposition (SVD). As a result, a set of eigenvectors and associated eigenvalues are extracted. In the preferred embodiment, in order to reduce the size of speaker space 42, process 40 truncates some of the lowest eigenvectors. Thereby, some upper eigenvectors are left depending on the case, and the eigenspace, that is, the speaker space 42 is constructed. While it is possible to leave all of the generated eigenvectors, process 40 is preferably performed to reduce the amount of memory required to store speaker space 42.
[0034]
Once the eigenspace (speaker space) 42 and the delta decision tree 44 have been generated for the N learning speakers, the system is ready to be used in creating a new speaker mouth shape library. In this case, the new speaker may be a speaker who has not provided mouth shape data in advance during learning, or may be one of the speakers who participated during learning.
[0035]
FIG. 3 shows a system and process for creating a new library.
[0036]
As shown in FIG. 3, first, a parameter representation 50 of mouth shape data is obtained from a new speaker. Although a complete set of mouth shape parameter data for all visemes can be collected at this stage, it is not actually necessary. It is sufficient to obtain a sample of mouth shape data sufficient to identify a point in the eigenspace. As a result, the point P in the speaker space 42 is estimated based on the parameter representation 50 of the mouth shape data, and the context-independent speaker-dependent parameter space 52 becomes the point P in the eigenspace (speaker space). It is generated in the form of a corresponding center of gravity 53. One major advantage of using eigenspace is that it can automatically estimate mouth-shaped viseme parameters that were not given by the new speaker. This is because the eigenspace is based on speaker-dependent data of N learning speaker populations. For this purpose, it is preferable that a complete set of mouth shape data has already been given.
[0037]
As indicated by reference numeral 54, context-dependent speaker-independent mouth shape data 48 stored in the form of a delta decision tree 44 is added to the centroid 53 of the context-independent speaker-dependent parameter space 52. Thus, the mouth shape library 56 is established.
[0038]
Specifically, context-dependent speaker independence data is extracted from the delta decision tree for each context, and the extracted data is combined with the speaker-dependent data created using the eigenspace, ie, summation. As a result, a new speaker's mouth shape library is created. In practice, the speaker dependency data created from the eigenspace can be regarded as a centroid, and the speaker independence data can be regarded as a “delta” from the centroid, ie, a deviation. In this regard, the data created from the eigenspace represents mouth shape information corresponding to the specific speaker (a part of the information represents an estimated value due to the action of the eigenspace). On the other hand, the data obtained from the delta decision tree represents the difference in speaker independence between mouth shapes in various contexts. In this way, a new mouth shape library is created by combining speaker dependency information (center of gravity) and speaker independence information (deviation) for each context.
[0039]
As shown in FIG. 4, in the adaptive audiovisual text-to-speech synthesis system 58 of the present invention, the speaker-independent mouth shape model information 60 and the information 62 regarding the variation of the speaker-dependent mouth shape model are stored in the computer memory. Is stored. The system 58 also has an input unit 64 that receives mouth shape data 66 from a new speaker. Mouth shape library creation module 68 estimates speaker-dependent mouth shape model information (not shown) based on mouth shape data 66 from a new speaker and information 62 on the variation of the speaker-dependent mouth shape model, It is possible to operate to create the mouth shape library 70 based on the speaker independent mouth shape model information 60 and the speaker dependent mouth shape model information (not shown).
[0040]
The above description of the present invention is merely exemplary in nature and, therefore, variations that do not depart from the spirit of the present invention are intended to be included within the scope of the present invention. Such variations are not to be regarded as a departure from the spirit and scope of the present invention.
[Brief description of the drawings]
FIG. 1 is a flowchart of a mouth shape library creation method according to the present invention.
FIG. 2 is a block diagram illustrating decomposition into speaker-dependent variation and speaker-independent variation in a preferred embodiment of the present invention.
FIG. 3 is a block diagram of a mouth shape library creation method according to a preferred embodiment of the present invention.
FIG. 4 is a block diagram of an adaptive audiovisual text speech synthesis system according to the present invention.
[Explanation of symbols]
28 Super Vector 44 Delta Decision Tree 50 Mouth Shape Data Parameter Representation 52 Context Independent Speaker Dependent Parameter Space 53 Center of Gravity 56 Mouth Shape Library 58 Adaptive Audio Visual Text Speech Synthesis System 64 Input Unit

Claims

The average value of the speaker dependence data (34) in which the mouth shape data (26) is modeled in the form of a probability model is subtracted from the mouth shape data (26) obtained from at least one learning speaker. a step (14) to provide extracted speaker independent data (36), and organizing said speaker independent data (36) to the context unit by,
A step (16) to provide a specific space (42) by constructing a parameter space using the parameters defining the probability model in the speaker-dependent data (34),
Obtaining a parameter representation (50) of the new speaker's mouth shape data (18) ;
Estimating a parameter space (52) defining a probability model of the new speaker's mouth shape data by projecting the new speaker's mouth shape data parameter representation (50) into the eigenspace (42). (20) and
The mouth shape library (56) is created by adding speaker independence data (44) organized in the context unit to the parameter space (52) that defines the probability model of the mouth shape data of the new speaker. Step (22) ,
Comprising
Mouth shape library creation method characterized by this.

In claim 1,
In the step (14) , the speaker independence data (36) is organized into a decision tree.
Mouth shape library creation method characterized by this.

In claim 1,
The step (14) includes organizing the speaker independence data (36) into a decision tree in which clauses are organized according to context .
Mouth shape library creation method characterized by this.

In claim 1,
The step (16) includes a step (38, 40) to reduce the number of dimensions of the eigenspace (42),
Mouth shape library creation method characterized by this.

In claim 1,
The new speaker's mouth shape data is visemes,
Mouth shape library creation method characterized by this.

In claim 1,
Step (18) is performed by collecting a sample of viseme data from the new speaker.
Mouth shape library creation method characterized by this.

In claim 6 ,
The viseme data sample is an incomplete collection of visemes composing spoken words.

The average value of the speaker dependence data (34) in which the mouth shape data (26) is modeled in the form of a probability model is subtracted from the mouth shape data (26) obtained from at least one learning speaker. Means for extracting speaker independence data (36) and organizing and providing the speaker independence data (36) in context units;
Means for providing an eigenspace (42) by constructing a parameter space using parameters defining a probability model in the speaker dependence data (34);
Means to obtain a parameter representation (50) of the new speaker's mouth shape data;
Means for estimating a parameter space (52) defining a probability model of the new speaker's mouth shape data by projecting the parameter representation (50) of the mouth shape data of the new speaker into the eigenspace (42) When,
The mouth shape library (56) is created by adding speaker independence data (44) organized in the context unit to the parameter space (52) that defines the probability model of the mouth shape data of the new speaker. Means ,
Comprising
Mouth shape library creation system characterized by that.