JP4459415B2

JP4459415B2 - Image processing apparatus, image processing method, and computer-readable information storage medium

Info

Publication number: JP4459415B2
Application number: JP2000264924A
Authority: JP
Inventors: 哲也河野
Original assignee: Namco Ltd; Namco Bandai Games Inc
Current assignee: Namco Ltd; Bandai Namco Entertainment Inc
Priority date: 2000-09-01
Filing date: 2000-09-01
Publication date: 2010-04-28
Anticipated expiration: 2020-09-01
Also published as: JP2002074395A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声に対応して口の動き等が変化するオブジェクトの画像を生成する画像処理装置、画像処理方法およびコンピュータ読み取り可能な情報記憶媒体に関する。
【０００２】
【従来の技術】
アニメーションや実写映像の声の吹き替え等の分野においては、登場人物の表情をリアルに表現するために、登場人物の音声に合わせて口の動きを描く必要がある。例えば、アニメーション画像の場合には、口の動きを表現する手法として、原画の作成者が手作業で口やその周辺の形状を決定する手法が用いられる場合がある。作成者は、ある場面における登場人物の音声を聞きながら、あるいはこの場面用に用意された台詞を読みながら、登場人物の口の動きを想像して口の形状をつくっており、自然な口の形状を再現しようとするとある程度の経験や勘が必要になる。
【０００３】
また、実写映像の声を吹き替える場合、例えば英語の音声を日本語の音声に吹き替える場合に、実写映像の人物の口周辺を日本語の音声に合わせて変形させるものとして、特許第２７９５０８４号公報に開示された口形状画像合成装置が知られている。この装置では、吹き替え後の音声の内容を分析して音素と持続時間の情報を抽出し、これらの情報に基づいた口やその周辺の形状を決定している。
【０００４】
【発明が解決しようとする課題】
ところで、上述した手作業によって登場人物の口やその周辺の形状を決定する手法では、作業者の経験や勘などの各作業者毎の熟練度や能力に左右されるため、作業効率が悪く、しかも作業者毎に異なる結果が得られるため再現性が低く自然な表情を生成することが難しいという問題があった。
【０００５】
また、音素と持続時間の情報に基づいて口やその周辺の形状を決定する手法では、前処理として音声の内容に基づいて音素と持続時間の情報を抽出する必要があるとともに、音素毎に異なる口形状パターンをあらかじめ登録しておく必要があるため、処理の負担が重く、しかも処理に必要なデータ容量が多くなるという問題があった。例えば、音声認識処理によって、音声に含まれる音素を抽出することができるが、この音声認識処理を行うには、あらかじめ各音素に対応した音声のパターンを登録しておいて、実際の音声を分析して得られたパターンと各登録パターンとを比較し、最も類似度の大きな登録パターンに対応する音素を抽出することになる。したがって、抽出対象となる各音素毎に登録パターンが必要になるとともに、類似度が大きな登録パターンを検索するために必要な処理の負担が重くなる。
【０００６】
また、頭部全体の画像を生成する場合に、自然な表情を得ようとすると、上述した口形状のみならず瞼についても不自然とならない適当なタイミングで開閉させる必要があり、処理の負担が軽く、自然な表情が得られる手法が望まれている。
【０００７】
本発明は、このような点に鑑みて創作されたものであり、その目的は、自然な表情が得られ、しかも処理の負担が軽く、必要な記憶容量を低減することができる画像処理装置、画像処理方法およびコンピュータ読み取り可能な情報記憶媒体を提供することにある。
【０００８】
【課題を解決するための手段】
上述した課題を解決するために、本発明の画像処理装置は、周波数分析手段、口形状分析手段、座標修正手段を備えている。周波数分析手段は、入力音声に対して周波数分析を行って、複数の周波数成分毎の音圧レベルを検出する。口形状設定手段は、周波数分析手段によって検出された複数の周波数成分毎の音圧レベルに基づいて、口形状を設定する。座標修正手段は、口形状設定手段によって設定された口形状となるように、頭部の三次元オブジェクトを構成する複数のポリゴンの座標を修正する。また、口形状設定手段は、複数の周波数成分毎の音圧レベルの値を用いて、上唇の位置、下唇の位置、口角の位置のそれぞれの値を口形状パラメータとして計算することにより、口形状の設定を行う。発声する音声と口形状とは密接な関係があるため、入力音声の各周波数成分毎の音圧レベルを検出して口形状に反映させることにより、より自然な表情を得ることができる。しかも、同じ入力音声に対して常に同じ口形状が得られるため、高い再現性を実現することができる。また、入力音声に対する周波数分析を行うだけで口形状を設定することができるため、音声認識処理によって音素を抽出する必要があった従来の手法に比べて、処理の負担が軽くなる。さらに、音声認識処理に必要な辞書が不要であり、各音素に対応した口形状のパターンを登録しておく必要もないため、メモリ等の記憶容量を低減することが可能となる。
【０００９】
また、テクスチャマッピング手段によって、上述した三次元オブジェクトを構成する各ポリゴンに、予め模様が設定されたテクスチャ情報を付与することが望ましい。入力音声に対応した口形状が設定されて各ポリゴンの頂点座標が修正されているため、これらのポリゴンに対してテクスチャマッピングを行うことにより、口やその周辺の形状が変更された三次元オブジェクトの画像を容易に得ることができる。
【００１０】
また、上述した口形状設定手段は、複数の周波数成分毎の音圧レベルの値を用いて、口および口周辺の形状を特定するために必要な口形状パラメータの値を計算することにより、口形状の設定を行っている。口形状パラメータと各音圧レベルとの関係式を予め定義しておくことにより、この関係式を用いて口形状パラメータの値を計算するだけで容易に口形状を設定することができ、処理の負担を大幅に軽減することができる。
【００１１】
特に、上述した口形状パラメータには、上唇の位置、下唇の位置、口角の位置のそれぞれに対応したパラメータが含まれており、これらの各位置を設定することにより、口形状を特定することができる。
また、上述した口形状パラメータは、上唇の位置、下唇の位置、口角の位置のそれぞれについて上限値あるいは下限値を設定しておいて、音圧レベルの値を用いて計算した結果が上限値や下限値を超える場合にパラメータ値をこの上限値や下限値に設定することが望ましい。実際の口の動きや口周辺の構造を考慮すると、上唇や下唇あるいは口角の動きには制限があると考えられる。したがって、上唇や下唇あるいは口角を所定の範囲内で動かすことにより、口形状が不自然になることを回避することができる。例えば、各周波数成分の音圧レベルが極端に大きな音声が入力されたときに、実際にはあり得ない口形状となることを防止することができる。
【００１２】
また、瞼パラメータ設定手段によって、周波数分析手段によって検出された音圧レベルに基づいて、瞼を動かすタイミングを示す瞼パラメータを設定し、上述した座標修正手段によって、口形状に関するポリゴンの座標を修正するとともに、瞼パラメータに基づいて瞼の開閉に関するポリゴンの座標を修正することが望ましい。実際の人間の頭部を考えると、何らかの音声を発声するときに口が動く動作とともに瞼が閉じたり開いたりするため、頭部画像を生成する際に、瞼の開閉の様子を考慮することにより、より自然な表情を得ることができる。
【００１３】
あるいは、上述した瞼パラメータを用いた瞼の開閉タイミングに合わせたポリゴンの座標修正は、口形状に関する処理とは切り離して別個に行うようにしてもよい。この場合に、本発明の画像処理装置は、特定周波数成分分析手段、瞼パラメータ設定手段、座標修正手段を備えている。特定周波数成分分析手段は、入力音声に含まれる特定周波数成分の音圧レベルを検出する。瞼パラメータ設定手段は、特定周波数成分分析手段によって検出された特定周波数成分の音圧レベルに基づいて、瞼を動かすタイミングを示す瞼パラメータを設定する。座標修正手段は、瞼パラメータ設定手段によって設定される瞼パラメータに基づいて瞼を開閉するように、頭部の三次元オブジェクトを構成する複数のポリゴンの座標を修正する。入力音声に含まれる特定周波数成分の音圧レベルに連動させて瞼の開閉タイミングが設定されるため、人の実際の表情に近い自然な表情を生成することができる。
【００１４】
特に、上述した瞼パラメータの設定に用いられる音圧レベルの周波数成分は、口の動きに連動しない周波数成分を用いることが望ましい。瞼の動きと口の動きが常に連動して不自然となることを防止することができる。また、特定の周波数成分の音圧レベルに基づいて瞼パラメータが設定されて瞼を動かすタイミングが決まるため、例えば瞼をランダムなタイミングで動かす場合のような他の演算等が不要になり、処理の簡略化が可能になる。
【００１５】
また、本発明の画像処理方法は、周波数分析手段、口形状設定手段、座標修正手段を備える画像処理装置における画像処理方法であって、周波数分析手段によって、入力音声に対して周波数分析を行って複数の周波数成分毎の音圧レベルを検出する第１のステップと、第１のステップにおいて周波数分析手段によって検出された各周波数成分毎の音圧レベルに基づいて口形状を口形状設定手段によって設定する第２のステップと、第２のステップにおいて口形状設定手段によって設定された口形状となるように頭部の三次元オブジェクトを構成する複数のポリゴンの座標を座標修正手段によって修正する第３のステップとを有し、口形状設定手段は、複数の周波数成分毎の音圧レベルの値を用いて、上唇の位置、下唇の位置、口角の位置のそれぞれの値を口形状パラメータとして計算することにより、口形状の設定を行っている。また、画像処理装置は、さらに瞼パラメータ設定手段を備え、第１のステップにおいて周波数分析手段によって検出された音圧レベルに基づいて、瞼を動かすタイミングを示す瞼パラメータを瞼パラメータ設定手段によって設定する第４のステップを有し、第３のステップにおいて、座標修正手段は、口形状に関するポリゴンの座標の修正を行うとともに、第４のステップにおいて設定された瞼パラメータに基づいて瞼の開閉に関するポリゴンの座標の修正を行うようにしてもよい。
【００１６】
また、本発明の情報記憶媒体は、コンピュータを、入力音声に対して周波数分析を行って、複数の周波数成分毎の音圧レベルを検出する周波数分析手段と、周波数分析手段によって検出された複数の周波数成分毎の音圧レベルに基づいて、口形状を設定する口形状設定手段と、口形状設定手段によって設定された口形状となるように、頭部の三次元オブジェクトを構成する複数のポリゴンの座標を修正する座標修正手段として機能させるためのプログラムを格納した情報記憶媒体であって、口形状設定手段は、複数の周波数成分毎の音圧レベルの値を用いて、上唇の位置、下唇の位置、口角の位置のそれぞれの値を口形状パラメータとして計算することにより、口形状の設定を行っている。
【００１７】
本発明の画像処理方法を実施することにより、あるいは本発明の情報記憶媒体に格納されたプログラムを実行することにより、入力音声の内容を反映させた自然な口形状を形成することができる。特に、同じ入力音声に対して常に同じ口形状が得られるため、高い再現性を実現し、しかも自然な表情を有する画像を生成することができる。また、入力音声に対する周波数分析を行うだけで口形状を設定することができるため、音声認識処理によって音素を抽出する必要があった従来の手法に比べて、処理の負担が軽くなる。さらに、音声認識処理に必要な辞書が不要であり、各音素に対応した口形状のパターンを登録しておく必要もないため、メモリ等の記憶容量を低減することが可能となる。
【００１８】
【発明の実施の形態】
以下、本発明を適用した一実施形態について、図面を参照しながら詳細に説明する。
図１は、本発明を適用した一実施形態の画像処理装置の構成を示す図である。図１に示す画像処理装置１００は、アナログ−デジタル（Ａ／Ｄ）変換部１０、オブジェクト制御部２０、画像生成部３０を含んで構成されている。
【００１９】
Ａ／Ｄ変換部１０は、アナログの音声信号が入力され、この音声信号に対して所定のサンプリング周波数でサンプリングを行って、所定時間間隔でデジタルの音声データを出力する。
オブジェクト制御部２０は、人体の頭部を模擬した三次元オブジェクトの形状を計算する処理を行う。このために、オブジェクト制御部２０は、周波数分析部２２、口形状パラメータ変換部２４、瞼パラメータ変換部２６、座標修正部２８を含んで構成されている。
【００２０】
周波数分析部２２は、所定の時間間隔で入力される音声データに基づいて、入力音声の周波数分析を行う。例えば、周波数分析部２２は、可聴周波数帯域を６４分割し、それぞれの分割領域毎に入力音声の音圧レベルを検出する。
口形状パラメータ変換部２４は、周波数分析部２２による周波数分析結果に基づいて、頭部に含まれる口およびその周辺の形状を決定するために必要な口形状パラメータの値を設定する。例えば、本実施形態では、以下に示す５つの口形状パラメータＰａ〜Ｐｅが用いられている。
【００２１】
１）上唇上方向の変化量を示すパラメータＰａ
２）上唇前方向の変化量を示すパラメータＰｂ
３）下唇下方向の変化量を示すパラメータＰｃ
４）下唇前方向の変化量を示すパラメータＰｄ
５）口角横方向の変化量を示すパラメータＰｅ
瞼パラメータ変換部２６は、周波数分析部２２による周波数分析結果に基づいて、頭部に含まれる瞼の動き、すなわち瞼を閉じるタイミングを示す瞼パラメータの値を設定する。瞼の動きを口の動きに連動させると表情が不自然になるため、本実施形態では、口の動きと直接的な関連がなさそうな周波数成分の音圧レベルに基づいて瞼パラメータの値が設定され、瞼を閉じるタイミングが決定される。
【００２２】
座標修正部２８は、口形状パラメータ変換部２４によって決定された口形状や瞼パラメータ変換部２６によって決定された瞼の動きを反映させるために、頭部を模擬した三次元オブジェクトを構成するポリゴンの頂点座標を必要に応じて適宜変更する処理を行う。例えば、音声が何も入力されておらず、しかも瞼を開いた状態を「ノーマル」とする。口やその周辺の所定範囲については、上述した５つの口パラメータＰａ〜Ｐｅの値に基づいて、ノーマル状態を基準とした各ポリゴンの頂点座標の変化量が計算される。また、瞼が動くタイミングが瞼パラメータ変換部２６によって指示されると、ノーマル状態から瞼を閉じて再び開くまでの瞼の一連の動きを再現するために必要な各ポリゴンの頂点座標の変化量が計算される。
【００２３】
また、画像生成部３０は、オブジェクト制御部２０によって各ポリゴンの頂点座標が計算された三次元オブジェクトの画像を、実際に表示装置１１０に表示するために必要な処理を行う。このために、画像生成部３０は、透視投影変換部３２、テクスチャマッピング部３４、フレームメモリ３６を含んで構成されている。
【００２４】
透視投影変換部３２は、予め設定された所定の視点位置を基準とした透視投影変換を行う。これにより、仮想的な三次元空間に配置された三次元オブジェクトを所定の視点位置から見た二次元画像（疑似三次元画像）が得られる。テクスチャマッピング部３４は、透視投影変換部３２によって得られた二次元画像に含まれる各ポリゴンにテクスチャの画像を対応させるテクスチャマッピング処理およびシェーディング処理を行う。このテクスチャマッピング部３４の構成の具体例については後述する。テクスチャマッピング処理およびシェーディング処理が行われた後の二次元画像はフレームメモリ３６に格納された後、所定の順番で読み出されて表示装置１１０の画面上に表示される。
【００２５】
図２は、テクスチャマッピング部３４の詳細構成を示す図である。図２に示すように、テクスチャマッピング部３４は、マッピング座標計算部５０、形状メモリ５２、テクスチャ付与部５４、テクスチャメモリ５６、シェーディング補正部５８を含んで構成されている。
【００２６】
マッピング座標計算部５０は、三次元オブジェクトを構成する各ポリゴンとテクスチャとの対応付けを計算する。形状メモリ５２は、各ポリゴンに関する情報を格納する。テクスチャ付与部５４は、各ポリゴンにテクスチャを付与する。テクスチャメモリ５４は、各テクスチャに関する情報を格納する。シェーディング補正部５８は、光源の位置や三次元オブジェクトを囲む環境条件から各ポリゴンに対応して付与されたテクスチャ画像の明るさの補正を行う。例えば、鼻や唇等の凹凸によって影が生じる場合には、その具体的な陰影の様子がこのシェーディング補正によって再現される。
【００２７】
上述した周波数分析部２２が周波数分析手段、特定周波数成分分析手段に、口形状パラメータ変換部２４が口形状設定手段に、瞼パラメータ変換部２６が瞼パラメータ設定手段に、座標修正部２８が座標修正手段に、テクスチャマッピング部３４がテクスチャマッピング手段にそれぞれ対応する。
【００２８】
本実施形態の画像処理装置１００はこのような構成を有しており、次にその動作を説明する。
図３〜図５は、人間の声の周波数分析結果を示す図である。図３は母音「ア」を発声した場合の周波数分析結果を、図４は母音「イ」を発声した場合の周波数分析結果を、図５は母音「ウ」を発声した場合の周波数分析結果をそれぞれ示している。これらの図において、横軸が周波数に対応しており、ほぼ可聴帯域を６４分割した場合の各周波数帯域毎の音声レベルが示されている。
【００２９】
図３〜図５に示すように、異なる３つの母音「ア」、「イ」、「ウ」を発声したときの周波数分布をみると、特徴的な変化の様子を示す３つの山が存在することがわかる。これら３つの山が存在する周波数帯域を低域側から順に領域Ａ、Ｂ、Ｃとする。
【００３０】
例えば、図３に示した母音「ア」の周波数分布をみると、領域Ａおよび領域Ｂに対応する音声レベルが大きく、領域Ｃに対応する音声レベルが小さいことがわかる。また、図４に示した母音「イ」の周波数分布をみると、領域Ａおよび領域Ｃに対応する音声レベルが大きく、領域Ｂに対応する音声レベルが小さいことがわかる。さらに、図５に示した母音「ウ」の周波数分布をみると、領域Ａに対応する音声レベルが非常に大きく、領域Ｂおよび領域Ｃに対応する音声レベルが小さいことがわかる。
【００３１】
このように、３つの母音「ア」、「イ」、「ウ」のそれぞれについて、３つの領域Ａ〜Ｃの各音声レベルを観察すると、異なる傾向を有することがわかる。発声する音と口形状とは密接な関係があるため、これらの音声レベルの傾向の違いを口およびその周辺の形状に反映させることができれば、入力音声の周波数分布に応じた口およびその周辺形状を再現することが可能になる。
【００３２】
図６および図７は、母音「ア」を発声した場合の口およびその周辺の形状を示す図である。図６には人の顔を正面から見た状態が、図７には人の顔を右側面から見た状態がそれぞれ示されている。
同様に、図８および図９は、母音「イ」を発声した場合の口およびその周辺の形状を示す図である。図８には人の顔を正面から見た状態が、図９には人の顔を右側面から見た状態がそれぞれ示されている。
【００３３】
図１０および図１１は、母音「ウ」を発声した場合の口およびその周辺の形状を示す図である。図１０には人の顔を正面から見た状態が、図１１には人の顔を右側面から見た状態がそれぞれ示されている。
図１２および図１３は、何も発声していないノーマル状態における口およびその周辺の形状を示す図である。図１２には人の顔を正面から見た状態が、図１３には人の顔を右側面から見た状態がそれぞれ示されている。
【００３４】
図６〜図１１に示すように、３つの母音「ア」、「イ」、「ウ」のそれぞれについて、口およびその周辺の形状を観察すると、発声する母音が異なるときに、１）上唇の位置、２）下唇の位置、３）口角部の横方向位置が変化することがわかる。しかも、図３〜図５に示した周波数分析結果によると、各母音は、３つの周波数帯域Ａ、Ｂ、Ｃの各音圧レベルが互いに異なる傾向を示しているため、各音圧レベルの値を用いて上唇の位置、下唇の位置、口角部の横方向位置のそれぞれを求める計算式を定義することができれば、任意の音に対応する各周波数帯域の音圧レベルの値に基づいて上唇の位置、下唇の位置、口角部の横方向位置を計算し、これにより口形状を決定することができることになる。
【００３５】
具体的には、本実施形態では、図３〜図５に示した３つの周波数帯域の音圧レベルを低域側から順番にＬ１、Ｌ２、Ｌ３としたときに、口形状パラメータ変換部２４によって、上述した５つのパラメータＰａ〜Ｐｅの値を以下の関係式を用いて計算している。なお、各音圧レベルＬ１、Ｌ２、Ｌ３は、それぞれの周波数帯域に含まれる６４分割領域毎の音圧レベルの値を累積することにより求めることができるが、この累積値の代わりに平均値を用いるようにしてもよい。
【００３６】
１）上唇上方向の変化量を示すパラメータＰａ：
図１２に示す上唇位置ｐ１の上方向の変化量を示すパラメータＰａは、
Ｐａ＝Ａ×Ｌ２ …（１）
の関係式を用いて計算することができる。ここで、Ａは適当な係数であり、例えばＡ＝０．５に設定されている。
【００３７】
２）上唇前方向の変化量を示すパラメータＰｂ：
図１３に示す上唇位置ｐ１の前方向の変化量を示すパラメータＰｂは、
Ｐｂ＝Ｂ×Ｌ１−Ｃ×Ｌ２−Ｄ×Ｌ３ …（２）
の関係式を用いて計算することができる。ここで、Ｂ、Ｃ、Ｄは適当な係数であり、例えばＢ＝１．０、Ｃ＝１．０、Ｄ＝１．０に設定されている。
【００３８】
３）下唇下方向の変化量を示すパラメータＰｃ：
図１３に示す下唇位置ｐ２の下方向の変化量を示すパラメータＰｃは、
Ｐｃ＝Ｅ×Ｌ２ …（３）
の関係式を用いて計算することができる。ここで、Ｅは適当な係数であり、例えばＥ＝３．０に設定されている。
【００３９】
４）下唇前方向の変化量を示すパラメータＰｄ：
図１２に示す下唇位置ｐ３の前方向の変化量を示すパラメータＰｄは、
Ｐｄ＝Ｆ×Ｌ１−Ｇ×Ｌ２−Ｈ×Ｌ３ …（４）
の関係式を用いて計算することができる。ここで、Ｆ、Ｇ、Ｈは適当な係数であり、例えばＢ＝１．０、Ｃ＝１．０、Ｄ＝１．０に設定されている。
【００４０】
５）口角横方向の変化量を示すパラメータＰｅ：
図１２に示す口角位置ｐ４の横方向の変化量を示すパラメータＰｅは、
Ｐｅ＝Ｉ×Ｌ３−Ｊ×Ｌ１ …（５）
の関係式を用いて計算することができる。ここで、Ｉ、Ｊは適当な係数であり、例えばＩ＝３．０、Ｊ＝２．０に設定されている。
【００４１】
なお、上述した各位置ｐ１〜ｐ４のそれぞれは、便宜的に設定したものであり、口周辺において多少上下位置等を変更してもよい。また、実際の口の動きは、常識的な範囲内に収まるものと考えられるが、上述した（１）式〜（５）式を用いて計算した場合には、この常識的な範囲から外れる場合も起こり得る。例えば、入力音声の音圧レベルが非常に大きい場合には、実際にあり得ないような口の大きさに対応した口形状パラメータの値が計算結果として得られるおそれがある。したがって、各口形状パラメータＰａ〜Ｐｅには、口形状が不自然にならないような上限値を設けることが望ましい。そして、上限値を超える値が計算結果として得られた場合には、この口形状パラメータの値を強制的に上限値に設定する。また、口を閉じたノーマル状態を基準として各口形状パラメータの値を設定しているため、各値が負となることは好ましくない。したがって、各口形状パラメータＰａ〜Ｐｅが負にならないように下限値を設けることが望ましい。そして、計算結果として負の値が得られた場合には、この口形状パラメータの値を強制的に下限値「０」に設定する。
【００４２】
ところで、本実施形態では、入力音声の内容に応じて口形状を変化させるとともに、この口形状の変化に連動しないように瞼の開閉を行っており、この開閉タイミングを示す瞼パラメータを所定の周波数帯域の音圧レベルに基づいて設定している。
【００４３】
図１４および図１５は、瞼パラメータの設定に用いる周波数帯域を示す図である。瞼パラメータ変換部２６は、図１４に示すように、周波数領域Ｄの音圧レベルが所定の閾値を越える場合に、瞼を閉じるタイミングを示す瞼パラメータを設定する。例えば、このとき瞼パラメータＭの値が１に設定される。また、瞼パラメータ変換部２６は、図１５に示すように、周波数領域Ｄの音圧レベルが所定の閾値よりも小さくなる場合には、瞼が開いた状態であることを示す瞼パラメータを設定する。例えば、このとき瞼パラメータＭの値が０に設定される。
【００４４】
図１６は、本実施形態の画像処理装置１００において頭部画像を生成する動作手順を示す図である。例えば、１画面分の頭部画像を生成する場合の動作手順が示されている。
アナログの音声信号が入力されると、Ａ／Ｄ変換部１０は、所定のサンプリング周波数でデジタルの音声データに変換して周波数分析部２２に入力する。周波数分析部２２は、入力される音声データに基づいてＦＦＴ（高速フーリエ変換）演算を行って所定の周波数分析処理を行う（ステップ１００）。具体的には、周波数分析部２２は、ほぼ可聴周波数帯域を６４分割した各分割領域毎の音圧レベルを求めた後、図３などに示した３つの周波数帯域Ａ、Ｂ、Ｃのそれぞれについて各分割領域の音圧レベルを累積し、口形状パラメータの計算に用いる３つの音圧レベルＬ１、Ｌ２、Ｌ３を求めて口形状パラメータ変換部２４に入力する。また、周波数分析部２２は、６４分割した各分割領域毎の音圧レベルの中から、図１４および図１５に示した特定の周波数帯域Ｄの音圧レベルのみを抽出して瞼パラメータ変換部２６に入力する。
【００４５】
次に、口形状パラメータ変換部２４は、周波数分析部２２から入力された３つの音圧レベルＬ１、Ｌ２、Ｌ３に基づいて、口およびその周辺の形状を決定するために用いられる５つのパラメータＰａ、Ｐｂ、Ｐｃ、Ｐｄ、Ｐｅの値を設定する（ステップ１０１）。具体的な計算方法は上述した通りであり、（１）式〜（５）式を用いた簡単な演算によって各パラメータ値が得られる。
【００４６】
この口形状パラメータ変換部２４による処理と並行して、あるいは前後して、瞼パラメータ変換部２６は、周波数分析部２２から入力された周波数帯域Ｄの音圧レベルに基づいて瞼パラメータの値を設定する（ステップ１０２）。上述したように、瞼パラメータの値の設定は、周波数帯域Ｄの音圧レベルと所定の閾値の大小関係を調べるという簡単な処理によって行うことができる。
【００４７】
次に、座標修正部２８は、口形状パラメータ変換部２４によって設定された口形状パラメータＰａ〜Ｐｅの値および瞼パラメータ変換部２６によって設定された瞼パラメータＭの値に基づいて、頭部を模擬した三次元オブジェクトを構成する各ポリゴンの頂点座標を修正する処理を行う（ステップ１０３）。
【００４８】
なお、実際には、各口形状パラメータＰａ〜Ｐｅによって特定位置ｐ１〜ｐ４の変化量が示されるが、各ポリゴンの頂点座標の修正は、これらの特定位置を中心とした所定範囲に含まれるポリゴンを対象として行われる。図１２に示した上唇位置ｐ１に着目すると、縦方向については上唇から鼻の付け根までの範囲が、横方向については唇の幅と同程度の範囲が上唇位置ｐ１を含む所定範囲として抽出され、この所定範囲に含まれるポリゴンの頂点座標が（１）式によって計算されるパラメータＰａに基づいて修正される。また、修正の程度については、この所定範囲に含まれる各ポリゴンの頂点座標を一律に変化させるのではなく、上唇位置ｐ１から遠ざかるにしたがって修正量が少なくなるように設定されている。
【００４９】
また、図１３に示した下唇位置ｐ２については、下顎全体が回転することを考慮して、所定位置ｐ５を回転中心として下唇位置ｐ２を修正する処理が行われる。したがって、下唇位置ｐ２から回転中心位置ｐ５に向かうにしたがって徐々に下方向への修正量が少なくなるようにポリゴンの頂点座標が修正される。
【００５０】
図１７〜図２０は、ポリゴンの頂点座標の修正の様子を示す図である。図１７は、ノーマル状態の三次元オブジェクトを正面から見た様子を示す図であり、図１８は、ノーマル状態の三次元オブジェクトを右側面から見た様子を示す図である。また、図１９は、母音「ア」を発声している状態の三次元オブジェクトを正面から見た様子を示す図であり、図２０は、母音「ア」を発声している状態の三次元オブジェクトを右側面から見た様子を示す図である。これらの図に示すように、位置ｐ１〜ｐ４を修正することにより、口周辺に配置された各ポリゴンの頂点座標が修正され、母音「ア」を発声している状態の口形状に対応した各ポリゴンの頂点座標位置となる。
【００５１】
次に、透視投影変換部３２は、必要に応じて頂点座標が修正された各ポリゴンに対し、所定の視点位置を基準とした透視投影変換を行って擬似三次元画像データを生成する（ステップ１０４）。
次に、テクスチャマッピング部３４は、テクスチャを対応させるテクスチャマッピング処理（ステップ１０５）およびシェーディング処理（ステップ１０６）を行い、フレームメモリ３６に格納する。その後、フレームメモリ３６に格納された疑似三次元画像データが表示ライン毎に読み出されて表示装置１１０に表示される（ステップ１０７）。
【００５２】
図２１〜図２８は、表示装置１１０に表示された疑似三次元画像の具体例を示す図である。図２１は、何も発声しないノーマル状態の画像であって、視点位置が頭部の正面に設定されている場合が示されている。図２２は、ノーマル状態の画像であって、視点位置が頭部の右側面に設定されている場合が示されている。図２３は、母音「ア」を発声した状態の画像であって、視点位置が頭部の正面に設定されている場合が示されている。図２４は、母音「ア」を発声した状態の画像であって、視点位置が頭部の右側面に設定されている場合が示されている。図２５は、母音「イ」を発声した状態の画像であって、視点位置が頭部の正面に設定されている場合が示されている。図２６は、母音「イ」を発声した状態の画像であって、視点位置が頭部の右側面に設定されている場合が示されている。図２７は、母音「ウ」を発声した状態の画像であって、視点位置が頭部の正面に設定されている場合が示されている。図２８は、母音「ウ」を発声した状態の画像であって、視点位置が頭部の右側面に設定されている場合が示されている。これらの図に示すように、入力音声の内容に応じた口やその周辺の形状を再現することができ、自然な表情を有する三次元オブジェクトの画像を生成することができる。
【００５３】
このように、本実施形態の画像処理装置１００では、入力音声の各周波数成分毎の音圧レベルを検出して口形状に反映させることにより、入力音声に対応する自然な口形状を形成することができる。しかも、同じ入力音声に対して常に同じ口形状が得られるため、高い再現性を実現することができる。
【００５４】
また、入力音声に対する周波数分析を行うだけで口形状を設定することができるため、音声認識処理によって音素を抽出する必要があった従来の手法に比べて、処理の負担が軽くなる。さらに、音声認識処理に必要な辞書が不要であり、各音素に対応した口形状のパターンを登録しておく必要もないため、メモリ等の記憶容量を低減することが可能となる。
【００５５】
特に、上述した（１）式〜（５）式に示したように、口形状パラメータＰａ〜Ｐｅと各音圧レベルＬ１、Ｌ２、Ｌ３との関係式を予め定義しておくことにより、この関係式を用いて各口形状パラメータの値を計算するだけで容易に口形状を設定することができ、処理の負担を大幅に軽減することができる。
【００５６】
また、口形状の設定と並行して瞼の開閉動作も再現しており、自然な表情を有するリアルな三次元オブジェクトの画像を生成することができる。特に、口の動きと連動しないように瞼の動きが設定されるため、これらを連動させた場合に生じる表情の不自然さを回避することができる。また、入力音声に含まれる特定周波数成分の音圧レベルの大小に応じて瞼の開閉タイミングが設定されているため、例えば瞼をランダムなタイミングで動かす場合のような他の演算等が不要になり、処理の簡略化も可能になる。
【００５７】
なお、本発明は上記実施形態に限定されるものではなく、本発明の要旨の範囲内で種々の変形実施が可能である。上述した実施形態では、三次元オブジェクトに対応した疑似三次元画像を単に表示する場合を説明したが、具体的な用途については様々なものが考えられる。例えば、顔のＣＧ（コンピュータグラフィックス）アニメーションを作る際のモーション付けに本発明を利用することができる。また、ゲームなどに登場するキャラクタについて顔のモーションデータを持たずに音声から口の動きを表示する場合に本発明を適用することができる。あるいは、ＣＧによって人物の表情を表示するテレビ電話において、通話相手の音声に基づいて画面上の人物の口を動かす場合に本発明を適用することができる。
【００５８】
また、上述した実施形態では、（１）式〜（５）式を用いて各口形状パラメータＰａ〜Ｐｅの値を計算するようにしたが、各口形状パラメータに対応した計算式は（１）式〜（５）式以外を採用するようにしてもよい。例えば、上述した実施形態では、３つの母音「ア」、「イ」、「ウ」を発声したときに口形状を観察して口形状パラメータの計算式を求めたが、他の母音「エ」、「オ」やそれ以外の子音を発声したときの口形状を観察して口形状パラメータの計算式を求めるようにしてもよい。また、図４に示した３つの周波数帯域Ａ、Ｂ、Ｃの各音圧レベルＬ１、Ｌ２、Ｌ３を用いて口形状パラメータを計算したが、２つあるいは４つ以上の周波数帯域の音圧レベルを用いて口形状パラメータを計算するようにしてもよい。
【００５９】
また、（１）式〜（５）式に示した計算式は、実際に３つの母音を発声した際の口形状と、このときの周波数分析結果を見比べながら、その内容を決定する手法を用いたが、各口形状パラメータＰａ〜Ｐｅと各音圧レベルＬ１〜Ｌ３の関係を、連立方程式を解くことによって求めるようにしてもよい。例えば、各パラメータが３つの音圧レベルＬ１〜Ｌ３の線形結合によって算出されるものとすると、各パラメータについて未知の係数が３つ、合計で１５個の係数の値を決定すればよい。
【００６０】
また、上述した実施形態では、口形状を決定するために５つの口形状パラメータを用いたが、５以外の数の口形状パラメータを用いるようにしてもよい。例えば、口周辺の詳細な形状を決定するために、５より多くの口形状パラメータを用いる場合が考えられる。
【００６１】
また、上述した実施形態の画像処理装置１００は、それぞれの構成部に対応した機能をハードウエアで実現することもできるが、仕様の一部を容易に変更することができるように、メモリ等の情報記憶媒体に格納されたプログラムを読み出してＣＰＵによって実行することにより実現するようにしてもよい。この場合には、Ａ／Ｄ変換部１０とフレームメモリ３６を除く各構成部をＣＰＵとメモリを含むコンピュータの構成に置き換えればよい。プログラムを提供する情報記憶媒体としては、ＲＯＭやＲＡＭ等の半導体メモリの他に、ＣＤやＤＶＤ等の光ディスクや、ハードディスク装置等を用いることができる。
【００６２】
また、上述した実施形態では、入力音声に対する周波数分析結果に基づいて、５種類のパラメータＰａ、Ｐｂ、Ｐｃ、Ｐｄ、Ｐｅを定義し、これらの各パラメータを用いて、上唇上方向の変化量、上唇前方向の変化量、下唇下方向の変化量、下唇前方向の変化量、口角横方向の変化量を直接計算によって求めて口形状を決定したが、口形状の決定方法、すなわち入力音声に対する周波数分析結果をどのように口形状に反映させるかについては、上述した方法以外を採用するようにしてもよい。具体的には、以下に示す手法を用いて、入力音声に対応する口形状を設定するようにしてもよい。
【００６３】
口形状パラメータＰａは、上唇上方向の変化量を示すパラメータであるため、口を上下に大きく開いた口形状（例えば図６および図７に示す口形状）をこのパラメータの値が大きな場合に対応させる。例えば、口形状パラメータＰａを正規化したパラメータＰａ’を考えた場合に、この値が１の場合に、図６および図７に示した口形状が設定されるものとする。
【００６４】
同様に、口形状パラメータＰｂは、上唇前方向の変化量を示すパラメータであるため、口を前に大きく突き出した口形状（例えば図１０および図１１に示す口形状）をこのパラメータの値が大きな場合に対応させる。例えば、口形状パラメータＰｂを正規化したパラメータＰｂ’を考えた場合に、この値が１の場合に、図１０および図１１に示した口形状が設定されるものとする。
【００６５】
また、口形状パラメータＰｅは、口角横方向の変化量を示すパラメータであるため、口を横に大きく開いた口形状（例えば図８および図９に示す口形状）をこのパラメータの値が大きな場合に対応させる。例えば、口形状パラメータＰｅを正規化したパラメータＰｅ’を考えた場合に、この値が１の場合に、図８および図９に示した口形状が設定されるものとする。
【００６６】
このようにして３つのパラメータＰａ’、Ｐｂ’、Ｐｅ’のそれぞれに、代表的な口形状を対応させておいて、入力音声に対応して実際にこれら３つのパラメータの値が計算されたときに、各パラメータの値を考慮した画像の合成を行って口形状が設定される。
【００６７】
例えば、口周辺の任意のポリゴンの一の頂点をＱとする。図１２および図１３に示した口を閉じた状態に対応する頂点Ｑを始点とし、図６および図７に示した口を上下に大きく開いた状態に対応する頂点Ｑを終点とするベクトルをＸａと定義する。また、口を閉じた状態に対応する頂点Ｑを始点とし、図１０および図１１に示した口を前に大きく突き出した状態に対応する頂点Ｑを終点とするベクトルをＸｂと定義する。同様に、口を閉じた状態に対応する頂点Ｑを始点とし、図８および図９に示した口を横に大きく開いた状態に対応する頂点Ｑを終点とするベクトルをＸｅと定義する。
【００６８】
上述した３つのベクトルＸａ、Ｘｂ、Ｘｅを用いると、入力音声に対応させて口形状を変化させた後の頂点Ｑの座標は、Ｐａ’・Ｘａ＋Ｐｂ’・Ｘｂ＋Ｐｅ’・Ｘｅの式を使って計算することができる。口形状を変化させるために必要な全てのポリゴンの頂点についてこのような計算を行うことによって、入力音声に対応する口形状を設定することができる。
【００６９】
【発明の効果】
上述したように、本発明によれば、入力音声の各周波数成分毎の音圧レベルを検出して口形状に反映させることにより、より自然な表情を得ることができる。しかも、同じ入力音声に対して常に同じ口形状が得られるため、高い再現性を実現することができる。また、入力音声に対する周波数分析を行うだけで口形状を設定することができるため、処理の負担を軽くすることができる。さらに、音声認識処理に必要な辞書が不要であり、各音素に対応した口形状のパターンを登録しておく必要もないため、メモリ等の記憶容量を低減することが可能となる。
【００７０】
また、入力音声に含まれる特定周波数成分の音圧レベルに連動させて瞼の開閉タイミングが設定されるため、人の実際の表情に近い自然な表情を生成することができる。
【図面の簡単な説明】
【図１】一実施形態の画像処理装置の構成を示す図である。
【図２】テクスチャマッピング部の詳細構成を示す図である。
【図３】人間の声の周波数分析結果を示す図である。
【図４】人間の声の周波数分析結果を示す図である。
【図５】人間の声の周波数分析結果を示す図である。
【図６】母音「ア」を発声した場合の口およびその周辺の形状を示す図である。
【図７】母音「ア」を発声した場合の口およびその周辺の形状を示す図である。
【図８】母音「イ」を発声した場合の口およびその周辺の形状を示す図である。
【図９】母音「イ」を発声した場合の口およびその周辺の形状を示す図である。
【図１０】母音「ウ」を発声した場合の口およびその周辺の形状を示す図である。
【図１１】母音「ウ」を発声した場合の口およびその周辺の形状を示す図である。
【図１２】ノーマル状態における口およびその周辺の形状を示す図である。
【図１３】ノーマル状態における口およびその周辺の形状を示す図である。
【図１４】瞼パラメータの設定に用いる周波数帯域を示す図である。
【図１５】瞼パラメータの設定に用いる周波数帯域を示す図である。
【図１６】本実施形態の画像処理装置において頭部画像を生成する動作手順を示す図である。
【図１７】ポリゴンの頂点座標の修正の様子を示す図である。
【図１８】ポリゴンの頂点座標の修正の様子を示す図である。
【図１９】ポリゴンの頂点座標の修正の様子を示す図である。
【図２０】ポリゴンの頂点座標の修正の様子を示す図である。
【図２１】疑似三次元画像の表示の具体例を示す図である。
【図２２】疑似三次元画像の表示の具体例を示す図である。
【図２３】疑似三次元画像の表示の具体例を示す図である。
【図２４】疑似三次元画像の表示の具体例を示す図である。
【図２５】疑似三次元画像の表示の具体例を示す図である。
【図２６】疑似三次元画像の表示の具体例を示す図である。
【図２７】疑似三次元画像の表示の具体例を示す図である。
【図２８】疑似三次元画像の表示の具体例を示す図である。
【符号の説明】
１０アナログ−デジタル（Ａ／Ｄ）変換部
２０オブジェクト制御部
２２周波数分析部
２４口形状パラメータ変換部
２６瞼パラメータ変換部
２８座標修正部
３０画像生成部
３２透視投影変換部
３４テクスチャマッピング部
３６フレームメモリ
１００画像処理装置
１１０表示装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image processing apparatus, an image processing method, and a computer-readable information storage medium that generate an image of an object whose mouth movement or the like changes in response to sound.
[0002]
[Prior art]
In fields such as voice-over of animation and live-action video, it is necessary to draw mouth movements according to the voice of the characters in order to realistically express the facial expressions of the characters. For example, in the case of an animation image, as a technique for expressing the movement of the mouth, a technique in which the creator of the original picture manually determines the shape of the mouth and its surroundings may be used. While listening to the voice of the character in a scene or reading the dialogue prepared for this scene, the creator imagines the movement of the character's mouth and creates a mouth shape. A certain amount of experience and intuition are required to reproduce the shape.
[0003]
In addition, when voice of a live-action video is dubbed, for example, when voice in English is dubbed into Japanese, Japanese Patent No. 2795084 assumes that the periphery of a person's mouth in a live-action video is deformed in accordance with Japanese voice. A mouth shape image synthesizing device disclosed in the publication is known. In this device, the contents of speech after dubbing are analyzed to extract information on phonemes and durations, and the shape of the mouth and its surroundings are determined based on these information.
[0004]
[Problems to be solved by the invention]
By the way, in the method of determining the shape of the character's mouth and its surroundings by the above-mentioned manual work, because it depends on the skill level and ability of each worker such as the worker's experience and intuition, work efficiency is poor, Moreover, since different results are obtained for each worker, there is a problem that it is difficult to generate a natural expression with low reproducibility.
[0005]
In addition, in the method of determining the shape of the mouth and its surroundings based on phoneme and duration information, it is necessary to extract phoneme and duration information based on the contents of speech as preprocessing, and differs for each phoneme. Since it is necessary to register the mouth shape pattern in advance, there is a problem that the processing load is heavy and the data capacity necessary for the processing increases. For example, the phoneme included in the voice can be extracted by the voice recognition process. To perform this voice recognition process, a voice pattern corresponding to each phoneme is registered in advance and the actual voice is analyzed. The pattern obtained in this way is compared with each registered pattern, and the phoneme corresponding to the registered pattern with the highest similarity is extracted. Therefore, a registration pattern is required for each phoneme to be extracted, and a processing load necessary for searching for a registration pattern having a high degree of similarity becomes heavy.
[0006]
In addition, when generating an image of the entire head, if an attempt is made to obtain a natural expression, it is necessary to open and close not only the mouth shape described above but also the wrinkle at an appropriate timing, and the processing load is reduced. There is a demand for a method that can produce a light and natural expression.
[0007]
The present invention has been created in view of the above points, and its purpose is to provide an image processing apparatus capable of obtaining a natural expression, reducing the processing load, and reducing the required storage capacity. An image processing method and a computer-readable information storage medium are provided.
[0008]
[Means for Solving the Problems]
  In order to solve the above-described problems, the image processing apparatus of the present invention includes frequency analysis means, mouth shape analysis means, and coordinate correction means. The frequency analysis means performs frequency analysis on the input voice and detects a sound pressure level for each of a plurality of frequency components. The mouth shape setting means sets the mouth shape based on the sound pressure level for each of the plurality of frequency components detected by the frequency analysis means. The coordinate correcting means corrects the coordinates of a plurality of polygons constituting the three-dimensional object of the head so that the mouth shape set by the mouth shape setting means is obtained.Further, the mouth shape setting means uses the sound pressure level value for each of a plurality of frequency components to calculate the respective values of the upper lip position, the lower lip position, and the mouth corner position as mouth shape parameters. Set the shape.Since the voice to be uttered and the mouth shape are closely related, a more natural expression can be obtained by detecting the sound pressure level for each frequency component of the input voice and reflecting it in the mouth shape. Moreover, since the same mouth shape is always obtained for the same input voice, high reproducibility can be realized. In addition, since the mouth shape can be set only by performing frequency analysis on the input speech, the processing load is reduced as compared with the conventional method in which phonemes have to be extracted by speech recognition processing. Furthermore, since a dictionary necessary for the speech recognition process is unnecessary and it is not necessary to register a mouth shape pattern corresponding to each phoneme, the storage capacity of a memory or the like can be reduced.
[0009]
In addition, it is desirable to give texture information in which a pattern is set in advance to each polygon constituting the three-dimensional object by the texture mapping means. Since the mouth shape corresponding to the input sound is set and the vertex coordinates of each polygon are corrected, texture mapping is performed on these polygons, so that the shape of the mouth and its surroundings is changed. Images can be easily obtained.
[0010]
  Further, the mouth shape setting means described above calculates the mouth shape parameter value necessary for specifying the mouth and the shape of the periphery of the mouth, using the sound pressure level values for each of the plurality of frequency components. Set the shapeing. By defining a relational expression between the mouth shape parameter and each sound pressure level in advance, the mouth shape can be easily set simply by calculating the value of the mouth shape parameter using this relational expression. The burden can be greatly reduced.
[0011]
  In particular, the mouth shape parameters described above are parameters corresponding to the positions of the upper lip, the lower lip, and the mouth corner, respectively.Is included,By setting each of these positions, the mouth shape can be specified.
  In addition, the above-described mouth shape parameter is set to an upper limit value or a lower limit value for each of the upper lip position, the lower lip position, and the mouth corner position, and the result calculated using the value of the sound pressure level is the upper limit value. It is desirable to set the parameter value to the upper limit value or the lower limit value when the value exceeds the lower limit value. Considering the actual movement of the mouth and the structure around the mouth, it is considered that there are restrictions on the movement of the upper lip, the lower lip, or the mouth corner. Therefore, it is possible to prevent the mouth shape from becoming unnatural by moving the upper lip, the lower lip, or the mouth corner within a predetermined range. For example, when a sound having an extremely large sound pressure level of each frequency component is input, it is possible to prevent the mouth shape from being impossible in practice.
[0012]
Further, the wrinkle parameter setting means sets a wrinkle parameter indicating the timing to move the wrinkle based on the sound pressure level detected by the frequency analysis means, and the coordinate correcting means described above corrects the coordinates of the polygon related to the mouth shape. At the same time, it is desirable to correct the coordinates of the polygon related to the opening / closing of the eyelid based on the eyelid parameter. Considering the actual human head, the mouth closes and opens with the movement of the mouth when uttering some sound. , You can get a more natural look.
[0013]
Alternatively, the correction of the polygon coordinates in accordance with the opening / closing timing of the eyelid using the eyelid parameters described above may be performed separately from the processing related to the mouth shape. In this case, the image processing apparatus of the present invention includes specific frequency component analysis means, wrinkle parameter setting means, and coordinate correction means. The specific frequency component analyzing means detects the sound pressure level of the specific frequency component included in the input voice. The wrinkle parameter setting means sets a wrinkle parameter indicating the timing for moving the wrinkle based on the sound pressure level of the specific frequency component detected by the specific frequency component analyzing means. The coordinate correcting means corrects the coordinates of a plurality of polygons constituting the three-dimensional object of the head so as to open and close the eyelid based on the eyelid parameter set by the eyelid parameter setting means. Since the opening / closing timing of the bag is set in conjunction with the sound pressure level of the specific frequency component included in the input voice, a natural expression close to the actual expression of the person can be generated.
[0014]
In particular, the frequency component of the sound pressure level used for setting the wrinkle parameter described above is desirably a frequency component that is not linked to the movement of the mouth. It can be prevented that the movement of the heel and the movement of the mouth are always linked and unnatural. In addition, since the 瞼 parameter is set based on the sound pressure level of a specific frequency component and the timing for moving the 瞼 is determined, other calculations such as moving the 瞼 at random timing are unnecessary, and processing Simplification is possible.
[0015]
  Further, the image processing method of the present invention includes:An image processing method in an image processing apparatus comprising a frequency analysis means, a mouth shape setting means, and a coordinate correction means, wherein the frequency analysis meansA first step of performing frequency analysis on the input speech to detect a sound pressure level for each of a plurality of frequency components;In the first step by frequency analysis meansMouth shape based on the sound pressure level for each frequency component detectedBy mouth shape setting meansA second step to set,In the second step, the mouth shape setting meansThe coordinates of the polygons that make up the 3D object of the head so that the mouth shape is set.Correction by coordinate correction meansAnd a third step toThe mouth shape setting means calculates the mouth shape by calculating the values of the upper lip position, the lower lip position, and the mouth corner position as mouth shape parameters using the sound pressure level values for each of a plurality of frequency components. Set uping. Also,The image processing apparatus further includes a habit parameter setting means,In the first stepBy frequency analysis meansBased on the detected sound pressure level, the 瞼 parameter indicating when to move 瞼によって By parameter setting meansA fourth step of setting, and in the third step:Coordinate correction meansWhile correcting the coordinates of the polygon related to the mouth shape, the coordinates of the polygon related to the opening / closing of the eyelid may be corrected based on the eyelid parameter set in the fourth step.
[0016]
  The information storage medium of the present invention isBased on the sound pressure level for each of the plurality of frequency components detected by the frequency analyzing means for performing frequency analysis on the input sound and detecting the sound pressure level for each of the plurality of frequency components. A mouth shape setting means for setting the mouth shape, and a coordinate correcting means for correcting the coordinates of a plurality of polygons constituting the three-dimensional object of the head so as to be the mouth shape set by the mouth shape setting means. The mouth shape setting means uses a sound pressure level value for each of a plurality of frequency components, and each value of the position of the upper lip, the position of the lower lip, and the position of the mouth corner The mouth shape is set by calculating as a mouth shape parameter.
[0017]
By implementing the image processing method of the present invention or by executing a program stored in the information storage medium of the present invention, a natural mouth shape reflecting the contents of the input voice can be formed. In particular, since the same mouth shape is always obtained for the same input voice, it is possible to realize an image having high reproducibility and having a natural expression. In addition, since the mouth shape can be set only by performing frequency analysis on the input speech, the processing load is reduced as compared with the conventional method in which phonemes have to be extracted by speech recognition processing. Furthermore, since a dictionary necessary for the speech recognition process is unnecessary and it is not necessary to register a mouth shape pattern corresponding to each phoneme, the storage capacity of a memory or the like can be reduced.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment to which the present invention is applied will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing a configuration of an image processing apparatus according to an embodiment to which the present invention is applied. An image processing apparatus 100 illustrated in FIG. 1 includes an analog-digital (A / D) conversion unit 10, an object control unit 20, and an image generation unit 30.
[0019]
The A / D converter 10 receives an analog audio signal, samples the audio signal at a predetermined sampling frequency, and outputs digital audio data at predetermined time intervals.
The object control unit 20 performs processing for calculating the shape of a three-dimensional object that simulates the human head. For this purpose, the object control unit 20 includes a frequency analysis unit 22, a mouth shape parameter conversion unit 24, a heel parameter conversion unit 26, and a coordinate correction unit 28.
[0020]
The frequency analysis unit 22 performs frequency analysis of input voice based on voice data input at predetermined time intervals. For example, the frequency analysis unit 22 divides the audible frequency band into 64, and detects the sound pressure level of the input sound for each divided region.
The mouth shape parameter conversion unit 24 sets the value of the mouth shape parameter necessary for determining the shape of the mouth and its surroundings included in the head based on the frequency analysis result by the frequency analysis unit 22. For example, in the present embodiment, the following five mouth shape parameters Pa to Pe are used.
[0021]
1) Parameter Pa indicating the amount of change in the upper lip direction
2) Parameter Pb indicating the amount of change in the front lip direction
3) Parameter Pc indicating the amount of change in the lower lip direction
4) Parameter Pd indicating the amount of change in the forward direction of the lower lip
5) Parameter Pe indicating the amount of change in the lateral direction of the mouth corner
The heel parameter conversion unit 26 sets the value of the heel parameter indicating the movement of the heel included in the head, that is, the timing to close the heel, based on the frequency analysis result by the frequency analysis unit 22. Since the facial expression becomes unnatural when the movement of the eyelid is linked to the movement of the mouth, in this embodiment, the value of the eyelid parameter is set based on the sound pressure level of the frequency component that does not seem to be directly related to the movement of the mouth. Set and determine when to close the bag.
[0022]
In order to reflect the mouth shape determined by the mouth shape parameter conversion unit 24 and the movement of the eyelid determined by the eyelid parameter conversion unit 26, the coordinate correction unit 28 reflects the polygons constituting the three-dimensional object simulating the head. A process of changing the vertex coordinates as necessary is performed. For example, a state in which no voice is input and the bag is opened is “normal”. For a predetermined range around the mouth and its surroundings, the amount of change in the vertex coordinates of each polygon based on the normal state is calculated based on the values of the five mouth parameters Pa to Pe described above. Further, when the timing at which the eyelid moves is instructed by the eyelid parameter conversion unit 26, the amount of change in the vertex coordinates of each polygon necessary for reproducing a series of motions of the eyelid from the normal state until the eyelid is closed and reopened is obtained. Calculated.
[0023]
In addition, the image generation unit 30 performs processing necessary for actually displaying the image of the three-dimensional object on which the vertex coordinates of each polygon are calculated by the object control unit 20 on the display device 110. For this purpose, the image generation unit 30 includes a perspective projection conversion unit 32, a texture mapping unit 34, and a frame memory 36.
[0024]
The perspective projection conversion unit 32 performs perspective projection conversion based on a predetermined viewpoint position set in advance. Thereby, a two-dimensional image (pseudo three-dimensional image) obtained by viewing a three-dimensional object arranged in a virtual three-dimensional space from a predetermined viewpoint position is obtained. The texture mapping unit 34 performs a texture mapping process and a shading process for associating a texture image with each polygon included in the two-dimensional image obtained by the perspective projection conversion unit 32. A specific example of the configuration of the texture mapping unit 34 will be described later. After the texture mapping process and the shading process are performed, the two-dimensional image is stored in the frame memory 36, read out in a predetermined order, and displayed on the screen of the display device 110.
[0025]
FIG. 2 is a diagram illustrating a detailed configuration of the texture mapping unit 34. As shown in FIG. 2, the texture mapping unit 34 includes a mapping coordinate calculation unit 50, a shape memory 52, a texture adding unit 54, a texture memory 56, and a shading correction unit 58.
[0026]
The mapping coordinate calculation unit 50 calculates the correspondence between each polygon and texture constituting the three-dimensional object. The shape memory 52 stores information regarding each polygon. The texture imparting unit 54 imparts a texture to each polygon. The texture memory 54 stores information regarding each texture. The shading correction unit 58 corrects the brightness of the texture image assigned to each polygon from the position of the light source and the environmental conditions surrounding the three-dimensional object. For example, when a shadow is generated by unevenness such as a nose or lips, the specific shadow is reproduced by this shading correction.
[0027]
The frequency analysis unit 22 is the frequency analysis unit, the specific frequency component analysis unit, the mouth shape parameter conversion unit 24 is the mouth shape setting unit, the heel parameter conversion unit 26 is the heel parameter setting unit, and the coordinate correction unit 28 is the coordinate correction unit. The texture mapping unit 34 corresponds to the texture mapping means.
[0028]
The image processing apparatus 100 of the present embodiment has such a configuration, and the operation thereof will be described next.
3 to 5 are diagrams showing frequency analysis results of human voice. 3 shows a frequency analysis result when the vowel “a” is uttered, FIG. 4 shows a frequency analysis result when the vowel “a” is uttered, and FIG. 5 shows a frequency analysis result when the vowel “c” is uttered. Each is shown. In these drawings, the horizontal axis corresponds to the frequency, and the audio level for each frequency band when the audible band is divided into 64 is shown.
[0029]
As shown in FIGS. 3 to 5, when looking at the frequency distribution when uttering three different vowels “a”, “b”, and “c”, there are three peaks that show a characteristic change. I understand that. The frequency bands in which these three peaks exist are defined as regions A, B, and C in order from the low frequency side.
[0030]
For example, the frequency distribution of the vowel “a” shown in FIG. 3 shows that the sound level corresponding to the region A and the region B is large and the sound level corresponding to the region C is small. Further, when the frequency distribution of the vowel “I” shown in FIG. 4 is seen, it can be seen that the sound level corresponding to the region A and the region C is large and the sound level corresponding to the region B is small. Furthermore, when the frequency distribution of the vowel “c” shown in FIG. 5 is seen, it can be seen that the sound level corresponding to the region A is very large and the sound levels corresponding to the regions B and C are small.
[0031]
As described above, when each of the three vowels “a”, “b”, and “c” is observed, the sound levels in the three regions A to C have different tendencies. Since the sound to be uttered and the mouth shape are closely related, if the difference in tendency of these voice levels can be reflected in the mouth and its surrounding shape, the mouth and its surrounding shape according to the frequency distribution of the input speech Can be reproduced.
[0032]
6 and 7 are diagrams showing the shape of the mouth and its surroundings when the vowel “a” is uttered. FIG. 6 shows a state where a person's face is viewed from the front, and FIG. 7 shows a state where the person's face is viewed from the right side.
Similarly, FIG. 8 and FIG. 9 are diagrams showing the shape of the mouth and its surroundings when the vowel “I” is uttered. FIG. 8 shows a state when a human face is viewed from the front, and FIG. 9 shows a state when the human face is viewed from the right side.
[0033]
10 and 11 are diagrams showing the shape of the mouth and its surroundings when the vowel “U” is uttered. FIG. 10 shows a state when a human face is viewed from the front, and FIG. 11 shows a state when the human face is viewed from the right side.
12 and 13 are diagrams showing the shape of the mouth and its surroundings in a normal state where nothing is uttered. FIG. 12 shows a state where a human face is viewed from the front, and FIG. 13 shows a state where the human face is viewed from the right side.
[0034]
As shown in FIG. 6 to FIG. 11, when the vowels to be uttered are different when observing the shape of the mouth and its surroundings for each of the three vowels “A”, “I”, “U”, 1) the upper lip It can be seen that the position, 2) the position of the lower lip, and 3) the lateral position of the mouth corner changes. Moreover, according to the frequency analysis results shown in FIGS. 3 to 5, each vowel has a tendency that the sound pressure levels of the three frequency bands A, B, and C are different from each other. Can be used to define the calculation formulas for calculating the position of the upper lip, the position of the lower lip, and the lateral position of the corner of the mouth, the upper lip can be calculated based on the sound pressure level value of each frequency band corresponding to an arbitrary sound. , The position of the lower lip, and the lateral position of the corner of the mouth are calculated, whereby the mouth shape can be determined.
[0035]
Specifically, in this embodiment, when the sound pressure levels of the three frequency bands shown in FIGS. 3 to 5 are set to L1, L2, and L3 in order from the low frequency side, the mouth shape parameter conversion unit 24 The values of the five parameters Pa to Pe described above are calculated using the following relational expressions. The sound pressure levels L1, L2, and L3 can be obtained by accumulating the sound pressure level values for each of the 64 divided regions included in each frequency band, but an average value is used instead of the accumulated value. You may make it use.
[0036]
1) Parameter Pa indicating the amount of change in the upper lip direction:
A parameter Pa indicating the upward change amount of the upper lip position p1 shown in FIG.
Pa = A × L2 (1)
It is possible to calculate using the following relational expression. Here, A is an appropriate coefficient, for example, A = 0.5.
[0037]
2) Parameter Pb indicating the amount of change in the front lip direction:
The parameter Pb indicating the forward change amount of the upper lip position p1 shown in FIG.
Pb = B * L1-C * L2-D * L3 (2)
It is possible to calculate using the following relational expression. Here, B, C, and D are appropriate coefficients, for example, B = 1.0, C = 1.0, and D = 1.0.
[0038]
3) Parameter Pc indicating the amount of change in the lower lip direction:
The parameter Pc indicating the downward change amount of the lower lip position p2 shown in FIG.
Pc = E × L2 (3)
It is possible to calculate using the following relational expression. Here, E is an appropriate coefficient, and is set to E = 3.0, for example.
[0039]
4) Parameter Pd indicating the amount of change in the lower lip forward direction:
The parameter Pd indicating the forward change amount of the lower lip position p3 shown in FIG.
Pd = F * L1-G * L2-H * L3 (4)
It is possible to calculate using the following relational expression. Here, F, G, and H are appropriate coefficients, for example, B = 1.0, C = 1.0, and D = 1.0.
[0040]
5) Parameter Pe indicating the amount of change in the lateral direction of the mouth corner:
The parameter Pe indicating the amount of change in the lateral direction of the mouth corner position p4 shown in FIG.
Pe = I * L3-J * L1 (5)
It is possible to calculate using the following relational expression. Here, I and J are appropriate coefficients, for example, set to I = 3.0 and J = 2.0.
[0041]
Note that each of the positions p1 to p4 described above is set for convenience, and the vertical position and the like may be slightly changed around the mouth. In addition, the actual movement of the mouth is considered to be within the common sense range, but when calculated using the above-mentioned formulas (1) to (5), it is out of this common sense range. Can also happen. For example, when the sound pressure level of the input voice is very high, there is a possibility that the value of the mouth shape parameter corresponding to the mouth size which cannot be actually obtained is obtained as a calculation result. Therefore, it is desirable to set an upper limit value for each mouth shape parameter Pa to Pe so that the mouth shape does not become unnatural. When a value exceeding the upper limit value is obtained as a calculation result, the value of the mouth shape parameter is forcibly set to the upper limit value. Moreover, since the value of each mouth shape parameter is set on the basis of the normal state with the mouth closed, it is not preferable that each value be negative. Therefore, it is desirable to provide a lower limit value so that each mouth shape parameter Pa to Pe does not become negative. If a negative value is obtained as a calculation result, the value of the mouth shape parameter is forcibly set to the lower limit “0”.
[0042]
By the way, in this embodiment, the mouth shape is changed according to the content of the input voice, and the eyelid is opened and closed so as not to be interlocked with the change of the mouth shape, and the eyelid parameter indicating the opening and closing timing is set to a predetermined frequency. It is set based on the sound pressure level of the band.
[0043]
14 and 15 are diagrams showing frequency bands used for setting the wrinkle parameter. As shown in FIG. 14, the heel parameter conversion unit 26 sets a heel parameter indicating the timing for closing the heel when the sound pressure level in the frequency domain D exceeds a predetermined threshold. For example, the value of the wrinkle parameter M is set to 1 at this time. Further, as shown in FIG. 15, when the sound pressure level in the frequency domain D is smaller than a predetermined threshold, the heel parameter conversion unit 26 sets a heel parameter indicating that the heel is in an open state. . For example, the value of the wrinkle parameter M is set to 0 at this time.
[0044]
FIG. 16 is a diagram illustrating an operation procedure for generating a head image in the image processing apparatus 100 according to the present embodiment. For example, an operation procedure for generating a head image for one screen is shown.
When an analog audio signal is input, the A / D conversion unit 10 converts it to digital audio data at a predetermined sampling frequency and inputs the digital audio data to the frequency analysis unit 22. The frequency analysis unit 22 performs a predetermined frequency analysis process by performing an FFT (Fast Fourier Transform) operation based on the input voice data (step 100). Specifically, the frequency analysis unit 22 obtains the sound pressure level for each divided area obtained by dividing the audible frequency band into 64 parts, and then, for each of the three frequency bands A, B, and C shown in FIG. The sound pressure level of each divided area is accumulated, and three sound pressure levels L1, L2, and L3 used for calculating mouth shape parameters are obtained and input to the mouth shape parameter converting unit 24. Further, the frequency analysis unit 22 extracts only the sound pressure level of the specific frequency band D shown in FIGS. 14 and 15 from the sound pressure level for each divided region divided into 64, and the soot parameter conversion unit 26 To enter.
[0045]
Next, the mouth shape parameter converting unit 24 uses the five parameters Pa used to determine the shape of the mouth and its surroundings based on the three sound pressure levels L1, L2, and L3 input from the frequency analyzing unit 22. , Pb, Pc, Pd, and Pe are set (step 101). The specific calculation method is as described above, and each parameter value is obtained by a simple calculation using the equations (1) to (5).
[0046]
In parallel with or before or after the processing by the mouth shape parameter conversion unit 24, the soot parameter conversion unit 26 sets the value of the soot parameter based on the sound pressure level of the frequency band D input from the frequency analysis unit 22. (Step 102). As described above, the setting of the value of the wrinkle parameter can be performed by a simple process of examining the magnitude relationship between the sound pressure level of the frequency band D and the predetermined threshold value.
[0047]
Next, the coordinate correction unit 28 simulates the head based on the values of the mouth shape parameters Pa to Pe set by the mouth shape parameter conversion unit 24 and the value of the eyelid parameter M set by the eyelid parameter conversion unit 26. Processing for correcting the vertex coordinates of each polygon constituting the three-dimensional object is performed (step 103).
[0048]
Actually, the change amounts of the specific positions p1 to p4 are indicated by the mouth shape parameters Pa to Pe, but the correction of the vertex coordinates of each polygon is a polygon included in a predetermined range centered on these specific positions. It is performed on the subject. Focusing on the upper lip position p1 shown in FIG. 12, the range from the upper lip to the base of the nose in the vertical direction is extracted as a predetermined range including the upper lip position p1 in the horizontal direction, and a range similar to the width of the lips is extracted. The vertex coordinates of the polygon included in the predetermined range are corrected based on the parameter Pa calculated by the equation (1). The degree of correction is set so that the amount of correction decreases as the distance from the upper lip position p1 does not change the vertex coordinates of each polygon included in the predetermined range uniformly.
[0049]
Further, for the lower lip position p2 shown in FIG. 13, a process of correcting the lower lip position p2 with the predetermined position p5 as the rotation center is performed in consideration of the rotation of the entire lower jaw. Accordingly, the vertex coordinates of the polygon are corrected so that the amount of correction in the downward direction gradually decreases from the lower lip position p2 toward the rotation center position p5.
[0050]
17 to 20 are diagrams showing how the vertex coordinates of the polygon are corrected. FIG. 17 is a diagram illustrating a normal state three-dimensional object viewed from the front, and FIG. 18 is a normal state three-dimensional object viewed from the right side. FIG. 19 is a diagram showing a front view of a three-dimensional object in a state where the vowel “a” is uttered, and FIG. 20 shows a three-dimensional object in a state where the vowel “a” is uttered. It is a figure which shows a mode that was seen from the right side surface. As shown in these drawings, by correcting the positions p1 to p4, the vertex coordinates of the polygons arranged around the mouth are corrected, and each of the mouth shapes corresponding to the mouth shape in which the vowel “a” is uttered is corrected. It becomes the vertex coordinate position of the polygon.
[0051]
Next, the perspective projection conversion unit 32 performs pseudo-projection conversion on the basis of a predetermined viewpoint position for each polygon whose vertex coordinates are corrected as necessary to generate pseudo three-dimensional image data (step 104). ).
Next, the texture mapping unit 34 performs texture mapping processing (step 105) and shading processing (step 106) for associating the texture, and stores them in the frame memory 36. Thereafter, the pseudo three-dimensional image data stored in the frame memory 36 is read for each display line and displayed on the display device 110 (step 107).
[0052]
21 to 28 are diagrams illustrating specific examples of the pseudo three-dimensional image displayed on the display device 110. FIG. 21 shows an image in a normal state where nothing is uttered, and the viewpoint position is set in front of the head. FIG. 22 shows a normal state image in which the viewpoint position is set on the right side of the head. FIG. 23 shows an image in a state in which the vowel “a” is uttered, and the viewpoint position is set in front of the head. FIG. 24 shows an image in a state where the vowel “a” is uttered, and the viewpoint position is set on the right side of the head. FIG. 25 is an image in a state where the vowel “I” is uttered, and shows the case where the viewpoint position is set in front of the head. FIG. 26 shows an image in a state where the vowel “I” is uttered, and the viewpoint position is set on the right side of the head. FIG. 27 is an image in a state where the vowel “U” is uttered, and shows a case where the viewpoint position is set in front of the head. FIG. 28 is an image in a state where the vowel “U” is uttered, and shows a case where the viewpoint position is set on the right side surface of the head. As shown in these drawings, the shape of the mouth and its surroundings according to the content of the input voice can be reproduced, and an image of a three-dimensional object having a natural expression can be generated.
[0053]
As described above, the image processing apparatus 100 according to the present embodiment forms a natural mouth shape corresponding to the input sound by detecting the sound pressure level for each frequency component of the input sound and reflecting it in the mouth shape. Can do. Moreover, since the same mouth shape is always obtained for the same input voice, high reproducibility can be realized.
[0054]
In addition, since the mouth shape can be set only by performing frequency analysis on the input speech, the processing load is reduced as compared with the conventional method in which phonemes have to be extracted by speech recognition processing. Furthermore, since a dictionary necessary for the speech recognition process is unnecessary and it is not necessary to register a mouth shape pattern corresponding to each phoneme, the storage capacity of a memory or the like can be reduced.
[0055]
In particular, as shown in the equations (1) to (5) described above, this relationship can be obtained by defining in advance the relational expressions between the mouth shape parameters Pa to Pe and the sound pressure levels L1, L2, and L3. The mouth shape can be easily set simply by calculating the value of each mouth shape parameter using the equation, and the processing burden can be greatly reduced.
[0056]
In addition, the opening / closing operation of the eyelid is reproduced in parallel with the setting of the mouth shape, and a realistic three-dimensional object image having a natural expression can be generated. In particular, since the movement of the eyelid is set so as not to be interlocked with the movement of the mouth, it is possible to avoid the unnaturalness of the facial expression that occurs when these are interlocked. In addition, since the opening / closing timing of the kite is set according to the level of the sound pressure level of the specific frequency component included in the input voice, other calculations such as moving the kite at random timing are unnecessary. Also, the processing can be simplified.
[0057]
In addition, this invention is not limited to the said embodiment, A various deformation | transformation implementation is possible within the range of the summary of this invention. In the above-described embodiment, the case where the pseudo three-dimensional image corresponding to the three-dimensional object is simply displayed has been described. However, various specific applications are conceivable. For example, the present invention can be used for adding motion when creating a CG (computer graphics) animation of a face. Further, the present invention can be applied to a case where a mouth movement is displayed from voice without having facial motion data for a character appearing in a game or the like. Alternatively, the present invention can be applied to moving a person's mouth on the screen based on the voice of the other party in a videophone that displays a person's facial expression using CG.
[0058]
In the above-described embodiment, the values of the mouth shape parameters Pa to Pe are calculated using the equations (1) to (5), but the equation corresponding to each mouth shape parameter is (1). You may make it employ | adopt other than Formula-(5) Formula. For example, in the above-described embodiment, when three vowels “a”, “b”, and “c” are uttered, the mouth shape is observed to obtain a formula for calculating the mouth shape parameter. The mouth shape when uttering “o” or other consonants is observed, and a formula for calculating mouth shape parameters may be obtained. Further, the mouth shape parameters are calculated using the sound pressure levels L1, L2, and L3 of the three frequency bands A, B, and C shown in FIG. 4, but the sound pressure levels of two or more frequency bands are calculated. The mouth shape parameter may be calculated using
[0059]
In addition, the calculation formulas shown in the formulas (1) to (5) use a method of determining the contents while comparing the mouth shape when the three vowels are actually spoken with the frequency analysis result at this time. However, the relationship between the mouth shape parameters Pa to Pe and the sound pressure levels L1 to L3 may be obtained by solving simultaneous equations. For example, if each parameter is calculated by linear combination of three sound pressure levels L1 to L3, three unknown coefficients may be determined for each parameter, and the value of 15 coefficients in total may be determined.
[0060]
In the above-described embodiment, five mouth shape parameters are used to determine the mouth shape. However, a number of mouth shape parameters other than five may be used. For example, in order to determine the detailed shape around the mouth, a case where more than five mouth shape parameters are used can be considered.
[0061]
In addition, the image processing apparatus 100 according to the above-described embodiment can realize the functions corresponding to the respective components by hardware. However, in order to easily change a part of the specification, a memory or the like can be used. It may be realized by reading a program stored in the information storage medium and executing it by the CPU. In this case, each component except the A / D converter 10 and the frame memory 36 may be replaced with a computer configuration including a CPU and a memory. As an information storage medium for providing the program, in addition to a semiconductor memory such as a ROM and a RAM, an optical disk such as a CD and a DVD, a hard disk device, and the like can be used.
[0062]
Further, in the above-described embodiment, five types of parameters Pa, Pb, Pc, Pd, and Pe are defined based on the frequency analysis result for the input voice, and using these parameters, the amount of change in the upper lip direction, The mouth shape was determined by directly calculating the amount of change in the front lip direction, the amount of change in the lower lip direction, the amount of change in the lower lip direction, and the change amount in the lateral direction of the mouth corner. You may make it employ | adopt a method other than the method mentioned above about how the frequency analysis result with respect to an audio | voice is reflected in a mouth shape. Specifically, the mouth shape corresponding to the input voice may be set using the following method.
[0063]
Since the mouth shape parameter Pa is a parameter indicating the amount of change in the upper lip upward direction, a mouth shape with the mouth greatly opened up and down (for example, the mouth shape shown in FIGS. 6 and 7) corresponds to a case where the value of this parameter is large. Let For example, when the parameter Pa ′ obtained by normalizing the mouth shape parameter Pa is considered and the value is 1, the mouth shape shown in FIGS. 6 and 7 is set.
[0064]
Similarly, since the mouth shape parameter Pb is a parameter indicating the amount of change in the upper lip front direction, the mouth shape (for example, the mouth shape shown in FIGS. 10 and 11) that protrudes forward from the mouth has a large value for this parameter. Correspond to the case. For example, when the parameter Pb ′ obtained by normalizing the mouth shape parameter Pb is considered and the value is 1, the mouth shape shown in FIGS. 10 and 11 is set.
[0065]
Further, since the mouth shape parameter Pe is a parameter indicating the amount of change in the lateral direction of the mouth corner, a mouth shape (for example, the mouth shape shown in FIGS. 8 and 9) with the mouth wide open to the side is large. To correspond to. For example, when the parameter Pe ′ obtained by normalizing the mouth shape parameter Pe is considered and the value is 1, the mouth shape shown in FIGS. 8 and 9 is set.
[0066]
In this way, when the representative mouth shape is associated with each of the three parameters Pa ′, Pb ′, and Pe ′, the values of these three parameters are actually calculated corresponding to the input voice. In addition, the mouth shape is set by synthesizing images in consideration of the values of the respective parameters.
[0067]
For example, let Q be the vertex of an arbitrary polygon around the mouth. A vector having a vertex Q corresponding to the closed state shown in FIGS. 12 and 13 as a start point and a vertex Q corresponding to a state where the mouth shown in FIGS. It is defined as Further, a vector having the vertex Q corresponding to the state where the mouth is closed as the start point and the vertex Q corresponding to the state where the mouth shown in FIGS. Similarly, a vector having a vertex Q corresponding to a state where the mouth is closed as a starting point and a vertex Q corresponding to a state where the mouth shown in FIGS.
[0068]
Using the above-described three vectors Xa, Xb, and Xe, the coordinates of the vertex Q after changing the mouth shape corresponding to the input speech are calculated using the formula Pa ′ · Xa + Pb ′ · Xb + Pe ′ · Xe. can do. By performing such a calculation for all the vertices of the polygons necessary for changing the mouth shape, the mouth shape corresponding to the input voice can be set.
[0069]
【The invention's effect】
As described above, according to the present invention, a more natural expression can be obtained by detecting the sound pressure level for each frequency component of the input voice and reflecting it in the mouth shape. Moreover, since the same mouth shape is always obtained for the same input voice, high reproducibility can be realized. Moreover, since the mouth shape can be set only by performing frequency analysis on the input voice, the processing load can be reduced. Furthermore, since a dictionary necessary for the speech recognition process is unnecessary and it is not necessary to register a mouth shape pattern corresponding to each phoneme, the storage capacity of a memory or the like can be reduced.
[0070]
In addition, since the opening / closing timing of the bag is set in conjunction with the sound pressure level of the specific frequency component included in the input voice, a natural facial expression close to an actual facial expression of a person can be generated.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of an image processing apparatus according to an embodiment.
FIG. 2 is a diagram illustrating a detailed configuration of a texture mapping unit.
FIG. 3 is a diagram illustrating a frequency analysis result of a human voice.
FIG. 4 is a diagram illustrating a frequency analysis result of a human voice.
FIG. 5 is a diagram illustrating a frequency analysis result of a human voice.
FIG. 6 is a diagram showing the shape of the mouth and its surroundings when the vowel “a” is uttered.
FIG. 7 is a diagram showing the shape of the mouth and its surroundings when the vowel “a” is uttered.
FIG. 8 is a diagram showing the shape of the mouth and its surroundings when the vowel “I” is uttered.
FIG. 9 is a diagram showing the shape of the mouth and its surroundings when the vowel “I” is uttered.
FIG. 10 is a diagram showing the shape of the mouth and its surroundings when the vowel “U” is uttered.
FIG. 11 is a diagram showing the shape of the mouth and its surroundings when the vowel “U” is uttered.
FIG. 12 is a diagram showing the shape of the mouth and its surroundings in a normal state.
FIG. 13 is a diagram showing the shape of the mouth and its surroundings in the normal state.
FIG. 14 is a diagram showing a frequency band used for setting a soot parameter.
FIG. 15 is a diagram illustrating a frequency band used for setting a habit parameter.
FIG. 16 is a diagram illustrating an operation procedure for generating a head image in the image processing apparatus according to the present embodiment;
FIG. 17 is a diagram showing how the vertex coordinates of a polygon are corrected.
FIG. 18 is a diagram showing how the vertex coordinates of a polygon are corrected.
FIG. 19 is a diagram showing how the vertex coordinates of a polygon are corrected.
FIG. 20 is a diagram illustrating how the vertex coordinates of a polygon are corrected.
FIG. 21 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 22 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 23 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 24 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 25 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 26 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 27 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
FIG. 28 is a diagram illustrating a specific example of displaying a pseudo three-dimensional image.
[Explanation of symbols]
10 Analog-digital (A / D) converter
20 Object control unit
22 Frequency analyzer
24 Mouth shape parameter converter
26 瞼 Parameter converter
28 Coordinate correction part
30 Image generator
32 perspective projection converter
34 Texture mapping section
36 frame memory
100 Image processing apparatus
110 Display device

Claims

A frequency analysis means for performing a frequency analysis on the input sound and detecting a sound pressure level for each of a plurality of frequency components;
A mouth shape setting means for setting a mouth shape based on a sound pressure level for each of the plurality of frequency components detected by the frequency analysis means;
Coordinate correcting means for correcting the coordinates of a plurality of polygons constituting the three-dimensional object of the head so as to be the mouth shape set by the mouth shape setting means;
The mouth shape setting means calculates the values of the upper lip position, the lower lip position, and the mouth corner position as mouth shape parameters using the sound pressure level values for each of the plurality of frequency components. A mouth shape is set by the image processing apparatus.

In claim 1,
An image processing apparatus, further comprising a texture mapping unit that gives texture information in which a pattern is set in advance to each polygon constituting the three-dimensional object.

In claim 1 or 2,
In the mouth shape parameter, an upper limit value is set for each of the position of the upper lip, the position of the lower lip, and the position of the mouth corner, and the result calculated using the value of the sound pressure level indicates the upper limit value. An image processing apparatus characterized in that a parameter value is set to the upper limit value when exceeding.

In claim 1 or 2,
The mouth shape parameter has a lower limit set for each of the position of the upper lip, the position of the lower lip, and the position of the mouth corner, and the result calculated using the value of the sound pressure level is based on the lower limit value. An image processing apparatus, wherein a parameter value is set to the lower limit value when the value becomes smaller.

In any one of Claims 1-4,
Based on the sound pressure level detected by the frequency analysis means, further comprises a heel parameter setting means for setting a heel parameter indicating the timing for moving the heel,
The image processing apparatus, wherein the coordinate correcting means corrects the coordinates of the polygon related to the mouth shape and corrects the coordinates of the polygon related to opening / closing of the eyelid based on the eyelid parameter.

Specific frequency component analysis means for detecting the sound pressure level of the specific frequency component included in the input speech;
瞼 parameter setting means for setting a 瞼 parameter indicating a timing for closing the heel when the sound pressure level of the specific frequency component detected by the specific frequency component analysis means exceeds a predetermined threshold ;
Coordinate correcting means for correcting the coordinates of a plurality of polygons constituting the three-dimensional object of the head so as to open and close the heel based on the heel parameter set by the heel parameter setting means;
An image processing apparatus comprising:

An image processing method in an image processing apparatus comprising frequency analysis means, mouth shape setting means, coordinate correction means,
A first step of performing a frequency analysis on the input speech by the frequency analysis means to detect a sound pressure level for each of a plurality of frequency components;
A second step of setting a mouth shape by the mouth shape setting means based on a sound pressure level for each frequency component detected by the frequency analysis means in the first step;
A third step of correcting the coordinates of a plurality of polygons constituting the three-dimensional object of the head by the coordinate correcting means so that the mouth shape set by the mouth shape setting means in the second step;
The mouth shape setting means calculates the values of the upper lip position, the lower lip position, and the mouth corner position as mouth shape parameters using the sound pressure level values for each of the plurality of frequency components. An image processing method characterized by setting a mouth shape .

In claim 7,
The image processing apparatus further includes a wrinkle parameter setting unit,
Based on the sound pressure level detected by the frequency analysis means in the first step, a fourth step is set by the heel parameter setting means indicating a heel parameter indicating the timing for moving the heel,
In the third step, the coordinate correcting means corrects the coordinates of the polygon related to the mouth shape, and coordinates of the polygon related to the opening / closing of the eyelid based on the eyelid parameter set in the fourth step. The image processing method characterized by performing correction of this.

Computer
A frequency analysis means for performing a frequency analysis on the input sound and detecting a sound pressure level for each of a plurality of frequency components;
A mouth shape setting means for setting a mouth shape based on a sound pressure level for each of the plurality of frequency components detected by the frequency analysis means;
Coordinate correcting means for correcting the coordinates of a plurality of polygons constituting the three-dimensional object of the head so as to be the mouth shape set by the mouth shape setting means;
An information storage medium storing a program for causing the program to function,
The mouth shape setting means uses the sound pressure level value for each of the plurality of frequency components to calculate the mouth lip position, the lower lip position, and the mouth corner position as mouth shape parameters. A computer-readable information storage medium for setting a shape .