JP4543263B2

JP4543263B2 - Animation data creation device and animation data creation program

Info

Publication number: JP4543263B2
Application number: JP2006230543A
Authority: JP
Inventors: 達夫四倉; 真一川本; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-08-28
Filing date: 2006-08-28
Publication date: 2010-09-15
Anticipated expiration: 2026-08-28
Also published as: JP2008052628A

Description

この発明はアニメーションデータ作成技術に関し、特に、予め準備した顔モデルを用い、音声から、音声と同期した顔画像のアニメーションを作成するためのアニメーションデータ作成装置及びプログラムに関する。 The present invention relates to an animation data creation technique, and more particularly to an animation data creation apparatus and program for creating an animation of a face image synchronized with voice from voice using a face model prepared in advance.

アニメーション作品の制作にコンピュータ・グラフィックス（ＣＧ）が用いられることが多くなり、従来のセルアニメーション等では制作者の高度な技能を要していたようなアニメーションが、単純な作業によって実現できるようになった。ＣＧを用いる技術の中には例えば、３次元モデルを用いてアニメーションを制作する技術がある。この技術では、アニメーションの各フレームにおいて、オブジェクトの形状・位置・方向等を仮想空間上のポリゴンによって定義する。そしてその定義に基づきオブジェクトの画像を合成し、それら画像からアニメーションを構成する。オブジェクトの形状が一度定義されると、その形状について、あらゆる視点からの画像を何度でも合成できる。 Computer graphics (CG) are often used to create animation works, so that animations that require advanced skills of creators with conventional cell animations can be realized by simple tasks. became. Among the techniques using CG, for example, there is a technique for producing an animation using a three-dimensional model. In this technique, the shape, position, direction, etc. of an object are defined by polygons in a virtual space in each frame of the animation. Then, based on the definition, the images of the object are synthesized and an animation is constructed from the images. Once the shape of an object is defined, images from all viewpoints can be combined any number of times for that shape.

フレームごとにオブジェクトを変形させて画像化することにより、キャラクタの表情の変化等も表現できる。キャラクタの声として別途音声を用意し、キャラクタの口の形及び表情などをその音声に合せて変化させると、あたかもキャラクタが発話しているようなアニメーションを制作できる。本明細書では、音声に合せてキャラクタの口の形や表情を変化させることを、「リップシンク」と呼ぶ。また、本明細書では、リップシンクが実現しているアニメーションを「リップシンクアニメーション」と呼ぶ。 By transforming an object into an image for each frame, it is possible to express changes in the character's facial expression. By preparing a separate voice as the character's voice and changing the mouth shape and facial expression of the character according to the voice, it is possible to produce an animation as if the character is speaking. In this specification, changing the shape and expression of a character's mouth in accordance with the voice is called “lip sync”. In this specification, an animation realized by lip sync is referred to as “lip sync animation”.

リップシンクを実現するには、キャラクタの声と各フレームの画像で表現されるキャラクタの表情とを同期させなければならない。リップシンクを実現するための手法として従来から広く用いられている手法は、次の二つに分類される。一つの手法は、予め制作された映像に合せて後から音声を録音する手法（アフターレコーディング：いわゆる「アフレコ」）である。もう一つの手法は、音声を先に録音しておき、その音声に合せて映像を後から制作する方法（プレレコーディング：これを以下「プレレコ」と呼ぶ。）である。アフレコでは、アニメーションの制作者が、発話中のキャラクタの表情変化を予測しながら各フレームの画像を制作し、アニメーションを構成する。キャラクタの声を担当する発話者（又は声優）は、アニメーション上でのキャラクタの表情を見ながらタイミングを調整してセリフを発話する。これに対しプレレコでは、発話者は自由にセリフを発話する。制作者は、その音声に合せて表情を調整しながら、各フレームの画像を制作する。 In order to realize the lip sync, the voice of the character and the facial expression of the character expressed by the image of each frame must be synchronized. Conventionally used techniques for realizing lip sync are classified into the following two methods. One method is a method (after-recording: so-called “after-recording”) in which audio is recorded later in accordance with a previously produced video. Another method is a method in which audio is recorded first and a video is produced later according to the audio (pre-recording: this is hereinafter referred to as “pre-recording”). In post-recording, animation creators create images for each frame while predicting changes in the facial expression of the character being uttered. The speaker (or voice actor) who is in charge of the character's voice utters the speech by adjusting the timing while watching the expression of the character on the animation. On the other hand, in Pre-Reco, the speaker speaks freely. The producer creates an image of each frame while adjusting the facial expression according to the sound.

ＣＧを用いてリップシンクアニメーションを生成するための技術として、後掲の非特許文献１では、発話時の音声を録音することにより得られる収録音声データと、当該収録音声データの収録時に同時に収録される発話者の顔の複数個の特徴点に関するモーションキャプチャデータとからなるデータセットから、リップシンクアニメーション作成用の統計確率モデルを作成するための統計確率モデル作成装置が開示されている。この統計確率モデルは、入力される音素ラベル列又は視覚素列ラベルに対する、各特徴点の位置の確率を与えるモデルである。 As a technique for generating lip-sync animation using CG, Non-Patent Document 1 described later is recorded simultaneously with recorded audio data obtained by recording audio during utterance and when the recorded audio data is recorded. A statistical probability model creating apparatus for creating a statistical probability model for creating a lip sync animation from a data set composed of motion capture data relating to a plurality of feature points of a speaker's face is disclosed. This statistical probability model is a model that gives the probability of the position of each feature point with respect to an input phoneme label string or visual element string label.

なお、本明細書では、「視覚素」とは、音素と同様、顔（主として口）の基本的な形状のことをいう。視覚素は複数個存在するが、それらは視覚素を識別する名称により区別される。本明細書では視覚素の名称を視覚素ラベルと呼ぶ。 In the present specification, the “visual element” refers to the basic shape of the face (mainly the mouth), like the phoneme. There are a plurality of visual elements, and they are distinguished by names that identify visual elements. In this specification, the name of the visual element is referred to as a visual element label.

この統計確率モデルを用い、入力音声から得られた音素ラベル列又は視覚素ラベル列に対して最も尤度が高くなるような特徴点の位置データの系列を推定することができる。推定された特徴点の位置データの系列により、入力音声と同期した顔モデルの特徴点の軌跡、すなわち顔画像のアニメーションのフレームごとのワイヤフレームモデルが得られる。各フレームにおけるワイヤフレームモデルに対するレンダリングによってアニメーション画像を得ることができる。 By using this statistical probability model, it is possible to estimate a sequence of feature point position data having the highest likelihood for a phoneme label string or a visual element label string obtained from the input speech. A trajectory of the feature points of the face model synchronized with the input speech, that is, a wire frame model for each frame of the animation of the face image is obtained from the sequence of the estimated feature point position data. An animated image can be obtained by rendering to a wireframe model at each frame.

非特許文献１の開示によると、統計確率モデルの学習の際に、顔の特徴点の位置データだけではなく、その速度及び加速度までモデル学習用のパラメータに加えることにより、位置データのみを用いた場合と比較してより自然な動きをする顔アニメーションを得ることができる。
Ｔ．ヨツクラ他、「動的特徴を用いたＨＭＭからのリップシンクアニメーション」、ＡＣＭＳＩＧＧＲＡＰＨ２００６予稿集ＣＤ、２００６年７月３０日（T. Yotsukura et al., "Lip-sync Animation from HMM Using Dynamic Features", ACM SIGGRAPH 2006, 30 July 2006, Boston, Massachusetts） According to the disclosure of Non-Patent Document 1, not only the position data of facial feature points but also the speed and acceleration are added to the parameters for model learning when learning the statistical probability model, and only the position data is used. Compared to the case, it is possible to obtain a face animation that moves more naturally.
T. T. Yotsukura et al., "Lip-sync animation from HMM using dynamic features", ACM SIGGRAPH PH 2006 Proceedings CD, July 30, 2006 (T. Yotsukura et al., "Lip-sync Animation from HMM Using Dynamic Features", ACM SIGGRAPH 2006, 30 July 2006, Boston, Massachusetts)

上記した非特許文献１による手法は、位置データという静的データのみを用いた場合と比較してよりスムーズで自然な動きを持つ顔アニメーションを作成するために有効である。しかし、モデルの学習に特徴点の動的データを用いるために、モデル学習時のパラメータ数が静的データのみを用いる場合と比較して３倍になる。そのため、モデル学習に時間を要するという問題がある。特に、より精密なアニメーションを作成するために特徴点の数を増加させたりすると、モデル学習の時間がそれだけ増加してしまう。 The method according to Non-Patent Document 1 described above is effective for creating a face animation having a smoother and more natural motion than when only static data such as position data is used. However, since dynamic data of feature points is used for model learning, the number of parameters during model learning is three times that in the case of using only static data. Therefore, there is a problem that model learning takes time. In particular, if the number of feature points is increased in order to create a more precise animation, the model learning time will increase accordingly.

また、ＨＭＭ（隠れマルコフモデル）による統計的処理により顔の特徴点の位置データを推定するため、実際の顔の特徴点の動きと比較すると、推定された位置データには、わずかではあるがずれが生ずるという問題点がある。 Further, since the position data of the facial feature points is estimated by statistical processing using an HMM (Hidden Markov Model), the estimated positional data is slightly shifted compared to the actual movement of the facial feature points. There is a problem that occurs.

そのため、アニメーション作成のための準備がより短時間で可能で、しかも動きが自然で実際の顔の動きをよく反映したリップシンクアニメーションを作成できる技術が望まれている。 Therefore, there is a demand for a technology that can prepare for animation creation in a shorter time, and that can create a lip sync animation that naturally reflects the movement of the face and reflects the actual movement of the face.

それ故に本発明の目的は、アニメーション作成のための準備が短時間で可能で、実際の顔の動きをよく反映した自然な動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成装置及びそのためのプログラムを提供することである。 Therefore, an object of the present invention is to provide an animation data creation apparatus capable of creating a lip-sync animation capable of realizing a natural movement that well reflects an actual facial movement, and an animation data creation apparatus that can be prepared in a short time for animation creation. Is to provide a program.

本発明の第１の実施の形態に係るアニメーションデータ作成プログラムは、視覚素コーパスを記憶した第１の記憶手段を備えたコンピュータにおいて、入力される音声データに基づき、音声データに対応して動く口を含む顔のアニメーションデータを作成するためのアニメーションデータ作成プログラムである。視覚素コーパスは、音声付の発話時の顔の映像から作成した複数の視覚素ユニットを含む。各視覚素ユニットは、視覚素ラベルと、当該視覚素ユニットに対応する顔の動きを示す動きデータと、当該視覚素ユニットに対応する音声から得られた、当該視覚素ユニットに対応する音素の継続長を含む韻律情報とを含む。このプログラムは、音声データを、音声データにより表される音素を特定する音素データ列に変換するための第１の変換手段としてコンピュータを機能させる。音素データ列は、音素ラベルと、音声データ中の当該音素部分の継続長を含む韻律情報とからなる音素データを含む。このプログラムはさらに、第１の変換手段の出力する音素データ列中の音素データに含まれる音素ラベルの各々を、対応の視覚素ラベルに変換することにより、視覚素データ列を出力するための第２の変換手段としてコンピュータを機能させる。第２の変換手段の出力する視覚素データ列は、視覚素ラベルと、音声データ中における、当該視覚素データに対応する部分から得られる、少なくとも当該視覚素データに対応する音素の継続長を含む韻律情報とからなる視覚素データを含む。このプログラムはさらに、視覚素データ列に含まれる視覚素データの各々について、視覚素コーパス内の視覚素ユニットの内、当該視覚素データに含まれる視覚素ラベルと同じ視覚素ラベルを持ち、かつ当該視覚素データに含まれる韻律情報と、視覚素コーパスに含まれる各視覚素が有する韻律情報とにより音声の類似度を評価する評価関数により、当該視覚素データに含まれる音声と最も類似した音声を持つと評価された視覚素ユニットを視覚素コーパスから選択するための第１の選択手段と、第１の選択手段により選択された視覚素ユニットに含まれる動きデータを視覚素データ列の順序にしたがい時間軸上で連結することにより、入力される音声データに対応する口のアニメーションデータを作成するための連結手段としてコンピュータを機能させる。 An animation data creation program according to a first embodiment of the present invention is a computer that includes a first storage unit that stores a visual element corpus, and that moves according to audio data based on input audio data. An animation data creation program for creating face animation data including The visual element corpus includes a plurality of visual element units created from an image of a face during speech with speech. Each visual element unit is a continuation of the phoneme corresponding to the visual element unit obtained from the visual element label, the motion data indicating the movement of the face corresponding to the visual element unit, and the sound corresponding to the visual element unit. Prosodic information including length. This program causes the computer to function as first conversion means for converting the voice data into a phoneme data string that identifies the phonemes represented by the voice data. The phoneme data string includes phoneme data including a phoneme label and prosodic information including the duration of the phoneme portion in the speech data. The program further converts each of the phoneme labels included in the phoneme data in the phoneme data sequence output by the first conversion means into a corresponding visual element label, thereby outputting a visual element data sequence. The computer functions as the second conversion means. The visual element data string output from the second conversion means includes a visual element label and a continuation length of at least the phoneme corresponding to the visual element data obtained from the portion corresponding to the visual element data in the audio data. Visual elementary data consisting of prosodic information is included. The program further includes, for each of the visual element data included in the visual element data string, a visual element label that is the same as the visual element label included in the visual element data in the visual element unit in the visual element corpus, and By using an evaluation function that evaluates the similarity of speech based on the prosodic information included in the visual elementary data and the prosodic information included in each visual elementary included in the visual elementary corpus, the speech most similar to the speech included in the visual elementary data is obtained. A first selection means for selecting a visual element unit evaluated to have from the visual element corpus, and motion data included in the visual element unit selected by the first selection means in accordance with the order of the visual element data string. By connecting on the time axis, the computer is used as a connection means for creating animation data of the mouth corresponding to the input audio data. It is allowed to function.

予め、視覚素コーパスを第１の記憶手段に記憶させておく。視覚素コーパスは、音声付の発話時の顔の映像から作成した複数の視覚素ユニットを含む。視覚素ユニットに含まれる動きデータは、発話時の実際の顔の動きを反映している。第１の変換手段は、入力される音声データを、音素データ列に変換する。第２の変換手段は、音素データ列に含まれる音素ラベルを対応の視覚素ラベルに変換し、視覚素データ列として出力する。第１の選択手段は、評価関数により、視覚素データに含まれる音声と最も類似した音声を持つと評価された視覚素ユニットを視覚素コーパスから選択する。連結手段は、こうして選択された視覚素ユニットの動きデータを時間軸上で連結し、アニメーションデータを作成する。 The visual element corpus is stored in the first storage means in advance. The visual element corpus includes a plurality of visual element units created from an image of a face during speech with speech. The motion data contained in the visual element unit reflects the actual facial motion at the time of speech. The first conversion means converts the input voice data into a phoneme data string. The second conversion means converts the phoneme label included in the phoneme data string into a corresponding visual element label and outputs it as a visual element data string. The first selection means selects, from the visual element corpus, the visual element unit evaluated by the evaluation function as having the sound most similar to the sound included in the visual element data. The connecting means connects the motion data of the visual element units thus selected on the time axis to create animation data.

アニメーションデータの作成時に使用される動きデータは、実際の顔の動きから得られたものである。したがって、それらを連結したとき、少なくとも各視覚素データに対応する部分で得られる顔アニメーションの動きは、実際の顔の動きをよく反映した自然なものとなる。視覚素コーパスの作成には、非特許文献１で挙げられたような多数のデータを用いた学習処理は必要ない。したがって、アニメーション作成のための準備が短時間で可能で、実際の顔の動きをよく反映した自然な動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成プログラムを提供できる。 The motion data used when creating the animation data is obtained from actual facial motion. Therefore, when they are connected, the motion of the face animation obtained at least in the portion corresponding to each visual element data becomes a natural one that well reflects the actual facial motion. The creation of the visual element corpus does not require learning processing using a large number of data as described in Non-Patent Document 1. Therefore, it is possible to provide an animation data creation program capable of creating a lip-sync animation that can be prepared in a short time and can realize a natural motion that well reflects the actual facial motion.

好ましくは、視覚素コーパスに含まれる視覚素ユニットの各々に含まれる音声の韻律情報は、当該視覚素ユニットに対応する音声の継続長に加えて当該継続期間中の音声の平均パワーを含み、第１の変換手段は、音声データを、音素データ列に変換するための手段を含み、音素データ列は、音素ラベルと、音声データ中の当該音素部分の継続長及び平均パワーとからなる音素データを含み、第１の選択手段は、視覚素データ列に含まれる視覚素データの各々について、視覚素コーパス内の視覚素ユニットの内、当該視覚素データに含まれる視覚素ラベルと同じ視覚素ラベルを持つ視覚素ユニットの各々について、当該視覚素データに含まれる継続長及び平均パワーと、当該視覚素ユニットが有する継続長及び平均パワーとにより音声の類似度を評価する評価関数の値を評価するための評価手段と、評価手段により、当該視覚素データに含まれる音声と最も類似した音声を持つと評価された視覚素ユニットを視覚素コーパスから選択するための第２の選択手段とを含む。 Preferably, the prosodic information of the sound included in each of the visual element units included in the visual element corpus includes the average power of the sound during the duration in addition to the duration of the sound corresponding to the visual element unit, The conversion means 1 includes means for converting speech data into a phoneme data string, and the phoneme data string includes phoneme data including a phoneme label, a continuation length and an average power of the phoneme part in the speech data. The first selection means includes, for each of the visual element data included in the visual element data string, a visual element label that is the same as the visual element label included in the visual element data in the visual element unit in the visual element corpus. For each of the visual element units possessed, the audio similarity is determined by the duration and average power included in the visual elementary data and the duration and average power of the visual elementary unit. An evaluation means for evaluating the value of the evaluation function to be evaluated, and a visual element unit evaluated by the evaluation means to have the voice most similar to the voice included in the visual elementary data is selected from the visual elementary corpus Second selection means.

視覚素ユニットの選択における評価に、継続長だけでなく音声の平均パワーも使用される。顔の各部の動きは、発話時の声の大きさにより影響される。したがって、このように音声の平均パワーも用いて、選択すべき視覚素ユニットを評価することにより、顔の各部の動きに大きな不連続がない視覚素ユニットを選択できる。 The average power of the voice as well as the duration is used for the evaluation in the selection of the visual unit. The movement of each part of the face is affected by the loudness of the voice when speaking. Therefore, by evaluating the visual element unit to be selected using the average power of the sound in this way, it is possible to select a visual element unit that does not have a large discontinuity in the movement of each part of the face.

その結果、アニメーション作成のための準備が短時間で可能で、実際の顔の動きをよく反映した自然で滑らかな動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成プログラムを提供できる。 As a result, it is possible to provide an animation data creation program capable of creating a lip-sync animation that can be prepared for animation creation in a short time and can realize a natural and smooth motion that well reflects the actual facial motion.

より好ましくは、コンピュータは、音素ラベルと、視覚素ラベルとの対応関係を記憶した音素−視覚素変換テーブルを記憶するための第２の記憶手段をさらに含む。第２の変換手段は、第１の変換手段の出力する音素データ列の音素データに含まれる音素ラベルの各々を、音素−視覚素変換テーブルを参照することによって対応の視覚素ラベルに変換して、視覚素データ列を出力するための手段を含む。 More preferably, the computer further includes second storage means for storing a phoneme-visual element conversion table storing a correspondence relationship between phoneme labels and visual element labels. The second conversion means converts each of the phoneme labels included in the phoneme data of the phoneme data string output by the first conversion means into a corresponding visual element label by referring to the phoneme-visual element conversion table. Means for outputting a visual element data string.

音声データから音素データ列への変換という確立した技術を用いて音素データ列を得て、その後に音素ラベルを対応する視覚素ラベルに変換する。したがって、既存の技術を用いて効率的にシステムを構築できる。 A phoneme data string is obtained using an established technique of converting voice data into a phoneme data string, and then a phoneme label is converted into a corresponding visual element label. Therefore, a system can be constructed efficiently using existing technology.

さらに好ましくは、第２の変換手段による変換により得られる視覚素ラベルの数は、第１の変換手段により出力される音素ラベルの数よりも少ない。 More preferably, the number of visual element labels obtained by the conversion by the second conversion means is smaller than the number of phoneme labels output by the first conversion means.

音声と比較して、視覚素の数は少なくてもよい。そこで、このように視覚素ラベルの数を音素ラベルの数より少なくすることで、処理を安定させることができる。 Compared with speech, the number of visual elements may be small. Thus, the processing can be stabilized by making the number of visual element labels smaller than the number of phoneme labels in this way.

視覚素コーパスの各視覚素ユニットは、音声付の発話時の顔の映像から複数の視覚素ユニットを作成した際の、各視覚素に先行する第１の数の視覚素ユニットの視覚素ラベル、及び各視覚素に後続する第２の数の視覚素ユニットの視覚素ラベルをさらに含んでもよい。先行する第１の数の視覚素ユニットの視覚素ラベルと、各視覚素ユニットの視覚素ラベルと、後続する第２の数の視覚素ユニットの視覚素ラベルとは、視覚素ラベルの組を構成する。第２の変換手段は、第１の変換手段の出力する音素データ列中の音素データの各々に対し、当該音素データに含まれる音素ラベルと、その前の第１の数の音素データに含まれる音素ラベルと、その後の第２の数の音素データに含まれる音素ラベルとの各々を、対応の視覚素ラベルに変換し、音素データの順番に組合せて視覚素ラベルの組を作成するための手段と、第１の変換手段の出力する音素データ列中の音素データの各々に対し、第１の変換手段の出力する音素データ列中の音素データに含まれる音素ラベルを、視覚素ラベルの組を作成するための手段により得られた視覚素ラベルの組で置換することにより、視覚素ラベルデータを作成し、出力するための手段とを含む。第１の選択手段は、視覚素データ列に含まれる視覚素データの各々について、視覚素コーパス内にある、処理対象の視覚素データに含まれる視覚素ラベルの組と同じ視覚素ラベルの組を持ち、かつ当該処理対象の視覚素データに含まれる韻律情報と、視覚素コーパスに含まれる各視覚素ユニットが有する韻律情報とにより音声の類似度を評価する評価関数により、当該視覚素データに含まれる音声と最も類似した音声を持つと評価された視覚素ユニットを視覚素コーパスから選択するための第２の選択手段を含む。 Each visual element unit of the visual element corpus includes a visual element label of a first number of visual element units preceding each visual element when a plurality of visual element units are created from a face image during speech with speech, And a visual element label of a second number of visual element units following each visual element. The visual element labels of the preceding first number of visual element units, the visual element labels of each visual element unit, and the visual element labels of the subsequent second number of visual element units constitute a set of visual element labels. To do. The second conversion means includes, for each phoneme data in the phoneme data string output by the first conversion means, the phoneme label included in the phoneme data and the first number of phoneme data before that. Means for converting each of the phoneme labels and the phoneme labels included in the subsequent second number of phoneme data into corresponding visual element labels and combining them in the order of the phoneme data to create a set of visual element labels And for each of the phoneme data in the phoneme data string output from the first converter, the phoneme label included in the phoneme data in the phoneme data string output from the first converter is set as a set of visual element labels. Means for creating and outputting visual label data by replacing with a set of visual labels obtained by the means for creating. The first selecting means selects, for each visual element data included in the visual element data string, a set of visual element labels that is the same as the set of visual element labels included in the visual element data to be processed in the visual element corpus. The visual element data is included in the visual element data by an evaluation function that evaluates the similarity of speech based on the prosodic information included in the visual element data to be processed and the prosodic information included in each visual element unit included in the visual element corpus. A second selection means for selecting from the visual element corpus the visual element unit that is evaluated to have the sound most similar to the sound to be recorded.

視覚素ラベルをこのように視覚素ラベルの組で置換することにより、視覚素ユニットに対応する発話時の前後の顔の形まで考慮した形で視覚素ユニットを選択できる。したがって、実際の顔の動きを、その前後の顔の形まで考慮した形で反映した、自然で滑らかな動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成プログラムを提供できる。 By replacing the visual element label with the group of visual element labels in this way, the visual element unit can be selected in consideration of the shape of the face before and after the speech corresponding to the visual element unit. Therefore, it is possible to provide an animation data creation program capable of creating a lip sync animation that can realize a natural and smooth motion in which the actual facial motion is reflected in consideration of the shape of the previous and subsequent faces.

第１の数は１でもよく、第２の数も１でよい。 The first number may be 1, and the second number may be 1.

好ましくは、第２の選択手段は、視覚素データ列に含まれる視覚素データの各々について、視覚素コーパス内に、当該処理対象の視覚素データに含まれる視覚素ラベルの組と同じ視覚素ラベルの組を持つ視覚素ユニットが存在するか否かを判定するための判定手段と、判定手段により、視覚素コーパス内に、当該処理対象の視覚素データに含まれる視覚素ラベルの組と同じ視覚素ラベルの組を持つ視覚素ユニットが存在すると判定されたことに応答して、それら視覚素ユニットの各々に関し、当該視覚素データに含まれる韻律情報と、視覚素コーパスに含まれる各視覚素ユニットが有する韻律情報とにより音声の類似度を評価する評価関数の値を算出するための第１の算出手段と、判定手段により、視覚素コーパス内に、当該視覚素データに含まれる視覚素ラベルの組と同じ視覚素ラベルの組を持つ視覚素ユニットが存在しないと判定されたことに応答して、処理対象の視覚素データの視覚素ラベルの組のうち、処理対象の視覚素データの視覚素ラベルを含む一部からなる部分的視覚素ラベルのみを基準として、視覚素コーパス内から、当該一部と位置及び内容が一致する視覚素ラベルの組を持つ視覚素ユニットを選択するための手段と、選択するための手段により選択された視覚素ユニットの各々について、処理対象の視覚素データに含まれる韻律情報との間で評価関数の値を算出するための第２の算出手段と、第１の算出手段又は第２の算出手段により算出された評価関数の値が最も小さな視覚素ユニットを選択するための手段とを含む。 Preferably, the second selection means includes, for each of the visual element data included in the visual element data string, the same visual element label as the set of visual element labels included in the visual element data to be processed in the visual element corpus. And a visualizing unit for determining whether or not there is a visual element unit having the same set of visual element labels included in the visual element data to be processed in the visual element corpus. In response to determining that there is a visual elementary unit having a set of elementary labels, for each of the visual elementary units, the prosodic information included in the visual elementary data and each visual elementary unit included in the visual elementary corpus The first calculation means for calculating the value of the evaluation function for evaluating the similarity of speech based on the prosodic information possessed by and the determination means, the visual elementary data is converted into the visual elementary data by the determination means. In response to determining that there is no visual element unit having the same visual element label set as the set of visual element labels to be processed, out of the set of visual element labels of the visual element data to be processed, A visual element unit having a set of visual element labels whose positions and contents match the part of the visual element corpus, based on only a partial visual element label including a visual element label of visual element data. A second means for calculating a value of the evaluation function between the means for selecting and the prosody information included in the visual element data to be processed for each of the visual element units selected by the means for selecting Calculating means and means for selecting a visual element unit having the smallest value of the evaluation function calculated by the first calculating means or the second calculating means.

前後の視覚素ラベルまで含んだ視覚素ラベルの組と一致するような視覚素ラベルを持つ視覚素ユニットを視覚素コーパスから選択しようとする場合、特に視覚素コーパスに含まれる視覚素のバリエーションが十分大きくないときには、条件を満たす視覚素ユニットが存在しないこともあり得る。そこで、そうした場合には、前半のみ、又は後半のみの視覚素ラベルの組が一致するような視覚素ユニットを視覚素コーパスから選択することにより、確実に適切な視覚素ユニットを選択することができる。 When trying to select from the visual element corpus a visual element unit that has a visual element label that matches the set of visual element labels including the preceding and subsequent visual element labels, the variation of visual elements included in the visual element corpus is sufficient. When it is not large, there may be no visual element unit that satisfies the condition. Therefore, in such a case, an appropriate visual element unit can be surely selected by selecting a visual element unit from the visual element corpus that matches only the first half or only the second half. .

さらに好ましくは、連結手段は、選択手段により選択された視覚素ユニットに含まれる動きデータのうち、時間軸上で連続する二つの視覚素ユニットの動きデータについて、先行する視覚素ユニットの動きデータの最後の一部分の動きデータと、後続する視覚素ユニットの先頭の一部分の動きデータとの各々を、時間に応じた重み付けをして加算することにより、視覚素ユニットの動きデータを時間軸上で連結するための加重加算手段を含む。 More preferably, the linking means is the motion data of two preceding visual element units on the time axis among the motion data included in the visual element unit selected by the selecting means, and the motion data of the preceding visual element unit is selected. The motion data of the visual element unit is connected on the time axis by adding the motion data of the last part and the motion data of the first part of the subsequent visual element unit with weighting according to time. Including weighted addition means.

視覚素コーパスから選択した視覚素ユニットは、通常は互いに連続して収録されたものではない。したがって、顔の動きに多少の不連続が生じ得る。そこで、このように連続する二つの視覚素ユニットの動きデータを、その境界部分で加重加算することによって、滑らかに両者を連結することができる。その結果、アニメーション作成のための準備が短時間で可能で、実際の顔の動きをよく反映した自然で滑らかな動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成プログラムを提供できる。 Visual element units selected from the visual element corpus are not usually recorded consecutively. Therefore, some discontinuity may occur in the movement of the face. Thus, the motion data of two consecutive visual element units can be smoothly connected by weighting and adding at the boundary portion. As a result, it is possible to provide an animation data creation program capable of creating a lip-sync animation that can be prepared for animation creation in a short time and can realize a natural and smooth motion that well reflects the actual facial motion.

本発明の第２の局面にかかる記録媒体は、上記したいずれかのアニメーションデータ作成プログラムを記録した、コンピュータ読取可能な記録媒体である。 A recording medium according to the second aspect of the present invention is a computer-readable recording medium on which any one of the animation data creation programs described above is recorded.

本発明の第３の局面に係るアニメーションデータ作成装置は、複数の三つ組視覚素ユニットを含む視覚素コーパスを用い、入力される音声データに対応する顔の動きを示すアニメーションデータを作成するためのアニメーションデータ作成装置である。三つ組視覚素ユニットの各々は、三つ組視覚素ラベルと、当該三つ組視覚素ユニットに対応する視覚素の継続時間と、当該視覚素を収録したときに発話されていた音声の平均パワーと、当該視覚素を収録したときの発話者の顔の特徴点の動きデータとを含む。アニメーションデータ作成装置は、入力される音声データに対して音声分析を行なうことにより、音素ラベル、音素の継続長、及び当該音素の発話時の平均パワーからなる音素データ列を作成するための音素変換手段と、音素ラベルと視覚素ラベルとの対応関係を示すテーブルを記憶するための手段と、音素データ列に含まれる音素ラベルを、テーブルを参照して対応する視覚素ラベルに変換することにより、視覚素データ列を作成するための第１の変換手段と、第１の変換手段の出力する視覚素データ列中の視覚素データの各々について、視覚素ラベルを前後の視覚素データの視覚素ラベルと組合せた三つ組視覚素ラベルに変換し、三つ組視覚素データ列を出力するための第２の変換手段と、第２の変換手段の出力する三つ組視覚素データ列に含まれる三つ組視覚素データの各々について、視覚素コーパスから、三つ組視覚素データの有する三つ組視覚素ラベルと一致する三つ組視覚素ラベルを持つ三つ組視覚素ユニットであって、当該三つ組視覚素ユニットの持つ継続長及びパワーと、三つ組視覚素データの持つ継続長及び平均パワーとの間の類似度を評価する評価関数によって三つ組視覚素データの継続長及び平均パワーと類似する継続長及び平均パワーを持つと評価される三つ組視覚素ユニットを選択するための選択手段と、三つ組視覚素ユニット選択手段により選択された三つ組視覚素ユニットに含まれる顔の動きデータを、三つ組視覚素データの時系列にしたがって時間軸上で連結することにより、顔のアニメーションデータを作成するための連結手段とを含む。 An animation data creating apparatus according to a third aspect of the present invention uses an visual element corpus including a plurality of triple visual element units, and an animation for creating animation data indicating facial motion corresponding to input audio data A data creation device. Each triple visual element unit includes a triple visual element label, a visual element duration corresponding to the triple visual element unit, an average power of speech spoken when the visual element was recorded, and the visual elementary element. And motion data of feature points of the speaker's face. The animation data creation apparatus performs phoneme conversion to create a phoneme data string including a phoneme label, a phoneme duration, and an average power when the phoneme is uttered by performing speech analysis on the input speech data. Means, a means for storing a table indicating a correspondence relationship between the phoneme label and the visual element label, and converting the phoneme label included in the phoneme data string into a corresponding visual element label by referring to the table, A first conversion means for creating a visual elementary data string and a visual elementary label of the preceding and following visual elementary data for each of the visual elementary data in the visual elementary data string output by the first converting means. To a triple visual element label in combination with the second conversion means for outputting the triple visual element data string, and the triple visual element data output by the second conversion means For each of the triple visual element data included in the visual element corpus, a triple visual element unit having a triple visual element label that matches the triple visual element label of the triple visual element data from the visual elementary corpus, the triple visual element unit having The evaluation function that evaluates the similarity between the duration and power and the duration and average power of the triplet visual element data has a duration and average power similar to the duration and average power of the triplet visual element data. The selection means for selecting the triple visual element unit to be evaluated, and the face motion data included in the triple visual element unit selected by the triple visual element unit selection means, based on the time series of the triple visual element data, And connecting means for creating facial animation data by connecting the above.

予め、視覚素コーパスを作成しておく。視覚素コーパスの視覚素ユニットに含まれる動きデータは、実際の顔の動きを反映している。第１の変換手段は、入力される音声データを、音素データ列に変換する。第２の変換手段は、音素データ列に含まれる音素ラベルを対応の視覚素ラベルに変換し、視覚素データ列として出力する。選択手段は、評価関数により、視覚素データに含まれる音声と最も類似した音声を持つと評価された視覚素ユニットを視覚素コーパスから選択する。連結手段は、こうして選択された視覚素ユニットの動きデータを視覚素データの時系列にしたがって時間軸上で連結し、アニメーションデータを作成する。 A visual element corpus is created in advance. The motion data included in the visual element unit of the visual element corpus reflects the actual facial movement. The first conversion means converts the input voice data into a phoneme data string. The second conversion means converts the phoneme label included in the phoneme data string into a corresponding visual element label and outputs it as a visual element data string. The selecting means selects, from the visual element corpus, the visual element unit evaluated by the evaluation function as having the sound most similar to the sound included in the visual element data. The connecting means connects the motion data of the visual element units thus selected on the time axis according to the time series of the visual element data, and creates animation data.

アニメーションデータの作成時に使用される動きデータは、実際の顔の動きから得られたものである。したがって、それらを連結したとき、各視覚素データに対応する部分で得られる顔アニメーションの動きは、実際の顔の動きをよく反映した自然なものとなる。視覚素コーパスの作成には、非特許文献１で挙げられたような多数のデータを用いた学習処理は必要ない。したがって、アニメーション作成のための準備が短時間で可能で、実際の顔の動きをよく反映した自然な動きを実現できるリップシンクアニメーションを作成可能なアニメーションデータ作成プログラムを提供できる。 The motion data used when creating the animation data is obtained from actual facial motion. Therefore, when they are connected, the movement of the face animation obtained at the portion corresponding to each visual element data becomes a natural one that well reflects the actual movement of the face. The creation of the visual element corpus does not require learning processing using a large number of data as described in Non-Patent Document 1. Therefore, it is possible to provide an animation data creation program capable of creating a lip-sync animation that can be prepared in a short time and can realize a natural motion that well reflects the actual facial motion.

以下、本発明の一実施の形態に係るリップシンクアニメーション作成装置について説明する。後述するように、このリップシンクアニメーション作成装置は、コンピュータハードウェアと、コンピュータハードウェアにより実行されるプログラムと、コンピュータの記憶装置に格納される音響モデルなどのデータとにより実現される。 Hereinafter, a lip sync animation creating apparatus according to an embodiment of the present invention will be described. As will be described later, this lip sync animation creating apparatus is realized by computer hardware, a program executed by the computer hardware, and data such as an acoustic model stored in a storage device of the computer.

最初に、以下の説明で使用される用語について説明する。 First, terms used in the following description will be described.

「視覚素」とは、英語の「ｖｉｓｅｍｅ」の訳語である。「口形素」とも呼ばれる。視覚素は、音声における音素と同じく、顔の動きの中に存在する基本的な顔（特に口）の形状を表す情報の組のことをいう。 “Visual element” is a translation of “viseme” in English. Also called “viseme”. A visual element is a set of information representing the shape of a basic face (especially the mouth) existing in the movement of a face, like a phoneme in speech.

「視覚素ラベル」とは、視覚素を識別するために各視覚素に付与される名称のことをいう。音素における「音素ラベル」と同様に使用される。 The “visual element label” refers to a name given to each visual element in order to identify the visual element. Used in the same way as “phoneme labels” in phonemes.

「視覚素コーパス」とは、発話しているときの発話者の顔の動きをモーションキャプチャ装置によって収録し、視覚素別に分割して保持したデータベースのことをいう。本実施の形態では、視覚素コーパスは複数の視覚素ユニットを含む。各視覚素ユニットは、顔の特徴点の位置ベクトルの時系列データと、視覚素名と、位置ベクトルの時系列データのうち、各視覚素に対応する部分の時間情報と、各視覚素に対応する部分の音声のパワーとを含んでいる。なお、本実施の形態では、視覚素コーパスに、最初に収録された音声データも付してある。これを「音声−視覚素コーパス」と呼ぶ。 The “visual element corpus” refers to a database in which the movement of the speaker's face when speaking is recorded by a motion capture device and divided and stored for each visual element. In the present embodiment, the visual element corpus includes a plurality of visual element units. Each visual element unit corresponds to the time series data of the position vector of the face feature point, the name of the visual element, the time information of the portion corresponding to each visual element in the time vector data of the position vector, and each visual element. And the power of the voice of the part to be In the present embodiment, the visual data recorded first is also attached to the visual element corpus. This is called a “speech-visual elementary corpus”.

「視覚素データ」とは、入力される音声から得られる、視覚素コーパス中から視覚素を選択するための基準となるデータのことをいう。本実施の形態では、視覚素データは、選択されるべき視覚素の視覚素ラベルと、その継続長と、視覚素に対応する入力音声の平均パワーとを含む。視覚素の継続長も、その視覚素に対応する入力音声の音素の継続長から得られる。 “Visual element data” refers to data that is obtained from input speech and serves as a reference for selecting a visual element from a visual element corpus. In this embodiment, the visual element data includes a visual element label of a visual element to be selected, its duration, and the average power of input speech corresponding to the visual element. The duration of the visual element is also obtained from the duration of the phoneme of the input speech corresponding to the visual element.

「三つ組視覚素ラベル」とは、ある視覚素の視覚素ラベルと、その視覚素の直前の視覚素の視覚素ラベルと、その視覚素の直後の視覚素の視覚素ラベルとを、時間軸上での順序にしたがって組合せたもののことをいう。本実施の形態では、視覚素コーパス中の各視覚素ユニットには、この三つ組視覚素のラベルが付されている。これらを本明細書では三つ組視覚素ユニットと呼ぶ。 “Triple visual element label” means a visual element label of a certain visual element, a visual element label of the visual element immediately before that visual element, and a visual element label of the visual element immediately after that visual element on the time axis. It means a combination according to the order. In the present embodiment, each visual element unit in the visual element corpus is labeled with this triple visual element. These are referred to herein as triple visual element units.

［構成］
以下、本発明の一実施の形態に係るプログラムにより実現されるリップシンクアニメーション作成装置の機能的構成について説明する。図１に、このリップシンクアニメーション作成装置４０のブロック図を示す。図１を参照して、リップシンクアニメーション作成装置４０は、所定のテキストを発話しているときの発話者５０の顔の特徴点の動きをその音声とともに収録し、音声−視覚素コーパスを作成するための収録システム６０と、収録システム６０により作成された音声−視覚素コーパスを記憶するための音声−視覚素コーパス記憶部６２と、入力される音声データ４２から、音声データ４２と同期して動く、顔の特徴点の動きベクトル列をアニメーションデータとして合成するためのアニメーションデータ合成装置４４と、アニメーションデータ合成装置４４により合成されたアニメーションデータを記憶するためのアニメーションデータ記憶部４６とを含む。 [Constitution]
Hereinafter, a functional configuration of a lip sync animation creating apparatus realized by a program according to an embodiment of the present invention will be described. FIG. 1 shows a block diagram of the lip sync animation creating apparatus 40. Referring to FIG. 1, the lip-sync animation creation device 40 records the movement of the facial feature point of the speaker 50 when speaking a predetermined text together with the voice, and creates a voice-visual elementary corpus. The recording system 60 for recording, the speech-visual element corpus storage unit 62 for storing the speech-visual element corpus created by the recording system 60, and the input speech data 42 are operated in synchronization with the speech data 42. , An animation data synthesis device 44 for synthesizing a motion vector sequence of facial feature points as animation data, and an animation data storage unit 46 for storing animation data synthesized by the animation data synthesis device 44.

音声−視覚素コーパス記憶部６２に記憶される音声−視覚素コーパスは、発話時の発話者５０の映像から得られた三つ組視覚素ユニット列を含む。 The audio-visual element corpus stored in the audio-visual element corpus storage unit 62 includes a triple visual element unit sequence obtained from the video of the speaker 50 at the time of speaking.

リップシンクアニメーション作成装置４０はさらに、実際のアニメーションの作成時に、アニメーションデータ記憶部４６に記憶されたアニメーションデータを読出し、予め準備されたワイヤフレームからなる、アニメーションのキャラクタの顔モデルに対してこのアニメーションデータを適用することにより、入力される音声データ４２と同期して動く顔モデルの時系列データを作成し、さらに顔モデルに対し顔のテクスチャを適用してレンダリングをすることによって、所定フレーム／秒のレートで表示されるキャラクタの顔のアニメーションを作成するためのアニメーション作成装置４８と、アニメーション作成装置４８により作成されたアニメーションを音声データ４２とともに記憶するためのアニメーション記憶部９８とを含む。 The lip-sync animation creation device 40 further reads out the animation data stored in the animation data storage unit 46 at the time of creating the actual animation, and applies this animation to the face model of the animation character consisting of a wire frame prepared in advance. By applying the data, time series data of the face model that moves in synchronization with the input audio data 42 is generated, and further, the face texture is applied to the face model for rendering, thereby rendering a predetermined frame / second. And an animation storage unit 98 for storing the animation created by the animation creating device 48 together with the audio data 42.

リップシンクアニメーション作成装置４０はさらに、アニメーションの表示時に、アニメーション記憶部９８に記憶されているアニメーションを読出して所定フレームレートで図示しないフレームメモリに書込むためのアニメーション読出部１００と、アニメーション読出部１００によりフレームメモリに書込まれたアニメーションをその音声とともに再生し表示するための表示部５２とを含む。 The lip sync animation creating apparatus 40 further includes an animation reading unit 100 for reading an animation stored in the animation storage unit 98 and writing it in a frame memory (not shown) at a predetermined frame rate when displaying the animation, and the animation reading unit 100. And a display unit 52 for reproducing and displaying the animation written in the frame memory together with the sound.

図２に、収録システム６０の構成を示す。図２を参照して、収録システム６０は、発話者５０による発話音声と発話時における発話者５０の動画像とを収録するための録画・録音システム１１２と、発話時における発話者５０の顔の各部位の位置及びその軌跡を計測するためのモーションキャプチャ（ＭｏｔｉｏｎＣａｐｔｕｒｅ。以下「ＭｏＣａｐ」と呼ぶ。）システム１１４と、録画・録音システム１１２により収録された音声・動画データ１１６及びＭｏＣａｐシステム１１４により計測されたデータ（以下、このデータを「ＭｏＣａｐデータ」と呼ぶ。）１１８から、音声のデータ、発話時の発話者の顔の各部位の三次元の動きベクトル、視覚素ラベル、視覚素の継続長、及びその視覚素の発話時の音声の平均パワー等の系列からなるデータセット１２０を作成し、音声−視覚素コーパス記憶部６２に音声−視覚素コーパスとして格納するためのデータセット作成装置１２２とを含む。なお、発話者の顔の特徴点の三次元データは、後述するように頭部の動きを除去した動きベクトルとなるように加工される。本明細書ではこの処理を正規化処理と呼び、正規化された後の顔の特徴点の三次元動きベクトル系列を顔パラメータと呼ぶ。 FIG. 2 shows the configuration of the recording system 60. Referring to FIG. 2, the recording system 60 includes a recording / recording system 112 for recording the voice of the speaker 50 and a moving image of the speaker 50 at the time of speaking, and the face of the speaker 50 at the time of speaking. Measured by a motion capture (Motion Capture; hereinafter referred to as “MoCap”) system 114 for measuring the position of each part and its trajectory, audio / video data 116 recorded by the recording / recording system 112, and the MoCap system 114. From the generated data (hereinafter, this data is referred to as “MoCap data”) 118, voice data, three-dimensional motion vector of each part of the speaker's face at the time of utterance, visual element label, visual element continuation length , And a data set 120 consisting of a series of average power of the voice when the visual element is uttered, Audio Satoshimoto corpus storage unit 62 - and a data set creation unit 122 for storing a viseme corpus. Note that the three-dimensional data of the feature points of the speaker's face is processed so as to be a motion vector from which the head motion is removed, as will be described later. In this specification, this processing is called normalization processing, and the three-dimensional motion vector series of the facial feature points after normalization is called a face parameter.

録画・録音システム１１２は、発話者５０により発せられた音声を受けて音声信号に変換するためのマイクロホン１３０Ａ及び１３０Ｂと、発話者５０の動画像を撮影しその映像信号とマイクロホン１３０Ａ及び１３０Ｂからの音声信号とを同時に記録して音声・動画データ１１６を生成するためのカムコーダ１３２とを含む。 The recording / recording system 112 receives microphones 130A and 130B for receiving voices uttered by the speaker 50 and converting them into voice signals, and captures a moving image of the speaker 50, and the video signals and the microphones 130A and 130B. And a camcorder 132 for simultaneously recording audio signals and generating audio / moving image data 116.

カムコーダ１３２は、ＭｏＣａｐシステム１１４に対してタイムコード１３４を供給する機能を持つ。カムコーダ１３２は、音声信号及び映像信号を所定の形式でデータ化し、さらにタイムコード１３４と同じタイムコードを付与して図示しない記録媒体に記録する機能を持つ。 The camcorder 132 has a function of supplying the time code 134 to the MoCap system 114. The camcorder 132 has a function of converting an audio signal and a video signal into data in a predetermined format, and adding the same time code as the time code 134 to record it on a recording medium (not shown).

本実施の形態に係るＭｏＣａｐシステム１１４は、高再帰性光学反射マーカ（以下、単に「マーカ」と呼ぶ。）の反射光を利用して計測対象の位置を計測する光学式のシステムを含む。ＭｏＣａｐシステム１１４は、発話者５０の頭部の予め定める多数の部位にそれぞれ装着されるマーカからの赤外線反射光の映像を、所定の時間間隔のフレームごとに撮影するための複数の赤外線カメラ１３６Ａ，…，１３６Ｆと、赤外線カメラ１３６Ａ，…，１３６Ｆからの映像信号をもとにフレームごとに各マーカの位置を計測し、カムコーダ１３２からのタイムコード１３４を付与して出力するためのデータ処理装置１３８とを含む。 The MoCap system 114 according to the present embodiment includes an optical system that measures the position of a measurement target using the reflected light of a highly recursive optical reflection marker (hereinafter simply referred to as “marker”). The MoCap system 114 includes a plurality of infrared cameras 136 A for capturing images of infrared reflected light from markers attached to a plurality of predetermined parts of the head of the speaker 50 for each frame at predetermined time intervals. .., 136F and a data processing device 138 for measuring the position of each marker for each frame based on the video signals from the infrared cameras 136A,..., 136F, adding the time code 134 from the camcorder 132, and outputting it. Including.

図３に、発話者５０の頭部１１０に装着されるマーカの装着位置の例を模式的に示す。図３を参照して、発話者５０の頭部１１０に近い顔、首、及び耳の多数の箇所１６０にそれぞれマーカが装着される。マーカの形状は半球状又は球状であり、その表面は光を再帰反射するよう加工されている。マーカの大きさは直径数ミリメートル程度である。音声−視覚素コーパス６２を充実したものにするには、複数日にわたり又は複数の発話者５０について計測を行なうことが必要となる。そのため、マーカの装着順序を予め定めておき、装着位置として、顔器官の特徴的な位置又は装着済みのマーカとの相対的な関係によって定められる位置を予め定めておく。こうして定められる装着位置を、本明細書では「特徴点」と呼ぶ。 In FIG. 3, the example of the mounting position of the marker with which the speaker's 50 head 110 is mounted | worn is shown typically. Referring to FIG. 3, markers are attached to a large number of locations 160 on the face, neck, and ears close to the head 110 of the speaker 50. The marker has a hemispherical shape or a spherical shape, and its surface is processed to retroreflect light. The size of the marker is about several millimeters in diameter. In order to enrich the speech-visual corpus 62, it is necessary to perform measurement over a plurality of days or a plurality of speakers 50. For this reason, the mounting order of the markers is determined in advance, and the position determined by the relative position with the characteristic position of the facial organ or the mounted marker is determined in advance as the mounting position. The mounting position thus determined is referred to as a “feature point” in this specification.

顔の物理的な構造上、発話者５０の顔の表面上には、頭自体の動きに追従して移動するが発話者５０の表情変化の影響をほとんど受けない箇所がある。例えばこめかみ１６０Ａ及び１６０Ｂ，鼻の先端１６０Ｃがこのような特徴を持つ。本実施の形態では、このような箇所を特徴点として予め定めておく。以下、このような特徴点を不動点と呼ぶ。モーションキャプチャでは、顔の特徴点の三次元的位置が計測されるが、その位置の変動は発話者５０の頭部１１０自体の移動による変動も含む。顔の動きを得るためには、各特徴点の位置データから、頭部の動きを差引く必要がある。この処理を正規化と呼ぶ。その詳細については後述する。不動点は正規化処理で用いられる。正規化処理のためには４点以上の不動点を定めることが望ましい。 Due to the physical structure of the face, there are places on the face of the speaker 50 that follow the movement of the head itself, but are hardly affected by changes in the expression of the speaker 50. For example, the temples 160A and 160B and the tip 160C of the nose have such characteristics. In this embodiment, such a location is determined in advance as a feature point. Hereinafter, such a feature point is referred to as a fixed point. In the motion capture, the three-dimensional position of the facial feature point is measured, and the change in the position includes a change due to the movement of the head 110 of the speaker 50 itself. In order to obtain the movement of the face, it is necessary to subtract the movement of the head from the position data of each feature point. This process is called normalization. Details thereof will be described later. The fixed point is used in the normalization process. For the normalization process, it is desirable to determine four or more fixed points.

再び図２を参照して、データ処理装置１３８は、各マーカの位置の計測データ（以下、「マーカデータ」と呼ぶ。）をフレームごとにまとめてＭｏＣａｐデータ１１８を生成し、データセット作成装置１２２に出力する。ＭｏＣａｐシステム１１４には、市販の光学式ＭｏＣａｐシステムを利用できる。市販の光学式ＭｏＣａｐシステムにおける赤外線カメラ及びデータ処理装置の機能及び動作については周知であるので、これらについての詳細な説明はここでは繰返さない。 Referring again to FIG. 2, the data processing device 138 generates MoCap data 118 by collecting the measurement data (hereinafter referred to as “marker data”) at the positions of the markers for each frame, and the data set creation device 122. Output to. A commercially available optical MoCap system can be used for the MoCap system 114. Since the functions and operations of the infrared camera and the data processing device in the commercially available optical MoCap system are well known, detailed description thereof will not be repeated here.

データセット作成装置１２２は、音声・動画データ１１６を取込んで記憶するための音声・動画記憶部１４０と、音声・動画記憶部１４０に記憶された音声・動画データ１１６を読出し、三つ組視覚素データ列１２４を作成して出力するための三つ組視覚素データ列作成部１４４と、ＭｏＣａｐデータ１１８を取込んで記憶するためのＭｏＣａｐデータ記憶部１４２と、ＭｏＣａｐデータ記憶部１４２に記憶されたＭｏＣａｐデータを読出し、ＭｏＣａｐデータ１５２を正規化して、顔の各特徴点の顔パラメータの系列１２６に変換するための正規化処理部１４６と、三つ組視覚素データ列作成部１４４からの三つ組視覚素データ列１２４及び正規化処理部１４６からの顔パラメータの系列１２６を、それらのタイムスタンプを利用して同期させて結合することによりデータセット１２０を生成し、音声−視覚素コーパス記憶部６２に音声−視覚素コーパスとして格納させるための結合部１４８とを含む。 The data set creation device 122 reads out the audio / moving image storage unit 140 for capturing and storing the audio / moving image data 116, and the audio / moving image data 116 stored in the audio / moving image storage unit 140, and performs triple visual element data. A triple visual element data sequence creation unit 144 for creating and outputting the sequence 124, a MoCap data storage unit 142 for capturing and storing the MoCap data 118, and the MoCap data stored in the MoCap data storage unit 142. The normalization processing unit 146 for normalizing and converting the MoCap data 152 into the face parameter series 126 of each feature point of the face, the triple visual element data string 124 from the triple visual element data string creating unit 144, and The face parameter series 126 from the normalization processing unit 146 is synchronized using their time stamps. Thereby generating a data set 120 by binding to, voice - and a coupling portion 148 for causing stored as viseme corpus - voice viseme corpus storage unit 62.

正規化処理部１４６は、ＭｏＣａｐデータ１５２の各フレームにおいて、前述の不動点の位置変化が０になるよう、当該フレームの各マーカデータを変換することによって、当該フレームの顔パラメータを生成する機能を持つ。本実施の形態では、この変換にアフィン変換を用いる。 The normalization processing unit 146 has a function of generating a face parameter of the frame by converting each marker data of the frame so that the position change of the fixed point becomes 0 in each frame of the MoCap data 152. Have. In this embodiment, affine transformation is used for this transformation.

時刻ｔ＝０のフレームのＭｏＣａｐデータ１５２におけるマーカデータを同次座標系でＰ＝〈Ｐx，Ｐy，Ｐz，１〉、時刻ｔ≠０におけるマーカデータをＰ'＝〈Ｐ'x，Ｐ'y，Ｐ'z，１〉と表すと、マーカデータＰとマーカデータＰ’との関係は、アフィン行列Ｍを用いて次の式（１）のように表現される。 The marker data in the MoCap data 152 of the frame at time t = 0 is P = <Px, Py, Pz, 1> in the homogeneous coordinate system, and the marker data at time t ≠ 0 is P ′ = <P′x, P′y. , P′z, 1>, the relationship between the marker data P and the marker data P ′ is expressed by the following equation (1) using the affine matrix M.

顔パラメータの系列１２６の各フレームにおいて不動点の位置データがすべて同じ値となれば、不動点の位置変化が０になり、それ以外の特徴点の位置を不動点の位置を基準として正規化できる。そこで、本実施の形態では、フレームごとに、ｔ＝０のフレームにおける各不動点のマーカデータと、処理対象のフレームにおける当該不動点のマーカデータとから、当該フレームにおけるアフィン行列Ｍを算出する。このアフィン行列Ｍを用いて、各マーカデータをアフィン変換する。変換後のマーカデータはそれぞれ、ｔ＝０での頭の位置のまま発話を行なった状態での顔の特徴量の位置を表すものとなる。

If the fixed point position data in all frames of the face parameter series 126 have the same value, the fixed point position change becomes zero, and the positions of other feature points can be normalized with reference to the fixed point position. . Therefore, in the present embodiment, for each frame, the affine matrix M in the corresponding frame is calculated from the marker data of each fixed point in the frame at t = 0 and the marker data of the fixed point in the processing target frame. Using this affine matrix M, each marker data is affine transformed. Each of the converted marker data represents the position of the facial feature amount in a state where the speech is performed with the head position at t = 0.

本実施の形態ではさらに、無表情の発話者の顔の画像から得られた各特徴点のマーカデータを、上記正規化により得られた各特徴点のマーカデータから差し引くことによって、各フレームでの特徴点の位置を動きベクトルで表す。こうすることで、顔モデルのアニメーションを作成する際には次のような処理をすればよいことになる。 Further, in the present embodiment, by subtracting the marker data of each feature point obtained from the face image of the expressionless speaker from the marker data of each feature point obtained by the above normalization, The position of the feature point is represented by a motion vector. By doing so, the following processing can be performed when creating an animation of a face model.

図４を参照して、アニメーションキャラクタの顔モデル１７０が予め準備されているものとする。この顔モデル１７０に対し、３つの連続するフレーム１８０、１８２及び１８４からなる顔画像のアニメーション１７２を作成するときには、顔モデル１７０の各特徴点のマーカデータに、上記した処理で得られた動きベクトルＶ_１８０，Ｖ_１８２及びＶ_１８４をそれぞれ加算する。この処理により、三つのフレーム１８０，１８２及び１８４の各々における顔モデルの各特徴点の位置が得られる。実際には、顔モデルはワイヤフレームで与えられ、特徴点の位置がワイヤフレームのノードの位置とは必ずしも一致しないので、顔モデル１７０のノードに、特徴点をマッピングしておく必要がある。顔モデルの変形の詳細については後述する。 Referring to FIG. 4, it is assumed that a face model 170 of an animation character is prepared in advance. When a face image animation 172 composed of three consecutive frames 180, 182, and 184 is created for this face model 170, the motion vector obtained by the above-described processing is added to the marker data of each feature point of the face model 170. V ₁₈₀ , V ₁₈₂ and V ₁₈₄ are added. By this processing, the position of each feature point of the face model in each of the three frames 180, 182 and 184 is obtained. Actually, the face model is given by a wire frame, and the position of the feature point does not necessarily match the position of the node of the wire frame. Therefore, it is necessary to map the feature point to the node of the face model 170. Details of the deformation of the face model will be described later.

図２に示す三つ組視覚素データ列作成部１４４の詳細について図５を参照して説明する。図５を参照して、三つ組視覚素データ列作成部１４４は、音声・動画記憶部１４０から音声・動画データ１１６を読出し、音声を音響処理のための所定のフレーム長及びフレーム間隔でフレーム化するためのフレーム化処理部２００と、フレーム化処理部２００により出力される各フレームの音声データから後述するビタビアライメントで使用する特徴量２３０を抽出するための特徴抽出部２０１と、発話者５０（図１参照）の音声による学習によって得られた統計的音響モデルを記憶するための音響モデル記憶部２０２と、収録システム６０による発話データの収録時の発話テキストを記憶するための発話テキスト記憶部２０４と、特徴抽出部２０１により出力される特徴量の系列から、音響モデル記憶部２０２に記憶された音響モデル及び発話テキスト記憶部２０４に記憶された発話テキストを用いたビタビアライメントにより、発話テキストに対応する各音素のラベルとその継続長とからなる音素データの系列であって尤度最大となるもの（音素データ列２３２）を出力するためのビタビアライメント部２０６と、ビタビアライメント部２０６により出力された音素データ列２３２を記憶するための音素データ列記憶部２０８とを含む。なお、本実施の形態では、音響モデルとしては音響ＨＭＭからなるものを用いる。 Details of the triple visual element data string creation unit 144 shown in FIG. 2 will be described with reference to FIG. Referring to FIG. 5, triple visual element data string creation unit 144 reads audio / video data 116 from audio / video storage unit 140 and frames the audio with a predetermined frame length and frame interval for acoustic processing. A framing processor 200, a feature extracting unit 201 for extracting a feature amount 230 used in Viterbi alignment (to be described later) from audio data of each frame output by the framing processor 200, and a speaker 50 (FIG. 1), an acoustic model storage unit 202 for storing a statistical acoustic model obtained by speech learning, and an utterance text storage unit 204 for storing utterance texts when recording the utterance data by the recording system 60; The acoustic model stored in the acoustic model storage unit 202 is extracted from the sequence of feature amounts output by the feature extraction unit 201. A sequence of phoneme data consisting of a label of each phoneme corresponding to the utterance text and its duration by viterbi alignment using the utterance text stored in the utterance text storage unit 204 and having the maximum likelihood (phoneme data A Viterbi alignment unit 206 for outputting the column 232), and a phoneme data sequence storage unit 208 for storing the phoneme data sequence 232 output by the Viterbi alignment unit 206. In the present embodiment, an acoustic model composed of an acoustic HMM is used.

三つ組視覚素データ列作成部１４４はさらに、音素ラベルと視覚素ラベルとの間の対応関係を示す音素−視覚素変換テーブルを記憶するための音素−視覚素変換テーブル記憶部２１０と、音素データ列記憶部２０８に記憶された音素データ列を読出し、各音素データに含まれる音素ラベルを、音素−視覚素変換テーブル記憶部２１０に記憶された音素−視覚素変換テーブルを参照して、対応する視覚素ラベルに変換して、視覚素ラベルとその継続長とからなる視覚素データとし、視覚素データ列２３４を出力するための音素−視覚素変換部２１２と、音素−視覚素変換部２１２から出力される視覚素データ列２３４を記憶するための視覚素データ列記憶部２１４と、視覚素データ列記憶部２１４に記憶された視覚素データ列を読出し、各視覚素データに含まれる視覚素ラベルを、その前の視覚素データの視覚素ラベル、処理対象の視覚素データの視覚素ラベル、及びその直後の視覚素データの視覚素ラベルをこの順番で組合せた三つ組視覚素ラベルに変換し、三つ組視覚素データ列２３６として出力するための視覚素−三つ組視覚素変換部２１６と、視覚素−三つ組視覚素変換部２１６により出力される三つ組視覚素データ列２３６を記憶するための三つ組視覚素データ列記憶部２１８とを含む。音素−視覚素変換部２１２は、音素ラベルを視覚素ラベルに変換した結果、同一の視覚素ラベルが連続するときには、それらをまとめて一つの視覚素データとし、その継続長も合計する。音素−視覚素変換部２１２はさらに、各視覚素データに対応する音声の平均パワーも算出し、視覚素データに韻律的情報として付与する。 The triple visual element data sequence creation unit 144 further includes a phoneme-visual element conversion table storage unit 210 for storing a phoneme-visual element conversion table indicating the correspondence between phoneme labels and visual element labels, and a phoneme data sequence. The phoneme data string stored in the storage unit 208 is read out, and the phoneme label included in each phoneme data is referred to the phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 210 to correspond to the visual The phoneme-visual element conversion unit 212 for outputting the visual element data string 234 and the phoneme-visual element conversion unit 212 to output visual element data including the visual element label and its continuation length. Visual element data string storage unit 214 for storing the visual element data string 234 to be read, and the visual element data string stored in the visual element data string storage unit 214 are read out. The visual element label included in the data is a triple visual combination that combines the visual element label of the preceding visual element data, the visual element label of the visual element data to be processed, and the visual element label of the immediately subsequent visual element data in this order. A visual element-triple visual element conversion unit 216 for converting into a prime label and outputting as a triple visual element data string 236 and a triple visual element data string 236 output by the visual element-triple visual element conversion unit 216 are stored. And a triple visual element data string storage unit 218. When the phoneme-visual element conversion unit 212 converts the phoneme label into the visual element label, and the same visual element label continues, the phoneme-visual element conversion unit 212 combines them into one visual element data, and totals the continuation length. The phoneme-visual element conversion unit 212 further calculates the average power of speech corresponding to each visual element data, and assigns it to the visual element data as prosodic information.

図６に、ビタビアライメント部２０６が行なう処理の概略を示す。本実施の形態では、特徴抽出部２０１は、音声の各フレームから特徴量２３０としてＭＦＣＣ（メル周波数ケプストラム係数）を算出し、ビタビアライメント部２０６に与える。ビタビアライメント部２０６は、音響モデル記憶部２０２に記憶された多数の音素ＨＭＭと、発話テキスト記憶部２０４に記憶された発話テキストとを用い、発話テキストに対応した音素列の分割として最も尤度の高くなるような分割方法にしたがって音声を音素列に分割し、各音素のラベルとその継続長とからなる音素データ列２３２を出力する。 FIG. 6 shows an outline of processing performed by the Viterbi alignment unit 206. In the present embodiment, the feature extraction unit 201 calculates an MFCC (Mel Frequency Cepstrum Coefficient) as the feature amount 230 from each frame of speech, and gives it to the Viterbi alignment unit 206. The Viterbi alignment unit 206 uses a large number of phoneme HMMs stored in the acoustic model storage unit 202 and the utterance text stored in the utterance text storage unit 204, and has the highest likelihood as the division of the phoneme string corresponding to the utterance text. The speech is divided into phoneme strings according to a division method that increases, and a phoneme data string 232 composed of the label of each phoneme and its duration is output.

図７に、視覚素−三つ組視覚素変換部２１６から出力され、三つ組視覚素データ列記憶部２１８に記憶される三つ組視覚素データ列２３６の例を示す。図７に示すように、三つ組視覚素データの各々は、三つ組視覚素ラベルと、ミリ秒単位の継続長と、その継続長全体での音声の平均パワーとを含む。図７において、三つ組視覚素の中央にある記号がその視覚素データ本来の視覚素ラベルである。その左側に記号「−」をはさんで付されているのがその直前の視覚素データの視覚素ラベルであり、右側に記号「＋」をはさんで付されているのがその直後の視覚素データの視覚素ラベルである。なお、図中、「ｓｉｌ」は無音状態に対応する視覚素ラベルを示し、「ｓｐ」は短いポーズに対応する視覚素ラベルを示し、Ａ，Ｒ，Ｙ等はそれぞれ所定の音素に対応する視覚素ラベルを示す。 FIG. 7 shows an example of the triple visual element data string 236 output from the visual element-triple visual element conversion unit 216 and stored in the triple visual element data string storage unit 218. As shown in FIG. 7, each of the triple visual element data includes a triple visual element label, a duration in milliseconds, and an average power of the voice over the entire duration. In FIG. 7, the symbol in the center of the triple visual element is the original visual element label of the visual element data. The visual element label of the immediately preceding visual element data is attached to the left side with the symbol “-”, and the visual immediately after that is attached to the right side with the symbol “+”. It is a visual elementary label of raw data. In the figure, “sil” represents a visual element label corresponding to a silent state, “sp” represents a visual element label corresponding to a short pose, and A, R, Y, etc. represent visual elements corresponding to predetermined phonemes, respectively. Indicates an elementary label.

図８に、図５に示す音素−視覚素変換テーブル記憶部２１０に記憶された音素−視覚素変換テーブルの一例を示す。音素−視覚素変換テーブルの構成はこれ以外にも種々に考えられる。基本的には、音素ラベルを、その音素を発音しているときの口の形を示す視覚素ラベルに関連付けたものが音素−視覚素変換テーブルである。図８に示すように、本実施の形態では、一つの視覚素ラベルには１以上の音素ラベルが対応付けられている。これは、発音している音が異なっていても、口の形がよく似ている場合があること、そのような場合には、異なる音に対し同じ口の形状でアニメーションを作成しても違和感を与えないこと、に基づく。 FIG. 8 shows an example of the phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 210 shown in FIG. Various configurations of the phoneme-visual element conversion table are conceivable. Basically, a phoneme-visual element conversion table associates a phoneme label with a visual element label indicating a mouth shape when the phoneme is pronounced. As shown in FIG. 8, in this embodiment, one visual element label is associated with one or more phoneme labels. This is because even if the sound being pronounced is different, the mouth shape may be very similar, and in such a case, it would be strange to create animations with the same mouth shape for different sounds. Not based on.

なお、図８において「ｓｉｌＢ」は発話の直前の無音状態を、「ｓｉｌＥ」は発話の直後の無音状態を、それぞれ表す。 In FIG. 8, “silB” represents the silent state immediately before the utterance, and “silE” represents the silent state immediately after the utterance.

以上から、音声−視覚素コーパス記憶部６２に記憶される音声−視覚素コーパスの構成を示すと図９のようになる。図９を参照して、音声−視覚素コーパスは、音声波形データ２４０と、動きベクトル列２４２と、三つ組視覚素ユニット列２４４とを含む。音声波形データ２４０及び動きベクトル列２４２にはいずれもタイムコードが付されている。本実施の形態では音声波形データ２４０は使用しない。 From the above, the configuration of the speech-visual element corpus stored in the speech-visual element corpus storage unit 62 is as shown in FIG. Referring to FIG. 9, the audio-visual element corpus includes audio waveform data 240, a motion vector sequence 242, and a triple visual element unit sequence 244. Both the audio waveform data 240 and the motion vector sequence 242 are given time codes. In this embodiment, the voice waveform data 240 is not used.

三つ組視覚素ユニット列２４４中の各ユニットは、ユニットを識別するためのユニットＩＤ（識別番号）と、そのユニットの三つ組視覚素ラベルと、そのユニットの視覚素の継続長と、その視覚素に対応する音声の平均パワーと、その視覚素に対応する動きベクトルの、動きベクトル列２４２における開始位置を示す時間とを含む。本実施の形態では、三つ組視覚素ユニットには動きベクトル列は含まれていないが、開始位置と、継続長とで動きベクトル列２４２を参照することにより、その視覚素ユニットに属する動きベクトル系列が動きベクトル列２４２中のどこにあるかを知ることができる。 Each unit in the triple visual element unit column 244 corresponds to the unit ID (identification number) for identifying the unit, the triple visual element label of the unit, the duration of the visual element of the unit, and the visual element. And the time indicating the start position in the motion vector sequence 242 of the motion vector corresponding to the visual element. In the present embodiment, the motion vector sequence is not included in the triple visual element unit, but the motion vector sequence belonging to the visual element unit is obtained by referring to the motion vector sequence 242 with the start position and the duration. It can be known where in the motion vector sequence 242.

再び図１を参照して、アニメーションデータ合成装置４４は、入力される音声データ４２から三つ組視覚素データ列を作成するための、図２に示す三つ組視覚素データ列作成部１４４と同様の機能を実現する三つ組視覚素データ列作成部８０と、三つ組視覚素データ列作成部８０により作成された三つ組視覚素データ列に含まれる視覚素データの各々について、入力される音声データ４２に同期したアニメーションを作成するために最適と評価される三つ組視覚素ユニットを音声−視覚素コーパス記憶部６２の中から選択するための三つ組視覚素ユニット選択部８２と、三つ組視覚素ユニット選択部８２により選択された三つ組視覚素ユニットに含まれる顔の特徴点の三次元動きベクトルを時間軸に沿って互いに連結することにより、アニメーションデータを作成するための三つ組視覚素ユニット連結部８４とを含む。 Referring to FIG. 1 again, the animation data synthesizing device 44 has the same function as the triple visual element data string creation unit 144 shown in FIG. 2 for creating a triple visual element data string from the input audio data 42. For each of the visual element data included in the triple visual element data string generation unit 80 and the triple visual element data string generated by the triple visual element data string generation unit 80, an animation synchronized with the input audio data 42 is generated. A triple visual element unit selection unit 82 for selecting the triple visual element unit evaluated to be optimal for creation from the speech-visual element corpus storage unit 62, and the triple selected by the triple visual element unit selection unit 82 By connecting the three-dimensional motion vectors of facial feature points included in the visual unit to each other along the time axis, And a triad viseme unit connecting portion 84 for creating Deployment data.

図１０に、三つ組視覚素データ列作成部８０の構成の詳細を示す。三つ組視覚素データ列作成部８０は、図５に示す三つ組視覚素データ列作成部１４４と基本的に同じ構成である。 FIG. 10 shows details of the configuration of the triple visual element data string creation unit 80. The triple visual element data string creation unit 80 has basically the same configuration as the triple visual element data string creation unit 144 shown in FIG.

図１０を参照して、三つ組視覚素データ列作成部８０は、入力される音声データ４２を所定フレーム長及び所定フレーム間隔のフレームによってフレーム化するためのフレーム化処理部２８０と、フレーム化処理部２８０により出力される音声データの各フレームから、ＭＦＣＣを特徴量として抽出し、特徴量からなる系列を出力するための特徴量抽出部２８２と、音声データ４２の発話者の音声により学習を行なった音響モデルを記憶するための音響モデル記憶部２８４と、入力される音声データ４２の発話テキストを記憶するための発話テキスト記憶部２８６と、特徴量抽出部２８２により抽出された特徴量の系列に対し、音響モデル記憶部２８４に記憶された音響モデルと、発話テキスト記憶部２８６に記憶された発話テキストとを用いたビタビアライメントを行ない、発話テキストにしたがった音素の音素ラベル及びその継続長を含む音素データの系列（音素データ列）を出力するためのビタビアライメント部２８８とを含む。 Referring to FIG. 10, triple visual element data sequence creation unit 80 includes a framing processing unit 280 for framing input audio data 42 with frames having a predetermined frame length and a predetermined frame interval, and a framing processing unit MFCC is extracted as a feature amount from each frame of speech data output by 280, and learning is performed using a feature amount extraction unit 282 for outputting a sequence of feature amounts and the voice of the speaker of the speech data 42. An acoustic model storage unit 284 for storing the acoustic model, an utterance text storage unit 286 for storing the utterance text of the input voice data 42, and a feature amount sequence extracted by the feature amount extraction unit 282 The acoustic model stored in the acoustic model storage unit 284 and the utterance text stored in the utterance text storage unit 286 are used. It was subjected to Viterbi alignment, and a Viterbi alignment unit 288 for outputting a sequence of phonemic data including phoneme label and duration thereof phoneme in accordance with spoken text (phoneme data strings).

三つ組視覚素データ列作成部８０はさらに、図５に示す音素−視覚素変換テーブル記憶部２１０に記憶されたものと同一の音素−視覚素変換テーブルを記憶するための音素−視覚素変換テーブル記憶部２９０と、ビタビアライメント部２８８により出力される音素データ列に含まれる音素データの各々の音素ラベルを、音素−視覚素変換テーブル記憶部２９０に記憶された音素−視覚素変換テーブルを参照して対応する視覚素ラベルに変換し、視覚素データ列を出力するための音素−視覚素変換部２９２と、音素−視覚素変換部２９２により出力される視覚素データ列を記憶するための視覚素データ列記憶部２９３と、視覚素データ列記憶部２９３に記憶された視覚素データ列を読出し、視覚素データの各々に対し、その視覚素ラベルをその前後の視覚素データの視覚素ラベルと順番に結合して得られる三つ組視覚素ラベルに置換することによって、三つ組視覚素データ列を出力するための視覚素−三つ組視覚素変換部２９４と、視覚素−三つ組視覚素変換部２９４により出力される三つ組視覚素データ列を記憶するための三つ組視覚素データ列記憶部２９５とを含む。図１に示す三つ組視覚素ユニット選択部８２は、三つ組視覚素データ列記憶部２９５から三つ組視覚素データ列２９６を読出すことになる。 The triple visual element data string creation unit 80 further stores a phoneme-visual element conversion table storage for storing the same phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 210 shown in FIG. The phoneme labels of the phoneme data included in the phoneme data string output by the unit 290 and the Viterbi alignment unit 288 are referred to the phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 290. A phoneme-visual element conversion unit 292 for converting to a corresponding visual element label and outputting a visual element data sequence, and a visual element data for storing a visual element data sequence output by the phoneme-visual element conversion unit 292 The visual element data string stored in the column storage unit 293 and the visual element data string storage unit 293 is read out, and the visual element label is assigned to each of the visual element data. A visual element-triple visual element conversion unit 294 for outputting a triple visual element data string by replacing with a visual element label obtained by sequentially combining the visual element labels of the subsequent visual element data, and a visual element A triple visual element data string storage unit 295 for storing the triple visual element data string output by the triple visual element conversion unit 294; The triple visual element unit selection unit 82 shown in FIG. 1 reads the triple visual element data string 296 from the triple visual element data string storage unit 295.

図１１に、図１に示す三つ組視覚素データ列作成部８０から三つ組視覚素ユニット選択部８２に渡される三つ組視覚素データ列２９６の構成を示す。図１１を参照して、この三つ組視覚素データ列２９６に含まれる三つ組視覚素データの各々は、入力される音声データ４２中の視覚素データの順序を示すシーケンス番号と、三つ組視覚素ラベルと、視覚素の継続長と、この視覚素データに対応する音声の平均パワーとを含む。 FIG. 11 shows the configuration of the triple visual element data string 296 passed from the triple visual element data string creation unit 80 shown in FIG. 1 to the triple visual element unit selection unit 82. Referring to FIG. 11, each of the triple visual element data included in this triple visual element data sequence 296 includes a sequence number indicating the order of visual elementary data in the input audio data 42, a triple visual element label, It includes the duration of the visual element and the average power of the audio corresponding to this visual element data.

図１２に、三つ組視覚素ユニット選択部８２より出力される三つ組視覚素データ列３００の構成を示す。図１２を参照して、この三つ組視覚素データ列３００に含まれる三つ組視覚素データは、図１１に示す三つ組視覚素データ列と同様の構成を持つが、アニメーションデータを生成するために最適であると三つ組視覚素ユニット選択部８２により評価され、音声−視覚素コーパス記憶部６２から選択された視覚素ユニットを識別するための選択ユニットＩＤをさらに含んでいる。この選択ユニットＩＤは、図９に示す三つ組視覚素ユニット列２４４の左端の「ユニットＩＤ」に相当する。このユニットＩＤがあれば、音声−視覚素コーパス記憶部６２を参照して、三つ組視覚素ユニット列２４４の中の対応する三つ組視覚素ユニットの「開始時間」及び「継続長」のデータを用いて動きベクトル列２４２からこのユニットに属する動きベクトル系列を抽出できる。 FIG. 12 shows a configuration of a triple visual element data string 300 output from the triple visual element unit selection unit 82. Referring to FIG. 12, the triple visual element data included in the triple visual element data string 300 has the same configuration as the triple visual element data string shown in FIG. 11, but is optimal for generating animation data. And a selection unit ID for identifying the visual element unit selected from the speech-visual element corpus storage unit 62. This selection unit ID corresponds to the “unit ID” at the left end of the triple visual element unit row 244 shown in FIG. If this unit ID exists, the audio-visual element corpus storage unit 62 is referred to, and data of “start time” and “continuation length” of the corresponding triplet visual element unit in the triplet visual element unit column 244 is used. A motion vector sequence belonging to this unit can be extracted from the motion vector sequence 242.

図１３に、コンピュータを三つ組視覚素ユニット選択部８２として機能させるためのコンピュータプログラムの制御構造をフローチャート形式で示す。図１３を参照して、この機能ブロックでは、ステップ３１０において、三つ組視覚素データ列作成部８０により出力される三つ組視覚素データ列のうち、読出ポインタ位置にある三つ組視覚素データを読む。ステップ３１２では、読出ポインタ位置が、読込むべき三つ組視覚素データ列の終了位置に達したか否かを判定する。達していれば処理を終了する。終了位置に達していなければステップ３１４に進む。 FIG. 13 is a flowchart showing the control structure of a computer program for causing a computer to function as the triple visual element unit selection unit 82. Referring to FIG. 13, in this functional block, in step 310, the triple visual element data at the read pointer position is read from the triple visual element data string output by the triple visual element data string creation unit 80. In step 312, it is determined whether or not the read pointer position has reached the end position of the triple visual element data string to be read. If it has reached, the process is terminated. If the end position has not been reached, the process proceeds to step 314.

ステップ３１４では、ステップ３１０で読んだ三つ組視覚素データに含まれる三つ組視覚素ラベルと一致する三つ組視覚素ラベルを持つ視覚素ユニットが音声−視覚素コーパス記憶部６２に記憶された音声−視覚素コーパス内に存在しているか否かを判定する。そのような視覚素ユニットが音声−視覚素コーパス内に存在していればステップ３１８に進み、なければステップ３１６に進む。 In step 314, the speech-visual element corpus in which the visual element unit having the triple visual element label that matches the triple visual element label included in the triple visual element data read in step 310 is stored in the audio-visual element corpus storage unit 62. It is determined whether it exists in the inside. If such a visual element unit is present in the audio-visual corpus, the process proceeds to step 318, and if not, the process proceeds to step 316.

ステップ３１８では、ステップ３１４で見つけられた三つ組視覚素ユニットを全て音声−視覚素コーパスから読み出す。この後ステップ３２０に進む。 In step 318, all triple visual element units found in step 314 are read from the speech-visual element corpus. Thereafter, the process proceeds to step 320.

一方、ステップ３１６では、ステップ３１０で読んだ三つ組視覚素データに含まれる三つ組視覚素ラベルのうち、前半の二つ組視覚素ラベル、又は後半の二つ組視覚素ラベルと一致するような二つ組視覚素ラベルを三つ組視覚素ラベルの前半又は後半に持つ三つ組視覚素ユニットを音声−視覚素コーパスから全て読み出す。この後、ステップ３２０に進む。 On the other hand, in step 316, among the triplet visual element labels included in the triplet visual element data read in step 310, two that match the first half pair visual element label or the second half pair visual element label. All triplet visual element units having the first half or the second half of the triplet visual element label are read from the speech-visual element corpus. Thereafter, the process proceeds to step 320.

ステップ３２０では、ステップ３１６又はステップ３１８で読み出された三つ組視覚素ユニットの全てについて、以下の式によりコストＣを計算する。 In step 320, the cost C is calculated for all of the triple visual element units read in step 316 or 318 by the following formula.

ただし、ＴＤ及びＴＰは、ステップ３１０で読んだ三つ組視覚素データに含まれる視覚素継続時間及び平均パワーであり、ＵＤ及びＵＰは、コスト計算の対象となっている三つ組視覚素ユニットに含まれる視覚素の継続時間及び平均パワーであり、ｗ_ＴＤ及びｗ_ＴＰはそれぞれ継続時間の差及び平均パワーの差に対して割当てられる重みである。重みｗ_ＴＤ及びｗ_ＴＰは、話者の相違などの条件によって異なるため、主観的テストによって決定する必要があるが、例えばｗ_ＴＤ＝ｗ_ＴＰとしてもよい。また、この状態からｗ_ＴＤ及びｗ_ＴＰの値を少しずつ変えることにより、これらの値の好ましい組合せを徐々に求めるようにしてもよい。

However, TD and TP are visual element duration and average power included in the triple visual element data read in step 310, and UD and UP are visual elements included in the triple visual element unit that is the object of cost calculation. The raw duration and average power, w _TD and w _TP are weights assigned to the duration difference and the average power difference, respectively. Since the weights w _TD and w _TP differ depending on conditions such as speaker differences, it is necessary to determine the weight w _TD and w _TP by a subjective test. For example, w _TD = w _TP may be used. Moreover, you may make it obtain | require gradually the preferable combination of these values by changing the value of _wTD and _wTP little by little from this state.

ここで算出するコストＣは、処理中の三つ組視覚素データに対する顔の特徴点の動きベクトルを与える三つ組視覚素ユニットとして最適なものを、視覚素に対応する音声の韻律的特徴を用いて評価するための評価関数である。上の式から分かるように、本実施の形態では、コスト関数として、視覚素の継続長（これは視覚素に対応する音声の継続長に等しい。）の差の絶対値と、視覚素に対応する音声の平均パワーの差の絶対値の線形和を用いる。この他にもコスト関数としては種々のものが考えられる。三つ組視覚素データ及び三つ組視覚素ユニットの構成を定める際には、コスト関数としてどのような情報を用いるかを検討し、必要なデータを保存するようにしなければならない。 The cost C calculated here is evaluated using the prosodic feature of the speech corresponding to the visual element, which is optimal as the triple visual element unit that gives the motion vector of the facial feature point with respect to the triple visual element data being processed. Is an evaluation function. As can be seen from the above formula, in the present embodiment, as a cost function, the absolute value of the difference between the durations of the visual elements (which is equal to the duration of the speech corresponding to the visual elements) and the visual elements are supported. The linear sum of the absolute values of the difference in the average power of the sound to be used is used. Various other cost functions are conceivable. When determining the structure of triple visual element data and triple visual element units, it is necessary to consider what information is used as a cost function and to store necessary data.

ステップ３２２では、ステップ３２０で計算されたコストの最小値を求め、最小値を与えた三つ組視覚素ユニットを選択する。この三つ組視覚素ユニットが、処理中の三つ組視覚素データに対する最適な動きベクトル列を与えるものとして選択される。この後、制御はステップ３１０に戻り、次の三つ組視覚素データに対する処理を実行する。 In step 322, the minimum value of the cost calculated in step 320 is obtained, and the triple visual element unit giving the minimum value is selected. This triplet visual element unit is selected as giving the optimal motion vector sequence for the triplet visual element data being processed. Thereafter, the control returns to step 310 to execute processing for the next triplet visual element data.

図１３に示すフローチャートに対応する制御構造を有するコンピュータプログラムにより、図１に示す三つ組視覚素ユニット選択部８２を実現することができる。 The triple visual element unit selection unit 82 shown in FIG. 1 can be realized by a computer program having a control structure corresponding to the flowchart shown in FIG.

次に、図１に示す三つ組視覚素ユニット連結部８４の機能について説明する。以上述べたように、図１に示す三つ組視覚素ユニット選択部８２により、入力される音声データ４２に対応する三つ組視覚素ユニット列が選択される。これら三つ組視覚素ユニット列をそのまま時間軸上で連結すると、ユニットとユニットとの間で各特徴点の位置のずれが生じたり、ユニットとユニットとの間で時間軸上でのギャップ又は重複が生じたりするために、画像が不自然なものになってしまう。三つ組視覚素ユニット連結部８４は、そのような特徴点の位置のずれを解消させながら三つ組視覚素ユニットを時間軸上で連結する機能を持つ。 Next, the function of the triple visual element unit connecting portion 84 shown in FIG. 1 will be described. As described above, the triple visual element unit selection unit 82 shown in FIG. 1 selects the triple visual element unit sequence corresponding to the input audio data 42. If these triplet visual element unit sequences are connected as they are on the time axis, the position of each feature point may shift between units, or a gap or overlap between units may occur on the time axis. The image becomes unnatural. The triple visual element unit connecting unit 84 has a function of connecting the triple visual element units on the time axis while eliminating such positional shift of the feature points.

図１７に、三つ組視覚素ユニット連結部８４による動きベクトルの連結方法を示す。図１７（Ａ）を参照して、ある三つ組視覚素ユニットにおけるある特徴点Ｍ１の軌跡４３０と、後続する三つ組視覚素ユニットにおける対応する特徴点Ｍ１’の軌跡４３２とを、その両端で時間Ｔだけ重複させる。そして、この時間におけるこの特徴点の軌跡を、以下の式に従い平滑化して算出し、なめらかな軌跡４４０を生成する。 FIG. 17 shows a motion vector connection method by the triple visual element unit connection unit 84. Referring to FIG. 17 (A), a trajectory 430 of a certain feature point M1 in a certain triplet visual element unit and a trajectory 432 of a corresponding feature point M1 ′ in a subsequent triplet visual element unit are obtained only at time T at both ends thereof. Duplicate. Then, the trajectory of the feature point at this time is calculated by smoothing according to the following formula, and a smooth trajectory 440 is generated.

ただし、Ｍは平滑化後の特徴点の動きベクトル、Ｍ_１及びＭ_１’はそれぞれ平滑化前の、先行及び後続する三つ組視覚素ユニットの特徴点の動きベクトル、ｔは重複区間Ｔの先頭からの経過時間を示す。三つ組視覚素ユニット連結部８４は、これ以外の区間では、その三つ組視覚素ユニットの動きベクトル列をそのまま出力する。

Where M is the motion vector of the feature point after smoothing, M ₁ and M ₁ ′ are the motion vectors of the feature points of the preceding and succeeding triple visual element units before smoothing, and t is the head of the overlap section T. Shows the elapsed time. The triple visual element unit connecting unit 84 outputs the motion vector sequence of the triple visual element unit as it is in other sections.

なお、このような連結を行なうと、各三つ組視覚素ユニットの継続長は実質的にＴだけ短縮されることになるので、それを防ぐため、各三つ組視覚素ユニットの一端（例えば後端）をＴだけ延長する。図９に示すような音声−視覚素コーパス記憶部６２の構造を採用することにより、そのような三つ組視覚素ユニットの延長は簡単に行なえる。 If such a connection is made, the duration of each triplet visual element unit is substantially shortened by T. To prevent this, one end (for example, the rear end) of each triplet visual element unit is connected. Extend by T. By adopting the structure of the audio-visual element corpus storage unit 62 as shown in FIG. 9, such a triple visual element unit can be easily extended.

三つ組視覚素ユニット連結部８４は、このような連結を、三つ組視覚素ユニット選択部８２から出力される三つ組視覚素ユニット列内の連結部の全てについて、全ての特徴点に対して行なう。その結果、所定の周期ごとに、顔の特徴点の全てについての動きベクトルを持ったデータ系列が得られる。このデータを本明細書ではアニメーションデータと呼ぶ。アニメーションデータはアニメーションデータ記憶部４６により格納される。 The triple visual element unit connecting unit 84 performs such connection with respect to all feature points for all of the connecting parts in the triple visual element unit row output from the triple visual element unit selecting unit 82. As a result, a data series having motion vectors for all the facial feature points is obtained for each predetermined period. This data is called animation data in this specification. The animation data is stored in the animation data storage unit 46.

アニメーション作成装置４８は、アニメーションデータ記憶部４６に記憶されたアニメーションデータと、予め準備された、アニメーションキャラクタの顔モデルとからアニメーションを作成する機能を持つ。 The animation creation device 48 has a function of creating an animation from animation data stored in the animation data storage unit 46 and a face model of the animation character prepared in advance.

図１を参照して、アニメーション作成装置４８は、アニメーションキャラクタの顔モデルを記憶するための顔モデル記憶部９０を含む。本実施の形態では、顔モデル記憶部９０に記憶された顔モデルは、多数の多角形（ポリゴン）によって、静止状態における所定の顔の形状を表現した形状モデルを利用する。この顔モデルに基づき、アニメーションデータ記憶部４６に格納されたアニメーションデータを利用してアニメーションを作成するためには、この顔モデルと、アニメーションデータに対応する特徴点の位置との対応付け（マッピング）を予め行なっておく必要がある。本実施の形態では、顔モデルに手作業でこの特徴点の位置をマッピングするものとし、顔モデル上の特徴点の位置を「仮想マーカ」と呼ぶマーカにより示すものとする。 Referring to FIG. 1, the animation creation device 48 includes a face model storage unit 90 for storing a face model of an animation character. In the present embodiment, the face model stored in the face model storage unit 90 uses a shape model that expresses a predetermined face shape in a stationary state by a large number of polygons (polygons). In order to create an animation using the animation data stored in the animation data storage unit 46 based on the face model, the face model is associated (mapped) with the position of the feature point corresponding to the animation data. Must be performed in advance. In the present embodiment, the position of the feature point is manually mapped to the face model, and the position of the feature point on the face model is indicated by a marker called “virtual marker”.

図１４に、顔モデル３３０及び仮想マーカの一例を示す。図１４を参照して、顔モデル３３０を構成するポリゴンの辺（図１４の三角形の辺を構成する黒い線）をエッジ、エッジ同士の交点を顔モデル３３０におけるノードと呼ぶ。図１４には、仮想ノードのマッピング例を、記号○と＋マークとを組合せた記号３３２として示してある。 FIG. 14 shows an example of the face model 330 and the virtual marker. With reference to FIG. 14, the sides of the polygons constituting the face model 330 (black lines constituting the sides of the triangle in FIG. 14) are called edges, and the intersections of the edges are called nodes in the face model 330. In FIG. 14, a virtual node mapping example is shown as a symbol 332 that combines the symbol O and the + mark.

顔には、目・口・鼻の穴のように顔面を構成しない切れ目がある。一般に、これらの切れ目は、顔モデル３３０の一部としてはモデリングされない。すなわち、切れ目にはポリゴンを定義しないか、切れ目は、顔モデル３３０とは別のオブジェクトとして定義される。したがって、切れ目と顔面との間は境界エッジで仕切られる。境界エッジとは、二つのポリゴンによって共有されていないようなエッジのことを言う。 The face has cuts that do not constitute the face, such as the eyes, mouth, and nostrils. In general, these breaks are not modeled as part of the face model 330. That is, polygons are not defined for the cuts, or the cuts are defined as objects different from the face model 330. Therefore, the cut and the face are partitioned by the boundary edge. A boundary edge is an edge that is not shared by two polygons.

再び図１を参照して、アニメーション作成装置４８はさらに、アニメーションデータ記憶部４６に格納されたアニメーションデータのうち、アニメーションの各フレームに相当する時刻のデータを読出し、顔モデル記憶部９０に格納された顔モデルを、読出されたアニメーションデータ内の動きベクトルにしたがって変形させて出力するための顔モデル変形部９２と、顔モデルに対するレンダリングによりキャラクタのアニメーションを作成するための、顔のテクスチャデータ、照明位置、カメラ位置などの設定を記憶するためのレンダリングデータ記憶部９４と、顔モデル変形部９２により出力される各フレームの顔モデルに対し、レンダリングデータ記憶部９４に記憶されたレンダリングのためのデータを用いてレンダリングを行ない、アニメーションのフレームごとに出力しアニメーション記憶部９８に記憶させるためのレンダリング部９６とを含む。 Referring again to FIG. 1, the animation creation device 48 further reads out the data at the time corresponding to each frame of the animation from the animation data stored in the animation data storage unit 46 and stores it in the face model storage unit 90. A face model deforming unit 92 for deforming and outputting the face model according to the motion vector in the read animation data, and face texture data and illumination for creating character animation by rendering the face model Rendering data storage unit 94 for storing settings such as position and camera position, and rendering data stored in the rendering data storage unit 94 for the face model of each frame output by the face model deformation unit 92 Render using Output for each frame of the animation and a rendering unit 96 for storing the animation storage unit 98.

顔モデル３３０により表現される顔の形状は、アニメーションのキャラクタの顔の基本形状を示すものであり、ユーザにより創作される任意のものでよい。ただし、前述したとおり、動きベクトルを用いて顔モデル３３０に表情を付与するには、顔モデル３３０により表現される形状のどの部分が特徴点に対応しているかを定義する必要がある。仮想マーカ３３２によってそうした対応が示される。図１に示す顔モデル記憶部９０には、顔モデルの各ノードの３次元位置データだけでなく、各仮想マーカの位置と、それら仮想マーカと特徴点との対応関係も記憶されている。 The shape of the face represented by the face model 330 indicates the basic shape of the face of the animation character, and may be any created by the user. However, as described above, in order to give a facial expression to the face model 330 using the motion vector, it is necessary to define which part of the shape expressed by the face model 330 corresponds to the feature point. A virtual marker 332 indicates such a correspondence. The face model storage unit 90 shown in FIG. 1 stores not only the three-dimensional position data of each node of the face model, but also the positions of the virtual markers and the correspondence between the virtual markers and feature points.

顔モデル変形部９２は、以下のようにして顔モデル記憶部９０に記憶された顔モデルの変形を行なう。基本的には顔モデル変形部９２は、読み出したアニメーションデータごとに、顔モデルを構成する全てのノードに対して以下の処理（マーカラベリング処理と呼ぶ。）を行なう。すなわち、顔モデル変形部９２は、顔モデルのノードの各々に対し、そのノードからの距離が最も近い仮想マーカを、仮想マーカの座標に基づき選択する。顔モデル変形部９２は、選択された仮想マーカが、処理中のノードに対応付ける仮想マーカとして適切か否かを判定する。適切であれば選択マーカをこのノードに対応するマーカとして採用し、不適切であれば棄却する。このような処理を繰返し、顔モデルの一つのノードに対し所定数ｎ（例えばｎ＝３）の仮想マーカを採用する。本明細書では、あるノードに対し採用された仮想マーカを、当該ノードの「対応マーカ」と呼ぶ。 The face model deforming unit 92 deforms the face model stored in the face model storage unit 90 as follows. Basically, the face model deforming unit 92 performs the following process (referred to as marker labeling process) on all the nodes constituting the face model for each read animation data. That is, the face model deforming unit 92 selects, for each face model node, a virtual marker having the closest distance from the node based on the coordinates of the virtual marker. The face model deformation unit 92 determines whether or not the selected virtual marker is appropriate as a virtual marker associated with the node being processed. If it is appropriate, the selected marker is adopted as a marker corresponding to this node, and if it is inappropriate, it is rejected. Such processing is repeated, and a predetermined number n (for example, n = 3) of virtual markers is adopted for one node of the face model. In this specification, a virtual marker adopted for a certain node is called a “corresponding marker” of the node.

なお、本実施の形態では、選択マーカの対応マーカとしての適／不適を判断する際の基準に、顔モデルの境界エッジを利用する。 In the present embodiment, the boundary edge of the face model is used as a reference when determining whether the selected marker is appropriate as a corresponding marker.

このマーカラベリング処理により、顔モデルの各ノードに対応マーカが関係付けられると、アニメーションデータから得られる、対応マーカに対応する特徴点の動きベクトルの値の内挿により、そのノードの三次元位置座標が計算される。この計算方法については後述する。 When a corresponding marker is related to each node of the face model by this marker labeling process, the three-dimensional position coordinates of that node are obtained by interpolation of the motion vector value of the feature point corresponding to the corresponding marker obtained from the animation data. Is calculated. This calculation method will be described later.

図１５に、顔モデル変形部９２により実行されるマーカラベリング処理のプログラムの制御構造をフローチャートで示す。図１５を参照して、マーカラベリング処理では、ステップ３４０Ａとステップ３４０Ｂとで囲まれた、ステップ３４２からステップ３５４までの処理を、顔モデル３３０の各ノードに対して実行する。 FIG. 15 is a flowchart showing a control structure of a marker labeling process program executed by the face model deforming unit 92. Referring to FIG. 15, in the marker labeling process, the process from step 342 to step 354 surrounded by step 340 A and step 340 B is executed for each node of face model 330.

ステップ３４２では、処理対象のノードから仮想マーカまでの距離をそれぞれ算出する。さらに仮想マーカをこの距離の昇順でソートしたものをリストにする。 In step 342, the distance from the processing target node to the virtual marker is calculated. Furthermore, a list of virtual markers sorted in ascending order of the distance is displayed.

ステップ３４４では、以下の繰返しを制御するための変数ｉ及び対応マーカとして採用したマーカの数を表す変数ｊに０を代入する。ステップ３４６では、変数ｉに１を加算する。 In step 344, 0 is substituted into a variable i for controlling the following repetition and a variable j representing the number of markers employed as corresponding markers. In step 346, 1 is added to the variable i.

ステップ３４７では、変数ｉの値が仮想マーカの数Ｍmaxを超えているか否かを判定する。変数ｉの値が仮想マーカの数Ｍmaxを超えていればエラーとし、処理を終了する。これは、全ての仮想マーカを調べても、対応マーカとして採用されたものが３つに満たなかった場合に生ずる。普通このようなことはないが、念のためにこのようなエラー処理を設けておく。変数ｉの値が仮想マーカの数Ｍmax以下であれば制御はステップ３４８に進む。 In step 347, it is determined whether or not the value of the variable i exceeds the number Mmax of virtual markers. If the value of the variable i exceeds the number Mmax of virtual markers, an error occurs and the process is terminated. This occurs when all the virtual markers are examined, but less than three are used as corresponding markers. Normally, this is not the case, but such error handling is provided just in case. If the value of variable i is less than or equal to the number of virtual markers Mmax, control proceeds to step 348.

ステップ３４８では、リストの先頭から変数ｉで示される位置に存在する仮想マーカ（以下これを「マーカ（ｉ）」と呼ぶ。）と処理対象のノードとを結ぶ線分が、顔モデル３３０におけるいずれの境界エッジも横切らない、という制約条件を充足しているか否かを判定する。当該線分が境界エッジのいずれかを横切るものであれば、ステップ３４４に戻る。さもなければステップ３５０に進む。 In step 348, the line segment connecting the virtual marker (hereinafter referred to as “marker (i)”) existing at the position indicated by the variable i from the head of the list and the node to be processed is It is determined whether or not the boundary condition that the boundary edge is not crossed is satisfied. If the line segment crosses one of the boundary edges, the process returns to step 344. Otherwise, go to step 350.

ステップ３５０では、この時点でのマーカ（ｉ）を処理対象のノードの対応マーカの一つに指定する。そしてマーカ（ｉ）を示す情報を、処理対象のノードのマーカ・ノード対応情報として保存する。この後制御はステップ３５２に進む。ステップ３５２では、変数ｊに１を加算する。ステップ３５４では、変数ｊの値が３となっているか否かを判定する。変数ｊの値が３であればステップ３４０Ｂに進む。さもなければステップ３４４に進む。 In step 350, the marker (i) at this point is designated as one of the corresponding markers of the processing target node. Information indicating the marker (i) is stored as marker / node correspondence information of the processing target node. Thereafter, the control proceeds to step 352. In step 352, 1 is added to the variable j. In step 354, it is determined whether or not the value of the variable j is 3. If the value of the variable j is 3, the process proceeds to step 340B. Otherwise, go to step 344.

上記したように、処理対象のノードと仮想マーカとを結ぶ線分が顔モデルの境界エッジを横切るものは、処理対象のノードに対応する仮想マーカから除外される。これは以下の理由による。例えば上唇と下唇とのように、間に境界エッジが存在する場合を考える。この場合、実際の顔では、上唇に位置するノードと、下唇に位置するノードとに相当する位置は互いに異なる動きをする。したがって、例えば上唇のノードの移動量を算出する際に、下唇に存在するマーカの移動量を用いることは適当ではない。線分がある境界エッジを横切っているか否かは、例えば、その境界エッジが顔モデルを構成するポリゴンのうち二つによって共有されているか、一つのみに属しているかによって判定する。 As described above, the line segment that connects the node to be processed and the virtual marker crosses the boundary edge of the face model is excluded from the virtual marker corresponding to the node to be processed. This is due to the following reason. For example, consider a case where a boundary edge exists between the upper lip and the lower lip. In this case, in the actual face, the positions corresponding to the node located on the upper lip and the node located on the lower lip move differently. Therefore, for example, when calculating the movement amount of the upper lip node, it is not appropriate to use the movement amount of the marker existing on the lower lip. Whether or not the line segment crosses a certain boundary edge is determined by, for example, whether the boundary edge is shared by two polygons constituting the face model or belongs to only one.

図１６に、顔モデル３３０における唇周辺のポリゴンと仮想マーカとを示す。以下、図１６を参照して、あるノードの対応マーカを特定する方法について具体例を用いて説明する。図１６を参照して、顔モデル３３０（図１４参照）の唇周辺には、多数の三角形ポリゴンが存在する。各ポリゴンは、三つのエッジに囲まれている。上唇と下唇の間には境界エッジ４００が存在する。境界エッジ４００は、顔モデル３３０と切れ目との境界、又は顔モデル３３０の外縁にあたる。そのため、境界エッジ以外のエッジは二つのポリゴンに共有されるが、境界エッジ４００に該当するエッジは共有されない。 FIG. 16 shows polygons around the lips and virtual markers in the face model 330. Hereinafter, a method for specifying the corresponding marker of a certain node will be described with reference to FIG. 16 using a specific example. Referring to FIG. 16, there are a large number of triangular polygons around the lips of face model 330 (see FIG. 14). Each polygon is surrounded by three edges. A boundary edge 400 exists between the upper lip and the lower lip. The boundary edge 400 corresponds to the boundary between the face model 330 and the cut or the outer edge of the face model 330. Therefore, edges other than the boundary edge are shared by the two polygons, but the edge corresponding to the boundary edge 400 is not shared.

既に説明したように、顔モデル変形部９２は、顔モデル３３０を構成するノードの中から処理対象のノードを一つ選択する。図１６において、ノード４１０が処理対象のノードとして選択されたものとする。ノード４１０の近隣には、仮想マーカ４１２Ａ，…，４１２Ｅが存在するものとする。顔モデル変形部９２は、ノード４１０の座標と、仮想マーカ仮想マーカ４１２Ａ，…，４１２Ｅの座標とをもとに、ノード４１０と仮想マーカとの間の距離をそれぞれ算出する。そして、仮想マーカの中から、ノード４１０に最も近い位置にある仮想マーカ４１２Ａを選択する。 As already described, the face model deforming unit 92 selects one processing target node from the nodes constituting the face model 330. In FIG. 16, it is assumed that the node 410 is selected as a processing target node. Assume that virtual markers 412A,..., 412E exist in the vicinity of the node 410. The face model deformation unit 92 calculates the distance between the node 410 and the virtual marker based on the coordinates of the node 410 and the coordinates of the virtual marker virtual markers 412A,. Then, the virtual marker 412A located closest to the node 410 is selected from the virtual markers.

続いて、顔モデル変形部９２は、ノード４１０と仮想マーカ４１２Ａ，…，４１２Ｅとを結ぶ線分４１４Ａ，…，４１４Ｅが境界エッジ４００を横切るか否かを検査する。すなわち、まずノード４１０と仮想マーカ４１２Ａとを結ぶ線分４１４Ａが境界エッジ４００を横切るか否かを検査する。図１６に示す例では、この線分４１４Ａは、境界エッジ４００を横切らない。そのため顔モデル変形部９２は、仮想マーカ４１２Ａをノード４１０の対応マーカの一つとする。そして、仮想マーカの中から、仮想マーカ４１２Ａの次にノード４１０に近い位置にある仮想マーカ４１２Ｂを選択し検査を行なう。ノード４１０と仮想マーカ４１２Ｂとを結ぶ線分４１４Ｂは、境界エッジ４００を横切っている。そのため、仮想マーカ４１２Ｂはノード４１０の対応マーカからは除外される。 Subsequently, the face model deformation unit 92 checks whether or not the line segments 414A,..., 414E connecting the node 410 and the virtual markers 412A,. That is, first, it is checked whether or not the line segment 414A connecting the node 410 and the virtual marker 412A crosses the boundary edge 400. In the example shown in FIG. 16, the line segment 414A does not cross the boundary edge 400. Therefore, the face model deforming unit 92 sets the virtual marker 412A as one of the corresponding markers of the node 410. The virtual marker 412B next to the virtual marker 412A and next to the node 410 is selected from the virtual markers, and the inspection is performed. A line segment 414B connecting the node 410 and the virtual marker 412B crosses the boundary edge 400. Therefore, the virtual marker 412B is excluded from the corresponding marker of the node 410.

顔モデル変形部９２は、以上のような動作を所定数（３個）の対応マーカが選択されるまで繰返し、ノード４１０の対応マーカ（図１６に示す例では仮想マーカ４１２Ａ、４１２Ｄ、及び４１２Ｅ）を選択する。 The face model deforming unit 92 repeats the above operation until a predetermined number (three) of corresponding markers are selected, and the corresponding markers of the node 410 (virtual markers 412A, 412D, and 412E in the example shown in FIG. 16). Select.

再び図１４を参照して、顔モデル変形部９２は、顔モデル記憶部９０に記憶された、特徴点と仮想マーカとの対応関係に基づき、あるフレームの三つ組視覚素ユニットにおける各特徴点の動きベクトルをそれぞれ対応の仮想マーカ３３２に付与する。さらに顔モデル変形部９２は、顔モデル３３０の各ノードに、対応する仮想マーカ３３２の動きベクトルにより示される変化量から所定の内挿式により算出される変化量ベクトルｖを割当てることにより、顔モデル３３０の変形を行なう。顔モデル変形部９２は、変形後の顔モデル３３０を、そのフレームにおける形状モデルとして出力する。 Referring to FIG. 14 again, the face model deforming unit 92 moves the feature points in the triple visual element unit of a certain frame based on the correspondence relationship between the feature points and the virtual markers stored in the face model storage unit 90. Each vector is assigned to the corresponding virtual marker 332. Further, the face model deforming unit 92 assigns to each node of the face model 330 a change amount vector v calculated by a predetermined interpolation formula from the change amount indicated by the motion vector of the corresponding virtual marker 332, so that the face model 330 is deformed. The face model deformation unit 92 outputs the deformed face model 330 as a shape model in the frame.

基本となる顔モデル３３０のうちの、あるノードの座標ベクトルをＮ、基本となる顔モデル３３０において、当該ノードと対応関係にあるｉ番目の仮想マーカの座標をＭi（１≦ｉ≦３）、変形後の顔モデルにおける対応するマーカの座標をＭ'iとすると、顔モデル変形部９２は、このノードの座標の変化量ベクトルｖを次の内挿式によって算出する。なお、Ｍ'i−Ｍiが特徴点の動きベクトルに相当する。 Of the basic face model 330, the coordinate vector of a certain node is N, and in the basic face model 330, the coordinate of the i-th virtual marker corresponding to the node is Mi (1 ≦ i ≦ 3), When the coordinate of the corresponding marker in the deformed face model is M′i, the face model deforming unit 92 calculates a change vector v of the coordinate of this node by the following interpolation formula. Note that M′i−Mi corresponds to a motion vector of feature points.

レンダリング部９６は、ポリゴンにより表された形状モデルに対するレンダリングを行なうことができるものであればよく、市販のレンダリングエンジンを用いることもできる。アニメーションのフレームレートにしたがい、１フレームごとの顔モデルを上記式にしたがって生成し、レンダリングを行なうことにより、このレンダリングによりえられた画像のシーケンスとしてアニメーションが得られる。アニメーションはアニメーション記憶部９８に記憶される。

The rendering unit 96 only needs to be capable of rendering the shape model represented by the polygon, and a commercially available rendering engine can also be used. In accordance with the frame rate of the animation, a face model for each frame is generated according to the above equation, and rendering is performed, whereby an animation is obtained as a sequence of images obtained by this rendering. The animation is stored in the animation storage unit 98.

アニメーション読出部１００は、アニメーション記憶部９８から１フレームごとにアニメーションを読出して画像化し、フレーム間隔ごとに表示部５２のフレームメモリに書き込む機能を持つ。 The animation reading unit 100 has a function of reading an animation from the animation storage unit 98 for each frame and converting it into an image, and writing it into the frame memory of the display unit 52 at every frame interval.

［動作］
本実施の形態に係る顔アニメーションの作成システム４０は以下のように動作する。リップシンクアニメーション作成装置４０の動作は大きく四つのフェーズに分けることができる。第１のフェーズは音声−視覚素コーパス記憶部６２を作成するフェーズである。第２のフェーズは音声−視覚素コーパス記憶部６２を用いて入力される音声データ４２からアニメーションデータ記憶部４６を作成するフェーズである。第３のフェーズは、顔モデルを用い、アニメーションデータ記憶部４６からアニメーションを作成しアニメーション記憶部９８に格納するフェーズである。最後のフェーズは、アニメーション記憶部９８に記憶されたアニメーションを表示部５２に表示するフェーズである。以下、各フェーズにおけるリップシンクアニメーション作成装置４０の動作について説明する。 [Operation]
The face animation creation system 40 according to the present embodiment operates as follows. The operation of the lip sync animation creating apparatus 40 can be roughly divided into four phases. The first phase is a phase for creating the speech-visual element corpus storage unit 62. The second phase is a phase in which the animation data storage unit 46 is created from the audio data 42 input using the audio-visual element corpus storage unit 62. The third phase is a phase in which an animation is created from the animation data storage unit 46 using the face model and stored in the animation storage unit 98. The last phase is a phase in which the animation stored in the animation storage unit 98 is displayed on the display unit 52. Hereinafter, the operation of the lip sync animation creating apparatus 40 in each phase will be described.

〈第１のフェーズ：音声−視覚素コーパス記憶部６２の作成〉
以下に、収録システム６０が収録を行ない、音声−視覚素コーパス記憶部６２を生成する動作について説明する。図２及び図３を参照して、発話者５０の頭部１１０の各特徴点１６０には、マーカを予め装着しておく。その状態で、発話者は発話を行なう。音声・視覚素コーパスを充実したものにするために、又は、各音素がバランスよく含まれるようにするために、発話の内容を事前に決めておき、発話者５０にその内容で発話を行なってもらう。この発話の内容は、図５に示す発話テキスト記憶部２０４に記憶される。 <First Phase: Creation of Speech-Visual Elementary Corpus Storage Unit 62>
Hereinafter, an operation in which the recording system 60 performs recording and generates the speech-visual element corpus storage unit 62 will be described. With reference to FIG.2 and FIG.3, the marker is previously mounted | worn at each feature point 160 of the head 110 of the speaker 50. FIG. In that state, the speaker speaks. In order to enhance the speech / visual element corpus, or in order to include each phoneme in a well-balanced manner, the content of the utterance is determined in advance, and the utterer 50 is uttered with the content. get. The content of this utterance is stored in the utterance text storage unit 204 shown in FIG.

収録が開始され、発話者５０が発話すると、録画・録音システム１１２が、発話時の音声と顔の動画像を収録し、音声・動画データ１１６を生成する。音声・動画データ１１６は音声・動画記憶部１４０に記憶される。この際、カムコーダ１３２は、ＭｏＣａｐシステム１１４に対してタイムコード１３４を供給するとともに、音声・動画データ１１６に、タイムコード１３４と同じタイムコードを付与する。 When the recording is started and the speaker 50 speaks, the recording / recording system 112 records the voice and the moving image of the face at the time of the utterance, and generates the voice / moving picture data 116. The audio / video data 116 is stored in the audio / video storage unit 140. At this time, the camcorder 132 supplies the time code 134 to the MoCap system 114 and assigns the same time code as the time code 134 to the audio / video data 116.

同時に、発話時における特徴点１６０の位置が、ＭｏＣａｐシステム１１４により次のようにして三次元データとして計測される。マーカはそれぞれ、対応する特徴点の動きに追従して移動する。赤外線カメラ１３６Ａ，…，１３６Ｆはそれぞれ、マーカによる赤外線反射光を、所定のフレームレート（例えば毎秒１２０フレーム）で撮影しその映像信号をデータ処理装置１３８に出力する。データ処理装置１３８は、それらの映像信号の各フレームにタイムコード１３４を付与し、当該映像信号をもとに、各マーカの三次元座標をフレームごとに算出する。データ処理装置１３８は、各マーカの三次元座標をフレームごとにまとめてＭｏＣａｐデータ１１８として蓄積する。 At the same time, the position of the feature point 160 at the time of utterance is measured as three-dimensional data by the MoCap system 114 as follows. Each marker moves following the movement of the corresponding feature point. Each of the infrared cameras 136 A,..., 136 F captures infrared reflected light from the marker at a predetermined frame rate (for example, 120 frames per second) and outputs the video signal to the data processing device 138. The data processing device 138 assigns the time code 134 to each frame of the video signal, and calculates the three-dimensional coordinates of each marker for each frame based on the video signal. The data processing device 138 collects the three-dimensional coordinates of each marker for each frame and accumulates them as MoCap data 118.

以上の収録プロセスにより収録された音声・動画データ１１６及びＭｏＣａｐデータ１１８は、データセット作成装置１２２に与えられる。データセット作成装置１２２は、音声・動画データ１１６を音声・動画記憶部１４０に蓄積し、ＭｏＣａｐデータ１１８を、ＭｏＣａｐデータ記憶部１４２に蓄積する。 The audio / moving image data 116 and the MoCap data 118 recorded by the above recording process are given to the data set creation device 122. The data set creation device 122 accumulates the audio / video data 116 in the audio / video storage unit 140 and accumulates the MoCap data 118 in the MoCap data storage unit 142.

正規化処理部１４６は、ＭｏＣａｐデータ記憶部１４２から、ｔ＝０のフレームにおけるＭｏＣａｐデータを読出す。このときの不動点のＭｏＣａｐデータが後の正規化処理の基準となる。正規化処理部１４６はさらに、各フレームでの各特徴点の座標を、不動点として指定された複数の特徴点の三次元座標を用いて以下の様に正規化する。 The normalization processing unit 146 reads out the MoCap data in the frame at t = 0 from the MoCap data storage unit 142. The MoCap data at the fixed point at this time becomes a reference for the subsequent normalization process. The normalization processing unit 146 further normalizes the coordinates of each feature point in each frame using the three-dimensional coordinates of a plurality of feature points designated as fixed points as follows.

すなわち、正規化処理部１４６は、ＭｏＣａｐデータ１５２の各フレームにおいて、当該フレームの不動点の三次元座標と、ｔ＝０のフレームにおける不動点の三次元座標とから、前述の式（１）のアフィン行列を求め、当該アフィン行列を用いて、各特徴点の三次元座標をアフィン変換する。この変換により、変換後の特徴点の三次元座標はそれぞれ、ｔ＝０での位置に頭を固定して発話を行なった状態での顔の特徴点の位置を表すものとなる。すなわち、各特徴点の三次元座標が正規化される。これら座標から、ｔ＝０のときの各特徴点の座標から減算することで、その特徴点のその時点での動きベクトルが得られる。その結果、ＭｏＣａｐデータ１５２から顔パラメータの系列１２６が得られる。顔パラメータの系列１２６は、結合部１４８に与えられる。 That is, in each frame of the MoCap data 152, the normalization processing unit 146 uses the three-dimensional coordinates of the fixed point of the frame and the three-dimensional coordinates of the fixed point in the frame at t = 0 to obtain the equation (1). An affine matrix is obtained, and the three-dimensional coordinates of each feature point are affine transformed using the affine matrix. By this conversion, the three-dimensional coordinates of the feature points after the conversion represent the positions of the feature points of the face in the state where the head is fixed at the position at t = 0 and the speech is performed. That is, the three-dimensional coordinates of each feature point are normalized. By subtracting from the coordinates of each feature point when t = 0 from these coordinates, a motion vector at that time point of the feature point is obtained. As a result, a face parameter series 126 is obtained from the MoCap data 152. The face parameter series 126 is provided to the combining unit 148.

図５を参照して、三つ組視覚素データ列作成部１４４のフレーム化処理部２００は、音声・動画記憶部１４０に記憶された音声・動画データ１１６の音声データを所定フレーム長及び所定フレーム間隔でフレーム化し、特徴抽出部２０１に与える。 Referring to FIG. 5, the framing processing unit 200 of the triple visual element data string creation unit 144 converts the audio data of the audio / video data 116 stored in the audio / video storage unit 140 into a predetermined frame length and a predetermined frame interval. Frame it and give it to the feature extraction unit 201.

特徴抽出部２０１は、フレーム化処理部２００から与えられた各フレームから、ビタビアライメント部２０６の処理で使用される音響特徴量（ＭＦＣＣ）を算出し、特徴量２３０としてビタビアライメント部２０６に与える。このとき、各フレームの音声データもビタビアライメント部２０６に与えられる。 The feature extraction unit 201 calculates an acoustic feature quantity (MFCC) used in the processing of the Viterbi alignment unit 206 from each frame given from the framing processing unit 200, and gives it to the Viterbi alignment unit 206 as a feature quantity 230. At this time, audio data of each frame is also provided to the Viterbi alignment unit 206.

ビタビアライメント部２０６は、音響モデル記憶部２０２に記憶された音響モデルと、発話テキスト記憶部２０４に記憶された発話テキストとを用いて、特徴量２３０の系列に対するビタビアライメントを行ない、アライメントの結果得られた音素のラベル（音素ラベル）列を、各音素の継続長とともに音素データ列２３２として音素データ列記憶部２０８に格納させる。このとき、音素データ列２３２には各フレームの音声データも付される。 The Viterbi alignment unit 206 performs Viterbi alignment for the series of feature quantities 230 using the acoustic model stored in the acoustic model storage unit 202 and the utterance text stored in the utterance text storage unit 204, and obtains an alignment result. The phoneme label string (phoneme label) string is stored in the phoneme data string storage unit 208 as the phoneme data string 232 together with the duration of each phoneme. At this time, audio data of each frame is also attached to the phoneme data string 232.

音素−視覚素変換部２１２は、音素データ列記憶部２０８から音素データを順次読み出し、各音素データに含まれる音素ラベルを、音素−視覚素変換テーブル記憶部２１０に記憶された音素−視覚素変換テーブルを参照して視覚素ラベルに変換する。音素ラベルにかえて視覚素ラベルを格納した音素データは視覚素データを構成する。音素−視覚素変換部２１２はこのとき、同一の視覚素ラベルが連続しているときにはそれらを一つの視覚素データにまとめ、その継続長も合計する。さらに音声の平均パワーを算出し、各視覚素データに付与する。音素−視覚素変換部２１２は、こうして得られた視覚素データ列２３４を、フレーム化された音声データとともに視覚素データ列記憶部２１４に格納させる。 The phoneme-visual element conversion unit 212 sequentially reads the phoneme data from the phoneme data string storage unit 208, and converts the phoneme labels included in each phoneme data into the phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 210. Refer to the table and convert it to a visual label. The phoneme data storing the visual element label instead of the phoneme label constitutes visual element data. At this time, when the same visual element label is continuous, the phoneme-visual element conversion unit 212 collects them into one visual element data and totals the continuation length. Furthermore, the average power of the sound is calculated and assigned to each visual element data. The phoneme-visual element conversion unit 212 stores the visual element data string 234 thus obtained in the visual element data string storage unit 214 together with the framed audio data.

視覚素−三つ組視覚素変換部２１６は、視覚素データ列記憶部２１４から視覚素データを順次読出し、以下のような処理を行なう。すなわち、視覚素−三つ組視覚素変換部２１６は、各視覚素データの視覚素ラベルを、その直前の視覚素データに含まれる視覚素ラベルと、当該視覚素データの視覚素ラベルと、その直後の視覚素ラベルとをこの順で結合した三つ組視覚素ラベルに変換する。このようにして視覚素ラベルに代えて三つ組視覚素ラベルを格納した視覚素データは、三つ組視覚素データとなる。視覚素−三つ組視覚素変換部２１６は、こうして得られた三つ組視覚素データ列２３６を三つ組視覚素データ列記憶部２１８に音声データとともに格納させる。 The visual element-triple visual element conversion unit 216 sequentially reads visual element data from the visual element data string storage unit 214 and performs the following processing. That is, the visual element-triple visual element conversion unit 216 sets the visual element label of each visual element data, the visual element label included in the immediately preceding visual element data, the visual element label of the visual element data, and the immediately following element. The visual element labels are converted into triple visual element labels combined in this order. Visual element data in which triple visual element labels are stored instead of visual elementary labels in this way is triple visual element data. The visual element-triple visual element conversion unit 216 stores the triple visual element data string 236 thus obtained in the triple visual element data string storage unit 218 together with the audio data.

結合部１４８は、三つ組視覚素データ列記憶部２１８に記憶された三つ組視覚素データ列と、正規化処理部１４６から与えられる顔パラメータの系列１２６とをそれらに付されている時間情報を用いて同期させて結合して、音声データ４２とともに音声−視覚素コーパスを生成し、音声−視覚素コーパス記憶部６２に格納する。 The combining unit 148 uses the time information attached to the triple visual element data sequence stored in the triple visual element data sequence storage unit 218 and the face parameter series 126 given from the normalization processing unit 146. The audio-visual element corpus is generated together with the audio data 42 by being synchronized and combined and stored in the audio-visual element corpus storage unit 62.

〈第２のフェーズ：アニメーションデータ記憶部４６の合成〉
第２のフェーズはアニメーションデータ合成装置４４による。キャラクタの声を表す音声データ４２が準備され、三つ組視覚素データ列作成部８０に与えられる。この音声データ４２は、事前に、キャラクタの声を担当する発話者（又は声優）によって発話されたものを録音することにより得られる。音声データ４２の発話テキストは図１０に示す発話テキスト記憶部２８６に格納される。 <Second Phase: Synthesis of Animation Data Storage Unit 46>
The second phase is performed by the animation data synthesizer 44. Voice data 42 representing the voice of the character is prepared and provided to the triple visual element data string creation unit 80. The voice data 42 is obtained by recording in advance what is spoken by a speaker (or voice actor) who is in charge of the voice of the character. The speech text of the speech data 42 is stored in the speech text storage unit 286 shown in FIG.

図１０を参照して、フレーム化処理部２８０は、入力される音声データ４２を図５に示すフレーム化処理部２００と同一のフレーム長及びフレーム間隔でフレーム化し、特徴量抽出部２８２に与える。 Referring to FIG. 10, framing processing section 280 frames input audio data 42 with the same frame length and frame interval as framing processing section 200 shown in FIG.

特徴量抽出部２８２は、図５に示す特徴抽出部２０１と同様の処理により、音声の各フレームごとに、所定の音響特徴量（ＭＦＣＣ）を抽出し、ビタビアライメント部２８８に与える。 The feature quantity extraction unit 282 extracts a predetermined acoustic feature quantity (MFCC) for each frame of the voice by the same processing as the feature extraction unit 201 shown in FIG. 5 and gives it to the Viterbi alignment unit 288.

ビタビアライメント部２８８は、音響モデル記憶部２８４及び発話テキスト記憶部２８６を用いて特徴量抽出部２８２に対するビタビアライメントを行なって、音素ラベル及び各音素の継続長を含む音素データからなる音素データ列を音素−視覚素変換部２９２に与える。 The Viterbi alignment unit 288 performs Viterbi alignment for the feature amount extraction unit 282 using the acoustic model storage unit 284 and the utterance text storage unit 286, and generates a phoneme data string including phoneme labels and phoneme data including the duration of each phoneme. The phoneme-visual element conversion unit 292 is provided.

音素−視覚素変換部２９２は、この音素データ列に含まれる各音素データに対し、その中の音素ラベルを、音素−視覚素変換テーブル記憶部２９０に格納された音素−視覚素変換テーブルを参照して視覚素ラベルに変換する。音素ラベルに代えて視覚素ラベルを格納した音素データは視覚素データとなり、視覚素データ列として音素−視覚素変換部２９２から出力され視覚素データ列記憶部２９３に記憶される。各視覚素データは、視覚素ラベルと、元の音素の継続長とを含む。同一の視覚素ラベルが連続する場合、それらはまとめられ、継続長も合計される。また、視覚素データごとに、対応する音声の平均パワーが算出される。 The phoneme-visual element conversion unit 292 refers to the phoneme-visual element conversion table stored in the phoneme-visual element conversion table storage unit 290 for the phoneme label in each phoneme data included in the phoneme data string. And convert it to a visual label. The phoneme data in which the visual element label is stored instead of the phoneme label becomes visual element data, which is output from the phoneme-visual element conversion unit 292 as a visual element data string and stored in the visual element data string storage unit 293. Each visual element data includes a visual element label and a duration of the original phoneme. If the same visual element labels are consecutive, they are put together and the duration is also summed. In addition, the average power of the corresponding sound is calculated for each visual element data.

視覚素−三つ組視覚素変換部２９４は、視覚素データ列記憶部２９３に記憶された各視覚素データを順番に読出し、各視覚素データに含まれる視覚素ラベルを、その直前及び直後の視覚素データの視覚素ラベルと結合することにより得られる三つ組視覚素ラベルで、視覚素データの視覚素ラベルを置換し、三つ組視覚素データ列として三つ組視覚素データ列記憶部２９５に記憶させる。 The visual element-triple visual element conversion unit 294 sequentially reads each visual element data stored in the visual element data string storage unit 293, and the visual element labels included in each visual element data are displayed immediately before and after the visual element label. The visual element label of the visual element data is replaced with the triple visual element label obtained by combining with the visual element label of the data, and is stored in the triple visual element data string storage unit 295 as a triple visual element data string.

図１を参照して、三つ組視覚素ユニット選択部８２は、図１０に示す三つ組視覚素データ列記憶部２９５から三つ組視覚素データ列２９６を読出し、以下の処理を行なう。すなわち、三つ組視覚素ユニット選択部８２は、三つ組視覚素データ列２９６に含まれる三つ組視覚素データごとに、音声−視覚素コーパス記憶部６２に含まれる、同一の三つ組視覚素ラベルを持つ三つ組視覚素ユニットを探す（図１３のステップ３１４）。そのようなユニットがあればそれら全てとの間で、その三つ組視覚素データに含まれる視覚素継続長及び平均パワーを用い、先に示した式（２）によってコスト計算を行なう（図１３のステップ３１８）。そのようなユニットがなければ、三つ組視覚素ラベルのうちで前半の二つ組視覚素ラベル、又は後半の二つ組視覚素ラベルが一致する三つ組視覚素ユニットを音声−視覚素コーパス記憶部６２から読出し、それら全てとの間で、その三つ組視覚素データに含まれる視覚素の継続長及び平均パワーを用い、先に示した式（２）によってコスト計算を行なう（図１３のステップ３１６）。 Referring to FIG. 1, triple visual element unit selection unit 82 reads triple visual element data sequence 296 from triple visual element data sequence storage unit 295 shown in FIG. 10, and performs the following processing. That is, the triple visual element unit selection unit 82 performs the triple visual element having the same triple visual element label included in the speech-visual element corpus storage unit 62 for each triple visual element data included in the triple visual element data string 296. A unit is searched (step 314 in FIG. 13). If there is such a unit, the cost is calculated by using the visual element continuation length and the average power included in the triplet visual element data among all such units (step in FIG. 13). 318). If there is no such unit, the first half pair visual element label or the second half pair visual element label of the triple pair visual element labels is matched with the voice-visual element corpus storage unit 62. The cost is calculated according to the above-described equation (2) using the visual element continuation length and the average power included in the triple visual element data among the read data and all of them (step 316 in FIG. 13).

そして、このようにして計算されたコストの最小値を与える三つ組視覚素ユニットを処理対象の三つ組視覚素データに対する最適な三つ組視覚素ユニットとして選ぶ（図１３のステップ３２２）。 Then, the triple visual element unit that gives the minimum cost calculated in this way is selected as the optimal triple visual element unit for the triple visual element data to be processed (step 322 in FIG. 13).

三つ組視覚素ユニット選択部８２はこのようにして得られた三つ組視覚素ユニットからなる三つ組視覚素ユニット列を三つ組視覚素ユニット連結部８４に与える。 The triple visual element unit selection unit 82 gives the triple visual element unit connection unit 84 a triple visual element unit sequence including the triple visual element units obtained in this way.

三つ組視覚素ユニット連結部８４は、三つ組視覚素ユニット選択部８２から与えられた三つ組視覚素ユニット列中の各三つ組視覚素ユニットについて、その動きベクトル列を、先行する三つ組視覚素ユニットの動きベクトル列、及び後続する三つ組視覚素ユニットの動きベクトル列と時間軸上で連結する。なお、このとき、図１７を参照して説明したように、各ユニットの動きベクトル列の最後部を時間Ｔだけ延長し、後続するユニットの動きベクトル列の先頭の時間Ｔの部分との間で、各特徴点ごとに上記した式（３）による平滑化処理を行なう。 The triple visual element unit linking unit 84 sets the motion vector sequence of each triple visual element unit in the triple visual element unit sequence given from the triple visual element unit selection unit 82 as the motion vector sequence of the preceding triple visual element unit. , And the subsequent motion vector sequence of the triplet visual element unit on the time axis. At this time, as described with reference to FIG. 17, the last part of the motion vector sequence of each unit is extended by time T, and between the portion of the motion vector sequence of the succeeding unit at the beginning of time T. Then, the smoothing process according to the above equation (3) is performed for each feature point.

以上の処理により、アニメーションデータが作成される。作成されたアニメーションデータはアニメーションデータ記憶部４６に格納される。 Through the above processing, animation data is created. The created animation data is stored in the animation data storage unit 46.

〈第３のフェーズ：モデルを用いたアニメーションの作成〉
アニメーションの作成は、図１に示すアニメーション作成装置４８により行なわれる。図１を参照して、顔モデル変形部９２は、顔モデル記憶部９０に記憶された顔モデルを読み出す。この顔モデルについては、音声−視覚素コーパス記憶部６２を作成したときの特徴点と仮想マーカとの対応付けが既に行なわれており、さらに顔モデルを構成する各ノードに対応する仮想マーカも既に定められているものとする。 <Third phase: Animation using model>
The creation of the animation is performed by the animation creation device 48 shown in FIG. With reference to FIG. 1, the face model deforming unit 92 reads the face model stored in the face model storage unit 90. With respect to this face model, the feature points when the speech-visual element corpus storage unit 62 is created are already associated with the virtual markers, and the virtual markers corresponding to the respective nodes constituting the face model have already been established. It shall be stipulated.

顔モデル変形部９２は、アニメーションデータ記憶部４６に記憶されているアニメーションデータのうちから、アニメーションのフレームレートにしたがった時間に最も近い時刻を持つフレームのアニメーションデータを順番に読出し、各フレームについて以下の処理を行なう。 The face model deforming unit 92 sequentially reads out the animation data of the frame having the time closest to the time according to the animation frame rate from the animation data stored in the animation data storage unit 46. Perform the following process.

顔モデル変形部９２は、顔モデルの各仮想マーカに、読出されたアニメーションデータ内に含まれる対応する特徴点の三次元の動きベクトルを割り当てる。顔モデル変形部９２はさらに、式（４）にしたがい、顔モデル記憶部９０の各ノードの三次元位置座標を与える変化量ベクトルｖを算出する。変化量ベクトルを全てのノードに対し算出することにより、そのフレームにおける顔モデルが完成する。この顔モデルはレンダリング部９６に与えられる。 The face model deforming unit 92 assigns a three-dimensional motion vector of a corresponding feature point included in the read animation data to each virtual marker of the face model. The face model deforming unit 92 further calculates a variation vector v that gives the three-dimensional position coordinates of each node in the face model storage unit 90 according to the equation (4). By calculating the variation vector for all nodes, the face model in the frame is completed. This face model is given to the rendering unit 96.

レンダリング部９６は、顔モデル変形部９２から与えられた顔モデルに対し、レンダリングデータ記憶部９４に記憶されたレンダリングデータ及び設定にしたがったレンダリングを行なってアニメーションの一フレームに相当する画像を作成し、アニメーション記憶部９８に格納させる。 The rendering unit 96 renders the face model given from the face model deforming unit 92 according to the rendering data stored in the rendering data storage unit 94 and the settings, and creates an image corresponding to one frame of animation. And stored in the animation storage unit 98.

顔モデル変形部９２及びレンダリング部９６の以上の動作を繰返し、アニメーションデータ記憶部４６に記憶されたアニメーションデータの末尾まで到達したところでアニメーション作成装置４８は処理を終了させる。 The above operations of the face model deforming unit 92 and the rendering unit 96 are repeated, and when the end of the animation data stored in the animation data storage unit 46 is reached, the animation creating device 48 ends the process.

以上の処理により、アニメーション記憶部９８には、入力される音声データ４２に対応した顔アニメーションを表す、所定のフレームレートでの一連の顔画像が記憶されていることになる。 With the above processing, the animation storage unit 98 stores a series of face images at a predetermined frame rate representing the face animation corresponding to the input audio data 42.

〈第４のフェーズ：アニメーションの表示〉
アニメーションの表示はアニメーション読出部１００及び表示部５２により行なわれる。表示処理では、アニメーション読出部１００がアニメーション記憶部９８に格納された画像を先頭から順次読出し、規定の時間間隔で、表示部５２内の図示されないフレームメモリに書込む。表示部５２はこのフレームメモリに書き込まれた画像を所定時間間隔で読出し、画面に表示する。その結果、表示部５２の画面上には、入力される音声データ４２に対応して変化する、顔モデル記憶部９０に記憶されたアニメーションキャラクタの顔モデルにより規定された顔のアニメーションが表示される。 <Fourth phase: Animation display>
The animation is displayed by the animation reading unit 100 and the display unit 52. In the display process, the animation reading unit 100 sequentially reads the images stored in the animation storage unit 98 from the top, and writes them in a frame memory (not shown) in the display unit 52 at a predetermined time interval. The display unit 52 reads the images written in the frame memory at predetermined time intervals and displays them on the screen. As a result, on the screen of the display unit 52, the face animation defined by the face model of the animation character stored in the face model storage unit 90, which changes corresponding to the input voice data 42, is displayed. .

以上のように、本実施の形態に係るリップシンクアニメーション作成装置４０によれば、発話者の顔の多数の特徴点と、顔モデル１７０の各ノードとを予め対応付ける。さらに、発話時の音声から得た音素ラベルを、対応する視覚素ラベルに変換し、さらに三つ組視覚素ラベルに変換して、その継続長、その継続長中の音声の平均パワー、及びモーションキャプチャにより得た発話時の顔の特徴点の三次元動きベクトル列と組合せて、三つ組視覚素ユニットとして音声−視覚素コーパス記憶部６２に記憶させることで、音声−視覚素コーパスを作成しておく。 As described above, according to the lip sync animation creating apparatus 40 according to the present embodiment, a large number of feature points of the speaker's face are associated with each node of the face model 170 in advance. Furthermore, the phoneme label obtained from the speech at the time of utterance is converted into a corresponding visual element label, further converted into a triple visual element label, and the duration, the average power of the voice during the duration, and motion capture A voice-visual element corpus is created by combining the obtained three-dimensional motion vector sequence of facial feature points at the time of utterance and storing it in the voice-visual element corpus storage unit 62 as a triple visual element unit.

音声データ４２が与えられると、音声データ４２から得た音素ラベル、音素の継続長、及び平均パワーからなる音素データ列を作成し、さらに各音素ラベルを音声−視覚素コーパスの作成時と同様の方法により三つ組視覚素ラベルに変換して三つ組視覚素データ列を得る。各三つ組視覚素ラベルの継続長及びその音声の平均パワーをもとに、所定のコストの評価関数を用い、コストが最小となる三つ組視覚素ユニットを音声−視覚素コーパス記憶部６２から選択し、三つ組視覚素ユニット連結部８４によってそれらの動きベクトル列を連結する。この連結の際に、隣接する三つ組視覚素ユニットの動きベクトル列のうち、先行する三つ組視覚素ユニットの動きベクトル列を延長し、この部分で後続する三つ組視覚素ユニットの動きベクトルとの間で平滑化処理を行なう。 When the audio data 42 is given, a phoneme data string composed of the phoneme label obtained from the audio data 42, the phoneme duration, and the average power is created, and each phoneme label is also the same as when the speech-visual element corpus is created. A triplet visual element data string is obtained by converting the triplet visual element label by the method. Based on the continuation length of each triplet visual element label and the average power of the voice, a triplet visual element unit having a minimum cost is selected from the voice-visual element corpus storage unit 62 using an evaluation function of a predetermined cost. These motion vector sequences are connected by a triple visual element unit connecting unit 84. In this connection, among the motion vector sequences of adjacent triple visual element units, the motion vector sequence of the preceding triple visual element unit is extended and smoothed with the motion vector of the subsequent triple visual element unit in this part. The process is performed.

こうした処理により、顔モデル１７０の各ノードの軌跡を表す動きベクトルが、音声−視覚素コーパス記憶部６２に格納された実際の発話時の顔の特徴点の動きベクトルに基づいて算出される。したがって、ノードの集合としての顔モデルの時間的変化が、実際の動きに近い自然な動きを表すアニメーションデータとして得られる。アニメーションデータを構成する各特徴点の動きベクトルと、顔モデル１７０の各ノードとの対応関係とに基づいて顔モデル１７０の各ノードの動きベクトルを算出することで、フレームごとに各ノードの集合としての顔モデルを作成することができ、変形された顔モデルの系列が得られる。これら顔モデルに対するレンダリングを行なうことで、アニメーションを作成できる。 Through such processing, a motion vector representing the trajectory of each node of the face model 170 is calculated based on the motion vector of the facial feature point at the time of actual speech stored in the speech-visual element corpus storage unit 62. Therefore, the temporal change of the face model as a set of nodes can be obtained as animation data representing a natural movement close to the actual movement. By calculating the motion vector of each node of the face model 170 based on the motion vector of each feature point constituting the animation data and the corresponding relationship with each node of the face model 170, a set of each node is obtained for each frame. Face models can be created, and a series of deformed face models is obtained. An animation can be created by rendering these face models.

アニメーションデータ記憶部４６に記憶された顔パラメータの系列１２６は、音声データ４２により表される音声が発話されるときの顔の各特徴点の非線形的な軌跡を、実際にモーションキャプチャにより得られた測定データに基づいて表現する。したがって、発話中の表情の非線形的な変化を忠実に再現した、自然なアニメーションを作成することができる。 The face parameter series 126 stored in the animation data storage unit 46 is obtained by actually capturing a nonlinear trajectory of each feature point of the face when the voice represented by the voice data 42 is spoken. Express based on measured data. Therefore, it is possible to create a natural animation that faithfully reproduces non-linear changes in facial expressions during speech.

リップシンクアニメーション作成装置４０は、モデルベースでアニメーションを作成する。ユーザは、音声−視覚素コーパス記憶部６２が作成された後は、キャラクタの声に相当する音声データ４２と、静止状態でのキャラクタの顔の形状を定義した顔モデルと、音声データ４２に対するビタビアライメントを行なうための音響モデルと、音声データ４２に対応する発話テキストとを用意し、顔モデル上に、特徴点に対応する仮想マーカを指定するだけで、キャラクタの声に合せて表情の変化する、自然なリップシンクアニメーションを作成できる。キャラクタの顔のデザインが制限されることなく、顔モデル４４が表すキャラクタの顔の形状は任意のものでよい。そのため、ユーザによるアニメーション制作のバリエーションを狭めることなく、リップシンクアニメーションを作成できる。 The lip sync animation creation device 40 creates an animation on a model basis. After the voice-visual element corpus storage unit 62 is created, the user performs the Viterbi for the voice data 42 corresponding to the voice of the character, the face model defining the shape of the character's face in a stationary state, and the voice data 42. An acoustic model for alignment and utterance text corresponding to the voice data 42 are prepared, and the facial expression changes according to the voice of the character simply by designating a virtual marker corresponding to the feature point on the face model. Create natural lip-sync animations. The shape of the character's face represented by the face model 44 may be arbitrary without limiting the design of the character's face. Therefore, a lip sync animation can be created without narrowing the variation of animation production by the user.

なお、上記した実施の形態では、リップシンクアニメーション作成装置４０は収録システム６０、アニメーションデータ合成装置４４、アニメーション作成装置４８、及びアニメーション読出部１００の全てを含んでいる。しかし本発明はそのような実施の形態には限定されない。これらが全て別々の装置、又はプログラムにより実現されてもよい。また、これらが物理的に同じコンピュータ上で実現される必要もないし、例えばアニメーションデータ合成装置４４を構成する各部が単一のコンピュータ上で実現される必要もない。これらを別々のコンピュータ上で動作するプログラムにより実現し、それらの間のデータの移動を、ネットワーク経由又はリムーバブル記録媒体を介して実現するようにしてもよい。 In the above-described embodiment, the lip sync animation creating apparatus 40 includes all of the recording system 60, the animation data synthesizing apparatus 44, the animation creating apparatus 48, and the animation reading unit 100. However, the present invention is not limited to such an embodiment. These may all be realized by separate apparatuses or programs. Further, it is not necessary that these are physically implemented on the same computer, and it is not necessary that each unit constituting the animation data synthesizing apparatus 44 is implemented on a single computer. These may be realized by a program operating on different computers, and data movement between them may be realized via a network or via a removable recording medium.

［コンピュータによる実現及び動作］
本実施の形態のリップシンクアニメーション作成装置４０の各機能部は、収録システム６０（図１及び図２参照）の録画・録音システム１１２及びＭｏＣａｐシステム１１４に含まれる一部の特殊な機器を除き、いずれもコンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図１８はこのコンピュータシステム４５０の外観を示し、図１９はコンピュータシステム４５０の内部構成を示す。 [Realization and operation by computer]
Each functional unit of the lip sync animation creating apparatus 40 according to the present embodiment, except for some special devices included in the recording / recording system 112 and the MoCap system 114 of the recording system 60 (see FIGS. 1 and 2), Both are realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 18 shows the external appearance of the computer system 450, and FIG. 19 shows the internal configuration of the computer system 450.

図１８を参照して、このコンピュータシステム４５０は、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）ドライブ４７２及びリムーバブルなメモリを装着可能なメモリポート４７０を有するコンピュータ４６０と、キーボード４６６と、マウス４６８と、モニタ４６２と、マイクロフォン４９０と、一対のスピーカ４５８とを含む。マイクロフォン４９０は、このコンピュータシステム４５０において音声データ４２（図１参照）を収録する際に使用される。スピーカ４５８はアニメーションを表示する際の音声の再生に用いられる。 Referring to FIG. 18, a computer system 450 includes a computer 460 having a DVD (Digital Versatile Disk) drive 472 and a memory port 470 in which a removable memory can be mounted, a keyboard 466, a mouse 468, a monitor 462, A microphone 490 and a pair of speakers 458 are included. The microphone 490 is used when recording the audio data 42 (see FIG. 1) in the computer system 450. The speaker 458 is used to reproduce sound when displaying an animation.

図１９を参照して、コンピュータ４６０は、メモリポート４７２及びＤＶＤドライブ４７０に加えて、ハードディスク４７４と、ＣＰＵ（中央処理装置）４７６と、ＣＰＵ４７６、ハードディスク４７４、メモリポート４７２、及びＤＶＤドライブ４７０に接続されたバス４８６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）４７８と、バス４８６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）４８０と、バス４８６に接続され、マイクロフォン４９０からの音声信号をデジタル信号化したり、ＣＰＵ４７６より出力されるデジタル音声信号をアナログ化してスピーカ４５８を駆動したりするためのサウンドボード４８８とを含む。コンピュータシステム４５０はさらに、プリンタを含んでいてもよい。 Referring to FIG. 19, in addition to memory port 472 and DVD drive 470, computer 460 is connected to hard disk 474, CPU (Central Processing Unit) 476, CPU 476, hard disk 474, memory port 472, and DVD drive 470. Bus 486, a read only memory (ROM) 478 for storing a boot-up program and the like, a random access memory (RAM) 480 connected to the bus 486 and storing a program command, a system program, work data, and the like, A sound board 488 connected to the bus 486 and used to convert the audio signal from the microphone 490 into a digital signal and to convert the digital audio signal output from the CPU 476 into an analog signal to drive the speaker 458. The computer system 450 may further include a printer.

コンピュータ４６０はさらに、ローカルエリアネットワーク（ＬＡＮ）４５２への接続を提供するネットワークインターフェイス（Ｉ／Ｆ）４９６を含む。 Computer 460 further includes a network interface (I / F) 496 that provides a connection to a local area network (LAN) 452.

コンピュータシステム４５０にリップシンクアニメーション作成装置４０の各機能部を実現させるためのコンピュータプログラムは、ＤＶＤドライブ４７０又はメモリポート４７２に挿入されるＤＶＤ４８２又はメモリ４８４に記憶され、さらにハードディスク４７４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ４６０に送信されハードディスク４７４に記憶されてもよい。プログラムは実行の際にＲＡＭ４８０にロードされる。ＤＶＤ４８２から、メモリ４８４から、又はネットワークを介して、直接にＲＡＭ４８０にプログラムをロードしてもよい。 A computer program for causing the computer system 450 to realize each function unit of the lip sync animation creating apparatus 40 is stored in the DVD 482 or the memory 484 inserted into the DVD drive 470 or the memory port 472 and further transferred to the hard disk 474. Alternatively, the program may be transmitted to the computer 460 through a network (not shown) and stored in the hard disk 474. The program is loaded into the RAM 480 when executed. The program may be loaded into the RAM 480 directly from the DVD 482, from the memory 484, or via a network.

このプログラムは、コンピュータ４６０にこの実施の形態のリップシンクアニメーション作成装置４０の各機能部を実現させるための複数の命令を含む。この機能を実現させるのに必要な基本的機能のいくつかは、コンピュータ４６０にインストールされる各種ツールキットのモジュール、又はコンピュータ４６０上で動作するオペレーティングシステム（ＯＳ）若しくはサードパーティのプログラムにより提供される。したがって、このプログラムはこの実施の形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又は「ツール」を呼出すことにより、上記した顔アニメーションの作成システム４０の各機能部が行なう処理を実行する命令のみを含んでいればよい。コンピュータシステム４５０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions for causing the computer 460 to realize each functional unit of the lip sync animation creating apparatus 40 of this embodiment. Some of the basic functions necessary to realize this function are provided by modules of various toolkits installed in the computer 460, or an operating system (OS) or a third-party program running on the computer 460. . Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program executes processing performed by each functional unit of the above-described face animation creation system 40 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. It is only necessary to include instructions to be executed. The operation of computer system 450 is well known and will not be repeated here.

［様々な変形例］
なお、上記した実施の形態では、視覚素コーパス及び視覚素データの視覚素ラベルとして、三つ組視覚素ラベルを用いている。こうすることにより、ある視覚素に対応する発話の前後の発話における顔の動きまで反映した形で、適切な視覚素ユニットを選択できる。しかし本発明はそのような実施の形態には限定されない。例えば視覚素ラベルとして一つだけを使用してもよい。この場合、得られる視覚素ユニットの不連続部分が大きくなる可能性があるが、上記した加重加算処理によって滑らかに連結することができる。 [Various variations]
In the above-described embodiment, triple visual element labels are used as visual element labels of visual element corpus and visual element data. By doing so, it is possible to select an appropriate visual element unit in a form that reflects the movement of the face in the utterance before and after the utterance corresponding to a certain visual element. However, the present invention is not limited to such an embodiment. For example, only one visual label may be used. In this case, although the discontinuous part of the obtained visual element unit may become large, it can connect smoothly by the above-mentioned weighted addition process.

また、上記した実施の形態では、三つ組視覚素ラベルとして、ある視覚素を中心に、その前後の視覚素ラベルを一つずつ採用し、中央の視覚素ラベルと組合せたものを用いた。しかし本発明はそのような実施の形態には限定されない。例えば、ある視覚素の前又は後の視覚素の視覚素ラベルと、対象の視覚素の視覚素ラベルとからなる、二つ組視覚素ラベルを用いてもよい。また、四つ以上の視覚素ラベルからなるものを採用してもよい。四つ以上の視覚素ラベルの組を採用した場合には、適切な視覚素ラベルの組を持つ視覚素ユニットが視覚素コーパスで見つからない可能性が高くなる。そうした場合、視覚素コーパスをより大きくしてもよいし、上記した実施の形態におけるように、それより少ない数の視覚素ラベルの組を用いて代替的な視覚素ユニットを探すようにしてもよい。 In the above-described embodiment, as the triple visual element label, one visual element label around the certain visual element is adopted one by one and combined with the central visual element label. However, the present invention is not limited to such an embodiment. For example, a pair of visual element labels including a visual element label of a visual element before or after a certain visual element and a visual element label of a target visual element may be used. Moreover, you may employ | adopt what consists of four or more visual element labels. When a set of four or more visual element labels is adopted, there is a high possibility that a visual element unit having an appropriate visual element label group is not found in the visual element corpus. In such a case, the visual element corpus may be made larger, or an alternative visual element unit may be searched using a smaller number of sets of visual element labels as in the above-described embodiment. .

また、上記した実施の形態では、視覚素ユニットの連結の際に、先行する視覚素ユニットの動きベクトルのみを時間Ｔだけ拡張している。しかし本発明はそのような実施の形態には限定されない。例えば、どの視覚素ユニットも、その前後にＴ／２だけ動きベクトルを拡張するようにしてもよい。また、拡張する時間についても主観的テストによって適切と思われる値に設定すればよい。 In the above-described embodiment, only the motion vector of the preceding visual element unit is expanded by time T when the visual element units are connected. However, the present invention is not limited to such an embodiment. For example, any visual element unit may extend the motion vector by T / 2 before and after it. Also, the extension time may be set to a value deemed appropriate by a subjective test.

上記した実施の形態では、どの視覚素ユニットを選択すべきかを決定するために、視覚素に対応する音素の発話継続長と、その間の音声の平均パワーとを用いている。しかし本発明はそのような実施の形態には限定されない。これら以外の韻律的特徴、例えば音の高さ（基本周波数）を用いてもよいし、これらの任意の組合せを用いてもよい。顔画像が時間とともに変化するというアニメーションの特性上、発話継続長については評価の対象として採用することが望ましいが、発話継続長以外の韻律的特徴を用いて視覚素ユニットを選択したのち、発話継続長を視覚素データの継続長にあわせて調整するようにしてもよい。 In the above-described embodiment, in order to determine which visual element unit should be selected, the utterance continuation length of the phoneme corresponding to the visual element and the average power of the sound during that time are used. However, the present invention is not limited to such an embodiment. Other prosodic features such as pitch (fundamental frequency) may be used, or any combination thereof may be used. Due to the characteristics of the animation that the face image changes with time, it is desirable to adopt the utterance duration as an object of evaluation, but after selecting a visual unit using prosodic features other than the utterance duration, utterance continuation The length may be adjusted according to the duration of the visual element data.

また、上記した実施の形態では、式（２）によって、韻律的に似た特徴を持つ三つ組視覚素ユニットを探し、そうしたユニットがアニメーション合成の上で適したものとして取り扱っている。しかし、使用できる式は式（２）に限らない。韻律的特徴の差の二乗和を式（２）に代えて用いてもよいし、それ以外の式で、視覚素ユニットと視覚素データとの韻律的な特徴の相違を的確に表すものがあればそうしたものを用いてもよい。 Further, in the above-described embodiment, a triple visual element unit having prosodic features is searched by Equation (2), and such a unit is handled as suitable for animation synthesis. However, the formula that can be used is not limited to formula (2). The sum of squares of differences in prosodic features may be used instead of equation (2), and other equations may accurately represent the prosodic feature differences between visual element units and visual element data. You can use such a thing.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の実施の形態に係るリップシンクアニメーション作成装置４０のブロック図である。It is a block diagram of the lip sync animation creating apparatus 40 according to the embodiment of the present invention. 収録システム６０の詳細な構成を示すブロック図である。3 is a block diagram showing a detailed configuration of a recording system 60. FIG. 頭部１１０に装着されるマーカの配置例を示す図である。It is a figure which shows the example of arrangement | positioning of the marker with which the head 110 is mounted | worn. アニメーションキャラクタの顔モデル１７０及びフレームごとの動きベクトルから顔画像のアニメーション１７２を作成する手順を示す模式図である。It is a schematic diagram showing a procedure for creating an animation 172 of a face image from a face model 170 of an animation character and a motion vector for each frame. 三つ組視覚素データ列作成部１４４のブロック図である。It is a block diagram of the triple visual element data string creation unit 144. ビタビアライメントの概略を示す模式図である。It is a schematic diagram which shows the outline of Viterbi alignment. 三つ組視覚素データ列２３４の構成を示す図である。It is a figure which shows the structure of the triple visual element data sequence 234. 音素−視覚素変換テーブルの構成を示す図である。It is a figure which shows the structure of a phoneme-visual element conversion table. 音声−視覚素コーパス記憶部６２の構成を示す図である。It is a figure which shows the structure of the audio-visual element corpus memory | storage part 62. FIG. 三つ組視覚素データ列作成部８０のより詳細な構成を示す図である。It is a figure which shows the more detailed structure of the triple visual element data sequence production | generation part 80. FIG. 三つ組視覚素データ列２９６の構成を示す図である。It is a figure which shows the structure of the triple visual element data sequence 296. FIG. 三つ組視覚素データ列３００の構成を示す図である。It is a figure which shows the structure of the triple visual element data row | line | column 300. FIG. 三つ組視覚素ユニット選択部８２を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the triple visual element unit selection part 82. FIG. 顔モデルを示す模式図である。It is a schematic diagram which shows a face model. 顔モデル変形部９２により実行されるマーカラベリング処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the marker labeling process performed by the face model deformation | transformation part 92. FIG. 顔モデル変形部９２により実行される、顔モデルにおける唇周辺のノードと仮想マーカとの対応付を説明するための図である。It is a figure for demonstrating matching with the node around a lip in a face model, and a virtual marker performed by the face model deformation | transformation part 92. FIG. 三つ組視覚素ユニット連結部８４による動きベクトルの連結方法を説明するための図である。It is a figure for demonstrating the connection method of the motion vector by the triple visual element unit connection part 84. FIG. 本発明の一実施の形態に係るリップシンクアニメーション作成装置４０の主要な機能を実現するコンピュータシステムの外観の一例を示す図である。It is a figure which shows an example of the external appearance of the computer system which implement | achieves the main functions of the lip-sync animation production apparatus 40 concerning one embodiment of this invention. 図１８に示すコンピュータシステムのブロック図である。It is a block diagram of the computer system shown in FIG.

Explanation of symbols

４０リップシンクアニメーション作成装置
４４アニメーションデータ合成装置
４８アニメーション作成装置
６０収録システム
６２音声−視覚素コーパス記憶部
８０三つ組視覚素データ列作成部
８２三つ組視覚素ユニット選択部
８４三つ組視覚素ユニット連結部
９２顔モデル変形部
９６レンダリング部
１００アニメーション読出部
１１２録画・録音システム
１１４ＭｏＣａｐシステム
１１６音声・動画データ
１２２データセット作成装置
１３８データ処理装置
１４４三つ組視覚素データ列作成部
１４６正規化処理部
１４８結合部
２００，２８０フレーム化処理部
２０１，２８２特徴抽出部
２０２，２８４音響モデル記憶部
２０４，２８６発話テキスト記憶部
２０６，２８８ビタビアライメント部
２１０，２９０音素−視覚素変換テーブル記憶部
２１２，２９２音素−視覚素変換部
２１４，２９３視覚素データ列記憶部
２１６，２９４視覚素−三つ組視覚素変換部
２１８，２９５三つ組視覚素データ列記憶部
２３４，２９６三つ組視覚素データ列
２４０音声波形データ
２４２動きベクトル列
２４４三つ組視覚素ユニット列
３００三つ組視覚素データ列 40 Lip Sync Animation Creation Device 44 Animation Data Synthesizer 48 Animation Creation Device 60 Recording System 62 Audio-Visual Element Corpus Storage Unit 80 Triplet Visual Element Data Sequence Creation Unit 82 Triplet Visual Element Unit Selection Unit 84 Triplet Visual Element Unit Connection Unit 92 Face Model transformation unit 96 Rendering unit 100 Animation reading unit 112 Recording / recording system 114 MoCap system 116 Audio / moving image data 122 Data set creation device 138 Data processing device 144 Triple visual element data string creation unit 146 Normalization processing unit 148 Coupling unit 200, 280 Framing processing unit 201,282 Feature extraction unit 202,284 Acoustic model storage unit 204,286 Speech text storage unit 206,288 Viterbi alignment unit 210,29 Phoneme-visual element conversion table storage unit 212,292 Phoneme-visual element conversion unit 214,293 Visual element data string storage unit 216,294 Visual element-triple visual element conversion unit 218,295 Triple element visual element data string storage unit 234,296 Triple visual element data string 240 Speech waveform data 242 Motion vector string 244 Triple visual element unit string 300 Triple visual element data string

Claims

An animation data creation program for creating animation data of a face including a mouth that moves in accordance with audio data, based on the input audio data, in a computer comprising first storage means for storing a visual element corpus There,
The visual element corpus includes a plurality of visual element units created from a face image during speech with sound,
Each visual element unit is a continuation of the phoneme corresponding to the visual element unit obtained from the visual element label, the motion data indicating the movement of the face corresponding to the visual element unit, and the sound corresponding to the visual element unit. Including prosodic information including length,
The program causes the computer to function as first conversion means for converting the voice data into a phoneme data string that identifies a phoneme represented by the voice data,
The phoneme data string includes phoneme data including a phoneme label and prosodic information including a duration of the phoneme portion in the speech data;
The program further outputs a visual element data string by converting each of the phoneme labels included in the phoneme data in the phoneme data string output by the first conversion means into a corresponding visual element label. Causing the computer to function as the second conversion means,
The visual element data string output from the second conversion means is a visual element label and a continuation length of a phoneme corresponding to at least the visual element data obtained from a portion corresponding to the visual element data in the audio data. Visual element data consisting of prosodic information including
The program further includes
Each of the visual element data included in the visual element data string has the same visual element label as the visual element label included in the visual element data in the visual element unit in the visual element corpus, and the visual element data An evaluation function that evaluates the similarity of speech based on the prosodic information included in the visual element corpus and the prosodic information included in each visual element included in the visual element corpus, and having the speech most similar to the speech included in the visual prime data First selecting means for selecting an evaluated visual element unit from the visual element corpus;
By connecting the motion data included in the visual element unit selected by the first selection means on the time axis according to the order of the visual element data string, the animation data of the mouth corresponding to the input audio data is obtained. An animation data creating program for causing the computer to function as a connecting means for creating.

The prosodic information of speech included in each of the visual element units included in the visual element corpus includes the average power of the speech during the duration in addition to the duration of the speech corresponding to the visual elementary unit,
The first conversion means includes means for converting the voice data into a phoneme data string, and the phoneme data string includes a phoneme label, a continuation length and an average power of the phoneme part in the voice data. Phoneme data consisting of
The first selection means includes
For each visual element data included in the visual element data string, among visual element units in the visual element corpus, each visual element unit having the same visual element label as the visual element label included in the visual element data. An evaluation means for evaluating a value of an evaluation function for evaluating the similarity of speech based on the duration and average power included in the visual element data and the duration and average power of the visual element unit;
The second evaluation means for selecting from the visual element corpus the visual element unit evaluated by the evaluation means as having the sound most similar to the sound included in the visual element data. The animation data creation program described.

The computer further includes second storage means for storing a phoneme-visual element conversion table storing a correspondence relationship between a phoneme label and a visual element label,
The second conversion means refers to each of the phoneme labels included in the phoneme data of the phoneme data string output from the first conversion means by referring to the phoneme-visual element conversion table, thereby corresponding visual element labels. The animation data creation program according to claim 1, further comprising means for converting into a visual element data string.

Each visual element unit of the visual element corpus includes a first number of visual element units preceding each visual element when the plurality of visual element units are created from the face image at the time of speech with speech. A visual element label, and a visual element label of a second number of visual element units following each visual element, the visual element label of the preceding first number of visual element units, and each visual element A visual element label of the unit and a visual element label of the second number of visual element units subsequent to each other constitute a set of visual element labels;
The second conversion means includes
For each of the phoneme data in the phoneme data string output by the first conversion means, a phoneme label included in the phoneme data, and a phoneme label included in the preceding first phoneme data; Means for converting each of the phoneme labels included in the second number of phoneme data thereafter to a corresponding visual element label and combining them in the order of the phoneme data to create a set of visual element labels;
For each of the phoneme data in the phoneme data string output by the first conversion means, the phoneme label included in the phoneme data in the phoneme data string output by the first conversion means is the visual element label. Means for generating and outputting the visual element label data by replacing with a set of visual element labels obtained by the means for generating
The first selection means includes, for each of the visual element data included in the visual element data string, the same visual element label as the set of visual element labels included in the visual element data to be processed in the visual element corpus. The visual function is evaluated by an evaluation function that evaluates the similarity of speech using prosodic information included in the visual element data to be processed and prosodic information included in each visual element unit included in the visual element corpus. The animation data creation program according to claim 1, further comprising: second selection means for selecting from the visual element corpus a visual element unit evaluated as having the sound most similar to the sound included in the elementary data.

The second selection means includes
For each of the visual element data included in the visual element data string, a visual element unit having the same set of visual element labels as the set of visual element labels included in the visual element data to be processed is included in the visual element corpus. A determination means for determining whether or not it exists;
In response to the determination by the determination means that a visual element unit having the same set of visual element labels as the set of visual element labels included in the visual element data to be processed exists in the visual element corpus. For each of these visual element units, a value of an evaluation function for evaluating the similarity of speech is calculated from the prosodic information included in the visual element data and the prosodic information included in each visual element unit included in the visual element corpus. First calculating means for performing,
In response to determining by the determination means that there is no visual element unit having the same set of visual element labels as the set of visual element labels included in the visual element data in the visual element corpus. From the set of visual elementary labels of the target visual elementary data, only a partial visual elementary label consisting of a part including the visual elementary label of the visual elementary data to be processed is used as a reference from within the visual elementary corpus. Means for selecting a visual element unit having a set of visual element labels whose positions and contents coincide with each other;
Second calculating means for calculating a value of the evaluation function with respect to each of the visual element units selected by the selecting means with respect to the prosodic information included in the visual element data to be processed;
The animation data creation program according to claim 4, further comprising means for selecting a visual element unit having the smallest value of the evaluation function calculated by the first calculation means or the second calculation means.

The connecting means includes, for the motion data of two visual element units continuous on the time axis among the motion data included in the visual element unit selected by the selection means, the last of the motion data of the preceding visual element unit. To link the motion data of the visual element unit on the time axis by adding each of the motion data of the partial part and the motion data of the first part of the subsequent visual element unit with weighting according to time. The animation data creation program according to claim 1, comprising weighted addition means.

An animation data creation device for creating animation data indicating a movement of a face corresponding to input audio data using a visual element corpus including a plurality of triple visual element units,
Each of the triple visual element units includes a triple visual element label, a visual element duration corresponding to the triple visual element unit, an average power of speech spoken when the visual element is recorded, and the visual element. Including movement data of feature points of the speaker's face when the element was recorded,
Phoneme conversion means for creating a phoneme data string comprising a phoneme label, a phoneme duration, and an average power at the time of speech of the phoneme by performing speech analysis on the input speech data;
Means for storing a table indicating the correspondence between phoneme labels and visual element labels;
A first conversion means for creating a visual element data string by converting a phoneme label included in the phoneme data string into a corresponding visual element label with reference to the table;
For each of the visual element data in the visual element data string output by the first conversion means, the visual element label is converted into a triple visual element label combined with the visual element labels of the preceding and subsequent visual element data, and a triple visual element is obtained. Second conversion means for outputting a data string;
For each of the triple visual element data included in the triple visual element data string output from the second conversion means, a triple visual element label matching the triple visual element label of the triple visual element data is obtained from the visual element corpus. A triplet visual element unit, wherein the triplet visual element unit is evaluated by an evaluation function for evaluating a similarity between the duration and power of the triplet visual element unit and the duration and average power of the triplet visual element data. A selection means for selecting a triple visual elementary unit evaluated to have a duration and average power similar to the duration and average power of the elementary data;
The face motion data included in the triple visual element unit selected by the triple visual element unit selection unit is connected on the time axis according to the time series of the triple visual element data, thereby creating face animation data. And an animation data creating device.