JP4617500B2

JP4617500B2 - Lip sync animation creation device, computer program, and face model creation device

Info

Publication number: JP4617500B2
Application number: JP2007180505A
Authority: JP
Inventors: 真一川本; 達夫四倉; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-07-24
Filing date: 2007-07-10
Publication date: 2011-01-26
Anticipated expiration: 2027-07-10
Also published as: JP2008140364A

Description

この発明は音声からアニメーションを作成するアニメーション作成装置に関し、特に、発話音声にあわせて口等の形が変わる顔画像等のアニメーションを自動的に生成する装置に関する。 The present invention relates to an animation creating apparatus that creates an animation from voice, and more particularly to an apparatus that automatically generates an animation such as a face image that changes the shape of a mouth or the like in accordance with an uttered voice.

コンピュータ技術の発達により、以前は大部分が手作業で行なわれていた仕事がコンピュータによる作業に置き換えられるケースが多くなっている。その代表的なものに、アニメーションの作成がある。 Due to the development of computer technology, work that has been mostly done manually has been replaced by computer work. A typical example is the creation of animation.

以前は、アニメーションといえば次のような手法で作成されることが一般的であった。登場するキャラクタをアニメーションの演出家が決め、絵コンテと呼ばれる、主要なシーンのラフな原画を作成する。これら絵コンテに基づき、アニメーションの各フレームの絵をアニメータと呼ばれる作業者が作成する。それら絵を仕上げ担当者がセル画に仕上げる。セル画を順にフィルムに写し、所定のフレームレートで再生すればアニメーションの画像の部分が出来上がる。 Previously, animations were generally created by the following method. The animation director decides the characters to appear, and creates a rough original picture of the main scene called a storyboard. Based on these storyboards, a picture of each frame of animation is created by an operator called an animator. The person in charge finishes these pictures into cel drawings. If the cell images are sequentially copied onto the film and played back at a predetermined frame rate, an animation image portion is completed.

このアニメーションの画像を再生しながら、声優がアニメーションの台本に基づいて台詞をつけていく。いわゆる「アフレコ」である。 While playing this animation image, the voice actor adds a line based on the script of the animation. This is so-called “post-recording”.

このような作業で最も人手がかかるのはセル画の作成である。一方、原画をＣＧ（コンピュータ・グラフィックス）で作成する場合、原画を加工してセル画を作成するのは比較的単純な作業である。一枚一枚撮影する必要もない。そのため、この部分については原画のＣＧ化とあわせてかなりコンピュータ化されている。 It is the creation of cel images that requires the most work in such work. On the other hand, when creating an original picture with CG (computer graphics), it is a relatively simple task to create a cell picture by processing the original picture. There is no need to shoot one by one. Therefore, this part is considerably computerized together with the CG conversion of the original picture.

一方、残りの作業のうちで比較的むずかしいのは、アフレコの作業である。アニメーションの動きにあわせて、なおかつ状況にあわせた声で台詞をしゃべる必要があるため、アフレコの作業にはそれなりの時間がかかり、習熟も必要である。 On the other hand, the remaining task is relatively difficult. Because it is necessary to speak dialogue with the voice of the situation according to the movement of the animation, the post-recording work takes a certain amount of time, and learning is also necessary.

そこで、アフレコの逆に、先に音声を収録し、その音声にあわせてアニメーションを作成する手法が考えられた。これは「プレスコ」又は「プレレコ」（以下「プレスコ等」と呼ぶ。）と呼ばれる。これはもともと米国等で手作業でアニメーションを作成する際に採用されていた手法である。この手法でアニメーションを作成する場合には、次のような作業手順となる。 Therefore, conversely to post-recording, a method of recording audio first and creating an animation according to the audio was considered. This is called "Presco" or "Pre-Reco" (hereinafter referred to as "Presco etc."). This is a technique that was originally used when creating animations manually in the United States. When an animation is created by this method, the work procedure is as follows.

まず、アニメーションに登場するキャラクタを決める。絵コンテも従来と同様に作成する。声優が、絵コンテと台本に基づいて発話し、それを音声として収録する。この音声にあわせて、アニメーションを作成する。 First, determine the character that will appear in the animation. Create storyboards as before. A voice actor speaks based on a storyboard and script and records it as audio. An animation is created according to this sound.

このプレスコ等の手法によるアニメーション作成をコンピュータで実現する場合には、音声からアニメーションをいかにして自動的に作成するか、という点が問題となる。特に、人物等のアニメーションの口の動きを、予め録音した声優の音声にあわせて自然な形で生成するのは難しく、これを自動的に行なう手法が求められている。 When the animation creation by the technique such as Presco is realized by a computer, the problem is how to automatically create the animation from the voice. In particular, it is difficult to generate the movement of the mouth of an animation such as a person in a natural form in accordance with the voice of a voice actor that has been recorded in advance, and there is a need for a method that automatically performs this.

このための一手法として提案されたものに、特許文献１に記載された手法がある。特許文献１に記載された手法では、口形状の基本パターンを予め複数個用意しておく。そして、任意の音声に対応する口形状を、これら基本パターンの加重和により求める。そのために、声優の音声の所定の特徴量から、各基本パターンの加重パラメータに変換するための変換関数を、重回帰分析によって予め求めておく。台本に沿って録音された声優の音声の所定の特徴量をこの変換関数で加重パラメータに変換し、その加重パラメータを用いて口形状の基本パターンの加重和を算出することで、声優の音声に対応する口形状及び顔画像を作成する。こうした処理をアニメーションの各フレームに相当する時刻に行なうことで、アニメーションのフレームシーケンスを作成する。 One method proposed for this purpose is the method described in Patent Document 1. In the method described in Patent Document 1, a plurality of mouth-shaped basic patterns are prepared in advance. Then, a mouth shape corresponding to an arbitrary voice is obtained by a weighted sum of these basic patterns. For this purpose, a conversion function for converting a predetermined feature amount of the voice actor's voice into a weighting parameter of each basic pattern is obtained in advance by multiple regression analysis. The voice actor's voice is recorded into the voice actor's voice by converting the predetermined feature quantity of the voice actor's voice recorded along the script into a weighted parameter using this conversion function and calculating the weighted sum of the basic pattern of the mouth shape using the weighted parameter. Corresponding mouth shape and face image are created. By performing such a process at a time corresponding to each frame of the animation, an animation frame sequence is created.

図１に、このような従来のアニメーション作成装置の前提となるアニメーション作成過程３０の概略を示す。図１を参照して、アニメーション作成過程３０においては、話者４０が台本４４に基づき台詞を発話すると、その音声信号４２に対し、音声認識装置による音素セグメンテーション（発話から、発話を構成する音素列を生成すること）が行なわれる。 FIG. 1 shows an outline of an animation creating process 30 which is a premise of such a conventional animation creating apparatus. Referring to FIG. 1, in the animation creation process 30, when a speaker 40 utters a speech based on a script 44, a phoneme segmentation (a phoneme sequence that constitutes an utterance from the speech) is performed on the speech signal 42. Is generated).

予め、主要な音素については、その音素を発音するときの口の形状を含む顔画像６０〜６８が準備されており、音声認識の結果得られる各音素５０〜５８に対し、これら顔画像を割当ててアニメーション化する。 For main phonemes, face images 60 to 68 including a mouth shape when the phoneme is pronounced are prepared in advance, and these face images are assigned to each phoneme 50 to 58 obtained as a result of speech recognition. To animate.

なお、個々の音素に対して発話画像を一つずつ割当てても滑らかな画像が得られないため、特許文献１にも記載のように、主要な画像の間の加重和により、中間の画像を作成する。例えば、主要な顔画像として「あ（／ａ／）」「い（／ｉ／）」「う（／ｕ／）」「え（／ｅ／）」「お（／ｏ／）」という５つの音素に対する５つの顔画像、及び音素「ん／Ｎ／」に対する顔画像の、合計６つの顔画像を準備する。「ん／Ｎ／」に対する顔画像は後述するように他の顔画像の基本となる画像であり、本明細書では「無表情の顔画像」とも呼ぶ。「あ」〜「お」の５つの音素はそれぞれ対応の顔画像に割当て、残りの音素についてはそれぞれ上記した６つの顔画像のいずれかに割当てる。これを以下、音素から顔画像へのマッピングと呼ぶ。 In addition, since a smooth image cannot be obtained even if one speech image is assigned to each phoneme, an intermediate image is obtained by a weighted sum between main images as described in Patent Document 1. create. For example, there are five main face images: “A (/ a /)”, “I (/ i /)”, “U (/ u /)”, “E (/ e /)”, “O (/ o /)”. A total of six face images are prepared: five face images for phonemes and face images for phonemes “n / N /”. The face image for “N / N /” is an image that is the basis of another face image as will be described later, and is also referred to as “an expressionless face image” in this specification. The five phonemes “A” to “O” are assigned to the corresponding face images, respectively, and the remaining phonemes are assigned to any of the six face images described above. This is hereinafter referred to as phoneme to face image mapping.

図２に、使用される顔画像の例を示す。顔画像は、他の全ての顔画像の基本となる無表情の顔画像８０と、前述した「あ」〜「お」までの顔画像６０〜６８とを含む。顔画像６０〜６８は、ワイアフレーム画像に予め準備した顔のテクスチャを貼り付けることで生成する。顔画像６０〜６８及び８０のワイアフレーム画像は、いずれもワイアフレームを構成する各頂点の３次元座標により定義される。ただし、基本となる無表情の顔画像８０については各頂点の座標が予め定義されるが、顔画像６０〜６８の各頂点の座標は、無表情の顔画像８０に対する相対座標により定義される。顔画像６０〜６８及び８０を構成する各頂点の座標の集合からなる顔モデルを以下「視覚素」と呼ぶ。 FIG. 2 shows an example of a face image used. The face image includes an expressionless face image 80 which is the basis of all other face images, and the face images 60 to 68 from “A” to “O” described above. The face images 60 to 68 are generated by pasting a prepared facial texture on the wire frame image. The wire frame images of the face images 60 to 68 and 80 are all defined by the three-dimensional coordinates of each vertex constituting the wire frame. However, the coordinates of each vertex are defined in advance for the basic expressionless facial image 80, but the coordinates of each vertex of the facial images 60 to 68 are defined by relative coordinates with respect to the expressionless facial image 80. A face model composed of a set of coordinates of each vertex constituting the face images 60 to 68 and 80 is hereinafter referred to as a “visual element”.

このように準備した顔画像に基づいてアニメーションを作成する場合、従来は以下のような手作業による手順を採っている。すなわち、音声を聞きながら、ある時点での「あ」の音声の発話時に「あ」の顔画像を割当て、「お」の音声の発話時に「お」の顔画像を割当てる、という作業を、そのような割当が必要と思われるフレームの全てに対して手作業で行なう。このように特定の音声の発話時の顔画像が割当てられたフレームを「キーフレーム」と呼ぶ。 In the case of creating an animation based on the face image prepared as described above, the following manual procedure has been conventionally employed. In other words, while listening to the voice, assigning the face image of “A” at the time of uttering the voice of “A” at a certain time, and assigning the face image of “O” when uttering the voice of “O” This is done manually for all such frames that may need to be assigned. A frame to which a face image at the time of uttering a specific voice is assigned is called a “key frame”.

次に、このようにして割当てられたキーフレームに基づき、キーフレームの間の任意の時点の顔画像を、その時点をはさむ二つのキーフレームに割当てられた顔画像の間のブレンドによって合成する。 Next, based on the key frames assigned in this manner, a face image at an arbitrary time point between the key frames is synthesized by blending between the face images assigned to the two key frames sandwiching the time point.

図３に、キーフレームの割当例を示す。図３に示す例では、「あ」を表す顔画像６０については、縦棒１００及び１０２で示されるように、二つのキーフレームに割当てられている。同様に、顔画像６２については縦棒１１０により、顔画像６４については縦棒１２０により、顔画像６６については縦棒１３０により、そして顔画像６８については縦棒１４０により、それぞれ示されるように、一つのフレームに割当てられている。 FIG. 3 shows an example of key frame allocation. In the example shown in FIG. 3, the face image 60 representing “A” is assigned to two key frames as indicated by vertical bars 100 and 102. Similarly, as shown by the vertical bar 110 for the face image 62, by the vertical bar 120 for the face image 64, by the vertical bar 130 for the face image 66, and by the vertical bar 140 for the face image 68, respectively. It is assigned to one frame.

これらフレーム（キーフレーム）での顔画像は、指定された顔画像と一致するように作成されるが、それ以外のフレームでは、そのフレームをはさむ二つのキーフレームの顔画像の間のブレンドにより作成される。特許文献１でいう「加重和」がこれに相当する概念である。図３のグラフ１０４、１１２、１２２、１３２、及び１４２は、それぞれ顔画像６０〜６８のブレンド率を表したものである。ブレンド率＝０の区間ではその顔画像はアニメーション作成に使用されない。ブレンド率≠０の区間では、その顔画像とブレンド率とを掛け合わせたものを、他の顔画像とそのブレンド率とを掛け合わせたものと加算して顔画像を作成する。 Face images in these frames (key frames) are created to match the specified face image, but in other frames, they are created by blending between the face images of two key frames that sandwich the frame. Is done. The “weighted sum” referred to in Patent Document 1 is a concept corresponding to this. The graphs 104, 112, 122, 132, and 142 in FIG. 3 represent the blend ratios of the face images 60 to 68, respectively. In the section where the blend rate = 0, the face image is not used for animation creation. In a section where the blend rate is not 0, the face image is created by adding the product of the face image and the blend rate to the product of the other face image and the blend rate.

ブレンド率とは、特定の顔画像を１００％、顔画像／Ｎ／を０％として、顔画像／Ｎ／から特定の顔画像に至るまでの特徴点の移動量の割合で中間の顔画像を表すものである。従って、顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／をそのまま音素に割当てた場合、そのブレンド率はいずれも１００％となる。ブレンド率５０％の顔画像／Ａ／とは、顔画像／Ｎ／からの特徴点の移動量の割合が、顔画像／Ａ／の特徴点の移動量の５０％となっているような顔画像のことをいう。顔画像／Ｎ／での位置を始点とするベクトルで顔画像の特徴点の移動量を表せば、ブレンド率Ｂ％の顔画像とは、各特徴点を表すベクトルが、方向はブレンド率１００％の顔画像のベクトルと等しく、長さがブレンド率Ｂ％に相当するだけ縮小されたものとなっている顔画像に相当する。 The blend ratio is defined as 100% for a specific face image and 0% for the face image / N /, and the intermediate face image at the ratio of the amount of movement of the feature points from the face image / N / to the specific face image. It represents. Accordingly, when the face images / A /, / I /, / U /, / E /, / O / are assigned to phonemes as they are, the blend ratio is 100%. A face image / A / with a blend rate of 50% is a face whose ratio of the amount of movement of feature points from the face image / N / is 50% of the amount of movement of feature points of the face image / A /. Refers to the image. If the movement amount of the feature point of the face image is represented by a vector starting from the position of the face image / N /, the face image with the blend rate B% is a vector representing each feature point, and the direction is the blend rate 100%. This corresponds to a face image that is equal to the vector of the face image and is reduced in length corresponding to the blend ratio B%.

図４に、このようにしてブレンドにより作成された顔画像の例を示す。図４（Ａ）には、／ａ／の顔画像に対するブレンド率が１００％のときの顔画像を示す。図４（Ｄ）には、／ｉ／の顔画像に対するブレンド率が１００％のときの顔画像を示す。図４（Ｂ）には、／ａ／のブレンド率６５％、／ｉ／のブレンド率３５％のときの顔画像を、図４（Ｃ）には、／ａ／のブレンド率３５％、／ｉ／のブレンド率６５％のときの顔画像を、それぞれ示す。 FIG. 4 shows an example of a face image created by blending in this way. FIG. 4A shows a face image when the blend ratio for the / a / face image is 100%. FIG. 4D shows a face image when the blend ratio for the / i / face image is 100%. FIG. 4B shows a face image at a blending ratio of 65% for / a / and a blending ratio of 35% for / i /, and FIG. 4C shows a blending ratio of 35% for / a /, The face images when the blend ratio of i / is 65% are shown.

図４（Ａ）〜（Ｄ）から分かるように、ブレンド率を変化させて二つの顔画像をモデル上でブレンドして新たな顔画像を作成することにより、二つの顔画像の中間的な顔画像を作成できる。 As can be seen from FIGS. 4A to 4D, by changing the blend ratio and blending the two face images on the model to create a new face image, an intermediate face between the two face images is obtained. You can create an image.

特開平７−４４７２７号公報Japanese Patent Laid-Open No. 7-44727 Linde Y., Buzo A., Gray R., "An algorithm for vector quantizeer design," IEEE Transactions on Communications. COM-28 (1980), 84-95.Linde Y., Buzo A., Gray R., "An algorithm for vector quantizeer design," IEEE Transactions on Communications. COM-28 (1980), 84-95.

上記した従来技術によって自動的に顔画像のアニメーションを作成する場合、どこにキーフレームを設定するか、及びそのブレンド率をどのように設定するかが問題となる。従来はいずれも人間が手作業で行なっており、その結果得られるアニメーションはかなり高い品質となっている。しかし、キーフレームとそのブレンド率とを自動的に設定することができ、かつ人間の手作業による結果と同様に滑らかなアニメーションを作成できる技術については、従来は知られていない。 When an animation of a face image is automatically created by the above-described conventional technique, the problem is where to set a key frame and how to set the blend rate. In the past, humans performed all of them manually, and the resulting animation has a fairly high quality. However, a technique that can automatically set a key frame and a blend ratio thereof and can create a smooth animation similar to a result of human manual work has not been known.

キーフレームの設定及びブレンド率の設定は、上記したブレンドによるアニメーションの作成において最も重要で、かつ熟練を要する作業であり、この作業を自動化する技術が望まれている。 Setting the key frame and setting the blend rate are the most important and skill-intensive work in creating an animation by blending as described above, and a technique for automating this work is desired.

また、アニメーションは、映画とは異なり、単に滑らかな映像が得られれば良い、というものではない。例えば、従来の手作業によるアニメーションでは、単位時間あたりのフレーム数が少ないため、動きがぎこちない、という問題があったが、こうした弱点を逆にアニメーションの魅力であると感じる人もいる。リップシンクアニメーションでも、必要であればこのように手作業によるアニメーションのような動きを実現できることが望ましい。 Also, unlike movies, animation does not simply mean that smooth images can be obtained. For example, the conventional manual animation has a problem that the movement is awkward because the number of frames per unit time is small. However, there are some people who feel that the weak point is the appeal of animation. Even in lip sync animation, it is desirable to be able to realize movement like manual animation in this way if necessary.

さらに、文化のグローバル化に伴い、外国で日本語のアニメーションが作成されることも多くなってきたが、今後は日本語で作成したアニメーションを外国での放送用に変更することも考えられる。従来は、映画と同じようにいわゆる吹替えによってこれを実現しているが、吹替えの場合にはどうしても口の動きと音声とが一致しない。リップシンクアニメーションを使用すると、先に音声を収録してからその音声にあわせてアニメーションを作成するので、こうした問題にはうまく対処することができる。しかしその場合には、それぞれの言語で使用される音声にあわせてアニメーション作成に必要な資源を準備する必要がある。そのような準備作業は、できるだけ少なくすることが望ましい。 Furthermore, with the globalization of culture, animations in Japanese are often created in foreign countries, but in the future, animations created in Japanese may be changed for broadcasting in foreign countries. Conventionally, this is realized by so-called dubbing as in the case of a movie. However, in the case of dubbing, the movement of the mouth and the sound do not coincide. With lip sync animation, you can deal well with these problems by first recording the audio and then creating the animation for that audio. However, in that case, it is necessary to prepare resources necessary for animation creation according to the sound used in each language. It is desirable to minimize such preparation work.

したがって本発明の目的は、人間の発話の音声データから顔画像のアニメーションを作成する際に、滑らかで自然なアニメーションが得られるようにキーフレーム及びそのブレンド率を自動的に設定できるリップシンクアニメーション作成装置を提供することである。 Therefore, an object of the present invention is to create a lip sync animation that can automatically set key frames and blend ratios so that a smooth and natural animation can be obtained when creating an animation of a facial image from voice data of human speech. Is to provide a device.

本発明の他の目的は、人間の発話の音声データから顔画像のアニメーションを作成する際に、滑らかで自然なアニメーションも、ぎこちない動きのアニメーションも、必要に応じて得られるようにキーフレーム及びそのブレンド率を自動的に設定できるリップシンクアニメーション作成装置を提供することである。 Another object of the present invention is to create a key image and its animation so that a smooth and natural animation and an awkward movement animation can be obtained as necessary when creating an animation of a facial image from voice data of human speech. The object is to provide a lip-sync animation creation device capable of automatically setting a blend rate.

本発明のさらに他の目的は、多言語の人間の発話の音声データから、それぞれの言語の音声に合致した顔画像のアニメーションを作成する際に、できるだけ作業量を少なくしながら、滑らかで自然なアニメーションが得られるようにキーフレーム及びそのブレンド率を自動的に設定できるリップシンクアニメーション作成装置を提供することである。 Still another object of the present invention is to create a smooth and natural image while reducing the amount of work as much as possible when creating an animation of a facial image that matches the speech of each language from speech data of multilingual human utterances. It is an object of the present invention to provide a lip-sync animation creation device capable of automatically setting key frames and their blend ratios so that animation can be obtained.

本発明の第１の局面に係るリップシンクアニメーション作成装置は、予め準備された統計的音響モデルと、予め準備された音素及び視覚素の間のマッピング定義と、視覚素に対応する、予め準備された複数個の顔画像の顔モデルとを用い、入力される発話データからリップシンクアニメーションを作成するためのリップシンクアニメーション作成装置であって、発話データに対するトランスクリプションが利用可能である。このリップシンクアニメーション作成装置は、統計的音響モデル、マッピング定義、及びトランスクリプションを使用して、発話データに含まれる音素及び対応する視覚素を求め、デフォルトのブレンド率が付与された継続長付きの視覚素シーケンスを作成するための視覚素シーケンス作成手段を含む。視覚素シーケンスの継続長内の所定位置にはキーフレームが定義され、視覚素シーケンスの各視覚素の継続長内に定義されるキーフレームによりキーフレームシーケンスが定義される。リップシンクアニメーション作成装置はさらに、キーフレームシーケンス内のキーフレームのうち、隣接するキーフレームとの間で、視覚素に対応する顔モデルとの間の変化の速さが最も大きいものから順番に、所定の割合のキーフレームを削除するためのキーフレーム削除手段と、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成するためのブレンド処理手段とを含む。 The lip-sync animation creating apparatus according to the first aspect of the present invention is prepared in advance, corresponding to a statistical acoustic model prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and visual elements. A lip-sync animation creation apparatus for creating a lip-sync animation from input speech data using a plurality of face models of face images, and transcription for speech data can be used. This lip-sync animation creation device uses statistical acoustic models, mapping definitions, and transcriptions to find phonemes and corresponding visual elements contained in speech data, with a default blend rate and duration Visual element sequence creating means for creating a visual element sequence. A key frame is defined at a predetermined position within the duration of the visual element sequence, and a key frame sequence is defined by a key frame defined within the duration of each visual element of the visual element sequence. The lip sync animation creating device further includes, in order from the fastest change between the face model corresponding to the visual element, between the key frames in the key frame sequence and the adjacent key frames. An animation of a face image is created by blending between key frames based on a key frame deleting means for deleting a predetermined percentage of key frames and a key frame sequence in which some key frames are deleted by the key frame deleting means. Blending processing means.

視覚素シーケンス作成手段は、統計的音響モデル、マッピング定義、及びトランスクリプションを使用して、発話データから視覚素シーケンスを作成する。この視覚素シーケンスには継続長が付されている。視覚素シーケンスの継続長内の所定位置にはキーフレームが定義され、視覚素シーケンスの各視覚素の継続長内に定義されるキーフレームによりキーフレームシーケンスが定義される。これらのキーフレームからブレンドによりアニメーションを作成することもできるが、そうすると作成されるアニメーションの動きは不自然になる。そこで、キーフレーム削除手段によって、キーフレームシーケンス内のキーフレームのうち、隣接するキーフレームとの間で顔モデルの変化の速さが最も大きいものから順番に、所定の割合のキーフレームを削除する。動きが速くなる部分のキーフレームを削除することにより、デフォルトのブレンド率を使用しても、作成されるアニメーションの動きは自然なものとなる。その結果、滑らかで自然なアニメーションが得られるようにキーフレーム及びそのブレンド率を自動的に設定できるリップシンクアニメーション作成装置を提供できる。この割合は、調整可能としてもよい。 The visual element sequence creating means creates a visual element sequence from the speech data using a statistical acoustic model, a mapping definition, and a transcription. This visual element sequence has a duration. A key frame is defined at a predetermined position within the duration of the visual element sequence, and a key frame sequence is defined by a key frame defined within the duration of each visual element of the visual element sequence. Although animation can be created from these keyframes by blending, the resulting animation will be unnatural. Therefore, the key frame deletion means deletes a predetermined percentage of key frames in order from the largest one of the key frames in the key frame sequence with the fastest change of the face model. . By removing the keyframes where the movement is faster, the resulting animation will be natural even when using the default blend ratio. As a result, it is possible to provide a lip sync animation creation device that can automatically set key frames and blend ratios thereof so that a smooth and natural animation can be obtained. This ratio may be adjustable.

好ましくは、キーフレーム削除手段は、キーフレームシーケンス内のキーフレームのうち、当該キーフレームの視覚素に対応する顔モデルを構成する各特徴点と、隣接するキーフレームの視覚素に対応する顔モデルを構成する、対応する各特徴点との間の変化の速さが最も大きいものから順番に、所定の割合のキーフレームを削除するための手段を含む。 Preferably, the key frame deleting means includes, in the key frames in the key frame sequence, each feature point constituting the face model corresponding to the visual element of the key frame and the face model corresponding to the visual element of the adjacent key frame. , And a means for deleting a predetermined percentage of key frames in order from the one with the highest speed of change between corresponding feature points.

顔モデルを構成する各特徴点について、隣接するキーフレームとの間での変化の速さを算出することにより、計算量は大きくなるが計算結果に含まれる誤差が少なくなり、自然なアニメーションを作成できる。 For each feature point that makes up the face model, calculating the speed of change between adjacent key frames increases the amount of calculation but reduces the error contained in the calculation result, creating a natural animation it can.

より好ましくは、リップシンクアニメーション作成装置は、複数個の顔モデルの内から選ばれる２個の顔モデルの組合せの全てに対し、顔モデルを構成する特徴点の間の動きベクトルを算出するための動きベクトル算出手段と、２個の顔モデルの特徴点を、動きベクトル算出手段により算出された動きベクトルに対する所定のクラスタリング方法によってクラスタ化し、各クラスタの代表ベクトルを算出することにより、クラスタ化された顔モデルを作成するための手段と、クラスタ化された顔モデルを記憶するためのクラスタ化顔モデル記憶手段とをさらに含む。キーフレーム削除手段は、キーフレームシーケンス内のキーフレームの各々に対し、当該キーフレームの視覚素と、隣接するキーフレームの視覚素との組合せに対応するクラスタ化された顔モデルをクラスタ化顔モデル記憶手段から読出し、各クラスタに属する特徴点のキーフレーム間の変化の速さを当該クラスタの代表ベクトルを用いて算出するための移動量算出手段と、移動量算出手段により算出された変化の速さが最も大きいものから順番に、所定の割合のキーフレームをキーフレームシーケンスから削除するための手段とを含む。 More preferably, the lip-sync animation creation device calculates a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models. The feature points of the motion vector calculation means and the two face models are clustered by a predetermined clustering method with respect to the motion vector calculated by the motion vector calculation means, and the representative vectors of each cluster are calculated, thereby being clustered. Further comprising means for creating a face model and clustered face model storage means for storing the clustered face model. For each key frame in the key frame sequence, the key frame deletion means generates a clustered face model corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame. A moving amount calculating means for calculating a speed of change between key frames of feature points belonging to each cluster using a representative vector of the cluster, and a speed of change calculated by the moving amount calculating means; Means for deleting a predetermined percentage of key frames from the key frame sequence in order from the largest.

予め、顔モデルの組合せの全てについて、動きベクトルを求め、それら動きベクトルに対する所定のクラスタリング、例えばベクトル量子化クラスタリングによって各特徴点をクラスタに分類する。クラスタ化された顔モデルを作成するための手段は、各クラスタについて、代表ベクトルを算出する。移動量算出手段は、キーフレームシーケンス内のキーフレームの各々に対し、当該キーフレームの視覚素と、隣接するキーフレームの視覚素との組合せに対応するクラスタ化された顔モデルをクラスタ化顔モデル記憶手段から読出し、各クラスタに属する特徴点のキーフレーム間の変化の速さを当該クラスタの代表ベクトルを用いて算出する。算出された変化の速さが最も大きいものから順番に、所定の割合のキーフレームがキーフレームシーケンスから削除される。各特徴点の変化の速さを算出する代わりに、一つのクラスタに属する特徴点を一つの代表点で代表させてそれらの変化の速さを算出するので、演算に要する時間が短縮できる。 In advance, motion vectors are obtained for all combinations of face models, and each feature point is classified into clusters by predetermined clustering for the motion vectors, for example, vector quantization clustering. The means for creating a clustered face model calculates a representative vector for each cluster. For each key frame in the key frame sequence, the movement amount calculation means calculates a clustered face model corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame. Reading from the storage means, the speed of change between key frames of feature points belonging to each cluster is calculated using the representative vector of the cluster. A predetermined percentage of key frames are deleted from the key frame sequence in order from the highest calculated change rate. Instead of calculating the speed of change of each feature point, the feature points belonging to one cluster are represented by one representative point, and the speed of change is calculated, so that the time required for calculation can be shortened.

さらに好ましくは、リップシンクアニメーション作成装置は、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスを受け、当該キーフレームシーケンス内のキーフレームの視覚素に対応する音素の発話パワーを発話データから算出するための発話パワー算出手段と、キーフレームシーケンス内の各キーフレームに対し、発話パワー算出手段により、当該キーフレームを含む視覚素の継続長について算出された平均発話パワーが小さければ小さいほどブレンド率が小さくなるような所定の関数により、ブレンド率を調整するための、発話パワーによるブレンド率調整手段とをさらに含む。ブレンド処理手段は、発話パワーによるブレンド率調整手段によってブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成する。 More preferably, the lip-sync animation creating device receives a key frame sequence in which some key frames are deleted by the key frame deleting unit, and generates a speech power of a phoneme corresponding to a visual element of the key frame in the key frame sequence. If the average utterance power calculated by the utterance power calculating means for calculating from the utterance data and the duration of the visual element including the key frame is small for each key frame in the key frame sequence by the utterance power calculating means It further includes blend rate adjustment means based on utterance power for adjusting the blend rate by a predetermined function such that the smaller the smaller the blend rate. The blend processing means creates an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on speech power.

発話パワーが小さいところでは、ブレンド率が小さくなる。一般に、発話パワーが小さいときには、人間はあまりはっきりと口を開いていない。したがって、このようにすることにより、実際の発話時の発話者の口に近い動きをする顔画像のアニメーションを実現できる。その結果、滑らかで自然なアニメーションが得られるようにキーフレーム及びそのブレンド率を自動的に設定できるリップシンクアニメーション作成装置を提供できる。 Where the utterance power is small, the blend rate is small. In general, when speech power is low, humans do not open their mouths very clearly. Therefore, by doing this, it is possible to realize an animation of a face image that moves close to the speaker's mouth during actual speech. As a result, it is possible to provide a lip sync animation creation device that can automatically set key frames and blend ratios thereof so that a smooth and natural animation can be obtained.

リップシンクアニメーション作成装置は、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスを受け、キーフレームの視覚素に対応する顔モデルを構成する頂点と、隣接するキーフレームの視覚素に対応する顔モデルを構成する頂点との間の変化の速さを算出するための変化の速さ算出手段と、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段とをさらに含んでもよい。ブレンド処理手段は、頂点速度によるブレンド率調整手段によってブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成する。 The lip sync animation creation device receives a key frame sequence in which some key frames have been deleted by the key frame deletion means, receives a vertex constituting the face model corresponding to the visual element of the key frame, and a visual element of the adjacent key frame. Included in the key frame sequence in which some key frames are deleted by the key frame deletion unit and the change rate calculation unit for calculating the change rate between the vertices constituting the face model corresponding to For each key frame, a predetermined function is used such that the blend rate is a smaller value for a key frame whose speed of change calculated by the speed of change calculation means is larger than a predetermined threshold value. And a blend rate adjusting means based on a vertex speed for updating the blend rate. The blend processing means creates an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

好ましくは、リップシンクアニメーション作成装置は、複数個の顔モデルの内から選ばれる２個の顔モデルの組合せの全てに対し、顔モデルを構成する特徴点の間の動きベクトルを算出するための動きベクトル算出手段と、２個の顔モデルの特徴点を、動きベクトル算出手段により算出された動きベクトルに対する所定のクラスタリング方法によってクラスタ化し、各クラスタの代表ベクトルを算出することにより、クラスタ化された顔モデルを作成するための手段と、クラスタ化された顔モデルを記憶するためのクラスタ化顔モデル記憶手段とをさらに含む。リップシンクアニメーション作成装置はさらに、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスを受け、各キーフレームのうち、当該キーフレームの視覚素と、隣接するキーフレームの視覚素との組合せに対応するクラスタ化された顔モデルの組合せをクラスタ化顔モデル記憶手段から読出し、各クラスタに属する特徴点のキーフレーム間の変化の速さを当該クラスタの代表ベクトルを用いて算出するための変化の速さ算出手段と、キーフレーム削除手段により一部のキーフレームが削除されたキーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段とをさらに含む。ブレンド処理手段は、頂点速度によるブレンド率調整手段によってブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成する。 Preferably, the lip-sync animation creating apparatus calculates a motion vector between feature points constituting a face model for all combinations of two face models selected from a plurality of face models. Clustered faces are obtained by clustering the feature points of the vector calculation means and the two face models by a predetermined clustering method for the motion vector calculated by the motion vector calculation means, and calculating a representative vector of each cluster. Further comprising means for creating a model and clustered face model storage means for storing the clustered face model. The lip sync animation creating device further receives a key frame sequence in which some key frames are deleted by the key frame deleting means, and among each key frame, a visual element of the key frame, a visual element of an adjacent key frame, A clustered face model combination corresponding to each combination is read from the clustered face model storage means, and the speed of change between key frames of feature points belonging to each cluster is calculated using a representative vector of the cluster Of the key frames included in the key frame sequence in which some key frames are deleted by the key frame deletion unit, the change rate calculated by the change rate calculation unit is For key frames that are larger than a given threshold, the blend ratio will be smaller. Further comprising a blend ratio adjustment means according to the vertices rate for updating the blend ratio by using such a predetermined function. The blend processing means creates an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

本発明の第２の局面に係るリップシンクアニメーション作成装置は、予め準備された統計的音響モデルと、予め準備された音素及び視覚素の間のマッピング定義と、予め準備された複数個の顔画像の顔モデルとを用い、入力される発話データからリップシンクアニメーションを作成するためのリップシンクアニメーション作成装置であって、発話データに対するトランスクリプションが利用可能であり、統計的音響モデル、マッピング定義、及びトランスクリプションを使用して、発話データに含まれる音素及び対応する視覚素を求め、デフォルトのブレンド率が付与された継続長付きの視覚素シーケンスを作成するための視覚素シーケンス作成手段を含む。視覚素シーケンスの継続長内の所定位置にはキーフレームが定義され、視覚素シーケンスの各視覚素の継続長内に定義されるキーフレームによりキーフレームシーケンスが定義される。リップシンクアニメーション作成装置はさらに、キーフレームシーケンス内のキーフレームの視覚素に対応する音素の発話パワーを発話データから算出するための発話パワー算出手段と、キーフレームシーケンス内の各キーフレームに対し、発話パワー算出手段により、当該キーフレームを含む視覚素の継続長について算出された平均発話パワーが小さければ小さいほどブレンド率が小さくなるような所定の関数により、ブレンド率を調整するための、発話パワーによるブレンド率調整手段と、ブレンド率調整手段によりブレンド率が調整された視覚素シーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成するためのブレンド処理手段とを含む。 The lip-sync animation creating apparatus according to the second aspect of the present invention includes a statistical acoustic model prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and a plurality of facial images prepared in advance. A lip-sync animation creation device for creating a lip-sync animation from input utterance data using a facial model, and transcription for the utterance data can be used, statistical acoustic model, mapping definition, And a visual element sequence creating means for obtaining phonemes and corresponding visual elements included in the utterance data by using transcription and creating a visual element sequence having a duration with a default blend ratio. . A key frame is defined at a predetermined position within the duration of the visual element sequence, and a key frame sequence is defined by a key frame defined within the duration of each visual element of the visual element sequence. The lip sync animation creation device further includes speech power calculation means for calculating speech power of phonemes corresponding to visual elements of key frames in the key frame sequence from speech data, and for each key frame in the key frame sequence, Speaking power for adjusting the blending rate by a predetermined function such that the smaller the average utterance power calculated for the duration of the visual element including the key frame by the utterance power calculating means, the smaller the blending rate is. And blend processing means for creating an animation of a face image by blending between key frames based on the visual element sequence whose blend ratio is adjusted by the blend ratio adjusting means.

好ましくは、リップシンクアニメーション作成装置は、発話パワーによるブレンド率調整手段によりブレンド率が調整されたキーフレームシーケンスを受け、当該キーフレームシーケンスに含まれる各キーフレームの視覚素に対応する顔モデルを構成する頂点と、隣接するキーフレームの視覚素に対応する顔モデルを構成する頂点との間の変化の速さを算出するための変化の速さ算出手段と、発話パワーによるブレンド率調整手段によりブレンド率が調整されたキーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段とをさらに含む。ブレンド処理手段は、頂点速度によるブレンド率調整手段によってブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成する。 Preferably, the lip sync animation creation device receives a key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on speech power, and constructs a face model corresponding to the visual element of each key frame included in the key frame sequence Blending by means of a rate of change calculation for calculating the speed of change between the vertices that make up and the vertices that make up the face model corresponding to the visual element of the adjacent key frame, and a blend rate adjustment unit by utterance power Among the key frames included in the key frame sequence whose rate has been adjusted, the blend rate of the key frame whose speed of change calculated by the speed of change calculation means is greater than a predetermined threshold is more Blur due to vertex velocity to update the blend rate using a predetermined function that is small. Further comprising a de factor adjusting means. The blend processing means creates an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

より好ましくは、リップシンクアニメーション作成装置は、複数個の顔モデルの内から選ばれる２個の顔モデルの組合せの全てに対し、顔モデルを構成する特徴点の間の動きベクトルを算出するための動きベクトル算出手段と、２個の顔モデルの特徴点を、動きベクトル算出手段により算出された動きベクトルに対する所定のクラスタリング方法によってクラスタ化し、各クラスタの代表ベクトルを算出することにより、クラスタ化された顔モデルを作成するための手段と、クラスタ化された顔モデルを記憶するためのクラスタ化顔モデル記憶手段とをさらに含む。リップシンクアニメーション作成装置はさらに、発話パワーによるブレンド率調整手段によりブレンド率が調整されたキーフレームシーケンスを受け、各キーフレームのうち、当該キーフレームの視覚素と、隣接するキーフレームの視覚素との組合せに対応するクラスタ化された顔モデルの組合せをクラスタ化顔モデル記憶手段から読出し、各クラスタに属する特徴点のキーフレーム間の変化の速さを当該クラスタの代表ベクトルを用いて算出するための変化の速さ算出手段と、キーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段とを含む。ブレンド処理手段は、頂点速度によるブレンド率調整手段によってブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成する。 More preferably, the lip-sync animation creation device calculates a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models. The feature points of the motion vector calculation means and the two face models are clustered by a predetermined clustering method with respect to the motion vector calculated by the motion vector calculation means, and the representative vectors of each cluster are calculated, thereby being clustered. Further comprising means for creating a face model and clustered face model storage means for storing the clustered face model. The lip sync animation creation device further receives a key frame sequence in which the blend rate is adjusted by the speech rate blend rate adjusting means, and among each key frame, the visual element of the key frame, the visual element of the adjacent key frame, A clustered face model combination corresponding to each combination is read from the clustered face model storage means, and the speed of change between key frames of feature points belonging to each cluster is calculated using a representative vector of the cluster The change rate calculation means for each of the key frames included in the key frame sequence and the blend of the key frames whose change speed calculated by the change speed calculation means is greater than a predetermined threshold The top point for updating the blend rate with a predetermined function that gives a smaller value. And a blend ratio adjustment means according to the speed. The blend processing means creates an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

本発明の第３の局面に係るリップシンクアニメーション作成装置は、予め準備された統計的音響モデルと、予め準備された音素及び視覚素の間のマッピング定義と、予め準備された複数個の顔画像の顔モデルとを用い、入力される発話データからリップシンクアニメーションを作成するためのリップシンクアニメーション作成装置であって、発話データに対するトランスクリプションが利用可能である。リップシンクアニメーション作成装置は、統計的音響モデル、マッピング定義、及びトランスクリプションを使用して、発話データに含まれる音素及び対応する視覚素を求め、デフォルトのブレンド率が付与された継続長付きの視覚素シーケンスを作成するための視覚素シーケンス作成手段を含む。視覚素シーケンス中の各視覚素の継続長中にはキーフレームが定義され、これらキーフレームによりキーフレームシーケンスが定義される。リップシンクアニメーション作成装置はさらに、当該キーフレームシーケンスに含まれる各キーフレームの視覚素に対応する顔モデルを構成する頂点と、隣接するキーフレームの視覚素に対応する顔モデルを構成する頂点との間の変化の速さを算出するための変化の速さ算出手段と、キーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段と、頂点速度によるブレンド率調整手段によりブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成するためのブレンド処理手段とを含む。 A lip-sync animation creating apparatus according to a third aspect of the present invention includes a statistical acoustic model prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and a plurality of facial images prepared in advance. And a lip sync animation creation device for creating a lip sync animation from input utterance data, and transcription for the utterance data can be used. The lip-sync animation creation device uses a statistical acoustic model, mapping definition, and transcription to determine phonemes and corresponding visual elements contained in speech data, and has a duration with a default blend rate. Visual element sequence creation means for creating a visual element sequence is included. A key frame is defined during the duration of each visual element in the visual element sequence, and a key frame sequence is defined by these key frames. The lip sync animation creating device further includes a vertex that constitutes a face model corresponding to a visual element of each key frame included in the key frame sequence and a vertex that constitutes a face model corresponding to a visual element of an adjacent key frame. A change rate calculation means for calculating the change speed between the key frame sequences, and the change speed calculated by the change speed calculation means among the key frames included in the key frame sequence is a predetermined threshold. For a key frame larger than the value, the blend rate adjustment means based on the vertex speed and the blend rate adjustment means based on the vertex speed for updating the blend ratio using a predetermined function such that the blend ratio becomes a smaller value. Based on the key frame sequence with adjusted blend ratio, the face image is animated by blending between key frames. And a blend processing unit for creating.

本発明の第４の局面に係るリップシンクアニメーション作成装置は、予め準備された統計的音響モデルと、予め準備された音素及び視覚素の間のマッピング定義と、予め準備された複数個の顔画像の顔モデルとを用い、入力される発話データからリップシンクアニメーションを作成するためのリップシンクアニメーション作成装置であって、発話データに対するトランスクリプションが利用可能である。リップシンクアニメーション作成装置は、複数個の顔モデルの内から選ばれる２個の顔モデルの組合せの全てに対し、顔モデルを構成する特徴点の間の動きベクトルを算出するための動きベクトル算出手段と、２個の顔モデルの特徴点を、動きベクトル算出手段により算出された動きベクトルに対する所定のクラスタリング方法によってクラスタ化し、各クラスタの代表ベクトルを算出することにより、クラスタ化された顔モデルを作成するための手段と、クラスタ化された顔モデルを記憶するためのクラスタ化顔モデル記憶手段と、統計的音響モデル、マッピング定義、及びトランスクリプションを使用して、発話データに含まれる音素及び対応する視覚素を求め、デフォルトのブレンド率が付与された継続長付きのキーフレームシーケンスを作成するためのキーフレームシーケンス作成手段とを含む。視覚素シーケンス中の各視覚素の継続長中にはキーフレームが定義され、これらキーフレームによりキーフレームシーケンスが定義される。リップシンクアニメーション作成装置はさらに、キーフレームシーケンスを受け、各キーフレームのうち、当該キーフレームの視覚素と、隣接するキーフレームの視覚素との組合せに対応するクラスタ化された顔モデルの組合せをクラスタ化顔モデル記憶手段から読出し、各クラスタに属する特徴点のキーフレーム間の変化の速さを当該クラスタの代表ベクトルを用いて算出するための変化の速さ算出手段と、キーフレームシーケンスに含まれる各キーフレームのうち、変化の速さ算出手段により算出された変化の速さが所定のしきい値よりも大きなキーフレームについて、そのブレンド率が、より小さな値となるような所定の関数を用いてブレンド率を更新するための頂点速度によるブレンド率調整手段と、頂点速度によるブレンド率調整手段によりブレンド率が調整されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成するためのブレンド処理手段とを含む。 A lip-sync animation creating apparatus according to a fourth aspect of the present invention includes a statistical acoustic model prepared in advance, a mapping definition between phonemes and visual elements prepared in advance, and a plurality of face images prepared in advance. And a lip sync animation creation device for creating a lip sync animation from input utterance data, and transcription for the utterance data can be used. A lip-sync animation creating device is a motion vector calculating means for calculating a motion vector between feature points constituting a face model for all combinations of two face models selected from a plurality of face models. Then, the feature points of the two face models are clustered by a predetermined clustering method for the motion vector calculated by the motion vector calculation means, and a representative vector of each cluster is calculated to create a clustered face model Phonetics included in speech data and correspondence using clustered face model storage means for storing clustered face models, statistical acoustic models, mapping definitions, and transcriptions A keyframe sequence with a continuous length with a default blend rate And a key frame sequence creation means for creating. A key frame is defined during the duration of each visual element in the visual element sequence, and a key frame sequence is defined by these key frames. The lip-sync animation creation device further receives a key frame sequence, and, among each key frame, a clustered face model combination corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame. Included in the key frame sequence, which is read from the clustered face model storage means and calculates the speed of change between the key frames of the feature points belonging to each cluster using the representative vector of the cluster, and the key frame sequence Among the key frames to be changed, a predetermined function is set such that the blend rate becomes a smaller value for a key frame whose change speed calculated by the change speed calculation means is larger than a predetermined threshold value. Blend rate adjustment means by vertex speed and blend rate adjustment by vertex speed to update blend rate using Based on the keyframe sequence blend ratio by stages is adjusted, and a blend processing unit for creating animation of the face image by blending between keyframes.

好ましくは、リップシンクアニメーション作成装置は、視覚素シーケンス作成手段の出力するキーフレームシーケンスに含まれるキーフレームのうち、空白音素に対応する視覚素が割当てられたキーフレームの直前のキーフレームの継続長の終端位置を、当該キーフレーム内の発話データの発話パワー系列の最大点以後で、かつ当該キーフレームの継続長内の位置に移動させることにより、発話終端位置を補正するための発話終端補正手段をさらに含む。キーフレーム削除手段は、発話終端補正手段により発話終端が補正されたキーフレームシーケンスを入力として受ける。 Preferably, the lip-sync animation creating apparatus includes a key frame duration immediately before a key frame to which a visual element corresponding to a blank phoneme is assigned among key frames included in a key frame sequence output by a visual element sequence creating unit. Utterance end correction means for correcting the utterance end position by moving the end position of the utterance to a position after the maximum point of the utterance power sequence of the utterance data in the key frame and within the duration of the key frame Further included. The key frame deletion means receives as input the key frame sequence whose utterance end has been corrected by the utterance end correction means.

空白音素に対応する視覚素が割当てられたキーフレームの直前のキーフレームについて、その終端位置が補正される。補正後の終端は、そのキーフレーム内の発話パワー系列の最大点以後の位置とする。補正後の終端をこのようにもとの終端位置より前に移動させることにより、発話の最後における視覚素が早めに空白音素に対応する視覚素となり、発話アニメーションが自然なものとなる。 The end position of the key frame immediately before the key frame to which the visual element corresponding to the blank phoneme is assigned is corrected. The end point after correction is a position after the maximum point of the speech power sequence in the key frame. By moving the corrected end in front of the original end position in this way, the visual element at the end of the utterance becomes a visual element corresponding to the blank phoneme earlier, and the utterance animation becomes natural.

より好ましくは、発話終端補正手段は、視覚素シーケンス作成手段の出力するキーフレームシーケンスに含まれるキーフレームのうち、空白音素に対応する視覚素が割当てられたキーフレームの直前のキーフレームの、発話パワーの最大値を与える第１の時刻を検出するための手段と、第１の時刻以後で、かつ処理対象のキーフレームの終端時刻以前に、発話パワーの最大値より所定の割合だけ発話パワーが減少する第２の時刻を検出するための手段と、処理対象のキーフレームの終端位置を、第２の時刻まで移動させるようにキーフレームを補正するための手段とを含む。 More preferably, the utterance end correction means includes the utterance of the key frame immediately before the key frame to which the visual element corresponding to the blank phoneme is assigned among the key frames included in the key frame sequence output by the visual element sequence generation means. Means for detecting a first time for giving a maximum value of power, and speech power at a predetermined rate from the maximum value of speech power after the first time and before the end time of the key frame to be processed. Means for detecting a decreasing second time, and means for correcting the key frame so as to move the end position of the key frame to be processed to the second time.

発話パワーの最大値を与える第１の時刻以後で、最大値より所定の割合だけ発話パワーが減少する第２の時刻に、キーフレームの終端位置を移動させる。各キーフレームにおける発話パワーの絶対値の大きさとは無関係に、最大値からの減衰率で終端位置の移動位置を決定するので、発話パワーの大きさの変動にかかわらず、発話の最後に安定したタイミングで口を閉じる画像が得られる。 After the first time when the maximum value of the utterance power is given, the end position of the key frame is moved to a second time when the utterance power decreases by a predetermined rate from the maximum value. Regardless of the magnitude of the absolute value of the utterance power in each key frame, the movement position of the end position is determined by the attenuation rate from the maximum value, so that it is stable at the end of the utterance regardless of the fluctuation of the utterance power magnitude. An image that closes the mouth at the timing is obtained.

さらに好ましくは、キーフレーム作成手段は、キーフレームシーケンスの作成時には、第１のフレームレートのフレームの任意のものをキーフレームとして選択する。リップシンクアニメーション作成装置はさらに、第１のフレームレートよりも小さな第２のフレームレートを指定する入力と、キーフレーム削除手段により出力されるキーフレームシーケンスとを受けるように接続され、キーフレーム削除手段により出力されるキーフレームシーケンスを、第２のフレームレートのキーフレームシーケンスに変換するためのフレームレート変換手段を含む。フレームレート変換手段は、第２のフレームレートのキーフレームシーケンスの各キーフレームに、キーフレーム削除手段の出力するキーフレームシーケンス内で、当該キーフレームの継続長内に始端を有するキーフレームに割当てられた視覚素のいずれかを割当てる。ブレンド処理手段は、フレームレート変換手段によりフレームレートが変換されたキーフレームシーケンスに基づき、キーフレーム間のブレンドにより顔画像のアニメーションを作成するための手段を含む。 More preferably, the key frame creation means selects any one of the frames at the first frame rate as the key frame when creating the key frame sequence. The lip sync animation creating apparatus is further connected to receive an input designating a second frame rate smaller than the first frame rate and a key frame sequence output by the key frame deleting means, and the key frame deleting means Includes a frame rate converting means for converting the key frame sequence output by the key frame sequence into a key frame sequence of the second frame rate. The frame rate conversion means is assigned to each key frame of the key frame sequence of the second frame rate, in the key frame sequence output by the key frame deletion means, to a key frame having a start edge within the duration of the key frame. Assign one of the visual elements. The blend processing means includes means for creating an animation of a face image by blending between key frames based on the key frame sequence whose frame rate has been converted by the frame rate conversion means.

キーフレーム作成手段は第１のフレームレートのフレームのうちの任意のフレームを用いてキーフレームシーケンスを作成する。第１のフレームレートよりも小さな第２のフレームレートが指定されると、キーフレームレート変換手段が第１のフレームレートのキーフレームシーケンスを第２のフレームレートのキーフレームシーケンスに変換する。このとき、第１のフレームレートのキーフレームシーケンスのうちの複数のキーフレームが、第２のフレームレートのキーフレームシーケンス中のキーフレームに対応する可能性が有る。フレームレート変換手段は、そうした場合には、第２のキーフレームレートのキーフレームシーケンス中のキーフレームの継続長内に始端を有する、第１のキーフレームレートのキーフレームシーケンスのキーフレームの視覚素のいずれかを、変換後のキーフレームに割当てる。第２のキーフレームレートのキーフレームシーケンス中のキーフレームに、必ずそのキーフレームの継続長内に始端を有するキーフレームの視覚素が割当てられるため、実際の音声の発声の前に視覚素にしたがって口形状の変化が始まることになる。この順序は実際の人間の発声時に観測される順序と一致するので、自然な発話をする顔画像アニメーションが得られる。 The key frame creation means creates a key frame sequence using an arbitrary frame among the frames at the first frame rate. When a second frame rate smaller than the first frame rate is designated, the key frame rate conversion means converts the key frame sequence having the first frame rate into a key frame sequence having the second frame rate. At this time, there is a possibility that a plurality of key frames in the key frame sequence at the first frame rate correspond to key frames in the key frame sequence at the second frame rate. In such a case, the frame rate converting means has a key frame visual element of the key frame sequence of the first key frame rate having a start edge within the duration of the key frame in the key frame sequence of the second key frame rate. Is assigned to the converted key frame. Since a key element in the key frame sequence at the second key frame rate is always assigned a visual element of a key frame having a start point within the duration of the key frame, it follows the visual element before the actual voice is spoken. Mouth shape changes will begin. Since this order matches the order observed during actual human speech, a face image animation with natural speech is obtained.

フレームレート変換手段は、第２のフレームレートのキーフレームシーケンスの各キーフレームに割当てる視覚素が、直前のキーフレームに割当てた視覚素と異なるものとなるように視覚素を割当てるようにしてもよい。 The frame rate conversion means may assign the visual element so that the visual element assigned to each key frame of the key frame sequence of the second frame rate is different from the visual element assigned to the immediately preceding key frame. .

同一の視覚素が割当てられたキーフレームが連続すると、同じ口形状が長く続くことになり、発話中の顔画像としては不自然になる。直前のキーフレームに割当てられた視覚素と異なる視覚素を各キーフレームに割当てるようにすることにより、そのような不自然さを回避することができ、より自然な顔画像アニメーションを作成できる。 If key frames to which the same visual element is assigned continue, the same mouth shape will continue for a long time, making it unnatural as a face image during speech. By assigning a visual element different from the visual element assigned to the immediately preceding key frame to each key frame, such unnaturalness can be avoided and a more natural facial image animation can be created.

より好ましくは、ブレンド処理手段は、第２のフレームレートのキーフレームシーケンスからアニメーションを作成するときには、第２のキーフレームレートよりも高い第３のフレームレートでフレームごとの画像を作成する機能を有し、かつ隣接するキーフレームの間の補間により、当該隣接するキーフレームの間のフレームの画像を生成する機能を有する。リップシンクアニメーション作成装置はさらに、フレームレート変換手段の出力する第２のフレームレートのキーフレームシーケンス内のキーフレームの各々について、当該キーフレームと、当該キーフレームの直後のキーフレームとの間のフレーム位置に、当該キーフレームと同じキーフレームをコピーするためのキーフレームコピー手段を含む。 More preferably, the blend processing means has a function of creating an image for each frame at a third frame rate higher than the second key frame rate when creating an animation from the key frame sequence of the second frame rate. And having a function of generating an image of a frame between the adjacent key frames by interpolation between the adjacent key frames. The lip sync animation creating apparatus further includes, for each key frame in the key frame sequence of the second frame rate output from the frame rate conversion means, a frame between the key frame and the key frame immediately after the key frame. The position includes a key frame copy means for copying the same key frame as the key frame.

さらに好ましくは、キーフレームコピー手段は、フレームレート変換手段の出力する第２のフレームレートのキーフレームシーケンス内のキーフレームの各々について、当該キーフレームの直後のキーフレームの直前のフレーム位置に、当該キーフレームと同じキーフレームをコピーするための手段を含む。 More preferably, the key frame copy means, for each key frame in the key frame sequence of the second frame rate output from the frame rate conversion means, at the frame position immediately before the key frame immediately after the key frame, Means for copying the same keyframe as the keyframe.

ブレンド処理手段が、第２のフレームレートの隣接する二つのキーフレーム間に、第３のフレームレートにしたがったフレームを作成するようになっており、しかもそれらのフレームにおける画像を、それら二つのキーフレームの間の補間により作成する場合、二つのキーフレーム間に、滑らかに変化する第３のフレームレートにしたがったフレームが挿入される。そのような補間処理をすると、画像の変化は滑らかになるが、時にアニメーションに求められる「リミット感」を持つ映像（「カクカク」と変化する映像）が得られない。その場合、隣接する二つのキーフレームのうち、後者の直前のフレーム位置に、前者のキーフレームをそのままコピーする。その結果、前者のキーフレーム位置から、コピーされたフレーム位置まではブレンド処理手段による補間を行なっても画像は安定し、変化せず、その直後の次のキーフレームではじめて画像が変化することになる。その結果、第２のフレームレートより大きな第３のフレームレートにしたがってフレームシーケンスを作成する場合で、しかも隣接するキーフレーム間のフレームの画像を補間によって作成する機能を持つブレンド処理手段をそのまま使用する場合にも、リミット感を持つアニメーションを作成できる。 The blend processing means creates a frame according to the third frame rate between two adjacent key frames of the second frame rate, and the image in those frames is converted to the two key frames. When creating by interpolation between frames, a frame according to a third frame rate that smoothly changes is inserted between two key frames. When such an interpolation process is performed, the change in the image becomes smooth, but a video with a “limit feeling” sometimes required for animation (a video that changes as “cracking”) cannot be obtained. In that case, the former key frame is copied as it is to the frame position immediately before the latter of the two adjacent key frames. As a result, from the former key frame position to the copied frame position, even if interpolation is performed by the blend processing means, the image is stable and does not change, and the image changes only at the next key frame immediately after that. Become. As a result, when the frame sequence is created according to the third frame rate larger than the second frame rate, the blend processing means having the function of creating the frame image between the adjacent key frames is used as it is. Even in this case, you can create an animation with a sense of limit.

さらに好ましくは、リップシンクアニメーション作成装置は、複数個の顔画像の顔モデルを記憶するための顔モデル記憶手段をさらに含む。 More preferably, the lip sync animation creating apparatus further includes a face model storage means for storing a face model of a plurality of face images.

複数の顔画像の顔モデルを、顔モデル記憶手段によって記憶することができる。アニメーションを繰返し作成する場合であっても、顔モデルを外部から繰返し受信することなく、同じ顔モデルを何度でも用いて、アニメーションを作成することができる。 Face models of a plurality of face images can be stored by the face model storage means. Even when the animation is repeatedly created, the same face model can be used any number of times without repeatedly receiving the face model from the outside.

さらに好ましくは、予め準備された音素は、予め定められた標準音素と、標準音素以外の一般音素とを含み、複数個の顔画像の顔モデルは、標準音素に対応する顔モデルから成る標準視覚素モデルと、一般音素に対応する顔モデルから成る一般視覚素モデルとを含む。リップシンクアニメーション作成装置はさらに、予め準備された音素に対応して予め分類された、対応する音素を発話しているときの発話者の顔画像の特徴点の３次元位置の実測値から成るキャプチャデータと標準視覚素モデルとを用い、一般視覚素モデルを生成するための一般視覚素生成手段を含む。 More preferably, the phoneme prepared in advance includes a predetermined standard phoneme and a general phoneme other than the standard phoneme, and the face model of the plurality of face images includes a standard visual composed of a face model corresponding to the standard phoneme. And a general visual element model including a face model corresponding to a general phoneme. The lip sync animation creation device further captures the measured values of the three-dimensional positions of the feature points of the face image of the speaker when the corresponding phoneme is spoken, which is classified in advance corresponding to the phonemes prepared in advance. General visual element generation means for generating a general visual element model using the data and the standard visual element model is included.

標準視覚素モデルのみを手作業で予め作成しておき、発話時の実際の発話者の顔のキャプチャデータを準備しておけば、装置が一般視覚素作成手段によって標準視覚素モデル以外の一般視覚素モデルを自動的に生成する。したがって、手作業による顔モデル作成のための作業量を少なくし、口の動きと音声とが一致したさらに滑らかで自然な顔画像アニメーションが得られる。 If only the standard visual element model is created in advance by hand, and the capture data of the face of the actual speaker at the time of utterance is prepared, the general visual element creation means uses the general visual element creation means to generate general visuals other than the standard visual element model. Generate an elementary model automatically. Therefore, the amount of work for creating a face model by manual work is reduced, and a smoother and more natural face image animation in which the movement of the mouth and the sound coincide with each other can be obtained.

さらに好ましくは、一般視覚素生成手段は、標準音素に対応するキャプチャデータの線形和で、一般音素に対応するキャプチャデータを近似するための、標準音素の数と同数の係数を、所定の近似誤差を最小とするように算出するための係数算出手段と、一般視覚素モデルを、当該一般視覚素モデルに対応する一般音素について係数算出手段により算出された係数を用いた標準視覚素モデルの線形和により計算し、標準視覚素モデルとともに対応する一般音素と関連付けて顔モデル記憶手段に記憶させるための線形和計算手段とを含む。 More preferably, the general visual element generating means uses a linear sum of the capture data corresponding to the standard phonemes, and calculates a predetermined approximation error with the same number of coefficients as the number of standard phonemes for approximating the capture data corresponding to the general phonemes. And a linear sum of a standard visual element model using a coefficient calculated by the coefficient calculating means for a general phoneme corresponding to the general visual element model. And a linear sum calculation means for storing in the face model storage means in association with the corresponding general phoneme together with the standard visual element model.

装置が、近似誤差が最小となるような標準視覚素モデルの線形和で一般視覚素モデルを生成する。標準視覚素モデルだけでなく、一般視覚素モデルも用いて各音素に対する顔画像を生成できるので、滑らかで自然な顔画像アニメーションが得られる。 The apparatus generates a general visual element model with a linear sum of standard visual element models that minimize the approximation error. Since a face image for each phoneme can be generated using not only the standard visual element model but also the general visual element model, a smooth and natural face image animation can be obtained.

本発明の第５の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかのリップシンクアニメーション作成装置として機能させる。 When the computer program according to the fifth aspect of the present invention is executed by a computer, it causes the computer to function as any one of the above-described lip-sync animation creating apparatuses.

本発明の第６の局面に係る顔モデル生成装置は、予め準備された音素及び視覚素の間のマッピング定義を用い、視覚素に対応する顔画像の顔モデルを生成するための顔モデル生成装置であって、予め準備された音素は、予め定められた標準音素と、標準音素以外の一般音素とを含み、複数個の顔画像の顔モデルは、標準音素に対応する顔モデルから成る標準視覚素モデルと、一般音素に対応する顔モデルから成る一般視覚素モデルとを含み、顔モデル生成装置は、視覚素に対応する複数個の顔画像の顔モデルを記憶するための顔モデル記憶手段と、予め準備された音素に対応して予め分類された、対応する音素を発話しているときの発話者の顔画像の特徴点の３次元位置の実測値から成るキャプチャデータ及び標準視覚素モデルを用い、一般視覚素モデルを生成するための一般視覚素生成手段とを含む。 A face model generation apparatus according to a sixth aspect of the present invention uses a mapping definition between phonemes and visual elements prepared in advance, and generates a face model of a face image corresponding to the visual elements. The phonemes prepared in advance include predetermined standard phonemes and general phonemes other than the standard phonemes, and the face models of the plurality of face images are standard visions composed of face models corresponding to the standard phonemes. A face model storage unit for storing a face model of a plurality of face images corresponding to visual elements; and a general visual element model including a face model corresponding to a general phoneme. Capture data and standard visual element model, which are pre-classified in correspondence with phonemes prepared in advance, and are composed of measured values of three-dimensional positions of feature points of the face image of the speaker when speaking the corresponding phoneme. Used, general And a general visual element generating means for generating a Satoshimoto model.

好ましくは、一般視覚素生成手段は、標準音素に対応するキャプチャデータの線形和で、一般音素に対応するキャプチャデータを近似するための、標準音素の数と同数の係数を、所定の近似誤差を最小とするように算出するための係数算出手段と、一般視覚素モデルを、当該一般視覚素モデルに対応する一般音素について係数算出手段により算出された係数を用いた標準視覚素モデルの線形和により計算し、標準視覚素モデルとともに対応する一般音素と関連付けて顔モデル記憶手段に記憶させるための線形和計算手段とを含む。 Preferably, the general visual element generating means is a linear sum of the capture data corresponding to the standard phonemes, and has the same number of coefficients as the number of standard phonemes for approximating the capture data corresponding to the general phonemes, and a predetermined approximation error. Coefficient calculation means for calculating to be minimized, and a general visual element model by linear sum of standard visual element models using coefficients calculated by coefficient calculation means for general phonemes corresponding to the general visual element model. Linear sum calculation means for calculating and storing in the face model storage means in association with the corresponding general phoneme together with the standard visual element model.

以下、本発明について、実施の形態に基づいて説明する。以下の説明では、基本となる顔画像を６種類使用しているが、顔画像の数はこれには限定されない。６種類よりも少なくてもよいし、６種類よりも多くてもよい。 Hereinafter, the present invention will be described based on embodiments. In the following description, six types of basic face images are used, but the number of face images is not limited to this. There may be fewer than six types or more than six types.

［第１の実施の形態］
＜構成＞ [First Embodiment]
<Configuration>

図５に、本発明に係るアニメーション作成装置の一例として、本発明の第１の実施の形態に係るリップシンクアニメーション作成装置２００の概略ブロック図を示す。図５を参照して、リップシンクアニメーション作成装置２００は、発話記憶部１５２に記憶された発話の音声データと、トランスクリプション記憶部１５４に記憶された、発話記憶部１５２に記憶された発話の書き起こしテキスト（トランスクリプション）とを入力として受け、キャラクタモデル記憶部１５６に記憶された、／ａ／〜／ｏ／及び／Ｎ／からなる６つの基本となる顔画像に相当する３Ｄキャラクターモデルを用いて顔画像のアニメーション２６０を作成するためのものである。 FIG. 5 shows a schematic block diagram of a lip sync animation creating apparatus 200 according to the first embodiment of the present invention as an example of the animation creating apparatus according to the present invention. Referring to FIG. 5, lip-sync animation creating apparatus 200 includes speech data stored in utterance storage unit 152 and utterance stored in utterance storage unit 152 stored in transcription storage unit 154. 3D character model corresponding to six basic face images composed of / a / ˜ / o / and / N / stored in the character model storage unit 156, receiving a transcription text (transcription) as an input Is used to create an animation 260 of a face image.

キャラクタモデル記憶部１５６に記憶される顔画像の例を図７に示す。図７（Ａ）〜（Ｆ）は、それぞれ音素／ａ／，／ｉ／，／ｕ／，／ｎ／，／ｅ／，／ｏ／に対応する顔画像である。本明細書では、これら画像をそれぞれ顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｎ／，／Ｅ／，及び／Ｏ／と表記することにする。 An example of a face image stored in the character model storage unit 156 is shown in FIG. FIGS. 7A to 7F are face images corresponding to phonemes / a /, / i /, / u /, / n /, / e /, / o /, respectively. In this specification, these images are expressed as face images / A /, / I /, / U /, / N /, / E /, and / O /, respectively.

なお、本実施の形態では、顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／は、いずれも顔画像／Ｎ／を基準とし、各特徴点が、顔画像の定義されている３次元空間において、顔画像／Ｎ／の対応する特徴点からどの程度移動しているかを示す３次元ベクトル情報によって定義されている。従って、例えば顔画像／Ａ／と顔画像／Ｎ／との間で、その中間の顔画像を定義することもできる。本実施の形態では、特定の顔画像と顔画像／Ｎ／との間の中間の顔画像を定義するために、上記した「ブレンド率」という概念を使用する。 In the present embodiment, the face images / A /, / I /, / U /, / E /, / O / are all based on the face image / N /, and each feature point represents the face image. In the defined three-dimensional space, it is defined by three-dimensional vector information indicating how far the corresponding feature point of the face image / N / has moved. Accordingly, for example, an intermediate face image can be defined between the face image / A / and the face image / N /. In this embodiment, in order to define an intermediate face image between a specific face image and the face image / N /, the concept of “blend rate” described above is used.

二つの顔画像の間のブレンドについては前述したとおりである。 The blending between the two face images is as described above.

リップシンクアニメーション作成装置２００は、発話者の音声から予め作成された音響モデルを記憶するための音響モデル記憶部１７０と、予め準備された、音素と視覚素との間のマッピングテーブルを記憶するための音素−視覚素マッピングテーブル記憶部１７６と、音響モデル記憶部１７０に記憶された音響モデル及び音素−視覚素マッピングテーブル記憶部１７６に記憶された音素−視覚素マッピングテーブルを用い、発話データに対し、トランスクリプション記憶部１５４に記憶されたトランスクリプションに基づいた音素セグメンテーションを行なって音素シーケンスを作成し、さらに、得られた音素シーケンス内の各音素を音素−視覚素マッピングテーブル記憶部１７６に記憶された音素−視覚素マッピングテーブルを用いて対応の視覚素に変換することにより、継続長付き視覚素シーケンスを作成するための視覚素シーケンス作成部２３０と、視覚素シーケンス作成部２３０により出力される視覚素シーケンスを記憶するための視覚素シーケンス記憶部１８０とを含む。視覚素の継続期間は、対応する音素継続期間の先頭から開始する。したがって視覚素シーケンス記憶部１８０に記憶された視覚素シーケンスのうち、各視覚素の先頭フレームがキーフレームとなる。視覚素シーケンス内のキーフレームにより、キーフレームシーケンスが構成される。なお、視覚素シーケンス作成部２３０は、各視覚素に対し、置換前の音素と、デフォルトのブレンド率（例えば１００％）を付して視覚素シーケンスを作成するものとする。 The lip-sync animation creating apparatus 200 stores an acoustic model storage unit 170 for storing an acoustic model created in advance from the voice of a speaker, and a mapping table between phonemes and visual elements prepared in advance. Using the phoneme-visual element mapping table storage unit 176, the acoustic model stored in the acoustic model storage unit 170, and the phoneme-visual element mapping table stored in the phoneme-visual element mapping table storage unit 176. The phoneme segmentation based on the transcription stored in the transcription storage unit 154 is performed to create a phoneme sequence, and each phoneme in the obtained phoneme sequence is stored in the phoneme-visual element mapping table storage unit 176. Pair using stored phoneme-visual element mapping table The visual element sequence creating unit 230 for creating a visual element sequence with a continuous length by converting the visual element sequence into a visual element, and a visual element sequence storage for storing the visual element sequence output by the visual element sequence creating unit 230 Part 180. The duration of a visual element starts from the beginning of the corresponding phoneme duration. Therefore, among the visual element sequences stored in the visual element sequence storage unit 180, the first frame of each visual element is a key frame. A key frame sequence is composed of key frames in the visual elementary sequence. Note that the visual element sequence creation unit 230 creates a visual element sequence by attaching a phoneme before replacement and a default blend ratio (for example, 100%) to each visual element.

リップシンクアニメーション作成装置２００はさらに、キャラクタモデル記憶部１５６に記憶された３Ｄキャラクタモデルの各顔画像を構成する頂点に対し、任意の二つの顔画像の間での動きベクトルを用いたＶＱ（ベクトル量子化）クラスタリングを行ない、任意の二つの顔画像の間での各頂点の動きを、各頂点が属するクラスタの代表ベクトルによって表した動きベクトルデータと、そのときのクラスタリング後の顔画像のモデルとを出力するためのクラスタリング処理部２３２と、クラスタリング処理部２３２の出力する、任意の顔画像モデルの組合せに対するクラスタリング後の顔画像モデルと動きベクトルデータとを記憶するためのクラスタ化顔モデル記憶部２３４と、キャラクタモデル記憶部１５６に記憶された顔画像モデルと、クラスタ化顔モデル記憶部２３４に記憶されたクラスタリング後の顔モデル及び動きベクトルデータとのいずれか一方を使用して、キーフレームの中で頂点の動きが速いものを検出し、そのようなキーフレームを所定の割合だけ削除するためのキーフレーム削除部２３６とを含む。なお、本実施の形態では、あるキーフレームを削除した場合、そのキーフレームの継続長だった部分は、削除されたキーフレームの直前のキーフレームの継続長に統合される。 The lip sync animation creating apparatus 200 further uses a VQ (vector) using a motion vector between any two face images for the vertices constituting each face image of the 3D character model stored in the character model storage unit 156. (Quantization) Clustering is performed, and motion vector data representing the motion of each vertex between any two face images by the representative vector of the cluster to which each vertex belongs, and the model of the face image after clustering at that time A clustering processing unit 232 for outputting and a clustered face model storage unit 234 for storing a face image model after clustering and motion vector data for any combination of face image models output by the clustering processing unit 232 And a face image model stored in the character model storage unit 156 Using one of the clustered face model and motion vector data stored in the clustered face model storage unit 234, a key frame having a fast vertex motion is detected, and such a key frame is detected. And a key frame deleting unit 236 for deleting a predetermined ratio. In this embodiment, when a certain key frame is deleted, the portion that was the continuation length of the key frame is integrated with the continuation length of the key frame immediately before the deleted key frame.

リップシンクアニメーション作成装置２００はさらに、キーフレーム削除部２３６によるキーフレームの削除の際の、全体のキーフレーム数のうち、削除されるキーフレームの数が示す割合を指定するための削除率入力部２０１と、キーフレーム削除部２３６によるキーフレーム削除の際の速度計算に、キャラクタモデル記憶部１５６に記憶されたモデルをそのまま使用するか、クラスタ化顔モデル記憶部２３４に記憶されたクラスタリング後の動きベクトルによるモデルを使用するかを指定するためのクラスタ処理指定部２０２とを含む。キーフレーム削除部２３６の詳細については後述する。 The lip sync animation creation apparatus 200 further includes a deletion rate input unit for designating a ratio indicated by the number of key frames to be deleted out of the total number of key frames when the key frame deletion unit 236 deletes the key frames. 201 and the motion after clustering stored in the clustered face model storage unit 234 are used as they are for the speed calculation at the time of key frame deletion by the key frame deletion unit 236 or the model stored in the character model storage unit 156 And a cluster processing designation unit 202 for designating whether to use a vector model. Details of the key frame deletion unit 236 will be described later.

リップシンクアニメーション作成装置２００はさらに、発話記憶部１５２に記憶された発話データから、各フレームにおける発話パワーを算出するための発話パワー算出部２３８と、発話パワー算出部２３８により算出された発話パワーを記憶するための発話パワー記憶部２４０と、キーフレーム削除部２３６により出力された視覚素シーケンスに対し、発話パワー記憶部２４０に記憶された各フレームにおける発話パワーに基づいて、後述するように、キーフレームのブレンド率を調整するための発話パワーによるブレンド率調整部２４４とを含む。 The lip sync animation creating apparatus 200 further includes an utterance power calculation unit 238 for calculating the utterance power in each frame from the utterance data stored in the utterance storage unit 152, and the utterance power calculated by the utterance power calculation unit 238. For the visual element sequence output by the utterance power storage unit 240 for storing and the key frame deletion unit 236, as described later, based on the utterance power in each frame stored in the utterance power storage unit 240, the key And a blend rate adjustment unit 244 based on speech power for adjusting the blend rate of the frame.

リップシンクアニメーション作成装置２００はさらに、発話パワーによるブレンド率調整部２４４において、あるキーフレームのブレンド率を減衰させる際のパラメータα（以下「減衰率α」と呼ぶ。）をユーザが入力するための減衰率入力部２０６と、発話パワーによるブレンド率調整部２４４によるブレンド率の調整を行なうか否かをユーザが指示する際に使用する発話パワー使用指示入力部２０４と、発話パワー使用指示入力部２０４により発話パワーが指示されたときにはキーフレーム削除部２３６の出力を発話パワーによるブレンド率調整部２４４に与え、それ以外のときにはキーフレーム削除部２３６の出力を発話パワーによるブレンド率調整部２４４をバイパスして後続する処理部に与えるために、一対の選択部２４２及び２４６とを含む。 The lip sync animation creating apparatus 200 further allows the user to input a parameter α (hereinafter referred to as “attenuation rate α”) for attenuating the blend rate of a certain key frame in the blend rate adjustment unit 244 based on speech power. Attenuation rate input unit 206, speech power use instruction input unit 204 used when the user instructs whether or not to adjust the blend rate by speech rate blend rate adjustment unit 244, and speech power use instruction input unit 204 When the utterance power is instructed, the output of the key frame deletion unit 236 is given to the blend rate adjustment unit 244 based on the utterance power; otherwise, the output of the key frame deletion unit 236 is bypassed the blend rate adjustment unit 244 based on the utterance power. A pair of selectors 242 and 24 for feeding to subsequent processing units. 6 and so on.

リップシンクアニメーション作成装置２００はさらに、クラスタ処理指定部２０２により指定された値にしたがい、キャラクタモデル記憶部１５６に記憶された顔画像モデルのデータ及びクラスタ化顔モデル記憶部２３４に記憶された動きベクトルのいずれかを用い、各キーフレームにおける頂点の動きの速さを算出して、動きの速さが所定の基準より大きなキーフレームについて、ブレンド率を小さくなるように調整するための頂点速度によるブレンド率調整部２５０と、ブレンド率調整部２５０によるブレンド率の調整の際の、ブレンド率の減衰率βを入力するためにユーザが使用する減衰率入力部２１０と、ブレンド率調整部２５０によるブレンド率調整を行なうか否かをユーザが指定するための頂点速度使用指示入力部２０８と、使用指示入力部２０８により入力された指示にしたがい、選択部２４６の出力をブレンド率調整部２５０に与えるか、発話パワーによるブレンド率調整部２４４をバイパスして後続する処理部に与えるかを選択する一対の選択部２４８及び２５２とを含む。 The lip sync animation creating apparatus 200 further includes the face image model data stored in the character model storage unit 156 and the motion vector stored in the clustered face model storage unit 234 according to the values specified by the cluster processing specification unit 202. To calculate the speed of vertex movement at each key frame, and blend by vertex speed to adjust the blend rate to be smaller for key frames whose movement speed is larger than a predetermined reference The rate adjustment unit 250, the attenuation rate input unit 210 used by the user to input the blend rate attenuation rate β when the blend rate adjustment unit 250 adjusts the blend rate, and the blend rate by the blend rate adjustment unit 250 A vertex speed use instruction input unit 208 for the user to specify whether or not to perform adjustment, A pair that selects whether to output the output of the selection unit 246 to the blend rate adjustment unit 250 or to bypass the blend rate adjustment unit 244 based on speech power and give it to the subsequent processing unit according to the instruction input by the instruction input unit 208. Selection units 248 and 252.

リップシンクアニメーション作成装置２００はさらに、選択部２５２の出力する、ブレンド率の調整が完了した継続長付き視覚素シーケンスを記憶するための視覚素シーケンス記憶部２５４と、視覚素シーケンス記憶部２５４に記憶された継続長付き視覚素シーケンスに基づき、キャラクタモデル記憶部１５６に記憶された各顔画像モデルを用いたブレンド処理を行なうことによって、顔画像のアニメーション２６０を作成するためのブレンド処理部２５６を含む。 The lip-sync animation creating apparatus 200 further stores a visual element sequence storage unit 254 for storing a visual element sequence with a continuous length, which is output from the selection unit 252 and for which the blend ratio has been adjusted, and a visual element sequence storage unit 254. A blend processing unit 256 for creating an animation 260 of the face image by performing blend processing using each face image model stored in the character model storage unit 156 based on the visual element sequence with duration. .

図６に、図５の視覚素シーケンス作成部２３０の詳細な構成を示す。図６を参照して、視覚素シーケンス作成部２３０は、音響モデル記憶部１７０に記憶された音響モデルを用い、発話記憶部１５２に記憶された発話データに対して、トランスクリプション記憶部１５４に記憶されたトランスクリプションに基づいた音素セグメンテーションを行ない、音素シーケンスをその継続長を示す情報とともに出力するための音素セグメンテーション部１７２と、音素セグメンテーション部１７２から出力された継続長付き音素シーケンスを記憶するための音素シーケンス記憶部１７４とを含む。 FIG. 6 shows a detailed configuration of the visual element sequence creation unit 230 of FIG. With reference to FIG. 6, the visual element sequence creation unit 230 uses the acoustic model stored in the acoustic model storage unit 170 and uses the acoustic model stored in the speech storage unit 152 to store the transcription in the transcription storage unit 154. Perform phoneme segmentation based on stored transcription, store phoneme segmentation unit 172 for outputting phoneme sequence together with information indicating its duration, and store phoneme sequence with duration length output from phoneme segmentation unit 172 A phoneme sequence storage unit 174.

視覚素シーケンス作成部２３０はさらに、音素と視覚素との間のマッピングテーブルを記憶するための音素−視覚素マッピングテーブル記憶部１７６と、音素−視覚素マッピングテーブル記憶部１７６に記憶された音素−視覚素マッピングテーブルを参照しながら、音素シーケンス記憶部１７４に記憶された音素シーケンス内の各音素を対応する視覚素に変換することにより、継続長付き視覚素シーケンスを出力するための音素−視覚素変換処理部１７８とを含む。なお、前述したとおり、音素−視覚素変換処理部１７８の出力する継続長付き視覚素シーケンスの各視覚素には、対応の音素と、デフォルトのブレンド率とが付されている。 The visual element sequence creation unit 230 further includes a phoneme-visual element mapping table storage unit 176 for storing a mapping table between phonemes and visual elements, and a phoneme stored in the phoneme-visual element mapping table storage unit 176. While referring to the visual element mapping table, by converting each phoneme in the phoneme sequence stored in the phoneme sequence storage unit 174 into a corresponding visual element, a phoneme-visual element for outputting a visual element sequence with a duration is output. A conversion processing unit 178. As described above, each visual element of the visual element sequence with a continuous length output from the phoneme-visual element conversion processing unit 178 is assigned a corresponding phoneme and a default blend rate.

音素セグメンテーション部１７２は、発話記憶部１５２に含まれる発話データに対する音素セグメンテーションをし、音素列と、それぞれの継続時間長が分かる時間データとを出力できるものであればどのようなものでもよい。発話内容がトランスクリプション記憶部１５４に記憶されたトランスクリプションにより予め分かっているので、音素セグメンテーション部１７２は音声データを精度高く音素列に変換できる。 The phoneme segmentation unit 172 may be anything as long as it can perform phoneme segmentation on the utterance data included in the utterance storage unit 152 and output a phoneme string and time data in which each duration is known. Since the utterance content is known in advance by the transcription stored in the transcription storage unit 154, the phoneme segmentation unit 172 can convert the voice data into a phoneme string with high accuracy.

テーブル１に、マッピングテーブル記憶部１７６に記憶されたマッピングテーブルの例の一部を示す。 Table 1 shows a part of an example of the mapping table stored in the mapping table storage unit 176.

テーブル１を参照して、本実施の形態では、マッピングテーブルは、音素／ａ／を視覚素／Ａ／に、音素／ｉ／を視覚素／Ｉ／に、音素／ｕ／を視覚素／Ｕ／に、音素／ｅ／を視覚素／Ｅ／に、音素／ｏ／を視覚素／Ｏ／にそれぞれ対応付けている。マッピングテーブルでは、図３に示す顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／のように、予めある音素に対して準備された視覚素には、その音素を必ず対応付けるようにする。さもないと得られる顔の動画像が発話内容とちぐはぐになってしまう。また音素／Ｎ／、／ｐ／、／ｂ／、／ｍ／等、唇を閉じるような音素は無表情の顔画像／Ｎ／に対応付ける。音素／ｈ／、／ｊ／、／ｑ／、／ｒ／については無視し、視覚素に変換しない。第１のテーブルに記載された音素以外の音素については、直前の音素のブレンド率の８０％のブレンド率を割当てる。

Referring to Table 1, in this embodiment, the mapping table includes phonemes / a / to visual elements / A /, phonemes / i / to visual elements / I /, and phonemes / u / to visual elements / U. /, Phoneme / e / is associated with visual element / E /, and phoneme / o / is associated with visual element / O /. In the mapping table, a phoneme prepared for a certain phoneme in advance, such as face images / A /, / I /, / U /, / E /, / O / shown in FIG. Make sure to associate them. Otherwise, the obtained moving image of the face will be inconsistent with the utterance content. Phonemes that close the lips, such as phonemes / N /, / p /, / b /, / m /, etc., are associated with an expressionless facial image / N /. Phonemes / h /, / j /, / q /, / r / are ignored and are not converted to visual elements. For phonemes other than the phonemes listed in the first table, a blend rate of 80% of the blend rate of the previous phoneme is assigned.

図８〜図１０を用いて、クラスタリング処理部２３２による処理について説明する。クラスタリング処理部２３２による処理は、簡略にいえば、以下のようなものとなる。 The processing by the clustering processing unit 232 will be described with reference to FIGS. The processing by the clustering processing unit 232 is simply as follows.

キャラクタモデル記憶部１５６に含まれる顔モデルのうちの任意の二つの組合せの全てについて、以下の処理を行なう。 The following processing is performed for all two arbitrary combinations of the face models included in the character model storage unit 156.

まず、一方の顔画像の全ての頂点の座標ベクトルを、他方の対応する頂点の座標ベクトルから減算する。この減算により、一方の顔画像から他方の顔画像に変化する際の各頂点の動きベクトルが求められる。図８は、一方の顔画像として視覚素／Ｎ／の各頂点からなる顔画像２８０を、他方の顔画像として視覚素／Ｏ／の各頂点からなる顔画像２８２を例とし、視覚素／Ｎ／から視覚素／Ｏ／への動きベクトルの集合からなる画像２８４を示してある。なお、図８において、横軸はＸ軸、縦軸はＺ軸であり、Ｙ軸については図示していない。 First, the coordinate vector of all the vertices of one face image is subtracted from the coordinate vector of the other corresponding vertex. By this subtraction, the motion vector of each vertex when changing from one face image to the other face image is obtained. FIG. 8 shows an example of a face image 280 made up of the vertices of visual elements / N / as one face image and a face image 282 made up of the vertices of visual elements / O / as the other face image. An image 284 comprising a set of motion vectors from / to a visual element / O / is shown. In FIG. 8, the horizontal axis is the X axis, the vertical axis is the Z axis, and the Y axis is not shown.

こうして求めた動きベクトルの集合に対し、クラスタリング処理部２３２は、概略、以下のアルゴリズムによってクラスタリングを行なう。 The clustering processing unit 232 performs clustering on the set of motion vectors thus obtained by the following algorithm.

（１）クラスタ数Ｎを決定する。 (1) The number N of clusters is determined.

（２）動きベクトルの集合の中からＮ個のベクトルを任意に選択し、初期コードブックとする。 (2) N vectors are arbitrarily selected from the set of motion vectors, and set as an initial codebook.

（３）動きベクトルの集合の中の全ベクトルを、初期コードブックとの間のユークリッド距離に基づいてＮ個のクラスタに分類する。この場合、各動きベクトルは、ユークリッド距離が最も小さくなるコードブックにより代表されるクラスタに分類される。 (3) All vectors in the set of motion vectors are classified into N clusters based on the Euclidean distance from the initial codebook. In this case, each motion vector is classified into a cluster represented by a code book having the smallest Euclidean distance.

（４）各クラスタに属するベクトルの平均を算出することにより、新たなＮ個のコードブックを作成する。 (4) New N codebooks are created by calculating the average of the vectors belonging to each cluster.

（５）コードブックが変化しなくなるか、その間の差がしきい値より小さくなるまでステップ３及び４を繰返す。 (5) Repeat steps 3 and 4 until the codebook no longer changes or the difference between them is less than the threshold.

なお、本実施の形態においては、各クラスタの代表頂点は、そのクラスタについて求められたセントロイド（重心）に最も近い頂点とする。 In the present embodiment, the representative vertex of each cluster is the vertex closest to the centroid (center of gravity) obtained for that cluster.

以上のようにして得られたクラスタリングの結果、各画像の組合せについて各頂点が複数個のクラスタのいずれかに属することになる。図９にそうしたクラスタリングの結果を顔画像にマッピングした例を示す。図９を参照して、画像３００と他の画像（図示せず）との間の動きベクトルのクラスタリングにより、画像３００を構成する顔モデルを構成する各頂点は、画像３０２に示すように、クラスタ３１０，３１２，３１４，３１６，３１８，３２０，３２２及び３２４に分類される。この例の場合、クラスタの個数は８、頂点数は１４８３個である。 As a result of the clustering obtained as described above, each vertex belongs to one of a plurality of clusters for each image combination. FIG. 9 shows an example in which the result of such clustering is mapped to a face image. Referring to FIG. 9, as a result of clustering motion vectors between image 300 and another image (not shown), each vertex constituting the face model constituting image 300 is clustered as shown in image 302. 310, 312, 314, 316, 318, 320, 322 and 324. In this example, the number of clusters is 8, and the number of vertices is 1483.

図９から分かるように、口付近の頂点はその位置により明確にクラスタ化されるが、それ以外の領域の頂点の動きにはあまり差がない。 As can be seen from FIG. 9, the vertices near the mouth are clearly clustered according to their positions, but there is not much difference in the movement of the vertices in other regions.

図１０には、同様の処理でクラスタ数＝１２８、頂点数１４８３個の場合のクラスタリングにより得られたクラスタを顔画像にマッピングした結果３４０を示す。このようにクラスタ数を多くすると、口付近以外の各頂点もクラスタ化されてくることが分かる。 FIG. 10 shows a result 340 of mapping a cluster obtained by clustering in the same process when the number of clusters is 128 and the number of vertices is 1483 to the face image. It can be seen that when the number of clusters is increased in this way, each vertex other than the vicinity of the mouth is also clustered.

このようにクラスタ化するのは以下の理由による。例えば図５に示すキーフレーム削除部２３６及びブレンド率調整部２５０における処理において、全ての頂点について移動量又は速度を算出すると、頂点の数だけ計算する必要があり処理に長時間を要する。これに対し、頂点をクラスタ化した場合、各頂点の移動量又は速度を、その頂点が属するクラスタの代表頂点の移動量又は速度で近似することができる。したがって、実質的な計算量はクラスタの数まで削減され、計算時間を大幅に短縮することができる。 The reason for clustering in this way is as follows. For example, in the processing in the key frame deletion unit 236 and the blend rate adjustment unit 250 shown in FIG. 5, if the movement amount or speed is calculated for all the vertices, it is necessary to calculate the number of vertices, and the processing takes a long time. On the other hand, when vertices are clustered, the movement amount or speed of each vertex can be approximated by the movement amount or speed of the representative vertex of the cluster to which the vertex belongs. Therefore, the substantial calculation amount is reduced to the number of clusters, and the calculation time can be greatly shortened.

例えば口付近の画像だけを短時間で処理する必要があればクラスタ数を小さくし、計算時間が多少長くても、口だけでなく頭部全体の画像もある程度の精密さで求める必要があればクラスタ数を大きくすればよい。さらに、計算に要する時間に制限がないのであれば、こうしたクラスタリングを行なわず、全ての頂点について個別にその移動量又は速度を計算すればよい。 For example, if it is necessary to process only the image near the mouth in a short time, the number of clusters is reduced, and even if the calculation time is somewhat long, it is necessary to obtain not only the mouth but also the entire head image with a certain degree of precision. The number of clusters should be increased. Furthermore, if the time required for the calculation is not limited, such movement amount or speed may be calculated individually for all the vertices without performing such clustering.

図１１は、キーフレーム削除部２３６の機能をコンピュータプログラムで実現する際の、プログラムの制御構造を示すフローチャートである。図１１を参照して、ステップ３６０において、削除率を所定の記憶領域から読出す。この削除率は、図５に示す削除率入力部２０１を用いてユーザにより予め入力され、所定の記憶領域に記憶されていたものである。 FIG. 11 is a flowchart showing a control structure of a program when the function of the key frame deletion unit 236 is realized by a computer program. Referring to FIG. 11, in step 360, the deletion rate is read from a predetermined storage area. This deletion rate is input in advance by the user using the deletion rate input unit 201 shown in FIG. 5 and stored in a predetermined storage area.

ステップ３６２において、この削除率に基づき、削除すべきキーフレーム数Ｋを算出する処理が行なわれる。視覚素シーケンス記憶部１８０に記憶された視覚素シーケンス中のキーフレーム数をａ、削除率をγ％とすると、本実施の形態では、削除すべきキーフレーム数Ｋをａ×γ×１００により求める。ここで、計算結果を四捨五入するか、切り上げるか、切り捨てるかは設計事項である。 In step 362, processing for calculating the number K of key frames to be deleted is performed based on this deletion rate. Assuming that the number of key frames in the visual element sequence stored in the visual element sequence storage unit 180 is a and the deletion rate is γ%, in this embodiment, the number K of key frames to be deleted is obtained by a × γ × 100. . Here, whether the calculation result is rounded, rounded up or rounded down is a design matter.

ステップ３６４では、以下の繰返し処理のための繰返し変数ｉに０を代入する。ステップ３６６で変数ｉに１を加算し、ステップ３６８で変数ｉの値が削除すべきキーフレーム数Ｋより大きくなったか否かを判定する。判定結果がＹＥＳであればステップ３８２に進み、それ以外の場合にはステップ３７０に進む。 In step 364, 0 is assigned to the iteration variable i for the following iteration process. In step 366, 1 is added to the variable i. In step 368, it is determined whether or not the value of the variable i is larger than the number K of key frames to be deleted. If the determination result is YES, the process proceeds to step 382, and otherwise, the process proceeds to step 370.

ステップ３７０では、以下の計算において、クラスタ化顔モデル記憶部２３４に記憶されたクラスタリング後の顔画像のモデルを使用するか、又はキャラクタモデル記憶部１５６に記憶された元の顔画像のモデルを使用するかを判定する。この判定は、クラスタ処理指定部２０２を用いてユーザにより予め入力されており、所定の記憶領域に記憶されていた情報に基づいて行なわれる。クラスタ化後のモデルを使用する場合にはステップ３７６に進み、使用しない場合にはステップ３７２に進む。 In Step 370, the clustered face model storage unit 234 uses the clustered face image model stored in the clustered face model storage unit 234 or the original face image model stored in the character model storage unit 156 in the following calculation. Judge whether to do. This determination is performed based on information that is input in advance by the user using the cluster process designating unit 202 and stored in a predetermined storage area. When the clustered model is used, the process proceeds to step 376, and when not used, the process proceeds to step 372.

ステップ３７２では、視覚素シーケンス中で隣接するキーフレームの組合せの全てにおいて、全ての頂点を用いてキーフレーム間の距離Ｄを以下の式により算出する。 In step 372, the distance D between the key frames is calculated by the following equation using all the vertices in all combinations of adjacent key frames in the visual element sequence.

ここで、Ｄ（ｋ）はｋ番目のキーフレームと、ｋ＋１番目のキーフレームとの間の全頂点のユークリッド距離の合計を表す。この距離Ｄ（ｋ）を、以後ｋ番目のキーフレームとｋ＋１番目のキーフレームとの間のキーフレーム間の距離と呼ぶ。

Here, D (k) represents the sum of the Euclidean distances of all vertices between the k-th key frame and the (k + 1) -th key frame. This distance D (k) is hereinafter referred to as a distance between key frames between the k-th key frame and the k + 1-th key frame.

続いてステップ３７４において、ステップ３７２で算出されたキーフレーム間の距離に基づいて、以下の式によって削除すべきキーフレームを決定する。 Subsequently, in step 374, based on the distance between the key frames calculated in step 372, a key frame to be deleted is determined by the following equation.

ただしＤｕｒ_ｋはｋ番目のキーフレームの継続長を示す。

However Dur _k denotes the duration of the k-th key frame.

要するに、ステップ３７２及びステップ３７４の処理により、一つ前のキーフレームからの全ての頂点の移動速度と、一つ後のキーフレームまでの全ての頂点の移動速度との合計が最も大きなキーフレームが削除対象のキーフレームとして決定される。ステップ３８０でこのキーフレームを削除し、ステップ３６６に戻る。 In short, as a result of the processing in step 372 and step 374, the key frame having the largest sum of the moving speed of all the vertices from the previous key frame and the moving speed of all the vertices up to the next key frame is obtained. It is determined as a key frame to be deleted. In step 380, the key frame is deleted, and the process returns to step 366.

一方、ステップ３７０においてクラスタリング後のモデルを使用すると判定された場合には、ステップ３７６において、以下の式により、視覚素シーケンス中で隣接するキーフレームの組合せの全てにおいて、各クラスタの代表頂点を用いてキーフレーム間の距離Ｄ’を以下の式により算出する。 On the other hand, if it is determined in step 370 that the model after clustering is to be used, in step 376, the representative vertices of each cluster are used in all the combinations of adjacent key frames in the visual element sequence according to the following formula. Then, the distance D ′ between the key frames is calculated by the following equation.

ただしｍ_ｒは代表頂点ｒにより代表されるクラスタに属する頂点の数を示す。

Here, _mr represents the number of vertices belonging to the cluster represented by the representative vertex r.

ステップ３７８では、ステップ３７６で算出されたキーフレーム間の距離Ｄ’に基づいて、以下の式によって削除すべきキーフレームを決定する。 In step 378, based on the distance D 'between key frames calculated in step 376, a key frame to be deleted is determined by the following equation.

要するに、ステップ３７６及び３７８の処理により、キーフレーム間の全ての頂点の移動速度を、代表頂点の移動速度で近似し、それらを用いて一つ前及び一つ後のキーフレームの間の頂点の移動速度の合計が最も大きなキーフレームが削除対象のキーフレームとして決定される。ステップ３８０でこのキーフレームを削除し、ステップ３６６に戻る。

In short, the movement speed of all the vertices between key frames is approximated by the movement speed of the representative vertices by the processing in

steps

376 and 378, and the vertices between the previous and next key frames are used by using them. The key frame having the largest moving speed is determined as the key frame to be deleted. In step 380, the key frame is deleted, and the process returns to step 366.

ステップ３７２での処理は、顔画像のモデルを構成する全ての頂点について行なう必要がある。一方、ステップ３７６での処理は、各クラスタの代表頂点のみに対して行なえばよい。したがって、ステップ３７６での処理に要する時間はステップ３７２での処理に要する時間と比較してはるかに少なくなる。ただし、ステップ３７６で得られる距離Ｄ’は、ステップ３７２の処理で得られる距離Ｄと比較すると概算値となり、誤差を含み、場合によっては削除されるキーフレームが両者で異なってくる。 The processing in step 372 needs to be performed for all vertices constituting the face image model. On the other hand, the processing in step 376 may be performed only on the representative vertex of each cluster. Therefore, the time required for the process in step 376 is much less than the time required for the process in step 372. However, the distance D ′ obtained in step 376 is an approximate value as compared with the distance D obtained in the process in step 372, and includes an error. In some cases, the key frame to be deleted differs between the two.

なお、ステップ３６８で変数ｉの値が削除フレーム数Ｋより大きいと判定された場合、ステップ３８２において、Ｋ個のキーフレームが削除された後の視覚素シーケンスが出力され、処理を終了する。 If it is determined in step 368 that the value of the variable i is greater than the number K of deleted frames, in step 382, the visual element sequence after the K key frames have been deleted is output, and the process ends.

図１２に、キーフレーム削除部２３６によって行なわれるキーフレームの削除の概念を示す。図１２（Ａ）を参照して、視覚素シーケンス中に、４つのキーフレーム４００、４０２、４０４及び４０６があるものとする。これらの全ての組合せについて、前記した距離Ｄ又はＤ’を算出する。そして、これらの中で前後のキーフレームとの間の頂点の移動速度の合計値として最小値を与えるキーフレームを削除する。図１２（Ａ）で示す例では、キーフレーム４０２がそうしたキーフレームであるものとする。すると、図１２（Ｂ）に示すようにキーフレーム４０２を視覚素シーケンスから削除し、新たに３つの視覚素を含む視覚素シーケンスに対し、前記した処理が行なわれることになる。 FIG. 12 shows the concept of key frame deletion performed by the key frame deletion unit 236. Referring to FIG. 12A, it is assumed that there are four key frames 400, 402, 404, and 406 in the visual element sequence. The distance D or D ′ described above is calculated for all these combinations. Of these, the key frame that gives the minimum value as the total value of the moving speeds of the vertices between the previous and next key frames is deleted. In the example shown in FIG. 12A, it is assumed that the key frame 402 is such a key frame. Then, as shown in FIG. 12B, the key frame 402 is deleted from the visual element sequence, and the above-described processing is performed on the visual element sequence newly including three visual elements.

図５に示す発話パワーによるブレンド率調整部２４４によって行なわれる処理について、図１３を参照して説明する。発話パワーによるブレンド率調整部２４４は、各キーフレームに対応する音素の継続長にわたる発話パワーを、発話記憶部１５２に記憶された発話データ及び視覚素シーケンス記憶部１８０に記憶された視覚素シーケンスに含まれる音素シーケンスの継続長から算出する。ある音素の発話パワーは、各音素の継続長の中央における音声信号の振幅の二乗和により求める。 Processing performed by the blend rate adjustment unit 244 based on speech power shown in FIG. 5 will be described with reference to FIG. The utterance power blend rate adjustment unit 244 converts the utterance power over the phoneme duration corresponding to each key frame into the utterance data stored in the utterance storage unit 152 and the visual element sequence stored in the visual element sequence storage unit 180. Calculated from the continuation length of the included phoneme sequence. The utterance power of a phoneme is determined by the sum of squares of the amplitude of the speech signal at the center of the duration of each phoneme.

例えば、図１３に示すように、実際の音声信号の波形がグラフ４２０で示されるものであり、グラフ４２０により示される音声信号中に、音素／ａ／，／ｉ／、／ｏ／、／ｅ／、及び／ｕ／からなる音素シーケンスがあったものとする。音素／ａ／については、その継続長の先頭から次のキーフレーム／ｉ／に代わるまでの期間にわたる平均の発話パワーを算出する。他の音素／ｉ／、／ｏ／、／ｅ／、及び／ｕ／についても同様であり、それぞれの継続長の先頭から、次のキーフレームに代わるまでの期間にわたる平均の発話パワーを、線分４３０、４３２、４３４、４３６及び４３８により示すようにそれらの継続長の全体にわたり算出する。発話パワーによるブレンド率調整部２４４は、こうして算出された発話パワーの平均値に基づき、各音素に対応する視覚素のブレンド率を調整する。 For example, as shown in FIG. 13, the waveform of the actual audio signal is shown by a graph 420, and the phonemes / a /, / i /, / o /, / e are included in the audio signal shown by the graph 420. Assume that there is a phoneme sequence consisting of / and / u /. For phonemes / a /, the average speech power over the period from the beginning of the duration to the next key frame / i / is calculated. The same applies to the other phonemes / i /, / o /, / e /, and / u /. The average speech power over the period from the beginning of each duration to the next key frame is Calculate over their duration as indicated by minutes 430, 432, 434, 436 and 438. The blend rate adjustment unit 244 based on speech power adjusts the blend rate of visual elements corresponding to each phoneme based on the average value of the speech power thus calculated.

図１４に、発話パワーによるブレンド率調整部２４４が行なう処理をコンピュータプログラムにより実現する際の、プログラムの制御構造をフローチャート形式で示す。 FIG. 14 is a flowchart showing the control structure of the program when the processing performed by the blend rate adjustment unit 244 based on speech power is realized by a computer program.

図１４を参照して、ステップ４５０において、減衰率αを所定の記憶領域から読出す。この減衰率αは、図５に示す減衰率入力部２０６を用いてユーザにより入力され、所定の記憶領域に格納されていたものである。 Referring to FIG. 14, in step 450, attenuation factor α is read from a predetermined storage area. The attenuation rate α is input by the user using the attenuation rate input unit 206 shown in FIG. 5 and stored in a predetermined storage area.

ステップ４５２では、音素シーケンス中の全ての音素について、その継続長にわたる発話パワーの平均を算出する。以下、Ｎ番目のキーフレームの音素の、その継続長全体にわたる発話パワーの平均をＳＰ（Ｎ）と書く。 In step 452, the average speech power over the duration is calculated for all phonemes in the phoneme sequence. Hereinafter, the average of the utterance power of the phoneme of the Nth key frame over the entire duration is written as SP (N).

ステップ４５４では、ステップ４５２で算出された全ての発話パワーの平均値の内で、最大のものＭＡＸ（ＳＰ）と、最小のものＭＩＮ（ＳＰ）とを決定する。 In step 454, the maximum MAX (SP) and the minimum MIN (SP) are determined among the average values of all the utterance powers calculated in step 452.

ステップ４５６では、平均発話パワーの最大値を与えるキーフレームを除く全てのキーフレームについて、次の式にしたがい、ブレンド率を更新する。なお、以下、Ｎ番目のキーフレームのブレンド率をＢＲ（Ｎ）と書く。 In step 456, the blend rate is updated according to the following equation for all key frames except the key frame that gives the maximum value of the average speech power. Hereinafter, the blend ratio of the Nth key frame is written as BR (N).

平均発話パワーの最大値を与えるキーフレームを除く全てのキーフレームに対してこの式によるブレンド率の調整を行なうと、発話パワーによるブレンド率調整部２４４による処理は終了する。なお、減衰率αは、最小値を与えるキーフレームのブレンド率をどの程度減衰させるかを表していることが上の式から分かる。

When the blend rate is adjusted for all key frames except the key frame that gives the maximum value of the average utterance power, the processing by the blend rate adjustment unit 244 based on the utterance power ends. It can be seen from the above formula that the attenuation rate α represents how much the blend rate of the key frame that gives the minimum value is attenuated.

この処理による結果の一例を次のテーブルにより示す。調整前のブレンド率及び平均発話パワーを全てのキーフレームの音素に対して示したのがテーブル２であり、発話パワーによるブレンド率調整部２４４による調整後のブレンド率を示したのがテーブル３である。 An example of the result of this process is shown in the following table. Table 2 shows the blend rate before adjustment and the average utterance power for the phonemes of all key frames, and Table 3 shows the blend rate after adjustment by the blend rate adjustment unit 244 based on utterance power. is there.

ブレンド率に対しこのような調整を行なうことにより、平均発話パワーが最大となるキーフレームのブレンド率は変化しないが、平均の発話パワーが小さくなればなる程、ブレンド率が小さくなる。その結果、話し声が小さい場合には口の動きも小さくなるアニメーションが作成でき、アニメーションの動きがより自然に近くなる。

By making such an adjustment to the blend rate, the blend rate of the key frame that maximizes the average speech power does not change, but the blend rate decreases as the average speech power decreases. As a result, when the speaking voice is low, an animation can be created in which the movement of the mouth becomes small, and the movement of the animation becomes more natural.

図１５に、図５のブレンド率調整部２５０が行なう処理をコンピュータプログラムで実現する際の、プログラムの制御構造をフローチャート形式で示す。 FIG. 15 is a flowchart showing a program control structure when the processing performed by the blend rate adjustment unit 250 of FIG. 5 is realized by a computer program.

図１５を参照して、ステップ４７０において、減衰率βを所定の記憶領域から読出す。減衰率βは、図５に示す減衰率入力部２１０を用いてユーザにより入力され、所定の記憶領域に記憶されていたものである。減衰率βの意味は以下から明らかとなるが、本実施の形態では、キーフレームの間で頂点の動きに基づいてブレンド率を調整しないキーフレーム（以下「不変フレーム」と呼ぶ。）の割合を示す値が用いられる。 Referring to FIG. 15, in step 470, attenuation factor β is read from a predetermined storage area. The attenuation rate β is input by the user using the attenuation rate input unit 210 shown in FIG. 5 and stored in a predetermined storage area. The meaning of the attenuation rate β will be apparent from the following, but in this embodiment, the proportion of key frames (hereinafter referred to as “invariant frames”) in which the blend rate is not adjusted based on the movement of the vertices between key frames. The indicated value is used.

ステップ４７２では、ステップ４７０で読出された減衰率βを、全体のキーフレーム数に乗算することにより、不変フレームの数Ｌを算出する。不変フレームの数Ｌについて、切り上げにより求めるか、四捨五入により求めるか、切り捨てにより求めるかは設計事項である。 In step 472, the number of invariant frames L is calculated by multiplying the total number of key frames by the attenuation rate β read in step 470. It is a design matter whether the number L of invariant frames is obtained by rounding up, rounding off, or rounding down.

ステップ４７４では、クラスタリング後のモデルを使用するか否かを判定する。この判定は、クラスタ処理指定部２０２を用いてユーザにより入力され、所定の記憶領域に格納されていた値を用いて行なわれる。クラスタリング後のモデルを使用する場合はステップ４８０に進み、使用しない場合にはステップ４７６に進む。 In step 474, it is determined whether to use the model after clustering. This determination is performed by using a value input by the user using the cluster process designating unit 202 and stored in a predetermined storage area. If the model after clustering is used, the process proceeds to step 480, and if not used, the process proceeds to step 476.

ステップ４７６では、全てのキーフレームに対し、その前後のキーフレームとの間での、全頂点の平均速度を算出する。この算出方法は図１１のステップ３７２及び３７４で行なうのと同様である。 In step 476, the average speed of all the vertices between all key frames and the preceding and following key frames is calculated. This calculation method is the same as that performed in steps 372 and 374 of FIG.

ステップ４７８では、全キーフレームを、ステップ４７６で算出された平均速度の降順にソートする。 In step 478, all key frames are sorted in descending order of the average speed calculated in step 476.

ステップ４８４では、このようにソートされたキーフレームのデータのうち、下位からＬ個のキーフレームの中の、平均速度の最大値＜ＶＳ＞を決定する。 In step 484, the maximum value <VS> of the average speed in the L key frames from the lower order among the data of the key frames sorted in this way is determined.

ステップ４８６では、ステップ４８４で決定された値＜ＶＳ＞より大きな平均速度を持つキーフレームにおいて、ブレンド率ＢＲ（Ｎ）を以下の式にしたがって調整する。 In step 486, the blend rate BR (N) is adjusted according to the following equation in a key frame having an average speed larger than the value <VS> determined in step 484.

ただしＶＳ（Ｎ）はＮ番目のキーフレームの平均速度である。ステップ４８６の後、処理を終了する。

Where VS (N) is the average speed of the Nth key frame. After step 486, the process ends.

一方、クラスタリング後のモデルを使用する場合、ステップ４８０において、全てのキーフレームに対し、その前後のキーフレームとの間での頂点の平均速度を、各クラスタの代表頂点を用いて算出する。ここでの処理は、図１１のステップ３７６及び３７８で行なったのと同様の考え方により行なう。 On the other hand, when the clustered model is used, in step 480, the average speed of the vertices between all the key frames and the preceding and following key frames is calculated using the representative vertices of each cluster. This processing is performed based on the same concept as that performed in steps 376 and 378 in FIG.

ステップ４８２では、全キーフレームをステップ４８０で算出された平均速度の降順でソートする。以下、ステップ４８４の処理に進む。 In step 482, all key frames are sorted in descending order of the average speed calculated in step 480. Thereafter, the process proceeds to step 484.

ここでの処理は、要するに、各頂点の動く速度が速いキーフレームについては、他のキーフレームの速さを基準として、口の動きが小さくなるようにブレンド率を調整する、というものである。頂点の動きがキーフレーム間であまりに速い場合、キーフレームでの口の形を元のままに維持すると、口の動きが不自然に見える。そこで、そうした場合にはブレンド率を小さく調整することにより、口の動きが小さくなるようにする。 In short, the processing here is to adjust the blend rate so that the movement of the mouth becomes small with respect to the speed of the other key frames for the key frame where the moving speed of each vertex is fast. If the movement of the vertices is too fast between key frames, keeping the mouth shape at the key frames intact, the mouth movements look unnatural. Therefore, in such a case, the movement of the mouth is reduced by adjusting the blend rate to be small.

次の表に、ブレンド率調整部２５０によるブレンド率の調整前後におけるブレンド率の変化の例を示す。テーブル４は平均速度の調整後でキーフレームのソート前、テーブル５はソート後でかつブレンド率の調整前を示す。 The following table shows an example of a change in the blend rate before and after the blend rate adjustment unit 250 adjusts the blend rate. Table 4 shows after adjusting the average speed before sorting key frames, and Table 5 shows after sorting and before adjusting blend ratio.

ここで、減衰率β＝６０％とすると、不変フレーム数Ｌは５×０．６＝３となる。したがって表×における下３行についてはブレンド率の調整は行なわず、上２行のみのブレンド率の調整を行なう。ステップ４８４で決定する平均速度の最大値＜ＶＳ＞は、音素／ａ／の平均速度「１００」となる。

Here, if the attenuation rate β = 60%, the invariant frame number L is 5 × 0.6 = 3. Therefore, the blend rate is not adjusted for the lower three rows in the table x, and the blend rate is adjusted only for the upper two rows. The maximum value <VS> of the average speed determined in step 484 is the average speed “100” of phonemes / a /.

＜ＶＳ＞＝１００を用いてステップ４８６の処理を行なうと、上位の二つの音素／ｉ／及び／ｏ／のブレンド率がそれぞれ以下のように訂正される。すなわち、音素／ｉ／についてはＢＲ（Ｎ）＝９０×１００／２００＝４５となり、音素／ｏ／についてはＢＲ（Ｎ）＝６０×１００／１５０＝４０となる。その結果、ブレンド率調整部２５０によるブレンド率調整後の各キーフレームのブレンド率は以下のようになる。 When the processing of step 486 is performed using <VS> = 100, the blend ratios of the upper two phonemes / i / and / o / are corrected as follows. That is, BR (N) = 90 × 100/200 = 45 for phonemes / i / and BR (N) = 60 × 100/150 = 40 for phonemes / o /. As a result, the blend rate of each key frame after the blend rate adjustment by the blend rate adjustment unit 250 is as follows.

すなわち、不変フレームの中の最大の平均速度より大きな平均速度を持つキーフレームのブレンド率が当初より小さな値に調整される。しかも、そのキーフレームの平均速度が大きいほど、ブレンド率は小さくなるため、キーフレームの頂点の移動速度が速いほど、そのキーフレームにおける口の位置の変化が小さくなり、一連のアニメーションはより滑らかで自然なものとなる。

That is, the blend rate of key frames having an average speed larger than the maximum average speed in the invariant frame is adjusted to a value smaller than the initial value. Moreover, since the blend rate decreases as the average speed of the key frame increases, the change in the mouth position at the key frame decreases as the moving speed of the key frame vertex increases, and the series of animations becomes smoother. It will be natural.

＜動作＞
以上構成を説明したリップシンクアニメーション作成装置２００は以下のように動作する。図５を参照して、最初に発話記憶部１５２に、所定の発話者の発話を記録した発話データが準備され、その発話の書き起こしデータであるトランスクリプションがトランスクリプション記憶部１５４に準備される。また、前述した６つの視覚素に対応した６つの顔画像のキャラクタモデルがワイアフレーム画像としてキャラクタモデル記憶部１５６に準備される。 <Operation>
The lip sync animation creating apparatus 200 described above operates as follows. Referring to FIG. 5, firstly, utterance data in which an utterance of a predetermined speaker is recorded is prepared in utterance storage unit 152, and transcription that is transcription data of the utterance is prepared in transcription storage unit 154. Is done. In addition, character models of six face images corresponding to the six visual elements described above are prepared in the character model storage unit 156 as wire frame images.

顔画像のアニメーション２６０の作成のためには、種々の準備作業が必要である。以下それらの準備作業を順番に述べる。 In order to create the animation 260 of the face image, various preparation operations are necessary. These preparatory tasks are described in turn below.

−視覚素シーケンスの作成−
まず、視覚素シーケンス作成部２３０が音響モデル記憶部１７０に記憶された音響モデル、及び音素−視覚素マッピングテーブル記憶部１７６に記憶された音素−視覚素マッピングテーブル記憶部１７６を用い、以下のようにして視覚素シーケンスを作成し視覚素シーケンス記憶部１８０に記憶させる。 -Creation of visual elementary sequences-
First, the visual element sequence creation unit 230 uses the acoustic model stored in the acoustic model storage unit 170 and the phoneme-visual element mapping table storage unit 176 stored in the phoneme-visual element mapping table storage unit 176 as follows. Thus, a visual element sequence is created and stored in the visual element sequence storage unit 180.

図６を参照して、視覚素シーケンス作成部２３０の音素セグメンテーション部１７２が、発話記憶部１５２中の発話データを読み、トランスクリプション記憶部１５４と音響モデル記憶部１７０とを用いて発話データに対する音素セグメンテーションを行なう。この処理の結果、音素セグメンテーション部１７２からは音素シーケンスが、各音素の継続長を表すデータとともに出力される。この継続長付き音素シーケンスは音素シーケンス記憶部１７４に記憶される。 Referring to FIG. 6, phoneme segmentation unit 172 of visual element sequence creation unit 230 reads the utterance data in utterance storage unit 152 and uses the transcription storage unit 154 and acoustic model storage unit 170 to process the utterance data. Perform phoneme segmentation. As a result of this processing, the phoneme segmentation unit 172 outputs a phoneme sequence together with data representing the duration of each phoneme. This phoneme sequence with continuous length is stored in the phoneme sequence storage unit 174.

音素−視覚素変換処理部１７８が、音素シーケンス記憶部１７４から音素シーケンスを読出し、音素−視覚素マッピングテーブル記憶部１７６に記憶された音素−視覚素マッピングテーブルを用いて、音素シーケンス中の音素を対応する視覚素に置き換えることにより、継続長付き視覚素シーケンスを生成する。ただしここでは、置換前の音素も各視覚素に付してあるものとする。この継続長付き視覚素シーケンスは視覚素シーケンス記憶部１８０に記憶される。 The phoneme-visual element conversion processing unit 178 reads the phoneme sequence from the phoneme sequence storage unit 174, and uses the phoneme-visual element mapping table stored in the phoneme-visual element mapping table storage unit 176 to convert the phoneme in the phoneme sequence. A visual element sequence with a duration is generated by replacing the corresponding visual element. However, here, the phonemes before replacement are also attached to each visual element. This visual element sequence with a continuous length is stored in the visual element sequence storage unit 180.

−顔画像の頂点のクラスタリング−
クラスタリング処理部２３２は、キャラクタモデル記憶部１５６に格納された６つの顔画像に対し、二つの顔画像の全ての組合せに対し、以下の処理を実行する。 -Face image vertex clustering-
The clustering processing unit 232 executes the following processing for all combinations of the two face images with respect to the six face images stored in the character model storage unit 156.

まず、一方の顔画像から他方の顔画像に変化する際の頂点の動きベクトルを算出する。この動きベクトルの集合に対し、前述したとおりのＶＱクラスタリングを行なうことにより、一方の顔画像を所定個数のクラスタに分類する。逆方向の動きについては、動きベクトルの向きが逆になるだけであるから、クラスタリングは正逆で同じになる。 First, the motion vector of the vertex when changing from one face image to the other face image is calculated. One face image is classified into a predetermined number of clusters by performing VQ clustering on the set of motion vectors as described above. As for the movement in the reverse direction, since the direction of the motion vector is only reversed, the clustering is the same in the reverse direction.

このようにしてクラスタリングを行なった結果、二つの顔画像の全ての組合せに対し、クラスタリング後の顔モデルと、各クラスタの代表頂点とが算出される。この顔モデルが、各クラスタの代表頂点とともにクラスタ化顔モデル記憶部２３４に記憶される。 As a result of clustering as described above, the face model after clustering and the representative vertex of each cluster are calculated for all combinations of two face images. This face model is stored in the clustered face model storage unit 234 together with the representative vertex of each cluster.

−発話パワーの算出−
発話パワー算出部２３８は、視覚素シーケンス記憶部１８０に記憶された各視覚素に付された音素の情報に基づき、発話記憶部１５２中の各音素の平均発話パワーを算出し、発話パワーとして発話パワー記憶部２４０に記憶させる。 -Calculation of speech power-
The utterance power calculation unit 238 calculates the average utterance power of each phoneme in the utterance storage unit 152 based on the information on the phonemes attached to each visual element stored in the visual element sequence storage unit 180, and the utterance as the utterance power. The data is stored in the power storage unit 240.

−アニメーションの作成−
アニメーションの作成においては、様々な選択肢がある。第１の選択肢は、キーフレームの削除率γである。キーフレームの削除は常に行なわれるので、この指定は必須である。ただし、指定がない場合には所定のデフォルトの値を使用するようにしてもよい。第２の選択肢は、キーフレーム削除部２３６での処理及びブレンド率調整部２５０での処理において、クラスタリングの結果を使用するか否かの指定である。第３の選択肢は、発話パワーによるブレンド率調整部２４４の処理を行なうか否かである。さらに、発話パワーによるブレンド率調整部２４４の処理を実行する場合には減衰率αを指定する必要がある。第４の選択肢は、ブレンド率調整部２５０の処理を行なうか否かである。ブレンド率調整部２５０の処理を行なう場合にはさらに、減衰率βを指定する必要がある。 -Creation of animation-
There are various options for creating animations. The first option is a key frame deletion rate γ. This is mandatory because keyframes are always deleted. However, if not specified, a predetermined default value may be used. The second option is to specify whether or not to use the result of clustering in the processing in the key frame deletion unit 236 and the processing in the blend rate adjustment unit 250. The third option is whether or not to perform the processing of the blend rate adjustment unit 244 based on speech power. Furthermore, when executing the processing of the blend rate adjustment unit 244 based on speech power, it is necessary to specify the attenuation rate α. The fourth option is whether or not to perform the process of the blend rate adjustment unit 250. When the processing of the blend rate adjusting unit 250 is performed, it is further necessary to specify the attenuation rate β.

発話パワーによるブレンド率調整部２４４による処理を行なうことが指定された場合には、選択部２４２及び２４６は、キーフレーム削除部２３６の出力を発話パワーによるブレンド率調整部２４４に与え、さらに発話パワーによるブレンド率調整部２４４の出力を選択部２４８に与えるように、接続を切替える。それ以外の場合には、選択部２４２及び２４６は、キーフレーム削除部２３６の出力を直接に選択部２４８に与えるように接続を切替える。 When it is designated that the processing by the blend rate adjustment unit 244 based on speech power is to be performed, the selection units 242 and 246 give the output of the key frame deletion unit 236 to the blend rate adjustment unit 244 based on speech power. The connection is switched so that the output of the blend ratio adjustment unit 244 is supplied to the selection unit 248. In other cases, the selection units 242 and 246 switch the connection so that the output of the key frame deletion unit 236 is directly given to the selection unit 248.

一方、ブレンド率調整部２５０による処理を行なうことが指定された場合には、選択部２４８及び２５２は、選択部２４６の出力をブレンド率調整部２５０に与え、ブレンド率調整部２５０の出力を視覚素シーケンス記憶部２５４に与えるように接続を切替える。それ以外の場合には、選択部２４８及び２５２は、選択部２４６の出力を直接に視覚素シーケンス記憶部２５４に与えるように接続を切替える。 On the other hand, when it is designated that the processing by the blend rate adjustment unit 250 is to be performed, the selection units 248 and 252 give the output of the selection unit 246 to the blend rate adjustment unit 250, and visually see the output of the blend rate adjustment unit 250. The connection is switched so as to be given to the elementary sequence storage unit 254. In other cases, the selection units 248 and 252 switch the connection so that the output of the selection unit 246 is directly supplied to the visual element sequence storage unit 254.

以下、一般性を失わずに、発話パワーによるブレンド率調整部２４４による処理及びブレンド率調整部２５０による処理がともに選択されることを前提とし、クラスタリング後のモデルを使用しない場合と使用する場合とについて、それぞれキーフレーム削除部２３６、発話パワーによるブレンド率調整部２４４、及びブレンド率調整部２５０の動作を説明する。 Hereinafter, on the assumption that the processing by the blend rate adjustment unit 244 and the processing by the blend rate adjustment unit 250 by speech power are both selected without losing generality, the case where the model after clustering is not used and the case where the model is used are used. The operations of the key frame deleting unit 236, the speech rate blend rate adjusting unit 244, and the blend rate adjusting unit 250 will be described.

（１）クラスタリング後のモデルを使用しない場合
−キーフレーム削除部２３６の動作−
キーフレーム削除部２３６は、削除率入力部２０１により入力された削除率γを読出し（図１１、ステップ３６０）、視覚素シーケンス記憶部１８０に記憶された視覚素シーケンス中の視覚素の数に削除率γを乗ずることにより、削除フレーム数Ｋを算出する（ステップ３６２）。 (1) When a model after clustering is not used -Operation of the key frame deletion unit 236-
The key frame deletion unit 236 reads the deletion rate γ input by the deletion rate input unit 201 (FIG. 11, step 360), and deletes it to the number of visual elements in the visual element sequence stored in the visual element sequence storage unit 180. The number K of deleted frames is calculated by multiplying by the rate γ (step 362).

キーフレーム削除部２３６はさらに、ステップ３６８で削除フレーム数Ｋだけのキーフレームを削除したか否かを判定する。通常は最初の判定では削除フレーム数Ｋだけのキーフレームの削除は行なわれていない。したがってステップ３７０に進む。ステップ３７０では、クラスタリング後のモデルを使用することが指定されていないので、ステップ３７２に進む。 Further, the key frame deletion unit 236 determines whether or not the key frames corresponding to the number K of deleted frames have been deleted in step 368. Normally, in the first determination, key frames corresponding to the number K of deleted frames are not deleted. Accordingly, the process proceeds to step 370. In step 370, since it is not specified that the model after clustering is used, the process proceeds to step 372.

ステップ３７２では、視覚素シーケンス内の隣り合う全てのキーフレーム間で、全ての頂点を用いてキーフレーム間の距離Ｄを算出し、ステップ３７４でこの距離に基づいて各点の移動速度の合計が最も早いキーフレームを削除ターゲットに定める。そしてステップ３８０でこのキーフレームを削除する。この後ステップ３６６に戻る。 In step 372, the distance D between the key frames is calculated using all the vertices between all the adjacent key frames in the visual element sequence. In step 374, the total moving speed of each point is calculated based on this distance. Set the earliest keyframe as the deletion target. In step 380, the key frame is deleted. Thereafter, the process returns to step 366.

以後、削除したキーフレームの数が削除フレーム数Ｋより大きくなると処理を終了する。 Thereafter, when the number of deleted key frames is larger than the number K of deleted frames, the process is terminated.

キーフレーム削除部２３６によりこのようにしてＫ個のキーフレームが削除された視覚素シーケンスは選択部２４２を介して発話パワーによるブレンド率調整部２４４に与えられる。 The visual element sequence from which K key frames have been deleted in this way by the key frame deletion unit 236 is provided to the blend rate adjustment unit 244 based on speech power via the selection unit 242.

−発話パワーによるブレンド率調整部２４４の動作−
発話パワーによるブレンド率調整部２４４は、最初に減衰率αを読出す（図１４のステップ４５０）。ステップ４５２で、キーフレーム削除部２３６の出力する視覚素シーケンス中の音素に関する情報に基づいて、発話記憶部１５２に記憶された発話データから、各音素の継続期間にわたる平均発話パワーを算出する。 -Operation of blend rate adjustment unit 244 by speech power-
First, the blend rate adjustment unit 244 based on speech power reads the attenuation rate α (step 450 in FIG. 14). In step 452, based on the information about the phonemes in the visual element sequence output from the key frame deletion unit 236, the average utterance power over the duration of each phoneme is calculated from the utterance data stored in the utterance storage unit 152.

ステップ４５４では、こうして算出された平均発話パワーのうち、最大パワーＭＡＸ（ＳＰ）と最小パワーＭＩＮ（ＳＰ）とを算出し、ステップ４５６において、減衰率αを用いた式により、各キーフレームについてブレンド率ＢＲ（Ｎ）を調整する。全てのキーフレームについてブレンド率を調整された視覚素シーケンスは、選択部２４６及び選択部２４８を介してブレンド率調整部２５０に与えられる。 In step 454, the maximum power MAX (SP) and the minimum power MIN (SP) are calculated from the average utterance power thus calculated, and in step 456, blending is performed for each key frame by an expression using the attenuation factor α. The rate BR (N) is adjusted. The visual element sequence in which the blend rate is adjusted for all the key frames is provided to the blend rate adjusting unit 250 via the selection unit 246 and the selection unit 248.

−頂点速度によるブレンド率調整部２５０の動作−
頂点速度によるブレンド率調整部２５０は、最初に減衰率βを読出し（図１５、ステップ４７０）、選択部２４８から与えられた視覚素シーケンス中に含まれるキーフレームにこの減衰率βを乗算して不変フレーム数Ｌを算出する（ステップ４７２）。続くステップ４７４では、ステップ４７６が選択される。 -Operation of blend rate adjustment unit 250 by vertex speed-
The blend rate adjustment unit 250 based on the vertex velocity first reads the attenuation rate β (FIG. 15, step 470), and multiplies the key frame included in the visual element sequence given from the selection unit 248 by the attenuation rate β. The invariant frame number L is calculated (step 472). In the following step 474, step 476 is selected.

ステップ４７６では、選択部２４８から与えられた視覚素シーケンス中の全てのキーフレームに対し、その前後のキーフレームとの間での、全頂点の平均速度を算出する。ステップ４７８では、このようにして算出された平均速度をソートキーに、平均速度の降順にキーフレームをソートする。 In step 476, for all key frames in the visual element sequence given from the selection unit 248, the average speed of all vertices between the key frames before and after the key frame is calculated. In step 478, the key frames are sorted in descending order of the average speed using the average speed calculated in this way as a sort key.

ステップ４８４では、ステップ４７８でソートされたキーフレームの下位からＬ個のキーフレームのうちの平均速度の最大値を＜ＶＳ＞の値に設定する。ステップ４８６で、ステップ４８４において設定された速度＜ＶＳ＞の値を用い、前述した式によって、不変フレーム以外のキーフレームの各々について、そのブレンド率を調整する。不変フレーム以外の全てのキーフレームについてブレンド率の調整が終了すると、ブレンド率の調整が完了した視覚素シーケンスを図５に示す視覚素シーケンス記憶部２５４に出力する。 In step 484, the maximum value of the average speed among the L key frames from the lower order of the key frames sorted in step 478 is set to the value of <VS>. In step 486, the blend rate is adjusted for each of the key frames other than the invariant frame by using the value of the speed <VS> set in step 484 and the above-described equation. When the adjustment of the blend rate is completed for all the key frames other than the invariant frames, the visual element sequence for which the adjustment of the blend ratio has been completed is output to the visual element sequence storage unit 254 shown in FIG.

ブレンド処理部２５６は、視覚素シーケンス記憶部２５４に記憶された視覚素シーケンスを読出し、各キーフレームに対応する時刻にはそのキーフレームで指定された視覚素を用い、キーフレーム間のフレームの時刻では、そのフレームの両隣のキーフレームの間で、キーフレームに付されたブレンド率を用いた内挿によって中間の画像を作成する。このようにして、一定時間間隔のフレームの各々で、キーフレームの画像とそのブレンド率とを用いた内挿によって画像を作成することにより、アニメーションが作成される。 The blend processing unit 256 reads the visual element sequence stored in the visual element sequence storage unit 254, uses the visual element specified by the key frame as the time corresponding to each key frame, and the frame time between the key frames. Then, an intermediate image is created between the key frames on both sides of the frame by interpolation using the blend rate assigned to the key frame. In this way, an animation is created by creating an image by interpolation using a key frame image and its blend ratio in each frame at a fixed time interval.

（２）クラスタリング後のモデルを使用する場合
クラスタリング後のモデルを使用する場合には、リップシンクアニメーション作成装置２００の各部は以下のように動作する。 (2) When using a model after clustering When using a model after clustering, each unit of the lip sync animation creating apparatus 200 operates as follows.

−キーフレーム削除部２３６の動作−
図１１を参照して、キーフレーム削除部２３６は、ステップ３６０〜３６８までの処理についてはクラスタリング後のモデルを使用しない場合と同様に動作する。しかし、ステップ３７０の判定ではステップ３７６を選択する。ステップ３７６では、隣り合う全てのキーフレームの間で、代表頂点を用いて距離Ｄ’を算出する。代表頂点を用いた距離Ｄ’の算出については前述したとおりであるが、代表頂点の移動距離に、その代表頂点により代表されるクラスタ内の頂点の数を乗算し、その値を全てのクラスタにわたり合計することにより距離Ｄ’が得られる。 -Operation of the key frame deletion unit 236-
Referring to FIG. 11, key frame deletion section 236 operates in the same manner as in the case where the model after clustering is not used for the processing from steps 360 to 368. However, step 376 is selected in the determination of step 370. In step 376, the distance D ′ is calculated using the representative vertices between all adjacent key frames. The calculation of the distance D ′ using the representative vertices is as described above, but the movement distance of the representative vertices is multiplied by the number of vertices in the cluster represented by the representative vertices, and the value is applied to all clusters. The distance D ′ is obtained by summing up.

ステップ３７８では、こうして算出された距離Ｄ’を用い、頂点の動きが最も早いキーフレームを削除対象のキーフレームに決定する。ステップ３８０以下の処理は、クラスタリング後のモデルを使用しない場合と同様である。 In step 378, using the distance D 'thus calculated, the key frame with the fastest vertex movement is determined as the key frame to be deleted. The processing after step 380 is the same as when the model after clustering is not used.

−発話パワーによるブレンド率調整部２４４の動作−
発話パワーによるブレンド率調整部２４４は、クラスタリング後のモデルを使用しない場合と全く同様である。したがってここではその詳細は繰返さない。 -Operation of blend rate adjustment unit 244 by speech power-
The blend rate adjustment unit 244 based on speech power is exactly the same as when the model after clustering is not used. Therefore, details thereof will not be repeated here.

−ブレンド率調整部２５０の動作−
この場合、ブレンド率調整部２５０は以下のように動作する。図１５を参照して、ステップ４７０及び４７２の処理はクラスタリング後のモデルを使用しない場合と同様である。ただし、ステップ４７４の判定ではステップ４８０が選択される。 -Operation of Blend Ratio Adjustment Unit 250-
In this case, the blend rate adjustment unit 250 operates as follows. Referring to FIG. 15, the processing in steps 470 and 472 is the same as that in the case where the model after clustering is not used. However, step 480 is selected in the determination of step 474.

ステップ４８０では、全キーフレームに対し、その前後のキーフレームとの間の頂点の平均速度を、各頂点が属するクラスタの代表頂点の動きベクトルを用いて算出する。ここでの算出方法はキーフレーム削除部２３６での算出方法と同様である。そしてステップ４８２において、このようにして算出された平均速度をソートキーに、全てのキーフレームを降順にソートする。この後は、ステップ４８４及び４８６をクラスタリング後のモデルを使用しない場合と同様に実行する。 In step 480, for all key frames, the average speed of vertices between the previous and next key frames is calculated using the motion vector of the representative vertex of the cluster to which each vertex belongs. The calculation method here is the same as the calculation method in the key frame deletion unit 236. In step 482, all key frames are sorted in descending order using the average speed thus calculated as a sort key. After this, steps 484 and 486 are executed in the same manner as when the model after clustering is not used.

図１６に、キーフレーム削除部２３６によるキーフレーム削除の結果の一例を示す。図１６（Ａ）はキーフレーム削除部２３６によるキーフレームの削除なし（視覚素シーケンス作成部２３０による出力のまま。ただしブレンド率については発話パワーによって初期値を付与してある。）を示し、図１６（Ｂ）及び図１６（Ｃ）はそれぞれ削除率γ＝２０％及び３０％に設定したときの結果を示す。図１６（Ｄ）は従来の方法にしたがい、人間のアニメータが音声を聞きながら手作業によってキーフレームを設定した結果を示す。自動的な処理で図１６（Ｄ）に近い結果が得られると好ましい。 FIG. 16 shows an example of the result of key frame deletion by the key frame deletion unit 236. FIG. 16A shows that the key frame deletion unit 236 does not delete the key frame (the output from the visual element sequence generation unit 230 remains as it is, but the blend rate is given an initial value depending on the speech power). 16 (B) and FIG. 16 (C) show the results when the deletion rate γ is set to 20% and 30%, respectively. FIG. 16D shows the result of manually setting a key frame by a human animator while listening to the sound according to a conventional method. It is preferable that results close to FIG. 16D are obtained by automatic processing.

図１６（Ａ）と図１６（Ｂ）とを比較すると、キーフレーム５００及び５０２が削除されていることが分かる。この結果、図１６（Ｂ）と図１６（Ｄ）とはかなり近い結果となっている。さらに図１６（Ｂ）と図１６（Ｃ）とを比較すると、キーフレーム５１０が削除されている。この結果を図１６（Ｄ）と比較すると、両者が非常に類似していることが分かる。特に図１６（Ｃ）の結果から合成したアニメーションと、図１６（Ｄ）の手作業による結果から合成したアニメーションとは、前半部分において非常によく一致しており、主観的な評価ではほとんど差がなかった。 Comparing FIG. 16A and FIG. 16B, it can be seen that the key frames 500 and 502 are deleted. As a result, FIG. 16B and FIG. 16D are quite close to each other. Further, comparing FIG. 16B and FIG. 16C, the key frame 510 is deleted. When this result is compared with FIG. 16D, it can be seen that both are very similar. In particular, the animation synthesized from the result of FIG. 16C and the animation synthesized from the result of manual work of FIG. 16D are very well matched in the first half, and there is almost no difference in subjective evaluation. There wasn't.

図１７の上段（Ａ）（Ｂ）は、従来の方法によって得られた顔画像の口付近のアニメーション結果（Ａ）と、上記実施の形態によって得られたアニメーション結果（Ｂ）とを対比して示す。図１７の下段（Ｃ）（Ｄ）は、対応する各キーフレームのブレンド率を示す。従来の方法によるブレンド率を図１７（Ｄ）に、本発明の実施の形態によるブレンド率を図１７（Ｃ）に、それぞれ示す。図１７（Ｃ）における枠５３０、図１７（Ｄ）における枠５３２に相当する部分の顔アニメーションが図１７（Ｂ）及び（Ａ）に該当する。 The upper (A) and (B) of FIG. 17 compare the animation result (A) near the mouth of the face image obtained by the conventional method and the animation result (B) obtained by the above embodiment. Show. The lower sections (C) and (D) of FIG. 17 show the blend ratios of the corresponding key frames. FIG. 17D shows the blend ratio according to the conventional method, and FIG. 17C shows the blend ratio according to the embodiment of the present invention. The face animation of the part corresponding to the frame 530 in FIG. 17C and the frame 532 in FIG. 17D corresponds to FIGS. 17B and 17A.

図１７（Ｃ）及び（Ｄ）を参照して、従来の方法によるブレンド率のグラフ５２２と、本実施の形態によるブレンド率のグラフ５２０とを比較すると、本実施の形態では全体にブレンド率が低くなり、その結果口画像の動きが滑らかになっていることが分かる。 Referring to FIGS. 17C and 17D, a blend rate graph 522 according to the conventional method is compared with a blend rate graph 520 according to the present embodiment. As a result, the mouth image moves smoothly.

以上のように本実施の形態に係る視覚素シーケンス作成部２３０によれば、発話音声及びそのトランスクリプションと、視覚素に相当する基本的な顔画像のモデルとから、自動的に音声に対応して滑らかに変化する顔画像を作成することができる。発話パワーが小さい部分、又は隣接するキーフレームとの間のモデルの各頂点の動きが速すぎるキーフレームなどにおいては、ブレンド率は低くなるように調整される。その結果、得られる顔画像のアニメーションはいわゆる「うるさい」アニメーションではなく、滑らかで、手作業によってキーフレーム及びそのブレンド率を調整した場合に近いアニメーションを作成することができる。 As described above, according to the visual element sequence creation unit 230 according to the present embodiment, the voice is automatically supported from the speech voice and its transcription, and the basic face image model corresponding to the visual element. Thus, a face image that changes smoothly can be created. In a portion where the speech power is low or a key frame in which each vertex of the model between the adjacent key frames moves too fast, the blend rate is adjusted to be low. As a result, the animation of the obtained face image is not a so-called “noisy” animation, but an animation that is smooth and close to the case where the key frame and its blend ratio are adjusted manually can be created.

［コンピュータによる実現］
上述の実施の形態は、コンピュータシステム及びコンピュータシステム上で実行されるプログラムによって実現され得る。図１８はこの実施の形態で用いられるコンピュータシステム５５０の外観を示し、図１９はコンピュータシステム５５０のブロック図である。ここで示すコンピュータシステム５５０は単なる例であって、他の構成も利用可能である。 [Realization by computer]
The above-described embodiment can be realized by a computer system and a program executed on the computer system. FIG. 18 shows the appearance of a computer system 550 used in this embodiment, and FIG. 19 is a block diagram of the computer system 550. The computer system 550 shown here is merely an example, and other configurations can be used.

図１８を参照して、コンピュータシステム５５０はコンピュータ５６０と、全てコンピュータ５６０に接続された、モニタ５６２と、キーボード５６６と、マウス５６８と、スピーカ５５８と、マイクロフォン５９０と、を含む。さらに、コンピュータ５６０はＤＶＤ−ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄ−Ｏｎｌｙ−Ｍｅｍｏｒｙ：ディジタル多用途ディスク読出専用メモリ）ドライブ５７０と、半導体メモリドライブ５７２とを含む。 Referring to FIG. 18, computer system 550 includes a computer 560, a monitor 562, a keyboard 566, a mouse 568, a speaker 558, and a microphone 590, all connected to computer 560. Further, the computer 560 includes a DVD-ROM (Digital Versatile Disk Read-Only-Memory) drive 570 and a semiconductor memory drive 572.

図１９を参照して、コンピュータ５６０はさらに、ＤＶＤ−ＲＯＭドライブ５７０と半導体メモリドライブ５７２とに接続されたバス５８６と、全てバス５８６に接続された、ＣＰＵ５７６と、コンピュータ５６０のブートアッププログラムを記憶するＲＯＭ５７８と、ＣＰＵ５７６によって使用される作業領域を提供するとともにＣＰＵ５７６によって実行されるプログラムのための記憶領域となるＲＡＭ５８０と、音声データ、音響モデル、言語モデル、レキシコン、及びマッピングテーブルを記憶するためのハードディスクドライブ５７４と、ネットワーク５５２への接続を提供するネットワークインターフェイス５９６とを含む。 Referring to FIG. 19, computer 560 further stores a bus 586 connected to DVD-ROM drive 570 and semiconductor memory drive 572, a CPU 576 connected to bus 586, and a boot-up program for computer 560. ROM 578, RAM 580 that provides a work area used by CPU 576 and serves as a storage area for programs executed by CPU 576, and stores voice data, acoustic models, language models, lexicons, and mapping tables It includes a hard disk drive 574 and a network interface 596 that provides a connection to the network 552.

図５に示す発話記憶部１５２、トランスクリプション記憶部１５４、キャラクタモデル記憶部１５６、音響モデル記憶部１７０、音素−視覚素マッピングテーブル記憶部１７６、視覚素シーケンス記憶部１８０、クラスタ化顔モデル記憶部２３４、発話パワー記憶部２４０、視覚素シーケンス記憶部２５４などは、いずれも図１９に示すハードディスクドライブ５７４又はＲＡＭ５８０により実現される。また、削除率入力部２０１、クラスタ処理指定部２０２、発話パワー使用指示入力部２０４、減衰率入力部２０６、使用指示入力部２０８及び減衰率入力部２１０等は、いずれも図１８及び図１９に示すモニタ５６２並びにキーボード５６６及びマウス５６８を用いるグラフィカルユーザインタフェースを実現するプログラムによって実現される。そのような入力のプログラムの構成は周知であるので、ここではその詳細については説明しない。 Speech storage unit 152, transcription storage unit 154, character model storage unit 156, acoustic model storage unit 170, phoneme-visual element mapping table storage unit 176, visual element sequence storage unit 180, clustered face model storage shown in FIG. The unit 234, the speech power storage unit 240, the visual element sequence storage unit 254, and the like are all realized by the hard disk drive 574 or the RAM 580 shown in FIG. Further, the deletion rate input unit 201, the cluster processing specification unit 202, the speech power use instruction input unit 204, the attenuation rate input unit 206, the use instruction input unit 208, the attenuation rate input unit 210, and the like are all shown in FIGS. The monitor 562 shown, and a program that implements a graphical user interface using a keyboard 566 and a mouse 568 are implemented. Since the configuration of such an input program is well known, details thereof will not be described here.

顔画像のアニメーション２６０の再生は、図示しないアニメーション再生プログラムによって実現される。アニメーション再生プログラム自体は、所定のタイムテーブルにしたがい、一定のフレーム間隔でフレームシーケンスを順次表示する、という機能を提供するものであればよい。 The reproduction of the facial image animation 260 is realized by an animation reproduction program (not shown). The animation reproduction program itself may be any program that provides a function of sequentially displaying a frame sequence at a fixed frame interval according to a predetermined time table.

上述の実施の形態のシステムを実現するソフトウェアは、ＤＶＤ−ＲＯＭ５８２又は半導体メモリ５８４等の媒体に記録されたオブジェクトコードの形で流通し、ＤＶＤ−ＲＯＭドライブ５７０又は半導体メモリドライブ５７２等の読出装置を介してコンピュータ５６０に提供され、ハードディスクドライブ５７４に記憶される。ＣＰＵ５７６がプログラムを実行する際には、プログラムはハードディスクドライブ５７４から読出されてＲＡＭ５８０に記憶される。図示しないプログラムカウンタによって指定されたアドレスから命令がフェッチされ、その命令が実行される。ＣＰＵ５７６はハードディスクドライブ５７４から処理すべきデータを読出し、処理の結果をこれもまたハードディスクドライブ５７４に記憶する。スピーカ５５８とマイクロフォン５９０とは、直接に本発明とは関係ないが、スピーカ５５８は、作成されたアニメーションの再生時の音声の発生に必要である。発話データの収録にコンピュータシステム５５０を使用するときには、マイクロフォン５９０が必要となる。 The software for realizing the system of the above-described embodiment is distributed in the form of an object code recorded on a medium such as a DVD-ROM 582 or a semiconductor memory 584, and a reading device such as a DVD-ROM drive 570 or a semiconductor memory drive 572 is installed. To the computer 560 and stored in the hard disk drive 574. When CPU 576 executes a program, the program is read from hard disk drive 574 and stored in RAM 580. An instruction is fetched from an address designated by a program counter (not shown), and the instruction is executed. The CPU 576 reads data to be processed from the hard disk drive 574 and stores the processing result in the hard disk drive 574 as well. The speaker 558 and the microphone 590 are not directly related to the present invention, but the speaker 558 is necessary for the generation of sound during reproduction of the created animation. When the computer system 550 is used for recording speech data, a microphone 590 is required.

コンピュータシステム５５０の一般的動作は周知であるので、詳細な説明は省略する。 Since the general operation of the computer system 550 is well known, detailed description thereof is omitted.

ソフトウェアの流通の方法に関して、ソフトウェアは必ずしも記憶媒体上に固定されたものでなくてもよい。例えば、ソフトウェアはネットワークに接続された別のコンピュータから分配されてもよい。ソフトウェアの一部がハードディスクドライブ５７４に記憶され、ソフトウェアの残りの部分をネットワークを介してハードディスクドライブ５７４に取込み、実行の際に統合する様にしてもよい。 Regarding the software distribution method, the software does not necessarily have to be fixed on a storage medium. For example, the software may be distributed from another computer connected to the network. A part of the software may be stored in the hard disk drive 574, and the remaining part of the software may be taken into the hard disk drive 574 via the network and integrated at the time of execution.

典型的には、現代のコンピュータはコンピュータのオペレーティングシステム（ＯＳ）によって提供される一般的な機能を利用し、所望の目的に従って制御された態様で機能を達成する。従って、ＯＳ又はサードパーティから提供されうる一般的な機能を含まず、一般的な機能の実行順序の組合せのみを指定したプログラムであっても、そのプログラムが全体として所望の目的を達成する制御構造を有する限り、そのプログラムがこの発明の範囲に包含されることは明らかである。 Typically, modern computers utilize the general functions provided by a computer operating system (OS) to achieve functions in a controlled manner according to the desired purpose. Therefore, a control structure that does not include a general function that can be provided from the OS or a third party, and that achieves a desired purpose as a whole even if the program specifies only a combination of execution orders of the general functions. It is obvious that the program is included in the scope of the present invention.

［第２の実施の形態］
＜概略＞
上記した第１の実施の形態により、音声を基にして滑らかな顔画像のアニメーションを作成することができる。しかし、商品としてのアニメーションでは、単に画像が滑らかであることに留まらず、様々な制約が与えられることがある。例えば、通常のアニメーションは、テレビ（３０ｆｐｓ（ｆｒａｍｅｐｅｒｓｅｃｏｎｄ））又は、映画（２４ｆｐｓ）と同様のフレームレートで作成される。しかし、商業的なアニメーションでは、これよりも小さな（遅い）フレームレートでアニメーションを作成することが要請される場合がある。例えば、１２ｆｐｓ、８ｆｐｓなどでアニメーションを作成することが要請される場合があり得る。こうした場合には、次のような問題が生じる。 [Second Embodiment]
<Outline>
According to the first embodiment described above, it is possible to create a smooth facial image animation based on voice. However, in an animation as a product, the image is not simply smooth, and various restrictions may be given. For example, a normal animation is created at a frame rate similar to that of a television (30 fps (frame per second)) or a movie (24 fps). However, commercial animation may require that the animation be created at a smaller (slower) frame rate. For example, it may be requested to create an animation at 12 fps, 8 fps, or the like. In such a case, the following problems arise.

第１の実施の形態に係る装置では、アニメーション作成時のフレームレートは高く設定されており、従って滑らかな映像を得ることができる。しかし、敢えて低いフレームレートでアニメーションを作成する場合には、一つのキーフレームの継続長内に複数の音素が含まれる場合が多くなる。すると、本来は複数の視覚素を含む期間内に、口の画像が１種類しか含まれないこととなる。そのため、口画像にどの視覚素を割当てればよいかが問題となる。この場合、一つのキーフレームの継続長に含まれる複数の視覚素のうちのいずれかを、そのキーフレームの視覚素に割当てることが妥当である。しかし、そうすると、場合によっては連続するキーフレームに同じ視覚素が割当てられてしまう場合があり得る。一般的に、８ｆｐｓという遅いフレームレートでアニメーションを作成する場合にも、最終的にはテレビ、映画などのフレームレートと同じフレームレートの画像を作成することになるため、連続するキーフレームに同じ視覚素が割当てられると、かなり長い期間にわたり同じ視覚素が続いてしまうということになり、アニメーションが不自然になってしまう恐れがある。 In the apparatus according to the first embodiment, the frame rate at the time of creating an animation is set high, so that a smooth video can be obtained. However, when an animation is created with a low frame rate, a plurality of phonemes are often included in the duration of one key frame. Then, originally, only one type of mouth image is included in a period including a plurality of visual elements. Therefore, it becomes a problem which visual element should be assigned to the mouth image. In this case, it is appropriate to assign one of a plurality of visual elements included in the continuation length of one key frame to the visual element of the key frame. However, in this case, the same visual element may be assigned to successive key frames in some cases. In general, when creating an animation at a slow frame rate of 8 fps, an image having the same frame rate as that of a television, movie, or the like is eventually created. If primes are assigned, the same visual element will continue for a fairly long period of time, and the animation may become unnatural.

これと関連した問題であるが、現在使用されているアニメーション作成プログラムでは、あるキーフレームと、その次のキーフレームとにそれぞれの形状を割当てると、その間に存在するフレームの映像については、これら二つのキーレフレームの映像を自動的に補間して各フレームの画像を作成するという機能が標準的に備わっている。そうした場合、キーフレーム間の画像の変化は労せずして滑らかなものとなるが、遅いフレームレートを前提として作成するアニメーションの場合には、意図したものと異なった動きが生成されることになる。遅いフレームレートの場合には、結果として作成されるアニメーションは「カクカク」とした動きをするものとなる。これは「リミット感」と呼ばれてアニメーション作成上の一つの技法とされている。そのようなリミット感を生成することが意図されたアニメーションでは、このような自動的な補間機能があるために、かえって意図したリミット感を達成することができないという問題点が生ずる。 A related problem is that in an animation creation program that is currently used, if a shape is assigned to a key frame and the next key frame, these two images are used for the images of the frames that exist between them. A function of automatically interpolating images of one key frame to create an image of each frame is provided as standard. In such a case, the change of the image between the key frames is smooth without effort, but in the case of an animation created on the assumption of a slow frame rate, a motion different from the intended one is generated. . In the case of a slow frame rate, the resulting animation will move in a “cracking” manner. This is called “limit feeling” and is one technique for creating animation. In an animation intended to generate such a limit feeling, there is a problem that the intended limit feeling cannot be achieved because of such an automatic interpolation function.

更に、人間の発話の場合、発話終端で口を開いたままにするということはよくあるが、アニメーションでは、そのような形で発話を終わらせると不自然に感じられることがある。そこで、発話の終端では必ず口を閉じるように補正することが考えられる。しかし、この場合、どのように補正すれば自然に見えるかが問題となる。 Furthermore, in the case of human speech, it is common to leave the mouth open at the end of the speech, but in animation, it may feel unnatural if the speech is terminated in such a way. Therefore, it can be considered that the mouth is always closed at the end of the utterance. However, in this case, there is a problem of how to correct it so that it looks natural.

以後に説明する第２の実施の形態に係るリップシンクアニメーション作成装置は、こうした問題を解決するためのものである。 The lip sync animation creating apparatus according to the second embodiment to be described later is for solving these problems.

−発話終端補正−
最初に、発話の終端で口を閉じるように補正するためのアニメーションの補正方法（以後この補正を「発話終端補正」と呼ぶ。）について説明する。図２０を参照して、発話者の音声から得られたキーフレーム列６１０が、４つの連続するキーフレーム６２０，６２２，６２４，６２６を含むものとする。これらのうち、キーフレーム６２６は発話後の空白期間を表している。 -Speech termination correction-
First, an animation correction method for correcting the mouth to close at the end of the utterance (hereinafter, this correction will be referred to as “utterance end correction”) will be described. Referring to FIG. 20, it is assumed that a key frame sequence 610 obtained from a speaker's voice includes four consecutive key frames 620, 622, 624, and 626. Of these, the key frame 626 represents a blank period after the utterance.

本実施の形態では、発話の終端に相当するキーフレーム６２４について、以下のようにしてその終端位置を調整する。 In this embodiment, the end position of the key frame 624 corresponding to the end of the utterance is adjusted as follows.

図２０を参照して、キーフレーム列６１０を作成するもととなった発話者の音声信号の発話パワー系列６３０を考える。本実施の形態では、キーフレーム６２４の終端位置（キーフレーム６２６の開始位置）からこの発話パワー系列６３０を時間軸上でさかのぼるようにして、キーフレーム６２４に相当する期間内で発話パワーが最大となる点６４０を探索する。次にこの点６４０における発話パワーの値から、所定の減衰量６４２（δ（ｄＢ））だけ減衰した発話パワーを算出し、同じくキーフレーム６２４の終端から時間軸をさかのぼって、その発話パワーが減衰後の発話パワーと等しくなる点６４４を探索する。この点６４４に相当する時間軸上の位置をキーフレーム６２４の終端位置とする。 Referring to FIG. 20, consider an utterance power sequence 630 of a speech signal of a speaker who has created a key frame sequence 610. In the present embodiment, the speech power sequence 630 is traced back on the time axis from the end position of the key frame 624 (start position of the key frame 626), and the speech power is maximized within the period corresponding to the key frame 624. Search for a point 640. Next, the speech power attenuated by a predetermined attenuation amount 642 (δ (dB)) is calculated from the speech power value at this point 640, and the speech power is also attenuated by going back the time axis from the end of the key frame 624. A point 644 that becomes equal to the later speech power is searched. The position on the time axis corresponding to this point 644 is set as the end position of the key frame 624.

その結果、図２０に示されるように、キーフレーム６２６の位置が点６４４の位置まで進み、新たなキーフレーム６５２となり、その継続長はキーフレーム６２４の継続長が短縮された分だけ長くなる。こうして得られたキーフレーム列６５０を用いてアニメーションを作成すると、発話の最後において口が閉じる時期が早くなり、アニメーションとして自然なものになる。 As a result, as shown in FIG. 20, the position of the key frame 626 advances to the position of the point 644 and becomes a new key frame 652, and the continuation length is increased by the reduction of the continuation length of the key frame 624. When an animation is created using the key frame sequence 650 thus obtained, the time when the mouth closes at the end of the utterance is advanced, and the animation becomes natural.

−フレームレート変換及び視覚素の割当処理−
次に、低いフレームレートの時に、各キーフレームにどの視覚素を割当てるか、についての本実施の形態における決定方法について説明する。図２１を参照して、キーフレーム列６７０が、６つのキーフレーム６８０，６８２，６８４，６８６，６８８及び６９０を含むものとする。フレームレートが８ｆｐｓ程度に遅くなると、キーフレームの時刻はフレーム位置に固定されてしまう。すなわち、キーフレームと所定のフレームレートの画像のフレーム位置とが、図２１に示されるように一致する。 -Frame rate conversion and visual element assignment processing-
Next, a description will be given of a determination method in the present embodiment as to which visual element is assigned to each key frame at a low frame rate. Referring to FIG. 21, it is assumed that a key frame sequence 670 includes six key frames 680, 682, 684, 686, 688 and 690. When the frame rate is slowed down to about 8 fps, the time of the key frame is fixed at the frame position. That is, the key frame and the frame position of the image at the predetermined frame rate match as shown in FIG.

一方、第１の実施の形態に係るリップシンクアニメーション装置によって得られたキーフレーム列６７２から、図２１の上段に示すキーフレーム列６７０を生成する場合を考える。なお、キーフレーム列６７２は、キーフレーム７００，７０２，７０４，７０６，７０８，７１０，７１２，７１４及び７１６を含むものとする。 On the other hand, consider a case where the key frame sequence 670 shown in the upper part of FIG. 21 is generated from the key frame sequence 672 obtained by the lip sync animation apparatus according to the first embodiment. The key frame sequence 672 includes key frames 700, 702, 704, 706, 708, 710, 712, 714, and 716.

この場合、キーフレーム列６７０の各キーフレームの継続長は、キーフレーム列６７２の各キーフレームの継続長と比較して長くなるため、キーフレーム列６７０の一つのキーフレームの継続長に対し、キーフレーム列６７２の複数のキーフレームの視覚素が対応する。例えば、キーフレーム６８２に対しては、時間的に隣接する三つのキーフレーム７０２，７０４及び７０６の視覚素が割当てられる可能性がある。同様にキーフレーム６８８に対しては、キーフレーム７１４及び７１６の視覚素が割当てられる可能性がある。このように一つのキーフレームに複数の視覚素が割当てられる可能性があるときに、どの視覚素を選択すればよいかが問題となる。 In this case, since the duration of each key frame in the key frame sequence 670 is longer than the duration of each key frame in the key frame sequence 672, the duration of one key frame in the key frame sequence 670 is Visual elements of a plurality of key frames in the key frame sequence 672 correspond to each other. For example, a key frame 682 may be assigned visual elements of three key frames 702, 704, and 706 that are temporally adjacent. Similarly, keyframe 688 may be assigned the visual elements of keyframes 714 and 716. Thus, when there is a possibility that a plurality of visual elements may be assigned to one key frame, it becomes a problem which visual element should be selected.

ところで、実際の発話では、音声の発生を行なうに先立って口の動きが生ずるのが観察される。しがたって、音声より先にその音声に対応するように口を動かせるのが自然である。本実施の形態では、そのような考え方にしたがい、図２１に示すキーフレーム列６７０の各キーフレームに視覚素を割当てる場合、キーフレーム列６７２の中で、そのキーフレームの継続長内に視覚素の始端を有するキーフレームの視覚素を割当てることとする。 By the way, in an actual utterance, it is observed that a mouth movement occurs prior to the generation of voice. Therefore, it is natural that the mouth can be moved to respond to the sound before the sound. In this embodiment, in accordance with such a concept, when visual elements are assigned to each key frame of the key frame sequence 670 shown in FIG. 21, the visual elements are included in the key frame sequence 672 within the duration of the key frame. A visual element of a key frame having the beginning of is assigned.

例えば、図２２を参照して、楕円７３０で示したキーフレーム６８２について考える。前述したように、このキーフレーム６８２に対しては、キーフレーム列６７２の三つのキーフレーム７０２，７０４及び７０６が対応する可能性がある。しかしこれらのうち、キーフレーム７０２についてはその始端がキーフレーム６８２の継続長内にないため、候補からは外れる。キーフレーム６８２の継続長内に始端を有するという条件を充足するのは、キーフレーム７０４及び７０６である。このように二つ以上の視覚素がキーフレーム６８２内に存在する場合、先に生ずる視覚素をこのキーフレーム６８２に割当てるのが自然である。したがって本実施の形態では、矢印７３４で示されるように、キーフレーム７０４の視覚素Ｎ（／ｍ／）をキーフレーム６８２に割当てることとする。点線の矢印７３２及び７３６で示されるように、二つのキーフレーム７０２及び７０６の視覚素は、キーフレーム６８２には割当てられない。 For example, with reference to FIG. 22, consider a key frame 682 indicated by an ellipse 730. As described above, the key frame 682 may correspond to the three key frames 702, 704, and 706 in the key frame sequence 672. However, among these, the key frame 702 is not a candidate because the starting end is not within the continuation length of the key frame 682. It is the key frames 704 and 706 that satisfy the condition that the leading edge is within the duration of the key frame 682. When two or more visual elements are present in the key frame 682 as described above, it is natural to assign the visual element generated earlier to the key frame 682. Therefore, in this embodiment, the visual element N (/ m /) of the key frame 704 is assigned to the key frame 682 as indicated by an arrow 734. The visual elements of the two key frames 702 and 706 are not assigned to the key frame 682, as indicated by the dotted arrows 732 and 736.

ところでこうした場合、得られる映像に問題が生ずる可能性がある。例えば図２２において楕円７４０で示すように、キーフレーム６８８に対し、その継続長内に始端を有するキーフレーム７１４及び７１６がある。これらのいずれもキーフレーム６８８の視覚素に割当てるための条件は充足している。しかし、例えば図２２に示すように、その直前のキーフレーム６８６に対し、視覚素Ａ（／ａ／）が割当てられている場合、キーフレーム６８８に対しキーフレーム７１４の視覚素Ａ（／ａ／）を割当てると、二つのキーフレーム６８６及び６８８の視覚素が全く同一となってしまう。前述したようにこの場合、かなり長い時間にわたって同じ視覚素が連続してしまうため、アニメーションが不自然になるという問題点がある。 In such a case, there is a possibility that a problem occurs in the obtained image. For example, as indicated by an ellipse 740 in FIG. 22, there are key frames 714 and 716 having a start edge within the continuation length of the key frame 688. All of these satisfy the conditions for assigning to the visual element of the key frame 688. However, for example, as shown in FIG. 22, when the visual element A (/ a /) is assigned to the key frame 686 immediately before, the visual element A (/ a / of the key frame 714 is assigned to the key frame 688. ), The visual elements of the two key frames 686 and 688 are exactly the same. As described above, in this case, since the same visual element continues for a considerably long time, there is a problem that the animation becomes unnatural.

そこでこうした場合には、キーフレーム７１４ではなく、２番目のキーフレーム７１６の視覚素Ｉ（／ｉ／）をキーフレーム６８８に割当てることとする。 In such a case, the visual element I (/ i /) of the second key frame 716 is assigned to the key frame 688 instead of the key frame 714.

このようにすることにより、元々高速なフレームレートを想定して作成されたキーフレーム列６７２から、かなり低いフレームレートのキーフレーム列６７０を作成し、しかもそこから得られるアニメーションの顔画像に不自然さがそれほどないものを作成することができる。 In this way, a key frame sequence 670 having a considerably low frame rate is created from the key frame sequence 672 originally created assuming a high frame rate, and an artificial face image obtained therefrom is unnatural. You can create something that isn't that long.

以上のようにして、図２２において実線の矢印７５０，７３４，７５２，７５４，及び７４４で示される視覚素がキーフレーム列６７０の各キーフレームに割当てられる。なお、図２２においてキーフレーム列６７０の最後尾に示されているキーフレーム６９０には、キーフレーム列６７２の、図示されない次のキーフレームの視覚素が矢印７５６によって示される様に割当てられる。 As described above, visual elements indicated by solid arrows 750, 734, 752, 754, and 744 in FIG. 22 are assigned to each key frame of the key frame sequence 670. Note that a visual element of the next key frame not shown in the key frame row 672 is assigned to the key frame 690 shown at the end of the key frame row 670 in FIG.

−形状安定化処理−
ところで、先ほど述べたリミット感について、図２２に示すようにキーフレーム６８６及び６８８に異なる視覚素の口形状を割当てたとする。通常使用されているアニメーション作成プログラムでは、この二つのキーフレームの間のフレームの画像については、この二つのキーフレームの間の補間を行なうことによって生成するのが一般的である。その結果、意図したリミット感が得られなくなるという問題がある。この問題を図２３（Ａ）を参照して説明する。 -Shape stabilization treatment-
By the way, for the limit feeling described above, assume that different visual element mouth shapes are assigned to the key frames 686 and 688 as shown in FIG. In a commonly used animation creation program, an image of a frame between the two key frames is generally generated by performing interpolation between the two key frames. As a result, there is a problem that the intended limit feeling cannot be obtained. This problem will be described with reference to FIG.

図２３（Ａ）を参照して、キーフレーム６８６に相当する時刻を時刻ｔ、キーフレーム６８８に相当する時刻を時刻ｔ＋６とする。すなわち、この二つのキーフレームの間に、５つのフレームが存在している。時刻ｔでは、このキーフレーム７９０における視覚素／ａ／のブレンド率は、○印７７０によって示されるように１００％であり、視覚素／ｉ／のブレンド率は○印７７４で示されるように０％である。一方、時刻ｔ＋６では、逆に視覚素／ｉ／のブレンド率は○印７７６で示されるように１００％であり、視覚素／ａ／のブレンド率は○印７７２で示されるように０％となる。そしてこの間の両者のブレンド率は、ブレンド率曲線７８０及び７８２によって示されるように計算される。時刻ｔ及び時刻ｔ＋６の間の各フレームでは、このブレンド率によってこの二つの視覚素の顔画像をブレンドした顔画像が作成される。このようなブレンドを行なうと画像は滑らかに変化するが、それによってリミット感が失われ、小さなフレームレートでアニメーションを作成するという要請を充足することができなくなるという問題点がある。 Referring to FIG. 23A, a time corresponding to key frame 686 is set as time t, and a time corresponding to key frame 688 is set as time t + 6. That is, there are five frames between the two key frames. At time t, the blending rate of the visual element / a / in this key frame 790 is 100% as indicated by a circle 770, and the blending rate of the visual element / i / is 0 as indicated by a circle 774. %. On the other hand, at time t + 6, the blending rate of the visual element / i / is 100% as indicated by the circle mark 776, and the blending ratio of the visual element / a / is 0% as indicated by the circle mark 772. Become. And the blend ratio of both is calculated as shown by blend ratio curves 780 and 782. In each frame between time t and time t + 6, a face image is created by blending the face images of the two visual elements according to the blend ratio. When such blending is performed, the image changes smoothly, but there is a problem in that the sense of limit is lost, and the request to create an animation at a small frame rate cannot be satisfied.

そこで本実施の形態では、図２３（Ｂ）に示されるように、時刻ｔ＋６の直前のフレームに相当する時刻ｔ＋５に、時刻ｔにおける視覚素／ａ／及び／ｉ／のブレンド率をそのままにして、キーフレーム７９０をキーフレーム７９２としてコピーする。その結果、アニメーション作成プログラムによって自動的なブレンドが行なわれる場合であっても、時刻ｔ〜ｔ＋５の間では、直線８００及び８０２によって示されるように視覚素／ａ／のブレンド率は１００％、視覚素／ｉ／のブレンド率は０％に維持される。顔画像の変化は時刻ｔ＋５〜ｔ＋６の間で行なわれることになり、上記したリミット感を達成することができる。 Therefore, in this embodiment, as shown in FIG. 23B, at the time t + 5 corresponding to the frame immediately before the time t + 6, the blend ratio of the visual elements / a / and / i / at the time t is left as it is. The key frame 790 is copied as the key frame 792. As a result, even when automatic blending is performed by the animation creating program, the blend ratio of visual element / a / is 100% as shown by straight lines 800 and 802 between times t and t + 5. The blend ratio of elementary / i / is maintained at 0%. The change of the face image is performed between times t + 5 and t + 6, and the above-described limit feeling can be achieved.

＜構成＞
図２４に、この第２の実施の形態に係るリップシンクアニメーション作成装置８１０のブロック図を示す。このリップシンクアニメーション作成装置８１０の構成は、図５に示す第１の実施の形態に係るリップシンクアニメーション作成装置２００の構成とほぼ同様であるが、図５に示すキーフレーム削除部２３６と選択部２４２との間に、前述した発話終端の補正を行なうための発話終端補正部８２２、及びこの発話終端補正部８２２の機能を利用するか否かを選択するための選択部８２０及び８２４を更に含む点と、継続長付き視覚素シーケンス記憶部２５４の出力を受けるように接続され、継続長付き視覚素シーケンスのフレームレートを、フレームレート入力８３２によって指定されたフレームレートに変換するためのフレームレート変換部８４０と、フレームレート変換部８４０の出力する視覚素シーケンスについて、アニメーション作成プログラムによるブレンドによってリミット感が失われるのを防ぐための形状安定化処理を実行するための形状安定化処理部８４２と、形状安定化処理部８４２の出力するフレームレート変換後の継続長付き視覚素シーケンスを記憶するための継続長付き視覚素シーケンス記憶部８４６と、継続長付き視覚素シーケンス記憶部２５４及び８４６の出力にそれぞれ接続された第１及び第２の入力を有し、フレームレート変換を使用するか否かを指定する使用指示入力８３０の指示にしたがい、継続長付き視覚素シーケンス記憶部２５４の出力又は継続長付き視覚素シーケンス記憶部８４６の出力のいずれかを選択してブレンド処理部２５６に与えるための選択部８４８とを含む点において、図５に示すリップシンクアニメーション作成装置２００と異なっている。 <Configuration>
FIG. 24 shows a block diagram of a lip sync animation creating apparatus 810 according to the second embodiment. The configuration of the lip sync animation creating apparatus 810 is substantially the same as that of the lip sync animation creating apparatus 200 according to the first embodiment shown in FIG. 5, but the key frame deleting unit 236 and the selecting unit shown in FIG. 242 further includes an utterance end correction unit 822 for correcting the utterance end, and selection units 820 and 824 for selecting whether or not to use the function of the utterance end correction unit 822. And a frame rate conversion for converting the frame rate of the visual element sequence with duration to the frame rate specified by the frame rate input 832. Unit 840 and the visual element sequence output by frame rate conversion unit 840 A shape stabilization processing unit 842 for executing a shape stabilization process for preventing a sense of limit from being lost by blending by a program, and a visual element with a continuous length after frame rate conversion output from the shape stabilization processing unit 842 A visual element sequence storage unit 846 with a continuous length for storing a sequence, and first and second inputs connected to outputs of visual element sequence storage units 254 and 846 with a continuous length, respectively, for frame rate conversion In accordance with the instruction of the use instruction input 830 for designating whether or not to use, either the output of the visual element sequence storage unit with duration 254 or the output of the visual element sequence storage unit with duration 884 is selected and the blend processing unit The lip-sync animation creating apparatus 2 shown in FIG. 0 and is different.

なお、図２４に示す選択部８２０及び８２４は、発話終端補正を行なうか否かを指定する使用指示入力８２６にしたがって、キーフレーム削除部２３６の出力を発話終端補正部８２２を経由して選択部２４２に与える処理と、発話終端補正部８２２を経由せず直接に選択部２４２に与える処理とを選択的に行なう。また発話終端補正部８２２には、図２０を参照して説明した減衰率δ（ｄＢ）の入力８２８が与えられる。使用指示入力８２６と使用指示入力８３０とは、互いに同一の指示を用いるようにしてもよい。 Note that the selection units 820 and 824 shown in FIG. 24 output the output of the key frame deletion unit 236 via the utterance end correction unit 822 according to the use instruction input 826 that specifies whether or not to perform utterance end correction. The processing given to 242 and the processing given directly to the selection unit 242 without going through the utterance end correction unit 822 are selectively performed. Further, the input 828 of the attenuation rate δ (dB) described with reference to FIG. 20 is given to the speech end correction unit 822. The use instruction input 826 and the use instruction input 830 may use the same instruction.

既に述べたように、このリップシンクアニメーション作成装置８１０の発話終端補正部８２２、フレームレート変換部８４０、及び形状安定化処理部８４２は、コンピュータハードウェアと、そのハードウェア上で実行されるコンピュータプログラムとにより実現され得る。以下、それらプログラムの制御構造について説明する。 As described above, the utterance end correction unit 822, the frame rate conversion unit 840, and the shape stabilization processing unit 842 of the lip-sync animation creating apparatus 810 include computer hardware and a computer program executed on the hardware. And can be realized. Hereinafter, the control structure of these programs will be described.

図２５は、発話終端補正部８２２を実現するためのコンピュータプログラムの制御構造を示すフローチャートである。 FIG. 25 is a flowchart showing a control structure of a computer program for realizing the utterance end correction unit 822.

図２５を参照して、このプログラムは、キーフレーム削除部２３６から出力されるキーフレーム列のうち、未処理の発話終端を探すステップ８７０と、未処理の発話終端があったか否かを判定し、発話終端がない場合には処理を終了し、発話終端があった場合には次のステップに制御を移す判定ステップ８７２と、未処理の発話終端があると判定ステップ８７２で判定された場合に、その発話終端の直前のキーフレームの視覚素継続長内の音声パワーの最大値Ｐｍａｘを求めるステップ８７４とを含む。 Referring to FIG. 25, the program determines whether there is an unprocessed utterance end in step 870 for searching for an unprocessed utterance end in the key frame sequence output from the key frame deleting unit 236, If there is no utterance termination, the process ends. If there is an utterance termination, the determination step 872 moves to the next step, and if it is determined in the determination step 872 that there is an unprocessed utterance termination, And determining a maximum value Pmax of the audio power within the visual element duration of the key frame immediately before the end of the speech.

ステップ８７０における未処理の発話終端を探す処理は、空白の視覚素が割当てられたキーフレームの直前の、空白以外の視覚素の割当てられたキーフレームを探すことにより行なわれる。ステップ８７４で行なわれる最大値Ｐｍａｘを求める処理については、図２０を参照して説明した通りである。ここでいう最大値Ｐｍａｘを与える点は、図２０における点６４０に相当する。 The process of searching for an unprocessed utterance end in step 870 is performed by searching for a key frame to which a visual element other than a blank is assigned immediately before a key frame to which a blank visual element is assigned. The processing for obtaining the maximum value Pmax performed in step 874 is as described with reference to FIG. The point giving the maximum value Pmax here corresponds to the point 640 in FIG.

このプログラムは更に、ステップ８７４の後、処理中の視覚素継続長の終端からさかのぼり、音声パワーがＰｍａｘ-δ（ｄＢ）となる最初の時間ｔを求めるステップ８７６と、そのような条件を充足する点があるか否かを判定し、条件を充足する点がない場合にはステップ８７０に分岐し、条件を充足する点がある場合には次のステップに処理を分岐させるステップ８７８と、ステップ８７８において条件を充足する点があると判定されたことに応答して実行され、その視覚素継続長の終端を、ステップ８７６で発見された時間ｔに変更し、あわせてその直後のキーフレームの始端を同じく時間ｔに変更する処理を行なうステップ８８０とを含む。ステップ８８０の後、制御はステップ８７０に戻る。ステップ８７６で求める時間ｔの点は、図２０で説明した点６４４に相当する。 The program further satisfies such a condition, step 876, after step 874, going back from the end of the visual element duration being processed and determining the first time t when the speech power is Pmax-δ (dB). It is determined whether or not there is a point. If there is no point that satisfies the condition, the process branches to step 870. If there is a point that satisfies the condition, step 878 is branched to the next step. Is executed in response to the determination that there is a point satisfying the condition, and the end of the visual element continuation length is changed to the time t found in step 876, and at the same time, the start of the immediately following key frame And step 880 for performing a process of changing to a time t. After step 880, control returns to step 870. The point of time t obtained in step 876 corresponds to the point 644 described in FIG.

図２６は、図２４に示すフレームレート変換部８４０を実現するためのコンピュータプログラムの制御構造を示すフローチャートである。図２６を参照して、このプログラムは、以後の繰返し処理において処理対象のキーフレーム数を表す変数ｉに値０を設定するステップ９００と、変数ｉに１を加算するステップ９０２と、ステップ９０２での加算処理の結果、変数ｉが全てのキーフレームの数より大きくなったか否かを判定し、大きくなった場合にはこの処理を終了し、それ以外の場合には次のステップに制御を分岐させるステップ９０４とを含む。 FIG. 26 is a flowchart showing a control structure of a computer program for realizing the frame rate conversion unit 840 shown in FIG. Referring to FIG. 26, this program performs step 900 for setting a value 0 to variable i representing the number of key frames to be processed in the subsequent iterative processing, step 902 for adding 1 to variable i, and step 902. As a result of the addition process, it is determined whether or not the variable i is larger than the number of all key frames. If the variable i becomes larger, the process is terminated. Otherwise, the control branches to the next step. Step 904.

このプログラムは更に、ステップ９０４において、変数ｉがキーフレーム数より大きくないと判定されたことに応答して実行され、ｉ番目のキーフレーム（以後このキーフレームを「キーフレーム（ｉ）」と書く。）の継続長内に始端が含まれる視覚素を探すステップ９０６と、ステップ９０６で見つけられた視覚素の数Ｎが０か否かを判定し、その結果によって処理を分岐させるステップ９０８と、ステップ９０８で、視覚素の数Ｎ＝０と判定されたことに応答して実行され、キーフレーム（ｉ）を破棄する処理を行ない、更にステップ９０２に制御を戻すステップ９１０と、ステップ９０８によって視覚素の数Ｎが０でないと判定されたことに応答して実行され、視覚素の数Ｎが１か否かを判定し、その判定結果にしたがって制御を分岐させる処理を行なうステップ９１２と、ステップ９１２において視覚素の数Ｎが１であると判定されたことに応答して実行され、キーフレーム（ｉ）に、ステップ９０６で見つけられた視覚素（この視覚素はこの場合一つしかないのでこれを視覚素（１）と書く。）を割当て、制御をステップ９０２に戻すステップ９１４と、ステップ９１２において視覚素の数Ｎが１でないと判定されたことに応答して実行され、以後の処理でキーフレーム（ｉ）の継続長内に始端が含まれる視覚素の、先頭からの数を表す変数ｊに０を設定するステップ９１６とを含む。 The program is further executed in response to determining in step 904 that the variable i is not greater than the number of key frames and writing the i-th key frame (hereinafter this key frame is referred to as “key frame (i)”. )) Searching for a visual element whose starting edge is included in the continuation length of step 906, determining whether or not the number N of visual elements found in step 906 is 0, and branching the process depending on the result 908; Step 908 is executed in response to the determination that the number of visual primes N = 0, discards the key frame (i), and returns control to step 902. This is executed in response to the determination that the number N of primes is not 0, determines whether the number N of visual primes is 1, and branches control according to the determination result Step 912 is executed, and in response to the determination that the number N of visual elements is determined to be 1 in Step 912, the visual element found in Step 906 (this visual element) is displayed in the key frame (i). Since there is only one prime in this case, this is written as visual prime (1).) And control is returned to step 902. In step 912, it is determined that the number N of visual primes is not 1. And a step 916 of setting 0 to a variable j representing the number from the head of a visual element that is executed in response and includes the start edge in the duration of the key frame (i) in the subsequent processing.

このプログラムは更に、ステップ９１６に引き続いて、変数ｊに１を加算するステップ９１８と、ステップ９１８での加算の結果、変数ｊの値が、キーフレーム（ｉ）内の視覚素の数Ｎより大きくなったか否かを判定し、その判定結果にしたがって制御を分岐するステップ９２０と、ステップ９２０において変数ｊの値が視覚素の数Ｎより大きいと判定されたことに応答して実行され、キーフレーム（ｉ）に、キーフレーム（ｉ）内に始端を有する先頭の視覚素（視覚素（１））を割当て、制御をステップ９０２に戻すステップ９２２と、ステップ９２０において変数ｊの値が視覚素の数Ｎより大きくはないと判定されたことに応答して実行され、キーフレーム（ｉ）内のｊ番目の視覚素（これを「視覚素（ｊ）」と書く。）が、一つ前のキーフレーム（キーフレーム（ｉ−１））の視覚素と同一か否かを判定し、その判定結果にしたがって制御を分岐させるステップ９２４とを含む。 The program further includes step 918 of adding 1 to variable j following step 916, and the result of addition in step 918 is that the value of variable j is greater than the number N of visual elements in key frame (i). A step 920 for branching the control according to the determination result, and a key frame executed in response to the determination that the value of the variable j is greater than the number N of visual elements in step 920. In step 922, the head visual element (visual element (1)) having the start end in key frame (i) is assigned to (i), and control returns to step 902. In step 920, the value of variable j is the visual element. It is executed in response to the determination that it is not greater than the number N, and the j-th visual element in the key frame (i) (this is referred to as “visual element (j)”). Ki Determines whether identical or not and visual elements of the frame (key frame (i-1)), and a step 924 for branching the control according to the determination result.

ステップ９２４において、視覚素（ｊ）がキーフレーム（ｉ−１）の視覚素と一致すると判定された場合には、制御はステップ９１８に戻り、それ以外の場合には制御は次に進む。 If it is determined in step 924 that the visual element (j) matches the visual element of the key frame (i-1), control returns to step 918, otherwise control proceeds to the next.

このプログラムは更に、ステップ９２４において視覚素（ｊ）がキーフレーム（ｉ−１）の視覚素ではないと判定されたことに応答して実行され、キーフレーム（ｉ）に視覚素（ｊ）を割当て、更に制御をステップ９０２に戻す処理を行なうステップ９２６を含む。 The program is further executed in response to determining in step 924 that the visual element (j) is not the visual element of the key frame (i-1), and assigning the visual element (j) to the key frame (i). Step 926 includes processing for assigning and returning control to Step 902.

図２７に、図２４に示す形状安定化処理部８４２を実現するためのプログラムの制御構造をフローチャート形式で示す。図２７を参照して、このプログラムは、以後の処理において処理対象となるキーフレームの番号を表す変数ｉに１を設定するステップ９５０と、変数ｉに１を加算するステップ９５２と、ステップ９５２での加算処理の結果、変数ｉの値が処理対象のキーフレーム数より大きくなったか否かを判定し、変数ｉの値がキーフレーム数を上回った場合に処理を終了させるステップ９５４と、ステップ９５４において変数ｉの値がキーフレーム数を上回ってはいないと判定されたことに応答して実行され、キーフレーム（ｉ）の直前のフレームに、キーフレーム（ｉ-１）をコピーして新たなキーフレームとする処理を行ない、その後ステップ９５２に制御を戻す処理を行なうステップ９５６等を含む。 FIG. 27 shows a control structure of a program for realizing the shape stabilization processing unit 842 shown in FIG. 24 in a flowchart format. Referring to FIG. 27, this program executes step 950 for setting 1 to a variable i representing the number of a key frame to be processed in the subsequent processing, step 952 for adding 1 to variable i, and step 952. As a result of the addition processing of step 954, it is determined whether or not the value of the variable i has become larger than the number of key frames to be processed. If the value of the variable i exceeds the number of key frames, step 954 and step 954 are ended. Is executed in response to the determination that the value of the variable i does not exceed the number of key frames, and the key frame (i-1) is copied to the frame immediately before the key frame (i) to create a new one. Step 956 and the like are included for performing processing to make a key frame and then returning control to Step 952.

＜動作＞
図２４に示すリップシンクアニメーション作成装置８１０は以下のように動作する。以下の説明では、使用指示入力８２６と８３０とは、同一の値をリップシンクアニメーション作成装置８１０に指示するものとする。使用指示入力８２６及び８３０が、発話終端補正部８２２による処理、フレームレート変換部８４０による処理、及び形状安定化処理部８４２による処理を使用しないことを指定する値である場合、選択部８２０及び８２４はキーフレーム削除部２３６の出力を選択部２４２の入力に直接に与える。選択部８４８は、継続長付き視覚素シーケンス記憶部２５４の出力をブレンド処理部２５６に与える。したがってこの場合リップシンクアニメーション作成装置８１０の構成は事実上図５に示すリップシンクアニメーション作成装置２００と同一となり、リップシンクアニメーション作成装置２００と同様の動作を行なう。 <Operation>
The lip sync animation creating apparatus 810 shown in FIG. 24 operates as follows. In the following description, it is assumed that the usage instruction inputs 826 and 830 indicate the same value to the lip sync animation creating apparatus 810. When the usage instruction inputs 826 and 830 are values specifying that the processing by the speech end correction unit 822, the processing by the frame rate conversion unit 840, and the processing by the shape stabilization processing unit 842 are not used, the selection units 820 and 824 are used. Gives the output of the key frame deletion unit 236 directly to the input of the selection unit 242. The selection unit 848 gives the output of the visual element sequence storage unit 254 with a continuous length to the blend processing unit 256. Therefore, in this case, the configuration of the lip sync animation creating apparatus 810 is substantially the same as that of the lip sync animation creating apparatus 200 shown in FIG. 5, and the same operation as the lip sync animation creating apparatus 200 is performed.

使用指示入力８２６及び８３０が、発話終端補正部８２２、フレームレート変換部８４０、及び形状安定化処理部８４２を使用することを指定する値である場合、選択部８２０はキーフレーム削除部２３６の出力を発話終端補正部８２２に与える。発話終端補正部８２２の出力は選択部８２４を介して選択部２４２の入力に与えられる。 When the usage instruction inputs 826 and 830 are values specifying that the speech termination correction unit 822, the frame rate conversion unit 840, and the shape stabilization processing unit 842 are used, the selection unit 820 outputs the key frame deletion unit 236. Is provided to the utterance end correction unit 822. The output of the utterance end correction unit 822 is given to the input of the selection unit 242 via the selection unit 824.

一方、選択部８４８は、継続長付き視覚素シーケンス記憶部２５４の出力ではなく、継続長付き視覚素シーケンス記憶部８４６の出力を選択し、ブレンド処理部２５６に与える。フレームレート変換部８４０は、フレームレート入力８３２に応答し、継続長付き視覚素シーケンス記憶部２５４に記憶された視覚素シーケンスを順に読出し、図２１及び図２２に示した手法を用いてフレームレートを変換し、さらに各フレームに視覚素を割当てて、フレームレート変換後の視覚素シーケンスを形状安定化処理部８４２に与える。形状安定化処理部８４２は、フレームレート変換部８４０から出力される視覚素シーケンスの中で、各キーフレームを、次のキーフレームの直前のフレームにコピーする処理を行なう。この処理は図２３に示した通りである。この処理を全てのキーフレームに対して行なった後、その結果を継続長付き視覚素シーケンス記憶部８４６に出力する。 On the other hand, the selection unit 848 selects not the output of the visual element sequence storage unit with duration 254 but the output of the visual element sequence storage unit with duration 884, and supplies the selected result to the blend processing unit 256. In response to the frame rate input 832, the frame rate conversion unit 840 sequentially reads the visual element sequences stored in the visual element sequence storage unit 254 with a continuous length, and calculates the frame rate using the method shown in FIGS. 21 and 22. Then, the visual element is assigned to each frame, and the visual element sequence after the frame rate conversion is given to the shape stabilization processing unit 842. The shape stabilization processing unit 842 performs a process of copying each key frame to the frame immediately before the next key frame in the visual element sequence output from the frame rate conversion unit 840. This process is as shown in FIG. After performing this process for all the key frames, the result is output to the visual element sequence storage unit 846 with a continuous length.

既に述べたように選択部８４８は継続長付き視覚素シーケンス記憶部８４６の出力を選択してブレンド処理部２５６に与える。ブレンド処理部２５６は、継続長付き視覚素シーケンス記憶部８４６に記憶されたキーフレーム列を読込み、隣接するキーフレームの間で、それぞれ指定されたブレンド率をその間のフレームに内挿することにより、アニメーションを作成して出力する。こうして作成されるアニメーション２６０のフレームレートは、テレビ又は映画のフレームレートと同じフレームレートであるが、フレームレート変換部８４０によってキーフレームが削除され、更に形状安定化処理部８４２によって、隣接するキーフレーム間でのアニメーションの内挿を防止するように形状安定化処理が行なわれているため、実質的にフレームレート入力８３２で指定されたフレームレートの値にしたがった低いフレームレートのアニメーションと同様のリミット感を得ることができる。 As described above, the selection unit 848 selects the output of the visual element sequence storage unit 846 with a continuous length and supplies it to the blend processing unit 256. The blend processing unit 256 reads the key frame sequence stored in the visual element sequence storage unit with length of continuation 846, and interpolates the specified blend ratio between adjacent key frames in the frame between them. Create and output an animation. The frame rate of the animation 260 created in this way is the same as the frame rate of the television or movie, but the key frame is deleted by the frame rate conversion unit 840 and further the adjacent key frame is deleted by the shape stabilization processing unit 842. Since the shape stabilization process is performed to prevent interpolating the animation between the frames, the limit is substantially the same as the low frame rate animation according to the frame rate value specified by the frame rate input 832 A feeling can be obtained.

［第３の実施の形態］
＜概略＞
上記した第１及び第２の実施の形態により、視覚素／Ａ／、／Ｉ／、／Ｕ／、／Ｅ／、／Ｏ／、及び／Ｎ／（以下「標準視覚素」と呼び、これらに対応する音素を「標準音素」と呼ぶ。）に基づいた顔画像のアニメーションを作成することができる。しかし、日本語の場合、視覚素は標準視覚素を含めて十数種類あるので（／Ｋ／、／Ｓ／、／Ｔ／等）、標準視覚素のみでは、日本語の滑らかなアニメーションを作成するには十分ではない可能性がある。また、上記実施の形態において、標準視覚素のための顔画像は予め用意されていたが、他の視覚素も用いて日本語のアニメーションを作成するのであれば、準備しなければならない顔画像の数が増加する。こうした顔画像のための顔モデルは、アニメーション作成に使用する基準となる標準顔モデルに対して手作業で編集を加えて作成するため、多くの視覚素のための顔画像を用意するのは困難である。英語、中国語等のような外国語のアニメーションを作成するときには、さらに異なる視覚素について顔画像を作成しなくてはならず、したがってさらに困難になる。 [Third Embodiment]
<Outline>
According to the first and second embodiments described above, visual elements / A /, / I /, / U /, / E /, / O /, and / N / (hereinafter referred to as “standard visual elements”, these The phoneme corresponding to is called a “standard phoneme”). However, in the case of Japanese, there are more than ten types of visual elements including standard visual elements (/ K /, / S /, / T /, etc.), so a smooth animation in Japanese is created with only standard visual elements. May not be enough. In the above embodiment, the face image for the standard visual element has been prepared in advance. However, if a Japanese animation is to be created using another visual element, the face image to be prepared must be prepared. The number increases. Because the face model for these face images is created by manually editing the standard face model that is used as a reference for animation creation, it is difficult to prepare face images for many visual elements. It is. When creating a foreign language animation such as English, Chinese, etc., face images must be created for different visual elements, and therefore more difficult.

以後に説明する第３の実施の形態に係るリップシンクアニメーション作成装置は、標準視覚素と、標準視覚素以外の視覚素（以下、これらを「一般視覚素」と呼ぶ。）とを含む視覚素群を用いた日本語のリップシンクアニメーションの作成、及びその多言語への拡張のためのものである。 A lip-sync animation creating apparatus according to a third embodiment described below includes a visual element including a standard visual element and a visual element other than the standard visual element (hereinafter referred to as “general visual element”). This is for creating Japanese lip-sync animations using groups and extending them to multiple languages.

＜構成＞
図２８に、この第３の実施の形態に係るリップシンクアニメーション作成装置１０００のブロック図を示す。図２８に示すこのリップシンクアニメーション作成装置１０００の構成は、図２４に示す第２の実施の形態に係るリップシンクアニメーション作成装置８１０の構成とほぼ同様であるが、標準視覚素のみではなく、一般視覚素も用いて日本語の顔画像のアニメーション２６０を作成するためのものである点において、図２４に示すリップシンクアニメーション作成装置８１０と異なっている。 <Configuration>
FIG. 28 shows a block diagram of a lip sync animation creating apparatus 1000 according to the third embodiment. The configuration of the lip sync animation creating apparatus 1000 shown in FIG. 28 is almost the same as the configuration of the lip sync animation creating apparatus 810 according to the second embodiment shown in FIG. This is different from the lip sync animation creating apparatus 810 shown in FIG. 24 in that it is for creating a Japanese facial image animation 260 using visual elements.

具体的には、リップシンクアニメーション作成装置１０００は、図２４に示す音素−視覚素マッピングテーブル記憶部１７６に代え、それと同様の構成ではあるが、日本語の音素の各々に対し、標準視覚素と、それ以外の視覚素とを含む視覚素群の中から、一つの視覚素を関連付ける点で図２４に示す音素−視覚素マッピングテーブル記憶部１７６と異なる音素−視覚素マッピングテーブルを記憶するための音素−視覚素マッピングテーブル記憶部１００２を含む点と、図２４に示す、標準視覚素に対応した顔モデル（以下「標準視覚素モデル」と呼ぶ。）を格納した３Ｄキャラクタモデル記憶部１５６に代えて、標準視覚素だけでなく、それ以外の日本語の視覚素のための、標準顔モデルを基準とした顔モデル（以下「一般視覚素モデル」と呼ぶ。）からなる３Ｄキャラクタモデルを記憶する３Ｄキャラクタモデル記憶部１００４を含む点とにおいて図２４に示すリップシンクアニメーション作成装置８１０と異なっている。 Specifically, the lip sync animation creating apparatus 1000 has the same configuration as that of the phoneme-visual element mapping table storage unit 176 shown in FIG. 24, but for each Japanese phoneme, For storing a phoneme-visual element mapping table different from the phoneme-visual element mapping table storage unit 176 shown in FIG. 24 in that one visual element is associated from the visual element group including other visual elements. Instead of the point including the phoneme-visual element mapping table storage unit 1002 and the 3D character model storage unit 156 storing the face model corresponding to the standard visual element (hereinafter referred to as “standard visual element model”) shown in FIG. A face model based on the standard face model for not only standard visual elements but also other Japanese visual elements (hereinafter referred to as “general visual element model”). Called.) Is different from the lip-sync animation creating apparatus 810 shown in FIG. 24 at a point that includes a 3D character model storage unit 1004 for storing a 3D character model consisting of.

リップシンクアニメーション作成装置１０００はさらに、ある発話者が日本語の文を発音しているときにキャプチャした、顔の特徴点の３次元データ（以下「キャプチャデータ」と呼ぶ。）を、そのとき発音していた音素と関連付けて記憶するキャプチャデータ記憶部１００６と、標準視覚素モデルを記憶した標準視覚素モデル記憶部１００８と、キャプチャデータ記憶部１００６に記憶されたキャプチャデータ及び標準視覚素モデル記憶部１００８に記憶された標準視覚素モデルを使用して、標準音素以外の音素（／ｋ／、／ｓ／、／ｔ／等）に対応するキャプチャデータの各々を、標準音素に対応するキャプチャデータの線形和で近似するための係数を算出するための係数算出部１０１０と、係数算出部１０１０により算出された係数を用いて、標準視覚素モデル記憶部１００８に記憶された標準視覚素モデルの線形和で一般視覚素モデルを表し、標準視覚素モデルと一般視覚素モデルとを使用して３Ｄキャラクタモデルを作成してキャラクタモデル記憶部１００４に格納するためのキャラクタモデル合成部１０１２とを含む点において、図２４に示すリップシンクアニメーション作成装置８１０と異なっている。 The lip sync animation creating apparatus 1000 further generates three-dimensional data (hereinafter referred to as “capture data”) of facial feature points captured when a certain speaker is speaking a Japanese sentence. Capture data storage unit 1006 for storing the phoneme in association with the phoneme, standard visual element model storage unit 1008 for storing the standard visual element model, and capture data and standard visual element model storage unit stored in the capture data storage unit 1006 Using the standard visual element model stored in 1008, each of the capture data corresponding to phonemes other than the standard phonemes (/ k /, / s /, / t /, etc.) is converted into the capture data corresponding to the standard phonemes. A coefficient calculation unit 1010 for calculating a coefficient for approximation by linear sum, and a coefficient calculated by the coefficient calculation unit 1010 The general visual element model is represented by a linear sum of the standard visual element models stored in the standard visual element model storage unit 1008, and a 3D character model is created by using the standard visual element model and the general visual element model. It differs from the lip sync animation creating apparatus 810 shown in FIG. 24 in that it includes a character model combining unit 1012 for storing in the model storage unit 1004.

一般視覚素の数をいくつにするか、一般視覚素として、どのようなものを選択するか、及び日本語の各音素を標準視覚素及び一般視覚素のうちのどの視覚素と対応付けるかは設計事項に属する。ただし、標準音素は常に標準視覚素に対応付ける必要がある。 Design how many general visual elements, how to select as general visual elements, and which visual element of Japanese visual elements to associate with each Japanese phoneme Belongs to matter. However, it is necessary to always associate standard phonemes with standard visual elements.

図２９を参照して、図２８のキャプチャデータ記憶部１００６に記憶されたキャプチャデータ、及び標準視覚素モデル記憶部１００８に記憶された標準視覚素モデルを使用して、標準視覚素モデルによる線形和で一般視覚素モデルを近似するための係数を求める処理について説明する。 Referring to FIG. 29, using the capture data stored in capture data storage unit 1006 in FIG. 28 and the standard visual element model stored in standard visual element model storage unit 1008, a linear sum by the standard visual element model is used. A process for obtaining a coefficient for approximating the general visual element model will be described.

図２９を参照して、キャプチャデータ記憶部１００６に、日本語の音素／ａ／、／ｉ／、／ｕ／、／ｅ／、／ｏ／、／ｎ／、／ｋ／、／ｓ／、／ｔ／、／ｈ／、及び／ｂ／等を発話しているときの発話者の顔のキャプチャデータである、 Referring to FIG. 29, the capture data storage unit 1006 stores Japanese phonemes / a /, / i /, / u /, / e /, / o /, / n /, / k /, / s /, Capture data of the speaker's face when speaking / t /, / h /, / b /, etc.

等がそれぞれ記憶されているものとする。／〜Ｎ／（記号「〜」は式中文字の上に付されている。）は、音素／ｎ／を発話中の発話者の顔の特徴点のキャプチャデータである。／〜Ｎ／以外のキャプチャデータはいずれも、／〜Ｎ／を基準とし、顔画像の各特徴点が、顔画像の定義されている３次元空間において、キャプチャデータ／〜Ｎ／の対応する特徴点からどの程度移動しているかを示す３次元ベクトル情報によって表されたものである。

Etc. are stored. / ˜N / (symbol “˜” is added above the character in the formula) is capture data of the feature point of the face of the speaker who is speaking the phoneme / n /. Capture data other than / ˜N / is based on / ˜N /, and each feature point of the face image corresponds to the feature corresponding to the capture data / ˜N / in the three-dimensional space where the face image is defined. It is represented by three-dimensional vector information indicating how much the point has moved.

図２９を参照して、標準視覚素モデル記憶部１００８は、標準視覚素モデルである／Ａ／、／Ｉ／、／Ｕ／、／Ｅ／、及び／Ｏ／を、基準となる視覚素モデル／Ｎ／からの、各特徴点の移動ベクトルの集合という形で記憶している。これら視覚素モデルはいずれも、アニメーションのキャラクタとして使用される標準視覚素モデルについて作成されたものである。 Referring to FIG. 29, standard visual element model storage unit 1008 uses standard visual element models / A /, / I /, / U /, / E /, and / O / as reference visual element models. Stored in the form of a set of movement vectors of each feature point from / N /. Each of these visual element models is created for a standard visual element model used as an animation character.

係数算出部１０１０の機能は以下のとおりである。ここでは、例として、キャプチャデータ記憶部１００６に記憶されているキャプチャデータから、音素／ｋ／に対応付けられた、アニメーション作成のための一般視覚素モデル／Ｋ／を求める方法について説明する。 The function of the coefficient calculation unit 1010 is as follows. Here, as an example, a method for obtaining the general visual element model / K / for creating an animation associated with the phoneme / k / from the capture data stored in the capture data storage unit 1006 will be described.

一般視覚素モデル／〜Ｋ／を以下のように定式化する。 The general visual element model / ˜K / is formulated as follows.

ただし、〜α_ＫＡ、〜α_ＫＩ、〜α_ＫＵ、〜α_ＫＥ、及び〜α_ＫＯ（記号「〜」は式中文字の上に付されている。）は実数の値をとる変数であり、ε_Ｋは誤差変数である。この式は、一般視覚素モデル／〜Ｋ／を構成する各特徴点の位置を表すベクトルの全てについてたてることができる。すなわち、キャプチャデータを構成する特徴点の数がＭ個であれば、Ｍ個のベクトルの線形和の等式が得られる。

However, ~ α _KA , ~ α _KI , ~ α _KU , ~ α _KE , and ~ α _KO (symbol "~" is attached on the letter in the formula) is a variable that takes a real value, ε _K is an error variable. This equation can be established for all of the vectors representing the positions of the feature points constituting the general visual element model / ˜K /. That is, if the number of feature points constituting the capture data is M, an equation of linear sum of M vectors can be obtained.

これらＭ個のベクトルの線形和の等式の全てに関して算出したε_Ｋの自乗和が最小となるような、〜α_ＫＡ、〜α_ＫＩ、〜α_ＫＵ、〜α_ＫＥ、及び〜α_ＫＯを算出する。算出された〜α_ＫＡ、〜α_ＫＩ、〜α_ＫＵ、〜α_ＫＥ、及び〜α_ＫＯの値をそれぞれα_ＫＡ、α_ＫＩ、α_ＫＵ、α_ＫＥ、及びα_ＫＯとする。係数算出部１０１０が行なう処理は、この係数を算出することである。 Calculate ~ α _KA , ~ α _KI , ~ α _KU , ~ α _KE , and ~ α _KO such that the sum of squares of ε _K calculated for all of these M vector linear sum equations is minimized. To do. The calculated values of ˜α _KA , ˜α _KI , ˜α _KU , ˜α _KE , and ˜α _KO are taken as α _KA , α _KI , α _KU , α _KE , and α _KO , respectively. The processing performed by the coefficient calculation unit 1010 is to calculate this coefficient.

キャラクタモデル合成部１０１２の機能は、係数算出部１０１０により算出されたこれら係数α_ＫＡ、α_ＫＩ、α_ＫＵ、α_ＫＥ、及びα_ＫＯを用いて、一般視覚素モデルを構成する特徴点の各々の位置を表す３次元ベクトルの値を、標準視覚素モデルの線形和として算出し、キャラクタモデル記憶部１００４に格納することである。 The function of the character model synthesizing unit 1012 uses each of the coefficients α _KA , α _KI , α _KU , α _KE , and α _KO calculated by the coefficient calculating unit 1010 to use each of the feature points constituting the general visual element model. The value of the three-dimensional vector representing the position is calculated as a linear sum of the standard visual element model and stored in the character model storage unit 1004.

以下では、音素／ｋ／に対応付ける、アニメーション作成のための一般視覚素モデル／Ｋ／を算出する場合を例としてキャラクタモデル合成部１０１２の機能を説明する。キャラクタモデル合成部１０１２は、一般視覚素モデル／Ｋ／を次の式にしたがって算出する。 Hereinafter, the function of the character model combining unit 1012 will be described by taking as an example the case of calculating a general visual element model / K / for creating an animation that is associated with a phoneme / k /. The character model composition unit 1012 calculates the general visual element model / K / according to the following equation.

この式は、一般視覚素モデル／Ｋ／を構成する特徴点の位置を表す３次元ベクトルの全てを、標準視覚素モデル／Ａ／、／Ｉ／、／Ｕ／、／Ｅ／及び／Ｏ／を構成する特徴点の位置を表す３次元ベクトルの線形和で表すことを意味する。

This equation represents all of the three-dimensional vectors representing the positions of feature points constituting the general visual element model / K / with the standard visual element models / A /, / I /, / U /, / E / and / O /. Is represented by a linear sum of three-dimensional vectors representing the positions of feature points.

キャラクタモデル合成部１０１２は、同様にして、一般視覚素モデル／Ｓ／、／Ｔ／、／Ｈ／、及び／Ｂ／等を、標準視覚素モデル／Ａ／、／Ｉ／、／Ｕ／、／Ｅ／及び／Ｏ／の線形和として求める。 Similarly, the character model synthesis unit 1012 converts the general visual element models / S /, / T /, / H /, and / B / etc. into the standard visual element models / A /, / I /, / U /, Obtained as a linear sum of / E / and / O /.

そのようにして求められた一般視覚素モデルを、標準視覚素モデルとともにキャラクタモデル記憶部１００４に記憶させる。 The general visual element model thus obtained is stored in the character model storage unit 1004 together with the standard visual element model.

テーブル７に、音素−視覚素マッピングテーブル記憶部１００２に記憶されたマッピングテーブルの例を示す。 Table 7 shows an example of the mapping table stored in the phoneme-visual element mapping table storage unit 1002.

テーブル７を参照して、本実施の形態では、上から１行目の音素／ａ／から５行目の／ｏ／までは、第１の実施の形態で用いられたテーブル１と同様である。ただし、テーブル１と異なり、音素／ｎ／は視覚素／Ｎ／にのみ対応付けられている。７行目では、音素／ｋ／が、一般視覚素／Ｋ／に対応付けられている。８行目以下の音素／ｓ／等についても７行目の音素／ｋ／と同様である。このようなマッピングテーブルを用いると、音素が与えられるとそれに対応する視覚素が分かり、その視覚素のラベルと一致する視覚素ラベルを持つ視覚素モデルをキャラクタモデル記憶部１００４から読出すことができる。

Referring to Table 7, in the present embodiment, the phonemes / a / in the first row from the top to / o / in the fifth row are the same as those in Table 1 used in the first embodiment. . However, unlike Table 1, phonemes / n / are associated only with visual elements / N /. In the seventh line, phoneme / k / is associated with general visual element / K /. The phonemes / s / etc. in the 8th row and below are the same as the phonemes / k / in the 7th row. By using such a mapping table, when a phoneme is given, a visual element corresponding to the phoneme is known, and a visual element model having a visual element label that matches the label of the visual element can be read from the character model storage unit 1004. .

＜動作＞
以上、構成を説明したリップシンクアニメーション作成装置１０００は以下のように動作する。図２８に示すリップシンクアニメーション作成装置１０００の動作は、図２４に示すリップシンクアニメーション作成装置８１０とほぼ同様であり、使用する日本語用３Ｄキャラクタモデルのみが異なっている。したがって、以下においては、本実施の形態において追加された、一般視覚素モデルを含む３Ｄキャラクタモデルを作成する際のリップシンクアニメーション作成装置１０００の動作についてのみ詳細を述べ、それ以外の動作に関する説明は概略にとどめて、その詳細な説明は繰返さない。 <Operation>
The lip sync animation creating apparatus 1000 whose configuration has been described above operates as follows. The operation of the lip sync animation creating apparatus 1000 shown in FIG. 28 is almost the same as that of the lip sync animation creating apparatus 810 shown in FIG. 24, and only the Japanese 3D character model used is different. Therefore, in the following, only the operation of the lip sync animation creating apparatus 1000 when creating the 3D character model including the general visual element model added in the present embodiment will be described in detail, and the explanation of the other operations will be described. The outline will not be repeated.

本実施の形態に係るリップシンクアニメーション作成装置１０００では、顔画像のアニメーション２６０の作成のためには、音素−視覚素マッピングテーブルの作成と、一般視覚素モデルを含む３Ｄキャラクタモデルの作成という準備作業が必要である。以下それらの準備作業について述べる。 In the lip-sync animation creating apparatus 1000 according to the present embodiment, in order to create the face image animation 260, preparation work of creating a phoneme-visual element mapping table and creating a 3D character model including a general visual element model is performed. is required. The preparatory work is described below.

−音素−視覚素マッピングテーブル１００２の作成−
日本語の音素と、視覚素とを手作業で対応付け、機械可読な形式の音素−視覚素マッピングテーブルを作成し、音素−視覚素マッピングテーブル記憶部１００２に記憶させる。このとき、第２の実施の形態と異なり、標準音素以外の音素を標準視覚素に対応付けなければならないわけではない。任意の音素を標準視覚素以外の視覚素（一般視覚素）に対応付けてもよい。こうして作成された音素−視覚素マッピングテーブルの一例が上記したテーブル７である。 -Creation of phoneme-visual element mapping table 1002-
A Japanese phoneme and a visual element are manually associated with each other to create a machine-readable phoneme-visual element mapping table and store it in the phoneme-visual element mapping table storage unit 1002. At this time, unlike the second embodiment, phonemes other than standard phonemes do not have to be associated with standard visual elements. An arbitrary phoneme may be associated with a visual element other than the standard visual element (general visual element). An example of the phoneme-visual element mapping table created in this way is the table 7 described above.

−日本語用３Ｄキャラクタモデル記憶部１００４の作成−
係数算出部１０１０及びキャラクタモデル合成部１０１２は、以下のようにして標準視覚素モデルとともに一般視覚素モデルも含む３Ｄキャラクタモデルを作成する。ここで作成の対象となる一般視覚素モデルは、上記した音素−視覚素マッピングテーブルで音素と対応付けられた視覚素の全てである。 -Creation of Japanese 3D character model storage unit 1004-
The coefficient calculation unit 1010 and the character model synthesis unit 1012 create a 3D character model including the general visual element model as well as the standard visual element model as follows. The general visual element models to be created here are all visual elements associated with phonemes in the above-described phoneme-visual element mapping table.

図２９を参照して、係数算出部１０１０は、音素−視覚素マッピングテーブルで音素に対応付けられている任意の音素−視覚素のペアを選択し、キャプチャデータ記憶部１００６に記憶されているキャプチャデータのうち、選択されたペアの音素のラベルが付されたキャプチャデータ（これを便宜上「合成対象キャプチャデータ」と呼ぶ。）を読出す。係数算出部１０１０はさらに、キャプチャデータ記憶部１００６に記憶されているキャプチャデータのうち、標準音素に対応するキャプチャデータを全て読出す。そして、既に述べたように、合成対象キャプチャデータを、標準音素に対応するキャプチャデータの線形和で近似するための係数を算出する。そして、この係数群に、合成対象キャプチャデータの音素と対応付けられている視覚素のラベルを付してキャラクタモデル合成部１０１２に与える。 Referring to FIG. 29, coefficient calculation unit 1010 selects an arbitrary phoneme-visual element pair associated with a phoneme in the phoneme-visual element mapping table, and capture data stored in capture data storage unit 1006 Among the data, capture data (referred to as “compositing target capture data” for convenience) labeled with the phoneme label of the selected pair is read. The coefficient calculation unit 1010 further reads out all the capture data corresponding to the standard phonemes among the capture data stored in the capture data storage unit 1006. Then, as already described, a coefficient for approximating the synthesis target capture data with a linear sum of the capture data corresponding to the standard phonemes is calculated. Then, a visual element label associated with the phoneme of the synthesis target capture data is attached to the coefficient group and given to the character model synthesis unit 1012.

係数算出部１０１０は、これと同様の処理を、音素−視覚素マッピングテーブル記憶部１００２に記憶されている音素−視覚素マッピングのうち、一般視覚素を含むもの全てについて繰返す。 The coefficient calculation unit 1010 repeats the same processing for all phoneme-visual element mappings stored in the phoneme-visual element mapping table storage unit 1002 including general visual elements.

キャラクタモデル合成部１０１２は、係数算出部１０１０から与えられる係数群及び視覚素ラベルに基づき、次のような処理を行なう。すなわち、キャラクタモデル合成部１０１２は、与えられた視覚素ラベルに対応する一般視覚素モデルを、標準視覚素モデル記憶部１００８に記憶された標準視覚素の線形和で表し、このとき、その係数として係数算出部１０１０から与えられた係数を使用する。この結果、与えられた視覚素ラベルに対応する一般視覚素モデルが、標準視覚素モデルの線形和として表される。 The character model synthesis unit 1012 performs the following processing based on the coefficient group and the visual element label given from the coefficient calculation unit 1010. That is, the character model synthesis unit 1012 represents the general visual element model corresponding to the given visual element label as a linear sum of the standard visual elements stored in the standard visual element model storage unit 1008, and at this time, the coefficient The coefficient given from the coefficient calculation unit 1010 is used. As a result, the general visual element model corresponding to the given visual element label is expressed as a linear sum of the standard visual element models.

キャラクタモデル合成部１０１２は、係数算出部１０１０から与えられる係数群及び視覚素ラベルからなる全ての組に対して上記した処理を繰返し、結果をキャラクタモデル記憶部１００４に記憶させる。キャラクタモデル記憶部１００４に記憶される一般視覚素モデルには、該当する視覚素ラベルが付されている。 The character model synthesizing unit 1012 repeats the above-described processing for all the sets including the coefficient group and the visual element label given from the coefficient calculating unit 1010 and causes the character model storage unit 1004 to store the result. A corresponding visual element label is attached to the general visual element model stored in the character model storage unit 1004.

キャラクタモデル合成部１０１２はまた、標準視覚素モデル記憶部１００８に記憶されている標準視覚素モデルも、対応する視覚素ラベルを付してキャラクタモデル記憶部１００４に記憶させる。 The character model composition unit 1012 also stores the standard visual element model stored in the standard visual element model storage unit 1008 in the character model storage unit 1004 with a corresponding visual element label.

以上の処理により、日本語用の３Ｄキャラクタモデルが完成する。 With the above processing, a 3D character model for Japanese is completed.

３Ｄキャラクタモデルが完成すると、後のリップシンクアニメーション作成装置１０００の動作は、第２の実施の形態に係るリップシンクアニメーション作成装置８１０と異なるところがない。ただし、アニメーションのキーフレームに使用される顔画像として、標準視覚素モデルから得られたものだけでなく、一般視覚素モデルから得られたものも使用できる。このため、作成されるリップシンクアニメーションは、第２の実施の形態において得られたものよりもさらに滑らかなものとなる。 When the 3D character model is completed, the subsequent operation of the lip sync animation creating apparatus 1000 is not different from that of the lip sync animation creating apparatus 810 according to the second embodiment. However, as a face image used for an animation key frame, not only a standard visual element model but also a general visual element model can be used. For this reason, the created lip sync animation is smoother than that obtained in the second embodiment.

［多言語への拡張］
上述の第３の実施の形態の説明においては、リップシンクアニメーション作成装置１０００が日本語のアニメーションを作成するための装置であることを前提としていた。しかし、実は上記第３の実施の形態における日本語用３Ｄキャラクタモデルの作成方法は、英語、中国語等、日本語と異なる言語のアニメーションの作成にも、日本語の標準音素及び標準視覚素モデルを用いて拡張することができる。そして、そのような３Ｄキャラクタモデルを使用する限り、リップシンクアニメーション作成装置１０００においてリップシンクアニメーションを作成する部分の構成の基本的部分はそのまま使用することができる。 [Extension to multiple languages]
In the description of the third embodiment described above, it is assumed that the lip sync animation creating apparatus 1000 is an apparatus for creating a Japanese animation. However, the method for creating a Japanese 3D character model in the third embodiment is actually a Japanese standard phoneme and standard visual element model for creating animation in a language different from Japanese, such as English and Chinese. Can be extended using As long as such a 3D character model is used, the basic part of the configuration of the part for creating the lip sync animation in the lip sync animation creating apparatus 1000 can be used as it is.

例えば、英語のアニメーションを作成する場合における考え方を説明する。使用される言語が英語であるため、図２８に示すリップシンクアニメーション作成装置１０００において、次のような変更が必要となる。発話者が異なることを前提とすると、音響モデル記憶部１７０に記憶される音響モデルを英語の話者に対応したものに変更する必要がある。当然、アニメーション作成のための発話記憶部１５２及びトランスクリプション記憶部１５４も変わってくる。音素−視覚素マッピングテーブル記憶部１００２についても、英語の音素とその音素の発音時の視覚素とに基づいて新たに作成する必要がある。話者が異なることが前提となっているため、キャプチャデータ記憶部１００６に記憶されるキャプチャデータも英語の発話者から収録したものとする必要がある。 For example, the concept for creating an English animation will be described. Since the language used is English, the following changes are required in the lip sync animation creating apparatus 1000 shown in FIG. Assuming that the speakers are different, it is necessary to change the acoustic model stored in the acoustic model storage unit 170 to one corresponding to an English speaker. Naturally, the utterance storage unit 152 and the transcription storage unit 154 for creating an animation also change. The phoneme-visual element mapping table storage unit 1002 also needs to be newly created based on the English phoneme and the visual element when the phoneme is pronounced. Since it is assumed that the speakers are different, the capture data stored in the capture data storage unit 1006 must also be recorded from an English speaker.

そしてこの場合、キャラクタモデル記憶部１００４に記憶される３Ｄキャラクタモデルは以下のようにして作成する。図３０に、英語のアニメーションを作成するための３Ｄキャラクタモデルを準備するための方法について説明する。 In this case, the 3D character model stored in the character model storage unit 1004 is created as follows. FIG. 30 illustrates a method for preparing a 3D character model for creating an English animation.

図３０を参照して、この場合には、図２９に示すキャプチャデータ記憶部１００６には、英語の発話時の発話者の顔の特徴点の位置を表すキャプチャデータを準備する。このキャプチャデータは、頭部の揺動によるグローバルな座標変動を補正により除去した後、無音時のキャプチャデータを基準として、各特徴点が無音時の位置からどの程度移動したかによって表される。このキャプチャデータの中には、日本語の標準音素に相当する音素の発話時のキャプチャデータも含まれるものとする。 Referring to FIG. 30, in this case, the capture data storage unit 1006 shown in FIG. 29 prepares capture data representing the position of the feature point of the speaker's face when speaking English. This capture data is represented by how much each feature point has moved from the silent position with reference to the silent capture data after removing global coordinate fluctuations due to head swing by correction. The capture data includes capture data at the time of utterance of phonemes corresponding to Japanese standard phonemes.

係数算出部１０１０は、音素−視覚素マッピングテーブル記憶部１００２に記憶されている英語の音素−視覚素マッピングを参照し、そこに出現している音素−視覚素の組合わせごとに、その音素のラベルが付されているキャプチャデータを、日本語の標準音素に相当する音素の発話時のキャプチャデータの線形和で近似するよう、その係数群を最小自乗基準で決定する。音素−視覚素マッピングテーブルに出現する全ての音素について、この係数群を用いた線形和で一般視覚素モデルを作成し、標準視覚素モデルとともにキャラクタモデル記憶部１００４に記憶し、対応する視覚素ラベルを付しておく。 The coefficient calculation unit 1010 refers to the English phoneme-visual element mapping stored in the phoneme-visual element mapping table storage unit 1002, and for each phoneme-visual element combination that appears there, The coefficient group is determined on the basis of the least square so that the labeled capture data is approximated by a linear sum of the capture data when a phoneme corresponding to a Japanese standard phoneme is uttered. For all phonemes appearing in the phoneme-visual element mapping table, a general visual element model is created by a linear sum using this coefficient group, stored in the character model storage unit 1004 together with the standard visual element model, and corresponding visual element labels Is attached.

以上のように、英語用の音素−視覚素マッピングテーブルを準備し、英語用３Ｄキャラクタモデルを準備し、英語用の発話者用の音響モデル記憶部１７０を準備し、英語の発話記憶部１５２とそのトランスクリプション記憶部１５４とを準備すると、後は第３の実施の形態において日本語のリップシンクアニメーションを作成した場合と全く同様に、英語のリップシンクアニメーションを作成することができる。キャラクタモデル記憶部１００４に記憶された一般視覚素は全て日本語の標準視覚素の線形和で表されたものであるが、その線形和は英語のキャプチャデータに基づいて求められたものであるため、英語の発話時の顔画像をよく再現することができる。 As described above, an English phoneme-visual element mapping table is prepared, an English 3D character model is prepared, an English speaker acoustic model storage unit 170 is prepared, an English speech storage unit 152 When the transcription storage unit 154 is prepared, an English lip sync animation can be created in the same manner as when a Japanese lip sync animation is created in the third embodiment. The general visual elements stored in the character model storage unit 1004 are all expressed as a linear sum of Japanese standard visual elements, but the linear sum is obtained based on English capture data. , Can reproduce well the face image when speaking in English.

以上の説明は日本語の標準顔モデルを用いて英語のリップシンクアニメーションを作成する場合に関するものであった。しかし、以上の説明から明らかなように、第３の実施の形態に係るリップシンクアニメーション作成装置１０００は、そのような言語の組合せのみに限定的に適用可能なわけではない。任意の言語の組合せに対し、それらの発話時の発話者の顔画像の３次元の位置を表すキャプチャデータが得られれば、全く同様にしてこのリップシンクアニメーション作成装置１０００を適用してリップシンクアニメーションを作成できる。 The above description relates to the case of creating an English lip sync animation using a standard Japanese face model. However, as is apparent from the above description, the lip sync animation creating apparatus 1000 according to the third embodiment is not limitedly applicable only to such language combinations. If capture data representing the three-dimensional position of the face image of the speaker at the time of utterance can be obtained for any combination of languages, the lip sync animation is applied in the same manner to apply the lip sync animation. Can be created.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係るアニメーション作成装置によるアニメーション作成過程３０の概略を示す図である。It is a figure which shows the outline of the animation preparation process 30 by the animation preparation apparatus concerning the 1st Embodiment of this invention. 本発明の第１の実施の形態で使用される視覚素に対応する顔画像を示す図である。It is a figure which shows the face image corresponding to the visual element used in the 1st Embodiment of this invention. ブレンド率の概念を説明するための図である。It is a figure for demonstrating the concept of a blend rate. ブレンドによる顔画像の変化例を示す図である。It is a figure which shows the example of a change of the face image by blending. 本発明の第１の実施の形態に係るリップシンクアニメーション作成装置２００の概略の機能的構成を示すブロック図である。It is a block diagram which shows the schematic functional structure of the lip-sync animation production apparatus 200 which concerns on the 1st Embodiment of this invention. 図５の視覚素シーケンス作成部２３０のより詳細なブロック図である。FIG. 6 is a more detailed block diagram of the visual element sequence creation unit 230 of FIG. 5. 各音素に対応する視覚素のうち、口周辺の画像を示す図である。It is a figure which shows the image around a mouth among the visual elements corresponding to each phoneme. 二つの視覚素の間の動きベクトルを説明するための図である。It is a figure for demonstrating the motion vector between two visual elements. クラスタリング後の顔画像の例を示す図である。It is a figure which shows the example of the face image after clustering. クラスタリング後の顔画像の他の例を示す図である。It is a figure which shows the other example of the face image after clustering. 図５のキーフレーム削除部２３６を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the key frame deletion part 236 of FIG. キーフレームの削除を説明するための図である。It is a figure for demonstrating deletion of a key frame. 平均発話パワーの算出方法を説明するための図である。It is a figure for demonstrating the calculation method of average speech power. 図５の発話パワーによるブレンド率調整部２４４を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the blend rate adjustment part 244 by the speech power of FIG. 図５の頂点速度によるブレンド率調整部２５０を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves the blend rate adjustment part 250 by the vertex speed of FIG. 本発明の実施の形態における種々の条件でのキーフレームの生成結果と、手作業によるキーフレームの指定結果とを対比して示す図である。It is a figure which compares and shows the production | generation result of the key frame on various conditions in embodiment of this invention, and the designation | designated result of the key frame by manual labor. 本発明の一実施の形態によって得られるアニメーションの結果を、従来の方法によるものと比較して示す図である。It is a figure which shows the result of the animation obtained by one embodiment of this invention compared with the thing by the conventional method. コンピュータシステム５５０のハードウェア外観を示す図である。FIG. 17 is a diagram illustrating a hardware appearance of a computer system 550. コンピュータシステム５５０のブロック図である。2 is a block diagram of a computer system 550. FIG. 本発明の第２の実施の形態における発話終端補正の概略を説明するための模式図である。It is a schematic diagram for demonstrating the outline of speech termination | terminus correction | amendment in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における、フレームレート変換の概念を示す模式図である。It is a schematic diagram which shows the concept of frame rate conversion in the 2nd Embodiment of this invention. 第２の実施の形態における、フレームレート変換後の各キーフレームに対し割当てる視覚素の決定方法を説明するための模式図である。It is a schematic diagram for demonstrating the determination method of the visual element allocated with respect to each key frame after frame rate conversion in 2nd Embodiment. 第２の実施の形態における形状安定化処理を説明するための模式図である。It is a schematic diagram for demonstrating the shape stabilization process in 2nd Embodiment. 第２の実施の形態に係るリップシンクアニメーション作成装置８１０の概略ブロック図である。It is a schematic block diagram of the lip-sync animation production apparatus 810 which concerns on 2nd Embodiment. 図２４に示す発話終端補正部８２２を実現するためのコンピュータプログラムのフローチャートである。It is a flowchart of the computer program for implement | achieving the speech end correction | amendment part 822 shown in FIG. 図２４に示すフレームレート変換部８４０を実現するためのコンピュータプログラムのフローチャートである。It is a flowchart of the computer program for implement | achieving the frame rate conversion part 840 shown in FIG. 図２４に示す形状安定化処理部８４２を実現するためのコンピュータプログラムのフローチャートである。It is a flowchart of the computer program for implement | achieving the shape stabilization process part 842 shown in FIG. 第３の実施の形態に係るリップシンクアニメーション作成装置１０００の概略ブロック図である。It is a schematic block diagram of the lip-sync animation production apparatus 1000 which concerns on 3rd Embodiment. 図２８のキャラクタモデル記憶部１００４に記憶される３Ｄキャラクタモデルを準備するためのより詳細な図である。It is a more detailed figure for preparing the 3D character model memorize | stored in the character model memory | storage part 1004 of FIG. 英語のアニメーションを作成するための詳細な図である。It is a detailed diagram for creating an English animation.

Explanation of symbols

４０話者
４２音声信号
４４台本
５０〜５８音素
６０〜６８，８０顔画像
１５２発話記憶部
１５４トランスクリプション記憶部
１５６，１００４キャラクタモデル記憶部
１７０音響モデル記憶部
１７２音素セグメンテーション部
１７４音素シーケンス記憶部
１７６，１００２音素−視覚素マッピングテーブル記憶部
１７８音素−視覚素変換処理部
１８０，２５４視覚素シーケンス記憶部
１８２アニメーション作成部
２００，８１０，１０００リップシンクアニメーション作成装置
２０２クラスタ処理指定部
２０４発話パワー使用指示入力部
２３０視覚素シーケンス作成部
２３２クラスタリング処理部
２３４クラスタ化顔モデル記憶部
２３６キーフレーム削除部
２３８発話パワー算出部
２４０発話パワー記憶部
２４４発話パワーによるブレンド率調整部
２５０頂点速度によるブレンド率調整部
２５６ブレンド処理部
２６０顔画像のアニメーション
６１０，６５０，６７０，６７２キーフレーム列
６２０，６２２，６２４，６２６，６８０，６８２，６８４，６８６，６８８，６９０，７００，７０２，７０４，７０６，７０８，７１０，７１２，７１４，７１６，７９０，７９２キーフレーム
８２２発話終端補正部
８４０フレームレート変換部
８４２形状安定化処理部
１００６キャプチャデータ記憶部
１００８標準視覚素モデル記憶部
１０１０係数算出部
１０１２キャラクタモデル合成部 40 Speaker 42 Audio signal 44 Script 50-58 Phoneme 60-68, 80 Facial image 152 Speech storage unit 154 Transcription storage unit 156, 1004 Character model storage unit 170 Acoustic model storage unit 172 Phoneme segmentation unit 174 Phoneme sequence storage unit 176,1002 Phoneme-visual element mapping table storage unit 178 Phoneme-visual element conversion processing unit 180,254 Visual element sequence storage unit 182 Animation creation unit 200,810,1000 Lip sync animation creation device 202 Cluster processing designation unit 204 Use of speech power Instruction input unit 230 Visual element sequence creation unit 232 Clustering processing unit 234 Clustered face model storage unit 236 Key frame deletion unit 238 Speech power calculation unit 240 Speech power storage unit 44 Blend rate adjusting unit based on speech power 250 Blend rate adjusting unit based on vertex speed 256 Blend processing unit 260 Animation of face image 610, 650, 670, 672 Key frame sequence 620, 622, 624, 626, 680, 682, 684, 686 , 688, 690, 700, 702, 704, 706, 708, 710, 712, 714, 716, 790, 792 Key frame 822 Speech termination correction unit 840 Frame rate conversion unit 842 Shape stabilization processing unit 1006 Capture data storage unit 1008 Standard visual element model storage unit 1010 Coefficient calculation unit 1012 Character model synthesis unit

Claims

Input using a statistical acoustic model prepared in advance, a mapping definition between phonemes and visual elements prepared in advance, and face models of a plurality of facial images prepared in advance corresponding to the visual elements. A lip sync animation creation device for creating a lip sync animation from utterance data,
Using the statistical acoustic model, the mapping definition, and the transcription for the utterance data, a phoneme and a corresponding visual element included in the utterance data are obtained, and a duration with a default blend rate is given. A visual element sequence generating means for generating a visual element sequence, wherein a key frame is defined at a predetermined position within the duration of the visual element sequence, and is defined within the duration of each visual element of the visual element sequence Keyframe sequence is defined by the keyframe
The lip-sync animation creating device further includes, in order from the highest speed of change between the key frames in the key frame sequence and the face model corresponding to the visual element, between the adjacent key frames. And a key frame deleting means for deleting a predetermined percentage of key frames,
A lip sync animation creating apparatus, comprising: blend processing means for creating an animation of a face image by blending between key frames based on a key frame sequence in which some key frames are deleted by the key frame deleting means.

The key frame deletion means is configured to select, from among key frames in the key frame sequence, feature points constituting a face model corresponding to a visual element of the key frame and a face model corresponding to a visual element of an adjacent key frame. The lip sync animation creating apparatus according to claim 1, further comprising means for deleting a predetermined percentage of key frames in order from the largest change speed between corresponding corresponding feature points. .

Motion vector calculating means for calculating a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models;
The feature points of the two face models are clustered by a predetermined clustering method for the motion vector calculated by the motion vector calculating means, and a representative vector of each cluster is calculated to create a clustered face model Means for
Clustered face model storage means for storing the clustered face model;
The key frame deletion means includes
For each key frame in the key frame sequence, a clustered face model corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame is read from the clustered face model storage means. A moving amount calculating means for calculating the speed of change between key frames of feature points belonging to each cluster using a representative vector of the cluster;
2. The lip sync according to claim 1, further comprising: means for deleting a predetermined percentage of key frames from the key frame sequence in order from the highest speed of change calculated by the movement amount calculating means. Animation creation device.

The utterance power for receiving the key frame sequence in which some key frames are deleted by the key frame deletion means and calculating the utterance power of the phoneme corresponding to the visual element of the key frame in the key frame sequence from the utterance data A calculation means;
For each key frame in the key frame sequence, a predetermined value is set such that the smaller the average utterance power calculated by the utterance power calculation means for the duration of the visual element including the key frame, the smaller the blend rate. And a blend rate adjusting means by utterance power for adjusting the blend rate by a function,
2. The lip sync animation according to claim 1, wherein the blend processing unit creates an animation of a face image by blending between key frames based on a key frame sequence in which the blend rate is adjusted by the blend rate adjusting unit based on the speech power. Creation device.

Receiving a key frame sequence in which some of the key frames have been deleted by the key frame deleting means, and forming a face model corresponding to the visual element of the key frame and a face model corresponding to the visual element of the adjacent key frame A speed of change calculation means for calculating the speed of change between the vertices constituting
Of the key frames included in the key frame sequence in which some key frames have been deleted by the key frame deletion means, the change speed calculated by the change speed calculation means is higher than a predetermined threshold value. A blend rate adjusting means by vertex speed for updating the blend rate using a predetermined function such that the blend rate becomes a smaller value for a large key frame,
2. The lip sync animation according to claim 1, wherein the blend processing unit creates an animation of a face image by blending between key frames based on a key frame sequence whose blend rate is adjusted by the blend rate adjusting unit based on the vertex speed. Creation device.

Motion vector calculating means for calculating a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models;
The feature points of the two face models are clustered by a predetermined clustering method for the motion vector calculated by the motion vector calculating means, and a representative vector of each cluster is calculated to create a clustered face model Means for
Clustered face model storage means for storing the clustered face model;
The lip sync animation creating device further includes:
Clustering corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame, out of each key frame, upon receiving a key frame sequence in which some key frames are deleted by the key frame deleting unit The calculation of the change speed for reading the combination of the face models from the clustered face model storage means and calculating the change speed between the key frames of the feature points belonging to each cluster using the representative vector of the cluster Means,
Of the key frames included in the key frame sequence in which some key frames have been deleted by the key frame deletion means, the change speed calculated by the change speed calculation means is higher than a predetermined threshold value. A blend rate adjusting means by vertex speed for updating the blend rate using a predetermined function such that the blend rate becomes a smaller value for a large key frame,
2. The lip sync animation according to claim 1, wherein the blend processing unit creates an animation of a face image by blending between key frames based on a key frame sequence whose blend rate is adjusted by the blend rate adjusting unit based on the vertex speed. Creation device.

Lip sync animation from input speech data using statistical acoustic models prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and face models of multiple facial images prepared in advance A lip-sync animation creation device for creating a speech, the transcription for the utterance data is available,
Using the statistical acoustic model, the mapping definition, and the transcription, the phoneme sequence and the corresponding visual element included in the utterance data are obtained, and a visual element sequence with a duration is given a default blend rate. Including a visual element sequence creation means for creating
A key frame is defined at a predetermined position within the duration of the visual element sequence, and a key frame sequence is defined by a key frame defined within the duration of each visual element of the visual element sequence,
Utterance power calculation means for calculating utterance power of phonemes corresponding to visual elements of key frames in the key frame sequence from the utterance data;
For each key frame in the key frame sequence, a predetermined value is set such that the smaller the average utterance power calculated by the utterance power calculation means for the duration of the visual element including the key frame, the smaller the blend rate. Blend rate adjustment means by utterance power to adjust the blend rate by function,
A lip sync animation creating apparatus, comprising: blend processing means for creating an animation of a face image by blending between key frames based on a visual element sequence whose blend ratio has been adjusted by the blend ratio adjusting means.

A key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the speech power is received, and a vertex constituting a face model corresponding to a visual element of each key frame included in the key frame sequence and an adjacent key frame A change speed calculating means for calculating a change speed between the vertices constituting the face model corresponding to the visual element;
Of the key frames included in the key frame sequence whose blend rate is adjusted by the blend rate adjusting unit based on the speech power, the change rate calculated by the change rate calculating unit is higher than a predetermined threshold value. A blend rate adjusting means by vertex speed for updating the blend rate using a predetermined function such that the blend rate becomes a smaller value for a large key frame,
The lip sync animation according to claim 7, wherein the blend processing unit creates an animation of a face image by blending between key frames based on a key frame sequence in which the blend rate is adjusted by the blend rate adjusting unit based on the vertex speed. Creation device.

Motion vector calculating means for calculating a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models;
The feature points of the two face models are clustered by a predetermined clustering method for the motion vector calculated by the motion vector calculating means, and a representative vector of each cluster is calculated to create a clustered face model Means for
Clustered face model storage means for storing the clustered face model;
The lip sync animation creating device further includes:
Clustering corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame out of each key frame is received by the key frame sequence whose blend ratio is adjusted by the blend rate adjusting means by the speech power The calculation of the change speed for reading the combination of the face models from the clustered face model storage means and calculating the change speed between the key frames of the feature points belonging to each cluster using the representative vector of the cluster Means,
Among the key frames included in the key frame sequence, the blend rate of a key frame whose speed of change calculated by the speed of change calculating unit is larger than a predetermined threshold is set to a smaller value. A blend rate adjusting means based on the vertex speed for updating the blend rate using a predetermined function such that
The lip sync animation according to claim 7, wherein the blend processing unit creates an animation of a face image by blending between key frames based on a key frame sequence in which the blend rate is adjusted by the blend rate adjusting unit based on the vertex speed. Creation device.

Lip sync animation from input speech data using statistical acoustic models prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and face models of multiple facial images prepared in advance A lip-sync animation creation device for creating a speech, the transcription for the utterance data is available,
Using the statistical acoustic model, the mapping definition, and the transcription, the phoneme sequence and the corresponding visual element included in the utterance data are obtained, and a visual element sequence with a duration is given a default blend rate. Including a visual element sequence creation means for creating
A key frame is defined during the duration of each visual element in the visual element sequence, and a key frame sequence is defined by these key frames,
Calculates the speed of change between the vertices constituting the face model corresponding to the visual element of each key frame included in the key frame sequence and the vertices constituting the face model corresponding to the visual element of the adjacent key frame. A speed of change calculation means for
Among the key frames included in the key frame sequence, the blend rate of a key frame whose speed of change calculated by the speed of change calculating unit is larger than a predetermined threshold is set to a smaller value. A blend rate adjusting means based on the vertex speed for updating the blend rate using a predetermined function as follows:
A lip sync animation creating apparatus, comprising: blend processing means for creating an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

Lip sync animation from input speech data using statistical acoustic models prepared in advance, mapping definitions between phonemes and visual elements prepared in advance, and face models of multiple facial images prepared in advance A lip-sync animation creation device for creating a speech, the transcription for the utterance data is available,
Motion vector calculating means for calculating a motion vector between feature points constituting the face model for all combinations of two face models selected from the plurality of face models;
The feature points of the two face models are clustered by a predetermined clustering method for the motion vector calculated by the motion vector calculating means, and a representative vector of each cluster is calculated to create a clustered face model Means for
Clustered face model storage means for storing the clustered face model;
Using the statistical acoustic model, the mapping definition, and the transcription, the phoneme sequence and the corresponding visual element included in the utterance data are obtained, and a key frame sequence with a duration is given a default blend rate A key frame sequence creating means for creating
A key frame is defined during the duration of each visual element in the visual element sequence, and a key frame sequence is defined by these key frames,
The clustered face model storage means that receives the key frame sequence and stores a combination of clustered face models corresponding to a combination of a visual element of the key frame and a visual element of an adjacent key frame among the key frames. And a change speed calculation means for calculating the speed of change between key frames of feature points belonging to each cluster using a representative vector of the cluster, and
Among the key frames included in the key frame sequence, the blend rate of a key frame whose speed of change calculated by the speed of change calculating unit is larger than a predetermined threshold is set to a smaller value. A blend rate adjusting means based on the vertex speed for updating the blend rate using a predetermined function as follows:
A lip sync animation creating apparatus, comprising: blend processing means for creating an animation of a face image by blending between key frames based on the key frame sequence whose blend rate is adjusted by the blend rate adjusting means based on the vertex speed.

Of the key frames included in the key frame sequence output by the visual element sequence creating means, the end position of the continuation length of the key frame immediately before the key frame to which the visual element corresponding to the blank phoneme is assigned is indicated in the key frame. Utterance end correction means for correcting the utterance end position by moving to the position after the maximum point of the utterance power sequence of the utterance data and within the duration of the key frame,
12. The lip sync animation creating apparatus according to claim 1, wherein the key frame deletion unit receives as input a key frame sequence whose utterance end is corrected by the utterance end correction unit.

The utterance end correction means includes:
A first value that gives the maximum value of the utterance power of the key frame immediately before the key frame to which the visual element corresponding to the blank phoneme is assigned among the key frames included in the key frame sequence output by the visual element sequence creating means. Means for detecting the time;
Means for detecting a second time after the first time and before the end time of the key frame to be processed, a second time when the utterance power decreases from the maximum value of the utterance power by a predetermined rate;
13. The lip sync animation creating apparatus according to claim 12, further comprising means for correcting the key frame so as to move the end position of the key frame to be processed to the second time.

The key frame creation means, when creating the key frame sequence, selects any one of the frames at the first frame rate as a key frame,
The lip sync animation creating device is further connected to receive an input designating a second frame rate smaller than the first frame rate and a key frame sequence output by the key frame deleting means, Frame rate conversion means for converting a key frame sequence output by the key frame deletion means into a key frame sequence of the second frame rate;
The frame rate converting means includes a key frame having a start edge within the continuation length of the key frame in the key frame sequence output from the key frame deleting means for each key frame of the key frame sequence at the second frame rate. Assign one of the visual elements assigned to,
The blend processing means includes means for creating an animation of a face image by blending between key frames based on the key frame sequence whose frame rate is converted by the frame rate conversion means. The lip sync animation creating apparatus according to any one of claims 13 to 13.

The frame rate conversion means allocates a visual element so that a visual element allocated to each key frame of the key frame sequence of the second frame rate is different from a visual element allocated to the immediately preceding key frame. 14. The lip sync animation creating apparatus according to 14.

The blend processing means has a function of creating an image for each frame at a third frame rate higher than the second key frame rate when creating an animation from the key frame sequence of the second frame rate. And having a function of generating an image of a frame between adjacent key frames by interpolation between adjacent key frames,
The lip sync animation creating apparatus further includes, for each key frame in the key frame sequence of the second frame rate output from the frame rate conversion means, a key frame and a key frame immediately after the key frame. 16. The lip sync animation creating apparatus according to claim 14, further comprising key frame copy means for copying the same key frame as the key frame at a frame position between them.

The key frame copy means, for each key frame in the key frame sequence of the second frame rate output from the frame rate conversion means, at the frame position immediately before the key frame immediately after the key frame. 17. The lip sync animation creation device of claim 16, comprising means for copying the same key frame as the frame.

18. The lip sync animation creating apparatus according to claim 1, further comprising a face model storage unit for storing a face model of the plurality of face images.

The pre-prepared phonemes include predetermined standard phonemes and general phonemes other than the standard phonemes,
The face models of the plurality of face images include a standard visual element model composed of a face model corresponding to the standard phoneme, and a general visual element model composed of a face model corresponding to the general phoneme,
The lip-sync animation creating apparatus further includes an actual measurement value of a three-dimensional position of a feature point of a face image of a speaker when speaking a corresponding phoneme, which is classified in advance corresponding to the phoneme prepared in advance. The lip sync animation creating apparatus according to claim 18, further comprising: general visual element generation means for generating the general visual element model using the captured data and the standard visual element model.

The general visual element generation means is a linear sum of the capture data corresponding to the standard phonemes, and has a coefficient equal to the number of the standard phonemes for approximating the capture data corresponding to the general phonemes. Coefficient calculation means for calculating the approximation error to be minimum;
The general visual element model is calculated by a linear sum of the standard visual element models using the coefficients calculated by the coefficient calculating means for general phonemes corresponding to the general visual element model, and corresponds to the standard visual element model. The lip sync animation creating apparatus according to claim 19, further comprising: linear sum calculation means for storing the face model storage means in association with a general phoneme.

A computer program that, when executed by a computer, causes the computer to function as the lip-sync animation creation device according to any one of claims 1 to 20.