JP2007299300A

JP2007299300A - Animation creating device

Info

Publication number: JP2007299300A
Application number: JP2006128110A
Authority: JP
Inventors: Tatsuo Shikura; 達夫四倉; Shinichi Kawamoto; 真一川本; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2006-05-02
Filing date: 2006-05-02
Publication date: 2007-11-15
Anticipated expiration: 2026-05-02
Also published as: JP4631077B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an animation creating device capable of creating a moving image in which a portion of a shape is properly changed in accordance with voice without depending on a speaker. <P>SOLUTION: An animation creation system 80 includes a key frame data creation unit 96 for receiving a voice signal and creating key frame data representing a key frame image comprising face images at a prescribed key frame time during the continuous time of each phoneme in a phonemic sequence representing the voice signal, and an animation reproduction unit 98 for generating an animation composed of a series of images that changes in synchronization with the voice signal on the basis of the key frame data created by the key frame data creation unit 96. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は音声からアニメーションを作成するアニメーション作成装置に関し、特に、発話音声にあわせて口等の形が変わる顔画像等のアニメーションを自動的に生成する装置に関する。 The present invention relates to an animation creating apparatus that creates an animation from voice, and more particularly to an apparatus that automatically generates an animation such as a face image that changes the shape of a mouth or the like in accordance with an uttered voice.

コンピュータ技術の発達により、以前は大部分が手作業で行なわれていた仕事がコンピュータによる作業に置き換えられるケースが多くなっている。その代表的なものに、アニメーションの作成がある。 Due to the development of computer technology, work that has been mostly done manually has been replaced by computer work. A typical example is the creation of animation.

以前は、アニメーションといえば次のような手法で作成されることが一般的であった。登場するキャラクタをアニメーションの演出家が決め、絵コンテと呼ばれる、主要なシーンのラフな原画を作成する。これら絵コンテに基づき、アニメーションの各フレームの絵をアニメータと呼ばれる作業者が作成する。それら絵を仕上げ担当者がセル画に仕上げる。セル画を順にフィルムに写し、所定のフレームレートで再生すればアニメーションの画像の部分が出来上がる。 Previously, animations were generally created by the following method. The animation director decides the characters to appear, and creates a rough original picture of the main scene called a storyboard. Based on these storyboards, a picture of each frame of animation is created by an operator called an animator. The person in charge finishes these pictures into cel drawings. If the cell images are sequentially copied onto the film and played back at a predetermined frame rate, an animation image portion is completed.

このアニメーションの画像を再生しながら、声優がアニメーションの台本に基づいて台詞をつけていく。いわゆる「アフレコ」である。 While playing this animation image, the voice actor adds a line based on the script of the animation. This is so-called “post-recording”.

このような作業で最も人手がかかるのはセル画の作成である。一方、原画をＣＧ（コンピュータ・グラフィックス）で作成する場合、原画を加工してセル画を作成するのは比較的単純な作業である。一枚一枚撮影する必要もない。そのため、この部分については原画のＣＧ化とあわせてかなりコンピュータ化されている。 It is the creation of cel images that requires the most work in such work. On the other hand, when creating an original picture with CG (computer graphics), it is a relatively simple task to create a cell picture by processing the original picture. There is no need to shoot one by one. Therefore, this part is considerably computerized together with the CG conversion of the original picture.

一方、残りの作業のうちで比較的むずかしいのは、アフレコの作業である。アニメーションの動きにあわせて、なおかつ状況にあわせた声で台詞をしゃべる必要があるため、アフレコの作業にはそれなりの時間がかかり、習熟も必要である。 On the other hand, the remaining task is relatively difficult. Because it is necessary to speak dialogue with the voice of the situation according to the movement of the animation, the post-recording work takes a certain amount of time, and learning is also necessary.

そこで、アフレコの逆に、先に音声を収録し、その音声にあわせてアニメーションを作成する手法が考えられた。これは「プレスコ」又は「プレレコ」（以下「プレスコ等」と呼ぶ。）と呼ばれる。これはもともと米国等で手作業でアニメーションを作成する際に採用されていた手法である。この手法でアニメーションを作成する場合には、次のような作業手順となる。 Therefore, conversely to post-recording, a method of recording audio first and creating an animation according to the audio was considered. This is called "Presco" or "Pre-Reco" (hereinafter referred to as "Presco etc."). This is a technique that was originally used when creating animations manually in the United States. When an animation is created by this method, the work procedure is as follows.

まず、アニメーションに登場するキャラクタを決める。絵コンテも従来と同様に作成する。声優が、絵コンテと台本に基づいて発話し、それを音声として収録する。この音声にあわせて、アニメーションを作成する。 First, determine the character that will appear in the animation. Create storyboards as before. A voice actor speaks based on a storyboard and script and records it as audio. An animation is created according to this sound.

このプレスコ等の手法によるアニメーション作成をコンピュータで実現する場合には、音声からアニメーションをいかにして自動的に作成するか、という点が問題となる。特に、人物等のアニメーションの口の動きを、予め録音した声優の音声にあわせて自然な形で生成するのは難しく、これを自動的に行なう手法が求められている。 When the animation creation by the technique such as Presco is realized by a computer, the problem is how to automatically create the animation from the voice. In particular, it is difficult to generate the movement of the mouth of an animation such as a person in a natural form in accordance with the voice of a voice actor that has been recorded in advance, and there is a need for a method that automatically performs this.

このための一手法として提案されたものに、特許文献１に記載された手法がある。特許文献１に記載された手法では、口形状の基本パターンを予め複数個用意しておく。そして、任意の音声に対応する口形状を、これら基本パターンの加重和により求める。そのために、声優の音声の所定の特徴量から、各基本パターンの加重パラメータに変換するための変換関数を、重回帰分析によって予め求めておく。台本に沿って録音された声優の音声の所定の特徴量をこの変換関数で加重パラメータに変換し、その加重パラメータを用いて口形状の基本パターンの加重和を算出することで、声優の音声に対応する口形状及び顔画像を作成する。こうした処理をアニメーションの各フレームに相当する時刻に行なうことで、アニメーションのフレームシーケンスを作成する。
特開平７−４４７２７号公報 One method proposed for this purpose is the method described in Patent Document 1. In the method described in Patent Document 1, a plurality of mouth-shaped basic patterns are prepared in advance. Then, a mouth shape corresponding to an arbitrary voice is obtained by a weighted sum of these basic patterns. For this purpose, a conversion function for converting a predetermined feature amount of the voice actor's voice into a weighting parameter of each basic pattern is obtained in advance by multiple regression analysis. The voice actor's voice is recorded into the voice actor's voice by converting the predetermined feature quantity of the voice actor's voice recorded along the script into a weighted parameter using this conversion function and calculating the weighted sum of the basic pattern of the mouth shape using the weighted parameter. Corresponding mouth shape and face image are created. By performing such a process at a time corresponding to each frame of the animation, an animation frame sequence is created.
Japanese Patent Laid-Open No. 7-44727

現代では、例えば遠隔会議とか、テレビ電話等、動画像を伴う通信量が増大している。そのため、いかにして動画像のデータ量を削減するかが問題となっている。そのための一つの方策は、通信では音声のみを送信するが、受信側ではその音声から顔画像を合成する、というものである。こうした技術を一般化させるためには、不特定多数の人間の音声であっても、それらに対応する口画像を適切に生成する技術が必要である。 In modern times, for example, the amount of communication involving moving images, such as remote conferences and videophones, is increasing. Therefore, how to reduce the amount of moving image data is a problem. One measure for this is to transmit only voice in communication, but to synthesize a face image from the voice on the receiving side. In order to generalize such a technique, a technique for appropriately generating a mouth image corresponding to an unspecified number of human voices is necessary.

また、上記したアニメーションの作成を用いるサービスとして、例えば、不特定多数の話者の音声にあわせ、特定のキャラクタの顔画像を用いたアニメーションを作成するようなサービスが考えられる。そうしたサービスでは、不特定の話者の音声から適切に顔画像の口の動きを生成する必要がある。 Further, as a service using the above-described animation creation, for example, a service that creates an animation using a face image of a specific character in accordance with the voices of an unspecified number of speakers can be considered. In such services, it is necessary to generate mouth movements of facial images appropriately from the voices of unspecified speakers.

しかし、上記した特許文献１に開示の技術では、予め変換関数を求める必要がある。そのため、特定の話者に対しては有効であっても、不特定多数の話者に対しては適用できない。なぜなら、話者により、発声する音素が同一でもその音声から得られる音響特徴量は様々だからである。 However, in the technique disclosed in Patent Document 1 described above, it is necessary to obtain a conversion function in advance. Therefore, even if it is effective for a specific speaker, it cannot be applied to an unspecified number of speakers. This is because, depending on the speaker, even if the phoneme uttered is the same, there are various acoustic feature quantities obtained from the speech.

それ故に本発明の目的は、話者に依存せず、音声に応じて適切に一部の形状を変化させる動画像を生成できるアニメーション作成装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide an animation creating apparatus that can generate a moving image in which a part of the shape is appropriately changed according to a voice without depending on a speaker.

本発明の第１の局面に係るアニメーション作成装置は、音声信号を受け、当該音声信号の表す音素列内の各音素の継続時間中の、所定のキーフレーム時刻における画像により構成されるキーフレーム画像を表すキーフレームデータを作成するための手段と、キーフレームデータ作成手段により作成されるキーフレームデータに基づき、音声信号に同期して変化する一連の画像からなる画像のアニメーションを生成するためのアニメーション生成手段とを含む。 The animation creating apparatus according to the first aspect of the present invention receives a voice signal, and a key frame image composed of images at a predetermined key frame time during the duration of each phoneme in the phoneme string represented by the voice signal. An animation for generating an animation of an image made up of a series of images that change in synchronization with an audio signal based on the key frame data that is generated by the key frame data generation means Generating means.

キーフレームデータ作成手段は、音声信号を受け、当該音声信号の表す音素列内の各音素の継続時間中の、所定のキーフレーム時刻における画像により構成されるキーフレーム画像を表すキーフレームデータを作成する。アニメーション生成手段は、キーフレームデータ作成手段により作成されるキーフレームデータに基づき、音声信号に同期して変化する一連の画像からなる画像のアニメーションを生成する。音声信号からキーフレームデータを作成し、そのキーフレームデータから画像のアニメーションが生成される。キーフレームデータは、音声信号の発話者に依存せずに定められる。従って、話者に依存せず、音声から適切に動画像を生成できるアニメーション作成装置を提供することができる。 The key frame data creation means receives the audio signal and creates key frame data representing a key frame image constituted by an image at a predetermined key frame time during the duration of each phoneme in the phoneme string represented by the audio signal. To do. The animation generation unit generates an animation of an image including a series of images that change in synchronization with the audio signal, based on the key frame data generated by the key frame data generation unit. Key frame data is created from the audio signal, and an image animation is generated from the key frame data. The key frame data is determined without depending on the speaker of the audio signal. Therefore, it is possible to provide an animation creation device that can appropriately generate a moving image from voice without depending on a speaker.

好ましくは、所定のキーフレーム時刻は、音素列内の各音素の継続時間の開始時刻である。 Preferably, the predetermined key frame time is the start time of the duration of each phoneme in the phoneme string.

ある音素を発音するときの口の形状の特徴は、その音素を発音する最初の時期のときに最もよく現われている。従って、所定のキーフレーム時刻を、音素の継続時間の開始時刻とすることにより、得られる動画像は、音声の変化をよく反映した、適切なものとなる。 The characteristics of the mouth shape when a phoneme is pronounced are most apparent when it is the first time to pronounce the phoneme. Therefore, when the predetermined key frame time is set as the start time of the phoneme duration, the obtained moving image becomes an appropriate one that well reflects the change in sound.

より好ましくは、アニメーション作成装置は、予め定められた複数種類のテキストをユーザに選択させるためのテキスト選択手段と、テキスト選択手段によりテキストが選択されたことに基づき、ユーザの音声を録音して音声信号に変換し、選択されたテキストとともにキーフレームデータ作成手段に与えるための手段とをさらに含む。キーフレームデータを作成するための手段は、音素を、所定の基準画像を含む所定の複数個の画像のいずれかにマッピングするマッピングデータを記憶するためのマッピングデータ記憶手段と、音声信号及びテキストを受け、テキストに基づいて、音声信号に対する音素セグメンテーションを行ない、得られる音素列と、各音素の継続時間長を表す時間情報とを含む音素列データを出力するための音素セグメンテーション手段と、音素セグメンテーション手段より出力される音素列データに含まれる各音素に対し、当該音素の時間情報と、マッピングデータとを参照することにより、当該音素がマッピングされる画像を特定する識別子と、当該音素に対する所定の特徴量に対応して定められるブレンド率とを付すことによりキーフレームデータを作成して出力するためのキーフレームデータ作成手段とを含む。 More preferably, the animation creating device records the user's voice based on the selection of the text by the text selecting means for allowing the user to select a plurality of predetermined types of text and the text selecting means. Means for converting to a signal and providing it to the key frame data creation means along with the selected text. Means for creating key frame data includes mapping data storage means for storing mapping data for mapping a phoneme to any one of a plurality of predetermined images including a predetermined reference image, audio signal and text Phoneme segmentation means for performing phoneme segmentation on the speech signal based on the received text and outputting the phoneme string data including the obtained phoneme string and time information indicating the duration of each phoneme, and phoneme segmentation means For each phoneme included in the phoneme string data output from the phoneme, by referring to the time information of the phoneme and mapping data, an identifier for specifying an image to which the phoneme is mapped, and a predetermined feature for the phoneme Keyframe data by adding a blend ratio determined according to the volume. The create and includes a key frame data creation means for outputting.

音素セグメンテーションは、テキスト選択手段により選択されたテキストに基づいて行なわれる。音声信号を構成する各音素が予め判明しているので、音素セグメンテーションを正しく行なうことができる。 Phoneme segmentation is performed based on the text selected by the text selection means. Since each phoneme constituting the speech signal is known in advance, phoneme segmentation can be performed correctly.

さらに好ましくは、キーフレームデータ作成手段は、音素セグメンテーション手段より出力される音素列データに含まれる各音素に対し、マッピングデータを参照して得られる画像の識別子と、所定の定数からなるブレンド率とを付し、画像マッピング済の音素列データを出力するためのマッピング処理手段と、マッピング処理手段の出力する画像マッピング済の音素列データの各音素に対し、当該音素の継続長の単調増加関数として、ブレンド率を調整するための第１のブレンド率調整手段とを含む。 More preferably, the key frame data creation means includes, for each phoneme included in the phoneme string data output from the phoneme segmentation means, an image identifier obtained by referring to the mapping data, and a blend rate composed of a predetermined constant. And a mapping processing means for outputting image-mapped phoneme string data, and for each phoneme of the image-mapped phoneme string data output by the mapping process means as a monotonically increasing function of the duration of the phoneme And a first blend rate adjusting means for adjusting the blend rate.

音素の継続長は、音素を発音するときの口等の形状の変化の割合を反映している。従って、ブレンド率を音素の継続時間長に対する単調増加関数として調整することにより、口等の形状の実際の変化を適切に反映したアニメーションを得ることができる。 The phoneme continuation length reflects the rate of change in the shape of the mouth or the like when the phoneme is pronounced. Therefore, by adjusting the blend rate as a monotonically increasing function with respect to the phoneme duration, an animation that appropriately reflects the actual change in the shape of the mouth or the like can be obtained.

キーフレームデータ作成手段はさらに、第１のブレンド率調整手段の出力する、ブレンド率が調整された音素列データの各音素に対し、当該音素の継続期間内のパワーの大きさの単調増加関数として、ブレンド率を調整するための第２のブレンド率調整手段を含んでもよい。 The key frame data creation means further outputs, for each phoneme of the phoneme string data adjusted by the blend ratio, outputted from the first blend ratio adjustment means, as a monotonically increasing function of the magnitude of the power within the duration of the phoneme. Second blend rate adjusting means for adjusting the blend rate may be included.

音素の継続期間中のパワーは、音素を発音するときの強さ、従ってそのときの口等の形状の変化の割合を反映している。従って、ブレンド率を音素の継続期間中におけるパワーに対する単調増加関数として調整することにより、口等の形状の実際の変化を適切に反映したアニメーションを得ることができる。 The power of the phoneme during the duration reflects the strength at which the phoneme is pronounced, and thus the rate of change in the shape of the mouth and the like at that time. Therefore, by adjusting the blend rate as a monotonically increasing function with respect to the power during the phoneme duration, it is possible to obtain an animation that appropriately reflects the actual change in the shape of the mouth or the like.

好ましくは、アニメーション生成手段は、アニメーションの画像を生成するための生成時刻を、音声の録音時間と関係付けて決定するための時刻決定手段と、時刻決定手段により決定された生成時刻におけるフレームの画像を、当該生成時刻をはさむ複数のキーフレームの画像の間の補間により算出するための補間手段とを含む。 Preferably, the animation generation means includes a time determination means for determining the generation time for generating the animation image in relation to the recording time of the sound, and an image of the frame at the generation time determined by the time determination means. Interpolating means for calculating by interpolating between images of a plurality of key frames sandwiching the generation time.

補間手段が、ある生成時刻におけるフレームの画像を、その時刻を含む複数のキーフレームの画像の間の補間により生成する。ある時刻における口等の形状は、その前の音素から次の音素への遷移の途中の形状となる。このように補間によりある生成時刻の口等の形状を算出することにより、音素の遷移に対応した適切な画像のアニメーションを作成できる。 Interpolation means generates an image of a frame at a certain generation time by interpolation between a plurality of key frame images including that time. The shape of the mouth or the like at a certain time is a shape in the middle of the transition from the previous phoneme to the next phoneme. Thus, by calculating the shape of the mouth or the like at a certain generation time by interpolation, an appropriate image animation corresponding to the phoneme transition can be created.

より好ましくは、補間手段は、時刻決定手段により決定された生成時刻におけるフレームの画像を、当該生成時刻をはさんで互いに隣接する二つのキーフレームの画像の間の補間により算出するための手段を含む。 More preferably, the interpolation means calculates means for calculating an image of the frame at the generation time determined by the time determination means by interpolation between two key frame images adjacent to each other across the generation time. Including.

補間を、生成時刻をはさんで隣接する二つのキーフレームの間で行なって、生成時刻におけるフレームの画像を生成する。計算量を少なくしながら、適切な補間ができ、滑らかに変化するアニメーションを得ることができる。 Interpolation is performed between two adjacent key frames across the generation time, and an image of the frame at the generation time is generated. While reducing the amount of calculation, appropriate interpolation can be performed and a smoothly changing animation can be obtained.

さらに好ましくは、算出するための手段は、生成時刻をはさんで互いに隣接する第１及び第２のキーフレームのうち、第１のキーフレームにおいて１００％、第２のキーフレームにおいて０％となる第１の補間関数により、生成時刻における第１のブレンド率を第１のキーフレームにおけるブレンド率から補間するための第１の補間手段と、第１のキーフレームにおいて０％、第２のキーフレームにおいて１００％となる第２の補間関数により、生成時刻における第２のブレンド率を第２のキーフレームにおけるブレンド率から補間するための第２の補間手段と、第１のブレンド率及び第２のブレンド率を用いた、第１のキーフレームにマッピングされた画像のデータ及び第２のキーフレームにマッピングされた画像のデータの間の加重和により、生成時刻における画像のデータを算出するための手段とを含む。 More preferably, the means for calculating is 100% in the first key frame and 0% in the second key frame among the first and second key frames adjacent to each other across the generation time. First interpolation means for interpolating the first blend rate at the generation time from the blend rate in the first key frame by the first interpolation function, and 0% in the first key frame, the second key frame The second interpolation function for interpolating the second blend rate at the generation time from the blend rate in the second key frame by the second interpolation function that is 100% in the first key frame, the first blend rate and the second The weighted sum between the image data mapped to the first keyframe and the image data mapped to the second keyframe using the blend ratio. Ri, and means for calculating the data of an image in the generation time.

第１のキーフレームにおけるブレンド率と、第２のキーフレームにおけるブレンド率とを第１及び第２の補間関数により別個に算出し、次に、これらを用い、第１のキーフレームにマッピングされた画像のデータ及び第２のキーフレームにマッピングされた画像のデータの間の加重和を算出する。単純な計算を組み合わせることにより、二つのキーフレームの間の画像の滑らかなアニメーションを算出することができる。 The blend rate at the first key frame and the blend rate at the second key frame were calculated separately by the first and second interpolation functions, and then used to map to the first key frame. A weighted sum between the image data and the image data mapped to the second key frame is calculated. By combining simple calculations, a smooth animation of the image between two key frames can be calculated.

時刻決定手段は、補間手段によりあるフレームの画像が得られた時刻を、次のフレームの画像を生成するための生成時刻として決定するための手段を含んでもよい。 The time determination means may include means for determining a time when an image of a certain frame is obtained by the interpolation means as a generation time for generating an image of the next frame.

補間手段による画像の生成が終了すると、その時刻が生成時刻として決定される。生成時刻が決定されると、その生成時刻におけるフレームの画像が、当該生成時刻をはさむ複数のキーフレームの画像の間の補間により補間手段により算出される。従って補間手段は休むことなく常に画像の生成のために動作していることになり、補間手段を有効に利用することができる。 When the generation of the image by the interpolation means is completed, the time is determined as the generation time. When the generation time is determined, an image of a frame at the generation time is calculated by an interpolation unit by interpolation between a plurality of key frame images sandwiching the generation time. Therefore, the interpolating means always operates to generate an image without taking a rest, and the interpolating means can be used effectively.

なお、画像は発話時の口の形状の変化を反映した顔画像でもよい。 The image may be a face image reflecting a change in the shape of the mouth at the time of speech.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかのアニメーション作成装置を構成する各手段として機能させる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to function as each means constituting any one of the animation creating apparatuses described above.

以下、本発明について、実施の形態に基づいて説明する。以下の説明では、基本となる顔画像を６種類使用しているが、顔画像の数はこれには限定されない。６種類よりも少なくてもよいし、６種類よりも多くてもよい。 Hereinafter, the present invention will be described based on embodiments. In the following description, six types of basic face images are used, but the number of face images is not limited to this. There may be fewer than six types or more than six types.

［第１の実施の形態］
＜構成＞
図１に、本発明の第１の実施の形態に係るアニメーション作成装置によるアニメーション作成過程３０の概略を示す。図１を参照して、アニメーション作成過程３０においては、話者４０が台本４４に基づき台詞を発話すると、その音声信号４２に対し、音声認識装置による音素セグメンテーション（発話から、発話を構成する音素列を生成すること）が行なわれる。 [First Embodiment]
<Configuration>
FIG. 1 shows an outline of an animation creation process 30 by the animation creation apparatus according to the first embodiment of the present invention. Referring to FIG. 1, in the animation creation process 30, when a speaker 40 utters a speech based on a script 44, a phoneme segmentation (a phoneme sequence that constitutes an utterance from the speech) is performed on the speech signal 42. Is generated).

予め、主要な音素については、その音素を発音するときの口の形状を含む顔画像６０〜６８が準備されており、音声認識の結果得られる各音素５０〜５８に対し、これら顔画像を割当ててアニメーション化する。 For main phonemes, face images 60 to 68 including a mouth shape when the phoneme is pronounced are prepared in advance, and these face images are assigned to each phoneme 50 to 58 obtained as a result of speech recognition. To animate.

なお、個々の音素に対して発話画像を一つずつ割当てても滑らかな画像が得られないため、本実施の形態では、後述するように、主要な顔画像として「あ（／ａ／）」「い（／ｉ／）」「う（／ｕ／）」「え（／ｅ／）」「お（／ｏ／）」という５つの音素に対する５つの顔画像、及び無表情の顔画像の、合計６つの顔画像を準備する。「あ」〜「お」の５つの音素はそれぞれ対応の顔画像に割当て、残りの音素についてはそれぞれ上記した６つの顔画像のいずれかに割当てる。これを以下、音素から顔画像へのマッピングと呼ぶ。 Note that even if one speech image is assigned to each phoneme, a smooth image cannot be obtained. Therefore, in this embodiment, as described later, “A (/ a /)” is used as a main face image. Five face images for five phonemes of “I (/ i /)”, “U (/ u /)”, “E (/ e /)”, “O (/ o /)”, and expressionless face images, A total of six face images are prepared. The five phonemes “A” to “O” are assigned to the corresponding face images, respectively, and the remaining phonemes are assigned to any of the six face images described above. This is hereinafter referred to as phoneme to face image mapping.

さらに、音素ごとに、このようにマッピングされた顔画像を割当ててそれらを単純につないでアニメーションを作成すると、画像の動きが過大になって、いわゆる「うるさい」アニメーションとなる。そのため、本実施の形態では、音素の継続時間長及びそのパワーによって、各画像の「強さ」を調整し、調整後の画像を用い、音素間の遷移過程での顔画像を補間により生成する。また、所定のしきい値より小さな継続時間長又はパワーしか持たない音素については、あえてその音素に対応する画像を挿入せず、その直前の音素の画像に統合してしまう。こうすることで、滑らかに変化する、自然なアニメーションを音声にあわせて生成することができる。 Furthermore, if face images mapped in this way are assigned to each phoneme and are simply connected to create an animation, the movement of the image becomes excessive, resulting in a so-called “noisy” animation. Therefore, in this embodiment, the “strength” of each image is adjusted according to the duration time of the phoneme and its power, and a face image in the transition process between phonemes is generated by interpolation using the adjusted image. . In addition, for a phoneme having a duration or power smaller than a predetermined threshold value, an image corresponding to the phoneme is not inserted, and the phoneme image immediately before that is integrated. By doing so, it is possible to generate a natural animation that changes smoothly according to the sound.

図２に、本実施の形態に係るアニメーション生成システム８０の概略の機能的構成を示す。このアニメーション生成システム８０は、予め複数の書起しテキストを準備しておき、それらのいずれかを話者に選択させて発話させ、その発話音声に合致して変化する顔画像のアニメーションを、予め準備した６つの顔画像から補間により生成するものである。 FIG. 2 shows a schematic functional configuration of the animation generation system 80 according to the present embodiment. This animation generation system 80 prepares a plurality of transcription texts in advance, causes a speaker to select one of them and utters it, and animates a facial image that changes in accordance with the uttered speech in advance. It is generated by interpolation from the prepared six face images.

図２を参照して、アニメーション生成システム８０は、発話者が書起しテキストを選択する際に使用するテキスト選択インターフェイス９０と、発話者の音声を音声信号に変換するマイクロフォン９２と、予め複数種類のテキストを記憶しておき、話者にそのうちの一つをテキスト選択インターフェイス９０を用いて選択させた上で、マイクロフォン９２の出力する音声信号を録音しデジタル化した音声データファイルを作成するための入力指示ユニット９４と、入力指示ユニット９４から与えられる音声データファイルに対する音素セグメンテーションを、入力指示ユニット９４から与えられる対応する書起しテキストを用いて行ない、その結果と、入力指示ユニット９４からの音声データファイルとに基づき、アニメーションのキーフレームを規定するキーフレームデータを作成するためのキーフレームデータ作成ユニット９６とを含む。 Referring to FIG. 2, an animation generation system 80 includes a text selection interface 90 used when a speaker writes and selects text, a microphone 92 that converts a speaker's voice into an audio signal, and a plurality of types. Is stored, and one of them is selected by the text selection interface 90, and then the voice signal output from the microphone 92 is recorded to create a digitized voice data file. The phoneme segmentation for the input instruction unit 94 and the voice data file given from the input instruction unit 94 is performed using the corresponding transcription text given from the input instruction unit 94, and the result and the voice from the input instruction unit 94 are recorded. Based on the data file, And a key frame data creation unit 96 for creating a key frame data defining a beam.

アニメーション生成システム８０はさらに、入力指示ユニット９４の出力する音声データファイルと、キーフレームデータ作成ユニット９６により出力されるキーフレームデータとを用い、音声データファイルの音声に同期して口形状が変化する顔画像のアニメーションを作成し、音声とともに出力するためのアニメーション再生ユニット９８と、いずれもアニメーション再生ユニット９８に接続された、アニメーションを表示するためのモニタ１０２及び音声を再生するためのスピーカ１００とを含む。 The animation generation system 80 further uses the audio data file output from the input instruction unit 94 and the key frame data output from the key frame data creation unit 96 to change the mouth shape in synchronization with the audio of the audio data file. An animation reproduction unit 98 for creating an animation of a face image and outputting it together with sound, and a monitor 102 for displaying animation and a speaker 100 for reproducing sound, both connected to the animation reproduction unit 98. Including.

入力指示ユニット９４は、予め複数種類の書起しテキストを記憶しておくためのテキスト記憶部１１０と、テキスト記憶部１１０に記憶されたテキストをテキスト選択インターフェイス９０により話者４０に提示し、いずれか一つを選択させてそのテキストをキーフレームデータ作成ユニット９６に対し与えるとともに、テキスト選択インターフェイス９０を用いて、話者に対し当該テキストを発話するように指示を与えるためのテキスト選択部１１２と、話者がテキスト選択部１１２の指示に対して発話するテキストの音声についてマイクロフォン９２から出力される音声信号を、所定のフレーム長及びフレームシフト長でフレーム化し音声データとして保存し、キーフレームデータ作成ユニット９６及びアニメーション再生ユニット９８に与えるための音声収録部１１４とを含む。 The input instruction unit 94 presents a text storage unit 110 for storing a plurality of kinds of transcription texts in advance, and presents the text stored in the text storage unit 110 to the speaker 40 through the text selection interface 90. A text selection unit 112 for selecting one of them and giving the text to the key frame data generation unit 96 and giving an instruction to the speaker to speak the text using the text selection interface 90; The voice signal output from the microphone 92 for the voice of the text uttered in response to the instruction of the text selection unit 112 by the speaker is framed with a predetermined frame length and frame shift length and stored as voice data to create key frame data Unit 96 and animation playback unit 98 And a voice recording unit 114 for giving.

キーフレームデータ作成ユニット９６は、テキスト選択部１１２から与えられるテキストに基づいて音声収録部１１４からの音声データに対する音素セグメンテーションを行ない、音素列と、その継続時間長とを含む音素列データを出力するための音声認識装置１２０と、日本語を構成する全ての音素を、前述した６つの顔画像の識別子にマッピングするマッピングテーブルを記憶したマッピングテーブル記憶部１３０と、音声認識装置１２０から出力される音素列ファイル、テキスト選択部１１２から与えられるテキスト、及び音声収録部１１４から与えられる音声データに基づき、アニメーションのうち主要時点でのフレームの顔画像を、前述した６つの顔画像から作成するためのパラメータを生成してキーフレームデータとして出力するためのキーフレームデータ作成部１３６とを含む。 The key frame data creation unit 96 performs phoneme segmentation on the speech data from the speech recording unit 114 based on the text given from the text selection unit 112, and outputs phoneme sequence data including the phoneme sequence and its duration length. Speech recognition apparatus 120 for mapping, mapping table storage unit 130 storing a mapping table for mapping all phonemes constituting Japanese to the identifiers of the six face images described above, and phonemes output from the speech recognition apparatus 120 Parameters for creating a face image of a frame at a main time point in an animation from the above-described six face images based on the sequence file, the text given from the text selection unit 112, and the voice data given from the voice recording unit 114 And output as key frame data And a key frame data creation unit 136 for.

音声認識装置１２０は、音素セグメンテーションをし、音素列と、それぞれの継続時間長が分かる時間データとを出力できるものであればどのようなものでもよい。発話内容が予め分かっているので、音声認識装置１２０は音声データを確実に音素列に変換できる。 The speech recognition device 120 may be any device as long as it can perform phoneme segmentation and output a phoneme string and time data for which each duration is known. Since the utterance content is known in advance, the speech recognition apparatus 120 can reliably convert speech data into a phoneme string.

図５に、音声認識装置１２０の出力する音素列ファイル１６０の構成例を示す。図５を参照して、音素列ファイル１６０は、音声認識の結果得られた音素列と、各音素列の継続時間長が分かる時間情報との組を複数個含んでいる。図５において、継続時間長はミリ秒単位で示してある。 FIG. 5 shows a configuration example of the phoneme string file 160 output from the speech recognition apparatus 120. Referring to FIG. 5, phoneme string file 160 includes a plurality of sets of phoneme strings obtained as a result of speech recognition and time information for knowing the duration of each phoneme string. In FIG. 5, the duration is shown in milliseconds.

アニメーション再生ユニット９８は、５つの音素（／ａ／，／ｉ／，／ｕ／，／ｅ／，／ｏ／）に対応する顔画像と、無表情の顔画像との６つの顔画像を、ワイヤフレームモデルとして保持する顔データファイルを記憶した顔データファイル記憶部１３２と、キーフレームデータ作成部１３６によって作成されたキーフレームにおける顔画像を作成するためのパラメータを用い、アニメーションを構成する所定時点のフレームの顔画像を顔データファイル記憶部１３２に記憶された６つの顔画像から作成するために使用される補間関数を記憶するための補間関数記憶部１３４と、キーフレームデータ作成部１３６から与えられるキーフレームデータと、顔データファイル記憶部１３２に記憶された顔データファイルと、補間関数記憶部１３４に記憶された補間関数とを用い、アニメーションでの所定の時点でのフレームの顔画像を補間により生成するためのアニメーション生成部１３８とを含む。 The animation reproduction unit 98 generates six face images including a face image corresponding to five phonemes (/ a /, / i /, / u /, / e /, / o /) and an expressionless face image. Using a face data file storage unit 132 that stores a face data file held as a wire frame model, and a parameter for creating a face image in a key frame created by the key frame data creation unit 136, a predetermined time point constituting an animation An interpolation function storage unit 134 for storing an interpolation function used for generating a face image of a frame of the frame from six face images stored in the face data file storage unit 132, and a key frame data generation unit 136. Key frame data, a face data file stored in the face data file storage unit 132, and an interpolation function storage unit 134 Using a stored interpolation function, and a animation generation unit 138 for generating the interpolated frame of the facial image at a given point in the animation.

顔データファイル記憶部１３２に記憶される顔画像の例を図３に示す。図３（Ａ）〜（Ｅ）は、それぞれ音素／ａ／，／ｉ／，／ｕ／，／ｅ／，／ｏ／に対応する顔画像であり、図３（Ｆ）は、無表情に対応する顔画像である。本明細書では、これら画像をそれぞれ顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／，及び／φ／と表記することにする。 An example of a face image stored in the face data file storage unit 132 is shown in FIG. 3A to 3E are face images corresponding to phonemes / a /, / i /, / u /, / e /, and / o /, respectively, and FIG. It is a corresponding face image. In this specification, these images are expressed as face images / A /, / I /, / U /, / E /, / O /, and / φ /, respectively.

なお、本実施の形態では、顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／は、いずれも顔画像／φ／を基準とし、各特徴点が、顔画像の定義されている３次元空間において、顔画像／φ／の対応する特徴点からどの程度移動しているかを示す３次元ベクトル情報によって定義されている。従って、例えば顔画像／Ａ／と顔画像／φ／との間で、その中間の顔画像を定義することもできる。本実施の形態では、特定の顔画像と顔画像／φ／との間の中間の顔画像を定義するために、「ブレンド率」という概念を導入する。ブレンド率とは、特定の顔画像を１００％、顔画像／φ／を０％として、顔画像／φ／から特定の顔画像に至るまでの特徴点の移動量の割合で中間の顔画像を表すものである。従って、顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／をそのまま音素に割当てた場合、そのブレンド率はいずれも１００％となる。ブレンド率５０％の顔画像／Ａ／とは、顔画像／φ／からの特徴点の移動量の割合が、顔画像／Ａ／の特徴点の移動量の５０％となっているような顔画像のことをいう。顔画像／φ／での位置を始点とするベクトルで顔画像の特徴点の移動量を表せば、ブレンド率Ｂ％の顔画像とは、各特徴点を表すベクトルが、方向はブレンド率１００％の顔画像のベクトルと等しく、長さがブレンド率Ｂ％に相当するだけ縮小されたものとなっている顔画像に相当する。 In this embodiment, the face images / A /, / I /, / U /, / E /, / O / are all based on the face image / φ /, and each feature point is the face image. In the defined three-dimensional space, it is defined by three-dimensional vector information indicating how far the corresponding feature point of the face image / φ / has moved. Accordingly, for example, an intermediate face image can be defined between the face image / A / and the face image / φ /. In the present embodiment, the concept of “blend rate” is introduced in order to define an intermediate face image between a specific face image and the face image / φ /. The blend ratio is defined as 100% for a specific face image and 0% for a face image / φ /, and the intermediate face image at the ratio of the amount of movement of feature points from the face image / φ / to the specific face image. It represents. Accordingly, when the face images / A /, / I /, / U /, / E /, / O / are assigned to phonemes as they are, the blend ratio is 100%. A face image / A / with a blend ratio of 50% is a face whose ratio of the amount of movement of feature points from the face image / φ / is 50% of the amount of movement of feature points of the face image / A /. Refers to the image. If the movement amount of the feature point of the face image is represented by a vector starting from the position at the face image / φ /, the face image with the blend rate B% is a vector representing each feature point, and the direction is the blend rate 100%. This corresponds to a face image that is equal to the vector of the face image and is reduced in length corresponding to the blend ratio B%.

二つの顔画像の間の補間については後述する。 Interpolation between two face images will be described later.

図４に、マッピングテーブル記憶部１３０に記憶されたマッピングテーブルの例の一部を示す。図４を参照して、本実施の形態では、マッピングテーブル記憶部１３０は、音素／ａ／を顔画像／Ａ／に、音素／ｂ／を顔画像／φ／に、音素／ｄ／を顔画像／Ｕ／に、音素／ｅ／を顔画像／Ｅ／に、それぞれ対応付けている。マッピングテーブルでは、図３に示す顔画像／Ａ／，／Ｉ／，／Ｕ／，／Ｅ／，／Ｏ／のように、予めある音素に対して準備された顔画像には、その音素を必ず対応付けるようにする。さもないと得られる顔の動画像が発話内容とちぐはぐになってしまう。また／ｂ／、／ｍ／等、唇を閉じるような音素は無表情の顔画像／φ／に対応付ける。それ以外の音素は、前述した６つの顔画像のうち、口の形状が最も近いと思われるものに適宜割当てるようにする。 FIG. 4 shows a part of an example of the mapping table stored in the mapping table storage unit 130. Referring to FIG. 4, in the present embodiment, mapping table storage unit 130 converts phoneme / a / to face image / A /, phoneme / b / to face image / φ /, and phoneme / d / to face. The phoneme / e / is associated with the image / U / and the face image / E /, respectively. In the mapping table, a phoneme is prepared for a face image prepared in advance for a phoneme, such as face images / A /, / I /, / U /, / E /, / O / shown in FIG. Make sure to associate them. Otherwise, the obtained moving image of the face will be inconsistent with the utterance content. A phoneme that closes the lips, such as / b /, / m /, is associated with an expressionless face image / φ /. The other phonemes are appropriately assigned to the face images that are considered to have the closest mouth shape among the six face images described above.

再び図２を参照して、アニメーション再生ユニット９８はさらに、音声収録部１１４の出力する音声データを格納した音声ファイルを記憶するための音声ファイル記憶部１４０と、アニメーション生成部１３８から順次与えられる顔画像と、音声ファイル記憶部１４０に記憶された音声ファイルからの音声とを、互いに同期させてモニタ１０２及びスピーカ１００にそれぞれ与えるための出力部１４２と、入力指示ユニット９４のテキスト選択部１１２及び音声収録部１１４、キーフレームデータ作成ユニット９６のキーフレームデータ作成部１３６及び音声認識装置１２０、並びにアニメーション再生ユニット９８のアニメーション生成部１３８及び出力部１４２を所定のシーケンスで動作させ、それらの協働によってアニメーション生成システムを実現するようこれらを制御するためのシーケンス制御部１４４とを含む。 Referring to FIG. 2 again, the animation playback unit 98 further includes an audio file storage unit 140 for storing an audio file storing the audio data output from the audio recording unit 114, and a face sequentially given from the animation generation unit 138. An output unit 142 for synchronizing the image and the audio from the audio file stored in the audio file storage unit 140 to the monitor 102 and the speaker 100 in synchronization with each other, the text selection unit 112 of the input instruction unit 94, and the audio The recording unit 114, the key frame data generation unit 136 of the key frame data generation unit 96 and the voice recognition device 120, and the animation generation unit 138 and the output unit 142 of the animation reproduction unit 98 are operated in a predetermined sequence, and by their cooperation Animation generation And a sequence control unit 144 for controlling these to achieve a stem.

図６に、図２のキーフレームデータ作成部１３６の構成の詳細を示す。図６を参照して、キーフレームデータ作成部１３６は、音声認識装置１２０からの音素列データ内の各音素に対し、マッピングテーブル記憶部１３０を参照して顔画像をマッピングし、マッピングされた顔画像の識別子と、ブレンド率「１００％」とを付して出力するためのマッピング処理部１８０と、マッピング処理部１８０により出力された、継続時間長、対応顔画像の識別子及びそのブレンド率が付された音素列の各ブレンド率を、各音素の継続時間長に基づいて調整するための、継続時間長によるブレンド率調整部１８２と、継続時間長によるブレンド率調整部１８２の出力する、継続時間長、対応顔画像及びその継続時間長により調整されたブレンド率が付された音素列のブレンド率を、各音素の継続期間におけるパワーの大きさに基づいて調整するための、パワーによるブレンド率調整部１８４とを含む。パワーによるブレンド率調整部１８４の出力は、各音素に、その継続時間長と、対応顔画像と、継続時間長及びパワーにより調整されたブレンド率とが付された音素列となる。この音素列がキーフレームデータである。なお、本実施の形態では、キーフレームとは、各音素の継続期間の先頭時刻でフレームが作成される場合のそのフレームのことをいう。 FIG. 6 shows details of the configuration of the key frame data creation unit 136 of FIG. Referring to FIG. 6, key frame data creation unit 136 maps the face image to each phoneme in the phoneme string data from speech recognition apparatus 120 with reference to mapping table storage unit 130, and the mapped face A mapping processing unit 180 for adding and outputting an image identifier and a blend rate “100%”, and a duration length, a corresponding face image identifier and its blend rate output by the mapping processing unit 180 are attached. Durations output by the blend rate adjustment unit 182 based on the duration length and the blend rate adjustment unit 182 based on the duration length for adjusting each blend rate of the phoneme sequence based on the duration length of each phoneme The blending rate of the phoneme string with the blending rate adjusted according to the length, the corresponding face image, and the duration time, and the power level in the duration of each phoneme For adjusting the basis, and a blend ratio adjustment unit 184 by the power. The output of the blend ratio adjustment unit 184 by power is a phoneme string in which each phoneme is assigned with its duration, the corresponding face image, and the blend ratio adjusted by the duration and power. This phoneme string is key frame data. In the present embodiment, a key frame refers to a frame when the frame is created at the start time of the duration of each phoneme.

図７に、アニメーション生成部１３８のより詳細なブロック図を示す。図７を参照して、アニメーション生成部１３８は、二つのキーフレームにおける、それぞれ所定のブレンド率が割当てられた顔画像と、それら二つのキーフレームに対応する時刻と、その二つのキーフレームの間で、アニメーションを生成すべき時刻（ここでは便宜のため、「生成時刻」と呼ぶ。生成時刻は、二つのキーフレームの時刻を基準とする相対時刻で表される。）とが与えられると、その生成時刻の顔画像を、二つのキーフレームの顔画像から補間関数記憶部１３４に記憶された補間関数を用いた補間処理により生成して出力部１４２に対して出力するための補間処理部２０４と、所定の生成時刻が決まると、その生成時刻をはさむ二つのキーフレームを定め、それらのキーフレームにおける顔画像のデータ及びブレンド率、ならびにそれら二つのキーフレームの時刻の間における生成時刻の相対的位置を示す情報を補間処理部２０４に与え、生成時刻における顔画像を作成させるとともに、補間処理部２０４による顔画像の生成が終わると、そのときの時刻を次の生成時刻として次の顔画像を作成する処理を繰返す機能を持つアニメーション生成制御部２００と、アニメーション生成制御部２００が時刻を定めるために参照するタイマ２０２とを含む。この補間処理とアニメーションの生成処理とについては後述する。 FIG. 7 shows a more detailed block diagram of the animation generation unit 138. Referring to FIG. 7, the animation generation unit 138 includes a face image to which a predetermined blend ratio is assigned in two key frames, a time corresponding to the two key frames, and the interval between the two key frames. Then, given the time at which the animation should be generated (for convenience, it is called “generation time”. The generation time is expressed as a relative time with reference to the time of two key frames). An interpolation processing unit 204 for generating a face image at the generation time from the face images of the two key frames by interpolation using an interpolation function stored in the interpolation function storage unit 134 and outputting the generated face image to the output unit 142. When the predetermined generation time is determined, two key frames sandwiching the generation time are determined, and the face image data and blend ratio in those key frames are determined. When the interpolation processing unit 204 is provided with information indicating the relative position of the generation time between the times of the two key frames to generate a face image at the generation time, and the generation of the face image by the interpolation processing unit 204 is finished. The animation generation control unit 200 has a function of repeating the process of creating the next face image with the current time as the next generation time, and the timer 202 that the animation generation control unit 200 refers to in order to determine the time. This interpolation processing and animation generation processing will be described later.

図８から図１４を参照して、本実施の形態に係るキーフレームデータ作成部１３６及びアニメーション生成部１３８による顔のアニメーションの作成処理についてより詳細に説明する。 With reference to FIGS. 8 to 14, the face animation creation process by the key frame data creation unit 136 and the animation generation unit 138 according to the present embodiment will be described in more detail.

例えば図５に示すような音素列ファイル１６０が与えられたとする。この場合、図６に示すマッピング処理部１８０の出力を図示すると図８のようになる。図８を参照して、時間軸上で、各音素／ａ／，／ｉ／，／ｕ／，／ｅ／，／ｏ／の発話期間がそれぞれ継続時間３００、３００、１００、３０及び２５０（いずれもミリ秒）で割当てられる。各期間の先頭時刻がキーフレームとなる。各キーフレームでのブレンド率は、いずれも１００％である。 For example, assume that a phoneme string file 160 as shown in FIG. 5 is given. In this case, the output of the mapping processing unit 180 shown in FIG. 6 is shown in FIG. Referring to FIG. 8, on the time axis, the utterance period of each phoneme / a /, / i /, / u /, / e /, / o / has durations 300, 300, 100, 30 and 250 ( All are assigned in milliseconds). The start time of each period is a key frame. The blend rate at each key frame is 100%.

図８に示されるような音素列が継続時間長によるブレンド率調整部１８２により処理される途中の結果の一例を図９に示す。図９を参照して、まず、各音素のうちで、所定のしきい値よりも小さな継続時間長しか持たない音素については、その直前の音素の期間に統合してしまう。図８に示す例では音素／ｅ／の継続時間長が３０ミリ秒であり、しきい値が５０ミリ秒であったものとすると、音素／ｅ／は削除され、その継続時間長はその直前の音素／ｕ／に統合される。従って図９に示されるように、音素列は／ａ／，／ｉ／，／ｕ／，／ｏ／となり、その継続時間長はそれぞれ３００、３００、１３０、及び２５０（ミリ秒）となる。音素が一つ削除されるので、キーフレームの数も５つから４つに減少する。また、以下の説明では、これらのキーフレームに対応する時刻をそれぞれＴ_０，Ｔ_１，Ｔ_２及びＴ_３とし、最後の音素／ｏ／の直後のキーフレームの時刻をＴ_４とする。なお、以下、一般的に、ｋ番目の音素の開始時刻Ｔ_ｋにより規定されるキーフレームを「キーフレームＴ_ｋ」と呼ぶ。 FIG. 9 shows an example of a result in the middle of processing the phoneme string as shown in FIG. 8 by the blend rate adjusting unit 182 based on the duration time. Referring to FIG. 9, first, among each phoneme, a phoneme having a duration time smaller than a predetermined threshold is integrated into the immediately preceding phoneme period. In the example shown in FIG. 8, if the phoneme / e / duration is 30 milliseconds and the threshold is 50 milliseconds, the phoneme / e / is deleted and its duration is immediately before it. Phoneme / u /. Accordingly, as shown in FIG. 9, the phoneme string is / a /, / i /, / u /, / o /, and the durations thereof are 300, 300, 130, and 250 (milliseconds), respectively. Since one phoneme is deleted, the number of key frames is reduced from five to four. In the following description, the times corresponding to these key frames are T ₀ , T ₁ , T _2, and T ₃ , respectively, and the time of the key frame immediately after the last phoneme / o / is T ₄ . Hereinafter, a key frame defined by the start time T _k of the k-th phoneme is generally referred to as “key frame T _k ”.

継続時間長によるブレンド率調整部１８２は、さらに、各キーフレームに割当てられたブレンド率を、その継続時間長に応じて調整する。具体的には、継続時間長によるブレンド率調整部１８２は、音素列の中の継続時間長の最大値Ｌ_ＭＡＸを探し出し、各音素の継続時間長のブレンド率を次の式（１）により調整する。 The blend rate adjustment unit 182 according to the duration time further adjusts the blend rate assigned to each key frame according to the duration time. Specifically, the blending rate adjustment unit 182 based on the duration length finds the maximum duration length L _MAX in the phoneme string, and adjusts the blending rate of the duration length of each phoneme by the following equation (1). To do.

ただし、Ｂ（ｎ）はｎ番目の音素のブレンド率を、Ｌ（ｎ）はｎ番目の音素の継続時間長を、それぞれ表す。Ｃ_１は所定の定数で、例えば短い継続時間長の音素を削除したときに使用されるしきい値と同程度の大きさに選ばれる。こうしてブレンド率が継続時間長により調整された音素列の、時間軸上での配置とそのブレンド率とを図１０に模式的に示す。図１０において、例えば、調整後の音素／ａ／，／ｉ／，／ｕ／，／ｏ／のブレンド率はそれぞれａ、ａ、ｂ、及びｃ（％）である。

Here, B (n) represents the blend rate of the nth phoneme, and L (n) represents the duration of the nth phoneme. C ₁ is a predetermined constant, and is selected to be approximately the same as the threshold used when a phoneme having a short duration is deleted, for example. FIG. 10 schematically shows the arrangement on the time axis and the blend rate of the phoneme string in which the blend rate is adjusted by the duration time. In FIG. 10, for example, the blend ratios of the adjusted phonemes / a /, / i /, / u /, and / o / are a, a, b, and c (%), respectively.

パワーによるブレンド率調整部１８４は、継続時間長によるブレンド率調整部１８２と同様にして、各音素のブレンド率を、各音素の継続期間における音声のパワーによって調整する。具体的には、まず、所定のしきい値以下のパワーしかない音素については削除し、その継続期間を直前の音素の継続期間に統合する。こうして得られた各継続期間の先頭がキーフレームである。各キーフレームには、ブレンド率が割当てられている。パワーによるブレンド率調整部１８４はこのブレンド率（ｎ番目の音素のブレンド率を前と同様Ｂ（ｎ）とする。）を以下の式（２）により調整する。 Similarly to the blend rate adjusting unit 182 based on the duration time, the blend rate adjusting unit 184 by power adjusts the blend rate of each phoneme by the sound power in the duration of each phoneme. Specifically, first, a phoneme having power less than or equal to a predetermined threshold is deleted, and the duration is integrated with the duration of the previous phoneme. The head of each continuous period thus obtained is a key frame. Each key frame is assigned a blend rate. The blend ratio adjustment unit 184 by power adjusts this blend ratio (the blend ratio of the nth phoneme is set to B (n) as before) by the following formula (2).

ただしＰ_ＭＡＸは全体でのパワーの最大値であり、Ｐ（ｎ）はｎ番目の音素の継続期間のパワーであり、Ｃ_２は所定のしきい値である。このしきい値も、前述した音素の削除のときに使用されたしきい値と同程度の大きさに選ばれる。

However, P _MAX is the maximum value of the overall power, P (n) is the power of the duration of the nth phoneme, and C ₂ is a predetermined threshold value. This threshold value is also selected to be as large as the threshold value used when deleting the phonemes described above.

こうして最終的に得られた音素列と、その継続時間長と、各キーフレームにおける調整後のブレンド率とを模式的に示したものが図１１である。図１１を参照して、キーフレームＴ_０，Ｔ_１，Ｔ_２，Ｔ_３の音素はそれぞれ／ａ／，／ｉ／，／ｕ／，／ｏ／であり、ブレンド率はそれぞれａ’、ａ”、ｂ’及びｃ’（ただしａ’≦ａ、ａ”≦ａ、ｂ’≦ｂ及びｃ’≦ｃ）であり、継続時間長はそれぞれ３００、３００、１３０及び２５０（ミリ秒）である。 FIG. 11 schematically shows the phoneme string finally obtained in this way, its duration, and the blend ratio after adjustment in each key frame. Referring to FIG. 11, the phonemes of key frames T ₀ , T ₁ , T ₂ , and T ₃ are / a /, / i /, / u /, and / o /, respectively, and the blend rates are a ′ and a ′, respectively. ”, B ′ and c ′ (where a ′ ≦ a, a” ≦ a, b ′ ≦ b and c ′ ≦ c), and the durations are 300, 300, 130 and 250 (milliseconds), respectively. .

このようにしてキーフレームデータが作成される。 In this way, key frame data is created.

次に、図７に示す補間処理部２０４による補間処理について説明する。補間関数記憶部１３４に記憶される補間関数としては、様々なものが考えられるが、本実施の形態では計算処理の容易さと高速さとに重点をおき、線形補間を与えるものを使用する。線形補間の概念について図１２を参照して説明する。 Next, the interpolation processing by the interpolation processing unit 204 shown in FIG. 7 will be described. Various interpolation functions are conceivable as the interpolation function stored in the interpolation function storage unit 134. In the present embodiment, the function that gives linear interpolation is used with emphasis on the ease of calculation processing and high speed. The concept of linear interpolation will be described with reference to FIG.

図１２を参照して、時間軸を横軸、各キーフレームの時間Ｔ_０，Ｔ_１，Ｔ_２，Ｔ_３…におけるブレンド率を縦軸のグラフで表すものとする。本実施の形態での補間関数は、線分２２０，２２２，２２４及び２２６で表されるように、各キーフレームでのブレンド率と、隣接するキーフレームの時刻でのブレンド率「０」の点とを結んだ線分に沿って、各時間でのブレンド率を線形補間する関数である。すなわち、一方における率が１００％、他方における率が０％となるように線形補間を行なう関数である。 Referring to FIG. 12, the horizontal axis represents the time axis, and the blend rate at time T ₀ , T ₁ , T ₂ , T ₃ . As shown by line segments 220, 222, 224, and 226, the interpolation function in the present embodiment is a point of the blend rate at each key frame and the blend rate “0” at the time of the adjacent key frame. Is a function that linearly interpolates the blend rate at each time along the line segment connecting. That is, this function performs linear interpolation so that the rate on one side is 100% and the rate on the other side is 0%.

例えば、時刻Ｔ_０と時刻Ｔ_１との中間の時刻ｔが生成時刻であるものとする。キーフレームＴ_０及びＴ_１での音素はそれぞれ／ａ／、／ｉ／である。各キーフレームでのブレンド率はパワーによるブレンド率調整部１８４により算出されている。時刻Ｔ_０でのブレンド率ａ’と、時刻Ｔ_１でのブレンド率「０」の点とを結んだ線分２２０によって、時刻ｔにおけるキーフレームＴ_０のブレンド率αが線形補間される。同様に、時刻Ｔ_０でのブレンド率「０」の点と、時刻Ｔ_１でのブレンド率ａ”の点とを結んだ線分２２２に沿って、時刻ｔにおけるキーフレームＴ_１のブレンド率βが算出される。 For example, it is assumed that an intermediate time t between time T ₀ and time T ₁ is the generation time. The phonemes at key frames T ₀ and T ₁ are / a / and / i /, respectively. The blend rate at each key frame is calculated by a blend rate adjustment unit 184 based on power. The blend ratio a 'at time T _0, the blend ratio point and a line segment 220 connecting the "0" at time T _1, the blend ratio of keyframes T ₀ alpha is linearly interpolated at time t. Similarly, the blend rate β of the key frame T ₁ at time t along the line 222 connecting the point of the blend rate “0” at time T ₀ and the point of the blend rate a ″ at time T _1. Is calculated.

時刻Ｔ_０におけるキーフレームＴ_０のブレンド率をＢ（Ｔ_０）、時刻Ｔ_１におけるキーフレームＴ_１のブレンド率をＢ（Ｔ_１）、補間により求められた、時刻ｔにおけるキーフレームＴ_０、Ｔ_１のブレンド率をそれぞれα及びβとする。すると、α及びβは次の式（３）により求められる。 The blend rate of the key frame T _{0 at} the time T ₀ is B (T ₀ ), the blend rate of the key frame T _{1 at} the time T ₁ is B (T ₁ ), and the key frame T _{0 at} the time t, obtained by interpolation, T ₁ of the blend ratio, respectively, and α and beta. Then, α and β are obtained by the following equation (3).

本実施の形態では、このようにして算出された二つのブレンド率（例えばα及びβ）を用い、図１３に示されるようにして時刻ｔにおける顔画像を作成する。

In the present embodiment, using the two blend rates (for example, α and β) calculated in this way, a face image at time t is created as shown in FIG.

今、キーフレームＴ_０での顔画像の各特徴点の、顔画像／φ／での対応特徴点を基準とした移動量を要素とする３次元ベクトルをＸ（Ｔ_０）、同様にキーフレームＴ_１での３次元ベクトルをＸ（Ｔ_１）とする。すると、Ｔ_０≦ｔ≦Ｔ_１における顔画像の各特徴点の、顔画像／φ／の対応特徴点を基準とした移動量を要素とする３次元ベクトルＸ（ｔ）は、次の式（４）で表されるベクトル加重和で算出される。 Now, X (T ₀ ) is a three-dimensional vector whose element is a movement amount of each feature point of the face image at the key frame T ₀ on the basis of the corresponding feature point at the face image / φ /, and similarly the key frame. the three-dimensional vector in by T ₁ and X _{(T 1).} Then, the three-dimensional vector X (t) whose element is the movement amount of each feature point of the face image at T ₀ ≦ t ≦ T _{1 with} respect to the corresponding feature point of the face image / φ / is expressed by the following formula ( It is calculated by the vector weighted sum represented by 4).

補間処理部２０４は、こうした計算を顔画像の各特徴点に対して実行する。後述するようにこうした演算はグラフィックプロセッサユニット（ＧＰＵ）が得意とするところである。従ってアニメーション生成システム８０は、ＧＰＵを備えていることが望ましい。

The interpolation processing unit 204 performs such calculation for each feature point of the face image. As will be described later, such a calculation is a specialty of the graphic processor unit (GPU). Therefore, it is desirable that the animation generation system 80 includes a GPU.

アニメーション生成制御部２００によるアニメーションの生成制御について説明する。図１４を参照して、アニメーション生成制御部２００が、最初のキーフレームの時刻Ｔ_０に等しい時刻ｔからアニメーションの作成を開始するものとする。すなわち、時刻ｔ_０（＝Ｔ_０）においてアニメーション生成制御部２００は、補間処理部２０４に対して顔画像の生成の指示２４０を与える。すなわち、生成時刻ｔ＝ｔ_０である。 The animation generation control by the animation generation control unit 200 will be described. Referring to FIG. 14, it is assumed that animation generation control unit 200 starts creating an animation from time t equal to time T ₀ of the first key frame. That is, at time t ₀ (= T ₀ ), the animation generation control unit 200 gives a face image generation instruction 240 to the interpolation processing unit 204. In other words, it is the generation time _t = t _0.

この場合には、まずＴ_ｋ−１≦ｔ≦Ｔ_ｋとなるような整数ｋを探す。ここではｔ＝ｔ_０＝Ｔ_０であるから、ｋ＝１となる。補間処理部２０４は、時刻Ｔ_０における音素／ａ／の顔画像／Ａ／を構成する各特徴点の３次元ベクトルＸ（Ｔ_０）に、このときのブレンド率ａ’（図１１参照）を乗算する。さらに、時刻Ｔ_１における音素／ｉ／の顔画像／Ｉ／を構成する各特徴点の３次元ベクトルＸ（Ｔ₁）に、このときのブレンド率ａ”（図１１参照）を乗算する。補間処理部２０４は次に式（３）を用いてα、βを算出する。ここではｔ＝Ｔ_０なので、α＝Ｂ（Ｔ_０）、β＝０である。これらの結果を用い、式（４）によって時刻ｔにおける顔画像２４２を生成し出力する。 In this case, first, an integer k that satisfies T _k-1 ≦ t ≦ T _k is searched. Here, since t = t ₀ = T ₀ , k = 1. The interpolation processing unit 204 sets the blend rate a ′ (see FIG. 11) at this time to the three-dimensional vector X (T ₀ ) of each feature point constituting the face image / A / of the phoneme / a / at the time T ₀ . Multiply. Further, the three-dimensional vector X (T ₁ ) of each feature point constituting the face image / I / of the phoneme / i / at time T ₁ is multiplied by the blend rate a ″ (see FIG. 11) at this time. Next, the processing unit 204 calculates α and β using Equation (3), where t = T _0, so that α = B (T ₀ ) and β = 0. The face image 242 at time t is generated and output according to 4).

この生成処理に時間ｓ_１を要したものとする。顔画像２４２を生成し出力すると補間処理部２０４は、アニメーション生成制御部２００に対して処理を終了したことを示す完了通知２４４を与える。この時刻ｔ_２を新たな生成時刻ｔとする。 It is assumed that this generation process takes time s ₁ . When the face image 242 is generated and output, the interpolation processing unit 204 gives a completion notification 244 indicating that the processing is completed to the animation generation control unit 200. This time t ₂ as a new generation time t.

アニメーション生成制御部２００は、新たな生成時刻ｔ＝ｔ_２において完了通知２４４を受けたことに応答し、この生成時刻ｔをはさむ二つのキーフレーム、図１４に示す例では時刻Ｔ_０及びＴ_１におけるキーフレームを特定し、これら二つのキーフレームにおける顔画像／Ａ／及び／Ｉ／と、二つの時刻Ｔ_０及びＴ_１と、時間Ｔ_０〜Ｔ_１の中における時刻ｔ＝ｔ_２の相対時間とを補間処理部２０４に与え、顔画像の生成の指示２４６を与える。 Animation generation control unit 200, in response to receiving the completion notification 244 in the new generation time t = t _2, the two keyframes sandwiching the generation time t, the time T ₀ and T ₁ in the example shown in FIG. 14 , And the face images / A / and / I / in these two key frames, the two times T ₀ and T _1, and the relative time t = t ₂ in the times T _{0 to} T ₁ Time is given to the interpolation processing unit 204, and a face image generation instruction 246 is given.

補間処理部２０４は、この指示に応答し、時間ｓ_２をかけて生成時刻ｔ＝ｔ_２における顔画像２４８を生成し、出力する。このとき補間処理部２０４は、アニメーション生成制御部２００に対して完了通知２５０を与える。このときの時刻ｔ_３を新たな生成時刻ｔとする。 In response to this instruction, the interpolation processing unit 204 generates and outputs a face image 248 at the generation time t = t ₂ over time s ₂ . At this time, the interpolation processing unit 204 gives a completion notification 250 to the animation generation control unit 200. The time t ₃ at this time is a new generation time t.

すると、アニメーション生成制御部２００は、この新たな生成時刻ｔ＝ｔ_３に対し、直前の生成時刻ｔ_２で行なったものと同様の処理を行ない、顔画像の生成指示２５２を補間処理部２０４に対し与える。以下同様に、時間ｓ_３後に顔画像２５４が出力され、完了通知２５６が時刻ｔ_３でアニメーション生成制御部２００に与えられ、これに応答してアニメーション生成制御部２００から時刻ｔ_３における顔画像生成の指示２５８が補間処理部２０４に与えられる。以下同様である。 Then, the animation generation control unit 200 performs a process similar to that performed at the previous generation time t ₂ for the new generation time t = t ₃ , and sends a face image generation instruction 252 to the interpolation processing unit 204. Give to. Similarly, the output face image 254 after a time s _3, the completion notification 256 is given to the animation generation control unit 200 at time t _3, the face image generated at time t ₃ from the animation generation control unit 200 in response to the following Is given to the interpolation processing unit 204. The same applies hereinafter.

すなわち本実施の形態では、補間処理部２０４が常にその能力をフルに発揮するように、アニメーション生成制御部２００がアニメーション生成のためのタイミングを制御する。 In other words, in the present embodiment, the animation generation control unit 200 controls the timing for animation generation so that the interpolation processing unit 204 always exhibits its full potential.

＜動作＞
図２〜図１４を参照して、上記したアニメーション生成システム８０は以下のように動作する。なお、以下の各部の動作は、図２に示すシーケンス制御部１４４による制御によって所定のシーケンスで行なわれるが、説明を分かりやすくするため、以下ではシーケンス制御部１４４の制御については言及しない。 <Operation>
2 to 14, the animation generation system 80 described above operates as follows. The following operations of the respective units are performed in a predetermined sequence by the control by the sequence control unit 144 shown in FIG. 2, but for the sake of easy understanding, the control of the sequence control unit 144 will not be described below.

予め、アニメーションのキャラクタの顔画像を、上記した６種類の音素について準備し、顔データファイル記憶部１３２に記憶させておく。各音素に対して顔画像をマッピングするマッピングテーブルも予め準備し、マッピングテーブル記憶部１３０に記憶させておく。補間関数を実現するプログラムも予め準備し、補間関数記憶部１３４に記憶させておく。さらに、ユーザ発話のための書起しテキストも予め何種類か準備し、テキスト記憶部１１０に記憶させておく。 The face image of the animation character is prepared in advance for the above six types of phonemes and stored in the face data file storage unit 132. A mapping table for mapping the face image to each phoneme is also prepared in advance and stored in the mapping table storage unit 130. A program for realizing the interpolation function is also prepared in advance and stored in the interpolation function storage unit 134. Further, several types of transcription texts for user utterances are prepared in advance and stored in the text storage unit 110.

テキスト選択部１１２は、テキスト記憶部１１０に記憶されている書起しテキストを全て読出し、テキスト選択インターフェイス９０に表示して、いずれかを選択するように促すメッセージを表示する。 The text selection unit 112 reads all the transcription text stored in the text storage unit 110, displays the text on the text selection interface 90, and displays a message prompting to select one of them.

ユーザがいずれかのテキストを選択すると、テキスト選択部１１２はそのテキストをキーフレームデータ作成部１３６に与えるとともに、テキスト選択インターフェイス９０上に、そのテキストを発話することを促すメッセージを表示する。同時にテキスト選択部１１２は、音声収録部１１４を起動し、マイクロフォン９２からの音声信号の収録を開始する。 When the user selects any text, the text selection unit 112 provides the text to the key frame data creation unit 136 and displays a message on the text selection interface 90 prompting the user to speak the text. At the same time, the text selection unit 112 activates the voice recording unit 114 and starts recording a voice signal from the microphone 92.

音声収録部１１４は、入力される音声を所定フレーム長、所定シフト長でフレーム化した音声データを作成し、ハードディスク内に音声データファイルとして記憶する。 The audio recording unit 114 creates audio data obtained by framing the input audio with a predetermined frame length and a predetermined shift length, and stores the audio data as an audio data file in the hard disk.

音声信号の収録が終了すると、テキスト選択部１１２及び音声収録部１１４は、それぞれ、書起しテキストと音声データファイルとを、キーフレームデータ作成部１３６及び音声認識装置１２０の各々に与える。 When the recording of the audio signal is completed, the text selection unit 112 and the audio recording unit 114 give the written text and the audio data file to the key frame data generation unit 136 and the audio recognition device 120, respectively.

キーフレームデータ作成部１３６及び音声認識装置１２０は、このデータに対し以下のように動作する。まず音声認識装置１２０が、音声収録部１１４から与えられた音声データファイルに対し、書起しデータを参照して音素セグメンテーションを行ない、図５に示すような音素列ファイル１６０（継続時間長を特定できる時間情報を含む）を作成する。音声認識装置１２０は、この音素列ファイル１６０のデータをキーフレームデータ作成部１３６のマッピング処理部１８０に与える。 The key frame data creation unit 136 and the speech recognition device 120 operate on this data as follows. First, the speech recognition device 120 performs phoneme segmentation on the speech data file given from the speech recording unit 114 with reference to the transcription data, and specifies the phoneme sequence file 160 (the duration length is specified) as shown in FIG. (Including possible time information). The speech recognition apparatus 120 gives the data of the phoneme string file 160 to the mapping processing unit 180 of the key frame data creation unit 136.

図６を参照して、キーフレームデータ作成部１３６のマッピング処理部１８０は、音素列ファイル１６０内の音素の各々に対し、マッピングテーブル記憶部１３０を参照してそれぞれ顔画像の識別子を付与し、継続時間長によるブレンド率調整部１８２に与える。 Referring to FIG. 6, mapping processing unit 180 of key frame data creation unit 136 assigns a face image identifier to each phoneme in phoneme sequence file 160 with reference to mapping table storage unit 130, This is given to the blend rate adjustment unit 182 according to the duration time.

継続時間長によるブレンド率調整部１８２は、各音素の継続時間長としきい値とを比較し、しきい値未満の継続時間長しか持たない音素を削除し、その継続期間を直前の音素の継続期間に統合する。継続時間長によるブレンド率調整部１８２はさらに、各音素のブレンド率を、音素継続時間長の最大値と、その音素の継続時間長とに基づき、式（１）に従って調整する。継続時間長によるブレンド率調整部１８２は、このようにして作成された、継続時間長、顔画像の識別子、及びブレンド率の付された音素列をパワーによるブレンド率調整部１８４に与える。 The blend rate adjustment unit 182 based on duration length compares the duration length of each phoneme with a threshold value, deletes a phoneme having a duration length less than the threshold value, and continues the duration of the previous phoneme. Integrate into the period. The blend rate adjustment unit 182 according to the duration time further adjusts the blend rate of each phoneme based on the maximum value of the phoneme duration and the duration of the phoneme according to the equation (1). The blend rate adjustment unit 182 based on the duration length gives the phoneme string with the duration time, the face image identifier, and the blend rate created in this way to the blend rate adjustment unit 184 based on power.

パワーによるブレンド率調整部１８４は、継続時間長によるブレンド率調整部１８２から与えられた音素列の各音素のうち、その期間中のパワーの値が所定のしきい値未満のものがあれば、その音素を削除する。そしてその音素の継続期間を直前の音素の継続期間と統合する。 The power blend ratio adjusting unit 184 has a power value during a period of less than a predetermined threshold among the phonemes of the phoneme string given from the blend ratio adjusting unit 182 by the duration time, Delete the phoneme. Then, the duration of the phoneme is integrated with the duration of the previous phoneme.

パワーによるブレンド率調整部１８４はさらに、各音素のブレンド率を、パワーの最大値と、各音素のパワーとに基づき、式（２）に従って調整する。パワーによるブレンド率調整部１８４は、このようにしてブレンド率が調整された音素列からなるキーフレームデータを図７に示すアニメーション生成制御部２００に与える。 The power blend ratio adjusting unit 184 further adjusts the blend ratio of each phoneme according to the formula (2) based on the maximum power value and the power of each phoneme. The power-based blend rate adjustment unit 184 supplies key frame data composed of phoneme strings with the blend rate adjusted in this way to the animation generation control unit 200 shown in FIG.

図７を参照して、アニメーション生成制御部２００は、まず、与えられたキーフレームデータのうちの最初のキーフレーム（先頭の音素の開始時刻）の時刻を生成時刻ｔとし、Ｔ_ｋ−１≦ｔ＜Ｔ_ｋとなる整数ｋを探す。この場合ｔ＝Ｔ_０＝Ｔ_ｋ−１なので、ｋ＝１となる。アニメーション生成制御部２００は、キーフレームＴ_０及びＴ_１に対応する顔画像のデータを顔データファイル記憶部１３２から読出し、時刻Ｔ_０、Ｔ_１、生成時刻ｔ、及びキーフレームＴ_０及びＴ_１に対応する顔画像のデータを補間処理部２０４に与える。 Referring to FIG. 7, animation generation control unit 200 first sets generation time t to the time of the first key frame (start time of the first phoneme) in the given key frame data, and T _k−1 ≦ Search for an integer k such that t <T _k . In this case, since t = T ₀ = T _k−1 , k = 1. The animation generation control unit 200 reads the face image data corresponding to the key frames T ₀ and T ₁ from the face data file storage unit 132, and generates the times T ₀ and T ₁ , the generation time t, and the key frames T ₀ and T _1. Is provided to the interpolation processing unit 204.

補間処理部２０４は、与えられた時刻Ｔ_０、Ｔ_１と、それらキーフレームＴ_０、Ｔ_１のブレンド率と、生成時刻ｔの値とに基づき、キーフレームＴ_０及びＴ_１のブレンド率に対する、生成時刻ｔにおけるブレンド率α、βを、補間関数記憶部１３４に記憶された補間関数（式（３））を用いてそれぞれ算出する。補間処理部２０４はさらに、算出されたブレンド率α、βと、キーフレームＴ_０及びＴ_１における顔画像データの各特徴点ベクトルＸ（Ｔ_０）、Ｘ（Ｔ_１）とを用い、前述した式（４）を用いて生成時刻ｔにおける顔画像の各特徴点ベクトルｘ（ｔ）を算出し、出力部１４２に与える。出力部１４２はこの画像をディスプレイ１０２上に表示する。出力部１４２は、この顔画像の表示と同期して音声ファイル記憶部１４０に記憶された音声ファイルの再生を開始する。 Based on the given times T ₀ , T ₁ , the blend rate of the key frames T ₀ , T ₁ , and the value of the generation time t, the interpolation processing unit 204 determines the blend rate of the key frames T ₀ and T ₁ . The blending ratios α and β at the generation time t are calculated using the interpolation function (formula (3)) stored in the interpolation function storage unit 134, respectively. The interpolation processing unit 204 further uses the calculated blend rates α and β and the feature point vectors X (T ₀ ) and X (T ₁ ) of the face image data in the key frames T ₀ and T ₁ as described above. Using the equation (4), each feature point vector x (t) of the face image at the generation time t is calculated and given to the output unit 142. The output unit 142 displays this image on the display 102. The output unit 142 starts reproduction of the audio file stored in the audio file storage unit 140 in synchronization with the display of the face image.

アニメーション生成部１３８は、生成時刻ｔにおける顔画像の算出が終了すると、処理の終了を示す信号をアニメーション生成制御部２００に与える。アニメーション生成制御部２００は、この信号を受信すると、そのときの時刻をタイマ２０２を参照して定める。アニメーション生成制御部２００は、この時刻を新たな生成時刻ｔに設定し、Ｔ_ｋ−１≦ｔ＜Ｔ_ｋとなる整数ｋを定める。そして、時刻Ｔ_ｋ−１、Ｔ_ｋ、生成時刻ｔ、キーフレームＴ_ｋ−１及びＴ_ｋのデータを補間処理部２０４に与え、時刻ｔにおける顔画像の生成を実行させる。 When the calculation of the face image at the generation time t is completed, the animation generation unit 138 gives a signal indicating the end of the process to the animation generation control unit 200. When the animation generation control unit 200 receives this signal, it determines the time at that time with reference to the timer 202. The animation generation control unit 200 sets this time as a new generation time t, and determines an integer k that satisfies T _k ₋₁ ≦ t <T _k . Then, the data of the times T _k−1 , T _k , the generation time t, and the key frames T _k−1 and T _k are given to the interpolation processing unit 204, and the generation of the face image at the time t is executed.

補間処理部２０４は、前のサイクルと同様にして、与えられた時刻Ｔ_ｋ−１、Ｔ_ｋと、それらのキーフレームＴ_ｋ−１、Ｔ_ｋのブレンド率と、生成時刻ｔの値とに基づき、キーフレームＴ_ｋ−１及びＴ_ｋのブレンド率に対する、生成時刻ｔにおけるブレンド率α、βを補間関数記憶部１３４に記憶された補間関数（式（３））を用いてそれぞれ算出する。補間処理部２０４はさらに、算出されたブレンド率α、βと、キーフレームＴ_ｋ−１及びＴ_ｋにおける顔画像データの各特徴点ベクトルＸ（Ｔ_ｋ−１）、Ｘ（Ｔ_ｋ）とを用い、式（４）を用いて生成時刻ｔにおける顔画像の各特徴点ベクトルｘ（ｔ）を算出し、出力部１４２に与える。出力部１４２はこの画像をディスプレイ１０２上に表示する。音声ファイル記憶部１４０の再生は、画像の出力と同期して継続される。 In the same manner as in the previous cycle, the interpolation processing unit 204 converts the given times T _k−1 , T _k , the blend rate of those key frames T _k−1 , T _k , and the value of the generation time t. Based on the blending ratios α and β at the generation time t with respect to the blending ratios of the key frames T _k−1 and T _k , based on the interpolation function (Equation (3)) stored in the interpolation function storage unit 134, respectively. The interpolation processing unit 204 further calculates the calculated blend ratios α and β and the feature point vectors X (T _k−1 ) and X (T _k ) of the face image data in the key frames T _k−1 and T _k . Then, each feature point vector x (t) of the face image at the generation time t is calculated using Expression (4), and is given to the output unit 142. The output unit 142 displays this image on the display 102. The reproduction of the audio file storage unit 140 is continued in synchronization with the image output.

補間処理部２０４は、生成時刻ｔにおける顔画像の算出が終了すると、それを示す信号をアニメーション生成制御部２００に与える。アニメーション生成制御部２００は、この信号を受信すると、そのときの時刻をタイマ２０２を参照して求める。そしてその時刻を新たな生成時刻ｔに定める。 When the calculation of the face image at the generation time t is completed, the interpolation processing unit 204 gives a signal indicating the calculation to the animation generation control unit 200. When receiving this signal, the animation generation control unit 200 obtains the time at that time with reference to the timer 202. Then, the time is set as a new generation time t.

以下、同様の処理を繰返す。そして、生成時刻ｔが音声の収録時間を上回ると、アニメーション生成システム８０はアニメーション生成の処理を終了し、その後の状態は最初の書起しテキスト選択時の表示時の状態に戻る。 Thereafter, the same processing is repeated. When the generation time t exceeds the audio recording time, the animation generation system 80 ends the animation generation process, and the subsequent state returns to the state at the time of the first writing and display at the time of text selection.

このようにして、ある発話テキストをユーザが選択して読み上げると、その音声データに基づき、顔データファイル記憶部１３２に記憶された顔画像データを用い、口の形状が音声データに同期して変形する顔画像のアニメーションが得られる。最初に収録された音声も顔画像に同期して再生されるため、アニメーションのキャラクタが発話しているように見える。その結果、ユーザの声でキャラクタが発話するアニメーションを得ることができる。 When the user selects and reads out a certain utterance text in this way, the shape of the mouth is deformed in synchronization with the voice data using the face image data stored in the face data file storage unit 132 based on the voice data. The animation of the facial image to be obtained is obtained. The first recorded sound is also played back in synchronization with the face image, so the animated character appears to be speaking. As a result, it is possible to obtain an animation in which the character speaks with the voice of the user.

本実施の形態では、顔画像は、限定された音素に対応するものしか準備されていないが、マッピングテーブルを用いて各音素に対し、適切な顔画像をマッピングすることにより、十分に自然なアニメーションを得ることができる。音素の継続時間長が極端に短かったり、パワーが極端に小さかったりした場合、その音素については、画像の生成を省略している。通常は、このような音素を発音する際の実際の顔の動きも非常に小さい。そのため、この省略により、得られる顔画像のアニメーションは自然な動きに近く感じられる効果がある。さらに、ブレンド率という概念を用いて、各音素の発話の強さに応じて顔の変形量（各特徴点の、基準画像（無表情という画像）の各特徴点位置からの３次元的な移動量）を調整している。そのため、音素の発話の強さに応じて自然な動きの顔画像のアニメーションを得ることができる。また、隣り合うキーフレームの間の顔画像は、隣り合うキーフレームの顔画像を、キーフレームにおけるブレンド率と、キーフレームの時間と、画像の生成時間とに応じた加重和により内挿して得ている。従って、音素から音素への変化の際の口の形状変化が滑らかなものとなり、得られた顔画像のアニメーションも自然なものに感じられる。 In the present embodiment, only face images corresponding to limited phonemes are prepared, but a sufficiently natural animation can be obtained by mapping an appropriate face image to each phoneme using a mapping table. Can be obtained. When the phoneme duration is extremely short or the power is extremely small, image generation is omitted for the phoneme. Usually, the actual movement of the face when generating such phonemes is very small. Therefore, this omission has an effect that the animation of the obtained face image can be felt close to natural movement. Furthermore, using the concept of blend ratio, the amount of facial deformation (each feature point is moved from each feature point position of the reference image (image without expression)) according to the strength of each phoneme utterance Amount). Therefore, it is possible to obtain an animation of a face image with natural movement according to the strength of phoneme utterance. A face image between adjacent key frames is obtained by interpolating the face images of adjacent key frames with a weighted sum corresponding to the blend rate in the key frames, the time of the key frames, and the generation time of the images. ing. Accordingly, the mouth shape changes smoothly from the phoneme to the phoneme, and the animation of the obtained face image feels natural.

［第２の実施の形態］
上記した第１の実施の形態に係るアニメーション生成システム８０は、十分な性能のコンピュータがあれば、そのコンピュータ一台でも実現可能である。しかし、ある程度短い時間で作業を完了させるためには、複数のコンピュータを用いることが実際的である。 [Second Embodiment]
If there is a computer with sufficient performance, the animation generation system 80 according to the first embodiment described above can be realized with only one computer. However, in order to complete the work in a relatively short time, it is practical to use a plurality of computers.

図１５に、本発明の第２の実施の形態に係るアニメーション生成システム２８０の概略構成を示す。図１５を参照して、このアニメーション生成システム２８０は、不特定のユーザによる音声入力を受け、アニメーション生成システム２８０でのアニメーションの生成を開始させる処理を行なう音声入力用のコンピュータ２９２と、音声入力用のコンピュータ２９２によって入力された音声に対する音素セグメンテーションを行なってキーフレームデータを作成するための音声認識サーバ２９４と、音声入力用のコンピュータ２９２による音声入力を受け、音声認識サーバ２９４が出力するキーフレームデータを利用して、入力された音声と同期して口の形状が変化する、所定のキャラクタの顔画像のアニメーションを作成し表示するためのアニメーション表示用コンピュータ２９６とを含む。音声入力用のコンピュータ２９２、音声認識サーバ２９４、及びアニメーション表示用コンピュータ２９６はいずれもネットワーク２９０を介して互いに所定のプロトコルで通信可能となっている。 FIG. 15 shows a schematic configuration of an animation generation system 280 according to the second embodiment of the present invention. Referring to FIG. 15, this animation generation system 280 receives a voice input from an unspecified user, and performs a voice input computer 292 that performs a process of starting generation of an animation in animation generation system 280, and a voice input A voice recognition server 294 for generating key frame data by performing phoneme segmentation on the voice input by the computer 292, and key frame data output by the voice recognition server 294 in response to voice input by the voice input computer 292 And an animation display computer 296 for creating and displaying an animation of a face image of a predetermined character whose mouth shape changes in synchronization with the input voice. The voice input computer 292, the voice recognition server 294, and the animation display computer 296 are all communicable with each other via a network 290 using a predetermined protocol.

図２と比較すると、音声入力用のコンピュータ２９２が図２の入力指示ユニット９４に、音声認識サーバ２９４が図２の音声認識装置１２０に、アニメーション表示用コンピュータ２９６が図２のアニメーション再生ユニット９８に、それぞれ相当する。音声入力用のコンピュータ２９２、音声認識サーバ２９４、及びアニメーション表示用コンピュータ２９６の機能構成は、それぞれ図２の入力指示ユニット９４、音声認識装置１２０、及びアニメーション再生ユニット９８の構成と同様であるので、ここではその詳細は繰返さない。 Compared with FIG. 2, the computer 292 for voice input is the input instruction unit 94 in FIG. 2, the voice recognition server 294 is in the voice recognition device 120 in FIG. 2, and the animation display computer 296 is in the animation playback unit 98 in FIG. , Respectively. The functional configurations of the voice input computer 292, the voice recognition server 294, and the animation display computer 296 are the same as the configurations of the input instruction unit 94, the voice recognition device 120, and the animation playback unit 98 in FIG. The details are not repeated here.

音声入力用のコンピュータ２９２は、タッチパネル３００と、マイク３０２とを有する。アニメーション表示用コンピュータ２９６は、スピーカ３１２を有するモニタ３１０と、モニタ３１０の下に配置されたコンピュータ筐体３１４とを含む。 The computer 292 for voice input has a touch panel 300 and a microphone 302. The animation display computer 296 includes a monitor 310 having a speaker 312 and a computer housing 314 disposed under the monitor 310.

図１６に、アニメーション表示用コンピュータ２９６のハードウェア構成を示す。図１６を参照して、アニメーション表示用コンピュータ２９６は、図１５に示すモニタ３１０及びスピーカ３１２に加え、いずれもコンピュータ本体３１４内に配置された、ＣＰＵ（中央演算処理装置）３５０と、読出専用メモリ（ＲＯＭ）３５２と、随時読出書込可能メモリ（ＲＡＭ）３５４と、ハードディスクドライブ３５６と、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）３３０を装着可能なＤＶＤドライブ３５８と、顔画像の演算処理を実行するためのＧＰＵ３６０とを含む。これらはいずれもバス３６２によりＣＰＵ３５０に接続されている。 FIG. 16 shows a hardware configuration of the animation display computer 296. Referring to FIG. 16, in addition to the monitor 310 and speaker 312 shown in FIG. 15, an animation display computer 296 includes a CPU (Central Processing Unit) 350 and a read-only memory, both of which are arranged in the computer main body 314. (ROM) 352, read / writeable memory (RAM) 354, hard disk drive 356, DVD drive 358 in which a DVD (Digital Versatile Disk) 330 can be mounted, and GPU 360 for executing facial image calculation processing Including. These are all connected to the CPU 350 by a bus 362.

アニメーション表示用コンピュータ２９６はさらに、いずれもコンピュータ本体３１４内に配置され、バス３６２に接続された、ネットワークインターフェイス（Ｉ／Ｆ）３６８、フラッシュメモリからなる持運び可能なメモリ３３２を装着可能なメモリポート３６６、及びスピーカ３１２が接続されるサウンドボード３６４を含む。 Each of the animation display computers 296 is further arranged in the computer main body 314 and connected to the bus 362. A memory port that can be equipped with a portable memory 332 composed of a network interface (I / F) 368 and a flash memory. 366 and a sound board 364 to which a speaker 312 is connected.

なお、アニメーション表示用コンピュータ２９６においては、アニメーション作成処理においてキーボードを使用する必要がないため、キーボードを備えていない。もちろん、アニメーション表示用コンピュータ２９６を通常のコンピュータとして使用する際には、コンピュータ本体３１４にキーボード及びマウス等の入力装置を接続することが可能である。 Note that the animation display computer 296 does not include a keyboard because it is not necessary to use a keyboard in the animation creation process. Of course, when the animation display computer 296 is used as a normal computer, an input device such as a keyboard and a mouse can be connected to the computer main body 314.

図には示していないが、音声入力用のコンピュータ２９２及び音声認識サーバ２９４のハードウェア構成もアニメーション表示用コンピュータ２９６とほぼ同様である。相違点といえば、音声入力用のコンピュータ２９２において、モニタ３１０と入力装置とが一体となってタッチパネル３００を構成していること、音声入力用のコンピュータ２９２がさらにマイクロフォン３０２を備えていること、音声入力用のコンピュータ２９２及び音声認識サーバ２９４ではＧＰＵ３６０が不要であること等である。 Although not shown in the drawing, the hardware configuration of the voice input computer 292 and the voice recognition server 294 is substantially the same as that of the animation display computer 296. Speaking of differences, in the computer 292 for voice input, the monitor 310 and the input device are integrated to form the touch panel 300, the computer 292 for voice input further includes the microphone 302, and the voice. For example, the GPU 360 is not required in the input computer 292 and the speech recognition server 294.

図１７に、音声入力用のコンピュータ２９２で実行されることにより、音声入力用のコンピュータ２９２を図２に示す入力指示ユニット９４として動作させるためのコンピュータプログラムの制御構造をフローチャート形式で示す。 FIG. 17 is a flowchart showing a control structure of a computer program for causing the voice input computer 292 to operate as the input instruction unit 94 shown in FIG. 2 by being executed by the voice input computer 292.

図１７を参照して、音声入力用のコンピュータ２９２の電源が投入され、このプログラムが起動されると、ステップ４００で初期化処理が実行される。この処理では、音声入力用のコンピュータ２９２内で処理に必要な資源の確保及び初期化、通信機能の確認、発話テキストファイルからの発話テキストの読込み等が行なわれる。 Referring to FIG. 17, when the computer 292 for voice input is turned on and this program is started, an initialization process is executed in step 400. In this processing, resources necessary for the processing are secured and initialized in the computer 292 for voice input, the communication function is confirmed, and the utterance text is read from the utterance text file.

初期化処理が終了すると、ステップ４０２において、音声入力用のコンピュータ２９２の準備が完了したことをアニメーション表示用コンピュータ２９６に通知する。続いてステップ４０４において、アニメーション表示用コンピュータ２９６より、音声入力用のコンピュータ２９２、音声認識サーバ２９４及びアニメーション表示用コンピュータ２９６がともに準備完了状態となったことを示す準備完了通知を受信したか否かを判定する。準備完了通知を受けたらステップ４０６に進む。準備完了通知を受取るまで、ステップ４０４の判定処理を繰返す。 When the initialization process is finished, in step 402, the animation display computer 296 is notified that the preparation of the computer 292 for voice input is completed. Subsequently, at step 404, whether or not a preparation completion notification indicating that the voice input computer 292, the voice recognition server 294, and the animation display computer 296 are all ready is received from the animation display computer 296. Determine. When the preparation completion notification is received, the process proceeds to step 406. The determination process in step 404 is repeated until a preparation completion notification is received.

このようにアニメーション表示用コンピュータ２９６からの準備完了通知を待つのは、同時期に音声入力用のコンピュータ２９２、音声認識サーバ２９４及びアニメーション表示用コンピュータ２９６が起動されたとして、全てにおいて準備が完了しないと、アニメーション生成システム２８０全体として機能することができないためである。 In this way, waiting for the preparation completion notification from the animation display computer 296 is not completed in all cases, assuming that the voice input computer 292, the voice recognition server 294, and the animation display computer 296 are activated at the same time. This is because the animation generation system 280 cannot function as a whole.

続いてステップ４０６において、いくつかの発話テキストをタッチパネル３００に表示し、「テキストを一つ選択してください」という、入力待ちメッセージを表示する。そしてステップ４０８で入力待ちの状態となる。入力があると、すなわちテキストがユーザにより選択されるとステップ４１０に進む。 Subsequently, in step 406, several utterance texts are displayed on the touch panel 300, and an input waiting message “Please select one text” is displayed. Then, in step 408, the state waits for input. If there is an input, i.e. the text is selected by the user, the process proceeds to step 410.

ステップ４１０では、選択されたテキストを発話するようにユーザに促すメッセージを表示し、録音を開始する。録音が終了するとステップ４１２に進む。 In step 410, a message prompting the user to speak the selected text is displayed and recording begins. When recording ends, the process proceeds to step 412.

ステップ４１２では、アニメーション表示用コンピュータ２９６に対して録音完了を通知する。続くステップ４１４において、アニメーション表示用コンピュータ２９６から処理開始通知を受信したか否かを判定する。処理開始通知とは、音声認識サーバ２９４における音素セグメンテーション処理と、アニメーション表示用コンピュータ２９６におけるアニメーション生成処理との実行を開始したことを示すメッセージである。 In step 412, the recording completion is notified to the animation display computer 296. In the following step 414, it is determined whether or not a processing start notification has been received from the animation display computer 296. The process start notification is a message indicating that execution of the phoneme segmentation process in the speech recognition server 294 and the animation generation process in the animation display computer 296 is started.

ステップ４１６では、処理中を示す表示をタッチパネル３００上に表示する。ステップ４１８で、ステップ４１０において録音した音声データと、対応するテキストデータ（書起しデータ）とをアニメーション表示用コンピュータ２９６に送信する。そして、ステップ４２０で、アニメーション表示用コンピュータ２９６から処理完了通知を受信するまで待機する。処理完了通知とは、ステップ４１８でアニメーション表示用コンピュータ２９６に対し送信した音声データに対して、音素セグメンテーション処理とその後のキーフレームデータ作成処理までが完了したことを示す通知である。 In step 416, a display indicating that processing is in progress is displayed on touch panel 300. In step 418, the voice data recorded in step 410 and the corresponding text data (transcription data) are transmitted to the animation display computer 296. In step 420, the process waits until a processing completion notification is received from the animation display computer 296. The process completion notification is a notification indicating that the phoneme segmentation process and the subsequent key frame data creation process have been completed for the audio data transmitted to the animation display computer 296 in step 418.

処理完了通知を受信すると、ステップ４２２において、アニメーション表示用コンピュータ２９６に対し、アニメーションの出力命令を送信する。後述するように、アニメーション表示用コンピュータ２９６は、この出力命令に対してアニメーションの生成処理及び出力処理を開始する。ステップ４２４ではアニメーション表示用コンピュータ２９６から出力中通知を受信するまで待機し、出力中表示を受けるとステップ４２６に進む。ステップ４２６では、タッチパネル３００上に、アニメーションをアニメーション表示用コンピュータ２９６のモニタ３１０上に出力中であることを示すメッセージを表示する。そしてステップ４２８で、アニメーション表示用コンピュータ２９６からアニメーションの出力処理が完了したことを示す出力完了通知を待つ。出力完了通知を受信すると、ステップ４１０で録音した音声に対するアニメーションの生成及び表示が全て完了したということである。従って制御はステップ４０６に戻り、次のユーザ入力を待つ。 When the processing completion notification is received, an animation output command is transmitted to the animation display computer 296 in step 422. As will be described later, the animation display computer 296 starts animation generation processing and output processing in response to this output command. In step 424, the process waits until an output notification is received from the animation display computer 296. If an output display is received, the process proceeds to step 426. In step 426, a message indicating that the animation is being output on the monitor 310 of the animation display computer 296 is displayed on the touch panel 300. In step 428, the output computer 296 waits for an output completion notification indicating that the animation output processing has been completed. When the output completion notification is received, the generation and display of the animation for the sound recorded in step 410 are all completed. Accordingly, control returns to step 406 to wait for the next user input.

音声入力用のコンピュータ２９２は、上記した処理を繰返す。 The computer 292 for voice input repeats the above processing.

図１８は、音声認識サーバ２９４が実行する処理のフローチャートである。このプログラムが起動されると、ステップ４４０において初期化処理が実行される。初期化処理が完了すると、ステップ４４２においてアニメーション表示用コンピュータ２９６に対し音声認識サーバ２９４の準備が完了したことを示す通知を送信する。 FIG. 18 is a flowchart of processing executed by the voice recognition server 294. When this program is started, initialization processing is executed in step 440. When the initialization process is completed, in step 442, a notification indicating that the preparation of the voice recognition server 294 is completed is transmitted to the animation display computer 296.

ステップ４４４では、アニメーション表示用コンピュータ２９６から音素セグメンテーションの依頼を受信したか否かを判定する。音素セグメンテーションとは音声認識処理と同様の処理であって、入力された音声を、音響モデルを用いて音素に分割する処理のことをいう。依頼を受信すると、ステップ４４８に進む。 In step 444, it is determined whether or not a phoneme segmentation request has been received from the animation display computer 296. Phoneme segmentation is processing similar to speech recognition processing, and refers to processing that divides input speech into phonemes using an acoustic model. When the request is received, the process proceeds to step 448.

ステップ４４８で、アニメーション表示用コンピュータ２９６に対し、音声認識サーバ２９４が音素セグメンテーション処理を開始したことを通知する。 In step 448, the voice display server 294 is notified to the animation display computer 296 that the phoneme segmentation process has started.

続くステップ４５０において、依頼に従い、音素セグメンテーションを行なうべき音声データと書起しデータとをアニメーション表示用コンピュータ２９６より取得する。この取得が完了したら、ステップ４５２において対象データの受信が完了したことを示す通知をアニメーション表示用コンピュータ２９６に送信する。ステップ４５４では、受信した音声データに対し、図示しない音響モデルと、受信した書起しデータとを用いた音素セグメンテーション処理を実行する。この処理では、書起しデータが存在するので、正確な音素セグメンテーションをすることが可能である。 In the subsequent step 450, in accordance with the request, voice data and transcription data to be subjected to phoneme segmentation are acquired from the animation display computer 296. When the acquisition is completed, a notification indicating that the reception of the target data is completed is transmitted to the animation display computer 296 in step 452. In step 454, phoneme segmentation processing using an acoustic model (not shown) and the received transcription data is executed on the received voice data. In this process, since transcription data exists, it is possible to perform accurate phoneme segmentation.

音素セグメンテーションが終了し、音素列ファイルの生成が完了したら、ステップ４５６において、音素列ファイルの生成が完了したことをアニメーション表示用コンピュータ２９６に通知する。 When the phoneme segmentation is completed and the generation of the phoneme string file is completed, in step 456, the generation of the phoneme string file is notified to the animation display computer 296.

さらに、この音素列ファイルに基づき、ステップ４５８において、キーフレームデータの生成処理を実行する。キーフレームデータの生成処理の詳細については図１９を参照して後述する。キーフレームデータの生成処理が完了すると、ステップ４６０においてキーフレームデータ生成が完了したことをアニメーション表示用コンピュータ２９６に対して通知する。さらに、ステップ４６２において、音素列ファイルと、キーフレームデータとをアニメーション表示用コンピュータ２９６に対して送信する。ステップ４６４では、アニメーション表示用コンピュータ２９６に対して音声認識サーバ２９４における処理が全て完了したことを通知し、ステップ４４４に戻る。 Furthermore, based on this phoneme string file, in step 458, key frame data generation processing is executed. Details of the key frame data generation process will be described later with reference to FIG. When the key frame data generation process is completed, the animation display computer 296 is notified in step 460 that the key frame data generation has been completed. In step 462, the phoneme string file and the key frame data are transmitted to the animation display computer 296. In step 464, the animation display computer 296 is notified that all processing in the voice recognition server 294 has been completed, and the process returns to step 444.

図１９は、図１８のステップ４５８で実行されるキーフレームデータの作成処理のフローチャートである。図１９を参照して、ステップ４８０において、与えられた音素列の中で、継続時間長が所定のしきい値より小さい音素、又はパワーが所定のしきい値より小さい音素があるか否かを判定する。もしあれば、ステップ４８２において、その音素を削除し、その音素の継続時間長を直前の音素の継続時間長に統合する処理を行ない、ステップ４８０に戻る。上記したような音素が存在しなくなると、ステップ４８４に進む。 FIG. 19 is a flowchart of the key frame data creation process executed in step 458 of FIG. Referring to FIG. 19, in step 480, it is determined whether or not there is a phoneme having a duration less than a predetermined threshold or a phoneme having a power less than a predetermined threshold in a given phoneme string. judge. If there is, in step 482, the phoneme is deleted, the duration of the phoneme is integrated with the duration of the previous phoneme, and the process returns to step 480. When there is no phoneme as described above, the process proceeds to step 484.

ステップ４８４では、与えられた音素列を構成する全ての音素に対して、ブレンド率の初期値として１００％を設定する。続くステップ４８６では、図２に示すマッピングテーブル記憶部１３０に記憶されたマッピングテーブルを用い、音素列中の各音素に対し、図３に示す顔画像／Ａ／〜／φ／の中のいずれかを割当て、その顔画像の識別子を音素に付す。こうして割当てられた顔画像が、その音素の開始時点をフレーム時刻とするキーフレームとなる。 In step 484, 100% is set as the initial value of the blend ratio for all phonemes constituting the given phoneme string. In the following step 486, any one of the face images / A / ˜ / φ / shown in FIG. 3 is used for each phoneme in the phoneme string using the mapping table stored in the mapping table storage unit 130 shown in FIG. And the identifier of the face image is attached to the phoneme. The face image assigned in this way becomes a key frame whose frame time is the start time of the phoneme.

続いてステップ４８８において、与えられた全ての音素列を調べ、音素の最大継続時間長と最大パワーとを探索する。探索された最大継続時間長をＬ_ＭＡＸ、最大パワーをＰ_ＭＡＸとする。 Subsequently, in step 488, all the given phoneme strings are examined, and the maximum duration time and the maximum power of the phonemes are searched. The searched maximum duration length is L _MAX , and the maximum power is P _MAX .

ステップ４９０では、各音素のブレンド率を、前述した式（１）により更新する。なお、式（１）でＢ（ｎ）はｎ番目の音素のブレンド率を表す。同様に、ステップ４９２では、各音素のブレンド率を、前述した式（２）により更新する。 In step 490, the blend ratio of each phoneme is updated by the above-described equation (1). In Equation (1), B (n) represents the blend rate of the nth phoneme. Similarly, in step 492, the blend ratio of each phoneme is updated by the above-described equation (2).

ステップ４９４では、上記したように算出されたブレンド率と、対応の顔画像の識別子と、時間情報とが付された音素列を、キーフレームデータとしてファイルに出力し、キーフレームデータの作成処理を終了する。 In step 494, the phoneme sequence with the blend ratio calculated as described above, the identifier of the corresponding face image, and the time information is output to a file as key frame data, and key frame data creation processing is performed. finish.

図２０は、アニメーション表示用コンピュータ２９６により実行されるアニメーション生成制御処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。図２０を参照して、アニメーション生成制御処理が起動されると、ステップ５００において初期化処理を行ない、ステップ５０２において音声入力用のコンピュータ２９２及び音声認識サーバ２９４からの準備完了通知を待つ。 FIG. 20 is a flowchart showing a control structure of a computer program that realizes animation generation control processing executed by the animation display computer 296. Referring to FIG. 20, when the animation generation control process is activated, an initialization process is performed in step 500, and a preparation completion notification from voice input computer 292 and voice recognition server 294 is waited in step 502.

音声入力用のコンピュータ２９２及び音声認識サーバ２９４から準備完了通知を受信すると、ステップ５０４において音声入力用のコンピュータ２９２に対しアニメーション生成システム２８０の全体が準備完了していることを示す準備完了通知を送信する。続いてステップ５０６で、音声入力用のコンピュータ２９２から録音完了通知を受信するまで待機する。 When the preparation completion notification is received from the voice input computer 292 and the voice recognition server 294, in step 504, a preparation completion notification indicating that the entire animation generation system 280 is ready is transmitted to the voice input computer 292. To do. In step 506, the process waits until a recording completion notification is received from the computer 292 for voice input.

録音完了通知を受信すると、ステップ５０８において、音声入力用のコンピュータ２９２に対し音声認識サーバ２９４及びアニメーション表示用コンピュータ２９６がアニメーションを作成するための一連の処理を実行開始することを示す処理開始通知を送信する。続いてステップ５１０で、音声入力用のコンピュータ２９２から書起しテキストデータ及び音声データを受信するまで待機し、これらデータを受信するとステップ５１２に進む。 When the recording completion notification is received, in step 508, a process start notification indicating that the voice recognition server 294 and the animation display computer 296 start executing a series of processes for creating an animation to the computer 292 for voice input. Send. Subsequently, in step 510, the process waits until the text data and voice data are received from the voice input computer 292. When these data are received, the process proceeds to step 512.

ステップ５１２では、音声認識サーバ２９４に対し、ステップ５１０で受信した書起しテキストデータ及び音声データを送信し、音素セグメンテーションを依頼する。そしてステップ５１４では、音素セグメンテーションの結果得られるキーフレームデータを音声認識サーバ２９４から受信するまで待機する。キーフレームデータを受信すると、ステップ５２０以下のアニメーション生成のための処理を実行する。 In step 512, the transcription text data and the speech data received in step 510 are transmitted to the speech recognition server 294, and a phoneme segmentation is requested. In step 514, the process waits until key frame data obtained as a result of phoneme segmentation is received from the speech recognition server 294. When the key frame data is received, the processing for animation generation in step 520 and subsequent steps is executed.

ステップ５２０において、本実施の形態では、顔画像のアニメーションの先頭フレームの時刻（生成時刻）ｔとして、音素列の最初の音素の時刻Ｔ_０を選択する。 In step 520, in this embodiment, as the time (generation time) t of the first frame of the animation of the facial image, selects the time T ₀ of the first phoneme of the phoneme sequence.

続いてステップ５２２において、直前のステップで決定されたフレームの生成時刻ｔに対し、Ｔ_ｋ−１≦ｔ＜Ｔ_ｋとなるｋを決定する。ただしＴ_ｋは音素列中のｋ番目の音素の期間の開始時刻を指す。例えばｔ＝Ｔ_０であればＴ_０≦ｔ＜Ｔ_１であるから、ｋ＝１となる。 Subsequently, in step 522, _k that satisfies T _k ₋₁ ≦ t <T _k is determined with respect to the frame generation time t determined in the immediately preceding step. However, T _k indicates the start time of the period of the k-th phoneme in the phoneme string. For example, if t = T ₀ , T ₀ ≦ t <T ₁ , so k = 1.

続いてステップ５２４において、時刻Ｔ_ｋ及びＴ_ｋ−１と、キーフレームＴ_ｋ−１、Ｔ_ｋのブレンド率と、生成時刻ｔと、時刻Ｔ_ｋ及びＴ_ｋ−１での音素に対応する顔画像データとをＧＰＵ３６０に渡し、生成時刻ｔにおける顔画像を補間により生成することを依頼する。 In Step 524 is followed, the time _{T k} and _{T k-1,} a key frame _{T k-1,} and the blend ratio of _{T k,} and generation time t, the face corresponding to the phoneme at the time _{T k} and _{T k-1} The image data is transferred to the GPU 360, and a request is made to generate a face image at the generation time t by interpolation.

これに応答し、ＧＰＵ３６０が実行するプログラムは、生成時刻ｔにおける、キーフレームＴ_ｋ−１のブレンド率から補間演算されるブレンド率α、及びキーフレームＴ_ｋのブレンド率から補間演算されるブレンド率βをそれぞれ前述した補間式（３）により算出し、さらに生成時刻ｔにおける顔画像の各特徴点ベクトルＸ（ｔ）を、キーフレームＴ_ｋ-1及びＴ_ｋにおける顔画像の各特徴点ベクトルＸ（Ｔ_ｋ−１）及びＸ（Ｔ_ｋ）と、α、βとを用い、前述の式（４）によるベクトル加重和によって算出する。ＧＰＵ３６０は、この計算が顔画像の全ての特徴点に対し終了すると、生成された時刻ｔにおける顔画像を出力し、さらに処理終了通知をＣＰＵ３５０に対して送信する。 In response, the program GPU360 is executed, the generation time t, the blend ratio is the interpolation operation from a blend ratio keyframe T _k-1 alpha, and a keyframe T blend ratio that is the interpolation operation from a blend ratio of _k β is calculated by the above-described interpolation formula (3), and each feature point vector X (t) of the face image at the generation time t is converted to each feature point vector X of the face image at the key frames T _k−1 and T _k . Using (T _k-1 ) and X (T _k ), and α and β, the calculation is performed by the vector weighted sum according to the above-described equation (4). When this calculation is completed for all feature points of the face image, the GPU 360 outputs the generated face image at time t, and further transmits a process end notification to the CPU 350.

図２０を参照して、アニメーション生成制御処理のプログラムは、ステップ５２６でＧＰＵ３６０からの終了通知を受信するまで待ち状態となる。終了通知を受信するとステップ５２８に進む。 Referring to FIG. 20, the animation generation control processing program is in a waiting state until an end notification is received from GPU 360 in step 526. When the end notification is received, the process proceeds to step 528.

ステップ５２８では、タイマ２０２の時刻を読む。この時刻を新たな生成時刻ｔとする。続いてステップ５３０では、生成時刻ｔが、録音の最終時刻よりも後か否かを判定する。生成時刻ｔが録音時刻より後であれば、処理を終了する。さもなければ、この新たな生成時刻ｔにおける顔画像データを生成すべく、ステップ５２２に戻る。 In step 528, the time of the timer 202 is read. This time is set as a new generation time t. Subsequently, in step 530, it is determined whether or not the generation time t is later than the final recording time. If the generation time t is after the recording time, the process is terminated. Otherwise, the process returns to step 522 to generate face image data at this new generation time t.

以下、ステップ５２２〜ステップ５３０の処理を、生成時刻ｔが録音時間より大きくなるまで繰返す。生成時刻ｔが録音時間より大きくなると、ステップ５３２に進む。 Thereafter, the processing from step 522 to step 530 is repeated until the generation time t becomes larger than the recording time. When the generation time t becomes larger than the recording time, the process proceeds to step 532.

ステップ５３２では、音声入力用のコンピュータ２９２に対し、音声認識サーバ２９４及びアニメーション表示用コンピュータ２９６における処理が完了したことを示す通知を送信する。音声入力用のコンピュータ２９２はこの通知を図１７のステップ４２８で受信し、これに応答してステップ４０６に戻り、上記した一連の処理が音声入力から繰返される。 In step 532, a notification indicating that the processing in the voice recognition server 294 and the animation display computer 296 is completed is transmitted to the voice input computer 292. The voice input computer 292 receives this notification at step 428 in FIG. 17, and in response to this, returns to step 406, and the above-described series of processing is repeated from the voice input.

図１７にフローチャートで示す制御構造を有するプログラムを音声入力用のコンピュータ２９２で、図１８及び図１９にフローチャートで示す制御構造を有するプログラムを音声認識サーバ２９４で、図２０にフローチャートで示す制御構造を有するプログラムをアニメーション表示用コンピュータ２９６で、それぞれ実行することにより、第１の実施の形態に係るアニメーション生成システム８０と同様の機能を持つアニメーション生成システム２８０を実現することができる。 The program having the control structure shown in the flowchart in FIG. 17 is the computer 292 for voice input, the program having the control structure shown in the flowchart in FIGS. 18 and 19 is in the voice recognition server 294, and the control structure shown in the flowchart in FIG. The animation generation system 280 having the same functions as those of the animation generation system 80 according to the first embodiment can be realized by executing each of the stored programs on the animation display computer 296.

なお、第１の実施の形態に係るアニメーション生成システム８０をコンピュータで実現する際にも、上記した図１７〜図２０に示す制御構造を有するコンピュータプログラムと同様のプログラムを利用することができる。 Note that, when the animation generation system 80 according to the first embodiment is realized by a computer, a program similar to the computer program having the control structure shown in FIGS. 17 to 20 can be used.

＜動作＞
第２の実施の形態に係るアニメーション生成システム２８０の動作は、第１の実施の形態に係るアニメーション生成システム８０と同様である。従って、ここではその詳細については繰返さない。 <Operation>
The operation of the animation generation system 280 according to the second embodiment is the same as that of the animation generation system 80 according to the first embodiment. Therefore, details thereof will not be repeated here.

本実施の形態では、各コンピュータに処理を分散させている。そのため、各コンピュータの性能はそれほど高くなくてもよい。また、音声認識サーバ２９４としては高性能なものを準備しておき、複数の音声入力用のコンピュータ２９２とアニメーション表示用コンピュータ２９６との組からのキーフレームデータ作成要求を単一の音声認識サーバ２９４で処理することも可能である。 In this embodiment, processing is distributed to each computer. Therefore, the performance of each computer may not be so high. A high-performance voice recognition server 294 is prepared, and a key frame data creation request from a plurality of voice input computers 292 and animation display computers 296 is sent to a single voice recognition server 294. Can also be processed.

さらに、本実施の形態では、音声入力用のコンピュータ２９２とアニメーション表示用コンピュータ２９６とは別のコンピュータとしたが、これらをまとめて一つのコンピュータによって実現するようにしてもよい。 Further, in this embodiment, the computer 292 for voice input and the computer 296 for animation display are separate computers, but they may be realized by a single computer.

どのような音素に対応する顔画像を準備するか、及びどれだけの数の顔画像を準備するかは、アニメーション製作時の設計事項である。また、どの音素に対しどの顔画像をマッピングするかもアニメーション製作時の設計事項である。また、音素の組は対象とする言語によっても異なり、従ってマッピングも異なってくることは当然である。 It is a design matter at the time of animation production which face image corresponding to what phoneme is prepared and how many face images are prepared. Also, what face image is mapped to which phoneme is a design item at the time of animation production. In addition, the phoneme set varies depending on the target language, and therefore the mapping is naturally different.

上記した実施の形態では、ある音素に対しては必ず一つの顔画像が対応するように音素と顔画像とのマッピングがされているが、そうでなくてもよい。すなわち、同一の音素でも、その前後の音素によって異なる顔画像を割当てるようにしてもよい。 In the above-described embodiment, the phoneme and the face image are mapped so that one face image always corresponds to a certain phoneme, but this need not be the case. That is, even with the same phoneme, different face images may be assigned depending on the phonemes before and after.

上記した実施の形態では、ブレンド率の算出に式（１）及び（２）を使用している。しかし本発明は、式（１）及び（２）を用いるものには限定されない。継続時間長又はパワーが短くなればブレンド率が低くなるようなものであれば、すなわち継続時間長及びパワーに対する単調関数であれば、どのような関数を用いてブレンド率を算出するようにしてもよい。また、継続時間長及びパワーに限らず、それ以外の音声特徴量を考慮してブレンド率を決定してもよい。 In the above-described embodiment, the equations (1) and (2) are used for calculating the blend ratio. However, the present invention is not limited to those using the formulas (1) and (2). As long as the duration or power becomes shorter, the blend rate becomes lower, that is, as long as it is a monotonic function with respect to the duration and power, any function can be used to calculate the blend rate. Good. In addition, the blend rate may be determined in consideration of not only the duration time and power but also other audio feature amounts.

上記した実施の形態では、補間関数として図１２に示されるような直線式に対応するものを用いた。しかし本発明はそのような実施の形態には限定されない。補間関数として、時間に対する２次以上の多項式を用いたり、非線形関数を用いたりしてもよい。本実施の形態では、キーフレームに相当する時刻においてブレンド率が最も高くなり、そこから遠ざかるにつれてブレンド率が低くなるような補間関数であれば、どのようなものを用いてもよい。補間関数として複数のものを用意しておき、ユーザが切替えて使用できるようにしておいてもよい。 In the above-described embodiment, an interpolation function corresponding to a linear equation as shown in FIG. 12 is used. However, the present invention is not limited to such an embodiment. As an interpolation function, a second-order or higher order polynomial with respect to time may be used, or a nonlinear function may be used. In the present embodiment, any interpolation function may be used as long as the blend rate is the highest at the time corresponding to the key frame, and the blend rate is lowered as the distance from the time is increased. A plurality of interpolation functions may be prepared so that the user can use them by switching.

また上記実施の形態では、キーフレームの位置は、各音素の継続期間の先頭位置としたが、本発明はそのような実施の形態には限定されない。キーフレームの位置を、各音素の継続期間の途中にしてもよい。キーフレームの位置についても、ユーザが任意に変更可能としてもよい。 In the above embodiment, the position of the key frame is the head position of the duration of each phoneme, but the present invention is not limited to such an embodiment. The position of the key frame may be in the middle of the duration of each phoneme. The position of the key frame may be arbitrarily changed by the user.

なお、上記した実施の形態では、音素列中のある音素の継続時間長又はパワーがしきい値より小さい場合には、その音素を削除し、その継続時間長を直前の音素の継続時間長に統合した。こうすることにより、口形状の変化が滑らかで自然なものとなる効果が得られる。 In the above-described embodiment, if the duration or power of a phoneme in the phoneme string is smaller than the threshold, the phoneme is deleted, and the duration is changed to the duration of the previous phoneme. Integrated. By doing so, it is possible to obtain an effect that the change in the mouth shape is smooth and natural.

しかし本発明はそのような実施の形態には限定されない。例えば、ある音素の継続時間長のみを考慮したり、パワーのみを考慮するようにしてもよい。又は、継続時間長及びパワーの双方がそれぞれしきい値より小さいときに、その音素を削除するようにしてもよい。これらの間で、切換を行なうようにしてもよい。 However, the present invention is not limited to such an embodiment. For example, only the duration of a phoneme may be considered, or only power may be considered. Alternatively, the phoneme may be deleted when both the duration time and power are smaller than the threshold values. Switching between them may be performed.

さらに、上記した実施の形態では、最終的にアニメーションとともに再生される音声は、最初に収録されたユーザの音声そのままである。しかし、本発明はそのような実施の形態には限定されない。口形状は主として音素との関係で決定されるので、音素の位置にさえ大きな変更を加えないのであれば、ユーザの音声に何らかの加工を加えるようにしてもよい。この場合でも、最終的に再生される音声にはユーザの発話の特徴が生かされることが多く、より多彩なアニメーションを作成できる。 Furthermore, in the above-described embodiment, the voice that is finally reproduced together with the animation is the voice of the user recorded first. However, the present invention is not limited to such an embodiment. Since the mouth shape is mainly determined in relation to the phoneme, some processing may be added to the user's voice as long as no significant change is made to the position of the phoneme. Even in this case, the characteristics of the user's utterance are often utilized for the finally reproduced audio, and more various animations can be created.

上記した実施の形態では、ユーザによる書起しテキストの発話の録音後、キーデータファイルを生成し、キーデータファイルを生成した後はＧＰＵ３６０による顔画像の作成処理の終了時に次の顔画像の生成を開始するようにしている。従って、顔画像の生成は一定のサイクルで行なわれているわけではない。こうすることにより、ＧＰＵ３６０はその性能をフルに発揮できる。しかし、本発明はこのようにして顔画像を作成するものには限定されない。 In the above-described embodiment, after recording the transcribed text utterance by the user, a key data file is generated. After the key data file is generated, the next face image is generated at the end of the face image creation process by the GPU 360. Like to start. Therefore, the generation of the face image is not performed in a constant cycle. By doing so, the GPU 360 can fully exhibit its performance. However, the present invention is not limited to creating a face image in this way.

例えば、図１２に示されるように、各キーフレームの補間によるブレンド率の分布を求めた後、一定のフレーム間隔で顔画像を生成すべき時刻の系列を求め、各時刻での顔画像を生成し、全ての顔画像が生成された後にそれらをアニメーションとして再生するようにしてもよい。この場合には、フレーム間隔が短くなると処理に長時間を要するようになり、逆にフレーム間隔が長くなるとアニメーションの動きがぎこちなくなる可能性がある。 For example, as shown in FIG. 12, after obtaining a blend rate distribution by interpolation of each key frame, a series of times at which a face image should be generated at a certain frame interval is obtained, and a face image at each time is generated. Then, after all the face images are generated, they may be reproduced as an animation. In this case, if the frame interval is shortened, the processing takes a long time. Conversely, if the frame interval is increased, the motion of the animation may become awkward.

なお、上記した実施の形態は、ユーザの音声にあわせて顔画像のアニメーションを作成し、再生する。音声の書起しテキストは決まっているため、ユーザが不特定であっても音素セグメンテーションを精度高く行なえ、滑らかなアニメーションを作成できる。 In the above-described embodiment, an animation of a face image is created and reproduced in accordance with the user's voice. Since the transcription text of the voice is determined, phoneme segmentation can be performed with high accuracy even if the user is unspecified, and a smooth animation can be created.

上記実施の形態では、音声が入力されると、それに基づいて作成したアニメーションを一回だけ再生し、次の音声の入力を待つ。しかし本発明はそのような実施の形態には限定されない。音声を入力し、キーフレームデータを作成した後は、そのキーフレームデータに基づいて、何回でもアニメーションの再生を行なうことができる。特に、この再生においては、使用される顔画像を変えたり、補間の関数を変えたり、音素を間引く際のしきい値を変えたりして、同じ音声から様々なアニメーションを生成できる。そのため、いわゆるプレスコ（プレレコ）方式によってアニメーションを作成するためのツールとして利用することが可能である。 In the above embodiment, when a voice is input, an animation created based on the voice is played only once and the next voice is input. However, the present invention is not limited to such an embodiment. After inputting the sound and creating the key frame data, the animation can be reproduced any number of times based on the key frame data. In particular, in this reproduction, various animations can be generated from the same sound by changing a face image to be used, changing an interpolation function, or changing a threshold value when thinning phonemes. Therefore, it can be used as a tool for creating an animation by a so-called Presco method.

さらに、上記した実施の形態は音声に基づいて顔画像のアニメーションを生成するものであった。しかし本発明はそのような実施の形態に限定されるわけではない。音声に伴って形状が変化するものであり、その形状とある音素とのマッピングが可能なものであれば、どのようなものにも適用可能である。例えば、音声にあわせて声道形状のアニメーションを作成したり、調音機構のアニメーションを作成したりすることも考えられる。 Furthermore, the above-described embodiment generates an animation of a face image based on voice. However, the present invention is not limited to such an embodiment. Any shape can be applied as long as the shape changes with sound and mapping between the shape and a phoneme is possible. For example, it is conceivable to create a vocal tract-shaped animation in accordance with the voice or to create an animation of the articulation mechanism.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係るアニメーション作成装置によるアニメーション作成過程３０の概略を示す図である。It is a figure which shows the outline of the animation preparation process 30 by the animation preparation apparatus concerning the 1st Embodiment of this invention. 第１の実施の形態に係るアニメーション生成システム８０の概略の機能的構成を示すブロック図である。It is a block diagram which shows the schematic functional structure of the animation production | generation system 80 which concerns on 1st Embodiment. 顔データファイル記憶部１３２に記憶される顔画像の例を示す図である。6 is a diagram illustrating an example of a face image stored in a face data file storage unit 132. FIG. マッピングテーブル記憶部１３０に記憶されたマッピングテーブルの例の一部を示す図である。It is a figure which shows a part of example of the mapping table memorize | stored in the mapping table memory | storage part 130. FIG. 音声認識装置１２０の出力する音素列ファイル１６０の構成例を示す図である。It is a figure which shows the structural example of the phoneme sequence file 160 which the speech recognition apparatus 120 outputs. 図２のキーフレームデータ作成部１３６の構成の詳細を示すブロック図である。FIG. 3 is a block diagram illustrating details of a configuration of a key frame data creation unit 136 in FIG. 2. アニメーション生成部１３８のより詳細なブロック図である。It is a more detailed block diagram of the animation generation unit 138. 音素列とブレンド率との関係を示す図である。It is a figure which shows the relationship between a phoneme string and a blend rate. 音素列とブレンド率との関係を示す図である。It is a figure which shows the relationship between a phoneme string and a blend rate. 音素列とブレンド率との関係を示す図である。It is a figure which shows the relationship between a phoneme string and a blend rate. 音素列とブレンド率との関係を示す図である。It is a figure which shows the relationship between a phoneme string and a blend rate. ブレンド率の補間の概略を示す図である。It is a figure which shows the outline of the interpolation of a blend rate. キーフレームにおける顔画像のベクトル加重和を説明するための図である。It is a figure for demonstrating the vector weighted sum of the face image in a key frame. アニメーション生成制御部２００によるアニメーションの生成制御処理を説明するための図である。It is a figure for demonstrating the production | generation control processing of the animation by the animation production | generation control part 200. FIG. 本発明の第２の実施の形態にかかるアニメーション生成システム２８０の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the animation production | generation system 280 concerning the 2nd Embodiment of this invention. アニメーション表示用コンピュータ２９６のハードウェア構成を示すブロック図である。FIG. 22 is a block diagram showing a hardware configuration of an animation display computer 296. 音声入力用のコンピュータ２９２を図２に示す入力指示ユニット９４として動作させるためのコンピュータプログラムの制御構造を示すフローチャートである。3 is a flowchart showing a control structure of a computer program for causing a voice input computer 292 to operate as the input instruction unit 94 shown in FIG. 2. 音声認識サーバ２９４が実行するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which the speech recognition server 294 performs. キーフレームデータの作成処理を実現するコンピュータプログラムの制御構造を示すフローチャートであるIt is a flowchart which shows the control structure of the computer program which implement | achieves the preparation process of key frame data. アニメーション生成制御処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves animation production | generation control processing.

Explanation of symbols

４０話者、４２音声信号、４４台本、６０〜６８顔画像、８０，２８０アニメーション生成システム、９０テキスト選択インターフェイス、９２マイクロフォン、９４入力指示ユニット、９６キーフレームデータ作成ユニット、９８アニメーション再生ユニット、１００スピーカ、１０２,３１０モニタ、１１０テキスト記憶部、１１２テキスト選択部、１１４音声収録部、１２０音声認識装置、１３０マッピングテーブル記憶部、１３２顔データファイル記憶部、１３４補間関数記憶部、１３６キーフレームデータ作成部、１３８アニメーション生成部、１４０音声ファイル記憶部、１４２出力部、１６０音素列ファイル、１８０マッピング処理部、１８２継続時間長によるブレンド率調整部、１８４パワーによるブレンド率調整部、２００アニメーション生成制御部、２０２タイマ、２０４補間処理部、２９０ネットワーク、２９２音声入力用のコンピュータ、２９４音声認識サーバ、２９６アニメーション表示用コンピュータ、３００タッチパネル、３０２マイクロフォン 40 speakers, 42 audio signals, 44 scripts, 60-68 face images, 80, 280 animation generation system, 90 text selection interface, 92 microphone, 94 input instruction unit, 96 key frame data creation unit, 98 animation playback unit, 100 Speaker, 102, 310 Monitor, 110 Text storage unit, 112 Text selection unit, 114 Voice recording unit, 120 Speech recognition device, 130 Mapping table storage unit, 132 Face data file storage unit, 134 Interpolation function storage unit, 136 Key frame data Creation unit, 138 Animation generation unit, 140 Audio file storage unit, 142 Output unit, 160 Phoneme sequence file, 180 Mapping processing unit, 182 Blend rate adjustment unit by duration length, 184 Blend ratio adjustment unit by power, 200 Animation generation control unit, 202 Timer, 204 Interpolation processing unit, 290 Network, 292 Audio input computer, 294 Speech recognition server, 296 Animation display computer, 300 Touch panel, 302 Microphone

Claims

Means for receiving a voice signal and creating key frame data representing a key frame image constituted by an image at a predetermined key frame time during the duration of each phoneme in the phoneme string represented by the voice signal;
An animation generating device, including: animation generating means for generating an animation of an image composed of a series of images that change in synchronization with the audio signal based on the key frame data generated by the key frame data generating means.

The animation creating apparatus according to claim 1, wherein the predetermined key frame time is a start time of a duration time of each phoneme in the phoneme string.

Text selection means for allowing the user to select a plurality of predetermined types of text;
Means for recording a voice of a user based on the selection of the text by the text selection means, converting the voice into a voice signal, and supplying the voice signal to the key frame data creation means together with the selected text;
The means for creating the key frame data is:
Mapping data storage means for storing mapping data for mapping phonemes to any of a plurality of predetermined images including a predetermined reference image;
Receiving the speech signal and the text, performing phoneme segmentation on the speech signal based on the text, and outputting phoneme sequence data including the obtained phoneme sequence and time information indicating the duration of each phoneme Phoneme segmentation means of
For each phoneme included in the phoneme string data output from the phoneme segmentation means, an identifier for identifying an image to which the phoneme is mapped by referring to the time information of the phoneme and the mapping data And a key frame data creating means for creating and outputting key frame data by attaching a blend ratio determined in correspondence with a predetermined feature amount for the phoneme. Animation creation device.

The key frame data creation means includes
For each phoneme included in the phoneme string data output from the phoneme segmentation means, an image identifier obtained by referring to the mapping data and a blending rate consisting of a predetermined constant are attached, and the image mapped Mapping processing means for outputting phoneme sequence data;
First blend rate adjusting means for adjusting the blend rate as a monotonically increasing function of the phoneme duration for each phoneme of the image mapped phoneme string data output by the mapping processing means The animation creating apparatus according to claim 3.

The key frame data creation means further monotonously increases the power level within the duration of the phoneme for each phoneme of the phoneme string data adjusted by the blend ratio output from the first blend ratio adjustment means. The animation creating apparatus according to claim 4, further comprising: a second blend ratio adjusting unit that adjusts the blend ratio as a function.

The animation generating means includes
Time determination means for determining a generation time for generating an animation image in relation to the recording time of the sound;
Interpolation means for calculating an image of a frame at the generation time determined by the time determination means by interpolation between a plurality of key frame images sandwiching the generation time. An animation creating device according to any one of the above.

The interpolation means includes means for calculating an image of a frame at the generation time determined by the time determination means by interpolation between two key frame images adjacent to each other across the generation time. The animation creating apparatus according to claim 6.

The means for calculating is
Of the first and second key frames adjacent to each other across the generation time, the generation time is determined by a first interpolation function that is 100% in the first key frame and 0% in the second key frame. First interpolating means for interpolating the first blend rate in the above from the blend rate in the first key frame;
The second blend rate at the generation time is interpolated from the blend rate in the second key frame by a second interpolation function that is 0% in the first key frame and 100% in the second key frame. A second interpolation means for
A weighted sum between the image data mapped to the first key frame and the image data mapped to the second key frame using the first blend rate and the second blend rate. The animation creating apparatus according to claim 7, further comprising: means for calculating image data at the generation time.

A computer program that, when executed by a computer, causes the computer to function as each means constituting the animation creation device according to any one of claims 1 to 8.