JP2007034788A

JP2007034788A - Head motion learning device and head motion composition device for head motion automatic generation, and computer program

Info

Publication number: JP2007034788A
Application number: JP2005218476A
Authority: JP
Inventors: Shinichi Kawamoto; 真一川本; Tatsuo Shikura; 達夫四倉; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-07-28
Filing date: 2005-07-28
Publication date: 2007-02-08
Anticipated expiration: 2025-07-28
Also published as: JP4599606B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a head motion learning device and a head motion composition device for head motion automatic generation capable of reflecting sensitivity of a user. <P>SOLUTION: This head motion learning device includes: a feeling intensity input part 64 for receiving input of a feeling intensity parameter showing intensity related to prescribed feeling of an utterance subject during utterance from the user; and a learning part 78 for learning relation between motion of a head of the utterance subject, and a prescribed acoustic characteristic amount and the feeling intensity parameter from information showing the motion of the head of the utterance subject during the utterance and the feeling intensity parameter input through the feeling intensity input part 64, related to the prescribed acoustic characteristic amount extracted as a time series from voice of the utterance subject during each utterance, and the utterance. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、合成された顔画像の頭部動作を音声に合わせて自動的に生成する装置に関し、特に生成される頭部動作によって表現したい感情の強度をユーザが自由にカスタマイズする事ができる頭部動作学習装置及び頭部動作合成装置並びにそれらのためのコンピュータプログラムに関する。 The present invention relates to an apparatus that automatically generates a head motion of a synthesized face image in accordance with a voice, and in particular, a head that allows a user to freely customize the intensity of an emotion desired to be expressed by the generated head motion. The present invention relates to a head motion learning device, a head motion synthesis device, and a computer program therefor.

近年、ＣＧによって作られたキャラクタを使用したアニメーションの製作が盛んになっている。そこで、そのようなアニメーション製作に関する技術の需要は年々高まり、その技術の進歩も大いに期待されている。 In recent years, production of animations using characters created by CG has become popular. Therefore, the demand for such animation production technology is increasing year by year, and the advancement of the technology is also highly expected.

アニメーションの製作においては、音声に同期した画像を作成する事が必要である。通常は、先にアニメーション画像を作成し、後に画像に合わせて音声の吹込みを行なう。しかし、アニメーション画像の作成を先に行なうためには、物語の進行に合わせた想像力が必要とされる。また、画像に合わせて音声を吹込むためには、音声を担当するものの技術が必要である。一方で、先にキャラクタの音声の吹込みを台本に従って行ない、その音声に基づいてキャラクタのアニメーション画像を自動的に生成する手法も存在する。この場合、音声から画像が自動的に生成されるので、音声と画像が自然に同期する事になり、質の高いアニメーション製作を行なう事ができる。また、物語の進行に合わせ、キャラクタの音声を予測しながらそれに合わせて画像を作成する必要がないので、アニメーション製作を効率的に行なう事もできる。 In the production of animation, it is necessary to create an image synchronized with the sound. Usually, an animation image is created first, and then sound is blown in accordance with the image. However, in order to create an animation image first, imagination that matches the progress of the story is required. In addition, in order to inject sound in accordance with an image, a technique for in charge of sound is required. On the other hand, there is a method in which a character's voice is first blown in accordance with a script and an animation image of the character is automatically generated based on the voice. In this case, since the image is automatically generated from the sound, the sound and the image are naturally synchronized, and a high-quality animation can be produced. Also, since it is not necessary to create an image in accordance with the progress of the story while predicting the voice of the character, it is possible to efficiently produce the animation.

アニメーションで問題となる画像の動きの重要なものにキャラクタの動作がある。このキャラクタの動作としては、表情、ジェスチャ、頭部動作等がある。これらは、画像を見た者に、そのキャラクタによって表現しようとする感情の理解を容易にする事ができる重要な要素である。この様に感情の理解を容易にするために画像の動きを使用するにあたっては、その動きが感情を理解するために十分な助けとなるものである必要がある。だとすれば、画像によって表現しようとしている感情がキャラクタの動作を見た者に自然に理解できるものである事が望ましい。 Character movement is one of the important image movements that are problematic in animation. This character's movement includes facial expression, gesture, head movement, and the like. These are important elements that can make it easier for those who see the image to understand the emotions they want to express with the character. As described above, when using the motion of an image to facilitate understanding of emotion, the motion needs to be sufficient to help understand the emotion. If so, it is desirable that the emotions to be expressed by the images can be naturally understood by those who have seen the movement of the character.

キャラクタの動作には、前述した様に、表情、ジェスチャ、頭部動作等がある。このうち、表情、ジェスチャ等はある感情に対応する動作に見られる個人差が比較的大きい上に発話意図にも大きく依存する。それゆえ、ある音声からその音声に対応する自然でかつ、誰が見ても個々のキャラクタの感情を容易に推測する事のできる表情、ジェスチャ等を自動生成する事は難しい。つまり、ある表情やジェスチャ等が必ずしもすべてのキャラクタに適用できるとは限らない。一方で、頭部動作については、他の動作に比べ、同じ動作が他のキャラクタにもある程度、違和感なく適用できる。そこで、音声からキャラクタの動作を自動生成するにあたっては、頭部動作を生成する方法を採用して、画像を見た者がキャラクタの感情を適切に理解する事ができる様にする事が望ましい。 As described above, the character motion includes facial expressions, gestures, head motions, and the like. Of these, facial expressions, gestures, etc. have a relatively large individual difference in actions corresponding to certain emotions, and also greatly depend on utterance intentions. Therefore, it is difficult to automatically generate facial expressions, gestures, etc. that can naturally guess the emotions of individual characters from a certain voice, even if anyone sees it. That is, a certain facial expression or gesture is not necessarily applicable to all characters. On the other hand, as for the head movement, the same movement can be applied to other characters to some extent without a sense of incompatibility compared to other movements. Therefore, when automatically generating a character motion from speech, it is desirable to adopt a method of generating a head motion so that a person who has seen the image can properly understand the emotion of the character.

この様に、音声から、自然な頭部動作を自動的に生成する際には、予め学習されたパラメータを使用するという方法が考えられる。この学習にあたっては、音声と動作とを同時に収集してそのデータを集積し、いかなる音声からいかなる動作が生じるのが妥当であるかという関係を学習する方法が採用される。このような学習から得られた関係を使用して音声から頭部動作を生成する方法には、非特許文献１に示される様にニューラルネットワークを用いたものがある。ニューラルネットワークは非線型的な手法であり、実際の人間の頭部動作に良く似た動きを生成する事ができる。 Thus, when automatically generating a natural head movement from speech, a method of using parameters learned in advance can be considered. In this learning, a method is adopted in which voices and actions are collected at the same time and the data are accumulated to learn the relationship between what kind of actions are appropriate from what kind of voices. As a method for generating a head movement from speech using a relationship obtained from such learning, there is a method using a neural network as shown in Non-Patent Document 1. A neural network is a non-linear method and can generate movements that closely resemble actual human head movements.

図１に、非特許文献１に開示の従来技術による頭部動作の自動生成システムについて示す。図１を参照して、この頭部動作自動生成システムは、学習のための発話者の頭部動作３０を撮影して、学習のための頭部動作データを収集するためのカメラ３６と、発話者の発話音声３２を録音するためのマイクロフォン３４と、マイクロフォン３４によって録音された学習のための音声を格納するための音声格納部３８と、カメラ３６によって収集された頭部動作に関するデータを格納するための頭部動作格納部４０とを含む。録音音声と頭部動作に関するデータとの双方には、時刻情報が記録されている。この時刻情報は、録音機器と録画機器とで共有されており、従って音声と頭部画像との対応関係をとる事が可能である。 FIG. 1 shows an automatic head movement generation system according to the prior art disclosed in Non-Patent Document 1. Referring to FIG. 1, the head movement automatic generation system captures a head movement 30 of a speaker for learning and collects head movement data for learning, and an utterance. A microphone 34 for recording a person's utterance voice 32, a voice storage unit 38 for storing voice for learning recorded by the microphone 34, and data relating to head movements collected by the camera 36 are stored. A head motion storage unit 40. Time information is recorded in both the recorded voice and the data related to the head movement. This time information is shared between the recording device and the recording device, and therefore it is possible to take a correspondence relationship between the sound and the head image.

頭部動作自動生成システムはさらに、時刻を共有する音声から音響特徴量を算出し、この特徴量データと頭部動作に関するデータとから、ニューラルネットワーク学習のための学習データ（これを「音声−頭部動作データ」と呼ぶ。）を作成するための音声−頭部動作同期部４２と、音声−頭部動作同期部４２の作成した音声−頭部動作データを格納するための音声−頭部動作データ格納部４４と、格納された音声−頭部動作データを使用して、所定のニューラルネットワークに音声とそれに同期する頭部動作との関係について学習を行なわせるための学習部４６と、その学習によって得られたニューラルネットワークのパラメータを格納するためのニューラルネットワークパラメータ格納部４８とを含む。 The head movement automatic generation system further calculates an acoustic feature quantity from the voice sharing the time, and from this feature quantity data and the data relating to the head movement, learning data for neural network learning (this is referred to as “voice-head. Voice-head motion synchronization unit 42 for creating the voice-head motion data for storing the voice-head motion data created by the voice-head motion synchronization unit 42. A data storage unit 44, a learning unit 46 for causing a predetermined neural network to learn about a relationship between a voice and a head motion synchronized with the data, using the stored voice-head motion data, and the learning And a neural network parameter storage unit 48 for storing the parameters of the neural network obtained by the above.

頭部動作自動生成システムはさらに、予め台本に従い録音される、キャラクタの音声５４に使用される音声を格納するための音声格納部５２と、ニューラルネットワークパラメータ格納部４８に格納されたニューラルネットワークパラメータを使用して、音声格納部５２に格納された音声に基づいて、キャラクタの頭部動作５６を生成するための頭部動作生成部５０とを含む。 The head movement automatic generation system further includes a voice storage unit 52 for storing voice used for the character voice 54, which is recorded in advance according to the script, and a neural network parameter stored in the neural network parameter storage unit 48. And a head movement generation unit 50 for generating a head movement 56 of the character based on the voice stored in the voice storage unit 52.

この頭部動作自動生成システムにおいては、発話者の発話音声３２がマイクロフォン３４によって録音され、音声格納部３８に格納される。一方、カメラ３６で記録された発話者の頭部動作データは、頭部動作格納部４０に格納される。両者には共通した時刻情報が含まれている。 In this head movement automatic generation system, a speech voice 32 of a speaker is recorded by a microphone 34 and stored in a voice storage unit 38. On the other hand, the head motion data of the speaker recorded by the camera 36 is stored in the head motion storage unit 40. Both include common time information.

音声−頭部動作同期部４２が、音声格納部３８に格納された音声と頭部動作格納部４０に格納された頭部動作に関するデータとのうち、時刻情報を共有するデータを用いて、所定のニューラルネットワークの学習のための音声−頭部動作データを生成する。その生成された音声−頭部動作データは音声−頭部動作データ格納部４４に格納される。この音声−頭部動作データを用いて、学習部４６で、音声とそれに対応する頭部動作との間の関係をニューラルネットワークに学習させる。その学習によって得られた、ニューラルネットワークの動作を規定するパラメータ（ニューラルネットワークパラメータ）がニューラルネットワークパラメータ格納部４８に格納される。 The voice-head movement synchronization unit 42 uses the data that shares time information among the voice stored in the voice storage unit 38 and the data related to the head movement stored in the head movement storage unit 40, to determine a predetermined time. Speech-head motion data for learning the neural network is generated. The generated speech-head motion data is stored in the speech-head motion data storage unit 44. Using this speech-head motion data, the learning unit 46 causes the neural network to learn the relationship between the speech and the corresponding head motion. Parameters (neural network parameters) that define the operation of the neural network obtained by the learning are stored in the neural network parameter storage unit 48.

このニューラルネットワークパラメータを用いて、頭部動作生成部５０で音声から頭部動作の自動生成が行なわれる。具体的には、まず、台本に基づいてキャラクタの音声が録音され、音声格納部５２に格納される。一方、ニューラルネットワークパラメータ格納部４８に格納されたパラメータにより、ニューラルネットワークを予め設定しておく。音声格納部５２に格納された音声から、学習時に音声−頭部動作同期部４２が算出したものと同じ種類の音響特徴量を算出する。この音響特徴量を入力としてニューラルネットワークに与える事により、その出力として入力音声に対応する頭部の動き（頭部特徴点の座標）がニューラルネットワークから出力される。この値を各フレームで算出する事により、キャラクタの頭部動作５６が自動生成される。 Using this neural network parameter, the head motion generation unit 50 automatically generates head motion from speech. Specifically, first, the voice of the character is recorded based on the script and stored in the voice storage unit 52. On the other hand, the neural network is set in advance by the parameters stored in the neural network parameter storage unit 48. From the voice stored in the voice storage unit 52, the same kind of acoustic feature quantity as that calculated by the voice-head movement synchronization unit 42 at the time of learning is calculated. By giving this acoustic feature quantity to the neural network as an input, the movement of the head (coordinates of the head feature points) corresponding to the input speech is output from the neural network as the output. By calculating this value for each frame, the head action 56 of the character is automatically generated.

この様に、最初に台本に基づいてキャラクタの音声を録音し、その後に音声に同期した頭部動作が生成される。この頭部動作は、学習に基づいてニューラルネットワークにより生成されるので、自然なものとなる。それゆえ、画像を見た者にとっては頭部動作で表わされたキャラクタの感情が理解しやすくなるし、ユーザにとってはアニメーション製作を効率的に行なう事ができるという利点がある。
川本真一、松下義則、中井満、下平博、嵯峨山茂樹、「擬人化音声対話エージェントのための発話時の頭部挙動の自動生成」、日本音響学会誌、２００２年秋。 In this way, the voice of the character is first recorded based on the script, and then the head movement synchronized with the voice is generated. Since this head movement is generated by a neural network based on learning, it is natural. Therefore, it is easy for the person who sees the image to understand the emotion of the character represented by the head movement, and for the user, there is an advantage that the animation can be efficiently produced.
Shinichi Kawamoto, Yoshinori Matsushita, Mitsuru Nakai, Hiroshi Shimohira, Shigeki Hiyama, “Automatic Generation of Head Behavior during Speech for anthropomorphic Spoken Dialogue Agent”, Journal of the Acoustical Society of Japan, Autumn 2002.

非特許文献１に開示の技術における様に、音声から算出される音響特徴量から全自動で頭部動作を生成すると、画像生成の効率性の点及び音声に同期した自然な頭部動作を生成できるという点についての問題はない。しかし、音声から算出される音響特徴量から頭部動作の全自動生成を行なうと、ユーザであるアニメーションのクリエータの感性を反映する余地のない頭部動作が生成される。すなわち、キャラクタの頭部動作は学習時に採取された音声と台本に従って録音された音声との音響特徴量によって制限されてしまう事になり、多種多様なキャラクタの個性を頭部動作によって表現する事が難しくなる。 As in the technique disclosed in Non-Patent Document 1, when the head movement is generated automatically from the acoustic feature amount calculated from the voice, a natural head movement synchronized with the voice is generated in terms of the efficiency of image generation. There is no problem with that it can. However, if the head movement is fully automatically generated from the acoustic feature amount calculated from the voice, a head movement that has no room for reflecting the sensitivity of the creator of the animation that is the user is generated. In other words, the head movement of the character is limited by the acoustic features of the voice collected at the time of learning and the voice recorded according to the script, and various character personalities can be expressed by the head movement. It becomes difficult.

そこで、本発明の目的は、ユーザの感性を反映する事のできる頭部動作自動生成のための頭部動作学習装置及び頭部動作合成装置を提供する事である。 Accordingly, an object of the present invention is to provide a head motion learning device and a head motion synthesis device for automatically generating head motion that can reflect the user's sensibility.

本発明の第１の局面に係る頭部動作学習装置は、発話時の発話主体の所定の感情に関する強度を示す感情強度パラメータの入力をユーザより受けるための感情強度入力手段と、各発話の発話時の発話主体の音声から時系列として抽出される所定の音響特徴量と、当該発話に関して、感情強度入力手段を介して入力された感情強度パラメータと、当該発話時の発話主体の頭部の動きを示す情報とから、所定の音響特徴量及び感情強度パラメータと、発話主体の頭部の動きとの間の関係を学習するための学習手段とを含む。 The head movement learning device according to the first aspect of the present invention includes an emotion strength input means for receiving an input of an emotion strength parameter indicating the strength related to a predetermined emotion of the utterance subject at the time of utterance, and the utterance of each utterance. A predetermined acoustic feature extracted as a time series from the speech of the utterance subject at the time, the emotion strength parameter input via the emotion strength input means regarding the utterance, and the movement of the head of the utterance subject at the time of the utterance And learning means for learning a relationship between a predetermined acoustic feature amount and emotion intensity parameter and head movement of the utterance subject.

この頭部動作学習装置によると、上記した関係の学習にあたって、ユーザが任意の感情強度パラメータを入力する事ができる。ユーザが必要だと考える感情強度パラメータと音響特徴量と、頭部動作との間の関係を学習させる事ができる。学習後は、任意の値の感情強度パラメータに対し、上記した関係にもとづいて、妥当な頭部の動作を生成できる。従って、ユーザの感性を頭部動作学習結果に反映するための適切な頭部動作学習装置を提供する事ができる。 According to this head movement learning device, the user can input an arbitrary emotion intensity parameter in learning the above-described relationship. It is possible to learn the relationship between emotion intensity parameters, acoustic features, and head movements that the user thinks necessary. After learning, an appropriate head motion can be generated based on the relationship described above for an emotion intensity parameter having an arbitrary value. Therefore, it is possible to provide an appropriate head movement learning device for reflecting the user's sensitivity to the head movement learning result.

好ましくは、学習手段は、発話主体の発話時の音声を受けて、発話開始時から所定時間ごとに当該音声の音響特徴量を抽出するための音響特徴量抽出手段と、音響特徴量抽出手段により抽出される音響特徴量に、当該発話の発話開始からの時間を示す情報を付すための時間情報付与手段と、発話主体の、発話時の頭部の位置又は向きを時刻と対応させて示す情報を取得するための頭部位置情報取得手段と、ある発話について、音響特徴量抽出手段により抽出され、時間情報付与手段により発話からの時間情報が付与された音響特徴量と、当該発話に関して感情強度入力手段により入力された感情強度パラメータと、頭部位置情報取得手段により取得された発話主体の頭部の位置又は向きとを同期させて学習用データを生成するための同期手段と、同期手段によって生成された学習用データを用いて、音響特徴量の時系列及び感情強度パラメータと、発話主体の頭部の位置又は向きの変化との間の関係を学習するための手段とを含む。 Preferably, the learning means receives the voice at the time of the utterance of the utterance subject, and extracts the acoustic feature quantity of the voice every predetermined time from the start of the utterance, and the acoustic feature quantity extraction means Time information providing means for attaching information indicating the time from the start of the utterance to the extracted acoustic feature amount, and information indicating the position or direction of the head of the utterance in association with the time The head position information acquisition means for acquiring the utterance, the acoustic feature quantity extracted by the acoustic feature quantity extraction means for a certain utterance, and the time information from the utterance given by the time information giving means, and the emotion intensity for the utterance Synchronizing means for generating learning data by synchronizing the emotion intensity parameter input by the input means and the position or orientation of the head of the utterance subject acquired by the head position information acquiring means Using the learning data generated by the synchronization means, means for learning the relationship between the time series of the acoustic feature amount and the emotion intensity parameter and the change in the position or orientation of the head of the utterance subject. Including.

この頭部動作学習装置によると、音声から抽出された音響特徴量に時間情報を付与し、それと発話開始からの時刻に対応させた頭部の位置又は向きと感情強度入力部によって入力された感情強度とを同期させて学習用データを生成し、そのデータを元に音響特徴量の時系列及び感情強度パラメータと発話主体の頭部の位置又は向きの変化との間の関係を学習する。その結果、音声と同期した頭部動作の関係を学習するための適切な頭部動作学習装置を提供する事ができる。 According to this head movement learning device, time information is added to the acoustic feature amount extracted from the voice, and the head position or orientation corresponding to the time from the start of the utterance and the emotion input by the emotion strength input unit The learning data is generated in synchronization with the intensity, and the relationship between the time series of the acoustic feature amount and the emotion intensity parameter and the change in the position or orientation of the head of the utterance subject is learned based on the data. As a result, it is possible to provide an appropriate head motion learning device for learning the relationship between head motions synchronized with speech.

さらに好ましくは、学習するための手段は、同期手段によって生成された学習用データを用い、ある発話の発話開始からの時刻が付与された所定の音響特徴量の時系列及び感情強度パラメータと発話主体の頭部の位置又は向きとの間の関係を、所定の非線型関数近似により学習するための関数近似学習手段を含む。 More preferably, the means for learning uses the learning data generated by the synchronization means, and the time series of the predetermined acoustic feature amount given the time from the start of the utterance of a certain utterance, the emotion intensity parameter, and the utterance subject A function approximation learning means for learning the relationship between the position or orientation of the head of the head by a predetermined nonlinear function approximation.

この頭部動作学習装置によると、非線型関数近似により音響特徴量の時系列及び感情強度パラメータと発話主体の頭部の位置又は向きとの間の関係を学習する。従って、線型関数近似では表わせない頭部の位置又は向きの非線型な変化を学習する事ができる。その結果、より自然な頭部動作に近い頭部動作の学習をするための適切な頭部動作学習装置を提供する事ができる。 According to the head movement learning device, the relationship between the time series of the acoustic feature amount and the emotion intensity parameter and the position or orientation of the head of the utterance subject is learned by nonlinear function approximation. Therefore, it is possible to learn a non-linear change in the position or orientation of the head that cannot be expressed by linear function approximation. As a result, it is possible to provide an appropriate head movement learning device for learning a head movement close to a more natural head movement.

さらに好ましくは、関数近似学習手段は、同期手段によって生成された学習用データを学習用データとして、音響特徴量及び感情強度パラメータと発話主体の頭部の位置又は向きとの間の関係を学習するためのニューラルネットワークを含む。 More preferably, the function approximation learning unit learns the relationship between the acoustic feature amount and the emotion intensity parameter and the position or orientation of the head of the utterance subject using the learning data generated by the synchronization unit as learning data. Including a neural network.

この頭部動作学習装置によると、ニューラルネットワークを用いた非線型関数近似により音響特徴量の時系列及び感情強度パラメータと発話主体の頭部の位置又は向きとの間の関係を学習する。従って、ニューラルネットワークを用いない線型関数近似では学習する事のできない頭部の位置又は向きの非線型な変化を学習するための適切な頭部動作を学習する事ができる。その結果、より自然な頭部動作に近い頭部動作の学習をするための適切な頭部動作学習装置を提供する事ができる。 According to this head movement learning device, the relationship between the time series and emotion intensity parameters of acoustic features and the position or orientation of the head of the utterance subject is learned by nonlinear function approximation using a neural network. Accordingly, it is possible to learn an appropriate head motion for learning a non-linear change in head position or orientation that cannot be learned by linear function approximation without using a neural network. As a result, it is possible to provide an appropriate head movement learning device for learning a head movement close to a more natural head movement.

本発明の第２の局面に係る頭部動作合成装置は、発話時の発話主体の音声から抽出された所定の音響特徴量の時系列と、発話に関して指定された感情強度パラメータとが与えられると、当該音声の発話時の発話主体の頭部の動きを推定するための頭部位置推定手段と、音声から所定の音響特徴量の時系列を抽出するための音響特徴量抽出手段と、音響特徴量抽出手段により抽出された所定の音響特徴量の時系列と、指定された感情強度パラメータとを頭部位置推定手段に与える事により、音声に同期した、発話主体の頭部の動作に関する情報を頭部位置推定手段からの一連の出力として得るための頭部動作生成手段とを含む。 When the head motion synthesizer according to the second aspect of the present invention is given a time series of predetermined acoustic features extracted from the speech of the utterance subject at the time of utterance and an emotion intensity parameter specified for the utterance , Head position estimating means for estimating the movement of the head of the utterance when the voice is uttered, acoustic feature quantity extracting means for extracting a time series of predetermined acoustic feature quantities from the voice, and acoustic features By providing the head position estimation means with a time series of predetermined acoustic features extracted by the quantity extraction means and the specified emotion strength parameter, information on the movement of the head of the utterance subject synchronized with the speech is obtained. Head movement generation means for obtaining a series of outputs from the head position estimation means.

この頭部動作合成装置によると、音声を入力すると、その音声に含まれた情報から自動的に画像の頭部動作が合成される。従って、音声に同期した自然な頭部動作を効率的に合成するための適切な頭部動作合成装置を提供する事ができる。 According to this head movement synthesizer, when a voice is input, the head movement of the image is automatically synthesized from the information included in the voice. Therefore, it is possible to provide an appropriate head movement synthesis device for efficiently synthesizing a natural head movement synchronized with the voice.

好ましくは、頭部動作合成装置は、音声を予め格納し、音響特徴量抽出手段に与えるための手段をさらに含む。 Preferably, the head movement synthesizer further includes means for storing the sound in advance and supplying the sound to the acoustic feature amount extraction means.

この頭部動作合成装置によると、音声を予め録音し、それを格納する事ができる。従って、以前に録音された音声を元に画像の頭部動作を合成するための適切な頭部動作合成装置を提供する事ができる。 According to this head movement synthesizer, a voice can be recorded in advance and stored. Therefore, it is possible to provide an appropriate head motion synthesizer for synthesizing the head motion of an image based on previously recorded sound.

さらに好ましくは、この頭部動作合成装置は、ユーザにより入力された感情強度に対応する感情強度パラメータを頭部動作生成手段に与えるための手段をさらに含む。 More preferably, the head motion synthesizer further includes means for providing the head motion generation means with an emotion strength parameter corresponding to the emotion strength input by the user.

この頭部動作合成装置によると、ユーザが任意に感情強度を入力する事ができる。従って、ユーザの感性を、合成される頭部動作に適切に反映できる頭部動作合成装置を提供する事ができる。 According to this head movement synthesizer, the user can arbitrarily input emotion intensity. Therefore, it is possible to provide a head movement synthesis device that can appropriately reflect the user's sensitivity to the synthesized head movement.

さらに好ましくは、頭部位置推定手段は、予め、発話主体の発話時の音声から抽出された所定の音響特徴量の時系列と、当該発話時の発話主体の所定の感情に関して指定された感情強度パラメータと、当該発話時の発話主体の頭部の動きを示す情報とから、所定の音響特徴量の時系列及び感情強度パラメータと、発話主体の頭部の位置又は向きとの間の関係を予め学習済の機械学習手段を含む。 More preferably, the head position estimating means preliminarily specifies a time series of predetermined acoustic features extracted from the speech at the time of the utterance of the utterance subject, and an emotion intensity designated with respect to the predetermined emotion of the utterance subject at the time of the utterance Based on the parameters and the information indicating the movement of the head of the utterance subject at the time of the utterance, the relationship between the time series of the predetermined acoustic feature amount and the emotion intensity parameter and the position or orientation of the head of the utterance subject is determined in advance. Includes learned machine learning means.

この頭部動作合成装置によると、機械学習手段が、音響特徴量の時系列と、感情強度パラメータと、発話主体の頭部の動きを示す情報とから、これらの間の関係を予め学習する。従って、ユーザが音響特徴量と感情強度パラメータとを機械学習手段に入力して、入力に対応する頭部動作を得る事ができる。 According to this head motion synthesizer, the machine learning means learns in advance the relationship between the time series of acoustic feature quantities, the emotion intensity parameter, and the information indicating the movement of the head of the utterance subject. Therefore, the user can input the acoustic feature amount and the emotion intensity parameter to the machine learning means, and obtain a head movement corresponding to the input.

さらに好ましくは、機械学習手段は、発話主体の発話時の音声から抽出された所定の音響特徴量の時系列と、当該発話時の発話主体の所定の感情に関して指定された感情強度パラメータと、当該発話時の発話主体の頭部の動きを示す情報とから、所定の音響特徴量の時系列及び感情強度パラメータと、発話主体の頭部の位置又は向きとの間の関係を予め非線型関数近似により学習済の関数近似学習手段を含む。 More preferably, the machine learning means includes a time series of predetermined acoustic features extracted from speech at the time of utterance of the utterance subject, an emotion intensity parameter designated with respect to the predetermined emotion of the utterance subject at the time of utterance, From the information indicating the movement of the head of the utterance at the time of utterance, the relationship between the time series and emotion intensity parameters of the predetermined acoustic features and the position or orientation of the head of the utterance in advance is a nonlinear function approximation The function approximation learning means that has already been learned is included.

この頭部動作合成装置によると、頭部動作合成の際に非線型関数近似による頭部動作の合成が可能になる。従って、線型関数近似によると表現できない様な非線型な頭部動作の合成をするための適切な頭部動作合成装置を提供する事ができる。 According to this head motion synthesis device, head motion synthesis can be performed by nonlinear function approximation when head motion synthesis is performed. Accordingly, it is possible to provide an appropriate head motion synthesis device for synthesizing a nonlinear head motion that cannot be expressed by linear function approximation.

さらに好ましくは、関数近似学習手段は、発話主体の発話時の音声から抽出された所定の音響特徴量の時系列と、当該発話時の発話主体の所定の感情に関して指定された感情強度パラメータと、当該発話時の発話主体の頭部の動きを示す情報とから、所定の音響特徴量の時系列及び感情強度パラメータと、発話主体の頭部の位置又は向きとの間の関係を予め学習済のニューラルネットワークを含む。 More preferably, the function approximating learning means includes a time series of predetermined acoustic features extracted from the speech at the time of the utterance of the utterance subject, an emotion intensity parameter designated with respect to the predetermined emotion of the utterance subject at the time of the utterance, From the information indicating the movement of the head of the utterance at the time of the utterance, the relationship between the time series and emotion intensity parameters of the predetermined acoustic features and the position or orientation of the head of the utterance has been learned in advance. Includes neural networks.

この頭部動作合成装置によると、頭部動作合成の際にニューラルネットワークによる非線型関数近似による頭部動作の合成が可能になる。従って、ニューラルネットワークを用いない線型関数近似によると表現できない様な非線型な頭部動作の合成をするための適切な頭部動作合成装置を提供する事ができる。 According to this head movement synthesizer, the head movement can be synthesized by nonlinear function approximation using a neural network when the head movement is synthesized. Accordingly, it is possible to provide an appropriate head motion synthesis device for synthesizing a nonlinear head motion that cannot be expressed by linear function approximation without using a neural network.

本発明の第３の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記のいずれかに記載の装置として動作させるものである。従って上述した頭部動作学習装置及び頭部動作合成装置のいずれかと同様の効果を得る事ができる。 The computer program according to the third aspect of the present invention, when executed by a computer, causes the computer to operate as any of the above-described devices. Therefore, it is possible to obtain the same effect as any of the head motion learning device and the head motion synthesis device described above.

以下、図面を参照し発明の実施の一形態を説明する。本実施の形態は、音声の音響特徴量だけでなく、ユーザによる指示を反映した形で頭部動作を生成する装置に関するものである。 Hereinafter, an embodiment of the invention will be described with reference to the drawings. The present embodiment relates to an apparatus that generates a head movement in a form that reflects not only the acoustic feature quantity of a voice but also an instruction from a user.

＜構成＞
図２に、本発明の実施の一形態に係る頭部動作の自動生成システムのブロック図を示す。図２を参照して、この頭部動作自動生成システムは、学習用データの発話時の発話者の頭部動作６０を撮影して、頭部動作データを収集するためのカメラ６８と、学習用データの発話者の発話音声６２を録音するためのマイクロフォン６６と、感情の強さを明示するためにユーザにより指定される感情強度の入力を受けるための感情強度入力部６４と、マイクロフォン６６によって録音された音声と感情強度入力部６４を介して入力された感情強度のパラメータとを格納するための音声格納部７０と、カメラ６８によって収集された頭部動作に関するデータを格納するための頭部動作格納部７２とを含む。音声格納部７０の格納する音声データと、頭部動作格納部７２の格納する頭部動作に関するデータとはいずれもフレーム化されており、共通した時刻情報を含んでいる。従って、両者を同期させる事が可能である。 <Configuration>
FIG. 2 is a block diagram of an automatic head movement generation system according to an embodiment of the present invention. Referring to FIG. 2, this head movement automatic generation system captures a head movement 60 of a speaker at the time of utterance of learning data, collects head movement data, and a learning 68 A microphone 66 for recording the speech voice 62 of the data speaker, an emotion strength input unit 64 for receiving input of emotion strength specified by the user to clearly indicate the emotion strength, and recording by the microphone 66 Voice storage unit 70 for storing the received voice and emotion intensity parameters input via emotion intensity input unit 64, and head motion for storing data relating to head motion collected by camera 68 A storage unit 72. The voice data stored in the voice storage unit 70 and the data related to the head movement stored in the head movement storage unit 72 are both framed and include common time information. Therefore, it is possible to synchronize both.

なお、発話者は、学習用の発話を行なう際には、感情強度入力部６４で入力された感情強度に対応した形で、感情を込めて行なう。感情強度入力部６４からは、各発話について指定された感情強度を示すパラメータが入力される。なお、本実施の形態では感情強度としては「感情なし（通常）」、「感情を含む」、及び「非常に強い感情を含む」という三つの段階を使用する。これらは本実施の形態ではそれぞれ「０」、「０．５」及び「１」という値で指定され音声格納部７０に格納される。 In addition, when a speaker performs an utterance for learning, the speaker performs an emotion in a form corresponding to the emotion intensity input by the emotion intensity input unit 64. From the emotion intensity input unit 64, a parameter indicating the emotion intensity designated for each utterance is input. In this embodiment, the emotion strength uses three stages of “no emotion (normal)”, “includes emotion”, and “includes very strong emotion”. In the present embodiment, these are designated by the values “0”, “0.5”, and “1”, respectively, and stored in the voice storage unit 70.

このシステムはさらに、音声格納部７０に格納された音声から所定の音響特徴量を算出し、頭部動作格納部７２に格納された頭部動作に関するデータ及び音声格納部７０に格納された、当該発話に対して指定された感情強度とともにニューラルネットワークの学習のためのデータ（音声−頭部動作データ）を生成するための音声−頭部動作同期部７４と、その学習のための音声−頭部動作データを格納するための音声−頭部動作データ格納部７６と、格納された音声−頭部動作データを使用してニューラルネットワークに、音声と、それに対応する感情強度と、対応する頭部動作との間の関係についての学習を行なわせるための学習部７８と、その学習によって得られたニューラルネットワークパラメータを格納するためのニューラルネットワークパラメータ格納部８０とを含む。 The system further calculates a predetermined acoustic feature amount from the voice stored in the voice storage unit 70, and stores the data related to the head movement stored in the head movement storage unit 72 and the voice storage unit 70. Speech-head motion synchronization unit 74 for generating neural network learning data (speech-head motion data) together with emotional intensity specified for speech, and speech-head for the learning A voice-head movement data storage unit 76 for storing movement data, and a neural network using the stored voice-head movement data, a voice, emotional intensity corresponding thereto, and a corresponding head movement A learning unit 78 for learning about the relationship between and a neural network for storing the neural network parameters obtained by the learning And a chromatography click parameter storage unit 80.

このシステムはさらに、アニメーションにあるキャラクタについて、担当の声優により予め台本に基づいて録音された音声を格納するための音声格納部８６と、台本に基づき、かつユーザの判断に従って、キャラクタの発話時の感情強度パラメータをユーザが入力するために使用する感情強度設定部８４とを含む。感情強度設定部８４により設定される感情強度パラメータは、学習時と同様、「０」、「０．５」、及び「１」であるものとする。 The system further includes a voice storage unit 86 for storing voice recorded in advance based on a script by a voice actor in charge of the character in the animation, and at the time of the utterance of the character based on the script and according to the user's judgment. And an emotion strength setting unit 84 used for the user to input emotion strength parameters. The emotion strength parameters set by the emotion strength setting unit 84 are “0”, “0.5”, and “1”, as in the learning.

このシステムはさらに、ニューラルネットワークパラメータ格納部８０に格納されたニューラルネットワークパラメータを使用して、音声格納部８６に格納された音声から得られる音響特徴量と、感情強度設定部８４により設定された感情強度パラメータとに基づいてキャラクタの頭部動作９０を生成するための頭部動作生成部８２とを含む。 This system further uses the neural network parameters stored in the neural network parameter storage unit 80 to use the acoustic feature amount obtained from the voice stored in the voice storage unit 86 and the emotion set by the emotion strength setting unit 84. A head motion generation unit 82 for generating a head motion 90 of the character based on the intensity parameter.

ここで、感情強度とは、感情の強さによって頭部動作を制御するための要素である。本実施の形態では、「怒り感情強度」を扱うものとする。怒りという感情に関しては、通常状態（まったく怒っていない状態）〜怒り〜激怒の様な感情強度を考える事ができる。前述した様に「通常状態」を感情強度値（ＥＦ値）０で、「怒り」をＥＦ値０．５で、「激怒」をＥＦ値１．０で表わす。ユーザはこの感情強度（ＥＦ）の値を任意に選択する事により、キャラクタの多種多様な感情の強さを状況に応じて自由に表現する事ができる。例えば、ＥＦ値＝０．２を選択して「弱い怒り」を表現する事ができる。また、ＥＦ値は１．０以上を選択する事も可能であるので、例えばＥＦ値１．５等を選択して「あり得ないほどの激怒」等を表現する事もできる。 Here, the emotion strength is an element for controlling the head movement by the strength of emotion. In this embodiment, it is assumed that “anger emotion strength” is handled. Regarding the emotion of anger, it is possible to consider emotional intensity such as normal state (not angry at all), anger, and rage. As described above, “normal state” is represented by an emotion intensity value (EF value) of 0, “anger” is represented by an EF value of 0.5, and “furious” is represented by an EF value of 1.0. The user can freely express the strength of various emotions of the character according to the situation by arbitrarily selecting the value of the emotion strength (EF). For example, “weak anger” can be expressed by selecting EF value = 0.2. Further, since it is possible to select an EF value of 1.0 or more, for example, an EF value of 1.5 or the like can be selected to express “impossible rage” or the like.

また、この感情強度は怒り感情強度に限られない。例えば、通常状態〜喜び〜歓喜といった喜び感情強度を設定する事も可能である。 Moreover, this emotion strength is not limited to anger emotion strength. For example, it is possible to set a joy emotion intensity such as normal state to joy to joy.

図３に頭部動作生成部８２の詳細なブロック図を示す。図３を参照して、この頭部動作生成部８２は、入力された音声の音声信号を音素に分解し、入力音声に対応する音響特徴量及び音素列を出力するための音声認識部１１０と、音声認識部１１０より出力された音響特徴量及び音素列を格納するための音声情報格納部１１２と、ニューラルネットワークパラメータ格納部１１６に格納されたパラメータにより予め設定されたニューラルネットワークと、音声情報格納部１１２に格納された音響特徴量と、ユーザによって設定された感情強度とから頭部動作パラメータを生成するための頭部動作自動生成部１１４と、頭部動作自動生成部１１４により生成された頭部動作パラメータを格納するための頭部動作パラメータ格納部１１８と、頭部動作パラメータ格納部１１８に格納された頭部動作パラメータを使用して画像の頭部動作を合成するための頭部動作合成部１２０とを含む。 FIG. 3 shows a detailed block diagram of the head movement generation unit 82. Referring to FIG. 3, the head movement generation unit 82 decomposes an input speech signal into phonemes, and outputs a speech feature 110 and a phoneme string corresponding to the input speech. A speech information storage unit 112 for storing the acoustic feature quantity and phoneme sequence output from the speech recognition unit 110, a neural network preset by parameters stored in the neural network parameter storage unit 116, and speech information storage The head motion automatic generation unit 114 for generating a head motion parameter from the acoustic feature amount stored in the unit 112 and the emotion intensity set by the user, and the head generated by the head motion automatic generation unit 114 A head motion parameter storage unit 118 for storing head motion parameters, and a head motion parameter stored in the head motion parameter storage unit 118. Use over data comprises a head behavioral synthesis unit 120 for synthesizing the head motion of the image.

図４に、図３に示す頭部動作自動生成部１１４のブロック図を示す。図４を参照して、この頭部動作自動生成部１１４は、音声情報格納部１１２に格納された各フレームの音響特徴量から、声の高さに相当する基本周波数（Ｆ０）１７０、声の大きさに相当するパワー１７２、及び発話時間情報１７４を抽出するとともに、算出したフレームを含む直前１１フレームの音響特徴量を記憶するための特徴量抽出部１７６と、特徴量抽出部１７６に記憶された直前の１１フレーム分の音響特徴量と、感情強度設定部８４を介してユーザによって設定された感情強度１７８とを入力として受け、キャラクタの頭部動作に関する情報である頭部回転角Ｒｘ１８２、Ｒｙ１８４、Ｒｚ１８６を出力するためのニューラルネットワーク１８０とを含む。頭部回転角と発話時間情報の詳細については後述する。ニューラルネットワーク１８０は、ニューラルネットワークパラメータ格納部８０に記憶されたニューラルネットワークパラメータを用いて予め設定される。従ってニューラルネットワーク１８０は、図２に示す学習部７８による学習に従い、入力される音響特徴量及び感情強度パラメータに従った頭部動作に関する情報を出力する事が可能である。 FIG. 4 shows a block diagram of the head movement automatic generation unit 114 shown in FIG. Referring to FIG. 4, the head movement automatic generation unit 114 calculates the fundamental frequency (F0) 170 corresponding to the voice pitch, the voice frequency from the acoustic feature amount of each frame stored in the voice information storage unit 112. The power 172 corresponding to the size and the utterance time information 174 are extracted and stored in the feature quantity extraction unit 176 and the feature quantity extraction unit 176 for storing the acoustic feature quantity of the immediately preceding 11 frames including the calculated frame. In addition, the sound feature amount for the previous 11 frames and the emotion strength 178 set by the user via the emotion strength setting unit 84 are received as inputs, and the head rotation angles Rx182 and Ry184, which are information related to the head motion of the character. , And a neural network 180 for outputting Rz186. Details of the head rotation angle and utterance time information will be described later. The neural network 180 is preset using the neural network parameters stored in the neural network parameter storage unit 80. Therefore, the neural network 180 can output information on the head movement according to the input acoustic feature quantity and emotion intensity parameter in accordance with the learning by the learning unit 78 shown in FIG.

図５に、図２に示す学習部７８が行なう処理であるニューラルネットワークの学習に関係する処理の詳細について示す。図５を参照して、まず、処理１３４では学習用のテキスト１３２に基づいて学習用の音声の収録が行なわれる。収録された音声は記憶装置１３６に格納される。 FIG. 5 shows details of processing related to learning of the neural network, which is processing performed by the learning unit 78 shown in FIG. Referring to FIG. 5, first, in process 134, learning speech is recorded based on learning text 132. The recorded voice is stored in the storage device 136.

処理１４２は記憶装置１３６に格納された音声を音素に分割する。分割された音素は処理１５２に送られる。この音素は、頭部動作データの作成ではなく、キャラクタの口の動きの合成に使用される。 The process 142 divides the voice stored in the storage device 136 into phonemes. The divided phonemes are sent to the process 152. This phoneme is not used for creating head motion data, but for synthesizing the movement of the mouth of the character.

処理１３８では、記憶装置１３６に格納された音声を所定のフレーム長でかつ所定のシフト長のフレームにフレーム化する。フレーム化された音声は処理１４０と処理１４４とに送られる。 In the process 138, the voice stored in the storage device 136 is framed into a frame having a predetermined frame length and a predetermined shift length. The framed audio is sent to processing 140 and processing 144.

処理１４４では、処理１３８によって得られたフレームから発話区間を検出する。発話区間の検出には、種々の手法を用いる事ができる。学習用のテキスト１３２の録音は通常はスタジオで行なわれるので、発話区間と無音区間との識別は容易である。検出された発話区間を特定する情報は処理１５２に送られる。 In process 144, an utterance section is detected from the frame obtained in process 138. Various methods can be used to detect the utterance period. Since the recording of the learning text 132 is normally performed in the studio, it is easy to distinguish between the speech section and the silent section. Information specifying the detected utterance period is sent to the process 152.

処理１４０では、フレーム化された音声の各フレームから、基本周波数及びパワーを含む音響特徴量を算出する。算出された音響特徴量は処理１５２に送られる。 In process 140, an acoustic feature quantity including a fundamental frequency and power is calculated from each frame of the framed speech. The calculated acoustic feature amount is sent to the process 152.

次に、処理１３０では、ユーザによって任意の感情強度が設定される。設定された感情強度パラメータは処理１５２に送られる。 Next, in process 130, an arbitrary emotion intensity is set by the user. The set emotion strength parameter is sent to the process 152.

一方、処理１４８では、学習用のテキスト１３２を発話する際の発話者の頭部動作データを、カメラ１４６を用いたモーションキャプチャによって収集する。モーションキャプチャによって得られた頭部動作データは、記憶装置１５０に格納される。記憶装置１５０に格納された頭部動作データは、処理１５２に与えられる。 On the other hand, in the process 148, the head motion data of the speaker when the learning text 132 is uttered is collected by motion capture using the camera 146. Head movement data obtained by motion capture is stored in the storage device 150. The head movement data stored in the storage device 150 is given to the process 152.

処理１５２では感情強度、音響特徴量、発話区間、音素、及び、頭部動作データを参照して、音声と頭部動作との同期を行なった上で学習用のデータを作成する。すなわち、各フレームの音響特徴量と、指定された感情強度パラメータと、当該フレームに対応する頭部動作データとから学習用の音声−頭部動作データを作成する。この学習用データは記憶装置１５４に与えられる。 In process 152, learning data is created after synchronizing the voice and head movement with reference to the emotion intensity, acoustic feature, speech segment, phoneme, and head movement data. That is, learning speech-head motion data is created from the acoustic feature amount of each frame, the designated emotion intensity parameter, and the head motion data corresponding to the frame. This learning data is given to the storage device 154.

この記憶装置１５４に格納された音声−頭部動作データに基づいてニューラルネットワークの学習を行なう事により、音声から頭部動作を自動生成する際に使用されるニューラルネットワークパラメータが得られる。 By performing neural network learning based on the voice-head movement data stored in the storage device 154, neural network parameters used when the head movement is automatically generated from the voice can be obtained.

図６に発話時間情報の詳細について示す。発話時間情報とは、話し始めに頭部を上げて話し終わりに頭部を下げるというような、発話時に一般的に見られる発話経過時間に関連すると思われる頭部動作を表現するために音響特徴量の一つとして取り入れたものである。 FIG. 6 shows details of the speech time information. Speaking time information is an acoustic feature to represent head movements that are likely to be related to the elapsed utterance time commonly seen during speech, such as raising the head at the beginning of the talk and lowering the head at the end of the talk. Incorporated as one of the quantities.

図６を参照して、縦軸に発話時間情報をとり、横軸に発話開始から終了までの時間をとる。発話時間情報が０であるとは、発話の開始時又は発話がなされていない状態を示す。発話時間情報が１であるとは、発話の終了時を示す。 With reference to FIG. 6, the vertical axis represents the utterance time information, and the horizontal axis represents the time from the start to the end of the utterance. An utterance time information of 0 indicates a state when an utterance is started or when no utterance is made. The utterance time information of 1 indicates the end of the utterance.

図７に頭部回転角の詳細を示す。図７を参照して、顔画像中に示す様に、頭部回転角Ｒｘ１６０、Ｒｙ１６２、Ｒｚ１６４はそれぞれ予め定められた３次元座標の３軸（ｘ軸、ｙ軸、及びｚ軸）周りの回転角度を表わす。Ｒｘは頭部の上下方向の動き、すなわちうなずいたり上を向いたりするような動きに用いるための角度である。Ｒｙは左右方向に首をかしげるような動きに用いるための角度である。Ｒｚは左右方向に顔を向ける動きに用いるための角度である。この３軸の回転角を組み合わせる事によって３次元的な回転により、頭部動作の表現が可能になる。 FIG. 7 shows details of the head rotation angle. Referring to FIG. 7, as shown in the face image, head rotation angles Rx160, Ry162, and Rz164 are rotations about three axes (x axis, y axis, and z axis) of predetermined three-dimensional coordinates, respectively. Represents an angle. Rx is an angle for use in a vertical movement of the head, that is, a movement that nods or faces upward. Ry is an angle for use in a movement that causes the neck to bend in the left-right direction. Rz is an angle used for the movement of turning the face in the left-right direction. By combining these three axes of rotation angles, head movement can be expressed by three-dimensional rotation.

なお、本発明の実施の形態では頭部動作として頭部角度による頭部の向き（回転）を例にとって説明しているが、頭部の位置、つまり、頭部が並進するような動きに基づく頭部動作を合成する事も可能であり、さらには並進運動と回転運動とが組み合わされた頭部動作を合成する事もできる。 In the embodiment of the present invention, the head direction (rotation) according to the head angle is described as an example of the head movement, but it is based on the position of the head, that is, the movement of the head translating. It is possible to synthesize head movements, and it is also possible to synthesize head movements that combine translational motion and rotational motion.

＜動作＞
上記した本実施の形態に係る頭部動作自動生成システムは以下の様に動作する。このシステムの動作には三つのフェーズがある。第１のフェーズは学習フェーズであり、第２のフェーズは台本に基づく音声の録音フェーズであり、第３のフェーズは録音された音声に基づき、キャラクタの頭部動作データを作成するフェーズである。アニメーション作成システム全体としては、これ以外にキャラクタをデザインし、アニメーションの表情を作成したりする処理があるが、それらについては本願発明とは関係がないのでここでは説明は省略する。 <Operation>
The head movement automatic generation system according to the present embodiment described above operates as follows. There are three phases in the operation of this system. The first phase is a learning phase, the second phase is a voice recording phase based on a script, and the third phase is a phase in which character head movement data is created based on the recorded voice. In addition to this, the entire animation creation system includes a process of designing a character and creating a facial expression of the animation. However, these are not related to the present invention and will not be described here.

図２を参照して、学習時には、この頭部動作自動生成システムにおいては、学習時の発話者の発話音声６２がカメラ６８と同期したマイクロフォン６６によって録音され、音声格納部７０に格納される。一方、カメラ６８で記録された発話者の頭部動作データは、頭部動作格納部７２に格納される。この際、両方のデータには共通の時間情報が付される。発話者に対しては所定の感情強度（本実施の形態の場合には、怒りに関して、「通常」、「怒り」、及び「激怒」という３種類）のいずれかに合わせて発話する様に指示が出される。そして、感情強度入力部６４によって、その感情強度を示す感情強度パラメータ（「０」、「０．５」及び「１」のいずれか）がユーザにより手入力される。この感情強度パラメータは対応する発話音声とともに音声格納部７０に格納される。 Referring to FIG. 2, at the time of learning, in the head movement automatic generation system, the utterance voice 62 of the speaker at the time of learning is recorded by a microphone 66 synchronized with camera 68 and stored in voice storage unit 70. On the other hand, the head motion data of the speaker recorded by the camera 68 is stored in the head motion storage unit 72. At this time, common time information is attached to both data. Instruct the speaker to speak according to one of the predetermined emotional strengths (in the case of this embodiment, three types of “normal”, “anger”, and “wrath” regarding anger) Is issued. Then, the emotion strength input unit 64 manually inputs an emotion strength parameter (any one of “0”, “0.5”, and “1”) indicating the emotion strength. This emotion strength parameter is stored in the voice storage unit 70 together with the corresponding utterance voice.

音声格納部７０に格納された音声及び感情強度と、頭部動作格納部７２に格納された頭部動作に関するデータとはともにフレーム化され、同じ時刻のフレームに対応する音声データ及び感情強度パラメータと頭部動作パラメータとが、音声−頭部動作同期部７４によって一つのデータにまとめられる。こうして学習用の音声−頭部動作データが作成され、音声−頭部動作データ格納部７６に格納される。 The voice and emotion strength stored in the voice storage unit 70 and the data related to the head motion stored in the head motion storage unit 72 are both framed, and voice data and emotion strength parameters corresponding to frames at the same time The head movement parameters are combined into one data by the voice-head movement synchronization unit 74. In this way, speech-head motion data for learning is created and stored in the speech-head motion data storage unit 76.

続いて、音声−頭部動作データ格納部７６に格納された音声−頭部動作データを用いて、学習部７８で、音声及び感情強度とそれに対応する頭部動作との関係をニューラルネットワークに学習させる。その学習によって得られたニューラルネットワークパラメータがニューラルネットワークパラメータ格納部８０に格納される。 Subsequently, using the speech-head motion data stored in the speech-head motion data storage unit 76, the learning unit 78 learns the relationship between the speech and emotion intensity and the corresponding head motion in a neural network. Let The neural network parameters obtained by the learning are stored in the neural network parameter storage unit 80.

台本の録音時には、声優が、台本を見ながら所定のキャラクタの台詞を発話する。この音声は録音され、図２に示す音声格納部８６に格納される。 At the time of recording the script, the voice actor utters the speech of a predetermined character while watching the script. This voice is recorded and stored in the voice storage unit 86 shown in FIG.

頭部動作作成時には、学習時に得られたニューラルネットワークパラメータを用いて、頭部動作生成部８２で音声から頭部動作の自動生成が行なわれる。具体的には、まず、ニューラルネットワークをニューラルネットワークパラメータ格納部８０に格納されたパラメータを用いて設定する。さらに、各発話に対し感情強度設定部８４を用いてユーザが０、０．５又は１のうちのいずれかの感情強度パラメータを設定する。頭部動作生成部８２が、音声格納部８６から読出された音声から音響特徴量（Ｆ０、パワー、及び発話時間情報）を抽出し、その音響特徴量とユーザによって当該発話に対し設定された感情強度パラメータとを用いて、ニューラルネットワークに対する入力を作成し与える。この入力に応答して、ニューラルネットワークにより、キャラクタの頭部動作９０が自動生成される。 At the time of head motion creation, the head motion generation unit 82 automatically generates head motion from speech using the neural network parameters obtained at the time of learning. Specifically, first, a neural network is set using parameters stored in the neural network parameter storage unit 80. Further, the user sets any emotion strength parameter of 0, 0.5, or 1 using the emotion strength setting unit 84 for each utterance. The head motion generation unit 82 extracts the acoustic feature amount (F0, power, and utterance time information) from the voice read from the voice storage unit 86, and the acoustic feature amount and the emotion set for the utterance by the user Using the intensity parameter, create and give input to the neural network. In response to this input, the head movement 90 of the character is automatically generated by the neural network.

この頭部動作生成部８２の動作の詳細について、図３を参照して説明する。まず、音声格納部８６に格納された音声データが音声認識部１１０に与えられ、音声認識部１１０で音素に分解され音響特徴量が付された音素列として出力される。その音響特徴量及び音素列は音声情報格納部１１２に格納される。 Details of the operation of the head movement generation unit 82 will be described with reference to FIG. First, the speech data stored in the speech storage unit 86 is given to the speech recognition unit 110, and is output as a phoneme string that is decomposed into phonemes by the speech recognition unit 110 and to which acoustic features are added. The acoustic feature amount and the phoneme string are stored in the voice information storage unit 112.

音声情報格納部１１２に格納された音声情報の中で、音響特徴量に関するものは頭部動作を生成するために頭部動作自動生成部１１４に出力される。音素列はキャラクタの口の動きを合成するために使用される。 Among the audio information stored in the audio information storage unit 112, the one related to the acoustic feature amount is output to the head movement automatic generation unit 114 in order to generate the head movement. The phoneme string is used to synthesize the mouth movement of the character.

一方、ユーザが感情強度設定部８４で設定した感情強度パラメータも頭部動作自動生成部１１４に与えられる。 On the other hand, the emotion strength parameter set by the user using the emotion strength setting unit 84 is also given to the head motion automatic generation unit 114.

頭部動作自動生成部１１４においては、音声情報格納部１１２から与えられた音響特徴量と、感情強度設定部８４で設定された感情強度とから、ニューラルネットワークへの入力データが作成される。この入力データに関しては、音声データの最後の１１フレーム分が記憶される。これら１１フレーム分の入力データがニューラルネットワークへの入力として与えられる。これに応答して、対応する頭部動作パラメータがニューラルネットワークから出力される。この頭部動作自動生成部１１４での具体的な変換処理については後述する。 In the head movement automatic generation unit 114, input data to the neural network is created from the acoustic feature amount given from the voice information storage unit 112 and the emotion strength set by the emotion strength setting unit 84. For this input data, the last 11 frames of the audio data are stored. The input data for these 11 frames is given as an input to the neural network. In response, the corresponding head movement parameters are output from the neural network. Specific conversion processing in the head movement automatic generation unit 114 will be described later.

頭部動作自動生成部１１４から出力された頭部動作パラメータは図３に示す頭部動作パラメータ格納部１１８に格納される。頭部動作パラメータ格納部１１８に格納された頭部動作パラメータを使って頭部動作合成部１２０で画像の頭部動作が合成される。 The head movement parameter output from the head movement automatic generation unit 114 is stored in the head movement parameter storage unit 118 shown in FIG. The head motion synthesis unit 120 synthesizes the head motion of the image using the head motion parameters stored in the head motion parameter storage unit 118.

頭部動作自動生成部１１４は以下の様に動作する。 The head movement automatic generation unit 114 operates as follows.

図４を参照してまず、音声情報格納部１１２から出力された各フレームの音響特徴量に関する情報のうち、声の高さ１７０と、声の大きさ１７２と、発話時間情報１７４とが特徴量抽出部１７６により抽出される。この情報は、直前の１１フレーム分にわたり特徴量抽出部１７６内に保持される。これら直前の１１フレーム分の特徴量と感情強度設定部８４でユーザによって任意に設定された感情強度とが、ニューラルネットワーク１８０に与えられる。これに応答して、ニューラルネットワーク１８０が、頭部動作を合成するためのパラメータである頭部動作パラメータ（頭部回転角Ｒｘ１８２，Ｒｙ１８４及びＲｚ１８６）を出力する。ニューラルネットワーク１８０はニューラルネットワークパラメータ格納部８０に記憶されたニューラルネットワークパラメータにより予め設定される。このパラメータは、学習結果に従い、ニューラルネットワーク１８０における入力（音声）と出力（頭部動作パラメータ）との関係を示すものである。出力された頭部動作パラメータは頭部動作パラメータ格納部１１８に格納される。 Referring to FIG. 4, first, of the information regarding the acoustic feature amount of each frame output from the speech information storage unit 112, the voice height 170, the voice volume 172, and the utterance time information 174 are feature amounts. Extracted by the extraction unit 176. This information is held in the feature amount extraction unit 176 for the previous 11 frames. The feature amount for the immediately previous 11 frames and the emotion strength arbitrarily set by the user in the emotion strength setting unit 84 are given to the neural network 180. In response to this, the neural network 180 outputs head motion parameters (head rotation angles Rx182, Ry184, and Rz186), which are parameters for synthesizing the head motion. The neural network 180 is preset by the neural network parameters stored in the neural network parameter storage unit 80. This parameter indicates the relationship between input (speech) and output (head motion parameter) in the neural network 180 according to the learning result. The output head motion parameter is stored in the head motion parameter storage unit 118.

＜頭部動作生成の具体例＞
図８に、様々な感情強度を入力する事によって変化する頭部動作の具体例を示す。 <Specific example of head motion generation>
FIG. 8 shows specific examples of head movements that change by inputting various emotional intensities.

図８を参照して、縦軸には頭部回転角の一つで、頭部の上下方向の動きを表わす角Ｒｘの値を、横軸には発話開始時からの時間（発話時間情報）を単位１／１０００秒で示す。グラフ中に書かれた英語の文章は、この具体例で使用された発話文である。この具体例では感情強度は怒り感情強度を使用した。また、その怒り感情強度は０．０（通常状態）から１．０（激怒）までの間で０．２刻みで推移させたものを使用した。 Referring to FIG. 8, the vertical axis represents one of the head rotation angles, the value of angle Rx representing the vertical movement of the head, and the horizontal axis represents the time from the start of speech (speech time information). In units of 1/1000 seconds. The English sentence written in the graph is the utterance sentence used in this specific example. In this example, anger emotion strength was used as emotion strength. In addition, the anger emotion intensity was changed from 0.0 (normal state) to 1.0 (furious) in steps of 0.2.

図８に示される様に、感情強度を変化させると、波形で示される頭部の上下動が変化する。そして、感情強度が強くなる、すなわち、より「激怒」に近づくにつれて、０．７５秒から１．３秒にかけての頭部の上下動が大きくなる。つまり頭部画像の動きが激しくなる。 As shown in FIG. 8, when the emotion intensity is changed, the vertical movement of the head indicated by the waveform changes. Then, as the emotion intensity increases, that is, as the “rage” approaches, the vertical movement of the head increases from 0.75 seconds to 1.3 seconds. That is, the movement of the head image becomes intense.

なお、図８において、特に１秒付近の角Ｒｘの値は、感情強度の値に対し、非線型に変化している。これは、ニューラルネットワークの様に非線型の変換を行なう場合に特徴的な事であり、実際に発話者の頭部動作はこのような非線型性を示す。ニューラルネットワークとは異なり、線型的な手法を用いて頭部動作を生成すると、このような上下動の非線型性が失われ、感情強度に対して動きの大きさが単に線型にしか変化しないような、単調な変形結果しか得られない。 In FIG. 8, the value of the corner Rx, particularly in the vicinity of 1 second, changes non-linearly with respect to the emotion intensity value. This is characteristic when nonlinear conversion is performed like a neural network, and the head movement of the speaker actually shows such nonlinearity. Unlike neural networks, when head movements are generated using a linear method, the nonlinearity of this vertical movement is lost, and the magnitude of movement changes only to linear with respect to emotional intensity. Only a monotonous deformation result can be obtained.

＜性能評価のための実験＞
本実施の形態に係る頭部動作生成のための装置の有効性を評価するために、学習に用いたデータの一部である、実際の頭部の頭部回転角Ｒｘと、頭部動作を本実施の形態に係るシステムで生成した場合の頭部回転角Ｒｘとの比較を行なった結果を図９に示す。図９においては、縦軸に頭部回転角度Ｒｘをとり、横軸に発話開始からの時間を単位１／１０００秒でとっている。この性能評価のための実験で用いた感情強度は怒り感情強度である。 <Experiment for performance evaluation>
In order to evaluate the effectiveness of the device for generating head motion according to the present embodiment, the head rotation angle Rx of the actual head, which is a part of the data used for learning, and the head motion are FIG. 9 shows the result of comparison with the head rotation angle Rx when generated by the system according to the present embodiment. In FIG. 9, the vertical axis represents the head rotation angle Rx, and the horizontal axis represents the time from the start of utterance in units of 1/1000 seconds. The emotional strength used in this performance evaluation experiment is anger emotional strength.

図９に示される様に、通常の状態（図９の上段：感情強度ＥＦ＝０．０）、怒り（図９の中段：ＥＦ＝０．５）、激怒（図９の下段：ＥＦ＝１．０）のいずれにおいても、実際の頭部動作の回転角Ｒｘと、本実施の形態に係るシステムにより生成された頭部動作の回転角Ｒｘとの波形は互いに非常に類似している。 As shown in FIG. 9, normal state (upper part of FIG. 9: emotion intensity EF = 0.0), anger (middle part of FIG. 9: EF = 0.5), furious (lower part of FIG. 9: EF = 1) .0), the waveforms of the rotation angle Rx of the actual head movement and the rotation angle Rx of the head movement generated by the system according to the present embodiment are very similar to each other.

上述した波形の類似から、予め用意された学習用の音声から得られる音響特徴量と、ユーザによって設定された感情強度とを用いて生成された頭部回転角は、いずれの感情強度の場合にも、学習に用いられた実際の頭部動作の頭部回転角と類似する事が分かる。この結果から、ユーザが任意に感情強度を設定しても、音声に同期した自然な頭部動作が生成される事が期待できると言える。従って、音声からニューラルネットワークを使用して頭部動作を生成する際に任意の感情強度をユーザが入力する事によって、ユーザの感性に応じた、かつ、自然な頭部動作の生成が可能になる。 The head rotation angle generated using the acoustic feature amount obtained from the learning speech prepared in advance and the emotion intensity set by the user from the similarities of the waveforms described above is in any emotion intensity. It can also be seen that it is similar to the head rotation angle of the actual head movement used for learning. From this result, it can be said that even if the user arbitrarily sets the emotion strength, it is expected that a natural head movement synchronized with the voice is generated. Therefore, when a head motion is generated from a voice using a neural network, the user inputs an arbitrary emotion intensity, so that a natural head motion can be generated according to the user's sensitivity. .

以上より、音声から頭部動作を自動生成する際に、ユーザによって任意に設定できる感情強度を使用する事で、感情強度に応じ、音声から自然な頭部動作を生成する事ができる。このような手法で頭部動作を生成する事により、ユーザの感性に応じた、かつ、効率的な頭部動作の生成が可能となる。 As described above, when the head motion is automatically generated from the voice, the natural head motion can be generated from the voice according to the emotion strength by using the emotion strength arbitrarily set by the user. By generating the head movement by such a method, it is possible to generate the head movement efficiently according to the sensitivity of the user.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

従来技術による頭部動作自動生成システムを示す図である。It is a figure which shows the head movement automatic generation system by a prior art. 本発明に係る頭部動作自動生成システムを示す図である。It is a figure which shows the head movement automatic generation system which concerns on this invention. 頭部動作生成部の詳細を示す図である。It is a figure which shows the detail of a head movement production | generation part. 頭部動作自動生成部の詳細を示す図である。It is a figure which shows the detail of a head movement automatic generation part. ニューラルネットワークの学習方法に係る処理の詳細について示す図である。It is a figure shown about the detail of the process which concerns on the learning method of a neural network. 発話時間情報の詳細について示す図である。It is a figure shown about the detail of speech time information. 頭部回転角の詳細を示す図である。It is a figure which shows the detail of a head rotation angle. 様々な感情強度を入力する事によって変化する頭部動作の具体例を示す図である。It is a figure which shows the specific example of the head movement which changes by inputting various emotion intensity | strengths. 本発明の性能評価の結果を示す図である。It is a figure which shows the result of the performance evaluation of this invention.

Explanation of symbols

６４感情強度入力部
７４音声−頭部動作同期部
７８学習部
８４感情強度設定部
８６音声格納部
１１４頭部動作自動生成部
１４８モーションキャプチャ
１７６特徴量抽出部
１８０ニューラルネットワーク 64 emotion intensity input unit 74 voice-head motion synchronization unit 78 learning unit 84 emotion strength setting unit 86 voice storage unit 114 head motion automatic generation unit 148 motion capture 176 feature quantity extraction unit 180 neural network

Claims

A head movement learning device for learning the relationship between an utterance and the movement of the head of the utterance subject accompanying the utterance by machine learning,
An emotion strength input means for receiving an input of an emotion strength parameter indicating the strength related to a predetermined emotion of the utterance subject at the time of utterance;
A predetermined acoustic feature amount extracted as a time series from the speech of the utterance at the time of each utterance, the emotion strength parameter input via the emotion strength input means regarding the utterance, and And learning means for learning a relationship between the predetermined acoustic feature amount and the emotion intensity parameter and the movement of the head of the utterance subject from information indicating the movement of the head of the utterance subject. , Head movement learning device.

The learning means includes
An acoustic feature quantity extracting means for receiving the voice at the time of the utterance of the utterance subject and extracting the acoustic feature quantity of the voice every predetermined time from the start of the utterance;
Time information giving means for attaching information indicating the time from the start of the utterance to the acoustic feature quantity extracted by the acoustic feature quantity extraction means;
Head position information acquisition means for acquiring information indicating the position or orientation of the head of the utterance in association with the time of the utterance subject,
For an utterance, the acoustic feature amount extracted by the acoustic feature amount extraction unit and the time information from the utterance added by the time information addition unit, and the emotion strength parameter input by the emotion strength input unit for the utterance A synchronization means for generating learning data by synchronizing the position or orientation of the head of the utterance subject acquired by the head position information acquisition means;
For learning the relationship between the time series of the acoustic feature quantity and the emotion intensity parameter and the change in the position or orientation of the head of the utterance subject using the learning data generated by the synchronization means The head movement learning apparatus according to claim 1, comprising: means.

The means for learning uses the learning data generated by the synchronization means, the time series of the predetermined acoustic feature amount given the time from the start of the utterance of the certain utterance, the emotion intensity parameter, and the The head movement learning device according to claim 2, further comprising a function approximation learning means for learning a relationship between the position or orientation of the head of the utterance subject by a predetermined nonlinear function approximation.

The function approximation learning means learns the relationship between the acoustic feature value and the emotion intensity parameter and the position or orientation of the head of the utterance subject using the learning data generated by the synchronization means as learning data. The head movement learning device according to claim 3, comprising a neural network for performing the operation.

A head motion synthesizer for synthesizing the movement of the head of the utterance subject image accompanying the utterance from the utterance,
Given the time series of predetermined acoustic features extracted from the speech of the utterance at the time of utterance and the emotion intensity parameter specified for the utterance, the head movement of the utterance at the time of utterance of the speech is estimated Head position estimating means for
Acoustic feature quantity extraction means for extracting a time series of the predetermined acoustic feature quantity from speech;
The utterance subject synchronized with the voice by providing the head position estimation unit with the time series of the predetermined acoustic feature amount extracted by the acoustic feature amount extraction unit and the designated emotion strength parameter And a head motion generating means for obtaining information relating to the head motion of the head as a series of outputs from the head position estimating means.

The head motion synthesizer according to claim 5, further comprising means for storing voice in advance and supplying the sound to the acoustic feature quantity extraction means.

The head movement synthesizer according to claim 5 or 6, further comprising means for giving an emotion intensity parameter corresponding to an emotion intensity input by a user to the head movement generation means.

The head position estimation means includes a time series of predetermined acoustic features extracted from speech at the time of the utterance of the utterance subject, and an emotion intensity parameter designated for the predetermined emotion of the utterance subject at the time of utterance. Between the time series of the predetermined acoustic feature amount and the emotion intensity parameter and the position or orientation of the head of the utterance subject from the information indicating the movement of the head of the utterance subject during the utterance The head motion synthesizer according to any one of claims 5 to 7, further comprising machine learning means that has been previously learned.

The machine learning means includes a time series of predetermined acoustic features extracted from speech at the time of the utterance of the utterance subject, an emotion intensity parameter designated with respect to the predetermined emotion of the utterance subject at the time of the utterance, and the utterance From the information indicating the movement of the head of the utterance subject at the time, the relationship between the time series of the predetermined acoustic feature amount and the emotion intensity parameter and the position or orientation of the head of the utterance subject is nonlinear in advance The head movement synthesizer according to claim 8, comprising function approximation learning means learned by function approximation.

The function approximation learning means includes a time series of predetermined acoustic features extracted from speech at the time of utterance of the utterance subject, an emotion intensity parameter designated with respect to the predetermined emotion of the utterance subject at the time of utterance, Learning in advance the relationship between the time series of the predetermined acoustic feature quantity and the emotion intensity parameter and the position or orientation of the head of the utterance subject from information indicating the movement of the head of the utterance subject during utterance The head movement synthesis device according to claim 9, comprising a completed neural network.

A computer program that, when executed by a computer, causes the computer to operate as the apparatus according to any one of claims 1 to 11.