KR100813034B1

KR100813034B1 - Method for formulating character

Info

Publication number: KR100813034B1
Application number: KR1020060124010A
Authority: KR
Inventors: 김예진; 채원석; 이범렬; 양광호
Original assignee: 한국전자통신연구원
Priority date: 2006-12-07
Filing date: 2006-12-07
Publication date: 2008-03-14

Abstract

An animation character forming method is provided to provide more natural and smooth lipsync animation while reducing manpower consumed in producing face animation. Animation character data is received(S100). A figure model of the animation character is divided into a portion which is mainly used for pronunciation and a portion which is mainly used for expressing a look, and the portion mainly used for pronunciation is animated to form a lipsync animation of the animation character(S110). The portion mainly used for expressing a look is animated to form a look animation of the animation character in which the lipsync animation has been formed(S120,S130). The character data includes voice information and look information of the animation character.

Description

Method for formulating character

도 1 및 도 2는 본 발명에 따른 캐릭터 형성방법의 일실시예의 흐름도이고,1 and 2 are a flow chart of an embodiment of a character forming method according to the present invention,

도 3은 TTVS 시스템의 전체 흐름도이고,3 is an overall flowchart of a TTVS system;

도 4는 한글 모음과 자음의 음소를 11개의 비즘 모델로 표시한 도면이고,4 is a diagram illustrating 11 phoneme models of phonemes of Korean vowels and consonants.

도 5는 영어 모음과 자음의 음소를 13개의 비즘 모델로 표시한 도면이고,5 is a diagram showing 13 phoneme models of phonemes of English vowels and consonants,

도 6은 Catmull-Rom 스플라인 보간법의 일실시예를 나타낸 도면이고,6 is a diagram illustrating an embodiment of a Catmull-Rom spline interpolation method,

도 7은 비즘 모델 수정을 위한 가중치 계산 함수를 나타낸 도면이고,7 is a diagram illustrating a weight calculation function for correcting a model of bism,

도 8은 립싱크 애니메이션과 감정 표정의 혼합을 나타낸 도면이고,8 is a view showing a mixture of lip sync animation and emotional expressions,

도 9는 감정을 나타내는 표정 모델과 감정 표정 사이의 혼합을 위한 인터페이스를 나타낸 도면이고,9 is a diagram illustrating an interface for mixing between an expression model representing emotion and an emotion expression,

도 10은 감정을 나타낸 표정 입력의 수치화를 나타낸 도면이고,10 is a diagram showing quantification of facial expression input showing emotion,

도 11은 1차, 2차 및 최종 발음 중요도와 표정 중요도를 나타낸 도면이고,11 is a diagram showing primary, secondary and final pronunciation importance and facial expression importance,

도 12는 발음 중요도가 큰 경우 발음과 표정의 결합을 나타낸 도면이다.12 illustrates a combination of pronunciation and facial expressions when pronunciation importance is high.

본 발명은 캐릭터 형성방법에 관한 것으로서, 보다 상세하게는 캐릭터의 립 싱크 애니메이션 및 표정 애니메이션에 관한 것이다.The present invention relates to a character formation method, and more particularly to a lip sync animation and facial expression animation of the character.

사실적인 얼굴 애니메이션 생성은 대상 얼굴이 2차원 이미지인지 3차원 메쉬 모델인지에 따라 다양한 접근 방법들이 제시되고 있다.For realistic face animation generation, various approaches have been proposed depending on whether the target face is a 2D image or a 3D mesh model.

2차원 얼굴 이미지를 대상으로 한 립싱크 애니메이션에 관련하여 기존에 촬영된 비디오 영상을 분석하여 연속된 세 개의 음소 조합 각각에 대해 짧은 비디오 시퀀스(video sequence)로 잘라내고, 새로운 음성 트랙(track)에 맞도록 다시 연결하는 방법이 제안되었다.Regarding the lip-sync animation for the two-dimensional face image, the existing video image is analyzed, cut into short video sequences for each of three consecutive phoneme combinations, and fit into a new voice track. It has been suggested to reconnect.

그러나, 상술한 방식은 예제에 기반(sample-based)을 둔 방식이므로 실제에 매우 가까운 결과물을 만들어 낼 수 있지만, 많은 양의 데이터를 보관해야 하므로 용량이 매우 큰 라이브러리가 필요하다.However, the method described above is sample-based and can produce very close results, but a large library is needed because a large amount of data must be stored.

또한, 2차원 얼굴 이미지들을 예제로 사용한 입술 동기화 애니메이션에 관한 방법에선 각 음소에 대한 예제 모델에 해당하는 얼굴 이미지를 사람이 주관적으로 선택하고, 애니메이션을 위해 중간 단계의 입 모양은 이미지 몰핑(image morphing) 기법을 이용해 생성하여 입술 동기화 애니메이션을 만드는 방법도 제안되었다.In addition, in the method of lip synchronization animation using two-dimensional face images as an example, a subjective subject selects a face image corresponding to an example model for each phoneme, and an intermediate mouth shape is used for image morphing for animation. A method of creating a lip synchronization animation by using the same technique has also been proposed.

그러나, 상술한 방식은 비즘(viseme) 개수만큼의 입 모양 이미지만을 가지고 애니메이션을 만들기 때문에 같은 발음에 대해서 그 앞뒤 발음에 따라 입 모양이 달라지는 현상(co-articulation effect)을 잘 표현할 수가 없는 단점이 지적되어 왔다.However, the above-described method makes an animation with only the mouth shape image of as many as vismes, so it is difficult to express the co-articulation effect for the same pronunciation according to the front and back pronunciation. Has been.

그리고, 3차원 메쉬 모델을 대상으로 하는 입술 동기화 애니메이션에서는 말하여질 내용을 스크립트 텍스트(script text)로 입력받아서 음소의 시간에 따른 시 퀀스를 생성해내고, 그 정보를 해석하여 음성과 동기화된 립싱크 애니메이션을 만들어 내는 방법이 대표적이다.In the lip synchronization animation targeting the 3D mesh model, a lip sync synchronized with the voice is generated by generating a sequence of phonemes by receiving the spoken contents as script text, and interpreting the information. The best way is to create an animation.

그러나, 상술한 방법에선 음성에 동기화된 립싱크 애니메이션을 생성할 수 있지만, 보다 사실적인 얼굴 표정을 만들기 위하여 립싱크 애니메이션에 감정을 자연스럽게 추가하는 방법은 제안되지 않고 있다. 특히, 기존의 대부분의 립싱크 애니메이션 생성 방법들이 립싱크 애니메이션을 제작할 때에는 같은 발음에 대한 입 모양이라고 하더라도 그 앞뒤 발음에 따라 입 모양이 달라지는 현상(co-articulation effect)을 고려하지 않고 있으며, 특정 감정을 나타내는 표정을 추가할 때 현재 발음에 대한 입 주위의 모양새를 크게 깨트리지 않으면서 그 감정 표정을 얼굴 모델에 나타낼 수 있는 효과적인 방법을 제안하고 있지 않다.However, although the above-described method can generate a lip sync animation synchronized with the voice, a method of naturally adding emotion to the lip sync animation in order to create a more realistic facial expression has not been proposed. In particular, most of the existing lip-sync animation generation methods do not consider a co-articulation effect when the lip-sync animation is produced, even if the shape of the mouth is the same for the same pronunciation. As we add facial expressions, we do not propose an effective way to express those facial expressions in the face model without significantly breaking the appearance around the mouth for the current pronunciation.

가상 캐릭터의 립싱크 애니메이션을 제작할 때에도 보다 자연스러워 보이기 위해서는 TTVS 시스템으로부터 생성된 입술 동기화 애니메이션과 감정을 나타내는 표정 애니메이션을 결합하는 방법이 필요하다. Even when creating a lip-synced animation of a virtual character, to look more natural, we need to combine a lip sync animation generated from the TTVS system with a facial expression animation representing emotions.

본 발명은 상기와 같은 문제점들을 해결하기 위해 제안된 것으로, 3차원 얼굴 애니메이션 제작 과정에서 한글 및 영어의 립싱크 애니메이션을 생성하고 감정을 나타내는 얼굴 표정을 자연스럽게 반영하는 방법을 제공함에 있다. 얼굴 애니메이션을 생성하는데 있어 주어진 음성 및 음소 정보로부터 립싱크 애니메이션 생성을 자동화하여 사용자의 수작업을 최소화함과 동시에 사실적인 결과물을 실시간으로 생성하기 위해 음소 정보는 기존의 TTS(Text-To-Speech) 시스템이나 음성 파일 로부터 입력받고 시간에 따라 연속적으로 변하는 표정 파라미터를 위해서 사용자 인터페이스를 제공한다. 특히, 립싱크 애니메이션에 감정을 나타내는 얼굴 표정을 결합하기 위해 중요도 기반 접근 방법을 제안하고 있다.The present invention has been proposed to solve the above problems, and provides a method of generating a lip-sync animation of Korean and English and naturally reflecting facial expressions representing emotions in the process of producing 3D facial animation. To generate facial animations, the phoneme information can be generated using existing text-to-speech (TTS) systems to automate the creation of lip-sync animations from the given voice and phoneme information to minimize the user's manual operation and to produce realistic results in real time. It provides a user interface for facial expression parameters that are input from a voice file and change continuously over time. In particular, an importance-based approach is proposed to combine lip-sync animation with facial expressions that express emotions.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 캐릭터의 감정을 나타낸 립싱크 애니메이션을 실시간으로 생성할 수 있고, 시간에 따라 연속적으로 변하는 캐릭터의 감정을 표현하는 캐릭터 형성방법을 제공하고자 하는 것이다.The present invention is to solve the above problems, an object of the present invention to generate a lip-sync animation showing the emotions of the character in real time, to provide a character formation method for expressing the emotions of the character that continuously changes over time It is.

본 발명의 다른 목적은 보다 자연스럽고 부드러운 립싱크 애니메이션을 형성하고자 하는 것이다.Another object of the present invention is to form a more natural and smooth lip sync animation.

본 발명의 또 다른 목적은 사실적인 감정 표정을 나타내는 캐릭터의 립싱크 애니메이션을 제공하고자 하는 것이다.Still another object of the present invention is to provide a lip sync animation of a character representing a realistic emotional expression.

본 발명의 또 다른 목적은 얼굴 애니메이션을 제작하는 데 소모되는 사용자의 수작업을 최소화하고 음성과의 정확성, 애니메이션의 자연스러움과 실시간성 등을 최대화하는 것이다.Another object of the present invention is to minimize the user's manual work consumed in producing the facial animation, and to maximize the accuracy of the voice, the naturalness and real-time of the animation.

상술한 목적을 달성하기 위하여, 본 발명은 캐릭터 자료를 입력받는 단계; 상기 자료에 따라 캐릭터의 립싱크 애니메이션을 형성하는 단계; 및 상기 립싱크 애니메이션이 형성된 캐릭터에 표정 애니메이션을 형성하는 단계를 포함하여 이루어지는 것을 특징으로 하는 캐릭터 형성방법을 제공한다.In order to achieve the above object, the present invention comprises the steps of receiving character data; Forming a lip sync animation of the character according to the material; And forming a facial expression animation on the character on which the lip sync animation is formed.

본 발명의 다른 실시 형태에 따르면, 캐릭터 자료를 입력하는 입력부; 상기 캐릭터의 립싱크 애니메이션을 형성하는 립싱크 애니메이션 형성부; 및 상기 립싱크 애니메이션이 형성된 캐릭터에, 표정 애니메이션을 형성하는 표정 애니메이션 형성부를 포함하여 이루어지는 것을 특징으로 하는 캐릭터 형성장치를 제공한다.According to another embodiment of the present invention, an input unit for inputting character data; A lip sync animation forming unit for forming a lip sync animation of the character; And a facial expression animation forming unit for forming a facial expression animation in the character on which the lip sync animation is formed.

이하 상기의 목적을 구체적으로 실현할 수 있는 본 발명의 바람직한 실시예를 첨부한 도면을 참조하여 설명한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention that can specifically realize the above object will be described.

종래와 동일한 구성 요소는 설명의 편의상 동일 명칭 및 동일 부호를 부여하며 이에 대한 상세한 설명은 생략한다.The same components as in the prior art are given the same names and the same reference numerals for convenience of description, and detailed description thereof will be omitted.

본 발명에서는 사용자 등이 캐릭터 정보를 입력하면, 입력된 정보에 따라 표정이 담긴 애니메이션을 형성하는 것을 특징으로 한다.According to the present invention, when a user or the like inputs character information, an animation including an expression is formed according to the input information.

도 1 및 도 2는 본 발명에 따른 캐릭터 형성방법의 일실시예의 흐름도이다. 도 1 및 도 2를 참조하여 본 발명에 따른 캐릭터 형성방법의 일실시예를 설명하면 다음과 같다.1 and 2 are flowcharts of one embodiment of a character forming method according to the present invention. An embodiment of the character forming method according to the present invention with reference to Figures 1 and 2 as follows.

도 1에 도시된 바와 같이, 캐릭터의 음성 및 표정 자료가 입력되고(S100), 이어서 캐릭터의 립싱크 애니메이션(S110)과 표정 애니메이션(S120)이 차례로 형성된다. 그리고, 상술한 립싱크 애니메이션과 표정 애니메이션을 동기화시키는 것이 바람직하다(S130).As shown in FIG. 1, the voice and facial expression data of the character are input (S100), and then the lip sync animation S110 and the facial expression animation S120 of the character are sequentially formed. In addition, it is preferable to synchronize the lip sync animation and the facial expression animation described above (S130).

그리고, 도 2에서는 스크립트 텍스트의 형태로 입력된 캐릭터 정보에 대하여, TTS 시스템 등이 합성해낸 음성이나 외부 음성 파일을 분석하여 립싱크 애니메이션을 생성하고, 이어서 감정 표현을 애니메이션하여 상술한 립싱크 애니메이션이 이루어진 캐릭터에 결합한다. 이 때, 캐릭터의 얼굴 모델을 발음에 주로 사용되는 부분과 표정을 나타냄에 주로 사용하는 부분으로 구분하여, 립싱크 애니메이션은 발음에 주로 사용되는 부분을 애니메이션하고, 표정 애니메이션의 형성은 표정을 나타냄에 주로 사용되는 부분을 애니메이션하는 것을 특징으로 한다.In FIG. 2, a lip sync animation is generated by analyzing a voice or an external voice file synthesized by a TTS system or the like with respect to the character information input in the form of script text, and then animating the expression of emotion to generate the lip sync animation described above. To combine. At this time, the face model of the character is divided into a part mainly used for pronunciation and a part mainly used for expressing an expression, and the lip sync animation animates a part mainly used for pronunciation, and the formation of a facial expression animation mainly represents an expression. It is characterized by animating the parts used.

이하에서는 상술한 과정을 상세히 설명한다.Hereinafter, the above-described process will be described in detail.

립싱크 애니메이션의 목표는 주어진 음성 트랙에 자연스럽고 사실적인 입 모양을 생성하는 것이다. 입력으로 주어지는 음성은 실제 사람이 발음한 것을 녹음한 육성이거나, TTS 시스템에 의해서 합성된 음성이다. 그리고, 립싱크 애니메이션을 생성하기 위해서는 말하여질 내용의 음소 시퀀스 (phoneme sequence) 정보와 각 음소의 길이(phoneme duration)에 대한 정보가 필요하다. 실제 사람의 육성을 사용하는 경우에는 녹음된 음성 신호를 분석하여 음소 정보들을 생성하는 음운정렬시스템(phonetic alignment systems)이 필요할 수 있다. 그리고, TTS 시스템을 사용하면 실제 사람의 육성에 비해 어색함을 느끼게 된다는 단점은 있으나, TTS 시스템으로부터 바로 음소 정보들을 얻어 낼 수 있어서 보다 정확한 입술 동기화 애니메이션을 제작할 수 있다.The goal of lip-sync animation is to create a natural and realistic mouth shape for a given voice track. The voices given as inputs are voices recorded by real people or synthesized by the TTS system. In addition, to generate a lip sync animation, information about a phoneme sequence of information to be spoken and information about a phoneme duration of each phoneme is required. In the case of using the actual human upbringing, phonetic alignment systems may be required to analyze recorded voice signals and generate phoneme information. In addition, the TTS system has a disadvantage in that it feels awkward compared to the actual human development. However, since the phoneme information can be directly obtained from the TTS system, a more accurate lip synchronization animation can be produced.

본 발명에서는, 입력된 음성에 대한 음소 시퀀스와 길이 정보가 주어졌다는 가정하에 립싱크 애니메이션을 생성하는데 필요한 음소 정보의 처리 방법을 제안하고 있다. 사용자로부터 스크립트 텍스트(script text)를 입력으로 받아 립싱크 애니메이션을 생성하는 시스템을 TTVS 시스템(text-to-visual speech system)이라 한다. TTVS 시스템은 도 3에 도시된 바와 같이 주어진 스크립트 텍스트에 대해 자연 어처리를 수행하여 여러 가지 음소 정보를 생성하고, 이 정보들을 이용하여 음성처리기는 음성을, 영상처리기는 영상을 합성한다. 생성된 음성과 영상은 동기화하는 것이 바람직하다. TTS 시스템은 합성된 음성을 생성할 뿐만 아니라 음소 시퀀스 정보와 시간에 따른 각 음소의 길이 정보 등을 제공한다. 그리고, TTS 시스템으로부터 얻어진 음소 정보를 이용하여 합성된 음성과 영상간의 동기화가 가능하다.The present invention proposes a method for processing phoneme information necessary for generating a lip sync animation under the assumption that a phoneme sequence and length information of an input voice are given. A system for generating a lip sync animation by receiving script text as input from a user is called a text-to-visual speech system. As shown in FIG. 3, the TTVS system performs natural word processing on a given script text to generate various phoneme information. Using the information, the voice processor synthesizes a voice and the image processor synthesizes an image. It is desirable to synchronize the generated voice and video. The TTS system not only generates synthesized speech but also provides phoneme sequence information and length information of each phoneme over time. In addition, it is possible to synchronize the synthesized voice and video using phoneme information obtained from the TTS system.

음소는 발음을 할 때 구별되는 최소의 단위이다. 사람이 말을 하는 것은 연속적으로 나열된 음소들을 순서대로 발음하는 것이다. 이 과정을 컴퓨터에서 수행하기 위하여, 립싱크 애니메이션에서는 발음될 음소 순으로 해당하는 입 모양을 나열하고 그 사이를 부드럽게 연결하여 연속된 움직임을 생성한다. 본 발명에서는 각 음소들에 대해 같은 입 모양으로 발음되는 것끼리 분류하고 이에 대응하는 3차원 얼굴 예제 모델을 제작하여 주어진 음성에 동기화가 된 자연스럽고 사실적인 립싱크 애니메이션을 생성하는 방법을 제안한다.Phonemes are the smallest unit of distinction in pronunciation. When a person speaks, he pronounces the phonemes listed in sequence. In order to perform this process on a computer, the lip-sync animation lists the corresponding mouth shapes in phoneme order to be pronounced and creates a continuous movement by smoothly connecting them. The present invention proposes a method for generating a natural and realistic lip sync animation synchronized with a given voice by classifying the phonemes of the phonemes and producing a three-dimensional face example model corresponding thereto.

먼저, 한국어와 영어에 대해서 음소 예제 모델을 어떻게 분류하고 구성하였는지에 대해서 제안한다. 캐릭터에 대하여 다른 국가의 언어로 립싱크를 하고자 할 때에도, 한국어 및 영어에 준하여 분류 및 구성할 수 있을 것이다.First, we propose how to classify and construct a phoneme example model for Korean and English. Even if you want to lip-sync a character in a language of another country, you can classify and compose according to Korean and English.

본 발명에서는 립싱크 애니메이션을 제작하기 위해서 주어진 음소 시퀀스에 대해 각 음소에 대한 입 모양을 키프레임(key-frame)으로 보고, 그 사이를 부드럽게 채워주는 키프레이밍(key-framing) 애니메이션 기법을 사용한다. 이러한 접근 방법이 가능한 이유는 사람이 말하는 모습은 유한 개의 비즘(viseme) 모델들로 모두 만들어 낼 수 있다는 가정에 근거한다. 여기서 '비즘'은 일반적으로 사람이 발 음을 할 때 눈으로 보아서 다른 것들과 구별되는 입 모양을 지칭한다. 따라서 비즘 모델이라는 것은 발음을 할 때 외관상 다른 것들과 구별되는 입 모양을 갖는 3차원 메쉬 구조로 된 가상 캐릭터의 얼굴 모델을 말한다.In order to produce a lip-sync animation, the present invention uses a key-framing animation technique that looks at the mouth shape of each phoneme as a key-frame for a given phoneme sequence and smoothly fills the space between them. The reason for this approach is based on the assumption that human speech can be made up of all finite models of vis. Here 'bism' generally refers to the shape of the mouth that is distinguished from others by the eye when a person makes an pronunciation. Therefore, the bismuth model refers to the face model of a virtual character having a three-dimensional mesh structure having a mouth shape that is distinct from other things when it is pronounced.

즉, 본 발명은 사람이 말하는 모습은 유한 개의 비즘들을 사용하여 모두 생성해 낼 수 있다는 가정 하에, 적용할 수 있는 한 가지 접근 방법은 음소와 비즘 사이에 일대일 대응 관계를 정의하여 각각의 음소에 대해 비즘 모델을 만들어 립싱크 애니메이션을 제작하는 것이다. 이러한 방식은 입력으로 주어진 텍스트에 대해 TTS 시스템이 음소 시퀀스 정보와 음소 길이 정보를 생성하면 대응되는 시점들에 비즘 모델들을 두고 그 사이를 부드럽게 연결시켜 립싱크 애니메이션을 생성하는 것이다.In other words, the present invention assumes that a human speech can be generated by using a finite number of bismuths, and one approach that can be applied is to define a one-to-one correspondence between the phonemes and the bisms for each phoneme. It is to make a lip-synced animation by creating a bismuth model. In this way, when the TTS system generates phoneme sequence information and phoneme length information for the text given as input, the lip sync animation is generated by placing bismuth models at corresponding points and smoothly connecting them.

먼저, 단모음의 입 모양은 발음을 하는데 중요한 역할을 하므로 각기 다른 입 모양을 갖게 한다. 그러나, 일반적으로 여러 개의 음소들을 하나의 비즘에 대응시킬 수 있다. 예를 들어, 한국어의 'ㅁ'과 'ㅂ'이나 영어의 'o'나 'w'는 비슷한 입 모양에서 다른 소리를 낸다. 이와 같이 자음의 경우에는 비슷한 입 모양으로 여러 가지 다른 발음을 내는 특성이 있고 이중모음은 단독으로 쓰이지 못하는 반모음이나 다른 단모음의 결합이다. 따라서, 자음은 같은 입 모양을 갖는 몇 개의 그룹으로 분류하여 하나의 비즘을 대응시키고, 이중모음은 두 개의 단모음에 해당하는 비즘들로 나누어 애니메이션을 생성한다.First, the shape of the mouth of the short vowels play an important role in pronunciation, so they have different mouth shapes. However, in general, several phonemes can be mapped to a single bism. For example, Korean 'ㅁ' and 'ㅂ' or English 'o' or 'w' have different sounds in similar mouth shapes. Like this, in the case of consonants, there is a characteristic of making different pronunciations with a similar mouth shape, and a double vowel is a combination of a half vowel or another short vowel that cannot be used alone. Therefore, consonants are classified into several groups having the same mouth shape to correspond to one bismuth, and a double vowel is divided into two short vowels to generate an animation.

표 1은 비슷한 입 모양을 갖는 음소들을 모음과 자음별로 대응하는 비즘으로 구분한 것이고, 표 2는 영어의 자음 및 모음별로 대응하는 비즘으로 구분한 것이 다.Table 1 classifies phonemes with similar mouth shapes into corresponding vowels and consonants, and Table 2 classifies English consonants and vowels into corresponding bisms.

음소 분류Phoneme classification 비즘 분류Biblical classification 모음collection 단모음monophthong ㅏ, ㅐ, ㅓ, ㅔ, ㅗ, ㅜ, ㅡ, ㅣ, ㅚ, ㅟㅏ, ㅐ, ㅓ, ㅔ, ㅗ, TT, ㅡ, ㅣ, ㅚ, ㅟ ㅏ, ㅐ, ㅓ, ㅔ, ㅗ, ㅜ, ㅡ, ㅣㅏ, ㅐ, ㅓ, ㅔ, ㅗ, TT, ㅡ, ㅣ 이중모음Double vowels ㅑ, ㅒ, ㅕ, ㅖ, ㅛ, ㅠ, ㅢ, ㅙ, ㅞ, ㅘ, ㅝㅑ, ㅒ, ㅕ, ㅖ, ㅛ, ㅠ, ㅢ, ㅙ, ㅞ, ㅘ, ㅝ 자음Consonant 연구개음Research ㄱ, ㄲ, ㅋ, ㅇㄱ, ㄲ, ㅋ, ㅇ ㄱA 성문음Voice ㅎㅎ 치조음Alveolus ㄷ, ㄸ, ㅌ, ㅅ, ㅆ, ㄴ, ㄹC, k, k, k, k, b, d ㄷC 경구개음Oral opening ㅈ, ㅉ, ㅊ,, ㅉ, ㅊ 양순음A positive sound ㅂ, ㅃ, ㅍ, ㅁㅂ, ㅃ, ,, ㅁ ㅁM

그리고, 도 4는 총 40개의 음소로 이루어진 한국어를 11개의 모음 및 자음 비즘 모델로 구분한 것을 도시한 도면이고, 도 5는 총 41개의 음소로 이루어진 영어를 13개의 모음 및 자음 비즘 모델로 구분한 것을 도시한 도면이다.FIG. 4 is a diagram illustrating Korean vowels composed of 40 phonemes in 11 vowel and consonant bismuth models, and FIG. 5 illustrates English constituents composed of 41 phonemes in 13 vowel and consonant bismuth models. It is a figure which shows that.

이어서, 각 음소 예제 모델에 해당하는 얼굴 예제 모델들을 사용하여 립싱크 애니메이션을 생성하는 방법을 설명한다. 즉, 상술한 음소 예제 모델을 사용하여 입술 동기화 애니메이션을 어떻게 생성하는지에 관해 설명한다.Next, a method of generating a lip sync animation using face example models corresponding to each phoneme example model will be described. That is, how to generate the lip synchronization animation using the phoneme example model described above will be described.

여기서, 입력으로 주어진 텍스트에 대해서 TTS 시스템이 음소 시퀀스 정보와 음소 길이 정보를 생성하면, 이 음소 시퀀스와 시간 길이에 맞추어 비즘을 배열하고 그 사이를 키프레이밍 애니메이션 방식으로 연결한다.Here, when the TTS system generates the phoneme sequence information and the phoneme length information for the text given as an input, the BTS is arranged according to the phoneme sequence and the time length and the keyframe animation is connected between them.

본 발명에서는 부드러운 립싱크 애니메이션을 생성하기 위해서 Catmull-Rom 스플라인 보간법(spline interpolation)을 사용한다. Catmull-Rom 스플라인은 평면 또는 공간 위에 어떤 정점들의 시퀀스가 주어졌을 때 각 정점과 정점 사이를 3차식으로 정의된 곡선으로 C1 연속성을 갖도록 이어준 곡선의 일종으로, 각 정점에서 미분계수가 그 정점에 이웃하는 다른 두 정점 사이를 잇는 직선의 기울기와 같도록 한 것이다.In the present invention, Catmull-Rom spline interpolation is used to generate a smooth lip sync animation. Catmull-Rom splines are a series of curves defined as cubic continuity between each vertex and a vertex, given a sequence of vertices on a plane or space, with the differential coefficient at each vertex It is equal to the slope of a straight line between two neighboring vertices.

도 6은 Catmull-Rom 스플라인 보간법의 한 예를 보여주고 있다. 비즘 모델 p에 대한 정점들의 집합 V^p는 다음과 같이 정의된다.6 shows an example of the Catmull-Rom spline interpolation method. The set of vertices V ^p for the bism model p is defined as

V^p = {V^p ₁,V^p ₂,...,V^pn}V ^p = {V ^p ₁ , V ^p ₂ , ..., V ^p n}

V^p _i = {x^p _i,y^p _i,z^p _i}V ^p _i = {x ^p _i , y ^p _i , z ^p _i }

또한, TTS 시스템으로부터 얻은 음소 시퀀스 정보 P와 음소 길이 정보 L은 아래와 같다.The phoneme sequence information P and the phoneme length information L obtained from the TTS system are as follows.

P = {p₁, p₂,...,p_m}P = {p ₁ , p ₂ , ..., p _m }

L = {ℓ₁,ℓ₂,...,ℓ₃}L = {ℓ ₁ , ℓ ₂ , ..., ℓ ₃ }

그리고, 음소 길이 정보 L로부터 비즘 모델이 위치해야 하는 시각을 계산한다. 각 비즘 모델은 해당하는 음소를 발음하는 시간의 중간에 위치한다고 가정하면, 비즘 모델 p_j의 시간 축 상의 위치 t_j는 다음과 같이 계산된다.Then, the time at which the bismuth model should be located is calculated from the phoneme length information L. Assuming that each bismuth model is located in the middle of the time when the phonemes are pronounced, the position t _j on the time axis of the bismuth model p _j is calculated as follows.

이렇게 얻어진 음소의 발음 시간 정보 T를 아래와 같이 표시할 수 있다.The pronunciation time information T of the phoneme thus obtained can be displayed as follows.

T = {t₁, t₂,...t₃}T = {t ₁ , t ₂ , ... t ₃ }

Catmull-Rom 스플라인 보간법을 사용하여 t_j≤t<t_j ₊₁인 어느 시간 t에서 얼굴 모델의 i번째 정점의 x좌표는 다음과 같이 나타낼 수 있다.Using the Catmull-Rom spline interpolation method, the x-coordinate of the i th vertex of the face model at any time t where t _j ≤ t <t _j ₊₁ can be expressed as follows.

x(t) = at³+bt²+ct+dx (t) = at ³ + bt ² + ct + d

여기서, a, b, c, d는 아래의 네 조건으로부터 얻어진다.Here, a, b, c and d are obtained from the following four conditions.

여기서, x^p _i는 음소 시퀀스 정보 P에 포함된 비즘 모델 p_j의 i번째 정점의 x좌표이다. 즉, 시간 t _j 에서는 가상 캐릭터의 얼굴이 비즘 모델 p _j 가 되므로, 얼굴 모델의 i번째 정점의 x 좌표

는 비즘 모델 p _j 의 i번째 정점의 x 좌표 x^pji와 같다.Here, x ^p _i is the x-coordinate of the i th vertex of the bismuth model p _j included in the phoneme sequence information P. That is, time t _j Because the model is bijeum p _j, the virtual character's face, x-coordinate of the vertex of the i-th face model

Is equal to the x coordinate x ^pj i of the i th vertex of the bism model p _j .

마찬가지로 시간 t _j ₊₁ 에서 얼굴 모델 i번째 정점의 x 좌표

는 비즘 모델 p _j+1 의 i번째 정점의 x 좌표 x^pj ⁺¹ _i와 같다. 또한, Catmull-Rom 스플라인 보간법을 사용하기 때문에 시간 tj에서 xi 의 시간에 대한 순간 변화율

는

x^pj ^-1 _i에서 x^pj+1 _i으로의 시간에 대한 평균 변화율과 같다.Similarly, x coordinate of face model i th vertex at time t _j ₊₁

Is equal to the x coordinate x ^pj ⁺¹ _i of the i th vertex of the bismuth model p _{j + 1} . In addition, the rate of instant change with respect to the time of xi at time tj , because Catmull-Rom spline interpolation is used.

Is

It is equal to the average rate of change over time from x ^pj ^-1 _i to x ^{pj + 1} _i .

마찬가지로 시간 t _j ₊₁ 에서 x _i 의 시간에 대한 순간 변화율

은 x^pj _i에서 x^pj+2 _i로의 시간에 대한 평균 변화율과 같다. 상술한 같은 네 개의 조건으로부터 네 개의 미지수 a, b, c, d의 값을 결정한다. y 좌표와 z 좌표에 대해서도 같은 방법을 적용한다.Similarly, at time t _j x _i ₊₁ Rate of change of time

Is equal to the average rate of change over time from x ^pj _i to x ^{pj + 2} _i . The four unknowns a, b, c, d are determined from the same four conditions described above. The same applies to the y and z coordinates.

그리고, 아티스트에 의해 제작된 비즘 모델을 아무런 가공 없이 사용하여 애니메이션을 제작하면 입 모양의 움직임이 부드럽지 못하다. 말을 하는 모습은 매우 짧은 시간 간격으로 잦은 움직임이 일어나며 그러한 짧은 시간에 모든 비즘 모델을 거쳐 가도록 애니메이션을 생성하므로 부드럽지 못한 현상이 나타난다. 또한, 실제 사람은 같은 음소에 대해서라도 그 앞뒤 음소의 종류와 발음 시간에 따라 조금씩 다른 입 모양으로 발음을 한다. 가상 캐릭터의 립싱크 애니메이션을 생성할 때에도 이러한 점을 고려하여야 한다.And if you make an animation using the non-machining model of the artist made by the artist, the movement of the mouth shape is not smooth. Speaking shows frequent movements at very short time intervals, and it is not smooth because it creates animations that go through all the Bism models in such a short time. In addition, the actual person pronounces the same phoneme in a slightly different mouth shape depending on the type of phoneme and the pronunciation time. This should also be taken into account when creating a lip sync animation of a virtual character.

본 발명에서는 각 음소의 길이에 따라 앞뒤 음소에 대한 입 모양을 고려하여 현재 발음하고 있는 음소에 대한 입 모양을 수정하는 방법을 제안한다.The present invention proposes a method of modifying the mouth shape of the phoneme currently being pronounced in consideration of the mouth shape of the front and back phonemes according to the length of each phoneme.

음성 파일로부터 얻은 얻은 음소 시퀀스 정보 P에 포함된 j번째 비즘 모델 pj에 대한 i번째 정점

의 위치는 아래와 같이 수정된다. I-th vertex of the j-th bijeum model pj included in the phoneme sequence information obtained from the audio file obtained P

The position of is modified as follows.

즉, weight의 값이 0과 1사이에서 변함에 따라 수정되는

의 위치는 비즘 모델 pj-1의 i번째 정점으로부터 자기 자신까지 잇는 선분 위의 점이다. 이것은 발음되는 시간이 짧은 음소일수록 weight가 작아지게 하여 대응되는 비즘 모델의 위치까지 도달하지 못하고 다음 발음에 해당하는 비즘 모델의 위치를 향하여 애니메이션이 되도록 하기 위함이다. 여기서 weight는 발음 시간길이에 따라 선형적으로 변한다고 가정할 수도 있지만 부드럽게 보간하기 위하여, 다음과 같은 삼차식으로 정의된 함수에 따라 계산한다.That is, the weight value is modified as it changes between 0 and 1.

Is the point on the line from the i th vertex of the bismuth model pj-1 to itself. This is because the shorter the phoneme is pronounced, the smaller the weight is, so that it does not reach the position of the corresponding bismuth model and is animated toward the position of the bismuth model corresponding to the next pronunciation. In this case, the weight may be assumed to change linearly according to the pronunciation time length, but in order to smoothly interpolate, the weight is calculated according to a function defined by the following three equations.

사람은 입과 그 주위를 움직일 때에 근육에 힘을 주어 움직인다. 이때 움직이는 속도를 관찰해 보면 현재 발음하고 있는 음소의 입 모양에서 다음 발음을 하기 위한 입 모양으로 전환할 때에 움직이는 속도가 처음에는 증가하다 다시 감소한다. 따라서, 입 주위가 처음에는 천천히 변하다 속도가 증가하여 빨리 변하다가 다시 천천히 변하게 된다. 결국, 매우 짧은 시간에 발음해야 하는 음소는 상대적으로 대응되는 비즘 모델까지 도달하지 못하게 되고, 어느 이상 충분한 발음 시간을 갖게 되는 경우에 대응되는 비즘 모델에 도달하게 된다. 이러한 현상을 표현하기 위해 weight를 위에 제시한 함수에 따라 변하도록 근사 시킨다. 음소 시간길이에 따른 weight의 변화를 도7에 나타내었다. 이처럼 음소 시간길이에 따라 비즘 모델을 수정하여 Catmull-Rom 보간법에 의해 입술 동기화 애니메이션을 제작하면 자연스럽고 부드러운 결과를 얻는다.When a person moves around the mouth and around it, he moves with strength. At this time, if you observe the moving speed, the moving speed increases initially and then decreases again when switching from the mouth shape of the currently pronounced phoneme to the mouth shape for the next pronunciation. Thus, the area around the mouth changes slowly at first, then increases in speed and then changes slowly. As a result, the phoneme to be pronounced in a very short time does not reach a relatively corresponding bismuth model, and reaches a corresponding bismuth model when there is enough pronunciation time. To express this phenomenon, we approximate the weight to change according to the function given above. The change in weight according to the phoneme time length is shown in FIG. 7. In this way, by modifying the bismuth model according to the phoneme time length and producing a lip synchronization animation by Catmull-Rom interpolation, a natural and smooth result is obtained.

이어서 립싱크 애니메이션과 감정 애니메이션의 혼합을 설명한다.The following section describes a mix of lip sync animation and emotion animation.

사람은 대화할 때 현재 자신의 감정을 얼굴에 표정으로 나타냄과 동시에 말한다. 가상 캐릭터의 립싱크 애니메이션을 제작할 때에도 보다 자연스러워 보이기 위해서는 TTVS 시스템으로부터 생성된 입술 동기화 애니메이션과 감정을 나타내는 표정 애니메이션을 결합하는 방법이 필요하다. 예를 들어, 도 8과 같이 한글의 '아' 또는 영어의 'a'발음을 하고 있는 얼굴 모습과 화난 표정을 결합하여 화난 얼굴에서 한글의 '아' 또는 영어의 'a'발음을 하는 얼굴 모습을 생성할 수 있어야 한다. 얼굴의 각 부분에 따라 발음과 표정에 대한 중요도가 다른 점을 이용하여 입술 동기화 애니메이션에 표정을 결합하는 방법을 제안한다.When a person talks, he expresses his emotions with an expression on his face and speaks simultaneously. Even when creating a lip-synced animation of a virtual character, to look more natural, we need to combine a lip sync animation generated from the TTVS system with a facial expression animation representing emotions. For example, as shown in FIG. 8, the face of an 'a' in Korean or an 'a' in English and an angry expression are combined to show an 'a' or 'a' in English in an angry face. You should be able to create We propose a method to combine facial expressions into a lip synchronization animation by using the differences in importance of pronunciation and facial expressions for each part of the face.

먼저, 립싱크 애니메이션 혼합할 표정 예제 모델을 정의하고, 다양한 표정을 만들기 위한 사용자의 편리성을 고려한 감정 입력 인터페이스를 제안한다. 이어서, 립싱크 애니메이션에 감정을 혼합하기 위해서 필요한 중요도 기반 접근 방법을 제안한다.First, we define a facial expression example model to be mixed with lip sync animation, and propose an emotion input interface that considers the user's convenience for creating various facial expressions. Next, we propose an importance-based approach for blending emotions into lip-sync animations.

감정을 나타낸 표정 예제 모델를 제작하는 방법과 립싱크 에니메이션에 표정 예제 모델을 쉽게 적용시킬 수 있는 감정 입력 인터페이스에 대해서 기술한다. 우선 립싱크 애니메이션에 혼합할 표정을 나타내는 대표적인 감정들인 기쁨, 슬픔, 놀람, 두려움, 화남 등에 대한 표정 모델을 미리 준비하고 이들 사이에 블랜딩(blending)을 통하여 다양한 표정들을 생성한다. 여기서, 기쁨, 슬픔 등은 예시적인 감정의 분류에 불과하다. 즉, 표정 애니메이션의 형성은, 캐릭터의 얼굴을 소정 영역으로 구분하고, 입력된 캐릭터 자료에 따라 각각의 구분된 영역을 독립하여 애니메이션하는 것을 특징으로 한다. 이 때, 캐릭터의 얼굴 중 입술은 주로 감정에 관련된 표현을 하므로, 입력된 캐릭터 자료 중 입술에 관련된 자료만으로 표정 애니메이션을 형성할 수 있다.This article describes how to create a facial expression example model that shows emotions, and an emotion input interface that can be easily applied to a lip-sync animation. First, facial expression models for joy, sadness, surprise, fear, anger, etc., which are representative emotions to be expressed in the lip-sync animation, are prepared in advance, and various expressions are generated through blending between them. Here, joy, sadness, etc. are merely exemplary classifications of emotions. That is, the expression animation is characterized in that the character's face is divided into predetermined areas, and each divided area is independently animated according to the input character data. In this case, since the lips of the character's face mainly express emotions, the facial expression animation may be formed using only the data related to the lips among the input character data.

도 9는 이러한 대표적인 감정에 대한 표정 예제 모델을 나타낸다. 어느 한 시점에서 표정은 기쁨, 슬픔, 놀람, 두려움, 화남등의 감정 상태의 정도에 따라 달라진다. 감정의 상태를 수치화하여 현재의 표정이 무표정한 얼굴로부터 감정의 다른 정도에 따라 해당하는 표정을 나타낼 수 있다. 예를 들어, 각 표정 모델의 감정 수치의 기준을 1로 보았을 때 0.5만큼 기쁘고 0.5만큼 슬픈 얼굴을 생성하려면, 기쁨을 나타내는 표정 모델과 무표정을 나타내는 모델 사이에 각 정점별 변위 차의 절반 크기와 슬픔을 나타내는 표정 모델과 무표정을 나타내는 모델 사이에 각 정점별 변위 차의 절반 크기를 무표정한 얼굴 모델에 더함으로써 그러한 표정을 생성한다. 또한, 그렇게 생성된 표정 모델과 무표정을 나타내는 모델 사이에 각 정점별 변위 차이를 조절함으로써 감정 변화에 따른 표정을 조절할 수 있다.9 shows a facial expression example model for these representative emotions. At one point, the expression depends on the degree of emotional state, such as joy, sadness, surprise, fear, or anger. By expressing the state of emotion, a face corresponding to a different degree of emotion may be displayed from a face in which the present expression is not expressed. For example, to create a face that is as happy as 0.5 and sad as 0.5 when the emotional value of each facial expression model is 1, the half size and sadness of the difference in displacement between each vertex between the facial expression model that expresses joy and the expressionless expression. Such an expression is generated by adding half the magnitude of the displacement difference between each vertex between the facial expression model representing the facial expression model and the facial expressionless expression model to the expressionless facial model. In addition, by adjusting the displacement difference of each vertex between the generated facial expression model and the model representing the expressionless expression, it is possible to control the facial expression according to the emotion change.

임의의 표정 e1과 e2에 대해서 두 표정에 대한 감정 정도의 비가 α:(1 α)이고, 무표정 얼굴로부터 감정의 강도가 β인 새로운 표정 모델의 i번째 정점의 위치는 아래와 같이 계산된다.For arbitrary facial expressions e1 and e2 , the position of the i th vertex of the new facial expression model whose ratio of emotion for both expressions is α: (1 α) and the intensity of emotion from the expressionless face is β is calculated as follows.

여기서 V^N _i, V^e1 _i, V^e2 _i는 각각 무표정, 표정 e1, 표정 e2에 해당하는 표정 모델의 i번째 정점의 위치이다.Where V ^N _i , V ^e1 _i and V ^e2 _i are the positions of the i th vertex of the facial expression model corresponding to the expressionless expression, the facial expression e1 , and the facial expression e2 , respectively.

이어서, 표정을 용이하게 혼합하기 위한 사용자 인터페이스에 관하여 설명한다. 사용자로부터 얻어야 할 입력정보는 각 시간별로 혼합해야 하는 e1과 e2 표정과 감정 정도 α, 그리고 무표정 얼굴로부터 감정의 강도 β이다. 사용자가 이러한 정보들을 매 프레임마다 제공하기는 어렵기 때문에 몇 개의 키 프레임에 대해서만 정보를 얻고 그 사이는 보간에 의해 채워주는 방식을 취한다. 부드러운 애니메이션을 생성하기 위해 C2 연속성을 만족하도록 삼차 스플라인 보간법을 사용한다.Next, a user interface for easily mixing expressions will be described. The input information to be obtained from the user is e1 and e2 facial expression and emotion degree α to be mixed at each time, and the intensity β from emotionless face. Since it is difficult for a user to provide such information every frame, information is obtained for only a few key frames and filled by interpolation. To generate smooth animations, we use cubic spline interpolation to satisfy C2 continuity.

도 9는 표정 애니메이션을 생성하기 위한 인터페이스를 보여준다. 무표정을 의미하는 중앙의 흰 점을 중심으로 반지름의 길이가 1인 원을 5등분하여 기쁨, 슬픔, 놀람, 두려움, 화남의 다섯 가지 표정을 위치시킨다. 즉, 추가할 n표정에 대해서 원을 n등분하여 모든 표정을 원주 위에 위치시킨다. 이 때 원 내부에 한 점은 새로 생성될 얼굴 표정을 의미하며, 새로운 표정 모델의 생성에 필요한 모든 정보가 이 점으로부터 얻어진다.9 shows an interface for generating facial expression animation. The circle with a radius of 1 is divided into five, centered on the white dot in the center, which means no expression, and the five expressions of joy, sadness, surprise, fear, and anger are placed. That is, for n expressions to be added, the circle is divided into n parts so that all expressions are placed on the circumference. At this time, one point inside the circle means a facial expression to be newly generated, and all information necessary for generating a new facial expression model is obtained from this point.

도 10과 같이 주어진 점이 다섯 개의 부채꼴 중 어느 부채꼴에 위치하는가에 따라 표정 모델 e1과 e2를 결정하고, 중심과 주어진 점을 잇는 직선에 의해서 나누어지는 두 부채꼴의 중심각의 비는 α:(1 -α)를 결정한다. 또한, 주어진 점과 원의 중심과의 거리는 β값을 결정한다. 사용자가 몇 개의 키 프레임에 해당하는 표정 모델을 이 인터페이스를 통해서 찍어주면 삼차 스플라인 보간법을 통해서 커브를 그리고 커브 상에 존재하는 점들의 위치를 이용하여 C2 연속성을 갖는 표정 애니메이션을 생성한다.As shown in FIG. 10, the expression models e1 and e2 are determined according to which of the five sectors are located in the sector, and the ratio of the center angles of the two sectors divided by the straight line connecting the center and the given point is α: (1 -α ) . Also, the distance between the given point and the center of the circle determines the β value. When the user takes a facial expression model corresponding to several key frames through this interface, a cubic spline interpolation method creates a curve animation using C3 continuity by using a curve and the positions of points on the curve.

이어서, 립싱크 애니메이션에 감정을 혼합하기 위해서 필요한 중요도 기반 접근 방법을 설명한다.Next, we describe the importance-based approach needed to blend emotions into lip-sync animations.

사람이 말을 하고 있지 않을 때에는 이마, 눈, 눈썹, 볼, 입 등 얼굴 전체를 통하여 자신의 감정을 표현한다. 하지만, 말을 하고 있을 때에는 입과 그 주위에 해당하는 얼굴의 아랫부분은 주로 발음을 하기 위해서 사용되고, 이마, 눈, 눈썹과 같은 얼굴의 윗부분이 감정에 대한 표정을 나타내기 위해 사용된다. 따라서, 무표정한 얼굴로 발음을 하고 있는 얼굴 모습과 어떤 감정에 대한 표정을 나타내고 있는 얼굴 모습을 사용하여 표정을 지으면서 발음을 하고 있는 얼굴 모습을 생성하기 위해서는 얼굴의 각 부분에 대하여 발음과 표정에 대한 중요도에 따라 다른 방식으로 합성하여야 한다. 본 절에서는 발음 모델들과 표정 모델들을 분석하여 얼굴 모델의 각 정점에 대한 발음과 표정에 대한 중요도를 정의하는 방법과 이 중요도에 기반하여 발음과 표정이 결합된 새로운 얼굴 모습을 생성하는 방법을 제안한다.When a person is not speaking, he expresses his feelings through his entire forehead, eyes, eyebrows, cheeks, and mouth. However, when speaking, the lower part of the face, which corresponds to the mouth and its surroundings, is mainly used for pronunciation, and the upper part of the face, such as the forehead, eyes, and eyebrows, is used to express the expression of emotion. Therefore, in order to create a facial expression that is pronounced by making a facial expression using a facial expression that is pronounced with an expressionless face and a facial expression that expresses an expression for an emotion, the importance of pronunciation and expression for each part of the face It should be synthesized in different ways. This section proposes how to analyze pronunciation models and facial expression models to define the importance of pronunciation and facial expression for each vertex of face model and to create a new facial image combining pronunciation and facial expression based on this importance. do.

예컨대, 아랫입술이나 턱처럼 얼굴의 다른 부분에 비해 발음을 하기 위해서 움직임이 많은 부분은 발음에 대한 중요도가 높은 부분이다. 또한, 윗입술과 같은 부분은 아랫입술이나 턱에 비해 발음을 하기 위한 움직임은 적지만 발음에 대한 중요도가 높은 부분이다. 따라서, 발음을 하지 않은 얼굴을 기준으로 발음을 하기 위해 그 위치가 크게 변할수록 그 정점의 발음에 대한 중요도는 높아진다. 또한, 발음을 할 때 움직임이 그리 크지 않더라도 움직임이 큰 부분에 가까이 위치한 정점들은 발음에 대한 중요도가 높다. 기쁨, 슬픔, 놀람, 두려움, 화남 등의 감정에 대한 표정에 대해서도 발음에서와 같은 논리를 적용하여 표정에 대한 중요도가 결정된다. 그러나, 입과 그 주위는 발음을 할 때뿐만 아니라 감정을 표현할 때에도 움직임이 많은 부분이다. 이처럼 발음에 대해 많이 움직이는 부분이 표정에 대해서 적게 움직이지는 것은 아니므로 발음과 표정에 대한 움직임을 동시에 관찰하여 중요도를 결정해야 한다.For example, a part with a lot of movement in order to pronounce the voice compared to other parts of the face, such as the lower lip or the jaw, is an important part of pronunciation. In addition, the upper lip and the lower lip compared to the lower jaw, but the movement for pronunciation is a part of high importance for pronunciation. Therefore, the greater the position of the speaker to pronounce based on the unpronounced face, the higher the importance of the pronunciation of the vertex. In addition, even when the movement is not very large when pronounced, the vertices located near the large portion of the movement have a high importance for pronunciation. The expression of the emotions such as joy, sadness, surprise, fear, anger, etc. is also applied to the expressions, and the importance of the expression is determined. However, the mouth and its surroundings are a lot of movement not only for pronunciation but also for feelings. Like this, the part that moves a lot about pronunciation does not move less about expression. Therefore, it is necessary to observe the pronunciation and movement of expression at the same time to determine the importance.

사람이 말을 하고 있지 않을 때에는 얼굴의 전반에 걸쳐서 자신의 감정을 표현하지만, 말을 하고 있을 때에는 입과 그 주위처럼 발음을 위해 움직이는 부분은 표정에 대한 움직임 보다는 발음에 대한 움직임에 지배를 받는다. 립싱크 애니메이션에 감정을 결합하는 것은 말을 함과 동시에 표정을 나타내는 얼굴 애니메이션을 생성하는 것이므로 얼굴 모델의 각 정점에 대해 발음과 표정에 대한 중요도를 결정함에 있어서 발음에 대한 움직임의 크기를 우선적으로 고려하고 표정에 대한 움직임을 그 크기에 따라 보완하는 방식을 취한다. 발음에 대한 움직임이 큰 부분에 대해서 표정에 대한 움직임이 작은 경우에는 발음에 대한 중요도를 더욱 높이고, 표정에 대한 움직임이 큰 경우에는 중요도를 감소시킨다. 또한, 발음에 대한 움직임이 작은 부분에 대해서 표정에 대한 움직임이 큰 경우는 발음에 대한 중요도를 더욱 낮추고, 표정에 대한 움직임이 작은 경우에는 중요도를 증가시킨다. 이러한 사실을 바탕으로 최종적인 발음에 대한 중요도와 표정에 대한 중요도를 세 단계에 걸쳐서 계산한다. 첫째 단계에서는 움직임의 크기에 따라 중요도를 계산한다. 둘째 단계에서는 앞에서 계산된 중요도가 큰 정점들 주위로 그 중요도를 전파시킨다. 마지막 단계에서는 앞에서 구한 발음과 표정에 대한 중요도를 고려하여 최종적인 중요도를 결정한다.When a person is not speaking, he expresses his emotions throughout his face, but when he is speaking, the parts that move for pronunciation, such as the mouth and surroundings, are subject to the movement of the pronunciation rather than the expression. Combining emotions into the lip-sync animations creates facial animations that express facial expressions as you speak and at the same time, consider the magnitude of movement for pronunciation in determining the importance of pronunciation and facial expressions for each vertex of the face model. Complement the movement of the expression according to its size. If the movement for the expression is small for a large part of the movement for pronunciation, the importance for pronunciation is further increased, and if the movement for the expression is large, the importance for the expression is reduced. In addition, when the movement for the expression is large for a small part of the movement for pronunciation, the importance for the pronunciation is further lowered, and in the case where the movement for the expression is small, the importance for the expression is increased. Based on this fact, the importance of final pronunciation and the importance of expression are calculated in three steps. In the first step, the importance is calculated according to the size of the movement. In the second phase, the importance is propagated around the critical peaks calculated earlier. In the final step, the final importance is determined by considering the importance of the pronunciation and facial expressions obtained above.

얼굴 모델의 각 정점 Vi에 대해서 첫째 단계에서 계산되는 발음에 관한 중요도와 표정에 관한 중요도를 각각 일차 발음 중요도 p1 (Vi), 일차 표정 중요도 e1 (Vi)라 한다. 둘째 단계에서 계산되는 중요도는 각각 이차 발음 중요도 p2 (Vi), 이차 표정 중요도 e2 (Vi)라 한다. 마지막 단계에서 계산되는 중요도는 각각 최종 발음 중요도 p3 (Vi), 최종 표정 중요도 e3 (Vi)라 한다. 각 정점 Vi에 대한 일차 발음 중요도와 일차 표정 중요도는 아래와 같다.For each vertex Vi of the face model, the importance of pronunciation and facial expression, which are calculated in the first stage, are called primary pronunciation importance p1 (Vi) and primary facial expression importance e1 (Vi) , respectively. The importance calculated in the second stage is called secondary pronunciation importance p2 (Vi) and secondary expression importance e2 (Vi) , respectively. The importance calculated at the last stage is called final pronunciation importance p3 (Vi) and final facial expression importance e3 (Vi) , respectively. The primary pronunciation importance and primary facial expression importance for each vertex Vi are as follows.

얼굴 모델의 i번째 정점의 일차 발음 중요도는 발음을 하고 있지 않은 얼굴 모델의 i번째 정점

에서 각 비즘 모델의 i번째 정점들까지의 거리들 중에 가장 큰 값에 따라 결정된다. 단, 중요도가 0과 1사이의 값을 갖도록 모든 정점에 대해서 가장 큰 값으로 나누어 준다. 일차 표정 중요도에 대해서도 마찬가지 방법으로 계산된다.The i-th vertex of the model's face is not the primary importance pronounce pronounce the i-th vertex of the non-face models

Is determined by the largest value of the distances to the i th vertices of each bismuth model. However, divide by the largest value for all vertices so that importance is between 0 and 1. The same applies to primary facial expression importance.

각 정점 Vi에 대한 이차 발음 중요도와 이차 표정 중요도는 아래와 같다.Secondary pronunciation importance and secondary facial expression importance for each vertex Vi are as follows.

단, Lp 와 Le 는 각각 표정과 감정에 대하여 인접한 점을 결정하기 위한 기준 값(threshold value)이고, S1은 일차 발음 중요도에 대한 기준 값이다. Lp , Le, S1은 사용자가 지정한다. 결국 각 정점에는 근처의 일차 발음 중요도가 주어진 기준값 S1보다 높은 정점들의 가장 큰 중요도가 그 정점에 이차 발음 중요도로 할당되고, 그렇지 않으면 일차 발음 중요도를 그대로 유지한다. 각 정점에 대한 이차 발음 중요도에 대해서도 마찬가지 방법으로 계산된다. 이차 발음 중요도와 이차 표정 중요도를 고려하여 최종적인 발음과 표정에 대한 중요도는 아래와 같이 정의한다.However, Lp and Le are threshold values for determining adjacent points for expression and emotion, respectively, and S1 is a reference value for primary pronunciation importance. Lp , Le, S1 are user specified. As a result, each vertex is assigned the highest importance of the vertices higher than the given reference value S1 as the secondary pronunciation importance, otherwise the primary pronunciation importance is maintained. Similar calculations are made for the secondary pronunciation importance for each vertex. Considering the importance of secondary pronunciation and secondary facial expressions, the importance of final pronunciation and expression is defined as follows.

단, S2는 이차 발음 중요도에 대한 기준 값으로 사용자가 지정한다. 이렇게 정의한 이유는 이차 발음 중요도가 높은 정점에 대해서는 이차 표정 중요도가 낮을수록 최종 발음 중요도를 더욱 높이고, 이차 발음 중요도가 낮은 정점에 대해서는 이차 표정 중요도가 높을수록 최종 표정 중요도를 더욱 높이기 위함이다.However, S2 is designated by the user as a reference value for secondary pronunciation importance. The reason for this definition is that the lower the importance of the second facial expression, the higher the final facial expression is important for the vertices with higher secondary pronunciation importance, and the higher the final facial expression is more important for the vertices with the lower secondary pronunciation importance.

도 11에서는 1차, 2차, 그리고 최종 발음 중요도와 표정 중요도에 대해 보여준다. 밝게 보일수록 중요도가 높음을 의미한다. 립싱크 애니메이션과 감정을 표현하는 표정을 결합하기 위해서, 최종 표정 중요도와 최종 발음 중요도를 고려하여 얼굴 모델의 각 정점들의 최종 위치를 결정한다. 발음에 대하여 중요도가 큰 정점은 발음에 관한 움직임을 최대한 유지하면서 표정에 관한 움직임에 대해서도 반응해야 한다. 마찬가지로 발음에 대하여 중요도가 작은 정점은 표정에 대하여 중요도가 크기 때문에 표정에 관한 움직임을 최대한 유지하면서 발음에 관한 움직임에 대해서도 반응해야 한다.11 shows the primary, secondary, and final pronunciation importance and facial expression importance. The brighter it is, the higher the importance. In order to combine the lip-sync animation and the facial expression expressing emotions, the final position of each vertex of the face model is determined in consideration of the final facial expression importance and the final pronunciation importance. Peaks of great importance for pronunciation should respond to movements on facial expressions while maintaining movements on pronunciation as much as possible. Similarly, vertices with less importance for pronunciation have a greater importance for facial expressions, and therefore must respond to movements with pronunciation while maintaining maximum movement with respect to facial expressions.

입술 동기화 애니메이션으로부터 얻어진 얼굴의 i번째 정점

의 위치와 표정 애니메이션으로부터 얻어진 얼굴의 i번째 정점

의 위치, 그리고 i번째 정점의 최종 발음 중요도 p3(Vi)에 따라 발음과 표정이 결합된 얼굴의 i번째 정점 Vi의 위치가 결정된다. 무표정 모델의 i번째 정점

을 기준으로

의 위치를 나타내는 벡터

와

의 위치를 나타내는 벡터

는 공간 상에 한 평면을 결정한다. 따라서, i번째 정점의 최종위치

은 그 평면 위에 존재해야 한다. I th vertex of face obtained from lip sync animation

I th vertex of face obtained from the position and facial expression animation

The location, and i end of the first importance to pronounce p3 (Vi) and pronounce the i-th vertex Vi facial expressions combined according to the vertex position is determined. I th vertex of the expressionless model

based on this

A vector representing the location of

Wow

A vector representing the location of

Determines one plane in space. Thus, the final position of the i th vertex

Must be on that plane.

또한, 발음에 대한 중요도가 높은 경우에는

의 위치를 중요도의 크기에 따라 유지하도록 하여 발음에 대한 모습이 나타나게 하며

의 크기와 방향을 고려하여 도 12와 같이 표정에 대한 모습도 고려할 수 있다. 벡터 의 에 대한 수직성분과 평행성분을 각각 proj_p _⊥E 와 proj_p _∥E라고 하면,

의

에 대한 평행성분 proj_p _∥E가

에 영향을 주게 되면 그 정점이 발음을 하기 위해 이동하는 방향으로의 크기가 달라지기 때문에 발음 모습을 깨뜨리게 된다. 따라서,

와

의

에 대한 수직성분 proj_p _⊥E만을 고려하여 최종위치

을 결정한다. 발음에 대한 중요도가 낮고 표정 중요도가 높은 경우에는 반대로 적용하여 최종위치를 결정한다. 이상을 정리하면 얼굴 모델의 각 정점 Vi의 최종위치는 아래와 같다.Also, if the importance for pronunciation is high

Maintain the position of according to the magnitude of the importance so that the appearance of pronunciation appears

In consideration of the size and the direction of Figure 12 may also consider the appearance of the expression. vector of If the vertical and parallel components for are proj _p _⊥ E and proj _p _∥ E,

of

Parallel component proj _p _∥E

This affects the pronunciation because the size of the vertices changes in the direction of movement for pronunciation. therefore,

Wow

of

Final position considering only the vertical component proj _p _⊥ E

Determine. In the case of low importance for pronunciation and high expression importance, the final position is determined by applying the opposite. In summary, the final position of each vertex Vi of the face model is as follows.

여기서,here,

이다. 결과적으로 발음에 관한 중요도가 큰 정점은 표정에 관한 움직임의 발음에 관한 움직임에 대한 수직성분의 방향으로 중요도의 크기에 따라 이동하여 발음에 관한 움직임을 최대한 반영하며 표정에 대한 움직임도 살린다. 발음에 관한 중요도가 작은 경우에도 그에 상응하는 효과를 얻는다.to be. As a result, the vertices with great importance on pronunciation move in the direction of the vertical component of the movement on the expression, according to the magnitude of the importance, to reflect the movement on the pronunciation as much as possible and to save the movement on the expression. Even if the importance of pronunciation is small, the corresponding effect is obtained.

상술한 캐릭터 형성을 위하여, 도 13에 도시된 바와 같이 입력부(100)와 립싱크 애니메이션 형성부(110) 및 표정 애니메이션 형성부(120)가 구비되는 것을 바람직하다. 그리고, 립싱크 애니메이션과 표정 애니메이션을 동기화부(130)가 필요할 것이며, 각각의 기능은 상술한 바와 같다. 또한, 캐릭터 자료는 육성 녹음일 수도 있으나, TTS 시스템에 의하여 형성된 음성 자료일 수도 있다.In order to form the above-described character, as shown in FIG. 13, the input unit 100, the lip sync animation forming unit 110, and the facial expression animation forming unit 120 may be provided. Then, the lip sync animation and the facial expression animation synchronizer 130 will be required, each function is as described above. In addition, the character data may be voice recording or voice data formed by the TTS system.

본 발명은 상술한 실시예에 한정되지 않으며, 첨부된 청구범위에서 알 수 있는 바와 같이 본 발명이 속한 분야의 통상의 지식을 가진 자에 의해 변형이 가능하도 이러한 변형은 본 발명의 범위에 속한다.The present invention is not limited to the above-described embodiments, and such modifications are included within the scope of the present invention even though modifications may be made by those skilled in the art to which the present invention pertains.

상술한 본 발명에 따른 캐릭터 형성방법 및 장치의 효과를 설명하면 다음과 같다.Referring to the effects of the above-described character forming method and apparatus according to the present invention.

첫째, 입력 음소 정보를 기존의 TTS시스템이나 음성 파일로부터 얻어서 캐릭터의 감정을 나타낸 립싱크 애니메이션을 실시간으로 생성할 수 있고, 표정 파라미터를 생성하는 사용자 입력 감정 인터페이스에 따라 시간에 따라 연속적으로 변하는 감정을 나타낼 수 있다.Firstly, it is possible to generate a lip sync animation representing the character's emotions in real time by obtaining input phoneme information from an existing TTS system or a voice file, and to represent emotions that continuously change with time according to a user input emotion interface that generates facial expression parameters. Can be.

둘째, 아티스트에 의해 제작된 비즘 모델들을 해당 음소 정보에 따라 배열하고, Catmull-Rom 스플라인 보간법을 통해 발음 길이에 따라 비즘 모델의 상대적인 위치를 조절하여, 보다 자연스럽고 부드러운 립싱크 애니메이션을 형성할 수 있다.Second, the bism models produced by the artist are arranged according to the phoneme information, and the relative positions of the bism models are adjusted according to the pronunciation length through Catmull-Rom spline interpolation to form a more natural and smooth lip sync animation.

셋째, 각종 감정에 대한 표정 모델을 중요도 기반 접근 방법으로 립싱크 애니메이션에 혼합하여 사실적인 감정 표정을 나타내는 립싱크 애니메이션을 제작할 수 있다.Third, a lip-sync animation representing realistic emotion expressions can be produced by mixing facial expression models for various emotions with a lip-sync animation in an importance-based approach.

넷째, 본 발명에 따른 애니메이션 제작 프로세스는 자동화되어 있기 때문에, 얼굴 애니메이션을 제작하는 데 소모되는 사용자의 수작업을 최소화하고 음성과의 정확성, 애니메이션의 자연스러움과 실시간성 등을 최대화할 수 있다.Fourth, because the animation production process according to the present invention is automated, it is possible to minimize the user's manual work consumed in producing the facial animation, and to maximize the accuracy of the voice, the naturalness and real-time of the animation.

Claims

Receiving character data;

Dividing a character's face model into a portion mainly used for pronunciation and a portion mainly used for expressing an expression, and animation of a portion mainly used for the pronunciation to form a lip sync animation of the character; And

A step of forming a facial expression animation on the character on which the lip-sync animation is formed by animating a part mainly used to express the facial expression;

Character forming method comprising a.

The method of claim 1,

Wherein said character data comprises voice information and facial expression information of said character.

The method of claim 1,

And the character data is input in the form of script text.

The method of claim 3, wherein

Character data in the form of the script text, character formation method characterized in that synthesized by the recorded training or TTS system.

The method of claim 1,

And synchronizing a lip sync animation formed on the character with an expression animation.

The method of claim 1,

The forming of the lip sync animation is performed by setting a phoneme sequence and a phoneme length of a character according to the character data.

The method of claim 6,

A character forming method comprising assuming that a mouth shape of a character is a key frame with respect to the set phoneme sequence, and processes each key frame by a keyframing animation technique.

The method of claim 2,

Divide the voice information into groups each having a similar mouth shape,

Character input method characterized in that if the voice information in the same group is input, the lip sync animation of the character is formed in the same.

The method of claim 8,

The voice information is made in Korean,

In the forming of the lip sync animation, the Korean language consisting of a total of 40 phonemes is divided into 11 vowel and consonant bismuth models.

The method of claim 8,

The voice information is made in English,

The forming of the lip sync animation comprises: classifying English consisting of a total of 41 phonemes into 13 vowel and consonant bismuth models.

The method of claim 1, wherein the formation of the lip sync animation,

Character formation method comprising the Catmull-Rom spline interpolation method.

delete

The method of claim 1, wherein the facial expression animation,

And a facial expression example model and a facial expression model representing emotions, respectively, and the facial expression example model and the expressionless model are mixed to express a change in emotion as a facial expression of a character.

The method of claim 1,

The forming of the lip sync animation and the facial expression animation is performed by an importance based approach.

The method of claim 15, wherein the importance based approach comprises:

And a lip sync animation or a facial expression animation for each of the points according to the importance of pronunciation and expression for each point of the mask of the character.

The method of claim 1, wherein the forming of the facial expression animation,

The character's face is divided into a forehead, eyes, eyebrows, nose and cheeks, and the forehead, eyes, eyebrows, nose and cheeks according to the input character data, characterized in that the character formation method.

The method of claim 17,

Lips of the face of the character, characterized in that the animation of the input character data, characterized in that the animation according to the material related to the lips.

The method of claim 1,

The character material divides the emotion of the character into any one of joy, surprise, sadness, fear and anger,

The character expression animation is animated according to the emotion of the separated character.

An input unit for inputting character data;

A lip sync animation forming unit for classifying a face model of a character into a part mainly used for pronunciation and a part mainly used for expressing an expression, and animating a part mainly used for the pronunciation to form a lip sync animation; And

Character expression apparatus comprising a facial expression animation forming unit for forming a facial expression animation on the character with the lip-sync animation is formed by animate the portion mainly used to represent the facial expression.

The method of claim 20,

Character forming apparatus further comprises a synchronization unit for synchronizing the lip sync animation and facial expression animation formed in the character.

The method of claim 20,

The input unit is a TTS system, characterized in that the character data is voice data.