KR20100115003A

KR20100115003A - Method for generating talking heads from text and system thereof

Info

Publication number: KR20100115003A
Application number: KR1020090033488A
Authority: KR
Inventors: 박순영; 최경호; 이신성; 신용민; 이훈; 김상완
Original assignee: 목포대학교산학협력단
Priority date: 2009-04-17
Filing date: 2009-04-17
Publication date: 2010-10-27
Also published as: KR101039668B1

Abstract

PURPOSE: A face animation output method of a text data base and a system thereof for synchronizing a mouth shape image of the face animation are provided to increase the information transmission by outputting a face animation image. CONSTITUTION: A text data is text data of a power point on program. A text to speech engine(100) receives text data from the power point program. An FAP converter(600) converts a response message data into the parametric value of MPEG-4. A sink module(200) transmits an audio file to the audio output device.

Description

Method for outputting facial animation based on text data and its system {Method for generating talking heads from text and system

본 발명은 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템에 관한 것으로, 보다 구체적으로는 텍스트데이터의 음소들이 음성으로 출력되기 전에 상기 음소들에 대응하는 얼굴 애니메이션을 렌더링하여 영상으로 출력해줌으로써 상기 음소들의 음성 출력과 상기 얼굴 애니메이션의 입 모양 영상이 서로 일치되도록 동기화할 수 있는 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템에 관한 것이다.The present invention relates to a text data-based facial animation output method and a system thereof, and more particularly, before a phoneme of text data is output as a voice, a facial animation corresponding to the phonemes is rendered and output as an image. The present invention relates to a text data-based facial animation output method and a system capable of synchronizing the voice output of the facial animation with the mouth shape image of the facial animation.

TTS(Text-to-speech)란 텍스트를 음성으로 변환하여 출력해주는 기술로써 음성에 의한 제어, 음성을 통한 정보제공 등을 위한 음성정보기술의 하나이다.Text-to-speech (TTS) is a technology that converts text into voice and outputs it. It is one of voice information technologies for controlling by voice and providing information through voice.

이러한 TTS는 초기에 전화와 같은 통신수단을 이용하여 정보를 전달하는 ARS(Audio Response System : 음성 응답 시스템)에 주로 사용되었으나, 인터넷을 포함한 정보통신기술이 발달함에 따라 다양한 멀티미디어분야에 응용되어 사용되고 있다.Such TTS was mainly used in ARS (Audio Response System) for transmitting information by using communication means such as telephone, but it is applied to various multimedia fields as information communication technology including internet is developed. .

일반적으로 TTS는 텍스트를 단순히 음성으로 변환하여 출력해주므로 정보전달 효과가 떨어지는 문제점이 있다.In general, the TTS simply converts the text into voice and outputs the information, thereby reducing the information transmission effect.

이러한 문제점을 해결하기 위해 최근에는 TTS 기술에 영상처리 기술을 접목하여 음성뿐만 음성을 발음하는 얼굴 애니메이션 영상을 함께 출력해 줌으로써 정보전달의 효율성을 높이기 위한 노력이 있다.In order to solve this problem, recently, there is an effort to increase the efficiency of information transmission by outputting a facial animation image that pronounces voice as well as voice by integrating image processing technology with TTS technology.

한편, 상기 얼굴 애니메이션 영상은 사람이 텍스트를 읽을 때, 텍스트의 각 음소를 발음하는 입 모양을 보여주는 영상으로서, 토킹 헤드(Talking head) 영상이라고도 한다.Meanwhile, the face animation image is an image showing a mouth shape in which each phoneme of a text is pronounced when a human reads the text, and is also referred to as a talking head image.

종래에는 상기 얼굴 애니메이션 영상이 상기 텍스트를 실제로 읽는 것처럼 상기 TTS에서 출력되는 음성과 동기화시키는데 매우 어려움이 있었다.In the related art, it was very difficult to synchronize the facial animation image with the voice output from the TTS as if the text was actually read.

본 발명자들은 TTS엔진에 의해 변환되는 텍스트의 음성 출력과 얼굴 애니메이션의 입 모양 영상이 서로 일치되도록 동기화하여, 음성과 영상이 서로 맞지않는 어색함을 줄이고, 정보전달의 효과를 높이기 연구 노력한 결과, 상기 텍스트의 음소들의 음성 출력과 상기 얼굴 애니메이션의 입 모양 영상이 일치되도록 동기화할 수 있는 기술적 구성을 개발하게 되어 본 발명을 완성하게 되었다.The present inventors have synchronized the voice output of the text converted by the TTS engine and the mouth-shaped image of the facial animation to match each other, thereby reducing the awkwardness between the voice and the video and increasing the effect of information transmission. The present invention has been completed to develop a technical configuration that can be synchronized so that the voice output of the phonemes and the mouth-shaped image of the face animation match.

따라서, 본 발명의 목적은 텍스트를 음성뿐만 아니라 얼굴 애니메이션 영상과 함께 출력해줌으로써 정보전달의 효율을 높일 수 있는 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템을 제공하는 것이다.Accordingly, it is an object of the present invention to provide a text data-based facial animation output method and system capable of increasing information transmission efficiency by outputting text along with not only voice but also facial animation images.

또한, 본 발명의 다른 목적은 텍스트의 음성 출력과 얼굴 애니메이션 영상의 입 모양이 서로 일치되도록 동기화하여, 상기 얼굴 애니메이션이 자연스럽게 텍스트를 읽는 것과 같은 효과는 낼 수 있는 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템을 제공하는 것이다.In addition, another object of the present invention is to synchronize the voice output of the text and the mouth shape of the facial animation image to match each other, the facial animation output method based on the text data that can produce an effect such as the facial animation is naturally read text and To provide that system.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects that are not mentioned will be clearly understood by those skilled in the art from the following description.

상기의 목적을 달성하기 위하여 본 발명은 TTS엔진(Text-to-speech engine)이 텍스트데이터를 입력받아 상기 텍스트데이터에 포함된 음소들을 음성으로 출력하기 위한 오디오파일 및 상기 각 음소에 대한 음소정보/지속시간정보를 포함하는 응답메시지데이터들을 생성하여 저장매체에 저장하는 제1단계, 싱크모듈이 상기 오디오파일을 음성출력장치로 전송하고, 상기 응답메시지데이터들을 렌더링모듈로 전송하되, 상기 오디오파일 내의 임의의 음소가 음성으로 출력되기 전에 상기 임의의 음소의 음소정보를 포함하는 응답메시지데이터를 상기 렌더링모듈로 전송하는 제2단계, 상기 렌더링모듈이 상기 응답메시지데이터들의 음소정보에 따른 얼굴 애니메이션을 렌더링하여 디스플레이장치로 출력하고, 상기 음성출력장치가 상기 오디오파일을 음성으로 출력하는 제3단계를 포함하여, 상기 오디오파일 내의 임의의 음소가 상기 임의의 음소의 음소정보에 따른 얼굴 애니메이션이 렌더링되는 도중에 음성으로 출력되게 하는 텍스트데이터 기반의 얼굴 애니메이션 출력방법을 제공한다.In order to achieve the above object, the present invention provides a TTS engine (Text-to-speech engine) to receive the text data, audio files for outputting the phonemes included in the text data and the phoneme information for each phoneme / In a first step of generating response message data including duration information and storing the same in a storage medium, the sink module transmits the audio file to a voice output device and transmits the response message data to a rendering module. A second step of transmitting response message data including phoneme information of the arbitrary phonemes to the rendering module before any phoneme is output as voice; the rendering module renders a facial animation according to phoneme information of the response message data. Output to the display device, and the audio output device outputs the audio file as voice. And a third step of outputting a text data based facial animation output method such that any phoneme in the audio file is output as a voice while a face animation according to the phoneme information of the phoneme is rendered.

바람직한 실시예에 있어서, 상기 제1단계의 응답메시지데이터들은 각각 FAP컨버터(Facial animation parameter converter)에 의해 표준화된 파라미터 값으로 변환되어 저장된다.In a preferred embodiment, the response message data of the first step is converted into a standardized parameter value by a FAP converter and stored.

바람직한 실시예에 있어서, 상기 응답메시지데이터들은 각각 MPEG-4의 파라미터 값으로 변환되어 저장된다.In a preferred embodiment, the response message data are each converted into a parameter value of MPEG-4 and stored.

바람직한 실시예에 있어서, 상기 응답메시지데이터들은 상기 저장매체의 캐쉬영역에 저장된다.In a preferred embodiment, the response message data is stored in a cache area of the storage medium.

바람직한 실시예에 있어서, 상기 제2단계의 임의의 음소의 음소정보를 포함 하는 응답메시지데이터는 상기 임의의 음소가 음성으로 출력되기 전 0.6초 내지 0.2초 전에 상기 렌더링모듈로 전송된다.In a preferred embodiment, the response message data including the phoneme information of any phoneme of the second step is transmitted to the rendering module 0.6 seconds to 0.2 seconds before the phoneme is output as voice.

바람직한 실시예에 있어서, 상기 텍스트데이터는 파워포인트 프로그램상의 텍스트데이터이다.In a preferred embodiment, the text data is text data on a PowerPoint program.

또한, 본 발명은 텍스트데이터를 입력받아 상기 텍스트데이터에 포함된 음소들을 음성으로 출력하기 위한 오디오파일 및 상기 각 음소들에 대한 음소정보/지속시간정보를를 포함하는 응답메시지데이터들을 생성하여 저장매체에 저장하는 TTS엔진;In addition, the present invention generates a response message data including the audio file for receiving the text data and output the phonemes contained in the text data to the voice and phoneme information / duration information for each of the phonemes to the storage medium; A TTS engine for storing;

상기 저장매체에 저장된 오디오파일 및 응답메시지데이터들을 각각 음성출력장치 및 하기 렌더링모듈로 전송하되, 상기 오디오파일 내의 임의의 음소가 음성으로 출력되기 전에 상기 임의의 음소의 음소정보를 포함하는 응답메시지데이터를 하기 렌더링모듈로 전송하는 싱크모듈 및 상기 응답메시지데이터들을 입력받아 각 응답메시지데이터들에 따른 얼굴 애니메이션을 렌더링하여 디스플레이장치로 출력하는 렌더링모듈을 포함하여, 상기 오디오파일 내의 임의의 음소가 상기 임의의 음소의 음소정보에 따른 얼굴 애니메이션이 렌더링되는 도중에 음성으로 출력되게 하는 텍스트데이터 기반의 얼굴 애니메이션 출력시스템을 제공한다.The audio file and the response message data stored in the storage medium are respectively transmitted to a voice output device and the following rendering module, wherein the response message data includes phoneme information of the phoneme before any phoneme in the audio file is output as voice. And a sink module that receives the response message data and a rendering module that receives the response message data and renders a facial animation according to each response message data to a display device, wherein any phoneme in the audio file is random. The present invention provides a facial animation output system based on text data that is output as a voice while a facial animation according to phoneme information of a phoneme is rendered.

바람직한 실시예에 있어서, 상기 TTS엔진에 의해 생성된 응답메시지데이터들을 표준화된 파라미터 값으로 변환하여 상기 저장매체에 저장하는 FAP컨버터를 더 포함한다.In a preferred embodiment, the method further includes a FAP converter for converting the response message data generated by the TTS engine into a standardized parameter value and storing the response message data in the storage medium.

바람직한 실시예에 있어서, 상기 FAP컨버터는 상기 응답메시지데이터들을 MPEG-4의 파라미터 값으로 변환하여 상기 저장매체에 저장한다.In a preferred embodiment, the FAP converter converts the response message data into parameter values of MPEG-4 and stores them in the storage medium.

바람직한 실시예에 있어서, 상기 FAP컨버터는 상기 응답메시지데이터들을 상기 저장매체의 캐쉬영역에 저장한다.In a preferred embodiment, the FAP converter stores the response message data in a cache area of the storage medium.

바람직한 실시예에 있어서, 상기 싱크모듈은 상기 임의의 음소의 음소정보를 포함하는 응답메시지데이터를 상기 임의의 음소가 음성으로 출력되기 전 0.6초 내지 0.2초 전에 상기 렌더링모듈로 전송한다.In a preferred embodiment, the sync module transmits response message data including phoneme information of the arbitrary phonemes to the rendering module 0.6 seconds to 0.2 seconds before the arbitrary phonemes are output as voice.

바람직한 실시예에 있어서, 상기 TTS엔진은 상기 텍스트데이터를 파워포인트 프로그램으로부터 전송받는다.In a preferred embodiment, the TTS engine receives the text data from a PowerPoint program.

본 발명은 다음과 같은 우수한 효과를 가진다.The present invention has the following excellent effects.

먼저, 본 발명의 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템에 의하면 텍스트를 음성뿐만 아니라 얼굴 애니메이션 영상과 함께 출력해줌으로써 정보전달의 효율을 높일 수 있는 효과가 있다.First, according to the text data-based face animation output method and system of the present invention, text can be output not only with voice but also with face animation image, thereby improving the efficiency of information transmission.

또한, 본 발명의 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템에 의하면, 파워포인트 프로그램의 텍스트데이터를 입력받아 상기 텍스트데이터에 대응하는 음성 및 얼굴 애니메이션 영상을 슬라이드 쇼와 함께 보여줄 수 있으므로 프리젠테이션의 질을 향상시킬 수 있는 효과가 있다.In addition, according to the text data-based facial animation output method and system of the present invention, it is possible to receive the text data of the PowerPoint program and to display a voice and facial animation image corresponding to the text data with a slide show of the presentation It has the effect of improving quality.

또한, 본 발명의 텍스트데이터 기반의 얼굴 애니메이션 출력방법 및 그 시스템에 의하면, 싱크모듈이 오디오파일에 포함된 임의의 음소가 음성으로 출력되기 전에, 상기 임의의 음소에 대한 응답메시지데이터를 렌더링모듈로 전송함으로써 음소를 발음할 때 얼굴 애니메이션의 입 모양이 갖춰지게 하여, 음성 출력과 상기 얼굴 애니메이션의 입 모양이 일치되도록 동기화할 수 있는 효과가 있다.In addition, according to the text data-based facial animation output method and system of the present invention, before the sync module outputs any phoneme included in the audio file as voice, response message data for the phoneme to the rendering module to the rendering module. When the phonemes are pronounced, the mouth shape of the face animation is provided so that the voice output and the mouth shape of the face animation are synchronized to be synchronized.

본 발명에서 사용되는 용어는 가능한 현재 널리 사용되는 일반적인 용어를 선택하였으나, 특정한 경우는 출원인이 임의로 선정한 용어도 있는데 이 경우에는 단순한 용어의 명칭이 아닌 발명의 상세한 설명 부분에 기재되거나 사용된 의미를 고려하여 그 의미가 파악되어야 할 것이다.The terms used in the present invention were selected as general terms as widely used as possible, but in some cases, the terms arbitrarily selected by the applicant are included. In this case, the meanings described or used in the detailed description of the present invention are considered, rather than simply the names of the terms. The meaning should be grasped.

이하, 첨부한 도면 및 바람직한 실시예들을 참조하여 본 발명의 기술적 구성을 상세하게 설명한다.Hereinafter, the technical structure of the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.

그러나, 본 발명은 여기서 설명되는 실시예에 한정되지 않고 다른 형태로 구체화 될 수도 있다. 명세서 전체에 걸쳐 동일한 참조번호는 동일한 구성요소를 나타낸다.However, the present invention is not limited to the embodiments described herein but may be embodied in other forms. Like reference numerals designate like elements throughout the specification.

도 1은 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력시스템의 블럭도이고, 도 2는 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력시스템의 얼굴 애니메이션 렌더링이 시작되는 시점을 설명하기 위한 도면이다.1 is a block diagram of a text data based facial animation output system according to an embodiment of the present invention, Figure 2 is a facial animation rendering of the text data based facial animation output system according to an embodiment of the present invention It is a figure for demonstrating a viewpoint.

도 1을 참조하면 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애 니메이션 출력시스템은 TTS엔진(100), 싱크모듈(200) 및 렌더링모듈(300)을 포함하여 이루어진다.Referring to FIG. 1, a text data based facial animation output system according to an embodiment of the present invention includes a TTS engine 100, a sink module 200, and a rendering module 300.

상기 TTS엔진(100,Text-to-speech engine)은 텍스트데이터(10)를 입력받아 상기 텍스트데이터(10)에 포함된 음소들을 음성으로 출력하기 위한 오디오파일(410)과 상기 각 음소에 대한 음소정보/지속시간정보를 포함하는 응답메시지데이터들(420)을 생성한다.The TTS engine 100 receives a text data 10 and an audio file 410 for outputting phonemes included in the text data 10 as a voice and a phoneme for each phoneme. The response message data 420 including the information / duration information is generated.

또한, 상기 TTS엔진(100)은 생성된 오디오파일(410)과 응답메시지데이터들(420)을 저장매체(400)에 저장한다.In addition, the TTS engine 100 stores the generated audio file 410 and response message data 420 in the storage medium 400.

또한, 상기 TTS엔진(100)은 먼저, 상기 텍스트데이터(10)를 입력받아 상기 오디오파일(410)을 생성하여 상기 저장매체(400)에 저장하고, 다음, 상기 텍스트데이터(10)를 다시 입력받아 상기 응답메시지데이터들(420)을 생성한 후 상기 저장매체(400)에 저장한다.In addition, the TTS engine 100 first receives the text data 10, generates the audio file 410, stores the audio file 410 in the storage medium 400, and then inputs the text data 10 again. The response message data 420 is generated and stored in the storage medium 400.

그러나, 상기 TTS엔진(100)은 상기 응답메시지데이터들(420)을 먼저 생성할 수도 있고, 상기 오디오파일(410) 및 상기 응답메시지데이터들(420)을 동시에 생성할 수도 있다However, the TTS engine 100 may generate the response message data 420 first or may simultaneously generate the audio file 410 and the response message data 420.

또한, 상기 TTS엔진(100)은 상기 텍스트데이터(10)를 한 번만 입력받아 상기 오디오파일(410) 및 상기 응답메시지데이터들(420)을 각각 또는 동시에 생성할 수도 있다.In addition, the TTS engine 100 may receive the text data 10 only once and generate the audio file 410 and the response message data 420 respectively or simultaneously.

또한, 상기 텍스트데이터(10)는 마이크로소프트사의 파워포인트 프로그램상의 텍스트데이터일 수 있고, 상기 TTS엔진(100)은 상기 파워포인트 프로그램의 슬 라이드 쇼가 진행될 때, 상기 텍스트데이터(10)를 입력받아 상기 오디오파일(410) 및 상기 응답메시지데이터(420)를 생성할 수 있다.In addition, the text data 10 may be text data on a Microsoft PowerPoint program, and the TTS engine 100 receives the text data 10 when a slide show of the PowerPoint program is performed. The audio file 410 and the response message data 420 may be generated.

즉, 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력시스템은 파워포인트 프로그램과 연동하여, 프리젠테이션시에 상기 텍스트데이터(10)를 음성 및 얼굴 애니메이션(21)을 청중에게 제공할 수 있다.That is, the text data-based facial animation output system according to an embodiment of the present invention may provide a voice and facial animation 21 to the audience at the time of the presentation in cooperation with a PowerPoint program. have.

또한, 상기 파워포인트 프로그램의 슬라이드 쇼가 진행될 때, 발표자만 볼 수 있는 슬라이드 노트창에 상기 텍스트데이터(10)를 음성 출력상황이나 얼굴 애니메이션(21)의 출력상황을 표시해 줄 수도 있다.In addition, when the slide show of the PowerPoint program is in progress, the text data 10 may be displayed on the slide note window that can only be seen by the presenter.

그러나, 상기 텍스트데이터(10)는 마이크로소프트사의 워드프로그램, 한글과컴퓨터사의 한글프로그램상의 텍스트데이터일 수 있으며, 상기 TTS엔진(100)이 읽어들일 수 있는 데이터라면 어떠한 텍스트데이터일 수도 있다.However, the text data 10 may be text data of a Microsoft Word program, a Korean language program and a Korean language computer program, and may be any text data as long as the TTS engine 100 can read the text data.

또한, 상기 텍스트데이터(10)가 복수 개의 문장을 포함하는 장문일 경우 하나의 단위문장의 텍스트데이터로 구분되어 입력될 수 있다.In addition, when the text data 10 is a long sentence including a plurality of sentences, the text data 10 may be divided into text data of one unit sentence and input.

또한, 상기 TTS엔진(100)에서 생성되는 오디오파일(410)은 WMA, MP3, OGG, APE 또는 MP4 형식의 오디오파일일 수 있으며, 본 발명의 일 실시예에서는 상기 TTS엔진(100)이 상기 텍스트데이터(10)를 WMA 형식의 오디오파일로 생성하여 상기 저장매체(400)에 저장하게 하였다.In addition, the audio file 410 generated by the TTS engine 100 may be an audio file in a WMA, MP3, OGG, APE, or MP4 format. In an embodiment of the present invention, the TTS engine 100 is the text. The data 10 was generated as an audio file in a WMA format and stored in the storage medium 400.

또한, 상기 오디오파일(410) 및 상기 응답메시지데이터들(420)은 상기 저장매체(400)의 캐쉬영역에 저장된다.In addition, the audio file 410 and the response message data 420 are stored in a cache area of the storage medium 400.

또한, 상기 캐쉬영역은 컴퓨터의 하드디스크와 같은 보조기억장치 및 램(RAM)과 같은 주기억장치의 임시공간일 수 있는데, 상기 오디오파일(410)은 상기 하드디스크에 저장되도록 하였고, 상기 응답메시지데이터(420)는 상기 램에 저장되도록 하였다.In addition, the cache area may be a temporary space of an auxiliary storage device such as a hard disk of a computer and a main memory such as a RAM. The audio file 410 is stored in the hard disk, and the response message data 420 is to be stored in the RAM.

또한, 상기 TTS엔진(100)에서 출력되는 응답메시지데이터들(420)은 중복발음 음소 제거모듈(500)을 통해 전후 음소의 영향에 의해 상기 얼굴 애니메이션(21)의 입 모양(22)으로 표현되지 않아도 되는 음소를 포함하는 응답메시지데이터가 제거된 후 상기 저장매체(400)에 저장된다.In addition, the response message data 420 output from the TTS engine 100 is not represented as the mouth shape 22 of the face animation 21 by the influence of the front and rear phonemes through the redundant phoneme removing module 500. The response message data including the phoneme, which is not necessary, is removed and then stored in the storage medium 400.

예를 들면, 응답메시지데이터의 음소정보가 자음 "ㄱ[g], ㅋ[k], ㅇ[-ŋ], ㅎ[h] 또는 ㄹ[l]"일 경우 제거한다. 또한, 전후 음소가 모음"ㅗ[o], ㅛ[jo], ㅜ[u], ㅠ[ju], ㅡ[

],ㅣ[i], ㅢ[

j], ㅘ[wa], ㅙ[wæ], ㅚ[we], ㅝ[

], ㅞ[we], ㅟ[wi]"인 자음 음소정보 "ㄴ[n], ㄷ[d], ㅌ[t], ㅅ[s] ,ㅆ[s*], ㅈ[

] 또는 ㅊ[

*]"를 갖는 응답메시지데이터는 제거한다.For example, if the phoneme information of the response message data is the consonant "ㄱ [g], ㅋ [k], ㅇ [-ŋ], ㅎ [h] or ㄹ [l]", it is removed. In addition, the phonemes before and after the vowel "ㅗ [o], ㅛ [jo], ㅜ [u], ㅠ [ju], ㅡ [

], ㅣ [i], ㅢ [

j], ㅘ [wa], ㅙ [wæ], ㅚ [we], ㅝ [

], ㅞ [we], ㅟ [wi] "consonant phoneme information" n [n], [[d], ㅌ [t], [s], ㅆ [s *], ㅈ [

] Or [

Response message data with *] "is removed.

또한, 자음 음소 "ㄴ[n], ㄷ[d], ㅌ[t], ㅅ[s], ㅆ[s*], ㅈ[

], ㅊ[

*]" 전의(음절의 초성, 종성과 관계없이) 모음 음소정보 "ㅏ[a], ㅑ[ja], ㅓ[

], ㅕ[j

], ㅐ[ae], ㅒ[jae], ㅘ[wa], ㅙ[wae], ㅚ[we], ㅝ[w

] 또는 ㅞ[we]"를 갖는 응답메시지데이터는 제거하지 않는다. Also, the consonant phonemes "b [n], d [d], ㅌ [t], s [s], ㅆ [s *],

],

*] "Vowel phoneme information" ㅏ [a], ㅑ [ja], ㅓ [

], ㅕ [j

], ㅐ [ae], ㅒ [jae], ㅘ [wa], ㅙ [wae], ㅚ [we], ㅝ [w

] Or we [we] "does not remove the response message data.

또한, 이중 모음 "ㅑ[ja], ㅕ[j], ㅒ[jæ], ㅖ[je], ㅘ[wa], ㅙ[wæ], ㅚ[we], ㅝ[w

], ㅞ[we] 또는 ㅟ[wi]"을 갖는 응답메시지데이터는 음소지속시간 내에 각각 "ㅣ[i] + ㅏ[a], ㅣ[i] + ㅓ[

], ㅣ[i] + ㅐ[æ], ㅣ[i] + ㅔ[e], ㅗ[o] + ㅏ[a], ㅗ[o] + ㅐ[æ], ㅗ[o] + ㅔ[e], ㅜ[u] + ㅓ[

], ㅜ[u] + ㅔ[e] 및 ㅜ[u] + ㅣ[i]"로 두 번 발음되도록 변환되어 상기 저장매체(400)에 저장된다.Also, the double vowels ㅑ [ja], ㅕ [j], ㅒ [jæ], ㅖ [je], ㅘ [wa], ㅙ [wæ], ㅚ [we], ㅝ [w

], ㅞ [we], or ㅟ [wi] ", response message data are respectively" ㅣ [i] + ㅏ [a], | [i] + ㅓ [

], [[I] + ㅐ [æ], [[i] + ㅔ [e], ㅗ [o] + ㅏ [a], ㅗ [o] + ㅐ [æ], ㅗ [o] + ㅔ [e ], ㅜ [u] + ㅓ [

], ㅜ [u] + ㅔ [e] and TT [u] + ㅣ [i] "are converted into two pronunciations and stored in the storage medium 400.

따라서, 한글의 모든 음소를 얼굴 애니메이션의 입 모양으로 표현할 경우 발생할 수 있는 얼굴 애니메이션(21)의 부자연스러움을 해소하였다.Accordingly, the unnaturalness of the face animation 21 that may occur when all phonemes of the Hangul are represented by the shape of the mouth of the face animation is eliminated.

또한, 상기 중복발음 음소 제거모듈(500)에서 불필요한 음소에 대한 응답메시지데이터가 제거된 응답메시지데이터들은 FAP컨버터(600,Facial animation parameter converter)에 의해 표준화된 파라미터 값으로 변환되어, 상기 저장매체(400)에 저장된다. 본 발명의 일 실시예에서는 MPEG-4의 파라미터 값으로 변환되어 저장되게 하였다.In addition, the response message data from which the response message data for unnecessary phonemes are removed in the redundant phoneme phoneme removing module 500 are converted into standardized parameter values by a FAP converter 600, and the storage medium ( 400). In an embodiment of the present invention, the parameter values of MPEG-4 are converted and stored.

상기 싱크모듈(200)은 상기 저장매체(400)에 저장된 오디오파일(410) 및 응답메시지데이터(420)를 각각 음성출력장치(30) 및 아래에서 설명할 렌더링모듈(300)로 전송한다.The sink module 200 transmits the audio file 410 and the response message data 420 stored in the storage medium 400 to the audio output device 30 and the rendering module 300 to be described below.

또한, 상기 싱크모듈(200)은 상기 오디오파일(410)에 포함된 임의의 음소가 상기 음성출력장치(30)에 의해 출력되기 전에, 상기 임의의 음소에 대한 응답메시지데이터(420)를 상기 렌더링모듈(300)로 전송하여 상기 얼굴 애니메이션(21)이 렌더링되어 디스플레이장치(20)로 출력되게 한다. In addition, the sync module 200 renders the response message data 420 for the arbitrary phoneme before any phoneme included in the audio file 410 is output by the voice output device 30. The facial animation 21 is rendered to the module 300 to be output to the display apparatus 20.

그 이유는 사람이 음소를 발음할 때 입 모양이 갖춰지기 때문이다.The reason is that when a person pronounces a phoneme, it is shaped like a mouth.

예를 들어, 도 2를 참조하면, 음소 '아'(31)가 0초 시점에서 음성으로 출력 될 경우, 상기 싱크모듈(200)은 -0.4초 시점에서 상기 음소 '아'(31)에 대한 응답메시지데이터를 상기 렌더링(300) 모듈로 전송하여, 상기 음소 '아'(31)가 음성으로 출력되는 시점에 상기 얼굴 애니메이션(21)의 입 모양(22)은 벌어진 입 모양(22a)이 되도록 하는 것이다.For example, referring to FIG. 2, when a phoneme 'a' 31 is output as a voice at a time point of 0 seconds, the sync module 200 may perform the phoneme on the phoneme 'a' 31 at a time point of −0.4 seconds. The response message data is transmitted to the rendering 300 module so that the mouth shape 22 of the face animation 21 becomes a gaping mouth shape 22a when the phoneme 'A' 31 is output as a voice. It is.

즉, 임의의 음소에 대한 얼굴 애니메이션(21)이 렌더링되는 도중에 음성으로 출력되게 한다. 그러나, 임의의 음소에 대한 렌더링이 종료된 후 음성으로 출력되게 할 수도 있다.That is, the facial animation 21 for any phoneme is output as a voice while rendering. However, it may also be output as speech after the rendering of any phoneme is ended.

또한, 본 발명의 일 실시예에서는 상기 싱크모듈(200)이 상기 오디오파일(410)에 포함된 각 음소가 음성으로 출력되기 전 0.6초 내지 0.2초 사이에 상기 오디오파일(410)의 음소들과 대응하는 응답메시지데이터들(420)을 상기 렌더링모듈(300)로 전송하게 하여, 상기 얼굴 애니메이션(21)이 상기 임의의 음소가 음성으로 출력되는 시점에 상기 임의의 음소를 발음하고 있는 입 모양(22a)을 갖추도록 하였다.In addition, according to an embodiment of the present invention, the sync module 200 may use the phoneme of the audio file 410 between 0.6 seconds and 0.2 second before each phoneme included in the audio file 410 is output as a voice. Sends the corresponding response message data 420 to the rendering module 300, so that the facial animation 21 pronounces the arbitrary phoneme at the time when the arbitrary phoneme is output as a voice. 22a).

상기 렌더링모듈(300)은 상기 싱크모듈(200)로부터 상기 응답메시지데이터(420)를 수신하여, 음소정보/지속시간정보에 따른 얼굴 애니메이션(21)을 렌더링하여 디스플레이장치(20)로 출력한다.The rendering module 300 receives the response message data 420 from the sink module 200, renders a facial animation 21 according to phoneme information / duration information, and outputs the facial animation 21 to the display apparatus 20.

즉, 종래에 상기 얼굴 애니메이션(21)의 렌더링 시작시점에서 음성을 출력하여 얼굴 애니메이션(21)과 음성 간에 동기화가 이루어지지 않는 스트리밍방식(streaming)의 얼굴 애니메이션 출력방법의 문제점을 해결할 수 있다.That is, the problem of the method of outputting a facial animation of a streaming method in which the voice is not synchronized between the facial animation 21 and the voice by outputting a voice at the start of rendering of the facial animation 21 is conventionally solved.

도 3은 본 발명의 다른 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력방법의 흐름도이다.3 is a flowchart illustrating a text data based facial animation output method according to another exemplary embodiment of the present invention.

이하에서는 도 1 및 도 2와 실질적으로 동일한 구성요소에 대해서는 설명을 생략하고 동일한 부호를 참조하기로 한다.Hereinafter, the same elements as those of FIGS. 1 and 2 will be omitted and the same reference numerals will be referred.

도 3을 참조하면, 본 발명의 다른 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력방법은 먼저, TTS엔진(100)이 텍스트데이터(10)를 입력받아(S1000), 상기 텍스트데이터(10)의 포함된 음소들을 음성으로 출력하기 위한 오디오파일(410)을 생성하여 저장매체(400)에 저장한다(S2000).Referring to FIG. 3, in the text data-based facial animation output method according to another embodiment of the present invention, first, the TTS engine 100 receives text data 10 (S1000), An audio file 410 for outputting the included phonemes as a voice is generated and stored in the storage medium 400 (S2000).

다음, 상기 TTS엔진(100)이 상기 텍스트데이터(10)를 다시 입력받아(S3000), 상기 텍스트데이터(10)에 포함된 각 음소들에 대한 음소정보/지속시간정보를 포함하는 응답메시지데이터들(420)을 생성하여 상기 저장매체(400)에 저장한다(S4000).Next, the TTS engine 100 receives the text data 10 again (S3000) and response message data including phoneme information / duration information for each phoneme included in the text data 10. A 420 is generated and stored in the storage medium 400 (S4000).

그러나, 상기 TTS엔진(100)은 상기 응답메시지데이터(420)를 먼저 생성할 수도 저장할 수 있다.However, the TTS engine 100 may generate or store the response message data 420 first.

또한, 상기 TTS엔진(100)은 상기 텍스트데이터(10)를 한번 입력받아 상기 오디오파일(410) 및 상기 응답메시지데이터(420)를 각각 또는 동시에 생성할 수도 있다.In addition, the TTS engine 100 may receive the text data 10 once and generate the audio file 410 and the response message data 420 respectively or simultaneously.

또한, 상기 응답메시지데이터들(420)은 상기 중복발음 음소 제거모듈(500)을 통해 전후 음소의 영향에 의해 상기 얼굴 애니메이션(21)의 입 모양(22)으로 표현되지 않아도 되는 음소를 포함하는 응답메시지데이터가 제거되고, FPA컨버터(600)에 의해 MPEG-4의 표준화된 파라미터 값으로 변환된 후, 상기 저장매 체(400) 중 임시저장 영역인 캐쉬영역에 저장된다.In addition, the response message data 420 is a response including a phoneme that does not need to be represented by the mouth shape 22 of the face animation 21 by the influence of the phoneme before and after through the redundant phoneme phoneme removing module 500. The message data is removed, converted into a standardized parameter value of MPEG-4 by the FPA converter 600, and then stored in a cache area which is a temporary storage area of the storage medium 400.

다음, 싱크모듈(200)이 상기 오디오파일(410)을 음성출력장치(30)로 전송하고, 상기 응답메시지데이터(420)를 렌더링모듈(300)로 전송한다(S5000).Next, the sink module 200 transmits the audio file 410 to the voice output device 30, and transmits the response message data 420 to the rendering module 300 (S5000).

이때, 상기 싱크모듈(200)은 상기 오디오파일(410)에 포함되는 임의의 음소가 상기 음성출력장치(30)로 출력되기 전에 상기 임의의 음소를 포함하는 응답메시지데이터(420)를 상기 렌더링모듈(300)로 전송한다. 그 이유는 사람이 음소를 발음할 때 입 모양이 갖춰지기 때문에, 상기 음소가 음성으로 출력되기 전에 얼굴 애니메이션(21)의 렌더링이 개시되어 입 모양을 갖추게 하기 위함이다.In this case, the sync module 200 outputs the response message data 420 including the arbitrary phoneme before any phoneme included in the audio file 410 is output to the voice output device 30. Send to 300. The reason is that the mouth shape is provided when a person pronounces the phoneme, so that the rendering of the facial animation 21 is started before the phoneme is output as a voice to make the mouth shape.

또한, 본 발명의 일 실시예에서는 상기 싱크모듈(200)이 상기 오디오파일(410)에 포함된 각 음소가 음성으로 출력되기 전 0.6초 내지 0.2초 사이에 상기 오디오파일(410)의 음소들과 대응하는 응답메시지데이터들(420)을 상기 렌더링모듈(300)로 전송되게 하여, 임의의 음소가 음성으로 출력되는 시점에 상기 얼굴 애니메이션(21)이 상기 임의의 음소에 대한 입 모양(22)을 갖추도록 하였다.In addition, according to an embodiment of the present invention, the sync module 200 may use the phoneme of the audio file 410 between 0.6 seconds and 0.2 second before each phoneme included in the audio file 410 is output as a voice. Corresponding response message data 420 is transmitted to the rendering module 300 so that the face animation 21 forms a mouth shape 22 for the arbitrary phoneme at a time when any phoneme is output as a voice. It was equipped.

다음, 상기 음성출력장치(30)에 의해서 상기 오디오파일(410)이 출력되고, 상기 랜더링모듈(300)은 얼굴 애니메이션(21)을 랜더링하여 디스플레이장치(20)로 출력한다(S6000).Next, the audio file 410 is output by the voice output device 30, and the rendering module 300 renders the face animation 21 and outputs it to the display device 20 (S6000).

따라서, 임의의 음소가 음성으로 출력되는 시점에 상기 얼굴 애니메이션(21)이 상기 임의의 음소에 대한 입 모양(22)을 갖추도록 하여, 상기 얼굴 애니메이션(21)과 음성 간에 동기화가 이루어지지 않는 종래의 문제점을 해결할 수 있다Therefore, the face animation 21 has a mouth shape 22 for the arbitrary phoneme at a time when any phoneme is output as a voice, so that the face animation 21 and the voice are not synchronized. Can solve the problem

이상에서 살펴본 바와 같이 본 발명은 바람직한 실시예를 들어 도시하고 설명하였으나, 상기한 실시예에 한정되지 아니하며 본 발명의 정신을 벗어나지 않는 범위 내에서 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변경과 수정이 가능할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation, Various changes and modifications will be possible.

도 1은 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력시스템의 블럭도이고, 1 is a block diagram of a facial data output system based on text data according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 텍스트데이터 기반의 얼굴 애니메이션 출력시스템의 얼굴 애니메이션 렌더링이 시작되는 시점을 설명하기 위한 도면이고,FIG. 2 is a view for explaining a point in time at which facial animation rendering of the facial data output system based on text data according to an embodiment of the present invention is started;

본 발명에 따른 도면들에서 실질적으로 동일한 구성과 기능을 가진 구성요소들에 대하여는 동일한 참조부호를 사용한다.In the drawings according to the present invention, the same reference numerals are used for components having substantially the same configuration and function.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100:TTS엔진 200:싱크모듈100: TTS engine 200: sink module

300:렌더링모듈 400:저장매체300: rendering module 400: storage medium

410:오디오파일 420:응답메시지데이터410: Audio file 420: Response message data

500:중복발음 음소 제거모듈 600:FAP컨버터500: duplicate sound phoneme removal module 600: FAP converter

Claims

TTS engine (Text-to-speech engine) receives the text data, the audio file for outputting the phonemes included in the text data and the response message data including phoneme information / duration information for each phoneme A first step of generating and storing in a storage medium;

The sync module transmits the audio file to the voice output device and transmits the response message data to the rendering module, wherein the response message includes phoneme information of the phoneme before any phoneme in the audio file is output as voice. A second step of transmitting data to the rendering module;

And a third step of the rendering module rendering a face animation according to phoneme information of the response message data to a display device, and outputting the audio file as a voice by the voice output device.

And an arbitrary phoneme in the audio file is output as a voice while a face animation according to phoneme information of the phoneme is rendered.

The method of claim 1,

The response message data of the first step are respectively converted into a parameter value normalized by a FAP converter (Facial animation parameter converter) is stored, characterized in that the text data based facial animation output method.

The method of claim 2,

And the response message data are converted into parameter values of MPEG-4 and stored, respectively.

The method of claim 3, wherein

And the response message data is stored in a cache area of the storage medium.

The method according to any one of claims 1 to 4,

The response message data including phoneme information of any phoneme of the second step is transmitted to the rendering module 0.6 seconds to 0.2 seconds before the phoneme is output as voice. Output method.

The method of claim 5,

The text data is a text data based animation animation output method, characterized in that the text data on the PowerPoint program.

A TTS engine configured to receive text data and to generate response message data including audio files for outputting phonemes included in the text data as voices, and phoneme information / duration information for each phoneme, and storing them in a storage medium;

The audio file and the response message data stored in the storage medium are respectively transmitted to a voice output device and the following rendering module, wherein the response message data includes phoneme information of the phoneme before any phoneme in the audio file is output as voice. A sink module for transmitting to the rendering module; And

And a rendering module which receives the response message data and renders a facial animation according to each response message data and outputs it to a display device.

And an arbitrary phoneme in the audio file is output as a voice while a facial animation according to the phoneme information of the arbitrary phoneme is rendered.

The method of claim 7, wherein

And a FAP converter converting the response message data generated by the TTS engine into a standardized parameter value and storing the response message data in the storage medium.

The method of claim 8,

The FAP converter converts the response message data into parameter values of MPEG-4 and stores them in the storage medium.

The method of claim 9,

And the FAP converter stores the response message data in a cache area of the storage medium.

The method according to any one of claims 7 to 10,

The sync module transmits response message data including phoneme information of the arbitrary phoneme to the rendering module 0.6 seconds to 0.2 seconds before the arbitrary phoneme is output as voice. Output system.

The method of claim 11,

And the TTS engine receives the text data from a PowerPoint program.