KR100300962B1

KR100300962B1 - Lip sync method and apparatus for speech synthesis

Info

Publication number: KR100300962B1
Application number: KR1019980029941A
Authority: KR
Inventors: 박찬민
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1998-07-24
Filing date: 1998-07-24
Publication date: 2001-10-27
Also published as: KR20000009490A

Abstract

본 발명은 음성 합성을 위한 립싱크 방법 및 그 장치를 개시한다. 사람의 얼굴을 나타내는 정지 영상 데이타, 형상 모델 데이타와, 임의의 텍스트를 입력하여 상기 얼굴에 상기 텍스트로부터의 음성 합성 결과를 립싱크하는, 본 발명에 의한 음성 합성을 위한 립싱크 방법은, (a) 음성 합성 결과로부터 한 음절 단위로 적어도 음절을 구성하는 음소들에 대한 코드와 음절별 지속시간 정보를 얻는 단계, (b) 음절별 지속시간내에서 음소들 각각에 대한 음소별 지속시간을 할당하는 단계, (c) 음소별로 입 모양 정보를 정의한 소정의 참조 테이블을 참조하여, 각각의 음소별 지속시간내에서 음소들 각각에 대해 입 모양을 나타내는 적어도 하나의 동영상 키 프레임을 생성하는 단계, (d) 인접한 키 프레임들간의 보간으로 소정수의 연속되는 중간 프레임을 만들어 음성 합성 결과와 동기된 입 모양을 갖는 동영상을 생성하고, 정지 영상 데이타에 합성하는 단계 및 (e) 합성된 동영상을 음성 합성 결과와 동기하여 출력하는 단계를 구비하는 것을 특징으로 한다.The present invention discloses a lip sync method and apparatus for speech synthesis. The lip-sync method for speech synthesis according to the present invention, in which still image data representing a human face, shape model data, and arbitrary text are input to lip-sync the speech synthesis result from the text on the face, the (a) voice Obtaining chords and duration information for each syllable in the syllable unit at least one syllable from the synthesis result, (b) allocating durations for each phoneme within each syllable duration; (c) generating at least one video key frame representing a mouth shape for each of the phonemes within each phoneme duration, with reference to a predetermined reference table defining mouth shape information for each phoneme, (d) adjacent By interpolating between key frames, a predetermined number of consecutive intermediate frames are created to generate a video having a mouth shape synchronized with the speech synthesis result. Synthesizing the video data, and (e) characterized in that it comprises a step of outputting in synchronization with the synthesized speech resulting with a synthesized video.

Description

Lip sync method and apparatus for speech synthesis

본 발명은 립싱크(lip-syncronization)에 관한 것으로서, 음성 합성을 위한 립싱크 방법 및 그 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to lip-syncronization, and to a lip-sync method and apparatus for speech synthesis.

1970년대 중반부터 시작하여 1980년대말 이후 연구가 활발히 진행되고 있는 얼굴 합성에 관한 연구는 현재 MPEG4 표준화의 한 분야에 포함될 정도로 그 중요성이 많이 인식되고 있다. 이와 더불어 최근에는, 컴퓨터 그래픽으로 합성된 얼굴과 사람 음성간의 립싱크를 통하여 말하는 사람의 얼굴을 구현하는 방법에 관한 연구도 또한 많이 이루어지고 있다.The research on face synthesis, which has been actively researched since the mid-1970s since the late 1980s, is now recognized as important in the field of MPEG4 standardization. In addition, in recent years, a lot of researches have been conducted on how to implement a human face through a lip sync between a face synthesized with computer graphics and a human voice.

그러나 종래에 연구된 많은 립싱크 기술들은 주로 영어 발음에 의존하는 관계로 발음 구조상 상이한 한국어의 립싱크와는 많은 차이가 있다. 현재, 이에 대한 연구가 미진한 상태이며, 또한 구현 환경도 고성능의 워크스테이션급 컴퓨터에서 실행 속도에 거의 제한받지 않고 구현이 되어왔으므로, 실제로 일반적인 컴퓨터에서 적용하기가 용이하지 않다.However, many of the lip synch technologies studied in the prior art are mainly dependent on English pronunciation, and thus, there are many differences from the lip synch of Korean. At present, there is little research on this, and since the implementation environment has been implemented in a high performance workstation class computer with almost no limitation on execution speed, it is not easy to apply to a general computer.

본 발명은 컴퓨터 그래픽으로 합성된 사람의 얼굴에 TTS와의 립싱크 기술을 적용한 가상 에이전트를 구현하여 인터넷이나 사이버 가상 공간에서 한국어로 뉴스 방송을 할 수 있도록 한다.The present invention implements a virtual agent applying a lip sync technology with TTS to a human face synthesized with computer graphics, so that news can be broadcast in Korean in the Internet or cyber virtual space.

본 발명이 이루고자하는 과제는, 한국어 발음 구조에 적합한 립싱크를 제공함으로써, 컴퓨터 그래픽으로 합성된 사람의 얼굴에 음성 합성 결과와의 립싱크 기술을 적용한 가상 에이전트를 구현가능케하는, 음성 합성을 위한 립싱크 방법을 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a lip sync method suitable for Korean pronunciation structure, thereby enabling a virtual agent applying a lip sync technique with speech synthesis results to a human graphic synthesized face. To provide.

본 발명이 이루고자하는 다른 과제는, 상기 립싱크 방법을 수행하는 음성 합성을 위한 립싱크 장치를 제공하는데 있다.Another object of the present invention is to provide a lip sync device for speech synthesis that performs the lip sync method.

도 1은 본 발명에 의한 음성 합성을 위한 립싱크 방법을 설명하기 위한 플로우챠트이다.1 is a flowchart illustrating a lip sync method for speech synthesis according to the present invention.

도 2는 본 발명에 의한 음성 합성을 위한 립싱크 장치의 블럭도이다.2 is a block diagram of a lip sync device for speech synthesis according to the present invention.

도 3은 음소별 지속시간 할당 및 동기 방법을 설명하기 위한 도면이다.3 is a diagram illustrating a duration allocation and synchronization method for each phoneme.

상기 과제를 이루기 위하여, 사람의 얼굴을 나타내는 정지 영상 데이타, 형상 모델 데이타와, 임의의 텍스트를 입력하여 상기 얼굴에 상기 텍스트로부터의 음성 합성 결과를 립싱크하는, 본 발명에 의한 음성 합성을 위한 립싱크 방법은,In order to achieve the above object, a lip-sync method for speech synthesis according to the present invention, which lip-syncs the speech synthesis result from the text to the face by inputting still image data, shape model data representing a human face, and arbitrary text. silver,

(a) 음성 합성 결과로부터 한 음절 단위로 적어도 음절을 구성하는 음소들에 대한 코드와 음절별 지속시간 정보를 얻는 단계, (b) 음절별 지속시간내에서 음소들 각각에 대한 음소별 지속시간을 할당하는 단계, (c) 음소별로 입 모양 정보를 정의한 소정의 참조 테이블을 참조하여, 각각의 음소별 지속시간내에서 음소들 각각에 대해 입 모양을 나타내는 적어도 하나의 동영상 키 프레임을 생성하는 단계, (d) 인접한 키 프레임들간의 보간으로 소정수의 연속되는 중간 프레임을 만들어 음성 합성 결과와 동기된 입 모양을 갖는 동영상을 생성하고, 정지 영상 데이타에 합성하는 단계 및 (e) 합성된 동영상을 음성 합성 결과와 동기하여 출력하는 단계를 구비한다.(a) obtaining a code and syllable duration information for at least one syllable that constitutes a syllable in syllable units from a speech synthesis result, and (b) obtaining a phoneme duration for each phoneme within a syllable duration. (C) generating at least one video key frame representing a mouth shape for each of the phonemes within each phoneme duration by referring to a predetermined reference table that defines mouth shape information for each phoneme, (d) generating a moving picture having a mouth shape synchronized with the result of voice synthesis by interpolating a predetermined number of intermediate frames by interpolation between adjacent key frames, and synthesizing it into still image data; and (e) combining the synthesized video with voice. Outputting in synchronization with the synthesis result.

상기 다른 과제를 이루기 위하여, 사람의 얼굴을 나타내는 정지 영상 데이타, 형상 모델 데이타와, 임의의 텍스트를 입력하여 상기 얼굴에 상기 텍스트로부터의 음성 합성 결과를 립싱크하는, 본 발명에 의한 음성 합성을 위한 립싱크 장치는,In order to achieve the above object, a lip sync for speech synthesis according to the present invention, which lip-syncs the speech synthesis result from the text to the face by inputting still image data, shape model data representing a human face, and arbitrary text. The device,

합성된 음성으로부터 한 음절 단위로 적어도 음절을 구성하는 음소들에 대한 코드와 음절별 지속시간 정보를 얻는 입력부, 음절별 지속시간내에서 음소들 각각에 대한 음소별 지속시간을 할당하는 음절 분석부, 음소별로 입 모양 정보를 정의한 소정의 참조 테이블을 참조하여 각각의 음소별 지속시간내에서 음소들 각각에 대해 입 모양을 나타내는 적어도 하나의 키 프레임을 생성하는 키 프레임 생성부, 인접한 키 프레임들간의 보간으로 소정수의 연속되는 중간 프레임을 만들어 음성 합성 결과와 동기된 입 모양을 갖는 동영상을 생성하고, 정지 영상 데이타에 합성하는 동영상 합성부 및 합성된 동영상을 음성 합성 결과와 동기하여 출력하는 화면 출력부를 구비한다.An input unit for obtaining codes and syllable duration information of at least syllables in a syllable unit from the synthesized voice, and a syllable analyzer for allocating durations of each phoneme within syllable durations; Key frame generation unit for generating at least one key frame representing a mouth shape for each of the phonemes within each phoneme duration by referring to a predetermined reference table that defines mouth shape information for each phoneme, interpolation between adjacent key frames. To generate a predetermined number of consecutive intermediate frames to generate a video having a mouth shape synchronized with the speech synthesis result, and to output the video synthesizer synthesizing the still image data and to output the synthesized video in synchronization with the speech synthesis result. Equipped.

이하, 본 발명에 의한 음성 합성을 위한 립싱크 방법, 및 그 장치의 구성 및 동작을 첨부한 도면을 참조하여 다음과 같이 설명한다.Hereinafter, with reference to the accompanying drawings, the configuration and operation of the lip sync method for speech synthesis according to the present invention, and the apparatus will be described as follows.

컴퓨터 그래픽으로 합성된 얼굴 화면과 음성간의 립싱크를 위해서, 먼저 사람의 얼굴을 나타내는 정지 영상 데이타(이후에, 간략히 "얼굴 데이타"라 칭함)와, 임의의 텍스트 즉, 음성 합성하고자하는 텍스트를 입력한다(제100단계). 여기서, 얼굴 데이타는 말하는 사람의 얼굴을 구현하는데 기본 화면이 되며, 본 발명은 이러한 기본 화면에 입 모양을 갖는 동영상을 합성함으로써 음성과의 립싱크를 실현한다. 제100단계에서 텍스트가 입력되면, 이는 음성 합성 과정을 거친다.For lip syncing between the face image synthesized with computer graphics and voice, first, a still image data (hereinafter, simply referred to as "face data") representing a face of a person is inputted, and arbitrary text, that is, text to be synthesized by voice is input. (Step 100). Here, the face data becomes a basic screen for realizing the speaker's face, and the present invention synthesizes a moving picture having a mouth shape on the basic screen to realize a lip sync with voice. When text is input in step 100, it is subjected to a speech synthesis process.

텍스트에 대한 음성 합성 결과로부터 한 음절 단위로 한 음절을 구성하는 음소들에 대한 코드와 음절별 지속시간(duration) 정보를 얻는다(제102단계). 여기서, 코드는 구체적으로 초성, 중성, 종성, 공백에 대한 코드이며, 음절별 지속시간은 한 음절이 시작되는 순간부터 종료되기까지의 경과된 시간을 나타낸다. 또한, 음성 합성 결과로부터 음성의 시작시간 정보를 얻음으로써 이후에 얼굴의 입 모양과 음성을 동기하여 출력하는데 적용한다.From the speech synthesis result of the text, code and duration information for each syllable of the syllables in one syllable unit are obtained (step 102). Here, the codes are specifically codes for initial, neutral, final, and blank, and the duration of each syllable represents the elapsed time from the start of one syllable to the end. In addition, by obtaining the start time information of the voice from the result of speech synthesis, it is applied to later synchronously outputting the shape of the face and the voice.

제102단계 후에, 음절별 지속시간내에서 음소들 각각에 대한 음소별 지속시간을 할당한다(제104단계). 즉, 한 음절 단위로 할당되어 있는 음절별 지속시간을 한 음절을 구성하는 음소들 각각에 대해 소정의 비율로 할당한다. 여기서, 음소별 지속시간은 적어도, 한 음절을 초성, 중성 및 종성으로 구분하여 중성 > 초성 > 종 성의 비율로 되고, 중모음이 있는 경우에 연속된 2개의 단모음인 앞의 모음과 뒤의 모음으로 구분하여 뒤의 모음 > 앞의 모음의 비율로 되고, 자음을 입술이 다물어지는 입술 소리와 비 입술 소리로 구분하여 입술 소리 > 비 입술 소리의 비율로 되도록 할당된다(이후에 상세히 설명됨).After step 102, the phoneme duration for each phoneme is allocated within the duration for each syllable (step 104). That is, the duration of each syllable allocated in one syllable unit is allocated to each of the phonemes constituting one syllable at a predetermined ratio. Here, the duration of each phoneme is divided into at least one syllable into a ratio of neutral> primary> species by dividing the syllables into a consonant, a neutral, and a consonant. Consonants are assigned to the ratio of lip sounds> non lip sounds to be divided into lip sounds and non lip sounds (described in detail below).

제104단계 후에, 음소별로 입 모양 정보를 정의한 비즘 테이블(viseme (visual phoneme) table)을 참조하여, 각각의 음소별 지속시간내에서 음소들 각각에 대한 키 프레임(key frame)을 생성한다(제106단계). 키 프레임은 컴퓨터 애니메이션에서 가장 중심이 되는 장면을 의미하는데, 본 발명에서는 한 음소에 대해 가장 중심이 되는 입 모양을 갖는 영상을 의미한다. 비즘 테이블은 최소한으로 분류된 한국어 음소 패턴에서 음소들 각각에 대해, 적어도 입을 벌리는 정도를 나타내는 액션 유닛 강도를 포함한 소정의 입 모양 정보를 정의한다(이후에 상세히 설명됨).After step 104, a key frame for each of the phonemes is generated within each phoneme duration by referring to a visme (visual phoneme) table that defines mouth shape information for each phoneme. Step 106). The key frame means a scene that is the center of computer animation. In the present invention, the key frame refers to an image having a mouth shape that is the center of a phoneme. The bisms table defines, for each of the phonemes in the minimally classified Korean phoneme pattern, predetermined mouth shape information including an action unit intensity indicating at least the degree of mouth opening (described in detail below).

제106단계 후에, 인접한 키 프레임들의 연결을 자연스럽게 하기 위해 즉, 자연스러운 입모양을 위해서 주어진 각각의 음소별 지속시간내에서 중간 프레임들을 만든다. 즉, 인접한 키 프레임들간의 보간으로 소정수의 연속되는 중간 프레임을 만들어 음성 합성 결과에 의한 음성과 동기된 입 모양을 갖는 동영상을 생성하고, 제100단계에서 입력된 얼굴 데이타에 합성한다(제108단계). 마지막으로, 합성된 동영상을 음성과 동기하여 출력함으로써 립싱크를 실현한다(제110단계).After step 106, intermediate frames are created within each phoneme duration given to smoothly concatenate adjacent key frames, that is, for natural mouth shape. That is, a predetermined number of consecutive intermediate frames are generated by interpolation between adjacent key frames to generate a moving picture having a mouth shape synchronized with voice based on the voice synthesis result, and synthesized into face data input in step 100 (step 108). step). Finally, the lip sync is realized by outputting the synthesized moving picture in synchronization with audio (step 110).

전술한 바와 같이 본 발명에 의한 립싱크 방법을 수행하기 위해서는 다음과 같은 사항이 정의되어야 한다. 1. 한국어 발음 형태에 대한 입 모양의 비즘 테이블을 정의한다. 2. 한 음절을 구성하는 음소들에 대한 지속시간 할당방법을 정의한다. 3. 각 음소와 할당된 지속시간간의 동기 방법을 정의한다. 이러한 사항에 대해 다음과 같이 구체적으로 설명한다.As described above, in order to perform the lip sync method according to the present invention, the following matters should be defined. 1. Define a mouth-shaped bismuth table for Korean pronunciation forms. 2. Define how to assign durations to the phonemes that make up a syllable. 3. Define how to synchronize between each phoneme and the assigned duration. This will be described in detail as follows.

1. 한국어 비즘 테이블 정의1. Korean Bism Table Definition

한국어 발음 형태에 대한 입 모양의 한국어 비즘 테이블(viseme table)의 정의 방법에 대해 설명한다.A method of defining a mouth-shaped Korean visme table for Korean pronunciation forms will be described.

1970년대에 심리학자인 에크맨(Ekman)은 얼굴 표정에 대한 분류인 얼굴 움직임 코딩시스템(FACS:Facial Action Coding System)을 정의하였다. 이에 근거하여, 컴퓨터 그래픽으로 합성된 사람 얼굴에 FACS의 각 기본 단위인 액션 유닛(AU:Action Unit)별로 애니메이션을 실현하여 입 모양을 표현할 수 있다. 비즘 테이블은 음소별로 입 모양을 정의한 것으로, 영어는 45개의 음소를 18개의 입 모양으로 구분하고 있는데, 본 발명에서는 한국어 음소들을 바람직하게 총 12개의 패턴으로 분류하여 이에 대한 액션 유닛(AU) 강도를 정의한다(표 1).In the 1970s, psychologist Ekman defined the Facial Action Coding System (FACS), a classification of facial expressions. Based on this, the human face synthesized by computer graphics may express the shape of a mouth by realizing an animation for each action unit (AU), which is a basic unit of FACS. The bismuth table defines mouth shapes for each phoneme. In English, 45 phonemes are divided into 18 mouth shapes. In the present invention, Korean phonemes are preferably classified into a total of 12 patterns. (Table 1).

구체적으로, 한국어 음소를 12개의 비즘 패턴으로 분류하는 방법을 설명한다.Specifically, a method of classifying Korean phonemes into 12 bismuth patterns will be described.

1-1. 모음1-1. collection

모음에 대하여 총 9개의 비즘 패턴으로 분류한다. 모음은 크게 두 가지로 분류하는데, 하나는 한 모음에 대하여 하나의 입 모양으로 표현되는 경우이고, 다른 하나는 두 가지 이상의 입 모양이 연속으로 진행되어 입 모양으로 표현되는 경우이다.The vowels are classified into a total of nine bismuth patterns. Vowels are largely classified into two types, one of which is represented by one mouth shape for one vowel, and the other is when two or more mouth shapes are successively represented by mouth shapes.

1-1-1. 기본 모음1-1-1. Basic collection

모음 하나당 하나의 입 모양으로 표현되는 것으로, 다음과 같이 9 가지로 분류된다. ㅏ, ㅓ, ㅗ, ㅜ, ㅡ, ㅣ, ㅐ, ㅔ, ㅚIt is represented by one mouth shape per vowel, and is classified into nine kinds as follows. ㅏ, ㅓ, ㅗ, TT, ㅡ, ㅣ, ㅐ, ㅔ, ㅚ

1-1-2. 조합 모음1-1-2. Combination collection

모음 하나당 위의 두가지 이상의 기본 모음의 연속으로 표현되는 것으로, 다음과 같은 모음들이 있으며, 이들은 위의 기본 모음으로 표현이 되므로 패턴으로 분류하지 않는다. ㅑ(ㅣ+ㅏ), ㅕ(ㅣ+ㅓ), ㅛ(ㅣ+ㅗ), ㅠ(ㅣ+ㅜ), ㅘ(ㅗ+ㅏ), ㅙ(ㅗ+ㅐ), ㅝ(ㅜ+ㅓ), ㅞ(ㅜ+ㅔ), ㅟ(ㅜ+ㅣ), ㅢ(ㅡ+ㅣ)Each vowel is represented as a series of two or more basic vowels. The vowels are as follows. These vowels are not classified as patterns because they are represented by the basic vowel. ㅑ (ㅣ + ㅏ), ㅕ (ㅣ + ㅓ), ㅛ (ㅣ + ㅗ), ㅠ (ㅣ + TT), ㅘ (ㅗ + ㅏ), ㅙ (ㅗ + ㅐ), ㅝ (TT + ㅓ), ㅞ (ㅜ + ㅔ), ㅟ (ㅜ + ㅣ), ㅢ (ㅡ + ㅣ)

1-2. 자음1-2. Consonant

자음은 입이 다물어지는 입술 소리와 그렇지 않은 비 입술 소리로 분류되는데, 이는 한국어에서는 입술 소리이외에 비 입술 소리에서는 거의 다른 발음들은 입의 모양에 크게 영향을 끼치지 않기 때문이다.Consonants are categorized as lip-closing and non-lip-sounds, because in Korean, almost no other lip sounds have a significant effect on the shape of the mouth.

1-2-1. 입술소리1-2-1. Lips

다음과 같이 네 가지 자음이 있으며, 이들이 하나의 비즘 패턴으로 분류된다. ㅁ, ㅂ, ㅃ, ㅍThere are four consonants as follows, and they are classified into one bismuth pattern. ㅁ, ㅂ, ㅃ, ㅁ

1-2-2. 비 입술소리1-2-2. Rain lips

입술 소리이외의 모든 자음들로서 이들이 하나의 비즘 패턴으로 분류된다.As all the consonants other than the lip sounds, they are classified as one bismuth pattern.

1-3. 공백1-3. Blank

공백은 주어진 문장내에 띄어 쓴 칸이나, 쉼표등에 대응되는 것인데, 일반적으로 공백은 이전 음소의 입 모양을 그대로 유지하게 되는데, 지속시간이 아주 큰 공백 구간에서는 입을 다물거나, 다른 헤드 모션이나 표정 변화를 할 수가 있다. 이 공백도 하나의 비즘 패턴으로 분류한다.Spaces are spaces that correspond to spaces or commas in a given sentence. In general, spaces retain the shape of the mouth of the previous phoneme. You can do it. This space is also classified as a bismuth pattern.

[표 1]TABLE 1

표 1은 본 발명에 따른 바람직한 한국어 비즘 테이블을 나타낸다. 12개의 패턴에 대해 FACS에 따른 해당 액션 유닛(AU) 조합과, 입을 벌리는 정도를 나타내는 액션 유닛 강도를 포함한 입 모양 정보를 정의하고 있다. 액션 유닛은 FACS의 기본 단위이므로, 표 1에서와 같이 한 패턴은 복수개의 액션 유닛으로 이루어진다.Table 1 shows a preferred Korean bismuth table according to the present invention. For the 12 patterns, mouth shape information including the action unit (AU) combination according to FACS and the action unit strength indicating the degree of opening the mouth are defined. Since an action unit is a basic unit of FACS, as shown in Table 1, one pattern consists of a plurality of action units.

2. 음소별 지속시간 할당 방법2. How to allocate duration for each phoneme

한국어의 한 음절은 초성과 중성으로만 구성되는 경우와, 초성, 중성, 종성으로 구성되는 경우가 있고, 여기서, 중성인 모음은 단모음과 중모음으로 구분되고, 자음은 입술 소리와 비 입술소리로 구분되며, 이에 따라 한 음절은 바람직하게 총 4 가지로 나누어 지속시간을 할당 받을 수 있다.One syllable in Korean is composed of only primary and neutral, and sometimes composed of primary, neutral, and final, where the neutral vowel is divided into short vowels and middle vowels, and consonants are divided into lip and non-lip sounds. Accordingly, one syllable is preferably divided into a total of four durations.

말하는 사람의 입 모양은 주로 모음에 의하여 결정된다. 따라서 한 음절내에서 모음의 지속시간을 가장 많이 할당한다. 중모음의 경우에는 연속된 2개의 단모음으로 처리할 수 있는데, 이 때에는 주로 뒤의 모음이 입 모양을 주도하므로, 앞의 모음의 지속시간을 짧게, 그리고 뒤의 모음의 지속시간을 길게 할당한다. 한편, 자음의 경우에는 종성을 제일 짧게, 그리고 초성을 약간 짧게 지속시간을 할당한다. 이 때에 입술이 다물어지는 ㅁ, ㅂ, ㅃ, ㅍ들에 대하여는 지속시간을 약간씩 더 길게 할당하여 입이 다물어지는 모습이 보여질 수 있도록 보장해야 한다. 각 경우에 대하여는 다음과 같이 정리가 된다.The shape of the speaker's mouth is largely determined by vowels. Therefore, the vowel duration is assigned the most in one syllable. In the case of a middle vowel, it can be treated as two consecutive short vowels. In this case, the rear vowels dominate the shape of the mouth, so that the duration of the front vowels is short and the duration of the rear vowels is long. In the case of consonants, on the other hand, durations are assigned with the shortest last and the shortest first. At this time, the biting time of bite, bite, bite, and bite should be allotted a little longer to ensure that the bite is seen. Each case is summarized as follows.

2-1. 한 음절이 초성과 중성으로 구성되는 경우2-1. When a syllable is composed of both primary and neutral

2-1-1. 입술 소리가 아닌 경우와 단모음2-1-1. Non-lip and short vowels

초성 1/3, 중성 2/3로 전체 지속시간을 할당한다.Allocates a total duration of 1/3 first and 2/3 neutral.

2-1-2. 입술 소리가 아닌 경우와 중모음2-1-2. Non-lip sounds and heavy vowels

초성 1/3, 앞 모음 1/3, 뒤 모음 1/3로 전체 지속시간을 할당한다.The total duration is assigned to the initial 1/3, the front vowel 1/3, and the rear vowel 1/3.

2-1-3. 입술 소리와 단모음2-1-3. Lip sounds and short vowels

초성 1/2, 중성 1/2로 전체 지속시간을 할당한다.The total duration is assigned to 1/2 initial and 1/2 neutral.

2-1-4. 입술 소리와 중모음2-1-4. Lip sounds and heavy vowels

초성 1/2, 앞 모음 1/6, 뒤 모음 3/6로 전체 지속시간을 할당한다.The total duration is assigned to initial 1/2, front vowel 1/6, and rear vowel 3/6.

2-2. 한 음절이 초성, 중성, 종성으로 구성되는 경우2-2. When a syllable is composed of primary, neutral, and final

종성은 전체 지속시간의 1/5로 할당하고, 초성과 중성에 대하여는 다음과 같이 할당한다.Finality is assigned to 1/5 of the total duration, and initial and neutral are assigned as follows.

초성 1/5, 중성 3/5로 전체 지속시간을 할당한다.The total duration is assigned to 1/5 first and 3/5 neutral.

초성 1/5, 앞 모음 1/5, 뒤 모음 2/5로 전체 지속시간을 할당한다.The total duration is assigned to the initial 1/5, the front vowel 1/5, and the rear vowel 2/5.

2-1-3. 입술 소리와 단모음2-1-3. Lip sounds and short vowels

초성 1.5/5, 중성 2.5/5로 전체 지속시간을 할당한다.Allot durations are assigned to initial 1.5 / 5 and neutral 2.5 / 5.

2-1-4. 입술 소리와 중모음2-1-4. Lip sounds and heavy vowels

초성 1.5/5, 앞 모음 1/5, 뒤 모음 1.5/5로 전체 지속시간을 할당한다.The total duration is assigned to the initial 1.5 / 5, the front vowel 1/5, and the rear vowel 1.5 / 5.

3. 각 음소와 할당된 지속시간간의 동기 방법3. Synchronization method between each phoneme and assigned duration

합성된 얼굴의 입 모양과 음성의 동기는 도 1을 참조하여 전술한 바와 같이 기본적으로 각 음소마다 음소의 지속시간내에서 위에서 정의된 비즘 패턴에 따른 키 프레임들간의 보간으로 중간 프레임을 만들어 동영상을 생성함으로써 구현된다. 그러나, 컴퓨터 기종에 따른 동영상 프레임 생성속도의 차이, 프레임 수 계산시의 반올림 오차, 프레임 별 생성 시간 차이, 컴퓨터 타이머의 오차등으로 인하여 정확한 동기를 구현하기가 어렵다.As described above with reference to FIG. 1, the shape of the synthesized face's mouth and the voice are basically generated by creating an intermediate frame by interpolation between key frames according to the bismuth pattern defined above within the duration of each phoneme. Implemented by creating However, it is difficult to realize accurate synchronization due to the difference in the video frame generation speed according to the computer model, the rounding error when calculating the number of frames, the generation time difference per frame, and the error of the computer timer.

따라서, 본 발명에서는 다음 사항을 고려하여 동기 방법에 적용한다.Therefore, in the present invention, the following matters are considered and applied to the synchronization method.

- 음성 합성 결과에 따른 음성과 같은 시작시간을 갖는 모든 음소는 최소한 한 프레임(즉, 키프레임에 해당)이상을 생성해야 한다. 이것이 지켜지지 않으면 사람이 말하는 것처럼 보이질 않는다.All phonemes with the same start time as the speech resulting from the speech synthesis must produce at least one frame (ie, a keyframe). If this is not followed, it does not look like a person speaks.

- 매 프레임을 생성할 때마다 프레임 생성속도를 계산하여 이를 다음 음소의 지속시간당 프레임 생성 수에 지속적으로 반영해야 한다.-Every frame is generated, the frame rate must be calculated and reflected continuously in the number of frames per duration of the next phone.

- 각 음소간의 보간시에는, 즉, 그 음소의 키 프레임을 시작으로 하여 다음 키 프레임으로 보간될 때, 지속시간 초기에는 서서히 변하면서 지속시간 끝으로 갈 수록 빨리 변해야 한다. 빛이 소리보다 빠르다는 점을 고려해 볼때 입 모양의 키 프레임이 해당 음소 발음 초기에 나타나는 것을 보장해야 사람이 말하는 것처럼 자연스럼움을 얻을 수가 있다. 즉, 사람이 볼 때 이것이 늦어질 경우 말보다 입 모양이 늦어지는 것이 확연히 드러나며, 굉장히 어색하게 느껴진다.-When interpolating between phonemes, that is, when interpolating to the next keyframe starting with the keyframe of the phoneme, it should change slowly at the beginning of the duration and change as soon as the end of the duration. Given that light is faster than sound, you should ensure that mouth-shaped keyframes appear early in the phonetic pronunciation, so that you can feel natural as people say. In other words, when it is delayed, it is clearly revealed that the mouth shape is slower than words, and it feels very awkward.

중간 프레임은 다음 수학식 1을 사용하여 보간될 수 있다.The intermediate frame may be interpolated using Equation 1 below.

[수학식 1][Equation 1]

수학식 1에서, f_i, f_i+1는 인접한 키 프레임들 각각의 액션 유닛 강도를, f_j는 f_i와 f_i+1사이의 중간 프레임의 액션 유닛 강도를, k는 중간 프레임 인덱스를, N은 한 음소의 지속시간내에서 생성가능한 총 프레임 생성수를 각각 나타낸다. 수학식 1에 따라 중간 프레임의 액션 유닛 강도가 구해지면, 이전 키 프레임의 입 모양을 구해진 액션 유닛 강도만큼 변화시켜 중간 프레임을 용이하게 얻을 수 있다.In Equation 1, f _i , f _{i + 1} denotes the action unit strength of each adjacent key frame, f _j denotes the action unit strength of the intermediate frame between f _i and f _{i + 1} , and k denotes the intermediate frame index. , N respectively represent the total number of frame generations that can be generated within the duration of one phoneme. When the action unit strength of the intermediate frame is obtained according to Equation 1, the intermediate frame can be easily obtained by changing the mouth shape of the previous key frame by the obtained action unit strength.

도 2는 본 발명에 의한 음성 합성을 위한 립싱크 장치를 설명하기 위한 블럭도이다.2 is a block diagram illustrating a lip sync device for speech synthesis according to the present invention.

크게 얼굴 합성부(200) 및 음성 합성부(TTS)(260)으로 구성되며, 얼굴 합성부(200)는 얼굴 데이타 입력부(210), 얼굴 데이타 로더부(220), 동영상 합성부(230), 텍스트 입력부(250) 및 동기화 제어부(270)로 구성된다. 여기서, 실제적으로 립싱크를 실현하는 동영상 합성부(230)는 음절 분석부(232), 키 프레임 생성부(234), 얼굴 동영상 합성부(236), 화면 출력부(238) 및 비즘 테이블(240)로 구성된다.The face synthesizer 200 and the voice synthesizer (TTS) 260 are largely comprised, and the face synthesizer 200 includes a face data input unit 210, a face data loader 220, a video synthesizer 230, It is composed of a text input unit 250 and a synchronization control unit 270. Here, the video synthesizer 230 that actually realizes the lip sync is a syllable analyzer 232, a key frame generator 234, a face video synthesizer 236, a screen output unit 238, and a bismuth table 240. It consists of.

도 2를 참조하면, 얼굴 데이타 입력부(210)는 컴퓨터 그래픽으로 소정의 사람의 얼굴을 나타내는데 필요한 정점(vertex)등을 입력한다. 얼굴 데이타 로더(220)는 3차원 좌표 화일 및 연결 관계 화일등을 이용하여 얼굴 데이타 입력부(210)로부터 입력된 얼굴 데이타를 실제로 화면상에 표현하는 얼굴 합성 초기화 작업을 한다. 한편, 텍스트 입력부(250)는 합성된 얼굴에 맞춰 주어진 텍스트를 입력하고, 음성 합성부(TTS:Text To Speech portion)(260)는 그 텍스트를 음성 합성한다.Referring to FIG. 2, the face data input unit 210 inputs vertices and the like necessary for representing a face of a predetermined person through computer graphics. The face data loader 220 performs a face composition initialization operation that actually expresses face data input from the face data input unit 210 on the screen using a 3D coordinate file and a connection relation file. Meanwhile, the text input unit 250 inputs a given text in accordance with the synthesized face, and the text to speech portion (TTS) 260 synthesizes the text.

동영상 합성부(230)는 얼굴 데이타 로더(220)로부터 얼굴 데이타를 입력하고, 음성 합성부(260)로부터 한 음절 단위로 적어도 음절을 구성하는 음소들 즉, 초성, 중성, 종성, 공백에 대한 코드와, 그 음절별 지속시간 정보를 동기화 제어부(270)를 통해 얻는다. 동기화 제어부(270)는 음성 합성부(260)로부터의 전술한 데이타를 음절 단위로 소정의 동기에 맞춰 동영상 합성부(230)로 전달하는 역할을 한다.The video synthesizing unit 230 inputs face data from the face data loader 220, and codes for the phonemes that constitute at least syllables in one syllable unit, that is, initial, neutral, final, and blank from the voice synthesizing unit 260. The duration information for each syllable is obtained through the synchronization controller 270. The synchronization controller 270 transmits the above-described data from the voice synthesizer 260 to the video synthesizer 230 in accordance with a predetermined synchronization in units of syllables.

구체적으로, 동영상 합성부(230)에서, 음절 분석부(232)가 동기화 제어부(270)를 통해 음성 합성 결과로부터 한 음절 단위로 적어도 음절을 구성하는 음소들에 대한 코드와 음절별 지속시간 정보를 얻는다. 음절 분석부(232)는 음절별 지속시간내에서 음소들 각각에 대한 음소별 지속시간을 할당한다. 음소별 지속시간은 할당하는 방법에 대해서는 이미 전술한 바와 같다.In detail, the video synthesizing unit 230, the syllable analyzer 232, through the synchronization control unit 270, obtains code and duration information for each syllable of the phonemes that constitute at least one syllable in syllable units from the speech synthesis result. Get The syllable analyzer 232 allocates a phoneme duration for each phoneme within a syllable duration. The method for allocating phoneme durations is as described above.

키 프레임 생성부(234)는 한국어 발음 형태에 대한 입 모양 정보를 미리 정의한 비즘 테이블(240)을 참조하여 각각의 음소별 지속시간내에서 음소들 각각에 대해 입 모양을 나타내는 적어도 하나의 키 프레임을 생성한다. 비즘 테이블(240)은 바람직한 예로서 표 1에 나타나 있다.The key frame generation unit 234 refers to at least one key frame indicating a mouth shape for each of the phonemes within the duration of each phoneme with reference to the bismuth table 240 that pre-defines mouth shape information for the Korean pronunciation form. Create The bismuth table 240 is shown in Table 1 as a preferred example.

얼굴 동영상 합성부(236)는 키 프레임 생성부(234)에서 생성된 키 프레임들을 가지고 인접한 키 프레임들간의 보간으로 소정수의 연속되는 중간 프레임을 더 만든다. 이와같이 만들어진 키 프레임과 중간 프레임은 결국, 음성 합성 결과와 동기된 입 모양을 갖는 동영상을 생성한다. 얼굴 동영상 합성부(236)는 생성된 동영상을 다시 얼굴 데이타에 합성하고, 화면 출력부(238)는 합성된 동영상을 음성 합성부(260)에서의 음성 합성 결과와 동기하여 출력함으로써 립싱크를 실현한다.The facial video synthesizer 236 further generates a predetermined number of consecutive intermediate frames by interpolation between adjacent key frames with the key frames generated by the key frame generator 234. The key frame and the intermediate frame thus produced eventually produce a moving picture having a mouth shape synchronized with the speech synthesis result. The facial video synthesizing unit 236 synthesizes the generated moving image into face data again, and the screen output unit 238 realizes the lip syncing by outputting the synthesized moving image in synchronization with the voice synthesizing result of the voice synthesizing unit 260. .

도 3은 음소별 지속시간 할당 및 동기 방법을 설명하기 위한 도면으로서, 가로측은 한 음절의 지속시간을, 세로측은 액션 유닛(AU) 강도 변화를 각각 나타낸다.3 is a diagram illustrating a method of allocating and synchronizing durations of phonemes, in which the horizontal side represents the duration of one syllable and the vertical side represents the change in the intensity of the action unit (AU).

한 음절의 지속시간내에서 한 음절을 구성하는 음소들 즉, 초성, 중성, 종성은 각각 소정의 지속시간을 할당받는다. 각 음소와 할당된 지속시간간의 동기가 이뤄질때, 각 음소는 각 음소별 지속시간내에서 액션 유닛 강도 변화를 갖는다. 도 3에 도시된 바와 같이 각 음소별 지속시간내에서 각 음소의 액션 유닛 강도는 초기에 크고 뒤로 갈수록 서서히 작아지게 되는데, 이는 수학식 1에서 ρ에 의해 결정된다.The phonemes constituting one syllable within the duration of one syllable, that is, the primary, the neutral, and the final star, are each assigned a predetermined duration. When synchronization is made between each phoneme and the assigned duration, each phoneme has an action unit intensity change within each phoneme duration. As shown in FIG. 3, within each duration of each phoneme, the action unit strength of each phoneme is initially large and gradually decreases backwards, which is determined by ρ in Equation 1.

이상에서 설명한 바와 같이, 본 발명에 의한 음성 합성을 위한 립싱크 방법 및 그 장치는, 한국어 발음 구조에 적합한 립싱크를 제공함으로써, 본 발명을 통하여 사람처럼 말을 하는 가상 캐릭터 에이전트를 구현할 수가 있고, 이러한 에이전트가 구현되면 궁극적으로는 사람과 컴퓨터간의 인터페이스에서 사람이 사람을 대하는 듯한 휴먼 인터페이스가 실현 될 수가 있다. 이를 이용하여 사이버 뉴스 데스크, 무인 안내 시스템, 가상 배우, 화상 텔레커뮤니케이션 분야등 수많은 각종 분야에 적용될 수가 있다.As described above, the lip sync method and apparatus for speech synthesis according to the present invention can implement a virtual character agent that speaks like a human through the present invention by providing a lip sync suitable for the Korean pronunciation structure. Is implemented, ultimately, a human interface that seems to be a human face at the interface between a person and a computer can be realized. It can be applied to many different fields such as cyber news desk, unmanned guidance system, virtual actor, video telecommunications field.

Claims

A lip sync method for speech synthesis, comprising: inputting still image data, a shape model data representing a face of a person, and arbitrary text to lip-sync the speech synthesis result from the text on the face;

(a) obtaining, from the speech synthesis result, a code and duration information for each syllable for phonemes constituting at least one syllable in one syllable unit;

(b) allocating phoneme durations for each of the phonemes within each syllable duration;

(c) generating at least one video key frame representing a mouth shape for each of the phonemes within each phoneme duration by referring to a predetermined reference table that defines mouth shape information for each phoneme;

(d) generating a moving image having a mouth shape synchronized with the speech synthesis result by generating a predetermined number of consecutive intermediate frames by interpolation between adjacent key frames, and synthesizing the still image data; And

and (e) outputting the synthesized video in synchronization with the speech synthesis result.

The method of claim 1, wherein the duration of each phoneme in step (b) is divided into at least one syllable into initial, neutral, and final in a ratio of neutral> initial> final, and when there is a middle vowel, The vowels in the front and the vowels in the back are divided into vowels in the rear, and the vowels in the front are separated. Lip sync method for Korean speech synthesis, characterized in that the.

The method of claim 1, wherein the predetermined reference table comprises: a predetermined number of basic vowels representing one or more vowels of Korean phonemes, spaces, and consonants divided into lip sounds and non-lip sounds with which lips are closed. A lip sync for Korean speech synthesis, characterized by defining predetermined mouth shape information including an action unit strength indicating at least a mouth opening level for each of the phonemes included in the Korean phoneme pattern. Way.

4. The method of claim 3, wherein in the step (d), the intermediate frame is interpolated using the following equation, wherein f _i , f _{i + 1} are the action unit strengths of each of the adjacent key frames, and f _j is f _i. Lip sync for Korean speech synthesis, characterized in that the action unit strength of the intermediate frame between and f _{i + 1} , k is the intermediate frame index, and N is the total number of frame generations that can be generated within the duration of one phoneme. Way.

[Equation 1]

The method according to claim 1 or 4, wherein when generating an intermediate frame in the step (d), at least a frame generation rate is calculated every time each frame is generated and reflected in the number of frame generations per duration of the next phoneme. Lip sync method for Korean speech synthesis, characterized in that.

A lip sync device for speech synthesis in which a still image data representing a human face, shape model data, and arbitrary text are input to lip-sync a speech synthesis result from the text on the face. An input unit for obtaining codes and duration information for each syllable, for at least syllables;

A syllable analyzer for allocating a phoneme duration for each of the phonemes within the syllable duration; A key frame generation unit generating at least one key frame representing a mouth shape for each of the phonemes within a duration of each phoneme with reference to a predetermined reference table defining mouth shape information for each phoneme; A moving picture synthesizing unit which generates a moving picture having a mouth shape synchronized with the speech synthesis result by making a predetermined number of consecutive intermediate frames by interpolation between adjacent key frames, and synthesizing the moving picture data into the still picture data; And a screen output unit configured to output a synthesized video in synchronization with the speech synthesis result.

The method of claim 6, wherein the syllable analysis unit divides at least one syllable into a consonant, neutral, and tertiary to have a ratio of neutral> primary> final, and there are two consecutive single vowels in the case of a middle vowel. The phoneme is divided into vowels, and the vowels in the rear> the vowels in the front, and the consonants are divided into lip sounds and non-lip sounds, and the lip sounds> non-lip sounds are assigned the duration of each phoneme. Lip sync device for Korean speech synthesis, characterized in that.

The method of claim 6, wherein the predetermined reference table comprises: a predetermined number of basic vowels representing one or more vowels of Korean phonemes, spaces and consonants divided into lip sounds and non-lip sounds with which lips are closed. A lip sync for Korean speech synthesis, characterized by defining predetermined mouth shape information including an action unit strength indicating at least a mouth opening level for each of the phonemes included in the Korean phoneme pattern. Device.

The method of claim 8, wherein the video synthesizing unit interpolates the intermediate frame by using the following equation, wherein f _i , f _{i + 1} denotes an action unit strength of each of adjacent key frames, and f _j equals f _i . lip-sync device for Korean speech synthesis, characterized in that the action unit strength of the intermediate frame between _{i + 1} , k represents the intermediate frame index, and N represents the total number of frame generations that can be generated within the duration of one phoneme. .

[Equation 1]

The method of claim 6 or 9, wherein the video synthesis unit,

Every time the intermediate frame is generated, at least a frame generation rate is calculated and reflected in the number of frame generations per duration of the next phoneme.