KR20210094422A

KR20210094422A - Method and apparatus for generating speech animation

Info

Publication number: KR20210094422A
Application number: KR1020200008187A
Authority: KR
Inventors: 노준용; 장민정; 정선진
Original assignee: 한국과학기술원
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2021-07-29

Abstract

Disclosed are a method and an apparatus for generating a speech animation including an avatar uttering an input script in consideration of the characteristic of a language and a phonological phenomenon. The embodiment includes the steps of: receiving a script and a voice signal corresponding thereto; generating grapheme sequence and timing information; generating a phoneme sequence by applying a simultaneous articulation rule determined according to adjacent final and initial consonants included in the phoneme sequence, and generating a speech animation by changing the shape of the mouth of the avatar and the shape of the tongue of the avatar based on the phoneme sequence and timing information.

Description

METHOD AND APPARATUS FOR GENERATING SPEECH ANIMATION

아래 실시예들은 스피치 애니메이션을 생성하는 방법 및 장치에 관한 것이다.The embodiments below relate to a method and apparatus for generating a speech animation.

스피치 애니메이션 기술은 오디오에 따라 캐릭터의 입 모양을 변형하여 애니메이션을 생성하는 기술에 관한 것이다. 스피치 애니메이션 기술이 영화, 애니메이션, 게임, 교육 등 문화 산업 전반에 활용되면서 캐릭터가 오디오에 대응하여 실제 말하는 것과 같은 효과를 내기 위한 스피치 애니메이션 기술이 요구되고 있다. 이를 위해서 오디오와 싱크를 맞추어 캐릭터의 입이 움직이도록 애니메이션을 생성하는 기술 및 실제 사람이 발화할 때의 발화 기관의 움직임과 유사하게 캐릭터의 입 모양이 움직이도록 애니메이션을 생성하는 기술 등의 개발이 필요하다.Speech animation technology relates to a technology for generating animation by deforming a shape of a character's mouth according to audio. As speech animation technology is used in the entire cultural industry such as movies, animation, games, and education, speech animation technology is required to produce the same effect as if a character is actually speaking in response to audio. To this end, it is necessary to develop a technology that generates animations so that a character's mouth moves in sync with audio, and a technology that generates animations so that the shape of a character's mouth moves similarly to the movement of a speech organ when a person speaks. do.

실시예들은 스크립트 및 이에 대응하는 음성 신호에 기초하여 한국어를 발화하는 아바타가 포함된 애니메이션을 생성하는 기술을 제공할 수 있다.Embodiments may provide a technique for generating an animation including an avatar uttering Korean based on a script and a voice signal corresponding thereto.

또한, 실시예들은 음운구 내에서 자소의 발음이 인접한 자소에 따라 변경되는 음운 현상을 반영하여, 한국어를 발화하는 아바타의 입 모양 및 혀 모양을 변형하는 기술을 제공할 수 있다.In addition, the embodiments may provide a technique for changing the shape of the mouth and the tongue of an avatar speaking Korean by reflecting a phonological phenomenon in which the pronunciation of a grapheme is changed according to an adjacent grapheme in a phonetic phrase.

일 측에 따른 스피치 애니메이션을 생성하는 방법은 스크립트와 상기 스크립트에 대응하는 음성 신호를 수신하는 단계; 상기 스크립트에 기초하여, 적어도 하나의 자음 및 적어도 하나의 모음을 포함하는 자소 시퀀스를 생성하는 단계; 상기 음성 신호에 기초하여, 상기 자소 시퀀스에 포함된 상기 적어도 하나의 자음 및 상기 적어도 하나의 모음 각각의 타이밍을 포함하는 타이밍 정보를 생성하는 단계;상기 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하는 단계; 상기 종성 자음 및 상기 초성 자음에 기초하여, 적어도 하나의 동시조음 규칙 및 상기 적어도 하나의 동시조음 규칙의 순서를 결정하는 단계; 상기 순서에 따라 상기 적어도 하나의 동시조음 규칙을 상기 종성 자음 및 상기 초성 자음에 적용함으로써, 상기 자소 시퀀스를 갱신하여 음소 시퀀스를 생성하는 단계; 및 상기 타이밍 정보 및 상기 음소 시퀀스에 기초하여 아바타의 입 모양 및 상기 아바타의 혀 모양을 변형함으로써, 스피치 애니메이션을 생성하는 단계를 포함한다.A method of generating a speech animation according to one side includes: receiving a script and a voice signal corresponding to the script; generating a grapheme sequence including at least one consonant and at least one vowel based on the script; generating timing information including timings of each of the at least one consonant and the at least one vowel included in the grapheme sequence based on the speech signal; extracting adjacent final consonants and leading consonants from the grapheme sequence to do; determining at least one co-articulation rule and an order of the at least one co-articulation rule based on the final consonant and the leading consonant; generating a phoneme sequence by updating the phoneme sequence by applying the at least one simultaneous articulation rule to the final consonant and the leading consonant according to the order; and generating a speech animation by modifying a mouth shape of the avatar and a tongue shape of the avatar based on the timing information and the phoneme sequence.

상기 적어도 하나의 동시조음 규칙 및 상기 적어도 하나의 동시조음 규칙의 순서를 결정하는 단계는 상기 초성 자음이 미리 정해진 자음에 해당하지 않는 경우, 상기 종성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택하는 단계; 및 상기 초성 자음이 상기 미리 정해진 자음에 해당하는 경우, 상기 초성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택하는 단계를 포함할 수 있다. The determining of the at least one co-articulation rule and the order of the at least one co-articulation rule may include selecting at least one co-articulation rule corresponding to the final consonant when the leading consonant does not correspond to a predetermined consonant. step; and when the leading consonant corresponds to the predetermined consonant, selecting at least one simultaneous articulation rule corresponding to the leading consonant.

상기 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하는 단계는 해당하는 음운구에 포함된 자소들을 순차적으로 읽으면서, 해당 자소가 자음인지 여부 및 해당 자음의 위치를 추출하는 단계; 및 상기 추출된 자음의 위치가 종성인 경우, 상기 추출된 자음의 다음 자음을 추출하는 단계를 포함할 수 있다.The step of extracting adjacent final consonants and leading consonants from the grapheme sequence may include: extracting whether the corresponding grapheme is a consonant and the position of the corresponding consonant while sequentially reading the graphes included in the corresponding phoneme phrase; and when the position of the extracted consonant is the final consonant, extracting the next consonant of the extracted consonant.

상기 자소 시퀀스를 생성하는 단계는 상기 스크립트에 대응하는 언어의 의미 단위들이 등재된 사전에 기초하여, 상기 스크립트를 적어도 하나의 의미 단위로 분리하는 단계; 상기 적어도 하나의 의미 단위 별로 상기 의미 단위에 포함된 적어도 하나의 자소를 순서대로 나열하는 단계를 포함할 수 있다.The generating of the grapheme sequence may include: separating the script into at least one semantic unit based on a dictionary in which semantic units of a language corresponding to the script are listed; The method may include sequentially arranging at least one grapheme included in the semantic unit for each of the at least one semantic unit.

상기 스피치 애니메이션을 생성하는 단계는 상기 음소 시퀀스에 포함된 음소들 각각을 대응하는 입 모양 및 혀 모양에 맵핑하는 단계; 상기 아바타의 입 모양 및 상기 아바타의 혀 모양을 상기 타이밍 정보에 기초하여 각 음소에 맵핑된 입 모양 및 혀 모양으로 변형함으로써, 스피치 애니메이션을 생성하는 단계를 포함할 수 있다.The generating of the speech animation may include: mapping each of the phonemes included in the phoneme sequence to a corresponding mouth shape and a tongue shape; and transforming the avatar's mouth shape and the avatar's tongue shape into a mouth shape and a tongue shape mapped to each phoneme based on the timing information, thereby generating a speech animation.

상기 타이밍 정보를 생성하는 단계는 상기 음성 신호에서 미리 정해진 기준에 따라 특정 구간을 쉼(pause) 구간으로 표시하는 단계; 및 상기 쉼 구간에 대응하는 타이밍 정보를 생성하는 단계를 포함할 수 있다. The generating of the timing information may include: displaying a specific section as a pause section according to a predetermined criterion in the voice signal; and generating timing information corresponding to the rest period.

상기 스피치 애니메이션을 생성하는 단계는 상기 쉼 구간을 대응하는 입 모양에 맵핑하는 단계; 및 상기 쉼 구간에 대응하는 타이밍 정보에 기초하여, 상기 아바타의 입 모양을 상기 쉼 구간에 맵핑된 입 모양으로 변형함으로써, 스피치 애니메이션을 생성하는 단계를 포함할 수 있다.The generating of the speech animation may include: mapping the pause section to a corresponding mouth shape; and generating a speech animation by transforming the mouth shape of the avatar into a mouth shape mapped to the rest section based on the timing information corresponding to the pause section.

상기 미리 정해진 기준에 따른 특정 구간은 미리 정해진 음성 신호의 강도 이하의 신호가 미리 정해진 시간 동안 지속되는 구간에 해당할 수 있다. The specific section according to the predetermined criterion may correspond to a section in which a signal equal to or less than the strength of a predetermined voice signal is maintained for a predetermined time.

상기 자소 시퀀스를 생성하는 단계는 상기 스크립트에 포함된 적어도 하나의 자소를 미리 정해진 부호로 맵핑하여 자소 시퀀스를 생성하는 단계를 더 포함할 수 있다.The generating of the grapheme sequence may further include generating the grapheme sequence by mapping at least one grapheme included in the script to a predetermined code.

상기 동시조음 규칙은 상기 종성 자음 및 상기 종성 자음의 다음에 위치한 초성 자음을 발음대로 변경하는 규칙에 해당할 수 있다.The simultaneous articulation rule may correspond to a rule for changing the final consonant and the leading consonant located next to the final consonant according to pronunciation.

일 측에 따른 스피치 애니메이션을 생성하는 장치는 스크립트와 상기 스크립트에 대응하는 음성 신호를 수신하여 자소 시퀀스 및 상기 자소 시퀀스에 포함된 적어도 하나의 자음 및 적어도 하나의 모음 각각의 타이밍을 포함하는 타이밍 정보를 생성하고, 상기 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하여 상기 종성 자음 및 상기 초성 자음에 기초한 적어도 하나의 동시조음 규칙 및 상기 적어도 하나의 동시조음 규칙의 순서를 결정하고, 상기 순서에 따라 상기 적어도 하나의 동시조음 규칙을 상기 종성 자음 및 상기 초성 자음에 적용함으로써, 상기 자소 시퀀스를 갱신하여 음소 시퀀스를 생성하고, 상기 타이밍 정보 및 상기 음소 시퀀스에 기초하여 아바타의 입 모양 및 상기 아바타의 혀 모양을 변형함으로써, 스피치 애니메이션을 생성하는 적어도 하나의 프로세서 및 상기 자소 시퀀스, 상기 음소 시퀀스, 상기 타이밍 정보, 및 상기 적어도 하나의 음소에 대응하는 혀 모양 및 입 모양의 맵핑 정보를 저장하는 메모리를 포함한다.An apparatus for generating a speech animation according to one side receives a script and a voice signal corresponding to the script, and receives timing information including the timing of each of the grapheme sequence and at least one consonant and at least one vowel included in the grapheme sequence. generating, extracting a closing consonant and a leading consonant adjacent to each other from the grapheme sequence to determine the order of at least one coarticulation rule and the at least one coarticulation rule based on the final consonant and the leading consonant, according to the order By applying the at least one coarticulation rule to the final consonant and the leading consonant, the phoneme sequence is updated to generate a phoneme sequence, and based on the timing information and the phoneme sequence, the avatar's mouth shape and the avatar's tongue at least one processor for generating a speech animation by deforming a shape, and a memory for storing the grapheme sequence, the phoneme sequence, the timing information, and mapping information of a tongue shape and a mouth shape corresponding to the at least one phoneme; do.

상기 프로세서는 상기 적어도 하나의 동시조음 규칙 및 상기 적어도 하나의 동시조음 규칙의 순서를 결정하기 위하여, 상기 초성 자음이 미리 정해진 자음에 해당하지 않는 경우, 상기 종성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택하고, 상기 초성 자음이 상기 미리 정해진 자음에 해당하는 경우, 상기 초성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택할 수 있다.In order to determine the order of the at least one co-articulation rule and the at least one co-articulation rule, the processor is configured to, when the leading consonant does not correspond to a predetermined consonant, at least one co-articulation rule corresponding to the final consonant. , and when the leading consonant corresponds to the predetermined consonant, at least one simultaneous articulation rule corresponding to the leading consonant may be selected.

상기 프로세서는 상기 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하기 위하여, 해당하는 음운구에 포함된 자소들을 순차적으로 읽으면서, 해당 자소가 자음인지 여부 및 해당 자음의 위치를 추출하고, 상기 추출된 자음의 위치가 종성인 경우, 상기 추출된 자음의 다음 자음을 추출할 수 있다. The processor sequentially reads the graphes included in the corresponding phoneme phrase in order to extract the final and leading consonants adjacent to each other from the grapheme sequence, and extracts whether the corresponding grapheme is a consonant and the position of the corresponding consonant, and the extraction When the position of the consonant is the final consonant, the next consonant of the extracted consonant may be extracted.

상기 프로세서는 상기 자소 시퀀스를 생성하기 위하여, 상기 스크립트에 대응하는 언어의 의미 단위들이 등재된 사전에 기초하여, 상기 스크립트를 적어도 하나의 의미 단위로 분리하고, 상기 적어도 하나의 의미 단위 별로 상기 의미 단위에 포함된 적어도 하나의 자소를 순서대로 나열할 수 있다.In order to generate the grapheme sequence, the processor separates the script into at least one semantic unit based on a dictionary in which semantic units of a language corresponding to the script are listed, and the semantic unit for each of the at least one semantic unit. At least one grapheme included in may be listed in order.

상기 프로세서는 상기 스피치 애니메이션을 생성하기 위하여, 상기 음소 시퀀스에 포함된 음소들 각각을 대응하는 입 모양 및 혀 모양에 맵핑하고, 상기 아바타의 입 모양 및 상기 아바타의 혀 모양을 상기 타이밍 정보에 기초하여 각 음소에 맵핑된 입 모양 및 혀 모양으로 변형함으로써, 스피치 애니메이션을 생성할 수 있다.The processor maps each of the phonemes included in the phoneme sequence to a corresponding mouth shape and tongue shape to generate the speech animation, and sets the avatar's mouth shape and the avatar's tongue shape based on the timing information. By transforming the shape of a mouth and a shape of a tongue mapped to each phoneme, a speech animation can be created.

상기 스크립트는 한글로 쓰여진 적어도 하나의 음절을 포함하는 스크립트를 포함하고, 상기 음성 신호는 상기 스크립트를 한국어로 발화한 음성 신호를 포함할 수 있다.The script may include a script including at least one syllable written in Korean, and the voice signal may include a voice signal uttering the script in Korean.

상기 프로세서는 상기 타이밍 정보를 생성하기 위하여, 상기 음성 신호에서 미리 정해진 기준에 따라 특정 구간을 쉼(pause) 구간으로 표시하고, 상기 쉼 구간에 대응하는 타이밍 정보를 생성하며, 상기 스피치 애니메이션을 생성하기 위하여, 상기 쉼 구간을 대응하는 입 모양에 맵핑하고, 상기 쉼구간에 대응하는 타이밍 정보에 기초하여, 상기 아바타의 입 모양을 상기 쉼 구간에 맵핑된 입 모양으로 변형함으로써, 스피치 애니메이션을 생성할 수 있다.In order to generate the timing information, the processor marks a specific section as a pause section in the voice signal according to a predetermined criterion, generates timing information corresponding to the pause section, and generates the speech animation. In order to do this, a speech animation can be created by mapping the pause section to a corresponding mouth shape, and transforming the mouth shape of the avatar into a mouth shape mapped to the pause section based on timing information corresponding to the pause section. there is.

상기 프로세서는 상기 자소 시퀀스를 생성하기 위하여, 상기 스크립트에 포함된 적어도 하나의 자소를 미리 정해진 부호로 맵핑하여 자소 시퀀스를 생성할 수 있다.The processor may generate the grapheme sequence by mapping at least one grapheme included in the script to a predetermined code to generate the grapheme sequence.

도 1은 일실시예에 따른 스피치 에니메이션 생성 시스템의 전체적인 동작의 흐름을 도시한 도면.
도 2는 일실시예에 따른 음향 모델에 입력되는 맞춤형 사전의 예를 도시한 도면.
도 3은 일실시예에 따른 강제 음성 정렬 작업의 결과의 예를 도시한 도면.
도 4는 일실시예에 따른 동시조음 규칙이 자소에 따라 다르게 결정된 예를 도시한 도면.
도 5 내지 도 7은 일실시예에 따른 한국어의 음소들에 대응되는 혀 모양들 및 입 모양들을 도시한 도면들.
도 8은 일실시예에 따른 쉼(pause)에 대응되는 입 모양을 도시한 도면.
도 9는 일실시예에 따른 스피치 애니메이션을 생성하는 방법을 나타낸 도면.1 is a diagram illustrating an overall operation flow of a speech animation generation system according to an embodiment.
2 is a diagram illustrating an example of a customized dictionary input to an acoustic model according to an embodiment.
3 is a diagram illustrating an example of a result of a forced voice alignment operation according to an embodiment;
4 is a diagram illustrating an example in which simultaneous articulation rules are determined differently according to grapheme according to an embodiment;
5 to 7 are diagrams illustrating tongue shapes and mouth shapes corresponding to phonemes of Korean according to an exemplary embodiment.
8 is a diagram illustrating a mouth shape corresponding to a pause according to an embodiment;
9 is a diagram illustrating a method of generating a speech animation according to an embodiment.

본 명세서에서 개시되어 있는 특정한 구조적 또는 기능적 설명들은 단지 기술적 개념에 따른 실시예들을 설명하기 위한 목적으로 예시된 것으로서, 실시예들은 다양한 다른 형태로 실시될 수 있으며 본 명세서에 설명된 실시예들에 한정되지 않는다.Specific structural or functional descriptions disclosed in this specification are merely illustrative for the purpose of describing embodiments according to technical concepts, and the embodiments may be embodied in various other forms and are limited to the embodiments described herein. doesn't happen

제1 또는 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 이해되어야 한다. 예를 들어 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various elements, but these terms should be understood only for the purpose of distinguishing one element from another element. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 표현들, 예를 들어 "~간에"와 "바로~간에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” or “connected” to another component, it is understood that the other component may be directly connected or connected to the other component, but other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that no other element is present in the middle. Expressions describing the relationship between elements, for example, "between" and "between" or "neighboring to" and "directly adjacent to", etc. should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, and includes one or more other features or numbers, It should be understood that the possibility of the presence or addition of steps, operations, components, parts or combinations thereof is not precluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present specification. does not

도 1은 일실시예에 따른 스피치 에니메이션 생성 시스템의 전체적인 동작의 흐름을 도시한 도면이다.1 is a diagram illustrating an overall operation flow of a speech animation generation system according to an embodiment.

도 1을 참조할 때, 일실시예에 따른 스피치 애니메이션 생성 시스템은 인풋 단계 및 스피치 에니메이션 생성 단계로 구성될 수 있다. 인풋 단계는 스피치 오디오(101)와 스크립트(102)를 강제 음성 정렬 모델(110)에 입력하여 스크립트(101)를 형태소 및 자소로 분리하고, 스피치 오디오(102)에 기초하여 각 형태소 및 자소에 대응하는 타이밍 정보를 추출하는 단계에 해당할 수 있다. 애니메이션 생성 단계는 인풋 단계에서 추출한 타이밍 정보를 동시조음 모델(120)에 입력하여 각 자소에 대한 음소 정보를 추출하고, 각 음소를 입모양 및 혀모양에 맵핑함으로써 스피치 애니메이션을 생성하는 단계에 해당할 수 있다. 이하에서 설명하는 실시예는 한국어 스피치 오디오 및 한글 스크립트에 기초하여 한국어 스피치 애니메이션을 생성하는 시스템에 해당할 수 있다.Referring to FIG. 1 , the system for generating speech animation according to an embodiment may include an input step and a speech animation generating step. In the input step, the speech audio 101 and the script 102 are input to the forced speech alignment model 110 to separate the script 101 into morphemes and morphemes, and corresponding to each morpheme and grapheme based on the speech audio 102 . It may correspond to the step of extracting the timing information. The animation generation step corresponds to the step of generating the speech animation by inputting the timing information extracted in the input step into the simultaneous articulation model 120, extracting phoneme information for each grapheme, and mapping each phoneme to the mouth shape and the tongue shape. can The embodiment described below may correspond to a system for generating Korean speech animation based on Korean speech audio and Korean script.

일실시예에 따른 스크립트는 적어도 하나의 자음 및 적어도 하나의 모음으로 구성된 글자들로 구성될 수 있다. 스크립트를 구성하는 한 글자는 한 음절에 대응될 수 있으며, 한 음절은 소리의 위치에 따라 초성, 중성으로 구분될 수 있으며, 음절에 따라 초성, 중성, 종성으로 구분될 수 있다. 각 초성, 중성, 종성에는 대응하는 자소가 존재할 수 있다.A script according to an embodiment may be composed of letters composed of at least one consonant and at least one vowel. One letter constituting the script may correspond to one syllable, and one syllable may be divided into a leading consonant and a neutral depending on the position of the sound, and may be divided into a leading consonant, a middle consonant, and a final consonant according to the syllable. A corresponding grapheme may exist for each initial, neutral, and final consonant.

일실시예에 따른 자소는 특정 언어를 기재하는 문자의 단위로 자음 또는 모음에 해당할 수 있다. 예를 들어, 한글의 자소는 'ㄱ', 'ㄴ' 등의 자음 하나 또는 'ㅏ', 'ㅓ' 등의 모음 하나를 의미할 수 있다. 일실시예에 따른 음소는 자소들이 포함된 음운구를 실제로 발음할 때, 자소에 대응되는 발음을 의미하는 것으로, 음소는 자소의 발음을 나타내는 부호 로 표시할 수 있다. 예를 들어, '국물'이라는 스크립트를 구성하는 자소들은 'ㄱ', 'ㅜ', 'ㄱ', 'ㅁ', 'ㅜ', 'ㄹ'이고, '국물'은 '궁물'로 발음되므로 '국물'의 음소들은 'ㄱ', 'ㅜ', 'ㅇ', 'ㅁ', 'ㅜ', 'ㄹ'로 표시할 수 있다. A grapheme according to an exemplary embodiment may correspond to a consonant or a vowel as a unit of a character describing a specific language. For example, the alphabet of Hangul may mean one consonant such as 'ㄱ' or 'ㄴ' or one vowel such as 'a' or 'ㅓ'. A phoneme according to an exemplary embodiment means a pronunciation corresponding to a grapheme when a phoneme phrase including graphes is actually pronounced. For example, the letters constituting the script 'Gungmul' are 'ㄱ', 'TT', 'a', 'ㅁ', 'TT', and 'ㄹ', and 'soup' is pronounced as 'Gungmul', so ' The phonemes of 'soup' can be expressed as 'ㄱ', 'TT', 'ㅇ', 'ㅁ', 'TT', and 'ㄹ'.

자음 또는 모음의 발음은 입모양 및/또는 혀모양에 의해 결정될 수 있다. 예를 들어, 한국어에서 자음은 양순음(/ㅂ/, /ㅃ/, /ㅍ/, /ㅁ/)을 제외하고 혀의 위치 및 모양의 영향을 받으며, 모음은 주로 입 모양의 영향을 받는다. 음절의 구조를 보면 한국어는 모음만이 음절의 핵을 이루기 때문에 모음에 의해 입을 벌린 정도, 입술의 돌출 여부, 혀의 위치 등이 결정된다. 또한, 자음은 초성, 종성에 최대 1개까지 올 수 있으며, 종성이 존재하지 않는 음절도 존재한다. 한국어의 모음은 단모음과 이중모음으로 나뉜다. 단모음은 처음부터 끝까지하나의 조음 동작으로만 만들어지는 모음이며, 이중 모음은 반모음과 단모음이 결합하여 두 개의 조음 동작으로 만들어지는 모음을 말한다. 이중 모음은 반모음에 대응하는 입모양 및 혀모양에서 단모음에 대응되는 입모양 및 혀모양으로 순차적으로 변경하여 발음할 수 있다. 한국어에는 2개의 반모음 /j/와 /w/가 존재하며, 이들과 결합된 이중 모음을 각각 /j/계 이중모음, /w/계 이중모음이라고 부를 수 있다. 예를 들어, /ㅢ/를 하향 이중모음으로 보고 /j/계 이중모음으로 분류할 수 있다. 또한, 모음은 입술의 돌출 여부에 따라 평순 모음과 원순 모음, 혀가 입천장의 어느 부분에 가까운가를 기준으로 전설 모음과 후설 모음, 혀의 높이에 따라 고모음, 중모음, 저모음으로 분류할 수 있다. The pronunciation of a consonant or vowel may be determined by the shape of the mouth and/or the shape of the tongue. For example, in Korean, consonants are affected by the position and shape of the tongue except for the positive labial consonants (/b/, /ㅃ/, /p/, /ㅁ/), and vowels are mainly affected by the shape of the mouth. Looking at the structure of syllables, since only vowels in Korean form the core of syllables, vowels determine the degree of mouth opening, lips protrusion, and tongue position. In addition, up to one consonant can come in the beginning and ending, and there are also syllables that do not have a final consonant. Korean vowels are divided into short vowels and diphthongs. A short vowel is a vowel made with only one articulation movement from beginning to end, and a double vowel refers to a vowel made by combining a half vowel and a short vowel with two articulation movements. A double vowel can be pronounced by sequentially changing from a mouth shape and a tongue shape corresponding to a half vowel to a mouth shape and a tongue shape corresponding to a short vowel. There are two semivowels /j/ and /w/ in Korean, and the combined diphthongs can be called /j/ diphthongs and /w/ diphthongs, respectively. For example, /ㅢ/ can be viewed as a descending diphthong and classified as a diphthong of the /j/ series. In addition, vowels can be classified into flat vowels and round vowels according to the protrusion of the lips, legend vowels and hind tongue vowels based on where the tongue is close to the roof of the mouth, and high vowels, middle vowels, and low vowels according to the height of the tongue. .

일실시예에 따른 인풋 단계는 스피치 애니메이션 생성을 위해 입력 데이터를 전처리하는 단계에 해당할 수 있다. 일실시예에 따른 강제 음성 정렬 모델(110)은 스피치 오디오(101) 및 스크립트(102)를 입력받아 음향 모델을 학습시키고, 학습된 음향 모델을 활용하여 강제 음성 정렬(forced alignment) 작업을 수행하는 모델에 해당할 수 있다. 일실시예에 따른 강제 음성 정렬 모델은 'Montreal Forced Aligner' 등의 모델을 포함할 수 있다. 일실시예에 따른 강제 음성 정렬 작업은 스크립트(102)를 구성하는 자소들의 정보 및 스피치 오디오(101)의 특정 프레임이 어떤 자소에 해당하는지에 관한 정보를 포함하는 음성 정렬 정보를 추출함으로써, 오디오와 스크립트의 타이밍을 맞추는 작업에 해당할 수 있다.The input step according to an embodiment may correspond to a step of pre-processing input data to generate a speech animation. The forced voice alignment model 110 according to an embodiment learns the acoustic model by receiving the speech audio 101 and the script 102, and performs a forced alignment operation by using the learned acoustic model. may correspond to the model. The forced voice alignment model according to an embodiment may include a model such as 'Montreal Forced Aligner'. The forced voice alignment operation according to an embodiment extracts voice alignment information including information on grapheme constituting the script 102 and information on which grapheme a specific frame of the speech audio 101 corresponds to, It can correspond to the task of timing the script.

일실시예에 따른 강제 음성 정렬 모델(110)은 음성 인식 툴(예를 들어, Kaldi)에 기초하여 음운론적 맥락, 음소별 특성, 화자의 특성 등을 고려하여 언어에 대한 음향 모델을 학습시킬 수 있다. 일실시예에 따른 음향 모델을 학습시키기 위하여, 강제 음성 정렬 모델(110)에 입력되는 학습 데이터는 대량의 코퍼스와 사전을 포함할 수 있다. 일실시예에 따른 사전은 강제 음성 정렬 작업을 수행하기 위한 단어들이 등재된 맞춤형 사전(A)을 포함할 수 있다.The forced speech alignment model 110 according to an embodiment may train an acoustic model for a language in consideration of phonological context, phoneme-specific characteristics, speaker characteristics, etc. based on a speech recognition tool (eg, Kaldi). there is. In order to train the acoustic model according to an embodiment, the training data input to the forced speech alignment model 110 may include a large amount of corpus and dictionary. The dictionary according to an embodiment may include a customized dictionary A in which words for performing a forced voice sorting operation are listed.

일실시예에 따른 스크립트를 강제 음성 정렬 모델(110)에 입력하는 단계는 사전에 등재된 단어를 기준으로 스크립트를 토큰화(tokenization)하는 단계를 포함할 수 있다. 일실시예에 따른 토큰화 단계는 스크립트를 띄어쓰기를 기준으로 분리하는 단어 토큰화 단계를 포함할 수 있고, 띄어쓰기를 기준으로 분리된 단어를 앞 문장성분, 어미 및 조사로 분리하는 형태소 토큰화 단계를 포함할 수 있다. 일실시예에 따를 때, 강제 음성 정렬을 위한 맞춤형 사전(A) 외에도 형태소 토큰화를 수행하기 위한 맞춤형 사전(B)이 추가로 입력될 수 있다. 일실시예에 따른 형태소 토큰화를 수행하기 위한 맞춤형 사전(B)은 입력된 스크립트에 포함된 자소를 손실없이 보존한 상태에서 구축된 사전에 해당할 수 있다.The step of inputting the script into the forced speech alignment model 110 according to an embodiment may include tokenizing the script based on a word registered in advance. The tokenization step according to an embodiment may include a word tokenization step of separating a script based on spacing, and a morpheme tokenization step of separating a word separated based on spacing into a preceding sentence component, ending, and proposition. may include According to an embodiment, in addition to the customized dictionary A for forced speech alignment, a customized dictionary B for performing morpheme tokenization may be additionally input. The customized dictionary (B) for performing morpheme tokenization according to an embodiment may correspond to a dictionary built in a state in which the grapheme included in the input script is preserved without loss.

일실시예에 따른 사전(B)에는 동사 및 형용사가 어간의 형태로 등재될 수 있다. 또한, 활용에 따라 어간의 형태가 변하는 경우 변형된 어간도 사전에 등재될 수 있다. 예를 들어, 형용사 '어렵다'의 경우, 어간인 '어렵' 및 어간의 변형인 '어려', '어려우', '어려워' 등이 사전(B)에 등재될 수 있다.In the dictionary (B) according to an embodiment, verbs and adjectives may be registered in the form of a stem. In addition, if the shape of the stem changes according to the application, the modified stem may also be registered in advance. For example, in the case of the adjective 'difficult', the stem 'difficult' and the stem variations 'difficult', 'difficult', 'difficult', etc. may be registered in the dictionary (B).

일실시예에 따를 때, 형태소 토큰화를 위한 맞춤형 사전(B)에 등재된 단어 목록은 강제 음성 정렬 모델에 입력되기 위해 가공될 수 있다. 일실시예에 따른 맞춤형 사전(A)에는 맞춤형 사전(B)에 등재된 단어들이 알파벳으로 가공되어 등재될 수 있다. 단어들을 알파벳으로 가공하는 단계는 단어들 각각을 자음과 모음으로 분리하여, 각 단어에 포함된 자음(들) 및 모음(들)을 포함하는 자소(들)을 대응하는 알파벳으로 맵핑하여 자소(들)의 순서대로 나열하는 단계에 해당할 수 있다. 예를 들어, '가'라는 단어를 맞춤형 사전(A)에 등재하기 위해 가공하는 단계는 '가'를 자소 'ㄱ' 및 'ㅏ'로 분리하고, 'ㄱ'에 대응하는 알파벳 'K0' 및 'ㅏ'에 대응하는 'AA'로 맵핑하여 자소들의 순서대로 'K0 AA'와 같이 나열하는 단계에 해당할 수 있다. 일실시예에 따른 맞춤형 사전(A)의 형태는 도 2를 참조할 수 있다. 도 2를 참조할 때, 왼쪽 열은 단어를, 오른쪽 열은 해당 단어를 자소 단위로 풀어쓴 것을 나타낸다.According to an embodiment, the list of words registered in the customized dictionary B for morpheme tokenization may be processed to be input to the forced speech alignment model. In the customized dictionary (A) according to an embodiment, words registered in the customized dictionary (B) may be processed into alphabets and registered. In the step of processing the words into an alphabet, each of the words is separated into consonants and vowels, and the grapheme(s) including the consonant(s) and vowel(s) included in each word is mapped to the corresponding alphabet to form the grapheme(s). ) may correspond to the steps listed in the order of For example, the step of processing the word 'a' to be listed in the customized dictionary (A) is to separate the word 'a' into letter 'a' and 'a', and the alphabet 'K0' corresponding to 'a' and It may correspond to the step of mapping 'AA' corresponding to 'a' to 'K0 AA' in the order of the graphemes. The shape of the customized dictionary A according to an embodiment may refer to FIG. 2 . Referring to FIG. 2 , the left column shows words, and the right column shows the words translated into grapheme units.

일실시예에 따른 코퍼스는 복수의 화자로부터 얻은 스피치 오디오와 스크립트로 구성될 수 있다. 일실시예에 따른 코퍼스를 구성하는 스크립트는 맞춤형 사전(B)를 활용하여 형태소 토큰화하는 단계 및 토큰화된 형태소들을 알파벳 기호에 맵핑하는 단계를 거쳐 가공될 수 있다. 가공된 스크립트를 포함한 코퍼스는 맞춤형 사전(A)와 함께 음향 모델의 학습 데이터로 이용될 수 있다.The corpus according to an embodiment may be composed of speech audio and scripts obtained from a plurality of speakers. The script constituting the corpus according to an embodiment may be processed through the steps of tokenizing morphemes using the customized dictionary B and mapping the tokenized morphemes to alphabet symbols. The corpus including the processed script can be used as training data of the acoustic model together with the customized dictionary (A).

일실시예에 따른 강제 음성 정렬 모델(110)은 입력받은 스피치 오디오(101) 및 스크립트(102)를 가공하여 오디오와 스크립트의 타이밍을 맞추는 음성 정렬 작업을 수행할 수 있다. 일실시예에 따른 강제 음성 정렬 작업을 수행한 결과는 도 3을 참조할 수 있다.The forced voice alignment model 110 according to an embodiment may process the input speech audio 101 and the script 102 to perform a voice alignment operation of matching the timing of the audio and the script. A result of performing the forced voice alignment operation according to an embodiment may refer to FIG. 3 .

도 3을 참조할 때, 강제 음성 정렬 작업을 수행한 결과에는 오디오에 대응하는 음성파형(310) 및 스펙트로그램(320)이 포함될 수 있고, 스펙트로그램(320)에는 음의 높이 및 세기 등 운율적 요소에 대한 정보를 포함할 수 있다. 또한, 세로선을 경계로 하여 구분된 형태소 토큰 단위의 타이밍 정보(330) 및 세로선을 경계로 하여 구분된 자소 단위의 타이밍 정보(340)를 포함할 수 있다. 예를 들어, 도 3을 참조할 때, 하나의 형태소 단위(p0yeoll)는 스크립트를 토큰화하여 분리된 형태소에 해당할 수 있고, 형태소 단위에 대응되는 자소 단위들(P0, YEO, LL)은 형태소를 구성하는 자소들에 해당할 수 있다. 또한, 강제 음성 정렬 작업을 수행한 결과에는 쉼(pause)에 대한 타이밍 정보도 포함될 수 있다.Referring to FIG. 3 , the result of performing the forced voice alignment operation may include a voice waveform 310 and a spectrogram 320 corresponding to audio, and the spectrogram 320 includes prosody such as pitch and intensity of a sound. It can contain information about the element. In addition, the timing information 330 in units of morpheme tokens divided by the vertical line as a boundary and timing information 340 in units of graphes divided by the vertical line as the boundary may be included. For example, referring to FIG. 3 , one morpheme unit p0yeoll may correspond to a morpheme separated by tokenizing a script, and grapheme units P0, YEO, and LL corresponding to the morpheme unit are may correspond to the elements constituting the . In addition, timing information about a pause may be included in the result of performing the forced voice alignment operation.

일실시예에 따른 애니메이션 생성 단계에서는 인풋 단계에서 강제 음성 정렬 작업을 수행하여 얻은 결과를 동시조음 모델(120)에 적용하여 음소 정보를 추출하고, 타이밍 정보에 기초하여, 각 음소를 그에 대응되는 입모양 및 혀모양에 맵핑함으로써 스피치 애니메이션을 생성하는 단계에 해당할 수 있다. 일실시예에 따른 동시조음 모델(120)은 동시조음 규칙을 기반으로 정의된 모델로, Grapheme-to-Phoneme(G2P) 모델을 포함할 수 있다. 일실시예에 따른 동시조음 규칙은 특정 문자를 발음하는 방법에 관한 규칙으로, 특정 자소를 음소로 변경하는 규칙에 해당할 수 있다.In the animation generation step according to an embodiment, phoneme information is extracted by applying the result obtained by performing the forced speech alignment operation in the input step to the simultaneous articulation model 120, and based on the timing information, each phoneme is converted into a corresponding mouth. It may correspond to the step of generating the speech animation by mapping to the shape and the shape of the tongue. The simultaneous articulation model 120 according to an exemplary embodiment is a model defined based on a polyphonic rule, and may include a Grapheme-to-Phoneme (G2P) model. The simultaneous articulation rule according to an embodiment is a rule regarding a method of pronouncing a specific character, and may correspond to a rule for changing a specific grapheme into a phoneme.

일실시예에 따른 동시조음 규칙은 복수의 규칙을 포함할 수 있으며, 음운 현상의 환경에 기초하여 적어도 하나의 동시조음 규칙이 선택되며, 그 순서가 결정될 수 있다. 각 음소를 그에 대응되는 입모양 및 혀모양에 맵핑함으로써 스피치 애니메이션을 생성하는 것은 아바타 등을 포함하는 모델의 입모양 및 혀모양을 각 음소의 타이밍 정보에 따라 각 음소에 대응하는 입모양 및 혀모양으로 변경함으로써 해당 스크립트(101)를 발화하는 모델을 포함한 애니메이션을 생성하는 것에 해당할 수 있다. 이하에서, 각 음소에 대응하는 입모양 및 혀모양은 'viseme' 또는 'viseme 모델'로 표현될 수 있다. The simultaneous articulation rule according to an embodiment may include a plurality of rules, at least one simultaneous articulation rule is selected based on an environment of a phonological phenomenon, and the order thereof may be determined. Creating a speech animation by mapping each phoneme to a mouth shape and tongue shape corresponding to the mouth shape and tongue shape of a model including an avatar, etc., according to timing information of each phoneme, a mouth shape and tongue shape corresponding to each phoneme By changing to , it may correspond to generating an animation including a model that ignites the corresponding script 101 . Hereinafter, a mouth shape and a tongue shape corresponding to each phoneme may be expressed as a 'viseme' or a 'viseme model'.

일실시예에 따를 때, 동시조음 규칙은 복수의 자소들을 포함하는 음운구(phonological phrase) 단위로 적용될 수 있다. 일실시예에 따른 음운구는 적어도 하나의 음절을 포함하는 문장의 억양을 결정하는 최소 단위이자 음운 규칙이 적용되는 단위 및 음운 현상이 일어나는 단위에 해당할 수 있다. 일실시예에 따른 음운 현상은 인접한 자소들에 의해 발음이 달라지는 현상이라고 정의할 수 있다. 즉, 일실시예에 따른 음운 현상은 음운구 내에서 발생하는 것으로, 종성 자음의 다음에 인접하여 초성 자음이 위치하고 있더라도, 해당 종성 자음과 해당 초성 자음이 동일한 음운구에 속하지 않는 경우 음운 현상은 발생하지 않을 수 있다. 일실시예에 따른 인접한 종성 및 초성이 동일한 음운구에 속하지 않아 음운 현상이 발생하지 않는 경우 동시조음 규칙이 적용되지 않을 수 있다. 다시 말해, 일실시예에 따른 동시조음 모델(120)은 각 음운구 내에서 적용될 수 있다. 일실시예에 따른 자소 시퀀스 내의 음운구는 강제 음성 정렬 작업을 통해 얻은 출력으로부터 타이밍 정보, 특히 쉼(pause)에 대한 타이밍 정보에 기초하여 결정될 수 있다.According to an embodiment, the simultaneous articulation rule may be applied in units of phonological phrases including a plurality of graphes. A phonological phrase according to an embodiment may correspond to a minimum unit that determines intonation of a sentence including at least one syllable, a unit to which a phonological rule is applied, and a unit in which a phonological phenomenon occurs. A phonological phenomenon according to an embodiment may be defined as a phenomenon in which pronunciation is changed by adjacent graphes. That is, the phonological phenomenon according to an exemplary embodiment occurs within a phonological phrase, and even if a leading consonant is located next to a final consonant, the phonological phenomenon occurs when the corresponding final consonant and the corresponding leading consonant do not belong to the same phonological phrase. may not According to an embodiment, when the adjacent final and initial consonants do not belong to the same phonological phrase and thus a phonological phenomenon does not occur, the simultaneous articulation rule may not be applied. In other words, the simultaneous articulation model 120 according to an embodiment may be applied within each phoneme phrase. According to an exemplary embodiment, a phonophrase in a grapheme sequence may be determined based on timing information, particularly, timing information about a pause, from an output obtained through a forced speech alignment operation.

스크립트를 왼쪽에서 오른쪽으로 읽을 때, 음운 현상은 동일한 음운구에 속하는 인접한 자소 중 종성 자음과 그 다음 자음에 해당하는 초성 자음에서 일어날 수 있다. 다시 말해, 스크립트를 왼쪽에서 오른쪽으로 읽는 경우, 인접한 두 글자 중 왼쪽에 위치한 글자의 종성에 해당하는 자음은 오른쪽에 위치한 글자의 초성에 해당하는 자음에 의해 발음이 변경될 수 있고, 반대로 오른쪽에 위치한 글자의 초성이 왼쪽에 위치한 글자의 종성에 해당하는 자음에 의해 발음이 변경될 수 있으며, 종성 및 초성 모두 서로 영향을 받아 발음이 변경될 수도 있다. 예를 들어, '국물'의 경우, '국'의 종성에 해당하는 'ㄱ' 은 '물'의 초성에 해당하는 'ㅁ'에 의해 발화 시 발음이 'ㅇ'으로 변경될 수 있다. 즉, 일실시예에 따른 동시조음 규칙은 음운 현상이 일어나는 음운구 내에서 종성 자음 및 그 다음에 위치한 초성 자음에 적용될 수 있다.When a script is read from left to right, a phonological phenomenon can occur in a final consonant among adjacent consonants belonging to the same phonological phrase and a leading consonant corresponding to the following consonant. In other words, when the script is read from left to right, the pronunciation of the consonant corresponding to the last consonant of the letter located on the left among the two adjacent letters may be changed by the consonant corresponding to the beginning of the letter located on the right, and vice versa. Pronunciation may be changed by a consonant corresponding to the last consonant of a letter positioned on the left, and both the consonant and consonant may be influenced by each other to change the pronunciation. For example, in the case of 'soup', the pronunciation of 'ㄱ' corresponding to the last consonant of 'guk' may be changed to 'ㅇ' when uttered by 'ㅁ' corresponding to the initial consonant of 'water'. That is, the simultaneous articulation rule according to an embodiment may be applied to a final consonant and a subsequent leading consonant within a phonological phrase in which a phonological phenomenon occurs.

일실시예에 따른 스피치 애니메이션이 생성될 때, 한 음소에 하나의 viseme이 매핑되는 것이 아니라, 발음될 때의 입모양 및 혀모양에 가시적으로 유의미한 차이가 없는 음소들은 하나의 클래스로 묶어서 같은 viseme에 매핑될 수 있다. 다시 말해, 음운 현상에 따라 특정 자음이 대응되는 혀 모양 및 입 모양이 동일한 자음으로 발음되는 경우에는 동시조음 규칙이 적용되지 않을 수 있다. 예를 들어, 음운 현상에 따라 자음 'ㅂ'이 'ㅍ'으로 발음되는 경우, 음운 현상에 따르면 실제 발음에서 자음이 변경되지만, 'ㅍ'에 대응되는 입모양 및 혀모양은 'ㅂ'에 대응되는 입모양 및 혀 모양과 동일하므로 동시조음 규칙이 적용되지 않을 수 있다.When a speech animation according to an embodiment is generated, one viseme is not mapped to one phoneme, but phonemes that do not have a visually significant difference in mouth shape and tongue shape when pronounced are grouped into one class and assigned to the same viseme. can be mapped. In other words, when a tongue shape and a mouth shape corresponding to a specific consonant are pronounced as the same consonant according to a phonological phenomenon, the simultaneous articulation rule may not be applied. For example, if the consonant 'f' is pronounced as 'p' according to the phonological phenomenon, the consonant is changed from the actual pronunciation according to the phonological phenomenon, but the mouth shape and the tongue shape corresponding to 'p' correspond to 'ㅅ'. Since the shape of the mouth and tongue are the same, the simultaneous articulation rule may not be applied.

일실시예에 따른 동시조음 규칙에는 적어도 하나의 규칙이 포함될 수 있다. 예를 들어, 동시조음 규칙에는 표준 발음법에 따른 음운 규칙들이 포함될 수 있다. 보다 구체적으로, 표준 발음법에 따른 음운 규칙 중, 실시예에 따라 생략되지 않고 적용되어야 하는 9가지 음운 규칙(표준발음법 제12항 ‘ㅎ’받침의 발음, 제8-9항 음절의 끝소리 규칙, 제10-11항 자음군 단순화, 제19항 ‘ㄹ’ 비음화, 제20항 ‘ㄹ’ 비음화 예외, 제20항 유음화, 제17항 구개음화, 제29항 ‘ㄴ’ 첨가, 제12항 격음화(단, ‘ㅎ’ 받침의 발음과 중복되는 내용은 배제)) 및 3가지 연음 규칙(제13항, 제14항, 제15항)이 포함될 수 있다. 즉, 실시예에 따라 표준 발음법에 따른 음운 규칙들이 모두 포함되는 것이 아니라, 일부 규칙은 배제될 수 있다.The simultaneous articulation rule according to an embodiment may include at least one rule. For example, the simultaneous articulation rule may include phonological rules according to a standard pronunciation method. More specifically, among the phonological rules according to the standard pronunciation method, 9 phonological rules that should be applied without being omitted depending on the embodiment (Pronunciation of the 'H' base in Article 12 of the Standard Pronunciation method, and the rule of ending syllables in Articles 8-9 , Paragraph 10-11 Simplification of consonant groups, Paragraph 19 'd' nasalization, Paragraph 20 'ㄹ' nasalization exception, Paragraph 20 Phoneticization, Paragraph 17 Palatalization, Paragraph 29 Addition of 'b', Paragraph 12 Aspirated ( However, content that overlaps with the pronunciation of the 'ㅎ' consonant is excluded)) and three linking rules (paragraphs 13, 14, and 15) may be included. That is, according to an embodiment, not all phonological rules according to the standard pronunciation method are included, but some rules may be excluded.

도 4는 각 자소에 따라 동시조음 규칙이 다르게 결정되는 예를 나타낸 도면이다.4 is a diagram illustrating an example in which simultaneous articulation rules are determined differently according to each grapheme.

도 4를 참조할 때, 일실시예에 따른 동시조음 규칙은 종성 자음을 기준으로 종성 자음이 무엇인지에 따라 규칙이 선택되고, 선택된 규칙이 복수 개인 경우 규칙들 간 적용되는 순서가 결정될 수 있으며, 종성 자음 뒤에 인접한 초성 자음에 따라 선택된 규칙의 적용 여부가 결정될 수 있다. 이 경우, 종성 자음에 따라 음운 규칙 또는 연음 규칙이 선택되며, 화살표로 표시된 순서로 규칙의 적용 순서가 결정될 수 있다. 일실시예에 따를 때, 종성 자음이 무엇인지에 따라 선택된 동시조음 규칙에 포함된 규칙들은 뒤에 인접한 초성 자음에 따라 적용될지 여부가 판단될 수 있다. 다시 말해, 뒤에 인접한 초성 자음에 따라 선택된 규칙의 적용이 생략될 수 있다. 예를 들어, 종성자음이 'ㄺ'인 경우, 뒤에 인접한 초성 자음에 따라 격음화 규칙의 적용 여부가 결정되고, 그 후에 자음군 단순화 규칙이 적용될 수 있다.4, in the simultaneous articulation rule according to an embodiment, the rule is selected according to what the final consonant is based on the final consonant, and when there are a plurality of selected rules, the order in which the rules are applied can be determined, Whether or not the selected rule is applied may be determined according to an adjacent leading consonant after the final consonant. In this case, a phonological rule or a linking rule is selected according to the final consonant, and the order of application of the rules may be determined in the order indicated by an arrow. According to an embodiment, it may be determined whether the rules included in the simultaneous articulation rule selected according to the final consonant are applied according to the adjacent leading consonant. In other words, the application of the rule selected according to the leading consonant adjacent to the back may be omitted. For example, when the final consonant is 'ㄺ', it is determined whether or not to apply the aspirated rule according to the adjacent leading consonant, and then the consonant group simplification rule may be applied.

다만, 종성 자음의 다음에 위치한 초성 자음이 특정 자음에 해당하는 경우, 종성 자음이 아닌 초성 자음을 기준으로 동시조음 규칙이 선택되고, 그 순서가 결정될 수 있다.However, when the leading consonant located next to the final consonant corresponds to a specific consonant, the simultaneous articulation rule may be selected based on the leading consonant other than the final consonant, and the order may be determined.

도 4와 같이 동시조음 규칙이 적용된다고 예를 들 때, 음운 현상이 일어나는 초성 자음이 ‘ㅇ’ 또는 ‘ㅎ’에 해당하는 경우 초성 자음에 따라 연음 규칙 등이 선택되고, 초성 자음에 따라 규칙의 적용 순서가 결정될 수 있다. 이 경우, ‘닭이’라는 음운구에서 음운 현상이 일어나는 초성 자음은 'ㅇ'에 해당하므로 초성 자음에 따라 연음 규칙이 적용되어 '달기'로 변경될 수 있다. 반면, ‘닭장’이라는 음운구에서는 음운 현상이 일어나는 종성 자음 'ㄺ'에 따라 격음화, 자음군 단순화 및 경음화 규칙이 선택되고, 그 순서가 결정되며, 초성 자음 'ㅈ'에 따라 격음화 규칙의 적용 여부가 결정되고, 적용 여부 결정에 기초하여, 격음화 규칙이 적용되고 자음군 단순화 규칙이 적용되거나, 격음화 규칙은 적용되지 않고 자음군 단순화 규칙이 적용될 수 있다. 결과적으로, '닭장'은 종성 자음 'ㄺ' 및 초성 자음 'ㅈ'에 따라 결정된 동시조음 규칙이 적용되어 '닥장'으로 변경될 수도 있다. For example, when the simultaneous articulation rule is applied as shown in FIG. 4, if the leading consonant in which the phonological phenomenon occurs corresponds to 'ㅇ' or 'ㅎ', the linking rule is selected according to the leading consonant, and the rule of the rule is selected according to the leading consonant. The order of application may be determined. In this case, the leading consonant in which the phonological phenomenon occurs in the phonological phrase 'chicki' corresponds to 'ㅇ', so the linking rule is applied according to the leading consonant and can be changed to 'dalgi'. On the other hand, in the phonological phrase 'chicken house', the rules for aspiration, consonant group simplification, and hard consonants are selected according to the final consonant 'ㄺ' in which the phonological phenomenon occurs, and the order is determined. is determined, and based on the determination of whether to apply, the abbreviation rule may be applied and the consonant group simplification rule may be applied, or the consonant group simplification rule may not be applied and the consonant group simplification rule may be applied. As a result, the 'chicken house' may be changed to 'dakjang' by applying the simultaneous articulation rule determined according to the final consonant 'ㄺ' and the initial consonant 'j'.

한국어에서 모음은 단모음과 이중모음으로 나뉠 수 있다. 단모음은 처음부터 끝까지하나의 조음 동작으로만 만들어지는 모음이며, 이중 모음은 반모음(semivowel)과 단모음이 결합하여 두 개의 조음 동작으로 만들어지는 모음을 말한다. 예를 들어 모음 ‘ㅑ’는 반모음 /j/와 ‘ㅏ’의 조음 동작으로 만들어진 이중모음에 해당할 수 있다. 일실시예에 따른 스피치 애니메이션을 생성하는 단계는 이중모음을 반모음에 대응하는 혀모양 및 입모양과 단모음에 대응하는 혀모양 및 입모양을 순차적으로 블렌딩하는 과정을 포함할 수 있다. 반모음은 단모음과 달리 한 음절을 구성하지 못할 정도로 지속 시간이 짧으므로 블렌딩 과정에서 단모음보다 지속시간이 짧게 설정될 수 있다. 일실시예에 따를 때, 반모음의 지속시간은 단모음의 지속시간의 절반으로 설정될 수 있다.In Korean, vowels can be divided into short vowels and diphthongs. A short vowel is a vowel made from only one articulation from beginning to end, and a double vowel is a vowel made by combining a semivowel and a short vowel with two articulation movements. For example, the vowel ‘ㅑ’ may correspond to a diphthong made by articulating the half vowels /j/ and ‘a’. The generating of the speech animation according to an exemplary embodiment may include sequentially blending a tongue shape and a mouth shape corresponding to a half vowel of a diphthong and a tongue shape and a mouth shape corresponding to a short vowel. Unlike short vowels, half vowels have such a short duration that they cannot form one syllable, so the duration may be set shorter than that of short vowels during the blending process. According to an embodiment, the duration of the half vowel may be set to half the duration of the short vowel.

일실시예에 따른 동시조음 모델(120)은 동시조음 규칙을 적용하여 생성된 음소의 시퀀스 및 각 음소에 대응하는 타이밍 정보에 기초하여, 모델의 입모양 및 혀모양을 음소에 대응하는 입모양 및 혀모양으로 변경함으로써, 스피치 애니메이션을 생성할수 있다. 일실시예에 따른 동시조음 모델(120)은 viseme 모델을 포함할 수 있다. The simultaneous articulation model 120 according to an embodiment sets the mouth shape and the tongue shape of the model to the mouth shape corresponding to the phoneme and the By changing the shape of the tongue, you can create speech animations. The simultaneous articulation model 120 according to an embodiment may include a viseme model.

일실시예에 따른 음소는 그 발음 특성에 기초하여 특정 입모양 및 특정 혀모양에 맵핑될 수 있다. 일실시예에 따른 하나의 입모양에는 복수 개의 대응되는 음소가 맵핑될 수 있으며, 하나의 혀모양에는 복수 개의 대응되는 음소가 맵핑될 수 있다. 다시 말해, 서로 다른 음소를 발화하는 경우, 가시적인 영역에 있는 혀끝의 위치와 혀의 모양을 중심으로 시각적으로 유의미한 차이가 없다고 판단되면 해당 음소들은 같은 클래스로 분류될 수 있다.Phonemes according to an embodiment may be mapped to a specific mouth shape and a specific tongue shape based on their pronunciation characteristics. According to an embodiment, a plurality of corresponding phonemes may be mapped to one mouth shape, and a plurality of corresponding phonemes may be mapped to one tongue shape. In other words, when different phonemes are uttered, if it is determined that there is no visually significant difference based on the position of the tip of the tongue and the shape of the tongue in the visible region, the phonemes may be classified into the same class.

한국어를 예를 들면, 도 5를 참조할 때, 음운으로 존재하는 자음들에 대응하는 음소들은 발음되는 입모양 및 혀모양에 따라 7개의 클래스로 분류될 수 있고, 도 6 및 도 7을 참조할 때, 모음들에 대응하는 음소들 역시 발음되는 입모양 및 혀모양에 따라 5가지의 혀모양에 따른 클래스 및 10가지의 입모양에 따른 클래스로 분류될 수 있다. 도 8을 참조할 때, 쉼의 경우에도 입모양이 맵핑될 수 있다.Taking Korean as an example, referring to FIG. 5 , phonemes corresponding to consonants existing as phonemes may be classified into seven classes according to the mouth shape and tongue shape to be pronounced, and referring to FIGS. 6 and 7 . In this case, phonemes corresponding to vowels may also be classified into a class according to 5 types of tongue shapes and a class according to 10 types of mouth shapes according to the mouth shape and tongue shape to be pronounced. Referring to FIG. 8 , even in the case of a shim, a mouth shape may be mapped.

도 9는 일실시예에 따른 스피치 애니메이션을 생성하는 방법을 설명하기 위한 도면이다.9 is a diagram for explaining a method of generating a speech animation according to an embodiment.

도 9를 참조할 때, 일실시예에 따른 스피치 애니메이션을 생성하는 방법은 음성 신호와 스크립트를 수신하는 단계(910), 스크립트에 기초하여 자소 시퀀스를 생성하는 단계(920), 음성 신호에 기초하여 타이밍 정보를 생성하는 단계(930), 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하는 단계(940), 종성 자음 및 초성 자음에 기초하여, 동시조음 규칙 및 순서를 결정하는 단계(950), 순서에 따라 동시조음 규칙을 종성 자음 및 초성 자음에 적용함으로써, 음소 시퀀스를 생성하는 단계(960) 및 타이밍 정보 및 음소 시퀀스에 기초하여 아바타의 입 모양 및 혀 모양을 변형함으로써, 스피치 애니메이션을 생성하는 단계(970)를 포함한다. Referring to FIG. 9 , a method of generating a speech animation according to an embodiment includes receiving a voice signal and a script ( 910 ), generating a grapheme sequence based on the script ( 920 ), and based on the voice signal generating timing information (930), extracting a closing consonant and a leading consonant adjacent to each other from a grapheme sequence (940), determining a co-articulation rule and order based on the closing consonant and the leading consonant (950); generating a phoneme sequence by sequentially applying polyarticulation rules to the final and leading consonants ( 960 ) and modifying the shape of the mouth and tongue of the avatar based on the timing information and the phoneme sequence, thereby generating a speech animation. Step 970 is included.

일실시예에 따른 음성 신호와 스크립트를 수신하는 단계(910)는 음성 신호와 음성 신호에 대응하는 스크립트를 수신하는 단계에 해당할 수 있다. 일실시예에 따른 음성 신호는 스크립트를 소리내어 발음한 음성 신호를 포함할 수 있다. 예를 들어, 일실시예에 따른 스크립트가 한글로 쓰여진 적어도 하나의 음절을 포함하는 스크립트일 때, 음성 신호는 스크립트를 한국어로 발화한 음성신호에 해당할 수 있다. 일실시예에 따른 음성 신호는 상술한 음향 모델의 입력 데이터인 스피치 오디오(101)에 대응될 수 있다.The step 910 of receiving the voice signal and the script according to an embodiment may correspond to the step of receiving the voice signal and the script corresponding to the voice signal. The voice signal according to an embodiment may include a voice signal in which a script is pronounced aloud. For example, when the script according to an embodiment is a script including at least one syllable written in Korean, the voice signal may correspond to a voice signal uttering the script in Korean. The voice signal according to an embodiment may correspond to the speech audio 101 that is input data of the above-described acoustic model.

일실시예에 따른 스크립트에 기초하여 자소 시퀀스를 생성하는 단계(920)는스크립트에 기초하여, 적어도 하나의 자음 및 적어도 하나의 모음을 포함하는 자소 시퀀스를 생성하는 단계에 해당할 수 있다. 일실시예에 따른 자소 시퀀스는 스크립트를 구성하는 음절들을 모음 및 자음 단위로 분리하여 순서대로 나열한 것으로, 상술한 스크립트를 구성하는 자소들의 정보에 대응될 수 있다. 일실시예에 따른 자소 시퀀스는 스크립트에 포함된 자소들을 특정 부호로 맵핑하여 표시될 수 있다. 예를 들어, 스크립트 '가'에 기초하여 생성된 자소 시퀀스는 '가'를 구성하는 자소 'ㄱ'을 맵핑한 'K0' 및 'ㅏ'를 맵핑한 'AA'를 순서대로 나열한 'K0 AA'에 해당할 수 있다. 이하에서, 자소(자음 또는 모음을 포함)은 해당 자소에 맵핑된 부호를 포함할 수 있다.The step 920 of generating a grapheme sequence based on the script according to an embodiment may correspond to the step of generating a grapheme sequence including at least one consonant and at least one vowel based on the script. A grapheme sequence according to an embodiment is a sequence in which syllables constituting a script are separated into vowels and consonant units and arranged in order, and may correspond to information on grapheme constituting the above-described script. A grapheme sequence according to an embodiment may be displayed by mapping graphes included in a script to specific codes. For example, the grapheme sequence generated based on the script 'A' is 'K0 AA' in which 'K0' mapped the grapheme 'a' constituting 'A' and 'AA' mapped 'A' are sequentially listed. may correspond to Hereinafter, a grapheme (including a consonant or a vowel) may include a sign mapped to the corresponding grapheme.

일실시예에 따른 단계(920)는 스크립트에 대응하는 언어의 의미 단위들이 등재된 사전에 기초하여, 스크립트를 적어도 하나의 의미 단위로 분리하는 단계 및 적어도 하나의 의미 단위 별로 의미 단위에 포함된 적어도 하나의 자소를 순서대로 나열하는 단계를 포함할 수 있다. 일실시예에 따른 사전은 상술한 맞춤형 사전(A)및 맞춤형 사전(B)를 포함할 수 있다. 일실시예에 따른 의미 단위는 형태소를 포함할 수 있다. 일실시예에 따른 의미 단위가 형태소인 경우, 스크립트를 적어도 하나의 의미 단위로 분리하는 단계는 상술한 스크립트를 토큰화하는 단계에 대응될 수 있다.Step 920 according to an embodiment may include dividing the script into at least one semantic unit based on a dictionary in which semantic units of a language corresponding to the script are listed, and at least one semantic unit included in the semantic unit for each semantic unit. It may include listing one grapheme in order. The dictionary according to an embodiment may include the above-described customized dictionary (A) and customized dictionary (B). A semantic unit according to an embodiment may include a morpheme. When the semantic unit according to an embodiment is a morpheme, the step of dividing the script into at least one semantic unit may correspond to the step of tokenizing the script.

일실시예에 따른 음성 신호에 기초하여 타이밍 정보를 생성하는 단계(930)는 음성 신호에 기초하여, 상기 자소 시퀀스에 포함된 상기 적어도 하나의 자음 및 상기 적어도 하나의 모음 각각의 타이밍을 포함하는 타이밍 정보를 생성하는 단계에 해당할 수 있다. 일실시예에 따른 타이밍 정보는 상술한 강제 음성 정렬 작업을 수행한 결과에 포함된 타이밍 정보에 대응될 수 있다. 즉 일실시예에 따른 타이밍 정보는 음성 신호에 포함된 프레임(들)을 스크립트에 포함된 자소 단위로 정렬하여, 각 자소들의 발화 시간을 나타낸 정보를 포함할 수 있다. 일실시예에 따른 타이밍 정보는 자소 단위의 타이밍 정보 및 복수의 자소 단위들을 포함하는 형태소 단위의 타이밍 정보를 포함할 수 있다. 일실시예에 따른 형태소 단위는 상술한 형태소 토큰화 단계에 의해 스크립트가 분리된 단위에 해당할 수 있다.In the step 930 of generating timing information based on a voice signal according to an embodiment, the timing including timing of each of the at least one consonant and the at least one vowel included in the grapheme sequence, based on the voice signal It may correspond to the step of generating information. The timing information according to an embodiment may correspond to timing information included in a result of performing the above-described forced voice alignment operation. That is, the timing information according to an embodiment may include information indicating the utterance time of each grapheme by arranging the frame(s) included in the voice signal in units of graphes included in the script. The timing information according to an embodiment may include timing information in units of grapheme and timing information in units of morphemes including a plurality of grapheme units. A morpheme unit according to an embodiment may correspond to a unit in which a script is separated by the above-described morpheme tokenization step.

일실시예에 따른 타이밍 정보를 생성하는 단계(930)는 음성 신호에서 미리 정해진 기준에 따라 특정 구간을 쉼(pause) 구간으로 표시하는 단계 및 쉼 구간에 대응하는 타이밍 정보를 생성하는 단계를 포함할 수 있다. 즉, 일실시예에 따른 타이밍 정보는 쉼(pause)에 대한 타이밍 정보를 포함할 수 있다. 일실시예에 따른 쉼 또는 쉼 구간은 음성 신호에서 호흡 등에 의해 스크립트를 발화하는 소리가 잠시 끊기는 구간에 해당할 수 있다. 예를 들어, "책을 읽는다"라는 문장을 발화할 때, "책을"을 한 호흡으로 발화하고, "읽는다"를 다음 호흡으로 발화할 수 있으므로, 이에 대응하는 음성 신호에는 "책을" 과 "읽는다" 사이의 일정 구간에 음성 신호가 잠시 끊기는 쉼 구간이 존재할 수 있다. 즉, 일실시예에 따른 음성 신호에는 음성 신호의 세기가 미약하거나 음성 신호가 잠시 끊기는 구간이 존재할 수 있고, 이러한 구간이 쉼 구간으로 판단될 수 있다. 일실시예에 따를 때, 쉼 구간에 기초하여, 동시조음 규칙이 적용되는 음운구가 결정될 수 있다. Step 930 of generating the timing information according to an embodiment may include displaying a specific section as a pause section according to a predetermined criterion in the voice signal and generating timing information corresponding to the pause section. can That is, the timing information according to an embodiment may include timing information about a pause. The pause or pause section according to an embodiment may correspond to a section in which a sound for uttering a script by respiration in a voice signal is temporarily cut off. For example, when the sentence "read a book" is uttered, "book" can be uttered with one breath and "read" can be uttered with the next breath, so the corresponding voice signal includes "book" and There may be a pause section in which the voice signal is temporarily cut off in a certain section between “reading”. That is, in the voice signal according to an embodiment, there may be a section in which the strength of the voice signal is weak or the voice signal is temporarily cut off, and this section may be determined as a rest section. According to an embodiment, a phoneme phrase to which the simultaneous articulation rule is applied may be determined based on the rest period.

일실시예에 따를 때, 쉼 구간을 판단하기 위한 미리 정해진 기준은 음성 신호의 강도 및 특정 강도의 신호가 지속되는 시간에 기초하여 결정될 수 있다. 즉, 미리 정해진 기준에 따른 특정 구간은 미리 정해진 음성 신호의 강도 이하의 신호가 미리 정해진 시간 동안 지속되는 구간에 해당할 수 있다. 예를 들어, 음성 신호에서 10 dB 이하의 신호가 1초 동안 지속되는 구간을 쉼 구간으로 판단하는 기준에 따라, 음성 신호에서 특정 구간을 쉼 구간으로 표시하고, 타이밍 정보에 자소에 대응하는 타이밍 정보와 함께 쉼 구간에 대응하는 타이밍 정보가 포함될 수 있다.According to an embodiment, the predetermined criterion for determining the pause period may be determined based on the strength of the voice signal and the duration of the signal of the specific strength. That is, the specific section according to the predetermined criterion may correspond to a section in which a signal equal to or less than the strength of the predetermined voice signal is maintained for a predetermined time. For example, according to a criterion for determining a section in which a signal of 10 dB or less in the voice signal lasts for 1 second as a rest section, a specific section in the voice signal is displayed as a pause section, and timing information corresponding to the grapheme in the timing information and timing information corresponding to the rest period may be included.

일실시예에 따른 단계(910) 내지 단계(930)는 상술한 인풋 단계에 대응될 수 있다.Steps 910 to 930 according to an embodiment may correspond to the above-described input steps.

일실시예에 따른 자소 시퀀스로부터 서로 인접한 종성 자음 및 초성 자음을 추출하는 단계(940)는 특정 음절의 종성 자음 및 그 음절의 직후에 위치한 음절의 초성 자음을 추출하는 단계에 해당할 수 있다. 다시 말해, 서로 인접한 종성 자음 및 초성 자음은 자소들이 시간 순서로 나열된 자소 시퀀스에서 종성 자음 및 해당 종성 자음 다음에 위치한 초성 자음에 해당할 수 있다.According to an exemplary embodiment, the step of extracting the last consonant and the leading consonant adjacent to each other from the grapheme sequence 940 may correspond to the step of extracting the last consonant of a specific syllable and the leading consonant of a syllable located immediately after the syllable. In other words, the final consonant and the leading consonant adjacent to each other may correspond to the final consonant and the leading consonant located after the final consonant in the grapheme sequence in which the graphes are arranged in chronological order.

일실시예에 따른 단계(940)는 음운구에 포함된 자소들을 순차적으로 읽으면서, 해당 자소가 자음인지 여부 및 해당 자음의 위치를 추출하는 단계 및 추출된 자음의 위치가 종성인 경우, 추출된 자음의 다음 자음을 추출하는 단계를 포함할 수 있다. 일실시예에 따른 음운구는 상술한 바와 같이 적어도 하나의 음절을 포함하며 음운 현상이 일어나는 단위에 해당할 수 있다. 다시 말해, 단계(940)에서 추출되는 종성 자음 및 초성 자음은 하나의 음운구 내에 존재하는 인접한 자음들에 해당할 수 있다. 일실시예에 따를 때, 음운구에 포함된 자소들을 순차적으로 읽는 것은 음운구에 포함된 자소들을 왼쪽에서 오른쪽 방향으로 읽는 것에 해당할 수 있다.In step 940 according to an embodiment, while sequentially reading the grapheme included in the phonological phrase, extracting whether the corresponding grapheme is a consonant and the position of the corresponding consonant, and when the extracted position of the extracted consonant is the final consonant, the extracted It may include extracting the next consonant of the consonant. A phonological phrase according to an embodiment may include at least one syllable as described above and may correspond to a unit in which a phonological phenomenon occurs. In other words, the final consonant and the leading consonant extracted in step 940 may correspond to adjacent consonants existing in one phonological phrase. According to an embodiment, sequentially reading the grapheme included in the phonological phrase may correspond to reading the grapheme included in the phonemic phrase from left to right.

일실시예에 따른 종성 자음 및 초성 자음에 기초하여, 동시조음 규칙 및 순서를 결정하는 단계(950)는 종성 자음 및 초성 자음에 기초하여, 적어도 하나의 동시조음 규칙 및 적어도 하나의 동시조음 규칙의 순서를 결정하는 단계에 해당할 수 있다. 일실시예에 따를 때, 특정 자음에 대응하는 동시조음 규칙을 선택한다는 것은 동시조음 규칙에 포함된 적어도 하나의 규칙(들) 중 특정 자음에 대응하여 적어도 하나의 규칙(들)을 결정하고, 특정 자음에 대응하여 결정된 적어도 하나의 규칙(들)의 순서를 결정하는 것을 의미할 수 있다.According to an embodiment, the step of determining the coarticulation rule and order based on the final consonant and the leading consonant ( 950 ) may include at least one coarticulation rule and at least one coarticulation rule based on the final consonant and leading consonant. It may correspond to the step of determining the order. According to an embodiment, selecting the co-articulation rule corresponding to a specific consonant means determining at least one rule(s) in response to a specific consonant among at least one rule(s) included in the co-articulation rule, and It may mean determining the order of at least one rule(s) determined in response to a consonant.

일실시예에 따른 단계(950)는 초성 자음이 미리 정해진 자음에 해당하지 않는 경우, 상기 종성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택하는 단계 및 초성 자음이 미리 정해진 자음에 해당하는 경우, 초성 자음에 대응하는 적어도 하나의 동시조음 규칙을 선택하는 단계를 포함할 수 있다. 예를 들어, 미리 정해진 자음이 'ㅇ' 및 'ㅎ'인 경우, 하나의 음운구인 '닭장'에서는 종성 자음 'ㄺ'에 대응하여 동시조음 규칙이 선택될 수 있고, 하나의 음운구인 '닭이'에서는 종성 자음이 아닌 초성 자음 'ㅇ'에 대응하여 동시조음 규칙이 선택될 수 있다. In step 950 according to an embodiment, when the leading consonant does not correspond to a predetermined consonant, selecting at least one simultaneous articulation rule corresponding to the final consonant, and when the leading consonant corresponds to a predetermined consonant, The method may include selecting at least one simultaneous articulation rule corresponding to the leading consonant. For example, if the predetermined consonants are 'ㅇ' and 'ㅎ', a simultaneous articulation rule may be selected in response to the final consonant 'ㄺ' in one phoneme phrase 'chicken', and one phonemic phrase 'chicken' ', the simultaneous articulation rule may be selected in correspondence to the leading consonant 'ㅇ' rather than the final consonant.

일실시예에 따른 순서에 따라 동시조음 규칙을 종성 자음 및 초성 자음에 적용함으로써, 음소 시퀀스를 생성하는 단계(960)는 순서에 따라 적어도 하나의 동시조음 규칙을 종성 자음 및 초성 자음에 적용함으로써, 자소 시퀀스를 갱신하여 음소 시퀀스를 생성하는 단계에 해당할 수 있다. 일실시예에 따른 음소 시퀀스는 자소 시퀀스에 포함된 자소들을 음운구 단위로 동시조음 규칙을 적용하여 변경한 음소들을 나열한 시퀀스에 해당할 수 있다. 다시 말해, 음소 시퀀스는 자소들을 실제 발음에 따른 발음 부호인 음소들을 나열한 것으로, 상술한 음소 정보에 대응될 수 있다. 예를 들어, 스크립트에 '국물'이 포함된 경우, 이에 기초하여 생성된 자소 시퀀스는 'ㄱ', 'ㅜ', 'ㄱ', 'ㅁ', 'ㅜ', 'ㄹ'에 해당하며, '국물'이 하나의 음운구로 판단되어 동시조음 규칙을 적용함으로써 생성된 음소 시퀀스는 'ㄱ', 'ㅜ', 'ㅇ', 'ㅁ', 'ㅜ', 'ㄹ'에 해당할 수 있다.According to an embodiment, the step of generating a phoneme sequence 960 by applying the coarticulation rule to the final consonant and the leading consonant according to the order is by applying at least one coarticulation rule to the final consonant and the leading consonant according to the order, It may correspond to the step of generating a phoneme sequence by updating the phoneme sequence. A phoneme sequence according to an embodiment may correspond to a sequence in which phonemes included in the grapheme sequence are changed by applying a simultaneous articulation rule in units of phoneme phrases. In other words, the phoneme sequence is a list of phonemes that are phoneme codes according to actual pronunciation of the phonemes, and may correspond to the above-described phoneme information. For example, if the script includes 'soup', the grapheme sequence generated based on it corresponds to 'a', 'TT', 'a', 'ㅁ', 'TT', and 'ㄹ'. The phoneme sequence generated by applying the simultaneous articulation rule by determining that 'soup' is a single phoneme phrase may correspond to 'ㄱ', 'TT', 'ㅇ', 'ㅁ', 'TT', and 'ㄹ'.

일실시예에 따른 타이밍 정보 및 음소 시퀀스에 기초하여 아바타의 입 모양 및 혀 모양을 변형함으로써, 스피치 애니메이션을 생성하는 단계(970)는 단계(960)에서 생성된 음소 시퀀스에 포함된 음소들 각각을 대응하는 입 모양 및 혀 모양에 맵핑하는 단계 및 아바타의 입 모양 및 아바타의 혀 모양을 타이밍 정보에 따라 각 음소에 맵핑된 입 모양 및 혀 모양으로 변형함으로써, 스피치 애니메이션을 생성하는 단계를 포함할 수 있다. 일실시예에 따른 아바타는 입의 형상 및 혀의 형상을 포함하고 있는 객체에 해당할 수 있다.By modifying the shape of the mouth and the tongue of the avatar based on the timing information and the phoneme sequence according to an embodiment, generating a speech animation 970 includes each of the phonemes included in the phoneme sequence generated in step 960 . mapping to corresponding mouth and tongue shapes and transforming the avatar's mouth and avatar's tongue shapes into mouth and tongue shapes mapped to respective phonemes according to timing information, thereby generating a speech animation; there is. An avatar according to an embodiment may correspond to an object including a shape of a mouth and a shape of a tongue.

일실시예에 따른 타이밍 정보에 쉼 구간에 대응하는 타이밍 정보가 포함되어 있는 경우, 스피치 애니메이션을 생성하는 단계(970)는 쉼 구간을 대응하는 입 모양에 맵핑하는 단계 및 쉼 구간에 대응하는 타이밍 정보에 기초하여, 아바타의 입 모양을 상기 쉼 구간에 맵핑된 입 모양으로 변형함으로써, 스피치 애니메이션을 생성하는 단계를 포함할 수 있다. 즉, 일실시예에 따른 타이밍 정보에 발화가 잠시 멈춘 구간인 쉼 구간에 대응하는 타이밍 정보가 포함되어 있는 경우, 해당 타이밍에 대응하는 아바타의 입 모양을 쉼 구간에 맵핑된 입 모양으로 변형함으로써, 스피치 애니메이션이 생성될 수 있다.When the timing information according to an embodiment includes timing information corresponding to the pause section, generating the speech animation 970 includes mapping the pause section to a mouth shape corresponding to the pause section and timing information corresponding to the pause section. The method may include generating a speech animation by transforming the avatar's mouth shape into a mouth shape mapped to the rest period based on the . That is, when the timing information according to an embodiment includes timing information corresponding to a pause section, which is a section in which the speech is temporarily stopped, by transforming the mouth shape of the avatar corresponding to the timing into a mouth shape mapped to the pause section, A speech animation may be generated.

일실시예에 따른 단계(940) 내지 단계(970)는 상술한 스피치 애니메이션 생성 단계에 대응될 수 있다.Steps 940 to 970 according to an embodiment may correspond to the above-described speech animation generation step.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more of these, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

Claims

receiving a script and a voice signal corresponding to the script;
generating a grapheme sequence including at least one consonant and at least one vowel based on the script;
generating timing information including timings of each of the at least one consonant and the at least one vowel included in the grapheme sequence, based on the speech signal;
extracting a final consonant and a leading consonant adjacent to each other from the grapheme sequence;
determining at least one co-articulation rule and an order of the at least one co-articulation rule based on the final consonant and the leading consonant;
generating a phoneme sequence by updating the phoneme sequence by applying the at least one simultaneous articulation rule to the final consonant and the leading consonant according to the order; and
generating a speech animation by modifying a mouth shape of an avatar and a tongue shape of the avatar based on the timing information and the phoneme sequence;
containing,
How to create speech animation.

According to claim 1,
Determining the order of the at least one co-articulation rule and the at least one co-articulation rule comprises:
selecting at least one simultaneous articulation rule corresponding to the final consonant when the leading consonant does not correspond to a predetermined consonant; and
selecting at least one simultaneous articulation rule corresponding to the leading consonant when the leading consonant corresponds to the predetermined consonant;
containing,
How to create speech animation.

According to claim 1,
The step of extracting a final consonant and a leading consonant adjacent to each other from the grapheme sequence comprises:
extracting whether the corresponding grapheme is a consonant and a position of the corresponding consonant while sequentially reading the grapheme included in the corresponding phoneme phrase; and
When the position of the extracted consonant is the final consonant, extracting the next consonant of the extracted consonant
containing,
How to create speech animation.

According to claim 1,
The step of generating the grapheme sequence includes:
separating the script into at least one semantic unit based on a dictionary in which semantic units of a language corresponding to the script are listed;
Listing at least one grapheme included in the semantic unit in order for each of the at least one semantic unit
containing,
How to create speech animation.

According to claim 1,
The step of generating the speech animation is
mapping each of the phonemes included in the phoneme sequence to a corresponding mouth shape and a tongue shape;
generating a speech animation by transforming the mouth shape of the avatar and the tongue shape of the avatar into a mouth shape and a tongue shape mapped to each phoneme based on the timing information;
containing,
How to create speech animation.

According to claim 1,
The script includes a script including at least one syllable written in Korean,
The voice signal includes a voice signal uttering the script in Korean
How to create speech animation.

According to claim 1,
The step of generating the timing information
displaying a specific section as a pause section in the voice signal according to a predetermined criterion; and
generating timing information corresponding to the rest period
containing,
How to create speech animation.

8. The method of claim 7,
The step of generating the speech animation is
mapping the rest period to a corresponding mouth shape; and
generating a speech animation by transforming the mouth shape of the avatar into a mouth shape mapped to the rest section based on timing information corresponding to the pause section;
containing,
How to create speech animation.

8. The method of claim 7,
The specific section according to the predetermined criteria is
A period in which a signal less than or equal to the strength of a predetermined voice signal lasts for a predetermined time.
How to create speech animation.

According to claim 1,
The step of generating the grapheme sequence includes:
generating a grapheme sequence by mapping at least one grapheme included in the script to a predetermined code;
further comprising,
How to create speech animation.

According to claim 1,
The simultaneous articulation rule is
A rule for changing the final consonant and the leading consonant located next to the final consonant according to the pronunciation
How to create speech animation.

A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 11.

Receive a script and a voice signal corresponding to the script to generate timing information including timings of a grapheme sequence and at least one consonant and at least one vowel included in the grapheme sequence, and a final consonant adjacent to each other from the grapheme sequence and extracting a leading consonant to determine at least one co-articulation rule based on the final consonant and the leading consonant and an order of the at least one co-articulation rule, and set the at least one co-articulation rule according to the order to the final consonant. and generating a phoneme sequence by updating the phoneme sequence by applying to the leading consonant, and generating a speech animation by modifying a mouth shape of an avatar and a tongue shape of the avatar based on the timing information and the phoneme sequence. one processor and
A memory for storing the grapheme sequence, the phoneme sequence, the timing information, and mapping information of a tongue shape and a mouth shape corresponding to the at least one phoneme
containing
A device for generating speech animations.

14. The method of claim 13,
the processor
to determine the order of the at least one polyarticulation rule and the at least one polyarticulation rule;
When the leading consonant does not correspond to a predetermined consonant, at least one simultaneous articulation rule corresponding to the final consonant is selected, and when the leading consonant corresponds to the predetermined consonant, at least one corresponding to the leading consonant is selected. to choose the simultaneous articulation rule of
A device for generating speech animations.

14. The method of claim 13,
the processor
In order to extract a final consonant and a leading consonant adjacent to each other from the grapheme sequence,
While sequentially reading the graphemes included in the corresponding phoneme phrase, whether the corresponding grapheme is a consonant and the position of the corresponding consonant are extracted, and when the extracted position of the extracted consonant is the final consonant, the next consonant of the extracted consonant is extracted.
A device for generating speech animations.

14. The method of claim 13,
the processor
To generate the grapheme sequence,
Separating the script into at least one semantic unit based on a dictionary in which semantic units of a language corresponding to the script are listed, and listing at least one grapheme included in the semantic unit for each of the at least one semantic unit in order doing
A device for generating speech animations.

14. The method of claim 13,
the processor
To create the speech animation,
Each of the phonemes included in the phoneme sequence is mapped to a corresponding mouth shape and tongue shape, and the mouth shape of the avatar and the tongue shape of the avatar are mapped to each phoneme based on the timing information into a mouth shape and a tongue shape. By transforming, to create speech animation
A device for generating speech animations.

14. The method of claim 13,
The script includes a script including at least one syllable written in Korean,
The voice signal includes a voice signal uttering the script in Korean
A device for generating speech animations.

14. The method of claim 13,
the processor
To generate the timing information,
In the voice signal, a specific section is displayed as a pause section according to a predetermined criterion, and timing information corresponding to the pause section is generated,
To create the speech animation,
Creating a speech animation by mapping the pause section to a mouth shape corresponding to the pause section, and transforming the mouth shape of the avatar into a mouth shape mapped to the pause section based on timing information corresponding to the pause section
A device for generating speech animations.

14. The method of claim 13,
the processor
To generate the grapheme sequence,
generating a grapheme sequence by mapping at least one grapheme included in the script to a predetermined code;
A device for generating speech animations.