KR102287499B1

KR102287499B1 - Method and apparatus for synthesizing speech reflecting phonemic rhythm

Info

Publication number: KR102287499B1
Application number: KR1020200118088A
Authority: KR
Inventors: 김수화; 이동훈; 오승환; 트란딘썬
Original assignee: 주식회사 에이아이더뉴트리진
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2021-08-09

Abstract

Provided is a device for synthesizing a speech from a text. A speech synthesis device in accordance with one embodiment of the present invention comprises a preprocessor for separating an input text into a plurality of phonemes; a first neural network model for generating first phoneme length information from the plurality of phonemes; and a second neural network model for generating output audio corresponding to the input text from the plurality of phonemes. The second neural network model includes a positional encoding module for encoding position information of each of the phonemes with respect to a phoneme embedding vector representing each of the plurality of phonemes based on the first phoneme length information; and an encoder for generating a context vector from a phoneme embedding vector in which the position information is encoded.

Description

Speech synthesis method and apparatus reflecting phoneme unit prosody

본 발명은 텍스트로부터 음성을 합성하는 방법 및 장치에 관한 것이다. 보다 구체적으로, 본 발명은 음소문자인 한글로 표현되는 한국어 문장에서 음소 단위의 운율 및 음소 사이의 휴지 시간 등을 자연스럽게 반영하여 음성을 합성하는 장치, 그 장치에서 수행되는 방법, 및 음소 단위의 운율과 휴지 시간 등을 자연스럽게 반영한 음성 합성 기능이 구비된 신경망 기반의 음성 합성 모델을 구축하는 방법에 관한 것이다. The present invention relates to a method and apparatus for synthesizing speech from text. More specifically, the present invention relates to an apparatus for synthesizing speech by naturally reflecting the prosody of phoneme units and the pause time between phonemes in Korean sentences expressed in Hangul, which is a phoneme, a method performed in the apparatus, and a prosody of phoneme units It relates to a method of constructing a neural network-based speech synthesis model equipped with a speech synthesis function that naturally reflects and pause time.

음성 합성(speech synthesis) 기술은 입력된 텍스트로부터 사람이 발성하는 소리와 유사한 소리를 합성해내는 기술로서 흔히 TTS(Text-To-Speech) 기술로 지칭된다. 근래에, 스마트폰, 인공지능 스피커, 오디오 북, 차량용 내비게이션 등 개인 휴대용 장치의 개발과 보급이 활발하게 이루어짐으로써, 음성 출력을 위한 음성 합성 기술에 대한 요구가 급속도로 증가하고 있다. Speech synthesis technology is a technology for synthesizing a sound similar to a human voice from an input text, and is often referred to as a text-to-speech (TTS) technology. In recent years, as the development and dissemination of personal portable devices such as smart phones, artificial intelligence speakers, audio books, and in-vehicle navigation devices have been actively made, the demand for voice synthesis technology for voice output is rapidly increasing.

음성 합성 기술에 대한 요구가 증가함에 따라, 그 요구사항 또한 세분화되고 있다. 최근에는 단순히 주어진 텍스트에 대응되는 음성을 합성하는 것에 그치지 않고, 사람이 발음한 것과 매우 유사하여 이질감을 주지 않는 음성 합성 기술에 대한 요구가 증가하고 있다. 즉, 사람의 음성과 거의 구별할 수 없는 정도로 자연스러운 음성을 합성하는 높은 수준의 합성 기술이 요구되고 있다.As the demand for speech synthesis technology increases, the requirements are also being subdivided. Recently, there is an increasing demand for speech synthesis technology that does not simply synthesize speech corresponding to a given text, but is very similar to human pronunciation and does not create a sense of heterogeneity. That is, a high-level synthesis technique for synthesizing a natural voice that is almost indistinguishable from a human voice is required.

한편, 인공 신경망 기반의 머신 러닝 기술이 음성 합성 분야에도 활용되기 시작하면서, 음성 합성의 수준과 품질이 크게 향상되고 있다. 그런데 인공 신경망을 이용한 음성 합성에 관한 연구 개발은 주로 영어 및 중국어를 대상으로 집중적으로 이루어지고 있으며, 영어 또는 중국어와는 전혀 다른 고유의 특성을 가지는 한국어에 상기 연구의 결과가 그대로 적용되기 어렵다. On the other hand, as the artificial neural network-based machine learning technology is being used in the field of speech synthesis, the level and quality of speech synthesis are greatly improved. However, research and development on speech synthesis using artificial neural networks is mainly focused on English and Chinese, and it is difficult to apply the results of the study to Korean, which has unique characteristics completely different from English or Chinese.

한국어는 고유 문자인 한글로 표현되는데, 한글은 전세계의 다양한 문자 체계들 중에서 가장 독특하면서도 고도화된 문자 체계 중에 하나이다. 한글은 글자 하나 하나가 하나의 의미를 가지는 대신에 일정한 소리를 나타내는 표음문자이다. 표음문자는 각 글자가 하나의 음절에 대응되는 음절문자와, 각 글자가 하나의 자음이나 모음을 구현하는 음소문자로 구별되는데, 한글은 음소문자에 해당하며, 그 중에서도 음운자질을 조합해서 음소를 만들고, 음소를 조합해서 음절기호를 만드는 두 단계의 조합을 거치는 자질문자에 해당하는 독특한 특성을 가진다. Korean is expressed in Hangul, a unique character, and Hangeul is one of the most unique and advanced writing systems among various writing systems in the world. Hangeul is a phonetic alphabet that expresses a certain sound instead of having a single meaning for each letter. Phonetic characters are divided into syllabic characters in which each letter corresponds to one syllable, and phonemic characters in which each letter implements one consonant or vowel. It has a unique characteristic corresponding to a character character that undergoes a combination of two steps to create a syllable symbol by combining phonemes.

위와 같이 한글 문자 체계가 가지는 독특한 특성들로 인하여, 다른 언어들을 합성하기 위하여 고안된 인공 신경망 기반의 음성 합성 모델들을 한국어 데이터로 학습하여 사용할 경우, 일부 텍스트가 생략 또는 반복되거나, 어절이나 어구 사이의 휴지 시간이 실제 사람의 발화처럼 자연스럽지 못하는 등의 문제가 있다. 특히 이와 같은 문제는 음성 합성 모델에 입력으로 주어지는 텍스트의 길이가 길수록 더 빈번하게 발생되는 것으로 알려져 있다. 결국, 다른 언어들을 합성하기 위하여 고안된 인공 신경망 기반의 음성 합성 모델들로는 한국어를 자연스럽게 합성하는 데에 한계가 있다.Due to the unique characteristics of the Hangeul character system as described above, when using artificial neural network-based speech synthesis models designed to synthesize other languages by learning from Korean data, some texts are omitted or repeated, or there is a pause between words or phrases. There are problems such as time is not as natural as real human speech. In particular, it is known that such a problem occurs more frequently as the length of the text given as input to the speech synthesis model increases. After all, artificial neural network-based speech synthesis models designed to synthesize other languages have limitations in synthesizing Korean naturally.

따라서 한글로 표현되는 한국어 문장을 자연스러운 한국어 음성으로 합성하는 기술이 요구된다.Therefore, a technique for synthesizing Korean sentences expressed in Hangul into a natural Korean voice is required.

한국 등록특허공보 제10-1704926호 (2017.02.02. 등록)Korean Patent Publication No. 10-1704926 (registered on February 2, 2017)

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 기술적 과제는, 음소 문자로 표현되는 언어를 음성으로 자연스럽게 합성하는 장치 및 방법을 제공하는 것이다.A technical problem to be solved through some embodiments of the present invention is to provide an apparatus and method for naturally synthesizing a language expressed in phoneme characters into a voice.

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 다른 기술적 과제는, 주어진 텍스트에 대한 음성을 합성함에 있어서, 음소 단위의 운율을 자연스럽게 반영하여 음성을 합성하는 장치 및 방법을 제공하는 것이다.Another technical problem to be solved through some embodiments of the present invention is to provide an apparatus and method for synthesizing a voice by naturally reflecting the prosody of a phoneme unit when synthesizing a voice for a given text.

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 다른 기술적 과제는, 주어진 텍스트에 대한 음성을 합성함에 있어서, 음절, 어절 및 어구 등의 사이의 휴지 시간을 자연스럽게 반영하여 음성을 합성하는 장치 및 방법을 제공하는 것이다.Another technical problem to be solved through some embodiments of the present invention is to provide an apparatus and method for synthesizing speech by naturally reflecting the pause time between syllables, words, and phrases in synthesizing speech for a given text will do

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 음성 합성 방법은, 입력 텍스트를 제1 신경망 모델에 투입하여, 상기 입력 텍스트에 포함된 음소의 제1 음소 길이 정보를 획득하는 단계와, 상기 입력 텍스트를 제2 신경망 모델에 투입하여, 상기 제1 음소 길이 정보에 기초하여 상기 제2 신경망 모델로부터 상기 입력 텍스트에 대응되는 출력 오디오를 획득하는 단계를 포함할 수 있다. In order to solve the above technical problem, a speech synthesis method according to an embodiment of the present invention includes the steps of: inputting an input text into a first neural network model to obtain first phoneme length information of a phoneme included in the input text; , inputting the input text into a second neural network model, and obtaining output audio corresponding to the input text from the second neural network model based on the first phoneme length information.

일 실시예에서, 상기 제2 신경망 모델로부터 상기 입력 텍스트에 대응되는 출력 오디오를 획득하는 단계는, 상기 제1 신경망 모델로부터 획득된 상기 제1 음소 길이 정보에 기초하여, 상기 입력 텍스트에 포함된 상기 음소 및 상기 음소의 위치를 나타내는 음소 임베딩 벡터를 획득하는 단계를 포함할 수 있다. In an embodiment, the obtaining of the output audio corresponding to the input text from the second neural network model includes: based on the first phoneme length information obtained from the first neural network model, the input text included in the input text The method may include obtaining a phoneme and a phoneme embedding vector indicating a position of the phoneme.

일 실시예에서, 상기 제2 신경망 모델은 인코더 및 디코더를 포함하고, 상기 제2 신경망 모델로부터 상기 입력 텍스트에 대응되는 출력 오디오를 획득하는 단계는, 상기 음소 임베딩 벡터를 상기 인코더에 투입하는 단계와, 상기 인코더로부터 컨텍스트 벡터를 획득하는 단계와, 상기 인코더로부터 상기 음소의 제2 음소 길이 정보를 획득하는 단계와, 상기 컨텍스트 벡터 및 상기 제2 음소 길이 정보에 기초하여, 상기 디코더로부터 상기 출력 오디오를 획득하는 단계를 더 포함할 수 있다. In an embodiment, the second neural network model includes an encoder and a decoder, and obtaining output audio corresponding to the input text from the second neural network model includes: inputting the phoneme embedding vector to the encoder; , obtaining a context vector from the encoder, obtaining second phoneme length information of the phoneme from the encoder, and receiving the output audio from the decoder based on the context vector and the second phoneme length information It may further include the step of obtaining.

일 실시예에서, 상기 컨텍스트 벡터는 상기 음소에 관한 어텐션 정보를 나타내고, 상기 제2 음소 길이 정보는, 상기 음소에 대한 발성이 상기 출력 오디오에서 지속되어야 할 시간을 나타내는 정보일 수 있다. In an embodiment, the context vector may indicate attention information about the phoneme, and the second phoneme length information may be information indicating a time for which the utterance of the phoneme should be continued in the output audio.

일 실시예에서, 상기 컨텍스트 벡터 및 상기 제2 음소 길이 정보에 기초하여, 상기 디코더로부터 상기 출력 오디오를 획득하는 단계는, 상기 제2 음소 길이 정보에 기초하여, 상기 디코더에 투입되는 오디오 임베딩 벡터에 위치 정보를 인코딩하는 단계를 포함할 수 있다. In an embodiment, the obtaining of the output audio from the decoder based on the context vector and the second phoneme length information includes: based on the second phoneme length information, an audio embedding vector input to the decoder. It may include encoding the location information.

일 실시예에서, 상기 제1 신경망 모델은 인코더 및 디코더를 포함하고, 상기 입력 텍스트에 포함된 음소의 제1 음소 길이 정보를 획득하는 단계는, 상기 제1 신경망 모델의 상기 디코더에 의해 생성된 어텐션 행렬에 기초하여 상기 제1 음소 길이 정보를 계산하는 단계를 포함할 수 있다. In an embodiment, the first neural network model includes an encoder and a decoder, and the step of obtaining first phoneme length information of a phoneme included in the input text includes an attention generated by the decoder of the first neural network model. The method may include calculating the first phoneme length information based on a matrix.

일 실시예에서, 상기 제1 음소 길이 정보는 하기 수학식 1을 이용하여 계산되며,In one embodiment, the first phoneme length information is calculated using Equation 1 below,

[수학식 1][Equation 1]

상기 수학식 1에서 d_i는 i번째 음소의 길이, S는 상기 출력 오디오의 시퀀스의 길이, a_s,t는 어텐션 행렬을 나타내는 것일 수 있다.In Equation 1, d _i may represent the length of the i-th phoneme, S may represent the length of the output audio sequence, and a _s,t may represent an attention matrix.

상기 기술적 과제를 해결하기 위한, 본 발명의 다른 일 실시예에 따른 음소 처리 장치는, 입력 텍스트를 복수의 음소들로 분리하는 전처리부와, 신경망 모델 및 음소 길이 계산 모듈을 포함하는 음소 길이 추출부를 포함하되, 상기 신경망 모델은 상기 입력 텍스트로부터 분리된 상기 복수의 음소들을 입력 받는 인코더 및 상기 입력 텍스트에 대응되는 출력 오디오를 합성하는 디코더를 포함하고, 상기 음소 길이 계산 모듈은, 상기 디코더에 의해 생성된 어텐션 행렬에 기초하여 상기 입력 텍스트에 포함된 상기 복수의 음소들의 길이를 산출할 수 있다.In order to solve the above technical problem, a phoneme processing apparatus according to another embodiment of the present invention includes a preprocessor for separating an input text into a plurality of phonemes, and a phoneme length extraction unit including a neural network model and a phoneme length calculation module. The neural network model includes an encoder receiving the plurality of phonemes separated from the input text and a decoder synthesizing output audio corresponding to the input text, wherein the phoneme length calculation module is generated by the decoder Lengths of the plurality of phonemes included in the input text may be calculated based on the obtained attention matrix.

일 실시예에서, 상기 디코더는, 상기 복수의 음소들과 상기 출력 오디오의 프레임들 사이의 어텐션을 각각 나타내는 복수의 어텐션 행렬들을 생성하는 멀티 헤드 어텐션 레이어를 포함하고, 상기 음소 길이 계산 모듈은, 상기 복수의 어텐션 행렬들 중 적어도 일부를 이용하여 상기 복수의 음소들의 길이를 산출하는 것일 수 있다. In an embodiment, the decoder includes a multi-head attention layer generating a plurality of attention matrices each representing an attention between the plurality of phonemes and frames of the output audio, wherein the phoneme length calculation module comprises: Lengths of the plurality of phonemes may be calculated using at least some of the plurality of attention matrices.

일 실시예에서, 상기 음소 길이 계산 모듈은, 상기 복수의 어텐션 행렬들 중, 하기 수학식 2를 이용하여 계산되는 어텐션 포커스 비율이 가장 높은 어텐션 행렬을 이용하여, 상기 복수의 음소들의 길이를 산출하되, In an embodiment, the phoneme length calculation module calculates the lengths of the plurality of phonemes by using an attention matrix having the highest attention focus ratio calculated using Equation 2 below among the plurality of attention matrices. ,

[수학식 2][Equation 2]

상기 수학식 2에서, F는 어텐션 포커스 비율, S는 상기 출력 오디오의 시퀀스의 길이, T는 상기 입력 텍스트에 포함된 복수의 음소들의 시퀀스의 길이, a_s,t는 어텐션 행렬을 나타낼 수 있다.In Equation 2, F may represent an attention focus ratio, S may represent a length of the sequence of the output audio, T may represent a length of a sequence of a plurality of phonemes included in the input text, and a _s,t may represent an attention matrix.

일 실시예에서, 상기 신경망 모델은 트랜스포머 구조를 가지며, 학습용 텍스트 및 상기 학습용 텍스트에 대응되는 정답 음성 오디오를 이용하여 학습된 것일 수 있다.In an embodiment, the neural network model may have a transformer structure, and may be learned using a training text and an answer voice audio corresponding to the training text.

일 실시예에서, 상기 신경망 모델은 학습용 텍스트 및 상기 학습용 텍스트에 대응되는 정답 음성 오디오를 이용하여 학습된 것일 수 있다.In an embodiment, the neural network model may be learned using a training text and an answer voice audio corresponding to the training text.

상기 기술적 과제를 해결하기 위한, 본 발명의 또 다른 일 실시예에 따른 음성 합성 장치는, 입력 텍스트를 복수의 음소들로 분리하는 전처리부와, 상기 입력 텍스트를 신경망 기반의 음성 합성 모델에 입력하여, 상기 입력 텍스트에 대응되는 출력 오디오를 합성하는 음성 합성부를 포함하되, 상기 음성 합성 모델은, 제1 음소 길이 정보에 기초하여, 상기 입력 텍스트에 포함된 음소 및 상기 음소의 위치를 나타내는 음소 임베딩 벡터를 생성하는 제1 임베딩 모듈과, 상기 음소 임베딩 벡터로부터 컨텍스트 벡터를 생성하는 인코더와, 상기 컨텍스트 벡터에 기초하여, 상기 출력 오디오를 생성하는 디코더를 포함할 수 있다.In order to solve the above technical problem, a speech synthesis apparatus according to another embodiment of the present invention includes a preprocessing unit for separating input text into a plurality of phonemes, and inputting the input text into a neural network-based speech synthesis model. , a speech synthesizing unit for synthesizing output audio corresponding to the input text, wherein the speech synthesis model includes a phoneme included in the input text and a phoneme embedding vector indicating a position of the phoneme based on first phoneme length information It may include a first embedding module for generating , an encoder for generating a context vector from the phoneme embedding vector, and a decoder for generating the output audio based on the context vector.

일 실시예에서, 상기 음성 합성 모델은, 상기 컨텍스트 벡터로부터 제2 음소 길이 정보를 생성하는 음소 길이 예측 모듈과, 상기 제2 음소 길이 정보에 기초하여, 상기 디코더에 투입되는 오디오 임베딩 벡터에 위치 정보를 인코딩하는 제2 임베딩 모듈을 더 포함할 수 있다. In an embodiment, the speech synthesis model includes a phoneme length prediction module that generates second phoneme length information from the context vector, and location information in an audio embedding vector input to the decoder based on the second phoneme length information. It may further include a second embedding module for encoding.

일 실시예에서, 상기 인코더는 인코더 블록을 포함하며, 상기 인코더 블록은, 상기 복수의 음소들의 음소 임베딩 벡터에 기초하여 어텐션 행렬을 계산하는 멀티 헤드 어텐션 레이어 및 상기 어텐션 행렬로부터 컨텍스트 벡터를 계산하는 피드 포워드 레이어를 포함할 수 있다. In an embodiment, the encoder includes an encoder block, wherein the encoder block includes a multi-head attention layer for calculating an attention matrix based on a phoneme embedding vector of the plurality of phonemes and a feed for calculating a context vector from the attention matrix It may include a forward layer.

일 실시예에서, 상기 피드 포워드 레이어는 컨볼루션 신경망을 포함할 수 있다. In an embodiment, the feed forward layer may include a convolutional neural network.

일 실시예에서, 상기 인코더는 복수의 인코더 블록들을 포함하며, 상기 복수의 인코더 블록들은 잔차 연결(residual connection)을 통해 연결될 수 있다.In an embodiment, the encoder includes a plurality of encoder blocks, and the plurality of encoder blocks may be connected through a residual connection.

일 실시예에서, 상기 디코더는 디코더 블록을 포함하며, 상기 디코더 블록은, 상기 출력 오디오를 구성하는 복수의 프레임들 사이의 셀프 어텐션 정보를 계산하는 제1 멀티 헤드 어텐션 레이어와, 상기 복수의 프레임들과 상기 복수의 음소들 사이의 어텐션 정보를 계산하는 제2 멀티 헤드 어텐션 레이어와, 피드 포워드 레이어를 포함할 수 있다. In an embodiment, the decoder includes a decoder block, the decoder block comprising: a first multi-head attention layer for calculating self-attention information between a plurality of frames constituting the output audio; and a second multi-head attention layer for calculating attention information between the plurality of phonemes, and a feed forward layer.

일 실시예에서, 상기 디코더는 복수의 디코더 블록들을 포함하며, 상기 복수의 디코더 블록들은 잔차 연결(residual connection)을 통해 연결될 수 있다.In an embodiment, the decoder includes a plurality of decoder blocks, and the plurality of decoder blocks may be connected through a residual connection.

일 실시예에서, 상기 신경망 모델은, 학습용 텍스트, 상기 학습용 텍스트에 포함된 복수의 음소들 각각의 길이, 및 상기 학습용 텍스트에 대응되는 정답 음성 오디오를 이용하여 학습된 것일 수 있다.In an embodiment, the neural network model may be learned using training text, lengths of each of a plurality of phonemes included in the training text, and correct answer voice audio corresponding to the training text.

상기 기술적 과제를 해결하기 위한, 본 발명의 또 다른 일 실시예에 따른 음성 합성 장치는, 입력 텍스트를 복수의 음소들로 분리하는 전처리부와, 상기 복수의 음소들로부터 제1 음소 길이 정보를 생성하는 제1 신경망 모델과, 상기 복수의 음소들로부터 상기 입력 텍스트에 대응되는 출력 오디오를 생성하는 제2 신경망 모델을 포함하되, 상기 제2 신경망 모델은, 상기 제1 음소 길이 정보에 기초하여, 상기 복수의 음소들 각각을 나타내는 음소 임베딩 벡터를 대상으로 상기 음소들 각각의 위치 정보를 인코딩하는 포지셔널 인코딩 모듈과, 및 상기 위치 정보가 인코딩된 음소 임베딩 벡터로부터 컨텍스트 벡터를 생성하는 인코더를 포함할 수 있다. In order to solve the above technical problem, a speech synthesis apparatus according to another embodiment of the present invention includes a preprocessor for separating input text into a plurality of phonemes, and generating first phoneme length information from the plurality of phonemes. and a second neural network model for generating output audio corresponding to the input text from the plurality of phonemes, wherein the second neural network model comprises: based on the first phoneme length information, A positional encoding module for encoding location information of each phoneme with respect to a phoneme embedding vector representing each of the plurality of phonemes, and an encoder for generating a context vector from the phoneme embedding vector in which the location information is encoded there is.

일 실시예에서, 상기 제2 신경망 모델은, 상기 컨텍스트 벡터로부터 제2 음소 길이 정보를 예측하는 음소 길이 예측 모듈과, 상기 제2 음소 길이 정보에 기초하여 위치 정보가 인코딩된 오디오 임베딩 벡터를 이용하여, 상기 출력 오디오를 생성하는 디코더를 더 포함할 수 있다.In an embodiment, the second neural network model uses a phoneme length prediction module for predicting second phoneme length information from the context vector, and an audio embedding vector in which location information is encoded based on the second phoneme length information. , a decoder for generating the output audio may be further included.

도 1은 본 발명의 일 실시예에 따른 음성 합성 장치의 입력 및 출력을 설명하기 위한 도면이다.
도 2는 도 1에 도시된 음성 합성 장치의 구성 및 동작을 설명하기 위한 블록도이다.
도 3은 본 발명의 다른 일 실시예에 따른 음소 처리 장치의 구성 및 동작을 설명하기 위한 블록도이다.
도 4는 본 발명의 다른 일 실시예에 따른 음성 합성 장치의 구성 및 동작을 설명하기 위한 블록도이다.
도 5는 한국어 텍스트를 음소로 분리하는 과정을 설명하기 위한 도면이다.
도 6은 도 2 및 도 3을 참조하여 설명한 음성 합성 장치의 음소 길이 추출부를 보다 자세히 설명하기 위한 블록도이다.
도 7은 도 2 및 도 4를 참조하여 설명한 음성 합성 장치의 음성 합성부를 보다 자세히 설명하기 위한 블록도이다.
도 8는 도 6 및 도 7에 도시된 인코더가 복수의 인코더 블록들로 구성될 수 있음을 설명하기 위한 도면이다.
도 9은 도 8에 도시된 인코더 블록의 구성 및 동작을 설명하기 위한 도면이다.
도 10은 도 6 및 도 7에 도시된 인코더가 구비할 수 있는 멀티 헤드 어텐션 레이어를 설명하기 위한 도면이다.
도 11는 도 6에 도시된 디코더의 구성 및 동작을 설명하기 위한 도면이다.
도 12는 도 11에 도시된 멀티 헤드 어텐션 레이어로부터 획득될 수 있는 예시적인 어텐션 행렬들의 가중치들을 2차원에 표시한 그래프이다.
도 13은 도 7에 도시된 디코더의 구성 및 동작을 설명하기 위한 도면이다.
도 14는 본 발명의 또 다른 일 실시예에 따라 음성을 합성하는 방법을 나타내는 예시적인 흐름도이다.
도 15는 도 14를 참조하여 설명한 음성 합성 방법의 단계들 중, 제2 신경망 모델로부터 출력 오디오를 획득하는 단계를 보다 구체적으로 설명하기 위한 도면이다.
도 16은 본 발명의 몇몇 실시예들에 따른 음성 합성 장치 또는 음소 처리 장치를 구현할 수 있는 예시적인 컴퓨팅 장치를 설명하기 위한 도면이다.1 is a diagram for explaining input and output of a speech synthesis apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram for explaining the configuration and operation of the speech synthesis apparatus shown in FIG. 1 .
3 is a block diagram for explaining the configuration and operation of a phoneme processing apparatus according to another embodiment of the present invention.
4 is a block diagram for explaining the configuration and operation of a speech synthesis apparatus according to another embodiment of the present invention.
5 is a diagram for explaining a process of separating Korean text into phonemes.
6 is a block diagram for explaining in more detail a phoneme length extractor of the speech synthesis apparatus described with reference to FIGS. 2 and 3 .
7 is a block diagram illustrating in more detail a voice synthesizer of the voice synthesizer described with reference to FIGS. 2 and 4 .
FIG. 8 is a diagram for explaining that the encoder shown in FIGS. 6 and 7 may be configured with a plurality of encoder blocks.
FIG. 9 is a diagram for explaining the configuration and operation of the encoder block shown in FIG. 8 .
FIG. 10 is a diagram for explaining a multi-head attention layer that the encoder shown in FIGS. 6 and 7 may have.
FIG. 11 is a diagram for explaining the configuration and operation of the decoder shown in FIG. 6 .
FIG. 12 is a graph in which weights of exemplary attention matrices obtainable from the multi-head attention layer shown in FIG. 11 are displayed in two dimensions.
FIG. 13 is a diagram for explaining the configuration and operation of the decoder shown in FIG. 7 .
14 is an exemplary flowchart illustrating a method of synthesizing speech according to another embodiment of the present invention.
FIG. 15 is a diagram for explaining in more detail a step of acquiring output audio from a second neural network model among steps of the speech synthesis method described with reference to FIG. 14 .
16 is a diagram for describing an exemplary computing device capable of implementing a speech synthesis apparatus or a phoneme processing apparatus according to some embodiments of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명의 기술적 사상은 이하의 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명의 기술적 사상을 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 기술적 사상은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the technical spirit of the present invention is not limited to the following embodiments, but may be implemented in various different forms, and only the following embodiments complete the technical spirit of the present invention, and in the technical field to which the present invention belongs It is provided to fully inform those of ordinary skill in the art of the scope of the present invention, and the technical spirit of the present invention is only defined by the scope of the claims.

각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used with the meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly defined in particular. The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. When a component is described as being “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to the other component, but another component is between each component. It should be understood that elements may be “connected,” “coupled,” or “connected.”

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.As used herein, “comprises” and/or “comprising” refers to the presence of one or more other components, steps, operations and/or elements mentioned. or addition is not excluded.

이하, 본 발명의 몇몇 실시예들에 대하여 첨부된 도면에 따라 상세하게 설명한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성 합성 장치(10)의 입력 및 출력을 설명하기 위한 도면이다.1 is a diagram for explaining input and output of a speech synthesis apparatus 10 according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 음성 합성 장치(10)는 텍스트를 입력 받아서 그에 대응되는 음성(3)을 합성하여 출력하는 컴퓨팅 장치이다. 상기 컴퓨팅 장치는 노트북, 데스크톱(desktop), 랩탑(laptop) 등이 될 수 있으나, 이에 국한되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다. 상기 컴퓨팅 장치의 일 예는 도 16을 더 참조하도록 한다.As shown in FIG. 1 , the speech synthesis device 10 is a computing device that receives text and synthesizes and outputs a corresponding speech 3 . The computing device may be a notebook, a desktop, a laptop, etc., but is not limited thereto and may include any type of device equipped with a computing function. An example of the computing device is further referred to in FIG. 16 .

도 1은 음성 합성 장치(10)가 단일 컴퓨팅 장치로 구현된 것을 예로써 도시하고 있으나, 음성 합성 장치(10)의 제1 기능은 제1 컴퓨팅 장치에서 구현되고, 제2 기능은 제2 컴퓨팅 장치에서 구현될 수도 있다.1 illustrates that the speech synthesis device 10 is implemented as a single computing device as an example, a first function of the speech synthesis device 10 is implemented in a first computing device, and a second function of the speech synthesis device 10 is implemented in a second computing device. may be implemented in

본 발명의 다양한 실시예들에 따르면, 음성 합성 장치(10)는 음성 합성 모델을 구축하고, 음성 합성 모델을 이용하여, 주어진 텍스트로부터 해당 텍스트에 대응되는 음성을 포함하는 오디오 시퀀스를 생성할 수 있다.According to various embodiments of the present disclosure, the speech synthesis apparatus 10 may build a speech synthesis model and generate an audio sequence including a speech corresponding to the text from a given text by using the speech synthesis model. .

도 2는 도 1에 도시된 음성 합성 장치의(10)를 나타내는 예시적인 블록도이다. 구체적으로, 도 2는 본 발명의 일 실시예에 따라, 후술할 음소 길이 추출부(23)와 음성 합성부(25)를 모두 구비하는 음성 합성 장치(10)를 나타내는 예시적인 블록도이다. FIG. 2 is an exemplary block diagram illustrating the speech synthesis apparatus 10 shown in FIG. 1 . Specifically, FIG. 2 is an exemplary block diagram illustrating a speech synthesis apparatus 10 including both a phoneme length extraction unit 23 and a speech synthesis unit 25, which will be described later, according to an embodiment of the present invention.

한편, 본 발명의 다른 일 실시예에서, 음소 길이 추출부(23)와 음성 합성부(25)는 서로 별개의 장치들로 구현될 수 있다. 도 3은 상기 음소 길이 추출부(23) 및 음성 합성부(25) 중 음소 길이 추출부(23)만을 구비하는 음소 처리 장치(11)를 나타내는 도면이며, 도 4는 상기 음소 길이 추출부(23) 및 음성 합성부(25) 중 음성 합성부(25)만을 구비하는 음성 합성 장치(12)를 나타내는 도면이다.Meanwhile, in another embodiment of the present invention, the phoneme length extraction unit 23 and the speech synthesis unit 25 may be implemented as separate devices. 3 is a diagram illustrating a phoneme processing device 11 including only a phoneme length extraction unit 23 among the phoneme length extraction unit 23 and the speech synthesis unit 25, and FIG. 4 is the phoneme length extraction unit 23 ) and a diagram showing a speech synthesis device 12 including only the speech synthesis unit 25 among the speech synthesis units 25 .

이하에서는 도 2를 참조하여 본 발명의 일 실시예에 따른 음성 합성 장치(10)를 설명하면서, 본 발명의 다른 일 실시예에 따라 음소 처리 장치(11) 및 음성 합성 장치(12)가 별개의 장치들로 구현된 경우에 대해서도 도 3 및 도 4를 참조하여 차이점 위주로 설명하기로 한다. Hereinafter, the speech synthesis apparatus 10 according to an embodiment of the present invention will be described with reference to FIG. 2 , and the phoneme processing apparatus 11 and the speech synthesis apparatus 12 according to another embodiment of the present invention are separate. A case in which the devices are implemented will be mainly described with reference to FIGS. 3 and 4 .

도 2에 도시된 바와 같이, 음성 합성 장치(10)는 전처리부(21), 음소 길이 추출부(23), 음성 합성부(25), 및 보코더(27)를 포함할 수 있다. 다만, 도 2에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 2에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다. 또한, 도 2에 도시된 음성 합성 장치(10)의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다. 이하, 각 구성요소에 대하여 설명한다.As shown in FIG. 2 , the speech synthesis apparatus 10 may include a preprocessor 21 , a phoneme length extraction unit 23 , a speech synthesis unit 25 , and a vocoder 27 . However, only the components related to the embodiment of the present invention are illustrated in FIG. 2 . Accordingly, those skilled in the art to which the present invention pertains can see that other general-purpose components other than the components shown in FIG. 2 may be further included. In addition, it should be noted that each of the components of the speech synthesis apparatus 10 shown in FIG. 2 represents functionally separated functional elements, and a plurality of components may be implemented in a form that is integrated with each other in an actual physical environment. . Hereinafter, each component will be described.

전처리부(21)는, 음성 합성 장치(10)에 입력된 텍스트를 오디오 포맷의 음성으로 변환하기에 앞서서, 상기 텍스트를 후술할 인공 신경망 기반의 모델들이 처리하기에 적합한 형태의 데이터로 전처리한다. The preprocessor 21 pre-processes the text input to the speech synthesis apparatus 10 into audio formatted speech before converting the text into data suitable for processing by artificial neural network-based models, which will be described later.

구체적으로 상기 전처리는, 입력된 텍스트를 문장, 어절, 단어, 문자 등의 단위로 구분하는 것, 숫자 및 특수 문자 등을 문자로 변환하는 것, 입력된 텍스트를 음소 단위로 분리하는 것 등을 포함할 수 있다. 구체적인 전처리의 방식은 실시예에 따라 다양한 방식들 중 하나로 선택될 수 있다. 전처리 과정의 일례는 도 5에 도시되어 있다.Specifically, the preprocessing includes dividing the input text into units such as sentences, words, words, and characters, converting numbers and special characters into characters, and separating the input text into phoneme units. can do. A specific pre-processing method may be selected from among various methods according to embodiments. An example of the pre-processing process is shown in FIG. 5 .

도 5는 입력된 텍스트("지금 영국에선 정체 모를 ...")가 음소 단위의 텍스트 시퀀스 ( "c0;ii;k0;xx;mf; ;yv;ng;k0;uu;kf;ee;s0;vv;nf; ;c0;vv;ng;ch;ee; ;mm;oo;rr;xx;ll; ..." )로 변환된 일례를 도시한다. 도 5에 도시되지는 않았지만, 입력된 텍스트에 숫자가 포함된 경우, 전처리부(21)는 텍스트에 포함된 모든 숫자들을 우선 문자로 변환(예컨대, "349명"을 "삼백사십구명"으로 변환)하여 문자만을 포함하는 텍스트 시퀀스를 생성한 후, 이를 다시 음소 단위의 텍스트 시퀀스로 변환할 수 있다. 전처리부(21)가, 입력된 텍스트를 음소 단위의 텍스트 시퀀스로 변환하는 과정은, 본 발명의 분야의 통상의 기술자들에게 알려진 다양한 방식이 이용될 수 있으며, 본 발명은 그 중 어느 하나로 한정되지 않는다.5 shows that the input text (“Now unknown in the UK…”) is a phoneme-based text sequence (“c0;ii;k0;xx;mf; ;yv;ng;k0;uu;kf;ee;s0”). ;vv;nf; ;c0;vv;ng;ch;ee; ;mm;oo;rr;xx;ll; ..." ) is shown. Although not shown in FIG. 5 , when the input text includes numbers, the preprocessor 21 first converts all numbers included in the text into characters (eg, “349 people” to “three hundred and forty-nine people”) ) to generate a text sequence including only characters, and then convert it back into a phoneme-unit text sequence. Various methods known to those skilled in the art may be used for the process of the preprocessing unit 21 converting the input text into a phoneme unit text sequence, and the present invention is not limited to any one of them. does not

다시 도 2를 참조하여 설명한다. It will be described again with reference to FIG. 2 .

음소 길이 추출부(23)는, 음소들로 분리된 텍스트를 입력 받아서 각각의 음소의 길이를 계산하여 추출한다. 각각의 음소의 길이란, 상기 입력된 텍스트를 음성으로 발화할 때 각각의 음소에 대한 발성이 지속되어야 할 시간을 나타내는 것일 수 있다. 다시 말해, 각각의 음소의 길이란, 입력 텍스트로부터 합성될 출력 오디오 음성에서, 각각의 음소들에 대한 발성이 지속되어야 하는 시간을 나타내는 것일 수 있다. 입력 텍스트 "지금"에 포함된 음소들의 길이는, 예컨대 {"ㅈ:23ms", "ㅣ:38ms", "ㄱ:53ms", "ㅡ:40ms", "ㅁ:65ms"}, 또는 {"c0:23ms", "ii:38ms", "k0:53ms", "xx:40ms", "mf:65ms"}등과 같이 각각의 음소마다 발성이 지속되어야 할 시간을 숫자로 표현한 정보일 수 있지만, 이에 한정된 것은 아니다. The phoneme length extraction unit 23 receives text separated into phonemes, calculates and extracts the length of each phoneme. The length of each phoneme may indicate a time for which the speech of each phoneme should be continued when the input text is uttered by voice. In other words, the length of each phoneme may indicate a time for which the vocalization of each phoneme should be continued in the output audio voice to be synthesized from the input text. The length of the phonemes included in the input text "now" is, for example, {"c:23ms", "ㅣ:38ms", "a:53ms", "ㅡ:40ms", "ㅁ:65ms"}, or {"c0 :23ms", "ii:38ms", "k0:53ms", "xx:40ms", "mf:65ms"}, etc., may be information expressing the time to be uttered for each phoneme as a number, but this It is not limited.

후술하겠지만, 음소 길이 추출부(23)는, 인공 신경망 기반으로 구현된 제1 음성 합성 모델의 적어도 일 부분을 포함할 수 있다. 음소 길이 추출부(23)는, 사전에 학습된 제1 음성 합성 모델로부터 획득된 어텐션 행렬들 중 일부를 이용하여, 입력된 텍스트에 포함된 각각의 음소의 길이에 관한 정보를 생성하여 제공할 수 있다. 다만 상기 제1 음성 합성 모델은 음성 합성 장치(10)로부터 출력되는 출력 음성 오디오(3)를 생성하기 위해 사용되는 구성은 아니며, 이에 관해서는 후술하기로 한다.As will be described later, the phoneme length extractor 23 may include at least a portion of the first speech synthesis model implemented based on an artificial neural network. The phoneme length extractor 23 may generate and provide information about the length of each phoneme included in the input text by using some of the attention matrices obtained from the pre-trained first speech synthesis model. there is. However, the first speech synthesis model is not used to generate the output speech audio 3 output from the speech synthesis apparatus 10 , which will be described later.

도 2에 도시된 일 실시예에서, 음소 길이 추출부(23)는 입력된 텍스트에 포함된 음소 길이 정보를 후술할 음성 합성부(25)에 실시간으로 제공할 수 있다. In the embodiment shown in FIG. 2 , the phoneme length extractor 23 may provide phoneme length information included in the input text to the voice synthesizer 25 to be described later in real time.

한편 도 3에 도시된 다른 일 실시예에서, 음소 길이 추출부(23)는, 학습 대상 텍스트(1)에 포함된 음소들에 대해 미리 추출한 음소 길이 정보를 데이터 구조화하여, 음소 길이 정보 저장 장치(5)에 저장할 수 있다.Meanwhile, in another embodiment shown in FIG. 3 , the phoneme length extraction unit 23 structs the phoneme length information extracted in advance for the phonemes included in the learning target text 1 into data, and stores the phoneme length information ( 5) can be saved.

음성 합성부(25)는, 음소들로 분리된 텍스트를 입력 받아서, 상기 텍스트에 대응되는 출력 오디오 음성을 나타내는 데이터를 생성한다. 음성 합성부(25)는, 음소 길이 추출부(23)가 생성한 음소 길이 정보에 적어도 부분적으로 기초하여, 상기 출력 오디오 음성에서 각각의 음소가 발성되는 길이를 결정할 수 있다.The speech synthesis unit 25 receives text separated into phonemes and generates data representing output audio speech corresponding to the text. The speech synthesizer 25 may determine the length at which each phoneme is uttered in the output audio speech based at least in part on the phoneme length information generated by the phoneme length extractor 23 .

음성 합성부(25)는, 인공 신경망 기반의 제2 음성 합성 모델을 포함할 수 있다. 음성 합성부(25)는, 사전에 학습된 인공 신경망 기반의 제2 음성 합성 모델을 이용하여, 입력된 텍스트에 대응되는 출력 오디오 음성을 나타내는 데이터를 생성하는 것일 수 있다. 상기 음성 합성부(25)에 구비된 제2 음성 합성 모델은, 음소 길이 추출부(23)와 관련하여 앞서 설명한 제1 음성 합성 모델과는 구분되는 것임에 유의한다. 음성 합성 장치(10)로부터 출력되는 출력 음성 오디오(3)는 제1 음성 합성 모델이 아닌 제2 음성 합성 모델을 이용하여 생성된다. The speech synthesis unit 25 may include a second speech synthesis model based on an artificial neural network. The speech synthesis unit 25 may generate data representing an output audio speech corresponding to an input text by using a second speech synthesis model based on an artificial neural network learned in advance. Note that the second speech synthesis model provided in the speech synthesis unit 25 is distinct from the first speech synthesis model described above with respect to the phoneme length extractor 23 . The output speech audio 3 output from the speech synthesis apparatus 10 is generated using the second speech synthesis model instead of the first speech synthesis model.

도 2에 도시된 일 실시예에서, 음성 합성부(25)는, 입력된 텍스트에 대한 음성 합성 과정이 수행될 때마다, 상기 입력 텍스트에 포함된 음소들의 길이에 관한 정보를 상기 음소 길이 추출부(23)로부터 실시간으로 제공받아서, 입력 텍스트와 더불어 제2 음성 합성 모델에 투입되는 입력 값으로 이용하도록 구성될 수 있다. In the embodiment shown in FIG. 2 , the speech synthesis unit 25 extracts information about the lengths of phonemes included in the input text whenever a speech synthesis process is performed on the input text to the phoneme length extraction unit. (23) may be provided in real time, and may be configured to be used as an input value input to the second speech synthesis model together with the input text.

한편 도 4에 도시된 다른 일 실시예에서, 음성 합성부(25)의 제2 음성 합성 모델은, 학습 대상 텍스트에 포함된 음소들을 대상으로 사전에 생성해 둔 음소 길이 정보(5)를 이용하여, 미리 학습된 것일 수 있다. 다시 말해, 특정 텍스트에 대한 음성 합성 과정이 수행되기 이전에, 음성 합성부(25)의 제2 음성 합성 모델은 다양한 음소들을 각각 적절한 길이로 발성하도록 미리 학습된 것일 수 있다. 이 경우, 특정 텍스트에 대한 음성 합성 과정이 개시되면, 상기 합성 대상 텍스트만이 제2 음성 합성 모델의 입력 값으로서 투입될 수 있다.Meanwhile, in another embodiment shown in FIG. 4 , the second speech synthesis model of the speech synthesis unit 25 uses phoneme length information 5 generated in advance for phonemes included in the learning target text. , may have been previously learned. In other words, before the speech synthesis process for a specific text is performed, the second speech synthesis model of the speech synthesis unit 25 may be pre-trained to utter various phonemes at appropriate lengths, respectively. In this case, when a speech synthesis process for a specific text is started, only the synthesis target text may be input as an input value of the second speech synthesis model.

음성 합성부(25)는, 스펙트로그램 또는 멜-스케일의 스펙트로그램(이하, "멜스펙트로그램") 형태의 데이터를 생성하여 출력할 수 있다. 다만, 음성 합성부(25)가 생성하여 출력하는 데이터의 형식이 스펙트로그램 또는 멜스펙트로그램으로 한정되는 것은 아니며, 음성 합성부(25)는 사운드 장치에 의해 재생 가능한 오디오 포맷 데이터로 변환될 수 있는 다른 포맷의 데이터를 생성하도록 구현될 수도 있다.The voice synthesizer 25 may generate and output data in the form of a spectrogram or a mel-scale spectrogram (hereinafter, “mel spectrogram”). However, the format of data generated and output by the voice synthesizing unit 25 is not limited to spectrogram or mel spectrogram, and the voice synthesizing unit 25 may be converted into audio format data that can be reproduced by a sound device. It may be implemented to generate data in other formats.

음성 합성부(25)가 생성한 데이터(예컨대 멜스펙트로그램 형식의 데이터)는 후술할 보코더부(27)에 제공될 수 있다.Data (eg, data in a melspectrogram format) generated by the voice synthesizer 25 may be provided to the vocoder unit 27 to be described later.

보코더부(27)는 음성 합성부(25)가 생성한 멜스펙트로그램 등의 형식의 데이터를 디지털 오디오 포맷의 데이터로 변환하여 출력한다. 몇몇 실시예에서, 상기 멜스펙트로그램 데이터(41)는 예컨대 하나의 프레임 당 사전 결정된 차원 수의 벡터로 표현된 데이터일 수 있다. 보코더부(27)는 멜스펙트로그램 데이터를 이용하여 오디오 또는 음성을 합성하는 구성일 수 있다. 상기 변환 기능을 수행할 수 있다면, 보코더부(27)는 어떠한 기술이나 방식으로 구현되더라도 무방하다. 가령, 음성 보코더부(27)는 당해 기술 분야에서 널리 알려진 보코더 모듈을 이용하여 구현될 수 있다. 예를 들어, 보코더부(27)는 그리핀-림 알고리즘(Griffin-Lim algorithm)을 이용한 보코더, WaveRNN, WaveNet, WaveGlow, MelGan 등 신경망 기반의 보코더 등일 수 있다. The vocoder unit 27 converts data in a format such as a melspectrogram generated by the speech synthesis unit 25 into digital audio format data and outputs the converted data. In some embodiments, the melspectrogram data 41 may be data expressed as a vector having a predetermined number of dimensions per frame, for example. The vocoder unit 27 may be configured to synthesize audio or voice using melspectrogram data. As long as the conversion function can be performed, the vocoder unit 27 may be implemented by any technology or method. For example, the voice vocoder unit 27 may be implemented using a vocoder module widely known in the art. For example, the vocoder unit 27 may be a vocoder using a Griffin-Lim algorithm, a vocoder based on a neural network such as WaveRNN, WaveNet, WaveGlow, MelGan, or the like.

보코더부(27)에 의해 출력된 오디오 데이터(3)는, 입력 텍스트(1)를 사람이 발성한 것과 유사하게 합성한 음성을 담은 디지털 오디오이다. 상기 오디오 데이터(3)는 사운드 장치에 의해 재생될 수 있다.The audio data 3 output by the vocoder unit 27 is digital audio containing a synthesized voice of the input text 1 similar to that uttered by a human. The audio data 3 may be reproduced by a sound device.

지금까지 도 2 내지 도 5를 참조하여, 본 발명의 몇몇 실시예에 따른 음성 합성 장치에 대하여 설명하였다. 이하에서는 도 6 내지 도 13을 참조하여, 음성 합성 장치를 구성하는 음소 길이 추출부(23) 및 음성 합성부(25)의 세부 구성에 대하여 설명한다.So far, speech synthesis apparatuses according to some embodiments of the present invention have been described with reference to FIGS. 2 to 5 . Hereinafter, detailed configurations of the phoneme length extraction unit 23 and the speech synthesis unit 25 constituting the speech synthesis apparatus will be described with reference to FIGS. 6 to 13 .

먼저 도 6은 도 2 및 도 3을 참조하여 설명된 음소 길이 추출부(23)에 의해 이용되는 제1 음성 합성 모델의 전체 구조를 설명하기 위한 블록도이다.First, FIG. 6 is a block diagram for explaining the overall structure of the first speech synthesis model used by the phoneme length extractor 23 described with reference to FIGS. 2 and 3 .

도 6에 도시된 제1 음성 합성 모델은, 주어진 텍스트로부터 음성을 합성하는 모델이며, 학습용 텍스트 및 상기 학습용 텍스트에 대응되는 실제 음성으로 구성된 학습용 데이터 세트를 이용하여, 교사 강요(teacher forcing) 방식으로 사전에 학습된 것일 수 있다. 즉, 도 6에 도시된 인코더-디코더 구조의 제1 음성 합성 모델의 가중치들은, 주어진 텍스트에 대응되는 자연스러운 음성을 합성하도록 사전에 조정된 것일 수 있다. The first speech synthesis model shown in FIG. 6 is a model for synthesizing speech from a given text, using a training data set consisting of a training text and an actual voice corresponding to the training text, using a teacher forcing method. It may have been learned beforehand. That is, the weights of the first speech synthesis model of the encoder-decoder structure shown in FIG. 6 may be pre-adjusted to synthesize a natural speech corresponding to a given text.

본 발명의 실시예들에서, 음소 길이 추출부(23)는, 도 6에 도시된 인코더-디코더 구조의 제1 음성 합성 모델(61, 63, 71, 73)의 디코더(73)에서 생성된 어텐션 행렬을 기초로 음소 길이를 계산하는 음소 길이 계산 모듈(65)을 포함할 수 있다. 구체적으로 음소 길이 추출부(23)는, 자연스러운 음성을 합성하도록 학습된 제1 음성 합성 모델로부터, 음소와 멜스펙트로그램 사이의 어텐션 관계를 나타내는 어텐션 행렬을 획득하고, 이를 음소 길이 계산에 이용할 수 있다.In the embodiments of the present invention, the phoneme length extraction unit 23 is the attention generated by the decoder 73 of the first speech synthesis model 61, 63, 71, 73 of the encoder-decoder structure shown in FIG. and a phoneme length calculation module 65 for calculating phoneme lengths based on the matrix. Specifically, the phoneme length extractor 23 obtains an attention matrix representing an attention relationship between a phoneme and a melspectrogram from the first speech synthesis model trained to synthesize a natural voice, and uses it to calculate the phoneme length. .

도 6을 참조하면, 제1 음성 합성 모델은 음소 임베딩 모듈(61), 인코더(63), 오디오 임베딩 모듈(71), 및 디코더(73)를 포함할 수 있다. Referring to FIG. 6 , the first speech synthesis model may include a phoneme embedding module 61 , an encoder 63 , an audio embedding module 71 , and a decoder 73 .

상기 제1 음성 합성 모델은 트랜스포머(transformer) 모델의 인공 신경망으로 구현된 것일 수 있다. The first speech synthesis model may be implemented as an artificial neural network of a transformer model.

트랜스포머 모델은, 장단기 메모리(LSTM: Long Short-Term Memory) 등 통상 시계열적 특성을 가지는 시퀀스를 처리하기 위해서 널리 사용되던 순환 신경망(RNN: Recurrent Neural Network)을 사용하지 않고, 선형 신경망들로 구성된 멀티 어텐션 레이어들을 가지는 복수의 인코더 및 복수의 디코더를 통해 시계열적 데이터 시퀀스를 처리하는 모델이다. 다만 본 발명의 제1 음성 합성 모델이 반드시 트랜스포머 모델로 구현될 것으로 한정되는 것은 아니며, 본 기술 분야의 통상의 기술자들에게 알려진 다양한 신경망 모델을 이용하여 제1 음성 합성 모델이 구현될 수 있다. 이하의 설명에서는 제1 음성 합성 모델의 인공 신경망이 트랜스포머 모델로 구현되는 것을 가정하여 설명을 이어간다.The transformer model does not use a Recurrent Neural Network (RNN), which has been widely used to process sequences with time-series characteristics, such as Long Short-Term Memory (LSTM), but a multi-layered neural network composed of linear neural networks. It is a model that processes a time-series data sequence through a plurality of encoders and a plurality of decoders having attention layers. However, the first speech synthesis model of the present invention is not limited to be implemented as a transformer model, and the first speech synthesis model may be implemented using various neural network models known to those skilled in the art. In the following description, it is assumed that the artificial neural network of the first speech synthesis model is implemented as a transformer model.

음소 임베딩 모듈(61)은, 전처리부(21)가 입력 텍스트(예컨대 "모자")로부터 분리한 음소들(예컨대, "ㅁ", "ㅗ", "ㅈ", "ㅏ")을 입력 받아서, 각각의 음소를 가리키는 음소 임베딩 벡터(62)를 생성한다. 각각의 음소를 임베딩 벡터(62)로 변환하는 과정은 당해 기술 분야에서 알려진 기법에 의해 수행될 수 있다.The phoneme embedding module 61 receives the phonemes (eg, "ㅁ", "ㅗ", "j", "a") separated by the preprocessor 21 from the input text (eg, "hat"), A phoneme embedding vector 62 is generated pointing to each phoneme. The process of converting each phoneme into the embedding vector 62 may be performed by a technique known in the art.

한편, 상기 음소 임베딩 벡터(62)에는, 상기 임베딩 벡터(62)가 가리키는 음소를 식별하는 정보뿐만 아니라, 입력 시퀀스 내의 각각의 음소의 위치에 관한 정보가 포함될 수 있다. 텍스트 시퀀스를 순차적으로 입력 받아서 처리하는 RNN 모델과는 달리 트랜스포머 모델은 입력 시퀀스를 한번에 입력 받으므로, 입력 텍스트 내의 음소들의 위치 정보를 음소 임베딩 벡터(62)에 인코딩할 필요가 있으며, 이러한 과정은 통상 포지셔널 인코딩(positional encoding)이라고 지칭된다. 포지셔널 인코딩은, 예컨대 사인 함수와 코사인 함수를 이용하여, 각각의 음소 임베딩 벡터(62)에 위치 정보를 더해주는 과정인데, 당해 기술 분야의 종사자라면 자명하게 이해할 수 있을 것인 바, 포지셔널 인코딩에 관한 자세한 설명은 생략하도록 한다.Meanwhile, the phoneme embedding vector 62 may include information on the location of each phoneme in the input sequence as well as information for identifying the phoneme indicated by the embedding vector 62 . Unlike the RNN model that sequentially receives and processes a text sequence, the transformer model receives an input sequence at once, so it is necessary to encode location information of phonemes in the input text into the phoneme embedding vector 62, and this process is usually This is referred to as positional encoding. Positional encoding is a process of adding location information to each phoneme embedding vector 62 using, for example, a sine function and a cosine function. A detailed description thereof will be omitted.

인코더(63)는 입력된 텍스트를 구성하는 복수의 음소들을 가리키는 음소 임베딩 벡터(62)들로부터 셀프 어텐션 행렬을 생성하고, 셀프 어텐션 행렬을 피드 포워드하여 컨텍스트 벡터를 디코더(73)에 제공한다. 셀프 어텐션 행렬은 인코더에 입력된 음소 임베딩 벡터들이 나타내는 음소들 각각이 다른 음소과의 사이에 가지는 어텐션을 도출하기 위해 필요한 정보들을 포함하는 행렬일 수 있다. The encoder 63 generates a self-attention matrix from phoneme embedding vectors 62 indicating a plurality of phonemes constituting the input text, and feeds the self-attention matrix to provide a context vector to the decoder 73 . The self-attention matrix may be a matrix including information necessary for deriving an attention that each phoneme represented by phoneme embedding vectors input to the encoder has between other phonemes.

오디오 임베딩 모듈(71)은, 오디오를 구성하는 멜스펙트로그램 프레임들을 나타내는 오디오 임베딩 벡터(72)를 생성한다. 음소 임베딩 모듈(61)과 유사하게, 오디오 임베딩 모듈(71)은 출력 오디오 내에 각각의 멜스펙트로그램 프레임들의 위치 정보를 오디오 임베딩 벡터(72)에 포지셔널 인코딩한다. The audio embedding module 71 generates an audio embedding vector 72 representing Melspectrogram frames constituting the audio. Similar to the phoneme embedding module 61 , the audio embedding module 71 positionally encodes the positional information of each melspectrogram frame in the output audio into an audio embedding vector 72 .

디코더(73)는 먼저 문장의 시작을 가리키는 <sos> 토큰에 대한 임베딩 벡터를 입력 받아서 첫번째 멜스펙트로그램 프레임(78a)을 출력한다. 출력된 프레임(78a)은 다음 멜스펙트로그램 프레임을 생성하기 위한 입력 값(79a)으로서, 임베딩 모듈(71)에 의해 벡터로 변환된 후 디코더(73)에 투입된다. 입력 값(79a)을 기초로 디코더(73)에 의해 생성된 멜스펙트로그램 프레임(78b)은, 다시 다음 프레임을 생성하기 위한 입력 값(79b)으로서 이용된다.The decoder 73 first receives an embedding vector for a <sos> token indicating the start of a sentence and outputs a first melspectrogram frame 78a. The output frame 78a is input to the decoder 73 after being converted into a vector by the embedding module 71 as an input value 79a for generating the next melspectrogram frame. The melspectrogram frame 78b generated by the decoder 73 on the basis of the input value 79a is again used as the input value 79b for generating the next frame.

디코더(73)는 상술한 바와 같은 오디오 임베딩 벡터들(72)과, 인코더(63)로부터 제공받은 컨텍스트 벡터에 기초하여, 멜스펙트로그램 프레임들을 순차적으로 생성 및 출력하는 과정을 반복한다. The decoder 73 repeats the process of sequentially generating and outputting melspectrogram frames based on the audio embedding vectors 72 and the context vector provided from the encoder 63 as described above.

전술한 바와 같이, 도 6에 도시된 제1 음성 합성 모델은, 학습용 텍스트 및 상기 학습용 텍스트에 대응되는 실제 음성으로 구성된 학습용 데이터 세트를 이용하여 사전에 학습되며, 그 과정에서 제1 음성 합성 모델의 인코더 및 디코더의 가중치들은, 주어진 텍스트에 대응되는 자연스러운 음성을 합성하도록 조정될 수 있다.As described above, the first speech synthesis model shown in FIG. 6 is trained in advance using a training data set composed of training text and real speech corresponding to the training text, and in the process, the first speech synthesis model The weights of the encoder and decoder can be adjusted to synthesize a natural speech corresponding to a given text.

전술한 바와 같이 음소 길이 추출부(23)는, 제1 음성 합성 모델(61, 63, 71, 73)의 디코더(73)에서 생성된 어텐션 행렬을 기초로 음소 길이를 계산하는 음소 길이 계산 모듈(65)을 포함할 수 있다. As described above, the phoneme length extraction unit 23 includes a phoneme length calculation module ( 65) may be included.

디코더(73)는, 음소 임베딩 벡터들(62)과 오디오 임베딩 벡터들(72) 사이의 어텐션 관계를 나타내는 어텐션 행렬의 정보를 반영하여 멜스펙트로그램 프레임들(78a, 78b, 78c, 78d)을 순차적으로 생성하는 것 이외에, 상기 어텐션 행렬을 음소 길이 계산 모듈(65)에 제공한다. 전술한 바와 같이, 제1 음성 합성 모델의 인코더(63) 및 디코더(73)의 가중치들은 제1 음성 합성 모델로부터 출력되는 음성이 사람의 음성처럼 자연스러워지도록, 학습 과정에서 조정된 것이다.The decoder 73 sequentially converts the melspectrogram frames 78a, 78b, 78c, and 78d by reflecting information of an attention matrix indicating an attention relationship between the phoneme embedding vectors 62 and the audio embedding vectors 72 . In addition to generating, the attention matrix is provided to the phoneme length calculation module 65 . As described above, the weights of the encoder 63 and the decoder 73 of the first speech synthesis model are adjusted in the learning process so that the speech output from the first speech synthesis model becomes natural like a human speech.

음소 길이 계산 모듈(65)은, 디코더(73)로부터 제공된 상기 어텐션 행렬을 기초로 입력 텍스트에 포함된 음소들 각각의 길이를 계산한다. 음소 길이 계산 모듈(65)이 산출하는 음소들 각각의 길이는, 제1 음성 합성 모델에 의해 합성되어 출력되는 음성이 사람의 음성과 유사하게 들리도록 하는, 각 음소 별 발성 지속 시간일 수 있다. 음소 길이 계산 모듈(65)이 계산한 각각의 음소의 길이는 음소 길이 정보 저장 장치(5)에 저장되거나, 및/또는 음성 합성부(25)에 제공될 수 있다. 이하의 설명에서, 음소 길이 추출부의 음소 길이 계산 모듈(65)에 의해 제공된 상기 정보는 제1 음소 길이 정보로 지칭하기로 하며, 이는 후술할 음성 합성부(25)의 음소 길이 예측 모듈(84)이 제공하는 제2 음소 길이 정보와 구별됨에 유의한다.The phoneme length calculation module 65 calculates the length of each phoneme included in the input text based on the attention matrix provided from the decoder 73 . The length of each of the phonemes calculated by the phoneme length calculation module 65 may be a phonation duration for each phoneme so that a voice synthesized and output by the first voice synthesis model sounds similar to a human voice. The length of each phoneme calculated by the phoneme length calculation module 65 may be stored in the phoneme length information storage device 5 and/or may be provided to the speech synthesis unit 25 . In the following description, the information provided by the phoneme length calculation module 65 of the phoneme length extraction unit will be referred to as first phoneme length information, which will be described later by the phoneme length prediction module 84 of the speech synthesis unit 25 . Note that this is distinguished from the provided second phoneme length information.

디코더(73)로부터 제공된 어텐션 행렬을 기초로 입력 음소 길이 계산 모듈(65)이 텍스트에 포함된 음소의 길이를 계산하는 과정에 대해서는 도 11 및 도 12를 참조하여 후술하기로 한다.A process in which the input phoneme length calculation module 65 calculates the length of the phoneme included in the text based on the attention matrix provided from the decoder 73 will be described later with reference to FIGS. 11 and 12 .

지금까지 도 6을 참조하여, 음소 길이 추출부(23) 및 음소 길이 추출부(23)가 이용하는 제1 음성 합성 모델에 관하여 설명하였다. 음소 길이 추출부(23)는 인코더-디코더 구조의 제1 음성 합성 모델의 학습 결과로서 얻어지는, 음소와 오디오 사이의 어텐션 정보를 이용하여, 학습용 데이터 세트의 텍스트에 포함된 음소들의 길이(발성 지속 시간)에 관한 정보를 제공한다. 제1 음소 길이 정보는 음성 합성부(25)에 구비된 제2 음성 합성 모델에 의해, 합성될 출력 오디오 음성 내에서 음소들이 지속될 시간을 결정하는데 이용된다. 상기 제1 음성 합성 모델은 음소 길이 추출에 이용될 뿐, 음성 합성 장치(10)에서 출력 오디오를 생성하기 위해 이용되는 구성은 아니라는 점에 유의한다. So far, the phoneme length extraction unit 23 and the first speech synthesis model used by the phoneme length extraction unit 23 have been described with reference to FIG. 6 . The phoneme length extraction unit 23 uses the attention information between the phoneme and the audio, obtained as a result of learning the first speech synthesis model of the encoder-decoder structure, the length (voice duration time) of the phonemes included in the text of the training data set. ) to provide information about The first phoneme length information is used by the second speech synthesis model provided in the speech synthesis unit 25 to determine a duration for which phonemes in the output audio speech to be synthesized will last. Note that the first speech synthesis model is only used for phoneme length extraction and is not a configuration used to generate output audio in the speech synthesis apparatus 10 .

이하에서는 도 7을 참조하여, 음성 합성부(25)의 세부 구성에 대하여 설명한다. 전술한 바와 같이 음성 합성부(25)는, 사전에 학습된 제2 음성 합성 모델을 이용하여, 입력된 텍스트에 대응되는 출력 오디오 음성을 나타내는 데이터를 생성할 수 있다. Hereinafter, a detailed configuration of the speech synthesis unit 25 will be described with reference to FIG. 7 . As described above, the speech synthesis unit 25 may generate data representing the output audio speech corresponding to the input text by using the second speech synthesis model learned in advance.

제2 음성 합성 모델은, 음소 길이 추출부(23)와 관련하여 앞서 설명한 제1 음성 합성 모델과는 구분되는 것이다. 음성 합성 장치(10)로부터 출력되는 출력 음성 오디오(3)는 제1 음성 합성 모델이 아닌 제2 음성 합성 모델을 이용하여 생성된다. The second speech synthesis model is distinguished from the first speech synthesis model described above with respect to the phoneme length extractor 23 . The output speech audio 3 output from the speech synthesis apparatus 10 is generated using the second speech synthesis model instead of the first speech synthesis model.

제2 음성 합성 모델은, 주어진 텍스트로부터 음성을 합성하는 모델이며, 학습용 텍스트 및 상기 학습용 텍스트에 대응되는 실제 음성으로 구성된 학습용 데이터 세트를 이용하여 교사 강요(teacher forcing) 방식으로 사전에 학습된 것일 수 있다. 즉, 인코더-디코더 구조의 제2 음성 합성 모델의 가중치들은, 주어진 텍스트에 대응되는 자연스러운 음성을 합성하도록 사전에 조정된 것일 수 있다. The second speech synthesis model is a model for synthesizing speech from a given text, and may have been previously learned by a teacher forcing method using a training data set consisting of a training text and an actual voice corresponding to the training text. there is. That is, the weights of the second speech synthesis model of the encoder-decoder structure may be pre-adjusted to synthesize a natural speech corresponding to a given text.

도 4를 참조하여 앞서 설명한 바와 같이, 본 발명의 몇몇 실시예에서, 제2 음성 합성 모델은, 학습 대상 텍스트에 포함된 음소들을 대상으로 음소 길이 추출부(23)가 생성한 제1 음소 길이 정보(5)를 이용하여, 미리 학습된 것일 수 있다. 다시 말해, 특정 텍스트에 대한 음성 합성이 요청되기 이전에, 제2 음성 합성 모델은 다양한 음소들을 각각 적절한 길이로 발성하도록 상기 제1 음소 길이 정보에 기초하여 미리 학습된 것일 수 있다. As described above with reference to FIG. 4 , in some embodiments of the present invention, the second speech synthesis model includes first phoneme length information generated by the phoneme length extractor 23 with respect to phonemes included in the learning target text. By using (5), it may be learned in advance. In other words, before speech synthesis for a specific text is requested, the second speech synthesis model may be pre-trained based on the first phoneme length information so as to utter various phonemes with appropriate lengths, respectively.

다른 몇몇 실시예에서, 음성 합성부(25)는, 특정 텍스트에 대한 음성 합성이 요청될 때마다, 상기 입력된 특정 텍스트에 포함된 음소들의 길이에 관한 정보를 상기 음소 길이 추출부(23)로부터 제공받아서, 입력 텍스트와 더불어 제2 음성 합성 모델에 투입되는 입력 값으로 이용하도록 구성될 수 있다. In some other embodiments, the speech synthesis unit 25 receives information about the lengths of phonemes included in the input specific text from the phoneme length extraction unit 23 whenever a voice synthesis for a specific text is requested. It may be provided and configured to be used as an input value input to the second speech synthesis model together with the input text.

도 7을 참조하면, 제2 음성 합성 모델은, 음성으로 합성될 입력 텍스트 및 제1 음소 길이 정보(5)를 입력 받아서, 상기 입력 텍스트에 대응되는 출력 오디오 음성을 구성하는 멜스펙트로그램 프레임들(88a, 88b, 88c, 88d 등)을 출력할 수 있다.Referring to FIG. 7 , the second speech synthesis model receives input text to be synthesized into speech and first phoneme length information 5, and receives melspectrogram frames ( 88a, 88b, 88c, 88d, etc.) can be output.

제2 음성 합성 모델은 음소 임베딩 모듈(81), 인코더(83), 음소 길이 예측 모듈(84), 오디오 임베딩 모듈(85), 및 디코더(87)를 포함할 수 있다. The second speech synthesis model may include a phoneme embedding module 81 , an encoder 83 , a phoneme length prediction module 84 , an audio embedding module 85 , and a decoder 87 .

상기 제2 음성 합성 모델은 트랜스포머(transformer) 모델의 인공 신경망으로 구현된 것일 수 있으나, 이에 한정되는 것은 아니다. 이하의 설명에서는 제2 음성 합성 모델의 인공 신경망이 트랜스포머 모델로 구현되는 것을 가정하여 설명을 이어간다. The second speech synthesis model may be implemented as an artificial neural network of a transformer model, but is not limited thereto. In the following description, it is assumed that the artificial neural network of the second speech synthesis model is implemented as a transformer model.

먼저 음소 임베딩 모듈(81)은, 입력 텍스트로부터 분리한 음소들을 입력 받아서, 각각의 음소를 가리키는 음소 임베딩 벡터(82)를 생성한다. First, the phoneme embedding module 81 receives the phonemes separated from the input text, and generates a phoneme embedding vector 82 indicating each phoneme.

음소 임베딩 벡터(82)에는, 입력 시퀀스 내의 각각의 음소의 위치에 관한 정보가 추가된다. 다시 말해, 입력 텍스트 내의 음소들의 위치에 관한 정보가 음소 임베딩 벡터(82)에 포지셔널 인코딩(positional encoding)된다. To the phoneme embedding vector 82, information about the position of each phone in the input sequence is added. In other words, information about the positions of phonemes in the input text is positional encoded in the phoneme embedding vector 82 .

이때, 음소 길이 추출부(23)로부터 제공된 제1 음소 길이 정보(5)가 음소 임베딩 벡터(82)의 포지셔널 인코딩에 반영될 수 있다는 점에 주목한다. 다시 말해, 본 실시예에서 음성 합성부(25)의 제2 음성 합성 모델이 입력 텍스트를 임베딩할 때, 음소 길이 추출부(23)의 제1 음성 합성 모델의 인코더(63) 및 음소 길이 계산 모듈(65)에 의해 계산 및 제공된 제1 음소 길이 정보(5)가 반영될 수 있다.At this time, it is noted that the first phoneme length information 5 provided from the phoneme length extraction unit 23 may be reflected in the positional encoding of the phoneme embedding vector 82 . In other words, in the present embodiment, when the second speech synthesis model of the speech synthesis unit 25 embeds the input text, the encoder 63 of the first speech synthesis model of the phoneme length extraction unit 23 and the phoneme length calculation module The first phoneme length information (5) calculated and provided by (65) may be reflected.

인코더(83)는 입력된 텍스트를 구성하는 복수의 음소들을 가리키는 음소 임베딩 벡터(82)들로부터 어텐션 행렬을 생성하고, 어텐션 행렬을 피드 포워드하여 컨텍스트 벡터를 디코더(87)에 제공한다. 또한 인코더(83)는 상기 컨텍스트 벡터를 음소 길이 예측 모듈(84)에 제공한다.The encoder 83 generates an attention matrix from phoneme embedding vectors 82 indicating a plurality of phonemes constituting the input text, and feeds the attention matrix to the decoder 87 with a context vector. The encoder 83 also provides the context vector to the phoneme length prediction module 84 .

음소 길이 예측 모듈(84)은 인코더(83)로부터 제공받은 컨텍스트 벡터로부터, 제2 음소 길이 정보(7)를 생성한다. 상기 제2 음소 길이 정보는, 입력 텍스트에 포함된 각각의 음소들이, 디코더(87)로부터 출력되는 출력 오디오 음성(3)에서 발성되어야 할 지속 시간을 나타내는 정보이다. 상기 제2 음소 길이 정보(7)는, 후술할 오디오 임베딩 모듈(85)에 제공되어 오디오 임베딩 벡터(86)를 생성하는데 이용된다.The phoneme length prediction module 84 generates second phoneme length information 7 from the context vector provided from the encoder 83 . The second phoneme length information is information indicating a duration that each phoneme included in the input text should be uttered in the output audio voice 3 output from the decoder 87 . The second phoneme length information 7 is provided to an audio embedding module 85 to be described later and used to generate an audio embedding vector 86 .

오디오 임베딩 모듈(85)은, 출력 오디오 음성(3)을 구성하는 멜스펙트로그램 프레임들을 나타내는 오디오 임베딩 벡터(86)를 생성한다. 음소 임베딩 모듈(81)과 유사하게, 오디오 임베딩 모듈(85)은 출력 오디오 내에 각각의 멜스펙트로그램 프레임들의 위치 정보를 오디오 임베딩 벡터(86)에 포지셔널 인코딩한다. 이때, 오디오 임베딩 모듈(85)은 음소 길이 예측 모듈(84)이 제공한 제2 음소 길이 정보(7)를 상기 오디오 임베딩 벡터들(86)의 포지셔널 임베딩에 반영한다는 점에 주목한다. The audio embedding module 85 generates an audio embedding vector 86 representing the melspectrogram frames constituting the output audio speech 3 . Similar to the phoneme embedding module 81 , the audio embedding module 85 positionally encodes the positional information of each melspectrogram frame in the output audio into an audio embedding vector 86 . At this time, it is noted that the audio embedding module 85 reflects the second phoneme length information 7 provided by the phoneme length prediction module 84 to the positional embedding of the audio embedding vectors 86 .

디코더(87)는 <sos> 토큰에 대한 임베딩 벡터를 입력 받아서 첫번째 멜스펙트로그램 프레임(88a)을 출력하고. 출력된 프레임(88a)은 다음 멜스펙트로그램 프레임을 생성하기 위한 입력 값(89a)으로서, 임베딩 모듈(85)에 의해 벡터로 변환된 후 디코더(87)에 투입된다. 입력 값(89a)을 기초로 디코더(87)에 의해 생성된 멜스펙트로그램 프레임(88b)은, 다시 다음 프레임을 생성하기 위한 입력 값(89b)으로서 이용된다. The decoder 87 receives the embedding vector for the <sos> token and outputs the first melspectrogram frame 88a. The output frame 88a is input to the decoder 87 after being converted into a vector by the embedding module 85 as an input value 89a for generating the next melspectrogram frame. The melspectrogram frame 88b generated by the decoder 87 based on the input value 89a is again used as the input value 89b for generating the next frame.

디코더(87)는 상술한 바와 같은 오디오 임베딩 벡터들(86)과, 인코더(83)로부터 제공받은 컨텍스트 벡터에 기초하여, 멜스펙트로그램 프레임들을 순차적으로 생성 및 출력하는 과정을 반복한다. The decoder 87 repeats the process of sequentially generating and outputting melspectrogram frames based on the audio embedding vectors 86 and the context vector provided from the encoder 83 as described above.

입력 텍스트(1)에 포함된 모든 음소들이 상기 제2 음성 합성 모델의 인코더(83)에 투입되고, 디코더(87)에 의해 멜스펙트로그램 프레임들(88a, 88b, 88c, 88d, ...)의 생성이 완료되면, 생성된 멜스펙트로그램 프레임들이 하나의 멜스펙트로그램으로 이어 붙여져서, 비로서 보코더부(27)에 제공될 수 있다.All phonemes included in the input text 1 are input to the encoder 83 of the second speech synthesis model, and the melspectrogram frames 88a, 88b, 88c, 88d, ...) Upon completion of the generation of , the generated melspectrogram frames may be concatenated into one melspectrogram and provided to the vocoder unit 27 as a result.

지금까지 도 7을 참조하여, 음성 합성부(25)의 제2 음성 합성 모델에 관하여 설명하였다. 이하에서는 도 8 내지 도 13을 참조하여, 전술한 제1 음성 합성 모델 및 제2 음성 합성 모델의 인코더(63, 83) 및 디코더(73, 87)의 세부적인 구성 및 동작을 설명한다.So far, the second speech synthesis model of the speech synthesis unit 25 has been described with reference to FIG. 7 . Hereinafter, detailed configurations and operations of the encoders 63 and 83 and the decoders 73 and 87 of the above-described first and second speech synthesis models will be described with reference to FIGS. 8 to 13 .

도 8은, 도 6 및 도 7을 참조하여 설명한, 제1 음성 합성 모델의 인코더(63) 및 제2 음성 합성 모델의 인코더(83)가 복수의 인코더 블록들(93a 내지 93n)로 구성될 수 있음을 도시하는 도면이다. 8, the encoder 63 of the first speech synthesis model and the encoder 83 of the second speech synthesis model described with reference to FIGS. 6 and 7 may be configured with a plurality of encoder blocks 93a to 93n. It is a drawing showing that there is.

전술한 바와 같이, 제1 음성 합성 모델 및 제2 음성 합성 모델은 트랜스포머 모델로 구현될 수 있으며, 복수의 인코더들 및 복수의 디코더들을 구비하도록 구성될 수 있다. 도 8에 도시된 복수의 인코더 블록들(93a 내지 93n)은 각각 하나의 인코더에 대응될 수 있다. 하나의 인코더 블록의 출력 값은 다음 인코더 블록의 입력 값으로 제공되며, 인코더 블록들(93a 내지 93n) 사이는 잔차 연결(residual connection) 방식으로 연결되어, 여러 개의 인코더 블록들을 거치는 과정에서 야기될 수 있는 정보의 손실이 최소화될 수 있다.As described above, the first speech synthesis model and the second speech synthesis model may be implemented as a transformer model, and may be configured to include a plurality of encoders and a plurality of decoders. Each of the plurality of encoder blocks 93a to 93n illustrated in FIG. 8 may correspond to one encoder. The output value of one encoder block is provided as an input value of the next encoder block, and the encoder blocks 93a to 93n are connected in a residual connection method, which can be caused by going through several encoder blocks. Loss of information can be minimized.

도 9는 도 8에 도시된 인코더 블록들(93a 내지 93n)의 구성 및 동작을 설명하기 위한 도면이다.FIG. 9 is a diagram for explaining the configuration and operation of the encoder blocks 93a to 93n shown in FIG. 8 .

도 9를 참조하면, 각각의 인코더 블록(93)은 멀티 헤드 어텐션 레이어(101) 및 피드 포워드 레이어(103)를 포함할 수 있다. Referring to FIG. 9 , each encoder block 93 may include a multi-head attention layer 101 and a feed forward layer 103 .

멀티 헤드 어텐션 레이어(101)는, 예컨대 음소 임베딩 벡터(62, 82) 등과 같은 벡터를 입력 받아서, 멀티 헤드 어텐션 행렬(105)을 생성 및 제공한다. 멀티 헤드 어텐션 레이어(101)는, 인코더(63, 83)에서 계산되는 음소들 사이의 셀프 어텐션 정보를 생성함에 있어서, 어텐션의 계산을 복수의 어텐션 헤드들(예컨대 4개의 헤드들)에 분산하여 병렬 처리한다. 이로 인해, 둘 이상의 관점에서 어텐션이 이루어지게 되어 어텐션 정보의 품질과 정확도가 개선되며, 학습 속도 또한 향상될 수 있다. The multi-head attention layer 101 receives a vector such as, for example, phoneme embedding vectors 62 and 82 , and generates and provides a multi-head attention matrix 105 . The multi-head attention layer 101 distributes the calculation of attention to a plurality of attention heads (eg, four heads) in parallel when generating self-attention information between phonemes calculated by the encoders 63 and 83 . handle For this reason, attention is made from two or more viewpoints, so that the quality and accuracy of attention information can be improved, and a learning speed can also be improved.

도 10은 도 9에 도시된 멀티 헤드 어텐션 레이어(101)의 세부 구조를 예시적으로 도시한 도면이다. 멀티 헤드 어텐션 레이어(101)는, 어텐션 헤드마다 서로 다른 가중치 행렬을 이용하여 Scaled Dot-Product Attention 계산 방식에 따라 어텐션 헤드수 만큼의 어텐션 행렬을 계산한 후, 이를 서로 연결(concatenate)한다. 상기 연결된 행렬에, 다시 멀티 헤드 레이어(101)의 입력 값인 K, V, Q 중에 Q 값 벡터가 연결(concatenate)되고, 리니어 레이어를 거쳐서 멀티 헤드 어텐션 행렬이 획득될 수 있다. 이와 같이 멀티 헤드 어텐션 레이어의 입력 값(Q 값)을 출력 값에 다시 연결하는 잔차 연결(residual connection)을 구성함으로써, 멀티헤드 어텐션 내부에서의 정보 손실이 최소화될 수 있다.FIG. 10 is a diagram exemplarily illustrating a detailed structure of the multi-head attention layer 101 shown in FIG. 9 . The multi-head attention layer 101 calculates an attention matrix as many as the number of attention heads according to the Scaled Dot-Product Attention calculation method using different weight matrices for each attention head, and then concatenates them. In the concatenated matrix, a Q value vector among K, V, and Q input values of the multi-head layer 101 is concatenated again, and a multi-head attention matrix can be obtained through a linear layer. As described above, by configuring a residual connection that reconnects the input value (Q value) of the multi-head attention layer to the output value, information loss within the multi-head attention can be minimized.

다시 도 9를 참조하여 설명한다.It will be described again with reference to FIG. 9 .

멀티 헤드 어텐션 레이어(101)에서 생성된 멀티 헤드 어텐션 행렬(105)은, 피드 포워드 레이어(103)를 통해 컨텍스트 벡터(107)로 변환될 수 있다. 피드 포워드 레이어(103)는, 포지션 와이즈 피드 포워드(position-wise feed forward) 신경망 레이어로 이해될 수도 있다.The multi-head attention matrix 105 generated in the multi-head attention layer 101 may be converted into a context vector 107 through the feed forward layer 103 . The feed forward layer 103 may be understood as a position-wise feed forward neural network layer.

일 실시예에서, 피드 포워드 레이어(103)는 컨볼루션 신경망(Convolutional Neural Network) 모델로 구현될 수 있다. 피드 포워드 레이어(103)를 선형 신경망(Linear Neural Network)이 아닌 컨볼루션 신경망으로 구현함으로써, 음성 합성 모델의 학습 속도가 향상되고, 한국어 발음의 정확도가 개선되는 효과를 기대할 수 있게 된다. In an embodiment, the feed forward layer 103 may be implemented as a convolutional neural network model. By implementing the feed forward layer 103 as a convolutional neural network rather than a linear neural network, the learning speed of the speech synthesis model is improved and the accuracy of Korean pronunciation is improved.

도 8을 참조하여 설명한 인코더 블록들(93a 내지 93n) 사이의 연결 방식과 유사하게, 일 실시예에서, 멀티 헤드 어텐션 레이어(101)와 피드 포워드 레이어(103) 사이는 잔차 연결(residual connection) 방식으로 연결되어, 여러 개의 레이어들을 거치는 과정에서 야기될 수 있는 정보의 손실이 최소화될 수 있다.Similar to the connection method between the encoder blocks 93a to 93n described with reference to FIG. 8 , in an embodiment, a residual connection method is used between the multi-head attention layer 101 and the feed forward layer 103 . , so that loss of information that may be caused in the process of going through several layers can be minimized.

도 11은 도 6에 도시된 제1 음성 합성 모델의 디코더(73)의 구성 및 동작을 설명하기 위한 도면이다.FIG. 11 is a diagram for explaining the configuration and operation of the decoder 73 of the first speech synthesis model shown in FIG. 6 .

도 11을 참조하면, 디코더(73)는 마스크드 멀티 헤드 어텐션 레이어(111), 멀티 헤드 어텐션 레이어(113), 및 피드 포워드 레이어(115)를 포함할 수 있다. Referring to FIG. 11 , the decoder 73 may include a masked multi-head attention layer 111 , a multi-head attention layer 113 , and a feed forward layer 115 .

비록 도 11에는 디코더(73)가 하나의 블록의 레이어들(111, 113, 및 115)을 포함하는 것으로 도시되었지만, 인코더(63, 83)와 유사하게 디코더(73)도 상기 레이어들(111, 113, 및 115)을 각각 구비한 복수의 디코더 블록들을 포함할 수 있다는 점에 유의한다. Although the decoder 73 is illustrated as including the layers 111, 113, and 115 of one block in FIG. 11, similarly to the encoders 63 and 83, the decoder 73 also includes the layers 111, Note that it may include a plurality of decoder blocks each having 113 and 115 .

마스크드 멀티 헤드 어텐션 레이어(111)는, 예컨대 오디오 임베딩 벡터(72) 등과 같은 벡터를 입력 받아서, 예컨대 멜스펙트로그램 프레임들(오디오 프레임들) 사이의 셀프 어텐션을 계산하고, 멀티 헤드 어텐션 레이어(113)의 Q 값으로 전달한다.The masked multi-head attention layer 111 receives, for example, a vector such as the audio embedding vector 72, for example, calculates self-attention between melspectrogram frames (audio frames), and the multi-head attention layer 113 ) as the Q value.

멀티 헤드 어텐션 레이어(113)는, 인코더(63)로부터 제공받은 컨텍스트 벡터(107)를 K 및 V 값으로 입력 받고, 마스크드 멀티 헤드 어텐션 레이어(111)가 계산한 값을 Q 값으로 입력 받는다. 멀티 헤드 어텐션 레이어(113)는, 상기 값들을 이용하여, 인코더(63)에서 처리되는 음소들과 디코더(73)에서 처리되는 멜스펙트로그램 프레임들 사이의 어텐션 정보를 계산한다.The multi-head attention layer 113 receives the context vector 107 provided from the encoder 63 as K and V values, and receives the value calculated by the masked multi-head attention layer 111 as Q values. The multi-head attention layer 113 calculates attention information between the phonemes processed by the encoder 63 and the melspectrogram frames processed by the decoder 73 using the values.

마스크드 멀티 헤드 어텐션 레이어(111) 및 멀티 헤드 어텐션 레이어(113)에서도, 어텐션의 계산을 복수의 어텐션 헤드들(예컨대 6개의 헤드들)에 분산하여 병렬 처리함으로써, 어텐션 정보의 품질 및 정확도 개선과, 학습 속도 향상을 기대할 수 있다.In the masked multi-head attention layer 111 and the multi-head attention layer 113, by distributing the calculation of attention to a plurality of attention heads (eg, six heads) and processing in parallel, the quality and accuracy of attention information and , it can be expected to improve the learning speed.

멀티 헤드 어텐션 레이어(113)가 계산한 어텐션 정보는 피드 포워드 레이어(115)로 전달되며, 피드 포워드 레이어(115)는 멜스펙트로그램 프레임을 생성하기 위한 벡터(119)를 출력할 수 있다.The attention information calculated by the multi-head attention layer 113 is transmitted to the feed forward layer 115 , and the feed forward layer 115 may output a vector 119 for generating a melspectrogram frame.

일 실시예에서, 인코더 블록 내부 및 인코더 블록들(93a 내지 93n) 상호간의 연결 방식과 유사하게, 마스크드 멀티 헤드 어텐션 레이어(111), 멀티 헤드 어텐션 레이어(113), 및 피드 포워드 레이어(115) 사이는 잔차 연결(residual connection) 방식으로 연결되어, 여러 개의 레이어들을 거치는 과정에서 야기될 수 있는 정보의 손실이 최소화될 수 있다.In one embodiment, similar to the method of interconnection between the encoder block and the encoder blocks 93a to 93n, the masked multi-head attention layer 111 , the multi-head attention layer 113 , and the feed forward layer 115 . Since they are connected by a residual connection method, loss of information that may be caused in the process of passing through several layers can be minimized.

한편, 제1 음성 합성 모델의 디코더(73)에서 생성된 멀티 헤드 어텐션 행렬(105)은 음소 길이 계산 모듈(65)에 제공될 수 있으며, 음소 길이 계산 모듈(65)은, 상기 멀티 헤드 어텐션 행렬을 기초로 입력 텍스트에 포함된 음소들 각각의 길이를 계산할 수 있다. On the other hand, the multi-head attention matrix 105 generated by the decoder 73 of the first speech synthesis model may be provided to the phoneme length calculation module 65, and the phoneme length calculation module 65, the multi-head attention matrix Based on , the length of each phoneme included in the input text may be calculated.

구체적으로 멀티 헤드 어텐션 레이어(113)에서는, 인코더(63)에서 처리되는 음소들과 디코더(73)에서 처리되는 멜스펙트로그램 프레임들 사이의 어텐션 가중치를 나타내는 어텐션 행렬들이 적어도 헤드수만큼 생성된다.More specifically, in the multi-head attention layer 113 , attention matrices indicating an attention weight between phonemes processed by the encoder 63 and melspectrogram frames processed by the decoder 73 are generated by at least the number of heads.

본 발명의 몇몇 실시예에 따른 음소 길이 계산 모듈(65)은, 멀티 헤드 어텐션 레이어(113)에서 생성된 복수의 어텐션 행렬들 중 일부만을 선택하여 음소 길이 계산에 이용할 수 있다.The phoneme length calculation module 65 according to some embodiments of the present invention may select only some of the plurality of attention matrices generated in the multi-head attention layer 113 and use it to calculate the phoneme length.

멀티 헤드 어텐션 레이어(113)에서 생성된 어텐션 행렬들은, 인코더(63)에 입력된 음소들과 디코더(73)로부터 출력되는 멜스펙트로그램 프레임들 사이의 어텐션 관계를 나타내는 것이므로, 어텐션 행렬은 마치 대각 행렬(diagonal matrix)과 유사하게, 대각선 성분에서 높은 값을 가지고, 그 외의 성분들에서는 낮은 값을 가질 수 있다.Since the attention matrices generated in the multi-head attention layer 113 indicate an attention relationship between phonemes input to the encoder 63 and melspectrogram frames output from the decoder 73 , the attention matrix is like a diagonal matrix. Similar to (diagonal matrix), it can have a high value in the diagonal component and low value in the other components.

도 12의 플롯(161, 162, 163, 164)은 멀티 헤드 어텐션 레이어(113)에서 생성된 예시적인 어텐션 행렬들이 가지는 가중치들을 2차원 공간에 표시한 것이다. 플롯(161, 162, 163, 164)에서 가로축은 음소들의 시퀀스를 나타내고, 세로축은 멜스펙트로그램 프레임들의 시퀀스를 나타낸다. 도 12에서, 플롯(161) 및 플롯(164)은 음소들의 시퀀스와 멜스펙트로그램 프레임들의 시퀀스가 순차적으로 강하게 대응되는 어텐션 행렬을 나타낸다. 다시 말해, 플롯(161) 또는 플롯(164)과 같이 표현되는 어텐션 행렬들은 대각선 성분에서 높은 값을 가지고, 나머지 성분들에서는 낮은 값을 가진다. 반면에, 플롯(162) 및 플롯(163)은 음소들의 시퀀스와 멜스펙트로그램 프레임들의 시퀀스가 순차적으로 대응되지 않는 어텐션 행렬을 나타낸다.Plots 161 , 162 , 163 , and 164 of FIG. 12 represent weights of exemplary attention matrices generated in the multi-head attention layer 113 in a two-dimensional space. In the plots 161 , 162 , 163 and 164 , the horizontal axis indicates a sequence of phonemes, and the vertical axis indicates a sequence of melspectrogram frames. In FIG. 12 , a plot 161 and a plot 164 indicate an attention matrix in which a sequence of phonemes and a sequence of melspectrogram frames strongly correspond sequentially. In other words, the attention matrices expressed as the plot 161 or the plot 164 have high values in the diagonal component and low values in the remaining components. On the other hand, the plots 162 and 163 represent attention matrices in which the sequence of phonemes and the sequence of melspectrogram frames do not sequentially correspond.

본 발명의 몇몇 실시예에 따른 음소 길이 계산 모듈(65)은, 멀티 헤드 어텐션 레이어(113)에서 생성된 어텐션 행렬들 중에서 음소들의 시퀀스와 멜스펙트로그램 프레임들의 시퀀스가 순차적으로 강하게 대응되는 어텐션 행렬만을 이용할 수 있다. 이는 예컨대 아래 수학식 1의 F 값을 계산함으로써 획득될 수 있다.The phoneme length calculation module 65 according to some embodiments of the present invention is only an attention matrix in which a sequence of phonemes and a sequence of melspectrogram frames sequentially strongly correspond among the attention matrices generated in the multi-head attention layer 113 . Available. This can be obtained, for example, by calculating the F value of Equation 1 below.

F: 어텐션 포커스 비율F: Attention focus ratio

S: 멜스펙트로그램 시퀀스 길이S: Melspectrogram sequence length

T: 음소 시퀀스 길이T: phoneme sequence length

a_s,t: 어텐션 행렬a _s,t : attention matrix

몇몇 실시예에서 음소 길이 계산 모듈(65)은 최대의 F 값을 가지는 하나의 어텐션 행렬만을 선택하여 음소 길이 계산에 이용할 수 있다. 다른 몇몇 실시예에서 음소 길이 계산 모듈(65)은 사전 정의된 임계치보다 높은 F 값을 가지는 어텐션 행렬들을 선별하여 이들 행렬의 평균 값을 음소 길이 계산에 이용할 수 있다. 이와 같이 음소들의 시퀀스와 멜스펙트로그램 프레임들의 시퀀스가 순차적으로 강하게 대응되는 어텐션 행렬만을 이용함으로써, 보다 정확한 음소 길이가 계산될 수 있다.In some embodiments, the phoneme length calculation module 65 may select only one attention matrix having the maximum F value and use it to calculate the phoneme length. In some other embodiments, the phoneme length calculation module 65 may select attention matrices having an F value higher than a predefined threshold and use an average value of these matrices to calculate the phoneme length. As described above, a more accurate phoneme length can be calculated by using only the attention matrix in which the sequence of phonemes and the sequence of melspectrogram frames sequentially strongly correspond.

본 발명의 몇몇 실시예에 따른 음소 길이 계산 모듈(65)은, 각각의 음소별로 수학식 2의 d_i 값을 계산함으로써, 각각의 음소의 길이를 계산할 수 있다.The phoneme length calculation module 65 according to some embodiments of the present invention may calculate the length of each phoneme by _{calculating the d i} value of Equation 2 for each phoneme.

d_i: 음소 길이d _i : phoneme length

S: 멜스펙트로그램 시퀀스 길이S: Melspectrogram sequence length

a_s,t: 어텐션 행렬a _s,t : attention matrix

도 13은 도 7에 도시된 제2 음성 합성 모델의 디코더(87)의 구성을 나타내는 블록도이다. 도 13에 도시된 바와 같이 제2 음성 합성 모델의 디코더(87)는, 도 11을 참조하여 설명한 제1 음성 합성 모델의 디코더(73)와 유사한 구성을 구비하며 그 기능 또한 유사하므로, 도 11을 참조하여 제1 음성 합성 모델의 디코더(73)에 대하여 설명된 내용이, 제2 음성 합성 모델의 디코더(87)에도 적용될 수 있다. 따라서 제2 음성 합성 모델의 디코더(87)의 기능과 동작에 관한 반복적인 설명은 생략하기로 한다.13 is a block diagram showing the configuration of the decoder 87 of the second speech synthesis model shown in FIG. As shown in FIG. 13, the decoder 87 of the second speech synthesis model has a configuration similar to that of the decoder 73 of the first speech synthesis model described with reference to FIG. 11, and its functions are also similar. The contents described with reference to the decoder 73 of the first speech synthesis model may also be applied to the decoder 87 of the second speech synthesis model. Therefore, a repetitive description of the function and operation of the decoder 87 of the second speech synthesis model will be omitted.

다만, 제1 음성 합성 모델에서는 디코더(73)의 멀티 헤드 어텐션(123)에서 생성된 어텐션 행렬이 음소 길이 계산 모듈(65)에 제공되지만, 제2 음성 합성 모델의 디코더(87)의 경우 그렇지 않다는 점에서 차이가 있음에 유의한다. However, in the first speech synthesis model, the attention matrix generated by the multi-head attention 123 of the decoder 73 is provided to the phoneme length calculation module 65, but in the case of the decoder 87 of the second speech synthesis model, this is not the case. Note that there is a difference in the point.

지금까지 도 1 내지 도 13을 참조하여, 본 발명의 일 실시예에 따른 음성 합성 장치(10)에 대하여 자세히 설명하였다. So far, the speech synthesis apparatus 10 according to an embodiment of the present invention has been described in detail with reference to FIGS. 1 to 13 .

또한 본 발명의 몇몇 실시예에 따른 음성 합성 장치(10)는, 입력 텍스트를 음소 단위로 분리하여 처리함으로써, 영어 또는 중국어와는 전혀 다른 특성을 가지는 한국어를 합성하기 위한 종래의 모델들이, 문장 구성 요소들(문장, 절, 구, 어절, 음절 등) 사이의 휴지 시간을 표현하지 못하는 문제를 해결하고, 문장과 문장 사이의 구두점(".") 이후에 숨소리가 발화되어 사람이 긴 문장을 발화하듯이 자연스러운 음성을 합성할 수 있다. In addition, the speech synthesizing apparatus 10 according to some embodiments of the present invention separates and processes input text into phoneme units, so that conventional models for synthesizing Korean having characteristics completely different from those of English or Chinese are used to construct sentences. Solves the problem of not being able to express the pause time between elements (sentences, clauses, phrases, words, syllables, etc.) You can synthesize a natural voice just like you would.

또한 본 발명의 몇몇 실시예에 따른 음성 합성 장치(10)는, 음소의 길이를 추출하기 위한 제1 음성 합성 모델을 선행 학습하여, 학습된 모델을 통해 제1 음소 길이 정보를 획득하고, 제1 음소 길이 정보를 이용하여 실제 음성을 합성하는 제2 음성 합성 모델을 학습 및 이용한다. 이와 같이 별도의 신경망 모델을 통해 획득한 제1 음소 길이 정보를 음성 합성 모델에 이용함으로써, 동일한 음소도 문장 내에서의 사용된 위치나 음소가 사용된 단어에 따라 발성되는 시간이 달라지도록 할 수 있다. 즉, 음성 합성 장치(10)는 실제 사람의 발화와 같이 자연스러운 음성을 합성할 수 있게 된다.In addition, the speech synthesis apparatus 10 according to some embodiments of the present invention pre-learns a first speech synthesis model for extracting the length of a phoneme, obtains first phoneme length information through the learned model, and A second speech synthesis model for synthesizing real speech using phoneme length information is learned and used. As described above, by using the first phoneme length information obtained through a separate neural network model for the speech synthesis model, even the same phoneme can be uttered at different times depending on the position in the sentence or the word in which the phoneme is used. . That is, the voice synthesizing apparatus 10 can synthesize a natural voice, such as an actual human utterance.

또한 본 발명의 몇몇 실시예에 따른 음성 합성 장치(10)는, LSTM 등의 순환 신경망 모델 대신에, 멀티 어텐션 레이어들을 가지는 트랜스포머 모델로 구현된 음성 합성 모델을 구비함으로써, 학습 속도를 개선하고, 음성 합성의 품질을 향상하며, 음성 합성에 요구되는 컴퓨팅 자원을 절감할 수 있다. 나아가 음성 합성 장치(10)는, 기존의 음성 합성 모델들이 가지는 한계들을 극복할 수 있다. 예를 들어, 음성 합성 장치(10)는, 음성 합성 모델에 입력으로 주어지는 텍스트의 길이가 길더라도, 음성 합성 결과물에서 텍스트의 일부가 누락되거나, 일부 단어가 반복되어 오생성되는 문제들을 유발하지 않는다.In addition, the speech synthesis apparatus 10 according to some embodiments of the present invention includes a speech synthesis model implemented as a transformer model having multi-attention layers instead of a recurrent neural network model such as LSTM, thereby improving the learning speed, and It is possible to improve the quality of synthesis and reduce the computing resources required for speech synthesis. Furthermore, the speech synthesis apparatus 10 may overcome limitations of existing speech synthesis models. For example, the speech synthesis apparatus 10 does not cause problems in that a portion of text is omitted from a speech synthesis result or that some words are repeated and mis-generated, even if the length of the text input to the speech synthesis model is long.

또한 본 발명의 몇몇 실시예에 따른 음성 합성 장치(10)는, 트랜스포머 모델의 멀티 어텐션 레이어 출력을 피드 포워드 하는 신경망을 컨볼루션 신경망으로 구현함으로써, 한국어 발음의 정확도를 향상시킬 수 있다.In addition, the speech synthesis apparatus 10 according to some embodiments of the present invention may improve the accuracy of Korean pronunciation by implementing a neural network feed-forwarding the multi-attention layer output of the transformer model as a convolutional neural network.

이하에서는, 도 14 및 도 15를 참조하여, 본 발명의 다른 일 실시예에 따른 음성 합성 방법에 대하여 설명한다.Hereinafter, a speech synthesis method according to another embodiment of the present invention will be described with reference to FIGS. 14 and 15 .

도 14는 본 발명의 일 실시예에 따른 음성 합성 방법을 나타내는 예시적인 흐름도이다. 다만, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일 뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있다. 14 is an exemplary flowchart illustrating a speech synthesis method according to an embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and some steps may be added or deleted as necessary.

도 14에 도시된 음성 합성 방법의 각 단계는 예컨대 음성 합성 장치(10)과 같은 컴퓨팅 장치에 의해 수행될 수 있다. 다시 말하면, 상기 음성 합성 방법의 각 단계는 컴퓨팅 장치의 프로세서에 의해 실행되는 하나 이상의 인스트럭션들로 구현될 수 있다. 상기 음성 합성 방법에 포함되는 모든 단계는 하나의 물리적인 컴퓨팅 장치에 의하여 실행될 수도 있을 것이나, 상기 방법의 제1 단계들은 제1 컴퓨팅 장치에 의하여 수행되고, 상기 방법의 제2 단계들은 제2 컴퓨팅 장치에 의하여 수행될 수도 있다. 이하에서는, 상기 음성 합성 방법의 각 단계가 음성 합성 장치(10)에 의해 수행되는 것을 가정하여 설명을 이어가도록 한다. 다만, 설명의 편의를 위해, 상기 음성 합성 방법에 포함되는 각 단계의 동작 주체는 그 기재가 생략될 수도 있다.Each step of the speech synthesis method shown in FIG. 14 may be performed by, for example, a computing device such as the speech synthesis apparatus 10 . In other words, each step of the speech synthesis method may be implemented with one or more instructions executed by a processor of a computing device. All steps included in the speech synthesis method may be executed by one physical computing device, but the first steps of the method are performed by a first computing device, and the second steps of the method are performed by a second computing device may be performed by Hereinafter, it is assumed that each step of the speech synthesis method is performed by the speech synthesis apparatus 10 to continue the description. However, for convenience of description, the description of the operating subject of each step included in the speech synthesis method may be omitted.

본 실시예에 따른 음성 합성 방법의 각 동작을 이해함에 있어서, 도 1 내지 도 13에 관한 설명이 참조될 수 있다. 또한, 본 실시예에 따른 음성 합성 방법의 각 동작에 반영된 기술 사상 역시 도 1 내지 도 13을 참조하여 설명된 음성 합성 장치(10)의 구성 및 동작에 반영될 수 있을 것이다.In understanding each operation of the speech synthesis method according to the present embodiment, reference may be made to the descriptions of FIGS. 1 to 13 . In addition, the technical idea reflected in each operation of the speech synthesis method according to the present embodiment may also be reflected in the configuration and operation of the speech synthesis apparatus 10 described with reference to FIGS. 1 to 13 .

도 14를 참조하면, 본 실시예에 따른 음성 합성 방법은, 입력 텍스트를 제1 신경망 모델에 투입하는 단계(S100), 제1 신경망 모델로부터, 입력 텍스트에 포함된 음소의 제1 음소 길이 정보를 획득하는 단계(S200), 입력 텍스트 및 제1 음소 길이 정보를 제2 신경망 모델에 투입하는 단계(S300), 및 제2 신경망 모델로부터 입력 텍스트에 대응되는 출력 오디오를 획득하는 단계(S400)를 포함할 수 있다.Referring to FIG. 14 , in the speech synthesis method according to the present embodiment, inputting an input text into a first neural network model ( S100 ), and first phoneme length information of a phoneme included in the input text from the first neural network model Obtaining (S200), inputting the input text and first phoneme length information into a second neural network model (S300), and obtaining output audio corresponding to the input text from the second neural network model (S400) can do.

단계(S100) 및 단계(S200)의 제1 신경망 모델은, 도 6을 참조하여 설명한 제1 음성 합성 모델일 수 있으며, 단계(S200)에서 제1 신경망 모델로부터 제1 음소 길이 정보를 획득하는 것은, 제1 음성 합성 모델로부터 제1 음소 길이 정보를 획득하는 것에 대응될 수 있다.The first neural network model in steps S100 and S200 may be the first speech synthesis model described with reference to FIG. 6 , and obtaining the first phoneme length information from the first neural network model in step S200 is , may correspond to obtaining the first phoneme length information from the first speech synthesis model.

단계(S300) 및 단계(S400)의 제2 신경망 모델은, 도 7을 참조하여 설명한 제2 음성 합성 모델일 수 있다.The second neural network model in steps S300 and S400 may be the second speech synthesis model described with reference to FIG. 7 .

단계(S300)는 입력 텍스트를 전처리부(21)에 의해 음소 단위로 분리하는 단계를 포함할 수 있다.Step S300 may include separating the input text into phoneme units by the preprocessor 21 .

단계(S400)에 대해서는 도 15를 참조하여 보다 자세히 설명한다.Step S400 will be described in more detail with reference to FIG. 15 .

먼저 단계(S410)에서는, 예컨대 음소 임베딩 모듈(81)에 의하여, 입력 텍스트에 포함된 각각의 음소를 나타내는 임베딩 벡터가 획득될 수 있다. 또한 단계(S410)에서는, 제1 음소 길이 정보에 기초하여 임베딩 벡터를 포지셔널 인코딩함으로써, 임베딩 벡터에 음소 길이에 관한 정보를 덧붙일 수 있다.First, in step S410 , embedding vectors representing each phoneme included in the input text may be obtained, for example, by the phoneme embedding module 81 . Also, in step S410, information on the phoneme length may be added to the embedding vector by positionally encoding the embedding vector based on the first phoneme length information.

단계(S420)에서는 상기 임베딩 벡터가 예컨대 인코더(83)에 투입될 수 있다.In step S420 , the embedding vector may be input to, for example, the encoder 83 .

단계(S430)에서는 인코더로부터 컨텍스트 벡터가 획득될 수 있다. 상기 컨텍스트 벡터는, 인코더에 포함된 멀티 헤드 어텐션 레이어로부터 만들어지는 멀티 어텐션 행렬을 피드 포워드 신경망 레이어에 통과시킴으로써, 획득될 수 있다. 상기 컨텍스트 벡터는 음소들 사이의 셀프 어텐션 정보 및/또는 음소와 오디오 사이의 어텐션 정보를 나타내는 것일 수 있다.In step S430, a context vector may be obtained from the encoder. The context vector may be obtained by passing a multi-attention matrix created from the multi-head attention layer included in the encoder through the feed-forward neural network layer. The context vector may indicate self-attention information between phonemes and/or attention information between phonemes and audio.

단계(S440)에서는 인코더로부터 제2 음소 길이 정보가 획득될 수 있다. 상기 제2 음소 길이 정보는, 입력 텍스트에 포함된 각각의 음소들이, 출력 오디오 음성(3)에서 발성되어야 할 지속 시간을 나타내는 정보이다. 제2 음소 길이 정보는 인코더로부터 획득된 상기 컨텍스트 벡터를 기초로, 예컨대 음소 길이 예측 모듈(84)이 계산할 수 있다.In operation S440, second phoneme length information may be obtained from the encoder. The second phoneme length information is information indicating a duration during which each phoneme included in the input text should be uttered in the output audio voice 3 . The second phoneme length information may be calculated by, for example, the phoneme length prediction module 84 based on the context vector obtained from the encoder.

단계(S450)에서는 제2 음소 길이 정보를 이용하여 오디오 임베딩 벡터에 위치 정보가 인코딩될 수 있다. 단계(S450)는 예컨대 오디오 임베딩 모듈(85) 및/또는 포지셔널 인코딩 모듈에 의해 수행될 수 있다.In operation S450, position information may be encoded in the audio embedding vector using the second phoneme length information. Step S450 may be performed, for example, by the audio embedding module 85 and/or the positional encoding module.

단계(S460)에서는 위치 정보가 인코딩된 오디오 임베딩 벡터를 디코더에 투입하여 출력 오디오가 획득될 수 있다. 보다 구체적으로, 디코더는 <sos> 토큰에 대한 임베딩 벡터를 입력 받아서 첫번째 멜스펙트로그램 프레임을 출력하고. 출력된 프레임은 다음 멜스펙트로그램 프레임을 생성하기 위한 입력 값으로서, 벡터로 변환된 후 디코더에 투입된다. 입력 값을 기초로 디코더(87)에 의해 생성된 멜스펙트로그램 프레임은, 다시 디코더가 다음 프레임을 생성하기 위한 입력 값으로서 이용된다. 이와 같은 단계들을 순차적으로 반복함으로써, 입력된 텍스트에 대응되는 출력 오디오 음성을 합성하기 위한 데이터가 획득될 수 있다. In operation S460, the output audio may be obtained by inputting the audio embedding vector encoded with the location information to the decoder. More specifically, the decoder receives the embedding vector for the <sos> token and outputs the first melspectrogram frame. The output frame is an input value for generating the next melspectrogram frame, is converted into a vector, and then input to the decoder. The melspectrogram frame generated by the decoder 87 based on the input value is again used as an input value for the decoder to generate the next frame. By sequentially repeating these steps, data for synthesizing the output audio voice corresponding to the input text may be obtained.

지금까지 도 1 내지 도 15를 참조하여, 본 발명의 몇몇 실시예들에 따른 음성 합성 장치, 음성 합성 방법, 및 그 응용분야에 대해서 설명하였다. 이하에서는, 본 발명의 몇몇 실시예들에 따른 음성 합성 장치(10)를 구현할 수 있는 예시적인 컴퓨팅 장치(1500)에 대하여 설명하도록 한다.So far, a speech synthesis apparatus, a speech synthesis method, and an application field according to some embodiments of the present invention have been described with reference to FIGS. 1 to 15 . Hereinafter, an exemplary computing device 1500 capable of implementing the speech synthesis device 10 according to some embodiments of the present invention will be described.

도 16은 본 발명의 몇몇 실시예들에 따른 음성 합성 장치를 구현할 수 있는 예시적인 컴퓨팅 장치(1500)를 나타내는 하드웨어 구성도이다.16 is a hardware configuration diagram illustrating an exemplary computing device 1500 capable of implementing a speech synthesis apparatus according to some embodiments of the present invention.

도 16에 도시된 바와 같이, 컴퓨팅 장치(1500)는 하나 이상의 프로세서(1510), 버스(1550), 통신 인터페이스(1570), 프로세서(1510)에 의하여 수행되는 컴퓨터 프로그램(1591)을 로드(load)하는 메모리(1530)와, 컴퓨터 프로그램(1591)을 저장하는 스토리지(1590)를 포함할 수 있다. 다만, 도 16에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 16에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 알 수 있다.As shown in FIG. 16 , the computing device 1500 loads one or more processors 1510 , a bus 1550 , a communication interface 1570 , and a computer program 1591 executed by the processor 1510 . It may include a memory 1530 and a storage 1590 for storing the computer program 1591 . However, only the components related to the embodiment of the present invention are illustrated in FIG. 16 . Accordingly, a person skilled in the art to which the present invention pertains can see that other general-purpose components other than the components shown in FIG. 16 may be further included.

프로세서(1510)는 컴퓨팅 장치(1500)의 각 구성의 전반적인 동작을 제어한다. 프로세서(1510)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서를 포함하여 구성될 수 있다. 또한, 프로세서(1510)는 본 발명의 실시예들에 따른 방법을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 컴퓨팅 장치(1500)는 하나 이상의 프로세서를 구비할 수 있다.The processor 1510 controls the overall operation of each component of the computing device 1500 . The processor 1510 includes a central processing unit (CPU), a micro processor unit (MPU), a micro controller unit (MCU), a graphic processing unit (GPU), or any type of processor well known in the art. can be In addition, the processor 1510 may perform an operation on at least one application or program for executing the method according to the embodiments of the present invention. The computing device 1500 may include one or more processors.

메모리(1530)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(1530)는 본 발명의 실시예들에 따른 방법을 실행하기 위하여 스토리지(1590)로부터 하나 이상의 프로그램(1591)을 로드할 수 있다. 메모리(1530)는 RAM과 같은 휘발성 메모리로 구현될 수 있을 것이나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.The memory 1530 stores various data, commands, and/or information. The memory 1530 may load one or more programs 1591 from the storage 1590 to execute a method according to embodiments of the present invention. The memory 1530 may be implemented as a volatile memory such as RAM, but the technical scope of the present invention is not limited thereto.

버스(1550)는 컴퓨팅 장치(1500)의 구성 요소 간 통신 기능을 제공한다. 버스(1550)는 주소 버스(Address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 1550 provides communication functions between components of the computing device 1500 . The bus 1550 may be implemented as various types of buses, such as an address bus, a data bus, and a control bus.

통신 인터페이스(1570)는 컴퓨팅 장치(1500)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(1570)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(1570)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다.The communication interface 1570 supports wired/wireless Internet communication of the computing device 1500 . Also, the communication interface 1570 may support various communication methods other than Internet communication. To this end, the communication interface 1570 may be configured to include a communication module well known in the art.

몇몇 실시예들에 따르면, 통신 인터페이스(1570)는 생략될 수도 있다.According to some embodiments, the communication interface 1570 may be omitted.

스토리지(1590)는 상기 하나 이상의 프로그램(1591)과 각종 데이터를 비임시적으로 저장할 수 있다.The storage 1590 may non-temporarily store the one or more programs 1591 and various data.

스토리지(1590)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage 1590 is a non-volatile memory such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or well in the art to which the present invention pertains. It may be configured to include any known computer-readable recording medium.

컴퓨터 프로그램(1591)은 메모리(1530)에 로드될 때 프로세서(1510)로 하여금 본 발명의 다양한 실시예에 따른 방법/동작을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 즉, 프로세서(1510)는 상기 하나 이상의 인스트럭션들을 실행함으로써, 본 발명의 다양한 실시예에 따른 방법/동작들을 수행할 수 있다.The computer program 1591 may include one or more instructions that, when loaded into the memory 1530 , cause the processor 1510 to perform methods/operations in accordance with various embodiments of the present invention. That is, the processor 1510 may perform the methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.

위와 같은 경우, 컴퓨팅 장치(1500)를 통해 본 발명의 몇몇 실시예들에 따른 음성 합성 장치(10)가 구현될 수 있다.In this case, the speech synthesis apparatus 10 according to some embodiments of the present invention may be implemented through the computing device 1500 .

지금까지 도 1 내지 도 16을 참조하여 본 발명의 다양한 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 발명의 기술적 사상에 따른 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, various embodiments of the present invention and effects according to the embodiments have been described with reference to FIGS. 1 to 16 . Effects according to the technical spirit of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

지금까지 도 1 내지 도 16를 참조하여 설명된 본 발명의 기술적 사상은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체는, 예를 들어 이동형 기록 매체(CD, DVD, 블루레이 디스크, USB 저장 장치, 이동식 하드 디스크)이거나, 고정식 기록 매체(ROM, RAM, 컴퓨터 구비 형 하드 디스크)일 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The technical ideas of the present invention described with reference to FIGS. 1 to 16 may be implemented as computer-readable codes on a computer-readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk). can The computer program recorded in the computer-readable recording medium may be transmitted to another computing device through a network, such as the Internet, and installed in the other computing device, thereby being used in the other computing device.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명의 기술적 사상이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, even though all the components constituting the embodiment of the present invention are described as being combined or operating in combination, the technical spirit of the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining one or more.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시예들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although acts are shown in a particular order in the drawings, it should not be understood that the acts must be performed in the specific order or sequential order shown, or that all depicted acts must be performed to obtain a desired result. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of the various components in the embodiments described above should not be construed as necessarily requiring such separation, and the described program components and systems may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 본 발명이 다른 구체적인 형태로도 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명에 의해 정의되는 기술적 사상의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can practice the present invention in other specific forms without changing the technical spirit or essential features. can understand that there is Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. The protection scope of the present invention should be interpreted by the claims below, and all technical ideas within the equivalent range should be interpreted as being included in the scope of the technical ideas defined by the present invention.

Claims

A method for a computing device to synthesize speech from text, comprising:
inputting an input text into a first neural network model to obtain first phoneme length information of a phoneme included in the input text;
inputting the input text and the first phoneme length information to an encoder of a second neural network model to obtain second phoneme length information of the phoneme included in the input text; and
obtaining output audio corresponding to the input text from a decoder of the second neural network model based on the second phoneme length information;
including,
Each of the first neural network model and the second neural network model includes an encoder and a decoder,
the first phoneme length information is determined based at least in part on information obtained from a decoder of the first neural network model;
The first phoneme length information is used to positionally encode location information into a phoneme embedding vector input to the encoder of the second neural network model,
The second phoneme length information is used to positionally encode location information into an audio embedding vector input to a decoder of the second neural network model.
Speech synthesis method.

According to claim 1,
The first neural network model and the second neural network model are different speech synthesis models,
Speech synthesis method.

According to claim 1,
Obtaining the second phoneme length information comprises:
obtaining a phoneme embedding vector from the input text;
inputting the phoneme embedding vector to an encoder of the second neural network model;
obtaining a context vector from an encoder of the second neural network model; and
obtaining the second phoneme length information from the encoder of the second neural network model;
including,
Obtaining the output audio includes:
obtaining the output audio from a decoder of the second neural network model based on the context vector and the second phoneme length information;
containing,
Speech synthesis method.

4. The method of claim 3,
The context vector represents attention information about the phoneme,
The second phoneme length information is information indicating a time for which the utterance of the phoneme should be continued in the output audio,
Speech synthesis method.

5. The method of claim 4,
Acquiring the output audio from the decoder based on the context vector and the second phoneme length information includes:
Based on the second phoneme length information, comprising the step of encoding position information into an audio embedding vector input to the decoder,
Speech synthesis method.

According to claim 1,
The step of obtaining the first phoneme length information includes:
calculating the first phoneme length information based on an attention matrix generated by a decoder of the first neural network model;
containing,
Speech synthesis method.

7. The method of claim 6,
The first phoneme length information is calculated using Equation 1 below,
[Equation 1]

In Equation 1, d _i is the length of the i-th phoneme, S is the length of the sequence of the output audio, and a _s,t is the attention matrix,
Speech synthesis method.

A device for synthesizing speech from text, comprising:
a preprocessor for separating the input text into a plurality of phonemes; and
a first neural network model for generating first phoneme length information from the plurality of phonemes; and
A second neural network model for generating output audio corresponding to the input text from the plurality of phonemes
including,
The first neural network model includes a first encoder, a first decoder, and a phoneme length calculation module,
The first encoder receives the plurality of phonemes,
the phoneme length calculation module is configured to generate the first phoneme length information for the plurality of phonemes based at least in part on information obtained from the first decoder,
The second neural network model includes a second encoder, a phoneme length prediction module, and a second decoder,
The second encoder generates a context vector from a phoneme embedding vector for the plurality of phonemes,
The phoneme length prediction module generates second phoneme length information from the context vector,
wherein the second decoder generates the output audio based on the context vector,
The first phoneme length information is used to positionally encode location information into the phoneme embedding vector input to the second encoder of the second neural network model,
The second phoneme length information is used to positionally encode location information into an audio embedding vector input to the second decoder of the second neural network model.
speech synthesizer.

9. The method of claim 8,
The first decoder includes a multi-head attention layer generating a plurality of first attention matrices,
wherein the phoneme length calculation module calculates the first phoneme length information for the plurality of phonemes by using at least some of the plurality of first attention matrices;
speech synthesizer.

10. The method of claim 9,
The phoneme length calculation module calculates the lengths of the plurality of phonemes by using an attention matrix having the highest attention focus ratio calculated using Equation 2 below among the plurality of first attention matrices,
[Equation 2]

In Equation 2, F is an attention focus ratio, S is the length of the sequence of the output audio, T is the length of a sequence of a plurality of phonemes included in the input text, a _s,t is an attention matrix,
speech synthesizer.

9. The method of claim 8,
The first neural network model has a transformer structure, and is learned using a training text and a correct answer voice audio corresponding to the training text,
speech synthesizer.

9. The method of claim 8,
The second neural network model is
A phoneme embedding module that generates the phoneme included in the input text and the phoneme embedding vector indicating the location of the phoneme, based on the first phoneme length information
further comprising,
speech synthesizer.

13. The method of claim 12,
The second neural network model is
An audio embedding module that encodes position information into an audio embedding vector input to the second decoder based on the second phoneme length information
further comprising,
speech synthesizer.

14. The method of claim 13,
The second phoneme length information is information indicating a time for which the utterance of the phoneme should be continued in the output audio,
speech synthesizer.

13. The method of claim 12,
The second encoder comprises an encoder block,
The encoder block is
a multi-head attention layer for calculating a second attention matrix based on the phoneme embedding vectors of the plurality of phonemes; and
A feed forward layer for calculating the context vector from the second attention matrix
containing,
speech synthesizer.

16. The method of claim 15,
The feed forward layer comprises a convolutional neural network,
speech synthesizer.

16. The method of claim 15,
The encoder comprises a plurality of encoder blocks,
The plurality of encoder blocks are connected through a residual connection,
speech synthesizer.

13. The method of claim 12,
The second decoder comprises a decoder block,
The second decoder block,
a first multi-head attention layer for calculating self-attention information between a plurality of frames constituting the output audio;
a second multi-head attention layer for calculating attention information between the plurality of frames and the plurality of phonemes; and
feed forward layer
containing,
speech synthesizer.

13. The method of claim 12,
The second neural network model is a training text, a length of each of a plurality of phonemes included in the training text, and a correct answer speech audio corresponding to the training text,
speech synthesizer.

delete