KR102639322B1

KR102639322B1 - Voice synthesis system and method capable of duplicating tone and prosody styles in real time

Info

Publication number: KR102639322B1
Application number: KR1020220079679A
Authority: KR
Inventors: 이근배; 전예진
Original assignee: 포항공과대학교 산학협력단
Priority date: 2021-11-22
Filing date: 2022-06-29
Publication date: 2024-02-21
Also published as: KR20230075340A

Abstract

전역 스타일 토큰을 활용한 딥러닝 기반 종단 간 음성합성 기술을 이용하여 다양한 운율 및 화자 음색 스타일대로 음성을 빠르고 정확하게 합성할 수 있는, 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법이 개시된다. 음성합성 방법은, 입력되는 참조 오디오에서 기본주파수를 추출하여 레퍼런스 인코더의 입력으로 전달하는 단계, 레퍼런스 인코더에 의해 기본주파수를 인코딩하여 운율 임베딩을 생성하는 단계, 운율 임베딩으로부터 제1 스타일 임베딩을 생성하는 단계, 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 단계, 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 단계, 스피커 임베딩으로부터 제2 스타일 임베딩을 생성하는 단계, 제1 스타일 임베딩과 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 음성합성(TTS) 모델의 인코더의 출력과 함께 TTS 모델의 어텐션에 입력하는 단계, 및 TTS 모델에 의해 입력 텍스트에 대하여 음색과 운율이 조합된 음성합성의 오디오를 생성하는 단계를 포함한다.A speech synthesis system and method that can quickly and accurately synthesize speech according to various prosody and speaker timbre styles using deep learning-based end-to-end speech synthesis technology using global style tokens and capable of replicating tones and prosody styles in real time are disclosed. The speech synthesis method includes extracting the fundamental frequency from the input reference audio and transmitting it to the input of the reference encoder, encoding the fundamental frequency by the reference encoder to generate a prosody embedding, and generating a first style embedding from the prosody embedding. Converting the reference audio to a reference Mel spectrogram by Fourier transform, encoding the reference Mel spectrogram to generate a speaker embedding, generating a second style embedding from the speaker embedding, combining the first style embedding and the first style embedding. A step of inputting the integrated style embedding that combines the two style embeddings into the attention of the TTS model together with the output of the encoder of the speech synthesis (TTS) model, and the audio of the speech synthesis in which timbre and prosody are combined for the input text by the TTS model. Includes creation steps.

Description

{VOICE SYNTHESIS SYSTEM AND METHOD CAPABLE OF DUPLICATING TONE AND PROSODY STYLES IN REAL TIME}

본 발명은 다양한 운율 및 화자 음색 스타일대로 음성을 빠르고 정확하게 합성하는 음성합성 시스템에 관한 것으로, 전역 스타일 토큰(Global Style Token, GST)을 활용한 딥러닝 기반 종단 간(end-to-end) 음성합성 기술을 이용하는 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법에 관한 것이다.The present invention relates to a speech synthesis system that quickly and accurately synthesizes speech according to various prosody and speaker timbre styles. Deep learning-based end-to-end speech synthesis using Global Style Token (GST) It relates to a speech synthesis system and method capable of replicating real-time timbre and prosody styles using technology.

음성합성(Text to Speech, TTS)의 궁극적인 목표는 입력된 텍스트를 정확하고 자연스러운 목소리로 읽어주는 기술이며, 사람이 직접 발성하는 모든 분야에 적용 가능하다. 음성합성의 대표적인 응용 분야에는 네비게이션, 오디오북, 인공지능 비서 등이 있다.The ultimate goal of text to speech (TTS) is a technology that reads input text in an accurate and natural voice, and can be applied to all fields where people directly speak. Representative application areas of speech synthesis include navigation, audiobooks, and artificial intelligence assistants.

또한, 급격히 발전하는 딥러닝 기술을 활용하여 합성된 음성의 품질은 매우 향상되었으며, 종단 간(end-to-end) 음성합성 모델들이 제안되면서 별도의 지식이나 작업 없이 비교적 간단한 모델 학습이 가능해졌다.In addition, the quality of synthesized speech has greatly improved using rapidly developing deep learning technology, and as end-to-end speech synthesis models have been proposed, relatively simple model learning has become possible without any additional knowledge or work.

이러한 발전에도 불구하고, 현재의 딥러닝 기반의 종단 간 음성합성 시스템은 일상대화에서 쉽게 들을 수 있는 음성의 운율적 정보 예컨대 감정, 억양, 어조 등이나 화자의 음색을 그대로 따라하지 못하는 한계가 있으므로 추가적인 연구가 요구된다.Despite these developments, the current deep learning-based end-to-end speech synthesis system has limitations in that it cannot copy the prosodic information of speech that can be easily heard in everyday conversations, such as emotion, intonation, and tone, or the speaker's tone, so additional Research is required.

본 발명은 전술한 종래 기술의 요구에 부응하기 위해 도출된 것으로, 본 발명의 목적은,실시간으로 다양한 운율적 요소와 화자의 음색을 복제할 수 있는 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법을 제공하는데 있다.The present invention was derived to meet the needs of the prior art described above, and the purpose of the present invention is to provide a voice synthesis system and method capable of replicating various prosodic elements and a speaker's timbre in real time. is to provide.

본 발명의 다른 목적은, 전역 스타일 토큰(Global Style Token, GST) 기반 모듈을 사용하여 음색과 운율 스타일을 실시간 복제하여 합성할 수 있는 음성합성 시스템 및 방법을 제공하는데 있다.Another object of the present invention is to provide a speech synthesis system and method that can replicate and synthesize timbre and prosody styles in real time using a Global Style Token (GST)-based module.

본 발명의 또 다른 목적은, 실시간 음색 및 운율 스타일 복제와 함께 감정 제어 가능한 음성합성 시스템 및 방법을 제공하는데 있다.Another object of the present invention is to provide a voice synthesis system and method capable of controlling emotions along with real-time reproduction of timbre and prosody style.

상기 기술적 과제를 달성하기 위한 본 발명의 일 측면에 따른 음성합성 방법은, 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법으로서, 입력되는 참조 오디오(reference waveform)에서 기본주파수를 추출하여 레퍼런스 인코더의 입력으로 전달하는 단계; 상기 레퍼런스 인코더에 의해, 상기 기본주파수를 인코딩하여 운율 임베딩을 생성하는 단계; 상기 운율 임베딩으로부터 제1 스타일 임베딩을 생성하는 단계; 상기 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 단계; 상기 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 단계; 상기 스피커 임베딩으로부터 제2 스타일 임베딩을 생성하는 단계; 상기 제1 스타일 임베딩과 상기 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 음성합성(text to speech, TTS) 모델의 어텐션에 입력하는 단계; 및 상기 TTS 모델에 의해, 입력되는 텍스트에 대하여 상기 통합 스타일 임베딩에 의한 톤과 운율이 합성된 오디오를 생성하는 단계를 포함한다.The voice synthesis method according to one aspect of the present invention for achieving the above technical problem is a voice synthesis method capable of replicating timbre and prosody style in real time. The basic frequency is extracted from the input reference audio (reference waveform) and input to the reference encoder. delivering; generating a prosodic embedding by encoding the fundamental frequency by the reference encoder; generating a first style embedding from the prosodic embedding; converting the reference audio into a reference Mel spectrogram by Fourier transform; Generating a speaker embedding by encoding the reference Mel spectrogram; generating a second style embedding from the speaker embedding; Inputting an integrated style embedding that combines the first style embedding and the second style embedding into attention of a text to speech (TTS) model; and generating audio in which tones and prosody are synthesized by the integrated style embedding for the input text using the TTS model.

일실시예에서, 상기 기본주파수를 추출하는 단계는, 상기 참조 오디오를 일정한 구간의 슬라이딩 윈도우 단위로 자르고, 정규화된 상호 상관함수를 사용하여 상기 참조 오디오의 주파수 분해를 통해 피치(pitch) 윤곽을 계산하는 프로세스를 포함할 수 있다. 여기서 피치는 상기 참조 오디오의 음의 높낮이에 대응할 수 있다.In one embodiment, the step of extracting the fundamental frequency includes cutting the reference audio into sliding windows of a certain section and calculating a pitch contour through frequency decomposition of the reference audio using a normalized cross-correlation function. It may include processes that: Here, the pitch may correspond to the pitch of the reference audio.

일실시예에서, 상기 제1 스타일 임베딩을 생성하는 단계는, 제1 전역 스타일 토큰(global style token, GST) 레이어 내지 제3 전역 스타일 토큰 레이어로 구성된 3개 레이어의 계층형(hierarchical) GST를 이용하여 상기 참조 오디오의 운율에 대한 상기 제1 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, the step of generating the first style embedding uses a three-layer hierarchical GST consisting of a first global style token (GST) layer to a third global style token layer. and may be configured to generate the first style embedding for the prosody of the reference audio.

일실시예에서, 상기 제1 내지 제3 GST 레이어들 각각은, 입력되는 임베딩을 각 GST 레이어의 복수의 토큰들로 각각 전달하는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비하고, 상기 입력되는 임베딩과 각 토큰 간의 유사도를 측정하고 각각의 토큰들의 가중 합으로 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, each of the first to third GST layers includes a multi-head attention that transmits the input embedding to a plurality of tokens of each GST layer and a plurality of tokens connected to the multi-head attention, , It can be configured to measure the similarity between the input embedding and each token and generate a style embedding as a weighted sum of each token.

일실시예에서, 상기 제1 내지 제3 GST 레이어들은, 현재 GST 레이어의 상기 복수의 토큰들과 이전 GST 레이어의 상기 복수의 토큰들을 연결하는 잔여 커넥션(residual connection)을 구비할 수 있다.In one embodiment, the first to third GST layers may have a residual connection connecting the plurality of tokens of the current GST layer and the plurality of tokens of the previous GST layer.

일실시예에서, 상기 제2 스타일 임베딩을 생성하는 단계는, 제4 GST 레이어를 이용하여 상기 참조 오디오의 음색에 대한 상기 제2 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, the step of generating the second style embedding may be configured to generate the second style embedding for the tone of the reference audio using a fourth GST layer.

일실시예에서, 상기 오디오를 생성하는 단계는, 상기 TTS 모델의 인코더에 의해, 상기 인코더에 입력되는 텍스트의 문자(characters)로부터 특징 정보를 추출하는 단계; 상기 어텐션에 의해, 상기 특징 정보를 매 시점마다 상기 TTS 모델의 디코더에서 사용할 어텐션 얼라인(alignment) 정보로 매핑하고 매핑된 어텐션 얼라인 정보와 상기 통합 스타일 임베딩을 상기 디코더로 전달하는 단계; 및 상기 디코더에 의해, 상기 어텐션의 어텐션 정보와 이전 시점의 멜 스펙트로그램을 이용하여 상기 통합 스타일 임베딩이 합성된 현재 시점의 멜 스펙트로그램을 생성하는 단계를 포함하도록 구성될 수 있다.In one embodiment, the step of generating the audio includes extracting feature information from characters of text input to the encoder by the encoder of the TTS model; Mapping the feature information into attention alignment information to be used by a decoder of the TTS model at every time based on the attention, and transmitting the mapped attention alignment information and the integrated style embedding to the decoder; and generating, by the decoder, a current Mel spectrogram in which the integrated style embedding is synthesized using attention information of the attention and a previous Mel spectrogram.

일실시예에서, 상기 오디오를 생성하는 단계는, MelGAN(Mel generative adversarial networks)에 의해, 상기 현재 시점의 멜 스펙트로그램으로부터 오디오를 생성하는 단계를 더 포함할 수 있다.In one embodiment, the step of generating audio may further include generating audio from the Mel spectrogram at the current time using MelGAN (Mel generative adversarial networks).

일실시예에서, 음성합성 방법은, 상기 참조 멜 스펙트로그램에 대응하는 타겟 멜 스펙트로그램과 상기 TTS 모델에서 생성된, 상기 입력되는 텍스트와 상기 참조 오디오에 대한 예측 멜 스펙트로그램을 비교하여 상기 타겟 멜 스펙트로그램과 상기 예측 멜 스펙트로그램과의 차이 또는 상기 차이에 대응하는 손실(loss)을 구하는 단계; 및 상기 차이 또는 손실이 미리 설정된 수준 또는 기준값 이하가 될 때까지 상기 TTS 모델을 훈련시키는 단계를 더 포함할 수 있다.In one embodiment, the speech synthesis method compares the target Mel spectrogram corresponding to the reference Mel spectrogram and the predicted Mel spectrogram for the input text and the reference audio generated from the TTS model to determine the target Mel spectrogram. Obtaining a difference between a spectrogram and the predicted Mel spectrogram or a loss corresponding to the difference; And it may further include training the TTS model until the difference or loss is below a preset level or reference value.

상기 기술적 과제를 달성하기 위한 본 발명의 다른 측면에 따른 음성합성 시스템은, 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템으로서, 입력되는 참조 오디오(reference waveform)에서 기본주파수를 추출하는 추출부; 상기 기본주파수를 입력받고 상기 기본주파수를 인코딩하여 운율 임베딩을 생성하는 제1 레퍼런스 인코더; 상기 운율 임베딩으로부터 상기 참조 오디오의 운율에 대한 제1 스타일 임베딩을 생성하는 계층형 전역 스타일 토큰 레이어들; 상기 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 변환부; 상기 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 제2 레퍼런스 인코더; 상기 스피커 임베딩으로부터 상기 참조 오디오의 음색에 대한 제2 스타일 임베딩을 생성하는 단일 전역 스타일 토큰 레이어; 및 상기 제1 스타일 임베딩과 상기 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 어텐션으로 입력받고 입력되는 텍스트에 대하여 상기 통합 스타일 임베딩에 의한 톤과 운율이 합성된 오디오를 생성하는 음성합성(text to speech, TTS) 모델을 포함한다.A speech synthesis system according to another aspect of the present invention for achieving the above technical problem is a speech synthesis system capable of replicating tones and prosody styles in real time, comprising: an extraction unit for extracting a fundamental frequency from an input reference audio (reference waveform); a first reference encoder that receives the fundamental frequency and encodes the fundamental frequency to generate a prosody embedding; hierarchical global style token layers that generate a first style embedding for a prosody of the reference audio from the prosody embeddings; a conversion unit that converts the reference audio into a reference Mel spectrogram by Fourier transform; a second reference encoder that encodes the reference Mel spectrogram to generate speaker embedding; a single global style token layer that generates a second style embedding for the timbre of the reference audio from the speaker embedding; and text-to-speech synthesis (text-to-speech, which receives an integrated style embedding that combines the first style embedding and the second style embedding as attention and generates audio in which the tone and prosody of the integrated style embedding are synthesized for the input text. TTS) model.

일실시예에서, 상기 추출부는, 상기 참조 오디오를 미리 설정된 일정한 구간의 슬라이딩 윈도우 단위로 자르고, 정규화된 상호상관함수를 사용하여 유효 프레임을 구분하고, 상기 유효 프레임 내 상기 참조 오디오의 주파수 분해를 통해 피치(pitch) 윤곽을 계산하도록 구성될 수 있다. 여기서 피치는 상기 참조 오디오의 음의 높낮이에 대응할 수 있다.In one embodiment, the extraction unit cuts the reference audio into sliding window units of a preset certain section, distinguishes valid frames using a normalized cross-correlation function, and performs frequency decomposition of the reference audio within the valid frames. It may be configured to calculate a pitch contour. Here, the pitch may correspond to the pitch of the reference audio.

일실시예에서, 상기 추출부는 YIN 방법을 이용하여 피치 윤곽에 대응하는 기본 주파수를 추정할 수 있다. 여기서 YIN은 자기상관(autocorrelation)과 소거(cancellation) 사이의 상호작용을 암시하는 동양철학의 음(Yin)과 양(Yang)의 합성어이다.In one embodiment, the extractor may estimate the fundamental frequency corresponding to the pitch contour using the YIN method. Here, YIN is a compound word of Yin and Yang in Eastern philosophy, implying the interaction between autocorrelation and cancellation.

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들은, 제1 전역 스타일 토큰(global style token, GST) 레이어 내지 제3 전역 스타일 토큰 레이어로 구성된 3개 레이어의 계층형 GST 레이어들로 구성될 수 있다.In one embodiment, the hierarchical global style token layers may be composed of three layers of hierarchical GST layers consisting of a first global style token (GST) layer to a third global style token layer. .

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들의 각 GST 레이어는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비할 수 있다. 여기서, 상기 제1 GST 레이어는 상기 운율 임베딩과 각 토큰 간의 유사도를 측정하고 각각의 토큰들의 가중 합으로 스타일 임베딩을 생성하고, 상기 제1 GST 레이어에서 생성된 스타일 임베딩은 상기 제2 GST 레이어 및 제3 GST 레이어를 순차적으로 통과하여 제1 스타일 임베딩으로서 생성될 수 있다.In one embodiment, each GST layer of the hierarchical global style token layers may have multiple head attention and a plurality of tokens connected to the multiple head attention. Here, the first GST layer measures the similarity between the prosody embedding and each token and generates a style embedding as a weighted sum of each token, and the style embedding generated in the first GST layer is the second GST layer and the second GST layer. It can be generated as a first style embedding by sequentially passing through 3 GST layers.

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들은, 상기 제1 GST 레이어와 상기 제2 GST 레이어와의 제1 쌍과 상기 제2 GST 레이어와 상기 제3 GST 레이어와의 제2 쌍은 현재 레이어의 토큰들이 이전 레이어의 토큰들과 연결되는 잔여 커넥션(residual connection)을 구비할 수 있다.In one embodiment, the hierarchical global style token layers include: a first pair of the first GST layer and the second GST layer and a second pair of the second GST layer and the third GST layer. The tokens of may have residual connections connected to tokens of the previous layer.

일실시예에서, 음성합성 시스템은, 상기 참조 멜 스펙트로그램에 대응하는 타겟 멜 스펙트로그램과, 상기 TTS 모델에서 생성된, 상기 입력되는 텍스트와 상기 참조 오디오에 대한 예측 멜 스펙트로그램을 비교하여 상기 타겟 멜 스펙트로그램과 상기 예측 멜 스펙트로그램과의 차이 또는 상기 차이에 대응하는 손실(loss)을 구하는 학습관리부를 더 포함할 수 있다. 학습관리부는 상기 차이 또는 손실이 미리 설정된 수준 또는 기준값 이하가 될 때까지 상기 TTS 모델을 훈련시킬 수 있다.In one embodiment, the speech synthesis system compares the target Mel spectrogram corresponding to the reference Mel spectrogram with the predicted Mel spectrogram for the input text and the reference audio generated from the TTS model to determine the target. It may further include a learning management unit that obtains a difference between the Mel spectrogram and the predicted Mel spectrogram or a loss corresponding to the difference. The learning management unit may train the TTS model until the difference or loss is below a preset level or reference value.

일실시예에서, 상기 TTS 모델은, 스펙트로그램 예측기 및 보코더를 구비할 수 있다. 여기서, 상기 스펙트로그램 예측기는 상기 입력되는 텍스트와 상기 통합 스타일 임베딩을 토대로 멜 스펙트로그램을 생성하고, 상기 보코더는 상기 멜 스펙트로그램으로부터 합성 파형에 대응하는 파형 샘플을 생성할 수 있다.In one embodiment, the TTS model may include a spectrogram predictor and a vocoder. Here, the spectrogram predictor generates a mel spectrogram based on the input text and the integrated style embedding, and the vocoder can generate a waveform sample corresponding to a synthesized waveform from the mel spectrogram.

일실시예에서, 상기 스펙트로그램 예측기는 인코더, 어텐션 및 디코더를 포함할 수 있다. 상기 인코더는 상기 입력되는 텍스트의 문자(characters)로부터 특징 정보를 추출할 수 있다. 상기 어텐션은, 상기 특징 정보를 매 시점마다 상기 디코더에서 사용할 어텐션 얼라인(alignment) 정보로 매핑하고 매핑된 어텐션 얼라인 정보와 상기 통합 스타일 임베딩을 상기 디코더로 전달할 수 있다. 그리고, 상기 디코더는, 상기 어텐션의 어텐션 정보와 이전 시점의 멜 스펙트로그램을 이용하여 상기 통합 스타일 임베딩이 합성된 현재 시점의 멜 스펙트로그램을 생성할 수 있다.In one embodiment, the spectrogram predictor may include an encoder, attention, and decoder. The encoder can extract feature information from characters of the input text. The attention may map the feature information to attention alignment information to be used by the decoder at every time and transmit the mapped attention alignment information and the integrated style embedding to the decoder. Additionally, the decoder may generate a current Mel spectrogram in which the integrated style embedding is synthesized using the attention information of the attention and the previous Mel spectrogram.

일실시예에서, 상기 스펙트로그램 예측기는 상기 인코더의 입력단에 위치하는 전처리부를 더 포함할 수 있다. 상기 전처리부는 입력되는 텍스트를 음절 단위로 분리하고, 분리된 음절을 원핫 인코딩(one-hot encoding)을 통해 정수로 표현하도록 구성될 수 있다.In one embodiment, the spectrogram predictor may further include a preprocessor located at the input end of the encoder. The preprocessor may be configured to separate the input text into syllable units and express the separated syllables as integers through one-hot encoding.

일실시예에서, 상기 인코더는, 문자 임베딩(character embedding) 유닛, 상기 문자 임베딩 생성 유닛에 연결되는 3개의 컨볼루션 레이어들(3 Conv layers), 및 상기 3개의 컨볼루션 레이어들에 연결되는 양방향 LSTM(bidirectional long short term memory)을 구비할 수 있다. 상기 문자 임베딩 생성 유닛은 상기 전처리부로부터 받은 정수 시퀀스를 매트릭스 형태로 변환할 수 있다. 상기 컨볼루션 레이어들은 매트릭스 형태의 정보를 축약할 수 있다. 그리고 상기 양방향 LSTM은 축약된 매트릭스 형태의 정보를 인코더 특징 정보로 변환할 수 있다. 여기서 상기 인코더 특징 정보는 하나의 고정된 크기로 압축된 컨텍스트 벡터(context vector)를 포함할 수 있다.In one embodiment, the encoder includes a character embedding unit, three convolutional layers connected to the character embedding generation unit, and a bidirectional LSTM connected to the three convolutional layers. (bidirectional long short term memory) may be provided. The character embedding generation unit may convert the integer sequence received from the preprocessor into a matrix form. The convolutional layers can condense information in a matrix form. And the bidirectional LSTM can convert information in the form of a condensed matrix into encoder feature information. Here, the encoder feature information may include a context vector compressed into one fixed size.

일실시예에서, 상기 어텐션은 상기 TTS 모델의 보코더를 통해 출력될 음성 발음이 상기 입력되는 텍스트의 순차적인 순서대로 진행되도록 상기 디코더의 타임-스텝에 따라 상기 어텐션 얼라인 정보를 상기 인코더 특징 정보에 추가하도록 구성될 수 있다.In one embodiment, the attention stores the attention alignment information in the encoder characteristic information according to the time-step of the decoder so that the voice pronunciation to be output through the vocoder of the TTS model proceeds in the sequential order of the input text. It can be configured to add:

일실시예에서, 상기 보코더는, 멜겐(MelGAN: Mel generative adversarial networks)일 수 있다.In one embodiment, the vocoder may be MelGAN (Mel generative adversarial networks).

일실시예에서, 음성합성 시스템은, 상기 참조 오디오를 인코딩하여 감정 임베딩을 생성하는 제3 레퍼런스 인코더; 및 상기 감정 임베딩으로부터 상기 참조 오디오에 포함된 감정 정보에 대한 제3 스타일 임베딩을 생성하는 또 다른 전역 스타일 토큰 레이어를 더 포함할 수 있다. 제3 스타일 임베딩은 상기 제1 스타일 임베딩 및 상기 제2 스타일 임베딩과 함께 통합 스타일 임베딩에 포함되어 상기 TTS 모델의 어텐션에 입력될 수 있다.In one embodiment, the speech synthesis system includes a third reference encoder that encodes the reference audio to generate emotional embeddings; And it may further include another global style token layer that generates a third style embedding for emotional information included in the reference audio from the emotional embedding. The third style embedding may be included in the integrated style embedding together with the first style embedding and the second style embedding and input into the attention of the TTS model.

일실시예에서, 제3 레퍼런스 인코더는 상기 참조 오디오로부터 SFTF를 통해 변환된 멜 스펙트로그램(이하 '참조 멜 스펙트로그램')이 입력되는 3개의 합성곱 신경망(three convolutional neural networks, 3 CNN), 상기 3개의 합성곱 신경망에 연결되는 2개의 잔여 블록들(2 residual blocks) 및 상기 2개의 잔여 블록들에 연결되는 4개 레이어들의 수정 알렉스넷(modified AlexNet)으로 구성될 수 있다.In one embodiment, the third reference encoder is a three convolutional neural network (3 CNN) to which a Mel spectrogram (hereinafter referred to as 'reference Mel spectrogram') converted from the reference audio through SFTF is input. It may be composed of 2 residual blocks connected to 3 convolutional neural networks and a modified AlexNet of 4 layers connected to the 2 residual blocks.

일실시예에서, 상기 제3 레퍼런스 인코더 및 상기 또 다른 전력 스타일 토큰 레이어와 결합되는 TTS 모델은, 참조 오디오의 스타일 트랜스퍼 및 스타일 트랜스퍼의 강도를 조절하도록 구성되고, 리컨스트럭션 로스(reconstruction loss)만으로 학습을 진행할 수 있다.In one embodiment, the TTS model combined with the third reference encoder and the another power style token layer is configured to adjust the style transfer of the reference audio and the strength of the style transfer, and is learned using only reconstruction loss. You can proceed.

상기 기술적 과제를 달성하기 위한 본 발명의 또 다른 측면에 따른 음성합성 시스템은, 전술한 실시예들 중 어느 하나의 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법을 구현하기 위한 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 포함할 수 있다.A speech synthesis system according to another aspect of the present invention for achieving the above technical problem is stored in a computer-readable recording medium for implementing a speech synthesis method capable of replicating real-time timbre and prosody style of any one of the above-described embodiments. May include computer programs.

상기 기술적 과제를 달성하기 위한 본 발명의 또 다른 측면에 따른 음성합성 시스템은, 전술한 실시예들 중 어느 하나의 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법을 구현하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 포함할 수 있다.A speech synthesis system according to another aspect of the present invention for achieving the above technical problem is a computer-readable record recording a program for implementing a speech synthesis method capable of replicating real-time timbre and prosody style of any one of the above-described embodiments. May include media.

전술한 음성합성 시스템 및 방법의 구성에 의하면, 전역 스타일 토큰(global style token, GST) 기반 모듈을 사용하여 참조 오디오의 음색이나 운율 스타일을 복제하여 입력 텍스트에 대한 음성합성 결과에 효과적으로 반영할 수 있다.According to the configuration of the above-described speech synthesis system and method, a global style token (GST)-based module can be used to replicate the timbre or prosody style of the reference audio and effectively reflect it in the speech synthesis result for the input text. .

또한, 본 발명에 의하면, 입력 텍스트에 대하여 전역 스타일 토큰을 기반으로 음색, 운율 및 감정의 다양한 스타일을 반영한 음성합성 파형을 효과적으로 생성할 수 있다.Additionally, according to the present invention, it is possible to effectively generate a speech synthesis waveform that reflects various styles of timbre, prosody, and emotion based on the global style token for input text.

또한, 본 발명에 의하면, 실시간으로 참조 오디오의 다양한 운율적 요소, 화자의 음색, 또는 화자의 감정을 복제하여 입력 텍스트에 대한 음성합성 출력에 실시간으로 추가할 수 있는 음성합성 시스템 및 방법을 제공할 수 있다.In addition, according to the present invention, it is possible to provide a speech synthesis system and method that can replicate various prosodic elements of reference audio, the speaker's tone, or the speaker's emotions in real time and add them to the speech synthesis output for the input text in real time. You can.

도 1은 본 발명의 일 실시예에 따른 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템(이하 간략히 '음성합성 시스템')의 전체 구성에 대한 개략적인 블록도이다.
도 2는 비교예의 음성합성 시스템에 대한 블록도이다.
도 3은 도 1의 음성합성 시스템에 채용할 수 있는 운율 모듈에 대한 구성도이다.
도 4는 도 1의 음성합성 시스템에 채용할 수 있는 전체 구조에 대한 상세 구성도이다.
도 5는 본 발명의 다른 실시예에 따른 음성합성 시스템의 전체 구성도이다.
도 6은 도 5의 음성합성 시스템에 채용할 수 있는 감정 모듈을 설명하기 위한 블록도이다.
도 7은 도 6의 감정 모듈에 채용할 수 있는 레퍼런스 인코더에 대한 상세 구성도이다.
도 8은 본 발명의 또 다른 실시예에 따른 음성합성 시스템에 대한 개략적인 구성도이다.
도 9는 도 8의 음성합성 시스템에 적용될 수 있는 주요 동작을 설명하기 위한 블록도이다.Figure 1 is a schematic block diagram of the overall configuration of a speech synthesis system capable of replicating real-time timbres and prosody styles (hereinafter simply referred to as 'speech synthesis system') according to an embodiment of the present invention.
Figure 2 is a block diagram of a voice synthesis system of a comparative example.
Figure 3 is a configuration diagram of a prosody module that can be employed in the voice synthesis system of Figure 1.
Figure 4 is a detailed configuration diagram of the overall structure that can be employed in the voice synthesis system of Figure 1.
Figure 5 is an overall configuration diagram of a voice synthesis system according to another embodiment of the present invention.
FIG. 6 is a block diagram illustrating an emotion module that can be employed in the voice synthesis system of FIG. 5.
Figure 7 is a detailed configuration diagram of a reference encoder that can be employed in the emotion module of Figure 6.
Figure 8 is a schematic configuration diagram of a voice synthesis system according to another embodiment of the present invention.
FIG. 9 is a block diagram for explaining main operations applicable to the voice synthesis system of FIG. 8.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하여 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can make various changes and have various embodiments, specific embodiments will be described in detail by illustrating them in the drawings. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the present invention. While describing each drawing, similar reference numerals are used for similar components.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. '및/또는'이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another. For example, a first component may be named a second component, and similarly, the second component may also be named a first component without departing from the scope of the present invention. The term 'and/or' includes any of a plurality of related stated items or a combination of a plurality of related stated items.

본 출원의 실시예들에서, 'A 및 B 중에서 적어도 하나'는 'A 또는 B 중에서 적어도 하나' 또는 'A 및 B 중 하나 이상의 조합들 중에서 적어도 하나'를 의미할 수 있다. 또한, 본 출원의 실시예들에서, 'A 및 B 중에서 하나 이상'은 'A 또는 B 중에서 하나 이상' 또는 'A 및 B 중 하나 이상의 조합들 중에서 하나 이상'을 의미할 수 있다.In the embodiments of the present application, 'at least one of A and B' may mean 'at least one of A or B' or 'at least one of combinations of one or more of A and B'. Additionally, in the embodiments of the present application, 'one or more of A and B' may mean 'one or more of A or B' or 'one or more of combinations of one or more of A and B'.

어떤 구성요소가 다른 구성요소에 '연결되어' 있다거나 '접속되어' 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 '직접 연결되어' 있다거나 '직접 접속되어'있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is said to be 'connected' or 'connected' to another component, it is understood that it may be directly connected or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is 'directly connected' or 'directly connected' to another component, it should be understood that there are no other components in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, '포함한다' 또는 '가진다' 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in this application are only used to describe specific embodiments and are not intended to limit the invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as 'comprise' or 'have' are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted in an ideal or excessively formal sense unless explicitly defined in the present application. No.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the attached drawings. In order to facilitate overall understanding when describing the present invention, the same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 일 실시예에 따른 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템(이하 간략히 '음성합성 시스템')의 전체 구성에 대한 개략적인 블록도이다.Figure 1 is a schematic block diagram of the overall configuration of a speech synthesis system capable of replicating real-time timbres and prosody styles (hereinafter simply referred to as 'speech synthesis system') according to an embodiment of the present invention.

도 1을 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(spectrum predictor, 600) 및 보코더(vocoder, 700)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율 및 음색이 가미된 최초 음성 파형(raw waveform)을 출력할 수 있다.Referring to FIG. 1, the speech synthesis system includes a prosody module 100, a timbre module 200, a spectrum predictor 600, and a vocoder 700, and the first style of the prosody module 100 By applying integrated style embedding that combines the style embedding and the second style embedding of the tone module 200, the original voice waveform (raw waveform) with a specific prosody and tone added to the input text can be output. .

운율 모듈(100)은 추출부(110), 제1 레퍼런스 인코더(reference encoder, RE, 120), 제1 전역 스타일 토큰 레이어(global style token layer, GST1, 130), 제2 GST 레이어(GST2, 140) 및 제3 GST 레이어(GST3, 150)를 구비한다.The prosody module 100 includes an extraction unit 110, a first reference encoder (RE) 120, a first global style token layer (GST1, 130), and a second GST layer (GST2, 140). ) and a third GST layer (GST3, 150).

추출부(110)는 입력되는 참조 파형(reference waveform)인 참조 오디오에서 기본주파수(fundamental frequency, f0)를 추출한다. 추출부(110)는 참조 오디오에서 운율과 관련된 특징을 추출하는 특징 추출부(feature extractor, FE)로 지칭될 수 있다. 여기서 운율은 참조 오디오의 매 기준 시점의 음높이를 지칭하거나 음높이와 강세의 변화를 지칭할 수 있다.The extractor 110 extracts the fundamental frequency (f0) from the reference audio, which is an input reference waveform. The extraction unit 110 may be referred to as a feature extractor (FE) that extracts features related to prosody from the reference audio. Here, prosody may refer to the pitch at each reference point in the reference audio or to changes in pitch and stress.

제1 레퍼런스 인코더(120)는 추출부(110)으로부터 받은 기본주파수의 참조 오디오를 인코딩하여 운율 임베딩을 생성한다. 제1 레퍼런스 인코더(120)는 일정한 구간의 슬라이딩 윈도우 단위들을 배치놈(batchNorm)을 가진 3개의 합성곱 신경망을 통해 필터링하고, 2개의 잔여 블록들을 통해 앞서 학습된 정보를 이용하여 가중치들을 계산한 후, 4개의 레이어들로 수정된 알렉스넷으로 가중치를 최적화하도록 이루어질 수 있다. 여기서 운율 임베딩은 참조 오디오의 운율을 설명하는 매개변수로서, 학습이 완료된 후에 생성되는 운율 임베딩은 참조 오디오를 가장 잘 설명하는 최적의 매개변수일 수 있다.The first reference encoder 120 encodes the reference audio of the fundamental frequency received from the extractor 110 to generate prosody embedding. The first reference encoder 120 filters the sliding window units of a certain section through three convolutional neural networks with a batch norm, calculates weights using information previously learned through two residual blocks, and then calculates the weights. , can be done to optimize the weights with a modified AlexNet with four layers. Here, the prosody embedding is a parameter that describes the prosody of the reference audio, and the prosody embedding generated after learning is completed may be the optimal parameter that best describes the reference audio.

제1 GST 레이어(GST1, 130), 제2 GST 레이어(GST2, 140) 및 제3 GST 레이어(GST3, 150)는 계층형(hierarchical) GST 구조를 형성한다. 즉, 제1 내지 제3 GST 레이어들(130, 140, 150)은 기재된 순서대로 임베딩이 순차적으로 처리하도록 나열된 구조를 가진다. 각 GST 레이어는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비한다. 이러한 계층형 GST 레이어들(이하 간략히 '계층형 GST')는 제1 레퍼런스 인코더(120)로부터 입력되는 운율 임베딩을 적절하게 그리고 효과적으로 처리하기 위한 것이다.The first GST layer (GST1, 130), the second GST layer (GST2, 140), and the third GST layer (GST3, 150) form a hierarchical GST structure. That is, the first to third GST layers 130, 140, and 150 have a structure in which embedding is processed sequentially in the order described. Each GST layer has multiple head attention and a plurality of tokens connected to the multiple head attention. These hierarchical GST layers (hereinafter simply referred to as 'hierarchical GST') are for appropriately and effectively processing the prosody embedding input from the first reference encoder 120.

계층형 GST에서, 제1 GST 레이어(130)는 운율 임베딩과 자신의 각 토큰 간의 유사도를 측정하고 토큰들의 가중 합(weithted sum)으로 스타일 임베딩을 생성한다. 즉, 제1 GST 레이어(130)는 운율 임베딩을 k(5 이상의 임의의 자연수)개의 전역 스타일 토큰들(GSTs)의 가중 합으로 변화시켜 표현할 수 있다. 이 과정에서 k개의 GSTs은 이들의 조합이 다양한 운율 스타일을 표현할 수 있도록 학습된다. 특히, 본 실시예에서는, 제1 GST 레이어(130)에서 학습된 스타일 임베딩을 제2 GST 레이어(140) 및 제3 GST 레이어(150)에서 다시 학습시킴으로써, 운율 임베딩의 다양한 표현에 대하여 처리 시간과 시스템 복잡도를 고려하여 최대한 효과적으로 학습할 수 있도록 구성된다. 계층형 GST의 학습 결과는 제1 스타일 임베딩으로서 출력된다.In hierarchical GST, the first GST layer 130 measures the similarity between the prosodic embedding and each of its tokens and generates a style embedding as a weighted sum of the tokens. That is, the first GST layer 130 can express the prosodic embedding by changing it into a weighted sum of k (an arbitrary natural number of 5 or more) global style tokens (GSTs). In this process, k GSTs are learned so that their combinations can express various prosody styles. In particular, in this embodiment, the style embedding learned in the first GST layer 130 is re-learned in the second GST layer 140 and the third GST layer 150, thereby reducing the processing time and It is designed to enable learning as effectively as possible, taking into account system complexity. The learning results of hierarchical GST are output as first style embeddings.

더욱이, 본 실시예에서는, 운율 임베딩의 더욱 효과적인 학습을 위해, 계층형 GST가 제1 GST 레이어(130)와 제2 GST 레이어(140)의 제1 쌍과 제2 GST 레이어(140)와 제3 GST 레이어(150)의 제2 쌍 각각에 대하여 잔여 커넥션(residual connection)을 구비하도록 구성될 수 있다.Moreover, in this embodiment, for more effective learning of prosody embeddings, hierarchical GST is used to form a first pair of first GST layer 130 and second GST layer 140 and a third GST layer 140. It may be configured to have a residual connection for each second pair of GST layers 150.

이러한 잔여 커넥션은, 3개의 GST 레이어들(130, 140, 150)을 사용하여 운율 임베딩의 다양한 표현을 학습할 때, 데이터 처리 흐름 상에서 상대적으로 후단에 배치된 현재 GST 레이어가 상대적으로 전단에 배치된 이전 GST 레이어를 학습 결과를 이용하도록 하기 위한 것으로, 운율 임베딩의 학습 효과를 극대화하는데 기여할 수 있다.These residual connections are such that when learning various representations of prosodic embeddings using the three GST layers (130, 140, and 150), the current GST layer, which is placed relatively later in the data processing flow, is placed relatively earlier. It is intended to use the learning results of the previous GST layer, and can contribute to maximizing the learning effect of prosody embedding.

한편, 본 실시예에서는 계층형 GST가 3개의 GST 레이어들로 이루어지는 것이 가장 바람직한 형태임을 설명하였지만, 본 발명은 그러한 구성으로 한정되지 않고, 계층형 GST를 2개, 4개 혹은 5개의 GST 레이어들로 구성하는 경우에도, 본 실시예와 유사한 효과를 얻을 수 있음은 물론이다. 5개를 초과하는 GST 레이어들의 구성은 처리 시간과 시스템 복잡도 등을 고려하여 제한될 수 있다.Meanwhile, in this embodiment, it has been explained that the most preferable form is that the hierarchical GST is composed of three GST layers, but the present invention is not limited to such a configuration, and the hierarchical GST is composed of two, four, or five GST layers. Of course, even in the case of configuration, a similar effect as that of the present embodiment can be obtained. The configuration of more than five GST layers may be limited considering processing time and system complexity.

음색 모듈(200)은 제2 레퍼런스 인코더(RE, 220) 및 제4 GST 레이어(GST4, 240)을 구비하고, 제2 레퍼런스 인코더(220)에서 변환된 스피커 임베딩을 제4 GST 레이어(240)에서 학습하고 제2 스타일 임베딩을 출력하도록 구성될 수 있다.The tone module 200 includes a second reference encoder (RE, 220) and a fourth GST layer (GST4, 240), and transmits the speaker embedding converted from the second reference encoder (220) to the fourth GST layer (240). It may be configured to learn and output a second style embedding.

제2 레퍼런스 인코더(220)의 입력으로는 참조 오디오를 특정 길이의 슬라이딩 윈도우 단위들로 잘라 푸리에 변환을 통해 만들어진 스펙트로그램(spectrogram)이 사용될 수 있다. 이러한 스펙트로그램을 만드는 과정은 간략히 STFT(short-time fourier transform)로 지칭될 수 있다. 스펙트로그램은 멜 필터 뱅크(Mel-filter bank)를 이용한 로그 변환(log transform)을 통해 멜 스펙트로그램으로 변환될 수 있다. As an input to the second reference encoder 220, a spectrogram created by cutting the reference audio into sliding window units of a specific length and performing Fourier transform can be used. The process of creating such a spectrogram can be briefly referred to as STFT (short-time Fourier transform). The spectrogram can be converted to a Mel spectrogram through log transform using a Mel-filter bank.

제4 GST 레이어(240)는, 스피커 임베딩을 효과적으로 학습하기 위해 사용된다. 제4 GST 레이어(240)는 스피커 임베딩과 자신의 각 토큰 간의 유사도를 측정하고 토큰들의 가중 합(weithted sum)으로 스타일 임베딩(제2 스타일 임베딩)을 생성한다. 즉, 제4 GST 레이어(240)는 스피커 임베딩을 k(5 이상의 임의의 자연수)개의 전역 스타일 토큰들(GSTs)의 가중 합으로 변화시켜 표현할 수 있으며, 이 과정에서 k개의 GSTs은 이들의 조합이 다양한 음색 스타일을 표현할 수 있도록 학습된다.The fourth GST layer 240 is used to effectively learn speaker embedding. The fourth GST layer 240 measures the similarity between the speaker embedding and each of its tokens and generates a style embedding (second style embedding) as a weighted sum of the tokens. In other words, the fourth GST layer 240 can express the speaker embedding by changing it into a weighted sum of k (an arbitrary natural number of 5 or more) global style tokens (GSTs), and in this process, the k GSTs are a combination of these. Learned to express various timbre styles.

스펙트럼 예측기(spectrum predictor, 600)와 보코더(vocoder, 700)는 음성합성(text to speech, TTS) 모델(500)을 형성한다. 본 실시예에서 TTS 모델(500)은 제1 스타일 임베딩과 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 받고, 입력되는 텍스트 즉, 입력 텍스트(input text)에 대한 음성합성 결과를 생성할 때 통합 스타일 임베딩에 대응하는 음색과 운율을 반영하도록 구성된다.A spectrum predictor (600) and a vocoder (700) form a text to speech (TTS) model (500). In this embodiment, the TTS model 500 receives an integrated style embedding that combines the first style embedding and the second style embedding, and uses the integrated style embedding when generating a speech synthesis result for the input text, that is, the input text. It is composed to reflect the corresponding tone and prosody.

여기서, 스펙트럼 예측기(600)는 통합 스타일 임베딩을 반영하는 입력 텍스트에 대한 멜 스펙트로그램을 생성한다. 그리고, 보코더(700)는 멜 스펙트로그램으로부터 최초 음성 파형(raw waveform)을 출력한다. 최초 음성 파형은 합성 파형(synthesized waveform) 또는 파형 샘플(waveform samples)로서 WAV(waveform audio format) 등의 특정 포맷을 가질 수 있으나, 이에 한정되지는 않는다.Here, the spectral predictor 600 generates a mel spectrogram for the input text reflecting the integrated style embedding. Then, the vocoder 700 outputs the first voice waveform (raw waveform) from the Mel spectrogram. The initial voice waveform may be a synthesized waveform or waveform sample and may have a specific format such as WAV (waveform audio format), but is not limited thereto.

스펙트럼 예측기(600)는 텍스트가 입력되는 인코더, 인코더에 연결되는 어텐션, 어텐션과 상호 연결되는 디코더를 구비할 수 있다.The spectrum predictor 600 may include an encoder into which text is input, an attention connected to the encoder, and a decoder interconnected with the attention.

보코더(700)는 스펙트로그램이나 멜 스펙트로그램으로부터 파형(waveform) 신호 또는 음성 신호를 생성하는 모듈이다. 보코더(700)는 웨이브넷(WaveNet) 보코더, 멜겐(MelGAN: Mel generative adversarial network) 등을 사용할 수 있다. 멜겐(MelGAN)을 사용하면, 웨이브넷 보코더 등 다른 보코더보다 참조 오디오의 음색이나 운율을 잘 표현하는 음성 신호를 생성할 수 있다.The vocoder 700 is a module that generates a waveform signal or voice signal from a spectrogram or mel spectrogram. The vocoder 700 may use a WaveNet vocoder, MelGAN (Mel generative adversarial network), etc. Using MelGAN, it is possible to generate a voice signal that expresses the timbre or prosody of the reference audio better than other vocoders such as Wavenet vocoder.

도 2는 비교예의 음성합성 시스템에 대한 블록도이다.Figure 2 is a block diagram of a voice synthesis system of a comparative example.

도 2를 참조하면, 비교예의 음성합성 시스템은, 스펙트럼 예측기(10), 레퍼런스 인코더(20) 및 웨이브넷 보코더((WaveNet vocoder, 90)로 구성된다. 스펙트럼 예측기(10)는 인코더(encoder, 30), 어텐션(attention, 50) 및 디코더(decoder, 70)로 구성된다.Referring to FIG. 2, the speech synthesis system of the comparative example consists of a spectrum predictor 10, a reference encoder 20, and a WaveNet vocoder (WaveNet vocoder, 90). The spectrum predictor 10 has an encoder (encoder, 30). ), attention (50), and decoder (decoder).

앞서 설명한 본 실시예의 음성합성 시스템(도 1 참조)과 대비할 때, 비교예의 음성합성 시스템은, 레퍼런스 인코더(20)에서 참조 오디오(reference audio)로부터 생성된 참조 임베딩과 인코더(30)의 특징 정보를 합하여 어텐션(50)에 입력될 때, 어텐션(50)과 디코더(70)가 이들의 협업을 통해 멜 스펙트로그램을 생성하고, 웨이브넷 보코더(90)가 멜 스펙트로그램으로부터 음성합성 결과인 발과(speech)를 생성하도록 구성된다.When compared to the speech synthesis system of this embodiment described above (see FIG. 1), the speech synthesis system of the comparative example includes reference embedding generated from reference audio in the reference encoder 20 and characteristic information of the encoder 30. When combined and input to the attention 50, the attention 50 and the decoder 70 generate a mel spectrogram through their collaboration, and the wavenet vocoder 90 produces a speech synthesis result from the mel spectrogram ( It is configured to generate speech.

전술한 비교예의 음성합성 시스템과 대비할 때, 본 실시예의 음성합성 시스템은, 참조 오디오의 운율을 효과적으로 표현하기 위해, 참조 오디오의 기본주파수를 제1 레퍼런스 인코더의 입력으로 사용하고, 제1 레퍼런스 인코더에서 생성된 운율 임베딩을 계층형 GST의 복수의 GST 레이어들(특히, 제1 내지 제3 GST 레이어들)에 입력하여 운율을 학습하도록 구성됨을 알 수 있다.Compared with the speech synthesis system of the above-mentioned comparative example, the speech synthesis system of this embodiment uses the fundamental frequency of the reference audio as the input of the first reference encoder to effectively express the prosody of the reference audio, and It can be seen that the prosody is learned by inputting the generated prosody embeddings into a plurality of GST layers (particularly, the first to third GST layers) of the hierarchical GST.

또한, 본 실시예의 음성합성 시스템은, 참조 오디오의 음색을 효과적으로 표현하기 위해, 참조 오디오의 스펙트로그램을 제2 레퍼런스 인코더의 입력으로 사용하고, 제2 레퍼런스 인코더에서 생성된 스피커 임베딩을 단일 GST 레이어(제4 GST 레이어)에 입력하여 음색을 학습하도록 구성됨을 알 수 있다.Additionally, in order to effectively express the timbre of the reference audio, the speech synthesis system of this embodiment uses the spectrogram of the reference audio as an input to the second reference encoder, and uses the speaker embedding generated by the second reference encoder as a single GST layer ( It can be seen that the tone is learned by inputting it into the 4th GST layer.

또한, 본 실시예의 음성합성 시스템은, 운율 학습 결과로 얻은 제1 스타일 임베딩과 음색 학습 결과로 얻은 제2 스타일 임베딩을 합하여 스펙트럼 예측기의 인코더 출력과 합하거나 스펙트럼 예측기의 어텐션에 인코더의 출력과 함께 입력되도록 구성됨을 알 수 있다.In addition, the speech synthesis system of this embodiment combines the first style embedding obtained as a result of prosody learning and the second style embedding obtained as a result of timbre learning and combines them with the encoder output of the spectrum predictor or inputs them together with the output of the encoder to the attention of the spectrum predictor. It can be seen that it is configured as such.

더욱이, 본 실시예의 음성합성 시스템은, 후술되는 바와 같이, 참조 오디오의 감정을 효과적으로 표현하기 위해, 참조 오디오의 스펙트로그램을 제3 레퍼런스 인코더의 입력으로 사용하고, 제3 레퍼런스 인코더에서 생성된 감정 임베딩을 단일 GST 레이어(제5 GST 레이어)에 입력하여 상기의 감정을 학습하도록 구성됨을 알 수 있다. 여기서, 제5 GST 레이어에서 출력되는 제3 스타일 임베딩은 제1 및 제2 스타일 임베딩들과 합쳐져 스펙트럼 예측기를 포함하는 TTS 모델에 전달될 수 있다.Moreover, as will be described later, the speech synthesis system of this embodiment uses the spectrogram of the reference audio as an input to the third reference encoder to effectively express the emotion of the reference audio, and uses the emotion embedding generated by the third reference encoder. It can be seen that the above emotions are learned by inputting them into a single GST layer (the 5th GST layer). Here, the third style embedding output from the fifth GST layer may be combined with the first and second style embeddings and transmitted to a TTS model including a spectrum predictor.

또한, 본 실시예의 음성합성 시스템은 보코더로서 멜겐(MelGAN)을 사용함으로써 더욱 효과적으로 음색과 운율의 스타일이나 음색, 운율 및 감정 스타일이 복제된 음성합성 결과를 생성하도록 구성됨을 알 수 있다.In addition, it can be seen that the voice synthesis system of this embodiment is configured to more effectively generate voice synthesis results in which the tone and prosody style or tone, prosody, and emotional style are replicated by using MelGAN as a vocoder.

도 3은 도 1의 음성합성 시스템에 채용할 수 있는 운율 모듈에 대한 구성도이다.Figure 3 is a configuration diagram of a prosody module that can be employed in the voice synthesis system of Figure 1.

도 3을 참조하면, 운율 모듈(100)은 추출부(FE, 110), 레퍼런스 인코더(reference encoder, 120), 제1 GST 레이어(GST1 또는 GST layer 1, 130) 내지 제N GST 레이어(GST layer N, 160)를 구비한다. 여기서 레퍼런스 인코더(120)은 제1 레퍼런스 인코더에 대응하고, N은 2 내지 5 중 어느 하나의 자연수일 수 있다.Referring to Figure 3, the prosody module 100 includes an extraction unit (FE, 110), a reference encoder (120), a first GST layer (GST1 or GST layer 1, 130) to an N-th GST layer (GST layer). N, 160). Here, the reference encoder 120 corresponds to the first reference encoder, and N may be any natural number from 2 to 5.

운율 모듈(100)은 추출부(110)에 의해 입력되는 참조 파형(reference waveform) 또는 참조 오디오로부터 기본주파수(f0)를 추출하고, 제1 레퍼런스 인코더(120)에 의해 기본주파수로부터 운율 임베딩(prosody embedding)을 생성하고, 제1 내지 제N GST 레이어들(130, 160)의 계층형 GST에 의해 운율 임베딩을 학습하고 제1 스타일 임베딩(style embedding)을 출력하도록 구성된다.The prosody module 100 extracts the fundamental frequency (f0) from a reference waveform or reference audio input by the extraction unit 110, and prosody embedding (prosody) from the fundamental frequency by the first reference encoder 120. It is configured to generate a prosody embedding, learn a prosody embedding by the hierarchical GST of the first to Nth GST layers 130, 160, and output a first style embedding.

제1 GST 레이어(130)는 다중 헤드 어텐션(multi-head attention, 132)과 다중 헤드 어텐션(132)에 연결되는 복수의 토큰들(token 1 내지 token k)(134)을 구비한다. 여기서, k는 5 이상의 임의의 자연수일 수 있다. 제1 GST 레이어(130)는 운율 임베딩과 각 토큰(token 1 내지 token k) 간의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성한다. 제1 GST 레이어(130)에서 생성된 스타일 임베딩은 제2 GST 레이어의 입력으로 전달될 수 있다.The first GST layer 130 includes multi-head attention 132 and a plurality of tokens (token 1 to token k) 134 connected to the multi-head attention 132. Here, k may be any natural number greater than or equal to 5. The first GST layer 130 measures the similarity between the prosody embedding and each token (token 1 to token k) and generates a style embedding with a weighted sum of these tokens. The style embedding generated in the first GST layer 130 may be passed as an input to the second GST layer.

위의 경우와 유사하게, 제N GST 레이어(160)는 다중 헤드 어텐션(162)과 다중 헤드 어텐션(162)에 연결되는 복수의 토큰들(token 1 내지 token k)(164)을 구비한다. 제N GST 레이어(160)는 입력단에 위치하는 다른 GST 레이어에서 받은 스타일 임베딩과 각 토큰(token 1 내지 token k) 간의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성한다. 이 스타일 임베딩은 운율 모듈(100)에 최종적으로 출력되는 제1 스타일 임베딩이 된다.Similar to the above case, the Nth GST layer 160 includes a multi-head attention 162 and a plurality of tokens (token 1 to token k) 164 connected to the multi-head attention 162. The Nth GST layer 160 measures the similarity between the style embedding received from another GST layer located at the input terminal and each token (token 1 to token k) and generates the style embedding with the weighted sum of these tokens. This style embedding becomes the first style embedding that is finally output to the prosody module 100.

한편, 상기의 N이 2인 경우, 제2 GST 레이어는 제1 GST 레이어(130)로부터 받은 스타일 임베딩과 자신의 복수의 토큰들 각각과의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성할 수 있다. 이때, 제1 GST 레이어(130)의 복수의 토큰들(134)은 제2 GST 레이어의 복수의 토큰들과 잔여 커넥션(residual connection, 170)으로 연결될 수 있고, 그에 의해 제2 GST 레이어는 제1 GST 레이어(130)의 복수의 토큰들(134)의 가중치를 토대로 학습을 병렬적으로 수행할 수 있다.Meanwhile, when N is 2, the second GST layer measures the similarity between the style embedding received from the first GST layer 130 and each of its plurality of tokens and generates a style embedding with the weighted sum of these tokens. can do. At this time, a plurality of tokens 134 of the first GST layer 130 may be connected to a plurality of tokens of the second GST layer through a residual connection (170), whereby the second GST layer is connected to the first GST layer. Learning can be performed in parallel based on the weights of the plurality of tokens 134 of the GST layer 130.

그리고, 상기의 N이 3인 경우, 위의 N이 2인 경우에 더하여, 제3 GST 레이어는 제2 GST 레이어로부터 받은 스타일 임베딩과 자신의 복수의 토큰들 각각과의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성할 수 있다. 이때, 제2 GST 레이어의 복수의 토큰들은 제3 GST 레이어의 복수의 토큰들과 또 다른 잔여 커넥션(residual connection, 170)으로 연결될 수 있고, 그에 의해 제3 GST 레이어는 제2 GST 레이어의 토큰들의 가중치를 토대로 학습을 효과적으로 수행할 수 있다.And, when N is 3, in addition to the case where N is 2, the third GST layer measures the similarity between the style embedding received from the second GST layer and each of its plurality of tokens and Style embeddings can be created as weighted sums. At this time, a plurality of tokens of the second GST layer may be connected to a plurality of tokens of the third GST layer through another residual connection (170), whereby the third GST layer can connect the tokens of the second GST layer. Learning can be performed effectively based on weights.

도 4는 도 1의 음성합성 시스템에 채용할 수 있는 전체 구조에 대한 상세 구성도이다.Figure 4 is a detailed configuration diagram of the overall structure that can be employed in the voice synthesis system of Figure 1.

도 4를 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(spectrum predictor, 600) 및 보코더로서 멜겐(MelGAN, 700a)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율 및 음색이 부가된 파형 샘플(waveform samples)을 출력할 수 있다.Referring to FIG. 4, the speech synthesis system includes a prosody module 100, a tone module 200, a spectrum predictor 600, and MelGAN 700a as a vocoder, and the first part of the prosody module 100 1 By applying integrated style embedding that combines style embedding and the second style embedding of the tone module 200, waveform samples with a specific prosody and tone added to the input text can be output. there is.

운율 모듈(100)은 추출부(FE, 110), 제1 레퍼런스 인코더(reference encoder, 120), 제1 스타일 토큰 레이어(style token layer 1, 130), 제2 스타일 토큰 레이어(style token layer 2, 140) 및 제3 스타일 토큰 레이어(style token layer 3, 150)를 구비한다. 각 스타일 토큰 레이어는 전역 스타일 토큰(GST) 레이어에 대응된다.The prosody module 100 includes an extraction unit (FE) 110, a first reference encoder (reference encoder) 120, a first style token layer (style token layer 1, 130), a second style token layer (style token layer 2, 140) and a third style token layer (style token layer 3, 150). Each style token layer corresponds to a global style token (GST) layer.

운율 모듈(100)에서, 추출부(110)는 참조 파형인 참조 오디오로부터 기본주파수(fundamental frequency, f0)를 추출한다. 참조 오디오는 미리 지정된 데이터셋에 있는 오디오일 수 있다. 또한, 참조 오디오는 학습(traning)시 원본, 타겟 또는 참값을 나타내는 그라운드 트루(ground truth, GT)일 수 있고 추정(inference)시 원하는 스타일이 포함된 오디오일 수 있다. 기본주파수는 제1 레퍼런스 인코더(120)의 입력으로 사용된다.In the prosody module 100, the extraction unit 110 extracts a fundamental frequency (f0) from reference audio, which is a reference waveform. Reference audio may be audio in a pre-specified dataset. Additionally, the reference audio may be ground truth (GT) representing the original, target, or true value during training, and may be audio containing a desired style during inference. The fundamental frequency is used as an input to the first reference encoder 120.

제1 레퍼런스 인코더(120)는 기본주파수로부터 운율 임베딩(prosody embedding)을 생성한다. 제1 레퍼런스 인코더(120)는 6개의 2D 합성곱 신경망(convolutional neural networks, CNN)과 1개의 게이트 순환 유닛(gated recurrent unit, GRU)으로 구성될 수 있다. 운율 임베딩은 기본주파수가 제1 레퍼런스 인코더를 통과한 결과로서, 정해진 길이의 임베딩이고, 계층형 GST의 입력으로 사용된다.The first reference encoder 120 generates a prosody embedding from the fundamental frequency. The first reference encoder 120 may be composed of six 2D convolutional neural networks (CNN) and one gated recurrent unit (GRU). The prosody embedding is the result of the fundamental frequency passing through the first reference encoder, is an embedding of a fixed length, and is used as an input to the hierarchical GST.

3개의 스타일 토큰 레이어들(130, 140, 150)로 구성된 계층형 GST는 운율 임베딩을 학습하여 제1 스타일 임베딩을 생성할 수 있다. 각 스타일 토큰 레이어(130, 140, 150)는 다중헤드 어텐션(multi-head attention)을 구비한다. 계층형 GST는, 다중헤드 어텐션을 통해 제1 레퍼런스 인코더(120)에서 출력된 운율 임베딩과 각 스타일 토큰 사이의 유사도가 측정되면, 이 스타일 토큰들의 가중 합으로 스타일 임베딩을 생성한다.The hierarchical GST, which consists of three style token layers 130, 140, and 150, can learn the prosodic embedding to generate the first style embedding. Each style token layer (130, 140, 150) is equipped with multi-head attention. In the hierarchical GST, when the similarity between the prosody embedding output from the first reference encoder 120 and each style token is measured through multihead attention, a style embedding is generated as a weighted sum of the style tokens.

운율 모듈(100)의 스타일 임베딩은, 음색 모듈(200)의 스타일 임베딩과 다르게, 첫 번째로 생성된 스타일 임베딩이 두 개의 스타일 토큰 레이어를 추가로 통과할 때, 첫 번째 스타일 토큰 레이어와 두 번째 스타일 토큰 레이어와의 사이 및 두 번째 스타일 토큰 레이어와 세 번째 스타일 토큰 레이어와의 사이 각각에 잔여 커넥션(residual connection)을 추가로 사용하여 참조 오디오의 운율 특징에 대한 학습 효과를 극대화하도록 구성될 수 있다.The style embedding of the prosody module 100 differs from the style embedding of the timbre module 200 when the first generated style embedding additionally passes through two style token layers, the first style token layer and the second style token layer. It may be configured to maximize the learning effect on the prosodic features of the reference audio by additionally using residual connections between the token layer and between the second style token layer and the third style token layer.

음색 모듈(200)은 STFT 유닛(210), 제2 레퍼런스 인코더(220) 및 제4 스타일 토큰 레이어(style token layer 4, 240)을 구비한다. STFT 유닛(210)은, 참조 오디오를 별도로 가공한 스펙트로그램이나 멜 스펙트로그램을 제2 레퍼런스 인코더(220)의 입력 데이터로 준비하는 경우에, 생략될 수 있다. 제4 스타일 토큰 레이어(240)는 제4 GST 레이어에 대응된다.The tone module 200 includes an STFT unit 210, a second reference encoder 220, and a fourth style token layer (style token layer 4, 240). The STFT unit 210 may be omitted when preparing a spectrogram or mel spectrogram obtained by separately processing reference audio as input data of the second reference encoder 220. The fourth style token layer 240 corresponds to the fourth GST layer.

음색 모듈(200)에서, STFT 유닛(210)은 참조 오디오를 푸리에 변환에 의해 변환한 멜 스펙트로그램을 제2 레퍼런스 인코더(220)의 입력(input)으로 제공한다. 제2 레퍼런스 인코더(220)는 기본주파수가 아닌 멜 스펙트로그램을 입력받아 스피커 임베딩을 생성한다. 제4 GST 레이어(240)를 통과한 스피커 임베딩은 제2 스타일 임베딩으로 출력된다.In the tone module 200, the STFT unit 210 provides a Mel spectrogram obtained by converting the reference audio by Fourier transform as an input to the second reference encoder 220. The second reference encoder 220 receives the Mel spectrogram rather than the fundamental frequency and generates speaker embedding. The speaker embedding that has passed the fourth GST layer 240 is output as a second style embedding.

제4 스타일 토큰 레이어(240)는 스피커 임베딩을 학습하여 제2 스타일 임베딩을 생성할 수 있다. 제4 스타일 토큰 레이어(240)는 다중헤드 어텐션(multi-head attention, 242) 및 복수의 토큰들(244)을 구비한다. 복수의 토큰들(244)은 제1 토큰(token 1) 내지 제5 토큰(token 5)을 포함할 수 있고, 각 토큰은 스타일 토큰으로 지칭될 수 있다.The fourth style token layer 240 may learn the speaker embedding and generate the second style embedding. The fourth style token layer 240 includes multi-head attention (242) and a plurality of tokens (244). The plurality of tokens 244 may include a first token (token 1) to a fifth token (token 5), and each token may be referred to as a style token.

제4 스타일 토큰 레이어(240)는, 다중헤드 어텐션(242)을 통해 제2 레퍼런스 인코더(220)에서 출력된 스피커 임베딩과 각 스타일 토큰 사이의 유사도가 측정되면, 이 스타일 토큰들(244)의 가중 합으로 스타일 임베딩(제2 스타일 임베딩)을 생성할 수 있다.When the similarity between the speaker embedding output from the second reference encoder 220 and each style token is measured through the multihead attention 242, the fourth style token layer 240 weights the style tokens 244. A style embedding (second style embedding) can be created by summing.

음율 모듈(100)에서 총 3개의 스타일 토큰 레이어들을 통과한 스타일 임베딩(제1 스타일 임베딩)과 음색 모듈(200)의 스타일 임베딩(제2 스타일 임베딩)은 합쳐지고(concatenate) 음성합성 시스템의 스펙트럼 예측기(600)의 인코더(encoder)의 출력과 결합된다.The style embedding (first style embedding) that has passed through a total of three style token layers in the tone module 100 and the style embedding (second style embedding) of the timbre module 200 are concatenated and used as a spectrum predictor of the speech synthesis system. It is combined with the output of the encoder (600).

본 실시예의 음성합성 시스템은 좁은 의미에서 스펙트럼 예측기(600)와 보코더로서 멜겐(MelGAN, 700a)으로 구성될 수 있고, 넓은 의미에서 음율 모듈(100)과 음색모듈(200)을 더 포함할 수 있다. 여기서, 스펙트럼 예측기(600)는 인코더(610), 어텐션(620) 및 디코더(630)를 구비한다.The voice synthesis system of this embodiment may be composed of a spectrum predictor 600 and MelGAN 700a as a vocoder in a narrow sense, and may further include a tone module 100 and a tone module 200 in a broad sense. . Here, the spectrum predictor 600 includes an encoder 610, an attention 620, and a decoder 630.

즉, 스펙트럼 예측기(600)는 입력 텍스트(input text)를 받는 인코더(610), 인코더(610)의 출력단에 연결되는 어텐션(620), 어텐션(620)과 상호 연결되는 디코더(630)를 구비한다. 이러한 스펙트럼 예측기(600)는 타코트론2(Tacotron2)로 구현될 수 있다.That is, the spectrum predictor 600 includes an encoder 610 that receives input text, an attention 620 connected to the output terminal of the encoder 610, and a decoder 630 interconnected with the attention 620. . This spectrum predictor 600 can be implemented with Tacotron2.

전술한 경우, 음성합성 시스템은 seq2seq 음성합성 시스템으로서 두 단계를 통해 텍스트를 음성으로 합성할 수 있다. 첫 번째 단계에서 입력 텍스트로부터 멜 스펙트로그램을 생성하고, 두 번째 단계에서 보코더(vocoder)를 사용하여 멜 스펙트로그램으로부터 음성을 합성한다.In the case described above, the speech synthesis system is a seq2seq speech synthesis system that can synthesize text into speech through two steps. In the first step, a mel spectrogram is generated from the input text, and in the second step, speech is synthesized from the mel spectrogram using a vocoder.

본 실시예의 스펙트럼 예측기(600)는, 일반적인 타코트론2(Tacotron2)와 비교할 때, 제1 및 제2 스타일 임베딩들을 합한 통합 스타일 임베딩이 인코더(610)의 출력과 결합되어 어텐션(620)에 입력되는 점에서 차이가 있다.Compared to the general Tacotron2, the spectrum predictor 600 of this embodiment has an integrated style embedding that combines the first and second style embeddings combined with the output of the encoder 610 and input to the attention 620. There is a difference in that respect.

좀더 구체적으로, 스펙트럼 예측기(600)에 있어서, 인코더(610)는 입력 텍스트를 인코딩하여 512 차원의 특징(이하 '특징 정보')으로 변환한다. 인코더(610)는 텍스트가 입력되는 문자 임베딩(character embedding) 생성 유닛(613), 문자 임베딩 생성 유닛(613)에 연결되는 3개의 컨볼루션 레이어들(3 conv layers, 615), 및 3개의 컨볼루션 레이어들(615)에 연결되는 양방향 LSTM(bidirectional long short term memory, 617)를 구비할 수 있다. 3개의 컨볼루션 레이어들은 3개의 컨볼루션 신경망(convolutional neural network) 레이어들에 해당하고, '인코더 컨볼루션 레이어들'로 지칭될 수 있다.More specifically, in the spectrum predictor 600, the encoder 610 encodes the input text and converts it into 512-dimensional features (hereinafter referred to as 'feature information'). The encoder 610 includes a character embedding generation unit 613 into which text is input, three convolutional layers (3 conv layers, 615) connected to the character embedding generation unit 613, and three convolutional A bidirectional long short term memory (LSTM) 617 connected to the layers 615 may be provided. The three convolutional layers correspond to three convolutional neural network layers and may be referred to as 'encoder convolutional layers'.

또한, 스펙트럼 예측기(600)는 인코더(610)의 입력단에 입력 텍스트(input text)를 전처리하는 전처리부(611)를 더 구비할 수 있다. 전처리부(611)는 입력 텍스트를 음절 단위로 분리하고, 분리된 음절을 원핫 인코딩(one-hot encoding)을 통해 정수로 표현하도록 구성될 수 있다. 한편, 이러한 전처리부(611)는, 독립적인 전처리 수단이나 이에 상응하는 기능을 수행하는 독립적인 구성부에 의해 입력 텍스트를 별도로 처리하는 경우에, 생략될 수 있다. 다만, 이러한 전처리부(611)는 스펙트럼 예측기(600)에는 포함되지 않으나 TTS 모델에는 포함되도록 구성될 수 있다.Additionally, the spectrum predictor 600 may further include a preprocessor 611 that preprocesses input text at the input end of the encoder 610. The preprocessor 611 may be configured to separate the input text into syllable units and express the separated syllables as integers through one-hot encoding. Meanwhile, this preprocessing unit 611 may be omitted when the input text is separately processed by an independent preprocessing means or an independent component performing a corresponding function. However, this preprocessor 611 is not included in the spectrum predictor 600, but may be configured to be included in the TTS model.

인코더(610)에서, 문자 임베딩 생성 유닛(613)은 전처리부(611)로부터 받은 정수 시퀀스를 매트릭스 형태로 변환한다. 인코더 컨볼루션 레이어들(615)은 매트릭스 형태의 정보를 축약한다. 그리고 양방향 LSTM(617)은 축약된 매트릭스 형태의 정보를 인코더 특징 정보로 변환한다. 여기서 인코더 특징 정보는 하나의 고정된 크기로 압축된 컨텍스트 벡터를 포함할 수 있다.In the encoder 610, the character embedding generation unit 613 converts the integer sequence received from the preprocessor 611 into a matrix form. Encoder convolution layers 615 condense information in matrix form. And the bidirectional LSTM (617) converts information in the form of a condensed matrix into encoder feature information. Here, the encoder feature information may include a context vector compressed into one fixed size.

어텐션(620)은 멜겐(MelGAN, 700a)을 통해 출력될 음성 발음이 입력 텍스트의 순차적인 순서대로 진행되도록 디코더(630)의 타임-스텝에 따라 어텐션 얼라인 정보(attention alignment information, AAI)를 인코더 특징 정보에 추가한다. 즉, 어텐션(620)은 매 시점 디코더(630)에서 사용할 정보를 인코더(610)에서 추출하여 정렬(alignment) 혹은 매핑(mapping)하는 과정을 의미할 수 있다. 좀더 구체적으로, 어텐션(620)은 소정 시점마다 디코더(630)에서 집중해서 사용할 정보를 인코더(610)에서 추출하여 할당하고, 매 시점에서 사용하는 정보는 이전 시점의 어텐션 얼라인 정보를 사용하도록 구성될 수 있다. 이러한 구성에 의하면, 어텐션(620)의 얼라인 그래프는 TTS 모델이 학습 중일 때 우상향으로 연속되어 나갈 수 있다. 이러한 어텐션(620)은 로케이션 센서티브 어텐션(location sensitive attention)을 사용하여 구현될 수 있다.The attention 620 encodes attention alignment information (AAI) according to the time-step of the decoder 630 so that the voice pronunciation to be output through MelGAN (700a) proceeds in the sequential order of the input text. Add to feature information. In other words, attention 620 may refer to the process of extracting information to be used by the decoder 630 from the encoder 610 at each time point and aligning or mapping it. More specifically, the attention 620 extracts and allocates information from the encoder 610 to be used intensively by the decoder 630 at a predetermined time point, and the information used at each time point is configured to use attention alignment information from the previous time point. It can be. According to this configuration, the alignment graph of attention 620 can continue upward to the right while the TTS model is learning. This attention 620 may be implemented using location sensitive attention.

일례로, 로케이션 센서티브 어텐션은 어텐션 점수(attention score)의 계산에서, 이전 시점의 어텐션와 인코더 정보의 가중 합에 아크탄젠트(tanh)를 적용한 후 가중치를 곱하여 계산한다. 여기서, 어텐션은 어텐션 점수에 소프트맥스(softmax)를 적용해 0 내지 1의 범위로 정규화한 것으로 인코더 정보를 얼마나 할당할지 결정하는 정보에 해당된다. 어텐션 추출 결과는 컨텍스트(context)로 지칭되고, 해당 시점까지의 어텐션과 인코더의 곱의 총합에 해당한다.For example, location-sensitive attention is calculated by applying arctangent (tanh) to the weighted sum of attention and encoder information at a previous point in time and then multiplying the weight by the weighted sum in the calculation of the attention score. Here, attention is normalized to a range of 0 to 1 by applying softmax to the attention score, and corresponds to information that determines how much encoder information to allocate. The attention extraction result is referred to as context, and corresponds to the total sum of the product of attention and encoder up to that point.

디코더(630)는 어텐션(620)을 통해 어텐션 정보인 얼라인 특징(alignment feature)와 이전 타임-스텝에서 생성된 멜 스펙트로그램을 이용하여 다음 타임-스텝의 멜 스펙트로그램을 생성한다. 생성되는 멜 스펙트로그램에는 통합 스타일 임베딩에 의한 음색과 운율 정보가 포함된다. 이전 타임-스템은 이전 시점에, 다음 타임-스텝은 다음 시점 또는 현재 시점에 각각 대응될 수 있다.The decoder 630 generates a Mel spectrogram of the next time-step using the alignment feature, which is attention information, and the Mel spectrogram generated in the previous time-step through the attention 620. The generated mel spectrogram includes timbre and prosody information through integrated style embedding. The previous time-step may correspond to the previous time point, and the next time-step may correspond to the next time point or the current time point, respectively.

디코더(630)는 프리넷(pre-Net, 631), 두 개의 디코더 LSTM(632), 제1 완전연결층(fully connected layer, 633), 제2 완결연결층(634) 및 포스트넷(post-Net, 635)를 구비한다. 프리넷(631)은 2개 레이어로 구성된 프리넷(2 layer pre-Net)일 수 있고, 각 디코더 LSTM(632)은 단방향 LSTM(long short term memory)일 수 있고, 완전연결층은 선형 프로젝션(linear projection)일 수 있으며, 포스트넷(635)은 5개의 컨볼루션 레이어들로 구성된 포스트넷(5 convolution layer post-Net)일 수 있다.The decoder 630 includes a pre-Net (631), two decoder LSTMs (632), a first fully connected layer (633), a second fully connected layer (634), and a post-Net (post-Net). , 635). The pre-Net 631 may be a 2-layer pre-Net, each decoder LSTM 632 may be a unidirectional LSTM (long short term memory), and the fully connected layer may be a linear projection. ), and the postnet 635 may be a 5 convolution layer post-Net.

디코더(630)에서, 프리넷(631)은 이전 시점의 멜 벡터에서 정보를 추출한다. 디코더 LSTM(632)은 어텐션(620)과 프리넷(631)의 작동 결과를 이용하여 현재 시점의 정보를 추출한다. 제1 완전연결층(633)은 현재 시점의 정보를 이용하여 멜 스펙트로그램(Mel spectrogram)을 생성한다. 제2 완전연결층(634)는 시그모이드(sigmoid)와 결합되어 디코더 LSTM(632)으로부터 나오는 정보를 토대로 현재 시점의 종료 확률을 계산하여 종료 토큰(stop token)을 출력한다. 종료 확률은 0 내지 1의 범위에서 산출될 수 있다. 그리고, 포스트넷(635)은 5개의 1차원(1D) 컨볼루션 레이어와 배치 정규화(batch normalization)를 통해 멜 스펙트로그램의 품질을 향상시킨다. 포스트넷(635)은 디코더(630)을 통해 멜 스펙트로그램이 모두 생성된 후 적용될 수 있고, 생성된 멜 스펙트로그램의 품질을 잔여 커넥션을 이용하여 스무딩(smoothing)하게 보정할 수 있다.In the decoder 630, the freenet 631 extracts information from the mel vector at the previous point in time. The decoder LSTM 632 extracts information at the current time using the operation results of the attention 620 and the freenet 631. The first fully connected layer 633 generates a Mel spectrogram using information at the current time. The second fully connected layer 634 is combined with a sigmoid to calculate the termination probability at the current time based on information from the decoder LSTM 632 and outputs a stop token. The termination probability can be calculated in the range of 0 to 1. Additionally, Postnet 635 improves the quality of the Mel spectrogram through five one-dimensional (1D) convolutional layers and batch normalization. The postnet 635 can be applied after all mel spectrograms are generated through the decoder 630, and the quality of the generated mel spectrogram can be corrected by smoothing using the remaining connections.

전술한 스펙트럼 예측기(600)에서, 손실 함수의 전체 손실은 MSE(mean squared error)와 BCE(binary cross entropy)의 합으로 계산될 수 있다. MSE는 원본 멜 스펙트로그램과 추정 멜 스펙트로그램 간의 차이를 나타내고, BCE는 실제 종료 확률과 추정 종료 확률 간의 차이를 나타낸다. 원본 멜 스펙트로그램은 타겟 멜 스펙트로그램에 대응될 수 있다.In the above-described spectrum predictor 600, the overall loss of the loss function can be calculated as the sum of mean squared error (MSE) and binary cross entropy (BCE). MSE represents the difference between the original Mel spectrogram and the estimated Mel spectrogram, and BCE represents the difference between the actual termination probability and the estimated termination probability. The original Mel spectrogram may correspond to the target Mel spectrogram.

멜겐(MelGAN, 700a)는 멜 스펙트로그램을 입력받고 파형(waveform) 신호 또는 음성(audio) 신호를 생성한다. 멜겐(700a)은 생성자(generator)와 판별자(discriminator)로 구성된 GAN 기반 보코더로서 1차원 컨볼루션(conv1d) 여러 층으로 이루어진 모델이다.MelGAN (700a) receives the Mel spectrogram and generates a waveform signal or audio signal. Melgen (700a) is a GAN-based vocoder consisting of a generator and a discriminator, and is a one-dimensional convolution (conv1d) model composed of several layers.

멜겐(700a)의 생성자는 여러 층의 conv1d와 잔여 스택(residual stack)에 의한 구조로 이루어진다. 잔여 스택 내부에는 또 다른 conv1d가 있고, 잔여 커넥션을 3번 거치도록 구성된다. 생성자는 입력으로 배치 단위의 멜 스펙트로그램을 받는다. 따라서 입력 텐서의 크기는 [배치 사이즈, 80, 프레임 길이]가 될 수 있다. 여기서 80은 멜 스펙트로그램의 차수가 되고, 프레임 길이는 어느 구간만큼 입력할지 사용자가 정하는 변수이다. 출력은 오디오 신호과 되고 그 크기는 [배치 사이즈, 1, 프레임 길이×홉 사이즈)가 된다.The constructor of Melgen 700a is composed of several layers of conv1d and a residual stack. Inside the remaining stack, there is another conv1d, and it is configured to go through the remaining connections three times. The generator receives a batch-wise Mel spectrogram as input. Therefore, the size of the input tensor can be [batch size, 80, frame length]. Here, 80 is the order of the Mel spectrogram, and the frame length is a variable that the user determines which section to input. The output is an audio signal, and its size is [batch size, 1, frame length × hop size).

멜겐(700a)의 판별자는 3개의 다중 스케일(multi-scale) 구조로 이루어진다. 하나의 다중 스케일에는 6개의 특징맵(feature map)과 1개의 출력으로 총 7개의 출력을 가진다. 판별자 블록은 총 7개의 conv1d로 구성될 수 있다.The discriminator of Melgen (700a) consists of three multi-scale structures. One multi-scale has 6 feature maps and 1 output, for a total of 7 outputs. The discriminator block can consist of a total of 7 conv1ds.

멜겐(700a)의 마지막 출력은 기본적으로 판별자 출력이고, 나머지 출력은 손실(loss)로 사용될 수 있다.The last output of Melgen 700a is basically the discriminator output, and the remaining outputs can be used as losses.

도 5는 본 발명의 다른 실시예에 따른 음성합성 시스템의 전체 구성도이다. 도 6은 도 5의 음성합성 시스템에 채용할 수 있는 감정 모듈을 설명하기 위한 블록도이다. 그리고 도 7은 도 6의 감정 모듈에 채용할 수 있는 레퍼런스 인코더에 대한 상세 구성도이다Figure 5 is an overall configuration diagram of a voice synthesis system according to another embodiment of the present invention. FIG. 6 is a block diagram illustrating an emotion module that can be employed in the voice synthesis system of FIG. 5. And Figure 7 is a detailed configuration diagram of a reference encoder that can be employed in the emotion module of Figure 6.

도 5를 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 감정 모듈(300), 스펙트럼 예측기(spectrum predictor, 600) 및 멜겐(MelGAN, 700a)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩과 감정 모듈(300)의 제3 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율, 음색 및 감정이 부가된 파형 샘플(waveform samples)을 출력할 수 있다. 스펙트럼 예측기(600) 및 멜겐(MelGAN, 700a)은 음성합성(text to speech, TTS) 모델(500)을 구성한다.Referring to FIG. 5, the speech synthesis system includes a prosody module 100, a tone module 200, an emotion module 300, a spectrum predictor 600, and MelGAN 700a, and a prosody module ( By applying an integrated style embedding that combines the first style embedding of the tone module 200, the second style embedding of the tone module 200, and the third style embedding of the emotion module 300, a specific prosody is applied to the input text. , waveform samples with added tones and emotions can be output. The spectrum predictor 600 and MelGAN (700a) constitute a text to speech (TTS) model 500.

운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(600) 및 멜겐(MelGAN, 700a)은 도 4를 참조하여 앞서 설명한 음성합성 시스템의 대응 구성요소와 실질직으로 동일하므로 이들에 대한 상세 설명은 생략하기로 한다.Since the prosody module 100, timbre module 200, spectrum predictor 600, and MelGAN (700a) are substantially the same as the corresponding components of the speech synthesis system described above with reference to FIG. 4, their detailed description will be omitted.

감정 모듈(300)은 STFT 유닛(310), 제3 레퍼런스 인코더(320) 및 제5 스타일 토큰 레이어(style token 5, 350)을 구비한다. STFT 유닛(310)은, 참조 오디오를 별도로 가공한 스펙트로그램이나 멜 스펙트로그램을 제3 레퍼런스 인코더(320)의 입력 데이터로 준비하는 경우에, 생략될 수 있다. 제5 스타일 토큰 레이어(350)는 제5 GST 레이어에 대응된다.The emotion module 300 includes an STFT unit 310, a third reference encoder 320, and a fifth style token layer (style token 5, 350). The STFT unit 310 may be omitted when preparing a spectrogram or mel spectrogram obtained by separately processing reference audio as input data of the third reference encoder 320. The fifth style token layer 350 corresponds to the fifth GST layer.

감정 모듈(300)에서, STFT 유닛(310)은 참조 오디오를 푸리에 변환에 의해 변환한 스펙트로그램이나 멜 스펙트로그램을 제3 레퍼런스 인코더(320)의 입력(input)으로 제공할 수 있다. 제3 레퍼런스 인코더(320)는 기본주파수가 아닌 스펙트로그램이나 멜 스펙트로그램을 입력받아 감정 임베딩(emotional embedding)을 생성한다.In the emotion module 300, the STFT unit 310 may provide a spectrogram or Mel spectrogram obtained by converting reference audio by Fourier transform as an input to the third reference encoder 320. The third reference encoder 320 receives a spectrogram or mel spectrogram rather than a fundamental frequency and generates emotional embedding.

제5 GST 레이어(350)를 통과한 감정 임베딩은 제3 스타일 임베딩으로 출력된다. 즉, 제5 스타일 토큰 레이어(350)는 감정 임베딩을 학습하여 제3 스타일 임베딩을 생성할 수 있다. 이러한 제5 스타일 토큰 레이어(350)는 다중헤드 어텐션(multi-head attention, 352) 및 복수의 토큰들(354)을 구비한다. 복수의 토큰들(354)은 제1 토큰(token 1) 내지 제k 토큰(token k)을 포함할 수 있고, 각 토큰은 스타일 토큰으로 지칭될 수 있다. 감정 임베딩의 효과적을 학습을 위해 상기의 k는 7인 것이 바람직하다. 즉, 본 실시예의 제5 GST 레이어(350)는 7개로 분류된 감정 스타일에 대응하도록 7개의 스타일 토큰들을 구비할 수 있다.The emotional embedding that has passed the fifth GST layer 350 is output as a third style embedding. That is, the fifth style token layer 350 can generate a third style embedding by learning the emotion embedding. This fifth style token layer 350 has multi-head attention (352) and a plurality of tokens (354). The plurality of tokens 354 may include a first token (token 1) to a k-th token (token k), and each token may be referred to as a style token. For effective learning of emotional embedding, k is preferably 7. That is, the fifth GST layer 350 of the present embodiment may be provided with seven style tokens to correspond to the seven categorized emotional styles.

감정 스타일에 포함되는 7개의 감정은 분노(Anger), 실망(Disappointment), 두려움(Fear), 놀람(Surprise), 슬픔(Sad), 평온함(Neutral) 및 행복함(Happy)이고, 이를 MOS(mean opinion score) 점수와 함께 정리하면 다음의 표 1과 같다.The seven emotions included in the emotional style are Anger, Disappointment, Fear, Surprise, Sad, Neutral, and Happy, which are called MOS (mean). (opinion score) When summarized with the scores, it is as shown in Table 1 below.

전술한 제5 스타일 토큰 레이어(350)는, 도 6에 도시한 바와 같이, 다중헤드 어텐션(352)에 의해 제3 레퍼런스 인코더(320)에서 출력된 감정 임베딩과 7개의 스타일 토큰들 각각과의 유사도가 측정될 때(360 참조), 이 스타일 토큰들(244)의 가중 합으로 스타일 임베딩(제3 스타일 임베딩)을 생성할 수 있다.As shown in FIG. 6, the above-described fifth style token layer 350 is a similarity between the emotion embedding output from the third reference encoder 320 by the multi-head attention 352 and each of the seven style tokens. When is measured (see 360), a style embedding (third style embedding) can be generated as a weighted sum of these style tokens 244.

감정 모듈(300)에서 생성된 제3 스타일 임베딩은 음율 모듈(100)에서 총 3개의 스타일 토큰 레이어들을 통과한 제1 스타일 임베딩과 음색 모듈(200)의 제2 스타일 임베딩과 합쳐지고(concatenate) 음성합성 시스템의 스펙트럼 예측기(600)의 인코더(encoder, 610))의 출력과 결합되어 스펙트럼 예측기(600)의 어텐션(620)에 입력될 수 있다.The third style embedding generated in the emotion module 300 is concatenated with the first style embedding that passed a total of three style token layers in the tone module 100 and the second style embedding in the timbre module 200, and the voice It may be combined with the output of the encoder 610 of the spectrum predictor 600 of the synthesis system and input to the attention 620 of the spectrum predictor 600.

한편, 도 7을 참조하여 전술한 제3 레퍼런스 인코더(320)를 좀더 구체적으로 설명하면, 입력되는 참조 스펙트로그램(reference spectrogram)을 변환하여 감정 임베딩(emotional embedding)을 출력하기 위해, 제3 레퍼런스 인코더(320)는 3개의 합성곱 신경망(three convolutional neural networks, 3 CNN)(322), 3개의 합성곱 신경망(322)에 연결되는 2개의 잔여 블록들(2 residual blocks, 324), 및 2개의 잔여 블록들(324)에 연결되는 수정 알렉스넷(modified AlexNet, 326)으로 구성될 수 있다. 참조 스펙트로그램은 참조 오디오의 입력을 푸리에 변환을 통해 변환한 스펙트로그램을 지칭한다. 참조 스펙트로그램은 참조 멜 스펙트로그램으로 대체될 수 있다.Meanwhile, to describe the third reference encoder 320 described above in more detail with reference to FIG. 7, in order to convert the input reference spectrogram and output emotional embedding, the third reference encoder 320 includes three convolutional neural networks (3 CNNs) 322, 2 residual blocks 324 connected to the three convolutional neural networks 322, and two residual It may consist of a modified AlexNet (326) connected to blocks (324). The reference spectrogram refers to a spectrogram obtained by converting the input of reference audio through Fourier transform. The reference spectrogram can be replaced with a reference Mel spectrogram.

이러한 제3 레퍼런스 인코더(320)를 포함한 감정 모듈에 결합하는 TTS 모델은, 참조 오디오의 감정 표현을 음성합성에 적용하기 위해, 스타일 트랜스퍼 및 스타일 트랜스퍼의 강도를 조절하도록 구성되고, 리컨스트럭션 로스(reconstruction loss)만으로 학습을 진행할 수 있다.The TTS model coupled to the emotion module including the third reference encoder 320 is configured to adjust the style transfer and the intensity of the style transfer in order to apply the emotional expression of the reference audio to speech synthesis, and perform reconstruction loss (reconstruction loss). learning can be done only with loss).

도 8은 본 발명의 또 다른 실시예에 따른 음성합성 시스템에 대한 개략적인 구성도이고, 도 9는 도 8의 음성합성 시스템에 적용될 수 있는 주요 동작을 설명하기 위한 블록도이다.FIG. 8 is a schematic configuration diagram of a voice synthesis system according to another embodiment of the present invention, and FIG. 9 is a block diagram for explaining main operations applicable to the voice synthesis system of FIG. 8.

도 8을 참조하면, 음성합성 시스템(1000)은, 프로세서(processor, 1100) 및 메모리(momory, 1200)를 포함하여 구성될 수 있다. 또한, 음성합성 시스템(100)은 송수신 장치(transceiver, 1300)를 더 포함하거나, 저장 장치(1400)를 더 포함하거나, 입력 인터페이스 장치(1500) 및 출력 인터페이스 장치(1500)를 더 포함하도록 구성될 수 있다.Referring to FIG. 8, the voice synthesis system 1000 may be configured to include a processor (1100) and memory (1200). In addition, the speech synthesis system 100 may further include a transceiver 1300, a storage device 1400, or an input interface device 1500 and an output interface device 1500. You can.

음성합성 시스템(1000)에서, 프로세서(1100)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. In the speech synthesis system 1000, the processor 1100 may be a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. It may refer to a processor.

메모리(1200) 또는 저장 장치(1400)는 프로세서(1100)에 의해 실행되는 적어도 하나의 명령을 저장할 수 있다. 적어도 하나의 명령은, 음율 모듈을 구현하거나 동작시키기 위한 제1 명령, 음색 모듈을 구현하거나 동작시키기 위한 제2 명령, 감정 모듈을 구현하거나 동작시키기 위한 제3 명령, TTS 모델을 구현하거나 동작시키기 위한 제5 명령을 포함할 수 있다.The memory 1200 or the storage device 1400 may store at least one instruction executed by the processor 1100. At least one command includes a first command for implementing or operating the tone module, a second command for implementing or operating the tone module, a third command for implementing or operating the emotion module, and a third command for implementing or operating the TTS model. May include the fifth command.

또한, 적어도 하나의 명령은, 입력되는 참조 오디오에서 기본주파수를 추출하는 명령, 기본주파수를 레퍼런스 인코더의 입력으로 전달하는 명령, 기본주파수를 인코딩하여 운율 임베딩을 생성하는 명령, 운율 임베딩으로부터 제1 스타일 임베딩을 생성하는 명령, 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 명령, 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 명령, 스피커 임베딩으로부터 제2 스타일 임베딩을 생성하는 명령, 제1 스타일 임베딩과 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 음성합성(text to speech, TTS) 모델의 인코더의 출력과 합쳐 어텐션에 입력하는 명령, TTS 모델의 어텐션과 디코더를 통해 멜 스펙트로그램을 만드는 명령, TTS 모델의 보코더를 통해 입력 텍스트로부터 통합 스타일 임베딩에 의한 음색(tones)과 운율(prosody)이 결합된 음성합성 오디오를 생성하는 명령을 포함할 수 있다.In addition, at least one command includes a command for extracting a fundamental frequency from the input reference audio, a command for passing the fundamental frequency to the input of the reference encoder, a command for encoding the fundamental frequency to generate a prosodic embedding, and a first style from the prosodic embedding. Instructions for generating an embedding, instructions for converting the reference audio to a reference Mel spectrogram by Fourier transform, instructions for encoding the reference Mel spectrogram to generate a speaker embedding, instructions for generating a second style embedding from the speaker embedding, first A command to input the integrated style embedding, which combines the style embedding and the second style embedding, with the output of the encoder of the text to speech (TTS) model and input it into attention, a command to create a mel spectrogram through the attention and decoder of the TTS model, It can include commands for generating speech synthesis audio combining tones and prosody by integrated style embedding from input text through the vocoder of the TTS model.

전술한 메모리(1200) 및 저장 장치(1400) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(1200) 또는 저장 장치(1400)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다. Each of the above-described memory 1200 and storage device 1400 may be comprised of at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 1200 or the storage device 1400 may be comprised of at least one of read only memory (ROM) and random access memory (RAM).

송수신 장치(1300)는 무선 네트워크, 유선 네트워크, 위성 네트워크 또는 이들의 조합을 통해 외부 장치와의 통신을 지원하는 적어도 하나의 서브통신시스템을 포함할 수 있다.The transceiving device 1300 may include at least one sub-communication system that supports communication with an external device through a wireless network, a wired network, a satellite network, or a combination thereof.

입력 인터페이스 장치(1500)는 키보드, 마이크, 터치패드, 터치스크린 등의 입력 수단들에서 선택되는 적어도 하나와 적어도 하나의 입력 수단을 통해 입력되는 신호를 기저장된 명령과 매핑하거나 처리하는 입력 신호 처리부를 포함할 수 있다.The input interface device 1500 includes at least one selected from input means such as a keyboard, microphone, touch pad, and touch screen, and an input signal processor that maps or processes a signal input through at least one input means with a pre-stored command. It can be included.

출력 인터페이스 장치(1600)는 프로세서(1100)의 제어에 따라 출력되는 신호를 기저장된 신호 형태나 레벨로 매핑하거나 처리하는 출력 신호 처리부와, 출력 신호 처리부의 신호에 따라 진동, 빛 등의 형태로 신호나 정보를 출력하는 적어도 하나의 출력 수단을 포함할 수 있다. 출력 신호 처리부는 배선계통의 네트워크 토폴로지나 전압 상태추정 결과를 이미지, 음성 또는 이들의 조합 형태로 생성할 수 있다. 그리고 적어도 하나의 출력 수단은 스피커, 디스플레이 장치, 프린터, 광 출력 장치, 진동 출력 장치 등을 포함할 수 있다.The output interface device 1600 includes an output signal processing unit that maps or processes a signal output under the control of the processor 1100 into a pre-stored signal form or level, and a signal in the form of vibration, light, etc. according to the signal of the output signal processing unit. It may include at least one output means for outputting information. The output signal processing unit can generate the network topology or voltage state estimation results of the wiring system in the form of an image, voice, or a combination thereof. And at least one output means may include a speaker, a display device, a printer, an optical output device, a vibration output device, etc.

전술한 음성합성 시스템(1000)은, 예를 들어, 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant), 네트워크 서버, 웹서버, 메일서버, 특정 서비스를 위한 서비스 서버 등에 일체로 결합되거나 탑재될 수 있다.The above-described voice synthesis system 1000 can be used, for example, in a desktop computer, a laptop computer, a laptop, a smart phone, a tablet PC, or a mobile phone ( mobile phone, smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game console, navigation device, digital camera, DMB (digital multimedia) broadcasting) player, digital audio recorder, digital audio player, digital video recorder, digital video player, PDA (Personal Digital Assistant), network server, It can be integrated or mounted on a web server, mail server, service server for a specific service, etc.

한편, 음성합성 시스템(1000)은, 도 9에 도시한 바와 같이, 학습 관리를 위한 제1 모듈(1700)과 스타일 복제 TTS 서비스를 위한 제2 모듈(180)을 더 구비할 수 있다.Meanwhile, as shown in FIG. 9, the voice synthesis system 1000 may further include a first module 1700 for learning management and a second module 180 for style replication TTS service.

제1 모듈(1700)은 원하는 음색과 운율을 가진 제1 오디오 파형이나 원하는 음색과 운율과 감정을 가진 제2 오디오 파형을 그라운드 트루(ground truth, GT)의 참조 오디오로 사용하여, 운율 모듈의 계층형 GST와 음색 모듈의 제4 GST 레이어를 훈련시키거나, 감정 모듈의 제5 GST 레이어를 더 훈련시키도록 구성될 수 있다.The first module 1700 uses a first audio waveform with a desired timbre and prosody or a second audio waveform with a desired timbre, prosody, and emotion as a reference audio for ground truth (GT), a hierarchy of prosody modules. It can be configured to train the fourth GST layer of the type GST and tone module, or to further train the fifth GST layer of the emotion module.

또한, 음성합성 시스템(1000)은, 제2 모듈(1800)에 의해 참조 오디오의 음색과 운율 스타일을 복제한 음성합성(TTS) 서비스를 제공하거나, 참조 오디오의 음색, 운율 및 감정 스타일을 복제한 음성합성 서비스를 제공하도록 구성될 수 있다.In addition, the speech synthesis system 1000 provides a speech synthesis (TTS) service that replicates the timbre and prosody style of the reference audio by the second module 1800, or replicates the timbre, prosody and emotional style of the reference audio. It may be configured to provide voice synthesis services.

이러한 제1 모듈(1700) 및/또는 제2 모듈(1800)은 프로그램 명령이나 소프트웨어 모듈로 구현되어 메모리(1200)이나 저장 장치(1400)에 저장되고, 프로세서(1100)에 의해 실행되도록 구성될 수 있다. 물론, 구현에 따라서, 제1 모듈(1700) 및/또는 제2 모듈(1800)의 적어도 일부는 해당 기능의 적어도 일부를 위한 하드웨어 구성을 포함하도록 구성될 수 있다.The first module 1700 and/or the second module 1800 may be implemented as program instructions or software modules, stored in the memory 1200 or the storage device 1400, and configured to be executed by the processor 1100. there is. Of course, depending on implementation, at least a portion of the first module 1700 and/or the second module 1800 may be configured to include a hardware configuration for at least a portion of the corresponding function.

전술한 바와 같이, 음성합성 시스템(100)은 TTS 모델과 함께 운율 모듈 및 음색 모듈을 포함하거나 선택적으로 감정 모듈을 더 포함하도록 구성되고, 이 모듈들 각각은 전역 스타일 토큰을 기반으로 원하는 음색, 운율, 감정 또는 이들 조합의 스타일을 포함하고 있는 참조 오디오만 있다면 참조 오디오의 스타일대로 음성합성을 구현할 수 있는 장점이 있다.As described above, the speech synthesis system 100 is configured to include a prosody module and a timbre module along with a TTS model, or optionally further includes an emotion module, and each of these modules provides a desired timbre and prosody based on a global style token. , there is an advantage in that speech synthesis can be implemented according to the style of the reference audio as long as there is reference audio containing the style of emotion or a combination thereof.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다.The operation of the method according to an embodiment of the present invention can be implemented as a computer-readable program or code on a computer-readable recording medium. Computer-readable recording media include all types of recording devices that store data that can be read by a computer system. Additionally, computer-readable recording media can be distributed across networked computer systems so that computer-readable programs or codes can be stored and executed in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Additionally, computer-readable recording media may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Program instructions may include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter, etc.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다. Although some aspects of the invention have been described in the context of an apparatus, it may also refer to a corresponding method description, where a block or device corresponds to a method step or feature of a method step. Similarly, aspects described in the context of a method may also be represented by corresponding blocks or items or features of a corresponding device. Some or all of the method steps may be performed by (or using) a hardware device, such as a microprocessor, programmable computer, or electronic circuit, for example. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

실시예들에서, 프로그램 가능한 로직 장치 예컨대, 필드 프로그래머블 게이트 어레이가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 실시예들에서, 필드 프로그래머블 게이트 어레이는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device, such as a field programmable gate array, may be used to perform some or all of the functionality of the methods described herein. In embodiments, a field programmable gate array may operate in conjunction with a microprocessor to perform one of the methods described herein. In general, the methods are preferably performed by some hardware device.

이상 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the present invention has been described with reference to preferred embodiments of the present invention, those skilled in the art may make various modifications and changes to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it is possible.

Claims

A voice synthesis method capable of replicating real-time timbre and prosody styles, comprising:
Extracting the fundamental frequency from the input reference audio and transmitting it to the input of the reference encoder;
generating a prosodic embedding by encoding the fundamental frequency by the reference encoder;
generating a first style embedding from the prosodic embedding;
converting the reference audio into a reference Mel spectrogram by Fourier transform;
Generating a speaker embedding by encoding the reference Mel spectrogram;
generating a second style embedding from the speaker embedding;
Inputting an integrated style embedding that combines the first style embedding and the second style embedding into attention of a text to speech (TTS) model; and
Generating audio in which tones and prosody are synthesized by the integrated style embedding for the input text using the TTS model,
The step of generating the first style embedding includes generating the reference audio using a three-layer hierarchical GST consisting of a first global style token (GST) layer to a third global style token layer. A method of speech synthesis, generating the first style embedding for prosody.

In claim 1,
In the step of extracting the fundamental frequency, the reference audio is cut into sliding windows of a certain section, and a pitch contour is calculated through frequency decomposition of the reference audio using a normalized cross-correlation function, where the pitch is A voice synthesis method that corresponds to the pitch of the sound.

delete

In claim 1,
Each of the first to third GST layers has a multi-head attention that transmits the input embedding to a plurality of tokens of each GST layer and a plurality of tokens connected to the multi-head attention, and the input embedding and measure the similarity between each token and create a style embedding as a weighted sum of each token.
The first to third GST layers include a residual connection connecting the plurality of tokens of a current GST layer and the plurality of tokens of a previous GST layer.

In claim 1,
In the step of generating the second style embedding, the second style embedding for the tone of the reference audio is generated using a fourth GST layer.

In claim 1,
The step of generating the audio is,
Extracting feature information from text input to the encoder by the encoder of the TTS model;
Using the attention, mapping the feature information into attention alignment information to be used by a decoder of the TTS model at every time and transmitting the mapped attention alignment information and the integrated style embedding to the decoder; and
A speech synthesis method comprising generating, by the decoder, a current Mel spectrogram in which the integrated style embedding is synthesized using attention information of the attention and a previous Mel spectrogram.

In claim 6,
The step of generating the audio further includes generating audio from the current Mel spectrogram by MelGAN (Mel generative adversarial networks).

In claim 1,
By comparing the target Mel spectrogram corresponding to the reference Mel spectrogram and the predicted Mel spectrogram for the input text and the reference audio generated from the TTS model, the target Mel spectrogram and the predicted Mel spectrogram are compared. calculating a difference or a loss corresponding to the difference; and
Speech synthesis method further comprising training the TTS model until the difference or loss is below a preset level or reference value.

A speech synthesis system capable of replicating real-time timbre and prosody styles, comprising:
An extraction unit that extracts the fundamental frequency from the input reference audio;
A first reference encoder that receives the fundamental frequency and encodes the fundamental frequency to generate a prosody embedding;
hierarchical global style token layers that generate a first style embedding for a prosody of the reference audio from the prosody embeddings;
a conversion unit that converts the reference audio into a reference Mel spectrogram by Fourier transform;
a second reference encoder that encodes the reference Mel spectrogram to generate speaker embedding;
a single global style token layer that generates a second style embedding for the timbre of the reference audio from the speaker embedding; and
Text to speech (TTS), which receives an integrated style embedding that combines the first style embedding and the second style embedding as attention and generates audio in which the tone and prosody of the integrated style embedding are synthesized for the input text. ) contains the model,
The hierarchical global style token layers are comprised of three layers of hierarchical GST consisting of a first global style token (GST) layer to a third global style token layer.

In claim 9,
The extractor cuts the reference audio into sliding windows of a certain section, distinguishes valid frames using a normalized cross-correlation function, and calculates a pitch contour through frequency decomposition of the reference audio within the valid frames. and where the pitch corresponds to the pitch of the reference audio.

delete

In claim 9,
Each GST layer of the hierarchical global style token layers has a multi-head attention and a plurality of tokens connected to the multi-head attention,
The first GST layer measures the similarity between the prosody embedding and each token and generates a style embedding as a weighted sum of the plurality of tokens,
The style embedding generated in the first GST layer sequentially passes through the second GST layer and the third GST layer and is output as a first style embedding.

In claim 9,
The hierarchical global style token layers include: a first pair of the first GST layer and the second GST layer and a second pair of the second GST layer and the third GST layer where the tokens of the current layer are identical to the previous layer. A speech synthesis system having a residual connection connected to tokens of .

In claim 9,
The target Mel spectrogram corresponding to the reference Mel spectrogram and the predicted Mel spectrogram for the input text and the reference audio generated from the TTS model are compared to produce the target Mel spectrogram and the predicted Mel spectrogram. It further includes a learning management unit that calculates the difference or a loss corresponding to the difference,
The learning management unit trains the TTS model until the difference or loss is below a preset level or reference value.

In claim 9,
The TTS model includes a spectrogram predictor and a vocoder,
The spectrogram predictor generates a mel spectrogram based on the input text and the integrated style embedding,
A voice synthesis system wherein the vocoder generates a waveform sample corresponding to a synthesized waveform from the Mel spectrogram.

In claim 15,
The spectrogram predictor includes an encoder, attention, and decoder,
The encoder extracts feature information from the input text,
The attention maps the feature information to attention alignment information to be used by the decoder at every time and transmits the mapped attention alignment information and the integrated style embedding to the decoder,
The decoder is a speech synthesis system that generates a current Mel spectrogram in which the integrated style embedding is synthesized using attention information of the attention and a previous Mel spectrogram.

In claim 16,
The spectrogram predictor further includes a preprocessor located at the input end of the encoder, wherein the preprocessor separates the input text into syllable units and expresses the separated syllables as integers through one-hot encoding. Voice synthesis system.

In claim 17,
The encoder includes a character embedding generation unit, 3 convolutional layers connected to the character embedding generation unit, and a bidirectional long short term memory (LSTM) connected to the 3 convolutional layers. ), and
The character embedding generation unit converts the integer sequence received from the preprocessor into a matrix form,
The three convolutional layers condense information in the form of a matrix,
The bidirectional LSTM converts information in the form of a reduced matrix into encoder feature information, where the encoder feature information includes a context vector compressed into one fixed size.

In claim 16,
The attention is a speech synthesis method that adds the attention alignment information to the encoder characteristic information according to the time-step of the decoder so that the voice pronunciation to be output through the vocoder of the TTS model proceeds in the sequential order of the input text. system.

In claim 15,
The vocoder is a voice synthesis system called MelGAN (Mel generative adversarial networks).