KR20230075340A

KR20230075340A - Voice synthesis system and method capable of duplicating tone and prosody styles in real time

Info

Publication number: KR20230075340A
Application number: KR1020220079679A
Authority: KR
Inventors: 이근배; 전예진
Original assignee: 포항공과대학교 산학협력단
Priority date: 2021-11-22
Filing date: 2022-06-29
Publication date: 2023-05-31
Also published as: KR102639322B1

Abstract

Disclosed are a voice synthesis system and a method capable of duplicating one and prosody styles in real time which can rapidly and accurately synthesize voices according to various prosody and tone styles of a speaker by using deep learning-based end-to-end voice synthesis technology using a global style token. The voice synthesis method comprises the steps of: extracting a basic frequency from an input reference audio to transmit the same as an input of a reference encoder; encoding the basic frequency by the reference encoder to generate a prosody embedding; generating a first style embedding from the prosody embedding; converting the reference audio into a reference Mel Spectrogram by Fourier transform; encoding the reference Mel Spectrogram to generate a speaker embedding; generating a second style embedding from the speaker embedding; inputting, into attention of a text to speech (TTS) model, an integrated style embedding summing the first style embedding and the second style embedding along with an output of an encoder of the TTS model; and generating an audio of voice synthesis having tone and prosody combined for an input text by the TTS model.

Description

Voice synthesis system and method capable of replicating real-time tone and rhyme style

본 발명은 다양한 운율 및 화자 음색 스타일대로 음성을 빠르고 정확하게 합성하는 음성합성 시스템에 관한 것으로, 전역 스타일 토큰(Global Style Token, GST)을 활용한 딥러닝 기반 종단 간(end-to-end) 음성합성 기술을 이용하는 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법에 관한 것이다.The present invention relates to a speech synthesis system that quickly and accurately synthesizes speech according to various prosody and speaker timbre styles, and deep learning-based end-to-end speech synthesis using Global Style Token (GST) A voice synthesis system and method capable of replicating real-time timbre and prosody using technology.

음성합성(Text to Speech, TTS)의 궁극적인 목표는 입력된 텍스트를 정확하고 자연스러운 목소리로 읽어주는 기술이며, 사람이 직접 발성하는 모든 분야에 적용 가능하다. 음성합성의 대표적인 응용 분야에는 네비게이션, 오디오북, 인공지능 비서 등이 있다.The ultimate goal of Text to Speech (TTS) is a technology that reads input text with an accurate and natural voice, and can be applied to all areas where people directly vocalize. Typical applications of speech synthesis include navigation, audio books, and artificial intelligence assistants.

또한, 급격히 발전하는 딥러닝 기술을 활용하여 합성된 음성의 품질은 매우 향상되었으며, 종단 간(end-to-end) 음성합성 모델들이 제안되면서 별도의 지식이나 작업 없이 비교적 간단한 모델 학습이 가능해졌다.In addition, the quality of speech synthesized using rapidly developing deep learning technology has been greatly improved, and as end-to-end speech synthesis models have been proposed, relatively simple model learning has become possible without additional knowledge or work.

이러한 발전에도 불구하고, 현재의 딥러닝 기반의 종단 간 음성합성 시스템은 일상대화에서 쉽게 들을 수 있는 음성의 운율적 정보 예컨대 감정, 억양, 어조 등이나 화자의 음색을 그대로 따라하지 못하는 한계가 있으므로 추가적인 연구가 요구된다.Despite these developments, the current deep learning-based end-to-end speech synthesis system has limitations in that it cannot follow the prosodic information of voice that can be easily heard in everyday conversation, such as emotion, intonation, tone, etc., or the tone of the speaker. Research is required.

본 발명은 전술한 종래 기술의 요구에 부응하기 위해 도출된 것으로, 본 발명의 목적은,실시간으로 다양한 운율적 요소와 화자의 음색을 복제할 수 있는 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템 및 방법을 제공하는데 있다.The present invention was derived to meet the above-mentioned needs of the prior art, and an object of the present invention is a speech synthesis system and method capable of replicating real-time timbre and prosody style capable of replicating various prosody elements and timbres of speakers in real time. is providing

본 발명의 다른 목적은, 전역 스타일 토큰(Global Style Token, GST) 기반 모듈을 사용하여 음색과 운율 스타일을 실시간 복제하여 합성할 수 있는 음성합성 시스템 및 방법을 제공하는데 있다.Another object of the present invention is to provide a voice synthesis system and method capable of replicating and synthesizing voices and prosody styles in real time using a global style token (GST) based module.

본 발명의 또 다른 목적은, 실시간 음색 및 운율 스타일 복제와 함께 감정 제어 가능한 음성합성 시스템 및 방법을 제공하는데 있다.Another object of the present invention is to provide a voice synthesis system and method capable of emotion control together with real-time tone and prosody style replication.

상기 기술적 과제를 달성하기 위한 본 발명의 일 측면에 따른 음성합성 방법은, 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법으로서, 입력되는 참조 오디오(reference waveform)에서 기본주파수를 추출하여 레퍼런스 인코더의 입력으로 전달하는 단계; 상기 레퍼런스 인코더에 의해, 상기 기본주파수를 인코딩하여 운율 임베딩을 생성하는 단계; 상기 운율 임베딩으로부터 제1 스타일 임베딩을 생성하는 단계; 상기 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 단계; 상기 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 단계; 상기 스피커 임베딩으로부터 제2 스타일 임베딩을 생성하는 단계; 상기 제1 스타일 임베딩과 상기 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 음성합성(text to speech, TTS) 모델의 어텐션에 입력하는 단계; 및 상기 TTS 모델에 의해, 입력되는 텍스트에 대하여 상기 통합 스타일 임베딩에 의한 톤과 운율이 합성된 오디오를 생성하는 단계를 포함한다.A voice synthesis method according to an aspect of the present invention for achieving the above technical problem is a voice synthesis method capable of replicating real-time timbre and prosody style, and extracts a fundamental frequency from an input reference audio (reference waveform) and uses it as an input of a reference encoder. conveying; generating, by the reference encoder, a prosody embedding by encoding the fundamental frequency; generating a first style embedding from the prosody embedding; converting the reference audio into a reference mel spectrogram by Fourier transform; encoding the reference mel spectrogram to generate a speaker embedding; generating a second style embedding from the speaker embedding; inputting an integrated style embedding obtained by combining the first style embedding and the second style embedding into attention of a text to speech (TTS) model; and generating audio in which a tone and a prosody synthesized by the unified style embedding are synthesized with respect to the input text by the TTS model.

일실시예에서, 상기 기본주파수를 추출하는 단계는, 상기 참조 오디오를 일정한 구간의 슬라이딩 윈도우 단위로 자르고, 정규화된 상호 상관함수를 사용하여 상기 참조 오디오의 주파수 분해를 통해 피치(pitch) 윤곽을 계산하는 프로세스를 포함할 수 있다. 여기서 피치는 상기 참조 오디오의 음의 높낮이에 대응할 수 있다.In one embodiment, in the step of extracting the fundamental frequency, the reference audio is cut in units of a sliding window of a certain section, and a pitch contour is calculated through frequency decomposition of the reference audio using a normalized cross-correlation function. process may be included. Here, the pitch may correspond to the pitch of the reference audio.

일실시예에서, 상기 제1 스타일 임베딩을 생성하는 단계는, 제1 전역 스타일 토큰(global style token, GST) 레이어 내지 제3 전역 스타일 토큰 레이어로 구성된 3개 레이어의 계층형(hierarchical) GST를 이용하여 상기 참조 오디오의 운율에 대한 상기 제1 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, the step of generating the first style embedding uses a hierarchical GST of three layers consisting of a first global style token (GST) layer to a third global style token layer. to generate the first style embedding for the prosody of the reference audio.

일실시예에서, 상기 제1 내지 제3 GST 레이어들 각각은, 입력되는 임베딩을 각 GST 레이어의 복수의 토큰들로 각각 전달하는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비하고, 상기 입력되는 임베딩과 각 토큰 간의 유사도를 측정하고 각각의 토큰들의 가중 합으로 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, each of the first to third GST layers includes a multi-head attention that transfers an input embedding to a plurality of tokens of each GST layer, and a plurality of tokens connected to the multi-head attention, , It may be configured to measure the similarity between the input embedding and each token and generate a style embedding as a weighted sum of each token.

일실시예에서, 상기 제1 내지 제3 GST 레이어들은, 현재 GST 레이어의 상기 복수의 토큰들과 이전 GST 레이어의 상기 복수의 토큰들을 연결하는 잔여 커넥션(residual connection)을 구비할 수 있다.In one embodiment, the first to third GST layers may have a residual connection connecting the plurality of tokens of the current GST layer and the plurality of tokens of the previous GST layer.

일실시예에서, 상기 제2 스타일 임베딩을 생성하는 단계는, 제4 GST 레이어를 이용하여 상기 참조 오디오의 음색에 대한 상기 제2 스타일 임베딩을 생성하도록 구성될 수 있다.In one embodiment, the generating of the second style embedding may be configured to generate the second style embedding for the timbre of the reference audio using a fourth GST layer.

일실시예에서, 상기 오디오를 생성하는 단계는, 상기 TTS 모델의 인코더에 의해, 상기 인코더에 입력되는 텍스트의 문자(characters)로부터 특징 정보를 추출하는 단계; 상기 어텐션에 의해, 상기 특징 정보를 매 시점마다 상기 TTS 모델의 디코더에서 사용할 어텐션 얼라인(alignment) 정보로 매핑하고 매핑된 어텐션 얼라인 정보와 상기 통합 스타일 임베딩을 상기 디코더로 전달하는 단계; 및 상기 디코더에 의해, 상기 어텐션의 어텐션 정보와 이전 시점의 멜 스펙트로그램을 이용하여 상기 통합 스타일 임베딩이 합성된 현재 시점의 멜 스펙트로그램을 생성하는 단계를 포함하도록 구성될 수 있다.In one embodiment, the generating of the audio may include extracting, by an encoder of the TTS model, feature information from characters of text input to the encoder; mapping the feature information to attention alignment information to be used in the decoder of the TTS model at each time point by the attention, and transmitting the mapped attention alignment information and the unified style embedding to the decoder; and generating, by the decoder, a mel spectrogram of the current view synthesized with the unified style embedding using the attention information of the attention and the mel spectrogram of a previous view.

일실시예에서, 상기 오디오를 생성하는 단계는, MelGAN(Mel generative adversarial networks)에 의해, 상기 현재 시점의 멜 스펙트로그램으로부터 오디오를 생성하는 단계를 더 포함할 수 있다.In one embodiment, generating the audio may further include generating audio from the Mel spectrogram of the current view by Mel generative adversarial networks (MelGAN).

일실시예에서, 음성합성 방법은, 상기 참조 멜 스펙트로그램에 대응하는 타겟 멜 스펙트로그램과 상기 TTS 모델에서 생성된, 상기 입력되는 텍스트와 상기 참조 오디오에 대한 예측 멜 스펙트로그램을 비교하여 상기 타겟 멜 스펙트로그램과 상기 예측 멜 스펙트로그램과의 차이 또는 상기 차이에 대응하는 손실(loss)을 구하는 단계; 및 상기 차이 또는 손실이 미리 설정된 수준 또는 기준값 이하가 될 때까지 상기 TTS 모델을 훈련시키는 단계를 더 포함할 수 있다.In one embodiment, the voice synthesis method compares a target mel spectrogram corresponding to the reference mel spectrogram with a predicted mel spectrogram generated from the TTS model for the input text and the reference audio, and then the target mel spectrogram. obtaining a difference between a spectrogram and the predicted MEL spectrogram or a loss corresponding to the difference; and training the TTS model until the difference or loss is equal to or less than a preset level or a reference value.

상기 기술적 과제를 달성하기 위한 본 발명의 다른 측면에 따른 음성합성 시스템은, 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템으로서, 입력되는 참조 오디오(reference waveform)에서 기본주파수를 추출하는 추출부; 상기 기본주파수를 입력받고 상기 기본주파수를 인코딩하여 운율 임베딩을 생성하는 제1 레퍼런스 인코더; 상기 운율 임베딩으로부터 상기 참조 오디오의 운율에 대한 제1 스타일 임베딩을 생성하는 계층형 전역 스타일 토큰 레이어들; 상기 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 변환부; 상기 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 제2 레퍼런스 인코더; 상기 스피커 임베딩으로부터 상기 참조 오디오의 음색에 대한 제2 스타일 임베딩을 생성하는 단일 전역 스타일 토큰 레이어; 및 상기 제1 스타일 임베딩과 상기 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 어텐션으로 입력받고 입력되는 텍스트에 대하여 상기 통합 스타일 임베딩에 의한 톤과 운율이 합성된 오디오를 생성하는 음성합성(text to speech, TTS) 모델을 포함한다.A voice synthesis system according to another aspect of the present invention for achieving the above technical problem is a voice synthesis system capable of replicating real-time timbre and prosody style, and includes an extractor for extracting a fundamental frequency from an input reference audio (reference waveform); a first reference encoder that receives the fundamental frequency and encodes the fundamental frequency to generate a prosody embedding; hierarchical global style token layers for generating a first style embedding for a prosody of the reference audio from the prosody embedding; a transform unit for transforming the reference audio into a reference mel spectrogram by Fourier transform; a second reference encoder for encoding the reference mel spectrogram to generate a speaker embedding; a single global style token layer for generating a second style embedding for a timbre of the reference audio from the speaker embedding; and text-to-speech, which receives the unified style embedding obtained by combining the first style embedding and the second style embedding as attention, and generates audio in which the tone and rhyme synthesized by the unified style embedding are synthesized for the input text. TTS) model.

일실시예에서, 상기 추출부는, 상기 참조 오디오를 미리 설정된 일정한 구간의 슬라이딩 윈도우 단위로 자르고, 정규화된 상호상관함수를 사용하여 유효 프레임을 구분하고, 상기 유효 프레임 내 상기 참조 오디오의 주파수 분해를 통해 피치(pitch) 윤곽을 계산하도록 구성될 수 있다. 여기서 피치는 상기 참조 오디오의 음의 높낮이에 대응할 수 있다.In one embodiment, the extraction unit cuts the reference audio in units of a sliding window of a predetermined interval, classifies an effective frame using a normalized cross-correlation function, and performs frequency decomposition of the reference audio within the effective frame. It can be configured to calculate pitch contours. Here, the pitch may correspond to the pitch of the reference audio.

일실시예에서, 상기 추출부는 YIN 방법을 이용하여 피치 윤곽에 대응하는 기본 주파수를 추정할 수 있다. 여기서 YIN은 자기상관(autocorrelation)과 소거(cancellation) 사이의 상호작용을 암시하는 동양철학의 음(Yin)과 양(Yang)의 합성어이다.In one embodiment, the extractor may estimate the fundamental frequency corresponding to the pitch contour using the YIN method. Here, YIN is a compound word of Yin and Yang in Eastern philosophy, implying the interaction between autocorrelation and cancellation.

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들은, 제1 전역 스타일 토큰(global style token, GST) 레이어 내지 제3 전역 스타일 토큰 레이어로 구성된 3개 레이어의 계층형 GST 레이어들로 구성될 수 있다.In one embodiment, the hierarchical global style token layers may be composed of three hierarchical GST layers consisting of a first global style token (GST) layer to a third global style token layer. .

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들의 각 GST 레이어는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비할 수 있다. 여기서, 상기 제1 GST 레이어는 상기 운율 임베딩과 각 토큰 간의 유사도를 측정하고 각각의 토큰들의 가중 합으로 스타일 임베딩을 생성하고, 상기 제1 GST 레이어에서 생성된 스타일 임베딩은 상기 제2 GST 레이어 및 제3 GST 레이어를 순차적으로 통과하여 제1 스타일 임베딩으로서 생성될 수 있다.In one embodiment, each GST layer of the hierarchical global style token layers may include multiple heads of attention and a plurality of tokens connected to the multiple heads of attention. Here, the first GST layer measures the similarity between the rhyme embedding and each token and generates a style embedding as a weighted sum of each token, and the style embedding generated in the first GST layer is the second GST layer and the second GST layer. It can be generated as a first style embedding by sequentially passing through 3 GST layers.

일실시예에서, 상기 계층형 전역 스타일 토큰 레이어들은, 상기 제1 GST 레이어와 상기 제2 GST 레이어와의 제1 쌍과 상기 제2 GST 레이어와 상기 제3 GST 레이어와의 제2 쌍은 현재 레이어의 토큰들이 이전 레이어의 토큰들과 연결되는 잔여 커넥션(residual connection)을 구비할 수 있다.In one embodiment, in the hierarchical global style token layers, the first pair of the first GST layer and the second GST layer and the second pair of the second GST layer and the third GST layer are a current layer The tokens of may have residual connections connected to tokens of the previous layer.

일실시예에서, 음성합성 시스템은, 상기 참조 멜 스펙트로그램에 대응하는 타겟 멜 스펙트로그램과, 상기 TTS 모델에서 생성된, 상기 입력되는 텍스트와 상기 참조 오디오에 대한 예측 멜 스펙트로그램을 비교하여 상기 타겟 멜 스펙트로그램과 상기 예측 멜 스펙트로그램과의 차이 또는 상기 차이에 대응하는 손실(loss)을 구하는 학습관리부를 더 포함할 수 있다. 학습관리부는 상기 차이 또는 손실이 미리 설정된 수준 또는 기준값 이하가 될 때까지 상기 TTS 모델을 훈련시킬 수 있다.In one embodiment, the speech synthesis system compares a target mel spectrogram corresponding to the reference mel spectrogram with a predicted mel spectrogram generated from the TTS model for the input text and the reference audio, and compares the target mel spectrogram. The method may further include a learning management unit that obtains a difference between the MEL spectrogram and the predicted MEL spectrogram or a loss corresponding to the difference. The learning management unit may train the TTS model until the difference or loss is less than or equal to a preset level or a reference value.

일실시예에서, 상기 TTS 모델은, 스펙트로그램 예측기 및 보코더를 구비할 수 있다. 여기서, 상기 스펙트로그램 예측기는 상기 입력되는 텍스트와 상기 통합 스타일 임베딩을 토대로 멜 스펙트로그램을 생성하고, 상기 보코더는 상기 멜 스펙트로그램으로부터 합성 파형에 대응하는 파형 샘플을 생성할 수 있다.In one embodiment, the TTS model may include a spectrogram predictor and a vocoder. Here, the spectrogram predictor may generate a mel spectrogram based on the input text and the integrated style embedding, and the vocoder may generate a waveform sample corresponding to a synthesized waveform from the mel spectrogram.

일실시예에서, 상기 스펙트로그램 예측기는 인코더, 어텐션 및 디코더를 포함할 수 있다. 상기 인코더는 상기 입력되는 텍스트의 문자(characters)로부터 특징 정보를 추출할 수 있다. 상기 어텐션은, 상기 특징 정보를 매 시점마다 상기 디코더에서 사용할 어텐션 얼라인(alignment) 정보로 매핑하고 매핑된 어텐션 얼라인 정보와 상기 통합 스타일 임베딩을 상기 디코더로 전달할 수 있다. 그리고, 상기 디코더는, 상기 어텐션의 어텐션 정보와 이전 시점의 멜 스펙트로그램을 이용하여 상기 통합 스타일 임베딩이 합성된 현재 시점의 멜 스펙트로그램을 생성할 수 있다.In one embodiment, the spectrogram predictor may include an encoder, an attention and a decoder. The encoder may extract feature information from characters of the input text. The attention may map the feature information into attention alignment information to be used by the decoder at each time point and deliver the mapped attention alignment information and the unified style embedding to the decoder. Also, the decoder may generate a current view MEL spectrogram synthesized with the unified style embedding using the attention information of the attention and the MEL spectrogram of a previous view.

일실시예에서, 상기 스펙트로그램 예측기는 상기 인코더의 입력단에 위치하는 전처리부를 더 포함할 수 있다. 상기 전처리부는 입력되는 텍스트를 음절 단위로 분리하고, 분리된 음절을 원핫 인코딩(one-hot encoding)을 통해 정수로 표현하도록 구성될 수 있다.In one embodiment, the spectrogram predictor may further include a preprocessor located at an input terminal of the encoder. The pre-processing unit may be configured to divide input text into syllable units and express the separated syllables as integers through one-hot encoding.

일실시예에서, 상기 인코더는, 문자 임베딩(character embedding) 유닛, 상기 문자 임베딩 생성 유닛에 연결되는 3개의 컨볼루션 레이어들(3 Conv layers), 및 상기 3개의 컨볼루션 레이어들에 연결되는 양방향 LSTM(bidirectional long short term memory)을 구비할 수 있다. 상기 문자 임베딩 생성 유닛은 상기 전처리부로부터 받은 정수 시퀀스를 매트릭스 형태로 변환할 수 있다. 상기 컨볼루션 레이어들은 매트릭스 형태의 정보를 축약할 수 있다. 그리고 상기 양방향 LSTM은 축약된 매트릭스 형태의 정보를 인코더 특징 정보로 변환할 수 있다. 여기서 상기 인코더 특징 정보는 하나의 고정된 크기로 압축된 컨텍스트 벡터(context vector)를 포함할 수 있다.In one embodiment, the encoder includes a character embedding unit, 3 convolution layers connected to the character embedding generation unit, and a bidirectional LSTM connected to the three convolution layers. (bidirectional long short term memory) may be provided. The character embedding generating unit may convert the integer sequence received from the pre-processing unit into a matrix form. The convolutional layers may condense matrix-type information. And, the bidirectional LSTM can convert information in a reduced matrix form into encoder feature information. Here, the encoder characteristic information may include a context vector compressed to one fixed size.

일실시예에서, 상기 어텐션은 상기 TTS 모델의 보코더를 통해 출력될 음성 발음이 상기 입력되는 텍스트의 순차적인 순서대로 진행되도록 상기 디코더의 타임-스텝에 따라 상기 어텐션 얼라인 정보를 상기 인코더 특징 정보에 추가하도록 구성될 수 있다.In one embodiment, the Attention sets the Attention Align information to the encoder characteristic information according to the time-step of the decoder so that the voice pronunciation to be output through the vocoder of the TTS model proceeds in the sequential order of the input text. can be configured to add

일실시예에서, 상기 보코더는, 멜겐(MelGAN: Mel generative adversarial networks)일 수 있다.In one embodiment, the vocoder may be Mel generative adversarial networks (MelGAN).

일실시예에서, 음성합성 시스템은, 상기 참조 오디오를 인코딩하여 감정 임베딩을 생성하는 제3 레퍼런스 인코더; 및 상기 감정 임베딩으로부터 상기 참조 오디오에 포함된 감정 정보에 대한 제3 스타일 임베딩을 생성하는 또 다른 전역 스타일 토큰 레이어를 더 포함할 수 있다. 제3 스타일 임베딩은 상기 제1 스타일 임베딩 및 상기 제2 스타일 임베딩과 함께 통합 스타일 임베딩에 포함되어 상기 TTS 모델의 어텐션에 입력될 수 있다.In one embodiment, a voice synthesis system may include a third reference encoder for encoding the reference audio to generate emotion embedding; and another global style token layer generating a third style embedding for emotion information included in the reference audio from the emotion embedding. The third style embedding may be included in the integrated style embedding along with the first style embedding and the second style embedding and input to the attention of the TTS model.

일실시예에서, 제3 레퍼런스 인코더는 상기 참조 오디오로부터 SFTF를 통해 변환된 멜 스펙트로그램(이하 '참조 멜 스펙트로그램')이 입력되는 3개의 합성곱 신경망(three convolutional neural networks, 3 CNN), 상기 3개의 합성곱 신경망에 연결되는 2개의 잔여 블록들(2 residual blocks) 및 상기 2개의 잔여 블록들에 연결되는 4개 레이어들의 수정 알렉스넷(modified AlexNet)으로 구성될 수 있다.In one embodiment, the third reference encoder includes three convolutional neural networks (3 CNNs) to which a mel spectrogram (hereinafter referred to as 'reference mel spectrogram') converted from the reference audio through SFTF is input, the It can be composed of 2 residual blocks connected to 3 convolutional neural networks and a modified AlexNet of 4 layers connected to the 2 residual blocks.

일실시예에서, 상기 제3 레퍼런스 인코더 및 상기 또 다른 전력 스타일 토큰 레이어와 결합되는 TTS 모델은, 참조 오디오의 스타일 트랜스퍼 및 스타일 트랜스퍼의 강도를 조절하도록 구성되고, 리컨스트럭션 로스(reconstruction loss)만으로 학습을 진행할 수 있다.In one embodiment, the TTS model combined with the third reference encoder and the another power style token layer is configured to adjust the style transfer of reference audio and the strength of the style transfer, and learns only with reconstruction loss. can proceed.

상기 기술적 과제를 달성하기 위한 본 발명의 또 다른 측면에 따른 음성합성 시스템은, 전술한 실시예들 중 어느 하나의 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법을 구현하기 위한 컴퓨터 판독 가능한 기록매체에 저장된 컴퓨터 프로그램을 포함할 수 있다.A voice synthesis system according to another aspect of the present invention for achieving the above technical problem is stored in a computer-readable recording medium for implementing a voice synthesis method capable of replicating real-time tone and prosody style of any one of the above-described embodiments. It may contain a computer program.

상기 기술적 과제를 달성하기 위한 본 발명의 또 다른 측면에 따른 음성합성 시스템은, 전술한 실시예들 중 어느 하나의 실시간 음색 및 운율 스타일 복제 가능한 음성합성 방법을 구현하기 위한 프로그램을 기록한 컴퓨터 판독 가능한 기록매체를 포함할 수 있다.A voice synthesis system according to another aspect of the present invention for achieving the above technical problem is a computer-readable record recording a program for implementing a voice synthesis method capable of replicating real-time tone and prosody style of any one of the above-described embodiments. media may be included.

전술한 음성합성 시스템 및 방법의 구성에 의하면, 전역 스타일 토큰(global style token, GST) 기반 모듈을 사용하여 참조 오디오의 음색이나 운율 스타일을 복제하여 입력 텍스트에 대한 음성합성 결과에 효과적으로 반영할 수 있다.According to the configuration of the above-described voice synthesis system and method, a global style token (GST) based module can be used to reproduce the timbre or prosody style of the reference audio and effectively reflect it in the voice synthesis result for the input text. .

또한, 본 발명에 의하면, 입력 텍스트에 대하여 전역 스타일 토큰을 기반으로 음색, 운율 및 감정의 다양한 스타일을 반영한 음성합성 파형을 효과적으로 생성할 수 있다.In addition, according to the present invention, it is possible to effectively generate voice synthesized waveforms reflecting various styles of timbre, prosody, and emotion based on global style tokens with respect to input text.

또한, 본 발명에 의하면, 실시간으로 참조 오디오의 다양한 운율적 요소, 화자의 음색, 또는 화자의 감정을 복제하여 입력 텍스트에 대한 음성합성 출력에 실시간으로 추가할 수 있는 음성합성 시스템 및 방법을 제공할 수 있다.In addition, according to the present invention, a speech synthesis system and method capable of replicating various prosodic elements of reference audio, a speaker's timbre, or a speaker's emotions in real time and adding them to a speech synthesis output for input text in real time are provided. can

도 1은 본 발명의 일 실시예에 따른 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템(이하 간략히 '음성합성 시스템')의 전체 구성에 대한 개략적인 블록도이다.
도 2는 비교예의 음성합성 시스템에 대한 블록도이다.
도 3은 도 1의 음성합성 시스템에 채용할 수 있는 운율 모듈에 대한 구성도이다.
도 4는 도 1의 음성합성 시스템에 채용할 수 있는 전체 구조에 대한 상세 구성도이다.
도 5는 본 발명의 다른 실시예에 따른 음성합성 시스템의 전체 구성도이다.
도 6은 도 5의 음성합성 시스템에 채용할 수 있는 감정 모듈을 설명하기 위한 블록도이다.
도 7은 도 6의 감정 모듈에 채용할 수 있는 레퍼런스 인코더에 대한 상세 구성도이다.
도 8은 본 발명의 또 다른 실시예에 따른 음성합성 시스템에 대한 개략적인 구성도이다.
도 9는 도 8의 음성합성 시스템에 적용될 수 있는 주요 동작을 설명하기 위한 블록도이다.1 is a schematic block diagram of the entire configuration of a voice synthesis system capable of replicating real-time timbre and prosody style (hereinafter referred to as 'speech synthesis system') according to an embodiment of the present invention.
2 is a block diagram of a voice synthesis system of a comparative example.
FIG. 3 is a configuration diagram of a prosody module that can be employed in the voice synthesis system of FIG. 1 .
FIG. 4 is a detailed configuration diagram of an entire structure that can be employed in the voice synthesis system of FIG. 1. Referring to FIG.
5 is an overall configuration diagram of a voice synthesis system according to another embodiment of the present invention.
FIG. 6 is a block diagram for explaining an emotion module that can be employed in the voice synthesis system of FIG. 5 .
FIG. 7 is a detailed configuration diagram of a reference encoder that can be employed in the emotion module of FIG. 6 .
8 is a schematic configuration diagram of a voice synthesis system according to another embodiment of the present invention.
FIG. 9 is a block diagram for explaining main operations applicable to the speech synthesis system of FIG. 8 .

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하여 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. Since the present invention can have various changes and various embodiments, it will be described in detail by exemplifying specific embodiments in the drawings. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. Like reference numerals have been used for like elements throughout the description of each figure.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. '및/또는'이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention. The term 'and/or' includes a combination of a plurality of related recited items or any one of a plurality of related recited items.

본 출원의 실시예들에서, 'A 및 B 중에서 적어도 하나'는 'A 또는 B 중에서 적어도 하나' 또는 'A 및 B 중 하나 이상의 조합들 중에서 적어도 하나'를 의미할 수 있다. 또한, 본 출원의 실시예들에서, 'A 및 B 중에서 하나 이상'은 'A 또는 B 중에서 하나 이상' 또는 'A 및 B 중 하나 이상의 조합들 중에서 하나 이상'을 의미할 수 있다.In embodiments of the present application, 'at least one of A and B' may mean 'at least one of A or B' or 'at least one of combinations of one or more of A and B'. Also, in the embodiments of the present application, 'at least one of A and B' may mean 'at least one of A or B' or 'at least one of combinations of one or more of A and B'.

어떤 구성요소가 다른 구성요소에 '연결되어' 있다거나 '접속되어' 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 '직접 연결되어' 있다거나 '직접 접속되어'있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is understood that when a component is referred to as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is referred to as being 'directly connected' or 'directly connected' to another component, it should be understood that no other component exists in the middle.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, '포함한다' 또는 '가진다' 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, terms such as 'comprise' or 'having' are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in more detail. In order to facilitate overall understanding in the description of the present invention, the same reference numerals are used for the same components in the drawings, and redundant descriptions of the same components are omitted.

도 1은 본 발명의 일 실시예에 따른 실시간 음색 및 운율 스타일 복제 가능한 음성합성 시스템(이하 간략히 '음성합성 시스템')의 전체 구성에 대한 개략적인 블록도이다.1 is a schematic block diagram of the entire configuration of a voice synthesis system capable of replicating real-time timbre and prosody style (hereinafter referred to as 'speech synthesis system') according to an embodiment of the present invention.

도 1을 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(spectrum predictor, 600) 및 보코더(vocoder, 700)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율 및 음색이 가미된 최초 음성 파형(raw waveform)을 출력할 수 있다.Referring to FIG. 1, the speech synthesis system includes a prosody module 100, a timbre module 200, a spectrum predictor 600 and a vocoder 700, and a first style of the prosody module 100 By applying an integrated style embedding that combines the style embedding and the second style embedding of the tone module 200, a raw waveform with a specific rhyme and tone added to the input text can be output. .

운율 모듈(100)은 추출부(110), 제1 레퍼런스 인코더(reference encoder, RE, 120), 제1 전역 스타일 토큰 레이어(global style token layer, GST1, 130), 제2 GST 레이어(GST2, 140) 및 제3 GST 레이어(GST3, 150)를 구비한다.The prosody module 100 includes an extractor 110, a first reference encoder (RE, 120), a first global style token layer (GST1, 130), and a second GST layer (GST2, 140). ) and a third GST layer (GST3, 150).

추출부(110)는 입력되는 참조 파형(reference waveform)인 참조 오디오에서 기본주파수(fundamental frequency, f0)를 추출한다. 추출부(110)는 참조 오디오에서 운율과 관련된 특징을 추출하는 특징 추출부(feature extractor, FE)로 지칭될 수 있다. 여기서 운율은 참조 오디오의 매 기준 시점의 음높이를 지칭하거나 음높이와 강세의 변화를 지칭할 수 있다.The extraction unit 110 extracts a fundamental frequency (f0) from reference audio, which is an input reference waveform. The extraction unit 110 may be referred to as a feature extractor (FE) that extracts features related to prosody from reference audio. Here, the rhyme may refer to a pitch of reference audio at each reference point or to a change in pitch and stress.

제1 레퍼런스 인코더(120)는 추출부(110)으로부터 받은 기본주파수의 참조 오디오를 인코딩하여 운율 임베딩을 생성한다. 제1 레퍼런스 인코더(120)는 일정한 구간의 슬라이딩 윈도우 단위들을 배치놈(batchNorm)을 가진 3개의 합성곱 신경망을 통해 필터링하고, 2개의 잔여 블록들을 통해 앞서 학습된 정보를 이용하여 가중치들을 계산한 후, 4개의 레이어들로 수정된 알렉스넷으로 가중치를 최적화하도록 이루어질 수 있다. 여기서 운율 임베딩은 참조 오디오의 운율을 설명하는 매개변수로서, 학습이 완료된 후에 생성되는 운율 임베딩은 참조 오디오를 가장 잘 설명하는 최적의 매개변수일 수 있다.The first reference encoder 120 encodes the reference audio of the fundamental frequency received from the extractor 110 to generate prosody embedding. The first reference encoder 120 filters the sliding window units of a certain period through three convolutional neural networks having a batchNorm, calculates weights using previously learned information through two residual blocks, and then , can be made to optimize the weights with a modified alexnet with 4 layers. Here, the prosody embedding is a parameter describing the prosody of the reference audio, and the prosody embedding generated after learning is completed may be an optimal parameter that best describes the reference audio.

제1 GST 레이어(GST1, 130), 제2 GST 레이어(GST2, 140) 및 제3 GST 레이어(GST3, 150)는 계층형(hierarchical) GST 구조를 형성한다. 즉, 제1 내지 제3 GST 레이어들(130, 140, 150)은 기재된 순서대로 임베딩이 순차적으로 처리하도록 나열된 구조를 가진다. 각 GST 레이어는 다중 헤드 어텐션과 상기 다중 헤드 어텐션에 연결되는 복수의 토큰들을 구비한다. 이러한 계층형 GST 레이어들(이하 간략히 '계층형 GST')는 제1 레퍼런스 인코더(120)로부터 입력되는 운율 임베딩을 적절하게 그리고 효과적으로 처리하기 위한 것이다.The first GST layer (GST1, 130), the second GST layer (GST2, 140), and the third GST layer (GST3, 150) form a hierarchical GST structure. That is, the first to third GST layers 130, 140, and 150 have a structure in which embedding is sequentially processed in the order described. Each GST layer includes multiple heads of attention and a plurality of tokens connected to the multiple heads of attention. These hierarchical GST layers (hereinafter simply referred to as 'hierarchical GST') are intended to appropriately and effectively process prosody embedding input from the first reference encoder 120 .

계층형 GST에서, 제1 GST 레이어(130)는 운율 임베딩과 자신의 각 토큰 간의 유사도를 측정하고 토큰들의 가중 합(weithted sum)으로 스타일 임베딩을 생성한다. 즉, 제1 GST 레이어(130)는 운율 임베딩을 k(5 이상의 임의의 자연수)개의 전역 스타일 토큰들(GSTs)의 가중 합으로 변화시켜 표현할 수 있다. 이 과정에서 k개의 GSTs은 이들의 조합이 다양한 운율 스타일을 표현할 수 있도록 학습된다. 특히, 본 실시예에서는, 제1 GST 레이어(130)에서 학습된 스타일 임베딩을 제2 GST 레이어(140) 및 제3 GST 레이어(150)에서 다시 학습시킴으로써, 운율 임베딩의 다양한 표현에 대하여 처리 시간과 시스템 복잡도를 고려하여 최대한 효과적으로 학습할 수 있도록 구성된다. 계층형 GST의 학습 결과는 제1 스타일 임베딩으로서 출력된다.In the hierarchical GST, the first GST layer 130 measures the similarity between the prosody embedding and each token thereof and creates a style embedding as a weighted sum of the tokens. That is, the first GST layer 130 may express the prosody embedding by changing it to a weighted sum of k (any natural number greater than or equal to 5) global style tokens (GSTs). In this process, k GSTs are learned so that their combinations can express various prosody styles. In particular, in this embodiment, the style embedding learned in the first GST layer 130 is re-learned in the second GST layer 140 and the third GST layer 150, thereby reducing processing time and Considering the system complexity, it is configured to learn as effectively as possible. The learning result of the hierarchical GST is output as a first style embedding.

더욱이, 본 실시예에서는, 운율 임베딩의 더욱 효과적인 학습을 위해, 계층형 GST가 제1 GST 레이어(130)와 제2 GST 레이어(140)의 제1 쌍과 제2 GST 레이어(140)와 제3 GST 레이어(150)의 제2 쌍 각각에 대하여 잔여 커넥션(residual connection)을 구비하도록 구성될 수 있다.Moreover, in this embodiment, for more effective learning of prosodic embeddings, the hierarchical GST consists of a first pair of first GST layer 130 and second GST layer 140, a first pair of second GST layer 140 and a third GST layer 140. It may be configured to have a residual connection for each of the second pair of GST layers 150 .

이러한 잔여 커넥션은, 3개의 GST 레이어들(130, 140, 150)을 사용하여 운율 임베딩의 다양한 표현을 학습할 때, 데이터 처리 흐름 상에서 상대적으로 후단에 배치된 현재 GST 레이어가 상대적으로 전단에 배치된 이전 GST 레이어를 학습 결과를 이용하도록 하기 위한 것으로, 운율 임베딩의 학습 효과를 극대화하는데 기여할 수 있다.When learning various expressions of prosody embedding using the three GST layers 130, 140, and 150, the current GST layer placed relatively later in the data processing flow is arranged relatively earlier. This is to make the previous GST layer use the learning result, and can contribute to maximizing the learning effect of prosody embedding.

한편, 본 실시예에서는 계층형 GST가 3개의 GST 레이어들로 이루어지는 것이 가장 바람직한 형태임을 설명하였지만, 본 발명은 그러한 구성으로 한정되지 않고, 계층형 GST를 2개, 4개 혹은 5개의 GST 레이어들로 구성하는 경우에도, 본 실시예와 유사한 효과를 얻을 수 있음은 물론이다. 5개를 초과하는 GST 레이어들의 구성은 처리 시간과 시스템 복잡도 등을 고려하여 제한될 수 있다.On the other hand, in this embodiment, it has been described that the hierarchical GST is composed of three GST layers, but the present invention is not limited to such a configuration, and the hierarchical GST is composed of two, four, or five GST layers. Even in the case of configuration, of course, effects similar to those of the present embodiment can be obtained. Configuration of more than five GST layers may be limited in consideration of processing time and system complexity.

음색 모듈(200)은 제2 레퍼런스 인코더(RE, 220) 및 제4 GST 레이어(GST4, 240)을 구비하고, 제2 레퍼런스 인코더(220)에서 변환된 스피커 임베딩을 제4 GST 레이어(240)에서 학습하고 제2 스타일 임베딩을 출력하도록 구성될 수 있다.The tone module 200 includes a second reference encoder (RE, 220) and a fourth GST layer (GST4, 240), and converts the speaker embedding converted by the second reference encoder (220) into the fourth GST layer (240). learn and output second style embeddings.

제2 레퍼런스 인코더(220)의 입력으로는 참조 오디오를 특정 길이의 슬라이딩 윈도우 단위들로 잘라 푸리에 변환을 통해 만들어진 스펙트로그램(spectrogram)이 사용될 수 있다. 이러한 스펙트로그램을 만드는 과정은 간략히 STFT(short-time fourier transform)로 지칭될 수 있다. 스펙트로그램은 멜 필터 뱅크(Mel-filter bank)를 이용한 로그 변환(log transform)을 통해 멜 스펙트로그램으로 변환될 수 있다. As an input of the second reference encoder 220, a spectrogram generated through Fourier transform by cutting the reference audio into sliding window units having a specific length may be used. The process of creating such a spectrogram may be briefly referred to as short-time fourier transform (STFT). The spectrogram may be converted into a Mel spectrogram through log transform using a Mel-filter bank.

제4 GST 레이어(240)는, 스피커 임베딩을 효과적으로 학습하기 위해 사용된다. 제4 GST 레이어(240)는 스피커 임베딩과 자신의 각 토큰 간의 유사도를 측정하고 토큰들의 가중 합(weithted sum)으로 스타일 임베딩(제2 스타일 임베딩)을 생성한다. 즉, 제4 GST 레이어(240)는 스피커 임베딩을 k(5 이상의 임의의 자연수)개의 전역 스타일 토큰들(GSTs)의 가중 합으로 변화시켜 표현할 수 있으며, 이 과정에서 k개의 GSTs은 이들의 조합이 다양한 음색 스타일을 표현할 수 있도록 학습된다.The fourth GST layer 240 is used to effectively learn speaker embedding. The fourth GST layer 240 measures the similarity between the speaker embedding and each of its own tokens, and generates a style embedding (second style embedding) as a weighted sum of the tokens. That is, the fourth GST layer 240 can express the speaker embedding by transforming it into a weighted sum of k (any natural number greater than or equal to 5) global style tokens (GSTs). In this process, the k GSTs are a combination of these It is learned to express various timbre styles.

스펙트럼 예측기(spectrum predictor, 600)와 보코더(vocoder, 700)는 음성합성(text to speech, TTS) 모델(500)을 형성한다. 본 실시예에서 TTS 모델(500)은 제1 스타일 임베딩과 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 받고, 입력되는 텍스트 즉, 입력 텍스트(input text)에 대한 음성합성 결과를 생성할 때 통합 스타일 임베딩에 대응하는 음색과 운율을 반영하도록 구성된다.A spectrum predictor (600) and a vocoder (700) form a text to speech (TTS) model (500). In this embodiment, the TTS model 500 receives the unified style embedding that combines the first style embedding and the second style embedding, and when generating a voice synthesis result for the input text, that is, the input text, the unified style embedding It is configured to reflect the timbre and prosody corresponding to.

여기서, 스펙트럼 예측기(600)는 통합 스타일 임베딩을 반영하는 입력 텍스트에 대한 멜 스펙트로그램을 생성한다. 그리고, 보코더(700)는 멜 스펙트로그램으로부터 최초 음성 파형(raw waveform)을 출력한다. 최초 음성 파형은 합성 파형(synthesized waveform) 또는 파형 샘플(waveform samples)로서 WAV(waveform audio format) 등의 특정 포맷을 가질 수 있으나, 이에 한정되지는 않는다.Here, the spectrum predictor 600 generates a mel spectrogram for the input text reflecting the unified style embedding. And, the vocoder 700 outputs a raw waveform from the mel spectrogram. The initial voice waveform may have a specific format such as a synthesized waveform or waveform audio format (WAV) as waveform samples, but is not limited thereto.

스펙트럼 예측기(600)는 텍스트가 입력되는 인코더, 인코더에 연결되는 어텐션, 어텐션과 상호 연결되는 디코더를 구비할 수 있다.The spectrum predictor 600 may include an encoder into which text is input, an attention connected to the encoder, and a decoder connected to the attention.

보코더(700)는 스펙트로그램이나 멜 스펙트로그램으로부터 파형(waveform) 신호 또는 음성 신호를 생성하는 모듈이다. 보코더(700)는 웨이브넷(WaveNet) 보코더, 멜겐(MelGAN: Mel generative adversarial network) 등을 사용할 수 있다. 멜겐(MelGAN)을 사용하면, 웨이브넷 보코더 등 다른 보코더보다 참조 오디오의 음색이나 운율을 잘 표현하는 음성 신호를 생성할 수 있다.The vocoder 700 is a module that generates a waveform signal or audio signal from a spectrogram or a MEL spectrogram. The vocoder 700 may use a WaveNet vocoder, a Mel generative adversarial network (MelGAN), or the like. Using MelGAN, it is possible to generate a voice signal that better expresses the timbre or rhyme of the reference audio than other vocoders such as the Wavenet vocoder.

도 2는 비교예의 음성합성 시스템에 대한 블록도이다.2 is a block diagram of a voice synthesis system of a comparative example.

도 2를 참조하면, 비교예의 음성합성 시스템은, 스펙트럼 예측기(10), 레퍼런스 인코더(20) 및 웨이브넷 보코더((WaveNet vocoder, 90)로 구성된다. 스펙트럼 예측기(10)는 인코더(encoder, 30), 어텐션(attention, 50) 및 디코더(decoder, 70)로 구성된다.Referring to FIG. 2, the speech synthesis system of the comparative example is composed of a spectrum predictor 10, a reference encoder 20, and a WaveNet vocoder 90. The spectrum predictor 10 includes an encoder 30 ), attention (50), and decoder (70).

앞서 설명한 본 실시예의 음성합성 시스템(도 1 참조)과 대비할 때, 비교예의 음성합성 시스템은, 레퍼런스 인코더(20)에서 참조 오디오(reference audio)로부터 생성된 참조 임베딩과 인코더(30)의 특징 정보를 합하여 어텐션(50)에 입력될 때, 어텐션(50)과 디코더(70)가 이들의 협업을 통해 멜 스펙트로그램을 생성하고, 웨이브넷 보코더(90)가 멜 스펙트로그램으로부터 음성합성 결과인 발과(speech)를 생성하도록 구성된다.When compared with the voice synthesis system of the present embodiment described above (see FIG. 1), the voice synthesis system of the comparative example includes the reference embedding generated from the reference audio in the reference encoder 20 and the characteristic information of the encoder 30. When combined and input to the attention 50, the attention 50 and the decoder 70 generate a mel spectrogram through their collaboration, and the wavenet vocoder 90 generates a speech synthesis result from the mel spectrogram, speech) is configured to generate.

전술한 비교예의 음성합성 시스템과 대비할 때, 본 실시예의 음성합성 시스템은, 참조 오디오의 운율을 효과적으로 표현하기 위해, 참조 오디오의 기본주파수를 제1 레퍼런스 인코더의 입력으로 사용하고, 제1 레퍼런스 인코더에서 생성된 운율 임베딩을 계층형 GST의 복수의 GST 레이어들(특히, 제1 내지 제3 GST 레이어들)에 입력하여 운율을 학습하도록 구성됨을 알 수 있다.In comparison with the voice synthesis system of the above-described comparative example, the voice synthesis system of the present embodiment uses the fundamental frequency of the reference audio as an input to the first reference encoder in order to effectively express the prosody of the reference audio, and in the first reference encoder It can be seen that the prosody is configured to learn the prosody by inputting the generated prosody embedding to a plurality of GST layers (in particular, first to third GST layers) of the hierarchical GST.

또한, 본 실시예의 음성합성 시스템은, 참조 오디오의 음색을 효과적으로 표현하기 위해, 참조 오디오의 스펙트로그램을 제2 레퍼런스 인코더의 입력으로 사용하고, 제2 레퍼런스 인코더에서 생성된 스피커 임베딩을 단일 GST 레이어(제4 GST 레이어)에 입력하여 음색을 학습하도록 구성됨을 알 수 있다.In addition, the speech synthesis system of the present embodiment uses the spectrogram of the reference audio as an input of the second reference encoder in order to effectively represent the timbre of the reference audio, and uses the speaker embedding generated by the second reference encoder as a single GST layer ( It can be seen that it is configured to learn the timbre by inputting to the fourth GST layer).

또한, 본 실시예의 음성합성 시스템은, 운율 학습 결과로 얻은 제1 스타일 임베딩과 음색 학습 결과로 얻은 제2 스타일 임베딩을 합하여 스펙트럼 예측기의 인코더 출력과 합하거나 스펙트럼 예측기의 어텐션에 인코더의 출력과 함께 입력되도록 구성됨을 알 수 있다.In addition, the speech synthesis system of the present embodiment combines the first style embedding obtained as a result of prosody learning and the second style embedding obtained as a result of timbre learning, and adds them to the encoder output of the spectrum predictor or inputs the attention of the spectrum predictor together with the output of the encoder. It can be seen that it is configured so that

더욱이, 본 실시예의 음성합성 시스템은, 후술되는 바와 같이, 참조 오디오의 감정을 효과적으로 표현하기 위해, 참조 오디오의 스펙트로그램을 제3 레퍼런스 인코더의 입력으로 사용하고, 제3 레퍼런스 인코더에서 생성된 감정 임베딩을 단일 GST 레이어(제5 GST 레이어)에 입력하여 상기의 감정을 학습하도록 구성됨을 알 수 있다. 여기서, 제5 GST 레이어에서 출력되는 제3 스타일 임베딩은 제1 및 제2 스타일 임베딩들과 합쳐져 스펙트럼 예측기를 포함하는 TTS 모델에 전달될 수 있다.Moreover, as will be described later, the speech synthesis system of the present embodiment uses the spectrogram of the reference audio as an input of the third reference encoder to effectively express the emotion of the reference audio, and embedding the emotion generated by the third reference encoder. It can be seen that it is configured to learn the emotion by inputting to a single GST layer (fifth GST layer). Here, the third style embedding output from the fifth GST layer may be combined with the first and second style embeddings and transferred to the TTS model including the spectrum predictor.

또한, 본 실시예의 음성합성 시스템은 보코더로서 멜겐(MelGAN)을 사용함으로써 더욱 효과적으로 음색과 운율의 스타일이나 음색, 운율 및 감정 스타일이 복제된 음성합성 결과를 생성하도록 구성됨을 알 수 있다.In addition, it can be seen that the voice synthesis system of the present embodiment is configured to more effectively generate voice synthesis results in which timbres and rhymes or timbres, rhymes, and emotion styles are reproduced more effectively by using MelGAN as a vocoder.

도 3은 도 1의 음성합성 시스템에 채용할 수 있는 운율 모듈에 대한 구성도이다.FIG. 3 is a configuration diagram of a prosody module that can be employed in the voice synthesis system of FIG. 1 .

도 3을 참조하면, 운율 모듈(100)은 추출부(FE, 110), 레퍼런스 인코더(reference encoder, 120), 제1 GST 레이어(GST1 또는 GST layer 1, 130) 내지 제N GST 레이어(GST layer N, 160)를 구비한다. 여기서 레퍼런스 인코더(120)은 제1 레퍼런스 인코더에 대응하고, N은 2 내지 5 중 어느 하나의 자연수일 수 있다.Referring to FIG. 3, the prosody module 100 includes an extractor (FE, 110), a reference encoder (120), a first GST layer (GST1 or GST layer 1, 130) to an Nth GST layer (GST layer). N, 160). Here, the reference encoder 120 corresponds to the first reference encoder, and N may be any one natural number from 2 to 5.

운율 모듈(100)은 추출부(110)에 의해 입력되는 참조 파형(reference waveform) 또는 참조 오디오로부터 기본주파수(f0)를 추출하고, 제1 레퍼런스 인코더(120)에 의해 기본주파수로부터 운율 임베딩(prosody embedding)을 생성하고, 제1 내지 제N GST 레이어들(130, 160)의 계층형 GST에 의해 운율 임베딩을 학습하고 제1 스타일 임베딩(style embedding)을 출력하도록 구성된다.The prosody module 100 extracts a fundamental frequency f0 from a reference waveform or reference audio input by the extractor 110, and prosody is embedded from the fundamental frequency by the first reference encoder 120. embedding), learn prosody embedding by the hierarchical GST of the first to Nth GST layers 130 and 160, and output a first style embedding.

제1 GST 레이어(130)는 다중 헤드 어텐션(multi-head attention, 132)과 다중 헤드 어텐션(132)에 연결되는 복수의 토큰들(token 1 내지 token k)(134)을 구비한다. 여기서, k는 5 이상의 임의의 자연수일 수 있다. 제1 GST 레이어(130)는 운율 임베딩과 각 토큰(token 1 내지 token k) 간의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성한다. 제1 GST 레이어(130)에서 생성된 스타일 임베딩은 제2 GST 레이어의 입력으로 전달될 수 있다.The first GST layer 130 includes a multi-head attention 132 and a plurality of tokens (token 1 to token k) 134 connected to the multi-head attention 132 . Here, k may be any natural number greater than or equal to 5. The first GST layer 130 measures the similarity between the rhyme embedding and each token (token 1 to token k) and creates a style embedding as a weighted sum of the tokens. The style embedding generated in the first GST layer 130 may be transferred as an input of the second GST layer.

위의 경우와 유사하게, 제N GST 레이어(160)는 다중 헤드 어텐션(162)과 다중 헤드 어텐션(162)에 연결되는 복수의 토큰들(token 1 내지 token k)(164)을 구비한다. 제N GST 레이어(160)는 입력단에 위치하는 다른 GST 레이어에서 받은 스타일 임베딩과 각 토큰(token 1 내지 token k) 간의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성한다. 이 스타일 임베딩은 운율 모듈(100)에 최종적으로 출력되는 제1 스타일 임베딩이 된다.Similar to the above case, the Nth GST layer 160 includes a multi-head attention 162 and a plurality of tokens (token 1 to token k) 164 connected to the multi-head attention 162 . The Nth GST layer 160 measures the similarity between the style embeddings received from other GST layers located at the input terminal and each token (token 1 to token k), and creates a style embedding as a weighted sum of these tokens. This style embedding becomes the first style embedding finally output to the prosody module 100 .

한편, 상기의 N이 2인 경우, 제2 GST 레이어는 제1 GST 레이어(130)로부터 받은 스타일 임베딩과 자신의 복수의 토큰들 각각과의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성할 수 있다. 이때, 제1 GST 레이어(130)의 복수의 토큰들(134)은 제2 GST 레이어의 복수의 토큰들과 잔여 커넥션(residual connection, 170)으로 연결될 수 있고, 그에 의해 제2 GST 레이어는 제1 GST 레이어(130)의 복수의 토큰들(134)의 가중치를 토대로 학습을 병렬적으로 수행할 수 있다.On the other hand, when N is 2, the second GST layer measures the similarity between the style embedding received from the first GST layer 130 and each of its plurality of tokens, and generates a style embedding as a weighted sum of the tokens can do. In this case, the plurality of tokens 134 of the first GST layer 130 may be connected to the plurality of tokens of the second GST layer through a residual connection 170, whereby the second GST layer may be connected to the first Learning may be performed in parallel based on weights of the plurality of tokens 134 of the GST layer 130 .

그리고, 상기의 N이 3인 경우, 위의 N이 2인 경우에 더하여, 제3 GST 레이어는 제2 GST 레이어로부터 받은 스타일 임베딩과 자신의 복수의 토큰들 각각과의 유사도를 측정하고 이 토큰들의 가중 합으로 스타일 임베딩을 생성할 수 있다. 이때, 제2 GST 레이어의 복수의 토큰들은 제3 GST 레이어의 복수의 토큰들과 또 다른 잔여 커넥션(residual connection, 170)으로 연결될 수 있고, 그에 의해 제3 GST 레이어는 제2 GST 레이어의 토큰들의 가중치를 토대로 학습을 효과적으로 수행할 수 있다.In addition, when N is 3 and N is 2, the third GST layer measures the similarity between the style embedding received from the second GST layer and each of its own tokens, You can create style embeddings with a weighted sum. In this case, the plurality of tokens of the second GST layer may be connected to the plurality of tokens of the third GST layer through another residual connection (170), whereby the third GST layer is connected to the plurality of tokens of the second GST layer. Based on weights, learning can be performed effectively.

도 4는 도 1의 음성합성 시스템에 채용할 수 있는 전체 구조에 대한 상세 구성도이다.FIG. 4 is a detailed configuration diagram of an entire structure that can be employed in the voice synthesis system of FIG. 1. Referring to FIG.

도 4를 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(spectrum predictor, 600) 및 보코더로서 멜겐(MelGAN, 700a)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율 및 음색이 부가된 파형 샘플(waveform samples)을 출력할 수 있다.Referring to FIG. 4, the speech synthesis system includes a prosody module 100, a timbre module 200, a spectrum predictor 600, and MelGAN 700a as a vocoder, and the prosody module 100 1 style embedding and the 2nd style embedding of the voice module 200 are combined to apply the integrated style embedding to output waveform samples with a specific rhyme and timbre added to the input text. there is.

운율 모듈(100)은 추출부(FE, 110), 제1 레퍼런스 인코더(reference encoder, 120), 제1 스타일 토큰 레이어(style token layer 1, 130), 제2 스타일 토큰 레이어(style token layer 2, 140) 및 제3 스타일 토큰 레이어(style token layer 3, 150)를 구비한다. 각 스타일 토큰 레이어는 전역 스타일 토큰(GST) 레이어에 대응된다.The prosody module 100 includes an extractor (FE, 110), a first reference encoder (120), a first style token layer (style token layer 1, 130), a second style token layer (style token layer 2, 140) and a third style token layer (style token layer 3, 150). Each style token layer corresponds to a global style token (GST) layer.

운율 모듈(100)에서, 추출부(110)는 참조 파형인 참조 오디오로부터 기본주파수(fundamental frequency, f0)를 추출한다. 참조 오디오는 미리 지정된 데이터셋에 있는 오디오일 수 있다. 또한, 참조 오디오는 학습(traning)시 원본, 타겟 또는 참값을 나타내는 그라운드 트루(ground truth, GT)일 수 있고 추정(inference)시 원하는 스타일이 포함된 오디오일 수 있다. 기본주파수는 제1 레퍼런스 인코더(120)의 입력으로 사용된다.In the prosody module 100, the extraction unit 110 extracts a fundamental frequency (f0) from reference audio, which is a reference waveform. Reference audio may be audio in a pre-designated dataset. In addition, the reference audio may be original, target, or ground truth (GT) representing a true value during training, and may be audio including a desired style during inference. The fundamental frequency is used as an input of the first reference encoder 120.

제1 레퍼런스 인코더(120)는 기본주파수로부터 운율 임베딩(prosody embedding)을 생성한다. 제1 레퍼런스 인코더(120)는 6개의 2D 합성곱 신경망(convolutional neural networks, CNN)과 1개의 게이트 순환 유닛(gated recurrent unit, GRU)으로 구성될 수 있다. 운율 임베딩은 기본주파수가 제1 레퍼런스 인코더를 통과한 결과로서, 정해진 길이의 임베딩이고, 계층형 GST의 입력으로 사용된다.The first reference encoder 120 generates a prosody embedding from the fundamental frequency. The first reference encoder 120 may include six 2D convolutional neural networks (CNNs) and one gated recurrent unit (GRU). The prosody embedding is a result of passing the basic frequency through the first reference encoder, is an embedding of a fixed length, and is used as an input of the hierarchical GST.

3개의 스타일 토큰 레이어들(130, 140, 150)로 구성된 계층형 GST는 운율 임베딩을 학습하여 제1 스타일 임베딩을 생성할 수 있다. 각 스타일 토큰 레이어(130, 140, 150)는 다중헤드 어텐션(multi-head attention)을 구비한다. 계층형 GST는, 다중헤드 어텐션을 통해 제1 레퍼런스 인코더(120)에서 출력된 운율 임베딩과 각 스타일 토큰 사이의 유사도가 측정되면, 이 스타일 토큰들의 가중 합으로 스타일 임베딩을 생성한다.A hierarchical GST composed of three style token layers 130, 140, and 150 may learn a prosody embedding and generate a first style embedding. Each style token layer (130, 140, 150) has multi-head attention. In the hierarchical GST, when the similarity between the prosody embedding output from the first reference encoder 120 and each style token is measured through multihead attention, the style embedding is generated as a weighted sum of the style tokens.

운율 모듈(100)의 스타일 임베딩은, 음색 모듈(200)의 스타일 임베딩과 다르게, 첫 번째로 생성된 스타일 임베딩이 두 개의 스타일 토큰 레이어를 추가로 통과할 때, 첫 번째 스타일 토큰 레이어와 두 번째 스타일 토큰 레이어와의 사이 및 두 번째 스타일 토큰 레이어와 세 번째 스타일 토큰 레이어와의 사이 각각에 잔여 커넥션(residual connection)을 추가로 사용하여 참조 오디오의 운율 특징에 대한 학습 효과를 극대화하도록 구성될 수 있다.Unlike the style embedding of the voice module 200, the style embedding of the prosody module 100 is different from the style embedding of the tone module 200, when the first style embedding additionally passes through two style token layers, the first style token layer and the second style Residual connections may be additionally used between the token layer and between the second style token layer and the third style token layer, respectively, to maximize the learning effect of the prosody characteristics of the reference audio.

음색 모듈(200)은 STFT 유닛(210), 제2 레퍼런스 인코더(220) 및 제4 스타일 토큰 레이어(style token layer 4, 240)을 구비한다. STFT 유닛(210)은, 참조 오디오를 별도로 가공한 스펙트로그램이나 멜 스펙트로그램을 제2 레퍼런스 인코더(220)의 입력 데이터로 준비하는 경우에, 생략될 수 있다. 제4 스타일 토큰 레이어(240)는 제4 GST 레이어에 대응된다.The timbre module 200 includes a STFT unit 210, a second reference encoder 220, and a fourth style token layer (style token layer 4, 240). The STFT unit 210 may be omitted when a spectrogram or a Mel spectrogram separately processed from reference audio is prepared as input data of the second reference encoder 220 . The fourth style token layer 240 corresponds to the fourth GST layer.

음색 모듈(200)에서, STFT 유닛(210)은 참조 오디오를 푸리에 변환에 의해 변환한 멜 스펙트로그램을 제2 레퍼런스 인코더(220)의 입력(input)으로 제공한다. 제2 레퍼런스 인코더(220)는 기본주파수가 아닌 멜 스펙트로그램을 입력받아 스피커 임베딩을 생성한다. 제4 GST 레이어(240)를 통과한 스피커 임베딩은 제2 스타일 임베딩으로 출력된다.In the tone module 200, the STFT unit 210 provides a Mel spectrogram obtained by transforming the reference audio by Fourier transform to the second reference encoder 220 as an input. The second reference encoder 220 receives a mel spectrogram rather than a fundamental frequency and generates a speaker embedding. The speaker embedding that has passed through the fourth GST layer 240 is output as a second style embedding.

제4 스타일 토큰 레이어(240)는 스피커 임베딩을 학습하여 제2 스타일 임베딩을 생성할 수 있다. 제4 스타일 토큰 레이어(240)는 다중헤드 어텐션(multi-head attention, 242) 및 복수의 토큰들(244)을 구비한다. 복수의 토큰들(244)은 제1 토큰(token 1) 내지 제5 토큰(token 5)을 포함할 수 있고, 각 토큰은 스타일 토큰으로 지칭될 수 있다.The fourth style token layer 240 may generate a second style embedding by learning the speaker embedding. The fourth style token layer 240 includes multi-head attention 242 and a plurality of tokens 244 . The plurality of tokens 244 may include a first token (token 1) to a fifth token (token 5), and each token may be referred to as a style token.

제4 스타일 토큰 레이어(240)는, 다중헤드 어텐션(242)을 통해 제2 레퍼런스 인코더(220)에서 출력된 스피커 임베딩과 각 스타일 토큰 사이의 유사도가 측정되면, 이 스타일 토큰들(244)의 가중 합으로 스타일 임베딩(제2 스타일 임베딩)을 생성할 수 있다.The fourth style token layer 240, when the similarity between the speaker embedding output from the second reference encoder 220 and each style token is measured through the multi-head attention 242, the weight of the style tokens 244 The sum can generate a style embedding (second style embedding).

음율 모듈(100)에서 총 3개의 스타일 토큰 레이어들을 통과한 스타일 임베딩(제1 스타일 임베딩)과 음색 모듈(200)의 스타일 임베딩(제2 스타일 임베딩)은 합쳐지고(concatenate) 음성합성 시스템의 스펙트럼 예측기(600)의 인코더(encoder)의 출력과 결합된다.The style embedding (first style embedding) passing through a total of three style token layers in the tone module 100 and the style embedding (second style embedding) in the tone module 200 are concatenated, and the spectrum predictor of the speech synthesis system Combined with the output of the encoder at 600.

본 실시예의 음성합성 시스템은 좁은 의미에서 스펙트럼 예측기(600)와 보코더로서 멜겐(MelGAN, 700a)으로 구성될 수 있고, 넓은 의미에서 음율 모듈(100)과 음색모듈(200)을 더 포함할 수 있다. 여기서, 스펙트럼 예측기(600)는 인코더(610), 어텐션(620) 및 디코더(630)를 구비한다.The voice synthesis system of the present embodiment may be composed of a spectrum predictor 600 and MelGAN 700a as a vocoder in a narrow sense, and may further include a tone module 100 and a tone module 200 in a broad sense. . Here, the spectrum predictor 600 includes an encoder 610, an attention 620, and a decoder 630.

즉, 스펙트럼 예측기(600)는 입력 텍스트(input text)를 받는 인코더(610), 인코더(610)의 출력단에 연결되는 어텐션(620), 어텐션(620)과 상호 연결되는 디코더(630)를 구비한다. 이러한 스펙트럼 예측기(600)는 타코트론2(Tacotron2)로 구현될 수 있다.That is, the spectrum predictor 600 includes an encoder 610 receiving input text, an attention 620 connected to an output terminal of the encoder 610, and a decoder 630 connected to the attention 620. . This spectrum predictor 600 may be implemented as Tacotron2.

전술한 경우, 음성합성 시스템은 seq2seq 음성합성 시스템으로서 두 단계를 통해 텍스트를 음성으로 합성할 수 있다. 첫 번째 단계에서 입력 텍스트로부터 멜 스펙트로그램을 생성하고, 두 번째 단계에서 보코더(vocoder)를 사용하여 멜 스펙트로그램으로부터 음성을 합성한다.In the above case, the speech synthesis system is a seq2seq speech synthesis system and can synthesize text into speech through two steps. In the first step, a mel spectrogram is generated from the input text, and in the second step, a voice is synthesized from the mel spectrogram using a vocoder.

본 실시예의 스펙트럼 예측기(600)는, 일반적인 타코트론2(Tacotron2)와 비교할 때, 제1 및 제2 스타일 임베딩들을 합한 통합 스타일 임베딩이 인코더(610)의 출력과 결합되어 어텐션(620)에 입력되는 점에서 차이가 있다.Compared to the general Tacotron2, the spectrum predictor 600 of the present embodiment is input to the attention 620 by combining the combined style embedding of the first and second style embeddings with the output of the encoder 610. There is a difference in points.

좀더 구체적으로, 스펙트럼 예측기(600)에 있어서, 인코더(610)는 입력 텍스트를 인코딩하여 512 차원의 특징(이하 '특징 정보')으로 변환한다. 인코더(610)는 텍스트가 입력되는 문자 임베딩(character embedding) 생성 유닛(613), 문자 임베딩 생성 유닛(613)에 연결되는 3개의 컨볼루션 레이어들(3 conv layers, 615), 및 3개의 컨볼루션 레이어들(615)에 연결되는 양방향 LSTM(bidirectional long short term memory, 617)를 구비할 수 있다. 3개의 컨볼루션 레이어들은 3개의 컨볼루션 신경망(convolutional neural network) 레이어들에 해당하고, '인코더 컨볼루션 레이어들'로 지칭될 수 있다.More specifically, in the spectrum predictor 600, the encoder 610 encodes input text and converts it into a 512-dimensional feature (hereinafter referred to as 'feature information'). The encoder 610 includes a character embedding generation unit 613 into which text is input, three convolution layers (3 conv layers, 615) connected to the character embedding generation unit 613, and three convolution It may have a bidirectional long short term memory (LSTM) 617 coupled to the layers 615 . The three convolutional layers correspond to three convolutional neural network layers and may be referred to as 'encoder convolutional layers'.

또한, 스펙트럼 예측기(600)는 인코더(610)의 입력단에 입력 텍스트(input text)를 전처리하는 전처리부(611)를 더 구비할 수 있다. 전처리부(611)는 입력 텍스트를 음절 단위로 분리하고, 분리된 음절을 원핫 인코딩(one-hot encoding)을 통해 정수로 표현하도록 구성될 수 있다. 한편, 이러한 전처리부(611)는, 독립적인 전처리 수단이나 이에 상응하는 기능을 수행하는 독립적인 구성부에 의해 입력 텍스트를 별도로 처리하는 경우에, 생략될 수 있다. 다만, 이러한 전처리부(611)는 스펙트럼 예측기(600)에는 포함되지 않으나 TTS 모델에는 포함되도록 구성될 수 있다.In addition, the spectrum predictor 600 may further include a preprocessor 611 for preprocessing input text at an input terminal of the encoder 610 . The pre-processor 611 may be configured to divide the input text into syllable units and express the separated syllables as integers through one-hot encoding. Meanwhile, the pre-processing unit 611 may be omitted when the input text is separately processed by an independent pre-processing unit or an independent component performing a function corresponding thereto. However, this pre-processing unit 611 is not included in the spectrum predictor 600 but may be included in the TTS model.

인코더(610)에서, 문자 임베딩 생성 유닛(613)은 전처리부(611)로부터 받은 정수 시퀀스를 매트릭스 형태로 변환한다. 인코더 컨볼루션 레이어들(615)은 매트릭스 형태의 정보를 축약한다. 그리고 양방향 LSTM(617)은 축약된 매트릭스 형태의 정보를 인코더 특징 정보로 변환한다. 여기서 인코더 특징 정보는 하나의 고정된 크기로 압축된 컨텍스트 벡터를 포함할 수 있다.In the encoder 610, the character embedding generation unit 613 converts the integer sequence received from the preprocessor 611 into a matrix form. Encoder convolutional layers 615 reduce information in matrix form. And, the bidirectional LSTM 617 converts the reduced matrix-type information into encoder feature information. Here, the encoder feature information may include a context vector compressed to one fixed size.

어텐션(620)은 멜겐(MelGAN, 700a)을 통해 출력될 음성 발음이 입력 텍스트의 순차적인 순서대로 진행되도록 디코더(630)의 타임-스텝에 따라 어텐션 얼라인 정보(attention alignment information, AAI)를 인코더 특징 정보에 추가한다. 즉, 어텐션(620)은 매 시점 디코더(630)에서 사용할 정보를 인코더(610)에서 추출하여 정렬(alignment) 혹은 매핑(mapping)하는 과정을 의미할 수 있다. 좀더 구체적으로, 어텐션(620)은 소정 시점마다 디코더(630)에서 집중해서 사용할 정보를 인코더(610)에서 추출하여 할당하고, 매 시점에서 사용하는 정보는 이전 시점의 어텐션 얼라인 정보를 사용하도록 구성될 수 있다. 이러한 구성에 의하면, 어텐션(620)의 얼라인 그래프는 TTS 모델이 학습 중일 때 우상향으로 연속되어 나갈 수 있다. 이러한 어텐션(620)은 로케이션 센서티브 어텐션(location sensitive attention)을 사용하여 구현될 수 있다.Attention 620 encodes attention alignment information (AAI) according to the time-step of decoder 630 so that speech pronunciation to be output through MelGAN 700a proceeds in the sequential order of the input text. Add to feature information. That is, the attention 620 may refer to a process of extracting, from the encoder 610, alignment or mapping information to be used by the decoder 630 at each point in time. More specifically, the attention 620 extracts and allocates information to be used by the decoder 630 at a predetermined point in time from the encoder 610, and the information used at each point is configured to use the attention alignment information of the previous point in time. It can be. According to this configuration, the alignment graph of the attention 620 can continue in an upward direction while the TTS model is being trained. Such attention 620 may be implemented using location sensitive attention.

일례로, 로케이션 센서티브 어텐션은 어텐션 점수(attention score)의 계산에서, 이전 시점의 어텐션와 인코더 정보의 가중 합에 아크탄젠트(tanh)를 적용한 후 가중치를 곱하여 계산한다. 여기서, 어텐션은 어텐션 점수에 소프트맥스(softmax)를 적용해 0 내지 1의 범위로 정규화한 것으로 인코더 정보를 얼마나 할당할지 결정하는 정보에 해당된다. 어텐션 추출 결과는 컨텍스트(context)로 지칭되고, 해당 시점까지의 어텐션과 인코더의 곱의 총합에 해당한다.For example, location-sensitive attention is calculated by applying an arctangent (tanh) to a weighted sum of attention and encoder information at a previous point in time and then multiplying by a weight in the calculation of an attention score. Here, attention is normalized in the range of 0 to 1 by applying softmax to the attention score, and corresponds to information for determining how much encoder information to allocate. The attention extraction result is referred to as a context, and corresponds to the sum of products of attention and encoder up to that point in time.

디코더(630)는 어텐션(620)을 통해 어텐션 정보인 얼라인 특징(alignment feature)와 이전 타임-스텝에서 생성된 멜 스펙트로그램을 이용하여 다음 타임-스텝의 멜 스펙트로그램을 생성한다. 생성되는 멜 스펙트로그램에는 통합 스타일 임베딩에 의한 음색과 운율 정보가 포함된다. 이전 타임-스템은 이전 시점에, 다음 타임-스텝은 다음 시점 또는 현재 시점에 각각 대응될 수 있다.The decoder 630 generates a mel spectrogram of the next time-step by using an alignment feature, which is attention information, and a mel spectrogram generated in the previous time-step through the attention 620 . The generated mel spectrogram includes timbre and prosody information by integrated style embedding. The previous time-step may correspond to the previous time-step, and the next time-step may correspond to the next time-step or the current time point, respectively.

디코더(630)는 프리넷(pre-Net, 631), 두 개의 디코더 LSTM(632), 제1 완전연결층(fully connected layer, 633), 제2 완결연결층(634) 및 포스트넷(post-Net, 635)를 구비한다. 프리넷(631)은 2개 레이어로 구성된 프리넷(2 layer pre-Net)일 수 있고, 각 디코더 LSTM(632)은 단방향 LSTM(long short term memory)일 수 있고, 완전연결층은 선형 프로젝션(linear projection)일 수 있으며, 포스트넷(635)은 5개의 컨볼루션 레이어들로 구성된 포스트넷(5 convolution layer post-Net)일 수 있다.The decoder 630 includes a pre-Net (631), two decoder LSTMs (632), a first fully connected layer (633), a second fully connected layer (634) and a post-Net , 635). The free net 631 may be a two-layer pre-Net, each decoder LSTM 632 may be a unidirectional long short term memory (LSTM), and the fully connected layer may be a linear projection ), and the postnet 635 may be a 5 convolution layer post-Net composed of 5 convolution layers.

디코더(630)에서, 프리넷(631)은 이전 시점의 멜 벡터에서 정보를 추출한다. 디코더 LSTM(632)은 어텐션(620)과 프리넷(631)의 작동 결과를 이용하여 현재 시점의 정보를 추출한다. 제1 완전연결층(633)은 현재 시점의 정보를 이용하여 멜 스펙트로그램(Mel spectrogram)을 생성한다. 제2 완전연결층(634)는 시그모이드(sigmoid)와 결합되어 디코더 LSTM(632)으로부터 나오는 정보를 토대로 현재 시점의 종료 확률을 계산하여 종료 토큰(stop token)을 출력한다. 종료 확률은 0 내지 1의 범위에서 산출될 수 있다. 그리고, 포스트넷(635)은 5개의 1차원(1D) 컨볼루션 레이어와 배치 정규화(batch normalization)를 통해 멜 스펙트로그램의 품질을 향상시킨다. 포스트넷(635)은 디코더(630)을 통해 멜 스펙트로그램이 모두 생성된 후 적용될 수 있고, 생성된 멜 스펙트로그램의 품질을 잔여 커넥션을 이용하여 스무딩(smoothing)하게 보정할 수 있다.In the decoder 630, the freenet 631 extracts information from the mel vector of the previous point in time. The decoder LSTM (632) extracts information of the current point in time using the operation results of the attention (620) and the freenet (631). The first fully connected layer 633 generates a Mel spectrogram using information of a current view. The second fully connected layer 634 is combined with sigmoid to calculate the end probability of the current time based on the information output from the decoder LSTM 632 and outputs a stop token. The termination probability may be calculated in the range of 0 to 1. In addition, the postnet 635 improves the quality of the mel spectrogram through 5 one-dimensional (1D) convolutional layers and batch normalization. The postnet 635 may be applied after all of the mel spectrograms are generated through the decoder 630, and the quality of the generated mel spectrograms may be corrected in a smooth manner using the remaining connections.

전술한 스펙트럼 예측기(600)에서, 손실 함수의 전체 손실은 MSE(mean squared error)와 BCE(binary cross entropy)의 합으로 계산될 수 있다. MSE는 원본 멜 스펙트로그램과 추정 멜 스펙트로그램 간의 차이를 나타내고, BCE는 실제 종료 확률과 추정 종료 확률 간의 차이를 나타낸다. 원본 멜 스펙트로그램은 타겟 멜 스펙트로그램에 대응될 수 있다.In the aforementioned spectrum predictor 600, the total loss of the loss function may be calculated as the sum of mean squared error (MSE) and binary cross entropy (BCE). MSE represents the difference between the original mel spectrogram and the estimated mel spectrogram, and BCE represents the difference between the actual termination probability and the estimated termination probability. The original mel spectrogram may correspond to the target mel spectrogram.

멜겐(MelGAN, 700a)는 멜 스펙트로그램을 입력받고 파형(waveform) 신호 또는 음성(audio) 신호를 생성한다. 멜겐(700a)은 생성자(generator)와 판별자(discriminator)로 구성된 GAN 기반 보코더로서 1차원 컨볼루션(conv1d) 여러 층으로 이루어진 모델이다.MelGAN 700a receives a Mel spectrogram and generates a waveform signal or audio signal. Melgen 700a is a GAN-based vocoder composed of a generator and a discriminator, and is a one-dimensional convolution (conv1d) model composed of several layers.

멜겐(700a)의 생성자는 여러 층의 conv1d와 잔여 스택(residual stack)에 의한 구조로 이루어진다. 잔여 스택 내부에는 또 다른 conv1d가 있고, 잔여 커넥션을 3번 거치도록 구성된다. 생성자는 입력으로 배치 단위의 멜 스펙트로그램을 받는다. 따라서 입력 텐서의 크기는 [배치 사이즈, 80, 프레임 길이]가 될 수 있다. 여기서 80은 멜 스펙트로그램의 차수가 되고, 프레임 길이는 어느 구간만큼 입력할지 사용자가 정하는 변수이다. 출력은 오디오 신호과 되고 그 크기는 [배치 사이즈, 1, 프레임 길이×홉 사이즈)가 된다.The constructor of Melgen 700a consists of several layers of conv1d and a residual stack. Inside the rest stack is another conv1d, configured to go through the rest connections 3 times. The constructor takes as input the mel spectrogram in batches. Therefore, the size of the input tensor can be [batch size, 80, frame length]. Here, 80 becomes the degree of the Mel spectrogram, and the frame length is a variable that the user determines how many intervals to input. The output is an audio signal and its size is [batch size, 1, frame length × hop size).

멜겐(700a)의 판별자는 3개의 다중 스케일(multi-scale) 구조로 이루어진다. 하나의 다중 스케일에는 6개의 특징맵(feature map)과 1개의 출력으로 총 7개의 출력을 가진다. 판별자 블록은 총 7개의 conv1d로 구성될 수 있다.The discriminator of Melgen 700a consists of three multi-scale structures. One multi-scale has a total of 7 outputs with 6 feature maps and 1 output. A discriminator block may consist of a total of 7 conv1d.

멜겐(700a)의 마지막 출력은 기본적으로 판별자 출력이고, 나머지 출력은 손실(loss)로 사용될 수 있다.The last output of Melgen 700a is basically a discriminator output, and the remaining outputs can be used as a loss.

도 5는 본 발명의 다른 실시예에 따른 음성합성 시스템의 전체 구성도이다. 도 6은 도 5의 음성합성 시스템에 채용할 수 있는 감정 모듈을 설명하기 위한 블록도이다. 그리고 도 7은 도 6의 감정 모듈에 채용할 수 있는 레퍼런스 인코더에 대한 상세 구성도이다5 is an overall configuration diagram of a voice synthesis system according to another embodiment of the present invention. FIG. 6 is a block diagram for explaining an emotion module that can be employed in the voice synthesis system of FIG. 5 . And Figure 7 is a detailed configuration diagram of a reference encoder that can be employed in the emotion module of Figure 6

도 5를 참조하면, 음성합성 시스템은 운율 모듈(100), 음색 모듈(200), 감정 모듈(300), 스펙트럼 예측기(spectrum predictor, 600) 및 멜겐(MelGAN, 700a)를 포함하고, 운율 모듈(100)의 제1 스타일 임베딩(style embedding)과 음색 모듈(200)의 제2 스타일 임베딩과 감정 모듈(300)의 제3 스타일 임베딩을 합한 통합 스타일 임베딩을 적용하여 입력 텍스트(input text)에 특정 운율, 음색 및 감정이 부가된 파형 샘플(waveform samples)을 출력할 수 있다. 스펙트럼 예측기(600) 및 멜겐(MelGAN, 700a)은 음성합성(text to speech, TTS) 모델(500)을 구성한다.Referring to FIG. 5, the voice synthesis system includes a prosody module 100, a timbre module 200, an emotion module 300, a spectrum predictor 600, and a MelGAN 700a, and a prosody module ( 100) of the first style embedding, the voice module 200 of the second style embedding, and the emotion module 300 of the third style embedding of the combined style embedding is applied to a specific prosody of the input text , waveform samples to which timbre and emotion are added can be output. The spectrum predictor 600 and MelGAN 700a constitute a text to speech (TTS) model 500.

운율 모듈(100), 음색 모듈(200), 스펙트럼 예측기(600) 및 멜겐(MelGAN, 700a)은 도 4를 참조하여 앞서 설명한 음성합성 시스템의 대응 구성요소와 실질직으로 동일하므로 이들에 대한 상세 설명은 생략하기로 한다.Since the prosody module 100, the tone module 200, the spectrum predictor 600, and the MelGAN 700a are substantially the same as the corresponding components of the voice synthesis system described above with reference to FIG. 4, detailed descriptions thereof are given. will be omitted.

감정 모듈(300)은 STFT 유닛(310), 제3 레퍼런스 인코더(320) 및 제5 스타일 토큰 레이어(style token 5, 350)을 구비한다. STFT 유닛(310)은, 참조 오디오를 별도로 가공한 스펙트로그램이나 멜 스펙트로그램을 제3 레퍼런스 인코더(320)의 입력 데이터로 준비하는 경우에, 생략될 수 있다. 제5 스타일 토큰 레이어(350)는 제5 GST 레이어에 대응된다.The emotion module 300 includes a STFT unit 310, a third reference encoder 320, and a fifth style token layer (style token 5, 350). The STFT unit 310 may be omitted when a spectrogram or a Mel spectrogram separately processed from reference audio is prepared as input data of the third reference encoder 320 . The fifth style token layer 350 corresponds to the fifth GST layer.

감정 모듈(300)에서, STFT 유닛(310)은 참조 오디오를 푸리에 변환에 의해 변환한 스펙트로그램이나 멜 스펙트로그램을 제3 레퍼런스 인코더(320)의 입력(input)으로 제공할 수 있다. 제3 레퍼런스 인코더(320)는 기본주파수가 아닌 스펙트로그램이나 멜 스펙트로그램을 입력받아 감정 임베딩(emotional embedding)을 생성한다.In the emotion module 300, the STFT unit 310 may provide a spectrogram or a Mel spectrogram obtained by transforming the reference audio by Fourier transform to the third reference encoder 320 as an input. The third reference encoder 320 receives a spectrogram or a MEL spectrogram that is not a fundamental frequency and generates an emotional embedding.

제5 GST 레이어(350)를 통과한 감정 임베딩은 제3 스타일 임베딩으로 출력된다. 즉, 제5 스타일 토큰 레이어(350)는 감정 임베딩을 학습하여 제3 스타일 임베딩을 생성할 수 있다. 이러한 제5 스타일 토큰 레이어(350)는 다중헤드 어텐션(multi-head attention, 352) 및 복수의 토큰들(354)을 구비한다. 복수의 토큰들(354)은 제1 토큰(token 1) 내지 제k 토큰(token k)을 포함할 수 있고, 각 토큰은 스타일 토큰으로 지칭될 수 있다. 감정 임베딩의 효과적을 학습을 위해 상기의 k는 7인 것이 바람직하다. 즉, 본 실시예의 제5 GST 레이어(350)는 7개로 분류된 감정 스타일에 대응하도록 7개의 스타일 토큰들을 구비할 수 있다.The emotion embedding that has passed through the fifth GST layer 350 is output as a third style embedding. That is, the fifth style token layer 350 may generate a third style embedding by learning emotion embedding. This fifth style token layer 350 includes multi-head attention 352 and a plurality of tokens 354 . The plurality of tokens 354 may include a first token (token 1) to a k-th token (token k), and each token may be referred to as a style token. In order to effectively learn emotion embedding, k is preferably 7. That is, the fifth GST layer 350 of this embodiment may include seven style tokens to correspond to seven classified emotional styles.

감정 스타일에 포함되는 7개의 감정은 분노(Anger), 실망(Disappointment), 두려움(Fear), 놀람(Surprise), 슬픔(Sad), 평온함(Neutral) 및 행복함(Happy)이고, 이를 MOS(mean opinion score) 점수와 함께 정리하면 다음의 표 1과 같다.The seven emotions included in the emotion style are Anger, Disappointment, Fear, Surprise, Sad, Neutral, and Happy, which are MOS (mean When summarized together with the opinion score, it is shown in Table 1 below.

전술한 제5 스타일 토큰 레이어(350)는, 도 6에 도시한 바와 같이, 다중헤드 어텐션(352)에 의해 제3 레퍼런스 인코더(320)에서 출력된 감정 임베딩과 7개의 스타일 토큰들 각각과의 유사도가 측정될 때(360 참조), 이 스타일 토큰들(244)의 가중 합으로 스타일 임베딩(제3 스타일 임베딩)을 생성할 수 있다.As shown in FIG. 6 , the above-described fifth style token layer 350 is similar to the emotion embedding output from the third reference encoder 320 by the multi-head attention 352 and each of the seven style tokens. When is measured (see 360), a weighted sum of these style tokens 244 can generate a style embedding (a third style embedding).

감정 모듈(300)에서 생성된 제3 스타일 임베딩은 음율 모듈(100)에서 총 3개의 스타일 토큰 레이어들을 통과한 제1 스타일 임베딩과 음색 모듈(200)의 제2 스타일 임베딩과 합쳐지고(concatenate) 음성합성 시스템의 스펙트럼 예측기(600)의 인코더(encoder, 610))의 출력과 결합되어 스펙트럼 예측기(600)의 어텐션(620)에 입력될 수 있다.The third style embedding generated by the emotion module 300 is concatenated with the first style embedding passed through a total of three style token layers in the tone module 100 and the second style embedding of the tone module 200, and voice It may be combined with the output of the encoder (encoder, 610) of the spectrum predictor (600) of the synthesis system and input to the attention (620) of the spectrum predictor (600).

한편, 도 7을 참조하여 전술한 제3 레퍼런스 인코더(320)를 좀더 구체적으로 설명하면, 입력되는 참조 스펙트로그램(reference spectrogram)을 변환하여 감정 임베딩(emotional embedding)을 출력하기 위해, 제3 레퍼런스 인코더(320)는 3개의 합성곱 신경망(three convolutional neural networks, 3 CNN)(322), 3개의 합성곱 신경망(322)에 연결되는 2개의 잔여 블록들(2 residual blocks, 324), 및 2개의 잔여 블록들(324)에 연결되는 수정 알렉스넷(modified AlexNet, 326)으로 구성될 수 있다. 참조 스펙트로그램은 참조 오디오의 입력을 푸리에 변환을 통해 변환한 스펙트로그램을 지칭한다. 참조 스펙트로그램은 참조 멜 스펙트로그램으로 대체될 수 있다.Meanwhile, referring to FIG. 7, the aforementioned third reference encoder 320 will be described in more detail. In order to convert an input reference spectrogram and output an emotional embedding, the third reference encoder 320 includes three convolutional neural networks (3 CNNs) 322, two residual blocks (324) connected to the three convolutional neural networks 322, and two residuals. It may consist of a modified AlexNet 326 connected to blocks 324 . The reference spectrogram refers to a spectrogram obtained by transforming an input of reference audio through Fourier transform. The reference spectrogram may be replaced with the reference mel spectrogram.

이러한 제3 레퍼런스 인코더(320)를 포함한 감정 모듈에 결합하는 TTS 모델은, 참조 오디오의 감정 표현을 음성합성에 적용하기 위해, 스타일 트랜스퍼 및 스타일 트랜스퍼의 강도를 조절하도록 구성되고, 리컨스트럭션 로스(reconstruction loss)만으로 학습을 진행할 수 있다.The TTS model coupled to the emotion module including the third reference encoder 320 is configured to adjust the style transfer and the strength of the style transfer in order to apply the emotion expression of the reference audio to speech synthesis, and the reconstruction loss loss) alone.

도 8은 본 발명의 또 다른 실시예에 따른 음성합성 시스템에 대한 개략적인 구성도이고, 도 9는 도 8의 음성합성 시스템에 적용될 수 있는 주요 동작을 설명하기 위한 블록도이다.8 is a schematic configuration diagram of a voice synthesis system according to another embodiment of the present invention, and FIG. 9 is a block diagram for explaining main operations applicable to the voice synthesis system of FIG. 8 .

도 8을 참조하면, 음성합성 시스템(1000)은, 프로세서(processor, 1100) 및 메모리(momory, 1200)를 포함하여 구성될 수 있다. 또한, 음성합성 시스템(100)은 송수신 장치(transceiver, 1300)를 더 포함하거나, 저장 장치(1400)를 더 포함하거나, 입력 인터페이스 장치(1500) 및 출력 인터페이스 장치(1500)를 더 포함하도록 구성될 수 있다.Referring to FIG. 8 , a voice synthesis system 1000 may include a processor 1100 and a memory 1200. In addition, the speech synthesis system 100 may be configured to further include a transceiver 1300, a storage device 1400, or an input interface device 1500 and an output interface device 1500. can

음성합성 시스템(1000)에서, 프로세서(1100)는 중앙 처리 장치(central processing unit, CPU), 그래픽 처리 장치(graphics processing unit, GPU), 또는 본 발명의 실시예들에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. In the speech synthesis system 1000, the processor 1100 may be a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor in which methods according to embodiments of the present invention are performed. can mean a processor.

메모리(1200) 또는 저장 장치(1400)는 프로세서(1100)에 의해 실행되는 적어도 하나의 명령을 저장할 수 있다. 적어도 하나의 명령은, 음율 모듈을 구현하거나 동작시키기 위한 제1 명령, 음색 모듈을 구현하거나 동작시키기 위한 제2 명령, 감정 모듈을 구현하거나 동작시키기 위한 제3 명령, TTS 모델을 구현하거나 동작시키기 위한 제5 명령을 포함할 수 있다.The memory 1200 or the storage device 1400 may store at least one command executed by the processor 1100 . The at least one command includes a first command to implement or operate a tone module, a second command to implement or operate a tone module, a third command to implement or operate an emotion module, and a TTS model to implement or operate. A fifth command may be included.

또한, 적어도 하나의 명령은, 입력되는 참조 오디오에서 기본주파수를 추출하는 명령, 기본주파수를 레퍼런스 인코더의 입력으로 전달하는 명령, 기본주파수를 인코딩하여 운율 임베딩을 생성하는 명령, 운율 임베딩으로부터 제1 스타일 임베딩을 생성하는 명령, 참조 오디오를 푸리에 변환에 의해 참조 멜 스펙트로그램으로 변환하는 명령, 참조 멜 스펙트로그램을 인코딩하여 스피커 임베딩을 생성하는 명령, 스피커 임베딩으로부터 제2 스타일 임베딩을 생성하는 명령, 제1 스타일 임베딩과 제2 스타일 임베딩을 합한 통합 스타일 임베딩을 음성합성(text to speech, TTS) 모델의 인코더의 출력과 합쳐 어텐션에 입력하는 명령, TTS 모델의 어텐션과 디코더를 통해 멜 스펙트로그램을 만드는 명령, TTS 모델의 보코더를 통해 입력 텍스트로부터 통합 스타일 임베딩에 의한 음색(tones)과 운율(prosody)이 결합된 음성합성 오디오를 생성하는 명령을 포함할 수 있다.In addition, at least one command includes a command to extract a fundamental frequency from the input reference audio, a command to transfer the fundamental frequency to the input of the reference encoder, a command to encode the fundamental frequency to generate a prosody embedding, and a first style from the prosody embedding. Instructions for generating an embedding, instructions for transforming the reference audio into a reference mel spectrogram by Fourier transform, instructions for generating a speaker embedding by encoding the reference mel spectrogram, instructions for generating a second style embedding from the speaker embedding, first A command to input the combined style embedding, which is a combination of style embedding and second style embedding, to the attention by combining the output of the encoder of the text to speech (TTS) model, a command to create a mel spectrogram through the attention of the TTS model and the decoder, It may include a command for generating voice synthesized audio in which tones and prosody are combined by integrated style embedding from input text through a vocoder of the TTS model.

전술한 메모리(1200) 및 저장 장치(1400) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(1200) 또는 저장 장치(1400)는 읽기 전용 메모리(read only memory, ROM) 및 랜덤 액세스 메모리(random access memory, RAM) 중에서 적어도 하나로 구성될 수 있다. Each of the above-described memory 1200 and storage device 1400 may include at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory 1200 or the storage device 1400 may include at least one of a read only memory (ROM) and a random access memory (RAM).

송수신 장치(1300)는 무선 네트워크, 유선 네트워크, 위성 네트워크 또는 이들의 조합을 통해 외부 장치와의 통신을 지원하는 적어도 하나의 서브통신시스템을 포함할 수 있다.The transceiver 1300 may include at least one sub-communication system supporting communication with an external device through a wireless network, a wired network, a satellite network, or a combination thereof.

입력 인터페이스 장치(1500)는 키보드, 마이크, 터치패드, 터치스크린 등의 입력 수단들에서 선택되는 적어도 하나와 적어도 하나의 입력 수단을 통해 입력되는 신호를 기저장된 명령과 매핑하거나 처리하는 입력 신호 처리부를 포함할 수 있다.The input interface device 1500 includes at least one input unit selected from input units such as a keyboard, a microphone, a touch pad, and a touch screen, and an input signal processor that maps or processes a signal input through the at least one input unit with a pre-stored command. can include

출력 인터페이스 장치(1600)는 프로세서(1100)의 제어에 따라 출력되는 신호를 기저장된 신호 형태나 레벨로 매핑하거나 처리하는 출력 신호 처리부와, 출력 신호 처리부의 신호에 따라 진동, 빛 등의 형태로 신호나 정보를 출력하는 적어도 하나의 출력 수단을 포함할 수 있다. 출력 신호 처리부는 배선계통의 네트워크 토폴로지나 전압 상태추정 결과를 이미지, 음성 또는 이들의 조합 형태로 생성할 수 있다. 그리고 적어도 하나의 출력 수단은 스피커, 디스플레이 장치, 프린터, 광 출력 장치, 진동 출력 장치 등을 포함할 수 있다.The output interface device 1600 includes an output signal processing unit that maps or processes a signal output under the control of the processor 1100 into a pre-stored signal type or level, and a signal in the form of vibration or light according to the signal of the output signal processing unit. or at least one output means for outputting information. The output signal processing unit may generate a network topology or voltage state estimation result of the wiring system in the form of an image, audio, or a combination thereof. The at least one output means may include a speaker, a display device, a printer, an optical output device, a vibration output device, and the like.

전술한 음성합성 시스템(1000)은, 예를 들어, 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant), 네트워크 서버, 웹서버, 메일서버, 특정 서비스를 위한 서비스 서버 등에 일체로 결합되거나 탑재될 수 있다.The above-described voice synthesis system 1000 is, for example, a desktop computer, a laptop computer, a notebook, a smart phone, a tablet PC, a mobile phone ( mobile phone), smart watch, smart glass, e-book reader, PMP (portable multimedia player), portable game device, navigation device, digital camera, DMB (digital multimedia) broadcasting) player, digital audio recorder, digital audio player, digital video recorder, digital video player, personal digital assistant (PDA), network server, It may be integrally combined with or mounted on a web server, mail server, service server for a specific service, and the like.

한편, 음성합성 시스템(1000)은, 도 9에 도시한 바와 같이, 학습 관리를 위한 제1 모듈(1700)과 스타일 복제 TTS 서비스를 위한 제2 모듈(180)을 더 구비할 수 있다.Meanwhile, as shown in FIG. 9 , the speech synthesis system 1000 may further include a first module 1700 for learning management and a second module 180 for a style replication TTS service.

제1 모듈(1700)은 원하는 음색과 운율을 가진 제1 오디오 파형이나 원하는 음색과 운율과 감정을 가진 제2 오디오 파형을 그라운드 트루(ground truth, GT)의 참조 오디오로 사용하여, 운율 모듈의 계층형 GST와 음색 모듈의 제4 GST 레이어를 훈련시키거나, 감정 모듈의 제5 GST 레이어를 더 훈련시키도록 구성될 수 있다.The first module 1700 uses a first audio waveform having a desired tone and rhyme or a second audio waveform having a desired tone, rhyme, and emotion as a reference audio of ground truth (GT), thereby forming a hierarchy of prosody modules It may be configured to train the type GST and the fourth GST layer of the voice module, or further train the fifth GST layer of the emotion module.

또한, 음성합성 시스템(1000)은, 제2 모듈(1800)에 의해 참조 오디오의 음색과 운율 스타일을 복제한 음성합성(TTS) 서비스를 제공하거나, 참조 오디오의 음색, 운율 및 감정 스타일을 복제한 음성합성 서비스를 제공하도록 구성될 수 있다.In addition, the speech synthesis system 1000 provides a speech synthesis (TTS) service in which the timbre and prosody style of the reference audio are duplicated by the second module 1800, or the timbre, prosody and emotion style of the reference audio are reproduced. It may be configured to provide a voice synthesis service.

이러한 제1 모듈(1700) 및/또는 제2 모듈(1800)은 프로그램 명령이나 소프트웨어 모듈로 구현되어 메모리(1200)이나 저장 장치(1400)에 저장되고, 프로세서(1100)에 의해 실행되도록 구성될 수 있다. 물론, 구현에 따라서, 제1 모듈(1700) 및/또는 제2 모듈(1800)의 적어도 일부는 해당 기능의 적어도 일부를 위한 하드웨어 구성을 포함하도록 구성될 수 있다.The first module 1700 and/or the second module 1800 may be implemented as program commands or software modules, stored in the memory 1200 or the storage device 1400, and executed by the processor 1100. there is. Of course, depending on the implementation, at least a part of the first module 1700 and/or the second module 1800 may be configured to include a hardware configuration for at least a part of the function.

전술한 바와 같이, 음성합성 시스템(100)은 TTS 모델과 함께 운율 모듈 및 음색 모듈을 포함하거나 선택적으로 감정 모듈을 더 포함하도록 구성되고, 이 모듈들 각각은 전역 스타일 토큰을 기반으로 원하는 음색, 운율, 감정 또는 이들 조합의 스타일을 포함하고 있는 참조 오디오만 있다면 참조 오디오의 스타일대로 음성합성을 구현할 수 있는 장점이 있다.As described above, the speech synthesis system 100 is configured to include a prosody module and a timbre module together with the TTS model, or optionally further include an emotion module, each of which is a desired timbre and prosody based on the global style token. There is an advantage in that voice synthesis can be implemented according to the style of the reference audio if there is only the reference audio including the style of the reference audio, the emotion, or a combination thereof.

본 발명의 실시예에 따른 방법의 동작은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 프로그램 또는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산 방식으로 컴퓨터로 읽을 수 있는 프로그램 또는 코드가 저장되고 실행될 수 있다.The operation of the method according to the embodiment of the present invention can be implemented as a computer readable program or code on a computer readable recording medium. A computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored. In addition, computer-readable recording media may be distributed to computer systems connected through a network to store and execute computer-readable programs or codes in a distributed manner.

또한, 컴퓨터가 읽을 수 있는 기록매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.In addition, the computer-readable recording medium may include hardware devices specially configured to store and execute program instructions, such as ROM, RAM, and flash memory. The program command may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine code generated by a compiler.

본 발명의 일부 측면들은 장치의 문맥에서 설명되었으나, 그것은 상응하는 방법에 따른 설명 또한 나타낼 수 있고, 여기서 블록 또는 장치는 방법 단계 또는 방법 단계의 특징에 상응한다. 유사하게, 방법의 문맥에서 설명된 측면들은 또한 상응하는 블록 또는 아이템 또는 상응하는 장치의 특징으로 나타낼 수 있다. 방법 단계들의 몇몇 또는 전부는 예를 들어, 마이크로프로세서, 프로그램 가능한 컴퓨터 또는 전자 회로와 같은 하드웨어 장치에 의해(또는 이용하여) 수행될 수 있다. 몇몇의 실시예에서, 가장 중요한 방법 단계들의 하나 이상은 이와 같은 장치에 의해 수행될 수 있다. Although some aspects of the present invention have been described in the context of an apparatus, it may also represent a description according to a corresponding method, where a block or apparatus corresponds to a method step or feature of a method step. Similarly, aspects described in the context of a method may also be represented by a corresponding block or item or a corresponding feature of a device. Some or all of the method steps may be performed by (or using) a hardware device such as, for example, a microprocessor, programmable computer, or electronic circuitry. In some embodiments, one or more of the most important method steps may be performed by such an apparatus.

실시예들에서, 프로그램 가능한 로직 장치 예컨대, 필드 프로그래머블 게이트 어레이가 여기서 설명된 방법들의 기능의 일부 또는 전부를 수행하기 위해 사용될 수 있다. 실시예들에서, 필드 프로그래머블 게이트 어레이는 여기서 설명된 방법들 중 하나를 수행하기 위한 마이크로프로세서와 함께 작동할 수 있다. 일반적으로, 방법들은 어떤 하드웨어 장치에 의해 수행되는 것이 바람직하다.In embodiments, a programmable logic device, such as a field programmable gate array, may be used to perform some or all of the functions of the methods described herein. In embodiments, a field programmable gate array may operate in conjunction with a microprocessor to perform one of the methods described herein. Generally, methods are preferably performed by some hardware device.

이상 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention without departing from the spirit and scope of the present invention described in the claims below. You will understand that you can.

Claims

As a voice synthesis method capable of replicating real-time tone and prosody style,
extracting a fundamental frequency from input reference audio and transferring the fundamental frequency to an input of a reference encoder;
generating, by the reference encoder, a prosody embedding by encoding the fundamental frequency;
generating a first style embedding from the prosody embedding;
converting the reference audio into a reference mel spectrogram by Fourier transform;
encoding the reference mel spectrogram to generate a speaker embedding;
generating a second style embedding from the speaker embedding;
inputting an integrated style embedding obtained by combining the first style embedding and the second style embedding into attention of a text to speech (TTS) model; and
and generating audio in which a tone and a prosody are synthesized by the integrated style embedding with respect to the input text by the TTS model.

The method of claim 1,
In the step of extracting the fundamental frequency, the reference audio is cut in units of a sliding window of a certain period, and a pitch contour is calculated through frequency decomposition of the reference audio using a normalized cross-correlation function, where pitch is A voice synthesis method corresponding to the pitch of a sound.

The method of claim 1,
The generating of the first style embedding may include using a hierarchical GST of three layers including a first global style token (GST) layer to a third global style token layer to generate the reference audio. Speech synthesis method for generating the first style embedding for prosody.

The method of claim 3,
Each of the first to third GST layers includes a multi-head attention that transfers an input embedding to a plurality of tokens of each GST layer and a plurality of tokens connected to the multi-head attention, and the input embedding Measure the similarity between each token and generate a style embedding as a weighted sum of each token,
The first to third GST layers include a residual connection connecting the plurality of tokens of a current GST layer and the plurality of tokens of a previous GST layer.

The method of claim 3,
In the generating of the second style embedding, the second style embedding for the timbre of the reference audio is generated by using a fourth GST layer.

The method of claim 1,
To generate the audio,
extracting, by the encoder of the TTS model, feature information from text input to the encoder;
mapping the feature information to attention alignment information to be used in the decoder of the TTS model at each time point by the attention, and transmitting the mapped attention alignment information and the unified style embedding to the decoder; and
and generating, by the decoder, a mel spectrogram of a current view synthesized with the integrated style embedding using attention information of the attention and a mel spectrogram of a previous view.

The method of claim 6,
The generating of the audio further comprises generating the audio from the current Mel spectrogram by Mel generative adversarial networks (MelGAN).

The method of claim 1,
The target mel spectrogram corresponding to the reference mel spectrogram is compared with the predicted mel spectrograms for the input text and the reference audio generated by the TTS model, and the target mel spectrogram and the predicted mel spectrogram are matched. obtaining a difference or a loss corresponding to the difference; and
Further comprising the step of training the TTS model until the difference or loss is less than or equal to a preset level or a reference value.

As a voice synthesis system capable of replicating real-time tone and prosody style,
an extraction unit for extracting a fundamental frequency from input reference audio;
a first reference encoder that receives the fundamental frequency and encodes the fundamental frequency to generate a prosody embedding;
hierarchical global style token layers for generating a first style embedding for a prosody of the reference audio from the prosody embedding;
a transform unit for transforming the reference audio into a reference mel spectrogram by Fourier transform;
a second reference encoder for encoding the reference mel spectrogram to generate a speaker embedding;
a single global style token layer for generating a second style embedding for a timbre of the reference audio from the speaker embedding; and
Speech synthesis (text to speech, TTS) that receives the unified style embedding, which is the combination of the first style embedding and the second style embedding, as attention, and generates audio synthesized with tones and prosody by the unified style embedding for the input text. ) speech synthesis system containing the model.

The method of claim 9,
The extractor cuts the reference audio in units of a sliding window of a certain section, divides valid frames using a normalized cross-correlation function, and calculates a pitch contour through frequency decomposition of the reference audio within the valid frame. wherein the pitch corresponds to the pitch of the reference audio.

The method of claim 9,
The hierarchical global style token layers are composed of three layers of hierarchical GST consisting of a first global style token (GST) layer to a third global style token layer.

The method of claim 11,
Each GST layer of the hierarchical global style token layers includes a multi-head attention and a plurality of tokens connected to the multi-head attention;
The first GST layer measures a similarity between the rhyme embedding and each token and generates a style embedding as a weighted sum of the plurality of tokens;
The speech synthesis system of claim 1 , wherein the style embedding generated in the first GST layer sequentially passes through the second GST layer and the third GST layer and is output as a first style embedding.

The method of claim 11,
In the hierarchical global style token layers, a first pair of the first GST layer and the second GST layer and a second pair of the second GST layer and the third GST layer are tokens of a current layer and a previous layer. A speech synthesis system having a residual connection connected with tokens of .

The method of claim 9,
The target mel spectrogram corresponding to the reference mel spectrogram is compared with the predicted mel spectrograms for the input text and the reference audio generated by the TTS model, and the target mel spectrogram and the predicted mel spectrogram are Further comprising a learning management unit for obtaining a difference in or a loss corresponding to the difference,
The learning management unit trains the TTS model until the difference or loss is less than or equal to a preset level or a reference value.

The method of claim 9,
The TTS model includes a spectrogram predictor and a vocoder,
The spectrogram predictor generates a mel spectrogram based on the input text and the integrated style embedding;
Wherein the vocoder generates a waveform sample corresponding to a synthesized waveform from the mel spectrogram.

The method of claim 15
The spectrogram predictor includes an encoder, an attention and a decoder,
The encoder extracts feature information from the input text,
The Attention maps the feature information to attention alignment information to be used by the decoder at each point in time and transfers the mapped attention alignment information and the unified style embedding to the decoder;
The speech synthesis system of claim 1 , wherein the decoder generates a mel spectrogram of a current view synthesized with the integrated style embedding using the attention information of the attention and a mel spectrogram of a previous view.

The method of claim 16
The spectrogram predictor further includes a preprocessor located at an input terminal of the encoder, wherein the preprocessor divides the input text into syllable units and expresses the separated syllables as integers through one-hot encoding, voice synthesis system.

The method of claim 17
The encoder includes a character embedding generating unit, 3 convolution layers connected to the character embedding generating unit, and a bidirectional long short term memory (LSTM) connected to the 3 convolution layers. ) is provided,
The character embedding generation unit converts the integer sequence received from the pre-processing unit into a matrix form;
The 3 convolutional layers reduce information in matrix form,
The bidirectional LSTM converts information in a reduced matrix form into encoder characteristic information, wherein the encoder characteristic information includes a context vector compressed to one fixed size.

The method of claim 16
The Attention synthesizes speech by adding the Attention Alignment information to the encoder characteristic information according to the time-step of the decoder so that the speech pronunciation to be output through the vocoder of the TTS model proceeds in the sequential order of the input text. system.

The method of claim 15
The vocoder is a Mel generative adversarial networks (MelGAN) speech synthesis system.