KR102624194B1

KR102624194B1 - Non autoregressive speech synthesis system and method using speech component separation

Info

Publication number: KR102624194B1
Application number: KR1020220064616A
Authority: KR
Inventors: 장준혁; 이모아
Original assignee: 한양대학교 산학협력단
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2024-01-11
Also published as: KR20230164902A

Abstract

음성 구성요소 분리를 이용한 비자동회귀 방식의 음성합성 시스템 및 방법이 개시된다. 일 실시예에 따른 음성합성 시스템에 의해 수행되는 음성합성 방법은, 음성으로부터 음성 구성요소의 분리를 위한 제1 네트워크와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크로 구성된 비자동회귀 기반의 음성합성 모델에 발화 음성을 입력받는 단계; 및 상기 구성된 비자동회귀 기반의 음성합성 모델을 이용하여 발화 음성의 음성합성을 통해 최종 발화 음성을 추정하는 단계를 포함하고, 상기 비자동회귀 기반의 음성합성 모델은, 상기 제1 네트워크에서 비지도 학습을 통해 발화 음성으로부터 음성 구성요소를 분리하고, 상기 제2 네트워크에서 텍스트로부터 상기 분리된 음성 구성요소를 생성하여 비자동회귀 방식으로 음성합성을 통해 최종 발화 음성이 추정되도록 구성된 것일 수 있다. A non-autoregressive speech synthesis system and method using speech component separation is disclosed. A speech synthesis method performed by a speech synthesis system according to an embodiment is a non-autoregressive speech consisting of a first network for separating speech components from speech and a second network for estimation of speech components from text. Receiving speech input into the synthesis model; and estimating the final spoken voice through speech synthesis of the spoken voice using the configured non-autoregressive-based speech synthesis model, wherein the non-autoregressive-based speech synthesis model is unsupervised in the first network. It may be configured to separate voice components from the spoken voice through learning, generate the separated voice components from the text in the second network, and estimate the final spoken voice through voice synthesis in a non-autoregressive manner.

Description

Non-autoregressive speech synthesis system and method using speech component separation {NON AUTOREGRESSIVE SPEECH SYNTHESIS SYSTEM AND METHOD USING SPEECH COMPONENT SEPARATION}

아래의 설명은 음성합성 기술에 관한 것이다.The explanation below is about voice synthesis technology.

문자 시퀀스를 음성(또는 음향 기능)으로 변환하는 매핑 프로세스인 TTS(텍스트 음성 변환)는 인간-컴퓨터 상호 작용 시스템을 설계하는 데 필수적인 기술이다. 최근 뉴럴 TTS 모델은 자동회귀 방식의 종단 간(end-to-end) 프레임워크를 기반으로 합성된 음성의 품질과 자연스러움 측면에서 상당한 성공을 거두었다. 자동회귀 방식은 뉴럴 TTS 모델의 급속한 발전에 기여했지만 훈련 및 추론 속도가 낮고, 음성 프레임의 단일 오 예측은 모든 후속 프레임의 추론에 영향을 미칠 수 있다.Text-to-speech (TTS), a mapping process that converts character sequences into speech (or sound functions), is an essential technology for designing human-computer interaction systems. Recently, neural TTS models have achieved significant success in terms of quality and naturalness of synthesized speech based on an autoregressive end-to-end framework. Autoregressive methods have contributed to the rapid development of neural TTS models, but the training and inference speeds are low, and a single misprediction of a speech frame can affect the inference of all subsequent frames.

최근 몇 년 동안 비자동회귀 TTS 모델은 병렬 아키텍처를 사용하여 자동회귀 TTS 모델 문제를 해결하는 동시에 자동회귀 TTS 모델과 유사한 성능을 달성하는데 중점을 두었다.In recent years, non-autoregressive TTS models have focused on using parallel architecture to solve autoregressive TTS model problems while achieving similar performance to autoregressive TTS models.

그러나, 종래의 기술은 음성의 구성요소인 운율에 대한 추정값을 타겟으로 음성합성 모델에 사용하여 추정 에러에 의한 오차가 발생하는 것을 피할 수 없다. 또한, 비지도 학습을 통한 음성 구성요소 분리 기술을 통하여 음색을 변환하는 기술을 가능하나 이를 이용한 음성합성 시스템은 구축된 바 없다.However, in the conventional technology, the estimated value of prosody, a component of speech, is used as a target in the speech synthesis model, so the occurrence of errors due to estimation errors cannot be avoided. In addition, it is possible to convert timbres through voice component separation technology through unsupervised learning, but a voice synthesis system using this has not been built.

발화 음성으로부터 명시적인 운율 레이블없이 비지도 학습을 통해 음성 구성요소를 분리하고, 분리된 음성 구성요소를 텍스트로부터 생성하여 최종 발화 음성을 추정하는 비자동회귀 기반의 음성합성 방법 및 시스템을 제공할 수 있다. It is possible to provide a non-autoregressive-based speech synthesis method and system that separates speech components from spoken speech through unsupervised learning without explicit prosodic labels and generates the separated speech components from text to estimate the final spoken speech. there is.

음성합성 시스템에 의해 수행되는 음성합성 방법은, 음성 구성요소의 분리를 위한 제1 네트워크와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크로 구성된 비자동회귀 기반의 음성합성 모델에 발화 음성을 입력받는 단계; 및 상기 구성된 비자동회귀 기반의 음성합성 모델을 이용하여 발화 음성의 음성합성을 통해 최종 발화 음성을 추정하는 단계를 포함하고, 상기 비자동회귀 기반의 음성합성 모델은, 상기 제1 네트워크에서 비지도 학습을 통해 발화 음성으로부터 음성 구성요소를 분리하고, 상기 제2 네트워크에서 텍스트로부터 상기 분리된 음성 구성요소를 생성하여 비자동회귀 방식으로 음성합성을 통해 최종 발화 음성이 추정되도록 구성된 것일 수 있다. The speech synthesis method performed by the speech synthesis system inputs the spoken voice into a non-autoregressive speech synthesis model consisting of a first network for separation of speech components and a second network for estimation of speech components from text. Receiving stage; and estimating the final spoken voice through speech synthesis of the spoken voice using the configured non-autoregressive-based speech synthesis model, wherein the non-autoregressive-based speech synthesis model is unsupervised in the first network. It may be configured to separate voice components from the spoken voice through learning, generate the separated voice components from the text in the second network, and estimate the final spoken voice through voice synthesis in a non-autoregressive manner.

상기 비자동회귀 기반의 음성합성 모델은, 상기 제1 네트워크에서 비지도 학습을 통해 발화 음성으로부터 화자, 리듬(rhythm) 및 피치(pitch)을 포함하는 운율에 대한 구성요소와 콘텐츠(content)에 대한 구성요소를 분리하고, 상기 제2 네트워크에서 상기 텍스트로부터 상기 분리된 콘텐츠에 대한 구성요소를 추정하고, 디코더에서 상기 제1 네트워크 및 상기 제2 네트워크의 병렬적 구조에 기반한 복수 개의 스펙트로그램을 생성하도록 구성된 것일 수 있다. The non-autoregressive-based speech synthesis model is a prosody component and content including a speaker, rhythm, and pitch from the spoken voice through unsupervised learning in the first network. separate components, estimate components for the separated content from the text in the second network, and generate, at a decoder, a plurality of spectrograms based on the parallel structure of the first network and the second network. It may be configured.

상기 제1 네트워크는, 상기 음성 구성요소를 분리하기 위한 제1 콘텐츠(content) 인코더, 리듬(rhythm) 인코더 및 피치(pitch) 인코더를 포함할 수 있다. The first network may include a first content encoder, a rhythm encoder, and a pitch encoder for separating the voice components.

상기 제1 네트워크에서 상기 제1 네트워크에 포함된 제1 콘텐츠 인코더, 리듬 인코더 및 피치 인코더를 통해 콘텐츠 코드, 리듬 코드 및 피치 코드를 나타내는 잠재 벡터(latent vector)가 추출되고, 상기 추출된 잠재 벡터가 디코더의 입력 데이터로 사용되어 최종 발화 음성의 스펙트로그램이 추정될 수 있다. Latent vectors representing content codes, rhythm codes, and pitch codes are extracted from the first network through a first content encoder, rhythm encoder, and pitch encoder included in the first network, and the extracted latent vectors are It can be used as input data for the decoder to estimate the spectrogram of the final spoken voice.

상기 제2 네트워크는, 제2 콘텐츠 인코더 및 상기 제1 네트워크에서 학습된 제1 콘텐츠 인코더를 포함할 수 있다. The second network may include a second content encoder and a first content encoder learned from the first network.

상기 제2 네트워크에서 텍스트로부터 상기 제2 콘텐츠 인코더를 통해 콘텐츠 정보를 포함하는 언어적 특징으로 변환되고, 상기 변환된 언어적 특징에 대하여 기간 예측기(duration predictor)와 길이 조절기(length regulator)를 통해 각 텍스트 토큰(text token)의 기간(duration)에 해당하는 스칼라(scalar) 값이 추정되고, 상기 추정된 스칼라 값을 기반으로 각 텍스트 토큰에 해당하는 언어적 특징의 길이가 조절될 수 있다. In the second network, text is converted into linguistic features including content information through the second content encoder, and each of the converted linguistic features is analyzed through a duration predictor and a length regulator. A scalar value corresponding to the duration of the text token is estimated, and the length of the linguistic feature corresponding to each text token can be adjusted based on the estimated scalar value.

상기 제2 네트워크에서 상기 변환된 언어적 특징이 상기 제1 네트워크에서 학습된 제1 콘텐츠 인코더에 입력됨에 따라 무작위 리샘플링(random resampling)을 통해 콘텐츠 정보가 추출될 수 있다. As the linguistic features converted from the second network are input to the first content encoder learned from the first network, content information may be extracted through random resampling.

상기 제1 네트워크에서 학습된 콘텐츠 인코더는, 콘텐츠 코드를 추출하고, 상기 제2 네트워크를 통해 상기 추출된 콘텐츠 코드를 통해 최종 발화 음성의 스펙트로그램이 추정될 수 있다. The content encoder learned in the first network extracts a content code, and the spectrogram of the final spoken voice can be estimated through the extracted content code through the second network.

상기 디코더는, 최종 발화 스펙트로그램을 재구성하도록 학습될 수 있다. The decoder can be trained to reconstruct the final utterance spectrogram.

상기 기간 예측기는, 각 텍스트 토큰의 기간에 대한 길이-레벨(length-level)의 손실을 통해 명시적인 레이블 없이 학습될 수 있다. The duration predictor can be learned without explicit labels through a length-level loss for the duration of each text token.

상기 길이 레벨의 손실은 상기 기간 예측기에서 출력되는 스칼라 값의 합이 최종 발화 스펙트로그램의 길이가 되도록 유도된 것일 수 있다. The loss of the length level may be induced so that the sum of scalar values output from the period predictor becomes the length of the final utterance spectrogram.

음성합성 방법을 상기 음성합성 시스템에 실행시키기 위해 비-일시적인 컴퓨터 판독가능한 기록 매체에 저장되는 컴퓨터 프로그램을 포함할 수 있다. It may include a computer program stored in a non-transitory computer-readable recording medium to execute the speech synthesis method on the speech synthesis system.

음성합성 시스템은, 음성 구성요소의 분리를 위한 제1 네트워크와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크로 구성된 비자동회귀 기반의 음성합성 모델에 발화 음성을 입력받는 음성 입력부; 및 상기 구성된 비자동회귀 기반의 음성합성 모델을 이용하여 발화 음성의 음성합성을 통해 최종 발화 음성을 추정하는 음성 추정부를 포함하고, 상기 비자동회귀 기반의 음성합성 모델은, 상기 제1 네트워크에서 비지도 학습을 통해 발화 음성으로부터 음성 구성요소를 분리하고, 상기 제2 네트워크에서 텍스트로부터 상기 분리된 음성 구성요소를 생성하여 비자동회귀 방식으로 음성합성을 통해 최종 발화 음성이 추정되도록 구성된 것일 수 있다. The speech synthesis system includes a speech input unit that receives spoken speech into a non-autoregressive speech synthesis model consisting of a first network for separation of speech components and a second network for estimation of speech components from text; and a speech estimation unit that estimates the final spoken voice through speech synthesis of the spoken voice using the constructed non-autoregressive speech synthesis model, wherein the non-autoregressive based speech synthesis model is used as a busy signal in the first network. It may be configured to separate voice components from the spoken voice through learning, generate the separated voice components from the text in the second network, and estimate the final spoken voice through voice synthesis in a non-autoregressive manner.

비지도 학습을 통한 음성 구성요소 분리 기술 기반의 비자동회귀 방식의 음성합성 기술을 통해 별도의 명시적 레이블 없이 발화 음성으로부터 콘텐츠 요소와 운율 요소를 분리하고, 분리된 콘텐츠 요소와 분리된 운율 요소를 활용함으로써 훈련 파이프라인을 단순화하고 고품질의 음성합성 성능을 유지하면서 높은 표현성과 조절 가능성을 달성할 수 있다.Through non-autoregressive speech synthesis technology based on speech component separation technology through unsupervised learning, content elements and prosodic elements are separated from spoken speech without separate explicit labels, and separated content elements and separated prosodic elements are separated. By utilizing it, the training pipeline can be simplified and high expressivity and tunability can be achieved while maintaining high-quality speech synthesis performance.

도 1은 일 실시예에 있어서, 비자동회귀 기반의 음성합성 모델의 구조를 설명하기 위한 도면이다.
도 2는 일 실시예에 있어서, 비자동회귀 기반의 음성합성 모델의 세부 구조를 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 음성합성 시스템의 구성을 설명하기 위한 블록도이다.
도 4는 일 실시예에 따른 음성합성 시스템에서 음성합성 방법을 설명하기 위한 흐름도이다.
도 5는 일 실시예에 있어서, 음성 및 콘텐츠에서 각각 추출된 콘텐츠 정보를 이용하여 생성된 멜 스펙트로그램의 결과를 나타낸 도면이다. Figure 1 is a diagram for explaining the structure of a non-autoregressive-based speech synthesis model, according to one embodiment.
Figure 2 is a diagram for explaining the detailed structure of a non-autoregressive-based speech synthesis model, according to one embodiment.
Figure 3 is a block diagram for explaining the configuration of a voice synthesis system according to an embodiment.
Figure 4 is a flowchart for explaining a voice synthesis method in a voice synthesis system according to an embodiment.
Figure 5 is a diagram showing the results of a mel spectrogram generated using content information extracted from voice and content, respectively, in one embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 있어서, 비자동회귀 기반의 음성합성 모델의 구조를 설명하기 위한 도면이다. Figure 1 is a diagram for explaining the structure of a non-autoregressive-based speech synthesis model, according to one embodiment.

음성합성 모델에서 텍스트에서 병렬 방식으로 다양한 정보가 얽혀있는 음성을 직접 추정하는 것은 어려운 과제이다. 실시예에서는 발화 음성으로부터 텍스트 정보를 포함하는 컨텐츠와 억양, 화자정보, 리듬 등에 해당하는 운율 정보를 분리하는 기술을 비자동회귀 방식의 음성합성 기술에 적용한 시스템 및 방법에 대하여 설명하기로 한다. 실시예에 따른 비자동회귀 기반의 음성합성 모델(100)은 음성 구성요소를 분해하기 위한 네트워크와 분해된 음성 구성요소를 텍스트로부터 생성하기 위한 네트워크로 구성되어 음성 구성요소로부터 음성을 합성함으로써 텍스트로부터 음성을 직접 생성하는 것 보다 효과적인 학습을 가능하게 한다. In a speech synthesis model, it is a difficult task to directly estimate speech in which various information is intertwined in a parallel manner from text. In an embodiment, a system and method will be described in which a technology for separating content including text information and prosody information corresponding to intonation, speaker information, rhythm, etc. from the spoken voice is applied to a non-autoregressive voice synthesis technology. The non-autoregressive-based speech synthesis model 100 according to the embodiment is composed of a network for decomposing speech components and a network for generating decomposed speech components from text by synthesizing speech from speech components. It enables more effective learning than directly generating voices.

음성합성 시스템은 음성 구성요소의 분리를 위한 제1 네트워크(SCD; Speech Component Decomposition network)(110)와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크(SCG; Speech Component Generation network)(120)를 포함하는 비자동회귀 기반의 음성합성 모델(100)을 구성할 수 있다. The speech synthesis system includes a first network (SCD; Speech Component Decomposition network) 110 for separation of speech components and a second network (SCG; Speech Component Generation network) 120 for estimation of speech components from text. It is possible to construct a non-autoregressive-based speech synthesis model (100) including.

우선적으로 제1 네트워크(110)에 대하여 설명하기로 한다. 음성은 대표적으로 콘텐츠(content), 화자(speaker)(예를 들면, 음색), 리듬(rhythm) 및 피치(pitch)의 구성요소로 분리될 수 있으며, 이외의 정보들(예를 들면, 에너지, 배경 잡음) 등은 일반적으로 음성합성의 품질을 결정짓는 뚜렷한 요소는 아니므로 앞의 네 가지의 음성 구성요소(콘텐츠, 화자, 리듬 및 피치)를 중점적으로 활용하기로 한다. First, the first network 110 will be described. Voice can be typically separated into the components of content, speaker (e.g., tone), rhythm, and pitch, and other information (e.g., energy, Background noise) are generally not distinct factors that determine the quality of voice synthesis, so we will focus on using the four voice components (content, speaker, rhythm, and pitch).

제1 네트워크(110)는 발화 음성으로부터 음성 구성요소(들)를 분리할 수 있다. 제1 네트워크(110)는 정보 병목(information bottleneck)을 통하여 명시적인 운율 레이블(explicit prosody label) 없이 발화 음성으로부터 음성 구성요소를 분리할 수 있다. 제1 네트워크(110)는 각 음성 구성요소를 분리하기 위한 제1 콘텐츠 인코더(132), 리듬 인코더(131) 및 피치 인코더(134)로 구성될 수 있다. 이때, 화자 정보는 명시적 레이블이 존재하므로 별도의 화자 인코더를 구성하지 않는다. 각 인코더는 정밀하게 설계된 정보 병목을 통해 다른 구성요소들이 제거된 음성 구성요소에 대한 잠재 벡터(latent vector)를 추출할 수 있다. 추출된 잠재 벡터는 각각 콘텐츠 코드(content code), 리듬 코드(rhythm code), 피치 코드(pitch code), 화자 코드(speaker code)를 나타내며, 디코더(140)에서 발화 음성 추정을 위한 입력 데이터로 사용될 수 있다.First network 110 may separate speech component(s) from spoken speech. The first network 110 can separate speech components from spoken speech without an explicit prosody label through an information bottleneck. The first network 110 may be composed of a first content encoder 132, a rhythm encoder 131, and a pitch encoder 134 for separating each voice component. At this time, since the speaker information has an explicit label, a separate speaker encoder is not configured. Each encoder can extract latent vectors for speech components from which other components have been removed through a carefully designed information bottleneck. The extracted latent vectors represent a content code, a rhythm code, a pitch code, and a speaker code, respectively, and will be used as input data for speech speech estimation in the decoder 140. You can.

수학식 1:Equation 1:

수학식 1은 제1 콘텐츠 인코더(132), 리듬 인코더(131), 피치 인코더(134)를 통해 콘텐츠 코드, 리듬 코드, 피치 코드를 추출하는 수학식을 나타낸 것이다. 여기서, , 및 는 콘텐츠 코드, 리듬 코드, 피치 코드를 나타내고, , 및 는 제1 콘텐츠 인코더(132), 리듬 인코더(131), 피치 인코더(134)를 의미한다. A는 무작위 샘플링 연산, S는 발화 스펙트로그램, P는 정규화된 피치 컨투어(contour)를 의미한다. 디코더(140)는 인코더에서 추출된 분해 코드와 화자(speaker) ID를 입력으로 받아 발화 스펙트로그램을 추정할 수 있다. 다시 말해서, 리듬 인코더(131)의 경우, 발화 스펙트로그램이 입력되고, 제1 콘텐츠 인코더(132) 및 피치 인코더(134)의 경우, 무작위로 샘플링된 발화 스펙트로그램 및 피치가 입력될 수 있다.Equation 1 represents an equation for extracting the content code, rhythm code, and pitch code through the first content encoder 132, the rhythm encoder 131, and the pitch encoder 134. here, , and represents the content code, rhythm code, and pitch code, , and means the first content encoder 132, rhythm encoder 131, and pitch encoder 134. A stands for random sampling operation, S stands for speech spectrogram, and P stands for normalized pitch contour. The decoder 140 can estimate the speech spectrogram by receiving the decomposition code and speaker ID extracted from the encoder as input. In other words, for the rhythm encoder 131, a speech spectrogram may be input, and for the first content encoder 132 and the pitch encoder 134, randomly sampled speech spectrograms and pitch may be input.

콘텐츠 인코더(132)와 리듬 인코더(131)에서는 발화 음성 S가, 피치 인코더(134)에는 발화 음성으로부터 추출된 정규화된 피치 컨투어가 입력될 수 있다. 콘텐츠 인코더(132)와 피치 인코더(134)의 입력은 의도적으로 리듬 정보를 훼손하기 위하여 무작위 샘플링 과정 A가 수행될 수 있다.The spoken voice S may be input to the content encoder 132 and the rhythm encoder 131, and the normalized pitch contour extracted from the spoken voice may be input to the pitch encoder 134. A random sampling process A may be performed on the inputs of the content encoder 132 and the pitch encoder 134 to intentionally damage rhythm information.

음성 분할은 기준 음성에 따라 다양한 음성 변환을 수행할 수 있다. 음성 분할의 디코더(140)는 다음과 같이 표현될 수 있다. Voice segmentation can perform various voice conversions depending on the reference voice. The decoder 140 for voice segmentation can be expressed as follows.

수학식 2:Equation 2:

여기서, 는 예측된 발화 스펙트로그램, D는 디코더, U는 화자 ID이다. 그러나, 이러한 모델은 음성 변환 작업을 위해 설계될 수 있다. 수학식 2는 음성 구성요소들을 입력으로 디코더(D)(140)에서 최종 발화 음성의 스펙트로그램을 추정하는 과정을 나타낸다. here, is the predicted speech spectrogram, D is the decoder, and U is the speaker ID. However, these models can be designed for speech conversion tasks. Equation 2 represents the process of estimating the spectrogram of the final spoken voice in the decoder (D) 140 by inputting voice components.

제2 네트워크(120)에 대하여 설명하기로 한다. 비자동회귀 기반의 음성합성 모델의 훈련을 위한 입력 텍스트-음성 쌍에서 텍스트 입력은 콘텐츠 정보만 포함되어 있기 때문에 다양한 운율 정보가 얽혀있는 음성을 직접적으로 추정하는 것이 제한적이다. 또한, 대부분의 운율은 명시적 레이블이 없어 더욱 어렵다. 실시예에 따른 음성합성 시스템은 제2 네트워크(120)에서 비지도 학습을 통해 분해된 음성 구성요소를 텍스트로부터 추정하는 과정을 거쳐 최종 발화 음성을 추정할 수 있다. 제2 네트워크(120)는 텍스트로부터 음성 구성요소를 추정할 수 있다. 제2 네트워크(120)는 제2 콘텐츠 인코더(135) 및 제1 네트워크(110)에서 학습된 콘텐츠 인코더를 포함할 수 있다. 제2 콘텐츠 인코더(135)는 콘텐츠 정보를 포함하는 언어적 특징(linguistic feature)으로 텍스트를 변환하고, 변환된 언어적 특징에 대하여 기간 예측기(duration predictor)와 길이 조절기(length regulator)를 통해 각 텍스트 토큰(text token)의 기간(duration)에 해당하는 스칼라(scalar) 값을 추정하고, 추정되 스칼라 값을 각 텍스트 토큰에 해당하는 언어적 특징의 길이를 조절할 수 있다. 이러한 과정을 통해 변환된 언어적 특징의 길이는 추정하고자 하는 발화 음성의 스펙트로그램의 길이와 동일해지며, 콘텐츠 정보에 리듬 정보가 추가된다. 변환된 언어적 특징은 1 네트워크(110)에서 학습된 제1 콘텐츠 인코더(136)에 입력되어 무직위 리샘플링을 수행하여 콘텐츠 정보를 추출할 수 있다. 이때, 제1 네트워크(110)에서 학습된 제1 콘텐츠 인코더(136)는 제2 네트워크(120)로 복사되고 파라미터는 고정된다. 제1 네트워크(110)와 제2 네트워크(120)에서 콘텐츠 인코더(135, 136)는 입력에 포함된 음성 구성요소가 다르지만 해당 인코더의 목적은 콘텐츠 정보와 잔여 정보(residual information)를 분리하는 것이기 때문에 동일한 결과인 콘텐츠 코드(content code)를 추출하여 최종 발화 음성을 추정하는데 활용될 수 있다.The second network 120 will now be described. In input text-speech pairs for training a non-autoregressive-based speech synthesis model, the text input contains only content information, so directly estimating speech in which various prosodic information is intertwined is limited. Additionally, most rhymes do not have explicit labels, making them more difficult. The speech synthesis system according to the embodiment can estimate the final spoken voice through a process of estimating decomposed speech components from text through unsupervised learning in the second network 120. The second network 120 may estimate speech components from text. The second network 120 may include a second content encoder 135 and a content encoder learned in the first network 110. The second content encoder 135 converts the text into linguistic features including content information, and encodes each text through a duration predictor and a length regulator for the converted linguistic features. A scalar value corresponding to the duration of a text token can be estimated, and the length of the linguistic feature corresponding to each text token can be adjusted using the estimated scalar value. The length of the linguistic feature converted through this process becomes the same as the length of the spectrogram of the speech voice to be estimated, and rhythm information is added to the content information. The converted linguistic features are input to the first content encoder 136 learned in the first network 110, and content information can be extracted by performing non-positional resampling. At this time, the first content encoder 136 learned in the first network 110 is copied to the second network 120 and the parameters are fixed. In the first network 110 and the second network 120, the content encoders 135 and 136 have different voice components included in the input, but the purpose of the encoder is to separate content information and residual information. The same result, the content code, can be extracted and used to estimate the final speech voice.

제1 네트워크 (110)와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크(120)를 포함하는 비자동회귀 기반의 음성합성 모델(100)의 목적 함수에 대하여 설명하기로 한다. 비자동회귀 기반의 음성합성 모델(100)은 제2네트워크(120)를 통해 추출된 콘텐츠 코드를 통해 발화 음성의 스펙트로그램을 추정하며, 아래의 수학식 3과 같이 나타낼 수 있다.The objective function of the non-autoregressive-based speech synthesis model 100, which includes a first network 110 and a second network 120 for estimating speech components from text, will be described. The non-autoregressive-based speech synthesis model 100 estimates the spectrogram of the spoken voice through the content code extracted through the second network 120, and can be expressed as Equation 3 below.

수학식 3: Equation 3:

훈련하는 동안 디코더는 최종 발화(목표) 스펙트로그램을 재구성(reconstruct)하도록 훈련되며 재구성 손실(reconstruction loss)은 수학식 4와 같이 나타낼 수 있다.During training, the decoder is trained to reconstruct the final utterance (target) spectrogram, and the reconstruction loss can be expressed as Equation 4.

수학식 4:Equation 4:

비자동회귀 기반의 음성합성 모델(100)에서 기간 예측기는 각 텍스트 토큰의 기간에 대한 길이-레벨(length-level)의 손실을 통해 명시적인 레이블이 없이 학습될 수 있다. 길이-레벨의 손실은 수학식 5와 같이 나타낼 수 있으며, 기간 예측기에서 출력되는 스칼라 값의 합이 최종 발화 스펙트로그램의 길이가 되게 하여 각각의 스칼라 값이 기간이 되도록 유도한다.In the non-autoregressive-based speech synthesis model 100, the duration predictor can be learned without an explicit label through length-level loss for the duration of each text token. The length-level loss can be expressed as Equation 5, and the sum of the scalar values output from the period predictor becomes the length of the final utterance spectrogram, leading to each scalar value becoming a period.

수학식 5: Equation 5:

여기서, L은 최종 발화(타겟) 스펙트로그램의 길이, N은 텍스트 토큰의 수, d_n은 n번째 토큰의 길이를 나타낸다. Here, L represents the length of the final utterance (target) spectrogram, N represents the number of text tokens, and d _n represents the length of the nth token.

따라서, 최종 손실은 수학식 6과 같이 나타낼 수 있다.Therefore, the final loss can be expressed as Equation 6.

수학식 6:Equation 6:

도 2는 일 실시예에 있어서, 비자동회귀 기반의 음성합성 모델의 세부 구조를 설명하기 위한 도면이다. Figure 2 is a diagram for explaining the detailed structure of a non-autoregressive-based speech synthesis model, according to one embodiment.

비자동회귀 기반의 합성모델(100)에서 그림자 블록은 파라미터 고정 블록(parameter freeze block)을 나타낸다. 피치 인코더, 리듬 인코더 및 콘텐츠 인코더의 경우, 컨볼루션 레이어의 수 n₁은 256, 128, 512 차원의 3, 1 및 1이다. 그룹 정규화의 차원은 16, 8및 32이다. BLSTM(양방향 LSTM) 레이어의 수 n₂는 32, 1, 8 차원의 1, 1, 2이다. 모든 인코더의 다운샘플링 계수는 8로 동일하다.In the non-autoregressive synthetic model 100, the shadow block represents a parameter freeze block. For the pitch encoder, rhythm encoder and content encoder, the number of convolutional layers n ₁ is 3, 1 and 1 with dimensions 256, 128 and 512. The dimensions of group normalization are 16, 8, and 32. The number n ₂ of BLSTM (bidirectional LSTM) layers is 1, 1, 2 with dimensions 32, 1, and 8. The downsampling coefficient of all encoders is the same at 8.

비자동회귀 기반의 음성합성 모델(100)은 4개의 학습 가능한 인코더와 1개의 디코더로 구성될 수 있다. 제1 네트워크(110) 및 제2 네트워크(120)에서 사용되는 모든 학습 가능한 인코더는 그룹 정규화 레이어 및 BLSTM 레이어를 뒤따르는 컨볼루션 레이어로 구성된 유사한 구조를 공유한다. 리듬 인코더를 제외하고, 무작위 리샘플링 연산이 추가되어 입력 단계에서 의도적으로 리듬 정보를 손상시킨다. 컨볼루션 레이어를 통과하는 모든 출력은 BLSTM 레이어 스택에 공급되고 특징 차원이 축소된다. 그런 다음, 를 제외하고, 3개의 인코더는 추가적인 다운 샘플링 연산을 사용하여 시간 차원이 감소된 최종 음성 구성요소의 코드를 생성할 수 있다. 의 경우, 각 음소 토큰에 해당하는 기간을 추정하기 위해 다운 샘플링 연산을 거치지 않고 기간 예측기에 직접 공급될 수 있다. 두 개의 1차원 컨볼루션 레이어로 구성된 기간 예측기는 레이어 정규와 및 하나의 선형 레이어라 따른다. 의 출력 표현은 예측된 기간에 기초하여 길이 조정기에 의해 업샘플링되고, 제1 네트워크(110)에서 복사된 는 를 생성한다. 디코더(140)는 먼저 화자 식별 레이블과 음성 구성요소 코드를 수신하고 원래 샘플링 속도를 복원할 수 있다. 그런 다음 모든 입력이 연결되어 BLSTM 레이어 스택을 통해 최종 출력인 최종 발화 스펙트로그램을 생성할 수 있다. The non-autoregressive-based speech synthesis model 100 may be composed of four learnable encoders and one decoder. All learnable encoders used in first network 110 and second network 120 share a similar structure consisting of a convolutional layer followed by a group normalization layer and a BLSTM layer. Except for the rhythm encoder, a random resampling operation is added to intentionally corrupt the rhythm information at the input stage. All outputs that pass through the convolutional layers are fed into the BLSTM layer stack and feature dimensionality is reduced. after that, Except, the three encoders can use an additional down-sampling operation to generate codes for the final speech components with reduced time dimension. In the case of , it can be directly fed to the period predictor without going through a down-sampling operation to estimate the period corresponding to each phoneme token. The period predictor consists of two one-dimensional convolutional layers, a regular layer and a linear layer. The output representation of is upsampled by the length adjuster based on the predicted period and copied from the first network 110. Is creates . The decoder 140 may first receive the speaker identification label and speech component code and restore the original sampling rate. All inputs can then be concatenated to produce the final output, the final utterance spectrogram, through a stack of BLSTM layers.

도 3은 일 실시예에 따른 음성합성 시스템의 구성을 설명하기 위한 블록도이고, 도 4는 일 실시예에 따른 음성합성 시스템에서 음성합성 방법을 설명하기 위한 흐름도이다.FIG. 3 is a block diagram for explaining the configuration of a voice synthesis system according to an embodiment, and FIG. 4 is a flowchart for explaining a voice synthesis method in the voice synthesis system according to an embodiment.

음성합성 시스템(300)의 프로세서는 음성 입력부(310) 및 음성 추정부(320)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 음성합성 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 4의 음성합성 방법이 포함하는 단계들(410 내지 420)을 수행하도록 음성합성 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The processor of the voice synthesis system 300 may include a voice input unit 310 and a voice estimation unit 320. These processor components may be expressions of different functions performed by the processor according to control instructions provided by program codes stored in the speech synthesis system. The processor and its components can control the voice synthesis system to perform steps 410 to 420 included in the voice synthesis method of FIG. 4. At this time, the processor and its components may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 음성합성 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 음성합성 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 음성합성 시스템을 제어할 수 있다. 이때, 음성 입력부(310) 및 음성 추정부(320) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(410 내지 420)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다.The processor may load the program code stored in the file of the program for the voice synthesis method into memory. For example, when a program is executed in a voice synthesis system, the processor can control the voice synthesis system to load program code from the program file into memory under the control of the operating system. At this time, each of the voice input unit 310 and the voice estimation unit 320 is a different functional expression of the processor for executing the subsequent steps 410 to 420 by executing the command of the corresponding portion of the program code loaded in the memory. You can.

단계(410)에서 음성 입력부(310)는 음성 구성요소의 분리를 위한 제1 네트워크와 텍스트로부터 음성 구성요소의 추정을 위한 제2 네트워크로 구성된 비자동회귀 기반의 음성합성 모델에 발화 음성을 입력받을 수 있다. 예를 들면, 음성 입력부(310)는 전체 발화 음성을 입력받을 수 있고, 전체 발화 음성 중 기 설정된 구간의 발화 음성을 입력받을 수 있다. 또한, 음성 입력부(310)에 입력된 발화 음성은 한국어 또는 한국어 이외의 다양한 외국어로 구성된 것일 수 있다. In step 410, the voice input unit 310 receives the spoken voice as input to a non-autoregressive-based voice synthesis model consisting of a first network for separation of voice components and a second network for estimation of voice components from text. You can. For example, the voice input unit 310 can receive the entire spoken voice, and can receive the spoken voice of a preset section among all the spoken voices. Additionally, the spoken voice input to the voice input unit 310 may be composed of Korean or various foreign languages other than Korean.

단계(420)에서 음성 추정부(320)는 구성된 비자동회귀 기반의 음성합성 모델을 이용하여 발화 음성의 음성합성을 통해 최종 발화 음성을 추정할 수 있다. 음성 추정부(320)는 최종 발화 음성에 대한 최종 발화 스펙트로그램을 획득할 수 있다. In step 420, the voice estimation unit 320 may estimate the final spoken voice through voice synthesis of the spoken voice using the constructed non-autoregressive-based voice synthesis model. The voice estimation unit 320 may obtain a final speech spectrogram for the final speech voice.

도 5는 일 실시예에 있어서, 음성 및 콘텐츠에서 각각 추출된 콘텐츠 정보를 이용하여 생성된 멜 스펙트로그램의 결과를 나타낸 도면이다. Figure 5 is a diagram showing the results of a mel spectrogram generated using content information extracted from voice and content, respectively, in one embodiment.

실시예에서는 109명의 화자로부터 수집된 약 44시간의 연설로 구성된 VCTK 데이터 셋을 이용하여 실험이 수행될 수 있다. 실험을 위해 데이터 셋을 각각 40시간의 훈련 셋과 4시간의 평가 셋으로 구분하기로 한다. 훈련 셋 및 평가 셋은 보이는 셋과 보이지 않는 셋의 두 가지 유형으로 구분될 수 있다. 보이는 데이터는 동일한 화자이지만 다른 발화를 포함하고, 보이지 않는 데이터는 평가 셋에만 포함될 수 있다. In an embodiment, an experiment may be performed using the VCTK data set consisting of approximately 44 hours of speech collected from 109 speakers. For the experiment, the data set is divided into a training set of 40 hours and an evaluation set of 4 hours each. Training sets and evaluation sets can be divided into two types: visible sets and invisible sets. Visible data includes different utterances from the same speaker, and unseen data can only be included in the evaluation set.

모든 실험에서 생성된 발화 스펙트로그램은 보코더를 사용하여 파형으로 변환될 수 있다. 실험은 합성 품질과 속도라는 두 가지 측면에서 모델이 평가될 수 있다. 음성합성 품질은 12명의 참가자가 참여하는 주관적인 듣기 테스트를 사용하여 평가될 수 있으며, 95% 신뢰 구간에서 1 에서 5까지의 평균 의견 점수(MOS)로 표시될 수 있다. 훈련 중 평가를 위해 보이는 화자와 보이지 않는 화자를 포함하는 두 가지 다른 화자 유형이 설정될 수 있다.The utterance spectrogram generated in all experiments can be converted to a waveform using a vocoder. Experiments can evaluate models in two aspects: synthesis quality and speed. Speech synthesis quality can be assessed using a subjective listening test involving 12 participants and expressed as a mean opinion score (MOS) from 1 to 5 with a 95% confidence interval. Two different speaker types, including visible and invisible speakers, can be set up for evaluation during training.

표 1:Table 1:

표 1은 합성 음성의 품질을 나타낸 것이다. 표 1은 실시예에서 제안된 비자동회귀 기반의 음성합성 모델과 baseline 모델의 음질을 평가하기 위하여 MOS score를 측정한 결과이다. 실시예에서 제안된 비자동회귀 기반의 음성합성 모델의 자연스러움을 비교하기 위해 최단 병렬 TTS 모델인 FastSpeech2의 ground truth mel-spectrogram(GT mel) 및 합성 음성과 비교될 수 있다. 또한, 제1 네트워크에서 생성된 콘텐츠 구성요소가 제2 네트워크에서 분리된 콘텐츠 구성요소를 잘 추정하였는지 확인하기 위해 SpeechSplit 모델의 합성된 음성과 비교될 수 있다. MOS 점수는 실시예에서 제안된 비자동회귀 기반의 음성합성 모델에서 FastSpeech2보다 약간 더 합성 품질이 우수함을 나타낸다. 또한 합성 품질은 종래의 비자동회귀 방식의 음성합성 모델인 SpeechSplit과 유사하여 음성으로 분해된 콘텐츠 정보가 제1 네트워크를 통해 텍스트로부터 성공적으로 추정되었음을 알 수 있다. 다시 말해서, 비자동회귀 방식의 음성합성 모델인 FastSpeech2 보다 실시예에서 제안된 비자동회귀 기반의 음성합성 모델이 우수함을 확인할 수 있다. Table 1 shows the quality of synthesized speech. Table 1 shows the results of measuring the MOS score to evaluate the sound quality of the non-autoregressive speech synthesis model and baseline model proposed in the example. To compare the naturalness of the non-autoregressive-based speech synthesis model proposed in the embodiment, it can be compared with the ground truth mel-spectrogram (GT mel) and synthesized speech of FastSpeech2, the shortest parallel TTS model. Additionally, the content components generated in the first network may be compared with the synthesized speech of the SpeechSplit model to determine whether the content components separated in the second network are well estimated. The MOS score indicates that the non-autoregressive speech synthesis model proposed in the example has slightly better synthesis quality than FastSpeech2. In addition, the synthesis quality is similar to SpeechSplit, a conventional non-autoregressive speech synthesis model, indicating that content information decomposed into speech was successfully estimated from text through the first network. In other words, it can be confirmed that the non-autoregressive speech synthesis model proposed in the example is superior to FastSpeech2, which is a non-autoregressive speech synthesis model.

표 2는 종래의 자동회귀 방식인 Tacotron2 대비 합성 속도의 개선 정도를 보여준 결과이다. Table 2 shows the results showing the degree of improvement in synthesis speed compared to Tacotron2, a conventional autoregressive method.

표 2: Table 2:

추론 효율성의 향상을 측정하기 위해 제안된 비자동회귀 기반의 음성합성 모델과 병렬 TTS 모델인 Tacotron2에 대하여 음성의 멜 스펙트로그램을 생성하는 데 필요한 시간이 비교될 수 있다. 표 2에서 보는 바와 같이, 실시예에서 제안된 비자동회귀 기반의 음성합성 모델이 추론 속도 면에서 tacotron2보다 약간 나은 성능을 보이는 것을 확인할 수 있다. To measure the improvement in inference efficiency, the time required to generate a mel spectrogram of speech can be compared for the proposed non-autoregressive speech synthesis model and Tacotron2, a parallel TTS model. As shown in Table 2, it can be seen that the non-autoregressive-based speech synthesis model proposed in the example shows slightly better performance than tacotron2 in terms of inference speed.

도 5는 제1 네트워크와 제2 네트워크에서 추출한 발화 구성요소 코드를 통해 추정된 최종 발화 음성의 스펙트로그램으로, 두 네트워크에서 생성된 스펙트로그램이 유사한 것을 확인할 수 있다. 두 개의 멜 스펙트로그램은 유사하여 제1 네트워크와 제2 네트워크의 발화 구성요소가 서로 대체할 수 있음을 나타낸다.Figure 5 is a spectrogram of the final spoken voice estimated through speech component codes extracted from the first and second networks, and it can be seen that the spectrograms generated by the two networks are similar. The two Mel spectrograms are similar, indicating that the speech components of the first and second networks can replace each other.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), etc. , may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be embodied in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In a speech synthesis method performed by a speech synthesis system,
Receiving the spoken voice as input to a non-autoregressive-based speech synthesis model consisting of a first network for separating speech components and a second network for estimating speech components from text; and
Step of estimating the final spoken voice through speech synthesis of the spoken voice using the configured non-autoregressive speech synthesis model.
Including,
The non-autoregressive based speech synthesis model is,
The first network separates speech components from the spoken voice through unsupervised learning, the second network generates the separated speech components from text, and the final spoken voice is estimated through speech synthesis in a non-autoregressive manner. Configured to do so, the first network separates components for prosody and content, including the speaker, rhythm, and pitch, from the spoken voice through unsupervised learning, and Speech synthesis, characterized in that it is configured to estimate components for the separated content from the text in a second network and generate a plurality of spectrograms based on the parallel structure of the first network and the second network in a decoder. method.

delete

According to paragraph 1,
The first network is,
A voice synthesis method comprising a first content encoder, a rhythm encoder, and a pitch encoder for separating the voice components.

According to paragraph 3,
In the first network, a content code, a speaker code, a rhythm code, and a pitch code are generated through a first content encoder, a rhythm encoder, and a pitch encoder included in the first network. A latent vector representing is extracted, and the extracted latent vector is used as input data of a decoder to estimate the spectrogram of the final spoken voice.

According to paragraph 1,
The second network is,
A speech synthesis method comprising a second content encoder and a first content encoder learned from the first network.

According to clause 5,
In the second network, text is converted into linguistic features including content information through the second content encoder, and each of the converted linguistic features is analyzed through a duration predictor and a length regulator. A voice characterized in that a scalar value corresponding to the duration of a text token is estimated, and the length of the linguistic feature corresponding to each text token is adjusted based on the estimated scalar value. Synthesis method.

In a speech synthesis method performed by a speech synthesis system,
Receiving the spoken voice as input to a non-autoregressive-based speech synthesis model consisting of a first network for separating speech components and a second network for estimating speech components from text; and
Step of estimating the final spoken voice through speech synthesis of the spoken voice using the configured non-autoregressive speech synthesis model.
Including,
The non-autoregressive based speech synthesis model is,
The first network separates speech components from the spoken voice through unsupervised learning, the second network generates the separated speech components from text, and the final spoken voice is estimated through speech synthesis in a non-autoregressive manner. It is structured so that
The second network is,
Comprising a second content encoder and a first content encoder learned in the first network, converted from text in the second network into linguistic features including content information through the second content encoder, and the converted language For enemy features, a scalar value corresponding to the duration of each text token is estimated through a duration predictor and a length regulator, and based on the estimated scalar value. The length of the linguistic feature corresponding to each text token is adjusted, and random resampling is performed as the converted linguistic feature in the second network is input to the first content encoder learned in the first network. A voice synthesis method characterized in that content information is extracted through.

In clause 7,
The content encoder learned in the first network extracts a content code,
A voice synthesis method characterized in that the spectrogram of the spoken voice is estimated through the extracted content code through the second network.

According to paragraph 1,
The decoder is,
A voice synthesis method characterized by learning to reconstruct the spectrogram of the final spoken voice.

According to clause 6 or 7,
The period predictor is,
A speech synthesis method characterized by learning without explicit labels through length-level loss for the period of each text token.

According to clause 10,
The loss of the length level is,
A speech synthesis method, characterized in that the sum of the scalar values output from the period predictor is derived to be the length of the final speech spectrogram.

A computer program stored in a non-transitory computer-readable recording medium for executing the speech synthesis method of any one of claims 1, 3 to 9 on the speech synthesis system.

In the voice synthesis system,
A voice input unit that receives the spoken voice as input to a non-autoregressive-based voice synthesis model consisting of a first network for separating voice components from voice and a second network for estimation of voice components from text; and
A voice estimation unit that estimates the final spoken voice through voice synthesis of the spoken voice using the non-autoregressive voice synthesis model constructed above.
Including,
The non-autoregressive based speech synthesis model is,
The first network separates speech components from the spoken voice through unsupervised learning, the second network generates the separated speech components from text, and the final spoken voice is estimated through speech synthesis in a non-autoregressive manner. Configured to do so, the first network separates components for prosody and content, including the speaker, rhythm, and pitch, from the spoken voice through unsupervised learning, and Speech synthesis, characterized in that it is configured to estimate components for the separated content from the text in a second network and generate a plurality of spectrograms based on the parallel structure of the first network and the second network in a decoder. system.