KR102272554B1

KR102272554B1 - Method and system of text to multiple speech

Info

Publication number: KR102272554B1
Application number: KR1020200087643A
Authority: KR
Inventors: 이수영; 이영근
Original assignee: 한국과학기술원
Priority date: 2018-05-29
Filing date: 2020-07-15
Publication date: 2021-07-05
Also published as: KR20200088263A

Abstract

본 발명은 텍스트-다중음성 변환 방법 및 시스템에 관한 것으로, 상세하게는, 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성 하는 단계, 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 단계 및 상기 음성 스펙트럼을 음성으로 출력하는 단계를 포함하고, 상기 음성 스펙트럼 생성 단계는, 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용하는 텍스트-음성 변환 방법 및 시스템에 관한 것이다.The present invention relates to a text-multi-speech conversion method and system, and more particularly, generating a character embedding vector for a sentence to be voiced on a character-by-character basis, generating a voice spectrum from the embedding vector, and the voice spectrum to a text-to-speech conversion method and system, comprising the step of outputting a voice spectrum, wherein the voice spectrum generating step receives multiple voice data, converts it into a voice embedding vector, and uses it for voice spectrum merging.

Description

Text-to-speech conversion method and system {Method and system of text to multiple speech}

본 발명은 텍스트- 다중 음성 변환 방법 및 시스템에 관한 것이다.The present invention relates to a text-to-multiple speech conversion method and system.

음성은 인간의 가장 자연스러운 의사 소통 수단이면서 정보 전달 수단이자, 언어를 구현하는 수단으로서 인간이 내는 의미 있는 소리이다.Speech is the most natural means of communication for humans, a means of information transfer, and a meaningful sound made by humans as a means of realizing language.

인간과 기계 사이의 음성을 통한 통신 구현에 대한 시도는 과거부터 꾸준히 발전되어 왔는 바, 더욱이 최근 음성 정보를 효과적으로 처리하기 위한 음성 정보 처리 기술(speech information technology;SIT) 분야가 괄목할 만한 발전을 이룩함에 따라 실생활에도 속속 적용이 되고 있다.Attempts to realize communication between humans and machines through voice have been steadily developed since the past. Moreover, recently, speech information technology (SIT) for effectively processing voice information has made remarkable progress. As such, it is being applied one after another in real life.

이러한 음성 정보 처리 기술을 크게 분류하면, 음성 인식(speech recognition), 음성 합성(speech synthesis), 화자 인증(speaker identification and verification), 음성 코딩(speech coding) 등의 카테고리로 분류될 수 있다.When such speech information processing technologies are broadly classified, they may be classified into categories such as speech recognition, speech synthesis, speaker identification and verification, and speech coding.

음성 인식은 발화된 음성을 인식하여 문자열로 변환하는 기술이고, 음성 합성은 문자열을 음성 분석에서 얻어진 데이 터나 파라미터를 이용하여 원래의 음성으로 변환하는 기술이며, 화자 인증은 발화된 음성을 통하여 발화자를 추정하 거나 인증하는 기술이며 음성 코딩은 음성 신호를 효과적으로 압축하여 부호화하는 기술이다.Speech recognition is a technology that recognizes the spoken voice and converts it into a character string. Speech synthesis is a technology that converts a character string into an original voice using data or parameters obtained from voice analysis. It is a technique for estimating or authenticating, and voice coding is a technique for effectively compressing and encoding a voice signal.

이 중에서, 음성합성기술의 발전 과정을 살펴보면, 초기의 음성 합성은 대부분 기계 장치 또는 전자회로를 이용하여 인간의 발성기관을 흉내내는 구조를 채택하다. 예를 들어, 18세기 볼프강 폰 켐펠렌(Wolfgang von Kem pelen)은 고무로 만들어진 입과 콧구멍을 가지며 성도의 변화를 흉내낼 수 있도록 한, 풀무로 만든 음성 합성 기계를 고안한 바 있다. 이후, 전기적 분석 방법을 이용한 음성 합성 기술로 발전하여, 1930년대에는 더들리(Dudl ey)가 초기 형태의 보코더(vocoder)를 선보이기도 하다. Among them, looking at the development process of speech synthesis technology, most of the early speech synthesis adopts a structure that mimics a human vocal organ using a mechanical device or an electronic circuit. For example, in the 18th century, Wolfgang von Kem pelen devised a bellows speech synthesis machine that had a rubber mouth and nostrils and could mimic changes in the vocal tract. After that, it developed into a speech synthesis technology using an electrical analysis method, and in the 1930s, Dudley introduced an early form of a vocoder.

오늘날에는 컴퓨터의 급속한 발달에 힘입어, 컴퓨터 기반 음성 합성 방식이 음성 합성 방식의 주류를 이루게 되었으 며, 시스템 모델 방식(조음 합성 (articulary synthesis) 등)이나 신호 모델 방식(규칙기반 포만트 합성 또는 단위음 결합 합성) 등의 다양한 방식이 개발되고 있다.Today, thanks to the rapid development of computers, computer-based speech synthesis has become the mainstream of speech synthesis, and system model methods (articulary synthesis, etc.) or signal model methods (rule-based formant synthesis or unit A variety of methods such as negative bond synthesis) are being developed.

음성합성 기술은 실제 응용방식에 따라 크게 두 가지로 구분될 수 있다. 제한된 어휘 개수와 구문구조의 문장만을 합성하는 제한 어휘합성 또는 자동음성응답시스템(ARS; Automatic Response System)과 임의의 문장을 입력받아 음성 합성하는 무제한 어휘합성 또는 텍스트-음성 변환(TTS; Text-to-Speech) 시스템이 있다. Speech synthesis technology can be roughly divided into two types according to the actual application method. A limited lexical synthesis or Automatic Response System (ARS) that synthesizes only sentences with a limited number of words and a syntactic structure, and an unlimited lexical synthesis or text-to-speech (TTS; Text-to) system that receives arbitrary sentences and synthesizes speech -Speech) system.

그 중, 텍스트-음성 변환(TTS) 시스템은 작은 합성 단위음성과 언어 처리를 이용하여 임의의 문장에 대한 음성을 생성한다. 언어 처리를 이용하여 입력된 문장을 적당한 합성 단위의 조합으로 대응시키고, 문장으로부터 적당한 억양과 지속시간을 추출하여 합성음의 운율을 결정한다. 언어의 기본 단위인 음소, 음절 등의 조합에 의해 음성을 합성해 내므로 합성 대상어휘에 제한이 없으며 주로 TTS(Text-to-Speech) 장치 및 CTS(Context-to-Speech) 장치 등에 적용된다. Among them, a text-to-speech (TTS) system generates a voice for an arbitrary sentence using a small synthesized unit voice and language processing. By using language processing, the input sentence is matched with a combination of appropriate compound units, and the appropriate intonation and duration are extracted from the sentence to determine the prosody of the synthesized sound. Since speech is synthesized by the combination of phonemes, syllables, etc., which are the basic units of language, there is no limitation on the target vocabulary, and it is mainly applied to TTS (Text-to-Speech) devices and CTS (Context-to-Speech) devices.

종래의 음성 합성 기술은 새로운 사람의 목소리를 생성하기 위해서 많은 양의 데이터를 수집해야하고, 또 그 데이터에 대해 모델을 학습하는 시간이 필요하기 때문에 효율성이 떨어지는 문제를 가지고 있다. Conventional speech synthesis technology has a problem in that it is inefficient because it needs to collect a large amount of data to generate a new human voice, and it takes time to learn a model on the data.

본 발명의 목적은The object of the present invention is

기존의 텍스트-음성 변환 방법 및 시스템에서 복수의 음성을 구현하는데 있어서, 음성 구현에 소모되는 시간을 단축시키고 복수의 음성을 합성하기 위한 사전 데이터 수집 단계를 단축하는 텍스트-다중 음성 변환 방법 및 시스템을 제공하는 데 있다.In implementing a plurality of voices in the existing text-to-speech conversion method and system, a text-to-multi-speech conversion method and system for reducing the time consumed for voice implementation and shortening the pre-data collection step for synthesizing a plurality of voices is to provide

상기 목적을 달성하기 위해, 본 발명의 일 실시 예에 따른 텍스트-다중음성 변환 방법은 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성 하는 단계, 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 단계 및 상기 음성 스펙트럼을 음성으로 출력하는 단계를 포함하고, 상기 음성 스펙트럼 생성 단계는, 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용하할 수 있다. In order to achieve the above object, a text-multi-speech conversion method according to an embodiment of the present invention comprises the steps of: generating a character embedding vector for a sentence to be voiced in character units; generating a voice spectrum from the embedding vector; and outputting the voice spectrum as voice, wherein the generating of the voice spectrum may receive multiple voice data and convert it into a voice embedding vector to be used for voice spectrum merging.

본 발명의 일 실시 예에 따르면, 상기 다중 음성 데이터는 One-hot vector형태의 speaker-id일 수 있다.According to an embodiment of the present invention, the multi-voice data may be a speaker-id in the form of a one-hot vector.

본 발명의 일 실시 예에 따르면, 상기 다중 음성 데이터는 log-mel-spectrogram형태의 데이터일 수 있다.According to an embodiment of the present invention, the multi-voice data may be log-mel-spectrogram data.

본 발명의 일 실시 예에 따르면, 상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환될 수 있다.According to an embodiment of the present invention, the log-mel-spectrogram may be converted into a speech embedding vector through an embedder network.

본 발명의 일 실시 예에 따르면, 상기 음성 임베딩 벡터는 상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.According to an embodiment of the present invention, the speech embedding vector may be input to the decoder RNN and the lumped RNN of the spectrum merging unit.

상기 목적을 달성하기 위해, 본 발명의 일 실시 예에 따른 텍스트-다중음성 변환 시스템은 문자로부터 상기 문자의 음성을 출력하는 텍스트-다중음성 변환모듈을 포함하고, 상기 텍스트-음성 변환모듈은 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성하는 인코더, 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 디코더 및 상기 음성 스펙트럼을 음성으로 출력하는 음성 출력부를 포함하고, 상기 디코더는 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용할 수 있다.In order to achieve the above object, a text-to-multi-speech conversion system according to an embodiment of the present invention includes a text-to-multi-speech conversion module for outputting the voice of the text from the text, and the text-to-speech conversion module converts the voice An encoder for generating a text embedding vector in a character-by-character unit of a sentence to be generated, a decoder for generating a speech spectrum from the embedding vector, and a speech output unit for outputting the speech spectrum as speech, wherein the decoder receives multiple speech data and provides speech It can be converted into an embedding vector and used for merging speech spectrum.

본 발명의 일 실시 예에 따르면, 상기 음성 임베딩 벡터는,According to an embodiment of the present invention, the speech embedding vector is

상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.It may be input to the decoder RNN and the lumped RNN of the spectrum merging unit.

본 발명의 텍스트-음성 변환 방법은, 적은 데이터만으로도 새로운 목소리 구현이 가능하고, 모델을 따로 학습하지 않아도 텍스트-음성 변환이 가능하고,In the text-to-speech conversion method of the present invention, a new voice can be implemented with only a small amount of data, and text-to-speech conversion is possible without separately learning a model,

복수의 음성에 해당하는 ID를 텍스트 정보와 함께 입력받아 다자간의 대화 당의 음성 변환에 걸리는 시간을 단축시킬 수 있다.By receiving IDs corresponding to a plurality of voices together with text information, it is possible to shorten the time taken for voice conversion per multi-party conversation.

또한, 복수의 음성에 대한 데이터를 log-mel-spectogram을 이용해 적은 데이터 만으로 다수의 화자에 의한 음성을 합성하는 방법으로, 모델 학습시간을 줄일 수 있고, 사전 데이터 준비에 소모되는 시간 및 비용을 감소시킬 수 있는 효과가 있다.In addition, it is a method of synthesizing voices by multiple speakers with only a small amount of data using log-mel-spectogram data for multiple voices, which can reduce model training time and reduce the time and cost consumed in pre-data preparation. There is an effect that can make it happen.

도 1은 일반적인 텍스트-음성 변환 방법을 나타낸 모식도이다.
도 2는 본 발명의 실시예에 따른 GRU(Gated Recurrent Unit)을 나타낸 모식도이다.
도 3은 본 발명의 일 실시 예에 따른 텍스트- 다중 음성 변환 방법의 흐름도이다.
도 4는 본 발명의 일 실시 예에 따른 RNN 인코더-디코더 네트워크의 블록도이다.
도 5는 본 발명의 일 실시 예에 따른 콘텐츠 기반 집중 메커니즘 RNN 인코더-디코더 네트워크의 블록도이다.
도 6은 타코트론(Tacotron)의 모식도이다.
도 7은 집중 정렬의 이상적 그래프와 실제적인 집중 정렬의 문제를 도시하는 그래프이다.
도 8은 CBHG모듈의 모식도이다
도 9는 본 발명의 일 실시 예에 따른 타코트론의 파라미터를 도시한 그래프이다.
도 10은본 발명의 일 실시 예에 따른 텍스트-다중 음성 변환 타코트론의 모식도이다.
도 11은 본 발명의 일 실시 예에 따른 음성 임베딩 예측 네트워크의 모식도이다.
도 12는 본 발명의 일 실시 예에 따른 테스트 세트에서 음성 프로필을 나타낸 표이다.
도 13은 본 발명의 일 실시 예에 따른 다중 음성 타코트론의 파라미터 표이다.
도 14는 본 발명의 일 실시 예에 따른 생성된 음성 스펙토그램의 예시이다.
도 15는 본 발명의 일 실시 예에 따른 이론 요소에 따른 음성 임베딩 결과를 도시한 그래프이다.
도 16은 본 발명의 일 실시 예에 따른 이론 요소의 변화에 따른 스펙토그램을 도시한 그래프이다.
도 17은 본 발명의 일 실시 예에 따른 혼합된 음성에 의한 스펙토그램을 도시한 그래프이다.1 is a schematic diagram illustrating a general text-to-speech conversion method.
2 is a schematic diagram illustrating a Gated Recurrent Unit (GRU) according to an embodiment of the present invention.
3 is a flowchart of a text-multiple speech conversion method according to an embodiment of the present invention.
4 is a block diagram of an RNN encoder-decoder network according to an embodiment of the present invention.
5 is a block diagram of a content-based centralized mechanism RNN encoder-decoder network according to an embodiment of the present invention.
6 is a schematic diagram of tacotron.
Fig. 7 is a graph showing an ideal graph of concentrated sorting and a problem of actual centralized sorting.
8 is a schematic diagram of a CBHG module;
9 is a graph illustrating parameters of tacotron according to an embodiment of the present invention.
10 is a schematic diagram of a text-multi-speech conversion tacotron according to an embodiment of the present invention.
11 is a schematic diagram of a speech embedding prediction network according to an embodiment of the present invention.
12 is a table showing a voice profile in a test set according to an embodiment of the present invention.
13 is a parameter table of a multi-voice tacotron according to an embodiment of the present invention.
14 is an example of a generated voice spectogram according to an embodiment of the present invention.
15 is a graph illustrating a speech embedding result according to a theoretical element according to an embodiment of the present invention.
16 is a graph illustrating a spectogram according to a change in a theoretical element according to an embodiment of the present invention.
17 is a graph illustrating a spectogram by mixed voice according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 텍스트-음성 변환 방법은, 문자 입력을 받아 해당 내용의 발화 음성을 합성해서 출력하는 방법이다.A text-to-speech conversion method according to an embodiment of the present invention is a method of receiving a text input and synthesizing and outputting the spoken voice of the corresponding content.

즉, 본 발명의 일 실시예에 따른 텍스트-음성 변환 방법은 발화할 내용의 문자를 입력할 경우, 그것을 읽은 음성 신호를 출력하는 방법이다.That is, the text-to-speech conversion method according to an embodiment of the present invention is a method of outputting a read voice signal when a character of content to be uttered is input.

이하, 본 발명의 실시예에 따른 텍스트-음성 변환 방법을 도면을 참조하여 각 단계별로 보다 상세히 설명한다.Hereinafter, a text-to-speech conversion method according to an embodiment of the present invention will be described in detail for each step with reference to the drawings.

본 발명의 실시예에 따른 텍스트-음성 변환 방법은 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계를 포함한다.A text-to-speech conversion method according to an embodiment of the present invention includes generating a speech spectrum corresponding to each partial band from a text.

본 발명의 실시예에 따른 텍스트-음성 변환 방법은 문자를 대역별로 나누고, 대역별로 나눠진 신호 각각에 대해, 각각의 음성 스펙트럼을 생성하기 때문에 입력받은 문자에 대해 대역별 구별없이 하나의 음성 스펙트럼을 생성하는 종래의 방법에 비해 계산량을 현저히 줄일 수 있으며, 어려운 문제를 분할 정복(divide and conquer)방식으로 해결할 수 있는 장점이 있다.The text-to-speech conversion method according to an embodiment of the present invention divides characters by band and generates a voice spectrum for each signal divided by band, so that a single voice spectrum is generated for an input character without distinction by band. Compared to the conventional method, the amount of calculation can be significantly reduced, and there is an advantage in that difficult problems can be solved in a divide and conquer method.

본 발명의 일 실시 예에 따르면, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 각각의 부분 대역 별로 병렬적으로 동시에 음성 스펙트럼을 생성할 수 있다. According to an embodiment of the present invention, the step of generating the voice spectrum corresponding to each subband from the text may generate the voice spectrum in parallel for each subband at the same time.

상기 음성 스펙트럼을 생성하는 단계는 각 대역별 신호를 각기 다른 방법 또는 시스템을 이용하여 연산할 수 있으며, 상기 각 대역별 신호를 연산하는 방법 또는 시스템은 서로 독립적일 수 있다. 이에, 상기 단계에서 각각의 대역별 신호에 해당하는 음성 스펙트럼을 생성하기 위해, 복수의 텍스트 음성 변환 시스템이 사용될 수 있으며, 상기 텍스트 음성 변환 시스템으로, 타코트론(Tacotron) 또는 웨이브넷(Wavenet) 알고리즘이 사용될 수 있다. In the generating of the voice spectrum, signals for each band may be calculated using different methods or systems, and methods or systems for calculating signals for each band may be independent of each other. Accordingly, in order to generate a voice spectrum corresponding to each signal for each band in the above step, a plurality of text-to-speech systems may be used, and as the text-to-speech system, a Tacotron or Wavenet algorithm this can be used

여기서 상기 타코트론(Tacotron)은 하기와 같은 방법으로 문자를 음성 스펙트럼으로 생성한다.Here, the tacotron generates a character as a voice spectrum by the following method.

타코트론(Tacotron)은 순환 신경망((Recurrent Neural Network, RNN) 인코더-디코더를 활용한 'sequence-to-sequence' 모델로서 텍스트에서 필요한 정보를 추출하는 인코더(encoder)부와 인코더된 텍스트로부터 음성을 합성하는 디코더(decoder)부로 나눌 수 있다.Tacotron is a 'sequence-to-sequence' model using a Recurrent Neural Network (RNN) encoder-decoder. It can be divided into a decoder unit for synthesizing.

인코더(encoder)부에서는, 인코더 네트워크의 입력값으로, 문장을 문자(character) 단위로 분해하여 벡터형태로 만든 문자 임베딩(character embedding)이 사용되며 신경망(neural network)을 거쳐 텍스트 임베딩 벡터(text encoding vector)를 출력값으로 내어 놓는다. In the encoder unit, as an input value of the encoder network, character embedding made in a vector form by decomposing a sentence into character units is used, and a text encoding vector is used through a neural network. vector) as output.

상기 신경망(neural network)으로, CBHG 모듈 즉, 컨벌루션 신경망(convolutional neural network), 하이웨이 네트워크(highway network) 및 양방향성 순환 신경망(bi-directional recurrent neural network)을 순서대로 쌓은 신경망이 사용될 수 있다.As the neural network, a CBHG module, that is, a neural network in which a convolutional neural network, a highway network, and a bi-directional recurrent neural network are sequentially stacked may be used.

디코더(decoder)부에서는, t 시간 단계에서의 디코더 네트워크(Decoder network)의 입력값으로는 텍스트 임베딩 벡터(text encoding vector)들의 가중합과 이전 t-1 시간단계의 마지막 디코더 출력값이 합쳐져 사용된다. 디코더 출력값은 멜 스케일 스펙트로그램(mel-scale spectrogram)으로 매 시단 단계당 r개의 벡터를 내어 놓게 된다. r개의 벡터 중 마지막 벡터만 다음 시간 단계의 디코더 입력값으로 사용된다. 시간 단계마다 r개씩 생성된 멜 스케일 스펙트로그램 벡터들은 디코더 타임 스텝방향으로 합쳐져 합성된 음성 전체의 멜 스케일 스펙트로그램을 이루게 되며, 이 스펙트로그램은 추가적인 신경망(neural network)를 거쳐 선형-스케일 스펙트로그램으로 변환된다. 이후 상기 선형-스케일 스펙트로그램이 'Griffin-Lim reconstruction' 알고리즘을 통해 파형(wave form)으로 변환이 되며 이것을 '~.wav' 파일로 쓰게되면 음성 파일이 생성된다.In the decoder unit, the weighted sum of text encoding vectors and the last decoder output value of the previous t-1 time step are combined and used as an input value of the decoder network at the t time step. The output value of the decoder is a mel-scale spectrogram, and r vectors are given out for every starting step. Only the last vector among r vectors is used as the decoder input value for the next time step. Mel-scale spectrogram vectors generated by r for each time step are combined in the decoder time step direction to form a melt-scale spectrogram of the synthesized speech. This spectrogram is converted into a linear-scale spectrogram through an additional neural network. is converted Thereafter, the linear-scale spectrogram is converted into a wave form through a 'Griffin-Lim reconstruction' algorithm, and when this is written as a '~.wav' file, an audio file is generated.

본 발명의 실시예에 따른 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 상기 타코트론(Tacotron)을 이용하여 구현될 수 있다. The step of generating a voice spectrum corresponding to each partial band from a character according to an embodiment of the present invention may be implemented using the tacotron.

즉, 상기 타코트론(Tacotron)에서 디코더의 출력값은 r개의 스펙트로그램이다. 이것을 매트릭스 형태로 생각한다면 S x R 형태의 매트릭스로 볼 수 있다. 여기서 s는 스펙트로그램의 크기로, 멜-스캐일 스펙트로그램은 80차원의 벡터가 사용될 ㅅ수 있다. That is, the output values of the decoder in the tacotron are r spectrograms. If you think of this in the form of a matrix, you can see it as a matrix of the S x R form. Here, s is the size of the spectrogram, and an 80-dimensional vector may be used for the mel-scale spectrogram.

음성 스펙트럼을 선형 스펙트럼으로 변환 및 상기 선형 스펙트럼을 파형(waveform)으로 변환하여 최종적으로 음성으로 출력한다. The speech spectrum is converted into a linear spectrum, and the linear spectrum is converted into a waveform, and finally output as audio.

한편, 상기 음성 스펙트럼을 생성하는 단계는, 화자의 음색, 나이, 성별 및 감정 중 적어도 하나의 발화조건을 부여하여 음성 스펙트럼을 생성하는 단계를 포함할 수 있다.Meanwhile, the generating of the voice spectrum may include generating the voice spectrum by giving a utterance condition of at least one of tone, age, gender, and emotion of a speaker.

상기 단계에서, 상기 발화조건이 추가될 경우, 상기 조건을 반영한 음성 신호가 생성될 수 있다.In the above step, when the utterance condition is added, a voice signal reflecting the condition may be generated.

상기 병합된 스펙트럼을 음성으로 출력하는 단계는, 상기 음성 스펙트럼을 선형 스펙트럼으로 변환하는 단계 및 상기 선형 스펙트럼을 파형(waveform)으로 변환하는 단계를 포함할 수 있다.Outputting the merged spectrum as a voice may include converting the voice spectrum into a linear spectrum and converting the linear spectrum into a waveform.

또한, 본 발명은Also, the present invention

문자로부터 각각의 부분 대역별 신호를 생성하여 음성을 출력하는 텍스트-음성 변환모듈;을 포함하고, A text-to-speech conversion module for generating a signal for each partial band from a character and outputting a voice;

상기 텍스트-음성 변환모듈은,The text-to-speech conversion module,

문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 텍스트-음성 변환부, A text-to-speech conversion unit that generates a speech spectrum corresponding to each partial band from the text;

상기 텍스트-음성 변환부에서 생성된 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 스펙트럼 병합부, 및a spectrum merging unit for merging voice spectra corresponding to each band generated by the text-to-speech conversion unit; and

상기 스펙트럼 병합부에서 병합된 스펙트럼을 음성으로 출력하는 음성 출력부,를 포함하는 텍스트-음성 변환 시스템을 제공한다.It provides a text-to-speech conversion system including a voice output unit for outputting the spectrum merged by the spectrum merging unit as a voice.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은, 문자 입력을 받아 해당 내용의 발화 음성을 합성해서 출력하는 시스템이다.A text-to-speech conversion system according to an embodiment of the present invention is a system for receiving a text input and synthesizing the spoken voice of the corresponding content.

즉, 본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 발화할 내용의 문자가 입력되면, 그것을 읽은 음성 신호가 출력되는 시스템이다.That is, the text-to-speech conversion system according to an embodiment of the present invention is a system in which, when a character of content to be uttered is input, a voice signal reading it is output.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 인공신경망을 이용해 음성을 합성하며, 음성 합성에 있어서 합성할 음성신호에 대해 하나 이상의 대역으로 나누어져 생성 후 이를 합침으로써, 문자를 음성으로 변환하는 텍스트-음성 변환 시스템이다. A text-to-speech conversion system according to an embodiment of the present invention synthesizes speech using an artificial neural network, and converts text into speech by dividing and creating one or more bands for a speech signal to be synthesized in speech synthesis and combining them. It is a text-to-speech conversion system.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 상기 대역분할모듈에서 생성된 대역별 신호를 연산하여 음성을 출력하는 텍스트-음성 변환모듈을 포함한다.A text-to-speech conversion system according to an embodiment of the present invention includes a text-to-speech conversion module for outputting voice by calculating the signals for each band generated by the band dividing module.

본 발명의 일 실시 예에 따르면, 상기 텍스트-음성 변환모듈에서 부분 대역은 전체 대역의 길이보다 짧은 길이를 가지는 부분 대역들이 포함될 수 있다.According to an embodiment of the present invention, in the text-to-speech conversion module, the partial bands may include subbands having a length shorter than the length of the entire band.

본 발명의 일 실시 예에 따르면, 상기 부분 대역의 음성 파형 길이는 상기 전체 대역의 음성 파형 길이의 1/2 이하일 수 있다.According to an embodiment of the present invention, the length of the audio waveform of the partial band may be less than or equal to 1/2 of the length of the audio waveform of the entire band.

본 발명의 일 실시 예에 따르면, 원래의 음성 파형의 길이가 100일 때, 4개의 sub-band를 가지는 wavelet transform을 적용하면 {50, 25, 12.5, 12.5}의 길이를 가지는 4개의 부분 대역별 신호로 나뉠 수 있다.According to an embodiment of the present invention, when a wavelet transform having four sub-bands is applied when the length of the original voice waveform is 100, each of the four subbands having a length of {50, 25, 12.5, 12.5} can be divided into signals.

이때, 상기 텍스트-음성 변환모듈은, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 텍스트-음성 변환부를 포함한다.In this case, the text-to-speech conversion module includes a text-to-speech conversion unit that generates a speech spectrum corresponding to each partial band from the text.

본 발명의 일 실시 예에 따르면, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 각각의 부분 대역 별로 병렬적으로 동시에 음성 스펙트럼을 생성할 수 있다.According to an embodiment of the present invention, the step of generating the voice spectrum corresponding to each subband from the text may generate the voice spectrum in parallel for each subband at the same time.

상기 텍스트-음성 변환부는 각각의 대역마다 해당 음성 스펙트럼을 생성한다.The text-to-speech converter generates a corresponding voice spectrum for each band.

이때, 각각의 대역은 각기 다른 방법 또는 시스템을 이용하여 음성 스펙트럼을 생성할 있으며, 이에, 상기 방법 또는 시스템은 서로 독립적일 수 있다. 이에, 상기 텍스트-음성 변환부는 각각의 대역별 신호에 해당하는 음성 스펙트럼을 생성하기 위해, 복수의 텍스트 음성 변환 시스템이 사용될 수 있으며, 상기 텍스트 음성 변환 시스템으로, 타코트론(Tacotron) 또는 웨이브넷(Wavenet) 알고리즘이 사용될 수 있다. In this case, each band may generate a voice spectrum using a different method or system, and thus, the method or system may be independent of each other. Accordingly, the text-to-speech conversion unit may use a plurality of text-to-speech conversion systems to generate a speech spectrum corresponding to a signal for each band, and as the text-to-speech conversion system, Tacotron or Wavenet ( Wavenet) algorithm may be used.

또한, 상기 텍스트-음성 변환모듈은, 상기 텍스트-음성 변환부에서 생성된 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 스펙트럼 병합부를 포함할 수 있으며, 상기 스펙트럼 병합부에서 병합된 스펙트럼을 음성으로 출력하는 음성 출력부를 더 포함할 수 있다.In addition, the text-to-speech conversion module may include a spectrum merging unit for merging voice spectra corresponding to each band generated by the text-to-speech converting unit, and outputting the merged spectrum from the spectrum merging unit as voice. It may further include a voice output unit.

이때, 음성 출력부는 병합된 음성 스펙트럼을 선형 스펙트럼으로 변환하고 상기 선형 스펙트럼을 파형(waveform)으로 변환하여 최종 음성을 출력할 수 있다. In this case, the voice output unit may convert the merged voice spectrum into a linear spectrum and convert the linear spectrum into a waveform to output a final voice.

도 1은음성 사용자 인터페이스를 나타낸 모식도이다.1 is a schematic diagram illustrating a voice user interface.

도 1을 참조하면, 기존의 음성 사용자 인터페이스의 구성을 확인할 수 있다. 음성은 인간의 가장 기초적이고 효율적인 의사 전달 도구 중 하나다. 사람들은 일상 생활에서 대화 나 발표와 같은 말을 사용하여 정보를 교환한다. 음성 기반 통신은 사용자에게 직관적이고 편리하기 때문에 일부 장치는 음성을 사용하여 음성을 상호 작용하는 음성 사용자 인터페이스를 사용한다. 음성 사용자 인터페이스에서 음성 응답을 구현하는 간단한 방법은 오디오 녹음이다. 그러나 녹음은 녹음 된 음성만을 말할 수있는 한계가 있다. 장치가 준비되지 않은 상황에 대응할 수 없으므로 장치 사용의 유연성이 떨어진다. 예를 들어 Apple Siri 및 Amazon Alexa와 같은 인공 지능 (AI) 에이전트는 사용자의 쿼리가 임의적 일 수 있으므로 다양한 문장을 사용해야한다. 이러한 응용 프로그램에서 가능한 모든 응답을 기록해야하는 경우 시간과 비용 측면에서 회사에 부담이 될 수 있다. 따라서 많은 연구자들이 자연스럽고 빠른 음성 합성 모델을 만들려고 노력했다. 이러한 의미에서, TTS (text-to-speech)라고도하는 음성 합성은 임의의 연설을 생성 할 수 있으므로 중요한 주제이다.Referring to FIG. 1 , the configuration of an existing voice user interface can be confirmed. Speech is one of the most basic and effective communication tools for humans. In everyday life, people exchange information by using words such as conversations and presentations. Because voice-based communication is intuitive and convenient for users, some devices use voice user interfaces to interact with voice using voice. A simple way to implement a voice response in a voice user interface is audio recording. However, recording has a limitation in that only the recorded voice can be spoken. The flexibility of using the device is reduced because it cannot respond to situations where the device is not ready. For example, artificial intelligence (AI) agents such as Apple Siri and Amazon Alexa require the use of different sentences as a user's query can be arbitrary. Having to log all possible responses from these applications can be a burden on companies in terms of time and money. Therefore, many researchers have tried to create a natural and fast speech synthesis model. In this sense, speech synthesis, also known as text-to-speech (TTS), is an important topic as it can generate arbitrary speech.

음성 합성 기술은 AI 에이전트에 더 많은 가치를 제공 할 수 있다. 요즈음 특히, 기계 학습 기술의 발전으로 대화형 에이전트 및 AI 비서에 대한 요구가 증가하고 있다. 도 1에 묘사 된 것처럼 일련의 모듈로서 회화 적 모델의 틀을 공식화했다. 이러한 인간과 같은 에이전트의 경우, 대화 시스템의 백엔드 알고리즘뿐만 아니라 프론트 엔드 시스템 응답 부분도 중요하다. 사람들은 외모와 목소리를 통해 에이전트의 첫인상을 얻는다. 에이전트는 음성 합성 모델의 자연 스러움을 개선하여 이익을 얻을 수 있다. 이 점에서, 각각의 다른 AI 에이전트가 자신의 목소리로 임의의 문장을 말할 수 있다면, 그들은 더 자연스럽고 사용자에게 친숙 해 보일 것이다.Speech synthesis technology can provide more value to AI agents. These days, in particular, the demand for interactive agents and AI assistants is increasing with the advancement of machine learning technology. As a series of modules, as depicted in Fig. 1, we formulated the framework of the pictorial model. For these human-like agents, the response part of the front-end system as well as the back-end algorithm of the dialog system is important. People get the first impression of an agent through their looks and voice. Agents can benefit from improving the naturalness of speech synthesis models. In this regard, if each of the different AI agents could speak arbitrary sentences in their own voice, they would look more natural and user-friendly.

도 2는 본 발명의 실시예에 따른 GRU(Gated Recurrent Unit)을 나타낸 모식도이다.2 is a schematic diagram illustrating a Gated Recurrent Unit (GRU) according to an embodiment of the present invention.

도 2를 참조하면, 시퀀스를 다른 시퀀스(반드시 동일한 유형의 시퀀스일 필요는 없다.)로 매핑하는 것은 널리 퍼진 문제이다. 예를 들어, 음성 인식 작업은 일련의 파형 샘플을 일련의 문자로 매핑해야하며, 음성 합성 작업은 음성 인식의 역매핑을 수행해야한다. 기계 번역은 텍스트에서 다른 언어로 된 다른 텍스트로의 매핑이기도하다. 시퀀스 - 시퀀스 문제를 풀기 위해서는 하나의 시퀀스를 다른 시퀀스에서 생성해야하며 두 시퀀스의 요소를 정렬하는 방법을 알아야한다. 시퀀스 간 매핑 중에 컨텍스트 정보를 포함하려면 장기 의존성을 고려해야한다. 우리가 이런 종류의 네트워크를 구현할 때 최선의 선택 중 하나는 순환 신경망 (RNN)이다. 순환 신경망은 순차적 입력을 받아 숨겨진 상태의 상황 정보를 축적 할 수 있다. 특히, LSTM과 GRU는 고유 한 게이팅 메커니즘 (GRU, GRUEMP, LSTM)을 사용하여 관련 컨텍스트 정보를 효율적으로 유지할 수 있음을 보여주었다. 구체적으로, 리셋 게이트 r 및 업데이트 게이트 z가있는 GRU는 다음 수학식 1로 설명 할 수 있다.Referring to Figure 2, mapping a sequence to another (not necessarily the same type of sequence) is a widespread problem. For example, a speech recognition task should map a series of waveform samples to a series of characters, and a speech synthesis task should perform the reverse mapping of speech recognition. Machine translation is also a mapping from text to another text in another language. Sequences - To solve a sequence problem, you need to create one sequence from another and know how to align the elements of both sequences. Long-term dependencies must be considered to include contextual information during mapping between sequences. One of the best choices when we implement this kind of network is a recurrent neural network (RNN). A recurrent neural network can receive sequential input and accumulate contextual information in a hidden state. In particular, we showed that LSTMs and GRUs can efficiently maintain relevant context information using their own gating mechanisms (GRU, GRUEMP, LSTM). Specifically, a GRU with a reset gate r and an update gate z can be described by the following Equation 1.

[수학식 1][Equation 1]

여기서 x, h, t 및 j는 각각 입력, 숨겨진 상태, 시간 단계 색인 및 숨겨진 차원 색인이다. 게이팅 변수에 대해 비선형 성으로서 시그모이드 함수

를 적용함으로써 (0,1) 사이의 값을 갖도록 강제할 수 있다. 도 2는 GRU의 아키텍처를 보여준다. 본 발명의 일 실시 예에서는 GRU를 RNN의 기본 단위로 사용할 수 있다.where x, h, t and j are input, hidden state, time step index and hidden dimension index, respectively. Sigmoid function as nonlinearity for gating variable

You can force it to have a value between (0,1) by applying . 2 shows the architecture of the GRU. In an embodiment of the present invention, the GRU may be used as a basic unit of the RNN.

도 3은 본 발명의 일 실시 예에 따른 텍스트- 다중 음성 변환 방법의 흐름도이다.3 is a flowchart of a text-multiple speech conversion method according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시 예에 따른텍스트- 다중 음성 변환 방법은 음성을 생성할 전체 주파수 대역을 복수의 부분 대역으로 구별하는 단계(S110), 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계(S120), 상기 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 단계(S130) 및 상기 병합된 스펙트럼을 음성으로 출력하는 단계(S140)를 포함할 수 있다.Referring to FIG. 3 , in the text-multi-voice conversion method according to an embodiment of the present invention, a step ( S110 ) of distinguishing an entire frequency band for generating a voice into a plurality of sub-bands, a text corresponding to each sub-band from the text It may include generating a voice spectrum (S120), merging voice spectra corresponding to the respective bands (S130), and outputting the merged spectrum as a voice (S140).

상기 S130 단계는, 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용할 수 있다.In step S130, multiple voice data may be received and converted into a voice embedding vector to be used for voice spectrum merging.

상기 다중 음성 데이터는 One-hot vector형태의 speaker-id일 수 있다.The multi-voice data may be a speaker-id in the form of a one-hot vector.

상기 다중 음성 데이터는 log-mel-spectrogram형태의 데이터일 수 있다.The multi-voice data may be log-mel-spectrogram data.

상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환되될 수 있다.The log-mel-spectrogram may be converted into a speech embedding vector through an embedder network.

상기 음성 임베딩 벡터는 상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.The speech embedding vector may be input to a decoder RNN and a lumped RNN of the spectrum merging unit.

도 4는 본 발명의 일 실시 예에 따른 RNN 인코더-디코더 네트워크의 블록도이다.4 is a block diagram of an RNN encoder-decoder network according to an embodiment of the present invention.

도 4를 참조하면, 장기 종속성을 모델링할 수 있는 RNN 기능을 사용하여 시퀀스간 네트워크를 형성할 수 있다. 도 2에서 인코더 RNN의 마지막 숨겨진 상태 h_T는 입력 시퀀스의 요약 정보를 포함할 수 있다. 이어서, 디코더 RNN은이 컨텍스트 벡터를 취하여 출력을 생성할 수 있다. 그런 다음 생성 된 출력은 다음 시간 단계에서 컨텍스트 벡터 h_T와 함께 입력으로 사용될 수 있다. 이러한 아키텍처의 단점은 컨텍스트 벡터h_T가 디코더 RNN의 모든 타임 스텝마다 고정된다는 것이다. 컨텍스트 벡터가 입력 시퀀스의 전체 정보를 손상시키지 않으면 문제가되지 않는다. 그러나, 많은 경우, 시퀀스가 그 용량보다 긴 경우, 컨텍스트 벡터의 고정 용량은 입력 시퀀스의 전체 정보를 포함 할 수 없다. 이 문제는 관심 메커니즘에 의해 해결될 수 있다.Referring to FIG. 4 , an inter-sequence network may be formed using an RNN function capable of modeling long-term dependencies. In FIG. 2 , the last hidden state h _T of the encoder RNN may include summary information of the input sequence. The decoder RNN can then take this context vector and produce an output. The generated output can then be used as input along with the _{context vector h T in the next time step.} A disadvantage of this architecture is that the context vector h _T is fixed at every time step of the decoder RNN. It doesn't matter if the context vector doesn't corrupt the entire information in the input sequence. However, in many cases, if the sequence is longer than its capacity, the fixed capacity of the context vector cannot contain the entire information of the input sequence. This problem can be solved by the mechanism of interest.

기존의 소프트 윈도우라 불리는 집중 메커니즘은 인풋 시퀀스의 일부분에 집중 가중치를 곱한다. K가우시안 함수의 혼합을 사용하여 집중 가중치를 계산할 수 있다. 고정 된 수 K는 혼합의 수다. 집중 가중치를 계산하기 위해 가우스 함수를 가정했기 때문에 집중 가중치의 모양에 대해 임의의 유연성을 가질 수 없었다. A conventional convergence mechanism, called a soft window, multiplies a portion of an input sequence by a intensive weight. A mixture of K Gaussian functions can be used to compute the concentrated weights. The fixed number K is the number of mixtures. Since we assumed a Gaussian function to compute the concentrated weights, we could not have any flexibility in the shape of the concentrated weights.

도 5는 본 발명의 일 실시 예에 따른 콘텐츠 기반 집중 메커니즘 RNN 인코더-디코더 네트워크의 블록도이다.5 is a block diagram of a content-based centralized mechanism RNN encoder-decoder network according to an embodiment of the present invention.

도 5를 참조하면, 내용 기반 집중 메커니즘을 확인할 수 있다. 상기 네트워크의 컨텍스트 벡터는 엔코더의 숨겨진 상태의 가중치 합계로 계산될 수 있다. 컨텍스트 벡터는 각 디코더의 타임 스텝에서 유연하게 변경 될 수 있기 때문에 인코더의 숨겨진 상태는 이전의 모든 정보를 기억할 필요가 없다. 집중 가중치 벡터

는 신경망에 의해 모델링 된 정렬 모델에 의해 계산될 수 있다. 길이가 T_enc인 입력 시퀀스를 취하는 콘텐츠 기반주의 메커니즘을 갖는 RNN 인코더 - 디코더 네트워크는 다음 수학식 2와 같이 정의 될 수있다 :Referring to FIG. 5 , a content-based concentration mechanism can be identified. The context vector of the network may be calculated as the weighted sum of the hidden states of the encoder. Since the context vector can be flexibly changed at each decoder's time step, the hidden state of the encoder does not need to remember all previous information. concentration weight vector

can be calculated by an alignment model modeled by a neural network. RNN length encoder has a content based on a mechanism that takes care of the input sequence T _enc - decoder network may be defined as the following equation (2):

[수학식 2][Equation 2]

여기서 c는 컨텍스트 벡터이고, d와 h는 각각 디코더와 인코더의 숨김 상태이다. 인코더에 대한 양방향 RNN이 있기 때문에 인코더 숨김 상태 h_i는 앞뒤 숨김 상태를 연결하여 정의된다. 행렬 W_h, W_d 및 바이어스 벡터 b는 학습 가능한 매개 변수다. softmax 함수는 입력 벡터를 정규화하므로 결과 벡터의 합은 1이다. 따라서 인코더의 숨겨진 상태와 동일한 크기의 컨텍스트 벡터를 가질 수 있다. 결과적으로, 컨텍스트 벡터 c와 이전 시간 스텝 d_t-1의 디코더 숨김 상태가 연결되어 디코더 RNN의 입력을 만든다.Here, c is a context vector, and d and h are the hidden states of the decoder and encoder, respectively. Since there is a bidirectional RNN for the encoder, the encoder hidden state h _i is defined by concatenating the back and front hidden states. The matrices W _h , W _d and the bias vector b are learnable parameters. The softmax function normalizes the input vector, so the sum of the resulting vectors is 1. Therefore, it can have a context vector of the same size as the hidden state of the encoder. Consequently, the context vector c and _{the decoder hidden state of the previous time step d t-1} are concatenated to make the input of the decoder RNN.

공통 TTS 시스템은 텍스트 인코딩 부분과 음성 생성 부분의 두 가지 주요 부분으로 구성될 수 있다. 대상 언어에 대한 사전 지식을 사용하여 본 발명의 시스템은 언어의 유용한 기능을 정의하고이를 입력 텍스트에서 추출할 수 있다. 이 과정을 텍스트 인코딩 부분이라고하며 상기 단계에서 많은 자연 언어 처리 기술이 사용될 수 있다. 예를 들어, 음소 시퀀스를 얻기 위해 텍스트를 입력하기 위해 음소 - 음소 모델을 적용하고, 품사 정보를 얻기 위해 품사 태그 지정 기가 적용될 수 있다. 이러한 방식으로, 텍스트 인코딩부는 텍스트 입력을 취하여 다양한 언어 특징을 출력할 수 있다. 그리고, 다음의 음성 생성부는 언어적인 특징을 취하여 음성의 파형을 생성할 수 있다. 음성 생성부에 대한 두 가지 일반적인 접근법은 연결 TTS 및 매개 TTS이다. 연결 TTS는 짧은 단위의 음성을 연결하여 음성을 생성할 수 있다. 짧은 단위는 음소 또는 하위 음소의 음계를 가지며, 짧은 단위의 순서는 상태 전이 네트워크에 의해 결정될 수 있다. 이 과정은 HMM 기반 음성 인식과 유사할 수 있다. 반면, 매개 TTS는 생성 모델 및 보코더를 사용할 수 있다. 생성 모델은 결합 확률 p (o, l,

)를 모델링하기위한 매개 변수를 학습할 수 있다. 여기서 o, l 및

는 각각 보코더 매개 변수, 언어 기능 및 모델 매개 변수다. 훈련 단계에서 매개 TTS는 다음 수학식 3에 설명 된대로 모델 매개 변수를 학습할 수 있다.A common TTS system may be composed of two main parts: a text encoding part and a speech generating part. Using prior knowledge of the target language, the system of the present invention can define useful functions of the language and extract them from the input text. This process is called the text encoding part and many natural language processing techniques can be used in this step. For example, a phoneme-phoneme model may be applied to input text to obtain a phoneme sequence, and a part-of-speech tagging machine may be applied to obtain part-of-speech information. In this way, the text encoding unit may take text input and output various language features. In addition, the following voice generator may generate a waveform of the voice by taking linguistic features. Two general approaches to speech generators are concatenated TTS and mediated TTS. The concatenated TTS may generate speech by concatenating short units of speech. A short unit has a scale of a phoneme or a sub-phoneme, and the order of the short units may be determined by a state transition network. This process may be similar to HMM-based speech recognition. On the other hand, intermediary TTS may use generative models and vocoders. The generative model is the joint probability p(o, l,

) can learn parameters for modeling. where o, l and

are vocoder parameters, language features, and model parameters, respectively. In the training phase, the parametric TTS can learn the model parameters as described in Equation 3 below.

[수학식 3][Equation 3]

보코더 매개 변수는 멜 주파수 셉 스트럴 계수(MFCC), 기본 주파수 (F₀), 비 주기성 등과 같은 음성 관련 기능으로 구성될 수 있다. 보코더 매개 변수를 얻으려면 사전 훈련 된 피쳐 추출기를 이용할 수 있다. 언어 기능은 텍스트 인코딩 부분의 출력에서 사용할 수 있다. 생성 단계에서는 다음 수학식 4와 같이 언어 특징 및 모델 매개 변수에서 보코더 매개 변수를 얻을 수 있다.Vocoder parameters can be configured with voice-related functions such as Mel frequency cepstral coefficient (MFCC), fundamental frequency (F _{0 ), aperiodicity, etc.} A pre-trained feature extractor is available to obtain the vocoder parameters. Language functions are available in the output of the text encoding part. In the generation step, vocoder parameters can be obtained from language features and model parameters as shown in Equation 4 below.

[수학식 4][Equation 4]

이어서, 획득 된 보코더 파라미터는 Vocaine 또는 WORLD와 같은 보코더에 의해 처리된다. 결국 우리는 음성의 파형을 얻을 수 있다. (관련없는 내용)The obtained vocoder parameters are then processed by a vocoder such as Vocaine or WORLD. Eventually we can get the waveform of the voice. (unrelated content)

Tacotron은 엔코더, 디코더 및 포스트 프로세서의 세 가지 모듈로 구성된 엔드 - 투 - 엔드 TTS 모델입니다. Tacotron은 바닐라 sequence-to-sequence 모델보다 복잡한 구조를 가지고 있지만 Tacotron은 기본적으로 문자 시퀀스를 해당 파형으로 변환하는주의 메커니즘을 사용하여 시퀀스 - 시퀀스 프레임 워크를 따릅니다. 보다 구체적으로, 인코더는 문자 시퀀스를 입력으로 취하여 문자 시퀀스와 길이가 동일한 텍스트 인코딩 시퀀스를 생성한다. 디코더는 자기 회귀 방식으로 멜 스케일 스펙트럼을 생성합니다. 이전 시간 - 단계로부터의 디코더 출력은 현재 시간 - 단계에서 디코더의 입력으로서 사용된다. 디코더 모듈에서주의 RNN은 보통 일반적인 시퀀스 대 시퀀스 모델처럼주의 정렬을 예측합니다. 주의 정렬을 텍스트 인코딩과 결합하면 컨텍스트 벡터가 제공되고 디코더 RNN은 컨텍스트 벡터와주의 RNN의 출력을 입력으로 사용합니다. 디코더 RNN은 멜 - 스케일 스펙트로 그램을 예측하며, 포스트 - 프로세서 모듈은 결과적으로 멜 - 스케일 스펙트로 그램으로부터 선형 스케일 스펙트로 그램을 생성합니다. 선형 스케일 스펙트로 그램 자체에는 위상 정보가 없으므로 파형으로 직접 되돌릴 수 없습니다. Griffin-Lim 재구성 알고리즘은 역 푸리에 변환을 적용하여 선형 스케일 스펙트로 그램 {GLRECON}의 위상을 추정합니다. Tacotron은 선형 스케일 스펙트로 그램에서 파형을 얻기 위해 포스트 프로세서 모듈의 마지막 단계에서 Griffin-Lim 재구성을 사용했습니다.The Tacotron is an end-to-end TTS model consisting of three modules: encoder, decoder and post processor. Although Tacotron has a more complex structure than the vanilla sequence-to-sequence model, Tacotron basically follows a sequence-to-sequence framework with an attention mechanism that transforms sequences of characters into their corresponding waveforms. More specifically, the encoder takes a character sequence as input and produces a text encoding sequence that is the same length as the character sequence. The decoder generates mel-scale spectra in an autoregressive manner. The decoder output from the previous time-step is used as the input of the decoder in the current time-step. At the decoder module, attention RNNs usually predict attention alignments just like normal sequence-to-sequence models. Combining attention alignment with text encoding gives a context vector and the decoder RNN takes as input the context vector and the output of the attention RNN. The decoder RNN predicts the mel-scale spectrogram, and the post-processor module consequently generates a linear scale spectrogram from the mel-scale spectrogram. The linear scale spectrogram itself has no phase information, so it cannot be reverted directly to the waveform. The Griffin-Lim reconstruction algorithm applies an inverse Fourier transform to estimate the phase of a linear scale spectrogram {GLRECON}. Tacotron used the Griffin-Lim reconstruction in the final stage of the post-processor module to obtain the waveforms from the linear scale spectrogram.

도 6은 타코트론(Tacotron)의 모식도이다.6 is a schematic diagram of tacotron.

도 6을 참조하면, 타코트론은 엔코더, 디코더 및 포스트 프로세서의 세 가지 모듈로 구성된 엔드-투-엔드 TTS 모델다. 상기 타코트론은 바닐라 sequence-to-sequence 모델보다 복잡한 구조를 가지고 있지만, 상기 타코트론은 기본적으로 문자 시퀀스를 해당 파형으로 변환하는 집중 메커니즘을 사용하여 시퀀스-시퀀스 프레임 워크를 따를 수 있다. 보다 구체적으로, 인코더는 문자 시퀀스를 입력으로 취하여 문자 시퀀스와 길이가 동일한 텍스트 인코딩 시퀀스를 생성할 수 있다. 디코더는 자기 순환 방식으로 멜 스케일 스펙트럼을 생성할 수 있다. 이전 시간-단계로부터의 디코더 출력은 현재 시간-단계(time-step)에서 디코더의 입력으로서 사용될 수 있다. 디코더 모듈에서주의 RNN은 보통 일반적인 시퀀스 대 시퀀스 모델처럼주의 정렬을 예측할 수 있다. 주의 정렬을 텍스트 인코딩과 결합하면 컨텍스트 벡터가 제공될 수 있고 디코더 RNN은 컨텍스트 벡터와주의 RNN의 출력을 입력으로 사용할 수 있다. 디코더 RNN은 멜-스케일 스펙트로그램을 예측하며, 포스트 - 프로세서 모듈은 결과적으로 멜-스케일 스펙트로그램으로부터 선형 스케일 스펙트로그램을 생성할 수 있다. 선형 스케일 스펙트로 그램 자체에는 위상 정보가 없으므로 파형으로 직접 되돌릴 수 없다. Griffin-Lim 재구성 알고리즘은 역 푸리에 변환을 적용하여 선형 스케일 스펙트로그램의 위상을 추정할 수 있다. 상기 타코트론은 선형 스케일 스펙트로그램에서 파형을 얻기 위해 포스트 프로세서 모듈의 마지막 단계에서 Griffin-Lim 재구성을 사용할 수 있다.Referring to FIG. 6 , the tacotron is an end-to-end TTS model composed of three modules: an encoder, a decoder, and a post processor. Although the tacotron has a more complex structure than the vanilla sequence-to-sequence model, the tacotron can basically follow a sequence-to-sequence framework using a convergence mechanism that transforms character sequences into corresponding waveforms. More specifically, the encoder may take a character sequence as input and generate a text encoding sequence that is the same length as the character sequence. The decoder may generate a mel-scale spectrum in a self-circulating manner. The decoder output from the previous time-step can be used as the input of the decoder in the current time-step. Attention RNNs in the decoder module can usually predict the alignment of attentions just like a normal sequence-to-sequence model. Combining attention alignment with text encoding can provide a context vector and the decoder RNN can take as input the context vector and the output of the attention RNN. The decoder RNN predicts the mel-scale spectrogram, and the post-processor module can consequently generate a linear scale spectrogram from the mel-scale spectrogram. Since the linear scale spectrogram itself has no phase information, it cannot be reverted directly to the waveform. The Griffin-Lim reconstruction algorithm can estimate the phase of a linear scale spectrogram by applying an inverse Fourier transform. The tacotron can use the Griffin-Lim reconstruction at the last stage of the post-processor module to obtain a waveform from the linear scale spectrogram.

상기 타코트론에서 사용되는 중요한 네트워크 아키텍처는 CBHG 모듈일 수 있다. CBHG는 콘볼루션 은행(bank), 고속도로 네트워크 및 양방향 Gated Recurrent Unit (GRU)을 포함할 수 있다. GRU는 RNN의 일종일 수 있다. 개념적으로 CBHG의 컨볼루션레이어는 로컬 패턴을 캡처하는 반면 양방향 GRU는 주어진 시퀀스에서 장기 의존성을 캡처할 수 있다. 시퀀스의 특징을 인코딩하기 위해 CBHG는 상기 인코더와 상기 타코트론의 후처리기에서 사용됩니다.An important network architecture used in the tacotron may be the CBHG module. CBHGs may include convolutional banks, highway networks, and bidirectional Gated Recurrent Units (GRUs). The GRU may be a type of RNN. Conceptually, CBHG's convolutional layer captures local patterns, whereas bidirectional GRUs can capture long-term dependencies in a given sequence. To encode the features of the sequence, CBHG is used in the encoder and the post-processor of the tacotron.

훈련을 위해, 모델은 목적 함수에 두 개의 목표를 가지고있다. (1) 멜 스케일 분광법 목표 Y_mel와 (2) 선형 스케일 spectrogram 목표 Y_linear이다. 각 멜-스케일 스펙트로그램의 L1 거리는

와Y_mel이고 선형 스케일 스펙트로 그램은

와Y_linear이다. 다음과 같이 목적 함수인 수학식 5를 계산하기 위해 추가될 수 있다.For training, the model has two targets in the objective function. (1) Mel-scale spectroscopy target Y _mel and (2) linear-scale spectrogram target Y _linear . The L1 distance of each mel-scale spectrogram is

and Y _mel and the linear scale spectrogram is

and Y _linear . It can be added to calculate Equation 5, which is an objective function as follows.

[수학식 5][Equation 5]

여기서

와

는 상기 타코트론의 출력이고, 그라운드 진실 스펙트로그램이다.here

Wow

is the output of the tacotron, and is the ground truth spectrogram.

도 7은 집중 정렬의 이상적 그래프와 실제적인 집중 정렬의 문제를 도시하는 그래프이다.Fig. 7 is a graph showing an ideal graph of concentrated sorting and a problem of actual centralized sorting.

도 7을 참조하면, 긴 음성을 생성 할 때, 집중 정렬은 연설의 중간 부분에 불규칙성을 보였다. 또한, 집중 정렬을 해당 생성된 연설과 비교할 때주의 집중 정렬의 선명도와 생성된 음성의 품질간의 상관 관계를 관찰한 결과, 직관적으로 상기 타코트론의 디코더는 집중된 컨텍스트 벡터를 기반으로 스펙트로그램을 생성하므로 흐려진 집중에 의해 불명료하게 예측된 스펙트로그램을 가져와야한다. 본 발명의 일 실시 예에서 상기 집중 정렬 예측을 향상시키기 위해 상기 타코트론에서 정보의 흐름을 검사하기로 결정했다. 도 6에서 상기 타코트론의 집중 모듈은 집중 RNN 의 현재 숨겨진 상태(

)와 인코더의 텍스트 인코딩이라는 두 가지 소스에서 입력을 받을 수 있다. 상기 두 가지 소스를 바탕으로 집중 정렬의 예측을 개선하기 위해 다음과 같이 개량할 수 있다.Referring to Figure 7, when generating long voices, the focused alignment showed irregularities in the middle part of the speech. In addition, as a result of observing the correlation between the clarity of the attention alignment and the quality of the generated speech when comparing the focused alignment with the corresponding generated speech, intuitively, the decoder of the tacotron generates a spectrogram based on the focused context vector, so You should get a spectrogram that is obscuredly predicted by the blurred focus. In an embodiment of the present invention, it was decided to examine the flow of information in the tacotron to improve the focused alignment prediction. In Figure 6, the concentration module of the tacotron is the current hidden state (

) and the encoder's text encoding. Based on the above two sources, it can be improved as follows to improve the prediction of concentrated alignment.

한 문자의 발음에는 일반적으로 하나 이상의 스펙트로그램 프레임이 필요하기 때문에 기존 타코트론 모델의 집중 파트는 여러 디코더 시간 단계에 대해 텍스트 입력의 비슷한 부분에 자주 집중(attend)한다. 모델이 집중 가중치를 변경해야하는 경우에도 집중할 다음 부분은 현재 집중한 텍스트의 인접 부분이될 수 있다. 그러므로 집중 가중치를 결정할 때 모델은 텍스트 인코딩의 가중치 합계인 이전에 집중한 텍스트 c_t-1의 정보를 가질 수 있다. 그러나 기존의 타코트론은 상기c_t-1의 정보를 활용하지 않는다. 집중 RNN은 c_t-1의 정보를 거의 포함하지 않는 이전 시간-단계의 스펙트로그램만을 사용할 수 있다. 상기한 내용을 바탕으로, 집중 RNN의 입력 x_t에c_t-1을 연결했다. 집중 RNN은 다음 수학식 6과 같이 하나 더 많은 입력c_t-1을 취할 수 있다.Since the pronunciation of a character usually requires more than one spectrogram frame, the focused part of the existing takotron model often focuses on similar parts of the text input for several decoder time steps. Even if the model needs to change the focus weight, the next part to focus on can be the adjacent part of the currently focused text. Therefore, when determining the concentration weight, the model can have information _{of the previously focused text c t-1 , which is the weighted sum of text encodings.} However, the existing tacotron does not utilize the _{above c t-1 information.} Concentrated RNN can only use spectrograms of previous time-steps that contain little information _{of c t-1 .} Based on the above, we connected _{c t-1} _{to the input x t} of the concentrated RNN. _{The concentrated RNN may take one more input c t-1} as shown in Equation 6 below.

[수학식 6][Equation 6]

여기서 x_t와 h_t는 각각 시간-단계 t에서 집중 RNN의 입력과 숨겨진 상태이다.where x _t and h _t are the input and hidden states of the concentrated RNN at time-step t, respectively.

도 8은 CBHG모듈의 모식도이다.8 is a schematic diagram of a CBHG module.

도 8을 참조하면, 집중 예측을 향상시키는 두 번째 아이디어는 텍스트 입력을 인코딩하는 방법을 변경할 수 있다. 변경된 텍스트 입력은 집중 정렬을 결정하기위한 정보 소스이기도하다. 인코딩은 도 8에서와 같이 CBHG 모듈에 의해 생성될 수 있다. CBHG는 컨벌루션 필터 뱅크를 포함할 수 있으며, 각 필터는 로컬 및 컨텍스트 정보를 명시적으로 추출할 수 있다. CBHG의 마지막 단계에는 텍스트 입력에 장기 종속성을 포착 할 수있는 양방향 RNN이 포함될 수 있다. RNN이 입력을 순차적으로 읽음에 따라 RNN의 숨겨진 상태에 장기 종속성이 누적될 수 있다. 여기서 문제는 숨겨진 상태의 크기가 고정되어 있다는 것이다. 시퀀스가 충분히 길다면 숨겨진 상태는 시퀀스의 전체 정보를 포함 할 수 없다. 또한 CBHG의 양방향 RNN에는 현재 시간 단계의 텍스트 입력 정보와 장기 의존성 정보가 포함되어야한다. 이것은 숨겨진 상태에 더 많은 부담을 안준다. 도 7(a)를 참조하면 입력 정렬이 원본 타코트론에서 길면 시퀀스의 중간에 집중 정렬이 틈이나 흐린 부분이 포함될 수 있다. 이러한 불규칙성이 CBHG에서 양방향 RNN의 숨겨진 상태의 용량 부족으로 인한 것이다. 따라서 양방향 RNN의 입력과 출력을 연결하는 잔여 연결을 추가한다. CBHG의 출력을 다음 수학식 7과 같이 x_t이라는 추가 용어를 갖도록 변경했다.Referring to FIG. 8 , a second idea of improving focused prediction could change the way text input is encoded. Altered text input is also a source of information for determining focused alignment. The encoding may be generated by the CBHG module as in FIG. 8 . A CBHG may contain a bank of convolutional filters, each filter capable of explicitly extracting local and context information. The final stage of CBHG may include a bidirectional RNN that can capture long-term dependencies on text input. As the RNN reads its inputs sequentially, long-term dependencies can accumulate in the hidden state of the RNN. The problem here is that the size of the hidden state is fixed. If the sequence is long enough, the hidden state cannot contain the entire information of the sequence. In addition, the bidirectional RNN of CBHG should include text input information of the current time step and long-term dependency information. This doesn't put more strain on the hidden state. Referring to FIG. 7( a ), if the input alignment is long in the original tacotron, the central alignment may include a gap or a blurred part in the middle of the sequence. This irregularity is due to the lack of capacity of the hidden state of the bidirectional RNN in CBHG. Therefore, we add a residual connection connecting the input and output of the bidirectional RNN. The output of CBHG was changed to have an additional term of _{x t} as shown in Equation 7 below.

[수학식 7][Equation 7]

여기서 x_t, h_t 및 y_t는 각각 양방향 RNN의 입력, 숨겨진 상태 및 출력입니다. 이제 나머지 연결은 현재 시간 단계의 정보를 전달하므로 양방향 RNN의 숨겨진 상태에는 현재 시간 단계의 정보가 포함될 필요가 없다. 상기 연결로 인해 숨겨진 상태가 덜 혼잡 해지고 텍스트 정보를 인코딩하는 데 도움이된다.where x _t , h _{t ,} and y _t are the input, hidden state, and output of the bidirectional RNN, respectively. The remaining connections now carry information from the current time step, so the hidden state of a bidirectional RNN does not need to contain information from the current time step. The above connection makes the hidden state less cluttered and helps to encode textual information.

도 9는 본 발명의 각 실시 예에 따른 집중 정렬의 결과 그래프이다.9 is a graph showing results of concentrated sorting according to each embodiment of the present invention.

도 9를 참조하면, 4 가지 모델을 비교하여 문제를 해결하는 데 효과적인 방법을 확인했다. 4 가지 모델에는 기준선, (기준선 + CI), (기준선 + RC) 및 (기준선 + CI + RC)이 포함될 수 있다. 도 9의 이미지는 4 가지 모델의 텍스트를 생성하는 동안 얻은주의 정렬입니다. 텍스트와 음성은 성격상 단조적 정렬을 가져야한다는 것이 직관적이다. 따라서, 이상적인주의 정렬은 연속적이고 단조롭게 감소해야한다 (또는 집중 정렬 플롯의 y 축에 따라 증가한다).Referring to FIG. 9 , an effective method for solving the problem was confirmed by comparing the four models. The four models can include baseline, (baseline + CI), (baseline + RC) and (baseline + CI + RC). The image in Fig. 9 is the attention alignment obtained during text generation of the four models. It is intuitive that text and speech should have a monotonic alignment in nature. Thus, the ideal attentional alignment should decrease continuously and monotonically (or increase along the y-axis of the focused alignment plot).

컨텍스트 삽입과 잔여 연결은 네트워크가 집중 정렬을 쉽게 예측할 수있게 하여 예리하고 명확한 집중 정렬을 얻을 수 있다고 예측되었다. 기본 모델이 음성 합성에 문제가 있는 문장을 선택했다. 그림 도 9 (a)에서, 베이스 라인 모델의 집중 정렬에서 불투명한 부분과 불연속성을 확인할 수 있다. 대조적으로 제안된 모델 (baseline + CI + RC)의 집중 정렬은 그림 도 9 (b)에서 단조와 연속선을 보여 주었다. 생성 된 결과를 검사했을 때, 제안된 모델의 집중 정렬은 예리하고 깨끗했고 제안된 모델의 생성된 음성의 품질은 기준 모델보다 낫다. 그러나 제안된 방법 중 하나만 적용하여 큰 향상을 볼 수 없었다 도 9 (c) 및 도 9 (d) 참조). 이러한 각각의 접근법을 개별적으로 적용한이 결과는 집중 RNN과 인코더 출력에서 나온 두 가지 정보 흐름이 똑같이 중요 함을 나타낼 수 있다. 모델은 하나의 정보 소스만 사용하여서는 음성을 생성하지 못할 수도 있다. It was predicted that context insertion and residual connection could allow the network to easily predict the focused alignment, resulting in a sharp and clear focused alignment. The basic model selected sentences with problems in speech synthesis. In Figure 9 (a), it can be seen that the opaque part and discontinuity in the central alignment of the baseline model. In contrast, the concentrated alignment of the proposed model (baseline + CI + RC) showed monotonic and continuous lines in Fig. 9(b). When the generated results were examined, the focused alignment of the proposed model was sharp and clear, and the quality of the generated speech of the proposed model was better than that of the reference model. However, a significant improvement could not be seen by applying only one of the proposed methods (see FIGS. 9 (c) and 9 (d)). Applying each of these approaches individually, these results may indicate that the two information flows from the concentrated RNN and the encoder output are equally important. A model may not be able to generate speech using only one information source.

단조로운 집중 메커니즘과 반교사 강제 교육을 실시하면 성능이 더 향상 될 것으로 예상된다. 자연스럽게 기존의 단조로움을 이용하여 텍스트와 음성의 정렬에서 모델은 집중 정렬을 이전보다 쉽게 배우게되고 모델은 스펙트로그램을 예측하는 데 더 많은 용량을 사용할 수있게될 수 있다. 반교사 강제 교육은 훈련 단계 및 시험 단계에서 디코더 모듈의 입력 분포의 불일치를 줄일 수 있다. 노출-바이어스라고 불리는 이 불일치는 자가 회귀 네트워크가 훈련 단계에서 접지 진리 입력을 취하는 동안 시험 단계에서 이전 시간 단계의 입력을 취하기 때문에 발생할 수 있다. 그라운드 진실(ground truth) 입력을 생성된 입력으로 무작위로 대체하여 일정 샘플링으로이 문제를 해결했으며, 반 교사 강제 교육은 모델을 훈련 단계에서 생성된 샘플에 노출시킬 수 있다. 이러한 트릭을 적용 할 수는 있지만 제안 된 방법을 이러한 트릭에 직각으로 사용할 수 있기 때문에 적용하지 않았다.It is expected that the performance will be further improved if the monotonous concentration mechanism and anti-teacher compulsory education are implemented. Naturally, in the alignment of text and speech, using existing monotony, the model can learn focused alignment more easily than before and the model can use more capacity to predict spectrograms. The anti-teacher compulsory education can reduce the inconsistency of the input distribution of the decoder module in the training phase and the test phase. This discrepancy, called exposure-bias, can occur because the autoregressive network takes an input from a previous time step in the test phase while taking a ground truth input in the training phase. We solved this problem with constant sampling by randomly replacing ground truth inputs with generated inputs, and class-teacher forced training can expose the model to samples generated during the training phase. You can apply these tricks, but I didn't because the proposed method can be used orthogonally to these tricks.

도 10은 본 발명의 일 실시 예에 따른 텍스트-다중 음성 변환 타코트론의 모식도이다.10 is a schematic diagram of a text-multi-speech conversion tacotron according to an embodiment of the present invention.

도 10을 참조하면, 본 발명의 입력으로 문자 외에 speaker-id를 추가로 받아 음성을 생성하게 된다. Speaker-id는 보통 0,1,2,…의 정수로 지정될 수 있고, 실제 모델의 입력으로 들어갈 때에는 one-hot vector형태로 들어갈 수 있다. 이 One-hot vector은 룩업 테이블(1개의 linear layer과 같은 개념)을 거쳐 밀한 형태의 음성 임베딩 벡터로 바뀌고, 이것이 타코트론의 디코더 RNN 입력과 집중 RNN 입력에 사슬처럼 엮이게(concatenate)되어 사용된다. 따라서 음성 입력이 누구인가에 따라서 출력 음성이 달라지게 된다. 도 10(a) 참고 바람. 다중 음성 타코트론의 경우 실험적으로 각 화자당 30분정도의 데이터를 모으고 모든 화자를 합쳐 20시간 이상의 데이터를 가지고 있으면 여러 화자에 대한 음성합성이 가능한 것을 확인할 수 있었다.Referring to FIG. 10 , as an input of the present invention, a speaker-id is additionally received in addition to text to generate a voice. Speaker-id is usually 0,1,2,… It can be specified as an integer of , and when entering the input of the actual model, it can be entered in the form of a one-hot vector. This one-hot vector is converted into a dense speech embedding vector through a lookup table (the same concept as one linear layer), and it is used by concatenating the decoder RNN input and the concentrated RNN input of the tacotron. . Therefore, the output voice is different depending on who the voice input is. See Fig. 10(a). In the case of the multi-voice tacotron, it was confirmed that voice synthesis for multiple speakers was possible if the data for each speaker was collected for about 30 minutes and the total of all the speakers had more than 20 hours of data.

스피커 삽입은 생성된 음성의 음성을 변경하는 유일한 변수이다. 즉, 적절한 방법으로 스피커 내장을 조정하면 새로운 음성을 얻을 수 있다. 그러나 스피커 내장 벡터의 각 차원이 어떤 특징을 나타내는지 알지 못한다. 스피커 내장 벡터의 공간을 비선형으로 설정하여 스피커 내장 벡터의 값을 튜닝함으로써 직관적으로 음성을 변경하는 것을 방지할 수 있다. 스피커 내장 벡터의 각 차원을 직각 피쳐로 표현할 수 있다면 새로운 목소리를 만드는 것이 훨씬 쉽고 직관적 일 것이다. Sparsity Constraint를 추가하면 여러 스피커의 음성을 생성하는 데 직교성을 부여할 수 있다. 이러한 의미에서 훈련 목적 함수에 희소성 제약을 추가했다. e_d를 포함하는 스피커의 각 차원 d는 베르누이 분포를 따른 것으로 가정하므로 닫힌 간격 [0,1]에 값이 있다. 스피커 내장에있는 확률 변수에 1을 가질 확률이 낮으면 스피커 내장은 희소성의 정의에 의해 드물게된다. 이 성질은 각 스피커 내장 차원의 표본 평균

과 표적 희박성 매개 변수

사이의 Kullback-Leibler 발산을 최소화함으로써 얻을 수 있다. 일반적으로 희박 매개 변수는 0에 가까운 값 (즉, 0.05)을 취한다. Kullback-Leibler divergence term을 원래 목적 함수에 추가하면 다음과 같은 수정 된 목적 함수 수학식 8을 얻을 수 있다.Speaker insertion is the only variable that changes the voice of the generated voice. In other words, by adjusting the speaker built-in in an appropriate way, you can get a new voice. However, we do not know what characteristics each dimension of the speaker built-in vector represents. By setting the space of the speaker built-in vector to be non-linear and tuning the value of the speaker built-in vector, it is possible to prevent intuitively changing the voice. Creating a new voice would be much easier and more intuitive if we could represent each dimension of the speaker built-in vector as a rectangular feature. By adding a Sparsity Constraint, we can give orthogonality to the voice generation of multiple speakers. In this sense, a sparsity constraint is added to the training objective function. Each dimension d of the loudspeaker containing e _d is assumed to follow the Bernoulli distribution, so there is a value in the closed interval [0,1]. If there is a low probability of having a random variable in a speaker intrinsic, the speaker intrinsic becomes rare by the definition of sparsity. This property is a sample average of each speaker built-in dimension.

and target sparsity parameters

This can be achieved by minimizing the Kullback-Leibler divergence between In general, the sparse parameter takes a value close to zero (i.e., 0.05). By adding the Kullback-Leibler divergence term to the original objective function, the following modified objective function Equation 8 can be obtained.

[수학식 8][Equation 8]

여기서, D, m,

및 KL()는 각각 음성 임베딩 차원, 미니 배치(batch) 크기, 정규화 계수 및 Kullback-Leibler 분기 함수의 차원을 나타낸다.where D, m,

and KL( ) denote the dimensions of the negative embedding dimension, the mini-batch size, the normalization coefficient and the Kullback-Leibler branching function, respectively.

학습된 음성 삽입이 주성분 분석 (PCA)으로 얻은 점의 2 차원 시각화를 통해 벡터 공간에서 다양체를 공식화하는 방법을 분석했다. 또한 희박성 제약이 학습된 스피커 삽입 벡터에 어떻게 영향을 미치는지 비교했다.We analyzed how the learned speech interpolation formulates manifolds in vector space through two-dimensional visualization of points obtained by principal component analysis (PCA). We also compared how the sparsity constraint affects the learned speaker insertion vectors.

도 10(b)를 참조하면, 다중 음성 타코트론이 여러명의 화자에 대해 음성합성이 가능하고, 각 화자당 30분의 음성을 필요로 하는 것이 기존의 타코트론에 비해서는 발전된 것이지만, 30분의 깨끗한 음성을 얻는 것 또한 많은 수고를 필요로 하는 작업이다. 또한 하나의 다중 음성 타코트론을 학습시켜 둔 상황에서, 기존에 없던 새로운 화자의 목소리를 생성하는 모델이 필요하다면 새로운 목소리에 대해 모델을 다시 학습시키는 과정이 필요하고, 이 과정에서 몇 시간 정도 소요가 된다. 본 발명의 일 실시 예에 따른 모델은 새로운 목소리를 생성하는데 아주 적은 데이터 (실험에서는 6초) 만을 필요로하고, 모델을 추가로 학습시키지 않아도 새로운 목소리를 즉시 생성할 수 있다는 장점이 있다.Referring to FIG. 10(b), the multi-voice tacotron is capable of voice synthesis for several speakers, and requires 30 minutes of voice for each speaker, which is an improvement compared to the conventional tacotron, but Obtaining a clear voice is also a task that requires a lot of effort. In addition, in a situation where one multi-voice tacotron is trained, if a model that generates a new speaker's voice is needed, a process of re-training the model for the new voice is required, and this process takes several hours. do. The model according to an embodiment of the present invention has the advantage that very little data (6 seconds in the experiment) is required to generate a new voice, and a new voice can be immediately generated without additional training of the model.

상기 모델은 Voice imitative TTS이고, 다중 음성 타코트론과 상당히 유사한 구조를 가진다. 한가지 다른 점은 다중 음성 타코트론이 (text, speaker_id)를 입력으로 받았다면, Voice imitative TTS는 (text, 생성하고자 하는 화자의 음성의 log-mel-spectrogram) pair을 입력으로 받을 수 있다. 입력으로 받은 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환이 될 수 있다. 이렇게 생성된 음성 임베딩 벡터는 다중 음성 TTS에서와 같은 방식으로 디코더 RNN과 집중 RNN에 입력으로 들어가게 된다.The model is a voice imitative TTS and has a structure quite similar to a multi-voice tacotron. One difference is that if the multi-voice tacotron receives (text, speaker_id) as input, Voice imitative TTS can receive (text, log-mel-spectrogram of the speaker's voice) pair as input. The log-mel-spectrogram received as an input can be converted into a speech embedding vector through an embedder network. The generated speech embedding vector is input to the decoder RNN and the lumped RNN in the same way as in the multi-voice TTS.

핫 스피커 ID 벡터를 입력으로 사용하는 현재 멀티 스피커 TTS 모델은 음성 데이터가 교육 데이터에 없는 새로운 음성에 쉽게 확장 할 수 없다. 모델은 단일 핫 벡터로 표현된 음성에 대해서만 임베딩을 배웠기 때문에 새로운 음성의 임베딩을 즉시 얻을 수 있는 방법이 없습니다. 새로운 음성을 생성하려면 전체 TTS 모델을 재교육하거나 TTS 모델의 임베디드 레이어를 미세 조정해야한다. 이는 GPU가 장착 된 컴퓨터로도 시간이 많이 소요되는 프로세스다. 원하는 음성을 얻기 위해 스피커 내장을 조정할 수 있지만, 스피커 내장 벡터의 정확한 값 조합을 찾는 것은 어려울 것다. 상기한 접근법은 부정확 할뿐만 아니라 노동 집약적이다. 상기한 문제를보다 효율적으로 해결하기 위해 우리는 TTS 모델을 추가로 교육하거나 스피커 내장 벡터를 수동으로 검색하지 않고도 새로운 스피커의 음성을 즉시 생성 할 수있는 새로운 TTS 아키텍처를 포함시킬 수 있다.The current multi-speaker TTS model, which uses a hot speaker ID vector as input, cannot be easily extended to new voices for which voice data is not present in the training data. Since the model only learned embeddings for the speech represented by a single hot vector, there is no way to immediately get the embeddings of the new speech. Generating new voices requires retraining the entire TTS model or fine-tuning the embedded layer of the TTS model. This is a time-consuming process, even on computers with GPUs. You can tune the speaker embeddings to get the voice you want, but it will be difficult to find the exact combination of values for the speaker embedding vectors. The aforementioned approach is not only inaccurate, but also labor intensive. To solve the above problem more efficiently, we can include a new TTS architecture that can generate speech for new speakers on the fly without further training the TTS model or manually retrieving the speaker built-in vectors.

도 11은 본 발명의 일 실시 예에 따른 음성 임베딩 예측 네트워크의 모식도이다.11 is a schematic diagram of a speech embedding prediction network according to an embodiment of the present invention.

도 11을 참조하면, 스피커 내장 벡터를 예측하는 서브 네트워크를 추가할 수 있다. 도 11은 완전히 연결된 레이어가 뒤 따르는 컨볼루션레이어를 포함하는 서브 네트워크 스피커 삽입자를 입력할 수 있다. 상기 네트워크는 log-Mel-spectrogram을 입력으로 사용하여 고정 차원 스피커 임베딩 벡터를 예측할 수 있다. 스펙트로그램을 사용하는데 제약이 없기 때문에 임의의 스펙트로그램을이 네트워크에 삽입 할 수 있으며, 이를 통해 네트워크의 즉각적인 적응을 통해 새로운 화자의 음성을 생성 할 수 있다. 입력 스펙트로그램은 다양한 길이를 가질 수 있지만, 콘볼루션 레이어 끝에 위치한 최대 시간 초과 풀링 레이어는 시간 축에 대해 길이가 1 인 고정된 치수 벡터로 입력을 출력 수 있다. 음성 삽입자는 도 10(b)에서 설명한대로 다중 음성 타코트론의 단일 핫 벡터 입력을 대체할 수 있다. 음성 모조 모델은 동일한 목적 함수 수학식 5를 사용하여 훈련될 수 있다.Referring to FIG. 11 , a subnetwork for predicting a speaker built-in vector may be added. Figure 11 may input a subnetwork speaker insert comprising a convolutional layer followed by a fully connected layer. The network can predict a fixed-dimensional speaker embedding vector using a log-Mel-spectrogram as an input. Since there are no restrictions on using spectrograms, arbitrary spectrograms can be inserted into this network, which allows the network to immediately adapt to generate a new speaker's voice. The input spectrogram can have various lengths, but the maximum timed pooling layer located at the end of the convolutional layer can output the input as a fixed dimension vector of length 1 with respect to the time axis. The voice inserter can replace a single hot vector input of a multi-voice tacotron as described in Fig. 10(b). The speech mock model can be trained using the same objective function Equation (5).

도 12는 본 발명의 일 실시 예에 따른 테스트 세트에서 음성 프로필을 나타낸 표이고, 도 13은 본 발명의 일 실시 예에 따른 다중 음성 타코트론의 파라미터를 나타낸 표다.12 is a table showing a voice profile in a test set according to an embodiment of the present invention, and FIG. 13 is a table showing parameters of a multi-voice tacotron according to an embodiment of the present invention.

도 12 및 도 13의 파라미터에 따라 텍스트- 다중 음성 변환을 수행할 수 있고 이하에서 설명하는 도면에서 결과를 확인할 수 있다. According to the parameters of FIGS. 12 and 13 , text-multi-speech conversion can be performed, and the result can be confirmed in the drawings to be described below.

도 14는 본 발명의 일 실시 예에 따른 생성된 음성 스펙토그램의 예시이다.14 is an example of a generated voice spectogram according to an embodiment of the present invention.

도 14를 참조하면, 본 발명의 일 실시 예에서는 140 개 에코 (epoch)의 다중 음성 타코트론을 훈련 시켰는데, 1 에포크는 음성 임베딩 벡터에 희소성 제약이 있거나 없는 1289 반복과 같다. 두 모델의 생성된 음성 샘플은 비슷한 품질을 보였다. 각 스피커의 성별과 악센트를 샘플을 통해 명확하게 구분할 수 있다. 도 14는 세 가지 다른 음성의 샘플 스펙트로그램을 보여준다. 스펙트로그램에 해당하는 텍스트는 음성마다 동일하며 교육 데이터에는 나타나지 않는다. 각 샘플의 피치와 속도가 다른 샘플과 다른 것을 확인할 수 있다. 도 14 (a), (b), (c)는 각각 (여성, 하이피치), (여성, 로우피치), (남성, 로우피치)의 결과이다.Referring to FIG. 14 , in an embodiment of the present invention, multiple negative tacotrons of 140 epochs are trained, and 1 epoch is equal to 1289 repetitions with or without sparsity constraint on the negative embedding vector. The generated negative samples of both models showed similar quality. The gender and accent of each speaker can be clearly distinguished from the sample. 14 shows sample spectrograms of three different voices. The text corresponding to the spectrogram is the same for each voice and does not appear in the training data. It can be seen that the pitch and speed of each sample is different from the other samples. 14 (a), (b) and (c) are results of (female, high pitch), (female, low pitch), and (male, low pitch), respectively.

도 15는 본 발명의 일 실시 예에 따른 이론 요소에 따른 음성 임베딩 결과를 도시한 그래프이다.15 is a graph illustrating a speech embedding result according to a theoretical element according to an embodiment of the present invention.

도 15를 참조하면, 우리는 PCA를 사용하여 스피커 내장이 어떻게 학습되었는지 확인했습니다. 2차 고유치를 갖는 2차원을 선택하여 어떤 요인이 데이터의 분산을 가장 많이 일으키는지 확인할 수 있다. 도 15는 성별이나 악센트에 의한 선형 분리 가능성을 강조하기 위해 색상이 지정된다. 희소성 제한없이 모델이 훈련된 도 15(a), (c)는 모든 embedding이 얽혀있는 것처럼 보인다. 대조적으로 모델이 희소성 제약으로 훈련 된 도 15(b), (d)는 성별 측면에서 명확한 분리를 보여준다. 도 15 (d)에서 점 (0,1)과 점 (3,0)을 가로 지르는 선을 그리면 선이 영어 액센트와 미국 액센트를 구분할 수 있음을 알 수 있다. 이 결과는 음성 특성을보다 잘 표현할 수있는 희소성 제약 조건을 반드시 의미하지는 않는다. 가장 큰 분산을 제공하는 2 차원을 선택했기 때문에 희소성 제약 조건이 없는 모델은 각 개인의 음성을 구별 할 수있는 다른 요인을 발견했을 수 있다. 그러나 음성 내장이 매니 폴드에 분산되어 있으면 직관적으로 이해할 수 있으므로 사람들이 원하는 음성을 생성하기 위해 음성을 조정하는 것이 훨씬 쉬울수 있다. 음성 임베딩 값을 직접 조작하여 음성 특성을 제어 할 수 있다. 특히, 희박성 제약으로 훈련된 모델에서 각 주성분은 성별, 억양, 속도 및 진폭을 나타낼 수 있다.Referring to Figure 15, we used PCA to see how the speaker built-in was learned. By selecting two dimensions with second-order eigenvalues, you can check which factors cause the most variance in the data. Figure 15 is color-coded to highlight the linear separability by gender or accent. 15(a), (c), where the model is trained without sparsity constraints, all embeddings appear to be entangled. In contrast, Figs. 15(b), (d), where the model was trained with a sparsity constraint, show clear segregation in terms of gender. Drawing a line intersecting the points (0,1) and (3,0) in Fig. 15(d), it can be seen that the line can distinguish between English accents and American accents. This result does not necessarily imply a sparsity constraint that can better represent the negative characteristics. Because we chose the two-dimensionality that gave the largest variance, the model without the sparsity constraint may have discovered other factors that could distinguish each individual's voice. However, if the voice embedding is distributed across the manifold, it can be intuitively understood, so it may be much easier for people to tune their voice to produce the voice they want. You can control the voice characteristics by directly manipulating the voice embedding values. In particular, in a model trained with sparsity constraints, each principal component can represent gender, intonation, speed, and amplitude.

도 16은 본 발명의 일 실시 예에 따른 이론 요소의 변화에 따른 스펙토그램을 도시한 그래프이다.16 is a graph illustrating a spectogram according to a change in a theoretical element according to an embodiment of the present invention.

도 16을 참조하면, 음성 내장 튜닝 접근법을 보여주기 위해 다른 구성 요소를 고정 된 상태로 유지하면서 한 번에 한 주 구성 요소의 값을 변경하여 음성을 생성했다. 주 구성 요소가 선택되면 선택한 구성 요소의 최대 값과 최소값을 검사한다. 그런 다음 5 개의 값이 최대와 최소 사이에서 선택되며, 선택한 구성 요소에서 다른 구성 요소와 동일한 값만 다른 5 개의 음성 임베딩을 얻을 수 있다. 획득된 음성 임베딩이 PCA에 의해 변환 된 공간에 있기 때문에 다중 음성 타코트론과 호환되도록 음성 임베딩에 역변환을 적용했다. 결국 5 개의 음성 임베딩을 사용하여 음성 샘플을 생성했다.Referring to Fig. 16, to demonstrate the speech intrinsic tuning approach, speech was generated by changing the values of one main component at a time while keeping the other components fixed. When the main component is selected, the maximum and minimum values of the selected component are checked. Then five values are chosen between the maximum and the minimum, and in the selected component we can get 5 speech embeddings that differ only by the same values as the other components. Since the obtained speech embeddings are in the space transformed by PCA, we applied an inverse transform to the speech embeddings to be compatible with the multi-voice tacotron. In the end, 5 voice embeddings were used to generate voice samples.

도 16의 첫번재 그래프는 첫 번째 주성분의 값이 커질수록 피치가 높아지는 것을 보여줍니다. 우리가 도 15 (c)에서 예상했듯이, 표본 음성을 듣는 것에 따라 값이 증가함에 따라 음성이 남성에서 여성으로 바뀌는 것을 알 수 있다.The first graph in Fig. 16 shows that the pitch increases as the value of the first principal component increases. As we expected in Fig. 15 (c), it can be seen that the voice changes from male to female as the value increases according to listening to the sample voice.

두 번째 주요 구성 요소의 경우, 그림 도 15 (d)를 보면 화자의 액센트가 변경 될 것으로 예상된다. " I need some water from fast river."라는 문장을 생성하고 듣는 것으로 "water", "fast" 및 "river"이라는 단어에서 악센트의 변경을 식별 할 수 있다. 불행히도 이 변화를 시각화하는 효과적인 방법이 없으므로 이 실험의 수치는 생략했다.For the second major component, looking at Fig. 15(d), the speaker's accent is expected to change. By generating and listening to the sentence "I need some water from fast river.", we can identify changes in accents in the words "water," "fast," and "river." Unfortunately, there is no effective way to visualize this change, so figures from this experiment have been omitted.

세 번째와 네 번째 주성분의 PCA 플롯에서 직감을 얻을 수는 없지만 각 구성 요소의 값을 변경하여 경향을 볼 수 있다. 도 16의 두번째 그래프에서 각 스펙트로 그램의 유성음의 폭을 보면 구성 요소의 값이 중앙값으로 이동함에 따라 폭이 작아진다. 샘플을 들을 때 폭이 줄어들면 음성의 속도가 빨라진다는 것을 느낄 수 있다. 그러나, 제 1 및 제 2 주성분과 비교하여 명백한 경향을 나타내지 않는다.You cannot get an intuition from the PCA plots of the third and fourth principal components, but you can see trends by changing the values of each component. Looking at the width of the voiced sound of each spectrogram in the second graph of FIG. 16 , the width decreases as the value of the component moves to the median value. As you listen to the sample, you will notice that the speed of the voice increases as the width decreases. However, it does not show an obvious trend compared to the first and second principal components.

마지막으로, 네 번째 주요 구성 요소는 음성의 진폭을 나타내는 것으로 보인다. 스펙트로그램은 사운드가 클 때 더 강한 패턴을 갖는다. 도 16의 세번째 그래프에서 가장 밝은 색의 영역은 가장 오른쪽의 스펙트로그램에서 가장 작고 오른쪽의 스펙트로그램은 일반적으로 왼쪽의 스펙트로그램보다 어두움을 볼 수 있다. 또한 오른쪽의 음성이 왼쪽의 음성보다 더 크게 느껴질 수 있다.Finally, the fourth major component appears to represent the amplitude of speech. The spectrogram has a stronger pattern when the sound is loud. In the third graph of FIG. 16 , the region of the brightest color is the smallest in the rightmost spectrogram, and the right spectrogram is generally darker than the left spectrogram. Also, the voice on the right may be perceived as louder than the voice on the left.

음성 임베딩을 위해 단지 4 차원이 있었기 때문에, 음성 임베딩 벡터에 의해 영향을받은 4 개의 특징만을 관찰 할 수 있었다. 음성 임베딩 벡터의 차원을 증가시킴으로써 더 의미있는 특징을 발견 할 수있을 것이라고 가정 할 수있다. 세 번째와 네 번째 주요 구성 요소의 분석에서 속도와 소리의 크기가 연설의 가변성의 주요 요인 중 하나 일 수 있음을 관찰했다. 마지막 2 가지 주요 구성 요소에 대한 뚜렷한 차이를 보여주기 위해 스피커 내장 튜닝 접근 방식을 사용하기 위해 기존 데이터의 다양한 속도 및 음량으로 데이터 확대를 사용할 수 있다. 속도와 음량을 구별하기 위해 데이터 세트가 수집되지 않기 때문에 대부분의 데이터는 평균 속도와 평균 음량을 가진다. 따라서 속도와 소리 크기를 명확하게 구분하는 것만으로는 충분하지 않을 수 있다. 프로그램 Sound eXchange (SOX)를 비롯한 여러 가지 방법으로 오디오 파일을 조작 할 수 있다. 조작 된 사운드는 원래 데이터와 다른 특성을 가질 수 있지만, 속도와 소리의 크기가 다른 샘플의 수를 늘리기 위해 이러한 도구를 사용할 수 있다.Since there were only 4 dimensions for negative embeddings, we could observe only 4 features affected by negative embedding vectors. It can be assumed that by increasing the dimension of the speech embedding vector, we will be able to discover more meaningful features. In the analysis of the third and fourth major components, we observed that speed and loudness could be one of the main factors for the variability of speech. To show the distinct differences for the last two key components, we can use data augmentation at varying speeds and volumes of existing data to use the speaker built-in tuning approach. Most data have average speed and average loudness because no data sets are collected to distinguish between speed and loudness. Therefore, a clear distinction between speed and loudness may not be sufficient. There are several ways to manipulate audio files, including the program Sound eXchange (SOX). Although the manipulated sound may have different characteristics than the original data, these tools can be used to increase the number of samples that differ in speed and loudness.

도 17은 본 발명의 일 실시 예에 따른 혼합된 음성에 의한 스펙토그램을 도시한 그래프이다.17 is a graph illustrating a spectogram by mixed voice according to an embodiment of the present invention.

도 17을 참조하면, 희소성 제한없이 다중 음성 타코트론과 동일한 분석을 수행 할 때 이전 분석과 달리 샘플에서 경향을 인식 할 수 없었다. 이 분석과 음성 삽입의 2 차원 시각화에서 직감을 얻을 수는 없지만 보간된 음성 임베딩을 모델의 입력으로 사용하여 음성을 생성 할 수 있다는 점은 주목할 가치가 있다. 두 개의 음성 임베딩 벡터를 평균하여 그것들을 사용하여 음성을 생성했습니다. 생성된 샘플은 도 17과 같이 원래의 두 음성 사이의 피치 레벨을 가지고 있음을 알 수 있다. 각 구성 요소가 나타내는 특성을 파악하려면 추가 조사가 필요하다.Referring to Figure 17, when performing the same analysis as the multi-negative tacotron without sparsity limitation, no trend could be recognized in the sample, unlike the previous analysis. Although no intuition can be gained from this analysis and the two-dimensional visualization of speech embeddings, it is worth noting that interpolated speech embeddings can be used as input to the model to generate speech. We averaged two speech embedding vectors and used them to generate speech. It can be seen that the generated sample has a pitch level between the original two voices as shown in FIG. 17 . Further investigation is required to determine the characteristics each component exhibits.

희소성 제약이 문제 복잡성을 더 높일 수 있다고 추측하며, 이것은 희소성 제약이없는 모델보다 모델이 더 빨리 수렴하도록 유도 할 수있다. 모델의 복잡성은 문제의 복잡성에 비해 너무 높을 수 있다. 따라서 모델이 과장되기 쉽다. 여기서 희소성 제약은 정규화 용어로 사용되며 문제와 모델의 복잡성이 일치 할 수 있다. 이 시나리오를 가정하면, 제한된 모델이 어느 정도 수렴 된 후에 정규화 계수 ρ를 감소시킴으로써 이익을 얻을 수 있다. 이것은 초기 단계에서 수렴을 향상시킬뿐만 아니라 후반 단계에서 정확한 최적 매개 변수를 찾는 데 도움이된다. 희박성 제약 조건을 가진 모델이 충분히 훈련되지 않았을 수도 있다. We speculate that the sparsity constraint may further increase the problem complexity, which may lead the model to converge faster than the model without the sparsity constraint. The complexity of the model may be too high for the complexity of the problem. Therefore, the model is prone to exaggeration. Here sparsity constraint is used as a regularization term and the complexity of the problem and model can match. Assuming this scenario, we can benefit from reducing the regularization coefficient ρ after the constrained model has converged to some extent. This not only improves convergence in the early stages, but also helps to find the correct optimal parameters in the later stages. The model with the sparsity constraint may not be sufficiently trained.

본 발명은, 하드웨어, 소프트웨어, 펌웨어, 특수 목적 프로세서, 또는 이들의 조합체의 여러 형태로 구현될 수 있다는 것을 이해하여야 할 것이다. 바람직하게는, 본 발명은 하드웨어와 소프트웨어의 조합으로 구현된다. 나아가, 소프트웨어는 바람직하게는, 프로그램 저장 디바이스에 유형적으로 구현되는 어플리케이션 프로그램으로 구현된다. 이 어플리케이션 프로그램은 임의의 적절한 구조를 포함하는 머신(machine)에 업로드되며 이 머신에 의해 실행될 수 있다. 바람직하게, 이 머신은 하나 이상의 중앙 처리 장치(CPU)와 랜덤 억세스 메모리(RAM)와, 입/출력(I/O) 인터페이스(들)와 같은 하드웨어를 구비하는 컴퓨터 플랫폼(platform) 상에 구현된다. 이 컴퓨터 플랫폼은 또한 운영 체계와 마이크로 명령 코드를 포함한다. 본 명세서에 기술되는 여러 처리 및 기능은 운용 체계를 통해 실행되는 마이크로 명령 코드의 일부 또는 어플리케이션 프로그램의 일부(또는 이들의 조합)일 수 있다. 나아가, 부가적인 데이터 저장 디바이스와 프린팅 디바이스와 같은 여러 다른 주변 디바이스들이 이 컴퓨터 플랫폼에 연결될 수 있다. It should be understood that the present invention may be embodied in many forms of hardware, software, firmware, special purpose processor, or combinations thereof. Preferably, the present invention is implemented as a combination of hardware and software. Furthermore, the software is preferably implemented as an application program tangibly embodied on a program storage device. This application program can be uploaded to and executed by a machine comprising any suitable structure. Preferably, the machine is implemented on a computer platform comprising one or more central processing units (CPUs) and random access memory (RAM) and hardware such as input/output (I/O) interface(s). . This computer platform also includes an operating system and micro-instruction code. The various processes and functions described herein may be part of microinstruction code or part of an application program (or a combination thereof) executed through an operating system. Furthermore, several other peripheral devices may be connected to this computer platform, such as additional data storage devices and printing devices.

첨부하는 도면에 도시된 구성요소의 시스템 성분과 방법 단계의 일부는 바람직하게는 소프트웨어로 구현되기 때문에, 시스템 성분(또는 방법 단계) 사이의 실제 연결은 본 발명이 프로그래밍되는 방식에 따라 달라질 수 있다는 것을 더 이해하여야 할 것이다. 본 명세서에 개시된 내용에 따라, 관련 기술 분야에 통상의 지식을 가진 자라면 본 발명의 이들 구현예나 구성 및 이와 유사한 구현예나 구성을 생각할 수 있을 것이다.Since some of the method steps and system components of the components shown in the accompanying drawings are preferably implemented in software, the actual connections between the system components (or method steps) may vary depending on the manner in which the present invention is programmed. You will have to understand more. According to the disclosure herein, those of ordinary skill in the related art will be able to think of these embodiments or configurations of the present invention and similar embodiments or configurations.

Claims

generating, by the encoder, a character embedding vector for a sentence to generate a voice in character units;
generating, by a decoder, a speech spectrum from the embedding vector; and
Including; outputting the voice spectrum as voice by an audio output unit;
The decoder generating the voice spectrum comprises:
It receives multiple voice data, converts it into a voice embedding vector, and uses it for voice spectrum merging.
The multi-voice data is
It is a speaker-id in the form of a one-hot vector and is converted into a speech embedding vector through a lookup table,
The speech embedding vector is input to the decoder RNN and the concentrated RNN,
and the decoder adjusts the speaker embedding vector by setting the space of a speaker embedding vector (speaker embedded vector or speaker embedding vector) to be non-linear.

According to claim 1,
The multi-voice data is
A text-multi-speech conversion method that is data in log-mel-spectrogram format.

3. The method of claim 2,
The log-mel-spectrogram is converted into a speech embedding vector through an embedder network.

Including; a text-multi-speech conversion module for outputting the voice of the text from the text;
The text-multi-speech conversion module,
An encoder that generates a character embedding vector for a sentence to generate a voice character by character;
a decoder for generating a speech spectrum from the embedding vector; and
and a voice output unit for outputting the voice spectrum as voice;
The decoder is
It receives multiple voice data, converts it into a voice embedding vector, and uses it for voice spectrum merging.
The multi-voice data is
It is a speaker-id in the form of a one-hot vector and is converted into a speech embedding vector through a lookup table,
The speech embedding vector is input to the decoder RNN and the concentrated RNN,
and the decoder adjusts the speaker embedding vector by setting the space of a speaker embedding vector (speaker embedded vector or speaker embedding vector) to be non-linear.

5. The method of claim 4,
The multi-voice data is
A text-multi-speech conversion system that is data in the form of log-mel-spectrogram.

6. The method of claim 5,
The log-mel-spectrogram is converted into a speech embedding vector through an embedder network.