KR102505927B1

KR102505927B1 - Deep learning-based emotional text-to-speech apparatus and method using generative model-based data augmentation

Info

Publication number: KR102505927B1
Application number: KR1020180124925A
Authority: KR
Inventors: 장인선; 안충현; 서정일; 양승준; 최지훈; 강홍구; 강현주; 권오성
Original assignee: 한국전자통신연구원; 연세대학교 산학협력단
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2023-03-03
Also published as: KR20200044337A

Abstract

본 발명은 음성 합성을 수행하는 방법 및 장치에 대한 것으로, 보다 상세하게는 유사 증강 데이터를 생성하여 음성합성 모델을 훈련하고, 유사 증강 데이터를 생성하는 경우 유사데이터 생성모델(generative model)에 상기 감정 조절 벡터를 입력하여 유사 증강 데이터를 생성하는 것을 포함한다.The present invention relates to a method and apparatus for performing speech synthesis, and more particularly, to train a speech synthesis model by generating similar augmented data, and in the case of generating similar augmented data, to a generative model for generating similar data. and inputting an adjustment vector to generate similar augmented data.

Description

Deep learning-based emotional text-to-speech apparatus and method using generative model-based data augmentation}

본 발명은 음성 합성 시스템에 관한 것으로서, 보다 상세하게는 감정 음성 합성 시스템으로부터 고품질의 감정 음성을 합성하기 위한 효과적인 훈련 방법에 관한 것이다.The present invention relates to a speech synthesis system, and more particularly, to an effective training method for synthesizing high-quality emotional speech from an emotional speech synthesis system.

음성 합성 시스템은 입력 텍스트를 사람이 직접 발화한 음성과 같이 청각적으로 자연스러운 음성 신호로 변환하여 출력하는 시스템을 가리킨다. A voice synthesis system refers to a system that converts input text into an audibly natural voice signal, such as a voice directly uttered by a human, and outputs the converted voice signal.

통계적 파라메트릭 모델(statistical parametric model) 기반의 음성 합성 기법은 컨텍스트 정보에 따라 음성 파라미터를 모아 통계적으로 모델링한 후, 이를 활용하여 시스템에 입력된 텍스트에 대응하는 음성 파라미터를 생성하여 음성 신호를 합성한다. 이 방식은 수 시간 이내의 데이터베이스를 사용해도 양질의 합성음을 제공하며, 쉽게 음성 파라미터를 제어할 수 있고 이에 따라 합성음의 음색 등을 조절하기 용이하다는 장점 때문에 실생활에서 다양한 분야에 사용되고 있다. 이때 통계적 모델로는 입출력 데이터간의 비선형적이고 복잡한 관계를 모델링할 수 있는 딥 러닝 모델이 널리 사용되고 있다. A speech synthesis technique based on a statistical parametric model collects speech parameters according to context information and statistically models them, and then uses them to generate speech parameters corresponding to text input to the system to synthesize speech signals. . This method is used in various fields in real life because it provides high-quality synthesized sound even when using a database of several hours or less, and can easily control voice parameters and adjust the timbre of synthesized sound accordingly. At this time, as a statistical model, a deep learning model capable of modeling a nonlinear and complex relationship between input and output data is widely used.

통계적 파라메트릭 모델 기반의 시스템에서 양호한 음질을 얻기 위해서는 일반적으로 최소 3시간 가량의 음성 데이터가 필요하지만, 감정 음성 합성 시스템에 사용되는 음성 데이터베이스는, 한 화자의 데이터베이스만을 제작하더라도 사람의 감정 종류는 다양하기 때문에 낭독체 음성 데이터베이스를 만들 때 보다 수 시간 이상의 시간과 비용을 필요로 한다. 또한 같은 감정을 표현하더라도 화자에 따라 표현 방식의 편차가 크기 때문에, 단순히 여러 화자의 데이터를 모아서 음성 합성기를 훈련하는 것은 화자 별 감정 표현의 차이들을 나타내기가 어려우므로 생동감 있는 감정 음성을 합성하지 못한다. Statistical parametric model-based systems generally require at least 3 hours of voice data to obtain good sound quality. Therefore, it takes several hours and more time and money than when creating a reading voice database. In addition, even if the same emotion is expressed, since the expression method varies greatly depending on the speaker, it is difficult to express the differences in emotional expression for each speaker by simply collecting data from several speakers and training the voice synthesizer, so it is difficult to synthesize a lively emotional voice.

따라서 감정 음성 데이터베이스의 경우, 감정의 종류가 매우 다양하기 때문에 데이터베이스를 구축하는 비용적 한계가 매우 큰 점 및 같은 감정을 표현하더라도 표현 방식에 있어 화자에 따른 편차가 큰 점을 고려하여 고품질의 합성음을 얻기 위한 방안이 필요한 실정이다.Therefore, in the case of emotion voice databases, considering the fact that the cost of building a database is very high due to the wide variety of emotions and that even if the same emotion is expressed, there is a large variance among speakers in the expression method. There is a need for a way to obtain it.

본 발명은 감정 음성 합성 시스템에서 합성음의 품질을 향상시키기 위한 목적이 있다. An object of the present invention is to improve the quality of synthesized speech in an emotional speech synthesis system.

본 발명은 딥 러닝 기반의 감정 음성 합성 시스템을 효과적으로 훈련하여 고품질의 합성 음성을 제공하는데 목적이 있다. An object of the present invention is to effectively train a deep learning-based emotional speech synthesis system to provide high-quality synthesized speech.

본 발명은 딥 러닝 기반의 감정 음성 합성 시스템을 효과적으로 훈련하기 위하여 데이터를 증강하여 생성하는데 목적이 있다. An object of the present invention is to augment and generate data in order to effectively train a deep learning-based emotional speech synthesis system.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

본 발명의 일 실시예에 따라, 음성 합성 장치가 음성 합성을 수행하는 방법 및 장치를 제공할 수 있다. 이 때 음성 합성 장치는 파라미터 추출부, 유사 데이터 생성부, 음성합성 모델 훈련부 및 음성 합성부를 포함할 수 있다. According to an embodiment of the present invention, a method and apparatus for performing voice synthesis by a voice synthesizer may be provided. In this case, the speech synthesis apparatus may include a parameter extraction unit, a similar data generation unit, a speech synthesis model training unit, and a speech synthesis unit.

이 때, 파라미터 추출부는 데이터 베이스로부터 음성데이터들을 입력 받고, 상기 음성데이터들로부터 파라미터들을 추출할 수 있다. At this time, the parameter extractor may receive voice data from the database and extract parameters from the voice data.

유사 데이터 생성부는 음성데이터들을 기초로 유사 증강 데이터들을 생성할 수 있다. The similar data generating unit may generate similar augmented data based on the voice data.

음성합성 모델 훈련부는 음성데이터들과 유사 증강 데이터들에 기초하여 음성합성 모델을 훈련할 수 있다. The speech synthesis model training unit may train a speech synthesis model based on the speech data and similar augmentation data.

음성 합성부는 텍스트를 입력 받고, 음성합성 모델을 사용하여 음성을 합성하여 출력할 수 있다. The speech synthesis unit may receive text, synthesize speech using a speech synthesis model, and output the synthesized speech.

이 때 상기 유사 증강 데이터를 생성하는 경우, 감정 조절 벡터를 생성하고, 유사데이터 생성모델(generative model)에 상기 감정 조절 벡터를 입력하여, 적어도 하나 이상의 상기 유사 증강 데이터를 생성할 수 있다. In this case, when generating the similar augmented data, an emotion control vector may be generated and at least one piece of similar augmented data may be generated by inputting the emotion control vector into a generative model.

또한, 다음의 실시예들은 음성 합성 장치가 음성 합성을 수행하는 방법 및 장치에서 공통으로 적용될 수 있다.In addition, the following embodiments may be commonly applied to a method and apparatus for performing voice synthesis by a voice synthesizer.

본 발명의 일 실시예에 따라, 감정 조절 벡터는 유사데이터 생성모델의 입력으로 사용되는 확률 변수를 제어함으로써 감정 표현의 방법이나 감정 표현의 강도를 조절할 수 있다. According to an embodiment of the present invention, the emotion control vector can control the method of expressing emotion or the strength of emotion expression by controlling a random variable used as an input of a similar data generation model.

본 발명의 일 실시예에 따라, 생성된 적어도 하나 이상의 데이터는 음성데이터와 유사한 확률 분포를 가질 수 있다.According to an embodiment of the present invention, at least one piece of generated data may have a probability distribution similar to that of voice data.

본 발명의 일 실시예에 따라, 유사데이터 생성모델은 음성데이터들의 파라미터들을 입력 받아 훈련될 수 있다. According to an embodiment of the present invention, the similar data generation model may be trained by receiving parameters of voice data.

본 발명의 일 실시예에 따라, 유사데이터 생성모델은 음성데이터들로부터 추출한 파라미터들의 확률 분포를 통계적으로 모델링할 수 있다. According to an embodiment of the present invention, the similar data generation model may statistically model a probability distribution of parameters extracted from voice data.

본 발명의 일 실시예에 따라, 유사데이터 생성모델을 훈련하는 경우, VAE (Variational Auto Encoder)를 이용할 수 있다.According to an embodiment of the present invention, when training a similar data generation model, a Variational Auto Encoder (VAE) may be used.

본 발명의 일 실시예에 따라, 음성데이터의 파라미터를 입력 받아 감정 조절 모델이 훈련되고, 훈련된 감정 조절 모델은 랜덤 변수를 입력 받아 감정 조절 벡터를 생성할 수 있다. According to an embodiment of the present invention, an emotion regulation model is trained by receiving parameters of voice data, and the trained emotion regulation model receives random variables to generate an emotion regulation vector.

본 발명의 일 실시예에 따라, 음성합성 모델을 훈련하는 경우, 음성데이터들로부터 추출된 파라미터들 및 유사 증강데이터들로부터 추출된 파라미터들에 기초하여 상기 음성합성 모델을 훈련할 수 있다.According to an embodiment of the present invention, when training a speech synthesis model, the speech synthesis model may be trained based on parameters extracted from voice data and parameters extracted from similar augmented data.

본 발명의 일 실시예에 따라, 음성합성 모델을 훈련하는 경우, 언어 및 음성 파라미터 간의 매핑 (mapping) 관계에 대한 정보를 저장할 수 있다. According to an embodiment of the present invention, when training a speech synthesis model, information on a mapping relationship between language and speech parameters may be stored.

본 발명의 일 실시예에 따라, 텍스트를 입력 받은 경우, 음성합성 모델에 저장된 언어 및 음성 파라미터 간의 매핑 (mapping) 관계에 대한 정보에 기초하여, 텍스트에 대응되는 음성 파라미터를 추정하여 음성 파형을 합성하여 출력할 수 있다. According to an embodiment of the present invention, when text is input, a voice waveform is synthesized by estimating voice parameters corresponding to the text based on information on a mapping relationship between language and voice parameters stored in a voice synthesis model. can be printed out.

본 발명에 의하면 감정 음성 합성 시스템에서 합성음의 품질을 향상시킬 수 있다. According to the present invention, it is possible to improve the quality of synthesized speech in an emotional speech synthesis system.

본 발명에 의하면 딥 러닝 기반의 감정 음성 합성 시스템을 효과적으로 훈련하여 고품질의 합성 음성을 제공할 수 있다. According to the present invention, it is possible to effectively train a deep learning-based emotional speech synthesis system to provide high-quality synthesized speech.

본 발명에 의하면 딥 러닝 기반의 감정 음성 합성 시스템을 효과적으로 훈련하기 위하여 데이터를 증강하여 생성할 수 있다. According to the present invention, data can be augmented and generated in order to effectively train a deep learning-based emotional speech synthesis system.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.

도 1은 본 발명의 일 실시예에 따른 음성 합성 장치의 구성을 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 음성 합성 방법의 흐름도이다.
도 3은 본 발명의 실시예에 따른 음성 합성 장치의 구성 및 구체적인 흐름도이다.
도 4는 본 발명의 실시예에 따른 증강 데이터를 생성하는 방법의 흐름도이다.
도 5는 제어벡터부의 훈련 및 벡터 생성 과정에 대한 도면이다.
도 6은 유사 데이터 생성부의 동작에 대한 흐름도이다.
도 7은 딥 러닝 기반 음성 합성기의 전체 개요도 이다.1 is a diagram showing the configuration of a voice synthesizer according to an embodiment of the present invention.
2 is a flowchart of a speech synthesis method according to an embodiment of the present invention.
3 is a configuration and detailed flowchart of a voice synthesizer according to an embodiment of the present invention.
4 is a flowchart of a method for generating augmented data according to an embodiment of the present invention.
5 is a diagram of a control vector unit training and vector generation process.
6 is a flowchart of an operation of a similar data generating unit.
7 is an overall schematic diagram of a deep learning-based voice synthesizer.

이하에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein.

본 발명의 실시 예를 설명함에 있어서 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그에 대한 상세한 설명은 생략한다. 그리고, 도면에서 본 발명에 대한 설명과 관계없는 부분은 생략하였으며, 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.In describing the embodiments of the present invention, if it is determined that a detailed description of a known configuration or function may obscure the gist of the present invention, a detailed description thereof will be omitted. And, in the drawings, parts not related to the description of the present invention are omitted, and similar reference numerals are attached to similar parts.

본 발명에 있어서, 어떤 구성요소가 다른 구성요소와 "연결", "결합" 또는 "접속"되어 있다고 할 때, 이는 직접적인 연결관계뿐만 아니라, 그 중간에 또 다른 구성요소가 존재하는 간접적인 연결관계도 포함할 수 있다. 또한 어떤 구성요소가 다른 구성요소를 "포함한다" 또는 "가진다"고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 배제하는 것이 아니라 또 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In the present invention, when a component is said to be "connected", "coupled" or "connected" to another component, this is not only a direct connection relationship, but also an indirect connection relationship where another component exists in the middle. may also be included. In addition, when a component "includes" or "has" another component, this means that it may further include another component without excluding other components unless otherwise stated. .

본 발명에 있어서, 제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되며, 특별히 언급되지 않는 한 구성요소들간의 순서 또는 중요도 등을 한정하지 않는다. 따라서, 본 발명의 범위 내에서 일 실시 예에서의 제1 구성요소는 다른 실시 예에서 제2 구성요소라고 칭할 수도 있고, 마찬가지로 일 실시 예에서의 제2 구성요소를 다른 실시 예에서 제1 구성요소라고 칭할 수도 있다.In the present invention, terms such as first and second are used only for the purpose of distinguishing one component from another, and do not limit the order or importance of components unless otherwise specified. Accordingly, within the scope of the present invention, a first component in one embodiment may be referred to as a second component in another embodiment, and similarly, a second component in one embodiment may be referred to as a first component in another embodiment. can also be called

본 발명에 있어서, 서로 구별되는 구성요소들은 각각의 특징을 명확하게 설명하기 위함이며, 구성요소들이 반드시 분리되는 것을 의미하지는 않는다. 즉, 복수의 구성요소가 통합되어 하나의 하드웨어 또는 소프트웨어 단위로 이루어질 수도 있고, 하나의 구성요소가 분산되어 복수의 하드웨어 또는 소프트웨어 단위로 이루어질 수도 있다. 따라서, 별도로 언급하지 않더라도 이와 같이 통합된 또는 분산된 실시 예도 본 발명의 범위에 포함된다.In the present invention, components that are distinguished from each other are intended to clearly explain each characteristic, and do not necessarily mean that the components are separated. That is, a plurality of components may be integrated to form a single hardware or software unit, or a single component may be distributed to form a plurality of hardware or software units. Therefore, even if not mentioned separately, such an integrated or distributed embodiment is included in the scope of the present invention.

본 발명에 있어서, 다양한 실시 예에서 설명하는 구성요소들이 반드시 필수적인 구성요소들은 의미하는 것은 아니며, 일부는 선택적인 구성요소일 수 있다. 따라서, 일 실시 예에서 설명하는 구성요소들의 부분집합으로 구성되는 실시예도 본 발명의 범위에 포함된다. 또한, 다양한 실시 예에서 설명하는 구성요소들에 추가적으로 다른 구성요소를 포함하는 실시 예도 본 발명의 범위에 포함된다.In the present invention, components described in various embodiments do not necessarily mean essential components, and some may be optional components. Therefore, an embodiment composed of a subset of the components described in one embodiment is also included in the scope of the present invention. In addition, embodiments including other components in addition to the components described in various embodiments are also included in the scope of the present invention.

딥 러닝 기반 감정 음성 합성 시스템에서 감정을 생생하게 표현하는 고품질의 합성음을 얻기 위해서는 충분한 양의 음성 데이터를 필요로 한다. 그러나 실제 상황에서 대용량의 감정 음성 데이터를 구축하는 것은 많은 비용과 시간이 소모되므로 어려운 일이다. A deep learning-based emotion speech synthesis system requires a sufficient amount of voice data to obtain high-quality synthetic sounds that vividly express emotions. However, it is difficult to build a large amount of emotional voice data in real situations because it consumes a lot of money and time.

언어적 정보의 경우 한 개의 문자에 대해 한 개의 발음 정보가 대응하므로 일대일 대응 관계이다. 따라서 다양한 사람들이 발화하더라도 동일한 문자의 경우 발음 정보에 대해 공통적인 특성이 뚜렷하다. 그러나 감정 음성을 발화할 경우, 같은 감정을 표현하더라도 표현 강도 및 방법에 있어 화자마다 편차가 매우 크므로 이는 일대다 대응 관계에 가깝다. In the case of linguistic information, since one pronunciation information corresponds to one character, there is a one-to-one correspondence. Therefore, even if various people utter it, common characteristics of pronunciation information are clear in the case of the same character. However, when an emotional voice is uttered, even if the same emotion is expressed, there is a very large variation among speakers in the strength and method of expression, so this is close to a one-to-many correspondence.

따라서, 본 발명은 딥 러닝 기반의 통계적 생성 모델링 기법을 활용하여 실제 감정 음성 데이터베이스로부터 추출한 것과 유사한 통계적 특성을 가지는 파라미터를 생성하며, 이 때 감정의 강도 혹은 감정의 표현 방식 등을 제어함으로써 다양한 감정 표현과 강도의 조절을 가능하게 한다. 이렇게 생성한 데이터를 원래의 감정 음성 데이터베이스와 함께 딥 러닝 기반의 음성 합성 시스템을 훈련용 데이터로 사용하여 대용량 데이터베이스로 훈련한 음성 합성 시스템과 같은 결과를 얻음으로써 고품질의 합성음과 다양한 감정 표현이 가능한 감정 음성 합성 시스템을 제공하는 것을 목표로 한다.Therefore, the present invention utilizes a deep learning-based statistical generative modeling technique to generate parameters having statistical characteristics similar to those extracted from an actual emotional voice database, and at this time, various emotional expressions by controlling the intensity of emotion or the expression method of emotion and intensity can be adjusted. By using the data generated in this way as training data along with the original emotion voice database, a deep learning-based voice synthesis system is used to obtain the same results as the voice synthesis system trained with a large database, so that high-quality synthesized sounds and various emotions can be expressed. It aims to provide a speech synthesis system.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도1은 본 발명의 일 실시예에 따른 음성 합성 장치의 구성을 나타낸 도면이다.1 is a diagram showing the configuration of a voice synthesizer according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 음성 합성 시스템은 음성 파라미터 추출부(110), 제어 벡터 생성부(120), 유사 데이터 생성부(130), 딥 러닝 기반의 음성 합성 모델 훈련부(140), 음성 파라미터 생성부(150), 음성 합성부(160)를 포함한다.Referring to FIG. 1, the speech synthesis system of the present invention includes a speech parameter extraction unit 110, a control vector generator 120, a similar data generator 130, a deep learning-based speech synthesis model training unit 140, a voice It includes a parameter generator 150 and a voice synthesizer 160.

본 발명은 딥 러닝 기반의 생성 모델을 이용하여 실제 음성 신호로부터 추출한 것과 유사한 확률 분포를 갖는 파라미터를 감정 표현 혹은 감정 강도에 따라 확률 변수를 제어하여 필요에 맞게 생성하고, 이렇게 생성한 파라미터를 이용하여 딥 러닝 기반의 감정 음성 합성 시스템을 효과적으로 훈련하여 고품질의 합성 음성을 제공하는 데에 있다. The present invention uses a deep learning-based generation model to generate a parameter having a probability distribution similar to that extracted from an actual voice signal as needed by controlling a random variable according to emotional expression or emotional intensity, and using the generated parameter The purpose of this study is to effectively train a deep learning-based emotional speech synthesis system to provide high-quality synthesized speech.

본 발명에 따른 생성 모델 기반의 훈련 데이터(Training Data) 증강 기법을 활용한 딥 러닝 기반 감정 음성 합성 시스템의 효과적인 훈련 방법은, 감정 음성 합성을 위한 데이터를 사용하여 확률 분포생성 모델을 훈련하고, 이렇게 얻은 모델을 이용하여 실제 데이터의 확률 분포를 따르는 유사 데이터들을 생성하고, 이 때 미리 훈련해둔 제어 벡터(본 발명의 일 실시예로, 감정 표현 방식, 감정 강도 등을 제어)를 이용하여 목적에 맞도록 감정 표현 특징을 제어할 수 있으며, 이렇게 제어 벡터를 이용하여 생성한 유사 데이터를 원 데이터베이스로부터 추출한 파라미터와 함께 딥 러닝 기반의 음성 합성기의 훈련에 사용한다.An effective training method for a deep learning-based emotional speech synthesis system using a generation model-based training data augmentation technique according to the present invention trains a probability distribution generation model using data for emotional speech synthesis, Using the obtained model, similar data following the probability distribution of actual data is generated, and at this time, a pre-trained control vector (as an embodiment of the present invention, control of emotion expression method, emotion intensity, etc.) is used to meet the purpose. Emotion expression characteristics can be controlled, and the similar data generated using the control vector is used for training of the deep learning-based voice synthesizer along with the parameters extracted from the original database.

이에 따라, 음성 합성 모델을 훈련할 때 더욱 풍부한 데이터를 사용할 수 있으며, 이에 따라 모델을 더욱 세밀하게 조정하여 훈련할 수 있고, 해당 모델을 통해 동적 정보 (dynamic)가 보존된 파라미터를 생성함으로써 데이터의 정확도를 개선할 수 있으며, 더 나아가 다양한 감정 표현 방식과 표현 강도를 제어함으로써 다양한 서비스에 활용될 수 있다.Accordingly, richer data can be used when training a speech synthesis model, and accordingly, the model can be further fine-tuned and trained, and by generating parameters with preserved dynamic information through the model, the data Accuracy can be improved, and furthermore, it can be used for various services by controlling various emotional expression methods and expression strength.

따라서 본 기술은 이러한 문제점을 극복하기 위해 데이터의 확률 분포를 통계적으로 모델링하는 생성 모델(generative model)을 이용한다. 생성 모델을 이용하면 훈련 데이터(Training Data)와 유사한 데이터를 만들어낼 수 있는데, 이를 통해 음성 합성에 사용되는 감정 음성 파라미터와 유사한 데이터를 생성하여 음성 데이터베이스의 양을 늘리는 것과 같은 효과를 얻고자 한다. Therefore, in order to overcome this problem, the present technology uses a generative model that statistically models a probability distribution of data. Using a generative model, data similar to training data can be created. Through this, data similar to emotional voice parameters used in speech synthesis is generated to obtain the same effect as increasing the amount of a voice database.

이 때 생성 모델의 입력으로 사용되는 확률 변수를 제어함으로써 감정 표현의 방법이나 감정 표현의 강도를 조절할 수 있으며, 이를 통해 세밀하게 감정을 조절할 수 있고 대용량 데이터베이스를 이용하여 훈련한 것과 같은 고품질의 감정 음성을 합성할 수 있다.At this time, by controlling the random variables used as inputs of the generative model, the method of expressing emotions or the intensity of emotional expressions can be adjusted. Through this, emotions can be controlled in detail and high-quality emotional voices such as those trained using a large database. can be synthesized.

도 2는 본 발명의 실시예에 따른 음성 합성 방법의 흐름도이다. 2 is a flowchart of a speech synthesis method according to an embodiment of the present invention.

음성 합성을 수행하기 위해, 증강데이터를 생성하고(S210), 생성된 증강데이터를 이용하여 음성 합성기를 훈련하고(S220), 음성 합성을 출력(S230)한다. To perform voice synthesis, augmented data is generated (S210), the voice synthesizer is trained using the generated augmented data (S220), and the voice synthesized output is output (S230).

따라서, 본 장치를 크게 구분 하였을 때, 감정 음성 합성 시스템에 사용되는 음성 데이터베이스의 데이터를 증강시키기는 구성(110, 120, 130)과 증강된 데이터를 활용하여 음성합성 모델을 훈련하는 구성(140) 및 음성 합성을 출력하는 구성(150, 160)으로 구분할 수 있다. Therefore, when the apparatus is largely divided, configurations (110, 120, 130) for augmenting the data of the voice database used in the emotional speech synthesis system and configuration (140) for training a speech synthesis model using the augmented data and components 150 and 160 that output voice synthesis.

도 3은 본 발명의 실시예에 따른 음성 합성 장치의 구성 및 구체적인 흐름도이다. 3 is a configuration and detailed flowchart of a voice synthesizer according to an embodiment of the present invention.

음성 파라미터 추출부(110)는 감정 음성 합성 데이터베이스로부터 언어 및 음성 파라미터를 추출한다. The voice parameter extractor 110 extracts language and voice parameters from the emotional voice synthesis database.

그리고 제어 벡터 생성부(120) 및 유사 데이터 생성부(130)은 유사 데이터들을 생성하여 음성 데이터 베이스의 양을 증가시킬 수 있다. 제어 벡터 생성부(120)의 구체적 동작은 하기의 도 5, 유사 데이터 생성부(130)의 구체적 동작은 하기의 도 6에서 각각 상세히 기술된다.Also, the control vector generator 120 and the similar data generator 130 may increase the amount of the voice database by generating similar data. A detailed operation of the control vector generator 120 is described in detail in FIG. 5 below, and a specific operation of the similar data generator 130 is described in FIG. 6 below.

최근, 데이터의 확률 분포를 딥 러닝 기법을 이용해 모델링함으로써 실제 데이터와 유사한 데이터를 샘플링할 수 있는 생성 모델 (generative model)이 각광받고 있다. 기존의 딥 러닝 기반의 훈련 방법은 지도 학습 (supervised learning) 관점에서 입력 데이터와 해당 데이터의 레이블 정보 간의 매핑 관계를 나타내는 함수를 학습하는 경우에 주로 적용되어 왔으나, 생성 모델의 경우 비지도 학습 (unsupervised learning) 관점에서 레이블 없이 주어진 데이터만을 이용하여 해당 데이터에 내재되어 있는 구조를 학습하는 것을 목표로 한다. 생성 모델을 이용할 경우 주어진 데이터(원 데이터)와 같은 확률 분포를 갖는 새로운 샘플(유사 데이터)을 생성해낼 수 있으며, 생성한 데이터를 실제 제작한 데이터베이스처럼 음성 합성 시스템의 통계적 훈련에 사용할 수 있다. 또한 생성 모델을 통해 데이터의 잠재 변수를 모델링할 수 있으며, 이를 이용하여 유의미한 특성들이 제어된 데이터를 생성할 수 있다.Recently, a generative model capable of sampling data similar to actual data by modeling a probability distribution of data using a deep learning technique has been in the spotlight. Existing deep learning-based training methods have been mainly applied to learning a function representing the mapping relationship between input data and label information of the data from the viewpoint of supervised learning, but in the case of generative models, unsupervised learning (unsupervised From the point of view of learning, the goal is to learn the structure inherent in the data using only the given data without labels. When using a generative model, a new sample (similar data) having the same probability distribution as the given data (original data) can be generated, and the generated data can be used for statistical training of a speech synthesis system like a database actually produced. In addition, it is possible to model latent variables of data through a generative model, and by using this, it is possible to generate data in which significant characteristics are controlled.

딥 러닝 기반의 음성 합성 모델 훈련부(140)는 상기 언급된 원 데이터베이스로부터 추출한 언어 및 음성 파라미터와 유사 데이터 생성부로부터 취득한 파라미터를 이용하여 함께 딥 러닝 기반의 음성 합성 모델을 훈련하여 언어 및 음성 파라미터 간의 매핑 (mapping) 관계에 대한 정보를 저장한다. The deep learning-based speech synthesis model training unit 140 trains a deep learning-based speech synthesis model together using the language and voice parameters extracted from the above-mentioned original database and the parameters obtained from the similar data generator, It stores information about mapping relationships.

음성 파라미터 생성부(150)는 음성 합성 모델에 저장된 매핑 정보를 이용하여 입력으로 주어진 텍스트에 대응하는 음성 파라미터를 추정한다. The voice parameter generation unit 150 estimates voice parameters corresponding to text given as an input using mapping information stored in the voice synthesis model.

음성 합성부(160)는 상기 언급한 음성 파라미터 생성부로부터 추정한 음성 파라미터를 이용하여 음성 파형을 합성하여 출력한다. The voice synthesis unit 160 synthesizes and outputs a voice waveform using the voice parameters estimated from the above-mentioned voice parameter generator.

도 4는 본 발명의 실시예에 따른 증강 데이터를 생성하는 방법의 흐름도이다. 4 is a flowchart of a method for generating augmented data according to an embodiment of the present invention.

생성 모델 기반의 훈련 데이터(Training Data) 증강 기법을 활용한 딥 러닝 기반 감정 음성 합성 장치의 효과적인 훈련 방법을 실시하기 위하여, 먼저 파라미터 추출부(110)는 데이터 베이스로부터 음성데이터들을 입력 받고, 음성데이터들로부터 파라미터들을 추출(S410)한다. In order to implement an effective training method for a deep learning-based emotional speech synthesizer using a generative model-based training data augmentation technique, first, the parameter extractor 110 receives voice data from a database, and Parameters are extracted from (S410).

그리고 유사 데이터 생성부(130)는 음성데이터들을 기초로 유사 증강 데이터들을 생성한다. And the similar data generator 130 generates similar augmented data based on the voice data.

보다 상세하게는 먼저 유사 데이터 생성부(130)는 감정 음성 파라미터를 이용하여 유사 데이터 생성을 위한 통계적 모델을 훈련하고, 유사 데이터 모델을 생성(S420)한다. More specifically, first, the similar data generating unit 130 trains a statistical model for generating similar data using emotional voice parameters and generates a similar data model (S420).

그리고 제어 벡터 생성부(120)은 감정 표현을 위한 잠재 변수 모델을 훈련한다. 즉, 감정을 나타내는 제어 변수를 생성하기 위한 통계적 모델을 훈련하고, 제어 변수 모델을 생성(S430)한다. And the control vector generator 120 trains a latent variable model for emotion expression. That is, a statistical model for generating a control variable representing emotion is trained, and a control variable model is generated (S430).

그 후 제어 벡터 생성부(120)는 감정 잠재 변수 모델을 이용하여 감정 조절 확률 벡터를 생성(S440)한다. 즉, 제어 벡터 생성부(120)는 생성한 제어 변수 모델을 이용하여 목적에 맞는 제어 변수를 추정할 수 있다. After that, the control vector generator 120 generates an emotion control probability vector using the emotion latent variable model (S440). That is, the control vector generator 120 may estimate a control variable suitable for the purpose using the generated control variable model.

그리고 유사 데이터 생성부(130)는 제어 벡터 생성부(120)에서 생성된 감정 조절 확률 벡터를 이용하여 감정 특성이 제어된 증강 데이터를 생성(S450)한다. 보다 상세하게는 추정한 제어변수를 입력으로 사용하여 유사 데이터 생성 모델로부터 음성 합성 시스템의 훈련에 사용되는 유사 데이터를 생성할 수 있다.Then, the similar data generator 130 generates augmented data in which emotion characteristics are controlled using the emotion control probability vector generated by the control vector generator 120 (S450). More specifically, similar data used for training of a speech synthesis system may be generated from a similar data generation model using the estimated control variable as an input.

그 후 음성합성 모델 훈련부(140)는 상기 음성데이터들과 상기 유사 증강 데이터들에 기초하여 딥 러닝 기반의 음성 합성기를 훈련하고, 음성 합성부(160)는 텍스트를 입력 받고, 음성합성 모델을 사용하여 음성을 합성하여 출력한다. After that, the voice synthesis model training unit 140 trains a deep learning-based voice synthesizer based on the voice data and the similar augmented data, and the voice synthesizer 160 receives text and uses the voice synthesis model. to synthesize and output voice.

도 5는 제어벡터부의 훈련 및 벡터 생성 과정에 대한 도면이다. 보다 상세하게는 다양한 감정 표현 혹은 감정의 강도 조절을 가능하게 하는 제어 벡터부(120)의 훈련 및 벡터 생성 과정을 보다 자세하게 설명하기 위한 블록 다이어그램이다. 5 is a diagram of a control vector unit training and vector generation process. More specifically, it is a block diagram for explaining in detail the process of training and vector creation of the control vector unit 120 enabling various emotional expressions or intensity control of emotions.

제어 벡터 생성부(120)는 먼저 각 감정 음성 데이터베이스로부터 음성 파라미터를 추출한다. 그 후 제어 벡터 생성부(120)는 추출한 음성 파라미터를 이용하여 감정 음성으로부터 감정의 특성을 표현할 수 있는 잠재 변수를 모델링한다. 이를 통해 제어 벡터 생성부(120)는 유사 데이터 생성부(130)에서 생성하고자 하는 데이터의 조건을 표현할 수 있는 잠재 변수를 생성할 수 있다. The control vector generator 120 first extracts voice parameters from each emotional voice database. After that, the control vector generator 120 models latent variables that can express the characteristics of emotion from the emotional voice using the extracted voice parameters. Through this, the control vector generator 120 can generate latent variables that can express conditions of data to be generated by the similar data generator 130 .

보다 상세하게는 제어 벡터 생성부(120)는 각 감정 음성 데이터 베이스로부터 음성 파라미터를 추출한다. 그리고 제어 벡터 생성부(120)는 감정 조절 모델을 훈련하고, 감정 조절 모델을 생성한다. 제어 벡터 생성부(120)는 랜덤 변수를 입력받아 감정 조절 벡터를 추정하고, 감정 조절 확률 벡터(감정 조절 벡터)를 생성한다. More specifically, the control vector generator 120 extracts voice parameters from each emotional voice database. The control vector generator 120 trains the emotion regulation model and generates the emotion regulation model. The control vector generator 120 receives random variables, estimates an emotion control vector, and generates an emotion control probability vector (emotion control vector).

이때, 랜덤 변수는 사용자의 지정 값이며, 감정 조절 벡터를 추정하기 위한 입력값이 될 수 있다. In this case, the random variable is a value specified by the user and may be an input value for estimating an emotion control vector.

도 6은 유사 데이터 생성부의 동작에 대한 흐름도이다. 보다 상세하게는 도 6은 실제 음성 합성 데이터베이스로부터 추출한 파라미터들과 유사한 통계적 특징을 가지는 파라미터들을 생성하여 음성 합성 시스템을 훈련할 때 대용량의 음성 데이터베이스를 사용하는 것과 같은 효과를 기대할 수 있는 유사 데이터 생성부(130)를 상세하게 서술하는 블록 다이어그램이다.6 is a flowchart of an operation of a similar data generating unit. In more detail, FIG. 6 is a similar data generation unit that can expect the same effect as using a large-capacity speech database when training a speech synthesis system by generating parameters having similar statistical characteristics to parameters extracted from an actual speech synthesis database. A block diagram detailing 130.

도 6의 유사 데이터 생성부(130)는 각 감정 음성 데이터베이스로부터 음성 합성에 필요한 파라미터를 추출한다. 그 후 유사 데이터 생성부(130)는 해당 음성 파라미터의 확률 분포를 통계적으로 모델링하는 생성 모델을 훈련한다. 이 훈련 과정을 통해 해당 모델은 음성 파라미터에 내재되어 있는 각종 특징들을 압축적으로 표현할 수 있다. 이 모델을 이용하여 실제 음성 데이터로부터 추출한 파라미터와 유사한 특징을 보이지만, 실제로 음성 신호로부터 추출하지 않은 유사 데이터를 생성할 수 있다. The similarity data generator 130 of FIG. 6 extracts parameters required for voice synthesis from each emotional voice database. Then, the similar data generator 130 trains a generation model that statistically models the probability distribution of the corresponding speech parameter. Through this training process, the model can compressively express various features inherent in voice parameters. Using this model, it is possible to generate similar data that exhibits characteristics similar to parameters extracted from actual voice data, but is not actually extracted from voice signals.

일 실시 예로, 유사 데이터 생성부(130)는 VAE (Variational Auto Encoder) 등을 이용하여 생성 모델을 훈련하고 유사 데이터를 생성할 수 있다. VAE 구조는 생성 모델 구조 중 하나로, 원 데이터의 확률 분포를 구하는 것을 목적으로 하는데, 이 때 최대 우도 (maximum lilkelihood)를 기준으로 하여 최적화하는 것을 목표로 한다. As an example, the similar data generating unit 130 may train a generation model and generate similar data using Variational Auto Encoder (VAE) or the like. The VAE structure is one of the generative model structures, and aims to obtain a probability distribution of raw data, and at this time, it aims to optimize based on the maximum likelihood.

기존의 생성 모델을 훈련할 때 잠재 변수의 형태가 미분이 불가능하여 역전파 (backpropagation) 기법을 사용할 수 없었고, 대신 sampling기법을 많이 사용하였으나 이는 계산량이 매우 커서 긴 훈련 시간을 필요로 했다. 그러나 variational autoencoder의 경우 이를 미분 가능한 형태로 변환하여 역전파 기법을 사용할 수 있게 되었다. 이 때 VAE의 구조는 주어진 데이터를 잠재 변수의 형태로 압축하고, 잠재 변수를 입력으로 사용했을 때 주어진 데이터와 같은 형태의 데이터를 생성해낼 수 있다. 이러한 구조는 Autoencoder 구조의 encoder-decoder 구조와 유사했기 때문에 variational autoencoder라는 명칭을 갖고 있다.When training existing generative models, the backpropagation technique could not be used because the shape of the latent variable was not differentiable. Instead, the sampling technique was used a lot, but it required a long training time due to the large amount of computation. However, in the case of the variational autoencoder, it is possible to use the backpropagation technique by converting it into a differentiable form. At this time, the structure of VAE compresses the given data in the form of a latent variable, and when the latent variable is used as an input, it can generate data in the same form as the given data. Since this structure was similar to the encoder-decoder structure of the autoencoder structure, it has the name of a variational autoencoder.

도 7은 딥 러닝 기반 음성 합성기의 전체 개요도 이다.7 is an overall schematic diagram of a deep learning-based voice synthesizer.

상술한 바와 같이, 본 발명은 더욱 풍부한 데이터를 감정 음성 합성기의 훈련에 사용할 수 있도록 딥 러닝 기반의 생성 모델을 통해 실제 데이터와 유사한 확률분포를 갖는 파라미터를 제어 변수를 사용하여 감정 표현 정보를 목적에 맞도록 변조하여 생성한다. 이를 통해 합성음의 감정 표현 강도나 감정 표현의 방식을 조절할 수 있다.As described above, the present invention uses a parameter having a probability distribution similar to real data through a deep learning-based generation model so that richer data can be used for training of an emotion speech synthesizer, using a control variable to obtain emotion expression information for the purpose It is created by modifying it to fit. Through this, it is possible to adjust the intensity of emotional expression of the synthesized sound or the method of expressing emotion.

본 발명의 이점 및 특징, 그것들을 달성하는 방법은 첨부되어 있는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 제시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments presented below and can be implemented in various different forms, only the present embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to completely inform the person who has the scope of the invention, and the present invention is only defined by the scope of the claims.

110: 파라미터 추출부
120: 제어 벡터 생성부
130: 유사 데이터 생성부
140: 음성합성 모델 훈련부
150: 음성 파라미터 생성부
160: 음성 합성부110: parameter extraction unit
120: control vector generation unit
130: similar data generator
140: speech synthesis model training unit
150: voice parameter generation unit
160: voice synthesis unit

Claims

A method for performing voice synthesis by a voice synthesizer
receiving voice data from a database and extracting parameters from the voice data;
generating similar augmented data based on the voice data;
training a voice synthesis model based on the voice data and the similar augmented data; and
receiving text and synthesizing and outputting a voice using the voice synthesis model; Including,
When generating the above similar augmented data
create an emotion regulation vector;
By inputting the emotion control vector into a generative model,
generating at least one or more of the similar augmented data;
The similar data generation model is
A method for performing voice synthesis, characterized in that the parameters of the voice data are input and trained.

According to claim 1
The emotion control vector is
A method for performing speech synthesis, characterized in that by controlling a random variable used as an input of the similar data generation model, the method of expressing emotion or the intensity of emotional expression is adjusted.

According to claim 1
At least one or more of the generated data
A method for performing voice synthesis, characterized in that it has a probability distribution similar to that of the voice data.

delete

According to claim 1
The similar data generation model is
A method for performing voice synthesis, characterized in that statistically modeling a probability distribution of parameters extracted from the voice data.

According to claim 1
When training the similar data generation model,
A method for performing voice synthesis, characterized by using a Variational Auto Encoder (VAE).

According to claim 1
An emotion regulation model is trained by receiving parameters of the voice data,
The method of performing speech synthesis, characterized in that the trained emotion control model generates the emotion control vector by receiving random variables.

According to claim 1
When training the speech synthesis model,
and training the speech synthesis model based on parameters extracted from the speech data and parameters extracted from similar augmented data.

According to claim 1
When training a speech synthesis model,
A method of performing speech synthesis, characterized in that storing information about a mapping relationship between language and speech parameters.

In the voice synthesizer
a parameter extraction unit receiving voice data from a database and extracting parameters from the voice data;
a similar data generating unit generating similar augmented data based on the voice data;
a speech synthesis model training unit which trains a speech synthesis model based on the speech data and the similar augmented data; and
a voice synthesis unit that receives text and synthesizes and outputs voice using the voice synthesis model; Including,
When the similar data generator generates the similar augmented data
create an emotion regulation vector;
By inputting the emotion control vector into a generative model,
generate at least one additional piece of data;
The similar data generation model is
A voice synthesizer characterized in that the parameters of the voice data are input and trained.

delete