KR20220070979A

KR20220070979A - Style speech synthesis apparatus and speech synthesis method using style encoding network

Info

Publication number: KR20220070979A
Application number: KR1020200158107A
Authority: KR
Inventors: 김남수; 천성준; 최병진; 김민찬; 김형주; 손병찬
Original assignee: 서울대학교산학협력단
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2022-05-31
Also published as: KR102473685B1

Abstract

The present invention relates to a style speech synthesis device using an utterance style encoding network, wherein the speech synthesis device comprises: a style extractor; a longitudinal speech synthesizer; and a vocoder. Therefore, the present invention is capable of being usefully used for personalized speech synthesis.

Description

A style speech synthesis apparatus and a speech synthesis method using a speech style encoding network

본 발명은 스타일 음성 합성 장치 및 음성 합성 방법에 관한 것으로서, 보다 구체적으로는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법에 관한 것이다.The present invention relates to a style speech synthesis apparatus and a speech synthesis method, and more particularly, to a style speech synthesis apparatus and a speech synthesis method using an utterance style encoding network.

음성 합성 시스템은 AI 스피커, 오디오북, 스마트홈 등 다양한 분야에 적용되는 핵심적인 기술이다. Markets and Markets에 따르면 2017년 기준 전 세계 음성 합성 시장의 규모는 13억 달러 정도이며, 매년 15.2%의 성장을 통해 2022년에는 30.3억 달러에 미칠 것으로 전망된다. 현재 단일 화자가 단조로운 톤으로 발화하는 음성 합성 시스템의 경우 실제 음성과 구분이 되지 않을 정도의 성능을 보이고 있고, 앞서 언급한 여러 분야에서 실제 서비스에 활발히 사용되고 있다. 이러한 추세로 보았을 때 이후 여러 화자가 다양한 스타일로 발화하는 음성 합성기에 대한 수요가 크게 증가할 것으로 예상된다.The speech synthesis system is a core technology applied to various fields such as AI speakers, audio books, and smart homes. According to Markets and Markets, the global speech synthesis market was worth about $1.3 billion in 2017, and is expected to reach $3.03 billion in 2022 with a growth of 15.2% annually. Currently, in the case of a speech synthesis system in which a single speaker utters a monotonous tone, the performance is indistinguishable from an actual voice, and is actively used in actual services in the aforementioned various fields. In view of this trend, it is expected that the demand for speech synthesizers in which multiple speakers speak in various styles will increase significantly in the future.

최근 음성 합성 성능에 큰 발전을 이끈 딥러닝 기반의 음성 합성 시스템은 대량의 음성과 텍스트 쌍의 데이터를 필요로 하고, 이러한 데이터를 수집하는 데는 많은 시간과 비용이 소요된다. 따라서 새로운 화자나 발화 스타일에 대해 매번 새로운 데이터를 수집하여 학습하는 방식에는 한계가 있다. 그러므로, 새로운 스타일에 대해서도 추가적인 대량의 데이터 수집 없이 음성을 생성할 수 있는 기술의 개발이 필요하다.Deep learning-based speech synthesis systems, which have recently made great strides in speech synthesis performance, require a large amount of data of speech and text pairs, and it takes a lot of time and money to collect these data. Therefore, there is a limit to the method of learning by collecting new data every time for a new speaker or utterance style. Therefore, it is necessary to develop a technology capable of generating voices without additional data collection for a new style.

이와 같이 추가적인 학습 없이 새로운 스타일에 대해서도 음성을 합성할 수 있게 되면, 기존의 음성 합성 서비스를 대체할 수 있을 뿐만 아니라 사용자 맞춤형 음성 합성을 가능하게 하여 기존의 시장을 확장할 가능성이 있다.When it is possible to synthesize a voice for a new style without additional learning as described above, it is possible to not only replace the existing voice synthesis service but also to expand the existing market by enabling customized voice synthesis.

한편, 본 발명과 관련된 선행기술로서, 등록특허 제10-2159988호(발명의 명칭: 음성 몽타주 생성 방법 및 시스템, 등록일자: 2020년 09월 21일), 등록특허 제10-2055886호(발명의 명칭: 화자 음성 특징 추출 방법 및 장치, 그리고 이를 위한 기록 매체, 등록일자: 2019년 12월 09일) 등이 개시된 바 있다.On the other hand, as prior art related to the present invention, Patent Registration No. 10-2159988 (title of the invention: voice montage generating method and system, registration date: September 21, 2020), Patent Registration No. 10-2055886 (invention of Name: Method and apparatus for extracting speaker's voice features, and a recording medium for the same, registration date: December 09, 2019) and the like have been disclosed.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 단일 레퍼런스 음성만으로도 유사한 발화 스타일로 다른 음성을 발화할 수 있으므로, 개인화 음성 합성에 유용하게 사용될 수 있는, 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법을 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the previously proposed methods. Since another voice can be uttered in a similar utterance style only with a single reference voice, speech style encoding that can be usefully used for personalized speech synthesis An object of the present invention is to provide an apparatus for synthesizing a style voice using a network and a method for synthesizing a voice.

또한, 본 발명은, 음성에서 화자의 스타일을 추출하는 스타일 추출기를 비지도 학습하기 때문에, 스타일을 정의하거나 학습 데이터를 스타일에 따라 분류하는 과정 없이 음성에서 스타일을 추출하고 학습할 수 있으므로, 음성 데이터의 분류 시간 및 비용을 절약하고, 대량의 음성 데이터를 쉽게 활용할 수 있으며, 적은 비용으로 고품질의 음성 합성 모델을 학습할 수 있는, 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법을 제공하는 것을 다른 목적으로 한다.In addition, since the present invention unsupervised the style extractor for extracting the speaker's style from the voice, it is possible to extract and learn the style from the voice without defining the style or classifying the learning data according to the style. To provide a style speech synthesis apparatus and a speech synthesis method using a speech style encoding network, which can save classification time and cost, can easily utilize a large amount of speech data, and can learn a high-quality speech synthesis model at a low cost for other purposes.

뿐만 아니라, 본 발명은, 단 한 문장의 음성만으로도 유사한 스타일의 합성음을 생성할 수 있어, 최소 수 분에서 수 시간 분량의 음성을 바탕으로 스타일을 반영하던 기존 적응형 기법에 비해 매우 적은 양의 음성만으로도 스타일을 반영할 수 있으므로, 대용량 DB 구축 과정 없이 누구든 한 문장의 녹음만으로 해당 스타일로 된 합성음을 생성할 수 있는, 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법을 제공하는 것을 또 다른 목적으로 한다.In addition, the present invention can generate a synthesized sound of a similar style with only one sentence of voice, so that the amount of voice is very small compared to the existing adaptive technique that reflects the style based on a minimum of several minutes to several hours of voice. Another purpose is to provide a style speech synthesis device and a speech synthesis method using an utterance style encoding network, in which anyone can create a synthesized sound in the style by just recording one sentence without the process of building a large-scale DB because the style can be reflected only by do it with

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치는,A style speech synthesis apparatus using a speech style encoding network according to a feature of the present invention for achieving the above object,

음성 합성 장치로서,A speech synthesizer comprising:

인공신경망 기반으로, 레퍼런스 음성을 입력으로 받아 가변 길이 스타일 벡터 시퀀스를 출력하는 스타일 추출기;a style extractor that receives a reference voice as an input and outputs a variable-length style vector sequence based on an artificial neural network;

상기 스타일 추출기의 출력인 상기 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력하는 종단형 음성 합성기;a longitudinal speech synthesizer that receives the variable-length style vector sequence output from the style extractor as an input and outputs a melspectrogram sequence corresponding to the text input;

상기 종단형 음성 합성기의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력하는 보코더를 포함하며,and a vocoder that converts and outputs a melspectrogram sequence, which is an output of the longitudinal speech synthesizer, into a speech waveform,

상기 스타일 추출기와 종단형 음성 합성기는,The style extractor and the longitudinal speech synthesizer,

합동 훈련(Joint training)을 통해 학습되는 것을 그 구성상의 특징으로 한다.It is characterized by its composition to be learned through joint training.

바람직하게는, 상기 가변 길이 스타일 벡터 시퀀스는,Preferably, the variable length style vector sequence comprises:

입력으로 받은 상기 레퍼런스 음성의 길이에 따라 길이가 변하며, 상기 레퍼런스 음성에 대한 잠재변수로서 상기 레퍼런스 음성의 스타일 정보를 포함할 수 있다.The length may change according to the length of the reference voice received as an input, and style information of the reference voice may be included as a latent variable for the reference voice.

바람직하게는,Preferably,

스타일 요소가 반영된 텍스트-음성 페어를 학습 데이터로 저장하는 데이터베이스를 더 포함할 수 있다.A database for storing the text-to-speech pair in which the style element is reflected as training data may be further included.

더욱 바람직하게는,More preferably,

상기 종단형 음성 합성기는, 상기 텍스트-음성 페어의 학습 데이터에서, 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 학습되고,The longitudinal speech synthesizer is trained from the training data of the text-to-speech pair by taking a text as an input and using a melspectrogram of a voice that is a pair of the input text as a target output,

상기 스타일 추출기는, 상기 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 훈련될 수 있다.The style extractor may be trained through unsupervised learning by receiving a melspectrogram of the target output as an input.

더더욱 바람직하게는,Even more preferably,

상기 합동 훈련을 통해 학습된 상기 스타일 추출기와 종단형 음성 합성기를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성하되,Speech synthesis target text in a synthesis target style using the style extractor and longitudinal speech synthesizer learned through the joint training,

상기 스타일 추출기는, 상기 합성 대상 스타일이 반영되고 상기 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력하며,The style extractor outputs a variable-length style vector sequence by receiving, as a reference voice, a voice to which the synthesis target style is reflected and different from the synthesis target text;

상기 종단형 음성 합성기는, 상기 스타일 추출기의 출력인 상기 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 상기 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력할 수 있다.The longitudinal speech synthesizer may receive the variable-length style vector sequence, which is an output of the style extractor, as an input, and output a meltextogram sequence corresponding to the synthesis target text.

바람직하게는, 상기 스타일 추출기는,Preferably, the style extractor,

1차원 합성곱 신경망(Convolutional Neural Network, CNN) 및 게이트 순환 유닛(Gated Recurrent Unit, GRU)을 포함하는 스타일 인코더일 수 있다.It may be a style encoder including a one-dimensional convolutional neural network (CNN) and a gated recurrent unit (GRU).

바람직하게는, 상기 종단형 음성 합성기는,Preferably, the terminal speech synthesizer comprises:

타코트론2 및 트랜스포머-TTS를 포함하는 자가회귀 모델 군에서 선택된 어느 하나일 수 있다.It may be any one selected from the group of autoregressive models including tacotron 2 and transformer-TTS.

또한, 상기한 목적을 달성하기 위한 본 발명의 특징에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법은,In addition, a style speech synthesis method using a speech style encoding network according to a feature of the present invention for achieving the above object,

컴퓨터에 의해 각 단계가 수행되는 음성 합성 방법으로서,A method for synthesizing speech in which each step is performed by a computer,

(1) 스타일 요소가 반영된 텍스트-음성 페어의 학습 데이터를 이용해, 인공신경망 기반으로 레퍼런스 음성을 입력으로 받아 가변 길이 스타일 벡터 시퀀스를 출력하는 스타일 추출기와, 상기 스타일 추출기의 출력인 상기 가변 길이 스타일 벡터 시퀀스를 입력으로 하여 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력하는 종단형 음성 합성기를 합동 훈련(Joint training)을 통해 학습하는 단계; 및(1) a style extractor for outputting a variable-length style vector sequence by receiving a reference speech as an input based on an artificial neural network using training data of a text-to-speech pair reflecting a style element; and the variable-length style vector as an output of the style extractor Learning a longitudinal speech synthesizer that outputs a melspectrogram sequence corresponding to a text input by receiving a sequence as an input through joint training; and

(2) 상기 합동 훈련을 통해 학습된 상기 스타일 추출기와 종단형 음성 합성기를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.and (2) synthesizing the text to be synthesized in the style to be synthesized by using the style extractor and the longitudinal voice synthesizer learned through the joint training.

더욱 바람직하게는, 상기 단계 (1)은,More preferably, the step (1) is

(1-1) 상기 텍스트-음성 페어의 학습 데이터에서, 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 상기 종단형 음성 합성기를 학습하는 단계; 및(1-1) learning the longitudinal speech synthesizer from the training data of the text-to-speech pair, using a text as an input and a melspectrogram of a voice paired with the input text as a target output; and

(1-2) 상기 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 상기 스타일 추출기를 훈련하는 단계를 포함하여,(1-2) including the step of training the style extractor through unsupervised learning using the melspectrogram of the target output as an input,

상기 스타일 추출기와 종단형 음성 합성기를 합동 훈련을 통해 학습할 수 있다.The style extractor and the longitudinal speech synthesizer may be learned through joint training.

더더욱 바람직하게는, 상기 단계 (2)는,Even more preferably, the step (2) is

(2-1) 상기 스타일 추출기는, 합성 대상 스타일이 반영되며 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력하는 단계;(2-1) outputting, by the style extractor, a variable-length style vector sequence by receiving, as a reference voice, a voice that reflects the synthesis target style and is different from the synthesis target text;

(2-2) 상기 종단형 음성 합성기는, 상기 스타일 추출기의 출력인 상기 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 상기 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력하는 단계; 및(2-2) outputting, by the terminal speech synthesizer, a melttext sequence corresponding to the text to be synthesized by inputting the variable-length style vector sequence, which is an output of the style extractor; and

(2-3) 보코더는, 상기 종단형 음성 합성기의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력하는 단계를 포함할 수 있다.(2-3) The vocoder may include converting a melspectrogram sequence, which is an output of the terminal speech synthesizer, into a speech waveform and outputting the converted melspectrogram sequence.

본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법에 따르면, 단일 레퍼런스 음성만으로도 유사한 발화 스타일로 다른 음성을 발화할 수 있으므로, 개인화 음성 합성에 유용하게 사용될 수 있다.According to the apparatus and method for synthesizing a style speech using a speech style encoding network proposed in the present invention, another speech can be uttered in a similar speech style with only a single reference speech, which can be usefully used for personalized speech synthesis.

또한, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법에 따르면, 음성에서 화자의 스타일을 추출하는 스타일 추출기를 비지도 학습하기 때문에, 스타일을 정의하거나 학습 데이터를 스타일에 따라 분류하는 과정 없이 음성에서 스타일을 추출하고 학습할 수 있으므로, 음성 데이터의 분류 시간 및 비용을 절약하고, 대량의 음성 데이터를 쉽게 활용할 수 있으며, 적은 비용으로 고품질의 음성 합성 모델을 학습할 수 있다.In addition, according to the style speech synthesis apparatus and speech synthesis method using the speech style encoding network proposed in the present invention, the style extractor for extracting the speaker's style from the speech is unsupervised learning, so that the style is defined or the learning data is applied to the style. Because styles can be extracted and learned from speech without the process of classifying them, it is possible to save the classification time and cost of speech data, to easily utilize large amounts of speech data, and to train high-quality speech synthesis models at low cost. .

뿐만 아니라, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치 및 음성 합성 방법에 따르면, 단 한 문장의 음성만으로도 유사한 스타일의 합성음을 생성할 수 있어, 최소 수 분에서 수 시간 분량의 음성을 바탕으로 스타일을 반영하던 기존 적응형 기법에 비해 매우 적은 양의 음성만으로도 스타일을 반영할 수 있으므로, 대용량 DB 구축 과정 없이 누구든 한 문장의 녹음만으로 해당 스타일로 된 합성음을 생성할 수 있다.In addition, according to the apparatus and method for synthesizing a style speech using a speech style encoding network proposed in the present invention, it is possible to generate a synthesized sound of a similar style with only one sentence of speech, so that at least a few minutes to several hours of speech Compared to the existing adaptive technique that reflects styles based on

도 1은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치의 구성을 도시한 도면.
도 2는 스타일 음성 합성을 설명하기 위해 도시한 도면.
도 3은 종단형 음성 합성 장치의 구성을 도시한 도면.
도 4는 스타일 종단형 음성 합성 장치에서, 입력되는 레퍼런스 음성에 따라 출력되는 스타일 벡터를 도시한 도면.
도 5는 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치의 스타일 추출기에서, 입력되는 레퍼런스 음성에 따라 출력되는 스타일 벡터를 도시한 도면.
도 6은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치에서, 스타일 추출기의 세부적인 구성을 도시한 도면.
도 7은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치에서, 종단형 음성 합성기의 세부적인 구성을 도시한 도면.
도 8은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 흐름을 도시한 도면.
도 9는 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법에서, 단계 S100의 세부적인 흐름을 도시한 도면.
도 10은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법에서, 단계 S200의 세부적인 흐름을 도시한 도면.1 is a diagram showing the configuration of a style speech synthesis apparatus using a speech style encoding network according to an embodiment of the present invention.
Fig. 2 is a diagram for explaining style speech synthesis;
Fig. 3 is a diagram showing the configuration of a longitudinal speech synthesizing apparatus;
4 is a diagram illustrating a style vector output according to an input reference voice in a style termination type speech synthesis apparatus;
5 is a diagram illustrating a style vector output according to an input reference voice in a style extractor of a style speech synthesis apparatus using a speech style encoding network according to an embodiment of the present invention.
6 is a diagram illustrating a detailed configuration of a style extractor in a style speech synthesis apparatus using a speech style encoding network according to an embodiment of the present invention.
7 is a diagram illustrating a detailed configuration of a longitudinal speech synthesizer in a style speech synthesis apparatus using a speech style encoding network according to an embodiment of the present invention.
8 is a diagram illustrating a flow of a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention.
9 is a diagram illustrating a detailed flow of step S100 in a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention.
10 is a diagram illustrating a detailed flow of step S200 in a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those of ordinary skill in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing the preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is 'connected' with another part, it is not only 'directly connected' but also 'indirectly connected' with another element interposed therebetween. include In addition, "including" a certain component means that other components may be further included, rather than excluding other components, unless otherwise stated.

도 1은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 구성을 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)는, 인공신경망 기반으로, 레퍼런스 음성을 입력으로 받아 가변 길이 스타일 벡터 시퀀스를 출력하는 스타일 추출기(100); 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력하는 종단형 음성 합성기(200); 종단형 음성 합성기(200)의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력하는 보코더(300)를 포함하여 구성될 수 있으며, 스타일 요소가 반영된 텍스트-음성 페어를 학습 데이터로 저장하는 데이터베이스(400)를 더 포함하여 구성될 수 있다. 여기서, 스타일 추출기(100)와 종단형 음성 합성기(200)는, 합동 훈련(Joint training)을 통해 학습될 수 있다.1 is a diagram illustrating the configuration of a style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention. As shown in FIG. 1 , the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention receives a reference speech as an input and outputs a variable-length style vector sequence based on an artificial neural network. extractor 100; a longitudinal speech synthesizer 200 that receives a variable-length style vector sequence output from the style extractor 100 as an input and outputs a melspectrogram sequence corresponding to a text input; It may be configured to include a vocoder 300 that converts and outputs a melspectrogram sequence, which is an output of the longitudinal speech synthesizer 200, into a speech waveform, and a database ( 400) may be further included. Here, the style extractor 100 and the longitudinal speech synthesizer 200 may be learned through joint training.

본 발명은 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)에 관한 것으로서, 본 발명의 특징에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)는 컴퓨터로 구현될 수 있다. 예를 들어, 본 발명의 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)는, 개인용 컴퓨터, 노트북 컴퓨터, 서버 컴퓨터, PDA, 스마트폰, 태블릿 PC 등에 저장 및 구현될 수 있다.The present invention relates to a style speech synthesis apparatus 10 using a speech style encoding network, and the style speech synthesis apparatus 10 using a speech style encoding network according to a feature of the present invention may be implemented by a computer. For example, the apparatus 10 for synthesizing a speech style using a speech style encoding network of the present invention may be stored and implemented in a personal computer, a notebook computer, a server computer, a PDA, a smart phone, a tablet PC, and the like.

이하에서는, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법에 대해 설명하기 위해, 스타일 음성 합성과 종단형 음성 합성 장치에 대해 먼저 설명하도록 한다.Hereinafter, in order to describe the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network according to an embodiment of the present invention, the style speech synthesis and the terminal speech synthesis apparatus will be described first.

도 2는 스타일 음성 합성을 설명하기 위해 도시한 도면이다. 도 2에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법에 의한 스타일 음성 합성은, 주어진 텍스트와 함께 스타일 정보를 입력으로 사용하여 보다 풍부한 표현력을 가진 합성음을 생성하는 것이다. 여기서, 발화 스타일은 주어진 음성의 화자, 감정, 운율, 채널 등 음소로부터 주어지는 언어적 정보를 제외한 그 외의 음성학적 정보를 포괄적으로 의미한다.2 is a diagram illustrating style speech synthesis. As shown in FIG. 2, the style speech synthesis by the style speech synthesis apparatus 10 and the speech synthesis method using a speech style encoding network according to an embodiment of the present invention is performed by using style information together with a given text as input. It is to create a synthesized sound with richer expressive power. Here, the speech style comprehensively means other phonetic information except for linguistic information given from phonemes, such as a speaker, emotion, prosody, and channel of a given voice.

기존에는 스타일 정보로써 음조 정보를 나타내는 기본 주파수(F0, fundamental frequency), 화자의 음향 특징을 나타내는 화자 코드, 감정의 음향특징을 나타내는 감정 코드 등 스타일의 일부 요소들에 대한 특징들을 텍스트와 함께 시스템의 입력으로 주어 운율, 화자, 감정 등의 정보가 반영된 합성음을 생성해 내는 기술에 집중되어있으며, 지도학습 방식으로 학습할 수 있다.Conventionally, as style information, the characteristics of some elements of the style, such as a fundamental frequency (F0) indicating tonal information, a speaker code indicating the speaker's acoustic characteristics, and an emotional code indicating the acoustic characteristic of emotions, are included in the system along with the text. It is focused on technology that generates synthesized sounds that reflect information such as rhyme, speaker, and emotions given as input, and can be learned in a supervised learning method.

최근에는, 타코트론(Tacotron)을 시작으로 고성능 딥러닝 기반의 종단형 음성 합성 장치에 대한 연구가 집중적으로 진행되고 있다. 딥러닝 기반의 종단형 음성 합성 장치는, 음편을 이어붙여 음성을 생성하는 연결합성 기술과 통계기반 파라미터 합성 기술의 단점을 극복하여, 운율이 매우 자연스럽고 음향 품질이 우수한 특징이 있다. 도 3은 종단형 음성 합성 장치의 구성을 도시한 도면이다.Recently, starting with Tacotron, research on a high-performance deep learning-based longitudinal speech synthesis device has been intensively conducted. The deep learning-based longitudinal speech synthesis device overcomes the shortcomings of the concatenated synthesis technology that generates speech by concatenating sound pieces and the statistical-based parameter synthesis technology, so that the prosody is very natural and the sound quality is excellent. 3 is a diagram showing the configuration of a longitudinal speech synthesis apparatus.

딥러닝 기반의 고성능의 음성 합성 시스템의 등장과 함께, 보다 표현력이 풍부하며 제어 가능한 스타일 종단형 음성 합성 장치에 대한 연구 또한 활발히 이어지고 있는 추세이다.With the advent of deep learning-based high-performance speech synthesis systems, research on a style-terminal speech synthesis device that is more expressive and controllable is also being actively pursued.

종래 스타일 종단형 음성 합성 장치는, 전역 스타일 토큰(Global style token) 기법을 사용한다. 전역 스타일 토큰 기법은 레퍼런스로 주어진 음성을 어텐션 메커니즘을 이용하여 여러 스타일 토큰들의 선형 결합을 통해 하나의 고정된 차원의 스타일 벡터로 추출한다. 추출된 스타일 벡터는 기존 타코트론과 같은 종단형 음성 합성 장치의 인코더 출력에 결합(concatenation)하여 디코더의 입력에 사용되어 레퍼런스의 스타일이 반영된 합성음을 생성하게 된다. 이와 같이, 어텐션 모듈을 활용한 스타일 인코더는, 학습 시 타깃 음성이 레퍼런스 음성으로 주어지며 스타일이 지정되지 않은 비지도 학습 방식으로 학습된다.The conventional style terminated speech synthesis apparatus uses a global style token technique. The global style token technique extracts the voice given as a reference as a single fixed-dimensional style vector through the linear combination of several style tokens using the attention mechanism. The extracted style vector is concatenated with the encoder output of a longitudinal speech synthesis device such as a conventional tacotron and used for the input of the decoder to generate a synthesized sound reflecting the style of the reference. As such, the style encoder using the attention module is taught in an unsupervised learning method in which a target voice is given as a reference voice during learning and a style is not specified.

전술한 바와 같은 전역 스타일 토큰은 비지도 학습으로 스타일의 구성 요소들에 대해 학습할 수 있다는 장점이 있지만, 시간에 따라 변화하는 시퀀스 정보인 레퍼런스 음성을 하나의 고정된 차원의 스타일 벡터로만 추출하기 때문에 해당 레퍼런스 음성의 전역 특징만을 고려한다는 단점을 가지고 있다.The global style token as described above has the advantage of being able to learn about style components through unsupervised learning, but because it extracts the reference voice, which is sequence information that changes over time, only as a style vector of one fixed dimension. It has the disadvantage of considering only the global characteristics of the reference voice.

도 4는 스타일 종단형 음성 합성 장치에서, 입력되는 레퍼런스 음성에 따라 출력되는 스타일 벡터를 도시한 도면이고, 도 5는 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 스타일 추출기(100)에서, 입력되는 레퍼런스 음성에 따라 출력되는 스타일 벡터를 도시한 도면이다.4 is a diagram illustrating a style vector output according to an input reference voice in a style termination type speech synthesis apparatus, and FIG. 5 is a style speech synthesis apparatus 10 using an utterance style encoding network according to an embodiment of the present invention. It is a diagram illustrating a style vector output according to an input reference voice in the style extractor 100 of .

도 4에 도시된 바와 같이, 종래의 전역 스타일 토큰 방식은 다양한 길이의 레퍼런스 음성을 고정된 차원의 스타일 벡터로 추출하므로, 지역 특징을 반영할 수 없는 한계가 있다. 반면에, 도 5에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 스타일 추출기(100)는, 레퍼런스 음성을 가변길이 스타일 벡터 시퀀스로 추출하여 스타일에 대한 지역 특징을 반영할 수 있다.As shown in FIG. 4 , since the conventional global style token method extracts reference voices of various lengths as style vectors of a fixed dimension, there is a limitation in that regional characteristics cannot be reflected. On the other hand, as shown in FIG. 5 , the style extractor 100 of the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention extracts a reference speech into a variable-length style vector sequence, It can reflect regional characteristics for style.

보다 구체적으로, 가변 길이 스타일 벡터 시퀀스는, 입력으로 받은 레퍼런스 음성의 길이에 따라 길이가 변하며, 레퍼런스 음성에 대한 잠재변수로서 레퍼런스 음성의 스타일 정보를 포함할 수 있다. 이를 통해, 레퍼런스 음성의 전역 특징과 지역 특징을 모두 반영하여 효율적으로 레퍼런스 음성의 스타일이 적용된 음성을 합성할 수 있게 된다.More specifically, the length of the variable-length style vector sequence varies according to the length of the reference voice received as an input, and may include style information of the reference voice as a latent variable for the reference voice. Through this, it is possible to efficiently synthesize the reference voice styled voice by reflecting both the global and regional characteristics of the reference voice.

이하에서는, 도 6 및 도 7을 참조하여 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 각 구성요소에 대해 상세히 설명하도록 한다.Hereinafter, each component of the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention will be described in detail with reference to FIGS. 6 and 7 .

도 6은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)에서, 스타일 추출기(100)의 세부적인 구성을 도시한 도면이다. 도 6에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 스타일 추출기(100)는, 인공신경망 기반의 스타일 추출 네트워크로서, 레퍼런스 음성을 입력으로 받아 가변 길이 스타일 벡터 시퀀스를 출력하며, 해당 스타일 정보를 종단형 음성 합성기(200) 디코더의 입력으로 전달할 수 있다.6 is a diagram illustrating a detailed configuration of the style extractor 100 in the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention. As shown in FIG. 6 , the style extractor 100 of the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention is an artificial neural network-based style extraction network, and receives a reference speech as an input. It receives and outputs a variable-length style vector sequence, and transmits the corresponding style information to an input of a decoder of the terminal speech synthesizer 200 .

보다 구체적으로, 스타일 추출기(100)는, 1차원 합성곱 신경망(Convolutional Neural Network, CNN) 및 게이트 순환 유닛(Gated Recurrent Unit, GRU)을 포함하는 스타일 인코더일 수 있다. 즉, 스타일 추출 네트워크는, 도 6에 도시된 바와 같이, 1D Convolutional Network 스택과 GRU 스택으로 이루어진 스타일 인코더 등이 될 수 있으며, 입력으로는 레퍼런스 음성의 멜스펙트로그램 시퀀스가 주어진다. 스타일 추출기(100)는 레퍼런스 길이에 따라 변하는 가변길이 스타일 벡터 시퀀스를 추출하는데, 가변길이 스타일 벡터 시퀀스는 레퍼런스 음성에 대한 잠재변수로써, 레퍼런스 음성의 스타일 정보를 담고 있다.More specifically, the style extractor 100 may be a style encoder including a one-dimensional convolutional neural network (CNN) and a gated recurrent unit (GRU). That is, as shown in FIG. 6 , the style extraction network may be a style encoder including a 1D Convolutional Network stack and a GRU stack, and a melspectrogram sequence of a reference voice is given as an input. The style extractor 100 extracts a variable-length style vector sequence that changes according to the reference length. The variable-length style vector sequence is a latent variable for the reference voice and contains style information of the reference voice.

도 7은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)에서, 종단형 음성 합성기(200)의 세부적인 구성을 도시한 도면이다. 도 7에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)의 종단형 음성 합성기(200)는, 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력할 수 있다.7 is a diagram illustrating a detailed configuration of a longitudinal speech synthesizer 200 in a style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention. As shown in FIG. 7 , the longitudinal speech synthesizer 200 of the style speech synthesis apparatus 10 using a speech style encoding network according to an embodiment of the present invention includes a variable-length style vector output from the style extractor 100 . With the sequence as an input, a melspectrogram sequence corresponding to the text input may be output.

보다 구체적으로, 종단형 음성 합성기(200)는, 타코트론2(Tacotron2) 및 트랜스포머-TTS (Transformer Text-to-speech)를 포함하는 자가회귀 모델 군에서 선택된 어느 하나일 수 있다. 여기서, 타코트론2와 트랜스포머-TTS는 모두 자가회귀 모델로서, 시간 t 출력을 생성하기 위하여 텍스트 정보와 스타일 정보, 그리고 t-1까지 생성된 출력을 디코더의 입력으로 사용한다. 종단형 음성 합성기(200)는 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력할 수 있다.More specifically, the terminal speech synthesizer 200 may be any one selected from the group of autoregressive models including Tacotron2 and Transformer Text-to-speech (TTS). Here, both Tacotron 2 and Transformer-TTS are autoregressive models, and in order to generate a time t output, text information, style information, and an output generated up to t-1 are used as inputs of the decoder. The longitudinal speech synthesizer 200 may output a melspectrogram sequence corresponding to the text input.

도 7에 도시된 바와 같이, 스타일 추출기(100)의 출력인 가변길이 스타일 벡터 시퀀스는 종단형 음성 합성기(200)의 또 다른 디코더 입력으로 사용될 수 있다. 가변길이 스타일 벡터 시퀀스를 종단형 음성 합성기(200)에 컨디셔닝하는 방식의 예로 어텐션 방식을 들 수 있다. 텍스트 정보와 t-1까지 생성된 출력을 사용한 디코더의 중간 출력을 쿼리로, 스타일 정보를 키와 밸류로 사용한 어텐션을 통해 스타일이 반영된 디코더의 출력을 계산할 수 있다.As shown in FIG. 7 , the variable-length style vector sequence that is the output of the style extractor 100 may be used as another decoder input of the longitudinal speech synthesizer 200 . An example of a method of conditioning a variable-length style vector sequence in the longitudinal speech synthesizer 200 may be an attention method. It is possible to calculate the output of the decoder in which the style is reflected through the attention using the text information and the output generated up to t-1 as the query, and the style information as the key and value.

보코더(300)는, 종단형 음성 합성기(200)의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력할 수 있다. 실시예에 따라, 보코더(300)는 그리핀림(Griffin-Lim)과 같은 신호처리 기반의 보코딩 알고리즘 혹은 최근에 제안된 웨이브넷(WaveNet), 웨이브글로우(WaveGlow) 등의 뉴럴 보코더(300)로 구현될 수 있다.The vocoder 300 may convert a melspectrogram sequence output from the terminal speech synthesizer 200 into a speech waveform and output it. According to an embodiment, the vocoder 300 is a signal processing-based vocoding algorithm such as Griffin-Lim or a neural vocoder 300 such as the recently proposed WaveNet or WaveGlow. can be implemented.

한편, 스타일 추출기(100)와 종단형 음성 합성기(200)는, 합동 훈련(Joint training)을 통해 학습되는데, 합동 훈련을 위해 종단형 음성 합성기(200)는, 텍스트-음성 페어의 학습 데이터에서, 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 학습되고, 스타일 추출기(100)는, 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 훈련될 수 있다.On the other hand, the style extractor 100 and the longitudinal speech synthesizer 200 are learned through joint training. It is learned by taking a text as an input and using a melspectrogram of a voice paired with the inputted text as a target output, and the style extractor 100 can be trained through unsupervised learning by taking the melspectrogram of the target output as an input. .

보다 구체적으로, 스타일 음성 합성 장치(10)를 학습하기 위하여, 다화자, 다감정 및 어조의 변화가 두드러지는 다양한 스타일 요소가 반영되어있는 대용량 텍스트-음성 페어로 이루어져 있는 데이터베이스(400)를 준비한다. 종단형 음성 합성기(200)를 학습하기 위해서는 텍스트를 입력으로 사용하며, 해당 텍스트의 페어로 이루어진 음성의 멜스펙트로그램이 타깃이 된다. 스타일 추출기(100)를 종단형 음성 합성기(200)와 합동 훈련하기 위해 타깃 음성의 멜스펙트로그램이 스타일 추출 네트워크 입력으로 사용된다. 이때, 스타일 추출 네트워크는 별다른 정답 라벨이 주어지지 않기 때문에 비지도 학습을 통해 훈련이 된다.More specifically, in order to learn the style speech synthesizing apparatus 10, a database 400 consisting of large-capacity text-speech pairs in which various style elements in which multiple speakers, multiple emotions, and changes in tone are prominent are reflected is prepared. . In order to learn the longitudinal speech synthesizer 200, a text is used as an input, and a melspectrogram of a speech composed of a pair of the corresponding text is a target. In order to jointly train the style extractor 100 with the longitudinal speech synthesizer 200, a melspectrogram of the target speech is used as a style extraction network input. At this time, the style extraction network is trained through unsupervised learning because no specific correct answer label is given.

합동 훈련을 통해 학습된 스타일 추출기(100)와 종단형 음성 합성기(200)를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성할 수 있다. 보다 구체적으로, 스타일 추출기(100)는, 합성 대상 스타일이 반영되고 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력하며, 종단형 음성 합성기(200)는, 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력할 수 있다.By using the style extractor 100 and the longitudinal speech synthesizer 200 learned through joint training, the text to be synthesized may be synthesized in the style to be synthesized. More specifically, the style extractor 100 outputs a variable-length style vector sequence by receiving as a reference voice a speech in which the synthesis target style is reflected and different from the synthesis target text, and the longitudinal speech synthesizer 200 includes a style extractor ( 100) may be input as a variable-length style vector sequence, and a Melstektrogram sequence corresponding to the text to be synthesized may be output.

즉, 스타일 음성 합성 장치(10)의 합성음 생성 시에는 표현하고자 하는 합성 대상 스타일을 반영한 음성을 레퍼런스로 사용하여 스타일 추출기(100)의 입력으로 하며, 해당 레퍼런스 음성의 텍스트는 종단형 음성 합성기(200)의 입력으로 사용될 텍스트와 같지 않다. 스타일 추출기(100)는 레퍼런스 음성로부터 텍스트 정보를 제외한 스타일 정보를 추출하여 종단형 음성 합성기(200)의 디코더에 반영한다. 종단형 음성 합성기(200)에서 생성된 멜스펙트로그램 시퀀스는 보코더(300)를 통과하여 음성 파형으로 생성될 수 있다.That is, when the style speech synthesizer 10 generates a synthesized sound, the voice reflecting the synthesized target style to be expressed is used as a reference as an input to the style extractor 100 , and the text of the reference voice is converted into a longitudinal voice synthesizer 200 . ) is not the same as the text to be used as input for The style extractor 100 extracts style information excluding text information from the reference voice and reflects it in the decoder of the terminal voice synthesizer 200 . The melspectrogram sequence generated by the longitudinal speech synthesizer 200 may pass through the vocoder 300 and be generated as a speech waveform.

한편, 본 발명은 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법에 관한 것으로서, 본 발명의 특징에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법은, 메모리 및 프로세서를 포함한 하드웨어에서 기록되는 소프트웨어로 구성될 수 있다. 즉, 본 발명의 특징에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법은, 컴퓨터로 구현되는 스타일 음성 합성 장치(10)에 의해 각 단계가 수행될 수 있다. 예를 들어, 본 발명의 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법은, 개인용 컴퓨터, 노트북 컴퓨터, 서버 컴퓨터, PDA, 스마트폰, 태블릿 PC 등에 저장 및 구현될 수 있다. 이하에서는 설명의 편의를 위해, 각 단계를 수행하는 주체는 생략될 수 있다.On the other hand, the present invention relates to a style speech synthesis method using a speech style encoding network, and the style speech synthesis method using a speech style encoding network according to a feature of the present invention may be composed of software recorded in hardware including a memory and a processor. . That is, in the method for synthesizing a style speech using a speech style encoding network according to an aspect of the present invention, each step may be performed by the style speech synthesis apparatus 10 implemented by a computer. For example, the style speech synthesis method using the speech style encoding network of the present invention may be stored and implemented in a personal computer, a notebook computer, a server computer, a PDA, a smart phone, a tablet PC, and the like. Hereinafter, for convenience of description, a subject performing each step may be omitted.

도 8은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 흐름을 도시한 도면이다. 도 8에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법은, 컴퓨터에 의해 각 단계가 수행되는 음성 합성 방법으로서, 스타일 추출기(100)와 종단형 음성 합성기(200)를 합동 훈련을 통해 학습하는 단계(S100) 및 합동 훈련을 통해 학습된 스타일 추출기(100)와 종단형 음성 합성기(200)를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성하는 단계(S200)를 포함하여 구현될 수 있다.8 is a diagram illustrating a flow of a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention. As shown in FIG. 8 , the style speech synthesis method using a speech style encoding network according to an embodiment of the present invention is a speech synthesis method in which each step is performed by a computer, and includes a style extractor 100 and a longitudinal speech synthesizer. Learning 200 through joint training (S100) and synthesizing the text to be synthesized into speech using the style extractor 100 and the longitudinal speech synthesizer 200 learned through joint training (S200) ) can be implemented.

이하에서는, 각각의 단계들과 관련된 내용을 설명할 것이나, 앞서 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10)와 관련하여 구체적인 내용이 충분히 설명되었으므로, 상세한 설명은 일부 생략될 수 있다.Hereinafter, content related to each step will be described. However, since the detailed description of the style speech synthesizing apparatus 10 using the speech style encoding network according to an embodiment of the present invention has been sufficiently described above, the detailed description will be partially may be omitted.

단계 S100에서는, 스타일 요소가 반영된 텍스트-음성 페어의 학습 데이터를 이용해, 인공신경망 기반으로 레퍼런스 음성을 입력으로 받아 가변 길이 스타일 벡터 시퀀스를 출력하는 스타일 추출기(100)와, 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여 텍스트 입력에 상응하는 멜스펙트로그램 시퀀스를 출력하는 종단형 음성 합성기(200)를 합동 훈련(Joint training)을 통해 학습할 수 있다. 즉, 단계 S100은 스타일 추출기(100)와 종단형 음성 합성기(200)를 학습하는 학습 단계이다.In step S100, the style extractor 100 for outputting a variable-length style vector sequence by receiving a reference voice as an input based on an artificial neural network using the training data of the text-to-speech pair in which the style element is reflected, and the output of the style extractor 100 The longitudinal speech synthesizer 200 that outputs a melspectrogram sequence corresponding to a text input by taking a variable-length style vector sequence as an input can be learned through joint training. That is, step S100 is a learning step for learning the style extractor 100 and the longitudinal speech synthesizer 200 .

여기서, 가변 길이 스타일 벡터 시퀀스는, 입력으로 받은 레퍼런스 음성의 길이에 따라 길이가 변하며, 레퍼런스 음성에 대한 잠재변수로서 레퍼런스 음성의 스타일 정보를 포함할 수 있다.Here, the length of the variable-length style vector sequence varies according to the length of the reference voice received as an input, and may include style information of the reference voice as a latent variable for the reference voice.

이하에서는, 도 9를 참조하여 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 단계 S100에 대해 상세히 설명하도록 한다.Hereinafter, step S100 of the method for synthesizing a style speech using a speech style encoding network according to an embodiment of the present invention will be described in detail with reference to FIG. 9 .

도 9는 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법에서, 단계 S100의 세부적인 흐름을 도시한 도면이다. 도 9에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 단계 S100은, 텍스트-음성 페어의 학습 데이터에서 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 종단형 음성 합성기(200)를 학습하는 단계(S110) 및 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 스타일 추출기(100)를 훈련하는 단계(S120)를 포함하여 구현될 수 있다.9 is a diagram illustrating a detailed flow of step S100 in a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention. As shown in FIG. 9 , in step S100 of the method for synthesizing a style speech using a speech style encoding network according to an embodiment of the present invention, a text is inputted from training data of a text-speech pair as an input and a voice paired with the input text The step of learning the longitudinal speech synthesizer 200 using the melspectrogram of the target output (S110) and the step of training the style extractor 100 through unsupervised learning using the melspectrogram of the target output as an input (S120) ) can be implemented.

단계 S110에서는, 텍스트-음성 페어의 학습 데이터에서, 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 종단형 음성 합성기(200)를 학습할 수 있다.In step S110 , the longitudinal speech synthesizer 200 may be trained by using a text as an input and a melspectrogram of a voice that is a pair of the inputted text as an input from the training data of the text-to-speech pair as a target output.

단계 S120에서는, 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 스타일 추출기(100)를 훈련할 수 있다.In step S120, the style extractor 100 may be trained through unsupervised learning by receiving the melspectrogram of the target output as an input.

이와 같이, 단계 S110 및 단계 S120을 통해, 스타일 추출기(100)와 종단형 음성 합성기(200)를 합동 훈련을 통해 학습할 수 있다.As such, through steps S110 and S120, the style extractor 100 and the longitudinal speech synthesizer 200 may be learned through joint training.

단계 S200에서는, 합동 훈련을 통해 학습된 스타일 추출기(100)와 종단형 음성 합성기(200)를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성할 수 있다. 즉, 단계 S200은 스타일을 반영해 음성을 합성하는 단계이다.In step S200 , the text to be synthesized may be synthesized using the style extractor 100 and the longitudinal voice synthesizer 200 learned through joint training in the style to be synthesized. That is, step S200 is a step of synthesizing a voice by reflecting the style.

이하에서는, 도 10을 참조하여 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 단계 S200에 대해 상세히 설명하도록 한다.Hereinafter, step S200 of the method for synthesizing a style voice using a speech style encoding network according to an embodiment of the present invention will be described in detail with reference to FIG. 10 .

도 10은 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법에서, 단계 S200의 세부적인 흐름을 도시한 도면이다. 도 10에 도시된 바와 같이, 본 발명의 일실시예에 따른 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 방법의 단계 S200은, 스타일 추출기(100)는 합성 대상 스타일이 반영되며 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력하는 단계(S210), 종단형 음성 합성기(200)는 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력하는 단계(S220) 및 보코더(300)는 종단형 음성 합성기(200)의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력하는 단계(S230)를 포함하여 구현될 수 있다.FIG. 10 is a diagram illustrating a detailed flow of step S200 in a style speech synthesis method using a speech style encoding network according to an embodiment of the present invention. 10 , in step S200 of the method for synthesizing a style speech using a speech style encoding network according to an embodiment of the present invention, the style extractor 100 reflects the synthesis target style and refers to a voice different from the synthesis target text. A step of receiving a voice input and outputting a variable-length style vector sequence (S210), the longitudinal speech synthesizer 200 receives the variable-length style vector sequence output from the style extractor 100 as an input, The step of outputting the tektrogram sequence (S220) and the vocoder 300 may be implemented including the step (S230) of converting the melspectrogram sequence output from the terminal speech synthesizer 200 into a speech waveform and outputting it (S230). .

단계 S210에서는, 스타일 추출기(100)는, 합성 대상 스타일이 반영되며 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력할 수 있다.In step S210 , the style extractor 100 may output a variable-length style vector sequence by receiving, as a reference voice, a voice that reflects the synthesis target style and is different from the synthesis target text.

단계 S220에서는, 종단형 음성 합성기(200)는, 스타일 추출기(100)의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력할 수 있다.In step S220 , the longitudinal speech synthesizer 200 may receive a variable-length style vector sequence output from the style extractor 100 as an input, and output a meltextogram sequence corresponding to the text to be synthesized.

단계 S230에서는, 보코더(300)는, 종단형 음성 합성기(200)의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력할 수 있다.In step S230 , the vocoder 300 may convert the melspectrogram sequence output from the terminal speech synthesizer 200 into a speech waveform and output it.

전술한 바와 같이, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법에 따르면, 단일 레퍼런스 음성만으로도 유사한 발화 스타일로 다른 음성을 발화할 수 있으므로, 개인화 음성 합성에 유용하게 사용될 수 있다.As described above, according to the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention, another speech can be uttered with a similar speech style only with a single reference speech, so it is suitable for personalized speech synthesis. It can be useful.

또한, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법에 따르면, 음성에서 화자의 스타일을 추출하는 스타일 추출기(100)를 비지도 학습하기 때문에, 스타일을 정의하거나 학습 데이터를 스타일에 따라 분류하는 과정 없이 음성에서 스타일을 추출하고 학습할 수 있으므로, 음성 데이터의 분류 시간 및 비용을 절약하고, 대량의 음성 데이터를 쉽게 활용할 수 있으며, 적은 비용으로 고품질의 음성 합성 모델을 학습할 수 있다.In addition, according to the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention, the style extractor 100 for extracting the speaker's style from the speech is unsupervised learning, so the style is defined As it can extract and learn a style from speech without classifying the training data according to style or classifying the training data according to style, it can save the classification time and cost of speech data, can easily utilize a large amount of speech data, and can synthesize high-quality speech at low cost. model can be trained.

뿐만 아니라, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법에 따르면, 단 한 문장의 음성만으로도 유사한 스타일의 합성음을 생성할 수 있어, 최소 수 분에서 수 시간 분량의 음성을 바탕으로 스타일을 반영하던 기존 적응형 기법에 비해 매우 적은 양의 음성만으로도 스타일을 반영할 수 있으므로, 대용량 DB 구축 과정 없이 누구든 한 문장의 녹음만으로 해당 스타일로 된 합성음을 생성할 수 있다.In addition, according to the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention, it is possible to generate a synthesized sound of a similar style with only one sentence of speech, so that at least several minutes to several hours. Compared to the existing adaptive technique that reflects styles based on a large amount of voice, a style can be reflected with only a very small amount of voice, so anyone can create a synthesized sound with the style by just recording one sentence without building a large database. .

그밖에, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법은, 기존의 음성 합성 서비스를 대체할 수 있을 뿐만 아니라 사용자 맞춤형 음성 합성을 가능하게 하여 기존의 시장을 확장할 가능성이 있다. 대표적으로 여러 화자의 다양한 스타일 음성이 필요한 미디어 제작, 감정 표현이 가능한 AI 비서, 책 내용에 따라 다른 톤으로 읽어주는 오디오북 등 광범위한 응용이 가능하다.In addition, the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention can replace the existing speech synthesis service as well as enable user-customized speech synthesis to expand the existing market. potential to expand. Representatively, it can be widely applied to media production that requires various styles of voice from multiple speakers, an AI assistant capable of expressing emotions, and an audio book that reads in different tones depending on the content of the book.

한편, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법은, 일반적인 음성 합성 기술을 사용하고 있는 기존의 상품 및 서비스에 적용될 수 있는 일반적인 기술이다. 음성 합성 시 원하는 발화 스타일(화자 특성, 감정 등)을 선택 및 반영할 수 있다는 것이 특징이며, 이는 기존의 상품 및 서비스에 적용되어 훨씬 자연스럽고 다양한 스타일의 합성된 음성을 제공할 수 있다.Meanwhile, the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention are general techniques that can be applied to existing products and services using general speech synthesis technology. The feature is that a desired speech style (speaker characteristics, emotions, etc.) can be selected and reflected during speech synthesis, which can be applied to existing products and services to provide a more natural and diverse style of synthesized speech.

또한, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법은, 레퍼런스로 주어진 음성의 화자 특성, 감정, 발화 스타일을 추출하여 이를 반영한 음성을 새로이 합성하는 것을 특징으로 하므로, 하나의 음성 합성 모델이 임의의 화자가 지닌 발화 스타일을 모사한 음성을 합성할 수 있다는 점에서 기술적 의의 및 사업적 가능성이 높다.In addition, the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention extract the speaker characteristics, emotions, and speech style of the speech given as a reference, and newly synthesize the reflected speech. Therefore, it has high technical significance and business potential in that one voice synthesis model can synthesize a voice that mimics the utterance style of an arbitrary speaker.

그밖에, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법은, 음성 합성뿐만 아니라, 시퀀스로 표현되는 데이터를 학습하는 기계언어 학습 분야 및 영상 합성 분야 등 타 도메인에서도 사용될 수 있다. 상업적으로는 개인화 음성 합성, 스마트 에이전트, 엔터테인먼트 등에 직접적으로 활용될 수도 있다.In addition, the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention are not only for speech synthesis, but also from other domains such as machine language learning fields and image synthesis fields for learning data expressed in sequences. can also be used in Commercially, it can also be directly used for personalized speech synthesis, smart agents, and entertainment.

한편, 본 발명은 다양한 통신 단말기로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터에서 판독 가능한 매체를 포함할 수 있다. 예를 들어, 컴퓨터에서 판독 가능한 매체는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD_ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.Meanwhile, the present invention may include a computer-readable medium including program instructions for performing operations implemented in various communication terminals. For example, the computer-readable medium includes magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD_ROM and DVD, and floppy disks. It may include magneto-optical media and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

이와 같은 컴퓨터에서 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 이때, 컴퓨터에서 판독 가능한 매체에 기록되는 프로그램 명령은 본 발명을 구현하기 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예를 들어, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다.Such a computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. In this case, the program instructions recorded on the computer-readable medium may be specially designed and configured to implement the present invention, or may be known and available to those skilled in the art of computer software. For example, it may include not only machine language code such as generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이와 같이, 본 발명에서 제안하고 있는 발화 스타일 인코딩 네트워크 이용한 스타일 음성 합성 장치(10) 및 음성 합성 방법은 컴퓨터에 의해 구현되는 것으로서, 기계 장치에 적용되는 프로그램으로 제작하고, 프로그램 배포를 통해 대량 생산이 가능하며, 기존의 상품 및 서비스에 적용되고 있는 음성 합성 기술을 대체할 수 있어 산업 적용에도 용이할 것이다.As described above, the style speech synthesis apparatus 10 and the speech synthesis method using the speech style encoding network proposed in the present invention are implemented by a computer, produced as a program applied to a mechanical device, and mass-produced through program distribution. It is possible, and it will be easy for industrial application as it can replace speech synthesis technology applied to existing products and services.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.Various modifications and applications of the present invention described above are possible by those skilled in the art to which the present invention pertains, and the scope of the technical idea according to the present invention should be defined by the following claims.

10: 본 발명에 따른 스타일 음성 합성 장치
100: 스타일 추출기
200: 종단형 음성 합성기
300: 보코더
400: 데이터베이스
S100: 스타일 추출기와 종단형 음성 합성기를 합동 훈련을 통해 학습하는 단계
S110: 텍스트-음성 페어의 학습 데이터에서 텍스트를 입력으로 하고 입력된 텍스트와 페어인 음성의 멜스펙트로그램을 타깃 출력으로 하여 종단형 음성 합성기를 학습하는 단계
S120: 타깃 출력의 멜스펙트로그램을 입력으로 하여 비지도 학습을 통해 스타일 추출기를 훈련하는 단계
S200: 합동 훈련을 통해 학습된 스타일 추출기와 종단형 음성 합성기를 이용해 합성 대상 스타일로 합성 대상 텍스트를 음성 합성하는 단계
S210: 스타일 추출기는 합성 대상 스타일이 반영되며 합성 대상 텍스트와 상이한 음성을 레퍼런스 음성으로 입력받아 가변 길이 스타일 벡터 시퀀스를 출력하는 단계
S220: 종단형 음성 합성기는 스타일 추출기의 출력인 가변 길이 스타일 벡터 시퀀스를 입력으로 하여, 합성 대상 텍스트에 상응하는 멜스텍트로그램 시퀀스를 출력하는 단계
S230: 보코더는 종단형 음성 합성기의 출력인 멜스펙트로그램 시퀀스를 음성 파형으로 변환해 출력하는 단계10: Style speech synthesis apparatus according to the present invention
100: Style Extractor
200: terminal speech synthesizer
300: Vocoder
400: database
S100: Learning a style extractor and a longitudinal speech synthesizer through joint training
S110: Learning a longitudinal speech synthesizer by inputting text from training data of a text-to-speech pair as an input and using a melspectrogram of a voice that is a pair of the input text as a target output
S120: Step of training the style extractor through unsupervised learning with the melspectrogram of the target output as input
S200: Speech synthesis of the target text in the synthesis target style using the style extractor and the longitudinal speech synthesizer learned through joint training
S210: the style extractor outputs a variable-length style vector sequence by receiving a speech that is different from the text to be synthesized as a reference voice in which the style to be synthesized is reflected
S220: the longitudinal speech synthesizer outputs a melsectogram sequence corresponding to the text to be synthesized by inputting a variable-length style vector sequence, which is an output of the style extractor, as an input
S230: The vocoder converts the melspectrogram sequence, which is the output of the terminal voice synthesizer, into a voice waveform and outputs it

Claims

A speech synthesizer comprising:
a style extractor 100 that receives a reference voice as an input and outputs a variable-length style vector sequence based on an artificial neural network;
a longitudinal speech synthesizer 200 that receives the variable-length style vector sequence output from the style extractor 100 and outputs a melspectrogram sequence corresponding to the text input;
and a vocoder 300 that converts a melspectrogram sequence, which is an output of the longitudinal speech synthesizer 200, into a speech waveform and outputs it,
The style extractor 100 and the longitudinal speech synthesizer 200 are,
A style speech synthesis apparatus (10) using a speech style encoding network, characterized in that it is learned through joint training.

The method of claim 1, wherein the variable length style vector sequence comprises:
A style speech synthesis apparatus (10) using a speech style encoding network, characterized in that the length varies according to the length of the reference speech received as an input, and style information of the reference speech is included as a latent variable for the reference speech.

According to claim 1,
A style speech synthesis apparatus (10) using an utterance style encoding network, characterized in that it further comprises a database (400) for storing the text-to-speech pair in which the style element is reflected as training data.

4. The method of claim 3,
The longitudinal speech synthesizer 200 is trained from the training data of the text-to-speech pair by using a text as an input and a melspectrogram of a voice that is a pair of the input text as a target output,
The style extractor (100) is a style speech synthesis apparatus (10) using an utterance style encoding network, characterized in that it is trained through unsupervised learning by receiving a melspectrogram of the target output as an input.

5. The method of claim 4,
Speech synthesis target text in a synthesis target style using the style extractor 100 and the longitudinal speech synthesizer 200 learned through the joint training,
The style extractor 100 outputs a variable-length style vector sequence by receiving, as a reference voice, a voice to which the synthesis target style is reflected and different from the synthesis target text,
The longitudinal speech synthesizer 200 receives the variable-length style vector sequence that is the output of the style extractor 100 as an input, and outputs a Melstektrogram sequence corresponding to the text to be synthesized, A style speech synthesis apparatus (10) using a speech style encoding network.

According to claim 1, wherein the style extractor (100),
A style speech synthesis apparatus (10) using a speech style encoding network, characterized in that it is a style encoder including a one-dimensional convolutional neural network (CNN) and a gated recurrent unit (GRU).

According to claim 1, wherein the terminal speech synthesizer (200),
A style speech synthesis apparatus (10) using a speech style encoding network, characterized in that it is any one selected from the group of autoregressive models including tacotron 2 and transformer-TTS.

A method of synthesizing speech in which each step is performed by a computer,
(1) a style extractor 100 for outputting a variable-length style vector sequence by receiving a reference voice as an input based on an artificial neural network using training data of a text-to-speech pair in which the style element is reflected; and the output of the style extractor 100 learning a longitudinal speech synthesizer 200 that outputs a melspectrogram sequence corresponding to a text input by receiving the variable-length style vector sequence as an input through joint training; and
(2) speech synthesis of a synthesis target text in a synthesis target style using the style extractor 100 and the longitudinal speech synthesizer 200 learned through the joint training; style speech synthesis method using

The method of claim 8, wherein the variable length style vector sequence comprises:
A style speech synthesis method using a speech style encoding network, characterized in that the length varies according to the length of the reference speech received as an input, and style information of the reference speech is included as a latent variable for the reference speech.

According to claim 8, wherein the step (1),
(1-1) learning the longitudinal speech synthesizer 200 from the training data of the text-to-speech pair, using a text as an input and a melspectrogram of a voice paired with the inputted text as a target output; and
(1-2) including the step of training the style extractor 100 through unsupervised learning by using the melspectrogram of the target output as an input,
A style speech synthesis method using a speech style encoding network, characterized in that the style extractor (100) and the longitudinal speech synthesizer (200) are learned through joint training.

11. The method of claim 10, wherein the step (2),
(2-1) outputting, by the style extractor 100, a variable-length style vector sequence by receiving, as a reference voice, a voice that reflects the synthesis target style and is different from the synthesis target text;
(2-2) The longitudinal speech synthesizer 200 receives the variable-length style vector sequence that is the output of the style extractor 100 as an input, and outputs a Melstektrogram sequence corresponding to the text to be synthesized. step; and
(2-3) The vocoder 300 converts, by the vocoder 300, the melspectrogram sequence output from the terminal speech synthesizer 200 into a speech waveform and outputs the converted speech style speech using a speech style encoding network. synthesis method.