KR20220023381A

KR20220023381A - Method and apparatus for Speech Synthesis Containing Emotional Rhymes with Scarce Speech Data of a Single Speaker

Info

Publication number: KR20220023381A
Application number: KR1020200105049A
Authority: KR
Inventors: 이수영; 조성재; 김태호; 박세직
Original assignee: 한국과학기술원
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2022-03-02
Also published as: KR102426020B1

Abstract

Disclosed are a method and device for synthesizing a voice containing an emotional rhythm with small voice data of one speaker. The method for synthesizing the voice using a voice synthesis device implemented as an electronic device according to one embodiment comprises: a step of training a speaker encoder that extracts a features of a speaker's voice through a large amount of multi-speaker data; a step of training at least any one or more among a text encoder, a prosody encoder, and a residual encoder through at least one or more among the large amount of multi-speaker data, a large amount of expressive voice data, and the emotional voice data; a step of finding an expression of an emotional rhythm of the emotional voice data through at least one or more among the trained text encoder, the rhythm encoder, and the residual encoder; and a step of outputting the emotional voice of the speaker of the neutral speaker data by allowing the voice expression of the speaker of the neutral speaker data to be an output value of the speaker encoder, and selecting and synthesizing the expression of the emotional rhythm desired to be synthesized. Therefore, the present invention is capable of synthesizing the voice of the desired person in a short time.

Description

Method and apparatus for Speech Synthesis Containing Emotional Rhymes with Scarce Speech Data of a Single Speaker

아래의 실시예들은 음성 합성 방법에 관한 것으로, 더욱 상세하게는 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법 및 장치에 관한 것이다. The following embodiments relate to a voice synthesis method, and more particularly, to a voice synthesis method and apparatus containing emotional prosody with a small amount of voice data of one speaker.

음성 합성 기술은 인공지능을 통한 대화형 사용자 인터페이스 구현을 위한 핵심 기술로서, 인간이 발화하는 것과 같은 소리를 컴퓨터나 기계를 통하여 만들어내는 것이다. 기존의 음성 합성은 고정 합성 단위(Fixed Length Unit)인 단어, 음절, 음소를 조합하여 파형을 만들어내는 방식(1세대), 말뭉치를 이용한 가변 합성 단위 연결 방식(2세대)에서, 3세대 모델로 발전하였다. 3세대 모델은 음성인식을 위한 음향모델링에 주로 사용하는 HMM(Hidden Markov Model)방식을 음성 합성에 적용하여, 적절한 크기의 데이터베이스를 이용한 고품질 음성 합성을 구현하였다.Speech synthesis technology is a key technology for realizing an interactive user interface through artificial intelligence, and it is to create a sound similar to a human utterance through a computer or a machine. Existing speech synthesis is a method of creating waveforms by combining words, syllables, and phonemes, which are fixed length units (1st generation), and a method of connecting variable synthesis units using a corpus (2nd generation) to a 3rd generation model. developed. The 3rd generation model applied the HMM (Hidden Markov Model) method, which is mainly used for acoustic modeling for speech recognition, to speech synthesis, and realized high-quality speech synthesis using a database of an appropriate size.

텍스트 음성 합성(Text-to-Speech; TTS)은 문자열을 음성으로 변환하는 작업을 의미한다. 현재 다양한 사람의 목소리, 감정을 포함시킨 감정 음성 합성 등 다양한 음성 변화를 반영한 음성 합성이 가능하다. 하지만, 음성 합성을 가능하게 하기 위해서 각 감정 별로 사람의 음성을 녹음하여 이용해야 하는 어려움이 있다. 또한, 중립 음성만 녹음된 화자의 음성과, 감정 음성이 녹음된 화자의 음성을 활용해 중립 감정의 화자의 목소리를 감정적으로 변화시키는 연구가 진행되고 있다. 하지만, 현재 감정이 매우 약하게 표현이 된다.Text-to-Speech (TTS) refers to an operation of converting a character string into speech. Currently, it is possible to synthesize voices that reflect various voice changes, such as voice synthesis of various people's voices and emotions including emotions. However, in order to enable speech synthesis, there is a difficulty in recording and using a human voice for each emotion. In addition, research is being conducted to change the voice of a speaker with neutral emotions emotionally by using the speaker's voice with only neutral voice recorded and the speaker's voice with emotional voice recorded. However, the present emotion is expressed very weakly.

특히, 개인 사용자가 자신의 음성으로 음성 합성이 되도록 하는 음성 합성기를 구축하고자 할 때, 음성 합성기의 성능을 높이기 위해서는 많은 수의 문장에 대한 발화 데이터가 확보되어야 하는데, 개인 사용자는 그 특성상 많은 수의 문장에 대한 발화 데이터를 확보하는 데에 한계가 존재할 수밖에 없다.In particular, when an individual user wants to build a voice synthesizer that can synthesize his or her own voice, in order to improve the performance of the voice synthesizer, utterance data for a large number of sentences must be secured. There is bound to be a limit to securing the utterance data for a sentence.

Shen, Jonathan, et al. "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018. Shen, Jonathan, et al. "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. Skerry-Ryan, R. J., et al. "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron." International Conference on Machine Learning . 2018. Skerry-Ryan, R. J., et al. "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron." International Conference on Machine Learning. 2018. Wang, Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." International Conference on Machine Learning . 2018. Wang, Yuxuan, et al. "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis." International Conference on Machine Learning. 2018. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems . 2014. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). Arik, Sercan, et al. "Neural voice cloning with a few samples." Advances in Neural Information Processing Systems . 2018. Arik, Sercan, et al. "Neural voice cloning with a few samples." Advances in Neural Information Processing Systems. 2018. Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems . 2018. Jia, Ye, et al. "Transfer learning from speaker verification to multispeaker text-to-speech synthesis." Advances in neural information processing systems. 2018.

실시예들은 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법 및 장치에 관하여 기술하며, 보다 구체적으로 한 화자의 적은 음성 데이터를 이용하여 감정 운율 표현을 합성함에 따라 감정 음성을 제공하는 기술을 제공한다. The embodiments describe a method and apparatus for synthesizing a voice containing emotional prosody with a small amount of voice data from one speaker, and more specifically, a technology for providing an emotional voice by synthesizing an emotional prosody expression using a small amount of voice data from one speaker. to provide.

실시예들은 소량의 중립 화자 데이터로부터 제공되는 한 화자의 음성에 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력하는 음성 합성 방법 및 장치를 제공하는데 있다. Embodiments are a voice synthesis method and apparatus for synthesizing and outputting an emotional voice containing emotional prosody obtained from a large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data with one speaker's voice provided from a small amount of neutral speaker data is to provide

일 실시예에 따른 전자 장치로 구현되는 음성 합성 장치를 이용한 음성 합성 방법은, 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder)를 다량의 다화자 데이터를 통해 학습시키는 단계; 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder) 중 적어도 어느 하나 이상을 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시키는 단계; 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더 중 적어도 어느 하나 이상을 통해 상기 감정 음성 데이터의 감정 운율의 표현을 찾는 단계; 및 중립 화자 데이터의 화자의 목소리 표현을 상기 화자 인코더의 출력값으로 하고, 합성을 원하는 상기 감정 운율의 표현을 선별하여 합성시켜, 상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계를 포함하여 이루어질 수 있다. A speech synthesis method using a speech synthesis apparatus implemented as an electronic device according to an embodiment includes: learning a speaker encoder for extracting features of a speaker's voice through a large amount of multi-speaker data; Learning at least any one or more of a text encoder, a prosody encoder, and a residual encoder through at least any one or more of the large amount of multispeaker data, a large amount of expressive speech data, and emotional speech data making; finding an expression of an emotional prosody of the emotional speech data through at least one of the learned text encoder, the prosody encoder, and the residual encoder; and outputting the emotional voice of the speaker of the neutral speaker data by using the speaker's voice expression of the neutral speaker data as the output value of the speaker encoder, selecting and synthesizing the expression of the emotional prosody desired to be synthesized. there is.

상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시키는 단계는, 상기 텍스트 인코더(text encoder), 상기 운율 인코더(prosody encoder) 및 상기 잔차 인코더(residual encoder)를 각각 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 활용하여 학습시킬 수 있다. The step of learning through at least one of the large amount of multispeaker data, the large amount of expressive voice data, and the emotional voice data may include the text encoder, the prosody encoder, and the residual encoder. ) can be learned by using the large amount of multi-speaker data, a large amount of expression voice data, and emotional voice data, respectively.

상기 감정 음성 데이터의 감정 운율의 표현을 찾는 단계는, 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더를 통해 감정 운율의 표현이 상기 운율 인코더(prosody encoder)로 어떻게 표현되는지 감정 음성 데이터의 샘플을 입력으로 넣어 통계적 표현을 찾을 수 있다. The step of finding the expression of the emotional prosody of the emotional speech data includes how the expression of the emotional prosody is expressed by the prosody encoder through the learned text encoder, the prosody encoder and the residual encoder. Samples of emotional speech data can be entered as input to find a statistical expression.

상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계는, 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더의 일부를 중립 화자 데이터를 통해 적응시켜, 상기 중립 화자 데이터의 화자의 목소리 표현을 찾아내는 단계; 및 상기 중립 화자 데이터의 화자의 목소리 표현을 상기 화자 인코더의 출력값에 넣고, 합성을 원하는 감정 운율의 표현을 상기 감정 음성 데이터의 감정 운율의 표현을 찾는 단계에서 선별하여 넣어, 상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계를 포함하여 이루어질 수 있다. The step of outputting the emotional voice of the speaker of the neutral speaker data includes adapting a part of the learned text encoder, the prosody encoder, and the residual encoder through the neutral speaker data to find the speaker's voice expression of the neutral speaker data step; and inputting the voice expression of the speaker of the neutral speaker data into the output value of the speaker encoder, and selecting and putting the expression of the emotional prosody desired to be synthesized in the step of finding the expression of the emotional prosody of the emotional voice data, the speaker of the neutral speaker data It may be made including the step of outputting the emotional voice of.

상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계는, 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 이용하여 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더에서 감정 운율의 표현을 찾음에 따라 소량의 상기 중립 화자 데이터를 이용하여 한 화자의 음성으로 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. In the step of outputting the emotional voice of the speaker of the neutral speaker data, the emotional prosody in the text encoder, the prosody encoder and the residual encoder learned using the large amount of multi-speaker data, the large amount of expressive voice data, and the emotional voice data Upon finding the expression of , it is possible to synthesize and output an emotional voice containing an emotional rhyme with the voice of one speaker using a small amount of the neutral speaker data.

상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계는, 소량의 상기 중립 화자 데이터로부터 제공되는 한 화자의 음성에 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. In the step of outputting the speaker's emotional voice of the neutral speaker data, the emotional prosody obtained from the large amount of multi-speaker data, the large amount of expressive voice data, and the emotional voice data to the one speaker's voice provided from the small amount of the neutral speaker data It is possible to synthesize and output an emotional voice containing

다른 실시예에 따른 음성 합성 장치는, 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder)를 다량의 다화자 데이터를 통해 학습시키고, 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder) 중 적어도 어느 하나 이상을 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시키는 학습부; 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더 중 적어도 어느 하나 이상을 통해 상기 감정 음성 데이터의 감정 운율의 표현을 찾는 감정 운율 표현부; 및 중립 화자 데이터의 화자의 목소리 표현을 상기 화자 인코더의 출력값으로 하고, 합성을 원하는 상기 감정 운율의 표현을 선별하여 합성시켜, 상기 중립 화자 데이터의 화자의 감정 음성을 출력하는 감정 음성 합성부를 포함하여 이루어질 수 있다. A speech synthesis apparatus according to another embodiment trains a speaker encoder for extracting features of a speaker's speech through a large amount of multispeaker data, and includes a text encoder, a prosody encoder, and a residual encoder. a learning unit for learning at least any one or more of (residual encoder) through at least one of the large amount of multi-speaker data, the large amount of expression voice data, and the emotional voice data; an emotional prosody expression unit that finds an expression of an emotional prosody of the emotional speech data through at least one of the learned text encoder, the prosody encoder, and the residual encoder; and an emotional voice synthesizing unit configured to output the speaker's emotional voice of the neutral speaker data by using the speaker's voice expression of the neutral speaker data as the output value of the speaker encoder, selecting and synthesizing the expression of the emotional prosody desired to be synthesized, can be done

학습부는 상기 텍스트 인코더(text encoder), 상기 운율 인코더(prosody encoder) 및 상기 잔차 인코더(residual encoder)를 각각 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 활용하여 학습시킬 수 있다. The learning unit can learn the text encoder, the prosody encoder, and the residual encoder using the large amount of multispeaker data, the large amount of expressive voice data, and the emotional voice data, respectively. .

상기 감정 운율 표현부는, 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더를 통해 감정 운율의 표현이 상기 운율 인코더(prosody encoder)로 어떻게 표현되는지 감정 음성 데이터의 샘플을 입력으로 넣어 통계적 표현을 찾을 수 있다. The emotional prosody expression unit is, through the learned text encoder, the prosody encoder, and the residual encoder, how the expression of the emotional prosody is expressed with the prosody encoder. can

상기 감정 음성 합성부는, 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더의 일부를 중립 화자 데이터를 통해 적응시켜, 상기 중립 화자 데이터의 화자의 목소리 표현을 찾아내고, 상기 중립 화자 데이터의 화자의 목소리 표현을 상기 화자 인코더의 출력값에 넣고, 합성을 원하는 감정 운율의 표현을 선별하여 넣어, 상기 중립 화자 데이터의 화자의 감정 음성을 출력할 수 있다. The emotional speech synthesizing unit is configured to adapt a part of the learned text encoder, the prosody encoder, and the residual encoder through neutral speaker data to find a speaker's voice expression in the neutral speaker data, The emotional voice of the speaker of the neutral speaker data may be output by putting the voice expression into the output value of the speaker encoder, and selecting and putting the expression of the emotional prosody desired to be synthesized.

상기 감정 음성 합성부는, 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 이용하여 학습된 상기 텍스트 인코더, 상기 운율 인코더 및 상기 잔차 인코더에서 감정 운율의 표현을 찾음에 따라 소량의 상기 중립 화자 데이터를 이용하여 한 화자의 음성으로 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. The emotional speech synthesizing unit is configured to search for expressions of emotional prosody in the text encoder, the prosody encoder, and the residual encoder learned using the large amount of multispeaker data, the large amount of expressive speech data and the emotional speech data. It is possible to synthesize and output an emotional voice containing an emotional rhyme with the voice of one speaker using the neutral speaker data.

상기 감정 음성 합성부는, 소량의 상기 중립 화자 데이터로부터 제공되는 한 화자의 음성에 상기 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. The emotional voice synthesizing unit is configured to synthesize and output the emotional voice containing the emotional prosody obtained from the large amount of multi-speaker data, the large amount of expressive voice data, and the emotional voice data with the voice of one speaker provided from the small amount of the neutral speaker data. can

실시예들에 따르면 소량의 중립 화자 데이터로부터 제공되는 한 화자의 음성에 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력하는 음성 합성 방법 및 장치를 제공할 수 있다.According to the embodiments, a speech synthesis method for synthesizing and outputting an emotional voice including emotional prosody obtained from a large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data with the voice of one speaker provided from a small amount of neutral speaker data and devices.

실시예들에 따르면 일반인의 음성 등 한 화자의 적은 음성 데이터를 이용하여 감정 운율 표현을 담은 감정 음성을 제공함으로써, 짧은 시간 내에 원하는 사람의 목소리를 합성할 수 있을 뿐 아니라 짧은 시간 내에 음성에 감정 운율의 표현을 합성하여 감정 음성을 제공하는 음성 합성 방법 및 장치를 제공할 수 있다.According to embodiments, by providing an emotional voice containing an emotional prosody expression using a small amount of voice data of a single speaker, such as the voice of a general public, it is possible to synthesize the voice of a desired person within a short time as well as to add the emotional prosody to the voice within a short time. It is possible to provide a speech synthesis method and apparatus for providing emotional speech by synthesizing expressions of

도 1은 일 실시예들에 따른 전자 장치를 도시하는 도면이다.
도 2는 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치를 나타내는 블록도이다.
도 3은 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법을 나타내는 흐름도이다.
도 4는 일 실시예에 따른 음성 합성 장치와 데이터셋의 동작을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 음성 합성 방법 및 데이터셋의 활용을 설명하기 위한 도면이다. 1 is a diagram illustrating an electronic device according to example embodiments.
2 is a block diagram illustrating an apparatus for synthesizing an emotional prosody with a small amount of voice data of one speaker according to an exemplary embodiment.
3 is a flowchart illustrating a method of synthesizing a voice containing emotional prosody with a small amount of voice data of one speaker according to an embodiment.
4 is a diagram for explaining operations of a speech synthesis apparatus and a data set according to an exemplary embodiment.
5 is a diagram for explaining the use of a speech synthesis method and a data set according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present invention to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

현재 음성 합성 기술은 다양한 사람의 목소리 및 감정을 포함시킨 다양한 음성 변화를 반영한 음성 합성이 가능하나, 각 감정 별로 사람의 음성을 녹음하여 이용해야 하는 어려움이 있다.Current voice synthesis technology is capable of synthesizing voices reflecting various voice changes including voices and emotions of various people, but there is a difficulty in recording and using people's voices for each emotion.

한 화자의 음성으로 감정 운율을 담은 음성을 합성하기 위해서는 그 화자의 감정 운율을 담은 음성이 다량으로 필요하지만, 소량의 음성으로 그 화자의 음성과 감정 운율을 합성하기 어렵다. 또한, 새로운 화자의 음성을 모델이 학습하기 위해서는 모델이 다양한 화자의 목소리를 사전에 학습되어 있어야 한다. 기존에 학습된 목소리를 합성하는데 품질 하락이 없고 새로운 화자의 음성을 합성할 수 있어야 한다. In order to synthesize a voice containing an emotional rhyme with a speaker's voice, a large amount of voice containing the emotional rhyme of the speaker is required, but it is difficult to synthesize the speaker's voice and an emotional rhyme with a small amount of voice. In addition, in order for the model to learn the voice of a new speaker, the model must have previously learned the voices of various speakers. In synthesizing the previously learned voice, there should be no loss of quality and it should be possible to synthesize the voice of a new speaker.

아래의 실시예들은 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법에 관한 것으로, 한 화자의 적은 음성 데이터를 이용하여 감정 운율 표현을 합성함에 따라 감정 음성을 제공하는 기술을 제공한다. 실시예들은 소량의 중립 화자 데이터로부터 제공되는 한 화자의 음성에 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. The following embodiments relate to a method of synthesizing a voice containing emotional prosody with a small amount of voice data of a single speaker, and provide a technology for providing an emotional voice by synthesizing an emotional prosody expression using a small amount of voice data of a single speaker. Embodiments may synthesize and output an emotional voice including an emotional prosody obtained from a large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data with one speaker's voice provided from a small amount of neutral speaker data.

도 1은 일 실시예들에 따른 전자 장치를 도시하는 도면이다. 1 is a diagram illustrating an electronic device according to example embodiments.

도 1을 참조하면, 일 실시예들에 따른 전자 장치(100)는 입력 모듈(110), 출력 모듈(120), 메모리(130) 또는 프로세서(140) 중 적어도 어느 하나 이상을 포함할 수 있다. Referring to FIG. 1 , an electronic device 100 according to embodiments may include at least one of an input module 110 , an output module 120 , a memory 130 , and a processor 140 .

입력 모듈(110)은 전자 장치(100)의 구성 요소에 사용될 명령 또는 데이터를 전자 장치(100)의 외부로부터 수신할 수 있다. 입력 모듈(110)은, 사용자가 전자 장치(100)에 직접적으로 명령 또는 데이터를 입력하도록 구성되는 입력 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 명령 또는 데이터를 수신하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 입력 장치는 마이크로폰(microphone), 마우스(mouse), 키보드(keyboard) 또는 카메라(camera) 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다. The input module 110 may receive a command or data to be used for a component of the electronic device 100 from the outside of the electronic device 100 . The input module 110 is at least one of an input device configured to allow a user to directly input a command or data to the electronic device 100 or a communication device configured to receive a command or data by wire or wireless communication with an external electronic device may include any one. For example, the input device may include at least one of a microphone, a mouse, a keyboard, and a camera. For example, the communication device may include at least one of a wired communication device and a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

출력 모듈(120)은 전자 장치(100)의 외부로 정보를 제공할 수 있다. 출력 모듈(120)은 정보를 청각적으로 출력하도록 구성되는 오디오 출력 장치, 정보를 시각적으로 출력하도록 구성되는 표시 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 정보를 전송하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다.The output module 120 may provide information to the outside of the electronic device 100 . The output module 120 is at least one of an audio output device configured to audibly output information, a display device configured to visually output information, or a communication device configured to transmit information by wire or wireless communication with an external electronic device may include any one. For example, the communication device may include at least one of a wired communication device and a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

메모리(130)는 전자 장치(100)의 구성 요소에 의해 사용되는 데이터를 저장할 수 있다. 데이터는 프로그램 또는 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들면, 메모리(130)는 휘발성 메모리 또는 비휘발성 메모리 중 적어도 어느 하나를 포함할 수 있다. The memory 130 may store data used by components of the electronic device 100 . The data may include input data or output data for a program or instructions related thereto. For example, the memory 130 may include at least one of a volatile memory and a non-volatile memory.

프로세서(140)는 메모리(130)의 프로그램을 실행하여, 전자 장치(100)의 구성 요소를 제어할 수 있고, 데이터 처리 또는 연산을 수행할 수 있다. 이 때 프로세서(140)는 음성 합성부 및 음성 인식부를 포함하여 이루어질 수 있고, 실시예에 따라 학습부 및 미세 조정부를 더 포함할 수 있다. 이를 통해 프로세서(140)는 음성 감정 인식 및 합성의 반복 학습을 수행할 수 있다.The processor 140 may execute a program in the memory 130 to control the components of the electronic device 100 , and may process data or perform an operation. In this case, the processor 140 may include a voice synthesis unit and a voice recognition unit, and may further include a learning unit and a fine adjustment unit according to an embodiment. Through this, the processor 140 may perform repeated learning of voice emotion recognition and synthesis.

도 2는 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치를 나타내는 블록도이다.FIG. 2 is a block diagram illustrating an apparatus for synthesizing an emotional prosody with a small amount of voice data of one speaker according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치(200)는 학습부(210), 감정 운율 표현부(220) 및 감정 음성 합성부(230)를 포함하여 이루어질 수 있다. 여기서, 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치(200)는 도 1의 프로세서(140)에 포함할 수 있다.Referring to FIG. 2 , a speech synthesizing apparatus 200 containing emotional prosody with a small amount of voice data of one speaker according to an embodiment includes a learning unit 210 , an emotional prosody expression unit 220 , and an emotional speech synthesis unit 230 . may be included. Here, the speech synthesizing apparatus 200 containing emotional prosody with a small amount of voice data of one speaker may be included in the processor 140 of FIG. 1 .

학습부(210)는 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder)를 다량의 다화자 데이터를 통해 학습시키고, 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder) 중 적어도 어느 하나 이상을 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시킬 수 있다.The learning unit 210 trains a speaker encoder for extracting features of a speaker's voice through a large amount of multi-speaker data, and includes a text encoder, a prosody encoder, and a residual encoder. At least any one or more of them may be learned through at least any one or more of a large amount of multi-speaker data, a large amount of expression voice data, and emotional voice data.

감정 운율 표현부(220)는 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더 중 적어도 어느 하나 이상을 통해 감정 음성 데이터의 감정 운율의 표현을 찾을 수 있다.The emotional prosody expression unit 220 may find the expression of the emotional prosody of the emotional voice data through at least one of a learned text encoder, a prosody encoder, and a residual encoder.

감정 음성 합성부(230)는 중립 화자 데이터의 화자의 목소리 표현을 화자 인코더의 출력값으로 하고, 합성을 원하는 감정 운율의 표현을 선별하여 합성시켜, 중립 화자 데이터의 화자의 감정 음성을 출력할 수 있다. The emotional voice synthesizing unit 230 uses the speaker's voice expression of the neutral speaker data as an output value of the speaker encoder, selects and synthesizes the expression of the emotional prosody desired to be synthesized, and outputs the speaker's emotional voice of the neutral speaker data. .

여기에서는 텍스트 음성 합성(Text-to-Speech; TTS)을 사용하는 접근법을 채택하고 있다. 텍스트 음성 합성(TTS)은 텍스트 또는 음성 정보를 음성 파형으로 변환하는 작업이며, 풍부한 seq2seq 기반 연구가 활발하게 진행되었다. 텍스트 음성 합성(TTS)은 음성 변환(Voice Conversion; VC)과 매우 관련이 깊은 작업이다. 음성 변환과 텍스트 음성 합성(TTS)은 입력 도메인만 다를 뿐, 음성 정보를 음향 형상으로 변환하는 디코더의 역할은 매우 동일하다. 텍스트 음성 합성(TTS)의 임베딩 공간은 음성 정보와 높은 상관관계가 있으며, 음성 변환은 멀티태스크 학습을 통해 텍스트 음성 합성(TTS)와 가까운 임베딩 공간을 학습할 것으로 기대된다. 본 실시예에서는 성능 향상을 위해 텍스트 음성 합성(TTS)을 음성 변환에 음성 정보를 제공하는 데 활용한다. Here, an approach using text-to-speech (TTS) is adopted. Text-to-speech synthesis (TTS) is a task that converts text or speech information into speech waveforms, and abundant seq2seq-based research has been actively conducted. Text-to-speech synthesis (TTS) is a highly related task to voice conversion (VC). Speech conversion and text-to-speech synthesis (TTS) differ only in the input domain, but the role of the decoder for converting speech information into an acoustic shape is the same. The embedding space of text-to-speech synthesis (TTS) has a high correlation with speech information, and speech transformation is expected to learn an embedding space close to that of text-to-speech synthesis (TTS) through multi-task learning. In this embodiment, text-to-speech synthesis (TTS) is used to provide voice information for voice conversion to improve performance.

실시예들에 따르면, 이러한 작업을 감정적 음성 변환으로 확장할 수 있다. 스타일 참조 음성을 고려할 때 스타일 인코더는 감정 정보만을 추출하고 언어적 내용을 제거한다. 스타일 인코더는 언어적 내용에 관계없이 감정을 추출하도록 설계되어 있어 복수의 입력 스타일 도메인을 처리할 수 있다. 또한 추출된 감정이 디코더에 주입되면 다양한 감정을 생성할 수 있다. 따라서 제안된 모델은 다대다의 감정적 음성 합성을 처리할 수 있다.According to embodiments, this task may be extended to emotional voice transformation. When considering the style reference voice, the style encoder only extracts emotional information and removes the linguistic content. Style encoders are designed to extract emotions regardless of their linguistic content, so they can handle multiple input style domains. In addition, when the extracted emotions are injected into the decoder, various emotions can be generated. Therefore, the proposed model can handle many-to-many emotional speech synthesis.

음성 변환(VC)은 언어적 내용을 보존하면서 사람의 목소리를 다른 스타일로 바꾸는 작업이다. 음성 변환(VC)은 시퀀스 대 시퀀스(seq2seq) 모델을 기반으로 할 수 있다. 또한, 텍스트 음성 합성(TTS) 모듈을 이용한 멀티태스크 학습을 이용하여 음성 합성을 수행할 수 있다. seq2seq 기반 텍스트 음성 합성(TTS) 모듈의 임베딩(embedding)은 텍스트에 대한 풍부한 정보를 가지고 있다. 텍스트 음성 합성(TTS) 디코더의 역할은 내장 공간을 음성 변환(VC)과 같은 음성으로 변환하는 것이다. 제안된 모델에서, 전체 네트워크는 음성 변환(VC)과 텍스트 음성 합성(TTS) 모듈의 손실을 최소화하도록 학습된다. 음성 변환(VC)은 더 많은 언어 정보를 포착하고 멀티태스크 학습에 의해 학습 안정성을 보존할 것으로 기대된다. Voice transformation (VC) is the process of changing a human voice into a different style while preserving the linguistic content. Voice transformation (VC) may be based on a sequence-to-sequence (seq2seq) model. In addition, speech synthesis may be performed using multi-task learning using a text-to-speech (TTS) module. The embedding of the seq2seq-based text-to-speech (TTS) module has rich information about the text. The role of a text-to-speech (TTS) decoder is to transform the embedded space into speech, such as speech-to-speech (VC). In the proposed model, the entire network is trained to minimize the loss of speech transformation (VC) and text-to-speech synthesis (TTS) modules. Voice transformation (VC) is expected to capture more linguistic information and preserve learning stability by multitask learning.

예를 들어, 음성 합성부(310)에서의 음성 합성 방법은 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 음성 변환(VC)을 수행하는 단계, 입력 음성의 페어가 원-핫(one-hot) 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 텍스트 음성 합성(TTS)을 수행하는 단계, 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 단계, 및 디코딩된 멜 스펙트로그램으로부터 전처리부를 통해 선형 스펙트럼을 획득하는 단계를 포함할 수 있다. 실시예들에 따르면 스타일 참조 음성을 고려할 때 스타일 인코더는 감정 정보만을 추출하고 언어적 내용을 제거하며, 언어적 내용에 관계없이 감정을 추출하도록 설계되어 복수의 입력 스타일 도메인을 처리하고, 추출된 감정이 디코더에 주입되면 다양한 감정을 생성함으로써 다대다의 감정적 음성 변환을 처리한다.For example, the speech synthesis method in the speech synthesis unit 310 may be a log Mel spectrogram of a language in which a pair of input speech conveys language content and a log Mel spectrogram of a style reference speech. case, performing speech conversion (VC), when the pair of input speech is a log Mel spectrogram of one-hot representative text and style reference speech, text-to-speech synthesis (TTS) performing, the log Mel spectrogram and one-hot representative text of the language conveying the language content are both mapped to the same space and then decoded into a Mel spectrogram, and a preprocessing unit from the decoded Mel spectrogram It may include obtaining a linear spectrum through the According to embodiments, when considering a style reference voice, the style encoder is designed to extract only emotional information and remove linguistic content, and to extract emotions regardless of linguistic content, to process a plurality of input style domains, and to process the extracted emotions When injected into this decoder, it processes many-to-many emotional speech transformations by generating a variety of emotions.

실시예들에 따르면 일반인의 목소리를 쉽게 합성할 수 있다. 예를 들어 대화 에이전트의 목소리가 성우의 것이 아니라 친근한 지인의 것으로 음성을 제공하면, 친숙한 대화가 가능하여 에이전트와 심리적 거리가 줄어들 수 있다. According to embodiments, it is possible to easily synthesize a voice of a general public. For example, if the voice of the conversational agent is provided as that of a familiar acquaintance rather than that of a voice actor, a friendly conversation is possible and the psychological distance with the agent may be reduced.

또한, 실시예들에 따르면 음성에 감정 운율 표현을 담을 수 있다. 예를 들어 대화 에이전트의 음성이 상황에 적당한 감정을 담아 표현된다면, 대화가 자연스럽게 느껴져 에이전트와 심리적 거리가 줄어들 수 있다.In addition, according to embodiments, an emotional prosody expression may be included in the voice. For example, if the conversational agent's voice is expressed with emotions appropriate to the situation, the conversation will feel natural and the psychological distance from the agent may be reduced.

또한, 일반인이 긴 시간 동안 음성 녹음을 하기 어렵기 때문에 짧은 시간의 음성 녹음을 이용하여 그 사람의 목소리를 합성해야 한다. 실시예들에 따르면 짧은 시간 내에 원하는 사람의 목소리를 합성할 수 있다. In addition, since it is difficult for an ordinary person to record a voice for a long time, it is necessary to synthesize the person's voice using a voice recording for a short period of time. According to embodiments, a desired person's voice may be synthesized within a short time.

또한, 일반인이 감정을 풍부하게 운율에 담아 내기 어렵기 때문에 연기자들이 표현한 감정 운율을 일반인의 음성에 전이시켜 합성할 필요가 있다. 실시예들에 따르면 짧은 시간 내에 음성에 감정 운율의 표현을 합성하여 감정 음성을 제공할 수 있다. In addition, since it is difficult for ordinary people to express their emotions abundantly in rhymes, it is necessary to transfer the emotional rhymes expressed by actors to the voices of ordinary people to synthesize them. According to embodiments, the emotional voice may be provided by synthesizing the expression of the emotional rhyme with the voice within a short time.

또한, 일반적으로 한 화자의 음성으로 감정 운율을 담은 음성을 합성하기 위해서는 그 화자의 감정 운율을 담은 음성이 다량으로 필요하다. 실시예들에 따르면 소량의 중립 화자 데이터로부터 제공되는 한 화자의 음성에 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. Also, in general, in order to synthesize a voice containing an emotional rhyme with the voice of a speaker, a large amount of voice containing the emotional rhyme of the speaker is required. According to embodiments, it is possible to synthesize and output an emotional voice including an emotional prosody obtained from a large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data with the voice of one speaker provided from a small amount of neutral speaker data.

아래에서 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치에 대해 보다 상세히 설명한다. Hereinafter, a voice synthesizing apparatus including emotional prosody with a small amount of voice data of one speaker according to an embodiment will be described in more detail.

도 3은 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method of synthesizing a voice including emotional prosody with a small amount of voice data of one speaker according to an exemplary embodiment.

도 3을 참조하면, 일 실시예에 따른 전자 장치로 구현되는 음성 합성 장치를 이용한 음성 합성 방법은, 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder)를 다량의 다화자 데이터를 통해 학습시키는 단계(S110), 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder) 중 적어도 어느 하나 이상을 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시키는 단계(S120), 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더 중 적어도 어느 하나 이상을 통해 감정 음성 데이터의 감정 운율의 표현을 찾는 단계(S130), 및 중립 화자 데이터의 화자의 목소리 표현을 화자 인코더의 출력값으로 하고, 합성을 원하는 감정 운율의 표현을 선별하여 합성시켜, 중립 화자 데이터의 화자의 감정 음성을 출력하는 단계(S140)를 포함하여 이루어질 수 있다. Referring to FIG. 3 , a method for synthesizing a voice using a voice synthesizing apparatus implemented as an electronic device according to an embodiment includes the steps of: learning a speaker encoder for extracting features of a speaker's voice through a large amount of multi-speaker data (S110), at least any one or more of a text encoder, a prosody encoder, and a residual encoder, at least any one or more of a large amount of multispeaker data, a large amount of expressive speech data, and emotional speech data The step of learning through (S120), the step of finding the expression of the emotional prosody of the emotional speech data through at least any one of the learned text encoder, the prosody encoder, and the residual encoder (S130), and the speaker's voice expression of the neutral speaker data is the output value of the speaker encoder, selecting and synthesizing the expression of the emotional prosody desired to be synthesized, and outputting the emotional voice of the speaker of the neutral speaker data ( S140 ).

아래에서 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법의 각 단계를 예를 들어 설명한다.Hereinafter, each step of the method of synthesizing a voice containing emotional prosody with a small amount of voice data of one speaker according to an embodiment will be described as an example.

일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법은 도 2에서 설명한 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치를 예를 들어 보다 상세히 설명할 수 있다. 앞에서 설명한 바와 같이, 일 실시예에 따른 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 장치(200)는 학습부(210), 감정 운율 표현부(220) 및 감정 음성 합성부(230)를 포함하여 이루어질 수 있다. 여기서, 비특허문헌 4의 Seq2Seq와 비특허문헌 5의 attention 기제 기반 비특허문헌 1의 Tacotron 2를 통해 음성을 합성할 수 있다.The method for synthesizing a voice containing emotional prosody with a small amount of voice data of one speaker according to an embodiment will be described in more detail with an example of the voice synthesizing apparatus containing emotional prosody with a small amount of voice data of a single speaker according to the embodiment described with reference to FIG. 2 . can do. As described above, the speech synthesizing apparatus 200 containing emotional prosody with a small amount of voice data from one speaker according to an embodiment includes the learning unit 210 , the emotional prosody expression unit 220 , and the emotional speech synthesis unit 230 . can be included. Here, speech can be synthesized through Seq2Seq of Non-Patent Document 4 and Tacotron 2 of Non-Patent Document 1 based on the attention mechanism of Non-Patent Document 5.

단계(S110)에서, 학습부(210)는 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder)를 다량의 다화자 데이터를 통해 학습시킬 수 있다. In step S110 , the learning unit 210 may learn a speaker encoder for extracting features of a speaker's voice through a large amount of multispeaker data.

단계(S120)에서, 학습부(210)는 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder) 중 적어도 어느 하나 이상을 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터 중 적어도 어느 하나 이상을 통해 학습시킬 수 있다. 특히, 학습부(210)는 텍스트 인코더(text encoder), 운율 인코더(prosody encoder) 및 잔차 인코더(residual encoder)를 각각 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 활용하여 학습시킬 수 있다. In step S120, the learning unit 210 converts at least one of a text encoder, a prosody encoder, and a residual encoder to a large amount of multispeaker data, a large amount of expressive voice data, and Learning may be performed through at least any one of emotional voice data. In particular, the learning unit 210 learns a text encoder, a prosody encoder, and a residual encoder using a large amount of multispeaker data, a large amount of expressive voice data, and emotional voice data, respectively. can

단계(S130)에서, 감정 운율 표현부(220)는 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더 중 적어도 어느 하나 이상을 통해 감정 음성 데이터의 감정 운율의 표현을 찾을 수 있다. 예를 들어, 감정 운율 표현부(220)는 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더를 통해 감정 운율의 표현이 운율 인코더(prosody encoder)로 어떻게 표현되는지 감정 음성 데이터의 샘플을 입력으로 넣어 통계적 표현을 찾을 수 있다. In step S130 , the emotional prosody expression unit 220 may find the expression of the emotional prosody of the emotional voice data through at least one of a learned text encoder, a prosody encoder, and a residual encoder. For example, the emotional prosody expression unit 220 uses the learned text encoder, the prosody encoder, and the residual encoder to show how the expression of the emotional prosody is expressed with a prosody encoder. can be found

여기서, 비특허문헌 2 및 3의 방법을 통해 화자 음성 식별(speaker’s voice identity) 요소와 감정 운율(emotional prosody) 요소를 추출할 수 있다. Here, a speaker's voice identity element and an emotional prosody element may be extracted through the methods of Non-Patent Documents 2 and 3.

단계(S140)에서, 감정 음성 합성부(230)는 중립 화자 데이터의 화자의 목소리 표현을 화자 인코더의 출력값으로 하고, 합성을 원하는 감정 운율의 표현을 선별하여 합성시켜, 중립 화자 데이터의 화자의 감정 음성을 출력할 수 있다. In step S140, the emotional voice synthesis unit 230 uses the speaker's voice expression of the neutral speaker data as an output value of the speaker encoder, selects and synthesizes the expression of the emotional prosody desired to be synthesized, and synthesizes the speaker's emotion of the neutral speaker data Audio can be output.

보다 구체적으로, 감정 음성 합성부(230)는 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더의 일부를 중립 화자 데이터를 통해 적응시켜, 중립 화자 데이터의 화자의 목소리 표현을 찾아낼 수 있다. 여기서, 감정 음성 합성부(230)는 비특허문헌 6 및 7의 방법을 통해 음성 합성 장치를 한 화자의 적은 음성 데이터에 적응시킬 수 있다. 즉, 감정 음성 합성부(230)는 상기 방법을 통해 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더의 일부를 중립 화자 데이터를 통해 적응시킬 수 있다.More specifically, the emotional speech synthesizing unit 230 may find a speaker's voice expression in the neutral speaker data by adapting some of the learned text encoder, prosody encoder, and residual encoder through neutral speaker data. Here, the emotional speech synthesizing unit 230 may adapt the speech synthesizing apparatus to a small amount of voice data of a single speaker through the methods of Non-Patent Documents 6 and 7. That is, the emotional speech synthesizing unit 230 may adapt some of the text encoder, the prosody encoder, and the residual encoder learned through the above method through the neutral speaker data.

그리고 감정 음성 합성부(230)는 중립 화자 데이터의 화자의 목소리 표현을 화자 인코더의 출력값에 넣고, 합성을 원하는 감정 운율의 표현을 감정 음성 데이터의 감정 운율의 표현을 찾는 단계에서 선별하여 넣어, 중립 화자 데이터의 화자의 감정 음성을 출력할 수 있다. 여기서, 감정 음성 합성부(230)는 비특허문헌 2 및 3의 방법을 통해 음성에 운율을 전이(prosody transfer)시킬 수 있다.And the emotional voice synthesis unit 230 puts the voice expression of the speaker of the neutral speaker data into the output value of the speaker encoder, and selects and puts the expression of the emotional prosody desired for synthesis in the step of finding the expression of the emotional prosody of the emotional voice data, The emotional voice of the speaker of the speaker data may be output. Here, the emotional voice synthesizing unit 230 may transfer prosody to the voice through the methods of Non-Patent Documents 2 and 3.

감정 음성 합성부(230)는 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터를 이용하여 학습된 텍스트 인코더, 운율 인코더 및 잔차 인코더에서 감정 운율의 표현을 찾음에 따라 소량의 중립 화자 데이터를 이용하여 한 화자의 음성으로 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. 즉, 감정 음성 합성부(230)는 소량의 중립 화자 데이터로부터 제공되는 한 화자의 음성에 다량의 다화자 데이터, 다량의 표현 음성 데이터 및 감정 음성 데이터로부터 획득한 감정 운율을 담은 감정 음성을 합성하여 출력할 수 있다. The emotional speech synthesizing unit 230 finds the expression of emotional prosody in the text encoder, prosody encoder, and residual encoder learned using a large amount of multi-speaker data, a large amount of expressive speech data and emotional speech data, and thus a small amount of neutral speaker data can be used to synthesize and output an emotional voice containing an emotional rhyme with one speaker's voice. That is, the emotional voice synthesizing unit 230 synthesizes an emotional voice containing emotional prosody obtained from a large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data with one speaker's voice provided from a small amount of neutral speaker data. can be printed

실시예들에 따르면 일반인의 음성 등 한 화자의 적은 음성 데이터를 이용하여 감정 운율 표현을 담은 감정 음성을 제공함으로써, 짧은 시간 내에 원하는 사람의 목소리를 합성할 수 있을 뿐 아니라 짧은 시간 내에 음성에 감정 운율의 표현을 합성하여 감정 음성을 제공할 수 있다.According to embodiments, by providing an emotional voice containing an emotional prosody expression using a small amount of voice data of a single speaker, such as the voice of a general public, it is possible to synthesize the voice of a desired person within a short time as well as to add the emotional prosody to the voice within a short time. It is possible to provide an emotional voice by synthesizing the expressions of

도 4는 일 실시예에 따른 음성 합성 장치와 데이터셋의 동작을 설명하기 위한 도면이다. 4 is a diagram for explaining operations of a speech synthesis apparatus and a data set according to an exemplary embodiment.

도 4를 참조하면, 일 실시예에 따른 음성 합성 장치(400)와 데이터셋의 동작을 나타내며, 음성 합성 장치(400)의 구성 요소별 입력과 출력에 대한 도식을 나타낸다.Referring to FIG. 4 , the operation of the speech synthesis apparatus 400 and the dataset according to an embodiment is illustrated, and a schematic of input and output of each component of the speech synthesis apparatus 400 is illustrated.

음성 합성 장치(400)는 다량의 다화자 데이터(DB1, 401)를 학습하여 화자 음성의 특징, 즉 화자 음성 식별 요소를 추출할 수 있다. 보다 구체적으로, 음성 합성 장치(400)는 화자 음성의 특징을 추출하는 화자 인코더(speaker encoder, 430)를 다량의 다화자 데이터(DB1, 401)를 통해 학습시킬 수 있다. The speech synthesis apparatus 400 may learn a large amount of multi-speaker data DB1 and 401 to extract a speaker's voice characteristic, that is, a speaker's voice identification element. More specifically, the speech synthesis apparatus 400 may train a speaker encoder 430 for extracting features of a speaker's voice through a large amount of multi-speaker data DB1 and 401 .

그리고, 음성 합성 장치(400)는 다량의 표현이 풍부한 운율을 지닌 데이터셋을 통한 운율 음성 특징을 추출할 수 있다. 보다 구체적으로, 음성 합성 장치(400)는 텍스트 인코더(text encoder, 410), 운율 인코더(prosody encoder, 420) 및 잔차 인코더(residual encoder, 440)를 다량의 다화자 데이터(DB1, 401), 다량의 표현 음성 데이터(DB2, 402) 및 감정 음성 데이터(DB3, 403)를 통해 학습시킬 수 있다. In addition, the speech synthesis apparatus 400 may extract prosody speech features from a dataset having a large amount of expressive prosody. More specifically, the speech synthesis apparatus 400 uses a text encoder 410, a prosody encoder 420, and a residual encoder 440 with a large amount of multispeaker data (DB1, 401), a large amount of It can be learned through the expression voice data (DB2, 402) and emotional voice data (DB3, 403) of the.

여기서, 다량의 다화자 데이터(DB1, 401), 다량의 표현 음성 데이터(DB2, 402), 감정 음성 데이터(DB3, 403) 및 소량의 중립 화자 데이터(DB4, 404)로부터 텍스트(411) 및 음성(412) 정보를 제공 받을 수 있다. 텍스트 인코더(410)의 경우 텍스트(411) 정보를 활용할 수 있고, 화자 인코더(430), 운율 인코더(420) 및 잔차 인코더(440)의 경우 음성(412) 정보를 활용할 수 있다.Here, a large amount of multi-speaker data (DB1, 401), a large amount of expressive voice data (DB2, 402), emotional voice data (DB3, 403), and a small amount of neutral speaker data (DB4, 404), text 411 and voice (412) information can be provided. In the case of the text encoder 410 , information on the text 411 may be utilized, and in the case of the speaker encoder 430 , the prosody encoder 420 , and the residual encoder 440 , information on the voice 412 may be utilized.

음성 합성 장치(400)는 학습된 텍스트 인코더(410), 운율 인코더(420) 및 잔차 인코더(440)를 통해 감정 음성 데이터의 감정 운율의 표현을 찾을 수 있다. 그리고 소량의 중립 화자 데이터(DB4, 404)의 화자의 목소리 표현을 화자 인코더의 출력값으로 하고, 합성을 원하는 감정 운율의 표현을 선별하여 합성시켜, 소량의 중립 화자 데이터(DB4, 404)의 화자의 감정 음성(413)을 출력할 수 있다. The speech synthesis apparatus 400 may find the expression of the emotional prosody of the emotional speech data through the learned text encoder 410 , the prosody encoder 420 , and the residual encoder 440 . Then, the voice expression of the speaker in a small amount of neutral speaker data (DB4, 404) is used as the output value of the speaker encoder, and the expression of the emotional prosody to be synthesized is selected and synthesized, and the speaker of the small amount of neutral speaker data (DB4, 404) is synthesized. An emotional voice 413 may be output.

여기서, 텍스트 음성 합성(TTS)은 문자열을 입력으로 받아 음성 신호를 출력하는 모듈을 의미하며, 문자열은 음소, 음절 등 다양한 형태의 단위로 입력이 가능하다. 입력은 원 핫 벡터(one-hot vector)의 형태로 치환한 뒤, 캐릭터 임베딩(Character embedding)으로 매핑(mapping)될 수 있다. 매핑된 임베딩(embedding)은 텍스트 인코더(410)를 통해 텍스트 임베딩(text embedding)으로 변환되고, 텍스트 임베딩은 CNN 혹은 RNN 디코더(decoder)를 이용해 멜 스펙트로그램(Mel spectrogram)의 형태로 변환될 수 있다. 이 때, 디코더는 매 디코딩 타임 스텝(decoding time step)마다 어텐션(attention)을 이용해 텍스트 임베딩 중 어떤 텍스트에 집중할 지를 결정하게 된다. 추론된 멜 스펙트로그램(Mel spectrogram)과 정답 멜 스펙트로그램(Mel spectrogram) 사이의 L2 거리 손실(distance loss)을 이용해 텍스트 음성 합성(TTS)을 구성하는 모듈들의 파라미터가 업데이트될 수 있다. 멜 스펙트로그램(Mel spectrogram)은 보코더(vocoder)를 통해 음성 신호로 변환되며, 이 때 보코더는 trainable 할 수도, rule based 일 수도 있다.Here, text-to-speech synthesis (TTS) refers to a module that receives a character string as an input and outputs a voice signal, and the character string can be input in units of various types such as phonemes and syllables. After replacing the input in the form of a one-hot vector, it may be mapped by character embedding. The mapped embedding is converted into text embedding through the text encoder 410, and the text embedding is converted into the form of a Mel spectrogram using a CNN or RNN decoder. Can be converted. . At this time, the decoder determines which text to focus on during text embedding using attention at every decoding time step. The parameters of the modules constituting the text-to-speech synthesis (TTS) may be updated using the L2 distance loss between the inferred Mel spectrogram and the correct Mel spectrogram. Mel spectrogram is converted into a voice signal through a vocoder, and in this case, the vocoder may be trainable or rule based.

도 5는 일 실시예에 따른 음성 합성 방법 및 데이터셋의 활용을 설명하기 위한 도면이다.5 is a diagram for explaining the use of a speech synthesis method and a data set according to an embodiment.

도 5를 참조하면, 일 실시예에 따른 음성 합성 방법 및 데이터셋의 활용을 나타내며, 앞에서 설명한 음성 합성 장치를 통해 수행될 수 있다. 먼저, 음성 합성 장치는 화자의 특징을 잡아내는 화자 인코더(speaker encoder)를 DB1(401), 즉 다량의 화자 데이터를 통해 학습할 수 있다(510).Referring to FIG. 5 , the voice synthesis method and the use of the dataset according to an embodiment are shown, and may be performed through the above-described voice synthesis apparatus. First, the speech synthesizing apparatus may learn a speaker encoder that captures the speaker's characteristics through DB1 (401), that is, a large amount of speaker data (510).

또한, 음성 합성 장치는 세 가지 데이터셋(DB1(401), DB2(402), DB3(403))을 활용해 학습할 수 있다(520).Also, the speech synthesis apparatus may learn using three datasets (DB1 401 , DB2 402 , and DB3 403 ) ( 520 ).

그리고, 두 번째 절차(520)를 통해 학습된 음성 합성 장치 모델을 통해 감정 운율의 표현이 운율 인코더(prosody encoder)로 어떻게 표현되는지 DB3(403)의 샘플을 입력으로 넣어 통계적 표현을 찾을 수 있다(530).And, through the speech synthesis device model learned through the second procedure 520, a statistical expression can be found by putting a sample of DB3 403 as an input to see how the expression of emotional prosody is expressed with a prosody encoder ( 530).

또한, DB4(404)를 통해 학습된 음성 합성 장치 모델의 일부를 적응시켜 DB4(404)의 화자의 목소리 표현을 찾아낼 수 있다(540). DB4(404) 화자의 목소리 표현을 화자 인코더의 출력값에 넣고, 합성을 원하는 감정 표현을 세 번째 절차(530)에서 선별하여 넣어, DB4(404) 화자의 감정 음성을 출력할 수 있다(550).Also, it is possible to find the voice expression of the speaker of DB4 (404) by adapting a part of the speech synthesizer model learned through DB4 (404) (540). The voice expression of the speaker DB4 (404) is put into the output value of the speaker encoder, and the emotional expression desired to be synthesized is selected and put in the third procedure (530), and the emotional voice of the speaker DB4 (404) can be output (550).

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A speech synthesis method using a speech synthesis device implemented as an electronic device, the method comprising:
training a speaker encoder for extracting features of a speaker's voice through a large amount of multi-speaker data;
Learning at least any one or more of a text encoder, a prosody encoder, and a residual encoder through at least any one or more of the large amount of multispeaker data, a large amount of expressive speech data, and emotional speech data making;
finding an expression of an emotional prosody of the emotional speech data through at least one of the learned text encoder, the prosody encoder, and the residual encoder; and
outputting the speaker's emotional voice of the neutral speaker data by using the speaker's voice expression of the neutral speaker data as the output value of the speaker encoder, selecting and synthesizing the expression of the emotional prosody desired to be synthesized;
Including, speech synthesis method.

According to claim 1,
The step of learning through at least any one or more of the large amount of multi-speaker data, the large amount of expressive voice data, and the emotional voice data comprises:
Learning the text encoder, the prosody encoder, and the residual encoder using the large amount of multispeaker data, the large amount of expressive voice data and the emotional voice data, respectively
Characterized in, speech synthesis method.

According to claim 1,
The step of finding the expression of the emotional rhyme of the emotional voice data is,
Finding a statistical expression by putting a sample of emotional speech data as an input of how the expression of emotional prosody is expressed by the prosody encoder through the learned text encoder, the prosody encoder and the residual encoder
Characterized in, speech synthesis method.

4. The method of claim 3,
The step of outputting the emotional voice of the speaker of the neutral speaker data includes:
adapting a part of the learned text encoder, the prosody encoder, and the residual encoder through neutral speaker data to find a speaker's voice expression in the neutral speaker data; and
The voice expression of the speaker of the neutral speaker data is put in the output value of the speaker encoder, and the expression of the emotional prosody desired to be synthesized is selected and put in the step of finding the expression of the emotional prosody of the emotional voice data, and the speaker of the neutral speaker data Steps to output emotional voice
Including, speech synthesis method.

According to claim 1,
The step of outputting the emotional voice of the speaker of the neutral speaker data includes:
The text encoder, the prosody encoder, and the residual encoder learned using the large amount of multispeaker data, the large amount of expressive voice data and the emotional voice data to find the expression of the emotional prosody. Using a small amount of the neutral speaker data Synthesizing and outputting emotional voices containing emotional rhymes with one speaker's voice
Characterized in, speech synthesis method.

According to claim 1,
The step of outputting the emotional voice of the speaker of the neutral speaker data includes:
Synthesizing and outputting the emotional voice containing the emotional prosody obtained from the large amount of multi-speaker data, the large amount of expressive voice data and the emotional voice data with the voice of one speaker provided from the small amount of the neutral speaker data
Characterized in, speech synthesis method.

A speaker encoder, which extracts the features of a speaker's voice, is trained through a large amount of multispeaker data, and at least one of a text encoder, a prosody encoder, and a residual encoder is used. a learning unit for learning through at least one of the large amount of multi-speaker data, a large amount of expressive voice data, and emotional voice data;
an emotional prosody expression unit that finds an expression of an emotional prosody of the emotional speech data through at least one of the learned text encoder, the prosody encoder, and the residual encoder; and
An emotional speech synthesizing unit for outputting the speaker's emotional voice of the neutral speaker data by using the speaker's voice expression of the neutral speaker data as an output value of the speaker encoder, selecting and synthesizing the expression of the emotional prosody desired to be synthesized
Including, speech synthesis device.

8. The method of claim 7,
The learning unit,
Learning the text encoder, the prosody encoder, and the residual encoder using the large amount of multispeaker data, the large amount of expressive voice data and the emotional voice data, respectively
Characterized in, speech synthesis device.

8. The method of claim 7,
The emotional rhyme expression unit,
Finding a statistical expression by putting a sample of emotional speech data as an input of how the expression of emotional prosody is expressed by the prosody encoder through the learned text encoder, the prosody encoder and the residual encoder
Characterized in, speech synthesis device.

10. The method of claim 9,
The emotional voice synthesizing unit,
A part of the learned text encoder, the prosody encoder, and the residual encoder is adapted through the neutral speaker data to find the speaker's voice expression in the neutral speaker data, and the speaker's voice expression in the neutral speaker data is converted into the speaker encoder outputting the emotional voice of the speaker of the neutral speaker data by selecting and putting the expression of the emotional prosody to be synthesized into the output value of
Characterized in, speech synthesis device.

8. The method of claim 7,
The emotional voice synthesizing unit,
The text encoder, the prosody encoder, and the residual encoder learned using the large amount of multispeaker data, the large amount of expressive voice data and the emotional voice data to find the expression of the emotional prosody. Using a small amount of the neutral speaker data Synthesizing and outputting emotional voices containing emotional rhymes with one speaker's voice
characterized in that, speech synthesis device.

8. The method of claim 7,
The emotional voice synthesizing unit,
Synthesizing and outputting the emotional voice containing the emotional prosody obtained from the large amount of multi-speaker data, the large amount of expressive voice data and the emotional voice data with the voice of one speaker provided from the small amount of the neutral speaker data
Characterized in, speech synthesis method.