KR20210059586A

KR20210059586A - Method and Apparatus for Emotional Voice Conversion using Multitask Learning with Text-to-Speech

Info

Publication number: KR20210059586A
Application number: KR1020200027051A
Authority: KR
Inventors: 이수영; 김태호; 조성재; 최신국; 박세직
Original assignee: 한국과학기술원
Priority date: 2019-11-15
Filing date: 2020-03-04
Publication date: 2021-05-25
Also published as: KR102298901B1

Abstract

Disclosed are an emotional voice conversion method using multitask learning to preserve learning stability by the multitask learning and text-to-speech (TTS) conversion and an emotional voice conversion apparatus thereof. According to the present invention, the emotional voice conversion method comprises the following steps: performing voice conversion (VC) when a pair of input voices is a log Mel spectrogram of a language transferring language content and a log Mel spectrogram of a style reference voice; performing TTS conversion when the pair of input voices are a log Mel spectrogram of a one-hot representative text and a style reference voice; mapping both the log Mel spectrogram and the one-hot representative text of the language transferring the language content to the same space and being decoded into the Mel spectrogram; and acquiring a linear spectrum from the decoded Mel spectrogram through a preprocessor.

Description

Method and Apparatus for Emotional Voice Conversion using Multitask Learning with Text-to-Speech}

본 발명은 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 방법 및 장치에 관한 것이다.The present invention relates to an emotional speech conversion method and apparatus using multitask learning with text-to-speech conversion.

음성은 전달해야 할 것 - 언어 내용 - 그리고 전달 방법 - 스타일의 구성으로 간주될 수 있다. 음성 변환(Voice Conversion; VC)은 언어 내용을 유지하면서 음성 스타일을 변경하는 작업이다. 변환 과정에서 언어 내용이 손실되거나 스타일 정보가 변경되지 않을 수 있어 어려운 과제다.Voice can be regarded as the composition of what should be conveyed-the content of the language-and the method of delivery-a style. Voice Conversion (VC) is an operation of changing a voice style while maintaining language content. This is a difficult task as language content may be lost or style information may not be changed during the conversion process.

종래기술에서 음성 변환은 프레임 기반 접근방식을 사용하여 수행되었다. 출처와 대상 음성을 고려할 때 두 음성 사이의 정렬을 얻은 다음, 출처 음성의 음향 특성이 대상 음성으로 변환된다. 가우시안 혼합 모델(Gaussian Mixture Model; GMM), 심층 신경망(Deep Neural Network; DNN), 반복 신경망(Recurrent Neural Network; RNN)과 같이 모델 음향 기능에 다양한 방법이 적용되었다. 최근의 종래기술에서 주목하는 시퀀스 대 시퀀스(sequence-to-sequence; seq2seq) 모델의 성공 역시 음성 변환에 적용된다. seq2seq 음성 변환모델을 교육하는 동안 틀린 발음, 학습 불안정성과 같은 문제가 관찰되었다. 이러한 종래기술에서는 디코더 출력의 각 단계에 텍스트 감독을 추가하여 음성 변환의 품질을 개선한다. 그러나 이 접근법은 인간 또는 동적 시간 래핑(Dynamic Time Wrapping; DTW)에 의한 명확한 정렬이 필요하기 때문에 한계가 있다. In the prior art, speech conversion has been performed using a frame-based approach. When considering the source and target voices, an alignment between the two voices is obtained, and then the acoustic properties of the source voice are transformed into the target voice. Various methods have been applied to model acoustic functions such as Gaussian Mixture Model (GMM), Deep Neural Network (DNN), and Recurrent Neural Network (RNN). The success of a sequence-to-sequence (seq2seq) model, which has been noted in the recent prior art, is also applied to speech conversion. During the training of the seq2seq speech conversion model, problems such as incorrect pronunciation and learning instability were observed. In this prior art, text supervision is added to each stage of the decoder output to improve the quality of speech conversion. However, this approach is limited because it requires clear alignment by human or Dynamic Time Wrapping (DTW).

전통적으로 프레임 기반 모델은 VC를 해결하는 데 사용되어 왔다. 출처와 대상 음성을 고려할 때, 이들의 정렬은 DTW에 의해 발견된다. 그런 다음 정렬된 프레임의 음향 특성 간 변환을 모델링한다. 최근 seq2seq 모델이 VC에서 제안되었다. 이러한 모델에서 모델은 명확한 시간적 정렬 없이 주의 메커니즘에 의한 정렬과 프레임 변환을 공동으로 학습한다. 그러나 변환된 음성이 언어 정보를 잃을 수 있기 때문에 VC의 성능은 충분하지 않다. 예를 들어 같은 단어가 반복적으로 생성되거나, 일부 단어가 삭제되거나, 잘못 발음되는 경우가 있다. 이러한 음성현상을 방지하기 위해 텍스트 감독을 VC에 추가하지만, 명시적인 정렬이 필요하다. 이전 작업과 달리, 언어 정보를 VC로 안내하기 위해 TTS를 사용했다. 이 접근법에는 명시적인 정렬이 필요하지 않다. Traditionally, frame-based models have been used to solve VC. Given the source and target voices, their alignment is found by the DTW. Then, the transformation between the acoustic properties of the aligned frames is modeled. Recently, the seq2seq model was proposed in VC. In this model, the model jointly learns the alignment and frame transformation by attention mechanisms without explicit temporal alignment. However, the performance of VC is not sufficient because the converted voice may lose language information. For example, the same word may be repeatedly generated, some words may be deleted, or mispronounced. Text supervision is added to the VC to prevent this speech phenomenon, but explicit alignment is required. Unlike previous work, TTS was used to guide language information to the VC. No explicit alignment is required for this approach.

한편, 감성적 음성 변환은 주로 프레임 기반 변환 또는 규칙 기반 접근을 수행하였다. DTW는 정확한 정렬과 규칙 기반 접근방식이 음성 변환 모델링에 제한을 갖도록 보장하지 않기 때문에 이러한 접근방식은 한계가 있다. 명시적 정렬에 의존하지 않는 모델을 사용하여 개선할 수 있다.On the other hand, emotional voice conversion mainly performed frame-based conversion or rule-based approach. This approach has limitations because DTW does not guarantee that exact sorting and rule-based approaches have limitations in speech conversion modeling. This can be improved by using models that do not rely on explicit alignment.

본 발명이 이루고자 하는 기술적 과제는 텍스트 음성 변환(TTS)을 이용한 멀티태스크 학습을 이용하는 음성 변환 방법 및 장치를 제공하는데 있다. 제안하는 감정적 음성 변환 방법 및 장치를 통해 전체 네트워크는 VC와 TTS의 손실을 최소화하도록 학습되고, VC는 더 많은 언어 정보를 포착하고 멀티태스크 학습에 의해 학습 안정성을 보존하고자 한다. An object of the present invention is to provide a speech conversion method and apparatus using multitask learning using text-to-speech (TTS). Through the proposed emotional voice conversion method and apparatus, the entire network is learned to minimize the loss of VC and TTS, and the VC captures more language information and tries to preserve learning stability by multitask learning.

일 측면에 있어서, 본 발명에서 제안하는 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 방법은 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 음성 변환(Voice Conversion; VC)을 수행하는 단계, 입력 음성의 페어가 원-핫(one-hot) 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 텍스트 음성 변환(Text-to-Speech; TTS)을 수행하는 단계, 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 단계 및 디코딩된 멜 스펙트로그램으로부터 전처리부를 통해 선형 스펙트럼을 획득하는 단계를 포함한다. In one aspect, the emotional voice conversion method using multitask learning along with text-to-speech conversion proposed in the present invention is a log mel spectrogram of a language in which a pair of input voices convey language content and a style reference voice. In the case of a log mel spectrogram, performing a voice conversion (VC), a pair of input voices is a one-hot representative text and a log mel spectrogram of the style reference voice. spectrogram), the step of performing text-to-speech (TTS), the log mel spectrogram of the language delivering the language content, and the one-hot representative text are all mapped to the same space. It includes a step of decoding a post-mel spectrogram and obtaining a linear spectrum from the decoded mel spectrogram through a preprocessor.

입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 스타일 참조 음성의 로그 멜 스펙트로그램일 경우, 음성 변환을 수행하는 단계는 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램을 컨텐츠 인코더를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램을 스타일 인코더를 통해 임베딩한다. If the pair of input voices is the log mel spectrogram of the language that conveys the language content and the log mel spectrogram of the style reference voice, the step of performing the speech conversion is the log mel spectrogram of the language conveying the language content, using the content encoder. Through the embedding, the log mel spectrogram of the style reference voice is embedded through the style encoder.

입력 음성의 페어가 원-핫 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램일 경우, 텍스트 음성 변환을 수행하는 단계는 원-핫 대표 텍스트를 텍스트 인코더를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램을 스타일 인코더를 통해 임베딩한다 If the pair of input voices is the log mel spectrogram of the one-hot representative text and the style reference voice, the step of performing text-to-speech is to embed the one-hot representative text through the text encoder, and to the log mel spectrogram of the style reference voice. Embed gram via style encoder

언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 단계는 각 디코딩 단계에서 스타일 참조 음성의 로그 멜 스펙트로그램에서 추출한 스타일 벡터를 어텐션(attention) RNN과 디코더 RNN에 연결한다. Both the log mel spectrogram of the language conveying the language content and the one-hot representative text are mapped to the same space and then decoded into the mel spectrogram is performed in the log mel spectrogram of the style reference voice in each decoding step. The extracted style vector is connected to the attention RNN and the decoder RNN.

언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 단계는 텍스트 인코더, 어텐션 RNN, 디코더 RNN 및 전처리부를 통해 텍스트 음성 변환을 수행하는 경우, 어텐션 RNN의 모든 반복에 컨텍스트 벡터가 활용된다. Both the log mel spectrogram and the one-hot representative text of the language delivering the language content are mapped to the same space and then decoded into the mel spectrogram. When performing speech conversion, a context vector is used for all repetitions of the attention RNN.

스타일 참조 음성을 고려할 때 스타일 인코더는 감정 정보만을 추출하고 언어적 내용을 제거하며, 언어적 내용에 관계없이 감정을 추출하도록 설계되어 복수의 입력 스타일 도메인을 처리하고, 추출된 감정이 디코더에 주입되면 다양한 감정을 생성함으로써 다대 다의 감정적 음성 변환을 처리한다. When considering style reference speech, the style encoder is designed to extract only emotion information and remove linguistic content, and to extract emotion regardless of linguistic content to process multiple input style domains, and when the extracted emotion is injected into the decoder, It processes many-to-many emotional voice transformations by generating various emotions.

또 다른 일 측면에 있어서, 본 발명에서 제안하는 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 장치는 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)인지 또는 원-핫(one-hot) 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)인지에 따라 스타일 인코더, 컨텐츠 인코더 및 텍스트 인코더를 통해 음성 변환 또는 텍스트 음성 변환을 수행하는 변환부, 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 신경망 네트워크 및 디코딩된 멜 스펙트로그램으로부터 선형 스펙트럼을 획득하는 전처리부를 포함한다.In another aspect, the emotional voice conversion apparatus using multitask learning along with text-to-speech conversion proposed in the present invention refers to a log mel spectrogram and a style of a language in which a pair of input voices convey language content. Voice conversion via Style Encoder, Content Encoder and Text Encoder depending on whether it is a log mel spectrogram of speech or a one-hot representative text and style reference log mel spectrogram of speech Alternatively, the conversion unit performing text-to-speech conversion, the log mel spectrogram of the language delivering the language content, and the one-hot representative text are all mapped to the same space, and then a neural network network that is decoded into the mel spectrogram, and the decoded It includes a preprocessor for obtaining a linear spectrum from the mel spectrogram.

본 발명의 실시예들에 따르면 텍스트 음성 변환을 이용한 멀티태스크 학습을 이용하는 감정적 음성 변환 방법 및 장치를 통해 전체 네트워크는 VC와 TTS의 손실을 최소화하도록 학습하고, VC는 더 많은 언어 정보를 포착하고 멀티태스크 학습에 의해 학습 안정성을 보존할 수 있다.According to embodiments of the present invention, through the emotional voice conversion method and apparatus using multitask learning using text-to-speech, the entire network learns to minimize the loss of VC and TTS, and the VC captures more language information and multi-tasks. Learning stability can be preserved by task learning.

도 1은 본 발명의 일 실시예에 따른 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 방법을 설명하기 위한 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 장치의 구성을 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 스타일 인코더의 혼잡 행렬을 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 음성 변환 결과의 예시를 나타내는 도면이다. 1 is a flowchart illustrating an emotional voice conversion method using multitask learning along with text-to-speech conversion according to an embodiment of the present invention.
2 is a diagram illustrating a configuration of an emotional speech conversion apparatus using multitask learning along with text-to-speech conversion according to an embodiment of the present invention.
3 is a diagram illustrating a congestion matrix of a style encoder according to an embodiment of the present invention.
4 is a diagram illustrating an example of a voice conversion result according to an embodiment of the present invention.

본 발명에서는 텍스트 음성 변환(Text-to-Speech; TTS)을 사용하는 접근법을 채택하고 있다. TTS는 텍스트 또는 음성 정보를 음성 파형으로 변환하는 작업이며, 풍부한 seq2seq 기반 연구가 활발하게 진행되었다. TTS는 음성 변환(Voice Conversion; VC)과 매우 관련이 깊은 작업이다. VC와 TTS는 입력 도메인만 다를 뿐, 음성 정보를 음향 형상으로 변환하는 디코더의 역할은 매우 동일하다. TTS의 임베딩 공간은 음성정보와 높은 상관관계가 있으며, VC는 멀티태스크 학습을 통해 TTS와 가까운 임베딩 공간을 학습할 것으로 기대된다. 본 발명에서는 성능 향상을 위해 TTS를 VC에 음성 정보를 제공하는 데 활용한다. In the present invention, an approach using text-to-speech (TTS) is adopted. TTS is an operation that converts text or voice information into voice waveforms, and abundant seq2seq-based research has been actively conducted. TTS is a task very closely related to Voice Conversion (VC). VC and TTS differ only in the input domain, and the role of the decoder for converting voice information into an acoustic shape is very the same. TTS embedding space is highly correlated with voice information, and VC is expected to learn embedding space close to TTS through multitask learning. In the present invention, the TTS is used to provide voice information to the VC to improve performance.

본 발명의 실시예에 따르면, 이러한 작업을 감정적 음성 변환으로 확장한다. 스타일 참조 음성을 고려할 때 스타일 인코더는 감정 정보만을 추출하고 언어적 내용을 제거한다. 스타일 인코더는 언어적 내용에 관계없이 감정을 추출하도록 설계되어 있어 복수의 입력 스타일 도메인을 처리할 수 있다. 또한 추출된 감정이 디코더에 주입되면 다양한 감정을 생성할 수 있다. 따라서 제안된 모델은 다대 다의 감정적 음성 변환을 처리할 수 있다. According to an embodiment of the present invention, this task is extended to emotional speech conversion. When considering the style reference voice, the style encoder extracts only the emotion information and removes the linguistic content. Style encoders are designed to extract emotions regardless of linguistic content, so they can process multiple input style domains. In addition, when the extracted emotion is injected into the decoder, various emotions can be generated. Therefore, the proposed model can handle many-to-many emotional speech conversion.

본 발명의 실시예에 따른 모델의 기여는 다음과 같다: TTS를 이용한 멀티태스킹 학습은 VC의 성능을 향상시킬 수 있다. 다수의 감정적 음성 변환은 처음에 seq2seq 모델에 의해 수행되었다. 스타일 참조 음성은 음성 변환의 대상 영역을 결정할 수 있다. The contribution of the model according to the embodiment of the present invention is as follows: Multitasking learning using TTS can improve the performance of VC. Many emotional speech transformations were initially performed by the seq2seq model. The style reference voice may determine a target area for voice conversion.

다대 다의 VC는 출처 도메인의 개수와 대상 도메인의 개수를 가리킨다. 종래기술에서 Cycle-GAN은 데이터세트가 아닌 아닌 스피커(speaker)를 대상 스피커로 변환했고, 그 반대로도 변환했다. i-벡터 기반 VC 시스템은 학습 세트에 없는 스피커의 언어적 특징을 생성하기 위해 제안된다. 다른 다대 다의 VC 방식과 비교했을 때, 제안 모델은 스피커의 정서적 지식을 전달할 수 있고 다른 감정들 간의 변환을 가능하게 한다.Many-to-many VC refers to the number of source domains and the number of target domains. In the prior art, Cycle-GAN converted a speaker, not a dataset, into a target speaker, and vice versa. An i-vector based VC system is proposed to generate the linguistic features of speakers that are not in the learning set. Compared to other many-to-many VC schemes, the proposed model can convey the speaker's emotional knowledge and enables conversion between different emotions.

TTS는 음성 영역에서 가장 활발한 연구 분야 중 하나이다. 다양한 종류의 Seq2seq 모델이 제안되었으며 표현적 합성도 연구되었다. 스타일 인코더에서 추출한 스타일 벡터는 입력된 감정이 네트워크에 주입되어 감정적 음성을 생성한다. 본 발명에서, 감정 라벨은 학습 중에 활용되지 않으며, 어떠한 감정 라벨도 네트워크의 입력으로 명시적으로 받아들여지지 않는다. 이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다. TTS is one of the most active research fields in the voice field. Various types of Seq2seq models have been proposed and expressive synthesis has also been studied. The style vector extracted from the style encoder generates an emotional voice by injecting the input emotion into the network. In the present invention, the emotion label is not utilized during learning, and no emotion label is explicitly accepted as an input to the network. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 방법을 설명하기 위한 흐름도이다. 1 is a flowchart illustrating an emotional voice conversion method using multitask learning along with text-to-speech conversion according to an embodiment of the present invention.

음성 변환(VC)은 언어적 내용을 보존하면서 사람의 목소리를 다른 스타일로 바꾸는 작업이다. VC에 관한 종래의 최첨단 기술은 언어 정보를 호도할 수 있는 시퀀스 대 시퀀스(seq2seq) 모델을 기반으로 한다. 텍스트 감시를 사용하여 이를 극복하려는 시도가 있었고, seq2seq 모델을 사용할 경우의 이점을 잃게 되는 명확한 정렬이 요구된다. 본 발명에서는 텍스트 음성 변환(TTS)를 이용한 멀티태스크 학습을 이용하는 음성 변환을 제시한다. seq2seq 기반 TTS의 임베딩(embedding)은 텍스트에 대한 풍부한 정보를 가지고 있다. TTS 디코더의 역할은 내장 공간을 VC와 같은 음성으로 변환하는 것이다. 제안된 모델에서, 전체 네트워크는 VC와 TTS의 손실을 최소화하도록 학습된다. VC는 더 많은 언어 정보를 포착하고 멀티태스크 학습에 의해 학습 안정성을 보존할 것으로 기대된다. VC의 실험은 한국의 남성 정서적 텍스트 음성 데이터세트에서 수행되었으며, 멀티태스크 학습은 VC에서 언어적 내용을 유지하는 데 도움이 된다는 것을 보여준다.Voice conversion (VC) is the work of changing a person's voice into a different style while preserving the linguistic content. The conventional state-of-the-art technology for VC is based on a sequence-to-sequence (seq2seq) model capable of misleading language information. Attempts have been made to overcome this using text watchdog, and a clear alignment is required, which loses the benefits of using the seq2seq model. In the present invention, speech conversion using multitask learning using text-to-speech (TTS) is proposed. The seq2seq-based TTS embedding has a wealth of information about text. The role of the TTS decoder is to convert the built-in space into voice such as VC. In the proposed model, the entire network is trained to minimize the loss of VC and TTS. VC is expected to capture more linguistic information and preserve learning stability by multitasking learning. VC's experiments were conducted on a Korean male emotional text-to-speech dataset, and show that multitask learning helps to maintain linguistic content in VC.

제안하는 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 방법은 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 음성 변환(Voice Conversion; VC)을 수행하는 단계(110), 입력 음성의 페어가 원-핫(one-hot) 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 텍스트 음성 변환(Text-to-Speech; TTS)을 수행하는 단계(120), 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩되는 단계(130) 및 디코딩된 멜 스펙트로그램으로부터 전처리부를 통해 선형 스펙트럼을 획득하는 단계(140)를 포함한다. The proposed text-to-speech and emotional speech conversion method using multitask learning is the log mel spectrogram of the language in which the pair of input speech conveys the language content and the log mel spectrogram of the style reference speech. In the case of, the step 110 of performing voice conversion (VC), when the pair of input voices is a log mel spectrogram of one-hot representative text and style reference voice, Step 120 of performing text-to-speech (TTS), the log mel spectrogram of the language delivering the language content, and the one-hot representative text are all mapped to the same space and then the mel A step 130 of decoding into a spectrogram and a step 140 of acquiring a linear spectrum from the decoded mel spectrogram through a preprocessor.

단계(110)에서, 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 음성 변환(Voice Conversion; VC)을 수행한다. 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램을 컨텐츠 인코더를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램을 스타일 인코더를 통해 임베딩한다. In step 110, when the pair of input voices is a log Mel spectrogram of a language delivering language content and a log Mel spectrogram of a style reference voice, Voice Conversion (VC) Perform. The log mel spectrogram of the language delivering the language content is embedded through the content encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.

단계(120)에서, 입력 음성의 페어가 원-핫(one-hot) 대표 텍스트 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)일 경우, 텍스트 음성 변환(Text-to-Speech; TTS)을 수행한다. 원-핫 대표 텍스트를 텍스트 인코더를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램을 스타일 인코더를 통해 임베딩한다. In step 120, when the pair of input voices is a log mel spectrogram of one-hot representative text and style reference voice, text-to-speech (TTS) is performed. Carry out. One-hot representative text is embedded through the text encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.

단계(130)에서, 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램 및 원-핫(one-hot) 대표 텍스트 모두 동일한 공간에 매핑된 후 멜 스펙트로그램으로 디코딩된다. 각 디코딩 단계에서 스타일 참조 음성의 로그 멜 스펙트로그램에서 추출한 스타일 벡터를 어텐션(attention) RNN과 디코더 RNN에 연결한다. 이때, 텍스트 인코더, 어텐션 RNN, 디코더 RNN 및 전처리부를 통해 텍스트 음성 변환을 수행하는 경우, 어텐션 RNN의 모든 반복에 컨텍스트 벡터가 활용된다. In step 130, both the log mel spectrogram and the one-hot representative text of the language delivering the language content are mapped to the same space and then decoded into the mel spectrogram. In each decoding step, the style vector extracted from the log mel spectrogram of the style reference voice is connected to the attention RNN and the decoder RNN. In this case, when text-to-speech is performed through the text encoder, the attention RNN, the decoder RNN, and the preprocessor, the context vector is used for all repetitions of the attention RNN.

단계(140)에서, 디코딩된 멜 스펙트로그램으로부터 전처리부를 통해 선형 스펙트럼을 획득한다. In step 140, a linear spectrum is obtained from the decoded mel spectrogram through a preprocessor.

본 발명의 실시예에 따르면, 스타일 참조 음성을 고려할 때 스타일 인코더는 감정 정보만을 추출하고 언어적 내용을 제거하며, 언어적 내용에 관계없이 감정을 추출하도록 설계되어 복수의 입력 스타일 도메인을 처리하고, 추출된 감정이 디코더에 주입되면 다양한 감정을 생성함으로써 다대 다의 감정적 음성 변환을 처리한다. According to an embodiment of the present invention, when considering the style reference voice, the style encoder is designed to extract only emotion information and remove linguistic content, and to extract emotion regardless of linguistic content to process a plurality of input style domains, When the extracted emotions are injected into the decoder, many-to-many emotional voice conversion is processed by generating various emotions.

더욱 상세하게는, 제안된 감정적 음성 변환 방법은 단일 모델에서 VC와 TTS를 수행할 수 있다. 네트워크는 입력 페어(pair)가 (x_c, x_s)일 때 VC로, 또는 입력 페어가 (y_t, x_s)일 때 TTS로 플레이된다. 여기서 x_c, y_t, x_s는 언어 내용을 전달하는 언어의 로그 멜 스펙토그램(Mel spectrogram), 원-핫(one-hot) 대표 텍스트, 스타일 참조 음성의 로그 멜 스펙토그램이다. x_c와 y_t 모두 동일한 공간 h_l에 매핑된 후, 멜 스펙토그램 m으로 디코딩된다. 각 디코딩 단계에서 x_s에서 추출한 스타일 벡터 h_s를 어텐션 RNN과 디코더 RNN에 연결한다. 선형 스펙트럼 l는 전처리부에 의해 획득된다. 자세한 네트워크 아키텍처는 도 2 및 아래의 등식을 참조하여 설명한다. In more detail, the proposed emotional voice conversion method can perform VC and TTS in a single model. The network is played in _{VC when the input pair is (x c} , x _s ) or in TTS when the input pair is (y _t , x _{s ).} Here, x _c , y _t , and x _s are the log mel spectrogram of the language conveying the language content, the one-hot representative text, and the log mel spectrogram of the style reference voice. Both x _c and y _t are mapped to the same space h _l and then decoded into a mel spectogram m. In each decoding step, the style vector h _s _{extracted from x s} is connected to the attention RNN and the decoder RNN. The linear spectrum l is obtained by the preprocessor. A detailed network architecture will be described with reference to FIG. 2 and the following equation.

도 2는 본 발명의 일 실시예에 따른 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 장치의 구성을 나타내는 도면이다. 2 is a diagram illustrating a configuration of an emotional speech conversion apparatus using multitask learning along with text-to-speech conversion according to an embodiment of the present invention.

제안하는 텍스트 음성 변환과 함께 멀티태스크 학습을 사용한 감정적 음성 변환 장치는 변환부(210), 신경망 네트워크(220), 전처리부(230)를 포함한다. An emotional speech conversion apparatus using multitask learning along with the proposed text-to-speech conversion includes a conversion unit 210, a neural network 220, and a preprocessor 230.

변환부(210)는 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(Mel spectrogram)(x_c) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)인지 또는 원-핫(one-hot) 대표 텍스트(y_t) 및 스타일 참조 음성의 로그 멜 스펙트로그램(Mel spectrogram)(x_s)인지에 따라 스타일 인코더(211), 컨텐츠 인코더(212) 및 텍스트 인코더(213)를 통해 음성 변환 또는 텍스트 음성 변환을 수행한다. _{The conversion unit 210 is a log mel spectrogram (Mel spectrogram) (x c} ) of a language that delivers the language content and a log mel spectrogram (Mel spectrogram) of the style reference voice or one-hot (one). -hot) Voice conversion through the style encoder 211, the content encoder 212 and the text encoder 213 depending on whether the representative text (y _t ) and the log mel spectrogram (x _{s) of the style reference voice are} Or perform text-to-speech.

변환부(210)는 입력 음성의 페어가 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(x_c) 및 스타일 참조 음성의 로그 멜 스펙트로그램(x_s)일 경우, 음성 변환을 수행한다. 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(x_c)을 컨텐츠 인코더(212)를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램(x_s)을 스타일 인코더(211)를 통해 임베딩한다. The conversion unit 210 performs speech conversion when the pair of input voices is a log mel spectrogram (x _c ) of a language delivering language content and a log mel spectrogram (x _s ) of a style reference voice. A log mel spectrogram (x _c ) of a language delivering language content is embedded through the content encoder 212, and a log mel spectrogram (x _s ) of a style reference voice is embedded through the style encoder (211).

변환부(210)는 입력 음성의 페어가 원-핫 대표 텍스트(y_t) 및 스타일 참조 음성의 로그 멜 스펙트로그램(x_s)일 경우, 텍스트 음성 변환을 수행한다. 원-핫 대표 텍스트(y_t)를 텍스트 인코더(213)를 통해 임베딩하고, 스타일 참조 음성의 로그 멜 스펙트로그램(x_s)을 스타일 인코더(211)를 통해 임베딩한다. The conversion unit 210 performs text-to-speech when the pair of input voices is a one-hot representative text (y _t ) and a log mel spectrogram (x _s ) of a style reference voice. One-hot representative text (y _t ) is embedded through the text encoder 213, and the log mel spectrogram (x _s ) of the style reference voice is embedded through the style encoder 211.

신경망 네트워크(220)는 언어 내용을 전달하는 언어의 로그 멜 스펙트로그램(x_c) 및 원-핫(one-hot) 대표 텍스트(y_t) 모두 동일한 공간에 매핑된 후 멜 스펙트로그램(m)으로 디코딩된다. 신경망 네트워크(220)는 각 디코딩 단계에서 스타일 참조 음성의 로그 멜 스펙트로그램에서 추출한 스타일 벡터를 어텐션(attention) RNN(221)과 디코더 RNN(222)에 연결한다. The neural network network 220 maps both the log mel spectrogram (x _c ) and the one-hot representative text (y _t ) of the language delivering the language content to the same space and then converts it into a mel spectrogram (m). Is decoded. The neural network 220 connects the style vector extracted from the log mel spectrogram of the style reference voice to the attention RNN 221 and the decoder RNN 222 in each decoding step.

전처리부(230)는 디코딩된 멜 스펙트로그램(m)으로부터 선형 스펙트럼(l)을 획득한다. 선형 스펙트럼(l)은 보코더(Vocoder)(240)를 거쳐 변환된 음성(x_o)을 출력한다. The preprocessor 230 obtains a linear spectrum l from the decoded mel spectrogram m. The linear spectrum (l) outputs _{the converted voice (x o) through a vocoder (240).}

더욱 상세하게는, 제안된 감정적 음성 변환 장치 단일 모델에서 VC와 TTS를 수행할 수 있다. 네트워크는 입력 페어(pair)가 (x_c, x_s)일 때 VC로, 또는 입력 페어가 (y_t, x_s)일 때 TTS로 플레이된다. 여기서 x_c, y_t, x_s는 언어 내용을 전달하는 언어의 로그 멜 스펙토그램(Mel spectrogram), 원-핫(one-hot) 대표 텍스트, 스타일 참조 음성의 로그 멜 스펙토그램이다. x_c와 y_t 모두 동일한 공간 h_l에 매핑된 후, 멜 스펙토그램 m으로 디코딩된다. 각 디코딩 단계에서 x_s에서 추출한 스타일 벡터 h_s를 어텐션 RNN과 디코더 RNN에 연결한다. 선형 스펙트럼 l는 전처리부에 의해 획득된다. In more detail, VC and TTS can be performed in a single model of the proposed emotional speech conversion device. The network is played in _{VC when the input pair is (x c} , x _s ) or in TTS when the input pair is (y _t , x _{s ).} Here, x _c , y _t , and x _s are the log mel spectrogram of the language conveying the language content, the one-hot representative text, and the log mel spectrogram of the style reference voice. Both x _c and y _t are mapped to the same space h _l and then decoded into a mel spectogram m. In each decoding step, the style vector h _s _{extracted from x s} is connected to the attention RNN and the decoder RNN. The linear spectrum l is obtained by the preprocessor.

여기서 h_t, h_c, h_s는 각각 텍스트 인코더(213), 컨텐츠 인코더(212), 스타일 인코더(211)의 임베딩이다. h_att ^(t) 및 h_dec ^(t)는 시간 단계 t에서 주의 RNN과 디코더 RNN의 숨겨진 표현이다. m, l, c^(t) 및 o^(t)는 시간 단계 t에서의 로그 멜 스펙트로그램, 대상의 로그 선형 스펙트로그램, 주의 메커니즘에 의해 달성된 컨텍스트 벡터, 시간 단계 t에서의 주의 RNN 출력이다. XOR은 익스클러시브(exclusive) OR 연산자를 의미한다.Here, h _t , h _c , and h _s are embeddings of the text encoder 213, the content encoder 212, and the style encoder 211, respectively. h _att ^(t) and h _dec ^(t) are hidden representations of the attention RNN and decoder RNN at time step t. m, l, c ^(t) and o ^(t) are the log mel spectrogram at time step t, the log linear spectrogram of the object, the context vector achieved by the attention mechanism, and the attention RNN output at time step t. XOR stands for exclusive OR operator.

텍스트 인코더, 디코더, 어텐션 RNN 및 전처리부를 포함한 TTS 부분의 경우, 전체적인 아키텍처는 타코트론(Tacotron)을 기반으로 하며, 어텐션 RNN의 모든 반복에 컨텍스트 벡터 c^(t)가 활용되고, 잔여 연결은 Convolution Bank + Highway + bi-GRU(CBHG) 연결에 추가된다.In the case of the TTS part including the text encoder, decoder, attention RNN and preprocessor, the overall architecture is based on Tacotron, and the context vector c ^(t) is used for all iterations of the attention RNN, and the residual connection is Convolution Bank + Highway + bi-GRU (CBHG) added to the connection.

텍스트 인코더는 캐릭터 임베딩 계층, 프리넷(prenet), CBHG로 구성되며, 여기서 프리넷은 FC ReLU-Dropout 계층 2개로 구성된다. LSTM 스택은 컨텐츠 인코더에 사용되며, 시간적 감소가 로컬 시간 정보를 잃을 수 있기 때문에 시간적 해상도의 감소는 없었다. 이는 h_c의 길이가 x_c와 같다는 의미이다. 스타일 인코더의 경우 x_s는 LSTM의 마지막 단계에 이어 완전히 연결된 계층을 임베딩하여 고정 치수가 있는 h_s에 매핑되어 치수를 감소시킨다. 주의 모듈은 주의 RNN과 주의 메커니즘으로 구성된다. 주의 RNN은 h_s, c^(t-1), m^(t-1)을 입력으로 취한다. 그리고 나서 그것의 출력 o^(t)와 h_l은 컨텍스트 벡터 c^(t)를 생성하기 위한 주의 메커니즘에 사용된다.The text encoder consists of a character embedding layer, a prenet, and a CBHG, where the freenet consists of two FC ReLU-Dropout layers. The LSTM stack is used in the content encoder, and there is no reduction in temporal resolution because the temporal reduction can lose local time information. This means that _{the length of h c} is equal to x _c. In the case of a style encoder, x _s _{is mapped to h s} with a fixed dimension by embedding a fully connected layer following the last step of the LSTM to reduce the dimension. The attention module consists of the attention RNN and attention mechanism. Note The RNN takes h _s , c ^(t-1) and m ^(t-1) as inputs. Then its output o ^(t) and h _l are used to note the mechanism for generating the context vector c ^(t).

학습 손실은 Im-mgtI+II-IgtI이며, 여기서 mgt와 lgt는 그라운드 트루(ground truth) 로그 멜 스펙트로그램 및 로그 선형 스펙트로그램이다.The learning loss is Im-mgtI+II-IgtI, where mgt and lgt are the ground truth log mel spectrogram and log linear spectrogram.

본 발명의 실시예에 따르면, mKETTS(Korean Emotional Text-to-speech) 데이터 세트를 구축했다. 예를 들어, 실험을 위해 측정된 30세의 한 남자는 7개의 다른 감정(중립, 행복, 슬픔, 분노, 두려움, 놀라움, 혐오감)으로 3,000개의 문장을 발음했기 때문에, 전체 발언 수는 2만 1천 개에 이른다. 모든 문장의 텍스트는 감정 전반에 걸쳐 동일하다. 모든 녹음은 소리 없이 조용한 스튜디오에서 진행되어 44.1 kHz의 샘플링 속도로 기록되어 있다. 정적을 다듬은 후의 전체 시간은 29.2시간이다.According to an embodiment of the present invention, a Korean Emotional Text-to-speech (mKETTS) data set was constructed. For example, a 30-year-old man, measured for the experiment, pronounced 3,000 sentences with 7 different emotions (neutral, happy, sad, anger, fear, surprise, disgust), so the total number of utterances is 20,000. It reaches a thousand. The text of every sentence is the same throughout the emotion. All recordings are done in a silent studio and recorded at a sampling rate of 44.1 kHz. The total time after finishing the static is 29.2 hours.

사전 처리의 경우, 각 파형의 첫 번째와 끝의 정적을 음성 활동 감지(VAD) 알고리즘을 사용하여 다듬고 16kHz로 다시 샘플링한다. 그런 다음 로그 멜 스펙트로그램은 윈도우 크기 50ms, 시프트 12.5ms, nfft 2048, 80 멜 빈스(Mel bins) 및 해닝(Hanning) 윈도우로 추출한다. 그 크기는 [0, 1]로 표준화되었다. 텍스트 인코더는 문자 기반 표현을 사용하기 때문에, 한국 문자는 온셋(onset), 뉴클리스(nucleus), 코다(coda)로 분해된 후 원-핫(one-hot) 표현으로 변환된다.For pre-processing, the statics at the first and end of each waveform are trimmed using a voice activity detection (VAD) algorithm and resampled at 16 kHz. Then, the log mel spectrogram is extracted with a window size of 50 ms, shift 12.5 ms, nfft 2048, 80 Mel bins and Hanning windows. Its size was standardized to [0, 1]. Since the text encoder uses a character-based representation, Korean characters are decomposed into onset, nucleus, and coda, and then converted into a one-hot representation.

자세한 파라미터 설정은 다음과 같다. 본 발명의 실시예에 따르면, 256자 임베딩, 32차원 hs를 사용했다. 초기 학습율은 아담 최적기의 경우 1e-3이었다. 경사 클립핑(gradient clipping)은 1로 사용되었으며, m은 감소 계수 5로 생성되었다. 배치 사이즈는 32였고, Bahdanau attention이 사용되었다. 컨텐츠 인코더는 양방향 LSTM와 스타일 인코더에 두 층을 사용하였으며, 컨텐츠 인코더는 단방향 LSTM의 두 층으로 구성되어 있다. 마지막 단계의 출력은 컨텐츠 인코더에만 사용된다. 컨텐츠 인코더와 스타일 인코더의 파라미터는 공유되지 않는다. 학습 단계에서는 예측 산출물의 누적 손실을 방지하기 위해 티처 포싱(teacher forcing)을 사용하였다. 미니 배치를 만들 때 각 샘플은 샘플의 가장 긴 길이로 제로-페이딩(zero-padding) 된다. 그 후, 제로-페이딩된 영역의 손실도 추론 안정성을 위해 역전사된다. 모든 반복에 대해, 네트워크의 작업은 무작위로 VC 또는 TTS로 결정된다. 각 샘플에 대해 출처와 대상 감정은 다르게 선택된다.Detailed parameter settings are as follows. According to an embodiment of the present invention, 256-character embedding and 32-dimensional hs were used. The initial learning rate was 1e-3 for Adam's optimal period. Gradient clipping was used as 1, and m was created with a reduction factor of 5. The batch size was 32, and Bahdanau attention was used. The content encoder uses two layers, the bidirectional LSTM and the style encoder, and the content encoder consists of two layers, the unidirectional LSTM. The output of the last step is only used by the content encoder. The parameters of the content encoder and style encoder are not shared. In the learning stage, teacher forcing was used to prevent the cumulative loss of the predicted products. When creating a mini-batch, each sample is zero-padding to the longest length of the sample. Thereafter, the loss of the zero-faded region is also reverse transcribed for inference stability. For every iteration, the network's work is randomly determined by VC or TTS. For each sample, the source and target emotions are chosen differently.

제안된 모델의 언어적 일관성을 검증하기 위해 세 가지 다른 모델을 교육하고 평가하였다. VCTS는 앞서 설명된 바와 같이 TTS와 VC를 결합한 모델이다. VC는 TTS 경로가 없는 음성 변환기를 말한다. TTS는 VC 경로를 포함하지 않는 TTS 모델을 말한다. VCTTS-V는 VCTTS를 이용한 음성 변환 추론 모델이며, VCTTS-T는 VCTTS의 TTS 경로가 활성화되었을 때의 모델이다.Three different models were trained and evaluated to verify the linguistic consistency of the proposed model. VCTS is a model that combines TTS and VC as described above. VC refers to a voice converter without a TTS path. TTS refers to a TTS model that does not include a VC path. VCTTS-V is a voice conversion inference model using VCTTS, and VCTTS-T is a model when the TTS path of VCTTS is activated.

단어 오류율(WER)은 제안된 모델이 어떻게 변환된 언어의 일관성을 향상시키는지를 측정하기 위해 계산되었다. 사실, 형태소는 한국어의 인식 단위로 간주되기 때문에 단어 대신에 사용되었다. 구글 클라우드 음성-텍스트 API는 변환된 음성들을 기록했고, 녹취록은 KoNLPy의 Komoran 형태학 분석기에 의해 일련의 형태소로 나뉘었다. 이후, 실제 기록과 자동 인식된 기록의 두 가지 형태소 순서 사이의 평균 WER을 계산하여 표 1에 표시하였다. 그 결과 VCTTS-V가 VC보다 우수하고, TTS의 WER가 VCTTS-T보다 우수했다. The word error rate (WER) was calculated to measure how the proposed model improves the consistency of the transformed language. In fact, morphemes were used instead of words because they were regarded as the unit of recognition in Korean. The Google Cloud Voice-to-Text API recorded the converted voices, and the transcript was divided into a series of morphemes by KoNLPy's Komoran Morphological Analyzer. Thereafter, the average WER between the two morpheme sequences of the actual record and the automatically recognized record was calculated and displayed in Table 1. As a result, VCTTS-V was superior to VC, and WER of TTS was superior to VCTTS-T.

<표 1> <Table 1>

학습이 끝난 뒤 8명의 원어민이 주관적인 평가에 참여했다. 20개의 문장이 VCTTS-V 모델에 의해 생성되었다. x_c의 감정은 중립으로 설정되었고 대상 감정은 행복으로 설정되었고 x_s의 문장은 고정되었다. 어떤 모델이 어떤 음성을 생성하는지 피실험자들은 결코 알지 못한다는 것을 맹목적으로 시험한다. 주제들은 그것의 명료성을 1부터 5까지 평가하도록 요청 받았다. 또한, 두 모델 간의 선호도 ABX 테스트가 실시되었다. 어떤 모델이 어떤 샘플을 생성하는지 정보가 없는 두 개의 음성이 주어졌을 때, 주제들은 더 명확한 음성을 선택하도록 요청 받았다. 실험 대상자들은 두 샘플이 유사하다고 인식되면 어떤 것도 선택할 수 없다. 결과는 표 2와 같다. 그것은 VCTTV-V가 높은 MOS와 ABX 선호 점수를 가지고 있다는 것을 보여주는데, 이것은 VC와 TTS의 멀티태스크 학습이 언어 정보를 유지하는 데 도움이 된다는 것을 의미한다. After learning, 8 native speakers participated in the subjective evaluation. Twenty sentences were generated by the VCTTS-V model. The emotion of x _c was set to neutral, the emotion of the object was set to happiness, and the sentence of _{x s was fixed.} We blindly test that subjects never know which model produces which voice. Subjects were asked to rate their clarity on a scale of 1 to 5. In addition, the ABX test of preference between the two models was conducted. Given two voices with no information on which model produced which sample, subjects were asked to choose a clearer voice. Subjects cannot choose any of the two samples if they are perceived to be similar. The results are shown in Table 2. It shows that VCTTV-V has a high MOS and ABX preference score, which means that the multitask learning of VC and TTS helps to maintain language information.

<표 2><Table 2>

도 3은 본 발명의 일 실시예에 따른 스타일 인코더의 혼잡 행렬을 나타내는 도면이다. 3 is a diagram illustrating a congestion matrix of a style encoder according to an embodiment of the present invention.

감정적 음성 변환을 조사하기 위해 위에서 언급한 VCTTS 모델을 추론에 사용한다. 학습이 끝난 후, 무작위로 각 감정당 20개의 샘플을 선택했고, 그 샘플들은 모델에 공급된다. 그러면 샘플당 h_s를 얻을 수 있으며, 각 표본 간의 코사인 유사성을 측정하여 도 3에 나타내었다. 모든 감정 페어 간의 코사인 유사성의 평균값도 표시된다.To investigate emotional speech transformation, the VCTTS model mentioned above is used for inference. After training, 20 samples were randomly selected for each emotion, and the samples were fed to the model. Then, h _s can be obtained per sample, and the cosine similarity between each sample is measured and shown in FIG. 3. The average value of cosine similarity between all emotion pairs is also displayed.

도 3에서, 같은 감정의 표본은 코사인 유사성이 매우 높은 반면, 사선 이외의 감정 페어는 유사성이 낮다는 것을 보여준다. 그것은 스타일 인코더가 언어적 내용에 상관없이 감정 스타일을 추출할 수 있다는 것을 암시한다. 대각선 감정 페어를 제외하고도 감정 페어(역겨움-분노)는 비교적 높은 유사성을 나타내는데, 이는 이 두 감정의 내재가 다른 감정보다 더 가깝다는 것을 의미한다. (슬픔-두려움) 페어에서도 같은 음성이 관찰된다. In FIG. 3, it is shown that the sample of the same emotion has very high cosine similarity, while the emotion pair other than the diagonal line has low similarity. It implies that the style encoder can extract the emotional style regardless of the linguistic content. Except for the diagonal emotion pair, the emotion pair (disgusting-anger) exhibits a relatively high similarity, which means that the intrinsicity of these two emotions is closer than the other emotions. (Sad-Fear) The same voice is also observed in the pair.

도 4는 본 발명의 일 실시예에 따른 음성 변환 결과의 예시를 나타내는 도면이다.4 is a diagram illustrating an example of a voice conversion result according to an embodiment of the present invention.

감성적 음성 변환은 언어적 내용을 유지하면서 감정을 반영해야 한다. 도 4에서는 음성 변환의 예시를 나타낸다. 중립적인 음성이 주어질 때, 주어진 x_s는 6가지 다른 감정으로 변모한다. x_s의 내용은 이 실험 동안 고정되어 있다. 도 4의 첫째 행은 입력 음성의 로그 멜 스펙트로그램이며, 두 번째 행에서 일곱 번째 행까지 변환 음성의 로그 멜 스펙트로그램이다. 스펙트로그램의 전체적인 형태는 입력과 유사한 반면, 시간 이동, 주파수 이동, 일시 정지 시간 등의 일부 변경이 이루어졌다는 것을 알 수 있다. 단일 모델 내에서, 그것은 다수의 언어 도메인을 생성할 수 있다.Emotional voice conversion should reflect emotions while maintaining the linguistic content. 4 shows an example of voice conversion. Given a neutral voice, the given x _s transforms into 6 different emotions. The content of x _s is fixed during this experiment. The first row of FIG. 4 is a log mel spectrogram of an input voice, and a log mel spectrogram of the converted voice from the second row to the seventh row. While the overall shape of the spectrogram is similar to the input, it can be seen that some changes such as time shift, frequency shift, and pause time have been made. Within a single model, it can create multiple language domains.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications executed on the operating system. Further, the processing device may access, store, manipulate, process, and generate data in response to the execution of software. For the convenience of understanding, although it is sometimes described that one processing device is used, one of ordinary skill in the art, the processing device is a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, configuring the processing unit to behave as desired or processed independently or collectively. You can command the device. Software and/or data may be interpreted by a processing device or, to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodyed. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -A hardware device specially configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and drawings as described above, various modifications and variations can be made from the above description to those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as systems, structures, devices, circuits, etc. described are combined or combined in a form different from the described method, or other components Alternatively, even if substituted or substituted by an equivalent, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and those equivalent to the claims also fall within the scope of the claims to be described later.

Claims

If the pair of input voices is a log mel spectrogram of a language delivering language content and a log mel spectrogram of a style reference voice, performing voice conversion (VC);
When the pair of input voices is a log mel spectrogram of a one-hot representative text and a style reference voice, performing text-to-speech (TTS);
Mapping both a log mel spectrogram and a one-hot representative text of a language delivering language content to the same space and then decoding them into a mel spectrogram; And
Acquiring a linear spectrum from the decoded mel spectrogram through a preprocessor
Emotional voice conversion method comprising a.

The method of claim 1,
When the pair of input voices is a log mel spectrogram of a language delivering language content and a log mel spectrogram of a style reference voice, the step of performing speech conversion includes:
The log mel spectrogram of the language delivering the language content is embedded through the content encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.
Emotional speech conversion method.

The method of claim 1,
When the pair of input voices is a log mel spectrogram of one-hot representative text and style reference voice, the step of performing text-to-speech conversion includes:
One-hot representative text is embedded through the text encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.
Emotional speech conversion method.

The method of claim 1,
Both the log mel spectrogram and the one-hot representative text of the language conveying the language content are mapped to the same space and then decoded into the mel spectrogram.
In each decoding step, the style vector extracted from the log mel spectrogram of the style reference voice is connected to the attention RNN and the decoder RNN.
Emotional speech conversion method.

The method of claim 2,
Both the log mel spectrogram and the one-hot representative text of the language conveying the language content are mapped to the same space and then decoded into the mel spectrogram.
When text-to-speech is performed through the text encoder, the attention RNN, the decoder RNN, and the preprocessor, the context vector is used for all repetitions of the attention RNN.
Emotional speech conversion method.

The method of claim 1,
When considering style reference speech, the style encoder is designed to extract only emotion information and remove linguistic content, and to extract emotion regardless of linguistic content to process multiple input style domains, and when the extracted emotion is injected into the decoder, To process many-to-many emotional speech transformations by creating a variety of emotions.
Emotional speech conversion method.

Whether the pair of input voices is the log Mel spectrogram of the language conveying the language content and the log Mel spectrogram of the style reference voice, or a one-hot representative of the text and style reference voice. A conversion unit that performs speech conversion or text-to-speech conversion through a style encoder, a content encoder, and a text encoder according to whether it is a log mel spectrogram;
A neural network network in which both log mel spectrograms and one-hot representative texts of a language delivering language content are mapped to the same space and then decoded into mel spectrograms; And
Preprocessor to obtain a linear spectrum from the decoded mel spectrogram
Emotional speech conversion device comprising a.

The method of claim 7,
The conversion unit,
If the pair of input voices is the log mel spectrogram of the language that conveys the language content and the log mel spectrogram of the style reference voice, voice conversion is performed, and
The log mel spectrogram of the language delivering the language content is embedded through the content encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.
Emotional speech conversion device.

The method of claim 7,
The conversion unit,
If the pair of input speech is a log mel spectrogram of one-hot representative text and style reference speech, text-to-speech is performed, and
One-hot representative text is embedded through the text encoder, and the log mel spectrogram of the style reference voice is embedded through the style encoder.
Emotional speech conversion device.

The method of claim 7,
Neural network network,
In each decoding step, the style vector extracted from the log mel spectrogram of the style reference voice is connected to the attention RNN and the decoder RNN.
Emotional speech conversion device.

The method of claim 8,
Neural network network,
When text-to-speech is performed through the text encoder, the attention RNN, the decoder RNN, and the preprocessor, the context vector is used for all repetitions of the attention RNN.
Emotional speech conversion device.

The method of claim 7,
When considering style reference speech, the style encoder is designed to extract only emotion information and remove linguistic content, and to extract emotion regardless of linguistic content to process multiple input style domains, and when the extracted emotion is injected into the decoder, To process many-to-many emotional speech transformations by creating a variety of emotions.
Emotional speech conversion device.