KR20200084443A

KR20200084443A - System and method for voice conversion

Info

Publication number: KR20200084443A
Application number: KR1020180169788A
Authority: KR
Inventors: 김경섭; 강천성; 김동하; 임진수
Original assignee: 충남대학교산학협력단
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-13

Abstract

According to the present invention, a system for voice modulation includes an SR model learning module, a GAN model learning module, and a voice modulation module. The SR model learning module performs SR model learning by extracting an MFCC feature from an input sound source. The GAN model learning module generates PPGs by inputting a target voice into an SR model and then proceeds to learn a GAN model. The voice modulation module generates PPGs by inputting the MFCC of the input voice into the SR model, modulates the PPGs into a spectrogram of the target voice through a generator learned through a GAN structure, and then reconstructs the spectrogram into a voice to generate a voice.

Description

Voice modulation system and method{SYSTEM AND METHOD FOR VOICE CONVERSION}

본 발명은 음성 변조 시스템 및 방법에 관한 것으로, 보다 상세하게는 한 화자에 음성을 다른 화자에 음성 특성에 맞추어 변환하는 음성 변조 시스템 및 방법에 관한 것이다. The present invention relates to a speech modulation system and method, and more particularly, to a speech modulation system and method for converting speech to one speaker according to speech characteristics.

음성 변조(음성 변환)은 한 화자에 음성을 다른 화자에 음성 특성에 맞추어 변환하는 것을 말한다.Speech modulation (speech conversion) refers to the conversion of speech from one speaker to speech characteristics of another speaker.

기존의 음성변환 기술로는, 비특허문헌 1과 비특허문헌 2에 기재된 것(다른 화자가 같은 문장을 말하는 Pair된 병렬데이터를 이용하여 Gaussian Mixture Models(GMM) 기반에 음성 변환을 하는 것), 비특허문헌 3에 기재된 것(Bidirectional Long Short-Term Memory 기반의 음성 변환 기술) 등이 있으며, 비특허문헌 4에 기재된 것과 같이, Pair된 데이터 없이 PPGs(Phonetic Posterior Grams)를 중간에 생성하여 단계적으로 음성 변환하는 기술 등이 있다.Existing speech conversion technologies include those described in Non-Patent Document 1 and Non-Patent Document 2 (speech conversion based on Gaussian Mixture Models (GMM) using paired parallel data in which different speakers speak the same sentence), Non-patent document 3 (Bidirectional Long Short-Term Memory-based speech conversion technology) and the like, as described in non-patent document 4, PPGs (Phonetic Posterior Grams) without paired data in the middle to generate step by step And voice conversion technology.

대부분의 음성 변환 연구에서는 변환 음성에 대한 스펙트로그램을 생성하고 실제 스펙트로그램과 오차 평균인 Mean squared error(MSE)에 기반하여 학습을 한다. 하지만 MSE를 사용한 학습은 생성된 스펙트로그램 이미지와 정답을 평균하려는 성향이 강하기 때문에 생성되는 결과에 해상도(음질)가 떨어지는 문제가 발생한다. Most speech conversion studies generate a spectrogram for the transformed speech and learn based on the actual spectrogram and the mean squared error (MSE). However, learning using MSE has a strong tendency to average generated spectrogram images and correct answers, resulting in a problem that resolution (sound quality) is lowered in the generated result.

Stylianou, Yannis, Olivier Cappe, and Eric Moulines, "Continuous probabilistic transform for voice conversion.", IEEE Transactions on speech and audio processing 6.2, 131-142, 1998.Stylianou, Yannis, Olivier Cappe, and Eric Moulines, "Continuous probabilistic transform for voice conversion.", IEEE Transactions on speech and audio processing 6.2, 131-142, 1998. Toda, Tomoki, Alan W. Black, and Keiichi Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory.", IEEE Transactions on Audio, Speech, and Language Processing 15.8, 2222-2235, 2007.Toda, Tomoki, Alan W. Black, and Keiichi Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory.", IEEE Transactions on Audio, Speech, and Language Processing 15.8, 2222-2235, 2007. L. Sun, S. Kang, K. Li, and H. Meng, "Voice conversion using deep bidirectional Long Short-Term Memory based Recurrent Neural Networks", in Proc. ICASSP, 2015.L. Sun, S. Kang, K. Li, and H. Meng, "Voice conversion using deep bidirectional Long Short-Term Memory based Recurrent Neural Networks", in Proc. ICASSP, 2015. Sun, Lifa, et al. "Phonetic posteriorgrams for many-to-one voice conversion without parallel data training.", 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.Sun, Lifa, et al. "Phonetic posteriorgrams for many-to-one voice conversion without parallel data training.", 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.

본 발명이 해결하고자는 하는 과제는, 변조 후 음질이 향상되는 음성 변조 시스템 및 방법을 제공하는 것이다.The problem to be solved by the present invention is to provide a voice modulation system and method for improving sound quality after modulation.

본 발명에 의한 음성 변조 시스템은, SR 모델 학습 모듈, GAN 모델 학습 모듈, 음성 변조 모듈을 포함하는 음성 변조 시스템으로서, 상기 SR 모델 학습 모듈은, 입력 음원에 대해 MFCC 특징을 추출하여 SR 모델 학습을 진행하고, GAN 모델 학습 모듈은, 목표 음성을 SR 모델에 입력하여 PPGs 를 생성한 후 GAN 모델 학습을 진행하고, 음성 변조 모듈은, 입력음성의 MFCC를 SR 모델에 입력하여 PPGs를 생성하고 GAN 구조를 통해 학습된 생성기를 거쳐 목표 음성의 스펙트로그램으로 변조한 후, 이를 음성으로 재구성하여 음성을 생성하는 것을 특징으로 한다.The voice modulation system according to the present invention is a voice modulation system including an SR model learning module, a GAN model learning module, and a voice modulation module, wherein the SR model learning module extracts MFCC features for an input sound source to perform SR model learning. Proceeding, the GAN model learning module inputs the target voice into the SR model to generate PPGs, and then proceeds to the GAN model learning, and the voice modulation module inputs MFCC of the input voice into the SR model to generate PPGs and GAN structure It is characterized by generating a voice by modulating it into a spectrogram of a target voice via a generator learned through and reconstructing it into voice.

본 발명에 의한 음성 변조 방법은, SR 모델 학습 모듈, GAN 모델 학습 모듈, 음성 변조 모듈을 포함하는 음성 변조 시스템을 이용한 음성 변조 방법으로서, 상기 SR 모델 학습 모듈이, 입력 음원에 대해 MFCC 특징을 추출하여 SR 모델 학습을 진행하는 제1 단계; 상기 GAN 모델 학습 모듈이, 목표 음성을 SR 모델에 입력하여 PPGs 를 생성한 후 GAN 모델 학습을 진행하는 제2 단계; 상기 음성 변조 모듈이, 입력음성의 MFCC를 SR 모델에 입력하여 PPGs를 생성하고 GAN 구조를 통해 학습된 생성기를 거쳐 목표 음성의 스펙트로그램으로 변조한 후, 이를 음성으로 재구성하여 음성을 생성하는 제3 단계; 를 포함하는 것을 특징으로 한다.The speech modulation method according to the present invention is a speech modulation method using a speech modulation system including an SR model learning module, a GAN model learning module, and a speech modulation module, wherein the SR model learning module extracts MFCC features for an input sound source. A first step of progressing the SR model learning; A second step of the GAN model training module, generating a PPGs by inputting a target voice into the SR model, and then training the GAN model; The voice modulation module inputs the MFCC of the input voice into the SR model, generates PPGs, modulates it into the spectrogram of the target voice through a generator learned through the GAN structure, and reconstructs it into voice to generate voice. step; It characterized in that it comprises.

본 발명에 의한 음성 변조 시스템 및 방법은, 변조 후 음질이 향상된다.In the voice modulation system and method according to the present invention, the sound quality after modulation is improved.

도 1은 본 발명의 음성 변조 순서도
도 2는 본 발명의 SR 모델의 구조
도 3은 본 발명의 GAN 모델의 구조1 is a voice modulation flow chart of the present invention
Figure 2 is the structure of the SR model of the present invention
Figure 3 is the structure of the GAN model of the present invention

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.The present invention can be applied to various changes and can have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all conversions, equivalents, and substitutes included in the spirit and scope of the present invention. In the description of the present invention, when it is determined that a detailed description of known technologies related to the present invention may obscure the subject matter of the present invention, the detailed description will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.The terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise.

대부분의 음성 변환 연구에서는 변환 음성에 대한 스펙트로그램을 생성하고 실제 스펙트로그램과 오차 평균인 MSE에 기반하여 학습을 한다. 하지만 MSE를 사용한 학습은 생성된 스펙트로그램 이미지와 정답을 평균하려는 성향이 강하기 때문에 생성되는 결과에 해상도가 떨어지는 문제가 발생한다. 이러한 문제를 생성 모델에서 좋은 성능을 내고 있는 GAN 구조[{Goodfellow, Ian, et al., "Generative adversarial nets.", Advances in neural information processing systems. 2014.} 참조]를 추가해 해결하고자 한다. 또한, PPGs를 이용하여 입력 음성의 발음을 인식하는 단계를 거치고 TTS 분야의 대표적인 모델인 Tacotron[{Wang, Yuxuan, et al., "Tacotron: A fully end-to-end text-to-speech synthesis model.", arXiv preprint, 2017. 6.} 참조]의 음성 합성 모듈을 사용해 성능을 개선한 비 병렬 데이터 간 음성 변환을 할 수 있다.In most speech conversion studies, a spectrogram is generated for the transformed speech, and learning is performed based on the actual spectrogram and the error average, MSE. However, learning using MSE has a strong tendency to average the generated spectrogram image and the correct answer, resulting in a problem that resolution is reduced in the generated result. GAN structures that perform well in the generation model [{Goodfellow, Ian, et al., "Generative adversarial nets.", Advances in neural information processing systems. 2014.}]. In addition, using the PPGs, the process of recognizing the pronunciation of the input speech is followed, and Tacotron [{Wang, Yuxuan, et al., "Tacotron: A fully end-to-end text-to-speech synthesis model" .", arXiv preprint, 2017. 6.}], it is possible to perform speech conversion between non-parallel data with improved performance.

본 발명에 의한 음성 변조 시스템 및 방법은 기존 PPGs를 이용한 음성변환 모델을 기반으로 GAN의 학습 모델을 설정하였다. 그 방법의 변조 순서도는 도 1과 같다. 도 1은 본 발명의 음성 변조 순서도이다.The voice modulation system and method according to the present invention set a learning model of GAN based on a voice conversion model using existing PPGs. The modulation flow chart of the method is shown in FIG. 1. 1 is a voice modulation flow chart of the present invention.

도 1의 순서도의 구성 단계는, SR 모델 학습(Speech Recognition Model Training), GAN 모델 학습(GAN Model Training), 음성 변환(음성 변조)의 3단계로 진행된다.The configuration steps of the flow chart of FIG. 1 are performed in three stages: SR model learning, GAN model training, and speech transformation (speech modulation).

이때, SR 모델 학습을 하는 모듈을 SR 모델 학습 모듈, GAN 모델 학습을 하는 모듈을 GAN 모델 학습 모듈, 음성 변환(음성 변조)를 하는 모듈을 음성 변조 모듈이라고 하면, 본 발명의 음성 변조 시스템은 SR 모델 학습 모듈, GAN 모델 학습 모듈, 음성 변조 모듈을 포함한다.In this case, if the module for learning the SR model is the SR model learning module, the module for learning the GAN model is the GAN model learning module, and the module for performing speech conversion (speech modulation) is the voice modulation module, the voice modulation system of the present invention is the SR It includes a model learning module, a GAN model learning module, and a voice modulation module.

Parameter Extraction 은, 입력 음성 데이터에서 정보를 추출하여 Mel-frequency cepstral coefficients(MFCC)를 추출하는 단계를 의미한다.Parameter Extraction refers to a step of extracting information from input voice data to extract Mel-frequency cepstral coefficients (MFCC).

SR 모델 학습은 MFCC를 입력으로하여 PPGs를 생성하는 모델 훈련 단계를 의미한다.SR model training refers to the model training step of generating PPGs by inputting MFCC.

Griffin_Lim Reconstruction 은 그리핀 림 알고리즘을 바탕으로 음성을 생성하는 단계를 의미한다.Griffin_Lim Reconstruction refers to the step of generating a voice based on the Griffin Rim algorithm.

SR 모델 학습에 대해 설명하면 다음과 같다.The SR model learning is as follows.

도 2는 본 발명의 SR 모델의 구조이다. 2 is a structure of the SR model of the present invention.

SR 모델 학습은, 도 2에서와 같이, 시간단위로 각 음소별 확률을 나타낸 PPGs를 출력하도록 만들어진 모델로 [Sun, Lifa, et al. "Phonetic posteriorgrams for many-to-one voice conversion without parallel data training.", 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.]에서 사용한 SI-ASR 모델을 Dropout[Srivastava, Nitish, et al., "Dropout: a simple way to prevent neural networks from overfitting.", The Journal of Machine Learning Research 15.1, 1929-1958, 2014.]을 적용한 2층의 FC Layer인 Prenet 모듈, Tacotron의 CBHG 모듈과 FC Layer를 이용하여 구현하였다. 입력 음원에 대하여 MFCC 특징을 추출하여 입력으로 사용하였으며 생성한 출력을 Softmax와 argmax를 통해 각 음소 클래스에 대한 분류 학습을 진행하였다. GAN 모델 학습 때에는 argmax를 하지 않고 Softmax를 통해 모든 클래스에 대한 벡터를 가지도록 하여 PPGs를 생성을 하며, SR-Model을 학습 하지 않도록 하여 생성기 학습 중 입력 음성에 PPGs를 고정적으로 출력되도록 하였다.SR model learning, as shown in Figure 2, is a model made to output PPGs showing the probability of each phoneme in units of time [Sun, Lifa, et al. "Phonetic posteriorgrams for many-to-one voice conversion without parallel data training.", 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016.] used the SI-ASR model Dropout [Srivastava, Nitish, et al., "Dropout: a simple way to prevent neural networks from overfitting.", The Journal of Machine Learning Research 15.1, 1929-1958, 2014 .] is implemented by using Prenet module, FC layer of 2nd layer, CBHG module of Tacotron and FC layer. MFCC features were extracted and used as input for the input sound source. The generated output was classified and learned for each phoneme class through Softmax and argmax. When learning the GAN model, PPGs are generated by having vectors for all classes through Softmax without argmax, and by not having to train the SR-Model, PPGs are fixedly output to the input voice during training of the generator.

Phoneme Classification 은 음소를 분류하는 단계로, 단위 시간마다 들어오는 음성을 음소 클래스에 맞춰 사후 확률을 계산하여 PPGs를 생성한다. Phoneme Classification is a step of classifying phonemes, and generating PPGs by calculating the posterior probability according to the phoneme class for the incoming voice every unit time.

GAN 모델 학습에 대해 설명하면 다음과 같다.The GAN model learning is as follows.

본 발명의 GAN 모델은 음성 합성을 위한 생성기 부분을 기존의 음성합성을 위해 잘 알려진 모델인 Tacotron에 디코더를 기반으로 도 3과 같은 생성기를 구성하였다.In the GAN model of the present invention, a generator for speech synthesis is constructed based on a decoder in Tacotron, a well-known model for speech synthesis, based on a decoder.

도 3은 본 발명의 GAN 모델의 구조이다.3 is a structure of the GAN model of the present invention.

목표 음성을 SR 모델에 입력하여 PPGs을 생성했으며 입력 잡음 z와 붙여 Prenet 모듈을 통과시킨 값을 Attention RNN [Vinyals, Oriol, et al., "Grammar as a foreign language.", Advances in Neural Information Processing Systems, 2015.] 입력으로 사용하였다. 또한 입력 잡음과 붙이지 않은 PPGs를 Attention 메커니즘의 memory로서 입력을 해주었다.The target voice was input to the SR model to generate PPGs, and the value passed through the Prenet module with the input noise z was Attention RNN [Vinyals, Oriol, et al., "Grammar as a foreign language.", Advances in Neural Information Processing Systems , 2015.] as input. Also, input noise and unattached PPGs were input as memory of the attention mechanism.

Attention RNN의 결과를 Decoder로써 Residual connection을 포함한 GRU[Chung, Junyoung, et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling.", arXiv preprint arXiv:1412.3555, 2014.] 2개를 사용하였으며 Prenet과 CBHG 모듈을 통과하여 목표 음성의 Linear-Scale 스펙트로그램을 생성하였다.As the Decoder of Attention RNN, 2 GRUs including Residual connection [Chung, Junyoung, et al., "Empirical evaluation of gated recurrent neural networks on sequence modeling.", arXiv preprint arXiv:1412.3555, 2014.] were used. And passing through the CBHG module to generate a linear-scale spectrogram of the target voice.

판별기 구조는 Prenet 모듈, CBHG 모듈, GRU, Sigmoid를 이용하여 구성되었으며 입력으로 넣어준 데이터에 대하여 참인지 거짓인지 판별하도록 하였다. 판별기 모델의 입력으로는 거짓 데이터와 참 데이터를 입력하도록 구성하였다. 생성기에서 생성된 이미지와 생성기의 입력 PPGs를 묶어 거짓 데이터를 생성하며, 실제 데이터를 SR-Model에 입력하여 PPGs를 생성하고 이를 묶어 참 데이터를 만들었다. 이러한 입력 데이터를 판별기에 교차로 입력하였다.The discriminator structure was constructed using Prenet module, CBHG module, GRU, and Sigmoid, and it was determined whether the data entered as input was true or false. The input of the discriminator model is configured to input false data and true data. The image generated by the generator and the input PPGs of the generator are bundled to generate false data, and the actual data is input to the SR-Model to generate PPGs and bundled to create true data. The input data was inputted to the discriminator at an intersection.

GAN(Generative Adversarial Networks)모델은, Generator(생성기, 스펙토그램 생성기)와 Discriminator(판별기, 스펙토그램 판별기)의 두 가지 모델이 적대적으로 학습을 하는 방식의 네트워크이다.The GAN (Generative Adversarial Networks) model is a network in which two models, Generator (generator, spectrogram generator) and Discriminator (discriminator, spectrogram discriminator), learn hostilely.

Generator는 가짜 데이터를 생성하는 모델이다. Generator의 목적은 Discriminator가 판별하지 못할 정도로 정교한 데이터를 생성하는 모델이다.Generator is a model that generates fake data. The purpose of the generator is a model that generates data that is so sophisticated that the discriminator cannot discriminate.

원하는 데이터를 생성하는 모델로 제안된 모델에서는 타코트론의 구조를 가져와서 만들었다. 현 모델에서는 스펙트로그램을 생성한다. 스펙트로그램은, 소리의 스펙트럼을 시각화하여 그래프로 표현하는기법이다. 시간상 진폭축의 변화를 시각적으로 볼 수 있는 파형과 주파수상 진폭축의 변화를 시각적으로 볼 수 있는 스펙트럼의 특징이 모두 결합된 구조로 시간축과 주파수상의 진폭의 차이를 농도나 표시 색상으로나타낸다.The proposed model was created by importing the structure of tacotron in the proposed model. In the current model, a spectrogram is generated. Spectrogram is a technique of visualizing the spectrum of sound and expressing it graphically. It is a structure that combines both the characteristics of the waveform that can visually change the amplitude axis in time and the spectrum that can visually change the amplitude axis in frequency.

생성자로부터 생성된 데이터와 실제데이터를 섞어서 Discriminator에게 보여주며 해당 모델(Discriminator)은 입력으로 들어온 데이터가 진짜인지 가짜인지 판별하는 모델이다.The data generated from the constructor and the actual data are mixed and shown to the Discriminator, and the corresponding model (Discriminator) is a model that determines whether the data entered as input is real or fake.

GAN 구조는 이러한 Generator와 Discriminator가 서로 적대적으로 학습을 하며 점점 더 진짜같은 데이터를 생성하며 판별로 정교한 데이터를 더 잘 판별하도록 하여 기존 생성 모델에 비하여 샤프한 데이터를 생성하게된다.In the GAN structure, these generators and discriminators learn hostilely and generate more and more real data, and discriminate more sophisticated data by discrimination, thereby producing sharp data compared to existing generation models.

타코트론은, 구글에서만든 Text to Speech 모델로 제안된 모델의 제너레이터 부분을 구현하였으며 해당 논문에서 생성자(Generator)의 구조를 가져왔다.Tacotron implemented the generator part of the proposed model as a Text to Speech model only made by Google and brought the structure of the generator in the paper.

CBHG 는, 타코트론에서 음성 데이터의 특징 추출을 위해 사용된 모듈로 컨볼루션레이어와 Fully connected로 구성된 Highwaynet, GRU로 구성된모듈 로 타코트론 논문에서는 High-level feature를 뽑기 위해 사용한 네트워크라고 한다.CBHG is a module used for extracting features of voice data from Tacolon, a module consisting of a convolutional layer, a Highwaynet consisting of Fully connected, and a GRU. It is said to be a network used to select high-level features in Tacolon papers.

Atenttion 은 주어진 데이터에서 어느 부분에 집중을 할 것인가를 찾아내는 부분으로 Attention RNN을 넣는 것이 모델의 Generalization(일반화)에 도움이 된다고 한다. 모델의 일반화란“학습되지 않은 문제에 대해 얼마나 좋은 성능을 보여줄 수 있는가”를 뜻한다. RNN은 히든 노드가 방향을 가진 엣지로 연결돼 순환구조를 이루는(directed cycle) 인공신경망의 한 종류이다.Atenttion is a part that finds out where to focus on a given data, and it is said that putting the Attention RNN into the model helps the generalization of the model. The generalization of the model means "how good performance can be given to unlearned problems". RNN is a kind of artificial neural network in which a hidden node is connected to a directed edge to form a cyclic structure (directed cycle).

음성 변환 과정은 입력음성의 MFCC를 SR-Model에 입력하여 PPGs를 생성하고 GAN 구조를 통해 학습된 생성기를 거쳐 목표 음성의 스펙트로그램으로 변조하였다. 이를 Griffin-Lim Vocoder를 이용하여 음성으로 재구성하여 음성을 생성하였다.In the speech conversion process, the MFCC of the input speech was input to the SR-Model to generate PPGs and modulated into the spectrogram of the target speech through the generator learned through the GAN structure. This was reconstructed into speech using Griffin-Lim Vocoder to generate speech.

제안된 모델의 손실 함수는 conditional GAN에서 사용한 방식에 Reconstruction loss를 추가하여 구성을 하였다. 기본적인 Adversarial loss는 다음의 식 (1)과 같으며, 판별기는 식을 최대화하는 방향으로, 생성기는 최소화하는 방향으로 학습을 한다.The loss function of the proposed model is constructed by adding Reconstruction loss to the method used in conditional GAN. The basic adversarial loss is as shown in the following equation (1), and the discriminator learns in the direction of maximizing the expression and the generator in the direction of minimizing.

---(1)

---(One)

생성기에서 판별기를 속이기 위한 역할뿐만 아니라 기존에 Ground truth와 유사한 스펙트로그램을 생성하기 위하여 pix2pix[{Isola, Phillip, et al., "Image-to-image translation with conditional adversarial networks.", arXiv preprint ,2017.} 참조]에서 사용한 방법인 거리를 추가적으로 생성기에서 loss function 으로 이용하였다. 생성기에서 예측된 스펙트로그램과 실제 입력 음성에 스펙트로그램 간 거리를 생성기의 GAN loss인 식 (1)에 다음의 식 (2)을 추가하였다.Pix2pix[{Isola, Phillip, et al., "Image-to-image translation with conditional adversarial networks.", arXiv preprint ,2017, to generate spectrograms similar to the ground truth as well as to deceive the discriminator in the generator .}], the distance used in the generator was additionally used as a loss function. The following equation (2) is added to the equation (1), which is the GAN loss of the generator, for the distance between the spectrogram predicted by the generator and the actual input speech.

---(2)

최종적으로 모델의 손실함수는 다음의 식 (3)과 (4)와 같다Finally, the loss function of the model is as shown in the following equations (3) and (4).

---(3)

---(4)

GAN을 이용한 모델과에 성능을 비교하기 위해서 Baseline 모델과 비교를 하였으며 오픈소스인 FestVox system[{Anumanchipalli, Gopala Krishna, Kishore Prahallad, and Alan W. Black., "Festvox: Tools for creation and analyses of large speech corpora.", Workshop on Very Large Scale Phonetics Research, UPenn, Philadelphia, 2011.} 참조]의 Voice Conversion Toolkit을 기반으로 Baseline을 구성하였다.To compare the performance with the model using GAN, we compared it with the Baseline model, and the open source FestVox system [{Anumanchipalli, Gopala Krishna, Kishore Prahallad, and Alan W. Black., "Festvox: Tools for creation and analyses of large speech corpora.", Workshop on Very Large Scale Phonetics Research, UPenn, Philadelphia, 2011.}] based on the Voice Conversion Toolkit.

Baseline 모델은 GMM 모델을 기반으로 만들어 졌다. 훈련 시에는 가우시안 혼합의 수를 64로 설정 하였으며, 추가 리소스 없이 훈련데이터로만 훈련을 진행하였다.The Baseline model is based on the GMM model. When training, the number of Gaussian mixtures was set to 64, and training was performed only with training data without additional resources.

본 발명의 모델에 의한 실험결과는 다음과 같다.The experimental results by the model of the present invention are as follows.

본 발명에서는 음성 인식 모델에 학습을 위하여 TIMIT Corpus[ {Garofolo, John S., "TIMIT acoustic phonetic continuous speech corpus.", Linguistic Data Consortium, 1993.} 참조]을 사용했으며, PPGs 발음클래스로써 61개의 영어 음소를 사용하였다. 음성 인식 모델의 음소 분류 정확도는 53%의 정확도를 가지고 진행하였다.In the present invention, TIMIT Corpus [{Garofolo, John S., "TIMIT acoustic phonetic continuous speech corpus.", Linguistic Data Consortium, 1993.}] was used for learning in the speech recognition model, and 61 English as PPGs pronunciation class Phonemes were used. The phoneme classification accuracy of the speech recognition model was conducted with an accuracy of 53%.

GAN 구조의 음성 합성 모델 학습 단계에서 ARCTIC Corpus[{Kominek, John, and Alan W. Black., "The CMU Arctic speech databases.", Fifth ISCA workshop on speech synthesis, 2004.} 참조]을 사용하여 모델을 학습하였다. 음성 평가를 위해서 ARCTIC Corpus에서 일부를 나누어 훈련데이터와 검증 데이터로 사용했다.In the learning stage of the speech synthesis model of the GAN structure, the model was used using ARCTIC Corpus [{Kominek, John, and Alan W. Black., "The CMU Arctic speech databases.", Fifth ISCA workshop on speech synthesis, 2004.}]. I learned. For voice evaluation, ARCTIC Corpus divided a part and used it as training data and verification data.

모델 학습 중 모델 가중치에 대한 기울기의 발산을 막기 위해 Gradient Clipping [{Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio., "On the difficulty of training recurrent neural networks.", International Conference on Machine Learning, 2013.} 참조 ]을 사용하였으며 적대적인 예들에 대한 네트워크의 취약성을 줄이기 위해 positive라벨의 값을 0.9로 하는 One sided label smoothing[{Salimans, Tim, et al., "Improved techniques for training gans.", Advances in Neural Information Processing Systems, 2016.} 참조]을 사용하였다.Gradient Clipping to prevent divergence of the slope for model weight during model training [{Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio., "On the difficulty of training recurrent neural networks.", International Conference on Machine Learning, 2013.} Reference] and One sided label smoothing [{Salimans, Tim, et al., "Improved techniques for training gans.", Advances in Neural Information with a positive label value of 0.9 to reduce network vulnerability to hostile examples. Processing Systems, 2016.}.

각 모델에 학습 횟수는 baseline은 프로그램에 기본 설정 값을 이용하였으며 제안된 모델은 생성기에 loss가 수렴하여 더 이상 변화하지 않을 때까지 진행하였다. Adam Optimize을 사용했고, learning rate 값을 3e-4로 설정하였으며 약 790 epoch에 학습을 진행하였다.For each model, the baseline number was used for the program, and the proposed model was run until the loss converged in the generator and no longer changed. Adam Optimize was used, the learning rate value was set to 3e-4, and learning was performed at about 790 epoch.

하드웨어 사양은 Intel i5 8400 2.8GHZ, NVIDIA GTX 1080을 사용하여 SR-Model에서는 10시간 소요 되었으며 baseline은 8시간, GAN 모델은 30시간 소요 되었다.The hardware specification used Intel i5 8400 2.8GHZ, NVIDIA GTX 1080, SR-Model took 10 hours, baseline 8 hours, GAN model 30 hours.

각 모델별 생성된 음성을 비교하기 위해서 MOS(Mean Opinion Score) 테스트를 평가방법으로 진행하였다. MOS는 생성된 음성에 자연스러움과 발음의 명확함 정도를 1점에서 5점으로 평가하여 실시하였다. 테스트에는 정상 청력을 가진 남여 17명을 대상으로 실시하였다. ARCTIC Corpus를 나눈 검증 데이터를 이용하였으며 다른 성별 간 전환 음성을 이용하여 진행하였다.In order to compare the generated voice for each model, a MOS (Mean Opinion Score) test was conducted as an evaluation method. The MOS was performed by evaluating the naturalness and the degree of clarity of pronunciation in the generated speech from 1 to 5 points. The test was conducted on 17 men and women with normal hearing. The verification data divided by ARCTIC Corpus was used.

다음의 표 1은, 남성에서 여성으로의 목소리 변조의 MOS 테스트 결과를 보여준다.Table 1 below shows the results of the MOS test of male to female voice modulation.

음질Sound quality 발음Pronunciation BaselineBaseline 2.182.18 3.593.59 GANGAN 2.652.65 3,533,53

음질에 대한 MOS결과는 제안된 모델이 Baseline 모델보다 0.47점 더 높은 결과를 보였다. 발음의 정확도에 대한 MOS결과는 제안된 모델이 0.06점 낮은 결과를 보였다. 제안된 모델이 Baseline모델보다 음질 면에서 큰 성능 향상을 보여주었고, 발음에서는 비슷한 성능을 보여주었다.MOS results for sound quality showed that the proposed model was 0.47 points higher than the baseline model. MOS results for the accuracy of pronunciation showed that the proposed model was 0.06 points lower. The proposed model showed a significant performance improvement in sound quality compared to the baseline model, and similar performance in pronunciation.

따라서 본 발명에 의한 음성 변조 방법은, 다음의 3단계를 포함한다.Therefore, the voice modulation method according to the present invention includes the following three steps.

(1) 제1 단계: SR 모델 학습 모듈이, 입력 음원에 대해 MFCC 특징을 추출하여 SR 모델 학습을 진행하는 단계(1) Step 1: The SR model learning module extracts the MFCC features for the input sound source and progresses the SR model learning

(2) 제2 단계: GAN 모델 학습 모듈이, 목표 음성을 SR 모델에 입력하여 PPGs 를 생성한 후 GAN 모델 학습을 진행하는 단계(2) Step 2: The GAN model learning module generates the PPGs by inputting the target voice into the SR model, and then proceeds with the GAN model learning.

(3) 제3 단계: 음성 변조 모듈이, 입력음성의 MFCC를 SR 모델에 입력하여 PPGs를 생성하고 GAN 구조를 통해 학습된 생성기를 거쳐 목표 음성의 스펙트로그램으로 변조한 후, 이를 음성으로 재구성하여 음성을 생성하는 단계(3) Step 3: The voice modulation module inputs the MFCC of the input voice into the SR model, generates PPGs, modulates it into the spectrogram of the target voice through a generator learned through the GAN structure, and reconstructs it into voice. Steps to generate speech

본 발명에서는 한 화자에 음성을 다른 화자에 음성 특성에 맞추어 변환하는 음성 변환을 PPGs와 GAN 구조를 이용하여 수행하였는데, 실험결과 Baseline 모델에 비하여 생성 음성에 대해 MOS에서 개선된 결과를 보여주었다.In the present invention, speech conversion was performed using PPGs and GAN structures to convert speech from one speaker to speech characteristics of another speaker, and the experimental results showed improved results in MOS for the generated speech compared to the baseline model.

Claims

A speech modulation system including an SR model learning module, a GAN model learning module, and a voice modulation module,
The SR model learning module extracts MFCC features for the input sound source to perform SR model learning,
The GAN model training module generates the PPGs by inputting the target voice into the SR model, and then progresses the GAN model training.
The voice modulation module is characterized by generating PPGs by inputting the MFCC of the input voice into the SR model, modulating it into a spectrogram of the target voice through a generator learned through the GAN structure, and then reconstructing it into voice to generate voice. Voice modulation system.

As a speech modulation method using a speech modulation system including an SR model learning module, a GAN model learning module, a speech modulation module,
The SR model learning module, the first step of extracting the MFCC feature for the input source to proceed with the SR model learning;
A second step of the GAN model training module, generating a PPGs by inputting a target voice into the SR model, and then training the GAN model;
The voice modulation module generates a PPG by inputting the MFCC of the input voice into the SR model, modulates it into a spectrogram of the target voice through a generator learned through a GAN structure, and reconstructs it into voice to generate a voice. step;
Voice modulation method comprising a.