KR102526338B1

KR102526338B1 - Apparatus and method for synthesizing voice frequency using amplitude scaling of voice for emotion transformation

Info

Publication number: KR102526338B1
Application number: KR1020220008340A
Authority: KR
Inventors: 정경용; 권혜정; 김민정; 백지원
Original assignee: 경기대학교 산학협력단
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2023-04-26

Abstract

Disclosed is technology which relates to an apparatus for synthesizing a voice frequency using amplitude scaling of a voice for emotion transformation. The apparatus for synthesizing a voice frequency extracts a spectrum signal (SP) and a fundamental frequency signal (f0) from a source voice signal using an input vocoder, generates a spectrum signal and a fundamental frequency signal with emotion using an encoder and a generator trained in parallel using a VAW-GAN neural network model for each signal, and outputs an emotion-transformed voice signal using an output vocoder. Continuous wavelet transform (CWT) is applied to the fundamental frequency signal (f0) so as to reflect prosodic information, and amplitude scaling which reflects the amplitude of original emotion is applied to the spectrum signal with emotion so as to obtain a voice with highly-transformed emotional similarity.

Description

Apparatus and method for synthesizing voice frequency using amplitude scaling of voice for emotion transformation}

데이터 처리 기술, 특히 딥러닝 신경망을 이용하여 훈련시킨 생성기를 이용하여 소스음성의 감정을 변환하는 음성 주파수 합성 장치에 관한 데이터 처리 기술이 개시된다.Data processing technology, in particular, a data processing technology related to a voice frequency synthesis device that converts emotion of a source voice using a generator trained using a deep learning neural network is disclosed.

본 발명은 경기도의 경기도 지역협력연구센터 사업의 일환으로 수행한 연구로부터 도출된 것이다.[GRRC 경기 2020-B03, 산업통계 및 데이터마이닝 연구]The present invention is derived from research conducted as part of the Gyeonggi-do Regional Cooperation Research Center project in Gyeonggi-do. [GRRC Gyeonggi 2020-B03, Industrial Statistics and Data Mining Research]

인공지능의 발달에 따라 음성변환과 관련된 딥러닝 모델의 연구가 활발히 진행되고 있다. 생성적 적대 신경망(Generative Adversarial Network; GAN)이나 가변 자동 인코더(Variational Auto-Encoder; VAE)와 같은 딥러닝 기반의 생성 모델은 음성 변환에서 자주 활용되고 있으며 음성 변환 품질을 크게 향상시켰다.With the development of artificial intelligence, research on deep learning models related to speech conversion is being actively conducted. Deep learning-based generative models such as generative adversarial networks (GANs) and variable auto-encoders (VAEs) are frequently used in speech conversion and have greatly improved speech conversion quality.

기존 GAN의 손실함수를 개선하여, 원본(Ground Truth)의 스타일은 바뀌지만 원본의 특성을 유지하도록 수정된 CycleGAN은 음성 변환에서 좋은 성능을 보여주었다. 그러나, 1 대 1 방식의 음성 변환만 가능하여 화자 종속적인 단점이 존재한다. VAE 기반의 음성변환 방법은 n 대 m 방식의 화자 독립적 음성 변환이 가능하지만 음질의 품질이 떨어진다는 단점이 존재한다.By improving the loss function of the existing GAN, the style of the original (Ground Truth) is changed, but CycleGAN, modified to maintain the characteristics of the original, showed good performance in speech conversion. However, since only one-to-one voice conversion is possible, there is a speaker-dependent disadvantage. The VAE-based voice conversion method can perform independent speaker voice conversion in an n-to-m manner, but has a disadvantage in that the quality of sound quality is poor.

한편, 음성변환(Text-to-Speech; TTS)을 이용하면 말소리에 해당하는 음파를 기계가 자동으로 만들어줄 수 있으나, 기존의 음성 변환 모델은 운율을 고려하지 않고 음성 변환에 스펙트럼 매핑을 사용하고 있어서 감정이 포함되지 않았다. 최근 언어정보의 손실 없이 다른 감정을 추가하는 감정 음성 변환으로 좀더 인간의 음성에 가까운 음성을 합성하는 연구가 이루어지고 있다. 특히, 운율은 언어적 정보뿐만 아니라 발화의 감정적 정보를 포함하고 있기 때문에 음성변환에서 발화를 정확하게 분석하고 표현하는데 도움이 된다. On the other hand, if text-to-speech (TTS) is used, a machine can automatically create sound waves corresponding to speech sounds, but existing speech conversion models do not consider prosody and use spectrum mapping for speech conversion. emotions were not included. Recently, research has been conducted to synthesize voices closer to human voices through emotional voice conversion that adds other emotions without loss of language information. In particular, since prosody contains emotional information of speech as well as linguistic information, it is helpful in accurately analyzing and expressing speech in speech conversion.

2020년 11월에 공개된 논문 K. Zhou et al. (arXiv:2011.02314 [cs.SD])은 VAE와 GAN을 결합한 신경망 모델인 VAW-GAN(Variational Autoencoding Wasserstein Generative Adversarial Network)을 이용하여 음성의 감정적 요소를 분해하고 재구성하는 방법에 관한 것으로, 감정적 음성 변환을 위해 스펙트럼 특징에서 감정과 운율(F0) 정보를 분리하여 인코더로 훈련하는 방법에 대하여 개시하고 있다. Paper published in November 2020 K. Zhou et al. (arXiv:2011.02314 [cs.SD]) is about a method for decomposing and reconstructing emotional elements of speech using VAW-GAN (Variational Autoencoding Wasserstein Generative Adversarial Network), a neural network model combining VAE and GAN. For this purpose, a method for training with an encoder by separating emotion and prosody (F0) information from spectral features is disclosed.

2021년 3월 24일에 공개된 공개특허 제10-2021-0032235호는 감정 음성을 합성하는 장치에 관한 것으로, 합성 음성 데이터를 생성하는 합성 모델 및 감정 음성 데이터를 생성하는 감정 모델을 구축하고, 구축된 합성 모델 및 감정 모델에 기초하여 학습된 사용자 합성 모델에 사용자의 요청 문장 및 감정 정보를 입력하여 해당 요청 문장에 적합한 감정이 반영된 사용자 합성 감정 음성 데이터를 생성할 수 있는 장치에 관하여 개시하고 있다.Patent Publication No. 10-2021-0032235 published on March 24, 2021 relates to an apparatus for synthesizing emotional voice, building a synthesis model for generating synthesized voice data and an emotion model for generating emotional voice data, Disclosed is an apparatus capable of generating user synthesized emotion voice data in which emotion appropriate to the request sentence is reflected by inputting a user's request sentence and emotion information into a user synthesis model learned based on the constructed synthesis model and emotion model. .

다만, 운율정보를 효과적으로 반영할 수 있으며, 화자 독립적인 음성 변환을 지원하고, 음성 품질을 향상시키는 음성변환 방법에 대해서는 아직 구체적으로 개시되고 있지 않다.However, a voice conversion method capable of effectively reflecting prosody information, supporting speaker-independent voice conversion, and improving voice quality has not yet been specifically disclosed.

제안된 발명은 1 대 1 방식의 음성 변환이 아닌, n 대 m 방식의 음성 변환을 통해 화자 독립적으로 감정 변환이 가능한 음성 주파수 합성 장치를 제공하는 것을 목적으로 한다. An object of the proposed invention is to provide a voice frequency synthesizer capable of independently converting emotion through n-to-m voice conversion instead of 1-to-1 voice conversion.

나아가 제안된 발명은 음성의 운율 정보를 추출하여 학습함으로써 감정에 따른 정확한 변환이 가능한 음성 주파수 합성 장치를 제공하는 것을 목적으로 한다. Furthermore, an object of the proposed invention is to provide a voice frequency synthesis device capable of accurately transforming according to emotion by extracting and learning prosody information of voice.

나아가 제안된 발명은 목표 데이터의 감정 스펙트럼과의 차이를 보정하여 합성된 음성의 감정 유사도를 높일 수 있는 음성 주파수 합성 장치를 제공하는 것을 목적으로 한다.Furthermore, an object of the proposed invention is to provide a voice frequency synthesis device capable of increasing the emotion similarity of the synthesized voice by correcting the difference between the target data and the emotion spectrum.

제안된 발명의 일 양상에 따르면, 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치는, 입력보코더와, 감정스펙트럼생성부와, 감정기본주파수생성부와, 진폭스케일부와, 출력보코더를 포함한다.According to one aspect of the proposed invention, an audio frequency synthesis apparatus for emotion conversion using amplitude scaling of voice includes an input vocoder, an emotion spectrum generator, an emotion fundamental frequency generator, an amplitude scaler, and an output vocoder. include

입력보코더는 소스음성신호를 입력받아서 소스스펙트럼신호와 소스기본주파수신호를 출력한다. 감정스펙트럼생성부는 소스스펙트럼신호로부터 감정정보를 가진 감정스펙트럼신호를 생성한다. 감정기본주파수생성부는 소스기본주파수신호로부터 운율정보를 포함하는 감정기본주파수신호를 생성한다. 진폭스케일부는 감정스펙트럼신호의 진폭을 조절하여 진폭스케일된감정스펙트럼신호를 출력한다. 출력보코더는 진폭스케일된감정스펙트럼신호와 감정기본주파수신호를 입력받아 감정이 변환된 감정변환음성신호를 출력한다.The input vocoder receives a source voice signal and outputs a source spectrum signal and a source fundamental frequency signal. The emotion spectrum generator generates an emotion spectrum signal having emotion information from the source spectrum signal. The emotion fundamental frequency generator generates an emotion fundamental frequency signal including prosody information from the source fundamental frequency signal. The amplitude scale unit adjusts the amplitude of the emotion spectrum signal to output an amplitude-scaled emotion spectrum signal. The output vocoder receives the amplitude-scaled emotion spectrum signal and the emotion fundamental frequency signal and outputs an emotion-converted emotion-converted voice signal.

추가적인 양상에 따르면, 상기 감정스펙트럼생성부는, 소스스펙트럼신호에서 감정독립적인 잠재벡터를 산출하도록 생성적 적대 신경망(GAN)으로 학습된 스펙트럼인코더와, 상기 감정독립적인 잠재벡터를 이용하여 감정스펙트럼신호를 생성하도록 생성적 적대 신경망(GAN)으로 학습된 스펙트럼생성기를 포함한다.According to an additional aspect, the emotion spectrum generation unit generates an emotion spectrum signal using a spectrum encoder learned with a generative adversarial network (GAN) to calculate an emotion-independent latent vector from a source spectrum signal, and the emotion-independent latent vector. It includes a spectrum generator that has been trained with a generative adversarial network (GAN) to generate

추가적인 양상에 따르면, 상기 스펙트럼생성기는, 소스기본주파수신호를 입력받는 기본주파수입력부를 포함하고, 상기 감정독립적인 잠재벡터와 상기 소스기본주파수신호를 이용하여 감정스펙트럼신호를 생성한다.According to an additional aspect, the spectrum generator includes a fundamental frequency input unit that receives a source fundamental frequency signal, and generates an emotion spectrum signal using the emotion-independent latent vector and the source fundamental frequency signal.

추가적인 양상에 따르면, 상기 감정기본주파수생성부는, 소스기본주파수신호에서 감정독립적인 특징벡터를 산출하도록 생성적 적대 신경망(GAN)으로 학습된 기본주파수인코더와, 상기 감정독립적인 특징벡터를 이용하여 감정기본주파수신호를 생성하도록 생성적 적대 신경망(GAN)으로 학습된 기본주파수생성기를 포함한다.According to an additional aspect, the emotion fundamental frequency generator uses a basic frequency encoder learned with a generative adversarial network (GAN) to calculate an emotion-independent feature vector from a source fundamental frequency signal, and the emotion-independent feature vector. It includes a fundamental frequency generator trained with a generative adversarial network (GAN) to generate a fundamental frequency signal.

추가적인 양상에 따르면, 상기 감정기본주파수생성부는, 상기 소스기본주파수신호에 연속 웨이블릿 변환(CWT)을 적용하는 연속웨이블릿변환기를 더 포함한다.According to an additional aspect, the emotion fundamental frequency generator further includes a continuous wavelet transformer for applying continuous wavelet transform (CWT) to the source fundamental frequency signal.

추가적인 양상에 따르면, 상기 진폭스케일부는, 상기 감정스펙트럼신호와, 상기 스펙트럼생성기에 임의의 잠재벡터를 입력하여 생성된 랜덤스펙트럼신호를 입력받는 진폭스케일입력부와, 상기 랜덤스펙트럼신호의 평균과 표준편차를 이용하여 감정스펙트럼신호의 진폭을 조절하는 진폭조절부를 포함한다.According to an additional aspect, the amplitude scale unit may include an amplitude scale input unit that receives the emotion spectrum signal and a random spectrum signal generated by inputting an arbitrary latent vector to the spectrum generator, and an average and standard deviation of the random spectrum signal. and an amplitude control unit for adjusting the amplitude of the emotion spectrum signal using

제안된 발명에 따라, 음성 주파수 합성에 VAW-GAN 기반의 생성기를 이용하여 n 대 m 방식의 음성 변환을 할 수 있고 화자 독립적으로 감정 변환이 가능한 음성 주파수 합성 장치를 제공할 수 있다. According to the proposed invention, it is possible to provide a voice frequency synthesizer capable of n-to-m voice conversion using a VAW-GAN-based generator for voice frequency synthesis and capable of speaker-independent emotion conversion.

나아가 제안된 발명은 연속 웨이블릿 변환된 기본주파수를 이용한 운율 정보를 사용하여 학습함으로써 보다 효과적으로 감정에 따른 변환이 가능하다. Furthermore, the proposed invention enables emotion-dependent transformation more effectively by learning using prosody information using continuous wavelet-transformed fundamental frequencies.

나아가 제안된 발명은 재생성된 감정 스펙트럼과 학습시 얻은 감정 스펙트럼의 차이를 진폭스케일링을 이용하여 보정함으로써 합성된 음성의 감정 유사도를 높일 수 있다.Furthermore, the proposed invention can increase the emotion similarity of the synthesized voice by correcting the difference between the regenerated emotion spectrum and the emotion spectrum obtained during learning using amplitude scaling.

도 1은 같은 문장에 대해 서로 다른 감정을 가진 음성의 스펙트럼(SP)과 기본주파수(f0)를 비교하여 나타낸 그래프이다.
도 2는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 VAW-GAN 학습 장치의 구성을 나타내는 구성도이다.
도 3은 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치의 구성을 나타내는 구성도이다.
도 4는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 사용하여 중립 감정을 화난 감정과 싫은 감정으로 변환한 결과를 나타내는 스펙트로그램이다.
도 5는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 VAW-GAN 학습 방법을 나타내는 순서도이다.
도 6은 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 방법을 나타내는 순서도이다.1 is a graph showing a comparison of spectrums (SP) and fundamental frequencies (f0) of voices having different emotions for the same sentence.
2 is a configuration diagram showing the configuration of a VAW-GAN learning device for a voice frequency synthesis device for emotion conversion using amplitude scaling of voice according to an embodiment.
3 is a configuration diagram showing the configuration of a voice frequency synthesis device for emotional conversion using amplitude scaling of voice according to an embodiment.
4 is a spectrogram showing a result of converting neutral emotion into angry emotion and dislike emotion by using a voice frequency synthesizer for emotion conversion using amplitude scaling of voice according to an embodiment.
5 is a flowchart illustrating a VAW-GAN learning method for a voice frequency synthesizer for emotion conversion using voice amplitude scaling according to an embodiment.
6 is a flowchart illustrating a voice frequency synthesis method for emotion conversion using amplitude scaling of voice according to an embodiment.

전술한, 그리고 추가적인 양상들은 첨부된 도면을 참조하여 설명하는 실시예들을 통해 구체화된다. 각 실시예들의 구성 요소들은 다른 언급이나 상호간에 모순이 없는 한 실시예 내에서 또는 타 실시예의 구성 요소들과 다양한 조합이 가능한 것으로 이해된다. 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 명세서 및 청구범위에 사용된 용어는 기재 내용 혹은 제안된 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 본 명세서에서 모듈 또는 부분은, 컴퓨터 또는 프로세서에서 실행가능한 프로그램 명령어의 집합이거나, 이러한 명령들을 수행할 수 있는 전자 부품 또는 회로의 집합으로 구현할 수 있다. 또한, 각 모듈 또는 부분의 동작은 하나 또는 복수의 프로세서 또는 장치에 의해 수행될 수 있다. The foregoing and additional aspects are embodied through embodiments described with reference to the accompanying drawings. It is understood that the elements of each embodiment can be combined in various ways within one embodiment or with elements of another embodiment without contradiction with each other or other references. Based on the principle that the inventor can properly define the concept of terms in order to explain his/her invention in the best way, the terms used in this specification and claims have meanings consistent with the description or proposed technical idea. and should be interpreted as a concept. In this specification, a module or part may be implemented as a set of program instructions executable by a computer or processor, or a set of electronic components or circuits capable of executing these instructions. Also, the operation of each module or part may be performed by one or a plurality of processors or devices.

이하 첨부된 도면을 참조로 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 같은 문장에 대해 서로 다른 감정을 가진 음성의 스펙트럼(SP)과 기본주파수(f0)를 비교하여 나타낸 그래프이다. 1 is a graph showing a comparison of spectrums (SP) and fundamental frequencies (f0) of voices having different emotions for the same sentence.

사람들은 다양한 목소리를 가지고 있으며, 음성은 주파수와 진폭으로 분석할 수 있다. 음성에서 추출한 신호를 음성 파형, 스펙트럼, 스펙트로그램 형태로 컴퓨터로 시각화하여 딥러닝 모델에서 감정을 분석할 수 있다. 파형은 시간이 지남에 따라 생성되는 음성 진동파의 형태이다. 스펙트럼은 다양한 종류의 음성에 대한 주파수와 진폭을 나타낸다. 스펙트럼을 이용하면 소리의 파형을 분석하고 소리를 구성하는 성분을 표현할 수 있다. 스펙트로그램은 사운드 스펙트럼을 그래프로 시각적으로 표현한 것으로, 파형과 스펙트럼이 가지는 특성을 조합하여 X축, Y축, Z축에 각각 시간, 주파수, 진폭으로 나타낸다. People have different voices, and voices can be analyzed in frequency and amplitude. Emotions can be analyzed by deep learning models by visualizing signals extracted from speech with a computer in the form of speech waveforms, spectrums, and spectrograms. A waveform is a form of sound oscillatory wave that is created over time. Spectra represent the frequencies and amplitudes of different types of speech. Using a spectrum, it is possible to analyze the waveform of sound and express the components that make up sound. The spectrogram is a visual representation of the sound spectrum as a graph, and the characteristics of the waveform and the spectrum are combined and displayed as time, frequency, and amplitude on the X-axis, Y-axis, and Z-axis, respectively.

감정 분석은 대화나 텍스트에서 사람의 성향, 의견, 태도를 분석하는 데 사용된다. 음성 기반 감정 분석을 위해서 음성 신호를 스펙트로그램으로 추출하여 딥러닝 모델에 적용할 수 있다. 음성 데이터의 감정 변환을 위해 보코더를 사용하여 스펙트럼(SP)과 기본 주파수(f0)의 두 가지 오디오 특성을 추출할 수 있다. 스펙트럼(SP)은 고조파 스펙트럼 포락선을 나타내는 추출된 스무딩된 스펙트로그램이다. 기본주파수(f0)는 주기적인 파형의 가장 낮은 주파수로 정의할 수 있다. 기본주파수(f0)는 음성 데이터에 나타나는 가장 기본적인 운율 요소로, 어휘, 단어와 함께 운율을 나타내는 f0를 감정을 표현하는 요소로 사용할 수 있다.Sentiment analysis is used to analyze a person's tendencies, opinions, and attitudes in a conversation or text. For voice-based emotion analysis, voice signals can be extracted as spectrograms and applied to deep learning models. For emotional conversion of voice data, a vocoder can be used to extract two audio characteristics, a spectrum (SP) and a fundamental frequency (f0). Spectrum SP is an extracted smoothed spectrogram representing the harmonic spectral envelope. The fundamental frequency f0 may be defined as the lowest frequency of a periodic waveform. The fundamental frequency (f0) is the most basic prosody element appearing in voice data, and f0 representing prosody along with vocabulary and words can be used as an element expressing emotion.

도 1에서 보는 바와 같이 동일 문장에 대한 스펙트럼(SP)과 기본주파수(f0)는 감정에 따라 서로 다른 특성을 나타낸다. 스펙트럼(SP)은 주파수를 x축에, 진폭을 y축에 나타내었다. 감정이 있는 음성 데이터와 중립(Neutral) 음성 데이터는 진폭의 포락선 형태와 기본주파수(f0)가 서로 다르다. 이러한 음성 특징은 전처리되어 학습 입력으로 사용되며, 변환된 감정으로 음성을 합성할 수 있다.As shown in FIG. 1, the spectrum (SP) and the fundamental frequency (f0) of the same sentence show different characteristics according to emotions. The spectrum (SP) shows frequency on the x-axis and amplitude on the y-axis. Emotional voice data and neutral voice data have different amplitude envelopes and fundamental frequencies (f0). These voice features are preprocessed and used as learning inputs, and voices can be synthesized with converted emotions.

도 2는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 VAW-GAN 학습 장치의 구성을 나타내는 구성도이다. 2 is a configuration diagram showing the configuration of a VAW-GAN learning device for a voice frequency synthesis device for emotion conversion using amplitude scaling of voice according to an embodiment.

일 실시예에 따르면, 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 학습 장치는, 입력보코더(220)와, 스펙트럼인코더(235)와, 스펙트럼생성기(240)와, 스펙트럼판별기(245)와, 기본주파수인코더(260)와 기본주파수생성기(265)와, 기본주파수판별기(275)를 포함한다. According to an embodiment, a learning device for a voice frequency synthesis device for emotional conversion using voice amplitude scaling includes an input vocoder 220, a spectrum encoder 235, a spectrum generator 240, and a spectrum discriminator. 245, a fundamental frequency encoder 260, a fundamental frequency generator 265, and a fundamental frequency discriminator 275.

일 실시예에 따른 학습 장치는 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 나누어, 각각에 대해 감정을 변환하는 학습을 진행한다. 소스음성신호(210)에서 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 출력하기 위해 입력보코더(220)를 사용한다. 학습에 사용되는 소스음성신호(210) 데이터는 중립(Neutral) 외에도, 행복, 침착, 슬픔, 두려움, 분노(Angry), 놀람, 혐오(Disgust) 표현 등 다양한 감정을 가지는 음성을 사용할 수 있다. 입력보코더(220)는 소스음성신호(210)를 입력받아서 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 출력할 수 있는 WORLD 보코더, STRAIGHT 보코더 등을 사용할 수 있다. 입력보코더는 소스음성신호에서 소스스펙트럼신호와 소스기본주파수신호를 고품질의 음성 합성과 빠른 연산 수행이 가능한 WORLD 보코더를 사용하는 것이 바람직하다. The learning device according to an embodiment divides the source spectrum signal (SP) and the source fundamental frequency signal (f0), and proceeds with learning to convert emotions for each. The input vocoder 220 is used to output the source spectrum signal SP and the source fundamental frequency signal f0 from the source audio signal 210. In addition to neutral, the source audio signal 210 data used for learning may use voices having various emotions such as happiness, calm, sadness, fear, anger, surprise, and disgust expressions. The input vocoder 220 may use a WORLD vocoder or a STRAIGHT vocoder capable of receiving the source voice signal 210 and outputting a source spectrum signal (SP) and a source fundamental frequency signal (f0). As the input vocoder, it is preferable to use a WORLD vocoder capable of high-quality voice synthesis and fast calculation of the source spectrum signal and the source fundamental frequency signal from the source voice signal.

일 실시예에 따른 학습 장치는 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)의 학습을 위한 신경망 모델을 각각 구성한다. 각각의 신경망은 병렬로 배치되고, 각각의 신경망에서 각각의 인코더는 입력 신호에서 감정요소를 포함하지 않도록 별도로 훈련되고, 디코더인 생성기(Generator)는 소스음성신호(210)의 감정유형을 나타내는 감정ID와 소스기본주파수(f0)에 대해 조건화되어, 감정ID에 해당하는 출력 신호를 생성하도록 훈련될 수 있다. 학습에 사용되는 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 생성적 적대 신경망(GAN) 모델에 가변 자동 인코더(VAE)를 결합한 VAW-GAN(Variational Autoencoding Wasserstein Generative Adversarial Network) 신경망 모델을 사용하는 것이 바람직하다. VAW-GAN 신경망 모델은 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAE와 GAN의 결합 모델에 해당한다. The learning apparatus according to an embodiment configures neural network models for learning the source spectrum signal SP and the source fundamental frequency signal f0, respectively. Each neural network is arranged in parallel, each encoder in each neural network is trained separately so as not to include emotion elements in the input signal, and a generator, which is a decoder, has an emotion ID representing the emotion type of the source speech signal 210. It can be trained to generate an output signal corresponding to the emotion ID by being conditioned on the source fundamental frequency f0 and the source fundamental frequency f0. As a neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN (Variational Autoencoding Wasserstein Generative Adversarial Network) neural network model in which a variable autoencoder (VAE) is combined with a generative adversarial network (GAN) model. The VAW-GAN neural network model corresponds to a combined model of VAE and GAN using a generator of a generative adversarial network (GAN) in the decoder of a variable autoencoder (VAE).

소스스펙트럼신호(SP)에 대해 스펙트럼인코더(235)와 스펙트럼생성기(240)는 한쌍의 인코더와 디코더 역할을 한다. 스펙트럼인코더(235)는 소스스펙트럼신호(SP)에서 감정독립적인 특징벡터인 잠재벡터(latent vector)를 산출한다. 스펙트럼생성기(240)는 상기 감정독립적인 잠재벡터(latent vector)를 이용하여 감정스펙트럼신호를 생성한다. 스펙트럼생성기(240)는 소스기본주파수신호(f0)와 감정ID를 추가로 입력받을 수 있다. 스펙트럼생성기(240)는 소스스펙트럼신호(SP)로부터 소스기본주파수신호(f0)와 감정ID에 의해 조건화된 감정스펙트럼신호를 생성하여 스펙트럼판별기(245)로 전달한다. 스펙트럼판별기(245)는 생성된 감정스펙트럼신호가 진짜인지 가짜인지를 판별한다. 스펙트럼인코더(235)는 감정독립적인 잠재벡터를 생성하도록 학습되고, 스펙트럼생성기(240)는 소스스펙트럼신호(SP)와 생성된 감정스펙트럼신호 사이의 손실을 줄이도록 학습되고, 스펙트럼판별기(245)는 소스스펙트럼신호(SP)와 생성된 감정스펙트럼신호 사이의 손실을 최대화하도록 적대적으로 학습된다.For the source spectrum signal SP, the spectrum encoder 235 and the spectrum generator 240 serve as a pair of encoder and decoder. The spectrum encoder 235 calculates a latent vector, which is an emotion-independent feature vector, from the source spectrum signal SP. The spectrum generator 240 generates an emotion spectrum signal using the emotion-independent latent vector. The spectrum generator 240 may additionally receive the source fundamental frequency signal f0 and the emotion ID. The spectrum generator 240 generates an emotion spectrum signal conditioned by the source fundamental frequency signal f0 and the emotion ID from the source spectrum signal SP, and transfers it to the spectrum discriminator 245. The spectrum discriminator 245 determines whether the generated emotion spectrum signal is real or fake. The spectrum encoder 235 is trained to generate emotion-independent latent vectors, the spectrum generator 240 is trained to reduce loss between the source spectrum signal SP and the generated emotion spectrum signal, and the spectrum discriminator 245 is learned adversarially to maximize the loss between the source spectrum signal (SP) and the generated emotion spectrum signal.

소스기본주파수신호(f0)에 대해 기본주파수인코더(260)와 기본주파수생성기(265)는 한쌍의 인코더와 디코더 역할을 한다. 기본주파수인코더(260)는 소스기본주파수신호(f0)에서 감정독립적인 잠재벡터를 산출하도록, 기본주파수생성기(265)는 상기 감정독립적인 특징벡터를 이용하여 소스기본주파수신호(f0)에 가까운 감정기본주파수신호를 생성하도록, 기본주파수판별기(275)는 생성된 감정기본주파수신호를 소스기본주파수신호(f0)와 분별할 수 있도록, 인공 신경망 모델로 학습시킬 수 있다. 기본주파수생성기(265)는 감정ID를 추가로 입력받아서, 소스기본주파수신호(f0)로부터 감정ID에 의해 조건화된 감정기본주파수신호를 생성할 수 있다.For the source fundamental frequency signal f0, the fundamental frequency encoder 260 and the fundamental frequency generator 265 serve as a pair of encoder and decoder. The fundamental frequency encoder 260 calculates a latent vector independent of emotion from the source fundamental frequency signal f0, and the fundamental frequency generator 265 uses the emotion-independent feature vector to calculate emotion close to the source fundamental frequency signal f0. To generate the fundamental frequency signal, the fundamental frequency discriminator 275 may be trained with an artificial neural network model to discriminate the generated emotion fundamental frequency signal from the source fundamental frequency signal f0. The basic frequency generator 265 may additionally receive an emotion ID and generate an emotion fundamental frequency signal conditioned by the emotion ID from the source fundamental frequency signal f0.

소스기본주파수신호(f0)에 연속 웨이블릿 변환(CWT)을 적용하는 제1 연속웨이블릿변환기(255)를 기본주파수인코더(260) 앞단에 배치하여 연속 웨이블릿 변환된 신호를 기본주파수인코더(280)에 입력할 수 있다. 연속 웨이블릿 변환(Continuous Wavelet Transformation; CWT)은 소스 데이터의 음성 특성에서 운율 정보를 학습하기 위해 사용한다. 수학식 1은 연속 웨이블릿 변환(CWT)을 나타낸다.The first continuous wavelet transformer 255, which applies continuous wavelet transform (CWT) to the source fundamental frequency signal f0, is placed in front of the fundamental frequency encoder 260, and the continuous wavelet transformed signal is input to the fundamental frequency encoder 280. can do. Continuous Wavelet Transformation (CWT) is used to learn prosodic information from voice characteristics of source data. Equation 1 represents the continuous wavelet transform (CWT).

[수학식 1][Equation 1]

수학식 1에서, a는 규모를 나타내고. b는 시간이다. ψ_ab는 모 웨이블릿 함수를 의미한다.In Equation 1, a represents a scale. b is the time ψ _ab denotes a parent wavelet function.

연속 웨이블릿 변환(CWT)은 소스기본주파수신호(f0)에 대한 다중 스케일 모델링을 수행하고 이를 다양한 시간 스케일로 나눌 수 있다. 따라서 연속 웨이블릿 변환(CWT)을 사용하면 몇 개의 단어로 구성된 구의 짧은 운율과 전체 연설의 긴 운율을 표현할 수 있다. 소스기본주파수신호(f0)에 대한 연속 웨이블릿 변환(CWT)을 통해 운율 정보를 효율적으로 학습할 수 있다. Continuous wavelet transform (CWT) can perform multi-scale modeling on the source fundamental frequency signal f0 and divide it into various time scales. Thus, the continuous wavelet transform (CWT) can represent short rhymes of phrases made up of a few words and long rhymes of entire speeches. Prosody information can be efficiently learned through continuous wavelet transform (CWT) on the source fundamental frequency signal f0.

도 3은 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치의 구성을 나타내는 구성도이다. 3 is a configuration diagram showing the configuration of a voice frequency synthesis device for emotional conversion using amplitude scaling of voice according to an embodiment.

제안된 발명의 일 양상에 따르면, 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치는, 입력보코더(320)와, 감정스펙트럼생성부(330)와, 감정기본주파수생성부(350)와, 진폭스케일부(380)와, 출력보코더(390)를 포함한다.According to one aspect of the proposed invention, an apparatus for synthesizing voice frequencies for emotion conversion using amplitude scaling of voice includes an input vocoder 320, an emotion spectrum generator 330, an emotion fundamental frequency generator 350, , an amplitude scale unit 380 and an output vocoder 390.

입력보코더(320)는 소스음성신호(310)를 입력받아서 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 출력한다. 입력보코더로는 학습에 사용한 입력보코더와 동일한 입력보코더를 사용하는 것이 바람직하다. 입력보코더(320)는 소스음성신호(310)를 입력받아서 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 출력할 수 있는 WORLD 보코더, STRAIGHT 보코더 등을 사용할 수 있다. 입력보코더는 소스음성신호에서 소스스펙트럼신호와 소스기본주파수신호를 고품질의 음성 합성과 빠른 연산 수행이 가능한 WORLD 보코더를 사용하는 것이 바람직하다. The input vocoder 320 receives the source voice signal 310 and outputs a source spectrum signal SP and a source fundamental frequency signal f0. It is preferable to use the same input vocoder as the input vocoder used for learning as the input vocoder. The input vocoder 320 may use a WORLD vocoder or a STRAIGHT vocoder capable of receiving the source voice signal 310 and outputting the source spectrum signal SP and the source fundamental frequency signal f0. As the input vocoder, it is preferable to use a WORLD vocoder capable of high-quality voice synthesis and fast calculation of the source spectrum signal and the source fundamental frequency signal from the source voice signal.

감정스펙트럼생성부(330)는 소스스펙트럼신호(SP)로부터 감정정보를 가진 감정스펙트럼신호(eSP)를 생성한다. 감정기본주파수생성부(350)는 소스기본주파수신호(f0)로부터 운율정보를 포함하는 감정기본주파수신호(ef0)를 생성한다. 진폭스케일부(380)는 감정스펙트럼신호(eSP)의 진폭을 조절하여 진폭스케일된감정스펙트럼신호(AeSP)를 출력한다. 감정스펙트럼생성부(330)와 감정기본주파수생성부(350)는 각각 인코더와 디코더(생성기) 쌍으로 구성된 인공 신경망 모델로 구성된다. 즉, 일 실시예에 따른 음성 주파수 합성 장치는 병렬로 배치된 2개의 인공 신경망을 포함한다. 각각의 신경망에서 인코더는 입력 신호에서 감정요소를 제거하도록 훈련되고, 디코더인 생성기는 생성하려는 목표 감정유형을 나타내는 감정ID와 기본주파수(f0)에 대해 조건화되어 감정ID에 해당하는 출력 신호를 생성하도록 훈련될 수 있다. 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 인공 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN 신경망 모델을 사용하는 것이 바람직하다. The emotion spectrum generator 330 generates an emotion spectrum signal (eSP) having emotion information from the source spectrum signal (SP). The emotion fundamental frequency generator 350 generates an emotion fundamental frequency signal ef0 including prosody information from the source fundamental frequency signal f0. The amplitude scale unit 380 adjusts the amplitude of the emotion spectrum signal eSP to output an amplitude-scaled emotion spectrum signal AeSP. The emotion spectrum generator 330 and the emotion fundamental frequency generator 350 are each composed of an artificial neural network model composed of a pair of encoders and decoders (generators). That is, the voice frequency synthesis apparatus according to an embodiment includes two artificial neural networks arranged in parallel. In each neural network, an encoder is trained to remove emotion elements from an input signal, and a generator, which is a decoder, is conditioned on the emotion ID and fundamental frequency (f0) representing the target emotion type to be generated to generate an output signal corresponding to the emotion ID. can be trained As an artificial neural network model, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these artificial neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) for a decoder of a variable autoencoder (VAE).

출력보코더(390)는 진폭스케일된감정스펙트럼신호(AeSP)와 감정기본주파수신호(ef0)를 입력받아 감정이 변환된 감정변환음성신호(395)를 출력한다. 출력보코더(390)는 스펙트럼신호와 기본주파수신호를 입력받아서 음성신호를 출력할 수 있는 WORLD 보코더, STRAIGHT 보코더 등을 사용할 수 있다. 출력보코더는 입력보코더와 동일한 보코더를 사용하는 것이 바람직하며, 출력보코더에 입력보코더와 동일한 WORLD 보코더를 사용하는 것이 바람직하다. The output vocoder 390 receives the amplitude-scaled emotion spectrum signal AeSP and the emotion fundamental frequency signal ef0 and outputs an emotion converted voice signal 395. The output vocoder 390 can use a WORLD vocoder, a STRAIGHT vocoder, etc. that can output a voice signal by receiving a spectrum signal and a fundamental frequency signal. It is preferable to use the same vocoder as the input vocoder for the output vocoder, and it is preferable to use the same WORLD vocoder as the input vocoder for the output vocoder.

추가적인 양상에 따르면, 상기 감정스펙트럼생성부(330)는 스펙트럼인코더(335)와 스펙트럼생성기(340)를 포함한다. 스펙트럼인코더(335)는 소스스펙트럼신호(SP)에서 감정독립적인 특징벡터인 잠재벡터(latent vector)를 산출하도록 인공 신경망으로 학습된다. 스펙트럼생성기(340)는 상기 감정독립적인 잠재벡터(latent vector)를 이용하여 감정스펙트럼신호(eSP)를 생성하도록 인공 신경망으로 학습된다. 학습에 사용되는 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN신경망 모델을 사용하는 것이 바람직하다. 스펙트럼인코더(335)와 스펙트럼생성기(340)는 한쌍의 인코더와 디코더 역할을 한다. According to an additional aspect, the emotion spectrum generator 330 includes a spectrum encoder 335 and a spectrum generator 340 . The spectrum encoder 335 is trained with an artificial neural network to calculate a latent vector, which is an emotion-independent feature vector, from the source spectrum signal SP. The spectrum generator 340 is trained as an artificial neural network to generate an emotion spectrum signal (eSP) using the emotion-independent latent vector. As an artificial neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) in a decoder of a variable autoencoder (VAE). The spectrum encoder 335 and the spectrum generator 340 serve as a pair of encoder and decoder.

추가적인 양상에 따르면, 상기 스펙트럼생성기(340)는, 소스기본주파수신호(f0)를 입력받는 기본주파수입력부를 포함하고, 상기 감정독립적인 잠재벡터와 상기 소스기본주파수신호(f0)를 이용하여 감정스펙트럼신호(eSP)를 생성한다. 스펙트럼생성기(340)는 변환하려는 목표 감정ID를 입력받는 감정선택부를 더 포함할 수 있다. 스펙트럼생성기(340)는 소스스펙트럼신호(SP)로부터 소스기본주파수신호(f0)와 변환하려는 목표 감정ID에 의해 조건화된 감정스펙트럼신호(eSP)를 생성할 수 있다.According to an additional aspect, the spectrum generator 340 includes a fundamental frequency input unit that receives the source fundamental frequency signal f0, and generates an emotion spectrum using the emotion-independent latent vector and the source fundamental frequency signal f0. Generates the signal eSP. The spectrum generator 340 may further include an emotion selection unit that receives a target emotion ID to be converted. The spectrum generator 340 may generate the source fundamental frequency signal f0 and the emotion spectrum signal eSP conditioned by the target emotion ID to be converted from the source spectrum signal SP.

추가적인 양상에 따르면, 상기 감정기본주파수생성부(350)는, 기본주파수인코더(360)와 기본주파수생성기(365)를 포함한다. 기본주파수인코더(360)는 소스기본주파수신호(f0)에서 감정독립적인 특징벡터를 산출하도록 인공 신경망으로 학습된다. 기본주파수생성기(365)는 상기 감정독립적인 특징벡터를 이용하여 감정기본주파수신호(ef0)를 생성하도록 인공 신경망으로 학습된다. 학습에 사용되는 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN신경망 모델을 사용하는 것이 바람직하다. 기본주파수인코더(360)와 기본주파수생성기(365)는 한쌍의 인코더와 디코더 역할을 한다. 기본주파수생성기(365)는 변환하려는 목표 감정ID를 입력받는 감정선택부를 더 포함할 수 있다. 기본주파수생성기(365)는 소스기본주파수신호(f0)로부터 감정ID에 의해 조건화된 감정기본주파수신호(ef0)를 생성할 수 있다.According to an additional aspect, the emotion fundamental frequency generator 350 includes a fundamental frequency encoder 360 and a fundamental frequency generator 365 . The fundamental frequency encoder 360 is trained with an artificial neural network to calculate an emotion-independent feature vector from the source fundamental frequency signal f0. The fundamental frequency generator 365 is trained with an artificial neural network to generate an emotion fundamental frequency signal ef0 using the emotion-independent feature vector. As an artificial neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) in a decoder of a variable autoencoder (VAE). The fundamental frequency encoder 360 and the fundamental frequency generator 365 serve as a pair of encoder and decoder. The basic frequency generator 365 may further include an emotion selection unit that receives a target emotion ID to be converted. The fundamental frequency generator 365 may generate the emotion fundamental frequency signal ef0 conditioned by the emotion ID from the source fundamental frequency signal f0.

추가적인 양상에 따르면, 상기 감정기본주파수생성부(350)는, 상기 소스기본주파수신호(f0)에 연속 웨이블릿 변환(CWT)을 적용하는 제1 연속웨이블릿변환기(355)를 더 포함한다. 제1 연속웨이블릿변환기(355)는 기본주파수인코더(360) 앞단에 배치하여 연속 웨이블릿 변환된 신호를 기본주파수인코더(360)에 입력한다. 감정기본주파수생성부(350)는 생성기 후단에 제2 연속웨이블릿변환기(370)를 더 포함할 수 있다. 제2 연속웨이블릿변환기(370)는 제1 연속웨이블릿변환기(355)에서 수행한 연속 웨이블릿 변환의 역변환을 수행하여 감정기본주파수신호(ef0)를 생성할 수 있다. According to an additional aspect, the emotion fundamental frequency generator 350 further includes a first continuous wavelet transformer 355 for applying continuous wavelet transform (CWT) to the source fundamental frequency signal f0. The first continuous wavelet transformer 355 is disposed in front of the fundamental frequency encoder 360 and inputs the continuous wavelet-transformed signal to the fundamental frequency encoder 360. The emotion fundamental frequency generator 350 may further include a second continuous wavelet transformer 370 after the generator. The second continuous wavelet transformer 370 may perform an inverse transform of the continuous wavelet transform performed by the first continuous wavelet transformer 355 to generate an emotion fundamental frequency signal ef0.

추가적인 양상에 따르면, 상기 진폭스케일부(380)는 진폭스케일입력부와 진폭조절부를 포함한다. 진폭스케일입력부는 상기 감정스펙트럼신호(eSP)와, 상기 스펙트럼생성기에 임의의 랜덤 잠재벡터(z)를 입력하여 생성된 랜덤스펙트럼신호(eSPz)를 입력받는다. 진폭조절부는 상기 랜덤스펙트럼신호(eSPz)의 평균과 표준편차를 이용하여 감정스펙트럼신호(eSP)의 진폭을 조절한다. 감정스펙트럼신호(eSP)는 소스음성신호의 진폭정보가 포함되어 변환을 하려는 감정의 음성대비 진폭에서 차이가 발생한다. 이러한 차이를 보정하기 위해 동일한 스펙트럼생성기(340)에서 랜덤스펙트럼신호(eSPz)를 얻어 표준화(standard scaling)를 실시할 수 있다. 랜덤 잠재벡터(z)는 입력신호와 무관하므로, 랜덤스펙트럼신호(eSPz)는 소스신호의 감정과 무관하게 감정ID로 선택된 감정의 진폭을 나타낸다. According to an additional aspect, the amplitude scale unit 380 includes an amplitude scale input unit and an amplitude control unit. The amplitude scale input unit receives the emotion spectrum signal eSP and a random spectrum signal eSPz generated by inputting an arbitrary random latent vector z to the spectrum generator. The amplitude controller adjusts the amplitude of the emotion spectrum signal eSP by using the mean and standard deviation of the random spectrum signal eSPz. The emotion spectrum signal (eSP) contains the amplitude information of the source voice signal, so that a difference occurs in amplitude compared to the voice of the emotion to be converted. In order to compensate for this difference, standard scaling may be performed by obtaining a random spectrum signal (eSPz) from the same spectrum generator 340. Since the random latent vector z is irrelevant to the input signal, the random spectrum signal eSPz represents the amplitude of the emotion selected as the emotion ID regardless of the emotion of the source signal.

추가적인 양상에 따르면, 상기 진폭조절부는 랜덤스펙트럼신호(eSPz)의 진폭 데이터로부터 평균(μ)과 표준편차(σ)를 구하여, 감정스펙트럼신호(eSP)의 진폭 데이터(X)를 다음 수학식 2로 표준화하여 진폭스케일된감정스펙트럼신호(AeSP)의 진폭 데이터(X')를 구한다.According to an additional aspect, the amplitude adjusting unit obtains the average (μ) and standard deviation (σ) from the amplitude data of the random spectrum signal (eSPz), and converts the amplitude data (X) of the emotion spectrum signal (eSP) into the following Equation 2 Amplitude data (X') of the normalized and amplitude-scaled emotion spectrum signal (AeSP) is obtained.

[수학식 2][Equation 2]

수학식 2에서, X는 감정스펙트럼신호(eSP)의 데이터이고, μ는 랜덤스펙트럼신호(eSPz) 데이터의 평균이고, σ는 랜덤스펙트럼신호(eSPz) 데이터의 표준편차이다.In Equation 2, X is the emotion spectrum signal (eSP) data, μ is the average of the random spectrum signal (eSPz) data, and σ is the standard deviation of the random spectrum signal (eSPz) data.

진폭스케일부(380)는 진폭조절부로 전달되는 랜덤스펙트럼신호(eSPz)의 진폭 데이터와 감정스펙트럼신호(eSP)의 진폭 데이터를 로그(Log) 스케일로 변경하는 로그스케일변경부를 더 포함할 수 있다. 인간의 청각은 진폭의 로그 스케일에 비례하여 반응하므로, 진폭 데이터는 로그스케일로 바꾸어 진폭조절부로 전달하는 것이 바람직하다. The amplitude scale unit 380 may further include a log scale change unit for changing the amplitude data of the random spectrum signal eSPz and the amplitude data of the emotion spectrum signal eSP transmitted to the amplitude control unit into a log scale. Since human hearing responds in proportion to the logarithmic scale of amplitude, it is preferable to change the amplitude data to logarithmic scale and transmit the data to the amplitude control unit.

음성변환에서 진폭 특징을 조정하기 위해 동적 주파수 워핑(Dynamic Frequency warping; DFW)을 사용할 수 있다. 동적 주파수 워핑(DFW)은 목표 스펙트럼신호에 근접하게 소스 스펙트럼신호를 왜곡하는 방법이다. 그러나 동적 주파수 워핑(DFW)은 화자 특성을 적절하게 반영하지 못하는 한계가 있으므로, 별도의 가우시안 혼합 모델(Gaussian Mixture Model; GMM) 기반 변환 과정이 필요하다. Dynamic frequency warping (DFW) can be used to adjust amplitude characteristics in voice conversion. Dynamic frequency warping (DFW) is a method of distorting a source spectrum signal to approximate a target spectrum signal. However, since dynamic frequency warping (DFW) has a limitation in not properly reflecting speaker characteristics, a separate Gaussian Mixture Model (GMM)-based conversion process is required.

본 발명의 일 실시예에 따르면, 음성 주파수 합성 장치에서 수행되는 VAW-GAN 기반 진폭 변환 방법은 감정 ID에 따른 잠재분포를 학습한 생성기(Generator)에 의해 발생하는 두 스펙트럼을 보간하는 것으로, 별도의 가우시안 혼합 모델(GMM) 기반의 변환 과정이 요구되지 않는다. 그 결과 진폭변환에 의해 운율적 특성과 화자 특성이 왜곡되지 않고 목표 감정 ID로의 감정변환이 가능하다.According to an embodiment of the present invention, the VAW-GAN-based amplitude conversion method performed in a voice frequency synthesizer interpolates two spectra generated by a generator that has learned a latent distribution according to an emotion ID, and separate A Gaussian Mixture Model (GMM)-based conversion process is not required. As a result, emotion conversion to target emotion ID is possible without distorting prosodic characteristics and speaker characteristics by amplitude conversion.

이하에서는, 본 발명의 일 실시예에 따른 감정변환을 위한 음성 주파수 합성 장치를 사용하여 음성의 감정변환을 실시한 실험예를 개시한다. Hereinafter, an experimental example in which emotion conversion of voice is performed using the voice frequency synthesis apparatus for emotion conversion according to an embodiment of the present invention will be disclosed.

본 발명의 일 실시예에서, 음성 데이터의 감정 변환 장치는 Intel Skylake Xeon, NVIDIA Tesla V100 X 2(20RFLOPS), RAM 128GB의 하드웨어 실험 환경에서 구현하였다. 소프트웨어의 경우 Ubuntu 운영 체제에서 모델의 설계 및 구현에 Tensorflow 백엔드 엔진을 사용하였다. In one embodiment of the present invention, the emotion conversion device for voice data was implemented in a hardware experiment environment of Intel Skylake Xeon, NVIDIA Tesla V100 X 2 (20 RFLOPS), and 128 GB of RAM. For software, the Tensorflow backend engine was used for the design and implementation of the model on the Ubuntu operating system.

감정변환을 위한 음성 주파수 합성 장치를 위한 학습 데이터는 7,356개의 음성 데이터 파일(24.8GB)로 구성된 RAVDESS(Ryerson Audio-Visual Database of Emotional Speech and Song)를 기반으로 하였다. RAVDESS에는 24명의 화자(12: 여성, 12: 중성 북미 억양으로 발음되는 남성)의 음성 데이터가 포함되어 있다. 음성이 나타내는 감정 클래스는 행복, 침착, 슬픔, 두려움, 분노, 놀람, 혐오 표현이다. The training data for the voice frequency synthesizer for emotion conversion was based on RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) consisting of 7,356 voice data files (24.8 GB). RAVDESS contains audio data of 24 speakers (12 female, 12 male pronounced with a neutral North American accent). The emotion classes represented by the voice are happiness, calm, sadness, fear, anger, surprise, and disgust expressions.

입력보코더와 출력보코더로는 WORLD 보코더를 사용하였고, 입력보코더인 WORLD 보코더를 이용하여 음성 전용 데이터(16비트, 48kHz .wav)에서 음성 특징을 추출하여 모델의 입력 데이터로 사용할 수 있는 스펙트로그램으로 변환하였다. 연속 웨이블릿 변환에서 모 웨이블릿 함수는 MexicanHat을 사용하였다.WORLD vocoder was used as the input vocoder and output vocoder, and voice features were extracted from voice-only data (16-bit, 48kHz .wav) using the WORLD vocoder, which was the input vocoder, and converted into a spectrogram that could be used as input data for the model. did In the continuous wavelet transform, the mother wavelet function was MexicanHat.

감정변환모델의 성능평가 지표로는 멜 켑스트럴 왜곡(Mel Cepstral Distortion; MCD)과 로그 스펙트럴 왜곡(Log-Spectral Distortion; LSD)를 사용하였다. 멜 켑스트럴 왜곡(MCD)은 원래 음성이 mel-cepstral 측면에서 변환된 음성과 얼마나 다른지를 측정한 것으로, MCD 값이 낮을수록 변환된 음성의 품질이 원본 음성과 유사하다. 로그 스펙트럼 왜곡(LSD)은 원래 음성 스펙트럼과 변환된 음성 스펙트럼 사이의 거리를 의미하는 것으로, LSD 값이 낮을수록 변환된 음성의 품질이 원본 음성과 유사하다. Mel Cepstral Distortion (MCD) and Log-Spectral Distortion (LSD) were used as performance evaluation indicators of the emotion conversion model. Mel Cepstral Distortion (MCD) is a measure of how different the original voice is from the converted voice in terms of mel-cepstral. The lower the MCD value, the more similar the quality of the converted voice is to the original voice. Log Spectrum Distortion (LSD) means the distance between the original voice spectrum and the converted voice spectrum, and the lower the LSD value, the more similar the converted voice quality is to the original voice.

표 1은 기존의 음성변환 모델과 본 발명의 일 실시예에 개시된 모델의 평균 MCD와 LSD를 비교한 결과이다. Table 1 shows the results of comparing the average MCD and LSD of the existing speech conversion model and the model disclosed in an embodiment of the present invention.

CaseCase ModelModel MCDMCD LSDLSD 비교예 1Comparative Example 1 LG-Cycle-GANLG-Cycle-GAN 5.251 dB5.251 dB 6.897 dB6.897 dB 비교예 2Comparative Example 2 LG-VAW-GANLG-VAW-GAN 4.656 dB4.656 dB 6.488 dB6.488 dB 비교예 3Comparative Example 3 CWT-VAW-GANCWT-VAW-GAN 4.458 dB4.458 dB 6.245 dB6.245 dB 실시예 1Example 1 CWT-AS-VAW-GANCWT-AS-VAW-GAN 4.451 dB4.451 dB 6.239 dB6.239 dB

표 1에서 비교예 1은 인공 신경망 모델로 Cycle-GAN을 사용하였고, 나머지 비교예 2, 비교예 3, 및 실시예 1은 모두 인공 신경망 모델로 VAW-GAN을 사용하였다. 비교예 1과 비교예 2는 기본주파수(f0)를 로그-가우시안(Log-Gaussian) 기반의 선형변환을 이용하여 변환하였고, 비교예 3과 실시예 1은 연속 웨이블릿 변환(CWT)을 이용하여 기본주파수(f0)를 변환하였다. 실시예 1은 생성된 감정스펙트럼신호에 진폭스케일링(Amplitude Scaling)을 추가 적용하였다. In Table 1, Comparative Example 1 used Cycle-GAN as an artificial neural network model, and Comparative Example 2, Comparative Example 3, and Example 1 all used VAW-GAN as an artificial neural network model. In Comparative Example 1 and Comparative Example 2, the fundamental frequency (f0) was converted using a log-Gaussian-based linear transformation, and in Comparative Example 3 and Example 1, the basic frequency (f0) was converted using continuous wavelet transform (CWT). The frequency (f0) was converted. In Example 1, amplitude scaling was additionally applied to the generated emotion spectrum signal.

도 4는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 사용하여 중립 감정을 화난 감정과 싫은 감정으로 변환한 결과를 나타내는 스펙트로그램이다. 4 is a spectrogram showing a result of converting neutral emotion into angry emotion and dislike emotion by using a voice frequency synthesizer for emotion conversion using amplitude scaling of voice according to an embodiment.

도 4(a)는 소스 음성 데이터의 스펙트로그램으로, 감정은 중립(Neutral)을 나타낸다. 도 4(b)의 좌측은 분노(Angry)와 혐오(Disgust) 감정에 대한 음성 데이터의 스펙트로그램을 나타낸다. 스펙트로그램에서 확인할 수 있는 것처럼, 운율 및 음량 정보를 포함하는 스펙트럼은 감정에 따라 다르다. 4(a) is a spectrogram of source voice data, and emotion indicates neutral. The left side of FIG. 4(b) shows spectrograms of voice data for Angry and Disgust emotions. As can be seen from the spectrogram, the spectrum containing prosody and volume information differs according to emotion.

도 4(b)의 우측은 중립(Neutral) 감정을 가진 소스 음성 데이터를 실시예의 음성 주파수 합성 장치를 이용하여 분노와 혐오 감정으로 변환한 스펙트럼을 나타낸다. 변환된 스펙트럼을 살펴보면, 변환된 스펙트럼은 소스 음성 데이터의 스펙트럼에 비해 목표(Target) 감정 스펙트럼(Angry, Disgust)과 유사함을 확인할 수 있다.The right side of FIG. 4(b) shows a spectrum obtained by converting source voice data having neutral emotions into anger and disgust emotions using the voice frequency synthesizer according to the embodiment. Looking at the converted spectrum, it can be confirmed that the converted spectrum is similar to the target emotion spectrum (Angry, Disgust) compared to the spectrum of the source voice data.

이하에서는, 본 발명의 일 실시예에 따른 감정변환을 위한 음성 주파수 합성 장치에서 수행되는 음성의 감정변환을 위한 음성 주파수 합성 방법을 개시한다.Hereinafter, a voice frequency synthesizing method for emotion conversion of voice performed in the voice frequency synthesis apparatus for emotion conversion according to an embodiment of the present invention will be disclosed.

도 5는 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 VAW-GAN 학습 방법을 나타내는 순서도이다. 5 is a flowchart illustrating a VAW-GAN learning method for a voice frequency synthesizer for emotion conversion using voice amplitude scaling according to an embodiment.

일 실시예에 따르면, 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 장치를 위한 학습 방법은, 입력보코더가 소스음성신호(210)에서 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 추출하고(S510), 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)를 각각 학습을 진행한 후, 학습을 위한 데이터에 대해 모든 학습이 종료되었는지 확인하여(S590), 아직 학습할 데이터가 남아있는 경우 입력단계로 돌아가 반복하고, 학습할 데이터가 없는 경우 학습을 종료한다. 학습에 사용되는 소스음성신호 데이터는 중립(Neutral) 외에도, 행복, 침착, 슬픔, 두려움, 분노, 놀람, 혐오 표현 등 다양한 감정을 가지는 음성을 사용할 수 있다. 소스스펙트럼신호(SP)와 소스기본주파수신호(f0)는 병렬로 배치된 각각의 신경망에 입력되어 학습에 사용된다. 각각의 인코더는 입력 신호에서 감정요소를 포함하지 않도록 별도로 훈련되고, 디코더인 생성기는 감정유형을 나타내는 감정ID와 소스기본주파수(f0)에 대해 조건화되어 감정ID에 해당하는 출력 신호를 생성하도록 훈련될 수 있다. 학습에 사용되는 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN신경망 모델을 사용하는 것이 바람직하다.According to an embodiment, in the learning method for a voice frequency synthesizer for emotional conversion using voice amplitude scaling, an input vocoder converts a source spectrum signal (SP) and a source fundamental frequency signal (f0) from a source voice signal (210). After extracting (S510), learning the source spectrum signal (SP) and the source fundamental frequency signal (f0), respectively, confirming that all learning has been completed for the data for learning (S590), the data to be learned yet If remains, it returns to the input step and repeats, and if there is no data to learn, learning ends. Source voice signal data used for learning may use voices having various emotions, such as happiness, calm, sadness, fear, anger, surprise, and disgust expressions, in addition to neutral. The source spectrum signal (SP) and the source fundamental frequency signal (f0) are input to each neural network arranged in parallel and used for learning. Each encoder is separately trained not to include an emotion element in the input signal, and a generator, which is a decoder, is trained to generate an output signal corresponding to an emotion ID by being conditioned on an emotion ID representing an emotion type and a source fundamental frequency (f0). can As a neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) in a decoder of a variable autoencoder (VAE).

소스스펙트럼신호(SP)를 학습하는 방법은, 스펙트럼인코더가 소스스펙트럼신호(SP)에서 감정독립적인 특징벡터인 잠재벡터(latent vector)를 산출하는 스펙트럼인코딩단계(S535)와, 스펙트럼생성기가 상기 감정독립적인 잠재벡터(latent vector)를 이용하여 감정스펙트럼신호를 생성하는 단계(S540)와, 스펙트럼판별기가 생성된 감정스펙트럼신호가 진짜인지 가짜인지 판별하는 단계(S545)를 포함한다. 감정스펙트럼신호를 생성하는 단계(S540)에서 스펙트럼생성기는 소스기본주파수신호(f0)와 감정ID를 더 입력받아, 소스스펙트럼신호(SP)로부터 소스기본주파수신호(f0)와 감정ID에 의해 조건화된 감정스펙트럼신호를 생성하여 스펙트럼판별기로 전달할 수 있다. 학습과정을 통해서 스펙트럼인코더는 감정독립적인 잠재벡터를 생성하도록 학습되고, 스펙트럼생성기는 소스스펙트럼신호(SP)와 생성된 감정스펙트럼신호 사이의 손실을 줄이도록 학습되고, 스펙트럼판별기는 소스스펙트럼신호(SP)와 생성된 감정스펙트럼신호 사이의 손실을 최대화하도록 적대적으로 학습된다.The method of learning the source spectrum signal (SP) includes a spectrum encoding step (S535) in which the spectrum encoder calculates a latent vector, which is an emotion-independent feature vector, from the source spectrum signal (SP), and the spectrum generator calculates the emotion. A step of generating an emotion spectrum signal using an independent latent vector (S540), and a step of determining whether the generated emotion spectrum signal is real or fake by the spectrum discriminator (S545). In the step of generating the emotion spectrum signal (S540), the spectrum generator further receives the source fundamental frequency signal f0 and the emotion ID, and the source fundamental frequency signal f0 and the emotion ID from the source spectrum signal SP. An emotion spectrum signal can be generated and transmitted to the spectrum discriminator. Through the learning process, the spectrum encoder learns to generate an emotion-independent latent vector, the spectrum generator learns to reduce the loss between the source spectrum signal SP and the generated emotion spectrum signal, and the spectrum discriminator learns to generate the source spectrum signal SP. ) and the generated emotion spectrum signal is learned adversarially to maximize the loss.

소스기본주파수신호(f0)를 학습하는 방법은, 기본주파수인코더가 소스기본주파수신호(f0)에서 감정독립적인 잠재벡터를 산출하는 기본주파수인코딩단계(S560)와, 기본주파수생성기가 상기 감정독립적인 잠재벡터를 이용하여 감정기본주파수신호를 생성하는 단계(S565)와, 기본주파수판별기가 생성된 감정기본주파수신호가 진짜인지 가짜인지 판별하는 단계(S575)를 포함한다. 감정기본주파수신호를 생성하는 단계(S565)에서 기본주파수생성기는 감정ID를 더 입력받아, 소스기본주파수신호(f0)로부터 감정ID에 의해 조건화된 감정기본주파수신호를 생성하여 기본주파수판별기로 전달할 수 있다. 학습과정을 통해서 기본주파수인코더는 감정독립적인 잠재벡터를 생성하도록 학습되고, 기본주파수생성기는 소스기본주파수신호(f0)와 생성된 감정기본주파수신호 사이의 손실을 줄이도록 학습되고, 기본주파수판별기는 소스기본주파수신호(f0)와 생성된 감정기본주파수신호 사이의 손실을 최대화하도록 적대적으로 학습된다.The method of learning the source fundamental frequency signal (f0) includes a fundamental frequency encoding step (S560) in which the fundamental frequency encoder calculates an emotion-independent latent vector from the source fundamental frequency signal (f0), and the fundamental frequency generator calculates the emotion-independent latent vector. Generating an emotion fundamental frequency signal using the latent vector (S565), and determining whether the generated emotion fundamental frequency signal is real or fake by the fundamental frequency discriminator (S575). In the step of generating the emotion fundamental frequency signal (S565), the fundamental frequency generator may further receive an emotion ID, generate an emotion fundamental frequency signal conditioned by the emotion ID from the source fundamental frequency signal f0, and transmit it to the fundamental frequency discriminator. there is. Through the learning process, the fundamental frequency encoder learns to generate an emotion-independent latent vector, the fundamental frequency generator learns to reduce the loss between the source fundamental frequency signal f0 and the generated emotion fundamental frequency signal, and the fundamental frequency discriminator It is adversarially learned to maximize the loss between the source fundamental frequency signal f0 and the generated emotion fundamental frequency signal.

소스기본주파수신호(f0)를 학습하는 방법은, 기본주파수인코딩단계(S560) 앞에 소스기본주파수신호(f0)에 연속 웨이블릿 변환(CWT)을 적용하는 제1 연속웨이블릿변환단계(S555)를 더 포함할 수 있다. 소스기본주파수신호(f0)에 대한 연속 웨이블릿 변환(CWT)을 통해 음성의 운율 정보를 효율적으로 학습할 수 있다. The method for learning the source fundamental frequency signal f0 further includes a first continuous wavelet transformation step (S555) of applying continuous wavelet transform (CWT) to the source fundamental frequency signal f0 before the fundamental frequency encoding step (S560). can do. Through continuous wavelet transform (CWT) on the source fundamental frequency signal f0, it is possible to efficiently learn prosody information of speech.

도 6은 일 실시예에 따른 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 방법을 나타내는 순서도이다. 6 is a flowchart illustrating a voice frequency synthesis method for emotion conversion using amplitude scaling of voice according to an embodiment.

제안된 발명의 다른 양상에 따르면, 음성의 진폭스케일링을 이용하는 감정변환을 위한 음성 주파수 합성 방법은, 입력보코더가 소스음성신호를 입력받아서 소스스펙트럼신호와 소스기본주파수신호를 출력하는 음성신호분리단계(S620)와, 소스스펙트럼신호로부터 감정정보를 가진 감정스펙트럼신호를 생성하는 감정스펙트럼생성단계(S630)와, 소스기본주파수신호로부터 운율정보를 포함하는 감정기본주파수신호를 생성하는 감정기본주파수생성단계(S650)와, 감정스펙트럼신호의 진폭을 조절하여 진폭스케일된감정스펙트럼신호를 출력하는 진폭스케일단계(S680)와, 출력보코더가 상기 진폭스케일된감정스펙트럼신호와 상기 감정기본주파수신호를 입력받아 감정이 변환된 감정변환음성신호를 출력하는 음성신호합성단계(S690)를 포함한다. 입력보코더와 출력보코더는 학습에 사용한 입력보코더와 동일한 WORLD 보코더를 사용하는 것이 바람직하다.According to another aspect of the proposed invention, a voice frequency synthesis method for emotion conversion using amplitude scaling of voice includes a voice signal separation step in which an input vocoder receives a source voice signal and outputs a source spectrum signal and a source fundamental frequency signal ( S620), an emotion spectrum generation step of generating an emotion spectrum signal having emotion information from the source spectrum signal (S630), and an emotion fundamental frequency generation step of generating an emotion fundamental frequency signal including prosody information from the source fundamental frequency signal (S630) S650), an amplitude scaling step (S680) of adjusting the amplitude of the emotion spectrum signal and outputting an amplitude-scaled emotion spectrum signal, and an output vocoder receives the amplitude-scaled emotion spectrum signal and the emotion fundamental frequency signal, and the emotion is and a voice signal synthesis step (S690) of outputting the converted emotion conversion voice signal. It is desirable to use the same WORLD vocoder as the input vocoder used for learning as the input vocoder and output vocoder.

감정스펙트럼생성단계(S630)와 감정기본주파수생성단계(S650)는 병렬로 배치된 2개의 인공 신경망 모델로 각각 학습된 인코더와 생성기(디코더) 쌍을 이용하여 감정스펙트럼신호(eSP)와 감정기본주파수신호(ef0)를 생성한다. 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 인공 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN 신경망 모델을 사용하는 것이 바람직하다. In the emotion spectrum generation step (S630) and the emotion fundamental frequency generation step (S650), the emotion spectrum signal (esP) and the emotion fundamental frequency are generated by using the pair of encoders and generators (decoders) trained with two artificial neural network models arranged in parallel, respectively. Generates signal ef0. As an artificial neural network model, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these artificial neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) for a decoder of a variable autoencoder (VAE).

추가적인 양상에 따르면, 상기 감정스펙트럼생성단계(S630)는, 인공 신경망으로 학습된 스펙트럼인코더가 상기 소스스펙트럼신호에서 감정독립적인 잠재벡터를 산출하는 스펙트럼인코딩단계(S635)와, 인공 신경망으로 학습된 스펙트럼생성기가 상기 감정독립적인 잠재벡터를 이용하여 감정스펙트럼신호를 생성하는 스펙트럼생성단계(S640)를 포함한다. 학습에 사용되는 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN신경망 모델을 사용하는 것이 바람직하다.According to an additional aspect, the emotion spectrum generation step (S630) includes a spectrum encoding step (S635) in which the spectrum encoder learned by the artificial neural network calculates an emotion-independent latent vector from the source spectrum signal, and the spectrum learned by the artificial neural network. A spectrum generation step (S640) in which the generator generates an emotion spectrum signal using the emotion-independent latent vector. As an artificial neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) in a decoder of a variable autoencoder (VAE).

추가적인 양상에 따르면, 상기 스펙트럼생성단계(S640)는, 스펙트럼생성기가 상기 소스기본주파수신호를 입력받는 기본주파수입력단계와, 상기 감정독립적인 잠재벡터와 상기 소스기본주파수신호를 이용하여 감정스펙트럼신호를 생성하는 단계를 포함한다. 상기 스펙트럼생성단계(S640)는 감정ID를 입력받는 감정ID입력단계를 더 포함할 수 있다. 스펙트럼생성단계(S640)에서 인코딩된 소스스펙트럼신호(잠재벡터)를 이용하여 추가로 입력받은 소스기본주파수신호와 감정ID에 의해 조건화된 감정스펙트럼신호를 생성한다.According to an additional aspect, the spectrum generating step (S640) includes a basic frequency input step in which the spectrum generator receives the source fundamental frequency signal, and an emotion spectrum signal is generated using the emotion-independent latent vector and the source fundamental frequency signal. It includes the steps of creating The spectrum generation step (S640) may further include an emotion ID input step of receiving an emotion ID. In the spectrum generation step (S640), an emotion spectrum signal conditioned by the additionally input source fundamental frequency signal and emotion ID is generated using the encoded source spectrum signal (latent vector).

추가적인 양상에 따르면, 상기 감정기본주파수생성단계(S650)는, 인공 신경망으로 학습된 기본주파수인코더가 상기 소스기본주파수신호에서 감정독립적인 특징벡터를 산출하는 기본주파수인코딩단계(S660)와, 인공 신경망으로 학습된 기본주파수생성기가 상기 감정독립적인 특징벡터를 이용하여 감정기본주파수신호를 생성하는 기본주파수생성단계(S665)를 포함한다. 학습에 사용되는 인공 신경망 모델로는 생성적 적대 신경망(GAN)을 기반으로 한 다양한 인공 신경망 모델을 사용할 수 있다. 이러한 신경망 모델 중에 가변 자동 인코더(VAE)의 디코더에 생성적 적대 신경망(GAN)의 생성기를 사용한 VAW-GAN신경망 모델을 사용하는 것이 바람직하다.According to an additional aspect, the emotion fundamental frequency generation step (S650) includes the basic frequency encoding step (S660) of calculating an emotion-independent feature vector from the source fundamental frequency signal by the basic frequency encoder learned by the artificial neural network, and the artificial neural network and a basic frequency generation step (S665) of generating an emotion fundamental frequency signal by using the emotion-independent feature vector by the basic frequency generator learned as . As an artificial neural network model used for learning, various artificial neural network models based on generative adversarial networks (GANs) can be used. Among these neural network models, it is preferable to use a VAW-GAN neural network model using a generator of a generative adversarial network (GAN) in a decoder of a variable autoencoder (VAE).

추가적인 양상에 따르면, 상기 감정기본주파수생성단계(S650)는, 소스기본주파수신호에 연속 웨이블릿 변환(CWT)을 적용하는 제1 연속웨이블릿변환단계(S655)를 더 포함한다. 제1 연속웨이블릿변환단계(S655)는 기본주파수인코딩단계(S660)보다 먼저 수행하여 연속 웨이블릿 변환된 기본주파수신호를 기본주파수인코딩단계(S660)로 전달한다. 감정기본주파수생성단계(S650)는 기본주파수생성단계(S665)의 수행 이후에 제2 연속웨이블릿변환단계(S670)를 더 포함할 수 있다. 제2 연속웨이블릿변환단계(S670)에서는 제1 연속웨이블릿변환단계(S655)에서 수행한 연속 웨이블릿 변환의 역변환을 수행하여 감정기본주파수신호(ef0)를 생성할 수 있다.According to an additional aspect, the emotion fundamental frequency generation step (S650) further includes a first continuous wavelet transform step (S655) of applying continuous wavelet transform (CWT) to the source fundamental frequency signal. The first continuous wavelet transformation step (S655) is performed prior to the fundamental frequency encoding step (S660), and the continuous wavelet-transformed fundamental frequency signal is transferred to the fundamental frequency encoding step (S660). The emotion basic frequency generating step (S650) may further include a second continuous wavelet transform step (S670) after performing the basic frequency generating step (S665). In the second continuous wavelet transform step (S670), an inverse transform of the continuous wavelet transform performed in the first continuous wavelet transform step (S655) may be performed to generate the emotion fundamental frequency signal ef0.

추가적인 양상에 따르면, 상기 진폭스케일단계(S680)는, 상기 감정스펙트럼신호(eSP)와 상기 스펙트럼생성기에 임의의 잠재벡터를 입력하여 생성된 랜덤스펙트럼신호(eSPz)를 입력받는 진폭스케일입력단계와, 상기 랜덤스펙트럼신호(eSPz)의 평균과 표준편차를 이용하여 상기 감정스펙트럼신호(eSP)의 진폭을 조절하는 진폭조절단계를 포함한다.According to an additional aspect, the amplitude scale step (S680) includes an amplitude scale input step of receiving the emotion spectrum signal (eSP) and a random spectrum signal (eSPz) generated by inputting an arbitrary latent vector to the spectrum generator; and an amplitude adjusting step of adjusting the amplitude of the emotion spectrum signal eSP by using the mean and standard deviation of the random spectrum signal eSPz.

추가적인 양상에 따르면, 상기 진폭조절단계는 랜덤스펙트럼신호(eSPz)의 진폭 데이터로부터 평균(μ)과 표준편차(σ)를 구하여, 감정스펙트럼신호(eSP)의 진폭 데이터(X)를 상기 수학식 2로 표준화하여 진폭스케일된감정스펙트럼신호(AeSP)의 진폭 데이터(X')를 구한다.According to an additional aspect, the amplitude adjusting step obtains an average (μ) and a standard deviation (σ) from the amplitude data of the random spectrum signal (eSPz), and converts the amplitude data (X) of the emotion spectrum signal (eSPz) to Equation 2 Amplitude data (X') of the amplitude-scaled emotion spectrum signal (AeSP) is obtained by normalizing with .

진폭스케일단계(S680)는 진폭조절단계로 전달되는 랜덤스펙트럼신호(eSPz)의 진폭 데이터와 감정스펙트럼신호(eSP)의 진폭 데이터를 로그(Log) 스케일로 변경하는 로그스케일변경단계를 더 포함할 수 있다. 인간의 청각은 진폭의 로그 스케일에 비례하여 반응하므로, 진폭 데이터는 로그스케일로 바꾸어 진폭조절단계로 전달하는 것이 바람직하다. The amplitude scaling step (S680) may further include a log scale changing step of changing the amplitude data of the random spectrum signal (eSPz) and the amplitude data of the emotion spectrum signal (eSP) delivered to the amplitude adjusting step to a log scale. there is. Since human hearing reacts in proportion to the logarithmic scale of amplitude, it is preferable to convert the amplitude data into a logarithmic scale and transfer it to the amplitude control step.

본 발명의 일 실시예는 감정 변환을 위해 VAW-GAN 기반 진폭 스케일링을 이용한 음성 주파수 합성 장치와 방법에 대해 개시한다. 스펙트럼(SP)과 기본주파수(f0)를 각각 학습하여 감정을 보다 효과적으로 변환한다. 특히 기본주파수(f0)에 연속 웨이블릿 변환(CWT)을 적용하여 운율 특성이 반영된 화자 독립적인 감정 변환이 가능하다. 한편, 변환된 감정 스펙트럼은 원본 감정 스펙트럼과의 진폭 차이를 조정하여 유사한 감정을 만들기 위해 진폭스케일링이 적용된다. 이러한 음성 주파수 합성 장치를 이용하면, 감정을 가진 음성 변환이 가능하고, 인공지능 스피커 등 다양한 분야에 적용할 수 있다.An embodiment of the present invention discloses an apparatus and method for synthesizing voice frequencies using VAW-GAN-based amplitude scaling for emotion conversion. Emotions are converted more effectively by learning the spectrum (SP) and the fundamental frequency (f0) respectively. In particular, by applying the continuous wavelet transform (CWT) to the fundamental frequency (f0), speaker-independent emotional transformation in which prosody characteristics are reflected is possible. Meanwhile, amplitude scaling is applied to the converted emotion spectrum to create similar emotions by adjusting the difference in amplitude from the original emotion spectrum. Using such a voice frequency synthesis device, voice conversion with emotions is possible and can be applied to various fields such as artificial intelligence speakers.

이상에서 본 발명을 첨부된 도면을 참조하는 실시예들을 통해 설명하였지만 이에 한정되는 것은 아니며, 이들로부터 당업자라면 자명하게 도출할 수 있는 다양한 변형예들을 포괄하도록 해석되어야 한다. 특허청구범위는 이러한 변형예들을 포괄하도록 의도되었다. In the above, the present invention has been described through embodiments with reference to the accompanying drawings, but is not limited thereto, and should be interpreted to cover various modifications that can be obviously derived by those skilled in the art. The claims are intended to cover these variations.

210 : 소스음성신호 220 : 입력보코더
235 : 스펙트럼인코더 240 : 스펙트럼생성기
245 : 스펙트럼판별기 255 : 제1 연속웨이블릿변환기
260 : 기본주파수인코더 265 : 기본주파수생성기
275 : 기본주파수판별기
310 : 소스음성신호 320 : 입력보코더
330 : 감정스펙트럼생성부
335 : 스펙트럼인코더 340 : 스펙트럼생성기
350 : 감정기본주파수생성부 355 : 제1 연속웨이블릿변환기
360 : 기본주파수인코더 365 : 기본주파수생성기
370 : 제2 연속웨이블릿변환기 380 : 진폭스케일부
390 : 출력보코더 395 : 감정변환음성신호210: source audio signal 220: input vocoder
235: spectrum encoder 240: spectrum generator
245: spectrum discriminator 255: first continuous wavelet transformer
260: fundamental frequency encoder 265: basic frequency generator
275: fundamental frequency discriminator
310: source audio signal 320: input vocoder
330: emotion spectrum generator
335: spectrum encoder 340: spectrum generator
350: emotion basic frequency generator 355: first continuous wavelet transformer
360: fundamental frequency encoder 365: basic frequency generator
370: second continuous wavelet transformer 380: amplitude scale unit
390: output vocoder 395: emotion conversion voice signal

Claims

In the voice frequency synthesis device for emotion conversion using amplitude scaling of voice,
an input vocoder receiving a source voice signal and outputting a source spectrum signal and a source fundamental frequency signal;
an emotion spectrum generating unit generating an emotion spectrum signal having emotion information from the source spectrum signal;
an emotion fundamental frequency generator for generating an emotion fundamental frequency signal including prosody information from the source fundamental frequency signal;
an amplitude scaler configured to output an amplitude-scaled emotion spectrum signal by adjusting the amplitude of the emotion spectrum signal; and
an output vocoder which receives the amplitude-scaled emotion spectrum signal and the emotion fundamental frequency signal and outputs an emotion-converted voice signal;
including,
The emotion spectrum generator,
a spectrum encoder trained with a generative adversarial network (GAN) to calculate an emotion-independent latent vector from the source spectrum signal; and
a spectrum generator trained with a generative adversarial network (GAN) to generate an emotion spectrum signal using the emotion-independent latent vector;
including,
The amplitude scale part,
an amplitude scale input unit that receives the emotion spectrum signal and a random spectrum signal generated by inputting an arbitrary latent vector to the spectrum generator; and
an amplitude adjusting unit adjusting the amplitude of the emotion spectrum signal using the mean and standard deviation of the random spectrum signal;
Including, voice frequency synthesis device.

delete

The method according to claim 1, wherein the spectrum generator,
Including; a fundamental frequency input unit for receiving the source fundamental frequency signal;
A voice frequency synthesis apparatus for generating an emotion spectrum signal using the emotion-independent latent vector and the source fundamental frequency signal.

The method according to claim 1, wherein the emotion fundamental frequency generator,
a fundamental frequency encoder learned with a generative adversarial network (GAN) to calculate an emotion-independent feature vector from the source fundamental frequency signal; and
a basic frequency generator learned with a generative adversarial network (GAN) to generate an emotion fundamental frequency signal using the emotion-independent feature vector;
Including, voice frequency synthesis device.

The method according to claim 4, wherein the emotion fundamental frequency generator,
a continuous wavelet transformer for applying continuous wavelet transform (CWT) to the source fundamental frequency signal;
Further comprising a voice frequency synthesis device.

delete

In the voice frequency synthesis method for emotion conversion using amplitude scaling of voice,
A voice signal separation step in which an input vocoder receives a source voice signal and outputs a source spectrum signal and a source fundamental frequency signal;
an emotion spectrum generating step of generating an emotion spectrum signal having emotion information from the source spectrum signal;
an emotion fundamental frequency generating step of generating an emotion fundamental frequency signal including prosody information from the source fundamental frequency signal;
an amplitude scaling step of outputting an amplitude-scaled emotion spectrum signal by adjusting the amplitude of the emotion spectrum signal; and
a voice signal synthesizing step in which an output vocoder receives the amplitude-scaled emotion spectrum signal and the emotion fundamental frequency signal and outputs an emotion converted voice signal in which emotion is converted;
including,
In the emotion spectrum generation step,
a spectrum encoding step in which a spectrum encoder trained with a generative adversarial network (GAN) calculates an emotion-independent latent vector from the source spectrum signal; and
a spectrum generation step in which a spectrum generator learned with a generative adversarial network (GAN) generates an emotion spectrum signal using the emotion-independent latent vector;
including,
The amplitude scale step,
an amplitude scale input step of receiving the emotion spectrum signal and a random spectrum signal generated by inputting an arbitrary latent vector to the spectrum generator; and
an amplitude adjusting step of adjusting the amplitude of the emotion spectrum signal using the mean and standard deviation of the random spectrum signal;
Including, voice frequency synthesis method.

delete

The method according to claim 7, wherein the spectrum generation step,
a fundamental frequency input step in which the spectrum generator receives the source fundamental frequency signal; and
generating an emotion spectrum signal using the emotion-independent latent vector and the source fundamental frequency signal;
Including, voice frequency synthesis method.

The method according to claim 7, wherein the emotion fundamental frequency generation step,
A fundamental frequency encoding step of calculating a feature vector independent of emotion from the source fundamental frequency signal by a fundamental frequency encoder learned with a generative adversarial network (GAN); and
a basic frequency generation step of generating an emotion fundamental frequency signal using the emotion-independent feature vector by a fundamental frequency generator learned by a generative adversarial network (GAN);
Including, voice frequency synthesis method.

The method according to claim 10, wherein the emotion fundamental frequency generation step,
a continuous wavelet transform step of applying continuous wavelet transform (CWT) to the source fundamental frequency signal;
Further comprising, voice frequency synthesis method.

delete