KR102277205B1

KR102277205B1 - Apparatus for converting audio and method thereof

Info

Publication number: KR102277205B1
Application number: KR1020200033038A
Authority: KR
Inventors: 모하메드 이메드 모하메드 카멜 엘기르; 박중배
Original assignee: 휴멜로 주식회사
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2021-07-15

Abstract

A method for generating an output text from an input text using a text generation model is provided. A text generation method according to an embodiment of the present invention includes the steps of: obtaining a latent variable by inputting data indicating at least one word included in an input text into the text generation model; predicting a target cluster to which a first target word to be included in the output text belongs by using the latent variable; and predicting the first target word using the target cluster and the latent variable. And object of the present invention is to provide an audio conversion apparatus using a neural network-based audio conversion model and a method performed in the apparatus.

Description

Audio conversion apparatus and method {APPARATUS FOR CONVERTING AUDIO AND METHOD THEREOF}

본 발명은 오디오 변환 장치 및 방법에 관한 것이다. 보다 자세하게는, 신경망 기반의 오디오 변환 모델을 이용한 음성의 변환에 있어서, 화자의 목소리 등의 일부 특징들은 그대로 유지한 채, 음성에 표현된 화자의 감정을 변환하는 장치 및 방법과, 그 오디오 변환 모델을 구축하는 장치 및 방법에 관한 것이다.The present invention relates to an audio conversion apparatus and method. In more detail, in speech conversion using a neural network-based audio conversion model, an apparatus and method for converting a speaker's emotions expressed in a voice while maintaining some features such as the speaker's voice, and the audio conversion model It relates to an apparatus and method for building a

인공 신경망을 이용한 머신 러닝 기술은, 음성 합성(speech synthesis) 및 음성 변환(voice conversion) 등 다양한 기술 분야에 활용되고 있다. 음성 합성 기술은, 예컨대 입력된 텍스트로부터 사람이 말하는 소리와 유사한 소리를 합성해 내는 기술이다. 음성 변환 기술은, 예를 들어 발화의 스타일 중 일부를 인위적으로 변환하는 기술이다. 음성 변환 기술은, 오디오 북, 엔터테인먼트, 외국어 교육, 및 외화 더빙 등 다양한 분야에서 활용될 수 있다.Machine learning techniques using artificial neural networks are being used in various technical fields, such as speech synthesis and voice conversion. The speech synthesis technology is, for example, a technology for synthesizing sounds similar to human speaking sounds from input text. The speech conversion technology is, for example, a technology of artificially converting some of the styles of speech. The speech conversion technology may be utilized in various fields, such as an audio book, entertainment, foreign language education, and foreign currency dubbing.

종래의 음성 변환 기술은 음성 오디오에 담긴 화자의 메시지, 즉 언어적 컨텐츠(linguistic content)를 그대로 전달하면서 화자의 목소리를 변환하는 것이 대부분이다. 예를 들어, 특정 남성 화자(speaker)의 목소리가 담긴 오디오를 특정 여성 화자의 목소리로 변환하는 것이 음성 변환의 대표적인 예이다.Most of the conventional voice conversion technology converts the speaker's voice while delivering the speaker's message contained in the voice audio, that is, linguistic content. For example, converting audio containing the voice of a specific male speaker into the voice of a specific female speaker is a representative example of voice conversion.

한편, 음성 오디오에 담긴 화자의 메시지와 화자의 목소리를 그대로 유지한 채로, 음성의 다른 특징들, 예컨대 박자, 억양, 운율, 음성에 표현된 감정 등을 변환하려는 시도들도 이루어지고 있다. 그런데 현재 이용 가능한 음성 변환 기술들은 여러 가지 한계를 가지고 있다. Meanwhile, while maintaining the speaker's message and the speaker's voice contained in the audio audio, attempts are being made to convert other characteristics of the voice, such as time signature, intonation, rhyme, and emotion expressed in the voice. However, currently available speech conversion technologies have several limitations.

예를 들어, 현재 이용 가능한 음성 변환 기술들은, 타깃 음성의 박자, 억양, 운율, 또는 감정 등 음성의 어느 하나의 특징을 변환하는 과정에서, 화자의 목소리 등 다른 특징들이 의도치 않게 함께 변하는 문제를 완전히 해결하지 못하고 있다. 다시 말해, 행복한 말투로 표현된 음성을 슬픈 말투를 가지는 음성으로 변환하는 과정에서, 화자의 목소리가 이질적으로 변경되거나, 오디오의 품질이 훼손되는 문제가 발생한다.For example, currently available speech conversion technologies solve the problem of unintentionally changing other features, such as the speaker's voice, in the process of converting any one feature of the voice, such as the tempo, intonation, rhyme, or emotion of the target voice. not completely resolved. In other words, in the process of converting a voice expressed in a happy tone into a voice having a sad tone, a speaker's voice is heterogeneously changed or audio quality is damaged.

또 다른 예로, 현재 이용 가능한 음성 변환 기술들은, 셋 이상의 도메인들 사이의 상호 변환 기능을 제공하지 못하고 있다. 다시 말해, 하나의 음성 변환 모델을 이용하여, 제1 감정을 가지는 음성(예컨대 행복한 말투의 음성)을 제2 감정을 가지는 음성(예컨대 슬픈 말투의 음성)으로 변환하거나, 제2 감정을 가지는 음성을 제1 감정을 가지는 음성으로 변환하는 양방향 변환만이 가능할 뿐, 하나의 음성 변환 모델을 이용하여, 셋 이상의 서로 다른 감정들 사이를 상호 변환하는 기술은 제공되지 못하고 있다.As another example, currently available voice conversion technologies do not provide a function of interconversion between three or more domains. In other words, using one voice conversion model, a voice having a first emotion (eg, a voice with a happy tone) is converted into a voice having a second emotion (eg, a voice with a sad tone), or a voice having a second emotion is converted into a voice with a second emotion (eg, a voice with a sad tone). Only two-way conversion for converting a voice having a first emotion is possible, and a technology for mutually converting between three or more different emotions using one voice conversion model is not provided.

화자의 목소리의 동질성을 유지한 채로, 다양한 도메인들 사이에서 음성을 변환하기 위해서는, 각각의 화자마다 별도의 음성 변환 모델을 구축하거나, 및/또는 소스 도메인과 타깃 도메인으로 이루어진 하나의 쌍마다 별도의 음성 변환 모델을 구축하는 방식의 접근이 가능하다. 하지만, 각각의 음성 변환 모델 구축을 위한 학습 데이터의 확보와 변환 모델 구축에 드는 비용 및 시간을 고려하면, 상술한 접근 방식은 현실적인 솔루션이 되지 못한다.In order to convert speech between various domains while maintaining the homogeneity of the speaker's voice, a separate speech conversion model is built for each speaker, and/or a separate speech conversion model is constructed for each pair of the source domain and the target domain. An approach to building a speech conversion model is possible. However, in consideration of the cost and time required for securing training data for constructing each speech conversion model and building the conversion model, the above-described approach is not a realistic solution.

따라서, 화자를 특정하지 않은 하나의 음성 변환 모델을 통해, 변환 대상 음성의 나머지 특징들은 그대로 유지한 채로, 변환 대상 음성의 일부 특징만을 변환할 수 있는 음성 변환 방법이 요구된다.Accordingly, there is a need for a speech conversion method capable of converting only some features of the conversion target voice while maintaining the remaining features of the conversion target voice through a single speech conversion model that does not specify a speaker.

한국등록특허 제10-2057926호 (2019.12.20. 공고)Korean Patent Registration No. 10-2057926 (2019.12.20. Announcement)

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 기술적 과제는, 신경망 기반의 오디오 변환 모델을 이용한 오디오 변환 장치 및 그 장치에서 수행되는 방법을 제공하는 것이다.A technical problem to be solved through some embodiments of the present invention is to provide an audio conversion apparatus using a neural network-based audio conversion model and a method performed in the apparatus.

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 다른 기술적 과제는, 변환 대상 음성 고유의 일부 특징을 유지한 채, 변환하고자 하는 타깃 특징만을 선별적으로 변환하는 오디오 변환 장치 및 그 장치에서 수행되는 방법을 제공하는 것이다.Another technical problem to be solved through some embodiments of the present invention is an audio conversion apparatus for selectively converting only target features to be converted while maintaining some characteristics of the conversion target voice, and a method performed in the apparatus will provide

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 또 다른 기술적 과제는, 변환 대상 음성의 화자의 목소리 특징을 유지하면서 변환 대상 음성에 표현된 화자의 감정을 제1 감정으로부터 제2 감정으로 변환하는 장치 및 그 장치에서 수행되는 방법을 제공하는 것이다.Another technical problem to be solved through some embodiments of the present invention is an apparatus for converting a speaker's emotions expressed in a conversion target voice from a first emotion to a second emotion while maintaining the speaker's voice characteristics of the conversion target voice, and It is to provide a method to be performed on the device.

본 발명의 몇몇 실시예들을 통해 해결하고자 하는 또 다른 기술적 과제는, 하나의 음성 변환 모델을 이용하여 셋 이상의 서로 다른 감정들 사이를 변환하는 음성 변환 장치 및 그 장치에서 수행되는 방법을 제공하는 것이다.Another technical problem to be solved through some embodiments of the present invention is to provide a speech conversion apparatus for converting between three or more different emotions using one speech conversion model, and a method performed by the apparatus.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른, 신경망 기반 오디오 변환 모델을 이용하여 입력 오디오 시퀀스의 특징을 변환하는 방법은, 상기 입력 오디오 시퀀스를 나타내는 입력 오디오 데이터를 상기 오디오 변환 모델에 입력하는 단계와, 상기 입력 오디오 시퀀스의 시퀀스 레벨 특징을 나타내는 제1 잠재 변수(latent variable)를 획득하는 단계와, 상기 입력 오디오 시퀀스를 구성하는 복수의 세그먼트들 각각의 특징들을 나타내는 제2 잠재 변수를 획득하는 단계와, 상기 제1 잠재 변수를 조정하는 단계와, 상기 조정된 제1 잠재 변수 및 상기 제2 잠재 변수를 이용하여, 출력 오디오 시퀀스를 나타내는 출력 오디오 데이터를 획득하는 단계를 포함한다.According to an embodiment of the present invention, a method for transforming a feature of an input audio sequence using a neural network-based audio transformation model for solving the above technical problem is to convert input audio data representing the input audio sequence into the audio transformation model. inputting to , obtaining a first latent variable representative of a sequence level characteristic of the input audio sequence; and a second latent variable indicating characteristics of each of a plurality of segments constituting the input audio sequence. obtaining, adjusting the first latent variable, and using the adjusted first latent variable and the second latent variable to obtain output audio data representing an output audio sequence.

일 실시예에서, 상기 입력 오디오 시퀀스는 화자(speaker)의 음성(voice)을 포함하고, 상기 제1 잠재 변수는, 상기 입력 오디오 시퀀스의 제1 특징을 적어도 부분적으로 표현하는 제3 잠재 변수 및 상기 입력 오디오 시퀀스의 제2 특징을 적어도 부분적으로 표현하는 제4 잠재 변수를 포함하고, 상기 제2 잠재 변수는, 상기 입력 오디오 시퀀스에 포함된 언어적 컨텐츠를 적어도 부분적으로 표현하는 것이며, 상기 제1 잠재 변수를 조정하는 단계는, 상기 제4 잠재 변수를 그대로 유지한채 상기 제3 잠재 변수를 조정하는 단계를 포함할 수 있다.In an embodiment, the input audio sequence comprises a voice of a speaker, the first latent variable comprising: a third latent variable representing at least in part a first characteristic of the input audio sequence; a fourth latent variable at least partially representing a second characteristic of the input audio sequence, wherein the second latent variable is at least partially representing verbal content included in the input audio sequence, the first latent variable being at least partially representing the linguistic content included in the input audio sequence; Adjusting the variable may include adjusting the third latent variable while maintaining the fourth latent variable.

일 실시예에서, 상기 제1 특징은 상기 입력 오디오 시퀀스에 표현된 상기 화자의 감정을 가리키는 것이고, 상기 제2 특징은 상기 화자의 목소리 특징을 가리키는 것일 수 있다.In an embodiment, the first characteristic may indicate an emotion of the speaker expressed in the input audio sequence, and the second characteristic may indicate a voice characteristic of the speaker.

일 실시예에서, 상기 제3 잠재 변수는 벡터로 표현되는 값이고,상기 제3 잠재 변수를 조정하는 단계는, 제1 감정에 대응되는 임베딩 벡터를 제2 감정에 대응되는 임베딩 벡터로 변환하는 감정 변환 함수를 상기 제3 잠재 변수에 적용하는 단계를 포함할 수 있다.In one embodiment, the third latent variable is a value expressed as a vector, and the adjusting of the third latent variable includes converting an embedding vector corresponding to a first emotion into an embedding vector corresponding to a second emotion. and applying a transform function to the third latent variable.

일 실시예에서, 상기 제3 잠재 변수를 조정하는 단계는, 상기 제1 감정에 대응되는 임베딩 벡터 및 상기 제2 감정에 대응되는 임베딩 벡터를 직교화(orthogonalization)하는 단계를 더 포함할 수 있다.In an embodiment, the adjusting the third latent variable may further include orthogonalizing an embedding vector corresponding to the first emotion and an embedding vector corresponding to the second emotion.

일 실시예에서, 상기 출력 오디오 시퀀스는, 상기 입력 오디오 시퀀스의 언어적 컨텐츠 및 상기 입력 오디오 시퀀스의 화자의 음성의 특징을 유지한 채, 상기 입력 오디오 시퀀스에 표현된 상기 화자의 감정이 변환된 오디오일 수 있다.In one embodiment, the output audio sequence is audio in which the emotion of the speaker expressed in the input audio sequence is converted while maintaining the linguistic content of the input audio sequence and the characteristics of a speaker's voice of the input audio sequence can be

일 실시예에서, 상기 제1 잠재 변수를 획득하는 단계는, 상기 오디오 변환 모델의 인코딩부에 의해, 상기 입력 오디오 데이터를 이용하여 상기 제3 잠재 변수 예측 분포를 획득하는 단계와, 상기 제3 잠재 변수 예측 분포로부터 상기 제3 잠재 변수를 샘플링하는 단계를 포함할 수 있다.In one embodiment, the obtaining of the first latent variable includes, by the encoding unit of the audio transformation model, obtaining the third latent variable prediction distribution using the input audio data, and the third latent variable It may include sampling the third latent variable from the variable prediction distribution.

일 실시예에서, 상기 제3 잠재 변수 예측 분포는 정규 분포에 해당할 수 있다.In an embodiment, the third latent variable prediction distribution may correspond to a normal distribution.

일 실시예에서, 상기 제1 잠재 변수를 획득하는 단계는, 상기 인코딩부에 의해, 상기 입력 오디오 데이터를 이용하여 상기 제4 잠재 변수 예측 분포를 획득하는 단계와, 상기 제4 잠재 변수 예측 분포로부터 상기 제4 잠재 변수를 샘플링하는 단계를 더 포함할 수 있다.In an embodiment, the obtaining of the first latent variable includes, by the encoding unit, obtaining the fourth latent variable prediction distribution by using the input audio data, and from the fourth latent variable predictive distribution. The method may further include sampling the fourth latent variable.

일 실시예에서, 상기 제2 잠재 변수를 획득하는 단계는, 상기 입력 오디오 데이터 및 상기 제1 잠재 변수를 이용하여, 상기 제2 잠재 변수를 획득하는 단계를 포함할 수 있다.In an embodiment, the obtaining of the second latent variable may include obtaining the second latent variable by using the input audio data and the first latent variable.

일 실시예에서, 상기 조정된 제1 잠재 변수 및 상기 제2 잠재 변수를 이용하여, 출력 오디오 데이터를 획득하는 단계는, 상기 오디오 변환 모델의 디코딩부에 의해, 상기 조정된 제1 잠재 변수 및 상기 제2 잠재 변수를 결합(concatenate)한 값을 이용하여, 상기 출력 오디오 데이터를 예측하기 위한 분포를 획득하는 단계와, 상기 출력 오디오 데이터를 예측하기 위한 상기 분포로부터 상기 출력 오디오 데이터를 샘플링하는 단계를 포함할 수 있다.In an embodiment, using the adjusted first latent variable and the second latent variable, obtaining output audio data includes, by the decoding unit of the audio transformation model, the adjusted first latent variable and the obtaining a distribution for predicting the output audio data using a concatenated value of a second latent variable, and sampling the output audio data from the distribution for predicting the output audio data. may include

일 실시예에서, 상기 출력 오디오 데이터를 보코더에 입력하여 출력 오디오 시퀀스를 합성하는 단계를 더 포함할 수 있다.In an embodiment, the method may further include inputting the output audio data to a vocoder to synthesize an output audio sequence.

일 실시예에서, 상기 입력 오디오 시퀀스를 단시간 푸리에 변환(Short Time Fourier Transform)하여 상기 입력 오디오 시퀀스에 대한 스펙트로그램(spectrogram) 데이터를 획득하는 단계와, 신경망 보코더를 이용하여 상기 출력 오디오 데이터로부터 상기 출력 오디오 시퀀스를 합성하는 단계를 더 포함할 수 있다.In an embodiment, short-time Fourier transform on the input audio sequence to obtain spectrogram data for the input audio sequence; and outputting the output audio data from the output audio data using a neural network vocoder. The method may further include synthesizing the audio sequence.

상술한 기술적 과제를 해결하기 위한, 본 발명의 다른 일 실시예에 따른 오디오 변환 장치는, 인코딩부, 잠재 변수 변환부, 및 디코딩부를 포함하는 오디오 변환 모델과, 출력 오디오 데이터로부터 출력 오디오 시퀀스를 합성하는 오디오 합성부를 포함한다. 상기 인코딩부는, 입력 오디오 시퀀스를 전처리한 입력 오디오 데이터로부터, 상기 입력 오디오 시퀀스의 시퀀스 레벨 특징을 나타내는 제1 잠재 변수를 인코딩하여, 상기 제1 잠재 변수를 상기 잠재 변수 변환부에 제공하고, 상기 입력 오디오 데이터 및 상기 제1 잠재 변수로부터, 상기 입력 오디오 시퀀스를 구성하는 복수의 세그먼트들 각각의 특징들을 나타내는 제2 잠재 변수를 인코딩하여, 상기 제2 잠재 변수를 상기 디코딩부에 제공하며, 상기 잠재 변수 변환부는, 상기 제1 잠재 변수를 조정하고, 조정된 제1 잠재 변수를 상기 디코딩부에 제공하고, 상기 디코딩부는, 상기 조정된 제1 잠재 변수 및 상기 제2 잠재 변수를 이용하여, 상기 출력 오디오 데이터를 제공한다.In order to solve the above technical problem, an audio conversion apparatus according to another embodiment of the present invention synthesizes an audio conversion model including an encoding unit, a latent variable conversion unit, and a decoding unit, and an output audio sequence from output audio data and an audio synthesizing unit. The encoding unit encodes a first latent variable representing a sequence level characteristic of the input audio sequence from the input audio data pre-processing the input audio sequence, and provides the first latent variable to the latent variable conversion unit, and the input From the audio data and the first latent variable, a second latent variable representing characteristics of each of the plurality of segments constituting the input audio sequence is encoded, and the second latent variable is provided to the decoding unit, the latent variable The conversion unit adjusts the first latent variable, and provides the adjusted first latent variable to the decoding unit, and the decoding unit, using the adjusted first latent variable and the second latent variable, the output audio provide data.

일 실시예에서, 상기 인코딩부는, 상기 입력 오디오 데이터로부터 상기 제1 잠재 변수의 평균을 인코딩하고, 상기 제1 잠재 변수의 평균에 기초하여 결정되는 예측 분포로부터 상기 제1 잠재 변수를 샘플링하며, 상기 입력 오디오 데이터 및 상기 제1 잠재 변수로부터 상기 제2 잠재 변수의 평균을 인코딩하고, 상기 제2 잠재 변수의 평균에 기초하여 결정되는 예측 분포로부터 상기 제2 잠재 변수를 샘플링 할 수 있다.In an embodiment, the encoding unit encodes an average of the first latent variable from the input audio data, samples the first latent variable from a prediction distribution determined based on the average of the first latent variable, and the An average of the second latent variable may be encoded from the input audio data and the first latent variable, and the second latent variable may be sampled from a prediction distribution determined based on the average of the second latent variable.

일 실시예에서, 상기 입력 오디오 시퀀스는 화자(speaker)의 음성(voice)을 포함하고, 상기 제1 잠재 변수는, 상기 입력 오디오 시퀀스의 제1 특징을 적어도 부분적으로 표현하는 제3 잠재 변수 및 상기 입력 오디오 시퀀스의 제2 특징을 적어도 부분적으로 표현하는 제4 잠재 변수를 포함하고,상기 제2 잠재 변수는, 상기 입력 오디오 시퀀스에 포함된 언어적 컨텐츠를 적어도 부분적으로 표현하는 것이며, 상기 잠재 변수 변환부는, 상기 제4 잠재 변수를 그대로 유지한채 상기 제3 잠재 변수를 조정 할 수 있다.In an embodiment, the input audio sequence comprises a voice of a speaker, the first latent variable comprising: a third latent variable representing at least in part a first characteristic of the input audio sequence; a fourth latent variable representing at least in part a second characteristic of the input audio sequence, wherein the second latent variable is at least partially representing a linguistic content included in the input audio sequence, the latent variable transformation Wealth may adjust the third latent variable while maintaining the fourth latent variable.

일 실시예에서, 상기 제1 특징은 상기 입력 오디오 시퀀스에 표현된 상기 화자의 감정을 가리키는 것이고, 상기 제2 특징은 상기 화자의 목소리 특징을 가리킬 수 있다.In an embodiment, the first characteristic may indicate an emotion of the speaker expressed in the input audio sequence, and the second characteristic may indicate a voice characteristic of the speaker.

일 실시예에서, 상기 제3 잠재 변수는 벡터로 표현되는 값이고, 상기 잠재 변수 변환부는, 제1 감정에 대응되는 임베딩 벡터를 제2 감정에 대응되는 임베딩 벡터로 변환하는 감정 변환 함수를 상기 제3 잠재 변수에 적용하여 상기 제3 잠재 변수를 조정할 수 있다.In one embodiment, the third latent variable is a value expressed as a vector, and the latent variable converting unit converts an embedding vector corresponding to a first emotion into an embedding vector corresponding to a second emotion. The third latent variable may be adjusted by applying it to the 3 latent variable.

일 실시예에서, 상기 디코딩부는, 상기 조정된 제1 잠재 변수 및 상기 제2 잠재 변수를 결합(concatenate)한 값을 이용하여, 상기 출력 오디오 데이터 예측 분포를 획득하고, 상기 출력 오디오 데이터 예측 분포로부터 상기 출력 오디오 데이터를 샘플링 할 수 있다.In an embodiment, the decoding unit obtains the output audio data prediction distribution by using a value obtained by concatenating the adjusted first latent variable and the second latent variable, and from the output audio data prediction distribution The output audio data may be sampled.

상술한 기술적 과제를 해결하기 위한, 본 발명의 다른 일 실시예에 따른, 컴퓨팅 장치가 신경망 기반 오디오 변환 모델을 트레이닝 하는 방법은, 복수의 오디오 시퀀스 및 상기 복수의 오디오 시퀀스 각각에 나타난 화자의 감정을 가리키는 레이블을 포함하는 학습 데이터 세트를 획득하는 단계와, 상기 학습 데이터 세트를 이용하여 상기 오디오 변환 모델의 파라미터를 업데이트하는 단계를 포함한다. 상기 오디오 변환 모델의 인코딩부는, 입력 오디오 시퀀스를 인코딩하여 상기 입력 오디오 시퀀스에 나타난 화자의 감정을 나타내는 제1 잠재 변수(latent variable) 및 상기 입력 오디오 시퀀스를 구성하는 복수의 세그먼트들 각각의 특징들을 나타내는 제2 잠재 변수를 제공하고, 상기 오디오 변환 모델의 디코딩부는, 상기 제1 잠재 변수 및 상기 제2 잠재 변수를 디코딩하여 출력 오디오 시퀀스를 제공하며, 상기 오디오 변환 모델의 파라미터를 업데이트하는 단계는, 제1 입력 오디오 시퀀스를 상기 오디오 변환 모델에 입력할 경우 상기 제1 입력 오디오 데이터와 동일한 오디오 시퀀스가 출력될 확률이 증가하도록 상기 오디오 변환 모델의 파라미터를 업데이트 하는 단계를 포함하고, 상기 오디오 변환 모델의 파라미터를 업데이트하는 단계는, 상기 인코딩부에 의해 제공되는 상기 제1 잠재 변수 및 상기 제2 잠재 변수의 분포가 정규 분포에 가까워지도록 상기 오디오 변환 모델의 파라미터를 업데이트 하는 단계를 포함한다.In order to solve the above-described technical problem, a method for a computing device to train a neural network-based audio transformation model according to another embodiment of the present invention is obtained by detecting a plurality of audio sequences and a speaker's emotions appearing in each of the plurality of audio sequences. obtaining a training data set including a pointing label; and updating a parameter of the audio transformation model by using the training data set. The encoding unit of the audio transformation model encodes an input audio sequence to represent a first latent variable representing a speaker's emotion expressed in the input audio sequence and characteristics of each of a plurality of segments constituting the input audio sequence A second latent variable is provided, and the decoding unit of the audio transformation model decodes the first latent variable and the second latent variable to provide an output audio sequence, and the step of updating the parameter of the audio transformation model includes: and updating a parameter of the audio transformation model so that a probability that an audio sequence identical to that of the first input audio data is output increases when one input audio sequence is input to the audio transformation model; The updating includes updating a parameter of the audio transformation model so that distributions of the first latent variable and the second latent variable provided by the encoding unit are close to a normal distribution.

일 실시예에서, 상기 오디오 변환 모델의 파라미터를 업데이트하는 단계는, 동일한 감정을 가리키는 레이블을 가지는 복수의 오디오 시퀀스를 인코딩하여 획득되는 상기 제1 잠재 변수들의 분포가 정규 분포에 가까워지도록 상기 오디오 변환 모델의 파라미터를 업데이트 하는 단계를 더 포함할 수 있다.In an embodiment, the step of updating the parameters of the audio transformation model includes the audio transformation model so that the distribution of the first latent variables obtained by encoding a plurality of audio sequences having labels indicating the same emotion approaches a normal distribution. The method may further include updating a parameter of .

일 실시예에서, 상기 오디오 변환 모델의 파라미터를 업데이트하는 단계는, 각각 서로 다른 감정을 가리키는 레이블을 가지는 복수의 오디오 시퀀스들을 인코딩하여 획득되는 상기 제1 잠재 변수들의 거리가 서로 멀어지도록 상기 오디오 변환 모델의 파라미터를 업데이트 하는 단계를 더 포함할 수 있다.In an embodiment, the step of updating the parameters of the audio transformation model includes the audio transformation model so that the distances of the first latent variables obtained by encoding a plurality of audio sequences each having labels indicating different emotions are distant from each other. The method may further include updating a parameter of .

도 1은 본 발명의 일 실시예에 따른 오디오 감정 변환 장치의 입력 및 출력을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 오디오 감정 변환 장치를 나타내는 예시적인 블록도이다.
도 3은 도 2를 참조하여 설명된 오디오 감정 변환 장치의 음성 전처리부를 설명하기 위한 도면이다.
도 4는 도 2를 참조하여 설명된 오디오 감정 변환 장치의 음성 합성부를 설명하기 위한 도면이다.
도 5는 도 2를 참조하여 설명된 오디오 감정 변환 장치의 음성 변환부를 나타내는 예시적인 블록도이다.
도 6은 도 5를 참조하여 설명된 음성 변환부의 음성 변환 모델을 설명하기 위한 도면이다.
도 7은 본 발명의 몇몇 실시예에 따른 감정 변환 과정에서의 사용되는 임베딩 벡터들의 직교화(orthogonalization) 과정을 설명하기 위한 도면이다.
도 8은 도 6을 참조하여 설명된 음성 변환 모델의 인코딩부를 설명하기 위한 도면이다.
도 9는 도 6을 참조하여 설명된 음성 변환 모델의 디코딩부를 설명하기 위한 도면이다.
도 10은 본 발명의 일 실시예에 따라 음성 변환 모델을 구축하고 이를 통해 음성을 변환하는 일련의 과정을 나타내는 예시적인 흐름도이다.
도 11은 본 발명의 일 실시예에 따라 음성을 변환하는 방법을 나타내는 예시적인 흐름도이다.
도 12는 본 발명의 몇몇 실시예들에 따른 오디오 변환 장치를 구현할 수 있는 예시적인 컴퓨팅 장치를 설명하기 위한 도면이다.1 is a view for explaining an input and output of an audio emotion conversion apparatus according to an embodiment of the present invention.
2 is an exemplary block diagram illustrating an audio emotion conversion apparatus according to an embodiment of the present invention.
FIG. 3 is a view for explaining a voice pre-processing unit of the audio emotion converting apparatus described with reference to FIG. 2 .
FIG. 4 is a diagram for explaining a voice synthesizer of the apparatus for converting an audio emotion described with reference to FIG. 2 .
FIG. 5 is an exemplary block diagram illustrating a voice converting unit of the audio emotion converting apparatus described with reference to FIG. 2 .
FIG. 6 is a diagram for explaining a voice conversion model of the voice conversion unit described with reference to FIG. 5 .
7 is a view for explaining a process of orthogonalization of embedding vectors used in an emotion transformation process according to some embodiments of the present invention.
FIG. 8 is a diagram for explaining an encoding unit of the speech conversion model described with reference to FIG. 6 .
FIG. 9 is a view for explaining a decoding unit of the speech conversion model described with reference to FIG. 6 .
10 is an exemplary flowchart illustrating a series of processes of constructing a speech conversion model and converting speech through the speech conversion model according to an embodiment of the present invention.
11 is an exemplary flowchart illustrating a method of converting a voice according to an embodiment of the present invention.
12 is a diagram for explaining an exemplary computing device that can implement an audio conversion device according to some embodiments of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명의 기술적 사상은 이하의 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명의 기술적 사상을 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 기술적 사상은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the technical spirit of the present invention is not limited to the following embodiments, but may be implemented in various different forms, and only the following embodiments complete the technical spirit of the present invention, and in the technical field to which the present invention belongs It is provided to fully inform those of ordinary skill in the art of the scope of the present invention, and the technical spirit of the present invention is only defined by the scope of the claims.

각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description thereof will be omitted.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used with the meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly defined in particular. The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. When a component is described as being “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to the other component, but another component is between each component. It should be understood that elements may be “connected,” “coupled,” or “connected.”

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.As used herein, “comprises” and/or “comprising” refers to the presence of one or more other components, steps, operations and/or elements mentioned. or addition is not excluded.

본 명세서에 대한 설명에 앞서, 본 명세서에서 사용되는 몇몇 용어들에 대하여 명확하게 하기로 한다.Prior to the description of the present specification, some terms used in the present specification will be clarified.

본 명세서에서, 시퀀스(sequence)란 순서를 가지는 일련의 데이터들의 모음 또는 선형적으로 배열될 수 있는 일련의 데이터들의 모음을 의미한다. 본 명세서에서, 세그먼트(segment)란 시퀀스를 구성하는 단위를 가리킨다. 예를 들어, 고유한 길이를 가지는 오디오 파일 또는 오디오 데이터는 본 명세서에서 오디오 시퀀스로 이해될 수 있으며, 소정의 길이를 가지는 단위 시간으로 분절한 오디오 시퀀스의 조각은 본 명세서에서 오디오 세그먼트로 이해될 수 있다. 본 명세서에서 오디오 시퀀스를 구성하는 프레임(frame)은 오디오 시퀀스를 구성하는 세그먼트와 같은 의미로 사용된다.In the present specification, a sequence means a collection of a series of data having an order or a collection of a series of data that can be arranged linearly. In the present specification, a segment refers to a unit constituting a sequence. For example, an audio file or audio data having a unique length may be understood as an audio sequence in the present specification, and a fragment of an audio sequence segmented by unit time having a predetermined length may be understood as an audio segment in the present specification. have. In this specification, a frame constituting an audio sequence is used in the same sense as a segment constituting an audio sequence.

본 명세서에서, 음성 또는 발화의 언어적 특징(linguistic feature) 또는 언어적 컨텐츠(linguistic content)란, 음성에 담긴, 화자가 전달하고자 하는 메시지를 의미한다. 예를 들어 음성이 기록된 오디오로부터 식별될 수 있는 텍스트나 음소 등이 언어적 특징들에 해당한다.In the present specification, a linguistic feature or linguistic content of a voice or utterance means a message contained in a voice and intended to be transmitted by a speaker. For example, linguistic features include text or phonemes that can be identified from audio recorded with voice.

본 명세성에서, 음성의 음향적 특징(acoustic feature)이란, 화자의 음색, 높낮이, 억양 등의 특징들을 의미한다.In the present specification, an acoustic feature of a voice means characteristics such as a speaker's tone, pitch, and intonation.

본 명세서에서, 음성의 감정 특징(emotion feature)은 음향적 특징의 일부로 이해될 수 있다. 음성의 감정 특징은, 화자의 음성에 표현된 행복함, 기쁨, 슬픔, 분노, 공포 등의 감정에 따라 달라지는 음향적 특징들을 가리킨다. 언어 고유의 특성 및 화자 고유의 특성에 따라, 음성에 감정이 표현되는 방식은 다양할 수 있다. 예를 들어, 표현하고자 하는 감정에 따라, 음성을 구성하는 음의 높이, 운율, 속도 등이 달라질 수 있다. 본 명세서에서 음성에 표현된 감정 특징이란 상술한 음성 특징들을 포괄하는 의미로 해석될 수 있다.In this specification, the emotion feature of the voice may be understood as a part of the acoustic feature. The emotional characteristics of the voice refer to acoustic characteristics that vary depending on emotions such as happiness, joy, sadness, anger, and fear expressed in the speaker's voice. Depending on the characteristics of the language and the characteristics of the speaker, there may be various ways in which emotions are expressed in voice. For example, according to the emotion to be expressed, the pitch, rhyme, speed, etc. of the sound constituting the voice may vary. In the present specification, the emotional characteristic expressed in the voice may be interpreted as encompassing the above-described voice characteristics.

본 명세서에서, 음성에 표현된 화자의 목소리 특징이란, 음향적 특징의 일부로 이해될 수 있다. 음성에 표현된 화자의 목소리 특징이란, 화자 고유의 목소리를 식별시키는 특징들을 가리킨다. 화자의 목소리 특징은, 화자 고유의 발성 기관의 구조 및 발성 습관 등에 따라 달라지는 음색 및 운율적 특징들을 포괄한다.In the present specification, the speaker's voice characteristic expressed in the voice may be understood as a part of the acoustic characteristic. The speaker's voice characteristics expressed in the voice refer to characteristics that identify the speaker's own voice. The characteristics of the speaker's voice include timbre and prosody characteristics that vary depending on the structure of the speaker's own vocal organs and vocal habits.

본 명세서에서, 음성에 표현된 감정 특징과 화자 목소리 특징은 서로 구별되지만 전적으로 상호 배타적인 것은 아니다. 예를 들어 운율과 같은 특징들은 표현하고자 하는 감정에 따라 달라지기도 하고, 화자의 말투에 따라 달라지기도 한다.In this specification, the emotional characteristic expressed in the voice and the speaker's voice characteristic are distinguished from each other, but are not entirely mutually exclusive. For example, characteristics such as rhyme may vary depending on the emotion to be expressed, or may depend on the speaker's tone of voice.

이하, 본 발명의 몇몇 실시예들에 대하여 첨부된 도면에 따라 상세하게 설명한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 오디오 감정 변환 장치(10)의 입력 및 출력을 설명하기 위한 도면이다.1 is a view for explaining the input and output of the audio emotion conversion apparatus 10 according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 오디오 감정 변환 장치(10)는 제1 감정으로 표현된 오디오 시퀀스를 획득하여, 제2 감정으로 표현된 오디오 시퀀스를 출력하는 컴퓨팅 장치이다. As shown in FIG. 1 , the audio emotion conversion device 10 is a computing device that obtains an audio sequence expressed as a first emotion and outputs an audio sequence expressed as a second emotion.

상기 컴퓨팅 장치는 노트북, 데스크톱(desktop), 랩탑(laptop) 등이 될 수 있으나, 이에 국한되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다. 상기 컴퓨팅 장치의 일 예는 도 12를 더 참조하도록 한다.The computing device may be a notebook, a desktop, a laptop, etc., but is not limited thereto and may include any type of device equipped with a computing function. An example of the computing device is further referred to FIG. 12 .

도 1은 오디오 감정 변환 장치(10)가 단일 컴퓨팅 장치로 구현된 것을 예로써 도시하고 있으나, 오디오 감정 변환 장치(10)의 제1 기능은 제1 컴퓨팅 장치에서 구현되고, 제2 기능은 제2 컴퓨팅 장치에서 구현될 수도 있다.1 shows as an example that the audio emotion converting device 10 is implemented as a single computing device, the first function of the audio emotion converting device 10 is implemented in the first computing device, and the second function is the second It may be implemented in a computing device.

본 발명의 다양한 실시예들에 따르면, 오디오 감정 변환 장치(10)는 음성 변환 모델을 구축하고, 음성 변환 모델을 이용하여, 제1 감정으로 표현된 오디오 시퀀스(1)로부터 제2 감정으로 표현된 오디오 시퀀스(3)를 생성할 수 있다. According to various embodiments of the present disclosure, the audio emotion conversion apparatus 10 builds a speech conversion model, and uses the speech conversion model to express the second emotion from the audio sequence 1 expressed as the first emotion. It is possible to create an audio sequence (3).

음성 변환 모델은, 예를 들어 인공 신경망을 기반으로 하는 모델일 수 있다. 예컨대, 심층 신경망(DNN: Deep Neural Network), 선형 신경망(Linear Neural Network), 순환 신경망(RNN: Recurrent Neural Network), 장단기 메모리(LSTM: Long Short-Term Memory), 컨볼루션 신경망(CNN: Convolutional Neural Networks) 등과 같은 모델 또는 이들을 조합하여 구성한 신경망 모델이 음성 변환 모델로서 사용될 수 있으나, 이에 한정되지는 않는다. The speech conversion model may be, for example, a model based on an artificial neural network. For example, Deep Neural Network (DNN), Linear Neural Network, Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN) Networks) or a neural network model constructed by combining them may be used as the speech conversion model, but is not limited thereto.

음성 변환 모델은, 학습용 오디오 시퀀스 샘플들과 각 오디오 시퀀스 샘플에 대응되는 감정 레이블 데이터로 구성된, 학습용 데이터셋을 이용하여, 지도 방식(supervised learning)으로 학습될 수 있다.The speech conversion model may be trained by a supervised learning method using a training dataset including audio sequence samples for training and emotion label data corresponding to each audio sequence sample.

예를 들어, 행복함, 기쁨, 슬픔, 분노, 공포의 다섯 가지 감정들 사이를 변환하는 음성 변환 모델을 구축할 경우, 전문 성우 등을 통해 상기 다섯 가지 서로 다른 감정들을 각각 연기한, 충분한 수의 오디오 시퀀스 샘플들을 생성하고, 각각의 감정 레이블이 태깅된 상기 오디오 샘플들을 이용하여 음성 변환 모델이 학습될 수 있다. 또한, TV 및 라디오 프로그램 등 기존에 존재하는 오디오 시퀀스 샘플들에 감정 레이블을 사람이 태깅한 결과물을 이용하여 음성 변환 모델이 학습될 수 있다.For example, when constructing a voice transformation model that converts between five emotions of happiness, joy, sadness, anger, and fear, a sufficient number of A voice transformation model may be trained by generating audio sequence samples and using the audio samples tagged with each emotion label. In addition, the speech conversion model may be trained using a result of human tagging of emotion labels on existing audio sequence samples such as TV and radio programs.

음성 변환 모델의 학습 과정에서, 음성 변환 모델에 입력된 오디오 시퀀스 샘플과 가급적 동일하거나 유사한 오디오 시퀀스가 음성 변환 모델로부터 출력되도록, 음성 변환 모델의 파라미터들을 조정함으로써, 음성 변환 모델의 학습이 수행될 수 있다.In the learning process of the speech conversion model, by adjusting the parameters of the speech conversion model so that an audio sequence that is preferably the same as or similar to the audio sequence sample input to the speech conversion model is output from the speech conversion model, learning of the speech conversion model can be performed. have.

음성 변환 모델의 학습 과정에서는, 음성 변환 모델 내부의 감정 임베딩 매트릭스가 구축된다. 학습이 진행되는 동안에, 동일한 감정 레이블을 가지는 오디오 시퀀스 샘플들에 의해 생성되는 임베딩 벡터들은 잠재 벡터 공간에서 서로 근접하게 위치하도록, 상이한 감정 레이블을 가지는 오디오 시퀀스 샘플들에 의해 생성되는 임베딩 벡터들은 잠재 벡터 공간에서 서로 멀리 위치하도록, 음성 변환 모델 내부의 감정 임베딩 매트릭스가 업데이트된다.In the learning process of the speech conversion model, an emotion embedding matrix inside the speech conversion model is built. During learning, the embedding vectors generated by audio sequence samples with different emotion labels are latent vectors so that the embedding vectors generated by audio sequence samples with the same emotion label are located close to each other in the latent vector space. The emotion embedding matrix inside the speech transformation model is updated to be positioned away from each other in space.

본 발명의 다양한 실시예들에 따른 음성 변환 모델의 학습 방법에 대해서는 후술하기로 한다.A method of learning a speech conversion model according to various embodiments of the present invention will be described later.

이하에서는 본 발명의 일 실시예에 따른 오디오 감정 변환 장치(10)의 기능적인 구성에 대하여 도 2를 참조하여 설명한다.Hereinafter, a functional configuration of the audio emotion converting apparatus 10 according to an embodiment of the present invention will be described with reference to FIG. 2 .

도 2에 도시된 바와 같이, 오디오 감정 변환 장치(10)는 음성 전처리부(21), 음성 변환부(23), 음성 합성부(25), 및 저장부(27)를 포함할 수 있다. 다만, 도 2에는 본 발명의 실시예와 관련 있는 구성 요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 2에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다. 또한, 도 2에 도시된 오디오 감정 변환 장치(10)의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다. 이하, 각 구성요소에 대하여 설명한다.As shown in FIG. 2 , the audio emotion converting apparatus 10 may include a voice preprocessing unit 21 , a voice converting unit 23 , a voice synthesizing unit 25 , and a storage unit 27 . However, only the components related to the embodiment of the present invention are illustrated in FIG. 2 . Accordingly, those skilled in the art to which the present invention pertains can see that other general-purpose components other than the components shown in FIG. 2 may be further included. In addition, it should be noted that each of the components of the audio emotion converting apparatus 10 shown in FIG. 2 represents functionally distinct functional elements, and a plurality of components may be implemented in a form integrated with each other in an actual physical environment. do. Hereinafter, each component will be described.

설명의 편의를 위하여, 이하에서는 도 2 내지 도 4를 참조하여 음성 전처리부(21), 음성 합성부(25), 및 저장부(27)를 먼저 설명한 후, 이어서 도 5 내지 도 9를 참조하여 음성 변환부(23)를 설명하기로 한다.For convenience of explanation, below, the voice preprocessing unit 21, the voice synthesis unit 25, and the storage unit 27 will be first described with reference to FIGS. 2 to 4, and then with reference to FIGS. 5 to 9 . The voice conversion unit 23 will be described.

음성 전처리부(21)는, 오디오 감정 변환 장치(10)에 입력된, 제1 감정으로 표현된 오디오 시퀀스(1)를 변환하기에 앞서, 상기 오디오 시퀀스(1)를 후술할 음성 변환 모델이 처리하기 적합한 형태의 데이터로 전처리한다.The voice pre-processing unit 21, before converting the audio sequence 1, which is input to the audio emotion conversion device 10 and expressed as a first emotion, is processed by a voice conversion model to be described later. It is pre-processed into data in a format suitable for the following.

도 3을 참조하면, 몇몇 실시예에 따른 음성 전처리부(21)는 단시간 푸리에 변환(Short Time Fourier Transform)을 수행하는 STFT 변환부(31)를 포함할 수 있다. STFT 변환부(31)는 디지털 오디오 데이터(예컨대 wav 형식의 오디오)를 스펙트로그램 형태의 데이터로 변환하는 전처리 기능을 수행할 수 있다. 가령, STFT 변환부(31)는 STFT(Short Time Fourier Transform) 신호 처리를 수행하여, 오디오 감정 변환 장치(10)에 입력된, 제1 감정으로 표현된 오디오 시퀀스(1)를 스펙트로그램 데이터(35)로 변환하거나, 나아가 상기 스펙트로그램 데이터를 멜-스케일(mel-scale)로 변환할 수 있다. 멜-스케일 스펙트로그램의 경우 높은 주파수대의 정보보다 낮은 주파수대의 정보에 상대적으로 더 큰 중요도를 부여한다. 멜-스케일의 스펙트로그램은, 사람의 청각이 더 민감하게 반응하는 낮은 주파수 대의 정보에 더 큰 중요도를 부여하므로, 사람이 인식하는 음성의 특징을 보다 잘 표현해 낼 수 있다.Referring to FIG. 3 , the voice preprocessor 21 according to some embodiments may include an STFT transform unit 31 that performs a Short Time Fourier Transform. The STFT converter 31 may perform a preprocessing function of converting digital audio data (eg, audio in a wav format) into data in a spectrogram format. For example, the STFT transform unit 31 performs Short Time Fourier Transform (STFT) signal processing to convert the audio sequence 1 expressed as the first emotion input to the audio emotion transforming apparatus 10 to the spectrogram data 35 ), or further convert the spectrogram data to mel-scale. In the case of a mel-scale spectrogram, relatively greater importance is given to information in a low frequency band than information in a high frequency band. Since the mel-scale spectrogram attaches greater importance to information in a lower frequency band to which human hearing responds more sensitively, it is possible to better express the characteristics of speech recognized by humans.

음성 전처리부(21)는 STFT 변환부(31)에 의해 출력된 상기 스펙트로그램 데이터(35)를 소정의 길이를 가지는 프레임(또는 세그먼트) 단위로 분절하여, 음성 변환부(23)로 제공할 수 있다. 몇몇 실시예에서, 상기 스펙트로그램 데이터(35)는, 예컨대 하나의 프레임 당 80차원의 벡터로 표현된 데이터일 수 있다.The voice pre-processing unit 21 may segment the spectrogram data 35 output by the STFT transform unit 31 into frame (or segment) units having a predetermined length and provide it to the voice converting unit 23 . have. In some embodiments, the spectrogram data 35 may be, for example, data expressed as an 80-dimensional vector per frame.

음성 합성부(25)에 대해서는 도 4를 참조하여 설명한다. 음성 합성부(25)는, 음성 변환부(23)로부터 출력된 스펙트로그램 데이터(41)를 다시 디지털 오디오 데이터 형태로 변환한다. 몇몇 실시예에서, 상기 스펙트로그램 데이터(41)는 예컨대 하나의 프레임 당 80차원의 벡터로 표현된 데이터일 수 있다. 음성 합성부(25)는 스펙트로그램 데이터(41)를 이용하여 오디오 또는 음성을 합성하는 보코더일 수 있다. 상기 변환 기능을 수행할 수 있다면, 음성 합성부(25)는 어떠한 방식으로 구현되더라도 무방하다. 가령, 음성 합성부(25)는 당해 기술 분야에서 널리 알려진 보코더 모듈을 이용하여 구현될 수 있다. 예를 들어, 음성 합성부(25)는 그리핀-림 알고리즘(Griffin-Lim algorithm)을 이용한 음성 합성 모듈, 또는 WaveNet 및 WaveGlow 등 신경망 기반의 보코더 모듈을 이용하여 구현될 수 있다. 본 발명의 논지를 흐리지 않기 위해 음성 합성부(25)에 대한 더 이상의 설명은 생략하도록 한다.The speech synthesis unit 25 will be described with reference to FIG. 4 . The speech synthesis unit 25 converts the spectrogram data 41 output from the speech conversion unit 23 back into digital audio data format. In some embodiments, the spectrogram data 41 may be, for example, data expressed as an 80-dimensional vector per frame. The voice synthesizer 25 may be a vocoder that synthesizes audio or voice using the spectrogram data 41 . As long as the conversion function can be performed, the speech synthesis unit 25 may be implemented in any way. For example, the voice synthesizer 25 may be implemented using a vocoder module widely known in the art. For example, the speech synthesis unit 25 may be implemented using a speech synthesis module using a Griffin-Lim algorithm, or a neural network-based vocoder module such as WaveNet and WaveGlow. In order not to obscure the subject matter of the present invention, further description of the speech synthesis unit 25 will be omitted.

다시 도 2를 참조하여, 저장부(27)에 대하여 설명한다. 저장부(27)는 각종 데이터를 저장하고 관리한다. 특히 저장부(27)는 후술할 음성 변환 모델의 학습에 사용되는 학습용 데이터셋을 저장할 수 있다. 학습용 데이터셋은 학습용 오디오 시퀀스 샘플들과 각 오디오 시퀀스 샘플에 대응되는 감정 레이블 데이터를 포함할 수 있다. 또한, 저장부(27)는 후술할 음성 변환 모델을 구성하는 하나 이상의 신경망들에 관한 각종 파라미터 및 설정들을 저장하고 관리할 수 있다. 또한 저장부(27)는 후술할 음성 변환 모델이 감정의 변환에 사용하는 감정 임베딩 벡터들로 구성된 감정 임베딩 매트릭스를 저장하고 관리할 수 있다.Referring again to FIG. 2 , the storage unit 27 will be described. The storage unit 27 stores and manages various data. In particular, the storage unit 27 may store a training dataset used for training a voice transformation model, which will be described later. The training dataset may include training audio sequence samples and emotion label data corresponding to each audio sequence sample. Also, the storage unit 27 may store and manage various parameters and settings related to one or more neural networks constituting a speech conversion model to be described later. Also, the storage unit 27 may store and manage an emotion embedding matrix composed of emotion embedding vectors used for emotion conversion by a speech conversion model to be described later.

지금까지 도 2 내지 도 4를 참조하여, 오디오 감정 변환 장치(10)의 음성 전처리부(21), 음성 합성부(25), 및 저장부(27)에 대하여 설명하였다. 이하에서는 도 5 내지 도 9를 참조하여, 오디오 감정 변환 장치(10)의 음성 변환부(25)에 대하여 상세히 설명하기로 한다.So far, the voice preprocessor 21 , the voice synthesizer 25 , and the storage unit 27 of the audio emotion conversion apparatus 10 have been described with reference to FIGS. 2 to 4 . Hereinafter, the voice converting unit 25 of the audio emotion converting apparatus 10 will be described in detail with reference to FIGS. 5 to 9 .

도 5를 참조하면, 본 발명의 일 실시예에 따른 음성 변환부(23)는, 학습부(51), 음성 변환 모델(53), 및 변환부(55)를 포함할 수 있다. 설명의 편의를 위하여, 학습부(51) 및 변환부(55)를 먼저 설명한 후, 도 6 내지 도 9를 참조하여 음성 변환 모델(53)에 대하여 상세히 설명하기로 한다.Referring to FIG. 5 , the speech conversion unit 23 according to an embodiment of the present invention may include a learning unit 51 , a speech conversion model 53 , and a conversion unit 55 . For convenience of description, the learning unit 51 and the converting unit 55 will be first described, and then the voice conversion model 53 will be described in detail with reference to FIGS. 6 to 9 .

학습부(51)는 학습용 데이터셋을 이용하여 음성 변환 모델(53)을 학습시킨다. 즉, 학습부(51)는 학습용 데이터셋을 이용하여 음성 변환 모델(53)의 음성 변환 품질이 최적화되도록, 음성 합성 모델(53)의 파라미터들을 업데이트함으로써, 음성 변환 모델(53)을 구축할 수 있다. 상기 학습용 데이터셋은 저장부(27)로부터 제공받을 수 있을 것이나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다. 설명의 편의를 위하여, 음성 변환부(23)의 다른 구성들 및 음성 변환 모델(53)의 신경망 구조에 대해서 설명한 후에, 도 10을 참조하여 학습부(51)에 의한 학습 과정에 관하여 설명하기로 한다.The learning unit 51 trains the speech conversion model 53 using the training dataset. That is, the learning unit 51 can build the speech conversion model 53 by updating the parameters of the speech synthesis model 53 so that the speech conversion quality of the speech conversion model 53 is optimized using the training dataset. have. The training dataset may be provided from the storage unit 27 , but the technical scope of the present invention is not limited thereto. For convenience of explanation, after explaining other components of the speech conversion unit 23 and the neural network structure of the speech conversion model 53, the learning process by the learning unit 51 will be described with reference to FIG. do.

변환부(55)는, 학습부(51)에 의해 학습된 음성 변환 모델(53)을 이용하여, 오디오 시퀀스를 나타내는 스펙트로그램의 일부 특징을 변환한다. 보다 구체적으로, 변환부(55)는, 제1 감정으로 표현된 오디오 시퀀스에 대한 스펙트로그램을 음성 변환 모델(53)에 입력하고, 그 결과로 제2 감정으로 표현된 오디오 시퀀스를 합성하기 위한 스펙트로그램을 제공받는다. 변환부(55)가 음성 변환 모델(53)에 입력하거나 음성 변환 모델(53)로부터 제공받는 스펙트로그램은, 예를 들어 소정의 길이의 프레임 단위로 분절된 스펙트로그램 데이터로 구성될 수 있다.The conversion unit 55 uses the speech conversion model 53 learned by the learning unit 51 to convert some features of the spectrogram representing the audio sequence. More specifically, the conversion unit 55 inputs the spectrogram for the audio sequence expressed with the first emotion to the speech conversion model 53, and as a result, the spectrogram for synthesizing the audio sequence expressed with the second emotion. grams are provided. The spectrogram input by the speech conversion model 53 or provided by the speech conversion model 53 by the conversion unit 55 may be composed of, for example, spectrogram data segmented into frame units of a predetermined length.

음성 변환 모델(53)은, 입력된 오디오 시퀀스의 특징들 중 적어도 일부를 변환하는 모델이다. 일 실시예에서 음성 변환 모델(53)은, 스펙트로그램 데이터의 형태로 입력받은 오디오 시퀀스에 표현된 제1 감정을 표현하는 특징들을, 제2 감정을 표현하는 특징들로 변환한다. 도 6에 도시된 바와 같이, 음성 변환 모델(53)은 인코딩부(61), 디코딩부(69), 및 잠재변수 변환부(67)를 포함할 수 있다. 몇몇 실시예에서, 음성 변환 모델(53)은 감정 임베딩 매트릭스(65)를 더 포함할 수 있다. The speech transformation model 53 is a model that transforms at least some of the features of the input audio sequence. In an embodiment, the speech conversion model 53 converts features expressing a first emotion expressed in an audio sequence input in the form of spectrogram data into features expressing a second emotion. As shown in FIG. 6 , the speech conversion model 53 may include an encoding unit 61 , a decoding unit 69 , and a latent variable conversion unit 67 . In some embodiments, the speech transformation model 53 may further include an emotion embedding matrix 65 .

본 발명의 몇몇 실시예에서 음성 변환 모델(53)은 변이형 오토인코더(VAE: Variational AutoEncoder)의 구조에 기초하여 구현될 수 있다. 또한, 후술할 음성 변환 모델(53)의 구성들을 이해함에 있어서 변이형 오토인코더의 구성들이 참조될 수 있다. 그러나 본 발명이 그러한 실시예로 한정되는 것은 아니다.In some embodiments of the present invention, the speech conversion model 53 may be implemented based on the structure of a Variational AutoEncoder (VAE). Also, in understanding the configurations of the speech conversion model 53 to be described later, the configurations of the variant autoencoder may be referred to. However, the present invention is not limited to such embodiments.

도 6을 참조하면, 인코딩부(61)는 음성 변환 모델(53)에 입력된 스펙트로그램 데이터(35)를 인공 신경망 레이어에 입력하여, 입력 오디오 시퀀스의 특징들을 나타내는 잠재 변수들을 계산한다. 전술한 바와 같이, 스펙트로그램 데이터(35)는 예컨대 하나의 프레임(35a) 당 80차원의 벡터로 표현된 데이터일 수 있다. 몇몇 실시예에서, 상기 잠재 변수들은 32차원을 가지는 벡터들일 수 있다. 다시 말해, 인코딩부(61)는 80차원의 크기를 가지는 벡터들을 입력으로 받아서 32차원을 가지는 벡터들을 출력하는 것일 수 있다.Referring to FIG. 6 , the encoding unit 61 inputs the spectrogram data 35 input to the speech transformation model 53 to the artificial neural network layer, and calculates latent variables representing characteristics of the input audio sequence. As described above, the spectrogram data 35 may be, for example, data expressed as an 80-dimensional vector per one frame 35a. In some embodiments, the latent variables may be vectors having 32 dimensions. In other words, the encoding unit 61 may receive vectors having a size of 80 dimensions as input and output vectors having a size of 32 dimensions.

상기 잠재 변수들에는, 입력 오디오 시퀀스의 시퀀스 레벨 특징들을 나타내는 잠재 변수 및 상기 입력 오디오 시퀀스를 구성하는 세그먼트들 또는 프레임들 각각의 특징(이하, "세그먼트 레벨 특징")들을 나타내는 잠재 변수가 포함된다. 여기서 입력 오디오 시퀀스의 시퀀스 레벨 특징들이란, 화자의 음색 등 시퀀스 전체에 걸쳐 공통되는 특징이거나, 시퀀스 전체적인 높낮이 및 운율의 변화, 억양, 감정 등 시퀀스 전체에 대한 판단 결과 하나로 결정되는 특징들을 가리킨다. 한편, 입력 오디오 시퀀스의 세그먼트 레벨 특징들이란, 오디오 시퀀스의 각각의 세그먼트 또는 프레임에 나타난 텍스트나 음소 등 언어적 특징(linguistic feature)들을 가리킨다. The latent variables include a latent variable representing sequence level characteristics of the input audio sequence and a latent variable representing characteristics of each of segments or frames constituting the input audio sequence (hereinafter, “segment level characteristics”). Here, the sequence level features of the input audio sequence refer to features common to the entire sequence, such as a speaker's tone, or features determined as a result of a judgment on the entire sequence, such as changes in pitch and rhyme, intonation, and emotion throughout the sequence. Meanwhile, segment-level features of the input audio sequence refer to linguistic features such as text or phonemes appearing in each segment or frame of the audio sequence.

인코딩부(61)는, 시퀀스 레벨 특징들을 나타내는 잠재 변수들(예컨대 도 6의 Z₂, Z₃) 및 세그먼트 레벨 특징들을 나타내는 잠재 변수들(예컨대 도 6의 Z₁)을 디코딩부(69)에 제공한다. 또한 인코딩부(61)는, 시퀀스 레벨 특징들을 나타내는 잠재 변수들(Z₂, Z₃) 중 적어도 일부, 예컨대 감정을 나타내는 잠재 변수(Z₃)를 잠재 변수 변환부(67)에 제공할 수 있다.The encoding unit 61 transmits latent variables representing sequence level characteristics (eg, Z ₂ , Z _{3 in} FIG. 6 ) and latent variables indicating segment level characteristics (eg, Z _{1 in} FIG. 6 ) to the decoding unit 69 . to provide. In addition, the encoding unit 61 _{, at least some of the latent variables (Z 2} , Z ₃ ) representing sequence level characteristics, for example, the latent variable (Z ₃ ) representing the emotion may be provided to the latent variable conversion unit 67 . .

디코딩부(69)는 인코딩부(61)로부터 제공받은 잠재 변수들 및 잠재 변수 변환부(67)로부터 제공받은 조정된 잠재 변수를 인공 신경망 레이어에 입력하여, 출력 오디오 시퀀스를 생성하기 위한 스펙트로그램 데이터(41)를 생성한다. 상기 디코딩부(69)에 입력되는 잠재 변수들은 32차원의 벡터들일 수 있으며, 상기 디코딩부(69)가 출력하는 스펙트로그램 데이터(41)는, 예컨대 하나의 프레임(41a) 당 80차원의 벡터로 표현된 데이터일 수 있다. 상기 출력 오디오 시퀀스는, 잠재 변수 변환부(67)에 의해 잠재 변수(Z₃)가 조정됨으로써 변환된 오디오의 특징들이 반영된 오디오 시퀀스일 수 있다. The decoding unit 69 inputs the latent variables provided from the encoding unit 61 and the adjusted latent variables provided from the latent variable transform unit 67 to the artificial neural network layer, and spectrogram data for generating an output audio sequence. (41) is generated. The latent variables input to the decoding unit 69 may be 32-dimensional vectors, and the spectrogram data 41 output by the decoding unit 69 is, for example, an 80-dimensional vector per frame 41a. It may be expressed data. The output audio sequence may be an audio sequence in which characteristics of audio converted by adjusting the _{latent variable Z 3} by the latent variable converting unit 67 are reflected.

만약, 오디오 시퀀스에 표현된 감정을 나타내는 잠재 변수가 잠재 변수 변환부(67)에 의해 조정되었다면, 출력 오디오 시퀀스는 입력 오디오 시퀀스의 내용을 그대로 유지한 채로, 입력 오디오 시퀀스에 표현되었던 감정(예컨대 기쁨)과는 다른 감정(예컨대 화남)으로 표현된 오디오 시퀀스일 수 있다. 만약, 화자의 목소리 특징(예컨대 음색)을 나타내는 잠재 변수가 잠재 변수 변환부(67)에 의해 조정되었다면, 출력 오디오 시퀀스는 입력 오디오 시퀀스의 내용을 그대로 유지한 채로, 입력 오디오 시퀀스의 화자의 목소리가 다른 목소리로 변경된 오디오 시퀀스일 수 있다.If the latent variable representing the emotion expressed in the audio sequence is adjusted by the latent variable converting unit 67, the output audio sequence maintains the content of the input audio sequence as it is, and the emotion expressed in the input audio sequence (eg, joy) ) may be an audio sequence expressed with a different emotion (eg, angry). If the latent variable representing the speaker's voice characteristics (eg, timbre) has been adjusted by the latent variable converting unit 67, the output audio sequence maintains the content of the input audio sequence as it is, and the speaker's voice of the input audio sequence is It may be an audio sequence changed to a different voice.

감정 임베딩 매트릭스(63)는, 서로 다른 각각의 감정을 나타내는 임베딩 벡터들로 구성된 매트릭스다. 감정 임베딩 매트릭스(63)는, 각기 다른 하나의 감정마다 하나의 임베딩 벡터를 포함한다. 예를 들어, 오디오 감정 변환 장치(10)가, 행복함, 기쁨, 슬픔, 분노, 공포의 다섯 가지 감정들 사이를 변환하는 장치라면, 감정 임베딩 매트릭스(63)에는, 총 다섯개의 임베딩 벡터들이 포함된다. The emotion embedding matrix 63 is a matrix composed of embedding vectors representing different emotions. The emotion embedding matrix 63 includes one embedding vector for each different emotion. For example, if the audio emotion conversion device 10 is a device that converts between five emotions of happiness, joy, sadness, anger, and fear, the emotion embedding matrix 63 includes a total of five embedding vectors. do.

감정 임베딩 매트릭스(63)는, 음성 변환 모델(53)에 의한 감정 변환 과정에서, 원 감정(또는 출발 감정)을 나타내는 임베딩 벡터 및 대상 감정(또는 목적 감정)을 나타내는 임베딩 벡터를 잠재 변수 변환부(67)에 제공한다.In the emotion conversion process by the speech conversion model 53, the emotion embedding matrix 63 converts the embedding vector representing the original emotion (or departure emotion) and the embedding vector representing the target emotion (or target emotion) to the latent variable conversion unit ( 67) is provided.

감정 임베딩 벡터들로 구성되는 감정 임베딩 매트릭스(63)는, 학습용 데이터셋을 이용한 음성 변환 모델(53)의 지도 학습 과정에서 구축될 수 있다. 제1 감정에 대응되는 임베딩 벡터는, 제1 감정의 레이블이 태깅된 학습용 오디오 시퀀스들을 상기 음성 변환 모델(53)에 투입한 결과 획득되는, 감정을 나타내는 잠재 변수의 평균을 획득함으로써 결정될 수 있다. 예를 들어, 행복함이라는 감정에 대응되는 임베딩 벡터는, 행복함이 표현된 것으로 레이블링된 다수의 오디오 시퀀스 샘플들을 상기 음성 변환 모델(53)에 투입하면서 상기 음성 변환 모델(53)을 학습시킨 결과로서 얻어질 수 있다. 구체적으로, 행복함이 표현된 다수의 오디오 시퀀스 샘플들을 상기 학습된 음성 변환 모델(53)에 입력한 결과 인코딩부(61)에 의해 산출되는 각각의 잠재 변수들(Z₃)이 만드는 벡터 분포의 평균 벡터(mean vector)일 수 있다. The emotion embedding matrix 63 composed of emotion embedding vectors may be constructed in a supervised learning process of the speech transformation model 53 using the training dataset. The embedding vector corresponding to the first emotion may be determined by obtaining the average of the latent variables representing the emotion, which is obtained as a result of inputting the training audio sequences tagged with the label of the first emotion into the speech conversion model 53 . For example, the embedding vector corresponding to the emotion of happiness is a result of learning the speech conversion model 53 while inputting a plurality of audio sequence samples labeled as expressing happiness into the speech conversion model 53 . can be obtained as Specifically, as a result of inputting a plurality of audio sequence samples expressing happiness into the learned speech transformation model 53, the vector distribution created _{by each latent variable Z 3 calculated by the encoding unit 61} It may be a mean vector.

잠재 변수(Z₃)가 32차원를 가지는 벡터인 몇몇 실시예에서는, 감정 임베딩 벡터 역시 32차원를 가지는 벡터이다.In some embodiments where the latent variable Z ₃ is a vector having 32 dimensions, the emotion embedding vector is also a vector having 32 dimensions.

몇몇 실시예에서, 감정 임베딩 매트릭스(63)에 포함된 서로 다른 감정에 대응되는 임베딩 벡터들은, 벡터 공간 내에서의 서로 간의 거리가 최대화되도록 결정될 수 있다. In some embodiments, embedding vectors corresponding to different emotions included in the emotion embedding matrix 63 may be determined such that a distance between them in the vector space is maximized.

제1 감정에 대응되는 임베딩 벡터와 제2 감정에 대응되는 임베딩 벡터가 서로 유사하게 결정된다면(즉, 벡터들 사이의 거리가 가깝다면), 음성 변환 모델(53)에 의한 제1 감정으로의 변환 결과와 제2 감정으로의 변환 결과가 서로 명확하게 구별되지 않는다. 예컨대 화난 감정을 표현하고자 하는 의도로 변환한 결과물과 기쁜 감정을 표현하고자 하는 의도로 변환한 결과물 사이에 명확한 차이가 없는 문제가 발생할 수 있다. 즉, 감정 변환의 품질이 저하된다. If the embedding vector corresponding to the first emotion and the embedding vector corresponding to the second emotion are determined to be similar to each other (that is, if the distance between the vectors is close), conversion into the first emotion by the speech conversion model 53 The result and the result of the transformation into the second emotion are not clearly distinguished from each other. For example, there may be a problem in which there is no clear difference between the result converted to the intention to express angry feelings and the result converted to the intention to express happy feelings. That is, the quality of emotion transformation is degraded.

본 발명의 몇몇 실시예에서는, 음성 변환 모델(53)의 지도 학습 과정에서 감정 임베딩 매트릭스(63)가 구축될 때, 감정 임베딩 매트릭스(63)에 포함된 서로 다른 감정에 대응되는 임베딩 벡터들 사이의 거리가 최대화되도록 음성 변환 모델(53)이 갱신된다. 이는 음성 변환 모델(53)의 지도 학습 과정에서 음성 변환 모델(53)의 파라미터들을 갱신할 때, 아래 수학식 1이 최대화되도록 하는 항(term)을 포함시킴으로써 달성될 수 있다.In some embodiments of the present invention, when the emotion embedding matrix 63 is built in the supervised learning process of the speech transformation model 53 , between embedding vectors corresponding to different emotions included in the emotion embedding matrix 63 . The speech transformation model 53 is updated so that the distance is maximized. This can be achieved by including a term such that Equation 1 below is maximized when updating the parameters of the speech conversion model 53 in the supervised learning process of the speech conversion model 53 .

(여기서 K는 서로 다른 감정의 개수, μ₃ ⁽ⁱ⁾ 및 μ₃ ^(j)는 서로 다른 감정에 대응되는 감정 임베딩 벡터를 가리킴)(where K is the number of different emotions, μ ₃ ⁽ⁱ⁾ and μ ₃ ^(j) are emotion embedding vectors corresponding to different emotions)

본 실시예에 따르면, 감정 임베딩 매트릭스(63)에 포함된 서로 다른 감정에 대응되는 임베딩 벡터들 사이의 거리가 최대화되도록 감정 임베딩 매트릭스(63)를 구축함으로써, 제1 감정을 표현하고자 하는 의도로 변환한 결과물과 제2 감정을 표현하고자 하는 의도로 변환한 결과물이 보다 뚜렷한 차이를 가지게 된다. 즉, 우수한 감정 변환 품질을 제공할 수 있게 된다.According to this embodiment, by constructing the emotion embedding matrix 63 so that the distance between embedding vectors corresponding to different emotions included in the emotion embedding matrix 63 is maximized, the intention to express the first emotion is converted There is a clearer difference between the one result and the one converted to the intention to express the second emotion. That is, it is possible to provide excellent emotion conversion quality.

본 실시예에서는, 음성 변환 모델(53)이 감정 임베딩 매트릭스(63)를 포함하는 것으로 설명하였으나, 본 발명이 그러한 실시예로 한정되는 것은 아니다. 예컨대, 오디오 시퀀스의 다른 특징들은 유지한 채, 오디오 시퀀스에 표현된 화자의 억양만을 변환하기 위한 목적으로 본 발명을 구현하는 경우, 감정 임베딩 매트릭스(63) 대신에 억양 임베딩 매트릭스가 구비될 수 있다. 이 경우, 음성 변환 모델(53)의 학습 과정에서 억양 레이블 정보가 태깅된 오디오 시퀀스 샘플들을 이용하여 억양 임베딩 매트릭스가 구축될 수 있으며, 오디오 시퀀스의 억양을 제1 억양에서 제2 억양으로 변환하는 과정 중에 억양 임베딩 매트릭스로부터 억양 임베딩 벡터들이 잠재 변수 변환부(67)로 제공될 수 있다.In this embodiment, it has been described that the speech transformation model 53 includes the emotion embedding matrix 63, but the present invention is not limited to such an embodiment. For example, when the present invention is implemented for the purpose of transforming only the intonation of a speaker expressed in the audio sequence while maintaining other characteristics of the audio sequence, the intonation embedding matrix may be provided instead of the emotion embedding matrix 63 . In this case, an intonation embedding matrix may be constructed using audio sequence samples tagged with intonation label information in the learning process of the speech conversion model 53, and the process of converting the intonation of the audio sequence from the first intonation to the second intonation In the middle, the intonation embedding vectors may be provided to the latent variable conversion unit 67 from the intonation embedding matrix.

다시 도 6을 참조하면, 잠재 변수 변환부(67)는 시퀀스 레벨 특징들을 나타내는 잠재 변수들(Z₂, Z₃) 중 감정을 나타내는 잠재 변수(Z₃)를 인코딩부(61)로부터 제공받아서, 감정을 변환시키기 위하여 조정된 잠재 변수(

)를 산출한다. 이를 위하여 잠재 변수 변환부(67)는 감정 임베딩 매트릭스(63)로부터, 대상 감정(또는 목적 감정)을 나타내는 임베딩 벡터

와 원 감정(또는 출발 감정)을 나타내는 임베딩 벡터

를 제공받는다. 잠재 변수 변환부(67)는 조정된 잠재 변수(

)를 디코딩부(69)에 제공한다.Referring back to Figure 6, the receiving service from the latent variable conversion part 67 is the latent variables representing the sequence-level features (Z _2, Z ₃₎ the encoding unit 61 the latent variables (Z ₃₎ representing the emotion of, latent variables adjusted to transform emotions (

) is calculated. To this end, the latent variable transformation unit 67 receives an embedding vector representing the target emotion (or target emotion) from the emotion embedding matrix 63 .

and an embedding vector representing the original emotion (or departure emotion)

are provided with The latent variable conversion unit 67 adjusts the latent variable (

) is provided to the decoding unit 69 .

구체적으로 잠재 변수 변환부(67)는, 수학식 2와 같은 벡터 연산을 통하여 잠재 변수(Z₃)를 조정한 잠재 변수(

)를 산출할 수 있다.Specifically, the latent variable conversion unit 67, the latent variable (Z ₃ ) adjusted through the vector operation as in Equation 2 (

) can be calculated.

(여기서

는 대상 감정을 나타내는 임베딩 벡터,

는 원 감정을 나타내는 임베딩 벡터)(here

is an embedding vector representing the target emotion,

is the embedding vector representing the original emotion)

잠재 변수(Z₃)로부터 조정된 잠재 변수(

)를 계산하는 상기 벡터 연산은, 오디오 시퀀스에 표현된 감정을 변환시킨다는 관점에서, 감정 변환 함수로 이해될 수 있다.The latent variable (Z ₃ ) adjusted from the latent variable (Z 3 )

The vector operation for calculating ) may be understood as an emotion conversion function from the viewpoint of transforming the emotion expressed in the audio sequence.

전술한 바와 같이, 본 발명의 여러 실시예들에서는, 복수의 감정들 각각을 나타내는 임베딩 벡터들의 차이를 이용한 벡터 연산을 통해, 입력된 오디오 시퀀스의 감정을 나타내는 잠재 변수(Z₃)를 조정함으로써, 셋 이상의 복수의 감정들 사이를 변환하는 음성 변환 모델을 제공할 수 있다. 즉, 원 감정과 대상 감정으로 이루어진 각각의 쌍마다 별도의 음성 변환 모델을 구축하지 않고도, 셋 이상의 서로 다른 감정들 사이를 변환하는 장치를 제공할 수 있게 된다.As described above, in various embodiments of the present invention, by adjusting the _{latent variable (Z 3} ) representing the emotion of the input audio sequence through vector operation using the difference between the embedding vectors representing each of a plurality of emotions, It is possible to provide a speech conversion model for converting between three or more plurality of emotions. That is, it is possible to provide an apparatus for converting between three or more different emotions without building a separate voice conversion model for each pair of the original emotion and the target emotion.

본 발명의 몇몇 실시예에서, 예시적인 수학식 2로 표현된 상기 감정 변환 함수는, 대상 감정을 나타내는 임베딩 벡터

및 원 감정을 나타내는 임베딩 벡터

대신에,

및

를 서로 직교화(orthogonalization) 한 벡터

및

를 사용할 수 있다. 이는 유사한 속성들을 공유하는 서로 다른 감정들이 보다 명확히 구별되도록 한다.In some embodiments of the present invention, the emotion conversion function expressed by the exemplary Equation 2 is an embedding vector representing a target emotion.

and embedding vectors representing circle emotions

Instead of,

and

a vector that is orthogonalized to each other

and

can be used This allows different emotions that share similar properties to be more clearly distinguished.

예를 들어, 화난 감정과 행복한 감정을 표현하는 음성들은 높은 음조와 큰 목소리 등의 공통된 속성들을 가지므로, 화난 감정에 대응되는 임베딩 벡터와 행복한 감정에 대응되는 임베딩 벡터는 유사한 성분을 가질 수 있다. 이와 같은 경우, 화난 감정과 행복한 감정 사이의 상호 감정 변환 결과 사이에 명확한 차이가 없는 문제가 발생할 수 있다. 또한, 제3의 감정을 화난 감정으로 변환한 결과물과 제3의 감정을 행복한 감정으로 변환한 결과물이 서로 명확하게 구별되지 않는 문제가 발생할 수 있다.For example, since voices expressing angry and happy emotions have common properties such as high pitch and loud voice, an embedding vector corresponding to an angry emotion and an embedding vector corresponding to a happy emotion may have similar components. In such a case, a problem may arise in which there is no clear difference between the result of mutual emotion conversion between angry emotion and happy emotion. In addition, there may be a problem in that a result obtained by converting the third emotion into an angry emotion and a result obtained by converting the third emotion into a happy emotion are not clearly distinguished from each other.

상술한 문제를 해결하기 위하여, 서로 다른 감정들에 대응되는 임베딩 벡터들이 가지는 유사한 성분을 제거하는 직교화 과정이 수행될 수 있다. 구체적으로, 임베딩 벡터들의 직교화는 예컨대 Gram-Schmidt 처리를 통해 수행될 수 있다. 도 7은 벡터 v₁, v₂가 Gram-Schmidt 처리를 통해 벡터 u₁, u₂로 직교화 된 결과를 개념적으로 나타내는 도면이다. 도 7에 도시된 바와 같이, 서로 유사한 성분을 가지는 벡터 v₁, v₂를 서로 직교하는 벡터 u₁, u₂로 변환하면, 벡터들이 가지는 유사한 성분이 제거되고, 벡터들 사이의 차이가 극대화된다.In order to solve the above problem, an orthogonalization process of removing similar components of embedding vectors corresponding to different emotions may be performed. Specifically, orthogonalization of embedding vectors may be performed, for example, through Gram-Schmidt processing. 7 is a diagram conceptually illustrating a result of orthogonalizing _{vectors v 1} , v ₂ _{into vectors u 1} , u ₂ through Gram-Schmidt processing. As shown in FIG. 7 , when vectors v ₁ , v ₂ having components similar to each other _{are transformed into vectors u 1} , u ₂ orthogonal to each other, similar components of the vectors are removed, and the difference between the vectors is maximized. .

따라서, 본 발명의 몇몇 실시예에서, 대상 감정을 나타내는 임베딩 벡터

및 원 감정을 나타내는 임베딩 벡터

를 서로 직교화 함으로써, 임베딩 벡터들 사이의 유사한 성분은 제거되고, 임베딩 벡터들 사이의 차이가 더욱 뚜렷해진다. 직교화 된 감정 임베딩 벡터들을 이용하여 전술한 수학식 2의 감정 변환 함수를 계산함으로써, 서로 유사한 속성을 공유하는 상이한 감정들 사이의 감정 변환 결과가 보다 더 뚜렷하게 구별될 수 있게 된다.Accordingly, in some embodiments of the present invention, an embedding vector representing a subject emotion

and embedding vectors representing circle emotions

By orthogonalizing to each other, similar components between embedding vectors are removed, and the differences between embedding vectors become more pronounced. By calculating the emotion conversion function of Equation 2 using the orthogonalized emotion embedding vectors, the emotion conversion result between different emotions sharing similar properties can be more clearly distinguished.

지금까지 도 6 및 도 7을 참조하여, 음성 변환 모델(53)을 설명하였다. 이하에서는 도 8과 도 9를 참조하여 음성 변환 모델(53)의 인코딩부(61)와 디코딩부(69)에 대하여 보다 자세히 설명하기로 한다.So far, the speech conversion model 53 has been described with reference to FIGS. 6 and 7 . Hereinafter, the encoding unit 61 and the decoding unit 69 of the speech conversion model 53 will be described in more detail with reference to FIGS. 8 and 9 .

도 8은 몇몇 실시예에 따른 인코딩부(61)의 신경망 구조를 나타내는 도면이다.8 is a diagram illustrating a neural network structure of the encoding unit 61 according to some embodiments.

전술한 바와 같이, 인코딩부(61)는 스펙트로그램 데이터(35)를 입력 받아서 시퀀스 레벨 잠재 변수들(Z₂, Z₃) 및 세그먼트 레벨 잠재 변수(Z₁)를 계산 또는 예측한다. 구체적으로 인코딩부(61)는, 스펙트로그램 데이터(35)를 인공 신경망 레이어에 입력하여 각각의 잠재 변수들을 예측하기 위한 잠재 변수 예측 분포들을 결정하고, 결정된 예측 분포에서 각각의 잠재 변수들을 샘플링함으로써, 잠재 변수들을 예측한다. As described above, the encoding unit 61 receives the spectrogram data 35 and calculates or predicts the _{sequence level latent variables (Z 2} , Z ₃ ) and the segment level latent variable (Z _{1 ).} Specifically, the encoding unit 61 inputs the spectrogram data 35 to the artificial neural network layer to determine latent variable prediction distributions for predicting each latent variable, and by sampling each latent variable from the determined prediction distribution, Predict latent variables.

이를 위하여 인코딩부(61)는, 각각의 잠재 변수들(Z₁, Z₂, Z₃)의 분포를 결정하기 위한 각각의 신경망 레이어들(81 내지 86)과 상기 분포로부터 각각의 잠재 변수들을 샘플링하기 위한 잠재 변수 샘플링부(87)를 포함할 수 있다.To this end, the encoding unit 61 samples each of the neural network layers 81 to 86 for determining the distribution of _{each latent variable Z 1} , Z ₂ , and Z _{3 and each latent variable from the distribution.} It may include a latent variable sampling unit 87 for

각각의 잠재 변수들(Z₁, Z₂, Z₃)을 계산하기 위한 신경망 레이어들(81 내지 86)은, 심층 신경망(DNN: Deep Neural Network) 레이어, 선형 신경망(Linear Neural Network) 레이어, 순환 신경망(RNN: Recurrent Neural Network) 레이어, 장단기 메모리(LSTM: Long Short-Term Memory) 레이어, 컨볼루션 신경망(CNN: Convolutional Neural Networks) 레이어 등으로 구성될 수 있다. 몇몇 실시예에서, 신경망 레이어들(81 내지 86)은, 도 8에 예시적으로 도시된 바와 같이, LSTM 레이어들(81, 83, 85) 및 이들에 연결된 선형 레이어들(82a, 82b, 84a, 84b, 86a, 86b)로 구성될 수 있는데, 본 발명이 이러한 실시예로 한정되는 것은 아니다. Neural network layers 81 to 86 for calculating each of the latent variables Z ₁ , Z ₂ , Z ₃ are a Deep Neural Network (DNN) layer, a Linear Neural Network layer, a cyclic It may include a Recurrent Neural Network (RNN) layer, a Long Short-Term Memory (LSTM) layer, a Convolutional Neural Networks (CNN) layer, and the like. In some embodiments, neural network layers 81 - 86, as illustratively shown in FIG. 8 , include LSTM layers 81 , 83 , 85 and linear layers 82a , 82b , 84a connected thereto, 84b, 86a, 86b), but the present invention is not limited to these embodiments.

설명의 편의를 위하여, 먼저 잠재 변수 Z₂ 및 Z₃를 예측하기 위한 구성들을 설명하고, 이어서 잠재 변수 Z₁을 예측하기 위한 구성들을 설명한다. For convenience of explanation, _{configurations for predicting latent variables Z 2} and Z ₃ are first described, and then _{configurations for predicting latent variables Z 1} will be described.

인코딩부(61)에서 시퀀스 레벨 잠재 변수 Z₂를 예측하기 위하여, 스펙트로그램 데이터(35)가 LSTM 레이어(83)에 입력된다. 전술한 바와 같이 스펙트로그램 데이터(35)는 하나의 프레임(35a) 당 80차원의 벡터로 표현된 데이터일 수 있다. In order to predict the sequence level latent variable Z ₂ in the encoding unit 61 , the spectrogram data 35 is input to the LSTM layer 83 . As described above, the spectrogram data 35 may be data expressed as an 80-dimensional vector per frame 35a.

LSTM 레이어(83)의 최종 은닉층의 상태 값이 선형 레이어(84a) 및 선형 레이어(84b)로 전달된다. LSTM 레이어(83)의 최종 은닉층의 상태 값은 256 차원의 벡터일 수 있다. The state value of the last hidden layer of the LSTM layer 83 is transferred to the linear layer 84a and the linear layer 84b. The state value of the final hidden layer of the LSTM layer 83 may be a 256-dimensional vector.

선형 레이어(84a) 및 선형 레이어(84b)에서는 최종 은닉층의 상태 값으로부터 잠재 변수 Z₂를 예측하기 위한 분포의 평균(μ₂) 및 분산(σ₂ ²)이 각각 획득된다. 상기 평균(μ₂) 및 분산(σ₂ ²)은 32차원, 즉 잠재 변수 Z₂와 같은 크기의 차원을 가지는 벡터들일 수 있다. 후술하겠지만, 잠재 변수 Z₂를 예측하기 위한 분포는 가우시안 정규 분포와 유사하도록 또는 가우시안 정규 분포를 따르도록, 음성 변환 모델(53)의 레이어들이 음성 변환 모델(53)의 학습 과정 중에 갱신될 수 있다. In the linear layer 84a and the linear layer 84b, the mean (μ ₂ ) and the variance (σ ₂ ² _{) of the distribution for predicting the latent variable Z 2} from the state value of the final hidden layer are respectively obtained. The mean (μ ₂ ) and variance (σ ₂ ² ) may be vectors having 32 dimensions, that is, the same size as the _{latent variable Z 2 .} As will be described later, the layers of the speech conversion model 53 may be updated during the learning process of the speech conversion model 53 so that the distribution for predicting the _{latent variable Z 2 is similar to or follows a Gaussian normal distribution.} .

잠재 변수 샘플링부(87)는, 상기 선형 레이어들(84a 및 84b)로부터 제공된 상기 평균(μ₂) 및 분산(σ₂ ²)으로 정의되는 가우시안 정규 분포로부터, 잠재 변수 Z₂를 예측할 수 있다. The latent variable sampling unit 87 may predict the _{latent variable Z 2} from a Gaussian normal distribution defined by _{the mean (μ 2} ) and the variance (σ ₂ ² ) provided from the linear layers 84a and 84b .

인코딩부(61)에서 시퀀스 레벨 잠재 변수 Z₃를 예측하기 위하여, 스펙트로그램 데이터(35)가 LSTM 레이어(85)에 입력된다. LSTM 레이어(85)의 최종 은닉층의 상태 값이 선형 레이어(86a) 및 선형 레이어(86b)로 전달된다. 선형 레이어(86a) 및 선형 레이어(86b)에서는 최종 은닉층의 상태 값으로부터 잠재 변수 Z₃를 예측하기 위한 분포의 평균(μ₃) 및 분산(σ₃ ²)이 각각 획득된다. 잠재 변수 Z₃를 예측하기 위한 분포는 가우시안 정규 분포와 유사하도록 또는 가우시안 정규 분포를 따르도록, 음성 변환 모델(53)의 레이어들이 음성 변환 모델(53)의 학습 과정 중에 갱신될 수 있다. 잠재 변수 샘플링부(87)는 상기 평균(μ₃) 및 분산(σ₃ ²)으로 정의되는 가우시안 정규 분포로부터 잠재 변수 Z₃를 예측할 수 있다. 몇몇 실시예에서, LSTM 레이어(83)의 최종 은닉층의 상태 값은 256 차원의 벡터이며, 상기 평균(μ₃) 및 분산(σ₃ ²)은 32차원, 즉 잠재 변수 Z₃와 같은 크기의 차원을 가지는 벡터들일 수 있다. In order to predict the sequence level latent variable Z ₃ in the encoding unit 61 , the spectrogram data 35 is input to the LSTM layer 85 . The state value of the last hidden layer of the LSTM layer 85 is transferred to the linear layer 86a and the linear layer 86b. In the linear layer 86a and the linear layer 86b, the average (μ ₃ ) and the variance (σ ₃ ² _{) of the distribution for predicting the latent variable Z 3} from the state value of the final hidden layer are respectively obtained. The distribution for predicting the latent variable Z ₃ may be similar to a Gaussian normal distribution or follow a Gaussian normal distribution, so that the layers of the speech transformation model 53 may be updated during the learning process of the speech transformation model 53 . The latent variable sampling unit 87 may predict the _{latent variable Z 3} from a Gaussian normal distribution defined by _{the mean (μ 3} ) and the variance (σ ₃ ^{2 ).} In some embodiments, the state value of the final hidden layer of the LSTM layer 83 is a 256-dimensional vector, and the mean (μ ₃ ) and variance (σ ₃ ² ) are 32-dimensional, that is, a dimension of the same size as the _{latent variable Z 3 .} may be vectors having .

본 발명의 몇몇 실시예에서, 세그먼트 레벨 잠재 변수 Z₁를 예측하기 위한 구성은 시퀀스 레벨 잠재 변수 Z₂, Z₃을 예측하기 위한 구성과는 다소 차이가 있다. 도 8에 도시된 바와 같이, 세그먼트 레벨 잠재 변수 Z₁를 예측하기 위해서, 스펙트로그램 데이터(35)와 함께, 잠재 변수 샘플링부(87)에 의해 샘플링된 시퀀스 레벨 잠재 변수 Z₂, Z₃가 LSTM 레이어(81)에 입력된다. In some embodiments of the present invention, the _{configuration for predicting the segment-level latent variable Z 1} is somewhat different from the configuration for predicting the sequence-level latent variable Z ₂ , Z _{3 .} As shown in FIG. 8 , _{in order to predict the segment level latent variable Z 1} _{, the sequence level latent variables Z 2} , Z ₃ sampled by the latent variable sampling unit 87 together with the spectrogram data 35 are LSTMs. It is input to the layer 81 .

몇몇 실시예에서, 스펙트로그램 데이터(35)는 한 프레임(35a) 당 80차원의 벡터로 표현되는 데이터이고, 시퀀스 레벨 잠재 변수 Z₂, Z₃는 32차원의 벡터로 표현되는 데이터이다. LSTM 레이어(81)는 상기 스펙트로그램 데이터(35) 및 잠재 변수들 Z₂, Z₃를 서로 연결한(concatenation), 하나의 프레임 당 총 144차원의 크기를 가지는 벡터로 표현되는 데이터를 입력 받는다. In some embodiments, the spectrogram data 35 is data represented by an 80-dimensional vector per frame 35a, and sequence level latent variables Z ₂ , Z ₃ are data represented by a 32-dimensional vector. The LSTM layer 81 receives the spectrogram data 35 and latent variables Z ₂ , Z ₃ data expressed as vectors having a total size of 144 dimensions per frame, concatenated with each other.

LSTM 레이어(81)의 최종 은닉층의 상태 값이, 선형 레이어(82a) 및 선형 레이어(82b)로 전달된다. 선형 레이어(82a) 및 선형 레이어(82b)에서는 최종 은닉층의 상태 값으로부터 세그먼트 레벨 잠재 변수 Z₁를 예측하기 위한 분포의 평균(μ₁) 및 분산(σ₁ ²)이 각각 획득된다. 잠재 변수 샘플링부(87)는 평균(μ₁) 및 분산(σ₁ ²)으로 정의되는 가우시안 정규 분포로부터 잠재 변수 Z₁를 예측할 수 있다. LSTM 레이어(81)의 최종 은닉층의 상태 값은 256 차원의 벡터일 수 있고, 상기 평균(μ₁) 및 분산(σ₁ ²)은 32차원, 즉 잠재 변수 Z₁와 같은 크기의 차원을 가지는 벡터들일 수 있다The state value of the last hidden layer of the LSTM layer 81 is transferred to the linear layer 82a and the linear layer 82b. In the linear layer 82a and the linear layer 82b, the average (μ ₁ ) and variance (σ ₁ ² _{) of the distribution for predicting the segment-level latent variable Z 1} from the state value of the final hidden layer are respectively obtained. The latent variable sampling unit 87 may predict the _{latent variable Z 1} from a Gaussian normal distribution defined by the _{mean (μ 1} ) and the variance (σ ₁ ^{2 ).} The state value of the final hidden layer of the LSTM layer 81 may be a 256-dimensional vector, and the mean (μ ₁ ) and variance (σ ₁ ² ) are 32-dimensional, that is, a vector having the same dimension as the _{latent variable Z 1 .} can pick up

몇몇 실시예에서, 잠재 변수 샘플링부(87)는, LSTM 레이어(81) 및 선형 레이어들(82a, 82b)로부터 획득된 μ₁ 및 σ₁ ²을 이용하여, 정규 분포 N(μ₁, σ₁ ²)로부터 잠재 변수 Z₁을 샘플링하고, LSTM 레이어(83) 및 선형 레이어들(84a, 84b)로부터 획득된 μ₂ 및 σ₂ ²을 이용하여, 정규 분포 N(μ₂, σ₂ ²)로부터 잠재 변수 Z₂를 샘플링하며, LSTM 레이어(85) 및 선형 레이어들(86a, 86b)로부터 획득된 μ₂ 및 σ₂ ²을 이용하여, 정규 분포 N(μ₂, σ₂ ²)로부터 잠재 변수 Z₃를 샘플링한다. 이를 수식으로 나타내면 아래 수학식 3과 같다. _{In some embodiments, the latent variable sampling unit 87 uses the μ 1} and σ ₁ ² obtained from the LSTM layer 81 and the linear layers 82a and 82b to obtain a normal distribution N(μ ₁ , σ _{1 ).} ²⁾ from using the μ ₂ and σ ₂ ² obtained from the latent variable sampling the Z ₁ and, LSTM layer 83 and a linear layer (84a, 84b), from a normal distribution N (μ _2, σ ₂ ²⁾ Sampling the latent variable Z ₂ _{, using μ 2} and σ ₂ ² obtained from the LSTM layer 85 and linear layers 86a, 86b, the latent variable Z from the normal distribution N(μ ₂ , σ ₂ ^{2 )} ₃ is sampled. If this is expressed as an equation, it is as Equation 3 below.

몇몇 실시예에서는, 음성 변환 모델(53)을 구성하는 신경망들의 엔드투엔드 학습을 원활하게 하기 위하여, 잠재 변수 샘플링부(87)가 잠재 변수들을 샘플링하는 방법에 재파라미터화(reparametrization) 기법을 적용하여 아래 수학식 4와 같이 구현할 수도 있다.In some embodiments, in order to facilitate end-to-end learning of neural networks constituting the speech transformation model 53 , a reparametrization technique is applied to a method by which the latent variable sampling unit 87 samples latent variables. Therefore, it can be implemented as Equation 4 below.

지금까지 도 8을 참조하여 인코딩부(61)의 신경망 레이어 구조 및 인코딩부(61)가 잠재 변수들(Z₁, Z₂, Z₃)을 예측하는 동작에 대하여 설명하였다. 이하에서는, 도 9를 참조하여 디코딩부(69)에 대하여 설명한다.So far, the neural network layer structure of the encoding unit 61 and the operation of the encoding unit 61 _{predicting the latent variables Z 1} , Z ₂ , Z ₃ have been described with reference to FIG. 8 . Hereinafter, the decoding unit 69 will be described with reference to FIG. 9 .

전술한 바와 같이, 디코딩부(69)는 잠재 변수들을 인공 신경망 레이어에 입력하여, 출력 오디오 시퀀스를 생성하기 위한 스펙트로그램 데이터(41)를 제공한다. As described above, the decoding unit 69 provides the spectrogram data 41 for generating an output audio sequence by inputting the latent variables to the artificial neural network layer.

이를 위하여 디코딩부(69)는, 출력 스펙트로그램 데이터(41)의 분포를 결정하기 위한 신경망 레이어들(93, 95a, 95b) 및 스펙트로그램 샘플링부(97)를 포함할 수 있다.To this end, the decoding unit 69 may include neural network layers 93 , 95a , and 95b for determining the distribution of the output spectrogram data 41 and a spectrogram sampling unit 97 .

전술한 신경망 레이어들(81 내지 86)과 유사하게, 신경망 레이어들(93, 95a, 95b)은, 예컨대 심층 신경망, 선형 신경망, 순환 신경망, LSTM, 컨볼루션 신경망 모델 등의 레이어들로 구성될 수 있다. 몇몇 실시예에서, 신경망 레이어들(93, 95a, 95b)은, LSTM 레이어(93) 및 그에 연결된 선형 레이어들(95a, 95b)로 구성될 수 있는데, 본 발명이 이러한 실시예로 한정되는 것은 아니다. Similar to the neural network layers 81 to 86 described above, the neural network layers 93, 95a, 95b may be composed of layers such as, for example, a deep neural network, a linear neural network, a recurrent neural network, an LSTM, a convolutional neural network model, etc. have. In some embodiments, the neural network layers 93 , 95a , 95b may consist of the LSTM layer 93 and linear layers 95a , 95b connected thereto, although the present invention is not limited to these embodiments. .

디코딩부(69)는 인코딩부(61)로부터 세그먼트 레벨 잠재 변수(Z₁) 및 시퀀스 레벨 잠재 변수(Z₂)를 제공받고, 잠재 변수 변환부(67)로부터 조정된 시퀀스 레벨 잠재 변수(Z₃)를 제공받는다. 몇몇 실시예에서, 잠재 변수들(Z₁, Z₂, Z₃)은 각각 32차원의 크기를 가지는 벡터들이다. The decoding unit 69 receives the segment level latent variable (Z ₁ ) and the sequence level latent variable (Z ₂ _{) from the encoding unit 61 , and the adjusted sequence level latent variable (Z 3} ) from the latent variable transformation unit 67 . ) is provided. In some embodiments, the latent variables Z ₁ , Z ₂ , and Z ₃ are vectors each having a size of 32 dimensions.

LSTM 레이어(93)는, 상기 잠재 변수들(Z₁, Z₂, Z₃)을 서로 연결한(concatenation), 하나의 프레임 당 총 96차원의 크기를 가지는 벡터로 표현되는 데이터(91)를 입력으로 받을 수 있다. The LSTM layer 93 inputs data 91 expressed as a vector having a total size of 96 dimensions per frame, in which the latent variables Z ₁ , Z ₂ , Z _{3 are concatenated.} can be received as

LSTM 레이어(93)의 출력층의 시퀀스는 선형 레이어(95a, 95b)로 전달된다. 상기 출력 시퀀스는 하나의 프레임 당 256 차원을 가지는 데이터일 수 있다.The sequence of output layers of the LSTM layer 93 is passed to the linear layers 95a and 95b. The output sequence may be data having 256 dimensions per frame.

선형 레이어(95a) 및 선형 레이어(95b)에서는, 출력 시퀀스로부터 출력 스펙트로그램 데이터(41)를 예측하기 위한 분포의 평균(μ_x) 및 분산(σ_x ²)이 각각 획득된다. 스펙트로그램 데이터(41)가 하나의 프레임(41a) 당 80차원의 벡터로 표현된 데이터인 몇몇 실시예에서, 상기 평균(μ_x) 및 분산(σ_x ²)은 80차원의 벡터로 표현된 데이터이다.In the linear layer 95a and the linear layer 95b, the mean (μ _x ) and the variance (σ _x ² ) of the distribution for predicting the output spectrogram data 41 from the output sequence are respectively obtained. In some embodiments in which the spectrogram data 41 is data expressed as an 80-dimensional vector per frame 41a, the mean (μ _x ) and the variance (σ _x ² ) are data expressed as an 80-dimensional vector to be.

스펙트로그램 샘플링부(97)는 상기 평균(μ_x) 및 분산(σ_x ²)에 의해 정의되는 예측 분포로부터 출력 스펙트로그램 데이터(41)를 예측할 수 있다.The spectrogram sampling unit 97 may predict the output spectrogram data 41 from a prediction distribution defined by _{the mean (μ x} ) and the variance (σ _x ^{2 ).}

스펙트로그램 샘플링부(97)에 의해 샘플링된 출력 스펙트로그램 데이터(41)는 전술한 바와 같이 음성 합성부(25)에 제공되어 디지털 오디오 데이터 형태로 변환될 수 있다.The output spectrogram data 41 sampled by the spectrogram sampling unit 97 may be provided to the speech synthesis unit 25 and converted into digital audio data as described above.

지금까지 도 8 및 도 9를 참조하여 음성 변환 모델(53)의 인코딩부(61) 및 디코딩부(69)에 대하여 설명하였다. 이하에서는, 도 10 및 도 11을 참조하여, 본 발명의 다른 실시예에 따라, 음성 변환 모델을 구축하고 이를 통해 음성을 변환하는 방법을 설명한다.So far, the encoding unit 61 and the decoding unit 69 of the speech conversion model 53 have been described with reference to FIGS. 8 and 9 . Hereinafter, with reference to FIGS. 10 and 11 , a method of constructing a voice conversion model and converting a voice through this according to another embodiment of the present invention will be described.

도 10은 본 발명의 일 실시예에 따라, 음성 변환 모델을 구축하고 이를 통해 음성을 변환하는 일련의 과정을 나타내는 예시적인 흐름도이다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일 뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있음은 물론이다.10 is an exemplary flowchart illustrating a series of processes of constructing a speech conversion model and converting speech through the speech conversion model according to an embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and it goes without saying that some steps may be added or deleted as needed.

도 10에 도시된 음성 변환 방법의 각 단계는 예컨대 오디오 감정 변환 장치(10)와 같은 컴퓨팅 장치에 의해 수행될 수 있다. 다시 말하면, 상기 음성 변환 방법의 각 단계는 컴퓨팅 장치의 프로세서에 의해 실행되는 하나 이상의 인스트럭션들로 구현될 수 있다. 상기 음성 변환 방법에 포함되는 모든 단계는 하나의 물리적인 컴퓨팅 장치에 의하여 실행될 수도 있을 것이나, 상기 방법의 제1 단계들은 제1 컴퓨팅 장치에 의하여 수행되고, 상기 방법의 제2 단계들은 제2 컴퓨팅 장치에 의하여 수행될 수도 있다. 예컨대, 도 10에 학습 과정(단계 S100 및 S200)과 변환 과정(단계 S300 및 S400)은 서로 다른 컴퓨팅 장치에 의해 수행될 수도 있다. 이하에서는, 상기 음성 변환 방법의 각 단계가 오디오 감정 변환 장치(10)에 의해 수행되는 것을 가정하여 설명을 이어가도록 한다. 다만, 설명의 편의를 위해, 상기 음성 변환 방법에 포함되는 각 단계의 동작 주체는 그 기재가 생략될 수도 있다.Each step of the voice conversion method shown in FIG. 10 may be performed by, for example, a computing device such as the audio emotion conversion device 10 . In other words, each step of the voice conversion method may be implemented with one or more instructions executed by a processor of a computing device. All steps included in the voice conversion method may be executed by one physical computing device, but the first steps of the method are performed by a first computing device, and the second steps of the method are performed by a second computing device may be performed by For example, the learning process (steps S100 and S200) and the conversion process (steps S300 and S400) in FIG. 10 may be performed by different computing devices. Hereinafter, it is assumed that each step of the voice conversion method is performed by the audio emotion conversion apparatus 10 to continue the description. However, for convenience of description, the description of the operating subject of each step included in the voice conversion method may be omitted.

도 10에 도시된 바와 같이, 본 실시예에 따른 음성 변환 방법은, 음성 변환 모델이 학습할 학습용 데이터셋을 획득하는 단계 S100, 학습용 데이터셋을 이용하여 음성 변환 모델을 구축하는 단계 S200, 변환하고자 하는 음성이 포함된 오디오 시퀀스를 획득하는 단계 S300, 음성 변환 모델을 이용하여 음성을 변환하는 단계 S400을 포함할 수 있다.As shown in FIG. 10 , the voice conversion method according to this embodiment includes a step S100 of obtaining a training dataset for the voice transformation model to learn, a step S200 of constructing a voice transformation model using the training dataset, and a It may include a step S300 of obtaining an audio sequence including a voice, and a step S400 of converting a voice using a voice conversion model.

먼저 단계 S100에서는, 음성 변환 모델을 학습시키기 위한 학습용 데이터셋이 획득된다. First, in step S100, a training dataset for training a voice transformation model is obtained.

학습용 데이터셋은, 학습용 오디오 시퀀스 샘플들과 각 오디오 시퀀스 샘플에 대응되는 감정 레이블 데이터로 구성될 수 있다. 몇몇 실시예에서, 예컨대 행복함, 기쁨, 슬픔, 분노, 공포의 다섯 가지 감정들 사이를 변환하는 음성 변환 모델을 구축하고자 할 경우, 전문 성우 등을 통해 상기 다섯 가지 서로 다른 감정들을 각각 연기한, 충분한 수의 오디오 시퀀스 샘플들과, 각 샘플에 대응되는 감정 레이블 데이터가 획득될 수 있다. 또한, TV 및 라디오 프로그램 등 기존에 존재하는 오디오 시퀀스 샘플들에 사람이 부여한 감정 레이블 데이터가 학습용 데이터셋으로 활용될 수 있다.The training dataset may include training audio sequence samples and emotion label data corresponding to each audio sequence sample. In some embodiments, for example, if you want to build a voice transformation model that converts between five emotions of happiness, joy, sadness, anger, and fear, each of the five different emotions is played by a professional voice actor, etc. A sufficient number of audio sequence samples and emotion label data corresponding to each sample may be obtained. In addition, emotion label data given by a person to existing audio sequence samples such as TV and radio programs may be used as a training dataset.

단계 S200에서는, 상기 학습용 데이터셋을 이용하여 음성 변환 모델을 구축할 수 있다. 단계 S200은, 예컨대 도 5를 참조하여 설명한 학습부(51)에 의해 수행될 수 있다. 몇몇 실시예에서, 음성 변환 모델은 도 6 내지 9를 참조하여 설명한 음성 변환 모델(53)일 수 있다. 단계 S200에서는, 상기 학습용 데이터셋에 포함된 오디오 시퀀스 샘플들과 각 샘플에 대응되는 감정 레이블 데이터를 이용하여, 상기 음성 변환 모델(53)을 지도 학습(supervised learning) 방식으로 학습시킬 수 있다. In step S200, a speech conversion model may be built using the training dataset. Step S200 may be performed, for example, by the learning unit 51 described with reference to FIG. 5 . In some embodiments, the speech conversion model may be the speech conversion model 53 described with reference to FIGS. 6 to 9 . In step S200, the speech transformation model 53 may be trained in a supervised learning method using audio sequence samples included in the training dataset and emotion label data corresponding to each sample.

음성 변환 모델(53)의 학습 과정에서는, 음성 변환 모델에 입력된 오디오 시퀀스 샘플과 가급적 동일하거나 유사한 오디오 시퀀스가 음성 변환 모델(53)로부터 출력되도록, 음성 변환 모델(53)을 구성하는 신경망 레이어들의 파라미터들을 업데이트함으로써, 음성 변환 모델(53)이 학습될 수 있다. 다시 말해, 제1 입력 오디오 시퀀스를 음성 변환 모델(53)에 입력할 경우 상기 제1 입력 오디오 데이터와 동일한 오디오 시퀀스가 출력될 확률이 증가하도록 음성 변환 모델(53)의 파라미터들을 업데이트 함으로써, 음성 변환 모델(53)이 학습될 수 있다.In the learning process of the speech conversion model 53, the neural network layers constituting the speech conversion model 53 are outputted from the speech conversion model 53 so that an audio sequence that is the same as or similar to the audio sequence sample input to the speech conversion model is output from the speech conversion model 53. By updating the parameters, the speech transformation model 53 can be trained. In other words, when the first input audio sequence is input to the speech conversion model 53, by updating the parameters of the speech conversion model 53 so that the probability of outputting the same audio sequence as the first input audio data is increased, speech conversion A model 53 may be trained.

또한, 음성 변환 모델(53)의 학습은, 음성 변환 모델(53)의 인코딩부(61)에서 예측되는 잠재 변수들(Z₁, Z₂, Z₃)의 예측 분포가 가우시안 정규 분포에 가까워지도록, 음성 변환 모델(53)의 파라미터들을 업데이트하는 것을 포함할 수 있다. 몇몇 실시예에서, 잠재 변수들(Z₁, Z₂, Z₃)의 예측 분포가 가우시안 정규 분포에 가까워지도록 하는 것은, 음성 변환 모델(53)의 파라미터들을 갱신할 때, 잠재 변수들(Z₁, Z₂, Z₃)의 예측 분포와 정규 분포 사이의 쿨백-라이블러 발산(KL-Divergence)이 최소화되도록 하는 항(term)을 포함시킴으로써 달성될 수 있다.In addition, the learning of the speech conversion model 53 is performed so that the prediction distribution of _{the latent variables (Z 1} , Z ₂ , Z ₃ ) predicted by the encoding unit 61 of the speech conversion model 53 is close to a Gaussian normal distribution. , updating parameters of the speech transformation model 53 . In some embodiments _{, making the predicted distribution of the latent variables Z 1} , Z ₂ , Z ₃ close to a Gaussian normal distribution, when updating the parameters of the speech transformation model 53 , the latent variables Z ₁ , Z ₂ , Z ₃ ) can be achieved by including a term such that the Kullback-Leibler divergence (KL-Divergence) between the predicted distribution and the normal distribution is minimized.

또한, 몇몇 실시예에서, 음성 변환 모델(53)의 학습은, 서로 다른 감정을 가리키는 레이블을 가지는 복수의 오디오 시퀀스들을 인코딩부(61)가 예측하여 획득되는 잠재 변수들(Z₃)의 사이의 거리가 서로 멀어지도록 상기 오디오 변환 모델의 파라미터를 업데이트 하는 것을 포함할 수 있다. 다시 말해, 학습 과정에서 감정 임베딩 매트릭스(63)가 구축될 때, 서로 다른 감정에 대응되는 임베딩 벡터들이 벡터 공간 내에서의 서로 멀리 떨어지도록 음성 변환 모델(53)의 파라미터들이 조정된다. 이에 관해서는 감정 임베딩 매트릭스(63)의 구축에 관하여 전술한 내용이 참조될 수 있다.In addition, in some embodiments, the learning of the speech transformation model 53 is performed by the encoding unit 61 predicting a plurality of audio sequences having labels indicating different emotions between _{latent variables (Z 3 ) obtained by prediction.} It may include updating the parameter of the audio transformation model so that the distance becomes greater from each other. In other words, when the emotion embedding matrix 63 is built in the learning process, the parameters of the speech transformation model 53 are adjusted so that embedding vectors corresponding to different emotions are far apart from each other in the vector space. In this regard, reference may be made to the above-described content regarding the construction of the emotion embedding matrix 63 .

전술한 실시예에 따른 음성 변환 모델(53)의 학습 과정을 이해함에 있어서는, 변이형 오토인코더의 학습 과정이 일반적으로 참조될 수 있으나, 본 발명이 그러한 실시예로 한정되는 것은 아니다.In understanding the learning process of the speech conversion model 53 according to the above-described embodiment, the learning process of the variant autoencoder may be generally referred to, but the present invention is not limited to such an embodiment.

다시 도 9를 참조하여, 음성 변환 모델(53)의 학습이 완료된 후, 음성 변환 모델(53)을 이용하여 음성을 변환하는 과정을 설명한다.Referring again to FIG. 9 , after the learning of the speech conversion model 53 is completed, a process of converting speech using the speech conversion model 53 will be described.

단계 S300에서는 변환하고자 하는 오디오 시퀀스가 획득된다. 오디오 시퀀스는 제1 감정이 표현된 음성일 수 있으며, 단계 S400의 과정을 통해 제2 감정이 표현된 음성으로 변환될 수 있다.In step S300, an audio sequence to be converted is obtained. The audio sequence may be a voice in which a first emotion is expressed, and may be converted into a voice in which a second emotion is expressed through the process of step S400.

단계 S400에서는 예컨대 오디오 감정 변환 장치(10)의 음성 변환 모델(53)을 이용하여 단계 S300에서 획득된 오디오 시퀀스가 변환된다. 이하에서는 도 11을 참조하여 단계 S400의 세부 사항을 보다 자세히 설명한다.In step S400, for example, the audio sequence obtained in step S300 is converted using the speech conversion model 53 of the audio emotion conversion apparatus 10 . Hereinafter, the details of step S400 will be described in more detail with reference to FIG. 11 .

도 11은 본 발명의 일 실시예에 따라 예컨대 오디오 감정 변환 장치(10)가 음성을 변환하는 방법을 나타내는 예시적인 흐름도이다.11 is an exemplary flowchart illustrating a method of, for example, an audio emotion converting apparatus 10 converting a voice according to an embodiment of the present invention.

먼저 단계 S410에서는, 입력된 오디오 시퀀스가 전처리된다. 오디오 시퀀스의 전처리는, 예컨대 오디오 감정 변환 장치(10)의 음성 전처리부(21)에 의해 수행될 수 있다. 입력된 오디오 시퀀스는 화자의 음성을 포함하는 것일 수 있다. 몇몇 실시예에서, 오디오 시퀀스의 전처리 과정은 입력된 오디오 시퀀스를 단시간 푸리에 변환(Short Time Fourier Transform) 처리하여, 멜-스케일 스펙트로그램 등의 데이터로 변환하는 과정을 포함할 수 있다. 몇몇 실시예에서, 오디오 시퀀스가 전처리 과정은 상기 멜-스케일 스펙트로그램 데이터를 소정의 길이를 가지는 프레임(또는 세그먼트) 단위로 분절하는 과정을 포함할 수 있다.First, in step S410, the input audio sequence is pre-processed. The pre-processing of the audio sequence may be performed, for example, by the voice pre-processing unit 21 of the audio emotion conversion apparatus 10 . The input audio sequence may include a speaker's voice. In some embodiments, the pre-processing of the audio sequence may include converting the input audio sequence into data such as a mel-scale spectrogram by short-time Fourier transform processing. In some embodiments, the pre-processing of the audio sequence may include segmenting the mel-scale spectrogram data into frame (or segment) units having a predetermined length.

단계 S420에서는, 상기 전처리된 스펙트로그램 데이터를 인코딩하여 상기 오디오 시퀀스의 시퀀스 레벨 특징을 나타내는 제1 잠재 변수가 획득된다. 상기 스펙트로그램 데이터의 인코딩 과정은 예컨대 오디오 감정 변환 장치(10)의 음성 변환 모델(53)을 구성하는 인코딩부(61)에 의해 수행될 수 있다. 몇몇 실시예에서, 오디오 시퀀스의 시퀀스 레벨 특징을 나타내는 제1 잠재 변수는, 입력된 오디오 시퀀스에 표현된 화자의 감정을 가리키는 특징들을 나타내는 벡터인 제3 잠재 변수 및 화자의 목소리 특징들을 가리키는 벡터인 제4 잠재 변수를 포함할 수 있다. In step S420, a first latent variable representing a sequence level characteristic of the audio sequence is obtained by encoding the preprocessed spectrogram data. The encoding process of the spectrogram data may be performed, for example, by the encoding unit 61 constituting the speech conversion model 53 of the audio emotion conversion apparatus 10 . In some embodiments, the first latent variable indicative of a sequence-level characteristic of the audio sequence includes a third latent variable, a vector indicative of features indicative of the speaker's emotions expressed in the input audio sequence, and a second latent variable indicative of the speaker's voice features 4 Latent variables may be included.

단계 S420에서 제1 잠재 변수를 획득하는 과정은, 시퀀스 레벨 잠재 변수들(Z₂, Z₃)을 획득하기 위한 인코딩부(61)의 신경망 레이어들에 관하여 앞서 설명한 내용이 더 참조될 수 있다. For the process of obtaining the first latent variable in step S420, the above-described contents of the neural network layers of the encoding unit 61 for obtaining the _{sequence level latent variables (Z 2} , Z _{3 ) may be further referenced.}

단계 S430에서는, 상기 전처리된 스펙트로그램 데이터를 인코딩하여 상기 오디오 시퀀스의 세그먼트 레벨 특징을 나타내는 제2 잠재 변수가 획득된다. 상기 제2 잠재 변수의 인코딩 과정은 제1 잠재 변수의 인코딩 과정과 마찬가지로 인코딩부(61)에 의해 수행될 수 있다. 몇몇 실시예에서, 제2 잠재 변수의 인코딩 과정에는, 상기 전처리된 스펙트로그램 데이터와 함께, 상기 제1 잠재 변수가 이용될 수 있다. 몇몇 실시예에서, 오디오 시퀀스의 세그먼트 레벨 특징들을 나타내는 제2 잠재 변수는, 오디오 시퀀스의 각각의 세그먼트 또는 프레임에 나타난 텍스트나 음소 등 언어적 특징(linguistic feature) 또는 언어적 컨텐츠(linguistic contents)를 나타내는 벡터일 수 있다.In step S430, a second latent variable representing a segment level characteristic of the audio sequence is obtained by encoding the preprocessed spectrogram data. The encoding process of the second latent variable may be performed by the encoding unit 61 like the encoding process of the first latent variable. In some embodiments, in the encoding process of the second latent variable, the first latent variable may be used together with the preprocessed spectrogram data. In some embodiments, the second latent variable indicative of segment-level features of the audio sequence is indicative of a linguistic feature or linguistic content, such as text or phoneme, appearing in each segment or frame of the audio sequence. It can be a vector.

단계 S430에서 제2 잠재 변수를 획득하는 과정은, 세그먼트 레벨 잠재 변수(Z₁)를 획득하기 위한 인코딩부(61)의 신경망 레이어들에 관하여 앞서 설명한 내용이 더 참조될 수 있다.For the process of obtaining the second latent variable in step S430, the above-described contents of the neural network layers of the encoding unit 61 for obtaining the _{segment-level latent variable (Z 1 ) may be further referenced.}

단계 S440에서는, 상기 제1 잠재 변수가 조정된다. 몇몇 실시예에서, 단계 S440는, 제1 잠재 변수에 포함된 제3 잠재 변수 및 제4 잠재 변수 중에, 제3 잠재 변수만을 조정하는 것일 수 있다. 다시 말해, 몇몇 실시예에서 단계 S440는, 입력된 오디오 시퀀스의 화자의 음색 등 목소리 특징들은 그대로 유지한 채, 입력된 오디오 시퀀스에 표현된 화자의 감정만을 조정하는 것일 수 있다.In step S440, the first latent variable is adjusted. In some embodiments, step S440 may be to adjust only the third latent variable among the third latent variable and the fourth latent variable included in the first latent variable. In other words, in some embodiments, step S440 may be to adjust only the speaker's emotions expressed in the input audio sequence while maintaining the voice characteristics such as the tone of the speaker of the input audio sequence.

몇몇 실시예에서, 화자의 감정을 가리키는 특징들을 나타내는 제3 잠재 변수를 조정하는 과정은, 대상 감정(또는 목적 감정)을 나타내는 제2 임베딩 벡터와 원 감정(또는 출발 감정)을 나타내는 제1 임베딩 벡터의 차이만큼 벡터를 이동시키는 감정 변환 함수를 상기 제3 잠재 변수에 적용함으로써 수행될 수 있다. 몇몇 실시예에서는, 잠재 변수 변환부(67)와 관련하여 앞서 설명한 바와 같이, 상기 제1 임베딩 벡터와 제2 임베딩 벡터 대신에, 상기 제1 임베딩 벡터와 제2 임베딩 벡터를 서로 직교화(orthogonalization) 한 벡터들이 감정 변환 함수에 이용될 수 있다. In some embodiments, the process of adjusting the third latent variable representing characteristics indicative of the speaker's emotion comprises a second embedding vector representing the target emotion (or target emotion) and a first embedding vector representing the original emotion (or starting emotion). It can be performed by applying an emotion transformation function that moves a vector by a difference of , to the third latent variable. In some embodiments, as described above with respect to the latent variable transform unit 67, instead of the first embedding vector and the second embedding vector, the first embedding vector and the second embedding vector are orthogonalized to each other. One vector can be used for the emotion transformation function.

단계 S440에 대해서는, 잠재 변수 변환부(67)와 관련하여 앞서 설명된 내용이 더 참조될 수 있을 것이다.For step S440, the content described above in relation to the latent variable conversion unit 67 may be further referenced.

단계 S450에서는, 상기 단계 S440에서 조정된 제1 잠재 변수와 상기 단계 S430에서 획득된 제2 잠재 변수를 이용하여 출력 오디오 시퀀스를 나타내는 스펙트로그램 데이터가 획득된다. 단계 S450은, 예컨대 오디오 감정 변환 장치(10)의 음성 변환 모델(53)을 구성하는 디코딩부(69)에 의해 수행될 수 있다. 몇몇 실시예에서, 상기 조정된 제1 잠재 변수 및 제2 잠재 변수를 결합(concatenation)한 값을 이용하여 상기 출력 스펙트로그램 데이터를 예측하기 위한 분포가 획득되고, 이로부터 출력 스펙트로그램 데이터가 샘플링될 수 있다.In step S450, spectrogram data representing an output audio sequence is obtained using the first latent variable adjusted in step S440 and the second latent variable obtained in step S430. Step S450 may be performed, for example, by the decoding unit 69 constituting the speech conversion model 53 of the audio emotion conversion apparatus 10 . In some embodiments, a distribution for predicting the output spectrogram data is obtained using a concatenation value of the adjusted first latent variable and the second latent variable, from which the output spectrogram data is to be sampled. can

단계 S450에 대해서는, 디코딩부(69)와 관련하여 앞서 설명된 내용이 더 참조될 수 있을 것이다.For step S450, the content described above in relation to the decoding unit 69 may be further referenced.

단계 S460에서는, 출력 오디오 시퀀스를 나타내는 스펙트로그램 데이터로부터 출력 오디오 시퀀스가 합성된다. 단계 S460은, 예컨대 오디오 감정 변환 장치(10)의 음성 합성부(25)에 의해 수행될 수 있으며, 음성 합성부(25)에 관하여 전술한 내용이 참조될 수 있다. 단계 S460에서는, 상기 스펙트로그램 데이터가 예컨대 신경망 기반의 보코더 모듈 등에 입력되어, 디지털 오디오 데이터 형태의 오디오 시퀀스로 합성될 수 있다.In step S460, the output audio sequence is synthesized from the spectrogram data representing the output audio sequence. Step S460 may be performed, for example, by the voice synthesizing unit 25 of the audio emotion converting apparatus 10 , and reference may be made to the above description regarding the voice synthesizing unit 25 . In step S460, the spectrogram data may be input to, for example, a neural network-based vocoder module and synthesized into an audio sequence in the form of digital audio data.

지금까지 도 1 내지 도 11을 참조하여, 본 발명의 몇몇 실시예들에 따른 오디오 감정 변환 방법 및 장치와, 그 응용분야에 대해서 설명하였다. 이하에서는, 본 발명의 몇몇 실시예들에 따른 오디오 감정 변환 장치(10)를 구현할 수 있는 예시적인 컴퓨팅 장치(1500)에 대하여 설명하도록 한다.So far, with reference to FIGS. 1 to 11 , a method and apparatus for converting an audio emotion according to some embodiments of the present invention, and an application field thereof have been described. Hereinafter, an exemplary computing device 1500 capable of implementing the audio emotion conversion device 10 according to some embodiments of the present invention will be described.

도 12는 본 발명의 몇몇 실시예들에 따른 오디오 감정 변환 장치(10)를 구현할 수 있는 예시적인 컴퓨팅 장치(1500)를 나타내는 하드웨어 구성도이다.12 is a hardware configuration diagram illustrating an exemplary computing device 1500 capable of implementing the audio emotion conversion device 10 according to some embodiments of the present invention.

도 12에 도시된 바와 같이, 컴퓨팅 장치(1500)는 하나 이상의 프로세서(1510), 버스(1550), 통신 인터페이스(1570), 프로세서(1510)에 의하여 수행되는 컴퓨터 프로그램(1591)을 로드(load)하는 메모리(1530)와, 컴퓨터 프로그램(1591)을 저장하는 스토리지(1590)를 포함할 수 있다. 다만, 도 12에는 본 발명의 실시예와 관련 있는 구성 요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 12에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다.12 , the computing device 1500 loads one or more processors 1510 , a bus 1550 , a communication interface 1570 , and a computer program 1591 executed by the processor 1510 . It may include a memory 1530 and a storage 1590 for storing the computer program 1591 . However, only the components related to the embodiment of the present invention are illustrated in FIG. 12 . Accordingly, a person skilled in the art to which the present invention pertains can see that other general-purpose components other than the components shown in FIG. 12 may be further included.

프로세서(1510)는 컴퓨팅 장치(1500)의 각 구성의 전반적인 동작을 제어한다. 프로세서(1510)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서를 포함하여 구성될 수 있다. 또한, 프로세서(1510)는 본 발명의 실시예들에 따른 방법을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 컴퓨팅 장치(1500)는 하나 이상의 프로세서를 구비할 수 있다.The processor 1510 controls the overall operation of each component of the computing device 1500 . The processor 1510 includes a central processing unit (CPU), a micro processor unit (MPU), a micro controller unit (MCU), a graphic processing unit (GPU), or any type of processor well known in the art. can be In addition, the processor 1510 may perform an operation on at least one application or program for executing the method according to the embodiments of the present invention. The computing device 1500 may include one or more processors.

메모리(1530)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(1530)는 본 발명의 실시예들에 따른 방법을 실행하기 위하여 스토리지(1590)로부터 하나 이상의 프로그램(1591)을 로드할 수 있다. 메모리(1530)는 RAM과 같은 휘발성 메모리로 구현될 수 있을 것이나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.The memory 1530 stores various data, commands, and/or information. The memory 1530 may load one or more programs 1591 from the storage 1590 to execute a method according to embodiments of the present invention. The memory 1530 may be implemented as a volatile memory such as RAM, but the technical scope of the present invention is not limited thereto.

버스(1550)는 컴퓨팅 장치(1500)의 구성 요소 간 통신 기능을 제공한다. 버스(1550)는 주소 버스(Address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 1550 provides communication functions between components of the computing device 1500 . The bus 1550 may be implemented as various types of buses, such as an address bus, a data bus, and a control bus.

통신 인터페이스(1570)는 컴퓨팅 장치(1500)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(1570)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(1570)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다.The communication interface 1570 supports wired/wireless Internet communication of the computing device 1500 . Also, the communication interface 1570 may support various communication methods other than Internet communication. To this end, the communication interface 1570 may be configured to include a communication module well known in the art.

몇몇 실시예들에 따르면, 통신 인터페이스(1570)는 생략될 수도 있다.According to some embodiments, the communication interface 1570 may be omitted.

스토리지(1590)는 상기 하나 이상의 프로그램(1591)과 각종 데이터를 비임시적으로 저장할 수 있다. 가령, 컴퓨팅 장치(1500)를 통해 텍스트 생성 장치(10)가 구현되는 경우라면, 상기 각종 데이터는 저장부(400)에 의해 관리되는 데이터를 포함할 수 있다.The storage 1590 may non-temporarily store the one or more programs 1591 and various data. For example, if the text generating device 10 is implemented through the computing device 1500 , the various data may include data managed by the storage unit 400 .

스토리지(1590)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage 1590 is a non-volatile memory such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or well in the art to which the present invention pertains. It may be configured to include any known computer-readable recording medium.

컴퓨터 프로그램(1591)은 메모리(1530)에 로드될 때 프로세서(1510)로 하여금 본 발명의 다양한 실시예에 따른 방법/동작을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 즉, 프로세서(1510)는 상기 하나 이상의 인스트럭션들을 실행함으로써, 본 발명의 다양한 실시예에 따른 방법/동작들을 수행할 수 있다.The computer program 1591 may include one or more instructions that, when loaded into the memory 1530 , cause the processor 1510 to perform methods/operations in accordance with various embodiments of the present invention. That is, the processor 1510 may perform the methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.

위와 같은 경우, 컴퓨팅 장치(1500)를 통해 본 발명의 몇몇 실시예들에 따른 오디오 감정 변환 장치(10)가 구현될 수 있다.In this case, the audio emotion conversion apparatus 10 according to some embodiments of the present invention may be implemented through the computing device 1500 .

지금까지 도 1 내지 도 12을 참조하여 본 발명의 다양한 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 발명의 기술적 사상에 따른 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, various embodiments of the present invention and effects according to the embodiments have been described with reference to FIGS. 1 to 12 . Effects according to the technical spirit of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

지금까지 도 1 내지 도 12을 참조하여 설명된 본 발명의 기술적 사상은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체는, 예를 들어 이동형 기록 매체(CD, DVD, 블루레이 디스크, USB 저장 장치, 이동식 하드 디스크)이거나, 고정식 기록 매체(ROM, RAM, 컴퓨터 구비 형 하드 디스크)일 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The technical idea of the present invention described with reference to FIGS. 1 to 12 may be implemented as computer-readable codes on a computer-readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk). can The computer program recorded in the computer-readable recording medium may be transmitted to another computing device through a network, such as the Internet, and installed in the other computing device, thereby being used in the other computing device.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명의 기술적 사상이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, even if all the components constituting the embodiment of the present invention are described as being combined or operating in combination, the technical spirit of the present invention is not necessarily limited to this embodiment. That is, within the scope of the object of the present invention, all the components may operate by selectively combining one or more.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시예들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although acts are shown in a specific order in the drawings, it should not be understood that the acts must be performed in the specific order or sequential order shown, or that all shown acts must be performed to obtain a desired result. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of the various components in the embodiments described above should not be construed as necessarily requiring such separation, and the described program components and systems may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 본 발명이 다른 구체적인 형태로도 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명에 의해 정의되는 기술적 사상의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can practice the present invention in other specific forms without changing the technical spirit or essential features. can understand that there is Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent range should be interpreted as being included in the scope of the technical ideas defined by the present invention.

Claims

A method for a computing device to transform features of an input audio sequence using a neural network-based audio transformation model, comprising:
inputting input audio data representing the input audio sequence into the audio transformation model;
obtaining, by the audio transformation model, a first latent variable representing a sequence level characteristic of the input audio sequence;
obtaining, by the audio transformation model, a second latent variable representing characteristics of each of a plurality of segments constituting the input audio sequence;
adjusting the first latent variable; and
obtaining output audio data representing an output audio sequence by using the adjusted first latent variable and the second latent variable;
including,
The first latent variable includes a third latent variable that at least partially expresses a speaker's emotion expressed in the input audio sequence,
The step of adjusting the first latent variable includes applying to the third latent variable an emotion conversion function that converts an embedding vector corresponding to a first emotion into an embedding vector corresponding to a second emotion,
How to convert audio.

According to claim 1,
The first latent variable is
a fourth latent variable that at least partially expresses an acoustic characteristic other than the emotion of the speaker expressed in the input audio sequence,
The second latent variable is ,
at least partially representing linguistic content included in the input audio sequence,
Adjusting the first latent variable comprises:
Comprising the step of adjusting the third latent variable while maintaining the fourth latent variable,
How to convert audio.

delete

According to claim 1,
The step of applying the emotion conversion function to the third latent variable comprises:
Further comprising the step of orthogonalizing the embedding vector corresponding to the first emotion and the embedding vector corresponding to the second emotion,
How to convert audio.

According to claim 1,
The output audio sequence is
Audio in which the emotion of the speaker expressed in the input audio sequence is converted while maintaining the linguistic content of the input audio sequence and the characteristics of the speaker's voice of the input audio sequence;
How to convert audio.

3. The method of claim 2,
Obtaining the first latent variable comprises:
obtaining, by the encoding unit of the audio transformation model, the third latent variable prediction distribution using the input audio data; and
sampling the third latent variable from the third latent variable prediction distribution;
containing,
How to convert audio.

8. The method of claim 7,
The third latent variable prediction distribution is one that has been trained to follow a normal distribution,
How to convert audio.

8. The method of claim 7,
Obtaining the first latent variable comprises:
obtaining, by the encoding unit, the fourth latent variable prediction distribution using the input audio data; and
sampling the fourth latent variable from the fourth latent variable prediction distribution;
further comprising,
How to convert audio.

According to claim 1,
Obtaining the second latent variable comprises:
Using the input audio data and the first latent variable, comprising the step of obtaining the second latent variable,
How to convert audio.

According to claim 1,
Using the adjusted first latent variable and the second latent variable, obtaining output audio data comprises:
obtaining a distribution for predicting the output audio data by using a value obtained by concatenating the adjusted first latent variable and the second latent variable by the decoding unit of the audio transformation model; and
sampling the output audio data from the distribution for predicting the output audio data;
containing,
How to convert audio.

12. The method of claim 11,
inputting the output audio data to a vocoder to synthesize an output audio sequence;
further comprising,
How to convert audio.

A device for converting audio, comprising:
an audio conversion model including an encoding unit, a latent variable conversion unit, and a decoding unit; and
An audio synthesizing unit for synthesizing an output audio sequence from the output audio data
including,
The encoding unit,
encoding a first latent variable representing a sequence level characteristic of the input audio sequence from the input audio data pre-processing the input audio sequence, and providing the first latent variable to the latent variable converting unit;
encoding a second latent variable representing characteristics of each of a plurality of segments constituting the input audio sequence from the input audio data and the first latent variable, and providing the second latent variable to the decoding unit;
The latent variable conversion unit,
adjusting the first latent variable, and providing the adjusted first latent variable to the decoding unit;
The decoding unit,
Using the adjusted first latent variable and the second latent variable, providing the output audio data,
The first latent variable includes a third latent variable that at least partially expresses a speaker's emotion expressed in the input audio sequence,
The latent variable conversion unit, adjusting the first latent variable by applying an emotion conversion function for converting an embedding vector corresponding to a first emotion into an embedding vector corresponding to a second emotion to the third latent variable,
audio conversion device.

14. The method of claim 13,
The encoding unit,
encoding an average of the first latent variable from the input audio data, and sampling the first latent variable from a prediction distribution determined based on the average of the first latent variable;
encoding an average of the second latent variable from the input audio data and the first latent variable, and sampling the second latent variable from a prediction distribution determined based on the average of the second latent variable;
audio conversion device.

14. The method of claim 13,
The first latent variable is
a fourth latent variable that at least partially expresses an acoustic characteristic other than a speaker's emotion expressed in the input audio sequence;
The second latent variable is
at least partially representing linguistic content included in the input audio sequence,
The latent variable conversion unit adjusts the third latent variable while maintaining the fourth latent variable as it is,
audio conversion device.

delete

14. The method of claim 13,
The decoding unit,
obtaining the output audio data prediction distribution by using the concatenated value of the adjusted first latent variable and the second latent variable, and sampling the output audio data from the output audio data prediction distribution,
audio conversion device.

A method for a computing device to train a neural network-based audio transformation model, comprising:
obtaining a training data set comprising a plurality of audio sequences and labels indicating a speaker's emotion appearing in each of the plurality of audio sequences; and
updating the parameters of the audio transformation model using the training data set.
including,
The audio conversion model is,
encoding the input audio sequence to provide a first latent variable representing a speaker's emotion expressed in the input audio sequence and a second latent variable representing characteristics of each of a plurality of segments constituting the input audio sequence,
By applying an emotion transformation function that transforms an embedding vector corresponding to a first emotion into an embedding vector corresponding to a second emotion to the first latent variable, the first latent variable is adjusted,
and decoding the adjusted first latent variable and the second latent variable to provide an output audio sequence in which the speaker's emotions are converted,
Updating the parameters of the audio transformation model comprises:
When a first input audio sequence is input to the audio transformation model, updating a parameter of the audio transformation model to increase a probability that an audio sequence identical to that of the first input audio data is output,
Updating the parameters of the audio transformation model comprises:
Including the step of updating the parameters of the audio transformation model so that the distribution of the first latent variable and the second latent variable is close to a normal distribution,
How to train an audio transformation model.

20. The method of claim 19,
Updating the parameters of the audio transformation model comprises:
Further comprising the step of updating the parameters of the audio transformation model so that the distances of the first latent variables obtained by encoding a plurality of audio sequences each having labels indicating different emotions are distant from each other,
How to train an audio transformation model.