KR20200063331A

KR20200063331A - Multiple speaker voice conversion using conditional cycle GAN

Info

Publication number: KR20200063331A
Application number: KR1020180144585A
Authority: KR
Inventors: 육동석; 유인철; 이효원
Original assignee: 고려대학교 산학협력단
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-06-05

Abstract

The present invention relates to a method for practically performing learning and voice conversion into non-parallel data of multiple speakers using conditional cycleGAN (GAN). According to one embodiment of the present invention to achieve the objective of the present invention, disclosed is a multiple speaker voice conversion method using conditional cycleGAN (GAN), the method using a single model by using conditional cycleGAN (GAN) without creating a voice conversion model between several pairs of speakers.

Description

Multiple speaker voice conversion using conditional cycle GAN}

본 발명은 비평행 데이터를 사용하는 단일 모델 음성 변환을 위한 방법에 관한 것이며, 보다 상세하게는 조건 순환 GAN(conditional cycleGAN)을 이용하여 여러 화자의 비평행 데이터로 학습 및 음성 변환을 실용적으로 수행하기 위한 방법에 관한 것이다.The present invention relates to a method for single-model speech conversion using non-parallel data, and more specifically, practically performing learning and speech conversion with non-parallel data of multiple speakers using conditional cycle GAN (GAN). It is about how.

조건 순환 GAN을 이용한 데이터 합성 기술은 다음과 같다.The data synthesis technology using conditional cyclic GAN is as follows.

a) GAN은 대표적인 데이터 합성 인공 신경망 이다. a) GAN is a representative data synthesis artificial neural network.

b) GAN은 이미지나 음성과 같은 데이터를 생성하는데 사용된다. b) GAN is used to generate data such as images or voice.

c) GAN은 데이터 생성 신경망(Generator)과 합성된 데이터와 실제 데이터를 판별하는 판별 신경망 (Discriminator)으로 구성된다. c) GAN is a data generation neural network (Generator) and a synthetic neural network that discriminates the synthesized and actual data. (Discriminator).

d) 순환 GAN(cycleGAN)은 데이터 합성 후 원본 데이터로 합성하는 순환 구조를 가져 원본(source) 데이터의 특성을 유지하면서 목적(target) 데이터 변환이 가능하다. 예) 말 사진을 얼룩말 사진으로 변환하는 이미지 변환기 d) Cyclic GAN (cycleGAN) has a circular structure that synthesizes data and then synthesizes it as original data. Target data can be converted while maintaining the characteristics of the data. E.g. a horse picture to a zebra picture Converting image converter

순환 GAN을 이용한 비평행 데이터(Non-parallel data) 음성변환 기술은 다음과 같다.Non-parallel data speech conversion technology using cyclic GAN is as follows.

e) 음성변환은 음성의 언어적 특성은 남기고 화자정보만 변환하는 기술이다. e) Voice conversion is a technology that converts only speaker information without leaving the verbal characteristics of the voice.

f) 평행데이터는 두 명 이상의 화자가 똑같은 대본을 읽은 데이터를 말한다. f) Parallel data refers to data in which two or more speakers read the same script.

g) 평행데이터를 구하기 위해서는 많은 비용이 소비된다. g) It is expensive to obtain parallel data.

h) 기존 음성변환 모델들은 평행데이터를 이용하여 실제 사용에 제약이 있다. h) Existing speech conversion models are limited in their practical use by using parallel data.

i) 순환 GAN을 이용하여 비평행 데이터로 음성변환이 가능하다. i) It is possible to convert speech to non-parallel data using circular GAN.

음성인식 스피커와 같은 실제 상품화된 제품의 환경에서 음성변환을 위해서는 평행 데이터를 구하기 어렵다.It is difficult to obtain parallel data for voice conversion in an environment of a commercialized product such as a voice recognition speaker.

기존의 순환 GAN을 이용하는 음성변환 모델의 경우 한 모델에 한 쌍의 화자 사이의 음성 변환이 되도록 학습된다. 때문에 다중 화자들 사이의 음성변환을 위해서 수 많은 모델을 학습해야 되는 문제가 있다.In the case of the speech conversion model using the existing cyclic GAN, the model is trained to be speech conversion between a pair of speakers in one model. Therefore, there is a problem in that a number of models must be trained for speech conversion between multiple speakers.

상기 목적을 달성하기 위해 본 발명의 일실시예에 의하면, 여러 쌍의 화자 간 음성 변환 모델을 만들지 않고 조건 순환 GAN을 이용하여 단일 모델만을 사용하는 것을 특징으로 하는 조건 순환 GAN을 이용한 다중화자 음성변환 방법이 개시된다. In order to achieve the above object, according to an embodiment of the present invention, a multiplexer speech conversion using conditional cyclic GAN is characterized in that a single model is used by using conditional cyclic GAN without creating a speech transformation model between multiple pairs of speakers. The method is disclosed.

본 발명의 일실시예에 의하면, 실제 상황에서 얻은 사용자의 음성데이터를 통해 목적 화자의 음성으로 변환이 가능 하다.According to an embodiment of the present invention, it is possible to convert the voice of the target speaker through the voice data of the user obtained in the actual situation.

본 발명의 일실시예에 의하면, 단일 모델 하나로 여러 화자들 사이의 음성 변환 가능 하다. According to an embodiment of the present invention, it is possible to convert speech between multiple speakers with a single model.

도 1은 본 발명의 일실시예에 관련된 조건 순환 GAN을 이용한 다중화자 음성변환 방법에서 화자

와

간의 일대일 변환의 경우를 보여준다. 1 is a speaker in a multi-speaker speech conversion method using conditional cyclic GAN according to an embodiment of the present invention

Wow

Shows the case of a one-to-one conversion between.

이하, 본 발명의 일실시예와 관련된 조건 순환 GAN을 이용한 다중화자 음성변환 방법에 대해 도면을 참조하여 설명하도록 하겠다.Hereinafter, a multi-speaker speech conversion method using conditional cyclic GAN related to an embodiment of the present invention will be described with reference to the drawings.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.As used herein, a singular expression includes a plural expression unless the context clearly indicates otherwise. In this specification, the terms "consisting of" or "comprising" should not be construed as including all of the various components, or various steps described in the specification, among which some components or some steps It may not be included, or it should be construed to further include additional components or steps.

와

Wow

Shows the case of a one-to-one conversion between.

도 1은 조건 순환 GAN의 구조이다. Y는 화자의 식별 정보이고, X는 음성정보이다. G는 생성 신경망이고

는 판별 신경망으로

와

는 동일한 파라미터를 사용하는 하나의 모델이다.

는 입력데이터와 재구성된 출력 데이터간의 차이를 계산하는 비용함수(cost function)이다.1 is a structure of a conditional circulation GAN. Y is the speaker's identification information, and X is the voice information. G is a producing neural network

Is a discriminating neural network

Wow

Is a model that uses the same parameters.

Is a cost function that calculates the difference between the input data and the reconstructed output data.

순환 GAN에 기반한 전통적인 음성변환 방법은 단일 원본 화자에서 다른 단일 목적 화자로 음성을 변환한다.Traditional speech-to-speech methods based on cyclic GAN convert speech from a single original speaker to another single-purpose speaker.

순환 GAN의 생성 신경망

는

화자의 음성 데이터

를

화자의 음성데이터

로 변환한다.Generating neural network of cyclic GAN

The

Speaker's voice data

To

Speaker's voice data

Convert to

다시

는

를

로 변환하는 순환 구조를 가진다.again

The

To

It has a circular structure that converts to.

때문에 순환 GAN은 원본 데이터의 특성을 유지하면서 목적 데이터로 데이터를 변환할 수 있다.Therefore, the circular GAN can transform data into target data while maintaining the characteristics of the original data.

이러한 구조는 생성 신경망

가

의 데이터를

의 데이터로만 변환할 수 있다. 본 발명에서는 기존의 순환 GAN 기반의 음성변환을 확장하여 단일 GAN 모델을 사용한다. 여러 원본 화자의 음성을 여러 목적 화자의 음성으로 변환한다.These structures generate neural networks

end

Data of

Can only be converted to data. In the present invention, a single GAN model is used by extending the existing cyclic GAN-based speech conversion. It converts the voices of several original speakers into the voices of multiple target speakers.

이를 구현하기 위해서 우리가 제시하는 조건 순환 GAN에 화자 신원 정보를 조건 입력하여 학습 및 변환한다.In order to implement this, we input and learn the speaker's identity information into the conditional circular GAN that we propose to learn and transform.

조건 순환 GAN의 각 레이어에서 화자 식별 벡터

가 레이어의 출력 벡터에 추가되어 다음 레이어에 대한 입력으로 사용된다.Speaker identification vector in each layer of conditional cyclic GAN

Is added to the layer's output vector and is used as input to the next layer.

생성 신경망은 일반적으로 다운 샘플링 레이어와 업 샘플링 레이어로 구성된다.The generated neural network is generally composed of a down sampling layer and an up sampling layer.

예를 들어, 화자

의 식별 벡터

를 생성 신경망

의 다운 샘플링 계층으로 공급되고, 화자

의 식별 벡터

는

의 업 샘플링 계층으로 공급된다. 또한 화자 벡터

는 판별 신경망

에 공급된다.For example, the speaker

Identification vector

Generate neural network

Is fed into the downsampling layer of the speaker

Identification vector

The

Is fed into the upsampling layer. Addition speaker vector

Discrimination neural network

Is supplied to.

도면 1 에서는 화자

와

간의 일대일 변환의 경우를 보여준다.In Figure 1, the speaker

Wow

Shows the case of a one-to-one conversion between.

생성 신경망이 화자 식별 벡터

가 조건으로 주어지기 때문에

의 내용을 변경하여 다른 화자에서도 동일한 모델을 사용할 수 있다.Generated neural network speaker identification vector

Because is given as a condition

By changing the contents of, you can use the same model in other speakers.

이 방식을 이용하여, 원본 음성과 원본 화자 식별 벡터를 받아 목적 화자 식별 벡터와 목적 화자 음성을 출력하는 일반화된 생성 모델을 만들 수 있다.Using this method, it is possible to create a generalized generation model that receives the original speech and the original speaker identification vector and outputs the target speaker identification vector and the target speaker speech.

실제로

명의 화자가 있는 경우

개의 in reality

If there are speakers

doggy

서로 다른 변환 기능을 하나의 모델로 구성된 조건 순환 GAN으로 만들 수 있다.Different transformation functions can be made into conditional cyclic GAN composed of one model.

및

(

및

)뿐만 아니라

및

도 동일한 매개 변수를 공유하므로

개의 다른 변환을 수행할 수 있는 변환 모델은 실질적으로 하나만 존재한다.

And

(

And

)As well as

And

As it also shares the same parameters

There is actually only one transformation model that can perform two different transformations.

본 발명의 일실시예에 의한 음성변환 방법은 음성변환 어플리케이션이 적용 가능한 분야에서 광범위하게 이용 가능하다. 본 발명의 일실시예에 의하면, 다수의 화자의 데이터를 사용하여 다대다 음성 변환이 가능하기 때문에 상대적으로 실제 서비스 환경에서 얻기 힘든 평행 데이터가 아닌 비평행 데이터를 이용하여 작은 단일 모델만 사용하여 서비스 가능하다.The voice conversion method according to an embodiment of the present invention can be widely used in fields where a voice conversion application is applicable. According to an embodiment of the present invention, since many-to-many speech conversion is possible using data of a plurality of speakers, service is performed using only a small single model using non-parallel data rather than parallel data that is difficult to obtain in a relatively real service environment. It is possible.

일 예로, 음성인식 스피커의 입력되는 음성들은 음성인식 스피커의 음성인식률 향상을 위해서 음성 데이터를 수집한다. 또한 AI 비서와 같은 음성으로 작동하는 다양한 서비스들이 이런 방식을 이용하고 있는데 사생활 보호 또는 생체 정보(음성)의 유출 문제가 있다. 때문에 제안하는 방법을 이용하여 음성 변환을 한 데이터를 서비스 업체에서 사용하는 비식별화 기술로써 운영 가능하다.For example, the voices input from the voice recognition speaker collect voice data in order to improve the voice recognition rate of the voice recognition speaker. In addition, various services operated by voice such as AI assistants use this method, and there is a problem of privacy or leakage of biometric information (voice). Therefore, it is possible to operate the data that has been converted using the proposed method as a de-identification technique used by service providers.

음성변환을 하는 어플리케이션으로 다양한 유명인이나 재미있는 음성으로 자유롭게 음성을 변환하는 오락 서비스로 운영이 가능하다.As an application that converts voice, it can be operated as an entertainment service that freely converts voices to various celebrities or funny voices.

전술한 바와 같이, 본 발명의 일실시예에 의하면, 실제 상황에서 얻은 사용자의 음성데이터를 통해 목적 화자의 음성으로 변환이 가능 하다.As described above, according to an embodiment of the present invention, it is possible to convert the voice of the target speaker through the voice data of the user obtained in the actual situation.

상기와 같이 설명된 조건 순환 GAN을 이용한 다중화자 음성변환 방법은 상기 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.The multi-speaker speech conversion method using the conditional cyclic GAN described above is not limited to the configuration and method of the above-described embodiments, and the above-described embodiments can be applied to all or each of the embodiments so that various modifications can be made. Some may be configured by selectively combining.

Claims

A multi-speaker speech conversion method using conditional cyclic GAN, characterized in that only a single model is used by using conditional cyclic GAN without creating a speech transformation model between multiple pairs of speakers.