KR20210098935A

KR20210098935A - System and Method for Sound Translation Based on Generative Adversarial Networks

Info

Publication number: KR20210098935A
Application number: KR1020210103286A
Authority: KR
Inventors: 정원진; 김창현; 조남규
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2019-08-19
Filing date: 2021-08-05
Publication date: 2021-08-11
Also published as: KR20210021719A; KR102289218B1; KR102350048B1

Abstract

Disclosed are a system and method for modifying sound. An objective, in the present embodiment, is to provide the system and method for modifying the sound that applies training based on a self-attention module and a characteristic loss function when transforming between mutually unpaired sounds, and increases a reality of the deformed sound by facilitating transformations that mimic a target tone while maintaining the elements such as pitch, beat, silence, and vibration. The system comprises: a first generator; a second generator; a first discriminator; a second discriminator; and a training part.

Description

System and Method for Sound Translation Based on Generative Adversarial Networks

본 발명은 서로 짝이 맞지 않는(mutually unpaired) 음향 간 변형을 구현하기 위한, 생성적 대립 네트워크를 이용한 음향 변형 시스템 및 방법에 관한 것이다. The present invention relates to a system and method for acoustic modification using generative antagonistic networks for implementing transformations between mutually unpaired sounds.

이하에 기술되는 내용은 단순히 본 발명과 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것이 아니다.The content described below merely provides background information related to the present invention and does not constitute the prior art.

특정한 가수의 목소리로 특정한 노래를 듣고 싶으나, 그 상황의 실현이 현실적으로 가능하지 않을 때, 음향 변형(sound translation)이 유용한 방법이다. 그러나 변형 대상 간에 서로 짝이 맞지 않는(mutually unpaired) 경우가 대부분일 뿐만 아니라, 가사(lyrics)나 음고(pitch)에 대한 정보가 부족한 상황에서 음향 변형이 시도되어야 한다. 서로 다른 두 가수로부터 시간 동기된(time-synchronized) 데이터를 얻기가 현실적으로 곤란하다는 점도 음향 변형을 어렵게 한다. Sound translation is a useful method when you want to hear a specific song with a specific singer's voice, but the realization of the situation is not realistically feasible. However, in most cases, not only are the objects mutually unpaired, but also sound modification should be attempted in a situation where information on lyrics or pitch is insufficient. The fact that it is practically difficult to obtain time-synchronized data from two different singers also makes acoustic deformation difficult.

종래의 방법으로서, 예컨대 특허문헌 1 또는 비특허문헌 2에서는, 이종 이미지 도메인 간의 형식(style)이나 형태(shape) 변형이 가능한 학습 모델을 제시했다. 그러나 이종 이미지 간의 변형이 가능하다 해도, 서로 짝이 되는 이미지를 모으는 것은 시간 및 비용을 많이 필요로 한다. 음색 변화(timbre change)가 중요한 음향 변형의 경우는, 전술한 대로 시간 동기된 데이터를 얻기가 현실적으로 불가능하여 시도된 예가 드물었다. As a conventional method, for example, Patent Document 1 or Non-Patent Document 2 proposes a learning model capable of transforming a style or shape between heterogeneous image domains. However, even if transformation between heterogeneous images is possible, collecting images that match each other is time-consuming and expensive. In the case of acoustic deformation in which timbre change is important, it is practically impossible to obtain time-synchronized data as described above, and thus attempts have been rare.

종래의 방법으로서, 예컨대 비특허문헌 3에서는, 짝이 맞지 않는 음성(unpaired voice)을 주파수 영역에서 표현한 후(마치 이미지 형식인 것처럼), 비특허문헌 2에 예시된 학습 모델을 이용하여 이종 음성 간의 변형을 시도하였다. 시도 결과로서, MOS(Mean Opinion Score) 테스트라는 주관적인 검증 방법을 통하여 변형 전후의 음성 간 운율(prosody)의 변화를 확인하였다. 음고를 지닌 음향(pitched sound)의 경우에는, 운율 외에도 풍부한 고조파 구조(harmonic structures)가 분명하게 관찰되므로, 고조파 구조는 음향을 표현하는 음색의 중요한 요소이다. 따라서, 타겟 음색을 모사하는 변형이 용이할 수 있도록 고조파 구조를 더 많이 반영할 수 있는 블록 및 비용함수가 고려된 음향 변형방법을 필요로 한다.As a conventional method, for example, in Non-Patent Document 3, an unpaired voice is expressed in the frequency domain (as if in the form of an image), and then between heterogeneous voices using the learning model exemplified in Non-Patent Document 2 modification was attempted. As a result of the trial, the change in prosody between voices before and after transformation was confirmed through a subjective verification method called MOS (Mean Opinion Score) test. In the case of a pitched sound, since rich harmonic structures are clearly observed in addition to rhyme, the harmonic structure is an important element of the timbre expressing the sound. Therefore, there is a need for an acoustic modification method that considers a block and a cost function that can reflect the harmonic structure more so that it can be easily deformed to simulate the target tone.

특허문헌 1: 미국 특허번호 US 10275473 B2(Method for learning cross-domain relations based on generative adversarial networks, 2019.04.30, 등록)Patent Document 1: US Patent No. US 10275473 B2 (Method for learning cross-domain relations based on generative adversarial networks, 2019.04.30, registered)

비특허문헌 1: Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems(NIPS), 2014.Non-Patent Document 1: Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014. 비특허문헌 2: Zhu, J., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017.Non-Patent Document 2: Zhu, J., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017. 비특허문헌 3: Kaneko, T. and Kameoka, H. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In EUSIPCO, pp. 2100-2104, 2018.Non-Patent Document 3: Kaneko, T. and Kameoka, H. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In EUSIPCO, pp. 2100-2104, 2018.

본 개시는, 서로 짝이 맞지 않는(mutually unpaired) 음향 간 변형 시, 셀프 어텐션 모듈 및 특성 손실함수에 기반하는 트레이닝을 적용하여, 타겟 음색을 모사하는 변형은 용이하게 하면서도 음정, 장단, 묵음(silence) 및 진동과 같은 요소는 유지함으로써, 변형된 음향의 현실성이 증대되는 음향 변형 시스템 및 방법을 제공하는 데 주된 목적이 있다.In the present disclosure, when transforming between mutually unpaired sounds, by applying training based on a self-attention module and a characteristic loss function, it is easy to transform to simulate a target tone while facilitating the transformation of tone, length, and silence (silence). ) and vibration, thereby increasing the realism of the deformed sound.

본 발명의 실시예에 의하면, 음향(sound) 데이터 간 변형(translation)을 위한 음향 변형 시스템의 학습장치에 있어서, 제1 도메인의 제1 음향을 제1 변형 음향으로 변형하기 위한 제1 생성기(generator); 제2 도메인의 제2 음향을 제2 변형 음향으로 변형하기 위한 제2 생성기; 상기 제1 도메인의 상기 제1 음향과 상기 제2 생성기에 의하여 변형된 상기 제2 변형 음향을 구별하기 위한 제1 구별기(discriminator); 상기 제2 도메인의 상기 제2 음향과 상기 제1 생성기에 의하여 변형된 상기 제1 변형 음향을 구별하기 위한 제2 구별기; 및 상기 제1 생성기와 상기 제2 구별기를 서로 대립적으로 트레이닝하고, 상기 제2 생성기와 상기 제1 구별기를 서로 대립적으로 트레이닝하는 트레이닝부를 포함하며, 상기 트레이닝부는, 상기 제1 생성기가 실행하는 상기 제2 도메인의 상기 제2 음향에서 제1 항등맵핑(identity-mapping) 음향으로의 변형; 및 상기 제2 생성기가 실행하는 상기 제1 도메인의 상기 제1 음향에서 제2 항등맵핑 음향으로의 변형 중 적어도 하나의 변형을 실시하는 것을 특징으로 하는, 음향 변형 시스템의 학습장치를 제공한다. According to an embodiment of the present invention, in a learning apparatus of a sound modification system for translation between sound data, a first generator for transforming a first sound of a first domain into a first modified sound ); a second generator for transforming a second sound of the second domain into a second modified sound; a first discriminator for discriminating the first sound of the first domain and the second deformed sound deformed by the second generator; a second discriminator for discriminating the second sound of the second domain and the first deformed sound deformed by the first generator; and a training unit configured to train the first generator and the second discriminator in opposition to each other and to train the second generator and the first discriminator to be opposed to each other, wherein the training unit comprises: transformation from the second sound to a first identity-mapping sound in two domains; and a transformation of at least one of the transformation from the first sound to the second identity mapping sound of the first domain executed by the second generator.

또한 본 발명의 다른 실시예에 의하면, 음향(sound) 데이터 간 변형(translation)을 위한 음향 변형 시스템에 있어서, 제1 도메인의 제1 음향을, 제2 도메인의 타겟 음향을 모사하여 제1 변형(translated) 음향으로 변형하는 제1 생성기(generator); 및 제2 도메인의 제2 음향을, 제1 도메인의 타겟 음향을 모사하여 제2 변형 음향으로 변형하는 위한 제2 생성기를 포함하되, 상기 제1 생성기가 상기 제2 도메인의 상기 제2 음향을 제1 항등맵핑(identity-mapping) 음향으로 변형하고, 상기 제2 생성기를 이용하여, 상기 제1 도메인의 상기 제1 음향을 제2 항등맵핑 음향으로 변형한 후, 상기 제 1 생성기 및 제 2 생성기는, 상기 제1 도메인의 상기 제1 음향과 상기 제2 생성기에 의하여 변형된 제2 변형 음향 간의 거리 메트릭; 상기 제2 도메인의 상기 제2 음향과 상기 제1 생성기에 의하여 변형된 상기 제1 변형 음향 간의 거리 메트릭; 및 상기 제1 항등맵핑 음향과 상기 제2 음향 간의 거리 메트릭 및 상기 제2 항등맵핑 음향과 상기 제1 음향 간의 거리 메트릭 중 적어도 하나를 기반으로 사전에 트레이닝되는 것을 특징으로 하는 음향 변형 시스템을 제공한다. Further, according to another embodiment of the present invention, in a sound modification system for translation between sound data, a first transformation ( translated) a first generator that translates into sound; and a second generator configured to transform the second sound of the second domain into a second modified sound by simulating the target sound of the first domain, wherein the first generator generates the second sound of the second domain After transforming into a first identity-mapping sound and using the second generator to transform the first sound of the first domain into a second identity-mapping sound, the first and second generators are , a distance metric between the first sound in the first domain and a second deformed sound deformed by the second generator; a distance metric between the second sound in the second domain and the first deformed sound deformed by the first generator; and a distance metric between the first identity mapping sound and the second sound, and a distance metric between the second identity mapping sound and the first sound, characterized in that the sound is trained in advance. A transformation system is provided.

또한 본 발명의 다른 실시예에 의하면, 음향(sound) 데이터 간 변형(translation)을 수행하기 위해, 컴퓨터 장치가 수행하는, 음향 변형 시스템의 학습방법에 있어서, 제1 생성기(generator)를 이용하여, 제1 도메인의 제1 음향을 제1 변형(translated) 음향으로 변형하고, 제2 생성기를 이용하여 상기 제1 변형 음향을 제2 재구성(reconstruction) 음향으로 변형하는 과정; 상기 제2 생성기를 이용하여, 제2 도메인의 제2 음향을 제2 변형 음향으로 변형하고, 상기 제1 생성기를 이용하여, 상기 제2 변형 음향을 제1 재구성 음향으로 변형하는 과정; 제1 구별기(discriminator)를 이용하여, 상기 제1 도메인의 상기 제1 음향과 상기 제2 생성기에 의하여 변형된 상기 제2 변형 음향을 구별하는 과정; 제2 구별기를 이용하여, 상기 제2 도메인의 상기 제2 음향과 상기 제1 생성기에 의하여 변형된 상기 제1 변형 음향을 구별하는 과정; 및 상기 제1 음향과 상기 제2 변형 음향 간의 거리 메트릭, 상기 제2 음향과 상기 제1 변형 음향 간의 거리 메트릭, 상기 제1 재구성 음향과 상기 제2 음향 간의 거리 메트릭 및 상기 제2 재구성 음향과 상기 제1 음향 간의 거리 메트릭의 일부 또는 전부를 기반으로 상기 제1 생성기 및 제2 생성기를 트레이닝하는 과정을 포함하되, 상기 트레이닝하는 과정은, 상기 제1 생성기를 이용하여, 상기 제2 도메인의 상기 제2 음향을 제1 항등맵핑(identity-mapping) 음향으로 변형하는 과정; 및 상기 제2 생성기를 이용하여, 상기 제1 도메인의 상기 제1 음향을 제2 항등맵핑 음향으로 변형하는 과정 중 적어도 하나의 과정을 더 포함하는 것을 특징으로 하는, 음향 변형 시스템의 학습방법을 제공한다. Further, according to another embodiment of the invention, to perform a sound (sound) transformation (translation) between the data, in a learning method, sound modification system of a computer device to perform, using a first generator (generator), transforming a first sound of a first domain into a first translated sound, and transforming the first transformed sound into a second reconstruction sound using a second generator; transforming a second sound of a second domain into a second deformed sound using the second generator, and transforming the second deformed sound into a first reconstructed sound using the first generator; distinguishing the first sound of the first domain from the second deformed sound deformed by the second generator using a first discriminator; Using a second discriminator, the second sound in the second domain and the second sound modified by the first generator distinguishing the first modified sound; and a distance metric between the first sound and the second modified sound, a distance metric between the second sound and the first modified sound, a distance metric between the first reconstructed sound and the second sound, and the second reconstructed sound and the and training the first generator and the second generator based on part or all of a distance metric between first sounds, wherein the training includes: using the first generator, the first generator of the second domain 2 transforming the sound into a first identity-mapping sound; and using the second generator to transform the first sound of the first domain into a second identity mapping sound. do.

또한 본 발명의 다른 실시예에 의하면, 음향(sound) 데이터 간 변형(translation)을 수행하기 위해, 컴퓨터 장치가 구현하는 음향 변형방법에 있어서, 제1 생성기(generator)를 이용하여, 제1 도메인의 제1 음향을 제1 변형(translated) 음향으로 변형하는 과정; 및 제2 생성기를 이용하여, 제2 도메인의 제2 음향을 제2 변형 음향으로 변형하는 과정을 포함하되, 상기 제1 생성기가 상기 제2 도메인의 상기 제2 음향을 제1 항등맵핑(identity-mapping) 음향으로 변형하고, 상기 제2 생성기를 이용하여, 상기 제1 도메인의 상기 제1 음향을 제2 항등맵핑 음향으로 변형한 후, 상기 제 1 생성기 및 제 2 생성기는, 상기 제1 도메인의 상기 제1 음향과 상기 제2 생성기에 의하여 변형된 제2 변형 음향 간의 거리 메트릭; 상기 제2 도메인의 상기 제2 음향과 상기 제1 생성기에 의하여 변형된 상기 제1 변형 음향 간의 거리 메트릭; 및 상기 제1 항등맵핑 음향과 상기 제2 음향 간의 거리 메트릭 및 상기 제2 항등맵핑 음향과 상기 제1 음향 간의 거리 메트릭 중 적어도 하나를 기반으로 사전에 트레이닝되는 것을 특징으로 하는 음향 변형방법을 제공한다.Further, according to another embodiment of the invention, to perform a sound (sound) transformation (translation) between the data, according to sound transformation method of the computer device implementing, using the first generator (generator), the first domain transforming the first sound into a first translated sound; and transforming a second sound of a second domain into a second deformed sound using a second generator, wherein the first generator maps the second sound of the second domain to a first identity-mapping (identity-mapping) method. mapping) sound, and using the second generator to transform the first sound of the first domain into a second identity mapping sound, the first generator and the second generator may a distance metric between the first sound and a second deformed sound deformed by the second generator; a distance metric between the second sound in the second domain and the first deformed sound deformed by the first generator; and a distance metric between the first identity mapping sound and the second sound and a distance metric between the second identity mapping sound and the first sound. .

또한 본 발명의 다른 실시예에 의하면, 음향 변형 시스템의 학습방법의 각 단계를 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다.In addition, according to another embodiment of the present invention, there is provided a computer program stored in a computer-readable recording medium in order to execute each step of the learning method of the acoustic modification system.

또한 본 발명의 다른 실시예에 의하면, 음향 변형방법의 각 단계를 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다.In addition, according to another embodiment of the present invention, there is provided a computer program stored in a computer-readable recording medium in order to execute each step of the sound modification method.

이상에서 설명한 바와 같이 본 실시예에 의하면, 서로 짝이 맞지 않는(mutually unpaired) 음향 간 변형 시, 셀프 어텐션 모듈 및 특성 손실함수에 기반하는 트레이닝을 적용하여, 타겟 음색을 모사하는 변형은 용이하게 하면서도 음정, 장단, 묵음 및 진동과 같은 요소는 유지 가능한 음향 변형 시스템 및 방법을 제공함으로써 변형된 음향의 현실성이 증대되는 효과가 있다. 또한 본 실시예의 기술적 장치 및 방법을 적절히 변형 및 이용한다면, 일대다(one-to-many) 또는 다대다(many-to-many) 형태의 음향 간 변형까지 적용 분야를 확대하는 것이 가능하다. As described above, according to the present embodiment, when transforming between mutually unpaired sounds, training based on the self-attention module and the characteristic loss function is applied to facilitate the transformation to simulate the target tone while facilitating the transformation. Elements such as pitch, length, silence, and vibration have the effect of increasing the realism of the modified sound by providing a sustainable acoustic modification system and method . In addition, if the technical apparatus and method of the present embodiment are appropriately modified and used, it is possible to expand the field of application to a one-to-many or many-to-many type of inter-acoustic transformation.

도 1은 본 발명의 일 실시예에 따른 가창음성 변형을 위한 시스템 구성도이다.
도 2는 본 발명의 일 실시예에 따른 가창음성 변형기용 학습 모델의 구조도이다.
도 3은 본 발명의 일 실시예에 따른 셀프 어텐션 모듈의 위치를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 가창음성 변형기의 생성기를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 가창음성 변형기의 성능을 표현하는 도면이다.
도 6은 발명의 일 실시예에 따른 가창음성 변형기에 의한 음색 변화를 표현하는 도면이다.1 is a configuration diagram of a system for modifying a singing voice according to an embodiment of the present invention.
2 is a structural diagram of a learning model for a singing voice modifier according to an embodiment of the present invention.
3 is a view for explaining a position of a self-attention module according to an embodiment of the present invention.
4 is a view showing a generator of a singing voice modifier according to an embodiment of the present invention.
5 is a diagram illustrating the performance of a singing voice modifier according to an embodiment of the present invention.
6 is a diagram illustrating a tone change by a singing voice modifier according to an embodiment of the present invention.

이하, 본 발명의 실시예들을 예시적인 도면을 참조하여 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 실시예들을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 실시예들의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in describing the present embodiments, if it is determined that a detailed description of a related well-known configuration or function may obscure the gist of the present embodiments, the detailed description thereof will be omitted.

또한, 본 실시예들의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Also, in describing the components of the present embodiments, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the components from other components, and the essence, order, or order of the components are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 발명의 예시적인 실시형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced.

본 발명은, 서로 짝이 맞지 않는(mutually unpaired) 음향이 속한 도메인(domain)에 대하여, 도메인(domain) 간 관계를 학습할 수 있는 새로운 GAN(Generative Adversarial Networks) 기반의 모델을 제안한다. The present invention proposes a new Generative Adversarial Networks (GAN)-based model capable of learning a relationship between domains with respect to a domain to which mutually unpaired sounds belong.

음향은 음성, 가창음성, 뮤직 및 자연음 등을 포함한다. 본 발명의 일 실시예에서는, 음향 변형의 한 예로서 가창음성 변형(singing voice translation)을 중심으로 설명한다.The sound includes voice, singing voice, music, natural sound, and the like. In one embodiment of the present invention, as an example of the acoustic transformation, a description will be focused on singing voice translation.

본 발명의 일 실시예에 따른 가창음성 변형기는, 학습된 도메인 간 관계를 이용하여, 가창음성을 하나의 도메인에서 다른 도메인으로 변형(translation)할 수 있다. The singing voice modifier according to an embodiment of the present invention may use the learned relationship between domains to translate a singing voice from one domain to another domain.

도 1은 본 발명의 일 실시예에 따른 가창음성 변형을 위한 시스템 구성도이다.1 is a configuration diagram of a system for modifying a singing voice according to an embodiment of the present invention.

도 1에 예시된 가창음성 변형 시스템(100)은 입력부(110), 변형부(120) 및 출력부(130)의 일부 또는 전부를 포함한다. 이때, 본 실시예에 따른 가창음성 변형 시스템에 포함되는 구성요소가 반드시 이에 한정되는 것은 아니다. 예컨대, 가창음성 변형 시스템 상에 학습 모델의 트레이닝을 위한 트레이닝부(미도시)를 추가로 구비하거나, 외부의 트레이닝부와 연동되는 형태로 구현될 수 있다. The singing voice modification system 100 illustrated in FIG. 1 includes some or all of the input unit 110 , the modification unit 120 , and the output unit 130 . In this case, the components included in the singing voice modification system according to the present embodiment are not necessarily limited thereto. For example, a training unit (not shown) for training the learning model may be additionally provided on the singing voice transformation system, or may be implemented in a form linked to an external training unit.

입력부(110)는 가창음성 변형을 수행하는 과정에서 필요한 데이터들을 획득하여 가창음성 변형에 적절한 형태로 변환한다. The input unit 110 acquires necessary data in the process of performing the singing voice transformation and converts it into a form suitable for the singing voice transformation.

예컨대, 본 실시예에 따른 입력부(110)는 소스(source) 도메인에 속한 가창음성(singing voice)을 시간 영역(time domain)에서 받아 들인 후, 주파수 영역(frequency domain)의 데이터로 변환(transformation)한다. 변환 방법으로는 FFT(Fast Fourier Transform) 또는 셉스트럼 변환(cepstrum transform) 등을 사용할 수 있으며, 반드시 이에 한정하는 것은 아니다.For example, the input unit 110 according to the present embodiment receives a singing voice belonging to a source domain in a time domain and then converts it into data in a frequency domain. do. As the transformation method, a Fast Fourier Transform (FFT) or a cepstrum transform may be used, but the present invention is not limited thereto.

본 실시예에서, 입력부(110)는 스펙트로그램(spectrogram)의 원리를 이용하여 가창음성을 주파수 영역 상의 이차원 데이터로 표현한다. 먼저, 시간 영역 상의 가창음성에 대하여, 수행하는 구간의 일부를 겹쳐가면서(overlapping and sliding) M차 FFT를 진행하여 주파수 영역 상의 M차 FFT 벡터들을 얻는다. 다음 주파수 영역 상의 M차 벡터 N개를 열벡터 형태로 결합하여 이차원 데이터인 MxN 행렬을 생성한다.In the present embodiment, the input unit 110 expresses the singing voice as two-dimensional data in the frequency domain using the principle of a spectrogram. First, with respect to a singing voice in the time domain, an M-order FFT is performed while overlapping and sliding a part of the performed section to obtain M-order FFT vectors in the frequency domain. Next, an MxN matrix, which is two-dimensional data, is generated by combining N M-th vectors in the frequency domain in the form of a column vector.

본 실시예에서는, 두 도메인에 속한 가창음성에 대하여 '서로 짝이 맞지 않는(mutually unpaired)' 이란 표현을 사용한다. 이 표현은, 두 가창음성이 서로 다른 가수의 것이고, 서로 시간 동기(time-synchronization)가 일치하지 않는다는 의미이다. In the present embodiment, the expression 'mutually unpaired' is used for singing voices belonging to the two domains. This expression means that the two singing voices are from different singers, and the time-synchronization is not consistent with each other.

변형부(120)는 서로 짝이 맞지 않는 가창음성 간 변형(translation)을 수행한다. The transforming unit 120 performs translation between unmatched singing voices.

본 실시예에 따른 변형부(120)는, 입력부(110)에 의하여 변환된 주파수 영역의 데이터를 입력으로 받아들여, 타겟 음색을 모사(mimic)한 주파수 영역 상의 데이터로 변형한다. 변형부(120)는 가창음성 변형을 위하여 트레이닝부에 의하여 기 학습된 신경회로망(Neural Network) 기반의 학습 모델을 이용한다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다.The transform unit 120 according to the present embodiment receives the frequency domain data converted by the input unit 110 as an input, and transforms the data into frequency domain data that mimics the target tone. The transforming unit 120 uses a neural network-based learning model previously learned by the training unit for singing voice transformation. The structure of the learning model and the training process of the learning model will be described later.

출력부(130)는 변형된 가창음성을 청각적 형태로 가창음성 변형 시스템의 사용자에게 제공한다. The output unit 130 provides the modified singing voice to the user of the singing voice modification system in an auditory form.

본 실시예에 따른 출력부(130)는, 변형부(120)로부터 주파수 영역에서 변형된 데이터를 받아들인 후, 합성(synthesis) 과정을 통하여 시간 영역에서의 변형된 가창음성으로 변환한다. 최종적으로 시간 영역 상의 변형된 데이터를 청각적 형태로 가창음성 변형 시스템 사용자에게 제공한다. The output unit 130 according to the present embodiment receives the transformed data in the frequency domain from the transforming unit 120 and then converts it into a transformed singing voice in the time domain through a synthesis process. Finally, the transformed data in the time domain is provided to the user of the audible voice modification system in an auditory form.

도 2는 본 발명의 일 실시예에 따른 가창음성 변형기용 학습 모델의 구조도이다.2 is a structural diagram of a learning model for a singing voice modifier according to an embodiment of the present invention.

도 2에 표시된 학습 모델은 재구성(reconstruction) 경로를 갖는 GAN(Generative Adversarial Networks)을 기반으로 하는 모델이다. 학습 모델에 사용된 GAN 기반 시스템은, 두 개의 GAN이 결합된(coupling) 구조로서, GAN 및 GAN의 결합구조에 관한 구체적인 사항은 특허문헌 1, 비특허문헌 1, 또는 비특허문헌 2를 참조하기 바란다. 이하에서는 본 실시예에 따른 학습 모델에서 사용하는 개념들을 위주로 설명하기로 한다.The learning model shown in FIG. 2 is a model based on Generative Adversarial Networks (GANs) with a reconstruction path. The GAN-based system used in the learning model is a structure in which two GANs are coupled. hope Hereinafter, concepts used in the learning model according to the present embodiment will be mainly described.

GAN은 기본적으로 생성기(generator)와 구별기(discriminator)를 포함한다. 생성기는 두 도메인 간의 대응(mapping)을 수행한다. 도메인을 구성하는 요소에 따라, 대응은 변환(transformation), 변형(translation), 짝짓기(pairing) 등 다양하게 표현될 수 있으며, 가창음성의 경우는 변형이란 용어를 사용하기로 한다. GAN basically includes a generator and a discriminator. The generator performs mapping between the two domains. Depending on the elements constituting the domain, the correspondence may be expressed in various ways such as transformation, translation, pairing, etc. In the case of singing voice, the term transformation will be used.

이하 본 실시예에 따른 GAN의 구현에 있어서, 생성기는 CNN(Convolutional Neural Network)을 이용하고, 구별기는 CNN의 구성 요소 중에서 디코더 부분만을 이용한다.Hereinafter, in the implementation of the GAN according to the present embodiment, the generator uses a convolutional neural network (CNN), and the discriminator uses only the decoder part among the components of the CNN.

생성기의 역할은, 구별기가 구분할 수 없을 정도의 유사 데이터를 생성하여 공역(codomain)에 속한 데이터를 모사한다. 생성기가 생성한 데이터를 오인하여 구별기가 참(true, 확률 1)을 출력하도록, 생성기가 트레이닝된다. 구별기의 역할은 생성기가 생성한 유사 데이터와 공역 도메인에 속한 타겟 데이터를 구별한다. 공역 도메인에 속한 타겟 데이터에 대해서는 참(true, 확률 1)을 출력하고, 유사 데이터에 대해서는 거짓(false, 확률 0)을 출력하도록, 구별기가 트레이닝된다. 따라서, 기본적인 GAN에서는 생성기와 구별기의 역할에 근거하여 생성적 대립(generative adversarial) 손실함수(loss function)을 정의하고, 정의된 손실함수에 근거한 비지도학습(unsupervised learning)을 통하여 생성기와 구별기를 트레이닝시킨다.The role of the generator is to simulate data belonging to a codomain by generating similar data to the extent that the discriminator cannot distinguish them. The generator is trained so that it misinterprets the data it generates and outputs true (probability 1) as the discriminator. The role of the discriminator is to distinguish the similar data generated by the generator and the target data belonging to the conjugation domain. The discriminator is trained to output true (true, probability 1) to target data belonging to the conjugation domain and to output false (false, probability 0) to similar data. Therefore, in the basic GAN, a generative adversarial loss function is defined based on the roles of the generator and the discriminator, and the generator and the discriminator are separated through unsupervised learning based on the defined loss function. train

두 개의 GAN이 결합된 구조는, 도 2에서 점선 박스로 구분된 부분으로서, 두 도메인 A 및 B 간의 일대일(one-to-one) 대응을 효과적으로 수행하기 위하여 두 개의 생성기와 두 개의 구별기를 포함한다. 이하 두 개의 생성기는 제1 및 제2 생성기로, 두 개의 구별기는 제1 및 제2 구별기로 표현한다. 본 실시예에서, 도메인 A는 가수 A의 가창음성을 포함하고, 도메인 B는 가수 B의 가창음성을 포함한다. 제1 생성기는 도메인 A에서 B로, 제2 생성기는 도메인 B에서 A로의 변형을 수행한다. 그리고, 제2 구별기는 제1 생성기의 출력과 도메인 B의 타겟을 구별하고, 제1 구별기는 제2 생성기의 출력과 도메인 A의 타겟를 구별한다.The structure in which two GANs are combined, as a part separated by a dotted box in FIG. 2, includes two generators and two distinguishers in order to effectively perform a one-to-one correspondence between the two domains A and B. . Hereinafter, two generators are referred to as first and second generators, and two differentiators are referred to as first and second distinguishers. In the present embodiment, domain A includes the singer A's singing voice, and domain B includes the singer B's singing voice. The first generator performs the transformation from domain A to B, and the second generator performs the transformation from domain B to A. The second discriminator discriminates the output of the first generator from the target of the domain B, and the first discriminator discriminates the output of the second generator and the target of the domain A.

이하 본 실시예에 따른 손실함수를 표현하는 데 필요한 용어들을 정의한다. 우선 제1 생성기가 수행하는 도메인 A에서 도메인 B로의 변형은 G _AB로 표기하고, 반대로 제2 생성기가 수행하는 도메인 B에서 도메인 A로의 변형은 G _BA로 표기한다. 또한 제1 및 제2 분류기의 기능을 각각 D _A 및 D _B로 표기한다. 도메인 A 및 B에 속한 가창음성을 각각 제1 및 제2 가창음성으로 표현하고, 기호로는 x_A 및 x_B로 표기한다. 입력과 무관하게 G_AB의 출력은 '제1'로 시작하는 것으로 표기하며, 마찬가지로 G_BA의 출력은 '제2'로 시작하는 것으로 표기한다. 기본적인 GAN 구조에서 생성기의 출력은 '변형'이란 표현을 사용하여 구분하기로 한다. 도 2에 따르면, 제1 생성기 및 제2 생성기 각각이 복수 개가 사용된 것으로 되어 있지만, 이는 학습 모델에 대한 설명 상 편이를 위함이며 실제 구현 시에는 제1 및 제2 생성기 각각 하나로 구현된다. 제1 및 제2 생성기 각각 하나를 이용하는 학습 모델의 트레이닝 절차는 도 4를 이용하여 추후에 설명하기로 한다. Hereinafter, terms necessary to express the loss function according to the present embodiment are defined. First, the transformation from the domain A to the domain B performed by the first generator is denoted as G _AB , and the transformation from the domain B to the domain A performed by the second generator is denoted as G _{BA .} Also, the functions of the first and second classifiers are denoted as D _A and D _{B , respectively.} The singing voices belonging to domains A and B are expressed as first and second singing voices, respectively, and are denoted by _{symbols x A} and x _B. Regardless of the input, _{the output of G AB} is indicated as starting with 'first', and similarly, _{the output of G BA} is indicated as starting with 'second'. In the basic GAN structure, the output of the generator is classified using the expression 'transformation'. According to FIG. 2 , a plurality of each of the first and second generators is used, but this is for convenience of description of the learning model, and in actual implementation, each of the first and second generators is implemented as one. A training procedure of the learning model using each of the first and second generators will be described later with reference to FIG. 4 .

전술한 표기들을 이용하여 생성기와 구별기의 기능을 다시 설명하자면, 우선 제1 생성기는 도메인 A에 속한 제1 가창음성을 변형하여 제1 변형 가창음성을 생성하고, 제2 생성기는 도메인 B에 속한 제2 가창음성을 변형하여 제2 변형 가창음성을 생성한다. 그리고 제2 구별기는 제1 변형 가창음성과 도메인 B의 제2 가창음성을, 제1 구별기는 제2 변형 가창음성과 도메인 A의 제1 가창음성을 구별한다.To explain the functions of the generator and the discriminator again using the above-mentioned notations, first, the first generator transforms the first sung voice belonging to the domain A to generate a first modified sung voice, and the second generator belongs to the domain B A second modified singing voice is generated by modifying the second singing voice. And a 2nd discriminator distinguishes a 1st modified song voice from the 2nd song voice of the domain B, and a 1st discriminator distinguishes a 2nd modified song voice from the 1st song voice of the domain A.

생성기의 주된 역할은, 구별기가 구분할 수 없을 정도로 변형된 유사 데이터를 생성하는 것이므로, 손실함수는 유사 데이터가 분류기를 통과한 후의 확률로 표현하고, 양의 값으로 표현하기 위하여 부의 부호(negative sign)를 사용한다. 제1 및 제2 생성기에 대한 생성 손실함수는 수학식 1로 표현한다. Since the main role of the generator is to generate similar data that the classifier cannot distinguish, the loss function is expressed as a probability after the similar data has passed through the classifier, and a negative sign is used to express it as a positive value. use The generation loss functions for the first and second generators are expressed by Equation (1).

여기서,

는 기대(expectation) 함수이다.here,

is an expectation function.

구별기의 주된 역할은, 생성기에 의하여 생성된 유사 데이터와 공역 도메인에 속한 데이터를 구별하는 것이므로, 손실함수는 '공역 데이터가 분류기를 통과한 후의 확률'과 '유사 데이터가 분류기를 통과한 후의 확률을 전체확률 1에서 차감한 확률'의 합으로 표현한다. 제1 및 제2 구별기에 대한 구별 손실함수는 수학식 2로 표현한다.Since the main role of the classifier is to distinguish the similar data generated by the generator and the data belonging to the conjugate domain, the loss function is the 'probability after the conjugated data passes the classifier' and 'the probability after the similar data passes the classifier' is expressed as the sum of the 'probabilities subtracted from the total probability 1'. The discrimination loss function for the first and second discriminators is expressed by Equation (2).

수학식 1 및 수학식 2에 표현된 식들이, 두 개의 GAN이 결합된 구조의 손실함수이다. 트레이닝부는 생성기 및 구별기를 트레이닝시키기 위하여 수학식 1 및 수학식 2의 손실함수를 이용한다. Equations 1 and 2 are loss functions of a structure in which two GANs are combined. The training unit uses the loss functions of Equations 1 and 2 to train the generator and the discriminator.

도 2에 나타낸 바와 같이, 재구성 경로를 포함할 경우, 생성기의 트레이닝에는 추가적인 손실을 이용한다. 재구성 경로는 제1 및 제2 생성기를 재사용하여 구성되며, 연속된 두 번의 변형 후에 소스 가창음성이 얼마나 제대로 재구성 됐는지를 확인하기 위한 경로이다. 가창음성의 재구성을 위하여 제2 생성기는 제1 변형 가창음성을 변형하여 제2 재구성 가창음성을 생성하고, 제1 생성기는 제2 변형 가창음성을 변형하여 제1 재구성 가창음성을 생성한다. As shown in Fig. 2, when a reconstruction path is included, an additional loss is used for training the generator. The reconstruction path is constructed by reusing the first and second generators, and is a path for confirming how well the source singing voice is reconstructed after two consecutive transformations. For the reconstruction of the singing voice, the second generator deforms the first modified singing voice to generate a second reconstructed singing voice, and the first generator modifies the second modified singing voice to generate a first reconstructed singing voice.

재구성 경로 추가에 따른 손실함수는 소스 가창음성과 재구성 가창음성 간의 거리 메트릭(distance metric)으로 정의하며, 제1 및 제2 생성기에 대한 재구성 손실함수는 수학식 3으로 표현된다. The loss function according to the addition of the reconstruction path is defined as a distance metric between the source and reconstructed singing voices, and the reconstruction loss functions for the first and second generators are expressed by Equation (3).

여기서,

는 거리 메트릭이고, 거리 메트릭은 어떠한 형태의 메트릭(L1, L2, 코사인 유사도 등)이 사용되어도 무방하다. 재구성 경로까지 고려한 제1 및 제2 생성기에 대한 손실함수는 수학식 4로 표현한다. 트레이닝부는 제1 및 제2 생성기를 트레이닝시키기 위하여 수학식 4의 손실함수를 사용한다. here,

is a distance metric, and any type of metric (L1, L2, cosine similarity, etc.) may be used as the distance metric. The loss functions for the first and second generators considering the reconstruction path are expressed by Equation (4). The training unit uses the loss function of Equation 4 to train the first and second generators.

수학식 2 및 4가, 특허문헌 1, 또는 비특허문헌 2에서 사용된 손실함수로서 이미지 짝짓기(image pairing)에 적용할 경우 만족할 만한 성능을 보여주었다. 그러나 가창음성 변형의 경우와 같이, 고조파 구조(harmonic structures)로 대표되는 음색 변형은 가능하면서 가사, 음정, 호흡 및 진동과 같은 요소의 유지가 요구될 시에는, 전술한 수학식에 기반한 트레이닝만으로는 부족한 측면이 있다. Equations 2 and 4 showed satisfactory performance when applied to image pairing as a loss function used in Patent Document 1 or Non-Patent Document 2. However, as in the case of singing voice transformation, when the tonal transformation represented by harmonic structures is possible and maintenance of elements such as lyrics, pitch, breathing and vibration is required, training based on the above-mentioned formula alone is insufficient. there is a side

따라서, 본 실시예에서는, 변형 과정을 더 개선하기 위하여, 특성(features) 경로 및 항등맵핑(identity-mapping) 경로(path)를 추가하고, 이에 더하여 셀프 어텐션(self-attention) 모듈을 추가할 수 있다. 이하에서는 개선을 위하여 추가된 부분을 위주로 설명하기로 한다.Therefore, in this embodiment, in order to further improve the transformation process, a feature path and an identity-mapping path are added, and in addition to this, a self-attention module can be added. there is. Hereinafter, the parts added for improvement will be mainly described.

먼저, 도 2에 나타낸 바와 같이, 본 실시예에 따른 항등맵핑 경로는, 소스 가창음성이 얼마나 제대로 항등맵핑됐는지를 확인하기 위한 경로로서, 제1 및 제2 생성기를 재사용하여 구성된다. 항등맵핑 경로를 추가함으로써 소스 가창음성의 가사, 음정, 호흡 및 진동과 같은 요소를 유지하는 데 도움을 준다. 항등맵핑을 위하여 제2 생성기는 제1 가창음성을 변형하여 제2 항등맵핑 가창음성을 생성하고, 제1 생성기는 제2 가창음성을 변형하여 제1 항등맵핑 가창음성을 생성한다. First, as shown in FIG. 2 , the identity mapping path according to the present embodiment is a path for confirming how properly the source singing voice is identity mapped, and is constructed by reusing the first and second generators. By adding an identity mapping path, it helps to maintain elements such as lyrics, pitch, breathing and vibration of the source singing voice. For the identity mapping, the second generator transforms the first singing voice to generate a second identity mapping singing voice, and the first generator modifies the second singing voice to generate a first identity mapping singing voice.

항등맵핑 경로 추가에 따른 손실함수는 소스 가창음성과 항등맵핑 가창음성 간의 거리 메트릭으로 정의하며, 제1 및 제2 생성기에 대한 항등매핑 손실함수는 수학식 5로 표현된다. The loss function according to the addition of the identity mapping path is defined as a distance metric between the source singing voice and the identity mapping song voice, and the identity mapping loss function for the first and second generators is expressed by Equation (5).

다음, 도 2에 나타낸 바와 같이, 본 실시예에 따른 특성 경로는, 제1 및 제2 생성기를 재사용하여 구성되며, 소스 가창음성의 특성이 얼마나 제대로 재구성됐는지를 확인하기 위한 경로이다. 항등맵핑 경로와 마찬가지로, 특성 경로의 추가를 통하여 소스 가창음성의 가사, 음정, 호흡 및 진동과 같은 요소를 유지하는 데 도움을 준다. 특성 추출을 위하여 제1 생성기는, 제2 가창음성으로부터 제1 가창음성 특성을 추출하고, 제2 변형 가창음성으로부터 제1 재구성 가창음성 특성을 추출한다. 그리고, 제2 생성기는, 제1 가창음성으로부터 제2 가창음성 특성을 추출하고, 제1 변형 가창음성으로부터 제2 재구성 가창음성 특성을 추출한다. Next, as shown in FIG. 2 , the characteristic path according to the present embodiment is constructed by reusing the first and second generators, and is a path for confirming how well the characteristics of the source sung voice are reconstructed. Like the identity mapping path, the addition of a characteristic path helps to maintain elements such as lyrics, pitch, breathing and vibration of the source singing voice. For the feature extraction, the first generator extracts a first sung voice characteristic from the second sung voice, and extracts a first reconstructed sung voice characteristic from the second modified sung voice. And a 2nd generator extracts a 2nd song voice characteristic from a 1st song voice, and extracts a 2nd reconstructed song voice characteristic from a 1st modified song voice.

특성 경로 추가에 따른 손실함수는 소스 가창음성 특성과 재구성 가창음성 특성 간의 거리 메트릭으로 정의하며, 제1 및 제2 생성기에 대한 특성 손실함수는 수학식 6으로 표현된다. The loss function according to the addition of the characteristic path is defined as a distance metric between the source sung speech characteristic and the reconstructed sung speech characteristic, and the characteristic loss function for the first and second generators is expressed by Equation (6).

여기서, F는 특성을 추출하기 위한 매핑(mapping)이다. 특성 추출은, 생성기를 구현하는 CNN의 어느 단(layer)을 이용하여도 되나, 본 실시예에서는 CNN을 구성하는 인코더의 마지막 단을 이용한다. 마지막 단을 이용하는 이유는, 이 단의 출력이 가창음성에 내포된 글로벌 특징을 가장 잘 표현하는 것으로 확인되었기 때문이다.Here, F is a mapping for extracting characteristics. In the feature extraction, any layer of CNN implementing the generator may be used, but in this embodiment, the last layer of the encoder constituting the CNN is used. The reason for using the last stage is that it has been confirmed that the output of this stage best expresses the global characteristics contained in the singing voice.

최종적으로, 수학식 4 내지 수학식 6을 결합하여, 본 실시예에 따른, 제1 및 제2 생성기를 위한 손실함수를 수학식 7로 표현한다. Finally, by combining Equations 4 to 6, the loss functions for the first and second generators according to the present embodiment are expressed by Equation 7.

트레이닝부는 제1 및 제2 생성기를 트레이닝시키기 위하여 수학식 7의 손실함수를 사용한다.The training unit uses the loss function of Equation 7 to train the first and second generators.

주파수 영역에서 가창음성의 고조파 구조는, 기본 주파수(f0 frequency)의 정수배에 위치한 부분적이고 지엽적인 피크들(partial and local peaks)에 의존한다. 본 실시예에서, 셀프 어텐션 모듈은, 제1 및 제2 변형 가창음성, 제1 및 제2 재구성 가창음성, 그리고 제1 및 제2 항등맵핑 가창음성 각각에 내재된 부분적인 피크들 및 피크들 간(inter or intra partial peaks)의 고조파 구조를 광역적으로(globally) 강화하는 역할을 한다. 셀프 어텐션 모듈이 사용되는 위치는, 도 3에 도시된 대로, 생성기와 구별기의 후단이다. 따라서 셀프 어텐션 모듈의 입력으로는 생성기 또는 구별기의 출력인 이차원 데이터를 이용한다. 전술한 대로, 이차원 데이터는 주파수 영역 상의 정보를 표현하고 있다.In the frequency domain, the harmonic structure of a singing voice depends on partial and local peaks located at integer multiples of the fundamental frequency (f0 frequency). In the present embodiment, the self-attention module is configured to configure partial peaks and inter-peaks inherent in each of the first and second modified sung voices, the first and second reconstructed phonic voices, and the first and second identity mapping sung voices, respectively. It serves to globally strengthen the harmonic structure of (inter or intra partial peaks). The position where the self-attention module is used is at the rear end of the generator and the discriminator, as shown in FIG. 3 . Therefore, as an input of the self-attention module, two-dimensional data that is an output of a generator or a discriminator is used. As described above, the two-dimensional data represents information in the frequency domain.

셀프 어텐션 모듈에서는 멀티헤드 어텐션(Multi-head Attention) 방식을 이용할 수 있다. 멀티헤드 어텐션은 수학식 8로 표현된다.In the self-attention module, a multi-head attention method may be used. The multi-head attention is expressed by Equation (8).

여기서 Q, K 및 V는 각각 문의 행렬(query matrix), 키 행렬(key matrix) 및 값 행렬(value matrix)이고, H는 멀티헤드 어텐션의 출력 행렬이다. 멀티헤드 어텐션에서는, 문의 행렬 및 키 행렬을 이용하여, 값 행렬을 구성하는 행벡터 별 어텐션에 해당하는 가중치를 구한 후, 가중치를 행벡터에 적용한다. 셀프 어텐션에서는, Q, K 및 V용으로 동일한 행렬을 사용한다. 그리고, 멀티헤드 어텐션에서는, 복수 개의 헤드(head)에 의한 병렬 처리가 수행된다. Here, Q, K, and V are a query matrix, a key matrix, and a value matrix, respectively, and H is an output matrix of multi-head attention. In the multi-head attention, a weight corresponding to an attention for each row vector constituting a value matrix is obtained using a query matrix and a key matrix, and then the weight is applied to the row vector. In self-attention, we use the same matrix for Q, K and V. And, in the multi-head attention, parallel processing by a plurality of heads is performed.

본 실시예에서는, 입력이 생성기 또는 구별기로부터 전달된 이차원 데이터이므로, 이차원 데이터를 구성하는 주파수 영역 상의 피크들 및 피크들 간의 고조파 구조가, 셀프 어텐션에 의하여 강조되는 효과가 생긴다. 셀프 어텐션 모듈의 출력이 생성기 또는 구별기의 최종 출력이 된다. 또한, 본 실시예에 있어서, 트레이닝부는 수학식 2 및 수학식 7에 표시된 손실함수(loss function)에 근거하여 멀티헤드 어텐션에 사용되는 파라미터를 업데이트한다.In this embodiment, a two-dimensional input is passed from a generator or discriminator. Since it is data, there is an effect that the peaks on the frequency domain constituting the two-dimensional data and the harmonic structure between the peaks are emphasized by the self-attention. The output of the self-attention module becomes the final output of the generator or discriminator. In addition, in the present embodiment, the training unit updates the parameters used for multi-head attention based on the loss function shown in Equations 2 and 7.

도 4는 본 발명의 일 실시예에 따른 가창음성 변형기의 생성기를 나타낸 도면이다.4 is a view showing a generator of a singing voice modifier according to an embodiment of the present invention.

도 4를 이용하여, 본 실시예에 따른 가창음성 변형기(translator)용 학습 모델의 학습 절차를 설명한다. 제1 및 제2 분류기의 트레이닝은, 수학식 2에 표현된 손실함수를 줄이는 방향으로 트레이닝부가 각 분류기의 파라미터를 업데이트하는 과정이므로, 이하에서는 제1 및 제2 생성기에 대한 트레이닝 절차만를 설명한다. A learning procedure of the learning model for a singing voice translator according to the present embodiment will be described with reference to FIG. 4 . Since the training of the first and second classifiers is a process in which the training unit updates the parameters of each classifier in the direction of reducing the loss function expressed in Equation (2), only the training procedure for the first and second generators will be described below.

이전 트레이닝 에포크(epoch) 동안, 트레이닝부가 제1 및 제2 생성기의 파라미터를 업데이트해 놓은 상태라고 가정한다. It is assumed that during the previous training epoch, the training unit has updated the parameters of the first and second generators.

제1 생성기는 도메인 A에 속한 제1 가창음성(x_A)을 변형하여 제1 변형 가창음성(G _AB(x_A))을 생성한다. 또한 제1 생성기는 도메인 B에 속한 제2 가창음성(x_B)을 변형하여 제1 항등맵핑 가창음성(G _AB(x_B)) 및 제1 가창음성 특성(F(x_B))을 생성한다.The first generator transforms the first sung voice (x _A ) belonging to the domain A to generate the first modified sung voice ( G _AB (x _A )). In addition, the first generator generates a first identity mapping singing voice ( G _AB (x _B )) and a first phonic voice characteristic ( F (x _B )) by _{transforming the second phonic voice (x B ) belonging to the domain B.} .

이어서, 제2 생성기는 도메인 B에 속한 제2 가창음성(x_B)을 변형하여 제2 변형 가창음성(G _BA(x_B))을 생성한다. 또한 제2 생성기는 도메인 A에 속한 제1 가창음성(x_A)을 변형하여 제2 항등맵핑 가창음성(G _BA(x_A)) 및 제1 가창음성 특성(F(x_A))을 생성한다.Then, the second generator _{transforms the second song voice (x B} ) belonging to the domain B to generate the second modified syllable voice ( G _BA (x _B )). In addition, the second generator generates a second identity mapping sung voice ( G _BA (x _A )) and a first phonic voice characteristic ( F (x _A )) by _{transforming the first sung voice (x A ) belonging to the domain A.} .

이어서, 제1 생성기는 제2 변형 가창음성(G _BA(x_B))을 변형하여 제1 재구성 가창음성(G _AB(G _BA(x_B))) 및 제1 재구성 가창음성 특성(F(G _BA(x_B)))을 생성한다.Then, the first generator modifies the second modified phonological voice ( G _BA (x _B )) to obtain a first reconstructed singable voice ( G _AB ( G _BA (x _B ))) and a first reconstructed phonic voice characteristic ( F ( G )). _BA (x _B ))).

이어서, 제2 생성기는 제1 변형 가창음성(G _AB(x_A))을 변형하여 제2 재구성 가창음성(G _BA(G _AB(x_A))) 및 제2 재구성 가창음성 특성(F(G _AB(x_A)))을 생성한다.Then, the second generator modifies the first modified phonological voice ( G _AB (x _A )) to obtain a second reconstructed phonic voice ( G _BA ( G _AB (x _A ))) and a second reconstructed phonic voice characteristic ( F ( G )). _AB (x _A ))).

이어서, 트레이닝부는 제1 및 제2 생성기의 생성 결과물들과 제1 및 제2 구별기를 이용하여, 수학식 1 및 수학식 3 내지 수학식 6을 기반으로 수학식 7에 표현된 최종 손실함수를 계산한다. Next, the training unit calculates the final loss function expressed in Equation 7 based on Equations 1 and 3 to 6 by using the generated results of the first and second generators and the first and second discriminators. do.

마지막으로, 트레이닝부는 계산된 최종 손실함수를 줄이는 방향으로 제1 및 제2 생성기의 파라미터를 업데이트함으로써, 생성기들의 트레이닝을 위한 한번의 에포크를 마감한다. Finally, the training unit updates the parameters of the first and second generators in the direction of reducing the calculated final loss function, thereby ending one epoch for training the generators.

전술한 바와 같은 학습 절차는 순차적으로 실행되는 것으로 서술하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 전술된 과정을 변경하여 실행하거나 하나 이상의 과정을 병렬적으로 실행하는 것이 적용 가능할 것이므로, 전술한 바와 같은 시계열적인 순서로 한정되는 것은 아니다.The learning procedure as described above is described as being sequentially executed, but is not necessarily limited thereto. In other words, since it may be applicable to change and execute the above-described process or to execute one or more processes in parallel, it is not limited to the time-series order as described above.

다음, 도 4를 이용하여, 본 실시예에 따른 가창음성 변형 과정을 설명한다. 가창음성 변형시에는, 앞에서 설명한 바와 같이, 트레이닝된 제1 및 제2 생성기를 이용하되, 둘 모두를 이용하거나, 한 방향으로의 변형만을 원할 시에는, 하나만을 이용할 수 있다. 제1 생성기는 도메인 A에 속하는 제1 가창음성(x_A)을 변형하여 타겟을 모사한 가창음성(G _AB(x_A))을 생성한다. 또한, 제2 생성기는 도메인 B에 속하는 제2 가창음성(x_B)을 변형하여 타겟을 모사한 가창음성(G _BA(x_B))을 생성한다.Next, a process of modifying a singing voice according to the present embodiment will be described with reference to FIG. 4 . When modifying the singing voice, as described above, the trained first and second generators are used, but both or only one may be used when only deformation in one direction is desired. The first generator transforms the first singing voice (x _A ) belonging to the domain A to generate a singing voice ( G _AB (x _A )) that mimics the target. In addition, the second generator generates a song voice G _BA (x _B ) that mimics the target by modifying _{the second singing voice (x B ) belonging to the domain B.}

이하 본 실시예에 따른 가창음성 변형기의 성능 평가 결과를 설명한다. 성능 평가에 사용한 데이터베이스(database)는 유명 남녀 가수의 가창음성이다. 편의상 도메인 A에는 여자 가수, 도메인 B에는 남자 가수의 가창음성을 할당하였다. 사용된 가창음성은 대략 150분 분량이고, 학습 및 평가 시에 이 분량을 나누어 사용하였다.Hereinafter, performance evaluation results of the singing voice transducer according to the present embodiment will be described. The database used for performance evaluation is the singing voices of famous male and female singers. For convenience, a female singer is assigned to domain A and a male singer's voice is assigned to domain B. The singing voice used was approximately 150 minutes long, and this amount was divided and used during learning and evaluation.

평가 환경은, 먼저 생성기로는 CNN을 이용하고, 분류기로는 CNN의 디코더 부분만을 이용한다. 학습 모델의 트레이닝에는 아담 최적화기(Adam Optimizer)를 이용한다.The evaluation environment first uses CNN as a generator and only the decoder part of CNN as a classifier. The Adam Optimizer is used to train the learning model.

본 실시예에서는, MOS(Mean Opinion Score) 테스트와 같은 주관적인 방법 대신, 변형 가창음성의 음색 변형을 효과적으로 측정하기 위해, 비교 대상 간에 같은 음(예컨대, F3, B4)의 고조파 성분에 대한 유사도를 고려한다. 메트릭으로는 주파수 영역 상에서 산정한 코사인 유사도(cosine similarity)를 이용하되, 소스 가창음성, 타겟 가창음성 및 변형 가창음성 간의 각 음(note)별 코사인 유사도를 측정한다.In this embodiment, instead of a subjective method such as the MOS (Mean Opinion Score) test, in order to effectively measure the tonal variation of the deformed singing voice, the similarity of the harmonic components of the same sound (eg, F3, B4) between comparison objects is considered. do. As a metric, cosine similarity calculated in the frequency domain is used, and the cosine similarity for each note between the source sung voice, the target sung voice, and the modified sung voice is measured.

본 실시예에 따른 평가 결과는 도 5에 나타나 있다. 도 5에서 세로축은 주파수 영역 상에서 산정한 코사인 유사도이다. 도 5에서 가로축의 숫자는 피아노 건반 번호로서, F3 내지 B4 음(175 내지 494 Hz에 해당)에 해당한다. 다른 모든 경우보다 제시된 모델의 제1 변형 가창음성과 제2 가창음성 간의 유사도(도 5에서 [+SA+FEAT]A2B-B)가 큼을 알 수 있다. 비교된 다른 경우는, 제1 가창음성과 제2 가창음성 간의 유사도(A-B)와, 셀프 어텐션 모듈 및 특성 손실함수 L_feat이 모두 배제된 모델, 셀프 어텐션 모듈만 배제된 모델 및 L_feat만 배제된 모델 각각에 의한 제1 변형 가창음성과 제2 가창음성 간의 유사도(순서대로 [-SA-FEAT]A2B-B, [+SA-FEAT]A2B-B, [-SA+FEAT]A2B-B) 등이다. The evaluation result according to this embodiment is shown in FIG. 5 . In FIG. 5 , the vertical axis represents the cosine similarity calculated in the frequency domain. Numbers on the horizontal axis in FIG. 5 are piano key numbers, and correspond to notes F3 to B4 (corresponding to 175 to 494 Hz). It can be seen that the similarity ([+SA+FEAT]A2B-B in FIG. 5) between the first modified singing voice and the second singing voice of the presented model is greater than in all other cases. In other cases compared, the similarity (AB) between the first and second singing voices, _{the model in which both the self-attention module and the characteristic loss function L feat} are excluded, the model in which only the self-attention module is excluded, and L _feat are excluded. The degree of similarity between the first modified singing voice and the second singing voice by each model ([-SA-FEAT]A2B-B, [+SA-FEAT]A2B-B, [-SA+FEAT]A2B-B), etc. am.

본 실시예에 따른 가창음성 변형의 효과를 설명하는 결과는 도 6에 나타나 있다. 제1 변형 가창음성과 제2 가창음성 각각의 스펙트로그램을 상하단에 표시한 것으로서, 세로축은 주파수이고, 가로축은 시간을 의미한다. 본 실시예에서 의도한 대로, 고조파 구조로 대표되는 음색 변형(실선 박스)을 관찰할 수 있고, 진동(가는 실선 박스) 및 호흡(점선 박스)과 같은 요소는 유지됨을 관찰할 수 있다.The results for explaining the effect of the singing voice modification according to the present embodiment are shown in FIG. 6 . Spectrograms of each of the first modified singing voice and the second singing voice are displayed at the upper and lower ends, wherein the vertical axis indicates frequency and the horizontal axis indicates time. As intended in this embodiment, it can be observed that the tone deformation (solid line box) represented by the harmonic structure can be observed, and it can be observed that elements such as vibration (thin solid line box) and breathing (dashed line box) are maintained.

이상에서 설명한 바와 같이, 음향 변형의 한 예로서, 본 실시예에 따른 가창음성 변형기는 가창음성 변형 시 탁월한 성능을 보였다.As described above, as an example of acoustic modification, the singing voice modifier according to the present embodiment showed excellent performance in the singing voice modification.

따라서, 본 실시예에 따른 음향 변형기는, 서로 짝이 맞지 않는(mutually unpaired) 음향 간 변형 시, 셀프 어텐션 모듈 및 특성 손실함수에 기반하는 트레이닝을 적용하여, 타겟 음색을 모사하는 변형은 용이하게 하면서도 음정, 장단, 묵음 및 진동과 같은 요소는 유지 가능한 음향 변형 시스템 및 방법을 제공하는 효과가 있다. 이에 따라 음향 변형 시 변형된 음성의 현실성을 증대시키는 것이 가능하다. Therefore, the sound modifier according to the present embodiment applies training based on the self-attention module and the characteristic loss function when transforming between mutually unpaired sounds, while facilitating the transformation to simulate the target tone. Elements such as pitch, length, silence, and vibration have the effect of providing a sustainable acoustic modification system and method. Accordingly, it is possible to increase the reality of the deformed voice when the sound is deformed.

본 실시예는 두 음향 간의 변형 위주로 설명되었으나, 본 실시예의 기술적 장치 및 방법을 적절히 변형 및 이용한다면, 일대다(one-to-many) 또는 다대다(many-to-many) 형태의 음향 간 변형까지 적용 분야를 확대하는 것이 가능하다. Although the present embodiment has been mainly described for deformation between two sounds, if the technical apparatus and method of the present embodiment are appropriately modified and used, one-to-many or many-to-many form of deformation between sounds It is possible to expand the field of application to

또한 본 실시예에서는 서로 짝이 맞지 않은 데이터로 음향을 다루었으나, 본 실시예의 기술적 장치 및 방법을 적절히 변형 및 이용한다면, 서로 짝이 맞지 않은 데이터의 범주가 이미지 또는 영상을 포함하도록 확대하는 것이 가능하다. 바람직하게는, 음향 데이터에 적용되었 때, 본 실시예에서와 같이 탁월한 성능을 보인다. In addition, although sound is handled as data that does not match each other in this embodiment, if the technical apparatus and method of this embodiment are appropriately modified and used, it is possible to expand the category of unmatched data to include images or images do. Preferably, when applied to acoustic data, it exhibits excellent performance as in this embodiment.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍가능 시스템 상에서 실행가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령들을 수신하고 이들에게 데이터 및 명령들을 전송하도록 결합되는 적어도 하나의 프로그래밍가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터-판독가능 매체"에 저장된다. Various implementations of the systems and techniques described herein may include digital electronic circuitry, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combination can be realized. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a “computer-readable medium”.

컴퓨터-판독가능 매체는, 명령어들 및/또는 데이터를 프로그래밍가능 프로세서에게 제공하기 위해 사용되는, 임의의 컴퓨터 프로그램 제품, 장치, 및/또는 디바이스(예를 들어, CD-ROM, ROM, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성 또는 비일시적인 기록매체)를 나타낸다. A computer-readable medium includes any computer program product, apparatus, and/or device (eg, a CD-ROM, ROM, memory card, a non-volatile or non-transitory recording medium such as a hard disk, a magneto-optical disk, and a storage device).

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋탑 박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩탑, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof), and at least one communication interface. For example, the programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and various modifications and variations will be possible by those skilled in the art to which this embodiment belongs without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the equivalent range should be interpreted as being included in the scope of the present embodiment.

100: 가창음성 변형 시스템 110: 입력부
120: 변형부 130: 출력부
100: singing voice transformation system 110: input unit
120: deformation unit 130: output unit

Claims

A learning apparatus for a sound transformation system for translation between sound data, the learning apparatus comprising:
a first generator for transforming a first sound of a first domain into a first modified sound;
a second generator for transforming a second sound of the second domain into a second modified sound;
a first discriminator for discriminating the first sound of the first domain and the second deformed sound deformed by the second generator;
a second discriminator for discriminating the second sound of the second domain and the first deformed sound deformed by the first generator; and
A training unit configured to train the first generator and the second discriminator to be opposed to each other, and to train the second generator and the first discriminator to be opposed to each other.
includes,
The training unit,
transformation from the second sound to a first identity-mapping sound of the second domain executed by the first generator; and
Transformation from the first sound to a second identity mapping sound of the first domain executed by the second generator
A learning apparatus of an acoustic deformation system, characterized in that performing at least one of the deformations.

According to claim 1,
The training unit,
cause the first generator to transform the second modified sound into a first reconstructed sound;
The learning apparatus of the acoustic modification system, characterized in that the second generator transforms the first modified sound into a second reconstructed sound.

3. The method of claim 2,
The training unit,
and training the first and second generators based on a distance metric between the first reconstructed sound and the second sound and a distance metric between the second reconstructed sound and the first sound. learning device.

3. The method of claim 2,
The training unit,
Training at least one of the first generator and the second generator based on at least one of a distance metric between the first identity mapping sound and the second sound and a distance metric between the second identity mapping sound and the first sound Learning apparatus of the acoustic deformation system, characterized in that.

3. The method of claim 2,
The training unit,
a function, executed by the first generator, comprising extraction from the second sound to first acoustic features in the second domain and from the second modified sound to a first reconstructed acoustic feature; and
a function executed by the second generator, comprising extraction from the first sound to a second acoustic characteristic of the first domain and from the first modified sound to a second reconstructed acoustic characteristic
A learning apparatus for an acoustic deformation system, characterized in that it performs at least one function of

6. The method of claim 5,
The training unit,
at least one of the first and second generators based on at least one of a distance metric between the first acoustic characteristic and the first reconstructed acoustic characteristic and a distance metric between the second acoustic characteristic and the second reconstructed acoustic characteristic Learning apparatus of the acoustic deformation system, characterized in that training.

A sound transformation system for translation between sound data, the system comprising:
a first generator that transforms the first sound of the first domain into a first translated sound by simulating the target sound of the second domain; and
A second generator for transforming the second sound of the second domain into a second modified sound by simulating the target sound of the first domain;
The first generator transforms the second sound of the second domain into a first identity-mapping sound, and uses the second generator to transform the first sound of the first domain into a second identity After transforming into a mapping sound,
The first generator and the second generator,
a distance metric between the first sound in the first domain and a second deformed sound deformed by the second generator;
a distance metric between the second sound in the second domain and the first deformed sound deformed by the first generator; and
at least one of a distance metric between the first identity mapping sound and the second sound and a distance metric between the second identity mapping sound and the first sound
Sound characterized in that it is trained in advance based on transformation system.

8. The method of claim 7,
Acoustic deformation, characterized in that by using a metric calculated in the frequency domain, a degree of similarity between the first modified sound and the second sound and a degree of similarity between the second modified sound and the first sound are confirmed. system.

To perform the audio (sound) transformation (translation) between the data, according to which the computer devices to perform learning method of the acoustic variant system,
A first generator is used to transform a first sound of a first domain into a first translated sound, and a second generator is used to transform the first modified sound into a second reconstruction sound. process;
transforming a second sound of a second domain into a second deformed sound using the second generator, and transforming the second deformed sound into a first reconstructed sound using the first generator;
distinguishing the first sound of the first domain from the second deformed sound deformed by the second generator using a first discriminator;
Using a second discriminator, the second sound in the second domain and the second sound modified by the first generator distinguishing the first modified sound; and
a distance metric between the first sound and the second modified sound, a distance metric between the second sound and the first deformed sound, a distance metric between the first reconstructed sound and the second sound, and the second reconstructed sound and the second sound 1 The process of training the first generator and the second generator based on some or all of the distance metric between sounds
including,
The training process is
transforming the second sound of the second domain into a first identity-mapping sound using the first generator; and
A process of transforming the first sound of the first domain into a second identity mapping sound using the second generator
A learning method of the acoustic deformation system, characterized in that it further comprises at least one of the processes.

10. The method of claim 9,
The training process is
At least one parameter of the first generator and the second generator is updated based on at least one of a distance metric between the first identity mapping sound and the second sound and a distance metric between the second identity mapping sound and the first sound A learning method of the acoustic modification system, characterized in that.

11. The method of claim 10,
The training process is
extracting first acoustic features from the second sound of the second domain using the first generator, and extracting first reconstructed acoustic features from the second modified sound; and
A process of extracting a second acoustic characteristic from the first sound of the first domain using the second generator and extracting a second reconstructed acoustic characteristic from the first modified sound
comprising at least one process of
a parameter of at least one of the first and second generators based on at least one of a distance metric between the first acoustic characteristic and the first reconstructed acoustic characteristic and a distance metric between the second acoustic characteristic and the second reconstructed acoustic characteristic A learning method of the acoustic deformation system, characterized in that for updating the.

A sound transformation method implemented by a computer device to perform translation between sound data , the method comprising:
transforming a first sound of a first domain into a first translated sound by using a first generator; and
using a second generator to transform a second sound of a second domain into a second modified sound,
The first generator transforms the second sound of the second domain into a first identity-mapping sound, and uses the second generator to transform the first sound of the first domain into a second identity After transforming into a mapping sound,
The first generator and the second generator,
a distance metric between the first sound in the first domain and a second deformed sound deformed by the second generator;
a distance metric between the second sound in the second domain and the first deformed sound deformed by the first generator; and
at least one of a distance metric between the first identity mapping sound and the second sound and a distance metric between the second identity mapping sound and the first sound
Sound deformation method, characterized in that pre-trained based on.

A computer program stored in a computer-readable recording medium to execute each step of the learning method of the acoustic modification system according to any one of claims 9 to 11.

A computer program stored in a computer-readable recording medium to execute each step of the sound modification method according to claim 12.