KR20210155520A

KR20210155520A - Method and Apparatus for Synthesizing/Modulating Singing Voice of Multiple Singers

Info

Publication number: KR20210155520A
Application number: KR1020200072840A
Authority: KR
Inventors: 김창현; 이교구; 이주헌; 최형석; 구정현; 김지원; 조한수
Original assignee: 에스케이텔레콤 주식회사; 서울대학교산학협력단
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2021-12-23

Abstract

Disclosed are a device and method for synthesizing/modulating a singing voice for a plurality of singers. The present embodiment obtains a user request to specify a singer and song, generates tones and singing styles for the singer using a deep neural network-based inference model, and generates a formant for which the tone is adjusted and a pitch skeleton for which the singing style is adjusted. In addition, provided are the device and method for synthesizing/modulating the singing voice that generates the singing voice for the song from the pitch skeleton masked with a formant using a deep neural network-based SR transformation model. Therefore, the present invention is capable of having an effect for which a natural singing voice synthesis/modulation for the plurality of singers is enabled.

Description

Method and Apparatus for Synthesizing/Modulating Singing Voice of Multiple Singers

본 발명은 복수의 가수의 가창음성 합성/변조 장치 및 방법에 관한 것이다. 더욱 상세하게는, 심층신경망을 이용하여 사용자 요청에 기초하는 가수 특성, 가사 정보 및 음고 정보로부터 복수의 가수의 가창음성, 또는 두 가수 각각의 특성이 조합된 가창음성을 자동으로 생성하는 가창음성 합성/변조 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for synthesizing/modulating the voice of a plurality of singers. More specifically, singing voice synthesis that automatically generates the singing voices of a plurality of singers or a combination of the characteristics of each of the two singers from the singer characteristics, lyric information, and pitch information based on a user request using a deep neural network / relates to a modulation device and method.

이하에 기술되는 내용은 단순히 본 발명과 관련되는 배경 정보만을 제공할 뿐 종래기술을 구성하는 것이 아니다. The content described below merely provides background information related to the present invention and does not constitute the prior art.

가창음성 합성(Singing Voice Synthesis: SVS)은 악보(sheet music) 및 가사(lyrics) 정보를 이용하여 자연스러운 가창음성을 생성하는 방법이다. 텍스트를 음성으로 변환하는 TTS(Text-to-Speech)와 비교하여, SVS는 각 음절(syllable)의 음고(pitch) 정보를 조절하는 기능을 필요로 한다. 최근 심층신경망(deep neural network)의 적용 분야 확장 및 주목할 만한 성과 달성에 따라 SVS 방법에도 심층신경망이 적용되고 있다. Singing Voice Synthesis (SVS) is a method of generating a natural singing voice using sheet music and lyrics information. Compared to text-to-speech (TTS) that converts text into speech, SVS requires a function to adjust pitch information of each syllable. Recently, as the field of application of deep neural networks has been expanded and remarkable achievements have been achieved, deep neural networks are also being applied to the SVS method.

심층신경망 기반의 SVS 방법으로, 심층신경망을 이용하여 가창음성을 합성할 수 있는 파라미터를 예측하는 매개변수 방법(parametric method)이 존재한다(비특허문헌 1 참조). 매개변수 방법은 심층신경망을 이용하여 SVS를 구현할 수 있다는 가능성을 제시했으나, 파라미터를 이용하는 보코더(vocoder) 성능에 따라 전체 SVS의 성능이 결정된다는 단점이 있다. 따라서 대안으로서, 단대단(end-to-end) 심층신경망을 기반으로 선형 스펙트로그램(linear spectrogram)을 생성하거나, 심층신경망을 이용하여 보코더를 구현하는 SVS 방법 등이 시도되고 있다.As a deep neural network-based SVS method, a parametric method for predicting a parameter capable of synthesizing a singing voice using a deep neural network exists (see Non-Patent Document 1). Although the parametric method suggested the possibility of implementing SVS using a deep neural network, it has a disadvantage that the performance of the entire SVS is determined according to the performance of a vocoder using parameters. Therefore, as an alternative, an SVS method for generating a linear spectrogram based on an end-to-end deep neural network or implementing a vocoder using a deep neural network has been tried.

SVS 구현에 있어서 요구되는 다른 사항은 복수의 가수(multiple singers)에 대한 가창음성 합성이다. 도 8의 (a)에 나타낸 바와 같이, 단수의 가수(single singer)에 대한 SVS는 가사 정보, 즉 텍스트로부터 포먼트를 생성하고, 음고(pitch)로부터 음고골격(pitch contour 또는 pitch skeleton)를 생성한 후, 이들을 심층신경망에서 결합하여 가창음성을 합성한다. 한편, 복수의 가수에 대한 SVS 방법으로, 도 8의 (b)에 나타낸 바와 같이, 가수의 독자성(singer identity, 이하 ‘가수 ID’)을 원핫임베딩(one-hot embedding) 형태로 가창음성에 반영하는 방법이 존재한다(비특허문헌 2 참조). 이러한 직접적인 SVS 방법은 간단하게 구현 가능하나, 신규 가수를 추가할 때마다, 심층신경망을 재학습시켜야 한다는 단점이 존재한다. Another requirement in the SVS implementation is the synthesis of vocal voices for multiple singers. As shown in Fig. 8(a), the SVS for a single singer generates a formant from lyric information, that is, a text, and a pitch contour or pitch skeleton from the pitch. After that, they are combined in a deep neural network to synthesize singing voices. On the other hand, as an SVS method for a plurality of singers, as shown in FIG. There is a method to do it (refer to Non-Patent Document 2). Although this direct SVS method can be implemented simply, there is a disadvantage that the deep neural network needs to be retrained whenever a new singer is added.

따라서, 도 8의 (c)에 나타낸 바와 같이, 가창 질의(singing query)에 기초하여 가수 ID 특성(예컨대, 음색 또는 가창 스타일)을 생성한 후, 단대단 심층신경망을 이용하여 가수 ID 특성, 가사 정보 및 음고 정보로부터 복수의 가수에 대한 자연스러운 가창음성, 또는 두 가수 각각의 특성이 조합된 가창음성을 자동으로 생성하는 것이 가능한 가창음성 합성 및 변조방법을 필요로 한다.Therefore, as shown in FIG. 8(c), after generating a singer ID characteristic (eg, tone or singing style) based on a singing query, an end-to-end deep neural network is used to identify the singer ID characteristic, lyrics There is a need for a singing voice synthesis and modulation method capable of automatically generating natural singing voices for a plurality of singers from information and pitch information, or a singing voice in which the characteristics of each of two singers are combined.

비특허문헌 1: Merlijn Blaauw and Jordi Bonada, “A neural parametric singing synthesizer modeling timbre and expression from natural songs,” Applied Sciences, vol. 7, no. 12, pp. 1313, 2017.Non-Patent Document 1: Merlijn Blaauw and Jordi Bonada, “A neural parametric singing synthesizer modeling timbre and expression from natural songs,” Applied Sciences, vol. 7, no. 12, pp. 1313, 2017. 비특허문헌 2: Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez, “Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan,” arXiv preprint arXiv:1903.10729, 2019. Non-Patent Document 2: Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez, “Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan,” arXiv preprint arXiv:1903.10729, 2019. 비특허문헌 3: Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4784-4788. Non-Patent Document 3: Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4784-4788. 비특허문헌 4: Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee, “Adversarially trained end-to-end korean singing voice synthesis system,” Proc. Interspeech 2019, pp. 2588-2592, 2019.Non-Patent Document 4: Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee, “Adversarially trained end-to-end korean singing voice synthesis system,” Proc. Interspeech 2019, pp. 2588-2592, 2019.

본 개시는, 가수 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 생성한다. 또한, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제공하는 데 주된 목적이 있다.The present disclosure obtains a user request for designating a singer and a song, generates a tone and a singing style for the singer using an inference model based on a deep neural network, a tone-controlled formant, and a pitch with an adjusted singing style create a skeleton Another object of the present invention is to provide an apparatus and method for synthesizing/modulating a singing voice for generating a singing voice for a song from a pitch skeleton masked with a formant using a deep neural network-based SR transformation model.

본 발명의 실시예에 따르면, 컴퓨팅 장치가 실행하는 가창음성 합성(singing voice synthesis) 및 변조방법에 있어서, 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청(user request)을 획득하여, 제1 가수에 대한 제1 가수 스펙트로그램, 제2 가수에 대한 제2 가수 스펙트로그램, 및 상기 노래의 가사에 대한 텍스트를 획득하고, 상기 노래에 대한 MIDI 데이터로부터 상기 노래의 음고(pitch)를 획득하는 과정; 사전에 트레이닝된 심층신경망(deep neural network) 기반의 추론 모델(inference model)을 이용하여, 상기 제1 가수 스펙트로그램으로부터 상기 제1 가수에 대한 제1 음색(timbre)을 생성하고, 상기 텍스트와 상기 제1 음색으로부터 제1 포먼트마스크(formant mask)를 생성하는 제1 과정; 상기 추론 모델을 이용하여, 상기 제2 가수 스펙트로그램으로부터 상기 제2 가수에 대한 제2 가창 스타일을 생성하고, 상기 음고와 상기 제2 가창 스타일로부터 제2 음고골격을 생성하는 제2 과정; 및 상기 제1 포먼트마스크를 이용하여 상기 제2 음고골격을 마스킹함으로써 저해상도의 제3 추론 스펙트로그램을 생성하는 과정을 포함하는 것을 특징으로 하는 가창음성 합성 및 변조방법을 제공한다. According to an embodiment of the present invention, in a singing voice synthesis and modulation method executed by a computing device, a user request for specifying a tone of a first singer, a singing style of a second singer, and a song ) to obtain a first singer spectrogram for a first singer, a second singer spectrogram for a second singer, and text for lyrics of the song, and from MIDI data for the song, the pitch of the song the process of obtaining (pitch); Using an inference model based on a pre-trained deep neural network, a first timbre for the first singer is generated from the first singer spectrogram, and the text and the a first process of generating a first formant mask from the first tone; a second process of generating a second singing style for the second singer from the second singer spectrogram using the inference model, and generating a second pitch skeleton from the pitch and the second singing style; and generating a third inferred spectrogram of low resolution by masking the second pitch skeleton using the first formant mask.

본 발명의 다른 실시예에 따르면, 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청(user request)을 획득하여, 제1 가수에 대한 제1 가수 스펙트로그램, 제2 가수에 대한 제2 가수 스펙트로그램, 및 상기 노래의 가사에 대한 텍스트를 획득하고, 상기 노래에 대한 MIDI 데이터로부터 상기 노래의 음고(pitch)를 획득하는 입력부; 사전에 트레이닝된 심층신경망(deep neural network) 기반의 추론 모델(inference model)을 이용하여, 상기 제1 가수 스펙트로그램으로부터 상기 제1 가수에 대한 제1 음색(timbre)을 생성하고, 상기 텍스트와 상기 제1 음색으로부터 제1 포먼트마스크(formant mask)를 생성하며; 상기 추론 모델을 이용하여, 상기 제2 가수 스펙트로그램으로부터 상기 제2 가수에 대한 제2 가창 스타일을 생성하고, 상기 음고와 상기 제2 가창 스타일로부터 제2 음고골격을 생성하는 변조부; 및 상기 제1 포먼트마스크를 이용하여 상기 제2 음고골격을 마스킹함으로써 저해상도의 제3 추론 스펙트로그램을 생성하는 제3 마스킹부를 포함하는 것을 특징으로 하는 가창음성 합성 및 변조장치를 제공한다. According to another embodiment of the present invention, by obtaining a user request specifying the tone of the first singer, the singing style of the second singer, and the song, the first singer spectrogram for the first singer, the second an input unit for obtaining a second singer spectrogram for a singer and text for lyrics of the song, and obtaining a pitch of the song from MIDI data for the song; Using an inference model based on a pre-trained deep neural network, a first timbre for the first singer is generated from the first singer spectrogram, and the text and the create a first formant mask from the first tone; a modulator for generating a second singing style for the second singer from the second singer spectrogram by using the inference model, and generating a second pitch skeleton from the pitch and the second singing style; and a third masking unit generating a third inferred spectrogram of low resolution by masking the second pitch skeleton using the first formant mask.

본 발명의 다른 실시예에 따르면, 가창음성 합성 및 변조방법이 포함하는 각 단계를 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다. According to another embodiment of the present invention, there is provided a computer program stored in a computer-readable recording medium in order to execute each step included in the method for synthesizing and modulating a singing voice.

이상에서 설명한 바와 같이 본 실시예에 따르면, 가수 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 생성하며, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 복수의 가수에 대한 자연스러운 가창음성 합성/변조가 가능해지는 효과가 있다.As described above, according to this embodiment, a user request for designating a singer and a song is obtained, a tone and a singing style for the singer are generated using an inference model based on a deep neural network, and the tone is adjusted formant; And by providing a singing voice synthesis/modulation apparatus and method for generating a pitch skeleton with a controlled singing style, and generating a singing voice for a song from a pitch skeleton masked with a formant using an SR transformation model based on a deep neural network, There is an effect that natural singing voice synthesis/modulation for a plurality of singers is possible.

또한 본 실시예에 따르면, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 생성하며, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 단대단 망(end-to-end network)을 기반으로 가창음성 합성/변조가 가능해지는 효과가 있다.Also, according to this embodiment, a tone and a singing style for a singer are generated using an inference model based on a deep neural network, a tone-controlled formant, and a tone high skeleton with an adjusted singing style are generated, and a deep neural network-based inference model is generated. By providing a singing voice synthesis/modulation device and method for generating a singing voice for a song from a pitch skeleton masked with a formant using an SR transformation model, a singable voice based on an end-to-end network is provided. There is an effect that synthesis/modulation becomes possible.

또한 본 실시예에 따르면, 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 제1 가수의 음색과 제2 가수의 가창 스타일을 생성하고, 제1 가수의 음색이 조절된 포먼트, 및 제2 가수의 가창 스타일이 조절된 음고골격을 독립적으로 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 음색과 가창 스타일을 독립적으로 교차 조절하는 가창음성 합성/변조가 가능해지는 효과가 있다. In addition, according to this embodiment, by obtaining a user request to specify the tone of the first singer, the singing style of the second singer, and the song, the tone of the first singer and the tone of the second singer are obtained using an inference model based on a deep neural network. By providing a singing voice synthesis/modulation apparatus and method for generating a singing style, and independently generating a formant in which the tone of a first singer is adjusted, and a pitch skeleton in which the singing style of a second singer is adjusted, the tone and singing style It has the effect of enabling the synthesis/modulation of singing voices that independently cross-regulate.

도 1은 본 발명의 일 실시예에 따른 가창음성 합성 및 변조장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 가수 ID 인코더, 포먼트마스크 디코더 및 음고골격 디코더의 블록도이다.
도 3은 본 발명의 일 실시예에 따른 가창음성 합성 및 변조방법의 순서도이다.
도 4는 본 발명의 다른 실시예에 따른 가창음성 합성 및 변조장치의 블록도이다.
도 5는 본 발명의 다른 실시예에 따른 가창음성 합성 및 변조방법의 순서도이다.
도 6은 본 발명의 일 실시예에 따른 학습 모델의 블록도이다.
도 7은 본 발명의 일 실시예에 따른 학습 모델에 대한 학습방법의 순서도이다.
도 8은 종래의 가창음성 합성 및 변조방법에 대한 개념도이다. 1 is a block diagram of an apparatus for synthesizing and modulating a singing voice according to an embodiment of the present invention.
2 is a block diagram of a singer ID encoder, a formant mask decoder, and a pitch skeleton decoder according to an embodiment of the present invention.
3 is a flowchart of a method for synthesizing and modulating a singing voice according to an embodiment of the present invention.
4 is a block diagram of an apparatus for synthesizing and modulating a singing voice according to another embodiment of the present invention.
5 is a flowchart of a method for synthesizing and modulating a singing voice according to another embodiment of the present invention.
6 is a block diagram of a learning model according to an embodiment of the present invention.
7 is a flowchart of a learning method for a learning model according to an embodiment of the present invention.
8 is a conceptual diagram of a conventional method for synthesizing and modulating a singing voice.

이하, 본 발명의 실시예들을 예시적인 도면을 참조하여 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 실시예들을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 실시예들의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to exemplary drawings. In adding reference numerals to the components of each drawing, it should be noted that the same components are given the same reference numerals as much as possible even though they are indicated on different drawings. In addition, in the description of the present embodiments, if it is determined that a detailed description of a related well-known configuration or function may obscure the gist of the present embodiments, the detailed description thereof will be omitted.

또한, 본 실시예들의 구성요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성요소를 다른 구성요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 '포함', '구비'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 '…부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Also, in describing the components of the present embodiments, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. Throughout the specification, when a part 'includes' or 'includes' a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. . In addition, the '... Terms such as 'unit' and 'module' mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 발명의 예시적인 실시형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시형태를 나타내고자 하는 것이 아니다.DETAILED DESCRIPTION The detailed description set forth below in conjunction with the appended drawings is intended to describe exemplary embodiments of the present invention and is not intended to represent the only embodiments in which the present invention may be practiced.

본 실시예는 복수의 가수의 가창음성 합성/변조 장치 및 방법에 대한 내용을 개시한다. 보다 자세하게는, 가수 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 독립적으로 생성한다. 또한, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제안한다.This embodiment discloses a device and method for synthesizing/modulating a plurality of singers' singing voices. In more detail, by obtaining a user request to designate a singer and a song, and using an inference model based on a deep neural network to generate a tone and a singing style for the singer, the tone-adjusted formant, and the singing style-adjusted pitch Creates a skeleton independently. In addition, we propose an apparatus and method for synthesizing/modulating a singing voice to generate a singing voice for a song from a pitch skeleton masked with a formant using a deep neural network-based SR transformation model.

가창음성(singing voice)은 사람의 음성으로 표현된 노래를 의미한다. A singing voice refers to a song expressed by a human voice.

멜 스펙트로그램(Mel spectrogram)은, 멜 필터 뱅크(Mel filter bank)를 이용하여 주파수 영역 상에서 오디오 신호를 필터링하여, 오디오 신호의 특성을 추출한 계수이다. 본 실시예에서는, 가수의 가창음성 및 추론된 가창음성을 나타내기 위하여 멜 스펙트로그램이 사용된다. 고해상도의 선형(linear) 스펙트로그램은 수백 내지 수천 정도의 샘플을 포함하나, 멜 스펙트로그램을 표현하는 필터 뱅크를 구성하는 필터의 개수는 수십 개 정도이다. A Mel spectrogram is a coefficient obtained by extracting characteristics of an audio signal by filtering an audio signal in a frequency domain using a Mel filter bank. In this embodiment, a Mel spectrogram is used to indicate the singer's singing voice and the inferred singing voice. A high-resolution linear spectrogram includes hundreds to thousands of samples, but the number of filters constituting a filter bank expressing a Mel spectrogram is about tens.

MIDI(Musical Instrument Digital Interface)는 악기에 의해 연주되는 음악을 표현하기 위한 일반적인 방법이다. MIDI 포맷은 신디사이저(synthesizer) 또는 시퀀서(sequencer) 등의 하드웨어 혹은 소프트웨어가 음악을 재생하기 위해 사용하는 명령 순서에 대한 규칙이다. MIDI 데이터는 음고, 음의 길이 및 음의 세기(velocity, MIDI에서는 음의 세기를 velocity로 표현)를 표현한다. MIDI (Musical Instrument Digital Interface) is a common method for expressing music played by an instrument. The MIDI format is a rule for the command sequence used by hardware or software such as a synthesizer or sequencer to reproduce music. MIDI data expresses the pitch, the length of the note, and the velocity (velocity, in MIDI, the velocity of the note is expressed as velocity).

도 1은 본 발명의 일 실시예에 따른 가창음성 합성 및 변조장치의 블록도이다.1 is a block diagram of an apparatus for synthesizing and modulating a singing voice according to an embodiment of the present invention.

본 발명에 따른 실시예에 있어서, 가창음성 합성 및 변조장치(100)는 대상 가수 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델(inference model)을 이용하여 가수에 대한 음색(timbre)과 가창 스타일을 생성하고, 음색이 조절된 포먼트(formant), 및 가창 스타일이 조절된 음고골격(pitch skeleton)을 독립적으로 생성하며, 심층신경망 기반의 SR(Super-resolution) 변환 모델(transform model)을 이용하여 포먼트로 마스킹(masking)된 음고골격으로부터 노래에 대한 가창음성을 생성한다. 가창음성 합성 및 변조장치(100)는 입력부(102), 멜 변조부(104), SR 추론부(106) 및 출력부(108)의 전부 또는 일부를 포함한다. 여기서, 본 실시예에 따른 가창음성 합성 및 변조장치(100)에 포함되는 구성요소가 반드시 이에 한정되는 것은 아니다. 예컨대, 가창음성 합성 및 변조장치(100)는 추론 모델 및 SR 변환 모델의 트레이닝을 위한 트레이닝부(미도시)를 추가로 구비하거나, 외부의 트레이닝부와 연동되는 형태로 구현될 수 있다.In an embodiment according to the present invention, the singing voice synthesis and modulation device 100 obtains a user request to designate a target singer and song, and uses an inference model based on a deep neural network to provide a tone for the singer ( timbre) and a singing style, and independently generate a tone-adjusted formant and a pitch skeleton with an adjusted singing style, and a deep neural network-based SR (Super-resolution) transformation model ( transform model) to generate a singing voice for a song from a pitch skeleton masked with a formant. The singing voice synthesis and modulation apparatus 100 includes all or a part of an input unit 102 , a Mel modulation unit 104 , an SR inference unit 106 , and an output unit 108 . Here, components included in the apparatus 100 for synthesizing and modulating a voice according to the present embodiment are not necessarily limited thereto. For example, the apparatus 100 for synthesizing and modulating a singing voice may additionally include a training unit (not shown) for training the inference model and the SR transformation model, or may be implemented in a form that interworks with an external training unit.

도 1의 도시는 본 실시예에 따른 예시적인 구성이며, 입력의 형태, 추론 모델과 SR 변환 모델의 구조와 동작, 및 출력의 형태에 따라 다른 구성요소 또는 구성요소 간의 다른 연결을 포함하는 다양한 구현이 가능하다. 1 is an exemplary configuration according to the present embodiment, and various implementations including different components or different connections between components depending on the type of input, the structure and operation of the inference model and the SR transformation model, and the type of output This is possible.

본 실시예에 따른 입력부(102)는 대상 가수 및 노래를 지정하는 사용자 요청(user request)를 획득하여, 가수에 대한 가수 멜 스펙트로그램, 및 노래의 가사에 대한 텍스트를 획득하고, 노래에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다. 여기서, 사용자 요청은 복수의 가수 중에서 대상 가수, 및 복수의 노래 중에서 대상 가수가 부를 노래를 지정한다. The input unit 102 according to the present embodiment obtains a user request for designating a target singer and a song, obtains a singer Mel spectrogram for the singer, and text for the lyrics of the song, and obtains MIDI for the song Acquire the pitch from the data. Here, the user request designates a target singer from among a plurality of singers and a song to be sung by the target singer from among the plurality of songs.

또한, MIDI 데이터는, 지정된 노래를 표현하는, 기 존재하는 가창음성에 대한 MIDI 데이터이다. 따라서, 노래를 구성하는 음고를 표현할 수 있는 어느 MIDI 데이터든 이용될 수 있다. In addition, MIDI data is MIDI data about the existing singing voice which expresses a designated song. Accordingly, any MIDI data capable of expressing the pitch constituting a song may be used.

텍스트는 벡터로 표현되는데, 예컨대 한글인 경우, 한 음절을 포함하는 초성, 중성 및 종성(onset, nucleus, 및 coda)이 벡터로 표현된다. 음고는 벡터로 표현되며, 음의 시작과 끝, 즉 음의 길이를 포함하는 것으로 가정한다. 멜 스펙트로그램은 멜 필터 뱅크가 생성한 계수 각각이 벡터로 표현된다.The text is expressed as a vector, for example, in the case of Hangeul, the initial consonant, the middle consonant, and the final consonant (onset, nucleus, and coda) including one syllable are expressed as a vector. The pitch is expressed as a vector, and it is assumed to include the start and end of the note, that is, the length of the note. In the Mel spectrogram, each coefficient generated by the Mel filter bank is expressed as a vector.

본 실시예에서는, 추론 모델의 복잡도를 감소시키면서도 가창음성의 특징을 적절하게 표현하기 위해, 저해상도의 멜 스펙트로그램을 이용하여 가창음성을 표현하나, 반드시 이에 한정하는 것은 아니다. 따라서, 다른 신호처리(signal processing) 방식을 이용하여 생성한 주파수 영역 또는 시간 영역 상의 데이터 등 가창음성의 특성을 표현할 수 있는 어느 형태의 저해상도 데이터든 사용될 수 있다. In this embodiment, in order to appropriately express the characteristics of the singing voice while reducing the complexity of the inference model, a low-resolution Mel spectrogram is used to express the singing voice, but the present invention is not limited thereto. Therefore, any type of low-resolution data capable of expressing the characteristics of a singing voice, such as data on a frequency domain or a time domain generated using another signal processing method, may be used.

가창음성 합성 과정에서, 가수 멜 스펙트로그램은 대상 가수의 음성의 특징을 나타내는 시드(seed)로서 이용된다. 가수 멜 스펙트로그램은 대상 가수에 대한 짧은 구간의 가창음성으로부터 생성될 수 있다. 예를 들어, 12 초 분량의 가창음성을 22.05 KHz로 샘플링한 후, 윈도우(window) 및 홉(hop) 각각의 사이즈를 1,024 개로 설정하고, 80 차원의 멜 스펙트로그램을 생성하면, 대략 256 프레임의 멜 스펙트로그램이 획득될 수 있다. 이렇게 획득된 가수 멜 스펙트로그램은, 지정된 노래에 대한 가창음성이 합성되는 동안, 가창음성 합성 및 변조장치(100)에 반복적으로 적용될 수 있다. In the singing voice synthesis process, the singer Mel spectrogram is used as a seed representing the characteristics of the target singer's voice. The singer Mel spectrogram may be generated from the singing voice of a short section for the target singer. For example, after sampling 12 seconds of singing voice at 22.05 KHz, setting the size of each window and hop to 1,024, and generating an 80-dimensional Mel spectrogram, approximately 256 frames of A Mel spectrogram may be obtained. The singer Mel spectrogram obtained in this way may be repeatedly applied to the singing voice synthesis and modulation apparatus 100 while the singing voice for a specified song is synthesized.

입력부(102)는 지정된 가수의 가창음성을 저장장치(미도시)로부터 획득하여 가수 멜 스펙트로그램을 생성할 수 있다. 또는 가수 멜 스펙트로그램이 직접 저장장치로부터 획득될 수 있다. 입력부(102)는 지정된 노래의 가사를 나타내는 텍스트를 저장장치로부터 획득할 수 있다. 또한, 입력부(102)는 저장장치로부터 노래에 대한 MIDI 데이터를 획득하여 MIDI 데이터에 포함된 음고를 추출할 수 있다. 따라서, 저장장치는 복수의 가수의 가창음성 또는 가수 멜 스펙트로그램, 및 복수의 노래에 대한 텍스트, MIDI 데이터 등을 저장한다.The input unit 102 may generate a singer Mel spectrogram by acquiring the singing voice of a designated singer from a storage device (not shown). Alternatively, the singer Mel spectrogram may be obtained directly from the storage device. The input unit 102 may obtain text representing lyrics of a specified song from a storage device. Also, the input unit 102 may obtain MIDI data for a song from a storage device and extract a pitch included in the MIDI data. Accordingly, the storage device stores the singing voices of the plurality of singers or the singer Mel spectrogram, texts for the plurality of songs, MIDI data, and the like.

본 실시예에 따른 멜 변조부(104)는 심층신경망 기반의 추론 모델을 이용하여 가수 멜 스펙트로그램, 텍스트, 및 음고로부터 저해상도의 추론 멜 스펙트로그램을 생성한다. 더욱 상세하게는, 추론 모델은 가수 멜 스펙트로그램으로부터 대상 가수에 대한 가수 ID(Identity) 특성, 즉 음색과 가창 스타일을 생성하고, 텍스트와 음색으로부터 포먼트마스크(formant mask)를 생성하고, 음고 및 가창 스타일로부터 음고골격(pitch skeleton)을 생성하며, 음고골격과 포먼트마스크로부터 저해상도의 추론 멜 스펙트로그램을 생성한다. 멜 변조부(104)의 추론 모델은 가수 ID 인코더(121), 텍스트 인코더(122), 멜 스펙트로그램 인코더(123), 음고 인코더(124), 주의부(attention unit, 125), 포먼트마스크 디코더(131), 음고골격 디코더(132) 및 마스킹부(133)의 전부 또는 일부를 포함한다.The Mel modulator 104 according to the present embodiment generates a low-resolution inferred Mel spectrogram from the singer Mel spectrogram, the text, and the pitch using an inference model based on a deep neural network. More specifically, the inference model generates singer ID (Identity) characteristics for the target singer from the singer Mel spectrogram, that is, the tone and singing style, generates a formant mask from the text and tone, the pitch and A pitch skeleton is generated from a singing style, and a low-resolution inferred mel spectrogram is generated from the pitch skeleton and formant mask. The inference model of the Mel modulator 104 is a singer ID encoder 121, a text encoder 122, a Mel spectrogram encoder 123, a pitch encoder 124, an attention unit 125, and a formant mask decoder. 131 , and includes all or part of the tone high skeleton decoder 132 and the masking unit 133 .

본 실시예에 따른 추론 모델은 다수의 콘볼루션 레이어(convolution layer)를 기반으로 하는 딥러닝(deep learning) 기반 심층신경망으로 구현되나, 반드시 이에 한정하는 것은 아니다. 예컨대, RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory) 등과 같이 재귀적인(recurrent) 구조를 갖는 어느 심층신경망이든 이용될 수 있다. 추론 모델은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다.The inference model according to the present embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolution layers, but is not necessarily limited thereto. For example, any deep neural network having a recurrent structure, such as a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM), may be used. The inference model may be pre-trained using the learning model. The structure of the learning model and the training process of the learning model will be described later.

도 2는 본 발명의 일 실시예에 따른 가수 ID 인코더, 포먼트마스크 디코더 및 음고골격 디코더의 블록도이다.2 is a block diagram of a singer ID encoder, a formant mask decoder, and a pitch skeleton decoder according to an embodiment of the present invention.

가수 ID 인코더(121)는 가수 멜 스펙트로그램으로부터 대상 가수에 대한 전역적인 특성(global feature)으로서 가창 스타일과 음색을 추출한다. 도 2에 도시된 바와 같이, 두 개의 일차원 콘볼루션 레이어(conv1d), 두 개의 ReLU(Rectified Linear Unit), 및 평균 시간 풀링 레이어(average time pooling layer)를 이용하여, 가수 ID 인코더(121)는 가수 멜 스펙트로그램으로부터 시간에 따른 변화가 제거된 시불변(time-invariant) 전역 특성을 획득한다. 시불변 전역 특성은 밀집 레이어(dense layer) 및 타일(tile) 과정을 기반으로 음색 및 가창 스타일을 표현하는 가수 ID 임베딩(embedding)으로 변환될 수 있다.The singer ID encoder 121 extracts a singing style and tone from the singer Mel spectrogram as global features for the target singer. 2, using two one-dimensional convolutional layers (conv1d), two Rectified Linear Units (ReLU), and an average time pooling layer, the mantissa ID encoder 121 is a mantissa A time-invariant global characteristic in which the change with time is removed from the Mel spectrogram is obtained. Time-invariant global properties can be converted into singer ID embeddings that express tone and singing style based on a dense layer and tile process.

텍스트 인코더(122)는 텍스트로부터 텍스트 특성을 추출한다. 텍스트 인코더(122)는 콘볼루션 레이어를 기반으로 구현되며, 추출된 텍스트 특성은 가사의 발음(pronunciation)에 대한 특성을 나타낸다.The text encoder 122 extracts text characteristics from the text. The text encoder 122 is implemented based on a convolutional layer, and the extracted text characteristic indicates a characteristic for the pronunciation of lyrics.

멜 스펙트로그램 인코더(123)는 콘볼루션 레이어를 기반으로 구현되고, 초기 조건으로부터 자동 회귀적으로(auto-regressively) 오디오 특성을 추출한다. 초기 조건으로는 영(zero)이 사용될 수 있고, 이전 시간의 추론 멜 스펙트로그램이 피드백(feedback)되어 멜 스펙트로그램 인코더(123)의 입력으로 이용된다. The Mel spectrogram encoder 123 is implemented based on a convolutional layer and auto-regressively extracts audio characteristics from an initial condition. As an initial condition, zero may be used, and the inferred Mel spectrogram of the previous time is fed back and used as an input of the Mel spectrogram encoder 123 .

음고 인코더(124)는 음고로부터 음고 특성을 추출하며, 콘볼루션 레이어를 기반으로 구현된다.The pitch encoder 124 extracts pitch characteristics from the pitch, and is implemented based on a convolutional layer.

주의부(125)는 텍스트 특성과 오디오 특성 간의 어텐션(attention) 결과를 오디오 특성에 연쇄(concatenation)하여, 텍스트 특성과 오디오 특성 간의 동기를 일치시킨 동기 오디오 특성을 생성한다.The attention unit 125 concatenates the result of attention between the text characteristic and the audio characteristic to the audio characteristic to generate a synchronized audio characteristic in which the synchronization between the text characteristic and the audio characteristic is matched.

포먼트마스크 디코더(131)는 텍스트 특성에 전역적 특성인 음색을 조절(conditioning)하여 결합함으로써 포먼트마스크를 생성한다. 도 2에 도시된 바와 같이, 포먼트마스크 디코더(131)는 콘볼루션 레이어, 조절 레이어, HWC(Highway Causal Convolution) 레이어(비특허문헌 3 참조), 및 활성함수(activation function)로서 시그모이드(sigmoid) 함수와 결합된 콘볼루션 레이어를 포함한다. 포먼트마스크 디코더(131)의 조절 레이어에서 전역적 특성인 가수 ID 임베딩 중 음색이 조절되어 결합됨으로써, 텍스트 특성으로부터 발음과 음색에 관련된 특성인 포먼트마스크가 생성될 수 있다. The formant mask decoder 131 generates a formant mask by conditioning and combining the text characteristic with the tone, which is a global characteristic. 2, the formant mask decoder 131 is a convolution layer, an adjustment layer, a Highway Causal Convolution (HWC) layer (see Non-Patent Document 3), and a sigmoid (activation function) as an activation function. sigmoid) function and combined convolutional layer. In the adjustment layer of the formant mask decoder 131, the tone is adjusted and combined during the singer ID embedding, which is a global characteristic, so that a formant mask, which is a characteristic related to pronunciation and tone, can be generated from the text characteristic.

음고골격 디코더(132)는 동기 오디오 특성에 가창 스타일 및 음고 특성을 조절하여 결합함으로써 음고골격을 생성한다. 도 2에 도시된 바와 같이, 콘볼루션 레이어, 조절 레이어, HWNC(Highway Non-causal Convolution) 레이어(비특허문헌 3 참조), 및 활성함수로서 시그모이드 함수와 결합된 콘볼루션 레이어를 포함한다. 음고골격 디코더(132)의 조절 레이어에서 전역적 특성인 가수 ID 임베딩 중 가창 스타일, 및 국지적인 특성(local feature)인 음고 특성이 조절되어 결합됨으로써, 동기 오디오 특성으로부터 음고와 스타일 관련된 특성인 음고골격이 생성될 수 있다. The pitch skeleton decoder 132 generates a pitch skeleton by adjusting and combining a singing style and a pitch characteristic with the synchronized audio characteristic. As shown in FIG. 2 , it includes a convolutional layer, an adjustment layer, a highway non-causal convolution (HWNC) layer (see Non-Patent Document 3), and a convolutional layer combined with a sigmoid function as an activation function. In the adjustment layer of the pitch skeleton decoder 132, the singing style of the singer ID embedding, which is a global characteristic, and the pitch characteristic, which is a local feature, are adjusted and combined, so that the pitch and style-related characteristic from the synchronized audio characteristic is the pitch skeleton. can be created.

포먼트마스크 디코더(131) 및 음고골격 디코더(132)의 조절 레이어에서 수행되는 조절 과정은 수학식 1로 나타낼 수 있다.The adjustment process performed in the adjustment layer of the formant mask decoder 131 and the pitch skeleton decoder 132 may be expressed by Equation (1).

여기서, 포먼트마스크 디코더(131) 경우, z는 조절 레이어의 출력이고, x는 텍스트 특성이고,

은 음색이며,

는 영(zero) 입력이다. 또한 음고골격 디코더(132)의 경우, x는 동기 오디오 특성이고,

은 가창 스타일이며,

는 음고 특성이다. 기호 σ(시그모이드) 및 ReLU는 활성함수이고, 기호 ⊙는 구성요소 별(element-wise) 승산을 나타내는 연산자이다. 기호 *는 콘볼루션을 의미하는 연산자이고,

,

및

는 콘볼루션 연산을 위한 가중치이다. Here, in the case of the formant mask decoder 131, z is the output of the adjustment layer, x is the text characteristic,

is the tone,

is a zero input. Also, in the case of the pitch skeleton decoder 132, x is a synchronous audio characteristic,

is a singing style,

is a pitch characteristic. The symbols σ (sigmoid) and ReLU are activation functions, and the symbol ⊙ is an operator representing element-wise multiplication. The symbol * is an operator that means convolution,

,

and

is the weight for the convolution operation.

이상에서 설명한 바와 같이, 가수 ID 인코더(121)의 출력인 가수 ID 임베딩은 결합되는 대상이 텍스트인 경우 음색을 조절하고, 오디오 특성인 경우 가창 스타일을 조절하는 기능을 독립적으로 수행할 수 있다.As described above, the singer ID embedding, which is the output of the singer ID encoder 121, can independently perform a function of adjusting a tone when the combined target is text, and adjusting a singing style in the case of an audio characteristic.

마스킹부(133)는 포먼트마스크를 이용하여 음고골격을 마스킹함으로써 추론 멜 스펙트로그램을 생성한다. 여기서, 마스킹은 포먼트마스크와 음고골격을 구성요소 별로 승산하는 과정을 의미한다. The masking unit 133 generates an inferred Mel spectrogram by masking the pitch skeleton using a formant mask. Here, the masking refers to the process of multiplying the formant mask and the tone high skeleton for each component.

멜 변조부(104)의 추론 모델이 생성한 추론 멜 스펙트로그램은 가창음성에 대한 주파수 영역의 데이터이다.The inferred Mel spectrogram generated by the inference model of the Mel modulator 104 is data of a frequency domain for a singing voice.

본 실시예에 따른 SR 추론부(106)는 SR 변환 모델을 이용하여 추론 멜 스펙트로그램을 업샘플링(up-sampling)함으로써 고해상도의 선형 스펙트로그램을 생성한다. SR 추론부(106)는 추론 멜 스펙트로그램에 SR 기술을 적용함으로써, 주파수 영역 상에서 개선된 품질의 가창음성 데이터를 생성할 수 있다. The SR inference unit 106 according to the present embodiment generates a high-resolution linear spectrogram by up-sampling the inferred Mel spectrogram using the SR transformation model. The SR reasoning unit 106 may generate singing voice data of improved quality in the frequency domain by applying the SR technique to the inferred Mel spectrogram.

본 실시예에 따른 SR 변환 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. SR 변환 모델은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다.The SR transformation model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers. The SR transformation model may be trained in advance using a learning model. The structure of the learning model and the training process of the learning model will be described later.

본 실시예에 따른 출력부(108)는 선형 스펙트로그램을 변환하여 청각적 형태의 가창음성을 생성한다. 출력부(108)는 주파수 영역 상의 선형 스펙트로그램으로부터 시간 영역 상의 청각적 형태의 가창음성을 생성하여 사용자에게 제공할 수 있다. The output unit 108 according to the present embodiment converts the linear spectrogram to generate a singing voice in an auditory form. The output unit 108 may generate an auditory form of singing voice in the time domain from the linear spectrogram in the frequency domain and provide it to the user.

도 3은 본 발명의 일 실시예에 따른 가창음성 합성 및 변조방법의 순서도이다.3 is a flowchart of a method for synthesizing and modulating a singing voice according to an embodiment of the present invention.

본 발명의 실시예에 따른 가창음성 합성 및 변조장치(100)는 가수 및 노래를 지정하는 사용자 요청(user request)을 획득하여, 가수에 대한 가수 멜 스펙트로그램, 및 노래의 가사에 대한 텍스트를 획득하고, 노래에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다(S300). The singing voice synthesis and modulation apparatus 100 according to an embodiment of the present invention obtains a user request for designating a singer and a song, and obtains a singer Mel spectrogram for the singer, and text for the lyrics of the song and obtains a pitch from the MIDI data for the song (S300).

여기서, 가창 질의는 복수의 가수 중에서 대상 가수, 및 복수의 노래 중에서 대상 가수가 부를 노래를 지정한다. 또한, MIDI 데이터는, 지정된 노래를 표현하는, 기 존재하는 가창음성에 대한 MIDI 데이터이다. Here, the song query designates a target singer from among a plurality of singers and a song to be sung by the target singer from among a plurality of songs. In addition, MIDI data is MIDI data about the existing singing voice which expresses a designated song.

가창음성 합성 및 변조장치(100)는 지정된 가수의 가창음성을 저장장치로부터 획득하여 가수 멜 스펙트로그램 데이터를 생성할 수 있다. 또는 가수 멜 스펙트로그램 데이터가 직접 저장장치로부터 획득될 수 있다. 가창음성 합성 및 변조장치(100)는 지정된 노래의 가사를 나타내는 텍스트를 저장장치로부터 획득할 수 있다. 또한, 가창음성 합성 및 변조장치(100)는 저장장치로부터 노래에 대한 MIDI 데이터를 획득하여 MIDI 데이터에 포함된 음고를 추출할 수 있다. The singing voice synthesis and modulation apparatus 100 may generate singer Mel spectrogram data by acquiring the singing voice of a designated singer from the storage device. Alternatively, the singer Mel spectrogram data may be directly acquired from the storage device. The apparatus 100 for synthesizing and modulating a singing voice may obtain text representing lyrics of a specified song from a storage device. Also, the singing voice synthesis and modulation apparatus 100 may obtain MIDI data for a song from a storage device and extract a pitch included in the MIDI data.

가창음성 합성 및 변조장치(100)는 사전에 트레이닝된 심층신경망 기반의 추론 모델을 이용하여 가수 멜 스펙트로그램, 텍스트 및 음고로부터 저해상도의 추론 멜 스펙트로그램을 생성한다(S302). 더욱 상세하게는, 추론 모델은 가수 멜 스펙트로그램으로부터 대상 가수에 대한 가수 ID(Identity) 특성, 즉 음색과 가창 스타일을 생성하고, 텍스트와 음색으로부터 포먼트마스크를 생성하고, 음고 및 가창 스타일로부터 음고골격을 생성하며, 음고골격과 포먼트마스크로부터 저해상도의 추론 멜 스펙트로그램을 생성한다. The singing voice synthesis and modulation device 100 generates a low-resolution inferred Mel spectrogram from a singer Mel spectrogram, text, and pitch using a pre-trained deep neural network-based reasoning model (S302). More specifically, the inference model generates singer ID (Identity) characteristics for the target singer from the singer Mel spectrogram, that is, the tone and singing style, generates a formant mask from the text and tone, and the pitch and the pitch from the singing style. A skeleton is created, and a low-resolution inferred Mel spectrogram is generated from the pitch skeleton and the formant mask.

본 실시예에 따른 추론 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝(deep learning) 기반 심층신경망으로 구현된다. 추론 모델은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다.The inference model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers. The inference model may be pre-trained using the learning model. The structure of the learning model and the training process of the learning model will be described later.

가창음성 합성 및 변조장치(100)는 사전에 트레이닝된 심층신경망 기반의 SR 변환 모델을 이용하여 추론 멜 스펙트로그램을 업샘플링(up-sampling)함으로써 고해상도의 선형 스펙트로그램을 생성한다(S304).The singing voice synthesis and modulation device 100 generates a high-resolution linear spectrogram by up-sampling the inferred Mel spectrogram using a pre-trained deep neural network-based SR transformation model (S304).

가창음성 합성 및 변조장치(100)는 선형 스펙트로그램을 변환하여 청각적 형태의 가창음성을 생성한다(S306). 가창음성 합성 및 변조장치(100)는 주파수 영역 상의 선형 스펙트로그램으로부터 시간 영역 상의 청각적 형태의 가창음성을 생성하여 사용자에게 제공할 수 있다.The singing voice synthesis and modulation apparatus 100 converts the linear spectrogram to generate an auditory type of singing voice (S306). The apparatus 100 for synthesizing and modulating a singing voice may generate a audible voice in a time domain from a linear spectrogram on a frequency domain and provide it to a user.

이하, 추론 모델이 추론 멜 스펙트로그램을 생성하는 과정(S302)에 대하여 자세히 기술한다.Hereinafter, a process ( S302 ) in which the inference model generates an inference Mel spectrogram will be described in detail.

추론 모델은 가수 멜 스펙트로그램으로부터 가수에 대한 가창 스타일과 음색(timbre)을 추출한다(S320). 추론 모델은 가수 멜 스펙트로그램으로부터 시간에 따른 변화가 제거된 시불변(time-invariant) 전역 특성인 가수 ID 임베딩으로서 음색 및 가창 스타일을 생성한다.The inference model extracts the singing style and timbre of the singer from the singer Mel spectrogram (S320). The inference model generates timbre and singing style as singer ID embeddings, which are time-invariant global properties with the change over time removed from the singer Mel spectrogram.

추론 모델은 텍스트로부터 텍스트 특성을 추출한다(S322). 추출된 텍스트 특성은 가사의 발음에 대한 특성을 나타낸다.The inference model extracts text characteristics from the text (S322). The extracted text characteristics represent the characteristics for the pronunciation of lyrics.

추론 모델은 초기 조건으로부터 자동회귀적으로(auto-regressively) 오디오 특성을 추출한다(S324). 초기 조건으로는 영이 사용될 수 있고, 이전 시간의 추론 멜 스펙트로그램이 피드백(feedback)되어 오디오 특성을 추출에 이용된다. The inference model auto-regressively extracts audio characteristics from the initial conditions (S324). Zero may be used as an initial condition, and the inferred Mel spectrogram of the previous time is fed back and used to extract audio characteristics.

추론 모델은 음고로부터 음고 특성을 추출한다(S326).The inference model extracts a pitch characteristic from the pitch (S326).

추론 모델은 텍스트 특성과 오디오 특성 간의 어텐션(attention) 결과를 오디오 특성에 연쇄(concatenation)하여, 텍스트 특성과 오디오 특성 간의 동기를 일치시킨 동기 오디오 특성을 생성한다(S328).The inference model concatenates the result of attention between the text feature and the audio feature to the audio feature to match the synchronization between the text feature and the audio feature. A characteristic is created (S328).

추론 모델은 텍스트 특성으로부터 음색이 조절된(conditioned) 포먼트마스크를 생성한다(S330). 전역적 특성인 가수 ID 임베딩 중 음색이 조절되어 결합됨으로써, 텍스트 특성으로부터 발음과 음색에 관련된 특성인 포먼트마스크가 생성될 수 있다. The inference model generates a formant mask in which the tone is conditioned from the text characteristics (S330). As the tone is adjusted and combined during embedding of the singer ID, which is a global characteristic, a formant mask, which is a characteristic related to pronunciation and tone, may be generated from the text characteristic.

추론 모델은 동기 오디오 특성으로부터 가창 스타일 및 음고 특성이 조절된 음고골격을 생성한다(S332). 전역적 특성인 가수 ID 임베딩 중 가창 스타일, 및 국지적인 특성인 음고 특성이 조절되어 결합됨으로써, 동기 오디오 특성으로부터 음고와 스타일 관련된 특성인 음고골격이 생성될 수 있다.The inference model is synchronous audio From the characteristics, a tone height skeleton in which the singing style and tone characteristics are adjusted is generated (S332). A singing style and a pitch characteristic, which is a local characteristic, are adjusted and combined during singer ID embedding, which is a global characteristic, so that a pitch skeleton, which is a characteristic related to a pitch and a style, can be generated from the synchronized audio characteristic.

추론 모델은 포먼트마스크를 이용하여 음고골격을 마스킹함으로써 추론 멜 스펙트로그램을 생성한다(S334). 여기서, 마스킹은 포먼트마스크와 음고골격을 구성요소 별로 승산하는 과정을 의미한다. The inference model generates an inference Mel spectrogram by masking the pitch skeleton using a formant mask (S334). Here, the masking refers to the process of multiplying the formant mask and the tone high skeleton for each component.

이상에서 설명한 바와 같이 본 실시예에 따르면, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 생성하며, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 단대단 망(end-to-end network)을 기반으로 가창음성 합성/변조가 가능해지는 효과가 있다.As described above, according to this embodiment, a tone and a singing style for a singer are generated using an inference model based on a deep neural network, a tone-controlled formant, and a tone high skeleton with an adjusted singing style are generated, By providing a singing voice synthesis/modulation device and method for generating a singing voice for a song from a pitch skeleton masked with a formant using a deep neural network-based SR transformation model, an end-to-end network It has the effect of enabling the synthesis/modulation of singing voices based on the

전술한 바와 같이, 가수 ID 인코더(121)의 출력인 가수 ID 임베딩은 결합되는 대상이 텍스트인 경우 음색을 조절하고, 오디오 특성인 경우 가창 스타일을 조절하는 기능을 독립적으로 수행한다. 따라서, 추론 모델을 두 번 적용하여(또는 두 개의 추론 모델을 이용하여) 두 가수에 각각에 대한 가수 ID 임베딩을 생성하고, 각각의 가수 ID 임베딩의 독립적 조절 기능을 기반으로 한 가수의 음색과 다른 가수의 가창 스타일이 독립적으로 교차 반영된 가창음성을 생성할 수 있다.As described above, the singer ID embedding, which is the output of the singer ID encoder 121, independently performs a function of adjusting the tone when the object to be combined is text, and adjusting the singing style in the case of audio characteristics. Thus, by applying the inference model twice (or using two inference models) to generate singer ID embeddings for each of the two singers, the singer's timbre and other A singer's singing style can independently generate a cross-reflected singing voice.

도 4는 본 발명의 다른 실시예에 따른 가창음성 합성 및 변조장치의 블록도이다.4 is a block diagram of an apparatus for synthesizing and modulating a singing voice according to another embodiment of the present invention.

본 발명의 다른 실시예에 있어서, 가창음성 합성 및 변조장치(100)는 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청(user request)을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 제1 가수의 음색과 제2 가수의 가창 스타일을 생성하고, 제1 가수의 음색이 조절된 포먼트, 및 제2 가수의 가창 스타일이 조절된 음고골격을 독립적으로 생성하며, 심층신경망 기반의 SR 변환 모델(inference model)을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성한다. 가창음성 합성 및 변조장치(100)는 입력부(102), 멜 변조부(104), 마스킹부(401), SR 추론부(106) 및 출력부(108)의 전부 또는 일부를 포함한다. 여기서, 본 실시예에 따른 가창음성 합성 및 변조장치(100)에 포함되는 구성요소가 반드시 이에 한정되는 것은 아니다. 예컨대, 가창음성 합성 및 변조장치(100)는 추론 모델 및 SR 변환 모델의 트레이닝을 위한 트레이닝부(미도시)를 추가로 구비하거나, 외부의 트레이닝부와 연동되는 형태로 구현될 수 있다. In another embodiment of the present invention, the singing voice synthesis and modulation device 100 obtains a user request specifying a tone of a first singer, a singing style of a second singer, and a song, and is based on a deep neural network. generating the tone of the first singer and the singing style of the second singer using the inference model of , to generate a singing voice for a song from a pitch skeleton masked with a formant using a deep neural network-based SR transformation model (inference model). The singing voice synthesis and modulation apparatus 100 includes all or a part of an input unit 102 , a Mel modulation unit 104 , a masking unit 401 , an SR reasoning unit 106 , and an output unit 108 . Here, components included in the apparatus 100 for synthesizing and modulating a voice according to the present embodiment are not necessarily limited thereto. For example, the apparatus 100 for synthesizing and modulating a singing voice may additionally include a training unit (not shown) for training the inference model and the SR transformation model, or may be implemented in a form that interworks with an external training unit.

입력부(102)는 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청을 획득하여, 제1 가수에 대한 제1 가수 멜 스펙트로그램, 제2 가수에 대한 제2 가수 멜 스펙트로그램, 및 노래의 가사에 대한 텍스트를 획득하고, 노래에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다. The input unit 102 obtains a user request specifying the tone of the first singer, the singing style of the second singer, and a song, and obtains a first singer Mel spectrogram for the first singer and a second singer Mel for the second singer The spectrogram and text for the lyrics of the song are obtained, and the pitch is obtained from the MIDI data for the song.

여기서, 사용자 요청은 복수의 가수 중에서 음색에 대한 제1 가수, 가창 스타일에 대한 제2 가수, 및 복수의 노래 중에서 대상 가수가 부를 노래를 지정한다. 또한, MIDI 데이터는, 지정된 노래를 표현하는, 기 존재하는 가창음성에 대한 MIDI 데이터이다. Here, the user request designates a first singer for a tone color, a second singer for a singing style, and a song to be sung by a target singer among a plurality of songs from among a plurality of singers. In addition, MIDI data is MIDI data about the existing singing voice which expresses a designated song.

멜 변조부(104)는 심층신경망 기반의 추론 모델을 이용하여 제1 가수 멜 스펙트로그램으로부터 제1 가수에 대한 가수 ID 특성, 즉 제1 음색과 제1 가창 스타일을 생성하고, 텍스트와 제1 음색으로부터 제1 포먼트마스크를 생성하며, 음고 및 제1 가창 스타일로부터 제1 음고골격을 생성한다. 또한 멜 변조부(104)는 제1 포먼트마스크를 이용하여 제1 음고골격을 마스킹함으로써 저해상도의 제1 추론 멜 스펙트로그램을 생성하여, 추론 모델을 자동회귀적으로 동작시킴으로써 제1 가수에 대한 상태값을 업데이트한다. 여기서, 제1 가수에 대한 상태값은 추론 모델의 파라미터와 연산이 진행되는 입력, 중간 레이어의 출력, 최종 출력 등을 의미한다.The Mel modulator 104 generates a singer ID characteristic for the first singer, that is, a first tone and a first singing style, from the first singer Mel spectrogram using an inference model based on a deep neural network, and generates the text and the first tone. A first formant mask is generated from , and a first pitch skeleton is generated from the pitch and the first singing style. Also, the Mel modulator 104 generates a low-resolution first inferred Mel spectrogram by masking the first pitch skeleton using the first formant mask, and automatically operates the inference model to automatically regressively operate the state for the first mantissa. update the value Here, the state value of the first mantissa refers to parameters of the inference model, an input in which calculation is performed, an output of an intermediate layer, a final output, and the like.

또한, 멜 변조부(404)는 심층신경망 기반의 추론 모델을 이용하여 제2 가수 멜 스펙트로그램으로부터 제2 가수에 대한 가수 ID 특성, 즉 제2 음색과 제2 가창 스타일을 생성하고, 텍스트와 제2 음색으로부터 제2 포먼트마스크를 생성하며, 음고 및 제2 가창 스타일로부터 제2 음고골격을 생성한다. 또한 멜 변조부(104)는 제2 포먼트마스크를 이용하여 제2 음고골격을 마스킹함으로써 저해상도의 제2 추론 멜 스펙트로그램을 생성하여, 추론 모델을 자동회귀적으로 동작시킴으로써 제2 가수에 대한 상태값을 업데이트한다. 여기서, 제2 가수에 대한 상태값은 추론 모델의 파라미터와 연산이 진행되는 입력, 중간 레이어의 출력, 최종 출력 등을 의미한다.In addition, the Mel modulator 404 uses a deep neural network-based reasoning model to generate singer ID characteristics for the second singer from the second singer Mel spectrogram, that is, the second tone and the second singing style, and the text and the second singing style. A second formant mask is generated from the two tones, and a second pitch skeleton is generated from the pitch and the second singing style. In addition, the Mel modulator 104 generates a low-resolution second inferred Mel spectrogram by masking the second pitch skeleton using the second formant mask, and automatically operates the inference model to automatically regressively operate the state for the second mantissa. update the value Here, the state value of the second mantissa refers to parameters of the inference model, an input in which calculation is performed, an output of an intermediate layer, a final output, and the like.

마스킹부(401)는 제1 포먼트마스크를 이용하여 제2 음고골격을 마스킹함으로써 저해상도의 제3 추론 멜 스펙트로그램을 생성한다. 여기서, 마스킹은 제1 포먼트마스크와 제2 음고골격을 구성요소 별로 승산하는 과정을 의미한다. 따라서, 생성된 제3 추론 멜 스펙트로그램은 제1 가수의 음색과 제2 가수의 가창 스타일이 조합된 가창음성에 대한 멜 스펙트로그램이다.The masking unit 401 generates a low-resolution third inferred Mel spectrogram by masking the second pitch skeleton using the first formant mask. Here, the masking refers to a process of multiplying the first formant mask and the second pitch skeleton for each component. Accordingly, the generated third inferred Mel spectrogram is a Mel spectrogram for a singing voice in which the tone of the first singer and the singing style of the second singer are combined.

도 4의 도시는 하나의 멜 변조부를 순차적으로 이용하는 도시를 나타내고 있으나, 본 발명의 다른 실시예에 있어서, 가창음성 합성 및 변조장치(100)는 두 개의 멜 변조부를 이용하여 제1 가수의 포먼트마스크 및 제2 가수의 음고골격을 병렬로 생성할 수 있다.Although FIG. 4 shows a diagram sequentially using one Mel modulation unit, in another embodiment of the present invention, the singing voice synthesis and modulation apparatus 100 uses two Mel modulation units to form the first singer's formant. It is possible to generate the mask and the high-pitched skeleton of the second singer in parallel.

도 4의 도시에서, 제1 가수와 제2 가수가 동일한 경우, 가창음성 합성 및 변조장치(100)는, 도 1에 도시된 바와 같이, 한 가수의 음색 및 가창 스타일이 조합된 가창음성에 대한 멜 스펙트로그램을 생성할 수 있다. 4, when the first singer and the second singer are the same, the singing voice synthesis and modulation device 100, as shown in FIG. Mel spectrogram can be generated.

SR 추론부(106)는 SR 변환 모델을 이용하여 제3 추론 멜 스펙트로그램을 업샘플링(up-sampling)함으로써 고해상도의 선형 스펙트로그램을 생성한다. The SR inference unit 106 generates a high-resolution linear spectrogram by up-sampling the third inferred Mel spectrogram using the SR transformation model.

본 실시예에 따른 SR 변환 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. SR 변환 모델은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다The SR transformation model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers. The SR transformation model may be trained in advance using a learning model. The structure of the learning model and the training process of the learning model will be described later.

본 실시예에 따른 출력부(108)는 선형 스펙트로그램을 변환하여 청각적 형태의 가창음성을 생성한다. 출력부(108)는 주파수 영역 상의 선형 스펙트로그램으로부터 시간 영역 상의 청각적 형태의 가창음성을 생성하여 사용자에게 제공할 수 있다.The output unit 108 according to the present embodiment converts the linear spectrogram to generate a singing voice in an auditory form. The output unit 108 may generate an auditory form of singing voice in the time domain from the linear spectrogram in the frequency domain and provide it to the user.

도 5는 본 발명의 다른 실시예에 따른 가창음성 합성 및 변조방법의 순서도이다.5 is a flowchart of a method for synthesizing and modulating a singing voice according to another embodiment of the present invention.

가창음성 합성 및 변조장치(100)는 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청을 획득하여, 제1 가수에 대한 제1 가수 멜 스펙트로그램, 제2 가수에 대한 제2 가수 멜 스펙트로그램, 및 노래의 가사에 대한 텍스트를 획득하고, 노래에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다(S500).The singing voice synthesis and modulation device 100 obtains a user request specifying the tone of the first singer, the singing style of the second singer, and a song, and provides the first singer Mel spectrogram for the first singer and the second singer. A second singer Mel spectrogram and text for lyrics of a song are acquired, and a pitch is acquired from MIDI data for a song (S500).

가창음성 합성 및 변조장치(100)는 사전에 트레이닝된 심층신경망 기반의 추론 모델을 이용하여 제1 가수 멜 스펙트로그램으로부터 제1 가수에 대한 가수 ID 특성, 즉 제1 음색과 제1 가창 스타일을 생성하고, 텍스트와 제1 음색으로부터 제1 포먼트마스크를 생성하며, 음고 및 제1 가창 스타일로부터 제1 음고골격을 생성한다(S502). 또한 가창음성 합성 및 변조장치(100)는 제1 포먼트마스크를 이용하여 제1 음고골격을 마스킹함으로써 저해상도의 제1 추론 멜 스펙트로그램을 생성하여, 추론 모델을 자동회귀적으로 동작시킴으로써 제1 가수에 대한 상태값을 업데이트한다. 여기서, 제1 가수에 대한 상태값은 추론 모델의 파라미터와 연산이 진행되는 입력, 중간 레이어의 출력, 최종 출력 등을 의미한다.The singing voice synthesis and modulation device 100 generates a singer ID characteristic for the first singer from the first singer Mel spectrogram using a pre-trained deep neural network-based inference model, that is, the first tone and the first singing style. Then, a first formant mask is generated from the text and the first tone, and a first pitch skeleton is generated from the pitch and the first singing style (S502). In addition, the singing voice synthesis and modulation device 100 generates a low-resolution first inferred Mel spectrogram by masking the first pitch skeleton using the first formant mask, and automatically operates the inference model to automatically regressively operate the first singer update the status value for Here, the state value of the first mantissa refers to parameters of the inference model, an input in which calculation is performed, an output of an intermediate layer, a final output, and the like.

가창음성 합성 및 변조장치(100)는 사전에 트레이닝된 심층신경망 기반의 추론 모델을 이용하여 제2 가수 멜 스펙트로그램으로부터 제2 가수에 대한 가수 ID 특성, 즉 제2 음색과 제2 가창 스타일을 생성하고, 텍스트와 제2 음색으로부터 제2 포먼트마스크를 생성하며, 음고 및 제2 가창 스타일로부터 제2 음고골격을 생성한다(S504). 또한 가창음성 합성 및 변조장치(100)는 제2 포먼트마스크를 이용하여 제2 음고골격을 마스킹함으로써 저해상도의 제2 추론 멜 스펙트로그램을 생성하여, 추론 모델을 자동회귀적으로 동작시킴으로써 제2 가수에 대한 상태값을 업데이트한다. 여기서, 제2 가수에 대한 상태값은 추론 모델의 파라미터와 연산이 진행되는 입력, 중간 레이어의 출력, 최종 출력 등을 의미한다.The singing voice synthesis and modulation device 100 generates a singer ID characteristic for the second singer from the second singer Mel spectrogram using a pre-trained deep neural network-based inference model, that is, a second tone and a second singing style. Then, a second formant mask is generated from the text and the second tone, and a second pitch skeleton is generated from the pitch and the second singing style (S504). In addition, the singing voice synthesis and modulation device 100 generates a second inferred Mel spectrogram of low resolution by masking the second pitch skeleton using the second formant mask, and automatically operates the inference model to automatically regressively operate the second mantissa. update the status value for Here, the state value of the second mantissa refers to parameters of the inference model, an input in which calculation is performed, an output of an intermediate layer, a final output, and the like.

가창음성 합성 및 변조장치(100)는 제1 포먼트마스크를 이용하여 제2 음고골격을 마스킹함으로써 저해상도의 제3 추론 멜 스펙트로그램을 생성한다(S506). 여기서, 마스킹은 제1 포먼트마스크와 제2 음고골격을 구성요소 별로 승산하는 과정을 의미한다. 따라서, 생성된 제3 추론 멜 스펙트로그램은 제1 가수의 음색과 제2 가수의 가창 스타일이 조합된 가창음성에 대한 멜 스펙트로그램이다. The singing voice synthesizing and modulating apparatus 100 generates a low-resolution third inferred Mel spectrogram by masking the second pitch skeleton using the first formant mask (S506). Here, the masking refers to a process of multiplying the first formant mask and the second pitch skeleton for each component. Accordingly, the generated third inferred Mel spectrogram is a Mel spectrogram for a singing voice in which the tone of the first singer and the singing style of the second singer are combined.

제1 가수와 제2 가수가 동일한 경우, 가창음성 합성 및 변조장치(100)는 한 가수의 음색 및 가창 스타일이 조합된 가창음성에 대한 멜 스펙트로그램을 생성할 수 있다. When the first singer and the second singer are the same, the apparatus 100 for synthesizing and modulating a singing voice may generate a Mel spectrogram for a singing voice in which the tone and singing style of one singer are combined.

가창음성 합성 및 변조장치(100)는 사전에 트레이닝된 SR 변환 모델을 이용하여 추론 멜 스펙트로그램을 업샘플링(up-sampling)함으로써 고해상도의 선형 스펙트로그램을 생성한다(S508).The singing voice synthesis and modulation device 100 generates a high-resolution linear spectrogram by up-sampling the inferred Mel spectrogram using a pre-trained SR transformation model (S508).

본 실시예에 따른 SR 추론 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. SR 추론 모델은 학습 모델을 이용하여 사전에 트레이닝될 수 있다. 학습 모델의 구조 및 학습 모델의 트레이닝 과정은 추후에 설명하기로 한다.The SR inference model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers. The SR inference model may be trained in advance using the learning model. The structure of the learning model and the training process of the learning model will be described later.

가창음성 합성 및 변조장치(100)는 선형 스펙트로그램을 변환하여 청각적 형태의 가창음성을 생성한다(S510). 가창음성 합성 및 변조장치(100)는 주파수 영역 상의 선형 스펙트로그램으로부터 시간 영역 상의 청각적 형태의 가창음성을 생성하여 사용자에게 제공할 수 있다.The singing voice synthesis and modulation apparatus 100 converts the linear spectrogram to generate an auditory type of singing voice (S510). The apparatus 100 for synthesizing and modulating a singing voice may generate a audible voice in a time domain from a linear spectrogram on a frequency domain and provide it to a user.

이상에서 설명한 바와 같이 본 실시예에 따르면, 제1 가수의 음색, 제2 가수의 가창 스타일, 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 제1 가수의 음색과 제2 가수의 가창 스타일을 생성하고, 제1 가수의 음색이 조절된 포먼트, 및 제2 가수의 가창 스타일이 조절된 음고골격을 독립적으로 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 음색과 가창 스타일을 독립적으로 교차 조절하는 가창음성 합성/변조가 가능해지는 효과가 있다. As described above, according to this embodiment, the tone of the first singer and the tone of the first singer and the tone and the tone of the second singer are obtained using an inference model based on a deep neural network by obtaining a user request specifying a song and a singing style of the second singer. By providing a singing voice synthesis/modulation apparatus and method for generating a singing style of a second singer, and independently generating a formant in which the tone of the first singer is adjusted, and a pitch skeleton in which the singing style of the second singer is adjusted, It has the effect of enabling the synthesis/modulation of singing voices that independently cross-regulate the tone and singing style.

전술한 바와 같이 본 실시예에 따른 가창음성 합성 및 변조장치(100)는 딥러닝(deep learning) 기반의 학습 모델을 구비하고, 구비된 학습 모델을 이용하여 추론 모델 및 SR 변환 모델에 대한 트레이닝 과정을 수행할 수 있다. 이러한 학습 모델은 심층신경망 기반의 추론 모델을 이용하여 대상 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 독립적으로 생성하며, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하며, 심층신경망 기반의 구별기를 이용하여 추론 멜 스펙트로그램 및 선형 스펙트로그램의 페어(pair)와 GT(Ground Truth) 추론 멜 스펙트로그램 및 GT 선형 스펙트로그램의 페어 간을 구별하도록 학습된 모델일 수 있다.As described above, the apparatus 100 for synthesizing and modulating a singing voice according to the present embodiment includes a deep learning-based learning model, and a training process for an inference model and an SR transformation model using the provided learning model can be performed. This learning model creates a tone and a singing style for a target singer using an inference model based on a deep neural network, and independently generates a tone-adjusted formant and a tone-high skeleton with an adjusted singing style, and Using the SR transformation model, the singing voice for the song is generated from the pitch skeleton masked with the formant, and the pair of inferred Mel spectrogram and linear spectrogram and GT (Ground Truth) using a deep neural network-based discriminator It may be a model trained to discriminate between pairs of inference Mel spectrograms and GT linear spectrograms.

본 실시예에서는, 가창음성 합성 및 변조장치(100)의 멜 변조부(104) 및 SR 추론부(106)가 결합된 형태를 생성기(generator)로 사용되고, 생성기 및 구별기(discriminator)를 포함하는 GAN(Generative Adversarial Networks) 기반 학습 모델(600)을 이용하여 추론 모델 및 SR 변환 모델이 트레이닝될 수 있다. 본 실시예는 GAN 기반 학습 모델(600)을 채택함으로써, 대상 가수에 대한 더 실제적인 가창음성을 생성하도록 가창음성 합성 및 변조장치(100)의 추론 모델 및 SR 변환 모델을 트레이닝시킬 수 있다.In this embodiment, the combined form of the Mel modulator 104 and the SR inference unit 106 of the singing voice synthesis and modulation device 100 is used as a generator, and includes a generator and a discriminator. An inference model and an SR transformation model may be trained using the Generative Adversarial Networks (GAN)-based learning model 600 . In this embodiment, by adopting the GAN-based learning model 600 , it is possible to train the inference model and the SR transformation model of the singing voice synthesis and modulation device 100 to generate a more realistic singing voice for the target singer.

도 6은 본 발명의 일 실시예에 따른 학습 모델의 블록도이다.6 is a block diagram of a learning model according to an embodiment of the present invention.

본 발명에 따른 실시예에 있어서, GAN 기반 학습 모델(600)을 이용하여 가창음성 합성 및 변조장치(100)의 추론 모델 및 SR 변환 모델에 대한 트레이닝이 실행된다. 학습 모델(600)은 입력부(102), 멜 변조부(104) 및 SR 추론부(106)를 포함하는 생성기(602), 및 구별기(604)의 전부 또는 일부를 포함한다. 여기서, 본 실시예에 따른 학습 모델(600)에 포함되는 구성요소가 반드시 이에 한정되는 것은 아니다. 예컨대, 학습 모델(600)은 추론 모델 및 SR 변환 모델의 트레이닝을 위한 트레이닝부(미도시)를 추가로 구비하거나, 외부의 트레이닝부와 연동되는 형태로 구현될 수 있다.In the embodiment according to the present invention, training is performed on the inference model and the SR transformation model of the singing voice synthesis and modulation device 100 using the GAN-based learning model 600 . The learning model 600 includes all or part of an input unit 102 , a generator 602 including a Mel modulator 104 , and an SR inference unit 106 , and a discriminator 604 . Here, components included in the learning model 600 according to the present embodiment are not necessarily limited thereto. For example, the learning model 600 may additionally include a training unit (not shown) for training the inference model and the SR transformation model, or may be implemented in a form that is linked with an external training unit.

입력부(102)는 학습을 위한 복수의 가수 및 복수의 노래에 대하여, 가수 각각에 대한 가수 멜 스펙트로그램, 노래 각각의 가사에 대한 텍스트, 및 노래 각각에 대한 오디오 멜 스펙트로그램을 획득하고, 노래 각각에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다. The input unit 102 obtains a singer Mel spectrogram for each singer, a text for each song lyrics, and an audio Mel spectrogram for each song with respect to a plurality of singers and a plurality of songs for learning, and each song Obtain the pitch from MIDI data for

MIDI 데이터는, 복수의 노래 각각을 표현하는, 기 존재하는 가창음성에 대한 MIDI 데이터이다. 따라서, 노래를 구성하는 음고를 표현할 수 있는 어느 MIDI 데이터든 이용될 수 있다. MIDI data is MIDI data about the existing singing voice which expresses each of a some song. Accordingly, any MIDI data capable of expressing the pitch constituting a song may be used.

본 실시예에서는, 추론 모델의 복잡도를 감소시키면서도 가창음성의 특징을 적절하게 표현하기 위해, 저해상도의 멜 스펙트로그램을 이용하여 가수의 가창음성을 표현하나, 반드시 이에 한정하는 것은 아니다. 따라서, 다른 신호처리 방식을 이용하여 생성한 주파수 영역 또는 시간 영역 상의 데이터 등 가창음성의 특성을 표현할 수 있는 어느 형태의 저해상도 데이터든 사용될 수 있다. In the present embodiment, the singer's singing voice is expressed using a low-resolution Mel spectrogram in order to appropriately express the characteristics of the singing voice while reducing the complexity of the inference model, but the present invention is not limited thereto. Therefore, any type of low-resolution data capable of expressing the characteristics of a singing voice, such as data on a frequency domain or a time domain generated using other signal processing methods, may be used.

학습을 위한 가수 멜 스펙트로그램은 가수의 음성의 특징을 나타내는 시드(seed)로서 이용된다. 가수 멜 스펙트로그램은 짧은 구간의 가창음성으로부터 생성될 수 있다. 예를 들어, 12 초 분량의 가창음성을 22.05 KHz로 샘플링한 후, 윈도우(window) 및 홉(hop) 사이즈 각각을 1,024 개로 설정하고, 80 차원의 멜 스펙트로그램을 생성하면, 대략 256 프레임의 멜 스펙트로그램이 획득될 수 있다. 이렇게 획득된 복수의 가수에 대한 가수 멜 스펙트로그램은, 지정된 노래에 대한 가창음성이 합성/변조되는 동안, 학습 모델(600)에 랜덤하게(randomly) 적용될 수 있다. The singer Mel spectrogram for learning is used as a seed representing the characteristics of the singer's voice. Singer Mel spectrogram can be generated from singing voice of a short section. For example, if 12 seconds of singing voice is sampled at 22.05 KHz, window and hop sizes are set to 1,024 each, and an 80-dimensional Mel spectrogram is generated, approximately 256 frames of Mel spectrogram are generated. A spectrogram may be obtained. The singer Mel spectrogram for a plurality of singers thus obtained may be randomly applied to the learning model 600 while the singing voice for a specified song is synthesized/modulated.

학습을 위한 오디오 멜 스펙트로그램은 기 존재하는 가창음성으로부터 획득될 수 있다. 따라서, 기 존재하는 가창음성이 학습을 위한 가수의 가창음성인 경우, 기 존재하는 가창음성의 일부 구간이 학습을 위한 가수 멜 스펙트로그램의 생성에 이용될 수 있다.The audio Mel spectrogram for learning may be obtained from an existing singing voice. Therefore, when the existing singing voice is the singing voice of a singer for learning, some sections of the existing singing voice may be used to generate the singer Mel spectrogram for learning.

입력부(102)는 복수의 가수 각각에 대한 가창음성을 저장장치(미도시)로부터 획득하여 가수 멜 스펙트로그램을 생성할 수 있다. 또는 가수 멜 스펙트로그램이 직접 저장장치로부터 획득될 수 있다. 입력부(102)는 복수의 노래 각각의 가사를 나타내는 텍스트를 저장장치로부터 획득할 수 있다. 입력부(102)는 복수의 노래 각각에 대한 기 존재하는 가창음성을 저장장치로부터 획득하여 오디오 멜 스펙트럼을 생성할 수 있다. 또는 오디오 멜 스펙트럼이 직접 저장장치로부터 획득될 수 있다. 또한, 입력부(102)는 저장장치로부터 복수의 노래 각각에 대한 MIDI 데이터를 획득하여 MIDI 데이터에 포함된 음고를 추출할 수 있다. The input unit 102 may generate a singer Mel spectrogram by acquiring a singing voice for each of a plurality of singers from a storage device (not shown). Alternatively, the singer Mel spectrogram may be obtained directly from the storage device. The input unit 102 may obtain text representing the lyrics of each of the plurality of songs from the storage device. The input unit 102 may generate an audio Mel spectrum by acquiring pre-existing singing voices for each of a plurality of songs from a storage device. Alternatively, the audio melt spectrum may be obtained directly from the storage device. Also, the input unit 102 may obtain MIDI data for each of a plurality of songs from the storage device and extract a pitch included in the MIDI data.

멜 변조부(104)는 추론 모델을 이용하여 가수 멜 스펙트로그램, 텍스트, 오디오 멜 스펙트로그램, 및 음고로부터 저해상도의 추론 멜 스펙트로그램을 생성한다. 더욱 상세하게는, 추론 모델은 가수 멜 스펙트로그램으로부터 대상 가수에 대한 가수 ID(Identity) 특성, 즉 음색과 가창 스타일을 생성하고, 텍스트와 음색으로부터 포먼트마스크를 생성하고, 오디오 멜 스펙트로그램, 음고 및 가창 스타일로부터 음고골격을 생성하며, 음고골격과 포먼트마스크로부터 추론 멜 스펙트로그램을 생성한다. 본 실시예에 따른 추론 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝(deep learning) 기반 심층신경망으로 구현된다. The Mel modulator 104 generates a low-resolution inferred Mel spectrogram from the singer Mel spectrogram, text, audio Mel spectrogram, and pitch by using the inference model. More specifically, the inference model generates singer ID (Identity) characteristics for the target singer from the singer Mel spectrogram, that is, the tone and singing style, generates a formant mask from the text and tone, the audio Mel spectrogram, and the pitch and a pitch skeleton from the singing style, and an inferred Mel spectrogram from the pitch skeleton and a formant mask. The inference model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers.

한편, SR 추론부(106)는 SR 변환 모델을 이용하여 추론 멜 스펙트로그램을 업샘플링함으로써 고해상도의 선형 스펙트로그램을 생성한다. 본 실시예에 따른 SR 변환 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. Meanwhile, the SR inference unit 106 generates a high-resolution linear spectrogram by upsampling the inferred Mel spectrogram using the SR transformation model. The SR transformation model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers.

따라서 GAN 기반 학습 모델(600)의 생성기(602)는 중간 출력으로 추론 멜 스펙트로그램을 생성하고, 최종 출력으로 선형 스펙트로그램을 생성한다. Therefore, the generator 602 of the GAN-based learning model 600 generates an inference Mel spectrogram as an intermediate output and a linear spectrogram as a final output.

멜 변조부(104)에 포함된 멜 스펙트로그램 인코더(123)를 제외하면, 멜 변조부(104) 및 SR 추론부(106)에 대한 자세한 구조 및 동작은 가창음성 합성 및 변조장치(100)에서 이미 기술되었으므로, 더 이상의 설명은 생략한다.Except for the mel spectrogram encoder 123 included in the mel modulator 104, detailed structures and operations of the mel modulator 104 and the SR inference unit 106 are Since it has already been described, further description will be omitted.

가창음성 합성 및 변조장치(100)와 차별되게, 학습 모델에서는 기 존재하는 가창음성에 대한 오디오 멜 스펙트로그램이 추론 모델의 트레이닝에 이용된다. 따라서, 멜 스펙트로그램 인코더(123)는 오디오 멜 스펙트로그램으로부터 오디오 특성을 추출한다. 가창음성 합성 및 변조장치(100)와 유사하게, 이전 시간의 추론 멜 스펙트로그램이 피드백되어 오디오 멜 스펙트로그램에 연쇄됨으로써 자동 회귀적인(auto-regressive) 동작이 수행된다. 따라서, 가수 멜 스펙트로그램을 이용하여 멜 변조부(104)는 오디오 멜 스펙트로그램을 변조(modulation)하는 것과 같은 동작을 수행한다.Different from the singing voice synthesis and modulation device 100 , in the learning model, an audio Mel spectrogram for an existing singing voice is used for training of the inference model. Accordingly, the Mel spectrogram encoder 123 extracts audio characteristics from the audio Mel spectrogram. Similar to the singing voice synthesis and modulation device 100 , an auto-regressive operation is performed by feeding back an inference Mel spectrogram of a previous time and concatenating it to an audio Mel spectrogram. Accordingly, the Mel modulator 104 using the mantissa Mel spectrogram performs the same operation as modulating the audio Mel spectrogram.

본 실시예에 따른 구별기(604)는 생성기(602)의 최종 출력인 선형 스펙트로그램과 GT 선형 스펙트로그램을 구별한다. 추론 멜 스펙트로그램의 확률 분포

와 GT 추론 멜 스펙트로그램의 확률 분포 p(M)이 유사하다는 가정 하에, 추론 모델 및 SR 변환 모델은 공동으로(jointly) 트레이닝될 수 있다. 따라서, 구별기(604)는 추론 멜 스펙트로그램

및 선형 스펙트로그램

의 페어(pair)와 GT 추론 멜 스펙트로그램 M 및 GT 선형 스펙트로그램 S의 페어 간을 구별할 수 있다. 페어 간 구별을 위하여, 구별기(604)는, 선형 스펙트로그램에 추론 멜 스펙트로그램을 조절하여 가산함으로써 허(fake)출력을 생성하고, GT 선형 스펙트로그램에 GT 추론 멜 스펙트로그램을 조절하여 가산함으로써 진(true)출력을 생성할 수 있다. The discriminator 604 according to the present embodiment discriminates the linear spectrogram that is the final output of the generator 602 and the GT linear spectrogram. Probability Distribution of Inferred Mel Spectrogram

Assuming that the probability distribution p(M) of the GT inference Mel spectrogram is similar to , the inference model and the SR transformation model can be jointly trained. Thus, the discriminator 604 is an inferred Mel spectrogram.

and linear spectrogram

It is possible to distinguish between a pair of , and a pair of GT inferred Mel spectrogram M and GT linear spectrogram S. For pair-to-pair discrimination, the discriminator 604 generates a fake output by adjusting and adding the inferred Mel spectrogram to the linear spectrogram, and adjusts and adds the GT inferred Mel spectrogram to the GT linear spectrogram. Can produce true output.

구별기(604)는 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. The discriminator 604 is implemented as a deep learning-based deep neural network based on multiple convolutional layers.

생성기(602) 및 구별기(604)를 트레이닝할 때, 트레이닝부는 GAN 구조에 기반하는 손실 외에도 다양한 형태 거리 메트릭(distance metric) 기반 손실을 이용할 수 있다. When training the generator 602 and the discriminator 604 , the training unit may use various types of distance metric-based loss in addition to the loss based on the GAN structure.

본 실시예에 따른 트레이닝부는 수학식 2에 나타낸 바와 같은 대립적 손실(adversarial loss)을 이용한다(비특허문헌 4 참조).The training unit according to the present embodiment uses an adversarial loss as shown in Equation 2 (see Non-Patent Document 4).

여기서,

는 생성기(602) G의 GAN 손실이고,

는 구별기(604) D의 GAN 손실이다. 함수 f는 스칼라 함수로서 예로는 시그모이드 함수가 존재한다.here,

is the GAN loss of the generator 602 G,

is the GAN loss of the discriminator 604 D. The function f is a scalar function, and an example of a sigmoid function exists.

트레이닝부는 추론 모델의 트레이닝을 위하여 수학식 3에 나타낸 바와 같은 추론 손실

을 이용한다.The training unit loses inference as shown in Equation 3 for training of the inference model.

use the

여기서, 첫 번째 항은 추론 멜 스펙트로그램과 GT 추론 멜 스펙트로그램 간의 거리 메트릭에 기반하는 손실이고,

는 유도 어텐션 손실(guided attention loss)이다(비특허문헌 3 참조). 마지막 항에서

은 시간에 따른 추론 멜 스펙트로그램의 증분(increment)이고,

은 시간에 따른 GT 추론 멜 스펙트로그램의 증분이다. 따라서, 마지막 항은 증분

과

간의 거리 메트릭에 기반하는 손실이다.Here, the first term is the loss based on the distance metric between the inferred Mel spectrogram and the GT inferred Mel spectrogram,

is a guided attention loss (refer to Non-Patent Document 3). in the last paragraph

is the increment of the inferred Mel spectrogram over time,

is the increment of the GT inferred Mel spectrogram over time. Therefore, the last term is incremented

class

It is a loss based on the distance metric between

트레이닝부는 SR 변환 모델의 트레이닝을 위하여 선형 스펙트로그램과 GT 선형 스펙트로그램 간의 메트릭 거리에 기반하는 SR 손실

을 이용한다.The training unit SR loss based on the metric distance between the linear spectrogram and the GT linear spectrogram for training the SR transformation model.

use the

여기서 거리 메트릭은 크로스 엔트로피(cross entropy), L1, L2 메트릭 등 두 비교 대상 간의 메트릭 차이를 표현할 수 있는 것(또는 그것들의 결합)이면 어느 것이든 이용이 가능하다.Here, as the distance metric, any one (or a combination thereof) that can express the metric difference between two comparison objects, such as cross entropy, L1, and L2 metrics, can be used.

이상의 손실들을 결합하여, GAN 기반 학습 모델의 생성기(602) 및 구별기(604)의 총손실(total loss)은 수학식 4 및 수학식 5로 표현될 수 있다.Combining the above losses, the total loss of the generator 602 and the discriminator 604 of the GAN-based learning model can be expressed by Equations 4 and 5.

수학식 4 및 수학식 5에서,

는 생성기(602)의 총손실이고,

는 구별기(604)의 총손실이며,

및

는 손실에 관련된 하이퍼파라미터이다.In Equations 4 and 5,

is the total loss of the generator 602,

is the total loss of the discriminator 604,

and

is a hyperparameter related to the loss.

본 실시예에 따른 트레이닝부는 총손실이 감소되는 방향으로 생성기(602) 및 구별기(604)의 파라미터를 업데이트함으로써 생성기(602) 및 구별기(604)에 대한 트레이닝을 진행한다. The training unit according to the present embodiment performs training on the generator 602 and the discriminator 604 by updating the parameters of the generator 602 and the discriminator 604 in a direction in which the total loss is reduced.

또한, 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 생성기(602) 및 구별기(604)의 파라미터가 업데이트될 수 있다.Also, parameters of the generator 602 and the discriminator 604 may be updated in a direction in which all or part of the loss items included in the total loss are reduced.

또한, 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 생성기(602) 및 구별기(604) 중 적어도 하나의 파라미터가 업데이트될 수 있다.In addition, at least one parameter of the generator 602 and the discriminator 604 may be updated in a direction in which all or part of the loss items included in the total loss are reduced.

GAN 기반 딥러닝 모델의 트레이닝은 어려운 것으로 알려져 있다. 특히, 학습의 초기 단계에서 안정적인 트레이닝을 실행하는 것이 어려울 수 있다. 따라서, 본 실시예에 따른 트레이닝부는 하이퍼파라미터 각각에 대한 설정을 변경함으로써, 학습 모델(600)에 대한 학습 효율을 증대시킬 수 있다. 트레이닝 초기 단계에서, 트레이닝부는 수학식 4 및 수학식 5에 표현된 총손실 중에서 일부 항목에 대한 하이퍼파라미터를 영(zero)으로 설정하여 트레이닝을 진행할 수 있다. 예컨대 합성 손실 항목만 활성화되고, 대립적 손실 및 SR 손실 항목은 비활성화될 수 있다. Training of GAN-based deep learning models is known to be difficult. In particular, it can be difficult to implement stable training in the early stages of learning. Accordingly, the training unit according to the present embodiment may increase the learning efficiency of the learning model 600 by changing the settings for each of the hyperparameters. In the initial stage of training, the training unit may perform training by setting hyperparameters for some items among the total losses expressed in Equations 4 and 5 to zero. For example, only the synthetic loss item may be activated, and the antagonistic loss and SR loss items may be deactivated.

생성기(602) 중 추론 모델의 동작이 안정된 후기 단계에서, 트레이닝부는 영으로 설정되었던 하이퍼파라미터를 영이 아닌 값으로 설정함으로써, 모든 손실 항목을 이용하여 생성기(602) 및 구별기(604)의 파라미터를 업데이트할 수 있다. In the later stage when the operation of the inference model among the generators 602 is stable, the training unit sets the hyperparameters that were set to zero to non-zero values, so that the parameters of the generator 602 and the discriminator 604 are calculated using all loss items. can be updated.

도 7은 본 발명의 일 실시예에 따른 학습 모델에 대한 학습방법의 순서도이다.7 is a flowchart of a learning method for a learning model according to an embodiment of the present invention.

학습 모델(600)의 트레이닝부는 복수의 가수 및 복수의 노래에 대하여, 가수 각각에 대한 가수 멜 스펙트로그램, 노래 각각의 가사에 대한 텍스트, 및 노래 각각에 대한 오디오 멜 스펙트로그램을 획득하고, 노래 각각에 대한 MIDI 데이터로부터 음고(pitch)를 획득한다(S700). MIDI 데이터는, 복수의 노래 각각을 표현하는, 기 존재하는 가창음성에 대한 MIDI 데이터이다. For a plurality of singers and a plurality of songs, the training unit of the learning model 600 obtains a singer Mel spectrogram for each singer, a text for each song lyrics, and an audio Mel spectrogram for each song, each song A pitch is obtained from the MIDI data for (S700). MIDI data is MIDI data about the existing singing voice expressing each of a some song.

트레이닝부는 복수의 가수 각각에 대한 가창음성을 저장장치로부터 획득하여 가수 멜 스펙트로그램을 생성할 수 있다. 또는 가수 멜 스펙트로그램이 직접 저장장치로부터 획득될 수 있다. 트레이닝부는 복수의 노래 각각의 가사를 나타내는 텍스트를 저장장치로부터 획득할 수 있다. 트레이닝부는 복수의 노래 각각에 대한 기 존재하는 가창음성을 저장장치로부터 획득하여 오디오 멜 스펙트럼을 생성할 수 있다. 또는 오디오 멜 스펙트럼이 직접 저장장치로부터 획득될 수 있다. 또한, 트레이닝부는 저장장치로부터 복수의 노래 각각에 대한 MIDI 데이터를 획득하여 MIDI 데이터에 포함된 음고를 추출할 수 있다.The training unit may generate a singer Mel spectrogram by acquiring a singing voice for each of the plurality of singers from the storage device. Alternatively, the singer Mel spectrogram may be obtained directly from the storage device. The training unit may obtain text representing lyrics of each of the plurality of songs from the storage device. The training unit may generate an audio Mel spectrum by acquiring pre-existing singing voices for each of the plurality of songs from the storage device. Alternatively, the audio melt spectrum may be obtained directly from the storage device. Also, the training unit may obtain MIDI data for each of a plurality of songs from the storage device and extract a pitch included in the MIDI data.

트레이닝부는 심층신경망 기반의 추론 모델을 이용하여 가수 멜 스펙트로그램, 텍스트, 오디오 멜 스펙트로그램, 및 음고로부터 저해상도의 추론 멜 스펙트로그램을 생성한다(S702). 더욱 상세하게는, 추론 모델은 가수 멜 스펙트로그램으로부터 대상 가수에 대한 가수 ID(Identity) 특성, 즉 음색과 가창 스타일을 생성하고, 텍스트와 음색으로부터 포먼트마스크를 생성하고, 오디오 멜 스펙트로그램, 음고 및 가창 스타일로부터 음고골격을 생성하며, 음고골격과 포먼트마스크로부터 추론 멜 스펙트로그램을 생성한다. The training unit generates a low-resolution inference Mel spectrogram from the singer Mel spectrogram, text, audio Mel spectrogram, and pitch by using a deep neural network-based reasoning model (S702). More specifically, the inference model generates singer ID (Identity) characteristics for the target singer from the singer Mel spectrogram, that is, the tone and singing style, generates a formant mask from the text and tone, the audio Mel spectrogram, and the pitch and a pitch skeleton from the singing style, and an inferred Mel spectrogram from the pitch skeleton and the formant mask.

본 실시예에 따른 추론 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝(deep learning) 기반 심층신경망으로 구현된다. The inference model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers.

트레이닝부는 심층신경망 기반의 SR 변환 모델을 이용하여 추론 멜 스펙트로그램을 업샘플링(up-sampling)함으로써 고해상도의 선형 스펙트로그램을 생성한다(S704).The training unit generates a high-resolution linear spectrogram by up-sampling the inferred Mel spectrogram using the deep neural network-based SR transformation model (S704).

본 실시예에 따른 SR 변환 모델은 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. The SR transformation model according to this embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers.

트레이닝부는 심층신경망 기반의 구별기를 이용하여 추론 멜 스펙트로그램 및 선형 스펙트로그램의 페어와 GT 추론 멜 스펙트로그램 및 GT 선형 스펙트로그램의 페어(pair) 간을 구별한다(S706).The training unit distinguishes between a pair of inferred Mel spectrogram and linear spectrogram and a pair of GT inferred Mel spectrogram and GT linear spectrogram using a deep neural network-based discriminator (S706).

본 실시예에 따른 구별기(604)는 다수의 콘볼루션 레이어를 기반으로 하는 딥러닝 기반 심층신경망으로 구현된다. The discriminator 604 according to the present embodiment is implemented as a deep learning-based deep neural network based on a plurality of convolutional layers.

트레이닝부는 추론 모델, SR 변환 모델, 및 구별기의 출력을 이용하여 총손실(total loss)을 산정한다(S708).The training unit calculates a total loss by using the inference model, the SR transformation model, and the output of the discriminator (S708).

총손실을 구성하는 각각의 손실 항목에 대한 내용은 이미 설명되었으므로, 더 이상의 자세한 설명은 생략한다.Since the contents of each loss item constituting the total loss have already been described, further detailed description will be omitted.

트레이닝부는 총손실에 포함된 손실 항목의 전부 또는 일부가 감소되는 방향으로 추론 모델, SR 변환 모델 및 구별기 중 적어도 하나의 파라미터를 업데이트한다(S710).The training unit updates at least one parameter of the inference model, the SR transformation model, and the discriminator in a direction in which all or part of the loss items included in the total loss are reduced ( S710 ).

이하, 추론 모델이 추론 멜 스펙트로그램을 생성하는 과정(S702)에 대하여 자세히 기술한다.Hereinafter, a process ( S702 ) in which the inference model generates an inference Mel spectrogram will be described in detail.

추론 모델은 가수 멜 스펙트로그램으로부터 가수에 대한 가창 스타일과 음색(timbre)을 추출한다(S720). 추론 모델은 가수 멜 스펙트로그램으로부터 시간에 따른 변화가 제거된 시불변(time-invariant) 전역 특성인 가수 ID 임베딩으로서 음색 및 가창 스타일을 생성한다.The inference model extracts the singing style and timbre of the singer from the singer Mel spectrogram (S720). The inference model generates timbre and singing style as singer ID embeddings, which are time-invariant global properties with the change over time removed from the singer Mel spectrogram.

추론 모델은 텍스트로부터 텍스트 특성을 추출한다(S722). 추출된 텍스트 특성은 가사의 발음에 대한 특성을 나타낸다.The inference model extracts text characteristics from the text (S722). The extracted text characteristics indicate the characteristics of the pronunciation of the lyrics.

추론 모델은 오디오 멜 스펙트로그램으로부터 오디오 특성을 추출한다(S724). 이전 시간의 추론 멜 스펙트로그램이 피드백되어 오디오 멜 스펙트로그램에 연쇄됨으로써 자동 회귀적인(auto-regressive) 동작이 수행된다. 따라서, 가수 멜 스펙트로그램을 이용하여 추론 모델은 오디오 멜 스펙트로그램을 변조(modulation)하는 것과 같은 동작을 수행한다.The inference model extracts audio characteristics from the audio melt spectrogram (S724). The inference Mel spectrogram of the previous time is fed back and concatenated to the audio Mel spectrogram, whereby an auto-regressive operation is performed. Therefore, using the mantissa Mel spectrogram, the inference model performs the same operation as modulating the audio Mel spectrogram.

추론 모델은 음고로부터 음고 특성을 추출한다(S726).The inference model extracts a pitch characteristic from the pitch (S726).

추론 모델은 텍스트 특성과 오디오 특성 간의 어텐션(attention) 결과를 오디오 특성에 연쇄(concatenation)하여, 텍스트 특성과 오디오 특성 간의 동기를 일치시킨 동기 오디오 특성을 생성한다(S728).The inference model concatenates the result of attention between the text feature and the audio feature to the audio feature to match the synchronization between the text feature and the audio feature. A characteristic is created (S728).

추론 모델은 텍스트 특성으로부터 음색이 조절된(conditioned) 포먼트마스크를 생성한다(S730). 전역적 특성인 가수 ID 임베딩 중 음색이 조절되어 결합됨으로써, 텍스트 특성으로부터 발음과 음색에 관련된 특성인 포먼트마스크가 생성될 수 있다.The inference model generates a tone-conditioned formant mask from the text characteristics (S730). As the tone is adjusted and combined during embedding of the singer ID, which is a global characteristic, a formant mask, which is a characteristic related to pronunciation and tone, may be generated from the text characteristic.

추론 모델은 동기 오디오 특성으로부터 가창 스타일 및 음고 특성이 조절된 음고골격을 생성한다(S732). 전역적 특성인 가수 ID 임베딩 중 가창 스타일, 및 국지적인 특성인 음고 특성이 조절되어 결합됨으로써, 동기 오디오 특성으로부터 음고와 스타일 관련된 특성인 음고골격이 생성될 수 있다.The inference model is synchronous audio From the characteristics, a tone height skeleton in which the singing style and tone characteristics are adjusted is generated (S732). A singing style and a pitch characteristic, which is a local characteristic, are adjusted and combined during singer ID embedding, which is a global characteristic, so that a pitch skeleton, which is a characteristic related to a pitch and a style, can be generated from the synchronized audio characteristic.

추론 모델은 포먼트마스크를 이용하여 음고골격을 마스킹함으로써 추론 멜 스펙트로그램을 생성한다(S734). 여기서, 마스킹은 포먼트마스크와 음고골격을 구성요소 별로 승산하는 과정을 의미한다. The inference model generates an inference Mel spectrogram by masking the pitch skeleton using a formant mask (S734). Here, the masking refers to the process of multiplying the formant mask and the tone high skeleton for each component.

본 실시예에 따른 가창음성 합성 및 변조장치(100)가 탑재되는 디바이스(미도시)는 프로그램가능 컴퓨터일 수 있으며, 서버(미도시)와 연결이 가능한 적어도 한 개의 통신 인터페이스를 포함한다. A device (not shown) on which the singing voice synthesis and modulation apparatus 100 according to the present embodiment is mounted may be a programmable computer and includes at least one communication interface capable of being connected to a server (not shown).

전술한 바와 같은 추론 모델 및 SR 변환 모델에 대한 트레이닝은, 가창음성 합성 및 변조장치(100)가 탑재되는 디바이스의 컴퓨팅 파워를 이용하여 가창음성 합성 및 변조장치(100)가 탑재되는 디바이스에서 진행될 수 있다. The training for the inference model and the SR transformation model as described above can be carried out in the device on which the singable speech synthesis and modulation apparatus 100 is mounted by using the computing power of the device on which the singing voice synthesis and modulation apparatus 100 is mounted. have.

전술한 바와 같은 추론 모델 및 SR 변환 모델에 대한 트레이닝은 서버에서 진행될 수 있다. 디바이스 상에 탑재된 가창음성 합성 및 변조장치(100)의 구성요소인 추론 모델 및 SR 변환 모델과 동일한 구조의 딥러닝 모델에 대하여 서버의 트레이닝부는 트레이닝을 수행할 수 있다. 디바이스와 연결되는 통신 인터페이스를 이용하여 서버는 트레이닝된 딥러닝 모델의 파라미터를 디바이스로 전달하고, 전달받은 파라미터를 이용하여 가창음성 합성 및 변조장치(100)는 추론 모델 및 SR 변환 모델의 파라미터를 설정할 수 있다. 또한 디바이스의 출하 시점 또는 가창음성 합성 및 변조장치(100)가 디바이스에 탑재되는 시점에, 추론 모델 및 SR 변환 모델의 파라미터가 설정될 수 있다. Training for the inference model and the SR transformation model as described above may be performed in the server. The training unit of the server may perform training on the deep learning model having the same structure as the inference model and the SR transformation model, which are components of the apparatus 100 for synthesizing and modulating a singing voice mounted on the device. Using a communication interface connected to the device, the server transmits the parameters of the trained deep learning model to the device, and the apparatus 100 for synthesizing and modulating the singing voice using the received parameters sets the parameters of the inference model and the SR transformation model. can In addition, parameters of the inference model and the SR transformation model may be set at the time of shipment of the device or the time when the apparatus 100 for synthesizing and modulating the audible voice is mounted on the device.

이상에서 설명한 바와 같이 본 실시예에 따르면, 가수 및 노래를 지정하는 사용자 요청을 획득하여, 심층신경망 기반의 추론 모델을 이용하여 가수에 대한 음색과 가창 스타일을 생성하고, 음색이 조절된 포먼트, 및 가창 스타일이 조절된 음고골격을 생성하며, 심층신경망 기반의 SR 변환 모델을 이용하여 포먼트로 마스킹된 음고골격으로부터 노래에 대한 가창음성을 생성하는 가창음성 합성/변조 장치 및 방법을 제공함으로써, 복수의 가수에 대한 자연스러운 가창음성 합성/변조가 가능해지는 효과가 있다.As described above, according to this embodiment, a user request for designating a singer and a song is obtained, a tone and a singing style for the singer are generated using an inference model based on a deep neural network, and the tone is adjusted formant; And by providing a singing voice synthesis/modulation device and method for generating a pitch skeleton with a controlled singing style, and generating a singing voice for a song from a pitch skeleton masked with a formant using an SR transformation model based on a deep neural network, There is an effect that natural singing voice synthesis/modulation for a plurality of singers is possible.

본 실시예에 따른 각 순서도에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 순서도에 기재된 과정을 변경하여 실행하거나 하나 이상의 과정을 병렬적으로 실행하는 것이 적용 가능할 것이므로, 순서도는 시계열적인 순서로 한정되는 것은 아니다.Although it is described that each process is sequentially executed in each flowchart according to the present embodiment, the present invention is not limited thereto. In other words, since it may be applicable to change and execute the processes described in the flowchart or to execute one or more processes in parallel, the flowchart is not limited to a time-series order.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 디지털 전자 회로, 집적 회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍가능 시스템 상에서 실행가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령들을 수신하고 이들에게 데이터 및 명령들을 전송하도록 결합되는 적어도 하나의 프로그래밍가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는　기록매체"에 저장된다. Various implementations of the systems and techniques described herein may be implemented in digital electronic circuitry, integrated circuitry, field programmable gate array (FPGA), application specific integrated circuit (ASIC), computer hardware, firmware, software, and/or combination can be realized. These various implementations may include being implemented in one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. or may be a general-purpose processor). Computer programs (also known as programs, software, software applications or code) contain instructions for a programmable processor and are stored on a "computer-readable recording medium".

컴퓨터가 읽을 수 있는　기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는　기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비일시적인(non-transitory) 매체일 수 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송) 및 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한 컴퓨터가 읽을 수 있는　기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.The computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. These computer-readable recording media are non-volatile or non-transitory, such as ROM, CD-ROM, magnetic tape, floppy disk, memory card, hard disk, magneto-optical disk, and storage device. media, and may further include transitory media such as carrier waves (eg, transmission over the Internet) and data transmission media. In addition, the computer-readable recording medium is distributed in network-connected computer systems, and computer-readable codes may be stored and executed in a distributed manner.

본 명세서에 설명되는 시스템들 및 기법들의 다양한 구현예들은, 프로그램가능 컴퓨터에 의하여 구현될 수 있다. 여기서, 컴퓨터는 프로그램가능 프로세서, 데이터 저장 시스템(휘발성 메모리, 비휘발성 메모리, 또는 다른 종류의 저장 시스템이거나 이들의 조합을 포함함) 및 적어도 한 개의 커뮤니케이션 인터페이스를 포함한다. 예컨대, 프로그램가능 컴퓨터는 서버, 네트워크 기기, 셋탑 박스, 내장형 장치, 컴퓨터 확장 모듈, 개인용 컴퓨터, 랩탑, PDA(Personal Data Assistant), 클라우드 컴퓨팅 시스템 또는 모바일 장치 중 하나일 수 있다.Various implementations of the systems and techniques described herein may be implemented by a programmable computer. Here, the computer includes a programmable processor, a data storage system (including volatile memory, non-volatile memory, or other types of storage systems or combinations thereof) and at least one communication interface. For example, a programmable computer may be one of a server, a network appliance, a set-top box, an embedded device, a computer expansion module, a personal computer, a laptop, a Personal Data Assistant (PDA), a cloud computing system, or a mobile device.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of this embodiment, and a person skilled in the art to which this embodiment belongs may make various modifications and variations without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are intended to explain rather than limit the technical spirit of the present embodiment, and the scope of the technical spirit of the present embodiment is not limited by these embodiments. The protection scope of this embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be interpreted as being included in the scope of the present embodiment.

100: 가창음성 합성 및 변조장치
102: 입력부 104: 멜 변조부
105: SR 추론부 108: 출력부
121: 가수 ID 인코더 125: 주의부
131: 포먼트마스크 디코더 132: 음고골격 디코더
400: 학습 모델
602: 생성기 604: 구별기
100: singing voice synthesis and modulation device
102: input unit 104: Mel modulation unit
105: SR reasoning unit 108: output unit
121: singer ID encoder 125: attention
131: formant mask decoder 132: tone high skeleton decoder
400: learning model
602: generator 604: distinguisher

Claims

A method for singing voice synthesis and modulation performed by a computing device, the method comprising:
Obtaining a user request specifying the tone of the first singer, the singing style of the second singer, and a song, the first singer spectrogram for the first singer, the second singer spectrogram for the second singer; and obtaining text for the lyrics of the song, and obtaining a pitch of the song from MIDI data for the song.
Using an inference model based on a pre-trained deep neural network, a first timbre for the first singer is generated from the first singer spectrogram, and the text and the a first process of generating a first formant mask from the first tone;
a second process of generating a second singing style for the second singer from the second singer spectrogram using the inference model, and generating a second pitch skeleton from the pitch and the second singing style; and
The process of generating a low-resolution third inferred spectrogram by masking the second pitch skeleton using the first formant mask
Singing voice synthesis and modulation method comprising a.

According to claim 1,
generating a high-resolution linear spectrogram by up-sampling the third inferred spectrogram using a pre-trained deep neural network-based SR (super-resolution) transform model; and
The process of converting the linear spectrogram to generate a singing voice in an auditory form
Singing voice synthesis and modulation method, characterized in that it further comprises.

According to claim 1,
The first process and the second process are
The process of extracting a singing style and tone for a singer from the singer spectrogram;
extracting text characteristics from the text;
a process of auto-regressively extracting audio characteristics from initial conditions;
extracting a pitch characteristic from the pitch;
Synchronous audio obtained by concatenating an attention result between the text characteristic and the audio characteristic to the audio characteristic to match the synchronization between the text characteristic and the audio characteristic the process of creating a characteristic;
generating a formant mask in which the tone is conditioned from the text characteristics;
said synchronous audio generating a pitch skeleton in which the singing style and the pitch characteristic are adjusted from the characteristics; and
The process of generating an inference spectrogram by masking the pitch skeleton using the phoneme mask
Singing voice synthesis and modulation method comprising a.

4. The method of claim 3,
The process of generating the formant mask is,
The first result of applying the convolution to each of the text characteristics and the tone, and the summation result of applying a sigmoid activation function, and a first result of applying a rectified linear unit (ReLU) activation function. 2 After calculating the result, the method for synthesizing and modulating a singing voice comprising the step of performing element-wise multiplication between the first result and the second result.

4. The method of claim 3,
The process of generating the eumgo skeleton is,
After calculating the summation result by applying convolution to each of the text characteristic, the synchronous audio characteristic, and the tone, a third result applied to the sigmoid activation function, and a fourth result applied to the ReLU activation function , an adjustment process of performing multiplication for each component between the third result and the fourth result.

Obtaining a user request specifying the tone of the first singer, the singing style of the second singer, and a song, the first singer spectrogram for the first singer, the second singer spectrogram for the second singer; and an input unit for obtaining text for the lyrics of the song and obtaining a pitch of the song from MIDI data for the song;
Using an inference model based on a pre-trained deep neural network, a first timbre for the first singer is generated from the first singer spectrogram, and the text and the create a first formant mask from the first tone; a modulator for generating a second singing style for the second singer from the second singer spectrogram by using the inference model, and generating a second pitch skeleton from the pitch and the second singing style; and
A third masking unit generating a third inferred spectrogram of low resolution by masking the second pitch skeleton using the first formant mask
Singing voice synthesis and modulation device comprising a.

7. The method of claim 6,
an SR inference unit for generating a high-resolution linear spectrogram by up-sampling the third inferred spectrogram using a pre-trained deep neural network-based SR (super-resolution) transform model; and
An output unit that converts the linear spectrogram to generate a singing voice in an auditory form
Singing voice synthesis and modulation device, characterized in that it further comprises.

7. The method of claim 6,
The modulator is
a singer ID encoder for extracting a singing style and timbre for a singer from the singer spectrogram;
a text encoder for extracting text characteristics from the text;
a spectrogram encoder for auto-regressively extracting audio characteristics from initial conditions;
a pitch encoder for extracting pitch characteristics from the pitch;
an attention unit configured to concatenate a result of attention between the text characteristic and the audio characteristic to the audio characteristic to generate a synchronized audio characteristic in which synchronization between the text characteristic and the audio characteristic is matched;
a formant mask decoder for generating the conditioned formant mask from the text characteristics;
a pitch skeleton decoder for generating a pitch skeleton in which the singing style and the pitch characteristic are adjusted from the synchronized audio characteristic; and
A masking unit that generates an inference spectrogram by masking the pitch skeleton using the formant mask
Singing voice synthesis and modulation device comprising a.

A computer program stored in a computer-readable recording medium in order to execute each process included in the method for synthesizing and modulating a singing voice according to any one of claims 1 to 5.